Predicting Stocks in an Unpredictable World

Mikemoschitto
6 min readDec 20, 2020

--

How is sentiment-based volatility affecting the US stock market?

As student programmers trading during COVID-19, our team — Michael Moschitto, Dylan Mooers, Kyle Thompson, and Justin Feinfeld — set out to answer the question, How is sentiment-based volatility affecting the US stock market? We hypothesized that COVID-19-related news would have a destabilizing affect on the stock market, so we combined market sentiment with quantitative price analysis in order to predict company prices.

Once a prediction was established for both areas (quantitative and sentiment), we converted these predictions into a buy or sell signal for a given stock. The results were measured using F1 scores, charts of sentiment over time, and the amount of money made or lost.

The quantitative branch began with a wrapper for Twelve Data, an online purveyor of stock metrics, to gather quantitative data. An integral part of this data collection was the Moving Average Convergence Divergence (MACD) calculation — in short, the difference between the 26-day exponential moving average and the 12-day exponential moving average. Exponential moving averages are a form of a moving average where increased weight is placed upon the more recent data points.

The MACD, along with a time series of percent change and volume, was fed to a custom neural network with two input branches concatenated with a pooling layer. The first input branch was a Long Short Term Memory Network (LSTM), and the second was a standard neural network with one hidden layer.

x = LSTM(histPoints, name='lstm_0')(lstmInput)
x = Dropout(0.2, name='lstm_dropout_0')(x)
lstmBranch = Model(inputs = lstmInput, outputs=x)
y = Dense(20, name='tech_dense_0')(denseInput)
y = Activation('tanh', name='tech_relu_0')(y)
y = Dropout(0.2, name='tech_dropout_0')(y) techIndicatorsBranch = Model(inputs=denseInput, outputs=y)
combined = concatenate(
[lstmBranch.output,techIndicatorsBranch.output],
name='concatenate'
)
z = Dense(64, activation='sigmoid', name='dense_pooling')(combined) z = Dense(1, activation='linear', name='dense_out')(z)

LSTM’s are a popular choice for predicting stock information because they are able to analyze a sequence of data segments and use the output from each segment to influence the next segment — a crucial part of our project as past stock prices certainly affect future ones. We were able to predict future prices with the best F1 score being 0.71! Much better than the coinflip exhibited when using a traditional sequential network.

Unsurprisingly, the sentiment prediction started much along the same vein as the quantitative. First, the Google News API was leveraged to collect a given number of recent articles.

# Below is basic example of using the Google News API to produce a   # dataframe containing a date, description, image, media, and title 
# for each article
from GoogleNews import GoogleNews
import pandas as pd
googlenews = GoogleNews(start='01/01/2020', end='01/08/2020')
googlenews.search('APPLE')
result = googlenews.result()
df = pd.DataFrame(result)
df
Dataframe produced from one call to the Google News API.

We were only interested in the title and description and thus only passed those two columns to the sentiment model.

These articles were first filtered so as to only include articles relevant to the specific company. They were then cleaned of punctuation and other characters before being passed to the sentiment model for a prediction between 0 and 1. On our scale 0 was the most negative and 1 the most positive.

The sentiment prediction model was composed of a convolutional neural network with an initial embedding layer, followed by the convolutional later with 128 filters and a 5x5 kernel size. The output of the convolutional layer is then pooled with a size of 2 and flattened for the two dense layers, the first using a rectified linear activation and finally a sigmoid activation to scale values between 0 and 1.

The code used to create the sentiment model is shown below.

self.model = Sequential()self.model.add(Embedding(vocab_size, 100, input_length=self.max_length))self.model.add(Conv1D(filters=128, kernel_size=5, activation='relu'))self.model.add(MaxPooling1D(pool_size=2))
self.model.add(Flatten())
self.model.add(Dense(16, activation='relu'))
self.model.add(Dense(1, activation='sigmoid'))

The model was trained using a data set of 1,967 financial headlines labeled by sentiment and split 85%-15% for training and testing respectively. The result of this was a model that achieved an accuracy of 81% and an F1 score of 0.875 when evaluated on the testing set.

The final remaining task was to bring both parts together and start trading! This was accomplished using a trading bot built on the Alpaca library. A buy or sell was initiated from the predicted change in price while the number of shares was a product of the sentiment prediction. Our best result so far has been buying before a 16% jump in Nikola stock and then selling the next morning to avoid a 20% plunge!

As with any data science project, we incorporated visuals to aid our understandings of how our respective models were doing. The most effective way to display quantitative data was to graph our predicted price vs the actual price.

Predicted vs Actual NFLX

To understand our sentiment predictions, sentiment was plotted over the last 30 days and compared with the list of headlines.

Our predictions about on how readers felt about NFLX, MSFT, and AAPL from 10/19 to 11/12.

We also tracked our trading history by plotting our buy/sell points on top of the stock’s closing price. (At the time of writing we had not collected enough trading data to warrant including this chart, however an updated copy can be found on our Github.)

In addition to using visuals to evaluate the success of our models, we used them to optimize hyper-parameters. Specifically, we plotted the F1 score of our quantitative model versus the number of epochs used to train the model. The peak of these plots represented the optimum number of epochs needed to train the model.

F1 Score vs Epochs for MSFT

Finally, we evaluated how our quantitative model performed across different securities with a bar chart representing the F1 score for each company.

Quantitative model performance across each company tracked.

At the time of writing, our trading bot has been mildly successful, with our biggest victory being trading Nikola. However, we plan to keep the trading bot running, adjusting our models and methods to hopefully make money!

Closing Thoughts: Throughout this project, we familiarized ourselves with the steps to answering a data science problem! We gathered both quantitative and qualitative data through web-scraping and APIs, cleaned that data, created multiple machine learning models, and visualized our results, all things of which we had no prior experience.

We were also moderately successful from a financial standpoint, as our project would have been profitable had we spent actual money! However, it is important to remember that we were trading during a rebound from one of the sharpest drops in market history and that the Dow was on its way to a record 30,199 points as of December 14th. Had we traded during a more standard period for the market, our results may have differed.

Our group would like to give special thanks to Cal Poly Professor Stanchev for his generous mentorship and willingness to continue as a senior project.

See our Github for the full source code: https://github.com/d-mooers/SentimentalTrader

--

--