Forecasting ETFs with Machine Learning Algorithms, 2017


Liew, Jim Kyung-Soo, and Boris Mayster. “Forecasting ETFs with Machine Learning Algorithms.” (2017).



本文标题: 《Forecasting ETFs withMachine Learning Algorithms》


本文作者 :Jim Kyung-Soo Liew 和 BoriSmaySter

翻译 :笪洁琼

校对 :吴谣

编辑 :小满

简介: 本文在有效市场假说中的背景下,使用深层神经网络(DNN)、随机森林(RF)、支持向量机(SVF)三种机器学习算法对SPY(标普500)、TIP(美国通胀债券)、FXE(欧元做多)等ETF的未来价格走势方向进行预测。


Machine learning and artificial intelligence (AI) algorithms havecome home to roost.These algorithms, whether we like it or not, will continueto permeate our daily lives. Nowhere is this more evident than in their currentuses in self-driving cars, spam filters, movie recommendation systems, creditfraud detection, geo-fencing mar- keting campaigns, and so forth. The usage ofthese algorithms will only expand and deepen going forward. Recently, StephenHawking issued a forewarning: “The automation of factories has alreadydecimated jobs in traditional manufacturing, and the rise of AI is likely toextend this job destruction deep into the middle classes” (Price [2016]).Whether we agree or disagree with the virtues of automation, the only way tobetter utilize its potentials and evade its dangers is to gain a deeperknowledge and appreciation of these algorithms. Moreover, quite nearly upon usis the next big wave called the Internet of Things (IoT), whereby increasinglymore devices and common household items will be interconnected and streamterabytes of data. As our society is deluged with data, the critical questionthat emerges is whether machine learning algorithms contribute a net benefit toor extract a net cost from society.


While the future loomslarge for machine learning and AI, one pocket of their development appears tohave been deliberately left behind—namely, in finance and more so in hedge fundsthat attempt to predict asset prices to generate alpha for their clients. Thereason is clear: One trader’s gain in applying a well- traded learning algorithm is another’s loss. This edgebecomes a closely guarded secret, in many cases, defining the hedge fund’s secret sauce.In this work, we investigate the benefits of applying machine learningalgorithms to this corner of the financial industry, which academic researchershave left unexamined.


We are essentially interested in understandingwhether machine learning algorithms can be applied to predicting financialassets. More specifically, the goal is to program an AI to throw off profits bylearning and besting other traders and, possibly, other machines. This goal isrumored to have already been achieved by a hedge fund, Renaissance Technology’s Medallion fund.The Medallion fund, cloaked in mystery to all but a few insiders, has generatedamazing performance over an extended time period. Renaissance Technologyclearly has access to enough brain power to be on the cutting edge of anymachine learning implementation, and as much as others have tried to replicatetheir success, Medallion’s secret sauce recipe has yet to be cracked. In this article, we attemptto unravel several potential research paths in an attempt to shed light on howmachine learning algorithms could be employed to trade financial market assets.We employ machine learning algorithms to build and test models of prediction ofasset price direction for several well- known and highly liquid exchange-tradedfunds (ETFs).


Markets are efficient or, at a minimum, at leastsemi-strong form efficient. All publicly available information is reflected inthe stock prices, and the pricing mechanism is extremely quick and efficient inprocessing new information sets. Attempting to gain an edge is nearlyimpossible, especially when one tries to process widely accessible publicinformation. Investors are therefore better off holding a well-diversifiedportfolio of stocks. Clearly, Fama would side with those who have little faith in the abilityof these machine learning algorithms to process known publicly availableinformation and, with such information, gain an edge by trading.


Many researchers have documented evidence thatasset prices are predictable. Jegadeesh and Titman [1993] and Rouwenhorst[1998] showed that past prices help predict returns. Fama and French [1992,1993, 1995] showed that the fundamental factors of book-to-market and sizeaffect returns. Liew and Vassalou [2000] documented the predictability offundamental factors and linked these factors to future real gross domesticproduct growth. Keim [1983] documented seasonality in returns, with morepronounced performance in January. For more recent evidence on internationalpredictability, see Asness, Moskowitz, and Pedersen [2013]. Whether thepredictability stems from suboptimal behavior along the reasoning ofLakonishok, Shleifer, and Vishny [1994]; limits to arbitrage by Shleifer andVishny [1997]; or some unidentified risk-based explanation by Fama and French[1992, 1993], it nonetheless appears that predictability exists in markets.


We employ the most advanced machine learningalgorithms, namely, deep neural networks (DNNs), random forest (RF), andsupport vector machines (SVMs). Our results are generally similar across thealgorithms employed, with a slight advantage for RF and SVMs over DNNs. Wereport results for the three distinct algorithms and are interested inpredicting price changes in 10 ETFs. These ETFs were chosen for theirpopularity as well as their liquidity, and their historical data were sourcedfrom Yahoo Finance. Because we are interested in predicting the change inprices over varying future periods, we employ daily data. The horizons that weattempt to predict range from trading days to weeks and months.



We test several information sets to determine whichsets are most important in predicting across differing horizons. Ourinformation sets are based on(A) prior prices, (B) prior volume, © dummies fordays of the week and months of the year, and (ABC) all our information sets combined.We find that (B) volume is very important for predicting across the 20 to 60day horizon. Additionally, we document that each feature has very lowpredictability, so we recommend that model builders use a wide range offeatures guided by finan- cial experiences and intuition. Our methodology wasconstructed to be robust and allow for easy switching and testing of differentinformation set specifications and securities.




The next section describes the procedures weemployed and the assumptions made in this work, with a focus on applying bestpractices when applicable. Afterward, we discuss the machine algorithmsemployed. We then move into the details of our methodology and present our mainresults. Finally, we present our thoughts on implementation, weaknesses in ourapproach, implications, and conclusions.




Machine learning algorithms are extremely powerful, and most can easily overfit any dataset. In machine learning parlance, overfitting is known as high variance. Fitting an overly complex model on the training set does not perform well out of sample or in the test set. Prior to the introduction of cross-validation, researchers would rely on their best judgment as to whether a model was overfit. Currently, the best practices for building a machine learning predictive model are based on the holdout cross-validation procedure.



  • 注:[1]使用holdout方法,我们将初始数据集(initial dataset)分为训练集(training dataset)和测试集(test dataset)两部分。训练集用于模型的训练,测试集进行性能的评价。

We therefore employ the holdout cross-validation procedure in our article. We split the data into three components: the training set, the validation set, and test set. Best practices state that we should construct the model on the training set and validation set only and use the test set once. Repeatedly using the training set and validation set on that part of the data that does not contain the test set is known as holdout cross-validation. Although cross-validation is readily employed in many other fields, some criticize its use in financial time-series data. Nonetheless, we believe our results are still inter esting because we are investigating the foundational question of whether using machine learning algorithms works in modeling changes in prices.


The Irksome Stability Assumption

Arguably the most famous cross-sectional relationship is the capital asset pricing model (CAPM), which states that the expected return on any security is equal to the risk-free rate of returns plus a risk premium. The CAPM’s risk premium is defined as the security’s beta multiplied by the excess return on the market. Market observability has been changed by Roll’s [1976] critique; however, generally speaking, the theoretical CAPM has become a mainstay in academics as well as in practice. To estimate beta, students are taught to run a time-series linear regression of the excess return of a given security on the excess return on the market. The covariance of a security and the market provides for the fundamental building block on which the CAPM has been constructed.



In this work, however, breaking from some finance tradition, we view predictability disregarding the timeseries structure. We make the irksome stability assumption that there is a stable relationship between predicting price changes and the many features employed across our information sets. That is, like modeling credit card fraud and email spam, we assume that the relationship between the response and features is independent of time. For example, we allow our algorithms to capture the relationship that maps the feature input matrix (X) to the output responses (y). With that said, the features are always known prior to the change in prices.



To sum, we incorporate the current best practices of balancing the overfitting (or high-variance) problem with the underfitting (or high-bias) problem. This is accomplished by separating the training sample and performing k-fold cross-validation on the training sample and employing the test sample only once. In this work, we adhere to this best practice when applicable.


We attempt to use machine learning algorithms to answer the following questions:

  1. What is the optimal prediction horizon for ETFs?
  2. What are the best information sets for such prediction horizons?


  1. 什么是ETF的最佳预测期?
  2. 对这种预测基准而言,什么是最好的信息集?

Because we have a dependent variable (y) as future price movements, either up or down, in this work we are dealing with a supervised learning problem. The true value of the dependent variable is known a priori. We can also test the accuracy of our forecasts. Accuracy is measured by the percentage of times the model predicts correctly over the total number of predictions. Although we could choose from a vast number of algorithms, we restrict our analysis to the following three powerful and popular algorithms: DNNs, RFs, and SVMs.



DNNs are defined by neural networks with more than one hidden layer. Neural networks are composed of perceptrons, first introduced by Rosenblatt [1957], who built on the prior work of McCulloch and Pitts [1943]. McCulloch and Pitts introduced the first concept of a simplified brain cell: the McCulloch–Pitts neuron. Widrow and Hoff [1960] improved upon Rosenblatt’s [1957] work by introducing a linear activation function, thus allowing the solutions to be cast in the minimization of the cost function. The cost function is defined as the sum of squared errors, with errors in the context of a supervised machine learning algorithm defined as the predicted or hypothesized value minus the true value. The advantage of this setting allows for the change of only the activation function to yield different techniques. Setting the activation function to either the logistic or hyperbolic tangent allows us to arrive at the multilayered neural network. If the networks have more than one hidden layer, we arrive at our deep artificial neural network (see Raschka [2015]).



The parameters or weights in our DNN setting are determined by gradient descent. The process consists of initializing the weights across the neural network to small random numbers and then forward propagating the weights throughout the network. At each node, the weights and input data are multiplied and aggregated, then sent through the prespecified activation function. Prior layers are employed as input into the next layer and repeated. Once the errors have been forward propagated throughout the network, backward propagation is employed to adjust the weights. Weights are adjusted until some maximum number of iterations has been met or some minimum limit of error has been achieved. It should be noted that the reemergence of neural networks can be attributed to backward propagation contribution, which allowed for a much quicker convergence to the optimal parameters. Without backward propagation, this technique would have taken too long for convergence and would remain much less popular.


In our analysis, we employ a DNN algorithm (i.e., a neural network with more than one hidden layer). Between the input and output layers are the hidden layers. We employ two- and three-hidden-layer neural networks and thus a DNN in this work. Recently, a deep learning neural network beat the best human champion in the game Go, showing that this algorithm can be employed in surprising ways.




RFs, introduced by Breiman [2001], have become extremely popular as a machine learning algorithm. Much of this popularity stems from their quick speed and ease of use. Unlike the DNN and SVM, the RF classifier does not require any standardization or normalization of input features in the preprocessing stage. By taking the raw data of features and responses and specifying the number of trees in the forest, RF will return a model quickly and often outperforms even the most sophisticated algorithms.



Decision trees can easily overfit the data, a problem that many have tried to overcome by making the decision tree more robust. Limiting the depth of the tree and number of leaves in the terminal nodes are some methods that have been employed in an attempt to reduce the high variance problem. RFs take a very different approach to gaining robustness in resultant predictions. Given that decision trees can easily overfit the data, RFs attempt to reduce such overfitting along two dimensions. The first is bootstrapping with replacement of the row samples used in a given decision tree. The second is a subset of features that are randomly sampled without replacement at each node split, with the objective of maximizing the information gain for this subsample of features. Parent and child nodes are examined and features are chosen that provide for the lowest impurity of the child node. The more homogeneous the elements within the child split, the better the branch is at separating the data.


Many statisticians were irked by RFs when they were initially introduced because they only provided a limited number of features. At that time, the model-building intuition was to employ as much data as possible and to avoid limiting the feature space. By limiting the feature space, each tree has slightly different variations, and thus the average across the many trees, also known as bagging the trees within the forest, provides for a very robust prediction that easily incorporates the complexity in the data. RFs will continue to gain even more appeal with the added benefit of allowing researchers to see the features that are most important for a given RF prediction model. We will present a list of the most important feature per ETFs later in this work, and our results show the complexity of predicting across our ETF asset classes.





SVMs, by Vapnik [1995], attempt to separate the data by finding supporting vectors that provide for the largest separation between groups, or maximize the margin. Margin is defined as the distance between the supporting hyperplanes.

One of the main advantages of this approach is that SVMs generate separations that are less influenced by outliers and potentially more robust vis-a-vis alternative classifiers. Additionally, SVMs allow for the option to apply the radial basis function, which allows for nonlinear separation by leveraging the kernel trick. The kernel trick casts the data into a higher dimension. In this higher dimension, linear separation occurs when projecting the data back down into the original dimensional space.





As mentioned earlier, we have chosen widely used and liquid ETFs from various asset classes. The cross-section of ETFs allows us to include cross-asset correlation to boost predictive power. Presumably, investors make their decisions depending on their risk preferences as well as the ability to hold a well-diversified portfolio of assets. Although our list of ETFs is not exhaustive, it does represent well-known ETFs with which most practitioners and registered investment advisors should be well acquainted. The list of ETFs is as follows.





ETF Opportunity Set

  • SPY—SPDR S&P 500;U.S. equities large cap
  • IWM—iShares Russell 2000;U.S. equities small cap
  • EEM—iShares MSCI Emerging Markets;Global emerging markets equities
  • TLT—iShares 20+ Years;U.S. Treasury bonds
  • LQD—iShares iBoxx $ Invst Grade Crp Bond;U.S. liquid investment-grade corporate bonds
  • TIP—iShares TIPS Bond;U.S. Treasury inf lation-protected securities
  • IYR—iShares U.S. Real EstateReal estate
  • GLD—SPDR Gold Shares;Gold
  • OIH—VanEck Vectors Oil Services ETF;Oil
  • FXE—CurrencyShares Euro ETF;Euro


  • SPY-标准普尔500指数;美国股票大盘
  • IWM-罗素2000指数;美国股票小盘
  • EEM-新兴市场ETF;全球新兴股票市场
  • TLT-美国国债20+年ETF;美国国债
  • LQD-美国投资级公司债;美国流动投资级公司债券
  • TIP-美国通胀债券ETF;美国财政部保护证券
  • IYR-美国房地产ETF;美国房地产指数
  • GLD-道富环球投资旗下的黄金ETF;黄金ETF-SPDR
  • OIH-Van Eck旗下的油服ETF;石油服务ETF
  • FXE-欧元ETF;欧元做多

We test the predictability of ETF returns on a set of varying horizons. Although it is common knowledge that stock prices and thus ETF prices follow a random walk on the shorter horizons, thus making shorter-term predictability very difficult, longer horizons may be driven by asset class linkages and attention. Asset classes ebb and flow in and out of investors’ favor. With this intuition, we attempt to predict the direction of price moves, not the magnitude. Thus, we cast our research into a supervised classification problem. The returns are calculated by employing adjusted closing prices (adjusted for stock splits and dividends) for the given time periods as measured in trading days (1, 2, 3, 5, 10, 20, 40, 60, 120, and 250 days), using the following formula:



For each horizon of n days and each ETF, we examine four dataset combinations as explanatory information sets. We employ the term information set as the set of features based on the following explicit definitions. Note that, for any given asset’s change in price, we allow for its own information as well as the other ETFs’ information to influence the sign of the price change over the given horizon. We define our four information sets A, B, C, and ABC as follows:


  • Information set A: previous n days return and j lagged n days return, where j is equivalent to the previous horizon (i.e., for a 20-day horizon, the number of lagged returns will be 10) for all ETFs:

  • 信息集A:所有ETF的前n天回报和j滞后n天回报,对所有ETF的,j等于之前讲的观察窗口(即20天期限内,滞后回报的数目将为10):

  • Information set B: average volume for n days and j lagged average volume for n days, where j is equivalent to the previous horizon for all ETFs:

  • 信息集B:n天的平均成交量,滞后j天的n天的平均成交量线:

  • Information set C: day of the week and month dummy variables.

  • Information set ABC: A, B, and C combined.

  • 信息集C:星期和月的虚拟变量。

  • 信息集ABC:A、B、C组合。

We concentrate our presentation of results on information set ABC, but the other information sets provide insight on the drivers of the predictions across our three algorithms. A priori, we believe that past returns will be the most beneficial in terms of future return predictions, and volume will be useful to boost the results. Many have shown volume to capture the notion of investors’ attention. Higher volumes are typically associated with more trading activity. If trading releases information, then those ETFs with a higher volume of trading should be adjusting more quickly to their true values. Note that we implicitly assume the dollar volume of trading is approximately equal across ETFs and concern our study with share volume. Clearly, the ETF prices are not equal at any given time. However, the intuition is clear: More relative trading volume is an important feature for predictability across ETFs. We would suspect that volume and prior returns should work in tandem. However, we find that volume in isolation works very well; that is, B works well even without A—a rather surprising result.


Dummy variables are assumed to boost the performance of algorithms on shorter time periods and to be insignificant on longer horizons. It is important to note that A and ABC datasets are equivalent in number of observations, whereas B and C are not equivalent to A and have one less observation. This occurs because we need n+1 days of adjusted closing prices to compute returns. We only need n days, however, to compute average volume. We are employing the daily volume for that day.

The dependent variable is defined as 1 if n days’ return is equal to or greater than 0 and 0 otherwise:




Next, we employ cross-validation. We divide datasets and corresponding dependent variables into training and test sets. Division is done randomly in the following proportion: 70% training set and 30% test set. The training set will be used to train the model and the test set to estimate the predictive power of the algorithms. Following best-practice procedures, we use our training set and perform holdout cross-validation. A validation set is obtained from the training set. Once the k-fold cross-validation has been performed and the optimal hyperparameters have been selected, we employ this model only once on our test set. Many textbooks recommend 10-fold cross-validation in the training set; however, we used only threefold cross-validation, given that larger cross-validation tests would take an even longer time to generate results. We exhaustively searched for the best hyperparameters for each of our algorithms.1 The possible values for hyperparameters for each algorithm are as follows.




Hyperparameter Search Space.²


  • alpha (L2 regularization term)—{0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0}
  • activation function—{rectified unit linear function, logistic sigmoid function, hyperbolic tan function}3
  • solver for weight optimization—stochastic gradient descent
  • hidden layers—{(100, 100), (100, 100, 100)}

l RF

  • number of decision trees—{100, 200, 300}
  • function to measure quality of a split—{Gini impurity, information gain}


  • C (penalty parameter of the error term)—{0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0,1000.0}
  • kernel—{linear, radial basis function}


l 深层神经网络

  • 阿尔法,(L2正则化)—{0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0}
  • 激活函数-{校正单位线性函数,Logistic S形函数,双曲tan函数}3
  • 权重优化求解-随机梯度下降法
  • 隐藏层-{(100, 100), (100, 100, 100)}

l 随机森林

  • 决策树数量—{100, 200, 300}
  • 检测分裂—{基尼不纯度,信息增益}

l 支持向量机

  • C(错误项的惩罚参数))—{0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0,1000.0}
  • 核函数{线性,径向基函数}

The best estimator is determined by the accuracy of all possible combinations of hyperparameters values listed. After the best estimator is found, we use test set data to see how the algorithm performs. Scoring for test performance is based on accuracy, where accuracy is defined by how well the predicted outcome of the model compares to the actual outcome. To estimate the performance of each algorithm, we introduce our gain criteria. These criteria show whether the explanatory dataset explains the dependent variable better than randomly generated noise.


Introduce Gain Criteria

The gain criteria are computed as the difference between the accuracy of the model given the input information set and the accuracy of the model given noise data. We define noise as creating random data from a uniform random distribution bounded by 0 and 1 and replacing the original input data with these simulated noise data in the modeling process. We replaced this noise directly into the input feature data space, which preserves the shape of the actual data. We compute the gains by rerunning the same code used previously to obtain accuracy scores for the original data and subtracting the results from the scores obtained by testing best estimators with the actual data. Formally,





Using this score, it is easy to compare the performance of each algorithm and choose the one with the highest explanatory power.



First, we compare the test scoring accuracy (Exhibit 1) and prediction gain criteria accuracy (Exhibit 2) of each algorithm on different horizons for the ABC dataset. In Exhibit 1, test score accuracy increases as the prediction horizon increases. We would expect such a pattern given that some of our ETFs have had a natural positive drift over the sample period examined. The gain criteria would adjust for such positive drifts. Notice that the gain criterion ranges from 0% to 35%, compared to the scoring accuracy, which ranges from 0% to 100%.



RF and SVM show close results for all horizons in terms of test set accuracy and gain criterion, whereas DNN converges to the other algorithms at 40 days or more. This may be due to the insufficient number of hidden layers or default values of other hyper-parameters of the DNN classifier. Overall, we see that, even though test set accuracy on average increases with horizon length, the gain criterion peaks at 40 days and steadily falls thereafter. This finding is because for some ETFs,such as SPY, IWM, IYR, and GLD, we see a strong trend toward increases or decreases in price changes at longer horizons (see the chart for deep neural nets TLT in Exhibit A6 in the Appendix), which increases the test set accuracy score but lowers the gain criterion score.




Not surprisingly, the predictive power of algorithms increases with the forecast horizon. The result for the gain criterion was anticipated because, from pre-processing the data, examination suggested that noise level will increase with horizon, thus decreasing the gain. However, gain in predictive power for 120 and 250 days is still high, which raises the question of why such long-term predictions using only technical analysis still have good results. To understand this phenomenon, more analysis is required with the introduction of fundamental data, which is outside the scope of this article.

The next step in our analysis is to see how the algorithms performed in more detail. Using the gain criterion, we compare the performance of algorithms for each ETF from our list (see the Appendix). Generally, we see two patterns in gain behavior. The first is a peak at the 10- to 40-days level and a fall at consequent horizons, and the second is an increase for up to 20 to 60 days and a plateau with fluctuations or a slight rising trend to 250 days. We believe the reasons for such behaviors are the same as previously described.





The examples of SPY and EEM, respectively, shown in Exhibits 3 and 4.

Because overall DNN performance was lower than that of the other two algorithms on 3- to 20-day horizons, it is not surprising that it shows the same pattern at the single-ETF level. Nonetheless, for some ETFs (i.e., IWM and TLT), the DNN algorithm was able to catch up to the rest at 20 days. SVM and RF show close results for all of the ETFs, but the latter seems to produce less volatile results with respect to horizon. We find horizons of 10 to 60 days to be most interesting across all ETFs. Although for some instances 3-, 5-, 120-, and 250-day periods also draw attention and deserve more rigorous analysis, our goal is to compare the algorithms’ performances and to estimate the possibility and feasibility of meaningful predictions rather than to investigate the specifics of prediction of individual ETFs.

Next, we develop a deeper understanding of the explanatory variables and their significance in terms of predictive power. For that purpose, we examine the average test scores and gain criterion for the A, B, and C datasets for each algorithm and compare them to the ABC results.







Let’s start with RF, displayed in Exhibits 5 and 6. Volume (dataset B) effectively explains all the results obtained in previous sections, which is a surprising and unexpected result. Returns (dataset A) have decent predictive power. However, strong at the horizon of 40 days and longer, returns show the same performance as volume and combined datasets. Calendar dummies (dataset C), however, seem to explain a small portion of daily returns and monthly (20 days) returns. We assume this is because dummies are for day of the week and month. Nonetheless, the predictive power of this set is negligible, and a clear contribution can only be seen at a one-day horizon.

SVM results (Exhibits 7 and 8) have the same pattern as RF. However, the overall performance and gain for the returns dataset is closer to those of the combined dataset in comparison with RF. What is more interesting, the combined dataset seems to outperform individual datasets on one- to three-day horizons. Furthermore, calendar dummy variables seem to yield better results but are still not large enough to be significant, and they do not add any predictive power to the combined dataset.

让我们从RF开始,展示在示例5和6中。成交量指标(DataSet B)高效地解释了前几节中获得的所有结果,这是一个令人惊讶和意外的结果。收益值(数据集A)具有良好的预测能力,但是,在40天或更长的时间内,收益数据显示的性能与成交量数据集和组合数据集相同。

然而,虚拟日历(DataSet C)似乎可以解释一小部分每日收益和每月收益(20天)。



As mentioned earlier, DNNs (Exhibits 9 and 10) struggle to show competitive results on horizons less than 20 to 40 days. One- to five-day horizon predictions have effectively no predictive power. Apart from other algorithms, DNN benefits from a combination of datasets. However, the gradation of datasets with respect to gain is the same as for the previous algorithms, as are patterns of change in gains and test scores with varying horizons.

The results of DNN on 10- to 60-day horizons suggest that there is a possibility of improvement in algorithm predictions with combinations of datasets, which is not the case for other algorithms. Overall, we see poor ability to predict short-term returns for all algorithms. The solution to boost results might be an ensemble of algorithms, but such an analysis is beyond the scope of this article.


As mentioned earlier, we find horizons from 10 through 60 days to be most interesting in terms of predictive power. Thus, we will examine the performance of the algorithms in these time periods in more detail using the receiver operator characteristic (ROC), which will allow us to compare algorithms from a different angle. Based on ROC, we can compute another measure for algorithm comparison, the ROC area under the curve (AUC). We generated ROC curves for horizons of 10 to 60 days (see the Appendix). The results follow the same pattern as in all previous sections. For example, see the ROC graphs for EEM (Exhibits 11, 12, and 13). We also calculated the AUC for each of the selected horizons, ETFs, and algorithms (Exhibits A2 through A4 in the Appendix).

Longer-horizon ROC curves have an almost ideal form and AUCs close to 1, which suggest high predictive ability with high accuracy. Altogether, we can conclude that predictions for these ETFs are possible.






Feature Importance with RF

We also try to shed some light on which data drive the performance of algorithms by assessing feature importance with RFs. As previously discussed, volume is a good predictor by itself but is more powerful in combination with returns. However, it is unclear which features are actually driving the performance of the algorithms. In the case of SVM, it is only possible to interpret weights of each feature if the kernel is linear. For DNN, it is hard to explain and grasp what relationships exist within the hidden layers. For RF, however, we can compute and interpret the importance of each feature in an easy way.

We decided to examine the importance of 20-day horizon RF features for ETFs (Exhibits 14 and 15). The results show that there is no single feature that would explain most of the returns. Note that on the graph, features are sorted in descending order for each ETF. As one can see, the pattern is the same for all ETFs, meaning that almost all information in the dataset is contributing and is useful for prediction. With the features’ importance structured this way, we see that there is not a single factor that contributes more than 1.6%. However, the dataset also contains lagged variables. The questions that immediately arise are whether a group or groups of features (i.e., volume of SPY and returns on EEM) are more beneficial, and whether we can drop them.

For that purpose, we grouped returns and volumes for each ETF and summed up importance within each group (Exhibit 16). Volume is more important than returns, which confirms the difference in results for the A and B datasets. One of the reasons for such behavior might be a relationship between volume and returns; we assume that might be the result of a relationship obtained by Chen, Hong, and Stein [2001] and Chordia and Swaminathan [2000] in the sense that past volume is a good predictor of future returns’ skewness and patterns. Calendar dummies show little to no influence on predictions, as expected from dataset results.









In this work, we examined the ability of three popular machine learning algorithms to predict ETF returns. Although we restricted our initial analysis to only the direction of the future price movements, we still procured valuable results. First, machine learning algorithms do a good job of predicting price changes at the 10- to 60-day horizon. Not surprisingly, these algorithms fail to predict returns on short-term horizons of five days or less. We introduce our gain measure to help assess efficacy across algorithms and horizons. We also segmented our input feature variables into different information sets so as to cast our research in the framework of the efficient markets hypothesis. We find that the volume information set (B) works extremely well across our three algorithms. Moreover, we find that the most important predictive features vary depending on the ETFs that are being predicted. Financial intuition helps us to understand the prediction variables with complex relationships embedded within the prediction of the S&P 500, as proxied by SPY, requiring a more diverse set of features compared to the complexity of the top feature set needed to explain GLD or OIH.

In practice, the information set could be vastly extended to include other important features, such as social media, along the lines of Liew and Budavari [2017], who identified the social media factor. Additionally, the forecasting time horizons could have been extended even further beyond one trading year or shortened to examine intraday performance. However, we leave this more ambitious research agenda to future work.

One interesting application is to use several different horizon models launched at staggered times within a day, thereby gaining slight diversification benefits for the resultant portfolio of strategies.

In sum, we hope that our application of machine learning algorithm motivates others to move this body of knowledge forward. These algorithms possess great potential in their applications to the many problems in finance.


在这项研究工作中,我们测试了三种流行的机器学习算法预测ETF收益的能力。虽然我们的初步分析仅限于未来价格走势的方向,但我们仍然坚持获得了有价值的结果。首先,机器学习算法在预测10至60天的价格变化方面做得很好。毫不奇怪,这些算法无法预测5天或更短时间内的短期回报。我们将介绍我们的收益测量,以帮助评估算法和跨界的有效性。我们还将我们的输入特征变量分割成不同的信息集,从而将我们的研究置于有效市场假说的框架里。我们发现,信息量集(B)在我们的三种算法中工作得非常好。此外,我们发现,最重要的预测特征的变化取决于被预测的ETF。金融的直觉帮助我们理解标普500指数(S&P 500)预测中包含复杂关系的预测变量,这些预测变量由SPY提供的,与解释GLD或OIH所需的高级特征集的复杂性相比,需要更多不同的特征集。










1 For the DNN and SVM approaches, we also standardize the data using the training set prior to training and testing estimators. Standardization computes the ratio of the demeaned feature divided by the standard deviation of that feature. Thus, in the training set, each feature has a mean of zero and a standard deviation of one; however, in the test set, the features’ mean and standard deviation varies from zero and one.

2 All other parameters have default values. For more information, see

3 f(x)=max(0, x), f(x)=1/(1+exp(-x)) and f(x)=tanh(x), respectively.


1 对于DNN和SVM方法,我们在训练和测试估计器之前使用训练集对数据进行标准化。标准化计算被贬低的特征的比率除以该特征的标准差。因此,在训练集中,每个特征的均值为零,标准偏差为1;然而,在测试集中,特征的均值和标准偏差分别为0和1。

2 所有其他参数都有默认值。欲了解更多信息,请参见 http://cikit-Learning

3 F(X)=max(0,x),f(X)=1/(1+exp(-x)和f(X)=tanh(X)。