{"title": "Multi-Task Learning for Stock Selection", "book": "Advances in Neural Information Processing Systems", "page_first": 946, "page_last": 952, "abstract": null, "full_text": "Multi-Task Learning for Stock Selection \n\nJoumana Ghosn \n\nDept. Informatique et \n\nRecherche Operationnelle \n\nUniversite de Montreal \nMontreal, Qc H3C-3J7 \n\nYoshua Bengio * \nDept. Informatique et \n\nRecherche Operationnelle \n\nUniversite de Montreal \nMontreal, Qc H3C-3J7 \n\nghosn~iro.umontreal.ca \n\nbengioy~iro.umontreal . ca \n\nAbstract \n\nArtificial Neural Networks can be used to predict future returns \nof stocks in order to take financial decisions . Should one build a \nseparate network for each stock or share the same network for all \nthe stocks? In this paper we also explore other alternatives, in \nwhich some layers are shared and others are not shared. When \nthe prediction of future returns for different stocks are viewed as \ndifferent tasks, sharing some parameters across stocks is a form \nof multi-task learning. In a series of experiments with Canadian \nstocks, we obtain yearly returns that are more than 14% above \nvarious benchmarks. \n\n1 \n\nIntrod uction \n\nPrevious applications of ANNs to financial time-series suggest that several of these \nprediction and decision-taking tasks present sufficient non-linearities to justify the \nuse of ANNs (Refenes, 1994; Moody, Levin and Rehfuss, 1993). These models can \nincorporate various types of explanatory variables: so-called technical variables (de(cid:173)\npending on the past price sequence) , micro-economic stock-specific variables (such \nas measures of company profitability), and macro-economic variables (which give \ninformation about the business cycle). \n\nOne question addressed in this paper is whether the way to treat these different vari(cid:173)\nables should be different for different stocks, i.e., should one use the same network \nfor all the stocks or a different network for each stock? To explore this question \n\n\"also, AT&T Labs, Holmdel, NJ 07733 \n\n\fMulti-Task Learning/or Stock Selection \n\n947 \n\nwe performed a series of experiments in which different subsets of parameters are \nshared across the different stock models. When the prediction of future returns for \ndifferent stocks are viewed as different tasks (which may nonetheless have some(cid:173)\nthing in common), sharing some parameters across stocks is a form of multi-task \nlearning. \n\nThese experiments were performed on 9 years of data concerning 35 large capital(cid:173)\nization companies of the Toronto Stock Exchange (TSE). Following the results of \nprevious experiments (Bengio, 1996), the networks were not trained to predict the \nfuture return of stocks, but instead to directly optimize a financial criterion. This \nhas been found to yield returns that are significantly superior to training the ANNs \nto minimize the mean squared prediction error. \n\nIn section 2, we review previous work on multi-task. In section 3, we describe the \nfinancial task that we have considered, and the experimental setup. In section 4, \nwe present the results of these experiments. In section 5, we propose an extension \nof this work in which the models are re-parameterized so as to automatically learn \nwhat must be shared and what need not be shared. \n\n2 Parameter Sharing and Multi-Task Learning \n\nMost research on ANNs has been concerned with tabula rasa learning. The learner \nis given a set of examples (Xl, yr), (X2' Y2), ... , (XN , YN) chosen according to some \nunknown probability distribution. Each pair (x, y) represents an input x, and a \ndesired value y. One defines a training criterion C to be minimized in function \nof the desired outputs and of the outputs of the learner f(x). The function f is \nparameterized by the parameters of the network and belongs to a set of hypotheses \nH, that is the set of all functions that can be realized for different values of the \nparameters. The part of generalization error due to variance (due to the specific \nchoice of training examples) can be controlled by making strong assumptions on \nthe model, i.e., by choosing a small hypotheses space H . But using an incorrect \nmodel also worsens performance. \n\nOver the last few years, methods for automatically choosing H based on similar \ntasks have been studied. They consider that a learner is embedded in a world \nwhere it faces many related tasks and that the knowledge acquired when learn(cid:173)\ning a task can be used to learn better and/or faster a new task. Some methods \nconsider that the related tasks are not always all available at the same time (Pratt, \n1993; Silver and Mercer, 1995): knowledge acquired when learning a previous task \nis transferred to a new task. Instead, all tasks may be learned in parallel (Baxter, \n1995; Caruana, 1995), and this is the approach followed here. Our objective is not \nto use multi-task learning to improve the speed of learning the training data (Pratt, \n1993; Silver and Mercer, 1995), but instead to improve generalization performance. \nFor example, in \n(Baxter, 1995), several neural networks (one for each task) are \ntrained simultaneouly. The networks share their first hidden layers, while all the \nremaining layers are specific to each network. The shared layers use the knowledge \nprovided from the training examples of all the tasks to build an internal represen(cid:173)\ntation suitable for all these tasks. The remaining layers of each network use the \ninternal representation to learn a specific task. \n\nIn the multitask learning method used by Caruana (Caruana, 1995), all the hidden \n\n\f948 \n\nJ. Ghosn and Y. Bengio \n\nlayers are shared. They serve as mutual sources of inductive bias. It was also \nsuggested that besides the relevant tasks that are used for learning, it may be \npossible to use other related tasks that we do not want to learn but that may \nhelp to further bias the learner (Caruana, Baluja and Mitchell, 1996; Intrator and \nEdelman, 1996) . \n\nIn the family discovery method \n(Omohundro, 1996), a parameterized family of \nmodels is built. Several learners are trained separately on different but related \ntasks and their parameters are used to construct a manifold of parameters. When \na new task has to be learned, the parameters are chosen so as to maximize the data \nlikelihood on the one hand, and to maximize a \"family prior\" on the other hand \nwhich restricts the chosen parameters to lie on the manifold. \n\nIn all these methods, the values of some or all the parameters are constrained. \nSuch models restrict the size of the hypotheses space sufficiently to ensure good \ngeneralization performance from a small number of examples. \n\n3 Application to Stock Selection \n\nWe apply the ideas of multi-task learning to a problem of stock selection and port(cid:173)\nfolio management. We consider a universe of 36 assets, including 35 risky assets \nand one risk-free asset. The risky assets are 35 Canadian large-capitalization stocks \nfrom the Toronto Stock Exchange. The risk-free asset is represented by 90-days \nCanadian treasury bills. The data is monthly and spans 8 years, from February \n1986 to January 1994 (96 months). Each month, one can buy or sell some of these \nassets in such a way as to distribute the current worth between these assets. We do \nnot allow borrowing or short selling, so the weights of the resulting portfolio are all \nnon-negative (and they sum to 1). \n\nWe have selected 5 explanatory variables, 2 of which represent macro-economic \nvariables which are known to influence the business cycle, and 3 of which are micro(cid:173)\neconomic variables representing the profitability of the company and previous price \nchanges of the stock. The macro-economic variables were derived from yields of \nlong-term bonds and from the Consumer Price Index. The micro-economic variables \nwere derived from the series of dividend yields and from the series of ratios of stock \nprice to book value of the company. Spline extrapolation (not interpolation) was \nused to obtain monthly data from the quarterly or annual company statements or \nmacro-economic variables. For these variables, we used the dates at which their \nvalue was made public, not the dates to which they theoretically refer. \n\nTo take into account the non-stationarity of the financial and economic time-series, \nand estimate performance over a variety of economic situations, multiple training \nexperiments were performed on different training windows, each time testing on \nthe following 12 months. For each architecture, 5 such trainings took place, with \ntraining sets of size 3, 4, 5, 6, and 7 years respectively. Furthermore, multiple such \nexperiments with different initial weights were performed to verify that we did not \nobtain \"lucky\" results due to particular initial weights. The 5 concatenated test \nperiods make an overall 5-year test period from February 1989 to January 1994. \n\nThe training algorithm is described in (Bengio, 1996) and is based on the optimiza(cid:173)\ntion of the neural network parameters with respect to a financial criterion (here \nmaximizing the overall profit) . The outputs of the neural network feed a trading \n\n\fMulti-Task Learning/or Stock Selection \n\n949 \n\nmodule. The trading module has as input at each time step the output of the net(cid:173)\nwork, as well as , the weights giving the current distribution of worth between the \nassets. These weights depend on the previous portfolio weights and on the relative \nchange in value of each asset (due to different price changes) . The outputs of the \ntrading module are the current portfolio weights for each of the assets. Based on \nthe difference between these desired weights and the current distribution of worth, \ntransactions are performed. Transaction costs of 1 % (of the absolute value of each \nbuy or sell transaction) are taken into account. Because of transaction costs, the ac(cid:173)\ntions of the trading module at time t influence the profitability of its future actions. \nThe financial criterion depends in a non-additive way on the performance of the \nnetwork over the whole sequence. To obtain gradients of this criterion with respect \nto the network output we have to backpropagate gradients backward through time, \nthrough the trading module, which computes a differentiable function of its inputs. \nTherefore, a gradient step is performed only after presenting the whole training \nsequence (in order, of course) . In (Bengio, 1996), we have found this procedure \nto yield significantly larger profits (around 4% better annual return), at compara(cid:173)\nble risks, in comparison to training the neural network to predict expected future \nreturns with the mean squared error criterion. In the experiments, the ANN was \ntrained for 120 epochs. \n\n4 Experimental Results \n\nFour sets of experiments with different types of parameter sharing were performed, \nwith two different architectures for the neural network: 5-3-1 (5 inputs, a hidden \nlayer of 3 units, and 1 output) , 5-3-2-1 (5 inputs, 3 units in the first hidden layer, \n2 units in the second hidden layer, and 1 output) . The output represents the belief \nthat the value of the stock is going to increase (or the expected future return over \nthree months when training with the MSE criterion) . \n\nFour types of parameter sharing between the different models for each stock are \ncompared in these experiments: sharing everything (the same parameters for all the \nstocks) , sharing only the parameters (weights and biases) of the first hidden layers, \nsharing only the output layer parameters, and not sharing anything (independent \nmodels for each stock). \n\nThe main results for the test period, using the 5-3-1 architecture, are summarized \nin Table 1, and graphically depicted in Figure 1 with the worth curves for the four \ntypes of sharing. The results for the test period, using the 5-3-2-1 architecture are \nsummarized in Table 2. The ANNs were compared to two benchmarks: a buy-and(cid:173)\nhold benchmark (with uniform initial weights over all 35 stocks), and the TSE300 \nIndex. Since the buy-and-hold benchmark performed better (8.3% yearly return) \nthan the TSE300 Index (4.4% yearly return) during the 02/89-01/94 test period, \nTables 1, and 2 give comparisons with the buy-and-hold benchmark. Variations \nof average yearly return on the test period due to different initial weights were \ncomputed by performing each of the experiments 18 times with different random \nseeds. The resulting standard deviations are less than 3.7 when no parameters or \nall the parameters are shared, less than 2.7 when the parameters of the first hidden \nlayers are shared, and less than 4.2 when the output layer is shared. \nThe values of beta and alpha are computed by fitting the monthly return of the \nportfolio r p to the return of the benchmark r M , both adjusted for the risk-free return \n\n\f950 \n\nJ. Ghosn and Y. Bengio \n\nshare \n\nTable 1: Comparative results for the 5-3-1 architecture: four types of sharing are \ncompared with the buy-and-hold benchmark (see text). \nshare \nhidden \n23.4'70 \n5.3% \n1.30 \n20.6% \n14.9 \n22.9% \n15.1% \n13.4% \n\nA verage yearly return \nStandard deviation (monthly) \nBeta \nAlpha (yearly) \nt-statistic for alpha = 0 \nReward to variability \nExcess return above benchmark \nMaximum drawdown \n\nshare \noutput \n24.8% \n5.3% \n1.26 \n21.8% \n\nsharing \n22.8% \n5.2% \n1.26 \n19.9% \n\nall \n13% \n4.3% \n1.07 \n9% \n11 \n\nbuy & \nhold \n8.3% \n3.5% \n\n14 \n\n22.3% \n14.5% \n13.3% \n\n1 \n0 \nNA \n0.9% \n\n0 \n\n15 \n\n24.7% \n16.4% \n13.3% \n\n9.6% \n4.7% \n13.3% \n\n15.7% \n\nno \n\nTable 2: Comparative results for the 5-3-2-1 architecture: three types of sharing \nare compared with the buy-and-hold benchmark (see text). \nshare first \n\nshare \n\nno \n\nAverage yearly return \nStandard deviation (monthly) \nBeta \nAlpha (yearly) \nt-statistic for alpha = 0 \nReward to variability \nExcess return above benchmark \nMaximum drawdown \n\nbuy & \nhold \n8.3% \n3.5% \n\n1 \n0 \nNA \n0.9% \n\n0 \n\n15.7% \n\nall \n\n12.5% \n4.% \n1.02 \n8.2% \n12.1 \n9.3% \n4.2% \n13% \n\nhidden \n22.7% \n5.2% \n1.25 \n19.7% \n14.1 \n22.2% \n14.4% \n12.6% \n\nshare all \nhidden \n23% \n5.2% \n1.28 \n20.1% \n14.8 \n22.5% \n14.7% \n13.4% \n\nsharing \n9.1% \n3.1% \n0.87 \n4.% \n21.2 \n2.5% \n0.8% \n10% \n\nri (interest rates), according to the linear regression E(rp -rd = alpha + beta(rM(cid:173)\nr i). Beta is simply the ratio of the covariance between the portfolio return and the \nmarket return with the variance of the market. According to the Capital Asset \nPricing Model (Sharpe, 1964), beta gives a measure of \"systematic\" risk, i.e., as it \nrelates to the risk of the market, whereas the variance of the return gives a measure \nof total risk. The value of alpha in the tables is annualized (as a compound return): \nit represents a measure of excess return (over the market benchmark) adjusted for \nmarket risk (beta). The hypothesis that alpha = 0 is clearly rejected in all cases \n(with t-statistics above 9, and corresponding p-values very close to 0). The reward \nto variability (or \"Sharpe ratio\") as defined in (Sharpe, 1966), is another risk-\nadjusted measure of performance: h.=2:J, where O\"p is the standard deviation of \nthe portfolio return (monthly returns were used here). The excess return above \nbenchmark is the simple difference (not risk-adjusted) between the return of the \nportfolio and that of the benchmark. The maximum drawdown is another measure \nof risk, and it can be defined in terms of the worth curve: worth[t] is the ratio \nbetween the value of the portfolio at time t and its value at time o. The maximum \ndrawdown is then defined as max ({max.