{"title": "Reinforcement Learning for Trading", "book": "Advances in Neural Information Processing Systems", "page_first": 917, "page_last": 923, "abstract": null, "full_text": "Reinforcement Learning for Trading \n\nJohn Moody and Matthew Saffell* \n\nOregon Graduate Institute, CSE Dept. \n\nP.O . Box 91000 , Portland, OR 97291-1000 \n\n{moody, saffell }@cse.ogi.edu \n\nAbstract \n\nWe propose to train trading systems by optimizing financial objec(cid:173)\ntive functions via reinforcement learning. The performance func(cid:173)\ntions that we consider are profit or wealth, the Sharpe ratio and \nour recently proposed differential Sharpe ratio for online learn(cid:173)\ning. In Moody & Wu (1997), we presented empirical results that \ndemonstrate the advantages of reinforcement learning relative to \nsupervised learning. Here we extend our previous work to com(cid:173)\npare Q-Learning to our Recurrent Reinforcement Learning (RRL) \nalgorithm. We provide new simulation results that demonstrate \nthe presence of predictability in the monthly S&P 500 Stock Index \nfor the 25 year period 1970 through 1994, as well as a sensitivity \nanalysis that provides economic insight into the trader's structure. \n\nIntroduction: Reinforcement Learning for Thading \n\n1 \nThe investor's or trader's ultimate goal is to optimize some relevant measure of \ntrading system performance , such as profit, economic utility or risk-adjusted re(cid:173)\nturn. In this paper , we propose to use recurrent reinforcement learning to directly \noptimize such trading system performance functions , and we compare two differ(cid:173)\nent reinforcement learning methods. The first, Recurrent Reinforcement Learning, \nuses immediate rewards to train the trading systems, while the second (Q-Learning \n(Watkins 1989)) approximates discounted future rewards. These methodologies can \nbe applied to optimizing systems designed to trade a single security or to trade port(cid:173)\nfolios . In addition , we propose a novel value function for risk-adjusted return that \nenables learning to be done online: the differential Sharpe ratio. \n\nTrading system profits depend upon sequences of interdependent decisions, and are \nthus path-dependent. Optimal trading decisions when the effects of transactions \ncosts, market impact and taxes are included require knowledge of the current system \nstate. In Moody, Wu, Liao & Saffell (1998), we demonstrate that reinforcement \nlearning provides a more elegant and effective means for training trading systems \nwhen transaction costs are included , than do more standard supervised approaches. \n\n\u2022 The authors are also with Nonlinear Prediction Systems. \n\n\f918 \n\nJ. Moody and M Saffell \n\nThough much theoretical progress has been made in recent years in the area of rein(cid:173)\nforcement learning, there have been relatively few successful, practical applications \nof the techniques. Notable examples include Neuro-gammon (Tesauro 1989), the \nasset trader of Neuneier (1996), an elevator scheduler (Crites & Barto 1996) and a \nspace-shuttle payload scheduler (Zhang & Dietterich 1996). \n\nIn this paper we present results for reinforcement learning trading systems that \noutperform the S&P 500 Stock Index over a 25-year test period, thus demonstrating \nthe presence of predictable structure in US stock prices. The reinforcement learning \nalgorithms compared here include our new recurrent reinforcement learning (RRL) \nmethod (Moody & Wu 1997, Moody et ai. 1998) and Q-Learning (Watkins 1989). \n\n2 Trading Systems and Financial Performance Functions \n2.1 Structure, Profit and Wealth for Trading Systems \nWe consider performance functions for systems that trade a single 1 security with \nprice series Zt. The trader is assumed to take only long, neutral or short positions \nFt E {-I , 0, I} of constant magnitude. The constant magnitude assumption can \nbe easily relaxed to enable better risk control. The position Ft is established or \nmaintained at the end of each time interval t, and is re-assessed at the end of \nperiod t + 1. A trade is thus possible at the end of each time period, although \nnonzero trading costs will discourage excessive trading. A trading system return \nR t is realized at the end of the time interval (t - 1, t] and includes the profit or loss \nresulting from the position F t - 1 held during that interval and any transaction cost \nincurred at time t due to a difference in the positions Ft- 1 and Ft. \n\nIn order to properly incorporate the effects of transactions costs, market impact and \ntaxes in a trader's decision making, the trader must have internal state information \nand must therefore be recurrent. An example of a single asset trading system \nthat takes into account transactions costs and market impact has following decision \nfunction: Ft = F((}t; Ft-l. It) with It = {Zt, Zt-1, Zt-2,\u00b7\u00b7.; Yt, Yt-1, Yt-2, ... } where \n(}t denotes the (learned) system parameters at time t and It denotes the information \nset at time t, which includes present and past values of the price series Zt and an \narbitrary number of other external variables denoted Yt. \nTrading systems can be optimized by maximizing performance functions U 0 such \nas profit, wealth, utility functions of wealth or performance ratios like the Sharpe \nratio. The simplest and most natural performance function for a risk-insensitive \ntrader is profit. The transactions cost rate is denoted 6. \n\nAdditive profits are appropriate to consider if each trade is for a fixed number \nof shares or contracts of security Zt. This is often the case, for example, when \ntrading small futures accounts or when trading standard US$ FX contracts in dollar(cid:173)\ndenominated foreign currencies. With the definitions rt = Zt - Zt-1 and r{ = \n4 - 4-1 for the price returns of a risky (traded) asset and a risk-free asset (like T(cid:173)\nBills) respectively, the additive profit accumulated over T time periods with trading \nposition size Jl > 0 is then defined as: \n\nT \n\nT \n\nPT = LRt = Jl L {r{ + Ft- 1(rt - r{) - 61Ft - Ft-11} \n\n(1) \n\nt=l \n\nt=l \n\n1 See Moody et al. (1998) for a detailed discussion of multiple asset portfolios. \n\n\f919 \nReinforcement Learning for Trading \nwith Po = 0 and typically FT = Fa = O. Equation (1) holds for continuous quanti(cid:173)\nties also. The wealth is defined as WT = Wo + PT. \nMultiplicative profits are appropriate when a fixed fraction of accumulated \nwealth v > 0 is invested in each long or short trade. Here, rt = (zt/ Zt-l -\nI) \nand r{ = (z{ /4-1 - 1). If no short sales are allowed and the leverage factor is set \nfixed at v = 1, the wealth at time Tis: \nWT = Wo II {I + Rd = Wo II {I + (1- Ft_t}r{ + Ft-1rt} {1- 81Ft - Ft- 11}\u00b7 (2) \n\nT \n\nt=1 \n\nT \n\nt=1 \n\n2.2 The Differential Sharpe Ratio for On-line Learning \nRather than maximizing profits, most modern fund managers attempt to maximize \nrisk-adjusted return as advocated by Modern Portfolio Theory. The Sharpe ratio is \nthe most widely-used measure of risk-adjusted return (Sharpe 1966). Denoting as \nbefore the trading system returns for period t (including transactions costs) as R t , \nthe Sharpe ratio is defined to be \n\nS _ \n\nAverage(Re) \n\nT - Standard Deviation(Rt ) \n\n(3) \n\nwhere the average and standard deviation are estimated for periods t = {I, ... , T}. \n\nProper on-line learning requires that we compute the influence on the Sharpe ratio \nof the return at time t. To accomplish this, we have derived a new objective func(cid:173)\ntion called the differential Sharpe ratio for on-line optimization of trading system \nperformance (Moody et al. 1998). It is obtained by considering exponential moving \naverages of the returns and standard deviation of returns in (3), and expanding to \nfirst order in the decay rate \".,: St ~ St-1 + \"\"~ll1=O + 0(\".,2) . Noting that only the \nfirst order term in this expansion depends upon the return R t at time t, we define \nthe differential Sharpe ratio as: \n\n(4) \n\nwhere the quantities At and B t are exponential moving estimates of the first and \nsecond moments of R t : \n\nA t- 1 + \".,~At = A t- 1 + \".,(Rt - A t -1) \nBt- 1 + \".,~Bt = Bt- 1 + TJ(R; - Bt-d \n\n(5) \n\nTreating A t - 1 and Bt - 1 as numerical constants, note that\"., in the update equations \ncontrols the magnitude of the influence of the return R t on the Sharpe ratio St . \nHence, the differential Sharpe ratio represents the influence of the trading return \nR t realized at time t on St. \n\n3 Reinforcement Learning for Trading Systems \nThe goal in using reinforcement learning to adjust the parameters of a system is \nto maximize the expected payoff or reward that is generated due to the actions \nof the system. This is accomplished through trial and error exploration of the \nenvironment. The system receives a reinforcement signal from its environment (a \n\n\f920 \n\nJ. Moody and M. Saffell \n\nreward) that provides information on whether its actions are good or bad. The \nperformance function at time T can be expressed as a function of the sequence of \ntrading returns UT = U(R1' R 2 , ... , RT). \nGiven a trading system model FtU}), the goal is to adjust the parameters () in \norder to maximize UT. This maximization for a complete sequence of T trades \ncan be done off-line using dynamic programming or batch versions of recurrent \nreinforcement learning algorithms. Here we do the optimization on-line using a \nreinforcement learning technique. This reinforcement learning algorithm is based \non stochastic gradient ascent. The gradient of UT with respect to the parameters () \nof the system after a sequence of T trades is \n\ndUT(()) = L dUT {dRt dFt + dRt dFt-1} \n\nT \n\ndRt \n\ndFt d() \n\ndFt- 1 d() \n\nd() \n\nt=1 \n\n(6) \n\nA simple on-line stochastic optimization can be obtained by considering only the \nterm in (6) that depends on the most recently realized return R t during a forward \npass through the data: \n\n_dU_t-'..( ()-'-) = _dU_t {_dR_t _dF_t + _d_R_t __ dF_t_-_1} \n\nd() \n\ndRt \n\ndFt d() \n\ndFt- 1 d() \n\n. \n\n(7) \n\nThe parameters are then updated on-line using /),.()t = pdUt(()t)/d()t. Because of the \nrecurrent structure of the problem (necessary when transaction costs are included), \nwe use a reinforcement learning algorithm based on real-time recurrent learning \n(Williams & Zipser 1989). This approach, which we call recurrent reinforcement \nlearning (RRL), is described in (Moody & Wu 1997, Moody et al. 1998) along with \nextensive simulation results. \n4 Empirical Results: S&P 500 I TBill Asset Allocation \nA long/short trading system is trained on monthly S&P 500 stock index and 3-\nmonth TBill data to maximize the differential Sharpe ratio. The S&P 500 target \nseries is the total return index computed by reinvesting dividends. The 84 input \nseries used in the trading systems include both financial and macroeconomic data. \nAll data are obtained from Citibase, and the macroeconomic series are lagged by \none month to reflect reporting delays. \n\nA total of 45 years of monthly data are used, from January 1950 through December \n1994. The first 20 years of data are used only for the initial training of the system. \nThe test period is the 25 year period from January 1970 through December 1994. \nThe experimental results for the 25 year test period are true ex ante simulated \ntrading results. \n\nFor each year during 1970 through 1994, the system is trained on a moving window \nof the previous 20 years of data. For 1970, the system is initialized with random \nparameters. For the 24 subsequent years, the previously learned parameters are \nused to initialize the training. In this way, the system is able to adapt to changing \nmarket and economic conditions. Within the moving training window, the \"RRL\" \nsystems use the first 10 years for stochastic optimization of system parameters, and \nthe subsequent 10 years for validating early stopping of training. The networks \nare linear, and are regularized using quadratic weight decay during training with a \n\n\fReinforcement Learningfor Trading \n\n921 \n\nregularization parameter of 0.0l. The \"Qtrader\" systems use a bootstrap sample \nof the 20 year training window for training, and the final 10 years of the training \nwindow are used for validating early stopping of training. The networks are two(cid:173)\nlayer feedforward networks with 30 tanh units in the hidden layer. \n\n4.1 Experimental Results \nThe left panel in Figure 1 shows box plots summarizing the test performance for \nthe full 25 year test period of the trading systems with various realizations of the \ninitial system parameters over 30 trials for the \"RRL\" system, and 10 trials for \nthe \"Qtrader\" system2 . The transaction cost is set at 0.5%. Profits are reinvested \nduring trading, and multiplicative profits are used when calculating the wealth. The \nnotches in the box plots indicate robust estimates of the 95% confidence intervals \non the hypothesis that the median is equal to the performance of the buy and hold \nstrategy. The horizontal lines show the performance of the \"RRL\" voting, \"Qtrader\" \nvoting and buy and hold strategies for the same test period. The annualized monthly \nSharpe ratios of the buy and hold strategy, the \"Qtrader\" voting strategy and the \n\"RRL\" voting strategy are 0.34, 0.63 and 0.83 respectively. The Sharpe ratios \ncalculated here are for the returns in excess of the 3-month treasury bill rate. \n\nThe right panel of Figure 1 shows results for following the strategy of taking posi(cid:173)\ntions based on a majority vote of the ensembles of trading systems compared with \nthe buy and hold strategy. We can see that the trading systems go short the S&P \n500 during critical periods, such as the oil price shock of 1974, the tight money \nperiods of the early 1980's, the market correction of 1984 and the 1987 crash. This \nability to take advantage of high treasury bill rates or to avoid periods of substantial \nstock market loss is the major factor in the long term success of these trading mod(cid:173)\nels. One exception is that the \"RRL\" trading system remains long during the 1991 \nstock market correction associated with the Persian Gulf war, though the \"Qtrader\" \nsystem does identify the correction. On the whole though, the \"Qtrader\" system \ntrades much more frequently than the \"RRL\" system, and in the end does not \nperform as well on this data set. \n\nFrom these results we find that both trading systems outperform the buy and hold \nstrategy, as measured by both accumulated wealth and Sharpe ratio. These dif(cid:173)\nferences are statistically significant and support the proposition that there is pre(cid:173)\ndictability in the U.S. stock and treasury bill markets during the 25 year period \n1970 through 1994. A more detailed presentation of the \"RRL\" results appears in \n(Moody et al. 1998). \n\n4.2 Gaining Economic Insight Through Sensitivity Analysis \nA sensitivity analysis of the \"RRL\" systems was performed in an attempt to de(cid:173)\ntermine on which economic factors the traders are basing their decisions. Figure 2 \nshows the absolute normalized sensitivities for 3 of the more salient input series as \na function of time, averaged over the 30 members of the \"RRL\" committee. The \nsensitivity of input i is defined as: \n\nSi = I dF I /max I dF I \n\ndXi \n\nJ \n\ndXj \n\n(8) \n\nwhere F is the unthresholded trading output and Xi denotes input i. \n\n2Ten trials were done for the \"Qtrader\" system due to the amount of computation \n\nrequired in training the systems \n\n\fJ. Moody and M. Saffell \n\n\"\"' __ I \nMl_ a.-._ \n\nI~ \n\n922 \n\nF,nal Eqully: OIrador VI RRl \n\nI::....:...:. \"\"''''''_I \n. ---\nro~====~ ________ ~ __ ~ \n~:vcMI \n_____ g_~ _________________ , _______ _ \n\ng \n\n...-\n\n, \n\n70 \n\nso \n~40 \n.z-\n\n30 \n\n20 \n\n10 \n\n, , , , \n, \n, \n- ' -\n\n----- --r- ----- --- -------- --- --\n\n, \n, \n- ' -\n\nRRL \n\nFigure 1: Test results for ensembles of simulations using the S&P 500 stock in(cid:173)\ndex and 3-month Treasury Bill data over the 1970-1994 time period. The solid \ncurves correspond to the \"RRL\" voting system performance, dashed curves to the \n\"Qtrader\" voting system and the dashed and dotted curves indicate the buy and \nhold performance. The boxplots in (a) show the performance for the ensembles \nof \"RRL\" and \"Qtrader\" trading systems The horizontal lines indicate the per(cid:173)\nformance of the voting systems and the buy and hold strategy. Both systems \nsignificantly outperform the buy and hold strategy. (b) shows the equity curves \nassociated with the voting systems and the buy and hold strategy, as well as the \nvoting trading signals produced by the systems. In both cases, the traders avoid \nthe dramatic losses that the buy and hold strategy incurred during 1974 and 1987. \n\nThe time-varying sensitivities in Figure 2 emphasize the nonstationarity of economic \nrelationships. For example, the yield curve slope (which measures inflation expecta(cid:173)\ntions) is found to be a very important factor in the 1970's, while trends in long term \ninterest rates (measured by the 6 month difference in the AAA bond yield) becomes \nmore important in the 1980's, and trends in short term interest rates (measured by \nthe 6 month difference in the treasury bill yield) dominate in the early 1990's. \n5 Conclusions and Extensions \nIn this paper, we have trained trading systems via reinforcement learning to optimize \nfinancial objective functions including our differential Sharpe ratio for online learn(cid:173)\ning. We have also provided results that demonstrate the presence of predictability \nin the monthly S&P 500 Stock Index for the 25 year period 1970 through 1994. \n\nWe have previously shown with extensive simulation results (Moody & Wu \n1997, Moody et al. 1998) that the \"RRL\" trading system significantly outperforms \nsystems trained using supervised methods for traders of both single securities and \nportfolios. The superiority of reinforcement learning over supervised learning is \nmost striking when state-dependent transaction costs are taken into account. Here, \nwe present results for asset allocation systems trained using two different reinforce(cid:173)\nment learning algorithms on a real, economic dataset. We find that the \"Qtrader\" \nsystem does not perform as well as the \"RRL\" system on the S&P 500 / TBill asset \nallocation problem, possibly due to its more frequent trading. This effect deserves \nfurther exploration. In general, we find that Q-Iearning can suffer from the curse of \ndimensionality and is more difficult to use than our RRL approach. \n\nFinally, we apply sensitivity analysis to the trading systems, and find that certain \ninterest rate variables have an influential role in making asset allocation decisions. \n\n\fReinforcement Learningfor Trading \n\n923 \n\nS\",sltivity Analysis: A .... g. on RRL Commill \u2022\u2022 \n\n, - -.. \n\n0.9 \n\nI \n\n0.8 \n\nGO.6 \u2022 \n\n, \n~f07 : \n! \ni \njos! \niO.4 I \n1 \n103 \n\nI \n\n,- \"\\ \n\nj \n\n\\ \n\n, I \n, \n, \n\n, I , \n\" , \n\nI \nI \n\nI \n\n\" ,', \n\n, , \n, \n\nI \nI \n\n, , \n\n\\ \n\n0.2 Ir--------'...!.'-----, \n\n' \n\nVI.1d Curv. Slop. \n6 Month Dill. In AM Bond yield \n6 Month Dill. In TBIU Vieid \n\n1975 \n\n1980 \n\n0.,. \n\n1985 \n\n1990 \n\n1995 \n\nFigure 2: Sensitivity traces for three of the inputs to the \"RRL\" trading system \naveraged over the ensemble of traders. The nonstationary relationships typical \namong economic variables is evident from the time-varying sensitivities. \n\nWe also find that these influences exhibit nonstationarity over time. \n\nAcknowledgements \n\nWe gratefully acknowledge support for this work from Nonlinear Prediction Systems and \nfrom DARPA under contract DAAH01-96-C-R026 and AASERT grant DAAH04-95-1-\n0485. \n\nReferences \n\nCrites, R. H. & Barto, A. G. (1996), Improving elevator performance using reinforcement \nlearning, in D. S. Touretzky, M. C. Mozer & M. E. Hasselmo, eds, 'Advances in NIPS', \nVol. 8, pp. 1017-1023. \n\nMoody, J. & Wu, L. (1997), Optimization of trading systems and portfolios, in Y. Abu(cid:173)\n\nMostafa, A. N. Refenes & A. S. Weigend, eds, 'Decision Technologies for Financial \nEngineering', World Scientific, London, pp. 23-35. This is a slightly revised version \nof the original paper that appeared in the NNCM*96 Conference Record, published \nby Caltech, Pasadena, 1996. \n\nMoody, J., Wu, L., Liao, Y. & Saffell, M. (1998), 'Performance functions and reinforcement \n\nlearning for trading systems and portfolios', Journal of Forecasting 17,441-470. \n\nNeuneier, R. (1996), Optimal asset allocation using adaptive dynamic programming, in \nD. S. Touretzky, M. C. Mozer & M. E. Hasselmo, eds, 'Advances in NIPS', Vol. 8, \npp. 952-958. \n\nSharpe, W. F. (1966), 'Mutual fund performance', Journal of Business pp. 119-138. \n\nTesauro, G. (1989), 'Neurogammon wins the computer olympiad', Neural Computation \n\n1,321-323. \n\nWatkins, C. J. C. H. (1989), Learning with Delayed Rewards, PhD thesis, Cambridge \n\nUniversity, Psychology Department. \n\nWilliams, R. J. & Zipser, D. (1989), 'A learning algorithm for continually running fully \n\nrecurrent neural networks', Neural Computation 1,270-280. \n\nZhang, W. & Dietterich, T. G. (1996), High-performance job-shop scheduling with a time(cid:173)\ndelay td(A) network, in D. S. Touretzky, M. C. Mozer & M. E. Hasselmo, eds, 'Ad(cid:173)\nvances in NIPS', Vol. 8, pp. 1024-1030. \n\n\f", "award": [], "sourceid": 1551, "authors": [{"given_name": "John", "family_name": "Moody", "institution": null}, {"given_name": "Matthew", "family_name": "Saffell", "institution": null}]}