{"title": "Enhancing Q-Learning for Optimal Asset Allocation", "book": "Advances in Neural Information Processing Systems", "page_first": 936, "page_last": 942, "abstract": "", "full_text": "Enhancing Q-Learning for \nOptimal Asset Allocation \n\nRalph Neuneier \n\nSiemens AG, Corporate Technology \n\nD-81730 MUnchen, Germany \n\nRalph.Neuneier@mchp.siemens.de \n\nAbstract \n\nThis paper enhances the Q-Iearning algorithm for optimal asset alloca(cid:173)\ntion proposed in (Neuneier, 1996 [6]). The new formulation simplifies \nthe approach by using only one value-function for many assets and al(cid:173)\nlows model-free policy-iteration. After testing the new algorithm on \nreal data, the possibility of risk management within the framework of \nMarkov decision problems is analyzed. The proposed methods allows \nthe construction of a multi-period portfolio management system which \ntakes into account transaction costs, the risk preferences of the investor, \nand several constraints on the allocation. \n\n1 Introduction \nAsset allocation and portfolio management deal with the distribution of capital to various \ninvestment opportunities like stocks, bonds, foreign exchanges and others. The aim is to \nconstruct a portfolio with a maximal expected return for a given risk level and time horizon \nwhile simultaneously obeying institutional or legally required constraints. To find such an \noptimal portfolio the investor has to solve a difficult optimization problem consisting of \ntwo phases [4]. First, the expected yields together with a certainty measure has to be pre(cid:173)\ndicted. Second, based on these estimates, mean-variance techniques are typically applied \nto find an appropriate fund allocation. The problem is further complicated if the investor \nwants to revise herlhis decision at every time step and if transaction costs for changing the \nallocations must be considered. \n\ndisturbanc ies -\n\n,---\n\nfinancial market \n\nI---\n\ninvestmen ts \n\nj return \n\nrates, prices \n\n' - - - -\n\ninvestor \n\nI---\n\nMarkov Decision Problem: \nXt = ($t, J(t}' \nat = p(xt) \np(Xt+ll x d \nr( Xt, at, $t+l) \n\nstate: market $t \nand portfolio J(t \npolicy p, actions \ntransition probabilities \nreturn function \n\nWithin the framework of Markov Decision Problems, MDPs, the modeling phase and the \nsearch for an optimal portfolio can be combined (fig. above). Furthermore, transaction \ncosts, constraints, and decision revision are naturally integrated. The theory ofMDPs for(cid:173)\nmalizes control problems within stochastic environments [1]. If the discrete state space is \nsmall and if an accurate model of the system is available, MDPs can be solved by con-\n\n\fEnhancing Q-Leaming for Optimal Asset Allocation \n\n937 \n\nventional Dynamic Programming, DP. On the other extreme, reinforcement learning meth(cid:173)\nods using function approximator and stochastic approximation for computing the relevant \nexpectation values can be applied to problems with large (continuous) state spaces and \nwithout an appropriate model available [2, 10]. \n\nIn [6], asset allocation is fonnalized as a MDP under the following assumptions which \nclarify the relationship between MDP and portfolio optimization: \n\n1. The investor may trade at each time step for an infinite time horizon. \n2. The investor is not able to influence the market by her/his trading. \n3. There are only two possible assets for investing the capital. \n4. The investor has no risk aversion and always invests the total amount. \n\nThe reinforcement algorithm Q-Learning, QL, has been tested on the task to invest liquid \ncapital in the Gennan stock market DAX, using neural networks as value function approx(cid:173)\nimators for the Q-values Q(x, a). The resulting allocation strategy generated more profit \nthan a heuristic benchmark policy [6]. \n\nHere, a new fonnulation of the QL algorithm is proposed which allows to relax the third \nassumption. Furthennore, in section 3 the possibility of risk control within the MDP frame(cid:173)\nwork is analyzed which relaxes assumption four. \n\n2 Q-Learning with uncontrollable state elements \nThis section explains how the QL algorithm can be simplified by the introduction of an \nartificial detenninistic transition step. Using real data, the successful application of the \nnew algorithm is demonstrated. \n\n2.1 Q-Leaming for asset allocation \nThe situation of an investor is fonnalized at time step t by the state vector Xt = ($t, Kt), \nwhich consists of elements $t describing the financial market (e. g. interest rates, stock \nindices), and of elements K t describing the investor's current allocation of the capital \n(e. g. how much capital is invested in which asset). The investor's decision at for a new allo(cid:173)\ncation and the dynamics on the financial market let the state switch to Xt+l = ($t+l' K t+1 ) \naccording to the transition probability p(Xt+lIXto at). Each transition results in an imme(cid:173)\ndiate return rt = r(xt, Xt+l. at} which incorporates possible transaction costs depending \non the decision at and the change of the value of K t due to the new values of the as(cid:173)\nsets at time t + 1. The aim is to maximize the expected discounted sum of the returns, \nV* (x) = E(2::~o It rt Ixo = x). by following an optimal stationary policy J.l. (xt) = at. \nFor a discrete finite state space the solution can be stated as the recursive Bellman equation: \n\nV\u00b7 (xd = m:-x [L p(xt+llxt, a)rt + ~I L p(xt+llxt. a) V* (Xt+l)] . \n\n(1) \n\nXt+l \n\nX.+l \n\nA more useful fonnulationdefines a Q-function Q\u00b7(x, a) of state-action pairs (Xt. at), \n\nto allow the application ofan iterative stochastic approximation scheme, called Q-Learning \n[11]. The Q-value Q*(xt,a,) quantifies the expected discounted sum of returns if one \nexecutes action at in state Xt and follows an optimal policy thereafter, i. e. V* (xt) = \nmaxa Q* (Xt, a). Observing the tuple (Xt, Xt+l, at, rd, the tabulated Q-values are updated \n\n\f938 \n\nR. Neuneier \n\nin the k + 1 iteration step with learning rate 17k according to: \n\nIt can be shown, that the sequence of Q k converges under certain assumptions to Q* . If the \nQ-values Q* (x, a) are approximated by separate neural networks with weight veCtor w a \nfor different actions a, Q* (x, a) ~ Q(x; w a ) , the adaptations (called NN-QL) are based on \nthe temporal differences dt : \n\ndt := r(xt, at , Xt+l) + ,),maxQ(Xt+l; w~) - Q(Xt; wZ t ) \n\naEA \n\n, \n\nNote, that although the market dependent part $t of the state vector is independent of the \ninvestor's decisions, the future wealth Kt+l and the returns rt are not. Therefore, asset \nallocation is a multi-stage decision problem and may not be reduced to pure prediction \nif transaction costs must be considered. On the other hand, the attractive feature that the \ndecisions do not influence the market allows to approximate the Q-values using historical \ndata of the financial market. We need not to invest real money during the training phase. \n\nIntroduction of an artificial deterministic transition \n\n2.2 \nNow, the Q-values are reformulated in order to make them independent of the actions cho(cid:173)\nsen at the time step t. Due to assumption 2, which states that the investor can not influence \nthe market by the trading decisions, the stochastic process of the dynamics of $t is an un(cid:173)\ncontrollable Markov chain. This allows the introduction of a deterministic intermediate \nstep between the transition from Xt to Xt+1 (see fig. below). After the investor has \"ho(cid:173)\nsen an action at, the capital K t changes to K: because he/she may have paid transaction \ncosts Ct = c(Kt, at) and K; reflects the new allocation whereas the state of the market, \n$t, remains the same. Because the costs Ct are known in advance, this transition is deter(cid:173)\nministic and controllable. Then, the market switches stochastically to $t+1 and generates \nthe immediate return r~ = r' ($t, K:, $t+1) i.e., rt = Ct + r~ . The capital changes to \nKt+1 = r~ + K; . This transition is uncontrollable by the investor. V* ($, K) = V* (x) is \nnow computed using the costs Ct and returns r~ (compare also eq. 1) \n\ntn.sid .. , \n\n... torml.lode \n\n110<'_ \n\nt+l \n\nSt \n\nK t \n\nat \n\nCt \n\nSt \n\nK' \nt \n\nQ(SbK~) \n\nr: \n\nSt+l \n\nKt+l \n\nDefining Q* ($t, Kn as the Q-values of the intermediate time step \n\nQ* ($t , K:) \n\nE [r' ($t , K: , $t+1) + ')'V* ($t+1 ' Kt+d] \n\n\fEnhancing Q-Leaming for Optimal Asset Allocation \n\n939 \n\ngives rise to the optimal value function and policy (time indices are suppressed), \n\nV* ($, K) = max[c(K, a) + Q* ($, K')], \nJl*($, K) = argmax[c(K, a) + Q*($, K')]. \n\na \n\na \n\nDefining the temporal differences dt for the approximation Q k as \n\ndt := r' ($t, K:, $t+1) + ,max[c(Kt+b a) + Q(k)($t+1, K:+ 1 )] - Q(k)($t, KD \n\na \n\nleads to the update equations for the Q-values represented by tables or networks: \n\nQLU: \n\nQ(k+l)($t,K;) \n\nNN-QLU: \n\nw(k+l) \n\nQ(k)($t, K:) + 1/kdt , \nw(k) + 1/kdtV'Q($, K'; w(k\u00bb) . \n\nThe simplification is now obvious, because (NN-)QLU only needs one table or neural net(cid:173)\nwork no matter how many assets are concerned. This may lead to a faster convergence and \nbetter results. The training algorithm boils down to the iteration of the following steps: \n\nQLU for optimal investment decisions \n\n1. draw randomly patterns $t, $t+ 1 from the data set, \n\ndraw randomly an asset allocation K: \n\n2. for all possible actions a: \n\ncompute rf, c(Kt+b a), Q(k)($t+b K:+I) \n\n3. compute temporal difference dt \n\n4. compute new value Q(k+1)($t, Kn resp. Q($t, K:; w(k+1\u00bb) \n\n5. stop, ifQ-values have converged, otherwise go to 1 \n\nSince QLU is equivalent to Q-Leaming, QLU converges to the optimal Q-values under the \nsame conditions as QL (e. g [2]). The main advantage of (NN-)QLU is that this algorithm \nonly needs one value function no matter how many assets are concerned and how fine the \ngrid of actions are: \n\nQ*(($,K),a) = c(K,a) + Q*($,K'). \n\nInterestingly, the convergence to an optimal policy of QLU does not rely on an explicit \n\nexploration strategy because the randomly chosen capital K: in step 1 simulates a random \n\naction which was responsible for the transition from K t . In combination with the randomly \nchosen market state $t, a sufficient exploration of the action and state space is guaranteed. \n\n2.3 M\\ldel-free policy-iteration \nThe refonnulation also allows the design of a policy iteration algorithm by alternating a \npolicy evaluation phase (PE) and a policy improvement (PI) step. Defining the temporal \ndifferences dt for the approximation Q~I of the policy JlI in the k step ofPE \ndt := r' ($t, K;, $t+d + ,[c(Kt+I, JlI ($t+l, K t+1 )) + Q(k) (K:+ 1 , $t+d] - Q(k)(K;, $t} \n\nleads to the following update equation for tabulated Q-values \n\nQ(k+l)($ K') Q(k)($ K\") \n\nt = IJ.I \n\nt, \n\nJJI \n\nt, \n\nt + 1/k t\u00b7 \n\nd \n\n\f940 \n\nR. Neuneier \n\nAfter convergence, one can improve the policy J-li to J-lI+l by \n\nJ-l1+I($t, Kt} = arg max[c(Kt , a) + QJ.'I ($t, KD] . \n\na \n\nBy alternating the two steps PE and PI, the sequence of policies [J-l1 (x )]1=0, ... converges \nunder the typical assumptions to the optimal policy J-l* (x) [2] . \n\nNote, that policy iteration is normally not possible using classical QL, if one has not an \nappropriate model at hand. The introduction of the detenninistic intermediate step allows \nto start with an initial strategy (e. g. given by a broker), which can be subsequently opti(cid:173)\nmized by model-free policy iteration trained with historical data of the financial market. \nGeneralization to parameterized value functions is straightforward. \n\n2.4 Experiments on the German Stock Index DAX \nThe NN-QLU algorithm is now tested on a real world task: assume that an investor wishes \nto invest herihis capital into a portfolio of stocks which behaves like the German stock \nindex DAX. Herihis alternative is to keep the capital in the certain asset cash, referred to \nas DM. We compare the resulting strategy with three benchmarks, namely Neuro-Fuzzy, \nBuy&Hold and the naive prediction. The Buy&Hold strategy invests at the first time step \nin the DAX and only sells at the end. The naive prediction invests if the past return of the \nDAX has been positive and v. v. The third is based on a Neuro-Fuzzy model which was \noptimized to predict the daily changes of the DAX [8]. The heuristic benchmark strategy is \nthen constructed by taking the sign of the prediction as a trading signal, such that a positive \nprediction leads to an investment in stocks. The input vector of the Neuro-Fuzzy model, \nwhich consists of the DAX itself and 11 other influencing market variables, was carefully \noptimized for optimal prediction. These inputs also constitutes the $t part of the state vector \nwhich describes the market within the NN-QLU algorithm. The data is split into a training \n(from 2. Jan. 1986 to 31. Dec. 1994) and a test set (from 2. Jan. 1993 to 1. Aug. 1996). The \ntransaction costs (Ct) are 0.2% of the invested capital if K t is changed from DM to DAX, \nwhich are realistic for financial institutions. Referring to an epoch as one loop over all \ntraining patterns, the training proceeds as outlined in the previous section for 10000 epochs \nwith T}k = \"'0 . 0.999k with start value \"'0 = 0.05. \n\nTable 1: Comparison of the profitability of the strategies, the number of position changes \nand investments in DAX for the test (training) data. \n\nprofit \n\nI strategy \n1.60 (3.74) \nNN-QLU \nN euro-Fuzzy \n1.35 (1.98) \nNaive Prediction 0.80 (1.06) \n1.21 (1.46) \nBuy&Hold \n\nI investments in DAX I position changes I \n\n70 (73)% \n53 (53)% \n51 (51)% \n100 (100)% \n\n30 (29)% \n50 (52)% \n51 (48)% \n0(0)% \n\nThe strategy constructed with the NN-QLU algorithm, using a neural network with 8 hid(cid:173)\nden neurons and a linear output, clearly beats the benchmarks. The capital at the end of the \ntest set (training set) exceeds the second best strategy Neuro-Fuzzy by about 18.5% (89%) \n(fig. 1). One reason for this success is, that QLU changes less often the position and thus, \navoids expensive transaction costs. The Neuro-Fuzzy policy changes almost every second \nday whereas NN-QLU changes only every third day (see tab. 1). \n\nIt is interesting to analyze the learning behavior during training by evaluating the strategies \nofNN-QLU after each epoch. At the beginning, the policies suggest to change almost never \nor each time to invest in DAX. After some thousand epochs, these bang-bang strategies \nstarts to differentiate. Simultaneously, the more complex the strategies become the more \nprofit they generate (fig. 2). \n\n\fEnhancing Q-Leaming for Optimal Asset Allocation \n\n941 \n\nde~lopment 01 the Capital \n\n3.5 \n\n2.5 \n\nNN-QLU \n\n09 \no 8 \n1 3.94 \n\n. '. ' \n\ntime \n\nNaIVe PredlCllon \n\n18.96 \n\nlime \n\nFigure 1: Comparison of the development of the capital for the test set (left) and the training \nset (right). The NN-QLU strategy clearly beats all the benchmarks. \n\nDAX-mvestrnents In \" . \n\no \n\ni \n\nr8ILm CNGf 60 days \n\nopoehs \n\n8000 \n\n10000 \n\n2000 \n\n4000 \n\nepochs \n\n6000 \n\n8000 \n\n10000 \n\nFigure 2: Training course: percentage ofDAX investments (left), profitability measured as \nthe average return over 60 days on the training set (right). \n\n3 Controlling the Variance of the Investment Strategies \n3.1 Risk-adjusted MDPs \nPeople are not only interested in maximizing the return, but also in controlling the risk of \ntheir investments. This has been formalized in the Markowitz portfolio-selection, which \naims for an allocation with the maximal expected return for a given risk level [4]. Given a \nstationary fo1icy f..L( x) with finite state space, the associated value function V JI. (x) and its \nvariance (T (V JI. ( X )) can be defined as \n\nV\"(x) \n\nE [t. ~'r(x\"I\", x'+1) xo ~ xl, \nE [ (t. ~'r(x\" p\" X'+1) - V\"(X)), Xo = x] . \n\nThen, an optimal strategy f..L* (x ; ,\\) for a risk-adjusted MDP (see [9], S. 410 for variance(cid:173)\npenalized MDPs) is \n\nf..L*(x;,\\) = argmax[VJI.(x) -\n\nJI. \n\n,\\(T2(VJI.(x))] \n\nfor'\\ > O. \n\nBy variation of '\\, one can construct so-called efficient portfolios which have minimal risk \nfor each achievable level of expected return. But in comparison to classical portfolio theory, \nthis approach manages multi-period portfolio management systems including transaction \ncosts. Furthermore, typical min-max requirements on the trading volume and other alloca(cid:173)\ntion constraints can be easily implemented by constraining the action space. \n\n\f942 \n\nR. Neuneier \n\n3.2 Non-linear Utility Functions \nIn general, it is not possible to compute (J\"2 (V If. (x)) with ( approximate) dynamic program(cid:173)\nming or reinforcement techniques, because (J\"2 (VJ.I (x)) can not be written in a recursive \nBellman equation. One solution to this problem is the use of a return function rt, which pe(cid:173)\nnalizes high variance. In financial analysis, the Sharpe-ratio, which relates the mean of the \nsingle returns to their variance i. e., r/(J\"(r), is often employed to describe the smoothness \nof an equity curve. For example, Moody has developed a Sharpe-ratio based error function \nand combines it with a recursive training procedure [5] (see also [3]). The limitation of the \nSharpe-ratio is, that it penalizes also upside volatility. For this reason, the use of an utility \nfunction with a negative second derivative, typical for risk averse investors, seems to be \nmore promising. For such return functions an additional unit increase is less valuable than \nthe last unit increase [4]. An example is r = log (new portfolio value I old portfolio value) \nwhich also penalizes losses much stronger than gains. The Q-function Q(x, a) may lead to \nintermediate values of a* as shown in the figure below. \n\ne\"I ---'--'--_~~~_ \n\n.0 \n\nI \n~\"7Jr \n\n~J '\" \n, . . \n\n1 '''~ \n,_l \n\n1II\u00b0'i \nI \n\n.-\n\n,,----.-:;----0; --;. - :i - .\u2022 -y:-- \u2022\u2022 -~ \n\n% of l'N'8Sur'8n11n UncertlWl asset \n\n--~ ~ -~.--~-~ - ---'-- - ' \nt \n. 1 \n\nO. \nrtMaM change 01 the pcwtIoko I4l1A ... % \n\n01 \n\n\" \n\n\" \n\n\\ \n\n. J \n\n,. \n\n\" \n\n4 Conclusion and Future Work \nTwo improvements of Q-Ieaming have been proposed to bridge the gap between classi(cid:173)\ncal portfolio management and asset allocation with adaptive dynamic programming. It is \nplanned to apply these techniques within the framework of a European Community spon(cid:173)\nsored research project in order to design a decision support system for strategic asset al(cid:173)\nlocation [7). Future work includes approximations and variational methods to compute \nexplicitly the risk (J\"2 (V If. (x)) of a policy. \nReferences \n[I J D. P. Bertsekas. Dynamic Programming and Optimal Control, vol. I. Athena Scientific, 1995. \n[2] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996. \n[3J M. Choey and A. S. Weigend. Nonlinear trading models through Sharpe Ratio maximization. \n\nIn proc. ofNNCM'96, 1997. World Scientific. \n\n[4J E. J. Elton and M. J. Gruber. Modern Portfolio Theory and Investment Analysis. 1995. \n[5J J. Moody, L. Whu, Y. Liao, and M. Saffell. Performance Functions and Reinforcement Learning \n\nfor Trading Systems and Portfolios. Journal of Forecasting, 1998. forthcoming, \n\n[6J R. Neuneier. Optimal asset allocation using adaptive dynamic programming. In proc. of Ad(cid:173)\n\nvances in Neural Information Processing Systems, vol. 8, 1996. \n\n[7J R. Neuneier, H. G. Zimmermann, P. Hierve, and P. Nairn. Advanced Adaptive Asset Allocation. \n\nEU Neuro-Demonstrator, 1997, \n\n[8J R. Neuneier, H. G. Zimmermann, and S. Siekmann. Advanced Neuro-Fuzzy in Finance: Pre(cid:173)\n\ndicting the German Stock Index DAX, 1996. Invited presentation at ICONIP'96, Hong Kong, \navailabel by email fromRalph.Neuneier@mchp.siemens.de. \n\n[9J M. L. Puterman. Markov Decision Processes. John Wiley & Sons, 1994. \n\n[IOJ S. P. Singh. Learning to Solve Markovian Decision Processes, CMPSCI TR 93-77, University \n\nof Massachusetts, November 1993. \n\n[I I J C. J. C. H. Watkins and P. Dayan. Technical Note: Q-Learning. Machine Learning: Special \n\nIssue on Reinforcement Learning, 8,3/4:279-292, May 1992. \n\n\f", "award": [], "sourceid": 1427, "authors": [{"given_name": "Ralph", "family_name": "Neuneier", "institution": null}]}