{"title": "Optimal Asset Allocation using Adaptive Dynamic Programming", "book": "Advances in Neural Information Processing Systems", "page_first": 952, "page_last": 958, "abstract": null, "full_text": "Optimal Asset Allocation \n\n\u2022 uSIng \n\nAdaptive Dynamic Programming \n\nRalph Neuneier* \n\nSiemens AG, Corporate Research and Development \n\nOtto-Hahn-Ring 6, D-81730 Munchen, Germany \n\nAbstract \n\nIn recent years, the interest of investors has shifted to computer(cid:173)\nized asset allocation (portfolio management) to exploit the growing \ndynamics of the capital markets. In this paper, asset allocation is \nformalized as a Markovian Decision Problem which can be opti(cid:173)\nmized by applying dynamic programming or reinforcement learning \nbased algorithms. Using an artificial exchange rate, the asset allo(cid:173)\ncation strategy optimized with reinforcement learning (Q-Learning) \nis shown to be equivalent to a policy computed by dynamic pro(cid:173)\ngramming. The approach is then tested on the task to invest liquid \ncapital in the German stock market. Here, neural networks are \nused as value function approximators. The resulting asset alloca(cid:173)\ntion strategy is superior to a heuristic benchmark policy. This is \na further example which demonstrates the applicability of neural \nnetwork based reinforcement learning to a problem setting with a \nhigh dimensional state space. \n\n1 \n\nIntroduction \n\nBillions of dollars are daily pushed through the international capital markets while \nbrokers shift their investments to more promising assets. Therefore, there is a great \ninterest in achieving a deeper understanding of the capital markets and in developing \nefficient tools for exploiting the dynamics of the markets. \n\n* Ralph.Neuneier@zfe.siemens.de, http://www.siemens.de/zfe.Jlll/homepage.html \n\n\fOptimal Asset Allocation Using Adaptive Dynamic Programming \n\n953 \n\nAsset allocation (portfolio management) is the investment of liquid capital to various \ntrading opportunities like stocks, futures, foreign exchanges and others. A portfolio \nis constructed with the aim of achieving a maximal expected return for a given \nrisk level and time horizon. To compose an optimal portfolio, the investor has \nto solve a difficult optimization problem consisting of two phases (Brealy, 1991). \nFirst, the expected yields are estimated simultaneously with a certainty measure. \nSecond, based on these estimates, a portfolio is constructed obeying the risk level \nthe investor is willing to accept (mean-variance techniques). The problem is further \ncomplicated if transaction costs must be considered and if the investor wants to \nrevise the decision at every time step. In recent years, neural networks (NN) have \nbeen successfully used for the first task. Typically, a NN delivers the expected \nfuture values of a time series based on data of the past. Furthermore, a confidence \nmeasure which expresses the certainty of the prediction is provided. \n\nIn the following, the modeling phase and the search for an optimal portfolio are \ncombined and embedded in the framework of Markovian Decision Problems, MDP. \nThat theory formalizes control problems within stochastic environments (Bertsekas, \n1987, Elton, 1971). If the discrete state space is small and if an accurate model of \nthe system is available, MDP can be solved by conventional Dynamic Programming, \nDP. On the other extreme, reinforcement learning methods, e.g. Q-Learning, QL, \ncan be applied to problems with large state spaces and with no appropriate model \navailable (Singh, 1994). \n\n2 Portfolio Managelnent is a Markovian Decision Problem \n\nThe following simplifications do not restrict the generalization of the proposed meth(cid:173)\nods with respect to real applications but will help to clarify the relationship between \nMDP and portfolio optimization. \n\n\u2022 There is only one possible asset for a Deutsch-Mark based investor, say a \n\nforeign currency called Dollar, US-$. \n\n\u2022 The investor is small and does not influence the market by her/his trading. \n\n\u2022 The investor has no risk aversion and always invests the total amount. \n\n\u2022 The investor may trade at each time step for an infinite time horizon. \n\nMDP provide a model for multi-stage decision making problems in stochastic en(cid:173)\nvironments. MDP can be described by a finite state set S = 1, ... , n, a finite set \nU (i) of admissible control actions for every state i E S, a set of transition prob(cid:173)\nabilities P0' which describe the dynamics of the system, and a return function 1 \nr(i,j,u(i)),with i,j E S,u(i) E U(i). Furthermore, there is a stationary policy \nrr(i), which delivers for every state an admissible action u(i). One can compute the \nvalue-function l;j11\" of a given state and policy, \n\n00 \n\nVi: = E[I:'\"-/R(it,rr(it ))), \n\nt=o \n\n(1) \n\n1 In the MDP-literature, the return often depends only on the current state i, but the \n\ntheory extends to the case of r = r(i,j,u(i)) (see Singh, 1994). \n\n\f954 \n\nR. NEUNEIER \n\nwhere E indicates the expected value, 'Y is the discount factor with 0 ~ 'Y < 1, and \nwhere R are the expected returns, R = Ej(r(i, j, u(i)). The aim is now to find a \npolicy 71\"* with the optimal value-function Vi* = max?!\" Vi?!\" for all states. \n\nIn the context discussed here, a state vector consists of elements which describe the \nfinancial time series, and of elements which quantify the current value of the invest(cid:173)\nment. For the simple example above, the state vector is the triple of the exchange \nrate, Xt, the wealth of the portfolio, Ct, expressed in the basis currency (here DM), \nand a binary variable b, representing the fact that currently the investment is in \nDM or US-$. \n\nNote, that out of the variables which form the state vector, the exchange rate is \nactually independent of the portfolio decisions, but the wealth and the returns are \nnot. Therefore, asset allocation is a control problem and may not be reduced to pure \nprediction. 2 This problem has the attractive feature that, because the investments \ndo not influence the exchange rate, we do not need to invest real money during the \ntraining phase of QL until we are convinced that our strategy works. \n\n3 Dynamic Programming: Off-line and Adaptive \n\nThe optimal value function V* is the unique solution of the well-known Bellman \nequation (Bertsekas, 1987). According to that equation one has to maximize the \nexpected return for the next step and follow an optimal policy thereafter in order \nto achieve global optimal behavior (Bertsekas, 1987). An optimal policy can be \neasily derived from V* by choosing a 71\"( i) which satisfies the Bellman equation. For \nnonlinear systems and non-quadric cost functions, V* is typically found by using an \niterative algorithm, value iteration, which converges asymptotically to V*. Value \niteration applies repeatedly the operator T for all states i, \n\n(2) \n\nValue iteration assumes that the expected return function R(i, u(i)) and the tran(cid:173)\nsition probabilities pij (i. e. the model) are known. Q-Learning, (QL), is a \nreinforcement-learning method that does not require a model of the system but \noptimizes the policy by sampling state-action pairs and returns while interacting \nwith the system (Barto, 1989). Let's assume that the investor executes action u(i) \nat state i, and that the system moves to a new state j. Let r(i, j, u(i)) denote the \nactual return. QL then uses the update equation \n\nQ(i, u(i)) \n\nQ(k, v) \n\n(1 - TJ)Q(i, u(i)) + TJ(r(i, j, u(i)) + 'Yma:xQ(j, u(j))) \nQ(k, v), for all k oF i and voF u(i) \n\nu(J ) \n\n(3) \n\nwhere TJ is the learning rate and Q(i, u(i)) are the tabulated Q-values. One can \nprove, that this relaxation algorithm converges (under some conditions) to the op(cid:173)\ntimal Q-values (Singh, 1994). \n\n2To be more precise, the problem only becomes a mUlti-stage decision problem if the \n\ntransaction costs are included in the problem. \n\n\fOptimal Asset Allocation Using Adaptive Dynamic Programming \n\n955 \n\nThe selection of the action u( i) should be guided by the trade-off between explo(cid:173)\nration and exploitation. In the beginning, the actions are typically chosen randomly \n(exploration) and in the course of training, actions with larger Q-values are cho(cid:173)\nsen with increasingly higher probability (exploitation). The implementation in the \nfollowing experiments is based on the Boltzmann-distribution using the actual Q(cid:173)\nvalues and a slowly decreasing temperature parameter (see Barto, 1989). \n\n4 Experiment I: Artificial Exchange Rate \n\nIn this section we use an exchange-rate model to demonstrate how DP and Q(cid:173)\nLearning can be used to optimize asset allocation. \n\nThe artificial exchange rate Xt is in the range between 1 and 2 representing the \nvalue of 1 US-$ in DM. The transition probabilities Pij of the exchange rate are \nchosen to simulate a situation where the Xt follows an increasing trend, but with \nhigher values of Xt, a drop to very low values becomes more and more probable. \nA realization of the time series is plotted in the upper part of fig. 2. The random \nstate variable Ct depends on the investor's decisions Ut, and is further influenced by \nXt, Xt+b and Ct-l. A complete state vector consists of the current exchange rate Xt \nand the capital Ct, which is always calculated in the basis currency (DM). Its sign \nrepresents the actual currency, i. e., Ct = -1.2 stands for an investment in US-$ \nworth of 1.2 DM, and Ct = 1.2 for a capital of 1.2 DM. Ct and Xt are discretized in 10 \nbins each. The transaction costs ~ = 0.1 + Ic/IOOI are a combination of fixed (0.1) \nand variable costs (Ic/IOOI). Transactions only apply, if the currency is changed \nfrom DM to US-$. The immediate return rt(Xt,ct, Xt+1, ut) is computed as in table \n1. If the decision has been made to change the portfolio into DM or to keep the \nactual portfolio in DM, Ut = DM, then the return is always zero. If the decision \nhas been made to change the portfolio into US-$ or to keep the actual portfolio in \nUS-$, Ut = US-$, then the return is equal to the relative change of the exchange \nrate weighted with Ct. That return is reduced by the transaction costs e, if the \ninvestor has to change into US-$. \n\nTable 1: The immediate return function. \nUt = US-$ \n\nUt =DM \n\nCt E DM \nCt E US-$ \n\no \no \n\nThe success of the strategies was tested on a realization (2000 data points) of the \nexchange rate. The initial investment is 1 DM, at each time step the algorithm has \nto decide to either change the currency or remain in the present currency. \n\nAs a reinforcement learning method, QL has to interact with the environment to \nlearn optimal behavior. Thus, a second set of 2000 data was used to learn the Q(cid:173)\nvalues. The training phase is divided into epochs. Each epoch consists of as many \ntrials as data exist in the training set. At every trial the algorithm looks at Xt, \nchooses randomly a portfolio value Ct and selects a decision. Then the immediate \nreturn and the new state is evaluated to apply eq. 3. The Q-values were initialized \nwith zero, the learning rate T} was 0.1. Convergence was achieved after 4 epochs. \n\n\f956 \n\nR. NEUNEIER \n\n$ \n\nDM \n2 \n\n~02 \n\n04 \n~03 \n.s \n1t' o 1 \no \n2 \n\n2 \n\n2 \n\n-2 1 \n\nFigure 1: The optimal decisions (left) and value function (right). \n\n. \n\n1 0 \n\n60 \n\n70 \n\n10 \n\n50 \n\n40 \n\n.\n\n' \n20\" 30' \n\n. . . . \n- _ . \n. \n. _ . \n. \n. . . . \n. _ . \n. \n.\n. \n\na': \n~::[: : \n: ~ o \n~] :ONJD:V, U: \n\n:IT] \n\n: \n\n90 \n\n100 \n\n90 \n\n100 \n\n30 \n\n40 \n\n50 \n\n70 \n\n80 \n\n10 \n\n20 \n\n80 \n\n10 \n\n20 \n\n30 \n\n40 \n\n60 \n\n70 \n\n80 \n\n90 \n\n100 \n\n60 \n\no \n\n50 \n\nTime \n\nFigure 2: The exchange rate (top), the capital and the decisions (bottom). \n\nTo evaluate the solution QL has found, the DP-algorithm from eq. 2 was imple(cid:173)\nmented using the given transition probabilities. The convergence of DP was very \nfast. Only 5 iterations were needed until the average difference between successive \nvalue functions was lower than 0.01. That means 500 updates in comparison to \n8000 updates with QL. \n\nThe solutions were identical with respect to the resulting policy which is plotted in \nfig. 1, left. It can clearly be seen, that there is a difference between the policy of \na DM-based and a US-$-based portfolio. If one has already changed the capital to \nUS-$, then it is advisable to keep the portfolio in US-$ until the risk gets too high, \ni. e. Xt E {1.8, 1.9}. On the other hand, if Ct is still in DM, the risk barrier moves \nto lower values depending on the volume of the portfolio. The reason is that the \npotential gain by an increasing exchange rate has to cover the fixed and variable \ntransaction costs. For very low values of Ct, it is forbidden to change even at low Xt \nbecause the fixed transaction costs will be higher than any gain. Figure 2 plots the \n\n\fOptimal Asset Allocation Using Adaptive Dynamic Programming \n\n957 \n\nexchange rate Xt, the accumulated capital Ct for 100 days, and the decisions Ut. \n\nLet us look at a few interesting decisions. At the beginning, t = 0, the portfolio was \nchanged immediately to US-$ and kept there for 13 steps until a drop to low rates \nXt became very probable. During the time steps 35-45, the 'O'xchange rate oscillated \nat higher exchange rates. The policy insisted on the DM portfolio, because the \nrisk was too high. In contrary, looking at the time steps 24 to 28, the policy first \nswitched back to DM, then there was a small decrease of Xt which was sufficient to \nlet the investor change again. The following increase justified that decision. The \nsuccess of the resulting strategy can be easily recognized by the continuous increase \nof the portfolio. Note, that the ups and downs of the portfolio curve get higher \nin magnitude at the end because the investor has no risk aversion and always the \nwhole capital is traded. \n\n5 Experiment II: German Stock Index DAX \n\nIn this section the approach is tested on a real world task: assume that an investor \nwishes to invest her Ihis capital into a block of stocks which behaves like the German \nstock index DAX. We based the benchmark strategy (short: MLP) on a NN model \nwhich was build to predict the daily changes of the DAX (for details, see Dichtl, \n1995). If the prediction of the next day DAX difference is positive then the capital \nis invested into DAX otherwise in DM. The input vector of the NN model was \ncarefully optimized for optimal prediction. We used these inputs (the DAX itself \nand 11 other influencing market variables) as the market description part of the \nstate vector for QL. In order to store the value functions two NNs, one for each \naction, with 8 nonlinear hidden neurons and one linear output are used. \n\nThe data is split into a training (from 2. Jan. 1986 to 31. Dec. 1992) and a test set \n(from 2. Jan. 1993 to 18. Oct. 1995). The return function is defined in the same \nway as in section 4 using 0.4% as proportional costs and 0.001 units as fixed costs, \nwhich are realistic for financial institutions. The training proceeds as outlined in \nthe previous section with TJ = 0.001 for 1000 epochs. \n\nIn fig. 3 the development of a reinvested capital is plotted for the optimized (upper \nline) and the MLP strategy (middle line). The DAX itself is also plotted but with \na scaling factor to fit it into the figure (lower line). The resulting policy by QL \nclearly beats the benchmark strategy because the extra return amounts to 80% at \nthe end of the training period and to 25% at the end of the test phase. A closer \nlook at some statistics can explain the success. The QL policy proposes almost as \noften as the MLP policy to invest in DAX, but the number of changes from DM \nto DAX and v. v. is much lower (see table 2). Furthermore, it seems that the QL \nstrategy keeps the capital out of the market if there is no significant trend to follow \nand the market shows too much volatility (see fig. 3 with straight horizontal lines \nof the capital development curve indicating no investments). An extensive analysis \nof the resulting strategy will be the topic of future research. \n\nIn a further experiment the NNs which store the Q-values are initialized to imitate \nthe MLP strategy. In some runs the number of necessary epochs were reduced by \na factor of 10. But often the QL algorithni took longer to converge because the \ninitialization ignores the input elements which describe the investor's capital and \ntherefore led to a bad starting point in the weight space. \n\n\f958 \n\nR. NEUNEIER \n\n4S,r--------------------------, \n\n, 7.-----------------------------, \n\n, Jan 1993 \n\n18 Od H:195 \n\nFigure 3: The development of a reinvested capital on the training (left) and test set \n(right). The lines from top to bottom: QL-strategy, MLP-strategy, scaled DAX. \n\nTable 2: Some statistics of the policies. \n\nDAX investments \n\nposition changes \n\nTraining set \nTest set \n\nData MLP Policy QL-Policy MLP Policy QL-Policy \n1825 \n729 \n\n1005 \n395 \n\n1020 \n434 \n\n904 \n344 \n\n284 \n115 \n\n6 Conclusions and Future Work \n\nIn this paper, the task of asset allocation/portfolio management was approached \nby reinforcement learning algorithms. QL was successfully utilized in combination \nwith NNs as value function approximators in a high dimensional state space. \n\nFuture work has to address the possibility of several alternative investment oppor(cid:173)\ntunities and to clarify the connection to the classical mean-variance approach of \nprofessional brokers. The benchmark strategy in the real world experiment is in \nfact a neuro-fuzzy model which allows the extraction of useful rules after learning. \nIt will be interesting to use that network architecture to approximate the value \nfunction in order to achieve a deeper insight in the resulting optimized strategy. \n\nReferences \n\nBarto A. G., Sutton R. S. and Watkins C. J. C. H. (1989) , Learning and Sequential Decision \nMaking, COINS TR 89-95. \nBertsekas D. P. (1987) , Dynamic Programming, NY: Wiley. \nSingh, P. S. (1993) , Learning to Solve Markovian Decision Processes, CMPSCI TR 93-77. \nNeuneier R. (1995), Optimal Strategies with Density-Estimating Neural Networks, ICANN \n95, Paris. \nBrealy, R. A. , Myers, S. C. (1991), Principles of Corporate Finance, McGraw-Hill. \nWatkins C. J., Dayan, P. (1992) , Technical Note: Q-Learning, Machine Learning 8, 3/4. \nElton, E. J. , Gruber, M. J. (1971), Dynamic Programming Applications in Finance, The \nJournal of Finance, 26/2. \nDichtl, H. (1995), Die Prognose des DAX mit Neuro-Fuzzy, masterthesis, engl. abstract \nin preparation. \n\n\f", "award": [], "sourceid": 1121, "authors": [{"given_name": "Ralph", "family_name": "Neuneier", "institution": null}]}