{"title": "Risk Sensitive Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1031, "page_last": 1037, "abstract": null, "full_text": "Learning a Continuous Hidden Variable \n\nModel for Binary Data \n\nDaniel D. Lee \nBell Laboratories \n\nLucent Technologies \nMurray Hill, NJ 07974 \nddlee~bell-labs.com \n\nHaim Sompolinsky \n\nRacah Institute of Physics and \nCenter for Neural Computation \n\nHebrew University \n\nJerusalem, 91904, Israel \nhaim~fiz.huji . ac.il \n\nAbstract \n\nA directed generative model for binary data using a small number \nof hidden continuous units is investigated. A clipping nonlinear(cid:173)\nity distinguishes the model from conventional principal components \nanalysis. The relationships between the correlations of the underly(cid:173)\ning continuous Gaussian variables and the binary output variables \nare utilized to learn the appropriate weights of the network. The \nadvantages of this approach are illustrated on a translationally in(cid:173)\nvariant binary distribution and on handwritten digit images. \n\nIntroduction \n\nPrincipal Components Analysis (PCA) is a widely used statistical technique for rep(cid:173)\nresenting data with a large number of variables [1]. It is based upon the assumption \nthat although the data is embedded in a high dimensional vector space, most of \nthe variability in the data is captured by a much lower climensional manifold. In \nparticular for PCA, this manifold is described by a linear hyperplane whose char(cid:173)\nacteristic directions are given by the eigenvectors of the correlation matrix with \nthe largest eigenvalues. The success of PCA and closely related techniques such as \nFactor Analysis (FA) and PCA mixtures clearly indicate that much real world data \nexhibit the low dimensional manifold structure assumed by these models [2, 3]. \nHowever, the linear manifold structure of PCA is not appropriate for data with \nbinary valued variables. Binary values commonly occur in data such as computer \nbit streams, black-and-white images, on-off outputs of feature detectors, and elec(cid:173)\ntrophysiological spike train data [4]. The Boltzmann machine is a neural network \nmodel that incorporates hidden binary spin variables, and in principle, it should be \nable to model binary data with arbitrary spin correlations [5]. Unfortunately, the \n\n\f516 \n\nD. D. Lee and H. Sompolinsky \n\nFigure 1: Generative model for N-dimensional binary data using a small number \np of continuous hidden variables. \n\ncomputational time needed for training a Boltzmann machine renders it impractical \nfor most applications. \nIn these proceedings, we present a model that uses a small number of continuous \nhidden variables rather than hidden binary variables to capture the variability of \nbinary valued visible data. The generative model differs from conventional peA \nbecause it incorporates a clipping nonlinearity. The resulting spin configurations \nhave an entropy related to the number of hidden variables used, and the resulting \nstates are connected by small numbers of spin flips. The learning algorithm is par(cid:173)\nticularly simple, and is related to peA by a scalar transformation of the correlation \nmatrix. \n\nGenerative Model \n\nFigure 1 shows a schematic diagram of the generative process. As in peA, the \nmodel assumes that the data is generated by a small number P of continuous hidden \nvariables Yi . Each of the hidden variables are assumed to be drawn independently \nfrom a normal distribution with unit variance: \n\nP(Yi) = exp( -yt /2)/~. \n\n(1) \nThe continuous hidden variables are combined using the feedforward weights Wij , \nand the N binary output units are then calculated using the sign of the feedforward \nacti vations: \n\n(3) \nSince binary data is commonly obtained by thresholding, it seems reasonable that \na proper generative model should incorporate such a clipping nonlinearity. The \ngenerative process is similar to that of a sigmoidal belief network with continuous \nhidden units at zero temperature. The nonlinearity will alter the relationship be(cid:173)\ntween the correlations of the binary variables and the weight matrix W as described \nbelow. \n\nThe real-valued Gaussian variables Xi are exactly analogous to the visible variables \nof conventional peA. They lie on a linear hyperplane determined by the span of \nthe matrix W, and their correlation matrix is given by: \n\n(2) \n\n(4) \n\nP \n\nXi = L WijYj \nj=l \nsgn(xi). \n\nSi \n\ncxx = (xxT ) = WW T . \n\n\fLearning a Continuous Hidden Variable Model for Binary Data \n\n517 \n\n\"\"\"\"'\" \n-- \"' ..... . \n~ .. , ...... ,.\",'\" \n\n+-\n\n4+ \n.... \"\" \n: \n\nY2 \n-t-\n. ' \n\n.' \n\n\u2022\u2022 ' \n\nJ J \n\n+++ \n\n\"'/\"~'l=LWl 'Y'~O \n. . . . \n: \n. . \n. . . \n, , . , . \n. . . . . \n, \n, , , \n. . . . \n: x3 r \n\n, \n\"~ \n\n\"\" x2~ 0 \n\n, , \n\nFigure 2: Binary spin configurations Si in the vector space of continuous hidden \nvariables Yj with P = 2 and N = 3. \n\nBy construction, the correlation matrix CXX has rank P which is much smaller \nthan the number of components N. Now consider the binary output variables \nSi = sgn(xd\u00b7 Their correlations can be calculated from the probability distribution \nof the Gaussian variables Xi: \n\n(CSS)ij = (SiSj) = J IT dYk P(Xk) sgn(Xi) sgn(Xj) \n\nk \n\nwhere \n\n(5) \n\n(6) \n\n(7) \n\nThe integrals in Equation 5 can be done analytically, and yield the surprisingly \nsimple result: \n\n(CSS ) .. -\n'J -\n\n(2) \n\n_ \n11\" \n\nsin-1 \n\n[C~.X 1 \nJCfix elf . \n\n'J \n\nThus, the correlations of the clipped binary variables CSS are related to the corre(cid:173)\nlations of the corresponding Gaussian variables CXX through the nonlinear arcsine \nfunction. The normalization in the denominator of the arcsine argument reflects the \nfact that the sign function is unchanged by a scale change in the Gaussian variables. \nAlthough the correlation matrix CSS and the generating correlation matrix cn are \neasily related through Equation 7, they have qualitatively very different properties. \nIn general, the correlation matrix CSS will no longer have the low rank structure of \nCXX. As illustrated by the translationally invariant example in the next section, the \nspectrum of CSS may contain a whole continuum of eigenvalues even though cxx \nhas only a few nonzero eigenvalues. \n\nPCA is typically used for dimensionality reduction of real variables; can this model \nbe used for compressing the binary outputs Si? Although the output correlations \nCSS no longer display the low rank structure of the generating CXX , a more appropri(cid:173)\nate measure of data compression is the entropy of the binary output states. Consider \nhow many of the 2N possible binary states will be generated by the clipping process. \nThe equation Xi = E j WijYj = 0 defines a P - 1 dimensional hyperplane in the \nP-dimensional state space of hidden variables Yj, which are shown as dashed lines \nin Figure 2. These hyperplanes partition the half-space where Si = +1 from the \n\n\f518 \n\nD. D. Lee and H. Sompolinsky \n\n5;=+1 \n\n5;=-1 \n\nL \n\nIL.--__ --II \n\n______ ...... 1 \n\n, \n'(cid:173) , , \n\nC)()( \n\ncss \n'., \n., , , , ... ... \n\n10' \n\nEigenvalue rank \n\n10.2 '-----'-__ ~~ __ _ ~.............J \n102 \n\n10\u00b0 \n\n\"' ... ... ... \n\nFigure 3: Translationally invariant binary spin distribution with N = 256 units. \nRepresentative samples from the distribution are illustrated on the left, while the \neigenvalue spectrum of CSS and CXX are plotted on the right. \n\nregion where Si = -1. Each of the N spin variables will have such a dividing hyper(cid:173)\nplane in this P-dimensional state space, and all of these hyperplanes will generically \nbe unique. Thus , the total number of spin configurations Si is determined by the \nnumber of cells bounded by N dividing hyperplanes in P dimensions. The number \nof such cells is approximately NP for N \u00bb P, a well-known result from perceptrons \n[6]. To leading order for large N, the entropy of the binary states generated by this \nprocess is then given by S = P log N. Thus, the entropy of the spin configurations \ngenerated by this model is directly proportional to the number of hidden variables \nP . \nHow is the topology of the binary spin configurations Si related to the PCA man(cid:173)\nifold structure of the continuous variables Xi? Each of the generated spin states is \nrepresented by a polytope cell in the P dimensional vector space of hidden variables. \nEach polytope has at least P + 1 neighboring polytopes which are related to it by a \nsingle or small number of spin flips. Therefore, although the state space of binary \nspin configurations is discrete, the continuous manifold structure of the underlying \nGaussian variables in this model is manifested as binary output configurations with \nlow entropy that are connected with small Hamming distances. \n\nTranslationally Invariant Example \n\nIn principle, the weights W could be learned by applying maximum likelihood to \nthis generative model; however, the resulting learning algorithm involves analyti(cid:173)\ncally intractable multi-dimensional integrals. Alternatively, approximations based \nupon mean field theory or importance sampling could be used to learn the appropri(cid:173)\nate parameters [7]. However, Equation 7 suggests a simple learning rule that is also \napproximate, but is much more computationally efficient [8]. First, the binary cor(cid:173)\nrelation matrix CSS is computed from the data. Then the empirical CSS is mapped \ninto the appropriate Gaussian correlation matrix using the nonlinear transforma(cid:173)\ntion: CXX = sin(7l'Css /2). This results in a Gaussian correlation matrix where the \nvariances of the individual Xi are fixed at unity. The weights Ware then calculated \nusing the conventional PCA algorithm. The correlation matrix cxx is diagonalized, \nand the eigenvectors with the largest eigenvalues are used to form the columns of \n\n\fLearning a Continuous Hidden Variable Model for Binary Data \n\n519 \n\nw to yield the best low rank approximation CXX ~ WW T . Scaling the variables Xi \nwill result in a correlation matrix CXX with slightly different eigenvalues but with \nthe same rank. \nThe utility of this transformation is illustrated by the following simple example. \nConsider the distribution of N = 256 binary spins shown in Figure 3. Half of the \nspins are chosen to be positive, and the location of the positive bump is arbitrary \nunder the periodic boundary conditions. Since the distribution is translationally \ninvariant, the correlations CIl depend only on the relative distance between spins \nIi - jl. The eigenvectors are the Fourier modes, and their eigenvalues correspond \nto their overlap with a triangle wave. The eigenvalue spectrum of css is plotted in \nFigure 3 as sorted by their rank. In this particular case, the correlation matrix CSS \nhas N /2 positive eigenvalues with a corresponding range of values. \nNow consider the matrix CXX = sin(-lI'Css /2). The eigenvalues of CXX are also \nshown in Figure 3. In contrast to the many different eigenvalues CSS, the spectrum \nof the Gaussian correlation matrix CXX has only two positive eigenvalues, with all \nthe rest exactly equal to zero. The corresponding eigenvectors are a cosine and sine \nfunction. The generative process can thus be understood as a linear combination \nof the two eigenmodes to yield a sine function with arbitary phase. This function \nis then clipped to yield the positive bump seen in the original binary distribution. \nIn comparison with the eigenvalues of CS S, the eigenvalue spectrum of CXX makes \nobvious the low rank structure of the generative process. In this case, the original \nbinary distribution can be constructed using only P = 2 hidden variables, whereas \nit is not clear from the eigenvalues of CSS what the appropriate number of modes \nis. This illustrates the utility of determining the principal components from the \ncalculated Gaussian correlation matrix cxx rather than working directly with the \nobservable binary correlation matrix CSS. \n\nHandwritten Digits Example \n\nThis model was also applied to a more complex data set. A large set of 16 x 16 \nblack and white images of handwritten twos were taken from the US Post Office \ndigit database [9]. The pixel means and pixel correlations were directly computed \nfrom the images. The generative model needs to be slightly modified to account for \nthe non-zero means in the binary outputs. This is accomplished by adding fixed \nbiases ~i to the Gaussian variables Xi before clipping: \n\n(8) \nThe biases ~i can be related to the means of the binary outputs through the ex-\npression: \n\nSi = sgn(~i + Xi). \n\n~i = J2CtX erf- 1 (Si). \n\n(9) \nThis allows the biases to be directly computed from the observed means of the \nbinary variables. Unfortunately, with non-zero biases, the relationship between \nthe Gaussian correlations CXX and binary correlations CSS is no longer the simple \nexpression found in Equation 7. Instead, the correlations are related by the following \nintegral equation: \n\nGiven the empirical pixel correlations CSS for the handwritten digits, the integral \nin Equation 10 is numerically solved for each pair of indices to yield the appropriate \n\n\f520 \n\nD. D. Lee and H Sompolinsky \n\n102 ~------~------~------~-------.------~ \n\n.... \n\nCSS \n\n..... .... .... ..... \n\n\"'to \n\n\" ~ , , , \n\nMorph \n2 \n2 \n2 \n2 \n;2 \na \n2 \n~ a \n\n103 L -____ ~ __ ____ ~ __ ~ __ ~ ______ ~ ______ ~ \n250 \n\n200 \n\n50 \n\n100 \n\n150 \n\nEigenvalue Rank \n\nFigure 4: Eigenvalue spectrum of CSS and CXx for handwritten images of twos. The \ninset shows the P = 16 most significant eigenvectors for cxx arranged by rows. The \nright side of the figure shows a nonlinear morph between two different instances of \na handwritten two using these eigenvectors. \n\nGaussian correlation matrix CXX . The correlation matrices are diagonalized and \nthe resulting eigenvalue spectra are shown in Figure 4. The eigenvalues for CXX \nagain exhibit a characteristic drop that is steeper than the falloff in the spectrum \nof the binary correlations CSs. The corresponding eigenvectors of CXX with the 16 \nlargest positive eigenvalues are depicted in the inset of Figure 4. These eigenmodes \nrepresent common image distortions such as rotations and stretching and appear \nqualitatively similar to those found by the standard PCA algorithm. \nA generative model with weights W corresponding to the P = 16 eigenvectors \nshown in Figure 4 is used to fit the handwritten twos, and the utility of this nonlin(cid:173)\near generative model is illustrated in the right side of Figure 4. The top and bottom \nimages in the figure are two different examples of a handwritten two from the data \nset, and the generative model is used to morph between the two examples. The hid(cid:173)\nden values Yi for the original images are first determined for the different examples, \nand the intermediate images in the morph are constructed by linearly interpolat(cid:173)\ning in the vector space of the hidden units. Because of the clipping nonlinearity, \nthis induces a nonlinear mapping in the outputs with binary units being flipped in \na particular order as determined by the generative model. In contrast, morphing \nusing conventional PCA would result in a simple linear interpolation between the \ntwo images, and the intermediate images would not look anything like the original \nbinary distribution [10]. \nThe correlation matrix CXX also happens to contain some small negative eigen(cid:173)\nvalues. Even though the binary correlation matrix CSS is positive definite, the \ntransformation in Equation 10 does not guarantee that the resulting matrix CXx \nwill also be positive definite. The presence of these negative eigenvalues indicates \na shortcoming of the generative processs for modelling this data. In particular, \nthe clipped Gaussian model is unable to capture correlations induced by global \n\n\fLearning a Continuous Hidden Variable Model for Binary Data \n\n521 \n\nconstraints in the data. As a simple illustration of this shortcoming in the gen(cid:173)\nerative model, consider the binary distribution defined by the probability density: \nP({s}) tX lim.B-+ooexp(-,BLijSiSj). The states in this distribution are defined by \nthe constraint that the sum of the binary variables is exactly zero: Li Si = O. Now, \nfor N 2: 4, it can be shown that it is impossible to find a Gaussian distribution \nwhose visible binary variables match the negative correlations induced by this sum \nconstraint. \n\nThese examples illustrate the value of using the clipped generative model to learn \nthe correlation matrix of the underlying Gaussian variables rather than using the \ncorrelations of the outputs directly. The clipping nonlinearity is convenient because \nthe relationship between the hidden variables and the output variables is particu(cid:173)\nlarly easy to understand. The learning algorithm differs from other nonlinear PCA \nmodels and autoencoders because the inverse mapping function need not be explic(cid:173)\nitly learned [11, 12]. Instead, the correlation matrix is directly transformed from the \nobservable variables to the underlying Gaussian variables. The correlation matrix \nis then diagonalized to determine the appropriate feedforward weights. This results \nin a extremely efficient training procedure that is directly analogous to PCA for \ncontinuous variables. \nWe acknowledge the support of Bell Laboratories, Lucent Technologies, and the \nUS-Israel Binational Science Foundation. We also thank H. S. Seung for helpful \ndiscussions. \n\nReferences \n[1] Jolliffe, IT (1986). Principal Component Analysis. New York: Springer-Verlag. \n[2] Bartholomew, DJ (1987) . Latent variable models and factor analysis. London: \n\nCharles Griffin & Co. Ltd. \n\n[3] Hinton, GE, Dayan, P & Revow, M (1996). Modeling the manifolds of images \n\nof handwritten digits. IEEE Transactions on Neural networks 8,65- 74. \n\n[4] Van Vreeswijk, C, Sompolinsky, H, & Abeles, M. (1999). Nonlinear statistics \n\nof spike trains . In preparation. \n\n[5] Ackley, DH, Hinton, GE, & Sejnowski, TJ (1985). A learning algorithm for \n\nBoltzmann machines. Cognitive Science 9, 147-169. \n\n[6] Cover, TM (1965). Geometrical and statistical properties of systems of linear \ninequalities with applications in pattern recognition. IEEE Trans. Electronic \nComput. 14, 326- 334. \n\n[7] Tipping, ME (1999) . Probabilistic visualisation of high-dimensional binary \n\ndata. Advances in Neural Information Processing Systems ~1. \n\n[8] Christoffersson, A (1975). Factor analysis of dichotomized variables. Psychome(cid:173)\n\ntrika 40, 5- 32. \n\n[9] LeCun, Yet al. (1989). Backpropagation applied to handwritten zip code recog(cid:173)\n\nnition. Neural Computation i, 541-551. \n\n[10] Bregler, C, & Omohundro, SM (1995). Nonlinear image interpolation using \nmanifold learning. Advances in Neural Information Processing Systems 7,973-\n980. \n\n[11] Hastie, T and Stuetzle, W (1989). Principal curves. Journal of the American \n\nStatistical Association 84, 502-516. \n\n[12] Demers, D, & Cottrell, G (1993) . Nonlinear dimensionality reduction. Advances \n\nin Neural Information Processing Systems 5, 580-587. \n\n\fRisk Sensitive Reinforcement Learning \n\nRalph Neuneier \n\nOliver Mihatsch \n\nSiemens AG, Corporate Technology \n\nD-81730 Mtinchen, Germany \n\nSiemens AG, Corporate Technology \n\nD-81730 Mtinchen, Germany \n\nRalph.Neuneier@mchp.siemens.de \n\nOliver.Mihatsch@mchp.siemens.de \n\nAbstract \n\nAs already known, the expected return of a policy in Markov Deci(cid:173)\nsion Problems is not always the most suitable optimality criterion. For \nmany applications control strategies have to meet various constraints like \navoiding very bad states (risk-avoiding) or generating high profit within \na short time (risk-seeking) although this might probably cause significant \ncosts. We propose a modified Q-Iearning algorithm which uses a single \ncontinuous parameter K E [-1, 1] to determine in which sense the re(cid:173)\nsulting policy is optimal. For K = 0, the policy is optimal with respect \nto the usual expected return criterion, while K -+ 1 generates a solution \nwhich is optimal in worst case. Analogous, the closer K is to -1 the more \nrisk seeking the policy becomes. In contrast to other related approaches \nin the field of MDPs we do not have to transform the cost model or to \nincrease the state space in order to take risk into account. Our new ap(cid:173)\nproach is evaluated by computing optimal investment strategies for an \nartificial stock market. \n\n1 WHY IT SOMETIMES PAYS TO ACT CAUTIOUSLY \n\nReinforcement learning (RL) deals with the computation of favorable control policies in \nsequential decision task. Its theoretical framework of Markov Decision Problems (MDPs) \nevaluates and compares policies by their expected (sometimes discounted or averaged) sum \nof the immediate returns or costs per time step (Bertsekas & Tsitsiklis, 1996). But there are \nnumerous applications which require a more sophisticated control scheme: e. g. a policy \nshould take into account that bad outcomes or states may be possible even if they are very \nrare because they are so disastrous, that they should be certainly avoided. \n\nAn obvious example is the field of finance where the main question is how to invest re(cid:173)\nsources among various opportunities (e.g. assets like stocks, bonds, etc.) to achieve re(cid:173)\nmarkable returns while simultaneously controlling the risk exposure of the investments \ndue to changing markets or economic conditions. Many traders try to achieve this by a \nMarkovitz-like portfolio management which distributes capital according to return and risk \n\n\f/032 \n\nR. Neuneier and 0. Mihatsch \n\nestimates of the assets. A new approach using reinforcement learning techniques which \nadditionally integrates trading costs and other market imperfections has been proposed in \nNeuneier, 1998. Here, these algorithms are naturally extended such that an explicit risk \ncontrol is now possible. The investor can decide how much risk shelhe is willing to accept \nand then compute an optimal risk-averse investment strategy. Similar trade-off scenarios \ncan be formulated in robotics, traffic control and further application areas. \n\nThe fact that the popular expected value criterion is not always suitable has been already \nknown in the field of AI (Koenig & Simmons, 1994), control theory and reinforcement \nlearning (Heger, 1994 and Szepesvari, 1997). Several techniques have been proposed to \nhandle this problem. The most obvious way is to transform the sum of returns \"Et rt using \nan appropriate utility function U which reflects the desired properties of the solution. Un(cid:173)\nfortunately, interesting nonlinear utility functions incorporating the variance of the return, \nsuch as U(\"Et rt) = \"Et rt - >'(\"Et rt - E(\"Et rt))2, lead to non-Markovian decision \nproblems. The popular class of exponential utility functions U(\"E t rt) = exp(>'\"Et rt) \npreserves the Markov property but requires time dependent policies even for discounted \ninfinite horizon MDPs. Furthermore, it is not possible to formulate a corresponding model(cid:173)\nfree learning algorithm. A further alternative changes the state space model by including \npast returns as an additional state element at the cost of a higher dimensionality of the \nMDP. Furthermore, it is not always clear in which way the states should be augmented. \nOne may also transform the cost model, i. e. by punishing large losses stronger than mi(cid:173)\nnor costs. While requiring a significant amount of prior knowledge, this also increases the \ncomplexity of the MDP. \n\nIn contrast to these approaches we modify the popular Q-learning algorithm by introducing \na control parameter which determines in which sense the resulting policy is optimal. Intu(cid:173)\nitively and loosely speaking, our algorithm simulates the learning behavior of an optimistic \n(pessimistic) person by overweighting (underweighting) experiences which are more pos(cid:173)\nitive (negative) than expected. This main idea will be made more precise in section 2 and \nmathematically thoroughly analyzed in section 3. Using artificial data, we demonstrate \nsome properties of the new algorithm by constructing an optimal risk-avoiding investment \nstrategy (section 4). \n\n2 RISK SENSITIVE Q-LEARNING \n\nFor brevity we restrict ourselves to the subclass of infinite horizon discounted Markov deci(cid:173)\nsion problems (MDP). Furthermore, we assume the immediate rewards being deterministic \nfunctions of the current state and control action. Let S = {I, ... , n} be the finite state \nspace and U be the finite action space. Transition probabilities and immediate rewards are \ndenoted by Pij(U) and 9i(U), respectively. 'Y denotes the discount factor. Let II be the set \nof all deterministic policies mapping states to control actions. \n\nA commonly used objective is to learn a policy 1r that \n\nmaximizes ( Q' (i, u) '~g,(u) + E {t, 'Y'g\" (,,(i,)) } ) \n\n(1) \n\nquantifying the expected reward if one executes control action U in state i and follows \nthe policy 1r thereafter. It is a well-known result that the optimal Q-values Q*(i,u) := \nmaX7rETIQ7r (i, u) satisfy the following optimality equation \n\nQ*(i,u) = 9i(U) + 'Y ~ Pij(U) maxQ*(j,u') Vi E S,u E U. \n\nL...J \njES \n\nu'EU \n\n(2) \n\nAny policy 1f with 1f(i) = argmaxuEU Q* (i, u) is optimal with respect to the expected \nreward criterion. \n\n\fRisk Sensitive Reinforcement Learning \n\n1033 \n\nThe Q-function Q1r averages over the outcome of all possible trajectories (series of states) \nof the Markov process generated by following the policy 1r. However, the outcome of a \nspecific realization of the Markov process may deviate significantly from this mean value. \nThe expected reward criterion does not consider any risk, although the cases where the \ndiscounted reward falls considerably below the mean value is of a living interest for many \napplications. Therefore, depending on the application at hand the expected reward ap(cid:173)\nproach is not always appropriate. Alternatively, Heger (1994) and Littman & Szepesvari \n(1996) present a performance criterion that exclusively focuses on risk avoiding policies: \n\nmaximize (Q< (i, u) ,= 9i(U) + \n\n\"i~f {t, 7' 9;,(1T(i,\u00bb}) . \n\np(ll, t 2, ... \u00bbo \n\n-\n\n(3) \n\nThe Q-function Q1r (i, u) denotes the worst possible outcome if one executes control action \nu in state i and follows the policy 1r thereafter. The corresponding optimality equation for \nQ*(i, u) := max1rEn Q1r (i, u) is given by \n\nQ*(i,u) = 9i(U) + / min maxQ*(j,u') . \n\n)ES u'EU-\n\n-\n\n(4) \n\nPij(U\u00bbO \n\nAny policy 1[ satisfying 1[( i) = arg maxuE U Q* (i, u) is optimal with respect to this mini(cid:173)\nmal reward criterion. In most real world applications this approach is too restrictive because \nit takes very rare events (that in practice never happen) fully into account. This usually leads \nto policies with a lower average performance than the application requires. An investment \nmanager, for instance, which acts with respect to this very pessimistic objective function \nwill not invest at all. \n\nTo handle the trade-off between a sufficient average performance and a risk avoiding (risk \nseeking) behavior, we propose a family of new optimality equations parameterized by a \nmeta-parameter /'i, (-1 < /'i, < 1): \n\no = ~ Pij(U)X\" (9i(U) + / max Q,.(j, u') - Q,.(i, u)) Vi E S, u E U \n\n(5) \n\nu'EU \n\n~ \njES \n\nwhere X,. (x) := (1 - /'i, sign(x) )x. (In the next section we will show that a unique solution \nQ,. of the above equation (5) exists.) Obviously, for /'i, = 0 we recover equation (2), \nthe optimality equation for the expected reward criterion. If we choose /'i, to be positive \n(0 < /'i, < 1) then we overweight negative temporal differences \n\n9i(U) + / max Q,.(j, u') - Q,.(i, u) < 0 \n\nu'EU \n\n(6) \n\nwith respect to positive ones. Loosely speaking, we overweight transitions to states where \nthe future return is lower than the average one. On the other hand, we underweight transi(cid:173)\ntions to states that promise a higher return than in the average. Thus, an agent that behaves \naccording to the policy 1r,.(i) := argmaxuEU Q,.(i,u) is risk avoiding if /'i, > O. In the \nlimit /'i, -+ 1 the policy 1r,. approaches the optimal worst-case policy 1[, as we will show \nin the following section. (To get an intuition about this, the reader may easily check that \nthe optimal worst-case Q-value Q* fulfills the modified optimality equation (5) for /'i, = 1.) \nSimilarly, the policy 1r,. becomesrisk seeking if we choose /'i, to be negative. \nIt is straightforward to formulate a risk sensitive Q-Iearning algorithm that bases on the \nmodified optimality equation (5). Let Q,.(i, u; w) be a parametric approximation of the \nQ-function Q,.(i,u). The states and actions encountered at time step k during simulation \nare denoted by ik and Uk respectively. At each time step apply the following update rule: \n\nd(k) \n\nw(k+1) \n\n9ik (Uk) + / max Q,.(ik+l, u'; w(k)) - Q,.(ik, Uk; w(k)), \nW(k) + a~k) X\"(d(k))\\7 wQ,.(ik, Uk; w(k)), \n\nu'EU \n\n(7) \n\n\f1034 \n\nR. Neuneier and 0. Mihatsch \n\nwhere o:~k) denotes a stepsize sequence. The following section analyzes the properties of \nthe new optimality equations and the corresponding Q-Iearning algorithm. \n\n3 PROPERTIES OF THE RISK SENSITIVE Q-FUNCTION \n\nDue to space limitations we are not able to give detailed proofs of our results. Instead, we \nfocus on interpreting their practical consequences. The proofs will be published elsewhere. \n\nBefore formulating the mathematical results, we introduce some notation to make the ex(cid:173)\nposition more concise. Using an arbitrary stepsize 0 < 0: < 1, we define the value iteration \noperator corresponding to our modified optimality equation (5) as \n\nTa,~[Q](i, u) := Q(i, u) + 0: L Pij(U)X~ (9i(U) +, ~~ Q(j, u') - Q(i, u)). \n\njES \n\n(8) \n\nThe operator Ta,~ acts on the space of Q-functions. For every Q-function Q and every \nstate-action pair (i, u) we define N~[Q](i, u) to be the set of all successor states j for \nwhich maxu'EU Q(j, u') attains its minimum: \nN~[Q](i,u):= {j E Slpij(u) > o and maxQ(j,u') = min maxQV,u')}. (9) \n\nu'EU \n\nj'es u'EU \n\nPij,(U) >0 \n\nLet p~[Q]( i, u) := 2:jE N\" [Q](i,u) Pij (u) be the probability of transitions to such successor \nstates. \n\nWe have the following lemma ensuring the contraction property of Ta,~. \n\nLemma 1 (Contraction Property) Let IQI = maxiES,uEU Q(i, u) and 0 < 0: < 1,0 < \n, < 1. Then \n\nITa,~[Qd - Ta,~[Q2ll ::; (1 - 0:(1 -11>:1)(1 -\n\n,)) IQ1 - Q21 VQ1, Q2. \n\n(10) \n\nThe operator Ta,~ is contracting, because 0 < (1 - 0:(1 - 11>:1)(1 - ,)) < 1. \n\nThe lemma has several important consequences. \n\n1. The risk sensitive optimality equation (5), i. e. Ta,~[Ql = Q has a unique solution Q~ \n\nfor all -1 < I>: < 1. \n\n2. The value iteration procedure Qnew := Ta,~[Ql converges towards Q~. \n\n3. The existing convergence results for traditional Q-Iearning (Bertsekas & Tsitsiklis \n1997, Tsitsiklis & Van Roy 1997) remain also valid in the risk sensitive case I>: i- O. \nParticularly, risk sensitive Q-learning (7) converges with probability one in the case \nof lookup table representations as well as in the case of optimal stopping problems \ncombined with linear representations. \n\n4. The speed of convergence for both, risk sensitive value iteration and Q-Iearning be(cid:173)\n\ncomes worse if 11>:1 -7 1. We can remedy this to some extent if we increase the stepsize \n0: appropriately. \n\nLet 7r ~ be a greedy policy with respect to the unique solution Q ~ of our modified optimality \nequation; that is 7r~(i) = argmaxuEuQ~(i,u). The following theorem examines the \nperformance of 7r ~ for the risk avoiding case I>: 2: O. It gives us a feeling about the expected \noutcome Q'Ir\" and the worst possible outcome Q'Ir\" of policy 7r~ for different values of 1>:. \nThe theorem clarifies the limiting behavior of 7r ~ if I>: -7 1. \n\n\fRisk Sensitive Reinforcement Learning \n\n1035 \n\nTheorem 2 Let 0 ~ /\\, < 1. The following inequalities hold componentwise, i. e. for each \npair (i,u) E S x U . \n\no ~ Q* - Qrr\" ~ 2/\\'-1' (Q* - Q*) \n\n-, \n\n-\n\no ~ PK[QK](Q* - Qrr,,) ~ (1- /\\,) -'-(Q* - Q*) \n\n2/\\, 1-, \n\n-\n\n-\n\n-\n\n(11) \n\n(12) \n\nMoreover, lim 0\"\" = Q* and lim Qrr\" = Q*. \n\nK~O \n\nK~l--\n\nThe difference Q * - Q* between the optimal expected reward and the optimal worst case \nreward is crucial in theabove inequalities. It measures the amount of risk being inherent in \nour MDP at hand. Besides the value of /\\', this quantity essentially influences the difference \nbetween the performance of the policy 7r K and the optimal performance with respect to \nboth, the expected reward and the worst case criterion. The second inequality (12) states \nthat the performance of policy 7r K in the worst case sense tends to the optimal worst case \nperformance if /\\, -+ 1. The \"speed of convergence\" is influenced by the quantity PK [Q K], \ni. e. the probability that a worst case transition really occurs. (Note that PK [Q KJ is bounded \nfrom below.) A higher probability PK [Q KJ of worst case transitions implies a stronger risk \navoiding attitude of the policy 7r K. \n\n4 EXPERIMENTS: RISK-AVERSE INVESTMENT DECISIONS \n\nOur algorithm is now tested on the task of constructing an optimal investment policy for an \nartificial stock price analogous to the empirical analysis in Neuneier, 1998. The task, illus(cid:173)\ntrated as a MDP in fig. 1, is to decide at each time step (e. g. each day or after each mayor \nevent on the market) whether to buy the stock and therefore speculating on increasing stock \nprices or to keep the capital in cash which avoids potential losses due to decreasing stock \nprices. \n\ndisturbancies \n\nfinancial market \n\ninvestments \n\nreturn \n\nrates, prices \n\ninvestor \n\nat = J-L(xt} \np(xt+llxt} \nr(xt,at,$t+d \n\n2.-----~----__ ----__ ----__ ----__ --__. \n\nFigure 1. The Markov Decision Problem: \nXt = ($t, Kt)' \n\nstate: market $t \nand portfolio K t \npolicy J-L, actions \ntransition probabilities \nreturn function \n\n' .9 \n1 . B \n\n.~: :: \ni '.5 \n\n1. , \n\nFigure 2. A realization of the ar(cid:173)\ntificial stock price for 300 time \nsteps. \nIt is obvious that the \nprice follows an increasing trend \nbut with higher values a sud(cid:173)\nden drop to low values becomes \nmore and more probable. \n\nIt is assumed, that the investor is not able to influence the market by the investment de(cid:173)\ncisions. This leads to a MDP with some of the state elements being uncontrollable and \nresults in two computationally import implications: first, one can simulate the investments \nby historical data without investing (and potentially losing) real money. Second, one can \nformulate a very efficient (memory saving) and more robust Q-Ieaming algorithms. Due to \nspace restriction we skip a detailed description of these algorithms and refer the interested \nreader to Neuneier, 1998. \n\n\f1036 \n\nR. Neuneier and O. Mihatsch \n\nThe artificial stock price is in the range of [1, 2]. The transition probabilities are chosen \nsuch that the stock market simulates a situation where the price follows an increasing trend \nbut with higher values a drop to very low values becomes more and more probable (fig. 2). \n\nThe state vector consists of the current stock price and the current investment, i. e. the \namount of money invested in stocks or cash. Changing the investment from cash to stocks \nresults in some transaction costs consisting of variable and fixed terms. These costs are \nessential to define the investment problem as a MDP because they couple the actions made \nat different time steps. Otherwise we could solve the problem by a pure prediction of the \nnext stock price. The function which quantifies the immediate return for each time step is \ndefined as follows: if the capital is invested in cash, then there is nothing to earn even if \nthe stock price increases, if the investor has bought stocks the return is equal the relative \nchange of the stock price weighted by the invested amount of capital minus the transaction \ncosts which apply if one changed from cash to stocks. \n\no \n\n< .. to! on ,took> 1 \n\n1.5 \n\n\u2022 tode prtce S \n\n.. \", 0.5 \n\no \n\ncoptto! \n\n,\"stocks 1 \n\nIt .. o.s \n\n...... \n\ncaon \n,ncaoh \n\n~ , \n\ncapital \n\nR'1 stclCln 1 \n\n1.5 \n\n_todc price \u2022 \n\nFigure 3. Left: Risk neu(cid:173)\ntral policy, K, = O. Right: \nA small bias of K, = 0.3 \nagainst risk changes the po(cid:173)\nlicy if one is not invested \n(transaction costs apply in \nthis case) . \n\nFigure 4. Left: K, = 0.5 \nyields a stronger risk averse \nattitude. Right: With K, = \n0.8 the policy becomes also \nmore cautious if already in(cid:173)\nvested in stocks. \nFigure 5. Left: K, = 0.9 \nleads to a policy which in(cid:173)\nvests in stocks in only 5 \ncases. Right: The worst \ncase solution never invests \nbecause there is always a \npositive probability for de(cid:173)\ncreasing stock prices. \n\nAs a reinforcement learning method, Q-Iearning has to interact with the environment (here \nthe stock market) to learn optimal investment behavior. Thus, a training set of 2000 data \npoints is generated. The training phase is divided into epochs which consists of as many \ntrials as data in the training set exist. At every trial the algorithm selects randomly a stock \nprice from the data set, chooses a random investment state and updates the tabulated Q(cid:173)\nvalues according to the procedure given in Neuneier, 1998. The only difference of our new \nrisk averse Q-Iearning is that negative experiences, i. e. smaller returns than in the mean, \nare overweighted in comparison to positive experiences using the /\\,-factor of eq. (7). Using \ndifferent /\\, values from 0 (recovering the original Q-Iearning procedure) to 1 (leading to \nworst case Q-Iearning) we plot the resulting policies as mappings from the state space to \ncontrol actions in figures 3 to 5. Obviously, with increasing /\\, the investor acts more and \nmore cautiously because there are less states associated with an investment decision for \nstocks. In the extreme case of /\\, = 1, there is no stock investment at all in order to avoid \nany loss. The policy is not useful in practice. This supports our introductory comments that \nworst case Q-learning is not appropriate in many tasks. \n\n\fRisk Sensitive Reinforcement Learning \n\n1037 \n\n0.8,---.-----r--,---.--.---.----,------,-----r---, \n\nQQ-plol of Ihe distributions \n\n, ' + \n\n+ ., .. \n+' \n\n\"\" \n\no \n\n0 .7 \n\"+ ,. 0.6 \n\" \u2022 ~ .. 0.5 \ni .... 0.4 \n! ~ 0.3 \nN i 0.2 \n~ J! \n: 0.1 \n'ii \nc \n... \n1I \n\not--~WIIfl!III!'I\"\"\"\" \n\n-0.1 \n\n'., ' \n\n-O .2':--::'.,---L---:-'-:---::\"---:-'-::---'----:c-'-::----'~--'-,____---,J \n\n~ ~ 0 ~ u ~ U \n\nU MUM \n\nquanlU ... of the cla .. leal approach: KaO \n\nFigure 6. The quantiles of the dis(cid:173)\ntributions of the discounted sum of \nreturns for It = 0.2 (0) and It = 0.4 \n(+) are plotted against the quan(cid:173)\ntiles for the classical risk neutral \napproach It = O. The distributions \nonly differ significantly for nega(cid:173)\ntive accumulated returns (left tail of \nthe distributions). \n\nFor further analysis, we specify a risky start state io for which a sudden drop of the stock \nprice in the near future is very probable. Starting at io we compute the cumulated dis(cid:173)\ncounted rewards of 10000 different trajectories following the policies 11\"0, 11\"0.2 and 11\"0.4 \nwhich have been generated using K, = 0 (risk neutral), K, = 0.2 and K, = 0.4. The resulting \nthree data sets are compared using a quantile-quantile plot whose purpose is to determine \nwhether the samples come from the same distribution type. If they do so, the plot will be \nlinear. Fig. 6 clearly shows that for higher K,-values the left tail of the distribution (neg(cid:173)\native returns) bends up indicating a fewer number of losses. On the other hand there is \nno significant difference for positive quantiles. In contrast to naive utility functions which \npenalizes high variance in general, our risk sensitive Q-Iearning asymmetrically reduces \nthe probability for losses which may be more suitable for many applications. \n\n5 CONCLUSION \n\nWe have formulated a new Q-Iearning algorithm which can be continuously tuned towards \nrisk seeking or risk avoiding policies. Thus, it is possible to construct control strategies \nwhich are more suitable for the problem at hand by only small modifications of Q-Iearning \nalgorithm. The advantage of our approch in comparison to already known solutions is, that \nwe have neither to change the cost nor the state model. We can prove that our algorithm \nconverges under the usual assumptions. Future work will focus on the connections between \nour approach and the utility theoretic point of view. \n\nReferences \n\nD. P. Bertsekas, J. N. Tsitsiklis (1996) Neuro-Dynamic Programming. Athena Scientific. \nM. Heger (1994) Consideration of Risk and Reinforcement Learning, in Machine Learning, proceed(cid:173)\nings of the 11 th International Conference, Morgan Kaufmann Publishers. \nS. Koenig, R. G. Simmons (1994) Risk-Sensitive Planning with Probabilistic Decision Graphs. \nProc. of the Fourth Int. Conf. on Principles of Knowledge Representation and Reasoning (KR). \nM.L. Littman, Cs. Szepesvari (1996), A generalized reinforcement-learning model: Convergence \nand applications. In International Conference of Machine Learning '96. Bari. \nR. Neuneier (1998) Enhancing Q-learning for Optimal Asset Allocation, in Advances in Neural In(cid:173)\nformation Processing Systems /0, Cambridge, MA: MIT Press. \nM. L. Puterman (1994), Markov Decision Processes, John Wiley & Sons. \nCs. Szepesvari (1997) Non-Markovian Policies in Sequential Decision Problems, Acta Cybernetica. \nJ. N. Tsitsiklis, B. Van Roy (1997) Approximate Solutions to Optimal Stopping Problems, in Ad(cid:173)\nvances in Neural Information Processing Systems 9, Cambridge, MA: MIT Press. \n\n\f", "award": [], "sourceid": 1583, "authors": [{"given_name": "Ralph", "family_name": "Neuneier", "institution": null}, {"given_name": "Oliver", "family_name": "Mihatsch", "institution": null}]}