{"title": "Monte Carlo Matrix Inversion and Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 687, "page_last": 694, "abstract": null, "full_text": "Monte Carlo Matrix Inversion and \n\nReinforcement Learning \n\nAndrew Barto and Michael Duff \n\nComputer Science Department \n\nUniversity of Massachusetts \n\nAmherst, MA 01003 \n\nAbstract \n\nWe describe the relationship between certain reinforcement learn(cid:173)\ning (RL) methods based on dynamic programming (DP) and a class \nof unorthodox Monte Carlo methods for solving systems of linear \nequations proposed in the 1950's. These methods recast the solu(cid:173)\ntion of the linear system as the expected value of a statistic suitably \ndefined over sample paths of a Markov chain. The significance of \nour observations lies in arguments (Curtiss, 1954) that these Monte \nCarlo methods scale better with respect to state-space size than do \nstandard, iterative techniques for solving systems of linear equa(cid:173)\ntions. This analysis also establishes convergence rate estimates. \nBecause methods used in RL systems for approximating the evalu(cid:173)\nation function of a fixed control policy also approximate solutions \nto systems of linear equations, the connection to these Monte Carlo \nmethods establishes that algorithms very similar to TD algorithms \n(Sutton, 1988) are asymptotically more efficient in a precise sense \nthan other methods for evaluating policies. Further, all DP-based \nRL methods have some of the properties of these Monte Carlo al(cid:173)\ngorithms, which suggests that although RL is often perceived to \nbe slow, for sufficiently large problems, it may in fact be more ef(cid:173)\nficient than other known classes of methods capable of producing \nthe same results. \n\n687 \n\n\f688 \n\nBarto and Duff \n\n1 \n\nIntroduction \n\nConsider a system whose dynamics are described by a finite state Markov chain with \ntransition matrix P, and suppose that at each time step, in addition to making a \ntransition from state Xt = i to XHI = j with probability Pij, the system produces \na randomly determined reward, rt+1! whose expected value is R;. The evaluation \njunction, V, maps states to their expected, infinite-horizon discounted returns: \n\nIt is well known that V uniquely satifies a linear system of equations describing \nlocal consistency: \n\nV = R + -yPV, \n\nor \n\n(I - -yP)V = R. \n\n( 1) \n\nThe problem of computing or estimating V is interesting and important in its \nown right, but perhaps more significantly, it arises as a (rather computationally(cid:173)\nburdensome) step in certain techniques for solving Markov Decision Problems. In \neach iteration of Policy-Iteration (Howard, 1960), for example, one must determine \nthe evaluation function associated with some fixed control policy, a policy that \nimproves with each iteration. \n\nMethods for solving (1) include standard iterative techniques and their variants(cid:173)\nsuccessive approximation (Jacobi or Gauss-Seidel versions), successive over(cid:173)\nrelaxation, etc. They also include some of the algorithms used in reinforcement \nlearning (RL) systems, such as the family of TD algorithms (Sutton, 1988). Here \nwe describe the relationship between the latter methods and a class of unorthodox \nMonte Carlo methods for solving systems of linear equations proposed in the 1950's. \nThese methods recast the solution of the linear system as the expected value of a \nstatistic suitably defined over sample paths of a Markov chain. \n\nThe significance of our observations lies in arguments (Curtiss, 1954) that these \nMonte Carlo methods scale better with respect to state-space size than do stan(cid:173)\ndard, iterative techniques for solving systems of linear equations. This analysis also \nestablishes convergence rate estimates. Applying this analysis to particular mem(cid:173)\nbers of the family of TD algorithms (Sutton, 1988) provides insight into the scaling \nproperties of the TD family as a whole and the reasons that TD methods can be \neffective for problems with very large state sets, such as in the backgammon player \nof Tesauro (Tesauro, 1992). \nFurther, all DP-based RL methods have some of the properties of these Monte \nCarlo algorithms, which suggests that although RL is often slow, for large problems \n(Markov Decision Problems with large numbers of states) it is in fact far more prac(cid:173)\ntical than other known methods capable of producing the same results. First, like \nmany RL methods, the Monte Carlo algorithms do not require explicit knowledge \nof the transition matrix, P. Second, unlike standard methods for solving systems \nof linear equations, the Monte Carlo algorithms can approximate the solution for \nsome variables without expending the computational effort required to approximate \n\n\fMonte Carlo Matrix Inversion and Reinforcement Learning \n\n689 \n\nthe solution for all of the variables. In this respect, they are similar to DP-based \nRL algorithms that approximate solutions to Markovian decision processes through \nrepeated trials of simulated or actual control, thus tending to focus computation \nonto regions of the state space that are likely to be relevant in actual control (Barto \net. al., 1991). \nThis paper begins with a condensed summary of Monte Carlo algorithms for solv(cid:173)\ning systems of linear equations. We show that for the problem of determining an \nevaluation function, they reduce to simple, practical implementations. Next, we \nrecall arguments (Curtiss, 1954) regarding the scaling properties of Monte Carlo \nmethods compared to iterative methods. Finally, we conclude with a discussion of \nthe implications of the Monte Carlo technique for certain algorithms useful in RL \nsystems. \n\n2 Monte Carlo Methods for Solving Systems of Linear \n\nEquations \n\nThe Monte Carlo approach may be motivated by considering the statistical evalua(cid:173)\ntion of a simple sum, I:k ak. If {Pk} denotes a set of values for a probability mass \nfunction that is arbitrary (save for the requirement that ak =P 0 imply Pk =P 0), then \nI:k ak = I:k (~) Pk, which may be interpreted as the expected value of a random \nvariable Z defined by Pr { Z = ~ } = Pk. \nFrom equation (1) and the Neumann series representation of the inverse it is is clear \nthat \n\nwhose ith component is \n\nV = (1 - -yp)-l R = R + -yP R + -y2 p2 R + ... \nVi = R; + -y L P\"l R;l + -y2 L P\"lP'1'2 R;2 + ... \n\n. . . + -yk L Pii1 ... P,/o-li/oR;/o + ... \n\n(2) \n\nand it is this series that we wish to evaluate by statistical means. \n\nA technique originated by Ulam and von-Neumann (Forsythe & Leibler, 1950) uti(cid:173)\nlizes an arbitrarily defined Markov chain with transition matrix P and state set \n{I, 2, \"., n} (V is assumed to have n components). The chain begins in state i and \nis allowed to make k transitions, where k is drawn from a geometric distribution \nwith parameter Pdep; i.e., Pr{k state transitions} = P~tep(1 - P,tep)' The Markov \nchain, governed by P and the geometrically-distributed stopping criterion, defines \na mass function assigning probability to every trajectory of every length starting in \nstate i, Xo = io = i --+ Zl = i l --+ ... --+ Zk = ik, and to each such trajectory there \ncorresponds a unique term in the sum (2). \n\nFor the cas/~ of value estimation, \"Z\" is defined by \n\n\f690 \n\nBarto and Duff \n\nwhich for j> = P and P,tep = 'Y becomes \n\nPr {z = 1~\" } = 'Yk(1 -\n\n'Y \n\nk \n\n'Y) IT Pij_li;-\n;=1 \n\nThe sample average of sampled values of Z is guaranteed to converge (as the number \nof samples grows large) to state i's expected, infinite-horizon discounted return. \n\nIn Wasow's method (Wasow, 1952), the truncated Neumann series \n\n~ = R; + 'Y LPiilR;l + 'Y2 LPii l Pi li2R;2 + ... + 'YN L \n\nPii l \u00b7\u00b7 \u00b7PiN_liNR;N \n\nis expressed as R; plus the expected value of the sum of N random variables \nZlI Z2, ... , ZN, the intention being that \n\nE(Zk) = 'Yk L PihPili2\" \n\ni 1 \u00b7\u00b7\u00b7i\" \n\n\u00b7pi\"_d,,R;,,\u00b7 \n\nLet trajectories of length N be generated by the Markov chain governed by P. A \ngiven term 'Y\"Pii1Pi li2 \u00b7\u00b7 'Pi\"_li\"R;\" is associated with all trajectories i -+ i1 -+ i2 -+ \n... -+ ik -+ ik+1 -+ ... -+ iN whose first k + 1 states are i, ill ... , ik. The measure \nof this set of trajectories is just Pii1Pili2 ... Pi\"_li\". Thus, the random variables Zk, \nk = 1, N are defined by \n\nIf P = P, then the estimate becomes an average of sample truncated, discounted \nreturns: ~ = R; + 'YR;1 + 'Y2 R;.2 + ... + 'YN R;N. \nThe Ulam/von Neumann approach may be reconciled with that of Wasow by pro(cid:173)\ncessing a given trajectory a posteriori, converting it into a set of terminated paths \nconsistent with any choice of stopping-state transition probabilities. For example, \n'Y, a path of length k has proba(cid:173)\nfor a stopping state transition probability of 1 -\nbility 'Yk(1 - 'Y). Each \"prefix\" of the observed path x(O) -+ x(1) -+ z(2) -+ ... can \nbe weighted by the probability of a path of corresponding length, resulting in an \nestimate, V, that is the sampled, discounted return: \n\n00 \n\nV = L -rk RZ(k). \n\nk=O \n\n3 Complexity \n\nIn (Curtiss, 1954) Curtiss establishes a theoretical comparison of the complexity \n(number of multiplications) required by the Ulam/von Neumann method and a \nstationary linear iterative process for computing a single component of the solution \nto a system of linear equations. Curtiss develops an analytic formula for bounds \non the conditional mean and variance of the Monte-Carlo sample estimate, V, and \nmean and variance of a sample path's time to absorption, then appeals to the \n\n\fMonte Carlo Matrix Inversion and Reinforcement Learning \n\n691 \n\nn 1000 \n\n900 \n\n800 \n\n700 \n\n600 \n\n500 \n\n400 \n\n300 \n\n200 \n\n100 \n\n)\"=.5 \n),,=.7 \n)\"=.9 \n\nO~----------~----~--~--~--~ \na 100 200 300 400 500 600 700 800 900 1000 \n1/~ \n\nFigure 1: Break-even size of state space versus accuracy. \n\nCentral Limit Theorem to establish a 95%-confidence interval for the complexity of \nhis method to reduce the initial error by a given factor, e. 1 \nFor the case of. value-estimation, Curtiss' formula for the Monte-Carlo complexity \nmay be written as \n\nWORKMonte-Carlo = 1 ~ \"'; (1 + e22 ) \n\n. \n\n(3) \n\nThis is compared to the complexity of the iterative method, which for the value(cid:173)\nestimation problem takes the form of the classical dynamic programming recursion, \nv(n+l) = R + \",;pv(n): \n\nWORKiterati'lle = 1 + log\",; n + n. \n\nlOge) 2 \n\n(\n\nThe iterative methodts complexity has the form an2 + n, with a > It while the \nMonte-Carlo complexity is independent of n-it is most sensitive to the amount of \nerror reduction desired, signified bye. Thus, given a fixed amount of computation, \nfor large enough n, the Monte-Carlo method is likely (with 95% confidence level) to \nproduce better estimates. The theoretical \"break-even\" points are plotted in Figure \nIt and Figure 2 plots work versus state-space size for example values of\",; and e. \n\nIThat is, for the iterative method, e is defined via IIV(oo) - yen) II < eIlV(oo) - yeO) II, \nwhile for the Monte Carlo method, e is defined via IV(OD)(i) - VMI < eIlV(OD) - V(O)II, \nwhere VM is the average over M sample V's. \n\n\f692 \n\nBarto and Duff \n\n.::&.50000 \n.... \no \n~45000 \n\nI \nI \nI \nI \n\n40000~------------~/------~--------\n35000 \n\nI \n\nI \n\n30000 \n\n25000 \n\n20000 \n\n15000 \n\n10000 \n\n5000 \n\nI \nI \n\n, \n\nI \n\nIterative \nMonte Carlo \nGauss \n\nO~~--~~~~--~~--~--~~--~ \no 10 20 30 40 50 60 70 80 90 100 \n\nn \n\nFigure 2: Work versus number of states for\"Y = .5 and e = .01. \n\n4 Discussion \n\nIt was noted that the analytic complexity Curtiss develops is for the work required \nto compute one component of a solution vector. In the worst case, all components \ncould be estimated by constructing n separate, independent estimators. This would \nmultiply the Monte-Carlo complexity by a factor of n, and its scaling supremacy \nwould be only marginally preserved. A more efficient approach would utilize data \nobtained in the course of estimating one component to estimate other components \nas well; Rubinstein (Rubinstein, 1981) decribes one way of doing this, using the \nnotion of \"covering paths.\" Also, it should be mentioned that substituting more \nsophisticated iterative methods, such as Gauss-Seidel, in place of the simple suc(cid:173)\ncessive approximation scheme considered here, serves only to improve the condition \nnumber of the underlying iterative operator-the amount of computation required \nby iterative methods remains an2 + n, for some a> 1. \nAn attractive feature of the the analysis provided by Curtiss is that, in effect, it \nyields information regarding the convergence rate of the method; that is, Equation \n\n4 can be re-arranged in terms of e. Figure 3 plots e versus work for example values \n\nof\"Y and n. \n\nThe simple Monte Carlo scheme considered here is practically identical to the \nlimiting case of TD-A with A equal to one (TD-l differs in that its averaging of \nsampled, discounted returns is weighted with recency). Ongoing work (Duff) ex(cid:173)\nplores the connection between TD-A (Sutton, 1988), for general values of A, and \nMonte Carlo methods augmented by certain variance reduction techniques. Also, \nBarnard (Barnard) has noted that TD-O may be viewed as a stochastic approxima-\n\n\fMonte Carlo Matrix Inversion and Reinforcement Learning \n\n693 \n\n... \n\nIterative \nMonte Carlo \n\n10000 \n\n20000 \n\n30000 \n\n40000 \n\n50000 \n\nWork \n\n~ 1.0 \n0.9 \n0.8 \n0.7 \n0.6 \n0.5 \n0.4 \n0.3 \n0.2 \n0.1 \n0.0 \n\n0 \n\nFigure 3: Error reduction versus work for \"y = .9 and n = 100. \n\ntion method for solving (1). \n\nOn-line RL methods for solving Markov Decision Problems, such as Real-Time \nDynamic Programming (RTDP)(Barto et. al., 1991), share key features with the \nMonte Carlo method. As with many algorithms, RTDP does not require explicit \nknowledge of the transition matrix, P, and neither, of course, do the Monte Carlo \nalgorithms. RTDP approximates solutions to Markov Decision Problems through \nrepeated trials of simulated or actual control, focusing computation upon regions of \nthe state space likely to be relevant in actual control. This computational \"focusing\" \nis also a feature of the Monte Carlo algorithms. While it is true that a focusing \nof sorts is exhibited by Monte Carlo algorithms in an obvious way by virtue of \nthe fact that they can compute approximate solutions for single components of \nsolution vectors without exerting the computational labor required to compute all \nsolution components, a more subtle form of computational focusing also occurs. \nSome of the terms in the Neumann series (2) may be very unimportant and need \nnot be represented in the statistical estimator at all. The Monte Carlo method's \nstochastic estimation process achieves this automatically by, in effect, making the \nappearance of the representative of a non-essential term a very rare event. \n\nThese correspondences-between TD-O and stochastic approximation, between TD(cid:173)\n). and Monte Carlo methods with variance reduction, between DP-based RL al(cid:173)\ngorithms for solving Markov Decision Problems and Monte Carlo algorithms -\ntogether with the comparatively favorable scaling and convergence properties en(cid:173)\njoyed by the simple Monte Carlo method discussed in this paper, suggest that DP(cid:173)\nbased RL methods like TD/stochastic-approximation or RTDP, though perceived \nto be slow, may actually be advantageous for problems having a sufficiently large \n\n\f694 \n\nBarto and Duff \n\nnumber of states. \n\nAcknowledgement \n\nThis material is based upon work supported by the National Science Foundation \nunder Grant ECS-9214866. \n\nReferences \n\nE. Barnard. Temporal-Difference Methods and Markov Models. Submitted for \npublication. \n\nA. Barto, S. Bradtke, & S. Singh. (1991) Real-Time Learning and Control Using \nAsynchronous Dynamic Programming. Computer Science Department, University \nof Massachusetts, Tech. Rept. 91-57. \n\n1. Curtiss. (1954) A Theoretical Comparison of the Efficiencies of Two Classical \nMethods and a Monte Carlo Method for Computing One Component of the Solution \nof a Set of Linear Algebraic Equations. In H. A. Mayer (ed.), Symposium on Monte \nCarlo Methods, 191-233. New york, NY: Wiley. \n\nM. Duff. A Control Variate Perspective for the Optimal Weighting of Truncated, \nCorrected Returns. In Preparation. \n\nS. Forsythe & R. Leibler. (1950) Matrix Inversion by a Monte Carlo Method. Math. \nTables Other Aids Comput., 4:127-129. \n\nR. Howard. (1960) Dynamic Programming and Markov Proceses. Cambridge, MA: \nMIT Press. \n\nR. Rubinstein. (1981) Simulation and the Monte Carlo Method. New York, NY: \nWiley. \n\nR. Sutton. (1988) Learning to Predict by the Method of Temporal Differences. \nMachine Learning 3:9-44. \n\nG. Tesauro. \nLearning 8:257-277. \n\n(1992) Practical Issues in Temporal Difference Learning. Machine \n\nW. Wasow. (1952) A Note on the Inversion of Matrices by Random Walks. Math. \nTables Other Aids Comput., 6:78-81. \n\n\f", "award": [], "sourceid": 865, "authors": [{"given_name": "Andrew", "family_name": "Barto", "institution": null}, {"given_name": "Michael", "family_name": "Duff", "institution": null}]}