{"title": "Model-Free Least-Squares Policy Iteration", "book": "Advances in Neural Information Processing Systems", "page_first": 1547, "page_last": 1554, "abstract": "", "full_text": "Model-Free Least Squares Policy Iteration\n\nMichail G. Lagoudakis\n\nDepartment of Computer Science\n\nDuke University\n\nDurham, NC 27708\nmgl@cs.duke.edu\n\nRonald Parr\n\nDepartment of Computer Science\n\nDuke University\n\nDurham, NC 27708\nparr@cs.duke.edu\n\nAbstract\n\nWe propose a new approach to reinforcement learning which combines\nleast squares function approximation with policy iteration. Our method\nis model-free and completely off policy. We are motivated by the least\nsquares temporal difference learning algorithm (LSTD), which is known\nfor its ef\ufb01cient use of sample experiences compared to pure temporal\ndifference algorithms. LSTD is ideal for prediction problems, however it\nheretofore has not had a straightforward application to control problems.\nMoreover, approximations learned by LSTD are strongly in\ufb02uenced by\nthe visitation distribution over states. Our new algorithm, Least Squares\nPolicy Iteration (LSPI) addresses these issues. The result is an off-policy\nmethod which can use (or reuse) data collected from any source. We have\ntested LSPI on several problems, including a bicycle simulator in which\nit learns to guide the bicycle to a goal ef\ufb01ciently by merely observing a\nrelatively small number of completely random trials.\n\n1 Introduction\n\nLinear least squares function approximators offer many advantages in the context of re-\ninforcement learning. While their ability to generalize is less powerful than black box\nmethods such as neural networks, they have their virtues: They are easy to implement and\nuse, and their behavior is fairly transparent, both from an analysis standpoint and from\na debugging and feature engineering standpoint. When linear methods fail, it is usually\nrelatively easy to get some insight into why the failure has occurred.\n\nOur enthusiasm for this approach is inspired by the least squares temporal difference learn-\ning algorithm (LSTD) [4]. LSTD makes ef\ufb01cient use of data and converges faster than\nother conventional temporal difference learning methods. Although it is initially appealing\nto attempt to use LSTD in the evaluation step of a policy iteration algorithm, this combina-\ntion can be problematic. Koller and Parr [5] present an example where the combination of\nLSTD style function approximation and policy iteration oscillates between two bad policies\nin an MDP with just 4 states. This behavior is explained by the fact that linear approxi-\nmation methods such as LSTD compute an approximation that is weighted by the state\nvisitation frequencies of the policy under evaluation. Further, even if this problem is over-\ncome, a more serious dif\ufb01culty is that the state value function that LSTD learns is of no use\nfor policy improvement when a model of the process is not available.\n\n\flearns least squares approximations of the state-action (\n\nThis paper introduces the Least Squares Policy Iteration (LSPI) algorithm, which extends\nthe bene\ufb01ts of LSTD to control problems. First, we introduce LSQ, an algorithm that\n) value function, thus permitting\naction selection and policy improvement without a model. Next we introduce LSPI which\nuses the results of LSQ to form an approximate policy iteration algorithm. This algorithm\ncombines the policy search ef\ufb01ciency of policy iteration with the data ef\ufb01ciency of LSTD.\nIt is completely off policy and can, in principle, use data collected from any reasonable\nsampling distribution. We have evaluated this method on several problems, including a\nsimulated bicycle control problem in which LSPI learns to guide the bicycle to the goal by\nobserving a relatively small number of completely random trials.\n\n2 Markov Decision Processes\n\nWe will be assuming that the MDP has an in\ufb01nite horizon and that future rewards are\n\nWe assume that the underlying control problem is a Markov Decision Process (MDP). An\nis a \ufb01nite set of\n\nis a Markovian transition model where\t\b\u0001\u0003\u000f\u0010\u0004\u0007\u0011\u0012\u0004\n\u000f\u0014\u0013\u0015\r represents the probability of\nin state \u000f and\n()\u0004+*\"\r . (If we assume that all policies\nis the action the agent takes at state \u000f . The\n\nMDP is de\ufb01ned as a 4-tuple\u0001\u0003\u0002\u0005\u0004\u0007\u0006\b\u0004\n\t\u000b\u0004\u0007\f\u000e\r where:\u0002\nis a \ufb01nite set of states;\u0006\nactions;\t\ngoing from state\u000f\nto state\u000f\u0016\u0013 with action\u0011 ; and\f\nis a reward function\f\u0018\u0017\u0010\u0002\u001a\u0019\u001b\u0006\u001c\u0019\u000e\u0002\u001e\u001d\nsuch that \f \u0001!\u000f\u0010\u0004\u0007\u0011\u0012\u0004\n\u000f\nrepresents the reward obtained when taking action \u0011\nending up in state\u000f\"\u0013 .\ndiscounted exponentially with a discount factor#%$'&\nare proper, our results generalize to the undiscounted case.) A stationary policy,\nMDP is a mapping,-\u0017.\u0002/\u001d\n\u001f10\nstate-action value function43\nactions and indicates the expected, discounted total reward when taking action \u0011\nthereafter. The exact \n\u000f and following policy ,\n\nis de\ufb01ned over all possible combinations of states and\nin state\n-values for all state-action pairs can be\n\nfound by solving the linear system of the Bellman equations :\n\n, where,2\u0001!\u000f\"\r\n\u0001\u0003\u000f\u0010\u0004\n\u00115\n\nfor an\n\nIR,\n\n\u0007\rC\u0004\n\n\u0001!\u000f\n0MT andP\n\n\u0001!\u000f6\u0004\n\u00117\r28'9:\u0001!\u000f6\u0004\n\u00117\r<;=#\u000e>@?BA\b\t\b\u0001!\u000f\u0010\u0004\u0007\u0011\u0012\u0004\n\u000f\n\u0004\u0007,2\u0001\u0003\u000f\n\t\b\u0001\u0003\u000f\u0010\u0004\u0007\u0011\u0012\u0004G\u000fH\u0013\u0015\rI\fJ\u0001!\u000f6\u0004\n\u0011\u0012\u0004\n\u000fK\u0013L\r . In matrix format, the system becomesM3\nwhere9:\u0001\u0003\u000f\u0010\u0004\u0007\u00117\rD8FE\n, whereS3 and9\n3R\u000e3\nare vectors of size T\n9N;O#QP\n\u0002DTUT\n3 describes the transitions from pairs\u0001\u0003\u000f\u0010\u0004\u0007\u00117\n\n . P\nto pairs \u0001\u0003\u000f\u0014\u0013!\u0004@,2\u0001!\u000fH\u0013\u0015\r@\r .\nsize \u0001\u0007T\n0 T7\u0019OT\n\u0002DTLT\n\u0002DTLT\nFor every MDP, there exists an optimal policy, ,\u0005V , which maximizes the expected, dis-\nicy ,.WYX[Z by solving the above system. Policy improvement de\ufb01nes the next policy as\n\u0001\u0003\u000f\u0010\u0004\u0007\u00117\r . These steps are repeated until convergence to an op-\n,.WYX[\\^]@ZC\u0001\u0003\u000f\"\rJ8`_\u0014a\nb\u0005c\b_\u0016d)e\n\ncounted return of every state. Policy iteration is a method of discovering this policy by\niterating through a sequence of monotonically improving policies. Each iteration con-\nsists of two phases. Value determination computes the state-action values for a pol-\n\ntimal policy, often in a surprisingly small number of steps.\n\nis a stochastic matrix of\n\n\u000e3\u0010fhgLi\n\n0MT\n\n3 Least Squares Approximation of Q Functions\n\nPolicy iteration relies upon the solution of a system of linear equations to \ufb01nd the Q values\nfor the current policy. This is impractical for large state and action spaces. In such cases we\n\nmay wish to approximate43 with a parametric function approximator and do some form\n\nof approximate policy iteration. We now address the problem of \ufb01nding a set of parameters\nthat maximizes the accuracy of our approximator. A common class of approximators is\nthe so called linear architectures, where the value function is approximated as a linear\n\n\u001f\n\u0013\n\n\n3\n\u0013\n\n\n3\n\u0013\n\u0013\n?\nA\n8\n3\n\f\u0005\u0007\u0006\n\n\u0013\u0002\n\n\u0001!\u000f\u0010\u0004\u0007\u0011\u0012\u0004\u0003\u0002\n\u000f\u0012\u0011\n\nweighted combination of basis functions (features):\n\u0001!\u000f6\u0004\n\u00117\r\t\u0002\n\u0001!\u000f\u0010\u0004\u0007\u00117\r\u0003\n\u000b\u0002S\u0004\nwhere\u0002\nis a set of weights (parameters). In general,\r\f\u000e\f-T\n0 T and so, the linear system\n\u0002DTLT\nabove now becomes an overconstrained system over the parameters\u0002\n9F;O#QP\nwhere\u000f\n0MT!\u0019\u0014)\r matrix. We are interested in a set of weights\u0002\npoint in value function space, that is a value function 3\nthe basis functions. Assuming that the columns of\u000f\n\u000f\u0019\u0011\n8\u0018\u0017\n9N;O#QP\n\r\u0007\r\n, the solution is guaranteed to exist for all but \ufb01nitely many#\n\nthat yields a \ufb01xed\nthat is invariant under\none step of value determination followed by orthogonal projection to the space spanned by\n\nWe note that this is the standard \ufb01xed point approximation method for linear value functions\nwith the exception that the problem is formulated in terms of Q values instead of state\n\nis a\u0001\nT\n\r\u0016\u0015\nvalues. For anyP\n\nare linearly independent this is\n\n4 LSQ: Learning the State-Action Value Function\n\n9\u001b\u001a\n\n\u0002DTLT\n\n:\n\n[5].\n\n.\n\nare of the form:\n\n(sequential) episodes or from random queries to a generative model of the MDP. In the\nextreme case, they can be experiences of other agents on the same MDP. We know that\n\nable. In many practical applications, such a model is not available and the value function\nor, more precisely, its parameters have to be learned from sampled data. These sampled\n\nIn the previous section we assumed that a model \u0001\u0003\fM\u0004\nP\n\r of the underlying MDP is avail-\ndata are tuples of the form: \u0001\u0003\u000f\u0010\u0004\u0007\u0011\u0012\u0004\u001d\u001c\u0016\u0004\n\u000f\u0016\u0013L\r , meaning that in state \u000f , action\u0011 was taken, a re-\nward\u001c was received, and the resulting state was\u000f\u0014\u0013 . These data can be collected from actual\n8! , where\nthe desired set of weights can be found as the solution of the system,\u001e\u001f\u0002\n\u000f\u0019\u0011\n\r and \nand cannot be determined a\nare unknown and so, \u001e\n3 and the vector 9\nand can be approximated using samples. Recall that\u000f\nThe matrix P\npriori. However,\u001e\n, and9\nA:I\nFHG\n),+.-0/\t1J-0/\u001d+0KL42(?),+0KM/\tN\u0018),+\u0016K\u00074\t4\n(*),+.-0/213-\u001d4\t5\n@\u001dA\n@\u001dA\n6O606\n60606\nFPG\n(*),+./7184\n),+./21J/\u0003+\n42(?),+\n/\tNQ),+\n4\t4\t5\n\"$#\n\"$#\nCED\n60606\n6O606\nFPG\n>?9=4\n(*),+:9\n;<9=/\t139\n/\tNQ),+\n42(?),+\n/21\n),+\n/\u0003+\n;<9\n>*9\nATI\n),+\n42UV),+\n/\u001d+\n/\u0003+\n/21\n/21\n606O6\nFHG\n42UV),+./\t1J/\t+\n),+:/71J/\u0003+\nRS#\n606O6\n>?9Y/\u0003+0K\u00074\n;<9X/\t139\n;J9=/\t139\n>?9W/\u0003+0KL42UV),+:9\n),+T9\n\u0004\u001ddfe , where the \u0001\u0003\u000f:]0_G\u0004\n\u0011J]0_I\r\n*6\u0004OaR\u0004b\u001ac\u001ab\u001a\nGiven a set of samples,Z\n8\\[\"\u000f^]0_G\u0004\u0007\u0011J]0_\n\u0004\n\u000fK\u0013\n\u0004\u0003\u001c:]0_@\r+T<`M8\n]0_\naccording to distributiong and the\u000f6\u0013\n_ are sampled according to\n\u000f:]0_G\u0004\u0007\u0011J]0_I\r , we can construct approximate versions of\u000f\nare sampled from\u0002\u001c\u0019\n]0_\nnbo\n,P\n, and9\n\t\b\u0001\u0003\u000fK\u0013\nnbo\nnbo\n\u0004@,sr\n\r\u0003t\n]lk\n]lk\n]lk\n]lk\n\u000fK\u0013\n\u0001!\u000fK\u0013\n\u0004\n\u0011\n\u0001\u0003\u000f\n\u001ac\u001ab\u001a\n\u001ab\u001ac\u001a\n_I\r\n_C\u0004\n\u0011\n\u0001\u0003\u000f\n\u0001\u0003\u000fK\u0013\n\u0004@,\n\u000fK\u0013\n\u001ab\u001ac\u001a\n\u001ac\u001ab\u001a\n\u0004\n\u0011<]\u0016m\n\u0001!\u000f:]\u0016m\n]\u0016m\n]\u0016m\n\u0001!\u000fH\u0013\n\u0004@,\n\u000fK\u0013\n\n]lk\n\u001ab\u001ab\u001a\n\u001ab\u001ab\u001a\n\u001c:]\u0016m\n\n,P\n4\t4\n\nas follows :\n\n\u0001\n\n3\n\n8\n\u0004\n>\n]\n\b\n\u0005\n\u0005\n8\n\b\n\u000f\n\u0002\n\u0010\n3\n\u000f\n\u0002\n\u0001\n#\nP\n3\n\u000f\n\u0010\n9\n3\n8\n\u000f\n\u0002\n3\n\u000f\n\u0001\n\u000f\n\n\u000f\n]\n\u000f\n\n\u0001\n3\n\u000f\n\u0002\n3\n\n8\n\u000f\n\u0002\n3\n\u0002\n3\n8\n\u0001\n\u000f\n\n\u0001\n#\nP\n3\n\u000f\n\u0015\n]\n\u000f\n\n3\n3\n3\n\u001e\n8\n\u000f\n\n\u0001\n#\nP\n3\n\u000f\n8\n\u000f\n\n9\n3\n\u000f\n%\n&\n&\n&\n'\n5\n5\nA\nA\nB\n%\n&\n&\n&\n'\n5\nA\nI\nK\nK\nK\nA\nI\n9\n9\nK\nK\nK\n5\nA\nA\nB\n%\n&\n&\n&\n'\nF\nG\n-\n-\nK\n-\n-\nK\n4\nA\nI\nK\nK\n4\nF\nG\nA\nI\n@\nA\nA\nA\nB\n\u0006\n]\nT\n3\n\u000f\n\u0001\n\u000f\n8\nh\ni\ni\ni\nj\n\b\n\n\b\n]\n]\n\n\b\n\no\no\np\nq\nP\n3\n\u000f\n8\nh\ni\ni\ni\ni\nj\n\b\n\n\b\n]\n_\nr\n]\n_\n\nt\n\n\b\nr\n\nt\n\no\no\no\np\n\u0001\n9\n8\nh\ni\ni\ni\nj\n\u001c\n\u001c\n]\n_\no\no\np\n\ffor\n\n, giving high\n\n:\n\n28\n\nand\n\nand\n\nand \nTUT\n\ntaken directly from the MDP.\n\nand then, conditioned on these samples, as sampling terms from the summations in the\n. The sampling distribution from the summations is\nare\n\nand 9\nand can be approximated as\n\u000f\u0019\u0011\n are consistent approximations of the true\u001e\n\r28\n\nThese approximations can be thought of as \ufb01rst sampling rows from \u000f\naccording to g\ncorresponding rows of P\ngoverned by the underlying dynamics (\t\b\u0001\u0003\u000f\u0010\u0004\u0007\u0011\u0012\u0004\n\u000f\u0014\u0013U\r ) of the process as the samples inZ\n, and \u0001\nGiven\u0001\n,\u001e\n,q\nWithd uniformly distributed samples over pairs of states and actions \u0001!\u000f\u0010\u0004\u0007\u00117\r , the approxi-\nand\u0001\nmations \u0001\n\u001e\u001a\r\nTUT\nThe Markov property ensures that the solution \u0001\nsuf\ufb01ciently larged whenever\u0002\n3 exists:\n\u001e\u001a\r\n C\r\n8$\u0002\n\u0002DTUT\nIn the more general case, whereg\nwhich minimizes theg weighted distance in the projection step. Thus, state\u000f\nassigned weightg\u0012\u0001!\u000fK\r and the projection minimizes the weighted sum of squared errors\nwith respect tog . In LSTD, for example,g\nis the stationary distribution ofP\nAs with LSTD, it is easy to see that approximations (\u0001\nsets of samples (Z\nThis observation leads to an incremental update rule for\u0001\n( and\u0001\n\r@\r\u0003\n\u0010\n\n\u001e\u0002\u00016\u0004\n \u0003\u0001\n . Assume that initially\u0001\nand\u0001\n( . For a \ufb01xed policy, a new sample \u0001!\u000f6\u0004\n\u0011\u0012\u0004\u0003\u001c\u0016\u0004\n\u000f\u0016\u0013\u0015\r contributes to the approximation\n\u001e\u0018;\n\u0001\u0003\u000f\u0010\u0004\u0007\u00117\r\u0013\u001c\n \u0006\u0004\n\n3 will converge to the true solution\u0002\n\u0002DTUT\n\n) derived from different\n) can be combined additively to yield a better approximation that\n\nis not uniform, we will compute a weighted projection,\nis implicitly\n\nweight to frequently visited states, and low weight to infrequently visited states.\n\nWe call this new algorithm LSQ due to its similarity to LSTD. However, unlike LSTD, it\ncomputes Q functions and does not expect the data to come from any particular Markov\nchain. It is a feature of this algorithm that it can use the same set of samples to compute Q\n\ncorresponds to the combined set of samples:\n\naccording to the following update equation :\n\n\u0001\u0003\u000f\u0010\u0004\n\u00115\r\u001d\n\nis added to \u0001\n\n\u0001!\u000f\u0016\u0013\u0003\u0004\u0007,2\u0001\u0003\u000fK\u0013!\u0004@,2\u0001!\u000fH\u0013\u0015\r@\r\u0007\n\nfor each sample. Thus,\nLSQ can use every single sample available to it no matter what policy is under evaluation.\nWe note that if a particular set of projection weights are desired, it is straightforward to\n\npolicy merely determines which\b\nvalues for any policy representation that offers an action choice for each\u000f\nreweight the samples as they are added to \u0001\n\u0007J\u00017\nof the size of the state and the action space. For each sample inZ\n\u0007J\u00017\t\b\"\r\n\u0007J\u0001\n\n, LSQ incurs a cost of\nto solve the system and\n\ufb01nd the weights. Singular value decomposition (SVD) can be used for robust inversion of\n\n and a one time cost of\n\nto update the matrices \u0001\n\n space independently\n\nNotice that apart from storing the samples, LSQ requires only\n\n\u0013 in the set. The\n\nas it is not always a full rank matrix.\n\nand\u0001\n\n.\n\n\u0001\u0003\u000f\u0010\u0004\n\u00115\r+\u0001\n\n\u0001!\u000f\n\n\u0004\u0007,2\u0001\u0003\u000f\n\nand\n\n\u0004\u001dZ\n\nand\n\n\u001e\u0005\u0004\n\n .;\n\nLSQ includes LSTD as a special case where there is only one action available. It is also\npossible to extend LSQ to LSQ(\n) [3], but in\n\n) in a way that closely resembles LSTD(\n\n3\n\u000f\n\u000f\nP\n3\n\u000f\n9\n\u0001\n\u001e\n8\n\u0001\n\u000f\n\n\u0001\n\u0001\n#\nq\nP\n3\n\u000f\n\n\u0001\n \n8\n\u0001\n\u000f\n\n\u0001\n9\n\u001e\n\n\u0001\n\u0001\n8\nd\nT\n\u0002\n\u0006\nT\n\u001e\n\n\u0001\n\u0001\n \nd\nT\n\u0002\n\u0006\nT\n \n\u0002\n3\n\n\u0001\n\u0001\n\u0002\n3\n\n8\n\n\u0001\n\u0001\n\u001e\n\u0015\n]\n\u0001\n8\n\u0001\nd\nT\n\u0006\nT\n\u0015\n]\n\u0001\nd\nT\n\u0006\nT\n \n\u001e\n\u0015\n]\n \n3\n3\n\u001e\n]\n\u0004\n\u0001\n \n]\n\u0004\n\u0001\n\u0001\n]\n\u0001\n\u0001\n\u001e\n8\n\u0001\n\u001e\n]\n;\n\u0001\n\u001e\n\u0001\n\u0001\n \n8\n\u0001\n \n]\n;\n\u0001\n \n\u0001\n\u001a\n\u001e\n\u001e\n8\n \n8\n\u0001\n\u0001\n\b\n\b\n\u0011\n#\n\b\n\u0013\n\u0013\n\u0001\n\u0001\n\b\n\u001e\n\u001e\n\u0001\n\n\u0001\n\n\u001e\n\u0001\n\u001e\n\n\fthat case the sample set must consist of complete episodes generated using the policy un-\nder evaluation, which again raises the question of bias due to sampling distribution, and\nprevents the reusability of samples. LSQ is also applicable in the case of in\ufb01nite and con-\ntinuous state and/or action spaces with no modi\ufb01cation. States and actions are re\ufb02ected\nonly through the basis functions of the linear approximation and the resulting value func-\ntion can cover the entire state-action space with the appropriate set of continuous basis\nfunctions.\n\n5 LSPI: Least Squares Policy Iteration\n\n_\u0014a\nb2c\b_\u0016d\n\nThe LSQ algorithm provides a means of learning an approximate state-action value func-\n. We now integrate LSQ into an approximate policy\niteration algorithm. Clearly, LSQ is a candidate for the value determination step. The key\ninsight is that we can achieve the policy improvement step without ever explicitly repre-\n\ntion, S3\n\u0001!\u000f6\u0004\n\u00117\r , for any \ufb01xed policy,\nsenting our policy and without any sort of model. Recall that in policy improvement,,\u000bWYX[\\^]@Z\nthat maximizes43\nwill pick the action\u0011\n\u0001!\u000f\u0010\u0004\u0007\u00117\r . Since LSQ computes Q functions directly,\n\u0001!\u000f\u0010\u0004\u0003\u0002\nWUX[\\^]IZ\n\u0001\u0003\u000f\u0010\u0004\u0007\u00117\r\n_\u0014a\nb\u0005c\b_\u0016d\n\nwe do not need a model to determine our improved policy; all the information we need is\ncontained implicitly in the weights parameterizing our Q functions1:\n\n\u0001!\u000f\u0010\u0004\u0007\u00117\r\u0003\n\u000b\u0002\n\nWe close the loop simply by requiring that LSQ performs this maximization for each \u000f\u0010\u0013\nwhen constructing the\u001e matrix for a policy. For very large or continuous action spaces,\nexplicit maximization over\u0011 may be impractical. In such cases, some sort of global non-\n\nlinear optimization may be required to determine the optimal action.\n\nSince LSPI uses LSQ to compute approximate Q functions, it can use any data source for\nsamples. A single set of samples may be used for the entire optimization, or additional sam-\nples may be acquired, either through trajectories or some other scheme, for each iteration\nof policy iteration. We summarize the LSPI algorithm in Figure 1. As with any approxi-\nmate policy iteration algorithm, the convergence of LSPI is not guaranteed. Approximate\npolicy iteration variants are typically analyzed in terms of a value function approximation\nerror and an action selection error [2]. LSPI does not require an approximate policy rep-\nresentation, e.g., a policy function or \u201cactor\u201d architecture, removing one source of error.\nMoreover, the direct computation of linear Q functions from any data source, including\nstored data, allows the use of all available data to evaluate every policy, making the prob-\nlem of minimizing value function approximation error more manageable.\n6 Results\n\nWe initially tested LSPI on variants of the problematic MDP from Koller and Parr [5],\nessentially simple chains of varying length. LSPI easily found the optimal policy within\na few iterations using actual trajectories. We also tested LSPI on the inverted pendulum\nproblem, where the task is to balance a pendulum in the upright position by moving the cart\nto which it is attached. Using a simple set of basis functions and samples collected from\nrandom episodes (starting in the upright position and following a purely random policy),\nLSPI was able to \ufb01nd excellent policies using a few hundred such episodes [7].\n\nFinally, we tried a bicycle balancing problem [12] in which the goal is to learn to balance\nand ride a bicycle to a target position located 1 km away from the starting location. Initially,\nto the goal. The state description is a six-\nthe bicycle\u2019s orientation is at an angle of 90\nis the angle of the handlebar,\nis the vertical\n\n1This is the same principle that allows action selection without a model in Q-learning. To our\nknowledge, this is the \ufb01rst application of this principle in an approximate policy iteration algorithm.\n\ndimensional vector\u0001\n\n\u0004\u0004\u0003D\u0004\n\n\u0004\u0006\u0005\n\n\u0004\u0004\u0007\u000b\r , where \u0001\n\n,\n\n8\ne\n\u0001\n\n8\ne\n\b\n\u001a\n\n\u0001\n\u0004\n\u0002\n\u0001\n\u0002\n\u0003\n\u0003\n\u0003\n\fLSPI (\n\n/\u0018(\n\n/\u0002\u0001\n\n/\u0004\u0003\u0016/*N\u0006\u0005./\u0004\u0007\b\u0005 )\n\n//(\n//\n// \u0001\n//N\n// \u0003\n// \u0007\f\u0005\n\n: Number of basis functions\n: Basis functions\n: Discount factor\n: Stopping criterion\n\n: Initial policy, given as\n: Initial set of samples, possibly empty\n\n\u0005 ,N\n\nN\u0018),+./\n\n4 (default:\n\n#\u000b\n )\n\n// In essence,\n\nrepeat\n\nN\u0006\u0005\nUpdate \u0007\nK = LSQ (\u0007\nuntil (N\u000f\u000ePN\nK )\nreturnN\n\n(optional)\n\n// Add/remove samples, or leave unchanged\n\n/\u0018(\n\n/\u0002\u0001\n\n/*N )\n\n//\n\n//\n\nK = LSQ (\u0007\n\t\u0013\u0012\u0014\t\n\u0010\u0011\u0010\n\n/\r\u0001\n\u0003 )\n\n/\u0018(\n\u0010\u0015\u0010\u0017\u0016\n\n// that is, (\n\n// return \t\n\n)\n\nFigure 1: The LSPI algorithm.\n\nangle of the bicycle, and\n\ncombination of 20 basis functions:\n\nis the angle of the bicycle to the goal. The actions are the torque\n\n(R* seconds.\nfor a \ufb01xed action \u0011\n\n\u0018 applied to the handlebar (discretized to[\na7\u0004\n(R\u0004C;\na