{"title": "Linear Complementarity for Regularized Policy Evaluation and Improvement", "book": "Advances in Neural Information Processing Systems", "page_first": 1009, "page_last": 1017, "abstract": "Recent work in reinforcement learning has emphasized the power of L1 regularization to perform feature selection and prevent overfitting. We propose formulating the L1 regularized linear fixed point problem as a linear complementarity problem (LCP). This formulation offers several advantages over the LARS-inspired formulation, LARS-TD. The LCP formulation allows the use of efficient off-the-shelf solvers, leads to a new uniqueness result, and can be initialized with starting points from similar problems (warm starts). We demonstrate that warm starts, as well as the efficiency of LCP solvers, can speed up policy iteration. Moreover, warm starts permit a form of modified policy iteration that can be used to approximate a greedy\" homotopy path, a generalization of the LARS-TD homotopy path that combines policy evaluation and optimization.\"", "full_text": "Linear Complementarity for Regularized Policy\n\nEvaluation and Improvement\n\nJeff Johns\n\nRonald Parr\n\nChristopher Painter-Wake\ufb01eld\nDepartment of Computer Science\n\nDuke University\nDurham, NC 27708\n\n{johns, paint007, parr}@cs.duke.edu\n\nAbstract\n\nRecent work in reinforcement learning has emphasized the power of L1 regular-\nization to perform feature selection and prevent over\ufb01tting. We propose formulat-\ning the L1 regularized linear \ufb01xed point problem as a linear complementarity prob-\nlem (LCP). This formulation offers several advantages over the LARS-inspired\nformulation, LARS-TD. The LCP formulation allows the use of ef\ufb01cient off-the-\nshelf solvers, leads to a new uniqueness result, and can be initialized with starting\npoints from similar problems (warm starts). We demonstrate that warm starts, as\nwell as the ef\ufb01ciency of LCP solvers, can speed up policy iteration. Moreover,\nwarm starts permit a form of modi\ufb01ed policy iteration that can be used to approxi-\nmate a \u201cgreedy\u201d homotopy path, a generalization of the LARS-TD homotopy path\nthat combines policy evaluation and optimization.\n\nIntroduction\n\n1\nL1 regularization has become an important tool over the last decade with a wide variety of ma-\nchine learning applications. In the context of linear regression, its use helps prevent over\ufb01tting and\nenforces sparsity in the problem\u2019s solution. Recent work has demonstrated how L1 regularization\ncan be applied to the value function approximation problem in Markov decision processes (MDPs).\nKolter and Ng [1] included L1 regularization within the least-squares temporal difference learning\n[2] algorithm as LARS-TD, while Petrik et al. [3] adapted an approximate linear programming algo-\nrithm. In both cases, L1 regularization automates the important task of selecting relevant features,\nthereby easing the design choices made by a practitioner.\nLARS-TD provides a homotopy method for \ufb01nding the L1 regularized linear \ufb01xed point formulated\nby Kolter and Ng. We reformulate the L1 regularized linear \ufb01xed point as a linear complementarity\nproblem (LCP). This formulation offers several advantages. It allows us to draw upon the rich theory\nof LCPs and optimized solvers to provide strong theoretical guarantees and fast performance. In\naddition, we can take advantage of the \u201cwarm start\u201d capability of LCP solvers to produce algorithms\nthat are better suited to the sequential nature of policy improvement than LARS-TD, which must\nstart from scratch for each new policy.\n\n2 Background\nFirst, we introduce MDPs and linear value function approximation. We then review L1 regulariza-\ntion and feature selection for regression problems. Finally, we introduce LCPs. We defer discussion\nof L1 regularization and feature selection for reinforcement learning (RL) until section 3.\n\n1\n\n\fT \u2217V (s) = R(s) + \u03b3 max\n\nP (s!|s, a)V (s!).\n\na\u2208A !s!\u2208S\n\nOf the many algorithms that exist for \ufb01nding \u03c0\u2217, policy iteration is most relevant to the presentation\nherein. For any policy \u03c0j, policy iteration computes V \u03c0j, then determines \u03c0j+1 as the \u201cgreedy\u201d\npolicy with respect to V \u03c0j:\n\n2.1 MDP and Value Function Approximation Framework\nWe aim to discover optimal, or near-optimal, policies for Markov decision processes (MDPs) de\ufb01ned\nby the quintuple M = (S, A, P, R, \u03b3). Given a state s \u2208 S, the probability of a transition to a state\ns! \u2208 S when action a \u2208 A is taken is given by P (s!|s, a). The reward function is a mapping from\nstates to real numbers R : S \"\u2192 R. A policy \u03c0 for M is a mapping from states to actions \u03c0 : s \"\u2192 a\nand the transition matrix induced by \u03c0 is denoted P \u03c0. Future rewards are discounted by \u03b3 \u2208 [0, 1).\nThe value function at state s for policy \u03c0 is the expected total \u03b3-discounted reward for following \u03c0\nfrom s. In matrix-vector form, this is written:\n\nwhere T \u03c0 is the Bellman operator for policy \u03c0 and V \u03c0 is the \ufb01xed point of this operator. An optimal\npolicy, \u03c0\u2217, maximizes state values, has value function V \u2217, and is the \ufb01xed point of the T \u2217 operator:\n\nV \u03c0 = T \u03c0V \u03c0 = R + \u03b3P \u03c0V \u03c0,\n\n\u03c0j+1(s) = arg max\n\na\u2208A\n\n[R(s) + \u03b3 !s!\u2208S\n\nP (s!|s, a)V \u03c0j (s!)].\n\nThis is repeated until some convergence condition is met. For an exact representation of each V \u03c0j,\nthe algorithm will converge to an optimal policy and the unique, optimal value function V \u2217.\nThe value function, transition model, and reward function are often too large to permit an exact rep-\nresentation. In such cases, an approximation architecture is used for the value function. A common\nchoice is \u02c6V =\u03a6 w, where w is a vector of k scalar weights and \u03a6 stores a set of k features in an n\u00d7k\nmatrix with one row per state. Since n is often intractably large, \u03a6 can be thought of as populated\nby k linearly independent basis functions, \u03d51 . . .\u03d5 k, implicitly de\ufb01ning the columns of \u03a6.\nFor the purposes of estimating w, it is common to replace \u03a6 with \u02c6\u03a6, which samples rows of \u03a6,\nthough for conciseness of presentation we will use \u03a6 for both, since algorithms for estimating w are\nessentially identical if \u02c6\u03a6 is substituted for \u03a6. Typical linear function approximation algorithms [2]\nsolve for the w which is a \ufb01xed point:\n\n\u03a6w =\u03a0( R + \u03b3\u03a6!\u03c0w) =\u03a0 T \u03c0\u03a6w,\n\n1\n2\n\nw = arg min\n\n%\u03a6x \u2212 y%2\n\nx\u2208Rk\n\nwhere \u03a0 is the L2 projection into the span of \u03a6 and \u03a6!\u03c0 is P \u03c0\u03a6 in the explicit case and composed\nof sampled next features in the sampled case. Likewise, we overload T \u03c0 for the sampled case.\n2.2 L1 Regularization and Feature Selection in Regression\nIn regression, the L1 regularized least squares problem is de\ufb01ned as:\n2 + \u03b2%x%1,\n\n(1)\nwhere y \u2208 Rn is the target function and \u03b2 \u2208 R\u22650 is a regularization parameter. This penalized\nregression problem is equivalent to the Lasso [4], which minimizes the squared residual subject to a\nconstraint on %x%1. The use of the L1 norm in the objective function prevents over\ufb01tting, but also\nserves a secondary purpose of promoting sparse solutions (i.e., coef\ufb01cients w containing many 0s).\nTherefore, we can think of L1 regularization as performing feature selection. The Lasso\u2019s objective\nfunction is convex, ensuring the existence of a global (though not necessarily unique) minimum.\nEven though the optimal solution to the Lasso can be computed in a fairly straightforward manner\nusing convex programming, this approach is not very ef\ufb01cient for large problems. This is a mo-\ntivating factor for the least angle regression (LARS) algorithm [5], which can be thought of as a\nhomotopy method for solving the Lasso for all nonnegative values of \u03b2. We do not repeat the de-\ntails of the algorithm here, but point out that this is easier than it might sound at \ufb01rst because the\nhomotopy path in \u03b2-space is piecewise linear (with \ufb01nitely many segments). Furthermore, there\nexists a closed form solution for moving from one piecewise linear segment to the next segment.\nAn important bene\ufb01t of LARS is that it provides solutions for all values of \u03b2 in a single run of the\nalgorithm. Cross-validation can then be performed to select an appropriate value.\n\n2\n\n\f2.3 LCP and BLCP\nGiven a square matrix M and a vector q, a linear complementarity problem (LCP) seeks vectors\nw \u2265 0 and z \u2265 0 with wT z = 0 and\n\nw = q + M z.\n\nzi = ui\nzi = li\nli < zi < ui\n\n=\u21d2 wi \u2264 0\n=\u21d2 wi \u2265 0\n=\u21d2 wi = 0\n\nThe problem is thus parameterized by LCP(q, M ). Even though LCPs may appear to be simple\nfeasibility problems, the framework is rich enough to express any convex quadratic program.\nThe bounded linear complementarity problem (BLCP) [6] includes box constraints on z. The BLCP\ncomputes w and z where w = q + M z and each variable zi meets one of the following conditions:\n(2a)\n(2b)\n(2c)\nwith bounds \u2212\u221e \u2264 li < ui \u2264 \u221e. The parameterization is written BLCP(q, M, l, u). Notice that an\nLCP is a special case of a BLCP with li = 0 and ui = \u221e, \u2200i. Like the LCP, the BLCP has a unique\nsolution when M is a P-matrix1 and there exist algorithms which are guaranteed to \ufb01nd this solution\n[6, 7]. When the lower and upper bounds on the BLCP are \ufb01nite, the BLCP can in fact be formulated\nas an equivalent LCP of twice the dimensionality of the original problem. A full derivation of this\nequivalence is shown in the appendix (supplementary materials).\nThere are many algorithms for solving (B)LCPs. Since our approach is not tied to a particular algo-\nrithm, we review some general properties of (B)LCP solvers. Optimized solvers can take advantage\nof sparsity in z. A zero entry in z effectively cancels out a column in M. If M is large, ef\ufb01cient\nsolvers can avoid using M directly, instead using a smaller M ! that is induced by the nonzero entries\nof z. The columns of M ! can be thought of as the \u201cactive\u201d columns and the procedure of swapping\ncolumns in and out of M ! can be thought of as a pivoting operation, analogous to pivots in the sim-\nplex algorithm. Another important property of some (B)LCP algorithms is their ability to start from\nan initial guess at the solution (i.e., a \u201cwarm start\u201d). If the initial guess is close to a solution, this can\nsigni\ufb01cantly reduce the solver\u2019s runtime.\nRecently, Kim and Park [8] derived a connection between the BLCP and the Karush-Kuhn-Tucker\n(KKT) conditions for LARS. In particular, they noted the solution to the minimization problem in\nequation (1) has the form:\n\nx\n\n!\"#$w\n\n!\n\nq\n\n\"#\n\n= (\u03a6 T \u03a6)\u22121\u03a6T y\n\n+ (\u03a6 T \u03a6)\u22121\n\n(\u2212c)\n\n,\n\n$\n\n!\n\n\"#\n$M\n\n!\"#$z\n\nwhere the vector \u2212c follows the constraints in equation (2) with li = \u2212\u03b2 and ui = \u03b2. Although we\ndescribe the equivalence between the BLCP and LARS optimality conditions using M \u2261 (\u03a6T \u03a6)\u22121,\nthe inverse can take place inside the BLCP algorithm and this operation is feasible and ef\ufb01cient as\nit is only done for the active columns of \u03a6. Kim and Park [8] used a block pivoting algorithm,\noriginally introduced by J\u00b4udice and Pires [6], for solving the Lasso. Their experiments show the\nblock pivoting algorithm is signi\ufb01cantly faster than both LARS and Feature Sign Search [9].\n\n3 Previous Work\nRecent work has emphasized feature selection as an important problem in reinforcement learn-\ning [10, 11]. Farahmand et al. [12] consider L2 regularized RL. An L1 regularized Bellman residual\nminimization algorithm was proposed by Loth et al. [13]2. Johns and Mahadevan [14] investigate\nthe combination of least squares temporal difference learning (LSTD) [2] with different variants\nof the matching pursuit algorithm [15, 16]. Petrik et al. [3] consider L1 regularization in the con-\ntext of approximate linear programming. Their approach offers some strong guarantees, but is not\nwell-suited to noisy, sampled data.\n\n1A P-matrix is a matrix for which all principal minors are positive.\n2Loth et al. claim to adapt LSTD to L1 regularization, but in fact describe a Bellman residual minimization\n\nalgorithm and not a \ufb01xed point calculation.\n\n3\n\n\fThe work most directly related to our own is that of Kolter and Ng [1]. They propose augmenting\nthe LSTD algorithm with an L1 regularization penalty. This results in the following L1 regularized\nlinear \ufb01xed point (L1TD) problem:\n\nw = arg min\n\nx\u2208Rk\n\n1\n2\n\n!\u03a6x \u2212 (R + \u03b3\u03a6\"\u03c0w)!2\n\n2 + \u03b2!x!1.\n\n(3)\n\nKolter and Ng derive a set of necessary and suf\ufb01cient conditions characterizing the above \ufb01xed\npoint3 in terms of \u03b2, w, and a vector c of correlations between the features and the Bellman residual\nT \u03c0 \u02c6V \u2212 \u02c6V . More speci\ufb01cally, the correlation ci associated with feature \u03d5i is given by:\n\nci = \u03d5T\n\ni (T \u03c0 \u02c6V \u2212 \u02c6V ) = \u03d5T\n\ni (R + \u03b3\u03a6\"\u03c0w \u2212 \u03a6w).\n\n(4)\n\nIntroducing the notation I to denote the set of indices of active features in the model (i.e., I = {i :\nwi #= 0}), the \ufb01xed point optimality conditions can be summarized as follows:\n\nC1. All features in the active set share the same absolute correlation, \u03b2: \u2200i \u2208I , |ci| = \u03b2.\nC2. Inactive features have less absolute correlation than active features: \u2200i /\u2208I , |ci| <\u03b2 .\nC3. Active features have correlations and weights agreeing in sign: \u2200i \u2208I , sgn(ci) = sgn(wi).\nKolter and Ng show that it is possible to \ufb01nd the \ufb01xed point using an iterative procedure adapted\nfrom LARS. Their algorithm, LARS-TD, computes a sequence of \ufb01xed points, each of which sat-\nis\ufb01es the optimality conditions above for some intermediate L1 parameter \u00af\u03b2 \u2265 \u03b2. Successive\nsolutions decrease \u00af\u03b2 and are computed in closed form by determining the point at which a feature\nmust be added or removed in order to further decrease \u00af\u03b2 without violating one of the \ufb01xed point\nrequirements. The algorithm (as applied to action-value function approximation) is a special case of\nthe algorithm presented in the appendix (see Fig. 2). Kolter and Ng prove that if \u03a6T (\u03a6 \u2212 \u03b3\u03a6\"\u03c0) is\na P-matrix, then for any \u03b2 \u2265 0, LARS-TD will \ufb01nd a solution to equation (3).\nLARS-TD inherits many of the bene\ufb01ts and limitations of LARS. The fact that it traces an entire\nhomotopy path can be quite helpful because it does not require committing to a particular value of\n\u03b2. On the other hand, the incremental nature of LARS may not be the most ef\ufb01cient solution for any\nsingle value of the regularization parameter, as shown by Lee et al. [9] and Kim and Park [8].\nIt is natural to employ LARS-TD in an iterative manner within the least squares policy iteration\n(LSPI) algorithm [17], as Kolter and Ng did. In this usage, however, many of the bene\ufb01ts of LARS\nare lost. When a new policy is selected in the policy iteration loop, LARS-TD must discard its\nsolution from the previous policy and start an entirely new homotopy path, making the value of the\nhomotopy path in this context not entirely clear. One might cross-validate a choice of regularization\nparameter by measuring the performance of the \ufb01nal policy, but this requires guessing a value of \u03b2\nfor all policies and then running LARS-TD up to this value for each policy. If a new value of \u03b2 is\ntried, all of the work done for the previous value must be discarded.\n\n4 The L1 Regularized Fixed Point as an LCP\nWe show that the optimality conditions for the L1TD \ufb01xed point correspond to the solution of a\n(B)LCP. This reformulation allows for (1) new algorithms to compute the \ufb01xed point using (B)LCP\nsolvers, and (2) a new guarantee on the uniqueness of a \ufb01xed point.\nThe L1 regularized linear \ufb01xed point is described by a vector of correlations c as de\ufb01ned in equation\n(4). We introduce the following variables:\n\nA =\u03a6 T (\u03a6 \u2212 \u03b3\u03a6\"\u03c0)\n\nb =\u03a6 T R,\n\n3For \ufb01xed w, the RHS of equation (3) is a convex optimization problem; a suf\ufb01cient condition for optimality\nof some vector x\u2217 is that the zero vector is in the subdifferential of the RHS at x\u2217. The \ufb01xed point conditions\nfollow from the equality between the LHS and RHS.\n\n4\n\n\fthat allow equation (4) to be simpli\ufb01ed as c = b \u2212 Aw. Assuming A is a P-matrix, A is invert-\nible4 [18] and we can write:\n\nw\n\n!\"#$w\n\n!\"#$q\n\n= A\u22121b\n\n+ A\u22121\n\n(\u2212c)\n\n.\n\n!\"#$M\n\n!\"#$z\n\nConsider a solution (w and z) to the equation above where z is bounded as in equation (2) with\nl = \u2212\u03b2 and u = \u03b2 to specify a BLCP. It is easy to verify that coef\ufb01cients w satisfying this BLCP\nacheive the L1TD optimality conditions as detailed in section 3. Thus, any appropriate solver for\nthe BLCP(A\u22121b, A\u22121, \u2212\u03b2, \u03b2) can be thought of as a linear complementarity approach to solving\nfor the L1TD \ufb01xed point. We refer to this class of solvers as LC-TD algorithms and parameterize\nthem as LC-TD(\u03a6, \u03a6\"\u03c0, R,\u03b3,\u03b2 ).\nProposition 1 If A is a P-matrix, then for any R, the L1 regularized linear \ufb01xed point exists, is\nunique, and will be found by a basic-set BLCP algorithm solving BLCP(A\u22121b, A\u22121, \u2212\u03b2, \u03b2).\nThis proposition follows immediately from some basic BLCP results. We note that if A is a P-\nmatrix, so is A\u22121 [18], that BLCPs for P-matrices have a unique solution for any q ([7], Chp. 3),\nand that the the basic-set algorithm of J\u00b4udice and Pires [19] is guaranteed to \ufb01nd a solution to any\nBLCP with a P-matrix. This strengthens the theorem by Kolter and Ng [1], which guaranteed only\nthat the LARS-TD algorithm would converge to a solution when A is a P-matrix.\nThis connection to the LCP literature has practical bene\ufb01ts as well as theoretical ones. Decoupling\nthe problem from the solver allows a variety of algorithms to be exploited. For example, the ability\nof many solvers to use a warm start during initialization offers a signi\ufb01cant computational advantage\nover LARS-TD (which always begins with a null solution). In the experimental section of this paper,\nwe demonstrate that the ability to use warm starts during policy iteration can signi\ufb01cantly improve\ncomputational ef\ufb01ciency. We also \ufb01nd that (B)LCP solvers can be more robust than LARS-TD, an\nissue we address further in the appendix.\n5 Modi\ufb01ed Policy Iteration using LARS-TD and LC-TD\nAs mentioned in section 3, the advantages of LARS-TD as a homotopy method are less clear when\nit is used in a policy iteration loop since the homotopy path is traced only for speci\ufb01c policies. It is\npossible to incorporate greedy policy improvements into the LARS-TD loop, leading to a homotopy\npath for greedy policies. The greedy L1 regularized \ufb01xed point equation is:\n\nw = arg min\n\nx\u2208Rk\n\n1\n2\n\n\"\u03a6x \u2212 max\n\n\u03c0\n\n(R + \u03b3\u03a6\"\u03c0w)\"2\n\n2 + \u03b2\"x\"1.\n\n(5)\n\nWe propose a modi\ufb01cation to LARS-TD called LARQ which, along with conditions C1-C3 in sec-\ntion 3, maintains an additional invariant:\n\nC4. The current policy \u03c0 is greedy with respect to the current solution.\n\nIt turns out that we can change policies and avoid violating the LARS-TD invariants if we make\npolicy changes at points where applying the Bellman operator yields the same value for both the\nold policy (\u03c0) and the new policy (\u03c0\"): T \u03c0 \u02c6V = T \u03c0! \u02c6V . The LARS-TD invariants all depend on\nthe correlation of features with the residual T \u03c0 \u02c6V \u2212 \u02c6V of the current solution. When the above\nequation is satis\ufb01ed, the residual is equal for both policies. Thus, we can change policies at such\npoints without violating any of the LARS-TD invariants. Due to space limitations, we defer a full\npresentation of the LARQ algorithm to the appendix.\nWhen run to completion, LARQ provides a set of action-values that are the greedy \ufb01xed point for\nall settings of \u03b2. In principle, this is more \ufb02exible than LARS-TD with policy iteration because it\nproduces these results in a single run of the algorithm. In practice, LARQ suffers two limitations.\n4Even when A is not invertible, we can still use a BLCP solver as long as the principal submatrix of A\nassociated with the active features is invertible. As with LARS-TD, the inverse only occurs for this principal\nsubmatrix. In fact, we discuss in the appendix how one need never explicitly compute A. Alternatively, we can\nconvert the BLCP to an LCP (appendix A.1) thereby avoiding A\u22121 in the parameterization of the problem.\n\n5\n\n\fThe \ufb01rst is that it can be slow. LARS-TD enumerates every point at which the active set of features\nmight change, a calculation that must be redone every time the active set changes. LARQ must\ndo this as well, but it must also enumerate all points at which the greedy policy can change. For k\nfeatures and n samples, LARS-TD must check O(k) points, but LARQ must check O(k + n) points.\nEven though LARS-TD will run multiple times within a policy iteration loop, the number of such\niterations will typically be far fewer than the number of training data points. In practice, we have\nobserved that LARQ runs several times slower than LARS-TD with policy iteration.\nA second limitation of LARQ is that it can get \u201cstuck.\u201d This occurs when the greedy policy for a\nparticular \u03b2 is not well de\ufb01ned. In such cases, the algorithm attempts to switch to a new policy\nimmediately following a policy change. This problem is not unique to LARQ. Looping is possible\nwith most approximate policy iteration algorithms. What makes it particularly troublesome for\nLARQ is that there are few satisfying ways of addressing this issue without sacri\ufb01cing the invariants.\nTo address these limitations, we present a compromise between LARQ and LARS-TD with policy\niteration. The algorithm, LC-MPI, is presented as Algorithm 1. It avoids the cost of continually\nchecking for policy changes by updating the policy only at a \ufb01xed set of values, \u03b2(1) . . .\u03b2 (m). Note\nthat the \u03b2 values are in decreasing order with \u03b2(1) set to the maximum value (i.e., the point such\nthat w(1) is the zero vector). At each \u03b2(j), the algorithm uses a policy iteration loop to (1) determine\nthe current policy (greedy with respect to parameters \u02c6w(j)), and (2) compute an approximate value\nfunction \u03a6w(j) using LC-TD. The policy iteration loop terminates when w(j) \u2248 \u02c6w(j) or some\nprede\ufb01ned number of iterations is exceeded. This use of LC-TD within a policy iteration loop will\ntypically be quite fast because we can use the current feature set as a warm start. The warm start is\nindicated in Algorithm 1 by supp( \u02c6w(j)), where the function supp determines the support, or active\nelements, in \u02c6w(j); many (B)LCP solvers can use this information for initialization.\nOnce the policy iteration loop terminates for point \u03b2(j), LC-MPI simply begins at the next point\n\u03b2(j+1) by initializing the weights with the previous solution, \u02c6w(j+1) \u2190 w(j). This was found\nto be a very effective technique. As an alternative, we tested initializing \u02c6w(j+1) with the result of\nrunning LARS-TD with the greedy policy implicit in w(j) from the point (\u03b2(j), w(j)) to \u03b2(j+1). This\ninitialization method performed worse experimentally than the simple approach described above.\nWe can view LC-MPI as approximating LARQ\u2019s homotopy path since the two algorithms agree for\nany \u03b2(j) reachable by LARQ. However, LC-MPI is more ef\ufb01cient and avoids the problem of getting\nstuck. By compromising between the greedy updates of LARQ and the pure policy evaluation\nmethods of LARS-TD and LC-TD, LC-MPI can be thought of as form of modi\ufb01ed policy iteration\n[20]. The following table summarizes the properties of the algorithms described in this paper.\n\nWarm start for each new \u03b2\nWarm start for each new policy\nGreedy policy homotopy path\nRobust to policy cycles\n\nLARS-TD Policy Iteration\n\nN\nN\nN\nY\n\nLC-TD Policy Iteration\n\nN\nY\nN\nY\n\nLARQ\n\nY\nY\nY\nN\n\nLC-MPI\n\nY\nY\nY\n\nApproximate\n\n6 Experiments\nWe performed two types of experiments to highlight the potential bene\ufb01ts of (B)LCP algorithms.\nFirst, we used both LARS-TD and LC-TD within policy iteration. These experiments, which were\nrun using a single value of the L1 regularization parameter, show the bene\ufb01t of warm starts for\nLC-TD. The second set of experiments demonstrates the bene\ufb01t of using the LC-MPI algorithm. A\nsingle run of LC-MPI results in greedy policies for multiple values of \u03b2, allowing the use of cross-\nvalidation to pick the best policy. We show this is signi\ufb01cantly more ef\ufb01cient than running policy\niteration with either LARS-TD or LC-TD multiple times for different values of \u03b2. We discuss the\ndetails of the speci\ufb01c LCP solver we used in the appendix.\nBoth types of experiments were conducted on the 20-state chain [17] and mountain car [21] domains,\nthe same problems tested by Kolter and Ng [1]. The chain MDP consists of two stochastic actions,\nleft and right, a reward of one at each end of the chain, and \u03b3 = 0.9. One thousand samples were\ngenerated using 100 episodes, each consisting of 10 random steps. For features, we used 1000\nGaussian random noise features along with \ufb01ve equally spaced radial basis functions (RBFs) and\na constant function. The goal in the mountain car MDP is to drive an underpowered car up a hill\n\n6\n\n\fstate transition and reward samples\nstate-action features\n\nj=1, where \u03b2(1) = maxl \u02db\n\n\u02dbPn\n\ntermination conditions for policy iteration\n\ni=1 \u03d5l(si, ai)ri\u02db\n\n\u02db, \u03b2(j) <\u03b2 (j\u22121) for j \u2208{ 2, . . . , m}, and \u03b2(m) \u2265 0\n\n\u03a6 \u2190 [\u03d5(s1, a1) . . . \u03d5 (sn, an)]T , R \u2190 [r1 . . . rn]T , w(1) \u2190 0\n\nInputs:\n\nAlgorithm 1 LC-MPI\ni=1,\ni}n\n{si, ai, ri, s!\n\u03d5 : S \u00d7 A \u2192 Rk,\n\u03b3 \u2208 [0, 1), discount factor\n{\u03b2(j)}m\n\u0001 \u2208 R+ and T \u2208 N,\n\nInitialization:\n\nfor j = 2 to m do\n\n// Initialize with the previous solution\n\u02c6w(j) \u2190 w(j\u22121)\n// Policy iteration loop\nLoop:\n\n1, a!\n\ni \u2190 arg maxa \u03d5(s!\n1) . . . \u03d5 (s!\n\n// Select greedy actions and form \u03a6!\ni, a)T \u02c6w(i)\n\u2200i : a!\nn)]T\n\u03a6! \u2190 [\u03d5(s!\nn, a!\n// Solve the LC-TD problem using a (B)LCP solver with a warm start\nw(j) \u2190 LC-TD(\u03a6, \u03a6!, R,\u03b3,\u03b2 (j)) with warm start supp( \u02c6w(j))\n// Check for termination\nif (\u2019w(j) \u2212 \u02c6w(j)\u20192 \u2264 \u0001) or\n\n(# iterations \u2265 T)\n\nthen break loop\nelse\nReturn {w(j)}m\n\n\u02c6w(j) \u2190 w(j)\n\nj=1\n\nby building up momentum. The domain is continuous, two dimensional, and has three actions. We\nused \u03b3 = 0.99 and 155 radial basis functions (apportioned as a two dimensional grid of 1, 2, 3, 4, 5,\n6, and 8 RBFs) and one constant function for features. Samples were generated using 75 episodes\nwhere each episode started in a random start state, took random actions, and lasted at most 20 steps.\n\n6.1 Policy Iteration\nTo compare LARS-TD and LC-TD when employed within policy iteration, we recorded the number\nof steps used during each round of policy iteration, where a step corresponds to a change in the active\nfeature set. The computational complexity per step of each algorithm is similar; therefore, we used\nthe average number of steps per policy as a metric for comparing the algorithms. Policy iteration\nwas run either until the solution converged or 15 rounds were exceeded. This process was repeated\n10 times for 11 different values of \u03b2. We present the results from these experiments in the \ufb01rst two\ncolumns of Table 1. The two algorithms performed similarly for the chain MDP, but LC-TD used\nsigni\ufb01cantly fewer steps for the mountain car MDP. Figure 1 shows plots for the number of steps\nused for each round of policy iteration for a single (typical) trial. Notice the declining trend for\nLC-TD; this is due to the warm starts requiring fewer steps to \ufb01nd a solution. The plot for the chain\nMDP shows that LC-TD uses many more steps in the \ufb01rst round of policy iteration than does LARS-\nTD. Lastly, in the trials shown in Figure 1, policy iteration using LC-TD converged in six iterations\nwhereas it did not converge at all when using LARS-TD. This was due to LARS-TD producing\nsolutions that violate the L1TD optimality conditions. We discuss this in detail in appendix A.5.\n\n6.2 LC-MPI\nWhen LARS-TD and LC-TD are used as subroutines within policy iteration, the process ends at a\nsingle value of the L1 regularization parameter \u03b2. The policy iteration loop must be rerun to consider\ndifferent values of \u03b2. In this section, we show how much computation can be saved by running\nLC-MPI once (to produce m greedy policies, each at a different value of \u03b2) versus running policy\niteration m separate times. The third column in Table 1 shows the average number of algorithm steps\nper policy for LC-MPI. As expected, there is a signi\ufb01cant reduction in complexity by using LC-MPI\nfor both domains. In the appendix, we give a more detailed example of how cross-validation can be\n\n7\n\n\fs\np\ne\nt\nS\n\n \nf\no\n \nr\ne\nb\nm\nu\nN\n\n300\n\n250\n\n200\n\n150\n\n100\n\n50\n\n \n\n0\n0\n\n \n\n15\n\n250\n\n200\n\ns\np\ne\nt\nS\n\n \nf\no\n \nr\ne\nb\nm\nu\nN\n\n150\n\n100\n\n50\n\n \n\n0\n0\n\nLARS\u2212TD\nLC\u2212TD\n\n5\n10\nRound of Policy Iteration\n(a) Chain\n\n \n\n15\n\nLARS\u2212TD\nLC\u2212TD\n\n10\n5\nRound of Policy Iteration\n(b) Mountain car\n\nFigure 1: Number of steps used by algorithms LARS-TD and LC-TD during each round of policy\niteration for a typical trial. For LC-TD, note the decrease in steps due to warm starts.\n\nDomain\nChain\n\nMountain car\n\nLARS-TD, PI\n\n73 \u00b1 13\n214 \u00b1 33\n\nLC-TD, PI\n77 \u00b1 11\n116 \u00b1 22\n\nLC-MPI\n24 \u00b1 11\n21 \u00b1 5\n\nTable 1: Average number of algorithm steps per policy.\n\nused to select a good value of the regularization parameter. We also offer some additional comments\non the robustness of the LARS-TD algorithm.\n7 Conclusions\nIn this paper, we proposed formulating the L1 regularized linear \ufb01xed point problem as a linear\ncomplementarity problem. We showed the LCP formulation leads to a stronger theoretical guarantee\nin terms of the solution\u2019s uniqueness than was previously shown. Furthermore, we demonstrated that\nthe \u201cwarm start\u201d ability of LCP solvers can accelerate the computation of the L1TD \ufb01xed point when\ninitialized with the support set of a related problem. This was found to be particularly effective for\npolicy iteration problems when the set of active features does not change signi\ufb01cantly from one\npolicy to the next.\nWe proposed the LARQ algorithm as an alternative to LARS-TD. The difference between these\nalgorithms is that LARQ incorporates greedy policy improvements inside the homotopy path. The\nadvantage of this \u201cgreedy\u201d homotopy path is that it provides a set of action-values that are a greedy\n\ufb01xed point for all settings of the L1 regularization parameter. However, this additional \ufb02exibility\ncomes with increased computational complexity. As a compromise between LARS-TD and LARQ,\nwe proposed the LC-MPI algorithm which only maintains the LARQ invariants at a \ufb01xed set of\nvalues. The key to making LC-MPI ef\ufb01cient is the use of warm starts by using an LCP algorithm.\nThere are several directions for future work. An interesting question is whether there is a natural\nway to incorporate policy improvement directly within the LCP formulation. Another concern for\nL1TD algorithms is a better characterization of the conditions under which solutions exist and can\nbe found ef\ufb01ciently. In previous work, Kolter and Ng [1] indicated the P-matrix property can always\nhold provided enough L2 regularization is added to the problem. While this is possible, it also\ndecreases the sparsity of the solution; therefore, it would be useful to \ufb01nd other techniques for\nguaranteeing convergence while maintaining sparsity.\nAcknowledgments\nThis work was supported by the National Science Foundation (NSF) under Grant #0937060 to the\nComputing Research Association for the CIFellows Project, NSF Grant IIS-0713435, and DARPA\nCSSG HR0011-06-1-0027. Any opinions, \ufb01ndings, and conclusions or recommendations expressed\nin this material are those of the authors and do not necessarily re\ufb02ect the views of the National\nScience Foundation or the Computing Research Association.\n\n8\n\n\fReferences\n[1] J. Kolter and A. Ng. Regularization and feature selection in least-squares temporal difference\n\nlearning. In Proc. ICML, pages 521\u2013528, 2009.\n\n[2] S. Bradtke and A. Barto. Linear least-squares algorithms for temporal difference learning.\n\nMachine Learning, 22(1-3):33\u201357, 1996.\n\n[3] M. Petrik, G. Taylor, R. Parr, and S. Zilberstein. Feature selection using regularization in\nIn To appear in Proc. ICML,\n\napproximate linear programs for Markov decision processes.\n2010.\n\n[4] R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical\n\nSociety. Series B (Methodological), 58(1):267\u2013288, 1996.\n\n[5] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. The Annals of\n\nStatistics, 32(2):407\u2013451, 2004.\n\n[6] J. J\u00b4udice and F. Pires. A block principal pivoting algorithm for large-scale strictly monotone\nlinear complementarity problems. Computers and Operations Research, 21(5):587\u2013596, 1994.\n[7] K. Murty. Linear Complementarity, Linear and Nonlinear Programming. Heldermann Verlag,\n\n1988.\n\n[8] J. Kim and H. Park. Fast active-set-type algorithms for L1-regularized linear regression. In\n[9] H. Lee, A. Battle, R. Raina, and A. Ng. Ef\ufb01cient sparse coding algorithms. In Advances in\n\nProc. AISTAT, pages 397\u2013404, 2010.\n\nNeural Information Processing Systems 19, pages 801\u2013808, 2007.\n\n[10] S. Mahadevan and M. Maggioni. Proto-value functions: A Laplacian framework for learning\n\nrepresentation and control in Markov decision processes. JMLR, 8:2169\u20132231, 2007.\n\n[11] R. Parr, L. Li, G. Taylor, C. Painter-Wake\ufb01eld, and M. Littman. An analysis of linear models,\nlinear value-function approximation, and feature selection for reinforcement learning. In Proc.\nICML, 2008.\n\n[12] A. Farahmand, M. Ghavamzadeh, C. Szepesv\u00b4ari, and S. Mannor. Regularized \ufb01tted Q-iteration\nfor planning in continuous-space Markovian decision problems. In Proc. ACC. IEEE Press,\n2009.\n\n[13] M. Loth, M. Davy, and P. Preux. Sparse temporal difference learning using LASSO. In IEEE\nInternational Symposium on Approximate Dynamic Programming and Reinforcement Learn-\ning, 2007.\n\n[14] J. Johns and S. Mahadevan. Sparse approximate policy evaluation using graph-based basis\nfunctions. Technical Report UM-CS-2009-041, University of Massachusetts Amherst, Depart-\nment of Computer Science, 2009.\n\n[15] S. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transac-\n\ntions on Signal Processing, 41(12):3397\u20133415, 1993.\n\n[16] Y. Pati, R. Rezaiifar, and P. Krishnaprasad. Orthogonal matching pursuit: Recursive function\napproximation with applications to wavelet decomposition. In Proceedings of the 27th Annual\nAsilomar Conference on Signals, Systems, and Computers, volume 1, pages 40\u201344, 1993.\n\n[17] M. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning\n\nResearch, 4:1107\u20131149, 2003.\n\n4(1):38\u201343, 2001.\n\n[18] S. Lee and H. Seol. A survey on the matrix completion problem. Trends in Mathematics,\n\n[19] J. J\u00b4udice and F. Pires. Basic-set algorithm for a generalized linear complementarity problem.\n\nJournal of Optimization Theory and Applications, 74(3):391\u2013411, 1992.\n\n[20] M. Puterman and M. Shin. Modi\ufb01ed policy iteration algorithms for discounted Markov deci-\n\nsion problems. Management Science, 24(11), 1978.\n\n[21] R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.\n\n9\n\n\f", "award": [], "sourceid": 1234, "authors": [{"given_name": "Jeffrey", "family_name": "Johns", "institution": null}, {"given_name": "Christopher", "family_name": "Painter-wakefield", "institution": null}, {"given_name": "Ronald", "family_name": "Parr", "institution": null}]}