{"title": "Robust Value Function Approximation Using Bilinear Programming", "book": "Advances in Neural Information Processing Systems", "page_first": 1446, "page_last": 1454, "abstract": "Existing value function approximation methods have been successfully used in many applications, but they often lack useful a priori error bounds. We propose approximate bilinear programming, a new formulation of value function approximation that provides strong a priori guarantees. In particular, it provably finds an approximate value function that minimizes the Bellman residual. Solving a bilinear program optimally is NP hard, but this is unavoidable because the Bellman-residual minimization itself is NP hard. We, therefore, employ and analyze a common approximate algorithm for bilinear programs. The analysis shows that this algorithm offers a convergent generalization of approximate policy iteration. Finally, we demonstrate that the proposed approach can consistently minimize the Bellman residual on a simple benchmark problem.", "full_text": "Robust Value Function Approximation Using\n\nBilinear Programming\n\nMarek Petrik\n\nDepartment of Computer Science\n\nUniversity of Massachusetts\n\nAmherst, MA 01003\n\npetrik@cs.umass.edu\n\nShlomo Zilberstein\n\nDepartment of Computer Science\n\nUniversity of Massachusetts\n\nAmherst, MA 01003\n\nshlomo@cs.umass.edu\n\nAbstract\n\nExisting value function approximation methods have been successfully used in\nmany applications, but they often lack useful a priori error bounds. We propose\napproximate bilinear programming, a new formulation of value function approxi-\nmation that provides strong a priori guarantees. In particular, this approach prov-\nably \ufb01nds an approximate value function that minimizes the Bellman residual.\nSolving a bilinear program optimally is NP-hard, but this is unavoidable because\nthe Bellman-residual minimization itself is NP-hard. We therefore employ and\nanalyze a common approximate algorithm for bilinear programs. The analysis\nshows that this algorithm offers a convergent generalization of approximate pol-\nicy iteration. Finally, we demonstrate that the proposed approach can consistently\nminimize the Bellman residual on a simple benchmark problem.\n\n1 Motivation\n\nSolving large Markov Decision Problems (MDPs) is a very useful, but computationally challenging\nproblem addressed widely in the AI literature, particularly in the area of reinforcement learning.\nIt is widely accepted that large MDPs can only be solved approximately. The commonly used ap-\nproximation methods can be divided into three broad categories: 1) policy search, which explores\na restricted space of all policies, 2) approximate dynamic programming, which searches a restricted\nspace of value functions, and 3) approximate linear programming, which approximates the solu-\ntion using a linear program. While all of these methods have achieved impressive results in many\ndomains, they have signi\ufb01cant limitations.\nPolicy search methods rely on local search in a restricted policy space. The policy may be repre-\nsented, for example, as a \ufb01nite-state controller [22] or as a greedy policy with respect to an approx-\nimate value function [24]. Policy search methods have achieved impressive results in such domains\nas Tetris [24] and helicopter control [1]. However, they are notoriously hard to analyze. We are not\naware of any theoretical guarantees regarding the quality of the solution.\nApproximate dynamic programming (ADP) methods iteratively approximate the value func-\ntion [4, 20, 23]. They have been extensively analyzed and are the most commonly used methods.\nHowever, ADP methods typically do not converge and they only provide weak guarantees of approx-\nimation quality. The approximation error bounds are usually expressed in terms of the worst-case\napproximation of the value function over all policies [4]. In addition, most available bounds are with\nrespect to the L\u221e norm, while the algorithms often minimize the L2 norm. While there exist some\nL2-based bounds [14], they require values that are dif\ufb01cult to obtain.\nApproximate linear programming (ALP) uses a linear program to compute the approximate value\nfunction in a particular vector space [7]. ALP has been previously used in a wide variety of set-\ntings [2, 9, 10]. Although ALP often does not perform as well as ADP, there have been some recent\n\n1\n\n\fefforts to close the gap [18]. ALP has better theoretical properties than ADP and policy search. It is\nguaranteed to converge and return the closest L1-norm approximation \u02dcv of the optimal value func-\ntion v\u2217 up to a multiplicative factor. However, the L1 norm must be properly weighted to guarantee\na small policy loss, and there is no reliable method for selecting appropriate weights [7].\nTo summarize, the existing reinforcement learning techniques often provide good solutions, but typ-\nically require signi\ufb01cant domain knowledge [20]. The domain knowledge is needed partly because\nuseful a priori error bounds are not available, as mentioned above. Our goal is to develop a more\nrobust method that is guaranteed to minimize an actual bound on the policy loss.\nWe present a new formulation of value function approximation that provably minimizes a bound\non the policy loss. Unlike in some other algorithms, the bound in this case does not rely on values\nthat are hard to obtain. The new method uni\ufb01es policy search and value-function search methods\nto minimize the L\u221e norm of the Bellman residual, which bounds the policy loss. We start with a\ndescription of the framework and notation in Section 2. Then, in Section 3, we describe the pro-\nposed Approximate Bilinear Programming (ABP) formulation. A drawback of this formulation is\nits computational complexity, which may be exponential. We show in Section 4 that this is unavoid-\nable, because minimizing the approximation error bound is in fact NP-hard. Although our focus is\non the formulation and its properties, we also discuss some simple algorithms for solving bilinear\nprograms. Section 5 shows that ABP can be seen as an improvement of ALP and Approximate Pol-\nicy Iteration (API). Section 6 demonstrates the applicability of ABP using a common reinforcement\nlearning benchmark problem. A complete discussion of sampling strategies\u2013an essential compo-\nnent for achieving robustness\u2013is beyond the scope of this paper, but the issue is brie\ufb02y discussed in\nSection 6. Complete proofs of the theorems can be found in [19].\n\n2 Solving MDPs using ALP\n\nIn this section, we formally de\ufb01ne MDPs, their ALP formulation, and the approximation errors\ninvolved. These notions serve as a basis for developing the ABP formulation.\nA Markov Decision Process is a tuple (S,A, P, r, \u03b1), where S is the \ufb01nite set of states, A is the\n\ufb01nite set of actions. P : S \u00d7 S \u00d7 A (cid:55)\u2192 [0, 1] is the transition function, where P (s(cid:48), s, a) represents\nthe probability of transiting to state s(cid:48) from state s, given action a. The function r : S \u00d7 A (cid:55)\u2192 R is\nthe reward function, and \u03b1 : S (cid:55)\u2192 [0, 1] is the initial state distribution. The objective is to maximize\nthe in\ufb01nite-horizon discounted cumulative reward. To shorten the notation, we assume an arbitrary\nordering of the states: s1, s2, . . . , sn. Then, Pa and ra are used to denote the probabilistic transition\nmatrix and reward for action a.\nThe solution of an MDP is a policy \u03c0 : S \u00d7 A \u2192 [0, 1] from a set of possible policies \u03a0, such that\na\u2208A \u03c0(s, a) = 1. We assume that the policies may be stochastic, but stationary [21].\nA policy is deterministic when \u03c0(s, a) \u2208 {0, 1} for all s \u2208 S and a \u2208 A. The transition and reward\nfunctions for a given policy are denoted by P\u03c0 and r\u03c0. The value function update for a policy \u03c0 is\ndenoted by L\u03c0, and the Bellman operator is denoted by L. That is:\n\nfor all s \u2208 S,(cid:80)\n\nL\u03c0v = P\u03c0v + r\u03c0\n\nLv = max\n\u03c0\u2208\u03a0\n\nL\u03c0v.\n\nThe optimal value function, denoted v\u2217, satis\ufb01es v\u2217 = Lv\u2217. We focus on linear value function\napproximation for discounted in\ufb01nite-horizon problems. In linear value function approximation, the\nvalue function is represented as a linear combination of nonlinear basis functions (vectors). For\neach state s, we de\ufb01ne a row-vector \u03c6(s) of features. The rows of the basis matrix M correspond\nto \u03c6(s), and the approximation space is generated by the columns of the matrix. That is, the basis\nmatrix M, and the value function v are represented as:\n\n\uf8eb\uf8ec\uf8ed\u2212 \u03c6(s1) \u2212\n\n\u2212 \u03c6(s2) \u2212\n\n\uf8f6\uf8f7\uf8f8 v = M x.\n\nM =\n\n...\n\nDe\ufb01nition 1. A value function, v, is representable if v \u2208 M \u2286 R|S|, where M = colspan (M ),\nand is transitive-feasible when v \u2265 Lv. We denote the set of transitive-feasible value functions as:\nK = {v \u2208 R|S| v \u2265 Lv}.\n\n2\n\n\fNotice that the optimal value function v\u2217 is transitive-feasible, and M is a linear space. Also, all the\ninequalities are element-wise.\nBecause the new formulation is related to ALP, we introduce it \ufb01rst. It is well known that an in\ufb01-\nnite horizon discounted MDP problem may be formulated in terms of solving the following linear\nprogram:\n\n(cid:88)\n\nc(s)v(s)\n\n(cid:88)\n\ns\u2208S\nv(s) \u2212 \u03b3\n\nminimize\n\nv\n\ns.t.\n\nP (s(cid:48), s, a)v(s(cid:48)) \u2265 r(s, a) \u2200(s, a) \u2208 (S,A)\n\nc represents a distribution over the states, usually a uniform one. That is,(cid:80)\n\nWe use A as a shorthand notation for the constraint matrix and b for the right-hand side. The value\ns\u2208S c(s) = 1. The\nlinear program in Eq. (1) is often too large to be solved precisely, so it is approximated to get an\napproximate linear program by assuming that v \u2208 M [8], as follows:\n\ns(cid:48)\u2208S\n\n(1)\n\n(2)\n\nminimize\n\nx\ns.t.\n\ncTv\nAv \u2265 b\nv \u2208 M\n\nThe constraint v \u2208 M denotes the approximation. To actually solve this linear program, the\nvalue function is represented as v = M x. In the remainder of the paper, we assume that 1 \u2208 M to\nguarantee the feasibility of the ALP, where 1 is a vector of all ones. The optimal solution of the ALP,\n\u02dcv, satis\ufb01es that \u02dcv \u2265 v\u2217. Then, the objective of Eq. (2) represents the minimization of (cid:107)\u02dcv \u2212 v\u2217(cid:107)1,c,\nwhere (cid:107) \u00b7 (cid:107)1,c is a c-weighted L1 norm [7].\nThe ultimate goal of the optimization is not to obtain a good value function \u02dcv, but a good policy.\nThe quality of the policy, typically chosen to be greedy with respect to \u02dcv, depends non-trivially on\nthe approximate value function. The ABP formulation will minimize policy loss by minimizing\n(cid:107)L\u02dcv \u2212 \u02dcv(cid:107)\u221e, which bounds the policy loss as follows.\nTheorem 2 (e.g. [25]). Let \u02dcv be an arbitrary value function, and let \u02c6v be the value of the greedy\npolicy with respect to \u02dcv. Then:\n\n(cid:107)v\u2217 \u2212 \u02c6v(cid:107)\u221e \u2264 2\n1 \u2212 \u03b3\n\n(cid:107)L\u02dcv \u2212 \u02dcv(cid:107)\u221e,\n\nIn addition, if \u02dcv \u2265 L\u02dcv, the policy loss is smallest for the greedy policy.\nPolicies, like value functions, can be represented as vectors. Assume an arbitrary ordering of the\nstate-action pairs, such that o(s, a) (cid:55)\u2192 N maps a state and an action to its position. The policies are\nrepresented as \u03b8 \u2208 R|S|\u00d7|A|, and we use the shorthand notation \u03b8(s, a) = \u03b8(o(s, a)).\nRemark 3. The corresponding \u03c0 and \u03b8 are denoted as \u03c0\u03b8 and \u03b8\u03c0 and satisfy:\n\n\u03c0\u03b8(s, a) = \u03b8\u03c0(s, a).\n\nWe will also consider approximations of the policies in the policy-space, generated by columns of a\nmatrix N. A policy is representable when \u03c0 \u2208 N , where N = colspan (N ).\n\n3 Approximate Bilinear Programs\nThis section shows how to formulate minv\u2208M (cid:107)Lv \u2212 v(cid:107)\u221e as a separable bilinear program. Bilinear\nprograms are a generalization of linear programs with an additional bilinear term in the objective\nfunction. A separable bilinear program consists of two linear programs with independent constraints\nand are fairly easy to solve and analyze.\nDe\ufb01nition 4 (Separable Bilinear Program). A separable bilinear program in the normal form is\nde\ufb01ned as follows:\n\nminimize\n\nw,x y,z\n\ns.t.\n\nf (w, x, y, z) = sT\n\n1 w + rT\n\n1 x + xTCy + rT\n\n2 y + sT\n2 z\n\nA1x + B1w = b1 A2y + B2z = b2\nw, x \u2265 0\n\ny, z \u2265 0\n\n(3)\n\n3\n\n\fWe separate the variables using a vertical line and the constraints using different columns to em-\nphasize the separable nature of the bilinear program.\n\nIn this paper, we only use separable bilinear programs and refer to them simply as bilinear programs.\nAn approximate bilinear program can now be formulated as follows.\n\nminimize\n\u03b8 \u03bb,\u03bb(cid:48),v\n\ns.t.\n\n\u03b8T\u03bb + \u03bb(cid:48)\nB\u03b8 = 1 z = Av \u2212 b\n\u03b8 \u2265 0\n\nz \u2265 0\n\u03bb + \u03bb(cid:48)1 \u2265 z\n\u03bb \u2265 0\nv \u2208 M\n\n\u03b8 \u2208 N\n\n(4)\n\nAll variables are vectors except \u03bb(cid:48), which is a scalar.\nThe symbol z is only used to simplify\nthe notation and does not need to represent an optimization variable. The variable v is de\ufb01ned for\neach state and represents the value function. Matrix A represents constraints that are identical to the\nconstraints in Eq. (2). The variables \u03bb correspond to all state-action pairs. These variables represent\nthe Bellman residuals that are being minimized. The variables \u03b8 are de\ufb01ned for all state-action pairs\nand represent policies in Remark 3. The matrix B represents the following constraints:\n\n(cid:88)\n\na\u2208A\n\n\u03b8(s, a) = 1 \u2200s \u2208 S.\n\nAs with approximate linear programs, we initially assume that all the constraints on z are used. In\nrealistic settings, however, the constraints would be sampled or somehow reduced. We defer the\ndiscussion of this issue until Section 6. Note that the constraints in our formulation correspond to\nelements of z and \u03b8. Thus when constraints are omitted, also the corresponding elements of z and \u03b8\nare omitted.\nTo simplify the notation, the value function approximation in this problem is denoted only implicitly\nby v \u2208 M, and the policy approximation is denoted by \u03b8 \u2208 N . In an actual implementation, the\noptimization variables would be x, y using the relationships v = M x and \u03b8 = N y. We do not\nassume any approximation of the policy space, unless mentioned otherwise. We also use v or \u03b8\nto refer to partial solutions of Eq.\n(4) with the other variables chosen appropriately to achieve\nfeasibility.\nThe ABP formulation is closely related to approximate linear programs, and we discuss the connec-\ntion in Section 5. We \ufb01rst analyze the properties of the optimal solutions of the bilinear program\nand then show and discuss the solution methods in Section 4. The following theorem states the main\nproperty of the bilinear formulation.\nTheorem 5. b Let (\u02dc\u03b8, \u02dcv, \u02dc\u03bb, \u02dc\u03bb(cid:48)) be an optimal solution of Eq. (4) and assume that 1 \u2208 M. Then:\nv\u2208M(cid:107)v \u2212 v\u2217(cid:107)\u221e.\n\u02dc\u03b8T\u02dc\u03bb + \u02dc\u03bb(cid:48) = (cid:107)L\u02dcv \u2212 \u02dcv(cid:107)\u221e \u2264 min\nIn addition, \u03c0 \u02dc\u03b8 minimizes the Bellman residual with regard to \u02dcv, and its value function \u02c6v satis\ufb01es:\n\nv\u2208M(cid:107)Lv \u2212 v(cid:107)\u221e \u2264 2(1 + \u03b3) min\n\nv\u2208K\u2229M(cid:107)Lv \u2212 v(cid:107)\u221e \u2264 2 min\n\n(cid:107)\u02c6v \u2212 v\u2217(cid:107)\u221e \u2264 2\n1 \u2212 \u03b3\n\nv\u2208M(cid:107)Lv \u2212 v(cid:107)\u221e.\n\nmin\n\nThe proof of the theorem can be found in [19]. It is important to note that, as Theorem 5 states, the\nABP approach is equivalent to a minimization over all representable value functions, not only the\ntransitive-feasible ones. Notice also the missing coef\ufb01cient 2 (2 instead of 4) in the last equation\nof Theorem 5. This follows by subtracting a constant vector 1 from \u02dcv to balance the lower bounds\non the Bellman residual error with the upper ones. This modi\ufb01ed approximate value function will\nhave 1/2 of the original Bellman residual but an identical greedy policy. Finally, note that whenever\nv\u2217 \u2208 M, both ABP and ALP will return the optimal value function.\nThe ABP solution minimizes the L\u221e norm of the Bellman residual due to: 1) the correspondence\nbetween \u03b8 and the policies, and 2) the dual representation with respect to variables \u03bb and \u03bb(cid:48). The\ntheorem then follows using techniques similar to those used for approximate linear programs [7].\n\n4\n\n\fAlgorithm 1: Iterative algorithm for solving Eq. (3)\n(x0, w0) \u2190 random ;\n(y0, z0) \u2190 arg miny,z f (w0, x0, y, z) ;\ni \u2190 1 ;\nwhile yi\u22121 (cid:54)= yi or xi\u22121 (cid:54)= xi do\n\n(yi, zi) \u2190 arg min{y,z A2y+B2z=b2 y,z\u22650} f (wi\u22121, xi\u22121, y, z) ;\n(xi, wi) \u2190 arg min{x,w A1x+B1w=b1 x,w\u22650} f (w, x, yi, zi) ;\ni \u2190 i + 1\n\nreturn f (wi, xi, yi, zi)\n\n4 Solving Bilinear Programs\n\nIn this section we describe simple methods for solving ABPs. We \ufb01rst describe optimal methods,\nwhich have exponential complexity, and then discuss some approximation strategies.\nSolving a bilinear program is an NP-complete problem [3]. The membership in NP follows from\nthe \ufb01nite number of basic feasible solutions of the individual linear programs, each of which can be\nchecked in polynomial time. The NP-hardness is shown by a reduction from the SAT problem [3].\nThe NP-completeness of ABP compares unfavorably with the polynomial complexity of ALP. How-\never, most other ADP algorithms are not guaranteed to converge to a solution in \ufb01nite time.\nThe following theorem shows that the computational complexity of the ABP formulation is asymp-\ntotically the same as the complexity of the problem it solves.\nTheorem 6. b Determining minv\u2208K\u2229M (cid:107)Lv \u2212 v(cid:107)\u221e < \u0001 is NP-complete for the full constraint\nrepresentation, 0 < \u03b3 < 1, and a given \u0001 > 0. In addition, the problem remains NP-complete when\n1 \u2208 M, and therefore minv\u2208M (cid:107)Lv \u2212 v(cid:107)\u221e < \u0001 is also NP-complete.\n\nAs the theorem states, the value function approximation does not become computationally simpler\neven when 1 \u2208 M \u2013 a universal assumption in the paper. Notice that ALP can determine whether\nminv\u2208K\u2229M (cid:107)Lv \u2212 v(cid:107)\u221e = 0 in polynomial time.\nThe proof of Theorem 6 is based on a reduction from SAT and can be found in [19]. The policy in the\nreduction determines the true literal in each clause, and the approximate value function corresponds\nto the truth value of the literals. The approximation basis forces literals that share the same variable\nto have consistent values.\nBilinear programs are non-convex and are typically solved using global optimization techniques.\nThe common solution methods are based on concave cuts [11] or branch-and-bound [6]. In ABP\nsettings with a small number of features, the successive approximation algorithm [17] may be ap-\nplied ef\ufb01ciently. We are, however, not aware of commercial solvers available for solving bilinear\nprograms. Bilinear programs can be formulated as concave quadratic minimization problems [11], or\nmixed integer linear programs [11, 16], for which there are numerous commercial solvers available.\nBecause we are interested in solving very large bilinear programs, we describe simple approximate\nalgorithms next. Optimal scalable methods are beyond the scope of this paper.\nThe most common approximate method for solving bilinear programs is shown in Algorithm 1.\nIt is designed for the general formulation shown in Eq.\n(3), where f (w, x, y, z) represents the\nobjective function. The minimizations in the algorithm are linear programs which can be easily\nsolved.\nInterestingly, as we will show in Section 5, Algorithm 1 applied to ABP generalizes a\nversion of API.\nWhile Algorithm 1 is not guaranteed to \ufb01nd an optimal solution, its empirical performance is often\nremarkably good [13]. Its basic properties are summarized by the following proposition.\n\nProposition 7 (e.g. [3]). Algorithm 1 is guaranteed to converge, assuming that the linear program\nsolutions are in a vertex of the optimality simplex. In addition, the global optimum is a \ufb01xed point\nof the algorithm, and the objective value monotonically improves during execution.\n\n5\n\n\fThe proof is based on the \ufb01nite count of the basic feasible solutions of the individual linear programs.\nBecause the objective function does not increase in any iteration, the algorithm will eventually con-\nverge.\nIn the context of MDPs, Algorithm 1 can be further re\ufb01ned. For example, the constraint v \u2208 M in\nEq. (4) serves mostly to simplify the bilinear program and a value function that violates it may still\nbe acceptable. The following proposition motivates the construction of a new value function from\ntwo transitive-feasible value functions.\nProposition 8. Let \u02dcv1 and \u02dcv2 be feasible value functions in Eq.\n\u02dcv(s) = min{\u02dcv1(s), \u02dcv2(s)} is also feasible in Eq.\nmin{(cid:107)v\u2217 \u2212 \u02dcv1(cid:107)\u221e,(cid:107)v\u2217 \u2212 \u02dcv2(cid:107)\u221e}.\n\n(4). Then the value function\n(4). Therefore \u02dcv \u2265 v\u2217 and (cid:107)v\u2217 \u2212 \u02dcv(cid:107)\u221e \u2264\n\nThe proof of the proposition is based on Jensen\u2019s inequality and can be found in [19].\nProposition 8 can be used to extend Algorithm 1 when solving ABPs. One option is to take the\nstate-wise minimum of values from multiple random executions of Algorithm 1, which preserves\nthe transitive feasibility of the value function. However, the increasing number of value functions\nused to obtain \u02dcv also increases the potential sampling error.\n\n5 Relationship to ALP and API\n\nIn this section, we describe the important connections between ABP and the two closely related ADP\nmethods: ALP, and API with L\u221e minimization. Both of these methods are commonly used, for\nexample to solve factored MDPs [10]. Our analysis sheds light on some of their observed properties\nand leads to a new convergent form of API.\nABP addresses some important issues with ALP: 1) ALP provides value function bounds with re-\nspect to L1 norm, which does not guarantee small policy loss, 2) ALP\u2019s solution quality depends\nsigni\ufb01cantly on the heuristically-chosen objective function c in Eq. (2) [7], and 3) incomplete con-\nstraint samples in ALP easily lead to unbounded linear programs. The drawback of using ABP,\nhowever, is the higher computational complexity.\nBoth the \ufb01rst and the second issues in ALP can be addressed by choosing the right objective func-\ntion [7]. Because this objective function depends on the optimal ALP solution, it cannot be practi-\ncally computed. Instead, various heuristics are usually used. The heuristic objective functions may\nlead to signi\ufb01cant improvements in speci\ufb01c domains, but they do not provide any guarantees. ABP,\non the other hand, has no such parameters that require adjustments.\nThe third issue arises when the constraints of an ALP need to be sampled in some large domains.\nThe ALP may become unbounded with incomplete samples because its objective value is de\ufb01ned\nusing the L1 norm on the states, and the constraints are de\ufb01ned using the L\u221e norm of the Bellman\nresidual. In ABP, the Bellman residual is used in both the constraints and objective function. The\nobjective function of ABP is then bounded below by 0 for an arbitrarily small number of samples.\nABP can also improve on API with L\u221e minimization (L\u221e-API for short), which is a leading method\nfor solving factored MDPs [10]. Minimizing the L\u221e approximation error is theoretically preferable,\nsince it is compatible with the existing bounds on policy loss [10]. In contrast, few practical bounds\nexist for API with the L2 norm minimization [14], such as LSPI [12].\nL\u221e-API is shown in Algorithm 2, where f (\u03c0) is calculated using the following program:\n\nminimize\n\n\u03c6,v\ns.t.\n\n\u03c6\n(I \u2212 \u03b3P\u03c0)v + 1\u03c6 \u2265 r\u03c0\n\u2212(I \u2212 \u03b3P\u03c0)v + 1\u03c6 \u2265 \u2212r\u03c0\nv \u2208 M\n\n(5)\n\nHere I denotes the identity matrix. We are not aware of a convergence or a divergence proof of\nL\u221e-API, and this analysis is beyond the scope of this paper.\n\n6\n\n\fAlgorithm 2: Approximate policy iteration, where f (\u03c0) denotes a custom value function approxi-\nmation for the policy \u03c0.\n\u03c00, k \u2190 rand, 1 ;\nwhile \u03c0k (cid:54)= \u03c0k\u22121 do\n\u02dcvk \u2190 f (\u03c0k\u22121) ;\nk \u2190 k + 1\n\n\u03c0k(s) \u2190 arg maxa\u2208A r(s, a) + \u03b3(cid:80)\n\ns(cid:48)\u2208S P (s(cid:48), s, a)\u02dcvk(s) \u2200s \u2208 S ;\n\nWe propose Optimistic Approximate Policy Iteration (OAPI), a modi\ufb01cation of API. OAPI is shown\nin Algorithm 2, where f (\u03c0) is calculated using the following program:\n\nminimize\n\n\u03c6,v\ns.t.\n\n(\u2261 (I \u2212 \u03b3P\u03c0)v \u2265 r\u03c0 \u2200\u03c0 \u2208 \u03a0)\n\n\u03c6\nAv \u2265 b\n\u2212(I \u2212 \u03b3P\u03c0)v + 1\u03c6 \u2265 \u2212r\u03c0\nv \u2208 M\n\n(6)\n\nIn fact, OAPI corresponds to Algorithm 1 applied to ABP because Eq. (6) corresponds to Eq. (4)\nwith \ufb01xed \u03b8. Then, using Proposition 7, we get the following corollary.\nCorollary 9. Optimistic approximate policy iteration converges in \ufb01nite time. In addition, the Bell-\nman residual of the generated value functions monotonically decreases.\n\nOAPI differs from L\u221e-API in two ways: 1) OAPI constrains the Bellman residuals by 0 from below\nand by \u03c6 from above, and then it minimizes \u03c6. L\u221e-API constrains the Bellman residuals by \u03c6 from\nboth above and below. 2) OAPI, like API, uses only the current policy for the upper bound on the\nBellman residual, but uses all the policies for the lower bound on the Bellman residual.\nL\u221e-API cannot return an approximate value function that has a lower Bellman residual than ABP,\ngiven the optimality of ABP described in Theorem 5. However, even OAPI, an approximate ABP\nalgorithm, performs comparably to L\u221e-API, as the following theorem states.\nTheorem 10. b Assume that L\u221e-API converges to a policy \u03c0 and a value function v that both\nsatisfy: \u03c6 = (cid:107)v \u2212 L\u03c0v(cid:107)\u221e = (cid:107)v \u2212 Lv(cid:107)\u221e. Then \u02dcv = v + \u03c6\n1\u2212\u03b3 1 is feasible in Eq. (4), and it is a \ufb01xed\npoint of OAPI. In addition, the greedy policies with respect to \u02dcv and v are identical.\n\nThe proof is based on two facts. First, \u02dcv is feasible with respect to the constraints in Eq. (4). The\nBellman residual changes for all the policies identically, since a constant vector is added. Second,\nbecause L\u03c0 is greedy with respect to \u02dcv, we have that \u02dcv \u2265 L\u03c0 \u02dcv \u2265 L\u02dcv. The value function \u02dcv is\ntherefore transitive-feasible. The full proof can be found in [19].\nTo summarize, OAPI guarantees convergence, while matching the performance of L\u221e-API. The\nconvergence of OAPI is achieved because given a non-negative Bellman residual, the greedy policy\nalso minimizes the Bellman residual. Because OAPI ensures that the Bellman residual is always\nnon-negative, it can progressively reduce it. In comparison, the greedy policy in L\u221e-API does not\nminimize the Bellman residual, and therefore L\u221e-API does not always reduce it. Theorem 10 also\nexplains why API provides better solutions than ALP, as observed in [10]. From the discussion\nabove, ALP can be seen as an L1-norm approximation of a single iteration of OAPI. L\u221e-API, on\nthe other hand, performs many such ALP-like iterations.\n\n6 Empirical Evaluation\n\nAs we showed in Theorem 10, even OAPI, the very simple approximate algorithm for ABP, can\nperform as well as existing state-of-the art methods on factored MDPs. However, a deeper under-\nstanding of the formulation and potential solution methods will be necessary in order to determine\nthe full practical impact of the proposed methods.\nIn this section, we validate the approach by\napplying it to the mountain car problem, a simple reinforcement learning benchmark problem.\nWe have so far considered that all the constraints involving z are present in the ABP in Eq. (4).\nBecause the constraints correspond to all state-action pairs, it is often impractical to even enumerate\n\n7\n\n\f100\n\n(a) L\u221e error of the Bellman residual\nFeatures\nOAPI\nALP\nLSPI\nAPI\n\n0.21 (0.23)\n13. (13.)\n9. (14.)\n0.46 (0.08)\n\n144\n\n0.13 (0.1)\n3.6 (4.3)\n3.9 (7.7)\n0.86 (1.18)\n\n(b) L2 error of the Bellman residual\nFeatures\nOAPI\nALP\nLSPI\nAPI\n\n100\n0.2 (0.3)\n9.5 (18.)\n1.2 (1.5)\n0.04 (0.01)\n\n144\n0.1 (1.9)\n0.3 (0.4)\n0.9 (0.1)\n0.08 (0.08)\n\nTable 1: Bellman residual of the \ufb01nal value function. The values are averages over 5 executions,\nwith the standard deviations shown in parentheses.\n\nthem. This issue can be addressed in at least two ways. First, a small randomly-selected subset of the\nconstraints can be used in the ABP, a common approach in ALP [9, 5]. The ALP sampling bounds\ncan be easily extended to ABP. Second, the structure of the MDP can be used to reduce the number\nof constraints. Such a reduction is possible, for example, in factored MDPs with L\u221e-API and\nALP [10], and can be easily extended to OAPI and ABP.\nIn the mountain-car benchmark, an underpowered car needs to climb a hill [23]. To do so, it \ufb01rst\nneeds to back up to an opposite hill to gain suf\ufb01cient momentum. The car receives a reward of 1\nwhen it climbs the hill. In the experiments we used a discount factor \u03b3 = 0.99.\nThe experiments are designed to determine whether OAPI reliably minimizes the Bellman resid-\nual in comparison with API and ALP. We use a uniformly-spaced linear spline to approximate the\nvalue function. The constraints were based on 200 uniformly sampled states with all 3 actions per\nstate. We evaluated the methods with the number of the approximation features 100 and 144, which\ncorresponds to the number of linear segments.\nThe results of ABP (in particular OAPI), ALP, API with L2 minimization, and LSPI are depicted\nin Table 1. The results are shown for both L\u221e norm and uniformly-weighted L2 norm. The run-\ntimes of all these methods are comparable, with ALP being the fastest. Since API (LSPI) is not\nguaranteed to converge, we ran it for at most 20 iterations, which was an upper bound on the number\nof iterations of OAPI. The results demonstrate that ABP minimizes the L\u221e Bellman residual much\nmore consistently than the other methods. Note, however, that all the considered algorithms would\nperform signi\ufb01cantly better given a \ufb01ner approximation.\n\n7 Conclusion and Future Work\n\nWe proposed and analyzed approximate bilinear programming, a new value-function approximation\nmethod, which provably minimizes the L\u221e Bellman residual. ABP returns the optimal approximate\nvalue function with respect to the Bellman residual bounds, despite the formulation with regard to\ntransitive-feasible value functions. We also showed that there is no asymptotically simpler formula-\ntion, since \ufb01nding the closest value function and solving a bilinear program are both NP-complete\nproblems. Finally, the formulation leads to the development of OAPI, a new convergent form of API\nwhich monotonically improves the objective value function.\nWhile we only discussed approximate solutions of the ABP, a deeper study of bilinear solvers may\nrender optimal solution methods feasible. ABPs have a small number of essential variables (that\ndetermine the value function) and a large number of constraints, which can be leveraged by the\nsolvers [15]. The L\u221e error bound provides good theoretical guarantees, but it may be too conserva-\ntive in practice. A similar formulation based on L2 norm minimization may be more practical.\nWe believe that the proposed formulation will help to deepen the understanding of value function\napproximation and the characteristics of existing solution methods, and potentially lead to the de-\nvelopment of more robust and widely-applicable reinforcement learning algorithms.\n\nAcknowledgements\n\nThis work was supported by the Air Force Of\ufb01ce of Scienti\ufb01c Research under Grant No. FA9550-\n08-1-0171. We also thank the anonymous reviewers for their useful comments.\n\n8\n\n\fReferences\n[1] Pieter Abbeel, Varun Ganapathi, and Andrew Y. Ng. Learning vehicular dynamics, with appli-\ncation to modeling helicopters. In Advances in Neural Information Processing Systems, pages\n1\u20138, 2006.\n\n[2] Daniel Adelman. A price-directed approach to stochastic inventory/routing. Operations Re-\n\nsearch, 52:499\u2013514, 2004.\n\n[3] Kristin P. Bennett and O. L. Mangasarian. Bilinear separation of two sets in n-space. Technical\n\nreport, Computer Science Department, University of Wisconsin, 1992.\n\n[4] Dimitri P. Bertsekas and Sergey Ioffe. Temporal differences-based policy iteration and appli-\n\ncations in neuro-dynamic programming. Technical Report LIDS-P-2349, LIDS, 1997.\n\n[5] Guiuseppe Cala\ufb01ore and M.C. Campi. Uncertain convex programs: Randomized solutions and\n\ncon\ufb01dence levels. Mathematical Programming, Series A, 102:25\u201346, 2005.\n\n[6] Alberto Carpara and Michele Monaci. Bidimensional packing by bilinear programming. Math-\n\nematical Programming Series A, 118:75\u2013108, 2009.\n\n[7] Daniela P. de Farias. The Linear Programming Approach to Approximate Dynamic Program-\n\nming: Theory and Application. PhD thesis, Stanford University, 2002.\n\n[8] Daniela P. de Farias and Ben Van Roy. The linear programming approach to approximate\n\ndynamic programming. Operations Research, 51:850\u2013856, 2003.\n\n[9] Daniela Pucci de Farias and Benjamin Van Roy. On constraint sampling in the linear program-\nming approach to approximate dynamic programming. Mathematics of Operations Research,\n29(3):462\u2013478, 2004.\n\n[10] Carlos Guestrin, Daphne Koller, Ronald Parr, and Shobha Venkataraman. Ef\ufb01cient solution\nalgorithms for factored MDPs. Journal of Arti\ufb01cial Intelligence Research, 19:399\u2013468, 2003.\n[11] Reiner Horst and Hoang Tuy. Global optimization: Deterministic approaches. Springer, 1996.\n[12] Michail G. Lagoudakis and Ronald Parr. Least-squares policy iteration. Journal of Machine\n\nLearning Research, 4:1107\u20131149, 2003.\n\n[13] O. L. Mangasarian. The linear complementarity problem as a separable bilinear program.\n\nJournal of Global Optimization, 12:1\u20137, 1995.\n\n[14] Remi Munos. Error bounds for approximate policy iteration. In International Conference on\n\nMachine Learning, pages 560\u2013567, 2003.\n\n[15] Marek Petrik and Shlomo Zilberstein. Anytime coordination using separable bilinear pro-\n\ngrams. In Conference on Arti\ufb01cial Intelligence, pages 750\u2013755, 2007.\n\n[16] Marek Petrik and Shlomo Zilberstein. Average reward decentralized Markov decision pro-\n\ncesses. In International Joint Conference on Arti\ufb01cial Intelligence, pages 1997\u20132002, 2007.\n\n[17] Marek Petrik and Shlomo Zilberstein. A bilinear programming approach for multiagent plan-\n\nning. Journal of Arti\ufb01cial Intelligence Research, 35:235\u2013274, 2009.\n\n[18] Marek Petrik and Shlomo Zilberstein. Constraint relaxation in approximate linear programs.\n\nIn International Conference on Machine Learning, pages 809\u2013816, 2009.\n\n[19] Marek Petrik and Shlomo Zilberstein. Robust value function approximation using bilinear pro-\ngramming. Technical Report UM-CS-2009-052, Department of Computer Science, University\nof Massachusetts Amherst, 2009.\n\n[20] Warren B. Powell. Approximate Dynamic Programming. Wiley-Interscience, 2007.\n[21] Martin L. Puterman. Markov decision processes: Discrete stochastic dynamic programming.\n\nJohn Wiley & Sons, Inc., 2005.\n\n[22] Kenneth O. Stanley and Risto Miikkulainen. Competitive coevolution through evolutionary\n\ncomplexi\ufb01cation. Journal of Arti\ufb01cial Intelligence Research, 21:63\u2013100, 2004.\n\n[23] Richard S. Sutton and Andrew Barto. Reinforcement learning. MIT Press, 1998.\n[24] Istvan Szita and Andras Lorincz. Learning Tetris using the noisy cross-entropy method. Neural\n\nComputation, 18(12):2936\u20132941, 2006.\n\n[25] Ronald J. Williams and Leemon C. Baird. Tight performance bounds on greedy policies based\n\non imperfect value functions. In Yale Workshop on Adaptive and Learning Systems, 1994.\n\n9\n\n\f", "award": [], "sourceid": 127, "authors": [{"given_name": "Marek", "family_name": "Petrik", "institution": null}, {"given_name": "Shlomo", "family_name": "Zilberstein", "institution": null}]}