{"title": "Constrained Reinforcement Learning Has Zero Duality Gap", "book": "Advances in Neural Information Processing Systems", "page_first": 7555, "page_last": 7565, "abstract": "Autonomous agents must often deal with conflicting requirements, such as completing tasks using the least amount of time/energy, learning multiple tasks, or dealing with multiple opponents. In the context of reinforcement learning~(RL), these problems are addressed by (i)~designing a reward function that simultaneously describes all requirements or (ii)~combining modular value functions that encode them individually. Though effective, these methods have critical downsides. Designing good reward functions that balance different objectives is challenging, especially as the number of objectives grows. Moreover, implicit interference between goals may lead to performance plateaus as they compete for resources, particularly when training on-policy. Similarly, selecting parameters to combine value functions is at least as hard as designing an all-encompassing reward, given that the effect of their values on the overall policy is not straightforward. The later is generally addressed by formulating the conflicting requirements as a constrained RL problem and solved using Primal-Dual methods. These algorithms are in general not guaranteed to converge to the optimal solution since the problem is not convex. This work provides theoretical support to these approaches by establishing that despite its non-convexity, this problem has zero duality gap, i.e., it can be solved exactly in the dual domain, where it becomes convex. Finally, we show this result basically holds if the policy is described by a good parametrization~(e.g., neural networks) and we connect this result with primal-dual algorithms present in the literature and we establish the convergence to the optimal solution.", "full_text": "Constrained Reinforcement Learning Has Zero\n\nDuality Gap\n\nSantiago Paternain, Luiz F. O. Chamon, Miguel Calvo-Fullana and Alejandro Ribeiro\n\n{spater,luizf,cfullana,aribeiro}@seas.upenn.edu\n\nElectrical and Systems Engineering\n\nUniversity of Pennsylvania\n\nAbstract\n\nAutonomous agents must often deal with con\ufb02icting requirements, such as com-\npleting tasks using the least amount of time/energy, learning multiple tasks, or\ndealing with multiple opponents. In the context of reinforcement learning (RL),\nthese problems are addressed by (i) designing a reward function that simultane-\nously describes all requirements or (ii) combining modular value functions that en-\ncode them individually. Though effective, these methods have critical downsides.\nDesigning good reward functions that balance different objectives is challenging,\nespecially as the number of objectives grows. Moreover, implicit interference\nbetween goals may lead to performance plateaus as they compete for resources,\nparticularly when training on-policy. Similarly, selecting parameters to combine\nvalue functions is at least as hard as designing an all-encompassing reward, given\nthat the effect of their values on the overall policy is not straightforward. The\nlater is generally addressed by formulating the con\ufb02icting requirements as a con-\nstrained RL problem and solved using Primal-Dual methods. These algorithms\nare in general not guaranteed to converge to the optimal solution since the prob-\nlem is not convex. This work provides theoretical support to these approaches by\nestablishing that despite its non-convexity, this problem has zero duality gap, i.e.,\nit can be solved exactly in the dual domain, where it becomes convex. Finally, we\nshow this result basically holds if the policy is described by a good parametriza-\ntion (e.g., neural networks) and we connect this result with primal-dual algorithms\npresent in the literature and we establish the convergence to the optimal solution.\n\n1\n\nIntroduction\n\nAutonomous agents must often deal with con\ufb02icting requirements, such as completing a task in the\nleast amount of time/energy, learning multiple tasks or contexts, dealing with multiple opponents or\nwith several speci\ufb01cations that are designed to guide the agent in the learning process. In the context\nof reinforcement learning [1], these problems are generally addressed by combining modular value\nfunctions that encode them individually, by multiplying each signal by its own coef\ufb01cient, which\ncontrols the emphasis placed on it [2\u20134]. Although effective, the multi-objective problem [5] has\nseveral downsides. First, for each set of penalty coef\ufb01cients, there exists a different, optimal solution,\nalso known as Pareto optimality [6]. In practice, the exact coef\ufb01cient is selected through a time\nconsuming and a computationally intensive process of hyper-parameter tuning that often times are\ndomain dependent, as showed in [7\u20139]. Moreover, implicit interference between the goals may lead\nto training plateaus as they compete for resources in the policy [10].\nAn alternative, is to embed all con\ufb02icting requirements in a constrained RL problem and to use a\nprimal-dual algorithm as in [7, 11] that chooses the parameters automatically. The main advantage of\nthis approach is that constraints ensure satisfying behavior without the need for manually selecting\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fIn these algorithms the policy update is on a faster time-scale than the\nthe penalty coef\ufb01cients.\nmultiplier update. Thus, effectively, these approaches work as if the dual problem of the constrained\nreinforcement learning problem was being solved. Thus, guaranteeing to obtain the feasible solution\nwith the smallest suboptimality. Yet, there is no guarantee on how small the suboptimality is. In this\nwork we provide an answer to the previous question. In particular we establish that:\n\n1. Despite its non-convexity, constrained reinforcement learning for policies belonging to a\ngeneral distribution class has zero duality gap, i.e., it can be solved exactly in the dual\ndomain, where the problem is actually convex\n\n2. Since working with generic distributions as policies is in general intractable, we extend this\nresult to parametrized policies, by showing that the suboptimality bound also holds when\nthe parametrization is a universal approximator, e.g., a neural network [12]).\n\n3. We leverage these theoretical results to establish that the family of primal-dual algorithms\nfor constrained reinforcement learning, e.g. [7, 11], in fact converge to the optimal solution\nunder mild assumptions.\n\n1.1 Related Work\n\nConstrained Markov Decision Processes (CMDPs) [13] are an active \ufb01eld of research. CMDP ap-\nplications cover a vast number of topics, such as: electric grids [14], networking [15], robotics\n[3, 16, 17] and \ufb01nance [18, 19]. The most common approaches to solve this problems can be di-\nvided under the following categories. Manual selection of Lagrange multipliers: constrained Re-\ninforcement Learning problems can be solved through by maximizing an unconstrained Lagrangian,\nfor a speci\ufb01c multiplier [2]. The combination of different rewards with manually selected Lagrange\nmultipliers has been applied for instance to learning complex movements for humanoids [4] or to\nlimit the variance of the constraint that needs to be satis\ufb01ed [19, 20]. Integrating prior knowledge\nabout the system transitions is exploited in order to project the action chosen by the policy to a set\nthat ensures the satisfaction of the constraints [21]. Primal-dual algorithms [7, 11], allow us to\nchoose dynamically the multipliers by \ufb01nd the best policy for the current set of parameters and then\ntaking steps along the gradient of the Lagrangian with respect to the multipliers. These allow to\nconsider general constraints and the algorithm is reward agnostic and it does not require the use of\nprior knowledge.\n\n2 Constrained Reinforcement Learning\nLet t \u2208 N \u222a {0} denote the time instant and S \u2282 Rn and A \u2282 Rd be compact sets describing the\npossible states and actions of an agent described by a Markovian dynamical system with transition\nprobability density p, i.e., p (st+1 | {su, au}u\u2264t) = p (st+1 | st, at) for st \u2208 S and at \u2208 A for\nall t. The agent chooses actions sequentially based on a policy \u03c0 \u2208 P(S), where P(S) is the\nspace of probability measures on (A,B(A)) parametrized by elements of S, where B(A) are the\nBorel sets of A. The action taken by the agent at each state results in rewards de\ufb01ned by the\nfunctions ri : S \u00d7 A \u2192 R, for i = 0, . . . , m, that the agent accumulates over time. These rewards\ndescribe different objectives that the agent must achieve, such as completing a task, remaining within\na region of the state space, or not running out of battery. The goal of constrained RL is then to \ufb01nd\na policy \u03c0(cid:63) \u2208 P(S) that meets these objectives by solving the problem\n\nP (cid:63) (cid:44) max\n\u03c0\u2208P(S)\n\nV0(\u03c0) (cid:44) Es,\u03c0\n\nsubject to Vi(\u03c0) (cid:44) Es,\u03c0\n\n\u03b3tr0(st, \u03c0(st))\n\n(cid:35)\n(cid:35)\n\n(cid:34) \u221e(cid:88)\n(cid:34) \u221e(cid:88)\n\nt=0\n\nt=0\n\n\u03b3tri(st, \u03c0(st))\n\n\u2265 ci, i = 1, . . . , m,\n\n(PI)\n\nwhere \u03b3 \u2208 (0, 1) is a discount factor and ci \u2208 R represent the i-th reward speci\ufb01cation.\nIt is\nimportant to contrast the formulation in (PI) with the unconstrained, regularized problem commonly\nfound in the literature [4, 19, 20]\n\nmaximize\n\n\u03c0\u2208P(S)\n\nV0(\u03c0) +\n\nwi (Vi(\u03c0) \u2212 ci) ,\n\n( \u02dcPI)\n\nm(cid:88)\n\ni=1\n\n2\n\n\fwhere wi \u2265 0 are the regularization parameters. First, (PI) precludes the manual balancing of\ndifferent requirements through the choice of wi. Even with expert knowledge, tuning these pa-\nrameters can be as hard as solving the RL problem itself, since there is no straightforward relation\nbetween the value of wi and the value Vi(\u03c0(cid:63)) given by the \ufb01nal policy. What is more, note that\n\nthe objective of ( \u02dcPI) can be written as a single value function \u00afV (\u03c0) (cid:44) Es,\u03c0 [(cid:80)\u221et=0 \u03b3t\u00afr(st, \u03c0(st))]\nfor \u00afr(st, \u03c0(st)) = r0(st, \u03c0(st)) +(cid:80)m\n\ni=1 wiri(st, \u03c0(st)). In other words, choosing the value of wi\namounts to designing a reward that simultaneously encodes different, possibly con\ufb02icting, objec-\ntives and/or requirements. Given the challenge that can be designing good reward functions for a\nsingle task, it is ready that this regularized approach is neither ef\ufb01cient nor effective.\nThough promising, solving the constrained RL problem in (PI) is intricate. Indeed, it is both in\ufb01nite\ndimensional and non-convex, so that it is in general not tractable in the primal domain. Its dual\nproblem, on the other hand, is convex and has dimensionality equal to the number of constraints.\nHowever, since (PI) is not a convex program, its dual problem in general only provides an upper\nbound on P (cid:63). How good the policy obtained by solving the dual problem is depends on the tightness\nof this bound. What is more, formulating the problem in the dual domain is at least as hard as\nsolving ( \u02dcPI), which is also in\ufb01nite dimensional and non-convex. In the sequel, we address these two\nissues by \ufb01rst showing that (PI) has no duality gap (Section 3), i.e., that the upper bound on P (cid:63) from\nthe dual problem is tight. This implies that (PI) can be solved exactly in the dual domain. Then, we\nshow that we lose (almost) nothing by parametrizing the policies \u03c0 (Section 4), which immediately\naddresses the issue of dimensionality in (PI)\u2013( \u02dcPI). Finally, we put forward and analyze a primal-\ndual algorithm for constrained RL (Section 5), showing that under mild conditions it yields a locally\noptimal, feasible solution of (PI).\n\n3 Constrained Reinforcement Learning Has Zero Duality Gap\nLet us start by formalizing the concept of dual problem. Let the vector \u03bb \u2208 Rm\nmultipliers of the constraints of (PI) and de\ufb01ne its Lagrangian as\n\n+ collect the Lagrange\n\nL(\u03c0, \u03bb) (cid:44) V0(\u03c0) +\n\n\u03bbi (Vi(\u03c0) \u2212 ci) .\n\nm(cid:88)\n\ni=1\n\n(1)\n\n(2)\n\nThe dual function is then the point-wise maximum of (1) with respect to the policy \u03c0, i.e.,\n\nd(\u03bb) (cid:44) max\n\n\u03c0\u2208P(S)L(\u03c0, \u03bb).\n\nThe dual function (2) provides an upper bounds on the value of (PI), i.e., d(\u03bb) \u2265 P (cid:63) for all \u03bb \u2208\nRm\n+ [22, Section 5.1.3]. The tighter the bound, the closer the policy obtained from (2) is to the\noptimal solution of (PI). Hence, the dual problem is that of \ufb01nding the tightest of these bounds:\n\nD(cid:63) (cid:44) min\n\u03bb\u2208Rm\n\n+\n\nd(\u03bb).\n\n(DI)\n\nNote that the dual function (2) can be related to the unconstrained, regularized problem ( \u02dcPI) from\nSection 2 by taking \u03bbi = wi in (1). Hence, (2) takes on the optimal value of ( \u02dcPI) for all possible\nregularization parameters. Problem (DI) then \ufb01nds the best regularized problem, i.e., that whose\nvalue is closest to P (cid:63). It turns out, this problem is tractable if d(\u03bb) can be evaluated, since (DI) is\na convex program (the dual function is the point-wise maximum of a set of linear functions and is\ntherefore convex) [22, Section 3.2.3].\nDespite these similarities, (DI) [and consequently ( \u02dcPI)] do not necessarily solve the same problem\nas (PI). In other words, there need not be a relation between the optimal dual variables \u03bb(cid:63) from (DI)\nor the regularization parameters wi and the speci\ufb01cations ci of (PI). This depends on the value of the\nduality gap \u2206 = D(cid:63)\u2212P (cid:63). Indeed, if \u2206 is small, then so is the suboptimality of the policies obtained\nfrom (DI). In the limit case where \u2206 = 0, problems (PI)\u2013(DI) and ( \u02dcPI) would all be essentially\nequivalent. Since (PI) is not a convex program, however, this result does not hold immediately. Still,\nwe calim in Theorem 1 that (PI) has zero duality gap under Slater\u2019s conditions. Before stating the\nTheorem we de\ufb01ne the perturbation function associated to problem (PI) which is fundamental for\nthe proof of the result and for future reference. For any \u03be \u2208 Rn, the perturbation function associated\n\n3\n\n\fto (PI) is de\ufb01ned as\n\nP (\u03be) (cid:44) max\n\u03c0\u2208P(S)\nsubject to Vi(\u03c0) \u2265 ci + \u03bei, i = 1 . . . m.\n\nV0(\u03c0)\n\n(PI(cid:48))\n\nNotice that P (0) = P (cid:63), the optimal value of (PI). We formally state next the conditions under which\nProblem (PI) has zero duality gap.\nTheorem 1. Suppose that ri is bounded for all i = 0, . . . , m and that Slater\u2019s condition holds\nfor (PI). Then, strong duality holds for (PI), i.e., P (cid:63) = D(cid:63).\n\nProof. This proof relies on a well-known result from perturbation theory connecting strong duality\nto the convexity of the perturbation function de\ufb01ned in(PI(cid:48)). We formalize this result next.\nProposition 1 (Fenchel-Moreau). If (i) Slater\u2019s condition holds for (PI) and (ii) its perturbation\nfunction P (\u03be) is concave, then strong duality holds for (PI).\n\nProof. See, e.g., [23, Cor. 30.2.2].\n\nP(cid:2)\u00b5\u03be1 + (1 \u2212 \u00b5)\u03be2(cid:3)\n\n\u2265 \u00b5P(cid:0)\u03be1(cid:1) + (1 \u2212 \u00b5)P(cid:0)\u03be2(cid:1) .\n\nCondition (i) of Proposition 1 is satis\ufb01ed by the hypotheses of Theorem 1. It suf\ufb01ces then to show\nthat the perturbation function is concave [(ii)], i.e., that for every \u03be1, \u03be2 \u2208 Rm, and \u00b5 \u2208 (0, 1),\n\n(3)\nIf for either perturbation \u03be1 or \u03be2 the problem becomes infeasible then P (\u03be1) = \u2212\u221e or P (\u03be2) =\n\u2212\u221e and thus (3) holds trivially. For perturbations that keep the problem feasible, suppose P (\u03be1)\nand P (\u03be2) are achieved by the policies \u03c01 \u2208 P(S) and \u03c02 \u2208 P(S) respectively. Then, P (\u03be1) =\nV0(\u03c01) with Vi(\u03c01) \u2212 ci \u2265 \u03be1\ni for i = 1, . . . , m.\ni and P (\u03be2) = V0(\u03c02) with Vi(\u03c02) \u2212 ci \u2265 \u03be2\nTo establish (3) it suf\ufb01ces to show that for every \u00b5 \u2208 (0, 1) there exists a policy \u03c0\u00b5 such that\ni and V0(\u03c0\u00b5) = \u00b5V0(\u03c01) + (1 \u2212 \u00b5)V0(\u03c02). Notice that any policy\nVi(\u03c0\u00b5) \u2212 ci \u2265 \u00b5\u03be1\ni + (1 \u2212 \u00b5)\u03be2\n\u03c0\u00b5 satisfying the previous conditions is a feasible policy for the slack ci + \u00b5\u03be1\ni . Hence,\nby de\ufb01nition of the perturbed function (PI(cid:48)), it follows that\n\ni + (1 \u2212 \u00b5)\u03be2\n\nP(cid:2)\u00b5\u03be1 + (1 \u2212 \u00b5)\u03be2(cid:3)\n\n(4)\nIf such policy exists, the previous equation implies (3). Thus, to complete the proof of the result we\nneed to establish its existence. To do so we start by formulating a linear program equivalent to (PI(cid:48)).\nNotice that for any i = 0, . . . , m we can write\n\n\u2265 V0(\u03c0\u00b5) = \u00b5V0(\u03c01) + (1 \u2212 \u00b5)V0(\u03c02) = \u00b5P(cid:0)\u03be1(cid:1) + (1 \u2212 \u00b5)P(cid:0)\u03be2(cid:1) .\n(cid:90)\n\n(cid:33)\n\n(cid:32) \u221e(cid:88)\n\nVi(\u03c0) =\n\n\u03b3tri(st, at)\n\np\u03c0(s0, a0, . . .) ds0 . . . da0 . . . .\n\n(5)\n\n(S\u00d7A)\u221e\n\nt=0\n\nSince the reward functions are bounded the Dominated Convergence Theorem holds. This allows us\nto exchange the order of the sum and the integral. Moreover, using conditional probabilities and the\nMarkov property of the transition of the system we can write Vi(\u03c0) as\n\nt=0\n\nVi(\u03c0) =\n\nri(st, at)\n\np(su|su\u22121, au\u22121)\u03c0(au|su)p(s0)\u03c0(a0|s0) ds0 . . . da0 . . . .\n(6)\nNotice that for every u > t the integrals with respect to au and su yield one, since they are integrating\ndensity functions. Thus, the previous expression reduces to\n\n(S\u00d7A)\u221e\n\nu=1\n\n\u221e(cid:89)\n\nt(cid:89)\n\nu=1\n\n(cid:90)\n\n\u221e(cid:88)\n\n\u03b3t\n\n\u221e(cid:88)\n\nt=0\n\n(cid:90)\n\n(cid:90)\n\nVi(\u03c0) =\n\n\u03b3t\n\nri(st, at)\n\n(S\u00d7A)t\n\np(su|su\u22121, au\u22121)\u03c0(au|su)p(s0)\u03c0(a0|s0) ds0 . . . dstda0 . . . dat.\n\n(7)\nNotice that the probability density of being at state s and choosing action a under the policy \u03c0 at\ntime t can be written as\n\nt(cid:89)\n\npt\n\u03c0(st, at) =\n\n(S\u00d7A)t\u22121\n\nu=1\n\np(su|su\u22121, au\u22121)\u03c0(au|su)p(s0)\u03c0(a0|s0) ds0 . . . dst\u22121da0 . . . dat\u22121.\n(8)\n\n4\n\n\f(cid:90)\n\n\u221e(cid:88)\n\nThus, using again the Dominated Convergence Theorem, one can write compactly (7) as\n\nBy de\ufb01ning the occupation measure \u03c1(s, a) = (1 \u2212 \u03b3)(cid:80)\u221et=0 \u03b3tpt\n\u03b3)Vi(\u03c0) =(cid:82)\n\n\u03c0(s, a) dsda.\n\nVi(\u03c0) =\n\nri(s, a)\n\nS\u00d7A\n\n\u03b3tpt\n\nt=0\n\nS\u00d7A\n\nset R as the set of all occupation measures induced by the policies \u03c0 \u2208 P(S) as\n\u03b3tp\u03c0(st = s, at = a)\n\n\u03c0(s, a) it follows that (1 \u2212\n(cid:40)\nri(s, a)\u03c1(s, a) dsda. Denote by M(S,A) the measures over S \u00d7A and de\ufb01ne the\n\u03c1 \u2208 M(S,A)(cid:12)(cid:12)\u03c1(s, a) = (1 \u2212 \u03b3)\n\n(cid:32) \u221e(cid:88)\n\n(cid:33)(cid:41)\n\nR :=\n\n(10)\n\n,\n\n(9)\n\nt=0\n\nwhere It follows from [24, Theorem 3.1] that the set of occupation measures R is convex and com-\npact. Hence, we can write the following linear program equivalent to (PI(cid:48))\n\n(cid:90)\n(cid:90)\n\n(cid:90)\n\nP (\u03be) (cid:44) max\n\u03c1\u2208R\n\nsubject to\n\n1\n1 \u2212 \u03b3\n1\n1 \u2212 \u03b3\n\nS\u00d7A\n\nS\u00d7A\n\nr0(s, a)\u03c1(s, a) dsda\n\nri(s, a)\u03c1(s, a) dsda \u2265 ci + \u03bei, i = 1, . . . , m.\n\n(PI(cid:48)(cid:48))\n\nLet \u03c11, \u03c12 \u2208 R be the occupation measures associated to \u03c01 and \u03c02. Since, R is convex, there exists\na policy \u03c0\u00b5 \u2208 P(S) such that its corresponding occupation measure is \u03c1\u00b5 = \u00b5\u03c11 + (1 \u2212 \u00b5)\u03c12 \u2208 R.\ni for i = 1, . . . , m since the\nNotice that \u03c1\u00b5 satis\ufb01es the constraints with slack ci + \u00b5\u03be1\ni + (1 \u2212 \u00b5)\u03be2\nintegral is linear and \u03c11 and \u03c12 satisfy the constraints with slacks ci + \u03be1\ni respectively.\ni and ci + \u03be2\nThus, it follows that\n\nP (\u00b5\u03be1 + (1 \u2212 \u00b5)\u03be2) \u2265\n\nr0(s, a)\u03c1\u00b5(s, a) dsda = \u00b5V0(\u03c01) + (1 \u2212 \u00b5)V0(\u03c02),\n\n(11)\n\nwhere we have used again the linearity of the integral. Since \u03c0i are such that V0(\u03c01) = P (\u03be1) and\nV0(\u03c02) = P (\u03be2), inequality (3) follows. This completes the proof that the perturbation function is\nconcave.\n\n1\n1 \u2212 \u03b3\n\nS\u00d7A\n\nTheorem 1 establishes a fundamental equivalence between the constrained (PI) and the dual prob-\nlem (DI) [and therefore also ( \u02dcPI)]. Indeed, since (PI) has no duality gap, its solution can be obtained\nby solving (DI). What is more, the trade-offs expressed by the wi in ( \u02dcPI) are the same as those\nexpressed by the speci\ufb01cations ci in the sense that they trace the same Pareto front. Nevertheless,\nnote that the relationship between ci and wi is not trivial and that specifying the constrained prob-\nlem is often considerably simpler. Theorem 1 establishes that this is indeed a valid transformation,\nsince both problems are equivalent. Observe that due to the non-convexity of the objective in RL\nproblems, this result is in fact not immediate.\nThe theoretical importance of the previous result notwithstanding, it does not yield a procedure to\nsolve (PI) since evaluating the dual function involves a maximization problem that is intractable for\ngeneral classes of distributions. In the next section, we study the effect of using a \ufb01nite parametriza-\ntion for the policies and show that the price to pay in terms of duality gap depends on how \u201cgood\u201d\nthe parametrization is. If we consider, for instance, a neural network\u2014which are universal function\napproximators [12, 25\u201328]\u2014the loss in optimality can be made arbitrarily small.\n\n4 There is (almost) no price to pay by parametrizing the policies\nWe consider next the problem where the policies are parametrized by a vector \u03b8 \u2208 Rp. This vector\ncould be for instance the coef\ufb01cients of a neural network or the weights of a linear combination\nof functions. In this work, we focus our attention however on a widely used class of parametriza-\ntions that we term near-universal, which are able to model any function in P(S) to within a stated\naccuracy. We formalize this concept in the following de\ufb01nition.\nDe\ufb01nition 1. A parametrization \u03c0\u03b8 is an \u0001-universal parametrization of functions in P(S) if, for\nsome \u0001 > 0, there exists for any \u03c0 \u2208 P(S) a parameter \u03b8 \u2208 Rp such that\n\n(cid:90)\n\nmax\ns\u2208S\n\nA\n\n|\u03c0(a|s) \u2212 \u03c0\u03b8(a|s)| da \u2264 \u0001.\n\n(12)\n\n5\n\n\fThe previous de\ufb01nition includes all parametrizations that induce distributions that are close to dis-\ntributions in P(S) in total variational norm. Notice that this is a milder requirement than approxi-\nmation in uniform norm which is a property that has been established to be satis\ufb01ed by radial basis\nfunctions networks [29], reproducing kernel Hilbert spaces [30] and deep neural networks [12].\nNotice that the objective function and the constraints in Problem (PI) involve an in\ufb01nite horizon\nand thus, the policy is applied an in\ufb01nite number of times. Hence, the error introduced by the\nparametrization could a priori accumulate and induce distributions over trajectories that differ con-\nsiderably from the distributions induced by policies in P(S). We claim in the following lemma that\nthis is not the case.\nLemma 1. Let \u03c1 and \u03c1\u03b8 be occupation measures induced by the policies \u03c0 \u2208 P(S) and \u03c0\u03b8 respec-\ntively, where \u03c0\u03b8 is an \u0001- parametrization of \u03c0. Then, it follows that\n\u0001\n\n(cid:90)\n\n|\u03c1(s, a) \u2212 \u03c1\u03b8(s, a)| dsda \u2264\n\n.\n\n1 \u2212 \u03b3\n\nS\u00d7A\n\nThe previous result, although derived as a technical result required to bound the duality gap for\nparametric problems, has a natural interpretation. The larger \u03b3 \u2014the more the operation is concerned\nabout rewards far in the future \u2014the larger the error in the approximation of the occupation measure.\nHaving de\ufb01ned the concept of universal approximator, we shift focus to writing the parametric\nversion of the constrained reinforcement learning problem. This is, to \ufb01nd the parameters that solve\n(PI), where now the policies are restricted to the functions induced by the chosen parametrization\n\n(13)\n\n(PII)\n\nP (cid:63)\n\u03b8\n\n(cid:44) max\n\n\u03b8\n\nV0(\u03b8) (cid:44) Es,\u03c0\u03b8\n\nsubject to Vi(\u03b8) (cid:44) Es,\u03c0\u03b8\n\n\u03b3tr0(st, \u03c0\u03b8(st))\n\n(cid:35)\n(cid:35)\n\n(cid:34) \u221e(cid:88)\n(cid:34) \u221e(cid:88)\n\nt=0\n\nt=0\n\n\u03b3tri(st, \u03c0\u03b8(st))\n\n\u2265 ci, i = 1 . . . m.\n\nNotice that the problem (PII) is similar to the original problem (PI), with the only difference that the\nexpectations are now with respect to distributions induced by the parameter vector \u03b8. As done in the\nprevious section, let \u03bb \u2208 Rm and de\ufb01ne the dual function associated to (PII) as\n\u03bbi (Vi(\u03b8) \u2212 ci) ,\nLikewise we de\ufb01ne the dual problem as \ufb01nding the tightest upper bound for (PII)\n\n\u03b8\u2208Rp L\u03b8(\u03b8, \u03bb) (cid:44) min\n\u03b8\u2208Rp\n\nd\u03b8(\u03bb) (cid:44) min\n\nm(cid:88)\n\nV0(\u03b8) +\n\n(14)\n\ni=1\n\nD(cid:63)\n\u03b8\n\n(cid:44) minimize\n\n\u03bb\u2208Rm\n\n+\n\nd\u03b8(\u03bb).\n\n(DII)\n\nAs previously stated, the reason for introducing the parametrization is to turn the original functional\noptimization problem into a tractable problem in which the optimization variable is a \ufb01nite dimen-\nsional vector of parameters. Yet, there is a cost for introducing the aforementioned parametrization:\nthe duality gap is no longer null. The latter means that the solution obtained through the dual prob-\nlem is sub-optimal. We claim however that this gap is bounded by a function that is linear with the\napproximation error \u0001, and thus if the parametrization has a good representation power the price to\npay is almost zero. This is the subject of the following theorem.\nTheorem 2. Suppose that ri is bounded for all i = 0, . . . , m by constants Bri > 0 and de\ufb01ne Br =\nmaxi=1...m Bri. Let \u03bb(cid:63)\n\u0001 be the solution to the dual problem associated to (PI(cid:48)) for perturbation\n\u03bei = Br\u0001/(1 \u2212 \u03b3) for all i = 1, . . . , m. Then, under the hypothesis of Theorem 1 it follows that\n\nP (cid:63) \u2265 D(cid:63)\n\n\u03b8 \u2265 P (cid:63) \u2212 (Br0 + (cid:107)\u03bb(cid:63)\n\n\u0001(cid:107)1 Br)\n\n,\n\n(15)\n\n\u0001\n\n1 \u2212 \u03b3\n\nwhere P (cid:63) is the optimal value of (PI), and D(cid:63)\n\n\u03b8 the value of the parametrized dual problem (DII).\n\nThe implication of the previous result is that there is almost no price to pay by introducing a\nparametrization. By solving the dual problem (DII) the sub-optimality achieved is of order \u0001, i.e.,\nthe error on the representation of the policies. Notice that this error could be made arbitrarily small\nby increasing the representation ability of the parametrization, by for instance increasing the dimen-\nsion of the vector of parameters \u03b8. The latter means that if we can compute the dual function it is\n\n6\n\n\fpossible to solve (PI) approximately. Moreover, working on the dual domain provides two computa-\ntional advantages; on one hand, the dimension of the problem is the number of constraints in (PI). In\naddition, the dual function is always convex, hence gradient descent on the dual domain solves the\nproblem of interest. In the next section we propose an algorithm to solve (PI) approximately based\non the previous discussion.\nBefore doing so notice that we have not assumed anything about the feasibility of problem (PII).\nNotice that if the problem is infeasible then we have that D(cid:63)\n\u03b8 = \u2212\u221e and thus the upper bound\non (15) holds trivially. On the other hand if the problem is infeasible it also means that there is no\npolicy \u03c0 \u2208 P(S) that satis\ufb01es the constraints of (PI) with slack Br\u0001/(1\u2212 \u03b3) since \u03b8 is an \u0001-universal\napproximation of P(S). Hence the perturbed problem is infeasible which yields a dual multiplier\n\u0001 that has in\ufb01nite norm. Thus the right hand side of (15) holds as well. In that sense, as long as the\n\u03bb(cid:63)\nparameterization introduced keeps the problem feasible the price to pay for parameterizing is almost\nzero.\n\n5 Solving Constrained Reinforcement Learning Problems\n\nAs previously stated, the dual function is always a convex function since it is the point-wise maxi-\nmum of linear functions. Thus the dual problem (DII) can be ef\ufb01ciently solved using (sub)gradient\ndescent, with the caveat that because we require the dual iterates to remain in the positive orthant,\nwe include a projection onto this space after taking the gradient step\n\n\u03bbk+1 = [\u03bbk \u2212 \u03b7\u2202d\u03b8(\u03bbk)]+ ,\n\n(16)\nwhere \u03b7 > 0 is the step-size of the algorithm, [\u00b7]+ denotes the projection onto Rm\n+ and \u2202d\u03b8(\u03bb)\ndenotes \u2014with a slight abuse of notation \u2014a vector in the subgradient of d\u03b8(\u03bb). The latter can be\ncomputed by virtue of Dankin\u2019s Theorem (see e.g. [31, Chapter 3]) by evaluating the constraints\nin the original problem (PII) at the primal maximizer of the Lagrangian. Thus, the main theoretical\ndif\ufb01culty in this computation lies on \ufb01nding said maximizer since the Lagrangian is non-convex\nwith respect to \u03b8. However, maximizing the Lagrangian with respect to \u03b8 corresponds to learning a\npolicy that uses as reward the following linear combination of rewards\n\nr\u03bb(s, a) = r0(s, a) +\n\n\u03bbiri(s, a).\n\n(17)\n\n(cid:34) \u221e(cid:88)\n\nEs,\u03c0\n\n(cid:35)\n\n(cid:34) \u221e(cid:88)\n\nt=0\n\n(cid:34) \u221e(cid:88)\n\n(cid:35)\n\nIndeed, using the linearity of the expectation, the cumulative discounted cost for the reward r\u03bb(s, a)\nyields\n\n\u03bbiEs,\u03c0\u03b8\n\n= Es,\u03c0\u03b8\n\n\u03b3tri(st, at)\n\n\u03b3tr\u03bb(st, at)\n\n\u03b3tr0(st, at)\n\n+\n\nt=0\n\nt=0\n\n= L(\u03b8, \u03bb).\n(18)\nAnd therefore reinforcement learning algorithms such as policy gradient [32] or actor-critic meth-\nods [33] can be used to \ufb01nd the parameters \u03b8 such that they maximize the Lagrangian. The good\nperformance of these algorithms is rooted in the fact that they are able to maximize the expected\ncumulative reward or at least to achieve a value that is close to the maximum. The next assumption\nformalizes this idea.\nAssumption 1. Let \u03c0\u03b8 be a parametrization of functions in P(S) and let L\u03b8(\u03b8, \u03bb) with \u03bb \u2208 Rm\nbe the Lagrangian associated to (PII). Denote by \u03b8(cid:63)(\u03bb), \u03b8\u2020(\u03bb) \u2208 RP the maximum of L(\u03b8, \u03bb) and\na local maximum respectively achieved by a generic reinforcement learning algorithm. Then, there\nexists \u03b4 > 0 such that for all \u03bb \u2208 Rm\nNotice that the previous assumption only means that we are able to solve the regularized uncon-\nstrained problem approximately. This means that the parameter at time k + 1 is\n\n+ it holds that L\u03b8(\u03b8(cid:63)(\u03bb), \u03bb) \u2264 L\u03b8(\u03b8\u2020(\u03bb), \u03bb) + \u03b4.\n\n+\n\nm(cid:88)\n\ni=1\n\n(cid:35)\n\nm(cid:88)\n\ni=1\n\nThen, the dual variable is updated following the gradient descent scheme suggested in (16), where\nwe replace the subgradient of the dual function by the constraint of the primal problem (PII). De\ufb01n-\n\n\u03b8k+1 \u2248 argmax\n\n\u03b8\u2208Rp L(\u03bbk, \u03b8).\n\n(19)\n\ning \u02c6\u2202dk (cid:44) V (\u03b8k+1) \u2212 s, the update yields\n\u03bbk \u2212 \u03b7 \u02c6\u2202dk\n\n\u03bbk+1 =\n\n(cid:104)\n\n(cid:105)\n\n+\n\n= [\u03bbk \u2212 \u03b7 (V (\u03b8k+1) \u2212 s)]+ .\n\n(20)\n\n7\n\n\fAlgorithm 1 dualDescent\nInput: \u03b7\n1: Initialize: \u03b80 = 0, \u03bb0 = 0\n2: for k = 0, 1 . . .\n3:\n4:\n5: end\n\nCompute an approximation of \u03b8k+1 \u2248 argmaxL\u03b8(\u03b8, \u03bbk) with a RL algorithm\nCompute the dual ascent step \u03bbk+1 = [\u03bbk \u2212 \u03b7 (V (\u03b8k+1) \u2212 s)]+.\n\nThe algorithm given by (19)\u2013(20) is summarized under Algorithm 1. The previous algorithm relies\non the fact that the \u02c6\u2202dk does not differ much from \u2202d\u03b8(\u03bbk). We claim in the following proposition\nthat this is the case. In particular, we establish that the constraint evaluation does not differ from the\nsubgradient in more than \u03b4, the error on the primal maximization de\ufb01ned in Assumption 1.\nProposition 2. Under Assumption 1, the constraint in (PII) evaluated at a local maximizer of La-\ngrangian \u03b8\u2020(\u03bb) approximate the subgradient of the dual function (14). In particular it follows that\n\nd\u03b8(\u03bb) \u2212 d\u03b8(\u03bb(cid:63)\n\n\u03b8) \u2264 (\u03bb \u2212 \u03bb(cid:63)\n\n\u03b8)(cid:62)(cid:0)V (\u03b8\u2020(\u03bb)) \u2212 s(cid:1) + \u03b4.\n\n(21)\n\nThe previous proposition is key in establishing convergence of the algorithm proposed since allows\nus to claim that the dual updated is an approximation of a dual descent step. We formalize this result\nnext and we establish a maximum number of dual steps required to achieve a desired accuracy.\nTheorem 3. Let \u03c0\u03b8 be an \u0001 universal parametrization of P(S) according to De\ufb01nition 1, Br =\nmaxi=1...m Bri with Bri > 0 bounds on the rewards ri and \u03b3 \u2208 (0, 1) be the discount factor. Then,\nif Slater\u2019s conditions hold for (PII), under Assumption 1 and for any \u03b5 > 0, the sequence of updates\nof Algorithm 1 with step size \u03b7 converges in K > 0 steps, with\n\n\u03b8(cid:107)2\nK \u2264 (cid:107)\u03bb0 \u2212 \u03bb(cid:63)\nto a neighborhood of P (cid:63) \u2013the solution of (PI)\u2013 satisfying\n\n2\u03b7\u03b5\n\n,\n\nwhere B =(cid:80)m\n\n1 \u2212 \u03b3 \u2264 d\u03b8(\u03bbK) \u2264 P (cid:63) + \u03b7\nP (cid:63) \u2212 (Br0 + (cid:107)\u03bb(cid:63)\ni=1 (Bri/(1 \u2212 \u03b3) \u2212 ci)2 and \u03bb(cid:63) is the solution of (DI).\n\n\u0001(cid:107)1 Br)\n\n\u0001\n\n(22)\n\n(23)\n\nB\n2\n\n+ \u03b4 + \u03b5.\n\nThe previous result establishes a bound on the number of dual iterations required to converge to a\nneighborhood of the optimal solution. This bound is linear with the inverse of the desired accuracy\n\u03b5. Notice that the size of the neighborhood to which the dual descent algorithm converges depends\non the representation ability of the parametrization chosen, and the goodness of the solution of the\nmaximization of the Lagrangian. Since the cost of running policy gradient or actor-critic algorithms\nuntil convergence before updating the dual variable might result in an algorithm that is computa-\ntionally prohibitive, an alternative that is common in the context of optimization is to update both\nvariables in parallel [34]. This idea can be applied in the context of reinforcement learning as well,\nwhere a policy gradient \u2014or actor critic as in [7, 11] \u2014update is followed by an update of the mul-\ntipliers along the direction of the constraint violation. In these algorithms the update on the policy\nis on a faster scale than the update of the multipliers, and therefore they operate from a theoretical\npoint of view as (1). In particular, the proofs in [7, 11] rely on the fact that this different time-scale\nis such that allows to consider the multiplier as constant.\n\n6 Numerical Example\n\nIn this section, we include a numerical example in order to showcase the consequences of our the-\noretical results. As an illustrative example, we consider a gridworld navigation scenario. This\nscenario, illustrated in Figure 1, consists of an agent attempting to navigate from a starting position\nto a goal. To do so, the agent must cross from the left side of the world to the right side using either\none of two bridges. The bridge above is deemed \u201cunsafe\u201d. The agents uses a softmax policy with\nfour possible actions (moving up, down, left, and right) over a table-lookup of states and actions.\nThe agent receives a reward r(s, a) = 10 for reaching the goal and a reward of r(s, a) = \u22121 for\n\n8\n\n\fFigure 1: Safe (blue) and unsafe\n(red) optimal path. Parametrization\ncoarseness is on the bottom left.\n\nFigure 2: Duality gap of the policies.\n\nFigure 3: Effect of parametrization coarseness.\n\neach step it wanders outside of goal. The scenario is designed such that the shortest path requires\ncrossing the unsafe path (red bridge), while the safe path (blue bridge) requires a longer detour.\nUsing our formulation, we constrain the agent to not cross the unsafe bridge with 99% probability.\nWe train the agent via Algorithm 1, agent and plot in Fig. 2, the resulting normalized duality gap.\nWe consider two cases, an inexact primal maximization via policy gradient and, an exact primal\nmaximization. In order to obtain the global primal minimizer, for a given value of the dual variables\n\u03bb, the optimal primal minimizer can be easily found via Dijkstra\u2019s algorithm. We show that by\nsolving Step 4 of Algorithm 1 exactly the duality gap effectively vanishes (red curve). We also\nshowcase a curve in which Step 4 is replaced by a single policy gradient step (blue curve). Since\nthe minimization in Step 4 is done approximately, the duality gap decreases at a slower rate and will\nonly converge to a neighborhood of zero (as per Theorem 3). In any of the two cases, ultimately, the\nagent learns to navigate from start to goal by crossing the safe bridge (blue path in Fig. 1).\nNow, we turn our attention to the effect of the parametrization size. We consider parametrization\nof different coarseness via state aggregation, as shown in Fig. 1. This will correspond, as per\nDe\ufb01nition 1, in parametrizations with lager values of \u0001, i.e., looser approximators. Figure 3 displays\nthe effect of using coarser parametrizations, as the parametrization becomes coarser, the duality gap\nincreases (as per Theorem 2). Specially, for very coarse parametrizations (such as the cyan case),\nthe agent cannot learn a successful policy due to the poor covering properties of its parametrization\nand resultantly such problem will have a large duality gap.\n\n7 Discussion\n\nThroughout this work we have developed a duality theory for constrained reinforcement learning\nproblems. In particular we have established that for policies belonging to a general class of distri-\nbutions, the duality gap of this problems is null and therefore by solving the problem on the dual\ndomain \u2014which always yields a \ufb01nite dimensional convex problem \u2014yields the same result as solv-\ning the original problem directly. Moreover, it establishes the equivalence between the constrained\nproblem and the regularized problem \u2014or manual selection of multipliers \u2014in the sense that both\nproblems track the same Pareto optimal front.\nThese theoretical implications however do not imply that it is always possible to solve the problem.\nTo be able to solve the dual problem, one is required to evaluate the dual function, which might result\nintractable in several problems, for instance in cases where arbitrary policies are considered. To\novercome this limitation, we have shown that for suf\ufb01ciently rich parametrizations the zero duality\ngap result holds approximately. However, for the most part, the parametrizations considered in the\nliterature are not necessarily universal approximators of distributions since in general the output of\nthe neural network reduces to the mean \u2014and in some cases the variance \u2014of a distribution.\nRegardless of these limitations, the primal dual algorithm considered here and those proposed in\n[7, 11] provide a manner to solve constrained policy optimization problems without the need to\nperform an exhaustive search over the weights that we assign to each reward function, as it is the case\nin [4, 19, 20]. Likewise, the need of imposing constraints might arise directly from the algorithm\ndesign, this is for instance the case in Trust Region Policy Optimization [35], where a constraint\non the divergence of the policy is included. Although our theorems do not guarantee that the zero\nduality gap result holds under these constraints, since they reduce to a projection onto a convex set\nit would not be surprising that it could be adapted.\n\n9\n\n00.20.40.60.81\u00d710610\u2212410\u2212310\u2212210\u22121100Iteration(k)DualityGapDualityGapDualityGap(PG)00.20.40.60.81\u00d710510\u2212310\u2212210\u22121100Iteration(k)DualityGapd=1d=2d=3d=4d=51234510\u2212310\u2212210\u22121100ParametrizationSize(d)DualityGap\fReferences\n[1] Richard S Sutton and Andrew G Barto, Reinforcement learning: An introduction, MIT press,\n\n2018.\n\n[2] Vivek S Borkar, \u201cAn actor-critic algorithm for constrained markov decision processes,\u201d Sys-\n\ntems & control letters, vol. 54, no. 3, pp. 207\u2013213, 2005.\n\n[3] Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel, \u201cConstrained policy optimiza-\ntion,\u201d in Proceedings of the 34th International Conference on Machine Learning-Volume 70.\nJMLR. org, 2017, pp. 22\u201331.\n\n[4] Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne,\n\n\u201cDeepmimic:\nExample-guided deep reinforcement learning of physics-based character skills,\u201d ACM Trans-\nactions on Graphics (TOG), vol. 37, no. 4, pp. 143, 2018.\n\n[5] Shie Mannor and Nahum Shimkin, \u201cA geometric approach to multi-criterion reinforcement\n\nlearning,\u201d Journal of machine learning research, vol. 5, no. Apr, pp. 325\u2013360, 2004.\n\n[6] Kristof Van Moffaert and Ann Now\u00e9, \u201cMulti-objective reinforcement learning using sets of\nPareto dominating policies,\u201d The Journal of Machine Learning Research, vol. 15, no. 1, pp.\n3483\u20133512, 2014.\n\n[7] Chen Tessler, Daniel J Mankowitz, and Shie Mannor, \u201cReward constrained policy optimiza-\n\ntion,\u201d arXiv preprint arXiv:1805.11074, 2018.\n\n[8] Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A Ortega, Tom Everitt, Andrew Lefrancq,\nLaurent Orseau, and Shane Legg, \u201cAi safety gridworlds,\u201d arXiv preprint arXiv:1711.09883,\n2017.\n\n[9] Horia Mania, Aurelia Guy, and Benjamin Recht, \u201cSimple random search provides a competi-\n\ntive approach to reinforcement learning,\u201d arXiv preprint arXiv:1803.07055, 2018.\n\n[10] Tom Schaul, Diana Borsa, Joseph Modayil, and Razvan Pascanu, \u201cRay interference: a source\n\nof plateaus in deep reinforcement learning,\u201d arXiv preprint arXiv:1904.11455, 2019.\n\n[11] Shalabh Bhatnagar and K Lakshmanan, \u201cAn online actor\u2013critic algorithm with function ap-\nproximation for constrained markov decision processes,\u201d Journal of Optimization Theory and\nApplications, vol. 153, no. 3, pp. 688\u2013708, 2012.\n\n[12] Kurt Hornik, Maxwell Stinchcombe, and Halbert White, \u201cMultilayer feedforward networks\n\nare universal approximators,\u201d Neural networks, vol. 2, no. 5, pp. 359\u2013366, 1989.\n[13] Eitan Altman, Constrained Markov decision processes, vol. 7, CRC Press, 1999.\n[14] Iordanis Koutsopoulos and Leandros Tassiulas, \u201cControl and optimization meet the smart\npower grid: Scheduling of power demands for optimal energy management,\u201d in Proceedings\nof the 2nd International Conference on Energy-ef\ufb01cient Computing and Networking. ACM,\n2011, pp. 41\u201350.\n\n[15] Chen Hou and Qianchuan Zhao, \u201cOptimization of web service-based control system for bal-\nance between network traf\ufb01c and delay,\u201d IEEE Transactions on Automation Science and En-\ngineering, vol. 15, no. 3, pp. 1152\u20131162, 2018.\n\n[16] Yinlam Chow, Aviv Tamar, Shie Mannor, and Marco Pavone,\n\n\u201cRisk-sensitive and robust\ndecision-making: a cvar optimization approach,\u201d in Advances in Neural Information Pro-\ncessing Systems, 2015, pp. 1522\u20131530.\n\n[17] Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine, \u201cDeep reinforcement learn-\ning for robotic manipulation with asynchronous off-policy updates,\u201d in 2017 IEEE interna-\ntional conference on robotics and automation (ICRA). IEEE, 2017, pp. 3389\u20133396.\n\n[18] Pavlo Krokhmal, Jonas Palmquist, and Stanislav Uryasev, \u201cPortfolio optimization with condi-\n\ntional value-at-risk objective and constraints,\u201d Journal of risk, vol. 4, pp. 43\u201368, 2002.\n\n[19] Dotan Di Castro, Aviv Tamar, and Shie Mannor, \u201cPolicy gradients with variance related risk\n\ncriteria,\u201d arXiv preprint arXiv:1206.6404, 2012.\n\n[20] Aviv Tamar and Shie Mannor, \u201cVariance adjusted actor critic algorithms,\u201d arXiv preprint\n\narXiv:1310.3697, 2013.\n\n10\n\n\f[21] Gal Dalal, Krishnamurthy Dvijotham, Matej Vecerik, Todd Hester, Cosmin Paduraru, and Yu-\nval Tassa, \u201cSafe exploration in continuous action spaces,\u201d arXiv preprint arXiv:1801.08757,\n2018.\n\n[22] S. Boyd and L. Vandenberghe, Convex optimization, Cambridge University Press, 2004.\n[23] R. T. Rockafellar, Convex analysis, Princeton University Press, 1970.\n[24] Vivek S Borkar, \u201cA convex analytic approach to markov decision processes,\u201d Probability\n\nTheory and Related Fields, vol. 78, no. 4, pp. 583\u2013602, 1988.\n\n[25] Ken-Ichi Funahashi, \u201cOn the approximate realization of continuous mappings by neural net-\n\nworks,\u201d Neural networks, vol. 2, no. 3, pp. 183\u2013192, 1989.\n\n[26] George Cybenko, \u201cApproximation by superpositions of a sigmoidal function,\u201d Mathematics\n\nof control, signals and systems, vol. 2, no. 4, pp. 303\u2013314, 1989.\n\n[27] Zhou Lu, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang, \u201cThe expressive\npower of neural networks: A view from the width,\u201d in Advances in Neural Information Pro-\ncessing Systems, 2017, pp. 6231\u20136239.\n\n[28] Hongzhou Lin and Stefanie Jegelka, \u201cResnet with one-neuron hidden layers is a universal\napproximator,\u201d in Advances in Neural Information Processing Systems, 2018, pp. 6169\u20136178.\n[29] Jooyoung Park and Irwin W Sandberg, \u201cUniversal approximation using radial-basis-function\n\nnetworks,\u201d Neural computation, vol. 3, no. 2, pp. 246\u2013257, 1991.\n\n[30] Bharath Sriperumbudur, Kenji Fukumizu, and Gert Lanckriet, \u201cOn the relation between uni-\nversality, characteristic kernels and rkhs embedding of measures,\u201d in Proceedings of the Thir-\nteenth International Conference on Arti\ufb01cial Intelligence and Statistics, 2010, pp. 773\u2013780.\n\n[31] Dimitri P Bertsekas and Athena Scienti\ufb01c, Convex optimization algorithms, Athena Scienti\ufb01c\n\nBelmont, 2015.\n\n[32] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour, \u201cPolicy gradi-\nent methods for reinforcement learning with function approximation,\u201d in Advances in neural\ninformation processing systems, 2000, pp. 1057\u20131063.\n\n[33] Vijay R Konda and John N Tsitsiklis, \u201cActor-critic algorithms,\u201d in Advances in neural infor-\n\nmation processing systems, 2000, pp. 1008\u20131014.\n\n[34] Kenneth J. Arrow and Leonard Hurwicz, Studies in linear and nonlinear programming, Stan-\n\nford University Press, CA, 1958.\n\n[35] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz, \u201cTrust\nregion policy optimization,\u201d in International Conference on Machine Learning, 2015, pp.\n1889\u20131897.\n\n11\n\n\f", "award": [], "sourceid": 4117, "authors": [{"given_name": "Santiago", "family_name": "Paternain", "institution": "University of Pennsylvania"}, {"given_name": "Luiz", "family_name": "Chamon", "institution": "University of Pennsylvania"}, {"given_name": "Miguel", "family_name": "Calvo-Fullana", "institution": "University of Pennsylvania"}, {"given_name": "Alejandro", "family_name": "Ribeiro", "institution": "University of Pennsylvania"}]}