{"title": "Reinforcement Learning under Model Mismatch", "book": "Advances in Neural Information Processing Systems", "page_first": 3043, "page_last": 3052, "abstract": "We study reinforcement learning under model misspecification, where we do not have access to the true environment but only to a reasonably close approximation to it. We address this problem by extending the framework of robust MDPs to the model-free Reinforcement Learning setting, where we do not have access to the model parameters, but can only sample states from it. We define robust versions of Q-learning, Sarsa, and TD-learning and prove convergence to an approximately optimal robust policy and approximate value function respectively. We scale up the robust algorithms to large MDPs via function approximation and prove convergence under two different settings. We prove convergence of robust approximate policy iteration and robust approximate value iteration for linear architectures (under mild assumptions). We also define a robust loss function, the mean squared robust projected Bellman error and give stochastic gradient descent algorithms that are guaranteed to converge to a local minimum.", "full_text": "Reinforcement Learning under Model Mismatch\n\nAurko Roy1, Huan Xu2, and Sebastian Pokutta2\n1Google \u2217, Email: aurkor@google.com\nEmail: huan.xu@isye.gatech.edu\n\n2ISyE, Georgia Institute of Technology, Atlanta, GA, USA.\n\n2ISyE, Georgia Institute of Technology, Atlanta, GA, USA.\nEmail: sebastian.pokutta@isye.gatech.edu\n\nAbstract\n\nWe study reinforcement learning under model misspeci\ufb01cation, where we do not\nhave access to the true environment but only to a reasonably close approximation\nto it. We address this problem by extending the framework of robust MDPs of\n[1, 15, 11] to the model-free Reinforcement Learning setting, where we do not have\naccess to the model parameters, but can only sample states from it. We de\ufb01ne robust\nversions of Q-learning, SARSA, and TD-learning and prove convergence to an\napproximately optimal robust policy and approximate value function respectively.\nWe scale up the robust algorithms to large MDPs via function approximation\nand prove convergence under two di\ufb00erent settings. We prove convergence of\nrobust approximate policy iteration and robust approximate value iteration for linear\narchitectures (under mild assumptions). We also de\ufb01ne a robust loss function, the\nmean squared robust projected Bellman error and give stochastic gradient descent\nalgorithms that are guaranteed to converge to a local minimum.\n\nIntroduction\n\n1\nReinforcement learning is concerned with learning a good policy for sequential decision making\nproblems modeled as a Markov Decision Process (MDP), via interacting with the environment [20, 18].\nIn this work we address the problem of reinforcement learning from a misspeci\ufb01ed model. As a\nmotivating example, consider the scenario where the problem of interest is not directly accessible,\nbut instead the agent can interact with a simulator whose dynamics is reasonably close to the true\nproblem. Another plausible application is when the parameters of the model may evolve over time\nbut can still be reasonably approximated by an MDP.\n\nTo address this problem we use the framework of robust MDPs which was proposed by [1, 15, 11]\nto solve the planning problem under model misspeci\ufb01cation. The robust MDP framework considers a\nclass of models and \ufb01nds the robust optimal policy which is a policy that performs best under the\nworst model. It was shown by [1, 15, 11] that the robust optimal policy satis\ufb01es the robust Bellman\nequation which naturally leads to exact dynamic programming algorithms to \ufb01nd an optimal policy.\nHowever, this approach is model dependent and does not immediately generalize to the model-free\ncase where the parameters of the model are unknown.\n\nEssentially, reinforcement learning is a model-free framework to solve the Bellman equation using\nsamples. Therefore, to learn policies from misspeci\ufb01ed models, we develop sample based methods to\nsolve the robust Bellman equation. In particular, we develop robust versions of classical reinforcement\nlearning algorithms such as Q-learning, SARSA, and TD-learning and prove convergence to an\napproximately optimal policy under mild assumptions on the discount factor. We also show that\n\n\u2217Work done while at Georgia Tech\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fthe nominal versions of these iterative algorithms converge to policies that may be arbitrarily worse\ncompared to the optimal policy.\n\nWe also scale up these robust algorithms to large scale MDPs via function approximation, where\nwe prove convergence under two di\ufb00erent settings. Under a technical assumption similar to [5, 24]\nwe show convergence of robust approximate policy iteration and value iteration algorithms for linear\narchitectures. We also study function approximation with nonlinear architectures, by de\ufb01ning an\nappropriate mean squared robust projected Bellman error (MSRPBE) loss function, which is a\ngeneralization of the mean squared projected Bellman error (MSPBE) loss function of [22, 21, 6].\nWe propose robust versions of stochastic gradient descent algorithms as in [22, 21, 6] and prove\nconvergence to a local minimum under some assumptions for function approximation with arbitrary\nsmooth functions.\n\nContribution.\n\nIn summary we have the following contributions:\n\n1. We extend the robust MDP framework of [1, 15, 11] to the model-free reinforcement learning\nsetting. We then de\ufb01ne robust versions of Q-learning, SARSA, and TD-learning and prove\nconvergence to an approximately optimal robust policy.\n\n2. We also provide robust reinforcement learning algorithms for the function approximation case\nand prove convergence of robust approximate policy iteration and value iteration algorithms\nfor linear architectures. We also de\ufb01ne the MSRPBE loss function which contains the robust\noptimal policy as a local minimum and we derive stochastic gradient descent algorithms to\nminimize this loss function as well as establish convergence to a local minimum in the case\nof function approximation by arbitrary smooth functions.\n\n3. Finally, we demonstrate empirically the improvement in performance for the robust algorithms\ncompared to their nominal counterparts. For this we used various Reinforcement Learning\ntest environments from OpenAI [9] as benchmark to assess the improvement in performance\nas well as to ensure reproducibility and consistency of our results.\n\nRelated Work. Recently, several approaches have been proposed to address model performance due\nto parameter uncertainty for Markov Decision Processes (MDPs). A Bayesian approach was proposed\nby [19] which requires perfect knowledge of the prior distribution on transition matrices. Other\nprobabilistic and risk based settings were studied by [10, 25, 23] which propose various mechanisms\nto incorporate percentile risk into the model. A framework for robust MDPs was \ufb01rst proposed by\n[1, 15, 11] who consider the transition matrices to lie in some uncertainty set and proposed a dynamic\nprogramming algorithm to solve the robust MDP. Recent work by [24] extended the robust MDP\nframework to the function approximation setting where under a technical assumption the authors\nprove convergence to an optimal policy for linear architectures. Note that these algorithms for robust\nMDPs do not readily generalize to the model-free reinforcement learning setting where the parameters\nof the environment are not explicitly known.\n\nFor reinforcement learning in the non-robust model-free setting, several iterative algorithms such\nas Q-learning, TD-learning, and SARSA are known to converge to an optimal policy under mild\nassumptions, see [4] for a survey. Robustness in reinforcement learning for MDPs was studied by\n[13] who introduced a robust learning framework for learning with disturbances. Similarly, [16] also\nstudied learning in the presence of an adversary who might apply disturbances to the system. However,\nfor the algorithms proposed in [13, 16] no theoretical guarantees are known and there is only limited\nempirical evidence. Another recent work on robust reinforcement learning is [12], where the authors\npropose an online algorithm with certain transitions being stochastic and the others being adversarial\nand the devised algorithm ensures low regret.\n\nFor the case of reinforcement learning with large MDPs using function approximations, theoretical\nguarantees for most TD-learning based algorithms are only known for linear architectures [2]. Recent\nwork by [6] extended the results of [22, 21] and proved that a stochastic gradient descent algorithm\nminimizing the mean squared projected Bellman equation (MSPBE) loss function converges to a\nlocal minimum, even for nonlinear architectures. However, these algorithms do not apply to robust\nMDPs; in this work we extend these algorithms to the robust setting.\n\n2\n\n\f2 Preliminaries\nWe consider an in\ufb01nite horizon Markov Decision Process (MDP) [18] with \ufb01nite state space X of\nsize n and \ufb01nite action space A of size m. At every time step t the agent is in a state i \u2208 X and\ncan choose an action a \u2208 A incurring a cost ct(i, a). We will make the standard assumption that\nfuture cost is discounted, see e.g., [20], with a discount factor \u03d1 < 1 applied to future costs, i.e.,\nct(i, a) := \u03d1tc(i, a), where c(i, a) is a \ufb01xed constant independent of the time step t for i \u2208 X and\na \u2208 A. The states transition according to probability transition matrices \u03c4 := {Pa}a\u2208A which\ndepends only on their last taken action a. A policy of the agent is a sequence \u03c0 = (a0, a1, . . . ), where\nevery at(i) corresponds to an action in A if the system is in state i at time t. For every policy \u03c0, we\nhave a corresponding value function v\u03c0 \u2208 Rn, where v\u03c0(i) for a state i \u2208 X measures the expected\ncost of that state if the agent were to follow policy \u03c0. This can be expressed by the recurrence relation\n(1)\nThe goal is to devise algorithms to learn an optimal policy \u03c0\u2217 that minimizes the expected total cost:\nDe\ufb01nition 2.1 (Optimal policy). Given an MDP with state space X , action space A and transition\nmatrices Pa, let \u03a0 be the strategy space of all possibile policies. Then an optimal policy \u03c0\u2217 is one\nthat minimizes the expected total cost, i.e.,\n\nv\u03c0(i) := c(i, a0(i)) + \u03d1Ej\u223cX [v\u03c0(j)] .\n\n\u2217 := arg min\n\u03c0\u2208\u03a0\n\n\u03c0\n\nE\n\n\u03d1tc(it, at(it))\n\n.\n\n(2)\n\nIn the robust case we will assume as in [15, 11] that the transition matrices Pa are not \ufb01xed and may\ncome from some uncertainty region P a and may be chosen adversarially by nature in future runs of\nthe model. In this setting, [15, 11] prove the following robust analogue of the Bellman recursion. A\npolicy of nature is a sequence \u03c4 := (P0, P1, . . . ) where every Pt(a) \u2208 P a corresponds to a transition\nprobability matrix chosen from P a. Let T denote the set of all such policies of nature. In other words,\na policy \u03c4 \u2208 T of nature is a sequence of transition matrices that may be played by it in response to\nbe the support function of the set P. For a state i \u2208 X , let P a\ni be the projection onto the ith row of P a.\nTheorem 2.2. [15] We have the following perfect duality relation\n\nthe actions of the agent. For any set P \u2286 Rn and vector v \u2208 Rn, let \u03c3P(v) := sup(cid:8)p(cid:62)v | p \u2208 P(cid:9)\n\n(cid:34) \u221e\n\n\u2211\nt=0\n\n(cid:35)\n\n.\n\n(3)\n\n(4)\n\n(5)\n\nmin\n\u03c0\u2208\u03a0 max\n\n\u03c4\u2208T E\u03c4\n\n\u03d1tc (it, at(it))\n\n= max\n\n\u03c4\u2208T min\n\u03c0\u2208\u03a0\n\nE\u03c4\n\n\u03d1tc (it, at(it))\n\nThe optimal value function v\u03c0\u2217 corresponding to the optimal policy \u03c0\u2217 satis\ufb01es\n\nand \u03c0\u2217 can then be obtained in a greedy fashion, i.e.,\n\nv\u03c0\u2217 (i) = min\na\u2208A\n\nc(i, a) + \u03d1\u03c3P a\ni\n\n(v\u03c0\u2217 )\n\na\u2217(i) \u2208 arg min\na\u2208A\n\nc(i, a) + \u03d1\u03c3P a\ni\n\n(v)\n\ni . On the other hand it is often easy to have a con\ufb01dence region Ua\n\nThe main shortcoming of this approach is that it does not generalize to the model free case where the\ntransition probabilities are not explicitly known but rather the agent can only sample states according\nto these probabilities. In the absence of this knowledge, we cannot compute the support functions of\nthe uncertainty sets P a\ni , e.g., a ball\nor an ellipsoid, corresponding to every state-action pair i \u2208 X , a \u2208 A that quanti\ufb01es our uncertainty\nin the simulation, with the uncertainty set P a\ni centered around the\nunknown simulator probabilities. Formally, we de\ufb01ne the uncertainty sets corresponding to every\nstate action pair in the following fashion.\nDe\ufb01nition 2.3 (Uncertainty sets). Corresponding to every state-action pair (i, a) we have a con\ufb01dence\ni of the probability transition matrix corresponding to\nregion Ua\n(i, a) is de\ufb01ned as\n\ni so that the uncertainty region P a\n\ni being the con\ufb01dence region Ua\n\n(6)\ni is the unknown state transition probability vector from the state i \u2208 X to every other state\n\nwhere pa\nin X given action a during the simulation.\n\nP a\ni\n\n:= {x + pa\n\ni | x \u2208 Ua\ni } ,\n\n3\n\n(cid:34) \u221e\n\n\u2211\nt=0\n\n(cid:35)\n(cid:16)\n\n(cid:110)\n\n(cid:35)\n\n\u2211\nt=0\n\n(cid:34) \u221e\n(cid:17)\n(cid:111)\n\n,\n\n.\n\n\f:= (cid:8)x | x(cid:62)Aa\n:=(cid:8)x + pa\n\ni x \u2264 1, \u2211i\u2208X xi = 0(cid:9) for some\n\n(cid:9) , where pa\n\nAs a simple example, we have the ellipsoid Ua\ni\n\ni with the uncertainty set P a\n\nn \u00d7 n psd matrix Aa\ni is the\nunknown simulator state transition probability vector with which the agent transitioned to a new state\nduring training. Note that while it may easy to come up with good descriptions of the con\ufb01dence\nregion Ua\ni and merely\nobserve the new state j sampled from this distribution.\n\ni , the approach of [15, 11] breaks down since we have no knowledge of pa\n\ni being P a\n\ni | x \u2208 Ua\n\ni\n\ni\n\ni , which in the case of Ua\n\nIn the following sections we develop robust versions of Q-learning, SARSA, and TD-learning\nwhich are guaranteed to converge to an approximately optimal policy that is robust with respect to\nthis con\ufb01dence region. The robust versions of these iterative algorithms involve an additional linear\ni = {(cid:107)x(cid:107)2 \u2264 r} simply corresponds to\noptimization step over the set Ua\nadding \ufb01xed noise during every update. In later sections we will extend it to the function approximation\ncase where we study linear architectures as well as nonlinear architectures; in the latter case we derive\nnew stochastic gradient descent algorithms for computing approximately robust policies.\n3 Robust exact dynamic programming algorithms\nIn this section we develop robust versions of exact dynamic programming algorithms such as Q-\nlearning, SARSA, and TD-learning. These methods are suitable for small MDPs where the size n of\ni must also be constrained to lie within\nthe state space is not too large. Note that con\ufb01dence region Ua\nthe probability simplex \u2206n. However since we do not have knowledge of the simulator probabilities\ni is from the boundary of \u2206n and so the algorithms will make\ni , we do not know how far away pa\npa\ni \u2286 \u2206n, to compute the\nrobust optimal policies. With a suitable choice of step lengths and discount factors we can prove\ni -robust policy where the approximation depends on the\nconvergence to an approximately optimal Ua\ni . Below we\ni }i,a be a sequence of n \u00d7 n psd matrices. Then we can de\ufb01ne the con\ufb01dence\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)x(cid:62)Aa\n(cid:40)\n(cid:9) lies inside \u2206n. Since we do not know pa\n\nuse of a proxy con\ufb01dence region (cid:99)Ua\ndi\ufb00erence between the unconstrained proxy region(cid:99)Ua\ngive speci\ufb01c examples of possible choices for simple con\ufb01dence regions.\nEllipsoid: Let {Aa\nregion as\n\ni x \u2264 1, \u2211\ni\u2208X\nlinear constraints so that\n\ni where we drop the requirement of (cid:99)Ua\n\n(cid:8)pa\n:=\n\ufb01dence region (cid:99)Ua\ni + x | x \u2208 Ua\ni , we will make use of the proxy con-\ni = r\u22121In for every\ni \u2208 X , a \u2208 A then this corresponds to a spherical con\ufb01dence interval of [\u2212r, r] in every direction. In\nother words, each uncertainty set P a\nParallelepiped: Let {Ba\n(cid:40)\ncon\ufb01dence region as\n\ni x \u2264 1, \u2211i\u2208X xi = 0}. In particular when Aa\ni is an (cid:96)2 ball of radius r.\n\ni }i,a be a sequence of n \u00d7 n invertible matrices. Then we can de\ufb01ne the\n\nij,\u2200j \u2208 X\nthe uncertainty set P a\ni\n\ni and the true con\ufb01dence region Ua\n\ni has some additional\n\nij \u2264 xj \u2264 1 \u2212 pa\n\n:= {x | x(cid:62)Aa\n\nxi = 0,\u2212pa\n\nNote that Ua\n\n(cid:41)\n\n(cid:41)\n\nUa\ni\n\n(7)\n\n:=\n\nx\n\ni\n\ni\n\n.\n\nAs before, we will use the unconstrained parallelepiped (cid:99)Ua\ndiagonal matrix D, then the proxy con\ufb01dence region(cid:99)Ua\nconstraints, as a proxy for Ua\nevery diagonal entry is r, then every uncertainty set P a\n3.1 Robust Q-learning\n\ni since we do not have knowledge pa\n\n.\n\n(8)\n\nij \u2264 xj \u2264 1 \u2212 pa\n\nij,\u2200j \u2208 X\ni without the \u2212pa\n\nij \u2264 xj \u2264 1 \u2212 pa\nij\ni = D for a\ni corresponds to a rectangle. In particular if\ni is an (cid:96)1 ball of radius r.\n\ni . In particular if Ba\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:107)Ba\n\nUa\ni\n\n:=\n\nx\n\ni x(cid:107)1 \u2264 1, \u2211\ni\u2208X\n\nxi = 0,\u2212pa\n\nLet us recall the notion of a Q-factor of a state-action pair (i, a) and a policy \u03c0 which in the non-robust\nsetting is de\ufb01ned as\n\nQ(i, a) := c(i, a) + Ej\u223cX [v(j)] ,\n\n(9)\n\n4\n\n\f(cid:19)\n\n,\n\n(cid:18)\n\nc(i, a) + \u03d1\u03c3(cid:99)Ua\n\ni\n\nwhere v is the value function of the policy \u03c0. In other words, the Q-factor represents the expected cost\nif we start at state i, use the action a and follow the policy \u03c0 subsequently. One may similarly de\ufb01ne\nthe robust Q-factors using a similar interpretation and the minimax characterization of Theorem 2.2.\nLet Q\u2217 denote the Q-factors of the optimal robust policy and let v\u2217 \u2208 Rn be its value function. Note\nthat we may write the value function in terms of the Q-factors as v\u2217 = mina\u2208A Q\u2217(i, a). From\nTheorem 2.2 we have the following expression for Q\u2217:\n\nQ\u2217(i, a) = c(i, a) + \u03d1\u03c3P a\ni\n= c(i, a) + \u03d1\u03c3Ua\ni\n\n(10)\n(11)\nwhere equation (11) follows from De\ufb01nition 2.3. For an estimate Qt of Q\u2217, let vt \u2208 Rn be its value\nvector, i.e., vt(i) := mina\u2208A Qt(i, a). The robust Q-iteration is de\ufb01ned as:\n\n(v\u2217)\n(v\u2217) + \u03d1 \u2211\nj\u2208X\n\na(cid:48)\u2208A Q\u2217(j, a(cid:48)),\npa\nij min\n\ni\n\ni\n\n(12)\n\n(v) + \u03b2a\n\nQt(i, a) := (1 \u2212 \u03b3t) Qt\u22121(i, a) + \u03b3t\n\ni in terms of linear optimization over P a\ni := maxy\u2208(cid:99)Ua\ni (cid:107)v(cid:107)\u221e .\n\nThe following simple lemma allows us to decompose the optimization of a linear function over the\ni .\n(cid:107)y \u2212 x(cid:107)1. Then we have\n\na(cid:48)\u2208A Qt\u22121(j, a(cid:48))\n(vt\u22121) + \u03d1 min\nwhere a state j \u2208 X is sampled with the unknown transition probability pa\nij using the simulator. Note\n(vt) of vt over the proxy con\ufb01dence region(cid:99)Ua\nthat the robust Q-iteration of equation (12) involves an additional linear optimization step to compute\nthe support function \u03c3(cid:99)Ua\ni . We will prove that iterating\nequation (12) converges to an approximately optimal policy. The following de\ufb01nition introduces the\nnotion of an \u03b5-optimal policy, see e.g., [4]. The error factor \u03b5 is also referred to as the ampli\ufb01cation\nfactor. We will treat the Q-factors as a |X| \u00d7 |A| matrix in the de\ufb01nition so that its (cid:96)\u221e norm is\noptimal policy \u03c0\u2217 with corresponding Q-factors Q\u2217 if(cid:13)(cid:13)Q(cid:48) \u2212 Q\u2217(cid:13)(cid:13)\u221e \u2264 \u03b5 (cid:107)Q\u2217(cid:107)\u221e .\nde\ufb01ned as usual.\nDe\ufb01nition 3.1 (\u03b5-optimal policy). A policy \u03c0 with Q-factors Q(cid:48) is \u03b5-optimal with respect to the\nproxy uncertainty set (cid:99)P a\ni , and(cid:99)Ua\nLemma 3.2. Let v \u2208 Rn be any vector and let \u03b2a\n\u03c3(cid:99)P a\n(v) \u2264 \u03c3P a\ni\nThe following theorem proves that under a suitable choice of step lengths \u03b3t and discount factor\n\u03d1, the iteration of equation (12) converges to an \u03b5-approximately optimal policy with respect to the\ncon\ufb01dence regions Ua\ni .\nTheorem 3.3. Let the step lengths \u03b3t of the Q-iteration algorithm be chosen such that \u2211\u221e\nand \u2211\u221e\nmaxi\u2208X ,a\u2208A \u03b2a\nan \u03b5-optimal policy where \u03b5 :=\nRemark 3.4. If \u03b2 = 0 then note that by Theorem 3.3, the robust Q-iterations converge to the exact\n:= maxy\u2208(cid:99)Ua\n(cid:107)y \u2212 x(cid:107)1, it follows that \u03b2 = 0 i\ufb00\n(cid:99)Ua\noptimal Q-factors since \u03b5 = 0. Since \u03b2a\ni\ni for every i \u2208 X , a \u2208 A. This happens when the con\ufb01dence region is small enough so that\ni = Ua\nij\u2200j \u2208 X in the description of P a\nij \u2264 xj \u2264 1 \u2212 pa\nthe simplex constraints \u2212pa\ni become redundant for\nevery i \u2208 X , a \u2208 A. Equivalently every pa\ni is \u201cfar\u201d from the boundary of the simplex \u2206n compared\nto the size of the con\ufb01dence region Ua\ni .\nRemark 3.5. Note that simply using the nominal Q-iteration without the \u03c3(cid:99)Ua\n(cid:13)(cid:13)Q(cid:48) \u2212 Q\u2217(cid:13)(cid:13)\u221e may be arbitrary large. This follows easily from observing that\n(v) term does not\nguarantee convergence to Q\u2217. Indeed, the nominal Q-iterations converge to Q-factors Q(cid:48) where\n\nt < \u221e and let the discount factor \u03d1 < 1. Let \u03b2a\n\nt=0 \u03b3t = \u221e\ni be as in Lemma 3.2 and let \u03b2 :=\ni . If \u03d1(1 + \u03b2) < 1 then with probability 1 the iteration of equation (12) converges to\n\ni , Ua\nminx\u2208Ua\ni\n\n1\u2212\u03d1(1+\u03b2) .\n\nminx\u2208Ua\ni\n\nt=0 \u03b32\n\n\u03d1\u03b2\n\ni\n\ni\n\ni\n\n(13)\n\n(14)\n\n| Q(cid:48)(i, a) \u2212 Q\u2217(i, a)| =\n\n(cid:13)(cid:13)Q(cid:48) \u2212 Q\u2217(cid:13)(cid:13)\u221e = max\n\ni\u2208X ,a\u2208A\n\n(cid:12)(cid:12)(cid:12)\u03c3(cid:99)Ua\n(cid:12)(cid:12)(cid:12)\u03c3(cid:99)Ua\n\ni\n\ni\n\n(cid:12)(cid:12)(cid:12)\n(cid:12)(cid:12)(cid:12)\n\n(v\u2217)\n\n(v\u2217)\n\n, where v\u2217 is the value function of Q\u2217 and so\n\nwhich can be as high as (cid:107)v\u2217(cid:107)\u221e = (cid:107)Q\u2217(cid:107)\u221e.\n\n5\n\n\f3.2 Robust TD-Learning\n\nLet (i0, i1, . . . ) be a trajectory of the agent, where im denotes the state of the agent at time step m.\nThe main idea behind the TD(\u03bb)-learning method is to estimate the value function v\u03c0 of a policy \u03c0\nusing the temporal di\ufb00erence errors dm de\ufb01ned as\n\n(15)\nFor a parameter \u03bb \u2208 (0, 1), the TD-learning iteration is de\ufb01ned in terms of the temporal di\ufb00erence\nerrors as\n\ndm := c(im, \u03c0(im)) + \u03bdvt(im+1) \u2212 vt(im).\n(cid:33)\n\nvt+1(ik) := vt(ik) + \u03b3t\n\n.\n\n(16)\n\n(cid:32) \u221e\n(\u03d1\u03bb)m\u2212k dm\ni with proxy(cid:99)Ua\n\n\u2211\nm=k\n\n(cid:101)dm := dm + \u03d1\u03c3(cid:92)\n\nU\u03c0(im )\nim\n\nIn the robust setting, we have a con\ufb01dence region Ua\nerror, which leads us to de\ufb01ne the robust temporal di\ufb00erence errors as\n\ni for every temporal di\ufb00erence\n\n(vt),\n\n(17)\n\nwhere dm is the non-robust temporal di\ufb00erence. The robust TD-update is the usual TD-update, with\n\nthe robust temporal di\ufb00erence errors(cid:102)dm replacing the usual temporal di\ufb00erence error dm. We de\ufb01ne\nan \u03b5-suboptimal value function for a \ufb01xed policy \u03c0 similar to De\ufb01nition 3.1.\nDe\ufb01nition 3.6 (\u03b5-approximate value function). Given a policy \u03c0, we say that a vector v(cid:48) \u2208 Rn is an\n\u03b5-approximation of v\u03c0 if (cid:107)v(cid:48) \u2212 v\u03c0(cid:107)\u221e \u2264 \u03b5 (cid:107)v\u03c0(cid:107)\u221e .\n\n\u03d1\u03b2\n\nThe following theorem guarantees convergence of the robust TD-iteration to an approximate value\n\ni . Let \u03c1 := \u03d1\u03bb\n\ni , then the convergence is exact, i.e., \u03b5 = 0.\n\ni be as in Lemma 3.2 and let \u03b2 := maxi\u2208X ,a\u2208A \u03b2a\n\ni = \u03b2 = 0, i.e., the proxy con\ufb01dence region(cid:99)Ua\n\nfunction for \u03c0. We refer the reader to the supplementary material for a proof.\nTheorem 3.7. Let \u03b2a\n1\u2212\u03d1\u03bb. If \u03d1(1 +\n\u03c1\u03b2) < 1 then the robust TD-iteration converges to an \u03b5-approximate value function, where \u03b5 :=\n1\u2212\u03d1(1+\u03c1\u03b2) . In particular if \u03b2a\ni is the same as the true\ncon\ufb01dence region Ua\n4 Robust Reinforcement Learning with function approximation\nIn Section 3 we derived robust versions of exact dynamic programming algorithms such as Q-learning,\nSARSA and TD-learning respectively. If the state space X of the MDP is large then it is prohibitive\nto maintain a lookup table entry for every state. A standard approach for large scale MDPs is to\nuse the approximate dynamic programming (ADP) framework [17]. In this setting, the problem is\nparametrized by a smaller dimensional vector \u03b8 \u2208 Rd where d (cid:28) n = |X|.\n\nThe natural generalizations of Q-learning, SARSA, and TD-learning algorithms of Section 3 are\nvia the projected Bellman equation, where we project back to the space spanned by all the parameters\nin \u03b8 \u2208 Rd, since they are the value functions representable by the model. Convergence for these\nalgorithms even in the non-robust setting are known only for linear architectures, see e.g., [2]. Recent\nwork by [6] proposed stochastic gradient descent algorithms with convergence guarantees for smooth\nnonlinear function architectures, where the problem is framed in terms of minimizing a loss function.\nWe give robust versions of both these approaches.\n4.1 Robust approximations with linear architectures\n\nIn the approximate setting with linear architectures, we approximate the value function v\u03c0 of a policy\n\u03c0 by \u03a6\u03b8 where \u03b8 \u2208 Rd and \u03a6 is a n \u00d7 d feature matrix with rows \u03c6(j) for every state j \u2208 X\nrepresenting its feature vector. Let S be the span of the columns of \u03a6, i.e., S :=\n.\nDe\ufb01ne the operator T\u03c0 : Rn \u2192 Rn as (T\u03c0v)(i) := c(i, \u03c0(i)) + \u03d1 \u2211j\u2208X p\u03c0(i)\nv(j), so that the\ntrue value function v\u03c0 satis\ufb01es T\u03c0v\u03c0 = v\u03c0. A natural approach towards estimating v\u03c0 given a\ncurrent estimate \u03a6\u03b8t is to compute T\u03c0 (\u03a6\u03b8t) and project it back to S to get the next parameter \u03b8t+1.\nThe motivation behind such an iteration is the fact that the true value function is a \ufb01xed point of\n\n(cid:110)\u03a6\u03b8 | \u03b8 \u2208 Rd(cid:111)\n\nij\n\n6\n\n\fi , where \u03be is some probability distribution over the states X .\n\nthis operation if it belonged to the subspace S. This gives rise to the projected Bellman equation\nwhere the projection \u03a0 is typically taken with respect to a weighted Euclidean norm (cid:107)\u00b7(cid:107)\u03be, i.e.,\n(cid:107)x(cid:107)\u03be = \u2211i\u2208X \u03beix2\nIn the model free case, where we do not have explicit knowledge of the transition probabilities,\nvarious methods like LSTD(\u03bb), LSPE(\u03bb), TD(\u03bb) have been proposed [3, 8, 7, 14, 22, 21]. The key\nidea behind proving convergence for these methods is to show that the mapping \u03a0T\u03c0 is a contraction\nmapping with respect to the (cid:107)\u00b7(cid:107)\u03be for some distribution \u03be over the states X . While the operator T\u03c0\nin the non-robust case is linear and is a contraction in the (cid:96)\u221e norm as in Section 3, the projection\noperator with respect to such norms is not guaranteed to be a contraction. However, it is known that if\n\u03be is the steady state distribution of the policy \u03c0 under evaluation, then \u03a0 is non-expansive in (cid:107)\u00b7(cid:107)\u03be\n[4, 2].\n\ni\n\n(T\u03c0v)(i) := c(i, \u03c0(i)) + \u03d1\u03c3P \u03c0(i)\n\nIn the robust setting, we have the same methods but with the robust Bellman operators T\u03c0 de\ufb01ned as\n(v). Since we do not have access to the simulator probabilities pa\ni ,\n\nmethods of the non-robust setting generalize via the robust operator T\u03c0 and the robust projected\nBellman equation \u03a6\u03b8 = \u03a0T\u03c0(\u03a6\u03b8), it is however not clear how to choose the distribution \u03be under\nwhich the projected operator \u03a0T\u03c0 is a contraction in order to show convergence. Let \u03be be the steady\n\nwe will use a proxy set(cid:99)P a\ni as in Section 3, with the proxy operator denoted by(cid:99)T\u03c0. While the iterative\nstate distribution of the exploration policy (cid:98)\u03c0 of the MDP with transition probability matrix P(cid:98)\u03c0. We\nmake the following assumption on the discount factor \u03d1 as in [24].\nAssumption 4.1. For every state i \u2208 X and action a \u2208 A, there exists a constant \u03b1 \u2208 (0, 1) such\nthat for any p \u2208 P a\n\ni we have \u03d1pj \u2264 \u03b1P(cid:98)\u03c0\n\nij for every j \u2208 X .\n\nAssumption 4.1 might appear arti\ufb01cially restrictive; however, it is necessary to prove that \u03a0T\u03c0 is a\ncontraction. While [24] require this assumption for proving convergence of robust MDPs, a similar\nassumption is also required in proving convergence of o\ufb00-policy Reinforcement Learning methods of\n\n[5] where the states are sampled from an exploration policy (cid:98)\u03c0 which is not necessarily the same as\n\nthe policy \u03c0 under evaluation. Note that in the robust setting, all methods are necessarily o\ufb00-policy\nsince the transition matrices are not \ufb01xed for a given policy.\n\ni\n\nThe following lemma is an \u03be-weighted Euclidean norm version of Lemma 3.2.\n(cid:107)y\u2212x(cid:107)\u03be\n\nLemma 4.2. Let v \u2208 Rn be any vector and let \u03b2a\n\u03c3(cid:99)P a\ni\n\n:=\ni (cid:107)v(cid:107)\u03be , where \u03bemin := mini\u2208X \u03bei.\n\n(v) \u2264 \u03c3P a\ni\nThe following theorem shows that the robust projected Bellman equation is a contraction under\n\n. Then we have\n\nminx\u2208Ua\ni\n\u03bemin\n\nmaxy\u2208(cid:99)Ua\n\n(v) + \u03b2a\n\ni be as in Lemma 4.2 and let \u03b2 := maxi\u2208X \u03b2\u03c0(i)\n\nsome assumptions on the discount factor \u03d1.\nTheorem 4.3. Let \u03b2a\nAssumption 4.1 and \u03b12 + \u03d12\u03b22 < 1\n(cid:16)\nother words for any two \u03b8, \u03b8(cid:48) \u2208 Rd, we have\n\n(cid:13)(cid:13)(cid:13)(cid:99)T\u03c0(\u03a6\u03b8) \u2212(cid:99)T\u03c0(\u03a6\u03b8\n\n. If the discount factor \u03d1 satis\ufb01es\n\n2, then the operator(cid:99)T\u03c0 is a contraction with respect to (cid:107)\u00b7(cid:107)\u03be. In\n(cid:13)(cid:13)(cid:13)2\n\n\u03b12 + \u03d12\u03b22(cid:17)(cid:13)(cid:13)\u03a6\u03b8 \u2212 \u03a6\u03b8\n\n<(cid:13)(cid:13)\u03a6\u03b8 \u2212 \u03a6\u03b8\n\n(cid:48)(cid:13)(cid:13)2\n\n(cid:48)(cid:13)(cid:13)2\n\n\u2264 2\n\n(18)\n\n\u03be .\n\ni\n\ni\n\n\u03be\n\n(cid:48))\n\n\u03be\n\n(cid:91)\nU\u03c0(i)\ni = U\u03c0(i)\n\ni\n\nIf \u03b2i = \u03b2 = 0 so that\n\u03b1 < 1.\n\n, then we have a simpler contraction under the assumption that\n\nThe following corollary shows that the solution to the proxy projected Bellman equation converges\n\nto a solution that is not too far away from the true value function v\u03c0.\n\nCorollary 4.4. Let Assumption 4.1 hold and let \u03b2 be as in Theorem 4.3. Let(cid:101)v\u03c0 be the \ufb01xed point of\nthe projected Bellman equation for the proxy operator(cid:99)T\u03c0, i.e., \u03a0(cid:99)T\u03c0(cid:101)v\u03c0 = (cid:101)v\u03c0. Let(cid:98)v\u03c0 be the \ufb01xed\npoint of the proxy operator(cid:99)T\u03c0, i.e.,(cid:99)T\u03c0(cid:98)v\u03c0 =(cid:98)v\u03c0. Let v\u03c0 be the true value function of the policy \u03c0,\n\ni.e., T\u03c0v\u03c0 = v\u03c0. Then it follows that\n\n1 \u2212(cid:112)2 (\u03b12 + \u03d12\u03b22)\n(cid:107)(cid:101)v\u03c0 \u2212 v\u03c0(cid:107)\u03be \u2264 \u03d1\u03b2 (cid:107)v\u03c0(cid:107)\u03be + (cid:107)\u03a0v\u03c0 \u2212 v\u03c0(cid:107)\u03be\n\n.\n\n(19)\n\n7\n\n\fIn particular if \u03b2i = \u03b2 = 0 i.e., the proxy con\ufb01dence region is actually the true con\ufb01dence region,\n\nthen the proxy projected Bellman equation has a solution satisfying (cid:107)(cid:101)v\u03c0 \u2212 v\u03c0(cid:107)\u03be \u2264 (cid:107)\u03a0v\u03c0\u2212v\u03c0(cid:107)\u03be\n\n1\u2212\u03b1\n\n.\n\n(cid:110)\n\nv\u03b8 | \u03b8 \u2208 Rd(cid:111)\n\nTheorem 4.3 guarantees that the robust projected Bellman iterations of LSTD(\u03bb), LSPE(\u03bb) and\nTD(\u03bb)-methods converge, while Corollary 4.4 guarantees that the solution it coverges to is not too\nfar away from the true value function v\u03c0.\n4.2 Robust approximations with nonlinear architectures\nIn this section we consider the situation where the function approximator v\u03b8 is a smooth but not\nnecessarily linear function of \u03b8. This section generalizes the results of [6] to the robust setting with\ncon\ufb01dence regions. We de\ufb01ne robust analogues of the nonlinear GTD2 and nonlinear TDC algorithms\nrespectively.\nLet M :=\nbe the manifold spanned by all possible value functions representable\nby our model and let PM\u03b8 be the tangent plane of M at \u03b8. Let TM\u03b8 be the tangent space, i.e.,\nthe translation of PM\u03b8 to the origin. In other words, TM\u03b8 :=\n\u03b8 is an\nn \u00d7 d matrix with entries \u03a6\nv\u03b8(i). In the nonlinear case, we project on to the tangent\nspace TM\u03b8, since projections on to M is computationally hard. We denote this projection by\n\u03b8 and it is also with respect to a weighted Euclidean norm (cid:107)\u00b7(cid:107)\u03be. The mean squared projected\n\u03a0\nBellman equation (MSPBE) loss function was proposed by [6] and is an extension of [22, 21],\nMSPBE(\u03b8) = (cid:107)v\u03b8 \u2212 \u03a0\n\u03be , where we now project to the the tangent space TM\u03b8. Since the\nnumber n of states is prohibitively large, we want stochastic gradient algorithms that run in time\ni = U and(cid:99)Ua\npolynomial in d. Therefore, we assume that the con\ufb01dence region of every state action pair is the\ni . The robust version of the MSPBE loss function, the. mean squared\nsame: Ua\nrobust projected Bellman equation (MSRPBE) loss can then be de\ufb01ned in terms of the robust Bellman\n\n\u03b8u | u \u2208 Rd(cid:111)\n\n\u03b8(i, j) := \u2202\n\u2202\u03b8j\n\n\u03b8T\u03c0v\u03b8(cid:107)2\n\n(cid:110)\u03a6\n\n, where \u03a6\n\ni = Ua\n\noperator with the proxy con\ufb01dence region (cid:98)U and proxy uncertainty set\n(cid:13)(cid:13)(cid:13)2\n\n(cid:13)(cid:13)(cid:13)v\u03b8 \u2212 \u03a0\n\n\u03b8(cid:99)T\u03c0v\u03b8\n\nMSRPBE(\u03b8) =\n\n.\n\nas\n\n(20)\n\n(cid:91)P \u03c0(i)\n\ni\n\n\u03be\n\nIn order to derive stochastic gradient descent algorithms for minimizing the MSRPBE loss function,\nwe need to take the gradient of \u03c3P(v\u03b8) for the a convex set P. The gradient \u00b5 of \u03c3 is given by\n\nwhere \u03a6\n\nevaluation. Let\n\n((cid:101)d \u2212 \u03c6\n\n(cid:62)u)\u22072v\u03b8(i)u\n\nh(\u03b8, u) := \u2212E\n(cid:105)\n\n\u00b5P(\u03b8) := \u2207 max\ny\u2208P\n\u03b8(i) := \u2207v\u03b8(i). Let us denote \u03a6\n\ny(cid:62)v\u03b8,\n(21)\n\u03b8(i(cid:48)) by \u03c6(cid:48), where i(cid:48) is the next\n(cid:105)\nof state i and the policy \u03c0 under\n\ny(cid:62)v\u03b8 = \u03a6(cid:62)\n\u03b8 arg max\ny\u2208P\nsampled state. Let us denote by (cid:98)U the proxy con\ufb01dence region\n\u03b8(i) simply by \u03c6 and \u03a6\n(cid:91)\nU\u03c0(i)\ni\nwhere (cid:101)d is the robust temporal di\ufb00erence error. As in [6], we may express \u2207 MSRPBE(\u03b8) in terms\nof h(\u03b8, w) where w = E(cid:2)\u03c6\u03c6(cid:62)(cid:3)\u22121 E\n\u03b8k+1 := \u0393(cid:16)\n(cid:110)(cid:0)\u03c6k \u2212 \u03d1\u03c6\n(cid:110)(cid:101)dk\u03c6k \u2212 \u03d1\u03c6\n\u03b8k+1 := \u0393(cid:16)\n(cid:17) \u22072v\u03b8k (ik) wk and \u0393 is a projection into an appropriately chosen compact\n\n. We refer the reader to the supplementary material for\nthe details. This leads us to the following robust analogues of nonlinear GTD and nonlinear TDC,\nwhere we update the estimators wk of w as wk+1 := wk + \u03b2k\n\u03c6k, with the parameters\n\u03b8k being updated on a slower timescale as\n\n(cid:16)(cid:101)dk \u2212 \u03c6(cid:62)\n(cid:111)(cid:17)\n(cid:111)(cid:17)\n(cid:62)\nk wk) \u2212 hk\n(cid:62)\nk wk) \u2212 hk\n\n(cid:104)(cid:101)d\u03c6\nk \u2212 \u03d1\u00b5(cid:98)U(\u03b8)(cid:1) (\u03c6\nk \u2212 \u03d1\u00b5(cid:98)U(\u03b8)(\u03c6\n\nwhere hk :=\nset C with a smooth boundary as in [6]. Under the assumption of Lipschitz continuous gradients\n\n(cid:16)(cid:101)dk \u2212 \u03c6(cid:62)\n\nrobust-nonlinear-GTD2,\n\nrobust-nonlinear-TDC,\n\n\u03b8k + \u03b1k\n\n\u03b8k + \u03b1k\n\nk wk\n\n(23)\n\n(24)\n\nk wk\n\n(cid:48)\n\n(cid:48)\n\n(22)\n\n(cid:17)\n\n(cid:104)\n\n8\n\n\fand suitable assumptions on the step lengths \u03b1k and \u03b2k and the con\ufb01dence region (cid:98)U, the updates of\nmaterial for the exact statement and proof of convergence. Note that in general computing \u00b5(cid:98)U(\u03b8)\nwould take time polynomial in n, but it can be done in O(d2) time using a rank-d approximation to (cid:98)U.\n\nequations (23) converge with probability 1 to a local optima of MSRPBE(\u03b8). See the supplementary\n\n5 Experiments\n\nWe implemented robust versions of Q-learning and SARSA as in Section 3 and evaluated its perfor-\nmance against the nominal algorithms using the OpenAI gym framework [9]. To test the performance\nof the robust algorithms, we perturb the models slightly by choosing with a small probability p a\nrandom state after every action. The size of the con\ufb01dence region Ua\ni for the robust model is chosen\nby a 10-fold cross validation via line search. After the value functions are learned for the robust\nand the nominal algorithms, we evaluate its performance on the true environment. To compare the\ntrue algorithms we compare both the cumulative reward as well as the tail distribution function\n(complementary cumulative distribution function) as in [24] which for every a plots the probability\nthat the algorithm earned a reward of at least a.\n\nNote that there is a tradeo\ufb00 in the performance of the robust versus the nominal algorithms with the\nvalue of p due to the presence of the \u03b2 term in the convergence results. See Figure 1 for a comparison.\nMore \ufb01gures and detailed results are included in the supplementary material.\n\nFigure 1: Line search, tail distribution, and cumulative rewards during transient phase of robust vs\nnominal Q-learning on FrozenLake-v0 with p = 0.01. Note the instability of reward as a function\nof the size of the uncertainty set (left) is due to the small sample size used in line search.\n\nAcknowledgments\n\nThe authors would like to thank Guy Tennenholtz and anonymous reviewers for helping improve the\npresentation of the paper.\nReferences\n[1] J. A. Bagnell, A. Y. Ng, and J. G. Schneider. Solving uncertain markov decision processes.\n\n2001.\n\n[2] D. P. Bertsekas. Approximate policy iteration: A survey and some new methods. Journal of\n\nControl Theory and Applications, 9(3):310\u2013335, 2011.\n\n[3] D. P. Bertsekas and S. Io\ufb00e. Temporal di\ufb00erences-based policy iteration and applications in\nneuro-dynamic programming. Lab. for Info. and Decision Systems Report LIDS-P-2349, MIT,\nCambridge, MA, 1996.\n\n[4] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-dynamic programming: an overview. In Decision\nand Control, 1995., Proceedings of the 34th IEEE Conference on, volume 1, pages 560\u2013564.\nIEEE, 1995.\n\n[5] D. P. Bertsekas and H. Yu. Projected equation methods for approximate solution of large linear\n\nsystems. Journal of Computational and Applied Mathematics, 227(1):27\u201350, 2009.\n\n9\n\n\f[6] S. Bhatnagar, D. Precup, D. Silver, R. S. Sutton, H. R. Maei, and C. Szepesv\u00e1ri. Convergent\ntemporal-di\ufb00erence learning with arbitrary smooth function approximation. In Advances in\nNeural Information Processing Systems, pages 1204\u20131212, 2009.\n\n[7] J. A. Boyan. Technical update: Least-squares temporal di\ufb00erence learning. Machine Learning,\n\n49(2-3):233\u2013246, 2002.\n\n[8] S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal di\ufb00erence learning.\n\nMachine learning, 22(1-3):33\u201357, 1996.\n\n[9] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba.\n\nOpenai gym. arXiv preprint arXiv:1606.01540, 2016.\n\n[10] E. Delage and S. Mannor. Percentile optimization for markov decision processes with parameter\n\nuncertainty. Operations research, 58(1):203\u2013213, 2010.\n\n[11] G. N. Iyengar. Robust dynamic programming. Mathematics of Operations Research, 30(2):257\u2013\n\n280, 2005.\n\n[12] S. H. Lim, H. Xu, and S. Mannor. Reinforcement learning in robust markov decision processes.\n\nIn Advances in Neural Information Processing Systems, pages 701\u2013709, 2013.\n\n[13] J. Morimoto and K. Doya. Robust reinforcement learning. Neural computation, 17(2):335\u2013359,\n\n2005.\n\n[14] A. Nedi\u0107 and D. P. Bertsekas. Least squares policy evaluation algorithms with linear function\n\napproximation. Discrete Event Dynamic Systems, 13(1):79\u2013110, 2003.\n\n[15] A. Nilim and L. El Ghaoui. Robustness in markov decision problems with uncertain transition\n\nmatrices. In NIPS, pages 839\u2013846, 2003.\n\n[16] L. Pinto, J. Davidson, R. Sukthankar, and A. Gupta. Robust adversarial reinforcement learning.\n\narXiv preprint arXiv:1703.02702, 2017.\n\n[17] W. B. Powell. Approximate Dynamic Programming: Solving the curses of dimensionality,\n\nvolume 703. John Wiley & Sons, 2007.\n\n[18] M. L. Puterman. Markov decision processes: discrete stochastic dynamic programming. John\n\nWiley & Sons, 2014.\n\n[19] A. Shapiro and A. Kleywegt. Minimax analysis of stochastic problems. Optimization Methods\n\nand Software, 17(3):523\u2013542, 2002.\n\n[20] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction, volume 1. MIT press\n\nCambridge, 1998.\n\n[21] R. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesv\u00e1ri, and E. Wiewiora. Fast\ngradient-descent methods for temporal-di\ufb00erence learning with linear function approximation. In\nProceedings of the 26th Annual International Conference on Machine Learning, pages 993\u20131000.\nACM, 2009.\n\n[22] R. S. Sutton, H. R. Maei, and C. Szepesv\u00e1ri. A convergent o(n) temporal-di\ufb00erence algorithm\nfor o\ufb00-policy learning with linear function approximation. In Advances in neural information\nprocessing systems, pages 1609\u20131616, 2009.\n\n[23] A. Tamar, Y. Glassner, and S. Mannor. Optimizing the cvar via sampling. arXiv preprint\n\narXiv:1404.3862, 2014.\n\n[24] A. Tamar, S. Mannor, and H. Xu. Scaling up robust mdps using function approximation. In\n\nICML, volume 32, page 2014, 2014.\n\n[25] W. Wiesemann, D. Kuhn, and B. Rustem. Robust markov decision processes. Mathematics of\n\nOperations Research, 38(1):153\u2013183, 2013.\n\n10\n\n\f", "award": [], "sourceid": 1738, "authors": [{"given_name": "Aurko", "family_name": "Roy", "institution": "Google"}, {"given_name": "Huan", "family_name": "Xu", "institution": null}, {"given_name": "Sebastian", "family_name": "Pokutta", "institution": "Georgia Institute of Technology"}]}