{"title": "Off-Policy Evaluation via Off-Policy Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 5437, "page_last": 5448, "abstract": "In this work, we consider the problem of model selection for deep reinforcement learning (RL) in real-world environments. Typically, the performance of deep RL algorithms is evaluated via on-policy interactions with the target environment. However, comparing models in a real-world environment for the purposes of early stopping or hyperparameter tuning is costly and often practically infeasible.\nThis leads us to examine off-policy policy evaluation (OPE) in such settings.\nWe focus on OPE of value-based methods, which are of particular interest in deep RL with applications like robotics, where off-policy algorithms based on Q-function estimation can often attain better sample complexity than direct policy optimization.\nFurthermore, existing OPE metrics either rely on a model of the environment, or the use of importance sampling (IS) to correct for the data being off-policy. \nHowever, for high-dimensional observations, such as images, models of the environment can be difficult to fit and value-based methods can make IS hard to use or even ill-conditioned, especially when dealing with continuous action spaces.\nIn this paper, we focus on the specific case of MDPs with continuous action spaces and sparse binary rewards, which is representative of many important real-world applications. We propose an alternative metric that relies on neither models nor IS, by framing OPE as a positive-unlabeled (PU) classification problem. We experimentally show that this metric outperforms baselines on a number of tasks. Most importantly, it can reliably predict the relative performance of different policies in a number of generalization scenarios, including the transfer to the real-world of policies trained in simulation for an image-based robotic manipulation task.", "full_text": "Off-Policy Evaluation via Off-Policy Classi\ufb01cation\n\nAlex Irpan1, Kanishka Rao1, Konstantinos Bousmalis2,\n\nChris Harris1, Julian Ibarz1, Sergey Levine1,3\n\n1Google Brain, Mountain View, USA\n\n2DeepMind, London, UK\n\n3University of California Berkeley, Berkeley, USA\n\n{alexirpan,kanishkarao,konstantinos,ckharris,julianibarz,slevine}@google.com\n\nAbstract\n\nIn this work, we consider the problem of model selection for deep reinforcement\nlearning (RL) in real-world environments. Typically, the performance of deep\nRL algorithms is evaluated via on-policy interactions with the target environment.\nHowever, comparing models in a real-world environment for the purposes of early\nstopping or hyperparameter tuning is costly and often practically infeasible. This\nleads us to examine off-policy policy evaluation (OPE) in such settings. We focus\non OPE for value-based methods, which are of particular interest in deep RL,\nwith applications like robotics, where off-policy algorithms based on Q-function\nestimation can often attain better sample complexity than direct policy optimization.\nExisting OPE metrics either rely on a model of the environment, or the use of\nimportance sampling (IS) to correct for the data being off-policy. However, for\nhigh-dimensional observations, such as images, models of the environment can\nbe dif\ufb01cult to \ufb01t and value-based methods can make IS hard to use or even ill-\nconditioned, especially when dealing with continuous action spaces. In this paper,\nwe focus on the speci\ufb01c case of MDPs with continuous action spaces and sparse\nbinary rewards, which is representative of many important real-world applications.\nWe propose an alternative metric that relies on neither models nor IS, by framing\nOPE as a positive-unlabeled (PU) classi\ufb01cation problem with the Q-function\nas the decision function. We experimentally show that this metric outperforms\nbaselines on a number of tasks. Most importantly, it can reliably predict the relative\nperformance of different policies in a number of generalization scenarios, including\nthe transfer to the real-world of policies trained in simulation for an image-based\nrobotic manipulation task.\n\n1\n\nIntroduction\n\nSupervised learning has seen signi\ufb01cant advances in recent years, in part due to the use of large,\nstandardized datasets [6]. When researchers can evaluate real performance of their methods on the\nsame data via a standardized of\ufb02ine metric, the progress of the \ufb01eld can be rapid. Unfortunately,\nsuch metrics have been lacking in reinforcement learning (RL). Model selection and performance\nevaluation in RL are typically done by estimating the average on-policy return of a method in the\ntarget environment. Although this is possible in most simulated environments [3, 4, 37], real-world\nenvironments, like in robotics, make this dif\ufb01cult and expensive [36]. Off-policy evaluation (OPE)\nhas the potential to change that: a robust off-policy metric could be used together with realistic and\ncomplex data to evaluate the expected performance of off-policy RL methods, which would enable\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a) Visual summary of off-policy metrics\n\n(b) Robotics grasping simulation-to-reality gap.\n\nFigure 1: (a) Visual illustration of our method: We propose using classi\ufb01cation-based approaches\nto do off-policy evaluation. Solid curves represent Q(s, a) over a positive and negative trajectory,\nwith the dashed curve representing maxa0 Q(s, a0) along states the positive trajectory visits (the\ncorresponding negative curve is omitted for simplicity). Baseline approaches (blue, red) measure Q-\nfunction \ufb01t between Q(s, a) to maxa0 Q(s, a0). Our approach (purple) directly measures separation\nof Q(s, a) between positive and negative trajectories. (b) The visual \u201creality gap\u201d of our most\nchallenging task: off-policy evaluation of the generalization of image-based robotic agents trained\nsolely in simulation (left) using historical data from the target real-world environment (right).\n\nrapid progress on important real-world RL problems. Furthermore, it would greatly simplify transfer\nlearning in RL, where OPE would enable model selection and algorithm design in simple domains\n(e.g., simulation) while evaluating the performance of these models and algorithms on complex\ndomains (e.g., using previously collected real-world data).\nPrevious approaches to off-policy evaluation [7, 13, 28, 35] generally use importance sampling (IS)\nor learned dynamics models. However, this makes them dif\ufb01cult to use with many modern deep RL\nalgorithms. First, OPE is most useful in the off-policy RL setting, where we expect to use real-world\ndata as the \u201cvalidation set\u201d, but many of the most commonly used off-policy RL methods are based on\nvalue function estimation, produce deterministic policies [20, 38], and do not require any knowledge\nof the policy that generated the real-world training data. This makes them dif\ufb01cult to use with IS.\nFurthermore, many of these methods might be used with high-dimensional observations, such as\nimages. Although there has been considerable progress in predicting future images [2, 19], learning\nsuf\ufb01ciently accurate models in image space for effective evaluation is still an open research problem.\nWe therefore aim to develop an OPE method that requires neither IS nor models.\nWe observe that for model selection, it is suf\ufb01cient to predict some statistic correlated with policy\nreturn, rather than directly predict policy return. We address the speci\ufb01c case of binary-reward MDPs:\ntasks where the agent receives a non-zero reward only once during an episode, at the \ufb01nal timestep\n(Sect. 2). These can be interpreted as tasks where the agent can either \u201csucceed\u201d or \u201cfail\u201d in each trial,\nand although they form a subset of all possible MDPs, this subset is quite representative of many\nreal-world tasks, and is actively used e.g. in robotic manipulation [15, 31]. The novel contribution\nof our method (Sect. 3) is to frame OPE as a positive-unlabeled (PU) classi\ufb01cation [16] problem,\nwhich provides for a way to derive OPE metrics that are both (a) fundamentally different from prior\nmethods based on IS and model learning, and (b) perform well in practice on both simulated and\nreal-world tasks. Additionally, we identify and present (Sect. 4) a list of generalization scenarios\nin RL that we would want our metrics to be robust against. We experimentally show (Sect. 6) that\nour suggested OPE metrics outperform a variety of baseline methods across all of the evaluation\nscenarios, including a simulation-to-reality transfer scenario for a vision-based robotic grasping task\n(see Fig. 1b).\n\n2 Preliminaries\n\nWe focus on \ufb01nite\u2013horizon Markov decision processes (MDP). We de\ufb01ne an MDP as\n(S,A,P,S0, r, ). S is the state\u2013space, A the action\u2013space, and both can be continuous. P de\ufb01nes\ntransitions to next states given the current state and action, S0 de\ufb01nes initial state distribution, r is\nthe reward function, and 2 [0, 1] is the discount factor. Episodes are of \ufb01nite length T : at a given\n\n2\n\n\ft0=t t0tr(st0, a\u21e1\n\ntime-step t the agent is at state st 2S , samples an action at 2A from a policy \u21e1, receives a reward\nrt = r(st, at), and observes the next state st+1 as determined by P.\nThe goal in RL is to learn a policy \u21e1(at|st) that maximizes the expected episode return\nR(\u21e1) = E\u21e1[PT\nt=0 tr(st, at)]. A value of a policy for a given state st is de\ufb01ned as V \u21e1(st) =\nE\u21e1[PT\nt is the action \u21e1 takes at state st and E\u21e1 implies an expectation\nover trajectories \u2327 = (s1, a1, . . . , sT , aT ) sampled from \u21e1. Given a policy \u21e1, the expected value of its\naction at at a state st is called the Q-value and is de\ufb01ned as Q\u21e1(st, at) = E\u21e1[r(st, at) + V \u21e1(st+1)].\nWe assume the MDP is a binary reward MDP, which satis\ufb01es the following properties: = 1, the\nreward is rt = 0 at all intermediate steps, and the \ufb01nal reward rT is in {0, 1}, indicating whether\nthe \ufb01nal state is a failure or a success. We learn Q-functions Q(s, a) and aim to evaluate policies\n\u21e1(s) = arg maxa Q(s, a).\n\nt0)] where a\u21e1\n\n2.1 Positive-unlabeled learning\n\nPositive-unlabeled (PU) learning is a set of techniques for learning binary classi\ufb01cation from partially\nlabeled data, where we have many unlabeled points and some positively labeled points [16]. We will\nmake use of these ideas in developing our OPE metric. Positive-unlabeled data is suf\ufb01cient to learn a\nbinary classi\ufb01er if the positive class prior p(y = 1) is known.\nLet (X, Y ) be a labeled binary classi\ufb01cation problem, where Y = {0, 1}. Let g : X ! R be some\ndecision function, and let ` : R \u21e5{ 0, 1}! R be our loss function. Suppose we want to evaluate loss\n`(g(x), y) over negative examples (x, y = 0), but we only have unlabeled points x and positively\nlabeled points (x, y = 1). The key insight of PU learning is that the loss over negatives can be\nindirectly estimated from p(y = 1). For any x 2 X,\n\np(x) = p(x|y = 1)p(y = 1) + p(x|y = 0)p(y = 0)\n\n(1)\nIt follows that for any f (x), EX,Y [f (x)] = p(y = 1)EX|Y =1 [f (x)] + p(y = 0)EX|Y =0 [f (x)],\n\nsince by de\ufb01nition EX [f (x)] =Rx p(x)f (x)dx. Letting f (x) = `(g(x), 0) and rearranging gives\n\np(y = 0)EX|Y =0 [`(g(x), 0)] = EX,Y [`(g(x), 0)] p(y = 1)EX|Y =1 [`(g(x), 0)]\n\n(2)\n\nIn Sect. 3, we reduce off-policy evaluation of a policy \u21e1 to a positive-unlabeled classi\ufb01cation problem.\nWe provide reasoning for how to estimate p(y = 1), apply PU learning to estimate classi\ufb01cation error\nwith Eqn. 2, then use the error to estimate a lower bound on return R(\u21e1) with Theorem 1.\n\n3 Off-policy evaluation via state-action pair classi\ufb01cation\n\nA Q-function Q(s, a) predicts the expected return of each action a given state s. The policy\n\u21e1(s) = arg maxa Q(s, a) can be viewed as a classi\ufb01er that predicts the best action. We propose\nan off-policy evaluation method connecting off-policy evaluation to estimating validation error for\na positive-unlabeled (PU) classi\ufb01cation problem [16]. Our metric can be used with Q-function\nestimation methods without requiring importance sampling, and can be readily applied in a scalable\nway to image-based deep RL tasks.\nWe present an analysis for binary reward MDPs, de\ufb01ned in Sec. 2. In binary reward MDPs, each\n(st, at) is either potentially effective, or guaranteed to lead to failure.\nDe\ufb01nition 1. In a binary reward MDP, (st, at) is feasible if an optimal policy \u21e1\u21e4 has non-zero\nprobability of achieving success, i.e an episode return of 1, after taking at in st. A state-action pair\n(st, at) is catastrophic if even an optimal \u21e1\u21e4 has zero probability of succeeding if at is taken. A\nstate st is feasible if there exists a feasible (st, at), and a state st is catastrophic if for all actions at,\n(st, at) is catastrophic.\n\nUnder this de\ufb01nition, the return of a trajectory \u2327 is 1 only if all (st, at) in \u2327 are feasible (see\nAppendix A.1). This condition is necessary, but not suf\ufb01cient, because success can depend on the\nstochastic dynamics. Since De\ufb01nition 1 is de\ufb01ned by an optimal \u21e1\u21e4, we can view feasible and\ncatastrophic as binary labels that are independent of the policy \u21e1 we are evaluating. Viewing \u21e1 as a\nclassi\ufb01er, we relate the classi\ufb01cation error of \u21e1 to a lower bound for return R(\u21e1).\n\n3\n\n\fTheorem 1. Given a binary reward MDP and a policy \u21e1, let c(st, at) be the probability that\nstochastic dynamics bring a feasible (st, at) to a catastrophic st+1, with c = maxs,a c(s, a). Let \u21e2+\nt,\u21e1\ndenote the state distribution at time t, given that \u21e1 was followed, all its previous actions a1,\u00b7\u00b7\u00b7 , at1\nwere feasible, and st is feasible. Let A(s) denote the set of catastrophic actions at state s, and let\n\u270ft = E\u21e2+\nwith \u270f = 1\n\nt,\u21e1hPa2A(st) \u21e1(a|st)i be the per-step expectation of \u21e1 making its \ufb01rst mistake at time t,\nT PT\n\ni=1 \u270ft being average error over all (st, at). Then R(\u21e1) 1 T (\u270f + c).\n\nSee Appendix A.2 for the proof. For the deterministic case (c = 0), we can take inspiration from\nimitation learning behavioral cloning bounds in Ross & Bagnell [32] to prove the same result. This\nalternate proof is in Appendix A.3.\nA smaller error \u270f gives a higher lower bound on return, which implies a better \u21e1. This leaves\nestimating \u270f. The primary challenge with this approach is that we do not have negative labels \u2013 that\nis, for trials that receive a return of 0 in the validation set, we do not know which (s, a) were in fact\ncatastrophic, and which were recoverable. We discuss how we address this problem next.\n\n3.1 Missing negative labels\n\nRecall that (st, at) is feasible if \u21e1\u21e4 has a chance of succeeding after action at. Since \u21e1\u21e4 is at least as\ngood as \u21e1b, whenever \u21e1b succeeds, all tuples (st, at) in the trajectory \u2327 must be feasible. However,\nthe converse is not true, since failure could come from poor luck or suboptimal actions. Our key\ninsight is that this is an instance of the positive-unlabeled (PU) learning problem from Sect. 2.1,\nwhere \u21e1b positively labels some (s, a) and the remaining are unlabeled. This lets us use ideas from\nPU learning to estimate error.\nIn the RL setting,\nlabels {0, 1} correspond to\n{catastrophic, f easible} labels, and a natural choice for the decision function g is g(s, a) =\nQ(s, a), since Q(s, a) should be high for feasible (s, a) and low for catastrophic (s, a). We aim to\nestimate \u270f, the probability that \u21e1 takes a catastrophic action \u2013 i.e., that (s,\u21e1 (s)) is a false positive.\nNote that if (s,\u21e1 (s)) is predicted to be catastrophic, but is actually feasible, this false-negative does\nnot impact future reward \u2013 since the action is feasible, there is still some chance of success. We want\njust the false-positive risk, \u270f = p(y = 0)EX|Y =0 [`(g(x), 0)]. This is the same as Eqn. 2, and using\ng(s, a) = Q(s, a) gives\n\ninputs (s, a) are from X = S\u21e5A ,\n\n\u270f = E(s,a) [`(Q(s, a), 0)] p(y = 1)E(s,a),y=1 [`(Q(s, a), 0)] .\n\n(3)\n\nEqn. 3 is the core of all our proposed metrics. While it might at \ufb01rst seem that the class prior p(y = 1)\nshould be task-dependent, recall that the error \u270ft is the expectation over the state distribution \u21e2+\nt,\u21e1,\nwhere the actions a1,\u00b7\u00b7\u00b7 , at1 were all feasible. This is equivalent to following an optimal \u201cexpert\u201d\npolicy \u21e1\u21e4, and although we are estimating \u270ft from data generated by behavior policy \u21e1b, we should\nmatch the positive class prior p(y = 1) we would observe from expert \u21e1\u21e4. Expert \u21e1\u21e4 will always\npick feasible actions. Therefore, although the validation dataset will likely have both successes and\nfailures, a prior of p(y = 1) = 1 is the ideal prior, and this holds independently of the environment.\nWe illustrate this further with a didactic example in Sect. 6.1.\nTheorem 1 relies on estimating \u270f over the distribution \u21e2+\nt,\u21e1, but our dataset D is generated by an\nunknown behavior policy \u21e1b. A natural approach here would be importance sampling (IS) [7],\nbut: (a) we assume no knowledge of \u21e1b, and (b) IS is not well-de\ufb01ned for deterministic poli-\ncies \u21e1(s) = arg maxa Q(s, a). Another approach is to subsample D to transitions (s, a) where\na = \u21e1(s) [21]. This ensures an on-policy evaluation, but can encounter \ufb01nite sample issues if \u21e1b\ndoes not sample \u21e1(s) frequently enough. Therefore, we assume classi\ufb01cation error over D is a good\nenough proxy that correlates well with classi\ufb01cation error over \u21e2+\nt,\u21e1. This is admittedly a strong\nassumption, but empirical results in Sect. 6 show surprising robustness to distributional mismatch.\nThis assumption is reasonable if D is broad (e.g., generated by a suf\ufb01ciently random policy), but may\nproduce pessimistic estimates when potential feasible actions in D are unlabeled.\n3.2 Off-policy classi\ufb01cation for OPE\n\nBased off of the derivation from Sect. 3.1, our proposed off-policy classi\ufb01cation (OPC) score\nis de\ufb01ned by the negative loss when ` in Eqn. 3 is the 0-1 loss. Let b be a threshold, with\n\n4\n\n\f`(Q(s, a), Y ) = 1\n\n2 Y sign(Q(s, a) b). This gives\n\n2 + 1\nOPC(Q) = p(y = 1)E(s,a),y=1\u21e51Q(s,a)>b\u21e4 E(s,a)\u21e51Q(s,a)>b\u21e4 .\n\n(4)\nTo be fair to each Q(s, a), threshold b is set separately for each Q-function to maximize OPC(Q).\nGiven N transitions and Q(s, a) for all (s, a) 2D , this can be done in O(N log N ) time per Q-\nfunction (see Appendix B). This avoids favoring Q-functions that systematically overestimate or\nunderestimate the true value.\nAlternatively, ` can be a soft loss function. We experimented with `(Q(s, a), Y ) = (1 2Y )Q(s, a),\nwhich is minimized when Q(s, a) is large for Y = 1 and small for Y = 0. The negative of this loss\nis called the SoftOPC.\n\nSoftOPC(Q) = p(y = 1)E(s,a),y=1 [Q(s, a)] E(s,a) [Q(s, a)] .\n\n(5)\nIf episodes have different lengths, to avoid focusing on long episodes, transitions (s, a) from an\nepisode of length T are weighted by 1\nAlthough our derivation is for binary reward MDPs, both the OPC and SoftOPC are purely evaluation\ntime metrics, and can be applied to Q-functions trained with dense rewards or reward shaping, as\nlong as the \ufb01nal evaluation uses a sparse binary reward.\n\nT when estimating SoftOPC. Pseudocode is in Appendix B.\n\n3.3 Evaluating OPE metrics\nThe standard evaluation method for OPE is to report MSE to the true episode return [21, 35]. However,\nour metrics do not estimate episode return directly. The OPC(Q)\u2019s estimate of \u270f will differ from the\ntrue value, since it is estimated over our dataset D instead of over the distribution \u21e2+\nt,\u21e1. Meanwhile,\nSoftOPC(Q) does not estimate \u270f directly due to using a soft loss function. Despite this, the OPC and\nSoftOPC are still useful OPE metrics if they correlate well with \u270f or episode return R(\u21e1).\nWe propose an alternative evaluation method. Instead of reporting MSE, we train a large suite\nof Q-functions Q(s, a) with different learning algorithms, evaluating true return of the equivalent\nargmax policy for each Q(s, a), then compare correlation of the metric to true return. We report two\ncorrelations, the coef\ufb01cient of determination R2 of line of best \ufb01t, and the Spearman rank correlation\n\u21e0 [33].1 R2 measures con\ufb01dence in how well our linear best \ufb01t will predict returns of new models,\nwhereas \u21e0 measures con\ufb01dence that the metric ranks different policies correctly, without assuming a\nlinear best \ufb01t.\n\n4 Applications of OPE for transfer and generalization\n\nOff-policy evaluation (OPE) has many applications. One is to use OPE as an early stopping or\nmodel selection criteria when training from off-policy data. Another is applying OPE to validation\ndata collected in another domain to measure generalization to new settings. Several papers [5, 27,\n30, 39, 40] have examined over\ufb01tting and memorization in deep RL, proposing explicit train-test\nenvironment splits as benchmarks for RL generalization. Often, these test environments are de\ufb01ned in\nsimulation, where it is easy to evaluate the policy in the test environment. This is no longer suf\ufb01cient\nfor real-world settings, where test environment evaluation can be expensive. In real-world problems,\noff-policy evaluation is an inescapable part of measuring generalization performance in an ef\ufb01cient,\ntractable way. To demonstrate this, we identify a few common generalization failure scenarios faced\nin reinforcement learning, applying OPE to each one. When there is insuf\ufb01cient off-policy training\ndata and new data is not collected online, models may memorize state-action pairs in the training\ndata. RL algorithms collect new on-policy data with high frequency. If training data is generated\nin a systematically biased way, we have mismatched off-policy training data. The model fails to\ngeneralize because systemic biases cause the model to miss parts of the target distribution. Finally,\nmodels trained in simulation usually do not generalize to the real-world, due to the training and test\ndomain gap: the differences in the input space (see Fig. 1b and Fig. 2) and the dynamics. All of these\nscenarios are, in principle, identi\ufb01able by off-policy evaluation, as long as validation is done against\ndata sampled from the \ufb01nal testing environment. We evaluate our proposed and baseline metrics\nacross all these scenarios.\n\n1We slightly abuse notation here, and should clarify that R2 is used to symbolize the coef\ufb01cient of determi-\n\nnation and should not be confused with R(\u21e1), the average return of a policy \u21e1.\n\n5\n\n\f(a) Simulated samples\n\n(b) Real samples\n\nFigure 2: An example of a training and test domain gap. We display this with a robotic grasping\ntask. Left: Images used during training, from (a) simulated grasping over procedurally generated\nobjects; and from (b) the real-world, with a varied collection of everyday physical objects.\n5 Related work\n\nOff-policy policy evaluation (OPE) predicts the return of a learned policy \u21e1 from a \ufb01xed off-policy\ndataset D, generated by one or more behavior policies \u21e1b. Prior works [7, 10, 13, 21, 28, 34] do\nso with importance sampling (IS) [11], MDP modeling, or both. Importance sampling requires\nquerying \u21e1(a|s) and \u21e1b(a|s) for any s 2D , to correct for the shift in state-action distributions. In\nRL, the cumulative product of IS weights along \u2327 is used to weight its contribution to \u21e1\u2019s estimated\nvalue [28]. Several variants have been proposed, such as step-wise IS and weighted IS [23]. In\nMDP modeling, a model is \ufb01tted to D, and \u21e1 is rolled out in the learned model to estimate average\nreturn [13, 24]. The performance of these approaches is worse if dynamics or reward are poorly\nestimated, which tends to occur for image-based tasks. Improving these models is an active research\nquestion [2, 19]. State of the art methods combine IS-based estimators and model-based estimators\nusing doubly robust estimation and ensembles to produce improved estimators with theoretical\nguarantees [7, 8, 10, 13, 35].\nThese IS and model-based OPE approaches assume importance sampling or model learning are\nfeasible. This assumption often breaks down in modern deep RL approaches. When \u21e1b is unknown,\n\u21e1b(a|s) cannot be queried. When doing value-based RL with deterministic policies, \u21e1(a|s) is\nunde\ufb01ned for off-policy actions. When working with high-dimensional observations, model learning\nis often too dif\ufb01cult to learn a reliable model for evaluation.\nMany recent papers [5, 27, 30, 39, 40] have de\ufb01ned train-test environment splits to evaluate RL\ngeneralization, but de\ufb01ne test environments in simulation where there is no need for OPE. We\ndemonstrate how OPE provides tools to evaluate RL generalization for real-world environments.\nWhile to our knowledge no prior work has proposed a classi\ufb01cation-based OPE approach, several\nprior works have used supervised classi\ufb01ers to predict transfer performance from a few runs in the\ntest environment [17, 18]. To our knowledge, no other OPE papers have shown results for large\nimage-based tasks where neither importance sampling nor model learning are viable options.\n\nBaseline metrics Since we assume importance-sampling and model learning are infeasible, many\ncommon OPE baselines do not \ufb01t our problem setting. In their place, we use other Q-learning based\nmetrics that also do not need importance sampling or model learning and only require a Q(s, a)\nestimate. The temporal-difference error (TD Error) is the standard Q-learning training loss, and\nFarahmand & Szepesv\u00e1ri [9] proposed a model selection algorithm based on minimizing TD error.\n\nThe discounted sum of advantages (Pt tA\u21e1) relates the difference in values V \u21e1b(s) V \u21e1(s) to the\nsum of advantagesPt tA\u21e1(s, a) over data from \u21e1b, and was proposed by Kakade & Langford [14]\n\nand Murphy [26]. Finally, the Monte Carlo corrected error (MCC Error) is derived by arranging the\ndiscounted sum of advantages into a squared error, and was used as a training objective by Quillen\net al. [29]. The exact expression of each of these metrics is in Appendix C.\nEach of these baselines represents a different way to measure how well Q(s, a) \ufb01ts the true return.\nHowever, it is possible to learn a good policy \u21e1 even when Q(s, a) \ufb01ts the data poorly. In Q-learning,\n\n6\n\n\fit is common to de\ufb01ne an argmax policy \u21e1 = arg maxa Q(s, a). The argmax policy for Q\u21e4(s, a) is\n\u21e1\u21e4, and Q\u21e4 has zero TD error. But, applying any monotonic function to Q\u21e4(s, a) produces a Q0(s, a),\nwhose TD error is non-zero, but whose argmax policy is still \u21e1\u21e4. A good OPE metric should rate Q\u21e4\nand Q0 identically. This motivates our proposed classi\ufb01cation-based OPE metrics: since \u21e1\u2019s behavior\nonly depends on the relative differences in Q-value, it makes sense to directly contrast Q-values\nagainst each other, rather than compare error between the Q-values and episode return. Doing so lets\nus compare Q-functions whose Q(s, a) estimates are inaccurate. Fig. 1a visualizes the differences\nbetween the baseline metrics and classi\ufb01cation metrics.\n\n6 Experiments\n\nIn this section, we investigate the correlation of OPC and SoftOPC with true average return, and\nhow they may be used for model selection with off-policy data. We compare the correlation of these\nmetrics with the correlation of the baselines, namely the TD Error, Sum of Advantages, and the\nMCC Error (see Sect. 5) in a number of environments and generalization failure scenarios. For each\nexperiment, a validation dataset D is collected with a behavior policy \u21e1b, and state-action pairs (s, a)\nare labeled as feasible whenever they appear in a successful trajectory. In line with Sect. 3.3, several\nQ-functions Q(s, a) are trained for each task. For each Q(s, a), we evaluate each metric over D and\ntrue return of the equivalent argmax policy. We report both the coef\ufb01cient of determination R2 of\nline of best \ufb01t and the Spearman\u2019s rank correlation coef\ufb01cient \u21e0 [33]. Our results are summarized in\nTable 1 and Table 2. Our OPC/SoftOPC metrics are implemented using p(y = 1) = 1, as explained\nin Sect. 3 and Appendix D.\n\n6.1 Simple Environments\n\nBinary tree. As a didactic toy example, we used a binary tree MDP with depth of episode length\nT . In this environment,2 each node is a state st with rt = 0, unless it is a leaf/terminal state with\nreward rT 2{ 0, 1}. Actions are {\u2018left\u2019, \u2018right\u2019}, and transitions are deterministic. Exactly one leaf is\na success leaf with rT = 1, and the rest have rT = 0. In our experiments we used a full binary tree of\ndepth T = 6. The initial state distribution was uniform over all non-leaf nodes, which means that the\ninitial state could sometimes be initialized to one where failure is inevitable. The validation dataset\nD was collected by generating 1,000 episodes from a uniformly random policy. For the policies we\nwanted to evaluate, we generated 1,000 random Q-functions by sampling Q(s, a) \u21e0 U [0, 1] for every\n(s, a), de\ufb01ning the policy as \u21e1(s) = arg maxa Q(s, a). We compared the correlation of the actual\non-policy performance of the policies with the scores given by the OPC, SoftOPC and the baseline\nmetrics using D, as shown in Table 2. SoftOPC correlates best and OPC correlates second best.\nPong. As we are speci\ufb01cally motivated by image-based tasks with binary rewards, the Atari [3]\nPong game was a good choice for a simple environment that can have these characteristics. The\nvisual input is of low complexity, and the game can be easily converted into a binary reward task by\ntruncating the episode after the \ufb01rst point is scored. We learned Q-functions using DQN [25] and\nDDQN [38], varying hyperparameters such as the learning rate, the discount factor , and the batch\nsize, as discussed in detail in Appendix E.2. A total of 175 model checkpoints are chosen from the\nvarious models for evaluation, and true average performance is evaluated over 3,000 episodes for\neach model checkpoint. For the validation dataset we used 38 Q-functions that were partially-trained\nwith DDQN and generated 30 episodes from each, for a total of 1140 episodes. Similarly with the\nBinary Tree environments we compare the correlations of our metrics and the baselines to the true\naverage performance over a number of on-policy episodes. As we show in Table 2, both our metrics\noutperform the baselines, OPC performs better than SoftOPC in terms of R2 correlation but is similar\nin terms of Spearman correlation \u21e0.\n\nStochastic dynamics. To evaluate performance against stochastic dynamics, we modi\ufb01ed the\ndynamics of the binary tree and Pong environment. In the binary tree, the environment executes a\nrandom action instead of the policy\u2019s action with probability \u270f. In Pong, the environment uses sticky\nactions, a standard protocol for stochastic dynamics in Atari games introduced by [22]. With small\nprobability, the environment repeats the previous action instead of the policy\u2019s action. Everything else\n\n2Code for the binary tree environment is available at https://bit.ly/2Qx6TJ7.\n\n7\n\n\fis unchanged. Results in Table 1. In more stochastic environments, all metrics drop in performance\nsince Q(s, a) has less control over return, but OPC and SoftOPC consistently correlate better than\nthe baselines.\nTable 1: Results from stochastic dynamics experiments. For each metric (leftmost column), we report\nR2 of line of best \ufb01t and Spearman rank correlation coef\ufb01cient \u21e0 for each environment (top row), over\nstochastic versions of the binary tree and Pong tasks from Sect. 6.1. Correlation drops as stochasticity\nincreases, but our proposed metrics (last two rows) consistently outperform baselines.\n\nStochastic Tree 1-Success Leaf\n\nPong Sticky Actions\n\n\u270f = 0.4\nR2\n\u21e0\n0.01\n0.00\n0.07\n0.13\n0.14\n\n-0.07\n0.01\n-0.27\n0.38\n0.39\n\n\u270f = 0.6\nR2\n\u21e0\n0.00\n0.01\n0.01\n0.01\n0.03\n\n-0.05\n-0.07\n-0.06\n0.08\n0.18\n\n\u270f = 0.8\nR2\n\u21e0\n0.00\n0.00\n0.01\n0.03\n0.04\n\n-0.05\n-0.02\n-0.11\n0.19\n0.20\n\n\u21e0\n\n\u21e0\n\nSticky 10% Sticky 25%\nR2\n0.05\n0.04\n0.02\n0.48\n0.33\n\nR2\n0.07\n0.01\n0.00\n0.33\n0.16\n\n-0.16\n-0.29\n-0.32\n0.73\n0.67\n\n-0.15\n-0.22\n-0.18\n0.66\n0.58\n\nTD Err\n\nP tA\u21e1\n\nMCC Err\nOPC (Ours)\nSoftOPC (Ours)\n\n6.2 Vision-based Robotic Grasping\nOur main experimental results were on simulated and real versions of a robotic environment and a\nvision-based grasping task, following the setup from Kalashnikov et al. [15], the details of which\nwe brie\ufb02y summarize. The observation at each time-step is a 472 \u21e5 472 RGB image from a camera\nplaced over the shoulder of a robotic arm, of the robot and a bin of objects, as shown in Fig. 1b. At the\nstart of an episode, objects are randomly dropped in a bin in front of the robot. The goal is to grasp\nany of the objects in that bin. Actions include continuous Cartesian displacements of the gripper, and\nthe rotation of the gripper around the z-axis. The action space also includes three discrete commands:\n\u201copen gripper\u201d, \u201cclose gripper\u201d, and \u201cterminate episode\u201d. Rewards are sparse, with r(sT , aT ) = 1\nif any object is grasped and 0 otherwise. All models are trained with the fully off-policy QT-Opt\nalgorithm as described in Kalashnikov et al. [15].\nIn simulation we de\ufb01ne a training and a test environment by generating two distinct sets of 5 objects\nthat are used for each, shown in Fig. 8. In order to capture the different possible generalization failure\nscenarios discussed in Sect. 4, we trained Q-functions in a fully off-policy fashion with data collected\nby a hand-crafted policy with a 60% grasp success rate and \u270f-greedy exploration (with \u270f=0.1) with\ntwo different datasets both from the training environment. The \ufb01rst consists of 100, 000 episodes,\nwith which we can show we have insuf\ufb01cient off-policy training data to perform well even in the\ntraining environment. The second consists of 900, 000 episodes, with which we can show we have\nsuf\ufb01cient data to perform well in the training environment, but due to mismatched off-policy training\ndata we can show that the policies do not generalize to the test environment (see Fig. 8 for objects\nand Appendix E.3 for the analysis). We saved policies at different stages of training which resulted in\n452 policies for the former case and 391 for the latter. We evaluated the true return of these policies\non 700 episodes on each environment and calculated the correlation with the scores assigned by the\nOPE metrics based on held-out validation sets of 50, 000 episodes from the training environment and\n10, 000 episodes from the test one, which we show in Table 2.\n\nTable 2: Summarized results of Experiments section. For each metric (leftmost column), we report R2\nof line of best \ufb01t and Spearman rank correlation coef\ufb01cient \u21e0 for each environment (top row). These\nare: the binary tree and Pong tasks from Sect. 6.1, simulated grasping with train or test objects, and\nreal-world grasping from Sect. 6.2. Baseline metrics are discussed in Sect. 5, and our metrics (OPC,\nSoftOPC) are discussed in Sect. 3. Occasionally, some baselines correlate well, but our proposed\nmetrics (last two rows) are consistently among the best metrics for each environment.\n\nTD Err\n\nP tA\u21e1\n\nMCC Err\nOPC (Ours)\nSoftOPC (Ours)\n\n\u21e0\n\nTree (1 Succ)\nR2\n0.02\n0.00\n0.06\n0.21\n0.19\n\n-0.15\n0.00\n-0.26\n0.50\n0.51\n\nPong\n\nR2\n0.05\n0.09\n0.04\n0.50\n0.36\n\n\u21e0\n\n-0.18\n-0.32\n-0.36\n0.72\n0.75\n\nSim Train\nR2\n\u21e0\n0.02\n0.74\n0.00\n0.49\n0.55\n\n-0.37\n0.81\n0.33\n0.86\n0.76\n\nSim Test\nR2\n\u21e0\n0.10\n0.74\n0.06\n0.35\n0.48\n\n-0.51\n0.78\n-0.44\n0.66\n0.77\n\nReal-World\nR2\n0.17\n0.12\n0.01\n0.81\n0.91\n\n\u21e0\n0.48\n0.50\n-0.15\n0.87\n0.94\n\nThe real-world version of the environment has objects that were never seen during training (see\nFig. 1b and 9). We evaluated 15 different models, trained to have varying degrees of robustness to\n\n8\n\n\fthe training and test domain gap, based on domain randomization and randomized\u2013to-canonical\nadaptation networks [12].3 Out of these, 7 were trained on-policy purely in simulation. True average\nreturn was evaluated over 714 episodes with 7 different sets of objects, and true policy real-world\nperformance ranged from 17% to 91%. The validation dataset consisted of 4, 000 real-world episodes,\n40% of which were successful grasps and the objects used for it were separate from the ones used for\n\ufb01nal evaluation used for the results in Table 2.\n\n(a) SoftOPC and return in sim\n\n(b) Scatterplot for real-world grasping\n\nFigure 3: (a): SoftOPC in simulated grasping. Overlay of SoftOPC (red) and return (blue) in\nsimulation for model trained with 100k grasps. SoftOPC tracks episode return. (b): Scatterplots for\n\n(right) for the Real-World grasping task. Each point is a different grasping model. Shaded regions\n\nOPE metrics and real-world grasp success. Scatterplots forP t0A\u21e1(st0, at0) (left) and SoftOPC\nare a 95% con\ufb01dence interval.P t0A\u21e1(st0, at0) works in simulation but fails on real data, whereas\n\nSoftOPC works well in both.\n\n6.3 Discussion\n\nTable 2 shows R2 and \u21e0 for each metric for the different environments we considered. Our proposed\nSoftOPC and OPC consistently outperformed the baselines, with the exception of the simulated\nrobotic test environment, on which the SoftOPC performed almost as well as the discounted sum\nof advantages on the Spearman correlation (but worse on R2). However, we show that SoftOPC\nmore reliably ranks policies than the baselines for real-world performance without any real-world\ninteraction, as one can also see in Fig. 3b. The same \ufb01gure shows the sum of advantages metric that\nworks well in simulation performs poorly in the real-world setting we care about. Appendix F includes\nadditional experiments showing correlation mostly unchanged on different validation datasets.\nFurthermore, we demonstrate that SoftOPC can track the performance of a policy acting in the\nsimulated grasping environment, as it is training in Fig. 3a, which could potentially be useful for\nearly stopping. Finally, SoftOPC seems to be performing slightly better than OPC in most of the\nexperiments. We believe this occurs because the Q-functions compared in each experiment tend to\nhave similar magnitudes. Preliminary results in Appendix H suggest that when Q-functions have\ndifferent magnitudes, OPC might outperform SoftOPC.\n\n7 Conclusion and future work\n\nWe proposed OPC and SoftOPC, classi\ufb01cation-based off-policy evaluation metrics that can be used\ntogether with Q-learning algorithms. Our metrics can be used with binary reward tasks: tasks where\neach episode results in either a failure (zero return) or success (a return of one). While this class\nof tasks is a substantial restriction, many practical tasks actually fall into this category, including\nthe real-world robotics tasks in our experiments. The analysis of these metrics shows that it can\napproximate the expected return in deterministic binary reward MDPs. Empirically, we \ufb01nd that OPC\nand the SoftOPC variant correlate well with performance across several environments, and predict\ngeneralization performance across several scenarios. including the simulation-to-reality scenario, a\ncritical setting for robotics. Effective off-policy evaluation is critical for real-world reinforcement\nlearning, where it provides an alternative to expensive real-world evaluations during algorithm\ndevelopment. Promising directions for future work include developing a variant of our method that is\nnot restricted to binary reward tasks. We include some initial work in Appendix J. However, even in\nthe binary setting, we believe that methods such as ours can provide for a substantially more practical\npipeline for evaluating transfer learning and off-policy reinforcement learning algorithms.\n\n3For full details of each of the models please see Appendix E.4.\n\n9\n\n\fAcknowledgements\nWe would like to thank Razvan Pascanu, Dale Schuurmans, George Tucker, and Paul Wohlhart for\nvaluable discussions.\n\nReferences\n[1] Arjona-Medina, J. A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., and Hochre-\niter, S. Rudder: Return decomposition for delayed rewards. arXiv preprint arXiv:1806.07857,\n2018.\n\n[2] Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R. H., and Levine, S. Stochastic variational\n\nvideo prediction. In International Conference on Representation Learning, 2018.\n\n[3] Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The arcade learning environment:\nAn evaluation platform for general agents. Journal of Arti\ufb01cial Intelligence Research, 47:\n253\u2013279, 2013.\n\n[4] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba,\n\nW. Openai gym. arXiv preprint arXiv:1606.01540, 2016.\n\n[5] Cobbe, K., Klimov, O., Hesse, C., Kim, T., and Schulman, J. Quantifying generalization in\n\nreinforcement learning. arXiv preprint arXiv:1812.02341, 2018.\n\n[6] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. ImageNet: A Large-Scale\n\nHierarchical Image Database. In CVPR, 2009, 2009.\n\n[7] Dudik, M., Langford, J., and Li, L. Doubly robust policy evaluation and learning. In ICML,\n\nMarch 2011.\n\n[8] Dud\u00edk, M., Erhan, D., Langford, J., Li, L., et al. Doubly robust policy evaluation and optimiza-\n\ntion. Statistical Science, 29(4):485\u2013511, 2014.\n\n[9] Farahmand, A.-M. and Szepesv\u00e1ri, C. Model selection in reinforcement learning. Mach. Learn.,\n\n85(3):299\u2013332, December 2011.\n\n[10] Hanna, J. P., Stone, P., and Niekum, S. Bootstrapping with models: Con\ufb01dence intervals for\nOff-Policy evaluation. In Proceedings of the 16th Conference on Autonomous Agents and\nMultiAgent Systems, AAMAS \u201917, pp. 538\u2013546, Richland, SC, 2017. International Foundation\nfor Autonomous Agents and Multiagent Systems.\n\n[11] Horvitz, D. G. and Thompson, D. J. A generalization of sampling without replacement from a\n\n\ufb01nite universe. Journal of the American statistical Association, 47(260):663\u2013685, 1952.\n\n[12] James, S., Wohlhart, P., Kalakrishnan, M., Kalashnikov, D., Irpan, A., Ibarz, J., Levine, S.,\nHadsell, R., and Bousmalis, K. Sim-to-real via sim-to-sim: Data-ef\ufb01cient robotic grasping via\nrandomized-to-canonical adaptation networks. In IEEE Conference on Computer Vision and\nPattern Recognition, March 2019.\n\n[13] Jiang, N. and Li, L. Doubly robust off-policy value evaluation for reinforcement learning.\n\nNovember 2015.\n\n[14] Kakade, S. and Langford, J. Approximately optimal approximate reinforcement learning. In\n\nICML, 2002.\n\n[15] Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog, A., Jang, E., Quillen, D., Holly, E.,\nKalakrishnan, M., Vanhoucke, V., et al. Qt-opt: Scalable deep reinforcement learning for\nvision-based robotic manipulation. arXiv preprint arXiv:1806.10293, 2018.\n\n[16] Kiryo, R., Niu, G., du Plessis, M. C., and Sugiyama, M. Positive-unlabeled learning with\n\nnon-negative risk estimator. In NeurIPS, pp. 1675\u20131685, 2017.\n\n[17] Koos, S., Mouret, J.-B., and Doncieux, S. Crossing the reality gap in evolutionary robotics by\npromoting transferable controllers. In Proceedings of the 12th annual conference on Genetic\nand evolutionary computation, pp. 119\u2013126. ACM, 2010.\n\n10\n\n\f[18] Koos, S., Mouret, J.-B., and Doncieux, S. The transferability approach: Crossing the reality\ngap in evolutionary robotics. IEEE Transactions on Evolutionary Computation, 17(1):122\u2013145,\n2012.\n\n[19] Lee, A. X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., and Levine, S. Stochastic adversarial\n\nvideo prediction. arXiv preprint arXiv:1804.01523, 2018.\n\n[20] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D.\nContinuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.\n\n[21] Liu, Y., Gottesman, O., Raghu, A., Komorowski, M., Faisal, A., Doshi-Velez, F., and Brunskill,\n\nE. Representation balancing mdps for off-policy policy evaluation. In NeurIPS, 2018.\n\n[22] Machado, M. C., Bellemare, M. G., Talvitie, E., Veness, J., Hausknecht, M., and Bowling, M.\nRevisiting the arcade learning environment: Evaluation protocols and open problems for general\nagents. Journal of Arti\ufb01cial Intelligence Research, 61:523\u2013562, 2018.\n\n[23] Mahmood, A. R., van Hasselt, H. P., and Sutton, R. S. Weighted importance sampling for\noff-policy learning with linear function approximation. In Advances in Neural Information\nProcessing Systems, pp. 3014\u20133022, 2014.\n\n[24] Mannor, S., Simester, D., Sun, P., and Tsitsiklis, J. N. Bias and variance approximation in value\n\nfunction estimates. Management Science, 53(2):308\u2013322, 2007.\n\n[25] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves,\nA., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep\nreinforcement learning. Nature, 518(7540):529, 2015.\n\n[26] Murphy, S. A. A generalization error for Q-Learning. J. Mach. Learn. Res., 6:1073\u20131097, July\n\n2005.\n\n[27] Nichol, A., Pfau, V., Hesse, C., Klimov, O., and Schulman, J. Gotta learn fast: A new benchmark\n\nfor generalization in rl. arXiv preprint arXiv:1804.03720, 2018.\n\n[28] Precup, D., Sutton, R. S., and Singh, S. Eligibility traces for off-policy policy evaluation.\nIn Proceedings of the Seventeenth International Conference on Machine Learning, 2000, pp.\n759\u2013766. Morgan Kaufmann, 2000.\n\n[29] Quillen, D., Jang, E., Nachum, O., Finn, C., Ibarz, J., and Levine, S. Deep reinforcement\nlearning for Vision-Based robotic grasping: A simulated comparative evaluation of Off-Policy\nmethods. February 2018.\n\n[30] Raghu, M., Irpan, A., Andreas, J., Kleinberg, R., Le, Q., and Kleinberg, J. Can deep reinforce-\nment learning solve erdos-selfridge-spencer games? In International Conference on Machine\nLearning, pp. 4235\u20134243, 2018.\n\n[31] Riedmiller, M., Hafner, R., Lampe, T., Neunert, M., Degrave, J., Van de Wiele, T., Mnih, V.,\nHeess, N., and Springenberg, J. T. Learning by playing-solving sparse reward tasks from scratch.\nIn International Conference on Machine Learning, 2018.\n\n[32] Ross, S. and Bagnell, D. Ef\ufb01cient reductions for imitation learning. In AISTATS, pp. 661\u2013668,\n\n2010.\n\n[33] S. Spearman, C. The proof and measurement of association between two things. The American\n\nJournal of Psychology, 15:72\u2013101, 01 1904. doi: 10.2307/1412159.\n\n[34] Theocharous, G., Thomas, P. S., and Ghavamzadeh, M. Personalized ad recommendation\n\nsystems for life-time value optimization with guarantees. In IJCAI, pp. 1806\u20131812, 2015.\n\n[35] Thomas, P. and Brunskill, E. Data-Ef\ufb01cient Off-Policy policy evaluation for reinforcement\n\nlearning. In International Conference on Machine Learning, pp. 2139\u20132148, June 2016.\n\n[36] Thomas, P. S., Theocharous, G., and Ghavamzadeh, M. High-Con\ufb01dence Off-Policy evaluation.\n\nAAAI, 2015.\n\n11\n\n\f[37] Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control.\nIn Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp.\n5026\u20135033. IEEE, 2012.\n\n[38] van Hasselt, H., Guez, A., and Silver, D. Deep reinforcement learning with double q-learning.\n\nIn Thirtieth AAAI Conference on Arti\ufb01cial Intelligence, 2016.\n\n[39] Zhang, A., Ballas, N., and Pineau, J. A dissection of over\ufb01tting and generalization in continuous\n\nreinforcement learning. June 2018.\n\n[40] Zhang, C., Vinyals, O., Munos, R., and Bengio, S. A study on over\ufb01tting in deep reinforcement\n\nlearning. April 2018.\n\n12\n\n\f", "award": [], "sourceid": 2909, "authors": [{"given_name": "Alexander", "family_name": "Irpan", "institution": "Google Brain"}, {"given_name": "Kanishka", "family_name": "Rao", "institution": "Google"}, {"given_name": "Konstantinos", "family_name": "Bousmalis", "institution": "DeepMind"}, {"given_name": "Chris", "family_name": "Harris", "institution": "Google"}, {"given_name": "Julian", "family_name": "Ibarz", "institution": "Google Inc."}, {"given_name": "Sergey", "family_name": "Levine", "institution": "Google"}]}