{"title": "Causal Confusion in Imitation Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 11698, "page_last": 11709, "abstract": "Behavioral cloning reduces policy learning to supervised learning by training a discriminative model to predict expert actions given observations. Such discriminative models are non-causal: the training procedure is unaware of the causal structure of the interaction between the expert and the environment. We point out that ignoring causality is particularly damaging because of the distributional shift in imitation learning. In particular, it leads to a counter-intuitive \"causal misidentification\" phenomenon: access to more information can yield worse performance. We investigate how this problem arises, and propose a solution to combat it through targeted interventions---either environment interaction or expert queries---to determine the correct causal model. We show that causal misidentification occurs in several benchmark control domains as well as realistic driving settings, and validate our solution against DAgger and other baselines and ablations.", "full_text": "Causal Confusion in Imitation Learning\n\nPim de Haan\u22171, Dinesh Jayaraman\u2020\u2021, Sergey Levine\u2020\n\u2217Qualcomm AI Research, University of Amsterdam,\n\u2020Berkeley AI Research, \u2021 Facebook AI Research\n\nAbstract\n\nBehavioral cloning reduces policy learning to supervised learning by training a dis-\ncriminative model to predict expert actions given observations. Such discriminative\nmodels are non-causal: the training procedure is unaware of the causal structure of\nthe interaction between the expert and the environment. We point out that ignoring\ncausality is particularly damaging because of the distributional shift in imitation\nlearning. In particular, it leads to a counter-intuitive \u201ccausal misidenti\ufb01cation\u201d\nphenomenon: access to more information can yield worse performance. We inves-\ntigate how this problem arises, and propose a solution to combat it through targeted\ninterventions\u2014either environment interaction or expert queries\u2014to determine the\ncorrect causal model. We show that causal misidenti\ufb01cation occurs in several\nbenchmark control domains as well as realistic driving settings, and validate our\nsolution against DAgger and other baselines and ablations.\n\nIntroduction\n\n1\nImitation learning allows for control policies to be learned directly from example demonstrations\nprovided by human experts. It is easy to implement, and reduces or removes the need for extensive\ninteraction with the environment during training [58, 41, 4, 1, 20].\nHowever, imitation learning suffers from a fundamental problem: distributional shift [9, 42]. Training\nand testing state distributions are different, induced respectively by the expert and learned policies.\nTherefore, imitating expert actions on expert trajectories may not align with the true task objective.\nWhile this problem is widely acknowledged [41, 9, 42, 43], yet with careful engineering, na\u00efve\nbehavioral cloning approaches have yielded good results for several practical problems [58, 41, 44,\n36, 37, 4, 33, 3]. This raises the question: is distributional shift really still a problem?\nIn this paper, we identify a somewhat surprising and very problematic effect of distributional shift:\n\u201ccausal misidenti\ufb01cation.\u201d Distinguishing correlates of expert actions in the demonstration set from\ntrue causes is usually very dif\ufb01cult, but may be ignored without adverse effects when training and\ntesting distributions are identical (as assumed in supervised learning), since nuisance correlates\ncontinue to hold in the test set. However, this can cause catastrophic problems in imitation learning\ndue to distributional shift. This is exacerbated by the causal structure of sequential action: the very\nfact that current actions cause future observations often introduces complex new nuisance correlates.\nTo illustrate, consider behavioral cloning to train a neural network to drive a car. In scenario A, the\nmodel\u2019s input is an image of the dashboard and windshield, and in scenario B, the input to the model\n(with identical architecture) is the same image but with the dashboard masked out (see Fig 1). Both\ncloned policies achieve low training loss, but when tested on the road, model B drives well, while\nmodel A does not. The reason: the dashboard has an indicator light that comes on immediately when\nthe brake is applied, and model A wrongly learns to apply the brake only when the brake light is on.\nEven though the brake light is the effect of braking, model A could achieve low training error by\nmisidentifying it as the cause instead.\n\n1Work mostly done while at Berkeley AI Research.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Causal misidenti\ufb01cation: more information yields worse imitation learning performance. Model A\nrelies on the braking indicator to decide whether to brake. Model B instead correctly attends to the pedestrian.\n\nThis situation presents a give-away symptom of causal misidenti\ufb01cation: access to more information\nleads to worse generalization performance in the presence of distributional shift. Causal misidenti\ufb01-\ncation occurs commonly in natural imitation learning settings, especially when the imitator\u2019s inputs\ninclude history information.\nIn this paper, we \ufb01rst point out and investigate the causal misidenti\ufb01cation problem in imitation\nlearning. Then, we propose a solution to overcome it by learning the correct causal model, even when\nusing complex deep neural network policies. We learn a mapping from causal graphs to policies,\nand then use targeted interventions to ef\ufb01ciently search for the correct policy, either by querying an\nexpert, or by executing selected policies in the environment.\n\n2 Related Work\nImitation learning.\nImitation learning through behavioral cloning dates back to Widrow and Smith,\n1964 [58], and has remained popular through today [41, 44, 36, 37, 4, 13, 33, 56, 3]. The distributional\nshift problem, wherein a cloned policy encounters unfamiliar states during autonomous execution,\nhas been identi\ufb01ed as an issue in imitation learning [41, 9, 42, 43, 25, 19, 3]. This is closely tied\nto the \u201cfeedback\u201d problem in general machine learning systems that have direct or indirect access\nto their own past states [47, 2]. For imitation learning, various solutions to this problem have been\nproposed [9, 42, 43] that rely on iteratively querying an expert based on states encountered by some\nintermediate cloned policy, to overcome distributional shift; DAgger [43] has come to be the most\nwidely used of these solutions.\nWe show evidence that the distributional shift problem in imitation learning is often due to causal\nmisidenti\ufb01cation, as illustrated schematically in Fig 1. We propose to address this through targeted\ninterventions on the states to learn the true causal model to overcome distributional shift. As we\nwill show, these interventions can take the form of either environmental rewards with no additional\nexpert involvement, or of expert queries in cases where the expert is available for additional inputs. In\nexpert query mode, our approach may be directly compared to DAgger [43]: indeed, we show that we\nsuccessfully resolve causal misidenti\ufb01cation using orders of magnitude fewer queries than DAgger.\nWe also compare against Bansal et al. [3]: to prevent imitators from copying past actions, they train\nwith dropout [53] on dimensions that might reveal past actions. While our approach seeks to \ufb01nd\nthe true causal graph in a mixture of graph-parameterized policies, dropout corresponds to directly\napplying the mixture policy. In our experiments, our approach performs signi\ufb01cantly better.\nCausal inference. Causal inference is the general problem of deducing cause-effect relationships\namong variables [52, 38, 40, 50, 10, 51]. \u201cCausal discovery\u201d approaches allow causal inference from\npre-recorded observations under constraints [54, 17, 29, 15, 30, 31, 26, 14, 34, 57]. Observational\ncausal inference is known to be impossible in general [38, 39]. We operate in the interventional\nregime [55, 11, 49, 48] where a user may \u201cexperiment\u201d to discover causal structures by assigning\nvalues to some subset of the variables of interest and observing the effects on the rest of the system.\nWe propose a new interventional causal inference approach suited to imitation learning. While\nignoring causal structure is particularly problematic in imitation learning, ours is the \ufb01rst effort\ndirectly addressing this, to our knowledge.\n\n3 The Phenomenon of Causal Misidenti\ufb01cation\nIn imitation learning, an expert demonstrates how to perform a task (e.g., driving a car) for the bene\ufb01t\nof an agent. In each demo, the agent has access both to its n-dim. state observations at each time t,\nn] (e.g., a video feed from a camera), and to the expert\u2019s action At (e.g., steering,\nX t = [X t\n\n2, . . . X t\n\n1, X t\n\n2\n\nScenario A: Full InformationScenario B: Incomplete Informationpolicy attends to brake indicator\facceleration, braking). Behavioral cloning approaches learn a mapping \u03c0 from X t to At using all\n(X t, At) tuples from the demonstrations. At test time, the agent observes X t and executes \u03c0(X t).\nThe underlying sequential decision process has complex\ncausal structures, represented in Fig 2. States in\ufb02uence\nfuture expert actions, and are also themselves in\ufb02uenced\nby past actions and states.\nIn particular, expert actions At are in\ufb02uenced by some\ninformation in state X t, and unaffected by the rest. For\nthe moment, assume that the dimensions X t\n3, . . .\nof X t represent disentangled factors of variation. Then\nsome unknown subset of these factors (\u201ccauses\u201d) affect\nexpert actions, and the rest do not (\u201cnuisance variables\u201d).\nA confounder Z t = [X t\u22121, At\u22121] in\ufb02uences each state variable in X t, so that some nuisance\nvariables may still be correlated with At among (X t, At) pairs from demonstrations. In Fig 1, the\ndashboard light is a nuisance variable.\nA na\u00efve behavioral cloned policy might rely on nuisance correlates to select actions, producing low\ntraining error, and even generalizing to held-out (X t, At) pairs. However, this policy must contend\nwith distributional shift when deployed: actions At are chosen by the imitator rather than the expert,\naffecting the distribution of Z t and X t. This in turn affects the policy mapping from X t to At,\nleading to poor performance of expert-cloned policies. We de\ufb01ne \u201ccausal misidenti\ufb01cation\" as the\nphenomenon whereby cloned policies fail by misidentifying the causes of expert actions.\n3.1 Robustness and Causality in Imitation Learning\n\nFigure 2: Causal dynamics of imitation. Par-\nents of a node represent its causes.\n\n1, X t\n\n2, X t\n\ni=1\n\nIntuitively, distributional shift affects the relationship of the expert action At to nuisance variables,\nbut not to the true causes. In other words, to be maximally robust to distributional shift, a policy\nmust rely solely on the true causes of expert actions, thereby avoiding causal misidenti\ufb01cation. This\nintuition can be formalized in the language of functional causal models (FCM) and interventions [38].\nFunctional causal models: A functional causal model (FCM) over a set of variables {Yi}n\nis a tuple (G, \u03b8G) containing a graph G over {Yi}n\ni=1, and deterministic functions fi(\u00b7; \u03b8G) with\nparameters \u03b8G describing how the causes of each variable Yi determine it: Yi = fi(YPa(i;G), Ei; \u03b8G),\nwhere Ei is a stochastic noise variable that represents all external in\ufb02uences on Yi, and Pa(i; G)\ndenote the indices of parent nodes of Yi, which correspond to its causes.\nAn \u201cintervention\u201d do(Yi) on Yi to set its value may now be represented by a structural change in this\ngraph to produce the \u201cmutilated graph\u201d G \u00afYi, in which incoming edges to Yi are removed.1\nApplying this formalism to our imitation learning setting, any distributional shift in the state X t may\nbe modeled by intervening on X t, so that correctly modeling the \u201cinterventional query\u201d p(At|do(X t))\nis suf\ufb01cient for robustness to distributional shifts. Now, we may formalize the intuition that only a\npolicy relying solely on true causes can robustly model the mapping from states to optimal/expert\nactions under distributional shift.\nIn Appendix B, we prove that under mild assumptions, correctly modeling interventional queries does\nindeed require learning the correct causal graph G. In the car example, \u201csetting\u201d the brake light to on\nor off and observing the expert\u2019s actions would yield a clear signal unobstructed by confounders: the\nbrake light does not affect the expert\u2019s braking behavior.\n3.2 Causal Misidenti\ufb01cation in Policy Learning Benchmarks and Realistic Settings\n\nBefore discussing our solution, we \ufb01rst present several testbeds and real-world cases where causal\nmisidenti\ufb01cation adversely in\ufb02uences imitation learning performance.\nControl Benchmarks. We show that causal misidenti\ufb01cation is induced with small changes to\nwidely studied benchmark control tasks, simply by adding more information to the state, which\nintuitively ought to make the tasks easier, not harder. In particular, we add information about the\nprevious action, which tends to correlate with the current action in the expert data for many standard\ncontrol problems. This is a proxy for scenarios like our car example, in which correlates of past actions\n\n1For a more thorough overview of FCMs, see [38].\n\n3\n\n\f(a) Pong\n\n(b) Enduro\n\n(c) UpNDown\n\nFigure 3: The Atari environments with indicator of\npast action (white number in lower left).\n\nare observable in the state, and is similar to what we might see from other sources of knowledge about\nthe past, such as memory or recurrence. We study three kinds of tasks: (i) MountainCar (continuous\nstates, discrete actions), (ii) MuJoCo Hopper (continuous states and actions), (iii) Atari games: Pong,\nEnduro and UpNDown (states: two stacked consecutive frames, discrete actions).\nFor each task, we study imitation learning in\ntwo scenarios. In scenario A (henceforth called\n\"CONFOUNDED\"), the policy sees the augmented ob-\nservation vector, including the previous action. In\nthe case of low-dimensional observations, the state\nvector is expanded to include the previous action\nat an index that is unknown to the learner. In the\ncase of image observations, we overlay a symbol\ncorresponding to the previous action at an unknown\nlocation on the image (see Fig 3). In scenario B\n(\"ORIGINAL\"), the previous action variable is replaced with random noise for low-dimensional ob-\nservations. For image observations, the original images are left unchanged. Demonstrations are\ngenerated synthetically as described in Appendix A. In all cases, we use neural networks with\nidentical architectures to represent the policies, and we train them on the same demonstrations.\nFig 4 shows the rewards against varying demonstration dataset sizes for MountainCar, Hopper, and\nPong. Appendix E shows additional results, including for Enduro and UpNDown. All policies are\ntrained to near-zero validation error on held-out expert state-action tuples. ORIGINAL produces rewards\ntending towards expert performance as the size of the imitation dataset increases. CONFOUNDED either\nrequires many more demonstrations to reach equivalent performance, or fails completely to do so.\nOverall, the results are clear: across these tasks, access to more information leads to inferior perfor-\nmance. As Fig 11 in the appendix shows, this difference is not due to different training/validation\nlosses on the expert demonstrations\u2014for example, in Pong, CONFOUNDED produces lower validation\nloss than ORIGINAL on held-out demonstration samples, but produces lower rewards when actually\nused for control. These results not only validate the existence of causal misidenti\ufb01cation, but also\nprovides us with testbeds for investigating a potential solution.\nReal-World Driving.\nOur testbeds introduce deliberate nuisance variables to the \u201coriginal\u201d\nobservation variables for ease of evaluation, but evidence suggests that misattribution is pervasive\nin common real-world imitation learning settings. Real-world problems often have no privileged\n\u201coriginal\u201d observation space, and very natural-seeming state spaces may still include nuisance factors\u2014\nas in our dashboard light setting (Fig 1), where causal misattribution occurs when using the full image\nfrom the camera.\nIn particular, history would seem a natural part of the state space for real-world driving, yet\nrecurrent/history-based imitation has been consistently observed in prior work to hurt performance,\nthus exhibiting clear symptoms of causal misidenti\ufb01cation [36, 56, 3]. While these histories contain\nvaluable information for driving, they also naturally introduce information about nuisance factors\nsuch as previous actions. In all three cases, more information led to worse results for the behavioral\ncloning policy, but this was neither attributed speci\ufb01cally to causal misidenti\ufb01cation, nor tackled\nusing causally motivated approaches.\nWe draw the reader\u2019s attention to particularly\ntelling results from Wang et al. [56] for learning\nto drive in near-photorealistic GTA-V [24] envi-\nronments, using behavior cloning with DAgger-\ninspired expert perturbation.\nImitation learn-\ning policies are trained using overhead image\nobservations with and without \u201chistory\u201d infor-\nmation (HISTORY and NO-HISTORY) about the\nego-position trajectory of the car in the past.\nSimilar to our tests above, architectures are identical for the two methods. And once again, like in\nour tests above, HISTORY has better performance on held-out demonstration data, but much worse\nperformance when actually deployed. Tab 1 shows these results, reproduced from Wang et al. [56]\nTable II. These results constitute strong evidence for the prevalence of causal misidenti\ufb01cation in\n\nTable 1: Imitation learning results from Wang et al. [56].\nAccessing history yields better validation performance,\nbut worse actual driving performance.\n\nValidation\nPerplexity Distance\n144.92\n268.95\n\n0.834\n0.989\n\nMetrics \u2192\nMethods \u2193\nHISTORY\nNO-HISTORY\n\nInterventions\n2.94 \u00b1 1.79\n1.30 \u00b1 0.78\n\nCollisions\n6.49 \u00b1 5.72\n3.38 \u00b1 2.55\n\nDriving Performance\n\n4\n\n\f(a) MountainCar\n\n(b) Hopper\n\n(c) Pong\n\nFigure 4: Diagnosing causal misidenti\ufb01cation: net reward (y-axis) vs number of training samples (x-axis) for\nORIGINAL and CONFOUNDED, compared to expert reward (mean and stdev over 5 runs). Also see Appendix E.\n\nrealistic imitation learning settings. Bansal et al. [3] also observe similar symptoms in a driving\nsetting, and present a dropout [53] approach to tackle it, which we compare to in our experiments.\nSubsequent to an earlier version of this work, Codevilla et al. [8] also verify causal confusion in\nrealistic driving settings, and propose measures to address a speci\ufb01c instance of causal confusion.\n4 Resolving Causal Misidenti\ufb01cation\n\nRecall from Sec 3.1 that robustness to causal misidenti\ufb01cation can be achieved by \ufb01nding the true\ncausal model of the expert\u2019s actions. We propose a simple pipeline to do this. First, we jointly learn\npolicies corresponding to various causal graphs (Sec 4.1). Then, we perform targeted interventions to\nef\ufb01ciently search over the hypothesis set for the correct causal model (Sec 4.2).\n\n4.1 Causal Graph-Parameterized Policy Learning\n\nIn this step, we learn a policy corresponding to each candidate causal\ngraph. Recall from Sec 3 that the expert\u2019s actions A are based\non an unknown subset of the state variables {Xi}n\ni=1. Each Xi\nmay either be a cause or not, so there are 2n possible graphs. We\nparameterize the structure G of the causal graph as a vector of n\nbinary variables, each indicating the presence of an arrow from Xk\nto A in Fig 2. We then train a single graph-parameterized policy\n\u03c0G(X) = f\u03c6([X (cid:12) G, G]), where (cid:12) is element-wise multiplication,\nand [\u00b7,\u00b7] denotes concatenation. \u03c6 are neural network parameters,\ntrained through gradient descent to minimize:\n\nFigure 5: Graph-parameterized\npolicy.\n\nEG[(cid:96)(f\u03c6([Xi (cid:12) G, G]), Ai)],\n\n(1)\nwhere G is drawn uniformly at random over all 2n graphs and (cid:96) is a mean squared error loss for the\ncontinuous action environments and a cross-entropy loss for the discrete action environments. Fig 5\nshows a schematic of the training time architecture. The policy network f\u03c6 mapping observations X\nto actions A represents a mixture of policies, one corresponding to each value of the binary causal\ngraph structure variable G, which is sampled as a bernoulli random vector.\nIn Appendix D, we propose an approach to perform variational Bayesian causal discovery over\ngraphs G, using a latent variable model to infer a distribution over functional causal models (graphs\nand associated parameters)\u2014the modes of this distribution are the FCMs most consistent with the\ndemonstration data. This resembles the scheme above, except that instead of uniform sampling,\ngraphs are sampled preferentially from FCMs that \ufb01t the training demonstrations well. We compare\nboth approaches in Sec 5, \ufb01nding that simple uniform sampling nearly always suf\ufb01ces in preparation\nfor the next step: targeted intervention.\n4.2 Targeted Intervention\n\nHaving learned the graph-parameterized policy as in Sec 4.1, we propose targeted intervention to\ncompute the likelihood L(G) of each causal graph structure hypothesis G. In a sense, imitation\nlearning provides an ideal setting for studying interventional causal learning: causal misidenti\ufb01cation\npresents a clear challenge, while the fact that the problem is situated in a sequential decision process\nwhere the agent can interact with the world provides a natural mechanism for carrying out limited\ninterventions.\n\n5\n\n0.10.30.515Demonstration dataset size (x 1000)200180160140120100Reward0.10.51520100300Demonstration dataset size (x 1000)050010001500Reward10200250800Demonstration dataset size (x 1000)201001020RewardScenario A (Confounded)Scenario B (Original)Expert\fAlgorithm 1 Expert query intervention\n\nD(X) = EG[DKL(\u03c0G(X), \u03c0mix(X))]\n\nSample G \u223c p(G) \u221d exp(cid:104)w, G(cid:105).\nL \u2190 Es,a\u223cT [(cid:96)(\u03c0G(s), a)]\nD \u2190 D \u222a {(G,L)}\nFit w on D with linear regression.\n\nInput: policy network f\u03c6 s.t. \u03c0G(X) = f\u03c6([X (cid:12)\nG, G])\nInitialize w = 0,D = \u2205.\nCollect states S by executing \u03c0mix, the mixture of\npolicies \u03c0G for uniform samples G.\nFor each X in S, compute disagreement score:\nSelect S(cid:48) \u2282 S with maximal D(X).\nCollect state-action pairs T by querying expert on S(cid:48).\nfor i = 1 . . . N do\n\nWe propose two intervention modes, both of which can be carried out by interaction with the\nenvironment via the actions:\nExpert query mode. This is the standard in-\ntervention approach applied to imitation learn-\ning: intervene on X t to assign it a value, and\nobserve the expert response A. To do this, we\nsample a graph G at the beginning of each in-\ntervention episode and execute the policy \u03c0G.\nOnce data is collected in this manner, we elicit\nexpert labels on interesting states. This requires\nan interactive expert, as in DAgger [42], but re-\nquires substantially fewer expert queries than\nDAgger, because: (i) the queries serve only\nto disambiguate among a relatively small set\nof valid FCMs, and (ii) we use disagreement\namong the mixture of policies in f\u03c6 to query the\nexpert ef\ufb01ciently in an active learning approach.\nWe summarize this approach in Algorithm 1.\nPolicy execution mode.\nIt is not always pos-\nsible to query an expert. For example, for a\nlearner learning to drive a car by watching a\nhuman driver, it may not be possible to put the\nhuman driver into dangerous scenarios that the\nlearner might encounter at intermediate stages\nof training. In cases like these where we would\nlike to learn from pre-recorded demonstrations\nalone, we propose to intervene indirectly by\nusing environmental returns (sum of rewards\nt rt. The poli-\ncies \u03c0G(\u00b7) = f\u03c6([\u00b7 (cid:12) G, G]) corresponding\nto different hypotheses G are executed in the\nenvironment and the returns RG collected. The likelihood of each graph is proportional to the\nexponentiated returns exp RG. The intuition is simple: environmental returns contain information\nabout optimal expert policies even when experts are not queryable. Note that we do not even assume\naccess to per-timestep rewards as in standard reinforcement learning; just the sum of rewards for each\ncompleted run. As such, this intervention mode is much more \ufb02exible. See Algorithm 2.\nNote that both of the above intervention approaches evaluate individual hypotheses in isolation, but\nthe number of hypotheses grows exponentially in the number of state variables. To handle larger\nstates, we infer a graph distribution p(G), by assuming an energy based model with a linear energy\ni Bernoulli(Gi|\u03c3(wi/\u03c4 )),\nwhere \u03c3 is the sigmoid, which factorizes in independent factors. The independence assumption is\nsensible as our approach collapses p(G) to its mode before returning it and the collapsed distribution\nis always independent. E(G) is inferred from linear regression on the likelihoods. This process\nis depicted in Algorithms 1 and 2. The above method can be formalized within the reinforcement\nlearning framework [27]. As we show in Appendix H, the energy-based model can be seen as an\ninstance of soft Q-learning [16].\n\nE(G) = (cid:104)w, G(cid:105) + b, so the graph distribution is p(G) =(cid:81)\n\nover time in an episode) R =(cid:80)\n\ni p(Gi) =(cid:81)\n\nend for\nReturn: arg maxG p(G)\n\nAlgorithm 2 Policy execution intervention\n\nInput: policy network f\u03c6 s.t. \u03c0G(X) = f\u03c6([X (cid:12)\nG, G])\nInitialize w = 0,D = \u2205.\nfor i = 1 . . . N do\n\nSample G \u223c p(G) \u221d exp(cid:104)w, G(cid:105).\nCollect episode return RG by executing \u03c0G.\nD \u2190 D \u222a {(G, RG)}\nFit w on D with linear regression.\n\nend for\nReturn: arg maxG p(G)\n\n4.3 Disentangling Observations\n\nIn the above, we have assumed access to disentangled observations X t. When this is not the case,\nsuch as with image observations, X t must be set to a disentangled representation of the observation\nat time t. We construct such a representation by training a \u03b2-VAE [22, 18] to reconstruct the original\nobservations. To capture states beyond those encountered by the expert, we train with a mix of expert\nand random trajectory states. Once trained, X t is set to be the mean of the latent distribution produced\nat the output of the encoder. The VAE training objective encourages disentangled dimensions in the\nlatent space [5, 6]. We employ CoordConv [28] in both the encoder and the decoder architectures.\n\n6\n\n\fFigure 6: Reward vs. number of intervention episodes (policy execution interventions) on Atari games. UNIF-\nINTERVENTION succeeds in getting rewards close to ORIGINAL W/ VAE, while the DROPOUT baseline only\noutperforms CONFOUNDED W/ VAE in UpNDown.\n5 Experiments\nWe now evaluate the solution described in Sec 4 on the \ufb01ve tasks (MountainCar, Hopper, and 3 Atari\ngames) described in Sec 3.2. In particular, recall that CONFOUNDED performed signi\ufb01cantly worse\nthan ORIGINAL across all tasks. In our experiments, we seek to answer the following questions: (1)\nDoes our targeted intervention-based solution to causal misidenti\ufb01cation bridge the gap between\nCONFOUNDED and ORIGINAL? (2) How quickly does performance improve with intervention? (3)\nDo both intervention modes (expert query, policy execution) described in Sec 4.2 resolve causal\nmisidenti\ufb01cation? (4) Does our approach in fact recover the true causal graph? (5) Are disentangled\nstate representations necessary?\nIn each of the two intervention modes, we compare two variants of our method: UNIF-INTERVENTION\nand DISC-INTERVENTION. They only differ in the training of the graph-parameterized mixture-of-\npolicies f\u03c6\u2014while UNIF-INTERVENTION samples causal graphs uniformly, DISC-INTERVENTION uses\nthe variational causal discovery approach mentioned in Sec 4.1, and described in detail in Appendix D.\nBaselines. We compare our method against three baselines applied to the confounded state. DROPOUT\ntrains the policy using Eq 1 and evaluates with the graph G containing all ones, which amounts\nto dropout regularization [53] during training, as proposed by Bansal et al. [3]. DAGGER [42]\naddresses distributional shift by querying the expert on states encountered by the imitator, requiring\nan interactive expert. We compare DAGGER to our expert query intervention approach. Lastly, we\ncompare to Generative Adversarial Imitation Learning (GAIL) [19]. GAIL is an alternative to standard\nbehavioral cloning that works by matching demonstration trajectories to those generated by the\nimitator during roll-outs in the environment. Note that the PC algorithm [26], commonly used in\ncausal discovery from passive observational data, relies on the faithfulness assumption, which causes\nit to be infeasible in our setting, as explained in Appendix C. See Appendices B & D for details.\nIntervention by policy execution. Fig 7 plots episode rewards versus number of policy execution\nintervention episodes for MountainCar and Hopper. The reward always corresponds to the current\nmode arg maxG p(G) of the posterior distribution over graphs, updated after each episode, as\ndescribed in Algorithm 2. In these cases, both UNIF-INTERVENTION and DISC-INTERVENTION eventually\nconverge to models yielding similar rewards, which we veri\ufb01ed to be the correct causal model\ni.e., true causes are selected and nuisance correlates left out. In early episodes on MountainCar,\nDISC-INTERVENTION bene\ufb01ts from the prior over graphs inferred in the variational causal discovery\nphase. However, in Hopper, the simpler UNIF-INTERVENTION performs just as well. DROPOUT does\nindeed help in both settings, as reported in Bansal et al. [3], but is signi\ufb01cantly poorer than our\n\nFigure 7: Reward vs. number of intervention episodes (policy execution interventions) on MountainCar and\nHopper. Our methods UNIF-INTERVENTION and DISC-INTERVENTION bridge most of the causal misidenti-\n\ufb01cation gap (between ORIGINAL (lower bound) and CONFOUNDED (upper bound), approaching ORIGINAL\nperformance after tens of episodes. GAIL [19] (on Hopper) achieves this too, but after 1.5k episodes.\n\n7\n\n010002000Num episodes010RewardPong010002000Num episodes25303540RewardEnduro010002000Num episodes20406080RewardUpNDownUnif-Intervention [ours]Original [upper bound]Confounded [lower bound]Original w/ VAEConfounded w/ VAEDropout [3]0510152025Num episodes170160150140130120RewardMountainCar020406080100Num episodes2004006008001000120014001600RewardHopperDisc-Intervention [ours]Unif-Intervention [ours]Original [upper bound]Confounded [lower bound]Dropout [3]GAIL [17] (100 episodes)GAIL [17] (1.5k episodes)\fFigure 8: Reward vs. expert queries (expert query interventions) on MountainCar and Hopper. Our methods\npartially bridge the gap from CONFOUNDED (lower bd) to ORIGINAL (upper bd), also outperforming DAG-\nGER [43] and DROPOUT [3]. GAIL [19] outperforms our methods on Hopper, but requires a large number of\npolicy roll-outs (also see Fig 7 comparing GAIL to our policy execution-based approach).\n\napproach variants. GAIL requires about 1.5k episodes on Hopper to match the performance of our\napproaches, which only need tens of episodes. Appendix G further analyzes the performance of GAIL.\nStandard implementations of GAIL do not handle discrete action spaces, so we do not evaluate it on\nMountainCar.\nAs described in Sec 4.3, we use a VAE to disentangle image states in Atari games to produce 30-D\nrepresentations for Pong and Enduro and 50-D representations for UpNDown. We set this dimension-\nality heuristically to be as small as possible, while still producing good reconstructions as assessed\nvisually. Requiring the policy to utilize the VAE representation without end-to-end training does\nresult in some drop in performance, as seen in Fig 6. However, causal misidenti\ufb01cation still causes a\nvery large drop of performance even relative to the baseline VAE performance. DISC-INTERVENTION\nis hard to train as the cardinality of the state increases, and yields only minor advantages on Hopper\n(14-D states), so we omit it for these Atari experiments. As Fig 6 shows, UNIF-INTERVENTION indeed\nimproves signi\ufb01cantly over CONFOUNDED W/ VAE in all three cases, matching ORIGINAL W/ VAE on\nPong and UpNDown, while the DROPOUT baseline only improves UpNDown. In our experiments thus\nfar, GAIL fails to converge to above-chance performance on any of the Atari environments. These\nresults show that our method successfully alleviates causal misidenti\ufb01cation within relatively few\ntrials.\nIntervention by expert queries. Next, we perform direct intervention by querying the expert on\nsamples from trajectories produced by the different causal graphs. In this setting, we can also directly\ncompare to DAGGER [43]. Fig 8 shows results on MountainCar and Hopper. Both our approaches\nsuccessfully improve over CONFOUNDED within a small number of queries. Consistent with policy\nexecution intervention results reported above, we verify that our approach again identi\ufb01es the true\ncausal model correctly in both tasks, and also performs better than DROPOUT in both settings. It also\nexceeds the rewards achieved by DAGGER, while using far fewer expert queries. In Appendix F, we\nshow that DAGGER requires hundreds of queries to achieve similar rewards for MountainCar and tens\nof thousands for Hopper. Finally, GAIL with 1.5k episodes outperforms our expert query interventions\napproach. Recall however from Fig 8 that this is an order of magnitude more than the number of\nepisodes required by our policy intervention approach.\nOnce again, DISC-INTERVENTION only helps in early interventions on MountainCar, and not at all on\nHopper. Thus, our method\u2019s performance is primarily attributable to the targeted intervention stage,\nand the exact choice of approach used to learn the mixture of policies is relatively insigni\ufb01cant.\nOverall, of the two intervention approaches, policy execution converges to better \ufb01nal rewards.\nIndeed, for the Atari environments, we observed that expert query interventions proved ineffective.\nWe believe this is because expert agreement is an imperfect proxy for true environmental rewards.\nInterpreting the learned causal graph. Our method labels each dimension of the VAE encoding\nof the frame as a cause or nuisance variable. In Fig 9, we analyze these inferences in the Pong\nenvironment as follows: in the top row, a frame is encoded into the VAE latent, then for all nuisance\ndimensions (as inferred by our approach UNIF-INTERVENTION), that dimension is replaced with a\nsample from the prior, and new samples are generated. In the bottom row, the same procedure is\napplied with a random graph that has as many nuisance variables as the inferred graph. We observe\nthat in the top row, the causal variables (the ball and paddles) are shared between the samples, while\nthe nuisance variables (the digit) differ, being replaced either with random digits or unreadable digits.\n\n8\n\n024681012141618Num queries170160150140130120RewardMountainCar020406080100120140Num queries200400600800100012001400RewardHopperDisc-Intervention [ours]Unif-Intervention [ours]DAgger [41]Original [upper bound]Confounded [lower bound]Dropout [3]GAIL [17] (100 episodes)GAIL [17] (1.5k episodes)\fFigure 9: Samples from (top row) learned causal graph and (bottom row) random causal graph. (See text)\n\nMode\nPolicy execution Disentangled\n\nIn the bottom row, the causal variables differ strongly, indicating that important aspects of the state\nare judged as nuisance variables. This validates that, consistent with MountainCar and Hopper, our\napproach does indeed identify true causes in Pong.\nNecessity of disentanglement. Our interven-\ntion method assumes a disentangled representa-\ntion of state. Otherwise, each of the n individ-\nual dimensions in the state might capture both\ncauses as well as nuisance variables and problem\nof discovering true causes is no longer reducible\nto searching over 2n graphs.\nTable 2: Intervention on (dis)entangled MountainCar.\nTo test this empirically, we create a variant of\nour MountainCar CONFOUNDED testbed, where the 3-D past action-augmented state vector is rotated\nby a \ufb01xed, random rotation. After training the graph-conditioned policies on the entangled and\ndisentangled CONFOUNDED state, and applying 30 episodes of policy execution intervention or\n20 expert queries, we get the results shown in Tab 2. The results are signi\ufb01cantly lower in the\nentangled than in the disentangled (non-rotated) setting, indicating disentanglement is important for\nthe effectiveness of our approach.\n\nRepresentation Reward\n-137\n-145\n-140\n-165\n\nExpert queries\n\nEntangled\nDisentangled\nEntangled\n\n6 Conclusions\n\nWe have identi\ufb01ed a naturally occurring and fundamental problem in imitation learning, \u201ccausal\nmisidenti\ufb01cation\u201d, and proposed a causally motivated approach for resolving it. While we observe\nevidence for causal misidenti\ufb01cation arising in natural imitation learning settings, we have thus far\nvalidated our solution in somewhat simpler synthetic settings intended to mimic them. Extending\nour solution to work for such realistic scenarios is an exciting direction for future work. Finally,\napart from imitation, general machine learning systems deployed in the real world also encounter\n\u201cfeedback\u201d [47, 2], which opens the door to causal misidenti\ufb01cation. We hope to address these more\ngeneral settings in the future.\n\nAcknowledgments: We would like to thank Karthikeyan Shanmugam and Shane Gu for pointers\nto prior work early in the project, and Yang Gao, Abhishek Gupta, Marvin Zhang, Alyosha Efros,\nand Roberto Calandra for helpful discussions in various stages of the project. We are also grateful to\nDrew Bagnell and Katerina Fragkiadaki for helpful feedback on an earlier draft of this paper. This\nproject was supported in part by Berkeley DeepDrive, NVIDIA, and Google.\n\nReferences\n[1] Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot\n\nlearning from demonstration. Robotics and autonomous systems, 57(5):469\u2013483, 2009.\n\n[2] Drew Bagnell. Talk: Feedback in machine learning, 2016. URL https://youtu.be/\n\nXRSvz4UOpo4.\n\n[3] Mayank Bansal, Alex Krizhevsky, and Abhijit Ogale. ChauffeurNet: Learning to drive by\n\nimitating the best and synthesizing the worst. Robotics: Science & Systems (RSS), 2019.\n\n9\n\n\f[4] Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon\nGoyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning\nfor self-driving cars. arXiv preprint arXiv:1604.07316, 2016.\n\n[5] Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Des-\njardins, and Alexander Lerchner. Understanding disentangling in \u03b2-vae. arXiv preprint\narXiv:1804.03599, 2018.\n\n[6] Tian Qi Chen, Xuechen Li, Roger Grosse, and David Duvenaud. Isolating sources of disentan-\n\nglement in variational autoencoders. arXiv preprint arXiv:1802.04942, 2018.\n\n[7] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan:\nInterpretable representation learning by information maximizing generative adversarial nets. In\nAdvances in neural information processing systems, pages 2172\u20132180, 2016.\n\n[8] Felipe Codevilla, Eder Santana, Antonio M L\u00f3pez, and Adrien Gaidon. Exploring the limi-\ntations of behavior cloning for autonomous driving. International Conference on Computer\nVision(ICCV), 2019.\n\n[9] Hal Daum\u00e9, John Langford, and Daniel Marcu. Search-based structured prediction. Machine\n\nlearning, 75(3):297\u2013325, 2009.\n\n[10] Frederick Eberhardt. Introduction to the foundations of causal discovery. International Journal\n\nof Data Science and Analytics, 3(2):81\u201391, 2017.\n\n[11] Frederick Eberhardt and Richard Scheines. Interventions and causal inference. Philosophy of\n\nScience, 74(5):981\u2013995, 2007.\n\n[12] Weihao Gao, Sreeram Kannan, Sewoong Oh, and Pramod Viswanath. Estimating mutual\ninformation for discrete-continuous mixtures. In Advances in neural information processing\nsystems, pages 5986\u20135997, 2017.\n\n[13] Alessandro Giusti, Jerome Guzzi, Dan Ciresan, Fang-Lin He, Juan Pablo Rodriguez, Flavio\nFontana, Matthias Faessler, Christian Forster, Jurgen Schmidhuber, Gianni Di Caro, Davide\nScaramuzza, and Luca Gambardella. A machine learning approach to visual perception of forest\ntrails for mobile robots. IEEE Robotics and Automation Letters, 2016.\n\n[14] Olivier Goudet, Diviyan Kalainathan, Philippe Caillou, David Lopez-Paz, Isabelle Guyon,\nMichele Sebag, Aris Tritas, and Paola Tubaro. Learning functional causal models with generative\nneural networks. arXiv preprint arXiv:1709.05321, 2017.\n\n[15] Isabelle Guyon, Constantin Aliferis, Greg Cooper, Andr\u00e9 Elisseeff, Jean-Philippe Pellet, Peter\nSpirtes, and Alexander Statnikov. Design and analysis of the causation and prediction challenge.\nIn Causation and Prediction Challenge, pages 1\u201333, 2008.\n\n[16] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning\n\nwith deep energy-based policies. arXiv preprint arXiv:1702.08165, 2017.\n\n[17] David Heckerman, Christopher Meek, and Gregory Cooper. A Bayesian Approach to Causal\nDiscovery, pages 1\u201328. Springer Berlin Heidelberg, Berlin, Heidelberg, 2006.\nISBN\n978-3-540-33486-6. doi: 10.1007/3-540-33486-6_1. URL https://doi.org/10.1007/\n3-540-33486-6_1.\n\n[18] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick,\nShakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a\nconstrained variational framework. ICLR, 2017.\n\n[19] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in\n\nNeural Information Processing Systems, pages 4565\u20134573, 2016.\n\n[20] Ahmed Hussein, Mohamed Medhat Gaber, Eyad Elyan, and Chrisina Jayne. Imitation learning:\n\nA survey of learning methods. ACM Computing Surveys (CSUR), 50(2):21, 2017.\n\n[21] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax.\n\narXiv preprint arXiv:1611.01144, 2016.\n\n[22] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint\n\narXiv:1312.6114, 2013.\n\n[23] Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local\nreparameterization trick. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett,\n\n10\n\n\feditors, Advances in Neural Information Processing Systems 28, pages 2575\u20132583. Curran\nAssociates, Inc., 2015.\n\n[24] Philipp Kr\u00e4henb\u00fchl. Free supervision from video games. In Proceedings of the IEEE Conference\n\non Computer Vision and Pattern Recognition, pages 2955\u20132964, 2018.\n\n[25] Michael Laskey, Jonathan Lee, Roy Fox, Anca Dragan, and Ken Goldberg. Dart: Noise injection\n\nfor robust imitation learning. arXiv preprint arXiv:1703.09327, 2017.\n\n[26] Thuc Le, Tao Hoang, Jiuyong Li, Lin Liu, Huawen Liu, and Shu Hu. A fast pc algorithm for high\ndimensional causal discovery with multi-core pcs. IEEE/ACM transactions on computational\nbiology and bioinformatics, 2016.\n\n[27] Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and\n\nreview. CoRR, abs/1805.00909, 2018.\n\n[28] Rosanne Liu, Joel Lehman, Piero Molino, Felipe Petroski Such, Eric Frank, Alex Sergeev,\nand Jason Yosinski. An intriguing failing of convolutional neural networks and the coordconv\nsolution. CoRR, abs/1807.03247, 2018. URL http://arxiv.org/abs/1807.03247.\n\n[29] D. Lopez-Paz, R. Nishihara, S. Chintala, B. Sch\u00f6lkopf, and L. Bottou. Discovering causal\nsignals in images. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR) 2017, pages 58\u201366, Piscataway, NJ, USA, July 2017. IEEE.\n\n[30] Christos Louizos, Uri Shalit, Joris M Mooij, David Sontag, Richard Zemel, and Max Welling.\nCausal effect inference with deep latent-variable models. In Advances in Neural Information\nProcessing Systems, pages 6446\u20136456, 2017.\n\n[31] Marloes H Maathuis, Diego Colombo, Markus Kalisch, and Peter B\u00fchlmann. Predicting causal\n\neffects in large-scale systems from observational data. Nature Methods, 7(4):247, 2010.\n\n[32] Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous\n\nrelaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.\n\n[33] Jeffrey Mahler and Ken Goldberg. Learning deep policies for robot bin picking by simulating\n\nrobust grasping sequences. In Conference on Robot Learning, pages 515\u2013524, 2017.\n\n[34] Jovana Mitrovic, Dino Sejdinovic, and Yee Whye Teh. Causal inference via kernel deviance\n\nmeasures. In Advances in Neural Information Processing Systems, pages 6986\u20136994, 2018.\n\n[35] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan\nWierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint\narXiv:1312.5602, 2013.\n\n[36] Urs Muller, Jan Ben, Eric Cosatto, Beat Flepp, and Yann L Cun. Off-road obstacle avoidance\nthrough end-to-end learning. In Advances in neural information processing systems, pages\n739\u2013746, 2006.\n\n[37] Katharina M\u00fclling, Jens Kober, Oliver Kroemer, and Jan Peters. Learning to select and\ngeneralize striking movements in robot table tennis. The International Journal of Robotics\nResearch, 32(3):263\u2013279, 2013.\n\n[38] Judea Pearl. Causality. Cambridge university press, 2009.\n[39] Jonas Peters, Joris M Mooij, Dominik Janzing, and Bernhard Sch\u00f6lkopf. Causal discovery\nwith continuous additive noise models. The Journal of Machine Learning Research, 15(1):\n2009\u20132053, 2014.\n\n[40] Jonas Peters, Dominik Janzing, and Bernhard Sch\u00f6lkopf. Elements of causal inference: founda-\n\ntions and learning algorithms. MIT press, 2017.\n\n[41] Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Advances in\n\nneural information processing systems, pages 305\u2013313, 1989.\n\n[42] St\u00e9phane Ross and Drew Bagnell. Ef\ufb01cient reductions for imitation learning. In Proceedings of\nthe thirteenth international conference on arti\ufb01cial intelligence and statistics, pages 661\u2013668,\n2010.\n\n[43] St\u00e9phane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and\nstructured prediction to no-regret online learning. In Proceedings of the fourteenth international\nconference on arti\ufb01cial intelligence and statistics, pages 627\u2013635, 2011.\n\n11\n\n\f[44] Stefan Schaal. Is imitation learning the route to humanoid robots? Trends in cognitive sciences,\n\n3(6):233\u2013242, 1999.\n\n[45] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region\npolicy optimization. In International Conference on Machine Learning, pages 1889\u20131897,\n2015.\n\n[46] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal\n\npolicy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.\n\n[47] David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner,\nVinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. Hidden technical\ndebt in machine learning systems. In Advances in neural information processing systems, pages\n2503\u20132511, 2015.\n\n[48] Rajat Sen, Karthikeyan Shanmugam, Alexandres G Dimakis, and Sanjay Shakkottai. Identifying\nbest interventions through online importance sampling. In Proceedings of the 34th International\nConference on Machine Learning-Volume 70, pages 3057\u20133066, 2017.\n\n[49] Karthikeyan Shanmugam, Murat Kocaoglu, Alexandros G Dimakis, and Sriram Vishwanath.\nLearning causal graphs with small interventions. In Advances in Neural Information Processing\nSystems, pages 3195\u20133203, 2015.\n\n[50] Peter Spirtes. Introduction to causal inference. Journal of Machine Learning Research, 11\n\n(May):1643\u20131662, 2010.\n\n[51] Peter Spirtes and Kun Zhang. Causal discovery and inference: concepts and recent methodolog-\n\nical advances. In Applied informatics, 2016.\n\n[52] Peter Spirtes, Clark N Glymour, Richard Scheines, David Heckerman, Christopher Meek,\nGregory Cooper, and Thomas Richardson. Causation, prediction, and search. MIT press, 2000.\n[53] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.\nDropout: a simple way to prevent neural networks from over\ufb01tting. The Journal of Machine\nLearning Research, 15(1):1929\u20131958, 2014.\n\n[54] Mark Steyvers, Joshua B Tenenbaum, Eric-Jan Wagenmakers, and Ben Blum. Inferring causal\n\nnetworks from observations and interventions. Cognitive science, 27(3):453\u2013489, 2003.\n\n[55] Simon Tong and Daphne Koller. Active learning for structure in bayesian networks. In IJCAI,\n\n2001.\n\n[56] Dequan Wang, Coline Devin, Qi-Zhi Cai, Philipp Kr\u00e4henb\u00fchl, and Trevor Darrell. Monocular\n\nplan view networks for autonomous driving. IROS, 2019.\n\n[57] Yixin Wang and David M Blei.\n\narXiv:1805.06826, 2018.\n\nThe blessings of multiple causes.\n\narXiv preprint\n\n[58] Bernard Widrow and Fred W Smith. Pattern-recognizing control systems, 1964.\n\n12\n\n\f", "award": [], "sourceid": 6250, "authors": [{"given_name": "Pim", "family_name": "de Haan", "institution": "Qualcomm AI Research, University of Amsterdam"}, {"given_name": "Dinesh", "family_name": "Jayaraman", "institution": "UC Berkeley"}, {"given_name": "Sergey", "family_name": "Levine", "institution": "UC Berkeley"}]}