{"title": "Scalable End-to-End Autonomous Vehicle Testing via Rare-event Simulation", "book": "Advances in Neural Information Processing Systems", "page_first": 9827, "page_last": 9838, "abstract": "While recent developments in autonomous vehicle (AV) technology highlight substantial progress, we lack tools for rigorous and scalable testing. Real-world testing, the de facto evaluation environment, places the public in danger, and, due to the rare nature of accidents, will require billions of miles in order to statistically validate performance claims. We implement a simulation framework that can test an entire modern autonomous driving system, including, in particular, systems that employ deep-learning perception and control algorithms. Using adaptive importance-sampling methods to accelerate rare-event probability evaluation, we estimate the probability of an accident under a base distribution governing standard traffic behavior. We demonstrate our framework on a highway scenario, accelerating system evaluation by 2-20 times over naive Monte Carlo sampling methods and 10-300P times (where P is the number of processors) over real-world testing.", "full_text": "Scalable End-to-End Autonomous Vehicle\n\nTesting via Rare-event Simulation\n\nMatthew O\u2019Kelly\u2217\n\nUniversity of Pennsylvania\nmokelly@seas.upenn.edu\n\nAman Sinha\u2217\n\nStanford University\n\namans@stanford.edu\n\nHongseok Namkoong\u2217\nStanford University\n\nhnamk@stanford.edu\n\nJohn Duchi\n\nStanford University\n\njduchi@stanford.edu\n\nMassachusetts Institute of Technology\n\nRuss Tedrake\n\nrusst@mit.edu\n\nAbstract\n\nWhile recent developments in autonomous vehicle (AV) technology highlight\nsubstantial progress, we lack tools for rigorous and scalable testing. Real-world\ntesting, the de facto evaluation environment, places the public in danger, and, due\nto the rare nature of accidents, will require billions of miles in order to statistically\nvalidate performance claims. We implement a simulation framework that can test\nan entire modern autonomous driving system, including, in particular, systems\nthat employ deep-learning perception and control algorithms. Using adaptive\nimportance-sampling methods to accelerate rare-event probability evaluation, we\nestimate the probability of an accident under a base distribution governing standard\ntraf\ufb01c behavior. We demonstrate our framework on a highway scenario, acceler-\nating system evaluation by 2-20 times over naive Monte Carlo sampling methods\nand 10-300P times (where P is the number of processors) over real-world testing.\n\n1\n\nIntroduction\n\nRecent breakthroughs in deep learning have accelerated the development of autonomous vehicles\n(AVs); many research prototypes now operate on real roads alongside human drivers. While advances\nin computer-vision techniques have made human-level performance possible on narrow perception\ntasks such as object recognition, several fatal accidents involving AVs underscore the importance of\ntesting whether the perception and control pipeline\u2014when considered as a whole system\u2014can safely\ninteract with humans. Unfortunately, testing AVs in real environments, the most straightforward\nvalidation framework for system-level input-output behavior, requires prohibitive amounts of time\ndue to the rare nature of serious accidents [49]. Concretely, a recent study [29] argues that AVs need\nto drive \u201chundreds of millions of miles and, under some scenarios, hundreds of billions of miles\nto create enough data to clearly demonstrate their safety.\u201d Alteratively, formally verifying an AV\nalgorithm\u2019s \u201ccorrectness\u201d [34, 2, 47, 37] is dif\ufb01cult since all driving policies are subject to crashes\ncaused by other drivers [49]. It is unreasonable to ask that the policy be safe under all scenarios.\nUnfortunately, ruling out scenarios where the AV should not be blamed is a task subject to logical\ninconsistency, combinatorial growth in speci\ufb01cation complexity, and subjective assignment of fault.\nMotivated by the challenges underlying real-world testing and formal veri\ufb01cation, we consider\na probabilistic paradigm\u2014which we call a risk-based framework\u2014where the goal is to evaluate\nthe probability of an accident under a base distribution representing standard traf\ufb01c behavior. By\nassigning learned probability values to environmental states and agent behaviors, our risk-based\n\n\u2217Equal contribution\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1. Multi-lane highway driving on I-80: (left) real image, (right) rendered image from simulator\n\nframework considers performance of the AV\u2019s policy under a data-driven model of the world. To\nef\ufb01ciently evaluate the probability of an accident, we implement a photo-realistic and physics-based\nsimulator that provides the AV with perceptual inputs (e.g. video and range data) and traf\ufb01c conditions\n(e.g. other cars and pedestrians). The simulator allows parallelized, faster-than-real-time evaluations\nin varying environments (e.g. weather, geographic locations, and aggressiveness of other cars).\nFormally, we let P0 denote the base distribution that models standard traf\ufb01c behavior and X \u223c P0\nbe a realization of the simulation (e.g. weather conditions and driving policies of other agents). For\nan objective function f : X \u2192 R that measures \u201csafety\u201d\u2014so that low values of f (x) correspond to\ndangerous scenarios\u2014our goal is to evaluate the probability of a dangerous event\n\np\u03b3 := P0(f (X) \u2264 \u03b3)\n\n(1)\nfor some threshold \u03b3. Our risk-based framework is agnostic to the complexity of the ego-policy\nand views it as a black-box module. Such an approach allows, in particular, deep-learning based\nperception systems that make formal veri\ufb01cation methods intractable.\nAn essential component of this approach is to estimate the base distribution P0 from data; we use\npublic traf\ufb01c data collected by the US Department of Transportation [36]. While such datasets do\nnot offer insights into how AVs interact with human agents\u2014this is precisely why we design our\nsimulator\u2014they illustrate the range of standard human driving behavior that the base distribution\nP0 must model. We use imitation learning [45, 41, 42, 22, 6] to learn a generative model for the\nbehavior (policy) of environment vehicles; unlike traditional imitation learning, we train an ensemble\nof models to characterize a distribution of human-like driving policies.\nAs serious accidents are rare (p\u03b3 is small), we view this as a rare-event simulation [4] problem; naive\nMonte Carlo sampling methods require prohibitively many simulation rollouts to generate dangerous\nscenarios and estimate p\u03b3. To accelerate safety evaluation, we use adaptive importance-sampling\nmethods to learn alternative distributions P\u03b8 that generate accidents more frequently. Speci\ufb01cally,\nwe use the cross-entropy algorithm [44] to iteratively approximate the optimal importance sampling\ndistribution. In contrast to simple classical settings [44, 55] which allow analytic updates to P\u03b8,\nour high-dimensional search space requires solving convex optimization problems in each iteration\n(Section 2). To address numerical instabilities of importance sampling estimators in high dimensions,\nwe carefully design search spaces and perform computations in logarithmic scale. Our implementation\nproduces 2-20 times as many rare events as naive Monte Carlo methods, independent of the complexity\nof the ego-policy.\nIn addition to accelerating evaluation of p\u03b3, learning a distribution P\u03b8 that frequently generates\nrealistic dangerous scenarios Xi \u223c P\u03b8 is useful for engineering purposes. The importance-sampling\ndistribution P\u03b8 not only ef\ufb01ciently samples dangerous scenarios, but also ranks them according to\ntheir likelihoods under the base distribution P0. This capability enables a deeper understanding of\nfailure modes and prioritizes their importance to improving the ego-policy.\nAs a system, our simulator allows fully distributed rollouts, making our approach orders of magni-\ntude cheaper, faster, and safer than real-world testing. Using the asynchronous messaging library\nZeroMQ [21], our implementation is fully-distributed among available CPUs and GPUs; our rollouts\nare up to 30P times faster than real time, where P is the number of processors. Combined with the\ncross-entropy method\u2019s speedup, we achieve 10-300P speedup over real-world testing.\nIn what follows, we describe components of our open-source toolchain, a photo-realistic simulator\nequipped with our data-driven risk-based framework and cross-entropy search techniques. The\n\n2\n\n\ftoolchain can test an AV as a whole system, simulating the driving policy of the ego-vehicle by\nviewing it as a black-box model. The use of adaptive-importance sampling methods motivates a unique\nsimulator architecture (Section 3) which allows real-time updates of the policies of environment\nvehicles. In Section 4, we test our toolchain by considering an end-to-end deep-learning-based\nego-policy [9] in a multi-agent highway scenario. Figure 1 shows one con\ufb01guration of this scenario\nin the real world along with rendered images from the simulator, which uses Unreal Engine 4 [17].\nOur experiments show that we accelerate the assessment of rare-event probabilities with respect to\nnaive Monte Carlo methods as well as real-world testing. We believe our open-source framework is a\nstep towards a rigorous yet scalable platform for evaluating AV systems, with the broader goal of\nunderstanding how to reliably deploy deep-learning systems in safety-critical applications.\n\n2 Rare-event simulation\n\nTo motivate our risk-based framework, we \ufb01rst argue that formally verifying correctness of a AV\nsystem is infeasible due to the challenge of de\ufb01ning \u201ccorrectness.\u201d Consider a scenario where an\nAV commits a traf\ufb01c violation to avoid collision with an out-of-control truck approaching from\nbehind. If the ego-vehicle decides to avoid collision by running through a red light with no further\nrami\ufb01cations, is it \u201ccorrect\u201d to do so? The \u201ccorrectness\u201d of the policy depends on the extent to\nwhich the traf\ufb01c violation endangers nearby humans and whether any element of the \u201ccorrectness\u201d\nspeci\ufb01cation explicitly forbids such actions. That is, \u201ccorrectness\u201d as a binary output is a concept\nde\ufb01ned by its exceptions, many elements of which are subject to individual valuations [10].\nInstead of trying to verify correctness, we begin with a continuous measure of safety f : X \u2192 R,\nwhere X is space of traf\ufb01c conditions and behaviors of other vehicles. The prototypical example\nin this paper is the minimum time-to-collision (TTC) (see Appendix A for its de\ufb01nition) to other\nenvironmental agents over a simulation rollout. Rather than requiring safety for all x \u2208 X , we\nrelax the deterministic veri\ufb01cation problem into a probabilistic one where we are concerned with\nthe probability under standard traf\ufb01c conditions that f (X) goes below a safety threshold. Given a\ndistribution P0 on X , our goal is to estimate the rare event probability p\u03b3 := P0(f (X) \u2264 \u03b3) based\non simulated rollouts f (X1), . . . , f (Xn). As accidents are rare and p\u03b3 is near 0, we treat this as a\nrare-event simulation problem; see [11, 4, Chapter VI] for an overview of this topic.\n(cid:80)N\nFirst, we brie\ufb02y illustrate the well-known dif\ufb01culty of naive Monte Carlo simulation when p\u03b3 is\ni=1 1{f (Xi) \u2264 \u03b3}.\nsmall. From a sample Xi\nAs p\u03b3 is small, we use relative accuracy to measure our performance, and the central limit theorem\nimplies the relative accuracy is approximately\n1 \u2212 p\u03b3\nN p\u03b3\n\niid\u223c P0, the naive Monte Carlo estimate is(cid:98)pN,\u03b3 := 1\n(cid:12)(cid:12)(cid:12)(cid:12) dist\u2248\n\nN ) for Z \u223c N (0, 1).\n\n\u221a\n|Z| + o(1/\n\n(cid:115)\n\nN\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:98)pN,\u03b3\n\np\u03b3\n\n\u2212 1\n\nFor small p\u03b3, we require a sample of size N (cid:38) 1/(p\u03b3\u00012) to achieve \u0001-relative accuracy, and if f (X)\nis light-tailed, the sample size must grow exponentially in \u03b3.\n\ni=1\n\np0(Xi)\n\ntion of P0: as p0(x)/p(cid:63)(x) = p\u03b3 for all x satisfying 1{f (x) \u2264 \u03b3}, the estimate (cid:98)p(cid:63)\n(cid:80)N\n\nCross-entropy method As an alternative to a naive Monte Carlo estimator, we consider (adap-\ntive) importance sampling [4], and we use a model-based optimization procedure to \ufb01nd a good\nimportance-sampling distribution. The optimal importance-sampling distribution for estimating\np\u03b3 has the conditional density p(cid:63)(x) = 1{f (x) \u2264 \u03b3} p0(x)/p\u03b3, where p0 is the density func-\nN,\u03b3 :=\np(cid:63)(Xi) 1{f (Xi) \u2264 \u03b3} is exact. This sampling scheme is, unfortunately, de facto impossible,\n1\nN\nbecause we do not know p\u03b3. Instead, we use a parameterized importance sampler P\u03b8 and employ an\niterative model-based search method to modify \u03b8 so that P\u03b8 approximates P (cid:63).\nThe cross-entropy method [44] iteratively tries to \ufb01nd \u03b8(cid:63) \u2208 argmin\u03b8\u2208\u0398 Dkl (P (cid:63)||P\u03b8), the Kullback-\nLeibler projection of P (cid:63) onto the class of parameterized distributions P = {P\u03b8}\u03b8\u2208\u0398. Over iterations\nk, we maintain a surrogate distribution qk(x) \u221d 1{f (x) \u2264 \u03b3k} p0(x) where \u03b3k \u2265 \u03b3 is a (potentially\nrandom) proxy for the rare-event threshold \u03b3, and we use samples from P\u03b8 to update \u03b8 as an\napproximate projection of Q onto P. The motivation underlying this approach is to update \u03b8 so\nthat P\u03b8 upweights regions of X with low objective value (i.e. unsafe) f (x). We \ufb01x a quantile level\n\u03c1 \u2208 (0, 1)\u2014usually we choose \u03c1 \u2208 [0.01, 0.2]\u2014and use the \u03c1-quantile of f (X) where X \u223c P\u03b8k\n\n3\n\n\fAlgorithm 1 Cross-Entropy Method\n1: Input: Quantile \u03c1 \u2208 (0, 1), Stepsizes {\u03b1k}k\u2208N, Sample sizes {Nk}k\u2208N, Number of iterations K\n2: Initialize: \u03b80 \u2208 \u0398\n3: for k = 0, 1, 2, . . . , K \u2212 1 do\nSample Xk,1, . . . , Xk,Nk\n4:\nSet \u03b3k as the minimum of \u03b3 and the \u03c1-quantile of f (Xk,1), . . . , f (Xk,Nk )\n5:\n6:\n\u03b8k+1 = argmax\u03b8\u2208\u0398\n\n(cid:8)\u03b1k\u03b8(cid:62)Dk+1 + (1 \u2212 \u03b1k)\u03b8(cid:62)\u2207A(\u03b8k) \u2212 A(\u03b8)(cid:9)\n\niid\u223c P\u03b8k\n\nas \u03b3k, our proxy for the rare event threshold \u03b3 (see [23] for alternatives). We have the additional\nchallenge that the \u03c1-quantile of f (X) is unknown, so we approximate it using i.i.d. samples Xi \u223c P\u03b8k.\nCompared to applications of the cross-entropy method [44, 55] that focus on low-dimensional\nproblems permitting analytic updates to \u03b8, our high-dimensional search space requires solving convex\noptimization problems in each iteration. To address numerical challenges in computing likelihood\nratios in high-dimensions, our implementation carefully constrains the search space and we compute\nlikelihoods in logarithmic scale.\nWe now rigorously describe the algorithmic details. First, we use natural exponential families as our\nclass of importance samplers P.\nDe\ufb01nition 1. The family of density functions {p\u03b8}\u03b8\u2208\u0398, de\ufb01ned with respect to base measure \u00b5, is a\nnatural exponential family if there exists a suf\ufb01cient statistic \u0393 such that p\u03b8(x) = exp(\u03b8(cid:62)\u0393(x)\u2212A(\u03b8))\nX exp(\u03b8(cid:62)\u0393(x))d\u00b5(x) is the log partition function and \u0398 := {\u03b8 | A(\u03b8) < \u221e}.\nGiven this family, we consider idealized updates to the parameter vector \u03b8k at iteration k, where we\ncompute projections of a mixture of Qk and P\u03b8k onto P\n\nwhere A(\u03b8) = log(cid:82)\n\n\u03b8k+1 = argmin\n\n\u03b8\u2208\u0398\n\n= argmax\n\n\u03b8\u2208\u0398\n\n= argmax\n\n\u03b8\u2208\u0398\n\nDkl (\u03b1kQk + (1 \u2212 \u03b1k)P\u03b8k||P\u03b8)\n{\u03b1kEQk [log p\u03b8(X)] + (1 \u2212 \u03b1k)E\u03b8k [log p\u03b8(X)]}\n\n(cid:8)\u03b1k\u03b8(cid:62)EQk [\u0393(X)] + (1 \u2212 \u03b1k)\u03b8(cid:62)\u2207A(\u03b8k) \u2212 A(\u03b8)(cid:9) .\n\n(2)\n\niid\u223c\n\n(3)\n\nNk(cid:88)\n\ni=1\n\nNk(cid:88)\n\ni=1\n\nThe term EQk [\u0393(X)] is unknown in practice, so we use a sampled estimate. For Xk,1, . . . , Xk,Nk\nP\u03b8k, let \u03b3k be the \u03c1-quantile of f (Xk,1), . . . , f (Xk,Nk ) and de\ufb01ne\n\nDk+1 :=\n\n1\nNk\n\nqk(Xk,i)\np\u03b8k (Xk,i)\n\n\u0393(Xk,i) =\n\n1\nNk\n\np0(Xk,i)\np\u03b8k (Xk,i)\n\n1{f (Xk,i) \u2264 \u03b3k} \u0393(Xk,i).\n\ni=1\n\np0(Xi)\n\n(cid:80)N\n\np\u03b8ce (Xi) 1{f (Xi) \u2264 \u03b3} as our \ufb01nal importance-sampling estimator for p\u03b3.\n\nUsing the estimate Dk+1 in place of EQk [\u0393(X)] in the idealized update (2), we obtain Algorithm 1.\nTo select the \ufb01nal importance sampling distribution from Algorithm 1, we choose \u03b8k with the\nlowest \u03c1-quantile of f (Xk,i). We observe that this choice consistently improves performance over\ntaking the last iterate or Polyak averaging. Letting \u03b8ce denote the parameters for the importance\nsampling distribution learned by the cross-entropy method, we sample Xi\n1\nN\nIn the context of our rare-event simulator, we use a combination of Beta and Normal distributions\nfor P\u03b8. The suf\ufb01cient statistics \u0393 include (i) the parameters of the generative model of behaviors\nthat our imitation-learning schemes produce and (ii) the initial poses and velocities of other vehicles,\npedestrians, and obstacles in the simulation. Given a current parameter \u03b8 and realization from the\nmodel distribution P\u03b8, our simulator then (i) sets the parameters of the generative model for vehicle\npolicies and draws policies from this model, and (ii) chooses random poses and velocities for the\nsimulation. Our simulator is one of the largest-scale applications of cross-entropy methods.\n\niid\u223c P\u03b8ce and use(cid:98)pN,\u03b3 :=\n\n3 Simulation framework\n\nTwo key considerations in our risk-based framework in\ufb02uence design choices for our simulation\ntoolchain: (1) learning the base distribution P0 of nominal traf\ufb01c behavior via data-driven modeling,\nand (2) testing the AV as a whole system. We now describe how our toolchain achieves these goals.\n\n4\n\n\f3.1 Data-driven generative modeling\n\nWhile our risk-based framework (cf. Section 2) is a concise, unambiguous measure of system safety,\nthe rare-event probability p\u03b3 is only meaningful insofar as the base distribution P0 of road conditions\nand the behaviors of other (human) drivers is estimable. Thus, to implement our risk-based framework,\nwe \ufb01rst learn a base distribution P0 of nominal traf\ufb01c behavior. Using the highway traf\ufb01c dataset\nNGSim [36], we train policies of human drivers via imitation learning [45, 41, 42, 22, 6]. Our data\nconsists of videos of highway traf\ufb01c [36], and our goal is to create models that imitate human driving\nbehavior even in scenarios distinct from those in the data. We employ an ensemble of generative\nadversarial imitation learning (GAIL) [22] models to learn P0. Our approach is motivated by the\nobservation that reducing an imitation-learning problem to supervised learning\u2014where we simply\nuse expert data to predict actions given vehicle states\u2014suffers from poor performance in regions\nof the state space not encountered in data [41, 42]. Reinforcement-learning techniques have been\nobserved to improve generalization performance, as the imitation agent is able to explore regions of\nthe state space in simulation during training that do not necessarily occur in the expert data traces.\nGenerically, GAIL is a minimax game between two functions: a discriminator D\u03c6 and a generator\nG\u03be (with parameters \u03c6 and \u03be respectively). The discriminator takes in a state-action pair (s, u)\nand outputs the probability that the pair came from real data, P(real data). The generator takes in\na state s and outputs a conditional distribution G\u03be(s) := P(u | s) of the action u to take given\nstate s. In our context, G\u03be(\u00b7) is then the (learned) policy of a human driver given environmental\ninputs s. Training the generator weights \u03be occurs in a reinforcement-learning paradigm with reward\n\u2212 log(1 \u2212 D\u03c6(s, G\u03be(s))). We use the model-based variant of GAIL (MGAIL) [6] which renders this\nreward fully differentiable with respect to \u03be over a simulation rollout, allowing ef\ufb01cient model training.\nGAIL has been validated by Kue\ufb02er et al. [33] to realistically mimic human-like driving behavior\nfrom the NGSim dataset across multiple metrics. These include the similarity of low-level actions\n(speeds, accelerations, turn-rates, jerks, and time-to-collision), as well as higher-level behaviors (lane\nchange rate, collision rate, hard-brake rate, etc). See Appendix C for a reference to an example video\nof the learned model driving in a scenario alongside data traces from human drivers.\nOur importance sampling and cross-entropy methods use not just a single instance of model parame-\nters \u03be, but rather a distribution over them to form a generative model of human driving behavior. To\nmodel this distribution, we use a (multivariate normal) parametric bootstrap over a trained ensemble\nof generators \u03bei, i = 1, . . . , m. Our models \u03bei are high-dimensional (\u03be \u2208 Rd, d > m) as they\ncharacterize the weights of large neural networks, so we employ the graphical lasso [15] to \ufb01t the\ninverse covariance matrix for our ensemble. This approach to modeling uncertainty in neural-network\nweights is similar to the bootstrap approach of Osband et al. [38]. Other approaches include using\ndropout for inference [16] and variational methods [18, 8, 31].\nWhile several open source driving simulators have been proposed [14, 48, 39], our problem formula-\ntion requires unique features to allow sampling from a continuous distribution of driving policies for\nenvironmental agents. Conditional on each sample of model parameters \u03be, the simulator constructs\na (random) rollout of vehicle behaviors according to G\u03be. Unlike other existing simulators, ours is\ndesigned to ef\ufb01ciently execute and update these policies as new samples \u03be are drawn for each rollout.\n\n3.2 System architecture\n\nThe second key characteristic of our framework is that it enables black-box testing the AV as a whole\nsystem. Flaws in complex systems routinely occur at poorly speci\ufb01ed interfaces between components,\nas interactions between processes can induce unexpected behavior. Consequently, solely testing\nsubcomponents of an AV control pipeline separately is insuf\ufb01cient [1]. Moreover, it is increasingly\ncommon for manufacturers to utilize software and hardware artifacts for which they do not have\nany whitebox model [19, 12]. We provide a concise but extensible language-agnostic interface to\nour benchmark world model so that common AV sensors such as cameras and lidar can provide the\nnecessary inputs to induce vehicle actuation commands.\nOur simulator is a distributed, modular framework, which is necessary to support the inclusion\nof new AV systems and updates to the environment-vehicle policies. A bene\ufb01t of this design is\nthat simulation rollouts are simple to parallelize. In particular, we allow instantiation of multiple\nsimulations simultaneously, without requiring that each include the entire set of components. For\nexample, a desktop may support only one instance of Unreal Engine but could be capable of simulating\n\n5\n\n\f10 physics simulations in parallel; it would be impossible to fully utilize the compute resource with\na monolithic executable wrapping all engines together. Our architecture enables instances of the\ncomponents to be distributed on heterogeneous GPU compute clusters while maintaining the ability to\nperform meaningful analysis locally on commodity desktops. In Appendix A, we detail our scenario\nspeci\ufb01cation, which describes how Algorithm 1 maps onto our distributed architecture.\n\n4 Experiments\n\n+\n\nIn this section, we demonstrate our risk-based framework on a multi-agent highway scenario. As the\nrare-event probability of interest p\u03b3 gets smaller, the cross-entropy method learns to sample more\nrare events compared to naive Monte Carlo sampling; we empirically observe that the cross-entropy\nmethod produces 2-20 times as many rare events as its naive counterpart. Our \ufb01ndings hold across\ndifferent ego-vehicle policies, base distributions P0, and scenarios.\nTo highlight the modularity of our simulator, we evaluate the rare-event probability p\u03b3 on two\ndifferent ego-vehicle policies. The \ufb01rst is an instantiation of an imitation learning (non-vision) policy\nwhich uses lidar as its primary perceptual input. Secondly, we investigate a vision-based controller\n(vision policy), where the ego-vehicle drives with an end-to-end highway autopilot network [9],\ntaking as input a rendered image from the simulator (and lidar observations) and outputting actuation\ncommands. See Appendix B for a summary of network architectures used.\nWe consider a scenario consisting of six agents, \ufb01ve of which are considered part of the environment.\nThe environment vehicles\u2019 policies follow the distribution learned in Section 3.1. All vehicles are\nconstrained to start within a set of possible initial con\ufb01gurations consisting of pose and velocity,\nand each vehicle has a goal of reaching the end of the approximately 2 km stretch of road. Fig. 1\nshows one such con\ufb01guration of the scenario, along with rendered images from the simulator. We\ncreate scene geometry based on surveyors\u2019 records and photogrammetric reconstructions of satellite\nimagery of the portion of I-80 in Emeryville, California where the traf\ufb01c data was collected [36].\nSimulation parameters We detail our postulated base distribution P0. Letting m denote the number\nof vehicles, we consider the random tuple X = (S, T, W, V, \u03be) as our simulation parameter where\nthe pair (S, T ) \u2208 Rm\u00d72\nindicates the two-dimensional positioning of each vehicle in their respective\nlanes (in meters), W the orientation of each vehicle (in degrees), and V the initial velocity of each\nvehicle (in meters per second). We use \u03be \u2208 R404 to denote the weights of the last layer of the neural\nnetwork trained to imitate human-like driving behavior. Speci\ufb01cally, we set S \u223c 40Beta(2, 2) + 80\nwith respect to the starting point of the road, T \u223c 0.5Beta(2, 2) \u2212 0.25 with respect to the lane\u2019s\ncenter, W \u223c 7.2Beta(2, 2) \u2212 3.6 with respect to facing forward, and V \u223c 10Beta(2, 2) + 10. We\nassume \u03be \u223c N (\u00b50, \u03a30), with the mean and covariance matrices learned via the ensemble approach\noutlined in Section 3.1. The neural network whose last layer is parameterized by \u03be describes the\npolicy of environment vehicles; it takes as input the state of the vehicle and lidar observations of the\nsurrounding environment (see Appendix B for more details). Throughout this section, we de\ufb01ne our\nmeasure of safety f : X \u2192 R as the minimum time-to-collision (TTC) over the simulation rollout.\nWe calculate TTC from the center of mass of the ego vehicle; if the ego-vehicle\u2019s body crashes into\nobstacles, we end the simulation before the TTC can further decrease (see Appendix A for details).\nCross-entropy method Throughout our experiments, we impose constraints on the space of\nimportance samplers (adversarial distributions) for feasibility. Numerical stability considerations\npredominantly drive our hyperparameter choices. For model parameters \u03be, we also constrain the\nsearch space to ensure that generative models G\u03be maintain reasonably realistic human-like policies\n(recall Sec. 3.1). For S, T, W , and V , we let {Beta(\u03b1, \u03b2) : \u03b1, \u03b2 \u2208 [1.5, 7]} be the model space over\nwhich the cross-entropy method searches, scaled and centered appropriately to match the scale of the\nrespective base distributions. We restrict the search space of distributions over \u03be \u2208 R404 by searching\nover {N (\u00b5, \u03a30) : (cid:107)\u00b5 \u2212 \u00b50(cid:107)\u221e \u2264 .01}, where (\u00b50, \u03a30) are the parameters of the base (bootstrap)\ndistribution. For our importance sampling distribution P\u03b8, we use products of the above marginal\ndistributions. These restrictions on the search space mitigate numerical instabilities in computing\nlikelihood ratios within our optimization routines, which is important for our high-dimensional\nproblems.\nWe \ufb01rst illustrate the dependence of the cross-entropy method on its hyperparameters. We choose\nto use a non-vision ego-vehicle policy as a test bed for hyperparameter tuning, since this allows us\nto take advantage of the fastest simulation speeds for our experiments. We focus on the effects (in\n\n6\n\n\f(a) Ratio of number of rare events vs. threshold\n\n(b) Ratio of variance vs. threshold\n\nFigure 2. The ratio of (a) number of rare events and (b) variance of estimator for p\u03b3 between cross-\nentropy method and naive MC sampling for the non-vision ego policy. Rarity is inversely proportional\nto \u03b3, and, as expected, we see the best performance for our method over naive MC at small \u03b3.\n\nNaive 1300K\n\nSearch Algorithm\n\u03b3test = 0.14\n(12.4\u00b13.1)e-6\nCross-entropy 100K (19.8\u00b18.88)e-6\n(20\u00b114.1)e-6\n\nNaive 100K\n\n\u03b3test = 0.15\n(80.6\u00b17.91)e-6\n(66.1 \u00b1 15)e-6\n(100\u00b1 31.6)e-6\n\n\u03b3test = 0.19\n(133\u00b13.2)e-5\n(108\u00b1 9.51)e-5\n(132\u00b111.5)e-5\n\n\u03b3test = 0.20\n(186\u00b13.79)e-5\n(164 \u00b1 14)e-5\n(185\u00b113.6)e-5\n\nTable 1. Estimate of rare-event probability p\u03b3 (non-vision ego policy) with standard errors. For the\ncross-entropy method, we show results for the learned importance sampling distribution with \u03c1 = 0.01.\n\nAlgorithm 1) of varying the most in\ufb02uential hyperparameter, \u03c1 \u2208 (0, 1], which is the quantile level\ndetermining the rarity of the observations used to compute the importance sampler \u03b8k. Intuitively, as \u03c1\napproaches 0, the cross-entropy method learns importance samplers P\u03b8 that up-weight unsafe regions\nof X with lower f (x), increasing the frequency of sampling rare events (events with f (X) \u2264 \u03b3).\nIn order to avoid over\ufb01tting \u03b8k as \u03c1 \u2192 0, we need to increase Nk as \u03c1 decreases. Our choice of\nNk is borne out of computational constraints as it is the biggest factor that determines the run-time\nof the cross-entropy method. Consistent with prior works [44, 24], we observe empirically that\n\u03c1 \u2208 [0.01, 0.2] is a good range for the values of Nk deemed feasible for our computational budget\n(Nk = 1000 \u223c 5000). We \ufb01x the number of iterations at K = 100, number of samples taken per\niteration at Nk = 5000, step size for updates at \u03b1k = 0.8, and \u03b3 = 0.14. As we see below, we\nconsistently observe that the cross-entropy method learns to sample signi\ufb01cantly more rare events,\ndespite the high-dimensional nature (d \u2248 500) of the problem.\nTo evaluate the learned parameters, we draw n = 105 samples from the importance sampling\ndistribution to form an estimate of p\u03b3. In Figure 2, we vary \u03c1 and report the relative performance of\nthe cross-entropy method compared to naive Monte Carlo sampling. Even though we set \u03b3 = 0.14 in\nAlgorithm 1, we evaluate the performance of all models with respect to multiple threshold levels \u03b3test.\nWe note that as \u03c1 approaches 0, the cross-entropy method learns to frequently sample increasingly\nrare events; the cross-entropy method yields 3-10 times as many dangerous scenarios, and achieves\n2-16 times variance reduction depending on the threshold level \u03b3test. In Table 1, we contrast the\nestimates provided by naive Monte Carlo and the importance sampling estimator provided by the\ncross-entropy method with \u03c1 = 0.01; to form a baseline estimate, we run naive Monte Carlo with\n1.3 \u00b7 106 samples. For a given number of samples, the cross-entropy method with \u03c1 = 0.01 provides\nmore precise estimates for the rare-event probability p\u03b3 \u2248 10\u22125 over naive Monte Carlo.\nWe now leverage the tuned hyperparameter (\u03c1 = 0.01) for our main experiment: evaluating the\nprobability of a dangerous event for the vision-based ego policy. We \ufb01nd that the hyperparameters\nfor the cross-entropy method generalize, allowing us to produce good importance samplers for a\nvery different policy without further tuning. Based on our computational budget (with our current\nimplementation, vision-based simulations run about 15 times slower than simulations with only\nnon-vision policies), we choose K = 20 and Nk = 1000 for the cross-entropy method to learn a\ngood importance sampling distribution for the vision-based policy (although we also observe similar\nbehavior for Nk as small as 100). In Figure 3, we illustrate again that the cross-entropy method learns\nto sample dangerous scenarios more frequently (Figure 3a)\u2014up to 18 times that of naive Monte\n\n7\n\n0.140.160.180.20.220.240.26234567890.140.160.180.20.220.240.2610-1100101\f(a) Ratio of number of rare events vs. threshold\n\n(b) Ratio of variance vs. threshold\n\nFigure 3. The ratio of (a) number of rare events and (b) variance of estimator for p\u03b3 between cross-\nentropy method and naive MC sampling for the vision-based ego policy.\n\nSearch Algorithm\n\u03b3test = 0.22\nCross-entropy 50K (5.87\u00b11.82)e-5\n(11.3\u00b14.60)e-5\n\nNaive 50K\n\n\u03b3test = 0.23\n(13.0\u00b1 2.94)e-5\n(20.6\u00b16.22)e-5\n\n\u03b3test = 0.24\n(19.0 \u00b1 3.14)e-5\n(43.2\u00b19.00)e-5\n\n\u03b3test = 0.25\n(4.52 \u00b1 1.35)e-4\n(6.75\u00b11.13)e-4\n\nTable 2. Estimate of rare-event probability p\u03b3 (non-vision ego policy) with standard errors. For the\ncross-entropy method, we show results for the learned importance sampling distribution with \u03c1 = 0.01.\n\nCarlo\u2014and produces importance sampling estimators with lower variance (Figure 3b). As a result,\nour estimator in Table 2 is better calibrated compared to that computed from naive Monte Carlo.\nQualitative analysis We provide a qualitative interpretation for the learned parameters of the\nimportance sampler. For initial velocities, angles, and positioning of vehicles, the importance sampler\nshifts environmental vehicles to box in the ego-vehicle and increases the speeds of trailing vehicles\nby 20%, making accidents more frequent. We also observe that the learned distribution for initial\nconditions have variance 50% smaller than that of the base distribution, implying concentration\naround adversarial conditions. Perturbing the policy weights \u03be for GAIL increases the frequency of\nrisky high-level behaviors (lane-change rate, hard-brake rate, etc.). An interesting consequence of\nusing our de\ufb01nition of TTC from the center of the ego vehicle (cf. Appendix A) as a measure of\nsafety is that dangerous events f (X) \u2264 \u03b3test (for small \u03b3test) include frequent sideswiping behavior,\nas such accidents result in smaller TTC values than front- or rear-end collisions. See Appendix C\nfor a reference to supplementary videos that exhibit the range of behavior across many levels \u03b3test.\nThe modularity of our simulation framework easily allows us to modify the safety objective to an\nalternative de\ufb01nition of TTC or even include more sophisticated notions of safety, e.g. temporal-logic\nspeci\ufb01cations or implementations of responsibility-sensitive safety (RSS) [49, 40].\n\n5 Related work and conclusions\n\nGiven the complexity of AV software and hardware components, it is unlikely that any single method\nwill serve as an oracle for certi\ufb01cation. Many existing tools are complementary to our risk-based\nframework. In this section, we compare and contrast representative results in testing, veri\ufb01cation, and\nsimulation.\nAV testing generally consists of three paradigms. The \ufb01rst, largely attributable to regulatory efforts,\nuses a \ufb01nite set of basic competencies (e.g. the Euro NCAP Test Protocol [46]); while this method-\nology is successful in designing safety features such as airbags and seat-belts, the non-adaptive\nnature of static testing is less effective in complex software systems found in AVs. Alternatively,\nreal-world testing\u2014deployment of vehicles with human oversight\u2014exposes the vehicle to a wider\nvariety of unpredictable test conditions. However, as we outlined above, these methods pose a danger\nto the public and require prohibitive number of driving hours due to the rare nature of accidents [29].\nSimulation-based falsi\ufb01cation (in our context, simply \ufb01nding any crash) has also been successfully\nutilized [51]; this approach does not maintain a link to the likelihood of the occurrence of a particular\nevent, which we believe to be key in acting to prioritize and correct AV behavior.\n\n8\n\n0.220.240.260.280.324681012141618200.220.240.260.280.310-410-2100\fFormal veri\ufb01cation methods [34, 2, 47, 37] have emerged as a candidate to reduce the intractability\nof empirical validation. A veri\ufb01cation procedure considers whether the system can ever violate\na speci\ufb01cation and returns either a proof that there is no such execution or a counterexample.\nVeri\ufb01cation procedures require a white-box description of the system (although it may be abstract),\nas well as a mathematically precise speci\ufb01cation. Due to the impossibility of certifying safety in all\nscenarios, these approaches [49] require further speci\ufb01cations that assign blame in the case of a crash.\nSuch assignment of blame is impossible to completely characterize and relies on subjective notions of\nfault. Our risk-based framework allows one to circumvent this dif\ufb01culty by only using a measure of\nsafety that does not assign blame (e.g. TTC) and replacing the speci\ufb01cations that assign blame with a\nprobabilistic notion of how likely the accident is. While this approach requires a learned model of the\nworld P0\u2014a highly nontrivial statistical task in itself\u2014the adaptive importance sampling techniques\nwe employ can still ef\ufb01ciently identify dangerous scenarios even when P0 is not completely accurate.\nConceptually, we view veri\ufb01cation and our framework as complementary; they form powerful tools\nthat can evaluate safety before deploying a \ufb02eet for real-world testing.\nEven given a consistent and complete notion of blame, veri\ufb01cation remains highly intractable from\na computational standpoint. Ef\ufb01cient algorithms only exist for restricted classes of systems in the\ndomain of AVs, and they are fundamentally dif\ufb01cult to scale. Speci\ufb01cally, AVs\u2014unlike previous\nsuccessful applications of veri\ufb01cation methods to application domains such as microprocessors [5]\u2014\ninclude both continuous and discrete dynamics. This class of dynamics falls within the purview of\nhybrid systems [35], for which exhaustive veri\ufb01cation is largely undecidable [20].\nVerifying individual components of the perception pipeline, even as standalone systems, is a nascent,\nactive area of research (see [3, 13, 7] and many others). Current subsystem veri\ufb01cation techniques\nfor deep neural networks [28, 30, 50] do not scale to state-of-the-art models and largely investigate\nthe robustness of the network with respect to small perturbations of a single sample. There are two\nkey assumptions in these works; the label of the input is unchanged within the radius of allowable\nperturbations, and the resulting expansion of the test set covers a meaningful portion of possible\ninputs to the network. Unfortunately, for realistic cases in AVs it is likely that perturbations to\nthe state of the world which in turn generates an image should change the label. Furthermore, the\ncombinatorial nature of scenario con\ufb01gurations casts serious doubt on any claims of coverage.\nIn our risk-based framework, we replace the complex system speci\ufb01cations required for formal\nveri\ufb01cation methods with a model P0 that we learn via imitation-learning techniques. Generative\nadversarial imitation learning (GAIL) was \ufb01rst introduced by Ho and Ermon [22] as a way to directly\nlearn policies from data and has since been applied to model human driving behavior by Kue\ufb02er et al.\n[33]. Model-based GAIL (MGAIL) is the speci\ufb01c variant of GAIL that we employ; introduced by\nBaram et al. [6], MGAIL\u2019s generative model is fully differentiable, allowing ef\ufb01cient model training\nwith standard stochastic approximation methods.\nThe cross-entropy method was introduced by Rubinstein [43] and has attracted interest in many rare-\nevent simulation scenarios [44, 32]. More broadly, it can be thought of as a model-based optimization\nmethod [24\u201326, 53, 27, 56]. With respect to assessing safety of AVs, the cross-entropy method has\nrecently been applied in simple lane-changing and car-following scenarios in two dimensions [54, 55].\nOur work signi\ufb01cantly extends these works by implementing a photo-realistic simulator that can\nassess the deep-learning based perception pipeline along with the control framework. We leave the\ndevelopment of rare-event simulation methods that scale better with dimension as a future work.\nTo summarize, a fundamental tradeoff emerges when comparing the requirements of our risk-based\nframework to other testing paradigms, such as real-world testing or formal veri\ufb01cation. Real-world\ntesting endangers the public but is still in some sense a gold standard. Veri\ufb01ed subsystems provide\nevidence that the AV should drive safely even if the estimated distribution shifts, but veri\ufb01cation\ntechniques are limited by computational intractability as well as the need for both white-box models\nand the completeness of speci\ufb01cations that assign blame (e.g. [49]). In turn, our risk-based framework\nis most useful when the base distribution P0 is accurate, but even when P0 is misspeci\ufb01ed, our\nadaptive importance sampling techniques can still ef\ufb01ciently identify dangerous scenarios, especially\nthose that may be missed by veri\ufb01cation methods assigning blame. Our framework offers signi\ufb01cant\nspeedups over real-world testing and allows ef\ufb01cient evaluation of black-box AV input/output behavior,\nproviding a powerful tool to aid in the design of safe AVs.\n\n9\n\n\fAcknowledgments\n\nMOK was partially supported by a National Science Foundation Graduate Research Fellowship. AS\nwas partially supported by a Stanford Graduate Fellowship and a Fannie & John Hertz Foundation\nFellowship. HN was partially supported by a Samsung Fellowship and the SAIL-Toyota Center for\nAI Research. JD was partially supported by the National Science Foundation award NSF-CAREER-\n1553086.\n\nReferences\n[1] H. Abbas, M. O\u2019Kelly, A. Rodionova, and R. Mangharam. Safe at any speed: A simulation-\n\nbased test harness for autonomous vehicles. LNCS. Springer, 2018.\n\n[2] M. Althoff and J. Dolan. Online veri\ufb01cation of automated road vehicles using reachability\nanalysis. Robotics, IEEE Transactions on, 30(4):903\u2013918, Aug 2014. ISSN 1552-3098. doi:\n10.1109/TRO.2014.2312453.\n\n[3] S. Arora, A. Bhaskara, R. Ge, and T. Ma. Provable bounds for learning some deep representa-\n\ntions. In International Conference on Machine Learning, pages 584\u2013592. , 2014.\n\n[4] S. Asmussen and P. W. Glynn. Stochastic Simulation: Algorithms and Analysis. Springer, 2007.\n\n[5] C. Baier, J.-P. Katoen, et al. Principles of model checking, volume 26202649. MIT press\n\nCambridge, 2008.\n\n[6] N. Baram, O. Anschel, I. Caspi, and S. Mannor. End-to-end differentiable adversarial imitation\n\nlearning. In International Conference on Machine Learning, pages 390\u2013399, 2017.\n\n[7] P. L. Bartlett, D. J. Foster, and M. J. Telgarsky. Spectrally-normalized margin bounds for neural\nnetworks. In Advances in Neural Information Processing Systems, pages 6241\u20136250. , 2017.\n\n[8] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural\n\nnetworks. arXiv preprint arXiv:1505.05424, 2015.\n\n[9] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel,\nM. Monfort, U. Muller, J. Zhang, et al. End to end learning for self-driving cars. arXiv preprint\narXiv:1604.07316, 2016.\n\n[10] J.-F. Bonnefon, A. Shariff, and I. Rahwan. The social dilemma of autonomous vehicles.\nScience, 352(6293):1573\u20131576, 2016. ISSN 0036-8075. doi: 10.1126/science.aaf2654. URL\nhttp://science.sciencemag.org/content/352/6293/1573.\n\n[11] J. Bucklew. Introduction to rare event simulation. Springer Science & Business Media, 2013.\n\n[12] M. Cheah, S. A. Shaikh, J. Bryans, and H. N. Nguyen. Combining third party components\nsecurely in automotive systems. In IFIP International Conference on Information Security\nTheory and Practice, pages 262\u2013269. Springer, 2016.\n\n[13] N. Cohen, O. Sharir, and A. Shashua. On the expressive power of deep learning: A tensor\n\nanalysis. In Conference on Learning Theory, pages 698\u2013728. , 2016.\n\n[14] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun. CARLA: An open urban driving\nsimulator. In Proceedings of the 1st Annual Conference on Robot Learning, pages 1\u201316, 2017.\n\n[15] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the\n\ngraphical lasso. Biostatistics, 9(3):432\u2013441, 2008.\n\n[16] Y. Gal and Z. Ghahramani. Dropout as a bayesian approximation: Representing model un-\ncertainty in deep learning. In International Conference on Machine Learningearning, pages\n1050\u20131059, 2016.\n\n[17] E. Games.\n\nUnreal engine 4 documentation.\n\ncom/latest/INT/index. html, 2015.\n\nURL https://docs. unrealengine.\n\n10\n\n\f[18] A. Graves. Practical variational inference for neural networks. In Advances in Neural Informa-\n\ntion Processing Systems, pages 2348\u20132356, 2011.\n\n[19] H. Heinecke, K.-P. Schnelle, H. Fennel, J. Bortolazzi, L. Lundh, J. Le\ufb02our, J.-L. Mat\u00e9,\nK. Nishikawa, and T. Scharnhorst. Automotive open system architecture-an industry-wide\ninitiative to manage the complexity of emerging automotive e/e-architectures. Technical report,\nSAE Technical Paper, 2004.\n\n[20] T. A. Henzinger, P. W. Kopke, A. Puri, and P. Varaiya. What\u2019s decidable about hybrid automata?\n\nJ. Comput. Syst. Sci., 57(1):94\u2013124, 1998.\n\n[21] P. Hintjens. ZeroMQ: messaging for many applications. \" O\u2019Reilly Media, Inc.\", 2013.\n\n[22] J. Ho and S. Ermon. Generative adversarial imitation learning. In Advances in Neural Informa-\n\ntion Processing Systems, pages 4565\u20134573, 2016.\n\n[23] T. Homem-de Mello. A study on the cross-entropy method for rare-event probability estimation.\n\nINFORMS Journal on Computing, 19(3):381\u2013394, 2007.\n\n[24] J. Hu and P. Hu. On the performance of the cross-entropy method. In Simulation Conference\n\n(WSC), Proceedings of the 2009 Winter, pages 459\u2013468. IEEE, 2009.\n\n[25] J. Hu and P. Hu. Annealing adaptive search, cross-entropy, and stochastic approximation in\n\nglobal optimization. Naval Research Logistics (NRL), 58(5):457\u2013477, 2011.\n\n[26] J. Hu, P. Hu, and H. S. Chang. A stochastic approximation framework for a class of randomized\n\noptimization algorithms. IEEE Transactions on Automatic Control, 57(1):165\u2013178, 2012.\n\n[27] J. Hu, E. Zhou, and Q. Fan. Model-based annealing random search with stochastic averaging.\n\nACM Transactions on Modeling and Computer Simulation (TOMACS), 24(4):21, 2014.\n\n[28] X. Huang, M. Kwiatkowska, S. Wang, and M. Wu. Safety veri\ufb01cation of deep neural networks.\n\nIn International Conference on Computer Aided Veri\ufb01cation, pages 3\u201329. Springer, 2017.\n\n[29] N. Kalra and S. M. Paddock. Driving to safety: How many miles of driving would it take\nto demonstrate autonomous vehicle reliability? Transportation Research Part A: Policy and\nPractice, 94:182\u2013193, 2016.\n\n[30] G. Katz, C. Barrett, D. Dill, K. Julian, and M. Kochenderfer. Reluplex: An ef\ufb01cient smt solver\n\nfor verifying deep neural networks. arXiv:1702.01135 [cs.AI], 1:1, 2017.\n\n[31] D. P. Kingma, T. Salimans, and M. Welling. Variational dropout and the local reparameterization\n\ntrick. In Advances in Neural Information Processing Systems, pages 2575\u20132583, 2015.\n\n[32] D. P. Kroese, R. Y. Rubinstein, and P. W. Glynn. The cross-entropy method for estimation.\n\nHandbook of Statistics: Machine Learning: Theory and Applications, 31:19\u201334, 2013.\n\n[33] A. Kue\ufb02er, J. Morton, T. Wheeler, and M. Kochenderfer.\n\nImitating driver behavior with\ngenerative adversarial networks. In Intelligent Vehicles Symposium (IV), 2017 IEEE, pages\n204\u2013211. IEEE, 2017.\n\n[34] M. Kwiatkowska, G. Norman, and D. Parker. Prism 4.0: Veri\ufb01cation of probabilistic real-time\nsystems. In International conference on computer aided veri\ufb01cation, pages 585\u2013591. Springer,\n2011.\n\n[35] J. Lygeros. Lecture notes on hybrid systems. In Notes for an ENSIETA workshop, 2004.\n\n[36] U. D. of Transportation FHWA. Ngsim \u2013 next generation simulation, 2008.\n\n[37] M. O\u2019Kelly, H. Abbas, S. Gao, S. Shiraishi, S. Kato, and R. Mangharam. Apex: Autonomous\n\nvehicle plan veri\ufb01cation and execution. volume 1, Apr 2016.\n\n[38] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy. Deep exploration via bootstrapped dqn. In\n\nAdvances in neural information processing systems, pages 4026\u20134034, 2016.\n\n11\n\n\f[39] C. Quiter and M. Ernst. Deepdrive. https://github.com/deepdrive/deepdrive, 2018.\n\n[40] N. Roohi, R. Kaur, J. Weimer, O. Sokolsky, and I. Lee. Self-driving vehicle veri\ufb01cation towards\n\na benchmark. arXiv preprint arXiv:1806.08810, 2018.\n\n[41] S. Ross and D. Bagnell. Ef\ufb01cient reductions for imitation learning. In Proceedings of the\nthirteenth international conference on arti\ufb01cial intelligence and statistics, pages 661\u2013668, 2010.\n\n[42] S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction\nto no-regret online learning. In Proceedings of the fourteenth international conference on\narti\ufb01cial intelligence and statistics, pages 627\u2013635, 2011.\n\n[43] R. Y. Rubinstein. Combinatorial optimization, cross-entropy, ants and rare events. In Stochastic\n\noptimization: algorithms and applications, pages 303\u2013363. Springer, 2001.\n\n[44] R. Y. Rubinstein and D. P. Kroese. The cross-entropy method: A uni\ufb01ed approach to Monte\nCarlo simulation, randomized optimization and machine learning. Information Science &\nStatistics, Springer Verlag, NY, 2004.\n\n[45] S. Russell. Learning agents for uncertain environments. In Proceedings of the eleventh annual\n\nconference on Computational learning theory, pages 101\u2013103. ACM, 1998.\n\n[46] R. Schram, A. Williams, and M. van Ratingen. Implementation of autonomous emergency\n\nbraking (aeb), the next step in euro ncap\u2019s safety assessment. ESV, Seoul, 2013.\n\n[47] S. A. Seshia, D. Sadigh, and S. S. Sastry. Formal methods for semi-autonomous driving. In\n\nProceedings of the 52nd Annual Design Automation Conference, page 148. ACM, 2015.\n\n[48] S. Shah, D. Dey, C. Lovett, and A. Kapoor. Airsim: High-\ufb01delity visual and physical simulation\nfor autonomous vehicles. In Field and Service Robotics, 2017. URL https://arxiv.org/\nabs/1705.05065.\n\n[49] S. Shalev-Shwartz, S. Shammah, and A. Shashua. On a formal model of safe and scalable\n\nself-driving cars. arXiv preprint arXiv:1708.06374, 2017.\n\n[50] V. Tjeng and R. Tedrake. Verifying neural networks with mixed integer programming.\n\narXiv:1711.07356 [cs.LG], 2017.\n\n[51] C. E. Tuncali, T. P. Pavlic, and G. Fainekos. Utilizing s-taliro as an automatic test generation\nframework for autonomous vehicles. In Intelligent Transportation Systems (ITSC), 2016 IEEE\n19th International Conference on, pages 1470\u20131475. IEEE, 2016.\n\n[52] K. Vogel. A comparison of headway and time to collision as safety indicators. Accident analysis\n\n& prevention, 35(3):427\u2013433, 2003.\n\n[53] Z. B. Zabinsky. Stochastic adaptive search for global optimization, volume 72. Springer Science\n\n& Business Media, 2013.\n\n[54] D. Zhao. Accelerated Evaluation of Automated Vehicles. Ph.D. thesis, Department of Mechanical\n\nEngineering, University of Michigan, 2016.\n\n[55] D. Zhao, X. Huang, H. Peng, H. Lam, and D. J. LeBlanc. Accelerated evaluation of automated\nvehicles in car-following maneuvers. IEEE Transactions on Intelligent Transportation Systems,\n19(3):733\u2013744, 2018.\n\n[56] E. Zhou and J. Hu. Gradient-based adaptive stochastic search for non-differentiable optimization.\n\nIEEE Transactions on Automatic Control, 59(7):1818\u20131832, 2014.\n\n12\n\n\f", "award": [], "sourceid": 6426, "authors": [{"given_name": "Matthew", "family_name": "O'Kelly", "institution": "University of Pennsylvania"}, {"given_name": "Aman", "family_name": "Sinha", "institution": "Stanford University"}, {"given_name": "Hongseok", "family_name": "Namkoong", "institution": "Stanford University"}, {"given_name": "Russ", "family_name": "Tedrake", "institution": "MIT"}, {"given_name": "John", "family_name": "Duchi", "institution": "Stanford"}]}