{"title": "Bias Correction of Learned Generative Models using Likelihood-Free Importance Weighting", "book": "Advances in Neural Information Processing Systems", "page_first": 11058, "page_last": 11070, "abstract": "A learned generative model often produces biased statistics relative to the underlying data distribution. A standard technique to correct this bias is importance sampling, where samples from the model are weighted by the likelihood ratio under model and true distributions. When the likelihood ratio is unknown, it can be estimated by training a probabilistic classifier to distinguish samples from the two distributions. We employ this likelihood-free importance weighting method to correct for the bias in generative models. We find that this technique consistently improves standard goodness-of-fit metrics for evaluating the sample quality of state-of-the-art deep generative models, suggesting reduced bias. Finally, we demonstrate its utility on representative applications in a) data augmentation for classification using generative adversarial networks, and b) model-based policy evaluation using off-policy data.", "full_text": "Bias Correction of Learned Generative Models using\n\nLikelihood-Free Importance Weighting\n\nAditya Grover1, Jiaming Song1, Alekh Agarwal2, Kenneth Tran2,\n\nAshish Kapoor2, Eric Horvitz2, Stefano Ermon1\n1Stanford University, 2Microsoft Research, Redmond\n\nAbstract\n\nA learned generative model often produces biased statistics relative to the under-\nlying data distribution. A standard technique to correct this bias is importance\nsampling, where samples from the model are weighted by the likelihood ratio\nunder model and true distributions. When the likelihood ratio is unknown, it can\nbe estimated by training a probabilistic classi\ufb01er to distinguish samples from the\ntwo distributions. We show that this likelihood-free importance weighting method\ninduces a new energy-based model and employ it to correct for the bias in existing\nmodels. We \ufb01nd that this technique consistently improves standard goodness-of-\ufb01t\nmetrics for evaluating the sample quality of state-of-the-art deep generative mod-\nels, suggesting reduced bias. Finally, we demonstrate its utility on representative\napplications in a) data augmentation for classi\ufb01cation using generative adversarial\nnetworks, and b) model-based policy evaluation using off-policy data.\n\n1\n\nIntroduction\n\nLearning generative models of complex environments from high-dimensional observations is a long-\nstanding challenge in machine learning. Once learned, these models are used to draw inferences and\nto plan future actions. For example, in data augmentation, samples from a learned model are used to\nenrich a dataset for supervised learning [1]. In model-based off-policy policy evaluation (henceforth\nMBOPE), a learned dynamics model is used to simulate and evaluate a target policy without real-world\ndeployment [2], which is especially valuable for risk-sensitive applications [3]. In spite of the recent\nsuccesses of deep generative models, existing theoretical results show that learning distributions in an\nunbiased manner is either impossible or has prohibitive sample complexity [4, 5]. Consequently, the\nmodels used in practice are inherently biased,1 and can lead to misleading downstream inferences.\nIn order to address this issue, we start from the observation that many typical uses of generative\nmodels involve computing expectations under the model. For instance, in MBOPE, we seek to \ufb01nd\nthe expected return of a policy under a trajectory distribution de\ufb01ned by this policy and a learned\ndynamics model. A classical recipe for correcting the bias in expectations, when samples from\na different distribution than the ground truth are available, is to importance weight the samples\naccording to the likelihood ratio [6]. If the importance weights were exact, the resulting estimates are\nunbiased. But in practice, the likelihood ratio is unknown and needs to be estimated since the true\ndata distribution is unknown and even the model likelihood is intractable or ill-de\ufb01ned for many deep\ngenerative models, e.g., variational autoencoders [7] and generative adversarial networks [8].\nOur proposed solution to estimate the importance weights is to train a calibrated, probabilistic\nclassi\ufb01er to distinguish samples from the data distribution and the generative model. As shown in\nprior work, the output of such classi\ufb01ers can be used to extract density ratios [9]. Appealingly, this\nestimation procedure is likelihood-free since it only requires samples from the two distributions.\n\n1We call a generative model biased if it produces biased statistics relative to the true data distribution.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fTogether, the generative model and the importance weighting function (speci\ufb01ed via a binary classi\ufb01er)\ninduce a new energy function. While exact density estimation and sampling from this induced energy-\nbased model is intractable, we can derive a particle based approximation which permits ef\ufb01cient\nsampling via resampling based methods. We derive conditions on the quality of the weighting\nfunction such that the induced model provably improves the \ufb01t to the the data distribution.\nEmpirically, we evaluate our bias reduction framework on three main sets of experiments. First, we\nconsider goodness-of-\ufb01t metrics for evaluating sample quality metrics of a likelihood-based and a\nlikelihood-free state-of-the-art (SOTA) model on the CIFAR-10 dataset. All these metrics are de\ufb01ned\nas Monte Carlo estimates from the generated samples. By importance weighting samples, we observe\na bias reduction of 23.35% and 13.48% averaged across commonly used sample quality metrics on\nPixelCNN++ [10] and SNGAN [11] models respectively.\nNext, we demonstrate the utility of our approach on the task of data augmentation for multi-class\nclassi\ufb01cation on the Omniglot dataset [12]. We show that, while naively extending the model with\nsamples from a data augmentation, a generative adversarial network [1] is not very effective for multi-\nclass classi\ufb01cation, we can improve classi\ufb01cation accuracy from 66.03% to 68.18% by importance\nweighting the contributions of each augmented data point.\nFinally, we demonstrate bias reduction for MBOPE [13]. A typical MBOPE approach is to \ufb01rst\nestimate a generative model of the dynamics using off-policy data and then evaluate the policy via\nMonte Carlo [2, 14]. Again, we observe that correcting the bias of the estimated dynamics model via\nimportance weighting reduces RMSE for MBOPE by 50.25% on 3 MuJoCo environments [15].\n\n2 Preliminaries\n\nNotation. Unless explicitly stated otherwise, we assume that probability distributions admit abso-\nlutely continuous densities on a suitable reference measure. We use uppercase notation X, Y, Z to\ndenote random variables and lowercase notation x, y, z to denote speci\ufb01c values in the corresponding\nsample spaces X ,Y,Z. We use boldface for multivariate random variables and their vector values.\nBackground. Consider a \ufb01nite dataset Dtrain of instances x drawn i.i.d. from a \ufb01xed (unknown)\ndistribution pdata. Given Dtrain, the goal of generative modeling is to learn a distribution p\u2713 to\napproximate pdata. Here, \u2713 denotes the model parameters, e.g. weights in a neural network for deep\ngenerative models. The parameters can be learned via maximum likelihood estimation (MLE) as in\nthe case of autoregressive models [16], normalizing \ufb02ows [17], and variational autoencoders [7, 18],\nor via adversarial training e.g., using generative adversarial networks [8, 19] and variants.\nMonte Carlo Evaluation We are interested in use cases where the goal is to evaluate or optimize\nexpectations of functions under some distribution p (either equal or close to the data distribution\npdata). Assuming access to samples from p as well some generative model p\u2713, one extreme is to\nevaluate the sample average using the samples from p alone. However, this ignores the availability of\np\u2713, through which we have a virtually unlimited access of generated samples ignoring computational\nconstraints and hence, could improve the accuracy of our estimates when p\u2713 is close to p. We begin\nby presenting a direct motivating use case of data augmentation using generative models for training\nclassi\ufb01ers which generalize better.\nExample Use Case: Suf\ufb01cient labeled training data for learning classi\ufb01cation and regression system\nis often expensive to obtain or susceptible to noise. Data augmentation seeks to overcome this\nshortcoming by arti\ufb01cially injecting new datapoints into the training set. These new datapoints are\nderived from an existing labeled dataset, either by manual transformations (e.g., rotations, \ufb02ips for\nimages), or alternatively, learned via a generative model [1, 20].\nConsider a supervised learning task over a labeled dataset Dcl. The dataset consists of feature and\nlabel pairs (x, y), each of which is assumed to be sampled independently from a data distribution\npdata(x, y) de\ufb01ned over X\u21e5Y . Further, let Y\u2713 Rk. In order to learn a classi\ufb01er f : X! Rk\nwith parameters , we minimize the expectation of a loss ` : Y\u21e5 Rk ! R over the dataset Dcl:\n\nEpdata(x,y)[`(y, f (x))] \u21e1\n\n1\n\n|Dcl| X(x,y)\u21e0Dcl\n\n2\n\n`(y, f (x)).\n\n(1)\n\n\fE.g., ` could be the cross-entropy loss. A generative model for the task of data augmentation learns a\njoint distribution p\u2713(x, y). Several algorithmic variants exist for learning the model\u2019s joint distribution\nand we defer the speci\ufb01cs to the experiments section. Once the generative model is learned, it can be\nused to optimize the expected classi\ufb01cation loss in Eq. (1) under a mixture distribution of empirical\ndata distributions and generative model distributions given as:\n\npmix(x, y) = mpdata(x, y) + (1 m)p\u2713(x, y)\n\n(2)\nfor a suitable choice of the mixture weights m 2 [0, 1]. Notice that, while the eventual task here\nis optimization, reliably evaluating the expected loss of a candidate parameter is an important\ningredient. We focus on this basic question \ufb01rst in advance of leveraging the solution for data\naugmentation. Further, even if evaluating the expectation once is easy, optimization requires us to do\nrepeated evaluation (for different values of ) which is signi\ufb01cantly more challenging. Also observe\nthat the distribution p under which we seek expectations is same as pdata here, and we rely on the\ngeneralization of p\u2713 to generate transformations of an instance in the dataset which are not explicitly\npresent, but plausibly observed in other, similar instances [21].\n\n3 Likelihood-Free Importance Weighting\n\nWhenever the distribution p, under which we seek expectations, differs from p\u2713, model-based\nestimates exhibit bias. In this section, we start out by formalizing bias for Monte Carlo expectations\nand subsequently propose a bias reduction strategy based on likelihood-free importance weighting\n(LFIW). We are interested in evaluating expectations of a class of functions of interest f 2F w.r.t.\nthe distribution p. For any given f : X! R, we have Ex\u21e0p[f (x)] =R p(x)f (x)dx.\nGiven access to samples from a generative model p\u2713, if we knew the densities for both p and p\u2713,\nthen a classical scheme to evaluate expectations under p using samples from p\u2713 is to use importance\nsampling [6]. We reweight each sample from p\u2713 according to its likelihood ratio under p and p\u2713 and\ncompute a weighted average of the function f over these samples.\n\nEx\u21e0p[f (x)] = Ex\u21e0p\u2713\uf8ff p(x)\n\np\u2713(x)\n\nf (x) \u21e1\n\n1\nT\n\nTXi=1\n\nw(xi)f (xi)\n\n(3)\n\nwhere w(xi) := p(xi)/p\u2713(xi) is the importance weight for xi \u21e0 p\u2713. The validity of this procedure\nis subject to the use of a proposal p\u2713 such that for all x 2X where p\u2713(x) = 0, we also have\nf (x)p(x) = 0.2\nTo apply this technique to reduce the bias of a generative sampler p\u2713 w.r.t. p, we require knowledge\nof the importance weights w(x) for any x \u21e0 p\u2713. However, we typically only have a sampling access\nto p via \ufb01nite datasets. For instance, in the data augmentation example above, where p = pdata, the\nunknown distribution used to learn p\u2713. Hence we need a scheme to learn the weights w(x), using\nsamples from p and p\u2713, which is the problem we tackle next.In order to do this, we consider a binary\nclassi\ufb01cation problem over X\u21e5Y where Y = {0, 1} and the joint distribution is denoted as q(x, y).\nLet = q(y=0)\nq(y=1) > 0 denote any \ufb01xed odds ratio. To specify the joint q(x, y), we additionally need\nthe conditional q(x|y) which we de\ufb01ne as follows:\n\nq(x|y) =\u21e2p\u2713(x) if y = 0\n\np(x) otherwise.\n\n(4)\n\nSince we only assume sample access to p and p\u2713(x), our strategy would be to estimate the conditional\nabove via learning a probabilistic binary classi\ufb01er. To train the classi\ufb01er, we only require datasets\nof samples from p\u2713(x) and p(x) and estimate to be the ratio of the size of two datasets. Let\nc : X! [0, 1] denote the probability assigned by the classi\ufb01er with parameters to a sample x\nbelonging to the positive class y = 1. As shown in prior work [9, 22], if c is Bayes optimal, then\nthe importance weights can be obtained via this classi\ufb01er as:\n\nw(x) =\n\np(x)\np\u2713(x)\n\n= \n\nc(x)\n1 c(x)\n\n.\n\n(5)\n\n2A stronger suf\ufb01cient, but not necessary condition that is independent of f, states that the proposal p\u2713 is\n\nvalid if it has a support larger than p, i.e., for all x 2X , p\u2713(x) = 0 implies p(x) = 0.\n\n3\n\n\f(a) Setup\n\n(b) n = 50\n\n(c) n = 100\n\n(d) n = 1000\n\nFigure 1: Importance Weight Estimation using Probabilistic Classi\ufb01ers. (a) A univariate Gaussian\n(blue) is \ufb01t to samples from a mixture of two Gaussians (red). (b-d) Estimated class probabilities\n(with 95% con\ufb01dence intervals based on 1000 bootstraps) for varying number of points n, where n is\nthe number of points used for training the generative model and multilayer perceptron.\n\nIn practice, we do not have access to a Bayes optimal classi\ufb01er and hence, the estimated importance\nweights will not be exact. Consequently, we can hope to reduce the bias as opposed to eliminating it\nentirely. Hence, our default LFIW estimator is given as:\n\nEx\u21e0p[f (x)] \u21e1\n\n1\nT\n\nTXi=1\n\n\u02c6w(xi)f (xi)\n\n(6)\n\n1c(xi) is the importance weight for xi \u21e0 p\u2713 estimated via c(x).\n\nwhere \u02c6w(xi) = c(xi)\nPractical Considerations. Besides imperfections in the classi\ufb01er, the quality of a generative model\nalso dictates the ef\ufb01cacy of importance weighting. For example, images generated by deep generative\nmodels often possess distinct artifacts which can be exploited by the classi\ufb01er to give highly-con\ufb01dent\npredictions [23, 24]. This could lead to very small importance weights for some generated images,\nand consequently greater relative variance in the importance weights across the Monte Carlo batch.\nBelow, we present some practical variants of LFIW estimator to offset this challenge.\n\n1. Self-normalization: The self-normalized LFIW estimator for Monte Carlo evaluation normalizes\nthe importance weights across a sampled batch:\n\nEx\u21e0p[f (x)] \u21e1\n\nTXi=1\n\n\u02c6w(xi)\nj=1 \u02c6w(xj)\n\nf (xi) where xi \u21e0 p\u2713.\n\n(7)\n\n2. Flattening: The \ufb02attened LFIW estimator interpolates between the uniform importance weights\nand the default LFIW weights via a power scaling parameter \u21b5 0:\n\nPT\n\nTXi=1\n\nEx\u21e0p[f (x)] \u21e1\n\n1\nT\n\n\u02c6w(xi)\u21b5f (xi) where xi \u21e0 p\u2713.\n\n(8)\n\nFor \u21b5 = 0, there is no bias correction, and \u21b5 = 1 returns the default estimator in Eq. (6). For\nintermediate values of \u21b5, we can trade-off bias reduction with any undesirable variance introduced.\n3. Clipping: The clipped LFIW estimator speci\ufb01es a lower bound 0 on the importance weights:\n(9)\n\nmax( \u02c6w(xi), )f (xi) where xi \u21e0 p\u2713.\n\nEx\u21e0p[f (x)] \u21e1\n\n1\nT\n\nTXi=1\n\nWhen = 0, we recover the default LFIW estimator in Eq. (6). Finally, we note that these estimators\nare not exclusive and can be combined e.g., \ufb02attened or clipped weights can be normalized.\n\nCon\ufb01dence intervals. Since we have real and generated data coming from a \ufb01nite dataset and\nparametric model respectively, we propose a combination of empirical and parametric bootstraps to\nderive con\ufb01dence intervals around the estimated importance weights. See Appendix A for details.\nSynthetic experiment. We visually illustrate our importance weighting approach in a toy experiment\n(Figure 1a). We are given a \ufb01nite set of samples drawn from a mixture of two Gaussians (red). The\nmodel family is a unimodal Gaussian, illustrating mismatch due to a parametric model. The mean\n\n4\n\n\fAlgorithm 1 SIR for the Importance Resampled Energy-Based Model p\u2713,\n\nInput: Generative Model p\u2713, Importance Weight Estimator \u02c6w, budget T\n\n1: Sample x1, x2, . . . , xT independently from p\u2713\n2: Estimate importance weights \u02c6w(x1), \u02c6w(x2), . . . , \u02c6w(xT )\n\n3: Compute \u02c6Z PT\n4: Sample j \u21e0 Categorical\u21e3 \u02c6w(x1)\n\n5: return xj\n\nt=1 \u02c6w(xt)\n\n\u02c6Z\n\n, \u02c6w(x2)\n\n\u02c6Z\n\n, . . . , \u02c6w(xT )\n\n\u02c6Z \u2318\n\nand variance of the model are estimated by the empirical means and variances of the observed data.\nUsing estimated model parameters, we then draw samples from the model (blue).\nIn Figure 1b, we show the probability assigned by a binary classi\ufb01er to a point to be from true data\ndistribution. Here, the classi\ufb01er is a single hidden-layer multi-layer perceptron. The classi\ufb01er is not\nBayes optimal, which can be seen by the gaps between the optimal probabilities curve (black) and the\nestimated class probability curve (green). However, as we increase the number of real and generated\nexamples n in Figures 1c-d, the classi\ufb01er approaches optimality. Furthermore, even its uncertainty\nshrinks with increasing data, as expected. In summary, this experiment demonstrates how a binary\nclassi\ufb01er can mitigate this bias due to a mismatched generative model.\n\n4\n\nImportance Resampled Energy-Based Model\n\nIn the previous section, we described a procedure to augment any base generative model p\u2713 with\nan importance weighting estimator \u02c6w for debiased Monte Carlo evaluation. Here, we will use this\naugmentation to induce an importance resampled energy-based model with density p\u2713, given as:\n\n(10)\n\np\u2713,(x) / p\u2713(x) \u02c6w(x)\n\nwhere the partition function is expressed as Z\u2713, =R p\u2713(x) \u02c6w(x)dx = Ep\u2713 [ \u02c6w(x)].\n\nDensity Estimation. Exact density estimation requires a handle on the density of the base model p\u2713\n(typically intractable for models such as VAEs and GANs) and estimates of the partition function.\nExactly computing the partition function is intractable. If p\u2713 permits fast sampling and importance\nweights are estimated via LFIW (requiring only a forward pass through the classi\ufb01er network),\nwe can obtain unbiased estimates via a Monte Carlo average, i.e., Z\u2713, \u21e1 1\ni=1 \u02c6w(xi) where\nxi \u21e0 p\u2713. To reduce the variance, a potentially large number of samples are required. Since samples\nare obtained independently, the terms in the Monte Carlo average can be evaluated in parallel.\nSampling-Importance-Resampling. While exact sampling from p\u2713, is intractable, we can instead\nperform sample from a particle-based approximation to p\u2713, via sampling-importance-resampling [25,\n26] (SIR). We de\ufb01ne the SIR approximation to p\u2713, via the following density:\n\nT PT\n\n\u2713, (x; T ) := Ex2,x3,...,xT \u21e0p\u2713\"\n\npSIR\n\n\u02c6w(x)\n\n\u02c6w(x) +PT\n\ni=2 \u02c6w(xi)\n\np\u2713(x)#\n\n(11)\n\n(12)\n\nwhere T > 0 denotes the number of independent samples (or \u201cparticles\"). For any \ufb01nite T , sampling\nfrom pSIR\n\u2713, is tractable, as summarized in Algorithm 1. Moreover, any expectation w.r.t. the SIR\napproximation to the induced distribution can be evaluated in closed-form using the self-normalized\nLFIW estimator (Eq. 7). In the limit of T ! 1, we recover the induced distribution p\u2713,:\n\nlim\nT!1\n\npSIR\n\u2713, (x; T ) = p\u2713,(x) 8x\n\nNext, we analyze conditions under which the resampled density p\u2713, provably improves the model \ufb01t\nto pdata. In order to do so, we further assume that pdata is absolutely continuous w.r.t. p\u2713 and p\u2713,.\nWe de\ufb01ne the change in KL via the importance resampled density as:\n\n(pdata, p\u2713, p\u2713,) := DKL(pdata, p\u2713,) DKL(pdata, p\u2713).\n\nSubstituting Eq. 10 in Eq. 13, we can simplify the above quantity as:\n\n(pdata, p\u2713, p\u2713,) = Ex\u21e0pdata[ log(p\u2713(x) \u02c6w(x)) + log Z\u2713, + log p\u2713(x)]\n\n= Ex\u21e0pdata[log \u02c6w(x)] log Ex\u21e0p\u2713 [ \u02c6w(x)].\n\n(13)\n\n(14)\n(15)\n\n5\n\n\fTable 1: Goodness-of-\ufb01t evaluation on CIFAR-10 dataset for PixelCNN++ and SNGAN. Standard\nerrors computed over 10 runs. Higher IS is better. Lower FID and KID scores are better.\n\nModel\n-\nPixelCNN++ Default (no debiasing)\n\nEvaluation\nReference\n\nSNGAN\n\nLFIW\nDefault (no debiasing)\nLFIW\n\nIS (\")\n\n11.09 \u00b1 0.1263\n5.16 \u00b1 0.0117\n6.68 \u00b1 0.0773\n8.33 \u00b1 0.0280\n8.57 \u00b1 0.0325\n\nFID (#)\n\n5.20 \u00b1 0.0533\n58.70 \u00b1 0.0506\n55.83 \u00b1 0.9695\n20.40 \u00b1 0.0747\n17.29 \u00b1 0.0698\n\nKID (#)\n\n0.008 \u00b1 0.0004\n0.196 \u00b1 0.0001\n0.126 \u00b1 0.0009\n0.094 \u00b1 0.0002\n0.073 \u00b10.0004\n\nThe above expression provides a necessary and suf\ufb01cient condition for any positive real valued\nfunction (such as the LFIW classi\ufb01er in Section 3) to improve the KL divergence \ufb01t to the underlying\ndata distribution. In practice, an unbiased estimate of the LHS can be obtained via Monte Carlo\naveraging of log- importance weights based on Dtrain. The empirical estimate for the RHS is however\nbiased.3 To remedy this shortcoming, we consider the following necessary but insuf\ufb01cient condition.\nProposition 1. If (pdata, p\u2713, p\u2713,) 0, then the following conditions hold:\n\nEx\u21e0pdata[ \u02c6w(x)] Ex\u21e0p\u2713 [ \u02c6w(x)],\n\nEx\u21e0pdata[log \u02c6w(x)] Ex\u21e0p\u2713 [log \u02c6w(x)].\n\n(16)\n(17)\n\nThe conditions in Eq. 16 and Eq. 17 follow directly via Jensen\u2019s inequality applied to the LHS and\nRHS of Eq. 15 respectively. Here, we note that estimates for the expectations in Eqs. 16-17 based on\nMonte Carlo averaging of (log-) importance weights are unbiased.\n\n5 Application Use Cases\n\nIn all our experiments, the binary classi\ufb01er for estimating the importance weights was a calibrated\ndeep neural network trained to minimize the cross-entropy loss. The self-normalized LFIW in Eq. (7)\nworked best. Additional analysis on the estimators and experiment details are in Appendices B and C.\n\n5.1 Goodness-of-\ufb01t testing\n\nIn the \ufb01rst set of experiments, we highlight the bene\ufb01ts of importance weighting for a debiased\nevaluation of three popularly used sample quality metrics viz. Inception Scores (IS) [27], Frechet\nInception Distance (FID) [28], and Kernel Inception Distance (KID) [29]. All these scores can be\nformally expressed as empirical expectations with respect to the model. For all these metrics, we can\nsimulate the population level unbiased case as a \u201creference score\" wherein we arti\ufb01cially set both the\nreal and generated sets of samples used for evaluation as \ufb01nite, disjoint sets derived from pdata.\nWe evaluate the three metrics for two state-of-the-art models trained on the CIFAR-10 dataset viz.\nan autoregressive model PixelCNN++ [10] learned via maximum likelihood estimation and a latent\nvariable model SNGAN [11] learned via adversarial training. For evaluating each metric, we draw\n10,000 samples from the model. In Table 1, we report the metrics with and without the LFIW bias\ncorrection. The consistent debiased evaluation of these metrics via self-normalized LFIW suggest\nthat the SIR approximation to the importance resampled distribution (Eq. 11) is a better \ufb01t to pdata.\n\n5.2 Data Augmentation for Multi-Class Classi\ufb01cation\n\nWe consider data augmentation via Data Augmentation Generative Adversarial Networks (DA-\nGAN) [1]. While DAGAN was motivated by and evaluated for the task of meta-learning, it can also\nbe applied for multi-class classi\ufb01cation scenarios, which is the setting we consider here. We trained a\nDAGAN on the Omniglot dataset of handwritten characters [12]. The DAGAN training procedure is\ndescribed in the Appendix. The dataset is particularly relevant because it contains 1600+ classes but\nonly 20 examples from each class and hence, could potentially bene\ufb01t from augmented data.\n\n3If \u02c6Z is an unbiased estimator for Z, then log \u02c6Z is a biased estimator for log Z via Jensen\u2019s inequality.\n\n6\n\n\f(a)\n\n(d)\n\n(b)\n\n(e)\n\n(c)\n\n(f)\n\nFigure 2: Qualitative evaluation of importance weighting for data augmentation. (a-f) Top row shows\nheld-out data samples from a speci\ufb01c class in Omniglot. Bottom row shows generated samples from\nthe same class ranked in decreasing order of importance weights.\nTable 2: Classi\ufb01cation accuracy on the Omniglot dataset. Standard errors computed over 5 runs.\n\nDataset\nAccuracy\n\nDcl\n\nDg\n\n0.6603 \u00b1 0.0012\n\n0.4431 \u00b1 0.0054\n\nDg w/ LFIW\n0.4481 \u00b1 0.0056\n\nDcl + Dg\n\n0.6600 \u00b1 0.0040\n\nDcl + Dg w/ LFIW\n0.6818 \u00b1 0.0022\n\nOnce the model has been trained, it can be used for data augmentation in many ways. In particular, we\nconsider ablation baselines that use various combinations of the real training data Dcl and generated\ndata Dg for training a downstream classi\ufb01er. When the generated data Dg is used, we can either\nuse the data directly with uniform weighting for all training points, or choose to importance weight\n(LFIW) the contributions of the individual training points to the overall loss. The results are shown in\nTable 2. While generated data (Dg) alone cannot be used to obtain competitive performance relative\nto the real data (Dcl) on this task as expected, the bias it introduces for evaluation and subsequent\noptimization overshadows even the naive data augmentation (Dcl + Dg). In contrast, we can obtain\nsigni\ufb01cant improvements by importance weighting the generated points (Dcl + Dg w/ LFIW).\nQualitatively, we can observe the effect of importance weighting in Figure 2. Here, we show true\nand generated samples for 6 randomly choosen classes (a-f) in the Omniglot dataset. The generated\nsamples are ranked in decreasing order of the importance weights. There is no way to formally test\nthe validity of such rankings and this criteria can also prefer points which have high density under\npdata but are unlikely under p\u2713 since we are looking at ratios. Visual inspection suggests that the\nclassi\ufb01er is able to appropriately downweight poorer samples, as shown in Figure 2 (a, b, c, d - bottom\nright). There are also failure modes, such as the lowest ranked generated images in Figure 2 (e, f -\nbottom right) where the classi\ufb01er weights reasonable generated samples poorly relative to others.\nThis could be due to particular artifacts such as a tiny disconnected blurry speck in Figure 2 (e -\nbottom right) which could be more revealing to a classi\ufb01er distinguishing real and generated data.\n\n5.3 Model-based Off-policy Policy Evaluation\nSo far, we have seen use cases where the generative model was trained on data from the same\ndistribution we wish to use for Monte Carlo evaluation. We can extend our debiasing framework to\nmore involved settings when the generative model is a building block for specifying the full data\ngeneration process, e.g., trajectory data generated via a dynamics model along with an agent policy.\nIn particular, we consider the setting of off-policy policy evaluation (OPE), where the goal is to\nevaluate policies using experiences collected from a different policy. Formally, let (S,A, r, P,\u2318, T )\ndenote an (undiscounted) Markov decision process with state space S, action space A, reward\nfunction r, transition P , initial state distribution \u2318 and horizon T . Assume \u21e1e : S\u21e5A! [0, 1]\nis a known policy that we wish to evaluate. The probability of generating a certain trajectory\n\u2327 = {s0, a0, s1, a1, ..., sT , aT} of length T with policy \u21e1e and transition P is given as:\n\np?(\u2327 ) = \u2318(s0)\n\n\u21e1e(at|st)P (st+1|st, at).\n\n(18)\n\nThe return on a trajectory R(\u2327 ) is the sum of the rewards across the state, action pairs in \u2327: R(\u2327 ) =\n\nt=1 r(st, at), where we assume a known reward function r.\n\nPT\n\nT1Yt=0\n\n7\n\n\fTable 3: Off-policy policy evaluation on MuJoCo tasks. Standard error is over 10 Monte Carlo\nestimates where each estimate contains 100 randomly sampled trajectories.\n\nEnvironment\nSwimmer\nHalfCheetah\nHumanoidStandup\n\nv(\u21e1e) (Ground truth)\n\n\u02dcv(\u21e1e)\n\n\u02c6v(\u21e1e) (w/ LFIW)\n\n\u02c6v80(\u21e1e) (w/ LFIW)\n\n36.7 \u00b1 0.1\n241.7 \u00b1 3.56\n14170 \u00b1 53\n\n100.4 \u00b1 3.2\n204.0 \u00b1 0.8\n8417 \u00b1 28\n\n25.7 \u00b1 3.1\n217.8 \u00b1 4.0\n9372 \u00b1 375\n\n47.6 \u00b1 4.8\n219.1 \u00b1 1.6\n9221 \u00b1 381\n\nFigure 3: Estimation error (v) = v(\u21e1e) \u02c6vH(\u21e1e) for different values of H (minimum 0, maximum\n100). Shaded area denotes standard error over different random seeds.\n\nWe are interested in the value of a policy de\ufb01ned as v(\u21e1e) = E\u2327\u21e0p\u21e4(\u2327 ) [R(\u2327 )]. Evaluating \u21e1e requires\nthe (unknown) transition dynamics P . The dynamics model is a conditional generative model of\nthe next states st+1 conditioned on the previous state-action pair (st, at). If we have access to\nhistorical logged data D\u2327 of trajectories \u2327 = {s0, a0, s1, a1, . . . ,} from some behavioral policy\n\u21e1b : S\u21e5A! [0, 1], then we can use this off-policy data to train a dynamics model P\u2713(st+1|st, at).\nThe policy \u21e1e can then be evaluated under this learned dynamics model as \u02dcv(\u21e1e) = E\u2327\u21e0\u02dcp(\u2327 )[R(\u2327 )],\nwhere \u02dcp uses P\u2713 instead of the true dynamics in Eq. (18).\nHowever, the trajectories sampled with P\u2713 could signi\ufb01cantly deviate from samples from P due to\ncompounding errors [30]. In order to correct for this bias, we can use likelihood-free importance\nweighting on entire trajectories of data. The binary classi\ufb01er c(st, at, st+1) for estimating the\nimportance weights in this case distinguishes between triples of true and generated transitions.\nFor any true triple (st, at, st+1) extracted from the off-policy data, the corresponding generated\ntriple (st, at, \u02c6st+1) only differs in the \ufb01nal transition state, i.e., \u02c6st+1 \u21e0 P\u2713(\u02c6st+1|st, at). Such a\nclassi\ufb01er allows us to obtain the importance weights \u02c6w(st, at, \u02c6st+1) for every predicted state transition\n(st, at, \u02c6st+1). The importance weights for the trajectory \u2327 can be derived from the importance weights\nof these individual transitions as:\n\np?(\u2327 )\n\u02dcp(\u2327 )\n\n= QT1\nQT1\n\nOur \ufb01nal LFIW estimator is given as:\n\n=\n\nt=0 P (st+1|st, at)\nt=0 P\u2713(st+1|st, at)\n\nT1Yt=0\nT1Yt=0\nP (st+1|st, at)\nP\u2713(st+1|st, at) \u21e1\n\u02c6w(st, at, \u02c6st+1) \u00b7 R(\u2327 )# .\n\u02c6v(\u21e1e) = E\u2327\u21e0\u02dcp(\u2327 )\"T1Yt=0\n\n\u02c6w(st, at, \u02c6st+1).\n\n(19)\n\n(20)\n\nWe consider three continuous control tasks in the MuJoCo simulator [15] from OpenAI gym [31]\n(in increasing number of state dimensions): Swimmer, HalfCheetah and HumanoidStandup. High\ndimensional state spaces makes it challenging to learning a reliable dynamics model in these environ-\nments. We train behavioral and evaluation policies using Proximal Policy Optimization [32] with\ndifferent hyperparameters for the two policies. The dataset collected via trajectories from the behavior\npolicy are used train a ensemble neural network dynamics model. We the use the trained dynamics\nmodel to evaluate \u02dcv(\u21e1e) and its IW version \u02c6v(\u21e1e), and compare them with the ground truth returns\nv(\u21e1e). Each estimation is averaged over a set of 100 trajectories with horizon T = 100. Speci\ufb01cally,\nfor \u02c6v(\u21e1e), we also average the estimation over 10 classi\ufb01er instances trained with different random\nseeds on different trajectories. We further consider performing IW over only the \ufb01rst H steps, and\nuse uniform weights for the remainder, which we denote as \u02c6vH(\u21e1e). This allow us to interpolate\nbetween \u02dcv(\u21e1e) \u2318 \u02c6v0(\u21e1e) and \u02c6v(\u21e1e) \u2318 \u02c6vT (\u21e1e). Finally, as in the other experiments, we used the\nself-normalized variant (Eq. (7)) of the importance weighted estimator in Eq. (20).\nWe compare the policy evaluations under different environments in Table 3. These results show that\nthe rewards estimated with the trained dynamics model differ from the ground truth by a large margin.\n\n8\n\n\fBy importance weighting the trajectories, we obtain much more accurate policy evaluations. As\nexpected, we also see that while LFIW leads to higher returns on average, the imbalance in trajectory\nimportance weights due to the multiplicative weights of the state-action pairs can lead to higher\nvariance in the importance weighted returns. In Figure 3, we demonstrate that policy evaluation\nbecomes more accurate as more timesteps are used for LFIW evaluations, until around 80 100\ntimesteps and thus empirically validates the bene\ufb01ts of importance weighting using a classi\ufb01er. Given\nthat our estimates have a large variance, it would be worthwhile to compose our approach with\nother variance reduction techniques such as (weighted) doubly robust estimation in future work [33],\nas well as incorporate these estimates within a framework such as MAGIC to further blend with\nmodel-free OPE [14]. In Appendix C.5.1, we also consider a stepwise LFIW estimator for MBOPE\nwhich applies importance weighting at the level of every decision as opposed to entire trajectories.\nOverall. Across all our experiments, we observe that importance weighting the generated samples\nleads to uniformly better results, whether in terms of evaluating the quality of samples, or their utility\nin downstream tasks. Since the technique is a black-box wrapper around any generative model, we\nexpect this to bene\ufb01t a diverse set of tasks in follow-up works.\nHowever, there is also some caution to be exercised with these techniques as evident from the results\nof Table 1. Note that in this table, the con\ufb01dence intervals (computed using the reported standard\nerrors) around the model scores after importance weighting still do not contain the reference scores\nobtained from the true model. This would not have been the case if our debiased estimator was\ncompletely unbiased and this observation reiterates our earlier claim that LFIW is reducing bias,\nas opposed to completely eliminating it. Indeed, when such a mismatch is observed, it is a good\ndiagnostic to either learn more powerful classi\ufb01ers to better approximate the Bayes optimum, or \ufb01nd\nadditional data from pdata in case the generative model fails the full support assumption.\n\n6 Related Work & Discussion\nDensity ratios enjoy widespread use across machine learning e.g., for handling covariate shifts,\nclass imbalance etc. [9, 34]. In generative modeling, estimating these ratios via binary classi\ufb01ers\nis frequently used for de\ufb01ning learning objectives and two sample tests [19, 35, 35\u201341]. In partic-\nular, such classi\ufb01ers have been used to de\ufb01ne learning frameworks such as generative adversarial\nnetworks [8, 42], likelihood-free Approximate Bayesian Computation (ABC) [43] and earlier work\nin unsupervised-as-supervised learning [44] and noise contrastive estimation [43] among others.\nRecently, [45] used importance weighting to reweigh datapoints based on differences in training\nand test data distributions i.e., dataset bias. The key difference is that these works are explicitly\ninterested in learning the parameters of a generative model. In contrast, we use the binary classi\ufb01er\nfor estimating importance weights to correct for the model bias of any \ufb01xed generative model.\nRecent concurrent works [46\u201348] use MCMC and rejection sampling to explicitly transform or reject\nthe generated samples. These methods require extra computation beyond training a classi\ufb01er, in\nrejecting the samples or running Markov chains to convergence, unlike the proposed importance\nweighting strategy. For many model-based Monte Carlo evaluation usecases (e.g., data augmentation,\nMBOPE), this extra computation is unnecessary. If samples or density estimates are explicitly needed\nfrom the induced resampled distribution, we presented a particle-based approximation to the induced\ndensity where the number of particles is a tunable knob allowing for trading statistical accuracy with\ncomputational ef\ufb01ciency. Finally, we note resampling based techniques have been extensively studied\nin the context of improving variational approximations for latent variable generative models [49\u201352].\n\n7 Conclusion\nWe identi\ufb01ed bias with respect to a target data distribution as a fundamental challenge restricting\nthe use of generative models as proposal distributions for Monte Carlo evaluation. We proposed a\nbias correction framework based on importance weighting. Here, any base generative model can\nbe boosted with an importance weight estimator to induce an energy-based generative model. The\nimportance weights are estimated in a likelihood-free fashion via a binary classi\ufb01er. Empirically, we\n\ufb01nd the bias correction to be useful across a variety of tasks including goodness-of-\ufb01t sample quality\ntests, data augmentation, and off-policy policy evaluation. The ability to characterize the bias of a\ngenerative model is an important step towards using these models to guide decisions in high-stakes\napplications under uncertainty [53, 54], such as healthcare [55\u201357] and anomaly detection [58, 59].\n\n9\n\n\fAcknowledgments\n\nThis project was initiated when AG was an intern at Microsoft Research. We are thankful to Daniel\nLevy, Rui Shu, Yang Song, and members of the Reinforcement Learning, Deep Learning, and\nAdaptive Systems and Interaction groups at Microsoft Research for helpful discussions and comments\non early drafts. This research was supported by NSF (#1651565, #1522054, #1733686), ONR,\nAFOSR (FA9550-19-1-0024), and FLI.\n\nReferences\n[1] Antreas Antoniou, Amos Storkey, and Harrison Edwards. Data augmentation generative\n\nadversarial networks. arXiv preprint arXiv:1711.04340, 2017.\n\n[2] Shie Mannor, Duncan Simester, Peng Sun, and John N Tsitsiklis. Bias and variance approxima-\n\ntion in value function estimates. Management Science, 53(2):308\u2013322, 2007.\n\n[3] Philip S Thomas. Safe reinforcement learning. PhD thesis, University of Massachusetts\n\nLibraries, 2015.\n\n[4] Murray Rosenblatt. Remarks on some nonparametric estimates of a density function. The\n\nAnnals of Mathematical Statistics, pages 832\u2013837, 1956.\n\n[5] Sanjeev Arora, Andrej Risteski, and Yi Zhang. Do gans learn the distribution? some theory and\n\nempirics. In International Conference on Learning Representations, 2018.\n\n[6] Daniel G Horvitz and Donovan J Thompson. A generalization of sampling without replacement\n\nfrom a \ufb01nite universe. Journal of the American statistical Association, 1952.\n\n[7] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint\n\narXiv:1312.6114, 2013.\n\n[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural\nInformation Processing Systems, 2014.\n\n[9] Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine\n\nlearning. Cambridge University Press, 2012.\n\n[10] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. Pixelcnn++: Improving the\npixelcnn with discretized logistic mixture likelihood and other modi\ufb01cations. arXiv preprint\narXiv:1701.05517, 2017.\n\n[11] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization\n\nfor generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.\n\n[12] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept\n\nlearning through probabilistic program induction. Science, 350(6266):1332\u20131338, 2015.\n\n[13] Doina Precup, Richard S. Sutton, and Satinder P. Singh. Eligibility traces for off-policy policy\n\nevaluation. In International Conference on Machine Learning, 2000.\n\n[14] Philip Thomas and Emma Brunskill. Data-ef\ufb01cient off-policy policy evaluation for reinforce-\n\nment learning. In International Conference on Machine Learning, 2016.\n\n[15] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based\n\ncontrol. In International Conference on Intelligent Robots and Systems. IEEE, 2012.\n\n[16] Benigno Uria, Marc-Alexandre C\u00f4t\u00e9, Karol Gregor, Iain Murray, and Hugo Larochelle. Neural\nautoregressive distribution estimation. The Journal of Machine Learning Research, 17(1):\n7184\u20137220, 2016.\n\n[17] Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components\n\nestimation. arXiv preprint arXiv:1410.8516, 2014.\n\n10\n\n\f[18] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation\nand approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.\n\n[19] Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models. arXiv\n\npreprint arXiv:1610.03483, 2016.\n\n[20] Alexander J Ratner, Henry Ehrenberg, Zeshan Hussain, Jared Dunnmon, and Christopher R\u00e9.\nLearning to compose domain-speci\ufb01c transformations for data augmentation. In Advances in\nNeural Information Processing Systems, 2017.\n\n[21] Shengjia Zhao, Hongyu Ren, Arianna Yuan, Jiaming Song, Noah Goodman, and Stefano Ermon.\nBias and generalization in deep generative models: An empirical study. In Advances in Neural\nInformation Processing Systems, 2018.\n\n[22] Aditya Grover and Stefano Ermon. Boosted generative models. In AAAI Conference on Arti\ufb01cial\n\nIntelligence, 2018.\n\n[23] Augustus Odena, Vincent Dumoulin, and Chris Olah. Deconvolution and checkerboard\nartifacts. Distill, 2016. doi: 10.23915/distill.00003. URL http://distill.pub/2016/\ndeconv-checkerboard.\n\n[24] Augustus Odena. Open questions about generative adversarial networks. Distill, 4(4):e18, 2019.\n\n[25] Jun S Liu and Rong Chen. Sequential monte carlo methods for dynamic systems. Journal of\n\nthe American statistical association, 93(443):1032\u20131044, 1998.\n\n[26] Arnaud Doucet, Simon Godsill, and Christophe Andrieu. On sequential monte carlo sampling\n\nmethods for bayesian \ufb01ltering. Statistics and computing, 10(3):197\u2013208, 2000.\n\n[27] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.\nImproved techniques for training gans. In Advances in Neural Information Processing Systems,\npages 2234\u20132242, 2016.\n\n[28] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.\nGans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances\nin Neural Information Processing Systems, pages 6626\u20136637, 2017.\n\n[29] Miko\u0142aj Bi\u00b4nkowski, Dougal J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying\n\nmmd gans. arXiv preprint arXiv:1801.01401, 2018.\n\n[30] St\u00e9phane Ross and Drew Bagnell. Ef\ufb01cient reductions for imitation learning. In International\n\nConference on Arti\ufb01cial Intelligence and Statistics, 2010.\n\n[31] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang,\n\nand Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.\n\n[32] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal\n\npolicy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.\n\n[33] Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More robust doubly robust\n\noff-policy evaluation. In International Conference on Machine Learning, 2018.\n\n[34] Jonathon Byrd and Zachary C Lipton. What is the effect of importance weighting in deep\n\nlearning? arXiv preprint arXiv:1812.03372, 2018.\n\n[35] Mihaela Rosca, Balaji Lakshminarayanan, David Warde-Farley, and Shakir Mohamed. Vari-\narXiv preprint\n\national approaches for auto-encoding generative adversarial networks.\narXiv:1706.04987, 2017.\n\n[36] Arthur Gretton, Karsten M Borgwardt, Malte Rasch, Bernhard Sch\u00f6lkopf, and Alex J Smola.\nA kernel method for the two-sample-problem. In Advances in Neural Information Processing\nSystems, 2007.\n\n11\n\n\f[37] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy\nBengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349,\n2015.\n\n[38] David Lopez-Paz and Maxime Oquab. Revisiting classi\ufb01er two-sample tests. arXiv preprint\n\narXiv:1610.06545, 2016.\n\n[39] Ivo Danihelka, Balaji Lakshminarayanan, Benigno Uria, Daan Wierstra, and Peter Dayan.\nComparison of maximum likelihood and gan-based training of real nvps. arXiv preprint\narXiv:1705.05263, 2017.\n\n[40] Daniel Jiwoong Im, He Ma, Graham Taylor, and Kristin Branson. Quantitatively evaluating\n\ngans with divergences proposed for training. arXiv preprint arXiv:1803.01045, 2018.\n\n[41] Ishaan Gulrajani, Colin Raffel, and Luke Metz. Towards gan benchmarks which require\n\ngeneralization. In International Conference on Learning Representations, 2019.\n\n[42] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural sam-\nplers using variational divergence minimization. In Advances in Neural Information Processing\nSystems, pages 271\u2013279, 2016.\n\n[43] Michael U Gutmann and Aapo Hyv\u00e4rinen. Noise-contrastive estimation of unnormalized\nstatistical models, with applications to natural image statistics. Journal of Machine Learning\nResearch, 13(Feb):307\u2013361, 2012.\n\n[44] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning,\n\nvolume 1. Springer series in statistics New York, NY, USA:, 2001.\n\n[45] Maurice Diesendruck, Ethan R Elenberg, Rajat Sen, Guy W Cole, Sanjay Shakkottai,\narXiv preprint\n\nImportance weighted generative networks.\n\nand Sinead A Williamson.\narXiv:1806.02512, 2018.\n\n[46] Ryan Turner, Jane Hung, Yunus Saatci, and Jason Yosinski. Metropolis-hastings generative\n\nadversarial networks. arXiv preprint arXiv:1811.11357, 2018.\n\n[47] Samaneh Azadi, Catherine Olsson, Trevor Darrell, Ian Goodfellow, and Augustus Odena.\n\nDiscriminator rejection sampling. arXiv preprint arXiv:1810.06758, 2018.\n\n[48] Chenyang Tao, Liqun Chen, Ricardo Henao, Jianfeng Feng, and Lawrence Carin. Chi-square\n\ngenerative adversarial network. In International Conference on Machine Learning, 2018.\n\n[49] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders.\n\narXiv preprint arXiv:1509.00519, 2015.\n\n[50] Tim Salimans, Diederik Kingma, and Max Welling. Markov chain monte carlo and variational\n\ninference: Bridging the gap. In International Conference on Machine Learning, 2015.\n\n[51] Christian A Naesseth, Scott W Linderman, Rajesh Ranganath, and David M Blei. Variational\n\nsequential monte carlo. arXiv preprint arXiv:1705.11140, 2017.\n\n[52] Aditya Grover, Ramki Gummadi, Miguel Lazaro-Gredilla, Dale Schuurmans, and Stefano\nErmon. Variational rejection sampling. In International Conference on Arti\ufb01cial Intelligence\nand Statistics, 2018.\n\n[53] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model\n\nuncertainty in deep learning. In International Conference on Machine Learning, 2016.\n\n[54] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable\npredictive uncertainty estimation using deep ensembles. In Advances in Neural Information\nProcessing Systems, 2017.\n\n[55] Matthieu Komorowski, A Gordon, LA Celi, and A Faisal. A markov decision process to suggest\noptimal treatment of severe infections in intensive care. In Neural Information Processing\nSystems Workshop on Machine Learning for Health, 2016.\n\n12\n\n\f[56] Zhengyuan Zhou, Daniel Miller, Neal Master, David Scheinker, Nicholas Bambos, and Peter\nIn International\n\nGlynn. Detecting inaccurate predictions of pediatric surgical durations.\nConference on Data Science and Advanced Analytics, 2016.\n\n[57] Aniruddh Raghu, Matthieu Komorowski, Leo Anthony Celi, Peter Szolovits, and Marzyeh\nGhassemi. Continuous state-space models for optimal sepsis treatment-a deep reinforcement\nlearning approach. arXiv preprint arXiv:1705.08422, 2017.\n\n[58] Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan.\nDo deep generative models know what they don\u2019t know? arXiv preprint arXiv:1810.09136,\n2018.\n\n[59] Hyunsun Choi and Eric Jang. Generative ensembles for robust anomaly detection. arXiv\n\npreprint arXiv:1810.01392, 2018.\n\n[60] Bradley Efron and Robert J Tibshirani. An introduction to the bootstrap. CRC press, 1994.\n[61] Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised\n\nlearning. In International Conference on Machine learning, 2005.\n\n[62] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural\n\nnetworks. In International Conference on Machine Learning, 2017.\n\n[63] Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu\nDevin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensor\ufb02ow: a system for\nlarge-scale machine learning. In Operating Systems Design and Implementation, 2016.\n\n[64] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethink-\ning the inception architecture for computer vision. In IEEE conference on Computer Vision and\nPattern Recognition, 2016.\n\n[65] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\npytorch. 2017.\n\n[66] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks\n\nfor one shot learning. In Advances in Neural Information Processing Systems, 2016.\n\n[67] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec\nRadford, John Schulman, Szymon Sidor, and Yuhuai Wu. Openai baselines. GitHub, GitHub\nrepository, 2017.\n\n[68] Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions. arXiv\n\npreprint arXiv:1710.05941, 2017.\n\n13\n\n\f", "award": [], "sourceid": 5922, "authors": [{"given_name": "Aditya", "family_name": "Grover", "institution": "Stanford University"}, {"given_name": "Jiaming", "family_name": "Song", "institution": "Stanford University"}, {"given_name": "Ashish", "family_name": "Kapoor", "institution": "Microsoft"}, {"given_name": "Kenneth", "family_name": "Tran", "institution": "Microsoft Research"}, {"given_name": "Alekh", "family_name": "Agarwal", "institution": "Microsoft Research"}, {"given_name": "Eric", "family_name": "Horvitz", "institution": "Microsoft Research"}, {"given_name": "Stefano", "family_name": "Ermon", "institution": "Stanford"}]}