{"title": "Predictive Approximate Bayesian Computation via Saddle Points", "book": "Advances in Neural Information Processing Systems", "page_first": 10260, "page_last": 10270, "abstract": "Approximate Bayesian computation (ABC) is an important methodology for Bayesian inference when the likelihood function is intractable. Sampling-based ABC algorithms such as rejection- and K2-ABC are inefficient when the parameters have high dimensions, while the regression-based algorithms such as K- and DR-ABC are hard to scale. In this paper, we introduce an optimization-based ABC framework that addresses these deficiencies. Leveraging a generative model for posterior and joint distribution matching, we show that ABC can be framed as saddle point problems, whose objectives can be accessed directly with samples. We present the predictive ABC algorithm (P-ABC), and provide a probabilistically approximately correct (PAC) bound that guarantees its learning consistency. Numerical experiment shows that P-ABC outperforms both K2- and DR-ABC significantly.", "full_text": "Predictive Approximate Bayesian Computation via\n\nSaddle Points\n\nYingxiang Yang\u2217\n\nBo Dai(cid:63)\n\n{yyang172,kiyavash,niaohe} @illinois.edu\n\nNegar Kiyavash\u2020\n\nNiao He\u2217\u2020\n\nbohr.dai@gmail.com\n\nAbstract\n\nApproximate Bayesian computation (ABC) is an important methodology for\nBayesian inference when the likelihood function is intractable. Sampling-based\nABC algorithms such as rejection- and K2-ABC are inef\ufb01cient when the parame-\nters have high dimensions, while the regression-based algorithms such as K- and\nDR-ABC are hard to scale. In this paper, we introduce an optimization-based ABC\nframework that addresses these de\ufb01ciencies. Leveraging a generative model for\nposterior and joint distribution matching, we show that ABC can be framed as\nsaddle point problems, whose objectives can be accessed directly with samples.\nWe present the predictive ABC algorithm (P-ABC), and provide a probabilisti-\ncally approximately correct (PAC) bound for its learning consistency. Numerical\nexperiment shows that P-ABC outperforms both K2- and DR-ABC signi\ufb01cantly.\n\nIntroduction\n\n1\nApproximate Bayesian computation (ABC) is an important methodology to perform Bayesian\ninference on complex models where likelihood functions are intractable. It is typically used in\nlarge-scale systems where the generative mechanism can be simulated with high accuracy, but a\nclosed form expression for the likelihood function is not available. Such problems arise routinely\nin modern applications including population genetics [Excof\ufb01er, 2009, Drovandi and Pettitt, 2011],\necology and evolution [Csill\u00b4ery et al., 2012, Huelsenbeck et al., 2001, Drummond and Rambaut,\n2007], state space models [Martin et al., 2014], and image analysis [Kulkarni et al., 2014].\nFormally, ABC aims to estimate the posterior distribution p(\u03b8|y) \u221d p(y|\u03b8)\u03c0(\u03b8) where \u03c0(\u03b8) is the prior\nand p(y|\u03b8) is the likelihood function that represents the underlying model. The word \u201capproximate\u201d\n(or \u201cA\u201d in the abbreviation \u201cABC\u201d) refers to the fact that the joint distribution \u03c0(\u03b8)p(y|\u03b8) is only\navailable through \ufb01tting simulated data {(\u03b8j, yj)}N\nj=1 \u223c p(y|\u03b8)\u03c0(\u03b8). Based on how the \ufb01tting is\nperformed, existing ABC methods can be summarized into two main categories: sampling- and\nregression-based algorithms.\nSampling-based algorithms. A sampling-based algorithm directly approximates the likelihood\nfunction using simulated samples that are \u201cclose\u201d to the true observations. This closeness between\nsimulated samples yi and the true observation y\u2217 is measured by evaluating a similarity kernel\nK\u0001(yi, y\u2217). Informative summary statistics are often used to simplify this procedure when the dimen-\nsion of y\u2217 is large, e.g., [Joyce and Marjoram, 2008, Nunes and Balding, 2010, Blum and Franc\u00b8ois,\n2010, Wegmann et al., 2009, Blum et al., 2013]. Representative algorithms in this category include\nrejection ABC, indirect score ABC [Gleim and Pigorsch], K2-ABC [Park et al., 2016], distribu-\ntion regression ABC (DR-ABC) [Mitrovic et al., 2016], expectation propagation ABC (EP-ABC)\n[Barthelm\u00b4e and Chopin, 2011], random forest ABC [Raynal et al., 2016], Wasserstein ABC [Bernton\net al., 2017], copula ABC [Li et al., 2017], and ABC aided by neural network classi\ufb01ers [Gutmann\n\u2217Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign.\n\u2020Department of Industrial and Enterprise Systems Engineering, University of Illinois at Urbana-Champaign.\n(cid:63)Google Brain.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fet al., 2014, 2016]. The aforementioned work can be viewed under a uni\ufb01ed framework that approxi-\nmates the posterior p(\u03b8|y\u2217) using a weighted average of p(y|\u03b8)\u03c0(\u03b8) over Y with the majority of the\nmass concentrated within a small region around y\u2217:\n\n(cid:90)\n\nY\n\nK\u0001(sy, sy\u2217 )p(y|\u03b8)\u03c0(\u03b8)dy \u2248 N(cid:88)\n\ni=1\n\np\u0001(\u03b8|y\u2217) \u221d\n\n\u03b4\u03b8iK\u0001(syi, sy\u2217 ).\n\n(1)\n\nThis weighted average is then approximated using the sample average with \u03b4\u03b8i := 1[\u03b8 = \u03b8i] being\nthe indicator function, and sy being the summary statistics for y. In other words, sampling-based\nalgorithms reconstruct the posterior using a probability mass function where the mass distributes\nover the simulated \u03b8i \u223c \u03c0(\u03b8i), and, for each \u03b8i, is proportional to the closeness of yi \u223c p(yi|\u03b8i)\nand y\u2217. For example, when K\u0001(syi, sy\u2217 ) = 1{sy = sy\u2217} with sy = y, (1) recovers the true\nposterior asymptotically when the model parameter and the observation have discrete alphabets.\nWhen K\u0001(sy, sy\u2217 ) = 1{\u03c1(sy, sy\u2217 ) \u2264 \u0001} for some metric \u03c1, (1) reduces to rejection-ABC. When\nK\u0001(sy, sy\u2217 ) = exp(\u2212\u03c1(sy, sy\u2217 )/\u0001), (1) reduces to soft-ABC [Park et al., 2016], a variation of the\nsynthetic likelihood inference [Wood, 2010, Price et al., 2018] under the Bayesian setting. Finally,\nwhen K\u0001(sy, sy\u2217 ) is the zero/one output of a neural network classi\ufb01er, (1) reduces to ABC via\nclassi\ufb01cation [Gutmann et al., 2018].\nNote that, most of the aforementioned algorithms require summary statistics and a smoothing kernel,\nwhich introduce bias, and suffer from information loss when the summary statistics are insuf\ufb01cient.\nTo address the issue of having to select problem-speci\ufb01c summary statistics, Park et al. [2016]\nproposed K2-ABC, in which the summary statistics sy\u2217 is replaced by the kernel embedding of the\nempirical conditional distributions of p(y\u2217|\u03b8). When a characteristic kernel is selected, the kernel\nembedding of the distribution will be a suf\ufb01cient statistics, and therefore does not incur information\nloss. Apart from directly choosing kernel embedding of the conditional distribution, other approaches\nexist to help reduce bias: for example, the recalibration technique proposed by Rodrigues et al. [2018].\nHowever, despite their simplicity and continuous improvements, sampling-based ABC algorithms still\nsuffer from the following de\ufb01ciencies: (i) bias caused by the weighting kernel K\u0001, (ii) the potential\nneed of large sample size when the dimensions of \u03b8 and y\u2217 are large, and (iii) the need to access the\nmodel every time a new observation is given.\nRegression-based algorithms. Regression-based ABC algorithms establish regression relationships\nbetween the model parameter and the conditional distribution p(y|\u03b8) within an appropriate function\nspace F. Representative algorithms in this category include high-dimensional ABC [Nott et al., 2014],\nkernel-ABC (K-ABC) [Blum et al., 2013], and distribution-regression-ABC (DR-ABC) [Mitrovic\nIn DR-ABC, the kernel embeddings of the empirical version of the conditional\net al., 2016].\ndistribution, {\u00b5(cid:98)p(y|\u03b8i)}N\ni=1, is \ufb01rst obtained from training data, and is then used to perform distribution\n\nregression:\n\nh\u2217 = argmin\nh\u2208H\n\n1\nN\n\n(cid:12)(cid:12)h(\u00b5(cid:98)p(y|\u03b8i)) \u2212 \u03b8i\n\n(cid:12)(cid:12)2\n\nN(cid:88)\n\ni=1\n\n+ \u03bb(cid:107)h(cid:107)2H.\n\nThe algorithm then uses h\u2217 to predict the model parameter for any new set of data.\nContrary to the sampling-based algorithms, regression-based algorithms mitigate the bias introduced\nby the smoothing kernel. However, they do not provide an estimation for the posterior density.\nMeanwhile, it is often hard for such algorithms to scale. For example, the distribution regression\ninvolved in DR-ABC requires computing the inverse of an N \u00d7 N kernel matrix, which has O(N 3)\ncomputation cost as the dataset scales.\nNeither sampling- nor regression-based algorithms are satisfactory: while regression-based algorithms\nhave better performances compared to the sampling-based algorithms, they are not scalable to high\ndimensions. Therefore, an important question is whether one can design an algorithm that can\nperform well on large datasets? In this paper, we propose an optimization-based ABC algorithm\nthat can successfully address the de\ufb01ciencies of both sampling- and regression-based algorithms. In\nparticular, we show that ABC can be formulated under a uni\ufb01ed optimization framework: \ufb01nding the\nsaddle point of a minimax optimization problem, which allows us to leverage powerful gradient-based\noptimization algorithms to solve ABC. More speci\ufb01cally, our contributions are three-fold:\n\n\u2022 we show that the ABC problem can be formulated as a saddle point optimization through\nboth joint distribution matching and posterior matching. This approach circumvents the\n\n2\n\n\fdif\ufb01culties associated with choosing suf\ufb01cient summary statistics or computing kernel\nmatrices, as needed in K2- and DR-ABC. More critically, the saddle point objectives can be\nevaluated based purely on samples, without assuming any implicit form of the likelihood.\n\u2022 we provide an ef\ufb01cient SGD-based algorithm for \ufb01nding the saddle point, and provide a\nprobabilistically approximately correct (PAC) bound guaranteeing the consistency of the\nsolution to the problem.\n\u2022 we compare the proposed algorithm to K2- and DR-ABC. The experiment shows that our\nalgorithm outperforms K2- and DR-ABC signi\ufb01cantly and is close to optimal on the toy\nexample dataset.\n\n2 Approximate Bayesian Computation via Saddle Point Formulations\nWhen the likelihood function is given, the true posterior p(\u03b8|y\u2217) given observation y\u2217 can be obtained\nby optimizing the evidence lower bound (ELBO) in the space P that contains all probability density\nfunctions [Zellner, 1988],\n\nq(\u03b8)\u2208P KL(q||\u03c0) \u2212 E\u03b8\u223cq[log p(y\u2217|\u03b8)],\n\nmin\n\n(2)\n\nwhere KL denotes the Kullback-Leibler divergence: KL(q(cid:107)\u03c0) = E\u03b8\u223cq[log q(\u03b8)\n\u03c0(\u03b8) ]. When dealing with\nan intractable likelihood, this conventional optimization approach cannot work without combining it\nwith methods that \ufb01t p(y\u2217|\u03b8) with samples. In this paper, we introduce a new class of saddle point\noptimization objectives that allow the learner to directly leverage the samples from the likelihood\np(y\u2217|\u03b8), which is available under the ABC setting, for estimating the posterior. The method we\npropose does not merely \ufb01nd \u03b8\u2217 = argmax\u03b8 p(\u03b8|y\u2217) for any given observation y\u2217, but rather \ufb01nds\nthe optimal p(\u03b8|y) as a function of both \u03b8 and y, or a representation of \u03b8 generated from p(\u03b8|y)\nusing a transportation reparametrization \u03b8 = f (y, \u03be) for any data y (an idea inspired by Kingma and\nWelling [2013]). We introduce our method below.\n\n2.1 Saddle Point Objectives\nJoint distribution matching. Recall that p(y|\u03b8)\u03c0(\u03b8) = p(\u03b8|y)p(y). A natural idea for estimating\nthe posterior is to match the empirical joint distributions, given the availability of sampling from the\nproduct of the prior distribution and the model, p(y|\u03b8)\u03c0(\u03b8), and from the product of the estimated\nposterior distribution and the marginal q(\u03b8|y)p(y). Using an f-divergence associated with some\n\nconvex function \u03bd, de\ufb01ned by D\u03bd(p1, p2) =(cid:82) p2(x)\u03bd (p1(x)/p2(x)) dx, as our loss function, we\n\nhave the following divergence minimization problem for ABC:\n\np(\u03b8|y) = argmin\nq(\u03b8|y)\u2208P\n\nD\u03bd (p(y|\u03b8)\u03c0(\u03b8), q(\u03b8|y)p(y)) .\n\n(3)\nThis problem aims to recover the optimal posterior within the space of density functions P. Ideally,\nif P is large enough such that p(\u03b8|y) \u2208 P, then (3) recovers the true posterior distribution.\nHowever, the above optimization problem is still dif\ufb01cult to solve since D\u03bd is nonlinear with respect\nto q(\u03b8|y). This nonlinearity makes gradient computation hard as the computation of the f-divergence\nstill requires the value of p(y|\u03b8), which is not available under the ABC setting, and cannot be\ncomputed directly through samples obtained from the joint distribution.\nIn order to make the\nobjective accessible through samples, we apply Fenchel duality and the interchangeability principle\nas introduced in Dai et al. [2017], which yield an equivalent saddle point reformulation. We state the\ndetailed procedure in the following proposition.\nProposition 1. The divergence minimization (3) is equivalent to the following saddle point problem:\n(4)\n\nu(\u03b8,y)\u2208U \u03a6(f, u) := E(\u03b8,y)\u223cp(y|\u03b8)\u03c0(\u03b8) [u(\u03b8, y)] \u2212 E\u03b8\u223cq(\u03b8|y),y\u223cp(y) [\u03bd\u2217 (u(\u03b8, y))] ,\n\nq(\u03b8|y)\u2208P max\nmin\n\nwhere U is a function space containing u\u2217(\u03b8, y) = \u03bd(cid:48)( p(y|\u03b8)p(\u03b8)\n\np(\u03b8|y)p(y) ) and \u03bd\u2217 is the Fenchel dual of \u03bd.\n\nProof. Step 1: By the de\ufb01nition of f-divergence,\n\nD\u03bd(p(y|\u03b8)\u03c0(\u03b8)(cid:107)q(\u03b8|y)p(y)) = Eq(\u03b8|y)p(y)\n\n3\n\n(cid:20)\n\n\u03bd\n\n(cid:18) p(y|\u03b8)\u03c0(\u03b8)\n\nq(\u03b8|y)p(y)\n\n(cid:19)(cid:21)\n\n.\n\n\fStep 2: Apply Fenchel duality \u03bd(x) = supu(ux \u2212 \u03bd\u2217(x)) and obtain\n\nD\u03bd(p(y|\u03b8)\u03c0(\u03b8)(cid:107)q(\u03b8|y)p(y)) = Eq(\u03b8|y)p(y)\n\nStep 3: The interchangeability principle in Dai et al. [2017] suggests\n\nD\u03bd(p(y|\u03b8)\u03c0(\u03b8)(cid:107)q(\u03b8|y)p(y)) = sup\nu\u2208U\n\nEq(\u03b8|y)p(y)\n\n(cid:20)\n\nu\n\nsup\n\nu \u00b7 p(y|\u03b8)\u03c0(\u03b8)\nq(\u03b8|y)p(y)\n(cid:20)\nu(\u03b8, y) \u00b7 p(y|\u03b8)\u03c0(\u03b8)\nq(\u03b8|y)p(y)\n\n(cid:21)\n\n\u2212 \u03bd\u2217(u)\n\n.\n\n(cid:21)\n\n\u2212 \u03bd\u2217(u(\u03b8, y))\n\n.\n\nStep 4: By change of measure, we have (3) equivalent to (4).\n\nThe class of f-divergence covers many common divergences, including the KL divergence, Pearson\n\u03c72 divergence, Hellinger distance, and Jensen-Shannon divergence. Apart from f-divergences, we\ncan also employ other metrics to measure the distance between p(y|\u03b8)\u03c0(\u03b8) and p(\u03b8|y)p(y), e.g., the\nWasserstein distance. If the training data come with labels, we can also choose the objective function\nto be the mean square error between the label and the maximum a posterior estimate from p(\u03b8|y). 2\nFrom a density ratio estimation perspective, the optimal solution of the dual variable, u(\u03b8, y), is a\ndiscriminator that distinguishes the true and estimated joint distributions by computing their density\nratios, which is related to the ratio matching in Mohamed and Lakshminarayanan [2016].\nPosterior matching. Another way to learn the posterior representation is by directly matching the\nposterior distributions. Similar to the objective function de\ufb01ned in K-ABC, we have\n\n(cid:2)(E\u03b8|y[h(\u03b8)] \u2212 E\u03b8\u223cq(\u03b8|y)[h(\u03b8)])2(cid:3) .\n\nEy\n\n(5)\n\nq(\u03b8|y)\u2208P max\nmin\nh(\u03b8)\u2208H\n\nDirectly solving the optimization (5) is dif\ufb01cult due to the inner conditional expectation, but a\nsaddle point formulation can be obtained by applying the same technique we used to obtain (4) (see\nAppendix C for detailed derivations):\n\nq(\u03b8|y)\u2208P max\nmin\nh(\u03b8)\u2208H\nv(y)\u2208V\n\nE(\u03b8,y)\u223cp(y|\u03b8)\u03c0(\u03b8) [v(y)h(\u03b8)] \u2212 E(\u03b8,y)\u223cq(\u03b8|y)p(y) [v(y)h(\u03b8)] \u2212 1\n4\n\nEy\n\n(cid:2)v2(y)(cid:3)\n\n(6)\n\nwhere V is the entire space of functions on Y. The resulting saddle point objective (6) is much easier\nto solve than (5) and stochastic gradient-based methods could be applied in particular.\n2.2 Representations of u(\u03b8, y) and q(\u03b8|y)\nUnder the most general setting where P and U are closed and bounded function spaces, the saddle\npoint objective (4) is convex-concave. Practically, different representation methods can be used for\nu(\u03b8, y) and q(\u03b8|y), for which different optimization techniques can be applied to solving (4). Below,\nwe discuss several commonly used options.\nGaussian mixtures. Consider the following Gaussian mixture representation for q(\u03b8|y) and u(\u03b8, y):\n\nq(\u03b8|y) =\n\nc(q)\ni\n\n(y) \u00b7 N (\u00b5(q)\n\ni\n\n, \u03a3(q); \u03b8)\n\nand u(\u03b8, y) =\n\nc(u)\ni\n\n\u00b7 N (\u00b5(u)\n\ni\n\n, \u03a3(u); (\u03b8, y)).\n\n(7)\n\ni=1\n\ni=1\n\nThe coef\ufb01cients c(u)\nm are positive real numbers while c(q)\ncoef\ufb01cients. A simple way to guarantee that the summation of c(q)\nthat they take the form of softmax functions:\n\n1 , . . . , c(u)\n\ni\n\n1 (y), . . . , c(q)\n\nm (y) are y-dependent\n(y) is one for any y is to assume\n\nm(cid:88)\n\nm(cid:88)\n\nc(q)\ni\n\n(y) =\n\n(cid:80)m\nexp([1, y(cid:62)] \u00b7 c(q)\n)\nj=1 exp([1, y(cid:62)] \u00b7 c(q)\nj )\n\ni\n\n\u2200i \u2208 {1, . . . , m},\n\n,\n\n(8)\n\ni\n\nand concave for c(u)\n\n1 = 0. This makes (4) convex for c(q)\n\nwith c(q)\nReparametrization. When the dimensions of \u03b8 and y increase, the conditional distribution q(\u03b8|y)\nquickly becomes dif\ufb01cult to represent using parametric models. An effective way to implicitly\nrepresent q(\u03b8|y) is to use a sampler f (\u03be, y) \u2208 F for a function space F, in which \u03b8 is sampled\nusing \u03b8 = f (\u03be, y) using a pre-determined distribution \u03be \u223c p0(\u03be). This idea is inspired by the\nreparametrization technique used in variational autoencoders (VAEs) and neural networks. In our\ncase, both f and u can be represented using functions in reproducing kernel Hilbert spaces (RKHSs)\nor neural networks.\n\n.\n\ni\n\n2Table 2 in Appendix provides some examples of divergences and the derivation of their corresponding\n\nsaddle point objectives.\n\n4\n\n\fsup\n\n(cid:107)h(cid:107)H\u22641\n\nsup\n\n(cid:107)h(cid:107)H\u22641\n\n= E\u03b8,Y\n\nE\u03b8,y\n\n(cid:2)(cid:107)k(\u00b7, \u03b8) \u2212 C(y)(cid:107)2(cid:3) .\n\n(cid:107)h(cid:107)H\u22641\n\n2.3 Discussions\nThe saddle point framework is closely related to both regression- and GAN-based ABC algorithms.\nRelationship with regression-based ABC algorithms. Regression-based ABC algorithms, such as\nK-ABC, aim to compute the conditional expectation of the posterior by \ufb01nding its conditional kernel\nembedding C(y) : Y \u2192 H in an RKHS. With such parametrization, the objective (5) becomes\n\nmin\nC:Y\u2192H L(C) := sup\n(cid:107)h(cid:107)H\u22641\n\nEy[(E\u03b8|y[h(\u03b8)] \u2212 (cid:104)h, C(y)(cid:105)H)2].\n\nThis problem is further relaxed to a distribution regression problem by swapping the square operator\nwith the inner expectation, which leads to minimizing E\u03b8,y[(cid:107)K(\u00b7, \u03b8) \u2212 C(y)(cid:107)2], an upper bound of\nL(C). Speci\ufb01cally, we have\n\n(cid:104)(cid:0)E\u03b8|y [h(\u03b8)] \u2212 (cid:104)h, C(y)(cid:105)H(cid:1)2(cid:105) \u2264 sup\n(cid:2)(cid:104)h, k(\u00b7, \u03b8) \u2212 C(y)(cid:105)2(cid:3) \u2264 sup\n\nEy\n(cid:107)h(cid:107)H\u22641\n(cid:107)h(cid:107)2H E\u03b8,y\n\n(cid:104)(cid:0)E\u03b8|y [(cid:104)h, k(\u00b7, \u03b8)(cid:105)] \u2212 (cid:104)h, C(y)(cid:105)H(cid:1)2(cid:105)\n(cid:2)(cid:107)k(\u00b7, \u03b8) \u2212 C(y)(cid:107)2(cid:3)\n\nEy\n\n\u2264\n\nIn contrast, the proposed optimization framework for posterior matching does not restrict h \u2208 H.\nMoreover, the saddle point objective (6) is an exact reformulation of (5), rather than an upper bound.\nRelationship with GAN-based ABC algorithms. GAN-based algorithms leverage the represen-\ntation power of the neural networks to optimize the ELBO. One example is the use of variational\nautoencoder (VAE), where both q and p in (2) are represented by Gaussian distributions parameter-\nized by neural networks. Better performances have been observed in Mescheder et al. [2017] by\nembedding the optimal value of q(\u03b8|y) as the optimal solution of a real-valued discriminator network,\nequivalent to performing reparametrization. However, compared to the saddle point formulation,\nMescheder et al. [2017] requires computing an additional layer of optimization due to the embedding\nperformed. Meanwhile, when the underlying parameter is discrete, the saddle point formulation can\nbe viewed as a special case of conditional GAN (CGAN) [Mirza and Osindero, 2014].\n3 Algorithm and Theory\nIn this section, we introduce a concrete algorithm named predictive-ABC (P-ABC) that solves\nthe \ufb01nite-sample approximation (i.e., empirical risk) of the saddle point problem. For the sake of\npresentation, we consider the empirical risk of (4), where the empirical expectations are taken over\nN samples {(\u03b8i, yi)}N\n\ni=1:\n\nu(\u03b8,y)\u2208U(cid:98)\u03a6N (q, u) :=(cid:98)E(\u03b8,y)\u223cp(y|\u03b8)\u03c0(\u03b8)u(\u03b8, y) \u2212(cid:98)Ey\u223cp(y)\n\nmin\n\nq(\u03b8|y)\u2208P max\n\n(cid:8)E\u03b8\u223cq(\u03b8|y)[\u03bd\n\n(u(\u03b8, y))|y](cid:9) .\n\n\u2217\n\n(9)\n\nWe denote the optimal solution as q\u2217\nN . In the following, we \ufb01rst introduce a general form of\nP-ABC, followed by customizations to different representation methods for q(\u03b8|y) and u(\u03b8, y). We\nthen derive a probabilistically approximately correct (PAC) learning bound on the statistical error\n\nN and u\u2217\n\n\u0001N = D\u03bd(p(y|\u03b8)\u03c0(\u03b8), q\u2217\n\nN (\u03b8|y)p(y)) \u2212 D\u03bd(p(y|\u03b8)\u03c0(\u03b8), q\u2217(\u03b8|y)p(y)),\n\nwhich holds for closed and bounded function spaces P and U in general, with q\u2217 and u\u2217 denoting the\nsolution to (4). Lastly, we present the convergence results of P-ABC. For representations of q and u\nsuch that the objective function is convex-concave, e.g. Gaussian mixture representations, we present\nthe convergence of Algorithm 1. For the representation using reparametrization and neural networks,\nthe convergence behavior of P-ABC remains largely an open problem.\n3.1 The P-ABC Algorithm\nWe introduce P-ABC for solving (9), the empirical counterpart of (4), in Algorithm 1. This algo-\nrithm, in its general form, performs iterative updates to q and u using \ufb01rst-order information. The\ncomputation of stochastic gradients under the representations presented in Section 2 can be found in\nAppendix A.\n3.2 Theoretical Properties\nLearning bound. By invoking the tail inequality in Antos et al. [2008] and the \u0001-net argument, we\nhave the following theorem, the proof of which can be found in Appendix B.\n\n5\n\n\fAlgorithm 1 Predictive ABC (P-ABC)\n\nk=1, samples {(\u03b8i, yi)}N\n\nRandomly select (\u03b8k, yk) \u2208 {(\u03b8i, yi)}N\n\nInput: maximum number of iterations T . Prior distribution \u03c0(\u03b8), model p(y|\u03b8). Step sizes\nk}T\nk=1 and {\u03b7q\n{\u03b7u\nk}T\nInitialize: q1, u1.\nfor k = 1 to T do\nCompute stochastic gradients of (cid:98)\u03a6N (qk, uk), denoted by \u2207q(cid:98)\u03c6k(qk, uk) and \u2207u(cid:98)\u03c6k(qk, uk),\nusing (\u03b8k, yk,(cid:101)\u03b8k).\nUpdate q: qk+1 \u2190 ProjP (qk \u2212 \u03b7q\nUpdate u: uk+1 \u2190 ProjU (uk + \u03b7u\n(cid:80)T\n(cid:80)T\n\ni=1, objective function(cid:98)\u03a6N .\ni=1, sample(cid:101)\u03b8k \u223c qk(\u03b8|yk).\nk \u00b7 \u2207q(cid:98)\u03c6k(qk, uk)).\nk \u00b7 \u2207u(cid:98)\u03c6k(qk, uk)).\n(cid:80)T\n(cid:80)T\n\nend for\nOutput: \u00afqT =\n\n, and \u00afuT =\n\n.\n\nk=1 \u03b7u\nk uk\nk=1 \u03b7u\nk\n\nk=1 \u03b7q\nkqk\nk=1 \u03b7q\n\nk\n\nTheorem 1. Suppose {(\u03b8i, yi)}N\ni=1 is a \u03b2-mixing sequence 3 with \u03b2m \u2264 \u00af\u03b2 exp(\u2212bm\u03ba) for constants\n\u00af\u03b2, b and \u03ba, and suppose that function class U \u00d7 P has a \ufb01nite pseudo dimension D. 4 In addition,\nsuppose that u \u2208 [\u2212Cu, Cu] and the Fenchel dual satis\ufb01es \u03bd\u2217(u) \u2264 C\u03bd. Then, with probability 1\u2212 \u03b4,\n\n(cid:115)\n\n\u0001N \u2264\n\nC1(max(C1/b, 1))1/\u03ba\n\n,\n\nC2N\n2 e\u03b4\u22121 + [log(2 max(16e(D + 1)C\n\nD\n2\n\nwhere C1 = log N D\nTheorem 1 applies to all the formulations we introduced in Section 2, for which learning is consistent\nat a rate of O(N\u22121/2 log N ), with N being the number of samples, when the empirical saddle point\napproximation can be exactly solved. Below, we discuss the convergence of Algorithm 1.\nConvergence of P-ABC. From a theoretical perspective, global convergence of \ufb01rst-order methods\n\nsuch as stochastic gradient descent (SGD) can be achieved when the objective function(cid:98)\u03a6N is convex-\n\n2 , \u00af\u03b2))]+ and C2 = (512(C\u03bd + Cu)2)\u22121.\n\nconcave. For example, when u and q are Gaussian mixtures or belong to RKHSs. More often than\nnot, the objective function is not convex-concave, for which stochastic gradient descent (SGD) based\nalgorithms are only guaranteed to converge towards a stationary point in certain restricted cases\n[Sinha et al., 2017, Li and Yuan, 2017, Kodali et al., 2018]. Below, we provide the convergence\n\nresults for Algorithm 1 when(cid:98)\u03a6N is convex-concave.\n\nConsider the standard metric for evaluating the quality of any pair of estimates \u00afqT and \u00afuT :\n\nu\u2208U (cid:98)\u03a6N (\u00afqT , u) \u2212 min\n\nq\u2208P(cid:98)\u03a6N (q, \u00afuT ).\n\n\u03b5(\u00afqT , \u00afuT ) = max\n\nWe have the following result (See Appendix D for proof).\nTheorem 2 (Convergence of P-ABC). Suppose that P and U are closed and bounded function spaces\n\nwith diameters DP and DU, respectively. Let(cid:98)\u03a6N be convex-concave and LN -Lipschitz. Then, for\n\nthe outputs of Algorithm 1 with T iterates and whose step sizes satisfy \u03b7q\n\nk = \u03b7u\n\nk = \u03b7k, we have\n\nE [\u03b5(\u00afqT , \u00afuT )] \u2264 D2P + D2U +(cid:80)T\n2(cid:80)T\n\nk=0 \u03b7k\n\nk=0 2\u03b72\n\nkL2\nN\n\n.\n\nRKHSs, in which case(cid:98)\u03a6N is convex-concave. It suggests that if the sequence of step sizes satis\ufb01es\nTheorem 2 applies to the cases when P and U are spaces for Gaussian mixture coef\ufb01cients or\n(cid:80)\u221e\nk=1 \u03b7k = \u221e and(cid:80)\u221e\nk < \u221e, then limT\u2192\u221e \u03b5(\u00afqT , \u00afuT ) = 0. In this case, we choose \u03b7k =\n\u0398(k\u22121/2. Together with Theorem 1, we know that the overall error, contributed by the summation\nof the learning error and the optimization error, can be bounded by O(N\u22121/2 log N ) upon selecting\nT = \u0398(N ).\n\nk=1 \u03b72\n\n3A discrete time stochastic process is mixing if widely separated events are asymptotically independent.\nHere, \u03b2m provides an upper bound on the dependency of two events separated by n intervals of time. See Meir\n[2000] for a detailed de\ufb01nition.\n\n4Pseudo dimension, also known as the Pollard dimension, is a generalization of VC dimension to the function\n\nclass (see chapter 11 of Anthony and Bartlett [2009]).\n\n6\n\n\fyi = \u22120.25\n\nFigure 1: Empirical distribution of q(\u03b8|yi) induced by the histogram of f (yi, \u03be) computed from 1E4\ni.i.d. samples of \u03be \u223c p0(\u03be) for speci\ufb01c choices of yi.\n\nyi = 0\n\nyi = 0.25\n\n4 Numerical Experiment\nWe test the performance of P-ABC and compare the result with K2- and DR-ABC as representatives\nfrom sampling- and regression-based ABC algorithms.\n\n4.1 Synthetic Dataset I: Superposition of Uniform Distributions\nConsider \u03b8 \u2208 Rd and \u03c0(\u03b8) = 1{\u03b8 \u2208 [\u22120.5, 0.5]d}. Let p(y|\u03b8) be speci\ufb01ed by y = \u03b8 + u with u\nuniformly distributed over [\u22120.5, 0.5]d. The training samples are {(\u03b8i,{yij}M\ni=1 generated with\nindependent \u03b8i\u2019s and uij\u2019s. Denote Y = {Yi}N\nj=1, and for any y and \u03b8, denote\n(cid:21)\nthe k-th coordinate of them as y[k] and \u03b8[k], respectively. The posterior can then be written as\nj\u2208{1,...,M} yij[k] + 0.5}\n\nj\u2208{1,...,M} yij[k] \u2212 0.5} \u2264 \u03b8i[k] \u2264 min{0.5, min\n\np(\u03b8i|Yi) \u221d d(cid:89)\n\ni=1 with Yi = {yij}M\n\nmax{\u22120.5, max\n\nj=1}N\n\n(cid:20)\n\n1\n\n,\n\nk=1\n\nwhich is a uniform distribution whose boundary on the k-th dimension is characterized by the values\nof the maximum and minimum values of the k-th coordinate among all yij\u2019s. Due to the fact that\nK2- and DR-ABC evaluate their performances using the mean square error, we use predictive ABC\n(P-ABC) to \ufb01nd the optimal minimum mean square error (MMSE) estimator for \u03b8i\u2019s. We denote the\n\noptimal estimator for \u03b8i as(cid:98)\u03b8opt\n(cid:98)\u03b8opt\nfor all k \u2208 {1, . . . , d}. A sub-optimal estimator for this example is(cid:98)\u03b8ave\n\n(cid:0)maxj\u2208{1,...,M} yij[k] + minj\u2208{1,...,M} yij[k](cid:1) ,\n\n2 \u00b7 maxj\u2208{1,...,M} yij[k],\n2 \u00b7 minj\u2208{1,...,M} yij[k],\n\n\uf8f1\uf8f2\uf8f3\n\n[k] =\n\n1\n2\n\n1\n\n1\n\ni\n\ni\n\nminj\u2208{1,...,M} yij[k] \u2265 0\nmaxj\u2208{1,...,M} yij[k] \u2264 0\n\nminj yij[k] \u2264 0 \u2264 maxj yij[k]\n\ni = M\u22121(cid:80)M\n\n,\n\n, which has a closed form solution with the k-th coordinate being\n\nwhich(cid:98)\u03b8opt\n\nj=1 yij, which\nexploits the information that the expectation of the noise is a zero vector. We include these two\nclosed-form estimators in our benchmarks in addition to K2- and DR-ABC.\nThe scalar case. We \ufb01rst examine the case when d = 1 and M = 1. That is, when \u03b8i \u2208 R and when\neach \u03b8i only corresponds to one yi. We compare the performance of P-ABC, obtained under the\nposterior matching objective with reparametrization representation of q(\u03b8|y) and u(\u03b8, y) with f (y, \u03be)\nand u(\u03b8, y) using fully connected neural networks, to that of the theoretically optimal estimator, for\ni = yi/2. We train the neural networks using N = 1000 samples. Each neural network\ncontains two fully connected layers of size 8 with exponential linear unit (ELU) activation functions,\nand the \ufb01nal output layer for f is activated using the hyperbolic tangent. We choose \u03be \u2208 R and\np0(\u03be) \u221d 1{\u03be \u2208 [\u22121, 1]}, and use a learning rate of 10\u22124. In 2E5 iterations, P-ABC achieves 0.0413\nmean square error (MSE) on the training set and 0.0416 MSE on the test set. The theoretically optimal\n, is 0.0411 for the test set. Since f does not directly show the posterior\ndistribution, we plot a histogram in Figure 1 by evaluating f (y, \u03be) under 1E4 trials of \u03be for different\nvalues of y. We can see that the support of the empirical probability distribution is a very small\ninterval near yi/2, demonstrating that the output of P-ABC is nearly optimal. By comparison, since\nthere is only one observation available, K2- and DR-ABC do not output meaningful results as the\ncomputation of the maximum mean discrepancy (MMD) statistics requires at least two observations.\nLastly, for other choices of M and d, the training and testing errors are reported in Table 1. The result\nshows that the performance of P-ABC was closer to the theoretical optimum than DR- and K2-ABC.\nPerformance under higher dimensions. We examine the performance of P-ABC when the dimen-\nsion of the model parameter was higher. For illustration purpose, we choose d \u2208 {1, 16, 128, 256},\n\nMSE, obtained using(cid:98)\u03b8opt\n\ni\n\n7\n\n\fMSE\nd = 1\nd = 16\nd = 128\nd = 256\n\nP-ABC [test,train] K2-ABC DR-ABC\n\n[0.009,0.010]\n[0.182, 0.155]\n[2.749,1.793]\n[4.266,1.399]\n\n0.011\n1.283\n21.478\n41.830\n\n0.083\n1.143\n10.730\n21.324\n\n(cid:98)\u03b8opt\n\n0.003\n0.050\n0.409\n0.818\n\n(cid:98)\u03b8ave\n\n0.008\n0.134\n1.064\n2.119\n\nTable 1: MSE for estimating the model parameter with different dimensions using K2-, DR- and\nP-ABC. For K2- and DR-ABC, we set \u0001 = 0.01 when computing MMD. For P-ABC, the hidden\nlayer sizes are 8,32,128,256 for different values of d, and the dimension of \u03be is 1,4,4,4, respectively.\n\n(cid:80)M\n\nand we assumed M = 10, i.e., Yi contains 10 samples for each parameter value. Once again, we use\nneural networks to represent f and u in P-ABC, for which we train with N = 1000 sets of samples.\nTo reduce the input dimension of the neural networks, for each input set of samples Yi, we set\ni=1 u(\u03b8, yij). More speci\ufb01cally, rather than\nf (Yi, \u03be) = 1\nM\ntaking the entire set of samples as the input, the neural network representing f took each sample\nindividually, and used their average as the \ufb01nal value of f (Yi, \u03be). Under this setting, for 2E5 iterations,\nthe obtained results are shown in Table 1. We can see that P-ABC outperformed both K2- and DR-\nABC in all four cases, and when the dimension of \u03b8i was small, the performance of P-ABC was close\n\nand u(\u03b8, Yi) = 1\nM\n\n(cid:80)M\n\nj=1 f (yij, \u03be)\n\nto that of(cid:98)\u03b8ave\n\ni\n\n.\n\n4.2 Synthetic Dataset II: Gaussian Mixtures\nConsider a model where \u03b8 \u2208 R and \u03c0(\u03b8) = 1{\u03b8 \u2208 [\u22120.5, 0.5]}. The likelihood function p(y|\u03b8)\nis characterized by a Gaussian mixture model: y = (0.5 + \u03b8)N (\u22121, 1) + (0.5 \u2212 \u03b8)N (1, 1). In\nthis example, we compare the performances between K2-, DR-, EP-, and the proposed P-ABC.\nFor P-ABC, we adopt the same network structures for the neural networks representing f and u\nas in the previous example, and train them with N = 4000 sets of samples. Each set of samples\ncontained M = 250 samples corresponding to the same model parameter. The same setting is used for\nevaluating the benchmarks. For the result, P-ABC achieves an MSE of 0.004, and EP-ABC achieves\nan MSE of 0.06. 5 During the implementation, we noticed that the EP-ABC requires Cholesky\nfactorization for each iteration, which is computationally expensive and particularly sensitive to\ninitialization. In fact, the run time of EP-ABC (10 sets of samples per minute) was signi\ufb01cantly\nlonger than P-ABC (200 sets of samples per minute). K2- and DR-ABC, by comparison, were unable\nto produce results for 100 sets of samples within 1 hour. This experiment demonstrated the ef\ufb01ciency\nof implementing the P-ABC algorithm.\nDiscussions. Although P-ABC demonstrates superior numerical performances over the benchmarks,\nit suffers from some of the de\ufb01ciencies of the other existing ABC algorithms. One such de\ufb01ciency is\nthat the algorithm is prone to mismatched priors. To see this, we plotted the histogram of f (y, \u03be) with\ny sampled from the model with the model parameter being 0, \u03be sampled from a uniform distribution\nover [\u22121, 1], and with P-ABC trained on a mismatched prior. We skew the prior by substituting\n{\u03b8i, yi}N\ni=1 where \u00af\u03b8 = (\u03b8 + a)/(2a + 1) and \u00afyi is the output from\nthe model under \u00af\u03b8i. This transformation introduces bias between the true prior and the prior used for\ntraining, and, as can be seen in Figure 2, the range of the estimated parameter by P-ABC shifts away\nfrom true model parameter as the value of a increases.\n\ni=1 in Algorithm 1 with {\u00af\u03b8i, \u00afyi}N\n\n4.3 Ecological Dynamic System\nTime series observations are an important application scenario for ABC. In this experiment, we\ncompare the performances of K2-, DR- and P-ABC over the example of an ecological dynamic\nsystem studied in previous literatures (see Park et al. [2016] for example). The population dynamics\nfollow the relationship\n\n(cid:18)\n\n(cid:19)\n\nyt+1 = P yt\u2212\u03c4 exp\n\n\u2212 yt\u2212\u03c4\ny0\n\net + yt exp(\u2212\u03b4\u0001t),\n\n+. Let\nan evolution dynamics parametrized by a 5-dimensional vector \u03b8 = (P, y0, \u03c32\nY = (y1, . . . , yt) denote the set of samples that contains the population size data up to time t, the\n\nd, \u03c32\n\np, \u03c4, \u03b4) \u2208 R5\n\n5Per implementation of the code made available online by Barthelm\u00b4e and Chopin [2011].\n\n8\n\n\f\u03b8 = 0, a = 0\nImpact of improper prior on P-ABC. Consider \ufb01nding uniformly distributed \u03b8 \u223c\nFigure 2:\nU[\u22120.5, 0.5] from y = (0.5 + \u03b8)N (\u22121, 1) + (0.5 \u2212 \u03b8)N (\u22121, 1). Improper priors are obtained\n\nby(cid:101)\u03b8 = (\u03b8 + a)/(2a + 1) with a = 1, 10. We see that training on improper prior injects bias into the\n\n\u03b8 = 0, a = 10\n\n\u03b8 = 0, a = 1\n\np , \u03c32\n\nd). We sample each dimension of log \u03b8 from a uniform\n\noutput of P-ABC.\nnoise et \u223c \u0393(\u03c3\u22122\np), \u0001t \u223c \u0393(\u03c3\u22122\nd , \u03c32\ndistribution on [\u22125, 2], and set \u03c4 = (cid:100)\u03c4(cid:101).\nFor P-ABC, we implemented a recurrent neural network (RNN) with long short-term memory\n(LSTM) cells to capture the dynamics of the underlying time series. The output of the LSTM cell is\nthen plugged into a fully connected layer along with \u03b8 or \u03be. The structures of the neural networks\nrepresenting f and u are shown in Figure 4 in Appendix E.\nWhen training P-ABC and the benchmarks, we set t = 30, and use N = 1000 sets of samples. For\nP-ABC, we set \u03be \u2208 R4, the size of thee LSTM cells to be 32, and the size of the fully connected layer\nto be 16. For K2- and DR-ABC, the samples within Y were regarded as i.i.d.. The obtained result is\nshown in Figure 3, with the vertical axis denoting the MSE of the estimated parameter. We can see\nthat P-ABC outperforms K2-ABC and DR-ABC on all aspects: the MSE was 12.9 for P-ABC, 24.7\nfor K2-ABC, and 16.4 for DR-ABC. In addition, P-ABC had the lowest average, quartile, and better\nperformance on outliers.\n\n5 Conclusion\n\nIn this paper, we presented a unifying optimiza-\ntion framework for ABC, named Predictive-\nABC, under which we showed that ABC can\nbe formulated as a saddle point problem for dif-\nferent objective functions. We presented a high-\nprobability error bound that decays at the speed\nof O(N\u22121/2 log N ) with N being the num-\nber of samples and we presented a stochastic-\ngradient-descent-based algorithm, P-ABC, to\n\ufb01nd the solution.\nIn practice, P-ABC signif-\nicantly outperforms K2- and DR-ABC, repre-\nsentatives for the state-of-the-art sampling- and\nregression-based algorithms, respectively.\n\nFigure 3: Statistics of MSEs for P-, K2- and DR-\nABC trained on 1000 sequences of length 30.\n\nAcknowledgement\nThis work was supported in part by MURI grant ARMY W911NF-15-1-0479, ONR grant W911NF-\n15-1-0479, NSF CCF-1755829 and NSF CMMI-1761699.\n\nReferences\nMartin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations. cambridge\n\nuniversity press, 2009.\n\nAndr\u00b4as Antos, Csaba Szepesv\u00b4ari, and R\u00b4emi Munos. Learning near-optimal policies with Bellman-\nresidual minimization based \ufb01tted policy iteration and a single sample path. Machine Learning, 71\n(1):89\u2013129, 2008.\n\n9\n\n\fSimon Barthelm\u00b4e and Nicolas Chopin. ABC-EP: Expectation propagation for likelihoodfree Bayesian\n\ncomputation. In ICML, pages 289\u2013296, 2011.\n\nPeter L Bartlett, Nick Harvey, Chris Liaw, and Abbas Mehrabian. Nearly-tight VC-dimension and\npseudodimension bounds for piecewise linear neural networks. arXiv preprint arXiv:1703.02930,\n2017.\n\nEspen Bernton, Pierre E Jacob, Mathieu Gerber, and Christian P Robert. Inference in generative\n\nmodels using the Wasserstein distance. arXiv preprint arXiv:1701.05146, 2017.\n\nMichael GB Blum and Olivier Franc\u00b8ois. Non-linear regression models for approximate Bayesian\n\ncomputation. Statistics and Computing, 20(1):63\u201373, 2010.\n\nMichael GB Blum, Maria Antonieta Nunes, Dennis Prangle, Scott A Sisson, et al. A comparative\nreview of dimension reduction methods in approximate Bayesian computation. Statistical Science,\n28(2):189\u2013208, 2013.\n\nKatalin Csill\u00b4ery, Olivier Franc\u00b8ois, and Michael G B Blum. Abc: An R package for approximate\nBayesian computation (ABC). Methods in Ecology and Evolution, 3(3):475\u2013479, 2012. ISSN\n2041210X. doi: 10.1111/j.2041-210X.2011.00179.x.\n\nBo Dai, Niao He, Yunpeng Pan, Byron Boots, and Le Song. Learning from conditional distributions\n\nvia dual embeddings. In Arti\ufb01cial Intelligence and Statistics, pages 1458\u20131467, 2017.\n\nC. C. Drovandi and A. N. Pettitt. Estimation of parameters for macroparasite population evolution\nusing approximate Bayesian computation. Biometrics, 67(1):225\u2013233, 2011. ISSN 0006341X.\ndoi: 10.1111/j.1541-0420.2010.01410.x.\n\nAlexei J Drummond and Andrew Rambaut. Beast: Bayesian evolutionary analysis by sampling trees.\n\nBMC evolutionary biology, 7(1):214, 2007.\n\nChristoph Leuenberger Daniel Wegmann Laurent Excof\ufb01er. Bayesian computation and model\nselection in population genetics. Genetics, page 18, 2009. doi: 10.1534/genetics.109.102509. URL\nhttp://arxiv.org/abs/0901.2231.\n\nAlexander Gleim and Christian Pigorsch. Approximate Bayesian computation with indirect summary\n\nstatistics.\n\nMichael U Gutmann, Ritabrata Dutta, Samuel Kaski, and Jukka Corander. Statistical inference of\n\nintractable generative models via classi\ufb01cation. arXiv preprint arXiv:1407.4981, 2014.\n\nMichael U Gutmann, Jukka Corander, et al. Bayesian optimization for likelihood-free inference of\n\nsimulator-based statistical models. Journal of Machine Learning Research, 2016.\n\nMichael U Gutmann, Ritabrata Dutta, Samuel Kaski, and Jukka Corander. Likelihood-free inference\n\nvia classi\ufb01cation. Statistics and Computing, 28(2):411\u2013425, 2018.\n\nDavid Haussler. Sphere packing numbers for subsets of the boolean n-cube with bounded Vapnik-\n\nChervonenkis dimension. Journal of Combinatorial Theory, Series A, 69(2):217\u2013232, 1995.\n\nJohn P Huelsenbeck, Fredrik Ronquist, Rasmus Nielsen, and Jonathan P Bollback. Bayesian inference\n\nof phylogeny and its impact on evolutionary biology. science, 294(5550):2310\u20132314, 2001.\n\nPaul Joyce and Paul Marjoram. Approximately suf\ufb01cient statistics and Bayesian computation.\n\nStatistical applications in genetics and molecular biology, 7(1), 2008.\n\nDiederik P Kingma and Max Welling. Auto-encoding variational Bayes.\n\narXiv:1312.6114, 2013.\n\narXiv preprint\n\nNaveen Kodali, James Hays, Jacob Abernethy, and Zsolt Kira. On convergence and stability of GANs.\n\n2018.\n\nT Kulkarni, Ilker Yildirim, Pushmeet Kohli, W Freiwald, and Joshua B Tenenbaum. Deep generative\n\nvision as approximate Bayesian computation. In NIPS 2014 ABC Workshop, 2014.\n\n10\n\n\fJingjing Li, David J Nott, Yanan Fan, and Scott A Sisson. Extending approximate Bayesian computa-\ntion methods to high dimensions via a Gaussian copula model. Computational Statistics & Data\nAnalysis, 106:77\u201389, 2017.\n\nYuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with ReLU activation.\n\nIn Advances in Neural Information Processing Systems, pages 597\u2013607, 2017.\n\nGael M. Martin, Brendan P. M. McCabe, Worapree Maneesoonthorn, and Christian P. Robert.\nApproximate Bayesian computation in state space models. arXiv:1409.8363, pages 1\u201338, 2014.\nURL http://arxiv.org/abs/1409.8363.\n\nRon Meir. Nonparametric time series prediction through adaptive model selection. Machine learning,\n\n39(1):5\u201334, 2000.\n\nLars Mescheder, Sebastian Nowozin, and Andreas Geiger. Adversarial variational Bayes: Unifying\nvariational autoencoders and generative adversarial networks. arXiv preprint arXiv:1701.04722,\n2017.\n\nMehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint\n\narXiv:1411.1784, 2014.\n\nJovana Mitrovic, Dino Sejdinovic, and Yee Whye Teh. DR-ABC: Approximate Bayesian computation\n\nwith kernel-based distribution regression. 2016.\n\nShakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models. arXiv\n\npreprint arXiv:1610.03483, 2016.\n\nDavid J Nott, Y Fan, L Marshall, and SA Sisson. Approximate Bayesian computation and Bayes\u2019\nlinear analysis: toward high-dimensional ABC. Journal of Computational and Graphical Statistics,\n23(1):65\u201386, 2014.\n\nMatthew A Nunes and David J Balding. On optimal selection of summary statistics for approximate\n\nBayesian computation. Statistical applications in genetics and molecular biology, 9(1), 2010.\n\nMijung Park, Wittawat Jitkrittum, and Dino Sejdinovic. K2-ABC: Approximate Bayesian computation\n\nwith kernel embeddings. 2016.\n\nLeah F Price, Christopher C Drovandi, Anthony Lee, and David J Nott. Bayesian synthetic likelihood.\n\nJournal of Computational and Graphical Statistics, 27(1):1\u201311, 2018.\n\nLouis Raynal, Jean-Michel Marin, Pierre Pudlo, Mathieu Ribatet, Christian P Robert, and Arnaud\nEstoup. ABC random forests for Bayesian parameter inference. arXiv preprint arXiv:1605.05537,\n2016.\n\nGS Rodrigues, Dennis Prangle, and Scott A Sisson. Recalibration: A post-processing method for\napproximate Bayesian computation. Computational Statistics & Data Analysis, 126:53\u201366, 2018.\n\nAman Sinha, Hongseok Namkoong, and John Duchi. Certi\ufb01able distributional robustness with\n\nprincipled adversarial training. arXiv preprint arXiv:1710.10571, 2017.\n\nDaniel Wegmann, Christoph Leuenberger, and Laurent Excof\ufb01er. Ef\ufb01cient approximate Bayesian\ncomputation coupled with Markov chain Monte Carlo without likelihood. Genetics, 182(4):\n1207\u20131218, 2009.\n\nSimon N Wood. Statistical inference for noisy nonlinear ecological dynamic systems. Nature, 466\n\n(7310):1102, 2010.\n\nArnold Zellner. Optimal information processing and Bayes\u2019s theorem. The American Statistician, 42\n\n(4):278\u2013280, 1988.\n\n11\n\n\f", "award": [], "sourceid": 6582, "authors": [{"given_name": "Yingxiang", "family_name": "Yang", "institution": "University of Illinois at Urbana Champaign"}, {"given_name": "Bo", "family_name": "Dai", "institution": "Google Brain"}, {"given_name": "Negar", "family_name": "Kiyavash", "institution": "Georgia Tech"}, {"given_name": "Niao", "family_name": "He", "institution": "UIUC"}]}