{"title": "Discretely Relaxing Continuous Variables for tractable Variational Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 10466, "page_last": 10476, "abstract": "We explore a new research direction in Bayesian variational inference with discrete latent variable priors where we exploit Kronecker matrix algebra for efficient and exact computations of the evidence lower bound (ELBO). The proposed \"DIRECT\" approach has several advantages over its predecessors; (i) it can exactly compute ELBO gradients (i.e. unbiased, zero-variance gradient estimates), eliminating the need for high-variance stochastic gradient estimators and enabling the use of quasi-Newton optimization methods; (ii) its training complexity is independent of the number of training points, permitting inference on large datasets; and (iii) its posterior samples consist of sparse and low-precision quantized integers which permit fast inference on hardware limited devices. In addition, our DIRECT models can exactly compute statistical moments of the parameterized predictive posterior without relying on Monte Carlo sampling. The DIRECT approach is not practical for all likelihoods, however, we identify a popular model structure which is practical, and demonstrate accurate inference using latent variables discretized as extremely low-precision 4-bit quantized integers. While the ELBO computations considered in the numerical studies require over 10^2352 log-likelihood evaluations, we train on datasets with over two-million points in just seconds.", "full_text": "Discretely Relaxing Continuous Variables\n\nfor tractable Variational Inference\n\nTrefor W. Evans\n\nUniversity of Toronto\n\nPrasanth B. Nair\n\nUniversity of Toronto\n\ntrefor.evans@mail.utoronto.ca\n\npbn@utias.utoronto.ca\n\nAbstract\n\nWe explore a new research direction in Bayesian variational inference with discrete\nlatent variable priors where we exploit Kronecker matrix algebra for ef\ufb01cient and\nexact computations of the evidence lower bound (ELBO). The proposed \"DIRECT\"\napproach has several advantages over its predecessors; (i) it can exactly compute\nELBO gradients (i.e. unbiased, zero-variance gradient estimates), eliminating\nthe need for high-variance stochastic gradient estimators and enabling the use of\nquasi-Newton optimization methods; (ii) its training complexity is independent of\nthe number of training points, permitting inference on large datasets; and (iii) its\nposterior samples consist of sparse and low-precision quantized integers which\npermit fast inference on hardware limited devices.\nIn addition, our DIRECT\nmodels can exactly compute statistical moments of the parameterized predictive\nposterior without relying on Monte Carlo sampling. The DIRECT approach is not\npractical for all likelihoods, however, we identify a popular model structure which\nis practical, and demonstrate accurate inference using latent variables discretized as\nextremely low-precision 4-bit quantized integers. While the ELBO computations\nconsidered in the numerical studies require over 102352 log-likelihood evaluations,\nwe train on datasets with over two-million points in just seconds.\n\n1\n\nIntroduction\n\nHardware restrictions posed by mobile devices make Bayesian inference particularly ill-suited for\non-board machine learning. This is unfortunate since the safety afforded by Bayesian statistics is\nextremely valuable in many prominent mobile applications. For example, the cost of erroneous\ndecisions are very high in autonomous driving or mobile robotic control. The robustness and\nuncertainty quanti\ufb01cation provided by Bayesian inference is therefore extremely valuable for these\napplications provided inference can be performed on-board in real-time [1, 2].\nOutside of mobile applications, resource ef\ufb01ciency is still an important concern. For example,\ndeployed models making billions of predictions per day can incur substantial energy costs, making\nenergy ef\ufb01ciency an important consideration in modern machine learning architectures [3].\nWe approach the problem of ef\ufb01cient Bayesian inference by considering discrete latent variable\nmodels such that posterior samples of the variables will be quantized and sparse, leading to ef\ufb01cient\ninference computations with respect to energy, memory and computational requirements. Training a\nmodel with a discrete prior is typically very slow and expensive, requiring the use of high variance\nMonte Carlo gradient estimators to learn the variational distribution. The main contribution of this\nwork is the development of a method to rapidly learn the variational distribution for such a model\nwithout the use of any stochastic estimators; the objective function will be computed exactly at\neach iteration. To our knowledge, such an approach has not been taken for variational inference of\nlarge-scale probabilistic models.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fIn this paper, we compare our work not only to competing stochastic variational inference (SVI)\nmethods for discrete latent variables, but also to the more general SVI methods for continuous latent\nvariables. We make this comparison with continuous variables by discretely relaxing continuous priors\nusing a discrete prior with a \ufb01nite support set that contains much of the structure and information as its\ncontinuous analogue. Using this discretized prior we show that we can make use of Kronecker matrix\nalgebra for ef\ufb01cient and exact ELBO computations. We will call our technique DIRECT (DIscrete\nRElaxation of ConTinous variables). We summarize our main contributions below:\n\u2022 We ef\ufb01ciently and exactly compute the ELBO using a discrete prior even when this computation\nrequires more likelihood evaluations than the number of atoms in the known universe. This\nachieves unbiased, zero-variance gradients which we show outperforms competing Monte Carlo\nsampling alternatives that give high-variance gradient estimates while learning.\n\u2022 Complexity of our ELBO computations are independent of the quantity of training data using the\n\u2022 At inference time, we can exactly compute the statistical moments of the parameterized predictive\n\u2022 Using a discrete prior, our models admit sparse posterior samples that can be represented as\nquantized integer values to enable ef\ufb01cient inference, particularly on hardware limited devices.\n\u2022 We present the DIRECT approach for generalized linear models and deep Bayesian neural networks\n\u2022 Our empirical studies demonstrate superior performance relative to competing SVI methods on\n\nposterior distribution, unlike competing techniques which rely on Monte Carlo sampling.\n\nfor regression, and discuss approximations that allow extensions to many other models.\n\nDIRECT method, making the proposed approach amenable to big data applications.\n\nproblems with as many as 2 million training points.\n\nThe paper will proceed as follows; section 2 contains a background on variational inference and poses\nthe learning problem to be addressed while section 3 outlines the central ideas of the DIRECT method,\ndemonstrating the approach on several popular probabilistic models. Section 4 discusses limitations\nof the proposed approach and outlines some work-arounds, for instance, we discuss how to go beyond\nmean-\ufb01eld variational inference. We empirically demonstrate our approaches in section 5, and\nconclude in section 6. Our full code is available at https://github.com/treforevans/direct.\n\n2 Variational Inference Background\n\nWe begin with a review of variational inference, a method for approximating probability densities in\nBayesian statistics [4\u20139]. We introduce a regression problem for motivation; given X P Rn\u0002d, y P\nRn, a d-dimensional dataset of size n, we wish to evaluate y\u0006 at an untried point x\u0006 by constructing\na statistical model that depends on the b latent variables in the vector w P Rb. After specifying a\nprior over the latent variables, Prpwq, and selecting a probabilistic model structure that admits the\nlikelihood Prpy|wq, we may proceed with Bayesian inference to determine the posterior Prpw|yq\nwhich generally requires analytically intractable computations.\nVariational inference turns the task of computing a posterior into an optimization problem. By\nintroducing a family of probability distributions q\u03b8pwq parameterized by \u03b8, we minimize the Kullback-\nLeibler divergence to the exact posterior [9]. This equates to maximization of the evidence lower\nbound (ELBO) which we can write as follows for a continuous or discrete prior, respectively\n\nPrior\n\nELBO\n\nELBOp\u03b8q \u0010\u00bb q\u03b8pwq\u0001 log Prpy|wq log Prpwq \u0001 log q\u03b8pwq\tdw,\nELBOp\u03b8q \u0010 qT\u0001 log (cid:96) log p \u0001 log q\t,\n\n(1)\n\n(2)\n\ni\u00101, log p \u0010 tlog Prpwiqum\n\nwhere log (cid:96) \u0010 tlog Prpy|wiqum\nW P Rb\u0002m is the entire support set of the discrete prior.\nIt is immediately evident that computing the ELBO is challenging when b is large, since in the\ncontinuous case eq. (1) is a b-dimensional integral, and in the discrete case the size of the sum in eq. (2)\ngenerally increases exponentially with respect to b. Typically, the ELBO is not explicitly computed\nand instead, a Monte Carlo estimate of the gradient of the ELBO with respect to the variational\n\ni\u00101, q \u0010 tq\u03b8pwiqum\n\ni\u00101, and twium\n\ni\u00101 \u0010\n\n2\n\n\fparameters \u03b8 is found, allowing stochastic gradient descent to be performed. We will outline some\nexisting techniques to estimate ELBO gradients with respect to the variational parameters, \u03b8.\nFor continuous priors, the reparameterization trick [10] can be used to perform variational inference.\nThe technique uses Monte Carlo estimates of the gradient of the evidence lower bound (ELBO) which\nis maximized during the training procedure. While this approach has been employed successfully\nfor many large-scale models, we \ufb01nd that discretely relaxing continuous latent variable priors can\nimprove training and inference performance when using our proposed DIRECT technique which\ncomputes the ELBO (and its gradients) exactly.\nWhen the latent variable priors are discrete, reparameterization cannot be applied, however, the\nREINFORCE [11] estimator may be used to provide an unbiased estimate of the ELBO during\ntraining (alternatively called the score function estimator [12], or likelihood ratio estimator [13]).\nEmpirically, the REINFORCE gradient estimator is found to give a high-variance when compared\nwith reparameterization, leading to a slow learning process. Unsurprisingly, we \ufb01nd that our proposed\nDIRECT technique trains signi\ufb01cantly faster than a model trained using a REINFORCE estimator.\nRecent work in variational inference with discrete latent variables has largely focused on continuous\nrelaxations of discrete variables such that reparameterization can be applied to reduce gradient\nvariance compared to REINFORCE. One example is CONCRETE [14, 15] and its extensions [16, 17].\nWe consider an opposing direction by identifying how the ELBO (eq. (2)) can be computed exactly\nfor a class of discretely relaxed probabilistic models such that the discrete latent variable model can be\ntrained more easily then its continuous counterpart. We outline this approach in the following section.\n\n3 DIRECT: Ef\ufb01cient ELBO Computations with Kronecker Matrix Algebra\n\nWe outline the central ideas of the DIRECT method and illustrate its application on several proba-\nbilistic models. The DIRECT method allows us to ef\ufb01ciently and exactly compute the ELBO which\nhas several advantages over existing SVI techniques for discrete latent variable models such as, zero-\nvariance gradient estimates, the ability to use a super-linearly convergent quasi-Newton optimizer\n(since our objective is deterministic), and the per-iteration complexity is independent of training\nset size. We will also discuss advantages at inference time such as the ability to exactly compute\npredictive posterior statistical moments, and to exploit sparse and low-precision posterior samples.\nTo begin, we consider a discrete prior over our latent variables whose support set W forms a Cartesian\ntensor product grid as most discrete priors do (e.g. any prior that factorizes between variables) so that\nwe can write\n\nW \u0010\u0004\u0006\u0006\u0006\u0005\n\n\u0084m b \u0004 \u0004 \u0004 b 1T\n1 b 1T\nswT\n\u0084m\n1T\n2 b \u0004 \u0004 \u0004 b 1T\n\u0084m b swT\n\u0084m\n...\n...\n...\n1T\n\u0084m b 1T\n\u0084m b \u0004 \u0004 \u0004 b swT\n\n...\n\nb\n\n,\n\n\f\u000e\u000e\u000e\n\n(3)\n\nwhere 1\u0084m P R\u0084m denotes a vector of ones, swi P R\u0084m contains the sm discrete values that the ith\nlatent variable wi can take1, m \u0010 smb, and b denotes the Kronecker product [18]. Since the number\nq, log p, log (cid:96), and log q can be written as a sum of Kronecker product vectors (i.e.\u00b0i\u00c2b\n\nof columns of W P Rb\u0002\u0084mb increases exponentially with respect to b, it is evident that computing\nthe ELBO in eq. (2) is typically intractable when b is large. For instance, forming and storing the\nmatrices involved naively require exponential time and memory. We can alleviate this concern if\nj\u00101 f piq\nj ,\nj P R\u0084m). If we \ufb01nd this structure, then we never need to explicitly compute or store a vector\nwhere f piq\nof length m. This is because eq. (2) would simply require multiple inner products between Kronecker\nproduct vectors which the following result demonstrates can be computed extremely ef\ufb01ciently.\nProposition 1. The inner product between two Kronecker product vectors k \u0010 bb\na \u0010 bb\n\ni\u00101apiq can be computed as follows [18],\n\ni\u00101kpiq, and\n\naT k \u0010\n\napiq T kpiq,\n\n(4)\n\n1The discrete values that the ith latent variable can take, swi, may be chosen a priori or learned during ELBO\nmaximization (may be helpful for coarse discretizations). For the sake of simplicity, we focus on the former.\n\nb\u00b9i\u00101\n\n3\n\n\fwhere apiq P R\u0084m, a P R\u0084mb, kpiq P R\u0084m, and k P R\u0084mb.\nThis result enables substantial savings in the computation of the ELBO since each inner product\n\ncomputation is reduced from the naive exponential Opsmbq cost to a linear Opbsmq cost.\nif the prior is chosen to factorize between latent variables, as it often is, (i.e. Prpwq \u0010\u00b1b\ni\u00101pi admits a Kronecker product structure where pi \u0010 tPrpwi\u0010swijqu\n\nWe now discuss how the Kronecker product structure of the variables in eq. (2) can be achieved. Firstly,\ni\u00101 Prpwiq)\nthen p \u0010 bb\n\u0084m.\nThe following result demonstrates how this structure for p enables log p to be written as a sum of b\nKronecker product vectors.\nProposition 2. The element-wise logarithm of the Kronecker product vector k \u0010 bb\nwritten as a sum of b Kronecker product vectors as follows,\n\ni\u00101kpiq can be\n\n\u0084m\nj\u00101 P p0, 1q\n\nlog k \u0010\n\nlog kpiq,\n\n(5)\n\nb\u00e0i\u00101\n\nwhere kpiq P R\u0084m, k P R\u0084mb contain positive values, and ` is a generalization of the Kronecker\nsum [19] for vectors which we de\ufb01ne as follows\n\nlog kpiq \u0010\n\nb\u00e0i\u00101\n\nb\u00b8i\u00101\u0002 i\u00011\u00e2j\u00101\n\n1\u0084m\n b log kpiq b\u0002 b\u00e2j\u0010i1\n\n1\u0084m\n.\n\n(6)\n\nThe proof is trivial. We will \ufb01rst consider a mean-\ufb01eld variational distribution that factorizes over\ni\u00101 log qi can be written as a sum of\nlatent variables such that both q \u0010 bb\n\u0084m are used as the variational\nparameters, \u03b8, with the use of the softmax function. For the mean-\ufb01eld case we can rewrite eq. (2) as\n\nKronecker product vectors, where qj \u0010 tPrpwj\u0010swjiqu\n\ni\u00101qi and log q \u0010 `b\n\n\u0084m\ni\u00101 P p0, 1q\n\nELBOp\u03b8q \u0010 qT log (cid:96) \n\nqT\n\ni log pi \u0001\n\nqT\n\ni log qi,\n\n(7)\n\nb\u00b8i\u00101\n\nb\u00b8i\u00101\n\nwhere we use the fact that qi de\ufb01nes a valid probability distribution for the ith latent variable such\nthat qT\ni 1\u0084m \u0010 1. We extend results to unfactorized prior and variational distributions later in section 4.\nThe structure of log (cid:96) depends on the probabilistic model used; in the worst case, log (cid:96) can always\nbe represented as a sum of m Kronecker product vectors. However, many models admit a far more\ncompact structure where dramatic savings can be realized as we demonstrate in the following sections.\n\n3.1 Generalized Linear Regression\n\nWe \ufb01rst focus on the popular class of Bayesian generalized linear models (GLMs) for regression.\nWhile the Bayesian integrals that arise in GLMs can be easily computed in the case of conjugate\npriors, for general priors inference is challenging.\nThis highly general model architecture has been applied in a vast array of application areas. Recently,\nWilson et al. [20] used a scalable Bayesian generalized linear model with Gaussian priors on the\noutput layer of deep neural network with notable empirical success. They also considered the ability\nto train the neural network simultaneously with the approximate Gaussian process which we also\nhave the ability to do if a practitioner were to require such an architecture.\nConsider the generalized linear regression model y \u0010 \u03a6w \u0001, where \u0001 \u0012 N p0, \u03c32Iq, and \u03a6 \u0010\nt\u03c6jpxiqui,j P Rn\u0002b contains the evaluations of the basis functions on the training data. The following\nresult demonstrates how the ELBO can be exactly and ef\ufb01ciently computed, assuming the factorized\nprior and variational distributions over w discussed earlier. Note that we also consider a prior over \u03c32.\nTheorem 1. The ELBO can be exactly computed for a discretely relaxed regression GLM as follows\n\nELBOp\u03b8q \u0010 \u0001\n\nn\n2\n\nqT\n\u03c3 log \u03c32 \u0001\n\n1\n\n\u03c3 \u03c3\u00012\b\u0001yT y \u0001 2sT\u03a6T y\b sT \u03a6T \u03a6s \u0001 diagp\u03a6T \u03a6qT s2\n2qT\nj hj\t \nb\u00b8i\u00101qT\n\ni log qi\b qT\n\n\u03c3 log p\u03c3 \u0001 qT\n\ni log pi \u0001 qT\n\n\u03c3 log q\u03c3,\n\n(8)\n\nqT\n\nb\u00b8j\u00101\n\n4\n\n\fi\u00101 \u03c62\n\nijub\n\nj\u00101 P R\u0084m\u0002b, and s \u0010 tqT\n\nwhere q\u03c3, p\u03c3 P R\u0084m are factorized variational and prior distributions over the Gaussian noise\nvariance \u03c32 for which we consider the discrete positive values \u03c32 P R\u0084m, respectively. Also, we use\n\nA proof is provided in appendix A of the supplementary material. We can pre-compute the terms yT y,\n\u03a6T y, H, and \u03a6T \u03a6 before training begins (since these do not depend on the variational parameters)\nsuch that the \ufb01nal complexity of the proposed DIRECT method outlined in Theorem 1 is only\n\nj\u00b0n\nthe shorthand notation H \u0010 tsw2\nOpbsm b2q. This complexity is independent of the number of training points, making the proposed\n\ntechnique ideal for massive datasets. Also, each of the pre-computed terms can easily be updated as\nmore data is observed making the techniques amenable to online learning applications.\n\nj swjub\n\nj\u00101 P Rb.\n\nPredictive Posterior Computations Typically, the predictive posterior distribution is found by\nsampling the variational distribution at a large number of points and running the model forward for\neach sample. To exactly compute the statistical moments, a model would have to be run forward at\nevery point in the hypothesis space with is typically intractable, however, we can exploit Kronecker\nmatrix algebra to ef\ufb01ciently compute these moments exactly. For example, the exact predictive\nposterior mean for our generalized linear regression model is computed as follows\n\nEpy\u0006q \u0010\n\nm\u00b8i\u00101\n\nqpwiq\u00bb y\u0006 Prpy\u0006|wiqdy\u0006, \u0010 \u03a6\u0006Wq \u0010 \u03a6\u0006s,\n\n(9)\n\nj swjub\n\nj\u00101 P Rb, and \u03a6\u0006 P R1\u0002b contains the basis functions evaluated at x\u0006. This\nwhere s \u0010 tqT\ncomputation is highly ef\ufb01cient, requiring just Opbq time per test point. It can be shown that a similar\nscheme can be derived to exactly compute higher order statistical moments, such as the predictive\nposterior variance, for generalized linear regression models and other DIRECT models.\nWe have shown how to exactly compute statistical moments, and now we show how to exploit our\ndiscrete prior to compute predictive posterior samples extremely ef\ufb01ciently. This sampling approach\nmay be preferable to the exact computation of statistical moments on hardware limited devices where\nwe need to perform inference with extreme memory, energy and computational ef\ufb01ciency. The\n\nquantized integer array because of the discrete support of the prior which enables extremely compact\nstorage in memory. Much work has been done elsewhere in the machine learning community to\nquantize variables for storage compression purposes since memory is a very restrictive constraint on\nmobile devices [21\u201324]. However, we can go beyond this to additionally reduce computational and\n\nlatent variable posterior samples\u0080W P Rb\u0002num. samples will of course be represented as a low-precision\nenergy demands for the evaluation of \u03a6\u0006\u0080W. One approach is to constrain the elements of sw to be 0\n\nor a power of 2 so that multiplication operations simply become ef\ufb01cient bit-shift operations [25\u201327].\nAn even more ef\ufb01cient approach is to employ basis functions with discrete outputs so that \u03a6\u0006 can\nalso be represented as a low-precision quantized integer array. For example, a rounding operation\ncould be applied to continuous basis functions. Provided that the quantization schemes are an af\ufb01ne\nmapping of integers to real numbers (i.e. the quantized values are evenly spaced), then inference can\nbe conducted using extremely ef\ufb01cient integer arithmetic [28]. Either of these approaches enable\nextremely ef\ufb01cient on-device inference.\n\n3.2 Deep Neural Networks for Regression\n\nWe consider the hierarchical model structure of a Bayesian deep neural network for regression.\nConsidering a DIRECT approach for this architecture is not conceptually challenging so long as\nan appropriate neuron activation function is selected. We would like a non-linear activation that\nmaintains a compact representation of the log-likelihood evaluated at every point in the hypothesis\nspace, i.e. we would like log (cid:96) to be represented as a sum of as few Kronecker product vectors as\npossible. Using a power function for the activation can maintain a compact representation; the natural\nchoice being a quadratic activation function (i.e. output x2 for input x).\n\nIt can be shown that the ELBO can be exactly computed in Op(cid:96)smpb{(cid:96)q4(cid:96)q for a deep Bayesian neural\n\nnetwork with (cid:96) layers, where we assume a quadratic activation function and an equal distribution\nof discrete latent variables between network layers. This complexity evidently enables scalable\nBayesian inference for models of moderate depth, and like we found for the regression GLM model of\nsection 3.1, computational complexity is independent of the quantity of training data, making this ap-\nproach ideal for large datasets. We outline this model and the computation of its ELBO in appendix D.\n\n5\n\n\f4 Limitations & Extensions\n\nIn generality, when the support of the prior is on a Cartesian grid, any prior, likelihood, or variational\ndistribution (or log-distribution) can be expressed using the proposed Kronecker matrix representation,\nhowever, this representation will not always be compact enough to be practical. We can see this\nby viewing these probability distributions over the hypothesis space as high-dimensional tensors.\nIn section 3, we exploited some popular models whose variational probability tensors, and whose\nprior, likelihood and variational log-probability tensors all admit a low-rank structure, however, other\nmodels may not admit this structure, in which case their representation will not be so compact. In the\ninterest of generalizing the technique, we outline a likelihood, a prior, and a variational distribution\nthat does not admit a compact representation of the ELBO and discuss several ways the DIRECT\nmethod can still be used to ef\ufb01ciently compute, or lower bound the ELBO. We hope that these\nextensions inspire future research directions in approximate Bayesian inference.\n\nGeneralized Linear Logistic Regression Logistic regression models do not easily admit a\ncompact representation for exact ELBO computations, however, we will demonstrate that we can\nef\ufb01ciently compute a lower-bound of the ELBO by leveraging developed algebraic techniques. To\ndemonstrate, we will consider a generalized linear logistic regression model which is commonly\nemployed for classi\ufb01cation problems. Such a model could easily be extended to a deep architecture\nfollowing Bradshaw et al. [2], if desired. All terms in the ELBO in eq. (7) can be computed\nexactly for this model except the term involving the log-likelihood, for which the following result\ndemonstrates an ef\ufb01cient computation of the lower bound.\nTheorem 2. For a generalized linear logistic regression model with classi\ufb01cation training labels\ny P t0, 1un, the class-conditional probability Prpyi\u00100|wq \u0010 p1 expp\u0001\u03a6ri, :swqq\u00011, and with the\nassumption that training examples are sampled independently, the following inequality holds\n\nn\u00b8i\u00101# \u00b1b\n\u00b1b\n\nj\u00101 qT\nj\u00101 qT\n\nqT log (cid:96) \u00a5 \u0001sT\u03a6T y\b \u0001\nWe prove this result in appendix B of the supplement. This computation can be performed in Opsmbnq\n\ntime, where dependence on n is evident unlike in the case of the exact computations described in\nsection 3. As a result, stochastic optimization techniques should be considered. Using this lower\nbound, the log-likelihood is accurately approximated for hypotheses that correctly classify the training\ndata, however, hypotheses that con\ufb01dently misclassify training labels may be over-penalized. In\nappendix B we further discuss the accuracy of this approximation and discuss a stable implementation.\n\nj expp\u0001\u03c6ijswjq\nj expp\u03c6ijswjq \u0001\u00b0b\n\ni \u03c6ijswj\n\nif yi \u0010 0\nif yi \u0010 1\n\nj\u00101 qT\n\n(10)\n\nq \u0010\u00b0r\n\nj\u00101 qpiq\n\ni\u00101 \u03b1i\u00c2b\nj \u0010 tPrpwj\u0010swjk|iqu\n\nUnfactorized Variational Distributions We now consider going beyond a mean-\ufb01eld variational\ndistribution to account for correlations between latent variables. Considering a \ufb01nite mixture of\nfactorized categorical distributions as is used in latent structure analysis [29, 30], we can write\nj , where \u03b1 P p0, 1qr is a vector of mixture probabilities for r components,\n\n\u0084m.\n\n\u0084m\nk\u00101 P p0, 1q\n\nand qpiq\nWhile q can evidently be expressed as a compact sum of Kronecker product vectors, log q is more\nchallenging to compute than in the mean-\ufb01eld case, however, the following result demonstrates how\nwe can lower-bound the term involving log q in the ELBO (eq. (7)).\nTheorem 3. The following inequality holds when we consider a \ufb01nite mixture of factorized categorical\ndistributions for q\u03b8pwq,\n\n\u0001qT log q \u00a5\n\nmax\n\n1 \u0001\n\ntaiPp0,1q \u0084mub\n\ni\u00101\n\nr\u00b8j\u00101\n\n\u03b1j\u0002 b\u00b8i\u00101\n\nqpjq T\ni\n\nlog ai \u03b1j\n\nqpjq T\ni\n\nqpjq\ni\nai\n\nb\u00b9i\u00101\n\n 2\n\n\u03b1k\n\nr\u00b8k\u0010j1\n\nqpjq T\ni\n\nb\u00b9i\u00101\n\nqpkq\ni\n\nai \n,\n\nwhere a \u0010 bb\n\ni\u00101ai, ai P p0, 1q\n\n\u0084m is the center of the Taylor series approximation of log q.\n\nWe prove this result in appendix C and discuss a stable implementation. Note that if the mixture\nvariational distribution q degenerates to a mean-\ufb01eld distribution equal to a, then the ELBO will be\ncomputed exactly, and as q moves away from a, the ELBO will be underestimated.\n\n6\n\n\fj\u00101 ppiq\n\ni\u00101 \u03b1i\u00c2b\n\ndistribution given by p \u0010\u00b0r\n\nUnfactorized Prior Distributions To consider an unfactorized prior, we assume a prior mixture\nj . When we use this mixture distribution for the prior, p\ncan evidently be expressed as a compact sum of Kronecker product vectors but log p cannot. The\nfollowing result demonstrates how we can still lower-bound the term involving log p in the ELBO\n(eq. (2)). For simplicity, we assume that the variational distribution factorizes, however, the result\ncould easily be extended to the case of a mixture variational distribution.\nTheorem 4. The following inequality holds when we consider a \ufb01nite mixture of factorized categorical\ndistributions for p\u03b8pwq,\n\nqT log p \u00a5\n\n\u03b1i\n\nr\u00b8i\u00101\n\nb\u00b8j\u00101\n\nj log ppiq\nqT\n\nj\n\nThe proof is trivial by Jensen\u2019s inequality. Note that the equality only holds when the prior mixture\ndegenerates to a factorized distribution with all mixture components equivalent.\n\nUnbiased Stochastic Entropy and Prior Expectation Gradients We previously showed how to\nlower bound the ELBO terms qT log p and \u0001qT log q when the variational and/or prior distributions\ndo not factor, however, optimizing this bound introduces bias and does not guarantee convergence to\na local optimum of the true ELBO. Here we reintroduce REINFORCE to deliver unbiased gradient\nestimates for these terms. The REINFORCE estimator typically has high variance, however, since\ngradient estimates for these terms are so cheap, a massive number of samples can be used per\nstochastic gradient descent (SGD) iteration to decrease variance. Since we can still compute the\nexpensive qT log (cid:96) term exactly when q is an unfactorized mixture distribution, its gradient can be\ncomputed exactly. The unbiased gradient estimator of qT log q is expressed as follows2\n\nB\nB\u03b8\n\nqT log q \u0010\n\n1\n2\n\n,\n\n(11)\n\nB\u03b8 log q 1\b2\n \u0013\nqT\u0002 B\n\nB\nB\u03b8\n\n1\n2t\n\nt\u00b8i\u00101 log qpsiq 1\b2\n\nwhere si P Rb is the ith of t samples from the variational distribution used in the Monte Carlo gradient\nestimator. It is evident that this surrogate loss can be easily optimized using automatic differentiation,\nand the per-sample computations are extremely cheap.\n\n5 Numerical Studies\n\n5.1 Comparison with REINFORCE\n\nAs discussed in section 2, we cannot reparameterize because of the discrete latent variable priors\nconsidered, however, we can directly compare the optimization performance of the proposed tech-\nniques with the REINFORCE gradient estimator [11]. In \ufb01g. 1, we compare ELBO maximization\nperformance between the proposed DIRECT, and the REINFORCE methods. For this study we gen-\nerated a dataset from a random weighting of b \u0010 20 random Fourier features of a squared exponential\nkernel [31] and corrupted by independent Gaussian noise. We use a generalized linear regression\n\nmodel as described in section 3.1 which uses the same features with sm \u0010 3. We consider a prior over\n\u03c32, and a mean-\ufb01eld variational distribution giving smpb 1q \u0010 63 variational parameters which we\n\ninitialize to be the same as the prior; a uniform categorical distribution. For DIRECT, a L-BFGS\noptimizer is used [32] and stochastic gradient descent is used for REINFORCE with a varying number\nof samples used for the Monte Carlo gradient estimator. Both methods use full batch training and are\nimplemented using TensorFlow [33]. It can be seen that DIRECT greatly outperforms REINFORCE\nboth in the number of iterations and computational time. As we move to a large n or a larger b, the\ndifference between the proposed DIRECT technique and REINFORCE becomes more profound. The\nsuperior scaling with respect to n was expected since we had shown in section 3.1 that the DIRECT\ncomputational runtime is independent of n. However, the improved scaling with respect to b is an\ninteresting result and may be attributed to the fact that as the dimension of the variational parameter\nspace increases, there is more value in having low (or zero) variance estimates of the gradient.\n\n2We used the identity log q 1\b d\n\nB log q\n\nB\u03b8\n\n\u0010 1\n2\n\nB\n\nB\u03b8 log q 1\b2, where d denotes an elementwise product.\n\n7\n\n\fFigure 1: Convergence rates of a GLM trained with REINFORCE verses the proposed DIRECT\nmethod. The DIRECT method greatly outperforms REINFORCE in iterations and wall-clock time.\n\n5.2 Relaxing Gaussian Priors on UCI Regression Datasets\n\nIn this section, we consider discretely relaxing a continuous Gaussian prior on the weights of a gener-\nalized linear regression model. This allows us to compare performance between a reparameterization\ngradient estimator for a continuous prior and our DIRECT method for a relaxed, discrete prior.\nConsidering regression datasets from the UCI repository, we report the mean and standard deviation\nof the root mean squared error (RMSE) from 10-fold cross validation3. Also presented is the mean\ntraining time per fold on a machine with two E5-2680 v3 processors and 128Gb of RAM, and the\nexpected sparsity (percentage of zeros) within a posterior sample. All models use b \u0010 2000 basis\nfunctions. Further details of the experimental setup can be found in appendix E. In table 1, we see\nthe results of our studies across several model-types. In the left column, the \u201cREPARAM Mean-\nField\u201d model uses a (continuous) Gaussian prior, an uncorrelated Gaussian variational distribution\nand reparameterization gradients. The right two models use a discrete relaxation of a Gaussian\nprior (DIRECT) with support at 15 discrete values, allowing storage of each latent variable sample\nas a vector of 4-bit quantized integers. Therefore, each ELBO evaluation requires 152000 \u00a1 102352\nlog-likelihood evaluations, however, these computation can be done quickly by exploiting Kronecker\nmatrix algebra. We compute the ELBO as described in section 3.1 for the \u201cDIRECT Mean-Field\u201d\nmodel, and use the low-variance, unbiased gradient estimator described in eq. (11) for the \u201cDIRECT\n5-Mixture SGD\u201d model which uses a mixture distribution with r \u0010 5 components, and t \u0010 3000\nMonte Carlo samples for the entropy gradient estimator.\nThe boldface entries indicate top performance on each dataset, where it is evident that the DIRECT\nmethod not only outperformed REPARAM on most datasets but also trained much faster, particularly\non the large datasets due to the independence of dataset size on computational complexity. The\n\nseconds on all datasets, including electric with over 2 million points. The DIRECT mixture model\n\nDIRECT mean-\ufb01eld model contains smb \u0010 30, 000 variational parameters, however, training took just\ncontains smbr \u0010 150, 000 variational parameters, and since the gradient estimates are stochastic,\n\naverage training times are on the order of hundreds of seconds across all datasets. While the time for\nprecomputations does depend on dataset size, its contribution to the overall timings are negligible,\nbeing well under one second for the largest dataset, electric. Additionally, it is evident that posterior\nsamples from the DIRECT model tend to be very sparse. For example, the DIRECT models on the\ngas dataset admit posterior samples that are over 84% sparse on average, meaning that over 1680\nweights are expected to be zero in a posterior sample with b \u0010 2000 elements. This would yield\nmassive computational savings on hardware limited devices. Samples from the DIRECT models on\nthe electric dataset are over 99.6% sparse.\nComparing the DIRECT mean-\ufb01eld model to the mixture model, we observe gains in the RMSE\nperformance on many datasets, as we would expect with the increased \ufb02exibility of the variational\ndistribution. While we only showed the posterior mean in our results, we would expect an even\nlarger disparity in the quality of the predictive uncertainty which was not analyzed. In table 2 of\nthe supplement, we present results for a DIRECT mixture model that uses the ELBO lower bound\npresented in Theorem 3. This model does not perform as well as the DIRECT mixture model trained\nusing an unbiased SGD approach, as would be expected, however, it does train faster since its\n\n390% train, 10% test per fold. We use folds from https://people.orie.cornell.edu/andrew/code\n\n8\n\n020406080100Iterations1301201101009080ELBO101100101102Time Elapsed (s)1301201101009080DIRECTREINFORCE 1 sampleREINFORCE 10 sample\fDataset\n\nn\n\nd\n\nTime RMSE\n\nSparsity Time RMSE\n\nSparsity RMSE\n\nSparsity\n\nContinuous Prior\n\nREPARAM Mean-Field\n\nDiscrete 4-bit Prior\n\nDIRECT Mean-Field\n\nDIRECT 5-Mixture SGD\n\n8\n4\n9\n8\n25 5\n4\n5\n33 5\n5\n7\n5\n6\n7\n5\n13 5\n12 5\n11 5\n5\n9\n5\n8\n8\n5\n10 5\n5\n5\n11 5\n128 5\n19 46\n26 47\n20 48\n26 50\n18 51\n9\n58\n20 58\n385 61\n27 61\n\nchallenger 23\nfertility\n100\nautomobile 159\nservo\n167\n194\ncancer\n209\nhardware\n308\nyacht\n392\nautompg\n506\nhousing\nforest\n517\nstock\n536\npendulum 630\n768\nenergy\n1030\nconcrete\n1066\nsolar\n1503\nairfoil\nwine\n1599\n2565\ngas\n3338\nskillcraft\nsml\n4137\nparkinsons 5875\n15000\npoletele\nelevators\n16599\n45730\nprotein\n48827\nkegg\n53500\nctslice\n63608\nkeggu\n434874 3\n3droad\nsong\n515345 90 158 0.537 \b 0.002\n583250 77 169 0.94 \b 0.006\nbuzz\nelectric\n2049280 11 500 9.26 \b 4.47\n\n0.515 \b 0.284 0%\n0%\n0.161 \b 0.043\n0%\n0.425 \b 0.2\n0%\n0.524 \b 0.184\n0%\n27.488 \b 5.45\n0%\n1.796 \b 1.537\n0%\n0.815 \b 0.17\n0%\n4.05 \b 0.739\n0%\n3.014 \b 0.567\n0%\n1.378 \b 0.148\n0%\n0.751 \b 0.338\n0%\n1.465 \b 0.26\n78.852 \b 21.73 0%\n10.347 \b 2.847 0%\n0%\n0.902 \b 0.171\n2.071 \b 0.271 0%\n0%\n0.939 \b 0.33\n0%\n0.27 \b 0.052\n0%\n0.273 \b 0.029\n0.327 \b 0.013 0%\n0.158 \b 0.009 0%\n12.487 \b 0.363 0%\n0%\n0.247 \b 0.156\n0%\n0.642 \b 0.006\n0.178 \b 0.012 0%\n4.415 \b 0.113 0%\n0.122 \b 0.004 0%\n141 11.057 \b 0.091 0%\n0%\n0%\n0%\n\n1\n2\n10\n10\n4\n11\n1\n10\n10\n2\n8\n1\n1\n10\n10\n11\n11\n1\n7\n1\n1\n10\n1\n11\n1\n2\n1\n2\n2\n1\n1\n\n0.523 \b 0.248 17%\n0.159 \b 0.041 17%\n0.129 \b 0.063 51%\n0.271 \b 0.08 35%\n22.954 \b 3.09 19%\n0.401 \b 0.048 51%\n96%\n0.234 \b 0.07\n2.564 \b 0.363 31%\n2.752 \b 0.405 40%\n17%\n1.363 \b 0.15\n0.011 \b 0.003 98%\n1.329 \b 0.282 68%\n3.272 \b 0.332 99%\n5.316 \b 0.716 82%\n0.787 \b 0.192 23%\n2.175 \b 0.349 48%\n0.472 \b 0.044 54%\n0.211 \b 0.058 84%\n0.253 \b 0.016 97%\n0.677 \b 0.044 57%\n0.651 \b 0.034 13%\n13.65 \b 0.348 16%\n0.124 \b 0.003 99%\n0.619 \b 0.007 76%\n0.222 \b 0.009 96%\n6.063 \b 0.122 19%\n0.139 \b 0.004 87%\n10.493 \b 0.105 40%\n0.501 \b 0.002 32%\n1.007 \b 0.007 82%\n0.575 \b 0.032 99.6% 0.557 \b 0.055 99.6%\n\n17%\n0.525 \b 0.246\n17%\n0.16 \b 0.041\n0.122 \b 0.056 51%\n35%\n0.274 \b 0.077\n22.937 \b 3.135 19%\n0.401 \b 0.046 51%\n0.225 \b 0.082 96%\n2.543 \b 0.362 31%\n2.699 \b 0.361 39%\n1.357 \b 0.155 17%\n0.008 \b 0.001 98%\n1.312 \b 0.253 63%\n2.911 \b 0.309 99%\n82%\n5.477 \b 0.632\n23%\n0.788 \b 0.189\n45%\n2.156 \b 0.316\n0.469 \b 0.042 54%\n0.184 \b 0.063 76%\n0.253 \b 0.016 97%\n57%\n0.671 \b 0.047\n13%\n0.613 \b 0.083\n13.369 \b 0.431 17%\n0.124 \b 0.003 99%\n0.618 \b 0.007 60%\n95%\n0.205 \b 0.004\n42%\n5.478 \b 0.137\n87%\n0.136 \b 0.006\n10.354 \b 0.077 33%\n0.498 \b 0.002 28%\n80%\n0.959 \b 0.004\n\nTable 1: Mean and standard deviation of test error, average training time, and average expected\nsparsity of a posterior sample from 10-fold cross validation on UCI regression datasets.\n\nobjective is evaluated deterministically, and its RMSE performance is still marginally better than the\nDIRECT mean-\ufb01eld model on many datasets.\n\n6 Conclusions\n\nWe have shown that by discretely relaxing continuous priors, variational inference can be performed\naccurately and ef\ufb01ciently using our DIRECT method. We have demonstrated that through the\nuse of Kronecker matrix algebra, the ELBO of a discretely relaxed model can be ef\ufb01ciently and\nexactly computed even when this computation requires signi\ufb01cantly more log-likelihood evaluations\nthan the number of atoms in the known universe. Through this ability to exactly perform ELBO\ncomputations we achieve unbiased, zero-variance gradient estimates using automatic differentiation\nwhich we show signi\ufb01cantly outperforms competing Monte Carlo alternatives that admit high-variance\ngradient estimates. We also demonstrate that the computational complexity of ELBO computations\nis independent of the quantity of training data using the DIRECT method, making the proposed\napproaches amenable to big data applications. At inference time, we show that we can again use\nKronecker matrix algebra to exactly compute the statistical moments of the parameterized predictive\nposterior distribution, unlike competing techniques which rely on Monte Carlo sampling. Finally, we\ndiscuss and demonstrate how posterior samples can be sparse and can be represented as quantized\ninteger values to enable ef\ufb01cient inference which is particularly powerful on hardware limited devices,\nor if energy ef\ufb01ciency is a major concern.\nWe illustrate the DIRECT approach on several popular models such as mean-\ufb01eld variational inference\nfor generalized linear models and deep Bayesian neural networks for regression. We also discuss\nsome models which do not admit a compact representation for exact ELBO computations. For these\ncases, we discuss and demonstrate novel extensions to the DIRECT method that allow ef\ufb01cient\ncomputation of a lower bound of the ELBO, and we demonstrate how an unfactorized variational\ndistribution can be used by introducing a manageable level of stochasticity into the gradients. We\nhope that these new approaches for ELBO computations will inspire new model structures and\nresearch directions in approximate Bayesian inference.\n\n9\n\n\fAcknowledgements\n\nResearch funded by an NSERC Discovery Grant and the Canada Research Chairs program.\n\nReferences\n[1] S. Thrun, W. Burgard, and D. Fox. Probabilistic robotics. MIT press, 2005.\n[2]\n\nJ. Bradshaw, A. G. d. G. Matthews, and Z. Ghahramani. Adversarial Examples, Uncertainty,\nand Transfer Testing Robustness in Gaussian Process Hybrid Deep Networks. Tech. rep. 2017.\n[3] C. Louizos, K. Ullrich, and M. Welling. \u201cBayesian compression for deep learning\u201d. In: Ad-\n\nvances in Neural Information Processing Systems. 2017, pp. 3288\u20133298.\n\n[4] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. \u201cAn introduction to variational\n\nmethods for graphical models\u201d. In: Machine learning 37.2 (1999), pp. 183\u2013233.\n\n[5] M. J. Wainwright and M. I. Jordan. \u201cGraphical models, exponential families, and variational\n\ninference\u201d. In: Foundations and Trends in Machine Learning 1.1\u20132 (2008), pp. 1\u2013305.\n\n[6] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. \u201cStochastic variational inference\u201d. In:\n\nThe Journal of Machine Learning Research 14.1 (2013), pp. 1303\u20131347.\n\n[7] R. Ranganath, S. Gerrish, and D. Blei. \u201cBlack box variational inference\u201d. In: Arti\ufb01cial Intelli-\n\ngence and Statistics. 2014, pp. 814\u2013822.\n\n[8] A. Kucukelbir, D. Tran, R. Ranganath, A. Gelman, and D. M. Blei. \u201cAutomatic differentiation\nvariational inference\u201d. In: The Journal of Machine Learning Research 18.1 (2017), pp. 430\u2013\n474.\n\n[9] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. \u201cVariational inference: A review for statisti-\n\ncians\u201d. In: Journal of the American Statistical Association 112.518 (2017), pp. 859\u2013877.\n\n[10] D. P. Kingma and M. Welling. \u201cAuto-encoding variational Bayes\u201d. In: arXiv preprint\n\narXiv:1312.6114 (2013).\n\n[11] R. J. Williams. \u201cSimple statistical gradient-following algorithms for connectionist reinforce-\n\nment learning\u201d. In: Reinforcement Learning. 1992, pp. 5\u201332.\n\n[12] M. C. Fu. \u201cGradient estimation\u201d. In: Handbooks in operations research and management\n\nscience 13 (2006), pp. 575\u2013616.\n\n[13] P. W. Glynn. \u201cLikelihood ratio gradient estimation for stochastic systems\u201d. In: Communications\n\nof the ACM 33.10 (1990), pp. 75\u201384.\n\n[14] E. Jang, S. Gu, and B. Poole. \u201cCategorical reparameterization with Gumbel-softmax\u201d. In:\n\narXiv preprint arXiv:1611.01144 (2016).\n\n[15] C. J. Maddison, A. Mnih, and Y. W. Teh. \u201cThe concrete distribution: A continuous relaxation\n\nof discrete random variables\u201d. In: arXiv preprint arXiv:1611.00712 (2016).\n\n[16] G. Tucker, A. Mnih, C. J. Maddison, J. Lawson, and J. Sohl-Dickstein. \u201cREBAR: Low-\nvariance, unbiased gradient estimates for discrete latent variable models\u201d. In: Advances in\nNeural Information Processing Systems. 2017, pp. 2627\u20132636.\n\n[17] W. Grathwohl, D. Choi, Y. Wu, G. Roeder, and D. Duvenaud. \u201cBackpropagation through\nthe void: Optimizing control variates for black-box gradient estimation\u201d. In: International\nConference on Learning Representations. 2017.\n\n[18] C. F. Van Loan. \u201cThe ubiquitous Kronecker product\u201d. In: Journal of Computational and\n\nApplied Mathematics 123.1 (2000), pp. 85\u2013100.\n\n[19] R. A. Horn and C. R. Johnson. Topics in Matrix analysis. Cambridge university press, 1994,\n\np. 208.\n\n[20] A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing. \u201cDeep kernel learning\u201d. In: Arti\ufb01cial\n\nIntelligence and Statistics. 2016, pp. 370\u2013378.\n\n[21] W. Chen, J. Wilson, S. Tyree, K. Weinberger, and Y. Chen. \u201cCompressing neural networks with\nthe hashing trick\u201d. In: International Conference on Machine Learning. 2015, pp. 2285\u20132294.\n[22] Y. Gong, L. Liu, M. Yang, and L. Bourdev. \u201cCompressing deep convolutional networks using\n\nvector quantization\u201d. In: arXiv preprint arXiv:1412.6115 (2014).\n\n[23] S. Han, H. Mao, and W. J. Dally. \u201cDeep compression: Compressing deep neural networks\nwith pruning, trained quantization and huffman coding\u201d. In: arXiv preprint arXiv:1510.00149\n(2015).\n\n10\n\n\f[24] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen. \u201cIncremental network quantization: Towards\n\nlossless cnns with low-precision weights\u201d. In: arXiv preprint arXiv:1702.03044 (2017).\nI. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. \u201cBinarized neural net-\nworks\u201d. In: Advances in neural information processing systems. 2016, pp. 4107\u20134115.\n\n[25]\n\n[26] F. Li, B. Zhang, and B. Liu. \u201cTernary weight networks\u201d. In: arXiv preprint arXiv:1605.04711\n\n(2016).\n\n[27] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. \u201cXnor-net: Imagenet classi\ufb01cation\nusing binary convolutional neural networks\u201d. In: European Conference on Computer Vision.\nSpringer. 2016, pp. 525\u2013542.\n\n[28] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko.\n\u201cQuantization and Training of Neural Networks for Ef\ufb01cient Integer-Arithmetic-Only Infer-\nence\u201d. In: arXiv preprint arXiv:1712.05877 (2017).\n\n[29] P. Lazarsfeld and N. Henry. Latent structure analysis. Houghton Mif\ufb02in Company, Boston,\n\nMassachusetts, 1968.\n\n[30] L. A. Goodman. \u201cExploratory latent structure analysis using both identi\ufb01able and unidenti\ufb01able\n\nmodels\u201d. In: Biometrika 61.2 (1974), pp. 215\u2013231.\n\n[31] A. Rahimi and B. Recht. \u201cRandom features for large-scale kernel machines\u201d. In: Advances in\n\nNeural Information Processing Systems. 2007, pp. 1177\u20131184.\n\n[32] R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu. \u201cA limited memory algorithm for bound constrained\n\noptimization\u201d. In: SIAM Journal on Scienti\ufb01c Computing 16.5 (1995), pp. 1190\u20131208.\n\n[33] M. Abadi et al. \u201cTensorFlow: A System for Large-Scale Machine Learning.\u201d In: OSDI. Vol. 16.\n\n2016, pp. 265\u2013283.\n\n[34] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.\n[35] F. Nielsen and K. Sun. \u201cGuaranteed bounds on the Kullback-Leibler divergence of univariate\n\nmixtures using piecewise log-sum-exp inequalities\u201d. In: arXiv:1606.05850 (2016).\n\n[36] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press,\n\n2006.\n\n[37] D. Tran, A. Kucukelbir, A. B. Dieng, M. Rudolph, D. Liang, and D. M. Blei. \u201cEdward: A library\nfor probabilistic modeling, inference, and criticism\u201d. In: arXiv preprint arXiv:1610.09787\n(2016).\n\n11\n\n\f", "award": [], "sourceid": 6704, "authors": [{"given_name": "Trefor", "family_name": "Evans", "institution": "University of Toronto"}, {"given_name": "Prasanth", "family_name": "Nair", "institution": "University of Toronto"}]}