{"title": "Variational Bayes under Model Misspecification", "book": "Advances in Neural Information Processing Systems", "page_first": 13357, "page_last": 13367, "abstract": "Variational Bayes (VB) is a scalable alternative to Markov chain Monte Carlo (MCMC) for Bayesian posterior inference. Though popular, VB comes with few theoretical guarantees, most of which focus on well-specified models. However, models are rarely well-specified in practice. In this work, we study VB under model misspecification. We prove the VB posterior is asymptotically normal and centers at the value that minimizes the Kullback-Leibler (KL) divergence to the true data-generating distribution. Moreover, the VB posterior mean centers at the same value and is also asymptotically normal. These results generalize the variational Bernstein--von Mises theorem [29] to misspecified models. As a consequence of these results, we find that the model misspecification error dominates the variational approximation error in VB posterior predictive distributions. It explains the widely observed phenomenon that VB achieves comparable predictive accuracy with MCMC even though VB uses an approximating family. As illustrations, we study VB under three forms of model misspecification, ranging from model over-/under-dispersion to latent dimensionality misspecification. We conduct two simulation studies that demonstrate the theoretical results.", "full_text": "Variational Bayes under Model Misspeci\ufb01cation\n\nYixin Wang\n\nColumbia University\n\nDavid M. Blei\n\nColumbia University\n\nAbstract\n\nVariational Bayes (VB) is a scalable alternative to Markov chain Monte Carlo\n(MCMC) for Bayesian posterior inference. Though popular, VB comes with few\ntheoretical guarantees, most of which focus on well-speci\ufb01ed models. However,\nmodels are rarely well-speci\ufb01ed in practice. In this work, we study VB under\nmodel misspeci\ufb01cation. We prove the VB posterior is asymptotically normal and\ncenters at the value that minimizes the Kullback-Leibler (KL) divergence to the true\ndata-generating distribution. Moreover, the VB posterior mean centers at the same\nvalue and is also asymptotically normal. These results generalize the variational\nBernstein\u2013von Mises theorem [30] to misspeci\ufb01ed models. As a consequence of\nthese results, we \ufb01nd that the model misspeci\ufb01cation error dominates the variational\napproximation error in VB posterior predictive distributions. It explains the widely\nobserved phenomenon that VB achieves comparable predictive accuracy with\nMCMC even though VB uses an approximating family. As illustrations, we study\nVB under three forms of model misspeci\ufb01cation, ranging from model over-/under-\ndispersion to latent dimensionality misspeci\ufb01cation. We conduct two simulation\nstudies that demonstrate the theoretical results.\n\n1\n\nIntroduction\n\nBayesian modeling uses posterior inference to discover patterns in data. Begin by positing a proba-\nbilistic model that describes the generative process; it is a joint distribution of latent variables and the\ndata. The goal is to infer the posterior, the conditional distribution of the latent variables given the data.\nThe inferred posterior reveals hidden patterns of the data and helps form predictions about new data.\nFor many models, however, the posterior is computationally dif\ufb01cult\u2014it involves a marginal proba-\nbility that takes the form of an integral. Unless that integral admits a closed-form expression (or the\nlatent variables are low-dimensional) it is intractable to compute.\nTo circumvent this intractability, investigators rely on approximate inference strategies such as varia-\ntional Bayes (VB). VB approximates the posterior by solving an optimization problem. First propose\nan approximating family of distributions that contains all factorizable densities; then \ufb01nd the member\nof this family that minimizes the KL divergence to the (computationally intractable) exact posterior.\nTake this minimizer as a substitute for the posterior and carry out downstream data analysis.\nVB scales to large datasets and works empirically in many dif\ufb01cult models. However, it comes with\nfew theoretical guarantees, most of which focus on well-speci\ufb01ed models. For example, Wang & Blei\n[30] establish the consistency and asymptotic normality of the VB posterior, assuming the data is gen-\nerated by the probabilistic model. Under a similar assumption of a well-speci\ufb01ed model, Zhang & Gao\n[35] derive the convergence rate of the VB posterior in settings with high-dimensional latent variables.\nBut as George Box famously quipped, \u201call models are wrong.\u201d Probabilistic models are rarely well-\nspeci\ufb01ed in practice. Does VB still enjoy good theoretical properties under model misspeci\ufb01cation?\nWhat about the VB posterior predictive distributions? These are the questions we study in this paper.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Why does the VB posterior converge to a point mass at \u03b8\u2217? The intuition behind this \ufb01gure\nis described in \u00a7 1. In the \ufb01gure, q\u2217(x) is the optimal VB posterior given x1:1000.\n\nMain idea. We study VB under model misspeci\ufb01cation. Under suitable conditions, we show that (1)\nthe VB posterior is asymptotically normal, centering at the value that minimizes the KL divergence\nfrom the true distribution; (2) the VB posterior mean centers at the same value and is asymptotically\nnormal; (3) in the variational posterior predictive, the error due to model misspeci\ufb01cation dominates\n(cid:81)n\nthe error due to the variational approximation.\n(cid:81)n\nConcretely, consider n data points x1:n independently and identically distributed with a true density\ni=1 p0(xi). Further consider a parametric probabilistic model with a d-dimensional latent variable\ni=1 p(xi | \u03b8) : \u03b8 \u2208 Rd}.1 When the model is\n\u03b8 = \u03b81:d; its density belongs to the family {\nmisspeci\ufb01ed, it does not contain the true density, p0(x) /\u2208 {p(x| \u03b8) : \u03b8 \u2208 \u0398}.\n(cid:110)\nPlacing a prior p(\u03b8) on the latent variable \u03b8, we infer its posterior p(\u03b8 | x1:n) using VB. Mean-\ufb01eld\nq(\u03b8) : q(\u03b8) =(cid:81)d\nVB considers an approximating family Q that includes all factorizable densities\n\n(cid:111)\n\ni=1 qi(\u03b8i)\n\n.\n\nQ =\n\nIt then \ufb01nds the member that minimizes the KL divergence to the exact posterior p(\u03b8 | x1:n),\n\n(1)\n\nq\u2217(\u03b8) = arg min\n\nKL(q(\u03b8)||p(\u03b8 | x1:n)).\n\nq\u2208Q\n\nThe global minimizer q\u2217(\u03b8) is called the VB posterior. (Here we focus on mean-\ufb01eld VB. The results\nbelow apply to VB with more general approximating families as well.)\nWe \ufb01rst study the asymptotic properties of the VB posterior and its mean. Denote \u03b8\u2217 as the value of \u03b8\nthat minimizes the KL divergence to the true distribution,\n\n\u03b8\u2217 = arg min\n\n\u03b8\n\nKL(p0(x)||p(x| \u03b8)).\n\n(2)\n\nNote this KL divergence is different from the variational objective (Eq. 1); it is a property of the model\nclass\u2019s relationship to the true density. We show that, under standard conditions, the VB posterior q\u2217(\u03b8)\nposterior mean \u02c6\u03b8VB =(cid:82) \u03b8 \u00b7 q\u2217(\u03b8) d\u03b8: it converges almost surely to \u03b8\u2217 and is asymptotically normal.\nconverges in distribution to a point mass at \u03b8\u2217. Moreover, the VB posterior of the rescaled and cen-\ntered latent variable \u02dc\u03b8 = \u221an(\u03b8 \u2212 \u03b8\u2217) is asymptotically normal. Similar asymptotics hold for the VB\nWhy does the VB posterior converge to a point mass at \u03b8\u2217? The reason rests on three observations.\n(1) The classical Bernstein\u2013von Mises theorem under model misspeci\ufb01cation [18] says that the exact\nposterior p(\u03b8 | x1:n) converges to a point mass at \u03b8\u2217. (2) Because point masses are factorizable,\nthis limiting exact posterior belongs to the approximating family Q: if \u03b8\u2217 = (\u03b8\u22171, \u03b8\u22172, \u03b8\u22173), then\n\u03b4\u03b8\u2217 (\u03b8) = \u03b4\u03b8\u22171 (\u03b81) \u00b7 \u03b4\u03b8\u22172 (\u03b82) \u00b7 \u03b4\u03b8\u22173 (\u03b83). (3) VB seeks the member in Q that is closest to the exact\nposterior (which also belongs to Q, in the limit). Therefore, the VB posterior also converges to a\npoint mass at \u03b8\u2217. Figure 1 illustrates this intuition\u2014as we see more data, the posterior gets closer to\nthe variational family. We make this argument rigorous in \u00a7 2.\nThe asymptotic characterization of the VB posterior leads to an interesting result about the VB\napproximation of the posterior predictive. Consider two posterior predictive distributions under the\nmisspeci\ufb01ed model. The VB predictive density is formed with the VB posterior,\n\n(cid:90)\n\nppred\nVB (xnew | x1:n) =\n\np(xnew | \u03b8) \u00b7 q\u2217(\u03b8) d\u03b8.\n\n(3)\n\n1A parametric probabilistic model means the dimensionality of the latent variables do not grow with the\n\nnumber of data points. We extend these results to more general probabilistic models in \u00a7 2.3.\n\n2\n\np(\u2713|x1:1)=\u2713\u21e4AAACHHicbVDLSgMxFM34tr6qLt0Ei6AiZUYFRVAENy4r2Cp06pBJ79jQTGZI7ohl7Ie48VfcuFDEjQvBvzF9LHwdSDiccy7JPWEqhUHX/XRGRsfGJyanpgszs3PzC8XFpZpJMs2hyhOZ6MuQGZBCQRUFSrhMNbA4lHARtk96/sUNaCMSdY6dFBoxu1YiEpyhlYLiTrruYwuQUX+L3vWu2yD3DnyhIux0Nyg9pH4TJLIgH+SuNrtBseSW3T7oX+INSYkMUQmK734z4VkMCrlkxtQ9N8VGzjQKLqFb8DMDKeNtdg11SxWLwTTy/nJdumaVJo0SbY9C2le/T+QsNqYThzYZM2yZ315P/M+rZxjtN3Kh0gxB8cFDUSYpJrTXFG0KDRxlxxLGtbB/pbzFNONo+yzYErzfK/8lte2yt1PePtstHR8N65giK2SVrBOP7JFjckoqpEo4uSeP5Jm8OA/Ok/PqvA2iI85wZpn8gPPxBc+1n/w=p(\u2713|x1)AAAB/nicbVDLSsNAFJ3UV62vqLhyM1iEClKSKuhKCm5cVrAPaEKYTKft0MmDmRuxxIK/4saFIm79Dnf+jZM2C209cC+Hc+5l7hw/FlyBZX0bhaXlldW14nppY3Nre8fc3WupKJGUNWkkItnxiWKCh6wJHATrxJKRwBes7Y+uM799z6TiUXgH45i5ARmEvM8pAS155kFccWDIgGDnFD9m7cGzTzyzbFWtKfAisXNSRjkanvnl9CKaBCwEKohSXduKwU2JBE4Fm5ScRLGY0BEZsK6mIQmYctPp+RN8rJUe7kdSVwh4qv7eSEmg1Djw9WRAYKjmvUz8z+sm0L90Ux7GCbCQzh7qJwJDhLMscI9LRkGMNSFUcn0rpkMiCQWdWEmHYM9/eZG0alX7rFq7PS/Xr/I4iugQHaEKstEFqqMb1EBNRFGKntErejOejBfj3fiYjRaMfGcf/YHx+QNVCJPEp(\u2713|x1:100)AAACBHicbVDLSgMxFM3UV62vUZfdBItQQcpMFRQXUnDjsoJ9QFtKJs20oZnMkNwRy9iFG3/FjQtF3PoR7vwbM20X2nrgXg7n3EtyjxcJrsFxvq3M0vLK6lp2PbexubW9Y+/u1XUYK8pqNBShanpEM8ElqwEHwZqRYiTwBGt4w6vUb9wxpXkob2EUsU5A+pL7nBIwUtfOR8U2DBgQ3D7GD2m77ybuhes446OuXXBKzgR4kbgzUkAzVLv2V7sX0jhgEqggWrdcJ4JOQhRwKtg41441iwgdkj5rGSpJwHQnmRwxxodG6WE/VKYk4In6eyMhgdajwDOTAYGBnvdS8T+vFYN/3km4jGJgkk4f8mOBIcRpIrjHFaMgRoYQqrj5K6YDoggFk1vOhODOn7xI6uWSe1Iq35wWKpezOLIojw5QEbnoDFXQNaqiGqLoET2jV/RmPVkv1rv1MR3NWLOdffQH1ucPCyOVww==p(\u2713|x1:1000)AAACBXicbVC7SgNBFJ2Nrxhfq5ZaDAYhgoTdKCgWErCxjGAekA1hdjKbDJl9MHNXDGsaG3/FxkIRW//Bzr9xNtlCEw/cy+Gce5m5x40EV2BZ30ZuYXFpeSW/Wlhb39jcMrd3GiqMJWV1GopQtlyimOABqwMHwVqRZMR3BWu6w6vUb94xqXgY3MIoYh2f9APucUpAS11zPyo5MGBAsHOMH9J2303sC9uyrPFR1yxaZWsCPE/sjBRRhlrX/HJ6IY19FgAVRKm2bUXQSYgETgUbF5xYsYjQIemztqYB8ZnqJJMrxvhQKz3shVJXAHii/t5IiK/UyHf1pE9goGa9VPzPa8fgnXcSHkQxsIBOH/JigSHEaSS4xyWjIEaaECq5/iumAyIJBR1cQYdgz548TxqVsn1SrtycFquXWRx5tIcOUAnZ6AxV0TWqoTqi6BE9o1f0ZjwZL8a78TEdzRnZzi76A+PzB4Delf0=Q={factorizableq(\u2713)}AAACIHicbVDLSgMxFM3Ud31VXboJFkFBykwV6kYR3LhswVahU8qdNGNDMw+TO2Id5lPc+CtuXCiiO/0aM20Xvi4JHM65NzfneLEUGm37wypMTc/Mzs0vFBeXlldWS2vrLR0livEmi2SkLj3QXIqQN1Gg5Jex4hB4kl94g9Ncv7jhSosoPMdhzDsBXIXCFwzQUN1SzQ0A+wxk2sjoEXVTF/ktqiD1gWGkxB2YlzLq7uXnesfFPkfYdbNuqWxX7FHRv8CZgDKZVL1bend7EUsCHiKToHXbsWPspKBQMLOh6Caax8AGcMXbBoYQcN1JRwYzum2YHvUjZW6IdMR+n0gh0HoYeKYzt6N/azn5n9ZO0D/spCKME+QhGy/yE0kxonlatCcUZyiHBgBTwvyVsj4ok43JtGhCcH5b/gta1YqzX6k2Dsonx5M45skm2SI7xCE1ckLOSJ00CSP35JE8kxfrwXqyXq23cWvBmsxskB9lfX4BZFajDw==\u2713\u21e4=argmin\u2713KL(p0(x)||p(x|\u2713))AAACN3icbVBNaxsxENWmH3Hcj7jtMRdRU7BLMbtuIb2kGHoppAQHaifgdRetPLZFJO0izQabtf9VL/kbuTWXHFJCrvkH0do+tHYfaHh68wZpXpxKYdH3f3tbjx4/ebpd2ik/e/7i5W7l1euuTTLDocMTmZjTmFmQQkMHBUo4TQ0wFUs4ic++Fv2TczBWJPoHTlPoKzbSYig4QydFlaMQx4Ds53t6QENmRjRUQkf5Up3TEGGCRuWH3+e1NPJrkzoNP9DZrKhpbbK4FGXpr9ejStVv+AvQTRKsSJWs0I4ql+Eg4ZkCjVwya3uBn2I/ZwYFlzAvh5mFlPEzNoKeo5opsP18sfecvnPKgA4T445GulD/nsiZsnaqYudUDMd2vVeI/+v1Mhx+7udCpxmC5suHhpmkmNAiRDoQBjjKqSOMG+H+SvmYGcbRRV12IQTrK2+SbrMRfGw0jz9VW19WcZTIHnlLaiQg+6RFvpE26RBOfpErckP+eBfetXfr3S2tW95q5g35B979A8MVqVQ=q\u21e4(\u2713)AAAB8nicbVBNSwMxEM3Wr1q/qh69BItQPZTdKuhJCl48VrAf0K4lm2bb0GyyJrNCWfozvHhQxKu/xpv/xrTdg7Y+GHi8N8PMvCAW3IDrfju5ldW19Y38ZmFre2d3r7h/0DQq0ZQ1qBJKtwNimOCSNYCDYO1YMxIFgrWC0c3Ubz0xbbiS9zCOmR+RgeQhpwSs1Hl8OCt3YciAnPaKJbfizoCXiZeREspQ7xW/un1Fk4hJoIIY0/HcGPyUaOBUsEmhmxgWEzoiA9axVJKIGT+dnTzBJ1bp41BpWxLwTP09kZLImHEU2M6IwNAselPxP6+TQHjlp1zGCTBJ54vCRGBQePo/7nPNKIixJYRqbm/FdEg0oWBTKtgQvMWXl0mzWvHOK9W7i1LtOosjj47QMSojD12iGrpFddRAFCn0jF7RmwPOi/PufMxbc042c4j+wPn8AV2ykKQ=KL(q\u21e4(\u2713)||p(\u2713|x))AAACJ3icbVDLSgNBEJz1bXxFPXoZDEIiEnZV0JMEvAh6UDAqZGOYnfSaIbMPZ3rFsMnfePFXvAgqokf/xNm4B18F0xRV3Ux3ebEUGm373RoZHRufmJyaLszMzs0vFBeXznSUKA51HslIXXhMgxQh1FGghItYAQs8Cededz/zz29AaRGFp9iLoRmwq1D4gjM0Uqu45yLcogrSw6NB+fpyvexiB5BVqLtB+/2sxrk0VLLiBgw7np/eDiqVVrFkV+0h6F/i5KREchy3ik9uO+JJACFyybRuOHaMzZQpFFzCoOAmGmLGu+wKGoaGLADdTId3DuiaUdrUj5R5IdKh+n0iZYHWvcAzndmO+reXif95jQT93WYqwjhBCPnXR34iKUY0C422hQKOsmcI40qYXSnvMMU4mmgLJgTn98l/ydlm1dmqbp5sl2p7eRxTZIWskjJxyA6pkQNyTOqEkzvyQJ7Ji3VvPVqv1ttX64iVzyyTH7A+PgFTVKO4\fThe exact posterior predictive density is formed with the exact posterior,\n\n(cid:90)\n\nppred\nexact(xnew | x1:n) =\n\np(xnew | \u03b8) \u00b7 p(\u03b8 | x1:n) d\u03b8.\n\n(4)\n\nNow de\ufb01ne the model misspeci\ufb01cation error to be the total variation (TV) distance between the exact\nposterior predictive and the true density p0(x). (When the model is well-speci\ufb01ed, it converges to zero\n[26].) Further de\ufb01ne the variational approximation error is the TV distance between the variational\npredictive and the exact predictive; it measures the price of the approximation when using the VB\nposterior to form the predictive. Below we prove that the model misspeci\ufb01cation error dominates\nthe variational approximation error\u2014the variational approximation error vanishes as the number of\ndata points increases. This result explains a widely observed phenomenon: VB achieves comparable\npredictive accuracy as MCMC even though VB uses an approximating family [4, 5, 7, 20].\nThe contributions of this work are to generalize the variational Bernstein\u2013von Mises theorem [30]\nto misspeci\ufb01ed models and to further study the VB posterior predictive distribution. \u00a7 2.1 and 2.2\ndetails the results around VB in parametric probabilistic models. \u00a7 2.3 generalizes the results to\nprobabilistic models where the dimensionality of latent variables can grow with the number of data\npoints. \u00a7 2.4 illustrates the results in three forms of model misspeci\ufb01cation, including underdispersion\nand misspeci\ufb01cation of the latent dimensionality. \u00a7 3 corroborates the theoretical \ufb01ndings with\nsimulation studies on generalized linear mixed model (GLMM) and latent Dirichlet allocation (LDA).\nRelated work. This work draws on two themes around VB and model misspeci\ufb01cation.\nThe \ufb01rst theme is a body of work on the theoretical guarantees of VB. Assuming well-speci\ufb01ed\nmodels, many researchers have studied the properties of VB posteriors on particular Bayesian models,\nincluding linear models [23, 33], exponential family models [27, 28], generalized linear mixed models\n[14, 15, 22], nonparametric regression [10], mixture models [29, 31], stochastic block models [3, 34],\nlatent Gaussian models [25], and latent Dirichlet allocation [13].\nIn other work, Wang & Blei [30] establish the consistency and asymptotic normality of VB posteriors;\nZhang & Gao [35] derive their convergence rate; and Pati et al. [24] provide risk bounds of VB point\nestimates. Further, Alquier & Ridgway [1], Alquier et al. [2], Yang et al. [32] study risk bounds\nfor variational approximations of Gibbs posteriors and fractional posteriors, Ch\u00e9rief-Abdellatif\net al. [9] study VB for model selection in mixtures, Jaiswal et al. [17] study \u03b1-R\u00e9nyi-approximate\nposteriors, and Fan et al. [11] and Ghorbani et al. [13] study generalizations of VB via TAP free\nenergy. Again, most of these works focus on well-speci\ufb01ed models. In contrast, we focus on VB in\ngeneral misspeci\ufb01ed Bayesian models and characterize the asymptotic properties of the VB posterior\nand the VB posterior predictive. Note, when the model is well-speci\ufb01ed, our results recover the\nvariational Bernstein\u2013von Mises theorem of [30], but we further generalize their theory and extend it\nto analyzing the posterior predictive.\nThe second theme is about characterizing posterior distributions under model misspeci\ufb01cation.\nAllowing for model misspeci\ufb01cation, Kleijn et al. [18] establishes consistency and asymptotic\nnormality of the exact posterior in parametric Bayesian models; Kleijn et al. [19] studies exact\nposteriors in in\ufb01nite-dimensional Bayesian models. We leverage these results around exact posteriors\nto characterize VB posteriors and VB posterior predictive distributions under model misspeci\ufb01cation.\n\n2 Variational Bayes (VB) under model misspeci\ufb01cation\n\n\u00a7 2.1 and 2.2 examine the asymptotic properties of VB under model misspeci\ufb01cation and for paramet-\nric models. \u00a7 2.3 extends these results to more general models, where the dimension of the latent\nvariables grows with the data. \u00a7 2.4 illustrates the results with three types of model misspeci\ufb01cation.\n\n2.1 The VB posterior and the VB posterior mean\nWe \ufb01rst study the VB posterior q\u2217(\u03b8) and its mean \u02c6\u03b8VB. Assume iid data from a density xi \u223c p0 and\na parametric model p(x| \u03b8), i.e., a model where the dimension of the latent variables does not grow\nwith the data. We show that the optimal variational distribution q\u2217(\u03b8) (Eq. 1) is asymptotically normal\nand centers at \u03b8\u2217 (Eq. 2), which minimizes the KL between the model p\u03b8 and the true data generating\ndistribution p0. The VB posterior mean \u02c6\u03b8VB also converges to \u03b8\u2217 and is asymptotically normal.\n\n3\n\n\fBefore stating these asymptotic results, we make a few assumptions about the prior p(\u03b8) and the\nprobabilistic model {p(x| \u03b8) : \u03b8 \u2208 \u0398}. These assumptions resemble the classical assumptions in the\nBernstein\u2013von Mises theorems [18, 26].\nAssumption 1 (Prior mass). The prior density p(\u03b8) is continuous and positive in a neighborhood of\n\u03b8\u2217. There exists a constant Mp > 0 such that |(log p(\u03b8))(cid:48)(cid:48)| \u2264 Mpe|\u03b8|2.\nAssumption 1 roughly requires that the prior has some mass around the optimal \u03b8\u2217. It is a necessary\nassumption: if \u03b8\u2217 does not lie in the prior support then the posterior cannot be centered there.\nAssumption 1 also requires a tail condition on log p(\u03b8): the second derivative of log p(\u03b8) can not\ngrow faster than exp(|\u03b8|2). This is a technical condition that many common priors satisfy.\nAssumption 2 (Consistent testability). For every \u0001 > 0 there exists a sequence of tests \u03c6n such that\n\n(cid:90)\n\nn(cid:89)\n\n\u03c6n(x1:n)\n\np0(xi) dx1:n \u2192 0,\n\n(cid:90)\n\nsup\n\n{\u03b8:||\u03b8\u2212\u03b8\u2217||\u2265\u0001}\n\ni=1\n\n(1 \u2212 \u03c6n(x1:n)) \u00b7\n\n(cid:34) n(cid:89)\n\ni=1\n\np(xi | \u03b8)\np(xi | \u03b8\u2217)\n\np0(xi)\n\ndx1:n \u2192 0.\n\n(cid:35)\n\n(5)\n\n(6)\n\n(8)\n\n(cid:12)(cid:12)(cid:12)(cid:12)log\n\n(cid:12)(cid:12)(cid:12)(cid:12) P0\u2192 0,\n\nAssumption 2 roughly requires \u03b8\u2217 to be the unique optimum of the KL divergence to the truth (Eq. 2).\nIn other words, \u03b8\u2217 is identi\ufb01able from \ufb01tting the probabilistic model p(x| \u03b8) to the data drawn\nfrom p0(x). To satisfy this condition, it suf\ufb01ces to have the likelihood ratio p(x| \u03b81)/p(x| \u03b82) be a\ncontinuous function of x for all \u03b81, \u03b82 \u2208 \u0398 (Theorem 3.2 of [18]).\nAssumption 1 and Assumption 2 are classical conditions required for the asymptotic normality of the\nexact posterior Kleijn et al. [18]. They ensure that, for every sequence Mn \u2192 \u221e,\n\n\u0398\n\n1(||\u03b8 \u2212 \u03b8\u2217|| > \u03b4nMn) \u00b7 p(\u03b8 | x1:n) d\u03b8\n\n(7)\nfor some constant sequence \u03b4n \u2192 0. In other words, the exact posterior p(\u03b8 | x) occupies vanishing\nmass outside of the \u03b4nMn-sized neighborhood of \u03b8\u2217. We note that the sequence \u03b4n also plays a role\nin the following local asymptotic normality (LAN) assumption.\nAssumption 3 (Local asymptotic normality (LAN)). For every compact set K \u2282 Rd, there exist\nrandom vectors \u2206n,\u03b8\u2217 bounded in probability and nonsingular matrices V\u03b8\u2217 such that\n\nP0\u2192 0,\n\n(cid:90)\n\nsup\nh\u2208K\n\np(x| \u03b8\u2217 + \u03b4nh)\n\np(x| \u03b8\u2217)\n\n\u2212 h(cid:62)V\u03b8\u2217 \u2206n,\u03b8\u2217 +\n\n1\n2\n\nh(cid:62)V\u03b8\u2217 h\n\nwhere \u03b4n is a d \u00d7 d diagonal matrix that describes how fast each dimension of the \u03b8 posterior\nconverges to a point mass. We note that \u03b4n \u2192 0 as n \u2192 \u221e.\nThis is a key assumption that characterizes the limiting normal distribution of the VB posterior. The\nquantities \u2206n,\u03b8\u2217 and V\u03b8\u2217 determine the normal distribution that the VB posterior will converge to.\nThe constant \u03b4n determines the convergence rate of the VB posterior to a point mass. Many parametric\nmodels with a differentiable likelihood satisfy LAN. We provide a more technical description on how\nto verify Assumption 3 in Appendix A.\nWith these assumptions, we establish the asymptotic properties of the VB posterior and the VB\nposterior mean.\nTheorem 1. (Variational Bernstein\u2013von Mises Theorem under model misspeci\ufb01cation, parametric\nmodel version) Under Assumptions 1 to 3,\n\n1. The VB posterior converges to a point mass at \u03b8\u2217:\n\nq\u2217(\u03b8) d\u2192 \u03b4\u03b8\u2217 .\n\n(9)\n\n2. Denote \u02dc\u03b8 = \u03b4\u22121\n\nasymptotically normal:\n\nn (\u03b8 \u2212 \u03b8\u2217) as the re-centered and re-scaled version of \u03b8. The VB posterior of \u02dc\u03b8 is\n\n(10)\nwhere V (cid:48)\u03b8\u2217 is diagonal and has the same diagonal terms as the exact posterior precision matrix V\u03b8\u2217.\n\n\u03b8\u2217 ))\n\nP0\u2192 0.\n\n(cid:13)(cid:13)(cid:13)q\u2217(\u02dc\u03b8) \u2212 N (\u02dc\u03b8 ; \u2206n,\u03b8\u2217 , V (cid:48)\u22121\n\n(cid:13)(cid:13)(cid:13)TV\n\n4\n\n\f3. The VB posterior mean converges to \u03b8\u2217 almost surely:\na.s.\u2192 \u03b8\u2217.\n4. The VB posterior mean is also asymptotically normal:\n\n\u02c6\u03b8VB\n\nwhere \u2206\u221e,\u03b8\u2217 is the limiting distribution of the random vectors \u2206n,\u03b8\u2217: \u2206n,\u03b8\u2217\ndistribution is \u2206\u221e,\u03b8\u2217 \u223c N\n\n(cid:2)(log p(x| \u03b8\u2217))(cid:48)(log p(x| \u03b8\u2217))(cid:48)(cid:62)(cid:3) V \u22121\n\n(cid:0)0, V \u22121\n\n\u03b8\u2217 EP0\n\n(cid:1).\n\n\u03b8\u2217\n\nn (\u02c6\u03b8VB \u2212 \u03b8\u2217) d\u2192 \u2206\u221e,\u03b8\u2217 ,\n\u03b4\u22121\n\n(11)\n\n(12)\nd\u2192 \u2206\u221e,\u03b8\u2217. Its\n\nProof sketch. The proof structure of Theorem 1 mimics Wang & Blei [30] but extends it to allow\nfor model misspeci\ufb01cation. In particular, we take care of the extra technicality due to the difference\nbetween the true data-generating measure p0(x) and the probabilistic model we \ufb01t {p(x| \u03b8) : \u03b8 \u2208 \u0398}.\nThe proof proceeds in three steps:\n\n1. Characterize the asymptotic properties of the exact posterior:\n\n(cid:13)(cid:13)(cid:13)p(\u02dc\u03b8 | x) \u2212 N (\u2206n,\u03b8\u2217 , V \u22121\n\n(cid:13)(cid:13)(cid:13)TV\n\np(\u03b8 | x) d\u2192 \u03b4\u03b8\u2217 ,\nP0\u2192 0.\n\u03b8\u2217 )\n\nThis convergence is due to Assumptions 1 and 2, and the classical Bernstein\u2013von Mises theorem\nunder model misspeci\ufb01cation [18].\n\n2. Characterize the KL minimizer of the limiting exact posterior in the variational approximating\n\nfamily Q:\n\n(cid:13)(cid:13)(cid:13)(cid:13)arg min\n\nq\u2208Q\n\narg min\n\nq\u2208Q\n\nKL(q(\u03b8)|| p(\u03b8 | x)) d\u2192 \u03b4\u03b8\u2217 ,\nP0\u2192 0,\n\nTV\n\n(cid:13)(cid:13)(cid:13)(cid:13)\n\nKL(q(\u02dc\u03b8)|| p(\u02dc\u03b8 | x)) \u2212 N (\u02dc\u03b8 ; \u2206n,\u03b8\u2217 , V (cid:48)\u22121\n\u03b8\u2217 )\n\nwhere V (cid:48) is diagonal and shares the same diagonal terms as V . The intuition of this step is due to\nthe observation that the point mass is factorizable: \u03b4\u03b8\u2217 \u2208 Q. We prove it via bounding the mass\noutside a neighborhood of \u03b8\u2217 under the KL minimizer q\u2217(\u03b8).\n\n3. Show that the VB posterior approaches the KL minimizer of the limiting exact posterior as the\n\nnumber of data points increases: (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)q\u2217(\u03b8) \u2212 arg min\n\nq\u2208Qd\n\nKL(q(\u00b7)||\u03b4\u03b8\u2217 )\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)q\u2217(\u02dc\u03b8) \u2212 arg min\n\nq\u2208Qd\n\nKL(q(\u00b7)||N (\u00b7 ; \u2206n,\u03b8\u2217 , V \u22121\n\u03b8\u2217 ))\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)TV\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)TV\n\nP0\u2192 0.\n\nP0\u2192 0.\n\nThe intuition of this step is that if two distributions are close, then their KL minimizer should\nalso be close. In addition, the VB posterior is precisely the KL minimizer to the exact posterior:\nq\u2217(\u03b8) = arg minq\u2208Qd KL(q(\u03b8)||p(\u03b8 | x)). We leverage \u0393-convergence to prove this claim.\n\nThese three steps establish the asymptotic properties of the VB posterior under model misspeci\ufb01cation\n(Theorem 1.1 and Theorem 1.2): the VB posterior converges to \u03b4\u03b8\u2217 and is asymptotically normal.\nTo establish the asymptotic properties of the VB posterior mean (Theorem 1.3 and Theorem 1.4), we\nfollow the classical argument in Theorem 2.3 of Kleijn et al. [18], which leverages that the posterior\nmean is the Bayes estimator under squared loss. The full proof is in Appendix D.\n\nTheorem 1 establishes the asymptotic properties of the VB posterior under model misspeci\ufb01cation: it\nis asymptotically normal and converges to a point mass at \u03b8\u2217, which minimizes the KL divergence\n\n5\n\n\fto the true data-generating distribution. It also shows that the VB posterior mean shares similar\nconvergence and asymptotic normality.\nTheorem 1 states that, in the in\ufb01nite data limit, the VB posterior and the exact posterior converge\nto the same point mass. The reason for this coincidence is (1) the limiting exact posterior is a point\nmass and (2) point masses are factorizable and hence belong to the variational approximating family\nQ. In other words, the variational approximation has a negligible effect on the limiting posterior.\nTheorem 1 also shows that the VB posterior has a different covariance matrix from the exact posterior.\nThe VB posterior has a diagonal covariance matrix but the covariance of the exact posterior is not\nnecessarily diagonal. However, the inverse of the two covariance matrices match in their diagonal\nterms. This fact implies that the entropy of the limiting VB posterior is always smaller than or equal\nto that of the limiting exact posterior (Lemma 8 of Wang & Blei [30]), which echoes the fact that the\nVB posterior is under-dispersed relative to the exact posterior.\nWe remark that the under-dispersion of the VB posterior does not necessarily imply under-coverage\nof the VB credible intervals. The reason is that, under model misspeci\ufb01cation, even the credible\nintervals of the exact posterior cannot guarantee coverage [18]. Depending on how the model is\nmisspeci\ufb01ed, the credible intervals derived from the exact posterior can be arbitrarily under-covering\nor over-covering. Put differently, under model misspeci\ufb01cation, neither the VB posterior nor the exact\nposterior are reliable for uncertainty quanti\ufb01cation.\nConsider a well-speci\ufb01ed model, where p0(x) = p(x| \u03b80) for some \u03b80 \u2208 \u0398 and \u03b8\u2217 = \u03b80. In this case,\nTheorem 1 recovers the variational Bernstein\u2013von Mises theorem [30]. That said, Assumptions 2\nand 3 are stronger than their counterparts for well-speci\ufb01ed models; the reason is that P0 is usually less\nwell-behaved than P\u03b80. Assumptions 2 and 3 more closely align with those required in characterizing\nthe exact posteriors under misspeci\ufb01cation (Theorem 2.1 of [18]).\n\n2.2 The VB posterior predictive distribution\n\nWe now study the posterior predictive induced by the VB posterior. As a consequence of Theorem 1,\nthe error due to model misspeci\ufb01cation dominates the error due to the variational approximation.\nRecall that ppred\ntrue (xnew | x1:n) is the exact\nposterior predictive (Eq. 4), p0(\u00b7) is the true data generating density, and the TV distance between\ntwo densities q1 and q2 is (cid:107)q1(x) \u2212 q2(x)(cid:107)TV (cid:44) 1\nTheorem 2. (The VB posterior predictive distribution) If the probabilistic model is misspeci\ufb01ed,\ni.e. (cid:107)p0(x) \u2212 p(x| \u03b8\u2217)(cid:107)TV > 0, then the model approximation error dominates the variational\n\nVB (xnew | x1:n) is the VB posterior predictive (Eq. 3), ppred\n|q1(x) \u2212 q2(x)| dx.\n\n(cid:82)\n\n2\n\napproximation error: (cid:13)(cid:13)(cid:13)ppred\nunder the regularity condition(cid:82)\n\nVB (xnew | x1:n) \u2212 ppred\n\n(cid:13)(cid:13)(cid:13)p0(xnew) \u2212 ppred\n\nexact(xnew | x1:n)\n\nexact(xnew | x1:n)\n\n(cid:13)(cid:13)(cid:13)TV\n\n(cid:13)(cid:13)(cid:13)TV\n\nP0\u2192 0,\n\n(13)\n\n\u03b8p(x| \u03b8\u2217) dx < \u221e and Assumptions 1 to 3.\n\u22072\n\nProof sketch. Theorem 2 is due to two observations: (1) in the in\ufb01nite data limit, the VB posterior\npredictive converges to the exact posterior predictive and (2) in the in\ufb01nite data limit, the exact\nposterior predictive does not converge to the true data-generating distribution because of model\nmisspeci\ufb01cation. Taken together, these two observations give Eq. 13.\nThe \ufb01rst observation comes from Theorem 1, which implies that both the VB posterior and the exact\nposterior converge to the same point mass \u03b4\u03b8\u2217 in the in\ufb01nite data limit. Thus, they lead to similar\nposterior predictive distributions, which gives\n\nMoreover, the model is assumed to be misspeci\ufb01ed (cid:107)p0(x) \u2212 p(x| \u03b8\u2217)(cid:107)TV > 0, which implies\n\nThis fact shows that the model misspeci\ufb01cation error does not vanish in the in\ufb01nite data limit. Eq. 14\nand Eq. 15 imply Theorem 2. The full proof of Theorem 2 is in Appendix E.\n\nVB (xnew | x1:n) \u2212 ppred\n\ntrue (xnew | x1:n)\n\n(cid:13)(cid:13)(cid:13)ppred\n(cid:13)(cid:13)(cid:13)p0(xnew) \u2212 ppred\n\n(cid:13)(cid:13)(cid:13)TV\n(cid:13)(cid:13)(cid:13)TV \u2192 c0 > 0.\n\nP0\u2192 0.\n\nexact(xnew | x1:n)\n\n6\n\n(14)\n\n(15)\n\n\fAs the number of data points increases, Theorem 2 shows that the model misspeci\ufb01cation error\ndominates the variational approximation error. The reason is that both the VB posterior and the exact\nposterior converge to the same point mass. So, even though the VB posterior has an under-dispersed\ncovariance matrix relative to the exact posterior, both covariance matrices shrink to zero in the in\ufb01nite\ndata limit; they converge to the same posterior predictive distributions.\nTheorem 2 implies that when the model is misspeci\ufb01ed, VB pays a negligible price in its posterior\npredictive distribution. In other words, if the goal is prediction, we should focus on \ufb01nding the\ncorrect model rather than on correcting the variational approximation. For the predictive ability of\nthe posterior, the problem of an incorrect model outweighs the problem of an inexact inference.\nTheorem 2 also explains the phenomenon that VB predicts well despite being an approximate inference\nmethod. As models are rarely correct in practice, the error due to model misspeci\ufb01cation often\ndominates the variational approximation error. Thus, on large datasets, VB can achieve comparable\npredictive performance, even when compared to more exact Bayesian inference algorithms (like long-\nrun MCMC) that do not use approximating families [4, 5, 7, 20].\n\n2.3 Variational Bayes (VB) in misspeci\ufb01ed general probabilistic models\n\n\u00a7 2.1 and 2.2 characterize the VB posterior, the VB posterior mean, and the VB posterior predictive\ndistribution in misspeci\ufb01ed parametric models. Here we extend these results to a more general class\nof (misspeci\ufb01ed) models with both global latent variables \u03b8 = \u03b81:d and local latent variables z = z1:n.\nThis more general class allows the local latent variables to grow with the size of the data. The key\nidea is to reduce this class to the simpler parametric models, via what we call the \u201cvariational model.\u201d\nConsider the following probabilistic model with both global and local latent variables for n data\npoints x = x1:n,\n\ni=1 p(zi | \u03b8)p(xi | zi, \u03b8).\nThe goal is to infer p(\u03b8 | x), the posterior of the global latent variables.2\nVB approximates the posterior of both global and local latent variables p(\u03b8, z | x) by minimizing its\nKL to the exact posterior:\n(17)\n\nq\u2217(\u03b8)q\u2217(z) = q\u2217(\u03b8, z) = arg min\n\n(16)\n\np(\u03b8, x, z) = p(\u03b8)(cid:81)n\n\nKL(q(\u03b8, z)||p(\u03b8, z | x)),\n\nq\u2208Q\n\nwhere Q = {q : q(\u03b8, z) =(cid:81)d\n\ni=1 q\u03b8i(\u03b8i)(cid:81)n\n\nj=1 qzj (zj)} is the approximating family that contains all\nfactorizable densities. (The \ufb01rst equality is because q\u2217(\u03b8, z) belongs to the factorizable family Q.)\nThe VB posterior of the global latent variables \u03b81:d is q\u2217(\u03b8).\nVB for general probabilistic models operates in the same way as for parametric models, except we\nmust additionally approximate the posterior of the local latent variables. Our strategy is to reduce the\ngeneral probabilistic model with VB to a parametric model (\u00a7 2.1). Consider the so-called variational\nlog-likelihood [30],\n\nlog pVB(x| \u03b8) = \u03b7(\u03b8) + max\nq(z)\u2208Q\n\nEq(z) [log p(x, z | \u03b8) \u2212 log q(z)] ,\n\n(18)\n\nwhere \u03b7(\u03b8) is a log normalizer. Now construct the variational model with pVB(x| \u03b8) as the likelihood\nand \u03b8 as the global latent variable. This model no longer contains local latent variables; it is a\nparametric model.\nUsing the same prior p(\u03b8), the variational model leads to a posterior on the global latent variable\n\n(cid:82) p(\u03b8)pVB(x| \u03b8) d\u03b8\n\u03c0\u2217(\u03b8 | x) (cid:44) p(\u03b8)pVB(x| \u03b8)\n\n.\n\n(19)\n\nAs shown in [30], the VB posterior, which optimizes the variational objective, is close to \u03c0\u2217(\u03b8 | x),\n(20)\n\nq\u2217(\u03b8) = arg min\n\nKL(q(\u03b8)||\u03c0\u2217(\u03b8 | x)) + oP0(1).\n\nq\u2208Q\n\n2This model has one local latent variable per data point. But the results here extend to probabilistic models\nwith z = z1:dn and non i.i.d data x = x1:n. We only require that d stays \ufb01xed as n grows but dn grows with n.\n\n7\n\n\f(cid:90)\n\n(cid:90)\n\nsup\n\n{\u03b8:||\u03b8\u2212\u03b8\u2217||\u2265\u0001}\n\n(cid:12)(cid:12)(cid:12)(cid:12)log\n\n\u03c6n(x)p0(x) dx \u2192 0,\n\n(1 \u2212 \u03c6n(x))\n\npVB(x| \u03b8)\npVB(x| \u03b8\u2217)\n\np0(x) dx \u2192 0.\n\n(cid:12)(cid:12)(cid:12)(cid:12) P0\u2192 0,\n\n1\n2\n\nh(cid:62)V\u03b8\u2217 h\n\n(22)\n\n(23)\n\n(24)\n\nNotice that Eq. 20 resembles Eq. 1. This observation leads to a reduction of VB in general probabilistic\nmodels to VB in parametric probabilistic models with an alternative likelihood pVB(x| \u03b8). This\nperspective then allows us to extend Theorems 1 and 2 in \u00a7 2.1 to general probabilistic models.\nMore speci\ufb01cally, we de\ufb01ne the optimal value \u03b8\u2217 as in parametric models:\n\n\u03b8\u2217 \u2206= arg max KL(p0(x)||pVB(\u03b8 ; x)).\n\n(21)\nThis de\ufb01nition of \u03b8\u2217 coincides with the de\ufb01nition in parametric models (Eq. 2) when the model is\nindeed parametric.\nNext we state the assumptions and results for the VB posterior and the VB posterior mean for general\nprobabilistic models.\nAssumption 4 (Consistent testability). For every \u0001 > 0 there exists a sequence of tests \u03c6n such that\n\nAssumption 5 (Local asymptotic normality (LAN)). For every compact set K \u2282 Rd, there exist\nrandom vectors \u2206n,\u03b8\u2217 bounded in probability and nonsingular matrices V\u03b8\u2217 such that\n\npVB(x| \u03b8\u2217 + \u03b4nh)\n\n\u2212 h(cid:62)V\u03b8\u2217 \u2206n,\u03b8\u2217 +\nwhere \u03b4n is a d \u00d7 d diagonal matrix, where \u03b4n \u2192 0 as n \u2192 \u221e.\n\npVB(x| \u03b8\u2217)\n\nsup\nh\u2208K\n\nAssumptions 4 and 5 are analogous to Assumptions 2 and 3 except that we replace the model\np(x| \u03b8) with the variational model pVB(x| \u03b8). In particular, Assumption 6 is a LAN assumption on\nprobabilistic models with local latent variables, i.e. nonparametric models. While the LAN assumption\ndoes not hold generally in nonparametric models with in\ufb01nite-dimensional parameters [12], there are a\nfew nonparametric models that have been shown to satisfy the LAN assumption, including generalized\nlinear mixed models [15], stochastic block models [3], and mixture models [31]. We illustrate how to\nverify Assumptions 4 and 5 for speci\ufb01c models in Appendix C. We refer the readers to Section 3.4 of\nWang & Blei [30] for a detailed discussion on these assumptions about the variational model.\nUnder Assumptions 1, 4 and 5, Theorems 1 and 2 can be generalized to general probabilistic models.\nThe full details of these results (Theorems 3 and 4) are in Appendix B.\n\n2.4 Applying the theory\n\nTo illustrate the theorems, we apply Theorems 1, 2, 3 and 4 to three types of model misspeci\ufb01cation:\nunderdispersion in Bayesian regression of count data, component misspeci\ufb01cation in Bayesian mixture\nmodels, and latent dimensionality misspeci\ufb01cation with Bayesian stochastic block models. For each\nmodel, we verify the assumptions of the theorems and then characterize the limiting distribution of\ntheir VB posteriors. The details of these results are in Appendix C.\n\n3 Simulations\n\nWe illustrate the implications of Theorems 1, 2, 3 and 4 with simulation studies. We studied two\nmodels, Bayesian GLMM [21] and LDA [6]. To make the models misspeci\ufb01ed, we generate datasets\nfrom an \u201cincorrect\u201d model and then perform approximate posterior inference. We evaluate how\nclose the approximate posterior is to the limiting exact posterior \u03b4\u03b8\u2217, and how well the approximate\nposterior predictive captures the test sets.\nTo approximate the posterior, we compare VB with Hamiltonian Monte Carlo (HMC), which draws\nsamples from the exact posterior. We \ufb01nd that both achieve similar closeness to \u03b4\u03b8\u2217 and comparable\npredictive log likelihood on test sets. We use two automated inference algorithms in Stan [8]:\n\n8\n\n\f(b) LDA: Mean KL to \u03b8\u2217\n\n(a) GLMM: RMSE to \u03b8\u2217\n(c) GLMM: Predictive LL (d) LDA: Predictive LL\nFigure 2: Dataset size versus closeness to the limiting exact posterior \u03b4\u03b8\u2217 and posterior predictive log\nlikelihood on test data (mean \u00b1 sd). VB posteriors and MCMC posteriors achieve similar closeness\nto \u03b4\u03b8\u2217 and comparable predictive accuracy.\n\nautomatic differentiation variational inference (ADVI) [20] for VB and No-U-Turn sampler (NUTS)\n[16] for HMC. We lay out the detailed simulation setup in Appendix I.\nBayesian GLMM. We simulate data from a negative binomial linear mixed model (LMM): each\nindividual belongs to one of the ten groups; each group has N individuals; and the outcome is\naffected by a random effect due to this group membership. Then we \ufb01t a Poisson LMM with the same\ngroup structure, which is misspeci\ufb01ed with respect to the simulated data. Figure 2a shows that the\nRMSE to \u03b8\u2217 for the VB and MCMC posterior converges to similar values as the number of individuals\nincreases. This simulation corroborates Theorems 1 and 3: the limiting VB posterior coincide with\nthe limiting exact posterior. Figure 2c shows that VB and MCMC achieve similar posterior predictive\nlog likelihood as the dataset size increases. It echoes Theorems 2 and 4: when performing prediction,\nthe error due to the variational approximation vanishes with in\ufb01nite data.\nLatent Dirichlet allocation (LDA). We simulate N documents from a 15-dimensional LDA and\n\ufb01t a 10-dimensional LDA; the latent dimensionality of LDA is misspeci\ufb01ed. Figure 2b shows the\ndistance between the VB/MCMC posterior topics to the limiting exact posterior topics, measured\nby KL averaged over topics. When the number of documents is at least 200, both VB and MCMC\nare similarly close to the limiting exact posterior. Figure 2d shows that, again once there are 200\ndocuments, the VB and MCMC posteriors also achieve similar predictive ability. These results are\nconsistent with Theorems 1, 2, 3 and 4.\n\n4 Discussion\n\nIn this work, we study VB under model misspeci\ufb01cation. We show that the VB posterior is asymp-\ntotically normal, centering at the value that minimizes the KL divergence from the true distribution.\nThe VB posterior mean also centers at the same value and is asymptotically normal. These results\ngeneralize the variational Bernstein\u2013von Mises theorem Wang & Blei [30] to misspeci\ufb01ed models.\nWe further study the VB posterior predictive distributions. We \ufb01nd that the model misspeci\ufb01cation er-\nror dominates the variational approximation error in the VB posterior predictive distributions. These\nresults explain the empirical phenomenon that VB predicts comparably well as MCMC even if it uses\nan approximating family. It also suggests that we should focus on \ufb01nding the correct model rather\nthan de-biasing the variational approximation if we use VB for prediction.\nAn interesting direction for future work is to characterize local optima of the evidence lower bound\n(ELBO), which is the VB posterior we obtain in practice. The results in this work all assume that the\nELBO optimization returns global optima. It provides the possibility for local optima to share these\nproperties, though further research is needed to understand the precise properties of local optima.\nCombining this work with optimization guarantees may lead to a fruitful further characterization of\nvariational Bayes.\nAcknowledgments. We thank Victor Veitch and Jackson Loper for helpful comments on this article.\nThis work is supported by ONR N00014-17-1-2131, ONR N00014-15-1-2209, NIH 1U01MH115727-\n01, NSF CCF-1740833, DARPA SD2 FA8750-18-C-0130, IBM, 2Sigma, Amazon, NVIDIA, and\nSimons Foundation.\n\n9\n\n502001000500020000N0.60.81.01.21.41.6RMSEHMCMFVB10502001000N0.81.01.21.41.61.8KLHMCMFVB502001000500020000N530520510500490480Pred LLHMCMFVB10502001000N430425420415410405400395Pred LLHMCMFVB\fReferences\n[1] Alquier, P. & Ridgway, J. (2017). Concentration of tempered posteriors and of their variational approxima-\n\ntions. arXiv preprint arXiv:1706.09293.\n\n[2] Alquier, P., Ridgway, J., & Chopin, N. (2016). On the properties of variational approximations of gibbs\n\nposteriors. Journal of Machine Learning Research, 17(239), 1\u201341.\n\n[3] Bickel, P., Choi, D., Chang, X., Zhang, H., et al. (2013). Asymptotic normality of maximum likelihood and\n\nits variational approximation for stochastic blockmodels. The Annals of Statistics, 41(4), 1922\u20131943.\n\n[4] Blei, D. M., Jordan, M. I., et al. (2006). Variational inference for dirichlet process mixtures. Bayesian\n\nanalysis, 1(1), 121\u2013143.\n\n[5] Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational inference: A review for statisticians.\n\nJournal of the American Statistical Association, 112(518), 859\u2013877.\n\n[6] Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning\n\nresearch, 3(Jan), 993\u20131022.\n\n[7] Braun, M. & McAuliffe, J. (2010). Variational inference for large-scale models of discrete choice. Journal\n\nof the American Statistical Association, 105(489), 324\u2013335.\n\n[8] Carpenter, B., Gelman, A., et al. (2015). Stan: a probabilistic programming language. Journal of Statistical\n\nSoftware.\n\n[9] Ch\u00e9rief-Abdellatif, B.-E., Alquier, P., et al. (2018). Consistency of variational bayes inference for estimation\n\nand model selection in mixtures. Electronic Journal of Statistics, 12(2), 2995\u20133035.\n\n[10] Faes, C., Ormerod, J. T., & Wand, M. P. (2011). Variational bayesian inference for parametric and\nnonparametric regression with missing data. Journal of the American Statistical Association, 106(495), 959\u2013\n971.\n\n[11] Fan, Z., Mei, S., & Montanari, A. (2018). Tap free energy, spin glasses, and variational inference. arXiv\n\npreprint arXiv:1808.07890.\n\n[12] Freedman, D. et al. (1999). Wald lecture: On the bernstein-von mises theorem with in\ufb01nite-dimensional\n\nparameters. The Annals of Statistics, 27(4), 1119\u20131141.\n\n[13] Ghorbani, B., Javadi, H., & Montanari, A. (2018). An instability in variational inference for topic models.\n\narXiv preprint arXiv:1802.00568.\n\n[14] Hall, P., Ormerod, J. T., & Wand, M. (2011a). Theory of gaussian variational approximation for a Poisson\n\nmixed model. Statistica Sinica, (pp. 369\u2013389).\n\n[15] Hall, P., Pham, T., Wand, M. P., Wang, S. S., et al. (2011b). Asymptotic normality and valid inference for\n\nGaussian variational approximation. The Annals of Statistics, 39(5), 2502\u20132532.\n\n[16] Hoffman, M. D. & Gelman, A. (2014). The No-U-Turn sampler. JMLR, 15(1), 1593\u20131623.\n[17] Jaiswal, P., Rao, V. A., & Honnappa, H. (2019). Asymptotic consistency of \u03b1-r\u00e9nyi-approximate posteriors.\n\narXiv preprint arXiv:1902.01902.\n\n[18] Kleijn, B., Van der Vaart, A., et al. (2012). The Bernstein-von-Mises theorem under misspeci\ufb01cation.\n\nElectronic Journal of Statistics, 6, 354\u2013381.\n\n[19] Kleijn, B. J., van der Vaart, A. W., et al. (2006). Misspeci\ufb01cation in in\ufb01nite-dimensional bayesian statistics.\n\nThe Annals of Statistics, 34(2), 837\u2013877.\n\n[20] Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., & Blei, D. M. (2017). Automatic differentiation\n\nvariational inference. The Journal of Machine Learning Research, 18(1), 430\u2013474.\n\n[21] McCullagh, P. (1984). Generalized linear models. European Journal of Operational Research, 16(3),\n\n285\u2013292.\n\n[22] Ormerod, J. T. & Wand, M. P. (2012). Gaussian variational approximate inference for generalized linear\n\nmixed models. Journal of Computational and Graphical Statistics, 21(1), 2\u201317.\n\n[23] Ormerod, J. T., You, C., & Muller, S. (2014). A variational Bayes approach to variable selection. Technical\n\nreport, Citeseer.\n\n[24] Pati, D., Bhattacharya, A., & Yang, Y. (2017). On statistical optimality of variational bayes. arXiv preprint\n\narXiv:1712.08983.\n\n[25] Sheth, R. & Khardon, R. (2017). Excess risk bounds for the bayes risk using variational inference in latent\n\ngaussian models. In Advances in Neural Information Processing Systems (pp. 5157\u20135167).\n\n[26] Van der Vaart, A. W. (2000). Asymptotic statistics, volume 3. Cambridge university press.\n[27] Wang, B. & Titterington, D. (2004). Convergence and asymptotic normality of variational bayesian\napproximations for exponential family models with missing values. In Proceedings of the 20th conference on\nUncertainty in arti\ufb01cial intelligence (pp. 577\u2013584).: AUAI Press.\n\n10\n\n\f[28] Wang, B. & Titterington, D. (2005). Inadequacy of interval estimates corresponding to variational bayesian\n\napproximations. In AISTATS.\n\n[29] Wang, B., Titterington, D., et al. (2006). Convergence properties of a general algorithm for calculating\n\nvariational bayesian estimates for a normal mixture model. Bayesian Analysis, 1(3), 625\u2013650.\n\n[30] Wang, Y. & Blei, D. M. (2018). Frequentist consistency of variational bayes. Journal of the American\n\nStatistical Association, (just-accepted), 1\u201385.\n\n[31] Westling, T. & McCormick, T. H. (2015). Establishing consistency and improving uncertainty estimates of\n\nvariational inference through M-estimation. arXiv preprint arXiv:1510.08151.\n\n[32] Yang, Y., Pati, D., & Bhattacharya, A. (2017). \u03b1-variational inference with statistical guarantees. arXiv\n\npreprint arXiv:1710.03266.\n\n[33] You, C., Ormerod, J. T., & M\u00fcller, S. (2014). On variational Bayes estimation and variational information\n\ncriteria for linear regression models. Australian & New Zealand Journal of Statistics, 56(1), 73\u201387.\n\n[34] Zhang, A. Y. & Zhou, H. H. (2017). Theoretical and computational guarantees of mean \ufb01eld variational\n\ninference for community detection. arXiv preprint arXiv:1710.11268.\n\n[35] Zhang, F. & Gao, C. (2017). Convergence rates of variational posterior distributions. arXiv preprint\n\narXiv:1712.02519.\n\n11\n\n\f", "award": [], "sourceid": 7343, "authors": [{"given_name": "Yixin", "family_name": "Wang", "institution": "Columbia University"}, {"given_name": "David", "family_name": "Blei", "institution": "Columbia University"}]}