{"title": "Copula variational inference", "book": "Advances in Neural Information Processing Systems", "page_first": 3564, "page_last": 3572, "abstract": "We develop a general variational inference method that preserves dependency among the latent variables. Our method uses copulas to augment the families of distributions used in mean-field and structured approximations. Copulas model the dependency that is not captured by the original variational distribution, and thus the augmented variational family guarantees better approximations to the posterior. With stochastic optimization, inference on the augmented distribution is scalable. Furthermore, our strategy is generic: it can be applied to any inference procedure that currently uses the mean-field or structured approach. Copula variational inference has many advantages: it reduces bias; it is less sensitive to local optima; it is less sensitive to hyperparameters; and it helps characterize and interpret the dependency among the latent variables.", "full_text": "Copula variational inference\n\nDustin Tran\n\nHarvard University\n\nDavid M. Blei\n\nColumbia University\n\nEdoardo M. Airoldi\nHarvard University\n\nAbstract\n\nWe develop a general variational inference method that preserves dependency\namong the latent variables. Our method uses copulas to augment the families of\ndistributions used in mean-\ufb01eld and structured approximations. Copulas model the\ndependency that is not captured by the original variational distribution, and thus\nthe augmented variational family guarantees better approximations to the posterior.\nWith stochastic optimization, inference on the augmented distribution is scalable.\nFurthermore, our strategy is generic: it can be applied to any inference procedure\nthat currently uses the mean-\ufb01eld or structured approach. Copula variational in-\nference has many advantages: it reduces bias; it is less sensitive to local optima;\nit is less sensitive to hyperparameters; and it helps characterize and interpret the\ndependency among the latent variables.\n\n1\n\nIntroduction\n\nVariational inference is a computationally e\ufb03cient approach for approximating posterior distribu-\ntions. The idea is to specify a tractable family of distributions of the latent variables and then to min-\nimize the Kullback-Leibler divergence from it to the posterior. Combined with stochastic optimiza-\ntion, variational inference can scale complex statistical models to massive data sets [9, 23, 24].\nBoth the computational complexity and accuracy of variational inference are controlled by the fac-\ntorization of the variational family. To keep optimization tractable, most algorithms use the fully-\nfactorized family, also known as the mean-\ufb01eld family, where each latent variable is assumed inde-\npendent. Less common, structured mean-\ufb01eld methods slightly relax this assumption by preserving\nsome of the original structure among the latent variables [19]. Factorized distributions enable e\ufb03-\ncient variational inference but they sacri\ufb01ce accuracy. In the exact posterior, many latent variables\nare dependent and mean-\ufb01eld methods, by construction, fail to capture this dependency.\nTo this end, we develop copula variational inference (copula vi). Copula vi augments the traditional\nvariational distribution with a copula, which is a \ufb02exible construction for learning dependencies in\nfactorized distributions [3]. This strategy has many advantages over traditional vi: it reduces bias;\nit is less sensitive to local optima; it is less sensitive to hyperparameters; and it helps characterize\nand interpret the dependency among the latent variables. Variational inference has previously been\nrestricted to either generic inference on simple models\u2014where dependency does not make a signif-\nicant di\ufb00erence\u2014or writing model-speci\ufb01c variational updates. Copula vi widens its applicability,\nproviding generic inference that \ufb01nds meaningful dependencies between latent variables.\nIn more detail, our contributions are the following.\nA generalization of the original procedure in variational inference. Copula vi generalizes vari-\national inference for mean-\ufb01eld and structured factorizations: traditional vi corresponds to running\nonly one step of our method. It uses coordinate descent, which monotonically decreases the KL\ndivergence to the posterior by alternating between \ufb01tting the mean-\ufb01eld parameters and the copula\nparameters. Figure 1 illustrates copula vi on a toy example of \ufb01tting a bivariate Gaussian.\nImproving generic inference. Copula vi can be applied to any inference procedure that currently\nuses the mean-\ufb01eld or structured approach. Further, because it does not require speci\ufb01c knowledge\n\n1\n\n\fFigure 1: Approximations to an elliptical Gaussian. The mean-\ufb01eld (red) is restricted to \ufb01tting\nindependent one-dimensional Gaussians, which is the \ufb01rst step in our algorithm. The second step\n(blue) \ufb01ts a copula which models the dependency. More iterations alternate: the third re\ufb01ts the mean-\n\ufb01eld (green) and the fourth re\ufb01ts the copula (cyan), demonstrating convergence to the true posterior.\n\nof the model, it falls into the framework of black box variational inference [15]. An investigator\nneed only write down a function to evaluate the model log-likelihood. The rest of the algorithm\u2019s\ncalculations, such as sampling and evaluating gradients, can be placed in a library.\nRicher variational approximations. In experiments, we demonstrate copula vi on the standard\nexample of Gaussian mixture models. We found it consistently estimates the parameters, reduces\nsensitivity to local optima, and reduces sensitivity to hyperparameters. We also examine how well\ncopula vi captures dependencies on the latent space model [7]. Copula vi outperforms competing\nmethods and signi\ufb01cantly improves upon the mean-\ufb01eld approximation.\n\n2 Background\n\n2.1 Variational inference\n\nLet x be a set of observations, z be latent variables, and \u03bb be the free parameters of a variational\ndistribution q(z; \u03bb). We aim to \ufb01nd the best approximation of the posterior p(z| x) using the vari-\national distribution q(z; \u03bb), where the quality of the approximation is measured by KL divergence.\nThis is equivalent to maximizing the quantity\n\nL (\u03bb) = Eq(z;\u03bb)[log p(x, z)] \u2212 Eq(z;\u03bb)[log q(z; \u03bb)].\n\nL(\u03bb) is the evidence lower bound (elbo), or the variational free energy [25]. For simpler computa-\ntion, a standard choice of the variational family is a mean-\ufb01eld approximation\n\nd(cid:89)\n\nq(z; \u03bb) =\n\nqi(zi; \u03bbi),\n\ni=1\n\nwhere z = (z1, . . . , zd). Note this is a strong independence assumption. More sophisticated ap-\nproaches, known as structured variational inference [19], attempt to restore some of the dependencies\namong the latent variables.\nIn this work, we restore dependencies using copulas. Structured vi is typically tailored to individual\nmodels and is di\ufb03cult to work with mathematically. Copulas learn general posterior dependencies\nduring inference, and they do not require the investigator to know such structure in advance. Further,\ncopulas can augment a structured factorization in order to introduce dependencies that were not\nconsidered before; thus it generalizes the procedure. We next review copulas.\n\n2.2 Copulas\n\nWe will augment the mean-\ufb01eld distribution with a copula. We consider the variational family\n\nq(z) =\n\nq(zi)\n\nc(Q(z1), . . . , Q(zd)).\n\n(cid:34) d(cid:89)\n\n(cid:35)\n\ni=1\n\n2\n\n \fFigure 2: Example of a vine V which factorizes a copula density of four random variables\nc(u1, u2, u3, u4) into a product of 6 pair copulas. Edges in the tree Tj are the nodes of the lower level\ntree Tj+1, and each edge determines a bivariate copula which is conditioned on all random variables\nthat its two connected nodes share.\n\nHere Q(zi) is the marginal cumulative distribution function (CDF) of the random variable zi, and\nc is a joint distribution of [0, 1] random variables.1 The distribution c is called a copula of z: it\nis a joint multivariate density of Q(z1), . . . , Q(zd) with uniform marginal distributions [21]. For\nany distribution, a factorization into a product of marginal densities and a copula always exists and\nintegrates to one [14].\nIntuitively, the copula captures the information about the multivariate random variable after elimi-\nnating the marginal information, i.e., by applying the probability integral transform on each variable.\nThe copula captures only and all of the dependencies among the zi\u2019s. Recall that, for all random vari-\nables, Q(zi) is uniform distributed. Thus the marginals of the copula give no information.\nFor example, the bivariate Gaussian copula is de\ufb01ned as\n\nc(u1, u2; \u03c1) = \u03a6\u03c1(\u03a6\u22121(u1), \u03a6\u22121(u2)).\n\nIf u1, u2 are independent uniform distributed, the inverse CDF \u03a6\u22121 of the standard normal trans-\nforms (u1, u2) to independent normals. The CDF \u03a6\u03c1 of the bivariate Gaussian distribution, with\nmean zero and Pearson correlation \u03c1, squashes the transformed values back to the unit square. Thus\nthe Gaussian copula directly correlates u1 and u2 with the Pearson correlation parameter \u03c1.\n\n2.2.1 Vine copulas\n\nIt is di\ufb03cult to specify a copula. We must \ufb01nd a family of distributions that is easy to compute with\nand able to express a broad range of dependencies. Much work focuses on two-dimensional copulas,\nsuch as the Student-t, Clayton, Gumbel, Frank, and Joe copulas [14]. However, their multivariate ex-\ntensions do not \ufb02exibly model dependencies in higher dimensions [4]. Rather, a successful approach\nin recent literature has been by combining sets of conditional bivariate copulas; the resulting joint is\ncalled a vine [10, 13].\nA vine V factorizes a copula density c(u1, . . . , ud) into a product of conditional bivariate copulas,\nalso called pair copulas. This makes it easy to specify a high-dimensional copula. One need only ex-\npress the dependence for each pair of random variables conditioned on a subset of the others.\nFigure 2 is an example of a vine which factorizes a 4-dimensional copula into the product of 6 pair\ncopulas. The \ufb01rst tree T1 has nodes 1, 2, 3, 4 representing the random variables u1, u2, u3, u4 respec-\ntively. An edge corresponds to a pair copula, e.g., 1, 4 symbolizes c(u1, u4). Edges in T1 collapse\ninto nodes in the next tree T2, and edges in T2 correspond to conditional bivariate copulas, e.g.,\n1, 2|3 symbolizes c(u1, u2|u3). This proceeds to the last nested tree T3, where 2, 4|13 symbolizes\n1We overload the notation for the marginal CDF Q to depend on the names of the argument, though we oc-\ncasionally use Qi(zi) when more clarity is needed. This is analogous to the standard convention of overloading\nthe probability density function q(\u00b7).\n\n3\n\n13241;32;33;4(T1)2;31;33;41;2j31;4j3(T2)1;2j31;4j32;4j13(T3)\fc(u2, u4|u1, u3). The vine structure speci\ufb01es a complete factorization of the multivariate copula,\nand each pair copula can be of a di\ufb00erent family with its own set of parameters:\nc(u1, u2|u3)c(u1, u4|u3)\n\nc(u1, u2, u3, u4) =\nFormally, a vine is a nested set of trees V = {T1, . . . , Td\u22121} with the following properties:\n\nc(u1, u3)c(u2, u3)c(u3, u4)\n\nc(u2, u4|u1, u3)\n\n(cid:105)(cid:104)\n\n(cid:105)(cid:104)\n\n(cid:104)\n\n(cid:105)\n\n.\n\n1. Tree Tj = {Nj, Ej} has d + 1 \u2212 j nodes and d \u2212 j edges.\n2. Edges in the jth tree Ej are the nodes in the (j + 1)th tree Nj+1.\n3. Two nodes in tree Tj+1 are joined by an edge only if the corresponding edges in tree Tj\n\nshare a node.\n\nEach edge e in the nested set of trees {T1, . . . , Td\u22121} speci\ufb01es a di\ufb00erent pair copula, and the product\nof all edges comprise of a factorization of the copula density. Since there are a total of d(d \u2212 1)/2\nedges, V factorizes c(u1, . . . , ud) as the product of d(d \u2212 1)/2 pair copulas.\nEach edge e(i, k) \u2208 Tj has a conditioning set D(e), which is a set of variable indices 1, . . . , d. We\nde\ufb01ne cik|D(e) to be the bivariate copula density for ui and uk given its conditioning set:\n\ncik|D(e) = c\n\nQ(ui|uj : j \u2208 D(e)), Q(ui|uj : j \u2208 D(e))\n\n.\n\n(1)\n\n(cid:12)(cid:12)(cid:12)uj : j \u2208 D(e)\n(cid:17)\n\n(cid:16)\n\nBoth the copula and the CDF\u2019s in its arguments are conditional on D(e). A vine speci\ufb01es a factor-\nization of the copula, which is a product over all edges in the d \u2212 1 levels:\n\nc(u1, . . . , ud; \u03b7) =\n\ncik|D(e).\n\n(2)\n\nd\u22121(cid:89)\n\n(cid:89)\n\nj=1\n\ne(i,k)\u2208Ej\n\nWe highlight that c depends on \u03b7, the set of all parameters to the pair copulas. The vine construction\nprovides us with the \ufb02exibility to model dependencies in high dimensions using a decomposition of\npair copulas which are easier to estimate. As we shall see, the construction also leads to e\ufb03cient\nstochastic gradients by taking individual (and thus easy) gradients on each pair copula.\n\n3 Copula variational inference\n\nWe now introduce copula variational inference (copula vi), our method for performing accurate and\nscalable variational inference. For simplicity, consider the mean-\ufb01eld factorization augmented with\na copula (we later extend to structured factorizations). The copula-augmented variational family is\n\nq(z; \u03bb, \u03b7) =\n\nq(zi; \u03bb)\n\nc(Q(z1; \u03bb), . . . , Q(zd; \u03bb); \u03b7)\n,\n\n(3)\n\n(cid:123)(cid:122)\n\ncopula\n\n(cid:125)\n\n(cid:34) d(cid:89)\n(cid:124)\n\ni=1\n\n(cid:123)(cid:122)\n\nmean-\ufb01eld\n\n(cid:35)\n(cid:125)\n\n(cid:124)\n\nwhere \u03bb denotes the mean-\ufb01eld parameters and \u03b7 the copula parameters. With this family, we max-\nimize the augmented elbo,\n\nL (\u03bb, \u03b7) = Eq(z;\u03bb,\u03b7)[log p(x, z)] \u2212 Eq(z;\u03bb,\u03b7)[log q(z; \u03bb, \u03b7)].\n\nCopula vi alternates between two steps: 1) \ufb01x the copula parameters \u03b7 and solve for the mean-\ufb01eld\nparameters \u03bb; and 2) \ufb01x the mean-\ufb01eld parameters \u03bb and solve for the copula parameters \u03b7. This\ngeneralizes the mean-\ufb01eld approximation, which is the special case of initializing the copula to be\nuniform and stopping after the \ufb01rst step. We apply stochastic approximations [18] for each step with\ngradients derived in the next section. We set the learning rate \u03c1t \u2208 R to satisfy a Robbins-Monro\n\nschedule, i.e.,(cid:80)\u221e\n\nt=1 \u03c1t = \u221e, (cid:80)\u221e\n\nt < \u221e. A summary is outlined in Algorithm 1.\n\nt=1 \u03c12\n\nThis alternating set of optimizations falls in the class of minorize-maximization methods, which\nincludes many procedures such as the EM algorithm [1], the alternating least squares algorithm, and\nthe iterative procedure for the generalized method of moments. Each step of copula vi monotonically\nincreases the objective function and therefore better approximates the posterior distribution.\n\n4\n\n\fAlgorithm 1: Copula variational inference (copula vi)\n\nInput: Data x, Model p(x, z), Variational family q.\nInitialize \u03bb randomly, \u03b7 so that c is uniform.\nwhile change in elbo is above some threshold do\n\n// Fix \u03b7, maximize over \u03bb.\nSet iteration counter t = 1.\nwhile not converged do\n\nDraw sample u \u223c Unif([0, 1]d).\nUpdate \u03bb = \u03bb + \u03c1t\u2207\u03bbL. (Eq.5, Eq.6)\nIncrement t.\n\nend\n// Fix \u03bb, maximize over \u03b7.\nSet iteration counter t = 1.\nwhile not converged do\n\nDraw sample u \u223c Unif([0, 1]d).\nUpdate \u03b7 = \u03b7 + \u03c1t\u2207\u03b7L. (Eq.7)\nIncrement t.\n\nend\n\nend\nOutput: Variational parameters (\u03bb, \u03b7).\n\nCopula vi has the same generic input requirements as black-box variational inference [15]\u2014the user\nneed only specify the joint model p(x, z) in order to perform inference. Further, copula variational in-\nference easily extends to the case when the original variational family uses a structured factorization.\nBy the vine construction, one simply \ufb01xes pair copulas corresponding to pre-existent dependence in\nthe factorization to be the independence copula. This enables the copula to only model dependence\nwhere it does not already exist.\nThroughout the optimization, we will assume that the tree structure and copula families are given\nand \ufb01xed. We note, however, that these can be learned. In our study, we learn the tree structure using\nsequential tree selection [2] and learn the families, among a choice of 16 bivariate families, through\nBayesian model selection [6] (see supplement). In preliminary studies, we\u2019ve found that re-selection\nof the tree structure and copula families do not signi\ufb01cantly change in future iterations.\n\n3.1 Stochastic gradients of the elbo\n\nTo perform stochastic optimization, we require stochastic gradients of the elbo with respect to both\nthe mean-\ufb01eld and copula parameters. The copula vi objective leads to e\ufb03cient stochastic gradients\nand with low variance.\nWe \ufb01rst derive the gradient with respect to the mean-\ufb01eld parameters. In general, we can apply the\nscore function estimator [15], which leads to the gradient\n\n\u2207\u03bbL = Eq(z;\u03bb,\u03b7)[\u2207\u03bb log q(z; \u03bb, \u03b7) \u00b7 (log p(x, z) \u2212 log q(z; \u03bb, \u03b7))].\n\n(4)\nWe follow noisy unbiased estimates of this gradient by sampling from q(\u00b7) and evaluating the inner\nexpression. We apply this gradient for discrete latent variables.\nWhen the latent variables z are di\ufb00erentiable, we use the reparameterization trick [17] to take ad-\nvantage of \ufb01rst-order information from the model, i.e.,\u2207z log p(x, z). Speci\ufb01cally, we rewrite the\nexpectation in terms of a random variable u such that its distribution s(u) does not depend on the\nvariational parameters and such that the latent variables are a deterministic function of u and the\nmean-\ufb01eld parameters, z = z(u; \u03bb). Following this reparameterization, the gradients propagate\n\n5\n\n\finside the expectation,\n\n\u2207\u03bbL = Es(u)[(\u2207z log p(x, z) \u2212 \u2207z log q(z; \u03bb, \u03b7))\u2207\u03bbz(u; \u03bb)].\n\n(5)\nThis estimator reduces the variance of the stochastic gradients [17]. Furthermore, with a copula vari-\national family, this type of reparameterization using a uniform random variable u and a deterministic\nfunction z = z(u; \u03bb, \u03b7) is always possible. (See the supplement.)\nThe reparameterized gradient (Eq.5) requires calculation of the terms \u2207zi log q(z; \u03bb, \u03b7) and\n\u2207\u03bbiz(u; \u03bb, \u03b7) for each i. The latter is tractable and derived in the supplement; the former decom-\nposes as\n\u2207zi log q(z; \u03bb, \u03b7) = \u2207zi log q(zi; \u03bbi) + \u2207Q(zi;\u03bbi) log c(Q(z1; \u03bb1), . . . , Q(zd; \u03bbd); \u03b7)\u2207ziQ(zi; \u03bbi)\n(6)\n\n= \u2207zi log q(zi; \u03bbi) + q(zi; \u03bbi)\n\n\u2207Q(zi;\u03bbi) log ck(cid:96)|D(e).\n\n(cid:88)\n\nd\u22121(cid:88)\n\nj=1\n\ne(k,(cid:96))\u2208Ej :\ni\u2208{k,(cid:96)}\n\nThe summation in Eq.6 is over all pair copulas which contain Q(zi; \u03bbi) as an argument. In other\nwords, the gradient of a latent variable zi is evaluated over both the marginal q(zi) and all pair\ncopulas which model correlation between zi and any other latent variable zj. A similar derivation\nholds for calculating terms in the score function estimator.\nWe now turn to the gradient with respect to the copula parameters. We consider copulas which are\ndi\ufb00erentiable with respect to their parameters. This enables an e\ufb03cient reparameterized gradient\n\n\u2207\u03b7L = Es(u)[(\u2207z log p(x, z) \u2212 \u2207z log q(z; \u03bb, \u03b7))\u2207\u03b7z(u; \u03bb, \u03b7)].\n\nThe requirements are the same as for the mean-\ufb01eld parameters.\nFinally, we note that the only requirement on the model is the gradient \u2207z log p(x, z). This can\nbe calculated using automatic di\ufb00erentiation tools [22]. Thus Copula vi can be implemented in a\nlibrary and applied without requiring any manual derivations from the user.\n\n(7)\n\n3.2 Computational complexity\nIn the vine factorization of the copula, there are d(d \u2212 1)/2 pair copulas, where d is the number of\nlatent variables. Thus stochastic gradients of the mean-\ufb01eld parameters \u03bb and copula parameters \u03b7\nrequire O(d2) complexity. More generally, one can apply a low rank approximation to the copula by\ntruncating the number of levels in the vine (see Figure 2). This reduces the number of pair copulas\nto be Kd for some K > 0, and leads to a computational complexity of O(Kd).\nUsing sequential tree selection for learning the vine structure [2], the most correlated variables are at\nthe highest level of the vines. Thus a truncated low rank copula only forgets the weakest correlations.\nThis generalizes low rank Gaussian approximations, which also have O(Kd) complexity [20]: it is\nthe special case when the mean-\ufb01eld distribution is the product of independent Gaussians, and each\npair copula is a Gaussian copula.\n\n3.3 Related work\nPreserving structure in variational inference was \ufb01rst studied by Saul and Jordan [19] in the case of\nprobabilistic neural networks. It has been revisited recently for the case of conditionally conjugate\nexponential familes [8]. Our work di\ufb00ers from this line in that we learn the dependency structure\nduring inference, and thus we do not require explicit knowledge of the model. Further, our augmen-\ntation strategy works more broadly to any posterior distribution and any factorized variational family,\nand thus it generalizes these approaches.\nA similar augmentation strategy is higher-order mean-\ufb01eld methods, which are a Taylor series correc-\ntion based on the di\ufb00erence between the posterior and its mean-\ufb01eld approximation [11]. Recently,\nGiordano et al. [5] consider a covariance correction from the mean-\ufb01eld estimates. All these methods\nassume the mean-\ufb01eld approximation is reliable for the Taylor series expansion to make sense, which\nis not true in general and thus is not robust in a black box framework. Our approach alternates the\nestimation of the mean-\ufb01eld and copula, which we \ufb01nd empirically leads to more robust estimates\nthan estimating them simultaneously, and which is less sensitive to the quality of the mean-\ufb01eld\napproximation.\n\n6\n\n\fFigure 3: Covariance estimates from copula variational inference (copula vi), mean-\ufb01eld (mf), and\nlinear response variational Bayes (lrvb) to the ground truth (Gibbs samples). copula vi and lrvb\ne\ufb00ectively capture dependence while mf underestimates variance and forgets covariances.\n\n4 Experiments\n\nWe study copula vi with two models: Gaussian mixtures and the latent space model [7]. The Gaus-\nsian mixture is a classical example of a model for which it is di\ufb03cult to capture posterior dependen-\ncies. The latent space model is a modern Bayesian model for which the mean-\ufb01eld approximation\ngives poor estimates of the posterior, and where modeling posterior dependencies is crucial for un-\ncovering patterns in the data.\nThere are several implementation details of copula vi. At each iteration, we form a stochastic gra-\ndient by generating m samples from the variational distribution and taking the average gradient. We\nset m = 1024 and follow asynchronous updates [16]. We set the step-size using ADAM [12].\n\n4.1 Mixture of Gaussians\n\nWe follow the goal of Giordano et al. [5], which is to estimate the posterior covariance for a Gaussian\nmixture. The hidden variables are a K-vector of mixture proportions \u03c0 and a set of K P -dimensional\nmultivariate normals N (\u00b5k, \u039b\u22121\n(a P -vector) and P \u00d7 P precision\nmatrix \u039bk. In a mixture of Gaussians, the joint probability is\n\nk ), each with unknown mean \u00b5k\n\nK(cid:89)\n\nN(cid:89)\n\np(x, z, \u00b5, \u039b\u22121, \u03c0) = p(\u03c0)\n\np(\u00b5k, \u039b\u22121\nk )\n\np(xn | zn, \u00b5zn , \u039b\u22121\n\nzn\n\n)p(zn | \u03c0),\n\nk=1\n\nn=1\n\nwith a Dirichlet prior p(\u03c0) and a normal-Wishart prior p(\u00b5k, \u039b\u22121\nk ).\nWe \ufb01rst apply the mean-\ufb01eld approximation (mf), which assigns independent factors to \u00b5, \u03c0, \u039b, and\nz. We then perform copula vi over the copula-augmented mean-\ufb01eld distribution, i.e., one which\nincludes pair copulas over the latent variables. We also compare our results to linear response varia-\ntional Bayes (lrvb) [5], which is a posthoc correction technique for covariance estimation in varia-\ntional inference. Higher-order mean-\ufb01eld methods demonstrate similar behavior as lrvb. Compar-\nisons to structured approximations are omitted as they require explicit factorizations and are not black\nbox. Standard black box variational inference [15] corresponds to the mf approximation.\nWe simulate 10, 000 samples with K = 2 components and P = 2 dimensional Gaussians. Figure\n3 displays estimates for the standard deviations of \u039b for 100 simulations, and plots them against the\nground truth using 500 e\ufb00ective Gibb samples. The second plot displays all o\ufb00-diagonal covariance\nestimates. Estimates for \u00b5 and \u03c0 indicate the same pattern and are given in the supplement.\nWhen initializing at the true mean-\ufb01eld parameters, both copula vi and lrvb achieve consistent\nestimates of the posterior variance. mf underestimates the variance, which is a well-known limita-\ntion [25]. Note that because the mf estimates are initialized at the truth, copula vi converges to the\ntrue posterior upon one step of \ufb01tting the copula. It does not require alternating more steps.\n\n7\n\n0.00.10.20.30.00.10.20.3Gibbs standard deviationEstimated sdmethodCVILRVBMFLambda\u22120.010.000.01\u22120.010.000.01Gibbs standard deviationEstimated sdmethodCVILRVBAll off\u2212diagonal covariances\fVariational inference methods\nMean-\ufb01eld\nlrvb\ncopula vi (2 steps)\ncopula vi (5 steps)\ncopula vi (converged)\n\nPredictive Likelihood Runtime\n15 min.\n-383.2\n38 min.\n-330.5\n32 min.\n-303.2\n-80.2\n1 hr. 17 min.\n2 hr.\n-50.5\n\nTable 1: Predictive likelihood on the latent space model. Each copula vi step either re\ufb01ts the mean-\n\ufb01eld or the copula. copula vi converges in roughly 10 steps and already signi\ufb01cantly outperforms\nboth mean-\ufb01eld and lrvb upon \ufb01tting the copula once (2 steps).\n\nCopula vi is more robust than lrvb. As a toy demonstration, we analyze the MNIST data set of\nhandwritten digits, using 12,665 training examples and 2,115 test examples of 0\u2019s and 1\u2019s. We per-\nform \"unsupervised\" classi\ufb01cation, i.e., classify without using training labels: we apply a mixture of\nGaussians to cluster, and then classify a digit based on its membership assignment. copula vi reports\na test set error rate of 0.06, whereas lrvb ranges between 0.06 and 0.32 depending on the mean-\ufb01eld\nestimates. lrvb and similar higher order mean-\ufb01eld methods correct an existing mf solution\u2014it is\nthus sensitive to local optima and the general quality of that solution. On the other hand, copula vi\nre-adjusts both the mf and copula parameters as it \ufb01ts, making it more robust to initialization.\n\n4.2 Latent space model\n\nWe next study inference on the latent space model [7], a Bernoulli latent factor model for network\nanalysis. Each node in an N-node network is associated with a P -dimensional latent variable z \u223c\nN (\u00b5, \u039b\u22121). Edges between pairs of nodes are observed with high probability if the nodes are close\nto each other in the latent space. Formally, an edge for each pair (i, j) is observed with probability\nlogit(p) = \u03b8 \u2212 |zi \u2212 zj|, where \u03b8 is a model parameter.\nWe generate an N = 100, 000 node network with latent node attributes from a P = 10 dimensional\nGaussian. We learn the posterior of the latent attributes in order to predict the likelihood of held-out\nedges. mf applies independent factors on \u00b5, \u039b, \u03b8 and z, lrvb applies a correction, and copula vi\nuses the fully dependent variational distribution. Table 1 displays the likelihood of held-out edges and\nruntime. We also attempted Hamiltonian Monte Carlo but it did not converge after \ufb01ve hours.\nCopula vi dominates other methods in accuracy upon convergence, and the copula estimation with-\nout re\ufb01tting (2 steps) already dominates lrvb in both runtime and accuracy. We note however that\nlrvb requires one to invert a O(N K 3) \u00d7 O(N K 3) matrix. We can better scale the method and\nachieve faster estimates than copula vi if we applied stochastic approximations for the inversion.\nHowever, copula vi always outperforms lrvb and is still fast on this 100,000 node network.\n\n5 Conclusion\n\nWe developed copula variational inference (copula vi). copula vi is a new variational inference\nalgorithm that augments the mean-\ufb01eld variational distribution with a copula; it captures posterior\ndependencies among the latent variables. We derived a scalable and generic algorithm for performing\ninference with this expressive variational distribution. We found that copula vi signi\ufb01cantly reduces\nthe bias of the mean-\ufb01eld approximation, better estimates the posterior variance, and is more accurate\nthan other forms of capturing posterior dependency in variational approximations.\n\nAcknowledgments\n\nWe thank Luke Bornn, Robin Gong, and Alp Kucukelbir for their insightful comments. This work\nis supported by NSF IIS-0745520, IIS-1247664, IIS-1009542, ONR N00014-11-1-0651, DARPA\nFA8750-14-2-0009, N66001-15-C-4032, Facebook, Adobe, Amazon, and the John Templeton Foun-\ndation.\n\n8\n\n\fReferences\n[1] Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the\n\nEM algorithm. Journal of the Royal Statistical Society, Series B, 39(1).\n\n[2] Dissmann, J., Brechmann, E. C., Czado, C., and Kurowicka, D. (2012). Selecting and estimating regular\n\nvine copulae and application to \ufb01nancial returns. arXiv preprint arXiv:1202.2002.\n\n[3] Fr\u00e9chet, M. (1960). Les tableaux dont les marges sont donn\u00e9es. Trabajos de estad\u00edstica, 11(1):3\u201318.\n[4] Genest, C., Gerber, H. U., Goovaerts, M. J., and Laeven, R. (2009). Editorial to the special issue on modeling\nand measurement of multivariate risk in insurance and \ufb01nance. Insurance: Mathematics and Economics,\n44(2):143\u2013145.\n\n[5] Giordano, R., Broderick, T., and Jordan, M. I. (2015). Linear response methods for accurate covariance\n\nestimates from mean \ufb01eld variational Bayes. In Neural Information Processing Systems.\n\n[6] Gruber, L. and Czado, C. (2015). Sequential Bayesian model selection of regular vine copulas. International\n\nSociety for Bayesian Analysis.\n\n[7] Ho\ufb00, P. D., Raftery, A. E., and Handcock, M. S. (2001). Latent space approaches to social network analysis.\n\nJournal of the American Statistical Association, 97:1090\u20131098.\n\n[8] Ho\ufb00man, M. D. and Blei, D. M. (2015). Structured stochastic variational inference. In Arti\ufb01cial Intelligence\n\nand Statistics.\n\n[9] Ho\ufb00man, M. D., Blei, D. M., Wang, C., and Paisley, J. (2013). Stochastic variational inference. Journal of\n\nMachine Learning Research, 14:1303\u20131347.\n\n[10] Joe, H. (1996). Families of m-variate distributions with given margins and m(m\u2212 1)/2 bivariate depen-\n\ndence parameters, pages 120\u2013141. Institute of Mathematical Statistics.\n\n[11] Kappen, H. J. and Wiegerinck, W. (2001). Second order approximations for probability models. In Neural\n\nInformation Processing Systems.\n\n[12] Kingma, D. P. and Ba, J. L. (2015). Adam: A method for stochastic optimization.\n\nConference on Learning Representations.\n\nIn International\n\n[13] Kurowicka, D. and Cooke, R. M. (2006). Uncertainty Analysis with High Dimensional Dependence Mod-\n\nelling. Wiley, New York.\n\n[14] Nelsen, R. B. (2006). An Introduction to Copulas (Springer Series in Statistics). Springer-Verlag New\n\nYork, Inc.\n\n[15] Ranganath, R., Gerrish, S., and Blei, D. M. (2014). Black box variational inference. In Arti\ufb01cial Intelli-\n\ngence and Statistics, pages 814\u2013822.\n\n[16] Recht, B., Re, C., Wright, S., and Niu, F. (2011). Hogwild: A lock-free approach to parallelizing stochastic\n\ngradient descent. In Advances in Neural Information Processing Systems, pages 693\u2013701.\n\n[17] Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate\n\ninference in deep generative models. In International Conference on Machine Learning.\n\n[18] Robbins, H. and Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical\n\nStatistics, 22(3):400\u2013407.\n\n[19] Saul, L. and Jordan, M. I. (1995). Exploiting tractable substructures in intractable networks. In Neural\n\nInformation Processing Systems, pages 486\u2013492.\n\n[20] Seeger, M. (2010). Gaussian covariance and scalable variational inference. In International Conference\n\non Machine Learning.\n\n[21] Sklar, A. (1959). Fonstions de r\u00e9partition \u00e0 n dimensions et leurs marges. Publications de l\u2019Institut de\n\nStatistique de l\u2019Universit\u00e9 de Paris, 8:229\u2013231.\n\n[22] Stan Development Team (2015). Stan: A C++ library for probability and sampling, version 2.8.0.\n[23] Toulis, P. and Airoldi, E. M. (2014). Implicit stochastic gradient descent. arXiv preprint arXiv:1408.2923.\n[24] Tran, D., Toulis, P., and Airoldi, E. M. (2015). Stochastic gradient descent methods for estimation with\n\nlarge data sets. arXiv preprint arXiv:1509.06459.\n\n[25] Wainwright, M. J. and Jordan, M. I. (2008). Graphical models, exponential families, and variational\n\ninference. Foundations and Trends in Machine Learning, 1(1-2):1\u2013305.\n\n9\n\n\f", "award": [], "sourceid": 1959, "authors": [{"given_name": "Dustin", "family_name": "Tran", "institution": "Harvard University"}, {"given_name": "David", "family_name": "Blei", "institution": "Columbia University"}, {"given_name": "Edo", "family_name": "Airoldi", "institution": "Harvard University"}]}