{"title": "Flexible sampling of discrete data correlations without the marginal distributions", "book": "Advances in Neural Information Processing Systems", "page_first": 2517, "page_last": 2525, "abstract": "Learning the joint dependence of discrete variables is a fundamental problem in machine learning, with many applications including prediction, clustering and dimensionality reduction. More recently, the framework of copula modeling has gained popularity due to its modular parametrization of joint distributions. Among other properties, copulas provide a recipe for combining flexible models for univariate marginal distributions with parametric families suitable for potentially high dimensional dependence structures. More radically, the extended rank likelihood approach of Hoff (2007) bypasses learning marginal models completely when such information is ancillary to the learning task at hand as in, e.g., standard dimensionality reduction problems or copula parameter estimation. The main idea is to represent data by their observable rank statistics, ignoring any other information from the marginals. Inference is typically done in a Bayesian framework with Gaussian copulas, and it is complicated by the fact this implies sampling within a space where the number of constraints increase quadratically with the number of data points. The result is slow mixing when using off-the-shelf Gibbs sampling. We present an efficient algorithm based on recent advances on constrained Hamiltonian Markov chain Monte Carlo that is simple to implement and does not require paying for a quadratic cost in sample size.", "full_text": "Flexible sampling of discrete data correlations\n\nwithout the marginal distributions\n\nAlfredo Kalaitzis\n\nRicardo Silva\n\nDepartment of Statistical Science and CSML\n\nDepartment of Statistical Science and CSML\n\nUniversity College London\n\na.kalaitzis@ucl.ac.uk\n\nUniversity College London\n\nricardo@stats.ucl.ac.uk\n\nAbstract\n\nLearning the joint dependence of discrete variables is a fundamental problem in\nmachine learning, with many applications including prediction, clustering and\ndimensionality reduction. More recently, the framework of copula modeling\nhas gained popularity due to its modular parameterization of joint distributions.\nAmong other properties, copulas provide a recipe for combining \ufb02exible models\nfor univariate marginal distributions with parametric families suitable for poten-\ntially high dimensional dependence structures. More radically, the extended rank\nlikelihood approach of Hoff (2007) bypasses learning marginal models completely\nwhen such information is ancillary to the learning task at hand as in, e.g., standard\ndimensionality reduction problems or copula parameter estimation. The main idea\nis to represent data by their observable rank statistics, ignoring any other informa-\ntion from the marginals. Inference is typically done in a Bayesian framework with\nGaussian copulas, and it is complicated by the fact this implies sampling within\na space where the number of constraints increases quadratically with the number\nof data points. The result is slow mixing when using off-the-shelf Gibbs sam-\npling. We present an ef\ufb01cient algorithm based on recent advances on constrained\nHamiltonian Markov chain Monte Carlo that is simple to implement and does not\nrequire paying for a quadratic cost in sample size.\n\n1 Contribution\n\nThere are many ways of constructing multivariate discrete distributions: from full contingency ta-\nbles in the small dimensional case [1], to structured models given by sparsity constraints [11] and\n(hierarchies of) latent variable models [6]. More recently, the idea of copula modeling [16] has\nbeen combined with such standard building blocks. Our contribution is a novel algorithm for ef\ufb01-\ncient Markov chain Monte Carlo (MCMC) for the copula framework introduced by [7], extending\nalgorithmic ideas introduced by [17].\nA copula is a continuous cumulative distribution function (CDF) with uniformly distributed uni-\nvariate marginals in the unit interval [0, 1]. It complements graphical models and other formalisms\nthat provide a modular parameterization of joint distributions. The core idea is simple and given\nby the following observation: suppose we are given a (say) bivariate CDF F (y1, y2) with marginals\nF1(y1) and F2(y2). This CDF can then be rewritten as F (F \u22121\n(F2(y2))). The func-\ntion C(\u00b7,\u00b7) given by F (F \u22121\n(\u00b7)) is a copula. For discrete distributions, this decomposition\nis not unique but still well-de\ufb01ned [16]. Copulas have found numerous applications in statistics\nand machine learning since they provide a way of constructing \ufb02exible multivariate distributions by\nmix-and-matching different copulas with different univariate marginals. For instance, one can com-\nbine \ufb02exible univariate marginals Fi(\u00b7) with useful but more constrained high-dimensional copulas.\nWe will not further motivate the use of copula models, which has been discussed at length in recent\n\n(F1(y1)), F \u22121\n\n1\n\n(\u00b7), F \u22121\n\n2\n\n2\n\n1\n\n1\n\n\fmachine learning publications and conference workshops, and for which comprehensive textbooks\nexist [e.g., 9]. For a recent discussion on the applications of copulas from a machine learning per-\nspective, [4] provides an overview. [10] is an early reference in machine learning. The core idea\ndates back at least to the 1950s [16].\nIn the discrete case, copulas can be dif\ufb01cult to apply: transforming a copula CDF into a probability\nmass function (PMF) is computationally intractable in general. For the continuous case, a common\ntrick goes as follows: transform variables by de\ufb01ning ai \u2261 \u02c6Fi(yi) for an estimate of Fi(\u00b7) and then\n\ufb01t a copula density c(\u00b7, . . . ,\u00b7) to the resulting ai [e.g. 9]. It is not hard to check this breaks down\nin the discrete case [7]. An alternative is to represent the CDF to PMF transformation for each data\npoint by a continuous integral on a bounded space. Sampling methods can then be used. This trick\nhas allowed many applications of the Gaussian copula to discrete domains. Readers familiar with\nprobit models will recognize the similarities to models where an underlying latent Gaussian \ufb01eld is\ndiscretized into observable integers as in Gaussian process classi\ufb01ers and ordinal regression [18].\nSuch models can be indirectly interpreted as special cases of the Gaussian copula.\nIn what follows, we describe in Section 2 the Gaussian copula and the general framework for con-\nstructing Bayesian estimators of Gaussian copulas by [7], the extended rank likelihood framework.\nThis framework entails computational issues which are discussed. A recent general approach for\nMCMC in constrained Gaussian \ufb01elds by [17] can in principle be directly applied to this problem\nas a blackbox, but at a cost that scales quadratically in sample size and as such it is not practical\nin general. Our key contribution is given in Section 4. An application experiment on the Bayesian\nGaussian copula factor model is performed in Section 5. Conclusions are discussed in the \ufb01nal\nsection.\n\n2 Gaussian copulas and the extended rank likelihood\n\nIt is not hard to see that any multivariate Gaussian copula is fully de\ufb01ned by a correlation matrix C,\nsince marginal distributions have no free parameters. In practice, the following equivalent generative\nmodel is used to de\ufb01ne a sample U according to a Gaussian copula GC(C):\n\n1. Sample Z from a zero mean Gaussian with covariance matrix C\n2. For each Zj, set Uj = \u03a6(zj), where \u03a6(\u00b7) is the CDF of the standard Gaussian\n\nIt is clear that each Uj follows a uniform distribution in [0, 1]. To obtain a model for variables\n{y1, y2, . . . , yp} with marginal distributions Fj(\u00b7) and copula GC(C), one can add the deterministic\np },\nstep yj = F \u22121\n1 , . . . , y(n)\np , y(2)\none is interested on inferring C via a Bayesian approach and the posterior distribution\n\n(uj). Now, given n samples of observed data Y \u2261 {y(1)\n\n1 , . . . , y(1)\n\nj\n\np(C, \u03b8F | Y) \u221d pGC(Y | C, \u03b8F )\u03c0(C, \u03b8F )\n\nwhere \u03c0(\u00b7) is a prior distribution, \u03b8F are marginal parameters for each Fj(\u00b7), which in general might\nneed to be marginalized since they will be unknown, and pGC(\u00b7) is the PMF of a (here discrete)\ndistribution with a Gaussian copula and marginals given by \u03b8F .\nLet Z be the underlying latent Gaussians of the corresponding copula for dataset Y. Although Y is a\ndeterministic function of Z, this mapping is not invertible due to the discreteness of the distribution:\neach marginal Fj(\u00b7) has jumps. Instead, the reverse mapping only enforces the constraints where\n. Based on this observation, [7] considers the event Z \u2208 D(y),\ny(i1)\nj < y(i2)\nwhere D(y) is the set of values of Z in Rn\u00d7p obeying those constraints, that is\n\nj < z(i2)\n\nimplies z(i1)\n\nj\n\nj\n\nZ \u2208 Rn\u00d7p : max\n\nz(k)\nj s.t. y(k)\n\nj < y(i)\n\nj\n\n< z(i)\n\nj < min\n\nz(k)\nj s.t. y(i)\n\nj < y(k)\n\nj\n\nD(y) \u2261(cid:110)\n\n(cid:110)\n\n(cid:111)\n\n(cid:110)\n\nSince {Y = y} \u21d2 Z(y) \u2208 D(y), we have\n\npGC(Y | C, \u03b8F ) = pGC(Z \u2208 D(y), Y | C, \u03b8F )\n\n= pN (Z \u2208 D(y) | C) \u00d7 pGC(Y| Z \u2208 D(y), C, \u03b8F ),\n\nthe \ufb01rst factor of the last line being that of a zero-mean a Gaussian density function marginalized\nover D(y).\n\n2\n\n(cid:111)(cid:111)\n\n.\n\n(1)\n\n\fThe extended rank likelihood is de\ufb01ned by the \ufb01rst factor of (1). With this likelihood, inference for\nC is given simply by marginalizing\n\np(C, Z | Y) \u221d I(Z \u2208 D(y)) pN (Z| C) \u03c0(C),\n\n(2)\n\nthe \ufb01rst factor of the right-hand side being the usual binary indicator function.\nStrictly speaking, this is not a fully Bayesian method since partial information on the marginals is\nignored. Nevertheless, it is possible to show that under some mild conditions there is information in\nthe extended rank likelihood to consistently estimate C [13]. It has two important properties: \ufb01rst,\nin many applications where marginal distributions are nuisance parameters, this sidesteps any major\nassumptions about the shape of {Fi(\u00b7)} \u2013 applications include learning the degree of dependence\namong variables (e.g., to understand relationships between social indicators as in [7] and [13]) and\ncopula-based dimensionality reduction (a generalization of correlation-based principal component\nanalysis, e.g., [5]); second, MCMC inference in the extended rank likelihood is conceptually simpler\nthan with the joint likelihood, since dropping marginal models will remove complicated entangle-\nments between C and \u03b8F . Therefore, even if \u03b8F is necessary (when, for instance, predicting missing\nvalues of Y), an estimate of C can be computed separately and will not depend on the choice of\nestimator for {Fi(\u00b7)}. The standard model with a full correlation matrix C can be further re\ufb01ned\nto take into account structure implied by sparse inverse correlation matrices [2] or low rank decom-\npositions via higher-order latent variable models [13], among others. We explore the latter case in\nsection 5.\nAn off-the-shelf algorithm for sampling from (2) is full Gibbs sampling: \ufb01rst, given Z, the (full or\nstructured) correlation matrix C can be sampled by standard methods. More to the point, sampling\nZ is straightforward if for each variable j and data point i we sample Z (i)\nconditioned on all other\nj\nvariables. The corresponding distribution is an univariate truncated Gaussian. This is the approach\nused originally by Hoff. However, mixing can be severely compromised by the sampling of Z, and\nthat is where novel sampling methods can facilitate inference.\n\n3 Exact HMC for truncated Gaussian distributions\n\nj\n\ncan move at a time.\n\nHoff\u2019s algorithm modi\ufb01es the positions of all Z (i)\nassociated with a particular discrete value of Yj,\nj\nconditioned on the remaining points. As the number of data points increases, the spread of the hard\nboundaries on Z (i)\n, given by data points of Zj associated with other levels of Yj, increases. This\nj\nreduces the space in which variables Z (i)\nj\nTo improve the mixing, we aim to sample from the joint Gaussian distribution of all latent variables\nZ (i)\n, i = 1 . . . n , conditioned on other columns of the data, such that the constraints between them\nare satis\ufb01ed and thus the ordering in the observation level is conserved. Standard Gibbs approaches\nfor sampling from truncated Gaussians reduce the problem to sampling from univariate truncated\nGaussians. Even though each step is computationally simple, mixing can be slow when strong\ncorrelations are induced by very tight truncation bounds.\nIn the following, we brie\ufb02y describe the methodology recently introduced by [17] that deals with\nthe problem of sampling from log p(x) \u221d \u2212 1\n2 x(cid:62)Mx + r(cid:62)x , where x, r \u2208 Rn and M is positive\nj x \u2264 gj , where fj \u2208 Rn, j = 1 . . . m,\nde\ufb01nite, with linear constraints of the form f(cid:62)\nis the\nnormal vector to some linear boundary in the sample space.\nLater in this section we shall describe how this framework can be applied to the Gaussian copula\nextended rank likelihood model. More importantly, the observed rank statistics impose only linear\nconstraints of the form xi1 \u2264 xi2 . We shall describe how this special structure can be exploited to\nreduce the runtime complexity of the constrained sampler from O(n2) (in the number of observa-\ntions) to O(n) in practice.\n\n3.1 Hamiltonian Monte Carlo for the Gaussian distribution\n\nHamiltonian Monte Carlo (HMC) [15] is a MCMC method that extends the sampling space with\nauxiliary variables so that (ideally) deterministic moves in the joint space brings the sampler to\n\n3\n\n\fpotentially far places in the original variable space. Deterministic moves cannot in general be done,\nbut this is possible in the Gaussian case.\nThe form of the Hamiltonian for the general d-dimensional Gaussian case with mean \u00b5 and preci-\nsion matrix M is:\n\nH =\n\n1\n2\n\nx(cid:62)Mx \u2212 r(cid:62)x +\n\ns(cid:62)M\u22121s ,\n\n1\n2\n\n(3)\n\nwhere M is also known in the present context as the mass matrix, r = M\u00b5 and s is the\nvelocity. Both x and s are Gaussian distributed so this Hamiltonian can be seen (up to a constant)\nas the negative log of the product of two independent Gaussian random variables. The physical\ninterpretation is that of a sum of potential and kinetic energy terms, where the total energy of the\nsystem is conserved.\nIn a system where this Hamiltonian function is constant, we can exactly compute its evolution\nthrough the pair of differential equations:\n\n\u02d9x = \u2207sH = M\u22121s ,\n\n\u02d9s = \u2212\u2207xH = \u2212Mx + r .\n\n(4)\n\nThese are solved exactly by\nat initial conditions (t = 0) :\n\nx(t) = \u00b5 + a sin(t) + b cos(t)\n\n, where a and b can be identi\ufb01ed\n\na = \u02d9x(0) = M\u22121s ,\n\nb = x(0) \u2212 \u00b5 .\n\n(5)\n\nTherefore, the exact HMC algorithm can be summarised as follows:\n\n\u2022 Initialise the allowed travel time T and some initial position x0 .\n\u2022 Repeat for HMC samples k = 1 . . . N\n\n1. Sample sk \u223c N (0, M)\n2. Use sk and xk to update a and b and store the new position at the end of the\n\ntrajectory xk+1 = x(T ) as an HMC sample.\n\nIt can be easily shown that the Markov chain of sampled positions has the desired equilibrium\n\ndistribution N(cid:0)\u00b5, M\u22121(cid:1) [17].\n\n3.2 Sampling with linear constraints\n\nSampling from multivariate Gaussians does not require any method as sophisticated as HMC, but\nthe plot thickens when the target distribution is truncated by linear constraints of the form Fx \u2264 g .\nHere, F \u2208 Rm\u00d7n is a constraint matrix whose every row is the normal vector to a linear boundary\nin the sample space. This is equivalent to sampling from a Gaussian that is con\ufb01ned in the (not\nnecessarily bounded) convex polyhedron {x : Fx \u2264 g}. In general, to remain within the boundaries\nof each wall, once a new velocity has been sampled one must compute all possible collision times\nwith the walls. The smallest of all collision times signi\ufb01es the wall that the particle should bounce\nfrom at that collision time. Figure 1 illustrates the concept with two simple examples on 2 and 3\ndimensions.\nThe collision times can be computed analytically and their equations can be found in the supplemen-\ntary material. We also point the reader to [17] for a more detailed discussion of this implementation.\nOnce the wall to be hit has been found, then position and velocity at impact time are computed and\nthe velocity is re\ufb02ected about the boundary normal1. The constrained HMC sampler is summarized\nfollows:\n\n\u2022 Initialise the allowed travel time T and some initial position x0 .\n\u2022 Repeat for HMC samples k = 1 . . . N\n\n1. Sample sk \u223c N (0, M)\n2. Use sk and xk to update a and b .\n\n1Also equivalent to transforming the velocity with a Householder re\ufb02ection matrix about the bounding\n\nhyperplane.\n\n4\n\n\fFigure 1: Left: Trajectories of the \ufb01rst 40 iterations of the exact HMC sampler on a 2D truncated\nGaussian. A re\ufb02ection of the velocity can clearly be seen when the particle meets wall #2 . Here,\nthe constraint matrix F is a 4 \u00d7 2 matrix. Center: The same example after 40000 samples. The\ncoloring of each sample indicates its density value. Right: The anatomy of a 3D Gaussian. The\nwalls are now planes and in this case F is a 2 \u00d7 3 matrix. Figure best seen in color.\n\n3. Reset remaining travel time Tleft \u2190 T . Until no travel time is left or no walls can be\n\nreached (no solutions exist), do:\n(a) Compute impact times with all walls and pick the smallest one, th (if a solution\n\nexists).\n\n(b) Compute v(th) and re\ufb02ect it about the hyperplane fh . This is the updated\n(c) Tleft \u2190 Tleft \u2212 th\n\nvelocity after impact. The updated position is x(th) .\n\n4. Store the new position at the end of the trajectory xk+1 as an HMC sample.\n\nIn general, all walls are candidates for impact, so the runtime of the sampler is linear in m , the\nnumber of constraints. This means that the computational load is concentrated in step 3(a). Another\nconsideration is that of the allocated travel time T . Depending on the shape of the bounding\npolyhedron and the number of walls, a very large travel time can induce many more bounces thus\nrequiring more computations per sample. On the other hand, a very small travel time explores the\ndistribution more locally so the mixing of the chain can suffer. What constitutes a given travel time\n\u201clarge\u201d or \u201csmall\u201d is relative to the dimensionality, the number of constraints and the structure of the\nconstraints.\nDue to the nature of our problem, the number of constraints, when explicitly expressed as linear\nfunctions, is O(n2) . Clearly, this restricts any direct application of the HMC framework for Gaus-\nsian copula estimation to small-sample (n) datasets. More importantly, we show how to exploit the\nstructure of the constraints to reduce the number of candidate walls (prior to each bounce) to O(n) .\n\nj\n\nof each Y (i)\n\n4 HMC for the Gaussian Copula extended rank likelihood model\nGiven some discrete data Y \u2208 Rn\u00d7p , the task is to infer the correlation matrix of the underlying\nGaussian copula. Hoff\u2019s sampling algorithm proceeds by alternating between sampling the continu-\nous latent representation Z (i)\n, for i = 1 . . . n, j = 1 . . . p , and sampling a covariance\nj\nmatrix from an inverse-Wishart distribution conditioned on the sampled matrix Z \u2208 Rn\u00d7p , which\nis then renormalized as a correlation matrix.\nFrom here on, we use matrix notation for the samples, as opposed to the random variables \u2013 with\nZi,j replacing Z (i)\n, Z:,j being a column of Z, and Z:,\\j being the submatrix of Z without the j-th\nj\ncolumn.\nIn a similar vein to Hoff\u2019s sampling algorithm, we replace the successive sampling of each Zi,j con-\nditioned on Zi,\\j (a conditional univariate truncated Gaussian) with the simultaneous sampling of\nZ:,j conditioned on Z:,\\j. This is done through an HMC step from a conditional multivariate trun-\ncated Gaussian.\nThe added bene\ufb01t of this HMC step over the standard Gibbs approach, is that of a handle for regu-\nlating the trade-off between exploration and runtime via the allocated travel time T . Larger travel\ntimes potentially allow for larger moves in the sample space, but it comes at a cost as explained in\nthe sequel.\n\n5\n\n12341234\f4.1 The Hough envelope algorithm\n\nThe special structure of constraints. Recall that the number of constraints is quadratic in the\ndimension of the distribution. This is because every Z sample must satisfy the conditions of\nthe event Z \u2208 D(y) of the extended rank likelihood (see Section 2). In other words, for any\ncolumn Z:,j , all entries are organised into a partition L(j) of\nlevels, the number of\nunique values observed for the discrete or ordinal variable Y (j) . Thereby, for any two adjacent\nlevels lk, lk+1 \u2208 L(j) and any pair i1 \u2208 lk, i2 \u2208 lk+1,\nit must be true that Zli,j < Zli+1,j .\nEquivalently, a constraint f exists where fi1 = 1, fi2 = \u22121 and g = 0 . It is easy to see that\nO(n2) of such constraints are induced by the order statistics of the j-th variable. To deal with this\nboundary explosion, we developed the Hough Envelope algorithm to search ef\ufb01ciently, within all\npairs in {Z:,j}, in practically linear time.\nRecall in HMC (section 3.2) that the trajectory of the particle, x(t), is decomposed as\n\n|L(j)|\n\nxi(t) = ai sin(t) + bi cos(t) + \u00b5i ,\n\n(6)\n\nand there are n such functions, grouped into a partition of levels as described above. The Hough\nenvelope2 is found for every pair of adjacent levels. We illustrate this with an example of 10 di-\nmensions and two levels in Figure 2, without loss of generalization to any number of levels or\ndimensions. Assume we represent trajectories for points in level lk with blue curves, and points in\nlk+1 with red curves. Assuming we start with a valid state, at time t = 0 all red curves are above all\nblue curves. The goal is to \ufb01nd the smallest t where a blue curve meets a red curve. This will be our\ncollision time where a bounce will be necessary.\n\nFigure 2: The trajectories xj(t) of each compo-\nnent are sinusoid functions. The right-most green\ndot signi\ufb01es the wall and the time th of the ear-\nliest bounce, where the \ufb01rst inter-level pair (that\nis, any two components respectively from the blue\nand red level) becomes equal, in this case the con-\nstraint activated being xblue2 = xred2 .\n\n1. First we \ufb01nd the largest component bluemax of the blue level at t = 0. This takes\nO(n) time. Clearly, this will be the largest component until its sinusoid intersects that\nof any other component.\n\n2. To \ufb01nd the next largest component, compute the roots of xbluemax (t) \u2212 xi(t) = 0 for\nall components and pick the smallest (earliest) one (represented by a green dot). This also\ntakes O(n) time.\n\n3. Repeat this procedure until a red sinusoid crosses the highest running blue sinusoid. When\n\nthis happens, the time of earliest bounce and its constraint are found.\n\nIn the worst-case scenario, n such repetitions have to be made, but in practice we can safely\nassume an \ufb01xed upper bound h on the number of blue crossings before a inter-level crossing occurs.\nIn experiments, we found h << n, no more than 10 in simulations with hundreds of thousands of\ncurves. Thus, this search strategy takes O(n) time in practice to complete, mirroring the analysis\nof other output-sensitive algorithms such as the gift wrapping algorithm for computing convex hulls\n[8]. Our HMC sampling approach is summarized in Algorithm 1.\n\n2The name is inspired from the fact that each xi(t) is the sinusoid representation, in angle-distance space,\nof all lines that pass from the (ai, bi) point in a \u2212 b space. A representation known in image processing as the\nHough transform [3].\n\n6\n\n0.20.40.60.811.21.41234512345t\fAlgorithm 1 HMC for GCERL\n\n# Notation: T MN (\u00b5, C, F) is a truncated multivariate normal with location vector \u00b5, scale\nmatrix C and constraints encoded by F and g = 0 .\n# IW(df, V0) is an inverse-Wishart prior with degrees of freedom df and scale matrix V0 .\nInput: Y \u2208 Rn\u00d7p, allocated travel time T , a starting Z and variable covariance V \u2208 Rp\u00d7p ,\ndf = p + 2, V0 = df Ip and chain size N .\nGenerate constraints F(j) from Y:,j , for j = 1 . . . p .\nfor samples k = 1 . . . N do\n# Resample Z as follows:\nfor variables j = 1 . . . p do\nCompute parameters: \u03c32\n\nGet one sample Z:,j \u223c T MN(cid:0)\u00b5j, \u03c32\nj I, F(j)(cid:1) ef\ufb01ciently by using the Hough Envelope\nCompute correlation matrix C, s.t. Ci,j = Vi,j/(cid:112)Vi,iVj,j and store sample, C(k) \u2190 C .\n\nend for\nResample V \u223c IW(df + n, V0 + Z(cid:62)Z) .\n\nj = Vjj \u2212 Vj,\\jV\u22121\\j,\\jV\\j,j ,\n\n\u00b5j = Z:,\\jV\u22121\\j,\\jV\\j,j .\n\nalgorithm, see section 4.1.\n\nend for\n\n5 An application on the Bayesian Gausian copula factor model\nIn this section we describe an experiment that highlights the bene\ufb01ts of our HMC treatment, com-\npared to a state-of-the-art parameter expansion (PX) sampling scheme. During this experiment we\nask the important question:\n\u201cHow do the two schemes compare when we exploit the full-advantage of the HMC machinery to\njointly sample parameters and the augmented data Z, in a model of latent variables and structured\ncorrelations?\u201d\nWe argue that under such circumstances the superior convergence speed and mixing of HMC unde-\nniably compensate for its computational overhead.\nExperimental setup\nIn this section we provide results from an application on the Gaussian\ncopula latent factor model of [13] (Hoff\u2019s model [7] for low-rank structured correlation matrices).\nWe modify the parameter expansion (PX) algorithm used by [13] by replacing two of its Gibbs steps\nwith a single HMC step. We show a much faster convergence to the true mode with considerable\nsupport on its vicinity. We show that unlike the HMC, the PX algorithm falls short of properly\nexploring the posterior in any reasonable \ufb01nite amount of time, even for small models, even for\nsmall samples. Worse, PX fails in ways one cannot easily detect.\nNamely, we sample each row of the factor loadings matrix \u039b jointly with the corresponding column\nof the augmented data matrix Z, conditioning on the higher-order latent factors. This step is anal-\nogous to Pakman and Paninski\u2019s [17, sec.3.1] use of HMC in the context of a binary probit model\n(the extension to many levels in the discrete marginal is straightforward with direct application of\nthe constraint matrix F and the Hough envelope algorithm). The sampling of the higher level latent\nfactors remains identical to [13]. Our scheme involves no parameter expansion. We do however\ninterweave the Gibbs step for the Z matrix similarly to Hoff. This has the added bene\ufb01t of exploring\nthe Z sample space within their current boundaries, complementing the joint (\u03bb, z) sampling which\nmoves the boundaries jointly. The value of such \u201dinterweaving\u201d schemes has been addressed in [19].\nResults\nWe perform simulations of 10000 iterations, n = 1000 observations (rows of Y), travel\ntime \u03c0/2 for HMC with the setups listed in the following table, along with the elapsed times of each\nsampling scheme. These experiments were run on Intel COREi7 desktops with 4 cores and 8GB of\nRAM. Both methods were parallelized across the observed variables (p).\n\np (vars)\n\nk (latent factors) M (ordinal levels)\n\nFigure\n3(a) :\n3(b) :\n3(c) :\n\n20\n10\n10\n\n5\n3\n3\n\n2\n2\n5\n\nelapsed (mins): HMC PX\n8\n6\n16\n\n115\n80\n203\n\nMany functionals of the loadings matrix \u039b can be assessed. We focus on reconstructing the true\n(low-rank) correlation matrix of the Gaussian copula. In particular, we summarize the algorithm\u2019s\n\n7\n\n\foutcome with the root mean squared error (RMSE) of the differences between entries of the\nground-truth correlation matrix and the implied correlation matrix at each iteration of a MCMC\nscheme (so the following plots looks like a time-series of 10000 timepoints), see Figures 3(a), 3(b)\nand 3(c) .\n\n(a)\n\n(b)\n\n(c)\n\nFigure 3: Reconstruction (RMSE per iteration) of the low-rank structured correlation matrix of the\nGaussian copula and its histogram (along the left side).\n(a) Simulation setup: 20 variables, 5 factors, 5 levels. HMC (blue) reaches a better mode faster\n(in iterations/CPU-time) than PX (red). Even more importantly the RMSE posterior samples of PX\nare concentrated in a much smaller region compared to HMC, even after 10000 iterations. This\nillustrates that PX poorly explores the true distribution.\n(b) Simulation setup: 10 vars, 3 factors, 2 levels. We observe behaviors similar to Figure 3(a). Note\nthat the histogram counts RMSEs after the burn-in period of PX (iteration #500).\n(c) Simulation setup: 10 vars, 3 factors, 5 levels. We observe behaviors similar to Figures 3(a) and\n3(b) but with a thinner tail for HMC. Note that the histogram counts RMSEs after the burn-in period\nof PX (iteration #2000).\n\nMain message\nHMC reaches a better mode faster (iterations/CPUtime). Even more importantly\nthe RMSE posterior samples of PX are concentrated in a much smaller region compared to HMC,\neven after 10000 iterations. This illustrates that PX poorly explores the true distribution. As an\nanalogous situation we refer to the top and bottom panels of Figure 14 of Radford Neal\u2019s slice sam-\npler paper [14]. If there was no comparison against HMC, there would be no evidence from the PX\nplot alone that the algorithm is performing poorly. This mirrors Radford Neal\u2019s statement opening\nSection 8 of his paper: \u201ca wrong answer is obtained without any obvious indication that something\nis amiss\u201d. The concentration on the posterior mode of PX in these simulations is misleading of\nthe truth. PX might seen a bit simpler to implement, but it seems one cannot avoid using complex\nalgorithms for complex models. We urge practitioners to revisit their past work with this model to\n\ufb01nd out by how much credible intervals of functionals of interest have been overcon\ufb01dent. Whether\ntrivially or severely, our algorithm offers the \ufb01rst principled approach for checking this out.\n\n6 Conclusion\n\nSampling large random vectors simultaneously in order to improve mixing is in general a very hard\nproblem, and this is why clever methods such as HMC or elliptical slice sampling [12] are necessary.\nWe expect that the method here developed is useful not only for those with data analysis problems\nwithin the large family of Gaussian copula extended rank likelihood models, but the method itself\nand its behaviour might provide some new insights on MCMC sampling in constrained spaces in\ngeneral. Another direction of future work consists of exploring methods for elliptical copulas, and\nrelated possible extensions of general HMC for non-Gaussian copula models.\n\nAcknowledgements\n\nThe quality of this work has bene\ufb01ted largely from comments by our anonymous reviewers and use-\nful discussions with Simon Byrne and Vassilios Stathopoulos. Research was supported by EPSRC\ngrant EP/J013293/1.\n\n8\n\n\fReferences\n[1] Y. Bishop, S. Fienberg, and P. Holland. Discrete Multivariate Analysis: Theory and Practice.\n\nMIT Press, 1975.\n\n[2] A. Dobra and A. Lenkoski. Copula Gaussian graphical models and their application to model-\n\ning functional disability data. Annals of Applied Statistics, 5:969\u2013993, 2011.\n\n[3] R. O. Duda and P. E. Hart. Use of the Hough transformation to detect lines and curves in\n\npictures. Communications of the ACM, 15(1):11\u201315, 1972.\n\n[4] G. Elidan. Copulas and machine learning. Proceedings of the Copulae in Mathematical and\n\nQuantitative Finance workshop, to appear, 2013.\n\n[5] F. Han and H. Liu. Semiparametric principal component analysis. Advances in Neural Infor-\n\nmation Processing Systems, 25:171\u2013179, 2012.\n\n[6] G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks.\n\nScience, 313(5786):504\u2013507, 2006.\n\n[7] P. Hoff. Extending the rank likelihood for semiparametric copula estimation. Annals of Applied\n\nStatistics, 1:265\u2013283, 2007.\n\n[8] R. Jarvis. On the identi\ufb01cation of the convex hull of a \ufb01nite set of points in the plane. Infor-\n\nmation Processing Letters, 2(1):18\u201321, 1973.\n\n[9] H. Joe. Multivariate Models and Dependence Concepts. Chapman-Hall, 1997.\n[10] S. Kirshner. Learning with tree-averaged densities and distributions. Neural Information Pro-\n\ncessing Systems, 2007.\n\n[11] S. Lauritzen. Graphical Models. Oxford University Press, 1996.\n[12] I. Murray, R. Adams, and D. MacKay. Elliptical slice sampling. JMLR Workshop and Confer-\n\nence Proceedings: AISTATS 2010, 9:541\u2013548, 2010.\n\n[13] J. Murray, D. Dunson, L. Carin, and J. Lucas. Bayesian Gaussian copula factor models for\n\nmixed data. Journal of the American Statistical Association, to appear, 2013.\n\n[14] R. Neal. Slice sampling. The Annals of Statistics, 31:705\u2013767, 2003.\n[15] R. Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo,\n\npages 113\u2013162, 2010.\n\n[16] R. Nelsen. An Introduction to Copulas. Springer-Verlag, 2007.\n[17] A. Pakman and L. Paninski. Exact Hamiltonian Monte Carlo for truncated multivariate Gaus-\n\nsians. arXiv:1208.4118, 2012.\n\n[18] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.\n[19] Y. Yu and X. L. Meng. To center or not to center: That is not the question \u2014 An ancillarity-\nsuf\ufb01ciency interweaving strategy (ASIS) for boosting MCMC ef\ufb01ciency. Journal of Compu-\ntational and Graphical Statistics, 20(3):531\u2013570, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1191, "authors": [{"given_name": "Alfredo", "family_name": "Kalaitzis", "institution": "UCL"}, {"given_name": "Ricardo", "family_name": "Silva", "institution": "UCL"}]}