{"title": "Variational bounds for mixed-data factor analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 1108, "page_last": 1116, "abstract": "We propose a new variational EM algorithm for fitting factor analysis models with mixed continuous and categorical observations. The algorithm is based on a simple quadratic bound to the log-sum-exp function. In the special case of fully observed binary data, the bound we propose is significantly faster than previous variational methods. We show that EM is significantly more robust in the presence of missing data compared to treating the latent factors as parameters, which is the approach used by exponential family PCA and other related matrix-factorization methods. A further benefit of the variational approach is that it can easily be extended to the case of mixtures of factor analyzers, as we show. We present results on synthetic and real data sets demonstrating several desirable properties of our proposed method.", "full_text": "Variational Bounds for Mixed-Data Factor Analysis\n\nMohammad Emtiyaz Khan\nUniversity of British Columbia\nVancouver, BC, Canada V6T 1Z4\n\nemtiyaz@cs.ubc.ca\n\nBenjamin M. Marlin\n\nUniversity of British Columbia\nVancouver, BC, Canada V6T 1Z4\n\nbmarlin@cs.ubc.ca\n\nGuillaume Bouchard\n\nXerox Research Center Europe\n\n38240 Meylan, France\n\nguillaume.bouchard@xerox.com\n\nKevin P. Murphy\n\nUniversity of British Columbia\nVancouver, BC, Canada V6T 1Z4\n\nmurphyk@cs.ubc.ca\n\nAbstract\n\nWe propose a new variational EM algorithm for \ufb01tting factor analysis models\nwith mixed continuous and categorical observations. The algorithm is based on a\nsimple quadratic bound to the log-sum-exp function. In the special case of fully\nobserved binary data, the bound we propose is signi\ufb01cantly faster than previous\nvariational methods. We show that EM is signi\ufb01cantly more robust in the presence\nof missing data compared to treating the latent factors as parameters, which is the\napproach used by exponential family PCA and other related matrix-factorization\nmethods. A further bene\ufb01t of the variational approach is that it can easily be\nextended to the case of mixtures of factor analyzers, as we show. We present\nresults on synthetic and real data sets demonstrating several desirable properties\nof our proposed method.\n\n1\n\nIntroduction\n\nContinuous latent factor models, such as factor analysis (FA) and probabilistic principal components\nanalysis (PPCA), are very commonly used density models for continuous-valued data. They have\nmany applications including latent factor discovery, dimensionality reduction, and missing data im-\nputation. The factor analysis model asserts that a low-dimensional continuous latent factor zn \u2208 RL\nunderlies each high-dimensional observed data vector yn \u2208 RD. Standard factor analysis models\nassume the prior on the latent factor has the form p(zn) = N (zn|0, I), while the likelihood has the\nform p(yn|zn) = N (yn|Wzn + \u00b5, \u03a3). W is the D \u00d7 L factor loading matrix, \u00b5 is an offset term,\nand \u03a3 is a D \u00d7 D diagonal matrix specifying the marginal noise variances. If we set \u03a3 = \u03c32I and\nrequire W to be orthogonal, we recover probabilistic principal components analysis (PPCA). Such\nmodels can be easily \ufb01t using the expectation-maximization (EM) algorithm [Row97, TB99].\nThe FA model can be extended to other members of the exponential family by requiring that the\nnatural (canonical) parameters have the form Wzn + \u00b5 [WK01, CDS02, MHG08, LT10]. This\nis the unsupervised version of a generalized linear model (GLM), and is extremely useful since it\nallows for non-trivial dependencies between data variables with mixed types.\nThe principal dif\ufb01culty with the general FA model is computational tractability, both at training\nand test time. A problem arises because the Gaussian prior on p(zn) is not conjugate to the likeli-\nhood except when yn also has a Gaussian distribution (the standard FA model). There are several\napproaches one can take to this problem. The simplest is to approximate the posterior p(zn|yn)\nusing a point estimate, which is equivalent to viewing the latent variables as parameters and esti-\nmating them by maximum likelihood. This approach is known as exponential family PCA (ePCA)\n\n1\n\n\fGraphical Model:\n\nNotation:\n\nMixture indicator variable\nLatent factor vector\nContinuous data vector\nDiscrete data variable\nFactor loading matrices\nOffset vectors\nContinuous noise covariance\nMixture prior parameter\n# data cases\n# latent dimensions\n# mixture components\n# continuous variables\n# discrete variables\n# classes per discrete variable\n\ndk\n\nqn\nzn\nyC\nn\nyD\nnd\nk , WD\nWC\ndk\n\u00b5C\nk , \u00b5D\n\u03a3C\nk\n\u03c0\nN\nL\nK\nDc\nDd\nMd + 1\n\nFigure 1: The generalized mixture of factor analyzers model for discrete and continuous data.\n\n[CDS02]. We refer to it as the \u201cMM\u201d approach to \ufb01tting the general FA model since we maximize\nover zn in the E-step, as well as W in the M-step. The main drawback of the MM approach is that\nit ignores posterior uncertainty in zn, which can result in over-\ufb01tting unless the model is carefully\nregularized [WCS08]. This is a particular concern when we have missing data.\nThe opposite end of the model estimation spectrum is to integrate out both zn and W using Markov\nchain Monte Carlo methods. This approach has recently been studied under the name \u201cBayesian\nexponential family PCA\u201d [MHG08] using a Hamiltonian Monte Carlo (HMC) sampling approach.\nWe will refer to this as the \u201cSS\u201d approach to indicate that we are integrating out both zn and W by\nsampling. The SS approach preserves posterior uncertainty about zn (unlike the MM approach) and\nis robust to missing data, but can have a signi\ufb01cantly higher computational cost.\nIn this work, we study a variational EM model \ufb01tting approach that preserves posterior uncertainty\nabout zn, is robust to missing data, and is more computationally ef\ufb01cient than SS. We refer to this\nas the \u201cVM\u201d approach to indicate that we integrate over zn in the E-step after applying a variational\nbound, and maximize over W in the M-step. We focus on the case of continuous (Gaussian) and\ncategorical data. Our main contribution is the development of variational EM algorithms for factor\nanalysis and mixtures of factor analyzers based on a simple quadratic lower bound to the multinomial\nlikelihood (which subsumes the Bernoulli case) [Boh92]. This bound results in an EM iteration that\nis computationally more ef\ufb01cient than the bound previously proposed by Jaakkola for binary PCA\nwhen the training data is fully observed [JJ96], but is less tight. The proposed bound has advantages\nrelative to other previously introduced bounds, as we discuss in the following sections.\n\n2 The Generalized Mixture of Factor Analyzers Model\n\nIn this section, we describe a model for mixed continuous and discrete data that we call the gen-\neralized mixture of factor analyzers model. This model has two important special cases: mixture\nmodels and factor analysis, both for mixed continuous and discrete data. We use the general model\nas well as both special cases in subsequent experiments. In this work, we focus on Gaussian dis-\ntributed continuous data and multinomially distributed discrete data. The graphical model is given\nin Figure 1 while the probabilistic model is given in Equations 1 to 4. We begin with a description\nof the the general model and then highlight the two special cases.\nWe let n \u2208 {1 . . . N} index data cases, d \u2208 {1 . . . Dd} index discrete data dimensions and k \u2208\n{1 . . . K} index mixture components. Superscripts C and D indicate variables associated with\nn \u2208 RDc denote the continuous data vector and\ncontinuous and discrete data respectively. We let yC\n\n2\n\n\u03a3Ck\u00b5CkWCkkYCnYDndqnzn\u03c0\u03bbz\u00b5DdkWDdkknd\u03bbw\fnd \u2208 {1 . . . M + 1} denote the dth discrete data variable.1 We use a 1-of-(M + 1) encoding for the\nyD\nnd = m is represented by a (M + 1)-dimensional vector yD\ndiscrete variables where a variable yD\nnd\nin which m\u2019th element is set to 1, and all remaining elements equal 0. We denote the complete data\n\nvector by yn =(cid:2)yC\n\n(cid:3).\n\nn , yD\n\nn1, . . . , yD\n\nnDd\n\nThe generative process begins by sampling a state of the mixture indicator variable qn for each data\ncase n from a K-state multinomial distribution with parameters \u03c0. Simultaneously, a length L latent\nfactor vector zn \u2208 RL is sampled from a zero-mean Gaussian distribution with precision parameter\n\u03bbz. Both steps are given in Equation 1. The natural parameters of the distribution over the data\nvariables is obtained by passing the latent factor vector zn through a linear function de\ufb01ned by a\nfactor loading matrix and an offset term, both of which depend on the setting of the mixture indicator\nvariable qn.\n\np(zn, qn|\u03b8) = N (zn|0, \u03bb\u22121\nn |WC\n\np(yn|zn, qn = k, \u03b8) = N (yC\n\nz IL)M(qn|\u03c0)\n\nk zn + \u00b5C\n\nk )\nk , \u03a3C\n\n\u03b7ndk = WD\nSm(\u03b7) = exp[\u03b7m \u2212 lse(\u03b7)]\n\ndkzn + \u00b5D\n\ndk\n\nM +1X\n\nlse(\u03b7) = log[\n\nexp(\u03b7m)]\n\nDdY\n\nd=1\n\nM(yD\n\nnd|S(\u03b7ndk))\n\n(1)\n\n(2)\n\n(3)\n(4)\n\n(5)\n\nm=1\n\nk , \u00b5D\n\n1k, \u00b5D\n\ndkzn + \u00b5D\n\nk , WD\n\n1k, WD\n\n2k, . . . , WD\n\nk \u2208 RDc and \u00b5D\n\nk , and each discrete data variable yD\n\nn is Gaussian distributed with mean WC\n\ndk \u2208 RM +1\u00d7L, while the offsets are \u00b5C\n\nparameter vector \u03b1 such thatP\n\nk zn +\nAssuming that qn = k, the continuous data vector yC\nk and covariance \u03a3C\nnd is multinomially distributed with natural\n\u00b5C\ndk, as seen in Equation 2. Here, N (\u00b7|m, V) denotes a Gaussian\nparameters \u03b7ndk = WD\ndistribution with mean m and covariance V, while M(\u00b7|\u03b1) denotes a multinomial distribution with\ni \u03b1i = 1 and \u03b1i \u2265 0. For the discrete data variables, the natural\nparameter vector is converted into the standard mean parameter vector through the softmax function\nS(\u03b7) = [S1(\u03b7), . . . ,SM +1(\u03b7)], where Sm(\u03b7) is de\ufb01ned in Equation 4. The softmax function\nSm(\u03b7) is itself de\ufb01ned in terms of the log-sum-exp (LSE) function, which we give in Equation 5.\nk \u2208 RDc\u00d7L and\nWe note that the factor loading matrices for the kth mixture component are WC\ndk \u2208 RM +1. We de\ufb01ne the en-\nWD\nDdk] and\nsemble of factor loading matrices and offsets to be Wk = [WC\nDdk], respectively. The complete set of parameters for this model is\n\u00b5k = [\u00b5C\n2k, . . . , \u00b5D\nthus \u03b8 = {W1:K, \u00b51:K, \u03a3C\n1:K, \u03c0, \u03bbz}. To complete the model speci\ufb01cation, we must specify the\nprior on these parameters. For each row of each factor loading matrix Wk, we use a Gaussian prior\nof the form N (0, \u03bb\u22121\nAs mentioned at the start of this section, this general model has two important special cases: general-\nized factor analysis and mixture models for mixed continuous and discrete data. The factor analysis\nmodel is obtained by using one mixture component and at least one latent factor (K = 1, L > 1).\nThe mixture model is obtained by using no latent factors and at least one mixture component\n(K > 1, L = 0). In the mixture model case where L = 0, the distribution is modeled through\nthe offset parameters \u00b5k only. We will compare these three models in Section 5.\nBefore concluding this section, we point out one key difference between the current model and\nother latent factor models for discrete data like multinomial PCA [BJ04] and latent Dirichlet allo-\ncation (LDA) [BNJ03]. In our model, the natural parameters for discrete data are de\ufb01ned on a low-\ndimensional linear subspace and are mapped to the mean parameters via the softmax function. In\nmultinomial PCA and LDA, the mean parameters are instead directly de\ufb01ned on a low-dimensional\nlinear subspace. The latter approach can also be extended to the mixed-data case [BDdF+03]. How-\never, model \ufb01tting is even more computationally challenging than in our approach.\nIn fact, the\nbounds we propose can be used in this alternative setting, but we leave this to future work.\n\nw I). We use vague conjugate priors for the remaining parameters.\n\n1Note that we assume all the discrete data variables have the same number of states, namely M + 1, for\n\nnotational simplicity only. In the general case, the dth discrete variable has Md + 1 states.\n\n3\n\n\f3 Variational Bounds for Model Fitting\n\nIn the standard expectation-maximization (EM) algorithm for mixtures of factor analyzers, the E-\nstep consists of marginalizing over the complete-data log likelihood with respect to the posterior\nover the mixture indicator variable qn and latent factors zn. The M-step consists of maximizing\nthe expected complete log likelihood with respect to the parameters \u03b8.\nIn the case of Gaussian\nobservations, this posterior is available in closed form because of conjugacy. Introduction of discrete\nobservations, however, makes it intractable to compute the posterior as the likelihood for these\nobservations is not conjugate to the Gaussian prior on the latent factors.\nTo overcome these problems, we propose to use a quadratic bound on the LSE function. This allows\nus to obtain closed form updates for both the E and M steps. We use the quadratic bound described\nin [Boh92]. In rest of the paper, we will refer to it as the \u201cBohning bound\u201d. For simplicity, we\ndescribe the bound only for one discrete measurement with K = 1 and \u00b5k = 0 in order to suppress\nthe n, k and d subscripts. To ensure identi\ufb01ability, we assume that the last element of \u03b7 is zero (this\ncan be enforced by setting the last row of W to zero).\nThe key idea behind the Bohning bound is to take a second order Taylor series expansion of the LSE\nfunction around a point \u03c8. An upper bound to the LSE function is found by replacing the Hessian\nmatrix H(\u03c8), which appears in the second order term, with a \ufb01xed matrix A such that A\u2212 H(\u03c8) is\npositive de\ufb01nite for all \u03c8 [Boh92]. Bohning gives one such matrix A, which we de\ufb01ne below. The\nexpansion point \u03c8 is a free variational parameter that must be optimized.\n\nlse(\u03b7) \u2264 1\n2 \u03b7T A\u03b7 \u2212 bT\n1\n[IM \u2212 1M 1T\n2\n\nA =\nb\u03c8 = A\u03c8 \u2212 S(\u03c8)\nc\u03c8 =\n\n\u03c8 \u03b7 + c\u03c8\n\nM /(M + 1)]\n\n(6)\n\n(7)\n\n(8)\n\n1\n2 \u03c8T A\u03c8 \u2212 S(\u03c8)T \u03c8 + lse(\u03c8)\n\n(9)\n\u03c8 \u2208 RM is the vector of variational parameters, IM is the identity matrix of size M \u00d7 M and 1M\nis a vector of ones of length M. By substituting this bound in to the log-likelihood, completing\nthe square and exponentiating, we obtain the Gaussian lower bound described below. We obtain a\nGaussian-like \u201cpseudo\u201d observation \u02dcy\u03c8 corresponding to the discrete observation yD.\n\np(yD|z, W) \u2265 h(\u03c8)N (\u02dcy\u03c8|Wz, A\u22121)\n\n\u02dcy\u03c8 = A\u22121(b\u03c8 + yD)\n\nh(\u03c8) = |2\u03c0A\u22121| 1\n\n\u03c8A\u02dcy\u03c8 \u2212 c\u03c8\n\u02dcyT\n\n2 exp(cid:2)1\n\n2\n\n(cid:3)\n\n(10)\n(11)\n\n(12)\n\nWe use this result to obtain a lower bound for each mixed data vector yn. We will suppress the\n\u03c8 subscripts, which differ for each data point n and each discrete variable d for clarity. Let \u02dcyn =\nn , \u02dcy1,n, . . . , \u02dcyDd,n] be the data vector for a given n and \u03c8. It is straightforward to show that this\n[yC\nobservation gives the following lower bound on the joint likelihood,\np(\u02dcyn|zn) = N (\u02dcyn| \u02dcWzn, \u02dc\u03a3),\n1 , . . . , A\u22121\n)\nGiven this pseudo observation, the computation of the posterior means mn and covariances Vn is\nsimilar to the Gaussian FA model as seen below. This result can be generalized to the mixture case in\na straightforward way. The M-step is the same as in mixtures of Gaussian factor analyzers [GH96].\n\n\u02dcW =(cid:2)WC, WD\n\n\u02dc\u03a3 = diag(\u03a3C, A\u22121\n\n1 , . . . , wD\nDd\n\n(cid:3) ,\n\nDd\n\nVn = ( \u02dcWT \u02dc\u03a3\n\n\u22121 \u02dcW + \u03bbzIL)\u22121, mn = Vn \u02dcWT \u02dc\u03a3\n\n(13)\nThe only question remaining is how to obtain the value of \u03c8. By maximizing the lower bound, one\ncan show that the optimal value is \u03c8n = \u02dcWmn. This follows from the fact that the Bohning bound\nis tight for lse(\u03b7) when \u03c8 = \u03b7, and that the curvature is independent of \u03b7 [Boh92]. We iterate this\nupdate until convergence. In practice, we \ufb01nd that the method usually converges in \ufb01ve or fewer\niterations.\nThe most attractive feature of the bound described above is its computational ef\ufb01ciency. To see\nthis, note that the posterior covariance Vn does not in fact depend on n if the data vector is fully\n\n\u22121\u02dcyn\n\n4\n\n\fobserved, since A is a constant matrix. Consequently we need only invert Vn once outside the EM\nloop instead of N times, once for each data point. We will see in the next section that the other\nexisting quadratic bounds do not have this property. To derive the overall computational cost of our\nEM algorithm, let us de\ufb01ne the total dimension of \u02dcyn to be D and assume K = 1. Computing Vn\ntakes O(L3 + L2D) time, and computing each mn takes O(L2 + LD) time. So the total cost of one\nE-step is O(L3 + L2D + N I(L2 + LD)), where I is the number of variational updates. If there is\nmissing data, Vn will change across data cases, so the total cost will be O(N I(L3 + L2D)).\n\n3.1 Comparison with Other Bounding Methods\n\n2 \u03be + log(1 + e\u03be), \u03bb\u03be = 1\n2\u03be (\n\n1\n\n2\n\nIn the binary case, the Bohning bound reduces to the following: log(1 + e\u03b7) \u2264 1\n2 A\u03b72 \u2212 b\u03c8\u03b7 + c\u03c8,\n2 A\u03c82 \u2212 (1 + e\u2212\u03c8)\u22121\u03c8 + log(1 + e\u03c8). It is\nwhere A = 1/4, b\u03c8 = A\u03c8 \u2212 (1 + e\u2212\u03c8)\u22121, and c\u03c8 = 1\ninteresting to compare this bound to Jaakkola\u2019s bound [JJ96] used in [Tip98, YT04]. This bound can\nalso be written in the quadratic form: log(1 + e\u03b7) \u2264 1\n\u02dcA\u03be\u03b72 \u2212\u02dcb\u03be\u03b7 + \u02dcc\u03be, where \u02dcA\u03be = 2\u03bb\u03be, \u02dcb\u03be = \u2212 1\n2,\n\u02dcc\u03be = \u2212\u03bb\u03be\u03be2 \u2212 1\n1+e\u2212\u03be \u2212 1\n2).\nAlthough the Jaakkola bound is tighter than the Bohning bound, it has higher computational com-\nplexity. The reason is that the \u02dcA\u03be parameter depends on \u03be and hence on n, which means we need to\ncompute a different posterior covariance matrix for each n. Consequently, the cost of an E-step is\nO(N I(L3 + L2D)), even if there is no missing data (note the L3 term inside the N I loop).\nTo explore the speed vs accuracy trade-off, we use the synthetic binary data described in [MHG08]\nwith N = 600, D = 16, and 10% missing data. We learn a binary FA model with L = 10, \u03bbz = 1,\nand \u03bbw = 0. We learn on the observed entries in the data matrix and compute the mean squared error\n(MSE) on the held out missing entries as in [MHG08]. We average the results over 20 repetitions\nof the experiment. We see in Figure 2 (top left) that the Jaakkola bound gives a lower MSE than\nBohning\u2019s bound in less time on this data. Next, we consider the case where the training data is fully\nobserved using a modi\ufb01ed version of the data generating procedure described in [MHG08]. We vary\nD from 16 to 128 while setting L = 0.25D and N = 10D. We sample L different binary prototypes\nat random, assign each data case to a prototype, and add 10% random binary noise. We measure the\naverage time per iteration over 40 iterations of each method. Figure 2 (bottom left) shows that the\nBohning bound exhibits much better scalability per iteration than the Jaakkola bound in this regime.\nThe speed issue becomes more serious when combining binary variables with categorical variables.\nFirstly, there is no direct extension of the Jaakkola bound to the general categorical case. Hence,\nto combine categorical variables with binary variables, we can use the Jaakkola bound for binary\nand the Bohning for the rest. However, this is not computationally ef\ufb01cient as we need to compute\nthe posterior covariance for each data point because of the Jaakkola bound. For computational\nsimplicity, we use Bohning\u2019s bound for both binary and categorical data.\nVarious other bounds and approximations to the multinomial likelihood also exist; however, they\nare all more computationally intensive, and do not give an ef\ufb01cient variational algorithm. To the\nbest of our knowledge these methods have not been applied to the FA model, but we describe them\nbrie\ufb02y for completeness. An extension of the Jaakkola bound to the multinomial case was given in\n[Bou07]. However, this tends to be less accurate than the Bohning bound. Another approach [BL06]\nj=1 exp(\u03b7j))\u2212log \u03bd\u22121, where\n\u03bd is a variational parameter. This bound does not give closed form updates for the E and M steps so\na numerical optimizer needs to be used (see [BL06] for details).\nInstead of using a bound, an alternative approach is to apply a quadratic approximation derived\nfrom a Taylor series expansion of the LSE function [AX07]. This provides a tighter approximation\nthat could perform better than a bound, but one cannot make convergence guarantees when using\nit inside of EM. In practice we found this alternative approach to be very slow on the datasets that\nwe consider. In view of its speed and simplicity, we will only consider the Bohning method for the\nremainder of the paper.\n\nis to use the concavity of the log function to write lse(\u03b7) \u2264 \u03bd(1+PM\n\n5\n\n\fFigure 2: Top left: accuracy vs speed of variational EM with the Bohning bound (FA-VM), Jaakkola bound\n(FA-VJM) and HMC (FA-SS) on synthetic binary data. Bottom left: Time per iteration of EM with Bohning\nbound and Jaakkola bound as we vary D. Right: MSE vs \u03bbw for FA-MM, FA-VM, and FA-SS on synthetic\nGaussian data. We show results on the test and training sets, for 10% and 50% missing data.\n\n4 Alternative Estimation Approaches\n\nIn this section, we discuss several alternative methods for \ufb01tting the generalized FA model in the\ncase K = 1, which we compare to the VM method. We defer comparisons of FA to mixture models\nto Section 5.\n\n4.1 Maximize-Maximize (MM) Method\nThe simplest approach to \ufb01t the FA model is to maximize log p(Y, Z, W|\u03bbw, \u03bbz) with respect to\nZ and W, the matrix of latent factor values and the factor loading matrix. It is straightforward to\ncompute the gradient of the log posterior and apply a generic optimizer (we use the limited-memory\nquasi-newton method). Alternatively, one can use coordinate descent [CDS02]. We set the hyper-\nparameters \u03bbw and \u03bbz by cross validation. To handle missing data, we simply evaluate the gradients\nby only summing over the observed entries of Y. At test time, consider a data vector consisting\nof missing and observed components, y\u2217 = [y\u2217m, y\u2217o]. To \ufb01ll in the missing entries, we compute\n\u02c6z\u2217 = arg max p(z\u2217, y\u2217o| \u02c6W) and use it with \u02c6\u03b8 to predict y\u2217m.\nThe MM approach is simple and widely applicable, but these bene\ufb01ts come at the expense of ignor-\ning the posterior variance of Z [WCS08]. This has negative consequences for the method in terms\nof sensitivity to the parameters \u03bbw and \u03bbz. To illustrate this effect, we generate a continuous dataset\nusing D = 10, L = 5, and N = 200 data cases by sampling from the FA model. We set \u03bbw = 1,\n\u03bbz = 1, and \u03c3c = 0.1. We standardize each data dimension to have unit variance and zero mean.\nWe consider the case of 10% and 50% missing data. We evaluate the sensitivity of the methods to\nthe setting of the posterior precision parameter \u03bbw by varying it over the range 10\u22122 to 102. We \ufb01x\n\u03bbz = 1, since this is the standard assumption when \ufb01tting FA models. We run the methods on a\nrandom 50/50 train/test split. We train on the observed entries in the training set, and then compute\nMSE on the missing entries in the training and test sets. We average the results over 20 repetitions\nof the experiment.\nFigure 2 (top right) shows that the test MSE of the MM method is extremely sensitive to the prior\nprecision \u03bbw. We can see that this sensitivity increases as a function of the missing data rate.\nWe hypothesize that this is a result of the MM method ignoring the posterior uncertainty in Z.\n\n6\n\n1001011020.050.10.15Accuracy vs SpeedMSETime (s) FA\u2212VMFA\u2212VJMFA\u2212SS10\u2212210010210\u22121100101Test Sensitivity: 10% Mis.MSEPrior Strength (\u03bbW) FA\u2212VMFA\u2212MMFA\u2212SS10\u2212210010210\u22121100101Test Sensitivity: 50% Mis.MSEPrior Strength (\u03bbW)16326412800.511.522.5ScalabilityTime per Iteration (s)Data Dimension (D) FA\u2212VMFA\u2212VJM10\u2212210010210\u2212310\u2212210\u22121100Train Sensitivity: 10% Mis.Train MSEPrior Strength (\u03bbW)10\u2212210010210\u2212310\u2212210\u22121100Train Sensitivity: 50% Mis.Train MSEPrior Strength (\u03bbW)\fThis is supported by looking at the MSE on the training set, Figure 2 (bottom right). We see that\nthe MM method over\ufb01ts when \u03bbw is small. Consequently, MM requires a careful discrete search\nover the values of \u03bbw, which is slow, since the quality of each such value must be estimated by\ncross-validation. By contrast, the VM method takes the posterior uncertainty about Z into account,\nresulting in almost no sensitivity to \u03bbw over this range. Henceforth we set \u03bbw = 0 for VM, meaning\nwe are performing (approximate) maximum likelihood parameter estimation.\n\n4.2 Sample-Sample (SS) Method\n\nAn alternative to the MM approach is to sample both Z and W from their posteriors using Hamil-\ntonian Monte Carlo (HMC) [MHG08]. We call this the \u201cSS\u201d method, since we sample both Z and\nW. HMC leverages the fact that we can compute the gradient of the log posterior in closed form.\nHowever, it has several important parameters that must be set including the step size, the momentum\ndistribution, the number of leapfrog steps, etc.\nTo handle missing data, we can simply evaluate the gradients by only summing over the observed\nentries of Y. We do not need to impute the missing entries on the training set. At test time, we\nhave a collection of samples of W. For each sample of W and each test case, we sample a set of z,\nand compute an averaged prediction for ym. In Figure 2 (right), we see that SS is insensitive to \u03bbw,\njust like VM, since it also models posterior uncertainty in Z (note that the absolute MSE values are\nhigher for SS than VM since for continuous data, VM corresponds to EM with an exact posterior).\nHowever, in Figure 2 (top left), we see that SS can be much slower than VM. In the remainder of\nthe paper we focus on deterministic \ufb01tting methods only.\n\n5 Experiments on Real Data\n\nIn this section, we evaluate the performance of our model on real data with mixed continuous and\ndiscrete variables. We consider the following three cases of our model: (1) a model with latent\nfactors but no mixtures (FA) (2) a model with mixtures but no latent factors (Mix) and (3) the\ngeneral mixture of factor analyzers model (MixFA). To learn the FA model, we consider the FA-\nMM and FA-VM approaches. For the Mix model, we use the standard EM algorithm. In the Mix\nmodel, continuous variables can be modeled with either a diagonal or a full covariance matrix. We\nrefer to these two variants as Mix-Diag and Mix-Full. For MixFA model, we use the VM approach.\nThis gives us \ufb01ve methods: FA-MM, FA-VM, MixFA, Mix-Full and Mix-Diag.\nWe consider three real datasets of different sizes (see the table in Figure 3).2 For each dataset, we use\n70% for training, 10% for validation and 20% for testing. We consider 20 splits for each dataset. We\nuse the validation set to determine the number of latent factors and the number of mixtures (ranges\nshown in the table) with imputation error (described below) as our performance objective. For the\nFA-MM method, we set the values of the regularization parameters \u03bbz and \u03bbw by cross validation.\nWe use the range {0.01, 0.1, 1, 10, 100} for both \u03bbz and \u03bbw . As VM is robust to the setting of these\nparameters, we set \u03bbz = 1 and \u03bbw = 0.\nOne way to assess the performance of a generative model is to see how well it can impute missing\ndata. We do this by randomly introducing missing values in the test data with a missing data rate\nof 0.3. For continuous variables, we compute the imputation MSE averaged over all the missing\nvalues (these variables are standardized beforehand). For discrete variables, we report the cross-\nentropy (averaged over missing values) de\ufb01ned as yT log \u02c6p, where \u02c6pm is the estimated probability\nthat y = m and y uses the one-of-(M + 1) encoding.\nThese errors are shown in Figure 3 along with the running time for ASES dataset in the bottom\nright sub\ufb01gure. We see that FA-VM consistently performs better than FA-MM for all the datasets.\nMoreover, because of the need for cross-validation, FA-MM takes more time than FA-VM. We also\nsee that the Mix model, although faster, performs worse than FA-VM. Finally, as expected, MixFA\ngenerally performs slightly better than FA, but takes longer to run.\n\n2Adult and Auto are available in UCI repository, while ASES dataset is a subset of Asia-Europe Survey\n\nfrom www.icpsr.umich.edu\n\n7\n\n\fDataset Details\n\nAuto\n\nAdult\n\nASES\n\n45222\n5\n27\n4\n31\n4, 15,\n31\n1, 5,\n10, 20\n\n16815\n42\n156\n0\n156\n20, 40,\n60, 80\n1, 10, 20,\n30, 40\n\nN\nDd\n\nP Md\n\nDc\nD\n\nL\n\nK\n\n392\n3\n21\n5\n26\n5, 13,\n26\n1, 5,\n10, 20\n\nFigure 3: Left: the table shows the details of each dataset used. Here D = Dc +P Md is the total size of\n\nthe data vector. L and K are the ranges of number of latent factors and mixture components used for cross\nvalidation. Note that the maximum value of L is D, as required by the FA model. Right: the \ufb01gure shows the\nimputation error for each dataset for continuous and discrete variables. The bottom right sub\ufb01gure shows the\ntiming comparison for the ASES dataset.\n\n6 Discussion and Future Work\n\nIn this work we have proposed a new variational EM algorithm for \ufb01tting factor analysis models\nwith mixed data. The algorithm is based on the Bohning bound, a simple quadratic bound to the\nlog-sum-exp function. In the special case of fully observed binary data, the Bohning bound itera-\ntion is theoretically faster than Jaakkola\u2019s bound iteration and we have demonstrated this advantage\nempirically. More importantly, the Bohning bound also easily extends to the categorical case. This\nenables, for the \ufb01rst time, an ef\ufb01cient variational method for \ufb01tting a factor analysis model to mixed\ncontinuous, binary, and categorical observations.\nIn comparison to the maximize-maximize (MM) method, which forms the basis of ePCA and other\nmatrix factorization methods, our variational EM method accounts for posterior uncertainty in the\nlatent factors, leading to reduced sensitivity to hyper parameters. This has important practical con-\nsequences as the MM method requires extensive cross validation while our approach does not.\nWe have compared a range of models and algorithms in terms of imputation performance on real\ndata. This analysis shows that the cost of the cross validation search for MM is higher than the cost\nof \ufb01tting the FA model using our method. It also shows that standard alternatives to FA, such as\n\ufb01nite mixture models, do not perform as well as FA. Finally, we show that the MixFA model can\nyield a performance improvement over a single FA model, although at a higher computational cost.\nWe note that the quadratic bound that we study can be used in a variety of other models, such as\nlinear-Gaussian state-space models with categorical observations [SH03]. It might be an interesting\nalternative to a Laplace approximation to the posterior, which is used in [KPBSK10, RMC09]. The\nbound might also be useful in the context of the correlated topic model [BL06, AX07], where similar\nvariational EM methods have been applied.\nIn the Bayesian statistics literature, it is common to use latent factor models combined with a pro-\nbit observation model; this allows one to perform inference for the latent states using ef\ufb01cient\nauxiliary-variable MCMC techniques (see e.g., [HSC09, Dun07]). Additionally, the recently pro-\nposed Riemannian Manifold Hamiltonian Monte Carlo sampler [GCC09] may signi\ufb01cantly speed-\nup sampling-based approaches for mixed-data factor analysis models. We leave a comparison to\nthese approaches to future work.\n\nAcknowledgments\n\nWe would like to thank the reviewers for their helpful coments. This work was completed in part at\nthe Xerox Research Center Europe and was supported by the Paci\ufb01c Institute for the Mathematical\nSciences and the Killam Trusts at the University of British Columbia.\n\n8\n\n0.40.50.6Error DiscreteAuto0.20.30.4Error ContinuousFA\u2212MMFA\u2212VMMixFAMix\u2212FullMix\u2212Diag0.40.50.6Error DiscreteAdult0.80.91Error ContinuousFA\u2212MMFA\u2212VMMixFAMix\u2212FullMix\u2212Diag0.40.5Error DiscreteASES100102104Time in secFA\u2212MMFA\u2212VMMixFAMix\u2212FullMix\u2212Diag\fReferences\n[AX07]\n\nA. Ahmed and E. Xing. On tight approximate inference of the logistic-normal topic\nadmixture model. In AI/Statistics, 2007.\n\n[BJ04]\n[BL06]\n[BNJ03]\n\n[BDdF+03] Kobus Barnard, Pinar Duygulu, Nando de Freitas, David Forsyth, David Blei, and\nMichael I. Jordan. Matching words and pictures. J. of Machine Learning Research,\n3:1107\u20131135, 2003.\nW. Buntine and A. Jakulin. Applying Discrete PCA in Data Analysis. In UAI, 2004.\nD. Blei and J. Lafferty. Correlated topic models. In NIPS, 2006.\nD. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. J. of Machine Learning\nResearch, 3:993\u20131022, 2003.\nD. Bohning. Multinomial logistic regression algorithm. Annals of the Inst. of Statistical\nMath., 44:197\u2013200, 1992.\nG. Bouchard. Ef\ufb01cient bounds for the softmax and applications to approximate infer-\nence in hybrid models. In NIPS 2007 Workshop on Approximate Inference in Hybrid\nModels, 2007.\n\n[Boh92]\n\n[Bou07]\n\n[CDS02] M. Collins, S. Dasgupta, and R. E. Schapire. A generalization of principal components\n\nanalysis to the exponential family. In NIPS-14, 2002.\nD. Dunson. Bayesian methods for latent trait modelling of longitudinal data. Stat.\nMethods Med. Res., 16(5):399\u2013415, Oct 2007.\n\n[GCC09] M. Girolami, B. Calderhead, and S.A. Chin. Riemannian manifold hamiltonian monte\n\ncarlo. Arxiv preprint arXiv:0907.1100, 2009.\nZ. Ghahramani and G. Hinton. The EM algorithm for mixtures of factor analyzers.\nTechnical report, Dept. of Comp. Sci., Uni. Toronto, 1996.\nP. R. Hahn, J. Scott, and C. Carvahlo. Sparse Factor-Analytic Probit Models. Technical\nreport, Duke, 2009.\nT. Jaakkola and M. Jordan. A variational approach to Bayesian logistic regression\nproblems and their extensions. In AI/Statistics, 1996.\n\n[KPBSK10] S. Koyama, L. Perez-Bolde, C. Shalizi, and R. Kass. Approximate methods for state-\n\nspace models. Technical report, CMU, 2010.\nJ. Li and D. Tao. Simple exponential family PCA. In AI/Statistics, 2010.\nS. Mohamed, K. Heller, and Z. Ghahramani. Bayesian Exponential Family PCA. In\nNIPS, 2008.\nH. Rue, S. Martino, and N. Chopin. Approximate Bayesian Inference for Latent Gaus-\nsian Models Using Integrated Nested Laplace Approximations. J. of Royal Stat. Soc.\nSeries B, 71:319\u2013392, 2009.\nS. Roweis. EM algorithms for PCA and SPCA. In NIPS, 1997.\nV. Siivola and A. Honkela. A state-space method for language modeling. In Proc.\nIEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pages\n548\u2013553, 2003.\nM. Tipping and C. Bishop. Probabilistic principal component analysis. J. of Royal\nStat. Soc. Series B, 21(3):611\u2013622, 1999.\nM. Tipping. Probabilistic visualization of high-dimensional binary data.\n1998.\n\nIn NIPS,\n\n[Dun07]\n\n[GH96]\n\n[HSC09]\n\n[JJ96]\n\n[LT10]\n[MHG08]\n\n[RMC09]\n\n[Row97]\n[SH03]\n\n[TB99]\n\n[Tip98]\n\n[WK01]\n\n[YT04]\n\n[WCS08] Max Welling, Chaitanya Chemudugunta, and Nathan Sutter. Deterministic latent vari-\n\nable models and their pitfalls. In Intl. Conf. on Data Mining, 2008.\nMichel Wedel and Wagner Kamakura. Factor analysis with (mixed) observed and\nlatent variables in the exponential family. Psychometrika, 66(4):515\u2013530, December\n2001.\nK. Yu and V. Tresp. Heterogenous data fusion via a probabilistic latent-variable model.\nIn Organic and Pervasive Computing (ARCS 2004), 2004.\n\n9\n\n\f", "award": [], "sourceid": 859, "authors": [{"given_name": "Mohammad Emtiyaz", "family_name": "Khan", "institution": null}, {"given_name": "Guillaume", "family_name": "Bouchard", "institution": null}, {"given_name": "Kevin", "family_name": "Murphy", "institution": null}, {"given_name": "Benjamin", "family_name": "Marlin", "institution": null}]}