{"title": "Variational EM Algorithms for Non-Gaussian Latent Variable Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1059, "page_last": 1066, "abstract": null, "full_text": "Variational EM Algorithms for Non-Gaussian Latent Variable Models\nJ. A. Palmer, D. P. Wipf, K. Kreutz-Delgado, and B. D. Rao Department of Electrical and Computer Engineering University of California San Diego, La Jolla, CA 92093 {japalmer,dwipf,kreutz,brao}@ece.ucsd.edu\n\nAbstract\nWe consider criteria for variational representations of non-Gaussian latent variables, and derive variational EM algorithms in general form. We establish a general equivalence among convex bounding methods, evidence based methods, and ensemble learning/Variational Bayes methods, which has previously been demonstrated only for particular cases.\n\n1 Introduction\nProbabilistic methods have become well-established in the analysis of learning algorithms over the past decade, drawing largely on classical Gaussian statistical theory [21, 2, 28]. More recently, variational Bayes and ensemble learning methods [22, 13] have been proposed. In addition to the evidence and VB methods, variational methods based on convex bounding have been proposed for dealing with non-gaussian latent variables [18, 14]. We concentrate here on the theory of the linear model, with direct application to ICA [14], factor analysis [2], mixture models [13], kernel regression [30, 11, 32], and linearization approaches to nonlinear models [15]. The methods can likely be applied in other contexts. In Mackay's evidence framework, \"hierarchical priors\" are employed on the latent variables, using Gamma priors on the inverse variances, which has the effect of making the marginal distribution of the latent variable prior the non-Gaussian Student's t [30]. Based on Mackay's framework, Tipping proposed the Relevance Vector Machine (RVM) [30] for estimation of sparse solutions in the kernel regression problem. A relationship between the evidence framework and ensemble/VB methods has been noted in [22, 6] for the particular case of the RVM with t hyperprior. Figueiredo [11] proposed EM algorithms based on hyperprior representations of the Laplacian and Jeffrey's priors. In [14], Girolami employed the convex variational framework of [16] to derive a different type of variational EM algorithm using a convex variational representation of the Laplacian prior. Wipf et al. [32] demonstrated the equivalence between the variational approach of [16, 14] and the evidence based RVM for the case of t priors, and thus via [6], the equivalence of the convex variational method and the ensemble/VB methods for the particular case of the t prior. In this paper we consider these methods from a unifying viewpoint, deriving algorithms in more general form and establishing a more general relationship among the methods than has previously been shown. In 2, we define the model and estimation problems we shall be concerned with, and in 3 we discuss criteria for variational representations. In 4 we consider the relationships among these methods.\n\n\f\n2 The Bayesian linear model\nThroughout we shall consider the following model, y = Ax + , (1) i where A Rmn , x p(x) = p(xi ), and N (0, ), with x and independent. The important thing to note for our purposes is that the xi are non-Gaussian. We consider two types of variational representation of the non-Gaussian priors p(xi ), which we shall call convex type and integral type. In the convex type of variational representation, the density is represented as a supremum over Gaussian functions of varying scale, p(x) = sup N (x; 0, -1 ) ( ) .\n >0\n\n(2)\n\nThe essential property of \"concavity in x2 \" leading to this representation was used in [29, 17, 16, 18, 6] to represent the Logistic link function. A convex type representation of the Laplace density was applied to learning overcomplete representations in [14]. In the integral type of representation, the density p(x) is represented as an integral over the scale parameter of the density, with respect to some positive measure , p(x) = N (x; 0, -1 ) d( ) . (3)\n0\n\nSuch representations with a general kernel are referred to as scale mixtures [19]. Gaussian scale mixtures were discussed in the examples of Dempster, Laird, and Rubin's original EM paper [9], and treated more extensively in [10]. The integral representation has been used, sometimes implicitly, for kernel-based estimation [30, 11] and ICA [20]. The distinction between MAP estimation of components and estimation of hyperparameters has been discussed in [23] and [30] for the case of Gamma distributed inverse variance. We shall be interested in variational EM algorithms for solving two basic problems, corresponding essentially to the two methods of handling hyperparameters discussed in [23]: the MAP estimate of the latent variables ^ x = arg max p(x|y) (4)\nx\n\nand the MAP estimate of the hyperparameters, ^ = arg max p( |y) .\n\n\n(5)\n\nThe following section discusses the criteria for and relationship between the two types of variational representation. In 4, we discuss algorithms for each problem based on the two types of variational representations, and determine when these are equivalent. We also discuss the approximation of the likelihood p(y; A) using the ensemble learning or VB method, which approximates the posterior p(x, |y) by a factorial density q (x|y)q ( |y). We show that the ensemble method is equivalent to the hyperparameter MAP method.\n\n3 Variational representations of super-Gaussian densities\nIn this section we discuss the criteria for the convex and integral type representations. 3.1 Convex variational bounds We wish to determine when a symmetric, unimodal density p(x) can be represented in the form (2) for some function ( ). Equivalently, when, x 1 - log p(x) = - sup log N ; 0, -1 ( ) = inf 1 x2 - log 2 ( ) 2\n >0 >0\n\n\f\n for all x > 0. The last formula says that - log p( x) is the concave conjugate of (the 1 closure of e convex hull of) the function, log 2 ( ) [27, 12]. This is possible if and only th if - log p( x) is closed, increasing and concave on (0, ). Thus we have the following. Theorem 1. A symmetric probability density p(x) exp(-g (x2 )) can be represented in the convex variational form, p(x) = sup N (x; 0, -1 ) ( ) if and only if g (x) - log p( x) is increasing and concave on (0, ). In this case we can use the function, 2 g , ( ) = / exp ( /2) where g is the concave conjugate of g . Examples of densities satisfying this criterion include: (i) Generalized Gaussian exp(-|x| ), 0 < 2, (ii) Logistic 1/ cosh2 (x/2), (iii) Student's t (1 + x2 / )-( +1)/2 , > 0, and (iv) symmetric -stable densities (having characteristic function exp(-| | ), 0 < 2). The convex variational representation motivates the following definition. Definition 1. A symmetric probability density p(x) is rongly super-gaussian if p( x) is st log-convex on (0, ), and strongly sub-gaussian if p( x) is log-concave on (0, ). An equivalent definition is given in [5, pp. 60-61], which defines p(x) = exp(-f (x)) to be sub-gaussian (super-gaussian) if f (x)/x is increasing (decreasing) on (0, ). This condition is equivalent to f (x) = g (x2 ) with g concave, i.e. g decreasing. The property of being strongly sub- or super-gaussian is independent of scale. 3.2 Scale mixtures We now wish to determine when a probability density p(x) can be represented in the form (3) for some ( ) non-decreasing on (0, ). A fundamental result dealing with integral representations was given by Bernstein and Widder (see [31]). It uses the following definition. Definition 1. A function f (x) is completely monotonic on (a, b) if, (-1)n f (n) (x) 0 , for every x (a, b). That is, f (x) is completely monotonic if it is positive, decreasing, convex, and so on. Bernstein's theorem [31, Thm. 12b] states: Theorem 2. A necessary and sufficient condition that p(x) should be completely monotonic on (0, ) is that, p(x) =\n0 >0\n\nn = 0, 1, . . .\n\ne-tx d(t) ,\n\nwhere (t) is non-decreasing on (0, ). Thus for p(x) to be a Gaussian scale mixture, p(x) = e\n-f (x)\n\n =\n0\n\n=e\n\n-g (x2 )\n\ne- 2 tx d(t) ,\n\n1\n\n2\n\n a necessary and sufficient condition is that p( x) = e-g(x) be completely monotonic for 0 < x < , and we have the following (see also [19, 1]), Theorem 3. A function p(x) can be represented as a Gaussian scale mixture if and only if p( x) is completely monotonic on (0, ).\n\n\f\n3.3 Relationship between convex and integral type representations We now consider the relationship between the convex and integral types of variational representation. Let p(x) = exp(-g (x2 )). We have seen that p(x) can be represented in the form (2) if and only if g (x) is symmetric and concave on (0, ). And we have seen that p(x) can be represented in the form (3) if and only if p( x) = exp(-g (x)) is completely monotonic. We shall consider now whether or not complete monotonicity of p( x) implies the concavity of g (x) = - log p( x), that is whether representability in the integral form implies representability in the convex form. Complete mo tonicity of a function q (x) implies that q 0, q 0, q 0, etc. For no example, if p( x) is completely monotonic, then, g d2 d2 p( x) = 2 e-g(x) = e-g(x) (x)2 - g (x) 0. 2 dx dx Thus if g 0, then p( x) is convex, but the conver does not necessarily hold. That se is, concavity of g does not follow from convexity of p( x), as the latter only requires that g g 2. Concavity of g does follow however from the complete monotonicity of p( x). For example, we can use the following result [8, 3.5.2]. D f (x) Theorem 4. If the functions ft (x), t D, are convex, then e t dt is convex. Thus completely monotonic functions, being scale mixtures of the log convex function e-x by Theorem 2, are also log convex. We thus see that any function representable in the integral variational form (3) is also representable in the convex variational form (2). In fact, a stronger result holds. The following theorem [7, Thm. 4.1.5] establishes the equivalence between q (x) and g (x) = d/dx - log q (x) in terms of complete monotonicity. Theorem 5. If g (x) > 0, then e-ug(x) is completely monotonic for every u > 0, if and only if g (x) is completely monotonic. In particular, it holds that q (x) p( x) = exp(-g (x)) is convex only if g (x) 0. To summarize, let p(x) = e-g(x ) . If g is increasing and concave for x > 0, then p(x) admits the convex type of variational representation (2). If, in addition, the higher derivatives satisfy g (3) (x) 0, g (4) (x) 0, g (5) (x) 0, etc., then p(x) also admits the Gaussian scale mixture representation (3).\n2\n\n4 General equivalences among Variational methods\n4.1 MAP estimation of components Consider first the MAP estimate of the latent variables (4). 4.1.1 Component MAP Integral case Following [10]1 , consider an EM algorithm to estimate x when the p(xi ) are independent Gaussian scale mixtures as in (3). Differentiating inside the integral gives, d ( p x) = p(x| )p( )d = - xp(x, ) d dx 0 0 = -xp(x) p( |x) d .\n0\n\nIn [10], the xi in (1) are actually estimated as non-random parameters, with the noise being non-gaussian, but the underlying theory is essentially the same.\n\n1\n\n\f\nThus, with p(x) exp(-f (x)), we see that, p (xi ) f (xi ) E (i |xi ) = i p(i |xi ) di = - = . (6) xi p(xi ) xi 0 ^ The EM algorithm alternates setting i to the posterior mean, E (i |xi ) = f (xi )/xi , and setting x to minimize, 1 ^ - log p(y|x)p(x| ) = 2 xTAT -1Ax - yT -1Ax + 1 xTx + const., (7) 2 ^)-1 . At iteration k , we put k = f (xk )/xk , and k = diag( k )-1 , and where = diag(\ni i i\n\nxk+1 = k AT (Ak AT + )-1 y . 4.1.2 Component MAP Convex case Again consider the MAP estimate of x. For strongly super-gaussian priors, p(xi ), we have, arg max p(x|y) = arg max p(y|x)p(x) = arg max max p(y|x)p(x; )( )\nx x x \n\nNow since, - log p(y|x)p(x; )( ) =\n1 2\n\nxTAT -1Ax - yT -1Ax + \n\nin\n=1\n\n1 2\n\nx2 i - g (i /2) , i\n\nthe MAP estimate can be improved iteratively by alternately maximizing x and , f (xk ) i k i = 2 g -1(xk 2 ) = 2 g (xk 2 ) = , (8) i i xk i with x updated as in 4.1.1. We thus see that this algorithm is equivalent to the MAP algorithm derived in 4.1.1 for Gaussian scale mixtures. That is, for direct MAP estimation of latent variable x, the EM Gaussian scale mixture method and the variational bounding method yield the same algorithm. This algorithm has also been derived in the image restoration literature [12] as the \"halfquadratic\" algorithm, and it is the basis for the FOCUSS algorithms derived in [26, 25]. The regression algorithm given in [11] for the particular cases of Laplacian and Jeffrey's priors is based on the theory in 4.1.1, and is in fact equivalent to the FOCUSS algorithm derived in [26]. 4.2 MAP estimate of variational parameters Now consider MAP estimation of the (random) variational hyperparameters . 4.2.1 Hyperparameter MAP Integral case Consider an EM algorithm to find the MAP estimate of the hyperparameters in the integral representation (Gaussian scale mixture) case, where the latent variables x are hidden. For the complete likelihood, we have, p( , x|y) p(y|x, )p(x| )p( ) = p(y|x)p(x| )p( ) . The function to be minimized over is then, i - x 1 log p(x| )p( ) = x2 i - log i p(i ) + const. (9) i 2 If we define h( ) log i p(i ), and assume that this function is concave, then the optimal value of is given by, 1 i = h 2 x 2 . i ^ This algorithm converges to a local maximum of p( |y), , which then yields an estimate ^ ^ of x by taking x = E (x|y, ). Alternative algorithms result from using this method to find the MAP estimate of different functions of the scale random variable .\n\n\f\n4.2.2 Hyperparameter MAP Convex case In the convex representation, the parameters do not actually represent a probabilistic quantity, but rather arise as parameters in a variational inequality. Specifically, we write, p m p(y) = (y, x) dx = ax p(y|x) p(x| ) ( ) dx p max (y|x) p(x| ) ( ) dx y = max N ; 0, AAT + ( ) .\n\n\nNow we define the function, p(y; ) N ~\n\ny\n\n; 0, AAT + \n\n ( )\n\n^ and try to find = arg max p(y; ). We maximize p by EM, marginalizing over x, ~ ~ p p(y; ) = ~ (y|x) p(x| ) ( ) dx . The algorithm is then equivalent to that in 4.1.2 except that the expectation is taken of x2 as the E step, and the diagonal weighting matrix becomes, i = f (i ) , i\n\nE where i = (x2 |y; i ). Although p is not a true probability density function, the proof ~ i of convergence for EM does not assume unit normalization. This theory is the basis for the algorithm presented in [14] for the particular case of a Laplacian prior (where in addition A in the model (1) is updated according to the standard EM update.) 4.3 Ensemble learning In the ensemble learning approach (also Variational Bayes [4, 3, 6]) the idea is to find the approximate separable posterior that minimizes the KL divergence from the true posterior, using the following decomposition of the log likelihood, q p q p(z, y) log p(y) = (z|y) log dz + D (z|y) (z|y) q (z|y) -F (q ) + D(q ||p) . The term F (q ) is commonly called the variational free energy [29, 24]. Minimizing the F over q is equivalent to minimizing D over q . The posterior approximating distribution is taken to be factorial, q (z|y) = q (x, |y) = q (x|y)q ( |y) . For fixed q ( |y), the free energy F is given by, q q e log p(x,|y) + p(x, |y) - (x|y)q ( |y) log d dx = D (x|y) const., q (x|y)q ( |y) (10) where . denotes expectation with respect to q ( |y), and the constant is the entropy, q H ( |y) The minimum of the KL divergence in (10) is attained if and only if l l q (x|y) exp og p(x, |y) p(y|x) exp og p(x| ) almost surely. An identical derivation yields the optimal l x l x q ( |y) exp og p(x, |y) p( ) exp og p(x| )\n\n\f\nwhen q (x|y) is fixed. The ensemble (or VB) algorithm consists of alternately updating the parameters of these approximating marginal distributions. In the linear model with Gaussian scale mixture latent variables, the complete likelihood is again, p(y, x, ) = p(y|x)p(x| )p( ) . The optimal approximate posteriors are given by, x , q (x|y) = N (x; x|y , x|y ) , q (i |y) = p i i = x2 1/2 i where, letting = diag( )-1 , the posterior moments are given by, x|y x|y AT (AAT + )-1 y (AT -1 A + -1 )-1 = - AT (AAT + )-1A . \n\nThe only relevant fact about q ( |y) that we need is , for which we have, using (6), d f (i ) 2 /2 i = i = , i q (i |y) di = ip i | xi = xi 1 i E where i = (x2 |y; i ). We thus see that the ensemble learning algorithm is equivalent i to the approximate hyperparameter MAP algorithm of 4.2.2. Note also that this shows that the VB methods can be applied to any Gaussian scale mixture density, using only the form of the latent variable prior p(x), without needing the marginal hyperprior p( ) in closed form. This is particularly important in the case of the Generalized Gaussian and Logistic densities, whose scale parameter densities are -Stable and Kolmogorov [1] respectively.\n\n5 Conclusion\nIn this paper, we have discussed criteria for variational representations of non-Gaussian latent variables, and derived general variational EM algorithms based on these representations. We have shown a general equivalence between the two representations in MAP estimation taking hyperparameters as hidden, and we have shown the general equivalence between the variational convex approximate MAP estimate of hyperparameters and the ensemble learning or VB method.\n\nReferences\n[1] D. F. Andrews and C. L. Mallows. Scale mixtures of normal distributions. J. Roy. Statist. Soc. Ser. B, 36:99102, 1974. [2] H. Attias. Independent factor analysis. Neural Computation, 11:803851, 1999. [3] H. Attias. A variational Bayesian framework for graphical models. In Advances in Neural Information Processing Systems 12. MIT Press, 2000. [4] M. J. Beal and Z. Ghahrarmani. The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures. In Bayesian Statistics 7, pages 453464. University of Oxford Press, 2002. [5] A. Benveniste, M. Metivier, and P. Priouret. Adaptive algorithms and stochastic approximations. Springer-Verlag, 1990. [6] C. M. Bishop and M. E. Tipping. Variational relevance vector machines. In C. Boutilier and M. Goldszmidt, editors, Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence, pages 4653. Morgan Kaufmann, 2000. [7] S. Bochner. Harmonic analysis and the theory of probability. University of California Press, Berkeley and Los Angeles, 1960.\n\n\f\n[8] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. [9] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39:138, 1977. [10] A. P. Dempster, N. M. Laird, and D. B. Rubin. Iteratively reweighted least squares for linear regression when errors are Normal/Independent distributed. In P. R. Krishnaiah, editor, Multivariate Analysis V, pages 3557. North Holland Publishing Company, 1980. [11] M. Figueiredo. Adaptive sparseness using Jeffreys prior. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, Cambridge, MA, 2002. MIT Press. [12] D. Geman and G. Reynolds. Constrained restoration and the recovery of discontinuities. IEEE Trans. Pattern Analysis and Machine Intelligence, 14(3):367383, 1992. [13] Z. Ghahramani and M. J. Beal. Variational inference for Bayesian mixtures of factor analysers. In Advances in Neural Information Processing Systems 12. MIT Press, 2000. [14] M. Girolami. A variational method for learning sparse and overcomplete representations. Neural Computation, 13:25172532, 2001. [15] A. Honkela and H. Valpola. Unsupervised variational Bayesian learning of nonlinear models. In Advances in Neural Information Processing Systems 17. MIT Press, 2005. [16] T. S. Jaakkola. Variational Methods for Inference and Estimation in Graphical Models. PhD thesis, Massachusetts Institute of Technology, 1997. [17] T. S. Jaakkola and M. I. Jordan. A variational approach to Bayesian logistic regression models and their extensions. In Proceedings of the 1997 Conference on Artificial Intelligence and Statistics, 1997. [18] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for graphical models. In M. I. Jordan, editor, Learning in Graphical Models. Kluwer Academic Publishers, 1998. [19] J. Keilson and F. W. Steutel. Mixtures of distributions, moment inequalities, and measures of exponentiality and Normality. The Annals of Probability, 2:112130, 1974. [20] H. Lappalainen. Ensemble learning for independent component analysis. In Proceedings of the First International Workshop on Independent Component Analysis, 1999. [21] D. J. C. MacKay. Bayesian interpolation. Neural Computation, 4(3):415447, 1992. [22] D. J. C. MacKay. Ensemble learning and evidence maximization. Unpublished manuscript, 1995. [23] D. J. C. Mackay. Comparison of approximate methods for handling hyperparameters. Neural Computation, 11(5):10351068, 1999. [24] R. M. Neal and G. E. Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. In M. I. Jordan, editor, Learning in Graphical Models, pages 355368. Kluwer, 1998. [25] B. D. Rao, K. Engan, S. F. Cotter, J. Palmer, and K. Kreutz-Delgado. Subset selection in noise based on diversity measure minimization. IEEE Trans. Signal Processing, 51(3), 2003. [26] B. D. Rao and I. F. Gorodnitsky. Sparse signal reconstruction from limited data using FOCUSS: a re-weighted minimum norm algorithm. IEEE Trans. Signal Processing, 45:600616, 1997. [27] R. T. Rockafellar. Convex Analysis. Princeton, 1970. [28] Sam Roweis and Zoubin Ghahramani. A unifying review of linear gaussian models. Neural Computation, 11(5):305345, 1999. [29] L. K. Saul, T. S. Jaakkola, and M. I. Jordan. Mean field theory for sigmoid belief networks. Journal of Artificial Intelligence Research, 4:6176, 1996. [30] M. E. Tipping. Sparse Bayesian learning and the Relevance Vector Machine. Journal of Machine Learning Research, 1:211244, 2001. [31] D. V. Widder. The Laplace Transform. Princeton University Press, 1946. [32] D. Wipf, J. Palmer, and B. Rao. Perspectives on sparse bayesian learning. In S. Thrun, L. Saul, and B. Scholkopf, editors, Advances in Neural Information Processing Systems 16, Cambridge, MA, 2003. MIT Press.\n\n\f\n", "award": [], "sourceid": 2933, "authors": [{"given_name": "Jason", "family_name": "Palmer", "institution": null}, {"given_name": "Kenneth", "family_name": "Kreutz-Delgado", "institution": null}, {"given_name": "Bhaskar", "family_name": "Rao", "institution": null}, {"given_name": "David", "family_name": "Wipf", "institution": null}]}