{"title": "Matrix Manifold Optimization for Gaussian Mixtures", "book": "Advances in Neural Information Processing Systems", "page_first": 910, "page_last": 918, "abstract": "We take a new look at parameter estimation for Gaussian Mixture Model (GMMs). Specifically, we advance Riemannian manifold optimization (on the manifold of positive definite matrices) as a potential replacement for Expectation Maximization (EM), which has been the de facto standard for decades. An out-of-the-box invocation of Riemannian optimization, however, fails spectacularly: it obtains the same solution as EM, but vastly slower. Building on intuition from geometric convexity, we propose a simple reformulation that has remarkable consequences: it makes Riemannian optimization not only match EM (a nontrivial result on its own, given the poor record nonlinear programming has had against EM), but also outperform it in many settings. To bring our ideas to fruition, we develop a well-tuned Riemannian LBFGS method that proves superior to known competing methods (e.g., Riemannian conjugate gradient). We hope that our results encourage a wider consideration of manifold optimization in machine learning and statistics.", "full_text": "Matrix Manifold Optimization for Gaussian Mixtures\n\nReshad Hosseini\nSchool of ECE\n\nCollege of Engineering\n\nUniversity of Tehran, Tehran, Iran\nreshad.hosseini@ut.ac.ir\n\nSuvrit Sra\n\nLaboratory for Information and Decision Systems\n\nMassachusetts Institute of Technology\n\nCambridge, MA.\n\nsuvrit@mit.edu\n\nAbstract\n\nWe take a new look at parameter estimation for Gaussian Mixture Model (GMMs).\nSpeci\ufb01cally, we advance Riemannian manifold optimization (on the manifold of\npositive de\ufb01nite matrices) as a potential replacement for Expectation Maximiza-\ntion (EM), which has been the de facto standard for decades. An out-of-the-box\ninvocation of Riemannian optimization, however, fails spectacularly: it obtains\nthe same solution as EM, but vastly slower. Building on intuition from geometric\nconvexity, we propose a simple reformulation that has remarkable consequences:\nit makes Riemannian optimization not only match EM (a nontrivial result on its\nown, given the poor record nonlinear programming has had against EM), but also\noutperforms it in many settings. To bring our ideas to fruition, we develop a well-\ntuned Riemannian LBFGS method that proves superior to known competing meth-\nods (e.g., Riemannian conjugate gradient). We hope that our results encourage a\nwider consideration of manifold optimization in machine learning and statistics.\n\nIntroduction\n\n1\nGaussian Mixture Models (GMMs) are a mainstay in a variety of areas, including machine learning\nand signal processing [4, 10, 16, 19, 21]. A quick literature search reveals that for estimating pa-\nrameters of a GMM, the Expectation Maximization (EM) algorithm [9] is still the de facto choice.\nOver the decades, other numerical approaches have also been considered [24], but methods such as\nconjugate gradients, quasi-Newton, Newton, have been noted to be usually inferior to EM [34].\nThe key dif\ufb01culty of applying standard nonlinear programming methods to GMMs is the positive\nde\ufb01niteness (PD) constraint on covariances. Although an open subset of Euclidean space, this con-\nstraint can be dif\ufb01cult to impose, especially in higher-dimensions. When approaching the boundary\nof the constraint set, convergence speed of iterative methods can also get adversely affected. A par-\ntial remedy is to remove the PD constraint by using Cholesky decompositions, e.g., as exploited in\nsemide\ufb01nite programming [7]. It is believed [30] that in general, the nonconvexity of this decom-\nposition adds more stationary points and possibly spurious local minima.1 Another possibility is\nto formulate the PD constraint via a set of smooth convex inequalities [30] and apply interior-point\nmethods. But such sophisticated methods can be extremely slower (on several statistical problems)\nthan simpler EM-like iterations, especially for higher dimensions [27].\nSince the key dif\ufb01culty arises from the PD constraint, an appealing idea is to note that PD matrices\nform a Riemannian manifold [3, Ch.6] and to invoke Riemannian manifold optimization [1, 6].\nIndeed, if we operate on the manifold2, we implicitly satisfy the PD constraint, and may have a\nbetter chance at focusing on likelihood maximization. While attractive, this line of thinking also\nfails: an out-of-the-box invocation of manifold optimization is also vastly inferior to EM. Thus, we\nneed a new approach to challenge the hegemony of EM; we outline one such new approach below.\n1Remarkably, using Cholesky with the reformulation in \u00a72.2 does not add spurious local minima to GMMs.\n2Equivalently, on the interior of the constraint set, as is done by interior point methods (their nonconvex\n\nversions); though these turn out to be slow too as they are second order methods.\n\n1\n\n\fKey idea. Intuitively, the mismatch is in the geometry. For GMMs, the M-step of EM is a Euclidean\nconvex optimization problem, whereas the GMM log-likelihood is not manifold convex3 even for a\nsingle Gaussian. If we could reformulate the likelihood so that the single component maximization\ntask (which is the analog of the M-step of EM for GMMs) becomes manifold convex, it might have a\nsubstantial empirical impact. This intuition supplies the missing link, and \ufb01nally makes Riemannian\nmanifold optimization not only match EM but often also greatly outperform it.\nTo summarize, the key contributions of our paper are the following:\n\u2013 Introduction of Riemannian manifold optimization for GMM parameter estimation, for which we\n\nshow how a reformulation based on geodesic convexity is crucial to empirical success.\n\n\u2013 Development of a Riemannian LBFGS solver; here, our main contribution is the implementation\nof a powerful line-search procedure, which ensures convergence and makes LBFGS outperform\nboth EM and manifold conjugate gradients. This solver may be of independent interest.\n\nWe provide substantive experimental evidence on both synthetic and real-data. We compare man-\nifold optimization, EM, and unconstrained Euclidean optimization that reformulates the problem\nusing Cholesky factorization of inverse covariance matrices. Our results show that manifold opti-\nmization performs well across a wide range of parameter values and problem sizes. It is much less\nsensitive to overlapping data than EM, and displays much less variability in running times.\nThese results are very encouraging, and we believe that manifold optimization could open new\nalgorithmic avenues for mixture models, and perhaps for other statistical estimation problems.\nNote. To aid reproducibility of our results, MATLAB implementations of our methods are available\nas a part of the MIXEST toolbox developed by our group [12]. The manifold CG method that we use\nis directly based on the excellent toolkit MANOPT [6].\n\nRelated work. Summarizing published work on EM is clearly impossible. So, let us brie\ufb02y men-\ntion a few lines of related work. Xu and Jordan [34] examine several aspects of EM for GMMs and\ncounter the claims of Redner and Walker [24], who claim EM to be inferior to generic second-order\nnonlinear programming techniques. However, it is now well-known (e.g., [34]) that EM can attain\ngood likelihood values rapidly, and scale to much larger problems than amenable to second-order\nmethods. Local convergence analysis of EM is available in [34], with more re\ufb01ned results in [18],\nwho show that for data with low overlap EM can converge locally superlinearly. Our paper develops\nRiemannian LBFGS, which can also achieve local superlinear convergence.\nFor GMMs some innovative gradient-based methods have also been suggested [22, 26], where the\nPD constraint is handled via a Cholesky decomposition of covariance matrices. However, these\nworks report results only for low-dimensional problems and (near) spherical covariances.\nOur idea of using manifold optimization for GMMs is new, though manifold optimization by itself is\na well-developed subject. A classic reference is [29]; a more recent work is [1]; and even a MATLAB\ntoolbox exists [6]. In machine learning, manifold optimization has witnessed increasing interest4,\ne.g., for low-rank optimization [15, 31], or optimization based on geodesic convexity [27, 33].\n2 Background and problem setup\nThe key object in this paper is the Gaussian Mixture Model (GMM), whose probability density is\n\n(cid:88)K\n\nand where pN is a (multivariate) Gaussian with mean \u00b5 \u2208 Rd and covariance \u03a3 (cid:31) 0. That is,\n\np(x) :=\n\n\u03b1jpN (x; \u00b5j, \u03a3j),\n\nj=1\n\npN (x; \u00b5, \u03a3) := det(\u03a3)\u22121/2(2\u03c0)\u2212d/2 exp(cid:0)\u2212 1\n(cid:16)(cid:88)K\n\nn(cid:88)\n\nmax\n\n\u03b1\u2208\u2206K ,{\u00b5j ,\u03a3j(cid:31)0}K\n\nj=1\n\nlog\n\ni=1\n\nGiven i.i.d. samples {x1, . . . , xn}, we wish to estimate { \u02c6\u00b5j \u2208 Rd, \u02c6\u03a3j (cid:31) 0}K\nj=1 and weights \u02c6\u03b1 \u2208\n(cid:17)\n\u2206K, the K-dimensional probability simplex. This leads to the GMM optimization problem\n\n\u03b1jpN (xi; \u00b5j, \u03a3j)\n\n.\n\nj=1\n\n(2.1)\n\nx \u2208 Rd,\n\n2 (x \u2212 \u00b5)T \u03a3\u22121(x \u2212 \u00b5)(cid:1).\n\n3That is, convex along geodesics on the PD manifold.\n4Manifold optimization should not be confused with \u201cmanifold learning\u201d a separate problem altogether.\n\n2\n\n\fSolving Problem (2.1) can in general require exponential time [20].5 However, our focus is more\npragmatic: similar to EM, we also seek to ef\ufb01ciently compute local solutions. Our methods are set\nin the framework of manifold optimization [1, 29]; so let us now recall some material on manifolds.\n2.1 Manifolds and geodesic convexity\nA smooth manifold is a non-Euclidean space that locally resembles Euclidean space [17]. For opti-\nmization, it is more convenient to consider Riemannian manifolds (smooth manifolds equipped with\nan inner product on the tangent space at each point). These manifolds possess structure that allows\none to extend the usual nonlinear optimization algorithms [1, 29] to them.\nAlgorithms on manifolds often rely on geodesics, i.e., curves that join points along shortest paths.\nIn particular, say M is a\nGeodesics help generalize Euclidean convexity to geodesic convexity.\nRiemmanian manifold, and x, y \u2208 M; also let \u03b3 be a geodesic joining x to y, such that\n\n\u03b3xy : [0, 1] \u2192 M,\n\n\u03b3xy(0) = x, \u03b3xy(1) = y.\n\nThen, a set A \u2286 M is geodesically convex if for all x, y \u2208 A there is a geodesic \u03b3xy contained\nwithin A. Further, a function f : A \u2192 R is geodesically convex if for all x, y \u2208 A, the composition\nf \u25e6 \u03b3xy : [0, 1] \u2192 R is convex in the usual sense.\nThe manifold of interest to us is Pd, the set of d \u00d7 d symmetric positive de\ufb01nite matrices. At any\npoint \u03a3 \u2208 Pd, the tangent space is isomorphic to the set of symmetric matrices; and the Riemannian\nmetric at \u03a3 is given by tr(\u03a3\u22121d\u03a3\u03a3\u22121d\u03a3). This metric induces the geodesic [3, Ch. 6]\n\n\u03b3\u03a31,\u03a32 (t) := \u03a31/2\n\n1\n\n(\u03a3\n\n\u22121/2\n\u22121/2\n1 \u03a32\u03a3\n1\n\n)t\u03a31/2\n\n,\n\n1\n\n0 \u2264 t \u2264 1.\n\nThus, a function f : Pd \u2192 R is geodesically convex on a set A if it satis\ufb01es\n\nf (\u03b3\u03a31,\u03a32(t)) \u2264 (1 \u2212 t)f (\u03a31) + tf (\u03a32),\n\nt \u2208 [0, 1], \u03a31, \u03a32 \u2208 A.\n\nSuch functions can be nonconvex in the Euclidean sense, but are globally optimizable due to\ngeodesic convexity. This property has been important in some matrix theoretic applications [3, 28],\nand has gained more extensive coverage in several recent works [25, 27, 33].\nWe emphasize that even though the mixture cost (2.1) is not geodesically convex, for GMM opti-\nmization geodesic convexity seems to play a crucial role, and it has a huge impact on convergence\nspeed. This behavior is partially expected and analogous to EM, where a convex M-Step makes the\noverall method much more practical. This intuition guides us to elicit geodesic convexity below.\n2.2 Problem reformulation\nWe begin with parameter estimation for a single Gaussian: although this has a closed-form solution\n(which ultimately bene\ufb01ts EM), it requires more subtle handling when using manifold optimization.\nConsider the following maximum likelihood parameter estimation for a single Gaussian:\n\nL(\u00b5, \u03a3) :=\n\nmax\n\u00b5,\u03a3(cid:31)0\n\n(cid:88)n\n\ni=1\n\nlog pN (xi; \u00b5, \u03a3).\n\n(2.2)\n\nAlthough (2.2) is a Euclidean convex problem, it is not geodesically convex on its domain Rd \u00d7 Pd,\nwhich makes it geometrically handicapped when applying manifold optimization. To overcome this\nproblem, we invoke a simple reparametrization6 that has far-reaching impact. More precisely, we\naugment the sample vectors xi to instead consider yT\n\ni 1]. Therewith, (2.2) turns into\n\n(cid:88)n\n\n(cid:98)L(S) :=\n\ni = [xT\nlog qN (yi; S),\n\nProposition 1. The map \u03c6(S) \u2261 \u2212(cid:98)L(S), where (cid:98)L(S) is as in (2.3), is geodesically convex.\n\nmax\nS(cid:31)0\n2 )pN (yi; 0, S). Proposition 1 states the key property of (2.3).\n\nwhere qN (yi; S) :=\n\n2\u03c0 exp( 1\n\n\u221a\n\ni=1\n\n(2.3)\n\nWe omit the proof due to space limits; see [13] for details. Alternatively, see [28] for more general\nresults on geodesic convexity.\nTheorem 2.1 shows that the solution to (2.3) yields the solution to the original problem (2.2) too.\n\n5Though under very strong assumptions, it has polynomial smoothed complexity [11].\n6This reparametrization in itself is probably folklore; its role in GMM optimization is what is crucial here.\n\n3\n\n\f(a) Single Gaussian\n\n(b) Mixtures of seven Gaussians\n\nFigure 1: The effect of reformulation in convergence speed of manifold CG and manifold LBFGS methods\n(d = 35); note that the x-axis (time) is on a logarithmic scale.\n\nTheorem 2.1. If \u00b5\u2217, \u03a3\u2217 maximize (2.2), and if S\u2217 maximizes (2.3), then (cid:98)L(S\u2217) = L(\u00b5\u2217, \u03a3\u2217) for\n\n(cid:18)\u03a3\u2217 + \u00b5\u2217\u00b5\u2217T \u00b5\u2217\n\n(cid:19)\n\n\u00b5\u2217T\n\n.\n\n1\n\nS\u2217 =\n\n(cid:18)U + sttT\n\n(cid:19)\n\nProof. We express S by new variables U, t and s by writing S =\n\nfunction (cid:98)L(S) in terms of the new parameters becomes\n(cid:98)L(U , t, s) = n\n2 (xi \u2212 t)T U\u22121(xi \u2212 t) \u2212 n\nOptimizing (cid:98)L over s > 0 we see that s\u2217 = 1 must hold. Hence, the objective reduces to a d-\n2s .\n\n2 log det(U ) \u2212(cid:88)n\n\ndimensional Gaussian log-likelihood, for which clearly U\u2217 = \u03a3\u2217 and t\u2217 = \u00b5\u2217.\n\n2 log(2\u03c0) \u2212 n\n\n2 log s \u2212 n\n\n. The objective\n\n2 \u2212 d\n\nstT\n\nst\ns\n\ni=1\n\n1\n\nTheorem 2.1 shows that reformulation (2.3) is \u201cfaithful,\u201d as it leaves the optimum unchanged. The-\norem 2.2 proves a local version of this result for GMMs.\nTheorem 2.2. A local maximum of the reparameterized GMM log-likelihood\n\n(cid:98)L({Sj}K\n\nj=1) :=\n\nlog\n\ni=1\n\n\u03b1jqN (yi; Sj)\n\nj=1\n\n(cid:88)n\n(cid:88)n\n\n(cid:16)(cid:88)K\n(cid:16)(cid:88)K\n\n(cid:17)\n\n(cid:17)\n\nis a local maximum of the original log-likelihood\n\nL({\u00b5j, \u03a3j}K\n\nj=1) :=\n\nlog\n\ni=1\n\nj=1\n\n\u03b1jpN (xi|\u00b5j, \u03a3j)\n\n.\n\nThe proof can be found in [13].\nTheorem 2.2 shows that we can replace problem (2.1) by one whose local maxima agree with those\nof (2.1), and whose individual components are geodesically convex. Figure 1 shows the true import\nof our reformulation: the dramatic impact on the empirical performance of Riemmanian Conjugate-\nGradient (CG) and Riemannian LBFGS for GMMs is unmistakable.\nThe \ufb01nal technical piece is to replace the simplex constraint \u03b1 \u2208 \u2206K to make the problem un-\nk = 1, . . . , K \u2212 1. Assuming \u03b7K = 0 is a constant, the \ufb01nal GMM optimization problem is:\n\nconstrained. We do this via a commonly used change of variables [14]: \u03b7k = log(cid:0) \u03b1k\n(cid:17)\n\n(cid:98)L({Sj}K\nj=1 Pd(cid:1) \u00d7 RK\u22121. The next section presents a method for solving it.\n\nWe view (2.4) as a manifold optimization problem; speci\ufb01cally, it is an optimization problem on the\n\nproduct manifold(cid:0)(cid:81)K\n\nexp(\u03b7j)\nk=1 exp(\u03b7k)\n\nj=1,{\u03b7j}K\u22121\n\nmax\nj=1,{\u03b7j}K\u22121\n\n(cid:1) for\n\n(cid:16) K(cid:88)\n\n(cid:80)K\n\nqN (yi; Sj)\n\nn(cid:88)\n\nj=1 ) :=\n\n{Sj(cid:31)0}K\n\ni=1\n\nj=1\n\n(2.4)\n\nlog\n\n\u03b1K\n\nj=1\n\n4\n\n10\u2212110010110221.421.621.82222.222.422.622.82323.2Time(seconds)Averagelog-likelihoodLBFGS,ReformulatedMVNCG,ReformulatedMVNLBFGS,UsualMVNCG,UsualMVN1001011021031041052627282930313233Time(seconds)Averagelog-likelihoodLBFGS,ReformulatedMVNCG,ReformulatedMVNLBFGS,OriginalMVNCG,OriginalMVN\flarge-scale Euclidean methods\n\n3 Manifold Optimization\nIn unconstrained Euclidean optimization, typically one iteratively (i) \ufb01nds a descent direction; and\n(ii) performs a line-search to obtain suf\ufb01cient decrease and ensure convergence. On a Riemannian\nmanifold, the descent direction is computed on the tangent space (this space varies (smoothly) as\none moves along the manifold). At a point X, the tangent space TX is the approximating vector\nspace (see Fig. 2). Given a descent direction \u03beX \u2208 TX, line-search is performed along a smooth\ncurve on the manifold (red curve in Fig. 2). The derivative of this curve at X equals the descent\ndirection \u03beX. We refer the reader to [1, 29] for an in depth introduction to manifold optimization.\nSuccessful\nsuch as\nconjugate-gradient and LBFGS combine gradients at the\ncurrent point with gradients and descent directions from\nprevious points to obtain a new descent direction. To adapt\nsuch algorithms to manifolds, in addition to de\ufb01ning gra-\ndients on manifolds, we also need to de\ufb01ne how to trans-\nport vectors in a tangent space at one point to vectors in a\ndifferent tangent space at another point.\nOn Riemannian manifolds, the gradient is simply a direc-\ntion on the tangent space, where the inner-product of the\ngradient with another direction in the tangent space gives\nthe directional derivative of the function. Formally, if gX\nde\ufb01nes the inner product in the tangent space TX, then\n\nFigure 2: Visualization of line-search on a\nmanifold: X is a point on the manifold, TX\nis the tangent space at the point X, \u03beX is a\ndescent direction at X; the red curve is the\ncurve along which line-search is performed.\n\nDf (X)\u03be = gX (gradf (X), \u03be),\n\nfor \u03be \u2208 TX .\n\nGiven a descent direction in the tangent space, the curve\nalong which we perform line-search can be a geodesic. A\nmap that takes the direction and a step length to obtain a corresponding point on the geodesic is\ncalled an exponential map. Riemannian manifolds are also equipped with a natural way of trans-\nporting vectors on geodesics, which is called parallel transport. Intuitively, a parallel transport is\na differential map with zero derivative along the geodesics. Using the above ideas, Algorithm 1\nsketches a generic manifold optimization algorithm.\n\nAlgorithm 1: Sketch of an optimization algorithm (CG, LBFGS) to minimize f (X) on a manifold\n\nGiven: Riemannian manifold M with Riemannian metric g; parallel transport T on M; exponential map\nR; initial value X0; a smooth function f\nfor k = 0, 1, . . . do\n\nObtain a descent direction based on stored information and gradf (Xk) using metric g and transport T\nUse line-search to \ufb01nd \u03b1 such that it satis\ufb01es appropriate (descent) conditions\nCalculate the retraction / update Xk+1 = RXk (\u03b1\u03bek)\nBased on the memory and need of algorithm store Xk, gradf (Xk) and \u03b1\u03bek\n\nend for\nreturn estimated minimum Xk\n\nNote that Cartesian products of Riemannian manifolds are again Riemannian, with the exponential\nmap, gradient and parallel transport de\ufb01ned as the Cartesian product of individual expressions; the\ninner product is de\ufb01ned as the sum of inner product of the components in their respective manifolds.\nDifferent variants of Riemannian LBFGS can be obtained depending where to perform the vector\n\nDe\ufb01nition\n\nTangent space\n\nMetric between two tangent vectors \u03be, \u03b7 at \u03a3\nGradient at \u03a3 if Euclidean gradient is \u2207f (\u03a3)\nExponential map at point \u03a3 in direction \u03be\n\nParallel transport of tangent vector \u03be from \u03a31 to \u03a32\n\nExpression for PSD matrices\nSpace of symmetric matrices\ng\u03a3(\u03be, \u03b7) = tr(\u03a3\u22121\u03be\u03a3\u22121\u03b7)\ngradf (\u03a3) = 1\nR\u03a3(\u03be) = \u03a3 exp(\u03a3\u22121\u03be)\nT\u03a31,\u03a32 (\u03be) = E\u03beET , E = (\u03a32\u03a3\u22121\n\n2 \u03a3(\u2207f (X) + \u2207f (X)T )\u03a3\n1 )1/2\n\nTable 1: Summary of key Riemannian objects for the PD matrix manifold.\n\ntransport. We found that the version developed in [28] gives the best performance, once we combine\nit with a line-search algorithm satisfying Wolfe conditions. We present the crucial details below.\n\n5\n\nSd+XTX\u21e0X\f3.1 Line-search algorithm satisfying Wolfe conditions\nTo ensure Riemannian LBFGS always produces a descent direction, it is necessary to ensure that the\nline-search algorithm satis\ufb01es Wolfe conditions [25]. These conditions are given by:\n\nf (RXk (\u03b1\u03bek)) \u2264 f (Xk) + c1\u03b1Df (Xk)\u03bek,\n\nDf (RXk (\u03b1\u03bek))TXk,RXk (\u03b1\u03bek)(\u03bek) \u2265 c2Df (Xk)\u03bek,\n\n(3.1)\n(3.2)\n\nwhere 0 < c1 < c2 < 1. Note that Df (Xk)\u03bek = gXk (gradf (Xk), \u03bek), i.e., the derivative of f (Xk)\nin the direction \u03bek is the inner product of descent direction and gradient of the function. Practical\nline-search algorithms implement a stronger (Wolfe) version of (3.2) that enforces\n\n|Df (RXk (\u03b1\u03bek))TXk,RXk (\u03b1\u03bek)(\u03bek)| \u2264 c2Df (Xk)\u03bek.\n\nSimilar to the Euclidean case, our line-search algorithm is also divided into two phases: bracketing\nand zooming [23]. During bracketing, we compute an interval such that a point satisfying Wolfe con-\nditions can be found in this interval. In the zooming phase, we obtain such a point in the determined\ninterval. The one-dimensional function and its gradient used by the line-search are\n\n\u03c6(\u03b1) = f (RXk (\u03b1\u03bek)),\n\n(cid:48)(\u03b1) = Df (RXk (\u03b1\u03bek))TXk,RXk (\u03b1\u03bek)(\u03bek).\n\n\u03c6\n\nThe algorithm is essentially the same as the line-search in the Euclidean space; the reader can also\nsee its manifold incarnation in [13]. Theory behind how this algorithm is guaranteed to \ufb01nd a step-\nlength satisfying (strong) Wolfe conditions can be found in [23].\nA good choice of initial step-length \u03b11 can greatly speed up the line-search. We propose the follow-\ning choice that turns out to be quite effective in our experiments:\nf (Xk) \u2212 f (Xk\u22121)\n\n(3.3)\nEquation (3.3) is obtained by \ufb01nding \u03b1\u2217 that minimizes a quadratic approximation of the function\nalong the geodesic through the previous point (based on f (Xk\u22121), f (Xk) and Df (Xk\u22121)\u03bek\u22121):\n\nDf (Xk)\u03bek\n\n\u03b11 = 2\n\n.\n\n\u2217 = 2\n\n\u03b1\n\nf (Xk) \u2212 f (Xk\u22121)\nDf (Xk\u22121)\u03bek\u22121\n\n.\n\n(3.4)\n\nThen assuming that \ufb01rst-order change will be the same as in the previous step, we write\n\n\u2217\n\n\u03b1\n\nDf (Xk\u22121)\u03bek\u22121 \u2248 \u03b11Df (Xk)\u03bek.\n\n(3.5)\nCombining (3.4) and (3.5), we obtain our estimate \u03b11 expressed in (3.3). Nocedal and Wright [23]\nsuggest using either \u03b1\u2217 of (3.4) for the initial step-length \u03b11, or using (3.5) where \u03b1\u2217 is set to be\nthe step-length obtained in the line-search in the previous point. We observed that if one instead\nuses (3.3) instead, one obtains substantially better performance than the other two approaches.\n\n4 Experimental Results\n\nWe have performed numerous experiments to examine effectiveness of our method. Below we re-\nport performance comparisons on both real and simulated data. In all experiments, we initialize the\nmixture parameters for all methods using k-means++ [2]. All methods also use the same termina-\ntion criteria: they stop either when the difference of average log-likelihood (i.e., 1\nnlog-likelihood)\nbetween consecutive iterations falls below 10\u22126, or when the number of iterations exceeds 1500.\nMore extensive empirical results can be found in the longer version of this paper [13].\nSimulated Data\nEM\u2019s performance is well-known to depend on the degree of separation of the mixture compo-\nnents [18, 34]. To assess the impact of this separation on our methods, we generate data as proposed\nin [8, 32]. The distributions are chosen so their means satisfy the following inequality:\n\n\u2200i(cid:54)=j : (cid:107)mi \u2212 mj(cid:107) \u2265 c max\n\n{tr(\u03a3i), tr(\u03a3j)},\n\ni,j\n\nwhere c models the degree of separation. Since mixtures with high eccentricity (i.e., the ratio of\nthe largest eigenvalue of the covariance matrix to its smallest eigenvalue) have smaller overlap, in\n\n6\n\n\fc = 0.2 K = 2\nK = 5\n\nc = 1\n\nc = 5\n\nK = 2\nK = 5\n\nK = 2\nK = 5\n\nEM Original\nTime (s)\n1.1 \u00b1 0.4\n30.0 \u00b1 45.5\n0.5 \u00b1 0.2\n104.1 \u00b1 113.8\n0.2 \u00b1 0.2\n38.8 \u00b1 65.8\n\nALL\n-10.7\n-12.7\n\n-10.4\n-13.4\n\n-11.0\n-12.8\n\nLBFGS Reformulated\nALL\n-10.7\n-12.7\n\nTime (s)\n5.6 \u00b1 2.7\n49.2 \u00b1 35.0\n3.1 \u00b1 0.8\n79.9 \u00b1 62.8\n3.4 \u00b1 1.4\n41.0 \u00b1 45.7\n\n-10.4\n-13.3\n\n-11.0\n-12.8\n\nALL\n-10.8\n-12.7\n\nCG Reformulated\nTime (s)\n3.7 \u00b1 1.5\n47.8 \u00b1 40.4\n2.6 \u00b1 0.6\n45.8 \u00b1 30.4\n2.8 \u00b1 1.2\n29.2 \u00b1 36.3\n\n-11.0\n-12.8\n\n-10.4\n-13.3\n\nCG Original\nTime (s)\n23.8 \u00b1 23.7\n206.0 \u00b1 94.2\n25.6 \u00b1 13.6\n144.3 \u00b1 48.1\n43.2 \u00b1 38.8\n197.6 \u00b1 118.2\n\nALL\n-10.7\n-12.8\n\n-10.4\n-13.3\n\n-11.0\n-12.8\n\nTable 2: Speed and average log-likelihood (ALL) comparisons for d = 20, e = 10 (each row reports values\naveraged over 20 runs over different datasets, so the ALL values are not comparable to each other).\n\nc = 0.2 K = 2\nK = 5\n\nc = 1\n\nc = 5\n\nK = 2\nK = 5\n\nK = 2\nK = 5\n\nEM Original\nTime (s)\n65.7 \u00b1 33.1\n365.6 \u00b1 138.8\n6.0 \u00b1 7.1\n40.5 \u00b1 61.1\n0.2 \u00b1 0.1\n17.5 \u00b1 45.6\n\nALL\n17.6\n17.5\n\n17.0\n16.2\n\n17.1\n16.1\n\nLBFGS Reformulated\nTime (s)\nALL\n39.4 \u00b1 19.3\n17.6\n160.9 \u00b1 65.9\n17.5\n12.9 \u00b1 13.0\n51.6 \u00b1 39.5\n3.0 \u00b1 0.5\n20.6 \u00b1 22.5\n\n17.0\n16.2\n\n17.1\n16.1\n\nALL\n17.6\n17.5\n\nCG Reformulated\nTime (s)\n46.4 \u00b1 29.9\n207.6 \u00b1 46.9\n15.7 \u00b1 17.5\n63.7 \u00b1 45.8\n2.8 \u00b1 0.7\n20.3 \u00b1 24.1\n\n17.0\n16.2\n\n17.1\n16.1\n\nCG Original\nTime (s)\n64.0 \u00b1 50.4\n279.8 \u00b1 169.3\n42.5 \u00b1 21.9\n203.1 \u00b1 96.3\n19.6 \u00b1 8.2\n93.9 \u00b1 42.4\n\nALL\n17.6\n17.5\n\n17.0\n16.2\n\n17.1\n16.1\n\nTable 3: Speed and ALL comparisons for d = 20, e = 1.\n\nCG Cholesky Original\n\nCG Cholesky Reformulated\n\ne = 1\nTime (s)\n101.5 \u00b1 34.1\n627.1 \u00b1 247.3\n135.2 \u00b1 65.4\n1016.9 \u00b1 299.8\n55.2 \u00b1 27.9\n371.7 \u00b1 281.4\n\nALL\n17.6\n17.5\n\n16.9\n16.2\n\n17.1\n16.1\n\ne = 10\nTime (s)\n113.9 \u00b1 48.1\n521.9 \u00b1 186.9\n110.9 \u00b1 51.8\n358.0 \u00b1 155.5\n86.7 \u00b1 47.2\n337.7 \u00b1 178.4\n\nALL\n-10.7\n-12.7\n\n-10.4\n-13.3\n\n-11.0\n-12.8\n\ne = 1\nTime (s)\n36.7 \u00b1 9.8\n156.7 \u00b1 81.1\n38.0 \u00b1 14.5\n266.7 \u00b1 140.5\n60.2 \u00b1 20.8\n270.2 \u00b1 106.5\n\nALL\n17.6\n17.5\n\n16.9\n16.2\n\n17.1\n16.1\n\ne = 10\nTime (s)\n23.5 \u00b1 11.9\n106.7 \u00b1 39.7\n49.0 \u00b1 17.8\n279.8 \u00b1 111.0\n177.6 \u00b1 147.6\n562.1 \u00b1 242.7\n\nALL\n-10.7\n-12.6\n\n-10.4\n-13.4\n\n-11.0\n-12.9\n\nc = 0.2 K = 2\nK = 5\n\nc = 1\n\nc = 5\n\nK = 2\nK = 5\n\nK = 2\nK = 5\n\nTable 4: Speed and ALL for applying CG on Cholesky-factorized problems with d = 20.\n\naddition to high eccentricity e = 10, we also test the spherical case where e = 1. We test three levels\nof separation c = 0.2 (low), c = 1 (medium), and c = 5 (high). We test two different numbers of\nmixture components K = 2 and K = 5; we consider experiments with larger values of K in our\nexperiments on real data. For e = 10, the results for data with dimensionality d = 20 are given in\nTable 2. The results are obtained after running with 20 different random choices of parameters for\neach con\ufb01guration. It is apparent that the performance of EM and Riemannian optimization with our\nreformulation is very similar. The variance of computation time shown by Riemmanian optimization\nis, however, notably smaller. Manifold optimization on the non-reformulated problem (last column)\nperforms the worst.\nIn another set of simulated data experiments, we apply different algorithms to spherical data (e = 1);\nthe results are shown in Table 3. The interesting instance here is the case of low separation c = 0.2,\nwhere the condition number of the Hessian becomes large. As predicted by theory, the EM converges\nvery slowly in such a case; Table 3 con\ufb01rms this claim. It is known that in this case, the performance\nof powerful optimization approaches like CG and LBFGS also degrades [23]. But both CG and\nLBFGS suffer less than EM, while LBFGS performs noticeably better than CG.\nCholesky decomposition is a commonly suggested idea for dealing with PD constraint. So, we also\ncompare against unconstrained optimization (using Euclidean CG), where the inverse covariance\nmatrices are Cholesky factorized. The results for the same data as in Tables 2 and 3 are reported in\nTable 4. Although the Cholesky-factorized problem proves to be much inferior to both EM and the\nmanifold methods, our reformulation seems to also help it in several problem instances.\n\nReal Data\nWe now present performance evaluation on a natural image dataset, where mixtures of Gaussians\nwere reported to be a good \ufb01t to the data [35]. We extracted 200,000 image patches of size 6\u00d76 from\nimages and subtracted the DC component, leaving us with 35-dimensional vectors. Performance of\ndifferent algorithms are reported in Table 5. Similar to the simulated results, performance of EM and\n\n7\n\n\fEM Algorithm\n\nTime (s)\n16.61\n90.54\n165.77\n202.36\n228.80\n365.28\n596.01\n900.88\n2159.47\n\nALL\n29.28\n30.95\n31.65\n32.07\n32.36\n32.63\n32.81\n32.94\n33.05\n\nLBFGS Reformulated\nALL\nTime (s)\n29.28\n14.23\n30.95\n38.29\n31.65\n106.53\n32.07\n117.14\n245.74\n32.35\n32.63\n192.44\n32.81\n332.85\n32.94\n657.24\n658.34\n33.06\n\nCG Reformulated\nALL\nTime (s)\n29.28\n17.52\n30.95\n54.37\n31.65\n153.94\n32.07\n140.21\n281.32\n32.35\n32.63\n318.95\n32.81\n536.94\n32.95\n1449.52\n1048.00\n33.06\n\nK = 2\nK = 3\nK = 4\nK = 5\nK = 6\nK = 7\nK = 8\nK = 9\nK = 10\n\nCG Original\n\nTime (s)\n947.35\n3051.89\n6380.01\n5262.27\n10566.76\n10844.52\n14282.80\n15774.88\n17711.87\n\nALL\n29.28\n30.95\n31.64\n32.07\n32.33\n32.63\n32.58\n32.77\n33.03\n\nCG Cholesky Reformulated\nALL\nTime (s)\n29.28\n476.77\n30.95\n1046.61\n31.65\n2673.21\n32.07\n3865.30\n4771.36\n32.35\n32.63\n6819.42\n32.81\n9306.33\n32.94\n9383.98\n7463.72\n33.05\n\nTable 5: Speed and ALL comparisons for natural image data d = 35.\n\nFigure 3: Best ALL minus current ALL values with number of function and gradient evaluations. Left: \u2018magic\ntelescope\u2019 (K = 5, d = 10). Middle: \u2018year predict\u2019 (K = 6, d = 90). Right: natural images (K = 8, d = 35).\n\nmanifold CG on the reformulated parameter space is similar. Manifold LBFGS converges notably\nfaster (except for K = 6) than both EM and CG. Without our reformulation, performance of the\nmanifold methods degrades substantially. Note that for K = 8 and K = 9, CG without reformu-\nlation stops prematurely because it hits the bound of a maximum 1500 iterations, and therefore its\nALL is smaller than the other two methods. The table also shows results of the Cholesky-factorized\n(and reformulated) problem. It is more than 10 times slower than manifold optimization. Optimiz-\ning the Cholesky-factorized (non-reformulated) problem is the slowest (not shown) and it always\nreaches the maximum number of iterations before \ufb01nding the local minimum.\nFig. 3 depicts the typical behavior of our manifold optimization methods versus EM. The X-axis\nis the number of log-likelihood and gradient evaluations (or the number of E- and M-steps in EM).\nFig. 3(a) and Fig. 3(b) are the results of \ufb01tting GMMs to the \u2018magic telescope\u2019 and \u2018year prediction\u2019\ndatasets7. Fig. 3(c) is the result for the natural image data of Table 5. Apparently in the initial few\niterations EM is faster, but manifold optimization methods match EM in a few iterations. This is\nremarkable, given that manifold optimization methods need to perform line-search.\n5 Conclusions and future work\nWe introduced Riemannian manifold optimization as an alternative to EM for \ufb01tting Gaussian mix-\nture models. We demonstrated that for making manifold optimization succeed, to either match or\noutperform EM, it is necessary to represent the parameters in a different space and reformulate\nthe cost function accordingly. Extensive experimentation with both experimental and real datasets\nyielded quite encouraging results, suggesting that manifold optimization could have the potential to\nopen new algorithmic avenues for mixture modeling.\nSeveral strands of practical importance are immediate (and are a part of our ongoing work): (i)\nextension to large-scale GMMs through stochastic optimization [5]; (ii) use of richer classes of\npriors with GMMs than the usual inverse Wishart priors (which are typically also used as they make\nthe M-step convenient), which is actually just one instance of a geodesically convex prior that our\nmethods can handle; (iii) incorporation of penalties for avoiding tiny clusters, an idea that \ufb01ts easily\nin our framework but not so easily in the EM framework. Finally, beyond GMMs, extension to other\nmixture models will be fruitful.\n\nAcknowledgments. SS was partially supported by NSF grant IIS-1409802.\n\n7Available at UCI machine learning dataset repository via https://archive.ics.uci.edu/ml/datasets\n\n8\n\n05010015020025030010\u2212510\u2212410\u2212310\u2212210\u22121100101102#functionandgradientevaluationsALL\u2217-ALLEM,UsualMVNLBFGS,ReparameterizedMVNCG,ReparameterizedMVN05010015020010\u2212510\u2212410\u2212310\u2212210\u22121100101102#functionandgradientevaluationsALL\u2217-ALLEM,OriginalMVNLBFGS,ReformulatedMVNCG,ReformulatedMVN05010015020025010\u2212510\u2212410\u2212310\u2212210\u22121100101102#functionandgradientevaluationsALL\u2217-ALLEM,OriginalMVNLBFGS,ReformulatedMVNCG,ReformulatedMVN\fReferences\n[1] P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization algorithms on matrix manifolds. Princeton\n\nUniversity Press, 2009.\n\n[2] D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the\n\neighteenth annual ACM-SIAM symposium on Discrete algorithms (SODA), pages 1027\u20131035, 2007.\n\n[3] R. Bhatia. Positive De\ufb01nite Matrices. Princeton University Press, 2007.\n[4] C. M. Bishop. Pattern recognition and machine learning. Springer, 2007.\n[5] S. Bonnabel. Stochastic gradient descent on riemannian manifolds. Automatic Control, IEEE Transactions\n\non, 58(9):2217\u20132229, 2013.\n\n[6] N. Boumal, B. Mishra, P.-A. Absil, and R. Sepulchre. Manopt, a matlab toolbox for optimization on\n\nmanifolds. The Journal of Machine Learning Research, 15(1):1455\u20131459, 2014.\n\n[7] S. Burer, R. D. Monteiro, and Y. Zhang. Solving semide\ufb01nite programs via nonlinear programming. part\n\ni: Transformations and derivatives. Technical Report TR99-17, Rice University, Houston TX, 1999.\n\n[8] S. Dasgupta. Learning mixtures of gaussians. In Foundations of Computer Science, 1999. 40th Annual\n\nSymposium on, pages 634\u2013644. IEEE, 1999.\n\n[9] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM\n\nalgorithm. Journal of the Royal Statistical Society, Series B, 39:1\u201338, 1977.\n\n[10] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classi\ufb01cation. John Wiley & Sons, 2nd edition, 2000.\n[11] R. Ge, Q. Huang, and S. M. Kakade.\n\nLearning Mixtures of Gaussians in High Dimensions.\n\n[12] R. Hosseini and M. Mash\u2019al. Mixest: An estimation toolbox for mixture models. arXiv preprint\n\nDifferential geometric optimization for Gaussian mixture models.\n\narXiv:1503.00424, 2015.\n\narXiv:1507.06065, 2015.\n\n[13] R. Hosseini and S. Sra.\n\narXiv:1506.07677, 2015.\n\ntion, 6(2):181\u2013214, 1994.\n\n[14] M. I. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural computa-\n\n[15] M. Journ\u00b4ee, F. Bach, P.-A. Absil, and R. Sepulchre. Low-rank optimization on the cone of positive\n\nsemide\ufb01nite matrices. SIAM Journal on Optimization, 20(5):2327\u20132351, 2010.\n\n[16] R. W. Keener. Theoretical Statistics. Springer Texts in Statistics. Springer, 2010.\n[17] J. M. Lee. Introduction to Smooth Manifolds. Number 218 in GTM. Springer, 2012.\n[18] J. Ma, L. Xu, and M. I. Jordan. Asymptotic convergence rate of the em algorithm for gaussian mixtures.\n\nNeural Computation, 12(12):2881\u20132907, 2000.\n\n[19] G. J. McLachlan and D. Peel. Finite mixture models. John Wiley and Sons, New Jersey, 2000.\n[20] A. Moitra and G. Valiant. Settling the polynomial learnability of mixtures of gaussians. In Foundations\n\nof Computer Science (FOCS), 2010 51st Annual IEEE Symposium on, pages 93\u2013102. IEEE, 2010.\n\n[21] K. P. Murphy. Machine Learning: A Probabilistic Perspective. MIT Press, 2012.\n[22] I. Naim and D. Gildea. Convergence of the EM algorithm for gaussian mixtures with unbalanced mixing\n\ncoef\ufb01cients. In ICML-12, pages 1655\u20131662, 2012.\n\n[23] J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 2006.\n[24] R. A. Redner and H. F. Walker. Mixture densities, maximum likelihood, and the EM algorithm. Siam\n\nReview, 26:195\u2013239, 1984.\n\n[25] W. Ring and B. Wirth. Optimization methods on riemannian manifolds and their application to shape\n\nspace. SIAM Journal on Optimization, 22(2):596\u2013627, 2012.\n\n[26] R. Salakhutdinov, S. T. Roweis, and Z. Ghahramani. Optimization with EM and Expectation-Conjugate-\n\nGradient. In ICML-03, pages 672\u2013679, 2003.\n\n[27] S. Sra and R. Hosseini. Geometric optimisation on positive de\ufb01nite matrices for elliptically contoured\n\ndistributions. In Advances in Neural Information Processing Systems, pages 2562\u20132570, 2013.\n\n[28] S. Sra and R. Hosseini. Conic Geometric Optimization on the Manifold of Positive De\ufb01nite Matrices.\n\nSIAM Journal on Optimization, 25(1):713\u2013739, 2015.\n\n[29] C. Udris\u00b8te. Convex functions and optimization methods on Riemannian manifolds. Kluwer, 1994.\n[30] R. J. Vanderbei and H. Y. Benson. On formulating semide\ufb01nite programming problems as smooth convex\n\nnonlinear optimization problems. Technical report, 2000.\n\n[31] B. Vandereycken. Low-rank matrix completion by riemannian optimization. SIAM Journal on Optimiza-\n\n[32] J. J. Verbeek, N. Vlassis, and B. Kr\u00a8ose. Ef\ufb01cient greedy learning of gaussian mixture models. Neural\n\ntion, 23(2):1214\u20131236, 2013.\n\ncomputation, 15(2):469\u2013485, 2003.\n\n(12):6182\u201389, 2012.\n\nComputation, 8:129\u2013151, 1996.\n\n[33] A. Wiesel. Geodesic convexity and covariance estimation. IEEE Transactions on Signal Processing, 60\n\n[34] L. Xu and M. I. Jordan. On convergence properties of the EM algorithm for Gaussian mixtures. Neural\n\n[35] D. Zoran and Y. Weiss. Natural images, gaussian mixtures and dead leaves.\n\nIn Advances in Neural\n\nInformation Processing Systems, pages 1736\u20131744, 2012.\n\n9\n\n\f", "award": [], "sourceid": 592, "authors": [{"given_name": "Reshad", "family_name": "Hosseini", "institution": "University of Tehran"}, {"given_name": "Suvrit", "family_name": "Sra", "institution": "MIT"}]}