{"title": "Local Maxima in the Likelihood of Gaussian Mixture Models: Structural Results and Algorithmic Consequences", "book": "Advances in Neural Information Processing Systems", "page_first": 4116, "page_last": 4124, "abstract": "We provide two fundamental results on the population (infinite-sample) likelihood function of Gaussian mixture models with $M \\geq 3$ components. Our first main result shows that the population likelihood function has bad local maxima even in the special case of equally-weighted mixtures of well-separated and spherical Gaussians. We prove that the log-likelihood value of these bad local maxima can be arbitrarily worse than that of any global optimum, thereby resolving an open question of Srebro (2007). Our second main result shows that the EM algorithm (or a first-order variant of it) with random initialization will converge to bad critical points with probability at least $1-e^{-\\Omega(M)}$. We further establish that a first-order variant of EM will not converge to strict saddle points almost surely, indicating that the poor performance of the first-order method can be attributed to the existence of bad local maxima rather than bad saddle points. Overall, our results highlight the necessity of careful initialization when using the EM algorithm in practice, even when applied in highly favorable settings.", "full_text": "Local Maxima in the Likelihood of Gaussian Mixture\n\nModels: Structural Results and Algorithmic\n\nConsequences\n\nChi Jin\n\nUC Berkeley\n\nchijin@cs.berkeley.edu\n\nYuchen Zhang\nUC Berkeley\n\nyuczhang@berkeley.edu\n\nSivaraman Balakrishnan\nCarnegie Mellon University\n\nsiva@stat.cmu.edu\n\nMartin J. Wainwright\n\nUC Berkeley\n\nwainwrig@berkeley.edu\n\nMichael I. Jordan\n\nUC Berkeley\n\njordan@cs.berkeley.edu\n\nAbstract\n\nWe provide two fundamental results on the population (in\ufb01nite-sample) likelihood\nfunction of Gaussian mixture models with M \u2265 3 components. Our \ufb01rst main\nresult shows that the population likelihood function has bad local maxima even\nin the special case of equally-weighted mixtures of well-separated and spherical\nGaussians. We prove that the log-likelihood value of these bad local maxima can\nbe arbitrarily worse than that of any global optimum, thereby resolving an open\nquestion of Srebro [2007]. Our second main result shows that the EM algorithm\n(or a \ufb01rst-order variant of it) with random initialization will converge to bad critical\npoints with probability at least 1 \u2212 e\u2212\u2126(M ). We further establish that a \ufb01rst-order\nvariant of EM will not converge to strict saddle points almost surely, indicating that\nthe poor performance of the \ufb01rst-order method can be attributed to the existence of\nbad local maxima rather than bad saddle points. Overall, our results highlight the\nnecessity of careful initialization when using the EM algorithm in practice, even\nwhen applied in highly favorable settings.\n\n1\n\nIntroduction\n\nFinite mixture models are widely used in variety of statistical settings, as models for heterogeneous\npopulations, as \ufb02exible models for multivariate density estimation and as models for clustering. Their\nability to model data as arising from underlying subpopulations provides essential \ufb02exibility in a\nwide range of applications Titterington [1985]. This combinatorial structure also creates challenges\nfor statistical and computational theory, and there are many problems associated with estimation of\n\ufb01nite mixtures that are still open. These problems are often studied in the setting of Gaussian mixture\nmodels (GMMs), re\ufb02ecting the wide use of GMMs in applications, particular in the multivariate\nsetting, and this setting will also be our focus in the current paper.\nEarly work [Teicher, 1963] studied the identi\ufb01ability of \ufb01nite mixture models, and this problem has\ncontinued to attract signi\ufb01cant interest (see the recent paper of Allman et al. [2009] for a recent\noverview). More recent theoretical work has focused on issues related to the use of GMMs for the\ndensity estimation problem [Genovese and Wasserman, 2000, Ghosal and Van Der Vaart, 2001].\nFocusing on rates of convergence for parameter estimation in GMMs, Chen [1995] established the\nsurprising result that when the number of mixture components is unknown, then the standard\nn-rate\nfor regular parametric models is not achievable. Recent investigations [Ho and Nguyen, 2015] into\n\n\u221a\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fexact-\ufb01tted, under-\ufb01tted and over-\ufb01tted GMMs have characterized the achievable rates of convergence\nin these settings.\nFrom an algorithmic perspective, the dominant practical method for estimating GMMs is the\nExpectation-Maximization (EM) algorithm [Dempster et al., 1977]. The EM algorithm is an ascent\nmethod for maximizing the likelihood, but is only guaranteed to converge to a stationary point of\nthe likelihood function. As such, there are no general guarantees for the quality of the estimate\nproduced via the EM algorithm for Gaussian mixture models.1 This has led researchers to explore\nvarious alternative algorithms which are computationally ef\ufb01cient, and for which rigorous statistical\nguarantees can be given. Broadly, these algorithms are based either on clustering [Arora et al., 2005,\nDasgupta and Schulman, 2007, Vempala and Wang, 2002, Chaudhuri and Rao, 2008] or on the\nmethod of moments [Belkin and Sinha, 2010, Moitra and Valiant, 2010, Hsu and Kakade, 2013].\nAlthough general guarantees have not yet emerged, there has nonetheless been substantial progress\non the theoretical analysis of EM and its variations. Dasgupta and Schulman [2007] analyzed a\ntwo-round variant of EM, which involved over-\ufb01tting the mixture and then pruning extra centers.\nThey showed that this algorithm can be used to estimate Gaussian mixture components whose means\nare separated by at least \u2126(d1/4). Balakrishnan et al. [2015] studied the local convergence of the\nEM algorithm for a mixture of two Gaussians with \u2126(1)-separation. Their results show that global\noptima have relatively large regions of attraction, but still require that the EM algorithm be provided\nwith a reasonable initialization in order to ensure convergence to a near globally optimal solution.\nTo date, computationally ef\ufb01cient algorithms for estimating a GMM provide guarantees under the\nstrong assumption that the samples come from a mixture of Gaussians\u2014i.e., that the model is well-\nspeci\ufb01ed. In practice however, we never expect the data to exactly follow the generative model, and it\nis important to understand the robustness of our algorithms to this assumption. In fact, maximum\nlikelihood has favorable properties in this regard\u2014maximum-likelihood estimates are well known to\nbe robust to perturbations in the Kullback-Leibler metric of the generative model [Donoho and Liu,\n1988]. This mathematical result motivates further study of EM and other likelihood-based methods\nfrom the computational point of view. It would be useful to characterize when ef\ufb01cient algorithms\ncan be used to compute a maximum likelihood estimate, or a solution that is nearly as accurate, and\nwhich retains the robustness properties of the maximum likelihood estimate.\nIn this paper, we focus our attention on uniformly weighted mixtures of M isotropic Gaussians. For\nthis favorable setting, Srebro [2007] conjectured that any local maximum of the likelihood function\nis a global maximum in the limit of in\ufb01nite samples\u2014in other words, that there are no bad local\nmaxima for the population GMM likelihood function. This conjecture, if true, would provide strong\ntheoretical justi\ufb01cation for EM, at least for large sample sizes. For suitably small sample sizes, it is\nknown [Am\u00e9ndola et al., 2015] that con\ufb01gurations of the samples can be constructed which lead to the\nlikelihood function having an unbounded number of local maxima. The conjecture of Srebro [2007]\navoids this by requiring that the samples come from the speci\ufb01ed GMM, as well as by considering the\n(in\ufb01nite-sample-size) population setting. In the context of high-dimensional regression, it has been\nobserved that in some cases despite having a non-convex objective function, every local optimum of\nthe objective is within a small, vanishing distance of a global optimum [see, e.g., Loh and Wainwright,\n2013, Wang et al., 2014]. In these settings, it is indeed the case that for suf\ufb01ciently large sample sizes\nthere are no bad local optima.\n\nA mixture of two spherical Gaussians: A Gaussian mixture model with a single component is\nsimply a Gaussian, so the conjecture of Srebro [2007] holds trivially in this case. The \ufb01rst interesting\ncase is a Gaussian mixture with two components, for which empirical evidence supports the conjecture\nthat there are no bad local optima. It is possible to visualize the setting when there are only two\ncomponents and to develop a more detailed understanding of the population likelihood surface.\nConsider for instance a one-dimensional equally weighted unit variance GMM with true centers\n1 = \u22124 and \u00b5\u2217\n\u00b5\u2217\n2 = 4, and consider the log-likelihood as a function of the vector \u00b5 : = (\u00b51, \u00b52).\nFigure 1 shows both the population log-likelihood, \u00b5 (cid:55)\u2192 L(\u00b5), and the negative 2-norm of its\ngradient, \u00b5 (cid:55)\u2192 \u2212(cid:107)\u2207L(\u00b5)(cid:107)2. Observe that the only local maxima are the vectors (\u22124, 4) and (4,\u22124),\n1In addition to issues of convergence to non-maximal stationary points, solutions of in\ufb01nite likelihood exist\nfor GMMs where both the location and scale parameters are estimated. In practice, several methods exist to\navoid such solutions. In this paper, we avoid this issue by focusing on GMMs in which the scale parameters are\n\ufb01xed.\n\n2\n\n\fwhich are both also global maxima. The only remaining critical point is (0, 0), which is a saddle\npoint. Although points of the form (0, R), (R, 0) have small gradient when |R| is large, the gradient\nis not exactly zero for any \ufb01nite R. Rigorously resolving the question of existence or non-existence\nof local maxima for the setting when M = 2 remains an open problem.\nIn the remainder of our paper, we focus our attention on the setting where there are more than two\nmixture components and attempt to develop a broader understanding of likelihood surfaces for these\nmodels, as well as the consequences for algorithms.\n\n(a)\n\n(b)\n\nFigure 1: Illustration of the likelihood and gradient maps for a two-component Gaussian mixture.\n(a) Plot of population log-likehood map \u00b5 (cid:55)\u2192 L(\u00b5). (b) Plot of the negative Euclidean norm of the\ngradient map \u00b5 (cid:55)\u2192 \u2212(cid:107)\u2207L(\u00b5)(cid:107)2.\n\nOur \ufb01rst contribution is a negative answer to the open question of Srebro [2007]. We construct a\nGMM which is a uniform mixture of three spherical unit variance, well-separated, Gaussians whose\npopulation log-likelihood function contains local maxima. We further show that the log-likelihood of\nthese local maxima can be arbitrarily worse than that of the global maxima. This result immediately\nimplies that any local search algorithm cannot exhibit global convergence (meaning convergence to a\nglobal optimum from all possible starting points), even on well-separated mixtures of Gaussians.\nThe mere existence of bad local maxima is not a practical concern unless it turns out that natural\nalgorithms are frequently trapped in these bad local maxima. Our second main result shows that\nthe EM algorithm, as well as a variant thereof known as the \ufb01rst-order EM algorithm, with random\ninitialization, converges to a bad critical point with an exponentially high probability. In more detail,\nwe consider the following practical scheme for parameter estimation in an M-component Gaussian\nmixture:\n\n(a) Draw M i.i.d. points \u00b51, . . . , \u00b5M uniformly at random from the sample set.\n(b) Run the EM or \ufb01rst-order EM algorithm to estimate the model parameters, using \u00b51, . . . , \u00b5M\n\nas the initial centers.\n\nWe note that in the limit of in\ufb01nite samples, the initialization scheme we consider is equivalent\nto selecting M initial centers i.i.d from the underlying mixture distribution. We show that for a\nuniversal constant c > 0, with probability at least 1 \u2212 e\u2212cM , the EM and \ufb01rst-order EM algorithms\nconverge to a suboptimal critical point, whose log-likelihood could be arbitrarily worse than that of\nthe global maximum. Conversely, in order to \ufb01nd a solution with satisfactory log-likelihood via this\ninitialization scheme, one needs repeat the above scheme exponentially many (in M) times, and then\nselect the solution with highest log-likelihood. This result strongly indicates that repeated random\ninitialization followed by local search (via either EM or its \ufb01rst order variant) can fail to produce\nuseful estimates under reasonable constraints on computational complexity.\nWe further prove that under the same random initialization scheme, the \ufb01rst-order EM algorithm with\na suitable stepsize does not converge to a strict saddle point with probability one. This fact strongly\nsuggests that the failure of local search methods for the GMM model is due mainly to the existence\nof bad local optima, and not due to the presence of (strict) saddle points.\n\n3\n\n-25020-200-15010-1000-5020010-100-10-20-20-2020-1510-10-5020010-100-10-20-20\fOur proofs introduce new techniques to reason about the structure of the population log-likelihood,\nand in particular to show the existence of bad local optima. We expect that these general ideas will\naid in developing a better understanding of the behavior of algorithms for non-convex optimization.\nFrom a practical standpoint, our results strongly suggest that careful initialization is required for local\nsearch methods, even in large-sample settings, and even for extremely well-behaved mixture models.\nThe remainder of this paper is organized as follows. In Section 2, we introduce GMMs, the EM\nalgorithm, its \ufb01rst-order variant and we formally set up the problem we consider. In Section 3, we\nstate our main theoretical results and develop some of their implications. Section A is devoted to the\nproofs of our results, with some of the more technical aspects deferred to the appendices.\n\n2 Background and Preliminaries\n\nIn this section, we formally de\ufb01ne the Gaussian mixture model that we study in the paper. We then\ndescribe the EM algorithm, the \ufb01rst-order EM algorithm, as well as the form of random initialization\nthat we analyze. Throughout the paper, we use [M ] to denote the set {1, 2,\u00b7\u00b7\u00b7 , M}, and N (\u00b5, \u03a3) to\ndenote the d-dimensional Gaussian distribution with mean vector \u00b5 and covariance matrix \u03a3. We use\n\u03c6(\u00b7 | \u00b5, \u03a3) to denote the probability density function of the Gaussian distribution with mean vector \u00b5\nand covariance matrix \u03a3:\n\ne\u2212 1\n\n2 (x\u2212\u00b5)(cid:62)\u03a3\u22121(x\u2212\u00b5).\n\n(1)\n\n1(cid:112)(2\u03c0)ddet(\u03a3)\n\n\u03c6(x | \u00b5, \u03a3) :=\n\n2.1 Gaussian Mixture Models\n\nA d-dimensional Gaussian mixture model (GMM) with M components can be speci\ufb01ed by a col-\nM} of d-dimensional mean vectors, a vector \u03bb\nlection \u00b5\u2217 = {\u00b5\u2217\n\u2217\n1, . . . , \u03bb\u2217\n= (\u03bb\u2217\nM ) of non-\nM} of covariance\nnegative mixture weights that sum to one, and a collection \u03a3\u2217 = {\u03a3\u2217\n1, . . . , \u03a3\u2217\nmatrices. Given these parameters, the density function of a Gaussian mixture model takes the form\n\ni , . . . , \u00b5\u2217\n\np(x | \u03bb\n\n\u2217\n\n, \u00b5\u2217, \u03a3\u2217) =\n\ni \u03c6(x | \u00b5\u2217\n\u03bb\u2217\n\ni , \u03a3\u2217\ni ),\n\nM(cid:88)\n\ni=1\n\nM(cid:88)\n\ni=1\n\nwhere the Gaussian density function \u03c6 was previously de\ufb01ned in equation (1). In this paper, we focus\non the idealized situation in which every mixture component is equally weighted, and the covariance\nof each mixture component is the identity. This leads to a mixture model of the form\n\np(x | \u00b5\u2217) :=\n\n1\nM\n\n\u03c6(x | \u00b5\u2217\n\ni , I),\n\n(2)\n\ni }M\ni=1 of the M components.\n\nwhich we denote by GMM(\u00b5\u2217). In this case, the only parameters to be estimated are the mean\nvectors \u00b5\u2217 = {\u00b5\u2217\nThe dif\ufb01culty of estimating a Gaussian mixture distribution depends on the amount of separation\nbetween the mean vectors. More precisely, for a given parameter \u03be > 0, we say that the GMM(\u00b5\u2217)-\nmodel is \u03be-separated if\n\n(cid:107)\u00b5\u2217\n\ni \u2212 \u00b5\u2217\n\nj(cid:107)2 \u2265 \u03be,\n\nfor all distinct pairs i, j \u2208 [M ].\n\n\u221a\nWe say that the mixture is well-separated if condition (3) holds for some \u03be = \u2126(\nSuppose that we observe an i.i.d. sequence {x(cid:96)}n\n(cid:96)=1 drawn according to the distribution GMM(\u00b5\u2217),\nand our goal is to estimate the unknown collection of mean vectors \u00b5\u2217. The sample-based log-\nlikelihood function Ln is given by\n\nd).\n\n(3)\n\n\u03c6(x(cid:96) | \u00b5i, I)\n\n.\n\n(4a)\n\nAs the sample size n tends to in\ufb01nity, this sample likelihood converges to the population log-likelihood\nfunction L given by\n\nL(\u00b5) = E\u00b5\u2217 log\n\n\u03c6(X | \u00b5i, I)\n\n.\n\n(4b)\n\n(cid:17)\n\n(cid:33)\n\nn(cid:88)\n\n(cid:96)=1\n\n(cid:16) 1\n\nM(cid:88)\n\nM\n\ni=1\n\nLn(\u00b5) :=\n\n1\nn\n\nlog\n\n(cid:32)\n\nM(cid:88)\n\ni=1\n\n1\nM\n\n4\n\n\fHere E\u00b5\u2217 denotes expectation taken over the random vector X drawn according to the model\nGMM(\u00b5\u2217).\nA straightforward implication of the positivity of the KL divergence is that the population likelihood\nfunction is in fact maximized at \u00b5\u2217 (along with permutations thereof, depending on how we index\nthe mixture components). On the basis of empirical evidence, Srebro [2007] conjectured that this\npopulation log-likelihood is in fact well-behaved, in the sense of having no spurious local optima.\nIn Theorem 1, we show that this intuition is false, and provide a simple example of a mixture of\nM = 3 well-separated Gaussians in dimension d = 1, whose population log-likelihood function has\narbitrarily bad local optima.\n\n2.2 Expectation-Maximization Algorithm\nA natural way to estimate the mean vectors \u00b5\u2217 is by attempting to maximize the sample log-likelihood\nde\ufb01ned by the samples {x(cid:96)}n\n(cid:96)=1. For a non-degenerate Gaussian mixture model, the log-likelihood\nis non-concave. Rather than attempting to maximize the log-likelihood directly, the EM algorithm\nproceeds by iteratively maximizing a lower bound on the log-likelihood. It does so by alternating\nbetween two steps:\n\n1. E-step: For each i \u2208 [M ] and (cid:96) \u2208 [n], compute the membership weight\n\nwi(x(cid:96)) =\n\n.\n\n(cid:80)M\n\u03c6(x(cid:96) | \u00b5i, I)\nj=1 \u03c6(x(cid:96) | \u00b5j, I)\n(cid:80)n\n(cid:80)n\n\n.\n\n2. M-step: For each i \u2208 [M ], update the mean \u00b5i vector via\ni=1 wi(x(cid:96)) x(cid:96)\n(cid:96)=1 wi(x(cid:96))\n\n\u00b5new\ni =\n\nIn the population setting, the M-step becomes:\n\n\u00b5new\ni =\n\nE\u00b5\u2217 [wi(X) X]\nE\u00b5\u2217 [wi(X)]\n\n.\n\n(5)\n\nIntuitively, the M-step updates the mean vector of each Gaussian component to be a weighted centroid\nof the samples for appropriately chosen weights.\n\nFirst-order EM updates: For a general latent variable model with observed variables X = x,\nlatent variables Z and model parameters \u03b8, by Jensen\u2019s inequality, the log-likelihood function can be\nlower bounded as\n\nlog P(x | \u03b8(cid:48)) \u2265 EZ\u223cP(\u00b7|x;\u03b8) log P(x, Z | \u03b8(cid:48))\n\n\u2212EZ\u223cP(\u00b7|x;\u03b8) log P(Z | x; \u03b8(cid:48)).\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n:=Q(\u03b8(cid:48)|\u03b8)\n\n(cid:125)\n\nEach step of the EM algorithm can also be viewed as optimizing over this lower bound, which gives:\n\n\u03b8new := arg max\n\n\u03b8(cid:48) Q(\u03b8(cid:48) | \u03b8)\n\nThere are many variants of the EM algorithm which rely instead on partial updates at each iteration\ninstead of \ufb01nding the exact optimum of Q(\u03b8(cid:48) | \u03b8). One important example, analyzed in the work\nof Balakrishnan et al. [2015], is the \ufb01rst-order EM algorithm. The \ufb01rst-order EM algorithm takes a\nstep along the gradient of the function Q(\u03b8(cid:48) | \u03b8) (with respect to its \ufb01rst argument) in each iteration.\nConcretely, given a step size s > 0, the \ufb01rst-order EM updates can be written as:\n\n\u03b8new = \u03b8 + s\u2207\u03b8(cid:48)Q(\u03b8(cid:48) | \u03b8) |\u03b8(cid:48)=\u03b8 .\n\nIn the case of the model GMM(\u00b5\u2217), the gradient EM updates on the population objective take the\nform\n\ni = \u00b5i + s E\u00b5\u2217(cid:2)wi(X)(X \u2212 \u00b5i)(cid:3).\n\n\u00b5new\n\n(6)\nThis update turns out to be equivalent to gradient ascent on the population likelihood L with step size\ns > 0 (see the paper Balakrishnan et al. [2015] for details).\n\n5\n\n\f2.3 Random Initialization\n\nSince the log-likelihood function is non-concave, the point to which the EM algorithm converges\ndepends on the initial value of \u00b5. In practice, it is standard to choose these values by some form\nof random initialization. For instance, one method is to to initialize the mean vectors by sampling\nuniformly at random from the data set {x(cid:96)}n\n(cid:96)=1. This scheme is intuitively reasonable, because\nit automatically adapts to the locations of the true centers. If the true centers have large mutual\ndistances, then the initialized centers will also be scattered. Conversely, if the true centers concentrate\nin a small region of the space, then the initialized centers will also be close to each other. In practice,\ninitializing \u00b5 by uniformly drawing from the data is often more reasonable than drawing \u00b5 from a\n\ufb01xed distribution.\nIn this paper, we analyze the EM algorithm and its variants at the population level. We focus on the\nabove practical initialization scheme of selecting \u00b5 uniformly at random from the sample set. In\nthe idealized population setting, this is equivalent to sampling the initial values of \u00b5 i.i.d from the\ndistribution GMM(\u00b5\u2217). Throughout this paper, we refer to this particular initialization strategy as\nrandom initialization.\n\n3 Main results\n\nWe now turn to the statements of our main results, along with a discussion of some of their conse-\nquences.\n\n3.1 Structural properties\nIn our \ufb01rst main result (Theorem 1), for any M \u2265 3, we exhibit an M-component mixture of\nGaussians in dimension d = 1 for which the population log-likelihood has a bad local maximum\nwhose log-likelihood is arbitrarily worse than that attained by the true parameters \u00b5\u2217. This result\nprovides a negative answer to the conjecture of Srebro [2007].\nTheorem 1. For any M \u2265 3 and any constant Cgap > 0, there is a well-separated uniform mixture\nof M unit-variance spherical Gaussians GMM(\u00b5\u2217) and a local maximum \u00b5(cid:48) such that\n\nL(\u00b5(cid:48)) \u2264 L(\u00b5\u2217) \u2212 Cgap.\n\n1 and \u00b5\u2217\n\n2 and \u00b5\u2217\n1 + \u00b5\u2217\n\n2 is much smaller than the respective distances from \u00b5\u2217\n\nIn order to illustrate the intuition underlying Theorem 1, we give a geometrical description of our\n1, \u00b5\u2217\nconstruction for M = 3. Suppose that the true centers \u00b5\u2217\n3, are such that the distance\nbetween \u00b5\u2217\n2 to\n3. Now, consider the point \u00b5 := (\u00b51, \u00b52, \u00b53) where \u00b51 = (\u00b5\u2217\n\u00b5\u2217\n2)/2; the points \u00b52 and \u00b53 are\nboth placed at the true center \u00b5\u2217\n3. This assignment does not maximize the population log-likelihood,\nbecause only one center is assigned to the two Gaussian components centered at \u00b5\u2217\n2, while\ntwo centers are assigned to the Gaussian component centered at \u00b5\u2217\n3. However, when the components\nare well-separated we are able to show that there is a local maximum in the neighborhood of this\ncon\ufb01guration. In order to establish the existence of a local maximum, we \ufb01rst de\ufb01ne a neighborhood\nof this con\ufb01guration ensuring that it does not contain any global maximum, and then prove that the\nlog-likelihood on the boundary of this neighborhood is strictly smaller than that of the sub-optimal\ncon\ufb01guration \u00b5. Since the log-likelihood is bounded from above, this neighborhood must contain at\nleast one maximum of the log-likelihood. Since the global maxima are not in this neighborhood by\nconstruction, any maximum in this neighborhood must be a local maximum. See Section A for a\ndetailed proof.\n\n1 to \u00b5\u2217\n\n3, and from \u00b5\u2217\n\n1 and \u00b5\u2217\n\n3.2 Algorithmic consequences\n\nAn important implication of Theorem 1 is that any iterative algorithm, such as EM or gradient ascent,\nthat attempts to maximize the likelihood based on local updates cannot be globally convergent\u2014that\nis, cannot converge to (near) globally optimal solutions from an arbitrary initialization. Indeed, if\nany such algorithm is initialized at the local maximum, then they will remain trapped. However,\none might argue that this conclusion is overly pessimistic, in that we have only shown that these\nalgorithms fail when initialized at a certain (adversarially chosen) point. Indeed, the mere existence\nof bad local minima need not be a practical concern unless it can be shown that a typical optimization\n\n6\n\n\falgorithm will frequently converge to one of them. The following result shows that the EM algorithm,\nwhen applied to the population likelihood and initialized according to the random scheme described\nin Section 2.2, converges to a bad critical point with high probability.\nTheorem 2. Let \u00b5t be the tth iterate of the EM algorithm initialized by the random initialization\nscheme described previously. There exists a universal constant c, for any M \u2265 3 and any constant\nCgap > 0, such that there is a well-separated uniform mixture of M unit-variance spherical Gaussians\nGMM(\u00b5\u2217) with\n\nP(cid:2)\u2200t \u2265 0, L(\u00b5t) \u2264 L(\u00b5\u2217) \u2212 Cgap\n\n(cid:3) \u2265 1 \u2212 e\u2212cM .\n\nTheorem 2 shows that, for the speci\ufb01ed con\ufb01guration \u00b5\u2217, the probability of success for the EM\nalgorithm is exponentially small as a function of M. As a consequence, in order to guarantee\nrecovering a global maximum with at least constant probability, the EM algorithm with random\ninitialization must be executed at least e\u2126(M ) times. This result strongly suggests that that effective\ninitialization schemes, such as those based on pilot estimators utilizing the method of moments [Moitra\nand Valiant, 2010, Hsu and Kakade, 2013], are critical to \ufb01nding good maxima in general GMMs.\nThe key idea in the proof of Theorem 2 is the following: suppose that all the true centers are grouped\ninto two clusters that are extremely far apart, and suppose further that we initialize all the centers in\nthe neighborhood of these two clusters, while ensuring that at least one center lies within each cluster.\nIn this situation, all centers will remain trapped within the cluster in which they were \ufb01rst initialized,\nirrespective of how many steps we take in the EM algorithm. Intuitively, this suggests that the only\nfavorable initialization schemes (from which convergence to a global maximum is possible) are those\nin which (1) all initialized centers fall in the neighborhood of exactly one cluster of true centers, (2)\nthe number of centers initialized within each cluster of true centers exactly matches the number of\ntrue centers in that cluster. However, this observation alone only suf\ufb01ces to guarantee that the success\nprobability is polynomially small in M.\nIn order to demonstrate that the success probability is exponentially small in M, we need to further\nre\ufb01ne this construction. In more detail, we construct a Gaussian mixture distribution with a recursive\nstructure: on top level, its true centers can be grouped into two clusters far apart, and then inside each\ncluster, the true centers can be further grouped into two mini-clusters which are well-separated, and so\non. We can repeat this structure for \u2126(log M ) levels. For this GMM instance, even in the case where\nthe number of true centers exactly matches the number of initialized centers in each cluster at the top\nlevel, we still need to consider the con\ufb01guration of the initial centers within the mini-clusters, which\nfurther reduces the probability of success for a random initialization. A straightforward calculation\nthen shows that the probability of a favorable random initialization is on the order of e\u2212\u2126(M ). The\nfull proof is given in Section A.2.\nWe devote the remainder of this section to a treatment of the \ufb01rst-order EM algorithm. Our \ufb01rst result\nin this direction shows that the problem of convergence to sub-optimal \ufb01xed points remains a problem\nfor the \ufb01rst-order EM algorithm, provided the step-size is not chosen too aggressively.\nTheorem 3. Let \u00b5t be the tth iterate of the \ufb01rst-order EM algorithm with stepsize s \u2208 (0, 1),\ninitialized by the random initialization scheme described previously. There exists a universal constant\nc, for any M \u2265 3 and any constant Cgap > 0, such that there is a well-separated uniform mixture of\nM unit-variance spherical Gaussians GMM(\u00b5\u2217) with\n\nP(cid:0)\u2200t \u2265 0, L(\u00b5t) \u2264 L(\u00b5\u2217) \u2212 Cgap\n\n(cid:1) \u2265 1 \u2212 e\u2212cM .\n\n(7)\n\nWe note that the restriction on the step-size is weak, and is satis\ufb01ed by the theoretically optimal\nchoice for a mixture of two Gaussians in the setting studied by Balakrishnan et al. [2015]. Recall\nthat the \ufb01rst-order EM updates are identical to gradient ascent updates on the log-likelihood function.\nAs a consequence, we can conclude that the most natural local search heuristics for maximizing\nthe log-likelihood (EM and gradient ascent), fail to provide statistically meaningful estimates when\ninitialized randomly, unless we repeat this procedure exponentially many (in M) times.\nOur \ufb01nal result concerns the type of \ufb01xed points reached by the \ufb01rst-order EM algorithm in our setting.\nPascanu et al. [2014] argue that for high-dimensional optimization problems, the principal dif\ufb01culty\nis the proliferation of saddle points, not the existence of poor local maxima. In our setting, however,\nwe can leverage recent results on gradient methods [Lee et al., 2016, Panageas and Piliouras, 2016]\nto show that the \ufb01rst-order EM algorithm cannot converge to strict saddle points. More precisely:\n\n7\n\n\fDe\ufb01nition 1 (Strict saddle point Ge et al. [2015]). For a maximization problem, we say that a critical\npoint xss of function f is a strict saddle point if the Hessian \u22072f (xss) has at least one strictly\npositive eigenvalue.\n\nWith this de\ufb01nition, we have the following:\nTheorem 4. Let \u00b5t be the tth iterate of the \ufb01rst-order EM algorithm with constant stepsize s \u2208 (0, 1),\nand initialized by the random initialization scheme described previously. Then for any M-component\nmixture of spherical Gaussians:\n\n(a) The iterates \u00b5t converge to a critical point of the log-likelihood.\n(b) For any strict saddle point \u00b5ss, we have P (limt\u2192\u221e \u00b5t = \u00b5ss) = 0.\n\nTheorems 3 and 4 provide strong support for the claim that the sub-optimal points to which the\n\ufb01rst-order EM algorithm frequently converges are bad local maxima. The algorithmic failure of the\n\ufb01rst-order EM algorithm is most likely due to the presence of bad local maxima, as opposed to (strict)\nsaddle-points.\nThe proof of Theorem 4 is based on recent work [Lee et al., 2016, Panageas and Piliouras, 2016] on\nthe asymptotic performance of gradient methods. That work reposes on the stable manifold theorem\nfrom dynamical systems theory, and, applied directly to our setting, would require establishing that\nthe population likelihood L is smooth. Our proof technique avoids such a smoothness argument; see\nSection A.4 for the details. The proof technique makes use of speci\ufb01c properties of the \ufb01rst-order\nEM algorithm that do not hold for the EM algorithm. We conjecture that a similar result is true for\nthe EM algorithm; however, we suspect that a generalized version of the stable manifold theorem\nwill be needed to establish such a result.\n\n4 Conclusion and open problems\n\nIn this paper, we resolved an open problem of Srebro [2007], by demonstrating the existence of\narbitrarily bad local maxima for the population log-likelihood of Gaussian mixture model, even in\nthe idealized situation where each component is uniformly weighted, spherical with unit variance,\nand well-separated. We further provided some evidence that even in this favorable setting random\ninitialization schemes for the population EM algorithm are likely to fail with high probability. Our\nresults carry over in a straightforward way, via standard empirical process arguments, to settings\nwhere a large \ufb01nite sample is provided.\nAn interesting open question is to resolve the necessity of at least three mixture components in our\nconstructions. In particular, we believe that at least three mixture components are necessary for the\nlog-likelihood to be poorly behaved, and that for a well-separated mixture of two Gaussians the EM\nalgorithm with a random initialization is in fact successful with high probability.\nIn a related vein, understanding the empirical success of EM-style algorithms using random initializa-\ntion schemes despite their failure on seemingly benign problem instances remains an open problem\nwhich we hope to address in future work.\n\nAcknowledgements\n\nThis work was partially supported by Of\ufb01ce of Naval Research MURI grant DOD-002888, Air\nForce Of\ufb01ce of Scienti\ufb01c Research Grant AFOSR-FA9550-14-1-001, the Mathematical Data Science\nprogram of the Of\ufb01ce of Naval Research under grant number N00014-15-1-2670, and National\nScience Foundation Grant CIF-31712-23800.\n\nReferences\nElizabeth S Allman, Catherine Matias, and John A Rhodes. Identi\ufb01ability of parameters in latent structure\n\nmodels with many observed variables. Annals of Statistics, 37(6A):3099\u20133132, 2009.\n\nCarlos Am\u00e9ndola, Mathias Drton, and Bernd Sturmfels. Maximum likelihood estimates for Gaussian mixtures\nare transcendental. In International Conference on Mathematical Aspects of Computer and Information\nSciences, pages 579\u2013590. Springer, 2015.\n\n8\n\n\fSanjeev Arora, Ravi Kannan, et al. Learning mixtures of separated nonspherical Gaussians. The Annals of\n\nApplied Probability, 15(1A):69\u201392, 2005.\n\nSivaraman Balakrishnan, Martin J Wainwright, and Bin Yu. Statistical guarantees for the EM algorithm: From\n\npopulation to sample-based analysis. Annals of Statistics, 2015.\n\nMikhail Belkin and Kaushik Sinha. Polynomial learning of distribution families. In 51st Annual IEEE Symposium\n\non Foundations of Computer Science, pages 103\u2013112. IEEE, 2010.\n\nKamalika Chaudhuri and Satish Rao. Learning mixtures of product distributions using correlations and\n\nindependence. In 21st Annual Conference on Learning Theory, volume 4, pages 9\u20131, 2008.\n\nJiahua Chen. Optimal rate of convergence for \ufb01nite mixture models. Annals of Statistics, 23(1):221\u2013233, 1995.\n\nSanjoy Dasgupta and Leonard Schulman. A probabilistic analysis of EM for mixtures of separated, spherical\n\nGaussians. Journal of Machine Learning Research, 8:203\u2013226, 2007.\n\nArthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the EM\n\nalgorithm. Journal of the Royal Statistical Society, Series B, 39(1):1\u201338, 1977.\n\nDavid L Donoho and Richard C Liu. The \u201cautomatic\u201d robustness of minimum distance functionals. Annals of\n\nStatistics, 16(2):552\u2013586, 1988.\n\nRong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points\u2014online stochastic gradient for\n\ntensor decomposition. In 28th Annual Conference on Learning Theory, pages 797\u2013842, 2015.\n\nChristopher R Genovese and Larry Wasserman. Rates of convergence for the Gaussian mixture sieve. Annals of\n\nStatistics, 28(4):1105\u20131127, 2000.\n\nSubhashis Ghosal and Aad W Van Der Vaart. Entropies and rates of convergence for maximum likelihood and\n\nBayes estimation for mixtures of normal densities. Annals of Statistics, 29(5):1233\u20131263, 2001.\n\nNhat Ho and XuanLong Nguyen. Identi\ufb01ability and optimal rates of convergence for parameters of multiple\n\ntypes in \ufb01nite mixtures. arXiv preprint arXiv:1501.02497, 2015.\n\nDaniel Hsu and Sham M Kakade. Learning mixtures of spherical Gaussians: Moment methods and spectral\ndecompositions. In Proceedings of the 4th Conference on Innovations in Theoretical Computer Science, pages\n11\u201320. ACM, 2013.\n\nJason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent converges to minimizers.\n\nIn 29th Annual Conference on Learning Theory, pages 1246\u20131257, 2016.\n\nPo-Ling Loh and Martin J Wainwright. Regularized M-estimators with nonconvexity: Statistical and algorithmic\n\ntheory for local optima. In Advances in Neural Information Processing Systems, pages 476\u2013484, 2013.\n\nAnkur Moitra and Gregory Valiant. Settling the polynomial learnability of mixtures of Gaussians. In 51st Annual\n\nIEEE Symposium on Foundations of Computer Science, pages 93\u2013102. IEEE, 2010.\n\nIoannis Panageas and Georgios Piliouras. Gradient descent converges to minimizers: The case of non-isolated\n\ncritical points. arXiv preprint arXiv:1605.00405, 2016.\n\nRazvan Pascanu, Yann N Dauphin, Surya Ganguli, and Yoshua Bengio. On the saddle point problem for\n\nnon-convex optimization. arXiv preprint arXiv:1405.4604, 2014.\n\nNathan Srebro. Are there local maxima in the in\ufb01nite-sample likelihood of Gaussian mixture estimation? In\n\n20th Annual Conference on Learning Theory, pages 628\u2013629, 2007.\n\nHenry Teicher. Identi\ufb01ability of \ufb01nite mixtures. The Annals of Mathematical Statistics, 34(4):1265\u20131269, 1963.\n\nD Michael Titterington. Statistical Analysis of Finite Mixture Distributions. Wiley, 1985.\n\nSantosh Vempala and Grant Wang. A spectral algorithm for learning mixtures of distributions. In The 43rd\n\nAnnual IEEE Symposium on Foundations of Computer Science, pages 113\u2013122. IEEE, 2002.\n\nZhaoran Wang, Han Liu, and Tong Zhang. Optimal computational and statistical rates of convergence for sparse\n\nnonconvex learning problems. Annals of Statistics, 42(6):2164, 2014.\n\n9\n\n\f", "award": [], "sourceid": 2046, "authors": [{"given_name": "Chi", "family_name": "Jin", "institution": "UC Berkeley"}, {"given_name": "Yuchen", "family_name": "Zhang", "institution": "UC Berkeley"}, {"given_name": "Sivaraman", "family_name": "Balakrishnan", "institution": "CMU"}, {"given_name": "Martin", "family_name": "Wainwright", "institution": "UC Berkeley"}, {"given_name": "Michael", "family_name": "Jordan", "institution": "UC Berkeley"}]}