{"title": "Benefits of over-parameterization with EM", "book": "Advances in Neural Information Processing Systems", "page_first": 10662, "page_last": 10672, "abstract": "Expectation Maximization (EM) is among the most popular algorithms for maximum likelihood estimation, but it is generally only guaranteed to find its stationary points of the log-likelihood objective. The goal of this article is to present theoretical and empirical evidence that over-parameterization can help EM avoid spurious local optima in the log-likelihood. We consider the problem of estimating the mean vectors of a Gaussian mixture model in a scenario where the mixing weights are known. Our study shows that the global behavior of EM, when one uses an over-parameterized model in which the mixing weights are treated as unknown, is better than that when one uses the (correct) model with the mixing weights fixed to the known values. For symmetric Gaussians mixtures with two components, we prove that introducing the (statistically redundant) weight parameters enables EM to find the global maximizer of the log-likelihood starting from almost any initial mean parameters, whereas EM without this over-parameterization may very often fail. For other Gaussian mixtures, we provide empirical evidence that shows similar behavior. Our results corroborate the value of over-parameterization in solving non-convex optimization problems, previously observed in other domains.", "full_text": "Bene\ufb01ts of over-parameterization with EM\n\nJi Xu\n\nColumbia University\n\njixu@cs.columbia.edu\n\nDaniel Hsu\n\nColumbia University\n\ndjhsu@cs.columbia.edu\n\nArian Maleki\n\nColumbia University\n\narian@stat.columbia.edu\n\nAbstract\n\nExpectation Maximization (EM) is among the most popular algorithms for maxi-\nmum likelihood estimation, but it is generally only guaranteed to \ufb01nd its stationary\npoints of the log-likelihood objective. The goal of this article is to present theoreti-\ncal and empirical evidence that over-parameterization can help EM avoid spurious\nlocal optima in the log-likelihood. We consider the problem of estimating the\nmean vectors of a Gaussian mixture model in a scenario where the mixing weights\nare known. Our study shows that the global behavior of EM, when one uses an\nover-parameterized model in which the mixing weights are treated as unknown, is\nbetter than that when one uses the (correct) model with the mixing weights \ufb01xed to\nthe known values. For symmetric Gaussians mixtures with two components, we\nprove that introducing the (statistically redundant) weight parameters enables EM\nto \ufb01nd the global maximizer of the log-likelihood starting from almost any initial\nmean parameters, whereas EM without this over-parameterization may very often\nfail. For other Gaussian mixtures, we provide empirical evidence that shows similar\nbehavior. Our results corroborate the value of over-parameterization in solving\nnon-convex optimization problems, previously observed in other domains.\n\nIntroduction\n\n1\nIn a Gaussian mixture model (GMM), the observed data Y = {y1, y2, . . . , yn}\u21e2 Rd comprise an\ni.i.d. sample from a mixture of k Gaussians:\n\ny1, . . . , yn\n\ni.i.d.\u21e0\n\nw\u21e4i N (\u2713\u21e4i , \u2303\u21e4i )\n\n(1)\n\nkXi=1\n\nwhere (w\u21e4i , \u2713\u21e4i , \u2303\u21e4i ) denote the weight, mean, and covariance matrix of the ith mixture component.\nParameters of the GMM are often estimated using the Expectation Maximization (EM) algorithm,\nwhich aims to \ufb01nd the maximizer of the log-likelihood objective. However, the log-likelihood function\nis not concave, so EM is only guaranteed to \ufb01nd its stationary points. This leads to the following\nnatural and fundamental question in the study of EM and non-convex optimization: How can EM\nescape spurious local maxima and saddle points to reach the maximum likelihood estimate (MLE)?\nIn this work, we give theoretical and empirical evidence that over-parameterizing the mixture model\ncan help EM achieve this objective.\nOur evidence is based on models in (1) where the mixture components share a known, common\ncovariance, i.e., we \ufb01x \u2303\u21e4i = \u2303\u21e4 for all i. First, we assume that the mixing weights wi are also\n\ufb01xed to known values. Under this model, which we call Model 1, EM \ufb01nds a stationary point of\nthe log-likelihood function in the parameter space of component means (\u27131, . . . , \u2713k). Next, we\nover-parameterize Model 1 as follows. Despite the fact that the weights \ufb01xed in Model 1, we now\npretend that they are not \ufb01xed. This gives a second model, which we call Model 2. Parameter\nestimation for Model 2 requires EM to estimate the mixing weights in addition to the component\nmeans. Finding the global maximizer of the log-likelihood over this enlarged parameter space is\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fseemingly more dif\ufb01cult for Model 2 than it is for Model 1, and perhaps needlessly so. However, in\nthis paper we present theoretical and empirical evidence to the contrary.\n\n1. For mixtures of two symmetric Gaussians (i.e., k = 2 and \u2713\u21e41 = \u2713\u21e42), we prove that EM for\nModel 2 converges to the global maximizer of the log-likelihood objective with almost any\ninitialization of the mean parameters, while EM for Model 1 will fail to do so for many choices\nof (w\u21e41, w\u21e42). These results are established for idealized executions of EM in an in\ufb01nite sample\nsize limit, which we complement with \ufb01nite sample results.\n\n2. We prove that the spurious local maxima in the (population) log-likelihood objective for Model\n\n1 are eliminated in the objective for Model 2.\n\n3. We present an empirical study to show that for more general mixtures of Gaussians, with a\nvariety of model parameters and sample sizes, EM for Model 2 has higher probability to \ufb01nd the\nMLE than Model 1 under random initializations.\n\nRelated work. Since Dempster\u2019s 1977 paper [Dempster et al., 1977], the EM algorithm has become\none of the most popular algorithms to \ufb01nd the MLE for mixture models. Due to its popularity, the\nconvergence analysis of EM has attracted researchers\u2019 attention for years. Local convergence of\nEM has been shown by Wu [1983], Xu and Jordan [1996], Tseng [2004], Chr\u00e9tien and Hero [2008].\nFurther, for certain models and under various assumptions about the initialization, EM has been\nshown to converge to the MLE [Redner and Walker, 1984, Balakrishnan et al., 2017, Klusowski\nand Brinda, 2016, Yan et al., 2017]. Typically, the initialization is required to be suf\ufb01ciently close\nto the true parameter values of the data-generating distribution. Much less is known about global\nconvergence of EM, as the landscape of the log-likelihood function has not been well-studied. For\nGMMs, Xu et al. [2016] and [Daskalakis et al., 2017] study mixtures of two Gaussians with equal\nweights and show that the log-likelihood objective has only two global maxima and one saddle point;\nand if EM is randomly initialized in a natural way, the probability that EM converges to this saddle\npoint is zero. (Our Theorem 2 generalizes these results.) It is known that for mixtures of three or\nmore Gaussians, global convergence is not generally possible [Jin et al., 2016].\nThe value of over-parameterization for local or greedy search algorithms that aim to \ufb01nd a global\nminimizer of non-convex objectives has been rigorously established in other domains. Matrix\ncompletion is a concrete example: the goal is to recover of a rank r \u2327 n matrix M 2 Rn\u21e5n from\nobservations of randomly chosen entries [Cand\u00e8s and Recht, 2009]. A direct approach to this problem\nis to \ufb01nd the matrix X 2 Rn\u21e5n of minimum rank that is consistent with the observed entries of\nM. However, this optimization problem is NP-hard in general, despite the fact that there are only\n2nrr2 \u2327 n2 degrees-of-freedom. An indirect approach to this matrix completion problem is to \ufb01nd\na matrix X of smallest nuclear norm, subject to the same constraints; this is a convex relaxation of the\nrank minimization problem. By considering all n2 degrees-of-freedom, Cand\u00e8s and Tao [2010] show\nthat the matrix M is exactly recovered via nuclear norm minimization as soon as \u2326(nr log6 n) entries\nare observed (with high probability). Notably, this combination of over-parameterization with convex\nrelaxation works well in many other research problems such as sparse-PCA [d\u2019Aspremont et al.,\n2005] and compressive sensing [Donoho, 2006]. However, many problems (like ours) do not have a\nstraightforward convex relaxation. Therefore, it is important to understand how over-parameterization\ncan help one solve a non-convex problem other than convex relaxation.\nAnother line of work in which the value of over-parameterization is observed is in deep learning.\nIt is conjectured that the use of over-parameterization is the main reason for the success of local\nsearch algorithms in learning good parameters for neural nets [Livni et al., 2014, Safran and Shamir,\n2017]. Recently, Haeffele and Vidal [2015], Nguyen and Hein [2017, 2018], Soltani and Hegde\n[2018], Du and Lee [2018] con\ufb01rm this observation for many neural networks such as feedforward\nand convolutional neural networks.\n\n2 Theoretical results\n\nIn this section, we present our main theoretical results concerning EM and two-component Gaussian\nmixture models.\n\n2\n\n\f\u02c6\u2713ht+1i =\n\n1\nn\n\nnXi=1\n\n24 w\u21e41ehyi,\u02c6\u2713htii w\u21e42ehyi,\u02c6\u2713htii\n\nw\u21e41ehyi,\u02c6\u2713htii + w\u21e42ehyi,\u02c6\u2713htii\n\nyi35 .\n\n(3)\n\nWe refer to this algorithm as Sample-based EM1: it is the EM algorithm one would normally use\nwhen the mixing weights are known. In spite of this, we also consider an EM algorithm that pretends\nthat the weights are not known, and estimates them alongside the mean parameters. We refer to this\nalgorithm as Sample-based EM2, which uses the following iterations:\n\n2.1 Sample-based EM and Population EM\nWithout loss of generality, we assume \u2303\u21e4 = I. We consider the following Gaussian mixture model:\n(2)\nThe mixing weights w\u21e41 and w\u21e42 are \ufb01xed (i.e., assumed to be known). Without loss of generality, we\nalso assume that w\u21e41 w\u21e42 > 0 (and, of course, w\u21e41 + w\u21e42 = 1). The only parameter to estimate is the\nmean vector \u2713\u21e4. The EM algorithm for this model uses the following iterations:\n\ni.i.d.\u21e0 w\u21e41N (\u2713\u21e4, I) + w\u21e42N (\u2713\u21e4, I).\n\ny1, . . . , yn\n\n\u02c6wht+1i\n\n1\n\n=\n\n\u02c6\u2713ht+1i =\n\n1\nn\n\n1\nn\n\nnXi=1\nnXi=1\n\n\u02c6whti1 ehyi,\u02c6\u2713htii\n\n24\n\u02c6whti1 ehyi,\u02c6\u2713htii + \u02c6whti2 ehyi,\u02c6\u2713htii35 = 1 \u02c6wht+1i\n24 \u02c6whti1 ehyi,\u02c6\u2713htii \u02c6whti2 ehyi,\u02c6\u2713htii\n\n\u02c6whti1 ehyi,\u02c6\u2713htii + \u02c6whti2 ehyi,\u02c6\u2713htii\n\nyi35 .\n\n2\n\n.\n\n(4)\n\nThis is the EM algorithm for a different Gaussian mixture model in which the weights w\u21e41 and w\u21e42 are\nnot \ufb01xed (i.e., unknown), and hence must be estimated. Our goal is to study the global convergence\nproperties of the above two EM algorithms on data from the \ufb01rst model, where the mixing weights\nare, in fact, known.\nWe study idealized executions of the EM algorithms in the large sample limit, where the algorithms\nare modi\ufb01ed to be computed over an in\ufb01nitely large i.i.d. sample drawn from the mixture distribution\nin (2). Speci\ufb01cally, we replace the empirical averages in (3) and (4) with the expectations with respect\nto the mixture distribution. We obtain the following two modi\ufb01ed EM algorithms, which we refer to\nas Population EM1 and Population EM2:\n\n\u2022 Population EM1:\n\n\u2713ht+1i = Ey\u21e0f\u21e4\" w\u21e41ehy,\u2713htii w\u21e42ehy,\u2713htii\n\nw\u21e41ehy,\u2713htii + w\u21e42ehy,\u2713htii\n\ny# =: H(\u2713hti; \u2713\u21e4, w\u21e41),\n\nwhere f\u21e4 = f\u21e4(\u2713\u21e4, w\u21e41) here denotes the true distribution of yi given in (2).\n\u2022 Population EM2: Set wh0i1 = wh0i2 = 0.51, and run\nwhti1 ehy,\u2713htii\n\nwhti1 ehy,\u2713htii + whti2 ehy,\u2713htii# =: Gw(\u2713hti, whti; \u2713\u21e4, w\u21e41)\n\nwht+1i\n\n1\n\n= Ey\u21e0f\u21e4\"\n= 1 wht+1i\n\n2\n\n.\n\n\u2713ht+1i = Ey\u21e0f\u21e4\" whti1 ehy,\u2713htii whti2 ehy,\u2713htii\n\nwhti1 ehy,\u2713htii + whti2 ehy,\u2713htii\n\ny# =: G\u2713(\u2713hti, whti; \u2713\u21e4, w\u21e41).\n\n(5)\n\n(6)\n\n(7)\n\nAs n ! 1, we can show the performance of Sample-based EM? converges to that of the\nPopulation EM? in probability. This argument has been used rigorously in many previous works\non EM [Balakrishnan et al., 2017, Xu et al., 2016, Klusowski and Brinda, 2016, Daskalakis et al.,\n2017]. The main goal of this section, however, is to study the dynamics of Population EM1 and\nPopulation EM2, and the landscape of the log-likelihood objectives of the two models.\n1Using equal initial weights is a natural way to initialize EM when the weights are unknown.\n\n3\n\n\f2.2 Main theoretical results\nLet us \ufb01rst consider the special case w\u21e41 = w\u21e42 = 0.5. Then, it is straightforward to show that whti1 =\nwhti2 = 0.5 for all t in Population EM2. Hence, Population EM2 is equivalent to Population EM1.\nGlobal convergence of \u2713hti to \u2713\u21e4 for this case was recently established by Xu et al. [2016, Theorem 1]\nfor almost all initial \u2713h0i (see also [Daskalakis et al., 2017]).\nWe \ufb01rst show that the same global convergence may not hold for Population EM1 when w\u21e41 6= w\u21e42.\nTheorem 1. Consider Population EM1 in dimension one (i.e., \u2713\u21e4 2 R). For any \u2713\u21e4 > 0, there\nexists > 0, such that given w\u21e41 2 (0.5, 0.5 + ) and initialization \u2713h0i \uf8ff \u2713\u21e4, the Population EM1\nestimate \u2713hti converges to a \ufb01xed point \u2713wrong inside (\u2713\u21e4, 0).\nThis theorem, which is proved in Appendix A, implies that if we use random initialization,\nPopulation EM1 may converge to the wrong \ufb01xed point with constant probability. We illustrate this in\nFigure 1. The iterates of Population EM1 converge to a \ufb01xed point of the function \u2713 7! H(\u2713; \u2713\u21e4, w\u21e41)\nde\ufb01ned in (5). We have plotted this function for several different values of w\u21e41 in the left panel of\nFigure 1. When w\u21e41 is close to 1, H(\u2713; \u2713\u21e4, w\u21e41) has only one \ufb01xed point and that is at \u2713 = \u2713\u21e4. Hence,\nin this case, the estimates produced by Population EM1 converge to the true \u2713\u21e4. However, when we\ndecrease the value of w\u21e41 below a certain threshold (which is numerically found to be approximately\n0.77 for \u2713\u21e4 = 1), two other \ufb01xed points of H(\u2713; \u2713\u21e4, w\u21e41) emerge. These new \ufb01xed points are foils for\nPopulation EM1.\nFrom the failure of Population EM1, one may expect the over-parameterized Population EM2 to fail\nas well. Yet, surprisingly, our second theorem proves the opposite is true: Population EM2 has global\nconvergence even when w\u21e41 6= w\u21e42.\nTheorem 2. For any w\u21e41 2 [0.5, 1), the Population EM2 estimate (\u2713t, whti) converges to either\n(\u2713\u21e4, w\u21e41) or (\u2713\u21e4, w\u21e42) with any initialization \u2713h0i except on the hyperplane h\u2713h0i, \u2713\u21e4i = 0. Further-\nmore, the convergence speed is geometric after some \ufb01nite number of iterations, i.e., there exists a\n\ufb01nite number T and constant \u21e2 2 (0, 1) such that the following hold.\n\n\u2022 If h\u2713h0i, \u2713\u21e4i > 0, then for all t > T ,\nk\u2713ht+1i \u2713\u21e4k2 + |wht+1i\n\u2022 If h\u2713h0i, \u2713\u21e4i < 0, then for all t > T ,\nk\u2713ht+1i + \u2713\u21e4k2 + |wht+1i\n\n1\n\n1\n\n w\u21e41|2 \uf8ff \u21e2tT\u21e3k\u2713hTi \u2713\u21e4k2 + (whTi1 w\u21e41)2\u2318 .\n w\u21e42|2 \uf8ff \u21e2tT\u21e3k\u2713hTi + \u2713\u21e4k2 + (whTi1 w\u21e42)2\u2318 .\n\nthe\n\nTheorem 2 implies that if we use random initialization for \u2713h0i, with probability one,\nPopulation EM2 estimates converge to the true parameters.\nThe failure of Population EM1 and success of Population EM2 can be explained intuitively. Let C1\nand C2, respectively, denote the true mixture components with parameters (w\u21e41,\u2713 \u21e4) and (w\u21e42,\u2713\u21e4).\nDue to the symmetry in Population EM1, we are assured that among the two estimated mixture\ncomponents, one will have a positive mean, and the other will have a negative mean: call these \u02c6C+\nand \u02c6C, respectively. Assume \u2713\u21e4 > 0 and w\u21e41 > 0.5, and consider initializing the Population EM1\nwith \u2713h0i := \u2713\u21e4. This initialization incorrectly associates \u02c6C with the larger weight w\u21e41 instead of\nthe smaller weight w\u21e42. This causes, in the E-step of EM, the component \u02c6C to become \u201cresponsible\u201d\nfor an overly large share of the overall probability mass, and in particular an overly large share of\nthe mass from C1 (which has a positive mean). Thus, in the M-step of EM, when the mean of the\nestimated component \u02c6C is updated, it is pulled rightward towards +1. It is possible that this\nrightward pull would cause the estimated mean of \u02c6C to become positive\u2014in which case the roles of\n\u02c6C+ and \u02c6C would switch\u2014but this will not happen as long as w\u21e41 is suf\ufb01ciently bounded away from 1\n(but still > 0.5).2 The result is a bias in the estimation of \u2713\u21e4, thus explaining why the Population EM1\nestimate converges to some \u2713wrong 2 (\u2713\u21e4, 0) when w\u21e41 is not too large.\n\n2When w\u21e41 is indeed very close to 1, then almost all of the probability mass of the true distribution comes\nfrom C1, which has positive mean. So, in the M-step discussed above, the rightward pull of the mean of \u02c6C may\n\n4\n\n\fOur discussion con\ufb01rms that one way Population EM1 may fail (in dimension one) is if it is\ninitialized with \u2713h0i having the \u201cincorrect\u201d sign (e.g., \u2713h0i = \u2713\u21e4). On the other hand, the\nperformance of Population EM2 does not depend on the sign of the initial \u2713h0i. Recall that\nthe estimates of Population EM2 converge to the \ufb01xed points of the mapping M : (\u2713, w1) 7!\n(G\u2713(\u2713, w1; \u2713\u21e4, w\u21e41), Gw(\u2713, w1; \u2713\u21e4, w\u21e41)), as de\ufb01ned in (6) and (7). One can check that for all\n\u2713, w1, \u2713\u21e4, w\u21e41, we have\n\nG\u2713(\u2713, w1; \u2713\u21e4, w\u21e41) + G\u2713(\u2713, w2; \u2713\u21e4, w\u21e41) = 0,\nGw(\u2713, w1; \u2713\u21e4, w\u21e41) + Gw(\u2713, w2; \u2713\u21e4, w\u21e41) = 1.\n\n(8)\n\nHence, (\u2713, w1) is a \ufb01xed point of M if and only if (\u2713, w2) is a \ufb01xed point of M as well. Therefore,\nPopulation EM2 is insensitive to the sign of the initial \u2713h0i. This property can be extended to mixtures\nof k > 2 Gaussians as well. In these cases, the performance of EM for Model 2 is insensitive to\npermutations of the component parameters. Hence, because of this nice property, as we will con\ufb01rm\nin our simulations, when the mixture components are well-separated, EM for Model 2 performs well\nfor most of the initializations, while EM for Model 1 fails in many cases.\nOne limitation of our permutation-free explanation is that the argument only holds when the weights\nin Population EM2 are initialized to be uniform. However, the bene\ufb01ts of over-parameterization are\nnot limited to this case. Indeed, when we compare the landscapes of the log-likelihood objective\nfor (the mixture models corresponding to) Population EM1 and Population EM2, we \ufb01nd that over-\nparameterization eliminates spurious local maxima that were obstacles for Population EM1.\nTheorem 3. For all w\u21e41 6= 0.5, the log-likelihood objective optimized by Population EM2 has only\none saddle point (\u2713, w1) = (0, 1/2) and no local maximizers besides the two global maximizers\n(\u2713, w1) = (\u2713\u21e4, w\u21e41) and (\u2713, w1) = (\u2713\u21e4, w\u21e42).\nThe proof of this theorem is presented in Appendix C.\nRemark 1. Consider the landscape of the log-likelihood objective for Population EM2 and the point\n(\u2713wrong, w\u21e41), where \u2713wrong is the local maximizer suggested by Theorem 1. Theorem 3 implies that\nwe can still easily escape this point due to the non-zero gradient in the direction of w1 and thus\n(\u2713wrong, w\u21e41) is not even a saddle point. We emphasize that this is exactly the mechanism that we have\nhoped for the purpose and bene\ufb01t of over-parameterization (See the left panel in Figure 2).\nRemark 2. Note that although (\u2713, w1) = ((w\u21e41 w\u21e42)\u2713\u21e4, 1) or ((w\u21e42 w\u21e41)\u2713\u21e4, 0) are the two \ufb01xed\npoints for Population EM2 as well, they are not the \ufb01rst order stationary points of the log-likelihood\nobjective if w\u21e41 6= 0.5.\nFinally, to complete the analysis of EM for the mixtures of two Gaussians, we present the following\nresult that applies to Sample-based EM2.\nTheorem 4. Let (\u02c6\u2713hti, \u02c6whti1 ) be the estimates of Sample-based EM2. Suppose \u02c6\u2713h0i = \u2713h0i, \u02c6wh0i1 =\nwh0i1 = 1\n\n2 and h\u2713h0i, \u2713\u21e4i 6= 0. Then we have\nt!1 k\u02c6\u2713hti \u2713htik ! 0\nand\nlim sup\n\nwhere convergence is in probability.\n\nlim sup\n\nt!1 | \u02c6whti1 whti1 |! 0\n\nas n ! 1 ,\n\nThe proof of this theorem uses the same approach as Xu et al. [2016] and is presented in Appendix D.\n\n2.3 Roadmap of the proof for Theorem 2\nOur \ufb01rst lemma, proved in Appendix B.1, con\ufb01rms that if h\u2713h0i, \u2713\u21e4i > 0, then h\u2713hti, \u2713\u21e4i > 0 for\nevery t and whti1 2 (0.5, 1). In other words, the estimates of the Population EM2 remain in the correct\nhyperplane, and the weight moves in the right direction, too.\nbe so strong that the updated mean estimate becomes positive. Since the model enforces that the mean estimates\nof \u02c6C+ and \u02c6C be negations of each other, the roles of \u02c6C+ and \u02c6C switch, and now it is \u02c6C+ that becomes\nassociated with the larger mixing weight w\u21e41. In this case, owing to the symmetry assumption, Population EM1\nmay be able to successfully converge to \u2713\u21e4. We revisit this issue in the numerical study, where the symmetry\nassumption is removed.\n\n5\n\n\fFigure 1: Left panel: we show the shape of iterative function H(\u2713; \u2713\u21e4, w\u21e41) with \u2713\u21e4 = 1 and different\nvalues of w\u21e41 2{ 0.9, 0.77, 0.7}. The green plus + indicates the origin (0, 0) and the black points\nindicate the correct values (\u2713\u21e4,\u2713 \u21e4) and (\u2713\u21e4,\u2713\u21e4). We observe that as w\u21e41 increases, the number\nof \ufb01xed points goes down from 3 to 2 and \ufb01nally to 1. Further, when there exists more than one\n\ufb01xed point, there is one stable incorrect \ufb01xed point in (\u2713\u21e4, 0). Right panel: we show the shape of\niterative function Gw(\u2713, w1; \u2713\u21e4, w\u21e41) with \u2713\u21e4 = 1, w\u21e41 = 0.7 and different values of \u2713 2{ 0.3, 1, 2}.\nWe observe that as \u2713 increases, Gw becomes from a concave function to a concave-convex function.\nFurther, there are at most three \ufb01xed points and there is only one stable \ufb01xed point.\n\nLemma 1. If h\u2713h0i, \u2713\u21e4i > 0, we have h\u2713hti, \u2713\u21e4i > 0, whti1 2 (0.5, 1) for all t 1. Otherwise, if\nh\u2713h0i, \u2713\u21e4i < 0, we have h\u2713hti, \u2713\u21e4i < 0, whti1 2 (0, 0.5) for all t 1.\nOn account of Lemma 1 and the invariance in (8), we can assume without loss of generality that\nh\u2713hti, \u2713\u21e4i > 0 and whti1 2 (0.5, 1) for all t 1.\nLet d be the dimension of \u2713\u21e4. We reduce the d > 1 case to the d = 1 case. This achieved by proving\nthat the angle between the two vectors \u2713hti and \u2713\u21e4 is a decreasing function of t and converges to\n0. The details appear in Appendix B.4. Hence, in the rest of this section we focus on the proof of\nTheorem 2 for d = 1.\nLet g\u2713(\u2713, w1) and gw(\u2713, w1) be the shorthand for the two update functions G\u2713 and Gw de\ufb01ned in (6)\nand (7) for a \ufb01xed (\u2713\u21e4, w\u21e41). To prove that {(\u2713hti, whti)} converges to the \ufb01xed point (\u2713?, w?), we\nestablish the following claims:\nC.1 There exists a set S = (a\u2713, b\u2713) \u21e5 (aw, bw) 2 R2, where a\u2713, b\u2713 2 R [ {\u00b11} and aw, bw 2 R,\nsuch that S contains point (\u2713?, w?) and point (g\u2713(\u2713, w1), gw(\u2713, w1)) 2 S for all (\u2713, w1) 2 S.\nFurther, g\u2713(\u2713, w1) is a non-decreasing function of \u2713 for a given w1 2 (aw, bw) and gw(\u2713, w1) is\na non-decreasing function of w for a given \u2713 2 (a\u2713, b\u2713),\nC.2a r is continuous, decreasing, and passes through point (\u2713?, w?), i.e., r(w?) = \u2713?.\nC.2b Given \u2713 2 (a\u2713, b\u2713), function w 7! gw(\u2713, w) has a stable \ufb01xed point in [aw, bw]. Further,\nany stable \ufb01xed point ws in [aw, bw] or \ufb01xed point ws in (aw, bw) satis\ufb01es the following:\n\u21e4 If \u2713<\u2713 ? and \u2713 r(bw), then r1(\u2713) > ws > w?.\n\u21e4 If \u2713 = \u2713?, then r1(\u2713) = ws = w?.\n\u21e4 If \u2713>\u2713 ? and \u2713 \uf8ff r(aw), then r1(\u2713) < ws < w?.\nstable \ufb01xed point \u2713s in [a\u2713, b\u2713] or \ufb01xed point \u2713s in (a\u2713, b\u2713) satis\ufb01es the following:\n\u21e4 If w1 < w?, then r(w) >\u2713 s >\u2713 ?.\n\u21e4 If w1 = w?, then r(w) = \u2713s = \u2713?.\n\u21e4 If w1 > w?, then r(w) <\u2713 s <\u2713 ?.\n\nC.2 There is a reference curve r : [aw, bw] ! [a\u2713, b\u2713] de\ufb01ned on \u00afS (the closure of S) such that:\n\nC.2c Given w 2 [aw, bw], function \u2713 7! g\u2713(\u2713, w) has a stable \ufb01xed point in [a\u2713, b\u2713]. Further, any\n\nWe explain C.1 and C.2 in the right panel of Figure 2. Heuristically, we expect (\u2713\u21e4, w\u21e41) to be the\nonly \ufb01xed point of the mapping (\u2713, w) 7! (g\u2713(\u2713, w), gw(\u2713, w)), and that (\u2713hti, whti) move toward\nthis \ufb01xed point. Hence, we can prove the convergence of the iterates by showing certain geometric\n\n6\n\n\frelationships between the curves of \ufb01xed points of the two functions. Hence, C.1 helps us to bound\nthe iterates on the area that such nice geometric relations exist, and the reference curve r and C.2 are\nthe tools to help us mathematically characterizing the geometric relations shown in the \ufb01gure. Indeed,\nthe next lemma implies that C.1 and C.2 are suf\ufb01cient to show the convergence to the right point\n(\u2713?, w?):\nLemma 2 (Proved in Appendix B.2.1). Suppose continuous functions g\u2713(\u2713, w), gw(\u2713, w) satisfy C.1\nand C.2, then there exists a continuous mapping m : \u00afS ! [0,1) such that (\u2713?, w?) is the only\nsolution for m(\u2713, w) = 0 on \u00afS, the closure of S . Further, if we initialize (\u2713h0i, wh0i) in S, the\nsequence {(\u2713hti, whti)}t0 de\ufb01ned by\n\n\u2713ht+1i = g\u2713(\u2713hti, whti),\n\nand wht+1i = gw(\u2713hti, whti),\nsatis\ufb01es that m(\u2713hti, whti) # 0, and therefore (\u2713hti, whti) converges to (\u2713?, w?).\nIn our problem, we set aw = 0.5, bw = 1, a\u2713 = 0, b\u2713 = 1 and (\u2713?, w?) = (\u2713\u21e4, w\u21e41). Then according\nto Lemma 1 and monotonic property of g\u2713 and gw, C.1 is satis\ufb01ed.\nTo show C.2, we \ufb01rst de\ufb01ne the reference curve r by\n\nr(w1) :=\n\nw\u21e41 w\u21e42\nw1 w2\n\n\u2713\u21e4 =\n\n2w\u21e41 1\n2w1 1\n\n\u2713\u21e4,\n\n8w1 2 (0.5, 1], w2 = 1 w1.\n\n(9)\n\nThe claim C.2a holds by construction. To show C.2b, we establish an even stronger property of the\nweights update function gw(\u2713, w): for any \ufb01xed \u2713> 0, the function w1 7! gw(\u2713, w1) has at most one\nother \ufb01xed point besides w1 = 0 and w1 = 1, and most importantly, it has only one unique stable\n\ufb01xed point. This is formalized in the following lemma.\nLemma 3 (Proved in Appendix B.2.2). For all \u2713> 0, there are at most three \ufb01xed points for\ngw(\u2713, w1) with respect to w1. Further, there exists an unique stable \ufb01xed point Fw(\u2713) 2 (0, 1], i.e.,\n(i) Fw(\u2713) = gw(\u2713, Fw(\u2713)) and (ii) for all w1 2 (0, 1), we have\n(10)\n\nand\n\ngw(\u2713, w1) > w1 , w1 > Fw(\u2713).\n\ngw(\u2713, w1) < w1 , w1 < Fw(\u2713)\n\nWe explain Lemma 3 in Figure 1. Note that, in the \ufb01gure, we observe that gw is an increasing\nfunction with gw(\u2713, 0) = 0 and gw(\u2713, 1) = 1. Further, it is either a concave function, it is piecewise\nconcave-then-convex3. Hence, we know if @gw(\u2713, w1)/@w1|w1=1 is at most 1, the only stable \ufb01xed\npoint is w1 = 1, else if the derivative is larger than 1, there exists only one \ufb01xed point in (0,1) and it\nis the only stable \ufb01xed point. The complete proof for C.2b is shown in Appendix B.3.\nThe \ufb01nal step to apply Lemma 2 is to prove C2.c. However, (\u2713, w1) = ((2w\u21e41 1)\u2713\u21e4, 1) is a point on\nthe reference curve r and \u2713 = (2w\u21e41 1)\u2713\u21e4 is a stable \ufb01xed point for g\u2713(\u2713, 1). This violates C.2c.\nTo address this issue, since we can characterize the shape and the number of \ufb01xed points for gw, by\ntypical uniform continuity arguments, we can \ufb01nd , \u270f > 0 such that the adjusted reference curve\nradj(w) := r(w) \u270f\u00b7 max(0, w 1 + ) satis\ufb01es C.2a and C.2b. Then we can prove that the adjusted\nreference curve radj(w) satis\ufb01es C2.c; see Appendix B.3.1.\n\n3 Numerical results\n\nIn this section, we present numerical results that show the value of over-parameterization in some\nmixture models not covered by our theoretical results.\n\n3.1 Setup\nOur goal is to analyze the effect of the sample size, mixing weights, and the number of mixture\ncomponents on the success of the two EM algorithms described in Section 2.1.\nWe implement EM for both Model 1 (where the weights are assumed to be known) and Model 2\n(where the weights are not known), and run the algorithm multiple times with random initial mean\nestimates. We compare the two versions of EM by their (empirical) success probabilities, which we\ndenote by P1 and P2, respectively. Success is de\ufb01ned in two ways, depending on whether EM is run\nwith a \ufb01nite sample, or with an in\ufb01nite-size sample (i.e., the population analogue of EM).\n\n3There exists \u02dcw 2 (0, 1) such that gw(\u2713, w) is concave in [0, \u02dcw] and convex in [ \u02dcw, 1].\n\n7\n\n\f\u2713AAAB7XicbVDLSgNBEJyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOOsmY2ZllplcIS/7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqSw6PvfXmFtfWNzq7hd2tnd2z8oHx41rU4NhwbXUpt2xCxIoaCBAiW0EwMsjiS0ovHtzG89gbFCqwecJBDGbKjEQHCGTmp2cQTIeuWKX/XnoKskyEmF5Kj3yl/dvuZpDAq5ZNZ2Aj/BMGMGBZcwLXVTCwnjYzaEjqOKxWDDbH7tlJ45pU8H2rhSSOfq74mMxdZO4sh1xgxHdtmbif95nRQH12EmVJIiKL5YNEglRU1nr9O+MMBRThxh3Ah3K+UjZhhHF1DJhRAsv7xKmhfVwK8G95eV2k0eR5GckFNyTgJyRWrkjtRJg3DySJ7JK3nztPfivXsfi9aCl88ckz/wPn8Ao/ePKA==\n\nAAAB7XicbVDLSgNBEJyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOOsmY2ZllplcIS/7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqSw6PvfXmFtfWNzq7hd2tnd2z8oHx41rU4NhwbXUpt2xCxIoaCBAiW0EwMsjiS0ovHtzG89gbFCqwecJBDGbKjEQHCGTmp2cQTIeuWKX/XnoKskyEmF5Kj3yl/dvuZpDAq5ZNZ2Aj/BMGMGBZcwLXVTCwnjYzaEjqOKxWDDbH7tlJ45pU8H2rhSSOfq74mMxdZO4sh1xgxHdtmbif95nRQH12EmVJIiKL5YNEglRU1nr9O+MMBRThxh3Ah3K+UjZhhHF1DJhRAsv7xKmhfVwK8G95eV2k0eR5GckFNyTgJyRWrkjtRJg3DySJ7JK3nztPfivXsfi9aCl88ckz/wPn8Ao/ePKA==\nAAAB7XicbVDLSgNBEJyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOOsmY2ZllplcIS/7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqSw6PvfXmFtfWNzq7hd2tnd2z8oHx41rU4NhwbXUpt2xCxIoaCBAiW0EwMsjiS0ovHtzG89gbFCqwecJBDGbKjEQHCGTmp2cQTIeuWKX/XnoKskyEmF5Kj3yl/dvuZpDAq5ZNZ2Aj/BMGMGBZcwLXVTCwnjYzaEjqOKxWDDbH7tlJ45pU8H2rhSSOfq74mMxdZO4sh1xgxHdtmbif95nRQH12EmVJIiKL5YNEglRU1nr9O+MMBRThxh3Ah3K+UjZhhHF1DJhRAsv7xKmhfVwK8G95eV2k0eR5GckFNyTgJyRWrkjtRJg3DySJ7JK3nztPfivXsfi9aCl88ckz/wPn8Ao/ePKA==\nAAAB7XicbVDLSgNBEJyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOOsmY2ZllplcIS/7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqSw6PvfXmFtfWNzq7hd2tnd2z8oHx41rU4NhwbXUpt2xCxIoaCBAiW0EwMsjiS0ovHtzG89gbFCqwecJBDGbKjEQHCGTmp2cQTIeuWKX/XnoKskyEmF5Kj3yl/dvuZpDAq5ZNZ2Aj/BMGMGBZcwLXVTCwnjYzaEjqOKxWDDbH7tlJ45pU8H2rhSSOfq74mMxdZO4sh1xgxHdtmbif95nRQH12EmVJIiKL5YNEglRU1nr9O+MMBRThxh3Ah3K+UjZhhHF1DJhRAsv7xKmhfVwK8G95eV2k0eR5GckFNyTgJyRWrkjtRJg3DySJ7JK3nztPfivXsfi9aCl88ckz/wPn8Ao/ePKA==\n\nGlobal Maxima\n\nLocal Maximum in \ndirection of \u2713AAAB7XicbVDLSgNBEJyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOOsmY2ZllplcIS/7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqSw6PvfXmFtfWNzq7hd2tnd2z8oHx41rU4NhwbXUpt2xCxIoaCBAiW0EwMsjiS0ovHtzG89gbFCqwecJBDGbKjEQHCGTmp2cQTIeuWKX/XnoKskyEmF5Kj3yl/dvuZpDAq5ZNZ2Aj/BMGMGBZcwLXVTCwnjYzaEjqOKxWDDbH7tlJ45pU8H2rhSSOfq74mMxdZO4sh1xgxHdtmbif95nRQH12EmVJIiKL5YNEglRU1nr9O+MMBRThxh3Ah3K+UjZhhHF1DJhRAsv7xKmhfVwK8G95eV2k0eR5GckFNyTgJyRWrkjtRJg3DySJ7JK3nztPfivXsfi9aCl88ckz/wPn8Ao/ePKA==\n\nAAAB7XicbVDLSgNBEJyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOOsmY2ZllplcIS/7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqSw6PvfXmFtfWNzq7hd2tnd2z8oHx41rU4NhwbXUpt2xCxIoaCBAiW0EwMsjiS0ovHtzG89gbFCqwecJBDGbKjEQHCGTmp2cQTIeuWKX/XnoKskyEmF5Kj3yl/dvuZpDAq5ZNZ2Aj/BMGMGBZcwLXVTCwnjYzaEjqOKxWDDbH7tlJ45pU8H2rhSSOfq74mMxdZO4sh1xgxHdtmbif95nRQH12EmVJIiKL5YNEglRU1nr9O+MMBRThxh3Ah3K+UjZhhHF1DJhRAsv7xKmhfVwK8G95eV2k0eR5GckFNyTgJyRWrkjtRJg3DySJ7JK3nztPfivXsfi9aCl88ckz/wPn8Ao/ePKA==\nAAAB7XicbVDLSgNBEJyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOOsmY2ZllplcIS/7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqSw6PvfXmFtfWNzq7hd2tnd2z8oHx41rU4NhwbXUpt2xCxIoaCBAiW0EwMsjiS0ovHtzG89gbFCqwecJBDGbKjEQHCGTmp2cQTIeuWKX/XnoKskyEmF5Kj3yl/dvuZpDAq5ZNZ2Aj/BMGMGBZcwLXVTCwnjYzaEjqOKxWDDbH7tlJ45pU8H2rhSSOfq74mMxdZO4sh1xgxHdtmbif95nRQH12EmVJIiKL5YNEglRU1nr9O+MMBRThxh3Ah3K+UjZhhHF1DJhRAsv7xKmhfVwK8G95eV2k0eR5GckFNyTgJyRWrkjtRJg3DySJ7JK3nztPfivXsfi9aCl88ckz/wPn8Ao/ePKA==\nAAAB7XicbVDLSgNBEJyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOOsmY2ZllplcIS/7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqSw6PvfXmFtfWNzq7hd2tnd2z8oHx41rU4NhwbXUpt2xCxIoaCBAiW0EwMsjiS0ovHtzG89gbFCqwecJBDGbKjEQHCGTmp2cQTIeuWKX/XnoKskyEmF5Kj3yl/dvuZpDAq5ZNZ2Aj/BMGMGBZcwLXVTCwnjYzaEjqOKxWDDbH7tlJ45pU8H2rhSSOfq74mMxdZO4sh1xgxHdtmbif95nRQH12EmVJIiKL5YNEglRU1nr9O+MMBRThxh3Ah3K+UjZhhHF1DJhRAsv7xKmhfVwK8G95eV2k0eR5GckFNyTgJyRWrkjtRJg3DySJ7JK3nztPfivXsfi9aCl88ckz/wPn8Ao/ePKA==\n\nw1AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48V7Qe0oWy2k3bpZhN2N0oJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4Zua3H1FpHssHM0nQj+hQ8pAzaqx0/9T3+uWKW3XnIKvEy0kFcjT65a/eIGZphNIwQbXuem5i/Iwqw5nAaamXakwoG9Mhdi2VNELtZ/NTp+TMKgMSxsqWNGSu/p7IaKT1JApsZ0TNSC97M/E/r5ua8MrPuExSg5ItFoWpICYms7/JgCtkRkwsoUxxeythI6ooMzadkg3BW355lbQuqp5b9e4uK/XrPI4inMApnIMHNajDLTSgCQyG8Ayv8OYI58V5dz4WrQUnnzmGP3A+fwAKKo2f\n\nAAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48V7Qe0oWy2k3bpZhN2N0oJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4Zua3H1FpHssHM0nQj+hQ8pAzaqx0/9T3+uWKW3XnIKvEy0kFcjT65a/eIGZphNIwQbXuem5i/Iwqw5nAaamXakwoG9Mhdi2VNELtZ/NTp+TMKgMSxsqWNGSu/p7IaKT1JApsZ0TNSC97M/E/r5ua8MrPuExSg5ItFoWpICYms7/JgCtkRkwsoUxxeythI6ooMzadkg3BW355lbQuqp5b9e4uK/XrPI4inMApnIMHNajDLTSgCQyG8Ayv8OYI58V5dz4WrQUnnzmGP3A+fwAKKo2f\nAAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48V7Qe0oWy2k3bpZhN2N0oJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4Zua3H1FpHssHM0nQj+hQ8pAzaqx0/9T3+uWKW3XnIKvEy0kFcjT65a/eIGZphNIwQbXuem5i/Iwqw5nAaamXakwoG9Mhdi2VNELtZ/NTp+TMKgMSxsqWNGSu/p7IaKT1JApsZ0TNSC97M/E/r5ua8MrPuExSg5ItFoWpICYms7/JgCtkRkwsoUxxeythI6ooMzadkg3BW355lbQuqp5b9e4uK/XrPI4inMApnIMHNajDLTSgCQyG8Ayv8OYI58V5dz4WrQUnnzmGP3A+fwAKKo2f\nAAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48V7Qe0oWy2k3bpZhN2N0oJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4Zua3H1FpHssHM0nQj+hQ8pAzaqx0/9T3+uWKW3XnIKvEy0kFcjT65a/eIGZphNIwQbXuem5i/Iwqw5nAaamXakwoG9Mhdi2VNELtZ/NTp+TMKgMSxsqWNGSu/p7IaKT1JApsZ0TNSC97M/E/r5ua8MrPuExSg5ItFoWpICYms7/JgCtkRkwsoUxxeythI6ooMzadkg3BW355lbQuqp5b9e4uK/XrPI4inMApnIMHNajDLTSgCQyG8Ayv8OYI58V5dz4WrQUnnzmGP3A+fwAKKo2f\n\nFigure 2: Left panel: The landscapes of log-likelihood objectives for Population EM1 and\nPopulation EM2 with (\u2713\u21e4, w\u21e41) = (1, 0.4) are shown in the black belt and the yellow surface re-\nspectively. The two green points indicates the two global maxima of Population EM2, one of which\nis also the global maximum of Population EM1. The purple point indicates the local maximum of\nPopulation EM1. Over-parameterization helps us to escape the local maximum through the direction\nof w1. Right panel: The \ufb01xed point curves for functions g\u2713 and gw are shown with red and blue lines\nrespectively. The green point at the intersections of the three curves is the correct convergence point\n(\u2713?, w?). The black dotted curve shows the reference curve r. The cross points \u21e5 are the possible\ninitializations and the plus points + are the corresponding positions after the \ufb01rst iteration. By the\ngeometric relations between the three curves, the iterations have to converge to (\u2713?, w?)\n\nWhen EM is run using a \ufb01nite sample, we do not expect recover the \u2713\u21e4i 2 Rd exactly. Hence, success\nis declared when the \u2713\u21e4i are recovered up to some expected error, according to the following measure:\n\nerror = min\n\u21e12\u21e7\n\nlim\nt!1\n\nkXi=1\n\nw\u21e4i k\u2713hti\u21e1(i) \u2713\u21e4ik2,\n\n(11)\n\nwhere \u21e7 is the set of all possible permutations on {1, . . . , k}. We declare success if the error is at\nmost C\u270f/n, where C\u270f := 4 \u00b7 Tr(W \u21e4I1(\u21e5\u21e4)). Here, W \u21e4 is the diagonal matrix whose diagonal is\n(w\u21e41, . . . , w\u21e41, . . . , w\u21e4k, . . . , w\u21e4k) 2 Rkd, where each w\u21e4i is repeated d times, and I(\u21e5\u21e4) is the Fisher\nInformation at the true value \u21e5\u21e4 := (\u2713\u21e41, . . . , \u2713\u21e4k). We adopt this criteria since it is well known that\nthe MLE asymptotically converges to N (\u2713\u21e4,I1(\u21e5\u21e4)/n). Thus, constant 4 \u21e1 1.962 indicates an\napproximately 95% coverage.\nWhen EM is run using an in\ufb01nite-size sample, we declare EM successful when the error de\ufb01ned\nin (11) is at most 107.\n\n3.2 Mixtures of two Gaussians\nWe \ufb01rst consider mixtures of two Gaussians in one dimension, i.e., \u2713\u21e41,\u2713 \u21e42 2 R. Unlike in our\ntheoretical analysis, the mixture components are not constrained to be symmetric about the origin. For\nsimplicity, we always let \u2713\u21e41 = 0, but this information is not used by EM. Further, we consider sample\nsize n 2{ 1000,1}, separation \u2713\u21e42 = |\u2713\u21e42\u2713\u21e41|2{ 1, 2, 4}, and mixing weight w\u21e41 2{ 0.52, 0.7, 0.9};\nthis gives a total of 18 cases. For each case, we run EM with 2500 random initializations and compute\nthe empirical probability of success. When n = 1000, the initial mean parameter is chosen uniformly\nat random from the sample. When n = 1, the initial mean parameter is chosen uniformly at random\nfrom the rectangle [2,\u2713 \u21e42 + 2] \u21e5 [2,\u2713 \u21e42 + 2].\nA subset of the success probabilities are shown in Table 1; see Appendix F for the full set of results.\nOur simulations lead to the following empirical \ufb01ndings about the behavior of EM on data from\nwell-separated mixtures (|\u2713\u21e41 \u2713\u21e42| 1). First, for n = 1, EM for Model 2 \ufb01nds the MLE almost\nalways (P2 = 1), while EM for Model 1 only succeeds about half the time (P1 \u21e1 0.5). Second, for\nsmaller n, EM for Model 2 still has a higher chance of success than EM for Model 1, except when the\nweights w\u21e41 and w\u21e42 are almost equal. When w\u21e41 \u21e1 w\u21e42 \u21e1 1/2, the bias in Model 1 is not big enough\nto stand out from the error due to the \ufb01nite sample, and hence Model 1 is more preferable. Notably,\n\n8\n\n\fSuccess probabilities for mixtures of two Gaussians (Section 3.2)\n\nSeparation\n0.499 / 0.899\n\u2713\u21e42 \u2713\u21e41 = 2\n0.506 / 1.000\nSuccess probabilities for mixtures of three or four Gaussians (Section 3.3)\n\nSample size\nn = 1000\nn = 1\n\n0.497 / 0.800\n0.514 / 1.000\n\nw\u21e41 = 0.52\n\n0.799 / 0.500\n0.504 / 1.000\n\nw\u21e41 = 0.7\n\nw\u21e41 = 0.9\n\nCase 1\n\nCase 2\n\nCase 3\n\nCase 4\n\n0.164 / 0.900\n\n0.167 / 1.000\n\n0.145 / 0.956\n\n0.159 / 0.861\n\nTable 1: Success probabilities for EM on Model 1 and Model 2 (denoted P1 and P2, respectively),\nreported as P1 / P2.\n\nunlike the special model in (2), highly unbalanced weights do not help EM for Model 1 due to the\nlack of the symmetry of the component means (i.e., we may have \u2713\u21e41 + \u2713\u21e42 6= 0).\nWe conclude that over-parameterization helps EM if the two mixture components are well-separated\nand the mixing weights are not too close.\n\n3.3 Mixtures of three or four Gaussians\nWe now consider a setup with mixtures of three or four Gaussians. Speci\ufb01cally, we consider the\nfollowing four cases, each using a larger sample size of n = 2000:\n\n\u2022 Case 1, mixture of three Gaussians on a line: \u2713\u21e41 = (3, 0), \u2713\u21e42 = (0, 0), \u2713\u21e43 = (2, 0) with\nweights w\u21e41 = 0.5, w\u21e42 = 0.3, w\u21e43 = 0.2.\n\u2022 Case 2, mixture of three Gaussians on a triangle: \u2713\u21e41 = (3, 0), \u2713\u21e42 = (0, 2), \u2713\u21e43 = (2, 0) with\nweights w\u21e41 = 0.5, w\u21e42 = 0.3, w\u21e43 = 0.2.\n\u2022 Case 3, mixture of four Gaussians on a line: \u2713\u21e41 = (3, 0), \u2713\u21e42 = (0, 0), \u2713\u21e43 = (2, 0), \u2713\u21e44 = (5, 0)\nwith weights w\u21e41 = 0.35, w\u21e42 = 0.3, w\u21e43 = 0.2, w\u21e44 = 0.15.\n\u2022 Case 4, mixture of four Gaussians on a trapezoid: \u2713\u21e41 = (3, 0), \u2713\u21e42 = (1, 2), \u2713\u21e43 = (2, 0),\n\n\u2713\u21e44 = (2, 2) with weights w\u21e41 = 0.35, w\u21e42 = 0.3, w\u21e43 = 0.2, w\u21e44 = 0.15.\n\nThe other aspects of the simulations are the same as in the previous subsection.\nThe results are presented in Table 1. From the table, we con\ufb01rm that EM for Model 2 (with unknown\nweights) has a higher success probability than EM for Model 1 (with known weights). Therefore,\nover-parameterization helps in all four cases.\n\n3.4 Explaining the disparity\nAs discussed in Section 2.2, the performance EM algorithm with unknown weights does not depend\non the ordering of the initialization means. We conjuncture that in general, this property that is a\nconsequence of over-parameterization leads to the boost that is observed in the performance of EM\nwith unknown weights.\nWe support this conjecture by revisiting the previous simulations with a different way of running EM\nfor Model 1. For each set of k vectors selected to be used as initial component means, we run EM k!\ntimes, each using a different one-to-one assignment of these vectors to initial component means. We\nmeasure the empirical success probability P3 based on the lowest observed error among these k! runs\nof EM. The results are presented in Table 3 in Appendix F. In general, we observe P3 & P2 for all\ncases we have studied, which supports our conjecture. However, this procedure is generally more\ntime-consuming than EM for Model 2 since k! executions of EM are required.\n\nAcknowledgements\nDH and JX were partially supported by NSF awards DMREF-1534910 and CCF-1740833, and JX\nwas also partially supported by a Cheung-Kong Graduate School of Business Fellowship. We thank\nJiantao Jiao for a helpful discussion about this problem.\n\n9\n\n\fReferences\nSivaraman Balakrishnan, Martin J. Wainwright, and Bin Yu. Statistical guarantees for the em\n\nalgorithm: From population to sample-based analysis. Ann. Statist., 45(1):77\u2013120, 02 2017.\n\nEmmanuel J Cand\u00e8s and Benjamin Recht. Exact matrix completion via convex optimization. Foun-\n\ndations of Computational mathematics, 9(6):717, 2009.\n\nEmmanuel J Cand\u00e8s and Terence Tao. The power of convex relaxation: Near-optimal matrix\n\ncompletion. IEEE Transactions on Information Theory, 56(5):2053\u20132080, 2010.\n\nSt\u00e9phane Chr\u00e9tien and Alfred O Hero. On EM algorithms and their proximal generalizations. ESAIM:\n\nProbability and Statistics, 12:308\u2013326, May 2008.\n\nConstantinos Daskalakis, Christos Tzamos, and Manolis Zampetakis. Ten steps of em suf\ufb01ce for\nmixtures of two gaussians. In Proceedings of the 2017 Conference on Learning Theory, pages\n704\u2013710, 2017.\n\nAlexandre d\u2019Aspremont, Laurent E Ghaoui, Michael I Jordan, and Gert R Lanckriet. A direct\nformulation for sparse pca using semide\ufb01nite programming. In Advances in neural information\nprocessing systems, pages 41\u201348, 2005.\n\nA. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum-likelihood from incomplete data via the\n\nEM algorithm. J. Royal Statist. Soc. Ser. B, 39:1\u201338, 1977.\n\nDavid L Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289\u20131306,\n\n2006.\n\nSimon Du and Jason Lee. On the power of over-parametrization in neural networks with quadratic\nactivation. In Proceedings of the 35th International Conference on Machine Learning, pages\n1329\u20131338, 2018.\n\nBenjamin D Haeffele and Ren\u00e9 Vidal. Global optimality in tensor factorization, deep learning, and\n\nbeyond. arXiv preprint arXiv:1506.07540, 2015.\n\nChi Jin, Yuchen Zhang, Sivaraman Balakrishnan, Martin J Wainwright, and Michael I Jordan.\nLocal maxima in the likelihood of gaussian mixture models: Structural results and algorithmic\nconsequences. In Advances in Neural Information Processing Systems, pages 4116\u20134124, 2016.\n\nJ. M. Klusowski and W. D. Brinda. Statistical Guarantees for Estimating the Centers of a Two-\n\ncomponent Gaussian Mixture by EM. ArXiv e-prints, August 2016.\n\nV. Koltchinskii. Oracle inequalities in empirical risk minimization and sparse recovery problems. In\n\n\u00c9cole d0\u00e9t\u00e9 de probabilit\u00e9s de Saint-Flour XXXVIII, 2011.\n\nRoi Livni, Shai Shalev-Shwartz, and Ohad Shamir. On the computational ef\ufb01ciency of training neural\n\nnetworks. In Advances in Neural Information Processing Systems, pages 855\u2013863, 2014.\n\nQuynh Nguyen and Matthias Hein. The loss surface of deep and wide neural networks. In Proceedings\n\nof the 34th International Conference on Machine Learning, pages 2603\u20132612, 2017.\n\nQuynh Nguyen and Matthias Hein. Optimization landscape and expressivity of deep CNNs. In\nProceedings of the 35th International Conference on Machine Learning, pages 3730\u20133739, 2018.\n\nR. A. Redner and H. F. Walker. Mixture densities, maximum likelihood and the EM algorithm. SIAM\n\nReview, 26(2):195\u2013239, 1984.\n\nItay Safran and Ohad Shamir. Spurious local minima are common in two-layer relu neural networks.\n\narXiv preprint arXiv:1712.08968, 2017.\n\nMohammadreza Soltani and Chinmay Hegde. Towards provable learning of polynomial neural\nnetworks using low-rank matrix estimation. In International Conference on Arti\ufb01cial Intelligence\nand Statistics, pages 1417\u20131426, 2018.\n\n10\n\n\fPaul Tseng. An analysis of the EM algorithm and entropy-like proximal point methods. Mathematics\n\nof Operations Research, 29(1):27\u201344, Feb 2004.\n\nC. F. Jeff Wu. On the convergence properties of the EM algorithm. The Annals of Statistics, 11(1):\n\n95\u2013103, Mar 1983.\n\nJi Xu, Daniel J Hsu, and Arian Maleki. Global analysis of expectation maximization for mixtures of\ntwo gaussians. In Advances in Neural Information Processing Systems, pages 2676\u20132684, 2016.\nL. Xu and M. I. Jordan. On convergence properties of the EM algorithm for Gaussian mixtures.\n\nNeural Computation, 8:129\u2013151, 1996.\n\nBowei Yan, Mingzhang Yin, and Purnamrita Sarkar. Convergence of gradient em on multi-component\nmixture of gaussians. In Advances in Neural Information Processing Systems, pages 6956\u20136966,\n2017.\n\n11\n\n\f", "award": [], "sourceid": 6785, "authors": [{"given_name": "Ji", "family_name": "Xu", "institution": "Columbia University"}, {"given_name": "Daniel", "family_name": "Hsu", "institution": "Columbia University"}, {"given_name": "Arian", "family_name": "Maleki", "institution": "Columbia University"}]}