{"title": "Theoretical guarantees for EM under misspecified Gaussian mixture models", "book": "Advances in Neural Information Processing Systems", "page_first": 9681, "page_last": 9689, "abstract": "Recent years have witnessed substantial progress in understanding\n the behavior of EM for mixture models that are correctly specified.\n Given that model misspecification is common in practice, it is\n important to understand EM in this more general setting. We provide\n non-asymptotic guarantees for population and sample-based EM for\n parameter estimation under a few specific univariate settings of\n misspecified Gaussian mixture models. Due to misspecification, the\n EM iterates no longer converge to the true model and instead\n converge to the projection of the true model over the set of models\n being searched over. We provide two classes of theoretical\n guarantees: first, we characterize the bias introduced due to the\n misspecification; and second, we prove that population EM converges\n at a geometric rate to the model projection under a suitable\n initialization condition. This geometric convergence rate for\n population EM imply a statistical complexity of order $1/\\sqrt{n}$\n when running EM with $n$ samples. We validate our theoretical\n findings in different cases via several numerical examples.", "full_text": "Theoretical guarantees for the EM algorithm when\napplied to mis-speci\ufb01ed Gaussian mixture models\n\nRaaz Dwivedi(cid:63) Nhat Ho(cid:63) Koulik Khamaru(cid:63)\n\nUC Berkeley\n\n{raaz.rsk, minhnhat, koulik}@berkeley.edu\n\nMartin J. Wainwright\n\nUC Berkeley\nVoleon Group\n\nwainwrig@berkeley.edu\n\nMichael I. Jordan\n\nUC Berkeley\n\njordan@berkeley.edu\n\nAbstract\n\nRecent years have witnessed substantial progress in understanding the behavior\nof EM for mixture models that are correctly speci\ufb01ed. Given that model mis-\nspeci\ufb01cation is common in practice, it is important to understand EM in this more\ngeneral setting. We provide non-asymptotic guarantees for the population and\nsample-based EM algorithms when used to estimate parameters of certain mis-\nspeci\ufb01ed Gaussian mixture models. Due to mis-speci\ufb01cation, the EM iterates no\nlonger converge to the true model and instead converge to the projection of the\ntrue model onto the \ufb01tted model class. We provide two classes of theoretical guar-\nantees: (a) a characterization of the bias introduced due to the mis-speci\ufb01cation;\nand (b) guarantees of geometric convergence of the population EM to the model\nprojection given a suitable initialization. This geometric convergence rate for pop-\nulation EM implies that the EM algorithm based on n samples converges to an es-\ntimate with 1/\u221an accuracy. We validate our theoretical \ufb01ndings in different cases\nvia several numerical examples.\n\n1\n\nIntroduction\n\nMixture models play a central role in statistical applications, where they are used to capture hetero-\ngeneity of data arising from several underlying subpopulations. However, estimating the parameters\nof mixture models is a challenging task, due to the non-convexity of the log likelihood function.\nAs shown by classical work, the maximum likelihood estimate (MLE) often has good properties for\nmixture models, but its computation can be non-trivial. One of the most popular algorithms used to\ncompute the MLE (approximately) is the expectation maximization (EM) algorithm. Although EM\nis widely used in practice, it does not always converge to the MLE, and its convergence rate can vary\nas a function of the problem. Classical results provide guarantees about the convergence rates of EM\nto local maxima [4, 16]. In the speci\ufb01c setting of Gaussian mixtures, population EM (idealized EM\nwith in\ufb01nite samples) was shown to have a range of behavior from super-linear convergence to slow\nconvergence like a \ufb01rst-order method depending on the overlap between the mixtures [9, 18]. More\nrecently, there has been a renewed interest in providing explicit and non-asymptotic guarantees on\nthe convergence of EM. Notably, Balakrishnan et al. [1] developed a rather general framework for\ncharacterizing the convergence of EM. For well-speci\ufb01ed problems\u2014including the two-component\nGaussian location mixture as a particular example\u2014they provided suf\ufb01cient conditions for the EM\n\n(cid:63)Raaz Dwivedi, Nhat Ho, and Koulik Khamaru contributed equally to this work.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\falgorithm to converge to a small neighborhood of global maximum; in addition, they provided ex-\nplicit bounds on the sample complexity of EM, meaning the number of samples n required, as a\nfunction of the tolerance \u0001, problem dimension and other parameters, to achieve an \u0001-accurate so-\nlution. A line of follow-up work has generalized and extended results of this type (e.g., see the\npapers [20, 15, 7, 17, 3, 19, 5, 2]).\nA shared assumption common to this body of past work is that either the true distribution of each\nsubpopulations is known, or that the number of components is exactly known; in practice, both of\nthese conditions are often violated. In such settings, it is well known that the MLE, instead of ap-\nproximating the true parameter, approximates a Kullback-Leibler projection of the data-generating\ndistribution onto the \ufb01tted model class. Thus, the MLE exhibits a desirable form of robustness to\nmodel mis-speci\ufb01cation.\nOn the other hand, it is not obvious a priori that this robustness need be shared by the solutions\nreturned by the EM algorithm. Since these solutions are those actually used in practice, it is impor-\ntant understand under what conditions the EM algorithm, when applied with mis-speci\ufb01ed models,\nconverges to an (approximate) KL projection. The main contribution of this paper is to provide\nsome precise answers to this question, and moreover to quantify the bias that arises from model\nmis-speci\ufb01cation. Our analysis focuses on two classes of mis-speci\ufb01ed mixture models.\n\u2022 Under-speci\ufb01ed number of components: Suppose that the true model is given by location-shifted\nmixture of k \u2265 3 univariate Gaussians, but we use EM to \ufb01t a location-shifted Gaussian mix-\nture with k \u2212 1 components. This scenario is very common: it arises naturally when either the\nmixture components are very close or some of the mixture weights are very small, so that the\ndata generating distribution appears to have fewer components. Analysis of the EM algorithm\nwhen the \ufb01tting distribution has fewer mixture components than the data-generating distribution\nposes new challenges; in particular, it requires an understanding of the model bias, meaning the\nKullback-Leibler discrepancy between the true model from its projection(s) onto the class of\n\ufb01tted models. In this paper, we provide a detailed analysis of the k = 3 case. First, we charac-\nterize the model bias induced by \ufb01tting a two-component mixture to a three-component mixture\nwith unknown means but known variance. We then provide suf\ufb01cient conditions for the popula-\ntion EM updates to converge at a geometric rate to the KL projection of the true model onto the\n\ufb01tted model class. Finally, using Rademacher-complexity based arguments and the geometric\nconvergence of population EM, we conclude that with high probability, the EM updates with n\nsamples converge to a ball of radius 1/\u221an around the aforementioned KL projection.\n\n\u2022 Incorrectly speci\ufb01ed weights or variances: In our second problem class, we assume that the\nnumber of components is correctly speci\ufb01ed, but either the mixture weights or the variances\nare mis-speci\ufb01ed. Concretely, suppose that the true model is a two-component location-shifted\nGaussian mixture with weights/variances that differ from those in the \ufb01tted model class. Our\nanalysis reveals a rather surprising phenomenon with respect to EM convergence: despite the\npotential non-convexity of the problem, the iterates converge at a geometric rate to a unique\n\ufb01xed point from an arbitrary initialization. Our results suggest that the projection from the true\nmodel to the \ufb01tted model is actually unique. Finally, we prove that the sample-based EM updates\nachieve standard minimax convergence rate of order 1/\u221an.\n\nTable 1 provides a high-level summary of our results, where we use (\u03b8, \u03c3, \u03b1) to denote the Gaussian\nmixture component with mean \u03b8, variance \u03c32 and weight \u03b1, i.e., \u03b1N (\u03b8, \u03c32).\nThe remainder of our paper is organized as follows. In Section 2, we introduce the problem set-up\nand provide the background information on the EM algorithm. In Section 3, we present our results\nfor the \ufb01rst framework and provide expressions for the bias and rate of convergence of EM for dif-\nferent 3 component mixture of Gaussians. Section 4 contains results when the mixture weights and\nvariance are mis-speci\ufb01ed. Numerical experiments illustrating our theoretical results are presented\nin Section 5. Finally in Section 6, we conclude the paper with a discussion of our results and a few\npossible venues for future work.\nNotation: We use c, c(cid:48), c1, c2 to denote universal constants whose value may vary in different con-\ntexts. For two distributions P and Q, the Kullback-Leibler divergence between them is denoted by\nKL(P, Q). We use the standard big-O notation to depict the scaling with respect to a particular\nquantity and hide constants and other problem parameters.\n\n2\n\n\fTrue Model\n\n3-component mixture:\n(\u2212\u03b8\u2217(1+\u03c1), \u03c3, 1/4);\n(\u2212\u03b8\u2217(1\u2212\u03c1), \u03c3, 1/4);\n(\u03b8\u2217, \u03c3, 1/2)\n3-component mixture:\n(\u2212\u03b8\u2217, \u03c3, (1\u2212\u03c9)/2);\n(\u03b8\u2217, \u03c3, (1\u2212\u03c9)/2);\n(0, \u03c32, \u03c9)\n2-component mixture:\n(\u2212\u03b8\u2217,(cid:112)\u03c32\u2212\u03b8\u22172, 1/2);\n(\u03b8\u2217,(cid:112)\u03c32\u2212\u03b8\u22172, 1/2);\n\nBest \ufb01t with two\ncomponents\n(\u2212\u03b8, \u03c3, 1/2);\n(\u03b8, \u03c3, 1/2)\n\u03c3 known\n\n(\u2212\u03b8, \u03c3, 1/2);\n(\u03b8, \u03c3, 1/2);\n\u03c3 known\n\n(\u2212\u03b8, \u03c3, \u03c0);\n(\u03b8, \u03c3, 1\u2212\u03c0)\n\u03c3, \u03c0(cid:54)= 1/2 known\n\nBias\nmin{|\u03b8 \u2212 \u03b8\u2217\n\u03c1|\u03b8\u2217\n\n| + c (\u03c1|\u03b8\u2217\n\n|, |\u03b8 + \u03b8\u2217\n|}\n|/\u03c3)1/4\n\nStatistical error |(cid:98)\u03b8n \u2212 \u03b8|\n\nof sample EM\nn\u22121/2\n\nc\u03c91/8 |\u03b8\u2217|1/4\n\u221a\n1\u2212\u03c9\u03c31/4\n\nc|\u03b8\u2217| ((2\u22124\u03c0) + \u03b8\u22172)1/2\n\n\u03c3\n\nn\u22121/2\n\nn\u22121/2\n\nTable 1. Summary of the main theoretical gurantees of this paper. Here the parameter \u03b8\u2217 denotes the\ntrue parameter value (in the data-generating distribution), \u00af\u03b8 denotes the value of the parameter of the\nbest \ufb01t model, and \u02c6\u03b8n denotes the estimate returned by running the EM algorithm. Recall that the true\nmodel is not in the class of \ufb01tted models, and we can only hope to estimate \u00af\u03b8; consequently, in the\nabove table lists the performance of the EM algorithm in estimating \u00af\u03b8 for different settings. The \ufb01rst\ncolumn lists the true model, while the second column shows the \ufb01tted model. In the third column, we\nsummarize the bias of the parameter of the best \ufb01tted model (2). When using EM with n samples, the\n\n\ufb01nal statistical error |(cid:98)\u03b8n \u2212 \u03b8| has the statistical rate of order n\u22121/2 in all cases, as depicted in the fourth\ncolumn (here(cid:98)\u03b8n denotes the \ufb01nal sample EM estimate).\n\n2 Problem set-up\nThroughout this paper, we assume that data is generated according to some true distribution P\u2217,\nwhich admits a continuous density over R. We are interested in the performance of the EM algorithm\nwhen we \ufb01t the model below using a two-component mixture of location-shifted Gaussians with\nknown variance \u03c32 and known mixture weight \u03c0 \u2208 (0, 1):\n\nP\u03b8 = \u03c0N (\u03b8, \u03c32) + (1 \u2212 \u03c0)N (\u2212\u03b8, \u03c32)\nWe consider two distinct settings of the mixture weights in model (1):\n\u2022 Balanced mixtures: the mixture weights are assumed to be equal, i.e., \u03c0 = 1 \u2212 \u03c0 = 1/2.\n\u2022 Unbalanced mixtures:\nIn order to estimate the location parameters, we apply the EM algorithm, allowing \u03b8 to vary over\nsome compact set \u0398. Since the true distribution P\u2217 may not belong to the class of \ufb01tted models, the\nbest possible estimator is the projections of P\u2217 to the \ufb01tted model (1). It is given by\n\nthe mixture weights are assumed to be unequal \u03c0 = 1\n\n2 (1 + \u0001) where |\u0001| \u2208 (0, 1).\n\n2 (1 \u2212 \u0001) and\n\n1 \u2212 \u03c0 = 1\n\n(1)\n\n\u03b8 \u2208 arg min\n\u03b8\u2208\u0398\n\nKL (P\u2217, P\u03b8) .\n\n(2)\n\nOur main goal in the paper is to establish the convergence rate of EM updates to \u03b8 for various choices\nof the data-generating model P\u2217 and the \ufb01tted model (1).\n\n2.1 EM algorithm for two-component location-Gaussian mixtures\n\nLet us now introduce some notation as well as a brief description of the EM algorithm for two-\ncomponent Gaussian location mixtures (1). The population version of EM is based on the function\n\n(3)\nwhere the expectation is taken over the true distribution P\u2217. For any \ufb01xed \u03b8, the M-step in the EM\nupdates for the model (1) is obtained by maximizing the minorization function (3); for a detailed\n\nE(cid:104)w\u03b8(X) (X \u2212 \u03b8(cid:48))2 + (1 \u2212 w\u03b8(X)) (X + \u03b8(cid:48))2(cid:105) ,\n\nQ(\u03b8(cid:48); \u03b8) := \u2212\n\n1\n2\n\n3\n\n\fderivation, see the paper [1]. More precisely, we denote the population EM operator M : R \u2192 R as\n(4a)\n\nM (\u03b8) := arg max\n\n\u03b8(cid:48) Q(\u03b8(cid:48), \u03b8) = E [(2w\u03b8(X) \u2212 1)X] ,\n\nwhere the weighting function w\u03b8 in the above formulation is given by\n\nw\u03b8(x) :=\n\n\u03c0 exp(cid:16)\u2212 (\u03b8\u2212x)2\n2\u03c32 (cid:17)\n2\u03c32 (cid:17) + (1 \u2212 \u03c0) exp(cid:16)\u2212 (\u03b8+x)2\n2\u03c32 (cid:17) .\n\n\u03c0 exp(cid:16)\u2212 (\u03b8\u2212x)2\n\nNote that the parameter \u03b8, de\ufb01ned in equation (2), minimizes the KL-distance between the \ufb01tted\nmodel and the true model, thereby ensuring that the log-likelihood is maximized at the model in-\ndexed by the parameter \u03b8. Consequently, the parameter \u03b8 is a \ufb01xed point of the population EM\nupdate\u2014that is, M (\u03b8) = \u03b8. The sample version of the EM algorithm\u2014the method actually used\nin practice\u2014is obtained by simply replacing the expectations in equations (3) and equation (4a) by\nthe sample-based counterpart. In particular, given a set of n i.i.d. samples {Xi}n\ni=1 from the true\nmodel, the sample EM operator Mn : R (cid:55)\u2192 R takes the form\n\n(4b)\n\n(5)\n\nMn(\u03b8) :=\n\n1\nn\n\nn(cid:88)i=1\n\n(2w\u03b8(Xi) \u2212 1)Xi.\n\nWith this notation in place, we are now ready to state our main results.\n\n3 Guarantees for EM algorithm for mis-speci\ufb01ed number of components\n\nIn this section, we study the convergence of the EM algorithm in the setting of under-\ufb01tted mixtures,\nwhere the number of components in the true model is larger than that in the \ufb01tted model. In sharp\ncontrast to the traditional setting of correctly speci\ufb01ed mixture models, where the number of com-\nponents of the true model is known to the EM algorithm, we analyze the performance of the EM\nalgorithm in the setting where the true number of the components is not known. Such a scenario\nnaturally occur in many practical cases, examples include: (1) Some components in the mixture are\nvery close, and it is hard to distinguish them; (2) Some components have very small mixture weights\nand thereby are dif\ufb01cult to detect. Consequently, in the aforementioned situations, the number of\ncomponents observed from the data may be much smaller compared to the number of components\npresent in the true model. In this section, we characterize the bias of the two-component \ufb01t and\nanalyze the convergence properties of EM for such a \ufb01t.\n\n3.1 Three-component mixtures with two close components\nFirst, we consider the case, where the true model has distribution P\u2217 is a mixture of three-component\nGaussian location mixture given by\n\n1\n\nP\u2217 =\n\n4N (\u2212\u03b8\u2217(1 + \u03c1), \u03c32) +\n\n(6)\nfor some \u03b8\u2217 in a compact subset \u0398 of the real line, and a small positive scalar \u03c1 that characterizes the\nseparation between two cluster means \u2212\u03b8\u2217(1 + \u03c1) and \u2212\u03b8\u2217(1\u2212 \u03c1). For \ufb01tting the model, we assume\nthat the variance \u03c32 is known, and we suspect that the true model is a two-component mixture (since\n\u03c1 is small). Consequently, we \ufb01t the data with the model\n\n4N (\u2212\u03b8\u2217(1 \u2212 \u03c1), \u03c32) +\n\n1\n\n2N (\u03b8\u2217, \u03c32)\n\n1\n\nP\u03b8 =\n\n1\n2N (\u2212\u03b8, \u03c32) +\n\n1\n2N (\u03b8, \u03c32),\n\n(7)\n\nand we use the EM algorithm to estimate the location parameter \u03b8. Clearly, the performance of\nmodel (7) and consequently the EM algorithm depends on the relationship between the separation\nfactor \u03c1 and the SNR \u03b7 := |\u03b8\u2217\n| /\u03c3 of the true model (6). Since the true model does not belong in the\nfamily of two components location-Gaussian mixtures in model class (7), the role of the projection\nparameter \u03b8 \u2208 arg min\u03b8\u2208\u0398 KL(P\u2217, P\u03b8) becomes crucial. In the next proposition, we provide an\nexplicit bound for the bias between \u03b8 and \u03b8\u2217 as a function of the problem parameters.\n\n4\n\n\fProposition 1. Given the true model (6) and any \u03c1 > 0, we have\n\n,\n\n(8)\n\n\u2212 \u03b8(cid:12)(cid:12) ,(cid:12)(cid:12)\u03b8\u2217 + \u03b8(cid:12)(cid:12)(cid:9) \u2264 \u03c1|\u03b8\u2217\n\n|\nwhere c is a universal positive constant that depends only on the set \u0398.\n\nmin(cid:8)(cid:12)(cid:12)\u03b8\u2217\n\u2212 \u03b8(cid:12)(cid:12) ,(cid:12)(cid:12)\u03b8\u2217 + \u03b8(cid:12)(cid:12)(cid:9) \u2264(cid:18)\u03c1 +\n\n| + c(cid:18) \u03c1|\u03b8\u2217\n\u03c3 (cid:19)1/4\nIn order to simplify our results in the sequel, we assume that \u03b7 = |\u03b8\u2217\nbound on the bias\u2014viz.:\n| \u2264(cid:18)\u03c1 +\n\n| /\u03c3 \u2265 1 and use a simpler\n\u03c3 (cid:19)|\u03b8\u2217\n\u03c11/4\n(9)\n| .\nThe bound above directly implies that(cid:12)(cid:12)\u03b8(cid:12)(cid:12) belongs to the interval [(1 \u2212 C\u03c1)|\u03b8\u2217\n| , (1 + C\u03c1)|\u03b8\u2217\n|],\n| and(cid:12)(cid:12)\u03b8(cid:12)(cid:12)\nassuming that C\u03c1 := \u03c1 + \u03c11/4/\u03c3 \u2264 1. As \u03c1 \u2192 0, we have C\u03c1 \u2192 0 implying that |\u03b8\u2217\nare almost identical. In the sequel, we utilize this precise control of(cid:12)(cid:12)\u03b8(cid:12)(cid:12) in terms of |\u03b8\u2217\n|, provided\nby Proposition 1, to analyze the behavior of the EM algorithm in a neighborhood of \u03b8. De\ufb01ning\n\u03c1(cid:63) := sup{\u03c1 > 0|C\u03c1 \u2264 1/9}, the following result characterizes the behavior population EM oper-\nator for the three-component Gaussian location mixture described by equation (6).\nTheorem 1. There exist universal constants c(cid:48), c(cid:48)(cid:48) such that the population EM operator for\nmodel (6) with \u03c1 \u2264 \u03c1(cid:63) and \u03b7 \u2265 c(cid:48) satis\ufb01es\n\n\u03b73/4\u03c3(cid:19)|\u03b8\u2217\n\nmin(cid:8)(cid:12)(cid:12)\u03b8\u2217\n\n\u03c11/4\n\n(cid:12)(cid:12)M (\u03b8) \u2212 \u03b8(cid:12)(cid:12) = E(cid:12)(cid:12)2(w\u03b8(X) \u2212 w\u03b8(X))X(cid:12)(cid:12) \u2264 \u03b3(cid:12)(cid:12)\u03b8 \u2212 \u03b8(cid:12)(cid:12) ,\n\nfor any \u03b8 \u2208 B(\u03b8,(cid:12)(cid:12)\u03b8(cid:12)(cid:12) /4).\n\nIn words, Theorem 1 establishes that the population EM iterates (in the ideal, in\ufb01nite data limit)\n\nthe Appendix). These results have a direct implication for the sample-based version of EM that\nis implemented in practice. In particular, the next result shows that EM updates with n samples\nconverge in a constant number of steps to a neighborhood of \u03b8.\nCorollary 1. Consider any scalar \u03b4 \u2208 (0, 1), sample size n \u2265 c1 log(1/\u03b4) and starting point\n\u03b8t+1 = Mn (\u03b8t) for the model (6) satis\ufb01es\n\nare \u03b3-contractive with respect to \u03b8 over the ball B(\u03b8,(cid:12)(cid:12)\u03b8(cid:12)(cid:12) /4), where \u03b3 \u2264 e\u2212c(cid:48)(cid:48)\u03b72. Combining that\nresult with the condition C\u03c1 \u2264 1/9, we can demonstrate that(cid:12)(cid:12)\u03b8(cid:12)(cid:12) is unique (See Section A.1.4 in\n\u03b80 \u2208 B(\u03b8,(cid:12)(cid:12)\u03b8(cid:12)(cid:12) /4). Then under the assumptions of Theorem 1, the sample-based EM sequence\nwith probability at least 1 \u2212 \u03b4, where \u03b3 \u2264 e\u2212c(cid:48)\u03b72.\nNote that the bound (10) consists of two main terms: the \ufb01rst term captures the geometric conver-\ngence of the population EM operator from Theorem 1, while the second term characterizes the radius\nof convergence in terms of sample complexity, which is O((cid:112)1/n). Therefore, with probability at\nleast 1 \u2212 \u03b4, we have\n| (\u03b8\u22172 + \u03c32)))\nc|\u03b8\u2217\n(cid:12)(cid:12)\u03b8T \u2212 \u03b8(cid:12)(cid:12) \u2264\n\n| (\u03b8\u22172 + \u03c32)\n1 \u2212 \u03b3\nwhere c, c(cid:48) are universal constants.\n\n(cid:12)(cid:12)\u03b8t \u2212 \u03b8(cid:12)(cid:12) \u2264 \u03b3t(cid:12)(cid:12)\u03b80 \u2212 \u03b8(cid:12)(cid:12) +\n\n|(cid:16)\u03b8\u22172 + \u03c32(cid:17)(cid:114) log(1/\u03b4)\n\nfor T \u2265 c(cid:48) log(n/(log(1/\u03b4)|\u03b8\u2217\n\n(cid:114) log(1/\u03b4)\n\n1 \u2212 \u03b3 |\u03b8\u2217\n\nlog(1/\u03b3)\n\n(10)\n\nc2\n\nn\n\nn\n\n,\n\n3.2 Three-component mixtures with small weight for one component\nNext, we consider the case where the true model P\u2217 is a three-component Gaussian location mixture\nmodel of the form\n\nP\u2217 =\n\n2 N (\u2212\u03b8\u2217, \u03c32) + \u03c9N (0, \u03c32) +\n1 \u2212 \u03c9\n\n(11)\nIn other words, two components are dominant with means \u2212\u03b8\u2217 and \u03b8\u2217 respectively, and we have\na small component at the origin. For such a model, it is again conceivable to \ufb01t a 2-component\nmixture given by equation (7). The primary interest in such a setting is driven by the fact that, when\n\u03c9 > 0 is suf\ufb01ciently small, recovering the third small component with center at origin is usually\nhard; consequently clustering that component with one of the other two may be a good idea. Once\nagain, the convergence of EM is governed by the properties of \u03b8 that we characterize in the next\nproposition.\n\n2 N (\u03b8\u2217, \u03c32).\n1 \u2212 \u03c9\n\n5\n\n\fProposition 2. For the three components location-Gaussian mixtures in model (11), we have\n\n,\n\n(12)\n\nmin(cid:8)(cid:12)(cid:12)\u03b8\u2217\n\u2212 \u03b8(cid:12)(cid:12) ,(cid:12)(cid:12)\u03b8\u2217 + \u03b8(cid:12)(cid:12)(cid:9) \u2264 C\u03c9 |\u03b8\u2217\n\nc\u03c91/8 |\u03b8\u2217\n|1/4\n\u03c31/4\u221a1 \u2212 \u03c9\nwhere c is a universal positive constant that depends only on the set \u0398.\nIn order to simplify further results, we assume under the condition \u03b7 := \u03b8\u2217\n\n\u2212 \u03b8(cid:12)(cid:12) ,(cid:12)(cid:12)\u03b8\u2217 + \u03b8(cid:12)(cid:12)(cid:9) \u2264\n\u03c3 \u2265 1. Then we have\nmin(cid:8)(cid:12)(cid:12)\u03b8\u2217\n|, where C\u03c9 := c\u03c91/8/(\u03c3\u221a1 \u2212 \u03c9). Such a bound on bias leads\nto slightly different conditions for convergence of the EM algorithm for model (11) compared to\nthe EM convergence for model (6). Note that for any \ufb01xed variance \u03c32, the function C\u03c9 increases\nwith \u03c9 and C0 = 0. Let \u03c9(cid:63) = sup{\u03c9 > 0|C(\u03c9) \u2264 1/9}. Similar to the model (6), we analyze the\nconvergence rate of EM under a strong SNR condition of true model (11). We de\ufb01ne \u02dc\u03b3 := \u03b3(\u03b7, \u03c9) =\n(1 \u2212 \u03c9)e\u2212\u03b72/64 + \u03c9 < 1. With the above notations in place, we now establish the contraction of the\npopulation EM operator M (\u03b8) for the three components location-Gaussian mixture (11).\nTheorem 2. For SNR \u03b7 \u2265 1 suf\ufb01ciently large and \u03c9 \u2264 \u03c9(cid:63), and for any \u03b80 \u2208 B(\u03b8,(cid:12)(cid:12)\u03b8(cid:12)(cid:12) /4), the\n\npopulation EM operator for the Gaussian mixture (11) satis\ufb01es\n\n(13)\n\n|M (\u03b80) \u2212 \u03b8| = E(cid:12)(cid:12)2(w\u03b80(X) \u2212 w\u03b8(X))X(cid:12)(cid:12) \u2264 \u02dc\u03b3(cid:12)(cid:12)\u03b80 \u2212 \u03b8(cid:12)(cid:12) .\n\nConsequently, the population EM sequence \u03b8t+1 = M (\u03b8t) converges to \u03b8 at a linear rate.\n\nThe precise expression for the contraction parameter \u02dc\u03b3 provides suf\ufb01cient conditions for a fast con-\nvergence of EM, which involves an interesting trade off between the SNR \u03b7 and weight \u03c9. More\nconcretely, if the SNR is large enough, the population EM converges fast towards the projection \u03b8,\nwhich is unique in its absolute value (See Section A.1.4 in the Appendix). This fast convergence of\nthe population EM again enables us to derive the following convergence rate of sample-based EM:\nCorollary 2. Consider the model (11) such that the assumptions of Theorem 2 hold. For any \ufb01xed\n\n\u03b4 \u2208 (0, 1), \u03b80 \u2208 B(\u03b8,(cid:12)(cid:12)\u03b8(cid:12)(cid:12) /4), if n \u2265 c1 log(1/\u03b4) then the sample EM iterates \u03b8t+1 = Mn(\u03b8t) satisfy\n\nc2\n\n1 \u2212 \u02dc\u03b3 |\u03b8\u2217\n\n|(cid:16)\u03b8\u22172 + \u03c32(cid:17)(cid:114) log(1/\u03b4)\n\nn\n\nwith probability at least 1 \u2212 \u03b4.\nSimilar to the structure of the convergence result of sample EM updates in Corollary 1, the result in\nCorollary 2 also consists of two key terms: the \ufb01rst term is the linear rate of convergence from the\npopulation EM operator in Theorem 2 while the second term characterizes the radius of convergence\n\nin terms of sample complexity, which is of O((cid:112)1/n) after T = O(log n/ log(1/\u02dc\u03b3)) iterations.\n\n4 Robustness of EM for mis-speci\ufb01ed variances and weights\n\n(cid:12)(cid:12)\u03b8t \u2212 \u03b8(cid:12)(cid:12) \u2264 \u02dc\u03b3t(cid:12)(cid:12)\u03b80 \u2212 \u03b8(cid:12)(cid:12) +\n\n1\n\n1\n\nP\u2217 =\n\nIn this section, we focus on establishing the convergence rate of EM under different mis-speci\ufb01ed\nregime of the \ufb01tted model (1). In particular, we assume that the true data distribution P\u2217 is given by:\n(14)\nwhere \u03c3 > 0 is a given positive number, and |\u03b8\u2217\n| \u2208 (0, \u03c3/2) is a true but unknown parameter. Note\nthat the assumption that |\u03b8\u2217\n| \u2208 (0, \u03c3/2) ensures that the variance \u03c32 \u2212 \u03b8\u22172 is bounded away from\nzero. We \ufb01t the above model by unbalanced two-component Gaussian location mixture model P\u03b8\ngiven by\n\n2N (\u2212\u03b8\u2217, \u03c32 \u2212 \u03b8\u22172),\n\n2N (\u03b8\u2217, \u03c32 \u2212 \u03b8\u22172) +\n\nP\u03b8 = \u03c0N (\u2212\u03b8, \u03c32) + (1 \u2212 \u03c0)N (\u03b8, \u03c32),\n\n(15)\nwhere \u03c0 := 1\n2 (1 \u2212 \u0001) and |\u0001| \u2208 (0, 1) are known apriori and only the parameter \u03b8 is to be estimated.\nIn the \ufb01tted model P\u03b8, we have mis-speci\ufb01ed the variance \u03c32 and the weight \u03c0, and we wish to\nunderstand the rate of convergence of EM to \u00af\u03b8, where \u00af\u03b8 is the parameter of the model P\u00af\u03b8, and P\u00af\u03b8 is\nthe projection of the true model P\u2217 onto the model class P\u03b8 := {P\u03b8 : \u03b8 \u2208 R}. We emphasize that\nthe main goal here is to see how the mis-speci\ufb01cation with variance and weight affects the statistical\ninference of EM. We choose variance of the form \u03c32 \u2212 \u03b8\u22172 because under this setting, we obtain\ninteresting behavior of EM without rendering the proof too technical. We begin with the \ufb01rst result\nestablishing the global linear convergence rate of population EM to \u03b8.\n\n6\n\n\fTheorem 3. For a two-component Gaussian location mixture model (14) and \ufb01tted model (15), the\npopulation EM operator \u03b8 (cid:55)\u2192 M (\u03b8) satis\ufb01es\n\n(cid:12)(cid:12)M (\u03b8) \u2212 \u03b8(cid:12)(cid:12) \u2264(cid:18)1 \u2212\n\n\u00012\n\n2(cid:19)(cid:12)(cid:12)\u03b8 \u2212 \u00af\u03b8(cid:12)(cid:12) .\n\nHence, the population EM sequence {\u03b8t} converges geometrically to \u03b8 from any initialization \u03b80.\nThere are two interesting features regarding the geometric convergence of population EM updates to\n\u03b8: (1) it does not require an evaluation of bias which was needed for our previous results; (2) it holds\nunder any initialization \u03b80. Overall, we have that \u00af\u03b8 is unique, thereby we conclude that the projection\nof P\u2217 to the model class (15) is unique. Before proceeding to the sample-based convergence of EM,\nwe establish the following upper bound on the bias of the parameter \u03b8:\nProposition 3. For the two-component Gaussian location mixture model (14), we have\n\nwhere c(\u03b8\u2217, \u03c3) is a positive constant depending only on \u03b8\u2217, \u03c3, and the set \u0398.\n\nmin(cid:8)(cid:12)(cid:12)\u03b8 \u2212 \u03b8\u2217(cid:12)(cid:12) ,(cid:12)(cid:12)\u03b8 + \u03b8\u2217(cid:12)(cid:12)(cid:9) \u2264 c(\u03b8\u2217, \u03c3) \u00b7(cid:113)(cid:2)2 (1 \u2212 2\u03c0)(cid:3)\u03b8\u22172 + \u03b8\u22174,\n\nGiven the above bound, we obtain the range of(cid:12)(cid:12)\u03b8(cid:12)(cid:12) as(cid:12)(cid:12)\u03b8(cid:12)(cid:12) \u2208 [(1 \u2212 C\u03b8\u2217 )||\u03b8\u2217\nC\u03b8\u2217 := c(\u03b8\u2217, \u03c3)(cid:113)(cid:2)2(1 \u2212 2\u03c0)(cid:3) + \u03b8\u22172. Equipped with this bound on(cid:12)(cid:12)\u03b8(cid:12)(cid:12), we have the following result\nCorollary 3. Consider the model (14). Let radius r > 0 and n \u2265 c1 log(1/\u03b4) and \u03b80 \u2208 B(cid:0)\u00af\u03b8, r(cid:1),\n\nregarding the convergence of sample-based EM:\n\nthen the sample-based EM sequence \u03b8t+1 = Mn(\u03b8t), satis\ufb01es\nc2 ((1 + C\u03b8\u2217 )|\u03b8\u2217\n\n| , (1 + C\u03b8\u2217 )|\u03b8\u2217\n\n| + r) \u03c32\n\n|] where\n\n(cid:12)(cid:12)\u03b8t \u2212 \u03b8(cid:12)(cid:12) \u2264(cid:18)1 \u2212\n\n\u00012\n\n2(cid:19)t(cid:12)(cid:12)\u03b80 \u2212 \u03b8(cid:12)(cid:12) +\n\nwith probability at least 1 \u2212 \u03b4 where \u0001 := 1 \u2212 2\u03c0.\nThe proof of Corollary 3 is similar to those of Corollary 1 or Corollary 2; therefore, it is omitted. The\nlast corollary demonstrates that the sample-based EM iterates converge to ball of radius O((cid:112)1/n)\naround \u03b8 after T = O(log n/ log(1/(1 \u2212 \u00012/2))) iterations.\n5 Simulation studies\n\n\u00012\n\n(cid:114) log(1/\u03b4)\n\nn\n\n,\n\nIn this section, we illustrate our theoretical results using a few numerical experiments. In particular,\nwe use the EM algorithm to \ufb01t 2-component Gaussian mixtures for the three mis-speci\ufb01ed settings\nconsidered above. For convenience in discussion, we refer to three settings as follows:\n\u2022 Case 1 refers to the true model (6) from Section 3.1, namely a three component Gaussian mixture\nwhere two of the components very close to each other and the quantity \u03c1 \u2208 (0, 1) denotes the\nextent of weak separation.\n\u2022 Case 2 refers to the true model (11) from Section 3.2, namely, a three components Gaussian\nmixture where one of the components has very small weight at origin and the quantity \u03c9 \u2208 (0, 1)\ndenotes the small mixture-weight.\n\u2022 Finally, Case 3 refers to the true model (14) from Section 4, namely where the true model is a\nFor cases 1 and 2, we \ufb01t a symmetric balanced two-Gaussian mixture given by equation (7); while\nfor the third case we \ufb01t the unbalanced two-Gaussian mixture given by equation (15) for different\n\ntwo-Gaussian mixture.\n\nEM converges to \u03b8 (2), we use the \ufb01nal iterate from the population EM sequence to estimate the\n\nvalues of \u03c0. Let(cid:98)\u03b8n denote the \ufb01nal sample EM estimate. Since our results establish that population\nerror |(cid:98)\u03b8n \u2212 \u03b8|. We now summarize our key \ufb01ndings:\n(i) In Figure 1(a), we observe that for all cases the \ufb01nal statistical error |(cid:98)\u03b8n\u2212\u03b8| has a parametric\n\nrate n\u22121/2 which veri\ufb01es the claims of Corollaries 1, 2 and 3.\n\n7\n\n\f(ii) For all cases, the population EM sequence has a geometric convergence (we omit illustra-\ntions for Cases 1 and 2). From Figure 1 (b), we note that for Case 3, the linear convergence\nof the population EM sequence \u03b8t+1 = M (\u03b8t) is affected by the extent of unbalancedness:\nas \u03c0 \u2192 0.5, the rate of decay of the error of population EM sequence decreases which is\nconsistent with the contraction result stated in Theorem 3.\n(iii) In panels (c) and (d) of Figure 1, we plot the biases for Case 1 and 2, with respect to \u03c1 and\n\u03c9 respectively. Least squares \ufb01t on the log-log scale suggest that the biases stated in Propo-\nsition 1 and Proposition 2 are potentially sub-optimal: the numerical scaling of the biases\n|\u03b8\u2217\n\u2212 \u03b8| is of the order \u03c12 and \u03c9 for Case 1 and 2 respectively, which is signi\ufb01cantly smaller\nthan the corresponding scaling of the order \u03c11/4 and \u03c91/8 stated in Propositions 1 and 2.\nIn Appendix B, we illustrate the scaling of the bias with \u03b8\u2217 in these cases via further simu-\nlations.\n\n(a)\n\n(c)\n\n(b)\n\n(d)\n\nFigure 1. Plots depicting behavior of EM when \ufb01tting two Gaussian mixture (7) for the three mis-\nspeci\ufb01ed mixtures cases (6), (11) and (14), referred to as Case 1, 2 and 3 respectively. (a) For all cases,\n\nthe statistical error |(cid:98)\u03b8n \u2212 \u03b8| has the parametric rate n\u22121/2. (b) For Case 3, convergence of population\n\nEM sequence \u03b8t+1 = M (\u03b8t) is affected by the mixture weight \u03c0. The convergence rate slows down\nas \u03c0 \u2192 0.5. (c) For Case 1, the bias scales quadratically with the extent of weak-separation \u03c1 for\ndifferent values of \u03b8\u2217. (d) For Case 2, the bias scales linearly with the weight \u03c9 of the third component,\nfor different values of \u03b8\u2217. Refer to the text for more details.\n\n6 Discussion\n\nIn this paper, we analyzed the behavior of the EM algorithm for certain classes of mis-speci\ufb01ed\nmixture models. Analyzing the behavior of the EM algoirithm under general mis-speci\ufb01cation is\nchallenging in general, and we view the results in this paper as a \ufb01rst step towards developing a\nmore general framework for the problem. In this paper, we studied the EM algorithm when it is used\nto \ufb01t Gaussian location mixture models to data generated by mixture models with larger numbers\nof components, and/or differing mixture weights. We considered only univariate mixtures in this\npaper, but we believe that several of our results can be extended to multivariate mixtures. It is also\ninteresting to investigate the behavior of the EM algorithm when it is used to \ufb01t models with scale\nparameters that vary (in addition to the location parameters). Besides deriving sharper results for the\nsettings considered in this paper, analyzing the behavior of EM for non-Gaussian and more general\nmixture models is an appealing avenue for future work.\n\n8\n\n102103104n\u219210\u2212210\u22121100|b\u03b8n\u2212\u03b8|StatisticalerrorvsnCase1:\u03b8\u2217=5,\u03c1=0.2Case2:\u03b8\u2217=5,\u03c9=0.2Case3:\u03b8\u2217=0.5slope=-0.500102030t\u219210\u22121110\u2212810\u2212510\u22122|\u03b8t\u2212\u03b8|Case3:ConvergenceofPopulationEM\u03c0=0.10\u03c0=0.20\u03c0=0.30\u03c0=0.4510\u2212310\u2212210\u22121\u03c1\u219210\u22121210\u22121010\u2212810\u2212610\u2212410\u22122|\u03b8\u2217\u2212\u03b8\u03c1|Case1:Scalingofbiasvs\u03c1\u03b8\u2217=1.0slope=2.00\u03b8\u2217=3.0slope=2.1210\u2212310\u2212210\u22121\u03c9\u219210\u2212310\u2212210\u22121|\u03b8\u2217\u2212\u03b8\u03c9|Case2:Scalingofbiasvs\u03c9\u03b8\u2217=1.0\u03b8\u2217=3.0slope=0.98\u03b8\u2217=5.0\fReferences\n[1] S. Balakrishnan, M. J. Wainwright, and B. Yu. Statistical guarantees for the EM algorithm:\n\nFrom population to sample-based analysis. Annals of Statistics, 45:77\u2013120, 2017.\n\n[2] T. T. Cai, J. Ma, and L. Zhang. CHIME: Clustering of high-dimensional Gaussian mixtures\n\nwith EM algorithm and its optimality. Annals of Statistics, To Appear.\n\n[3] C. Daskalakis, C. Tzamos, and M. Zampetakis. Ten steps of EM suf\ufb01ce for mixtures of two\n\nGaussians. In Proceedings of the 2017 Conference on Learning Theory, 2017.\n\n[4] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via\nthe EM algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology),\n39:1\u201338, 1977.\n\n[5] B. Hao, W. Sun, Y. Liu, and G. Cheng. Simultaneous clustering and estimation of heteroge-\n\nneous graphical models. Journal of Machine Learning Research, To Appear.\n\n[6] P. Heinrich and J. Kahn. Strong identi\ufb01ability and optimal minimax rates for \ufb01nite mixture\n\nestimation. Annals of Statistics, 46:2844\u20132870, 2018.\n\n[7] C. Jin, Y. Zhang, S. Balakrishnan, M. J. Wainwright, and M. I. Jordan. Local maxima in the\nlikelihood of Gaussian mixture models: Structural results and algorithmic consequences. In\nAdvances in Neural Information Processing Systems 29, 2016.\n\n[8] M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperimetry and Processes.\n\nSpringer-Verlag, New York, NY, 1991.\n\n[9] J. Ma, L. Xu, and M. I. Jordan. Asymptotic convergence rate of the EM algorithm for Gaussian\n\nmixtures. Neural Computation, 12:2881\u20132907, 2000.\n\n[10] X. Nguyen. Convergence of latent mixing measures in \ufb01nite and in\ufb01nite mixture models.\n\nAnnals of Statistics, 4(1):370\u2013400, 2013.\n\n[11] H. Teicher. Identi\ufb01ability of \ufb01nite mixtures. Ann. Math. Statist., 32:1265\u20131269, 1963.\n\n[12] A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes: With\n\nApplications to Statistics. Springer-Verlag, New York, NY, 2000.\n\n[13] R. Vershynin.\n\narXiv:1011.3027v7.\n\nIntroduction to the non-asymptotic analysis of\n\nrandom matrices.\n\n[14] C. Villani. Optimal transport: Old and New. Springer, 2008.\n\n[15] Z. Wang, Q. Gu, Y. Ning, and H. Liu. High-dimensional expectation-maximization algorithm:\nStatistical optimization and asymptotic normality. In Advances in Neural Information Process-\ning Systems 28, 2015.\n\n[16] C. F. J. Wu. On the convergence properties of the EM algorithm. Annals of Statistics, 11:95\u2013\n\n103, 1983.\n\n[17] J. Xu, D. Hsu, and A. Maleki. Global analysis of expectation maximization for mixtures of\n\ntwo Gaussians. In Advances in Neural Information Processing Systems 29, 2016.\n\n[18] L. Xu and M. I. Jordan. On convergence properties of the EM algorithm for Gaussian mixtures.\n\nNeural Computation, 8:129\u2013151, 1996.\n\n[19] B. Yan, M. Yin, and P. Sarkar. Convergence of gradient EM on multi-component mixture of\n\nGaussians. In Advances in Neural Information Processing Systems 30, 2017.\n\n[20] X. Yi and C. Caramanis. Regularized EM algorithms: A uni\ufb01ed framework and statistical\n\nguarantees. In Advances in Neural Information Processing Systems 28, 2015.\n\n9\n\n\f", "award": [], "sourceid": 6129, "authors": [{"given_name": "Raaz", "family_name": "Dwivedi", "institution": "UC Berkeley"}, {"given_name": "nh\u1eadt", "family_name": "H\u1ed3", "institution": "University of California, Berkeley"}, {"given_name": "Koulik", "family_name": "Khamaru", "institution": "University Of California Berkeley"}, {"given_name": "Martin", "family_name": "Wainwright", "institution": "UC Berkeley"}, {"given_name": "Michael", "family_name": "Jordan", "institution": "UC Berkeley"}]}