{"title": "Convergence of Gradient EM on Multi-component Mixture of Gaussians", "book": "Advances in Neural Information Processing Systems", "page_first": 6956, "page_last": 6966, "abstract": "In this paper, we study convergence properties of the gradient variant of Expectation-Maximization algorithm~\\cite{lange1995gradient} for Gaussian Mixture Models for arbitrary number of clusters and mixing coefficients. We derive the convergence rate depending on the mixing coefficients, minimum and maximum pairwise distances between the true centers, dimensionality and number of components; and obtain a near-optimal local contraction radius. While there have been some recent notable works that derive local convergence rates for EM in the two symmetric mixture of Gaussians, in the more general case, the derivations need structurally different and non-trivial arguments. We use recent tools from learning theory and empirical processes to achieve our theoretical results.", "full_text": "Convergence of Gradient EM on Multi-component\n\nMixture of Gaussians\n\nBowei Yan\n\nUniversity of Texas at Austin\n\nboweiy@utexas.edu\n\nMingzhang Yin\n\nUniversity of Texas at Austin\n\nmzyin@utexas.edu\n\nPurnamrita Sarkar\n\nUniversity of Texas at Austin\n\npurna.sarkar@austin.utexas.edu\n\nAbstract\n\nIn this paper, we study convergence properties of the gradient variant of\nExpectation-Maximization algorithm [11] for Gaussian Mixture Models for ar-\nbitrary number of clusters and mixing coef\ufb01cients. We derive the convergence\nrate depending on the mixing coef\ufb01cients, minimum and maximum pairwise dis-\ntances between the true centers, dimensionality and number of components; and\nobtain a near-optimal local contraction radius. While there have been some recent\nnotable works that derive local convergence rates for EM in the two symmetric\nmixture of Gaussians, in the more general case, the derivations need structurally\ndifferent and non-trivial arguments. We use recent tools from learning theory and\nempirical processes to achieve our theoretical results.\n\n1\n\nIntroduction\n\nProposed by [7] in 1977, the Expectation-Maximization (EM) algorithm is a powerful tool for sta-\ntistical inference in latent variable models. A famous example is the parameter estimation problem\nunder parametric mixture models. In such models, data is generated from a mixture of a known fam-\nily of parametric distributions. The mixture component from which a datapoint is generated from\ncan be thought of as a latent variable.\nTypically the marginal data log-likelihood (which integrates the latent variables out) is hard to op-\ntimize, and hence EM iteratively optimizes a lower bound of it and obtains a sequence of estima-\ntors. This consists of two steps. In the expectation step (E-step) one computes the expectation of\nthe complete data likelihood with respect to the posterior distribution of the unobserved mixture\nmemberships evaluated at the current parameter estimates. In the maximization step (M-step) one\nthis expectation is maximized to obtain new estimators. EM always improves the objective func-\ntion. While it is established in [4] that the true parameter vector is the global maximizer of the\nlog-likelihood function, there has been much effort to understand the behavior of the local optima\nobtained via EM.\nWhen the exact M-step is burdensome, a popular variant of EM, named Gradient EM is widely\nused. The idea here is to take a gradient step towards the maxima of the expectation computed in\nthe E-step. [11] introduces a gradient algorithm using one iteration of Newton\u2019s method and shows\nthe local properties of the gradient EM are almost identical with those of the EM.\nEarly literature [22, 24] mostly focuses on the convergence to the stationary points or local optima.\nIn [22] it is proven that the sequence of estimators in EM converges to stationary point when the\nlower bound function from E-step is continuous. In addition, some conditions are derived under\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fwhich EM converges to local maxima instead of saddle points; but these are typically hard to check.\nA link between EM and gradient methods is forged in [24] via a projection matrix and the local\nconvergence rate of EM is obtained. In particular, it is shown that for GMM with well-separated\ncenters, the EM achieves faster convergence rates comparable to a quasi-Newton algorithm. While\nthe convergence of EM deteriorates under worse separations, it is observed in [20] that the mixture\ndensity determined by estimator sequence of EM re\ufb02ects the sample data well.\nIn recent years, there has been a renewed wave of interest in studying the behavior of EM especially\nin GMMs. The global convergence of EM for a mixture of two equal-proportion Gaussian distribu-\ntions is fully characterized in [23]. For more than two clusters, a negative result on EM and gradient\nEM being trapped in local minima arbitrarily far away from the global optimum is shown in [9].\nFor high dimensional GMMs with M components, the parameters are learned via reducing the\ndimensionality via a random projection in [5]. In [6] the two-round method is proposed, where one\n\ufb01rst initializes with more than M points, then prune to get one point in every cluster. It is pointed\nout in this paper that in high dimensional space, when the clusters are well separated, the mixing\nweight will go to either 0 or 1 after one single update. It is showed in [25, 17] that one can cluster\nhigh dimensional sub-gaussian mixtures by semi-de\ufb01nite programming relaxations.\nFor the convergence rate of EM algorithm, it is observed in [19] that a very small mixing proportion\nfor one mixture component compared to others leads to slow convergence. [2] gives non-asymptotic\nconvergence guarantees in isotropic, balanced, two-component GMM; their result proves the linear\nconvergence of EM if the center is initialized in a small neighborhood of the true parameters. The\nlocal convergence result in this paper has a sub-optimal contraction region.\nK-means clustering is another widely used clustering method. Lloyd\u2019s algorithm for k-means clus-\ntering has a similar \ufb02avor as EM. At each step, it recomputes the centroids of each cluster and\nupdates the membership assignments alternatively. While EM does soft clustering at each step,\nLloyd\u2019s algorithm obtains hard clustering. The clustering error of Lloyd\u2019s algorithm for arbitrary\nnumber of clusters is studied in [13]. The authors also show local convergence results where the\ncontraction region is less restrictive than [2].\nWe would like to point out that there are many notable algorithms [10, 1, 21] with provable guaran-\ntees for estimating mixture models. In [14, 8] the authors propose polynomial time algorithms which\nachieve epsilon approximation to the k-means loss. A spectral algorithm for learning mixtures of\ngaussians is proposed in [21]. We want to point out that our aim is not to come up with a new algo-\nrithm for mixture models, but to understand the interplay of model parameters in the convergence of\ngradient EM for a mixture of Gaussians with M components. As we discuss later, our work also im-\nmediately leads to convergence guarantees of Stochastic Gradient EM. Another important difference\nis that the aim of these works is recovering the hidden mixture component memberships, whereas\nour goal is completely different: we are interested in understanding how well EM can estimate the\nmean parameters under a good initialization.\nIn this paper, we study the convergence rate and local contraction radius of gradient EM under\nGMM with arbitrary number of clusters and mixing weights which are assumed to be known. For\nsimplicity, we assume that the components share the same covariance matrix, which is known. Thus\nit suf\ufb01ces to carry out our analysis for isotropic GMMs with identity as the shared covariance matrix.\nWe obtain a near-optimal condition on the contraction region in contrast to [2]\u2019s contraction radius\nfor the mixture of two equal weight Gaussians. We want to point out that, while the authors of [2]\nprovide a general set of conditions to establish local convergence for a broad class of mixture models,\nthe derivation of speci\ufb01c results and conditions on local convergence are tailored to the balance and\nsymmetry of the model.\nWe follow the same general route: \ufb01rst we obtain conditions for population gradient EM, where\nall sample averages are replaced by their expected counterpart. Then we translate the population\nversion to the sample one. While the \ufb01rst part is conceptually similar, the general setting calls for\nmore involved analysis. The second step typically makes use of concepts from empirical processes,\nby pairing up Ledoux-Talagrand contraction type arguments with well established symmetrization\nresults. However, in our case, the function is not a contraction like in the symmetric two component\ncase, since it involves the cluster estimates of all M components. Furthermore, the standard analy-\nsis of concentration inequalities by McDiarmid\u2019s inequality gets complicated because the bounded\ndifference condition is not satis\ufb01ed in our setting. We overcome these dif\ufb01culties by taking advan-\n\n2\n\n\ftage of recent tools in Rademacher averaging for vector valued function classes, and variants of\nMcDiarmid type inequalities for functions which have bounded difference with high probability.\nThe rest of the paper is organized as follows. In Section 2, we state the problem and the notations.\nIn Section 3, we provide the main results in local convergence rate and region for both population\nand sample-based gradient EM in GMMs. Section 4 and 5 provide the proof sketches of population\nand sample-based theoretical results, followed by the numerical result in Section 6. We conclude\nthe paper with some discussions.\n\n\u00b5i \u2208 Rd be the mean of cluster i. Without loss of generality, we assume EX =(cid:80)\np(x|\u00b5) =(cid:80)M\n(cid:17)\ni=1 \u03c0i\u03c6(X|\u00b5i, Id)\nto lower bound the log likelihood. De\ufb01ne Q(\u00b5|\u00b5t) = EX [(cid:80)\n\n2 Problem Setup and Notations\nConsider a GMM with M clusters in d dimensional space, with weights \u03c0 = (\u03c01,\u00b7\u00b7\u00b7 , \u03c0M ). Let\ni \u03c0i\u00b5i = 0 and\nthe known covariance matrix for all components is Id. Let \u00b5 \u2208 RM d be the vector stacking the\n\u00b5is vertically. We represent the mixture as X \u223c GMM(\u03c0, \u00b5, Id), which has the density function\ni=1 \u03c0i\u03c6(x|\u00b5i, Id). where \u03c6(x; \u00b5, \u03a3) is the PDF of N (\u00b5, \u03a3). Then the population log-\nlikelihood function as L(\u00b5) = EX log\n. The Maximum Likelihood Estima-\ntor is then de\ufb01ned as \u02c6\u00b5ML = arg max p(X|\u00b5). EM algorithm is based on using an auxiliary function\ni p(Z = i|X; \u00b5t) log \u03c6(X; \u00b5i, Id)],\nwhere Z denote the unobserved component membership of data point X. The standard EM update\nis \u00b5t+1 = arg max\u00b5 Q(\u00b5|\u00b5t). De\ufb01ne\n\n(cid:16)(cid:80)M\n\nwi(X; \u00b5) =\n\n(cid:80)M\n\u03c0i\u03c6(X|\u00b5i, Id)\nj=1 \u03c0j\u03c6(X|\u00b5j, Id)\n\nThe update step for gradient EM, de\ufb01ned via the gradient operator G(\u00b5t) : RM d \u2192 RM d, is\n\ni + s[\u2207Q(\u00b5t|\u00b5t)]i = \u00b5t\n\ni + sEX\n\ni = \u00b5t\n\nG(\u00b5t)(i) := \u00b5t+1\n\n(2)\nwhere s > 0 is the step size and (.)(i) denotes the part of the stacked vector corresponding to the ith\nmixture component. We will also use Gn(\u00b5) to denote the empirical counterpart of the population\ngradient operator G(\u00b5) de\ufb01ned in Eq (2). We assume we are given an initialization \u00b50\ni and the true\nmixing weight \u03c0i for each component.\n\n(1)\n\n(cid:2)\u03c0iwi(X; \u00b5t)(X \u2212 \u00b5t\ni)(cid:3) .\n\n2.1 Notations\n\ni \u2212 \u00b5\u2217\n\nj(cid:107), Rmin = mini(cid:54)=j (cid:107)\u00b5\u2217\n\nDe\ufb01ne Rmax and Rmin as the largest and smallest distance between cluster centers i.e., Rmax =\nmaxi(cid:54)=j (cid:107)\u00b5\u2217\nj(cid:107). Let \u03c0max and \u03c0min be the maximal and\n. Standard complexity analysis notation\nminimal cluster weights, and de\ufb01ne \u03ba as \u03ba = \u03c0max\n\u03c0min\no(\u00b7), O(\u00b7), \u0398(\u00b7), \u2126(\u00b7) will be used. f (n) = \u02dc\u2126(g(n)) is short for \u2126(g(n)) ignoring logarithmic\nfactors, equivalent to f (n) \u2265 Cg(n) logk(g(n)), similar for others. We use \u2297 to represent the\nkronecker product.\n\ni \u2212 \u00b5\u2217\n\n3 Main Results\n\nDespite being a non-convex problem, EM and gradient EM algorithms have been shown to exhibit\ngood convergence behavior in practice, especially with good initializations. However, existing local\nconvergence theory only applies for two-cluster equal-weight GMM. In this section, we present our\nmain result in two parts. First we show the convergence rate and present a near-optimal radius for\ncontraction region for population gradient EM. Then in the second part we connect the population\nversion to \ufb01nite sample results using concepts from empirical processes and learning theory.\n\n3.1 Local contraction for population gradient EM\nIntuitively, when \u00b5t equals the ground truth \u00b5\u2217, then the Q(\u00b5|\u00b5\u2217) function will be well-behaved.\nThis function is a key ingredient in [2], where the curvature of the Q(\u00b7|\u00b5) function is shown to\nbe close to the curvature of Q(\u00b7|\u00b5\u2217) when the \u00b5 is close to \u00b5\u2217. This is a local property that only\nrequires the gradient to be stable at one point.\n\n3\n\n\fDe\ufb01nition 1 (Gradient Stability). The Gradient Stability (GS) condition, denoted by GS(\u03b3, a), is\nsatis\ufb01ed if there exists \u03b3 > 0, such that for \u00b5t\n\ni , a) with some a > 0, for \u2200i \u2208 [M ].\n\ni \u2208 B(\u00b5\u2217\n\n(cid:107)\u2207Q(\u00b5t|\u00b5\u2217) \u2212 \u2207Q(\u00b5t|\u00b5t)(cid:107) \u2264 \u03b3(cid:107)\u00b5t \u2212 \u00b5\u2217(cid:107)\n\nThe GS condition is used to prove contraction of the sequence of estimators produced by population\ngradient EM. However, for most latent variable models, it is typically challenging to verify the GS\ncondition and obtain a tight bound on the parameter \u03b3. We derive the GS condition under milder\nconditions (see Theorem 4 in Section 4), which bounds the deviation of the partial gradient evaluated\ni uniformly over all i \u2208 [M ]. This immediately implies the global GS condition de\ufb01ned in\nat \u00b5t\nDe\ufb01nition 1. Equipped with this result, we achieve a nearly optimal local convergence radius for\ngeneral GMMs in Theorem 1. The proof of this theorem can be found in Appendix B.2.\nTheorem 1 (Convergence for Population gradient EM). Let d0 := min{d, M}. If Rmin = \u02dc\u2126(\nwith initialization \u00b50 satisfying, (cid:107)\u00b50\n\nd0),\n\n\u221a\n\n(cid:32)(cid:115)\ni \u2212 \u00b5\u2217\ni (cid:107) \u2264 a,\u2200i \u2208 [M ], where\n\n(cid:26) M 2\u03ba\n\n(cid:18)\n\nd0O\n\nlog\n\nmax\n\n\u2212(cid:112)\n\n, Rmax, d0\n\n\u03c0min\n\n(cid:27)(cid:19)(cid:33)\n\na \u2264 Rmin\n2\n\nthen the Population EM converges:\n\n(cid:107)\u00b5t \u2212 \u00b5\u2217(cid:107) \u2264 \u03b6 t(cid:107)\u00b50 \u2212 \u00b5\u2217(cid:107),\n\n\u03c0max \u2212 \u03c0min + 2\u03b3\n\n\u03c0max + \u03c0min\n\n< 1\n\n\u03b6 =\n\n(cid:16)\u2212(cid:0) Rmin\n\n2 \u2212 a(cid:1)2 \u221a\n\n(cid:17)\n\nwhere \u03b3 = M 2(2\u03ba + 4) (2Rmax + d0)2 exp\nRemark 1. The local contraction radius is largely improved compared to that in [2], which has\nRmin/8 in the two equal sized symmetric GMM setting. It can be seen that in Theorem 1, a/Rmin\n2 as the signal to noise ratio goes to in\ufb01nity. We will show in simulations that when initial-\ngoes to 1\nized from some point that lies Rmin/2 away from the true center, gradient EM only converges to a\nstationary point which is not a global optimum. More discussion can be found in Section 6.\n\n< \u03c0min.\n\nd0/8\n\n3.2 Finite sample bound for gradient EM\n\nIn the \ufb01nite sample setting, as long as the deviation of the sample gradient from the population\ngradient is uniformly bounded, the convergence in the population setting implies the convergence\nin \ufb01nite sample scenario. Thus the key ingredient in the proof is to get this uniform bound over all\nn (\u00b5)(cid:107), where G and Gn\nparameters in the contraction region A, i.e. bound sup\u00b5\u2208A (cid:107)G(i)(\u00b5) \u2212 G(i)\nare de\ufb01ned in Section 2.\nTo prove the result, we expand the difference and de\ufb01ne the following function for i \u2208 [M ], where\nu is a unit vector on a d dimensional sphere S d\u22121. This appears because we can write the Euclidean\nnorm of any vector B, as (cid:107)B(cid:107) = supu\u2208S d\u22121(cid:104)B, u(cid:105).\n\nw1(Xi; \u00b5)(cid:104)Xi \u2212 \u00b51, u(cid:105) \u2212 Ew1(X; \u00b5)(cid:104)X \u2212 \u00b51, u(cid:105).\n\n(3)\n\ngu\ni (X) = sup\n\u00b5\u2208A\n\n1\nn\n\nn(cid:88)\n\ni=1\n\nWe will drop the super and subscript and prove results for gu\nThe outline of the proof is to show that g(X) is close to its expectation. This expectation can be\nfurther bounded via the Rademacher complexity of the corresponding function class (de\ufb01ned below\nin Eq (4)) by the tools like the symmetrization lemma [18].\nConsider the following class of functions indexed by \u00b5 and some unit vector on d dimensional\nsphere u \u2208 S d\u22121:\n\n1 without loss of generality.\n\nF u\ni = {f i : X \u2192 R|f i(X; \u00b5, u) = wi(X; \u00b5)(cid:104)X \u2212 \u00b5i, u(cid:105)}\n\n(4)\nWe need to bound the M functions classes separately for each mixture. Given a \ufb01nite n-sample\n(X1,\u00b7\u00b7\u00b7 , Xn), for each class, we de\ufb01ne the Rademacher complexity as the expectation of empirical\n\n4\n\n\fRademacher complexity.\n\n\u02c6Rn(F u\n\ni ) = E\u0001\n\n\uf8ee\uf8f0sup\n\n\u00b5\u2208A\n\nn(cid:88)\n\nj=1\n\n1\nn\n\n\uf8f9\uf8fb ;\n\n\u0001if i(Xj; \u00b5, u)\n\nRn(F u\n\ni ) = EX \u02c6Rn(F u\ni )\n\nwhere \u0001i\u2019s are the i.i.d. Rademacher random variables.\nFor many function classes, the computation of the empirical Rademacher complexity can be hard.\nFor complicated functions which are Lipschitz w.r.t functions from a simpler function class, one\ncan use Ledoux-Talagrand type contraction results [12]. In order to use the Ledoux-Talagrand con-\ntraction, one needs a 1-Lipschitz function, which we do not have, because our function involves \u00b5i,\ni \u2208 [M ]. Also, the weight functions wi are not separable in terms of the \u00b5i\u2019s. Therefore the classical\ncontraction lemma does not apply. In our analysis, we need to introduce a vector-valued function,\nwith each element involving only one \u00b5i, and apply a recent result of vector-versioned contraction\nlemma [15]. With some careful analysis, we get the following. The details are deferred to Section 5.\nProposition 1. Let F u\n\ni as de\ufb01ned in Eq. (4) for \u2200i \u2208 [M ], then for some universal constant c,\nRn(F u\n\n\u221a\ni ) \u2264 cM 3/2(1 + Rmax)3\n\u221a\n\nd max{1, log(\u03ba)}\n\nn\n\nAfter getting the Rademacher complexity, one needs to use concentration results like McDiarmid\u2019s\ninequality [16] to achieve the \ufb01nite-sample bound. Unfortunately for the functions de\ufb01ned in Eq. (4),\nthe martingale difference sequence does not have bounded differences. Hence it is dif\ufb01cult to apply\nMcDiarmid\u2019s inequality in its classical form. To resolve this, we instead use an extension of Mc-\nDiarmid\u2019s inequality which can accommodate sequences which have bounded differences with high\nprobability [3].\nTheorem 2 (Convergence for sample-based gradient EM). Let \u03b6 be the contraction parameter in\nTheorem 1, and\n\n\u0001unif(n) = \u02dcO(max{n\u22121/2M 3(1 + Rmax)3\n\nd max{1, log(\u03ba)}, (1 + Rmax)d/\n\nn}).\n\n(5)\n\n\u221a\n\n\u221a\n\nIf \u0001unif(n) \u2264 (1 \u2212 \u03b6)a, then sample-based gradient EM satis\ufb01es\n\n(cid:13)(cid:13) \u02c6\u00b5t\n\ni \u2212 \u00b5\u2217\n\ni\n\n(cid:13)(cid:13) \u2264 \u03b6 t(cid:13)(cid:13)\u00b50 \u2212 \u00b5\u2217(cid:13)(cid:13)2 +\n\n1\n1 \u2212 \u03b6\n\n\u0001unif(n);\nwith probability at least 1 \u2212 n\u2212cd, where c is a positive constant.\nRemark 2. When data is observed in a streaming fashion, the gradient update can be modi\ufb01ed\ninto a stochastic gradient update, where the gradient is evaluated based on a single observation\nor a small batch. By the GS condition proved in Theorem 1, combined with Theorem 6 in [2], we\nimmediately extend the guarantees of gradient EM into the guarantees for the stochastic gradient\nEM.\n\n\u2200i \u2208 [M ]\n\n3.3\n\nInitialization\n\nAppropriate initialization for EM is the key to getting good estimation within fewer restarts in prac-\ntice. There have been a number of interesting initialization algorithms for estimating mixture mod-\nels. It is pointed out in [9] that in practice, initializing the centers by uniformly drawing from the\ndata is often more reasonable than drawing from a \ufb01xed distribution. Under this initialization strat-\negy, we can bound the number of initializations required to \ufb01nd a \u201cgood\u201d initialization that falls\nin the contraction region in Theorem 1. The exact theorem statement and a discussion of random\ninitialization can be found in Appendix D. More sophisticated strategy includes, an approximate\nsolution to k-means on a projected low-dimensional space used in [1] and [10]. While it would be\ninteresting to study different initialization schemes, that is part of future work.\n\n4 Local Convergence of Population Gradient EM\n\nIn this section we present the proof sketch for Theorem 1. The complete proofs in this section are\ndeferred to Appendix B. To start with, we calculate the closed-form characterization of the gradient\nof q(\u00b5) as stated in the following lemma.\n\n5\n\n\fLemma 1. De\ufb01ne q(\u00b5) = Q(\u00b5|\u00b5\u2217). The gradient of q(\u00b5) is \u2207q(\u00b5) = (diag(\u03c0) \u2297 Id) (\u00b5\u2217 \u2212 \u00b5).\nIf we know the parameter \u03b3 in the gradient stability condition, then the convergence rate depends\nonly on the condition number of the Hessian of q(\u00b7) and \u03b3.\nTheorem 3 (Convergence rate for population gradient EM). If Q satis\ufb01es the GS condition with\nparameter 0 < \u03b3 < \u03c0min, denote dt := (cid:107)\u00b5t \u2212 \u00b5\u2217(cid:107), then with step size s =\n\n, we have:\n\n2\n\n\u03c0min+\u03c0max\n\n(cid:18) \u03c0max \u2212 \u03c0min + 2\u03b3\n\n(cid:19)t\n\n\u03c0max + \u03c0min\n\nd0\n\ndt+1 \u2264\n\nThe proof uses an approximation on gradient and standard techniques in analysis of gradient descent.\nRemark 3. It can be veri\ufb01ed that the convergence rate is equivalent to that shown in [2] when\napplied to GMMs. The convergence slows down as the proportion imbalance \u03ba = \u03c0max/\u03c0min\nincreases, which matches the observation in [19].\n\n2 \u2212 (cid:112)min{d, M} max(4(cid:112)2[log(Rmin/4)]+, 8\n(cid:80)M\ni , a),\u2200i \u2208 [M ] where a \u2264 Rmin\n(cid:17)\n2 \u2212 a(cid:1)2(cid:112)min{d, M}/8\ni=1 (cid:107)\u00b5t\n\nNow to verify the GS condition, we have the following theorem.\nTheorem 4 (GS condition for general GMM).\nB(\u00b5\u2217\n(cid:107)\u2207\u00b5iQ(\u00b5|\u00b5t) \u2212 \u2207\u00b5iq(\u00b5)(cid:107) \u2264 \u03b3\nwhere \u03b3 = M 2(2\u03ba + 4) (2Rmax + min{d, M})2 exp\nFurthermore, (cid:107)\u2207Q(\u00b5|\u00b5t) \u2212 \u2207q(\u00b5)(cid:107) \u2264 \u03b3(cid:107)\u00b5t \u2212 \u00b5\u2217(cid:107).\n\nIf Rmin = \u02dc\u2126((cid:112)min{d, M}), and \u00b5i \u2208\n(cid:16)\u2212(cid:0) Rmin\n\n(cid:107)\u00b5t \u2212 \u00b5\u2217(cid:107),\n\ni (cid:107) \u2264 \u03b3\u221a\n\ni \u2212 \u00b5\u2217\n\n3), then\n\n\u221a\n\nM\n\nM\n\n.\n\nProof sketch of Theorem 4. W.l.o.g. we show the proof with the \ufb01rst cluster, consider the difference\nof the gradient corresponding to \u00b51.\n\n\u2207\u00b51Q(\u00b5t|\u00b5t) \u2212 \u2207\u00b51 q(\u00b5t) =E(w1(X; \u00b5t) \u2212 w1(X; \u00b5\u2217))(X \u2212 \u00b5t\n1)\n\nFor any given X, consider the function \u00b5 \u2192 w1(X; \u00b5), we have\n\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8edw1(X; \u00b5)(1 \u2212 w1(X; \u00b5))(X \u2212 \u00b51)T\n\n\u2212w1(X; \u00b5)w2(X; \u00b5)(X \u2212 \u00b52)T\n\n\u2212w1(X; \u00b5)wM (X; \u00b5)(X \u2212 \u00b5M )T\ni ,(cid:107)\u00b5t\n\nB(\u00b5\u2217\n\ni=1\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f8\n\n\u2207\u00b5w1(X; \u00b5) =\n\nLet \u00b5u = \u00b5\u2217 + u(\u00b5t \u2212 \u00b5\u2217),\u2200u \u2208 [0, 1], obviously \u00b5u \u2208 \u03a0M\nBy Taylor\u2019s theorem,\n(cid:107)E(w1(X; \u00b5t\n\u2264U1(cid:107)\u00b5t\n1 \u2212 \u00b5\u2217\n\n1) \u2212 w1(X; \u00b5\u2217\n1(cid:107)2 +\nUi(cid:107)\u00b5t\n\n1))(X \u2212 \u00b5t\ni \u2212 \u00b5\u2217\n\n1)(cid:107) =\ni (cid:107)2 \u2264 max\ni\u2208[M ]\n\n(cid:88)\n\ni(cid:54)=1\n\n...\n(cid:13)(cid:13)(cid:13)(cid:13)E\n(cid:20)(cid:90) 1\n{Ui}(cid:88)\n\nu=0\n\ni\n\n(6)\n\n(7)\n\n(8)\n\n(cid:21)(cid:13)(cid:13)(cid:13)(cid:13)\n\ni \u2212 \u00b5\u2217\n\ni (cid:107)) \u2282 \u03a0M\n\ni=1\n\nB(\u00b5\u2217\n\ni , a).\n\n\u2207uw1(X; \u00b5u)du(X \u2212 \u00b5t\n1)\n(cid:107)\u00b5t\n\ni \u2212 \u00b5\u2217\ni (cid:107)2\n\nwhere\n\nU1 = sup\nu\u2208[0,1]\n\nUi = sup\nu\u2208[0,1]\n\n(cid:107)Ew1(X; \u00b5u)(1 \u2212 w1(X; \u00b5u))(X \u2212 \u00b5t\n(cid:107)Ew1(X; \u00b5u)wi(X; \u00b5u)(X \u2212 \u00b5t\n\n1)(X \u2212 \u00b5u\n\n1)(X \u2212 \u00b5u\n2 )T(cid:107)op\n\n1 )T(cid:107)op\n\nBounding them with careful analysis on Gaussian distribution yields the result. The technical details\nare deferred to Appendix B.\n\n5 Sample-based Convergence\n\nIn this section we present the proof sketch for sample-based convergence of gradient EM. The main\ningredient in proof of Theorem 2 is the result of the following theorem, which develops an uniform\nupper bound for the differences between sample-based gradient and population gradient on each\ncluster center.\n\n6\n\n\fTheorem 5 (Uniform bound for sample-based gradient EM). Denote A as the contraction region\n\u03a0M\ni=1\n\ni , a). Under the condition of Theorem 1, with probability at least 1 \u2212 exp (\u2212cd log n),\n\nB(\u00b5\u2217\n\n(cid:13)(cid:13)(cid:13)G(i)(\u00b5) \u2212 G(i)\n\nn (\u00b5)\n\n(cid:13)(cid:13)(cid:13) < \u0001unif(n);\n\nsup\n\u00b5\u2208A\n\n\u2200i \u2208 [M ]\n\nwhere \u0001unif(n) is as de\ufb01ned in Eq. (5).\nRemark 4. It is worth pointing out that, the \ufb01rst part of the bound on \u0001unif(n) in Eq. (5) comes from\nthe Rademacher complexity, which is optimal in terms of the order of n and d. However the extra\nfactor of\nd and log(n) comes from the altered McDiarmid\u2019s inequality, tightening which will be\nleft for future work.\n\n\u221a\n\n(cid:13)(cid:13)(cid:13)G(i)(\u00b5) \u2212 G(i)\n\n(cid:13)(cid:13)(cid:13). Recall gu\n\nProof sketch of Theorem 5. Denote Zi = sup\u00b5\u2208A\ni (X) de\ufb01ned\nin Eq. (3), then it is not hard to see that Zi = supu\u2208S d\u22121 gu\n2-covering\n{u(1),\u00b7\u00b7\u00b7 , u(K)} of the unit sphere S d\u22121, where K is the covering number of an unit sphere in\nd dimensions. We can show that Zi \u2264 2 maxj=1,\u00b7\u00b7\u00b7 ,K gu(j)\nAs we state below in Lemma 2, we have for each u, with probability at least 1 \u2212 exp (\u2212cd log n),\nn}). Plugging in the Rademacher complexity from\ni (X) = \u02dcO(max{Rn(F u\ngu\nProposition 1, standard bounds on K, and applying union bound, we have\n\u221a\n\nn (\u00b5)\ni (X). Consider a 1\n\ni ), (1 + Rmax)d/\n\n(X).\n\n\u221a\n\ni\n\nZi \u2264 \u02dcO(max{n\u22121/2M 3(1 + Rmax)3\n\nd max{1, log(\u03ba)}, (1 + Rmax)d/\n\n\u221a\n\nn})\n\nwith probability at least 1 \u2212 exp (2d \u2212 cd log n) = 1 \u2212 exp (\u2212c(cid:48)d log n).\n\nn}).\n\n\u221a\n1 ), (1 + Rmax)d/\n\nIteratively applying Theorem 5, we can bound the error in \u00b5 for the sample updates. The full proof\nis deferred to Appendix C. The key step is the following lemma, which bounds the gu\ni (X) for any\ngiven u \u2208 S d\u22121. Without loss of generality we prove the result for i = 1.\nLemma 2. Let u be a unit vector. Xi, i = 1,\u00b7\u00b7\u00b7 , n are i.i.d. samples from GMM(\u03c0, \u00b5\u2217, Id).\n1 (X) as de\ufb01ned in Eq. (3). Then with probability 1 \u2212 exp(\u2212cd log n) for some constant c > 0,\ngu\n1 (X) = \u02dcO(max{Rn(F u\ngu\nThe quantity gu\n1 (X) depends on the sample, the idea for proving Lemma 2 is to show it concentrates\naround its expectation when sample size is large. Note that when the function class has bounded\ndifferences (changing one data point changes the function by a bounded amount almost surely), as\nin the case in many risk minimization problems in supervised learning, the McDiarmid\u2019s inequality\ncan be used to achieve concentration. However the function class we de\ufb01ne in Eq. (4) is not bounded\nalmost everywhere, but with high probability, hence the classical result does not apply. Conditioning\non the event where the difference is bounded, we use an extension of McDiarmid\u2019s inequality by [3].\nFor convenience, we restate a weaker version of the theorem using our notation below.\nTheorem 6 ([3]). Consider independent random variable X = (X1,\u00b7\u00b7\u00b7 , Xn) in the product prob-\ni=1 Xi, where Xi is the probability space for Xi. Also consider a function\ng : X \u2192 R. If there exists a subset Y \u2282 X , and a scalar c > 0, such that\n|g(x) \u2212 g(y)| \u2264 L,\u2200x, y \u2208 Y, xj = yj,\u2200j (cid:54)= i.\n\nability space X = (cid:81)n\n\nDenote p = 1 \u2212 P (X \u2208 Y), then P (g(X) \u2212 E[g(X)|X \u2208 Y] \u2265 \u0001) \u2264 p + exp\n\n(cid:16)\u2212 2(\u0001\u2212npL)2\n\nnL2\n\n+\n\n(cid:17)\n\n.\n\nIt is worth pointing out that in Theorem 6, the concentration is shown in reference to the conditional\nexpectation E[g(X)|X \u2208 Y] when the data points are in the bounded difference set. So to fully\nachieve the type of bound given by McDiarmid\u2019s inequality, we need to further bound the difference\nof the conditional expectation and the full expectation. Combining the two parts we will be able to\nshow that, the empirical difference is upper bounded using the Rademacher complexity.\nNow it remains to derive the Rademacher complexity under the given function class. Note that\nwhen the function class is a contraction, or Lipschitz with respect to another function (usually of a\nsimpler form), one can use the Ledoux-Talagrand contraction lemma [12] to reduce the Rademacher\ncomplexity of the original function class to the Rademacher complexity of the simpler function\nclass. This is essential in getting the Rademacher complexities for complicated function classes. As\n\n7\n\n\fwe mention in Section 3, our function class in Eq. (4) is unfortunately not Lipschitz due to the fact\nthat it involves all cluster centers even for the gradient on one cluster. We get around this problem\nby introducing a vector valued function, and show that the functions in Eq. (4) are Lipschitz in\nterms of the vector-valued function. In other words, the absolute difference in the function when\nthe parameter changes is upper bounded by the norm of the vector difference of the vector-valued\nfunction. Then we build upon the recent vector-contraction result from [15], and prove the following\nlemma under our setting.\nLemma 3. Let X be nontrivial, symmetric and sub-gaussian. Then there exists a constant C < \u221e,\ndepending only on the distribution of X, such that for any subset S of a separable Banach space and\nfunction hi : S \u2192 R, fi : S \u2192 Rk, i \u2208 [n] satisfying \u2200s, s(cid:48) \u2208 S,|hi(s)\u2212hi(s(cid:48))| \u2264 L(cid:107)f (s)\u2212f (s(cid:48))(cid:107).\nIf \u0001ik is an independent doubly indexed Rademacher sequence, we have,\n\n(cid:88)\n\ni\n\nE sup\ns\u2208S\n\n\u221a\n\u0001ihi(s) \u2264 E\n\n(cid:88)\n\ni,k\n\n2L sup\ns\u2208S\n\n\u0001ikfi(s)k,\n\nwhere fi(s)k is the k-th component of fi(s).\nRemark 5. In contrast to the original form in [15], we have a S as a subset of a separable Banach\nSpace. The proof uses standard tools from measure theory, and is to be found in Appendix C.\n\nThis equips us to prove Proposition 1.\nProof sketch of Proposition 1. For any unit vector u, the Rademacher complexity of F u\n\n1 is\n\nRn(F u\n\n1 ) =EXE\u0001 sup\n\u00b5\u2208A\n\u2264 EXE\u0001 sup\n\u00b5\u2208A\n\n1\nn\n\n1\nn\n\n(cid:124)\n\nn(cid:88)\nn(cid:88)\n\ni=1\n\ni=1\n\n(cid:123)(cid:122)\n\n(D)\n\n\u0001iw1(Xi; \u00b5)(cid:104)Xi \u2212 \u00b51, u(cid:105)\n\n\u0001iw1(Xi; \u00b5)(cid:104)Xi, u(cid:105)\n\n(cid:125)\n\n+ EXE\u0001 sup\n\u00b5\u2208A\n\n(cid:124)\n\n\u0001iw1(Xi; \u00b5)(cid:104)\u00b51, u(cid:105)\n\n(9)\n\n(cid:125)\n\nn(cid:88)\n\ni=1\n\n1\nn\n\n(E)\n\n(cid:123)(cid:122)\n(cid:18) \u03c0k\n\n\u03c01\n\n(cid:19)\n\nWe bound the two terms separately. De\ufb01ne \u03b7j(\u00b5) : RM d \u2192 RM to be a vector valued function with\nthe k-th coordinate\n\n[\u03b7j(\u00b5)]k =\n\n(cid:107)\u00b51(cid:107)2\n\n2\n\n\u2212 (cid:107)\u00b5k(cid:107)2\n\n2\n\n+ (cid:104)Xj, \u00b5k \u2212 \u00b51(cid:105) + log\n\n\u221a\n\nM\n4\n\nIt can be shown that\n\n|w1(Xj; \u00b5) \u2212 w1(Xj; \u00b5(cid:48))| \u2264\n\n(cid:107)\u03b7j(\u00b5) \u2212 \u03b7j(\u00b5(cid:48))(cid:107)\n\n(10)\n\nNow let \u03c81(Xj; \u00b5) = w1(Xj; \u00b5)(cid:104)Xj, u(cid:105). With Lipschitz property (10) and Lemma C.1, we have\n\n\uf8ee\uf8f0sup\n\n\u00b5\u2208A\n\nE\n\nn(cid:88)\n\nj=1\n\n1\nn\n\n\u0001jwi(Xj; \u00b5)(cid:104)Xj, u(cid:105)\n\n\uf8f9\uf8fb \u2264 E\n\n\uf8ee\uf8f0\u221a\n\n\u221a\n2\n4n\n\nM\n\nsup\n\u00b5\u2208A\n\nn(cid:88)\n\nM(cid:88)\n\nj=1\n\nk=1\n\n\uf8f9\uf8fb\n\n\u0001jk[\u03b7j(\u00b5)]k\n\nThe right hand side can be bounded with tools regarding independent sum of sub-gaussian random\nvariables. Similar techniques apply to the (E) term. Adding things up we get the \ufb01nal bound.\n\n6 Experiments\n\nIn this section we collect some numerical results. In all experiments we set the covariance matrix\nfor each mixture component as identity matrix Id and de\ufb01ne signal-to-noise ratio (SNR) as Rmin.\nConvergence Rate We \ufb01rst evaluate the convergence rate and compare with those given in Theorem\n3 and Theorem 4. For this set of experiments, we use a mixture of 3 Gaussians in 2 dimensions. In\nboth experiments Rmax/Rmin = 1.5. In different settings of \u03c0, we apply gradient EM with varying\nSNR from 1 to 5. For each choice of SNR, we perform 10 independent trials with N = 12, 000\n\n8\n\n\fdata points. The average of log (cid:107)\u00b5t \u2212 \u02c6\u00b5(cid:107) and the standard deviation are plotted versus iterations. In\nFigure 1 (a) and (b) we plot balanced \u03c0 (\u03ba = 1) and unbalanced \u03c0 (\u03ba > 1) respectively.\nAll settings indicate the linear convergence rate as shown in Theorem 3. As SNR grows, the pa-\nrameter \u03b3 in GS condition decreases and thus yields faster convergence rate. Comparing left two\npanels in Figure 1, increasing imbalance of cluster weights \u03ba slows down the local convergence rate\nas shown in Theorem 3.\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 1: (a, b): The in\ufb02uence of SNR on optimization error in different settings. The \ufb01gures\nrepresent the in\ufb02uence of SNR when the GMMs have different cluster centers and weights: (a)\n\u03c0 = (1/3, 1/3, 1/3). (b) \u03c0 = (0.6, 0.3, 0.1). (c) plots statistical error with different initializations\narbitrarily close to the boundary of the contraction region. (d) shows the suboptimal stationary point\nwhen two centers are initialized from the midpoint of the respective true cluster centers.\n\nContraction Region To show the tightness of the contraction region, we generate a mixture with\n2 +\u00b5\u2217\nM = 3, d = 2, and initialize the clusters as follows. We use \u00b50\n2 + \u0001, for\nshrinking \u0001, i.e. increasing a/Rmin and plot the error on the Y axis. Figure 1-(c) shows that gradient\nEM converges when initialized arbitrarily close to the boundary, thus con\ufb01rming our near optimal\ncontraction region. Figure 1-(d) shows that when \u0001 = 0, i.e. a = Rmin\n, gradient EM can be trapped\nat a sub-optimal stationary point.\n\n2 +\u00b5\u2217\n2 \u2212 \u0001, \u00b50\n\n3 = \u00b5\u2217\n\n2\n\n2 = \u00b5\u2217\n\n3\n\n3\n\n7 Concluding Remarks\n\nIn this paper, we obtain local convergence rates and a near optimal contraction radius for popu-\nlation and sample-based gradient EM for multi-component GMMs with arbitrary mixing weights.\nFor simplicity, we assume that the the mixing weights are known, and the covariance matrices are\nthe same across components and known. For our sample-based analysis, we face new challenges\nwhere bears structural differences from the two-component, equal-weight setting, which are allevi-\nated via the usage of non-traditional tools like a vector valued contraction argument and martingale\nconcentrations bounds where bounded differences hold only with high probability.\n\nAcknowledgments\n\nPS was partially supported by NSF grant DMS 1713082.\n\nReferences\n[1] Pranjal Awasthi and Or Sheffet. Improved spectral-norm bounds for clustering. In APPROX-\n\nRANDOM, pages 37\u201349. Springer, 2012.\n\n[2] Sivaraman Balakrishnan, Martin J. Wainwright, and Bin Yu. Statistical guarantees for the em\n\nalgorithm: From population to sample-based analysis. Ann. Statist., 45(1):77\u2013120, 02 2017.\n\n[3] Richard Combes. An extension of mcdiarmid\u2019s inequality. arXiv preprint arXiv:1511.05240,\n\n2015.\n\n[4] Denis Conniffe. Expected maximum log likelihood estimation. The Statistician, pages 317\u2013\n\n329, 1987.\n\n9\n\n\f[5] Sanjoy Dasgupta. Learning mixtures of gaussians. In Foundations of Computer Science, 1999.\n\n40th Annual Symposium on, pages 634\u2013644. IEEE, 1999.\n\n[6] Sanjoy Dasgupta and Leonard J Schulman. A two-round variant of em for gaussian mixtures.\nIn Proceedings of the Sixteenth conference on Uncertainty in arti\ufb01cial intelligence, pages 152\u2013\n159. Morgan Kaufmann Publishers Inc., 2000.\n\n[7] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete\ndata via the em algorithm. Journal of the royal statistical society. Series B (methodological),\npages 1\u201338, 1977.\n\n[8] Zachary Friggstad, Mohsen Rezapour, and Mohammad R Salavatipour. Local search yields\na ptas for k-means in doubling metrics. In Foundations of Computer Science (FOCS), 2016\nIEEE 57th Annual Symposium on, pages 365\u2013374. IEEE, 2016.\n\n[9] Chi Jin, Yuchen Zhang, Sivaraman Balakrishnan, Martin J Wainwright, and Michael I Jordan.\nLocal maxima in the likelihood of gaussian mixture models: Structural results and algorithmic\nIn Advances in Neural Information Processing Systems, pages 4116\u20134124,\nconsequences.\n2016.\n\n[10] Amit Kumar and Ravindran Kannan. Clustering with spectral norm and the k-means algorithm.\nIn Foundations of Computer Science (FOCS), 2010 51st Annual IEEE Symposium on, pages\n299\u2013308. IEEE, 2010.\n\n[11] Kenneth Lange. A gradient algorithm locally equivalent to the em algorithm. Journal of the\n\nRoyal Statistical Society. Series B (Methodological), pages 425\u2013437, 1995.\n\n[12] Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and pro-\n\ncesses. Springer Science & Business Media, 2013.\n\n[13] Yu Lu and Harrison H Zhou. Statistical and computational guarantees of lloyd\u2019s algorithm and\n\nits variants. arXiv preprint arXiv:1612.02099, 2016.\n\n[14] Jir\u0131 Matou\u02c7sek. On approximate geometric k-clustering. Discrete & Computational Geometry,\n\n24(1):61\u201384, 2000.\n\n[15] Andreas Maurer. A vector-contraction inequality for rademacher complexities. In International\n\nConference on Algorithmic Learning Theory, pages 3\u201317. Springer, 2016.\n\n[16] Colin McDiarmid. On the method of bounded differences.\n\n141(1):148\u2013188, 1989.\n\nSurveys in combinatorics,\n\n[17] Dustin G Mixon, Soledad Villar, and Rachel Ward. Clustering subgaussian mixtures by\n\nsemide\ufb01nite programming. arXiv preprint arXiv:1602.06612, 2016.\n\n[18] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learn-\n\ning. MIT press, 2012.\n\n[19] Iftekhar Naim and Daniel Gildea. Convergence of the em algorithm for gaussian mixtures with\n\nunbalanced mixing coef\ufb01cients. arXiv preprint arXiv:1206.6427, 2012.\n\n[20] Richard A Redner and Homer F Walker. Mixture densities, maximum likelihood and the em\n\nalgorithm. SIAM review, 26(2):195\u2013239, 1984.\n\n[21] Santosh Vempala and Grant Wang. A spectral algorithm for learning mixture models. Journal\n\nof Computer and System Sciences, 68(4):841\u2013860, 2004.\n\n[22] CF Jeff Wu. On the convergence properties of the em algorithm. The Annals of statistics, pages\n\n95\u2013103, 1983.\n\n[23] Ji Xu, Daniel J Hsu, and Arian Maleki. Global analysis of expectation maximization for mix-\ntures of two gaussians. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett,\neditors, Advances in Neural Information Processing Systems 29, pages 2676\u20132684. Curran\nAssociates, Inc., 2016.\n\n10\n\n\f[24] Lei Xu and Michael I Jordan. On convergence properties of the em algorithm for gaussian\n\nmixtures. Neural computation, 8(1):129\u2013151, 1996.\n\n[25] Bowei Yan and Purnamrita Sarkar. On robustness of kernel clustering. In Advances in Neural\n\nInformation Processing Systems, pages 3090\u20133098, 2016.\n\n11\n\n\f", "award": [], "sourceid": 3485, "authors": [{"given_name": "Bowei", "family_name": "Yan", "institution": "University of Texas at Austin"}, {"given_name": "Mingzhang", "family_name": "Yin", "institution": "University of Texas at Austin"}, {"given_name": "Purnamrita", "family_name": "Sarkar", "institution": "UT Austin"}]}