{"title": "Regularized EM Algorithms: A Unified Framework and Statistical Guarantees", "book": "Advances in Neural Information Processing Systems", "page_first": 1567, "page_last": 1575, "abstract": "Latent models are a fundamental modeling tool in machine learning applications, but they present significant computational and analytical challenges. The popular EM algorithm and its variants, is a much used algorithmic tool; yet our rigorous understanding of its performance is highly incomplete. Recently, work in [1] has demonstrated that for an important class of problems, EM exhibits linear local convergence. In the high-dimensional setting, however, the M-step may not be well defined. We address precisely this setting through a unified treatment using regularization. While regularization for high-dimensional problems is by now well understood, the iterative EM algorithm requires a careful balancing of making progress towards the solution while identifying the right structure (e.g., sparsity or low-rank). In particular, regularizing the M-step using the state-of-the-art high-dimensional prescriptions (e.g., \\`a la [19]) is not guaranteed to provide this balance. Our algorithm and analysis are linked in a way that reveals the balance between optimization and statistical errors. We specialize our general framework to sparse gaussian mixture models, high-dimensional mixed regression, and regression with missing variables, obtaining statistical guarantees for each of these examples.", "full_text": "Regularized EM Algorithms: A Uni\ufb01ed Framework\n\nand Statistical Guarantees\n\nDept. of Electrical and Computer Engineering\n\nDept. of Electrical and Computer Engineering\n\nXinyang Yi\n\nConstantine Caramanis\n\nThe University of Texas at Austin\n\nyixy@utexas.edu\n\nThe University of Texas at Austin\nconstantine@utexas.edu\n\nAbstract\n\nLatent models are a fundamental modeling tool in machine learning applications,\nbut they present signi\ufb01cant computational and analytical challenges. The popular\nEM algorithm and its variants, is a much used algorithmic tool; yet our rigorous\nunderstanding of its performance is highly incomplete. Recently, work in [1] has\ndemonstrated that for an important class of problems, EM exhibits linear local\nconvergence. In the high-dimensional setting, however, the M-step may not be\nwell de\ufb01ned. We address precisely this setting through a uni\ufb01ed treatment using\nregularization. While regularization for high-dimensional problems is by now\nwell understood, the iterative EM algorithm requires a careful balancing of making\nprogress towards the solution while identifying the right structure (e.g., sparsity or\nlow-rank). In particular, regularizing the M-step using the state-of-the-art high-\ndimensional prescriptions (e.g., `a la [19]) is not guaranteed to provide this balance.\nOur algorithm and analysis are linked in a way that reveals the balance between\noptimization and statistical errors. We specialize our general framework to sparse\ngaussian mixture models, high-dimensional mixed regression, and regression with\nmissing variables, obtaining statistical guarantees for each of these examples.\n\n1\n\nIntroduction\n\nWe give general conditions for the convergence of the EM method for high-dimensional estimation.\nWe specialize these conditions to several problems of interest, including high-dimensional sparse\nand low-rank mixed regression, sparse gaussian mixture models, and regression with missing covari-\nates. As we explain below, the key problem in the high-dimensional setting is the M-step. A natural\nidea is to modify this step via appropriate regularization, yet choosing the appropriate sequence of\nregularizers is a critical problem. As we know from the theory of regularized M-estimators (e.g.,\n[19]) the regularizer should be chosen proportional to the target estimation error. For EM, however,\nthe target estimation error changes at each step.\nThe main contribution of our work is technical: we show how to perform this iterative regularization.\nWe show that the regularization sequence must be chosen so that it converges to a quantity controlled\nby the ultimate estimation error. In existing work, the estimation error is given by the relationship\nbetween the population and empirical M-step operators, but this too is not well de\ufb01ned in the high-\ndimensional setting. Thus a key step, related both to our algorithm and its convergence analysis, is\nobtaining a different characterization of statistical error for the high-dimensional setting.\n\nBackground and Related Work\n\nEM (e.g., [8, 12]) is a general algorithmic approach for handling latent variable models (including\nmixtures), popular largely because it is typically computationally highly scalable, and easy to im-\nplement. On the \ufb02ip side, despite a fairly long history of studying EM in theory (e.g., [12, 17, 21]),\n\n1\n\n\fvery little has been understood about general statistical guarantees until recently. Very recent work\nin [1] establishes a general local convergence theorem (i.e., assuming initialization lies in a lo-\ncal region around true parameter) and statistical guarantees for EM, which is then specialized to\nobtain near-optimal rates for several speci\ufb01c low-dimensional problems \u2013 low-dimensional in the\nsense of the classical statistical setting where the samples outnumber the dimension. A central chal-\nlenge in extending EM (and as a corollary, the analysis in [1]) to the high-dimensional regime is\nthe M-step. On the algorithm side, the M-step will not be stable (or even well-de\ufb01ned in some\ncases) in the high-dimensional setting. To make matters worse, any analysis that relies on showing\nthat the \ufb01nite-sample M-step is somehow \u201cclose\u201d to the M-step performed with in\ufb01nite data (the\npopulation-level M-step) simply cannot apply in the high-dimensional regime. Recent work in [20]\ntreats high-dimensional EM using a truncated M-step. This works in some settings, but also requires\nspecialized treatment for every different setting, precisely because of the dif\ufb01culty with the M-step.\nIn contrast to work in [20], we pursue a high-dimensional extension via regularization. The central\nchallenge, as mentioned above, is in picking the sequence of regularization coef\ufb01cients, as this\nmust control the optimization error (related to the special structure of \u03b2\u2217), as well as the statistical\nerror. Finally, we note that for \ufb01nite mixture regression, St\u00a8adler et al.[16] consider an (cid:96)1 regularized\nEM algorithm for which they develop some asymptotic analysis and oracle inequality. However,\nthis work doesn\u2019t establish the theoretical properties of local optima arising from regularized EM.\nOur work addresses this issue from a local convergence perspective by using a novel choice of\nregularization.\n\n2 Classical EM and Challenges in High Dimensions\n\nThe EM algorithm is an iterative algorithm designed to combat the non-convexity of max likelihood\ndue to latent variables. For space concerns we omit the standard derivation, and only give the\nde\ufb01nitions we need in the sequel. Let Y , Z be random variables taking values in Y,Z, with joint\ndistribution f\u03b2(y, z) depending on model parameter \u03b2 \u2286 \u2126 \u2286 Rp. We observe samples of Y but not\nof the latent variable Z. EM seeks to maximize a lower bound on the maximum likelihood function\nfor \u03b2. Letting \u03ba\u03b2(z|y) denote the conditional distribution of Z given Y = y, letting y\u03b2\u2217 (y) denote\nthe marginal distribution of Y , and de\ufb01ning the function\n\nQn(\u03b2(cid:48)|\u03b2) :=\n\n1\nn\n\n\u03ba\u03b2(z|yi) log f\u03b2(cid:48)(yi, z)dz,\n\n(2.1)\n\none iteration of the EM algorithm, mapping \u03b2(t) to \u03b2(t+1), consists of the following two steps:\n\n\u2022 E-step: Compute function Qn(\u03b2|\u03b2(t)) given \u03b2(t).\n\u2022 M-step: \u03b2(t+1) \u2190 Mn(\u03b2) := arg max\u03b2(cid:48)\u2208\u2126 Qn(\u03b2(cid:48)|\u03b2(t)).\n\nWe can de\ufb01ne the population (in\ufb01nite sample) versions of Qn and Mn in a natural manner:\n\n(cid:90)\n\nn(cid:88)\n\nZ\n\ni=1\n\n(cid:90)\n\n(cid:90)\n\n:=\n\nQ(\u03b2(cid:48)|\u03b2)\ny\u03b2\u2217 (y)\nM(\u03b2) = arg max\n\u03b2(cid:48)\u2208\u2126\n\nY\n\nZ\n\nQ(\u03b2(cid:48)|\u03b2).\n\n\u03ba\u03b2(z|y) log f\u03b2(cid:48)(y, z)dzdy\n\n(2.2)\n\n(2.3)\n\nThis paper is about the high-dimensional setting where the number of samples n may be far less\nthan the dimensionality p of the parameter \u03b2, but where \u03b2 exhibits some special structure, e.g., it\nmay be a sparse vector or a low-rank matrix. In such a setting, the M-step of the EM algorithm may\nbe highly problematic. In many settings, for example sparse mixed regression, the M-step may not\neven be well de\ufb01ned. More generally, when n (cid:28) p, Mn(\u03b2) may be far from the population version,\nM(\u03b2), and in particular, the minimum estimation error (cid:107)Mn(\u03b2\u2217) \u2212 M(\u03b2\u2217)(cid:107) can be much larger\nthan the signal strength (cid:107)\u03b2\u2217(cid:107). This quantity is used in [1] as well as in follow-up work in [20], as a\nmeasure of statistical error. In the high dimensional setting, something else is needed.\n\n3 Algorithm\n\nThe basis of our algorithm is the by-now well understood concept of regularized high dimensional\nestimators, where the regularization is tuned to the underlying structure of \u03b2\u2217, thus de\ufb01ning a regu-\n\n2\n\n\flarized M-step via\n\nMr\n\nQn(\u03b2(cid:48)|\u03b2) \u2212 \u03bbnR(\u03b2(cid:48)),\n\n(cid:107)Mr\n\nn(\u03b2(t)) \u2212 \u03b2\u2217(cid:107)2 \u2264 (cid:107)Mr\n\nn(\u03b2(t)) \u2212 Mr\n\nn(\u03b2) := arg max\n\u03b2(cid:48)\u2208\u2126\n\n(3.1)\nwhere R(\u00b7) denotes an appropriate regularizer chosen to match the structure of \u03b2\u2217. The key chal-\nlenge is how to choose the sequence of regularizers {\u03bb(t)\nn } in the iterative process, so as to control\noptimization and statistical error. As detailed in Algorithm 1, our sequence of regularizers attempts\nto match the target estimation error at each step of the EM iteration. For an intuition of what this\nmight look like, consider the estimation error at step t: (cid:107)Mr\nn(\u03b2(t)) \u2212 \u03b2\u2217(cid:107)2. By the triangle in-\nequality, we can bound this by a sum of two terms: the optimization error and the \ufb01nal estimation\nerror:\n\nn = \u03ba\u03bb(t\u22121)\n\n(3.2)\nSince we expect (and show) linear convergence of the optimization, it is natural to update \u03bb(t)\nn via a\nrecursion of the form \u03bb(t)\n+\u2206 as in (3.3), where the \ufb01rst term represents the optimization\nerror, and \u2206 represents the \ufb01nal statistical error, i.e., the last term above in (3.2). A key part of our\nanalysis shows that this error (and hence \u2206) is controlled by (cid:107)\u2207Qn(\u03b2\u2217|\u03b2)\u2212\u2207Q(\u03b2\u2217|\u03b2)(cid:107)R\u2217, which\nin turn can be bounded uniformly for a variety of important applications of EM, including the three\ndiscussed in this paper (see Section 5). While a technical point, it is this key insight that enables\nthe right choice of algorithm and its analysis. In the cases we consider, we obtain min-max optimal\nrates of convergence, demonstrating that no algorithm, let alone another variant of EM, can perform\nbetter.\nAlgorithm 1 Regularized EM Algorithm\nInput Samples {yi}n\n\ni=1, regularizer R, number of iterations T , initial parameter \u03b2(0), initial regu-\n\nn(\u03b2\u2217)(cid:107)2 + (cid:107)Mr\n\nn(\u03b2\u2217) \u2212 \u03b2\u2217(cid:107)2.\n\nn\n\nlarization parameter \u03bb(0)\n\nn , estimated statistical error \u2206, contractive factor \u03ba < 1.\n\n1: For t = 1, 2, . . . , T do\n2:\n\nRegularization parameter update:\n\nn \u2190 \u03ba\u03bb(t\u22121)\n\u03bb(t)\n\nn\n\n+ \u2206.\n\n(3.3)\n\n3:\n4:\n\nE-step: Compute function Qn(\u00b7|\u03b2(t\u22121)) according to (2.1).\nRegularized M-step:\n\n\u03b2(t) \u2190 Mr\n\nn(\u03b2(t\u22121)) := arg max\n\u03b2\u2208\u2126\n\nQn(\u03b2|\u03b2(t\u22121)) \u2212 \u03bb(t)\n\nn \u00b7 R(\u03b2).\n\n5: End For\nOutput \u03b2(T ).\n\n4 Statistical Guarantees\n\nWe now turn to the theoretical analysis of regularized EM algorithm. We \ufb01rst set up a general\nanalytical framework for regularized EM where the key ingredients are decomposable regularizer\nand several technical conditions on the population based Q(\u00b7|\u00b7) and the sample based Qn(\u00b7|\u00b7). In\nSection 4.3, we provide our main result (Theorem 1) that characterizes both computational and\nstatistical performance of the proposed variant of regularized EM algorithm.\n\n4.1 Decomposable Regularizers\n\nDecomposable regularizers (e.g., [3, 6, 14, 19]), have been shown to be useful both empirically and\ntheoretically for high dimensional structural estimation, and they also play an important role in our\nanalytical framework. Recall that for R : Rp \u2192 R+ a norm, and a pair of subspaces (S,S) in Rp\nsuch that S \u2286 S, we have the following de\ufb01nition:\nDe\ufb01nition 1 (Decomposability). Regularizer R : Rp \u2192 R+ is decomposable with respect to (S,S)\nif\n\nR(u + v) = R(u) + R(v), for any u \u2208 S, v \u2208 S\u22a5\n\n.\n\nTypically, the structure of model parameter \u03b2\u2217 can be characterized by specifying a subspace S such\nthat \u03b2\u2217 \u2208 S. The common use of a regularizer is thus to penalize the compositions of solution that\n\n3\n\n\flive outside S. We are interested in bounding the estimation error in some norm (cid:107)\u00b7(cid:107). The following\nquantity is critical in connecting R to (cid:107) \u00b7 (cid:107).\nDe\ufb01nition 2 (Subspace Compatibility Constant). For any subspace S \u2286 Rp, a given regularizer R\nand some norm (cid:107) \u00b7 (cid:107), the subspace compatibility constant of S with respect to R,(cid:107) \u00b7 (cid:107) is given by\n\n\u03a8(S) := sup\n\nu\u2208S\\{0}\n\nR(u)\n(cid:107)u(cid:107) .\n\n(cid:10)u, v(cid:11). To simplify notation,\n\nAs is standard, the dual norm of R is de\ufb01ned as R\u2217(v) := supR(u)\u22641\nwe let (cid:107)u(cid:107)R := R(u) and (cid:107)u(cid:107)R\u2217 := R\u2217(u).\n4.2 Conditions on Q(\u00b7|\u00b7) and Qn(\u00b7|\u00b7)\nNext, we review three technical conditions, originally proposed by [1], on the population level Q(\u00b7|\u00b7)\nfunction, and then we give two important conditions that the empirial function Qn(\u00b7|\u00b7) must satisfy,\nincluding one that characterizes the statistical error.\nIt is well known that performance of EM algorithm is sensitive to initialization. Following the low-\ndimensional development in [1], our results are local, and apply to an r-neighborhood region around\n\n\u03b2\u2217: B(r; \u03b2\u2217) :=(cid:8)u \u2208 \u2126,(cid:107)u \u2212 \u03b2\u2217(cid:107) \u2264 r(cid:9).\n\nWe \ufb01rst require that Q(\u00b7|\u03b2\u2217) is self consistent as stated below. This is satis\ufb01ed, in particular, when\n\u03b2\u2217 maximizes the population log likelihood function, as happens in most settings of interest [12].\nCondition 1 (Self Consistency). Function Q(\u00b7|\u03b2\u2217) is self consistent, namely\n\n\u03b2\u2217 = arg max\n\u03b2\u2208\u2126\n\nQ(\u03b2|\u03b2\u2217).\n\n(cid:107)\u03b22 \u2212 \u03b21(cid:107)2, \u2200 \u03b21, \u03b22 \u2208 \u2126.\n\n(4.1)\n\nWe also require that the function Q(\u00b7|\u00b7) satis\ufb01es a certain strong concavity condition and is smooth\nover \u2126.\nCondition 2 (Strong Concavity and Smoothness (\u03b3, \u00b5, r)). Q(\u00b7|\u03b2\u2217) is \u03b3-strongly concave over \u2126,\ni.e.,\n\nQ(\u03b22|\u03b2\u2217) \u2212 Q(\u03b21|\u03b2\u2217) \u2212(cid:10)\u2207Q(\u03b21|\u03b2\u2217), \u03b22 \u2212 \u03b21\nQ(\u03b22|\u03b2) \u2212 Q(\u03b21|\u03b2) \u2212(cid:10)\u2207Q(\u03b21|\u03b2), \u03b22 \u2212 \u03b21\n\nFor any \u03b2 \u2208 B(r; \u03b2\u2217), Q(\u00b7|\u03b2) is \u00b5-smooth over \u2126, i.e.,\n\n(cid:11) \u2264 \u2212 \u03b3\n(cid:11) \u2265 \u2212 \u00b5\n\n2\n\n(cid:107)\u03b22 \u2212 \u03b21(cid:107)2, \u2200 \u03b21, \u03b22 \u2208 \u2126.\n\n2\n\n(4.2)\nThe next condition is key in guaranteeing the curvature of Q(\u00b7|\u03b2) is similar to that of Q(\u00b7|\u03b2\u2217) when\n\u03b2 is close to \u03b2\u2217. It has also been called First Order Stability in [1].\nCondition 3 (Gradient Stability (\u03c4, r)). For any \u03b2 \u2208 B(r; \u03b2\u2217), we have\n\n(cid:13)(cid:13)\u2207Q(M(\u03b2)|\u03b2) \u2212 \u2207Q(M(\u03b2)|\u03b2\u2217)(cid:13)(cid:13) \u2264 \u03c4(cid:107)\u03b2 \u2212 \u03b2\u2217(cid:107).\n3 that is(cid:13)(cid:13)\u2207Q(\u03b2(cid:48)|\u03b2) \u2212 \u2207Q(\u03b2(cid:48)|\u03b2\u2217)(cid:13)(cid:13) \u2264 \u03c4(cid:107)\u03b2 \u2212 \u03b2\u2217(cid:107), \u2200 \u03b2(cid:48) \u2208 B(r; \u03b2\u2217).\n\nThe above condition only requires that the gradient be stable at one point M(\u03b2). This is suf\ufb01cient\nfor our analysis. In fact, for many concrete examples, one can verify a stronger version of Condition\n\nNext we require two conditions on the empirical function Qn(\u00b7|\u00b7), which is computed from \ufb01nite\nnumber of samples according to (2.1). Our \ufb01rst condition, parallel to Condition 2, imposes a cur-\nvature constraint on Qn(\u00b7|\u00b7). In order to guarantee that the estimation error (cid:107)\u03b2(t) \u2212 \u03b2\u2217(cid:107) in step t\nof the EM algorithm is well controlled, we would like Qn(\u00b7|\u03b2(t\u22121)) to be strongly concave at \u03b2\u2217.\nHowever, in the setting where n (cid:28) p, there might exist directions along which Qn(\u00b7|\u03b2(t\u22121)) is \ufb02at,\ne.g., as in mixed linear regression and missing covariate regression. In contrast with Condition 2, we\nonly require Qn(\u00b7|\u00b7) to be strongly concave over a particular set C(S,S;R) that is de\ufb01ned in terms\nof the subspace pair (S,S) and regularizer R. This set is de\ufb01ned as follows:\n\nu \u2208 Rp :(cid:13)(cid:13)\u03a0S\u22a5 (u)(cid:13)(cid:13)R \u2264 2 \u00b7(cid:13)(cid:13)\u03a0S (u)(cid:13)(cid:13)R + 2 \u00b7 \u03a8(S) \u00b7(cid:13)(cid:13)u(cid:13)(cid:13)(cid:27)\n\n(4.3)\nwhere the projection operator \u03a0S : Rp \u2192 Rp is de\ufb01ned as \u03a0S (u) := arg minv\u2208S (cid:107)v \u2212 u(cid:107). The\nrestricted strong concavity (RSC) condition is as follows.\n\nC(S,S;R) :=\n\n(cid:26)\n\n,\n\n4\n\n\fCondition 4 (RSC (\u03b3n,S,S, r, \u03b4)). For any \ufb01xed \u03b2 \u2208 B(r; \u03b2\u2217), with probability at least 1 \u2212 \u03b4, we\n\nhave that for all \u03b2(cid:48) \u2212 \u03b2\u2217 \u2208 \u2126(cid:84)C(S,S;R),\n\nQn(\u03b2(cid:48)|\u03b2) \u2212 Qn(\u03b2\u2217|\u03b2) \u2212(cid:10)\u2207Qn(\u03b2\u2217|\u03b2), \u03b2(cid:48) \u2212 \u03b2\u2217(cid:11) \u2264 \u2212 \u03b3n\n\n(cid:107)\u03b2(cid:48) \u2212 \u03b2\u2217(cid:107)2.\n\n2\n\ndifference is immaterial. This is because(cid:13)(cid:13)\u03a0S (u)(cid:13)(cid:13)R is within a constant factor of \u03a8(S) \u00b7(cid:13)(cid:13)u(cid:13)(cid:13), and\n\nThe above condition states that Qn(\u00b7|\u03b2) is strongly concave in directions \u03b2(cid:48) \u2212 \u03b2\u2217 that belong to\nC(S,S;R). It is instructive to compare Condition 4 with a related condition proposed by [14] for\nanalyzing high dimensional M-estimators. They require the loss function to be strongly convex over\nthe cone {u \u2208 Rp : (cid:107)\u03a0S\u22a5(u)(cid:107)R (cid:46) (cid:107)\u03a0S (u)(cid:107)R}. Therefore our restrictive set (4.3) is similar to the\ncone but has the additional term 2\u03a8(S)(cid:107)u(cid:107). The main purpose of the term 2\u03a8(S)(cid:107)u(cid:107) is to allow\nthe regularization parameter \u03bbn to jointly control optimization and statistical error. We note that\nwhile Condition 4 is stronger than the usual RSC condition in M-estimator, in typical settings the\nhence checking RSC over C amounts to checking it over (cid:107)\u03a0S\u22a5(u)(cid:107)R (cid:46) \u03a8(S)(cid:107)u(cid:107), which is indeed\nwhat is typically also done in the M-estimator setting.\nFinally, we establish the condition that characterizes the achievable statistical error.\nCondition 5 (Statistical Error (\u2206n, r, \u03b4)). For any \ufb01xed \u03b2 \u2208 B(r; \u03b2\u2217), with probability at least\n1 \u2212 \u03b4, we have\n\n(cid:13)(cid:13)\u2207Qn(\u03b2\u2217|\u03b2) \u2212 \u2207Q(\u03b2\u2217|\u03b2)(cid:13)(cid:13)R\u2217 \u2264 \u2206n.\n\n(4.4)\nThis quantity replaces the term (cid:107)Mn(\u03b2)\u2212M(\u03b2)(cid:107) which appears in [1] and [20], and which presents\nproblems in the high dimensional regime.\n\n4.3 Main Results\n\nIn this section, we provide the theoretical guarantees for a resampled version of our regularized EM\nalgorithm: we split the whole dataset into T pieces and use a fresh piece of data in each iteration of\nregularized EM. As in [1], resampling makes it possible to check that Conditions 4-5 are satis\ufb01ed\nwithout requiring them to hold uniformly for all \u03b2 \u2208 B(r; \u03b2\u2217) with high probability. Our empirical\nresults indicate that it is not in fact required and is an artifact of the analysis. We refer to this\nresampled version as Algorithm 2. In the sequel, we let m := n/T to denote the sample complexity\nin each iteration. We let \u03b1 := supu\u2208Rp\\{0} (cid:107)u(cid:107)\u2217/(cid:107)u(cid:107), where (cid:107) \u00b7 (cid:107)\u2217 is the dual norm of (cid:107) \u00b7 (cid:107).\nFor Algorithm 2, our main result is as follows. The proof is deferred to the Supplemental Material.\nTheorem 1. Assume the model parameter \u03b2\u2217 \u2208 S and regularizer R is decomposable with respect\nto (S,S) where S \u2286 S \u2286 Rp. Assume r > 0 is such that B(r; \u03b2\u2217) \u2286 \u2126. Further, assume function\nQ(\u00b7|\u00b7), de\ufb01ned in (2.2), is self consistent and satis\ufb01es Conditions 2-3 with parameters (\u03b3, \u00b5, r) and\n(\u03c4, r). Given n samples and T iterations, let m := n/T . Assume Qm(\u00b7|\u00b7), computed from any\nm i.i.d. samples according to (2.1), satis\ufb01es Conditions 4-5 with parameters (\u03b3m,S,S, r, 0.5\u03b4/T )\n, and assume 0 < \u03c4 < \u03b3 and 0 < \u03ba\u2217 \u2264 3/4. De\ufb01ne\nand (\u2206m, r, 0.5\u03b4/T ). Let \u03ba\u2217 := 5 \u03b1\u00b5\u03c4\n\u2206 := r\u03b3m/[60\u03a8(S)] and assume \u2206m is such that \u2206m \u2264 \u2206.\nConsider Algorithm 2 with initialization \u03b2(0) \u2208 B(r; \u03b2\u2217) and with regularization parameters given\nby\n\n\u03b3\u03b3m\n\nm = \u03bat \u03b3m\n\u03bb(t)\n5\u03a8(S)\n\n(4.5)\nfor any \u2206 \u2208 [3\u2206m, 3\u2206], \u03ba \u2208 [\u03ba\u2217, 3/4]. Then with probability at least 1 \u2212 \u03b4, we have that for any\nt \u2208 [T ],\n\n\u2206, t = 1, 2, . . . , T\n\n(cid:107)\u03b2(0) \u2212 \u03b2\u2217(cid:107) +\n\n1 \u2212 \u03bat\n1 \u2212 \u03ba\n\n(cid:107)\u03b2(t) \u2212 \u03b2\u2217(cid:107) \u2264 \u03bat(cid:107)\u03b2(0) \u2212 \u03b2\u2217(cid:107) +\n\n5\n\u03b3m\n\n1 \u2212 \u03bat\n1 \u2212 \u03ba\n\n\u03a8(S)\u2206.\n\n(4.6)\n\nThe estimation error is bounded by a term decaying linearly with number of iterations t, which we\ncan think of as the optimization error and a second term that characterizes the ultimate estimation\nerror of our algorithm. With T = O(log n) and suitable choice of \u2206 such that \u2206 = O(\u2206n/T ), we\nbound the ultimate estimation error as\n\n(cid:107)\u03b2(T ) \u2212 \u03b2\u2217(cid:107) (cid:46)\n\n1\n\n(1 \u2212 \u03ba)\u03b3n/T\n\n\u03a8(S)\u2206n/T .\n\n(4.7)\n\n5\n\n\fWe note that overestimating the initial error, (cid:107)\u03b2(0)\u2212\u03b2\u2217(cid:107) is not important, as it may slightly increase\nthe overall number of iterations, but will not impact the ultimate estimation error.\nThe constraint \u2206m (cid:46) r\u03b3m/\u03a8(S) ensures that \u03b2(t) is contained in B(r; \u03b2\u2217) for all t \u2208 [T ]. This\nconstraint is quite mild in the sense that if \u2206m = \u2126(r\u03b3m/\u03a8(S)), \u03b2(0) is a decent estimator with\nestimation error O(\u03a8(S)\u2206m/\u03b3m) that already matches our expectation.\n\n5 Examples: Applying the Theory\n\nNow we introduce three well known latent variable models. For each model, we \ufb01rst review the\nstandard EM algorithm formulations, and discuss the extensions to the high dimensional setting.\nThen we apply Theorem 1 to obtain the statistical guarantee of the regularized EM with data splitting\n(Algorithm 2). The key ingredient underlying these results is to check the technical conditions in\nSection 4 hold for each model. We postpone these tedious details to the Supplemental Material.\n\n5.1 Gaussian Mixture Model\n\nWe consider the balanced isotropic Gaussian mixture model (GMM) with two components where\nthe distribution of random variables (Y, Z) \u2208 Rp \u00d7 {\u22121, 1} is characterized as\n\nPr (Y = y|Z = z) = \u03c6(y; z \u00b7 \u03b2\u2217, \u03c32Ip), Pr(Z = 1) = Pr(Z = \u22121) = 1/2.\n\nHere we use \u03c6(\u00b7|\u00b5, \u03a3) to denote the probability density function of N (\u00b5, \u03a3). In this example, Z\nis the latent variable that indicates the cluster id of each sample. Given n i.i.d. samples {yi}n\ni=1,\nfunction Qn(\u00b7|\u00b7) de\ufb01ned in (2.1) corresponds to\n\nQGMM\n\nn\n\n(\u03b2(cid:48)|\u03b2) = \u2212 1\n2n\n\n(cid:2)w(yi; \u03b2)(cid:107)yi \u2212 \u03b2(cid:48)(cid:107)2\n\nn(cid:88)\n\ni=1\n\n2 + (1 \u2212 w(yi; \u03b2))(cid:107)yi + \u03b2(cid:48)(cid:107)2\n\n2\n\n(5.1)\n\n(cid:3) ,\n\n2\n\n2\n\n2\u03c32\n\n)[exp (\u2212(cid:107)y\u2212\u03b2(cid:107)2\n\nwhere w(y; \u03b2) := exp (\u2212(cid:107)y\u2212\u03b2(cid:107)2\n)]\u22121. We assume \u03b2\u2217 \u2208\nB0(s; p) := {u \u2208 Rp : | supp(u)| \u2264 s}. Naturally, we choose the regularizer R(\u00b7) to be the (cid:96)1\nnorm. We de\ufb01ne the signal-to-noise ratio SNR := (cid:107)\u03b2\u2217(cid:107)2/\u03c3.\nCorollary 1 (Sparse Recovery in GMM). There exist constants \u03c1, C such that if SNR \u2265 \u03c1, n/T \u2265\n[80C((cid:107)\u03b2\u2217(cid:107)\u221e + \u03c3)/(cid:107)\u03b2\u2217(cid:107)2]2 s log p, \u03b2(0) \u2208 B((cid:107)\u03b2\u2217(cid:107)2/4; \u03b2\u2217); then with probability at least 1\u2212 T /p\ns, any\n\nAlgorithm 2 with parameters \u2206 = C((cid:107)\u03b2\u2217(cid:107)\u221e + \u03c3)(cid:112)T log p/n, \u03bb(0)\n\n) + exp (\u2212(cid:107)y+\u03b2(cid:107)2\n\nn/T = 0.2(cid:107)\u03b2(0) \u2212 \u03b2\u2217(cid:107)2/\n\n\u03ba \u2208 [1/2, 3/4] and (cid:96)1 regularization generates \u03b2(t) that has estimation error\n\n\u221a\n\n2\u03c32\n\n2\u03c32\n\n2\n\n5C((cid:107)\u03b2\u2217(cid:107)\u221e + \u03c3)\n\n(cid:107)\u03b2(t) \u2212 \u03b2\u2217(cid:107)2 \u2264 \u03bat(cid:107)\u03b2(0) \u2212 \u03b2\u2217(cid:107)2 +\nNote that by setting T (cid:16) log(n/ log p),\n\n((cid:107)\u03b2\u2217(cid:107)\u221e + \u03b4)(cid:112)(s log p)/n) log (n/ log p). The minimax rate for estimating s-sparse vector in a\nsingle Gaussian cluster is(cid:112)s log p/n, thereby the rate is optimal on (n, p, s) up to a log factor.\n\nthe order of \ufb01nal estimation error turns out to be\n\nT , for all t \u2208 [T ].\n\n1 \u2212 \u03ba\n\ns log p\n\n(5.2)\n\nn\n\n(cid:114)\n\n5.2 Mixed Linear Regression\n\nMixed linear regression (MLR), as considered in some recent work [5, 7, 22], is the problem of\nrecovering two or more linear vectors from mixed linear measurements. In the case of mixed linear\nregression with two symmetric and balanced components, the response-covariate pair (Y, X) \u2208\nR \u00d7 Rp is linked through\nwhere W is the noise term and Z is the latent variable that has Rademacher distribution over {\u22121, 1}.\nWe assume X \u223c N (0, Ip), W \u223c N (0, \u03c32). In this setting, with n i.i.d. samples {yi, xi}n\ni=1 of pair\n(Y, X), function Qn(\u00b7|\u00b7) then corresponds to\n\n(cid:2)w(yi, xi; \u03b2)(yi \u2212 (cid:104)xi, \u03b2(cid:48)(cid:105))2 + (1 \u2212 w(yi, xi; \u03b2))(yi + (cid:104)xi, \u03b2(cid:48)(cid:105))2(cid:3) ,\n\nY = (cid:104)X, Z \u00b7 \u03b2\u2217(cid:105) + W,\n\nn(cid:88)\n\nQM LR\n\nn\n\n(\u03b2(cid:48)|\u03b2) = \u2212 1\n2n\n\ni=1\n\n6\n\n(5.3)\n\n\f2\u03c32\n\n2\u03c32\n\n2\u03c32\n\n)]\u22121.\n\n)[exp (\u2212 (y\u2212(cid:104)x,\u03b2(cid:105))2\n\n) + exp (\u2212 (y+(cid:104)x,\u03b2(cid:105))2\n\nwhere w(y, x; \u03b2) := exp (\u2212 (y\u2212(cid:104)x,\u03b2(cid:105))2\nWe consider two kinds of structure on \u03b2\u2217:\nSparse Recovery. Assume \u03b2\u2217 \u2208 B0(s; p). Then let R be the (cid:96)1 norm, as in the previous section.\nWe de\ufb01ne SNR := (cid:107)\u03b2\u2217(cid:107)2/\u03c3.\nCorollary 2 (Sparse recovery in MLR). There exist constant \u03c1, C, C(cid:48) such that if SNR \u2265 \u03c1, n/T \u2265\nC(cid:48) [((cid:107)\u03b2\u2217(cid:107)2 + \u03b4)/(cid:107)\u03b2\u2217(cid:107)2]2 s log p, \u03b2(0) \u2208 B((cid:107)\u03b2\u2217(cid:107)2/240, \u03b2\u2217); then with probability at least 1\u2212 T /p\ns), any\n\nAlgorithm 2 with parameters \u2206 = C((cid:107)\u03b2\u2217(cid:107)2 + \u03b4)(cid:112)T log p/n, \u03bb(0)\n\u221a\nn/T = (cid:107)\u03b2(0) \u2212 \u03b2\u2217(cid:107)2/(15\n(cid:114)\n\n\u03ba \u2208 [1/2, 3/4] and (cid:96)1 regularization generates \u03b2(t) that has estimation error\n\n(cid:107)\u03b2(t) \u2212 \u03b2\u2217(cid:107)2 \u2264 \u03bat(cid:107)\u03b2(0) \u2212 \u03b2\u2217(cid:107)2 +\n\ns log p\n\nT , for all t \u2208 [T ].\n\n15C((cid:107)\u03b2\u2217(cid:107)2 + \u03b4)\n\n1 \u2212 \u03ba\n\nn\n\n(cid:16)\n\nlog(n/(s log p))\n\n\u03b4)(cid:112)(s log p/n) log (n/(s log p)) which is near-optimal on (s, p, n). The dependence on (cid:107)\u03b2\u2217(cid:107)2,\n\niterations gives us\n\n((cid:107)\u03b2\u2217(cid:107)2 +\n\nestimation rate\n\nPerforming T\n\nwhich also appears in the analysis of EM in the classical (low dimensional) setting [1], arises from\nfundamental limits of EM. Removing such dependence for MLR is possible by convex relaxation\n[7]. It is interesting to study how to remove it in the high dimensional setting.\nLow Rank Recovery. Second we consider the setting where the model parameter is a matrix \u0393\u2217 \u2208\nRp1\u00d7p2 with rank(\u0393\u2217) = \u03b8 (cid:28) min(p1, p2). We further assume X \u2208 Rp1\u00d7p2 is an i.i.d. Gaussian\nmatrix, i.e., entries of X are independent random variables with distribution N (0, 1). We apply\ni=1 |si(\u0393)|, where\nsi(\u0393) is the ith singular value of \u0393. Similarly, we let SNR := (cid:107)\u0393\u2217(cid:107)F /\u03c3.\nCorollary 3 (Low rank recovery in MLR). There exist constant \u03c1, C, C(cid:48) such that if SNR \u2265 \u03c1,\nn/T \u2265 C(cid:48) [((cid:107)\u0393\u2217(cid:107)F + \u03c3)/(cid:107)\u0393\u2217(cid:107)F ]2 \u03b8(p1 + p2), \u0393(0) \u2208 B((cid:107)\u0393\u2217(cid:107)F /1600, \u0393\u2217); then with probability\n2\u03b8, any \u03ba \u2208 [1/2, 3/4] and nuclear norm regularization generates\nn/T = 0.01(cid:107)\u0393(0) \u2212 \u0393\u2217(cid:107)F /\n\u03bb(0)\n\u0393(t) that has estimation error\n\nnuclear norm regularization to serve the low rank structure, i.e, R(\u0393) = (cid:80)p1,p2\nat least 1 \u2212 T exp(\u2212p1 \u2212 p2) Algorithm 2 with parameters \u2206 = C((cid:107)\u0393\u2217(cid:107)F + \u03c3)(cid:112)T (p1 + p2)/n,\n\n\u221a\n\n(cid:107)\u0393(t) \u2212 \u0393\u2217(cid:107)F \u2264 \u03bat(cid:107)\u0393(0) \u2212 \u0393\u2217(cid:107)F +\n\n100C(cid:48)((cid:107)\u0393\u2217(cid:107)F + \u03c3)\n\n1 \u2212 \u03ba\n\n2\u03b8(p1 + p2)\n\nn\n\nT , for all t \u2208 [T ].\n\n(cid:114)\n\nThe standard low rank matrix recovery with a single component, including other sensing matrix\ndesigns beyond the Gaussianity, has been studied extensively (e.g., [2, 4, 13, 15]). To the best of our\nknowledge, the theoretical study of the mixed low rank matrix recovery has not been considered.\n\n5.3 Missing Covariate Regression\n\nAs our last example, we consider the missing covariate regression (MCR) problem. To parallel\nstandard linear regression, {yi, xi}n\ni=1 are samples of (Y, X) linked through Y = (cid:104)X, \u03b2\u2217(cid:105) + W .\nHowever, we assume each entry of xi is missing independently with probability \u0001 \u2208 (0, 1). There-\n\nfore, the observed covariate vector(cid:101)xi takes the form\n\n(cid:26)xi,j with probability 1 \u2212 \u0001\n\n(cid:101)xi,j =\n\n\u2217\n\notherwise\n\n.\n\nWe assume the model is under Gaussian design X \u223c N (0, Ip), W \u223c N (0, \u03c32). We refer the\nreader to our Supplementary Material for the speci\ufb01c Qn(\u00b7|\u00b7) function. In high dimensional case,\nwe assume \u03b2\u2217 \u2208 B0(s; p). We de\ufb01ne \u03c1 := (cid:107)\u03b2\u2217(cid:107)2/\u03c3 to be the SNR and \u03c9 := r/(cid:107)\u03b2\u2217(cid:107)2 to be the\nrelative contractivity radius. In particular, let \u03b6 := (1 + \u03c9)\u03c1.\nCorollary 4 (Sparse Recovery in MCR). There exist constants C, C(cid:48), C0, C1 such that if (1+\u03c9)\u03c1 \u2264\nC0 < 1, \u0001 < C1, n/T \u2265 C(cid:48) max{\u03c32(\u03c9\u03c1)\u22121, 1}s log p, \u03b2(0) \u2208 B(\u03c9(cid:107)\u03b2\u2217(cid:107)2, \u03b2\u2217); then with prob-\nn/T = (cid:107)\u03b2(0) \u2212\n\nability at least 1 \u2212 T /p Algorithm 2 with parameters \u2206 = C\u03c3(cid:112)T log p/n, \u03bb(0)\n\ns), any \u03ba \u2208 [1/2, 3/4] and (cid:96)1 regularization generates \u03b2(t) that has estimation error\n(cid:107)\u03b2(t) \u2212 \u03b2\u2217(cid:107)2 \u2264 \u03bat(cid:107)\u03b2(0) \u2212 \u03b2\u2217(cid:107)2 +\n\nT , for all t \u2208 [T ],\n\ns log p\n\n\u221a\n\u03b2\u2217(cid:107)2/(45\n\n(cid:114)\n\n45C\u03c3\n1 \u2212 \u03ba\n\nn\n\n7\n\n\fUnlike the previous two models, we require an upper bound on the signal to noise ratio. This unusual\nconstraint is in fact unavoidable [10]. By optimizing T , the order of \ufb01nal estimation error turns out\n\nto be \u03c3(cid:112)s log p/n log(n/(s log p)).\n\n6 Simulations\n\nWe now provide some simulation results to back up our theory. Note that while Theorem 1 requires\nresampling, we believe in practice this is unnecessary. This is validated by our results, where we\napply Algorithm 1 to the four latent variable models discussed in Section 5.\nConvergence Rate. We \ufb01rst evaluate the convergence of Algorithm 1 assuming only that the initial-\nization is a bounded distance from \u03b2\u2217. For a given error \u03c9(cid:107)\u03b2\u2217(cid:107)2, the initial parameter \u03b2(0) is picked\nrandomly from the sphere centered around \u03b2\u2217 with radius \u03c9(cid:107)\u03b2\u2217(cid:107)2. We use Algorithm 1 with T = 7,\n\u03ba = 0.7, \u03bb(0)\nn in Theorem 1. The choice of the critical parameter \u2206 is given in the Supplementary\nMaterial. For every single trial, we report estimation error (cid:107)\u03b2(t) \u2212 \u03b2\u2217(cid:107)2 and optimization error\n(cid:107)\u03b2(t) \u2212 \u03b2(T )(cid:107)2 in every iteration. We plot the log of errors over iteration t in Figure 1.\n\n(a) GMM\n\n(b) MLR(sparse)\n\n(c) MLR(low rank)\n\n(d) MCR\n\nFigure 1: Convergence of regularized EM algorithm. In each panel, one curve is plotted from single\nindependent trial. Settings: (a,b,d) (n, p, s) = (500, 800, 5); (d) (n, p, \u03b8) = (600, 30, 3); (a-c)\nSNR = 5; (d) (SNR, \u0001) = (0.5, 0.2); (a-d) \u03c9 = 0.5.\n\nStatistical Rate. We now evaluate the statistical rate. We set T = 7 and compute estimation error\n\non (cid:98)\u03b2 := \u03b2(T ). In Figure 2, we plot (cid:107)(cid:98)\u03b2 \u2212 \u03b2\u2217(cid:107)2 over normalized sample complexity, i.e., n/(s log p)\n\nfor s-sparse parameter and n/(\u03b8p) for rank \u03b8 p-by-p parameter. We refer the reader to Figure 1 for\nother settings. We observe that the same normalized sample complexity leads to almost identical\nestimation error in practice, which thus supports the corresponding statistical rate established in\nSection 5.\n\n(a) GMM\n\n(b) MLR(sparse)\n\n(c) MLR(low rank)\n\n(d) MCR\n\nFigure 2: Statistical rates. Each point is an average of 20 independent trials. Settings: (a,b,d) s = 5;\n(c) \u03b8 = 3.\n\nAcknowledgments\n\nThe authors would like to acknowledge NSF grants 1056028, 1302435 and 1116955. This research\nwas also partially supported by the U.S. Department of Transportation through the Data-Supported\nTransportation Operations and Planning (D-STOP) Tier 1 University Transportation Center.\n\n8\n\nNumber of iterations01234567Log error-6-5-4-3-2-101Est errorOpt errorNumber of iterations01234567Log error-10-8-6-4-202Est errorOpt errorNumber of iterations01234567Log error-4-3-2-101Est errorOpt errorNumber of iterations01234567Log error-3-2-10123Est errorOpt errorn/(slogp)51015202530k\u02c6\u03b2\u2212\u03b2\u2217k20.10.120.140.160.180.20.22p=200p=400p=800n/(slogp)51015202530k\u02c6\u03b2\u2212\u03b2\u2217k20.150.20.250.30.350.4p=200p=400p=800n/(\u03b8p)345678k\u02c6\u0393\u2212\u0393\u2217kF0.40.60.811.21.4p=25p=30p=35n/(slogp)51015202530k\u02c6\u03b2\u2212\u03b2\u2217k211.21.41.61.82p=200p=400p=800\fReferences\n[1] Sivaraman Balakrishnan, Martin J Wainwright, and Bin Yu. Statistical guarantees for the EM algorithm:\n\nFrom population to sample-based analysis. arXiv preprint arXiv:1408.2156, 2014.\n\n[2] T Tony Cai and Anru Zhang. Rop: Matrix recovery via rank-one projections. The Annals of Statistics,\n\n43(1):102\u2013138, 2015.\n\n[3] Emmanuel Candes and Terence Tao. The Dantzig selector: statistical estimation when p is much larger\n\nthan n. The Annals of Statistics, pages 2313\u20132351, 2007.\n\n[4] Emmanuel J Cand`es and Yaniv Plan. Tight oracle inequalities for low-rank matrix recovery from a min-\nimal number of noisy random measurements. Information Theory, IEEE Transactions on, 57(4):2342\u2013\n2359, 2011.\n\n[5] Arun Tejasvi Chaganty and Percy Liang. Spectral experts for estimating mixtures of linear regressions.\n\narXiv preprint arXiv:1306.3729, 2013.\n\n[6] Yudong Chen, Sujay Sanghavi, and Huan Xu. Improved graph clustering.\n\nTransactions on, 60(10):6440\u20136455, Oct 2014.\n\nInformation Theory, IEEE\n\n[7] Yudong Chen, Xinyang Yi, and Constantine Caramanis. A convex formulation for mixed regression with\n\ntwo components: Minimax optimal rates. In Conf. on Learning Theory, 2014.\n\n[8] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via\n\nthe em algorithm. Journal of the royal statistical society. Series B (methodological), pages 1\u201338, 1977.\n\n[9] Po-Ling Loh and Martin J Wainwright. High-dimensional regression with noisy and missing data: Prov-\nable guarantees with non-convexity. In Advances in Neural Information Processing Systems, pages 2726\u2013\n2734, 2011.\n\n[10] Po-Ling Loh and Martin J Wainwright. Corrupted and missing predictors: Minimax bounds for high-\ndimensional linear regression. In Information Theory Proceedings (ISIT), 2012 IEEE International Sym-\nposium on, pages 2601\u20132605. IEEE, 2012.\n\n[11] Jinwen Ma and Lei Xu. Asymptotic convergence properties of the em algorithm with respect to the\n\noverlap in the mixture. Neurocomputing, 68:105\u2013129, 2005.\n\n[12] Geoffrey McLachlan and Thriyambakam Krishnan. The EM algorithm and extensions, volume 382. John\n\nWiley & Sons, 2007.\n\n[13] Sahand Negahban, Martin J Wainwright, et al. Estimation of (near) low-rank matrices with noise and\n\nhigh-dimensional scaling. The Annals of Statistics, 39(2):1069\u20131097, 2011.\n\n[14] Sahand Negahban, Bin Yu, Martin J Wainwright, and Pradeep K Ravikumar. A uni\ufb01ed framework for\nhigh-dimensional analysis of m-estimators with decomposable regularizers. In Advances in Neural Infor-\nmation Processing Systems, pages 1348\u20131356, 2009.\n\n[15] Benjamin Recht, Maryam Fazel, and Pablo A Parrilo. Guaranteed minimum-rank solutions of linear\n\nmatrix equations via nuclear norm minimization. SIAM review, 52(3):471\u2013501, 2010.\n\n[16] Nicolas St\u00a8adler, Peter B\u00a8uhlmann, and Sara Van De Geer. L1-penalization for mixture regression models.\n\nTest, 19(2):209\u2013256, 2010.\n\n[17] Paul Tseng. An analysis of the em algorithm and entropy-like proximal point methods. Mathematics of\n\nOperations Research, 29(1):27\u201344, 2004.\n\n[18] Roman Vershynin.\n\nIntroduction to the non-asymptotic analysis of random matrices. arXiv preprint\n\narXiv:1011.3027, 2010.\n\n[19] Martin J Wainwright. Structured regularizers for high-dimensional problems: Statistical and computa-\n\ntional issues. Annual Review of Statistics and Its Application, 1:233\u2013253, 2014.\n\n[20] Zhaoran Wang, Quanquan Gu, Yang Ning, and Han Liu. High dimensional expectation-maximization\n\nalgorithm: Statistical optimization and asymptotic normality. arXiv preprint arXiv:1412.8729, 2014.\n\n[21] C.F.Jeff Wu. On the convergence properties of the em algorithm. The Annals of statistics, pages 95\u2013103,\n\n1983.\n\n[22] Xinyang Yi, Constantine Caramanis, and Sujay Sanghavi. Alternating minimization for mixed linear\n\nregression. arXiv preprint arXiv:1310.3745, 2013.\n\n9\n\n\f", "award": [], "sourceid": 975, "authors": [{"given_name": "Xinyang", "family_name": "Yi", "institution": "Utaustin"}, {"given_name": "Constantine", "family_name": "Caramanis", "institution": "UT Austin"}]}