{"title": "High Dimensional EM Algorithm: Statistical Optimization and Asymptotic Normality", "book": "Advances in Neural Information Processing Systems", "page_first": 2521, "page_last": 2529, "abstract": "We provide a general theory of the expectation-maximization (EM) algorithm for inferring high dimensional latent variable models. In particular, we make two contributions: (i) For parameter estimation, we propose a novel high dimensional EM algorithm which naturally incorporates sparsity structure into parameter estimation. With an appropriate initialization, this algorithm converges at a geometric rate and attains an estimator with the (near-)optimal statistical rate of convergence. (ii) Based on the obtained estimator, we propose a new inferential procedure for testing hypotheses for low dimensional components of high dimensional parameters. For a broad family of statistical models, our framework establishes the first computationally feasible approach for optimal estimation and asymptotic inference in high dimensions.", "full_text": "High Dimensional EM Algorithm:\n\nStatistical Optimization and Asymptotic Normality\u21e4\n\nZhaoran Wang\n\nPrinceton University\n\nQuanquan Gu\n\nUniversity of Virginia\n\nYang Ning\n\nPrinceton University\n\nHan Liu\n\nPrinceton University\n\nAbstract\n\nWe provide a general theory of the expectation-maximization (EM) algorithm for\ninferring high dimensional latent variable models. In particular, we make two con-\ntributions: (i) For parameter estimation, we propose a novel high dimensional EM\nalgorithm which naturally incorporates sparsity structure into parameter estimation.\nWith an appropriate initialization, this algorithm converges at a geometric rate\nand attains an estimator with the (near-)optimal statistical rate of convergence. (ii)\nBased on the obtained estimator, we propose a new inferential procedure for testing\nhypotheses for low dimensional components of high dimensional parameters. For\na broad family of statistical models, our framework establishes the \ufb01rst computa-\ntionally feasible approach for optimal estimation and asymptotic inference in high\ndimensions.\nIntroduction\n\n1\nThe expectation-maximization (EM) algorithm [12] is the most popular approach for calculating the\nmaximum likelihood estimator of latent variable models. Nevertheless, due to the nonconcavity of\nthe likelihood function of latent variable models, the EM algorithm generally only converges to a\nlocal maximum rather than the global one [30]. On the other hand, existing statistical guarantees\nfor latent variable models are only established for global optima [3]. Therefore, there exists a gap\nbetween computation and statistics.\nSigni\ufb01cant progress has been made toward closing the gap between the local maximum attained\nby the EM algorithm and the maximum likelihood estimator [2, 18, 25, 30]. In particular, [30] \ufb01rst\nestablish general suf\ufb01cient conditions for the convergence of the EM algorithm. [25] further improve\nthis result by viewing the EM algorithm as a proximal point method applied to the Kullback-Leibler\ndivergence. See [18] for a detailed survey. More recently, [2] establish the \ufb01rst result that characterizes\nexplicit statistical and computational rates of convergence for the EM algorithm. They prove that,\ngiven a suitable initialization, the EM algorithm converges at a geometric rate to a local maximum\nclose to the maximum likelihood estimator. All these results are established in the low dimensional\nregime where the dimension d is much smaller than the sample size n.\nIn high dimensional regimes where the dimension d is much larger than the sample size n, there\nexists no theoretical guarantee for the EM algorithm. In fact, when d n, the maximum likelihood\nestimator is in general not well de\ufb01ned, unless the models are carefully regularized by sparsity-type\nassumptions. Furthermore, even if a regularized maximum likelihood estimator can be obtained in a\ncomputationally tractable manner, establishing the corresponding statistical properties, especially\nasymptotic normality, can still be challenging because of the existence of high dimensional nuisance\nparameters. To address such a challenge, we develop a general inferential theory of the EM algorithm\nfor parameter estimation and uncertainty assessment of high dimensional latent variable models. In\nparticular, we make two contributions in this paper:\n\u2022 For high dimensional parameter estimation, we propose a novel high dimensional EM algorithm by\nattaching a truncation step to the expectation step (E-step) and maximization step (M-step). Such a\n\u21e4Research supported by NSF IIS1116730, NSF IIS1332109, NSF IIS1408910, NSF IIS1546482-BIGDATA,\nNSF DMS1454377-CAREER, NIH R01GM083084, NIH R01HG06841, NIH R01MH102339, and FDA\nHHSF223201000072C.\n\n1\n\n\ftruncation step effectively enforces the sparsity of the attained estimator and allows us to establish\nsigni\ufb01cantly improved statistical rate of convergence.\n\u2022 Based upon the estimator attained by the high dimensional EM algorithm, we propose a decorrelated\nscore statistic for testing hypotheses related to low dimensional components of the high dimensional\nparameter.\n\nUnder a uni\ufb01ed analytic framework, we establish simultaneous statistical and computational guar-\nantees for the proposed high dimensional EM algorithm and the respective uncertainty assessment\nprocedure. Let \u21e4 2 Rd be the true parameter, s\u21e4 be its sparsity level and(t) T\nt=0 be the iterative\nsolution sequence of the high dimensional EM algorithm with T being the total number of iterations.\nIn particular, we prove that:\n\u2022 Given an appropriate initialization init with relative error upper bounded by a constant \uf8ff 2 (0, 1),\ni.e.,init \u21e42/k\u21e4k2 \uf8ff \uf8ff, the iterative solution sequence(t) T\n+ 2 \u00b7ps\u21e4 \u00b7 log d/n\n|\n}\n\nwith high probability. Here \u21e2 2 (0, 1), and 1, 2 are quantities that possibly depend on \u21e2, \uf8ff and\n\u21e4. As the optimization error term in (1.1) decreases to zero at a geometric rate with respect to\nt, the overall estimation error achieves theps\u21e4 \u00b7 log d/n statistical rate of convergence (up to an\n\u2022 The proposed decorrelated score statistic is asymptotically normal. Moreover, its limiting variance\nis optimal in the sense that it attains the semiparametric information bound for the low dimensional\ncomponents of interest in the presence of high dimensional nuisance parameters. See Theorem 4.6\nfor details.\n\n(t) \u21e42 \uf8ff 1 \u00b7 \u21e2t/2\n{z\n}\n\nextra factor of log n), which is (near-)minimax-optimal. See Theorem 3.4 for details.\n\nt=0 satis\ufb01es\n\nOptimization Error\n\nStatistical Error: Optimal Rate\n\n|\n\n{z\n\n(1.1)\n\nOur framework allows two implementations of the M-step: the exact maximization versus approximate\nmaximization. The former one calculates the maximizer exactly, while the latter one conducts an\napproximate maximization through a gradient ascent step. Our framework is quite general. We\nillustrate its effectiveness by applying it to two high dimensional latent variable models, that is,\nGaussian mixture model and mixture of regression model.\nComparison with Related Work: A closely related work is by [2], which considers the low dimen-\nsional regime where d is much smaller than n. Under certain initialization conditions, they prove\n\nthat the EM algorithm converges at a geometric rate to some local optimum that attains thepd/n\n\nstatistical rate of convergence. They cover both maximization and gradient ascent implementations of\nthe M-step, and establish the consequences for the two latent variable models considered in our paper\nunder low dimensional settings. Our framework adopts their view of treating the EM algorithm as\na perturbed version of gradient methods. However, to handle the challenge of high dimensionality,\nthe key ingredient of our framework is the truncation step that enforces the sparsity structure along\nthe solution path. Such a truncation operation poses signi\ufb01cant challenges for both computational\nand statistical analysis. In detail, for computational analysis we need to carefully characterize the\nevolution of each intermediate solution\u2019s support and its effects on the evolution of the entire iterative\nsolution sequence. For statistical analysis, we need to establish a \ufb01ne-grained characterization of the\nentrywise statistical error, which is technically more challenging than just establishing the `2-norm\n\nerror employed by [2]. In high dimensional regimes, we need to establish theps\u21e4 \u00b7 log d/n statistical\nrate of convergence, which is much sharper than theirpd/n rate when d n. In addition to point\n\nestimation, we further construct hypothesis tests for latent variable models in the high dimensional\nregime, which have not been established before.\nHigh dimensionality poses signi\ufb01cant challenges for assessing the uncertainty (e.g., testing hypothe-\nses) of the constructed estimators. For example, [15] show that the limiting distribution of the Lasso\nestimator is not Gaussian even in the low dimensional regime. A variety of approaches have been\nproposed to correct the Lasso estimator to attain asymptotic normality, including the debiasing method\n[13], the desparsi\ufb01cation methods [26, 32] as well as instrumental variable-based methods [4]. Mean-\nwhile, [16, 17, 24] propose the post-selection procedures for exact inference. In addition, several\nauthors propose methods based on data splitting [20, 29], stability selection [19] and `2-con\ufb01dence\nsets [22]. However, these approaches mainly focus on generalized linear models rather than latent\nvariable models. In addition, their results heavily rely on the fact that the estimator is a global optimum\nof a convex program. In comparison, our approach applies to a much broader family of statistical\nmodels with latent structures. For these latent variable models, it is computationally infeasible to\n\n2\n\n\fobtain the global maximum of the penalized likelihood due to the nonconcavity of the likelihood\nfunction. Unlike existing approaches, our inferential theory is developed for the estimator attained\nby the proposed high dimensional EM algorithm, which is not necessarily a global optimum to any\noptimization formulation.\nAnother line of research for the estimation of latent variable models is the tensor method, which\nexploits the structures of third or higher order moments. See [1] and the references therein. However,\nexisting tensor methods primarily focus on the low dimensional regime where d \u2327 n. In addition,\nsince the high order sample moments generally have a slow statistical rate of convergence, the\nestimators obtained by the tensor methods usually have a suboptimal statistical rate even for d \u2327 n.\nFor example, [9] establish thepd6/n statistical rate of convergence for mixture of regression model,\nwhich is suboptimal compared with thepd/n minimax lower bound. Similarly, in high dimensional\n\nsettings, the statistical rates of convergence attained by tensor methods are signi\ufb01cantly slower than\nthe statistical rate obtained in this paper.\nThe latent variable models considered in this paper have been well studied. Nevertheless, only a\nfew works establish theoretical guarantees for the EM algorithm. In particular, for Gaussian mixture\nmodel, [10, 11] establish parameter estimation guarantees for the EM algorithm and its extensions. For\nmixture of regression model, [31] establish exact parameter recovery guarantees for the EM algorithm\nunder a noiseless setting. For high dimensional mixture of regression model, [23] analyze the gradient\nEM algorithm for the `1-penalized log-likelihood. They establish support recovery guarantees for the\nattained local optimum but have no parameter estimation guarantees. In comparison with existing\nworks, this paper establishes a general inferential framework for simultaneous parameter estimation\nand uncertainty assessment based on a novel high dimensional EM algorithm. Our analysis provides\nthe \ufb01rst theoretical guarantee of parameter estimation and asymptotic inference in high dimensional\nregimes for the EM algorithm and its applications to a broad family of latent variable models.\nNotation: The matrix (p, q)-norm, i.e., k\u00b7k p,q, is obtained by taking the `p-norm of each row and\nthen taking the `q-norm of the obtained row norms. We use C, C0, . . . to denote generic constants.\nTheir values may vary from line to line. We will introduce more notations in \u00a72.2.\n2 Methodology\nWe \ufb01rst introduce the high dimensional EM Algorithm and then the respective inferential procedure.\nAs examples, we consider their applications to Gaussian mixture model and mixture of regression\nmodel. For compactness, we defer the details to \u00a7A of the appendix. More models are included in the\nlonger version of this paper.\nAlgorithm 1 High Dimensional EM Algorithm\n\n1: Parameter: Sparsity Parameterbs, Maximum Number of Iterations T\n2: Initialization: bS init suppinit,bs, (0) truncinit,bS init\nsupp(\u00b7,\u00b7) and trunc(\u00b7,\u00b7) are de\ufb01ned in (2.2) and (2.3) \n3: For t = 0 to T 1\nE-step: Evaluate Qn; (t)\n4:\nMn(\u00b7) is implemented as in Algorithm 2 or 3 \nM-step: (t+0.5) Mn(t)\n5:\nT-step: bS (t+0.5) supp(t+0.5),bs, (t+1) trunc(t+0.5),bS (t+0.5)\n6:\n7: End For\nOutput: Mn(t) argmax Qn; (t)\n\n8: Output: b (T )\n1: Input: (t), Qn; (t)\n1: Input: (t), Qn; (t)\n2: Output: Mn(t) (t) + \u2318 \u00b7r Qn(t); (t)\n\nAlgorithm 2 Maximization Implementation of the M-step\n\nAlgorithm 3 Gradient Ascent Implementation of the M-step\n\nParameter: Stepsize \u2318> 0\n\n2.1 High Dimensional EM Algorithm\nBefore we introduce the proposed high dimensional EM Algorithm (Algorithm 1), we brie\ufb02y review\nthe classical EM algorithm. Let h(y) be the probability density function of Y 2Y , where 2 Rd is\nthe model parameter. For latent variable models, we assume that h(y) is obtained by marginalizing\nover an unobserved latent variable Z 2Z , i.e., h(y) =RZ\nf(y, z) dz. Let k(z | y) be the density\n\n3\n\n\f1\nn\n\nnXi=1ZZ\n\nof Z conditioning on the observed variable Y = y, i.e., k(z | y) = f(y, z)/h(y). We de\ufb01ne\n(2.1)\n\nQn(; 0) =\n\nk0(z | yi) \u00b7 log f(yi, z) dz.\n\nSee \u00a7B of the appendix for a detailed derivation. At the t-th iteration of the classical EM algorithm, we\nevaluate Qn; (t) at the E-step and then perform max Qn; (t) at the M-step. The proposed\nhigh dimensional EM algorithm (Algorithm 1) is built upon the E-step and M-step (lines 4 and 5)\nof the classical EM algorithm. In addition to the exact maximization implementation of the M-step\n(Algorithm 2), we allow the gradient ascent implementation of the M-step (Algorithm 3), which\nperforms an approximate maximization via a gradient ascent step. To handle the challenge of high\ndimensionality, in line 6 of Algorithm 1 we perform a truncation step (T-step) to enforce the sparsity\nstructure. In detail, we de\ufb01ne\n\nsupp(, s): The set of index j\u2019s corresponding to the top s largest |j|\u2019s.\n\nAlso, for an index set S\u2713{ 1, . . . , d}, we de\ufb01ne the trunc(\u00b7,\u00b7) function in line 6 as\n\n(2.3)\nNote that (t+0.5) is the output of the M-step (line 5) at the t-th iteration of the high dimensional\n\n\u21e5trunc(,S)\u21e4j = j \u00b7 1{j 2S} .\n\n(2.2)\n\n(line 1). By iteratively performing the E-step, M-step and T-step, the high dimensional EM algorithm\n\nEM algorithm. To obtain (t+1), the T-step (line 6) preserves the entries of (t+0.5) with the topbs\nlarge magnitudes and sets the rest to zero. Herebs is a tuning parameter that controls the sparsity level\nattains anbs-sparse estimator b = (T ) (line 8). Here T is the total number of iterations.\n2.2 Asymptotic Inference\nNotation: Let r1Q(; 0) be the gradient with respect to and r2Q(; 0) be the gradient with\nrespect to 0. If there is no confusion, we simply denote rQ(; 0) = r1Q(; 0) as in the previous\nsections. We de\ufb01ne the higher order derivatives in the same manner, e.g., r2\n1,2Q(; 0) is calculated\nby \ufb01rst taking derivative with respect to and then with respect to 0. For =>1 , >2> 2 Rd with\n1 2 Rd1, 2 2 Rd2 and d1 + d2 = d, we use notations such as v1 2 Rd1 and A1,2 2 Rd1\u21e5d2\nto denote the corresponding subvector of v 2 Rd and the submatrix of A 2 Rd\u21e5d.\nWe aim to conduct asymptotic inference for low dimensional components of the high dimensional\nparameter \u21e4. Without loss of generality, we consider a single entry of \u21e4. In particular, we assume\n\u21e4 = \u21e5\u21b5\u21e4, (\u21e4)>\u21e4>, where \u21b5\u21e4 2 R is the entry of interest, while \u21e4 2 Rd1 is treated as the\n\nnuisance parameter. In the following, we construct a high dimensional score test named decorrelated\nscore test. It is worth noting that, our method and theory can be easily generalized to perform statistical\ninference for an arbitrary low dimensional subvector of \u21e4.\nDecorrelated Score Test: For score test, we are primarily interested in testing H0 : \u21b5\u21e4 = 0, since\nthis null hypothesis characterizes the uncertainty in variable selection. Our method easily generalizes\nto H0 : \u21b5\u21e4 = \u21b50 with \u21b50 6= 0. For notational simplicity, we de\ufb01ne the following key quantity\n1,2Qn(; ) 2 Rd\u21e5d.\nTn() = r2\nLet =\u21b5, >>. We de\ufb01ne the decorrelated score function Sn(\u00b7,\u00b7) 2 R as\nSn(, ) =\u21e5r1Qn(; )\u21e4\u21b5 w(, )> \u00b7\u21e5r1Qn(; )\u21e4.\nHere w(, ) 2 Rd1 is obtained using the following Dantzig selector [8]\nw2Rd1 kwk1,\n\nsubject to \u21e5Tn()\u21e4,\u21b5 \u21e5Tn()\u21e4, \u00b7 w1 \uf8ff ,\n\ndimensional EM algorithm (Algorithm 1). We de\ufb01ne the decorrelated score statistic as\n\nwhere > 0 is a tuning parameter. Let b =b\u21b5,b>>, where b is the estimator attained by the high\npn \u00b7 Snb0,\u21e5Tnb0\u21e4\u21b5| 1/2,\nwhere b0 =0,b>>, and\u21e5Tnb0\u21e4\u21b5| =\u21e51,wb0,>\u21e4 \u00b7 Tnb0 \u00b7\u21e51,wb0,>\u21e4>.\nHere we use b0 instead of b since we are interested in the null hypothesis H0 : \u21b5\u21e4 = 0. We can also\nreplace b0 with b and the theoretical results will remain the same. In \u00a74 we will prove the proposed\n\ndecorrelated score statistic in (2.7) is asymptotically N (0, 1). Consequently, the decorrelated score\n\n1,1Qn(; ) + r2\n\nw(, ) = argmin\n\n(2.5)\n\n(2.6)\n\n(2.7)\n\n(2.4)\n\n4\n\n\ftest with signi\ufb01cance level 2 (0, 1) takes the form\n\n S() = 1pn \u00b7 Snb0,\u21e5Tnb0\u21e4\u21b5| 1/2 /2\u21e51(1 /2), 1(1 /2)\u21e4 ,\n\nwhere 1(\u00b7) is the inverse function of the Gaussian cumulative distribution function. If S() = 1,\nwe reject the null hypothesis H0 : \u21b5\u21e4 = 0. The intuition of this decorrelated score test is explained\nin \u00a7D of the appendix. The key theoretical observation is Theorem 2.1, which connects r1Qn(\u00b7;\u00b7)\nin (2.5) and Tn(\u00b7) in (2.7) with the score function and Fisher information in the presence of latent\nstructures. Let `n() be the log-likelihood. Its score function is r`n() and the Fisher information is\nI(\u21e4) = E\u21e4\u21e5r2`n(\u21e4)\u21e4n, where E\u21e4(\u00b7) is the expectation under the model with parameter \u21e4.\nTheorem 2.1. For the true parameter \u21e4 and any 2 Rd, it holds that\n\nand E\u21e4\u21e5Tn(\u21e4)\u21e4 = I(\u21e4) = E\u21e4\u21e5r2`n(\u21e4)\u21e4n.\n\n(2.8)\n\nr1Qn(; ) = r`n()/n,\n\nProof. See \u00a7I.1 of the appendix for a detailed proof.\nBased on the decorrelated score test, it is easy to establish the decorrelated Wald test, which allows\nus to construct con\ufb01dence intervals. For compactness we defer it to the longer version of this paper.\n3 Theory of Computation and Estimation\nBefore we present the main results, we introduce three technical conditions, which will signi\ufb01cantly\nease our presentation. They will be veri\ufb01ed for speci\ufb01c latent variable models in \u00a7E of the appendix.\nThe \ufb01rst two conditions, proposed by [2], characterize the properties of the population version lower\nbound function Q(\u00b7;\u00b7), i.e., the expectation of Qn(\u00b7;\u00b7) de\ufb01ned in (2.1). We de\ufb01ne the respective\npopulation version M-step as follows. For the M-step in Algorithm 2, we de\ufb01ne\n(3.1)\n\nM () = argmax\n\nQ(0; ).\n\n0\n\nFor the M-step in Algorithm 3, we de\ufb01ne\n\n(3.3)\n\nM () = + \u2318 \u00b7r 1Q(; ),\n\n(3.2)\nwhere \u2318> 0 is the stepsize in Algorithm 3. We use B to denote the basin of attraction, i.e., the local\nregion where the high dimensional EM algorithm enjoys desired guarantees.\nCondition 3.1. We de\ufb01ne two versions of this condition.\n\u2022 Lipschitz-Gradient-1(1,B). For the true parameter \u21e4 and any 2B , we have\n\nr1Q\u21e5M (); \u21e4\u21e4 r1Q\u21e5M (); \u21e42 \uf8ff 1 \u00b7k \u21e4k2,\nr1Q(; \u21e4) r1Q(; )2 \uf8ff 2 \u00b7k \u21e4k2.\n\nwhere M (\u00b7) is the population version M-step (maximization implementation) de\ufb01ned in (3.1).\n\u2022 Lipschitz-Gradient-2(2,B). For the true parameter \u21e4 and any 2B , we have\n\n(3.4)\nCondition 3.1 de\ufb01nes a variant of Lipschitz continuity for r1Q(\u00b7;\u00b7). In the sequel, we will use (3.3)\nand (3.4) in the analysis of the two implementations of the M-step respectively.\nCondition 3.2 Concavity-Smoothness(\u00b5, \u232b,B). For any 1, 2 2B , Q(\u00b7; \u21e4) is \u00b5-smooth, i.e.,\n\nand \u232b-strongly concave, i.e.,\n\nQ(1; \u21e4) Q(2; \u21e4) + (1 2)> \u00b7r 1Q(2; \u21e4) \u00b5/2 \u00b7k 2 1k2\n2,\nQ(1; \u21e4) \uf8ff Q(2; \u21e4) + (1 2)> \u00b7r 1Q(2; \u21e4) \u232b/2 \u00b7k 2 1k2\n2.\n\n(3.6)\nThis condition indicates that, when the second variable of Q(\u00b7;\u00b7) is \ufb01xed to be \u21e4, the function is\n\u2018sandwiched\u2019 between two quadratic functions. The third condition characterizes the statistical error\nbetween the sample version and population version M-steps, i.e., Mn(\u00b7) de\ufb01ned in Algorithms 2 and\n3, and M (\u00b7) in (3.1) and (3.2). Recall k\u00b7k 0 denotes the total number of nonzero entries in a vector.\nCondition 3.3 Statistical-Error(\u270f, , s, n,B). For any \ufb01xed 2B with kk0 \uf8ff s, we have that\n(3.7)\nholds with probability at least 1 . Here \u270f> 0 possibly depends on , sparsity level s, sample size\nn, dimension d, as well as the basin of attraction B.\nIn (3.7) the statistical error \u270f quanti\ufb01es the `1-norm of the difference between the population version\nand sample version M-steps. Particularly, we constrain the input of M (\u00b7) and Mn(\u00b7) to be s-sparse.\nSuch a condition is different from the one used by [2]. In detail, they quantify the statistical error\n\nM () Mn()1 \uf8ff \u270f\n\n(3.5)\n\n5\n\n\f1\n\n\u00b7 R\n\nOptimization Error\n\n{z\n\n(3.10)\n\nStatistical Error\n\nwith the `2-norm and do not constrain the input of M (\u00b7) and Mn(\u00b7) to be sparse. Consequently, our\nsubsequent statistical analysis is different from theirs. The reason we use the `1-norm is that, it\ncharacterizes the more re\ufb01ned entrywise statistical error, which converges at a fast rate ofplog d/n\n(possibly with extra factors depending on speci\ufb01c models). In comparison, the `2-norm statistical\nerror converges at a slow rate ofpd/n, which does not decrease to zero as n increases with d n.\n\nFurthermore, the \ufb01ne-grained entrywise statistical error is crucial to our key proof for quantifying the\neffects of the truncation step (line 6 of Algorithm 1) on the iterative solution sequence.\n3.1 Main Results\nTo simplify the technical analysis of the high dimensional EM algorithm, we focus on its resampling\nversion, which is illustrated in Algorithm 4 in \u00a7C of the appendix.\nTheorem 3.4. We de\ufb01ne B = : k \u21e4k2 \uf8ff R , where R = \uf8ff \u00b7k \u21e4k2 for some \uf8ff 2 (0, 1).\nWe assume Condition Concavity-Smoothness(\u00b5, \u232b,B) holds andinit \u21e42 \uf8ff R/2.\n\u2022 For the maximization implementation of the M-step (Algorithm 2), we suppose that Condition\nLipschitz-Gradient-1(1,B) holds with \u21e21 := 1/\u232b 2 (0, 1) and\nbs =\u2303C \u00b7 max16/(1/\u21e21 1)2, 4 \u00b7 (1 + \uf8ff)2/(1 \uf8ff)2 \u00b7 s\u21e4\u2325 ,\n(3.8)\np\nbs + C0/p1 \uf8ff \u00b7 ps\u21e4 \u00b7 \u270f \uf8ff min(1 p\u21e21)2 \u00b7 R, (1 \uf8ff)2/[2 \u00b7 (1 + \uf8ff)] \u00b7k \u21e4k2 .\n(3.9)\nHere C 1 and C0 > 0 are constants. Under Condition Statistical-Error(\u270f, /T,bs, n/T,B) we\n(t) \u21e42 \uf8ff \u21e2t/2\n| {z }\n\nbs + C0/p1 \uf8ff \u00b7 ps\u21e4/(1 p\u21e21) \u00b7 \u270f\n}\n\nhave that, for t = 1, . . . , T ,\n\n+p\n|\n\nProof. See \u00a7G.1 of the appendix for a detailed proof.\n\nof the same order as the true sparsity level s\u21e4. This assumption ensures that the error incurred by the\ntruncation step can be upper bounded. In addition, as is shown for speci\ufb01c latent variable models in\n\nholds with probability at least 1 , where C0 is the same constant as in (3.9).\n\u2022 For the gradient ascent implementation of the M-step (Algorithm 3), we suppose that Condition\nLipschitz-Gradient-2(2,B) holds with \u21e22 := 1 2\u00b7 (\u232b 2)/(\u232b + \u00b5) 2 (0, 1) and the stepsize in\nAlgorithm 3 is set to \u2318 = 2/(\u232b + \u00b5). Meanwhile, we assume (3.8) and (3.9) hold with \u21e21 replaced\nholds with probability at least 1 , in which \u21e21 is replaced with \u21e22.\n\nby \u21e22. Under Condition Statistical-Error(\u270f, /T,bs, n/T,B) we have that, for t = 1, . . . , T , (3.10)\nThe assumption in (3.8) states that the sparsity parameterbs is chosen to be suf\ufb01ciently large and also\n\u00a7E of the appendix, the error term \u270f in Condition Statistical-Error(\u270f, /T,bs, n/T,B) decreases as\nsample size n increases. By the assumption in (3.8),pbs + C0/p1 \uf8ff \u00b7 ps\u21e4 is of the same order\nas ps\u21e4. Therefore, the assumption in (3.9) suggests the sample size n is suf\ufb01ciently large such that\nps\u21e4 \u00b7 \u270f is suf\ufb01ciently small. These assumptions guarantee that the entire iterative solution sequence\nremains within the basin of attraction B in the presence of statistical error.\nTheorem 3.4 illustrates that, the upper bound of the overall estimation error can be decomposed\ninto two terms. The \ufb01rst term is the upper bound of optimization error, which decreases to zero at a\ngeometric rate of convergence, because we have \u21e21,\u21e2 2 < 1. Meanwhile, the second term is the upper\nbound of statistical error, which does not depend on t. Sincepbs + C0/p1 \uf8ff\u00b7ps\u21e4 is of the same\norder as ps\u21e4, this term is proportional to ps\u21e4 \u00b7 \u270f, where \u270f is the entrywise statistical error between\nM (\u00b7) and Mn(\u00b7). In \u00a7E of the appendix we prove that, for each speci\ufb01c latent variable model, \u270f is\nroughly of the orderplog d/n. (There may be extra factors attached to \u270f depending on each speci\ufb01c\nmodel.) Therefore, the statistical error term is roughly of the orderps\u21e4 \u00b7 log d/n. Consequently, for\nsame order, the \ufb01nal estimator b = (T ) attains a (near-)optimalps\u21e4 \u00b7 log d/n (possibly with extra\nfactors) statistical rate. For compactness, we give the following example and defer the details to \u00a7E.\nImplications for Gaussian Mixture Model: We assume y1, . . . , yn are the n i.i.d. realizations of\nY = Z \u00b7 \u21e4 + V . Here Z is a Rademacher random variable, i.e., P(Z = +1) = P(Z = 1) = 1/2,\nand V \u21e0 N (0, 2 \u00b7 Id) is independent of Z, where is the standard deviation. Suppose that we have\nk\u21e4k2/ r, where r > 0 is a suf\ufb01ciently large constant that denotes the minimum signal-to-noise\nratio. In \u00a7E of the appendix we prove that there exists some constant C > 0 such that Conditions\n\na suf\ufb01ciently large t = T such that the optimization and statistical error terms in (3.10) are of the\n\n6\n\n\fLipschitz-Gradient-1(1,B) and Concavity-Smoothness(\u00b5, \u232b,B) hold with\n1 = expC \u00b7 r2, \u00b5 = \u232b = 1, B = : k \u21e4k2 \uf8ff R with R = \uf8ff \u00b7k \u21e4k2,\uf8ff = 1/4.\nFor a suf\ufb01ciently large n, we have that Condition Statistical-Error(\u270f, , s, n,B) holds with\nThen the \ufb01rst part of Theorem 3.4 impliesb \u21e42 \uf8ff C \u00b7ps\u21e4 \u00b7 log d \u00b7 log n/n for a suf\ufb01ciently\nlarge T , which is near-optimal with respect to the minimax lower boundps\u21e4 log d/n.\n\n\u270f = C \u00b7k\u21e4k1 + \u00b7q\u21e5log d + log(2/)\u21e4n.\n\n4 Theory of Inference\nTo simplify the presentation of the uni\ufb01ed framework, we lay out several technical conditions, which\nwill be veri\ufb01ed for each model. Let \u21e3EM, \u21e3G, \u21e3T and \u21e3L be four quantities that scale with s\u21e4, d and n.\nThese conditions will be veri\ufb01ed for speci\ufb01c latent variable models in \u00a7F of the appendix.\nCondition 4.1 Parameter-Estimation\u21e3EM. We haveb \u21e41 = OP\u21e3EM.\nCondition 4.2 Gradient-Statistical-Error\u21e3G. We have\nr1Qn(\u21e4; \u21e4) r1Q(\u21e4; \u21e4)1\nCondition 4.3 Tn(\u00b7)-Concentration\u21e3T. We haveTn(\u21e4) E\u21e4\u21e5Tn(\u21e4)\u21e41,1\nCondition 4.4 Tn(\u00b7)-Lipschitz\u21e3L. For any , we have\nTn() Tn(\u21e4)1,1\nIn the sequel, we lay out an assumption on several population quantities and the sample size n. Recall\nthat \u21e4 = [\u21b5\u21e4, (\u21e4)>]>, where \u21b5\u21e4 2 R is the entry of interest, while \u21e4 2 Rd1 is the nuisance\nparameter. By the notations in \u00a72.2,\u21e5I(\u21e4)\u21e4, 2 R(d1)\u21e5(d1) and\u21e5I(\u21e4)\u21e4,\u21b5 2 R(d1)\u21e51 denote\nthe submatrices of the Fisher information matrix I(\u21e4) 2 Rd\u21e5d. We de\ufb01ne w\u21e4, s\u21e4w and S\u21e4w as\nand S\u21e4w = supp(w\u21e4).\nWe de\ufb01ne 1\u21e5I(\u21e4)\u21e4 and d\u21e5I(\u21e4)\u21e4 as the largest and smallest eigenvalues of I(\u21e4), and\n, \u00b7\u21e5I(\u21e4)\u21e4,\u21b5 2 R.\n\n= OP\u21e3G.\n= OP\u21e3L \u00b7k \u21e4k1.\n\n\u21e5I(\u21e4)\u21e4\u21b5| =\u21e5I(\u21e4)\u21e4\u21b5,\u21b5 \u21e5I(\u21e4)\u21e4>\n\ns\u21e4w = kw\u21e4k0,\n,\u21b5 \u00b7\u21e5I(\u21e4)\u21e41\n\n, \u00b7\u21e5I(\u21e4)\u21e4,\u21b5 2 Rd1,\n\nAccording to (4.1) and (4.2), we can easily verify that\n\nw\u21e4 =\u21e5I(\u21e4)\u21e41\n\n= OP\u21e3T.\n\n(4.1)\n\n(4.2)\n\n(4.3)\n\n\u21e5I(\u21e4)\u21e4\u21b5| =\u21e51,(w\u21e4)>\u21e4 \u00b7 I(\u21e4) \u00b7\u21e51,(w\u21e4)>\u21e4>.\n\nThe following assumption ensures that d\u21e5I(\u21e4)\u21e4 > 0. Hence,\u21e5I(\u21e4)\u21e4, in (4.1) is invertible.\nAlso, according to (4.3) and the fact that d\u21e5I(\u21e4)\u21e4 > 0, we have\u21e5I(\u21e4)\u21e4\u21b5| > 0.\nAssumption 4.5 . We impose the following assumptions.\n\u2022 For positive constants \u21e2max and \u21e2min, we assume\n\u21e5I(\u21e4)\u21e41\n\u21e5I(\u21e4)\u21e4\u21b5| = O(1),\n\u21e2max 1\u21e5I(\u21e4)\u21e4 d\u21e5I(\u21e4)\u21e4 \u21e2min,\n\u2022 The tuning parameter of the Dantzig selector in (2.6) is set to\n = C \u00b7\u21e3T + \u21e3L \u00b7 \u21e3EM \u00b71 + kw\u21e4k1,\nwhere C 1 is a suf\ufb01ciently large constant. The sample size n is suf\ufb01ciently large such that\n\nmaxkw\u21e4k1, 1 \u00b7 s\u21e4w \u00b7 = o(1),\u21e3\n\ns\u21e4w \u00b7 \u00b7 \u21e3G = o(1/pn),\n \u00b7 \u21e3EM = o(1/pn), max1, kw\u21e4k1 \u00b7 \u21e3L \u00b7\u21e3EM2 = o(1/pn).\n\nThe assumption on d\u21e5I(\u21e4)\u21e4 guarantees that the Fisher information matrix is positive de\ufb01nite. The\nother assumptions in (4.4) guarantee the existence of the asymptotic variance of pn \u00b7 Snb0, in\n\nthe score statistic de\ufb01ned in (2.7). Similar assumptions are standard in existing asymptotic inference\nresults. For example, for mixture of regression model, [14] impose variants of these assumptions.\nFor speci\ufb01c models, we will show that \u21e3EM, \u21e3G, \u21e3T and all decrease with n, while \u21e3L increases\nwith n at a slow rate. Therefore, the assumptions in (4.6) ensure that the sample size n is suf\ufb01ciently\nlarge. We will make these assumptions more explicit after we specify \u21e3EM, \u21e3G, \u21e3T and \u21e3L for each\n\n\u21b5| = O(1). (4.4)\n\nEM = o(1),\n\n(4.6)\n\n(4.5)\n\n7\n\n\fmodel. Note the assumptions in (4.6) imply that s\u21e4w = kw\u21e4k0 needs to be small. For instance, for \nspeci\ufb01ed in (4.5), maxkw\u21e4k1, 1 \u00b7 s\u21e4w \u00b7 = o(1) in (4.6) implies s\u21e4w \u00b7 \u21e3T = o(1). In the following,\nwe will prove that \u21e3T is of the orderplog d/n. Hence, we require that s\u21e4w = opn/ log d \u2327 d 1,\ni.e., w\u21e4 2 Rd1 is sparse. Such a sparsity assumption can be understood as follows. According to\nthe de\ufb01nition of w\u21e4 in (4.1), we have\u21e5I(\u21e4)\u21e4, \u00b7 w\u21e4 =\u21e5I(\u21e4)\u21e4,\u21b5. Therefore, such a sparsity\nassumption suggests\u21e5I(\u21e4)\u21e4,\u21b5 lies within the span of a few columns of\u21e5I(\u21e4)\u21e4,. Such a sparsity\n\nassumption on w\u21e4 is necessary, because otherwise it is dif\ufb01cult to accurately estimate w\u21e4 in high\ndimensional regimes. In the context of high dimensional generalized linear models, [26, 32] impose\nsimilar sparsity assumptions.\n4.1 Main Results\nDecorrelated Score Test: The next theorem establishes the asymptotic normality of the decorrelated\nscore statistic de\ufb01ned in (2.7).\n\npn \u00b7 Snb0,\u21e5Tnb0\u21e4\u21b5| 1/2 D! N (0, 1),\n\nTheorem 4.6. We consider \u21e4 =\u21e5\u21b5\u21e4, (\u21e4)>\u21e4> with \u21b5\u21e4 = 0. Under Assumption 4.5 and Conditions\n4.1-4.4, we have that for n ! 1,\nwhere b0 and\u21e5Tnb0\u21e4\u21b5| 2 R are de\ufb01ned in (2.7). The limiting variance of the decorrelated score\nfunction pn \u00b7 Snb0, is\u21e5I(\u21e4)\u21e4\u21b5|, which is de\ufb01ned in (4.2).\nProof. See \u00a7G.2 of the appendix for a detailed proof.\nOptimality: [27] prove that for inferring \u21b5\u21e4 in the presence of nuisance parameter \u21e4,\u21e5I(\u21e4)\u21e4\u21b5| is\nthe semiparametric ef\ufb01cient information, i.e., the minimum limiting variance of the (rescaled) score\nfunction. Our proposed decorrelated score function achieves such a semiparametric information lower\nbound and is therefore in this sense optimal.\nIn the following, we use Gaussian mixture model to illustrate the effectiveness of Theorem 4.6. We\ndefer the details and the implications for mixture of regression to \u00a7F of the appendix.\nImplications for Gaussian Mixture Model: Under the same model considered in \u00a73.1, if we assume\nall quantities except s\u21e4w, s\u21e4, d and n are constant, then we have that Conditions 4.1-4.4 hold with\n\u21e3EM = s\u21e4plog d \u00b7 log n/n, \u21e3G =plog d/n, \u21e3T =plog d/n and \u21e3L =log d + log n3/2. Thus,\nunder Assumption 4.5, (4.7) holds when n ! 1. Also, we can verify that (4.6) in Assumption 4.5\nholds if maxs\u21e4w, s\u21e4 2 \u00b7 (s\u21e4)2 \u00b7 (log d)5 = o\u21e5n/(log n)2\u21e4.\n\n5 Conclusion\nWe propose a novel high dimensional EM algorithm which naturally incorporates sparsity structure.\nOur theory shows that, with a suitable initialization, the proposed algorithm converges at a geometric\nrate and achieves an estimator with the (near-)optimal statistical rate of convergence. Beyond point\nestimation, we further propose the decorrelated score and Wald statistics for testing hypotheses and\nconstructing con\ufb01dence intervals for low dimensional components of high dimensional parameters.\nWe apply the proposed algorithmic framework to a broad family of high dimensional latent variable\nmodels. For these models, our framework establishes the \ufb01rst computationally feasible approach for\noptimal parameter estimation and asymptotic inference under high dimensional settings.\nReferences\n[1] A N A N D K U M A R , A ., G E , R ., H S U , D ., K A K A D E , S . M . and T E L G A R S K Y, M . (2014). Tensor\ndecompositions for learning latent variable models. Journal of Machine Learning Research 15 2773\u20132832.\n[2] B A L A K R I S H N A N , S ., WA I N W R I G H T, M . J . and Y U , B . (2014). Statistical guarantees for the EM\n\n(4.7)\n\nalgorithm: From population to sample-based analysis. arXiv preprint arXiv:1408.2156 .\n\n[3] B A R T H O L O M E W, D . J ., K N O T T, M . and M O U S TA K I , I . (2011). Latent variable models and\n\nfactor analysis: A uni\ufb01ed approach, vol. 899. Wiley.\n\n[4] B E L L O N I , A ., C H E N , D ., C H E R N O Z H U K O V, V. and H A N S E N , C . (2012). Sparse models and\n\nmethods for optimal instruments with an application to eminent domain. Econometrica 80 2369\u20132429.\n\n[5] B I C K E L , P. J ., R I T O V, Y. and T S Y B A K O V, A . B . (2009). Simultaneous analysis of Lasso and\n\nDantzig selector. Annals of Statistics 37 1705\u20131732.\n\n8\n\n\f[6] B O U C H E R O N , S ., L U G O S I , G . and M A S S A RT, P. (2013). Concentration inequalities: A nonasymp-\n\ntotic theory of independence. Oxford University Press.\n\n[7] C A I , T., L I U , W. and L U O , X . (2011). A constrained `1 minimization approach to sparse precision\n\nmatrix estimation. Journal of the American Statistical Association 106 594\u2013607.\n\n[8] C A N D `E S , E . and TA O , T. (2007). The Dantzig selector: Statistical estimation when p is much larger\n\nthan n. Annals of Statistics 35 2313\u20132351.\n\n[9] C H A G A N T Y, A . T. and L I A N G , P. (2013). Spectral experts for estimating mixtures of linear regres-\n\nsions. arXiv preprint arXiv:1306.3729 .\n\n[10] C H A U D H U R I , K ., D A S G U P TA , S . and VAT TA N I , A . (2009). Learning mixtures of Gaussians\n\nusing the k-means algorithm. arXiv preprint arXiv:0912.0086 .\n\n[11] D A S G U P TA , S . and S C H U L M A N , L . (2007). A probabilistic analysis of EM for mixtures of separated,\n\nspherical Gaussians. Journal of Machine Learning Research 8 203\u2013226.\n\n[12] D E M P S T E R , A . P., L A I R D , N . M . and R U B I N , D . B . (1977). Maximum likelihood from\nincomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Statistical\nMethodology) 39 1\u201338.\n\n[13] JAVA N M A R D , A . and M O N TA N A R I , A . (2014). Con\ufb01dence intervals and hypothesis testing for\n\nhigh-dimensional regression. Journal of Machine Learning Research 15 2869\u20132909.\n\n[14] K H A L I L I , A . and C H E N , J . (2007). Variables selection in \ufb01nite mixture of regression models. Journal\n\nof the American Statistical Association 102 1025\u20131038.\n\n[15] K N I G H T, K . and F U , W. (2000). Asymptotics for Lasso-type estimators. Annals of Statistics 28\n\n1356\u20131378.\n\n[16] L E E , J . D ., S U N , D . L ., S U N , Y. and TAY L O R , J . E . (2013). Exact inference after model selection\n\nvia the Lasso. arXiv preprint arXiv:1311.6238 .\n\n[17] L O C K H A RT, R ., TAY L O R , J ., T I B S H I R A N I , R . J . and T I B S H I R A N I , R . (2014). A signi\ufb01cance\n\ntest for the Lasso. Annals of Statistics 42 413\u2013468.\n\n[18] M C L A C H L A N , G . and K R I S H N A N , T. (2007). The EM algorithm and extensions, vol. 382. Wiley.\n[19] M E I N S H A U S E N , N . and B \u00a8U H L M A N N , P. (2010). Stability selection. Journal of the Royal Statistical\n\nSociety: Series B (Statistical Methodology) 72 417\u2013473.\n\n[20] M E I N S H A U S E N , N ., M E I E R , L . and B \u00a8U H L M A N N , P. (2009). p-values for high-dimensional\n\nregression. Journal of the American Statistical Association 104 1671\u20131681.\n\n[21] N E S T E R O V, Y. (2004). Introductory lectures on convex optimization:A basic course, vol. 87. Springer.\n[22] N I C K L , R . and VA N D E G E E R , S . (2013). Con\ufb01dence sets in sparse regression. Annals of Statistics\n\n41 2852\u20132876.\n\n[23] S T \u00a8A D L E R , N ., B \u00a8U H L M A N N , P. and VA N D E G E E R , S . (2010). `1-penalization for mixture\n\nregression models. TEST 19 209\u2013256.\n\n[24] TAY L O R , J ., L O C K H A RT, R ., T I B S H I R A N I , R . J . and T I B S H I R A N I , R . (2014). Post-selection\n\nadaptive inference for least angle regression and the Lasso. arXiv preprint arXiv:1401.3889 .\n\n[25] T S E N G , P. (2004). An analysis of the EM algorithm and entropy-like proximal point methods. Mathe-\n\nmatics of Operations Research 29 27\u201344.\n\n[26] VA N D E G E E R , S ., B \u00a8U H L M A N N , P., R I T O V, Y. and D E Z E U R E , R . (2014). On asymptotically\n\noptimal con\ufb01dence regions and tests for high-dimensional models. Annals of Statistics 42 1166\u20131202.\n\n[27] VA N D E R VA A R T, A . W. (2000). Asymptotic statistics, vol. 3. Cambridge University Press.\n[28] V E R S H Y N I N , R . (2010).\n\nIntroduction to the non-asymptotic analysis of random matrices. arXiv\n\npreprint arXiv:1011.3027 .\n\n[29] WA S S E R M A N , L . and R O E D E R , K . (2009). High-dimensional variable selection. Annals of Statistics\n\n37 2178\u20132201.\n\n[30] W U , C . F. J . (1983). On the convergence properties of the EM algorithm. Annals of Statistics 11\n\n95\u2013103.\n\n[31] Y I , X ., C A R A M A N I S , C . and S A N G H AV I , S . (2013). Alternating minimization for mixed linear\n\nregression. arXiv preprint arXiv:1310.3745 .\n\n[32] Z H A N G , C . - H . and Z H A N G , S . S . (2014). Con\ufb01dence intervals for low dimensional parameters in\nhigh dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology)\n76 217\u2013242.\n\n9\n\n\f", "award": [], "sourceid": 1496, "authors": [{"given_name": "Zhaoran", "family_name": "Wang", "institution": "Princeton University"}, {"given_name": "Quanquan", "family_name": "Gu", "institution": "University of Virginia"}, {"given_name": "Yang", "family_name": "Ning", "institution": "Princeton University"}, {"given_name": "Han", "family_name": "Liu", "institution": "Princeton University"}]}