{"title": "Estimating High-dimensional Non-Gaussian Multiple Index Models via Stein\u2019s Lemma", "book": "Advances in Neural Information Processing Systems", "page_first": 6097, "page_last": 6106, "abstract": "We consider estimating the parametric components of semiparametric multi-index models in high dimensions. To bypass the requirements of Gaussianity or elliptical symmetry of covariates in existing methods, we propose to leverage a second-order Stein\u2019s method with score function-based corrections. We prove that our estimator achieves a near-optimal statistical rate of convergence even when the score function or the response variable is heavy-tailed. To establish the key concentration results, we develop a data-driven truncation argument that may be of independent interest. We supplement our theoretical findings with simulations.", "full_text": "Learning Non-Gaussian Multi-Index Model via\n\nSecond-Order Stein\u2019s Method\n\nZhuoran Yang\u21e4 Krishna Balasubramanian\u21e4 Zhaoran Wang\u2020 Han Liu\u2020\n\nAbstract\n\nWe consider estimating the parametric components of semiparametric multi-index\nmodels in high dimensions. To bypass the requirements of Gaussianity or elliptical\nsymmetry of covariates in existing methods, we propose to leverage a second-order\nStein\u2019s method with score function-based corrections. We prove that our estimator\nachieves a near-optimal statistical rate of convergence even when the score function\nor the response variable is heavy-tailed. To establish the key concentration results,\nwe develop a data-driven truncation argument that may be of independent interest.\nWe supplement our theoretical \ufb01ndings with simulations.\n\nIntroduction\n\n1\nWe consider the semiparametric index model that relates the response Y 2 R and the covariate\nX 2 Rd as Y = f (h\u21e41 , Xi, . . . ,h\u21e4k, Xi) + \u270f, where each coef\ufb01cient \u21e4` 2 Rd (` 2 [k]) is s\u21e4-\nsparse and the noise term \u270f is zero-mean. Such a model is known as sparse multiple index model\n(MIM). Given n i.i.d. observations {Xi, Yi}n\ni=1 of the above model with possibly d  n, we aim\nto estimate the parametric component {\u21e4`}`2[k] when the nonparametric component f is unknown.\nMore importantly, we do not impose the assumption that X is Gaussian, which is commonly made in\nthe literature. Special cases of our model include phase retrieval, for which k = 1, and dimensionality\nreduction, for which k  1. Motivated by these applications, we make a distinction between the cases\nof k = 1, which is also known as single index model (SIM), and k > 1 in the rest of the paper.\nEstimating the parametric component {\u21e4`}`2[k] without knowing the exact form of the link function\nf naturally arises in various applications. For example, in one-bit compressed sensing [3, 39] and\nsparse generalized linear models [36], we are interested in recovering the underlying signal vector\nbased on nonlinear measurements. In suf\ufb01cient dimensionality reduction, where k is typically a \ufb01xed\nnumber greater than one but much less than d, we aim to estimate the projection onto the subspace\nspanned by {\u21e4`}`2[k] without knowing f. Furthermore, in deep neural networks, which are cascades\nof MIMs, the nonparametric component corresponds to the activation function, which is prespeci\ufb01ed,\nand the goal is to estimate the linear parametric component, which is used for prediction at the test\nstage. Hence, it is crucial to develop estimators for the parametric component with both statistical\naccuracy and computational ef\ufb01ciency for a broad class of possibly unknown link functions.\nChallenging aspects of index models: Several subtle issues arise from the optimal estimation of\nSIM and MIM. In speci\ufb01c, most existing results depend crucially on restrictive assumptions on X and\nf, and fail to hold when those assumptions are relaxed. Such issues arise even in the low-dimensional\nsetting with n  d. Let us consider, for example, the case of k = 1 with a known link function\nf (z) = z2. This corresponds to phase retrieval, which is a challenging inverse problem that has\nregained interest in the last few years along with the success of compressed sensing. A straightforward\nway to estimate \u21e4 is via nonlinear least squares regression [17], which is a nonconvex optimization\nproblem. [6] propose an estimator based on convex relaxation. Although their estimator is optimal\n\n\u21e4Princeton University, email: {zy6, kb18}@princeton.edu\n\u2020Tencent AI Lab & Northwestern University, email: {zhaoranwang, hanliu.cmu}@gmail.com\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: Histogram of the score function based\non 10000 independent realizations of the Gamma\ndistribution with shape parameter 5 and scale pa-\nrameter 0.2. The dark solid histogram concentrated\naround zero corresponds to the Gamma distribu-\ntion, and the transparent histogram corresponds to\nthe distribution of the score function of the same\nGamma distribution.\n\n600\n\n500\n\n400\n\n300\n\n200\n\n100\n\n0\n-10\n\n0\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\nwhen X is sub-Gaussian, it is not agnostic to the link function, i.e., the same result does not hold\nif the link function is not quadratic. Direct optimization of the nonconvex phase retrieval problem\nis considered by [5] and [30], which propose statistically optimal estimators based on iterative\nalgorithms. However, they rely on the assumption that X is Gaussian. A careful look at their proofs\nshows that extending them to a broader class of distributions is signi\ufb01cantly more challenging \u2014 for\nexample, they require sharp concentration inequalities for polynomials of degree four of X, which\nleads to a suboptimal statistical rate when X is sub-Gaussian. Furthermore, their results are also not\nagnostic to the link function. Similar observations could be made for both convex [21] and nonconvex\nestimators [4] for sparse phase retrieval in high dimensions.\nIn addition, a surprising result for SIM is established in [28]. They show that when X is Gaussian,\neven when the link function is unknown, one could estimate \u21e4 at the optimal statistical rate with\nLasso. Unfortunately, their assumptions on the link function are rather restrictive, which rule out\nseveral interesting models including phase retrieval. Furthermore, none of the above approaches are\napplicable to MIM. A line of work pioneered by Ker-Chau Li [18\u201320] focuses on the estimation of\nMIM in low dimensions. We will provide a discussion about this line of work in the related work\nsection, but it again requires restrictive assumptions on either the link function or the distribution of X.\nFor example, in most cases X is assumed to be elliptically symmetric, which limits the applicability.\nTo summarize, there are several subtleties that arise from the interplay between the assumptions on X\nand f in SIM and MIM. An interesting question is whether it is possible to estimate the parametric\ncomponent in SIM and MIM with milder assumptions on both X and f in the high-dimensional\nsetting. In this work, we provide a partial answer to this question. We construct estimators that work\nfor a broad class of link functions, including the quadratic link function in phase retrieval, and for a\nlarge family of distributions of X, which are assumed to be known a priori. We particularly focus on\nthe case where X follows a non-Gaussian distribution, which is not necessarily elliptically symmetric\nor sub-Gaussian, therefore making our method applicable to various situations that are not feasible\npreviously. Our estimators are based on a second-order variant of Stein\u2019s identity for non-Gaussian\nrandom variables, which utilizes the score function of the distribution of X. As we show in Figure 1,\neven when the distribution of X is light-tailed, the distribution of the score function of X could be\narbitrarily heavy-tailed. In order to develop consistent estimators within this context, we threshold\nthe score function in a data-driven fashion. This enables us to obtain tight concentration bounds\nthat lead to near-optimal statistical rates of convergence. Moreover, our results also shed light on\ntwo related problems. First, we provide an alternative interpretation of the initialization in [5] for\nphase retrieval. Second, our estimators are constructed based on a sparsity constrained semide\ufb01nite\nprogramming (SDP) formulation, which is related to a similar formulation of the sparse principal\ncomponent analysis (PCA) problem (see Section 4 for a detailed discussion). A consequence of\nour results for SIM and MIM is a near-optimal statistical rate of convergence for sparse PCA with\nheavy-tailed data in the moderate sample size regime. In summary, our contributions are as follows:\n\u2022 We construct estimators for the parametric component of high-dimensional SIM and MIM\nfor a class of unknown link function under the assumption that the covariate distribution is\nnon-Gaussian but known a priori.\n\nones in the literature and hold in several cases that are previously not feasible.\n\n\u2022 We establish near-optimal statistical rates for our estimators. Our results complement existing\n\u2022 We provide numerical simulations that con\ufb01rm our theoretical results.\n\n2\n\n\fRelated work: There is a signi\ufb01cant body of work on SIMs in the low-dimensional setting. We do not\nattempt to cover all of them as we concentrate on the high dimensional setting. The success of Lasso\nand related regression estimators in high-dimensions enables the exploration of high-dimensional\nSIMs, although this is still very much work in progress. As mentioned previously, [25, 26, 28] show\nthat Lasso and phase retrieval estimators could also work for SIM in high dimensions assuming the\ncovariate is Gaussian and the link function satis\ufb01es certain properties. Very recently, [10] relax the\nGaussian assumption and show that a modi\ufb01ed Lasso-type estimator works for elliptically symmetric\ndistributions. For the case of monotone link function, [38] analyze a nonconvex least squares estimator\nunder the assumption that the covariate is sub-Gaussian. However, the success of their estimator hinges\non the knowledge of the link function. Furthermore, [15, 23, 31, 32, 40] analyze the sliced inverse\nregression estimator in the high-dimensional setting, focusing primarily on support recovery and\nconsistency properties. The Gaussian assumption on the covariate restricts them from being applicable\nto various real-world applications involving heavy-tailed or non-symmetric covariate, for example,\nproblems in economics [9, 12]. Furthermore, several results are established on a case-by-case basis\nfor speci\ufb01c link functions. In speci\ufb01c, [1, 3, 8, 39] consider one-bit compressed sensing and matrix\ncompletion respectively, where the link function is assumed to be the sign function. Also, [4] propose\nnonconvex estimators for phase retrieval in high dimensions, where the link function is quadratic. This\nline of work, except [1], makes Gaussian assumptions on the covariate and is specialized for particular\nlink functions. The non-asymptotic result obtained in [1] is under sub-Gaussian assumptions, but the\nestimator therein lacks asymptotic consistency.\nFor MIMs, relatively less work studies the high-dimensional setting. In the low-dimensional setting, a\nline of work on the estimation of MIM is proposed by Ker-Chau Li, including inverse regression [18],\nprincipal Hessian directions [19], and regression under link violation [20]. The proposed estimators\nare applicable for a class of unknown link functions under the assumption that the covariate follows\nGaussian or symmetric elliptical distributions. Such an assumption is restrictive as often times the\ncovariate is heavy-tailed or skewed [9, 12]. Furthermore, they concentrate only on the low-dimensional\nsetting and establish asymptotic results. The estimation of high-dimensional MIM under the subspace\nsparsity assumption is previously considered in [7, 32] but also under rather restrictive distribution\nassumptions on the covariate.\nNotation: We employ [n] to denote the set {1, . . . , n}. For a vector v 2 Rd, we denote by kvkp the `p-\nnorm of v for any p  1. In addition, we de\ufb01ne the support of v 2 Rd as supp(v) = {j 2 [d], vj 6= 0}.\nWe denote by min(A), the minimum eigenvalue of matrix A. Moreover, we denote the elementwise\n`1-norm, elementwise `1-norm, operator norm, and Frobenius norm of a matrix A 2 Rd1\u21e5d2 to be\nk\u00b7k 1, k\u00b7k 1, k\u00b7k op, and k\u00b7k F, correspondingly. We denote by vec(A) the vectorization of matrix\nA, which is a vector in Rd1d2. For two matrices A, B 2 Rd1\u21e5d2, we denote the trace inner product\nto be hA, Bi = Trace(A>B). Also note that it could be viewed as the vector inner product between\nvec(A) and vec(B). For a univariate function g : R ! R, we denote by g  (v) and g  (A) the output\nof applying g to each element of vector v and matrix A, respectively. Finally, for a random variable\nX 2 R with density p, we use p\u2326d : Rd ! R to denote the joint density of X1,\u00b7\u00b7\u00b7 , Xd, which are d\nidentical copies of X.\n\n2 Models and Assumptions\n\nAs mentioned previously, we consider the cases of k = 1 (SIM) and k > 1 (MIM) separately. We\n\ufb01rst discuss the motivation for our estimators, which highlights the assumptions on the link function\nas well. Recall that our estimators are based on the second-order Stein\u2019s identity. To begin with, we\npresent the \ufb01rst-order Stein\u2019s identity, which motivates Lasso-type estimators for SIMs [25, 28].\nProposition 2.1 (First-Order Stein\u2019s Identity [29]). Let X 2 Rd be a real-valued random vector with\ndensity p. We assume that p : Rd ! R is differentiable. In addition, let g : Rd ! R be a continous\nfunction such that E[rg(X)] exists. Then it holds that\n\nE(Y \u00b7 X) = E\u21e5f (hX, \u21e4i) \u00b7 X\u21e4 = E\u21e5f0(hX, \u21e4i)\u21e4 \u00b7 \u21e4.\n\n3\n\nE\u21e5g(X) \u00b7 S(X)\u21e4 = E\u21e5rg(X)\u21e4,\n\nwhere S(x) = rp(x)/p(x) is the score function of p.\nOne could apply the above Stein\u2019s identity to SIMs to obtain an estimator of \u21e4. To see this, note that\nwhen X \u21e0 N (0, Id) we have S(x) = x for x 2 Rd. In this case, since E(\u270f \u00b7 X) = 0, we have\n\n\fThus, one could estimate \u21e4 by estimating E(Y \u00b7 X). This observation leads to the estimator proposed\nin [25, 28]. However, in order for the estimator to work, it is necessary to assume E[f0(hX, \u21e4i)] 6= 0.\nSuch a restriction prevents it from being applicable to some widely used cases of SIM, for example,\nphase retrieval in which f is the quadratic function. Such a limitation of the \ufb01rst-order Stein\u2019s identity\nmotivates us to examine the second-order Stein\u2019s identity, which is summarized as follows.\nProposition 2.2 (Second-Order Stein\u2019s Identity [13]). We assume the density of X is twice differen-\ntiable. We de\ufb01ne the second-order score function T : Rd ! Rd\u21e5d as T (x) = r2p(x)/p(x). For any\ntwice differentiable function g : Rd ! R such that E[r2g(X)] exists, we have\n\nE\u21e5g(X) \u00b7 T (X)\u21e4 = E\u21e5r2g(X)\u21e4.\nBack to the phase retrieval example, when X \u21e0 N (0, Id), the second-order score function is T (x) =\nxx>  Id, for x 2 Rd. Setting g(x) = hx, \u21e4i2 in (2.1), we have\nE\u21e5g(X) \u00b7 T (X)\u21e4 = E\u21e5g(X) \u00b7 (XX>  Id)\u21e4 = E\u21e5hX, \u21e4i2 \u00b7 (XX>  Id)\u21e4 = 2\u21e4\u21e4>.\n(2.2)\nHence for phase retrieval, one could extract \u00b1\u21e4 based on the second-order Stein\u2019s identity even in\nthe situation where the \ufb01rst-order Stein\u2019s identity fails. In fact, (2.2) is implicitly used in [5] to provide\na spectral initialization for the Wirtinger \ufb02ow algorithm in the case of Gaussian phase retrieval. Here,\nwe establish an alternative justi\ufb01cation based on Stein\u2019s identity for why such an initialization works.\nMotivated by this key observation, we propose to employ the second-order Stein\u2019s identity to estimate\nthe parametric component of SIM and MIM with a broad class of unknown link functions as well as\nnon-Gaussian covariates. The precise statistical models we consider are de\ufb01ned as follows.\nDe\ufb01nition 2.3 (SIM with Second-Order Link). The response Y 2 R and the covariate X 2 Rd are\nlinked via\n\n(2.1)\n\nY = f (hX, \u21e4i) + \u270f,\n\n(2.3)\n\nwhere f : R ! R is an unknown function, \u21e4 2 Rd is the parameter of interest, and \u270f 2 R is the\nexogenous noise with E(\u270f) = 0. We assume the entries of X are i.i.d. random variables with density\np0 and that \u21e4 is s\u21e4-sparse, i.e., \u21e4 contains only s\u21e4 nonzero entries. Moreover, since the norm of \u21e4\ncould be absorbed into f, we assume that k\u21e4k2 = 1 for identi\ufb01ability. Finally, we assume that f\nand X satisfy E[f00(hX, \u21e4i)] > 0.\nNote that in De\ufb01nition 2.3, we assume without any loss of generality that E[f00(hX, \u21e4i)] is positive.\nIf E[f00(hX, \u21e4i)] is negative, one could replace f by f by \ufb02ipping the sign of Y . In another word,\nwe essentially only require that E[f00(hX, \u21e4i)] is nonzero. Intuitively, such a restriction on f implies\nthat the second-order cross-moments contains the information of \u21e4. Thus, we name this type of link\nfunctions as the second-order link. Similarly, we de\ufb01ne MIM with second-order link.\nDe\ufb01nition 2.4 (MIM with Second-Order Link). The response Y 2 R and the covariate X 2 Rd are\nlinked via\n\nY = f (hX, \u21e41i, . . . ,hX, \u21e4ki) + \u270f,\n\n(2.4)\n\nwhere f : Rk ! R is an unknown link function, {\u21e4`}`2[k] \u2713 Rd are the parameters of interest, and\n\u270f 2 R is the exogenous random noise that satis\ufb01es E(\u270f) = 0. In addition, we assume that the entries\nof X are i.i.d. random variables with density p0 and that {\u21e4`}`2[k] span a k-dimensional subspace of\nRd. Let B\u21e4 = (\u21e41 . . . \u21e4k) 2 Rd\u21e5k. The model in (2.4) could be reformulated as Y = f (XB\u21e4) + \u270f.\nBy QR-factorization, we could write B\u21e4 as Q\u21e4R\u21e4, where Q\u21e4 2 Rd\u21e5k is an orthonormal matrix and\nR\u21e4 2 Rk\u21e5k is invertible. Since f is unknown, R\u21e4 could be absorbed into the link function. Thus, we\nassume that B\u21e4 is orthonormal for identi\ufb01ability. We further assume that B\u21e4 is s\u21e4-row sparse, that is,\nB\u21e4 contains only s\u21e4 nonzero rows. Note that this de\ufb01nition of row sparsity does not depends on the\nchoice of coordinate system. Finally, we assume that f and X satisfy min(E[r2f (XB\u21e4)]) > 0.\nIn De\ufb01nition 2.4, the assumption that E[r2f (XB\u21e4)] is positive de\ufb01nite is a multivariate generaliza-\ntion of the condition that E[f00(hX, \u21e4i)] > 0 for SIM in De\ufb01nition 2.3. It essentially guarantees that\nestimating the projector of the subspace spanned by {\u21e4`}`2[k] is information-theoretically feasible.\n\n4\n\n\f3 Estimation Method and Main Results\n\nWe now introduce our estimators and establish their statistical rates of convergence. Discussion on\nthe optimality of the established rates and connection to sparse PCA are deferred to \u00a74. Recall that\nwe focus on the case in which X has i.i.d. entries with density p0 : R ! R. Hence, the joint density\nof X is p(x) = p\u2326d\nj=1 p0(xj). For notational simplicity, let s0(u) = p00(u)/p0(u). Then\nthe \ufb01rst-order score function associated with p is S(x) = s0  (x). Equivalently, the j-th entry of the\n\ufb01rst-order score function associated with p is given by [S(x)]j = s0(xj). Moreover, the second-order\nscore function is\n\n0 (x) =Qd\nT (x) = S(x)S(x)>  rS(x) = S(x)S(x)>  diag\u21e5s00  (x)\u21e4.\nBefore we present our estimator, we introduce the assumption on Y and s0(\u00b7).\nAssumption 3.1 (Bounded Moment). We assume there exists a constant M such that Ep0[s0(U )6] \uf8ff\nM and E(Y 6) \uf8ff M. We denote 2\nThe assumption that Ep0[s0(U )6] \uf8ff M allows for a broad family of distributions including Gaussian\nand more heavy-tailed random variables. Furthermore, we do not require the covariate to be elliptically\nsymmetric as is commonly required in existing methods, which enables our estimator to be applicable\nfor skewed covariates. As for the assumption that E(Y 6) \uf8ff M, note that in the case of SIM we have\n\n0 = Ep0[s0(U )2] = Varp0[s0(U )].\n\n(3.1)\n\nE(Y 6) \uf8ff CE(\u270f6) + E\u21e5f 6(hX, \u21e4i)\u21e4.\n\nThus this assumption is satis\ufb01ed as long as both \u270f and f (hX, \u21e4i) have bounded sixth moments. This\nis a mild assumption that allows for heavy-tailed response. Now we are ready to present our estimator\nfor the sparse SIM in De\ufb01nition 2.3. Recall that by Proposition 2.2 we have\n\n(3.2)\nwhere C0 = 2E[f00(hX, \u21e4i)] > 0 as in De\ufb01nition 2.3. Hence, one way to estimator \u21e4 is to obtain\nthe leading eigenvector of the sample version of E[Y \u00b7 T (X)]. Moreover, as \u21e4 is sparse, we formulate\nour estimator as a semide\ufb01nite program\n\nE\u21e5Y \u00b7 T (X)\u21e4 = C0 \u00b7 \u21e4\u21e4>,\n\nmaximize \u2326W,e\u2303\u21b5  kWk1\n\nsubject to 0  W  Id, Trace(W ) = 1.\n\nHeree\u2303 is an estimator of \u2303\u21e4 = E[Y \u00b7 T (X)], which is de\ufb01ned as follows. Note that both the score\nT (X) and the response variable Y could be heavy-tailed. In order to obtain near-optimal estimates in\nthe \ufb01nite-sample setting, we apply the truncation technique to handle the heavy-tails. In speci\ufb01c, for\na positive threshold parameter \u2327 2 R, we de\ufb01ne the truncated random variables by\neYi = sign(Yi) \u00b7 min{|Yi|,\u2327 } and \u21e5eT (Xi)\u21e4jk = signTjk(Xi) \u00b7 min|Tjk(Xi)|,\u2327 2 .\n\n(3.4)\n\nThen we de\ufb01ne the robust estimator of \u2303\u21e4 as\n1\nn\n\ne\u2303=\n\nnXi=1eYi \u00b7 eT (Xi).\n\nfollowing theorem quanti\ufb01es the statistical rates of convergence of the proposed estimator.\n\nWe denote bycW the solution of the convex optimization problem in (3.3), where  is a regularization\nparameter to be speci\ufb01ed later. The \ufb01nal estimatorb is de\ufb01ned as the leading eigenvector ofcW . The\nTheorem 3.2. Let  = 10pM log d/n in (3.3) and \u2327 = (1.5M n/ log d)1/6 in (3.4). Then under\nAssumption 3.1, we have kb  \u21e4k2 \uf8ff 4p2/C0 \u00b7 s\u21e4 with probability at least 1  d2.\nNow we introduce the estimator of B\u21e4 for the sparse MIM in De\ufb01nition 2.4. Proposition 2.2 implies\nthat E[Y \u00b7 T (X)] = B\u21e4D0B\u21e4, where D0 = E[r2f (XB\u21e4)] is positive de\ufb01nite. Similar to (3.3), we\nrecover the column space of B\u21e4 by solving\n\n(3.3)\n\n(3.5)\n\n(3.6)\n\nmaximize \u2326W,e\u2303\u21b5  kWk1,\n\nsubject to 0  W  Id, Trace(W ) = k,\n\n5\n\n\fwheree\u2303 is de\ufb01ned in (3.5), > 0 is a regularization parameter, and k is the number of indices, which\nis assumed to be known. LetcW be the solution of (3.6), and let the \ufb01nal estimator bB contain the\ntop k leading eigenvectors ofcW as columns. For such an estimator, we have the following theorem\nquantifying its statistical rate of convergence. Let \u21e20 = min(E[r2f (XB\u21e4)]).\nTheorem 3.3. Let  = 10pM log d/n in (3.6) and \u2327 = (1.5M n/ log d)1/6 in (3.4). Then under\nAssumption 3.1, with probability at least 1  d2, we have\n\ninf\n\nO2OkbB  B\u21e4OF \uf8ff 4p2/\u21e20 \u00b7 s\u21e4,\n\nwhere Ok 2 Rk\u21e5k is the set of all possible rotation matrices.\nMinimax lower bounds for subspace estimation for MIM are established in [22]. For k being \ufb01xed,\nTheorem 3.3 is near-optimal from a minimax point of view. The difference between the optimal rate\nand the above theorem is roughly a factor of ps\u21e4. We will discuss more about this gap in Section 4.\nThe proofs of Theorem 3.2 and Theorem 3.3 are provided in the supplementary material.\nRemark 3.4. Recall that our discussion above is under the assumption that the entries of X are i.i.d.,\nwhich could be relaxed to the case of weak dependence between the covariates without any signi\ufb01cant\nloss in the statistical rates presented above. We do not focus on this extension in this paper as we aim\nto clearly convey the main message of the paper in a simpler setting.\n\n4 Optimality and Connection to Sparse PCA\nNow we discuss the optimality of the results presented in \u00a73. Throughout the discussion we assume\nthat k is \ufb01xed and does not increase with d and n. The estimators for SIM in (3.3) and MIM in (3.6)\nare closely related to the semide\ufb01nite program-based estimator for sparse PCA [33]. In speci\ufb01c, let\nX 2 Rd be a random vector with E(X) = 0 and covariance \u2303= E(XX>), which is symmetric and\npositive de\ufb01nite. The goal of sparse PCA is to estimate the projector onto the subspace spanned by\nthe top k eigenvectors, namely {v\u21e4`}`2[k], of \u2303, under the subspace sparsity assumption as speci\ufb01ed\nin De\ufb01nition 2.4. An estimator based on semide\ufb01nite programing is introduced in [33, 34], which is\nbased on solving\n\nmaximize \u2326W,b\u2303\u21b5  kWk1\n\n(4.1)\ni=1 of\nX. Note that the main difference between the SIM estimator and the sparse PCA estimator is the use of\n\nsubject to 0  W  Id, Trace(W ) = k.\nis the sample covariance matrix given n i.i.d. observations {Xi}n\n\ni=1 XiX>i\n\ntradeoff [16, 34, 35], which naturally appears in the context of SIM as well. In particular, while the\n\nHereb\u2303= n1Pn\ne\u2303 in place ofb\u2303. It is known that sparse PCA problem exhibits an interesting statistical-computational\noptimal statistical rate for sparse PCA is O(ps\u21e4 log d/n), the SDP-based estimator could only attain\nO(s\u21e4plog d/n) under the assumption that X is light-tailed. It is known that when n =\u2326( s\u21e42 log d),\none could obtain the optimal statistical rate of O(ps\u21e4 log d/n) by nonconvex method [37]. However,\ntheir results rely on the sharp concentration ofb\u2303 to \u2303 in the restricted operator norm:\nb\u2303  \u2303\u21e4op,s = supw>(b\u2303  \u2303)w : kwk2 = 1,kwk0 \uf8ff s = O(ps log d/n).\n\n(4.2)\nWhen X has heavy-tailed entries, for example, with bounded fourth moment, its highly unlikely that\n(4.2) holds.\nHeavy-tailed sparse PCA: Recall that our estimators leverage a data-driven truncation argument to\nhandle heavy-tailed distributions. Owing to the close relationship between our SIM/MIM estimators\nand the sparse PCA estimator, it is natural to ask whether such a truncation argument could lead to a\nsparse PCA estimator for heavy tailed X. Below we show it is indeed possible to obtain a near-optimal\nestimator for heavy-tailed sparse PCA based on the truncation technique. For vector v 2 Rd, let #(v)\nbe a truncation operator that works entrywise as [#(v)]j = sign[vj] \u00b7 min{|vj|,\u2327 } for j 2 [d]. Then,\nour estimator is de\ufb01ned as follows,\n\nmaximize hW, \u2303i  kWk1\nsubject to 0  W  Id, Trace(W ) = k,\n\n(4.3)\n\n6\n\n\fwhere \u2303= n1Pn\ni=1 X iX>i and X i = #(Xi), for i = 1, . . . n. For the above estimator, we have the\nfollowing theorem under the assumption that X has heavy-tailed marginals. Let V \u21e4 = (v\u21e41 . . . v\u21e4k) 2\nRd\u21e5k and we assume that \u21e20 = k(\u2303)  k+1(\u2303) > 0.\nTheorem 4.1. LetcW be the solution of the optimization in (4.3) and let bV 2 Rd\u21e5k contain the\ntop k leading eigenvectors of cW . Also, we set the regularization parameter in (4.3) to be  =\nC1pM log d/n and the truncation parameter to be \u2327 = (C2M n/ log d)1/4, where C1 and C2 are\nsome positive constants. Moreover, we assume that V \u21e4 contains only s\u21e4 nonzero rows and that X\nsatis\ufb01es E|Xj|4 \uf8ff M and E|Xi \u00b7 Xj|2 \uf8ff M. Then, with probability at least 1  d2, we have\n\ninf\n\nO2OkbV  V \u21e4OF \uf8ff 4p2/\u21e20 \u00b7 s\u21e4,\n\nWe end this section with the following questions based on the above discussions:\n\nwhere Ok 2 Rk\u21e5k is the set of all possible rotation matrices.\nThe proof of the above theorem is identical to that of Theorem 3.3 and thus we omit it. The above\ntheorem shows that with elementwise truncation, as long as X satis\ufb01es a bounded fourth moment con-\ndition, the SDP estimator for sparse PCA achieves the near-optimal statistical rate of O(s\u21e4plog d/n).\n1. Could we obtain the minimax optimal statistical rate O(ps log d/n) for sparse PCA in the\n2. Could we obtain the minimax optimal statistical rate O(ps log d/n) given n =\u2326( s\u21e42 log d)\n\nhigh sample size regime with n =\u2326( s\u21e42 log d) if X has only bounded fourth moment?\n\nwhen f, X, and Y satisfy the bounded moment condition in Assumption 3.1 for MIM?\n\nThe answers to both questions lie in constructing truncation-based estimators that concentrate sharply\nin restricted operator norm de\ufb01ned in (4.2) or more realistically exhibit one-sided concentration\nbounds (see, e.g., [24] and [27] for related results and discussion). Obtaining such an estimator seems\nto be challenging for heavy-tailed sparse PCA and it is not immediately clear if this is even possible.\nWe plan to report our \ufb01ndings for the above problem in the near future.\n\n5 Experimental Results\n\nIn this section, we evaluate the \ufb01nite-sample error of the proposed estimators on simulated data. We\nconcentrate on the case of sparse phase retrieval. Recall that in this case, the link function is known\nand existing convex and non-convex estimators are applicable predominantly for the case of Gaussian\nor light-tailed data. The question of what are the necessary assumptions on the measurement vectors\nfor (sparse) phase retrieval to work is an intriguing one [11]. In the sequel, we demonstrate that using\nthe proposed score-based estimators, one could use heavy-tailed and skewed measurements as well,\nwhich signi\ufb01cantly extend the class of measurement vectors applicable for sparse phase retrieval.\nRecall that the covariate X has i.i.d. entries with distribution p0. Throughout this section, we set p0\nto be Gamma distribution with shape parameter 5 and scale parameter 1 or Rayleigh distribution\nwith scale parameter 2. The random noise \u270f is set to be standard Gaussian. Moreover, we solve the\noptimization problems in (3.3) and (3.6) via the alternating direction method of multipliers (ADMM)\nalgorithm, which introduces a dual variable to handle the constraints and updates the primal and dual\nvariables iteratively.\nFor SIM, let the link functions be f1(u) = u2, f2 = |u|, and f3(u) = 4u2 +3 cos(u), correspondingly.\nHere f1 corresponds to the phase retrieval model and f2 and f3 could be viewed as its robust extension.\nThroughout the experiment we vary n and \ufb01x d = 500 and s\u21e4 = 5. Also, the support of \u21e4 is chosen\nuniformly at random from all the possible subsets of [d] with cardinality s\u21e4. For each j 2 supp(\u21e4),\nwe set \u21e4j = 1/ps\u21e4 \u00b7 j, where j\u2019s are i.i.d. Rademacher random variables. Furthermore, we \ufb01x the\nregularization parameter  = 4plog d/n and threshold parameter \u2327 = 20. In addition, we adopt the\ncosine distance cos \\(b,  \u21e4) = 1 |h b,  \u21e4i|, to measure the estimation error. We plot the cosine\ndistance against the theoretical statistical rate of convergence s\u21e4plog d/n in Figure 2-(a)-(c) for each\nestimation error is bounded by a linear function of s\u21e4plog d/n, which corroborates the theory.\n\nlink function, respectively. The plot is based on 100 independent trials for each n. It shows that the\n\n7\n\n\f0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n0.1\n\n0.15\n\n0.2\n\n0.25\n\n0.3\n\n0.35\n\n0.4\n\nf1(u) = u2\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n0.1\n\n0.15\n\n0.2\n\n0.25\n\n0.3\n\n0.35\n\n0.4\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n0.1\n\n0.15\n\n0.2\n\n0.25\n\n0.3\n\n0.35\n\n0.4\n\nf2(u) = |u|,\n\nf3(u) = 4u2 + 3 cos(u)\n\nSIM with the link function in one of f1, f2, and f3. Here we set d = 500. s\u21e4 = 5 and vary n.\n\nFigure 2: Cosine distances between the true parameter \u21e4 and the estimated parameterb in the sparse\n\n6 Discussion\n\nIn this work, we study estimating the parametric component of SIM and MIM in the high dimensions,\nunder fairly general assumptions on the link function f and response Y . Furthermore, our estimators\nare applicable in the non-Gaussian setting in which X is not required to satisfy restrictive Gaussian\nor elliptical symmetry assumptions. Our estimators are based on a data-driven truncation technique in\ncombination with a second-order Stein\u2019s identity.\nIn the low-dimensional setting, for two-layer neural network [14] propose a tensor-based method for\nestimating the parametric component. Their estimators are sub-optimal even assuming X is Gaussian.\nAn immediate application of our truncation-based estimators enables us to obtain optimal results for\na fairly general class of covariate distributions in the low-dimensional setting. Obtaining optimal or\nnear-optimal results in the high-dimensional setting is of great interest for two-layer neural network,\nalbeit challenging. We plan to extend the results of the current paper to two-layer neural networks in\nhigh dimensions and report our \ufb01ndings in the near future.\n\nReferences\n[1] Albert Ai, Alex Lapanowski, Yaniv Plan, and Roman Vershynin. One-bit compressed sensing\n\nwith non-Gaussian measurements. Linear Algebra and its Applications, 2014.\n\n[2] St\u00b4ephane Boucheron, G\u00b4abor Lugosi, and Pascal Massart. Concentration inequalities: A\n\nnonasymptotic theory of independence. Oxford University Press, 2013.\n\n[3] Petros T Boufounos and Richard G Baraniuk. 1-bit compressive sensing. In Annual Conference\n\non Information Sciences and Systems, pages 16\u201321. IEEE, 2008.\n\n[4] T Tony Cai, Xiaodong Li, and Zongming Ma. Optimal rates of convergence for noisy sparse\nphase retrieval via thresholded Wirtinger \ufb02ow. The Annals of Statistics, 44(5):2221\u20132251, 2016.\n[5] Emmanuel J Candes, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via Wirtinger\n\ufb02ow: Theory and algorithms. IEEE Transactions on Information Theory, 61(4):1985\u20132007,\n2015.\n\n[6] Emmanuel J Candes, Thomas Strohmer, and Vladislav Voroninski. Phaselift: Exact and stable\nsignal recovery from magnitude measurements via convex programming. Communications on\nPure and Applied Mathematics, 66(8):1241\u20131274, 2013.\n\n[7] Xin Chen, Changliang Zou, and Dennis Cook. Coordinate-independent sparse suf\ufb01cient dimen-\n\nsion reduction and variable selection. The Annals of Statistics, 38(6):3696\u20133723, 2010.\n\n[8] Mark A Davenport, Yaniv Plan, Ewout van den Berg, and Mary Wootters. 1-bit matrix comple-\n\ntion. Information and Inference, 3(3):189\u2013223, 2014.\n\n[9] Jianqing Fan, Jinchi Lv, and Lei Qi. Sparse high-dimensional models in economics. Annual\n\nReview of Economics, 3(1):291\u2013317, 2011.\n\n[10] Larry Goldstein, Stanislav Minsker, and Xiaohan Wei. Structured signal recovery from non-\n\nlinear and heavy-tailed measurements. arXiv preprint arXiv:1609.01025, 2016.\n\n[11] David Gross, Felix Krahmer, and Richard Kueng. A partial derandomization of phaselift using\n\nspherical designs. Journal of Fourier Analysis and Applications, 2015.\n\n8\n\n\f[12] Joel L Horowitz. Semiparametric and nonparametric methods in econometrics, volume 12.\n\nSpringer, 2009.\n\n[13] Majid Janzamin, Hanie Sedghi, and Anima Anandkumar. Score function features for discrimi-\n\nnative learning: Matrix and tensor framework. arXiv preprint arXiv:1412.2863, 2014.\n\n[14] Majid Janzamin, Hanie Sedghi, and Anima Anandkumar. Beating the perils of non-convexity:\nGuaranteed training of neural networks using tensor methods. arXiv preprint arXiv:1506.08473,\n2015.\n\n[15] Bo Jiang and Jun S Liu. Variable selection for general index models via sliced inverse regression.\n\nThe Annals of Statistics, 42(5):1751\u20131786, 2014.\n\n[16] Robert Krauthgamer, Boaz Nadler, and Dan Vilenchik. Do semide\ufb01nite relaxations solve sparse\n\nPCA up to the information limit? The Annals of Statistics, 43(3):1300\u20131322, 2015.\n\n[17] Guillaume Lecu\u00b4e and Shahar Mendelson. Minimax rate of convergence and the performance of\nempirical risk minimization in phase retrieval. Electronic Journal of Probability, 20(57):1\u201329,\n2015.\n\n[18] Ker-Chau Li. Sliced inverse regression for dimension reduction. Journal of the American\n\nStatistical Association, 86(414):316\u2013327, 1991.\n\n[19] Ker-Chau Li. On principal Hessian directions for data visualization and dimension reduc-\ntion: Another application of Stein\u2019s lemma. Journal of the American Statistical Association,\n87(420):1025\u20131039, 1992.\n\n[20] Ker-Chau Li and Naihua Duan. Regression analysis under link violation. The Annals of\n\nStatistics, 17(3):1009\u20131052, 1989.\n\n[21] Xiaodong Li and Vladislav Voroninski. Sparse signal recovery from quadratic measurements\nvia convex programming. SIAM Journal on Mathematical Analysis, 45(5):3019\u20133033, 2013.\n[22] Qian Lin, Xinran Li, Dongming Huang, and Jun S Liu. On the optimality of sliced inverse\n\nregression in high dimensions. arXiv preprint arXiv:1701.06009, 2017.\n\n[23] Qian Lin, Zhigen Zhao, and Jun S Liu. On consistency and sparsity for sliced inverse regression\n\nin high dimensions. arXiv preprint arXiv:1507.03895, 2015.\n\n[24] Shahar Mendelson. Learning without concentration. In Conference on Learning Theory, pages\n\n25\u201339, 2014.\n\n[25] Matey Neykov, Jun S Liu, and Tianxi Cai. `1-regularized least squares for support recovery of\nhigh dimensional single index models with Gaussian designs. Journal of Machine Learning\nResearch, 17(87):1\u201337, 2016.\n\n[26] Matey Neykov, Zhaoran Wang, and Han Liu. Agnostic estimation for misspeci\ufb01ed phase\nretrieval models. In Advances in Neural Information Processing Systems, pages 4089\u20134097,\n2016.\n\n[27] Roberto Imbuzeiro Oliveira. The lower tail of random quadratic forms, with applications to\nordinary least squares and restricted eigenvalue properties. arXiv preprint arXiv:1312.2903,\n2013.\n\n[28] Yaniv Plan and Roman Vershynin. The generalized lasso with non-linear observations. IEEE\n\nTransactions on Information Theory, 62(3):1528\u20131537, 2016.\n\n[29] Charles Stein, Persi Diaconis, Susan Holmes, and Gesine Reinert. Use of exchangeable pairs in\n\nthe analysis of simulations. In Stein\u2019s Method. Institute of Mathematical Statistics, 2004.\n\n[30] Ju Sun, Qing Qu, and John Wright. A geometric analysis of phase retrieval. arXiv preprint\n\narXiv:1602.06664, 2016.\n\n[31] Kean Ming Tan, Zhaoran Wang, Han Liu, and Tong Zhang. Sparse generalized eigenvalue\nproblem: Optimal statistical rates via truncated Rayleigh \ufb02ow. arXiv preprint arXiv:1604.08697,\n2016.\n\n[32] Kean Ming Tan, Zhaoran Wang, Han Liu, Tong Zhang, and Dennis Cook. A convex formulation\n\nfor high-dimensional sparse sliced inverse regression. manuscript, 2016.\n\n[33] Vincent Q Vu, Juhee Cho, Jing Lei, and Karl Rohe. Fantope projection and selection: A near-\nIn Advances in Neural Information Processing\n\noptimal convex relaxation of sparse PCA.\nSystems, pages 2670\u20132678, 2013.\n\n9\n\n\f[34] Tengyao Wang, Quentin Berthet, and Richard J Samworth. Statistical and computational trade-\noffs in estimation of sparse principal components. The Annals of Statistics, 44(5):1896\u20131930,\n2016.\n\n[35] Zhaoran Wang, Quanquan Gu, and Han Liu. Sharp computational-statistical phase transitions\n\nvia oracle computational model. arXiv preprint arXiv:1512.08861, 2015.\n\n[36] Zhaoran Wang, Han Liu, and Tong Zhang. Optimal computational and statistical rates of\nconvergence for sparse nonconvex learning problems. The Annals of Statistics, 42(6):2164\u2013\n2201, 12 2014.\n\n[37] Zhaoran Wang, Huanran Lu, and Han Liu. Tighten after relax: Minimax-optimal sparse PCA in\npolynomial time. In Advances in Neural Information Processing Systems, pages 3383\u20133391,\n2014.\n\n[38] Zhuoran Yang, Zhaoran Wang, Han Liu, Yonina C Eldar, and Tong Zhang. Sparse nonlinear re-\ngression: Parameter estimation and asymptotic inference. International Conference on Machine\nLearning, 2015.\n\n[39] Xinyang Yi, Zhaoran Wang, Constantine Caramanis, and Han Liu. Optimal linear estimation\nunder unknown nonlinear transform. In Advances in Neural Information Processing Systems,\npages 1549\u20131557, 2015.\n\n[40] Lixing Zhu, Baiqi Miao, and Heng Peng. On sliced inverse regression with high-dimensional\n\ncovariates. Journal of the American Statistical Association, 101(474):630\u2013643, 2006.\n\n10\n\n\f", "award": [], "sourceid": 3095, "authors": [{"given_name": "Zhuoran", "family_name": "Yang", "institution": "Princeton University"}, {"given_name": "Krishnakumar", "family_name": "Balasubramanian", "institution": "georgia tech"}, {"given_name": "Zhaoran", "family_name": "Wang", "institution": "Princeton, Phd student"}, {"given_name": "Han", "family_name": "Liu", "institution": "Tencent AI Lab"}]}