{"title": "Agnostic Estimation for Misspecified Phase Retrieval Models", "book": "Advances in Neural Information Processing Systems", "page_first": 4089, "page_last": 4097, "abstract": "The goal of noisy high-dimensional phase retrieval is to estimate an $s$-sparse parameter $\\boldsymbol{\\beta}^*\\in \\mathbb{R}^d$ from $n$ realizations of the model $Y = (\\boldsymbol{X}^{\\top} \\boldsymbol{\\beta}^*)^2 + \\varepsilon$. Based on this model, we propose a significant semi-parametric generalization called misspecified phase retrieval (MPR), in which $Y = f(\\boldsymbol{X}^{\\top}\\boldsymbol{\\beta}^*, \\varepsilon)$ with unknown $f$ and $\\operatorname{Cov}(Y, (\\boldsymbol{X}^{\\top}\\boldsymbol{\\beta}^*)^2) > 0$. For example, MPR encompasses $Y = h(|\\boldsymbol{X}^{\\top} \\boldsymbol{\\beta}^*|) + \\varepsilon$ with increasing $h$ as a special case. Despite the generality of the MPR model, it eludes the reach of most existing semi-parametric estimators. In this paper, we propose an estimation procedure, which consists of solving a cascade of two convex programs and provably recovers the direction of $\\boldsymbol{\\beta}^*$. Our theory is backed up by thorough numerical results.", "full_text": "Agnostic Estimation for Misspeci\ufb01ed\n\nPhase Retrieval Models\n\nMatey Neykov\n\nZhaoran Wang\n\nHan Liu\n\nDepartment of Operations Research and Financial Engineering\n\nPrinceton University, Princeton, NJ 08544\n\n{mneykov, zhaoran, hanliu}@princeton.edu\n\nAbstract\n\nThe goal of noisy high-dimensional phase retrieval is to estimate an s-sparse pa-\nrameter \u03b2\u2217 \u2208 Rd from n realizations of the model Y = (X(cid:62)\u03b2\u2217)2 + \u03b5. Based\non this model, we propose a signi\ufb01cant semi-parametric generalization called mis-\nspeci\ufb01ed phase retrieval (MPR), in which Y = f (X(cid:62)\u03b2\u2217, \u03b5) with unknown f and\nCov(Y, (X(cid:62)\u03b2\u2217)2) > 0. For example, MPR encompasses Y = h(|X(cid:62)\u03b2\u2217|) + \u03b5\nwith increasing h as a special case. Despite the generality of the MPR model, it\neludes the reach of most existing semi-parametric estimators. In this paper, we pro-\npose an estimation procedure, which consists of solving a cascade of two convex\nprograms and provably recovers the direction of \u03b2\u2217. Our theory is backed up by\nthorough numerical results.\n\ni )(cid:62)}n\n\nIntroduction\n\n1\nIn scienti\ufb01c and engineering \ufb01elds researchers often times face the problem of quantifying the\nrelationship between a given outcome Y and corresponding predictor vector X, based on a sample\n{(Yi, X(cid:62)\ni=1 of n observations. In such situations it is common to postulate a linear \u201cworking\u201d\nmodel, and search for a d-dimensional signal vector \u03b2\u2217 satisfying the following familiar relationship:\n(1.1)\nWhen the predictor X is high-dimensional in the sense that d (cid:29) n, it is commonly assumed that\nthe underlying signal \u03b2\u2217 is s-sparse. In a certain line of applications, such as X-ray crystallography,\nmicroscopy, diffraction and array imaging1, one can only measure the magnitude of X(cid:62)\u03b2\u2217 but not\nits phase (i.e., sign in the real domain). In this case, assuming model (1.1) may not be appropriate. To\ncope with such applications in the high-dimensional setting, [7] proposed the thresholded Wirtinger\n\ufb02ow (TWF), a procedure which consistently estimates the signal \u03b2\u2217 in the following real sparse\nnoisy phase retrieval model:\n\nY = X(cid:62)\u03b2\u2217 + \u03b5.\n\n(1.2)\nwhere one additionally knows that the predictors have a Gaussian random design X \u223c N (0, Id). In\nthe present paper, taking an agnostic point of view, we recognize that both models (1.1) and (1.2)\nrepresent an idealized view of the data generating mechanism. In reality, the nature of the data could\nbe better re\ufb02ected through the more \ufb02exible viewpoint of a single index model (SIM):\n\nY = (X(cid:62)\u03b2\u2217)2 + \u03b5,\n\n(1.3)\nwhere f is an unknown link function, and it is assumed that (cid:107)\u03b2\u2217(cid:107)2 = 1 for identi\ufb01ability. A recent\nline of work on high-dimensional SIMs [25, 27], showed that under Gaussian designs, one can apply\n(cid:96)1 regularized least squares to successfully estimate the direction of \u03b2\u2217 and its support. The crucial\ncondition allowing for the above somewhat surprising application turns out to be:\n\nY = f (X(cid:62)\u03b2\u2217, \u03b5),\n\nCov(Y, X(cid:62)\u03b2\u2217) (cid:54)= 0.\n\n(1.4)\nWhile condition (1.4) is fairly generic, encompassing cases with a binary outcome, such as logistic\nregression and one-bit compressive sensing [5], it fails to capture the phase retrieval model (1.2).\n1In such applications it is typically assumed that X \u2208 Cd is a complex normal random vector. In this paper\n\nfor simplicity we only consider the real case X \u2208 Rd.\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fMore generally, it is easy to see that when the link function f is even in its \ufb01rst coordinate, condition\n(1.4) fails to hold. The goal of the present manuscript is to formalize a class of SIMs, which includes\nthe noisy phase retrieval model as a special case in addition to various other additive and non-additive\nmodels with even link functions, and develop a procedure that can successfully estimate the direction\nof \u03b2\u2217 up to a global sign. Formally, we consider models (1.3) with Gaussian design that satisfy the\nfollowing moment assumption:\n\nCov(Y, (X(cid:62)\u03b2\u2217)2) > 0.\n\n(1.5)\nUnlike (1.4), one can immediately check that condition (1.5) is satis\ufb01ed by model (1.2). In \u00a72 we give\nmultiple examples, both abstract and concrete, of SIMs obeying this constraint. Our second moment\nconstraint (1.5) can be interpreted as a semi-parametric robust version of phase-retrieval. Hence, we\nwill refer to the class of models satisfying condition (1.5) as misspeci\ufb01ed phase retrieval (MPR)\nmodels. In this point of view it is worth noting that condition (1.4) relates to linear regression in a\nway similar to how condition (1.5) relates to the phase retrieval model. Our motivation for studying\nSIMs under such a constraint can ultimately be traced to the vast suf\ufb01cient dimension reduction\n(SDR) literature. In particular, we would like to point out [22] as a source of inspiration.\nContributions. Our \ufb01rst contribution is to formulate a novel and easily implementable two-step\nprocedure, which consistently estimates the direction of \u03b2\u2217 in an MPR model. In the \ufb01rst step\n\nwe solve a semide\ufb01nite program producing a unit vector(cid:98)v, such that |(cid:98)v(cid:62)\u03b2\u2217| is suf\ufb01ciently large.\naugmented outcome (cid:101)Yi = (Yi \u2212 Y )X(cid:62)\ni (cid:98)v, where Y is the average of Yi\u2019s, to produce a second\nestimate(cid:98)b, which is then normalized to obtain the \ufb01nal re\ufb01ned estimator (cid:98)\u03b2 =(cid:98)b/(cid:107)(cid:98)b(cid:107)2. In addition\n\nOnce such a pilot estimate is available, we consider solving an (cid:96)1 regularized least squares on the\n\nto being universally applicable to MPR models, our procedure has an algorithmic advantage in that\nit relies solely on convex optimization, and as a consequence we can obtain the corresponding global\nminima of the two convex programs in polynomial time.\nOur second contribution is to rigorously demonstrate that the above procedure consistently estimates\nthe direction of \u03b2\u2217. We prove that for a given MPR model, with high probability, one has:\n\nmin\u03b7\u2208{\u22121,1}(cid:107)(cid:98)\u03b2 \u2212 \u03b7\u03b2\u2217(cid:107)2 (cid:46)(cid:112)s log d/n,\n\nprovided that the sample size n satis\ufb01es n (cid:38) s2 log d. While the same rates (with different constants)\nhold for the TWF algorithm of [7] in the special case of noisy phase retrieval model, our procedure\nprovably achieves these rates over the broader class of MPR models.\nLastly, we propose an optional re\ufb01nement of our algorithm, which shows improved performance in\nthe numerical studies.\nRelated Work. The phase retrieval model has received considerable attention in the recent years by\nstatistics, applied mathematics as well as signal processing communities. For the non-sparse version\nof (1.2), ef\ufb01cient algorithms have been suggested based on both semide\ufb01nite programs [8, 10] and\nnon-convex optimization methods that extend gradient descent [9]. Additionally, a non-traditional\ninstance of phase retrieval model (which also happens to be a special case of the MPR model) was\nconsidered by [11], where the authors suggested an estimation procedure originally proposed for the\nproblem of mixed regression. For the noisy sparse version of model (1.2), near optimal solutions\nwere achieved with a computationally infeasible program by [20]. Subsequently, a tractable gradient\ndescent approach achieving minimax optimal rates was developed by [7].\nAbstracting away from the phase retrieval or linear model settings, we note that inference for SIMs\nin the case when d is small or \ufb01xed, has been studied extensively in the literature [e.g., 18, 24, 26, 34,\namong many others]. In another line of research on SDR, seminal insights shedding light on condition\n(1.4) can be found in, e.g., [12, 21, 23]. The modi\ufb01ed condition (1.5) traces roots to [22], where the\nauthors designed a procedure to handle precisely situations where (1.4) fails to hold. More recently,\nthere have been active developments for high-dimensional SIMs. [27] and later [31] demonstrated that\nunder condition (1.4), running the least squares with (cid:96)1 regularization can obtain a consistent estimate\nof the direction of \u03b2\u2217, while [25] showed that this procedure also recovers the signed support of the\ndirection. Excess risk bounds were derived in [14]. Very recently, [16] extended this observation to\nother convex loss functions under a condition corresponding to (1.4) depending implicitly on the loss\nfunction of interest. [28] proposed a non-parametric least squares with an equality (cid:96)1 constraint to\nhandle simultaneous estimation of \u03b2\u2217 as well as f. [17] considered a smoothed-out U-process type\nof loss function with (cid:96)1 regularization, and proved their approach works for a sub-class of functions\nsatisfying condition (1.4). None of the aforementioned works on SIMs can be directly applied to\ntackle the MPR class (1.5). A generic procedure for estimating sparse principal eigenvectors was\n\n2\n\n\fdeveloped in [37]. While in principle this procedure can be applied to estimate the direction in MPR\nmodels, it requires proper initialization, and in addition, it requires knowledge of the sparsity of the\nvector \u03b2\u2217. We discuss this approach in more detail in \u00a74.\nRegularized procedures have also been proposed for speci\ufb01c choices of f and Y . For example, [36]\nstudied consistent estimation under the model P(Y = 1|X) = (h(X(cid:62)\u03b2\u2217) + 1)/2 with binary Y ,\nwhere h : R (cid:55)\u2192 [\u22121, 1] is possibly unknown. Their procedure is based on taking pairs of differences\nin the outcome, and therefore replaces condition (1.4) with a different type of moment conditon. [35]\nconsidered the model Y = h(X(cid:62)\u03b2\u2217) + \u03b5 with a known continuously differentiable and monotonic\nh, and developed estimation and inferential procedures based on the (cid:96)1 regularized quadratic loss, in\na similar spirit to the TWF algorithm suggested by [7]. In conclusion, although there exists much\nprior related work, to the best of our knowledge, none of the available literature discusses the MPR\nmodels in the generality we attempt in the present manuscript.\nNotation. We now brie\ufb02y outline some commonly used notations. Other notations will be de\ufb01ned as\nneeded throughout the paper. For a (sparse) vector v = (v1, . . . , vp)(cid:62), we let Sv := supp(v) = {j :\nvj (cid:54)= 0} denote its support, (cid:107)v(cid:107)p denote the (cid:96)p norm (with the usual extension when p = \u221e) and\nv\u22972 := vv(cid:62) is a shorthand for the outer product. With a standard abuse of notation we will denote\nby (cid:107)v(cid:107)0 = |supp(v)| the cardinality of the support of v. We often use Id to denote a d \u00d7 d identity\nmatrix. For a real random variable X, de\ufb01ne\n\n(cid:107)X(cid:107)\u03c82 = sup\np\u22651\n\np\u22121/2(E|X|p)1/p, (cid:107)X(cid:107)\u03c81 = sup\np\u22651\n\np\u22121(E|X|p)1/p.\n\nRecall that a random variable is called sub-Gaussian if (cid:107)X(cid:107)\u03c82 < \u221e and sub-exponential if (cid:107)X(cid:107)\u03c81 <\n\u221e [e.g., 32]. For any integer k \u2208 N we use the shorthand notation [k] = {1, . . . , k}. We also use\nstandard asymptotic notations. Given two sequences {an},{bn} we write an = O(bn) if there exists\na constant C < \u221e such that an \u2264 Cbn, and an (cid:16) bn if there exist positive constants c and C such\nthat c < an/bn < C.\nOrganization. In \u00a72 and \u00a73 we introduce the MPR model class and our estimation procedure, and\n\u00a73.1 is dedicated to stating the theoretical guarantees of our proposed algorithm. Simulation results\nare given in \u00a74. A brief discussion is provided in \u00a75. We defer the proofs to the appendices due to\nspace limitations.\n2 MPR Models\nIn this section we formally introduce MPR models. In detail, we argue that the class of such models\nis suf\ufb01ciently rich, including numerous models of interest. Motivated by the setup in the sparse noisy\nphase retrieval model (1.2), we assume throughout the remainder of the paper that X \u223c N (0, Id).\nWe begin our discussion with a formal de\ufb01nition.\nDe\ufb01nition 2.1 (MPR Models). Assume that we are given model (1.3), where X \u223c N (0, Id), \u03b5 \u22a5\u22a5 X\nand \u03b2\u2217 \u2208 Rd is an s-sparse unit vector, i.e., (cid:107)\u03b2\u2217(cid:107)2 = 1. We call such a model misspeci\ufb01ed phase\nretrieval (MPR) model, if the link function f and noise \u03b5 further satisfy, for Z \u223c N (0, 1) and K > 0,\n(2.2)\nBoth assumptions (2.1) and (2.2) impose moment restrictions on the random variable Y . Assumption\n(2.1) states that Y is positively correlated with the random variable (X(cid:62)\u03b2\u2217)2, while assumption\n(2.2) requires Y to have somewhat light-tails. Also, as mentioned in the introduction, the unit norm\nconstraint on the vector \u03b2\u2217 is required for the identi\ufb01ability of model (1.3). We remark that the class of\nMPR models is convex in the sense that if we have two MPR models f1(X(cid:62)\u03b2\u2217, \u03b5) and f2(X(cid:62)\u03b2\u2217, \u03b5),\nall models generated by their convex combinations \u03bbf1(X(cid:62)\u03b2\u2217, \u03b5)+(1\u2212\u03bb)f2(X(cid:62)\u03b2\u2217, \u03b5) (\u03bb \u2208 [0, 1])\nare also MPR models. It is worth noting the > direction in (2.1) is assumed without loss of generality.\nIf Cov(Y, (X(cid:62)\u03b2\u2217)2) < 0 one can apply the same algorithm to \u2212Y . However, the knowledge of the\ndirection of the inequality is important.\nIn the following, we restate condition (2.1) in a more convenient way, enabling us to easily calculate\nthe explicit value of the constant c0 in several examples.\nProposition 2.2. Assume that there exists a version of the map \u03d5(z) : z (cid:55)\u2192 E[f (Z, \u03b5)|Z = z] such\nthat ED2\u03d5(Z) > 0, where D2 is the second distributional derivative of \u03d5 and Z \u223c N (0, 1). Then\nthe SIM (1.3) satis\ufb01es assumption (2.1) with c0 = ED2\u03d5(Z).\nWe now provide three concrete MPR models as warm up examples for the more general examples\ndiscussed in Proposition 2.3 and Remark 2.3. Consider the models:\n\nc0 := Cov(f (Z, \u03b5), Z 2) > 0,\n\n(cid:107)Y (cid:107)\u03c81 \u2264 K.\n\n(2.1)\n\n3\n\n\fY = (X(cid:62)\u03b2\u2217)2 + \u03b5,\n\nY = |X(cid:62)\u03b2\u2217| + \u03b5,\n\n(2.4)\n\nY = |X(cid:62)\u03b2\u2217 + \u03b5|,\n\n\u221a\n\n(2.3)\n\n(2.5)\nwhere \u03b5 \u22a5\u22a5 X is sub-exponential noise, i.e., (cid:107)\u03b5(cid:107)\u03c81 \u2264 K\u03b5 for some K\u03b5 > 0. Model (2.3) is the\nnoisy phase retrieval model considered by [7], while models (2.4) and (2.5) were both discussed\nin [11], where the authors proposed a method to solve model (2.5) in the low-dimensional regime.\nBelow we brie\ufb02y explain why these models satisfy conditions (2.1) and (2.2). First, observe that in\nall three models we have a sum of two sub-exponential random variables, and hence by the triangle\ninequality it follows that the random variable Y is also sub-exponential, which implies (2.2). To\nunderstand why (2.1) holds, by applying Proposition 2.2 we have c0 = E2 = 2 > 0 for model (2.3),\nc0 = E2\u03b40(Z) = 2/\n2\u03c0 > 0 for model (2.4), and c0 = E2\u03b40(Z + \u03b5) = 2E\u03c6(\u03b5) > 0 for model\n(2.5), where \u03b40(\u00b7) is the Dirac delta function centered at zero, and \u03c6 is the density of the standard\nnormal distribution.\nAdmittedly, calculating the second distributional derivative could be a laborious task in general. In\nthe remainder of this section we set out to \ufb01nd a simple to check generic suf\ufb01cient condition on the\nlink function f and error term \u03b5, under which both (2.1) and (2.2) hold. Before giving our result note\nthat the condition ED2\u03d5(Z) > 0 is implied whenever \u03d5 is strictly convex and twice differentiable.\nHowever, strictly convex functions \u03d5 may violate assumption (2.2) as they can in\ufb02ate the tails of Y\narbitrarily (consider, e.g., f (x, \u03b5) = x4 + \u03b5). Moreover, the functions in example (2.4) and (2.5) fail\nto be twice differentiable. In the following result we handle those two problems, and in addition we\nprovide a more generic condition than convexity, which suf\ufb01ces to ensure the validity of (2.1).\nProposition 2.3. The following statements hold.\n\n(i) Let the function \u03d5 de\ufb01ned in Proposition 2.2 be such that the map z (cid:55)\u2192 \u03d5(z) + \u03d5(\u2212z)\n0 and and there exist z1 > z2 > 0 such that \u03d5(z1) + \u03d5(\u2212z1) >\nis non-decreasing on R+\n\u03d5(z2) + \u03d5(\u2212z2). Then (2.1) holds.\n(ii) A suf\ufb01cient condition for (i) to hold, is that z (cid:55)\u2192 \u03d5(z) is convex and sub-differentiable at\n0 satisfying \u03d5(z0) + \u03d5(\u2212z0) > 2\u03d5(0).\nevery point z \u2208 R, and there exists a point z0 \u2208 R+\n(iii) Assume that there exist functions g1, g2 such that f (Z, \u03b5) \u2264 g1(Z) + g2(\u03b5), and g1 is\nessentially quadratic in the sense that there exists a closed interval I = [a, b] with 0 \u2208 I,\nsuch that for all z satisfying g1(z) \u2208 I c we have |g1(z)| \u2264 Cz2 for a suf\ufb01ciently large\nconstant C > 0, and let g2(\u03b5) be sub-exponential. Then (2.2) holds.\n\nRemark 2.4. Proposition 2.3 shows that the class of MPR models is suf\ufb01ciently broad. By (i) and\n(ii) it immediately follows that the additive models\n\nY = h(X(cid:62)\u03b2\u2217) + \u03b5,\n\n(2.6)\nwhere the link function h is even and increasing on R+\n0 or convex, satisfy the covariance condition\n(2.1) by (i) and (ii) of Proposition 2.3 respectively. If h is also essentially quadratic and \u03b5 is sub-\nexponentially distributed, using (iii) we can deduce that Y in (2.6) is a sub-exponential random\nvariable, and hence under these restrictions model (2.6) is an MPR model. Both examples (2.3) and\n(2.4) take this form.\nAdditionally, Proposition 2.3 implies that the model\n\nY = h(X(cid:62)\u03b2\u2217 + \u03b5)\n\n(2.7)\nsatis\ufb01es (2.1), whenever the link h is a convex sub-differentiable function, such that h(z0)+h(\u2212z0) >\n2h(0) for some z0 > 0, E|h(z + \u03b5)| < \u221e for all z \u2208 R and E|h(Z + \u03b5)| < \u221e. This conclusion\nfollows because under the latter conditions the function \u03d5(z) = Eh(z + \u03b5) satis\ufb01es (ii), which is\nproved in Appendix C under Lemma C.1. Moreover, if it turns out that h is essentially quadratic and\nh(2\u03b5) is sub-exponential, then by Jensen\u2019s inequality we have 2h(Z +\u03b5) \u2264 h(2Z)+h(2\u03b5) and hence\n(iii) implies that (2.2) is also satis\ufb01ed. Model (2.5) is of the type (2.7). Unlike the additive noise\nmodels (2.6), models (2.7) allow noise corruption even within the argument of the link function. On\nthe negative side, it should be apparent that (2.1) fails to hold in cases where \u03d5 is an odd function, i.e.,\n\u03d5(z) = \u2212\u03d5(\u2212z). In many such cases (e.g. when \u03d5 is monotone or non-constant and non-positive/non-\nnegative on R+), one would have Cov(Y, X(cid:62)\u03b2\u2217) = E[\u03d5(Z)Z] (cid:54)= 0, and hence direct application\nof the (cid:96)1 regularized least squares algorithm is possible as we discussed in the introduction.\n3 Agnostic Estimation for MPR\nIn this section we describe and motivate our two-step procedure, which consists of a convex relaxation\nand an (cid:96)1 regularized least squares program, for performing estimation in the MPR class of models\n\n4\n\n\fdescribed by De\ufb01nition 2.1. We begin our motivation by noting that any MPR model satis\ufb01es the\nfollowing inequality\n\nCov((Y \u2212 \u00b5)X(cid:62)\u03b2\u2217, X(cid:62)\u03b2\u2217) = E{(Y \u2212 \u00b5)(X(cid:62)\u03b2\u2217)2} = Cov(f (Z, \u03b5), Z 2) = c0 > 0,\n\nto sharpen the convergence rate. At \ufb01rst glance it might appear counter-intuitive that introducing a\n\n(3.1)\nwhere we have denoted \u00b5 := EY . This simple observation plays a major role in the motivation\nof our procedure. Notice that in view of condition (1.4), inequality (3.1) implies that if instead of\nobserving Y we had observed \u02d8Y = g(X(cid:62)\u03b2\u2217, \u03b5) = (Y \u2212 \u00b5)X(cid:62)\u03b2\u2217. However, there is no direct way\nof generating the random variable \u02d8Y , as doing so would require the knowledge of \u03b2\u2217 and the mean \u00b5.\n\nHere, we propose to roughly estimate \u03b2\u2217 by a vector(cid:98)v \ufb01rst, use an empirical estimate Y of \u00b5, and\nthen obtain the (cid:96)1 regularized least squares estimate on the augmented variable (cid:101)Y = (Y \u2212 Y )X(cid:62)(cid:98)v\nnoisy estimate of \u03b2\u2217 can lead to consistent estimates, as the so-de\ufb01ned (cid:101)Y variable depends on the\nprojection of X on span{\u03b2\u2217,(cid:98)v}. Decompose\nwhere (cid:98)\u03b2\u22a5 \u22a5 \u03b2\u2217. To better motivate this proposal, in the following we analyze the population least\nsquares \ufb01t, based on the augmented variable \u02c7Y = (Y \u2212 \u00b5)X(cid:62)(cid:98)v for some \ufb01xed unit vector(cid:98)v with\n\n(cid:98)v = ((cid:98)v(cid:62)\u03b2\u2217)\u03b2\u2217 + (cid:98)\u03b2\u22a5,\n[EX\u22972]\u22121E[X \u02c7Y ] = E[X(Y \u2212 \u00b5)X(cid:62)((cid:98)v(cid:62)\u03b2\u2217)\u03b2\u2217]\n(cid:125)\n\n+ E[X(Y \u2212 \u00b5)X(cid:62)(cid:98)\u03b2\u22a5]\n(cid:124)\n(cid:125)\n\ndecomposition (3.2). Writing out the population solution for least squares yields:\n\nWe will now argue that left hand side of (3.3) is proportional to \u03b2\u2217. First, we observe that\n\nI1 = c0((cid:98)v(cid:62)\u03b2\u2217)\u03b2\u2217, since multiplying by any vector b \u22a5 \u03b2\u2217 yields b(cid:62)I1 = 0 by independence.\ntor b \u2208 span{\u03b2\u2217,(cid:98)\u03b2\u22a5}\u22a5. Since the three variables b(cid:62)X, Y \u2212 \u00b5 and (cid:98)\u03b2\u22a5X are independent, we have\nX(Y \u2212 \u00b5) is independent of X(cid:62)(cid:98)\u03b2\u22a5.\nFinally, multiplying by (cid:98)\u03b2\u22a5 yields I(cid:62)\n\nSecond, and perhaps more importantly, we have that I2 = 0. To see this, we \ufb01rst take a vec-\nb(cid:62)I2 = 0. Multiplying by \u03b2\u2217 we have \u03b2\u2217(cid:62)\n\n2 (cid:98)\u03b2\u22a5 = 0, since (X(cid:62)(cid:98)\u03b2\u22a5)2 is independent of Y \u2212 \u00b5.\n\nI2 = 0 since \u03b2\u2217(cid:62)\n\n.\n\n(3.3)\n\n(cid:123)(cid:122)\n\nI1\n\n(3.2)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\nI2\n\n(b) Second Step\n\n(a) Initialization\n\n1. After the \ufb01rst step we can guarantee that the vector \u03b2\u2217 belongs to one of two spherical caps which\nn (cid:38) s2 log d is suf\ufb01ciently large. After the second step we can guarantee that the vector \u03b2\u2217 belongs\nto one of two spherical caps in (b), which are shrinking with (n, s, d) at a faster rate.\nIt is noteworthy to mention that the above derivation crucially relies on the fact that the Y variable\n\nFigure 1: An illustration of the estimates(cid:98)v and (cid:98)\u03b2 produced by the \ufb01rst and second steps of Algorithm\ncontain all vectors w such that |(cid:98)v(cid:62)w| \u2265 \u03ba for some constant \u03ba > 0, provided that the sample size\nwas centered, and the vector(cid:98)v was \ufb01xed. In what follows we formulate a pilot procedure which\nproduces an estimate(cid:98)v such that |(cid:98)v(cid:62)\u03b2\u2217| \u2265 \u03ba > 0. A proper initialization algorithm can be achieved\nby using a spectral method, such as the Principal Hessian Directions (PHD) proposed by [22]. Cast\ninto the framework of SIM, the PHD framework implies the following simple observation:\nLemma 3.1. If we have an MPR model, then argmax(cid:107)v(cid:107)2=1 v(cid:62)E[Y (X\u22972 \u2212 I)]v = \u00b1\u03b2\u2217.\nA proof of this fact can be found in Appendix C. Lemma 3.1 encourages us to look into the following\nsample version maximization problem\n\nargmax(cid:107)v(cid:107)2=1,(cid:107)v(cid:107)0=sn\u22121v(cid:62)(cid:80)n\nby [13]. Instead of solving (3.4), de\ufb01ne(cid:98)\u03a3 = n\u22121(cid:80)n\n\n(3.4)\nwhich targets a restricted (s-sparse) principal eigenvector. Unfortunately, solving such a problem is a\ncomputationally intensive task, and requires knowledge of s. Here we take a standard route of relaxing\nthe above problem to a convex program, and solving it ef\ufb01ciently via semide\ufb01nite programming\n(SDP). A similar in spirit SDP relaxation for solving sparse PCA problems, was originally proposed\ni \u2212I), and solve the following convex\n\ni=1[Yi(X\u22972\n\ni=1 Yi(X\u22972\n\ni \u2212 I)]v,\n\n5\n\n\fprogram:\n\n(cid:98)A = argmaxtr(A)=1,A\u2208Sd\n\ntr((cid:98)\u03a3A) \u2212 \u03bbn\n\n(cid:80)d\ni,j=1|Aij|,\n\n(3.5)\nwhere Sd\n+ is the convex cone of non-negative semide\ufb01nite matrices, and \u03bbn is a regularization param-\neter encouraging element-wise sparsity in the matrix A. The hopes of introducing the optimization\n\nprogram above are that (cid:98)A will be a good \ufb01rst estimate of \u03b2\u2217\u22972. In practice it could turn out that the\nmatrix (cid:98)A is not rank one, hence we suggest taking(cid:98)v as the principal eigenvector of (cid:98)A. In theory we\nshow that with high probability the matrix (cid:98)A will indeed be of rank one. Observation (3.3), Lemma\n\n+\n\ni\u2208S1\ni\u2208S2\n\ni b)2 + \u03bdn(cid:107)b(cid:107)1.\n\n(3.6)\n\ni (cid:98)v \u2212 X(cid:62)\n\ni=1: data, \u03bbn, \u03bdn: tuning parameters\n\n(cid:98)b = argminb(2|S2|)\n\nYi(X\u22972\nYi. Solve the following program:\n((Yi \u2212 Y )X(cid:62)\n\n3.1 and the SDP formulation motivate the agnostic two-step estimation procedure for misspeci\ufb01ed\nphase retrieval in Algorithm 1.\nAlgorithm 1\ninput :(Yi, Xi)n\n1. Split the sample into two approximately equal sets S1, S2, with |S1| = (cid:98)n/2(cid:99), |S2| = (cid:100)n/2(cid:101).\n\ni \u2212 Id). Solve (3.5). Let(cid:98)v be the \ufb01rst eigenvector of (cid:98)A.\n\u22121(cid:80)\n\n2. Let (cid:98)\u03a3 := |S1|\u22121(cid:80)\n3. Let Y = |S2|\u22121(cid:80)\n4. Return (cid:98)\u03b2 :=(cid:98)b/(cid:107)(cid:98)b(cid:107)2.\nThe sample split is required to ensure that after decomposition (3.2), the vector (cid:98)\u03b2\u22a5 and the value\n(cid:98)v(cid:62)\u03b2\u2217 are independent of the remaining sample. In \u00a73.1 we demonstrate that Algorithm 1 succeeds\nwith optimal (in the noisy regime) (cid:96)2 rate(cid:112)s log d/n, provided that s2 log d (cid:46) n. The latter require-\nment on the sample size suf\ufb01ces to guarantee that the solution (cid:98)A of optimization program (3.5) is\n1 on the full dataset using the output vector (cid:98)\u03b2 of Algorithm 1. Doing so can potentially result in\n5. Let Y = n\u22121(cid:80)\n6. Return (cid:98)\u03b2(cid:48) :=(cid:98)b/(cid:107)(cid:98)b(cid:107)2.\n\nadditional stability and further re\ufb01nements of the rate constant.\nAlgorithm 2 Optional Re\ufb01nement\ninput :(Yi, Xi)n\n\nrank one. Figure 1 illustrates the two steps of Algorithm 1. In addition to our main procedure, we\npropose an optional re\ufb01nement step (Algorithm 2) in which one applies steps 3. and 4. of Algorithm\n\nn: tuning parameter, output (cid:98)\u03b2 from the Algorithm 1\n\n(cid:98)b = argminb(2n)\u22121(cid:80)n\n\ni\u2208[n] Yi. Solve the following program:\ni=1((Yi \u2212 Y )X(cid:62)\n\ni (cid:98)\u03b2 \u2212 X(cid:62)\n\ni b)2 + \u03bd(cid:48)\n\nn(cid:107)b(cid:107)1.\n\ni=1: data, \u03bd(cid:48)\n\ni\u2208S2\n\n(3.7)\n\n3.1 Theoretical Guarantees\nIn this section we present our main theoretical results, which consist of theoretical justi\ufb01cation of our\nprocedures, as well as lower bounds for certain types of SIM (1.3). To simplify the presentation for\nthis section, we slightly change the notation and assume that the sample size is 2n and S1 = [n] and\nS2 = {n + 1, . . . , 2n}. Of course this abuse of notation does not restrict our analysis to only even\nsample size cases.\nclose to the vector \u03b2\u2217.\n\nOur \ufb01rst result shows that the optimization program (3.5) succeeds in producing a vector(cid:98)v which is\nProposition 3.2. Assume that n is large enough so that s(cid:112)log d/n < (1/6 \u2212 \u03ba/4)c0/(C1 + C2)\nvalue of \u03bbn (cid:16)(cid:112)log d/n such that the principal eigenvector(cid:98)v of (cid:98)A, the solution of (3.5), satis\ufb01es\nof \u03b2\u2217 to a union of two spherical caps (i.e., the estimate(cid:98)v satis\ufb01es |(cid:98)v(cid:62)\u03b2\u2217| \u2265 \u03ba for some constant\n\nwith probability at least 1 \u2212 4d\u22121 \u2212 O(n\u22121).\nProposition 3.2 shows that the \ufb01rst step of Algorithm 1 narrows down the search for the direction\n\nfor some small but \ufb01xed \u03ba > 0 and constants C1, C2 (depending on f and \u03b5). Then there exists a\n\n|(cid:98)v(cid:62)\u03b2\u2217| \u2265 \u03ba > 0,\n\n\u03ba > 0, see also Figure 1a). Our main result below, demonstrates that in combination with program\n(3.6) this suf\ufb01ces to recover the direction of \u03b2\u2217 at an optimal rate with high probability.\n\n6\n\n\f(cid:18)\n\nTheorem 3.3. There exist values of \u03bbn, \u03bdn (cid:16)(cid:112)log d/n and a constant R > 0 depending on f and\n\u03b5, such that if s(cid:112)log d/n < R and log(d) log2(n)/n = o(1), the output of Algorithm 1 satis\ufb01es:\nWe remark that although the estimation rate is of the order(cid:112)s log d/n, our procedure still requires\nthat s(cid:112)log d/n is suf\ufb01ciently small. This phenomenon is similar to what has been observed by [7],\nfailure, it is less clear whether the estimate (cid:98)\u03b2 is universally consistent (i.e., whether the sup can be\n\nand it is our belief that this requirement cannot be relaxed for computationally feasible algorithms.\nWe would further like to mention that while in bound (3.8) we control the worst case probability of\n\n(cid:114)\n(cid:107)(cid:98)\u03b2 \u2212 \u03b7\u03b2\u2217(cid:107)2 > L\n\nwhere L is a constant depending solely on f and \u03b5.\n\n\u2264 O(d\u22121 \u2228 n\u22121),\n\n(cid:107)\u03b2\u2217(cid:107)2=1,(cid:107)\u03b2\u2217(cid:107)0\u2264s\n\n\u03b7\u2208{1,\u22121}\n\n(cid:19)\n\ns log d\n\nP\u03b2\u2217\n\n(3.8)\n\nmin\n\nsup\n\nn\n\nk\u2208K\n\ni\u2208S2\n\nThe TPM is ran for 2000 iterations. In the case of phase retrieval, the TWF algorithm was also ran\nat a total number of 2000 iterations, using the tuning parameters originally suggested in [7]. As\nexpected the TWF algorithm which targets the sparse phase retrieval model in particular outperforms\nour approach in the case when the sample size n is small, however our approach performs very\ncomparatively to the TWF, and in fact even slightly better once we increase the sample size. It is\npossible that the TWF algorithm can perform better if it is ran for a longer than 2000 iterations,\nthough in most cases it appeared to have converged to its \ufb01nal value. The results are visualized on\nFigure 2 above. The TPM algorithm, has performance comparable to that of Algorithm 1, is always\n\n7\n\ns\n\n(cid:125)\n\n(cid:124)\n\nd\u2212s\n\n, 0, . . . 0\n\n(cid:123)(cid:122)\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\nmoved inside the probability in (3.8)).\n4 Numerical Experiments\nIn this section we provide numerical experiments based on the three models (2.3), (2.4) and (2.5)\nwhere the random variable \u03b5 \u223c N (0, 1). All models are compared with the Truncated Power Method\n(TPM), proposed in [37]. For model (2.3) we also compare the results of our approach to the ones\ngiven by the TWF algorithm of [7]. Our setup is as follows. In all scenarios the vector \u03b2\u2217 was held\n\ufb01xed at \u03b2\u2217 = (\u2212s\u22121/2, s\u22121/2, . . . , s\u22121/2\n). Since our theory requires that n (cid:38) s2 log d, we\nhave setup four different sample sizes n \u2248 \u03b8s2 log d, where \u03b8 \u2208 {4, 8, 12, 16}. We let s depend on\nthe dimension d and we take s \u2248 log d. In addition to the suggested approach in Algorithm 1, we\nalso provide results using the re\ufb01nement procedure (see Algorithm 3.7). We also provide the values\nof two \u201cwarm\u201d starts of our algorithm, produced by solving program (3.5) with half and full data\ncorrespondingly. It is evident that for all scenarios the second step of Algorithms 1 and 2 outperform\nthe warm start from SDP, except in Figure 2 (b), (c), when the sample size is simply two small to for\nthe warm start on half of the data to be accurate. All values we report are based on an average over\n100 simulations.\nThe SDP parameter was kept at a constant value (0.015) throughout all simulations, and we observed\nthat varying this parameter had little in\ufb02uence on the \ufb01nal SDP solution. To select the \u03bdn parameter\nfor (3.6) a pre-speci\ufb01ed grid of parameters {\u03bd1, . . . , \u03bdl} was selected, and the following heuristic\nequally sized non-intersecting sets S2 = \u222aj\u2208[K](cid:101)Sj\nprocedure based on K-fold cross-validation was used. We divide S2 into K = 5 approximately\n2 with a tuning parameter \u03bdn = \u03bdk to obtain an estimate (cid:98)\u03b2k,\u2212(cid:101)Sj\nset \u222ar\u2208[K],r(cid:54)=j(cid:101)Sr\n2. For each j \u2208 [K] and k \u2208 [l] we run (3.6) on the\njusti\ufb01es the following criteria to select the optimal index for selecting(cid:98)\u03bdn = \u03bd(cid:98)l where\n. Lemma 3.1 then\n(cid:88)\n(cid:88)\ni\u2208(cid:101)Sj\nare selected within appropriate range and are of the magnitude(cid:112)log d/n.\nestimate (cid:98)\u03b2k, and the \ufb01nal estimate is taken to be (cid:98)\u03b2(cid:98)k where(cid:98)k is given by\ni \u2212 Id)(cid:98)\u03b2k.\n\nSince the TPM algorithm requires an estimate of the sparsity s, we tuned it as suggested in Section\n4.1.2 of [37]. In particular, for each scenario we considered the set of possible sparsities K =\n{s, 2s, 4s, 8s}. For each k \u2208 K the algorithm is ran on the \ufb01rst part of the data S1, to obtain an\n\nOur experience suggests this approach works well in practice provided that the values {\u03bd1, . . . , \u03bdl}\n\nk |S2|\u22121(cid:88)\n(cid:98)\u03b2(cid:62)\n\n(cid:98)k = argmax\n\n(cid:98)l = argmax\n\ni (cid:98)\u03b2k,\u2212(cid:101)Sj\n\nYi(X\u22972\n\nYi(X(cid:62)\n\nj\u2208[K]\n\nk\u2208[l]\n\n)2.\n\n2\n\n2\n\n2\n\n\f(a) Model (2.3), d = 200\n\n(b) Model (2.4), d = 200\n\n(c) Model (2.5), d = 200\n\n(d) Model (2.3), d = 400\n\n(e) Model (2.4), d = 400\n\n(f) Model (2.5), d = 400\n\nFigure 2: Simulation results for the three examples considered in \u00a72, in two different settings for the\ndimension d = 200, 400. Here the parameter \u03b8 \u2248 n\ns2 log d describes the relationship between sample\nsize, dimension and sparsity of the signal. Algorithm 2 dominates in most settings, with exceptions\nwhen \u03b8 is too small, in which case none of the approaches provides meaningful results.\n\nworse than the estimate produced by Algorithm 2, and it needs an initialization (for the \ufb01rst step\nof Algorithm 1 is used) and further requires a rough knowledge of the sparsity s, whereas both\nAlgorithms 1 and 2 do not require an estimate of s.\n5 Discussion\nIn this paper we proposed a two-step procedure for estimation of MPR models with standard Gaussian\ndesigns. We argued that the MPR models form a rich class including numerous additive SIMs (i.e.,\nY = h(X(cid:62)\u03b2\u2217) + \u03b5) with an even and increasing on R+ link function h. Our algorithm is based\nsolely on convex optimization, and achieves optimal rates of estimation.\nOur procedure does require that the sample size n (cid:38) s2 log d to ensure successful initialization. The\nsame condition has been exhibited previously, e.g., in [7] for the phase retrieval model, and in works\non sparse principal components analysis [see, e.g., 3, 15, 33]. We anticipate that for a certain subclass\nof MPR models, the sample size requirement n (cid:38) s2 log d is necessary for computationally ef\ufb01cient\nalgorithms to exist. We conjecture that models (2.3)-(2.5) are such models. It is however certainly\nnot true that this sample size requirement holds for all models from the MPR class. For example, the\nfollowing model can be solved ef\ufb01ciently by applying the Lasso algorithm, without requiring the\nsample size scaling n (cid:38) s2 log d\n\nY = sign(X(cid:62)\u03b2\u2217 + c),\n\nwhere c < 0 is \ufb01xed. This discussion leads to the important question under what conditions of the\n(known) link and error distribution (f, \u03b5) one can ef\ufb01ciently solve the SIM Y = f (X(cid:62)\u03b2\u2217, \u03b5) with\nan optimal sample complexity. We would like to investigate this issue further in our future work.\nAcknowledgments: The authors would like to thank the reviewers and meta-reviewers for carefully\nreading the manuscript and their helpful suggestions which improved the presentation. The authors\nwould also like to thank Professor Xiaodong Li for kindly providing the code for the TWF algorithm.\nReferences\n[1] Adamczak, R. and Wolff, P. (2015). Concentration inequalities for non-Lipschitz functions with bounded\n\nderivatives of higher order. Probability Theory and Related Fields, 162 531\u2013586.\n\n[2] Amini, A. A. and Wainwright, M. J. (2008). High-dimensional analysis of semide\ufb01nite relaxations for\n\nsparse principal components. In IEEE International Symposium on Information Theory.\n\n[3] Berthet, Q. and Rigollet, P. (2013). Complexity theoretic lower bounds for sparse principal component\n\n[4] Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of Lasso and dantzig selector.\n\ndetection. In Conference on Learning Theory.\n\nThe Annals of Statistics 1705\u20131732.\n\n8\n\nlll0.00.51.01.52.0||b^-b*||2llllllllllllInitSecond StepInit full dataRefinedTPMTWF full dataq=4q=8q=12q=16ll0.00.51.01.52.0||b^-b*||2llllllllInitSecond StepInit full dataRefinedTPMq=4q=8q=12q=16ll0.00.51.01.52.0||b^-b*||2llllllllInitSecond StepInit full dataRefinedTPMq=4q=8q=12q=16lll0.00.51.01.52.0||b^-b*||2llllllllllllInitSecond StepInit full dataRefinedTPMTWF full dataq=4q=8q=12q=16ll0.00.51.01.52.0||b^-b*||2llllllllInitSecond StepInit full dataRefinedTPMq=4q=8q=12q=16ll0.00.51.01.52.0||b^-b*||2llllllllInitSecond StepInit full dataRefinedTPMq=4q=8q=12q=16\f[5] Boufounos, P. T. and Baraniuk, R. G. (2008). 1-bit compressive sensing. In Annual Conference on Infor-\n\n[6] B\u00fchlmann, P. and van de Geer, S. (2011). Statistics for high-dimensional data: Methods, theory and\n\nmation Sciences and Systems.\n\napplications. Springer.\n\n[7] Cai, T. T., Li, X. and Ma, Z. (2015). Optimal rates of convergence for noisy sparse phase retrieval via\n\nthresholded Wirtinger \ufb02ow. arXiv:1506.03382.\n\n[8] Cand\u00e8s, E. J., Li, X. and Soltanolkotabi, M. (2015). Phase retrieval from coded diffraction patterns. Ap-\n\nplied and Computational Harmonic Analysis, 39 277\u2013299.\n\n[9] Cand\u00e8s, E. J., Li, X. and Soltanolkotabi, M. (2015). Phase retrieval via Wirtinger \ufb02ow: Theory and algo-\n\nrithms. IEEE Transactions on Information Theory, 61 1985\u20132007.\n\n[10] Cand\u00e8s, E. J., Strohmer, T. and Voroninski, V. (2013). Phaselift: Exact and stable signal recovery from\nmagnitude measurements via convex programming. Communications on Pure and Applied Mathematics,\n66 1241\u20131274.\n\n[11] Chen, Y., Yi, X. and Caramanis, C. (2013). A convex formulation for mixed regression with two compo-\n\n[12] Cook, R. D. and Ni, L. (2005). Suf\ufb01cient dimension reduction via inverse regression. Journal of the\n\nnents: Minimax optimal rates. arXiv:1312.7006.\n\nAmerican Statistical Association, 100.\n\n[13] d\u2019Aspremont, A., El Ghaoui, L., Jordan, M. I. and Lanckriet, G. R. (2007). A direct formulation for sparse\n\nPCA using semide\ufb01nite programming. SIAM review, 49 434\u2013448.\n\n[14] Ganti, R., Rao, N., Willett, R. M. and Nowak, R. (2015). Learning single index models in high dimensions.\n\n[15] Gao, C., Ma, Z. and Zhou, H. H. (2014). Sparse CCA: Adaptive estimation and computational barriers.\n\narXiv preprint arXiv:1506.08910.\n\narXiv:1409.8565.\n\n[16] Genzel, M. (2016). High-dimensional estimation of structured signals from non-linear observations with\n\ngeneral convex loss functions. arXiv:1602.03436.\n\n[17] Han, F. and Wang, H. (2015). Provable smoothing approach in high dimensional generalized regression\n\n[18] Horowitz, J. L. (2009). Semiparametric and nonparametric methods in econometrics. Springer.\n[19] Laurent, B. and Massart, P. (2000). Adaptive estimation of a quadratic functional by model selection.\n\n[20] Lecu\u00e9, G. and Mendelson, S. (2013). Minimax rate of convergence and the performance of erm in phase\n\n[21] Li, K.-C. (1991). Sliced inverse regression for dimension reduction. Journal of the American Statistical\n\nmodel. arXiv:1509.07158.\n\nAnnals of Statistics 1302\u20131338.\n\nrecovery. arXiv:1311.5024.\n\nAssociation, 86 316\u2013327.\n\non information theory.\n\n266\u2013282.\n\n1135\u20131151.\n\n1052.\n\n[22] Li, K.-C. (1992). On principal Hessian directions for data visualization and dimension reduction: Another\n\napplication of Stein\u2019s lemma. Journal of the American Statistical Association, 87 1025\u20131039.\n\n[23] Li, K.-C. and Duan, N. (1989). Regression analysis under link violation. The Annals of Statistics 1009\u2013\n\n[24] McCullagh, P. and Nelder, J. (1989). Generalized linear models. Chapman & Hall/CRC.\n[25] Neykov, M., Liu, J. S. and Cai, T. (2016). L1-regularized least squares for support recovery of high dimen-\n\nsional single index models with Gaussian designs. Journal of Machine Learning Research, 17 1\u201337.\n\n[26] Peng, H. and Huang, T. (2011). Penalized least squares for single index models. Journal of Statistical\n\nPlanning and Inference, 141 1362\u20131379.\n\n[27] Plan, Y. and Vershynin, R. (2015). The generalized Lasso with non-linear observations. IEEE Transactions\n\n[28] Radchenko, P. (2015). High dimensional single index models. Journal of Multivariate Analysis, 139\n\n[29] Raskutti, G., Wainwright, M. J. and Yu, B. (2010). Restricted eigenvalue properties for correlated Gaussian\n\ndesigns. Journal of Machine Learning Research, 11 2241\u20132259.\n\n[30] Stein, C. M. (1981). Estimation of the mean of a multivariate normal distribution. The Annals of Statistics\n\n[31] Thrampoulidis, C., Abbasi, E. and Hassibi, B. (2015). Lasso with non-linear measurements is equivalent\n\nto one with linear measurements. arXiv preprint, arXiv:1506.02181.\n\n[32] Vershynin, R. (2010). Introduction to the non-asymptotic analysis of random matrices. arXiv:1011.3027.\n[33] Wang, Z., Gu, Q. and Liu, H. (2015). Sharp computational-statistical phase transitions via oracle compu-\n\ntational model. arXiv:1512.08861.\n\nStatistical Association, 94 1275\u20131285.\n\n[34] Xia, Y. and Li, W. (1999). On single-index coef\ufb01cient regression models. Journal of the American\n\n[35] Yang, Z., Wang, Z., Liu, H., Eldar, Y. C. and Zhang, T. (2015). Sparse nonlinear regression: Parameter\n\nestimation and asymptotic inference. arXiv; 1511:04514.\n\n[36] Yi, X., Wang, Z., Caramanis, C. and Liu, H. (2015). Optimal linear estimation under unknown nonlinear\n\ntransform. In Advances in Neural Information Processing Systems.\n\n[37] Yuan, X.-T. and Zhang, T. (2013). Truncated power method for sparse eigenvalue problems. Journal of\n\nMachine Learning Research, 14 899\u2013925.\n\n9\n\n\f", "award": [], "sourceid": 2036, "authors": [{"given_name": "Matey", "family_name": "Neykov", "institution": "Princeton University"}, {"given_name": "Zhaoran", "family_name": "Wang", "institution": "Princeton University"}, {"given_name": "Han", "family_name": "Liu", "institution": "Princeton University"}]}