{"title": "Statistical-Computational Tradeoff in Single Index Models", "book": "Advances in Neural Information Processing Systems", "page_first": 10419, "page_last": 10426, "abstract": "We study the statistical-computational tradeoffs in a high dimensional single index model $Y=f(X^\\top\\beta^*) +\\epsilon$, where $f$ is unknown, $X$ is a Gaussian vector and $\\beta^*$ is $s$-sparse with unit norm. When $\\cov(Y,X^\\top\\beta^*)\\neq 0$, \\cite{plan2016generalized} shows that the direction and support of $\\beta^*$ can be recovered using a generalized version of Lasso. In this paper, we investigate the case when this critical assumption fails to hold, where the problem becomes considerably harder. Using the statistical query model to characterize the computational cost of an algorithm, we show that when $\\cov(Y,X^\\top\\beta^*)=0$ and $\\cov(Y,(X^\\top\\beta^*)^2)>0$, no computationally tractable algorithms can achieve the information-theoretic limit of the minimax risk. This implies that one must pay an extra computational cost for the nonlinearity involved in the model.", "full_text": "Statistical-Computational Tradeoffs\nin High-Dimensional Single Index\n\nModels\n\nLingxiao Wang\u2217\n\nZhuoran Yang\u2020\n\nZhaoran Wang\u2021\n\nAbstract\n\nWe study the statistical-computational tradeoffs in a high dimensional single in-\ndex model Y = f (X(cid:62)\u03b2\u2217) + \u0001, where f is unknown, X is a Gaussian vector\nand \u03b2\u2217 is s-sparse with unit norm. When Cov(Y, X(cid:62)\u03b2\u2217) (cid:54)= 0, [43] shows that\nthe direction and support of \u03b2\u2217 can be recovered using a generalized version of\nLasso. In this paper, we investigate the case when this critical assumption fails to\nhold, where the problem becomes considerably harder. Using the statistical query\nmodel to characterize the computational cost of an algorithm, we show that when\nCov(Y, X(cid:62)\u03b2\u2217) = 0 and Cov(Y, (X(cid:62)\u03b2\u2217)2) > 0, no computationally tractable\nalgorithms can achieve the information-theoretic limit of the minimax risk. This\nimplies that one must pay an extra computational cost for the nonlinearity involved\nin the model.\n\nIntroduction\n\n1\nA single index model (SIM) speci\ufb01es that the response Y and the covariate X satisfy Y = f (X(cid:62)\u03b2\u2217)+\n\u0001, where \u03b2\u2217 \u2208 Rd is an unknown parameter, f : R \u2192 R is an unknown link function, and \u0001 \u2208 R is\na random noise. This model extends linear regression by incorporating the unknown link function,\noffers additional modeling \ufb02exibility and robustness to model misspeci\ufb01cation. SIMs are extensively\nstudied in the literature, with wide applications such as time-series [17], survival analysis [35], and\nquantile regression [56].\nGiven n i.i.d. observations of this model, the primary focus is to estimate the parametric component\n\u03b2\u2217 without knowing the exact form of f. When \u03b2\u2217 is estimated accurately, f can be \ufb01tted via\nunivariate nonparametric regression. Recently, there is growing research interest in recovering \u03b2\u2217 in\nthe high-dimensional setting where the dimensionality d is much larger than the sample size n and\n\u03b2\u2217 is sparse. When Y and X(cid:62)\u03b2\u2217 have nonzero correlation, [43, 44] propose to estimate \u03b2\u2217 by \ufb01tting\nan (cid:96)1-regularized linear model, i.e., Lasso [50], directly using Y and X. More interestingly, they also\nestablish similar theoretical guarantees as those for the linear model. Speci\ufb01cally, they show that the\nLasso estimator is consistent as long as the sample size is of the order s log d, where s is the number\nof nonzero entries in \u03b2\u2217. Moreover, this sample complexity result is known to be optimal in the\nsense that it attains the information-theoretical lower bound [46, 53], and the proposed estimator can\nbe obtained ef\ufb01ciently using convex optimization. However, the Lasso approach fails when Y and\nX(cid:62)\u03b2\u2217 are uncorrelated, which is the case when the link function is symmetric. A prominent example\nis phase retrieval [10, 11], where f is known to be either the absolute value or quadratic function.\nFor sparse phase retrieval, s log d sample complexity is only attained by the empirical risk minimizer\n\n\u2217Northwestern University; lingxiaowang2022@u.northwestern.edu\n\u2020Princeton University; zy6@princeton.edu\n\u2021Northwestern University; zhaoranwang@gmail.com\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f[33], which searches over all(cid:0)d\n\n(cid:1) possible support sets of \u03b2, and is thus computationally intractable.\n\ns\n\nIn addition, various ef\ufb01cient estimators are proposed based on convex relaxation or projected gradient\ndescent [8, 13], whose consistency is only shown when the sample size is of the order s2 log d. Thus,\nthere seems an interesting tradeoff between the statistical optimality and computational ef\ufb01ciency, i.e.,\nthere is a gap between the optimal statistical performance achieved by the family of computationally\nef\ufb01cient estimators and that attained by all possible estimators. In sparse phase retrieval, such a gap\nis conjectured to be fundamental [8] and is also observed in SIMs where f is symmetric [42, 48, 62].\nThis intriguing phenomenon motivates the following two questions: (i) How does the unknown link\nfunction affect the statistical and computational aspects of learning SIMs in high dimensions? (ii)\nAre the gap observed in symmetric links intrinsic and cannot be eliminated by more sophisticated\nalgorithm design and analysis?\nFor the \ufb01rst question, we introduce the notions of \ufb01rst- and second-order Stein\u2019s associations which\ncharacterize the dependence between Y and X(cid:62)\u03b2\u2217 of two different orders. We differentiate two\ntypes of link functions: (i) f with nonzero Stein\u2019s associations and (ii) f with zero \ufb01rst-order and\nnonzero second-order Stein\u2019s association. These two classes capture the functions considered in\n[43, 44] and [42, 48, 62] respectively. More importantly, we establish the statistical-computational\nbarrier under an oracle computational model [16, 18, 19, 54], which is an abstraction of computations\nmade by algorithms that interact with data. Speci\ufb01cally, we study the signal detection problem where\nthe link function is de\ufb01ned as a continuous interpolation of two link functions of different types. We\nestablish information-theoretical and computational lower bounds for the minimum signal strength\nrequired for successful detection and also propose algorithms that yield matching upper bounds.\nMoreover, we characterize the gap between signal strengths for learning SIMs under limited and\nunlimited computational budgets and display the evolution of this gap as the link function transits\nfrom one type to the other.\nMain Contribution. Our contribution is three-fold. First, we introduce the \ufb01rst- and second-order\nStein\u2019s associations, which bring a general characterization of the link functions considered in the\nliterature. Second, for the detection problem, we establish nearly tight information-theoretical and\ncomputational lower bounds under the framework of oracle model, which exhibit the statistical price\npaid for achieving computational ef\ufb01ciency in learning SIMs. Third, we also construct algorithms\nwhich yield matching upper bounds. Our results also imply a similar computational barrier for\nparameter estimation, thus providing a positive answer to the open problem raised in [8].\nRelated Work. There is a huge body of literature on single-index models in the low-dimensional\nsetting. See, for example, [25, 27, 29, 39] and the references therein. For high-dimensional SIMs,\nwhen Y and X(cid:62)\u03b2\u2217 have a nonzero correlation, [22, 23, 26, 40, 41, 43, 44, 58] study the statistical\nrates of Lasso-type estimators, which are shown to achieve both statistical accuracy and computational\nef\ufb01ciency. In contrast, [42, 49, 61, 62] study SIMs which are generalizations of sparse phase retrieval\n[8].\nIn addition, the statistical query model is proposed by [30] and further extended by [15, 18\u201320] for\nstudying the computational complexity of planted clique, random satis\ufb01ability problems, stochastic\nconvex optimization, and Gaussian mixture model. In addition, based on a slightly modi\ufb01ed version,\n[16, 34, 54, 63] establish the statistical-computational tradeoffs in statistical problems including\nsparse PCA, high-dimensional mixture models, weakly supervised learning, and graph structure\ninference. Among them, our work is mostly related to [16], which validates the computational barrier\nin phase retrieval with absolute value link function by drawing the connection to mixture of regression\nmodels. In comparison, we tackle SIMs directly, which takes phase retrieval as a particular case.\nMore importantly, by interpolating the two sub-classes of SIMs, we obtain the full spectrum of phase\ntransitions, which shed new light on the open problem raised in [8].\nFurthermore, there is a massive body of literature on understanding the computational barriers\nof statistical models. Besides our oracle model approach, there are two other popular means of\nattacking such problems. The \ufb01rst one is based on polynomial-time reductions from the conjectured\ncomputationally challenging problems to statistical problems of interest. See, e.g., [3\u20137, 9, 12, 21, 24,\n37, 57] and the references therein. Second method constructs a sequence of sum-of-squares convex\nrelaxations that are increasingly tighter based on semide\ufb01nite programming [1, 2, 14, 28, 31, 36, 38,\n45, 55]. Although this approach is free of hardness conjectures, their computational barriers only\nhold for the restricted family of convex relaxation algorithms.\n\n2\n\n\f2 Background\n\nIn this section, we \ufb01rst introduce the single index model and the associated signal detection problem.\nWe then introduce the statistical query model, which quanti\ufb01es the computational cost of an algorithm\nthat interacts with data and is later used to establish the main results.\n\n2.1 Statistical Model\n\nWe consider the single index model\n\nY = f (X(cid:62)\u03b2\u2217) + \u0001,\n\n(2.1)\nwhere X \u223c N (0, Id) is the covariate, Y is the response, \u03b2\u2217 \u2208 Rd is the unknown parameter of\ninterest, \u0001 \u223c N (0, \u03c32) is the noise, and f : R \u2192 R is the unknown link function. Given n independent\nrealizations {zi = (yi, xi)}i\u2208[n] of this model, our goal is to estimate \u03b2\u2217 under the assumption that\n\u03b2\u2217 is s-sparse, s (cid:28) n, and d (cid:29) n.\n[43] estimate \u03b2\u2217 by exploiting the covariance structure Cov(Y, X(cid:62)\u03b2\u2217). When such a structure\nis unavailable, that is, Cov(Y, X(cid:62)\u03b2\u2217) = 0, [42, 62] estimate \u03b2\u2217 by exploiting Cov[Y, (X(cid:62)\u03b2\u2217)2].\nHowever, the resulting estimators require a higher sample complexity than the estimators that are\nbased on Cov(Y, X(cid:62)\u03b2\u2217). To understand such a gap in sample complexity, we consider more general\nsettings under a uni\ufb01ed framework. The key of this framework is the following Stein\u2019s identities\n[47]. Let X \u223c N (0, Id) be the standard Gaussian distribution and Y = h(X). If the expectation\nE[\u2207h(X)] exists, the \ufb01rst-order Stein\u2019s identity takes the form\n\nE(cid:2)\u2207h(X)(cid:3) = E[Y X].\n\n(2.2)\nLet Y = h(X), where h is twice differentiable. If the expectation E[\u22072h(X)] exists, the second-order\nStein\u2019s identity takes the form\n\nE(cid:2)\u22072h(X)(cid:3) = E(cid:2)Y \u00b7 (XX(cid:62) \u2212 Id)(cid:3).\n\n(2.3)\nThe above identities show that the covariance structures Cov(Y, X(cid:62)\u03b2\u2217) and Cov[Y, (X(cid:62)\u03b2\u2217)2] are\npivotal in the estimation of the model de\ufb01ned in (2.1) [59, 60]. Speci\ufb01cally, following from (2.2)\nwith h(X) = f (X(cid:62)\u03b2\u2217) + \u0001, it holds that E[Y X] = E[f(cid:48)(X(cid:62)\u03b2\u2217, \u0001)] \u00b7 \u03b2\u2217, where we denote by f(cid:48)\nthe derivative of f with respect to the \ufb01rst coordinate. In other words, E[Y X] recovers \u03b2\u2217 up to\na scaling under the assumption that Cov(Y, X(cid:62)\u03b2\u2217) (cid:54)= 0. Meanwhile, following from (2.3) with\nh(X) = f (X(cid:62)\u03b2\u2217) + \u0001, it holds that\n\nE[Y \u00b7 XX(cid:62)] = E(cid:2)f(cid:48)(cid:48)(X(cid:62)\u03b2\u2217, \u0001)(cid:3) \u00b7 \u03b2\u2217\u03b2\u2217(cid:62)\n\nIn other words, \u03b2\u2217 is the leading eigenvector of E[Y \u00b7 XX(cid:62)] under the assumption that\nCov[Y, (X(cid:62)\u03b2\u2217)2] > 0. We de\ufb01ne the following covariance structures, which play important roles in\nthe estimation of \u03b2\u2217 in the model in (2.1) with unknown link function f.\nDe\ufb01nition 2.1 (First-order and second-order Stein\u2019s associations). Let \u03c8 be a twice differentiable\ntransformation from R to R and Y be the response of X under the model in (2.1). We de\ufb01ne the \ufb01rst-\nand second-order Stein\u2019s association between Y and X(cid:62)\u03b2\u2217 as\n\nS1(Y ) = Cov(Y, X(cid:62)\u03b2\u2217), S2(Y, \u03c8) = Cov(cid:2)\u03c8(Y ), (X(cid:62)\u03b2\u2217)2(cid:3),\n\n+ E[Y ] \u00b7 Id.\n\nrespectively, where \u03c8 is called the marginal transformation.\n\nIn the following, we introduce classes of link functions of interest. We consider the following two\nclasses of link functions,\n\nC1 =(cid:8)f : Cov(cid:0)f (X(cid:62)\u03b2\u2217), X(cid:62)\u03b2\u2217(cid:1)(cid:14)(cid:107)\u03b2\u2217(cid:107)2\n2 = 1(cid:9),\nC2 =(cid:8)f : Cov(cid:0)f (X(cid:62)\u03b2\u2217), X(cid:62)\u03b2\u2217(cid:1) = 0(cid:9).\nCov(cid:0)f (X(cid:62)\u03b2\u2217), X(cid:62)\u03b2\u2217(cid:1) = E(cid:2)f(cid:48)(X(cid:62)\u03b2\u2217)(cid:3) \u00b7 (cid:107)\u03b2\u2217(cid:107)2\n\n2.\n\n(2.4)\nThe function class C1 is a class of normalized link functions. Following from the \ufb01rst-order Stein\u2019s\nidentity in (2.2), it holds that\n\nIn other words, the de\ufb01nition of C1 in (2.4) equivalently requires the link function f \u2208 C1 to satisfy\nE[f(cid:48)(X(cid:62)\u03b2\u2217)] = 1.\nFor any twice differentiable marginal transformation \u03c8, we de\ufb01ne C(\u03c8) as the class of link functions\nf such that\n\nC(\u03c8) =(cid:8)f : Cov(cid:2)\u03c8(Y ), (X(cid:62)\u03b2\u2217)2(cid:3)(cid:14)(cid:107)\u03b2\u2217(cid:107)4\n\n2 \u2265 1 for Y = f (X(cid:62)\u03b2\u2217) + \u0001(cid:9).\n\n(2.5)\n\n3\n\n\fThe de\ufb01nition of C(\u03c8) is a generalization of the misspeci\ufb01ed phase retrieval model studied by [42, 62]\nwith additive noise. By allowing marginal transformations of Y , such a class also covers the linear\nregression model as a special case.\nNote that in (2.5), we require the covariance structure Cov[\u03c8(Y ), (X(cid:62)\u03b2\u2217)2] to have a magnitude\ncomparable to (cid:107)\u03b2\u2217(cid:107)4\n2. Without any loss of generality, such a requirement speci\ufb01es the scaling of the\nmarginal transformation \u03c8 and the corresponding link function f \u2208 C(\u03c8). To see this, note that it\nholds from the second-order Stein\u2019s identity in (2.3) that\n\nCov(cid:2)\u03c8(Y ), (X(cid:62)\u03b2\u2217)2(cid:3) = E(cid:2)D2\u03c8(cid:0)f (X(cid:62)\u03b2\u2217) + \u0001(cid:1)(cid:3) \u00b7 (cid:107)\u03b2\u2217(cid:107)4\n\nwhere D is the differentiation operator with respect to X(cid:62)\u03b2\u2217. In other words, (2.5) equivalently\nrequires the link function f \u2208 C(\u03c8) to satisfy E[D2\u03c8(f (X(cid:62)\u03b2\u2217) + \u0001)] \u2265 1.\nFor \u03c8(y) = y, the function class C(\u03c8) de\ufb01ned in (2.5) reduces to the misspeci\ufb01ed phase retrieval\nmodels considered by [42, 62] with additive noise. For \u03c8(y) = y2, C(\u03c8) characterizes the linear\nregression model, the mixed regression model, and various phase retrieval models, including Y =\n(X(cid:62)\u03b2\u2217)2 + \u0001 and Y = |X(cid:62)\u03b2\u2217| + \u0001, up to normalizations. In particular, C(\u03c8) also characterizes a\nclass of one-hidden-layer neural networks with Recti\ufb01ed Linear Units (ReLU) activation function.\nFor a neural network with two neurons in the hidden layer, where the parameters in the \ufb01rst layer are\n\u03b2\u2217 and \u2212\u03b2\u2217, and the parameter in the second layer is (1, 1) \u2208 R2, we have\n\n2,\n\nY = max{X(cid:62)\u03b2\u2217, 0} + max{\u2212X(cid:62)\u03b2\u2217, 0} + \u0001 = |X(cid:62)\u03b2\u2217| + \u0001,\n\n(cid:26)f1(X(cid:62)\u03b2\u2217) + \u0001,\n\nwhich is captured by C(\u03c8) with \u03c8(y) = y or \u03c8(y) = y2 up to normalizations.\nThroughout this paper, we focus on the marginal transformations \u03c8 such that C(\u03c8) \u2229 C1 (cid:54)= \u2205 and\nC(\u03c8) \u2229 C2 (cid:54)= \u2205, where the function classes C1, C2, and C(\u03c8) are de\ufb01ned in (2.4) and (2.5). Such a\nclass of marginal transformations \u03c8 enables us to study the phase transition between f1 \u2208 C(\u03c8) \u2229 C1\nand f2 \u2208 C(\u03c8) \u2229 C2. As an example, we consider \u03c8(y) = y. It holds that f1 \u2208 C(\u03c8) \u2229 C1 for\nf1(X(cid:62)\u03b2\u2217) = X(cid:62)\u03b2\u2217 + (X(cid:62)\u03b2\u2217)2, and f2 \u2208 C(\u03c8) \u2229 C2 for f2(X(cid:62)\u03b2\u2217) = (X(cid:62)\u03b2\u2217)2. In other words,\nit holds that C(\u03c8) \u2229 C1 (cid:54)= \u2205 and C(\u03c8) \u2229 C2 (cid:54)= \u2205 for \u03c8(y) = y. With link functions f1 \u2208 C(\u03c8) \u2229 C1\nand f2 \u2208 C(\u03c8) \u2229 C2, we introduce the following statistical model of interest,\n\nwith probability \u03b1,\nwith probability 1 \u2212 \u03b1,\n\nY =\n\nf2(X(cid:62)\u03b2\u2217) + \u0001,\n\n(2.6)\nwhere \u0001 \u223c N (0, \u03c32), X \u223c N (0, Id), and \u03b2\u2217 is s-sparse. We assume that f1 and f2 are unknown,\nand \u03c8 is known a priori. In (2.6), the mixture probability \u03b1 controls the magnitude of the \ufb01rst-order\nStein\u2019s association S1(Y ) de\ufb01ned in De\ufb01nition 2.1, which characterizes a notion of linearity between\nthe response Y and the index X(cid:62)\u03b2\u2217.\nLet zi = (yi, xi) be n independent observations of (2.6) with n (cid:28) d, we aim at detecting the\nexistence of a nonzero parameter \u03b2\u2217, that is, testing the following hypotheses,\n\nH0 : \u03b2\u2217 = 0 versus H1 : \u03b2\u2217 (cid:54)= 0.\n\n(2.7)\nIn what follows, we assume that s is a known integer and \u03c32 is an unknown constant. Meanwhile, to\naddress the identi\ufb01ability issue, we assume that (cid:107)\u03b2\u2217(cid:107)2 is \ufb01xed.\nThe dif\ufb01culty of the testing problem in (2.7) is characterized by the signal-to-noise ratio (SNR),\nwhich is de\ufb01ned as \u03ba(\u03b2\u2217, \u03c3) = (cid:107)\u03b2\u2217(cid:107)2\n2/\u03c32. Moreover, to characterize the minimum required SNR,\nwe consider the following parameter spaces corresponding to the null and alternative hypotheses,\n\nG0 =(cid:8)(\u03b2\u2217, \u03c3) \u2208 Rd+1 : \u03b2\u2217 = 0(cid:9),\nG1(s, \u03b3n) =(cid:8)(\u03b2\u2217, \u03c3) \u2208 Rd+1 : (cid:107)\u03b2\u2217(cid:107)0 = s, \u03ba(\u03b2\u2217, \u03c3) \u2265 \u03b3n\n\n(2.8)\nwhere {\u03b3n}\u221e\nn=1 is a nonnegative sequence. For notational simplicity, we denote by \u03b8\u2217 = (\u03b2\u2217, \u03c3) and\n\u03b8\u2217 the joint distribution of {zi}n\nPn\ni=1, which are generated by the model in (2.6) with the parameter of\ninterest \u03b8\u2217 and nuisance parameters f1, f2, and \u03c8. For any function \u03c6 that maps z = (z1, . . . , zn) \u2208\nR(d+1)\u00d7n to {0, 1}, the worst-case risk for testing H0 : \u03b8 \u2208 G0 versus H1 : \u03b8\u2217 \u2208 G1(s, \u03b3n) is de\ufb01ned\nas the sum of the maximum type-I and type-II errors,\n\n(cid:9),\n\n\u03c6\n\nRn(\u03c6;G0,G1),\n\nsup\n\nf1,f2,\u03c8\n\n4\n\nRn(\u03c6;G0,G1) = sup\n\u03b8\u2217\u2208G0\n\nP\u03b8\u2217 (\u03c6 = 1) + sup\n\u03b8\u2217\u2208G1\n\nP\u03b8\u2217 (\u03c6 = 0).\n\nCorrespondingly, the minimax risk is de\ufb01ned as\nn(G0,G1) = inf\nR\u2217\n\n(2.9)\n\n(2.10)\n\n\fwhere we take the supreme over the nuisance parameters f1, f2, and \u03c8 of models in (2.6), and the\nin\ufb01mum over the function \u03c6. We further de\ufb01ne the minimax separation rate in the following.\nDe\ufb01nition 2.2 (Minimax separation rate [32, 51]). A sequence {\u03b3\u2217\nseparation rate if\n\nn=1 is called the minimax\n\nn}\u221e\n\n(i) given\n\nany\n\nlim inf n\u2192\u221e R\u2217\n\n{\u03b3n}\u221e\nsequence\nn(G0,G1(s, \u03b3n)) = 1,\n\nn=1 with\n\n\u03b3n\n\n=\n\no(\u03b3\u2217\nn),\n\nit\n\nholds\n\nthat\n\n(ii) given any sequence {\u03b3n}\u221e\n\nn=1 with \u03b3n = \u2126(\u03b3\u2217\n\nn), it holds that limn\u2192\u221e R\u2217\n\nn(G0,G1(s, \u03b3n)) =\n\n0.\n\nThe minimax separation rate characterizes the minimum SNR that guarantees the existence of an\nasymptotically powerful test. Therefore, it captures the dif\ufb01culty of the hypothesis testing problem in\n(2.7).\n\n2.2 Oracle Computational Model\n\nIn what follows, we introduce an oracle computational model that quanti\ufb01es the computational cost\nof an algorithm. Our model follows from the one considered in [16, 54], which slightly extends the\nstatistical query model originally proposed in [18\u201320, 30].\nDe\ufb01nition 2.3 (Statistical query model). A statistical oracle r responds to a given query function\nq with Zq, which is a random variable in R. We de\ufb01ne Q \u2286 {q : Rd+1 \u2192 [\u2212M, M ]} as the space\nconsisting of all the query functions.\nWe de\ufb01ne an algorithm A as the iterative process that queries a given statistical oracle with query\nfunctions in QA \u2286 Q but does not access the data directly. We denote by A(T ) the set of algorithms\nthat query the statistical oracle T rounds, where T is called the oracle complexity. We denote by\nR[\u03be, n, T, \u03b7(QA )] the set of statistical oracles r such that\n\n(2.11)\n\n(cid:18) (cid:92)\n\nP\n\nq\u2208QA\n\n(cid:111)(cid:19)\n\n\u2265 1 \u2212 2\u03be,\n\n(cid:110)(cid:12)(cid:12)Zq \u2212 E[q(Z)](cid:12)(cid:12) \u2264 \u03c4q\n(cid:95)(cid:115)\n2(cid:2)\u03b7(QA ) + log(1/\u03be)(cid:3) \u00b7(cid:0)M 2 \u2212 {E[q(Y, X)]}2(cid:1)\n\nwhere Zq is the response of the statistical oracle r, Z = (Y, X) is the random variable following the\nunderlying statistical model, \u03be \u2208 [0, 1) is the tail probability, and \u03c4q is the tolerance parameter given\nby\n\n(cid:2)\u03b7(QA ) + log(1/\u03be)(cid:3) \u00b7 M\n\n\u03c4q =\n\n(2.12)\nHere the parameter \u03b7(QA ) is the logarithmic measure of the capacity of QA . For a countable\nQA , we have \u03b7(QA ) = log(|QA |). For an uncountable QA , the magnitude \u03b7(QA ) can be the\nVapnik-Chervonenkis dimension or the metric entropy.\n\nn\n\nn\n\n.\n\nThe intuition behind De\ufb01nition 2.3 is to separate the algorithm from the dataset. Under this de\ufb01nition,\nthe algorithms we consider are blackbox systems that access the necessary information from a\nstatistical oracle. The de\ufb01nition of the statistical oracle r \u2208 R[\u03be, n, T, \u03b7(QA )] is a generalization of\nthe sample average. Note that it holds that\n\nM 2 \u2212 {E[q(Y, X)]}2 \u2265 Var(cid:2)q(Y, X)(cid:3).\n\n(2.13)\nIf the response zq of the statistical oracle is the sample mean of n independent realizations of q(Z),\nthen (2.11) follows from Bernstein\u2019s inequality coupled with a uniform concentration argument over\nQA , where the variance term is replaced by its upper bound in (2.13) [16].\nTo capture the computational dif\ufb01culty of the hypothesis testing problem in (2.7), we introduce the\nfollowing de\ufb01nition of computational minimax separation risk, which is an analog of the minimax\nseparation risk de\ufb01ned in (2.10) with an additional constraint on the oracle complexity. We consider\nthe algorithms A \u2208 A(T ) associated with the statistical oracle r \u2208 R[\u03be, n, T, \u03b7(QA )], and denote by\nH(A , r) the set of all the test functions based on A \u2208 A(T ), which queries r \u2208 R[\u03be, n, T, \u03b7(QA )]\nT rounds. We de\ufb01ne the risk for test function \u03c6 \u2208 H(A , r) as\n\u00afP\u03b8\u2217 (\u03c6 = 1) + sup\n\u03b8\u2217\u2208G1\n\n\u00afRn(\u03c6;G0,G1) = sup\n\u03b8\u2217\u2208G0\n\n\u00afP\u03b8\u2217 (\u03c6 = 0).\n\n(2.14)\n\n5\n\n\fCorrespondingly, we de\ufb01ne the computational minimax risk as\n\n\u00afRn(\u03c6;G0,G1)\n\nn(G0,G1; A , r) = inf\n\u00afR\u2217\n\nsup\n\nf1,f2,\u03c8\n\n\u03c6\u2208H(A ,r)\n\n(2.15)\nThe probability \u00afP\u03b8\u2217 in the above formulation is taken over the distribution of responses from the\nstatistical oracle r under the model in (2.6) with the parameter of interest \u03b8\u2217 and nuisance parameter\nf1, f2, and \u03c8. We introduce the following de\ufb01nition of computational minimax separation rate\n[18, 19, 54].\nDe\ufb01nition 2.4 (Computational minimax separation rate). A sequence {\u00af\u03b3\u2217\ntational minimax separation rate if\n(i) given any sequence {\u03b3n}\u221e\n\nn), for any \u03b7 and any A \u2208 A(d\u03b7), there exists\n\nn=1 with \u03b3n = o(\u00af\u03b3\u2217\n\nn=1 is called the compu-\n\nn}\u221e\n\na statistical oracle r \u2208 R[\u03be, n, d\u00b5, \u03b7(QA )] such that\n\nn(G0,G1(s, \u03b3n); A , r) = 1,\n\u00afR\u2217\n\nlim inf\nn\u2192\u221e\nn=1 with \u03b3n = \u2126(\u00af\u03b3\u2217\n\n(ii) given any sequence {\u03b3n}\u221e\n\nn), there exists an algorithm A \u2208 A(d\u03b7) with\nsome absolute constant \u03b7 such that it holds for any statistical oracle r \u2208 R[\u03be, n, d\u00b5, \u03b7(QA )]\nthat\n\nn(G0,G1(s, \u03b3n); A , r) = 0.\n\u00afR\u2217\n\nlim\nn\u2192\u221e\n\nIn the following section, we give the explicit forms of \u03b3\u2217\nf deviates from class C1(\u03c8), a gap between \u00af\u03b3\u2217\nn and \u03b3\u2217\ncost to pay for the lack of \ufb01rst-order Stein\u2019s association de\ufb01ned in De\ufb01nition 2.1.\n\nn and \u00af\u03b3\u2217\nn. In particular, when the link function\nn arises, which characterizes the computational\n\n3 Main Results\n\nIn this section, we lay out the theoretical results. For the hypothesis testing problem in (2.7), we\nestablish the information-theoretic and computational lower bounds by constructing a worst-case\nhypothesis testing problem. We further establish upper bounds that attain these lower bounds up to\nlogarithmic factors, which is deferred to \u00a7A. These lower and upper bounds together characterize the\nstatistical-computational tradeoff. Finally, we show that such a tradeoff in hypothesis testing implies\nsimilar computational barriers in parameter estimation.\n\n3.1 Lower Bounds\n\nIn what follows, we present lower bounds of the minimax and computational minimax separation\nrates de\ufb01ned in De\ufb01nitions 2.2 and 2.4, respectively. For the hypothesis testing problem in (2.7)\nwith parameter spaces de\ufb01ned in (2.8), we have the following proposition that characterizes its\ninformation-theoretic dif\ufb01culty.\nProposition 3.1. We assume that \u03b2\u2217 in (2.6) is sparse such that s = o(d1/2\u2212\u03b4) for some positive\nabsolute constant \u03b4. For\n\nit holds that lim inf n\u2192\u221e R\u2217\nproblem in (2.7) and (2.8) is asymptotically powerless.\n\nn\n\ns log d\n\n\u03b3n = o\n\n(cid:2)G0,G1(s, \u03b3n)] \u2265 1. In other words, any test for the hypothesis testing\n\n(3.1)\n\nn\n\n,\n\n(cid:19)\n\n(cid:94) 1\n\u03b12 \u00b7 s log d\n\nn\n\n(cid:18)(cid:114)\n\n(cid:18)(cid:114)\n\n6\n\nProof. See \u00a7B.1 for a detailed proof.\n\nIt follows from Proposition 3.1 that any sequence satisfying (ii) of De\ufb01nition 2.2 is asymptotically\nlower bounded by any sequence that satis\ufb01es (3.1). As a result, it holds that\n\n(cid:19)\n\n(cid:94) 1\n\u03b12 \u00b7 s log d\n\nn\n\n\u03b3\u2217\nn = \u2126\n\ns log d\n\nn\n\n,\n\n(3.2)\n\nwhere \u03b3\u2217\nn is the minimax separation rate de\ufb01ned in De\ufb01nition 2.2. Based on (3.2) and the upper\nbound in Theorem A.2, which is deferred to \u00a7A, up to logarithmic factors, the minimax separation\n\n\frate de\ufb01ned in De\ufb01nition 2.2 takes the form\n\n(cid:114)\n\n\u03b3\u2217\nn =\n\ns log d\n\nn\n\n(cid:94) 1\n\u03b12 \u00b7 s log d\n\nn\n\n.\n\n(3.3)\n\nThe following theorem establishes a lower bound of the computational minimax separation rate\nde\ufb01ned in De\ufb01nition 2.4.\nTheorem 3.2. We assume that \u03b2\u2217 in (2.6) is sparse such that s = o(d1/2\u2212\u03b4) for some positive\nabsolute constant \u03b4. For any positive absolute constant \u00b5 and A \u2208 A(d\u00b5) with\n\n(3.4)\nn(G0,G1; A , r) \u2265 1.\nthere exists a statistical oracle r \u2208 R[\u03be, n, d\u00b5, \u03b7(Q)] such that lim inf n\u2192\u221e \u00afR\u2217\nIn other words, any computational tractable test for the hypothesis testing problem in (2.7) and (2.8)\nis asymptotically powerless.\n\n\u03b3n = o\n\nn\n\nn\n\n,\n\n(cid:18)(cid:26)(cid:114)\n\n(cid:94) 1\n\u03b12 \u00b7 s\n\ns2\nn\n\n(cid:27)(cid:95)\n\n(cid:19)\n\n\u03b3\u2217\n\nProof. See \u00a7B.2 for a detailed proof.\n\nIt follows from Theorem 3.2 that any sequence satisfying (ii) of De\ufb01nition 2.4 is asymptotically lower\nbounded by any sequence that satis\ufb01es (3.4). As a result, it holds that\n\n(cid:18)(cid:26)(cid:114)\n\n(cid:94) 1\n\u03b12 \u00b7 s\n\nn\n\ns2\nn\n\n(cid:27)(cid:95)\n\n(cid:19)\n\n\u03b3\u2217\n\nn\n\n,\n\n\u00af\u03b3\u2217\nn = \u2126\n\nn and \u00af\u03b3\u2217\n\nwhere \u03b3\u2217\nn are the minimax and computational minimax separation rates de\ufb01ned in De\ufb01nitions\n2.2 and 2.4, respectively. Based on (3.5) and the upper bound in Theorem A.3, which is deferred to\n\u00a7A, up to logarithmic factors, the computational minimax separation rate de\ufb01ned in De\ufb01nition 2.4\ntakes the form\n\n(3.5)\n\n(cid:114)\n\n(cid:94) 1\n\u03b12 \u00b7 s log d\n\nn\n\ns2\nn\n\n\u00af\u03b3\u2217\nn =\n\n3.2 Phase Transition\n\n.\n\n(3.6)\n\nIn what follows, we characterize the phase transition in the minimax and computational minimax\nseparation rates when the mixture probability \u03b1 transits from zero to one. We categorize the phase\ntransition into the following regimes in terms of \u03b1.\n\n1. For 0 < \u03b1 \u2264 ((log d)2/n)1/4, our results show that \u03b3\u2217\n\nn =(cid:112)s log d/n and \u00af\u03b3\u2217\n\npowerful test for (2.7) is computationally intractable with superpolynomial oracle complex-\n\nn =(cid:112)s2/n.\nFor \u03b3n = o((cid:112)s log d/n) , any test for the hypothesis testing problem in (2.7) is asymp-\ntotically powerless. For \u03b3n = \u2126((cid:112)s log d/n) and \u03b3n = o((cid:112)s2/n), any asymptotically\nity de\ufb01ned in De\ufb01nition 2.3. For \u03b3n = \u2126((cid:112)s2/n), there exists an asymptotically powerful\nn = (cid:112)s log d/n and\nn = 1/\u03b12 \u00b7 s log d/n. For \u03b3n = o((cid:112)s log d/n), any test is asymptotically powerless.\nFor \u03b3n = \u2126((cid:112)s log d/n) and \u03b3n = o(1/\u03b12 \u00b7 s log d/n), any asymptotically powerful test\n\ntest that is computationally tractable with polynomial oracle complexity. In this regime, the\ngap between the computational minimax separation rate \u00af\u03b3\u2217\nn and the minimax separation rate\n\u03b3\u2217\nn is invariant to \u03b1.\n\n\u00af\u03b3\u2217\nfor (2.7) is computationally intractable. For \u03b3n = \u2126(1/\u03b12 \u00b7 s log d/n), there exists an\nasymptotically powerful test that is computationally tractable. In this regime, a larger \u03b1\nimplies a smaller gap between \u00af\u03b3\u2217\n\n2. For (log2 d/n)1/4 \u2264 \u03b1 \u2264 (s log d/n)1/4, our results show that \u03b3\u2217\n\nn and \u03b3\u2217\nn.\n\n3. For (s log d/n)1/4 < \u03b1 \u2264 1, our results show that \u03b3\u2217\n\nn = 1/\u03b12 \u00b7 s log d/n. For \u03b3n =\no(1/\u03b12 \u00b7 s log d/n), any test for the hypothesis testing problem in (2.7) is asymptotically\npowerless, whereas for \u03b3n = \u2126(1/\u03b12 \u00b7 s log d/n), there exists an asymptotically powerful\ntest that is computationally tractable. In this regime, the gap between \u03b3\u2217\nn vanishes.\n\nn and \u00af\u03b3\u2217\n\nn = \u00af\u03b3\u2217\n\nBy the normalization speci\ufb01ed following (2.7), the mixture probability \u03b1 characterizes the \ufb01rst-order\nStein\u2019s association of the model under the alternative hypothesis. Therefore, the phase transition\n\n7\n\n\fimplies that when the \ufb01rst-order Stein\u2019s association attains its maximum, which corresponds to \u03b1 = 1,\nn and the minimax separation rate \u03b3\u2217\nthe gap between the computational minimax separation rate \u00af\u03b3\u2217\nn\nvanishes, whereas when the \ufb01rst-order Stein\u2019s association vanishes, which corresponds to \u03b1 = 0,\nthe gap between the computational minimax separation rate \u00af\u03b3\u2217\nn and the minimax separation rate \u03b3\u2217\nn\nattains its maximum. In other words, the lack of the \ufb01rst-order Stein\u2019s association leads to an extra\nprice of computational cost.\n\n3.3\n\nImplication for Parameter Estimation\n\nFor the model in (2.6), our result on the computational minimax separation rate in \u00a7A implies\ncomputational barriers in the estimation of \u03b2\u2217, which is established in the following theorem.\nTheorem 3.3. For the estimation of \u03b2\u2217 in (2.6) with\n\n(cid:18) s2\n\n(cid:94) s log d\n\n(cid:19)\n\n,\n\n(3.7)\nwhere \u03b3n = (cid:107)\u03b2\u2217(cid:107)2/\u03c32, it holds that, for any positive absolute constant \u00b5 and algorithm A \u2208 A(T )\n\nthat gives(cid:98)\u03b2 within oracle complexity T = O(d\u00b5), there exists a statistical oracle r \u2208 R[\u03be, n, T, \u03b7(Q)]\n\n\u03b3n \u00b7 \u03b12\n\nn = o\n\n\u03b32\nn\n\n\u00afP(cid:0)(cid:107)(cid:98)\u03b2 \u2212 \u03b2\u2217(cid:107)2 \u2265 \u03c3(cid:107)\u03b2\u2217(cid:107)\u22121\n\n2\n\n\u00b7 \u03b3n/4(cid:1) \u2265 C,\n\nsuch that\n\n(3.8)\n\nwhere C is a positive absolute constant.\n\nProof. See \u00a7B.5 for a detailed proof.\n\nFor \u03b1 = 0, the estimation of \u03b2\u2217 in (2.6) reduces to\nthe sparse phase retrieval problem. For simplicity\nof discussion, let \u03b3n = (cid:107)\u03b2\u2217(cid:107)2\n2/\u03c32 be a constant in\nthe following discussions. Theorem 3.3 implies that\nfor n = o(s2), any computationally tractable esti-\nmator is statistically inconsistent in the sense that\n\n(cid:107)(cid:98)\u03b2 \u2212 \u03b2\u2217(cid:107)2 \u2265 C holds with at least constant prob-\n\nFigure 1: Phase transition in the gap between\nminimax separation rate and computational\nminimax seperation rate: (i) for 0 < \u03b1 \u2264\n((log d)2/n)1/4, the gap is invariant to \u03b1. (ii)\nfor (log2 d/n)1/4 \u2264 \u03b1 \u2264 (s log d/n)1/4,\na larger \u03b1 implies a smaller gap.\n(iii) for\n(s log d/n)1/4 < \u03b1 \u2264 1, the gap vanishes.\n\nability. [8] construct a computational tractable esti-\nmator for sparse phase retrieval with the quadratic\nlink function Y = |X(cid:62)\u03b2\u2217|2 + \u0001. The estimator by\n[8] is statistically consistent under the assumption\nthat n \u2265 C(1 + \u03c32/(cid:107)\u03b2\u2217(cid:107)4\n2) \u00b7 s2 log d. Similar phe-\nnomenon arises in misspeci\ufb01ed sparse phase retrieval\nstudied by [42], although their work is slightly more\ngeneral, in the sense that they consider f (X(cid:62)\u03b2\u2217, \u0001)\nas the link function. The estimator by [42] requires\nn \u2265 Cs2 log d to be statistically consistent. Both\n[8] and [42] conjecture that their requirements on the\nsample size cannot be relaxed for computationally\ntractable estimators. Theorem 3.3 con\ufb01rms this con-\njecture for the sparse phase retrieval problem under\nthe statistical query model de\ufb01ned in De\ufb01nition 2.3.\nFor \u03b1 = 1, the requirement for a computationally\ntractable estimator to be statistically consistent be-\ncomes n \u2265 Cs log d. Such a sample size requirement\nagrees with the information-theoretic lower bound. [43] construct a computationally tractable estima-\ntor of \u03b2\u2217, which requires the sample size n \u2265 Cs log(d/s) to be statistically consistent. It follows\nfrom Theorem 3.3 that such a requirement is necessary.\nFor 0 < \u03b1 < 1, we observe a phase transition in the required sample size in terms of \u03b1, which\nis similar to the phase transition of the computational minimax separation rates. For 0 < \u03b1 \u2264\nbecomes n \u2265 Cs log d/\u03b12. In this regime, a larger \u03b1 implies a smaller sample size required for a\ncomputationally tractable estimator to be statistically consistent.\n\n(cid:112)\u03b3n log d/s, the requirement becomes n \u2265 Cs2. For(cid:112)\u03b3n log d/s \u2264 \u03b1 \u2264 1, the requirement\n\n8\n\n\f", "award": [], "sourceid": 5502, "authors": [{"given_name": "Lingxiao", "family_name": "Wang", "institution": "Northwestern University"}, {"given_name": "Zhuoran", "family_name": "Yang", "institution": "Princeton University"}, {"given_name": "Zhaoran", "family_name": "Wang", "institution": "Northwestern University"}]}