{"title": "Closed-form Estimators for High-dimensional Generalized Linear Models", "book": "Advances in Neural Information Processing Systems", "page_first": 586, "page_last": 594, "abstract": "We propose a class of closed-form estimators for GLMs under high-dimensional sampling regimes. Our class of estimators is based on deriving closed-form variants of the vanilla unregularized MLE but which are (a) well-defined even under high-dimensional settings, and (b) available in closed-form. We then perform thresholding operations on this MLE variant to obtain our class of estimators. We derive a unified statistical analysis of our class of estimators, and show that it enjoys strong statistical guarantees in both parameter error as well as variable selection, that surprisingly match those of the more complex regularized GLM MLEs, even while our closed-form estimators are computationally much simpler. We derive instantiations of our class of closed-form estimators, as well as corollaries of our general theorem, for the special cases of logistic, exponential and Poisson regression models. We corroborate the surprising statistical and computational performance of our class of estimators via extensive simulations.", "full_text": "Closed-form Estimators for High-dimensional\n\nGeneralized Linear Models\n\nEunho Yang\n\nIBM T.J. Watson Research Center\n\neunhyang@us.ibm.com\n\nAur\u00b4elie C. Lozano\n\nIBM T.J. Watson Research Center\n\naclozano@us.ibm.com\n\nPradeep Ravikumar\n\nUniversity of Texas at Austin\n\npradeepr@cs.utexas.edu\n\nAbstract\n\nWe propose a class of closed-form estimators for GLMs under high-dimensional\nsampling regimes. Our class of estimators is based on deriving closed-form vari-\nants of the vanilla unregularized MLE but which are (a) well-de\ufb01ned even under\nhigh-dimensional settings, and (b) available in closed-form. We then perform\nthresholding operations on this MLE variant to obtain our class of estimators. We\nderive a uni\ufb01ed statistical analysis of our class of estimators, and show that it en-\njoys strong statistical guarantees in both parameter error as well as variable selec-\ntion, that surprisingly match those of the more complex regularized GLM MLEs,\neven while our closed-form estimators are computationally much simpler. We de-\nrive instantiations of our class of closed-form estimators, as well as corollaries\nof our general theorem, for the special cases of logistic, exponential and Poisson\nregression models. We corroborate the surprising statistical and computational\nperformance of our class of estimators via extensive simulations.\n\n1\n\nIntroduction\n\nWe consider the estimation of generalized linear models (GLMs) [1], under high-dimensional set-\ntings where the number of variables p may greatly exceed the number of observations n. GLMs are\na very general class of statistical models for the conditional distribution of a response variable given\na covariate vector, where the form of the conditional distribution is speci\ufb01ed by any exponential\nfamily distribution. Popular instances of GLMs include logistic regression, which is widely used\nfor binary classi\ufb01cation, as well as Poisson regression, which together with logistic regression, is\nwidely used in key tasks in genomics, such as classifying the status of patients based on genotype\ndata [2] and identifying genes that are predictive of survival [3], among others. Recently, GLMs\nhave also been used as a key tool in the construction of graphical models [4]. Overall, GLMs have\nproven very useful in many modern applications involving prediction with high-dimensional data.\nAccordingly, an important problem is the estimation of such GLMs under high-dimensional sam-\npling regimes. Under such sampling regimes, it is now well-known that consistent estimators can-\nnot be obtained unless low-dimensional structural constraints are imposed upon the underlying re-\ngression model parameter vector. Popular structural constraints include that of sparsity, which en-\ncourages parameter vectors supported with very few non-zero entries, group-sparse constraints, and\nlow-rank structure with matrix-structured parameters, among others. Several lines of work have\nfocused on consistent estimators for such structurally constrained high-dimensional GLMs. A pop-\nular instance, for the case of sparsity-structured GLMs, is the `1 regularized maximum likelihood\nestimator (MLE), which has been shown to have strong theoretical guarantees, ranging from risk\n\n1\n\n\fconsistency [5], consistency in the `1 and `2-norm [6, 7, 8], and model selection consistency [9].\nAnother popular instance is the `1/`q (for q 2) regularized MLE for group-sparse-structured lo-\ngistic regression, for which prediction consistency has been established [10]. All of these estimators\nsolve general non-linear convex programs involving non-smooth components due to regularization.\nWhile a strong line of research has developed computationally ef\ufb01cient optimization methods for\nsolving these programs, these methods are iterative and their computational complexity scales poly-\nnomially with the number of variables and samples [10, 11, 12, 13], making them expensive for very\nlarge-scale problems.\nA key reason for the popularity of these iterative methods is that while the number of iterations\nare some function of the required accuracy, each iteration itself consists of a small \ufb01nite number\nof steps, and can thus scale to very large problems. But what if we could construct estimators\nthat overall require only a very small \ufb01nite number of steps, akin to a single iteration of popular\niterative optimization methods? The computational gains of such an approach would require that\nthe steps themselves be suitably constrained, and moreover that the steps could be suitably pro\ufb01led\nand optimized (e.g. ef\ufb01cient linear algebra routines implemented in BLAS libraries), a systematic\nstudy of which we defer to future work. We are motivated on the other hand by the simplicity of\nsuch a potential class of \u201cclosed-form\u201d estimators.\nIn this paper, we thus address the following question: \u201cIs it possible to obtain closed-form estimators\nfor GLMs under high-dimensional settings, that nonetheless have the sharp convergence rates of the\nregularized convex programs and other estimators noted above?\u201d This question was \ufb01rst considered\nfor linear regression models [14], and was answered in the af\ufb01rmative. Our goal is to see whether\na positive response can be provided for the more complex statistical model class of GLMs as well.\nIn this paper we focus speci\ufb01cally on the class of sparse-structured GLMs, though our framework\nshould extend to more general structures as well.\nAs an inkling of why closed-form estimators for high-dimensional GLMs is much trickier than that\nfor high-dimensional linear models is that under small-sample settings, linear regression models do\nhave a statistically ef\ufb01cient closed-form estimator \u2014 the ordinary least-squares (OLS) estimator,\nwhich also serves as the MLE under Gaussian noise. For GLMs on the other hand, even under\nsmall-sample settings, we do not yet have statistically ef\ufb01cient closed-form estimators. A classical\nalgorithm to solve for the MLE of logistic regression models for instance is the iteratively reweighted\nleast squares (IRLS) algorithm, which as its name suggests, is iterative and not available in closed-\nform. Indeed, as we show in the sequel, developing our class of estimators for GLMs requires far\nmore advanced mathematical machinery (moment polytopes, and projections onto an interior subset\nof these polytopes for instance) than the linear regression case.\nOur starting point to devise a closed-form estimator for GLMs is to nonetheless revisit this classical\nunregularized MLE estimator for GLMs from a statistical viewpoint, and investigate the reasons\nwhy the estimator fails or is even ill-de\ufb01ned in the high-dimensional setting. These insights enable\nus to propose variants of the MLE that are not only well-de\ufb01ned but can also be easily computed\nin analytic-form. We provide a uni\ufb01ed statistical analysis for our class of closed-form GLM es-\ntimators, and instantiate our theoretical results for the speci\ufb01c cases of logistic, exponential, and\nPoisson regressions. Surprisingly, our results indicate that our estimators have comparable statisti-\ncal guarantees to the regularized MLEs, in terms of both variable selection and parameter estimation\nerror, which we also corroborate via extensive simulations (which surprisingly even show a slight\nstatistical performance edge for our closed-form estimators). Moreover, our closed-form estimators\nare much simpler and competitive computationally, as is corroborated by our extensive simulations.\nWith respect to the conditions we impose on the GLM models, we require that the population covari-\nance matrix of our covariates be weakly sparse, which is a different condition than those typically\nimposed for regularized MLE estimators; we discuss this further in Section 3.2. Overall, we hope\nour simple class of statistically as well as computationally ef\ufb01cient closed-form estimators for GLMs\nwould open up the use of GLMs in large-scale machine learning applications even to lay users on the\none hand, and on the other hand, encourage the development of new classes of \u201csimple\u201d estimators\nwith strong statistical guarantees extending the initial proposals in this paper.\n\n2\n\n\f2 Setup\n\nWe consider the class of generalized linear models (GLMs), where a response variable y 2Y ,\nconditioned on a covariate vector x 2 Rp, follows an exponential family distribution:\n\n(1)\n\nP(y|x; \u2713\u21e4) = exp\u21e2 h(y) + yh\u2713\u21e4, xi Ah\u2713\u21e4, xi\n\nc()\n\n\n\n2\n\nwhere 2 R > 0 is \ufb01xed and known scale parameter, \u2713\u21e4 2 Rp is the GLM parameter of interest,\nand A(h\u2713\u21e4, xi) is the log-partition function or the log-normalization constant of the distribution. Our\ngoal is to estimate the GLM parameter \u2713\u21e4 given n i.i.d. samples(x(i), y(i)) n\ni=1. By properties of\nexponential families, the conditional moment of the response given the covariates can be written as\n\u00b5(h\u2713\u21e4, xi) \u2318 E(y|x; \u2713\u21e4) = A0(h\u2713\u21e4, xi).\nExamples. Popular instances of (1) include the standard linear regression model, the logistic re-\ngression model, and the Poisson regression model, among others.\nIn the case of the linear re-\ngression model, we have a response variable y 2 R, with the conditional distribution P(y|x, \u2713\u21e4):\nexpny2/2+yh\u2713\u21e4,xih\u2713\u21e4,xi2/2\no, where the log-partition function (or log-normalization constant)\nA(a) of (1) in this speci\ufb01c case is given by A(a) = a2/2. Another popular GLM instance is\nthe logistic regression model P(y|x, \u2713\u21e4), for a categorical output variable y 2 Y \u2318 {1, 1},\nexpyh\u2713\u21e4, xi log\u21e5 exp(h\u2713\u21e4, xi) + exp(h\u2713\u21e4, xi)\u21e4 where the log-partition function A(a) =\nlog exp(a) + exp(a). The exponential regression model P(y|x, \u2713\u21e4) in turn is given by:\nexpyh\u2713\u21e4, xi + log h\u2713\u21e4, xi . Here, the domain of response variable Y = R+ is the set\nof non-negative real numbers (it is typically used to model time intervals between events for in-\nstance), and the log-partition function A(a) = log(a). Our \ufb01nal example is the Poisson regres-\nsion model, P(y|x, \u2713\u21e4): exp log(y!) + yh\u2713\u21e4, xi exph\u2713\u21e4, xi where the response variable is\ncount-valued with domain Y\u2318{ 0, 1, 2, ...}, and with log-partition function A(a) = exp(a).\nAny exponential family distribution can be used to derive a canonical GLM regression model (1)\nof a response y conditioned on covariates x, by setting the canonical parameter of the exponential\nfamily distribution to h\u2713\u21e4, xi. For the parameterization to be valid, the conditional density should be\nnormalizable, so that Ah\u2713\u21e4, xi < +1.\nHigh-dimensional Estimation Suppose that we are given n covariate vectors, x(i) 2 Rp, drawn\nfrom some distribution, and corresponding response variables, y(i) 2Y , drawn from the\ni.i.d.\ndistribution P(y|x(i),\u2713 \u21e4) in (1). A key goal in statistical estimation is to estimate the parameters\n\u2713\u21e4 2 Rp, given just the samples(x(i), y(i)) n\ni=1. Such estimation becomes particularly challenging\nin a high-dimensional regime, where the dimension of covariate vector p is potentially even larger\nthan the number of samples n. In such high-dimensional regimes, it is well understood that structural\nconstraints on \u2713\u21e4 are necessary in order to \ufb01nd consistent estimators. In this paper, we focus on the\nstructural constraint of element-wise sparsity, so that the number of non-zero elements in \u2713\u21e4 is less\nthan or equal to some value k much smaller than p: k\u2713\u21e4k0 \uf8ff k.\nEstimators: Regularized Convex Programs The `1 norm is known to encourage the esti-\nmation of such sparse-structured parameters \u2713\u21e4. Accordingly, a popular class of M-estimators\nfor sparse-structured GLM parameters is the `1 regularized maximum log-likelihood estimator\nfor (1). Given n samples (x(i), y(i)) n\ni=1 from P(y|x, \u2713\u21e4), the `1 regularized MLEs can be\ni=1 y(i)x(i)\u21b5 + 1\nwritten as: minimize \u2713 \u2326\u2713, 1\ni=1 Ah\u2713, x(i)i + nk\u2713k1 . For nota-\nnPn\nnPn\ntional simplicity, we collate the n observations in vector and matrix forms where we overload\nthe notation y 2 Rn to denote the vector of n responses so that i-th element of y, yi, is\ny(i), and X 2 Rn\u21e5p to denote the design matrix whose i-th row is [x(i)]>. With this no-\ntation we can rewrite optimization problem characterizing the `1-regularized MLE simply as\nn 1>A(X\u2713 ) + nk\u2713k1 where we overload the notation A(\u00b7) for an\nminimize \u2713 1\ninput vector \u2318 2 Rn to denote A(\u2318) \u2318A(\u23181), A(\u23182), . . . , A(\u2318n)>, and 1 \u2318 (1, . . . , 1)> 2 Rn.\n\nn \u2713>X>y + 1\n\n3\n\n\f3 Closed-form Estimators for High-dimensional GLMs\n\nThe goal of this paper is to derive a general class of closed-form estimators for high-dimensional\nGLMs, in contrast to solving huge, non-differentiable `1 regularized optimization problems. Before\nintroducing our class of such closed-form estimators, we \ufb01rst introduce some notation.\nFor any u 2 Rp, we use [S(u)]i = sign(ui) max(|ui| , 0) to denote the element-wise soft-\nthresholding operator, with thresholding parameter . For any given matrix M 2 Rp\u21e5p, we denote\nby T\u232b(M ) : Rp\u21e5p 7! Rp\u21e5p a family of matrix thresholding operators that are de\ufb01ned point-wise,\nso that they can be written as [T\u232b(M )]ij := \u21e2\u232b(Mij), for any scalar thresholding operator \u21e2\u232b(\u00b7)\nthat satis\ufb01es the following conditions: for any input a 2 R, (a) |\u21e2\u232b(a)|\uf8ff| a|, (b) |\u21e2\u232b(a)| = 0 for\n|a|\uf8ff \u232b, and (c) |\u21e2\u232b(a) a|\uf8ff \u232b. The standard soft-thresholding and hard-thresholding operators\nare both pointwise operators that satisfy these properties. See [15] for further discussion of such\npointwise matrix thresholding operators.\nFor any \u2318 2 Rn, we let rA(\u2318) denote the element-wise gradients: rA(\u2318) \u2318A0(\u23181), A0(\u23182), . . . ,\nA0(\u2318n)>. We assume that the exponential family underlying the GLM is minimal, so that this map\nis invertible, and so that for any \u00b5 2 Rn in the range of rA(\u00b7), we can denote [rA]1(\u00b5) as an\nelement-wise inverse map of rA(\u00b7):(A0)1(\u00b51), (A0)1(\u00b52), . . . , (A0)1(\u00b5n)>.\nConsider the response moment polytope M := {\u00b5 : \u00b5 = Ep[y], for some distribution p over\ny 2Y} , and let Mo denote the interior of M. Our closed-form estimator will use a carefully\nselected subset\n(2)\nDenote the projection of a response variable y 2Y onto this subset as \u21e7 \u00afM(y) = arg min\u00b52 \u00afM |y \n\u00b5|, where the subset M is selected so that the projection step is always well-de\ufb01ned, and the mini-\nmum exists. Given a vector y 2Y n, we denote the vector of element-wise projections of entries in\ny as \u21e7 \u00afM(y) so that:\n(3)\nAs the conditions underlying our theorem will make clear, we will need the operator [rA]1(\u00b7)\nde\ufb01ned above to be both well-de\ufb01ned and Lipschitz in the subset M of the interior of the response\nmoment polytope. In later sections, we will show how to carefully construct such a subset M for\ndifferent GLM models.\nWe now have the machinery to describe our class of closed-form estimators:\n\n[\u21e7 \u00afM(y)]i := \u21e7 \u00afM(yi).\n\nM\u2713M o.\n\nb\u2713Elem = Sn hT\u232b\u21e3 X>X\n\nn \u2318i1 X>[rA]1\u21e7 \u00afM(y)\n\nn\n\n! ,\n\nwhere the various mathematical terms were de\ufb01ned above.\nIt can be immediately seen that the\nestimator is available in closed-form. In a later section, we will see instantiations of this class of\nestimators for various speci\ufb01c GLM models, and where we will see that these estimators take very\nsimple forms. Before doing so, we \ufb01rst describe some insights that led to our particular construction\nof the high-dimensional GLM estimator above.\n\n3.1\n\nInsights Behind Construction of Our Closed-Form Estimator\n\nWe\n\n\ufb01rst\n\nclassical\n\nthe\nrevisit\nn \u2713>X>y + 1\n\narg min\u2713 1\n\n2\nunique minimum in general, especially under high-dimensional sample settings where p > n.\nNonetheless, it is instructive to study why this unregularized MLE is either ill-suited or even\nill-de\ufb01ned under high-dimensional settings. The stationary condition of unregularized MLE\noptimization problem can be written as:\n\nn 1>A(X\u2713 ) . Note that this optimization problem does not have a\n\nunregularized MLE\n\nGLMs:\n\nfor\n\nb\u2713\n\n(4)\n\n(5)\n\nclarify below.\n\nThere are two main caveats to solving for a uniqueb\u2713 satisfying this stationary condition, which we\n\nX>y = X>rA(Xb\u2713).\n\n4\n\n\f(Mapping to mean parameters)\n\nhere, since the sample covariance matrix (X>X)/n would then be rank-de\ufb01cient, and hence not\n\nIn a high dimensional sampling regime where p n, (5) can\nbe seen to reduce to y = rA(Xb\u2713) (so long as X T has rank n). This then suggests solving for\nXb\u2713 = [rA]1(y), where we recall the de\ufb01nition of the operator rA(\u00b7) in terms of element-wise\noperations involving A0(\u00b7). The caveat however is that A0(\u00b7) is only onto the interior Mo of the\nresponse moment polytope [16], so that [A0(\u00b7)]1 is well-de\ufb01ned only when given \u00b5 2M o. When\nentries of the sample response vector y however lie outside of Mo, as will typically be the case and\nwhich we will illustrate for multiple instances of GLM models in later sections, the inverse mapping\nwould not be well-de\ufb01ned. We thus \ufb01rst project the sample response vector y onto M\u2713M o\nto obtain \u21e7 \u00afM(y) as de\ufb01ned in (3). Armed with this approximation, we then consider the more\namenable \u21e7 \u00afM(y) \u21e1 rA(Xb\u2713), instead of the original stationary condition in (5).\n(Sample covariance) We thus now have the approximate characterization of the MLE as Xb\u2713 \u21e1\n[rA]1\u21e7 \u00afM(y). This then suggests solving for an approximate MLE b\u2713 via least squares as\nb\u2713 = [X>X]1X>[rA]1\u21e7 \u00afM(y). The high-dimensional regime with p > n poses a caveat\ninvertible. Our approach is to then use a thresholded sample covariance matrix T\u232b X>X\nn de\ufb01ned\nn is consistent with respect to the spectral norm with\nthat thresholded sample covariance T\u232b X>X\nn \u2318, under some mild conditions detailed in our\nconvergence rateT\u232b X>X\n\nin the previous subsection instead, which can be shown to be invertible and consistent to the popu-\nlation covariance matrix \u2303 with high probability [15, 17]. In particular, recent work [15] has shown\n\nmain theorem. Plugging in this thresholded sample covariance matrix, to get an approximate least\nsquares solution for the GLM parameters \u2713, and then performing soft-thresholding precisely yields\nour closed-form estimator in (4).\nOur class of closed-form estimators in (4) can thus be viewed as surgical approximations to the MLE\nso that it is well-de\ufb01ned in high-dimensional settings, as well as being available in closed-form. But\nwould such an approximation actually yield rigorous consistency guarantees? Surprisingly, as we\nshow in the next section, not only is our class of estimators consistent, but in our corollaries we\nshow that the statistical guarantees are comparable to those of the state of the art iterative ways like\nregularized MLEs.\nWe note that our class of closed-form estimators in (4) can also be written in an equivalent form that\nis more amenable to analysis:\nminimize\n\nn \u2303op \uf8ff O\u21e3c0q log p\n\n(6)\n\nk\u2713k1\n\n\u2713\n\ns. t \u2713 hT\u232b\u21e3 X>X\n\nn \u2318i1 X>[rA]1\u21e7 \u00afM(y)\n\nn\n\n\uf8ff n.\n\n1\n\nThe equivalence between (4) and (6) easily follows from the fact that the optimization problem (6)\nis decomposable into independent element-wise sub-problems, and each sub-problem corresponds\nto soft-thresholding. It can be seen that this form is also amenable to extending the framework in\nthis paper to structures beyond sparsity, by substituting in alternative regularizers. Due to space\nconstraints, the computational complexity is discussed in detail in the Appendix.\n\n3.2 Statistical Guarantees\n\nIn this subsection, we provide an uni\ufb01ed statistical analysis for the class of estimators (4) under the\nfollowing standard conditions, namely sparse \u2713\u21e4 and sub-Gaussian design X:\n(C1) The parameter \u2713\u21e4 in (1) is exactly sparse with k non-zero elements indexed by the support\nset S, so that \u2713\u21e4Sc = 0.\n(C2) Each row of the design matrix X 2 Rn\u21e5p is i.i.d. sampled from a zero-mean distribution\nwith covariance matrix \u2303 such that for any v 2 Rp, the variable hv, Xii is sub-Gaussian with\nparameter at most \uf8ffukvk2 for every row of X, Xi.\nOur next assumption is on the covariance matrix of the covariate random vector:\n(C3) The covariance matrix \u2303 of X satis\ufb01es that for all w 2 Rp, k\u2303wk1 \uf8ff` kwk1 with\n\ufb01xed constant \uf8ff` > 0. Moreover, \u2303 is approximately sparse, along the lines of [17]: for some\n\n5\n\n\fWe also introduce some notations used in the following theorem. Under the condition (C2), we\n\nj=1 |\u2303ij|q \uf8ff c0. If q = 0, then this condition will be equivalent with \u2303 being sparse.\n\npositive constant D, \u2303ii \uf8ff D for all diagonal entries, and moreover, for some 0 \uf8ff q < 1 and c0,\nmaxiPp\nhave that with high probability, |h\u2713\u21e4, x(i)i| \uf8ff 2\uf8ffuk\u2713\u21e4k2plog n for all samples, i = 1, . . . , n. Let\n\u2327\u21e4 := 2\uf8ffuk\u2713\u21e4k2plog n. We then let M0 be the subset of M such that\nM0 :=n\u00b5 : \u00b5 = A0(\u21b5) , where \u21b5 2 [\u2327\u21e4,\u2327 \u21e4]o .\na2M0[ \u00afM|(A1)0(a)|\uf8ff \uf8ff`,A.\n\u21b52[\u2327\u21e4,\u2327\u21e4]|A00(\u21b5)|\uf8ff \uf8ffu,A , max\n\nWe also de\ufb01ne \uf8ffu,A and \uf8ff`,A on the upper bounds of A00(\u00b7) and (A1)0(\u00b7), respectively:\n\nmax\n\n(7)\n\n(8)\n\n2\n\nn .\n\n1q log p0 where c1 is a constant related only on \u2327 and maxi \u2303ii,\n\nArmed with these conditions and notations, we derive our main theorem:\nTheorem 1. Consider any generalized linear model in (1) where all the conditions (C1), (C2) and\n(C3) hold. Now, suppose that we solve the estimation problem (4) setting the thresholding parameter\n\u232b = C1q log p0\nn where C1 := 16(maxj \u2303jj)p10\u2327 for any constant \u2327> 2, and p0 := max{n, p}.\nFurthermore, suppose also that we set the constraint bound n as C2q log p0\nn + E where C2 :=\n\uf8ff`\u21e3\uf8ffu\uf8ff`,Ap2\uf8ffu,A + C1k\u2713\u21e4k1\u2318 and where E depends on the approximation error induced by the\n\uf8ff` q log p0\nprojection (3), and is de\ufb01ned as: E := maxi=1,...,n\u21e3y(i) \u21e5\u21e7 \u00afM(y)\u21e4i\u2318 4\uf8ffu\uf8ff`,A\n(A) Then, as long as n > 2c1c0\n\uf8ff` 2\nany optimal solutionb\u2713 of (4) is guaranteed to be consistent:\nn + E\u25c6.\nb\u2713 \u2713\u21e41 \uf8ff 2\u2713C2q log p0\nn + E\u25c6,b\u2713 \u2713\u21e42 \uf8ff 4pk\u2713C2q log p0\nn + E\u25c6,b\u2713 \u2713\u21e41 \uf8ff 8k\u2713C2q log p0\n(B) Moreover, the support set of the estimateb\u2713 correctly excludes all true zero values of \u2713\u21e4. More-\nover, when mins2S |\u2713\u21e4s| 3n, it correctly includes all non-zero true supports of \u2713\u21e4,\nwith probability at least 1 cp0c0 for some universal constants c, c0 > 0 depending on \u2327 and \uf8ffu.\nRemark 1. While our class of closed-form estimators and analyses consider sparse-structured pa-\nrameters, these can be seamlessly extended to more general structures (such as group sparsity and\nlow rank), using appropriate thresholding functions.\nRemark 2. The condition (C3) required in Theorem 1 is different from (and possibly stronger)\nthan the restricted strong convexity [8] required for `2 error bound of `1 regularized MLE. A key\nfacet of our analysis with our Condition (C3) however is that it provides much simpler and clearer\nidentifying constants in our non-asymptotic error bounds. Deriving constant factors in the analysis\nof the `1-regularized MLE on the other hand, with its restricted strong convexity condition, involves\nmany probabilistic statements, and is non-trivial, as shown in [8].\nAnother key facet of our analysis in Theorem 1 is that it also provides an `1 error bound, and\nguarantees the sparsistency of our closed-form estimator. For `1 regularized MLEs, this requires a\nseparate sparsistency analysis. In the case of the simplest standard linear regression models, [18]\nshowed that the incoherence condition of |||\u2303ScS\u23031\nSS|||1 < 1 is required for sparsistency, where\n||| \u00b7 |||1 is the maximum of absolute row sum. As discussed in [18], instances of such incoherent\ncovariance matrices \u2303 include the identity, and Toeplitz matrices: these matrices can be seen to\nalso satisfy our condition (C3). On the other hand, not all matrices that satisfy our condition (C3)\nneed satisfy the stringent incoherence condition in turn. For example, consider \u2303 where \u2303SS =\n0.95I3 + 0.0513\u21e53 for a matrix 1 of ones, \u2303SSc is all zeros but the last column is 0.413\u21e51, and\n\u2303ScSc = I(p3)\u21e5(p3). Then, this positive de\ufb01nite \u2303 can be seen to satisfy our Condition (C3),\nsince each row has only 4 non-zeros. However, |||\u2303ScS\u23031\nSS|||1 is equal to 1.0909 and larger than 1,\nand consequently, the incoherence condition required for the Lasso will not be satis\ufb01ed. We defer\nrelaxing our condition (C3) further as well as a deeper investigation of all the above conditions to\nfuture work.\n\n6\n\n\fRemark 3. The constant C2 in the statement depends on k\u2713\u21e4k1, which in the worst case where\nonly k\u2713\u21e4k2 is bounded, may scale with pk. On the other hand, our theorem does not require an\nexplicit sample complexity condition that n be larger than some function on k, while the analysis\nof `1-regularized MLEs do additionally require that n c k log p for some constant c.\nIn our\nexperiments, we verify that our closed-form estimators outperform the `1-regularized MLEs even\nwhen k is fairly large (for instant, when (n, p, k) = (5000, 104, 1000)).\n\nIn order to apply Theorem 1 to a speci\ufb01c instance of GLMs, we need to specify the quantities in (8),\nas well as carefully construct a subset M of the interior of the response moment polytope. In case\nof the simplest linear models described in Section 2, we have the identity mapping \u00b5 = A0(\u2318) = \u2318.\nThe inequalities in (8) can thus be seen to be satis\ufb01ed with \uf8ff`,A = \uf8ffu,A = 1 . Moreover, we can set\nM := Mo = R so that \u21e7 \u00afM(y) = y, and trivially recover the previous results in [14] as a special\ncase. In the following sections, we will derive the consequences of our framework for the complex\ninstances of logistic and Poisson regression models, which are also important members in GLMs.\n\n4 Key Corollaries\n\nIn order to derive corollaries of our main Theorem 1, we need to specify the response polytope\nsubsets M,M0 in (2) and (7) respectively, as well as bound the two quantities \uf8ff`,A and \uf8ffu,A in (8).\nLogistic regression models. The exponential family log-partition function of logistic regression\nmodels described in Section 2 can be seen to be A(\u2318) = log\u21e5 exp(\u2318) + exp(\u2318)\u21e4. Consequently,\nits double derivative A00(\u2318) = 4 exp(2\u2318)\n(exp(2\u2318)+1)2 \uf8ff 1 for any \u2318, so that (8) holds with \uf8ffu,A = 1.\nThe response moment polytope for the binary response variable y 2 Y \u2318 {1, 1} is the inter-\nval M = [1, 1], so that its interior is given by Mo = (1, 1). For the subset of the interior,\nwe de\ufb01ne M = [1 + \u270f, 1 \u270f], for some 0 <\u270f< 1. At the same time, the forward mapping\nis given by A0(\u2318) = exp(2\u2318) 1)/(exp(2\u2318) + 1), and hence M0 becomes [ a1\na+1 ] where\n1\u00b5, and\na := n\ngiven M and M0, it can be seen that (A0)1(\u00b5) is Lipschitz for M[M 0 with constant less than\n, 1/\u270fo in (8). Note that with this setting of the subset M, we have\n\uf8ff`,A := maxn 1\nthat maxi=1,...,n(y(i) \u21e5\u21e7 \u00afM(y)\u21e4i) = \u270f, and moreover, \u21e7 \u00afM(yi) = yi(1 \u270f), which we will use in\n\n4\uf8ffuk\u2713\u21e4k2\nplog n . The inverse mapping of logistic models is given by (A0)1(\u00b5) = 1\n\n2 log 1+\u00b5\n\nthe corollary below.\n\na+1 , a1\n\n2 + 1\n2 n\n\n4\uf8ffuk\u2713\u21e4k2\nplog n\n\nPoisson regression models. Another important instance of GLMs is the Poisson regression model,\nthat is becoming increasingly more relevant in modern big-data settings with varied multivariate\ncount data. For the Poisson regression model case, the double derivative of A(\u00b7) is not uniformly\nupper bounded: A00(u) = exp(u). Denoting \u2327\u21e4 := 2\uf8ffuk\u2713\u21e4k2plog n, we then have that for any\n\u21b5 in [\u2327\u21e4,\u2327 \u21e4], A00(\u21b5) \uf8ff exp2uk\u2713\u21e4k2plog n = n2uk\u2713\u21e4k2/plog n, so that (8) is satis\ufb01ed with\n\uf8ffu,A = n2uk\u2713\u21e4k2/plog n. The response moment polytope for the count-valued response variable\ny 2Y\u2318{ 0, 1, . . .} is given by M = [0,1), so that its interior is given by Mo = (0,1). For the\nsubset of the interior, we de\ufb01ne M = [\u270f,1) for some \u270f s.t. 0 <\u270f< 1. The forward mapping in this\n2\uf8ffuk\u2713\u21e4k2\nplog n . The\ncase is simply given by A0(\u2318) = exp(\u2318), and M0 in (7) becomes [a1, a] where a is n\ninverse mapping for the Poisson regression model then is given by (A0)1(\u00b5) = log(\u00b5), which can\n2\uf8ffuk\u2713\u21e4k2\nplog n , 1/\u270f} in (8). With this setting\nbe seen to be Lipschitz for M with constant \uf8ff`,A = max{n\nof M, it can be seen that the projection operator is given by \u21e7 \u00afM(yi) = I(yi = 0)\u270f + I(yi 6= 0)yi.\nNow, we are ready to recover the error bounds, as a corollary of Theorem 1, for logistic regression\nand Poisson models when condition (C2) holds:\nCorollary 1. Consider any logistic regression model or a Poisson regression model where all con-\nditions in Theorem 1 hold. Suppose that we solve our closed-form estimation problem (4), setting\nn(1/2c0/plog n) +\n\nn , and the constraint bound n = 2\n\ncplog p0\n\nn where c and c0 are some constants depending only on \uf8ffu, k\u2713\u21e4k2 and \u270f. Then the\n\nthe thresholding parameter \u232b = C1q log p0\nC1k\u2713\u21e4k1q log p0\n\n\uf8ff`\n\n7\n\n\fTable 1: Comparisons on simulated datasets when parameters are tuned to minimize `2 error on\nindependent validation sets.\n\n(n, p, k)\n\nMETHOD\n\nTP\n\nFP\n\n`2 ERROR\n\nTIME\n\n(n, p, k)\n\nMETHOD\n\nTP\n\nFP\n\n`2 ERROR\n\nTIME\n\n(n = 2000, `1 MLE1\n`1 MLE2\np = 5000,\n`1 MLE3\nk = 10)\nELEM\n(n = 4000, `1 MLE1\n`1 MLE2\np = 5000,\n`1 MLE3\nk = 10)\nELEM\n(n = 5000, `1 MLE1\n`1 MLE2\np = 104,\n`1 MLE3\nk = 100)\nELEM\n\n1\n1\n1\n\n0.1094\n0.0873\n0.1000\n0.9900 0.0184\n0.1626\n0.1327\n0.1112\n0.0069\n0.1301\n0.1695\n0.2001\n0.9975 0.3622\n\n1\n1\n1\n1\n1\n1\n1\n\n4.5450\n4.0721\n3.4846\n2.7375\n4.2132\n3.6569\n2.9681\n2.6213\n18.9079\n18.5567\n18.2351\n16.4148\n\n63.9\n133.1\n348.3\n26.5\n155.5\n296.8\n829.3\n40.2\n500.1\n983.8\n2353.3\n151.8\n\n(n = 8000,\np = 104,\nk = 100)\n\n`1 MLE1\n(n = 5000,\np = 104,\n`1 MLE2\nk = 1000) `1 MLE3\nELEM\n`1 MLE1\n`1 MLE2\n`1 MLE3\nELEM\n`1 MLE1\n(n = 8000,\np = 104,\n`1 MLE2\nk = 1000) `1 MLE3\nELEM\n\n0.7990\n0.7935\n0.7965\n0.8295\n\n1\n1\n1\n1\n\n1\n1\n1\n\n0.1904\n0.2181\n0.2364\n0.9450 0.0359\n0.7965\n0.7900\n0.7865\n0.7015 0.5103\n\n1\n1\n1\n\n65.1895\n65.1165\n65.1024\n63.2359\n18.6186\n18.1806\n17.6762\n11.9881\n65.0714\n64.9650\n64.8857\n61.0532\n\n520.7\n1005.8\n2560.1\n152.1\n810.6\n1586.2\n3568.9\n221.1\n809.5\n1652.8\n4196.6\n219.4\n\n4\n\ncplog p0\n\ncplog p0\n\nn(1/2c0/plog n)\n\noptimal solutionb\u2713 of (4) is guaranteed to be consistent:\n+ C1k\u2713\u21e4k1r log p0\n\uf8ff`\u2713\nn \u25c6 ,\ncplog p0\nb\u2713 \u2713\u21e41 \uf8ff\n+ C1k\u2713\u21e4k1r log p0\n+ C1k\u2713\u21e4k1r log p0\n\u2713\nn \u25c6 ,\nn \u25c6 ,\nb\u2713 \u2713\u21e41 \uf8ff\nn(1/2c0/plog n)\nwith probability at least 1 c1p0c01 for some universal constants c1, c01 > 0 and p0 := max{n, p}.\nn(1/2c0/plog n) + C1k\u2713\u21e4k1q log p0\n\uf8ff`\nMoreover, when mins2S |\u2713\u21e4s| 6\nRemarkably, the rates in Corollary 1 are asymptotically comparable to those for the `1-regularized\nMLE (see for instance Theorem 4.2 and Corollary 4.4 in [7]). In Appendix A, we place slightly\nmore stringent condition than (C2) and guarantee error bounds with faster convergence rates.\n\nn ,b\u2713 is sparsistent.\n\nb\u2713 \u2713\u21e42 \uf8ff\n\uf8ff` \u2713\n\nn(1/2c0/plog n)\n\n16k\n\ncplog p0\n\n8pk\n\uf8ff`\n\n5 Experiments\n\nWe corroborate the performance of our elementary estimators on simulated data over varied regimes\nof sample size n, number of covariates p, and sparsity size k. We consider two popular instances\nof GLMs, logistic and Poisson regression models. We compare against standard `1 regularized\nMLE estimators with iteration bounds of 50, 100, and 500, denoted by `1 MLE1, `1 MLE2 and `1\nMLE3 respectively. We construct the n \u21e5 p design matrices X by sampling the rows independently\nfrom N (0, \u2303) where \u2303i,j = 0.5|ij|. For each simulation, the entries of the true model coef\ufb01cient\nvector \u2713\u21e4 are set to be 0 everywhere, except for a randomly chosen subset of k coef\ufb01cients, which\nare chosen independently and uniformly in the interval (1, 3). We report results averaged over 100\nindependent trials. Noting that our theoretical results were not sensitive to the setting of \u270f in \u21e7 \u00afM(y),\nwe simply report the results when \u270f = 104 across all experiments.\nWhile our theorem speci\ufb01ed an optimal setting of the regularization parameter n and \u232b, this optimal\nsetting depended on unknown model parameters. Thus, as is standard with high-dimensional regu-\n\nlarized estimators, we set tuning parameters n = cplog p/n and \u232b = c0plog p/n by a holdout-\n\nvalidated fashion; \ufb01nding a parameter that minimizes the `2 error on an independent validation set.\nDetailed experimental setup is described in the appendix.\nTable 1 summarizes the performances of `1 MLE using 3 different stopping criteria and Elem-GLM.\nBesides `2 errors, the target tuning metric, we also provide the true and false positives for the support\nset recovery task on the new test set where the best tuning parameters are used. The computation\ntimes in second indicate the overall training computation time summing over the whole parameter\ntuning process. As we can see from our experiments, with respect to both statistical and compu-\ntational performance our closed form estimators are quite competitive compared to the classical `1\nregularized MLE estimators and in certain case outperform them. Note that `1 MLE1 stops pre-\nmaturely after only 50 iterations, so that training computation time is sometimes comparable to\nclosed-form estimator. However, its statistical performance measured by `2 is much inferior to other\n`1 MLEs with more iterations as well as Elem-GLM estimator. Due to the space limit, ROC curves,\nresults for other settings of p and more experiments on real datasets are presented in the appendix.\n\n8\n\n\fReferences\n[1] P. McCullagh and J.A. Nelder. Generalized linear models. Monographs on statistics and applied proba-\n\nbility 37. Chapman and Hall/CRC, 1989.\n\n[2] G. E. Hoffman, B. A. Logsdon, and J. G. Mezey. Puma: A uni\ufb01ed framework for penalized multiple\n\nregression analysis of gwas data. Plos computational Biology, 2013.\n\n[3] D. Witten and R. Tibshirani. Survival analysis with high-dimensional covariates. Stat Methods Med Res.,\n\n19:29\u201351, 2010.\n\n[4] E. Yang, P. Ravikumar, G. I. Allen, and Z. Liu. Graphical models via generalized linear models. In Neur.\n\nInfo. Proc. Sys. (NIPS), 25, 2012.\n\n[5] S. Van de Geer. High-dimensional generalized linear models and the lasso. Annals of Statistics, 36(2):\n\n614\u2013645, 2008.\n\n[6] F. Bach. Self-concordant analysis for logistic regression. Electron. J. Stat., 4:384\u2013414, 2010.\n[7] S. M. Kakade, O. Shamir, K. Sridharan, and A. Tewari. Learning exponential families in high-dimensions:\n\nStrong convexity and sparsity. In Inter. Conf. on AI and Statistics (AISTATS), 2010.\n\n[8] S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A uni\ufb01ed framework for high-dimensional\n\nanalysis of M-estimators with decomposable regularizers. Arxiv preprint arXiv:1010.2731v1, 2010.\n\n[9] F. Bunea. Honest variable selection in linear and logistic regression models via l1 and l1 + l2 penalization.\n\nElectron. J. Stat., 2:1153\u20131194, 2008.\n\n[10] L. Meier, S. Van de Geer, and P. B\u00a8uhlmann. The group lasso for logistic regression. Journal of the Royal\n\nStatistical Society, Series B, 70:53\u201371, 2008.\n\n[11] Y. Kim, J. Kim, and Y. Kim. Blockwise sparse regression. Statistica Sinica, 16:375\u2013390, 2006.\n[12] J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordi-\n\nnate descent. Journal of Statistical Software, 33(1):1\u201322, 2010.\n\n[13] K. Koh, S. J. Kim, and S. Boyd. An interior-point method for large-scale `1-regularized logistic regres-\n\nsion. Jour. Mach. Learning Res., 3:1519\u20131555, 2007.\n\n[14] E. Yang, A. C. Lozano, and P. Ravikumar. Elementary estimators for high-dimensional linear regression.\n\nIn International Conference on Machine learning (ICML), 31, 2014.\n\n[15] A. J. Rothman, E. Levina, and J. Zhu. Generalized thresholding of large covariance matrices. Journal of\n\nthe American Statistical Association (Theory and Methods), 104:177\u2013186, 2009.\n\n[16] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families and variational inference.\n\nFoundations and Trends in Machine Learning, 1(1\u20132):1\u2014305, December 2008.\n\n[17] P. J. Bickel and E. Levina. Covariance regularization by thresholding. Annals of Statistics, 36(6):2577\u2013\n\n2604, 2008.\n\n[18] M. J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using `1-constrained\n\nquadratic programming (Lasso). IEEE Trans. Information Theory, 55:2183\u20132202, May 2009.\n\n[19] Daniel A. Spielman and Shang-Hua Teng. Solving sparse, symmetric, diagonally-dominant linear systems\nin time 0(m1.31). In 44th Symposium on Foundations of Computer Science (FOCS 2003), 11-14 October\n2003, Cambridge, MA, USA, Proceedings, pages 416\u2013427, 2003.\n\n[20] Michael B. Cohen, Rasmus Kyng, Gary L. Miller, Jakub W. Pachocki, Richard Peng, Anup B. Rao, and\nShen Chen Xu. Solving sdd linear systems in nearly mlog1/2n time. In Proceedings of the 46th Annual\nACM Symposium on Theory of Computing, STOC \u201914, pages 343\u2013352. ACM, 2014.\n\n[21] Daniel A. Spielman and Shang-Hua Teng. Nearly linear time algorithms for preconditioning and solving\nsymmetric, diagonally dominant linear systems. SIAM J. Matrix Analysis Applications, 35(3):835\u2013885,\n2014.\n\n[22] P. Ravikumar, M. J. Wainwright, G. Raskutti, and B. Yu. High-dimensional covariance estimation by\nminimizing `1-penalized log-determinant divergence. Electronic Journal of Statistics, 5:935\u2013980, 2011.\n[23] E. Yang, A. C. Lozano, and P. Ravikumar. Elementary estimators for sparse covariance matrices and other\n\nstructured moments. In International Conference on Machine learning (ICML), 31, 2014.\n\n[24] E. Yang, A. C. Lozano, and P. Ravikumar. Elementary estimators for graphical models. In Neur. Info.\n\nProc. Sys. (NIPS), 27, 2014.\n\n9\n\n\f", "award": [], "sourceid": 411, "authors": [{"given_name": "Eunho", "family_name": "Yang", "institution": "IBM Thomas J. Watson Research Center"}, {"given_name": "Aurelie", "family_name": "Lozano", "institution": "IBM Research"}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": "University of Texas at Austin"}]}