{"title": "Scaled Least Squares Estimator for GLMs in Large-Scale Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 3324, "page_last": 3332, "abstract": "We study the problem of efficiently estimating the coefficients of generalized linear models (GLMs) in the large-scale setting where the number of observations $n$ is much larger than the number of predictors $p$, i.e. $n\\gg p \\gg 1$. We show that in GLMs with random (not necessarily Gaussian) design, the GLM coefficients are approximately proportional to the corresponding ordinary least squares (OLS) coefficients. Using this relation, we design an algorithm that achieves the same accuracy as the maximum likelihood estimator (MLE) through iterations that attain up to a cubic convergence rate, and that are cheaper than any batch optimization algorithm by at least a factor of $\\mathcal{O}(p)$. We provide theoretical guarantees for our algorithm, and analyze the convergence behavior in terms of data dimensions. % Finally, we demonstrate the performance of our algorithm through extensive numerical studies on large-scale real and synthetic datasets, and show that it achieves the highest performance compared to several other widely used optimization algorithms.", "full_text": "Scaled Least Squares Estimator for GLMs\n\nin Large-Scale Problems\n\nMurat A. Erdogdu\n\nDepartment of Statistics\n\nStanford University\n\nerdogdu@stanford.edu\n\nMohsen Bayati\n\nGraduate School of Business\n\nStanford University\n\nbayati@stanford.edu\n\nLee H. Dicker\n\nDepartment of Statistics and Biostatistics\n\nRutgers University and Amazon \u21e4\nldicker@stat.rutgers.edu\n\nAbstract\n\nWe study the problem of ef\ufb01ciently estimating the coef\ufb01cients of generalized linear\nmodels (GLMs) in the large-scale setting where the number of observations n is\nmuch larger than the number of predictors p, i.e. n p 1. We show that in\nGLMs with random (not necessarily Gaussian) design, the GLM coef\ufb01cients are\napproximately proportional to the corresponding ordinary least squares (OLS) coef-\n\ufb01cients. Using this relation, we design an algorithm that achieves the same accuracy\nas the maximum likelihood estimator (MLE) through iterations that attain up to a\ncubic convergence rate, and that are cheaper than any batch optimization algorithm\nby at least a factor of O(p). We provide theoretical guarantees for our algorithm,\nand analyze the convergence behavior in terms of data dimensions. Finally, we\ndemonstrate the performance of our algorithm through extensive numerical studies\non large-scale real and synthetic datasets, and show that it achieves the highest\nperformance compared to several other widely used optimization algorithms.\n\nIntroduction\n\n1\nWe consider the problem of ef\ufb01ciently estimating the coef\ufb01cients of generalized linear models (GLMs)\nwhen the number of observations n is much larger than the dimension of the coef\ufb01cient vector p,\n(n p 1). GLMs play a crucial role in numerous machine learning and statistics problems, and\nprovide a miscellaneous framework for many regression and classi\ufb01cation tasks. Celebrated examples\ninclude ordinary least squares, logistic regression, multinomial regression and many applications\ninvolving graphical models [MN89, WJ08, KF09].\nThe standard approach to estimating the regression coef\ufb01cients in a GLM is the maximum likelihood\nmethod. Under standard assumptions on the link function, the maximum likelihood estimator (MLE)\ncan be written as the solution to a convex minimization problem [MN89]. Due to the non-linear\nstructure of the MLE problem, the resulting optimization task requires iterative methods. The most\ncommonly used optimization technique for computing the MLE is the Newton-Raphson method,\nwhich may be viewed as a reweighted least squares algorithm [MN89]. This method uses a second\norder approximation to bene\ufb01t from the curvature of the log-likelihood and achieves locally quadratic\nconvergence. A drawback of this approach is its excessive per-iteration cost of O(np2). To remedy\nthis, Hessian-free Krylov sub-space based methods such as conjugate gradient and minimal residual\nare used, but the resulting direction is imprecise [HS52, PS75, Mar10]. On the other hand, \ufb01rst order\n\n\u21e4Work conducted while at Rutgers University\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fapproximation yields the gradient descent algorithm, which attains a linear convergence rate with\nO(np) per-iteration cost. Although its convergence rate is slow compared to that of second order\nmethods, its modest per-iteration cost makes it practical for large-scale problems. In the regime\nn p, another popular optimization technique is the class of Quasi-Newton methods [Bis95, Nes04],\nwhich can attain a per-iteration cost of O(np), and the convergence rate is locally super-linear; a\nwell-known member of this class of methods is the BFGS algorithm [Nes04]. There are recent studies\nthat exploit the special structure of GLMs [Erd15], and achieve near-quadratic convergence with a\nper-iteration cost of O (np), and an additional cost of covariance estimation.\nIn this paper, we take an alternative approach to \ufb01tting GLMs, based on an identity that is well-known\nin some areas of statistics, but appears to have received relatively little attention for its computational\nimplications in large scale problems. Let glm denote the GLM regression coef\ufb01cients, and let ols\ndenote the corresponding ordinary least squares (OLS) coef\ufb01cients (this notation will be de\ufb01ned\nmore precisely in Section 2). Then, under certain random predictor (design) models,\n\nglm / ols.\n\n(1)\nFor logistic regression with Gaussian design (which is equivalent to Fisher\u2019s discriminant analysis),\n(1) was noted by Fisher in the 1930s [Fis36]; a more general formulation for models with Gaussian\ndesign is given in [Bri82]. The relationship (1) suggests that if the constant of proportionality is\nknown, then glm can be estimated by computing the OLS estimator, which may be substantially\nsimpler than \ufb01nding the MLE for the original GLM. Our work in this paper builds on this idea.\nOur contributions can be summarized as follows.\n\n1. We show that glm is approximately proportional to ols in random design GLMs, regardless\n\nof the predictor distribution. That is, we prove\n\nglm c \u21e5 ols1 . 1\n\np\n\n, for some c 2 R.\n\n3. For random design GLMs with sub-Gaussian predictors, we show that\n\n2. We design a computationally ef\ufb01cient estimator for glm by \ufb01rst estimating the OLS co-\nef\ufb01cients, and then estimating the proportionality constant c . We refer to the resulting\nestimator as the Scaled Least Squares (SLS) estimator and denote it by \u02c6 sls. After estimating\nthe OLS coef\ufb01cients, the second step of our algorithm involves \ufb01nding a root of a real valued\nfunction; this can be accomplished using iterative methods with up to a cubic convergence\nrate and only O(n) per-iteration cost. This is cheaper than the classical batch methods\nmentioned above by at least a factor of O(p).\n+r\n\nn/ max{log(n), p}\nThis bound characterizes the performance of the proposed estimator in terms of data dimen-\nsions, and justi\ufb01es the use of the algorithm in the regime n p 1.\nMLE (using several well-known implementations), on a variety of large-scale datasets.\n\n4. We study the statistical and computational performance of \u02c6 sls, and compare it to that of the\n\n \u02c6 sls glm1\n\n. 1\np\n\np\n\n.\n\nThe rest of the paper is organized as follows: Section 1.1 surveys the related work and Section 2\nintroduces the required background and the notation. In Section 3, we provide the intuition behind the\nrelationship (1), which are based on exact calculations for GLMs with Gaussian design. In Section 4,\nwe propose our algorithm and discuss its computational properties. Section 5 provides a thorough\ncomparison between the proposed algorithm and other existing methods. Theoretical results may be\nfound in Section 6. Finally, we conclude with a brief discussion in Section 7.\n\n1.1 Related work\nAs mentioned in Section 1, the relationship (1) is well-known in several forms in statistics. Brillinger\n[Bri82] derived (1) for models with Gaussian predictors. Li & Duan [LD89] studied model misspeci-\n\ufb01cation problems in statistics and derived (1) when the predictor distribution has linear conditional\nmeans (this is a slight generalization of Gaussian predictors). More recently, Stein\u2019s lemma [BEM13]\nand the relationship (1) has been revisited in the context of compressed sensing [PV15, TAH15],\nwhere it has been shown that the standard lasso estimator may be very effective when used in models\n\n2\n\n\fwhere the relationship between the expected response and the signal is nonlinear, and the predictors\n(i.e. the design or sensing matrix) are Gaussian. A common theme for all of this previous work is that\nit focuses solely on settings where (1) holds exactly and the predictors are Gaussian (or, in [LD89],\nnearly Gaussian). Two key novelties of the present paper are (i) our focus on the computational\nbene\ufb01ts following from (1) for large scale problems with n p 1; and (ii) our rigorous analysis\nof models with non-Gaussian predictors, where (1) is shown to be approximately valid.\n2 Preliminaries and notation\nWe assume a random design setting, where the observed data consists of n random iid pairs (y1, x1),\n(y2, x2), . . ., (yn, xn); yi 2 R is the response variable and xi = (xi1, . . . , xip)T 2 Rp is the vector\nof predictors or covariates. We focus on problems where \ufb01tting a GLM is desirable, but we do not\nneed to assume that (yi, xi) are actually drawn from the corresponding statistical model (i.e. we\nallow for model misspeci\ufb01cation).\nThe MLE for GLMs with canonical link is de\ufb01ned by\n\n\u02c6mle = argmax\n2Rp\n\n1\nn\n\nnXi=1\n\nyihxi, i (hxi, i).\n\n(2)\n\nwhere h\u00b7,\u00b7i denotes the Euclidean inner-product on Rp, and is a suf\ufb01ciently smooth convex function.\nThe GLM coef\ufb01cients glm are de\ufb01ned by taking the population average in (2):\n\nglm = argmax\n2Rp\n\nE [yihxi, i (hxi, i)] .\n\n(3)\n\nWhile we make no assumptions on beyond smoothness, note that if is the cumulant generating\nfunction for yi | xi, then we recover the standard GLM with canonical link and regression parameters\nglm [MN89]. Examples of GLMs in this form include logistic regression, with (w) = log{1+ew};\nPoisson regression, with (w) = ew; and linear regression (least squares), with (w) = w2/2.\nOur objective is to \ufb01nd a computationally ef\ufb01cient estimator for glm. The alternative estima-\ntor for glm proposed in this paper is related to the OLS coef\ufb01cient vector, which is de\ufb01ned by\ni ]1E [xiyi]; the corresponding OLS estimator is \u02c6ols := (XT X)1XT y, where\nols := E[xixT\nX = (x1, . . . , xn)T is the n \u21e5 p design matrix and y = (y1, . . . , yn)T 2 Rn.\nAdditionally, throughout the text we let [m] ={1, 2, ..., m}, for positive integers m, and we denote\nthe size of a set S by |S|. The m-th derivative of a function g : R ! R is denoted by g(m). For\na vector u 2 Rp and a n \u21e5 p matrix U, we let kukq and kUkq denote the `q-vector and -operator\nnorms, respectively. If S \u2713 [n], let US denote the |S|\u21e5 p matrix obtained from U by extracting the\nrows that are indexed by S. For a symmetric matrix M 2 Rp\u21e5p, max(M) and min(M) denote the\nmaximum and minimum eigenvalues, respectively. \u21e2k(M) denotes the condition number of M with\nrespect to k-norm. We denote by Nq the q-variate normal distribution.\n\n3 OLS is equivalent to GLM up to a scalar factor\nTo motivate our methodology, we assume in this section that the covariates are multivariate normal,\nas in [Bri82]. These distributional assumptions will be relaxed in Section 6.\nProposition 1. Assume that the covariates are multivariate normal with mean 0 and covariance\n\nglm = c \u21e5 ols,\n\ni\u21e4, i.e. xi \u21e0 Np(0, \u2303). Then glm can be written as\nmatrix \u2303 = E\u21e5xixT\nwhere c 2 R satis\ufb01es the equation 1 = c E\u21e5 (2)(hx, olsic )\u21e4 .\nE [yixi] = Ehxi (1)(hxi, i)i .\n\n(4)\nNow, denote by (x | \u2303) the multivariate normal density with mean 0 and covariance matrix \u2303. We\nrecall the well-known property of Gaussian density d(x | \u2303)/dx = \u23031x(x | \u2303). Using this\n\nProof of Proposition 1. The optimal point in the optimization problem (3), has to satisfy the following\nnormal equations,\n\n3\n\n\fAlgorithm 1 SLS: Scaled Least Squares Estimator\n\nInput: Data (yi, xi)n\nStep 1. Compute the least squares estimator: \u02c6ols and \u02c6y = X \u02c6ols.\n\ni=1\n\nFor a sub-sampling based OLS estimator, let S \u21e2 [n] be a\nrandom subset and take \u02c6ols = |S|n (XT\n\nS XS)1XT y.\n\nStep 2. Solve the following equation for c 2 R: 1 = c\n\nUse Newton\u2019s root-\ufb01nding method:\n\ni=1 (2)(c \u02c6yi).\n\nnPn\n\nInitialize c = 2/Var (yi);\nRepeat until convergence:\n\nc c \nOutput: \u02c6 sls = c \u21e5 \u02c6ols.\n\n1\n\nc 1\n\ni=1 (2)(c \u02c6yi) 1\n\nnPn\ni=1 (2)(c \u02c6yi) + c (3)(c \u02c6yi) .\nnPn\n\nand integration by parts on the right hand side of the above equation, we obtain\n\nEhxi (1)(hxi, i)i =Z x (1)(hx, i)(x | \u2303) dx = \u2303Eh (2)(hxi, i)i\n\n(this is basically the Stein\u2019s lemma). Combining this with the identity (4), we conclude the proof.\nProposition 1 and its proof provide the main intuition behind our proposed method. Observe that\nin our derivation, we only worked with the right hand side of the normal equations (4) which does\nnot depend on the response variable yi. The equivalence holds regardless of the joint distribution of\n(yi, xi), whereas in [Bri82], yi is assumed to follow a single index model. In Section 6, where we\nextend the method to non-Gaussian predictors, (5) is generalized via the zero-bias transformations.\n\n(5)\n\n3.1 Regularization\nA version of Proposition 1 incorporating regularization \u2014 an important tool for datasets where p is\nlarge relative to n or the predictors are highly collinear \u2014 is also possible, as outlined brie\ufb02y in this\nsection. We focus on `2-regularization (ridge regression) in this section; some connections with lasso\n(`1-regularization) are discussed in Section 6 and Corollary 1.\nFor 0, de\ufb01ne the `2-regularized GLM coef\ufb01cients,\n\nglm\n = argmax\n2Rp\n\nE [yihxi, i (hxi, i)] \n\nand the corresponding `2-regularized OLS coef\ufb01cients ols\nglm = glm\n\n0 ). The same argument as above implies that\n\nand ols = ols\n\n0\n\n2\n\n\n2 kk2\n = E\u21e5xixT\n\n(6)\n\ni\u21e4 + I1 E [xiyi] (so\n\n(7)\nThis suggests that the ordinary ridge regression for the linear model can be used to estimate the\n`2-regularized GLM coef\ufb01cients glm\n . Further pursuing these ideas for problems where regularization\nis a critical issue may be an interesting area for future research.\n\n , where = c .\n\nglm\n = c \u21e5 ols\n\n4 SLS: Scaled Least Squares estimator for GLMs\nMotivated by the results in the previous section, we design a computationally ef\ufb01cient algorithm for\nany GLM task that is as simple as solving the least squares problem; it is described in Algorithm 1.\nThe algorithm has two basic steps. First, we estimate the OLS coef\ufb01cients, and then in the second\nstep we estimate the proportionality constant via a simple root-\ufb01nding algorithm.\nThere are numerous fast optimization methods to solve the least squares problem, and even a\nsuper\ufb01cial review of these could go beyond the page limits of this paper. We emphasize that this\nstep (\ufb01nding the OLS estimator) does not have to be iterative and it is the main computational\ncost of the proposed algorithm. We suggest using a sub-sampling based estimator for ols, where\nwe only use a subset of the observations to estimate the covariance matrix. Let S \u21e2 [n] be a\n\n4\n\n\fSLS vs MLE : Computation\n\nMethod\nSLS\nMLE\n\n)\nc\ne\ns\n(\ne\nm\nT\n\ni\n\n60\n\n40\n\n20\n\n0\n\n4\n\n5\n\nlog10(n)\n\n6\n\n1.2\n\n0.9\n\n0.6\n\n0.3\n\n0.0\n\n2\n|\n\u03b2\n\u2212\n\u03b2^\n\n|\n\nSLS vs MLE : Accuracy\n\nMethod\nSLS\nMLE\n\n4\n\n5\n\nlog10(n)\n\n6\n\n|S|\n\nXT\n\nS XS1 1\n\nFigure 1: Logistic regression with general Gaussian design. The left plot shows the computational cost (time)\nfor \ufb01nding the MLE and SLS as n grows and p = 200. The right plot depicts the accuracy of the estimators.\nIn the regime where the MLE is expensive to compute, the SLS is found much more rapidly and has the same\naccuracy. R\u2019s built-in functions are used to \ufb01nd the MLE.\nrandom sub-sample and denote by XS the sub-matrix formed by the rows of X in S. Then the\nsub-sampled OLS estimator is given as \u02c6ols = 1\nn XT y. Properties of this estimator\nhave been well-studied [Ver10, DLFU13, EM15]. For sub-Gaussian covariates, it suf\ufb01ces to use\na sub-sample size of O (p log(p)) [Ver10]. Hence, this step requires a single time computational\ncost of O|S|p2 + p3 + np \u21e1O p max{p2 log(p), n}. For other approaches, we refer reader to\n[RT08, DLFU13] and the references therein.\nThe second step of Algorithm 1 involves solving a simple root-\ufb01nding problem. As with the \ufb01rst\nstep of the algorithm, there are numerous methods available for completing this task. Newton\u2019s\nroot-\ufb01nding method with quadratic convergence or Halley\u2019s method with cubic convergence may be\nappropriate choices. We highlight that this step costs only O (n) per-iteration and that we can attain up\nto a cubic rate of convergence. The resulting per-iteration cost is cheaper than other commonly used\nbatch algorithms by at least a factor of O (p) \u2014 indeed, the cost of computing the gradient is O (np).\nFor simplicity, we use Newton\u2019s root-\ufb01nding method initialized at c = 2/Var (yi). Assuming that\nthe GLM is a good approximation to the true conditional distribution, by the law of total variance and\nbasic properties of GLMs, we have\n\nIt follows that this initialization is reasonable as long as c1\n\nVar (yi) = E [Var (yi | xi)] + Var (E [yi | xi]) \u21e1 c1\n\n + Var (1)(hxi, i).\nthan Var (1)(hxi, i). Our experiments show that SLS is very robust to initialization.\n\n(8)\n \u21e1 E [Var (yi | xi)] is not much smaller\n\nIn Figure 1, we compare the performance of our SLS estimator to that of the MLE, when both are used\nto analyze synthetic data generated from a logistic regression model under general Gaussian design\nwith randomly generated covariance matrix. The left plot shows the computational cost of obtaining\nboth estimators as n increases for \ufb01xed p. The right plot shows the accuracy of the estimators. In the\nregime n p 1 \u2014 where the MLE is hard to compute \u2014 the MLE and the SLS achieve the same\naccuracy, yet SLS has signi\ufb01cantly smaller computation time. We refer the reader to Section 6 for\ntheoretical results characterizing the \ufb01nite sample behavior of the SLS.\n5 Experiments\nThis section contains the results of a variety of numerical studies, which show that the Scaled Least\nSquares estimator reaches the minimum achievable test error substantially faster than commonly used\nbatch algorithms for \ufb01nding the MLE. Both logistic and Poisson regression models (two types of\nGLMs) are utilized in our analyses, which are based on several synthetic and real datasets.\nBelow, we brie\ufb02y describe the optimization algorithms for the MLE that were used in the experiments.\n1. Newton-Raphson (NR) achieves locally quadratic convergence by scaling the gradient by\nthe inverse of the Hessian evaluated at the current iterate. Computing the Hessian has a\n\nper-iteration cost of Onp2, which makes it impractical for large-scale datasets.\n2. Newton-Stein (NS) is a recently proposed second-order batch algorithm speci\ufb01cally de-\nsigned for GLMs [Erd16]. The algorithm uses Stein\u2019s lemma and sub-sampling to ef\ufb01ciently\nestimate the Hessian with O (np) per-iteration cost, achieving near quadratic rates.\n\n5\n\n\fLogis0c\tRegression\t\n\nLog\u2212Reg / Covariates ~ \u03a3 x {Exp(1)\u22121}\nSLS\nNR\nNS\nBFGS\nLBFGS\nGD\nAGD\n\nPoisson\tRegression\t\n\nPoi\u2212Reg / Covertype dataset\n\nLog\u2212Reg / Higgs dataset\n\nSLS\nNR\nNS\nBFGS\nLBFGS\nGD\nAGD\n\n2.5\n\n2.0\n\n1.5\n\n1.0\n\n)\nr\no\nr\nr\n\nE\n\n \nt\ns\ne\nT\n(\ng\no\n\nPoi\u2212Reg / Covariates ~ \u03a3 x Ber( \u00b1 1)\nSLS\nNR\nNS\nBFGS\nLBFGS\nGD\nAGD\n\nl\n\n0.5\n\n40\n\n50\n\n0\n\n10\n\nl\n\n30\n\n40\n\n0.5\n\n0.0\n\n2.5\n\n0\n\n10\n\n40\n\n50\n\n0\n\n10\n\n20\n30\nTime (sec)\n\n(a)\t\n\n20\n30\nTime (sec)\n\n(c)\t\n\nLog\u2212Reg / Covariates ~ \u03a3 x {Exp(1)\u22121}\n\nLog\u2212Reg / Higgs dataset\n\n7.5\n\n10.0\n\n5.0\nTime (sec)\n(g)\t\n\nPoi\u2212Reg / Covertype dataset\n\n2.0\n\n1.5\n\n1.0\n\n)\nr\no\nr\nr\n\nE\n\n \nt\ns\ne\nT\n(\ng\no\n\n15\n\n10\n\n)\nr\no\nr\nr\n\nE\n\n \nt\ns\ne\nT\n(\ng\no\n\nTime (sec)\n\n20\n(e)\t\n\nPoi\u2212Reg / Covariates ~ \u03a3 x Ber( \u00b1 1)\nSLS\nNR\nNS\nBFGS\nLBFGS\nGD\nAGD\n\nSLS\nNR\nNS\nBFGS\nLBFGS\nGD\nAGD\n\n2.0\n\n1.5\n\n1.0\n\n)\nr\no\nr\nr\n\nE\n\n \nt\ns\ne\nT\n(\ng\no\n\nSLS\nNR\nNS\nBFGS\nLBFGS\nGD\nAGD\n\nSLS\nNR\nNS\nBFGS\nLBFGS\nGD\nAGD\n\n\t\nt\nr\na\nt\ns\n\t\n\nm\no\nd\nn\na\nR\n\nr\no\nr\nr\n\nE\n\n \nt\ns\ne\nT\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.24\n\n\t\nt\nr\na\nt\ns\n\t\nS\nL\nO\n\nr\no\nr\nr\n\nE\n\n \nt\ns\ne\nT\n\n0.22\n\n0.20\n\nr\no\nr\nr\n\nE\n\n \nt\ns\ne\nT\n\n0.30\n\n0.28\n\n0.26\n\n0.24\n\n0.22\n\n0.25\n\nSLS\nNR\nNS\nBFGS\nLBFGS\nGD\nAGD\n\n0.24\n\nr\no\nr\nr\n\nE\n\n \nt\ns\ne\nT\n\n0.23\n\n0.18\n\n0\n\n5\n\n10\n\nTime (sec)\n\n15\n\n20\n\n0\n\n10\n\n20\n\nTime (sec)\n\n30\n\n(b)\t\n\n(d)\t\n\nl\n\n0.5\n\n40\n\n0\n\nl\n\n10\n\n20\n\nTime (sec)\n\n30\n\n40\n\n(f)\t\n\n5\n\n0\n\n0.0\n\n2.5\n\n5.0\n\nTime (sec)\n(h)\t\n\n7.5\n\n10.0\n\nFigure 2: Performance of SLS compared to that of MLE obtained with various optimization algorithms on\nseveral datasets. SLS is represented with red straight line. The details are provided in Table 1.\n\n3. Broyden-Fletcher-Goldfarb-Shanno (BFGS) is the most popular and stable quasi-Newton\nmethod [Nes04]. At each iteration, the gradient is scaled by a matrix that is formed\nby accumulating information from previous iterations and gradient computations. The\nconvergence is locally super-linear with a per-iteration cost of O (np).\n4. Limited memory BFGS (LBFGS) is a variant of BFGS, which uses only the recent iterates\nand gradients to approximate the Hessian, providing signi\ufb01cant improvement in terms of\nmemory usage. LBFGS has many variants; we use the formulation given in [Bis95].\n\n5. Gradient descent (GD) takes a step in the opposite direction of the gradient, evaluated at\nthe current iterate. Its performance strongly depends on the condition number of the design\nmatrix. Under certain assumptions, the convergence is linear with O (np) per-iteration cost.\n6. Accelerated gradient descent (AGD) is a modi\ufb01ed version of gradient descent with an\nadditional \u201cmomentum\u201d term [Nes83]. Its per iteration cost is O (np) and its performance\nstrongly depends on the smoothness of the objective function.\n\nFor all the algorithms, the step size at each iteration is chosen via the backtracking line search [BV04].\nRecall that the proposed Algorithm 1 is composed of two steps; the \ufb01rst \ufb01nds an estimate of the\nOLS coef\ufb01cients. This up-front computation is not needed for any of the MLE algorithms described\nabove. On the other hand, each of the MLE algorithms requires some initial value for , but no such\ninitialization is needed to \ufb01nd the OLS estimator in Algorithm 1. This raises the question of how the\nMLE algorithms should be initialized, in order to compare them fairly with the proposed method. We\nconsider two scenarios in our experiments: \ufb01rst, we use the OLS estimator computed for Algorithm 1\nto initialize the MLE algorithms; second, we use a random initial value.\nOn each dataset, the main criterion for assessing the performance of the estimators is how rapidly the\nminimum test error is achieved. The test error is measured as the mean squared error of the estimated\nmean using the current parameters at each iteration on a test dataset, which is a randomly selected\n(and set-aside) 10% portion of the entire dataset. As noted previously, the MLE is more accurate\nfor small n (see Figure 1). However, in the regime considered here (n p 1), the MLE and the\nSLS perform very similarly in terms of their error rates; for instance, on the Higgs dataset, the SLS\nand MLE have test error rates of 22.40% and 22.38%, respectively. For each dataset, the minimum\nachievable test error is set to be the maximum of the \ufb01nal test errors, where the maximum is taken\nover all of the estimation methods. Let \u2303(1) and \u2303(2) be two randomly generated covariance matrices.\nThe datasets we analyzed were: (i) a synthetic dataset generated from a logistic regression model\nwith iid {exponential(1)1} predictors scaled by \u2303(1); (ii) the Higgs dataset (logistic regression)\n[BSW14]; (iii) a synthetic dataset generated from a Poisson regression model with iid binary(\u00b11)\npredictors scaled by \u2303(2); (iv) the Covertype dataset (Poisson regression) [BD99].\nIn all cases, the SLS outperformed the alternative algorithms for \ufb01nding the MLE by a large margin,\nin terms of computation. Detailed results may be found in Figure 2 and Table 1. We provide additional\nexperiments with different datasets in the Supplementary Material.\n\n6\n\n\fMODEL\nDATASET\nSIZE\nINITIALIZED\nPLOT\nMETHOD\nSLS\nNR\nNS\nBFGS\nLBFGS\nGD\nAGD\n\nTable 1: Details of the experiments shown in Figure 2.\n\nLOGISTIC REGRESSION\n\nPOISSON REGRESSION\n\nn = 6.0 \u21e5 105, p = 300\n\n\u2303\u21e5{EXP(1)-1}\nOLS\nRND\n(B)\n(A)\nTIME IN SECONDS / NUMBER OF ITERATIONS (TO REACH MIN TEST ERROR)\n\nn = 1.1\u21e5107, p = 29\nOLS\nRND\n(D)\n(C)\n\n\u2303\u21e5BER(\u00b11)\nOLS\n(F)\n\nn = 6.0\u21e5105, p = 300\n\nHIGGS [BSW14]\n\nRND\n(E)\n\nRND\n(G)\n\nCOVERTYPE [BD99]\nn = 5.8\u21e5105, p = 53\n\nOLS\n(H)\n\n8.34/4\n301.06/6\n51.69/8\n148.43/31\n125.33/39\n669/138\n218.1/61\n\n2.94/3\n82.57/3\n7.8/3\n24.79/8\n24.61/8\n134.91/25\n35.97/12\n\n13.18/3\n37.77/3\n27.11/4\n660.92/68\n6368.1/651\n100871/10101 141736/13808\n2879.69/277\n\n9.57/3\n36.37/3\n26.69/4\n701.9/68\n6946.1/670\n\n2405.5/251\n\n5.42/5\n170.28/5\n32.71/5\n67.24/29\n224.6/106\n1711/513\n103.3/51\n\n3.96/5\n130.1/4\n36.82/4\n72.42/26\n357.1/88\n1364/374\n102.74/40\n\n2.71/6\n16.7/8\n21.17/10\n5.12/7\n10.01/14\n14.35/25\n11.28/15\n\n1.66/20\n32.48/18\n282.1/216\n22.74/59\n10.05/17\n33.58/87\n11.95/25\n\n6 Theoretical results\nIn this section, we use the zero-bias transformations [GR97] to generalize the equivalence between\nOLS and GLMs to settings where the covariates are non-Gaussian.\nDe\ufb01nition 1. Let z be a random variable with mean 0 and variance 2. Then, there exists a\nrandom variable z\u21e4 that satis\ufb01es E [zf (z)] = 2E[f (1)(z\u21e4)], for all differentiable functions f. The\ndistribution of z\u21e4 is said to be the z-zero-bias distribution.\n\nThe existence of z\u21e4 in De\ufb01nition 1 is a consequence of Riesz representation theorem [GR97]. The\nnormal distribution is the unique distribution whose zero-bias transformation is itself (i.e. the normal\ndistribution is a \ufb01xed point of the operation mapping the distribution of z to that of z\u21e4).\nTo provide some intuition behind the usefulness of the zero-bias transformation, we refer back to the\nproof of Proposition 1. For simplicity, assume that the covariate vector xi has iid entries with mean 0,\nand variance 1. Then the zero-bias transformation applied to the j-th normal equation in (4) yields\n\n.\n\n(9)\n\nE [yixij] = Ehxij (1)xijj +\u2303 k6=jxikki\n|\n}\n\nj-th normal equation\n\n{z\n\n= jEh (2)x\u21e4ijj +\u2303 k6=jxikiki\n|\n}\n\nZero-bias transformation\n\n{z\n\nThe distribution of x\u21e4ij is the xij-zero-bias distribution and is entirely determined by the distribution\nof xij; general properties of x\u21e4ij can be found, for example, in [CGS10]. If is well spread, it turns\nout that taken together, with j = 1, . . . , p, the far right-hand side in (9) behaves similar to the right\nside of (5), with \u2303 = I; that is, the behavior is similar to the Gaussian case, where the proportionality\nrelationship given in Proposition 1 holds. This argument leads to an approximate proportionality\nrelationship for non-Gaussian predictors, which, when carried out rigorously, yields the following.\nTheorem 1. Suppose that the covariate vector xi has mean 0 and covariance matrix \u2303 and, fur-\nthermore, that the random vector \u23031/2xi has independent entries and its sub-Gaussian norm is\nbounded by \uf8ff. Assume that the function (2) is Lipschitz continuous with constant k. Let kk2 = \u2327\nand assume is r-well-spread in the sense that \u2327/ kk1 = rpp for some r 2 (0, 1]. Then, for\nc = 1/E\u21e5 (2)(hxi, glmi)\u21e4, and \u21e2 = \u21e21(\u23031/2) denoting the condition number of \u23031/2, we have\n\n, where \u2318 = 8k\uf8ff3\u21e2k\u23031/2k1(\u2327 /r)2.\n\n(10)\n\n\n\n1\n\nc \u21e5 glm ols1\n\n\u2318\np\n\n\uf8ff\n\nTheorem 1 is proved in the Supplementary Material. It implies that the population parameters ols\nand glm are approximately equivalent up to a scaling factor, with an error bound of O (1/p). The\nassumption that glm is well-spread can be relaxed with minor modi\ufb01cations. For example, if we\nhave a sparse coef\ufb01cient vector, where supp(glm) = {j; glm\n6= 0} is the support set of glm, then\nTheorem 1 holds with p replaced by the size of the support set.\nAn interesting consequence of Theorem 1 and the remarks following the theorem is that whenever\nan entry of glm is zero, the corresponding entry of ols has to be small, and conversely. For 0,\nde\ufb01ne the lasso coef\ufb01cients\n\nj\n\nlasso\n = argmin\n2Rp\n\n1\n\n2E\u21e5(yi hxi, i)2\u21e4 + kk1 .\n\n(11)\n\n7\n\n\fi\u21e4 = I, we have\nCorollary 1. For any \u2318/|supp(glm)|,\nsupp(lasso) \u21e2 supp(glm). Further, if and glm also satisfy that 8j 2 supp(glm), |glm\nc + \u2318/|supp(glm)|, then we have supp(lasso) = supp(glm).\n\nif E [xi] = 0 and E\u21e5xixT\n\nSo far in this section, we have only discussed properties of the population parameters, such as glm.\nIn the remainder of this section, we turn our attention to results for the estimators that are the main\nfocus of this paper; these results ultimately build on our earlier results, i.e. Theorem 1.\nIn order to precisely describe the performance of \u02c6 sls, we \ufb01rst need bounds on the OLS estimator.\nThe OLS estimator has been studied extensively in the literature; however, for our purposes, we\n\ufb01nd it convenient to derive a new bound on its accuracy. While we have not seen this exact bound\nelsewhere, it is very similar to Theorem 5 of [DLFU13].\n\ni\u21e4 = \u2303, and that \u23031/2xi and yi are sub-Gaussian\nProposition 2. Assume that E [xi] = 0, E\u21e5xixT\nwith norms \uf8ff and , respectively. For min denoting the smallest eigenvalue of \u2303, and |S| >\u2318p ,\n\n| >\n\nj\n\n,\n\n(12)\n\nmin r p\n \u02c6ols ols2 \uf8ff \u2318 1/2\n\n|S|\n\n(13)\n\n(14)\n(15)\n\nwith probability at least 1 3ep, where \u2318 depends only on and \uf8ff.\nProposition 2 is proved in the Supplementary Material. Our main result on the performance of \u02c6 sls is\ngiven next.\nTheorem 2. Let the assumptions of Theorem 1 and Proposition 2 hold with E[k\u23031/2xk2] = \u02dc\u00b5pp.\nFurther assume that the function f (z) = zE\u21e5 (2)(hx, olsiz)\u21e4 satis\ufb01es f (\u00afc) > 1 + \u00afpp for some \u00afc\nand \u00af such that the derivative of f in the interval [0, \u00afc] does not change sign, i.e., its absolute value is\nlower bounded by > 0. Then, for n and |S| suf\ufb01ciently large, we have\n\n1\np\n\n,\nwith probability at least 1 5ep, where the constants \u23181 and \u23182 are de\ufb01ned by\nmin kolsk1 max{(b + k/\u02dc\u00b5), k\u00afc}\u2318 ,\n\n \u02c6 sls glm1 \uf8ff \u23181\n\u23181 =\u2318k\u00afc\uf8ff3\u21e2k\u23031/2k1(\u2327 /r)2\nmin \u21e31 + 11/2\n\u23182 =\u2318\u00afc1/2\n\nmin{n/ log(n),|S|/p}\n\n+ \u23182r\n\np\n\nand \u2318> 0 is a constant depending on \uf8ff and .\nNote that the convergence rate of the upper bound in (13) depends on the sum of the two terms, both\nof which are functions of the data dimensions n and p. The \ufb01rst term on the right in (13) comes from\nTheorem 1, which bounds the discrepancy between c \u21e5 ols and glm. This term is small when p is\nlarge, and it does not depend on the number of observations n.\nThe second term in the upper bound (13) comes from estimating ols and c . This term is increasing\nin p, which re\ufb02ects the fact that estimating glm is more challenging when p is large. As expected,\nthis term is decreasing in n and |S|, i.e. larger sample size yields better estimates. When the full OLS\nsolution is used (|S| = n), the second term becomes O(pp max{log(n), p}/n) = O(p/pn), for p\nsuf\ufb01ciently large. This suggests that n should be at least of order p2 for good performance.\n\n7 Discussion\n\nIn this paper, we showed that the coef\ufb01cients of GLMs and OLS are approximately proportional\nin the general random design setting. Using this relation, we proposed a computationally ef\ufb01cient\nalgorithm for large-scale problems that achieves the same accuracy as the MLE by \ufb01rst estimating the\nOLS coef\ufb01cients and then estimating the proportionality constant through iterations that can attain\nquadratic or cubic convergence rate, with only O (n) per-iteration cost.\nWe brie\ufb02y mentioned that the proportionality between the coef\ufb01cients holds even when there is\nregularization in Section 3.1. Further pursuing this idea may be interesting for large-scale problems\nwhere regularization is crucial. Another interesting line of research is to \ufb01nd similar proportionality\nrelations between the parameters in other large-scale optimization problems such as support vector\nmachines. Such relations may reduce the problem complexity signi\ufb01cantly.\n\n8\n\n\fReferences\n[BD99]\n\nJ. A. Blackard and D. J. Dean, Comparative accuracies of arti\ufb01cial neural networks and discriminant\nanalysis in predicting forest cover types from cartographic variables, Comput. Electron. Agr. 24\n(1999), 131\u2013151.\n\n[BEM13] M. Bayati, M. A. Erdogdu, and A. Montanari, Estimating lasso risk and noise level, NIPS 26, 2013,\n\n[Bis95]\n[Bri82]\n\npp. 944\u2013952.\nC. M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, 1995.\nD. R Brillinger, A generalized linear model with \"Gaussian\" regressor variables, A Festschrift For\nErich L. Lehmann, CRC Press, 1982, pp. 97\u2013114.\n\n[BSW14] P. Baldi, P. Sadowski, and D. Whiteson, Searching for exotic particles in high-energy physics with\n\ndeep learning, Nat. Commun. 5 (2014), 4308\u20134308.\nS. Boyd and L. Vandenberghe, Convex Optimization, Cambridge University Press, 2004.\n\n[BV04]\n[CGS10] L. H. Y. Chen, L. Goldstein, and Q.-M. Shao, Normal approximation by Stein\u2019s method, Springer,\n\n2010.\n\n[DLFU13] P. Dhillon, Y. Lu, D. P. Foster, and L. Ungar, New subsampling algorithms for fast least squares\n\nregression, NIPS 26 (2013), 360\u2013368.\n\n[EM15] M. A. Erdogdu and A. Montanari, Convergence rates of sub-sampled newton methods, NIPS 28,\n\n2015, pp. 3034\u20133042.\n\n[Erd15] M. A. Erdogdu, Newton-Stein method: A second order method for GLMs via Stein\u2019s lemma, NIPS\n\n28 (2015), 1216\u20131224.\n\n[Erd16]\n\n[Fis36]\n\n[Gol07]\n[GR97]\n\n[HS52]\n\n[KF09]\n\n[LD89]\n[Mar10]\n[MN89]\n[Nes83]\n\n[Nes04]\n[PS75]\n\n[PV15]\n\n[RT08]\n\n, Newton-Stein Method: An optimization method for GLMs via Stein\u2019s Lemma, Journal of\n\nMachine Learning Research (to appear) (2016).\nR. A. Fisher, The use of multiple measurements in taxonomic problems, Ann. Eugenic 7 (1936),\n179\u2013188.\nL. Goldstein, l1 bounds in normal approximation, Ann. Probab. 35 (2007), 1888\u20131930.\nL. Goldstein and G. Reinert, Stein\u2019s method and the zero bias transformation with application to\nsimple random sampling, Ann. Appl. Probab. 7 (1997), 935\u2013952.\nM. R. Hestenes and E. Stiefel, Methods of conjugate gradients for solving linear systems, J. Res.\nNat. Bur. Stand. 49 (1952), 409\u2013436.\nD. Koller and N. Friedman, Probabilistic Graphical Models: Principles and Techniques, MIT press,\n2009.\nK.-C. Li and N. Duan, Regression analysis under link violation, Ann. Stat. 17 (1989), 1009\u20131052.\nJ. Martens, Deep learning via Hessian-free optimization, ICML 27 (2010), 735\u2013742.\nP. McCullagh and J. A. Nelder, Generalized Linear Models, 2nd ed., Chapman and Hall, 1989.\nY. Nesterov, A method of solving a convex programming problem with convergence rate O(1/k2),\nSoviet Math. Dokl. 27 (1983), 372\u2013376.\n\n, Introductory Lectures on Convex Optimization: A Basic Course, Springer, 2004.\n\nC. C. Paige and M. A. Saunders, Solution of sparse inde\ufb01nite systems of linear equations, SIAM J.\nNumer. Anal. 12 (1975), 617\u2013629.\nY. Plan and R. Vershynin, The generalized lasso with non-linear observations, 2015, arXiv preprint\narXiv:1502.04071.\nV. Rokhlin and M. Tygert, A fast randomized algorithm for overdetermined linear least-squares\nregression, P. Natl. Acad. Sci. 105 (2008), 13212\u201313217.\n\n[TAH15] C. Thrampoulidis, E. Abbasi, and B. Hassibi, Lasso with non-linear measurements is equivalent to\n\n[Ver10]\n\n[WJ08]\n\none with linear measurements, NIPS 28 (2015), 3402\u20133410.\nR. Vershynin,\narXiv:1011.3027.\nM. J. Wainwright and M. I. Jordan, Graphical models, exponential families, and variational inference,\nFoundations and Trends in Machine Learning 1 (2008), 1\u2013305.\n\nIntroduction to the non-asymptotic analysis of random matrices, 2010,\n\n9\n\n\f", "award": [], "sourceid": 1659, "authors": [{"given_name": "Murat", "family_name": "Erdogdu", "institution": "Stanford University"}, {"given_name": "Lee", "family_name": "Dicker", "institution": "Rutgers University and Amazon"}, {"given_name": "Mohsen", "family_name": "Bayati", "institution": "Stanford University"}]}