{"title": "Newton-Stein Method: A Second Order Method for GLMs via Stein's Lemma", "book": "Advances in Neural Information Processing Systems", "page_first": 1216, "page_last": 1224, "abstract": "We consider the problem of efficiently computing the maximum likelihood estimator in Generalized Linear Models (GLMs)when the number of observations is much larger than the number of coefficients (n > > p > > 1). In this regime, optimization algorithms can immensely benefit fromapproximate second order information.We propose an alternative way of constructing the curvature information by formulatingit as an estimation problem and applying a Stein-type lemma, which allows further improvements through sub-sampling andeigenvalue thresholding.Our algorithm enjoys fast convergence rates, resembling that of second order methods, with modest per-iteration cost. We provide its convergence analysis for the case where the rows of the design matrix are i.i.d. samples with bounded support.We show that the convergence has two phases, aquadratic phase followed by a linear phase. Finally,we empirically demonstrate that our algorithm achieves the highest performancecompared to various algorithms on several datasets.", "full_text": "Newton-Stein Method:\n\nA Second Order Method for GLMs via Stein\u2019s Lemma\n\nMurat A. Erdogdu\n\nDepartment of Statistics\n\nStanford University\n\nerdogdu@stanford.edu\n\nAbstract\n\nWe consider the problem of ef\ufb01ciently computing the maximum likelihood esti-\nmator in Generalized Linear Models (GLMs) when the number of observations\nis much larger than the number of coef\ufb01cients (n p 1). In this regime, op-\ntimization algorithms can immensely bene\ufb01t from approximate second order in-\nformation. We propose an alternative way of constructing the curvature informa-\ntion by formulating it as an estimation problem and applying a Stein-type lemma,\nwhich allows further improvements through sub-sampling and eigenvalue thresh-\nolding. Our algorithm enjoys fast convergence rates, resembling that of second\norder methods, with modest per-iteration cost. We provide its convergence analy-\nsis for the case where the rows of the design matrix are i.i.d. samples with bounded\nsupport. We show that the convergence has two phases, a quadratic phase followed\nby a linear phase. Finally, we empirically demonstrate that our algorithm achieves\nthe highest performance compared to various algorithms on several datasets.\n\nIntroduction\n\n1\nGeneralized Linear Models (GLMs) play a crucial role in numerous statistical and machine learn-\ning problems. GLMs formulate the natural parameter in exponential families as a linear model\nand provide a miscellaneous framework for statistical methodology and supervised learning tasks.\nCelebrated examples include linear, logistic, multinomial regressions and applications to graphical\nmodels [MN89, KF09].\nIn this paper, we focus on how to solve the maximum likelihood problem ef\ufb01ciently in the GLM\nsetting when the number of observations n is much larger than the dimension of the coef\ufb01cient\nvector p, i.e., n p. GLM optimization task is typically expressed as a minimization problem\nwhere the objective function is the negative log-likelihood that is denoted by `() where 2 Rp is\nthe coef\ufb01cient vector. Many optimization algorithms are available for such minimization problems\n[Bis95, BV04, Nes04]. However, only a few uses the special structure of GLMs. In this paper, we\nconsider updates that are speci\ufb01cally designed for GLMs, which are of the from\n\n Qr`() ,\n\n(1.1)\n\nwhere is the step size and Q is a scaling matrix which provides curvature information.\nFor the updates of the form Eq. (1.1), the performance of the algorithm is mainly determined by the\nscaling matrix Q. Classical Newton\u2019s Method (NM) and Natural Gradient Descent (NG) are recov-\nered by simply taking Q to be the inverse Hessian and the inverse Fisher\u2019s information at the current\niterate, respectively [Ama98, Nes04]. Second order methods may achieve quadratic convergence\nrate, yet they suffer from excessive cost of computing the scaling matrix at every iteration. On the\nother hand, if we take Q to be the identity matrix, we recover the simple Gradient Descent (GD)\nmethod which has a linear convergence rate. Although GD\u2019s convergence rate is slow compared to\nthat of second order methods, modest per-iteration cost makes it practical for large-scale problems.\nThe trade-off between the convergence rate and per-iteration cost has been extensively studied\n[BV04, Nes04].\nIn n p regime, the main objective is to construct a scaling matrix Q that\n\n1\n\n\fis computational feasible and provides suf\ufb01cient curvature information. For this purpose, several\nQuasi-Newton methods have been proposed [Bis95, Nes04]. Updates given by Quasi-Newton meth-\nods satisfy an equation which is often referred as the Quasi-Newton relation. A well-known member\nof this class of algorithms is the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm [Nes04].\nIn this paper, we propose an algorithm that utilizes the structure of GLMs by relying on a Stein-type\nlemma [Ste81]. It attains fast convergence rate with low per-iteration cost. We call our algorithm\nNewton-Stein Method which we abbreviate as NewSt. Our contributions are summarized as follows:\n\u2022 We recast the problem of constructing a scaling matrix as an estimation problem and apply\na Stein-type lemma along with sub-sampling to form a computationally feasible Q.\n\u2022 Newton method\u2019s O(np2 + p3) per-iteration cost is replaced by O(np + p2) per-iteration\ncost and a one-time O(n|S|2) cost, where |S| is the sub-sample size.\n\u2022 Assuming that the rows of the design matrix are i.i.d. and have bounded support, and\ndenoting the iterates of Newton-Stein method by { \u02c6t}t0, we prove a bound of the form\n(1.2)\nwhere \u21e4 is the minimizer and \u23271,\u2327 2 are the convergence coef\ufb01cients. The above bound\nimplies that the convergence starts with a quadratic phase and transitions into linear later.\n\u2022 We demonstrate its performance on four datasets by comparing it to several algorithms.\n\n \u02c6t+1 \u21e42 \uf8ff \u23271 \u02c6t \u21e42 + \u23272 \u02c6t \u21e42\n\n2,\n\nThe rest of the paper is organized as follows: Section 1.1 surveys the related work and Section 1.2\nintroduces the notations used throughout the paper. Section 2 brie\ufb02y discusses the GLM framework\nand its relevant properties. In Section 3, we introduce Newton-Stein method, develop its intuition,\nand discuss the computational aspects. Section 4 covers the theoretical results and in Section 4.3\nwe discuss how to choose the algorithm parameters. Finally, in Section 5, we provide the empirical\nresults where we compare the proposed algorithm with several other methods on four datasets.\n1.1 Related work\nThere are numerous optimization techniques that can be used to \ufb01nd the maximum likelihood esti-\nmator in GLMs. For moderate values of n and p, classical second order methods such as NM, NG\nare commonly used. In large-scale problems, data dimensionality is the main factor while choos-\ning the right optimization method. Large-scale optimization tasks have been extensively studied\nthrough online and batch methods. Online methods use a gradient (or sub-gradient) of a single,\nrandomly selected observation to update the current iterate [Bot10]. Their per-iteration cost is inde-\npendent of n, but the convergence rate might be extremely slow. There are several extensions of the\nclassical stochastic descent algorithms (SGD), providing signi\ufb01cant improvement and/or stability\n[Bot10, DHS11, SRB13].\nOn the other hand, batch algorithms enjoy faster convergence rates, though their per-iteration cost\nmay be prohibitive. In particular, second order methods attain quadratic rate, but constructing the\nHessian matrix requires excessive computation. Many algorithms aim at forming an approximate,\ncost-ef\ufb01cient scaling matrix,. This idea lies at the core of Quasi-Newton methods [Bis95].\nAnother approach to construct an approximate Hessian makes use of sub-sampling techniques\n[Mar10, BCNN11, VP12, EM15]. Many contemporary learning methods rely on sub-sampling as\nit is simple and it provides signi\ufb01cant boost over the \ufb01rst order methods. Further improvements\nthrough conjugate gradient methods and Krylov sub-spaces are available.\nMany hybrid variants of the aforementioned methods are proposed. Examples include the combina-\ntions of sub-sampling and Quasi-Newton methods [BHNS14], SGD and GD [FS12], NG and NM\n[LRF10], NG and low-rank approximation [LRMB08]. Lastly, algorithms that specialize on cer-\ntain types of GLMs include coordinate descent methods for the penalized GLMs [FHT10] and trust\nregion Newton methods [LWK08].\n1.2 Notation\nLet [n] = {1, 2, ..., n}, and denote the size of a set S by |S|. The gradient and the Hessian of f\nwith respect to are denoted by rf and r2\nf, respectively. The j-th derivative of a function g\nis denoted by g(j). For vector x 2 Rp and matrix X 2 Rp\u21e5p, kxk2 and kXk2 denote the `2 and\nspectral norms, respectively. PC is the Euclidean projection onto set C, and Bp(R) \u21e2 Rp is the\nball of radius R. For random variables x, y, d(x, y) and D(x, y) denote probability metrics (to be\nexplicitly de\ufb01ned later), measuring the distance between the distributions of x and y.\n\n2\n\n\f2 Generalized Linear Models\nDistribution of a random variable y 2 R belongs to an exponential family with natural parameter \u2318 2\nR if its density can be written of the form f (y|\u2318) = exp\u2318y (\u2318)h(y), where is the cumulant\ngenerating function and h is the carrier density. Let y1, y2, ..., yn be independent observations such\nthat 8i 2 [n], yi \u21e0 f (yi|\u2318i). For \u2318 = (\u23181, ...,\u2318 n), the joint likelihood is\n\nf (y1, y2, ..., yn|\u2318) = exp( nXi=1\n\n[yi\u2318i (\u2318i)]) nYi=1\n\nh(yi).\n\nWe consider the problem of learning the maximum likelihood estimator in the above exponential\nfamily framework, where the vector \u2318 2 Rn is modeled through the linear relation,\n\n\u2318 = X,\n\nfor some design matrix X 2 Rn\u21e5p with rows xi 2 Rp, and a coef\ufb01cient vector 2 Rp. This formu-\nlation is known as Generalized Linear Models (GLMs) in canonical form. The cumulant generating\nfunction determines the class of GLMs, i.e., for the ordinary least squares (OLS) (z) = z2 and\nfor the logistic regression (LR) (z) = log(1 + ez).\nMaximum likelihood estimation in the above formulation is equivalent to minimizing the negative\nlog-likelihood function `(),\n\n`() =\n\n1\nn\n\nnXi=1\n\n[(hxi, i) yihxi, i] ,\n\n(2.1)\n\nwhere hx, i is the inner product between the vectors x and . The relation to OLS and LR can be\nseen much easier by plugging in the corresponding (z) in Eq. (2.1). The gradient and the Hessian\nof `() can be written as:\n\n1\nn\n\nr`() =\n\nnXi=1h(1)(hxi, i)xi yixii , r2\n\n(2)(hxi, i)xixT\ni .\nFor a sequence of scaling matrices {Qt}t>0 2 Rp\u21e5p, we consider iterations of the form\n\nnXi=1\n\n`() =\n\n1\nn\n\n(2.2)\n\n\u02c6t+1 \u02c6t tQtr`( \u02c6t),\n\nwhere t is the step size. The above iteration is our main focus, but with a new approach on how to\ncompute the sequence of matrices {Qt}t>0. We formulate the problem of \ufb01nding a scalable Qt as\nan estimation problem and use a Stein-type lemma that provides a computationally ef\ufb01cient update.\n3 Newton-Stein Method\nClassical Newton-Raphson update is generally used for training GLMs. However, its per-iteration\ncost makes it impractical for large-scale optimization. The main bottleneck is the computation of\nthe Hessian matrix that requires O(np2) \ufb02ops which is prohibitive when n p 1. Numerous\nmethods have been proposed to achieve NM\u2019s fast convergence rate while keeping the per-iteration\ncost manageable.\nThe task of constructing an approximate Hessian can be viewed as an estimation problem. Assuming\nthat the rows of X are i.i.d. random vectors, the Hessian of GLMs with cumulant generating function\n has the following form\n\n\u21e5Qt\u21e41 =\n\n1\nn\n\nnXi=1\n\nxixT\n\ni (2)(hxi, i) \u21e1 E[xxT (2)(hx, i)] .\n\nWe observe that [Qt]1 is just a sum of i.i.d. matrices. Hence, the true Hessian is nothing but a sam-\nple mean estimator to its expectation. Another natural estimator would be the sub-sampled Hessian\nmethod suggested by [Mar10, BCNN11, EM15]. Similarly, our goal is to propose an appropriate\nestimator that is also computationally ef\ufb01cient.\nWe use the following Stein-type lemma to derive an ef\ufb01cient estimator to the expectation of Hessian.\nLemma 3.1 (Stein-type lemma). Assume that x \u21e0 Np(0, \u2303) and 2 Rp is a constant vector. Then\nfor any function f : R ! R that is twice \u201cweakly\" differentiable, we have\n(3.1)\n\nE[xxT f (hx, i)] = E[f (hx, i)]\u2303 + E[f (2)(hx, i)]\u2303T \u2303 .\n\n3\n\n\fAlgorithm 1 Newton-Stein method\n\nInput: \u02c60, r,\u270f, .\n1. Set t = 0 and sub-sample a set of indices S \u21e2 [n] uniformly at random.\n\n\u02c6\u00b52( \u02c6t) = 1\n\n2. Compute: \u02c62 = r+1(b\u2303S), and \u21e3r(b\u2303S) = \u02c62I + argminrank(M) = rb\u2303S \u02c62I MF .\n3. while \u02c6t+1 \u02c6t2 \uf8ff \u270f do\nnPn\ni=1 (2)(hxi, \u02c6ti),\n\u02c6\u00b52( \u02c6t)h\u21e3r(b\u2303S)1 \n\u02c6t+1 = PBp(R)\u21e3 \u02c6t Qtr`( \u02c6t)\u2318,\nt t + 1.\n4. end while\nOutput: \u02c6t.\n\nnPn\ni=1 (4)(hxi, \u02c6ti),\n\u02c6\u00b52( \u02c6t)/\u02c6\u00b54( \u02c6t)+h\u21e3r(b\u2303S ) \u02c6t, \u02c6tii,\n\n\u02c6\u00b54( \u02c6t) = 1\n\nQt = 1\n\n\u02c6t[ \u02c6t]T\n\nThe proof of Lemma 3.1 is given in Appendix. The right hand side of Eq.(3.1) is a rank-1 update to\nthe \ufb01rst term. Hence, its inverse can be computed with O(p2) cost. Quantities that change at each\niteration are the ones that depend on , i.e.,\n\n\u00b52() = E[(2)(hx, i)] and \u00b54() = E[(4)(hx, i)].\n\nis denoted by b\u2303S = Pi2S xixT\n\n\u00b52() and \u00b54() are scalar quantities and can be estimated by their corresponding sample means\n\u02c6\u00b52() and \u02c6\u00b54() (explicitly de\ufb01ned at Step 3 of Algorithm 1), with only O(np) computation.\nTo complete the estimation task suggested by Eq. (3.1), we need an estimator for the covariance\nmatrix \u2303. A natural estimator is the sample mean where, we only use a sub-sample S \u21e2 [n] so\nthat the cost is reduced to O(|S|p2) from O(np2). Sub-sampling based sample mean estimator\ni /|S|, which is widely used in large-scale problems [Ver10]. We\nhighlight the fact that Lemma 3.1 replaces NM\u2019s O(np2) per-iteration cost with a one-time cost of\nO(np2). We further use sub-sampling to reduce this one-time cost to O(|S|p2).\nIn general, important curvature information is contained in the largest few spectral features. Follow-\ning [EM15], we take the largest r eigenvalues of the sub-sampled covariance estimator, setting rest\nof them to (r + 1)-th eigenvalue. This operation helps denoising and would require only O(rp2)\ncomputation. Step 2 of Algorithm 1 performs this procedure.\nInverting the constructed Hessian estimator can make use of the low-rank structure several times.\nFirst, notice that the updates in Eq. (3.1) are based on rank-1 matrix additions. Hence, we can sim-\nply use a matrix inversion formula to derive an explicit equation (See Qt in Step 3 of Algorithm\n1). This formulation would impose another inverse operation on the covariance estimator. Since\nthe covariance estimator is also based on rank-r approximation, one can utilize the low-rank in-\nversion formula again. We emphasize that this operation is performed once. Therefore, instead of\nNM\u2019s per-iteration cost of O(p3) due to inversion, Newton-Stein method (NewSt ) requires O(p2)\nper-iteration and a one-time cost of O(rp2). Assuming that NewSt and NM converge in T1 and\nT2 iterations respectively, the overall complexity of NewSt is OnpT1 + p2T1 + (|S| + r)p2 \u21e1\nOnpT1 + p2T1 + |S|p2 whereas that of NM is Onp2T2 + p3T2.\n\nEven though Proposition 3.1 assumes that the covariates are multivariate Gaussian random vectors,\nin Section 4, the only assumption we make on the covariates is that they have bounded support,\nwhich covers a wide class of random variables. The left plot of Figure 1 shows that the estimation\nis accurate for various distributions. This is a consequence of the fact that the proposed estimator in\nEq. (3.1) relies on the distribution of x only through inner products of the form hx, vi, which in turn\nresults in approximate normal distribution due to the central limit theorem when p is suf\ufb01ciently\nlarge. We will discuss this phenomenon in detail in Section 4.\nThe convergence rate of Newton-Stein method has two phases. Convergence starts quadratically and\ntransitions into a linear rate when it gets close to the true minimizer. The phase transition behavior\ncan be observed through the right plot in Figure 1. This is a consequence of the bound provided in\nEq. (1.2), which is the main result of our theorems stated in Section 4.\n\n4\n\n\fDifference between estimated and true Hessian\nRandomness\n\nBernoulli\nGaussian\nPoisson\nUniform\n\n0\n\n)\nr\no\nr\nr\ne\n \nn\no\ni\nt\na\nm\n\ni\nt\ns\nE\n(\n0\n1\ng\no\n\nl\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22124\n\n0\n\n100\n\n200\n\nDimension (p)\n\n300\n\n400\n\n)\nr\no\nr\nr\n\nE\n(\n0\n1\ng\no\n\nl\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\nConvergence Rate\n\nSub\u2212sample size\nNewSt : S = 1000\nNewSt : S = 10000\n\n0\n\n10\n\n20\n30\nIterations\n\n40\n\n50\n\nFigure 1: The left plot demonstrates the accuracy of proposed Hessian estimation over different distributions.\nNumber of observations is set to be n = O(p log(p)). The right plot shows the phase transition in the con-\nvergence rate of Newton-Stein method (NewSt). Convergence starts with a quadratic rate and transitions into\nlinear. Plots are obtained using Covertype dataset.\n4 Theoretical results\nWe start this section by introducing the terms that will appear in the theorems. Then, we provide our\ntechnical results on uniformly bounded covariates. The proofs are provided in Appendix.\n4.1 Preliminaries\nHessian estimation described in the previous section relies on a Gaussian approximation. For theo-\nretical purposes, we use the following probability metric to quantify the gap between the distribution\nof xi\u2019s and that of a normal vector.\nDe\ufb01nition 1. Given a family of functions H, and random vectors x, y 2 Rp, and any h 2H , de\ufb01ne\n\ndH(x, y) = sup\nh2H\n\ndh(x, y) where\n\ndh(x, y) =E [h(x)] E [h(y)].\n\nH3 =nh(x) = hv, xi2(2)(hx, i) : 2 Bp(R),kvk2 = 1o ,\n\nMany probability metrics can be expressed as above by choosing a suitable function class H. Exam-\nples include Total Variation (TV), Kolmogorov and Wasserstein metrics [GS02, CGS10]. Based on\nthe second and fourth derivatives of cumulant generating function, we de\ufb01ne the following classes:\nH1 =nh(x) = (2)(hx, i) : 2 Bp(R)o , H2 =nh(x) = (4)(hx, i) : 2 Bp(R)o ,\nwhere Bp(R) 2 Rp is the ball of radius R. Exact calculation of such probability metrics are often\ndif\ufb01cult. The general approach is to upper bound the distance by a more intuitive metric. In our\ncase, we observe that dHj (x, y) for j = 1, 2, 3, can be easily upper bounded by dTV(x, y) up to a\nscaling constant, when the covariates have bounded support.\nWe will further assume that the covariance matrix follows r-spiked model, i.e., \u2303 = 2I +\ni , which is commonly encountered in practice [BS06]. This simply means that the \ufb01rst\nr eigenvalues of the covariance matrix are large and the rest are small and equal to each other. Large\neigenvalues of \u2303 correspond to the signal part and small ones (denoted by 2) can be considered as\nthe noise component.\n\ni=1 \u2713iuiuT\n\nPr\n\n4.2 Composite convergence rate\nWe have the following per-step bound for the iterates generated by the Newton-Stein method, when\nthe covariates are supported on a p-dimensional ball.\nTheorem 4.1. Assume that the covariates x1, x2, ..., xn are i.i.d. random vectors supported on a\nball of radius pK with\n\nE[xi] = 0\n\nand\n\nE\u21e5xixT\n\ni\u21e4 = \u2303,\n\nwhere \u2303 follows the r-spiked model. Further assume that the cumulant generating function has\n\nbounded 2nd-5th derivatives and that R is the radius of the projection PBp(R). For \u02c6t t>0 given\n\n5\n\n\fby the Newton-Stein method for = 1, de\ufb01ne the event\n\n(4.1)\nfor some positive constant \u21e0, and the optimal value \u21e4. If n,|S| and p are suf\ufb01ciently large, then\nthere exist constants c, c1, c2 and \uf8ff depending on the radii K, R, P(E) and the bounds on |(2)| and\n|(4)| such that conditioned on the event E, with probability at least 1 c/p2, we have\n\nE =n\u00b52( \u02c6t) + \u00b54( \u02c6t)h\u2303 \u02c6t, \u02c6ti >\u21e0, \u21e4 2 Bp(R)o\n \u02c6t+1 \u21e42 \uf8ff \u23271 \u02c6t \u21e42 + \u23272 \u02c6t \u21e42\n\nwhere the coef\ufb01cients \u23271 and \u23272 are deterministic constants de\ufb01ned as\n\n(4.2)\n\n2,\n\n\u23271 = \uf8ffD(x, z) + c1\uf8ffr\n\np\n\nmin{p/ log(p)|S|, n/ log(n)}\n\n,\u2327\n\n2 = c2\uf8ff,\n\nand D(x, z) is de\ufb01ned as\n\nD(x, z) = k\u2303k2 dH1(x, z) + k\u2303k2\n\n2R2 dH2(x, z) + dH3(x, z),\n\n(4.3)\n\ntolerance \u270f satisfying\n\nfor a multivariate Gaussian random variable z with the same mean and covariance as xi\u2019s.\nThe bound in Eq. (4.2) holds with high probability, and the coef\ufb01cients \u23271 and \u23272 are deterministic\nconstants which will describe the convergence behavior of the Newton-Stein method. Observe that\nthe coef\ufb01cient \u23271 is sum of two terms: D(x, z) measures how accurate the Hessian estimation is,\nand the second term depends on the sub-sample size and the data dimensions.\nTheorem 4.1 shows that the convergence of Newton-Stein method can be upper bounded by a com-\npositely converging sequence, that is, the squared term will dominate at \ufb01rst giving a quadratic\nrate, then the convergence will transition into a linear phase as the iterate gets close to the optimal\nvalue. The coef\ufb01cients \u23271 and \u23272 govern the linear and quadratic terms, respectively. The effect of\nsub-sampling appears in the coef\ufb01cient of linear term. In theory, there is a threshold for the sub-\nsampling size |S|, namely O(n/ log(n)), beyond which further sub-sampling has no effect. The\ntransition point between the quadratic and the linear phases is determined by the sub-sampling size\nand the properties of the data. The phase transition can be observed through the right plot in Figure\n1. Using the above theorem, we state the following corollary.\n\nCorollary 4.2. Assume that the assumptions of Theorem 4.1 hold. For a constant PE C, a\nand for an iterate satisfying E\u21e5k \u02c6t \u21e4k2\u21e4 >\u270f , the iterates of Newton-Stein method will satisfy,\n\n\u270f 20Rc/p2 + ,\n\nEhk \u02c6t+1 \u21e4k2i \uf8ff \u02dc\u23271Ehk \u02c6t \u21e4k2i + \u23272Ehk \u02c6t \u21e4k2\n2i ,\n\nwhere \u02dc\u23271 = \u23271 + 0.1 and ,\u2327 1,\u2327 2 are as in Theorem 4.1.\nThe bound stated in the above corollary is an analogue of composite convergence (given in Eq. (4.2))\nin expectation. Note that our results make strong assumptions on the derivatives of the cumulant gen-\nerating function . We emphasize that these assumptions are valid for linear and logistic regressions.\nAn example that does not \ufb01t in our scheme is Poisson regression with (z) = ez. However, we ob-\nserved empirically that the algorithm still provides signi\ufb01cant improvement. The following theorem\nstates a suf\ufb01cient condition for the convergence of composite sequence.\nTheorem 4.3. Let { \u02c6t}t0 be a compositely converging sequence with convergence coef\ufb01cients\n\u23271 and \u23272 as in Eq. (4.2) to the minimizer \u21e4. Let the starting point satisfy \u02c60 \u21e42 = #<\n(1 \u23271)/\u23272 and de\ufb01ne \u2305= \u21e3 \u23271#\n1\u23272# ,#\u2318. Then the sequence of `2-distances converges to 0. Further,\nthe number of iterations to reach a tolerance of \u270f can be upper bounded by inf \u21e02\u2305 J (\u21e0), where\n\nJ (\u21e0) = log2\u2713 log ( (\u23271/\u21e0 + \u23272))\nlog (\u23271/\u21e0 + \u23272) # \u25c6 +\n\nlog(\u270f/\u21e0)\n\nlog(\u23271 + \u23272\u21e0)\n\n.\n\n(4.4)\n\nAbove theorem gives an upper bound on the number of iterations until reaching a tolerance of \u270f. The\n\ufb01rst and second terms on the right hand side of Eq. (4.4) stem from the quadratic and linear phases,\nrespectively.\n\n6\n\n\f4.3 Algorithm parameters\nNewSt takes three input parameters and for those, we suggest near-optimal choices based on our\ntheoretical results.\n\n\u2022 Sub-sample size: NewSt uses a subset of indices to approximate the covariance matrix \u2303.\nCorollary 5.50 of [Ver10] proves that a sample size of O(p) is suf\ufb01cient for sub-gaussian\ncovariates and that of O(p log(p)) is suf\ufb01cient for arbitrary distributions supported in some\nball to estimate a covariance matrix by its sample mean estimator. In the regime we con-\nsider, n p, we suggest to use a sample size of |S| = O(p log(p)).\n\u2022 Rank: Many methods have been suggested to improve the estimation of covariance ma-\ntrix and almost all of them rely on the concept of shrinkage [CCS10, DGJ13]. Eigenvalue\nthresholding can be considered as a shrinkage operation which will retain only the impor-\ntant second order information [EM15]. Choosing the rank threshold r can be simply done\non the sample mean estimator of \u2303. After obtaining the sub-sampled estimate of the mean,\none can either plot the spectrum and choose manually or use a technique from [DG13].\n\u2022 Step size: Step size choices of NewSt are quite similar to Newton\u2019s method (i.e., See\n[BV04]). The main difference comes from the eigenvalue thresholding. If the data follows\nthe r-spiked model, the optimal step size will be close to 1 if there is no sub-sampling.\nHowever, due to \ufb02uctuations resulting from sub-sampling, we suggest the following step\nsize choice for NewSt:\n\n =\n\n2\n\n1 + \u02c62O(pp/|S|)\n\n\u02c62\n\n.\n\n(4.5)\n\nIn general, this formula yields a step size greater than 1, which is due to rank thresholding,\nproviding faster convergence. See [EM15] for a detailed discussion.\n\n5 Experiments\nIn this section, we validate the performance of NewSt through extensive numerical studies. We\nexperimented on two commonly used GLM optimization problems, namely, Logistic Regression\n(LR) and Linear Regression (OLS). LR minimizes Eq. (2.1) for the logistic function (z) = log(1 +\nez), whereas OLS minimizes the same equation for (z) = z2. In the following, we brie\ufb02y describe\nthe algorithms that are used in the experiments:\n\n\u2022 Newton\u2019s Method (NM) uses the inverse Hessian evaluated at the current iterate, and may\nachieve quadratic convergence. NM steps require O(np2 + p3) computation which makes\nit impractical for large-scale datasets.\n\u2022 Broyden-Fletcher-Goldfarb-Shanno (BFGS) forms a curvature matrix by cultivating the\ninformation from the iterates and the gradients at each iteration. Under certain assumptions,\nthe convergence rate is locally super-linear and the per-iteration cost is comparable to that\nof \ufb01rst order methods.\n\u2022 Limited Memory BFGS (L-BFGS) is similar to BFGS, and uses only the recent few iter-\nates to construct the curvature matrix, gaining signi\ufb01cant performance in terms of memory.\n\u2022 Gradient Descent (GD) update is proportional to the negative of the full gradient evaluated\nat the current iterate. Under smoothness assumptions, GD achieves a linear convergence\nrate, with O(np) per-iteration cost.\n\u2022 Accelerated Gradient Descent (AGD) is proposed by Nesterov [Nes83], which improves\nover the gradient descent by using a momentum term. Performance of AGD strongly de-\npends of the smoothness of the function.\n\nFor all the algorithms, we use a constant step size that provides the fastest convergence. Sub-sample\nsize, rank and the constant step size for NewSt is selected by following the guidelines in Section 4.3.\nWe experimented over two real, two synthetic datasets which are summarized in Table 1. Synthetic\ndata are generated through a multivariate Gaussian distribution and data dimensions are chosen so\nthat Newton\u2019s method still does well. The experimental results are summarized in Figure 2. We\nobserve that NewSt provides a signi\ufb01cant improvement over the classical techniques. The methods\nthat come closer to NewSt is Newton\u2019s method for moderate n and p and BFGS when n is large.\nObserve that the convergence rate of NewSt has a clear phase transition point. As argued earlier,\nthis point depends on various factors including sub-sampling size |S| and data dimensions n, p, the\n\n7\n\n\fDataset:#\n\n\u22121\n\n)\nr\no\nr\nr\n\nE\n(\ng\no\n\nl\n\n\u22122\n\n\u22123\n\n\u22124\n\n0\n\n\u22121\n\n)\nr\no\nr\nr\n\nE\n(\ng\no\n\nl\n\n\u22122\n\n\u22123\n\nS3#\n\nLogistic Regression, rank=3\nMethod\nNewSt\nBFGS\nLBFGS\nNewton\nGD\nAGD\n\n\u22121\n\n\u22122\n\n\u22123\n\n)\nr\no\nr\nr\n\nE\n(\ng\no\n\nl\n\nS20#\n\nLogistic Regression, rank=20\nMethod\nNewSt\nBFGS\nLBFGS\nNewton\nGD\nAGD\n\nCT#Slices#\n\nLogistic Regression, rank=40\nMethod\nNewSt\nBFGS\nLBFGS\nNewton\nGD\nAGD\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n)\nr\no\nr\nr\n\nE\n(\ng\no\n\nl\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n)\nr\no\nr\nr\n\nE\n(\ng\no\n\nl\n\n10\n20\nTime(sec)\n\n30\n\n\u22124\n\n0\n\n10\n20\nTime(sec)\n\n30\n\n\u22124\n\n0.0\n\n2.5\n\n5.0\n\nTime(sec)\n\n7.5\n\n10.0\n\nLinear Regression, rank=3\n\nMethod\nNewSt\nBFGS\nLBFGS\nNewton\nGD\nAGD\n\n\u22121\n\n\u22122\n\n\u22123\n\n)\nr\no\nr\nr\n\nE\n(\ng\no\n\nl\n\nLinear Regression, rank=20\n\nMethod\nNewSt\nBFGS\nLBFGS\nNewton\nGD\nAGD\n\nLinear Regression, rank=40\n\nMethod\nNewSt\nBFGS\nLBFGS\nNewton\nGD\nAGD\n\nCovertype#\n\nLogistic Regression, rank=2\nMethod\nNewSt\nBFGS\nLBFGS\nNewton\nGD\nAGD\n\n10\n20\nTime(sec)\n\n30\n\nLinear Regression, rank=2\n\nMethod\nNewSt\nBFGS\nLBFGS\nNewton\nGD\nAGD\n\n\u22124\n\n0\n\n\u22121\n\n)\nr\no\nr\nr\n\nE\n(\ng\no\n\nl\n\n\u22122\n\n\u22123\n\n2\n1\n0\n\u22121\n\u22122\n\u22123\n\u22124\n\n)\nr\no\nr\nr\n\nE\n(\ng\no\n\nl\n\n30\n\n\u22124\n\n0\n\n10\n20\nTime(sec)\n\n30\n\n\u22124\n\n0\n\n10\n20\nTime(sec)\n\n0\n\n1\n\n2\n\n3\nTime(sec)\n\n4\n\n5\n\n\u22124\n\n0\n\n1\n\n2\n\n3\nTime(sec)\n\n4\n\n5\n\nFigure 2: Performance of various optimization methods on different datasets. Red straight line represents\nthe proposed method NewSt. Algorithm parameters including the rank threshold is selected by the guidelines\ndescribed in Section 4.3.\n\nrank threshold r and structure of the covariance matrix. The prediction of the phase transition point\nis an interesting line of research, which would allow further tuning of algorithm parameters.\nThe optimal step-size for NewSt will typically be larger than 1 which is mainly due to the eigenvalue\nthresholding operation. This feature is desirable if one is able to obtain a large step-size that provides\nconvergence. In such cases, the convergence is likely to be faster, yet more unstable compared to\nthe smaller step size choices. We observed that similar to other second order algorithms, NewSt is\nsusceptible to the step size selection. If the data is not well-conditioned, and the sub-sample size\nis not suf\ufb01ciently large, algorithm might have poor performance. This is mainly because the sub-\nsampling operation is performed only once at the beginning. Therefore, it might be good in practice\nto sub-sample once in every few iterations.\n\nDataset\nCT slices\nCovertype\nS3\nS20\n\nn\n53500\n581012\n500000\n500000\n\np\n386\n54\n300\n300\n\nReference, UCI repo [Lic13]\n[GKS+11]\n[BD99]\n3-spiked model, [DGJ13]\n20-spiked model, [DGJ13]\n\nTable 1: Datasets used in the experiments.\n\n6 Discussion\nIn this paper, we proposed an ef\ufb01cient algorithm for training GLMs. We call our algorithm\nNewton-Stein method (NewSt) as it takes a Newton update at each iteration relying on a Stein-type\nlemma. The algorithm requires a one time O(|S|p2) cost to estimate the covariance structure and\nO(np) per-iteration cost to form the update equations. We observe that the convergence of NewSt\nhas a phase transition from quadratic rate to linear. This observation is justi\ufb01ed theoretically along\nwith several other guarantees for covariates with bounded support, such as per-step bounds, condi-\ntions for convergence, etc. Parameter selection guidelines of NewSt are based on our theoretical\nresults. Our experiments show that NewSt provides high performance in GLM optimization.\nRelaxing some of the theoretical constraints is an interesting line of research. In particular, bounded\nsupport assumption as well as strong constraints on the cumulant generating functions might be\nloosened. Another interesting direction is to determine when the phase transition point occurs,\nwhich would provide a better understanding of the effects of sub-sampling and rank thresholding.\n\nAcknowledgements\nThe author is grateful to Mohsen Bayati and Andrea Montanari for stimulating conversations on the\ntopic of this work. The author would like to thank Bhaswar B. Bhattacharya and Qingyuan Zhao for\ncarefully reading this article and providing valuable feedback.\n\n8\n\n\fShun-Ichi Amari, Natural gradient works ef\ufb01ciently in learning, Neural computation 10 (1998).\n\nReferences\n[Ama98]\n[BCNN11] Richard H Byrd, Gillian M Chin, Will Neveitt, and Jorge Nocedal, On the use of stochastic hessian\ninformation in optimization methods for machine learning, SIAM Journal on Optimization (2011).\nJock A Blackard and Denis J Dean, Comparative accuracies of arti\ufb01cial neural networks and\ndiscriminant analysis in predicting forest cover types from cartographic variables, Computers and\nelectronics in agriculture (1999), 131\u2013151.\n\n[BD99]\n\n[BHNS14] Richard H Byrd, SL Hansen, Jorge Nocedal, and Yoram Singer, A stochastic quasi-newton method\n\n[Bis95]\n[Bot10]\n[BS06]\n\n[BV04]\n[CCS10]\n\n[CGS10]\n\n[DE15]\n\n[DG13]\n\n[DGJ13]\n\n[DHS11]\n\n[EM15]\n\n[FHT10]\n\n[FS12]\n\nfor large-scale optimization, arXiv preprint arXiv:1401.7020 (2014).\nChristopher M. Bishop, Neural networks for pattern recognition, Oxford University Press, 1995.\nL\u00e8on Bottou, Large-scale machine learning with stochastic gradient descent, COMPSTAT, 2010.\nJinho Baik and Jack W Silverstein, Eigenvalues of large sample covariance matrices of spiked\npopulation models, Journal of Multivariate Analysis 97 (2006), no. 6, 1382\u20131408.\nStephen Boyd and Lieven Vandenberghe, Convex optimization, Cambridge University Press, 2004.\nJian-Feng Cai, Emmanuel J Cand\u00e8s, and Zuowei Shen, A singular value thresholding algorithm\nfor matrix completion, SIAM Journal on Optimization 20 (2010), no. 4, 1956\u20131982.\nLouis HY Chen, Larry Goldstein, and Qi-Man Shao, Normal approximation by Stein\u00e2 \u02d8A \u00b4Zs method,\nSpringer Science, 2010.\nLee H Dicker and Murat A Erdogdu, Flexible results for quadratic forms with applications to\nvariance components estimation, arXiv preprint arXiv:1509.04388 (2015).\nDavid L Donoho and Matan Gavish, The optimal hard threshold for singular values is 4/sqrt3,\narXiv:1305.5870 (2013).\nDavid L Donoho, Matan Gavish, and Iain M Johnstone, Optimal shrinkage of eigenvalues in the\nspiked covariance model, arXiv preprint arXiv:1311.0851 (2013).\nJohn Duchi, Elad Hazan, and Yoram Singer, Adaptive subgradient methods for online learning\nand stochastic optimization, J. Mach. Learn. Res. 12 (2011), 2121\u20132159.\nMurat A Erdogdu and Andrea Montanari, Convergence rates of sub-sampled Newton methods,\narXiv preprint arXiv:1508.02810 (2015).\nJerome Friedman, Trevor Hastie, and Rob Tibshirani, Regularization paths for generalized linear\nmodels via coordinate descent, Journal of statistical software 33 (2010), no. 1, 1.\nMichael P Friedlander and Mark Schmidt, Hybrid deterministic-stochastic methods for data \ufb01tting,\nSIAM Journal on Scienti\ufb01c Computing 34 (2012), no. 3, A1380\u2013A1405.\n\n[GKS+11] Franz Graf, Hans-Peter Kriegel, Matthias Schubert, Sebastian P\u00f6lsterl, and Alexander Cavallaro,\n2d image registration in ct images using radial image descriptors, MICCAI 2011, Springer, 2011.\nAlison L Gibbs and Francis E Su, On choosing and bounding probability metrics, ISR 70 (2002).\nDaphne Koller and Nir Friedman, Probabilistic graphical models, MIT press, 2009.\nM. Lichman, UCI machine learning repository, 2013.\nNicolas Le Roux and Andrew W Fitzgibbon, A fast natural newton method, ICML, 2010.\n\n[GS02]\n[KF09]\n[Lic13]\n[LRF10]\n[LRMB08] Nicolas Le Roux, Pierre-A Manzagol, and Yoshua Bengio, Topmoumoute online natural gradient\n\nalgorithm, NIPS, 2008.\n\n[LWK08] Chih-J Lin, Ruby C Weng, and Sathiya Keerthi, Trust region newton method for logistic regression,\n\nJMLR (2008).\nJames Martens, Deep learning via hessian-free optimization, ICML, 2010, pp. 735\u2013742.\nPeter McCullagh and John A Nelder, Generalized linear models, vol. 2, Chapman and Hall, 1989.\nYurii Nesterov, A method for unconstrained convex minimization problem with the rate of conver-\ngence o (1/k2), Doklady AN SSSR, vol. 269, 1983, pp. 543\u2013547.\n\n[Nes04]\n[SRB13] Mark Schmidt, Nicolas Le Roux, and Francis Bach, Minimizing \ufb01nite sums with the stochastic\n\n, Introductory lectures on convex optimization: A basic course, vol. 87, Springer, 2004.\n\naverage gradient, arXiv preprint arXiv:1309.2388 (2013).\nCharles M Stein, Estimation of the mean of a multivariate normal distribution, Annals of Statistics\n(1981), 1135\u20131151.\nRoman Vershynin,\narXiv:1011.3027 (2010).\nOriol Vinyals and Daniel Povey, Krylov Subspace Descent for Deep Learning, AISTATS, 2012.\n\nIntroduction to the non-asymptotic analysis of\n\nrandom matrices,\n\n[Mar10]\n[MN89]\n[Nes83]\n\n[Ste81]\n\n[Ver10]\n\n[VP12]\n\n9\n\n\f", "award": [], "sourceid": 752, "authors": [{"given_name": "Murat", "family_name": "Erdogdu", "institution": "Stanford University"}]}