{"title": "Scalable Adaptive Stochastic Optimization Using Random Projections", "book": "Advances in Neural Information Processing Systems", "page_first": 1750, "page_last": 1758, "abstract": "Adaptive stochastic gradient methods such as AdaGrad have gained popularity in particular for training deep neural networks. The most commonly used and studied variant maintains a diagonal matrix approximation to second order information by accumulating past gradients which are used to tune the step size adaptively. In certain situations the full-matrix variant of AdaGrad is expected to attain better performance, however in high dimensions it is computationally impractical. We present Ada-LR and RadaGrad two computationally efficient approximations to full-matrix AdaGrad based on randomized dimensionality reduction. They are able to capture dependencies between features and achieve similar performance to full-matrix AdaGrad but at a much smaller computational cost. We show that the regret of Ada-LR is close to the regret of full-matrix AdaGrad which can have an up-to exponentially smaller dependence on the dimension than the diagonal variant. Empirically, we show that Ada-LR and RadaGrad perform similarly to full-matrix AdaGrad. On the task of training convolutional neural networks as well as recurrent neural networks, RadaGrad achieves faster convergence than diagonal AdaGrad.", "full_text": "Scalable Adaptive Stochastic Optimization Using\n\nRandom Projections\n\nGabriel Krummenacher\u2666\u2217\n\ngabriel.krummenacher@inf.ethz.ch\n\nBrian McWilliams\u2665\u2217\n\nbrian@disneyresearch.com\n\nYannic Kilcher\u2666\n\nyannic.kilcher@inf.ethz.ch\n\nJoachim M. Buhmann\u2666\n\njbuhmann@inf.ethz.ch\n\nNicolai Meinshausen\u2663\n\nmeinshausen@stat.math.ethz.ch\n\n\u2666Institute for Machine Learning, Department of Computer Science, ETH Z\u00fcrich, Switzerland\n\n\u2663Seminar for Statistics, Department of Mathematics, ETH Z\u00fcrich, Switzerland\n\n\u2665Disney Research, Z\u00fcrich, Switzerland\n\nAbstract\n\nAdaptive stochastic gradient methods such as ADAGRAD have gained popularity in\nparticular for training deep neural networks. The most commonly used and studied\nvariant maintains a diagonal matrix approximation to second order information\nby accumulating past gradients which are used to tune the step size adaptively. In\ncertain situations the full-matrix variant of ADAGRAD is expected to attain better\nperformance, however in high dimensions it is computationally impractical. We\npresent ADA-LR and RADAGRAD two computationally ef\ufb01cient approximations\nto full-matrix ADAGRAD based on randomized dimensionality reduction. They are\nable to capture dependencies between features and achieve similar performance to\nfull-matrix ADAGRAD but at a much smaller computational cost. We show that the\nregret of ADA-LR is close to the regret of full-matrix ADAGRAD which can have\nan up-to exponentially smaller dependence on the dimension than the diagonal\nvariant. Empirically, we show that ADA-LR and RADAGRAD perform similarly to\nfull-matrix ADAGRAD. On the task of training convolutional neural networks as\nwell as recurrent neural networks, RADAGRAD achieves faster convergence than\ndiagonal ADAGRAD.\n\n1\n\nIntroduction\n\nRecently, so-called adaptive stochastic optimization algorithms have gained popularity for large-scale\nconvex and non-convex optimization problems. Among these, ADAGRAD [9] and its variants [21]\nhave received particular attention and have proven among the most successful algorithms for training\ndeep networks. Although these problems are inherently highly non-convex, recent work has begun to\nexplain the success of such algorithms [3].\nADAGRAD adaptively sets the learning rate for each dimension by means of a time-varying proximal\nregularizer. The most commonly studied and utilised version considers only a diagonal matrix\nproximal term. As such it incurs almost no additional computational cost over standard stochastic\n\n\u2217Authors contributed equally.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fgradient descent (SGD). However, when the data has low effective rank the regret of ADAGRAD may\nhave a much worse dependence on the dimensionality of the problem than its full-matrix variant\n(which we refer to as ADA-FULL). Such settings are common in high dimensional data where there\nare many correlations between features and can also be observed in the convolutional layers of neural\nnetworks. The computational cost of ADA-FULL is substantially higher than that of ADAGRAD\u2013 it\nrequires computing the inverse square root of the matrix of gradient outer products to evaluate the\nproximal term which grows with the cube of the dimension. As such it is rarely used in practise.\nIn this work we propose two methods that approximate the proximal term used in ADA-FULL\ndrastically reducing computational and storage complexity with little adverse affect on optimization\nperformance. First, in Section 3.1 we develop ADA-LR, a simple approximation using random\nprojections. This procedure reduces the computational complexity of ADA-FULL by a factor of\np but retains similar theoretical guarantees.\nIn Section 3.2 we systematically pro\ufb01le the most\ncomputationally expensive parts of ADA-LR and introduce further randomized approximations\nresulting in a truly scalable algorithm, RADAGRAD. In Section 3.3 we outline a simple modi\ufb01cation\nto RADAGRAD\u2013 reducing the variance of the stochastic gradients \u2013 which greatly improves practical\nperformance. Finally we perform an extensive comparison between the performance of RADAGRAD\nwith several widely used optimization algorithms on a variety of deep learning tasks. For image\nrecognition with convolutional networks and language modeling with recurrent neural networks we\n\ufb01nd that RADAGRAD and in particular its variance-reduced variant achieves faster convergence.\n\n1.1 Related work\n\nMotivated by the problem of training deep neural networks, very recently many new adaptive\noptimization methods have been proposed. Most computationally ef\ufb01cient among these are \ufb01rst order\nmethods similar in spirit to ADAGRAD, which suggest alternative normalization factors [21, 28, 6].\nSeveral authors propose ef\ufb01cient stochastic variants of classical second order methods such as L-\nBFGS [5, 20]. Ef\ufb01cient algorithms exist to update the inverse of the Hessian approximation by\napplying the matrix-inversion lemma or directly updating the Hessian-vector product using the\n\u201cdouble-loop\u201d algorithm but these are not applicable to ADAGRAD style algorithms. In the convex\nsetting these methods can show great theoretical and practical bene\ufb01t over \ufb01rst order methods but\nhave yet to be extensively applied to training deep networks.\nOn a different note, the growing zoo of variance reduced SGD algorithms [19, 7, 18] has shown\nvastly superior performance to ADAGRAD-style methods for standard empirical risk minimization\nand convex optimization. Recent work has aimed to move these methods into the non-convex setting\n[1]. Notably, [22] combine variance reduction with second order methods.\nMost similar to RADAGRAD are those which propose factorized approximations of second order\ninformation. Several methods focus on the natural gradient method [2] which leverages second\norder information through the Fisher information matrix. [14] approximate the inverse Fisher matrix\nusing a sparse graphical model. [8] use low-rank approximations whereas [26] propose an ef\ufb01cient\nKronecker product based factorization. Concurrently with this work, [12] propose a randomized\npreconditioner for SGD. However, their approach requires access to all of the data at once in order to\ncompute the preconditioning matrix which is impractical for training deep networks. [23] propose a\ntheoretically motivated algorithm similar to ADA-LR and a faster alternative based on Oja\u2019s rule to\nupdate the SVD.\nFast random projections. Random projections are low-dimensional embeddings \u03a0 : Rp \u2192 R\u03c4\nwhich preserve \u2013 up to a small distortion \u2013 the geometry of a subspace of vectors. We concen-\ntrate on the class of structured random projections, among which the Subsampled Randomized\nFourier Transform (SRFT) has particularly attractive properties [15]. The SRFT consists of a pre-\nconditioning step after which \u03c4 columns of the new matrix are subsampled uniformly at random as\np/\u03c4 S\u0398D with the de\ufb01nitions: (i) S \u2208 R\u03c4\u00d7p is a subsampling matrix. (ii) D \u2208 Rp\u00d7p is a\n\u03a0 =\ndiagonal matrix whose entries are drawn independently from {\u22121, 1}. (iii) \u0398 \u2208 Rp\u00d7p is a unitary\ndiscrete Fourier tranansform (DFT) matrix. This formulations allows very fast implementations using\nthe fast Fourier transform (FFT), for example using the popular FFTW package2. Applying the FFT\nto a p\u2212dimensional vector can be achieved in O (p log \u03c4 ) time. Similar structured random projections\n\n(cid:112)\n\n2http://www.fftw.org/\n\n2\n\n\fhave gained popularity as a way to speed up [24] and robustify [27] large-scale linear regression and\nfor distributed estimation [17, 16].\n\n1.2 Problem setting\n\nThe problem considered by [9] is online stochastic optimization where the goal is, at each step,\nto predict a point \u03b2t \u2208 Rp which achieves low regret with respect to a \ufb01xed optimal predictor,\n\u03b2opt, for a sequence of (convex) functions Ft(\u03b2). After T rounds, the regret can be de\ufb01ned as\n\nR(T ) =(cid:80)T\n\nt=1 Ft(\u03b2t) \u2212(cid:80)T\n\nt=1 Ft(\u03b2opt).\n\ncontrolled using a time-varying proximal term which we brie\ufb02y review. De\ufb01ning Gt =(cid:80)t\n\nInitially, we will consider functions Ft of the form Ft(\u03b2) := ft(\u03b2) + \u03d5(\u03b2) where ft and \u03d5 are\nconvex loss and regularization functions respectively. Throughout, the vector gt \u2208 \u2207ft(\u03b2t) refers to\na particular subgradient of the loss function. Standard \ufb01rst order methods update \u03b2t at each step by\nmoving in the opposite direction of gt according to a step-size parameter, \u03b7. The ADAGRAD family\nof algorithms [9] instead use an adaptive learning rate which can be different for each feature. This is\ni=1 gig(cid:62)i\n2 (cid:104)\u03b2, Ht\u03b2(cid:105).\nand Ht = \u03b4Ip + (Gt\u22121 + gtg(cid:62)t )1/2, the ADA-FULL proximal term is given by \u03c8t(\u03b2) = 1\nClearly when p is large, constructing G and \ufb01nding its root and inverse at each iteration is impractical.\nIn practice, rather than the full outer product matrix, ADAGRAD uses a proximal function consisting\nof the diagonal of Gt, \u03c8t(\u03b2) = 1\n2\nterm is computationally cheaper, it is unable to capture dependencies between coordinates in the\ngradient terms. Despite this, ADAGRAD has been found to perform very well empirically. One reason\nfor this is modern high-dimensional datasets are typically also very sparse. Under these conditions,\ncoordinates in the gradient are approximately independent.\n\n(cid:10)\u03b2,(cid:0)\u03b4Ip + diag(Gt)1/2(cid:1)\u03b2(cid:11). Although the diagonal proximal\n\n2 Stochastic optimization in high dimensions\n\nADAGRAD has attractive theoretical and empirical properties and adds essentially no overhead above\na standard \ufb01rst order method such as SGD. It begs the question, what we might hope to gain by\nintroducing additional computational complexity. In order to motivate our contribution, we \ufb01rst\npresent an analogue of the discussion in [10] focussing on when data is high-dimensional and dense.\nWe argue that if the data has low-rank (rather than sparse) structure ADA-FULL can effectively adapt\nto the intrinsic dimensionality. We also show in Section 3.1 that ADA-LR has the same property.\nFirst, we review the theoretical properties of ADAGRAD algorithms, borrowing the g1:T,j notation[9].\nProposition 1. ADAGRAD and ADA-FULL achieve the following regret (Corollaries 6 & 11 from\n[9]) respectively:\n\np(cid:88)\n\nj=1\n\nRD(T ) \u2264 2(cid:107)\u03b2opt(cid:107)\n\n\u221e\n\n(cid:107)g1:T,j(cid:107) + \u03b4(cid:107)\u03b2opt(cid:107)1 ,\n\nRF (T ) \u2264 2(cid:107)\u03b2opt(cid:107) \u00b7 tr(G1/2\n\nT ) + \u03b4(cid:107)\u03b2opt(cid:107). (1)\n\nThe major difference between RD(T ) and RF (T ) is the inclusion of the \ufb01nal full-matrix and diagonal\nproximal term, respectively. Under a sparse data generating distribution ADAGRAD achieves an\nup-to exponential improvement over SGD which is optimal in a minimax sense [10]. While data\nsparsity is often observed in practise in high-dimensional datasets (particularly web/text data) many\nother problems are dense. Furthermore, in practise applying ADAGRAD to dense data results in a\nlearning rate which tends to decay too rapidly. It is therefore natural to ask how dense data affects the\nperformance of ADA-FULL.\nFor illustration, consider when the data points xi are sampled i.i.d. from a Gaussian distribution\nPX = N (0, \u03a3). The resulting variable will clearly be dense. A common feature of high dimensional\ndata is low effective rank de\ufb01ned for a matrix \u03a3 as r(\u03a3) = tr(\u03a3)/(cid:107)\u03a3(cid:107) \u2264 rank(\u03a3) \u2264 p. Low\neffective rank implies that r (cid:28) p and therefore the eigenvalues of the covariance matrix decay\nquickly. We will consider distributions parameterised by covariance matrices \u03a3 with eigenvalues\n\u03bbj(\u03a3) = \u03bb0j\u2212\u03b1 for j = 1, . . . , p.\nFunctions of the form Ft(\u03b2) = Ft(\u03b2(cid:62)xt) have gradients (cid:107)gt(cid:107) \u2264 M (cid:107)xt(cid:107). For example, the least\n2 (yt \u2212 \u03b2(cid:62)xt)2 has gradient gt = xt(yt \u2212 x(cid:62)t \u03b2t) = xt\u03b5t, such that\nsquares loss Ft(\u03b2(cid:62)xt) = 1\n\n3\n\n\f(cid:107)\u03b5t(cid:107) \u2264 M. Let us consider the effect of distributions parametrised by \u03a3 on the proximal terms of\nfull, and diagonal ADAGRAD. Plugging X into the proximal terms of (1) and taking expectations\nwith respect to PX we obtain for ADAGRAD and ADA-FULL respectively:\n\nE\n\nT(cid:88)\n\np(cid:88)\n\nj=1\n\n(cid:107)g1:T,j(cid:107) \u2264 p(cid:88)\n\n(cid:118)(cid:117)(cid:117)(cid:116)M 2E\nj=1 j\u2212\u03b1/2 = O (log p) and for \u03b1 \u2208 (1, 2),(cid:80)p\nspectrum: for \u03b1 \u2265 2,(cid:80)p\n\nt,j \u2264 pM\nx2\n\nT , E tr((\n\nT(cid:88)\n\nt=1\n\nj=1\n\nt=1\n\n\u221a\n\nwhere the \ufb01rst inequality is from Jensen and the second is from noticing the sum of T squared\nGaussian random variables is a \u03c72 random variable. We can consider the effect of fast-decaying\n\ngtg(cid:62)t )1/2) \u2264 M\n\nT \u03bb0\n\nj\u2212\u03b1/2,\n\n(cid:112)\n\np(cid:88)\n\nj=1\n\n(2)\n\nj=1 j\u2212\u03b1/2 = O(cid:0)p1\u2212\u03b1/2(cid:1).\n\nWhen the data (and thus the gradients) are dense, yet have low effective rank, ADA-FULL is able\nto adapt to this structure. On the contrary, although ADAGRAD is computationally practical, in the\nworst case it may have exponentially worse dependence on the data dimension (p compared with\nlog p). In fact, the discrepancy between the regret of ADA-FULL and that of ADAGRAD is analogous\nto the discrepancy between ADAGRAD and SGD for sparse data.\n\nReceive gt = \u2207ft(\u03b2t).\n\nAlgorithm 1 ADA-LR\nInput: \u03b7 > 0, \u03b4 \u2265 0, \u03c4\n1: for t = 1 . . . T do\n2:\n3: Gt = Gt\u22121 + gtg(cid:62)t\nProject: \u02dcGt = Gt\u03a0\n4:\n5: QR = \u02dcGt {QR-decomposition}\n6: B = Q(cid:62)Gt\n7: U, \u03a3, V = B {SVD}\n8:\n9:\n10:\n11: end for\nOutput: \u03b2T\n\n\u03b2t+1 = \u03b2t \u2212 \u03b7V(\u03a31/2 + \u03b4I)\u22121V(cid:62)gt\n\nAlgorithm 2 RADAGRAD\nInput: \u03b7 > 0, \u03b4 \u2265 0, \u03c4\n1: for t = 1 . . . T do\n2:\n3:\n4:\n\nReceive gt = \u2207ft(\u03b2t).\nProject: \u02dcgt = \u03a0gt\n\u02dcGt = \u02dcGt\u22121 + gt\u02dcg(cid:62)t\n\n5: Qt, Rt \u2190 qr_update(Qt\u22121, Rt\u22121, gt, \u02dcgt)\n6: B = \u02dcG(cid:62)t Qt\n7: U, \u03a3, W = B {SVD}\n8: V = WQ(cid:62)\n\u03b3t = \u03b7(gt \u2212 VV(cid:62)gt)\n9:\n\u03b2t+1 = \u03b2t\u2212 \u03b7V(\u03a31/2 + \u03b4I)\u22121V(cid:62)gt\u2212 \u03b3t\n10:\n11: end for\nOutput: \u03b2T\n\n3 Approximating ADA-FULL using random projections\n\nIt is clear that in certain regimes, ADA-FULL provides stark optimization advantages over ADAGRAD\nin terms of the dependence on p. However, ADA-FULL requires maintaining a p \u00d7 p matrix, G and\ncomputing its square root and inverse. Therefore, computationally the dependence of ADA-FULL on\np scales with the cube which is impractical in high dimensions.\nA na\u00efve approach would be to simply reduce the dimensionality of the gradient vector, \u02dcgt \u2208 R\u03c4 =\n\u03a0gt. ADA-FULL is now directly applicable in this low-dimensional space, returning a solution vector\n\u02dc\u03b2t \u2208 R\u03c4 at each iteration. However, for many problems, the original coordinates may have some\nintrinsic meaning or in the case of deep networks, may be parameters in a model. In which case it\nis important to return a solution in the original space. Unfortunately in general it is not possible to\nrecover such a solution from \u02dc\u03b2t [30].\nInstead, we consider a different approach to maintaining and updating an approximation of the\nADAGRAD matrix while retaining the original dimensionality of the parameter updates \u03b2 and\ngradients g.\n\n3.1 Randomized low-rank approximation\n\nAs a \ufb01rst approach we approximate the inverse square root of Gt using a fast randomized singular\nvalue decomposition (SVD) [15]. We proceed in two stages: First we compute an approximate basis\n\n4\n\n\fQ for the range of Gt. Then we use Q to compute an approximate SVD of Gt by forming the\nsmaller dimensional matrix B = Q(cid:62)Gt and then compute the low-rank SVD U\u03a3V(cid:62) = B. This is\nfaster than computing the SVD of Gt directly if Q has few columns.\nAn approximate basis Q can be computed ef\ufb01ciently by forming the matrix \u02dcGt = Gt\u03a0 by means\nof a structured random projection and then constructing an orthonormal basis for the range of \u02dcGt\nby QR-decomposition. The randomized SVD allows us to quickly compute the square root and\npseudo-inverse of the proximal term Ht by setting \u02dcH\u22121\nt = V(\u03a31/2 + \u03b4I)\u22121V(cid:62). We call this\napproximation ADA-LR and describe the steps in full in Algorithm 1.\nIn practice, using a structured random projection such as the SRFT leads to an approximation of the\n\noriginal matrix, Gt of the following form(cid:13)(cid:13)Gt \u2212 QQ(cid:62)Gt\n\n(cid:13)(cid:13) \u2264 \u0001, with high probability [15] where\n\n(cid:112)\n\nk +\n\n8 log(kn)\n\n(cid:17)2 \u2264 \u03c4 \u2264 p and de\ufb01ning \u0001 =\n\n\u0001 depends on \u03c4, the number of columns of Q; p and the \u03c4 th singular value of Gt. Brie\ufb02y, if the\nsingular values of Gt decay quickly and \u03c4 is chosen appropriately, \u0001 will be small (this is stated more\nformally in Proposition 2). We leverage this result to derive the following regret bound for ADA-LR\n(see C.1 for proof).\n(cid:112)\nProposition 2. Let \u03c3k+1 be the kth largest singular value of Gt. Setting the projection dimension as\n1 + 7p/\u03c4 \u00b7 \u03c3k+1. With failure probability at\n4\n\nT ) + (2\u03c4\n\u221a\n\u0001(cid:107)\u03b2opt(cid:107) compared with the regret\nDue to the randomized approximation we incur an additional 2\u03c4\nof ADA-FULL (eq. 1). So, under the earlier stated assumption of fast decaying eigenvalues we can\nuse an identical argument as in eq. (2) to similarly obtain a dimension dependence of O (log p + \u03c4 ).\n\n(cid:16)\u221a\nmost O(cid:0)k\u22121(cid:1) ADA-LR achieves regret RLR(T ) \u2264 2(cid:107)\u03b2opt(cid:107)tr(G1/2\nApproximating the inverse square root decreases the complexity of each iteration from O(cid:0)p3(cid:1)\nto O(cid:0)\u03c4 p2(cid:1). We summarize the cost of each step in Algorithm 1 and contrast it with the cost of\nin runtime to O(cid:0)\u03c4 2p(cid:1).\n\nADA-FULL in Table A.1 in Section A. Even though ADA-LR removes one factor of p form the runtime\nof ADA-FULL it still needs to store the large matrix Gt. This prevents ADA-LR from being a truly\npractical algorithm. In the following section we propose a second algorithm which directly stores a\nlow dimensional approximation to Gt that can be updated cheaply. This allows for an improvement\n\n\u0001 + \u03b4)(cid:107)\u03b2opt(cid:107) .\n\n\u221a\n\n3.2 RADAGRAD: A faster approximation\n\nFrom Table A.1, the expensive steps in Algorithm 1 are the update of Gt (line 3), the random\nprojection (line 4) and the projection onto the approximate range of Gt (line 6). In the following we\n\npropose RADAGRAD, an algorithm that reduces the complexity to O(cid:0)\u03c4 2p(cid:1) by only approximately\n\nsolving some of the expensive steps in ADA-LR while maintaining similar performance in practice.\nTo compute the approximate range Q, we do not need to store the full matrix Gt. Instead we only\nrequire the low dimensional matrix \u02dcGt = Gt\u03a0. This matrix can be computed iteratively by setting\n\u02dcGt \u2208 Rp\u00d7\u03c4 = \u02dcGt\u22121 + gt(\u03a0gt)(cid:62). This directly reduces the cost of the random projection to\nO (p log \u03c4 ) since we only project the vector gt instead of the matrix Gt, it also makes the update of\n\u02dcGt faster and saves storage.\nWe then project \u02dcGt on the approximate range of Gt and use the SVD to compute the inverse square\nroot. Since Gt is symmetric its row and column space are identical so little information is lost by\nprojecting \u02dcGt instead of Gt on the approximate range of Gt.3 The advantage is that we can now\n\ncompute the SVD in O(cid:0)\u03c4 3(cid:1) and the matrix-matrix product on line 6 in O(cid:0)\u03c4 2p(cid:1). See Algorithm 2\n\nfor the full procedure.\nThe most expensive steps are now the QR decomposition and the matrix multiplications in steps 6\nand 8 (see Algorithm 2 and Table A.1). Since at each iteration we only update the matrix \u02dcGt with\nthe rank-one matrix gt\u02dcg(cid:62)t we can use faster rank-1 QR-updates [11] instead of recomputing the\nfull QR decomposition. To speed up the matrix-matrix product \u02dcG(cid:62)t Q for very large problems (e.g.\nbackpropagation in convolutional neural networks), a multithreaded BLAS implementation can be\nused.\n\n3This idea is similar to bilinear random projections [13].\n\n5\n\n\f3.3 Practical algorithms\n\nHere we outline several simple modi\ufb01cations to the RADAGRAD algorithm to improve practical\nperformance.\n\nCorrected update. The random projection step only retains at most \u03c4 eigenvalues of Gt. If the\nassumption of low effective rank does not hold, important information from the p \u2212 \u03c4 smallest\neigenvalues might be discarded. RADAGRAD therefore makes use of the corrected update\n\u03b3t = \u03b7(I \u2212 VV(cid:62))gt.\n\n\u03b2t+1 = \u03b2t \u2212 \u03b7V(\u03a31/2 + \u03b4I)\u22121V(cid:62)gt \u2212 \u03b3t, where\n\n\u03b3t is the projection of the current gradient onto the space orthogonal to the one captured by the\nrandom projection of Gt. This ensures that important variation in the gradient which is poorly\napproximated by the random projection is not completely lost. Consequently, if the data has rank\nless than \u03c4, (cid:107)\u03b3(cid:107) \u2248 0. This correction only requires quantities which have already been computed but\ngreatly improves practical performance.\n\nVariance reduction. Variance reduction methods based on SVRG [19] obtain lower-variance\ngradient estimates by means of computing a \u201cpivot point\u201d over larger batches of data. Recent work\nhas shown improved theoretical and empirical convergence in non-convex problems [1] in particular\nin combination with ADAGRAD.\nWe modify RADAGRAD to use the variance reduction scheme of SVRG. The full procedure is given\nin Algorithm 3 in Section B. The majority of the algorithm is as RADAGRAD except for the outer\nloop which computes the pivot point, \u00b5 every epoch which is used to reduce the variance of the\nstochastic gradient (line 4). The important additional parameter is m, the update frequency for \u00b5. As\nin [1] we set this to m = 5n. Practically, as is standard practise we initialise RADA-VR by running\nADAGRAD for several epochs.\nWe study the empirical behaviour of ADA-LR, RADAGRAD and its variance reduced variant in the\nnext section.\n\n4 Experiments\n\n4.1 Low effective rank data\n\n(a) Logistic Loss\n\n(b) Spectrum\n\nWe compare the performance\nof our proposed algorithms\nagainst both the diagonal and\nfull-matrix ADAGRAD variants\nin the idealised setting where\nthe data is dense but has low\neffective rank. We gener-\nate binary classi\ufb01cation data\nwith n = 1000 and p =\n125. The data is sampled i.i.d.\nfrom a Gaussian distribution\nN (\u00b5c, \u03a3) where \u03a3 has with\nrapidly decaying eigenvalues\n\u03bbj(\u03a3) = \u03bb0j\u2212\u03b1 with \u03b1 = 1.3, \u03bb0 = 30. Each of the two classes has a different mean, \u00b5c.\nFor each algorithm learning rates are tuned using cross validation. The results for 5 epochs are\naveraged over 5 runs with different permutations of the data set and instantiations of the random\nprojection for ADA-LR and RADAGRAD. For the random projection we use an oversampling factor\nso \u03a0 \u2208 R(10+\u03c4 )\u00d7p to ensure accurate recovery of the top \u03c4 singular values and then set the values of\n\u03bb[\u03c4 :p] to zero [15].\nFigure 1a shows the mean loss on the training set. The performance of ADA-LR and RADAGRAD\nmatch that of ADA-FULL. On the other hand, ADAGRAD converges to the optimum much more\nslowly. Figure 1b shows the largest eigenvalues (normalized by their sum) of the proximal matrix\nfor each method at the end of training. The spectrum of Gt decays rapidly which is matched by\n\nFigure 1: Comparison of: (a) loss and (b) the largest eigenvalues\n(normalised by their sum) of the proximal term on simulated data.\n\n6\n\n05001000150020002500300035004000Iteration10\u2212210\u22121100LossADA-FULLADA-LRRADAGRADADAGRAD0102030405060Principalcomponent10\u2212310\u2212210\u22121100NormalisedeigenvaluesADA-FULLADA-LRRADAGRADADAGRAD\f(a) MNIST\n\n(b) CIFAR\n\n(c) SVHN\n\nFigure 2: Comparison of training loss (top row) and test accuracy (bottom row) on (a) MNIST, (b)\nCIFAR and (c) SVHN.\n\nthe randomized approximation. This illustrates the dependencies between the coordinates in the\ngradients and suggests Gt can be well approximated by a low-dimensional matrix which considers\nthese dependencies. On the other hand the spectrum of ADAGRAD (equivalent to the diagonal of G)\ndecays much more slowly. The learning rate, \u03b7 chosen by RADAGRAD and ADA-FULL are roughly\none order of magnitude higher than for ADAGRAD.\n\n4.2 Non-convex optimization in neural networks\n\nHere we compare RADAGRAD and RADA-VR against ADAGRAD and the combination of\nADAGRAD+SVRG on the task of optimizing several different neural network architectures.\n\nConvolutional Neural Networks. We used modi\ufb01ed variants of standard convolutional network\narchitectures for image classi\ufb01cation on the MNIST, CIFAR-10 and SVHN datasets. These consist of\nthree 5 \u00d7 5 convolutional layers generating 32 channels with ReLU non-linearities, each followed by\n2 \u00d7 2 max-pooling. The \ufb01nal layer was a dense softmax layer and the objevtive was to minimize the\ncategorical cross entropy.\nWe used a batch size of 8 and trained the networks without momentum or weight decay, in order\nto eliminate confounding factors. Instead, we used dropout regularization (p = 0.5) in the dense\nlayers during training. Step sizes were determined by coarsely searching a log scale of possible\nvalues and evaluating performance on a validation set. We found RADAGRAD to have a higher\nimpact with convolutional layers than with dense layers, due to the higher correlations between\nweights. Therefore, for computational reasons, RADAGRAD was only applied on the convolutional\nlayers. The last dense classi\ufb01cation layer was trained with ADAGRAD. In this setting ADA-FULL is\ncomputationally infeasible. The number of parameters in the convolutional layers is between 50-80k.\nSimply storing the full G matrix using double precision would require more memory than is available\non top-of-the-line GPUs.\nThe results of our experiments can be seen in Figure 2, where we show the objective value during\ntraining and the test accuracy. We \ufb01nd that both RADAGRAD variants consistently outperform\nboth ADAGRAD and the combination of ADAGRAD+SVRG on these tasks. In particular combining\nRADAGRAD with variance reduction results in the largest improvement for training although both\nRADAGRAD variants quickly converge to very similar values for test accuracy.\nFor all models, the learning rate selected by RADAGRAD is approximately an order of magnitude\nlarger than the one selected by ADAGRAD. This suggests that RADAGRAD can make more aggres-\nsive steps than ADAGRAD, which results in the relative success of RADAGRAD over ADAGRAD,\nespecially at the beginning of the experiments.\n\n7\n\n0500010000150002000025000300003500040000Iteration10\u2212310\u2212210\u22121TrainingLossRADAGRADRADA-VRADAGRADADAGRAD+SVRG0500010000150002000025000300003500040000Iteration0.950.960.970.980.99TestAccuracyRADAGRADRADA-VRADAGRADADAGRAD+SVRG05000100001500020000250003000035000Iteration100TrainingLossRADAGRADRADA-VRADAGRADADAGRAD+SVRG05000100001500020000250003000035000Iteration0.350.400.450.500.550.600.650.700.75TestAccuracyRADAGRADRADA-VRADAGRADADAGRAD+SVRG01000020000300004000050000Iteration10\u22121100TrainingLossRADAGRADRADA-VRADAGRADADAGRAD+SVRG01000020000300004000050000Iteration0.30.40.50.60.70.80.9TestAccuracyRADAGRADRADA-VRADAGRADADAGRAD+SVRG\fWe observed that RADAGRAD performed 5-10\u00d7 slower than ADAGRAD per iteration. This can be\nattributed to the lack of GPU-optimized SVD and QR routines. These numbers are comparable with\nother similar recently proposed techniques [23]. However, due to the faster convergence we found\nthat the overall optimization time of RADAGRAD was lower than for ADAGRAD.\n\nRecurrent Neural Networks.\nWe trained the strongly-typed\nvariant of the long short-term\nmemory network (T-LSTM, [4])\nfor language modelling, which\nconsists of the following task:\nGiven a sequence of words from\nan original text, predict the next\nword. We used pre-trained\nGLOVE embedding vectors [29]\nas input to the T-LSTM layer\nand a softmax over the vocabu-\nlary (10k words) as output. The\nloss is the mean categorical cross-\nentropy. The memory size of\nthe T-LSTM units was set to\n256. We trained and evaluated\nour network on the Penn Tree-\nbank dataset [25]. We subsampled strings of length 20 from the dataset and asked the network to\npredict each word in the string, given the words up to that point. Learning rates were selected by\nsearching over a log scale of possible values and measuring performance on a validation set.\nWe compared RADAGRAD with ADAGRAD without variance reduction. The results of this experi-\nment can be seen in Figure 3. During training, we found that RADAGRAD consistently outperforms\nADAGRAD: RADAGRAD is able to both quicker reduce the training loss and also reaches a smaller\nvalue (5.62 \u00d7 10\u22124 vs. 1.52 \u00d7 10\u22123, a 2.7\u00d7 reduction in loss). Again, we found that the selected\nlearning rate is an order of magnitude higher for RADAGRAD than for ADAGRAD. RADAGRAD is\nable to exploit the fact that T-LSTMs perform type-preserving update steps which should preserve\nany low-rank structure present in the weight matrices. The relative improvement of RADAGRAD\nover ADAGRAD in training is also re\ufb02ected in the test loss (1.15 \u00d7 10\u22122 vs. 3.23 \u00d7 10\u22122, a 2.8\u00d7\nreduction).\n\nFigure 3: Comparison of training loss (left) and and test loss\n(right) on language modelling task with the T-LSTM.\n\n5 Discussion\n\nWe have presented ADA-LR and RADAGRAD which approximate the full proximal term of ADAGRAD\nusing fast, structured random projections. ADA-LR enjoys similar regret to ADA-FULL and both\nmethods achieve similar empirical performance at a fraction of the computational cost. Importantly,\nRADAGRAD can easily be modi\ufb01ed to make use of standard improvements such as variance reduction.\nUsing variance reduction in combination in particular has stark bene\ufb01ts for non-convex optimization\nin convolutional and recurrent neural networks. We observe a marked improvement over widely-used\ntechniques such as ADAGRAD and SVRG, the combination of which has recently been proven to be\nan excellent choice for non-convex optimization [1].\nFurthermore, we tried to incorporate exponential forgetting schemes similar to RMSPROP and ADAM\ninto the RADAGRAD framework but found that these methods degraded performance. A downside of\nsuch methods is that they require additional parameters to control the rate of forgetting.\nOptimization for deep networks has understandably been a very active research area. Recent work has\nconcentrated on either improving estimates of second order information or investigating the effect of\nvariance reduction on the gradient estimates. It is clear from our experimental results that a thorough\nstudy of the combination provides an important avenue for further investigation, particularly where\nparts of the underlying model might have low effective rank.\nAcknowledgements. We are grateful to David Balduzzi, Christina Heinze-Deml, Martin Jaggi,\nAurelien Lucchi, Nishant Mehta and Cheng Soon Ong for valuable discussions and suggestions.\n\n8\n\n020000400006000080000100000Iteration10\u2212310\u2212210\u22121TrainingLossRADAGRADADAGRAD020000400006000080000100000Iteration10\u22121100TestLossRADAGRADADAGRAD\fReferences\n[1] Z. Allen-Zhu and E. Hazan. Variance reduction for faster non-convex optimization. In Proceedings of the\n\n33rd International Conference on Machine Learning, 2016.\n\n[2] S.-I. Amari. Natural gradient works ef\ufb01ciently in learning. Neural computation, 10(2):251\u2013276, 1998.\n[3] D. Balduzzi. Deep online convex optimization with gated games. arXiv preprint arXiv:1604.01952, 2016.\n[4] D. Balduzzi and M. Ghifary. Strongly-typed recurrent neural networks. In Proceedings of the 33rd\n\nInternational Conference on Machine Learning, 2016.\n\n[5] R. H. Byrd, S. Hansen, J. Nocedal, and Y. Singer. A stochastic quasi-newton method for large-scale\n\noptimization. arXiv preprint arXiv:1401.7020, 2014.\n\n[6] Y. N. Dauphin, H. de Vries, J. Chung, and Y. Bengio. Rmsprop and equilibrated adaptive learning rates for\n\nnon-convex optimization. arXiv preprint arXiv:1502.04390, 2015.\n\n[7] A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support for\nnon-strongly convex composite objectives. In Advances in Neural Information Processing Systems, 2014.\n[8] G. Desjardins, K. Simonyan, R. Pascanu, et al. Natural neural networks. In Advances in Neural Information\n\nProcessing Systems, pages 2062\u20132070, 2015.\n\n[9] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic\n\noptimization. The Journal of Machine Learning Research, 12:2121\u20132159, 2011.\n\n[10] J. C. Duchi, M. I. Jordan, and H. B. McMahan. Estimation, optimization, and parallelism when data is\n\nsparse. In Advances in Neural Information Processing Systems, 2013.\n\n[11] G. H. Golub and C. F. Van Loan. Matrix computations, volume 3. JHU Press, 2012.\n[12] A. Gonen and S. Shalev-Shwartz. Faster sgd using sketched conditioning. arXiv preprint arXiv:1506.02649,\n\n2015.\n\n[13] Y. Gong, S. Kumar, H. Rowley, and S. Lazebnik. Learning binary codes for high-dimensional data using\n\nbilinear projections. In Proceedings of CVPR, pages 484\u2013491, 2013.\n\n[14] R. Grosse and R. Salakhudinov. Scaling up natural gradient by sparsely factorizing the inverse \ufb01sher\nmatrix. In Proceedings of the 32nd International Conference on Machine Learning, pages 2304\u20132313,\n2015.\n\n[15] N. Halko, P. G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms\n\nfor constructing approximate matrix decompositions. SIAM Review, 53(2):217\u2013288, 2011.\n\n[16] C. Heinze, B. McWilliams, and N. Meinshausen. Dual-loco: Distributing statistical estimation using\n\nrandom projections. In Proceedings of AISTATS, 2016.\n\n[17] C. Heinze, B. McWilliams, N. Meinshausen, and G. Krummenacher. Loco: Distributing ridge regression\n\nwith random projections. arXiv preprint arXiv:1406.3469, 2014.\n\n[18] T. Hofmann, A. Lucchi, S. Lacoste-Julien, and B. McWilliams. Variance reduced stochastic gradient\n\ndescent with neighbors. In Advances in Neural Information Processing Systems, 2015.\n\n[19] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In\n\nAdvances in Neural Information Processing Systems, pages 315\u2013323, 2013.\n\n[20] N. S. Keskar and A. S. Berahas. adaQN: An Adaptive Quasi-Newton Algorithm for Training RNNs. Nov.\n\n2015.\n\n[21] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.\n[22] A. Lucchi, B. McWilliams, and T. Hofmann. A variance reduced stochastic newton method. arXiv preprint\n\narXiv:1503.08316, 2015.\n\n[23] H. Luo, A. Agarwal, N. Cesa-Bianchi, and J. Langford. Ef\ufb01cient second order online learning via sketching.\n\narXiv preprint arXiv:1602.02202, 2016.\n\n[24] M. W. Mahoney. Randomized algorithms for matrices and data. Apr. 2011. arXiv:1104.5557v3 [cs.DS].\n[25] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of english: The\n\npenn treebank. Computational linguistics, 19(2):313\u2013330, 1993.\n\n[26] J. Martens and R. Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In\n\nProceedings of the 32nd International Conference on Machine Learning, 2015.\n\n[27] B. McWilliams, G. Krummenacher, M. Lucic, and J. M. Buhmann. Fast and robust least squares estimation\n\nin corrupted linear models. In Advances in Neural Information Processing Systems, volume 27, 2014.\n\n[28] B. Neyshabur, R. R. Salakhutdinov, and N. Srebro. Path-sgd: Path-normalized optimization in deep neural\n\nnetworks. In Advances in Neural Information Processing Systems, pages 2413\u20132421, 2015.\n\n[29] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In EMNLP,\n\nvolume 14, pages 1532\u20131543, 2014.\n\n[30] L. Zhang, M. Mahdavi, R. Jin, T. Yang, and S. Zhu. Recovering optimal solution by dual random projection.\n\narXiv preprint arXiv:1211.3046, 2012.\n\n9\n\n\f", "award": [], "sourceid": 950, "authors": [{"given_name": "Gabriel", "family_name": "Krummenacher", "institution": "ETH Zurich"}, {"given_name": "Brian", "family_name": "McWilliams", "institution": "Disney Research"}, {"given_name": "Yannic", "family_name": "Kilcher", "institution": "ETH Zurich"}, {"given_name": "Joachim", "family_name": "Buhmann", "institution": "ETH Zurich"}, {"given_name": "Nicolai", "family_name": "Meinshausen", "institution": "ETH Zurich"}]}