{"title": "Learning with SGD and Random Features", "book": "Advances in Neural Information Processing Systems", "page_first": 10192, "page_last": 10203, "abstract": "Sketching and stochastic gradient methods are arguably the most common techniques to derive efficient large scale learning algorithms. In this paper, we investigate their application in the context of nonparametric statistical learning. More precisely, we study the estimator defined by stochastic gradient with mini batches and random features. The latter can be seen as form of nonlinear sketching and used to define approximate kernel methods. The considered estimator is not explicitly penalized/constrained and regularization is implicit. Indeed, our study highlights how different parameters, such as number of features, iterations, step-size and mini-batch size control the learning properties of the solutions. We do this by deriving optimal finite sample bounds, under standard assumptions. The obtained results are corroborated and illustrated by numerical experiments.", "full_text": "Learning with SGD and Random Features\n\nLuigi Carratino\u21e4\nUniversity of Genoa,\n\nGenoa, Italy\n\nAlessandro Rudi\n\nINRIA \u2013 Sierra Project-team,\n\n\u00c9cole Normale Sup\u00e9rieure, Paris\n\nLorenzo Rosasco\nUniversity of Genoa,\nLCSL \u2013 IIT & MIT\n\nAbstract\n\nSketching and stochastic gradient methods are arguably the most common tech-\nniques to derive ef\ufb01cient large scale learning algorithms. In this paper, we inves-\ntigate their application in the context of nonparametric statistical learning. More\nprecisely, we study the estimator de\ufb01ned by stochastic gradient with mini batches\nand random features. The latter can be seen as form of nonlinear sketching and used\nto de\ufb01ne approximate kernel methods. The considered estimator is not explicitly\npenalized/constrained and regularization is implicit. Indeed, our study highlights\nhow different parameters, such as number of features, iterations, step-size and\nmini-batch size control the learning properties of the solutions. We do this by\nderiving optimal \ufb01nite sample bounds, under standard assumptions. The obtained\nresults are corroborated and illustrated by numerical experiments.\n\n1\n\nIntroduction\n\nThe interplay between statistical and computational performances is key for modern machine learning\nalgorithms [1]. On the one hand, the ultimate goal is to achieve the best possible prediction error. On\nthe other hand, budgeted computational resources need be factored in, while designing algorithms.\nIndeed, time and especially memory requirements are unavoidable constraints, especially in large-\nscale problems.\nIn this view, stochastic gradient methods [2] and sketching techniques [3] have emerged as funda-\nmental algorithmic tools. Stochastic gradient methods allow to process data points individually, or\nin small batches, keeping good convergence rates, while reducing computational complexity [4].\nSketching techniques allow to reduce data-dimensionality, hence memory requirements, by random\nprojections [3]. Combining the bene\ufb01ts of both methods is tempting and indeed it has attracted much\nattention, see [5] and references therein.\nIn this paper, we investigate these ideas for nonparametric learning. Within a least squares frame-\nwork, we consider an estimator de\ufb01ned by mini-batched stochastic gradients and random features\n[6]. The latter are typically de\ufb01ned by nonlinear sketching: random projections followed by a\ncomponent-wise nonlinearity [3]. They can be seen as shallow networks with random weights [7],\nbut also as approximate kernel methods [8]. Indeed, random features provide a standard approach\nto overcome the memory bottleneck that prevents large-scale applications of kernel methods. The\ntheory of reproducing kernel Hilbert spaces [9] provides a rigorous mathematical framework to study\nthe properties of stochastic gradient method with random features. The approach we consider is not\nbased on penalizations or explicit constraints; regularization is implicit and controlled by different\nparameters. In particular, our analysis shows how the number of random features, iterations, step-size\nand mini-batch size control the stability and learning properties of the solution. By deriving \ufb01nite\nsample bounds, we investigate how optimal learning rates can be achieved with different parameter\nchoices. In particular, we show that similarly to ridge regression [10], a number of random features\nproportional to the square root of the number of samples suf\ufb01ces for O(1/pn) error bounds.\n\n\u21e4Email: luigi.carratino@dibris.unige.it\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fThe rest of the paper is organized as follows. We introduce problem, background and the proposed\nalgorithm in section 2. We present our main results in section 3 and illustrate numerical experiments\nin section 4.\nNotation: For any T 2 N+ we denote by [T ] the set {1, . . . , T}, for any a, b 2 R we de-\nnote by a _ b the maximum between a and b and with ^ the minimum. For any linear operator\nA and 2 R we denote by A the operator (A + I) if not explicitly de\ufb01ned differently. When\nA is a bounded self-adjoint linear operator on a Hilbert space, we denote by max(A) the biggest\neigenvalue of A.\n\n2 Learning with Stochastic Gradients and Random Features\n\nf E(f ),\nmin\n\nIn this section, we present the setting and discuss the learning algorithm we consider.\nThe problem we study is supervised statistical learning with squared loss [11]. Given a probability\nspace X \u21e5 R with distribution \u21e2 the problem is to solve\n\nE(f ) =Z (f (x) y)2d\u21e2(x, y),\ni 2 (X \u21e5 R)n, n 2 N, sampled independently according to\ngiven only a training set of pairs (xi, yi)n\n\u21e2. Here the minimum is intended over all functions for which the above integral is well de\ufb01ned and \u21e2\nis assumed \ufb01xed but known only through the samples.\nIn practice, the search for a solution needs to be restricted to a suitable space of hypothesis to allow\nef\ufb01cient computations and reliable estimation [12]. In this paper, we consider functions of the form\n(2)\nwhere w 2 RM and M : X ! RM, M 2 N, denotes a family of \ufb01nite dimensional feature maps,\nsee below. Further, we consider a mini-batch stochastic gradient method to estimate the coef\ufb01cients\nfrom data,\n\nf (x) = hw, M (x)i,\n\n8x 2 X,\n\n(1)\n\n1\nb\n\nt = 1, . . . , T. (3)\n\nbwt+1 = bwt t\n\nbtXi=b(t1)+1hbwt, M (xji)i yjiM (xji),\n\nbw1 = 0;\nHere T 2 N is the number of iterations and J = {j1, . . . , jbT} denotes the strategy to select training\nset points. In particular, in this work we assume the points to be drawn uniformly at random with\nreplacement. Note that given this sampling strategy, one pass over the data is reached on average\nafter d n\nb e iterations. Our analysis allows to consider multiple as well as single passes. For b = 1 the\nabove algorithm reduces to a simple stochastic gradient iteration. For b > 1 it is a mini-batch version,\nwhere b points are used in each iteration to compute a gradient estimate. The parameter t is the\nstep-size.\nThe algorithm requires specifying different parameters. In the following, we study how their choice\nis related and can be performed to achieve optimal learning bounds. Before doing this, we further\ndiscuss the class of feature maps we consider.\n\n2.1 From Sketching to Random Features, from Shallow Nets to Kernels\nIn this paper, we are interested in a particular class of feature maps, namely random features [6]. A\nsimple example is obtained by sketching the input data. Assume X \u2713 RD and\n\nM (x) = (hx, s1i, . . . ,hx, sMi) ,\n\nwhere s1, . . . , sM 2 RD is a set of identical and independent random vectors [13]. More generally,\nwe can consider features obtained by nonlinear sketching\n(4)\nM (x) = ((hx, s1i), . . . , (hx, sMi)) ,\nwhere : R ! R is a nonlinear function, for example (a) = cos(a) [6], (a) = |a|+ = max(a, 0),\na 2 R [7]. If we write the corresponding function (2) explicitly, we get\n8x 2 X.\n\nwj(hsj, xi),\n\nf (x) =\n\n(5)\n\nMXj=1\n\n2\n\n\fthat is as shallow neural nets with random weights [7] (offsets can be added easily).\nFor many examples of random features the inner product,\n\nhM (x), M (x0)i =\n\n(hx, sji)(hx0, sji),\n\n(6)\n\nMXj=1\n\ncan be shown to converge to a corresponding positive de\ufb01nite kernel k as M tends to in\ufb01nity [6, 14].\nWe now show some examples of kernels determined by speci\ufb01c choices of random features.\nExample 1 (Random features and kernel). Let (a) = cos(a) and consider (hx, si + b) in place of\nthe inner product hx, si, with s drawn from a standard Gaussian distribution with variance 2, and b\nuniformly from [0, 2\u21e1]. These are the so called Fourier random features and recover the Gaussian\nkernel k(x, x0) = ekxx0k2/22 [6] as M increases. If instead (a) = a, and the s is sampled\naccording to a standard Gaussian the linear kernel k(x, x0) = 2hx, x0i is recovered in the limit.\n[15].\n\nThese last observations allow to establish a connection with kernel methods [10] and the theory of\nreproducing kernel Hilbert spaces [9]. Recall that a reproducing kernel Hilbert space H is a Hilbert\nspace of functions for which there is a symmetric positive de\ufb01nite function2 k : X \u21e5 X ! R called\nreproducing kernel, such that k(x,\u00b7) 2H and hf, k(x,\u00b7)i = f (x) for all f 2H , x 2 X. It is also\nuseful to recall that k is a reproducing kernel if and only if there exists a Hilbert (feature) space F\nand a (feature) map : X !F such that\n(7)\n\nk(x, x0) = h(x), (x0)i,\n\n8x, x0 2 X,\n\nwhere F can be in\ufb01nite dimensional.\nThe connection to RKHS is interesting in at least two ways. First, it allows to use results and\ntechniques from the theory of RKHS to analyze random features. Second, it shows that random\nfeatures can be seen as an approach to derive scalable kernel methods [10]. Indeed, kernel methods\nhave complexity at least quadratic in the number of points, while random features have complexity\nwhich is typically linear in the number of points. From this point of view, the intuition behind random\nfeatures is to relax (7) considering\n\nwhere M is \ufb01nite dimensional.\n\nk(x, x0) \u21e1 hM (x), M (x0)i,\n\n8x, x0 2 X.\n\n(8)\n\n2.2 Computational complexity\nIf we assume the computation of the feature map M (x) to have a constant cost, the iteration (3)\nrequires O(M ) operations per iteration for b = 1, that is O(M n) for one pass T = n. Note that for\nb > 1 each iteration cost O(M b) but one pass corresponds to d n\nb e iterations so that the cost for one\npass is again O(M n). A main advantage of mini-batching is that gradient computations can be easily\nparallelized. In the multiple pass case, the time complexity after T iterations is O(M bT ).\nComputing the feature map M (x) requires to compute M random features. The computation of\none random feature does not depend on n, but only on the input space X. If for example we assume\nX \u2713 RD and consider random features de\ufb01ned as in the previous section, computing M (x) requires\nM random projections of D dimensional vectors [6], for a total time complexity of O(M D) for\nevaluating the feature map at one point. For different input spaces and different types of random\nfeatures computational cost may differ, see for example Orthogonal Random Features [16] or Fastfood\n[17] where the cost is reduced from O(M D) to O(M log D). Note that the analysis presented in\nhis paper holds for random features which are independent, while Orthogonal and Fastfood random\nfeatures are dependent. Although it should be possible to extend our analysis for Orthogonal and\nFastfood random features, further work is needed. To simplify the discussion, in the following we\ntreat the complexity of M (x) to be O(M ).\nOne of the advantages of random features is that each M (x) can be computed online at each iteration,\npreserving O(M bT ) as the time complexity of the algorithm (3). Computing M (x) online also\nreduces memory requirements. Indeed the space complexity of the algorithm (3) is O(M b) if the\nmini-batches are computed in parallel, or O(M ) if computed sequentially.\n\n2For all x1, . . . , xn the matrix with entries k(xi, xj), i, j = 1, . . . , n is positive semi-de\ufb01nite.\n\n3\n\n\f2.3 Related approaches\n\nWe comment on the connection to related algorithms. Random features are typically used within an\nempirical risk minimization framework [18]. Results considering convex Lipschitz loss functions and\n`1 constraints are given in [19], while [20] considers `2 constraints. A ridge regression framework is\nconsidered in [8], where it is shown that it is possible to achieve optimal statistical guarantees with a\nnumber of random features in the order of pn. The combination of random features and gradient\nmethods is less explored. A stochastic coordinate descent approach is considered in [21], see also\n[22, 23]. A related approach is based on subsampling and is often called Nystr\u00f6m method [24, 25].\nHere a shallow network is de\ufb01ned considering a nonlinearity which is a positive de\ufb01nite kernel, and\nweights chosen as a subset of training set points. This idea can be used within a penalized empirical\nrisk minimization framework [26, 27, 28] but also considering gradient [29, 30] and stochastic\ngradient [31] techniques. An empirical comparison between Nystr\u00f6m method, random features and\nfull kernel method is given in [23], where the empirical risk minimization problem is solved by block\ncoordinate descent. Note that numerous works have combined stochastic gradient and kernel methods\nwith no random projections approximation [32, 33, 34, 35, 36, 5]. The above list of references is only\npartial and focusing on papers providing theoretical analysis. In the following, after stating our main\nresults we provide a further quantitative comparison with related results.\n\n3 Main Results\n\nIn this section, we \ufb01rst discuss our main results under basic assumptions and then more re\ufb01ned results\nunder further conditions.\n\n3.1 Worst case results\n\nOur results apply to a general class of random features described by the following assumption.\nAssumption 1. Let (\u2326,\u21e1 ) be a probability space, : X \u21e5 \u2326 ! R and for all x 2 X,\n\nM (x) =\n\n1\npM\n\n( (x, !1), . . . , (x, !M )) ,\n\n(9)\n\nwhere !1, . . . ,! M 2 \u2326 are sampled independently according to \u21e1.\nThe above class of random features cover all the examples described in section 2.1, as well as many\nothers, see [8, 20] and references therein. Next we introduce the positive de\ufb01nite kernel de\ufb01ned by\nthe above random features. Let k : X \u21e5 X ! R be de\ufb01ned by\nk(x, x0) =Z (x, !) (x0,! )d\u21e1(!),\n\n8, x, x0 2 X.\n\nIt is easy to check that k is a symmetric and positive de\ufb01nite kernel. To control basic properties of\nthe induced kernel (continuity, boundedness), we require the following assumption, which is again\nsatis\ufb01ed by the examples described in section 2.1 (see also [8, 20] and references therein).\nAssumption 2. The function is continuous and there exists \uf8ff 1 such that | (x, !)|\uf8ff \uf8ff for any\nx 2 X, ! 2 \u2326.\nThe kernel introduced above allows to compare random feature maps of different size and to express\nthe regularity of the largest function class they induce. In particular, we require a standard assumption\nin the context of non-parametric regression (see [11]), which consists in assuming a minimum for the\nexpected risk, over the space of functions induced by the kernel.\nAssumption 3. If H is the RKHS with kernel k, there exists fH 2H such that\n\nE(fH) = inf\n\nf2HE(f ).\n\nTo conclude, we need some basic assumption on the data distribution. For all x 2 X, we denote by\n\u21e2(y|x) the conditional probability of \u21e2 and by \u21e2X the corresponding marginal probability on X. We\nneed a standard moment assumption to derive probabilistic results.\n\n4\n\n\fAssumption 4. For any x 2 X\n\nZY\n\ny2ld\u21e2(y|x) \uf8ff l!Blp,\n\n8l 2 N\n\n(10)\n\nfor costants B 2 (0,1) and p 2 (1,1), \u21e2X-almost surely.\nThe above assumption holds when y is bounded, sub-gaussian or sub-exponential.\nThe next theorem corresponds to our \ufb01rst main result. Recall that, the excess risk for a given estimator\n\nE(bf ) E (fH),\n\nand is a standard error measure in statistical machine learning [11, 18]. In the following theorem, we\ncontrol the excess risk of the estimator with respect to the number of points, the number of RF, the\n\nbf is de\ufb01ned as\nstep size, the mini-batch size and the number of iterations. We let bft+1 = hbwt+1, M (\u00b7)i, with bwt+1\n\nas in (3).\nTheorem 1. Let n, M 2 N+, 2 (0, 1) and t 2 [T ]. Under Assumptions 1 to 4, for b 2 [n], t = \n and M & T the following holds with probability at\ns.t. \uf8ff\nleast 1 :\n\n8(1+log T ), n 32 log2 2\n\n ^\n\n9T log n\n\nn\n\n1\n\nEJ\u21e5E(bft+1)\u21e4E(fH) . \n\nb\n\n+\u2713 t\n\nM\n\n+ 1\u25c6 t log 1\n\nn\n\n\n\n+\n\nlog 1\n\nM\n\n+\n\n1\nt\n\n.\n\n(11)\n\nThe above theorem bounds the excess risk with a sum of terms controlled by the different parameters.\nThe following corollary shows how these parameters can be chosen to derive \ufb01nite sample bounds.\nCorollary 1. Under the same assumptions of Theorem 1, for one of the following conditions\n(c1.1). b = 1, ' 1\n(c1.2). b = 1, ' 1pn, and T = n iterations (1 pass over the data);\n(c1.3). b = pn, ' 1, and T = pn iterations (1 pass over the data);\n(c1.4). b = n, ' 1, and T = pn iterations (pn passes over the data);\n\nn, and T = npn iterations (pn passes over the data);\n\na number\n\nof random features is suf\ufb01cient to guarantee with high probability that\n\n(12)\n\n(13)\n\nM = eO(pn)\n\nEJ\u21e5E(bfT )\u21e4 E (fH) . 1\n\npn\n\n.\n\nThe above learning rate is the same achieved by an exact kernel ridge regression (KRR) estimator\n[11, 37, 38], which has been proved to be optimal in a minimax sense [11] under the same assumptions.\nFurther, the number of random features required to achieve this bound is the same as the kernel ridge\nregression estimator with random features [8]. Notice that, for the limit case where the number of\nrandom features grows to in\ufb01nity for Corollary 1 under conditions (c1.2) and (c1.3) we recover the\nsame results for one pass SGD of [39], [40]. In this limit, our results are also related to those in [41].\nHere, however, averaging of the iterates is used to achieve larger step-sizes.\nNote that conditions (c1.1) and (c1.2) in the corollary above show that, when no mini-batches are\nused (b = 1) and 1\nn \uf8ff \uf8ff 1pn, then the step-size determines the number of passes over the data\nrequired for optimal generalization. In particular the number of passes varies from constant, when\n = 1pn, to pn, when = 1\nn. In order to increase the step-size over 1pn the algorithm needs to\nbe run with mini-batches. The step-size can then be increased up to a constant if b is chosen equal\nto pn (condition (c1.3)), requiring the same number of passes over the data of the setting (c1.2).\nInterestingly condition (c1.4) shows that increasing the mini-batch size over pn does not allow to\ntake larger step-sizes, while it seems to increase the number of passes over the data required to reach\noptimality.\nWe now compare the time complexity of algorithm (3) with some closely related methods which\n\n5\n\n\fachieve the same optimal rate of 1pn. Computing the classical KRR estimator [11] has a complexity\nof roughly O(n3) in time and O(n2) in memory. Lowering this computational cost is possible with\nrandom projection techniques. Both random features and Nystr\u00f6m method on KRR [8, 26] lower\nthe time complexity to O(n2) and the memory complexity to O(npn) preserving the statistical\naccuracy. The same time complexity is achieved by stochastic gradient method solving the full kernel\nmethod [33, 36], but with the higher space complexity of O(n2). The combination of the stochastic\ngradient iteration, random features and mini-batches allows our algorithm to achieve a complexity\nof O(npn) in time and O(n) in space for certain choices of the free parameters (like (c1.2) and\n(c1.3)). Note that these time and memory complexity are lower with respect to those of stochastic\ngradient with mini-batches and Nystr\u00f6m approximation which are O(n2) and O(n) respectively [31].\nA method with similar complexity to SGD with RF is FALKON [30]. This method has indeed a time\ncomplexity of O(npn log(n)) and O(n) space complexity. This method blends together Nystr\u00f6m\napproximation, a sketched preconditioner and conjugate gradient.\n\n3.2 Re\ufb01ned analysis and fast rates\nWe next discuss how the above results can be re\ufb01ned under an additional regularity assumption.\nWe need some preliminary de\ufb01nitions. Let H be the RKHS de\ufb01ned by k, and L : L2(X, \u21e2X) !\nL2(X, \u21e2X) the integral operator\n\nLf (x) =Z k(x, x0)f (x0)d\u21e2X(x0),\n\n8f 2 L2(X, \u21e2X), x 2 X,\n\n\u21e2 =R |f|2d\u21e2X < 1}. The above operator is symmetric\nwhere L2(X, \u21e2X) = {f : X ! R : kfk2\nand positive de\ufb01nite. Moreover, Assumption 1 ensures that the kernel is bounded, which in turn\nensures L is trace class, hence compact [18].\nAssumption 5. For any > 0, de\ufb01ne the effective dimension as N () = Tr((L + I)1L), and\nassume there exist Q > 0 and \u21b5 2 [0, 1] such that\n(14)\n\nN () \uf8ff Q2\u21b5.\n\nMoreover , assume there exists r 1/2 and g 2 L2(X, \u21e2X) such that\n\nfH(x) = (Lrg)(x).\n\n(15)\nCondition (14) describes the capacity/complexity of the RKHS H and the measure \u21e2. It is equivalent\nto classic entropy/covering number conditions, see e.g. [18]. The case \u21b5 = 1 corresponds to making\nno assumptions on the kernel, and reduces to the worst case analysis in the previous section. The\nsmaller is \u21b5 the more stringent is the capacity condition. A classic example is considering X = RD\nwith d\u21e2X(x) = p(x)dx, where p is a probability density, strictly positive and bounded away from\nzero, and H to be a Sobolev space with smoothness s > D/2. Indeed, in this case \u21b5 = D/2s and\nclassical nonparametric statistics assumptions are recovered as a special case. Note that in particular\nthe worst case is s = D/2. Condition (15) is a regularity condition commonly used in approximation\ntheory to control the bias of the estimator [42].\nThe following theorem is a re\ufb01ned version of Theorem 1 where we also consider the above capacity\ncondition (Assumption 5).\nTheorem 2. Let n, M 2 N+, 2 (0, 1) and t 2 [T ], under Assumptions 1 to 4, for b 2 [n], t = \n and M & T the following holds with high probability:\ns.t. \uf8ff\n+\u2713 1\nt\u25c62r\n\n+ 1\u25c6 N\u21e3 1\n\n8(1+log T ), n 32 log2 2\n\nt\u2318 log 1\n\n+ N\u21e3 1\n\nt\u23182r1\n\nM (t)2r1\n\n+\u2713 t\n\nM\n\nlog 1\n\n\nEJ\u21e5E(bft+1)\u21e4 E (fH) . \n\nb\n\n.\n(16)\n\nn\n\n9T log n\n\n ^\n\n1\n\n\n\nn\n\nThe main difference is the presence of the effective dimension providing a sharper control of the\nstability of the considered estimator. As before, explicit learning bounds can be derived considering\ndifferent parameter settings.\nCorollary 2. Under the same assumptions of Theorem 2, for one of the following conditions\n(c2.1). b = 1, ' n1, and T = n\n\n2r+\u21b5 passes over the data);\n\n2r+\u21b5 iterations (n\n\n2r+\u21b5+1\n\n1\n\n6\n\n\f(17)\n\n(19)\n\n(20)\n\n(21)\n\n(22)\n\n(23)\n\nmoreover\n\n \uf8ff\n\n1\n\n7\n\nn\n\n8(1+log T )\n\n9T 1\u2713 log n\n\n ^( \u2713^(1\u2713)\nM 4 + 18T 1\u2713 log\nbtmin(\u2713,1\u2713) (log t _ 1)\n\n\n\n\u2713 2]0, 1[\notherwise,\n\n12T 1\u2713\n\n,\n\n\n\n2r\n\n2r+\u21b5 , and T = n\n2r+\u21b5 , ' 1, and T = n\n\n(c2.2). b = 1, ' n 2r\n(c2.3). b = n\n(c2.4). b = n, ' 1, and T = n\na number\n\n1\n\n2r+1\n\n2r+\u21b5 iterations (n\n\n1\u21b5\n2r+\u21b5 passes over the data);\n\n1\n\n2r+\u21b5 iterations (n\n\n1\u21b5\n2r+\u21b5 passes over the data);\n\n2r+\u21b5 iterations (n\n\n1\n\n2r+\u21b5 passes over the data);\n\n1+\u21b5(2r1)\n\n2r+\u21b5\n\n)\n\nM = eO(n\n\nof random features suf\ufb01es to guarantee with high probability that\n2r+\u21b5 .\n\nEJ\u21e5E(bwT )\u21e4 E (fH) . n 2r\n\n(18)\nThe corollary above shows that multi-pass SGD achieves a learning rate that is the same as kernel\nridge regression under the regularity assumption 5 and is again minimax optimal (see [11]). Moreover,\nwe obtain the minimax optimal rate with the same number of random features required for ridge\nregression with random features [8] under the same assumptions. Finally, when the number of random\nfeatures goes to in\ufb01nity we also recover the results for the in\ufb01nite dimensional case of the single-pass\nand multiple pass stochastic gradient method [33].\nIt is worth noting that, under the additional regularity assumption 5, the number of both random\nfeatures and passes over the data suf\ufb01cient for optimal learning rates increase with respect to the one\nrequired in the worst case (see Corollary 1). The same effect occurs in the context of ridge regression\nwith random features as noted in [8]. In this latter paper, it is observed that this issue tackled can be\nusing more re\ufb01ned, possibly more costly, sampling schemes [20].\nFinally, we present a general result from which all our previous results follow as special cases. We\nconsider a more general setting where we allow decreasing step-sizes.\n\nTheorem 3. Let n, M, T 2 N, b 2 [n] and > 0. Let 2 (0, 1) and bwt+1 be the estimator in\n\nEq. (3) with t = \uf8ff2t\u2713 and \u2713 2 [0, 1[. Under Assumptions 1 to 4, when n 32 log2 2\n\n and\n\nEJ\u21e5E(bwt+1)\u21e4 inf\n\nthen, for any t 2 [T ] the following holds with probability at least 1 9\n t1\u2713 _ 1\u25c6 N\u21e3 \uf8ff2\nt1\u2713\u2318\n\n1\nM\nt1\u2713 )2r1 log 2\n\nw2F E(w) \uf8ff c1\n+\u2713c2 + c3\n+ c4 N ( \uf8ff2\n\nM (t1\u2713\uf8ff2)2r1 log22r11t1\u2713 +\u2713 1\n\nlog\n\nM\n\nn\n\n\n\nlog2(t) _ 1 log2 4\nt1\u2713\u25c62r! ,\n\n\n\nwith c1, c2, c3, c4 constants which do not depend on b, , n, t, M, .\nWe note that as the number of random features M goes to in\ufb01nity, we recover the same bound of [33]\nfor decreasing step-sizes. Moreover, the above theorem shows that there is no apparent gain in using\na decreasing stepsize (i.e. \u2713> 0) with respect to the regimes identi\ufb01ed in Corollaries 1 and 2.\n\n3.3 Sketch of the Proof\n\nintermediate functions. In particular, the following iterations are useful,\n\nIn this section, we sketch the main ideas in the proof. We relate bft and fH introducing several\nbv1 = 0;\nev1 = 0;\n\nnXi=1hbvt, M (xi)i yiM (xi),\nbvt+1 =bvt t\nevt+1 =evt tZXhevt, M (x)i yM (x)d\u21e2(x, y),\nvt+1 = vt tZXhvt, M (x)i fH(x)M (x)d\u21e2X(x),\n\n8t 2 [T ].\n\n8t 2 [T ].\n\n8t 2 [T ].\n\nv1 = 0;\n\n(24)\n\n(25)\n\n(26)\n\n1\nn\n\n7\n\n\fSUSY\n\nHIGGS\n\nr\no\nr\nr\ne\n\n \n\nn\no\n\nr\no\nr\nr\ne\n\n \n\nn\no\n\ni\nt\n\na\nc\ni\nf\ni\ns\ns\na\nc\n\nl\n\ni\nt\n\na\nc\ni\nf\ni\ns\ns\na\nc\n\nl\n\nn\u00b0 of random features\n\nn\u00b0 of random features\n\nFigure 1: Classi\ufb01cation error of SUSY (left) and HIGGS (right) datasets as the no of random features varies\n\nFurther, we let\n\neu = argmin\n\nu = argmin\n\nu2RM ZXhu, M (x)i fH(x)2d\u21e2X(x) + kuk2,>\nu2F ZXhu, (x)i y2d\u21e2(x, y) + kuk2,>\n\n0,\n\n0,\n\n(27)\n\n(28)\n\nwhere (F, ) are feature space and feature map associated to the kernel k. The \ufb01rst three vectors are\nde\ufb01ned by the random features and can be seen as an empirical and population batch gradient descent\niterations. The last two vectors can be seen as a population version of ridge regression de\ufb01ned by the\nrandom features and the feature map , respectively.\nSince the above objects (24), (25), (26), (27), (28) belong to different spaces, instead of comparing\nthem directly we compare the functions in L2(X, \u21e2X) associated to them, letting\n\nbgt = hbvt, M (\u00b7)i , egt = hevt, M (\u00b7)i , gt = hvt, M (\u00b7)i , eg = heu, M (\u00b7)i , g = hu, (\u00b7)i .\n\nSince it is well known [11] that\n\nwe than can consider the following decomposition\n\nE(f ) E (fH) = kf fHk2\n\u21e2,\n\nbft fH = bft bgt\n+bgt egt\n+egt gt\n+ gt eg\n+eg g\n\n(29)\n(30)\n(31)\n(32)\n(33)\n(34)\nThe \ufb01rst two terms control how SGD deviates from the batch gradient descent and the effect of noise\nand sampling. They are studied in Lemma 1, 2, 3, 4 5, 6 in the Appendix, borrowing and adapting\nideas from [33, 36, 8]. The following terms account for the approximation properties of random\nfeatures and the bias of the algorithm. Here the basic idea and novel result is the study of how the\npopulation gradient decent and ridge regression are related (32) (Lemma 9 in the Appendix). Then,\nresults from the the analysis of ridge regression with random features are used [8].\n\n+ g fH.\n\n4 Experiments\nWe study the behavior of the SGD with RF algorithm on subsets of n = 2 \u21e5 105 points of the\nSUSY 3 and HIGGS 4 datasets [43]. The measures we show in the following experiments are an\naverage over 10 repetitions of the algorithm. Further, we consider random Fourier features that\n\n3https://archive.ics.uci.edu/ml/datasets/SUSY\n4https://archive.ics.uci.edu/ml/datasets/HIGGS\n\n8\n\n\fSUSY - Classification Error\n\nHIGGS - Classification Error\n\ne\nz\ns\n \n\ni\n\nh\nc\nt\n\na\nb\n-\ni\nn\nm\n\ni\n\ni\n\ne\nz\ns\n \nh\nc\nt\na\nb\n-\ni\nn\nm\n\ni\n\nstep-size\n\nstep-size\n\nFigure 2: Classi\ufb01cation error of SUSY (left) and HIGGS (right) datasets as step-seize and mini-batch size vary\n\nare known to approximate translation invariant kernels [6]. We use random features of the form\n (x, !) = cos(wT x + q), with ! := (w, q), w sampled according to the normal distribution and q\nsampled uniformly at random between 0 and 2\u21e1. Note that the random features de\ufb01ned this way\nsatisfy Assumption 2.\nOur theoretical analysis suggests that only a number of RF of the order of pn suf\ufb01ces to gain optimal\nlearning properties. Hence we study how the number of RF affect the accuracy of the algorithm on\ntest sets of 105 points. In Figure 3.3 we show the classi\ufb01cation error after 5 passes over the data of\nSGD with RF as the number of RF increases, with a \ufb01xed batch size of pn and a step-size of 1. We\ncan observe that over a certain threshold of the order of pn, increasing the number of RF does not\nimprove the accuracy, con\ufb01rming what our theoretical results suggest.\nFurther, theory suggests that the step-size can be increased as the mini-batch size increases to reach\nan optimal accuracy, and that after a mini-batch size of the order of pn more than 1 pass over the\ndata is required to reach the same accuracy. We show in Figure 2 the classi\ufb01cation error of SGD with\nRF after 1 pass over the data, with a \ufb01xed number of random features pn, as mini-batch size and\nstep-size vary, on test sets of 105 points. As suggested by theory, to reach the lowest error as the\nmini-batch size grows the step-size needs to grow as well. Further for mini-batch sizes bigger that\npn the lowest error can not be reached in only 1 pass even if increasing the step-size.\n\n5 Conclusions\n\nIn this paper we investigate the combination of sketching and stochastic techniques in the context of\nnon-parametric regression. In particular we studied the statistical and computational properties of\nthe estimator de\ufb01ned by stochastic gradient descent with multiple passes, mini-batches and random\nfeatures. We proved that the estimator achieves optimal statistical properties with a number of\nrandom features in the order of pn (with n the number of examples). Moreover we analyzed possible\ntrade-offs between the number of passes, the step and the dimension of the mini-batches showing\nthat there exist different con\ufb01gurations which achieve the same optimal statistical guarantees, with\ndifferent computational impacts.\nOur work can be extended in several ways: First, (a) we can study the effect of combining random\nfeatures with accelerated/averaged stochastic techniques as [32]. Second, (b) we can extend our\nanalysis to consider more re\ufb01ned assumptions, generalizing [35] to SGD with random features.\nAdditionally, (c) we can study the statistical properties of the considered estimator in the context of\nclassi\ufb01cation with the goal of showing fast decay of the classi\ufb01cation error, as in [34]. Moreover,\n(d) we can apply the proposed method in the more general context of least squares frameworks for\nmultitask learning [44, 45] or structured prediction [46, 47, 48], with the goal of obtaining faster\nalgorithms, while retaining strong statistical guarantees. Finally, (e) to integrate our analysis with\nmore re\ufb01ned methods to select the random features analogously to [49, 50] in the context of column\nsampling.\n\n9\n\n\fAcknowledgments.\nThis material is based upon work supported by the Center for Brains, Minds and Machines (CBMM), funded by\nNSF STC award CCF-1231216, and the Italian Institute of Technology. We gratefully acknowledge the support\nof NVIDIA Corporation for the donation of the Titan Xp GPUs and the Tesla k40 GPU used for this research.\nL. R. acknowledges the support of the AFOSR projects FA9550-17-1-0390 and BAA-AFRL-AFOSR-2016-0007\n(European Of\ufb01ce of Aerospace Research and Development), and the EU H2020-MSCA-RISE project NoMADS\n- DLV-777826. A. R. acknowledges the support of the European Research Council (grant SEQUOIA 724063).\n\nReferences\n[1] Alekh Agarwal, Sahand Negahban, and Martin J Wainwright. Stochastic optimization and\nsparse statistical recovery: Optimal algorithms for high dimensions. In Advances in Neural\nInformation Processing Systems, pages 1538\u20131546, 2012.\n\n[2] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of\n\nmathematical statistics, pages 400\u2013407, 1951.\n\n[3] Haim Avron, Vikas Sindhwani, and David Woodruff. Sketching structured matrices for faster\nnonlinear regression. In Advances in neural information processing systems, pages 2994\u20133002,\n2013.\n\n[4] L\u00e9on Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in neural\n\ninformation processing systems, pages 161\u2013168, 2008.\n\n[5] Francesco Orabona. Simultaneous model selection and optimization through parameter-free\nstochastic learning. In Advances in Neural Information Processing Systems, pages 1116\u20131124,\n2014.\n\n[6] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances\n\nin neural information processing systems, pages 1177\u20131184, 2008.\n\n[7] Youngmin Cho and Lawrence K Saul. Kernel methods for deep learning. In Advances in neural\n\ninformation processing systems, pages 342\u2013350, 2009.\n\n[8] Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random\nfeatures. In Advances in Neural Information Processing Systems 30, pages 3215\u20133225. 2017.\n[9] Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American mathemati-\n\ncal society, 68(3):337\u2013404, 1950.\n\n[10] Bernhard Sch\u00f6lkopf and Alexander J Smola. Learning with kernels: support vector machines,\n\nregularization, optimization, and beyond. 2002.\n\n[11] Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares\n\nalgorithm. Foundations of Computational Mathematics, 7(3):331\u2013368, 2007.\n\n[12] Luc Devroye, L\u00e1szl\u00f3 Gy\u00f6r\ufb01, and G\u00e1bor Lugosi. A probabilistic theory of pattern recognition,\n\nvolume 31. Springer Science & Business Media, 2013.\n\n[13] David P Woodruff et al. Sketching as a tool for numerical linear algebra. Foundations and\n\nTrends in Theoretical Computer Science, 10(1\u20132):1\u2013157, 2014.\n\n[14] Bharath Sriperumbudur and Zolt\u00e1n Szab\u00f3. Optimal rates for random fourier features.\n\nAdvances in Neural Information Processing Systems, pages 1144\u20131152, 2015.\n\nIn\n\n[15] Raffay Hamid, Ying Xiao, Alex Gittens, and Dennis DeCoste. Compact random feature maps.\n\nIn International Conference on Machine Learning, pages 19\u201327, 2014.\n\n[16] X Yu Felix, Ananda Theertha Suresh, Krzysztof M Choromanski, Daniel N Holtmann-Rice,\nand Sanjiv Kumar. Orthogonal random features. In Advances in Neural Information Processing\nSystems, pages 1975\u20131983, 2016.\n\n[17] Quoc Le, Tam\u00e1s Sarl\u00f3s, and Alex Smola. Fastfood-approximating kernel expansions in loglinear\ntime. In Proceedings of the international conference on machine learning, volume 85, 2013.\n\n10\n\n\f[18] Ingo Steinwart and Andreas Christmann. Support vector machines. Springer Science & Business\n\nMedia, 2008.\n\n[19] Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimiza-\ntion with randomization in learning. In Advances in neural information processing systems,\npages 1313\u20131320, 2009.\n\n[20] Francis Bach. On the equivalence between kernel quadrature rules and random feature expan-\n\nsions. Journal of Machine Learning Research, 18(21):1\u201338, 2017.\n\n[21] Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, Maria-Florina F Balcan, and Le Song.\nScalable kernel methods via doubly stochastic gradients. In Advances in Neural Information\nProcessing Systems, pages 3041\u20133049, 2014.\n\n[22] Junhong Lin and Lorenzo Rosasco. Generalization properties of doubly online learning algo-\n\nrithms. arXiv preprint arXiv:1707.00577, 2017.\n\n[23] Stephen Tu, Rebecca Roelofs, Shivaram Venkataraman, and Benjamin Recht. Large scale kernel\n\nlearning using block coordinate descent. arXiv preprint arXiv:1602.05310, 2016.\n\n[24] Alex J Smola and Bernhard Sch\u00f6lkopf. Sparse greedy matrix approximation for machine\n\nlearning. 2000.\n\n[25] Christopher KI Williams and Matthias Seeger. Using the nystr\u00f6m method to speed up kernel\n\nmachines. In Advances in neural information processing systems, pages 682\u2013688, 2001.\n\n[26] Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nystr\u00f6m computa-\ntional regularization. In Advances in Neural Information Processing Systems, pages 1657\u20131665,\n2015.\n\n[27] Yun Yang, Mert Pilanci, and Martin J Wainwright. Randomized sketches for kernels: Fast and\n\noptimal non-parametric regression. arXiv preprint arXiv:1501.06195, 2015.\n\n[28] Ahmed Alaoui and Michael W Mahoney. Fast randomized kernel ridge regression with statistical\n\nguarantees. In Advances in Neural Information Processing Systems, pages 775\u2013783, 2015.\n\n[29] Raffaello Camoriano, Tom\u00e1s Angles, Alessandro Rudi, and Lorenzo Rosasco. Nytro: When\nsubsampling meets early stopping. In Arti\ufb01cial Intelligence and Statistics, pages 1403\u20131411,\n2016.\n\n[30] Alessandro Rudi, Luigi Carratino, and Lorenzo Rosasco. FALKON: An optimal large scale\nkernel method. In Advances in Neural Information Processing Systems, pages 3891\u20133901,\n2017.\n\n[31] Junhong Lin and Lorenzo Rosasco. Optimal rates for learning with nystr\u00f6m stochastic gradient\n\nmethods. arXiv preprint arXiv:1710.07797, 2017.\n\n[32] Aymeric Dieuleveut, Nicolas Flammarion, and Francis Bach. Harder, better, faster, stronger\nconvergence rates for least-squares regression. The Journal of Machine Learning Research,\n18(1):3520\u20133570, 2017.\n\n[33] Junhong Lin and Lorenzo Rosasco. Optimal rates for multi-pass stochastic gradient methods.\n\nJournal of Machine Learning Research, 18(97):1\u201347, 2017.\n\n[34] Loucas Pillaud-Vivien, Alessandro Rudi, and Francis Bach. Exponential convergence of testing\nerror for stochastic gradient methods. In Proceedings of the 31st Conference On Learning\nTheory, volume 75, pages 250\u2013296, 2018.\n\n[35] Loucas Pillaud-Vivien, Alessandro Rudi, and Francis Bach. Statistical optimality of stochastic\ngradient descent on hard learning problems through multiple passes. In S. Bengio, H. Wallach,\nH. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural\nInformation Processing Systems 31, pages 8125\u20138135. Curran Associates, Inc., 2018.\n\n[36] Lorenzo Rosasco and Silvia Villa. Learning with incremental iterative regularization.\n\nAdvances in Neural Information Processing Systems, pages 1630\u20131638, 2015.\n\nIn\n\n11\n\n\f[37] Ingo Steinwart, Don R Hush, Clint Scovel, et al. Optimal rates for regularized least squares\n\nregression. In COLT, 2009.\n\n[38] Junhong Lin, Alessandro Rudi, Lorenzo Rosasco, and Volkan Cevher. Optimal rates for spectral\nalgorithms with least-squares regression over hilbert spaces. Applied and Computational\nHarmonic Analysis, 2018.\n\n[39] Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization:\nConvergence results and optimal averaging schemes. In International Conference on Machine\nLearning, pages 71\u201379, 2013.\n\n[40] Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online\nprediction using mini-batches. Journal of Machine Learning Research, 13(Jan):165\u2013202, 2012.\n[41] Aymeric Dieuleveut, Francis Bach, et al. Nonparametric stochastic approximation with large\n\nstep-sizes. The Annals of Statistics, 44(4):1363\u20131399, 2016.\n\n[42] Steve Smale and Ding-Xuan Zhou. Estimating the approximation error in learning theory.\n\nAnalysis and Applications, 1(01):17\u201341, 2003.\n\n[43] Pierre Baldi, Peter Sadowski, and Daniel Whiteson. Searching for exotic particles in high-energy\n\nphysics with deep learning. Nature communications, 5:4308, 2014.\n\n[44] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Convex multi-task feature\n\nlearning. Machine Learning, 73(3):243\u2013272, 2008.\n\n[45] Carlo Ciliberto, Alessandro Rudi, Lorenzo Rosasco, and Massimiliano Pontil. Consistent multi-\ntask learning with nonlinear output relations. In Advances in Neural Information Processing\nSystems, pages 1986\u20131996, 2017.\n\n[46] Carlo Ciliberto, Lorenzo Rosasco, and Alessandro Rudi. A consistent regularization approach\nfor structured prediction. Advances in Neural Information Processing Systems 29 (NIPS), pages\n4412\u20134420, 2016.\n\n[47] Anton Osokin, Francis Bach, and Simon Lacoste-Julien. On structured prediction theory with\ncalibrated convex surrogate losses. In Advances in Neural Information Processing Systems,\npages 302\u2013313, 2017.\n\n[48] Carlo Ciliberto, Francis Bach, and Alessandro Rudi. Localized structured prediction. arXiv\n\npreprint arXiv:1806.02402, 2018.\n\n[49] Petros Drineas, Malik Magdon-Ismail, Michael W Mahoney, and David P Woodruff. Fast\napproximation of matrix coherence and statistical leverage. Journal of Machine Learning\nResearch, 13(Dec):3475\u20133506, 2012.\n\n[50] Alessandro Rudi, Daniele Calandriello, Luigi Carratino, and Lorenzo Rosasco. On fast leverage\nscore sampling and optimal learning. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman,\nN. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems\n31, pages 5677\u20135687. Curran Associates, Inc., 2018.\n\n[51] Felipe Cucker and Steve Smale. On the mathematical foundations of learning. Bulletin of the\n\nAmerican mathematical society, 39(1):1\u201349, 2002.\n\n[52] Ernesto De Vito, Lorenzo Rosasco, Andrea Caponnetto, Umberto De Giovannini, and Francesca\nOdone. Learning from examples as an inverse problem. Journal of Machine Learning Research,\n6(May):883\u2013904, 2005.\n\n[53] Alessandro Rudi, Guillermo D Canas, and Lorenzo Rosasco. On the sample complexity of\nsubspace learning. In Advances in Neural Information Processing Systems, pages 2067\u20132075,\n2013.\n\n12\n\n\f", "award": [], "sourceid": 6538, "authors": [{"given_name": "Luigi", "family_name": "Carratino", "institution": "University of Genoa"}, {"given_name": "Alessandro", "family_name": "Rudi", "institution": "INRIA, Ecole Normale Superieure"}, {"given_name": "Lorenzo", "family_name": "Rosasco", "institution": "University of Genova- MIT - IIT"}]}