{"title": "Beating SGD: Learning SVMs in Sublinear Time", "book": "Advances in Neural Information Processing Systems", "page_first": 1233, "page_last": 1241, "abstract": "We present an optimization approach for linear SVMs based on a stochastic primal-dual approach, where the primal step is akin to an importance-weighted SGD, and the dual step is a stochastic update on the importance weights. This yields an optimization method with a sublinear dependence on the training set size, and the first method for learning linear SVMs with runtime less then the size of the training set required for learning!", "full_text": "Beating SGD: Learning SVMs in Sublinear Time\n\nTomer Koren\nElad Hazan\nTechnion, Israel Institute of Technology\n\n{ehazan@ie,tomerk@cs}.technion.ac.il\n\nHaifa, Israel 32000\n\nNathan Srebro\n\nToyota Technological Institute\n\nChicago, Illinois 60637\n\nnati@ttic.edu\n\nAbstract\n\nWe present an optimization approach for linear SVMs based on a stochastic\nprimal-dual approach, where the primal step is akin to an importance-weighted\nSGD, and the dual step is a stochastic update on the importance weights. This\nyields an optimization method with a sublinear dependence on the training set\nsize, and the \ufb01rst method for learning linear SVMs with runtime less then the size\nof the training set required for learning!\n\n1\n\nIntroduction\n\nStochastic approximation (online) approaches, such as stochastic gradient descent and stochastic\ndual averaging, have become the optimization method of choice for many learning problems, includ-\ning linear SVMs. This is not surprising, since such methods yield optimal generalization guarantees\nwith only a single pass over the data. They therefore in a sense have optimal, unbeatable runtime:\nfrom a learning (generalization) point of view, in a \u201cdata laden\u201d setting [2, 13], the runtime to get to\na desired generalization goal is the same as the size of the data set required to do so. Their runtime\nis therefore equal (up to a small constant factor) to the runtime required to just read the data.\nIn this paper we show, for the \ufb01rst time, how to beat this unbeatable runtime, and present a method\nthat, in a certain relevant regime of high dimensionality, relatively low noise and accuracy propor-\ntional to the noise level, learns in runtime less then the size of the minimal training set size required\nfor generalization. The key here, is that unlike online methods that consider an entire training vector\nat each iteration, our method accesses single features (coordinates) of training vectors. Our compu-\ntational model is thus that of random access to a desired coordinate of a desired training vector (as is\nstandard for sublinear time algorithms), and our main computational cost are these feature accesses.\nOur method can also be understood in the framework of \u201cbudgeted learning\u201d [5] where the cost\nis explicitly the cost of observing features (but unlike, e.g. [8], we do not have differential costs\nfor different features), and gives the \ufb01rst non-trivial guarantee in this setting (i.e. \ufb01rst theoretical\nguarantee on the number of feature accesses that is less then simply observing entire feature vectors).\nWe emphasize that our method is not online in nature, and we do require repeated access to training\nexamples, but the resulting runtime (as well as the overall number of features accessed) is less (in\nsome regimes) then for any online algorithms that considers entire training vectors. Also, unlike\nrecent work by Cesa-Bianchi et al. [3], we are not constrained to only a few features from every\nvector, and can ask for however many we need (with the aim of minimizing the overall runtime, and\nthus the overall number of feature accesses), and so we obtain an overall number of feature accesses\nwhich is better then with SGD, unlike Cesa-Bianchi et al., which aim at not being too much worse\nthen full-information SGD.\nAs discussed in Section 3, our method is a primal-dual method, where both the primal and dual steps\nare stochastic. The primal steps can be viewed as importance-weighted stochastic gradient descent,\nand the dual step as a stochastic update on the importance weighting, informed by the current pri-\nmal solution. This approach builds on the work of [4] that presented a sublinear time algorithm for\napproximating the margin of a linearly separable data set. Here, we extend that work to the more rel-\n\n1\n\n\fevant noisy (non-separable) setting, and show how it can be applied to a learning problem, yielding\ngeneralization runtime better then SGD. The extension to the non-separable setting is not straight-\nforward and requires re-writing the SVM objective, and applying additional relaxation techniques\nborrowed from [10].\n\n2 The SVM Optimization Problem\nWe consider training a linear binary SVM based on a training set of n labeled points {xi, yi}i=1...n,\nxi \u2208 Rd, yi \u2208 {\u00b11}, with the data normalized such that (cid:107)xi(cid:107) \u2264 1. A predictor is speci\ufb01ed by\n(cid:80)n\nw \u2208 Rd and a bias b \u2208 R. In training, we wish to minimize the empirical error, measured in terms\ni=1[1 \u2212 y((cid:104)w, xi(cid:105) + b)]+ , and the norm of w. Since\nof the average hinge loss \u02c6Rhinge(w, b) = 1\nn\nwe do not typically know a-priori how to balance the norm with the error, this is best described as\nan unconstrained bi-criteria optimization problem:\n\nw\u2208Rd,b\u2208R (cid:107)w(cid:107) , \u02c6Rhinge(w, b)\n\nmin\n\n(1)\n\nA common approach to \ufb01nding Pareto optimal points of (1) is to scalarize the objective as:\n\nmin\n\nw\u2208Rd,b\u2208R\n\n\u02c6Rhinge(w, b) +\n\n(2)\nwhere the multiplier \u03bb \u2265 0 controls the trade-off between the two objectives. However, in order\nto apply our framework, we need to consider a different parametrization of the Pareto optimal set\n(the \u201cregularization path\u201d): instead of minimizing a trade-off between the norm and the error, we\nmaximize the margin (equivalent to minimizing the norm) subject to a constraint on the error. This\nallows us to write the objective (the margin) as a minimum over all training points\u2014a form we will\nlater exploit. Speci\ufb01cally, we introduce slack variables and consider the optimization problem:\n\n(cid:107)w(cid:107)2\n\n\u03bb\n2\n\nn(cid:88)\n\ni=1\n\nmax\n\nw\u2208Rd, b\u2208R, 0\u2264\u03bei\n\nmin\ni\u2208[n]\n\nyi((cid:104)w, xi(cid:105) + b) + \u03bei\n\ns.t.\n\n(cid:107)w(cid:107) \u2264 1 and\n\n\u03bei \u2264 n\u03bd\n\n(3)\n\nwhere the parameter \u03bd controls the trade-off between desiring a large margin (low norm) and small\nerror (low slack), and parameterizes solutions along the regularization path. This is formalized by\nthe following Lemma, which also gives guarantees for \u03b5-sub-optimal solutions of (3):\nLemma 2.1. For any w (cid:54)= 0,b \u2208 R consider problem (3) with \u03bd = \u02c6Rhinge(w, b)/(cid:107)w(cid:107). Let w\u03b5, b\u03b5, \u03be\u03b5\nbe an \u03b5-suboptimal solution to this problem with value \u03b3\u03b5, and consider the rescaled solution \u02dcw =\nw\u03b5/\u03b3\u03b5, \u02dcb = b\u03b5/\u03b3\u03b5. Then:\n\n(cid:107) \u02dcw(cid:107) \u2264\n\n1\n\n1 \u2212 (cid:107)w(cid:107) \u03b5\n\n(cid:107)w(cid:107) ,\n\nand\n\n\u02c6Rhinge( \u02dcw) \u2264\n\n1\n\n1 \u2212 (cid:107)w(cid:107) \u03b5\n\n\u02c6Rhinge(w).\n\nThat is, solving (3) exactly (to within \u03b5 = 0) yields Pareto optimal solutions of (1), and all such\nsolutions (i.e. the entire regularization path) can be obtained by varying \u03bd. When (3) is only solved\napproximately, we obtain a Pareto sub-optimal point, as quanti\ufb01ed by Lemma 2.1.\nBefore proceeding, we also note that any solution of (1) that classi\ufb01es at least some positive and\nnegative points within the desired margin must have (cid:107)w(cid:107) \u2265 1 and so in Lemma 2.1 we will only\nneed to consider 0 \u2264 \u03bd \u2264 1. In terms of (3), this means that we could restrict 0 \u2264 \u03bei \u2264 2 without\naffecting the optimal solution.\n\n3 Overview: Primal-Dual Algorithms and Our Approach\n\nThe CHW framework\n\nThe method of [4] applies to saddle-point problems of the form\n\nz\u2208K min\nmax\ni\u2208[n]\n\nci(z).\n\n2\n\n(4)\n\n\fwhere ci(z) are concave functions of z over some set K \u2286 Rd. The method is a stochastic primal-\ndual method, where the dual solution can be viewed as importance weighting over the n terms ci(z).\nTo better understand this view, consider the equivalent problem:\n\nn(cid:88)\n\ni=1\n\nz\u2208K min\nmax\np\u2208\u2206n\n\npici(z)\n\n(5)\n\nwhere \u2206n = {p \u2208 Rn | pi \u2265 0,(cid:107)p(cid:107)1 = 1} is the probability simplex. The method maintains\nand (stochastically) improves both a primal solution (in our case, a predictor w \u2208 Rd) and a dual\nsolution, which is a distribution p over [n]. Roughly speaking, the distribution p is used to focus in\non the terms actually affecting the minimum. Each iteration of the method proceeds as follows:\n\n1. Stochastic primal update:\n\n2. Stochastic dual update:\n\n(a) A term i \u2208 [n] is chosen according to the distribution p, in time O(n).\n(b) The primal variable z is updated according to the gradient of the ci(z), via an online\nlow-regret update. This update is in fact a Stochastic Gradient Descent (SGD) step on\nthe objective of (5), as explained in section 4. Since we use only a single term ci(z),\nthis can be usually done in time O(d).\n\n(a) We obtain a stochastic estimate of ci(z), for each i \u2208 [n]. We would like to use an\nestimator that has a bounded variance, and can be computed in O(1) time per term,\ni.e. in overall O(n) time. When the ci\u2019s are linear functions, this can be achieved using\na form of (cid:96)2-sampling for estimating an inner-product in Rd.\n\n(b) The distribution p is updated toward those terms with low estimated values of ci(z).\nThis is accomplished using a variant of the Multiplicative Updates (MW) framework\nfor online optimization over the simplex (see for example [1]), adapted to our case in\nwhich the updates are based on random variables with bounded variance. This can be\ndone in time O(n).\n\nEvidently, the overall runtime per iteration is O(n + d).\nIn addition, the regret bounds on the\nupdates of z and p can be used to bound the number of iterations required to reach an \u03b5-suboptimal\nsolution. Hence, the CHW approach is particularly effective when this regret bound has a favorable\ndependence on d and n. As we note below, this is not the case in our application, and we shall need\nsome additional machinery to proceed.\n\nThe PST framework\n\nproblem maxz\u2208K(cid:80)n\n\nThe Plotkin-Shmoys-Tardos framework [10] is a deterministic primal-dual method, originally pro-\nposed for approximately solving certain types of linear programs known as \u201cfractional packing and\ncovering\u201d problems. The same idea, however, applies also to saddle-point problems of the form (5).\nIn each iteration of this method, the primal variable z is updated by solving the \u201csimple\u201d optimization\ni=1 pici(z) (where p is now \ufb01xed), while the dual variable p is again updated\nusing a MW step (note that we do not use an estimation for ci(z) here, but rather the exact value).\nThese iterations yield convergence to the optimum of (5), and the regret bound of the MW updates\nis used to derive a convergence rate guarantee.\nSince each iteration of the framework relies on the entire set of functions ci, it is reasonable to apply\nit only on relatively small-sized problems. Indeed, in our application we shall use this method for\nthe update of the slack variables \u03be and the bias term b, for which the implied cost is only O(n) time.\n\nOur hybrid approach\n\nThe saddle-point formulation (3) of SVM from section 2 suggests that the SVM optimization prob-\nlem can be ef\ufb01ciently approximated using primal-dual methods, and speci\ufb01cally using the CHW\nframework. Indeed, taking z = (w, b, \u03be) and K = Bd \u00d7 [\u22121, 1] \u00d7 \u039e\u03bd where Bd \u2286 Rd is the Eu-\nclidean unit ball and \u039e\u03bd = {\u03be \u2208 Rn | \u2200i 0 \u2264 \u03bei \u2264 2, (cid:107)\u03be(cid:107)1 \u2264 \u03bdn} , we cast the problem into the\nform (4). However, as already pointed out, a na\u00a8\u0131ve application of the CHW framework yields in this\ncase a rather slow convergence rate. Informally speaking, this is because our set K is \u201ctoo large\u201d\nand thus the involved regret grows too quickly.\nIn this work, we propose a novel hybrid approach for tackling problems such as (3), that combines\nthe ideas of the CHW and PST frameworks. Speci\ufb01cally, we suggest using a SGD-like low-regret\n\n3\n\n\fupdate for the variable w, while updating the variables \u03be and b via a PST-like step; the dual update\nof our method is similar to that of CHW. Consequently, our algorithm enjoys the bene\ufb01ts of both\nmethods, each in its respective domain, and avoids the problem originating from the \u201csize\u201d of K.\nWe defer the detailed description of the method to the following section.\n\n4 Algorithm and Analysis\n\nIn this section we present and analyze our algorithm, which we call SIMBA (stands for \u201cSublinear\nIMportance-sampling Bi-stochastic Algorithm\u201d). The algorithm is a sublinear-time approximation\nalgorithm for problem (3), which as shown in section 2 is a reformulation of the standard soft-\nmargin SVM problem. For the simplicity of presentation, we omit the bias term for now (i.e., \ufb01x\nb = 0 in (3)) and later explain how adding such bias to our framework is almost immediate and does\nnot affect the analysis. This allows us to ignore the labels yi, by setting xi \u2190 \u2212xi for any i with\nyi = \u22121.\nLet us begin the presentation with some additional notation. To avoid confusion, we use the notation\nv(i) to refer to the i\u2019th coordinate of a vector v. We also use the shorthand v2 to denote the vector\nfor which v2(i) = (v(i))2 for all i. The n-vector whose entries are all 1 is denoted as 1n. Finally,\nwe stack the training instances xi as the rows of a matrix X \u2208 Rn\u00d7d, although we treat each xi as a\ncolumn vector.\n\nAlgorithm 1 SVM-SIMBA\n1: Input: \u03b5 > 0, 0 \u2264 \u03bd \u2264 1, and X \u2208 Rn\u00d7d with xi \u2208 Bd for i \u2208 [n].\n\n2: Let T \u2190 1002\u03b5\u22122 log n, \u03b7 \u2190(cid:112)log(n)/T and u1 \u2190 0, q1 \u2190 1n\n\n2T , \u03bet \u2190 arg max\u03be\u2208\u039e\u03bd (p(cid:62)\nt \u03be)\n\n3: for t = 1 to T do\nChoose it \u2190 i with probability pt(i)\n\u221a\n4:\nLet ut \u2190 ut\u22121 + xit/\n5:\n6: wt \u2190 ut/ max{1,(cid:107)ut(cid:107)}\nChoose jt \u2190 j with probability wt(j)2/(cid:107)wt(cid:107)2.\n7:\nfor i = 1 to n do\n8:\n9:\n10:\n11:\nend for\n12:\npt \u2190 qt/(cid:107)qt(cid:107)1\n13:\n14: end for\n15: return \u00afw = 1\nT\n\n\u02dcvt(i) \u2190 xi(jt)(cid:107)wt(cid:107)2/wt(jt) + \u03bet(i)\nvt(i) \u2190 clip(\u02dcvt(i), 1/\u03b7)\nqt+1(i) \u2190 qt(i)(1 \u2212 \u03b7vt(i) + \u03b72vt(i)2)\n\n(cid:80)\n\n(cid:80)\n\nt wt, \u00af\u03be = 1\n\nT\n\nt \u03bet\n\nThe pseudo-code of the SIMBA algorithm is given in \ufb01gure 1. In the primal part (lines 4 through 6),\nthe vector ut is updated by adding an instance xi, randomly chosen according to the distribution pt.\nThis is a version of SGD applied on the function p(cid:62)\nt (Xw + \u03bet), whose gradient with respect to w is\np(cid:62)\nt X; by the sampling procedure of it, the vector xit is an unbiased estimator of this gradient. The\nvector ut is then projected onto the unit ball, to obtain wt. On the other hand, the primal variable \u03bet\nt \u03be with respect to \u03be \u2208 \u039e\u03bd. This is an instance of the PST\nis updated by a complete optimization of p(cid:62)\nframework, described in section 3. Note that, by the structure of \u039e\u03bd, this update can be accomplished\nusing a simple greedy algorithm that sets \u03bet(i) = 2 corresponding to the largest entries pt(i) of pt,\nuntil a total mass of \u03bdn is reached, and puts \u03bet(i) = 0 elsewhere; this can be implemented in O(n)\ntime using standard selection algorithms.\nIn the dual part (lines 7 through 13), the algorithm \ufb01rst updates the vector qt using the jt column of X\nt /(cid:107)wt(cid:107)2. This\nand the value of wt(jt), where jt is randomly selected according to the distribution w2\nis a variant of the MW framework (see de\ufb01nition 4.1 below) applied on the function p(cid:62)(Xwt + \u03bet);\nthe vector \u02dcv serves as an estimator of Xwt + \u03bet, the gradient with respect to p. We note, however,\nthat the algorithm uses a clipped version v of the estimator \u02dcv; see line 10, where we use the notation\nclip(z, C) = max(min(z, C),\u2212C) for z, C \u2208 R. This, in fact, makes v a biased estimator of the\ngradient. As we show in the analysis, while the clipping operation is crucial to the stability of the\nalgorithm, the resulting slight bias is not harmful.\nBefore stating the main theorem, we describe in detail the MW algorithm we use for the dual update.\n\n4\n\n\fDe\ufb01nition 4.1 (Variance MW algorithm). Consider a sequence of vectors v1, . . . , vT \u2208 Rn and a\nparameter \u03b7 > 0. The Variance Multiplicative Weights (Variance MW) algorithm is as follows. Let\nw1 \u2190 1n, and for t \u2265 1,\npt \u2190 wt/(cid:107)wt(cid:107)1 ,\n\nwt+1(i) \u2190 wt(i)(1 \u2212 \u03b7vt(i) + \u03b72vt(i)2).\n\nand\n\n(6)\n\nThe following lemma establishes a regret bound for the Variance MW algorithm.\nLemma 4.2 (Variance MW Lemma). The Variance MW algorithm satis\ufb01es\n\n(cid:88)\n\nt\u2208[T ]\n\n(cid:88)\n\nt\u2208[T ]\n\nt vt \u2264 min\np(cid:62)\ni\u2208[n]\n\nmax{vt(i),\u22121/\u03b7} +\n\nlog n\n\n\u03b7\n\n+ \u03b7\n\np(cid:62)\nt v2\nt .\n\n(cid:88)\n\nt\u2208[T ]\n\n(cid:80)\nt\u2208[T ] p(cid:62)\n\nWe now state the main theorem. Due to space limitations, we only give here a sketch of the proof.\nTheorem 4.3 (Main). The SIMBA algorithm above returns an \u03b5-approximate solution to formula-\ntion (3) with probability at least 1/2. It can be implemented to run in time \u02dcO(\u03b5\u22122(n + d)).\n\nProof (sketch). The main idea of the proof is to establish lower and upper bounds on the average\nt (Xwt + \u03bet). Then, combining these bounds we are able to relate the\nobjective value 1\nT\nvalue of the output solution ( \u00afw, \u00af\u03be) to the value of the optimum of (3). In the following, we let\n(w\u2217, \u03be\u2217) be the optimal solution of (3) and denote the value of this optimum by \u03b3\u2217.\n\nFor the lower bound, we consider the primal part of the algorithm. Noting that(cid:80)\n(cid:80)\nt\u2208[T ] p(cid:62)\n\nt\u2208[T ] p(cid:62)\nbounding the regret of the SGD update, we obtain the lower bound (with probability \u2265 1 \u2212 O( 1\n\nt \u03bet \u2265\nt \u03be\u2217 (which follows from the PST step) and employing a standard regret guarantee for\nn )):\n\nt (Xwt + \u03bet) \u2265 \u03b3\u2217 \u2212 O\np(cid:62)\n\n(cid:16)(cid:113) log n\n\n(cid:17)\n\n.\n\nT\n\n(cid:88)\n\nt\u2208[T ]\n\n1\nT\n\n(cid:88)\n\nt\u2208[T ]\n\n1\nT\n\n(cid:88)\n\nt\u2208[T ]\n\n[x(cid:62)\n\nFor the upper bound, we examine the dual part of the algorithm. Applying lemma 4.2 for bounding\nthe regret of the MW update, we get the following upper bound (with probability > 3\n\n4 \u2212 O( 1\n\nn )):\n\nt (Xwt + \u03bet) \u2264 1\np(cid:62)\nT\n\nmin\ni\u2208[n]\n\ni wt + \u03bet(i)] + O\n\ni \u00afw + \u00af\u03be(i)] \u2265 \u03b3\u2217 \u2212 O((cid:112)log(n)/T ) with\n\nT\n\n.\n\n2, and using our choice for T the claim follows.\n\nRelating the two bounds we conclude that mini\u2208[n] [x(cid:62)\nprobability \u2265 1\nFinally, we note the runtime. The algorithm makes T = O(\u03b5\u22122 log n) iterations. In each iteration,\nthe update of the vectors wt and pt takes O(d) and O(n) time respectively, while \u03bet can be computed\nin O(n) time as explained above. The overall runtime is therefore \u02dcO(\u03b5\u22122(n + d)).\n\n(cid:16)(cid:113) log n\n\n(cid:17)\n\nIncorporating a bias term We return to the optimization problem (3) presented in section 2, and\nshow how the bias term b can be integrated into our algorithm. Unlike with SGD-based approaches,\nincluding the bias term in our framework is straightforward. The only modi\ufb01cation required to\nour algorithm as presented in \ufb01gure 1 occurs in lines 5 and 9, where the vector \u03bet is referred. For\nadditionally maintaining a bias bt, we change the optimization over \u03be in line 5 to a joint optimization\nover both \u03be and b:\n\n(\u03bet, bt) \u2190 argmax\n\nt (\u03be + b \u00b7 y)\np(cid:62)\n\n\u03be\u2208\u039e\u03bd , b\u2208[\u22121,1]\n\nwhile returning the average bias \u00afb = (cid:80)\n\nand use the computed bt for the dual update, in line 9: \u02dcvt(i) \u2190 xi(jt)(cid:107)wt(cid:107)2/wt(jt) + \u03bet(i) + yibt,\nt\u2208[T ] bt/T in the output of the algorithm. Notice that we\nstill assume that the labels yi were subsumed into the instances xi, as in section 4. The update of \u03bet\nis thus unchanged and can be carried out as described in section 4. The update of bt, on the other\nhand, admits a simple, closed-form formula: bt = sign(p(cid:62)\nt y). Evidently, the running time of each\niteration remains O(n + d), as before. The adaptation of the analysis to this case, which involves\nonly a change of constants, is technical and straightforward.\n\n5\n\n\fThe sparse case We conclude the section with a short discussion of the common situation in which\nthe instances are sparse, that is, each instance contains very few non-zero entries. In this case, we can\nimplement algorithm 1 so that each iteration takes \u02dcO(\u03b1(n + d)), where \u03b1 is the overall data sparsity\nratio. Implementing the vector updates is straightforward, using a data representation similar to\n[12]. In order to implement the sampling operations in time O(log n) and O(log d), we maintain\na tree over the points and coordinates, with internal nodes caching the combined (unnormalized)\nprobability mass of their descendants.\n\n5 Runtime Analysis for Learning\n\nIn Section 4 we saw how to obtain an \u03b5-approximate solution to the optimization problem (3) in time\n\u02dcO(\u03b5\u22122(n + d)). Combining this with Lemma 2.1, we see that for any Pareto optimal point w\u2217 of\n(1) with (cid:107)w\u2217(cid:107) = B and \u02c6Rhinge(w\u2217) = \u02c6R\u2217, the runtime required for our method to \ufb01nd a predictor\nwith (cid:107)w(cid:107) \u2264 2B and \u02c6Rhinge(w) \u2264 \u02c6R\u2217 + \u02c6\u03b4 is\n\n(cid:18)\n\n\u02dcO\n\nB2(n + d)\n\n(cid:18) \u02c6R\u2217 + \u02c6\u03b4\n\n(cid:19)2(cid:19)\n\n.\n\n\u02c6\u03b4\n\nThis guarantee is rather different from guarantee for other SVM optimization approaches. E.g. us-\ning a stochastic gradient descent (SGD) approach, we could \ufb01nd a predictor with (cid:107)w(cid:107) \u2264 B and\n\u02c6Rhinge(w) \u2264 \u02c6R\u2217 + \u02c6\u03b4 in time O(B2d/\u02c6\u03b42). Compared with SGD, we only ensure a constant factor ap-\nproximation to the norm, and our runtime does depend on the training set size n, but the dependence\non \u02c6\u03b4 is more favorable. This makes it dif\ufb01cult to compare the guarantees and suggests a different\nform of comparison is needed. Following [13], instead of comparing the runtime to achieve a certain\noptimization accuracy on the empirical optimization problem, we analyze the runtime to achieve a\ndesired generalization performance.\nRecall that our true learning objective is to \ufb01nd a predictor with low generalization error Rerr(w) =\nPr(x,y)(y (cid:104)w, x(cid:105) \u2264 0) where x, y are distributed according to some unknown source distribution,\nand the training set is drawn i.i.d. from this distribution. We assume that there exists some (un-\nknown) predictor w\u2217 that has norm (cid:107)w\u2217(cid:107) \u2264 B and low expected hinge loss R\u2217 = Rhinge(w\u2217) =\nE [[1 \u2212 y (cid:104)w\u2217, x(cid:105)]+], and analyze the runtime to \ufb01nd a predictor w with generalization error\nRerr(w) \u2264 R\u2217 + \u03b4.\nIn order to understand the runtime from this perspective, we must consider the required sample size\nto obtain generalization to within \u03b4, as well as the required suboptimality for (cid:107)w(cid:107) and \u02c6Rhinge(w).\nThe standard SVMs analysis calls for a sample size of n = O(B2/\u03b42). But since, as we will see,\nour analysis will be sensitive to the value of R\u2217, we will consider a more re\ufb01ned generalization\nguarantee which gives a better rate when R\u2217 is small relative to \u03b4. Following Theorem 5 of [14]\n(and recalling that the hinge-loss is an upper bound on margin violations), we have that with high\nprobability over a sample of size n, for all predictors w:\n\n(7)\n\n(8)\n\n(9)\n\nRerr(w) \u2264 \u02c6Rhinge(w) + O\n\nThis implies that a training set of size\n\nn = \u02dcO\n\n(cid:115)\n\n+\n\n\uf8eb\uf8ed(cid:107)w(cid:107)2\n(cid:18) B2\n\nn\n\n\u03b4\n\n\u00b7 R\u2217 + \u03b4\n\n\u03b4\n\n\uf8f6\uf8f8 .\n\n(cid:107)w(cid:107)2 \u02c6Rhinge(w)\n\nn\n\n(cid:19)\n\nis enough for generalization to within \u03b4. We will be mostly concerned here with the regime where\neither R\u2217 is small and we seek generalization to within \u03b4 = \u2126(R\u2217)\u2014a typical regime in learning.\nThis is always the case in the realizable setting, where R\u2217 = 0, but includes also the non-realizable\nsetting, as long as the desired estimation error \u03b4 is not much smaller then the unavoidable error R\u2217.\nIn any case, in such a regime In that case, the second factor in (9) is of order one.\nIn fact, an online approach 1 can \ufb01nd a predictor with Rerr(w) \u2264 R\u2217 + \u03b4 with a single pass over\nn = \u02dcO(B2/\u03b4 \u00b7 (\u03b4 + R\u2217)/\u03b4) training points. Since each step takes O(d) time (essentially the time\n\n1The Perceptron rule, which amounts to SGD on Rhinge(w), ignoring correctly classi\ufb01ed points [7, 3].\n\n6\n\n\frequired to read the training point), the overall runtime is:\nd \u00b7 R\u2217 + \u03b4\n\nO\n\n(cid:18) B2\n\n\u03b4\n\n(cid:19)\n\n\u03b4\n\n.\n\n(10)\n\nReturning to our approach, approximating the norm to within a factor of two is \ufb01ne, as it only effects\nthe required sample size, and hence the runtime by a constant factor. In particular, in order to ensure\nRerr(w) \u2264 R\u2217 + \u03b4 it is enough to have (cid:107)w(cid:107) \u2264 2B, optimize the empirical hinge loss to within\n\u02c6\u03b4 = \u03b4/2, and use a sample size as speci\ufb01ed in (9) (where we actually consider a radius of 2B and\nrequire generalization to within \u03b4/4, but this is subsumed in the constant factors). Plugging this into\nthe runtime analysis (7) yields:\nCorollary 5.1. For any B \u2265 1 and \u03b4 > 0, with high probability over a training set of size, n =\n\u02dcO(B2/\u03b4 \u00b7 (\u03b4 + R\u2217)/\u03b4), Algorithm 1 outputs a predictor w with Rerr(w) \u2264 R\u2217 + \u03b4 in time\n\n(cid:32)(cid:18)\n\n\u02dcO\n\nB2d +\n\nB4\n\u03b4\n\n\u00b7 \u03b4 + R\u2217\n\n\u03b4\n\n(cid:18) \u03b4 + R\u2217\n\n(cid:19)2(cid:33)\n\n(cid:19)\n\n\u00b7\n\n\u03b4\n\nwhere R\u2217 = inf(cid:107)w\u2217(cid:107)\u2264B Rhinge(w\u2217).\nLet us compare the above runtime to the online runtime (10), focusing on the regime where R\u2217 is\n\u03b4 = O(1), and ignoring the logarithmic factors hidden in the \u02dcO(\u00b7)\nsmall and \u03b4 = \u2126(R\u2217) and so R\u2217+\u03b4\nnotation in Corollary 5.1. To do so, we will \ufb01rst rewrite the runtime in Corollary 5.1 as:\n\n(cid:32)\n\n\u02dcO\n\nd \u00b7 R\u2217 + \u03b4\n\n\u03b4\n\nB2\n\u03b4\n\n\u00b7 (R\u2217 + \u03b4) +\n\nB2 \u00b7\n\nB2\n\u03b4\n\n.\n\n(11)\n\n(cid:18) R\u2217 + \u03b4\n\n(cid:19)3(cid:33)\n\n\u03b4\n\n\u03b4\n\n)2.\n\nIn order to compare the runtimes, we must consider the relative magnitudes of the dimensionality d\nand the norm B. Recall that using a norm-regularized approach, such as SVM, makes sense only\nwhen d (cid:29) B2. Otherwise, the low dimensionality would guarantee us good generalization, and we\nwouldn\u2019t gain anything from regularizing the norm. And so, at least when R\u2217+\u03b4\n\u03b4 = O(1), the \ufb01rst\nterm in (11) is the dominant term and we should compare it with (10). More generally, we will see\nan improvement as long as d (cid:29) B2( R\u2217+\u03b4\nNow, the \ufb01rst term in (11) is more directly comparable to the online runtime (10), and is always\nsmaller by a factor of (R\u2217 + \u03b4) \u2264 1. This factor, then, is the improvement over the online approach,\nor more generally, over any approach which considers entire sample vectors (as opposed to individ-\nual features). We see, then, that our proposed approach can yield a signi\ufb01cant reduction in runtime\nwhen the resulting error rate is small. Taking into account the hidden logarithmic factors, we get an\nimprovement as long as (R\u2217 + \u03b4) = O(1/ log(B2/\u03b4)).\nReturning to the form of the runtime in Corollary 5.1, we can also understand the runtime as fol-\nlows: Initially, a runtime of O(B2d) is required in order for the estimates of w and p to start being\nreasonable. However, this runtime does not depend on the desired error (as long as \u03b4 = \u2126(R\u2217),\nincluding when R\u2217 = 0), and after this initial runtime investment, once w and p are \u201creasonable\u201d,\nwe can continue decreasing the error toward R\u2217 with runtime that depends only on the norm, but is\nindependent of the dimensionality.\n\n6 Experiments\n\nIn this section we present preliminary experimental results, that demonstrate situations in which\nour approach has an advantage over SGD-based methods. To this end, we choose to compare the\nperformance of our algorithm to that of the state-of-the-art Pegasos algorithm [12], a popular SGD\nvariant for solving SVM. The experiments were performed with two standard, large-scale data sets:\n\u2022 The news20 data set of [9] that has 1,355,191 features and 19,996 examples. We split the\n\u2022 The real vs. simulated data set of McCallum, with 20,958 features and 72,309 examples.\nWe split the data set into a training set of 20,000 examples and a test set of 52,309 examples.\nWe implemented the SIMBA algorithm exactly as in Section 4, with a single modi\ufb01cation: we used\n\na time-adaptive learning rate \u03b7t =(cid:112)log(n)/t and a similarly an adaptive SGD step-size (in line 5),\n\ndata set into a training set of 8,000 examples and a test set of 11,996 examples.\n\n7\n\n\fr\no\nr\nr\ne\n\nt\ns\ne\nt\n\n0.5\n\n0.45\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\nSIMBA(\u03bd = 1 \u00b7 10\u22123)\nPegasos (\u03bb = 1.25 \u00b7 10\u22124)\n\nSIMBA(\u03bd = 5 \u00b7 10\u22125)\nPegasos (\u03bb = 5 \u00b7 10\u22125)\n\n0.55\n\n0.5\n\n0.45\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\nr\no\nr\nr\ne\n\nt\ns\ne\nt\n\n107\n\n108\n\n109\n\n109\n\n1010\n\n1011\n\nfeature accesses\n\nfeature accesses\n\nFigure 1: The test error, averaged over 10 repetitions, vs.\nthe number of feature accesses, on the real vs.\nsimulated (left) and news20 (right) data sets. The error bars depict one standard-deviation of the measurements.\n\ninstead of leaving them constant. While this version of the algorithm is more convenient to work\nwith, we found that in practice its performance is almost equivalent to that of the original algorithm.\nIn both experiments, we tuned the tradeoff parameter of each algorithm (i.e., \u03bd and \u03bb) so as to\nobtain the lowest possible error over the test set. Note that our algorithm assumes random access to\nfeatures (as opposed to instances), thus it is not meaningful to compare the test error as a function of\nthe number of iterations of each algorithm. Instead, and according to our computational model, we\ncompare the test error as a function of the number of feature accesses of each algorithm. The results,\naveraged over 10 repetitions, are presented in \ufb01gure 1 along with the parameters we used. As can be\nseen from the graphs, on both data sets our algorithm obtains the same test error as Pegasos achieves\nat the optimum, using about 100 times less feature accesses.\n\n7 Summary\n\nBuilding on ideas \ufb01rst introduced by [4], we present a stochastic-primal-stochastic-dual approach\nthat solves a non-separable linear SVM optimization problem in sublinear time, and yields a learning\nmethod that, in a certain regime, beats SGD and runs in less time than the size of the training set\nrequired for learning. We also showed some encouraging preliminary experiments, and we expect\nfurther work can yield signi\ufb01cant gains, either by improving our method, or by borrowing from the\nideas and innovations introduced, including:\n\nstochastic step.\n\n\u2022 Using importance weighting, and stochastically updating the importance weights in a dual\n\u2022 Explicitly introducing the slack variables (which are not typically represented in primal\nSGD approaches). This allows us to differentiate between an accounted-for margin mis-\ntakes, and a constraint violation where we did not yet assign enough \u201cslack\u201d and want to\nfocus our attention on. This differs from heuristic importance weighting approaches for\nstochastic learning, which tend to focus on all samples with a non-zero loss gradient.\n\n\u2022 Employing the PST methodology when the standard low-regret tools fail to apply.\n\nWe believe that our ideas and framework can also be applied to more complex situations where much\ncomputational effort is currently being spent, including highly multiclass and structured SVMs,\nlatent SVMs [6], and situations where features are very expensive to calculate, but can be calculated\non-demand. The ideas can also be extended to kernels, either through linearization [11], using an\nimplicit linearization as in [4], or through a representation approach. Beyond SVMs, the framework\ncan apply more broadly, whenever we have a low-regret method for the primal problem, and a\nsampling procedure for the dual updates. E.g. we expect the approach to be successful for (cid:96)1-\nregularized problems, and are working on this direction.\n\nAcknowledgments This work was supported in part by the IST Programme of the European Com-\nmunity, under the PASCAL2 Network of Excellence, IST-2007-216886. This publication only re-\n\ufb02ects the authors\u2019 views.\n\n8\n\n\fReferences\n[1] S. Arora, E. Hazan, and S. Kale. The multiplicative weights update method: a meta algorithm\n\nand applications. Manuscript, 2005.\n\n[2] L. Bottou and O. Bousquet. The tradeoffs of large scale learning. Advances in neural informa-\n\ntion processing systems, 20:161\u2013168, 2008.\n\n[3] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning\n\nalgorithms. Information Theory, IEEE Transactions on, 50(9):2050\u20132057, 2004.\n\n[4] K.L. Clarkson, E. Hazan, and D.P. Woodruff. Sublinear optimization for machine learning.\nIn 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 449\u2013457.\nIEEE, 2010.\n\n[5] K. Deng, C. Bourke, S. Scott, J. Sunderman, and Y. Zheng. Bandit-based algorithms for\nbudgeted learning. In Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference\non, pages 463\u2013468. IEEE, 2007.\n\n[6] P. Felzenszwalb, D. Mcallester, and D. Ramanan. A discriminatively trained, multiscale, de-\nIn In IEEE Conference on Computer Vision and Pattern Recognition\n\nformable part model.\n(CVPR-2008, 2008.\n\n[7] C. Gentile. The robustness of the p-norm algorithms. Machine Learning, 53(3):265\u2013299, 2003.\n\n[8] A. Kapoor and R. Greiner. Learning and classifying under hard budgets. Machine Learning:\n\nECML 2005, pages 170\u2013181, 2005.\n\n[9] S.S. Keerthi and D. DeCoste. A modi\ufb01ed \ufb01nite newton method for fast solution of large scale\n\nlinear SVMs. Journal of Machine Learning Research, 6(1):341, 2006.\n\n[10] S.A. Plotkin, D.B. Shmoys, and \u00b4E. Tardos. Fast approximation algorithms for fractional pack-\ning and covering problems. In Proceedings of the 32nd annual symposium on Foundations of\ncomputer science, pages 495\u2013504. IEEE Computer Society, 1991.\n\n[11] A. Rahimi and B. Recht. Random features for large-scale kernel machines. Advances in neural\n\ninformation processing systems, 20:1177\u20131184, 2008.\n\n[12] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver\nfor SVM. In Proceedings of the 24th international conference on Machine learning, pages\n807\u2013814. ACM, 2007.\n\n[13] S. Shalev-Shwartz and N. Srebro. SVM optimization: inverse dependence on training set size.\nIn Proceedings of the 25th international conference on Machine learning, pages 928\u2013935,\n2008.\n\n[14] N. Srebro, K. Sridharan, and A. Tewari. Smoothness, low noise and fast rates. In Advances in\n\nNeural Information Processing Systems 23, pages 2199\u20132207. 2010.\n\n9\n\n\f", "award": [], "sourceid": 720, "authors": [{"given_name": "Elad", "family_name": "Hazan", "institution": null}, {"given_name": "Tomer", "family_name": "Koren", "institution": null}, {"given_name": "Nati", "family_name": "Srebro", "institution": null}]}