{"title": "Sparse Online Learning via Truncated Gradient", "book": "Advances in Neural Information Processing Systems", "page_first": 905, "page_last": 912, "abstract": "We propose a general method called truncated gradient to induce sparsity in the weights of online-learning algorithms with convex loss. This method has several essential properties. First, the degree of sparsity is continuous---a parameter controls the rate of sparsification from no sparsification to total sparsification. Second, the approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular $L_1$-regularization method in the batch setting. We prove that small rates of sparsification result in only small additional regret with respect to typical online-learning guarantees. Finally, the approach works well empirically. We apply it to several datasets and find that for datasets with large numbers of features, substantial sparsity is discoverable.", "full_text": "Sparse Online Learning via Truncated Gradient\n\nJohn Langford\nYahoo! Research\n\njl@yahoo-inc.com\n\nLihong Li\n\nDepartment of Computer Science\n\nRutgers University\n\nTong Zhang\n\nDepartment of Statistics\n\nRutgers University\n\nlihong@cs.rutgers.edu\n\ntongz@rci.rutgers.edu\n\nAbstract\n\nWe propose a general method called truncated gradient to induce sparsity in the\nweights of online-learning algorithms with convex loss. This method has several\nessential properties. First, the degree of sparsity is continuous(cid:151)a parameter con-\ntrols the rate of sparsi(cid:2)cation from no sparsi(cid:2)cation to total sparsi(cid:2)cation. Second,\nthe approach is theoretically motivated, and an instance of it can be regarded as\nan online counterpart of the popular L1-regularization method in the batch set-\nting. We prove small rates of sparsi(cid:2)cation result in only small additional regret\nwith respect to typical online-learning guarantees. Finally, the approach works\nwell empirically. We apply it to several datasets and (cid:2)nd for datasets with large\nnumbers of features, substantial sparsity is discoverable.\n\n1\n\nIntroduction\n\nWe are concerned with machine learning over large datasets. As an example, the largest dataset\nwe use in this paper has over 107 sparse examples and 109 features using about 1011 bytes. In this\nsetting, many common approaches fail, simply because they cannot load the dataset into memory\nor they are not suf(cid:2)ciently ef(cid:2)cient. There are roughly two approaches which can work: one is\nto parallelize a batch learning algorithm over many machines (e.g., [3]), the other is to stream the\nexamples to an online-learning algorithm (e.g., [2, 6]). This paper focuses on the second approach.\nTypical online-learning algorithms have at least one weight for every feature, which is too expensive\nin some applications for a couple reasons. The (cid:2)rst is space constraints: if the state of the online-\nlearning algorithm over(cid:3)ows RAM it can not ef(cid:2)ciently run. A similar problem occurs if the state\nover(cid:3)ows the L2 cache. The second is test-time constraints: reducing the number of features can\nsigni(cid:2)cantly reduce the computational time to evaluate a new sample.\nThis paper addresses the problem of inducing sparsity in learned weights while using an online-\nlearning algorithm. Natural solutions do not work for our problem. For example, either simply\nadding L1 regularization to the gradient of an online weight update or simply rounding small weights\nto zero are problematic. However, these two ideas are closely related to the algorithm we propose\nand more detailed discussions are found in section 3. A third solution is black-box wrapper ap-\nproaches which eliminate features and test the impact of the elimination. These approaches typically\nrun an algorithm many times which is particularly undesirable with large datasets.\nSimilar problems have been considered in various settings before. The Lasso algorithm [12] is\ncommonly used to achieve L1 regularization for linear regression. This algorithm does not work\nautomatically in an online fashion. There are two formulations of L1 regularization. Consider a\nloss function L(w; zi) which is convex in w, where zi = (xi; yi) is an input(cid:150)output pair. One is the\nconvex-constraint formulation\n\n^w = arg min\n\nw\n\nn\n\nXi=1\n\nL(w; zi)\n\nsubject to kwk1 (cid:20) s;\n\n(1)\n\n\fwhere s is a tunable parameter. The other is soft-regularization with a tunable parameter g:\n\nn\n\n^w = arg min\n\nw\n\nL(w; zi) + gkwk1:\n\n(2)\n\nXi=1\n\nWith appropriately chosen g, the two formulations are equivalent. The convex-constraint formu-\nlation has a simple online version using the projection idea in [14]. It requires the projection of\nweight w into an L1 ball at every online step. This operation is dif(cid:2)cult to implement ef(cid:2)ciently\nfor large-scale data with many features even if all features are sparse, although important progress\nwas made recently so the complexity is logarithmic in the number of features [5]. In contrast, the\nsoft-regularization formulation (2) is ef(cid:2)cient for a batch setting[8] so we pursue it here in an online\nsetting where it has complexity independent of the number of features. In addition to the L1 regu-\nlarization formulation (2), the family of online-learning algorithms we consider also includes some\nnon-convex sparsi(cid:2)cation techniques.\nThe Forgetron [4] is an online-learning algorithm that manages memory use. It operates by decay-\ning the weights on previous examples and then rounding these weights to zero when they become\nsmall. The Forgetron is stated for kernelized online algorithms, while we are concerned with the\nsimple linear setting. When applied to a linear kernel, the Forgetron is not computationally or space\ncompetitive with approaches operating directly on feature weights.\nAt a high level, our approach is weight decay to a default value. This simple method enjoys strong\nperformance guarantees (section 3). For instance, the algorithm never performs much worse than a\nstandard online-learning algorithm, and the additional loss due to sparsi(cid:2)cation is controlled contin-\nuously by a single real-valued parameter. The theory gives a family of algorithms with convex loss\nfunctions for inducing sparsity(cid:151)one per online-learning algorithm. We instantiate this for square\nloss in section 4 and show how this algorithm can be implemented ef(cid:2)ciently in large-scale prob-\nlems with sparse features. For such problems, truncated gradient enjoys the following properties:\n(i) It is computationally ef(cid:2)cient: the number of operations per online step is linear in the number\nof nonzero features, and independent of the total number of features; (ii) It is memory ef(cid:2)cient: it\nmaintains a list of active features, and can insert (when the corresponding weight becomes nonzero)\nand delete (when the corresponding weight becomes zero) features dynamically.\nTheoretical results stating how much sparsity is achieved using this method generally require addi-\ntional assumptions which may or may not be met in practice. Consequently, we rely on experiments\nin section 5 to show truncated gradient achieves good sparsity in practice. We compare truncated\ngradient to a few others on small datasets, including the Lasso, online rounding of coef(cid:2)cients to\nzero, and L1-regularized subgradient descent. Details of these algorithms are given in section 3.\n\n2 Online Learning with Stochastic Gradient Descent\n\nWe are interested in the standard sequential prediction problems where for i = 1; 2; : : ::\n\n1. An unlabeled example xi arrives.\n2. We make a prediction ^yi based on the current weights wi = [w1\n3. We observe yi, let zi = (xi; yi), and incur some known loss L(wi; zi) convex in wi.\n4. We update weights according to some rule: wi+1 f (wi).\nWe want an update rule f allows us to bound the sum of losses,Pt\n\ni=1 L(wi; zi), as well as achieving\nsparsity. For this purpose, we start with the standard stochastic gradient descent (SGD) rule, which\nis of the form:\n\ni ; : : : ; wd\n\ni ] 2 Rd.\n\nf (wi) = wi (cid:0) (cid:17)r1L(wi; zi);\n\n(3)\nwhere r1L(a; b) is a subgradient of L(a; b) with respect to the (cid:2)rst variable a. The parameter (cid:17) > 0\nis often referred to as the learning rate. In the analysis, we only consider constant learning rate for\nsimplicity. In theory, it might be desirable to have a decaying learning rate (cid:17)i which becomes smaller\nwhen i increases to get the so-called no-regret bound without knowing T in advance. However, if\nT is known in advance, one can select a constant (cid:17) accordingly so the regret vanishes as T ! 1.\nSince the focus of the present paper is on weight sparsity, rather than choosing the learning rate, we\nuse a constant learning rate in the analysis because it leads to simpler bounds.\n\n\fThe above method has been widely used in online learning (e.g., [2, 6]). Moreover, it is argued\nto be ef(cid:2)cient even for solving batch problems where we repeatedly run the online algorithm over\ntraining data multiple times. For example, the idea has been successfully applied to solve large-scale\nstandard SVM formulations [10, 13]. In the scenario outlined in the introduction, online-learning\nmethods are more suitable than some traditional batch learning methods. However, the learning rule\n(3) itself does not achieve sparsity in the weights, which we address in this paper. Note that variants\nof SGD exist in the literature, such as exponentiated gradient descent (EG) [6]. Since our focus is\nsparsity, not SGD vs. EG, we shall only consider modi(cid:2)cations of (3) for simplicity.\n\n3 Sparse Online Learning\n\nIn this section, we (cid:2)rst examine three methods for achieving sparsity in online learning, including\na novel algorithm called truncated gradient. As we shall see, all these ideas are closely related.\nThen, we provide theoretical justi(cid:2)cations for this algorithm, including a general regret bound and\na fundamental connection to the Lasso.\n\n3.1 Simple Coef(cid:2)cient Rounding\n\nIn order to achieve sparsity, the most natural method is to round small coef(cid:2)cients (whose magni-\ntudes are below a threshold (cid:18) > 0) to zero after every K online steps. That is, if i=K is not an\ninteger, we use the standard SGD rule (3); if i=K is an integer, we modify the rule as:\n\nf (wi) = T0(wi (cid:0) (cid:17)r1L(wi; zi); (cid:18));\n\n(4)\nwhere: (cid:18) (cid:21) 0 is a threshold, T0(v; (cid:18)) = [T0(v1; (cid:18)); : : : ; T0(vd; (cid:18))] for vector v = [v1; : : : ; vd] 2 Rd,\nT0(vj; (cid:18)) = vjI(jvjj < (cid:18)), and I((cid:1)) is the set-indicator function. In other words, we (cid:2)rst perform a\nstandard stochastic gradient descent, and then round the coef(cid:2)cients to zero. The effect is to remove\nnonzero and small weights. In general, we should not take K = 1, especially when (cid:17) is small, since\nin each step wi is modi(cid:2)ed by only a small amount. If a coef(cid:2)cient is zero, it remains small after\none online update, and the rounding operation pulls it back to zero. Consequently, rounding can be\ndone only after every K steps (with a reasonably large K); in this case, nonzero coef(cid:2)cients have\nsuf(cid:2)cient time to go above the threshold (cid:18). However, if K is too large, then in the training stage, we\nmust keep many more nonzero features in the intermediate steps before they are rounded to zero. In\nthe extreme case, we may simply round the coef(cid:2)cients in the end, which does not solve the storage\nproblem in the training phase at all. The sensitivity in choosing appropriate K is a main drawback of\nthis method; another drawback is the lack of theoretical guarantee for its online performance. These\nissues motivate us to consider more principled solutions.\n\n3.2 L1-Regularized Subgradient\n\nIn the experiments, we also combined rounding-in-the-end-of-training with a simple online subgra-\ndient method for L1 regularization with a regularization parameter g > 0:\nf (wi) = wi (cid:0) (cid:17)r1L(wi; zi) (cid:0) (cid:17)g sgn(wi);\n\n(5)\n\nwhere for a vector v = [v1; : : : ; vd], sgn(v) = [sgn(v1); : : : ; sgn(vd)], and sgn(vj) = 1 if vj > 0,\nsgn(vj) = (cid:0)1 if vj < 0, and sgn(vj) = 0 if vj = 0. In the experiments, the online method (5)\nwith rounding in the end is used as a simple baseline. Notice this method does not produce sparse\nweights online simply because only in very rare cases do two (cid:3)oats add up to 0. Therefore, it is not\nfeasible in large-scale problems for which we cannot keep all features in memory.\n\n3.3 Truncated Gradient\n\nIn order to obtain an online version of the simple rounding rule in (4), we observe that direct round-\ning to zero is too aggressive. A less aggressive version is to shrink the coef(cid:2)cient to zero by a\nsmaller amount. We call this idea truncated gradient, where the amount of shrinkage is controlled\nby a gravity parameter gi > 0:\n\nf (wi) = T1(wi (cid:0) (cid:17)r1L(wi; zi); (cid:17)gi; (cid:18));\n\n(6)\n\n\fwhere for a vector v = [v1; : : : ; vd] 2 Rd, and a scalar g (cid:21) 0, T1(v; (cid:11); (cid:18)) =\n[T1(v1; (cid:11); (cid:18)); : : : ; T1(vd; (cid:11); (cid:18))], with\n\nT1(vj; (cid:11); (cid:18)) =8<\n:\n\nmax(0; vj (cid:0) (cid:11))\nmin(0; vj + (cid:11))\nvj\n\nif vj 2 [0; (cid:18)]\nif vj 2 [(cid:0)(cid:18); 0]\notherwise\n\n:\n\nAgain, the truncation can be performed every K online steps. That is, if i=K is not an integer, we\nlet gi = 0; if i=K is an integer, we let gi = Kg for a gravity parameter g > 0. The reason for\ndoing so (instead of a constant g) is that we can perform a more aggressive truncation with gravity\nparameter Kg after each K steps. This can potentially lead to better sparsity. We also note that when\n(cid:17)Kg (cid:21) (cid:18), truncate gradient coincides with (4). But in practice, as is also veri(cid:2)ed by the theory, one\nshould adopt a small g; hence, the new learning rule (6) is expected to differ from (4).\nIn general, the larger the parameters g and (cid:18) are, the more sparsity is expected. Due to the extra\ntruncation T1, this method can lead to sparse solutions, which is con(cid:2)rmed empirically in section 5.\nA special case, which we use in the experiment, is to let g = (cid:18) in (6). In this case, we can use only\none parameter g to control sparsity. Since (cid:17)Kg (cid:28) (cid:18) when (cid:17)K is small, the truncation operation\nis less aggressive than the rounding in (4). At (cid:2)rst sight, the procedure appears to be an ad-hoc\nway to (cid:2)x (4). However, we can establish a regret bound (in the next subsection) for this method,\nshowing it is theoretically sound. Therefore, it can be regarded as a principled variant of rounding.\nAnother important special case of (6) is setting (cid:18) = 1, in which all weight components shrink in\nevery online step. The method is a modi(cid:2)cation of the L1-regularized subgradient descent rule (5).\nThe parameter gi (cid:21) 0 controls the sparsity achieved with the algorithm, and setting gi = 0 gives\nexactly the standard SGD rule (3). As we show in section 3.5, this special case of truncated gradient\ncan be regarded as an online counterpart of L1 regularization since it approximately solves an L1\nregularization problem in the limit of (cid:17) ! 0. We also show the prediction performance of trun-\ncated gradient, measured by total loss, is comparable to standard stochastic gradient descent while\nintroducing sparse weight vectors.\n\n3.4 Regret Analysis\nThroughout the paper, we use k (cid:1) k1 for 1-norm, and k (cid:1) k for 2-norm. For reference, we make the\nfollowing assumption regarding the loss function:\nAssumption 3.1 We assume L(w; z) is convex in w, and there exist non-negative constants A and\nB such that (r1L(w; z))2 (cid:20) AL(w; z) + B for all w 2 Rd and z 2 Rd+1.\nFor linear prediction problems, we have a general loss function of the form L(w; z) = (cid:30)(wT x; y).\nThe following are some common loss functions (cid:30)((cid:1);(cid:1)) with corresponding choices of parameters A\nand B (which are not unique), under the assumption supx kxk (cid:20) C. All of them can be used for\nbinary classi(cid:2)cation where y 2 (cid:6)1, but the last one is more often used in regression where y 2 R:\nLogistic: (cid:30)(p; y) = ln(1 + exp((cid:0)py)), with A = 0 and B = C 2; SVM (hinge loss): (cid:30)(p; y) =\nmax(0; 1 (cid:0) py), with A = 0 and B = C 2; Least squares (square loss): (cid:30)(p; y) = (p (cid:0) y)2, with\nA = 4C 2 and B = 0.\nThe main result is Theorem 3.1 which is parameterized by A and B. The proof will be provided in\na longer paper.\n\nTheorem 3.1 (Sparse Online Regret) Consider sparse online update rule (6) with w1 = [0; : : : ; 0]\nand (cid:17) > 0. If Assumption 3.1 holds, then for all (cid:22)w 2 Rd we have\n\nT\n\n1 (cid:0) 0:5A(cid:17)\n\nT\n\n(cid:17)\n2\n\n(cid:20)\n\ngi\n\n1 (cid:0) 0:5A(cid:17)kwi+1 (cid:1) I(wi+1 (cid:20) (cid:18))k1(cid:21)\n[L( (cid:22)w; zi) + gik (cid:22)w (cid:1) I(wi+1 (cid:20) (cid:18))k1];\n\nXi=1(cid:20)L(wi; zi) +\nB + k (cid:22)wk2\nXi=1\n2(cid:17)T\njj (cid:20) (cid:18)) for vectors v = [v1; : : : ; vd] and v0 = [v0\n\nj=1 jvjjI(jv0\n\n+\n\n1\nT\n\nT\n\nwhere kv(cid:1)I(jv0j (cid:20) (cid:18))k1 =Pd\n\n1; : : : ; v0\n\nd].\n\n\fThe theorem is stated with a constant learning rate (cid:17). As mentioned earlier, it is possible to obtain\na result with variable learning rate where (cid:17) = (cid:17)i decays as i increases. Although this may lead to a\nno-regret bound without knowing T in advance, it introduces extra complexity to the presentation of\nthe main idea. Since the focus is on sparsity rather than optimizing learning rate, we do not include\nsuch a result for clarity. If T is known in advance, then in the above bound, one can simply take\n(cid:17) = O(1=pT ) and the regret is of order O(1=pT ).\nIn the above theorem, the right-hand side involves a term gik (cid:22)w (cid:1) I(wi+1 (cid:20) (cid:18))k1 that depends on\nwi+1 which is not easily estimated. To remove this dependency, a trivial upper bound of (cid:18) = 1\ncan be used, leading to L1 penalty gik (cid:22)wk1. In the general case of (cid:18) < 1, we cannot remove the\nwi+1 dependency because the effective regularization condition (as shown on the left-hand side)\nis the non-convex penalty gikw (cid:1) I(jwj (cid:20) (cid:18))k1. Solving such a non-convex formulation is hard\nboth in the online and batch settings. In general, we only know how to ef(cid:2)ciently discover a local\nminimum which is dif(cid:2)cult to characterize. Without a good characterization of the local minimum,\nit is not possible for us to replace gik (cid:22)w (cid:1) I(wi+1 (cid:20) (cid:18))k1 on the right-hand side by gik (cid:22)w (cid:1) I( (cid:22)w (cid:20)\n(cid:18))k1 because such a formulation implies we could ef(cid:2)ciently solve a non-convex problem with a\nsimple online update rule. Still, when (cid:18) < 1, one naturally expects the right-hand side penalty\ngik (cid:22)w (cid:1) I(wi+1 (cid:20) (cid:18))k1 is much smaller than the corresponding L1 penalty gik (cid:22)wk1, especially when\nwj has many components close to 0. Therefore the situation with (cid:18) < 1 can potentially yield better\nperformance on some data.\nTheorem 3.1 also implies a tradeoff between sparsity and regret performance. We may simply\nconsider the case where gi = g is a constant. When g is small, we have less sparsity but the regret\nterm gk (cid:22)w (cid:1) I(wi+1 (cid:20) (cid:18))k1 (cid:20) gk (cid:22)wk1 on the right-hand side is also small. When g is large, we are\nable to achieve more sparsity but the regret gk (cid:22)w(cid:1)I(wi+1 (cid:20) (cid:18))k1 on the right-hand side also becomes\nlarge. Such a tradeoff between sparsity and prediction accuracy is empirically studied in section 5,\nwhere we achieve signi(cid:2)cant sparsity with only a small g (and thus small decrease of performance).\nNow consider the case (cid:18) = 1 and gi = g. When T ! 1, if we let (cid:17) ! 0 and (cid:17)T ! 1, then\n\n1\nT\n\nT\n\nXi=1\n\n[L(wi; zi) + gkwik1] (cid:20) inf\n\n(cid:22)w2Rd\" 1\n\nT\n\nT\n\nXi=1\n\nL( (cid:22)w; zi) + 2gk (cid:22)wk1# + o(1):\n\nIn other words, if we let L0(w; z) = L(w; z) + gkwk1 be the L1-\nfollows from Theorem 3.1.\nregularized loss, then the L1-regularized regret is small when (cid:17) ! 0 and T ! 1. This implies\ntruncated gradient can be regarded as the online counterpart of L1-regularization methods. In the\nstochastic setting where the examples are drawn iid from some underlying distribution, the sparse\nonline gradient method proposed in this paper solves the L1 regularization problem.\n\n3.5 Stochastic Setting\n\nSGD-based online-learning methods can be used to solve large-scale batch optimization problems.\nIn this setting, we can go through training examples one-by-one in an online fashion, and repeat\nmultiple times over the training data. To simplify the analysis, instead of assuming we go through\nexample one by one, we assume each additional example is drawn from the training data randomly\nwith equal probability. This corresponds to the standard stochastic optimization setting, in which\nobserved samples are iid from some underlying distributions. The following result is a simple conse-\nquence of Theorem 3.1. For simplicity, we only consider the case with (cid:18) = 1 and constant gravity\ngi = g. The expectation E is taken over sequences of indices i1; : : : ; iT .\nTheorem 3.2 (Stochastic Setting) Consider a set of training data zi = (xi; yi) for 1 (cid:20) i (cid:20) n. Let\n\nn\n\n1\nn\n\nR(w; g) =\n\nXi=1\nwt+1 = T (wt (cid:0) (cid:17)r1(wt; zit ); g(cid:17));\n\nbe the L1-regularized loss over training data. Let ^w1 = w1 = 0, and de(cid:2)ne recursively for t (cid:21) 1:\nwhere each it is drawn from f1; : : : ; ng uniformly at random. If Assumption 3.1 holds, then for all\nT and (cid:22)w 2 Rd:\nE(cid:20)(1 (cid:0) 0:5A(cid:17))R( ^wT ;\n\n^wt+1 = ^wt + (wt+1 (cid:0) ^wt)=(t + 1);\n\n)(cid:21) (cid:20) E\" 1 (cid:0) 0:5A(cid:17)\n\n)# (cid:20)\n\nk (cid:22)wk2\n2(cid:17)T\n\n1 (cid:0) 0:5A(cid:17)\n\n1 (cid:0) 0:5A(cid:17)\n\n+R( (cid:22)w; g):\n\n(cid:17)\n2\n\nB +\n\nR(wi;\n\nT\n\ng\n\ng\n\nT\n\nXi=1\n\nL(w; zi) + gkwk1\n\n\fObserve that if we let (cid:17) ! 0 and (cid:17)T ! 1, the bound in Theorem 3.2 becomes E [R( ^wT ; g)] (cid:20)\nt=1 R(wt; g)i (cid:20) inf (cid:22)w R( (cid:22)w; g) + o(1): In other words, on average ^wT approximately solves\nEh 1\nT PT\nthe batch L1-regularization problem inf w(cid:2) 1\n\nIf we\nchoose a random stopping time T , then the above inequalities say that on average wT also solves\nthis L1-regularization problem approximately. Thus, we use the last solution wT instead of the ag-\ngregated solution ^wT in experiments. Since L1 regularization is often used to achieve sparsity in the\nbatch learning setting, the connection of truncated gradient to L1 regularization can be regarded as\nan alternative justi(cid:2)cation for the sparsity ability of this algorithm.\n\nnPn\n\ni=1 L(w; zi) + gkwk1(cid:3) when T is large.\n\n4 Ef(cid:2)cient Implementation of Truncated Gradient for Square Loss\n\nleading to f (wi) = T1(wi + 2(cid:17)(yi (cid:0) ^yi)xi; (cid:17)gi; (cid:18)), where the prediction is given by ^yi =Pj wj\n\nThe truncated descent update rule (6) can be applied to least-squares regression using square loss,\ni xj\ni .\nWe altered an ef(cid:2)cient SGD implementation, Vowpal Wabbit [7], for least-squares regression\naccording to truncated gradient. The program operates in an entirely online fashion. Features are\nhashed instead of being stored explicitly, and weights can be easily inserted into or deleted from the\ntable dynamically. So the memory footprint is essentially just the number of nonzero weights, even\nwhen the total numbers of data and features are astronomically large.\nIn many online-learning situations such as web applications, only a small subset of the features\nhave nonzero values for any example x. It is thus desirable to deal with sparsity only in this small\nsubset rather than in all features, while simultaneously inducing sparsity on all feature weights. The\napproach we take is to store a time-stamp (cid:28)j for each feature j. The time-stamp is initialized to the\nindex of the example where feature j becomes nonzero for the (cid:2)rst time. During online learning, at\neach step i, we only go through the nonzero features j of example i, and calculate the un-performed\nshrinkage of wj between (cid:28)j and the current time i. These weights are then updated, and their time\nstamps are reset to i. This lazy-update idea of delaying the shrinkage calculation until needed is\nthe key to ef(cid:2)cient implementation of truncated gradient. The implementation satis(cid:2)es ef(cid:2)ciency\nrequirements outlined at the end of the introduction section. A similar time-stamp trick can be\napplied to the other two algorithms given in section 3.\n\n5 Empirical Results\n\nWe applied the algorithm, with the ef(cid:2)ciently implemented sparsify option, as described in the\nprevious section, to a selection of datasets, including eleven datasets from the UCI repository [1],\nthe much larger dataset rcv1 [9], and a private large-scale dataset Big_Ads related to ad interest\nprediction. While UCI datasets are useful for benchmark purposes, rcv1 and Big_Ads are more\ninteresting since they embody real-world datasets with large numbers of features, many of which\nare less informative for making predictions than others. The UCI datasets we used do not have\nmany features, and it seems a large fraction of these features are useful for making predictions. For\ncomparison purposes and to better demonstrate the behavior of our algorithm, we also added 1000\nrandom binary features to those datasets. Each feature has value 1 with prob. 0:05 and 0 otherwise.\nIn the (cid:2)rst set of experiments, we are interested in how much reduction in the number of features is\npossible without affecting learning performance signi(cid:2)cantly; speci(cid:2)cally, we require the accuracy\nbe reduced by no more than 1% for classi(cid:2)cation tasks, and the total square loss be increased by no\nmore than 1% for regression tasks. As common practice, we allowed the algorithm to run on the\ntraining data set for multiple passes with decaying learning rate. For each dataset, we performed 10-\nfold cross validation over the training set to identify the best set of parameters, including the learning\nrate (cid:17), the gravity g, number of passes of the training set, and the decay of learning rate across these\npasses. This set of parameters was then used on the whole training set. Finally, the learned classi-\n(cid:2)er/regressor was evaluated on the test set. We (cid:2)xed K = 1 and (cid:18) = 1 in this set of experiments.\nThe effects of K and (cid:18) are included in an extended version of this paper. Figure 1 shows the fraction\nof reduced features after sparsi(cid:2)cation is applied to each dataset. For UCI datasets with randomly\nadded features, truncated gradient was able to reduce the number of features by a fraction of more\nthan 90%, except for the ad dataset in which only 71% reduction was observed. This less satisfying\nresult might be improved by a more extensive parameter search in cross validation. However, if\n\n\fBase data\n1000 extra\n 1\n\nt\nf\ne\nL\n \nn\no\ni\nt\nc\na\nr\nF\n\n 0.8\n\n 0.6\n\n 0.4\n\n 0.2\n\n 0\n\nd\na\n\nx\nr\nc\n\np\nk\ns\nv\nr\nk\n\ni\n\ng\nn\ns\nu\no\nh\n\ni\n\n4\n0\nc\ng\na\nm\n\nm\no\no\nr\nh\ns\n\nm\na\np\ns\n\nc\nb\nw\n\nc\nb\nd\nw\n\nc\nb\np\nw\n\no\no\nz\n\n1\nv\nc\nr\n\nDataset\n\ns\nd\nA\n_\ng\nB\n\ni\n\nFraction of Features Left\n\nBase data\n1000 extra\n\nRatio of AUC\n\no\ni\nt\na\nR\n\n 1.2\n 1\n 0.8\n 0.6\n 0.4\n 0.2\n 0\n\nd\na\n\nx\nr\nc\n\np\nk\ns\nv\nr\nk\n\ni\n\n4\n0\nc\ng\na\nm\n\nm\no\no\nr\nh\ns\n\nm\na\np\ns\n\nc\nb\nw\n\nc\nb\nd\nw\n\nc\nb\np\nw\n\n1\nv\nc\nr\n\nDataset\n\nFigure 1: Left: the amount of features left after sparsi(cid:2)cation for each dataset without 1% perfor-\nmance loss. Right: the ratio of AUC with and without sparsi(cid:2)cation.\nwe tolerated 1:3% decrease in accuracy (instead of 1%) during cross validation, truncated gradient\nwas able to achieve 91:4% reduction, indicating a large reduction is still possible at the tiny addi-\ntional accuracy loss of 0:3%. Even for the original UCI datasets without arti(cid:2)cially added features,\nsome of the less useful features were removed while the same level of performance was maintained.\nFor classi(cid:2)cation tasks, we also studied how truncated gradient affects AUC (Area Under the ROC\nCurve), a standard metric for classi(cid:2)cation. We use AUC here because it is insensitive to threshold,\nunlike accuracy. Using the same sets of parameters from 10-fold cross validation described above,\nwe found the criterion was not affected signi(cid:2)cantly by sparsi(cid:2)cation and in some cases, it was ac-\ntually improved, due to removal of some irrelevant features. The ratios of the AUC with and without\nsparsi(cid:2)cation for all classi(cid:2)cation tasks are plotted in Figure 1. Often these ratios are above 98%.\nThe previous results do not exercise the full power of the approach presented here because they are\napplied to datasets where the standard Lasso is computationally viable. We have also applied this\napproach to a large non-public dataset Big_Ads where the goal is predicting which of two ads was\nclicked on given context information (the content of ads and query information). Here, accepting a\n0:9% increase in classi(cid:2)cation error allows us to reduce the number of features from about 3 (cid:2) 109\nto about 24 (cid:2) 106(cid:151)a factor of 125 decrease in the number of features.\nThe next set of experiments compares truncated gradient to other algorithms regarding their abilities\nto tradeoff feature sparsi(cid:2)cation and performance. Again, we focus on the AUC metric in UCI\nclassi(cid:2)cation tasks. The algorithms for comparison include: (i) the truncated gradient algorithm with\nK = 10 and (cid:18) = 1; (ii) the truncated gradient algorithm with K = 10 and (cid:18) = g; (iii) the rounding\nalgorithm with K = 10; (iv) the L1-regularized subgradient algorithm with K = 10; and (v) the\nLasso [12] for batch L1 regularization (a publicly available implementation [11] was used). We have\nchosen K = 10 since it worked better than K = 1, and this choice was especially important for the\ncoef(cid:2)cient rounding algorithm. All unspeci(cid:2)ed parameters were identi(cid:2)ed using cross validation.\nNote that we do not attempt to compare these algorithms on rcv1 and Big_Ads simply because their\nsizes are too large for the Lasso and subgradient descent. Figure 2 gives the results on datasets\nad and spambase. Results on other datasets were qualitatively similar. On all datasets, truncated\ngradient (with (cid:18) = 1) is consistently competitive with the other online algorithms and signi(cid:2)cantly\noutperformed them in some problems, implying truncated gradient is generally effective. Moreover,\ntruncated gradient with (cid:18) = g behaves similarly to rounding (and sometimes better). This was\nexpected as truncated gradient with (cid:18) = g can be regarded as a principled variant of rounding with\nvalid theoretical justi(cid:2)cation. It is also interesting to observe the qualitative behavior of truncated\ngradient was often similar to LASSO, especially when very sparse weight vectors were allowed\n(the left sides in the graphs). This is consistent with theorem 3.2 showing the relation between\nthese two algorithms. However, LASSO usually performed worse when the allowed number of\nnonzero weights was large (the right side of the graphs). In this case, LASSO seemed to over(cid:2)t\nwhile truncated gradient was more robust to over(cid:2)tting. The robustness of online learning is often\nattributed to early stopping, which has been extensively studied (e.g., in [13]).\nFinally, it is worth emphasizing that these comparison experiments shed some light on the relative\nstrengths of these algorithms in terms of feature sparsi(cid:2)cation, without considering which one can\nbe ef(cid:2)ciently implemented. For large datasets with sparse features, only truncated gradient and the\nad hoc coef(cid:2)cient rounding algorithm are applicable.\n\n\fC\nU\nA\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n \n\n0.3\n100\n\nad\n\n \n\nTrunc. Grad. (q=\u00a5)\nTrunc. Grad. (q=g)\nRounding\nSub\u2212gradient\nLasso\n102\n\n103\n\n101\nNumber of Features\n\nC\nU\nA\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n \n\n0.3\n100\n\nspambase\n\n \n\nTrunc. Grad. (q=\u00a5)\nTrunc. Grad. (q=g)\nRounding\nSub\u2212gradient\nLasso\n102\n\n103\n\n101\nNumber of Features\n\nFigure 2: Comparison of the (cid:2)ve algorithms in two sample UCI datasets.\n\n6 Conclusion\n\nThis paper covers the (cid:2)rst ef(cid:2)cient sparsi(cid:2)cation technique for large-scale online learning with\nstrong theoretical guarantees. The algorithm, truncated gradient, is the natural extension of Lasso-\nstyle regression to the online-learning setting. Theorem 3.1 proves the technique is sound: it never\nharms performance much compared to standard stochastic gradient descent in adversarial situations.\nFurthermore, we show the asymptotic solution of one instance of the algorithm is essentially equiv-\nalent to Lasso regression, thus justifying the algorithm\u2019s ability to produce sparse weight vectors\nwhen the number of features is intractably large. The theorem is veri(cid:2)ed experimentally in a num-\nber of problems. In some cases, especially for problems with many irrelevant features, this approach\nachieves a one or two orders of magnitude reduction in the number of features.\n\nReferences\n[1] A. Asuncion and D.J. Newman. UCI machine learning repository, 2007. UC Irvine.\n[2] N. Cesa-Bianchi, P.M. Long, and M. Warmuth. Worst-case quadratic loss bounds for prediction using\n\nlinear functions and gradient descent. IEEE Transactions on Neural Networks, 7(3):604(cid:150)619, 1996.\n\n[3] C.-T. Chu, S.K. Kim, Y.-A. Lin, Y. Yu, G. Bradski, A.Y. Ng, and K. Olukotun. Map-reduce for machine\nlearning on multicore. In Advances in Neural Information Processing Systems 20, pages 281(cid:150)288, 2008.\n[4] O. Dekel, S. Shalev-Schwartz, and Y. Singer. The Forgetron: A kernel-based perceptron on a (cid:2)xed budget.\n\nIn Advances in Neural Information Processing Systems 18, pages 259(cid:150)266, 2006.\n\n[5] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Ef(cid:2)cient projections onto the \u20181-ball for learning\n\nin high dimensions. In Proceedings of ICML-08, pages 272(cid:150)279, 2008.\n\n[6] J. Kivinen and M.K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors.\n\nInformation and Computation, 132(1):1(cid:150)63, 1997.\n\n[7] J. Langford, L. Li, and A.L. Strehl. Vowpal Wabbit (fast online learning), 2007. http://hunch.net/(cid:24)vw/.\n[8] Honglak Lee, Alexis Batle, Rajat Raina, and Andrew Y. Ng. Ef(cid:2)cient sparse coding algorithms.\nIn\n\nAdvances in Neural Information Processing Systems 19 (NIPS-07), 2007.\n\n[9] D.D. Lewis, Y. Yang, T.G. Rose, and F. Li. RCV1: A new benchmark collection for text categorization\n\nresearch. Journal of Machine Learning Research, 5:361(cid:150)397, 2004.\n\n[10] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM.\n\nIn Proceedings of ICML-07, pages 807(cid:150)814, 2007.\n\n[11] K. Sj(cid:246)strand. Matlab implementation of LASSO, LARS, the elastic net and SPCA, June 2005. Version\n\n2.0, http://www2.imm.dtu.dk/pubdb/p.php?3897.\n\n[12] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society,\n\nB., 58(1):267(cid:150)288, 1996.\n\n[13] T. Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In\n\nProceedings of ICML-04, pages 919(cid:150)926, 2004.\n\n[14] M. Zinkevich. Online convex programming and generalized in(cid:2)nitesimal gradient ascent. In Proceedings\n\nof ICML-03, pages 928(cid:150)936, 2003.\n\n\f", "award": [], "sourceid": 280, "authors": [{"given_name": "John", "family_name": "Langford", "institution": null}, {"given_name": "Lihong", "family_name": "Li", "institution": null}, {"given_name": "Tong", "family_name": "Zhang", "institution": null}]}