{"title": "Bundle Methods for Machine Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1377, "page_last": 1384, "abstract": null, "full_text": "Bundle Methods for Machine Learning\n\nAlexander J. Smola, S.V. N. Vishwanathan, Quoc V. Le\n\nNICTA and Australian National University, Canberra, Australia\n\nAlex.Smola@gmail.com, {SVN.Vishwanathan, Quoc.Le}@nicta.com.au\n\nAbstract\n\nWe present a globally convergent method for regularized risk minimization prob-\nlems. Our method applies to Support Vector estimation, regression, Gaussian\nProcesses, and any other regularized risk minimization setting which leads to a\nconvex optimization problem. SVMPerf can be shown to be a special case of\nour approach. In addition to the uni\ufb01ed framework we present tight convergence\nbounds, which show that our algorithm converges in O(1/\u0001) steps to \u0001 precision\nfor general convex problems and in O(log(1/\u0001)) steps for continuously differen-\ntiable problems. We demonstrate in experiments the performance of our approach.\n\n1 Introduction\n\nIn recent years optimization methods for convex models have seen signi\ufb01cant progress. Starting\nfrom the active set methods described by Vapnik [17] increasingly sophisticated algorithms for solv-\ning regularized risk minimization problems have been developed. Some of the most exciting recent\ndevelopments are SVMPerf [5] and the Pegasos gradient descent solver [12]. The former computes\ngradients of the current solution at every step and adds those to the optimization problem. Joachims\n[5] prove an O(1/\u00012) rate of convergence. For Pegasos Shalev-Shwartz et al. [12] prove an O(1/\u0001)\nrate of convergence, which suggests that Pegasos should be much more suitable for optimization.\nIn this paper we extend the ideas of SVMPerf to general convex optimization problems and a much\nwider class of regularizers. In addition to this, we present a formulation which does not require\nthe solution of a quadratic program whilst in practice enjoying the same rate of convergence as\nalgorithms of the SVMPerf family. Our error analysis shows that the rates achieved by this algorithm\nare considerably better than what was previously known for SVMPerf, namely the algorithm enjoys\nO(1/\u0001) convergence and O(log(1/\u0001)) convergence, whenever the loss is suf\ufb01ciently smooth. An\nimportant feature of our algorithm is that it automatically takes advantage of smoothness in the\nproblem.\nOur work builds on [15], which describes the basic extension of SVMPerf to general convex prob-\nlems. The current paper provides a) signi\ufb01cantly improved performance bounds which match better\nwhat can be observed in practice and which apply to a wide range of regularization terms, b) a vari-\nant of the algorithm which does not require quadratic programming, yet enjoys the same fast rates of\nconvergence, and c) experimental data comparing the speed of our solver to Pegasos and SVMPerf.\nDue to space constraints we relegate the proofs to an technical report [13].\n\n2 Problem Setting\n\nDenote by x \u2208 X and y \u2208 Y patterns and labels respectively and let l(x, y, w) be a loss function\nwhich is convex in w \u2208 W, where either W = Rd (linear classi\ufb01er), or W is a Reproducing Kernel\nHilbert Space for kernel methods. Given a set of m training patterns {xi, yi}m\ni=1 the regularized risk\n\n1\n\n\ffunctional which many estimation methods strive to minimize can be written as\n\nm(cid:88)\n\ni=1\n\nJ(w) := Remp(w) + \u03bb\u2126(w) where Remp(w) :=\n\n1\nm\n\nl(xi, yi, w).\n\n(1)\n\n2 (cid:107)w(cid:107)2, and \u03bb > 0 is a regularization term. Typically\n\u2126(w) is a smooth convex regularizer such as 1\n\u2126 is cheap to compute and to minimize whereas the empirical risk term Remp(w) is computationally\nexpensive to deal with. For instance, in the case of intractable graphical models it requires approx-\nimate inference methods such as sampling or semide\ufb01nite programming. To make matters worse\nthe number of training observations m may be huge. We assume that the empirical risk Remp(w) is\nnonnegative.\n\nIf J is differentiable we can use standard quasi-Newtons methods\nlike LBFGS even for large values of m [8]. Unfortunately, it is not\nstraightforward to extend these algorithms to optimize a non-smooth\nobjective. In such cases one has to resort to bundle methods [3],\nwhich are based on the following elementary observation: for con-\nvex functions a \ufb01rst order Taylor approximation is a lower bound.\nSo is the maximum over a set of Taylor approximations. Further-\nmore, the Taylor approximation is exact at the point of expansion.\nThe idea is to replace Remp[w] by these lower bounds and to opti-\nmize the latter in conjunction with \u2126(w). Figure 1 gives geometric\nintuition. In the remainder of the paper we will show that 1) This ex-\ntends a number of existing algorithms; 2) This method enjoys good\nrates of convergence; and 3) It works well in practice.\n\nFigure 1: A lower bound on the\nconvex empirical risk Remp(w)\nobtained by computing three tan-\ngents on the entire function.\n\nNote that there is no need for Remp[w] to decompose into individual losses in an additive fashion.\nFor instance, scores, such as Precision@k [4], or SVM Ranking scores do not satisfy this property.\nLikewise, estimation problems which allow for an unregularized common constant offset or adaptive\nmargin settings using the \u03bd-trick fall into this category. The only difference is that in those cases the\nderivative of Remp[w] with respect to w no more decomposes trivially into a sum of gradients.\n\n3 Bundle Methods\n\n3.1 Subdifferential and Subgradient\n\nBefore we describe the bundle method, it is necessary to clarify a key technical point. The subgradi-\nent is a generalization of gradients appropriate for convex functions, including those which are not\nnecessarily smooth. Suppose w is a point where a convex function F is \ufb01nite. Then a subgradi-\nent is the normal vector of any tangential supporting hyperplane of F at w. Formally \u00b5 is called a\nsubgradient of F at w if, and only if,\n\nF (w(cid:48)) \u2265 F (w) + (cid:104)w(cid:48) \u2212 w, \u00b5(cid:105)\n\n\u2200w(cid:48).\n\n(2)\n\nThe set of all subgradients at a point is called the subdifferential, and is denoted by \u2202wF (w). If\nthis set is not empty then F is said to be subdifferentiable at w. On the other hand, if this set is a\nsingleton then, the function is said to be differentiable at w.\n\n3.2 The Algorithm\nDenote by wt \u2208 W the values of w which are obtained by successive steps of our method, Let\nat \u2208 W, bt \u2208 R, and set w0 = 0, a0 = 0, b0 = 0. Then, the Taylor expansion coef\ufb01cients of\nRemp[wt] can be written as\n\nat+1 := \u2202wRemp(wt) and bt+1 := Remp(wt) \u2212 (cid:104)at+1, wt(cid:105) .\n\n(3)\n\nNote that we do not require Remp to be differentiable: if Remp is not differentiable at wt we simply\nchoose any element of the subdifferential as at+1. Since each Taylor approximation is a lower\nbound, we may take their maximum to obtain that Remp(w) \u2265 maxt (cid:104)at, w(cid:105) + bt. Moreover, by\n\n2\n\n\fAlgorithm 1 Bundle Method(\u0001)\n\nInitialize t = 0, w0 = 0, a0 = 0, b0 = 0, and J0(w) = \u03bb\u2126(w)\nrepeat\n\nFind minimizer wt := argminw Jt(w)\nCompute gradient at+1 and offset bt+1.\nIncrement t \u2190 t + 1.\n\nuntil \u0001t \u2264 \u0001\n\nvirtue of the fact that Remp is a non-negative function we can write the following lower bounds on\nRemp and J respectively:\n\nRt(w) := max\nt(cid:48)\u2264t\n\n(cid:104)at(cid:48) , w(cid:105) + bt(cid:48) and Jt(w) := \u03bb\u2126(w) + Rt(w).\n\n(4)\n\nBy construction Rt(cid:48) \u2264 Rt \u2264 Remp and Jt(cid:48) \u2264 Jt \u2264 J for all t(cid:48) \u2264 t. De\ufb01ne\n\nw\u2217 := argmin\n\nw\n\nwt := argmin\n\nw\n\nJ(w),\n\nJt(w),\n\nand\n\n\u03b3t := Jt+1(wt) \u2212 Jt(wt),\n\u0001t := min\nt(cid:48)\u2264t\n\nJt(cid:48)+1(wt(cid:48)) \u2212 Jt(wt).\n\nThe following lemma establishes some useful properties of \u03b3t and \u0001t.\nLemma 1 We have Jt(cid:48)(wt(cid:48)) \u2264 Jt(wt) \u2264 J(w\u2217) \u2264 J(wt) = Jt+1(wt) for all t(cid:48) \u2264 t. Furthermore,\n\u0001t is monotonically decreasing with \u0001t \u2212 \u0001t+1 \u2265 Jt+1(wt+1) \u2212 Jt(wt) \u2265 0. Also, \u0001t upper bounds\nthe distance from optimality via \u03b3t \u2265 \u0001t \u2265 mint(cid:48)\u2264t J(wt(cid:48)) \u2212 J(w\u2217).\n\n3.3 Dual Problem\n\nOptimization is often considerably easier in the dual space. In fact, we will show that we need\nnot know \u2126(w) at all, instead it is suf\ufb01cient to work with its Fenchel-Legendre dual \u2126\u2217(\u00b5) :=\nsupw (cid:104)w, \u00b5(cid:105) \u2212 \u2126(w). If \u2126\u2217 is a so-called Legendre function [e.g. 10] the w at which the supremum\nis attained can be written as w = \u2202\u00b5\u2126\u2217(\u00b5). In the sequel we will always assume that \u2126\u2217 is twice\ndifferentiable and Legendre. Examples include \u2126\u2217(\u00b5) = 1\nTheorem 2 Let \u03b1 \u2208 Rt, denote by A = [a1, . . . , at] the matrix whose columns are the\n(sub)gradients, let b = [b1, . . . , bt]. The dual problem of\n\n2 (cid:107)\u00b5(cid:107)2 or \u2126\u2217(\u00b5) =(cid:80)\n\ni exp[\u00b5]i.\n\nminimize\n\nw\n\nmaximize\n\n\u03b1\n\n(cid:104)at(cid:48) , w(cid:105) + bt(cid:48) + \u03bb\u2126(w) is\n\nJt(w) := max\nt(cid:48)\u2264t\nt (\u03b1) := \u2212 \u03bb\u2126\u2217(\u2212\u03bb\u22121A\u03b1) + \u03b1(cid:62)b subject to \u03b1 \u2265 0 and (cid:107)\u03b1(cid:107)1 = 1.\nJ\u2217\n\nFurthermore, the optimal wt and \u03b1t are related by the dual connection wt = \u2202\u2126\u2217(\u2212\u03bb\u22121A\u03b1t).\n\n(5)\n\n(6)\n\ni.e.\n\n(7)\n\n2 (cid:107)w(cid:107)2\n\n2 the Fenchel-Legendre dual is given by \u2126\u2217(\u00b5) = 1\n\n2 (cid:107)\u00b5(cid:107)2\nRecall that for \u2126(w) = 1\ncommonly used in SVMs and Gaussian Processes. The following corollary is immediate:\nCorollary 3 De\ufb01ne Q := A(cid:62)A,\nminimizew max(0, maxt(cid:48)\u2264t (cid:104)at(cid:48) , w(cid:105) + bt(cid:48)) + \u03bb\n\n:= (cid:104)au, av(cid:105). For quadratic regularization,\n2 (cid:107)w(cid:107)2\n\n2 the dual becomes\n\ni.e. Quv\n\n2. This is\n\nmaximize\n\n\u03b1\n\n\u2212 1\n\n2\u03bb \u03b1(cid:62)Q\u03b1 + \u03b1(cid:62)b subject to \u03b1 \u2265 0 and (cid:107)\u03b1(cid:107)1 = 1.\n\nThis means that for quadratic regularization the dual optimization problem is a quadratic program\nwhere the number of variables equals the number of gradients computed previously. Since t is\ntypically in the order of 10s to 100s, the resulting QP is very cheap to solve. In fact, we don\u2019t even\nneed to know the gradients explicitly. All that is required to de\ufb01ne the QP are the inner products\nbetween gradient vectors (cid:104)au, av(cid:105). Later in this section we propose a variant which does away with\nthe quadratic program altogether while preserving most of the appealing convergence properties of\nAlgorithm 1.\n\n3\n\n\f3.4 Examples\n\nStructured Estimation Many estimation problems [14, 16] can be written in terms of a piecewise\nlinear loss function\n\nl(x, y, w) = max\ny(cid:48)\u2208Y\n\n(cid:104)\u03c6(x, y(cid:48)) \u2212 \u03c6(x, y), w(cid:105) + \u2206(y, y(cid:48))\n\n(8)\n\nfor some suitable joint feature map \u03c6, and a loss function \u2206(y, y(cid:48)). It follows from Section 3.1 that\na subdifferential of (8) is given by\n\u2202wl(x, y, w) = \u03c6(x, y\u2217) \u2212 \u03c6(x, y) where y\u2217 := argmax\ny(cid:48)\u2208Y\n\n(cid:104)\u03c6(x, y(cid:48)) \u2212 \u03c6(x, y), w(cid:105) + \u2206(y, y(cid:48)).\n\n(9)\n\nSince Remp is de\ufb01ned as a summation of loss terms, this allows us to apply Algorithm 1 directly\nfor risk minimization: at every iteration t we \ufb01nd all maximal constraint violators for each (xi, yi)\npair and compute the composite gradient vector. This vector is then added to the convex program\nwe have so far.\nJoachims [5] pointed out this idea for the special case of \u03c6(x, y) = y\u03c6(x) and y \u2208 {\u00b11}, that is,\nbinary loss. Effectively, by de\ufb01ning a joint feature map as the sum over individual feature maps and\nby de\ufb01ning a joint loss \u2206 as the sum over individual losses SVMPerf performs exactly the same\noperations as we described above. Hence, for losses of type (8) our algorithm is a direct extension\nof SVMPerf to structured estimation.\n\nExponential Families One of the advantages of our setting is that it applies to any convex loss\nfunction, as long as there is an ef\ufb01cient way of computing the gradient. That is, we can use it for\ncases where we are interested in modeling\n\np(y|x; w) = exp((cid:104)\u03c6(x, y), w(cid:105) \u2212 g(w|x)) where g(w|x) = log\n\n(10)\nThat is, g(w|x) is the conditional log-partition function. This type of losses includes settings such\nas Gaussian Process classi\ufb01cation and Conditional Random Fields [1]. Such settings have been\nstudied by Lee et al. [6] in conjunction with an (cid:96)1 regularizer \u2126(w) = (cid:107)w(cid:107)1 for structure discovery\nin graphical models. Choosing l to be the negative log-likelihood it follows that\n\nexp(cid:104)\u03c6(x, y(cid:48)), w(cid:105) dy(cid:48)\n\n(cid:90)\n\nY\n\nm(cid:88)\n\nm(cid:88)\n\nRemp(w) =\n\ng(w|xi) \u2212 (cid:104)\u03c6(xi, yi), w(cid:105) and \u2202wRemp(w) =\n\nEy(cid:48)\u223cp(y(cid:48)|xi;w) [\u03c6(xi, y(cid:48))] \u2212 \u03c6(xi, yi).\n\ni=1\n\ni=1\n\nThis means that column generation methods are therefore directly applicable to Gaussian Process\nestimation, a problem where large scale solvers were somewhat more dif\ufb01cult to \ufb01nd. It also shows\nthat adding a new model becomes a matter of de\ufb01ning a new loss function and its corresponding\ngradient, rather than having to build a full solver from scratch.\n\n4 Convergence Analysis\n\nWhile Algorithm 1 is intuitively plausible, it remains to be shown that it has good rates of conver-\ngence. In fact, past results, such as those by Tsochantaridis et al. [16] suggest an O(1/\u00012) rate,\nwhich would make the application infeasible in practice.\nWe use a duality argument similar to those put forward in [11, 16], both of which share key tech-\nniques with [18]. The crux of our proof argument lies in showing that \u0001t \u2212 \u0001t+1 \u2265 Jt+1(wt+1) \u2212\nJt(wt) (see Theorem 4) is suf\ufb01ciently bounded away from 0. In other words, since \u0001t bounds the\ndistance from the optimality, at every step Algorithm 1 makes suf\ufb01cient progress towards the op-\ntimum. Towards this end, we \ufb01rst observe that by strong duality the values of the primal and dual\nproblems (5) and (6) are equal at optimality. Hence, any progress in Jt+1 can be computed in the\ndual.\nNext, we observe that the solution of the dual problem (6) at iteration t, denoted by \u03b1t, forms a\nfeasible set of parameters for the dual problem (6) at iteration t+1 by means of the parameterization\n(\u03b1t, 0), i.e. by padding \u03b1t with a 0. The value of the objective function in this case equals Jt(wt).\n\n4\n\n\fTo obtain a lower bound on the improvement due to Jt+1(wt+1) we perform a line search along ((1\u2212\n\u03b7)\u03b1t, \u03b7) in (6). The constraint \u03b7 \u2208 [0, 1] ensures dual feasibility. We will bound this improvement\nin terms of \u03b3t. Note that, in general, solving the dual problem (6) results in an increase which is\nlarger than that obtained via the line search. The line search is employed in the analysis only for\nanalytic tractability. We aim to lower-bound \u0001t\u2212\u0001t+1 in terms of \u0001t and solve the resultant difference\nequation.\nDepending on J(w) we will be able to prove two different convergence results.\n\n(a) For regularizers \u2126(w) for which(cid:13)(cid:13)\u22022\n(b) Under the above conditions, if furthermore (cid:13)(cid:13)\u22022\n\n\u00b5\u2126\u2217(\u00b5)(cid:13)(cid:13) \u2264 H\u2217 we \ufb01rst experience a regime of progress\nwJ(w)(cid:13)(cid:13) \u2264 H, i.e. the Hessian of J is\n\nlinear in \u03b3t and a subsequent slowdown to improvements which are quadratic in \u03b3t.\n\nbounded, we have linear convergence throughout.\n\nWe \ufb01rst derive lower bounds on the improvement Jt+1(wt+1)\u2212 Jt(wt), then the fact that for (b) the\nbounds are better. Finally we prove the convergence rates by solving the difference equation in \u0001t.\nThis reasoning leads to the following theorem:\nTheorem 4 Assume that (cid:107)\u2202wRemp(w)(cid:107) \u2264 G for all w \u2208 W , where W is some domain of interest\n\ncontaining all wt(cid:48) for t(cid:48) \u2264 t. Also assume that \u2126\u2217 has bounded curvature, i.e. let(cid:13)(cid:13)\u22022\nfor all \u00b5 \u2208(cid:8)\u2212\u03bb\u22121 \u00afA\u00af\u03b1 where \u00af\u03b1 \u2265 0 and (cid:107)\u00af\u03b1(cid:107)1 \u2264 1(cid:9). In this case we have\nFurthermore, if(cid:13)(cid:13)\u22022\n\n\u00b5\u2126\u2217(\u00b5)(cid:13)(cid:13) \u2264 H\u2217\n\n2 min(1, \u03bb\u03b3t/4G2H\u2217) \u2265 \u0001t\n\n2 min(1, \u03bb\u0001t/4G2H\u2217).\n\n\u0001t \u2212 \u0001t+1 \u2265 \u03b3t\n\n(11)\n\n(12)\n\nwJ(w)(cid:13)(cid:13) \u2264 H, then we have\n\uf8f1\uf8f2\uf8f3\u03b3t/2\n\n\u0001t \u2212 \u0001t+1 \u2265\n\n\u03bb/8H\u2217\n\u03bb\u03b3t/4HH\u2217\n\nif \u03b3t > 4G2H\u2217/\u03bb\nif 4G2H\u2217/\u03bb \u2265 \u03b3t \u2265 H/2\notherwise\n\nNote that the error keeps on halving initially and settles for a somewhat slower rate of convergence\nafter that, whenever the Hessian of the overall risk is bounded from above. The reason for the\ndifference in the convergence bound for differentiable and non-differentiable losses is that in the\nformer case the gradient of the risk converges to 0 as we approach optimality, whereas in the former\ncase, no such guarantees hold (e.g. when minimizing |x| the (sub)gradient does not vanish at the\noptimum).\nTwo facts are worthwhile noting: a) The dual of many regularizers, e.g. squares norm, squared (cid:96)p\nnorm, and the entropic regularizer have bounded second derivative. See e.g. [11] for a discussion\n\n\u00b5\u2126\u2217(\u00b5)(cid:13)(cid:13) \u2264 H\u2217 is not unreasonable. b) Since the improvements\n\nand details. Thus our condition(cid:13)(cid:13)\u22022\n\ndecrease with the size of \u03b3t we may replace \u03b3t by \u0001t in both bounds and conditions without any ill\neffect (the bound only gets worse). Applying the previous result we obtain a convergence theorem\nfor bundle methods.\nTheorem 5 Assume that J(w) \u2265 0 for all w. Under the assumptions of Theorem 4 we can give the\nfollowing convergence guarantee for Algorithm 1. For any \u0001 < 4G2H\u2217/\u03bb the algorithm converges\nto the desired precision after\n\nn \u2264 log2\n\n\u03bbJ(0)\nG2H\u2217 +\n\n8G2H\u2217\n\n\u03bb\u0001\n\n\u2212 4\n\n(13)\nsteps. If furthermore the Hessian of J(w) is bounded, convergence to any \u0001 \u2264 H/2 takes at most\nthe following number of steps:\n\u03bbJ(0)\n4G2H\u2217 +\n\nmax(cid:2)0, H \u2212 8G2H\u2217/\u03bb(cid:3) +\n\nn \u2264 log2\n\n4H\u2217\n\u03bb\n\nlog(H/2\u0001)\n\n4HH\u2217\n\n\u03bb\n\n(14)\n\nSeveral observations are in order: \ufb01rstly, note that the number of iterations only depends logarithmi-\ncally on how far the initial value J(0) is away from the optimal solution. Compare this to the result\nof Tsochantaridis et al. [16], where the number of iterations is linear in J(0).\n\n5\n\n\fSecondly, we have an O(1/\u0001) dependence in the number of iterations in the non-differentiable\ncase. This matches the rate of Shalev-Shwartz et al. [12]. In addition to that, the convergence is\nO(log(1/\u0001)) for continuously differentiable problems.\nNote that whenever Remp(w) is the average over many piecewise linear functions Remp(w) behaves\nessentially like a function with bounded Hessian as long as we are taking large enough steps not to\n\u201cnotice\u201d the fact that the term is actually nonsmooth.\n\nH \u2265 \u03bb since(cid:13)(cid:13)\u22022\n\nwJ(w)(cid:13)(cid:13) = \u03bb +(cid:13)(cid:13)\u22022\n\nwRemp(w)(cid:13)(cid:13).\n\nRemark 6 For \u2126(w) = 1\n\n2 (cid:107)w(cid:107)2 the dual Hessian is exactly H\u2217 = 1. Moreover we know that\n\n2 w(cid:62)Qw the dual is \u2126\u2217(z) = 1\n\nEffectively the rate of convergence of the algorithm is governed by upper bounds on the primal and\ndual curvature of the objective function. This acts like a condition number of the problem \u2014 for\n2 z(cid:62)Q\u22121z, hence the largest eigenvalues of Q and Q\u22121\n\u2126(w) = 1\nwould have a signi\ufb01cant in\ufb02uence on the convergence.\nIn terms of \u03bb the number of iterations needed for convergence is O(\u03bb\u22121). In practice the iteration\ncount does increase with \u03bb, albeit not as badly as predicted. This is likely due to the fact that the\nempirical risk Remp(w) is typically rather smooth and has a certain inherent curvature which acts as\na natural regularizer in addition to the regularization afforded by \u03bb\u2126[w].\n\n5 A Linesearch Variant\n\nThe convergence analysis in Theorem 4 relied on a one-dimensional line search. Algorithm 1,\nhowever, uses a more complex quadratic program to solve the problem. Since even the simple\nupdates promise good rates of convergence it is tempting to replace the corresponding step in the\nbundle update. This can lead to considerable savings in particular for smaller problems, where the\ntime spent in the quadratic programming solver is a substantial fraction of the total runtime.\n\nIt can be shown that \u2202\u03b7Jt+1(0) = \u03b3t and \u22022\n\u03b7 = min(1, \u03bb\u03b3t/(cid:107)\u03bbwt + at+1(cid:107)2\n2).\n\n2 (cid:107)w(cid:107)2. Note that\nTo keep matters simple, we only consider quadratic regularization \u2126(w) := 1\nt+1((1 \u2212 \u03b7)\u03b1t, \u03b7) is a quadratic function in \u03b7, regardless of the choice of Remp[w].\nJt+1(\u03b7) := J\u2217\nHence a line search only needs to determine \ufb01rst and second derivative as done in the proof\n\u03bb (cid:107)\u2202wJ(wt)(cid:107)2 =\nof Theorem 4.\n\u2212 1\n\u03bb (cid:107)\u03bbwt + at+1(cid:107)2. Hence the optimal value of \u03b7 is given by\n(15)\nThis means that we may update wt+1 = (1 \u2212 \u03b7)wt \u2212 \u03b7\n\u03bb at+1. In other words, we need not store past\ngradients for the update. To obtain \u03b3t note that we are computing Remp(wt) as part of the Taylor\nrelations. In particular, the fact that w(cid:62)A\u03b1 = \u03bb(cid:107)w(cid:107)2 means that the only quantity we need to cache\nis b(cid:62)\u03b1t as an auxiliary variable rt in order to compute \u03b3t ef\ufb01ciently. Experiments show that this\nsimpli\ufb01ed algorithm has essentially the same convergence properties.\n\napproximation step. Finally, Rt(wt) is given by(cid:2)w(cid:62)A + b(cid:3) \u03b1t, hence it satis\ufb01es the same update\n\n\u03b7Jt+1(0) = \u2212 1\n\n6 Experiments\n\nIn this section we show experimental results that demonstrate the merits of our algorithm and its\nanalysis. Due to space constraints, we report results of experiments with two large datasets namely\nAstro-Physics (astro-ph) and Reuters-CCAT (reuters-ccat) [5, 12]. For a fair comparison with exist-\ning solvers we use the quadratic regularizer \u2126(w) := \u03bb\nIn our \ufb01rst experiment, we address the rate of convergence and its dependence on the value of \u03bb. In\nFigure 2 we plot \u0001t as a function of iterations for various values of \u03bb using the QP solver at every\niteration to solve the dual problem (6) to optimality. Initially, we observe super-linear convergence;\nthis is consistent with our analysis. Surprisingly, even though theory predicts sub-linear speed of\nconvergence for non-differentiable losses like the binary hinge loss (see (11)), our solver exhibits\nlinear rates of convergence predicted only for differentiable functions (see (12)). We conjecture\nthat the average over many piecewise linear functions, Remp(w), behaves essentially like a smooth\nfunction. As predicted, the convergence speed is inversely proportional to the value of \u03bb.\n\n2(cid:107)w(cid:107)2, and the binary hinge loss.\n\n6\n\n\fFigure 2: We plot (cid:31)t as a function of the number of iterations. Note the logarithmic scale in (cid:31)t. Left:\nastro-ph; Right: reuters-ccat.\n\nFigure 3: Top: Objective function value as a function of time. Bottom: Objective function value as\na function of iterations. Left: astro-ph, Right: reuters-ccat. The black line indicates the \ufb01nal value\nof the objective function + 0.001 .\n\nIn our second experiment, we compare the convergence speed of two variants of the bundle method,\nnamely, with a QP solver in the inner loop (which essentially boils down to SVMPerf) and the line\nsearch variant which we described in Section 5. We contrast these solvers with Pegasos [12] in the\nbatch setting. Following [5] we set \u03c6 = 10(cid:31)4 for reuters-ccat and \u03c6 = 2(cid:46)10(cid:31)4 for astro-ph.\nFigure 3 depicts the evolution of the primal objective function value as a function of both CPU time\nas well as the number of iterations. Following Shalev-Shwartz et al. [12] we investigate the time\nrequired by various solvers to reduce the objective value to within 0(cid:46)001 of the optimum. This is\ndepicted as a black horizontal line in our plots. As can be seen, Pegasos converges to this region\nquickly. Nevertheless, both variants of the bundle method converge to this value even faster (line\nsearch is slightly slower than Pegasos on astro-ph, but this is not always the case for many other large\ndatasets we tested on). Note that both line search and Pegasos converge to within 0.001 precision\nrather quickly, but they require a large number of iterations to converge to the optimum.\n\n7 Related Research\n\nOur work is closely related to Shalev-Shwartz and Singer [11] who prove mistake bounds for online\nalgorithms by lower bounding the progress in the dual. Although not stated explicitly, essentially\nthe same technique of lower bounding the dual improvement was used by Tsochantaridis et al. [16]\nto show polynomial time convergence of the SVMStruct algorithm. The main difference however\nis that Tsochantaridis et al. [16] only work with a quadratic objective function while the framework\n\n7\n\n\fproposed by Shalev-Shwartz and Singer [11] can handle arbitrary convex functions. In both cases,\na weaker analysis led to O(1/\u00012) rates of convergence for nonsmooth loss functions. On the other\nhand, our results establish a O(1/\u0001) rate for nonsmooth loss functions and O(log(1/\u0001)) rates for\nsmooth loss functions under mild technical assumptions.\nAnother related work is SVMPerf [5] which solves the SVM estimation problem in linear time.\nSVMPerf \ufb01nds a solution with accuracy \u0001 in O(md/(\u03bb\u00012)) time, where the m training patterns\nxi \u2208 Rd. This bound was improved by Shalev-Shwartz et al. [12] to \u02dcO(1/\u03bb\u03b4\u0001) for obtaining an\naccuracy of \u0001 with con\ufb01dence 1 \u2212 \u03b4. Their algorithm, Pegasos, essentially performs stochastic\n\u221a\n(sub)gradient descent but projects the solution back onto the L2 ball of radius 1/\n\u03bb. But, as our\nexperiments show, performing an exact line search in the dual leads to a faster decrease in the value\nof primal objective. Note that Pegasos also can be used in an online setting. This, however, only\napplies whenever the empirical risk decomposes into individual loss terms (e.g. it is not applicable\nto multivariate performance scores).\nThe third related strand of research considers gradient descent in the primal with a line search to\nchoose the optimal step size, see e.g. [2, Section 9.3.1]. Under assumptions of smoothness and\nstrong convexity \u2013 that is, the objective function can be upper and lower bounded by quadratic func-\ntions \u2013 it can be shown that gradient descent with line search will converge to an accuracy of \u0001 in\nO(log(1/\u0001)) steps. The problem here is the line search in the primal, since evaluating the regular-\nized risk functional might be as expensive as computing its gradient, thus rendering a line search in\nthe primal unattractive. On the other hand, the dual objective is relatively simple to evaluate, thus\nmaking the line search in the dual, as performed by our algorithm, computationally feasible.\nFinally, we would like to point out connections to subgradient methods [7]. These algorithms are\ndesigned for nonsmooth functions, and essentially choose an arbitrary element of the subgradient set\nto perform a gradient descent like update. Let (cid:107)Jw(w)(cid:107) \u2264 G, and B(w\u2217, r) denote a ball of radius\nr centered around the minimizer of J(w). By applying the analysis of Nedich and Bertsekas [7] to\n2(cid:107)w(cid:107)2, Ratliff et al. [9] showed that sub-\nthe regularized risk minimization problem with \u2126(w) := \u03bb\ngradient descent with a \ufb01xed, but suf\ufb01ciently small, stepsize will converge linearly to B(w\u2217, G/\u03bb).\n\nReferences\n[1] Y. Altun, A. J. Smola, and T. Hofmann. Exponential families for conditional random \ufb01elds. In UAI, pages\n\n2\u20139, 2004.\n\n[2] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[3] J. Hiriart-Urruty and C. Lemar\u00b4echal. Convex Analysis and Minimization Algorithms. 1993.\n[4] T. Joachims. A support vector method for multivariate performance measures. In ICML, pages 377\u2013384,\n\n2005.\n\n[5] T. Joachims. Training linear SVMs in linear time. In KDD, 2006.\n[6] S.-I. Lee, V. Ganapathi, and D. Koller. Ef\ufb01cient structure learning of Markov networks using L1-\n\nregularization. In NIPS, pages 817\u2013824, 2007.\n\n[7] A. Nedich and D. P. Bertsekas. Convergence rate of incremental subgradient algorithms. In Stochastic\n\nOptimization: Algorithms and Applications, pages 263\u2013304. 2000.\n\n[8] J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 1999.\n[9] N. Ratliff, J. Bagnell, and M. Zinkevich. (online) subgradient methods for structured prediction. In Proc.\n\nof AIStats, 2007.\n\n[10] R. T. Rockafellar. Convex Analysis. Princeton University Press, 1970.\n[11] S. Shalev-Shwartz and Y. Singer. Online learning meets optimization in the dual. In COLT, 2006.\n[12] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver for SVM. In\n\nICML, 2007.\n\n[13] A. J. Smola, S. V. N. Vishwanathan, and Q. V. Le. Bundle methods for machine learning. JMLR, 2008.\n\nin preparation.\n\n[14] B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. In NIPS, pages 25\u201332, 2004.\n[15] C. H. Teo, Q. Le, A. Smola, and S. V. N. Vishwanathan. A scalable modular convex solver for regularized\n\nrisk minimization. In KDD, 2007.\n\n[16] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and\n\ninterdependent output variables. JMLR, 6:1453\u20131484, 2005.\n\n[17] V. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995.\n[18] T. Zhang. Sequential greedy approximation for certain convex optimization problems.\n\nInformation Theory, 49(3):682\u2013691, 2003.\n\nIEEE Trans.\n\n8\n\n\f", "award": [], "sourceid": 470, "authors": [{"given_name": "Quoc", "family_name": "Le", "institution": null}, {"given_name": "Alex", "family_name": "Smola", "institution": null}, {"given_name": "S.v.n.", "family_name": "Vishwanathan", "institution": null}]}