{"title": "Efficient Learning using Forward-Backward Splitting", "book": "Advances in Neural Information Processing Systems", "page_first": 495, "page_last": 503, "abstract": "We describe, analyze, and experiment with a new framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an {\\em unconstrained} gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This yields a simple yet effective algorithm for both batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as $\\ell_1$. We derive concrete and very simple algorithms for minimization of loss functions with $\\ell_1$, $\\ell_2$, $\\ell_2^2$, and $\\ell_\\infty$ regularization. We also show how to construct efficient algorithms for mixed-norm $\\ell_1/\\ell_q$ regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in experiments with synthetic and natural datasets.", "full_text": "Ef\ufb01cient Learning using Forward-Backward Splitting\n\nJohn Duchi\n\nUniversity of California Berkeley\njduchi@cs.berkeley.edu\n\nYoram Singer\n\nGoogle\n\nsinger@google.com\n\nAbstract\n\nWe describe, analyze, and experiment with a new framework for empirical loss\nminimization with regularization. Our algorithmic framework alternates between\ntwo phases. On each iteration we \ufb01rst perform an unconstrained gradient descent\nstep. We then cast and solve an instantaneous optimization problem that trades off\nminimization of a regularization term while keeping close proximity to the result\nof the \ufb01rst phase. This yields a simple yet effective algorithm for both batch penal-\nized risk minimization and online learning. Furthermore, the two phase approach\nenables sparse solutions when used in conjunction with regularization functions\nthat promote sparsity, such as \u21131. We derive concrete and very simple algorithms\nfor minimization of loss functions with \u21131, \u21132, \u21132\n2, and \u2113\u221e regularization. We\nalso show how to construct ef\ufb01cient algorithms for mixed-norm \u21131/\u2113q regulariza-\ntion. We further extend the algorithms and give ef\ufb01cient implementations for very\nhigh-dimensional data with sparsity. We demonstrate the potential of the proposed\nframework in experiments with synthetic and natural datasets.\n\n1 Introduction\n\nBefore we begin, we establish notation for this paper. We denote scalars by lower case letters and\nvectors by lower case bold letters, e.g. w. The inner product of vectors u and v is denoted hu, vi.\nWe use kxkp to denote the p-norm of the vector x and kxk as a shorthand for kxk2.\nThe focus of this paper is an algorithmic framework for regularized convex programming to mini-\nmize the following sum of two functions:\n\n(1)\nwhere both f and r are convex bounded below functions (so without loss of generality we assume\n\nf (w) + r(w) ,\n\nthey are into R+). Often, the function f is an empirical loss and takes the form Pi\u2208S \u2113i(w) for\n: Rn \u2192 R+, and r(w) is a regularization term that penalizes\na sequence of loss functions \u2113i\nfor excessively complex vectors, for instance r(w) = \u03bbkwkp. This task is prevalent in machine\nlearning, in which a learning problem for decision and prediction problems is cast as a convex\noptimization problem. To that end, we propose a general and intuitive algorithm to minimize Eq. (1),\nfocusing especially on derivations for and the use of non-differentiable regularization functions.\n\nMany methods have been proposed to minimize general convex functions such as that in Eq. (1).\nOne of the most general is the subgradient method [1], which is elegant and very simple. Let \u2202f (w)\ndenote the subgradient set of f at w, namely, \u2202f (w) = {g | \u2200v : f (v) \u2265 f (w) + hg, v \u2212 wi}.\nSubgradient procedures then minimize the function f (w) by iteratively updating the parameter vec-\nf\ntor w according to the update rule wt+1 = wt \u2212 \u03b7tg\nt , where \u03b7t is a constant or diminishing step\nsize and g\nt \u2208 \u2202f (wt) is an arbitrary vector from the subgradient set of f evaluated at wt. A slightly\nmore general method than the above is the projected gradient method, which iterates\n\nf\n\nwt+1 = \u03a0\u2126(cid:16)wt \u2212 \u03b7tg\n\nf\n\nw\u2208\u2126 (cid:26)(cid:13)(cid:13)(cid:13)\nt(cid:17) = argmin\n\n1\n\nw \u2212 (wt \u2212 \u03b7tg\n\nf\n\n2\n\n2(cid:27)\nt )(cid:13)(cid:13)(cid:13)\n\n\fwhere \u03a0\u2126(w) is the Euclidean projection of w onto the set \u2126. Standard results [1] show that the\n(projected) subgradient method converges at a rate of O(1/\u03b52), or equivalently that the error f (w)\u2212\n\u22c6) = O(1/\u221aT ), given some simple assumptions on the boundedness of the subdifferential set\nf (w\nand \u2126 (we have omitted constants dependent on k\u2202fk or dim(\u2126)). Using the subgradient method to\nt \u2208 \u2202r(wt).\nminimize Eq. (1) gives simple iterates of the form wt+1 = wt \u2212 \u03b7tg\nA common problem in subgradient methods is that if r or f is non-differentiable, the iterates of the\nsubgradient method are very rarely at the points of non-differentiability. In the case of regularization\nfunctions such as r(w) = kwk1, however, these points (zeros in the case of the \u21131-norm) are often\nthe true minima of the function. Furthermore, with \u21131 and similar penalties, zeros are desirable\nsolutions as they tend to convey information about the structure of the problem being solved [2, 3].\n\nt , where g\nr\n\nt \u2212 \u03b7tg\n\nf\n\nr\n\nThere has been a signi\ufb01cant amount of work related to minimizing Eq. (1), especially when the\nfunction r is a sparsity-promoting regularizer. We can hardly do justice to the body of prior work,\nand we provide a few references here to the research we believe is most directly related. The ap-\nproach we pursue below is known as \u201cforward-backward splitting\u201d or a composite gradient method\nin the optimization literature and has been independently suggested by [4] in the context of sparse\nsignal reconstruction, where f (w) = ky \u2212 Awk2, though they note that the method can apply to\ngeneral convex f. [5] give proofs of convergence for forward-backward splitting in Hilbert spaces,\nthough without establishing strong rates of convergence. The motivation of their paper is signal\nreconstruction as well. Similar projected-gradient methods, when the regularization function r is no\nlonger part of the objective function but rather cast as a constraint so that r(w) \u2264 \u03bb, are also well\nknown [1]. [6] give a general and ef\ufb01cient projected gradient method for \u21131-constrained problems.\nThere is also a body of literature on regret analysis for online learning and online convex program-\nming with convex constraints upon which we build [7, 8]. Learning sparse models generally is of\ngreat interest in the statistics literature, speci\ufb01cally in the context of consistency and recovery of\nsparsity patterns through \u21131 or mixed-norm regularization across multiple tasks [2, 3, 9].\nIn this paper, we describe a general gradient-based framework, which we call FOBOS, and analyze\nit in batch and online learning settings. The paper is organized as follows. In the next section, we\nbegin by introducing and formally de\ufb01ning the method, giving some simple preliminary analysis.\nWe follow the introduction by giving in Sec. 3 rates of convergence for batch (of\ufb02ine) optimization.\nWe then provide bounds for online convex programming and give a convergence rate for stochastic\ngradient descent. To demonstrate the simplicity and usefulness of the framework, we derive in Sec. 4\nalgorithms for several different choices of the regularizing function r. We extend these methods to\nbe ef\ufb01cient in very high dimensional settings where the input data is sparse in Sec. 5. Finally,\nwe conclude in Sec. 6 with experiments examining various aspects of the proposed framework, in\nparticular the runtime and sparsity selection performance of the derived algorithms.\n\n2 Forward-Looking Subgradients and Forward-Backward Splitting\n\nIn this section we introduce our algorithm, laying the framework for its strategy for online or batch\nconvex programming. We originally named the algorithm Folos as an abbreviation for FOrward-\nLOoking Subgradient. Our algorithm is a distillation of known approaches for convex program-\nming, in particular the Forward-Backward Splitting method. In order not to confuse readers of the\nearly draft, we attempt to stay close to the original name and use the acronym FOBOS rather than\nFobas. FOBOS is motivated by the desire to have the iterates wt attain points of non-differentiability\nof the function r. The method alleviates the problems of non-differentiability in cases such as\n\u21131-regularization by taking analytical minimization steps interleaved with subgradient steps. Put\ninformally, FOBOS is analogous to the projected subgradient method, but replaces or augments the\nprojection step with an instantaneous minimization problem for which it is possible to derive a\nclosed form solution. FOBOS is succinct as each iteration consists of the following two steps:\n\nf\nt\n\n2\n\nwt+ 1\n\n= wt \u2212 \u03b7tg\nw (cid:26) 1\n2(cid:13)(cid:13)(cid:13)\nwt+1 = argmin\n\nw \u2212 wt+ 1\n\n+ \u03b7t+ 1\n\n2\n\nf\nIn the above, g\nt is a vector in \u2202f (wt) and \u03b7t is the step size at time step t of the algorithm. The\nactual value of \u03b7t depends on the speci\ufb01c setting and analysis. The \ufb01rst step thus simply amounts\nto an unconstrained subgradient step with respect to the function f. In the second step we \ufb01nd a\n\n2\n\n2(cid:13)(cid:13)(cid:13)\n\nr(w)(cid:27) .\n\n(2)\n\n(3)\n\n2\n\n\fnew vector that interpolates between two goals: (i) stay close to the interim vector wt+ 1\n, and (ii)\nattain a low complexity value as expressed by r. Note that the regularization function is scaled by\nan interim step size, denoted \u03b7t+ 1\n. The analyses we describe in the sequel determine the speci\ufb01c\n, which is either \u03b7t or \u03b7t+1. A key property of the solution of Eq. (3) is the necessary\nvalue of \u03b7t+ 1\ncondition for optimality and gives the reason behind the name FOBOS. Namely, the zero vector must\nbelong to subgradient set of the objective at the optimum wt+1, that is,\n\n2\n\n2\n\n2\n\n2\n\nf\n\n2\n\n+ \u03b7t+ 1\n\nw \u2212 wt+ 1\n\n0 \u2208 \u2202 (cid:26) 1\n2(cid:13)(cid:13)(cid:13)\nt , the above property amounts to 0 \u2208 wt+1 \u2212 wt + \u03b7tg\nt+1 \u2208 \u2202r(wt+1) such that 0 = wt+1 \u2212 wt + \u03b7tg\n\nSince wt+ 1\n\u2202r(wt+1).\n= wt \u2212 \u03b7tg\nThis property implies that so long as we choose wt+1 to be the minimizer of Eq. (3), we are guar-\nanteed to obtain a vector g\nt+1. We can\nr\nunderstand this as an update scheme where the new weight vector wt+1 is a linear combination of\nthe previous weight vector wt, a vector from the subgradient set of f at wt, and a vector from the\nsubgradient of r evaluated at the yet to be determined wt+1. To recap, we can write wt+1 as\n\nr(w)(cid:27)(cid:12)(cid:12)(cid:12)(cid:12)w=wt+1\n\nf\nt + \u03b7t+ 1\n\nf\nt + \u03b7t+ 1\n\n2(cid:13)(cid:13)(cid:13)\n\ng\n\n.\n\nr\n\n2\n\n2\n\n2\n\nf\n\nwt+1 = wt \u2212 \u03b7t g\n\nt \u2208 \u2202f (wt) and g\n\n(4)\nt+1 \u2208 \u2202r(wt+1). Solving Eq. (3) with r above has two main ben-\nwhere g\ne\ufb01ts. First, from an algorithmic standpoint, it enables sparse solutions at virtually no additional\ncomputational cost. Second, the forward-looking gradient allows us to build on existing analyses\nand show that the resulting framework enjoys the formal convergence properties of many existing\ngradient-based and online convex programming algorithms.\n\nt \u2212 \u03b7t+ 1\n\nr\nt+1,\n\ng\n\nr\n\n2\n\nf\n\n3 Convergence and Regret Analysis of FOBOS\n\nIn this section we build on known results while using the forward-looking property of FOBOS to\nprovide convergence rate and regret analysis. To derive convergence rates we set \u03b7t+ 1\nproperly. As\nwe show in the sequel, it is suf\ufb01cient to set \u03b7t+ 1\nto \u03b7t or \u03b7t+1, depending on whether we are doing\nonline or batch optimization, in order to obtain convergence and low regret bounds. We provide\nproofs of all theorems in this paper, as well as a few useful technical lemmas, in the appendices,\nas the main foci of the paper are the simplicity of the method and derived algorithms and their\nexperimental usefulness. The overall proof techniques all rely on the forward-looking property in\nEq. (4) and moderately straightforward arguments with convexity and subgradient calculus.\n\n2\n\n2\n\nk\u2202f (w)k2 \u2264 Af (w) + G2,\n\nk\u2202r(w)k2 \u2264 Ar(w) + G2 .\n\nThroughout the section we denote by w\n\u22c6 the minimizer of f (w)+r(w). The \ufb01rst bounds we present\nrely only on the assumption that kw\n\u22c6k \u2264 D, though they are not as tight as those in the sequel. In\nwhat follows, de\ufb01ne k\u2202f (w)k , supg\u2208\u2202f (w) kgk. We begin by deriving convergence results under\nthe fairly general assumption [10, 11] that the subgradients are bounded as follows:\n(5)\nFor example, any Lipschitz loss (such as the logistic or hinge/SVM) satis\ufb01es the above with A = 0\nand G equal to the Lipschitz constant; least squares satis\ufb01es Eq. (5) with G = 0 and A = 4.\nTheorem 1. Assume the following hold: (i) the norm of any subgradient from \u2202f and the norm of\n\u22c6 is less than or equal to D,\nany subgradient from \u2202r are bounded as in Eq. (5), (ii) the norm of w\n(iii) r(0) = 0, and (iv) 1\n= \u03b7t+1,\n\n2 \u03b7t \u2264 \u03b7t+1 \u2264 \u03b7t. Then for a constant c \u2264 4 with w1 = 0 and \u03b7t+ 1\nXt=1\n\u22c6))] \u2264 D2 + 7G2\n\nXt=1\nWe provide one corollary below as it underscores that the rate of convergence \u2248 \u221aT .\nCorollary 2 (Fixed step rate). Assume that the conditions of Thm. 1 hold and that we run FOBOS\nfor a prede\ufb01ned T iterations with \u03b7t = D\u221a7T G\n\nThe proof of the theorem is in Appendix A. We also provide in the appendix a few useful corollaries.\n\n\u22c6)) + \u03b7t ((1 \u2212 cA\u03b7t)r(wt) \u2212 r(w\n\n[\u03b7t ((1 \u2212 cA\u03b7t)f (wt) \u2212 f (w\n\n) > 0. Then\n\n\u03b72\nt\n\n2\nT\n\nT\n\n.\n\nmin\n\nt\u2208{1,...,T}\n\nf (wt) + r(wt) \u2264\n\n1\nT\n\nT\n\nXt=1\n\nf (wt) + r(wt) \u2264\n\nand that (1 \u2212 cA D\u221a7T G\n3DG\n\u221aT (cid:16)1 \u2212 cAD\nG\u221a7T(cid:17)\n\n+\n\n\u22c6)\n\nf (w\n\n\u22c6) + r(w\n1 \u2212 cAD\nG\u221a7T\n\n3\n\n\fBounds of the form we present above, where the point minimizing f (wt) + r(wt) converges rather\nthan the last point wT , are standard in subgradient optimization. This occurs since there is no way\nto guarantee a descent direction when using arbitrary subgradients (see, e.g., [12, Theorem 3.2.2]).\n\nWe next derive regret bounds for FOBOS in online settings in which we are given a sequence of\nfunctions ft : Rn \u2192 R. The goal is for the sequence of predictions wt to attain low regret when\ncompared to a single optimal predictor w\n\u22c6. Formally, let ft(w) denote the loss suffered on the\ntth input loss function when using a predictor w. The regret of an online algorithm which uses\nw1, . . . , wt, . . . as its predictors w.r.t a \ufb01xed predictor w\n\u22c6 while using a regularization function r is\n\nT\n\nRf +r(T ) =\n\n[ft(wt) + r(wt) \u2212 (ft(w\n\n\u22c6) + r(w\n\n\u22c6))]\n\n.\n\nXt=1\n\n2\n\n: speci\ufb01cally, we set \u03b7t+ 1\n\n\u22c6 for arbitrary length sequences.\n\nIdeally, we would like to achieve 0 regret to a stationary w\nTo achieve an online bound for a sequence of convex functions ft, we modify arguments of [7]. We\nbegin with a slightly different assignment for \u03b7t+ 1\n= \u03b7t. We have the\nfollowing theorem, whose proof we provide in Appendix B.\nTheorem 3. Assume that kwt \u2212 w\n\u22c6k \u2264 D for all iterations and the norm of the subgradient sets\n\u2202ft and \u2202r are bounded above by G. Let c > 0 an arbitrary scalar. Then the regret bound of FOBOS\nwith \u03b7t = c/\u221at satis\ufb01es Rf +r(T ) \u2264 GD +(cid:16) D2\n\n2c + 7G2c(cid:17)\u221aT .\n\nFor slightly technical reasons, the assumption on the boundedness of wt and the subgradients is not\nactually restrictive (see Appendix A for details). It is possible to obtain an O(log T ) regret bound\nfor FOBOS when the sequence of loss functions ft(\u00b7) or the function r(\u00b7) is strongly convex, similar\nto [8], by using the curvature of ft or r. While we can extend these results to FOBOS, we omit the\nextension for lack of space (though we do perform some experiments with such functions). Using\nthe regret analysis for online learning, we can also give convergence rates for stochastic FOBOS,\nwhich are O(\u221aT ). Further details are given in Appendix B and the long version of this paper [13].\n\n2\n\n4 Derived Algorithms\n\nWe now give a few variants of FOBOS by considering different regularization functions. The em-\nphasis of the section is on non-differentiable regularization functions that lead to sparse solutions.\nWe also give simple extensions to apply FOBOS to mixed-norm regularization [9] that build on the\n\ufb01rst part of this section. For lack of space, we mostly give the resulting updates, skipping techni-\ncal derivations. We would like to note that some of the following results were tacitly given in [4].\nFirst, we make a few changes to notation. To simplify our derivations, we denote by v the vector\n2 \u00b7 \u03bb. Using this notation the problem given in Eq. (3) can\n\nwt+ 1\nbe rewritten as minw\nFOBOS with \u21131 regularization: The update obtained by choosing r(w) = \u03bbkwk1 is simple and\nintuitive. The objective is decomposable into a sum of 1-dimensional convex problems of the form\n\u22c6 = wt+1 are\nminw\ncomputed from wt+ 1\n\n2 (w \u2212 v)2 + \u02dc\u03bb|w|. As a result, the components of the optimal solution w\n\n2kw \u2212 vk2 + \u02dc\u03bb r(w). Lastly, we let [z]+ denote max{0, z}.\n\nt and let \u02dc\u03bb denote \u03b7t+ 1\n\n= wt \u2212 \u03b7tg\n\nas\n\n1\n\n1\n\nf\n\n2\n\n2\n\nwt+1,j = sign(cid:16)wt+ 1\n\n2 ,j(cid:17)h|wt+ 1\n\n2 ,j| \u2212 \u02dc\u03bbi+\n\nt,j(cid:17)h(cid:12)(cid:12)(cid:12)\n= sign(cid:16)wt,j \u2212 \u03b7tgf\n\nwt,j \u2212 \u03b7tgf\n\n(6)\n\nt,j(cid:12)(cid:12)(cid:12) \u2212 \u03bb\u03b7t+ 1\n2i+\n\nNote that this update leads to sparse solutions: whenever the absolute value of a component of wt+ 1\nis smaller than \u02dc\u03bb, the corresponding component in wt+1 is set to zero. Eq. (6) gives a simple online\nand of\ufb02ine method for minimizing a convex f with \u21131 regularization. [10] recently proposed and\nanalyzed the same update, terming it the \u201ctruncated gradient,\u201d though the analysis presented here\nstems from a more general framework. This update can also be implemented very ef\ufb01ciently when\nf\nthe support of g\nt is small [10], but we defer details to Sec. 5, where we describe a uni\ufb01ed view that\nfacilitates an ef\ufb01cient implementation for all the regularization functions discussed in this paper.\n\n2\n\nFOBOS with \u21132\nproblem, minw\n\n2 regularization: When r(w) = \u03bb\n1\n2kw \u2212 vk2 + 1\n\n2, we obtain a very simple optimization\n\u02dc\u03bbkwk2. Differentiating the objective and setting the result equal to\n\n2 kwk2\n\n2\n\n4\n\n\fzero, we have w\n\n\u22c6 \u2212 v + \u02dc\u03bbw\n\n\u22c6 = 0, which, using the original notation, yields the update\n\nwt+1 =\n\nf\nt\n\nwt \u2212 \u03b7tg\n1 + \u02dc\u03bb\n\n.\n\n(7)\n\nInformally, the update simply shrinks wt+1 back toward the origin after each gradient-descent step.\nFOBOS with \u21132 regularization: A lesser used regularization function is the \u21132 norm of the weight\nvector. By setting r(w) = \u02dc\u03bbkwk we obtain the following problem: minw\n2kw \u2212 vk2 + \u02dc\u03bbkwk.\n\u22c6 = sv where\nThe solution of the above problem must be in the direction of v and takes the form w\ns \u2265 0. The resulting second step of the FOBOS update with \u21132 regularization amounts to\n\n1\n\nwt+1 =\"1 \u2212\n\n\u02dc\u03bb\nkwt+ 1\n\n2k#+\n\n=\"1 \u2212\n\n\u02dc\u03bb\n\nkwt \u2212 \u03b7tg\n\nt k#+\n\nf\n\n(wt \u2212 \u03b7tg\n\nf\nt ) .\n\nt k \u2264 \u02dc\u03bb. This\n\u21132-regularization results in a zero weight vector under the condition that kwt \u2212 \u03b7tg\ncondition is rather more stringent for sparsity than the condition for \u21131, so it is unlikely to hold in\nhigh dimensions. However, it does constitute a very important building block when using a mixed\n\u21131/\u21132-norm as the regularization, as we show in the sequel.\nFOBOS with \u2113\u221e regularization: We now turn to a less explored regularization function, the \u2113\u221e\nnorm of w. Our interest stems from the recognition that there are settings in which it is desirable to\nconsider blocks of variables as a group (see below). We wish to obtain an ef\ufb01cient solution to\n\nf\n\nmin\n\nw\n\n1\n2kw \u2212 vk2 + \u02dc\u03bbkwk\u221e\n\n.\n\n(8)\n\n2 k\u03b1 \u2212 vk2\n\n2 ,j) min{|wt+ 1\n\n2||2 \u2264 \u02dc\u03bb or ||wt+ 1\n\n2k1 \u2264 \u02dc\u03bb, and otherwise \u03b8 > 0 and can be found in O(n) steps.\n\nA solution to the dual form of Eq. (8) is well established. Recalling that the conjugate of the\nquadratic function is a quadratic function and the conjugate of the \u2113\u221e norm is the \u21131 barrier function,\nwe immediately obtain that the dual of the problem in Eq. (8) is max\u03b1 \u2212 1\n2 s.t. k\u03b1k1 \u2264\n\u02dc\u03bb. Moreover, the vector of dual variables \u03b1 satis\ufb01es the relation \u03b1 = v \u2212 w. [6] describes a\nlinear time algorithm for \ufb01nding the optimal \u03b1 to this \u21131-constrained projection, and the analysis\nthere shows the optimal solution to Eq. (8) is wt+1,j = sign(wt+ 1\n2 ,j|, \u03b8}. The optimal\nsolution satis\ufb01es \u03b8 = 0 iff kwt+ 1\nMixed norms: We saw above that when using either the \u21132 or the \u2113\u221e norm as the regularizer we\n2||1 \u2264 \u02dc\u03bb, respectively. This phenomenon can\nobtain an all zeros vector if ||wt+ 1\nbe useful. For example, in multiclass categorization problems each class s may be associated with\nk, x(cid:11),\na different weight vector w\nwhere k is the number of classes, and the predicted class is argmaxj(cid:10)w\nj, x(cid:11). Since all the weight\nvectors operate over the same instance space, it may be bene\ufb01cial to tie the weights corresponding\nto the same input feature: we would to zero the row of weights w1\nFormally, let W represent an n\u00d7k matrix where the jth column of the matrix is the weight vector w\nj\nassociated with class j. Then the ith row contains weight of the ith feature for each class. The mixed\n\u2113r/\u2113s-norm [9] of W is obtained by computing the \u2113s-norm of each row of W and then applying\nthe \u2113r-norm to the resulting n dimensional vector, for instance, kWk\u21131/\u2113\u221e\nj=1 maxj |Wi,j|.\n.\nIn a mixed-norm regularized optimization problem, we seek the minimizer of f (W ) + \u03bbkWk\u2113r/\u2113s\nGiven the speci\ufb01c variants of norms described above, the FOBOS update for the \u21131/\u2113\u221e and the \u21131/\u21132\nmixed-norms is readily available. Let \u00afw\ns be the sth row of W . Analogously to standard norm-based\nregularization, we use the shorthand V = Wt+ 1\n\ns. The prediction for an instance x is a vector (cid:10)w\n\n. For the \u21131/\u2113p mixed-norm, we need to solve\n\n1, x(cid:11) , . . . ,(cid:10)w\n\nj , . . . , wk\n\nj simultaneously.\n\n= Pn\n\n2\n\nmin\nW\n\n1\n2 kW \u2212 V k2\n\nFr + \u02dc\u03bbkWk\u21131/\u2113p \u2261 min\n\n1,..., \u00afw\n\n\u00afw\n\nk\n\nn\n\nXi=1(cid:18) 1\n2(cid:13)(cid:13) \u00afw\n\ni \u2212 \u00afv\n\n2\n\n2 + \u02dc\u03bb(cid:13)(cid:13) \u00afw\ni(cid:13)(cid:13)\n\ni(cid:13)(cid:13)p(cid:19)\n\n(9)\n\nwhere \u00afv\ni is the ith row of V . It is immediate to see that the problem given in Eq. (9) is decomposable\ninto n separate problems of dimension k, each of which can be solved by the procedures described\nin the prequel. The end result of solving these types of mixed-norm problems is a sparse matrix with\nnumerous zero rows. We demonstrate the merits of FOBOS with mixed-norms in Sec. 6.\n\n5\n\n\f5 Ef\ufb01cient implementation in high dimensions\n\nf\nt reside in a\nIn many settings, especially online learning, the weight vector wt and the gradients g\nf\nt are non-\nvery high-dimensional space, but only a relatively small number of the components of g\nzero. Such settings are prevalent, for instance, in text-based applications: in text categorization,\nthe full dimension corresponds to the dictionary or set of tokens that is being employed while each\ngradient is typically computed from a single or a few documents, each of which contains words\nand bigrams constituting only a small subset of the full dictionary. The need to cope with gradient\nsparsity becomes further pronounced in mixed-norm problems, as a single component of the gradient\nf\nmay correspond to an entire row of W . Updating the entire matrix because a few entries of g\nt are\nnon-zero is clearly undesirable. Thus, we would like to extend our methods to cope ef\ufb01ciently\nwith gradient sparsity. For concreteness, we focus in this section on the ef\ufb01cient implementation\nof \u21131, \u21132, and \u2113\u221e regularization, since the extension to mixed-norms (as in the previous section) is\nstraightforward. We postpone the proof of the following proposition to Appendix C.\nProposition 4. Let wT be the end result of solving a succession of T self-similar optimization\nproblems for t = 1, . . . , T ,\n\nP.1 : wt = argmin\n\nw\n\n1\n2kw \u2212 wt\u22121k2 + \u03bbtkwkq .\n\nLet w\n\n\u22c6 be the optimal solution of the following optimization problem,\n\n(10)\n\n(11)\n\nP.2 : w\n\n\u22c6 = argmin\n\nw\n\n1\n\n2kw \u2212 w0k2 + T\nXt=1\n\n\u03bbt!kwkq .\n\n\u22c6 are identical.\n\nFor q \u2208 {1, 2,\u221e} the vectors wT and w\nThe algorithmic consequence of Proposition 4 is that it is possible to perform a lazy update on each\niteration by omitting the terms of wt (or whole rows of the matrix Wt when using mixed-norms) that\nf\nare outside the support of g\nt , the gradient of the loss at iteration t. We do need to maintain the step-\nsizes used on each iteration and have them readily available on future rounds when we newly update\ncoordinates of w or W . Let \u039bt denote the sum of the step sizes times regularization multipliers\n\u03bb\u03b7t used from round 1 through t. Then a simple algebraic manipulation yields that instead of\n\ncan simply cache the last time t0 that w (or a coordinate in w or a row from W ) was updated and,\n\nsolving wt+1 = argminwn 1\n2 kw \u2212 wtk2\nwhen it is needed, solve wt+1 = argminwn 1\n\n2 + \u03bb\u03b7tkwkqo repeatedly when wt is not changing, we\n2 + (\u039bt \u2212 \u039bt0 )kwkqo. The advantage of\n2 kw \u2212 wtk2\n\nthe lazy evaluation is pronounced when using mixed-norm regularization as it lets us avoid updating\nf\nentire rows so long as the row index corresponds to a zero entry of the gradient g\nt . In sum, at the\nexpense of keeping a time stamp t for each entry of w or row of W and maintaining the cumulative\nf\nsums \u039b1, \u039b2, . . ., we get O(k) updates of w when the gradient g\nt has only k non-zero components.\n\n6 Experiments\n\nIn this section we compare FOBOS to state-of-the-art optimizers to demonstrate its relative merits\nand weaknesses. We perform more substantial experiments in the full version of the paper [13].\n2 and \u21131-regularized experiments: We performed experiments using FOBOS to solve both \u21131\n\u21132\nand \u21132-regularized learning problems. For the \u21132-regularized experiments, we compared FOBOS to\nPegasos [14], a fast projected gradient solver for SVM. Pegasos was originally implemented and\nevaluated on SVM-like problems by using the the hinge-loss as the empirical loss function along\nwith an \u21132\n2 regularization term, but it can be straightforwardly extended to the binary logistic loss\nfunction. We thus experimented with both\n\nm\n\nm\n\nf (w) =\n\n[1 \u2212 yi hxi, wi]+ (hinge)\n\nand\n\nf (w) =\n\nXi=1\n\nXi=1\n\nlog(cid:16)1 + e\u2212yihxi,wi(cid:17) (logistic)\n\nas loss functions. To generate data for our experiments, we chose a vector w with entries distributed\nnormally with 0 mean and unit variance, while randomly zeroing 50% of the entries in the vector.\n\n6\n\n\f)\n\n*\n\nw\n\n(\nf\n \n\n\u2212\n \n)\n\nw\n\n(\nf\n\nt\n\n102\n\n101\n\n100\n\n10\u22121\n\n10\u22122\n\n10\u22123\n\n \n\n10\n\n \n\nL2 Folos\nPegasos\n\n)\n\n*\nw\n(\nr\n \n\u2212\n\n \n)\n\n*\nw\n\n(\nf\n \n\n\u2212\n\nt\n\n \n)\n\nw\n(\nr\n \n+\n\n \n)\n\nw\n\n(\nf\n\nt\n\n100\n\n10\u22121\n\n10\u22122\n\n \n\nL2 Folos\nPegasos\n\n)\n\n*\nw\n(\nr\n \n\u2212\n\n \n)\n\n*\nw\n\n(\nf\n \n\n\u2212\n\nt\n\n \n)\n\nw\n(\nr\n \n+\n\n \n)\n\nw\n\n(\nf\n\nt\n\n \n\nL2 Folos\nPegasos\n\n100\n\n10\u22121\n\n10\u22122\n\n10\u22123\n\n10\u22124\n\n30\n\n20\n50\nNumber of Operations\n\n40\n\n60\n\n70\n\n \n0\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\n140\n\n160\n\nApproximate Operations\n\n180\n\n10\u22125\n\n \n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\nApproximate Operations\n\nFigure 1: Comparison of FOBOS with Pegasos on the problems of logistic regression (left and right)\nand SVM (middle). The rightmost plot shows the performance of the algorithms without projection.\n\n2 \u03bbkwk2\n\nThe examples xi \u2208 Rn were also chosen at random with entries normally distributed. To generate\ntarget values, we set yi = sign(hxi, wi), and \ufb02ipped the sign of 10% of the examples to add label\nnoise. In all experiments, we used 1000 training examples of dimension 400.\nThe graphs of Fig. 1 show (on a log-scale) the regularized empirical loss of the algorithms minus\nthe optimal value of the objective function. These results were averaged over 20 independent runs\nof the algorithms. In all experiments with the regularizer 1\n2, we used step size \u03b7t = \u03bb/t to\nachieve logarithmic regret. The two left graphs of Fig. 1 show that FOBOS performs comparably\nto Pegasos on the logistic loss (left \ufb01gure) and hinge (SVM) loss (middle \ufb01gure). Both algorithms\nquickly approach the optimal value. In these experiments we let both Pegasos and FOBOS employ\na projection after each gradient step into a 2-norm ball containing w\n\u22c6 (see [14]). However, in the\nexperiment corresponding to the rightmost plot of Fig. 1, we eliminated this additional projection\nstep and ran the algorithms with the logistic loss. In this case, FOBOS slightly outperforms Pegasos.\nWe hypothesize that the slightly faster rate of FOBOS is due to the explicit shrinkage that FOBOS\nperforms in the \u21132 update (see Eq. (7)).\nIn the next experiment, whose results are given in Fig. 2, we solved \u21131-regularized logistic regres-\nsion problems. We compared FOBOS to a simple subgradient method, where the subgradient of\nthe \u03bbkwk1 term is simply \u03bb sign(w)), and a fast interior point (IP) method which was designed\nspeci\ufb01cally for solving \u21131-regularized logistic regression [15]. On the left side of Fig. 2 we show the\nobjective function (empirical loss plus the \u21131 regularization term) obtained by each of the algorithms\nminus the optimal objective value. We again used 1000 training examples of dimension 400. The\n\nlearning rate was set to \u03b7t \u221d 1/\u221at. The standard subgradient method is clearly much slower than\n\nthe other two methods even though we chose the initial step size for which the subgradient method\nconverged the fastest. Furthermore, the subgradient method does not achieve any sparsity along its\nentire run. FOBOS quickly gets close to the optimal value of the objective function, but eventually\nthe specialized IP method\u2019s asymptotically faster convergence causes it to surpass FOBOS. In order\nto obtain a weight vector wt such that f (wt) \u2212 f (w\n\u22c6) \u2264 10\u22122, FOBOS works very well, though\nthe IP method enjoys faster convergence rate when the weight vector is very close to optimal solu-\ntion. However, the IP algorithm was speci\ufb01cally designed to minimize empirical logistic loss with\n\u21131 regularization whereas FOBOS enjoys a broad range of applicable settings.\nThe middle plot in Fig. 2 shows the sparsity levels (fraction of non-zero weights) achieved by FOBOS\nas a function of the number of iterations of the algorithm. Each line represents a different synthetic\nexperiment as \u03bb is modi\ufb01ed to give more or less sparsity to the solution vector w\n\u22c6. The results show\nthat FOBOS quickly selects the sparsity pattern of w\n\u22c6, and the level of sparsity persists throughout its\nexecution. We found this sparsity pattern common to non-stochastic versions of FOBOS we tested.\nMixed-norm experiments: Our experiments with mixed-norm regularization (\u21131/\u21132 and \u21131/\u2113\u221e)\nfocus mostly on sparsity rather than on the speed of minimizing the objective. Our restricted focus\nis a consequence of the relative paucity of benchmark methods for learning problems with mixed-\nnorm regularization. Our methods, however, as described in Sec. 4, are quite simple to implement,\nand we believe could serve as benchmarks for other methods to solve mixed-norm problems.\nOur experiments compared multiclass classi\ufb01cation with \u21131, \u21131/\u21132, and \u21131/\u2113\u221e regularization on\nthe MNIST handwritten digit database and the StatLog Landsat Satellite dataset [16]. The MNIST\ndatabase consists of 60,000 training examples and a 10,000 example test set with 10 classes. Each\ndigit is a 28 \u00d7 28 gray scale image represented as a 784 dimensional vector. Linear classi\ufb01ers\n\n7\n\n\f)\n\n*\n\nw\n\n(\nf\n \n\n\u2212\n \n)\n\nw\n\n(\nf\n\nt\n\n101\n\n100\n\n10\u22121\n\n10\u22122\n\n \n\n10\n\n20\n\n30\n\n40\n\n50\n\n80\nNumber of Operations\n\n60\n\n70\n\nL1 Folos\nL1 IP\nL1 Subgrad\n\n90\n\n100\n\n110\n\n \n\n \n\nn\no\n\ni\nt\nr\no\np\no\nr\nP\n \ny\nt\ni\ns\nr\na\np\nS\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n10\n\n20\n\n30\n\n50\n\n40\nFolos Steps\n\n60\n\n70\n\n80\n\n90\n\n100\n\nLeft:\n\nFigure 2:\nPerfor-\nmance of FOBOS, a sub-\ngradient method, and an in-\nterior point method on \u21131-\nregularized logistic regular-\nization. Left: sparsity level\nachieved by FOBOS along its\nrun.\n\nFigure 3: Left: FOBOS spar-\nsity and test error for LandSat\ndataset with \u21131-regularization.\nRight: FOBOS sparsity and\ntest error for MNIST dataset\nwith \u21131/\u21132-regularization.\n\n1.1\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n \n0\n\nTest Error 10%\nNNZ 10%\nTest Error 20%\nNNZ 20%\nTest Error 100%\nNNZ 100%\n\n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\n700\n\n800\n\n900\n\n1000\n\n0\n\n0\n\n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\n700\n\n800\n\ndo not perform well on MNIST. Thus, rather than learning weights for the original features, we\nlearn the weights for classi\ufb01er with Gaussian kernels, where value of the jth feature for the ith\n2 kzi\u2212zjk2. For the LandSat dataset we attempt to classify 3 \u00d7 3\nexample is xij = K(zi, zj) = e\u2212 1\nneighborhoods of pixels in a satellite image as a particular type of ground, and we expanded the\ninput 36 features into 1296 features by taking the product of all features.\n\nIn the left plot of Fig. 3, we show the test set error and row sparsity in W as a function of training\ntime (number of single-example gradient calculations) for the \u21131-regularized multiclass logistic loss\nwith 720 training examples. The green lines show results for using all 720 examples to calculate\nthe gradient, black using 20% of the examples, and blue using 10% of the examples to perform\nstochastic gradient. Each used the same learning rate \u03b7t, and the reported results are averaged\nover 5 independent runs with different training data. The righthand \ufb01gure shows a similar plot\nbut for MNIST with 10000 training examples and \u21131/\u21132-regularization. The objective value in\ntraining has a similar contour to the test loss. It is interesting to note that very quickly, FOBOS\nwith stochastic gradient descent gets to its minimum test classi\ufb01cation error, and as the training\nset size increases this behavior is consistent. However, the deterministic version increases the level\nof sparsity throughout its run, while the stochastic-gradient version has highly variable sparsity\nlevels and does not give solutions as sparse as the deterministic counterpart. The slowness of non-\nstochastic gradient mitigates this effect for the larger sample size on MNIST in the right \ufb01gure, but\nfor longer training times, we do indeed see similar behavior.\n\nFor comparison of the different regularization approaches, we report in Table 1 the test error as a\nfunction of row sparsity of the learned matrix W . For the LandSat data, we see that using the block\n\u21131/\u21132 regularizer yields better performance for a given level of structural sparsity. However, on\nthe MNIST data the \u21131 regularization and the \u21131/\u21132 achieve comparable performance for each level\nof structural sparsity. Moreover, for a given level of structural sparsity, the \u21131-regularized solution\nmatrix W attains signi\ufb01cantly higher overall sparsity, roughly 90% of the entries of each non-zero\nrow are zero. The performance on the different datasets might indicate that structural sparsity is\neffective only when the set of parameters indeed exhibit natural grouping.\n\n% Non-zero\n\n5\n10\n20\n40\n\n\u21131 Test\n\n.43\n.30\n.26\n.22\n\n\u21131/\u21132 Test\n\n.29\n.25\n.22\n.19\n\n\u21131/\u2113\u221e Test\n\n.40\n.30\n.26\n.22\n\n\u21131 Test\n\n.37\n.26\n.15\n.08\n\n\u21131/\u21132 Test\n\n.36\n.26\n.15\n.08\n\n\u21131/\u2113\u221e Test\n\n.47\n.31\n.24\n.16\n\nTable 1: LandSat (left) and MNIST (right) classi\ufb01cation error versus sparsity\n\n8\n\n\fReferences\n\n[1] D.P. Bertsekas. Nonlinear Programming. Athena Scienti\ufb01c, 1999.\n[2] P. Zhao and B. Yu. On model selection consistency of Lasso. Journal of Machine Learning\n\nResearch, 7:2541\u20132567, 2006.\n\n[3] N. Meinshausen and P. B\u00a8uhlmann. High dimensional graphs and variable selection with the\n\nLasso. Annals of Statistics, 34:1436\u20131462, 2006.\n\n[4] S. Wright, R. Nowak, and M. Figueiredo. Sparse reconstruction by separable approximation.\nIn IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 3373\u2013\n3376, 2008.\n\n[5] P. Combettes and V. Wajs. Signal recovery by proximal forward-backward splitting. Multiscale\n\nModeling and Simulation, 4(4):1168\u20131200, 2005.\n\n[6] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Ef\ufb01cient projections onto the \u21131-\nball for learning in high dimensions. In Proceedings of the 25th International Conference on\nMachine Learning, 2008.\n\n[7] M. Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In\n\nProceedings of the Twentieth International Conference on Machine Learning, 2003.\n\n[8] E. Hazan, A. Kalai, S. Kale, and A. Agarwal. Logarithmic regret algorithms for online convex\noptimization. In Proceedings of the Nineteenth Annual Conference on Computational Learning\nTheory, 2006.\n\n[9] G. Obozinski, M. Wainwright, and M. Jordan. High-dimensional union support recovery in\n\nmultivariate regression. In Advances in Neural Information Processing Systems 22, 2008.\n\n[10] J. Langford, L. Li, and T. Zhang. Sparse online learning via truncated gradient. In Advances\n\nin Neural Information Processing Systems 22, 2008.\n\n[11] S. Shalev-Shwartz and A. Tewari. Stochastic methods for \u21131-regularized loss minimization. In\n\nProceedings of the 26th International Conference on Machine Learning, 2009.\n\n[12] Y. Nesterov. Introductory Lectures on Convex Optimization. Kluwer Academic Publishers,\n\n2004.\n\n[13] J. Duchi and Y. Singer. Ef\ufb01cient online and batch learning using forward-backward splitting.\n\nJournal of Machine Learning Research, 10:In Press, 2009.\n\n[14] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver\n\nfor SVM. In Proceedings of the 24th International Conference on Machine Learning, 2007.\n\n[15] K. Koh, S.J. Kim, and S. Boyd. An interior-point method for large-scale \u21131-regularized logistic\n\nregression. Journal of Machine Learning Research, 8:1519\u20131555, 2007.\n\n[16] D. Spiegelhalter and C. Taylor. Machine Learning, Neural and Statistical Classi\ufb01cation. Ellis\n\nHorwood, 1994.\n\n[17] R.T. Rockafellar. Convex Analysis. Princeton University Press, 1970.\n\n9\n\n\f", "award": [], "sourceid": 248, "authors": [{"given_name": "Yoram", "family_name": "Singer", "institution": null}, {"given_name": "John", "family_name": "Duchi", "institution": null}]}