{"title": "Dual Averaging Method for Regularized Stochastic Learning and Online Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 2116, "page_last": 2124, "abstract": "We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as L1-norm for sparsity. We develop a new online algorithm, the regularized dual averaging method, that can explicitly exploit the regularization structure in an online setting. In particular, at each iteration, the learning variables are adjusted by solving a simple optimization problem that involves the running average of all past subgradients of the loss functions and the whole regularization term, not just its subgradient. This method achieves the optimal convergence rate and often enjoys a low complexity per iteration similar as the standard stochastic gradient method. Computational experiments are presented for the special case of sparse online learning using L1-regularization.", "full_text": "Dual Averaging Method for Regularized Stochastic\n\nLearning and Online Optimization\n\nMicrosoft Research, Redmond, WA 98052\n\nLin Xiao\n\nlin.xiao@microsoft.com\n\nAbstract\n\nWe consider regularized stochastic learning and online optimization problems,\nwhere the objective function is the sum of two convex terms: one is the loss func-\ntion of the learning task, and the other is a simple regularization term such as\n\u21131-norm for promoting sparsity. We develop a new online algorithm, the regular-\nized dual averaging (RDA) method, that can explicitly exploit the regularization\nstructure in an online setting. In particular, at each iteration, the learning variables\nare adjusted by solving a simple optimization problem that involves the running\naverage of all past subgradients of the loss functions and the whole regulariza-\ntion term, not just its subgradient. Computational experiments show that the RDA\nmethod can be very effective for sparse online learning with \u21131-regularization.\n\n1\n\nIntroduction\n\nIn machine learning, online algorithms operate by repetitively drawing random examples, one at a\ntime, and adjusting the learning variables using simple calculations that are usually based on the\nsingle example only. The low computational complexity (per iteration) of online algorithms is often\nassociated with their slow convergence and low accuracy in solving the underlying optimization\nproblems. As argued in [1, 2], the combined low complexity and low accuracy, together with other\ntradeoffs in statistical learning theory, still make online algorithms a favorite choice for solving large-\nscale learning problems. Nevertheless, traditional online algorithms, such as stochastic gradient\ndescent (SGD), has limited capability of exploiting problem structure in solving regularized learning\nproblems. As a result, their low accuracy often makes it hard to obtain the desired regularization\neffects, e.g., sparsity under \u21131-regularization. In this paper, we develop a new online algorithm, the\nregularized dual averaging (RDA) method, that can explicitly exploit the regularization structure in\nan online setting. We \ufb01rst describe the two types of problems addressed by the RDA method.\n\n1.1 Regularized stochastic learning\n\nThe regularized stochastic learning problems we consider are of the following form:\n\nminimize\n\n\u0001\n\n\u0007\u0001(\u0001) \u225c E\u0001\u0001 (\u0001, \u0001) + \u03a8(\u0001)\u0007\n\n(1)\n\nwhere \u0001 \u2208 R\u0001 is the optimization variable (called weights in many learning problems), \u0001 = (\u0001, \u0001)\nis an input-output pair drawn from an (unknown) underlying distribution, \u0001 (\u0001, \u0001) is the loss function\nof using \u0001 and \u0001 to predict \u0001, and \u03a8(\u0001) is a regularization term. We assume \u0001 (\u0001, \u0001) is convex in \u0001\nfor each \u0001, and \u03a8(\u0001) is a closed convex function. Examples of the loss function \u0001 (\u0001, \u0001) include:\n\n\u2219 Least-squares: \u0001 \u2208 R\u0001, \u0001 \u2208 R, and \u0001 (\u0001, (\u0001, \u0001)) = (\u0001 \u2212 \u0001\u0001 \u0001)2.\n\u2219 Hinge loss: \u0001 \u2208 R\u0001, \u0001 \u2208 {+1,\u22121}, and \u0001 (\u0001, (\u0001, \u0001)) = max{0, 1 \u2212 \u0001(\u0001\u0001 \u0001)}.\n\u2219 Logistic regression: \u0001 \u2208 R\u0001, \u0001\u2208{+1,\u22121}, and \u0001 (\u0001, (\u0001, \u0001)) = log\u001c1+ exp\u001c\u2212\u0001(\u0001\u0001 \u0001)\u001d\u001d.\n\n1\n\n\fExamples of the regularization term \u03a8(\u0001) include:\n\n\u2219 \u21131-regularization: \u03a8(\u0001) = \u0001\u2225\u0001\u22251 with \u0001 > 0. With \u21131-regularization, we hope to get a\n\nrelatively sparse solution, i.e., with many entries of \u0001 being zeroes.\n\n\u2219 \u21132-regularization: \u03a8(\u0001) = (\u0001/2)\u2225\u0001\u22252\n\u2219 Convex constraints: \u03a8(\u0001) is the indicator function of a closed convex set \u0001, i.e., \u03a8(\u0001) = 0\n\n2, for some \u0001 > 0.\n\nif \u0001 \u2208 \u0001 and +\u221e otherwise.\n\nIn this paper, we focus on online algorithms that process samples sequentially as they become avail-\nable. Suppose at time \u0001, we have the most up-to-date weight \u0001\u0001. Whenever \u0001\u0001 is available, we can\nevaluate the loss \u0001 (\u0001\u0001, \u0001\u0001), and a subgradient \u0001\u0001 \u2208 \u2202\u0001 (\u0001\u0001, \u0001\u0001) (here \u2202\u0001 (\u0001, \u0001) denotes the subdiffer-\nential of \u0001 with respect to \u0001). Then we compute the new weight \u0001\u0001+1 based on these information.\nFor solving the problem (1), the standard stochastic gradient descent (SGD) method takes the form\n\n\u0001\u0001+1 = \u0001\u0001 \u2212 \u0001\u0001 (\u0001\u0001 + \u0001\u0001) ,\n\n(2)\n\nwhere \u0001\u0001 is an appropriate stepsize, and \u0001\u0001 is a subgradient of \u03a8 at \u0001\u0001. The SGD method has been\nvery popular in the machine learning community due to its capability of scaling with large data sets\nand good generalization performance observed in practice (e.g., [3, 4]).\n\nNevertheless, a main drawback of the SGD method is its lack of capability in exploiting problem\nstructure, especially for regularized learning problems. As a result, their low accuracy (compared\nwith interior-point method for batch optimization) often makes it hard to obtain the desired regu-\nlarization effect. An important example and motivation for this paper is \u21131-regularized stochastic\nlearning, where \u03a8(\u0001) = \u0001\u2225\u0001\u22251. Even with relatively big \u0001, the SGD method (2) usually does not\ngenerate sparse solutions because only in very rare cases two \ufb02oat numbers add up to zero. Various\nmethods for rounding or truncating the solutions are proposed to generate sparse solutions (e.g., [5]).\n\nInspired by recently developed \ufb01rst-order methods for optimizing composite functions [6, 7, 8], the\nregularized dual averaging (RDA) method we develop exploits the full regularization structure at\neach online iteration. In other words, at each iteration, the learning variables are adjusted by solving\na simple optimization problem that involves the whole regularization term, not just its subgradients.\nFor many practical learning problems, we actually are able to \ufb01nd a closed-form solution for the\nauxiliary optimization problem at each iteration. This means that the computational complexity per\niteration is \u0001(\u0001), the same as the SGD method. Moreover, the RDA method converges to the optimal\nsolution of (1) with the optimal rate \u0001(1/\u221a\u0001). If the the regularization function \u03a8(\u0001) is strongly\nconvex, we have the better rate \u0001(ln \u0001/\u0001) by setting appropriate parameters in the algorithm.\n\n1.2 Regularized online optimization\n\nIn online optimization (e.g., [9]), we make a sequence of decision \u0001\u0001, for \u0001 = 1, 2, 3, . . .. At each\ntime \u0001, a previously unknown cost function \u0001\u0001 is revealed, and we encounter a loss \u0001\u0001(\u0001\u0001). We\nassume that the functions \u0001\u0001 are convex for all \u0001 \u2265 1. The goal of an online algorithm is to ensure\nthat the total cost up to each time \u0001, \u08a3\u0001\n\u0001 =1 \u0001\u0001(\u0001),\nthe smallest total cost of any \ufb01xed decision \u0001 from hindsight. The difference between these two\ncost is called the regret of the online algorithm. Applications of online optimization include online\nprediction of time series and sequential investment (e.g. [10]).\n\n\u0001 =1 \u0001\u0001(\u0001\u0001), is not much larger than min\u0001\u08a3\u0001\n\nIn regularized online optimization, we add to each cost function a convex regularization func-\ntion \u03a8(\u0001). For any \ufb01xed decision variable \u0001, consider the regret\n\n\u0001\n\n\u0001\n\n\u0001\u0001(\u0001) \u225c\n\n(3)\n\n\u08a3\u0001 =1\u001c\u0001\u0001 (\u0001\u0001 ) + \u03a8(\u0001\u0001 )\u001d \u2212\n\n\u08a3\u0001 =1\u001c\u0001\u0001 (\u0001) + \u03a8(\u0001)\u001d.\n\nThe RDA method we develop can also be used to solve the above regularized online optimization\nproblem, and it has an \u0001(\u221a\u0001) regret bound. Again, if the regularization term \u03a8(\u0001) is strongly\nconvex, the regret bound is \u0001(ln \u0001). However, the main advantage of the RDA method, compared\nwith other online algorithms, is its explicit regularization effect at each iteration.\n\n2\n\n\fAlgorithm 1 Regularized dual averaging (RDA) method\n\ninput:\n\n\u2219 a strongly convex function \u210e(\u0001) with modulus 1 on dom\u03a8, and \u00010 \u2208 R\u0001, such that\n\n\u00010 = arg min\n\n\u0001\n\n\u210e(\u0001) \u2208 Arg min\n\n\u0001\n\n\u03a8(\u0001).\n\n\u2219 a pre-determined nonnegative and nondecreasing sequence \u0001\u0001 for \u0001 \u2265 1.\n\ninitialize: \u00011 = \u00010, \u00af\u00010 = 0.\nfor \u0001 = 1, 2, 3, . . . do\n\n1. Given the function \u0001\u0001, compute a subgradient \u0001\u0001\u2208 \u2202\u0001\u0001(\u0001\u0001).\n2. Update the average subgradient \u00af\u0001\u0001:\n\n\u00af\u0001\u0001 =\n\n\u00af\u0001\u0001\u22121 +\n\n\u0001\u0001\n\n\u0001 \u2212 1\n\n\u0001\n\n1\n\u0001\n\n3. Compute the next iterate \u0001\u0001+1:\n\n\u0001\u0001+1 = arg min\n\n\u0001 \u0007\u27e8\u00af\u0001\u0001, \u0001\u27e9 + \u03a8(\u0001) +\n\n\u0001\u0001\n\u0001\n\n\u210e(\u0001)\u0007\n\nend for\n\n(4)\n\n(5)\n\n(6)\n\n2 Regularized dual averaging method\n\n\u210e(\u0001\u0001 + (1 \u2212 \u0001)\u0001) \u2264 \u0001\u210e(\u0001) + (1 \u2212 \u0001)\u210e(\u0001) \u2212\n\nIn this section, we present the generic RDA method (Algorithm 1) for solving regularized stochastic\nlearning and online optimization problems, and give some concrete examples. To unify notation,\nwe write \u0001 (\u0001, \u0001\u0001) as \u0001\u0001(\u0001) for stochastic learning problems. The RDA method uses an auxiliary\nstrongly convex function \u210e(\u0001). A function \u210e is called strongly convex with respect to a norm \u2225 \u22c5 \u2225 if\nthere exists a constant \u0001 > 0 such that\n(7)\nfor all \u0001, \u0001 \u2208 dom\u210e. The constant \u0001 is called the convexity parameter, or the modulus of strong\nconvexity. In equation (4), Arg min\u0001 \u03a8(\u0001) denotes the convex set of minimizers of \u03a8.\nIn Algorithm 1, step 1 is to compute a subgradient of \u0001\u0001 at \u0001\u0001, which is standard for all (sub)gradient-\nbased methods. Step 2 is the online version of computing average gradient \u00af\u0001\u0001 (dual average). In\nstep 3, we assume that the functions \u03a8 and \u210e are simple, meaning that the minimization problem\nin (6) can be solved with litter effort, especially if we are able to \ufb01nd a closed-form solution for\n\u0001\u0001+1. This assumption seems to be restrictive. But the following examples show that this indeed is\nthe case for many important learning problems in practice.\n\n\u0001(1 \u2212 \u0001)\u2225\u0001 \u2212 \u0001\u22252,\n\n\u0001\n2\n\n\u0001\u0001 = \u0001\u221a\u0001,\n\nIf the regularization function \u03a8(\u0001) has convexity parameter \u0001 = 0 (i.e., it is not strongly convex),\nwe can choose a parameter \u0001 > 0 and use the sequence\n\n\u0001 = 1, 2, 3, . . .\n\n(8)\nto obtain an \u0001(1/\u221a\u0001) convergence rate for stochastic learning, or an \u0001(\u221a\u0001) regret bound for online\noptimization. The formal convergence theorems are given in Sections 3. Here are some examples:\n\u2219 Nesterov\u2019s dual averaging method. Let \u03a8(\u0001) be the indicator function of a close convex\nset \u0001. This recovers the method of [11]: \u0001\u0001+1 = arg min\u0001\u2208\u0001\u0007\u27e8\u00af\u0001\u0001, \u0001\u27e9 + (\u0001/\u221a\u0001)\u210e(\u0001)\u0007.\n\u2219 \u21131-regularization: \u03a8(\u0001) = \u0001\u2225\u0001\u22251 for some \u0001 > 0. In this case, let \u00010 = 0 and\n\n\u210e(\u0001) =\n\n1\n2\u2225\u0001\u22252\n\n2 + \u0001\u2225\u0001\u22251,\n\nwhere \u0001 \u2265 0 is a sparsity enhancing parameter. The solution to (6) can be found as\n\n0\n\n\u0001(\u0001)\n\n\u0001+1 =\u23a7\uf8f4\u23a8\n\uf8f4\u23a9\n\n\u0001\n\n\u221a\u0001\n\u0001 \u001c\u00af\u0001(\u0001)\n\n\u2212\n= \u0001 + \u0001/\u221a\u0001. Notice that the truncating threshold \u0001\u0001 is at least as large as \u0001.\n\n\u0001 \u001d\u001d otherwise,\n\nsign\u001c\u00af\u0001(\u0001)\n\nwhere \u0001RDA\nThis is the main difference of our method from related work, see Section 4.\n\n\u0001 \u2212 \u0001RDA\n\n\u0001\n\n\u0001 = 1, . . . , \u0001,\n\n(9)\n\n\u00af\u0001(\u0001)\n\n\u0001\n\nif (cid:12)(cid:12)(cid:12)\n\n,\n\n\u0001\n\n(cid:12)(cid:12)(cid:12) \u2264 \u0001RDA\n\n3\n\n\fIf the regularization function \u03a8(\u0001) has convexity parameter \u0001 > 0, we can use any nonnegative,\n\nnondecreasing sequence {\u0001\u0001}\u0001\u22651 that is dominated by ln \u0001, to obtain an \u0001(ln \u0001/\u221a\u0001) convergence\nrate for stochastic learning, or an \u0001(ln \u0001) regret bound for online optimization (see Section 3). For\nsimplicity, in the following examples, we use \u0001\u0001 = 0 for all \u0001 \u2265 1, and we do not need \u210e(\u0001).\n2 with \u0001, \u0001 > 0. Then\n\n2-regularization. Let \u03a8(\u0001) = \u0001\u2225\u0001\u22251 + (\u0001/2)\u2225\u0001\u22252\n\n\u2219 Mixed \u21131/\u21132\n\n\u0001(\u0001)\n\n\u0001+1 =\u23a7\u23a8\n\u23a9\n\n0\n\n\u2212\n\n1\n\n\u0001 \u001c\u00af\u0001(\u0001)\n\n\u0001 \u2212 \u0001 sign\u001c\u00af\u0001(\u0001)\n\n\u0001\n\n\u00af\u0001(\u0001)\n\nif (cid:12)(cid:12)(cid:12)\n\n(cid:12)(cid:12)(cid:12) \u2264 \u0001,\n\u0001 \u001d\u001d otherwise,\n\nOf course, setting \u0001 = 0 gives the algorithm for pure \u21132\n\n2-regularization.\n\n\u0001 = 1, . . . , \u0001.\n\n\u2219 Kullback-Leibler (KL) divergence regularization: \u03a8(\u0001) = \u0001\u0001KL(\u0001\u2225\u0001), where \u0001 lies in\n\nthe standard simplex, \u0001 is a given probability distribution, and\n\n\u0001KL(\u0001\u2225\u0001) \u225c\n\n\u0001\n\n\u08a3\u0001=1\n\n\u0001(\u0001) ln\u001c \u0001(\u0001)\n\u0001(\u0001)\u001d .\n\nNote that \u0001KL(\u0001\u2225\u0001) is strongly convex with respect to \u2225\u0001\u22251 with modulus 1 (e.g., [12]).\nIn this case,\n\n\u0001(\u0001)\n\n\u0001(\u0001) exp\u001c\u2212\nwhere \u0001\u0001+1 is a normalization parameter such that\u08a3\u0001\n\n\u0001+1 =\n\n\u0001\u0001+1\n\n1\n\n\u00af\u0001(\u0001)\n\n\u0001 \u001d ,\n1\n\u0001\n\u0001=1 \u0001(\u0001)\n\n\u0001+1 = 1.\n\n3 Regret bounds and convergence rates\n\nWe \ufb01rst give bounds on the regret \u0001\u0001(\u0001) de\ufb01ned in (3), when the RDA method is used for solving\nregularized online optimization problem. To simplify notations, we de\ufb01ne the following sequence:\n\n\u0394\u0001 \u225c (\u00010 \u2212 \u00011)\u210e(\u00012) + \u0001\u0001\u00012 +\n\n\u00012\n2\n\n1\n\n\u0001\u0001 + \u0001\u0001\n\n,\n\n\u0001 = 1, 2, 3, . . . ,\n\n(10)\n\n\u0001\u22121\n\n\u08a3\u0001 =0\n\nwhere \u0001 and \u0001 are some given constants, \u0001 is the convexity parameter of the regularization function\n\u0001 =1 is the input sequence to the RDA method, which is nonnegative and nonde-\n\u03a8(\u0001), and {\u0001\u0001}\u0001\ncreasing. Notice that we just introduced an extra parameter \u00010. We require \u00010 > 0 to avoid blowup\nof the \ufb01rst term (when \u0001 = 0) in the summation in (10). This parameter does not appear in Algo-\nrithm 1, instead, it is solely for the convenience of convergence analysis. In fact, whenever \u00011 > 0,\nwe can set \u00010 = \u00011, so that the term (\u00010 \u2212 \u00011)\u210e(\u00012) vanishes. We also note that \u00012 is determined\nat the end of the step \u0001 = 1, so \u03941 is well de\ufb01ned. Finally, for any given constant \u0001 > 0, we de\ufb01ne\n\n\u2131\u0001 \u225c\u0007\u0001 \u2208 dom\u03a8(cid:12)(cid:12) \u210e(\u0001) \u2264 \u00012\u0007 .\n\nTheorem 1 Let the sequences {\u0001\u0001}\u0001\n\u0001 =1 be generated by Algorithm 1. Assume there\nis a constant \u0001 such that \u2225\u0001\u0001\u2225\u2217 \u2264 \u0001 for all \u0001 \u2265 1, where \u2225 \u22c5 \u2225\u2217 is the dual norm of \u2225 \u22c5 \u2225. Then for\nany \u0001 \u2265 1 and any \u0001 \u2208 \u2131\u0001, we have\n(11)\n\n\u0001 =1 and {\u0001\u0001}\u0001\n\n\u0001\u0001(\u0001) \u2264 \u0394\u0001.\n\nThe proof of this theorem is given in the longer version of this paper [13]. Here we give some direct\nconsequences based on concrete choices of algorithmic parameters.\nIf the regularization function \u03a8(\u0001) has convexity parameter \u0001 = 0, then the sequence {\u0001\u0001}\u0001\u22651\nde\ufb01ned in (8) together with \u00010 = \u00011 lead to\n\u0394\u0001 = \u0001\u221a\u0001\u00012 +\n\u0001 \u001d\u221a\u0001.\n\n2\u0001 \u001c1 +\u001c2\u221a\u0001 \u2212 2\u001d\u001d \u2264\u001c\u0001\u00012 +\n\n\u221a\u0001\u001d \u2264 \u0001\u221a\u0001\u00012 +\n\n2\u0001 \u001c1 +\n\n\u00012\n\n\u00012\n\n\u00012\n\n1\n\n\u0001\u22121\n\n\u08a3\u0001 =1\n\nThe best \u0001 that minimizes the above bound is \u0001\u2605 = \u0001/\u0001, which leads to\n\n\u0001\u0001(\u0001) \u2264 2\u0001\u0001\u221a\u0001.\n\n4\n\n(12)\n\n\fIf the regularization function \u03a8(\u0001) is strongly convex, i.e., with a convexity parameter \u0001 > 0, then\nany nonnegative, nondecreasing sequence that is dominated by ln \u0001 will give an \u0001(ln \u0001) regret bound.\nWe can simply choose \u210e(\u0001) = (1/\u0001)\u03a8(\u0001) whenever needed. Here are several possibities:\n\n\u2219 Positive constant sequences. For simplicity, let \u0001\u0001 = \u0001 for \u0001 \u2265 1 and \u00010 = \u00011. In this case,\n\n\u0394\u0001 = \u0001\u00012 +\n\n1\n\u0001 \u2264 \u0001\u00012 +\n\n\u00012\n2\u0001\n\n(1 + ln \u0001).\n\n\u2219 The logrithmic sequence. Let \u0001\u0001 = \u0001(1 + ln \u0001) for \u0001 \u2265 1, and \u00010 = \u0001. In this case,\n\n\u0394\u0001 = \u0001(1 + ln \u0001)\u00012 +\n\n2\u0001\u001d (1 + ln \u0001).\n\u2219 The zero sequence \u0001\u0001 = 0 for \u0001 \u2265 1, with \u00010 = \u0001. Using \u210e(\u0001) = (1/\u0001)\u03a8(\u0001), we have\n\n\u0001 + 1 + ln \u0001\u001d \u2264\u001c\u0001\u00012 +\n\n2\u0001 \u001c1 +\n\n\u00012\n\n\u00012\n\n\u0001\u22121\n\n1\n\n\u0394\u0001 \u2264 \u03a8(\u00012) +\n\n1\n\n\u0001\u001d \u2264\n\n\u00012\n2\u0001\n\n\u0001\n\n\u08a3\u0001 =1\n\n(6 + ln \u0001),\n\nwhere we used \u03a8(\u00012) \u2264 2\u00012/\u0001, as proved in [13]. This bound does not depend on \u0001.\n\nWhen Algorithm 1 is used to solve regularized stochastic learning problems, we have the following:\nTheorem 2 Assume there exists an optimal solution \u0001\u2605 to the problem (1) that satis\ufb01es \u210e(\u0001\u2605) \u2264 \u00012\nfor some \u0001 > 0, and there is an \u0001 > 0 such that E\u2225\u0001\u22252\n\u2217 \u2264 \u00012 for all \u0001 \u2208 \u2202\u0001 (\u0001, \u0001) and \u0001 \u2208 dom\u03a8.\nThen for any \u0001 \u2265 1, we have\n\n\u00012\n2\u0001\n\n\u0001\n\n\u08a3\u0001 =1\n\n\u08a3\u0001 =1\n2\u0001 \u001c1 +\n\n\u00012\n\nE \u0001( \u00af\u0001\u0001) \u2212 \u0001(\u0001\u2605) \u2264\n\n\u0394\u0001\n\u0001\n\n,\n\nwhere\n\n\u00af\u0001\u0001 =\n\n\u0001\u0001 .\n\n1\n\u0001\n\n\u0001\n\n\u08a3\u0001 =1\n\nThe proof of Theorem 2 is given in [13]. Further analysis for the cases \u0001 = 0 and \u0001 > 0 are the\nsame as before. We only need to divide every regret bound by \u0001 to obtain the convergence rate.\n\n4 Related work\n\nThere have been several recent work that address online algorithms for regularized learning prob-\nlems, especially with \u21131-regularization; see, e.g., [14, 15, 16, 5, 17].\nIn particular, a forward-\nbackward splitting method (FOBOS) is studied in [17] for solving the same problems we consider.\nIn an online setting, each iteration of the FOBOS method can be written as\n\n\u0001\u0001+1 = arg min\n\n\u0001 \u0007 1\n\n2 \u2225\u0001 \u2212 (\u0001\u0001 \u2212 \u0001\u0001\u0001\u0001)\u22252 + \u0001\u0001\u03a8(\u0001)\u0007 ,\n\n(13)\nwhere \u0001\u0001 is set to be \u0001(1/\u221a\u0001) if \u03a8(\u0001) has convexity parameter \u0001 = 0, and \u0001(1/\u0001) if \u0001 > 0. The\nRDA method and FOBOS use very different weights on the regularization term \u03a8(\u0001): RDA in (6)\nuses the original \u03a8(\u0001) without any scaling, while FOBOS scales \u03a8(\u0001) by a diminishing stepsize \u0001\u0001.\nThe difference is more clear in the special case of \u21131-regularization, i.e., when \u03a8(\u0001) = \u0001\u2225\u0001\u22251. For\nthis purpose, we consider the Truncated Gradient (TG) method proposed in [5]. The TG method\ntruncates the solutions obtained by the standard SGD method with an integer period \u0001 \u2265 1. More\nspeci\ufb01cally, each component of \u0001\u0001 is updated as\n\u0001 \u2212 \u0001\u0001\u0001(\u0001)\n\n, \u0001TG\n\n\u0001(\u0001)\n\n(14)\n\n\u0001\n\n\u0001\n\n, \u0001\u001d if mod(\u0001, \u0001) = 0,\n\notherwise.\n\nwhere \u0001TG\n\n\u0001 = \u0001\u0001\u0001 \u0001, the function mod(\u0001, \u0001) means the remainder on division of \u0001 by \u0001, and\n\n\u0001\n\n\u0001+1 =\u0007 trnc\u001c\u0001(\u0001)\n\u0001(\u0001)\n\u0001 \u2212 \u0001\u0001\u0001(\u0001)\n, \u0001) =\u23a7\u23a8\n\u23a9\n\n\u0001\n\ntrnc(\u0001, \u0001TG\n\n0\n\u0001 \u2212 \u0001TG\n\u0001\n\n\u0001\n\nsign(\u0001)\n\n\u0001\n\n,\n\nif \u2223\u0001\u2223 \u2264 \u0001TG\nif \u0001TG\nif \u2223\u0001\u2223 > \u0001.\n\n\u0001 < \u2223\u0001\u2223 \u2264 \u0001,\n\nWhen \u0001 = 1 and \u0001 = +\u221e, the TG method is the same as the FOBOS method (13). Now\nused in (9): with \u0001\u0001 = \u0001(1/\u221a\u0001), we have\ncomparing the truncation thresholds \u0001TG\n\u0001 = \u0001(1/\u221a\u0001)\u0001RDA\n. Therefore, the RDA method can generate much more sparse solutions.\n\u0001TG\nThis is con\ufb01rmed by our computational experiments in Section 5.\n\nand \u0001RDA\n\n\u0001\n\n\u0001\n\n\u0001\n\n5\n\n\f\u0001 = 0.01\n\n\u0001 = 0.03\n\n\u0001 = 0.1\n\n\u0001 = 0.3\n\n\u0001 = 1\n\n\u0001 = 3\n\n\u0001 = 10\n\nSGD\n\u0001\u0001\n\nTG\n\u0001\u0001\n\nRDA\n\u0001\u0001\n\nIPM\n\u0001\u2605\n\nSGD\n\u00af\u0001\u0001\n\nTG\n\u00af\u0001\u0001\n\nRDA\n\u00af\u0001\u0001\n\nFigure 1: Sparsity patterns of the weight \u0001\u0001 and the average weight \u00af\u0001\u0001 for classifying the digits 6\nand 7 when varying the regularization parameter \u0001 from 0.01 to 10. The background gray represents\nthe value zero, bright spots represent positive values and dark spots represent negative values.\n\n5 Computational experiments\n\nWe provide computational experiments for the \u21131-RDA method using the MNIST dataset of hand-\nwritten digits [18]. Each image from the dataset is represented by a 28 \u00d7 28 gray-scale pixel-map,\nfor a total of 784 features. Each of the 10 digits has roughly 6,000 training examples and 1,000\ntesting examples. No preprocessing of the data is employed.\nWe use \u21131-regularized logistic regression to do binary classi\ufb01cation on each of the 45 pairs of dig-\nits.\nIn the experiments, we compare the \u21131-RDA method (9) with the SGD method (2) and the\nTG/FOBOS method (14) with \u0001 = \u221e. These three online algorithms have similar convergence rate\nand the same order of computational complexity per iteration. We also compare them with the batch\noptimization approach, using an ef\ufb01cient interior-point method (IPM) developed by [19].\n\nEach pair of digits have about 12,000 training examples and 2,000 testing examples. We use online\nalgorithms to go through the (randomly permuted) data only once, therefore the algorithms stop\nat \u0001 = 12,000. We vary the regularization parameter \u0001 from 0.01 to 10. As a reference, the\nmaximum \u0001 for the batch optimization case [19] is mostly in the range of 30\u2212 50 (beyond which the\noptimal weights are all zeros). In the \u21131-RDA method (9), we use \u0001 = 5,000, and set \u0001 = 0 for basic\nregularization, or \u0001 = 0.005 (effectively \u0001\u0001 = 25) for enhanced regularization effect. The tradeoffs\nin choosing these parameters are further investigated in [13]. For the SGD and TG methods, we use a\n\nconstant stepsize \u0001 = (1/\u0001)\u00dd2/\u0001 . When \u0001 = \u0001/\u0001, which gives the best convergence bound (12)\n\n6\n\n\fLeft: \u0001 = 1 for TG, \u0001 = 0 for RDA\n\nRight: \u0001 = 10 for TG, \u0001\u0001 = 25 for RDA\n\n \n\nSGD\nTG\nRDA\n\n2000\n\n4000\n\n6000\n\n8000\n\n10000\n\n12000\n\n600\n\n500\n\n400\n\n300\n\n200\n\n100\n\n0\n \n0\n\n600\n\n500\n\n400\n\n300\n\n200\n\n100\n\n1\n.\n0\n=\n\u0001\nn\ne\nh\nw\ns\nZ\nN\nN\n\n0\n1\n=\n\u0001\nn\ne\nh\nw\ns\nZ\nN\nN\n\n \n\nSGD\nTG\nRDA\n\n2000\n\n4000\n\n6000\n\n8000\n\n10000\n\n12000\n\n600\n\n500\n\n400\n\n300\n\n200\n\n100\n\n0\n \n0\n\n600\n\n500\n\n400\n\n300\n\n200\n\n100\n\n0\n0\n\n2000\n\n4000\n\n6000\n\n8000\nNumber of samples \u0001\n\n10000\n\n12000\n\n0\n0\n\n2000\n\n4000\n\n6000\n\n8000\nNumber of samples \u0001\n\n10000\n\n12000\n\nFigure 2: Number of non-zeros (NNZs) in \u0001(\u0001) for the three online algorithms (classifying 6 and 7).\n\nfor the RDA method, the corresponding \u0001 = (\u0001/\u0001)\u00dd2/\u0001 also gives the best convergence rate for\n\nthe SGD method (e.g., [20]). In the TG method, the truncation period is set to \u0001 = 1 for basic\nregularization, or \u0001 = 10 for enhanced regularization effect, as suggested in [5].\nFigure 1 shows the sparsity patterns of the solutions \u0001\u0001 and \u00af\u0001\u0001 for classifying the digits 6 and 7.\nBoth the TG and RDA methods were run with parameters for enhanced \u21131-regularization: \u0001 = 10\nfor TG and \u0001\u0001 = 25 for RDA. The sparsity patterns obtained by the RDA method are most close to\nthe batch optimization results solved by IPM, especially for larger \u0001.\nFigure 2 plots the number of non-zeros (NNZs) in \u0001(\u0001) for different online algorithms. Only the\nRDA method and TG with \u0001 = 1 give explicit zero weights at every step. In order to count the\nNNZs in all other cases, we set a small threshold for rounding the weights to zero. Considering that\nthe magnitudes of the largest weights in Figure 1 are mostly on the order of 10\u22123, we set 10\u22125 as\nthe threshold and veri\ufb01ed that rounding elements less than 10\u22125 to zero does not affect the testing\nerrors. Note that we do not truncate the weights for RDA and TG with \u0001 = 1 further, even if\nsome of their components are below 10\u22125. It can be seen that the RDA method maintains a much\nmore sparse \u0001(\u0001) than the other two online algorithms. While the TG method generate more sparse\nsolutions than the SGD method when \u0001 is large, the NNZs in \u0001(\u0001) oscillates with a very big range.\nIn contrast, the RDA method demonstrate a much more smooth variation in the NNZs.\n\nFigure 3 illustrates the tradeoffs between sparsity and testing error rates for classifying 6 and 7.\nSince the performance of the online algorithms vary when the training data are given in different\npermutations, we run them on 100 randomly permuted sequences of the same training set, and\nplot the means and standard deviations shown as error bars. For the SGD and TG methods, the\ntesting error rates of \u0001\u0001 vary a lot for different random sequences. In contrast, the RDA method\ndemonstrates very robust performance (small standard deviations) for \u0001\u0001 , even though the theorems\nonly give performance bound for the averaged weight \u00af\u0001\u0001 . Note that \u00af\u0001\u0001 obtained by SGD and TG\nhave much smaller error rates than those of RDA and batch optimization, especially for larger \u0001.\nThe explanation is that these lower error rates are obtained with much more nonzero features.\n\nFigure 4 shows summary of classi\ufb01cation results for all the 45 pairs of digits. For clarity of presenta-\ntion, here we only plot results of the \u21131-RDA method and batch optimization using IPM. (The NNZs\nobtained by SGD and TG are mostly above the limit of the vertical axes, which is set at 200). We\nsee that, overall, the solutions obtained by the \u21131-RDA method demonstrate very similar tradeoffs\nbetween sparsity and testing error rates as rendered by the batch optimization solutions.\n\n7\n\n\fLast weight \u0001\u0001\n\nAverage weight \u00af\u0001\u0001\n\nLast weight \u0001\u0001\n\nAverage weight \u00af\u0001\u0001\n\n)\n\n%\n\n(\n\ns\ne\nt\na\nr\n\nr\no\nr\nr\nE\n\nSGD\nTG (\u0001 = 1)\nRDA (\u0001 = 0)\nIPM\n\n10\n\n1\n\n \n0.01\n\n0.1\n600\n\n0.1\n\n1\n\n10\n\ns\nZ\nN\nN\n\n400\n\n200\n\n \n\n4\n\n3\n\n2\n\n1\n\n \n0.01\n\n0\n600\n\n400\n\n200\n\nSGD\nTG (\u0001 = 1)\nRDA (\u0001 = 0)\nIPM\n\n0.1\n\n1\n\n10\n\n \n\n10\n\n1\n\n \n0.01\n\n0.1\n600\n\n400\n\n200\n\nSGD\nTG (\u0001 = 10)\nRDA (\u0001\u0001 = 25)\nIPM\n\n0.1\n\n1\n\n10\n\n \n\n4\n\n3\n\n2\n\n1\n\n \n0.01\n\n0\n600\n\n400\n\n200\n\n \n\nSGD\nTG (\u0001 = 10)\nRDA (\u0001\u0001 = 25)\nIPM\n\n0.1\n\n1\n\n10\n\n0\n\n0.01\n\n0.1\n\n\u0001\n\n1\n\n10\n\n0\n\n0.01\n\n0.1\n\n\u0001\n\n1\n\n10\n\n0\n\n0.01\n\n0.1\n\n\u0001\n\n1\n\n10\n\n0\n\n0.01\n\n0.1\n\n\u0001\n\n1\n\n10\n\nFigure 3: Tradeoffs between testing error rates and NNZs in solutions (for classifying 6 and 7).\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\n0.1\n\n1\n\n10\n\n0.1\n\n1\n\n10\n\n2\n\n0\n\n4\n\n2\n\n0\n\n5\n\n0\n\n0.1\n\n1\n\n10\n\n0.1\n\n1\n\n10\n\n0.1\n\n1\n\n10\n\n1\n\n0.5\n\n0\n\n2\n\n1\n\n0\n\n5\n\n0\n\n5\n\n5\n\n0\n\n5\n\n0.1\n\n1\n\n10\n\n0.1\n\n1\n\n10\n\n0.1\n\n1\n\n10\n\n0\n\n0.1\n\n1\n\n10\n\n0.1\n\n1\n\n10\n\n5\n\n0\n\n10\n\n0.1\n\n1\n\n10\n\n4\n\n2\n\n0\n\n2\n\n1\n\n0\n\n5\n\n0\n\n5\n\n0.1\n\n1\n\n10\n\n0.1\n\n1\n\n10\n\n0.1\n\n1\n\n10\n\n2\n\n0\n\n4\n\n2\n\n0\n\n5\n\n0\n\n5\n\n0.1\n\n1\n\n10\n\n0.1\n\n1\n\n10\n\n0.1\n\n1\n\n10\n\n0\n\n0.1\n\n1\n\n10\n\n0\n\n0.1\n\n1\n\n10\n\n0\n\n0.1\n\n1\n\n10\n\n0\n\n0.1\n\n1\n\n10\n\n0\n\n1\n\n1\n\n0.5\n\n0\n\n0.1\n\n1\n\n10\n\n4\n\n2\n\n0\n\n5\n\n0\n\n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\n2\n\n1\n\n0\n\n2\n\n0\n\n5\n\n0.1\n\n1\n\n10\n\n0.1\n\n1\n\n10\n\n0.1\n\n1\n\n10\n\n0.1\n\n1\n\n10\n\n0.1\n\n1\n\n10\n\n0\n\n0.1\n\n1\n\n10\n\n0.1\n\n1\n\n10\n\n5\n\n0\n\n10\n\n0.1\n\n1\n\n10\n\n0.1\n\n1\n\n10\n\n0\n\n0.1\n\n1\n\n10\n\n2\n\n0\n\n5\n\n0\n\n10\n\n5\n\n0\n\n10\n\n5\n\n0\n\n5\n\n0\n\n10\n\n10\n\n5\n\n0\n\n5\n\n0.1\n\n1\n\n10\n\n0.1\n\n1\n\n10\n\n0\n\n0.1\n\n1\n\n10\n\n0.1\n\n1\n\n10\n\n10\n\n5\n\n0\n\n10\n\n5\n\n0\n\n0.1\n\n1\n\n10\n\n0.1\n\n1\n\n10\n\n200\n150\n100\n50\n0\n\n200\n150\n100\n50\n0\n\n200\n150\n100\n50\n0\n\n200\n150\n100\n50\n0\n\n200\n150\n100\n50\n0\n\n200\n150\n100\n50\n0\n\n200\n150\n100\n50\n0\n\n200\n150\n100\n50\n0\n\n200\n150\n100\n50\n0\n\n10\n\n5\n\n0\n\n0.1\n\n1\n\n10\n\n5\n\n0\n\n5\n\n0\n\n0.1\n\n1\n\n10\n\n0.1\n\n1\n\n10\n\n5\n\n0\n\n5\n\n0\n\n5\n\n0.1\n\n1\n\n10\n\n0.1\n\n1\n\n10\n\n0\n\n0.1\n\n1\n\n10\n\n0\n\n0.1\n\n1\n\n10\n\n5\n\n0\n\n5\n\n0\n\nFigure 4: Binary classi\ufb01cation for all 45 pairs of digits. The images in the lower-left triangular area\nshow sparsity patterns of \u0001\u0001 with \u0001 = 1, obtained by the \u21131-RDA with \u0001\u0001 = 25. The plots in\nthe upper-right triangular area show tradeoffs between sparsity and testing error rates, by varying \u0001\nfrom 0.1 to 10. The solid circles and solid squares show error rates and NNZs in \u0001\u0001 , respectively,\nusing IPM for batch optimization. The hollow circles and hollow squares show error rates and\nNNZs of \u0001\u0001 , respectively, using the \u21131-RDA method. The vertical bars centered at hollow circles\nand squares show standard deviations by running on 100 random permutations of the training data.\n\n8\n\n\fReferences\n\n[1] L. Bottou and O. Bousquet. The tradeoffs of large scale learning.\n\nIn J.C. Platt, D. Koller,\nY. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20,\npages 161\u2013168. MIT Press, Cambridge, MA, 2008.\n\n[2] S. Shalev-Shwartz and N. Srebro. SVM optimization: Inverse dependence on training set size.\n\nIn Proceedings of the 25th International Conference on Machine Learning (ICML), 2008.\n\n[3] L. Bottou and Y. LeCun. Large scale online learning. In S. Thrun, L. Saul, and B. Sch\u00a8olkopf,\neditors, Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA,\n2004.\n\n[4] T. Zhang. Solving large scale linear prediction problems using stochastic gradient descent\nalgorithms. In Proceedings of the 21st International Conference on Machine Learning (ICML),\nBanff, Alberta, Canada, 2004.\n\n[5] J. Langford, L. Li, and T. Zhang. Sparse online learning via truncated gradient. Journal of\n\nMachine Learning Research, 10:777\u2013801, 2009.\n\n[6] Yu. Nesterov. Gradient methods for minimizing composiite objective function. CORE Dis-\ncussion Paper 2007/76, Catholic University of Louvain, Center for Operations Research and\nEconometrics, 2007.\n\n[7] P. Tseng. On accelerated proximal gradient methods for convex-concave optimization. Sub-\n\nmitted to SIAM Journal on Optimization, 2008.\n\n[8] A. Beck and M. Teboulle. A fast iterative shrinkage-threshold algorithm for linear inverse\n\nproblems. Technical report, Technion, 2008. To appear in SIAM Journal on Image Sciences.\n\n[9] M. Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In\nProceedings of the 20th International Conference on Machine Learning (ICML), pages 928\u2013\n936, Washington DC, 2003.\n\n[10] N. Cesa-Bianchi and G. Lugosi. Predictioin, Learning, and Games. Cambridge University\n\nPress, 2006.\n\n[11] Yu. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Program-\nming, 120(1):221\u2013259, 2009. Appeared early as CORE discussion paper 2005/67, Catholic\nUniversity of Louvain, Center for Operations Research and Econometrics.\n\n[12] Yu. Nesterov. Smooth minimization of nonsmooth functions. Mathematical Programming,\n\n103:127\u2013152, 2005.\n\n[13] L. Xiao. Dual averaging method for regularized stochastic learning and online optimization.\n\nTechnical Report MSR-TR-2009-100, Microsoft Research, 2009.\n\n[14] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Ef\ufb01cient projections onto the \u21131-\nball for learning in high dimensions. In Proceedings of the 25th International Conference on\nMachine Learning (ICML), pages 272\u2013279, 2008.\n\n[15] P. Carbonetto, M. Schmidt, and N. De Freitas. An interior-point stochastic approximation\nmethod and an \u00011-regularized delta rule. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bot-\ntou, editors, Advances in Neural Information Processing Systems 21, pages 233\u2013240. MIT\nPress, 2009.\n\n[16] S. Balakrishnan and D. Madigan. Algorithms for sparse linear classi\ufb01ers in the massive data\n\nsetting. Journal of Machine Learning Research, 9:313\u2013337, 2008.\n\n[17] J. Duchi and Y. Singer. Ef\ufb01cient learning using forward-backward splitting. In Proceedings of\n\nNeural Information Processing Systems, December 2009.\n\n[18] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to docu-\nment recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998. Dataset available at\nhttp://yann.lecun.com/exdb/mnist.\n\n[19] K. Koh, S.-J. Kim, and S. Boyd. An interior-point method for large-scale \u21131-regularized logistic\n\nregression. Journal of Machine Learning Research, 8:1519\u20131555, 2007.\n\n[20] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach\n\nto stochastic programming. SIAM Journal on Optimization, 19(4):1574\u20131609, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1048, "authors": [{"given_name": "Lin", "family_name": "Xiao", "institution": null}]}