{"title": "Stochastic Gradient Descent with Only One Projection", "book": "Advances in Neural Information Processing Systems", "page_first": 494, "page_last": 502, "abstract": "Although many variants of stochastic gradient descent have been proposed for large-scale convex optimization, most of them require projecting the solution at {\\it each} iteration to ensure that the obtained solution stays within the feasible domain. For complex domains (e.g., positive semidefinite cone), the projection step can be computationally expensive, making stochastic gradient descent unattractive for large-scale optimization problems. We address this limitation by developing a novel stochastic gradient descent algorithm that does not need intermediate projections. Instead, only one projection at the last iteration is needed to obtain a feasible solution in the given domain. Our theoretical analysis shows that with a high probability, the proposed algorithms achieve an $O(1/\\sqrt{T})$ convergence rate for general convex optimization, and an $O(\\ln T/T)$ rate for strongly convex optimization under mild conditions about the domain and the objective function.", "full_text": "Stochastic Gradient Descent with\n\nOnly One Projection\n\nMehrdad Mahdavi\u2020, Tianbao Yang\u2021, Rong Jin\u2020, Shenghuo Zhu(cid:63), and Jinfeng Yi\u2020\n\u2020Dept. of Computer Science and Engineering, Michigan State University, MI, USA\n\n\u2021Machine Learning Lab, GE Global Research, CA, USA\n\n\u2020{mahdavim,rongjin,yijinfen}@msu.edu,\u2021tyang@ge.com,(cid:63)zsh@nec-labs.com\n\n(cid:63)NEC Laboratories America, CA, USA\n\nAbstract\n\nAlthough many variants of stochastic gradient descent have been proposed for\nlarge-scale convex optimization, most of them require projecting the solution at\neach iteration to ensure that the obtained solution stays within the feasible domain.\nFor complex domains (e.g., positive semide\ufb01nite cone), the projection step can\nbe computationally expensive, making stochastic gradient descent unattractive for\nlarge-scale optimization problems. We address this limitation by developing novel\nstochastic optimization algorithms that do not need intermediate projections. In-\nstead, only one projection at the last iteration is needed to obtain a feasible solution\n\u221a\nin the given domain. Our theoretical analysis shows that with a high probability,\nthe proposed algorithms achieve an O(1/\nT ) convergence rate for general con-\nvex optimization, and an O(ln T /T ) rate for strongly convex optimization under\nmild conditions about the domain and the objective function.\n\n1\n\nIntroduction\n\nWith the increasing amount of data that is available for training, it becomes an urgent task to devise\nef\ufb01cient algorithms for optimization/learning problems with unprecedented sizes. Online learning\nalgorithms, such as celebrated Stochastic Gradient Descent (SGD) [16, 2] and its online counterpart\nOnline Gradient Descent (OGD) [22], despite of their slow rate of convergence compared with the\nbatch methods, have shown to be very effective for large scale and online learning problems, both\ntheoretically [16, 13] and empirically [19]. Although a large number of iterations is usually needed\nto obtain a solution of desirable accuracy, the lightweight computation per iteration makes SGD\nattractive for many large-scale learning problems.\nTo \ufb01nd a solution within the domain K that optimizes the given objective function f (x), SGD\ncomputes an unbiased estimate of the gradient of f (x), and updates the solution by moving it in\nthe opposite direction of the estimated gradient. To ensure that the solution stays within the domain\nK, SGD has to project the updated solution back into the K at every iteration. Although ef\ufb01cient\nalgorithms have been developed for projecting solutions into special domains (e.g., simplex and (cid:96)1\nball [6, 14]); for complex domains, such as a positive semide\ufb01nite (PSD) cone in metric learning\nand bounded trace norm matrices in matrix completion (more examples of complex domains can\nbe found in [10] and [11]), the projection step requires solving an expensive convex optimization,\nleading to a high computational cost per iteration and consequently making SGD unappealing for\nlarge-scale optimization problems over such domains. For instance, projecting a matrix into a PSD\ncone requires computing the full eigen-decomposition of the matrix, whose complexity is cubic in\nthe size of the matrix.\nThe central theme of this paper is to develop a SGD based method that does not require projection\nat each iteration. This problem was \ufb01rst addressed in a very recent work [10], where the authors\nextended Frank-Wolfe algorithm [7] for online learning. But, one main shortcoming of the algo-\n\n1\n\n\frithm proposed in [10] is that it has a slower convergence rate (i.e., O(T \u22121/3)) than a standard\nSGD algorithm (i.e., O(T \u22121/2)). In this work, we demonstrate that a properly modi\ufb01ed SGD algo-\nrithm can achieve the optimal convergence rate of O(T \u22121/2) using only ONE projection for general\nstochastic convex optimization problem. We further develop an SGD based algorithm for strongly\nconvex optimization that achieves a convergence rate of O(ln T /T ), which is only a logarithmic\nfactor worse than the optimal rate [9]. The key idea of both algorithms is to appropriately penalize\nthe intermediate solutions when they are outside the domain. With an appropriate design of penal-\nclose to the domain K, even without intermediate projections. As a result, the \ufb01nal feasible solution\n\nization mechanism, the average solution (cid:98)xT obtained by the SGD after T iterations will be very\n(cid:101)xT can be obtained by projecting(cid:98)xT into the domain K, the only projection that is needed for the\n\nentire algorithm. We note that our approach is very different from the previous efforts in developing\nprojection free convex optimization algorithms (see [8, 12, 11] and references therein), where the\nkey idea is to develop appropriate updating procedures to restore the feasibility of solutions at every\niteration.\nWe close this section with a statement of contributions and main results made by the present work:\n\u2022 We propose a stochastic gradient descent algorithm for general convex optimization that in-\ntroduces a Lagrangian multiplier to penalize the solutions outside the domain and performs\nprimal-dual updating. The proposed algorithm achieves the optimal convergence rate of\nO(1/\n\u2022 We propose a stochastic gradient descent algorithm for strongly convex optimization that\nconstructs the penalty function using a smoothing technique. This algorithm attains an\nO(ln T /T ) convergence rate with only one projection.\n\nT ) with only one projection;\n\n\u221a\n\n2 Related Works\n\nGenerally, the computational complexity of the projection step in SGD has seldom been taken into\naccount in the literature. Here, we brie\ufb02y review the previous works on projection free convex op-\ntimization, which is closely related to the theme of this study. For some speci\ufb01c domains, ef\ufb01cient\nalgorithms have been developed to circumvent the high computational cost caused by projection\nstep at each iteration of gradient descent methods. The main idea is to select an appropriate direc-\ntion to take from the current solution such that the next solution is guaranteed to stay within the\ndomain. Clarkson [5] proposed a sparse greedy approximation algorithm for convex optimization\nover a simplex domain, which is a generalization of an old algorithm by Frank and Wolfe [7] (a.k.a\nconditional gradient descent [3]). Zhang [21] introduced a similar sequential greedy approximation\nalgorithm for certain convex optimization problems over a domain given by a convex hull. Hazan [8]\ndevised an algorithm for approximately maximizing a concave function over a trace norm bounded\nPSD cone, which only needs to compute the maximum eigenvalue and the corresponding eigenvec-\ntor of a symmetric matrix. Ying et al. [20] formulated the distance metric learning problems into\neigenvalue maximization and proposed an algorithm similar to [8].\nRecently, Jaggi [11] put these ideas into a general framework for convex optimization with a gen-\neral convex domain. Instead of projecting the intermediate solution into a complex convex domain,\nJaggi\u2019s algorithm solves a linearized problem over the same domain. He showed that Clark\u2019s algo-\nrithm , Zhang\u2019s algorithm and Hazan\u2019s algorithm discussed above are special cases of his general\nalgorithm for special domains. It is important to note that all these algorithms are designed for batch\noptimization, not for stochastic optimization, which is the focus of this work.\nOur work is closely related to the online Frank-Wolfe (OFW) algorithm proposed in [10]. It is a\nprojection free online learning algorithm, built on the the assumption that it is possible to ef\ufb01ciently\nminimize a linear function over the complex domain. One main shortcoming of the OFW algorithm\nis that its convergence rate for general stochastic optimization is O(T \u22121/3), signi\ufb01cantly slower than\nthat of a standard stochastic gradient descent algorithm (i.e., O(T \u22121/2)). It achieves a convergence\nrate of O(T \u22121/2) only when the objective function is smooth, which unfortunately does not hold\nfor many machine learning problems where either a non-smooth regularizer or a non-smooth loss\nfunction is used. Another limitation of OFW is that it assumes a linear optimization problem over\nthe domain K can be solved ef\ufb01ciently. Although this assumption holds for some speci\ufb01c domains\nas discussed in [10], but in many settings of practical interest, this may not be true. The proposed\nalgorithms address the two limitations explicitly. In particular, we show that how two seemingly\ndifferent modi\ufb01cations of the SGD can be used to avoid performing expensive projections with\nsimilar convergency rates as the original SGD method.\n\n2\n\n\f3 Preliminaries\n\nThroughout this paper, we consider the following convex optimization problem:\n\nmin\nx\u2208K f (x),\n\n(1)\nwhere K is a bounded convex domain. We assume that K can be characterized by an inequality\nconstraint and without loss of generality is bounded by the unit ball, i.e.,\n\nK = {x \u2208 Rd : g(x) \u2264 0} \u2286 B = {x \u2208 Rd : (cid:107)x(cid:107)2 \u2264 1},\n\n(2)\nwhere g(x) is a convex constraint function. We assume that K has a non-empty interior, i.e., there\nexists x such that g(x) < 0 and the optimal solution x\u2217 to (1) is in the interior of the unit ball B, i.e.,\n(cid:107)x\u2217(cid:107)2 < 1. Note that when a domain is characterized by multiple convex constraint functions, say\ngi(x) \u2264 0, i = 1, . . . , m, we can summarize them into one constraint g(x) \u2264 0, by de\ufb01ning g(x) as\ng(x) = max1\u2264i\u2264m gi(x).\nTo solve the optimization problem in (1), we assume that the only information available to the al-\ngorithm is through a stochastic oracle that provides unbiased estimates of the gradient of f (x).\nMore precisely, let \u03be1, . . . , \u03beT be a sequence of independently and identically distributed (i.i.d)\nrandom variables sampled from an unknown distribution P . At each iteration t, given solu-\n\ntion xt, the oracle returns (cid:101)\u2207f (xt; \u03bet), an unbiased estimate of the true gradient \u2207f (xt), i.e.,\nE\u03bet[(cid:101)\u2207f (xt, \u03bet)] = \u2207f (xt). The goal of the learner is to \ufb01nd an approximate optimal solution\n\nby making T calls to this oracle.\nBefore proceeding, we recall a few de\ufb01nitions from convex analysis [17].\nDe\ufb01nition 1. A function f (x) is a G-Lipschitz continuous function w.r.t a norm (cid:107) \u00b7 (cid:107), if\n\n|f (x1) \u2212 f (x2)| \u2264 G(cid:107)x1 \u2212 x2(cid:107),\u2200x1, x2 \u2208 B.\n\n(3)\nIn particular, a convex function f (x) with a bounded (sub)gradient (cid:107)\u2202f (x)(cid:107)\u2217 \u2264 G is G-Lipschitz\ncontinuous, where (cid:107) \u00b7 (cid:107)\u2217 is the dual norm to (cid:107) \u00b7 (cid:107).\nDe\ufb01nition 2. A convex function f (x) is \u03b2-strongly convex w.r.t a norm (cid:107)\u00b7(cid:107) if there exists a constant\n\u03b2 > 0 (often called the modulus of strong convexity) such that, for any \u03b1 \u2208 [0, 1], it holds:\n\nf (\u03b1x1 + (1 \u2212 \u03b1)x2) \u2264 \u03b1f (x1) + (1 \u2212 \u03b1)f (x2) \u2212 1\n2\n\n\u03b1(1 \u2212 \u03b1)\u03b2(cid:107)x1 \u2212 x2(cid:107)2,\u2200x1, x2 \u2208 B.\n\nWhen f (x) is differentiable, the strong convexity is equivalent to f (x1) \u2265 f (x2) + (cid:104)\u2207f (x2), x1 \u2212\nx2(cid:105) + \u03b2\n2(cid:107)x1 \u2212 x2(cid:107)2,\u2200x1, x2 \u2208 B. In the sequel, we use the standard Euclidean norm to de\ufb01ne\nLipschitz and strongly convex functions. Stochastic gradient descent method is an iterative algorithm\nand produces a sequence of solutions xt, t = 1, . . . , T , by\n\nxt+1 = \u03a0K(xt \u2212 \u03b7t(cid:101)\u2207f (xt, \u03bet)),\n\nE\u03bet[exp((cid:107)(cid:101)\u2207f (x, \u03bet) \u2212 \u2207f (x)(cid:107)2\n\nwhere {\u03b7t}T\n\n(4)\nt=1 is a sequence of step sizes, \u03a0K(x) is a projection operator that projects x into the\n\ndomain K, and (cid:101)\u2207f (x, \u03bet) is an unbiased stochastic gradient of f (x), for which we further assume\n\nbounded gradient variance as\n\n(5)\n\u221a\nT ) con-\nFor general convex optimization, stochastic gradient descent methods can obtain an O(1/\nvergence rate in expectation or in a high probability provided (5) [16]. As we mentioned in the\nIntroduction section, SGD methods are computationally ef\ufb01cient only when the projection \u03a0K(x)\ncan be carried out ef\ufb01ciently. The objective of this work is to develop computationally ef\ufb01cient\nstochastic optimization algorithms that are able to yield the same performance guarantee as the\nstandard SGD algorithm but with only ONE projection when applied to the problem in (1).\n\n2/\u03c32)] \u2264 exp(1).\n\n4 Algorithms and Main Results\n\nWe now turn to extending the SGD method to the setting where only one projection is allowed to\nperform for the entire sequence of updating. The main idea is to incorporate the constraint function\ng(x) into the objective function to penalize the intermediate solutions that are outside the domain.\nThe result of the penalization is that, although the average solution obtained by SGD may not be\nfeasible, it should be very close to the boundary of the domain. A projection is performed at the end\nof the iterations to restore the feasibility of the average solution.\n\n3\n\n\fAlgorithm 1 (SGDP-PD): SGD with ONE Projection by Primal Dual Updating\n1: Input: a sequence of step sizes {\u03b7t}, and a parameter \u03b3 > 0\n2: Initialization: x1 = 0 and \u03bb1 = 0\n3: for t = 1, 2, . . . , T do\nCompute x(cid:48)\n4:\nt+1(cid:107)2, 1),\nUpdate xt+1 = x(cid:48)\n5:\nUpdate \u03bbt+1 = [(1 \u2212 \u03b3\u03b7t)\u03bbt + \u03b7tg(xt)]+\n6:\n7: end for\n\nt+1 = xt \u2212 \u03b7t((cid:101)\u2207f (xt, \u03bet) + \u03bbt\u2207g(xt))\n\n8: Output:(cid:101)xT = \u03a0K ((cid:98)xT ), where(cid:98)xT =(cid:80)T\n\nt+1/ max ((cid:107)x(cid:48)\n\nt=1 xt/T .\n\n(6)\n(7)\n\n(8)\n\nThe key ingredient of proposed algorithms is to replace the projection step with the gradient com-\nputation of the constraint function de\ufb01ning the domain K, which is signi\ufb01cantly cheaper than pro-\njection step. As an example, when a solution is restricted to a PSD cone, i.e., X (cid:23) 0 where X\nis a symmetric matrix, the corresponding inequality constraint is g(X) = \u03bbmax(\u2212X) \u2264 0, where\n\u03bbmax(X) computes the largest eigenvalue of X and is a convex function. In this case, \u2207g(X) only\nrequires computing the minimum eigenvector of a matrix, which is cheaper than a full eigenspectrum\ncomputation required at each iteration of the standard SGD algorithm to restore feasibility.\nBelow, we state a few assumptions about f (x) and g(x) often made in stochastic optimization as:\n\nA1\nA2\n\n(cid:107)\u2207f (x)(cid:107)2 \u2264 G1,\n\nE\u03bet[exp((cid:107)(cid:101)\u2207f (x, \u03bet) \u2212 \u2207f (x)(cid:107)2\n\n(cid:107)\u2207g(x)(cid:107)2 \u2264 G2,\n\n|g(x)| \u2264 C2,\n\n\u2200x \u2208 B,\n\n2/\u03c32)] \u2264 exp(1),\n\n\u2200x \u2208 B.\n\nWe also make the following mild assumption about the boundary of the convex domain K as:\n\nA3\n\nthere exists a constant \u03c1 > 0 such that min\ng(x)=0\n\n(cid:107)\u2207g(x)(cid:107)2 \u2265 \u03c1.\n\nRemark 1. The purpose of introducing assumption A3 is to ensure that the optimal dual variable\nfor the constrained optimization problem in (1) is well bounded from the above, a key factor for our\nanalysis. To see this, we write the problem in (1) into a convex-concave optimization problem:\n\nx\u2208B max\nmin\n\u03bb\u22650\n\nf (x) + \u03bbg(x).\n\nLet (x\u2217, \u03bb\u2217) be the optimal solution to the above convex-concave optimization problem. Since we\nassume g(x) is strictly feasible, x\u2217 is also an optimal solution to (1) due to the strong duality\ntheorem [4]. Using the \ufb01rst order optimality condition, we have \u2207f (x\u2217) = \u2212\u03bb\u2217\u2207g(x\u2217). Hence,\n\u03bb\u2217 = 0 when g(x\u2217) < 0, and \u03bb\u2217 = (cid:107)\u2207f (x\u2217)(cid:107)2/(cid:107)\u2207g(x\u2217)(cid:107)2 when g(x\u2217) = 0. Under assumption\nA3, we have \u03bb\u2217 \u2208 [0, G1/\u03c1].\nWe note that, from a practical point of view, it is straightforward to verify that for many domains\nincluding PSD cone and Polytope, the gradient of the constraint function is lower bounded on the\nboundary and therefore assumption A3 does not limit the applicability of the proposed algorithms\nfor stochastic optimization. For the example of g(X) = \u03bbmax(\u2212X), the assumption A3 implies\nming(X)=0 (cid:107)\u2207g(X)(cid:107)F = (cid:107)uu(cid:62)(cid:107)F = 1, where u is an orthonomal vector representing the corre-\nsponding eigenvector of the matrix X whose minimum eigenvalue is zero.\nWe propose two different ways of incorporating the constraint function into the objective function,\nwhich result in two algorithms, one for general convex and the other for strongly convex functions.\n4.1 SGD with One Projection for General Convex Optimization\n\nTo incorporate the constraint function g(x), we introduce a regularized Lagrangian function,\n\nL(x, \u03bb) = f (x) + \u03bbg(x) \u2212 \u03b3\n2\n\n\u03bb2,\n\n\u03bb \u2265 0.\n\nThe summation of the \ufb01rst two terms in L(x, \u03bb) corresponds to the Lagrangian function in dual anal-\nysis and \u03bb corresponds to a Lagrangian multiplier. A regularization term \u2212(\u03b3/2)\u03bb2 is introduced in\nL(x, \u03bb) to prevent \u03bb from being too large. Instead of solving the constrained optimization problem\nin (1), we try to solve the following convex-concave optimization problem\n\nx\u2208B max\nmin\n\u03bb\u22650\n\nL(x, \u03bb).\n\n(9)\n\nThe proposed algorithm for stochastically optimizing the problem in (9) is summarized in Algo-\nrithm 1. It differs from the existing stochastic gradient descent methods in that it updates both the\nprimal variable x (steps 4 and 5) and the dual variable \u03bb (step 6), which shares the same step sizes.\n\n4\n\n\fWe note that the parameter \u03c1 is not employed in the implementation of Algorithm 1 and is only\nrequired for the theoretical analysis. It is noticeable that a similar primal-dual updating is explored\nin [15] to avoid projection in online learning. Our work differs from [15] in that their algorithm\nand analysis only lead to a bound for the regret and the violation of the constraints in a long run,\nwhich does not necessarily guarantee the feasibility of \ufb01nal solution. Also our proof techniques\ndiffer from [16], where the convergence rate is obtained for the saddle point; however our goal is to\nattain bound on the convergence of the primal feasible solution.\nRemark 2. The convex-concave optimization problem in (9) is equivalent to the following mini-\nmization problem:\n\nmin\nx\u2208B f (x) +\n\n[g(x)]2\n+\n\n2\u03b3\n\n,\n\n(10)\n\nwhere [z]+ outputs z if z > 0 and zero otherwise. It thus may seem attractive to directly optimize\n\u221a\nthe penalized function f (x) + [g(x)]2\n+/(2\u03b3) using the standard SGD method, which unfortunately\ndoes not yield a regret of O(\nT ), we need\nT ), which unfortunately will lead to a blowup of the gradients and consequently a\nto set \u03b3 = \u2126(\n\u221a\npoor regret bound. Using a primal-dual updating schema allows us to adjust the penalization term\nmore carefully to obtain an O(1/\n\n\u221a\nT ). This is because, in order to obtain a regret of O(\n\nT ) convergence rate.\n\n\u221a\n\n2/(cid:112)(G2\n\nTheorem 1. For any general convex function f (x), if we set \u03b7t = \u03b3/(2G2\n\u03b3 = G2\nwith a probability at least 1 \u2212 \u03b4,\n\n2), t = 1,\u00b7\u00b7\u00b7 , T , and\n2 + (1 + ln(2/\u03b4))\u03c32)T in Algorithm 1, under assumptions A1-A3, we have,\n\n1 + C 2\n\nf ((cid:101)xT ) \u2264 min\n\nx\u2208K f (x) + O\n\n(cid:18) 1\u221a\n\n(cid:19)\n\n,\n\nT\n\nwhere O(\u00b7) suppresses polynomial factors that depend on ln(2/\u03b4), G1, G2, C2, \u03c1, and \u03c3.\n\n4.2 SGD with One Projection for Strongly Convex Optimization\n\nWe \ufb01rst emphasize that it is dif\ufb01cult to extend Algorithm 1 to achieve an O(ln T /T ) convergence\nrate for strongly convex optimization. This is because although \u2212L(x, \u03bb) is strongly convex in \u03bb,\nits modulus for strong convexity is \u03b3, which is too small to obtain an O(ln T ) regret bound.\nTo achieve a faster convergence rate for strongly convex optimization, we change assumptions A1\nand A2 to\n\n(cid:107)(cid:101)\u2207f (x, \u03bet)(cid:107)2 \u2264 G1,\n\nA4\n\n(cid:107)\u2207g(x)(cid:107)2 \u2264 G2,\n\n\u2200x \u2208 B,\n\nwhere we slightly abuse the same notation G1. Note that A1 only requires that (cid:107)\u2207f (x)(cid:107)2 is\nbounded and A2 assumes a mild condition on the stochastic gradient.\nIn contrast, for strongly\n\nconvex optimization we need to assume a bound on the stochastic gradient (cid:107)(cid:101)\u2207f (x, \u03bet)(cid:107)2. Al-\nsampling over the training examples. Given the bound on (cid:107)(cid:101)\u2207f (x, \u03bet)(cid:107)2, we can easily have\n(cid:107)\u2207f (x)(cid:107)2 = (cid:107)E(cid:101)\u2207f (x, \u03bet)(cid:107)2 \u2264 E(cid:107)(cid:101)\u2207f (x, \u03bet)(cid:107)2 \u2264 G1, which is used to set an input parameter\n\nthough assumption A4 is stronger than assumptions A1 and A2, however, it is always possible\nto bound the stochastic gradient for machine learning problems where f (x) usually consists of\na summation of loss functions on training examples, and the stochastic gradient is computed by\n\n\u03bb0 > G1/\u03c1 to the algorithm. According to the discussion in the last subsection, we know that the\noptimal dual variable \u03bb\u2217 is upper bounded by G1/\u03c1, and consequently is upper bounded by \u03bb0.\nSimilar to the last approach, we write the optimization problem (1) into an equivalent convex-\nconcave optimization problem:\n\nmin\ng(x)\u22640\n\nf (x) = min\n\nx\u2208B max\n0\u2264\u03bb\u2264\u03bb0\n\nf (x) + \u03bbg(x) = min\n\nx\u2208B f (x) + \u03bb0[g(x)]+.\n\nTo avoid unnecessary complication due to the subgradient of [\u00b7]+, following [18], we introduce a\nsmoothing term H(\u03bb/\u03bb0), where H(p) = \u2212p ln p \u2212 (1 \u2212 p) ln(1 \u2212 p) is the entropy function, into\nthe Lagrangian function, leading to the optimization problem min\nx\u2208B F (x), where F (x) is de\ufb01ned as\n\n(cid:18)\n\n(cid:18) \u03bb0g(x)\n\n(cid:19)(cid:19)\n\n,\n\nF (x) = f (x) + max\n0\u2264\u03bb\u2264\u03bb0\n\n\u03bbg(x) + \u03b3H(\u03bb/\u03bb0) = f (x) + \u03b3 ln\n\n1 + exp\n\n\u03b3\n\nwhere \u03b3 > 0 is a parameter whose value will be determined later. Given the smoothed objective\nfunction F (x), we \ufb01nd the optimal solution by applying SGD to minimize F (x), where the gradient\n\n5\n\n\fAlgorithm 2 (SGDP-ST): SGD with ONE Projection by a Smoothing Technique\n1: Input: a sequence of step sizes {\u03b7t}, \u03bb0, and \u03b3\n2: Initialization: x1 = 0.\n3: for t = 1, . . . , T do\n4:\n\n(cid:18)(cid:101)\u2207f (xt, \u03bet) +\n7: Output:(cid:101)xT = \u03a0K ((cid:98)xT ), where(cid:98)xT =(cid:80)T\n\nCompute x(cid:48)\nUpdate xt+1 = x(cid:48)\n\nt+1 = xt \u2212 \u03b7t\n\nt+1/ max((cid:107)x(cid:48)\n\n1 + exp(\u03bb0g(xt)/\u03b3)\n\n5:\n6: end for\n\n\u03bb0\u2207g(xt)\n\nexp (\u03bb0g(xt)/\u03b3)\n\nt+1(cid:107)2, 1)\n\nt=1 xt/T .\n\n(cid:19)\n\nof F (x) is computed by\n\n\u2207F (x) = \u2207f (x) +\n\nexp (\u03bb0g(x)/\u03b3)\n\n1 + exp (\u03bb0g(x)/\u03b3)\n\n\u03bb0\u2207g(x).\n\n(11)\n\nAlgorithm 2 gives the detailed steps. Unlike Algorithm 1, only the primal variable x is updated in\neach iteration using the stochastic gradient computed in (11).\nThe following theorem shows that Algorithm 2 achieves an O(ln T /T ) convergence rate if the cost\nfunctions are strongly convex.\nTheorem 2. For any \u03b2-strongly convex function f (x), if we set \u03b7t = 1/(2\u03b2t), t = 1, . . . , T , \u03b3 =\nln T /T , and \u03bb0 > G1/\u03c1 in Algorithm 2, under assumptions A3 and A4, we have with a probability\nat least 1 \u2212 \u03b4,\n\nf ((cid:101)xT ) \u2264 min\n\nx\u2208K f (x) + O\n\n(cid:18) ln T\n\n(cid:19)\n\n,\n\nT\n\nwhere O(\u00b7) suppresses polynomial factors that depend on ln(1/\u03b4), 1/\u03b2, G1, G2, \u03c1, and \u03bb0.\nIt is well known that the optimal convergence rate of SGD for strongly convex optimization is\nO(1/T ) [9] which has been proven to be tight in stochastic optimization setting [1]. According to\nTheorem 2, Algorithm 2 achieves an almost optimal convergence rate except for the factor of ln T .\nIt is worth mentioning that although it is not explicitly given in Theorem 2, the detailed expression\nfor the convergence rate of Algorithm 2 exhibits a tradeoff in setting \u03bb0 (more can be found in the\nproof of Theorem 2). Finally, under assumptions A1-A3, Algorithm 2 can achieve an O(1/\nT )\nconvergence rate for general convex functions, similar to Algorithm 1.\n\n\u221a\n\n5 Convergence Rate Analysis\n\nWe here present the proofs of main theorems. The omitted proofs are provided in the Appendix. We\nuse O(\u00b7) notation in a few inequalities to absorb constants independent from T for ease of exposition.\n\n5.1 Proof of Theorem 1\n\nTo pave the path for the proof, we present a series of lemmas. The lemma below states two key\ninequalities, which follows the standard analysis of gradient descent.\nLemma 1. Under the bounded assumptions in (6) and (7), for any x \u2208 B and \u03bb > 0, we have\n(xt \u2212 x)(cid:62)\u2207xL(xt, \u03bbt) \u2264 1\n2\u03b7t\n\n(cid:1) + 2\u03b7tG2\n(cid:0)(cid:107)x \u2212 xt(cid:107)2\n2 \u2212 (cid:107)x \u2212 xt+1(cid:107)2\n+ (x \u2212 xt)(cid:62)((cid:101)\u2207f (xt, \u03bet) \u2212 \u2207f (xt))\n+ 2\u03b7t (cid:107)(cid:101)\u2207f (xt, \u03bet) \u2212 \u2207f (xt)(cid:107)2\n(cid:124)\n(cid:125)\n(cid:124)\n(cid:123)(cid:122)\n(cid:125)\n(cid:0)|\u03bb \u2212 \u03bbt|2 \u2212 |\u03bb \u2212 \u03bbt+1|2(cid:1) + 2\u03b7tC 2\n\n1 + \u03b7tG2\n\n(cid:123)(cid:122)\n\n\u2261\u03b6t(x)\n\n\u2261\u2206t\n\n2\u03bb2\nt\n\n(\u03bb \u2212 \u03bbt)\u2207\u03bbL(xt, \u03bbt) \u2264 1\n2\u03b7t\n\n2 .\n\nT(cid:88)\n\nAn immediate result of Lemma 1 is the following which states a regret-type bound.\nT(cid:88)\n2), t = 1,\u00b7\u00b7\u00b7 , T , we have\nLemma 2. For any general convex function f (x), if we set \u03b7t = \u03b3/(2G2\n\u03b6t(x\u2217),\n\u03b3\nG2\n2\n\n(f (xt) \u2212 f (x\u2217)) +\n\nt=1 g(xt)]2\n+\n2/\u03b3)\n\n1 + C 2\n2 )\nG2\n2\n\n[(cid:80)T\n\n\u2264 G2\n2\n\u03b3\n\n2(\u03b3T + 2G2\n\nT(cid:88)\n\n\u2206t +\n\n\u03b3T +\n\n(G2\n\n+\n\n2\n\n2\n\n,\n\nt=1\n\nt=1\n\nt=1\n\nwhere x\u2217 = arg minx\u2208K f (x).\n\n6\n\n\ft=1\n\nt=1\n\nT ),\n\n\u221a\n\n\u221a\n\n\u221a\n1\n\nC\n\nT\n\nT\nC\n\n1 + C 2\n\n+ \u2264 O\n\n(cid:2) T(cid:88)\n\n\u221a\n+ \u2264 O(\n\n(f (xt) \u2212 f (x\u2217)) +\n\nplugging the stated value of \u03b3, we have with a probability 1 \u2212 \u03b4\n\n1 + C 2\nsuppresses polynomial factors that depend on ln(2/\u03b4), G1, G2, C2, \u03c3.\n\ng(xt)(cid:3)2\n2 + (1 + ln(2/\u03b4))\u03c32 + 2(cid:112)G2\n[g((cid:98)xT )]2\n\nt=1 \u03b6t(x\u2217) \u2264 2\u03c3(cid:112)3 ln(2/\u03b4)\n1 \u2212 \u03b4/2, we have(cid:80)T\n1 \u2212 \u03b4/2, we have(cid:80)T\nProof of Therorem 1. First, by martingale inequality (e.g., Lemma 4 in [13]), with a probability\nT . By Markov\u2019s inequality, with a probability\nt=1 \u2206t \u2264 (1 + ln(2/\u03b4))\u03c32T . Substituting these inequalities into Lemma 2,\nT(cid:88)\nwhere C = 2G2(1/(cid:112)G2\nRecalling the de\ufb01nition of(cid:98)xT =(cid:80)T\n(cid:18) 1\u221a\nAssume g((cid:98)xT ) > 0, otherwise(cid:101)xT =(cid:98)xT and we easily have f ((cid:101)xT ) \u2264 minx\u2208K f (x) + O(1/\nSince(cid:101)xT is the projection of(cid:98)xT into K, i.e.,(cid:101)xT = arg ming(x)\u22640 (cid:107)x \u2212(cid:98)xT(cid:107)2\ng((cid:101)xT ) = 0, and(cid:98)xT \u2212(cid:101)xT = s\u2207g((cid:101)xT )\nwhich indicates that(cid:98)xT \u2212(cid:101)xT is in the same direction to \u2207g(\u02dcxT ). Hence,\ng((cid:98)xT ) = g((cid:98)xT ) \u2212 g((cid:101)xT ) \u2265 ((cid:98)xT \u2212(cid:101)xT )(cid:62)\u2207g((cid:101)xT ) = (cid:107)(cid:98)xT \u2212(cid:101)xT(cid:107)2(cid:107)\u2207g((cid:101)xT )(cid:107)2 \u2265 \u03c1(cid:107)(cid:98)xT \u2212(cid:101)xT(cid:107)2,\ndue to f (x\u2217) \u2264 f ((cid:101)xT ) and Lipschitz continuity of f (x). Combining inequalities (12), (13), and (14)\n\n2 + (1 + ln(2/\u03b4))\u03c32) and O(\u00b7)\n(cid:19)\n\n(13)\nwhere the last inequality follows the de\ufb01nition of ming(x)=0 (cid:107)\u2207g(x)(cid:107)2 \u2265 \u03c1. Additionally, we have\n(14)\n\nf (x\u2217) \u2212 f ((cid:98)xT ) \u2264 f (x\u2217) \u2212 f ((cid:101)xT ) + f ((cid:101)xT ) \u2212 f ((cid:98)xT ) \u2264 G1(cid:107)(cid:98)xT \u2212(cid:101)xT(cid:107)2,\n\nt=1 xt/T and using the convexity of f (x) and g(x), we have\n\noptimality condition, there exists a positive constant s > 0 such that\n\nf ((cid:98)xT ) \u2212 f (x\u2217) +\n\nyields\n\nT(cid:107)(cid:98)xT \u2212(cid:101)xT(cid:107)2\n\nT ) + G1(cid:107)(cid:98)xT \u2212(cid:101)xT(cid:107)2.\n\u221a\n(cid:16)(cid:113) C\n(cid:17)\n2 \u2264 O(1/\nBy simple algebra, we have (cid:107)(cid:98)xT \u2212(cid:101)xT(cid:107)2 \u2264 G1C\n(cid:18) 1\u221a\n(cid:19)\n(cid:19)\n(cid:18) 1\u221a\nf ((cid:101)xT ) \u2264 f ((cid:101)xT )\u2212 f ((cid:98)xT ) + f ((cid:98)xT ) \u2264 G1(cid:107)(cid:98)xT \u2212(cid:101)xT(cid:107)2 + f (x\u2217) + O\nwhere we use the inequality in (12) to bound f ((cid:98)xT ) by f (x\u2217) and absorb the dependence on \u03c1, G1, C\n\n\u2264 f (x\u2217) + O\n\n. Therefore\n\nT ).\n2, then by \ufb01rst order\n\n(12)\n\u221a\n\n\u221a\n\n\u03c12\n\nT\n\n\u03c12\nC\n\n+ O\n\n.\n\nT\n\n,\n\nT\n\n\u221a\n\n\u03c12T\n\nT\n\ninto the O(\u00b7) notation.\n\nRemark 3. From the proof of Theorem 1, we can see that the key inequalities are (12), (13), and (14).\nIn particular, the regret-type bound in (12) depends on the algorithm. If we only update the primal\nvariable using the penalized objective in (10), whose gradient depends on 1/\u03b3, it will cause a blowup\nin the regret bound with (1/\u03b3 + \u03b3T + T /\u03b3), which leads to a non-convergent bound.\n\n5.2 Proof of Theorem 2\n\nT(cid:88)\n\nt=1\n\nOur proof of Theorem 2 for the convergence rate of Algorithm 2 when applied to strongly convex\nfunctions starts with the following lemma by analogy of Lemma 2.\nLemma 3. For any \u03b2-strongly convex function f (x), if we set \u03b7t = 1/(2\u03b2t), we have\n\n(F (x) \u2212 F (x\u2217)) \u2264 (G2\n\n1 + \u03bb2\n\n0G2\n2)(1 + ln T )\n2\u03b2\n\n+\n\n\u03b6t(x\u2217) \u2212 \u03b2\n4\n\n(cid:107)x\u2217 \u2212 xt(cid:107)2\n\n2\n\nwhere x\u2217 = arg minx\u2208K f (x).\nIn order to prove Theorem 2, we need the following result for an improved martingale inequality.\n\nT(cid:88)\n\nt=1\n\nT(cid:88)\n\nt=1\n\nLemma 4. For any \ufb01xed x \u2208 B, de\ufb01ne DT = (cid:80)T\n(cid:114)\n\n(cid:18)\n(cid:100)log2 T(cid:101). We have\nPr\n\n\u039bT \u2264 4G1\n\n(cid:18)\n\n(cid:19)\n\n+ Pr\n\nDT \u2264 4\nT\n\n2, \u039bT = (cid:80)T\n(cid:19)\n\nm\n\u03b4\n\n+ 4G1 ln\n\n\u2265 1 \u2212 \u03b4.\n\nm\n\u03b4\n\nt=1 (cid:107)xt \u2212 x(cid:107)2\n\nt=1 \u03b6t(x), and m =\n\nDT ln\n\n7\n\n\fProof of Theorem 2. We substitute the bound in Lemma 4 into the inequality in Lemma 3 with\nx = x\u2217. We consider two cases. In the \ufb01rst case, we assume DT \u2264 4/T . As a result, we have\n\n(\u2207f (xt) \u2212(cid:101)\u2207f (xt, \u03bet))(cid:62)(x\u2217 \u2212 xt) \u2264 2G1\n\n(cid:112)\n\nT DT \u2264 4G1,\n\nT(cid:88)\n\nt=1\n\n\u03b6t(x\u2217) =\n\nwhich together with the inequality in Lemma 3 leads to the bound\n0G2\n2)(1 + ln T )\n2\u03b2\n\n(F (xt) \u2212 F (x\u2217)) \u2264 4G1 +\n\n1 + \u03bb2\n\n(G2\n\nT(cid:88)\nT(cid:88)\n\nt=1\n\nt=1\n\n.\n\n(cid:19)\n\n(cid:18) 16G2\n\n1\n\n\u03b2\n\nIn the second case, we assume\n\u03b6t(x\u2217) \u2264 4G1\n\n(cid:114)\n\nT(cid:88)\nT(cid:88)\n\nt=1\n\nt=1\n\nm\n\u03b4\n\n,\n\n.\n\nm\n\u03b4\n\n\u03b2\n\n\u2264 \u03b2\n4\n\nm\n\u03b4\n\n(cid:19)\n\n.\n\nt=1\n\nm\n\u03b4\n\n1\n\n\u03b2\n\n(cid:19)\n\n1 + \u03bb2\n\nm\n\u03b4\n\n+\n\n+ 4G1\n\nln\n\nDT ln\n\u221a\n\n+ 4G1 ln\n\nDT +\nab \u2264 a2 + b2. We thus have\n\n1\n\n+ 4G1\n\nln\n\n+ 4G1 +\n\n+ 4G1\n\nln\n\n(G2\n\n1 + \u03bb2\n\nT(cid:88)\n\n0G2\n2)(1 + ln T )\n2\u03b2\n\nwhere the last step uses the fact 2\n(F (xt) \u2212 F (x\u2217)) \u2264\n\n(cid:18) 16G2\n(cid:18) 16G2\n(cid:124)\n(cid:18)\nF ((cid:98)xT ) = f ((cid:98)xT ) + \u03b3 ln\n\n(F (xt) \u2212 F (x\u2217)) \u2264\n\n1 + exp\n\nCombing the results of the two cases, we have, with a probability 1 \u2212 \u03b4,\n(G2\n\n0G2\n2)(1 + ln T )\n2\u03b2\n\nwe have F (x\u2217) \u2264 f (x\u2217) + \u03b3 ln 2. On the other hand,\n\n\u03b3\nTherefore, with the value of \u03b3 = ln T /T , we have\n\nBy convexity of F (x), we have F ((cid:98)xT ) \u2264 F (x\u2217) + O (ln T /T ). Noting that x\u2217 \u2208 K, g(x\u2217) \u2264 0,\n\n(cid:123)(cid:122)\n(cid:125)\n(cid:19)(cid:19)\n(cid:18) \u03bb0g((cid:98)xT )\n\u2265 f ((cid:98)xT ) + max (0, \u03bb0g((cid:98)xT )) .\n(cid:19)\n(cid:18) ln T\nf ((cid:98)xT ) \u2264 f (x\u2217) + O\n(cid:18) ln T\n(cid:19)\nf ((cid:98)xT ) + \u03bb0g((cid:98)xT ) \u2264 f (x\u2217) + O\n(cid:18) ln T\n(cid:19)\n\u03bb0\u03c1(cid:107)(cid:98)xT \u2212(cid:101)xT(cid:107)2 \u2264 G1(cid:107)(cid:98)xT \u2212(cid:101)xT(cid:107)2 + O\nFor \u03bb0 > G1/\u03c1, we have (cid:107)(cid:98)xT \u2212(cid:101)xT(cid:107)2 \u2264 (1/(\u03bb0\u03c1 \u2212 G1))O(ln T /T ). Therefore\n(cid:19)\n(cid:18) ln T\nf ((cid:101)xT ) \u2264 f ((cid:101)xT )\u2212 f ((cid:98)xT ) + f ((cid:98)xT ) \u2264 G1(cid:107)(cid:98)xT \u2212(cid:101)xT(cid:107)2 + f (x\u2217) + O\n\nApplying the inequalities (13) and (14) to (16), and noting that \u03b3 = ln T /T , we have\n\n(cid:18) ln T\n\n\u2264 f (x\u2217) + O\n\n(15)\n\n(16)\n\n(cid:19)\n\n,\n\n.\n\nT\n\n,\n\nT\n\n.\n\nT\n\nO(ln T )\n\nT\n\nT\n\nwhere in the second inequality we use inequality (15).\n\n6 Conclusions\n\nIn the present paper, we made a progress towards making the SGD method ef\ufb01cient by proposing a\nframework in which it is possible to exclude the projection steps from the SGD algorithm. We have\nproposed two novel algorithms to overcome the computational bottleneck of the projection step in\napplying SGD to optimization problems with complex domains. We showed using novel theoretical\nanalysis that the proposed algorithms can achieve an O(1/\nT ) convergence rate for general convex\nfunctions and an O(ln T /T ) rate for strongly convex functions with a overwhelming probability\nwhich are known to be optimal (up to a logarithmic factor) for stochastic optimization.\n\n\u221a\n\nAcknowledgments\n\nThe authors would like to thank the anonymous reviewers for their helpful suggestions. This work\nwas supported in part by National Science Foundation (IIS-0643494) and Of\ufb01ce of Navy Research\n(Award N000141210431 and Award N00014-09-1-0663).\n\n8\n\n\fReferences\n[1] A. Agarwal, P. L. Bartlett, P. D. Ravikumar, and M. J. Wainwright.\n\nInformation-theoretic\nlower bounds on the oracle complexity of stochastic convex optimization. IEEE Transactions\non Information Theory, 58(5):3235\u20133249, 2012.\n\n[2] F. Bach and E. Moulines. Non-asymptotic analysis of stochastic approximation algorithms for\n\nmachine learning. In NIPS, pages 451\u2013459, 2011.\n\n[3] D. P. Bertsekas. Nonlinear Programming. Athena Scienti\ufb01c, 2nd edition, 1999.\n[4] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[5] K. L. Clarkson. Coresets, sparse greedy approximation, and the frank-wolfe algorithm. ACM\n\nTransactions on Algorithms, 6(4), 2010.\n\n[6] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Ef\ufb01cient projections onto the l1-ball\n\nfor learning in high dimensions. In ICML, pages 272\u2013279, 2008.\n\n[7] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics,\n\n3, 1956.\n\n[8] E. Hazan. Sparse approximate solutions to semide\ufb01nite programs. In LATIN, pages 306\u2013316,\n\n2008.\n\n[9] E. Hazan and S. Kale. Beyond the regret minimization barrier: an optimal algorithm for\nstochastic strongly-convex optimization. Journal of Machine Learning Research - Proceedings\nTrack, 19:421\u2013436, 2011.\n\n[10] E. Hazan and S. Kale. Projection-free online learning. In ICML, 2012.\n[11] M. Jaggi. Sparse Convex Optimization Methods for Machine Learning. PhD thesis, ETH\n\nZurich, Oct. 2011.\n\n[12] M. Jaggi and M. Sulovsk\u00b4y. A simple algorithm for nuclear norm regularized problems. In\n\nICML, pages 471\u2013478, 2010.\n\n[13] G. Lan. An optimal method for stochastic composite optimization. Math. Program., 133(1-\n\n2):365\u2013397, 2012.\n\n[14] J. Liu and J. Ye. Ef\ufb01cient euclidean projections in linear time. In ICML, page 83, 2009.\n[15] M. Mahdavi, R. Jin, and T. Yang. Trading regret for ef\ufb01ciency: online convex optimization\n\nwith long term constraints. JMLR, 13:2465\u20132490, 2012.\n\n[16] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach\n\nto stochastic programming. SIAM J. on Optimization, 19:1574\u20131609, 2009.\n\n[17] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Aca-\n\ndemic Publishers, 2004.\n\n[18] Y. Nesterov. Smooth minimization of non-smooth functions. Math. Program., 103(1):127\u2013\n\n152, 2005.\n\n[19] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver\n\nfor svm. In ICML, pages 807\u2013814, 2007.\n\n[20] Y. Ying and P. Li. Distance metric learning with eigenvalue optimization. JMLR., 13:1\u201326,\n\n2012.\n\n[21] T. Zhang. Sequential greedy approximation for certain convex optimization problems. Infor-\n\nmation Theory, IEEE Transactions on, 49:682\u2013691, 2003.\n\n[22] M. Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In\n\nICML, pages 928\u2013936, 2003.\n\n9\n\n\f", "award": [], "sourceid": 254, "authors": [{"given_name": "Mehrdad", "family_name": "Mahdavi", "institution": null}, {"given_name": "Tianbao", "family_name": "Yang", "institution": null}, {"given_name": "Rong", "family_name": "Jin", "institution": null}, {"given_name": "Shenghuo", "family_name": "Zhu", "institution": null}, {"given_name": "Jinfeng", "family_name": "Yi", "institution": null}]}