{"title": "Mixed Optimization for Smooth Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 674, "page_last": 682, "abstract": "It is well known that the optimal convergence rate for stochastic optimization of smooth functions is $[O(1/\\sqrt{T})]$, which is same as stochastic optimization of Lipschitz continuous convex functions. This is in contrast to optimizing smooth functions using full gradients, which yields a convergence rate of $[O(1/T^2)]$. In this work, we consider a new setup for optimizing smooth functions, termed as {\\bf Mixed Optimization}, which allows to access  both a stochastic oracle  and a full gradient oracle. Our goal is to significantly improve the convergence rate of stochastic optimization of smooth functions by having an additional small number of accesses to the full gradient oracle. We show that, with an $[O(\\ln T)]$ calls to the full gradient oracle  and an $O(T)$ calls to the stochastic oracle, the proposed mixed optimization algorithm is able to achieve an optimization error of $[O(1/T)]$.", "full_text": "Mixed Optimization for Smooth Functions\n\nDepartment of Computer Science and Engineering, Michigan State University, MI, USA\n\nRong Jin\nMehrdad Mahdavi\nfmahdavim,zhanglij,rongjing@msu.edu\n\nLijun Zhang\n\nAbstract\n\np\nIt is well known that the optimal convergence rate for stochastic optimization of\nsmooth functions is O(1=\nT ), which is same as stochastic optimization of Lips-\nchitz continuous convex functions. This is in contrast to optimizing smooth func-\ntions using full gradients, which yields a convergence rate of O(1=T 2). In this\nwork, we consider a new setup for optimizing smooth functions, termed as Mixed\nOptimization, which allows to access both a stochastic oracle and a full gradient\noracle. Our goal is to signi\ufb01cantly improve the convergence rate of stochastic op-\ntimization of smooth functions by having an additional small number of accesses\nto the full gradient oracle. We show that, with an O(ln T ) calls to the full gradient\noracle and an O(T ) calls to the stochastic oracle, the proposed mixed optimization\nalgorithm is able to achieve an optimization error of O(1=T ).\n\n1 Introduction\n\nMany machine learning algorithms follow the framework of empirical risk minimization, which\noften can be cast into the following generic optimization problem\n\nn\u2211\n\nmin\nw2W\n\nG(w) :=\n\n1\nn\n\ngi(w);\n\ni=1\n\n(1)\n\nwhere n is the number of training examples, gi(w) encodes the loss function related to the ith\ntraining example (xi; yi), and W is a bounded convex domain that is introduced to regularize\nthe solution w 2 W (i.e., the smaller the size of W, the stronger the regularization is). In this\nstudy, we focus on the learning problems for which the loss function gi(w) is smooth. Examples of\nsmooth loss functions include least square with gi(w) = (yi(cid:0)\u27e8w; xi\u27e9)2 and logistic regression with\ngi(w) = log (1 + exp((cid:0)yi\u27e8w; xi\u27e9)). Since the regularization is enforced through the restricted do-\nmain W, we did not introduce a \u21132 regularizer (cid:21)\u2225w\u22252=2 into the optimization problem and as a\nresult, we do not assume the loss function to be strongly convex. We note that a small \u21132 regularizer\np\ndoes NOT improve the convergence rate of stochastic optimization. More speci\ufb01cally, the conver-\np\nT ) when\ngence rate for stochastically optimizing a \u21132 regularized loss function remains as O(1=\nT ) [11, Theorem 1], a scenario that is often encountered in real-world applications.\n(cid:21) = O(1=\nA preliminary approach for solving the optimization problem in (1) is the batch gradient descent\n(GD) algorithm [16]. It starts with some initial point, and iteratively updates the solution using the\nequation wt+1 = (cid:5)W (wt (cid:0) (cid:17)rG(wt)), where (cid:5)W ((cid:1)) is the orthogonal projection onto the convex\ndomain W. It has been shown that for smooth objective functions, the convergence rate of standard\nGD is O(1=T ) [16], and can be improved to O(1=T 2) by an accelerated GD algorithm [15, 16, 18].\nThe main shortcoming of GD method is its high cost in computing the full gradient rG(wt) when\nthe number of training examples is large. Stochastic gradient descent (SGD) [3, 13, 21] alleviates\nthis limitation of GD by sampling one (or a small set of) examples and computing a stochastic\n(sub)gradient at each iteration based on the sampled examples. Since the computational cost of\nSGD per iteration is independent of the size of the data (i.e., n), it is usually appealing for large-\nscale learning and optimization.\nWhile SGD enjoys a high computational ef\ufb01ciency per iteration, it suffers from a slow convergence\nrate for optimizing smooth functions. It has been shown in [14] that the effect of the stochastic noise\n\n1\n\n\fSetting\nLipschitz\nSmooth\n\nFull (GD)\n\nConvergence Os Of\n\n1p\nT\n1\nT 2\n\n1\n\n0\n0\n\nT\n\nT\n\nStochastic (SGD)\n\nConvergence Os Of\n0\n0\n\nT\n\nT\n\n1p\nT\n1p\nT\n\nMixed Optimization\nOf\nConvergence Os\n\u2014 \u2014\n\nT\n\nlog T\n\n\u2014\n1\nT\n\nTable 1: The convergence rate (O), number of calls to stochastic oracle (Os), and number of calls to\nfull gradient oracle (Of ) for optimizing Lipschitz continuous and smooth convex functions, using\np\nfull GD, SGD, and mixed optimization methods, measured in the number of iterations T .\ncannot be decreased with a better rate than O(1=\nT ) which is signi\ufb01cantly worse than GD that uses\nthe full gradients for updating the solutions and this limitation is also valid when the target function\nis smooth. In addition, as we can see from Table 1, for general Lipschitz functions, SGD exhibits\nthe same convergence rate as that for the smooth functions, implying that smoothness is essentially\nnot very useful and can not be exploited in stochastic optimization. The slow convergence rate for\nstochastically optimizing smooth loss functions is mostly due to the variance in stochastic gradients:\nunlike the full gradient case where the norm of a gradient approaches to zero when the solution is\napproaching to the optimal solution, in stochastic optimization, the norm of a stochastic gradient\np\nis constant even when the solution is close to the optimal solution. It is the variance in stochastic\nT ) unimprovable in smooth setting [14, 1].\ngradients that makes the convergence rate O(1=\n\nIn this study, we are interested in designing an ef\ufb01cient algorithm that is in the same spirit of SGD\nbut can effectively leverage the smoothness of the loss function to achieve a signi\ufb01cantly faster\nconvergence rate. To this end, we consider a new setup for optimization that allows us to interplay\nbetween stochastic and deterministic gradient descent methods. In particular, we assume that the\noptimization algorithm has an access to two oracles:\n(cid:15) A stochastic oracle Os that returns the loss function gi(w) and its gradient based on the\nsampled training example (xi; yi) 2, and\n(cid:15) A full gradient oracle Of that returns the gradient rG(w) for any given solution w 2 W.\nWe refer to this new setting as mixed optimization in order to distinguish it from both stochastic and\nfull gradient optimization models. The key question we examined in this study is:\nfunctions by having a small number of calls to the full gradient oracle Of ?\n\nIs it possible to improve the convergence rate for stochastic optimization of smooth\n\nWe give an af\ufb01rmative answer to this question. We show that with an additional O(ln T ) accesses\nto the full gradient oracle Of , the proposed algorithm, referred to as MIXEDGRAD, can improve\nthe convergence rate for stochastic optimization of smooth functions to O(1=T ), the same rate for\nstochastically optimizing a strongly convex function [11, 19, 23]. MIXEDGRAD builds off on multi-\nstage methods [11] and operates in epochs, but involves novel ingredients so as to obtain an O(1=T )\nrate for smooth losses. In particular, we form a sequence of strongly convex objective functions to\nbe optimized at each epoch and decrease the amount of regularization and shrink the domain as the\nalgorithm proceeds. The full gradient oracle Of is only called at the beginning of each epoch.\nFinally, we would like to distinguish mixed optimization from hybrid methods that use growing\nsample-sizes as optimization method proceeds to gradually transform the iterates into the full gra-\ndient method [9] and batch gradient with varying sample sizes [6], which unfortunately make the\niterations to be dependent to the sample size n as opposed to SGD. In contrast, MIXEDGRAD is as\nan alternation of deterministic and stochastic gradient steps, with different of frequencies for each\ntype of steps. Our result for mixed optimization is useful for the scenario when the full gradient\nof the objective function can be computed relatively ef\ufb01cient although it is still signi\ufb01cantly more\nexpensive than computing a stochastic gradient. An example of such a scenario is distributed com-\nputing where the computation of full gradients can be speeded up by having it run in parallel on\nmany machines with each machine containing a relatively small subset of the entire training data.\nOf course, the latency due to the communication between machines will result in an additional cost\nfor computing the full gradient in a distributed fashion.\nOutline The rest of this paper is organized as follows. We begin in Section 2 by brie\ufb02y reviewing\nthe literature on deterministic and stochastic optimization. In Section 3, we introduce the necessary\nde\ufb01nitions and discuss the assumptions that underlie our analysis. Section 4 describes the MIXED-\nGRAD algorithm and states the main result on its convergence rate. The proof of main result is given\nin Section 5. Finally, Section 6 concludes the paper and discusses few open questions.\n\n1The convergence rate can be improved to O(1=T ) when the structure of the objective function is provided.\n2We note that the stochastic oracle assumed in our study is slightly stronger than the stochastic gradient\n\noracle as it returns the sampled function instead of the stochastic gradient.\n\n2\n\n\f2 More Related Work\n\nDeterministic Smooth Optimization The convergence rate of gradient based methods usually\ndepends on the analytical properties of the objective function to be optimized. When the objective\nfunction is strongly convex and smooth, it is well known that a simple GD method can achieve a\np\np\nlinear convergence rate [5]. For a non-smooth Lipschitz-continuous function, the optimal rate for\nthe \ufb01rst order method is only O(1=\nT ) [16]. Although O(1=\nT ) rate is not improvable in general,\nseveral recent studies are able to improve this rate to O(1=T ) by exploiting the special structure\nof the objective function [18, 17]. In the full gradient based convex optimization, smoothness is\na highly desirable property. It has been shown that a simple GD achieves a convergence rate of\nO(1=T ) when the objective function is smooth, which is further can be improved to O(1=T 2) by\nusing the accelerated gradient methods [15, 18, 16].\n\nStochastic Smooth Optimization Unlike the optimization methods based on full gradients, the\np\nsmoothness assumption was not exploited by most stochastic optimization methods. In fact, it was\nshown in [14] that the O(1=\nT ) convergence rate for stochastic optimization cannot be improved\neven when the objective function is smooth. This classical result is further con\ufb01rmed by the recent\nstudies of composite bounds for the \ufb01rst order optimization methods [2, 12]. The smoothness of\nthe objective function is exploited extensively in mini-batch stochastic optimization [7, 8], where\nthe goal is not to improve the convergence rate but to reduce the variance in stochastic gradients\nand consequentially the number of times for updating the solutions [24]. We \ufb01nally note that the\nsmoothness assumption coupled with the strong convexity of function is bene\ufb01cial in stochastic\nsetting and yields a geometric convergence in expectation using Stochastic Average Gradient (SAG)\nand Stochastic Dual Coordinate Ascent (SDCA) algorithms proposed in [20] and [22], respectively.\n\n3 Preliminaries\n\n) + \u27e8rf (w\n\n\u2032\n\n); w (cid:0) w\n\n\u2032\u27e9 +\n\n(cid:12)\n2\n\nf (w) (cid:20) f (w\n\n\u2032 2 W, we denote by \u27e8w; w\n\u2032\u27e9\nWe use bold-face letters to denote vectors. For any two vectors w; w\n\u2032. Throughout this paper, we only consider the \u21132-norm. We\nthe inner product between w and w\nassume the objective function G(w) de\ufb01ned in (1) to be the average of n convex loss functions. The\nsame assumption was made in [20, 22]. We assume that G(w) is minimized at some w(cid:3) 2 W. With-\nout loss of generality, we assume that W (cid:26) BR, a ball of radius R. Besides convexity of individual\nfunctions, we will also assume that each gi(w) is (cid:12)-smooth as formally de\ufb01ned below [16].\nDe\ufb01nition 1 (Smoothness). A differentiable loss function f (w) is said to be (cid:12)-smooth with respect\nto a norm \u2225 (cid:1) \u2225, if it holds that\n\u2032\n\n\u2032\u22252;\n\u2225w (cid:0) w\nThe smoothness assumption also implies that \u27e8rf (w) (cid:0) rf (w\n\u2032\u22252 which\n); w (cid:0) w\n\u2032\nis equivalent to rf (w) being (cid:12)-Lipschitz continuous.\nIn stochastic \ufb01rst-order optimization setting, instead of having direct access to G(w), we only have\naccess to a stochastic gradient oracle, which given a solution w 2 W, returns the gradient rgi(w)\nwhere i is sampled uniformly at random from f1; 2;(cid:1)(cid:1)(cid:1) ; ng. The goal of stochastic optimization\nto use a bounded number T of oracle calls, and compute some (cid:22)w 2 W such that the optimization\nerror, G( (cid:22)w) (cid:0) G(w\nIn the mixed optimization model considered in this study, we \ufb01rst relax the stochastic oracle Os by\nassuming that it will return a randomly sampled loss function gi(w), instead of the gradient rgi(w)\nfor a given solution w 3. Second, we assume that the learner also has an access to the full gradient\noracle Of . Our goal is to signi\ufb01cantly improve the convergence rate of stochastic gradient descent\n(SGD) by making a small number of calls to the full gradient oracle Of . In particular, we show that\nby having only O(log T ) accesses to the full gradient oracle and O(T ) accesses to the stochastic\noracle, we can tolerate the noise in stochastic gradients and attain an O(1=T ) convergence rate for\noptimizing smooth functions.\n\n), is as small as possible.\n\n\u2032\u27e9 (cid:20) (cid:12)\u2225w (cid:0) w\n\n8 w; w\n\n\u2032 2 W;\n\n(cid:3)\n\n3The audience may feel that this relaxation of stochastic oracle could provide signi\ufb01cantly more informa-\ntion, and second order methods such as Online Newton [10] may be applied to achieve O(1=T ) convergence.\np\nWe note (i) the proposed algorithm is a \ufb01rst order method, and (ii) although the Online Newton method yields a\nregret bound of O(1=T ), its convergence rate for optimization can be as low as O(1=\nT ) due to the concentra-\ntion bound for Martingales. In addition, the Online Newton method is only applicable to exponential concave\nfunction, not any smooth loss function.\n\n3\n\n\fAlgorithm 1 MIXEDGRAD\nInput: step size (cid:17)1, domain size \u22061, the number of iterations T1 for the \ufb01rst epoch, the number of\nepoches m, regularization parameter (cid:21)1, and shrinking parameter (cid:13) > 1\n1: Initialize (cid:22)w1 = 0\n2: for k = 1; : : : ; m do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n\nConstruct the domain Wk = fw : w + wk 2 W;\u2225w\u2225 (cid:20) \u2206kg\nCall the full gradient oracle Of for rG( (cid:22)wk)\nCompute gk = (cid:21)k (cid:22)wk + rG( (cid:22)wk) = (cid:21)k (cid:22)wk + 1\nrgi( (cid:22)wk)\nInitialize w1\nfor t = 1; : : : ; Tk do\nCall stochastic oracle Os to return a randomly selected loss function git\nk + (cid:22)wk) (cid:0) rgit\nCompute the stochastic gradient as ^gt\nUpdate the solution by\n\u2225w (cid:0) wt\n\u27e9 +\n\n(w)\n( (cid:22)wk)\n\n\u2211\n\nk = 0\n\nwt+1\n\nn\ni=1\n\n(wt\n\n\u22252\n\nn\n\nk\n\nk\n\nk + (cid:21)kwt\nk\n\nk\n\n1\n2\n\nk\n\nk = gk + rgit\nk and (cid:22)wk+1 = (cid:22)wk +ewk+1\n\n(cid:17)k\u27e8w (cid:0) wt\n\nk; ^gt\n\nSet ewk+1 = 1\n\nend for\n\nTk+1\n\n\u2211\n\nk = arg max\nw2Wk\n\nTk+1\nt=1 wt\n\nSet \u2206k+1 = \u2206k=(cid:13), (cid:21)k+1 = (cid:21)k=(cid:13), (cid:17)k+1 = (cid:17)k=(cid:13), and Tk+1 = (cid:13)2Tk\n\n11:\n12:\n13:\n14: end for\nReturn (cid:22)wm+1\n\nThe analysis of the proposed algorithm relies on the strong convexity of intermediate loss functions\nintroduced to facilitate the optimization as given below.\nDe\ufb01nition 2 (Strong convexity). A function f (w) is said to be (cid:11)-strongly convex w.r.t a norm \u2225 (cid:1) \u2225,\nif there exists a constant (cid:11) > 0 (often called the modulus of strong convexity) such that it holds\n\nf (w) (cid:21) f (w\n\n\u2032\n\n) + \u27e8rf (w\n\n\u2032\n\n); w (cid:0) w\n\n\u2032\u27e9 +\n\n\u2225w (cid:0) w\n\n\u2032\u22252;\n\n8 w; w\n\n\u2032 2 W\n\n(cid:11)\n2\n\n4 Mixed Stochastic/Deterministic Gradient Descent\n\nWe now turn to describe the proposed mixed optimization algorithm and state its convergence rate.\nThe detailed steps of MIXEDGRAD algorithm are shown in Algorithm 1.\nIt follows the epoch\ngradient descent algorithm proposed in [11] for stochastically minimizing strongly convex functions\nand divides the optimization process into m epochs, but involves novel ingredients so as to obtain an\nO(1=T ) convergence rate. The key idea is to introduce a \u21132 regularizer into the objective function to\nmake it strongly convex, and gradually reduce the amount of regularization over the epochs. We also\nshrink the domain as the algorithm proceeds. We note that reducing the amount of regularization\nover time is closely-related to the classic proximal-point algorithms. Throughout the paper, we will\nuse the subscript for the index of each epoch, and the superscript for the index of iterations within\neach epoch. Below, we describe the key idea behind MIXEDGRAD.\nLet (cid:22)wk be the solution obtained before the kth epoch, which is initialized to be 0 for the \ufb01rst epoch.\nInstead of searching for w(cid:3) at the kth epoch, our goal is to \ufb01nd w(cid:3) (cid:0) (cid:22)wk, resulting in the following\noptimization problem for the kth epoch\n(cid:21)k\n2\n\n\u2225w + (cid:22)wk\u22252 +\n\ngi(w + (cid:22)wk);\n\nn\u2211\n\nmin\n\n1\nn\n\n(2)\n\ni=1\n\nw + wk 2 W\n\u2225w\u2225 (cid:20) \u2206k\n\nwhere \u2206k speci\ufb01es the domain size of w and (cid:21)k is the regularization parameter introduced at the\nkth epoch. By introducing the \u21132 regularizer, the objective function in (2) becomes strongly convex,\nmaking it possible to exploit the technique for stochastic optimization of strongly convex function\nin order to improve the convergence rate. The domain size \u2206k and the regularization parameter (cid:21)k\nare initialized to be \u22061 > 0 and (cid:21)1 > 0, respectively, and are reduced by a constant factor (cid:13) > 1\nevery epoch, i.e., \u2206k = \u22061=(cid:13)k(cid:0)1 and (cid:21)k = (cid:21)1=(cid:13)k(cid:0)1. By removing the constant term (cid:21)k\u2225 (cid:22)wk\u22252=2\nfrom the objective function in (2), we obtain the following optimization problem for the kth epoch\n\n]\n\nn\u2211\n\n[\nFk(w) =\n\nmin\nw2Wk\n\ngi(w + (cid:22)wk)\n\n;\n\ni=1\n\n(3)\n\n\u2225w\u22252 + (cid:21)k\u27e8w; (cid:22)wk\u27e9 +\n\n(cid:21)k\n2\n\n1\nn\n\n4\n\n\f=\n\n+\n\ni=1\n\ni=1\n\ni=1\n\n1\nn\n\n1\nn\n\n1\nn\n\n1\nn\n\n(cid:21)k\n2\n\n(cid:21)k\n2\n\ni (w)\n\nwhere\n\n\u27e8\n\nw; (cid:21)k (cid:22)wk +\n\nn\u2211\n\ngi(w + (cid:22)wk)\n\n\u2225w\u22252 +\n\n=\n\n(cid:21)k\n2\n\n\u2225w\u22252 + \u27e8w; gk\u27e9 +\n\ngi(w + (cid:22)wk) (cid:0) \u27e8w;rgi( (cid:22)wk)\u27e9\n\n\u27e9\nrgi( (cid:22)wk)\n\nwhere Wk = fw : w + wk 2 W; \u2225w\u2225 (cid:20) \u2206kg. We rewrite the objective function Fk(w) as\nFk(w) =\n\nn\u2211\n\u2225w\u22252 + (cid:21)k\u27e8w; (cid:22)wk\u27e9 +\nn\u2211\nn\u2211\nbgk\nrgi( (cid:22)wk) and bgk\n(cid:13)(cid:13) = \u2225rgi(w + (cid:22)wk) (cid:0) rgi( (cid:22)wk)\u2225 (cid:20) (cid:12)\u2225w\u2225:\n\nn\u2211\nThe main reason for usingbgk\nthe norm ofbgk\ni (w) as:(cid:13)(cid:13)rbgk\nepochs and consequentially \u2225rbgk\nAs a result, since \u2225w\u2225 (cid:20) \u2206k and \u2206k shrinks over epochs, then \u2225w\u2225 will approach to zero over\ni (w)\u2225 approaches to zero, which allows us to effectively control\nthe variance in stochastic gradients, a key to improving the convergence of stochastic optimization\nfor smooth functions to O(1=T ).\nUsing Fk(w) in (4), at the tth iteration of the kth epoch, we call the stochastic oracle Os to randomly\n(w) and update the solution by following the standard paradigm of SGD by\nselect a loss function gik\n\ni (w) instead of gi(w) is to tolerate the variance in the stochastic gradi-\nents. To see this, from the smoothness assumption of gi(w) we obtain the following inequality for\n\ni (w) = gi(w + (cid:22)wk) (cid:0) \u27e8w;rgi( (cid:22)wk)\u27e9:\n\ngk = (cid:21)k (cid:22)wk +\n\ni (w)\n\n1\nn\n\n(4)\n\ni=1\n\ni=1\n\nt\n\n(\n(\n\n)\nk))\nk + (cid:22)wk) (cid:0) rgit\n\nk + gk + rbgk\nk + gk + rgit\n\nk\n\nit\nk\n\nwt\nk\n\nwt\nk\n\n(wt\n\nwt+1\n\n= (cid:5)w2Wk\n\n= (cid:5)w2Wk\n\n(cid:0) (cid:17)k((cid:21)kwt\n(cid:0) (cid:17)k((cid:21)kwt\n\n)\n( (cid:22)wk))\nAt the end of each epoch, we compute the average solution ewk, and update the solution from (cid:22)wk to\nwhere (cid:5)w2Wk (w) projects the solution w into the domain Wk that shrinks over epochs.\n(cid:22)wk+1 = (cid:22)wk +ewk. Similar to the epoch gradient descent algorithm [11], we increase the number of\niterations by a constant (cid:13)2 for every epoch, i.e. Tk = T1(cid:13)2(k(cid:0)1).\nIn order to perform stochastic gradient updating given in (5), we need to compute vector gk at the\nbeginning of the kth epoch, which requires an access to the full gradient oracle Of . It is easy to\ncount that the number of accesses to the full gradient oracle Of is m, and the number of accesses to\nthe stochastic oracle Os is\n\n(wt\n\n(5)\n\n;\n\nk\n\nk\n\nm\u2211\n\n(cid:13)2(i(cid:0)1) =\n\n(cid:13)2m (cid:0) 1\n(cid:13)2 (cid:0) 1\n\nT1:\n\nT = T1\n\ni=1\n\nThus, if the total number of accesses to the stochastic gradient oracle is T , the number of access to\nthe full gradient oracle required by Algorithm 1 is O(ln T ), consistent with our goal of making a\nsmall number of calls to the full gradient oracle.\nThe theorem below shows that for smooth objective functions, by having O(ln T ) access to the\nfull gradient oracle Of and O(T ) access to the stochastic oracle Os, by running MIXEDGRAD\nalgorithm, we achieve an optimization error of O(1=T ).\nTheorem 1. Let (cid:14) (cid:20) e\n\n(cid:0)9=2 be the failure probability. Set (cid:13) = 2, (cid:21)1 = 16(cid:12) and\nT1 = 300 ln\n\n; and \u22061 = R:\n\n(cid:17)1 =\n\n;\n\nm\n(cid:14)\n\np\n1\n3T1\n\n2(cid:12)\n\nDe\ufb01ne T = T1\n=3. Let (cid:22)wm+1 be the solution returned by Algorithm 1 after m epochs\nwith m = O(ln T ) calls to the full gradient oracle Of and T calls to the stochastic oracle Os. Then,\nwith a probability 1 (cid:0) 2(cid:14), we have\n\n(\n\n)\n22m (cid:0) 1\n\n(\n\n)\n\n(cid:12)\nT\n\n:\n\nG( (cid:22)wm+1) (cid:0) min\nw2W\n\nG(w) (cid:20) 80(cid:12)R2\n\n22m(cid:0)2 = O\n\n5\n\n\f5 Convergence Analysis\n\nfollowing theorem.\n\nNow we turn to proving the main theorem. The proof will be given in a series of lemmas and\ntheorems where the proof of few are given in the Appendix. The proof of main theorem is based\n\non induction. To this end, let bwk(cid:3) be the optimal solution that minimizes Fk(w) de\ufb01ned in (3).\nThe key to our analysis is show that when \u2225bwk(cid:3)\u2225 (cid:20) \u2206k, with a high probability, it holds that\n\u2225 (cid:20) \u2206k=(cid:13), where bwk+1(cid:3)\n\u2225bwk+1(cid:3)\nis the optimal solution that minimizes Fk+1(w), as revealed by the\nTheorem 2. Let bwk(cid:3) and bwk+1(cid:3)\nspectively, and ewk+1 be the average solution obtained at the end of kth epoch of MIXEDGRAD\nbe the optimal solutions that minimize Fk(w) and Fk+1(w), re-\nalgorithm. Suppose \u2225bwk(cid:3)\u2225 (cid:20) \u2206k. By setting the step size (cid:17)k = 1=\n, we have, with a\nand Fk(ewk+1) (cid:0) min\n\u2225bwk+1(cid:3)\nTk (cid:21) 300(cid:13)8(cid:12)2\n\nFk(w) (cid:20) (cid:21)k\u22062\nk\n2(cid:13)4\n\nprovided that (cid:14) (cid:20) e\n\nprobability 1 (cid:0) 2(cid:14),\n\n\u2225 (cid:20) \u2206k\n(cid:13)\n\np\n3Tk\n\n(cid:0)9=2 and\n\n(\n\n)\n\n2(cid:12)\n\nln\n\nw\n\n:\n\n1\n(cid:14)\n\n(cid:21)2\nk\n\n1\nn\n\ni=1\n\n(cid:3)\n\nm\n\n2(cid:13)4 =\n\n(cid:21)1\u22062\n1\n2(cid:13)3m+1\n\nbe the optimal solution ob-\n\n\u27e8ewm+1; (cid:22)wm\u27e9\n\n\u2225w1(cid:3)\u2225 = \u2225w(cid:3)\u2225 (cid:20) R := \u22061:\n\n(cid:0) (cid:21)1\n(cid:13)m(cid:0)1\n(cid:21)1\u2225 (cid:22)wm\u2225\u22061\n\n\u2225bwm(cid:3) \u2225 (cid:20) \u22061\nn\u2211\n\nHence by expanding the left hand side and utilizing the smoothness of individual loss functions we\nget\n\ntained in the last epoch. Using Theorem 2, with a probability 1 (cid:0) 2m(cid:14), we have\n(cid:21)1\u22062\n1\n2(cid:13)3m+1\n\nTaking this statement as given for the moment, we proceed with the proof of Theorem 1, returning\nlater to establish the claim stated in Theorem 2.\nProof of Theorem 1. It is easy to check that for the \ufb01rst epoch, using the fact W 2 BR, we have\nLet wm(cid:3) be the optimal solution that minimizes Fm(w) and let bwm+1\n(cid:13)m(cid:0)1 ; Fm(ewm+1) (cid:0) Fm(bwm(cid:3) ) (cid:20) (cid:21)m\u22062\ngi( (cid:22)wm+1) (cid:20) Fm(bwm(cid:3) ) +\n(cid:20) Fm(bwm(cid:3) ) +\nwhere the last step uses the fact \u2225bwm+1\n\u2225 (cid:22)wm\u2225 (cid:20) m\u2211\njewij (cid:20) m\u2211\n\u2225 (cid:20) \u2206m = \u22061(cid:13)1(cid:0)m. Since\nn\u2211\ngi( (cid:22)wm+1) (cid:20) Fm(bwm(cid:3) ) +\nn\u2211\n\nOur \ufb01nal goal is to relate Fm(w) to minw G(w). Since bwm(cid:3) minimizes Fm(w), for any w(cid:3) 2\nFm(wm(cid:3) ) (cid:20) Fm(w(cid:3)) =\n(6)\nThus, the key to bound jF(wm(cid:3) ) (cid:0) G(w(cid:3))j is to bound \u2225w(cid:3) (cid:0) (cid:22)wm\u2225. To this end, after the \ufb01rst m\nwill hold deterministically as we deploy the full gradient for updating, i.e., \u2225ewk\u2225 (cid:20) \u2206k for any\nepoches, we run Algorithm 1 with full gradients. Let (cid:22)wm+1; (cid:22)wm+2; : : : be the sequence of solutions\ngenerated by Algorithm 1 after the \ufb01rst m epochs. For this sequence of solutions, Theorem 2\nk (cid:21) m + 1. Since we reduce (cid:21)k exponentially, (cid:21)k will approach to zero and therefore the sequence\nf (cid:22)wkg1\nk=m+1 will converge to w(cid:3), one of the optimal solutions that minimize G(w). Since w(cid:3) is the\nlimit of sequence f (cid:22)wkg1\n\nwhere in the last step holds under the condition (cid:13) (cid:21) 2. By combining above inequalities, we obtain\n\n(\u2225w(cid:3) (cid:0) (cid:22)wm\u22252 + 2\u27e8w(cid:3) (cid:0) (cid:22)wm; (cid:22)wm\u27e9)\n\n1\nn\narg minG(w), we have\n\n\u2206i (cid:20) (cid:13)\u22061\n(cid:13) (cid:0) 1\n\n(cid:21)1\u22062\n1\n2(cid:13)3m+1 +\n\n(cid:21)1\u22062\n1\n2(cid:13)3m+1 +\n\n2(cid:21)1\u22062\n1\n(cid:13)2m(cid:0)2 :\n\n(cid:20) 2\u22061\n\ngi(w(cid:3)) +\n\n(cid:21)1\n\n2(cid:13)m(cid:0)1\n\n(cid:13)2m(cid:0)2\n\ni=1\n\ni=1\n\nk=m+1 and \u2225 (cid:22)wk\u2225 (cid:20) \u2206k for any k (cid:21) m + 1, we have\n(cid:13)m(1 (cid:0) (cid:13)(cid:0)1)\n\njewij (cid:20)\n\n\u2206k (cid:20)\n\n\u22061\n\n\u2225w(cid:3) (cid:0) (cid:22)wm\u2225 (cid:20)\n\n(cid:20) 2\u22061\n(cid:13)m\n\n1\nn\n\ni=1\n\ni=1\n\n(cid:3)\n\n1\u2211\n\n:\n\n1\u2211\n\n6\n\ni=m+1\n\nk=m+1\n\n\fwhere the last step follows from the condition (cid:13) (cid:21) 2. Thus,\n4\u22062\n1\n(cid:13)2m +\n(cid:0)m\n\nFm(wm(cid:3) ) (cid:20) 1\n\ngi(w(cid:3)) +\n\n(cid:21)1\n\nn\n\n=\n\ngi(w(cid:3)) +\n\n2 + (cid:13)\n\nn\u2211\nn\u2211\n\ni=1\n\n(\n(\n\n)\n) (cid:20) 1\n\n8\u22062\n1\n(cid:13)m\n\nn\n\nn\u2211\n\ngi(w(cid:3)) +\n\nBy combining the bounds in (6) and (7), we have, with a probability 1 (cid:0) 2m(cid:14),\n\ni=1\n\ni=1\n\n1\nn\n\nn\u2211\n\ni=1\n\n1\nn\n\nwhere\n\n(cid:13)2m(cid:0)2 = O(1=T )\n\ngi( (cid:22)wm+1) (cid:0) 1\nm(cid:0)1\u2211\n\nn\n\nT = T1\n\nk=0\n\n(cid:13)2k =\n\n1\n\ngi(w(cid:3)) (cid:20) 5(cid:21)1\u22062\n(\n)\n(cid:13)2m (cid:0) 1\n(cid:13)2 (cid:0) 1\n\n(cid:20) T1\n3\n\nT1\n\n(cid:13)2m:\n\n2(cid:13)m(cid:0)1\n2(cid:21)1\u22062\n1\n(cid:13)2m(cid:0)1\n\nn\u2211\n\ni=1\n\n5(cid:21)1\u22062\n1\n(cid:13)2m(cid:0)1\n\n(7)\n\nWe complete the proof by plugging in the stated values for (cid:13), (cid:21)1 and \u22061.\n\n5.1 Proof of Theorem 2\n\nn\u2211\nn\u2211\n\ni=1\n\ni=1\n\n1\nn\n\u2032\u27e9 +\n\nFor the convenience of discussion, we drop the subscript k for epoch just to simplify our notation.\nLet (cid:21) = (cid:21)k, T = Tk, \u2206 = \u2206k, g = gk. Let (cid:22)w = (cid:22)wk be the solution obtained before the start of\n\u2032\n= (cid:22)wk+1 be the solution obtained after running through the kth epoch. We\nthe epoch k, and let (cid:22)w\ndenote by F(w) and F\u2032\n(w) the objective functions Fk(w) and Fk+1(w). They are given by\nF(w) =\n\n\u2225w\u22252 + (cid:21)\u27e8w; (cid:22)w\u27e9 +\n\ngi(w + (cid:22)w)\n\n(8)\n\n(cid:21)\n2\n\n\u2032\n\n+\n\n2(cid:17)\n\n\u2032\n\n)\n\n(cid:21)\n(cid:13)\n\nF\u2032\n\u2032\n\n\u2225w\u22252 +\n\n\u27e8w; (cid:22)w\n\n+\n\n(cid:17)\n2\n\n(9)\n\n1\nn\n\n2(cid:13)4\n\n\u27e8\n\n(w) =\n\ngi(w + (cid:22)w\n\nLemma 1.\n\n(w) over the\n\n; F( (cid:22)w\n\n(cid:3)\u2225 (cid:20) \u2206\n\u2032\n(cid:13)\n\n) (cid:0) F (bw(cid:3)) (cid:20) (cid:21)\u22062\n\n\u2225rbgit(wt) + (cid:21)wt\u22252 + \u27e8g; wt (cid:0) wt+1\u27e9\n\n(cid:21)\n2(cid:13)\nbe the optimal solutions that minimize F(w) and F\u2032\n\n(cid:3) = bwk+1(cid:3)\nLet bw(cid:3) = bwk(cid:3) and bw\ndomain Wk and Wk+1, respectively. Under the assumption that \u2225bw(cid:3)\u2225 (cid:20) \u2206, our goal is to show\n\u2225bw\nThe following lemma bounds F(wt) (cid:0) F (bw(cid:3)) where the proof is deferred to Appendix.\nF(wt) (cid:0) F (bw(cid:3)) (cid:20) \u2225wt (cid:0)bw(cid:3)\u22252\n\n(cid:0) \u2225wt+1 (cid:0)bw(cid:3)\u22252\n\u27e9\n\u27e8\nrbF(bw(cid:3)) (cid:0) rbgit(bw(cid:3)); wt (cid:0)bw(cid:3)\n(cid:0)rbgit(wt) + rbgit (bw(cid:3)) (cid:0) rbF(bw(cid:3)) + rbF(wt); wt (cid:0)bw(cid:3)\nT\u2211\nF(wt) (cid:0) F(bw(cid:3)) (cid:20) \u2225bw(cid:3)\u22252\n(cid:0) \u2225wT +1 (cid:0)bw(cid:3)\u22252\nT\u2211\nT\u2211\n\u27e8rbF(bw(cid:3)) (cid:0) rbgit (bw(cid:3)); wt (cid:0)bw(cid:3)\u27e9\n\u2225rbgit(wt) + (cid:21)wt\u22252\n}\n}\n|\n|\n\u27e9\n\u27e8\nT\u2211\n(cid:0)rbgit(wt) + rbgit (bw(cid:3)) (cid:0) rbF(bw(cid:3)) + rbF(wt); wt (cid:0)bw(cid:3)\n|\n}\nSince g = rF(0) and\n\nBy adding the inequality in Lemma 1 over all iterations, using the fact (cid:22)w1 = 0, we have\n\n(cid:0) \u27e8g; wT +1\u27e9\n{z\n\n{z\n\n{z\n\n:=BT\n\n:=AT\n\n2(cid:17)\n\n+\n\n+\n\n(cid:17)\n2\n\n:=CT\n\n2(cid:17)\n\n+\n\nt=1\n\n+\n\nt=1\n\n2(cid:17)\n\nt=1\n\nt=1\n\n:\n\nF(wT +1) (cid:0) F (0) (cid:20) \u27e8rF(0); wT +1\u27e9 +\n\n\u2225wT +1\u22252 = \u27e8g; wT +1\u27e9 +\n\n\u2225wT +1\u22252\n\n\u27e9\n\n(cid:12)\n2\n\n(cid:12)\n2\n\n7\n\n\fusing the fact F(0) (cid:20) F(w(cid:3)) + (cid:12)\n\nand therefore\n\n(cid:0)\u27e8g; wT +1\u27e9 (cid:20) F(0) (cid:0) F (wT +1) +\n(\nF(wt) (cid:0) F (bw(cid:3)) (cid:20) \u22062\n\nT +1\u2211\n\n1\n2(cid:17)\n\nt=1\n\n2\n\n\u22062 (cid:20) (cid:12)\u22062 (cid:0) (F(wT +1) (cid:0) F(bw(cid:3)))\n\u2225w(cid:3)\u22252 and max(\u2225w(cid:3)\u2225;\u2225wT +1\u2225) (cid:20) \u2206, we have\n)\n\n(cid:12)\n2\n\n+ (cid:12)\n\n+\n\nAT + BT + CT :\n\n(10)\n\n(cid:17)\n2\n\nThe following lemmas bound AT , BT and CT .\nLemma 2. For AT de\ufb01ned above we have AT (cid:20) 6(cid:12)2\u22062T .\nThe following lemma upper bounds BT and CT . The proof is based on the Bernstein\u2019s inequality\n)\nfor Martingales [4] and is given in the Appendix.\nLemma 3. With a probability 1 (cid:0) 2(cid:14), we have\n\n(\n\n)\n\n(\n\n\u221a\n\n\u221a\n\nUsing Lemmas 2 and 3, by substituting the uppers bounds for AT , BT , and CT in (10), with a\nprobability 1 (cid:0) 2(cid:14), we obtain\n\nln\n\n+\n\n2T ln\n\n:\n\n1\n(cid:14)\n\n)\n\n1\n(cid:14)\n\n)\n\u221a\n\n1\n(cid:14)\n\n+ 3(cid:12)\n\n2T ln\n\n\u221a\n\n\u221a\n\nt=1\n\n1\n(cid:14)\n\n1\n(cid:14)\n\n1\n(cid:14)\n\nln\n\n+\n\n2T ln\n\n1\n(cid:14)\n\n(\n\n1\n2(cid:17)\n\n(\n\n+ (cid:12) + 6(cid:12)2(cid:17)T + 3(cid:12) ln\n\nBy choosing (cid:17) = 1=[2(cid:12)\n\nand CT (cid:20) 2(cid:12)\u22062\n\np\n3T ], we have\n\nBT (cid:20) (cid:12)\u22062\nT +1\u2211\nF(wt) (cid:0) F(bw(cid:3)) (cid:20) \u22062\nT +1\u2211\nF(wt) (cid:0) F(bw(cid:3)) (cid:20) \u22062\n\u221a\nF(ew) (cid:0) F (bw(cid:3)) (cid:20) \u22062 5(cid:12)\np\n3 ln[1=(cid:14)]\nT + 1\nb\u22062 (cid:20) \u22062\n(cid:14) ]=(cid:21)2, we have, with a probability 1 (cid:0) 2(cid:14),\n(cid:13)4 ; and jF(ew) (cid:0) F (bw(cid:3))j (cid:20) (cid:21)\n(cid:3)\u2225 to \u2225ew (cid:0)bw(cid:3)\u2225.\n2(cid:13)4 \u22062:\n(cid:3)\u2225 (cid:20) (cid:13)\u2225ew (cid:0)bw(cid:3)\u2225.\n\nThe next lemma relates \u2225bw\nLemma 4. We have \u2225bw\nCombining the bound in (11) with Lemma 4, we have \u2225bw\n\nand using the fact ew =\n\nThus, when T (cid:21) [300(cid:13)8(cid:12)2 ln 1\n\np\n3T + (cid:12) + 3(cid:12) ln\n\ni=1 wt=(T + 1), we have\n\n(cid:3)\u2225 (cid:20) \u2206=(cid:13).\n\u2032\n\n\u2211\n\n1\n(cid:14)\n\nT +1\n\n2(cid:12)\n\n\u2032\n\n\u2032\n\nt=1\n\n+ 3(cid:12)\n\n2T ln\n\n; and b\u22062 = \u2225ew (cid:0)bw(cid:3)\u22252 (cid:20) \u22062 5(cid:12)\n\np\n3 ln[1=(cid:14)]\nT + 1\n(cid:21)\n\n:\n\n(11)\n\n6 Conclusions and Open Questions\n\nWe presented a new paradigm of optimization, termed as mixed optimization, that aims to improve\nthe convergence rate of stochastic optimization by making a small number of calls to the full gradient\noracle. We proposed the MIXEDGRAD algorithm and showed that it is able to achieve an O(1=T )\nconvergence rate by accessing stochastic and full gradient oracles for O(T ) and O(log T ) times,\nrespectively. We showed that the MIXEDGRAD algorithm is able to exploit the smoothness of the\nfunction, which is believed to be not very useful in stochastic optimization.\nIn the future, we would like to examine the optimality of our algorithm, namely if it is possible\nto achieve a better convergence rate for stochastic optimization of smooth functions using O(ln T )\naccesses to the full gradient oracle. Furthermore, to alleviate the computational cost caused by\nO(log T ) accesses to the full gradient oracle, it would be interesting to empirically evaluate the\nproposed algorithm in a distributed framework by distributing the individual functions among pro-\ncessors to parallelize the full gradient computation at the beginning of each epoch which requires\nO(log T ) communications between the processors in total. Lastly, it is very interesting to check\nwhether an O(1=T 2) rate could be achieved by an accelerated method in the mixed optimization\nscenario, and whether linear convergence rates could be achieved in the strongly-convex case.\nAcknowledgments. The authors would like to thank the anonymous reviewers for their helpful and insight-\nful comments. This work was supported in part by ONR Award N000141210431 and NSF (IIS-1251031).\n\n8\n\n\fReferences\n[1] A. Agarwal, P. L. Bartlett, P. D. Ravikumar, and M. J. Wainwright.\n\nInformation-theoretic\nlower bounds on the oracle complexity of stochastic convex optimization. IEEE Transactions\non Information Theory, 58(5):3235\u20133249, 2012.\n\n[2] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for\n\nconvex optimization. Oper. Res. Lett., 31(3):167\u2013175, 2003.\n\n[3] L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In NIPS, pages 161\u2013168,\n\n2008.\n\n[4] S. Boucheron, G. Lugosi, and O. Bousquet. Concentration inequalities. In Advanced Lectures\n\non Machine Learning, pages 208\u2013240, 2003.\n\n[5] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[6] R. H. Byrd, G. M. Chin, J. Nocedal, and Y. Wu. Sample size selection in optimization methods\n\nfor machine learning. Mathematical programming, 134(1):127\u2013155, 2012.\n\n[7] A. Cotter, O. Shamir, N. Srebro, and K. Sridharan. Better mini-batch algorithms via accelerated\n\ngradient methods. In NIPS, pages 1647\u20131655, 2011.\n\n[8] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction\n\nusing mini-batches. The Journal of Machine Learning Research, 13:165\u2013202, 2012.\n\n[9] M. P. Friedlander and M. Schmidt. Hybrid deterministic-stochastic methods for data \ufb01tting.\n\nSIAM Journal on Scienti\ufb01c Computing, 34(3):A1380\u2013A1405, 2012.\n\n[10] E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimiza-\n\ntion. Machine Learning, 69(2-3):169\u2013192, 2007.\n\n[11] E. Hazan and S. Kale. Beyond the regret minimization barrier: an optimal algorithm for\nstochastic strongly-convex optimization. Journal of Machine Learning Research - Proceedings\nTrack, 19:421\u2013436, 2011.\n\n[12] Q. Lin, X. Chen, and J. Pena. A smoothing stochastic gradient method for composite opti-\n\nmization. arXiv preprint arXiv:1008.5204, 2010.\n\n[13] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach\n\nto stochastic programming. SIAM J. on Optimization, 19:1574\u20131609, 2009.\n\n[14] A. S. Nemirovsky and D. B. Yudin. Problem complexity and method ef\ufb01ciency in optimization.\n\n1983.\n\n[15] Y. Nesterov. A method of solving a convex programming problem with convergence rate o\n\n(1/k2). In Soviet Mathematics Doklady, volume 27, pages 372\u2013376, 1983.\n\n[16] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Aca-\n\ndemic Publishers, 2004.\n\n[17] Y. Nesterov. Excessive gap technique in nonsmooth convex minimization. SIAM Journal on\n\nOptimization, 16(1):235\u2013249, 2005.\n\n[18] Y. Nesterov. Smooth minimization of non-smooth functions. Math. Program., 103(1):127\u2013\n\n152, 2005.\n\n[19] A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex\n\nstochastic optimization. In ICML, 2012.\n\n[20] N. L. Roux, M. W. Schmidt, and F. Bach. A stochastic gradient method with an exponential\n\nconvergence rate for \ufb01nite training sets. In NIPS, pages 2672\u20132680, 2012.\n\n[21] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver\n\nfor svm. In ICML, pages 807\u2013814, 2007.\n\n[22] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized\n\nloss minimization. JMLR, 14:567599, 2013.\n\n[23] O. Shamir and T. Zhang. Stochastic gradient descent for non-smooth optimization: Conver-\n\ngence results and optimal averaging schemes. ICML, 2013.\n\n[24] L. Zhang, T. Yang, R. Jin, and X. He. O(logt) projections for stochastic optimization of smooth\n\nand strongly convex functions. ICML, 2013.\n\n9\n\n\f", "award": [], "sourceid": 392, "authors": [{"given_name": "Mehrdad", "family_name": "Mahdavi", "institution": "Michigan State University (MSU)"}, {"given_name": "Lijun", "family_name": "Zhang", "institution": "Michigan State University (MSU)"}, {"given_name": "Rong", "family_name": "Jin", "institution": "Michigan State University (MSU)"}]}