{"title": "Optimal Stochastic and Online Learning with Individual Iterates", "book": "Advances in Neural Information Processing Systems", "page_first": 5415, "page_last": 5425, "abstract": "Stochastic composite mirror descent (SCMD) is a simple and efficient method able to capture both geometric and composite structures of optimization problems in machine learning. Existing strategies require to take either an average or a random selection of iterates to achieve optimal convergence rates, which, however, can either destroy the sparsity of solutions or slow down the practical training speed. In this paper, we propose a theoretically sound strategy to select an individual iterate of the vanilla SCMD, which is able to achieve optimal rates for both convex and strongly convex problems in a non-smooth learning setting. This strategy of outputting an individual iterate can preserve the sparsity of solutions which is crucial for a proper interpretation in sparse learning problems. We report experimental comparisons with several baseline methods to show the effectiveness of our method in achieving a fast training speed as well as in outputting sparse solutions.", "full_text": "Optimal Stochastic and Online Learning with\n\nIndividual Iterates\n\nYunwen Lei1,2 Peng Yang1 Ke Tang1\u2217 Ding-Xuan Zhou3\n\n1University Key Laboratory of Evolving Intelligent Systems of Guangdong Province,\n\nDepartment of Computer Science and Engineering,\n\nSouthern University of Science and Technology, Shenzhen 518055, China\n\nTechnical University of Kaiserslautern, Kaiserslautern 67653, Germany\n\n2Department of Computer Science,\n\n3School of Data Science and Department of Mathematics,\nCity University of Hong Kong, Kowloon, Hong Kong, China\n\n{leiyw, yangp, tangk3}@sustech.edu.cn\n\nmazhou@cityu.edu.hk\n\nAbstract\n\nStochastic composite mirror descent (SCMD) is a simple and ef\ufb01cient method able\nto capture both geometric and composite structures of optimization problems in\nmachine learning. Existing strategies require to take either an average or a random\nselection of iterates to achieve optimal convergence rates, which, however, can\neither destroy the sparsity of solutions or slow down the practical training speed. In\nthis paper, we propose a theoretically sound strategy to select an individual iterate\nof the vanilla SCMD, which is able to achieve optimal rates for both convex and\nstrongly convex problems in a non-smooth learning setting. This strategy of out-\nputting an individual iterate can preserve the sparsity of solutions which is crucial\nfor a proper interpretation in sparse learning problems. We report experimental\ncomparisons with several baseline methods to show the effectiveness of our method\nin achieving a fast training speed as well as in outputting sparse solutions.\n\n1\n\nIntroduction\n\nGradient-based methods have found wide applications to solve various optimization problems. A\nbasic and representative method of this type is the gradient descent, which iteratively moves iterates\nalong the minus gradient direction of the current iterate. However, gradient descent applied to\nmachine learning problems requires to go through all training examples at each iteration, which is not\nef\ufb01cient when the sample size is large. Stochastic gradient descent (SGD) relieves this computational\nburden by approximating the true gradient of the objective function with an unbiased estimation\nbased on a randomly selected training example. With this strategy, SGD can achieve sample-size\nindependent computational cost per iteration and therefore can be successfully applied to very large\ndatasets which are becoming ubiquitous in the big data era [2, 4, 44].\nFrom different viewpoints, SGD has been extended in various ways. For example, the trick of variance-\nreduction has been introduced to exploit the \ufb01nite summation structure of objective functions for\nreducing the inherent variance [16, 33, 40, 43]. Adaptive step sizes were proposed to dynamically\nincorporate the knowledge of the geometry of the data observed [9]. Decreasing step sizes in a\nstagewise manner is used as a common trick in practice [14, 19, 41]. The trick of momentum is\nwidely used to accelerate SGD by choosing an appropriate direction to pursue per iteration [18, 26, 29].\nProximal operators have been introduced to capture a structure of optimization problems [28], which\n\n\u2217Corresponding author\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fseparates the regularizer and data-\ufb01tting terms to achieve a desired regularization effect. The concept\nof mirror map has been introduced to induce a Bregman distance for re\ufb02ecting the geometry of the\nassociated optimization problem [3, 25]. Empirical studies have shown that these variants can further\nimprove the practical performance of SGD [4]. In this paper, we consider stochastic composite mirror\ndescent (SCMD), which combines the technique of mirror map and proximal operator to capture both\nthe composite and geometry structure of the associated optimization problem.\nTheoretical studies of SCMD have attracted much attention since its appearance and many results\nhave been derived to understand its promising behavior in practice. If the objective function is\nconvex, then the expected suboptimality of the uniform average of iterates decays with the rate\nO(T \u2212 1\n2 ) after T iterations [6\u20138, 30, 46]. If the objective function is strongly convex, then the\nexpected suboptimality of the uniform average of iterates decays with the rate O(T \u22121 log T ) [17, 35],\nwhich is unimprovable if the objective function is not smooth [31]. Surprisingly, some averaging\nschemes have been proposed to achieve the optimal rate O(T \u22121) [14, 20, 31, 37]. Although taking\naverages of previous iterates can achieve theoretically minimax optimal rates [1], it can have some\npractical side effects [31, 36, 37]2. For example, it can destroy the sparsity of the solution which\nis often crucial for a proper interpretation of models in many applications. Moreover, averaging\ncan affect the practical training speed due to the existence of some possibly poor iterates in the\niterate sequence produced by SCMD [31, 36]. The side effect of destroying sparsity can be resolved\nby randomly drawing an individual iterate with probabilities proportional to the weights in the\noptimal averaging scheme. However, this strategy of selecting a random iterate would introduces\nnew variances, which may further slow down the training speed due to the possibility of selecting\na poor iterate. Outputting the last SGD iterate is a natural strategy in practice, which would not\naffect sparsity and practical training speed. However, the suboptimality for the last SGD iterate\nconverges with suboptimal rates O(T \u2212 1\n2 log T ) and O(T \u22121 log T ) in the convex and strongly convex\ncases [37], respectively. Very recently, it was shown that the last SGD iterate can attain at most a rate\nO(T \u22121 log T ) with high probabilities [13]. These facts motivate us to ask a natural question: can we\ndevelop an algorithm which can inherit the optimal convergence rate from the averaging strategies as\nwell as the sparsity-preservation and fast practical training speed from the last SGD iterate?\nIn this paper, we aim to give an af\ufb01rmative answer to the above question by proposing a novel strategy\nto select an individual iterate able to achieve theoretically optimal rates and fast training speed in\npractice. We \ufb01rst consider the case with the number of iterations known to us, which allows us to\ndivide the implementation into two stages. The \ufb01rst stage produces an output with some averaging\nscheme, which is then used in the second stage to select an individual iterate according to the one-step\nprogress evolution of SCMD. We then extend this algorithm to the online learning setting where\ntraining examples arrive in a sequential way. We show that our algorithm can achieve the optimal\nrates O(T \u2212 1\n2 ) and O(T \u22121) in the convex and strongly convex case, respectively. Rooted on a careful\none-step progress analysis, our selection of iterates is based on the difference of two successive\niterates, which shares some spirits with the common heuristic trick of terminating an algorithm if\nthe distance of two successive iterates is lower than a threshold. Our analysis also removes bounded\nsubgradient assumptions in the existing discussions of optimal learning rates [14, 20, 31, 37]. We\nreport experimental results to con\ufb01rm the effectiveness of the proposed algorithm in both attaining a\nfast training speed and producing a sparse solution.\nWe present algorithms with motivation in Section 2. Theoretical results and discussions are given in\nSections 3 and 4. Experimental results and conclusions are presented in Sections 5 and 6.\n\n2 Algorithms with Motivations\n\n2.1 Background\n\nIn supervised learning, we aim to infer an unknown relationship between input and output variables\nfrom a sequence of training examples {zt = (xt, yt)}t\u2208N drawn from a probability measure de\ufb01ned\nin a sample space Z = X \u00d7 Y with an input space X \u2282 Rd and an output space Y \u2282 R, where\nd \u2208 N is the dimension. A very elementary and powerful approximation of this relationship is a linear\nmodel of the form x (cid:55)\u2192 (cid:104)w, x(cid:105) with w \u2208 Rd, whose behavior at a single training example (x, y) can\n2When mentioning averaging, individual iterates or last iterate, our attention is with respect to the iterate sequence\nproduced by (2.2) below.\n\n2\n\n\fbe quanti\ufb01ed by a function f (w, z) = (cid:96)((cid:104)w, x(cid:105), y), where the loss function (cid:96) : R2 (cid:55)\u2192 R+ is convex\nwith respect to (w.r.t.) the \ufb01rst argument and (cid:104)w, x(cid:105) denotes the inner product between w and x. The\nlearning process can be often formulated as an optimization problem of a composite structure\n\nmin\nw\u2208Rd\n\n\u03c6(w) = Ez[f (w, z)] + r(w),\n\n(2.1)\nwhere Ez[\u00b7] denotes the expectation w.r.t. z and r : Rd (cid:55)\u2192 R+ is a regularizer possibly inducing\nsparsity. With speci\ufb01c instantiations of (cid:96) and r, Eq. (2.1) covers many famous learning problems in a\nunifying framework, including least squares, SVMs, logistic regression, lasso and elastic-net [8, 39].\nSCMD provides an ef\ufb01cient \ufb01rst-order method to exploit the composite and geometry structure of the\nproblem (2.1) [8, 24]. It extends SGD by using a strongly convex and differentiable mirror map \u03a8 to\ngenerate an appropriate Bregman distance [3, 25] D\u03a8(w, \u02dcw) = \u03a8(w)\u2212 \u03a8( \u02dcw)\u2212(cid:104)w\u2212 \u02dcw,\u2207\u03a8( \u02dcw)(cid:105),\nwhere \u2207\u03a8( \u02dcw) denotes the gradient of \u03a8 at \u02dcw. Let w1 \u2208 Rd be an initial point and {\u03b7t}t\u2208N\nbe a step size sequence. Up on the arrival of zt at the t-th iteration, we calculate a subgradient\nf(cid:48)(wt, zt) \u2208 \u2202wf (wt, zt) as an unbiased estimate of F (cid:48)(wt) \u2208 \u2202F (wt), where \u2202wf (wt, zt) denotes\nthe subdifferential of f (\u00b7, zt) at wt and F (w) = Ez[f (w, z)]. SCMD updates the model by\n\n(cid:0)(cid:104)w \u2212 wt, f(cid:48)(wt, zt)(cid:105) + r(w)(cid:1).\n\nD\u03a8(w, wt) + \u03b7t\n\nwt+1 = arg min\nw\u2208Rd\n\n(2.2)\nIntuitively, SCMD uses f(cid:48)(wt, zt) to form a \ufb01rst-order approximation of f (\u00b7, zt) at wt and uses the\nBregman distance D\u03a8(w, wt) to keep wt+1 not far away from the current iterate. The regularizer r is\nkept intact here to preserve a regularization effect. Typical choices of mirror maps include the p-norm\np (1 < p \u2264 2), which works favorably when the solution of (2.1) is\ndivergence \u03a8p(w) = 1\n\nsparse by setting p close to 1 [8, 39]. Here (cid:107)\u00b7(cid:107)p is the p-norm de\ufb01ned by (cid:107)w(cid:107)p =(cid:0)(cid:80)d\n\nfor w = (w(1), . . . , w(d))(cid:62) \u2208 Rd. SCMD recovers the stochastic proximal gradient descent by\ntaking \u03a8 = \u03a82 [7] and stochastic mirror descent by taking r(w) = 0 [14]. It should be mentioned\nthat \u03c1 is not necessarily to be an empirical distribution over training examples, and therefore the\nsetting we consider here is more general than stochastic learning to minimize an empirical risk.\n\ni=1 |w(i)|p(cid:1)1/p\n\n2(cid:107)w(cid:107)2\n\nsw \u2190 w1, s \u2190 1\nsw \u2190 6\u03b71w1, s \u2190 6\u03b71\n\nAlgorithm 1: SCMDI\nInput: {\u03b7t}t, \u03c3\u03c6, w1 and T .\nOutput: an approximate solution of (2.1)\n1 if \u03c3\u03c6 == 0 then\n2\n3 else\n4\n5 for t = 1, 2 to T \u2212 1 do\n6\n7\n8\n9\n10\n11\n\nsw \u2190 sw + wt+1, s \u2190 s + 1\nsw \u2190 sw + (t + 2)(t + 3)\u03b7t+1wt+1\ns \u2190 s + (t + 2)(t + 3)\u03b7t+1\n\ncalculate wt+1 by (2.2)\nif \u03c3\u03c6 == 0 then\n\nelse\n\n12 \u00afwT \u2190 sw/s\n13 for t = T, T + 1 to 2T \u2212 1 do\n14\n15\n16\n17\n18 return wT \u2217\n\nT \u2217 \u2190 t, wT \u2217 \u2190 wt\n\ncalculate wt+1 by (2.2)\n(cid:52) \u2190 D\u03a8( \u00afwT , wt) \u2212 D\u03a8( \u00afwT , wt+1)\nif (cid:52) \u2264 T \u22121D\u03a8( \u00afwT , wT ) then\n\nAlgorithm 2: OCMDI\nInput: {\u03b7t}t, \u03c3\u03c6 and w1.\nOutput: an approximate solution of (2.1)\n1 if \u03c3\u03c6 == 0 then\nsw \u2190 w1, s \u2190 1\n2\n3 else\nsw \u2190 6\u03b71w1, s \u2190 6\u03b71\n4\n5 \u00afw \u2190 w1, \u02c6w \u2190 w1, k \u2190 1\n6 for t = 1, 2,\u00b7\u00b7\u00b7 do\n7\n8\n9\n10\n\ncalculate wt+1 by (2.2)\n(cid:52) \u2190 D\u03a8( \u00afw, wt) \u2212 D\u03a8( \u00afw, wt+1)\nif (cid:52) \u2264 21\u2212kD\u03a8( \u00afw, \u02c6w) then\n\nif \u03c3\u03c6 == 0 then\n\n\u02dcw \u2190 wt\nsw \u2190 sw + wt+1, s \u2190 s + 1\nsw \u2190 sw + (t + 2)(t + 3)\u03b7t+1wt+1\ns \u2190 s + (t + 2)(t + 3)\u03b7t+1\nk \u2190 k + 1, \u00afw \u2190 sw/s, \u02c6w \u2190 wt\n\nif t == 2k \u2212 1 then\n\n11\n12\n13\n14\n15\n\nelse\n\n16\n17\n18 return \u02dcw\n\n2.2 Algorithms\n\nWe now present our \ufb01rst algorithm (Algorithm 1) which performs SCMD in two stages. In the \ufb01rst\nstage, we simply perform updates according to (2.2). An output \u00afwT is then produced by taking some\n\n3\n\n\faverages of the previous iterates. It is clear that \u00afwT is a simple average with uniform weights in the\nconvex case, while the construction of \u00afwT in the strongly convex case is motivated by theoretical\nanalysis (see Theorem 4) [20]. In the second stage, other than updating {wt} with (2.2), we also\nsearch a time index (line 17 of Algorithm 1) at which the difference between two consecutive Bregman\ndistance is less than a small number. We refer to Algorithm 1 as Stochastic Composite Mirror Descent\nwith Individual iterates (SCMDI) since it outputs an individual iterate of the vanilla SCMD.\nRemark 1. For any t, denote At := D\u03a8( \u00afwT , wt) \u2212 D\u03a8( \u00afwT , wt+1). It should be mentioned that\nthe function t (cid:55)\u2192 At is not monotonically w.r.t. t. Therefore, not all t \u2265 T \u2217 can necessarily satisfy\nthe condition in line 16 of Algorithm 1, which is a requirement to achieve optimal convergence rates\nin our analysis.\nIn Algorithm 1, we update T \u2217 once we \ufb01nd a new t satisfying At \u2264 T \u22121D\u03a8( \u00afwT , wT ). Another\nfeasible strategy is to set T \u2217 = arg mint\u2208{T,T +1,...,2T\u22121} At. However, experiments show that this\nstrategy does not work as well as Algorithm 1.\n\nAlgorithm 2 is an extension of Algorithm 1 by removing the information on T . It differs from the\nvanilla SCMD in two aspects. Firstly, it computes some average \u00afwT of previous iterates at the (2k\u22121)-\nth iteration, k \u2208 N. Secondly, it searches an index t such that D\u03a8( \u00afwT , wt) \u2212 D\u03a8( \u00afwT , wt+1) \u2264\n21\u2212kD\u03a8( \u00afwT , wT ) and outputs this individual iterate. Algorithm 2 with 2k \u2264 t < 2k+1 recovers\nAlgorithm 1 with T = 2k or T = 2k\u22121, depending on whether an index t \u2265 2k satisfying At \u2264\n21\u2212kD\u03a8( \u00afwT , wT ) is found or not. Therefore, Algorithm 2 can achieve the same convergence rates\nas Algorithm 1 in both convex and strongly convex cases. It does not need the information on T and\ntherefore applies to the online learning setting. We refer to Algorithm 2 as Online Composite Mirror\nDescent with Individual iterates (OCMDI).\n\n2.3 Motivation\n\nBefore giving theoretical results, we sketch the key idea underlying the design of Algorithm 1 which\nalso forms a key foundation of our theoretical analysis. Let w\u2217 = argminw\u2208Rd \u03c6(w). Typically we\ncan get the following one-step error bounds for {wt}t\u2208N produced by (2.2) (Lemma A.1 in Appendix)\n\n\u03b7tE[\u03c6(wt)\u2212\u03c6(w)] \u2264 E[D\u03a8(w, wt)\u2212D\u03a8(w, wt+1)]+\u03b72\n\nfor any w \u2208 Rd independent of zt. Here (cid:101)C is a constant independent of t. If we choose w = w\u2217\n\nand can show E[D\u03a8(w\u2217, wt) \u2212 D\u03a8(w\u2217, wt+1)] = O(\u03b72\n\u221a\nt ), then we would immediately obtain\nE[\u03c6(wt)] \u2212 \u03c6(w\u2217) = O(\u03b7t). This would immediately imply the rate O(1/\nt) in the convex case\nand the rate O(1/t) in the strongly convex case since in these two cases the typical step size choices\nt and \u03b7t = 1/t (ignoring constant factors), respectively [8]. Since w\u2217 is unknown, any\nare \u03b7t = 1/\nalgorithm requiring an access to w\u2217 is not implementable. A good surrogate \u00afwT of w\u2217 should satisfy\nE[\u03c6( \u00afwT )]\u2212 \u03c6(w\u2217) = O(\u03b7T ) to enjoy a tight rate, for which a natural choice should be some average\nof {wt}t\u2264T . Then, we take w = \u00afwT in (2.3) and need to \ufb01nd an index T \u2217 \u2208 {T, T + 1, . . . , 2T \u2212 1}\nsuch that E[D\u03a8( \u00afwT , wT \u2217 ) \u2212 D\u03a8( \u00afwT , wT \u2217+1)] = O(\u03b72\nT ). By the non-negativity of Bregman\ndistance, there always exists a T \u2217 \u2208 {T, T + 1, . . . , 2T \u2212 1} satisfying (see Lemma A.2 in Appendix)\n\n(2.3)\n\n\u221a\n\nt (cid:101)C\n\nD\u03a8( \u00afwT , wT \u2217 ) \u2212 D\u03a8( \u00afwT , wT \u2217+1) \u2264 T \u22121D\u03a8( \u00afwT , wT ).\n\nThis motivates us to search the time index T \u2217 by Algorithm 1 (line 17). It is clear from (2.3) that\n\nE[\u03c6(wT \u2217 ) \u2212 \u03c6( \u00afwT )] \u2264 (T \u03b7T \u2217 )\u22121E[D\u03a8( \u00afwT , wT )] + \u03b7T \u2217(cid:101)C.\n\nThe term E[\u03c6(wT \u2217 )] \u2212 \u03c6(w\u2217) then can be estimated by the following error decomposition\n\nE[\u03c6(wT \u2217 )] \u2212 \u03c6(w\u2217) = E[\u03c6(wT \u2217 ) \u2212 \u03c6( \u00afwT )] + E[\u03c6( \u00afwT ) \u2212 \u03c6(w\u2217)]\n\n\u2264 (T \u03b7T \u2217 )\u22121E[D\u03a8( \u00afwT , wT )]+\u03b7T \u2217(cid:101)C +E[\u03c6( \u00afwT )\u2212\u03c6(w\u2217)].\n\nTo derive the desired bound E[\u03c6(wT \u2217 )] \u2212 \u03c6(w\u2217) = O(\u03b7T ), it suf\ufb01ces to show\n\n(2.4)\n\n(2.5)\n\n(T \u03b7T \u2217 )\u22121E[D\u03a8( \u00afwT , wT )] = O(\u03b7T )\n\nand E[\u03c6( \u00afwT ) \u2212 \u03c6(w\u2217)] = O(\u03b7T ).\n\nMore speci\ufb01cally, we need to show E[D\u03a8( \u00afwT , wT )] = O(1), E[\u03c6( \u00afwT )\u2212 \u03c6(w\u2217)] = O(T \u2212 1\n\u221a\n2 ) in the\nt, and E[D\u03a8( \u00afwT , wT )] = O(T \u22121), E[\u03c6( \u00afwT ) \u2212 \u03c6(w\u2217)] = O(T \u22121) in\nconvex case with \u03b7t = 1/\nthe strongly convex case with \u03b7t = 1/t. We will show this is possible by choosing T \u2217 in Algorithm 1.\n\n4\n\n\f3 Optimal Convergence Rates\n\n(cid:107)f(cid:48)(w, z)(cid:107)2\u2217 \u2264 Af (w, z) + B and (cid:107)r(cid:48)(w)(cid:107)2\u2217 \u2264 Ar(w) + B\n\nWe present here optimal convergence rates for SCMDI for both convex and strongly convex objectives.\nTo this aim, we need to impose some standard assumptions. We assume that the mirror map \u03a8 is\n\u03c3\u03a8-strongly convex w.r.t. a norm (cid:107)\u00b7(cid:107) in the sense D\u03a8(w, \u02dcw) \u2265 2\u22121\u03c3\u03a8(cid:107)w\u2212 \u02dcw(cid:107)2 for all w, \u02dcw \u2208 Rd\n(\u03c3\u03a8 > 0). We always assume the existence of A and B > 0 such that ((cid:107)\u00b7(cid:107)\u2217 is the dual norm of (cid:107)\u00b7(cid:107))\n(3.1)\nfor any w \u2208 Rd, z \u2208 Z and any f(cid:48)(w, z) \u2208 \u2202f (w, z), r(cid:48)(w) \u2208 \u2202r(w). Many popular learning\nmethods use loss functions and regularizers satisfying (3.1) [7, 44]. For example, if |(cid:96)(cid:48)(a, y)|2 \u2264\n\u02dcA(cid:96)(a, y) + \u02dcB for some \u02dcA, \u02dcB > 0 and all a, y, then f (w, z) = (cid:96)((cid:104)w, x(cid:105), y) would satisfy the \ufb01rst\ninequality of (3.1) if X is bounded. Here (cid:96)(cid:48)(a, y) denotes a subgradient of (cid:96) w.r.t. the \ufb01rst argument.\nExamples of such (cid:96) include all smooth functions and all Lipschitz continuous functions widely used\nin machine learning [42, 44]. Examples of r satisfying (3.1) include r(w) = \u03bb(cid:107)w(cid:107)p\np with p \u2208 [1, 2].\nWe say \u03a8 is L\u03a8-smooth w.r.t. (cid:107) \u00b7 (cid:107) if D\u03a8(w, \u02dcw) \u2264 L\u03a8\n2 (cid:107)w \u2212 \u02dcw(cid:107)2 for all w, \u02dcw \u2208 Rd.\nWe always assume the existence of \u03c3F , \u03c3r \u2265 0 such that for all w, \u02dcw \u2208 Rd\nF (w)\u2212F ( \u02dcw)\u2212(cid:104)w\u2212 \u02dcw, F (cid:48)( \u02dcw)(cid:105) \u2265 \u03c3F D\u03a8(w, \u02dcw), r(w)\u2212r( \u02dcw)\u2212(cid:104)w\u2212 \u02dcw, r(cid:48)( \u02dcw)(cid:105) \u2265 \u03c3rD\u03a8(w, \u02dcw)\n(3.2)\nFor \u03c3\u03c6 := \u03c3F + \u03c3r, the cases \u03c3\u03c6 = 0 and \u03c3\u03c6 > 0 correspond to convex and strongly convex\nobjectives, respectively.\n\n3.1 Convex objectives\nOur \ufb01rst result is an optimal rate O(T \u2212 1\n2 ) for convex objectives under a boundedness assumption of\niterates. The boundedness assumptions E[D\u03a8(w\u2217, wt)] \u2264 D,\u2200t \u2208 N, imposed also in the literature\n[7, 8, 37], always hold for the regularizer of the form r(w) = IW0(w) + \u02dcr(w), where W0 is a convex\nand compact domain, IW0 is an indicator function with IW0(w) = 0 if w \u2208 W0 and \u221e otherwise,\nand \u02dcr : Rd \u2192 R+ is convex. Proofs of theoretical results in this subsection are given in Appendix B.\nTheorem 1. Let D > 0. Assume E[D\u03a8(w\u2217, wt)] \u2264 D for all t \u2208 N and E[D\u03a8( \u00afwT , wT )] \u2264 D.\n\u221a\nLet wT \u2217 be de\ufb01ned by Algorithm 1 with \u03b7t = \u00b5/\n\nt satisfying \u00b5 \u2264 \u03c3\u03a8(2A)\u22121. Then, E(cid:2)\u03c6(wT \u2217 ) \u2212\n\n\u03c6(w\u2217)(cid:3) \u2264 (cid:101)C1\u221a\n\nT\n\n, where (cid:101)C1 = (4 + 2\n\n\u221a\n\n2)\u00b5\u22121D + 10\u00b5\u03c3\u22121\n\n\u03a8 (A\u03c6(w\u2217) + 2B).\n\nTheorem 1 requires to impose a boundedness assumption on iterates, which is removed in the\nfollowing theorem on convergence rates. The assumption D\u03a8(w, \u02dcw) \u2264 L\u03a8(cid:107)w \u2212 \u02dcw(cid:107)\u03b1 is milder than\nassuming a smoothness of \u03a8, the latter of which is satis\ufb01ed if \u03a8 = \u03a82.\nTheorem 2. Let wT \u2217 be produced by Algorithm 1 with a non-increasing step size sequence and \u03b71 \u2264\n\u03c3\u03a8(2A)\u22121. Assume that there exists \u03b1 \u2208 [0, 2] and L\u03a8 > 0 such that D\u03a8(w, \u02dcw) \u2264 L\u03a8(cid:107)w \u2212 \u02dcw(cid:107)\u03b1\nfor all w, \u02dcw \u2208 Rd. Let DT = D\u03a8(w\u2217, w1) + \u03c3\u22121\nE[\u03c6(wT \u2217 )] \u2212 \u03c6(w\u2217) \u2264 2L\u03a8\n\n\u03a8 \u03b1DT + 1(cid:3) + 4DT\n(cid:2)4\u03c3\u22121\n\n\u03a8 (A\u03c6(w\u2217) + 2B)(cid:80)T\n\nt=1 \u03b72\n2A\u03c6(w\u2217) + 4B\n\nT(cid:88)\n\nt . Then,\n\n\u03b7t\n\n. (3.3)\n\n(cid:104)\n\n(cid:105)\n\n+\n\nT \u03b72T\n\n\u03c3\u03a8\n\n2\nT\n\n\u03b7T +\n\n(cid:80)t\n\nt=1\n\nCorollary 3. Suppose the assumptions in Theorem 2 hold.\n\n= 0, then\nfor t = 1, . . . , 2T , then E[\u03c6(wT \u2217 )]\u2212\u03c6(w\u2217) = O(T \u2212 1\n2 ).\n\nlim\nt\u2192\u221e\n\n(a) If\n\n\u02dct=1 \u03b72\n\u02dct\nt\u03b7t\n\nE[\u03c6(wT \u2217 )] = \u03c6(w\u2217). (b) If \u03b7t = \u00b5\u221a\n\nT\n\nlim\nT\u2192\u221e\n\n3.2 Strongly convex objectives\n\nWe now turn to Algorithm 1 for strongly convex objectives, towards which a \ufb01rst step is the following\ntheorem on the performance of vanilla SCMD. Theorem 4 shows that both the suboptimality of\nsome weighted averaged iterates and the Bregman distance between the last iterate and w\u2217 converge\nwith the rate O(T \u22121). Theorem 4 is an extension of the results in [20] on the projected SGD\nto SCMD. Discussions in [20] require to impose a boundedness assumption on subgradients as\nEZ[(cid:107)f(cid:48)(wt, Z)(cid:107)2\nin [20] to show that this boundedness assumption holds for an SVM-like objective. Theorem 4\nshows that we can derive the same convergence rate without this boundedness assumption and\n\n2] \u2264 (cid:101)D for some (cid:101)D > 0 and all t \u2208 N. Indeed, an independent section was included\n\n5\n\n\f\u03c3r = 0, then \u00afwT de\ufb01ned in Theorem 4 can be simpli\ufb01ed as \u00afwT =(cid:2)(cid:80)T\n\nis therefore more applicable. Theorem 5 gives a suf\ufb01cient condition on step size sequences for\nthe convergence. Upper and lower bounds on convergence rates are established in Theorem 6 and\nTheorem 7, respectively. The proofs of theoretical results in this subsection are given in Appendix C. If\nt=1(t + 1)wt.\n.\n\nTheorem 4. Assume (3.2) holds with \u03c3\u03c6 > 0. Let {wt}t\u2208N be generated by (2.2) with \u03b7t =\nDe\ufb01ne\n\nt=1(t + 1)(cid:3)\u22121(cid:80)T\n\n\u03c3\u03c6t+2\u03c3F\n\n2\n\n(cid:104) T(cid:88)\n\nt=1\n\n(cid:105)\u22121 T(cid:88)\n\nt=1\n\n\u00afwT =\n\n(t + 1)(t + 2)\u03b7t\n\n(t + 1)(t + 2)\u03b7twt.\n\nThen, there exists a constant (cid:101)C2 > 0 independent of T and \u03c3\u03c6 (explicitly given in the proof) such that\nand(cid:80)\u221e\n\n(3.4)\nTheorem 5. Assume (3.2) holds with \u03c3\u03c6 > 0. Let {wt}t\u2208N be generated by (2.2). If limt\u2192\u221e \u03b7t = 0\n\nt=1 \u03b7t = \u221e, then limT\u2192\u221e E[D\u03a8(w\u2217, wT )] = 0.\n\nand E[D\u03a8(w\u2217, wT +1)] \u2264\n\nE[\u03c6( \u00afwT ) \u2212 \u03c6(w\u2217)] \u2264\n\n4(cid:101)C2\n\n\u03c32\n\u03c6(T + 2)\n\n\u03c3\u03c6(T + 1)\n\n(cid:101)C2\n\n.\n\nWith Theorem 4 at hand, we can provide optimal rates for the output produced by Algorithm 1.\nTheorem 6. Assume (3.2) holds with \u03c3\u03c6 > 0. Let wT \u2217 be generated by Algorithm 1 with \u03b7t =\n, then there exists a constant\n.\n\n(cid:101)C3 > 0 independent of T and \u03c3\u03c6 (explicitly given in the proof) such that E(cid:2)\u03c6(wT \u2217 )\u2212\u03c6(w\u2217)(cid:3) \u2264 (cid:101)C3\n\n. Assume that \u03a8 is L\u03a8-strongly smooth (L\u03a8 > 0). If T \u2265 4A\n\n\u03c3\u03c6t+2\u03c3F\n\n\u03c3\u03a8\u03c3\u03c6\n\n2\n\nT \u03c3\u03c6\n\nTheorem 7 presents lower bounds matching the above upper bounds up to a constant factor, which\nshows that the selected iterate has achieved the best possible rate and there are no stronger results. The\nmatching upper and lower bounds here apply to the speci\ufb01c SGD and speci\ufb01c optimization problems,\nwhile minimax bounds is related to the existence of a problem for any algorithm [1]. Similar results\nwere derived in [21] for online mirror descent. Here we give a different and simpler proof.\nTheorem 7. Let {wt}t be the sequence produced by (2.2) with \u03a8(w) = 1\n2 and r(w) = 0.\nSuppose f is differentiable, \u03c6 is L\u03c6-smooth w.r.t. (cid:107) \u00b7 (cid:107)2 and \u03b7t \u2264 1/(2L\u03c6). If for any w \u2208 Rd,\nE[(cid:107)\u2207f (w, z) \u2212 \u2207F (w)(cid:107)2\n\n2(cid:107)w(cid:107)2\n\nE[(cid:107)wt+1 \u2212 w\u2217(cid:107)2\n\n2] \u2265 \u03c32 for some \u03c3 > 0, then\n2] \u2265 min{(cid:107)w1 \u2212 w\u2217(cid:107)2\n\n2, \u03b71\u03c32/(2L\u03c6), . . . , \u03b7t\u03c32/(2L\u03c6)}.\n\n4 Discussions\n\nWe discuss related work on stochastic/online learning algorithms with different averaging schemes.\n\u221a\nA very common scheme is to output an average of all iterates with uniform weights (UNI-AVE) [44,\nT ) and the suboptimal rate O(T \u22121 log T ) in the\n46], which is able to attain the optimal rate O(1/\nconvex and strongly convex case, respectively [8]. A counterpart is established to show that the rate\nO(T \u22121 log T ) is the best possible one for this simple averaging scheme [31].\nFor strongly convex objectives, the \ufb01rst algorithm able to attain the optimal rate O(T \u22121) is the\nepoch-GD algorithm proposed in [14], which performs stochastic mirror descent in each epoch with\na \ufb01xed step size. The step size decreases exponentially in each consecutive epoch and the averaged\niterate of the last epoch is outputted as the solution. Since then, several other averaging schemes\nwere developed to attain optimal rates, including a suf\ufb01x-averaging scheme (SUFFIX) returning the\nuniform average of the last half of SGD iterates [31], a weighted averaging scheme (WEI-AVE) with\na weight of t + 1 for wt [20] and a polynomial-decay averaging scheme [37].\nMotivated by the side effects of averaging in either destroying the sparsity of the solution or slowing\ndown the training speed [31], the property of the last iterate (LAST) has also received a lot of attention.\nFor smooth objective functions, SGD with the last iterate can achieve the optimal rate [31]. The\nanalysis of last iterate for non-smooth objective functions is much more challenging, for which an\ninteresting technique to relate the last iterate with suf\ufb01x-average of iterates was developed [37, 44].\nBased on this, rates O(T \u2212 1\n2 log T ) and O(T \u22121 log T ) were established in the convex and strongly-\nconvex cases [23, 37], respectively. These results motivate a natural question of whether the last\niterate achieves the optimal rate in the non-smooth scenario. This open question was resolved in [13]\n\n6\n\n\fby constructing a problem for which SGD with the last iterate achieves at most the rate \u2126(T \u22121 log T ).\nAn interesting probabilistic rate O(T \u22121 log T ) was also developed for SGD with the last iterate [13].\nAll the above mentioned methods require an averaging scheme to attain the optimal convergence\nrates, which, however, may destroy the sparsity of solutions and slow down the practical training\nspeed. A simple scheme to preserve the sparsity of solutions while still enjoying the optimal rate is\nto draw an individual iterate randomly from the iterate sequence by (2.2) [10, 16]. Indeed, one can\nrandomly draw the output according to a distribution over the iterate sequence with the probability\nmass function determined by any optimal weighted averaging scheme [20, 31, 37]. However, this\nscheme introduces new variances due to the choice of a random iterate as the output. Moreover, as we\nwill verify in experiments, this scheme can slow down the practical training speed since the randomly\nselected iterate is not necessarily to be a favorable one in the iterate sequence. Furthermore, it requires\nto either store all the iterate sequence or determine the number of iterations beforehand, which does\nnot apply to the online learning scenario. After the acceptance of the paper, we noticed an interesting\nstep-size modi\ufb01cation scheme to get optimal convergence rates [15]. However, the step-size scheme\nthere requires to use the information of T , and therefore is not applicable to the online learning\nscenario. Furthermore, the algorithm and optimal convergence rates there are developed for the\nparticular T -th iterate and not for other iterates.\nIn this paper, we propose a novel stochastic/online learning algorithm able to achieve optimal rates. As\ncompared to the averaging scheme in [14, 20, 31, 37], our scheme of outputting an individual iterate is\nable to preserve the sparsity structure of solutions. As compared to the scheme of randomly choosing\nan individual iterate, our method avoids the added variance as well as the drawback of slowing down\nthe training speed due to the random iterate index. Our idea is not to output the last iterate but to\nselect an individual iterate based on a careful analysis on the one-step progress bound sketched in\nSection 2.3. Indeed, by (2.3), the quality of wt depends on At = D\u03a8( \u00afwT , wt) \u2212 D\u03a8( \u00afwT , wt+1).\nThe smaller the At, the better the quality. This motivates us to select a T \u2217 with small AT \u2217. Intuitively,\nAt is related to the distance between wt and wt+1. Therefore, our selection of the iterate shares some\nspirit with the widely used heuristic of terminating the algorithm when the successive iterates are\nclose. Our analysis also re\ufb01nes existing studies of optimal rates of SGD by removing the boundedness\nassumption on subgradients [20, 31, 37]. It should be mentioned that the boundedness assumption\n\nwas relaxed as Ez[(cid:107)f(cid:48)(wt, z)(cid:107)2\u2217] \u2264 (cid:101)A + (cid:101)B(cid:107)F (cid:48)(wt)(cid:107)2\u2217 in [4] for (cid:101)A, (cid:101)B > 0 and removed in [27], the\n\nlatter of which, however, requires the objective function to be smooth.\n\n5 Experimental Results\n\nIn this section, we justify the effectiveness of our algorithm by presenting experimental comparisons\nwith the following averaging strategies: WEI-AVE [20], UNI-AVE, LAST, SUFFIX [31] and RAND\n(outputting a random iterate chosen from the uniform distribution over the last half of iterates). We\nconsider two applications: binary classi\ufb01cation and tomography reconstruction in image processing.\n\n5.1 Binary classi\ufb01cation\n\nWe \ufb01rst consider SVM models with linear kernels. We use 16 real-world datasets whose information\nis summarized in Table C.1 in Appendix D.13. For each dataset, we use 80 percents of the data for\ntraining and reserve the remaining 20 percents for testing. The objective function we consider for a\ntraining dataset {(x1, y1), . . . , (xn, yn)} is\n(cid:107)w(cid:107)2\n\nmax{0, 1 \u2212 yi(cid:104)w, xi(cid:105)},\n\nn(cid:88)\n\n\u03c6(w) =\n\n2 +\n\n\u03bb\n2\n\n1\nn\n\ni=1\n\nwhich is \u03bb-strongly convex w.r.t. (cid:107)\u00b7(cid:107)2. Analogous to [20], we set \u03bb to the reciprocal of training sample\nsize. For datasets with multiple class labels, we group the \ufb01rst half of labels into the positive label, and\nthe second half into the negative label. We consider step sizes of the form \u03b7t = \u00b5/(\u03bbt) and tune the\nparameter \u00b5 in the set 2{\u221212,\u221211,...,4} by 10-fold cross validation. We repeat the experiment 40 times\nand report the average of experimental results. We consider two approaches to optimize the objective\nfunction \u03c6 by whether to separate the regularizer and the data-\ufb01tting term. In the \ufb01rst approach,\n2 + max{0, 1 \u2212 y(cid:104)w, x(cid:105)} and r(w) = 0,\nwe apply (2.2) with \u03a8(w) = 1\n\n2, f (w, z) = \u03bb\n\n2(cid:107)w(cid:107)2\n\n2(cid:107)w(cid:107)2\n\n3We display experimental results for 4 datasets here due to space limit. Complete results are in Appendix.\n\n7\n\n\f2(cid:107)w(cid:107)2\n\n2(cid:107)w(cid:107)2\n\nwhich is just the SGD. In the second approach, we apply (2.2) with \u03a8(w) = 1\n2, f (w, z) =\nmax{0, 1 \u2212 y(cid:104)w, x(cid:105)} and r(w) = \u03bb\n2, which is a stochastic proximal gradient descent (SPGD).\nIn Figure 1 and Figure 2, we plot the objective function values on the testing dataset against iteration\nnumbers for SGD and SPGD, respectively. It is clear that UNI-AVE is always the worst strategy in\nour experiments. WEI-AVE assigns more weights to recent iterates and can improve the performance,\nwhich is further outperformed by SUFFIX. RAND \ufb02uctuates a bit since the randomly selected index\nat the t-th iteration may not be an increasing function of t. LAST is overwritten by OCMDI due\nto their similar behavior, which behave best in our experiments especially in the beginning of the\noptimization process. The similarity between OCMDI and LAST can be explained as follows. If At\n(de\ufb01ned in Section 4) is small then it is likely that At+1 is also small. Therefore, our algorithm is\nprone to select the last part of iterates. We also plot the objective function values on training datasets\nagainst T , which show similar behavior and are deferred to Figures D.4, D.3 in Appendix.\n\n(a) german\n\n(d) letter\nFigure 1: Objective function values on test datasets versus iteration numbers for SGD.\n\n(b) splice\n\n(c) w8a\n\n(a) german\n(d) letter\nFigure 2: Objective function values on test datasets versus iteration numbers for SPGD.\n\n(b) splice\n\n(c) w8a\n\n5.2 Tomography reconstruction\n\n\u2020\ni|)2) to y\n\nparameters and consider the mirror map \u03a8(\u0001,\u03bb)(w) = \u03bb(cid:80)d\n\nWe now consider tomography reconstruction in image processing. We use AIR toolbox [12] to create\na CT-measurement matrix A \u2208 Rn\u00d7d, an output vector y\u2020 \u2208 Rn and a N \u00d7 N sparse image encoded\nby a vector w\u2020 \u2208 Rd with d = N 2. Each row of A together with the corresponding row in y indicates\na line integral from a fan bean projection geometry. Therefore, the true image w\u2020 satis\ufb01es Aw\u2020 = y\u2020.\n\u2020\nWe consider a noisy case by adding Gaussian noise N (0, (0.05|y\ni and get the noisy output\ny. Our aim is to reconstruct the image w\u2020 from the matrix A and the noisy measurements y by\n\ufb01nding an approximate solution of the equation Aw = y. Since many components of the true\nimage w\u2020 varnish, we apply randomized sparse Kaczmarz algorithm, which is a simple and ef\ufb01cient\nalgorithm to generate sparse approximate solutions to linear systems [5, 34]. Let \u0001 > 0, \u03bb > 0 be two\n2, where g\u0001(\u03be) = \u03be2\n2 for |\u03be| > \u0001. At each iteration t, we randomly select an index it from the\nfor |\u03be| \u2264 \u0001 and |\u03be| \u2212 \u0001\nuniform distribution over {1, . . . , n} and denote xit = A(cid:62)\nit is the transpose of Ait. Given\nw1 \u2208 Rd and v1 = \u2207\u03a8(\u0001,\u03bb)(w1), the randomized sparse Kaczmarz algorithm updates the model as\n(5.1)\nwhere S\u03bb,\u0001 : Rd (cid:55)\u2192 Rd is de\ufb01ned component-wisely by the soft-thresholding function S\u03bb,\u0001 : R (cid:55)\u2192 R\ngiven as [5]\n\nvt+1 = vt \u2212 \u03b7t((cid:104)wt, xit(cid:105) \u2212 yit)xit, wt+1 = S\u03bb,\u0001(vt+1),\n\ni=1 g\u0001(w(i)) + 1\nit, where A(cid:62)\n\n2(cid:107)w(cid:107)2\n\n2\u0001\n\n(cid:26)(\u03be\u0001)/(\u03bb + \u0001),\n\nS\u03bb,\u0001(\u03be) :=\n\nsgn(\u03be) max(|\u03be| \u2212 \u03bb, 0), otherwise.\n\nif |\u03be| \u2264 \u03bb + \u0001\n\nHere sgn(a) denotes the sign of a \u2208 R. Algorithm (5.1) can be equivalently formulated as [22]\n\nwt+1 = arg min\nw\u2208Rd\n\n\u03b7t(cid:104)w \u2212 wt, ((cid:104)wt, xit(cid:105) \u2212 yit)xit(cid:105) + D\u03a8(\u0001,\u03bb)(w, wt).\n\n8\n\n\fIt is clear from the above equivalent formulation that (5.1) is an instantiation of (2.2) with f (w, z) =\n2 ((cid:104)w, x(cid:105) \u2212 y)2, r(w) = 0 and \u03a8(w) = \u03a8(\u0001,\u03bb)(w) to minimize F (w) = 1\n2. It was\n1\n\u03c3min((cid:101)A) denotes the minimal positive eigenvalue of a matrix (cid:101)A. Therefore, our theoretical analysis\nshown that F (w) satis\ufb01es (3.2) with \u03c3F = \u03c3min(A(cid:62)A/n), w = w\u2217 and \u02dcw = wt, t \u2208 N [22], where\n\nn(cid:107)Aw \u2212 y(cid:107)2\n\nin Section 3.2 applies4. We randomly choose w1 from the uniform distribution in [0, 1]d and set\n\u03bb = 1, \u0001 = 10\u22128 as suggested in [5]. We repeat the experiment 40 times and report the average of\nresults.\n\n(a) True Image.\n\n(b) Reconstructed Image.\n\n(c) Error vs. T .\n\n(d) NNCs vs. T\n\nFigure 3: Tomography reconstruction with N = 64, n = 23040 and 5% relative noise. Panel (a) and\n(b) are the true image and the reconstructed image by OCMDI, respectively. Panel (c) and (d) plot\nthe errors and NNCs versus iteration numbers.\nIn Figure 3, we present results with N = 64, d = 4096 and n = 23040. Panels (a) and (b) display\nthe true image and the image output by OCMDI, respectively. In Panels (c) and (d), we plot the\nerrors and the number of non-zero components (NNCs) for the output of algorithms with different\naveraging strategies. According to Figure 3, LAST is overwritten by OCMDI since they behave\nanalogously, both of which achieve the fastest training speed among all considered methods. RAND\nis outperformed by SUFFIX in terms of training speed. Moreover, OCMDI, RAND and LAST\ncan produce much more sparse solutions than other methods. Indeed, the true image w\u2020 has 1686\nnon-zero components, while UNI-AVE, WEI-AVE, SUFFIX, RAND and OCMDI produce outputs\nwith 4096, 3727, 1943, 1823 and 1828 non-zero components, respectively. The advantage of OCMDI\nover LAST is that OCMDI can theoretically achieve optimal convergence rate. Similar phenomenon\nalso appears for the case with N = 32 and n = 11520, which we plot in Figure D.5 in Appendix.\n\n6 Conclusion\n\nWe propose a novel variant of SCMD with optimal convergence rates. An advantage of this algorithm\nover existing optimal learning algorithms is that it outputs an individual iterate without a random\nselection of index, which is able to preserve the sparsity structure without slowing down the training\nspeed. Experimental results in both the domain of binary classi\ufb01cation and tomography reconstruction\ndemonstrate the ability of our algorithm in getting a fast training speed as well as in producing\nsparse solutions. Some interesting work include a theoretical justi\ufb01cation on the advantage of\nsparsity of our method over other methods [11, 38] and an extension of our analysis to non-convex\nproblems [10, 32, 45].\n\nAcknowledgments\n\nThe work of Y. Lei, P. Yang and K. Tang is supported partially by the National Key Research\nand Development Program of China (Grant No. 2017YFB1003102), the National Natural Science\nFoundation of China (Grant Nos. 61806091, 61806090 and 61672478), the Program for University\nKey Laboratory of Guangdong Province (Grant No. 2017KSYS008) and Shenzhen Peacock Plan\n(Grant No. KQTD2016112514355531). The work of D.-X. Zhou is supported partially by the\nResearch Grants Council of Hong Kong [Project No. CityU 11338616] and by National Nature\nScience Foundation of China [Grant No. 11671307]. Y. Lei also acknowledges a Humboldt Research\nFellowship from the Alexander von Humboldt Foundation.\n\nReferences\n[1] A. Agarwal, M. J. Wainwright, P. L. Bartlett, and P. K. Ravikumar. Information-theoretic lower bounds on\nthe oracle complexity of convex optimization. In Advances in Neural Information Processing Systems,\n\n4In our analysis, we only use (3.2) for w = w\u2217 and \u02dcw = wt (a restricted strong convexity in literature).\n\n9\n\n\fpages 1\u20139, 2009.\n\n[2] F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate\n\nO(1/n). In Advances in Neural Information Processing Systems, pages 773\u2013781, 2013.\n\n[3] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex\n\noptimization. Operations Research Letters, 31(3):167\u2013175, 2003.\n\n[4] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. SIAM\n\nReview, 60(2):223\u2013311, 2018.\n\n[5] J.-F. Cai, S. Osher, and Z. Shen. Linearized bregman iterations for compressed sensing. Mathematics of\n\nComputation, 78(267):1515\u20131536, 2009.\n\n[6] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms.\n\nIEEE Transactions on Information Theory, 50(9):2050\u20132057, 2004.\n\n[7] J. Duchi and Y. Singer. Ef\ufb01cient online and batch learning using forward backward splitting. In Advances\n\nin Neural Information Processing Systems, pages 495\u2013503, 2009.\n\n[8] J. Duchi, S. Shalev-Shwartz, Y. Singer, and A. Tewari. Composite objective mirror descent. In Conference\n\non Learning Theory, pages 14\u201326, 2010.\n\n[9] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic\n\noptimization. Journal of Machine Learning Research, 12:2121\u20132159, 2011.\n\n[10] S. Ghadimi and G. Lan. Stochastic \ufb01rst-and zeroth-order methods for nonconvex stochastic programming.\n\nSIAM Journal on Optimization, 23(4):2341\u20132368, 2013.\n\n[11] Z.-C. Guo, D.-H. Xiang, X. Guo, and D.-X. Zhou. Thresholded spectral algorithms for sparse approxima-\n\ntions. Analysis and Applications, 15(03):433\u2013455, 2017.\n\n[12] P. C. Hansen and M. Saxild-Hansen. Air tools\u2014a matlab package of algebraic iterative reconstruction\n\nmethods. Journal of Computational and Applied Mathematics, 236(8):2167\u20132178, 2012.\n\n[13] N. J. A. Harvey, C. Liaw, Y. Plan, and S. Randhawa. Tight analyses for non-smooth stochastic gradient\ndescent. In A. Beygelzimer and D. Hsu, editors, Conference on Learning Theory, pages 1579\u20131613, 2019.\n[14] E. Hazan and S. Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-\n\nconvex optimization. Journal of Machine Learning Research, 15(1):2489\u20132512, 2014.\n\n[15] P. Jain, D. Nagaraj, and P. Netrapalli. Making the last iterate of sgd information theoretically optimal. In\n\nA. Beygelzimer and D. Hsu, editors, Conference on Learning Theory, pages 1752\u20131755, 2019.\n\n[16] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In\n\nAdvances in Neural Information Processing Systems, pages 315\u2013323, 2013.\n\n[17] S. M. Kakade and A. Tewari. On the generalization ability of online strongly convex programming\n\nalgorithms. In Advances in Neural Information Processing Systems, pages 801\u2013808, 2009.\n\n[18] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on\n\nLearning Representations, 2015.\n\n[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In Advances in Neural Information Processing Systems, pages 1097\u20131105, 2012.\n\n[20] S. Lacoste-Julien, M. Schmidt, and F. Bach. A simpler approach to obtaining an O(1/t) convergence rate\n\nfor the projected stochastic subgradient method. arXiv preprint arXiv:1212.2002, 2012.\n\n[21] Y. Lei and D.-X. Zhou. Convergence of online mirror descent. Applied and Computational Harmonic\n\nAnalysis, 2018. doi: https://doi.org/10.1016/j.acha.2018.05.005.\n\n[22] Y. Lei and D.-X. Zhou. Learning theory of randomized sparse Kaczmarz method. SIAM Journal on\n\nImaging Sciences, 11(1):547\u2013574, 2018.\n\n[23] J. Lin, L. Rosasco, and D.-X. Zhou. Iterative regularization for learning with convex loss functions. Journal\n\nof Machine Learning Research, 17(77):1\u201338, 2016.\n\n[24] P.-L. Lions and B. Mercier. Splitting algorithms for the sum of two nonlinear operators. SIAM Journal on\n\nNumerical Analysis, 16(6):964\u2013979, 1979.\n\n[25] A.-S. Nemirovsky and D.-B. Yudin. Problem Complexity and Method Ef\ufb01ciency in Optimization. John\n\nWiley & Sons, 1983.\n\n[26] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course, volume 87. Springer Science\n\n& Business Media, 2013.\n\n[27] L. M. Nguyen, P. H. Nguyen, M. van Dijk, P. Richt\u00e1rik, K. Scheinberg, and M. Tak\u00e1c. SGD and hogwild!\nconvergence without the bounded gradients assumption. In International Conference on Machine Learning,\npages 3747\u20133755, 2018.\n\n[28] N. Parikh and S. P. Boyd. Proximal algorithms. Foundations and Trends in optimization, 1(3):127\u2013239,\n\n2014.\n\n[29] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational\n\nMathematics and Mathematical Physics, 4(5):1\u201317, 1964.\n\n[30] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on\n\nControl and Optimization, 30(4):838\u2013855, 1992.\n\n10\n\n\f[31] A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex stochastic\n\noptimization. In International Conference on Machine Learning, pages 449\u2013456, 2012.\n\n[32] S. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola. Stochastic variance reduction for nonconvex\n\noptimization. In International Conference on Machine Learning, pages 314\u2013323, 2016.\n\n[33] M. Schmidt, N. Le Roux, and F. Bach. Minimizing \ufb01nite sums with the stochastic average gradient.\n\nMathematical Programming, 162(1-2):83\u2013112, 2017.\n\n[34] F. Sch\u00f6pfer and D. A. Lorenz. Linear convergence of the randomized sparse Kaczmarz method. Mathe-\n\nmatical Programming, pages 1\u201328, 2018.\n\n[35] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal estimated sub-gradient solver for\n\nsvm. Mathematical programming, 127(1):3\u201330, 2011.\n\n[36] O. Shamir. Open problem: Is averaging needed for strongly convex stochastic gradient descent? 2012.\n[37] O. Shamir and T. Zhang. Stochastic gradient descent for non-smooth optimization convergence results and\n\noptimal averaging schemes. In International Conference on Machine Learning, pages 71\u201379, 2013.\n\n[38] J. Steinhardt, S. Wager, and P. Liang. The statistics of streaming sparse regression. arXiv preprint\n\narXiv:1412.4182, 2014.\n\n[39] L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of\n\nMachine Learning Research, 11:2543\u20132596, 2010.\n\n[40] L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM\n\nJournal on Optimization, 24(4):2057\u20132075, 2014.\n\n[41] Y. Xu, Q. Lin, and T. Yang. Stochastic convex optimization: Faster local growth implies faster global\n\nconvergence. In International Conference on Machine Learning, pages 3821\u20133830, 2017.\n\n[42] Y. Ying and D.-X. Zhou. Unregularized online learning algorithms with general loss functions. Applied\n\nand Computational Harmonic Analysis, 42(2):224\u2014-244, 2017.\n\n[43] L. Zhang, M. Mahdavi, and R. Jin. Linear convergence with condition number independent access of full\n\ngradients. In Advances in Neural Information Processing Systems, pages 980\u2013988, 2013.\n\n[44] T. Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In\n\nInternational Conference on Machine Learning, pages 919\u2013926, 2004.\n\n[45] Z. Zhou, P. Mertikopoulos, N. Bambos, S. Boyd, and P. W. Glynn. Stochastic mirror descent in variationally\ncoherent optimization problems. In Advances in Neural Information Processing Systems, pages 7043\u20137052,\n2017.\n\n[46] M. Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In International\n\nConference on Machine Learning, pages 928\u2013936, 2003.\n\n11\n\n\f", "award": [], "sourceid": 2899, "authors": [{"given_name": "Yunwen", "family_name": "Lei", "institution": "Technical University of Kaiserslautern"}, {"given_name": "Peng", "family_name": "Yang", "institution": "Southern University of Science and Technology"}, {"given_name": "Ke", "family_name": "Tang", "institution": "Southern University of Science and Technology"}, {"given_name": "Ding-Xuan", "family_name": "Zhou", "institution": "City University of Hong Kong"}]}