{"title": "Fast Rates of ERM and Stochastic Approximation: Adaptive to Error Bound Conditions", "book": "Advances in Neural Information Processing Systems", "page_first": 4678, "page_last": 4689, "abstract": "Error bound conditions (EBC) are properties that characterize the growth of an objective function when a point is moved away from the optimal set. They have recently received increasing attention in the field of optimization for developing optimization algorithms with fast convergence. However, the studies of EBC in statistical learning are hitherto still limited. The main contributions of this paper are two-fold. First, we develop fast and intermediate rates of empirical risk minimization (ERM) under EBC for risk minimization with Lipschitz continuous, and smooth convex random functions. Second, we establish fast and intermediate rates of an efficient stochastic approximation (SA) algorithm for risk minimization with Lipschitz continuous random functions, which requires only one pass of $n$ samples and adapts to EBC. For both approaches, the convergence rates span a full spectrum between $\\widetilde O(1/\\sqrt{n})$ and $\\widetilde O(1/n)$ depending on the power constant in EBC, and could be even faster than $O(1/n)$ in special cases for ERM. Moreover, these convergence rates are automatically adaptive without using any knowledge of EBC. Overall, this work not only strengthens the understanding of ERM for statistical learning but also brings new fast stochastic algorithms for solving a broad range of statistical learning problems.", "full_text": "Fast Rates of ERM and Stochastic Approximation:\n\nAdaptive to Error Bound Conditions\n\nMingrui Liu\u2020, Xiaoxuan Zhang\u2020, Lijun Zhang\u2021, Rong Jin(cid:92), Tianbao Yang\u2020\n\n\u2020Department of Computer Science, The University of Iowa, Iowa City, IA 52242, USA\n\u2021National Key Laboratory for Novel Software Technology, Nanjing University, China\n\n(cid:92) Machine Intelligence Technology, Alibaba Group, Bellevue, WA 98004, USA\n\nmingrui-liu@uiowa.edu, zljzju@gmail.com, tianbao-yang@uiowa.edu\n\nAbstract\n\nError bound conditions (EBC) are properties that characterize the growth of an\nobjective function when a point is moved away from the optimal set. They have\nrecently received increasing attention for developing optimization algorithms with\nfast convergence. However, the studies of EBC in statistical learning are hitherto\nstill limited. The main contributions of this paper are two-fold. First, we develop\nfast and intermediate rates of empirical risk minimization (ERM) under EBC\nfor risk minimization with Lipschitz continuous, and smooth convex random\nfunctions. Second, we establish fast and intermediate rates of an ef\ufb01cient stochastic\napproximation (SA) algorithm for risk minimization with Lipschitz continuous\nrandom functions, which requires only one pass of n samples and adapts to EBC.\nn)\n\nFor both approaches, the convergence rates span a full spectrum between (cid:101)O(1/\nand (cid:101)O(1/n) depending on the power constant in EBC, and could be even faster\n\nthan O(1/n) in special cases for ERM. Moreover, these convergence rates are\nautomatically adaptive without using any knowledge of EBC.\n\n\u221a\n\n1\n\nIntroduction\n\nIn this paper, we focus on the following stochastic convex optimization problems arising in statistical\nlearning and many other \ufb01elds:\n\nw\u2208W P (w) (cid:44) Ez\u223cP[f (w, z)],\n\nmin\n\n(1)\n\nand more generally\n\nmin\n\nw\u2208W P (w) (cid:44) Ez\u223cP[f (w, z)] + r(w),\n\n(2)\nwhere f (\u00b7, z) : W \u2192 R is a random function depending on a random variable z \u2208 Z that follows\na distribution P, r(w) is a lower semi-continuous convex function. In statistical learning [48], the\nproblem above is also referred to as risk minimization where z is interpreted as data, w is interpreted\nas a model (or hypothesis), f (\u00b7,\u00b7) is interpreted as a loss function, and r(\u00b7) is a regularization. For\nexample, in supervised learning one can take z = (x, y) - a pair of feature vector x \u2208 X \u2286 Rd\nand label y \u2208 Y, f (w, z) = (cid:96)(w(x), y) - a loss function measuring the error of the prediction\nw(x) : X \u2192 Y made by the model w. Nonetheless, we emphasize that the risk minimization\nproblem (1) is more general than supervised learning and could be more challenging (c.f. [35]). In\nthis paper, we assume that W \u2286 Rd is a compact and convex set. Let W\u2217 = arg minw\u2208W P (w)\ndenote the optimal set and P\u2217 = minw\u2208W P (w) denote the optimal risk.\nThere are two popular approaches for solving the risk minimization problem. The \ufb01rst one is by\nempirical risk minimization that minimizes the empirical risk de\ufb01ned over a set of n i.i.d. samples\ndrawn from the same distribution P (sometimes with a regularization term on the model). The second\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fapproach is called stochastic approximation that iteratively learns the model from random samples\nzt \u223c P, t = 1, . . . , n. Both approaches have been studied broadly and extensive results are available\nabout the theoretical guarantee of the two approaches in the machine learning and optimization\ncommunity. A central theme in these studies is to bound the excess risk (or optimization error) of a\n\nlearned model (cid:98)w measured by P ((cid:98)w) \u2212 P\u2217, i.e., given a set of n samples (z1, . . . , zn) how fast the\norder of (cid:101)O((cid:112)d/n) 1 and O((cid:112)1/n) for ERM and SA, respectively, under appropriate conditions of\n\nlearned model converges to the optimal model in terms of the excess risk.\nA classical result about the excess risk bound for the considered risk minimization problem is in the\n\nthe loss functions (e.g., Lipschitz continuity, convexity) [29, 35]. Various studies have attempted to\nestablish faster rates by imposing additional conditions on the loss functions (e.g., strong convexity,\nsmoothness, exponential concavity) [13, 42, 21], or on both the loss functions and the distribution\n(e.g., Tsybakov condition, Bernstein condition, central condition) [45, 3, 46]. In this paper, we will\nstudy a different family of conditions called the error bound conditions (EBC) (see De\ufb01nition 1),\nwhich has a long history in the community of optimization and variational analysis [31] and recently\nrevived for developing fast optimization algorithms without strong convexity [4, 6, 17, 28, 54].\nHowever, the exploration of EBC in statistical learning for risk minimization is still under-explored\nand the connection to other conditions is not fully understood.\nDe\ufb01nition 1. For any w \u2208 W, let w\u2217 = arg minu\u2208W\u2217 (cid:107)u\u2212 w(cid:107)2 denote an optimal solution closest\nto w, where W\u2217 is the set containing all optimal solutions. Let \u03b8 \u2208 (0, 1] and 0 < \u03b1 < \u221e. The\nproblem (1) satis\ufb01es an EBC(\u03b8, \u03b1) if for any w \u2208 W, the following inequality holds\n\n(cid:107)w \u2212 w\u2217(cid:107)2\n\n2 \u2264 \u03b1(P (w) \u2212 P (w\u2217))\u03b8.\n\n(3)\n\ndition, and therefore leads to intermediate rates of (cid:101)O((d/n)\n\nThis condition has been well studied in optimization and variational analysis. Many results are\navailable for understanding the condition for different problems. For example, it has been shown that\nwhen P (w) is semi-algebraic and continuous, the inequality (3) is known to hold on any compact\nset with certain \u03b8 \u2208 (0, 1] and \u03b1 > 0 [4] 2. We will study both ERM and SA under the above error\nbound condition. In particular, we show that the bene\ufb01ts of exploiting EBC in statistical learning are\nnoticeable and profound by establishing the following results.\n\u2022 Result I. First, we show that for Lipchitz continous loss EBC implies a relaxed Bernstein con-\n2\u2212\u03b8 ) for Lipschitz continuous loss.\nAlthough this result does not improve over existing rates based on Bernstein condition, however,\nwe emphasize that it provides an alternative route for establishing fast rates and brings richer results\nthan literature to statistical learning in light of the examples provided in this paper.\n\u2022 Result II. Second, we develop fast and optimistic rates of ERM for non-negative, Lipschitz\n2\u2212\u03b8 ), and in the\n2\u2212\u03b8 ) when the sample size n is suf\ufb01ciently large, which imply\n\ncontinuous and smooth convex loss functions in the order of (cid:101)O(d/n + (dP\u2217/n)\norder of (cid:101)O((d/n)\nthat when the optimal risk P\u2217 is small one can achieve a fast rate of (cid:101)O (d/n) even with \u03b8 < 1 and\na faster rate of (cid:101)O((d/n)\n(cid:101)O((1/n)\n\n\u2022 Result III. Third, we develop an ef\ufb01cient SA algorithm with almost the same per-iteration cost as\nstochastic subgradient methods for Lipschitz continuous loss, which achieves the same order of rate\n2\u2212\u03b8 ) as ERM without an explicit dependence on d. More importantly it is \u201cparameter\u201d-\n\n2\u2212\u03b8 ) when n is suf\ufb01ciently large.\n\n2\u2212\u03b8 + (dP\u2217/n)\n\n2\n\n2\n\n1\n\n1\n\n1\n\nfree with no need of prior knowledge of \u03b8 and \u03b1 in EBC.\n\n1\n\nOverall, these results not only strengthen the understanding of ERM for statistical learning but also\nbring new fast stochastic algorithms for solving a broad range of statistical learning problems. Before\nending this section, we would like to point out that all the results are automatically adaptive to the\nlargest possible value of \u03b8 \u2208 (0, 1] in hindsight of the problem, and the dependence on d for ERM is\ngenerally unavoidable according to the lower bounds studied in [9].\n\n2 Related Work\n\nThe results for statistical learning under EBC are limited. A similar one to our Result I for ERM was\nestablished in [39]. However, their result requires the convexity condition of random loss functions,\n\n1(cid:101)O hides a poly-logarithmic factor of n.\n\n2One may consider \u03b8 \u2208 (1, 2], which will yield the same order of excess risk bound as \u03b8 = 1 in our settings.\n\n2\n\n\fmaking it weaker than our result. Ramdas and Singh [33] and Xu et al. [50] considered SA under the\nEBC condition and established similar adaptive rates. Nonetheless, their stochastic algorithms require\nknowing the values of \u03b8 and possibly the constant \u03b1 in the EBC. In contrast, the SA algorithm in\nthis paper is \u201cparameter\"-free without the need of knowing \u03b8 and \u03b1 while still achieving the adaptive\nrates of O(1/n2\u2212\u03b8). Fast rates under strong convexity (a special case of EBC) are well-known for\nERM, online optimization and SA [35, 43, 13, 16, 36, 14]. In the presence of strong convexity of\nP (w), our results of ERM and SA recover known rates (see below for more discussions).\nFast (intermediate) rates of ERM have been studied under various conditions, including Tsybakov\nmargin condition [44, 25], Bernstein condition [3, 2, 19], exp-concavity condition [21, 11, 26, 51],\nmixability condition [27], central condition [46], etc. The Bernstein condition (see De\ufb01nition 2)\nis a generalization of Tsybakov margin condition for classi\ufb01cation. The connection between the\nexp-concavity condition, the Bernstein condition and the v-central condition was studied in [46].\nIn particular, the exp-concavity implies a v-central condition under an appropriate condition of\nthe decision set W (e.g., well-speci\ufb01city or convexity). With the bounded loss condition, the\nBernstein condition implies the v-central condition and the v-central condition also implies a Bernstein\ncondition.\nIn this work, we also study the connection between the EBC and the Bernstein condition and the\nv-central condition. In particular, we will develop weaker forms of the Bernstein condition and the\nv-central condition from the EBC for Lipschitz continuous loss functions. Building on this connection,\nwe establish our Result I, which is on a par with existing results for bounded loss functions relying\non the Bernstein condition or the central condition. Nevertheless, we emphasize that employing\nthe EBC for developing fast rates has noticeable bene\ufb01ts: (i) it is complementary to the Bernstein\ncondition and the central condition and enjoyed by several interesting problems whose fast rates are\nnot exhibited yet; (ii) it can be leveraged for developing fast and intermediate optimistic rates for\nnon-negative and smooth loss functions; (iii) it can be leveraged to develop ef\ufb01cient SA algorithms\nwith intermediate and fast convergence rates.\n\nSebro et al. [42] established an optimistic rate of O(1/n +(cid:112)P\u2217/n) of both ERM and SA for\n\nsupervised learning with generalized linear loss functions. However, their SA algorithm requires\nknowing the value of P\u2217. Recently, Zhang et al. [55] considered the general stochastic optimization\nproblem (1) with non-negative and smooth loss functions and achieved a series of optimistic results.\nIt is worth mentioning that their excess risk bounds for both convex problems and strongly convex\nproblems are special cases of our Result II when \u03b8 = 0 and \u03b8 = 1, respectively. However, the\nintermediate optimistic rates for \u03b8 \u2208 (0, 1) are \ufb01rst shown in this paper. Importantly, our Result II\nunder the EBC with \u03b8 = 1 is more general than the result in [55] under strong convexity assumption.\nFinally, we discuss about stochastic approximation algorithms with fast and intermediate rates to\nunderstand the signi\ufb01cance of our Result III. Different variants of stochastic gradient methods have\nbeen analyzed for stochastic strongly convex optimization [14, 32, 38] with a fast rate of O(1/n). But\nthese stochastic algorithms require knowing the strong convexity modulus. A recent work established\n1\u2212\u03b8\nadaptive regret bounds O(n\n2\u2212\u03b8 ) for online learning with a total of n rounds under the Bernstein\ncondition [20]. However, their methods are based on the second-order methods and therefore are not\nas ef\ufb01cient as our stochastic approximation algorithm. For example, for online convex optimization\nthey employed the MetaGrad algorithm [47], which needs to maintain log(n) copies of the online\nNewton step (ONS) [13] with different learning rates. Notice that the per-iteration cost of ONS is\nusually O(d4) even for very simple domain W [21], while that of our SA algorithm is dominated by\nthe Euclidean projection onto W that is as fast as O(d) for a simple domain.\n\n3 Empirical Risk Minimization (ERM)\n\nWe \ufb01rst formally state the minimal assumptions that are made throughout the paper. Additional\nassumptions will be made in the sequel for developing fast rates for different families of the random\nfunctions f (w, z).\nAssumption 1. For the stochastic optimization problems (1) and (2), we assume: (i) P (w) is a\nconvex function, W is a closed and bounded convex set, i.e., there exists R > 0 such that (cid:107)w(cid:107)2 \u2264 R\nfor any w \u2208 W, and r(w) is a Lipschitz continuous convex function. (ii) the problem (1) and (2)\nsatisfy an EBC(\u03b8, \u03b1), i.e., there exist \u03b8 \u2208 (0, 1] and 0 < \u03b1 < \u221e such that the inequality (3) hold.\n\n3\n\n\f(cid:98)w \u2208 arg min\n\nw\u2208W Pn(w) (cid:44) 1\n\nn\n\nn(cid:88)\n\nIn this section, we focus on the development of theory of ERM for risk minimization. In particular,\n\nwe learn a model (cid:98)w by solving the following ERM problem corresponding to (1):\n\n(4)\nwhere z1, . . . , zn are i.i.d samples following the distribution P. A similar ERM problem can be\nformulated for (2). This section is divided into two subsections. First, we establish intermediate\nrates of ERM under EBC when the random function is Lipschitz continuous. Second, we develop\nintermediate rates of ERM under EBC when the random function is smooth. In the sequel and the\nsupplement, we use \u2228 to denote the max operation and use \u2227 to denote the min operation.\n\nf (w, zi)\n\ni=1\n\n3.1 ERM for Lipschitz continuous random functions\n\nIn this subsection, w.l.o.g we restrict our attention to (1) since we make the following assumption\nbesides Assumption 1. If r(w) is present, it can be absorbed into f (w, z).\nAssumption 2. For the stochastic optimization problem (1), we assume that f (w, z) is a G-Lipschitz\ncontinuous function w.r.t w for any z \u2208 Z.\n\nIt is notable that we do not assume f (w, z) is convex in terms of w or any z. First, we compare\nEBC with two very important conditions considered in literature for developing fast rates of ERM,\nnamely the Bernstein condition and the central condition. We \ufb01rst give the de\ufb01nitions of these two\nconditions.\nDe\ufb01nition 2. (Bernstein Condition) Let \u03b2 \u2208 (0, 1] and B \u2265 1. Then (f, P,W) satis\ufb01es the (\u03b2, B)-\nBernstein condition if there exists a w\u2217 \u2208 W such that for any w \u2208 W\n\nEz[(f (w, z) \u2212 f (w\u2217, z))2] \u2264 B(Ez[f (w, z) \u2212 f (w\u2217, z)])\u03b2.\n\n(5)\n\n(6)\n\nIt is clear that if such an w\u2217 exists it has to be the minimizer of the risk.\nDe\ufb01nition 3. (v-Central Condition) Let v : [0,\u221e) \u2192 [0,\u221e) be a bounded, non-decreasing function\nsatisfying v(x) > 0 for all x > 0. We say that (f, P,W) satis\ufb01es the v-central condition if for all\n\u03b5 \u2265 0, there exists w\u2217 \u2208 W such that for any w \u2208 W the following holds with \u03b7 = v(\u03b5).\n\n(cid:104)\ne\u03b7(f (w\u2217,z)\u2212f (w,z))(cid:105) \u2264 e\u03b7\u03b5.\n\nEz\u223cP\n\nIf v(\u03b5) is a constant for all \u03b5 \u2265 0, the v-central condition reduces to the strong \u03b7-central condition,\nwhich implies the O(1/n) fast rate [46]. The connection between the Bernstein condition or v-central\ncondition has been studied in [46]. For example, if the random functions f (w, z) take values in [0, a],\nthen (\u03b2, B)-Bernstein condition implies v-central condition with v(x) \u221d x1\u2212\u03b2.\nThe following lemma shows that for Lipchitz continuous function, EBC condition implies a relaxed\nBernstein condition and a relaxed v-central condition.\nLemma 1. (Relaxed Bernstein condition and v-central condition) Suppose Assumptions 1, 2 hold.\nFor any w \u2208 W, there exists w\u2217 \u2208 W\u2217 (which is actually the one closest to w), such that\n\nwhere B = G2\u03b1, and Ez\u223cP(cid:2)e\u03b7(f (w\u2217,z)\u2212f (w,z))(cid:3) \u2264 e\u03b7\u03b5, where \u03b7 = v(\u03b5) := c\u03b51\u2212\u03b8 \u2227 b. Additionally,\nfor any \u03b5 > 0 if P (w) \u2212 P (w\u2217) \u2265 \u03b5, we have Ez\u223cP(cid:2)ev(\u03b5)(f (w\u2217,z)\u2212f (w,z))(cid:3) \u2264 1, where b > 0 is any\n\nEz[(f (w, z) \u2212 f (w\u2217, z))2] \u2264 B(Ez[f (w, z) \u2212 f (w\u2217, z)])\u03b8,\n\nconstant and c = 1/(\u03b1G2\u03ba(4GRb)), where \u03ba(x) = (ex \u2212 x \u2212 1)/x2.\n\nRemark: There is a subtle difference between the above relaxed Bernstein condition and v-central\ncondition and their original de\ufb01nitions in De\ufb01nitions 2 and 3. The difference is that in De\ufb01nitions 2\nand 3, it requires there exists a universal w\u2217 for all w \u2208 W such that (5) and (6) hold. In Lemma 1\nit only requires for every w \u2208 W there exists one w\u2217 that could be different for different w such\nthat (5) and (6) hold. This relaxation enables us to establish richer results by exploring EBC than the\nBernstein condition and v-central condition, which are postponed to Section 5.\nNext, we present the main result of this subsection.\n\n4\n\n\fTheorem 1 (Result I). Suppose Assumptions 1, 2 hold. For any n \u2265 aC, with probability at least\n1 \u2212 \u03b4 we have\n\nP ((cid:98)w) \u2212 P\u2217 \u2264 O\n\n(cid:18) d log n + log(1/\u03b4)\n\n(cid:19) 1\n\n2\u2212\u03b8\n\nn\n\n,\n\n(7)\n\nwhere a = 3(d log(32GRn1/(2\u2212\u03b8)) + log(1/\u03b4))/c + 1 and C > 0 is some constant.\n\nRemark: The proof utilizes Lemma 1 and follows similarly as the proofs in previous studies [46, 26]\nbased on v-central condition. Our analysis essentially shows that relaxed Bernstein condition and\nrelaxed v-central condition with non-universal w\u2217 suf\ufb01ce to establish the intermediate rates. Although\nthe rate in Theorem 1 does not improve that in previous works [46], the relaxation brought by EBC\nallows us to establish fast rates for interesting problems that were unknown before. More details are\npostponed into Section 5. For example, under the condition that the input data x, y are bounded, ERM\nfor hinge loss minimization with (cid:96)1, (cid:96)\u221e norm constraints, and for minimizing a quadratic function\n\nand an (cid:96)1 norm regularization enjoys an (cid:101)O(1/n) fast rate. To the best of our knowledge, such a fast\n\nrate of ERM for these problems has not been shown in literature using other conditions or theories.\n\n3.2 ERM for non-negative, Lipschitz continuous and smooth convex random functions\n\nBelow we will present improved optimistic rates of ERM for non-negative smooth loss functions\nexpanding the results in [55]. To be general, we consider (2) and the following ERM problem:\n\n(cid:98)w \u2208 arg min\n\nw\u2208W Pn(w) (cid:44) 1\n\nn\n\nn(cid:88)\n\ni=1\n\nf (w, zi) + r(w)\n\n(8)\n\nBesides Assumptions 1, 2, we further make the following assumption for developing faster rates.\nAssumption 3. For the stochastic optimization problem (1), we assume f (w, z) is a non-negative\nand L-smooth convex function w.r.t w for any z \u2208 Z.\n\nIt is notable that we do not assume that r(w) is smooth. Our main result in this subsection is presented\nin the following theorem.\nTheorem 2 (Result II). Under Assumptions 1, 2, and 3, with probability at least 1 \u2212 \u03b4 we have\n\nd log n + log(1/\u03b4)\n\n(cid:32)\nP ((cid:98)w) \u2212 P\u2217 \u2264 O\n(cid:16)(cid:0)\u03b11/\u03b8d log n(cid:1)2\u2212\u03b8(cid:17)\n(cid:32)(cid:20) d log n + log(1/\u03b4)\nP ((cid:98)w) \u2212 P\u2217 \u2264 O\n\nn\n\nn\n\n+\n\n(cid:21) 2\n\n2\u2212\u03b8\n\n2\u2212\u03b8(cid:33)\n(cid:20) (d log n + log(1/\u03b4))P\u2217\n(cid:21) 1\n2\u2212\u03b8(cid:33)\n(cid:21) 1\n(cid:20) (d log n + log(1/\u03b4))P\u2217\n\nn\n\n.\n\n+\n\n, with probability at least 1 \u2212 \u03b4,\n\nn\n\n.\n\nWhen n \u2265 \u2126\n\nRemark: The constant in big O and \u2126 can be seen from the proof, which is tedious and included\nin the supplement. Here we focus on the understanding of the results. First, the above results are\noptimistic rates that are no worse than that in Theorem 1. Second, the \ufb01rst result implies that when the\noptimal risk P\u2217 is less than O((d log n/n)1\u2212\u03b8), the excess risk bound is in the order of O(d log n/n).\nThird, when the number of samples n is suf\ufb01ciently large and the optimal risk is suf\ufb01ciently small,\nthe second result can imply a faster rate than O(d log n/n). Considering smooth functions presented\nin Section 5 with \u03b8 = 1, when n \u2265 \u2126(\u03b1d log n) and P\u2217 \u2264 O(d log n/n) (large-sample and small\noptimal risk), the excess risk can be bounded by O((d log n/n)2). In another word, the sample\n\u0001). To the best of our knowledge,\n\u0001 for these examples is the \ufb01rst result appearing in\n\ncomplexity for achieving an \u0001-excess risk bound is given by (cid:101)O(d/\n\n\u221a\n\n\u221a\n\nthe sample complexity of ERM in the order of 1/\nthe literature.\n\n4 Ef\ufb01cient SA for Lipschitz continuous random functions\n\nIn this section, we will present intermediate rates of an ef\ufb01cient stochastic approximation algorithm\nfor solving (1) adaptive to the EBC under the Assumption 1 and 2. Note that (2) can be considered as\na special case by absorbing r(w) into f (w, z).\n\n5\n\n\fAlgorithm 1 SSG(w1, \u03b3, T,W)\nInput: w1 \u2208 W, \u03b3 > 0 and T\n1: for t = 1, . . . , T do\n2: wt+1 = \u03a0W (wt \u2212 \u03b3gt)\n3: end for\n\n(cid:80)T +1\n\nAlgorithm 2 ASA(w1, n, R)\n\n1: Set R0 = 2R,(cid:98)w0 = w1, m = (cid:98) 1\n\n2 log2\n\n2n\n\nlog2 n(cid:99)\u22121, n0 = (cid:98) n\nm(cid:99)\n\nG\n\nT +1\n\nt=1 wt\n\nSet \u03b3k = Rk\u22121\n\n\u221a\n\nn0+1 and Rk = Rk\u22121/2\n\n6: return (cid:98)wm\n\n2: for k = 1, . . . , m do\n3:\n4:\n5: end for\n\n(cid:98)wk = SSG((cid:98)wk\u22121, \u03b3k, n0,W \u2229 B((cid:98)wk\u22121, Rk\u22121))\n\n4: (cid:98)wT = 1\n5: return (cid:98)wT\nDenote by z1, . . . zk, . . . i.i.d samples drawn sequentially from the distribution P, by gk \u2208\n\u2202f (w, zk)|w=wk a stochastic subgradient evaluated at wk with sample zk, and by B(w, R)\na bounded ball centered at w with a radius R. By the Lipschitz continuity of f, we have\n(cid:107)\u2202f (w, z)(cid:107)2 \u2264 G for \u2200w \u2208 W,\u2200z \u2208 Z.\nThe proposed adaptive stochastic approximation algorithm is presented in Algorithm 2, which\nis referred to as ASA. The updates are divided into m stages, where at each stage a stochastic\nsubgradient method (Algorithm 1) is employed for running n0 = (cid:98)n/m(cid:99) iterations with a constant\nstep size \u03b3k. The step size \u03b3k will be decreased by half after each stage and the next stage will be\nwarm-started using the solution returned from the last stage as the initial solution. The projection onto\nthe intersection of W and a shrinking bounded ball at each stage is a commonly used trick for the\nhigh probability analysis [14, 15, 49]. We emphasize that the subroutine in ASA can be replaced by\nother SA algorithms, e.g., the proximal variant of stochastic subgradient for handling a non-smooth\ndeterministic component such as (cid:96)1 norm regularization [7], stochastic mirror descent with with a\np-norm divergence function [8], and etc. Please see an example in the supplement.\nIt is worth mentioning that the dividing schema of ASA is due to [15], which however restricts its\nanalysis to uniformly convex functions where uniform convexity is a stronger condition than the\nEBC. ASA is also similar to a recently proposed accelerated stochastic subgradient (ASSG) method\nunder the EBC [49]. However, the key differences are that (i) ASA is developed for a \ufb01xed number of\niterations while ASSG is developed for a \ufb01xed accuracy level \u0001; (ii) the adaptive iteration complexity\nof ASSG requires knowing the value of \u03b8 \u2208 (0, 2] while ASA does not require the value of \u03b8. As a\ntrade-off, we restrict our attention to \u03b8 \u2208 (0, 1].\nTheorem 3 (Result III). Suppose Assumptions 1 and 2 hold, and (cid:107)w1 \u2212 w\u2217(cid:107)2 \u2264 R0, where w\u2217\nis the closest optimal solution to w1. De\ufb01ne \u00af\u03b1 = max(\u03b1G2, (R0G)2\u2212\u03b8). For n \u2265 100 and any\n\u03b4 \u2208 (0, 1), with probability at least 1 \u2212 \u03b4, we have\n\nP ((cid:98)wm) \u2212 P\u2217 \u2264 O\n\n(cid:18) \u00af\u03b1(log(n) log(log(n)/\u03b4))\n\n(cid:19) 1\n\n2\u2212\u03b8\n\n.\n\nn\n\nRemark: The signi\ufb01cance of the result is that although Algorithm 2 does not utilize any knowledge\nabout EBC, it is automatically adaptive to the EBC. As a \ufb01nal note, the projection onto the intersection\nof W and a bounded ball can be ef\ufb01ciently computed by employing the projection onto W and a\nbinary search for the Lagrangian multiplier of the ball constraint. Moreover, we can replace the\nsubroutine with a slightly different variant of SSG to get around of the projection onto the intersection\nof W and a bounded ball, which is presented in the supplement.\n\n5 Applications\n\nFrom the last two sections, we can see that \u03b8 = 1 is a favorable case, which yields the fastest rate\nin our results. It is obvious that if f (w, z) is strongly convex or P (w) is strongly convex, then\nEBC(\u03b8 = 1, \u03b1) holds. Below we show some examples of problem (1) and (2) with \u03b8 = 1 without\n\nstrong convexity, which not only recover some known results of fast rate (cid:101)O(d/n), but also induce\nnew results of fast rates that are even faster than (cid:101)O(d/n).\n\nQuadratic Problems (QP):\n(9)\nwhere c is a constant. The random function can be taken as f (w, z, z(cid:48)) = w(cid:62)A(z)w + w(cid:62)b(z(cid:48)) + c.\nWe have the following corollary.\n\nw\u2208W P (w) (cid:44) w(cid:62)Ez[A(z)]w + w(cid:62)Ez(cid:48)[b(z(cid:48))] + c\n\nmin\n\n6\n\n\fCorollary 1. If Ez[A(z)] is a positive semi-de\ufb01nite matrix (not necessarily positive de\ufb01nite)\nand W is a bounded polyhedron, then the problem (9) satis\ufb01es EBC(\u03b8 = 1, \u03b1). Assume that\n\nmax((cid:107)A(z)(cid:107)2,(cid:107)b(z(cid:48))(cid:107)2) \u2264 \u03c3 < \u221e, then ERM has a fast rate at least (cid:101)O(d/n). If f (w, z, z(cid:48)) is\nfurther non-negative, convex and smooth, then ERM has a fast rate of (cid:101)O((d/n)2 + dP\u2217/n) when\nn \u2265 \u2126(d log n). ASA has a convergence rate of (cid:101)O(1/n).\n\nmin\n\nNext, we present some instances of the quadratic problem (9).\nInstance 1 of QP: minimizing the expected square loss. Consider the following problem:\n\nw\u2208W P (w) (cid:44) Ex,y[(w(cid:62)x \u2212 y)2]\n\nonly recover some known results of (cid:101)O(d/n) rate [22, 26], but also imply a faster rate than (cid:101)O(d/n) in\n\n(10)\nwhere x \u2208 X , y \u2208 Y and W is a bounded polyhedron (e.g., (cid:96)1-ball or (cid:96)\u221e-ball). It is not dif\ufb01cult\nto show that it is an instance of (9) and has the property that f (w, z, z(cid:48)) is non-negative, smooth,\nconvex, Lipchitz continuous over W. The convergence results in Corollary 1 for this instance not\na large-sample regime and an optimistic case when n \u2265 \u2126(d log n), P\u2217 \u2264 O(d log n/n), where the\nlatter result is the \ufb01rst such result of its own.\nInstance 2 of QP. Let us consider the following problem:\n\nw\u2208W P (w) (cid:44) Ez[w(cid:62)(S \u2212 zz(cid:62))w] \u2212 w(cid:62)b\n\n(11)\nwhere S \u2212 Ez[zz(cid:62)] (cid:23) 0. It is notable that f (w, z) = w(cid:62)(S \u2212 zz(cid:62))w \u2212 w(cid:62)b might be non-convex.\nA similar problem as (11) could arise in computing the leading eigen-vector of E[zz(cid:62)] by performing\nshifted-and-inverted power method over random samples z \u223c P [10].\n\nmin\n\nmin\n\nw\u2208W P (w) (cid:44) E[f (w, z)]\n\nPiecewise Linear Problems (PLP):\n(12)\nwhere E[f (w, z)] is a piecewise linear convex function and W is a bounded polyhedron. We have\nthe following corollary.\nCorollary 2. If E[f (w, z)] is piecewise linear and convex and W is a bounded polyhedron, then the\nproblem (12) satis\ufb01es EBC(\u03b8 = 1, \u03b1). If f (w, z) is Lipschitz continuous, then ERM has a fast rate at\n\nleast (cid:101)O(d/n), and ASA has a convergence rate of (cid:101)O(1/n). If f (w, z) is further non-negative and\nlinear, then ERM has a fast rate of (cid:101)O((d/n)2 + dP\u2217/n) when n \u2265 \u2126(d log n).\n\nInstance 1 of PLP: minimizing the expected hinge loss for bounded data. Consider the following\nproblem:\n\nP (w) (cid:44) Ex,y[(1 \u2212 yw(cid:62)x)+]\n\nmin\n\n(cid:107)w(cid:107)p\u2264B\n\n(13)\nwhere p = 1,\u221e and y \u2208 {1,\u22121}. Suppose that x \u2208 X is bounded and scaled such that |w(cid:62)x| \u2264 1.\nKoolen et al. [20] has considered this instance with p = 2 and proved that the Bernstein condition\n(De\ufb01nition 2) holds with \u03b2 = 1 for the problem (13) when E[yx] (cid:54)= 0 and |w(cid:62)x| \u2264 1. In contrast,\nwe can show that the problem (13) with any p = 1, 2,\u221e norm constraint 3, the EBC(\u03b8 = 1, \u03b1) holds\nsince the objective P (w) = 1 \u2212 w(cid:62)E[yx] is essentially a linear function of w. Then all results\nin Corollary 2 hold. To the best of our knowledge, the fast rates of ERM and SA for this instance\nwith (cid:96)1 and (cid:96)\u221e norm constraint are the new results. In comparison, Koolen et al.\u2019s [20] fast rate of\n\n(cid:101)O(1/n) only applies to SA and (cid:96)2 norm constraint, and their SA algorithm is not as ef\ufb01cient as our\n\nSA algorithm.\nInstance 2 of PLP: multi-dimensional newsvendor problem. Consider a \ufb01rm that manufactures p\nproducts from q resources. Suppose that a manager must decide on a resource vector x \u2208 Rq\n+ before\nthe product demand vector z \u2208 Rp is observed. After the demand becomes known, the manager\nchooses a production vector y \u2208 Rp so as to maximize the operating pro\ufb01t. Assuming that the\ndemand z is a random vector with discrete probability distribution, the problem is equivalent to\n\nc(cid:62)x \u2212 E[\u03a0(x; z)]\n\nx\u2208Rq\n\nmin\n+,x\u2264b\n\nwhere both \u03a0(x; z) and E[\u03a0(x; z)] are piecewise linear concave functions [18]. Then the problem\n\ufb01ts to the setting in Corollary 2.\n\n3The case of p = 2 is showed later.\n\n7\n\n\f(a) rcv1_binary\n\n(b) real-sim\n\n(c) E2006-t\ufb01df\n\n(d) E2006-log1p\n\nFigure 1: Testing Error vs Iteration of ASA and other baselines for SA\n\nRisk Minimization Problems over an (cid:96)2 ball. Consider the following problem\n\nP (w) (cid:44) Ez[f (w, z)]\n\nmin\n\n(cid:107)w(cid:107)2\u2264B\n\n(14)\n\nAssuming that P (w) is convex and minw\u2208Rd P (w) < min(cid:107)w(cid:107)2\u2264B P (w), we can show that\nEBC(\u03b8 = 1, \u03b1) holds (see supplement). Using this result, we can easily show that the consid-\nered problem (13) with p = 2 satis\ufb01es EBC(\u03b8 = 1, \u03b1).\n\nRisk Minimization with (cid:96)1 Regularization Problems. For (cid:96)1 regularized risk minimization:\n\nP (w) (cid:44) E[f (w; z)] + \u03bb(cid:107)w(cid:107)1,\n\nmin\n\n(cid:107)w(cid:107)1\u2264B\n\n(15)\n\nwe have the following corollary.\nCorollary 3. If the \ufb01rst component is quadratic as in (9) or is piecewise linear and convex, then\nthe problem (15) satis\ufb01es EBC(\u03b8 = 1, \u03b1). If the random function is Lipschitz continuous, then\n\nERM has a fast rate at least (cid:101)O(d/n), and ASA has a convergence rate of (cid:101)O(1/n). If f (w, z) is\nfurther non-negative, convex and smooth, then ERM has a fast rate of (cid:101)O((d/n)2 + dP\u2217/n) when\n\nn \u2265 \u2126(d log n).\n\nTo the best of our knowledge, this above general result is the \ufb01rst of its kind. Next, we show some\ninstances satisfying EBC(\u03b8, \u03b1) with \u03b8 < 1. Consider the problem minw\u2208W F (w) (cid:44) P (w)+\u03bb(cid:107)w(cid:107)p\np,\nwhere P (w) is quadratic as in (9), and W is a bounded polyhedron. In the supplement, we prove that\nEBC(\u03b8 = 2/p, \u03b1) holds.\nA Case Study for ASA. Finally, we provide some empirical evidence to support the effectiveness\nof the proposed ASA algorithm. In particular, we will consider solving an (cid:96)1 regularized expected\n\u221a\nsquare loss minimization problem (15) for learning a predictive model. We compare with two\nbaselines whose convergence rate are known as O(1/\nn), namely proximal stochastic gradient\n(PSG) method [7], and stochastic mirror descent (SMD) method using a p-norm divergence function\n(p = 2 log d) other than the Euclidean function. For SMD, we implement the algorithm proposed\nin [37], which was proposed for solving (15) and could be effective for very high-dimensional data.\nFor ASA, we implement two versions that use PSG and SMD as the subroutine and report the one\nthat gives the best performance. The two versions differ in using the Euclidean norm or the p-norm\nfor measuring distance. Since the comparison is focused on the testing error, we also include another\nstrong baseline, i.e, averaged stochastic gradient (ASGD) with a constant step size, which enjoys an\nO(d/n) rate for minimizing the expected square loss without any constraints or regularizations [1].\nWe use four benchmark datasets from libsvm website4, namely, real-sim, rcv1_binary, E2006-t\ufb01df,\nE2006-log1p, whose dimensionality is 20958, 47236, 150360, 4272227, respectively. We divide each\ndataset into three sets, respectively training, validation, and testing. For E2006-t\ufb01df and E2006-log1p\ndataset, we randomly split the given testing set into half validation and half testing. For the dataset\nreal-sim which do not explicitly provides a testing set, we randomly split the entire data into 4:1:1\nfor training, validation, and testing. For rcv1_binary, despite that the test set is given, the size of the\ntraining set is relatively small. Thus we \ufb01rst combine the training and the testing sets and then follow\nthe above procedure to split it.\nThe involved parameters of each algorithm are tuned based on the validation data. With the selected\nparameters, we run each algorithm by passing through training examples once and evaluate interme-\ndiate models on the testing data to compute the testing error measured by square loss. The results\n\n4http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/\n\n8\n\n11.522.533.544.5iteration\u00d71050.140.160.180.20.220.240.260.280.30.320.34testing errorASAASGDPSGSMD12345iteration\u00d71040.150.20.250.30.350.40.450.50.550.6testing errorASAASGDPSGSMD0.40.60.811.21.41.6iteration\u00d71040.150.160.170.180.190.20.210.220.23testing errorASAASGDPSGSMD0.40.60.811.21.41.6iteration\u00d71040.10.150.20.250.30.350.40.45testing errorASAASGDPSGSMD\fon different data sets averaged over 5 random runs over shuf\ufb02ed training examples are shown in\nFigure 1. From the testing curves, we can see that the proposed ASA has similar convergence rate\nto ASGD on two relatively low-dimensional data sets. This is not surprise since both algorithms\n\nenjoy an (cid:101)O(1/n) convergence rate indicated by their theories. For the data set E2006-t\ufb01df and\n\nE2006-log1p, we observe that ASA converges much faster than ASGD, which is due to the presence\nof (cid:96)1 regularization. In addition, ASA converges much faster than SGD and SMD with one exception\non E2006-log1p, on which ASA performs slightly better than SMD.\n\n6 Conclusion\n\nWe have comprehensively studied statistical learning under the error bound condition for both ERM\nand SA. We established the connection between the error bound condition and previous conditions\nfor developing fast rates of empirical risk minimization for Lipschitz continuous loss functions. We\nalso developed improved rates for non-negative and smooth convex loss functions, which induce\nfaster rates that were not achieved before. Finally, we analyzed an ef\ufb01cient \u201cparameter\"-free SA\nalgorithm under the error bound condition and showed that it is automatically adaptive to the error\nbound condition. Applications in machine learning and other \ufb01elds are considered and empirical\nstudies corroborate the fast rate of the developed algorithms. An open question is how to develop\nef\ufb01cient SA algorithms under the error bound condition with optimistic rates for non-negative smooth\nloss functions similar to the results obtained for empirical risk minimization in this paper.\n\nAcknowledgement\n\nThe authors thank the anonymous reviewers for their helpful comments. M. Liu and T. Yang are\npartially supported by National Science Foundation (IIS-1545995). L. Zhang is partially supported\nby YESS (2017QNRC001). We thank Nishant A. Mehta for pointing out the work [12] for the proof\nof Theorem 1.\n\nReferences\n[1] Francis R. Bach and Eric Moulines. Non-strongly-convex smooth stochastic approximation\nwith convergence rate O(1/n). In Advances in Neural Information Processing Systems (NIPS),\npages 773\u2013781, 2013.\n\n[2] Peter L. Bartlett, Olivier Bousquet, and Shahar Mendelson. Local rademacher complexities.\n\nThe Annals of Statistics, 2005.\n\n[3] Peter L. Bartlett and Shahar Mendelson. Empirical minimization. Probability Theory and\n\nRelated Fields, 2006.\n\n[4] Jerome Bolte, Trong Phong Nguyen, Juan Peypouquet, and Bruce Suter. From error bounds to\nthe complexity of \ufb01rst-order descent methods for convex functions. CoRR, abs/1510.08234,\n2015.\n\n[5] James V. Burke and Michael C. Ferris. Weak sharp minima in mathematical programming.\n\nSIAM Journal on Control and Optimization, 31(5):1340\u20131359, 1993.\n\n[6] Dmitriy Drusvyatskiy and Adrian S. Lewis. Error bounds, quadratic growth, and linear conver-\n\ngence of proximal methods. arXiv:1602.06661, 2016.\n\n[7] John Duchi and Yoram Singer. Ef\ufb01cient online and batch learning using forward backward\n\nsplitting. Journal of Machine Learning Research, 10:2899\u20132934, 2009.\n\n[8] John C. Duchi, Shai Shalev-Shwartz, Yoram Singer, and Ambuj Tewari. Composite objective\n\nmirror descent. In COLT, pages 14\u201326. Omnipress, 2010.\n\n[9] Vitaly Feldman. Generalization of erm in stochastic convex optimization: The dimension strikes\n\nback. In NIPS. 2016.\n\n9\n\n\f[10] Dan Garber, Elad Hazan, Chi Jin, Sham M. Kakade, Cameron Musco, Praneeth Netrapalli, and\nAaron Sidford. Faster eigenvector computation via shift-and-invert preconditioning. In ICML,\n2016.\n\n[11] Alon Gonen and Shai Shalev-Shwartz. Average stability is invariant to data preconditioning:\nImplications to exp-concave empirical risk minimization. J. Mach. Learn. Res., 18(1):8245\u2013\n8257, January 2017.\n\n[12] Peter D. Gr\u00fcnwald and Nishant A. Mehta. Fast rates with unbounded losses. CoRR,\n\nabs/1605.00252, 2016.\n\n[13] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex\n\noptimization. Machine Learning, 2007.\n\n[14] Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: an optimal algorithm for\n\nstochastic strongly-convex optimization. In COLT, 2011.\n\n[15] Anatoli Juditsky and Yuri Nesterov. Deterministic and stochastic primal-dual subgradient\n\nalgorithms for uniformly convex minimization. Stoch. Syst., 2014.\n\n[16] Sham M. Kakade and Ambuj Tewari. On the generalization ability of online strongly convex\n\nprogramming algorithms. In NIPS, 2008.\n\n[17] Hamed Karimi, Julie Nutini, and Mark W. Schmidt. Linear convergence of gradient and\n\nproximal-gradient methods under the polyak-\u0142ojasiewicz condition. In ECML-PKDD, 2016.\n\n[18] Sujin Kim, Raghu Pasupathy, and Shane G. Henderson. A Guide to Sample Average Approxima-\n\ntion, pages 207\u2013243. Springer New York, New York, NY, 2015.\n\n[19] Vladimir Koltchinskii. Local rademacher complexities and oracle inequalities in risk minimiza-\n\ntion. The Annals of Statistics, 2006.\n\n[20] Wouter M. Koolen, Peter Gr\u00fcnwald, and Tim van Erven. Combining adversarial guarantees and\n\nstochastic fast rates in online learning. In NIPS, 2016.\n\n[21] Tomer Koren and K\ufb01r Y. Levy. Fast rates for exp-concave empirical risk minimization. In NIPS,\n\n2015.\n\n[22] Wee Sun Lee, P. L. Bartlett, and R. C. Williamson. The importance of convexity in learning\n\nwith squared loss. IEEE Transactions on Information Theory, 44(5):1974\u20131980, 1998.\n\n[23] Guoyin Li. Global error bounds for piecewise convex polynomials. Math. Program., 2013.\n\n[24] Guoyin Li and Ting Kei Pong. Calculus of the exponent of kurdyka- \u0142ojasiewicz inequality and\n\nits applications to linear convergence of \ufb01rst-order methods. CoRR, abs/1602.02915, 2016.\n\n[25] Enno Mammen and Alexandre B. Tsybakov. Smooth discrimination analysis. Ann. Statist.,\n\n27(6):1808\u20131829, 12 1999.\n\n[26] Nishant A. Mehta. Fast rates with high probability in exp-concave statistical learning. In\n\nAISTATS, pages \u2013, 2017.\n\n[27] Nishant A. Mehta and Robert C. Williamson. From stochastic mixability to fast rates. In NIPS,\n\n2014.\n\n[28] I. Necoara, Yu. Nesterov, and F. Glineur. Linear convergence of \ufb01rst order methods for non-\n\nstrongly convex optimization. CoRR, abs/1504.06298, v4, 2015.\n\n[29] Arkadi Nemirovski, Anatoli Juditsky, Lan, and Alexander Shapiro. Robust stochastic approxi-\n\nmation approach to stochastic programming. SIAM Journal on Optimization, 2009.\n\n[30] Yurii Nesterov. Introductory lectures on convex optimization: a basic course. 2004.\n\n[31] Jong-Shi Pang. Error bounds in mathematical programming. Math. Program., 1997.\n\n10\n\n\f[32] Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for\n\nstrongly convex stochastic optimization. In ICML, 2012.\n\n[33] Aaditya Ramdas and Aarti Singh. Optimal rates for stochastic convex optimization under\n\ntsybakov noise condition. In ICML, 2013.\n\n[34] R.T. Rockafellar. Convex Analysis. 1970.\n\n[35] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Stochastic convex\n\noptimization. In COLT, 2009.\n\n[36] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated sub-gradient\n\nsolver for svm. In ICML, 2007.\n\n[37] Shai Shalev-Shwartz and Ambuj Tewari. Stochastic methods for l1-regularized loss minimiza-\n\ntion. Journal of Machine Learning Research, 12:1865\u20131892, 2011.\n\n[38] Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization:\n\nConvergence results and optimal averaging schemes. In ICML, 2013.\n\n[39] Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczynski. Lectures on Stochastic\nProgramming: Modeling and Theory, Second Edition. Society for Industrial and Applied\nMathematics, Philadelphia, PA, USA, 2014.\n\n[40] Steve Smale and Ding-Xuan Zhou. Learning theory estimates via integral operators and their\n\napproximations. Constructive Approximation, 2007.\n\n[41] Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. Optimistic rates for learning with a\n\nsmooth loss. ArXiv e-prints, arXiv:1009.3896, 2010.\n\n[42] Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. Smoothness, low noise and fast rates. In\n\nNIPS, 2010.\n\n[43] Karthik Sridharan, Shai Shalev-Shwartz, and Nathan Srebro. Fast rates for regularized objectives.\n\nIn NIPS, 2008.\n\n[44] Alexander B. Tsybakov. Optimal aggregation of classi\ufb01ers in statistical learning. Ann. Statist.,\n\n32(1):135\u2013166, 02 2004.\n\n[45] Alexander B Tsybakov et al. Optimal aggregation of classi\ufb01ers in statistical learning. The\n\nAnnals of Statistics, 32(1):135\u2013166, 2004.\n\n[46] Tim van Erven, Peter D. Gr\u00fcnwald, Nishant A. Mehta, Mark D. Reid, and Robert C. Williamson.\n\nFast rates in statistical and online learning. JMLR, 2015.\n\n[47] Tim van Erven and Wouter M. Koolen. Metagrad: Multiple learning rates in online learning. In\n\nNIPS, 2016.\n\n[48] Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, 1998.\n\n[49] Yi Xu, Qihang Lin, and Tianbao Yang. Accelerate stochastic subgradient method by leveraging\n\nlocal error bound. CoRR, abs/1607.01027, 2016.\n\n[50] Yi Xu, Qihang Lin, and Tianbao Yang. Stochastic convex optimization: Faster local growth\n\nimplies faster global convergence. In ICML, pages 3821\u20133830, 2017.\n\n[51] Tianbao Yang, Zhe Li, and Lijun Zhang. A simple analysis for exp-concave empirical mini-\n\nmization with arbitrary convex regularizer. In AISTATS, pages 445\u2013453, 2018.\n\n[52] Tianbao Yang and Qihang Lin. Rsg: Beating subgradient method without smoothness and\n\nstrong convexity. CoRR, abs/1512.03107, 2016.\n\n[53] W. H. Yang. Error bounds for convex polynomials. SIAM Journal on Optimization, 2009.\n\n[54] Hui Zhang. New analysis of linear convergence of gradient-type methods via unifying error\n\nbound conditions. CoRR, abs/1606.00269, 2016.\n\n11\n\n\f[55] Lijun Zhang, Tianbao Yang, and Rong Jin. Empirical risk minimization for stochastic convex\n\noptimization: O(1/n)- and o(1/n2)-type of risk bounds. CoRR, abs/1702.02030, 2017.\n\n[56] Martin Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent.\n\nIn ICML, pages 928\u2013936, 2003.\n\n12\n\n\f", "award": [], "sourceid": 2283, "authors": [{"given_name": "Mingrui", "family_name": "Liu", "institution": "The University of Iowa"}, {"given_name": "Xiaoxuan", "family_name": "Zhang", "institution": "University of Iowa"}, {"given_name": "Lijun", "family_name": "Zhang", "institution": "Nanjing University (NJU)"}, {"given_name": "Rong", "family_name": "Jin", "institution": "Alibaba"}, {"given_name": "Tianbao", "family_name": "Yang", "institution": "The University of Iowa"}]}