{"title": "Improved Dynamic Regret for Non-degenerate Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 732, "page_last": 741, "abstract": "Recently, there has been a growing research interest in the analysis of dynamic regret, which measures the performance of an online learner against a sequence of local minimizers. By exploiting the strong convexity, previous studies have shown that the dynamic regret can be upper bounded by the path-length of the comparator sequence. In this paper, we illustrate that the dynamic regret can be further improved by allowing the learner to query the gradient of the function multiple times, and meanwhile the strong convexity can be weakened to other non-degenerate conditions. Specifically, we introduce the squared path-length, which could be much smaller than the path-length, as a new regularity of the comparator sequence. When multiple gradients are accessible to the learner, we first demonstrate that the dynamic regret of strongly convex functions can be upper bounded by the minimum of the path-length and the squared path-length. We then extend our theoretical guarantee to functions that are semi-strongly convex or self-concordant. To the best of our knowledge, this is the first time that semi-strong convexity and self-concordance are utilized to tighten the dynamic regret.", "full_text": "Improved Dynamic Regret for Non-degenerate\n\nFunctions\n\nLijun Zhang\u2217, Tianbao Yang\u2020, Jinfeng Yi\u2021, Rong Jin\u00a7, Zhi-Hua Zhou\u2217\n\n\u2217National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China\n\n\u2020Department of Computer Science, The University of Iowa, Iowa City, USA\n\n\u2021AI Foundations Lab, IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA\n\n\u00a7Alibaba Group, Seattle, USA\n\nzhanglj@lamda.nju.edu.cn, tianbao-yang@uiowa.edu, jinfengyi@tencent.com\n\njinrong.jr@alibaba-inc.com, zhouzh@lamda.nju.edu.cn\n\nAbstract\n\nRecently, there has been a growing research interest in the analysis of dynamic\nregret, which measures the performance of an online learner against a sequence\nof local minimizers. By exploiting the strong convexity, previous studies have\nshown that the dynamic regret can be upper bounded by the path-length of the\ncomparator sequence. In this paper, we illustrate that the dynamic regret can be\nfurther improved by allowing the learner to query the gradient of the function\nmultiple times, and meanwhile the strong convexity can be weakened to other\nnon-degenerate conditions. Speci\ufb01cally, we introduce the squared path-length,\nwhich could be much smaller than the path-length, as a new regularity of the\ncomparator sequence. When multiple gradients are accessible to the learner, we\n\ufb01rst demonstrate that the dynamic regret of strongly convex functions can be upper\nbounded by the minimum of the path-length and the squared path-length. We then\nextend our theoretical guarantee to functions that are semi-strongly convex or self-\nconcordant. To the best of our knowledge, this is the \ufb01rst time that semi-strong\nconvexity and self-concordance are utilized to tighten the dynamic regret.\n\n1\n\nIntroduction\n\nOnline convex optimization is a fundamental tool for solving a wide variety of machine learning\nproblems [Shalev-Shwartz, 2011]. It can be formulated as a repeated game between a learner and\nan adversary. On the t-th round of the game, the learner selects a point xt from a convex set X\nand the adversary chooses a convex function ft : X 7\u2192 R. Then, the function is revealed to the\nlearner, who incurs loss ft(xt). The standard performance measure is the regret, de\ufb01ned as the\ndifference between the learner\u2019s cumulative loss and the cumulative loss of the optimal \ufb01xed vector\nin hindsight:\n\nT\n\nXt=1\n\nft(xt) \u2212 min\n\nx\u2208X\n\nT\n\nXt=1\n\nft(x).\n\n(1)\n\nOver the past decades, various online algorithms, such as the online gradient descent [Zinkevich,\n2003], have been proposed to yield sub-linear regret under different scenarios [Hazan et al., 2007,\nShalev-Shwartz et al., 2007].\n\nThough equipped with rich theories, the notion of regret fails to illustrate the performance of on-\nline algorithms in dynamic setting, as a static comparator is used in (1). To overcome this limita-\ntion, there has been a recent surge of interest in analyzing a more stringent metric\u2014dynamic regret\n[Hall and Willett, 2013, Besbes et al., 2015, Jadbabaie et al., 2015, Mokhtari et al., 2016, Yang et al.,\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f2016], in which the cumulative loss of the learner is compared against a sequence of local minimiz-\ners, i.e.,\n\nR\u2217\n\nT :=R(x\u2217\n\n1, . . . , x\u2217\n\nT ) =\n\nT\n\nXt=1\n\nft(xt) \u2212\n\nT\n\nXt=1\n\nft(x\u2217\n\nt ) =\n\nT\n\nXt=1\n\nft(xt) \u2212\n\nT\n\nXt=1\n\nmin\nx\u2208X\n\nft(x)\n\n(2)\n\nwhere x\u2217\nt \u2208 argminx\u2208X ft(x). A more general de\ufb01nition of dynamic regret is to evaluate the\ndifference of the cumulative loss with respect to any sequence of comparators u1, . . . , uT \u2208 X\n[Zinkevich, 2003].\n\nIt is well-known that in the worst-case, it is impossible to achieve a sub-linear dynamic regret bound,\ndue to the arbitrary \ufb02uctuation in the functions. However, it is possible to upper bound the dynamic\nregret in terms of certain regularity of the comparator sequence or the function sequence. A natural\nregularity is the path-length of the comparator sequence, de\ufb01ned as\n\nP \u2217\nT := P(x\u2217\n\n1, . . . , x\u2217\n\nT ) =\n\nT\n\nXt=2\n\nkx\u2217\n\nt \u2212 x\u2217\n\nt\u22121k\n\n(3)\n\nthat captures the cumulative Euclidean norm of the difference between successive comparators. For\nT )\n[Zinkevich, 2003]. And when all the functions are strongly convex and smooth, the upper bound can\n\nconvex functions, the dynamic regret of online gradient descent can be upper bounded by O(\u221aTP \u2217\nbe improved to O(P \u2217\n\nT ) [Mokhtari et al., 2016].\n\nIn the aforementioned results, the learner uses the gradient of each function only once, and performs\none step of gradient descent to update the intermediate solution.\nIn this paper, we examine an\ninteresting question: is it possible to improve the dynamic regret when the learner is allowed to\nquery the gradient multiple times? Note that the answer to this question is no if one aims to promote\nthe static regret in (1), according to the results on the minimax regret bound [Abernethy et al., 2008a].\nWe however show that when coming to the dynamic regret, multiple gradients can reduce the upper\nbound signi\ufb01cantly. To this end, we introduce a new regularity\u2014the squared path-length:\n\nT\n\nT ) =\n\nkx\u2217\n\n1, . . . , x\u2217\n\nt \u2212 x\u2217\n\nS \u2217\nT := S(x\u2217\nwhich could be much smaller than P \u2217\nkx\u2217\n\nXt=2\nT = \u2126(\u221aT ) but S \u2217\nt\u22121k = \u2126(1/\u221aT ) for all t \u2208 [T ], we have P \u2217\n\u2022 When all the functions are strongly convex and smooth, we propose to apply gradient de-\nscent multiple times in each round, and demonstrate that the dynamic regret is reduced from\nT )), provided the gradients of minimizers are small. We further\n\nT when the local variations are small. For example, when\nT = \u2126(1). We advance the\n\nanalysis of dynamic regret in the following aspects.\n\nt \u2212 x\u2217\n\nt\u22121k2\n\n(4)\n\nO(P \u2217\n\nT ) to O(min(P \u2217\n\nT ,S \u2217\n\npresent a matching lower bound which implies our result cannot be improved in general.\n\u2022 When all the functions are semi-strongly convex and smooth, we show that the standard\nT ) dynamic regret. And if we apply gra-\ndient descent multiple times in each round, the upper bound can also be improved to\n\nonline gradient descent still achieves the O(P \u2217\nO(min(P \u2217\n\nT )), under the same condition as strongly convex functions.\n\n\u2022 When all the functions are self-concordant, we establish a similar guarantee if both the\ngradient and Hessian of the function can be queried multiple times. Speci\ufb01cally, we pro-\npose to apply the damped Newton method multiple times in each round, and prove an\n\nT ,S \u2217\n\nO(min(P \u2217\n\nT ,S \u2217\n\nT )) bound of the dynamic regret under appropriate conditions.1\n\nApplication to Statistical Learning Most studies of dynamic regret, including this paper do not\nmake stochastic assumptions on the function sequence. In the following, we discuss how to inter-\npret our results when facing the problem of statistical learning. In this case, the learner receives a\nsequence of losses \u2113(x\u22a4z1, y1), \u2113(x\u22a4z2, y2), . . ., where (zi, yi)\u2019s are instance-label pairs sampled\nfrom a unknown distribution, and \u2113(\u00b7,\u00b7) measures the prediction error. To avoid the random \ufb02uctua-\ntion caused by sampling, we can set ft as the loss averaged over a mini-batch of instance-label pairs.\nAs a result, when the underlying distribution is stationary or drifts slowly, successive functions will\nbe close to each other, and thus the path-length and the squared path-length are expected to be small.\n\n1P \u2217\n\nT and S \u2217\n\nT are modi\ufb01ed slightly when functions are semi-strongly convex or self-concordant.\n\n2\n\n\f2 Related Work\n\nThe static regret in (1) has been extensively studied in the literature [Shalev-Shwartz, 2011]. It has\n\nbeen established that the static regret can be upper bounded by O(\u221aT ), O(log T ), and O(log T )\n\nfor convex functions, strongly convex functions, and exponentially concave functions, respectively\n[Zinkevich, 2003, Hazan et al., 2007]. Furthermore, those upper bounds are proved to be minimax\noptimal [Abernethy et al., 2008a, Hazan and Kale, 2011].\n\nR(u1, . . . , uT ), is on the order of \u221aTP(u1, . . . , uT ). When a prior knowledge of P \u2217\n\nThe notion of dynamic regret is introduced by Zinkevich [2003]. If we choose the online gradient\ndescent as the learner, the dynamic regret with respect to any comparator sequence u1, . . . , uT , i.e.,\nT is available,\nT ) [Yang et al., 2016]. If all the functions\nT ) [Mokhtari et al.,\nT ) rate is also achievable when all the functions are convex and smooth, and all the\n\nthe dynamic regret R\u2217\nare strongly convex and smooth, the upper bound of R\u2217\n2016]. The O(P \u2217\nminimizers x\u2217\n\nT can be upper bounded by O(pTP \u2217\nt \u2019s lie in the interior of X [Yang et al., 2016].\n\nT can be improved to O(P \u2217\n\nAnother regularity of the comparator sequence, which is similar to the path-length, is de\ufb01ned as\n\nP \u2032(u1, . . . , uT ) =\n\nT\n\nXt=2\n\nkut \u2212 \u03a6t(ut\u22121)k\n\nwhere \u03a6t(\u00b7) is a dynamic model that predicts a reference point for the t-th round. The advantage\nof this measure is that when the comparator sequence follows the dynamical model closely, it can\nbe much smaller than the path-length P(u1, . . . , uT ). A novel algorithm named dynamic mirror\ndescent is proposed to take \u03a6t(ut\u22121) into account, and the dynamic regret R(u1, . . . , uT ) is on the\norder of \u221aTP \u2032(u1, . . . , uT ) [Hall and Willett, 2013]. There are also some regularities de\ufb01ned in\n\nterms of the function sequence, such as the functional variation [Besbes et al., 2015]\n\nFT := F(f1, . . . , fT ) =\n\nT\n\nXt=2\n\nx\u2208X |ft(x) \u2212 ft\u22121(x)|\nmax\n\nor the gradient variation [Chiang et al., 2012]\n\nGT := G(f1, . . . , fT ) =\n\nT\n\nXt=2\n\nx\u2208X k\u2207ft(x) \u2212 \u2207ft\u22121(x)k2.\nmax\n\n(5)\n\n(6)\n\nUnder the condition that FT \u2264 FT and Ft is given beforehand, a restarted online gradient descent\nis developed by Besbes et al. [2015], and the dynamic regret is upper bounded by O(T 2/3F 1/3\nT ) and\nO(log T\u221aT FT ) for convex functions and strongly convex functions, respectively.\n\nThe regularities mentioned above re\ufb02ect different aspects of the learning problem, and are not di-\nrectly comparable in general. Thus, it is appealing to develop an algorithm that adapts to the smaller\nregularity of the problem. Jadbabaie et al. [2015] propose an adaptive algorithm based on the opti-\nmistic mirror descent [Rakhlin and Sridharan, 2013], such that the dynamic regret is given in terms\nT , FT , and GT ). However, it relies on the assumption that the learner\n\nof all the three regularities (P \u2217\n\ncan calculate each regularity incrementally.\n\nIn the setting of prediction with expert advice, the dynamic regret is also referred to as tracking\nregret or shifting regret [Herbster and Warmuth, 1998, Cesa-bianchi et al., 2012]. The path-length\nof the comparator sequence is named as shift, which is just the number of times the expert changes.\nAnother related performance measure is the adaptive regret, which aims to minimize the static regret\nover any interval [Hazan and Seshadhri, 2007, Daniely et al., 2015]. Finally, we note that the study\nof dynamic regret is similar to the competitive analysis in the sense that both of them compete\nagainst an optimal of\ufb02ine policy, but with signi\ufb01cant differences in their assumptions and techniques\n[Buchbinder et al., 2012].\n\n3 Online Learning with Multiple Gradients\n\nIn this section, we discuss how to improve the dynamic regret by allowing the learner to query the\ngradient multiple times. We start with strongly convex functions, and then proceed to semi-strongly\nconvex functions, and \ufb01nally investigate self-concordant functions.\n\n3\n\n\fAlgorithm 1 Online Multiple Gradient Descent (OMGD)\nRequire: The number of inner iterations K and the step size \u03b7\n1: Let x1 be any point in X\n2: for t = 1, . . . , T do\n3:\n4:\n5:\n6:\n\nSubmit xt \u2208 X and the receive loss ft : X 7\u2192 R\nz1\nt = xt\nfor j = 1, . . . , K do\n\nt = \u03a0X (cid:16)zj\nzj+1\n\nt )(cid:17)\nt \u2212 \u03b7\u2207ft(zj\n\nend for\nxt+1 = zK+1\n\n7:\n8:\n9: end for\n\nt\n\n3.1 Strongly Convex and Smooth Functions\n\nTo be self-contained, we provide the de\ufb01nitions of strong convexity and smoothness.\nDe\ufb01nition 1. A function f : X 7\u2192 R is \u03bb-strongly convex, if\n\nf (y) \u2265 f (x) + h\u2207f (x), y \u2212 xi +\n\nDe\ufb01nition 2. A function f : X 7\u2192 R is L-smooth, if\n\nf (y) \u2264 f (x) + h\u2207f (x), y \u2212 xi +\n\n\u03bb\n2ky \u2212 xk2, \u2200x, y \u2208 X .\n\nL\n2 ky \u2212 xk2, \u2200x, y \u2208 X .\n\nExample 1. The following functions are both strongly convex and smooth.\n\n1. A quadratic form f (x) = x\u22a4Ax \u2212 2b\u22a4x + c where aI (cid:22) A (cid:22) bI, a > 0 and b < \u221e;\n2. The regularized logistic loss f (x) = log(1 + exp(b\u22a4x)) + \u03bb\n\n2kxk2, where \u03bb > 0.\n\nFollowing previous studies [Mokhtari et al., 2016], we make the following assumptions.\nAssumption 1. Suppose the following conditions hold for each ft : X 7\u2192 R.\n\n1. ft is \u03bb-strongly convex and L-smooth over X ;\n2. k\u2207ft(x)k \u2264 G, \u2200x \u2208 X .\n\nWhen the learner can query the gradient of each function only once, the most popular learning\nalgorithm is the online gradient descent:\n\nxt+1 = \u03a0X (xt \u2212 \u03b7\u2207ft(xt))\n\nwhere \u03a0X (\u00b7) denotes the projection onto the nearest point in X . Mokhtari et al. [2016] have estab-\nlished an O(P \u2217\nTheorem 1. Suppose Assumption 1 is true. By setting \u03b7 \u2264 1/L in online gradient descent, we have\n\nT ) bound of dynamic regret, as stated below.\n\nT\n\nft(xt) \u2212 ft(x\u2217\n\nt ) \u2264\n\nXt=1\nwhere \u03b3 = q1 \u2212 2\u03bb\n1/\u03b7+\u03bb .\n\n1\n1 \u2212 \u03b3\n\nGP \u2217\n\nT +\n\n1\n1 \u2212 \u03b3\n\nGkx1 \u2212 x\u2217\n1k\n\nWe now consider the setting that the learner can access the gradient of each function multiple times.\nThe algorithm is a natural extension of online gradient descent by performing gradient descent mul-\ntiple times in each round. Speci\ufb01cally, in the t-th round, given the current solution xt, we generate\na sequence of solutions, denoted by z1\n, where K is a constant independent from T , as\nfollows:\n\nt , . . . , zK+1\n\nt\n\nz1\nt = xt,\n\nt = \u03a0X (cid:16)zj\nzj+1\n\nt \u2212 \u03b7\u2207ft(zj\n\nt )(cid:17) , j = 1, . . . , K.\n\n. The procedure is named as Online Multiple Gradient Descent (OMGD)\n\nThen, we set xt+1 = zK+1\nand is summarized in Algorithm 1.\n\nt\n\n4\n\n\fBy applying gradient descent multiple times, we are able to extract more information from each\nfunction and therefore are more likely to obtain a tight bound for the dynamic regret. The following\ntheorem shows that the multiple accesses of the gradient indeed help improve the dynamic regret.\n\nTheorem 2. Suppose Assumption 1 is true. By setting \u03b7 \u2264 1/L and K = \u2308 1/\u03b7+\u03bb\nm 1, for any constant \u03b1 > 0, we have\n\n2\u03bb\n\nln 4\u2309 in Algorith-\n\nT\n\nXt=1\n\nft(xt) \u2212 ft(x\u2217\n\nt ) \u2264 min\uf8f1\uf8f2\n\uf8f3\n\nT + 2Gkx1 \u2212 x\u2217\n2GP \u2217\n1k,\nPT\nt=1 k\u2207ft(x\u2217\n+ 2(L + \u03b1)S \u2217\n\nt )k2\n\n2\u03b1\n\nT + (L + \u03b1)kx1 \u2212 x\u2217\n\n1k2.\n\nt=1 k\u2207ft(x\u2217\n\nWhen PT\nCorollary 3. Suppose PT\n\nt )k2 is small, Theorem 2 can be simpli\ufb01ed as follows.\n\nt )k2 = O(S \u2217\n\nT ), from Theorem 2, we have\n\nt=1 k\u2207ft(x\u2217\nXt=1\n\nT\n\nft(xt) \u2212 ft(x\u2217\n\nt ) = O (min(P \u2217\n\nT ,S \u2217\n\nT )) .\n\nIn particular, if x\u2217\nas \u03b1 \u2192 0, implies\n\nt belongs to the relative interior of X (i.e., \u2207ft(x\u2217\n\nt ) = 0) for all t \u2208 [T ], Theorem 2,\n\nT\n\nXt=1\n\nft(xt) \u2212 ft(x\u2217\n\nt ) \u2264 min(cid:0)2GP \u2217\n\nT + 2Gkx1 \u2212 x\u2217\n\n1k, 2LS \u2217\n\nT + Lkx1 \u2212 x\u2217\n\n1k2(cid:1) .\n\nCompared to Theorem 1, the proposed OMGD improves the dynamic regret from O(P \u2217\nT )), when the gradients of minimizers are small. Recall the de\ufb01nitions of P \u2217\nT ,S \u2217\nO (min (P \u2217\nS \u2217\nT in (3) and (4), respectively. We can see that S \u2217\nence between x\u2217\nt and x\u2217\nsigni\ufb01cantly smaller than P \u2217\nT , as indicated below.\nExample 2. Suppose kx\u2217\n\nt\u22121. In this way, if the local variations (kx\u2217\nt \u2212 x\u2217\n\nT ) to\nT and\nT introduces a square when measuring the differ-\nT can be\n\nt\u22121k\u2019s) are small, S \u2217\n\nt \u2212 x\u2217\n\nt\u22121k = T \u2212\u03c4 for all t \u2265 1 and \u03c4 > 0, we have\nS \u2217\nT +1 = T 1\u22122\u03c4 \u226a P \u2217\nT +1 = 1 \u226a P \u2217\n\nT +1 = T 1\u2212\u03c4 .\nT +1 = \u221aT .\n\nIn particular, when \u03c4 = 1/2, we have S \u2217\nS \u2217\nT is also closely related to the gradient variation in (6). When all the x\u2217\ninterior of X , we have \u2207ft(x\u2217\n\nt ) = 0 for all t \u2208 [T ] and therefore\n\nt \u2019s belong to the relative\n\nGT \u2265\n\nT\n\nXt=2\n\nk\u2207ft(x\u2217\n\nt\u22121) \u2212 \u2207ft\u22121(x\u2217\n\nt\u22121)k2 =\n\nT\n\nXt=2\n\nk\u2207ft(x\u2217\n\nt\u22121) \u2212 \u2207ft(x\u2217\n\nt )k2 \u2265 \u03bb2S \u2217\n\nT\n\n(7)\n\nwhere the last inequality follows from the property of strongly convex functions [Nesterov, 2004].\nThe following corollary is an immediate consequence of Theorem 2 and the inequality in (7).\nCorollary 4. Suppose Assumption 1 is true, and further assume all the x\u2217\ninterior of X . By setting \u03b7 \u2264 1/L and K = \u2308 1/\u03b7+\u03bb\n\nln 4\u2309 in Algorithm 1, we have\n\nt \u2019s belong to the relative\n\n2\u03bb\n\nT\n\nXt=1\n\nft(xt) \u2212 ft(x\u2217\n\nt ) \u2264 min(cid:18)2GP \u2217\n\nT + 2Gkx1 \u2212 x\u2217\n1k,\n\n2LGT\n\u03bb2 + Lkx1 \u2212 x\u2217\n\n1k2(cid:19) .\n\nIn Theorem 2, the number of accesses of gradients K is set to be a constant depending on the\ncondition number of the function. One may ask whether we can obtain a tighter bound by using a\nlarger K. Unfortunately, according to our analysis, even if we take K = \u221e, which means ft(\u00b7) is\nminimized exactly, the upper bound can only be improved by a constant factor and the order remains\nthe same. A related question is whether we can reduce the value of K by adopting more advanced\noptimization techniques, such as the accelerated gradient descent [Nesterov, 2004]. This is an open\nproblem to us, and will be investigated as a future work.\n\nFinally, we prove that the O(S \u2217\n\nT ) bound is optimal for strongly convex and smooth functions.\n\n5\n\n\fTheorem 5. For any online learning algorithm A, there always exists a sequence of strongly convex\nand smooth functions f1, . . . , fT , such that\n\nT\n\nXt=1\n\nft(xt) \u2212 ft(x\u2217\n\nt ) = \u2126(S \u2217\nT )\n\nwhere x1, . . . , xT is the solutions generated by A.\nThus, the upper bound in Theorem 2 cannot be improved in general.\n\n3.2 Semi-strongly Convex and Smooth Functions\n\nDuring the analysis of Theorems 1 and 2, we realize that the proof is built upon the fact that \u201cwhen\nthe function is strongly convex and smooth, gradient descent can reduce the distance to the optimal\nsolution by a constant factor\u201d [Mokhtari et al., 2016, Proposition 2]. From the recent developments\nin convex optimization, we know that a similar behavior also happens when the function is semi-\nstrongly convex and smooth [Necoara et al., 2015, Theorem 5.2], which motivates the study in this\nsection.\n\nWe \ufb01rst introduce the de\ufb01nition of semi-strong convexity [Gong and Ye, 2014].\nDe\ufb01nition 3. A function f : X 7\u2192 R is semi-strongly convex over X , if there exists a constant \u03b2 > 0\nsuch that for any x \u2208 X\n\nf (x) \u2212 min\n\nx\u2208X\n\nf (x) \u2265\n\n\u03b2\n2 kx \u2212 \u03a0X \u2217 (x)k2\n\n(8)\n\nwhere X \u2217 = {x \u2208 X : f (x) \u2264 minx\u2208X f (x)} is the set of minimizers of f over X .\n\nThe semi-strong convexity generalizes several non-strongly convex conditions, such as the quadratic\napproximation property and the error bound property [Wang and Lin, 2014, Necoara et al., 2015]. A\nclass of functions that satisfy the semi-strongly convexity is provided below [Gong and Ye, 2014].\n\nExample 3. Consider the following constrained optimization problem\n\nmin\n\nx\u2208X \u2286Rd\n\nf (x) = g(Ex) + b\u22a4x\n\nwhere g(\u00b7) is strongly convex and smooth, and X is either Rd or a polyhedral set. Then, f : X 7\u2192 R\nis semi-strongly convex over X with some constant \u03b2 > 0.\nBased on the semi-strong convexity, we assume the functions satisfy the following conditions.\nAssumption 2. Suppose the following conditions hold for each ft : X 7\u2192 R.\n\n1. ft is semi-strongly convex over X with parameter \u03b2 > 0, and L-smooth;\n2. k\u2207ft(x)k \u2264 G, \u2200x \u2208 X .\n\nWhen the function is semi-strongly convex, the optimal solution may not be unique. Thus, we need\nto rede\ufb01ne P \u2217\n\nT to account for this freedom. We de\ufb01ne\nT\n\n\u03a0X \u2217\n\nt (x) \u2212 \u03a0X \u2217\n\nt\u22121(x)(cid:13)(cid:13)(cid:13)\n\n, and S \u2217\n\nT :=\n\nXt=2\n\nmax\n\nx\u2208X (cid:13)(cid:13)(cid:13)\n\n\u03a0X \u2217\n\nt (x) \u2212 \u03a0X \u2217\n\n2\n\nt\u22121(x)(cid:13)(cid:13)(cid:13)\n\nt = {x \u2208 X : ft(x) \u2264 minx\u2208X ft(x)} is the set of minimizers of ft over X .\n\nIn this case, we will use the standard online gradient descent when the learner can query the gradient\nonly once, and apply the online multiple gradient descent (OMGD) in Algorithm 1, when the learner\ncan access the gradient multiple times. Using similar analysis as Theorems 1 and 2, we obtain the\nfollowing dynamic regret bounds for functions that are semi-strongly convex and smooth.\nTheorem 6. Suppose Assumption 2 is true. By setting \u03b7 \u2264 1/L in online gradient descent, we have\n\nT\n\nT and S \u2217\nx\u2208X (cid:13)(cid:13)(cid:13)\nXt=2\n\nmax\n\nP \u2217\nT :=\nwhere X \u2217\n\nT\n\nT\n\nXt=1\n\nft(xt) \u2212\n\nXt=1\n1/\u03b7+\u03b2 , and \u00afx1 = \u03a0X \u2217\n\n1 (x1).\n\nmin\nx\u2208X\n\nft(x) \u2264\n\nT\n\nGP \u2217\n1 \u2212 \u03b3\n\n+\n\nGkx1 \u2212 \u00afx1k\n\n1 \u2212 \u03b3\n\nwhere \u03b3 = q1 \u2212 \u03b2\n\n6\n\n\fThus, online gradient descent still achieves an O(P \u2217\nT ) bound of the dynamic regret.\nTheorem 7. Suppose Assumption 2 is true. By setting \u03b7 \u2264 1/L and K = \u2308 1/\u03b7+\u03b2\nm 1, for any constant \u03b1 > 0, we have\n\n\u03b2\n\nln 4\u2309 in Algorith-\n\nT\n\nXt=1\n\nmin\nx\u2208X\n\nT\n\nft(xt) \u2212\n\nXt=1\nT = max{x\n\n\u2217\n\nt \u2208X \u2217\n\nt }T\n\nt=1 PT\n\nwhere G\u2217\n\n2GP \u2217\nG\u2217\nT\n2\u03b1\n\nft(x) \u2264 min\uf8f1\uf8f2\nT + 2Gkx1 \u2212 \u00afx1k\n+ 2(L + \u03b1)S \u2217\n\uf8f3\nt=1 k\u2207ft(x\u2217\nt )k2, and \u00afx1 = \u03a0X \u2217\n\nT + (L + \u03b1)kx1 \u2212 \u00afx1k2\n1 (x1).\n\nAgain, when the gradients of minimizers are small, in other words, G\u2217\nOMGD improves the dynamic regret form O(P \u2217\nT )).\n\nT ) to O(min(P \u2217\n\nT ,S \u2217\n\nT = O(S \u2217\n\nT ), the proposed\n\n3.3 Self-concordant Functions\n\nWe extend our previous results to self-concordant functions, which could be non-strongly convex\nand even non-smooth. Self-concordant functions play an important role in interior-point methods\nfor solving convex optimization problems. We note that in the study of bandit linear optimization\n[Abernethy et al., 2008b], self-concordant functions have been used as barriers for constraints. How-\never, to the best of our knowledge, this is the \ufb01rst time that losses themselves are self-concordant.\n\nThe de\ufb01nition of self-concordant functions is given below [Nemirovski, 2004].\n\nDe\ufb01nition 4. Let X be a nonempty open convex set in Rd and f be a C 3 convex function de\ufb01ned on\nX . f is called self-concordant on X , if it possesses the following two properties:\n\n1. f (xi) \u2192 \u221e along every sequence {xi \u2208 X} converging, as i \u2192 \u221e, to a boundary point\n2. f satis\ufb01es the differential inequality\n\nof X ;\n\n|D3f (x)[h, h, h]| \u2264 2(cid:0)h\u22a4\u22072f (x)h(cid:1)3/2\n\nfor all x \u2208 X and all h \u2208 Rd, where\n\u2202 3\n\nD3f (x)[h1, h2, h3] =\n\n\u2202t1\u2202t2\u2202t3 |t1=t2=t3=0f (x + t1h1 + t2h2 + t3h3) .\n\nExample\nexamples\n[Boyd and Vandenberghe, 2004, Nemirovski, 2004].\n\n4. We\n\nprovide\n\nsome\n\nof\n\nself-concordant\n\nfunctions\n\nbelow\n\n1. The function f (x) = \u2212 log x is self-concordant on (0,\u221e).\n2. A convex quadratic form f (x) = x\u22a4Ax\u2212 2b\u22a4x + c where A \u2208 Rd\u00d7d, b \u2208 Rd, and c \u2208 R,\n3. If f : Rd 7\u2192 R is self-concordant, and A \u2208 Rd\u00d7k, b \u2208 Rd, then f (Ax + b) is self-\n\nis self-concordant on Rd.\n\nconcordant.\n\nUsing the concept of self-concordance, we make the following assumptions.\nAssumption 3. Suppose the following conditions hold for each ft : Xt 7\u2192 R.\n\n1. ft is self-concordant on domain Xt;\n2. ft is non-degenerate on Xt, i.e., \u22072ft(x) \u227b 0, \u2200x \u2208 Xt;\n3. ft attains its minimum on Xt, and denote x\u2217\n\nt = argminx\u2208Xt ft(x).\n\nOur approach is similar to previous cases except for the updating rule of xt. Since we do not assume\nfunctions are strongly convex, we need to take into account the second order structure when updating\nthe current solution xt. Thus, we assume the learner can query both the gradient and Hessian of each\nfunction multiple times. Speci\ufb01cally, we apply the damped Newton method [Nemirovski, 2004] to\nupdate xt, as follows:\n\nz1\nt = xt,\n\nzj+1\nt = zj\n\nt \u2212\n\n1\n\nt ) h\u22072ft(zj\n\nt )i\u22121\n\n1 + \u03bbt(zj\n\nwhere\n\n\u03bbt(zj\n\nt ) = r\u2207ft(zj\n\nt )\u22a4h\u22072ft(zj\n\nt )i\u22121\n\n7\n\n\u2207ft(zj\n\nt ), j = 1, . . . , K\n\n\u2207ft(zj\nt ).\n\n(9)\n\n\fAlgorithm 2 Online Multiple Newton Update (OMNU)\nRequire: The number of inner iterations K in each round\n1: Let x1 be any point in X1\n2: for t = 1, . . . , T do\n3:\n4:\n5:\n6:\n\nSubmit xt \u2208 X and the receive loss ft : X 7\u2192 R\nz1\nt = xt\nfor j = 1, . . . , K do\n\nzj+1\nt = zj\n\nt \u2212\n\nwhere \u03bbt(zj\n\nt ) is given in (9)\n\nend for\nxt+1 = zK+1\n\n7:\n8:\n9: end for\n\nt\n\n1\n\n1 + \u03bbt(zj\n\nt ) h\u22072ft(zj\n\nt )i\u22121\n\n\u2207ft(zj\nt )\n\nThen, we set xt+1 = zK+1\n. Since the damped Newton method needs to calculate the inverse of the\nHessian matrix, its complexity is higher than gradient descent. The procedure is named as Online\nMultiple Newton Update (OMNU) and is summarized in Algorithm 2.\n\nt\n\nTo analyze the dynamic regret of OMNU, we rede\ufb01ne the two regularities P \u2217\nt )(x\u2217\n\nT\n\nT\n\nt \u2212 x\u2217\n\nt\u22121)\u22a4\u22072ft(x\u2217\n\nkx\u2217\n\nt \u2212 x\u2217\n\nt\u22121kt =\n\nP \u2217\nT :=\n\nT and S \u2217\nt \u2212 x\u2217\n\nt\u22121)\n\nT as follows:\n\nXt=2q(x\u2217\nXt=2\nt \u2212 x\u2217\n\n(x\u2217\n\nT\n\nt \u2212 x\u2217\n\nt\u22121k2\n\nt =\n\nt\u22121)\u22a4\u22072ft(x\u2217\n\nt )(x\u2217\n\nt \u2212 x\u2217\n\nt\u22121)\n\nXt=2\nXt=2\n\nT\n\nS \u2217\nT :=\n\nkx\u2217\nwhere khkt = ph\u22a4\u22072ft(x\u2217\n\nt )h. Compared to the de\ufb01nitions in (3) and (4), we introduce \u22072ft(x\u2217\nt )\nwhen measuring the distance between x\u2217\nt\u22121. When functions are strongly convex and smooth,\nthese de\ufb01nitions are equivalent up to constant factors. We then de\ufb01ne a quantity to compare the\nsecond order structure of consecutive functions:\n\nt and x\u2217\n\n\u00b5 = max\n\nt=2,...,T n\u03bbmax(cid:16)(cid:2)\u22072ft\u22121(x\u2217\n\nt\u22121)(cid:3)\u22121/2\n\n\u22072ft(x\u2217\n\nt )(cid:2)\u22072ft\u22121(x\u2217\n\nt\u22121)(cid:3)\u22121/2(cid:17)o\n\n(10)\n\nwhere \u03bbmax(\u00b7) computes the maximum eigenvalue of its argument. When all the functions are \u03bb-\nstrongly convex and L-smooth, \u00b5 \u2264 L/\u03bb. Then, we have the following theorem regarding the\ndynamic regret of the proposed OMNU algorithm.\n\nTheorem 8. Suppose Assumption 3 is true, and further assume\n\nkx\u2217\n\nt\u22121 \u2212 x\u2217\ntk2\n\nt \u2264\nWhen t = 1, we choose K = O(1)(f1(x1) \u2212 f1(x\u2217\nkx2 \u2212 x\u2217\n1k2\n1 \u2264\nFor t \u2265 2, we set K = \u2308log4(16\u00b5)\u2309 in OMNU, then\n3P \u2217\n\nt ) \u2264 min(cid:18) 1\n\nft(xt) \u2212 ft(x\u2217\n\nT\n\nXt=1\n\n1\n144\n\n, \u2200t \u2265 2.\n\n1\n\n144\u00b5\n\n.\n\nT , 4S \u2217\n\nT(cid:19) + f1(x1) \u2212 f1(x\u2217\n\n1) +\n\n1\n36\n\n.\n\n1) + log log \u00b5) in OMNU such that\n\n(11)\n\n(12)\n\nT ,S \u2217\nThe above theorem again implies the dynamic regret can be upper bounded by O(min(P \u2217\nT ))\nwhen the learner can access the gradient and Hessian multiple times. From the \ufb01rst property of\nself-concordant functions in De\ufb01nition 4, we know that x\u2217\nt must lie in the interior of Xt, and thus\n\u2207ft(x\u2217\nt ) = 0 for all t \u2208 [T ]. As a result, we do not need the additional assumption that the gradients\n\nof minimizers are small, which has been used before to simplify Theorems 2 and 7.\n\nCompared to Theorems 2 and 7, Theorem 8 introduces an additional condition in (11). This condi-\ntion is required to ensure that xt lies in the feasible region of ft(\u00b7), otherwise, ft(xt) can be in\ufb01nity\n\n8\n\n\fand it is impossible to bound the dynamic regret. The multiple applications of damped Newton\nmethod can enforce xt to be close to x\u2217\nt\u22121. Combined with (11), we conclude that xt is also close\nto x\u2217\nt . Then, based on the property of the Dikin ellipsoid of self-concordant functions [Nemirovski,\n2004], we can guarantee that xt is feasible for ft(\u00b7).\n\n4 Conclusion and Future Work\n\nIn this paper, we discuss how to reduce the dynamic regret of online learning by allowing the learner\nto query the gradient/Hessian of each function multiple times. By applying gradient descent multiple\ntimes in each round, we show that the dynamic regret can be upper bounded by the minimum of the\npath-length and the squared path-length, when functions are strongly convex and smooth. We then\nextend this theoretical guarantee to functions that are semi-strongly convex and smooth. We \ufb01nally\ndemonstrate that for self-concordant functions, applying the damped Newton method multiple times\nachieves a similar result.\n\nIn the current study, we upper bound the dynamic regret in terms of the path-length or the squared\npath-length of the comparator sequence. As we mentioned before, there also exist some regularities\nde\ufb01ned in terms of the function sequence, e.g., the functional variation [Besbes et al., 2015]. In the\nfuture, we will investigate whether multiple accesses of gradient/Hessian can improve the dynamic\nregret when measured by certain regularities of the function sequence. Another future work is to\nextend our results to the more general dynamic regret\n\nR(u1, . . . , uT ) =\n\nT\n\nXt=1\n\nft(xt) \u2212\n\nT\n\nXt=1\n\nft(ut)\n\nwhere u1, . . . , uT \u2208 X is an arbitrary sequence of comparators [Zinkevich, 2003].\nAcknowledgments\n\nThis work was partially supported by the NSFC (61603177, 61333014), JiangsuSF (BK20160658),\nYESS (2017QNRC001), NSF (IIS-1545995), and the Collaborative Innovation Center of Novel Soft-\nware Technology and Industrialization. Jinfeng Yi is now at Tencent AI Lab, Bellevue, WA, USA.\n\nReferences\n\nJ. Abernethy, P. L. Bartlett, A. Rakhlin, and A. Tewari. Optimal stragies and minimax lower bounds\nIn Proceedings of the 21st Annual Conference on Learning Theory,\n\nfor online convex games.\n2008a.\n\nJ. Abernethy, E. Hazan, and A. Rakhlin. Competing in the dark: An ef\ufb01cient algorithm for bandit\nlinear optimization. In Proceedings of the 21st Annual Conference on Learning, pages 263\u2013274,\n2008b.\n\nO. Besbes, Y. Gur, and A. Zeevi. Non-stationary stochastic optimization. Operations Research, 63\n\n(5):1227\u20131244, 2015.\n\nS. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n\nN. Buchbinder, S. Chen, J. S. Naor, and O. Shamir. Uni\ufb01ed algorithms for online learning and\n\ncompetitive analysis. In Proceedings of the 25th Annual Conference on Learning Theory, 2012.\n\nN. Cesa-bianchi, P. Gaillard, G. Lugosi, and G. Stoltz. Mirror descent meets \ufb01xed share (and feels\n\nno regret). In Advances in Neural Information Processing Systems 25, pages 980\u2013988, 2012.\n\nC.-K. Chiang, T. Yang, C.-J. Lee, M. Mahdavi, C.-J. Lu, R. Jin, and S. Zhu. Online optimization\nwith gradual variations. In Proceedings of the 25th Annual Conference on Learning Theory, 2012.\n\nA. Daniely, A. Gonen, and S. Shalev-Shwartz. Strongly adaptive online learning. In Proceedings of\n\nThe 32nd International Conference on Machine Learning, 2015.\n\nP. Gong and J. Ye. Linear convergence of variance-reduced stochastic gradient without strong con-\n\nvexity. ArXiv e-prints, arXiv:1406.1102, 2014.\n\n9\n\n\fE. C. Hall and R. M. Willett. Dynamical models and tracking regret in online convex programming.\nIn Proceedings of the 30th International Conference on Machine Learning, pages 579\u2013587, 2013.\n\nE. Hazan and S. Kale. Beyond the regret minimization barrier: an optimal algorithm for stochastic\nstrongly-convex optimization. In Proceedings of the 24th Annual Conference on Learning Theory,\npages 421\u2013436, 2011.\n\nE. Hazan and C. Seshadhri. Adaptive algorithms for online decision problems. Electronic Colloqui-\n\num on Computational Complexity, 88, 2007.\n\nE. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimization.\n\nMachine Learning, 69(2-3):169\u2013192, 2007.\n\nM. Herbster and M. K. Warmuth. Tracking the best expert. Machine Learning, 32(2):151\u2013178,\n\n1998.\n\nA. Jadbabaie, A. Rakhlin, S. Shahrampour, and K. Sridharan. Online optimization: Competing\nIn Proceedings of the 18th International Conference on Arti\ufb01cial\n\nwith dynamic comparators.\nIntelligence and Statistics, 2015.\n\nA. Mokhtari, S. Shahrampour, A. Jadbabaie, and A. Ribeiro. Online optimization in dynamic envi-\nronments: Improved regret rates for strongly convex problems. ArXiv e-prints, arXiv:1603.04954,\n2016.\n\nI. Necoara, Y. Nesterov, and F. Glineur. Linear convergence of \ufb01rst order methods for non-strongly\n\nconvex optimization. ArXiv e-prints, arXiv:1504.06298, 2015.\n\nA. Nemirovski.\n\nInterior point polynomial time methods in convex programming. Lecture notes,\n\nTechnion \u2013 Israel Institute of Technology, 2004.\n\nY. Nesterov. Introductory lectures on convex optimization: a basic course, volume 87 of Applied\n\noptimization. Kluwer Academic Publishers, 2004.\n\nS. Rakhlin and K. Sridharan. Optimization, learning, and games with predictable sequences. In\n\nAdvances in Neural Information Processing Systems 26, pages 3066\u20133074, 2013.\n\nS. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in\n\nMachine Learning, 4(2):107\u2013194, 2011.\n\nS. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: primal estimated sub-gradient solver for\nSVM. In Proceedings of the 24th International Conference on Machine Learning, pages 807\u2013\n814, 2007.\n\nP.-W. Wang and C.-J. Lin. Iteration complexity of feasible descent methods for convex optimization.\n\nJournal of Machine Learning Research, 15:1523\u20131548, 2014.\n\nT. Yang, L. Zhang, R. Jin, and J. Yi. Tracking slowly moving clairvoyant: Optimal dynamic regret of\nonline learning with true and noisy gradient. In Proceedings of the 33rd International Conference\non Machine Learning, 2016.\n\nM. Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In Pro-\n\nceedings of the 20th International Conference on Machine Learning, pages 928\u2013936, 2003.\n\n10\n\n\f", "award": [], "sourceid": 478, "authors": [{"given_name": "Lijun", "family_name": "Zhang", "institution": "Nanjing University (NJU)"}, {"given_name": "Tianbao", "family_name": "Yang", "institution": "The University of Iowa"}, {"given_name": "Jinfeng", "family_name": "Yi", "institution": "Tencent AI Lab/IBM TJ Watson Research Center"}, {"given_name": "Rong", "family_name": "Jin", "institution": "Alibaba"}, {"given_name": "Zhi-Hua", "family_name": "Zhou", "institution": "Nanjing University"}]}