{"title": "Adaptive Online Learning in Dynamic Environments", "book": "Advances in Neural Information Processing Systems", "page_first": 1323, "page_last": 1333, "abstract": "In this paper, we study online convex optimization in dynamic environments, and aim to bound the dynamic regret with respect to any sequence of comparators. Existing work have shown that online gradient descent enjoys an $O(\\sqrt{T}(1+P_T))$ dynamic regret, where $T$ is the number of iterations and $P_T$ is the path-length of the comparator sequence. However, this result is unsatisfactory, as there exists a large gap from the $\\Omega(\\sqrt{T(1+P_T)})$ lower bound established in our paper. To address this limitation, we develop a novel online method, namely adaptive learning for dynamic environment (Ader), which achieves an optimal $O(\\sqrt{T(1+P_T)})$ dynamic regret. The basic idea is to maintain a set of experts, each attaining an optimal dynamic regret for a specific path-length, and combines them with an expert-tracking algorithm. Furthermore, we propose an improved Ader based on the surrogate loss, and in this way the number of gradient evaluations per round is reduced from $O(\\log T)$ to $1$. Finally, we extend Ader to the setting that a sequence of dynamical models is available to characterize the comparators.", "full_text": "Adaptive Online Learning in Dynamic Environments\n\nLijun Zhang,\n\nShiyin Lu, Zhi-Hua Zhou\n\nNational Key Laboratory for Novel Software Technology\n\nNanjing University, Nanjing 210023, China\n\n{zhanglj, lusy, zhouzh}@lamda.nju.edu.cn\n\nAbstract\n\nIn this paper, we study online convex optimization in dynamic environments, and\n\u221a\naim to bound the dynamic regret with respect to any sequence of comparators.\nExisting work have shown that online gradient descent enjoys an O(\nT (1 + PT ))\ndynamic regret, where T is the number of iterations and PT is the path-length of\nthe comparator sequence. However, this result is unsatisfactory, as there exists\n\na large gap from the \u2126((cid:112)T (1 + PT )) lower bound established in our paper. To\nfor dynamic environment (Ader), which achieves an optimal O((cid:112)T (1 + PT ))\n\naddress this limitation, we develop a novel online method, namely adaptive learning\n\ndynamic regret. The basic idea is to maintain a set of experts, each attaining an\noptimal dynamic regret for a speci\ufb01c path-length, and combines them with an\nexpert-tracking algorithm. Furthermore, we propose an improved Ader based on\nthe surrogate loss, and in this way the number of gradient evaluations per round is\nreduced from O(log T ) to 1. Finally, we extend Ader to the setting that a sequence\nof dynamical models is available to characterize the comparators.\n\n1\n\nIntroduction\n\nOnline convex optimization (OCO) has become a popular learning framework for modeling various\nreal-world problems, such as online routing, ad selection for search engines and spam \ufb01ltering [Hazan,\n2016]. The protocol of OCO is as follows: At iteration t, the online learner chooses xt from a convex\nset X . After the learner has committed to this choice, a convex cost function ft : X (cid:55)\u2192 R is revealed.\nThen, the learner suffers an instantaneous loss ft(xt), and the goal is to minimize the cumulative loss\nover T iterations. The standard performance measure of OCO is regret:\n\nT(cid:88)\n\nt=1\n\nft(xt) \u2212 min\nx\u2208X\n\nT(cid:88)\n\nt=1\n\nft(x)\n\n(1)\n\nwhich is the cumulative loss of the learner minus that of the best constant point chosen in hindsight.\nThe notion of regret has been extensively studied, and there exist plenty of algorithms and theories\nfor minimizing regret [Shalev-Shwartz et al., 2007, Hazan et al., 2007, Srebro et al., 2010, Duchi\net al., 2011, Shalev-Shwartz, 2011, Zhang et al., 2013]. However, when the environment is changing,\nthe traditional regret is no longer a suitable measure, since it compares the learner against a static\npoint. To address this limitation, recent advances in online learning have introduced an enhanced\nmeasure\u2014dynamic regret, which received considerable research interest over the years [Hall and\nWillett, 2013, Jadbabaie et al., 2015, Mokhtari et al., 2016, Yang et al., 2016, Zhang et al., 2017].\nIn the literature, there are two different forms of dynamic regret. The general one is introduced by\nZinkevich [2003], who proposes to compare the cumulative loss of the learner against any sequence\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fof comparators\n\nT(cid:88)\n\nft(xt) \u2212 T(cid:88)\n\nR(u1, . . . , uT ) =\n\n(2)\nwhere u1, . . . , uT \u2208 X . Instead of following the de\ufb01nition in (2), most of existing studies on dynamic\nregret consider a restricted form, in which the sequence of comparators consists of local minimizers\nof online functions [Besbes et al., 2015], i.e.,\n\nft(ut)\n\nt=1\n\nt=1\n\nR(x\u2217\n\n1, . . . , x\u2217\n\nT ) =\n\nft(x\u2217\n\nt ) =\n\nT(cid:88)\n\nt=1\n\nft(xt) \u2212 T(cid:88)\n\nt=1\n\nT(cid:88)\n\nt=1\n\nft(xt) \u2212 T(cid:88)\n\nt=1\n\nmin\nx\u2208X ft(x)\n\n(3)\n\n1, . . . , x\u2217\n\n1, . . . , x\u2217\n1, . . . , x\u2217\n\nT ), it does not mean the notion of R(x\u2217\nT ) could be very loose for R(u1, . . . , uT ).\n\nt \u2208 argminx\u2208X ft(x) is a minimizer of ft(\u00b7) over domain X . Note that although\nT ) is stronger, because\n\nwhere x\u2217\nR(u1, . . . , uT ) \u2264 R(x\u2217\nan upper bound for R(x\u2217\nThe general dynamic regret in (2) includes the static regret in (1) and the restricted dynamic regret\nin (3) as special cases. Thus, minimizing the general dynamic regret can automatically adapt to\nthe nature of environments, stationary or dynamic. In contrast, the restricted dynamic regret is too\npessimistic, and unsuitable for stationary problems. For example, it is meaningless to the problem of\nstatistical machine learning, where ft\u2019s are sampled independently from the same distribution. Due\nto the random perturbation caused by sampling, the local minimizers could differ signi\ufb01cantly from\nthe global minimizer of the expected loss. In this case, minimizing (3) will lead to over\ufb01tting.\nBecause of its \ufb02exibility, we focus on the general dynamic regret in this paper. Bounding the general\ndynamic regret is very challenging, because we need to establish a universal guarantee that holds\nfor any sequence of comparators. By comparison, when bounding the restricted dynamic regret, we\nonly need to focus on the local minimizers. Till now, we have very limited knowledge on the general\ndynamic regret. One result is given by Zinkevich [2003], who demonstrates that online gradient\ndescent (OGD) achieves the following dynamic regret bound\n\n(cid:16)\u221a\n\n(cid:17)\n\nR(u1, . . . , uT ) = O\n\nT (1 + PT )\n\n(4)\n\nwhere PT , de\ufb01ned in (5), is the path-length of u1, . . . , uT .\nHowever, the linear dependence on PT in (4) is too loose, and there is a large gap between the upper\n\nbound and the \u2126((cid:112)T (1 + PT )) lower bound established in our paper. To address this limitation, we\nattains an O((cid:112)T (1 + PT )) dynamic regret. Ader follows the framework of learning with expert\n\npropose a novel online method, namely adaptive learning for dynamic environment (Ader), which\n\nadvice [Cesa-Bianchi and Lugosi, 2006], and is inspired by the strategy of maintaining multiple\nlearning rates in MetaGrad [van Erven and Koolen, 2016]. The basic idea is to run multiple OGD\nalgorithms in parallel, each with a different step size that is optimal for a speci\ufb01c path-length, and\ncombine them with an expert-tracking algorithm. While the basic version of Ader needs to query\nthe gradient O(log T ) times in each round, we develop an improved version based on surrogate loss\nand reduce the number of gradient evaluations to 1. Finally, we provide extensions of Ader to the\ncase that a sequence of dynamical models is given, and obtain tighter bounds when the comparator\nsequence follows the dynamical models closely.\nThe contributions of this paper are summarized below.\n\n\u2126((cid:112)T (1 + PT )).\nan optimal O((cid:112)T (1 + PT )) upper bound.\n\n\u2022 We establish the \ufb01rst\nlower bound for the general regret bound in (2), which is\n\u2022 We develop a serial of novel methods for minimizing the general dynamic regret, and prove\n\u2022 Compared to existing work for the restricted dynamic regret in (3), our result is universal in\n\u2022 Our result is also adaptive because the upper bound depends on the path-length of the\ncomparator sequence, so it automatically becomes small when comparators change slowly.\n\nthe sense that the regret bound holds for any sequence of comparators.\n\n2 Related Work\n\nIn this section, we provide a brief review of related work in online convex optimization.\n\n2\n\n\f2.1 Static Regret\n\n\u221a\n\n\u221a\n\nIn static setting, online gradient descent (OGD) achieves an O(\nT ) regret bound for general convex\nfunctions. If the online functions have additional curvature properties, then faster rates are attaina-\nble. For strongly convex functions, the regret bound of OGD becomes O(log T ) [Shalev-Shwartz\net al., 2007]. The O(\nT ) and O(log T ) regret bounds, for convex and strongly convex functions\nrespectively, are known to be minimax optimal [Abernethy et al., 2008]. For exponentially concave\nfunctions, Online Newton Step (ONS) enjoys an O(d log T ) regret, where d is the dimensionality\n[Hazan et al., 2007]. When the online functions are both smooth and convex, the regret bound could\nalso be improved if the cumulative loss of the optimal prediction is small [Srebro et al., 2010].\n\n2.2 Dynamic Regret\n\nTo the best of our knowledge, there are only two studies that investigate the general dynamic regret\n[Zinkevich, 2003, Hall and Willett, 2013]. While it is impossible to achieve a sublinear dynamic\nregret in general, we can bound the dynamic regret in terms of certain regularity of the comparator\nsequence or the function sequence. Zinkevich [2003] introduces the path-length\n\nPT (u1, . . . , uT ) =\n\n(cid:107)ut \u2212 ut\u22121(cid:107)2\n\n(5)\n\nand provides an upper bound for OGD in (4). In a subsequent work, Hall and Willett [2013] propose\na variant of path-length\n\nP (cid:48)\nT (u1, . . . , uT ) =\n\n(cid:107)ut+1 \u2212 \u03a6t(ut)(cid:107)2\n\n(6)\n\nin which a sequence of dynamical models \u03a6t(\u00b7) : X (cid:55)\u2192 X is incorporated. Then, they develop a new\nT (1 + P (cid:48)\nT )) dynamic regret. When the\nmethod, dynamic mirror descent, which achieves an O(\ncomparator sequence follows the dynamical models closely, P (cid:48)\nT could be much smaller than PT , and\nthus the upper bound of Hall and Willett [2013] could be tighter than that of Zinkevich [2003].\nFor the restricted dynamic regret, a powerful baseline, which simply plays the minimizer of previous\nround, i.e., xt+1 = argminx\u2208X ft(x), attains an O(P \u2217\nT ) dynamic regret [Yang et al., 2016], where\n\n\u221a\n\nT(cid:88)\n\nt=2\n\nT(cid:88)\n\nt=1\n\nT(cid:88)\n\nt=2\n\nT(cid:88)\n\nT(cid:88)\n\nt=2\n\n3\n\nP \u2217\nT := PT (x\u2217\n\n1, . . . , x\u2217\n\nT ) =\n\n(cid:107)x\u2217\n\nt \u2212 x\u2217\n\nt\u22121(cid:107)2.\n\nOGD also achieves the O(P \u2217\nT ) dynamic regret, when the online functions are strongly convex and\nsmooth [Mokhtari et al., 2016], or when they are convex and smooth and all the minimizers lie in\nthe interior of X [Yang et al., 2016]. Another regularity of the comparator sequence is the squared\npath-length\n\nS\u2217\nT := ST (x\u2217\n\n1, . . . , x\u2217\nwhich could be smaller than the path-length P \u2217\n[2017] propose online multiple gradient descent, and establish an O(min(P \u2217\n(semi-)strongly convex and smooth functions.\nIn a recent work, Besbes et al. [2015] introduce the functional variation\n\nt \u2212 x\u2217\n\nt\u22121(cid:107)2\n\n(cid:107)x\u2217\n\nT ) =\n\nt=2\n\n2\n\nT when local minimizers move slowly. Zhang et al.\nT )) regret bound for\n\nT , S\u2217\n\nFT := F (f1, . . . , fT ) =\n\nx\u2208X |ft(x) \u2212 ft\u22121(x)|\n\nmax\n\nto measure the complexity of the function sequence. Under the assumption that an upper bound\nVT \u2265 FT is known beforehand, Besbes et al. [2015] develop a restarted online gradient descent, and\n\nprove its dynamic regret is upper bounded by O(T 2/3(VT + 1)1/3) and O(log T(cid:112)T (VT + 1)) for\n\nconvex functions and strongly convex functions, respectively. One limitation of this work is that\nthe bounds are not adaptive because they depend on the upper bound VT . So, even when the actual\nfunctional variation FT is small, the regret bounds do not become better.\n\n\fOne regularity that involves the gradient of functions is\n\nT(cid:88)\n\nDT =\n\n(cid:107)\u2207ft(xt) \u2212 mt(cid:107)2\n\n2\n\nt=1\n\nwhere m1, . . . , mT is a predictable sequence computable by the learner [Chiang et al., 2012, Rakhlin\nand Sridharan, 2013]. From the above discussions, we observe that there are different types of\nregularities. As shown by Jadbabaie et al. [2015], these regularities re\ufb02ect distinct aspects of the online\nproblem, and are not comparable in general. To take advantage of the smaller regularity, Jadbabaie\net al. [2015] develop an adaptive method whose dynamic regret is on the order of\nDT + 1 +\nT }. However, it relies on the assumption that the learner\n\nmin{(cid:112)(DT + 1)P \u2217\n\nT , (DT + 1)1/3T 1/3F 1/3\n\n\u221a\n\ncan calculate each regularity online.\n\n2.3 Adaptive Regret\n\nAnother way to deal with changing environments is to minimize the adaptive regret, which is de\ufb01ned\nas maximum static regret over any contiguous time interval [Hazan and Seshadhri, 2007]. For convex\nfunctions and exponentially concave functions, Hazan and Seshadhri [2007] have developed ef\ufb01cient\nT log3 T ) and O(d log2 T ) adaptive regrets, respectively. Later, the\nalgorithms that achieve O(\nadaptive regret of convex functions is improved [Daniely et al., 2015, Jun et al., 2017]. The relation\nbetween adaptive regret and restricted dynamic regret is investigated by Zhang et al. [2018b].\n\n(cid:112)\n\n3 Our Methods\n\nWe \ufb01rst state assumptions about the online problem, then provide our motivations, including a lower\nbound of the general dynamic regret, and \ufb01nally present the proposed methods as well as their\ntheoretical guarantees. All the proofs can be found in the full paper [Zhang et al., 2018a].\n\n3.1 Assumptions\n\nSimilar to previous studies in online learning, we introduce the following common assumptions.\nAssumption 1 On domain X , the values of all functions belong to the range [a, a + c], i.e.,\n\na \u2264 ft(x) \u2264 a + c, \u2200x \u2208 X , and t \u2208 [T ].\nAssumption 2 The gradients of all functions are bounded by G, i.e.,\nx\u2208X (cid:107)\u2207ft(x)(cid:107)2 \u2264 G, \u2200t \u2208 [T ].\n\nmax\n\n(7)\n\nAssumption 3 The domain X contains the origin 0, and its diameter is bounded by D, i.e.,\n\n(8)\nNote that Assumptions 2 and 3 imply Assumption 1 with any c \u2265 GD. In the following, we assume\nthe values of G and D are known to the leaner.\n\nx,x(cid:48)\u2208X (cid:107)x \u2212 x(cid:48)(cid:107)2 \u2264 D.\n\nmax\n\n3.2 Motivations\n\nAccording to Theorem 2 of Zinkevich [2003], we have the following dynamic regret bound for online\ngradient descent (OGD) with a constant step size.\nTheorem 1 Consider the online gradient descent (OGD) with x1 \u2208 X and\n\nxt+1 = \u03a0X(cid:2)xt \u2212 \u03b7\u2207ft(xt)(cid:3), \u2200t \u2265 1\n\nwhere \u03a0X [\u00b7] denotes the projection onto the nearest point in X . Under Assumptions 2 and 3, OGD\nsatis\ufb01es\n\nT(cid:88)\n\nft(xt) \u2212 T(cid:88)\n\nft(ut) \u2264 7D2\n4\u03b7\nfor any comparator sequence u1, . . . , uT \u2208 X .\n\nt=1\n\nt=1\n\nT(cid:88)\n\nt=2\n\n+\n\nD\n\u03b7\n\n4\n\n(cid:107)ut\u22121 \u2212 ut(cid:107)2 +\n\n\u03b7T G2\n\n2\n\n\f\u221a\n\n\u221a\nThus, by choosing \u03b7 = O(1/\n\nuniversal. However, this upper bound is far from the \u2126((cid:112)T (1 + PT )) lower bound indicated by the\n\nT (1 + PT )) dynamic regret, that is\n\nT ), OGD achieves an O(\n\ntheorem below.\nTheorem 2 For any online algorithm and any \u03c4 \u2208 [0, T D], there exists a sequence of comparators\nu1, . . . , uT satisfying Assumption 3 and a sequence of functions f1, . . . , fT satisfying Assumption 2,\nsuch that\n\nPT (u1, . . . , uT ) \u2264 \u03c4 and R(u1, . . . , uT ) = \u2126(cid:0)G(cid:112)T (D2 + D\u03c4 )(cid:1).\n\na speci\ufb01c sequence \u00afu1, . . . , \u00afuT \u2208 X whose path-length P T = (cid:80)T\n\nAlthough there exist lower bounds for the restricted dynamic regret [Besbes et al., 2015, Yang et al.,\n2016], to the best of our knowledge, this is the \ufb01rst lower bound for the general dynamic regret.\nLet\u2019s drop the universal property for the moment, and suppose we only want to compare against\nt=2 (cid:107)\u00afut \u2212 \u00afut\u22121(cid:107)2 is known\n(1 + P T )/T )\n\nbeforehand. In this simple setting, we can tune the step size optimally as \u03b7\u2217 = O(\nand obtain an improved O(\nT (1 + P T )) dynamic regret bound, which matches the lower bound in\nTheorem 2. Thus, when bounding the general dynamic regret, we face the following challenge: On\none hand, we want the regret bound to hold for any sequence of comparators, but on the other hand,\nto get a tighter bound, we need to tune the step size for a speci\ufb01c path-length. In the next section, we\naddress this dilemma by running multiple OGD algorithms with different step sizes, and combining\nthem through a meta-algorithm.\n\n(cid:113)\n\n(cid:113)\n\n3.3 The Basic Approach\n\nOur proposed method, named as adaptive learning for dynamic environment (Ader), is inspired by a\nrecent work for online learning with multiple types of functions\u2014MetaGrad [van Erven and Koolen,\n2016]. Ader maintains a set of experts, each attaining an optimal dynamic regret for a different\npath-length, and chooses the best one using an expert-tracking algorithm.\n\nMeta-algorithm Tracking the best expert is a well-studied problem [Herbster and Warmuth, 1998],\nand our meta-algorithm, summarized in Algorithm 1, is built upon the exponentially weighted average\nforecaster [Cesa-Bianchi and Lugosi, 2006]. The inputs of the meta-algorithm are its own step size \u03b1,\nand a set H of step sizes for experts. In Step 1, we active a set of experts {E\u03b7|\u03b7 \u2208 H} by invoking\nthe expert-algorithm for each \u03b7 \u2208 H. In Step 2, we set the initial weight of each expert. Let \u03b7i be the\ni-th smallest step size in H. The weight of E\u03b7i is chosen as\n\n1\n(9)\n|H| .\nIn each round, the meta-algorithm receives a set of predictions {x\u03b7\nt |\u03b7 \u2208 H} from all experts (Step 4),\nand outputs the weighted average (Step 5):\n\n, and C = 1 +\n\ni(i + 1)\n\n1 =\n\nw\u03b7i\n\nC\n\n(cid:88)\n\n\u03b7\u2208H\n\nxt =\n\nw\u03b7\n\nt x\u03b7\n\nt\n\nwhere w\u03b7\nexperts are updated according to the exponential weighting scheme (Step 7):\n\nt is the weight assigned to expert E\u03b7. After observing the loss function, the weights of\n\n(cid:80)\n\nw\u03b7\n\nt+1 =\n\nt )\n\nt e\u2212\u03b1ft(x\u03b7\nw\u03b7\n\u00b5\u2208H w\u00b5\nt ) to each expert E\u03b7 so that they can update their own\n\nt e\u2212\u03b1ft(x\u00b5\n\nt )\n\n.\n\nIn the last step, we send the gradient \u2207ft(x\u03b7\npredictions.\n\nExpert-algorithm Experts are themselves algorithms, and our expert-algorithm, presented in\nAlgorithm 2, is the standard online gradient descent (OGD). Each expert is an instance of OGD, and\ntakes the step size \u03b7 as its input. In Step 3 of Algorithm 2, each expert submits its prediction x\u03b7\nt to\nthe meta-algorithm, and receives the gradient \u2207ft(x\u03b7\nt ) in Step 4. Then, in Step 5 it performs gradient\ndescent\nt \u2212 \u03b7\u2207ft(x\u03b7\n\nt+1 = \u03a0X(cid:2)x\u03b7\n\nt )(cid:3)\n\nx\u03b7\n\n5\n\n\fAlgorithm 1 Ader: Meta-algorithm\nRequire: A step size \u03b1, and a set H containing step sizes for experts\n1: Activate a set of experts {E\u03b7|\u03b7 \u2208 H} by invoking Algorithm 2 for each step size \u03b7 \u2208 H\n2: Sort step sizes in ascending order \u03b71 \u2264 \u03b72 \u2264 \u00b7\u00b7\u00b7 \u2264 \u03b7N , and set w\u03b7i\n3: for t = 1, . . . , T do\n4:\n5:\n\nt from each expert E\u03b7\n\nReceive x\u03b7\nOutput\n\n1 = C\n\ni(i+1)\n\n(cid:88)\n\n\u03b7\u2208H\n\nxt =\n\nw\u03b7\n\nt x\u03b7\n\nt\n\n6:\n7:\n\nObserve the loss function ft(\u00b7)\nUpdate the weight of each expert by\n\nw\u03b7\n\nt+1 =\n\n(cid:80)\n\nt e\u2212\u03b1ft(x\u03b7\nw\u03b7\n\u00b5\u2208H w\u00b5\n\nt e\u2212\u03b1ft(x\u00b5\n\nt )\n\nt )\n\nSend gradient \u2207ft(x\u03b7\n\nt ) to each expert E\u03b7\n\n8:\n9: end for\n\n1 be any point in X\n\nAlgorithm 2 Ader: Expert-algorithm\nRequire: The step size \u03b7\n1: Let x\u03b7\n2: for t = 1, . . . , T do\n3:\n4:\n5:\n\nSubmit x\u03b7\nReceive gradient \u2207ft(x\u03b7\n\nt to the meta-algorithm\n\nt ) from the meta-algorithm\n\nt+1 = \u03a0X(cid:2)x\u03b7\n\nx\u03b7\n\nt \u2212 \u03b7\u2207ft(x\u03b7\n\nt )(cid:3)\n\n6: end for\n\nto get the prediction for the next round.\nNext, we specify the parameter setting and our dynamic regret. The set H is constructed in the way\nsuch that for any possible sequence of comparators, there exists a step size that is nearly optimal. To\ncontrol the size of H, we use a geometric series with ratio 2. The value of \u03b1 is tuned such that the\nupper bound is minimized. Speci\ufb01cally, we have the following theorem.\nTheorem 3 Set\n\nH =\n\n\u03b7i =\n\nwhere N = (cid:100) 1\nand 3, for any comparator sequence u1, . . . , uT \u2208 X , our proposed Ader method satis\ufb01es\n\n2 log2(1 + 4T /7)(cid:101) + 1, and \u03b1 =(cid:112)8/(T c2) in Algorithm 1. Under Assumptions 1, 2\nft(xt) \u2212 T(cid:88)\n\n[1 + 2 ln(k + 1)]\n\nT(cid:88)\n\n\u221a\nc\n\n(10)\n\nG\n\n(cid:40)\n\n(cid:41)\n\n2T\n\n2i\u22121D\n\n(cid:114) 7\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) i = 1, . . . , N\n(cid:112)2T (7D2 + 4DPT ) +\n(cid:16)(cid:112)T (1 + PT )\n(cid:17)\n(cid:19)(cid:23)\n(cid:22) 1\n\n(cid:18)\n\nt=1\n\nt=1\n\nwhere\n\nThe order of the upper bound matches the \u2126((cid:112)T (1 + PT )) lower bound in Theorem 2 exactly.\n\nlog2\n\n+ 1.\n\nk =\n\n1 +\n\n2\n\n(11)\n\n4PT\n7D\n\nft(ut) \u2264 3G\n4\n\n=O\n\n2T\n4\n\n3.4 An Improved Approach\n\nThe basic approach in Section 3.3 is simple, but it has an obvious limitation: From Steps 7 and 8\nin Algorithm 1, we observe that the meta-algorithm needs to query the value and gradient of ft(\u00b7)\n\n6\n\n\fN times in each round, where N = O(log T ). In contrast, existing algorithms for minimizing static\nregret, such as OGD, only query the gradient once per iteration. When the function is complex, the\nevaluation of gradients or values could be expensive, and it is appealing to reduce the number of\nqueries in each round.\n\nSurrogate Loss We introduce surrogate loss [van Erven and Koolen, 2016] to replace the original\nloss function. From the \ufb01rst-order condition of convexity [Boyd and Vandenberghe, 2004], we have\n\nft(x) \u2265 ft(xt) + (cid:104)\u2207ft(xt), x \u2212 xt(cid:105), \u2200x \u2208 X .\n\nThen, we de\ufb01ne the surrogate loss in the t-th iteration as\n\n(cid:96)t(x) = (cid:104)\u2207ft(xt), x \u2212 xt(cid:105)\n\nand use it to update the prediction. Because\n\nft(xt) \u2212 ft(ut) \u2264 (cid:96)t(xt) \u2212 (cid:96)t(ut),\n\n(12)\n\n(13)\n\nwe conclude that the regret w.r.t. true losses ft\u2019s is smaller than that w.r.t. surrogate losses (cid:96)t\u2019s.\nThus, it is safe to replace ft with (cid:96)t. The new method, named as improved Ader, is summarized in\nAlgorithms 3 and 4.\n\nMeta-algorithm The new meta-algorithm in Algorithm 3 differs from the old one in Algorithm 1\nsince Step 6. The new algorithm queries the gradient of ft(\u00b7) at xt, and then constructs the surrogate\nloss (cid:96)t(\u00b7) in (12), which is used in subsequent steps. In Step 8, the weights of experts are updated\nbased on (cid:96)t(\u00b7), i.e.,\n\n(cid:80)\n\nt e\u2212\u03b1(cid:96)t(x\u03b7\nw\u03b7\n\u00b5\u2208H w\u00b5\n\nt e\u2212\u03b1(cid:96)t(x\u00b5\n\nt )\n\n.\n\nt )\n\nw\u03b7\n\nt+1 =\n\nIn Step 9, the gradient of (cid:96)t(\u00b7) at x\u03b7\n\nt is sent to each expert E\u03b7. Because the surrogate loss is linear,\n\u2207(cid:96)t(x\u03b7\n\nt ) = \u2207ft(xt), \u2200\u03b7 \u2208 H.\n\nAs a result, we only need to send the same \u2207ft(xt) to all experts. From the above descriptions, it is\nclear that the new algorithm only queries the gradient once in each iteration.\n\nExpert-algorithm The new expert-algorithm in Algorithm 4 is almost the same as the previous\none in Algorithm 2. The only difference is that in Step 4, the expert receives the gradient \u2207ft(xt),\nand uses it to perform gradient descent\n\nt+1 = \u03a0X(cid:2)x\u03b7\n\nt \u2212 \u03b7\u2207ft(xt)(cid:3)\n\nx\u03b7\n\nin Step 5.\nWe have the following theorem to bound the dynamic regret of the improved Ader.\n\nTheorem 4 Use the construction of H in (10), and set \u03b1 =(cid:112)2/(T G2D2) in Algorithm 3. Under\n\nAssumptions 2 and 3, for any comparator sequence u1, . . . , uT \u2208 X , our improved Ader method\nsatis\ufb01es\n\nT(cid:88)\n\nft(xt) \u2212 T(cid:88)\n\nt=1\n\nt=1\n\nft(ut) \u2264 3G\n4\n\n(cid:112)2T (7D2 + 4DPT ) +\n(cid:16)(cid:112)T (1 + PT )\n(cid:17)\n\n=O\n\n\u221a\n\nGD\n2\n\n2T\n\n[1 + 2 ln(k + 1)]\n\nwhere k is de\ufb01ned in (11).\n\nSimilar to the basic approach, the improved Ader also achieves an O((cid:112)T (1 + PT )) dynamic regret,\n\nthat is universal and adaptive. The main advantage is that the improved Ader only needs to query the\ngradient of the online function once in each iteration.\n\n7\n\n\fAlgorithm 3 Improved Ader: Meta-algorithm\nRequire: A step size \u03b1, and a set H containing step sizes for experts\n1: Activate a set of experts {E\u03b7|\u03b7 \u2208 H} by invoking Algorithm 4 for each step size \u03b7 \u2208 H\n2: Sort step sizes in ascending order \u03b71 \u2264 \u03b72 \u2264 \u00b7\u00b7\u00b7 \u2264 \u03b7N , and set w\u03b7i\n3: for t = 1, . . . , T do\n4:\n5:\n\nt from each expert E\u03b7\n\nReceive x\u03b7\nOutput\n\n1 = C\n\ni(i+1)\n\n(cid:88)\n\n\u03b7\u2208H\n\nxt =\n\nw\u03b7\n\nt x\u03b7\n\nt\n\n6:\n7:\n8:\n\nQuery the gradient of ft(\u00b7) at xt\nConstruct the surrogate loss (cid:96)t(\u00b7) in (12)\nUpdate the weight of each expert by\n\n(cid:80)\n\nt e\u2212\u03b1(cid:96)t(x\u03b7\nw\u03b7\n\u00b5\u2208H w\u00b5\n\nt e\u2212\u03b1(cid:96)t(x\u00b5\n\nt )\n\nt )\n\nw\u03b7\n\nt+1 =\n\nSend gradient \u2207ft(xt) to each expert E\u03b7\n\n9:\n10: end for\n\n1 be any point in X\n\nAlgorithm 4 Improved Ader: Expert-algorithm\nRequire: The step size \u03b7\n1: Let x\u03b7\n2: for t = 1, . . . , T do\n3:\n4:\n5:\n\nSubmit x\u03b7\nReceive gradient \u2207ft(xt) from the meta-algorithm\n\nt to the meta-algorithm\n\nt+1 = \u03a0X(cid:2)x\u03b7\n\nt \u2212 \u03b7\u2207ft(xt)(cid:3)\n\nx\u03b7\n\n6: end for\n\n3.5 Extensions\n\nFollowing Hall and Willett [2013], we consider the case that the learner is given a sequence of\ndynamical models \u03a6t(\u00b7) : X (cid:55)\u2192 X , which can be used to characterize the comparators we are\ninterested in. Similar to Hall and Willett [2013], we assume each \u03a6t(\u00b7) is a contraction mapping.\nAssumption 4 All the dynamical models are contraction mappings, i.e.,\n\n(cid:107)\u03a6t(x) \u2212 \u03a6t(x(cid:48))(cid:107)2 \u2264 (cid:107)x \u2212 x(cid:48)(cid:107)2,\n\n(14)\n\nfor all t \u2208 [T ], and x, x(cid:48) \u2208 X .\nThen, we choose P (cid:48)\ndeviates from the given dynamics.\n\nT in (6) as the regularity of a comparator sequence, which measures how much it\n\nAlgorithms For brevity, we only discuss how to incorporate the dynamical models into the basic\nAder in Section 3.3, and the extension to the improved version can be done in the same way. In fact,\nwe only need to modify the expert-algorithm, and the updated one is provided in Algorithm 5. To\nutilize the dynamical model, after performing gradient descent, i.e.,\n\nt+1 = \u03a0X(cid:2)x\u03b7\n\n\u00afx\u03b7\n\nt )(cid:3)\n\nt \u2212 \u03b7\u2207ft(x\u03b7\n\nin Step 5, we apply the dynamical model to the intermediate solution \u00afx\u03b7\n\nt+1, i.e.,\n\nx\u03b7\nt+1 = \u03a6t(\u00afx\u03b7\n\nt+1),\n\nand obtain the prediction for the next round. In the meta-algorithm (Algorithm 1), we only need to\nreplace Algorithm 2 in Step 1 with Algorithm 5, and the rest is the same. The dynamic regret of the\nnew algorithm is given below.\n\n8\n\n\f1 be any point in X\n\nAlgorithm 5 Ader: Expert-algorithm with dynamical models\nRequire: The step size \u03b7, a sequence of dynamical models \u03a6t(\u00b7)\n1: Let x\u03b7\n2: for t = 1, . . . , T do\n3:\n4:\n5:\n\nSubmit x\u03b7\nReceive gradient \u2207ft(x\u03b7\n\nt ) from the meta-algorithm\n\nt to the meta-algorithm\n\nt+1 = \u03a0X(cid:2)x\u03b7\n\n\u00afx\u03b7\n\nt \u2212 \u03b7\u2207ft(x\u03b7\n\nt )(cid:3)\n\n6:\n\n7: end for\n\nTheorem 5 Set\n\nx\u03b7\nt+1 = \u03a6t(\u00afx\u03b7\n\nt+1)\n\n(cid:40)\n\nH =\n\n(cid:114) 1\n\nT\n\n(cid:41)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) i = 1, . . . , N\n\n2i\u22121D\n\nG\n\n\u03b7i =\n\nAlgorithm 1. Under Assumptions 1, 2, 3 and 4, for any comparator sequence u1, . . . , uT \u2208 X , our\nproposed Ader method satis\ufb01es\n\n2 log2(1 + 2T )(cid:7) + 1, \u03b1 =(cid:112)8/(T c2), and use Algorithm 5 as the expert-algorithm in\nwhere N =(cid:6) 1\nft(xt)\u2212 T(cid:88)\nT(cid:88)\n(cid:18)(cid:113)\n\nft(ut) \u2264 3G\n2\n\nT (D2 + 2DP (cid:48)\n\n(cid:113)\n(cid:19)\n\n[1 + 2 ln(k + 1)]\n\n\u221a\nc\n\n2T\n4\n\nT ) +\n\n(15)\n\nt=1\n\nt=1\n\n=O\n\nT (1 + P (cid:48)\nT )\n\nwhere\n\n(cid:19)(cid:23)\nTheorem 5 indicates our method achieves an O((cid:112)T (1 + P (cid:48)\n\n(cid:22) 1\n\n2P (cid:48)\nT\nD\n\n(cid:18)\n\nlog2\n\nk =\n\n1 +\n\n2\n\n+ 1.\n\nT (1 + P (cid:48)\n\n\u221a\nT )) dynamic regret, improving the\nT )) dynamic regret of Hall and Willett [2013] signi\ufb01cantly. Note that when \u03a6t(\u00b7) is\nO(\nthe identity map, we recover the result in Theorem 3. Thus, the upper bound in Theorem 5 is also\noptimal.\n\n4 Conclusion and Future Work\n\nIn this paper, we study the general form of dynamic regret, which compares the cumulative loss of the\nonline learner against an arbitrary sequence of comparators. To this end, we develop a novel method,\nnamed as adaptive learning for dynamic environment (Ader). Theoretical analysis shows that Ader\n\nachieves an optimal O((cid:112)T (1 + PT )) dynamic regret. When a sequence of dynamical models is\navailable, we extend Ader to incorporate this additional information, and obtain an O((cid:112)T (1 + P (cid:48)\n\ndynamic regret.\nIn the future, we will investigate whether the curvature of functions, such as strong convexity and\nsmoothness, can be utilized to improve the dynamic regret bound. We note that in the setting of the\nrestricted dynamic regret, the curvature of functions indeed makes the upper bound tighter [Mokhtari\net al., 2016, Zhang et al., 2017]. But whether it improves the general dynamic regret remains an open\nproblem.\n\nT ))\n\nAcknowledgments\n\nThis work was partially supported by the National Key R&D Program of China (2018YFB1004300),\nNSFC (61603177), JiangsuSF (BK20160658), YESS (2017QNRC001), and Microsoft Research\nAsia.\n\n9\n\n\fReferences\nJ. Abernethy, P. L. Bartlett, A. Rakhlin, and A. Tewari. Optimal stragies and minimax lower bounds\nfor online convex games. In Proceedings of the 21st Annual Conference on Learning Theory,\npages 415\u2013423, 2008.\n\nO. Besbes, Y. Gur, and A. Zeevi. Non-stationary stochastic optimization. Operations Research, 63\n\n(5):1227\u20131244, 2015.\n\nS. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n\nN. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press,\n\n2006.\n\nC.-K. Chiang, T. Yang, C.-J. Lee, M. Mahdavi, C.-J. Lu, R. Jin, and S. Zhu. Online optimization with\n\ngradual variations. In Proceedings of the 25th Annual Conference on Learning Theory, 2012.\n\nA. Daniely, A. Gonen, and S. Shalev-Shwartz. Strongly adaptive online learning. In Proceedings of\n\nthe 32nd International Conference on Machine Learning, pages 1405\u20131411, 2015.\n\nJ. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic\n\noptimization. Journal of Machine Learning Research, 12:2121\u20132159, 2011.\n\nE. C. Hall and R. M. Willett. Dynamical models and tracking regret in online convex programming.\nIn Proceedings of the 30th International Conference on Machine Learning, pages 579\u2013587, 2013.\n\nE. Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2\n\n(3-4):157\u2013325, 2016.\n\nE. Hazan and C. Seshadhri. Adaptive algorithms for online decision problems. Electronic Colloquium\n\non Computational Complexity, 88, 2007.\n\nE. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimization.\n\nMachine Learning, 69(2-3):169\u2013192, 2007.\n\nM. Herbster and M. K. Warmuth. Tracking the best expert. Machine Learning, 32(2):151\u2013178, 1998.\n\nA. Jadbabaie, A. Rakhlin, S. Shahrampour, and K. Sridharan. Online optimization: Competing\nwith dynamic comparators. In Proceedings of the 18th International Conference on Arti\ufb01cial\nIntelligence and Statistics, pages 398\u2013406, 2015.\n\nK.-S. Jun, F. Orabona, S. Wright, and R. Willett. Improved strongly adaptive online learning using\ncoin betting. In Proceedings of the 20th International Conference on Arti\ufb01cial Intelligence and\nStatistics, pages 943\u2013951, 2017.\n\nA. Mokhtari, S. Shahrampour, A. Jadbabaie, and A. Ribeiro. Online optimization in dynamic\nenvironments: Improved regret rates for strongly convex problems. In IEEE 55th Conference on\nDecision and Control, pages 7195\u20137201, 2016.\n\nA. Rakhlin and K. Sridharan. Online learning with predictable sequences. In Proceedings of the 26th\n\nConference on Learning Theory, pages 993\u20131019, 2013.\n\nS. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in\n\nMachine Learning, 4(2):107\u2013194, 2011.\n\nS. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: primal estimated sub-gradient solver for SVM.\nIn Proceedings of the 24th International Conference on Machine Learning, pages 807\u2013814, 2007.\n\nN. Srebro, K. Sridharan, and A. Tewari. Smoothness, low-noise and fast rates. In Advances in Neural\n\nInformation Processing Systems 23, pages 2199\u20132207, 2010.\n\nT. van Erven and W. M. Koolen. Metagrad: Multiple learning rates in online learning. In Advances\n\nin Neural Information Processing Systems 29, pages 3666\u20133674, 2016.\n\n10\n\n\fT. Yang, L. Zhang, R. Jin, and J. Yi. Tracking slowly moving clairvoyant: Optimal dynamic regret of\nonline learning with true and noisy gradient. In Proceedings of the 33rd International Conference\non Machine Learning, pages 449\u2013457, 2016.\n\nL. Zhang, J. Yi, R. Jin, M. Lin, and X. He. Online kernel learning with a near optimal sparsity bound.\n\nIn Proceedings of the 30th International Conference on Machine Learning, 2013.\n\nL. Zhang, T. Yang, J. Yi, R. Jin, and Z.-H. Zhou. Improved dynamic regret for non-degenerate\n\nfunctions. In Advances in Neural Information Processing Systems 30, pages 732\u2013741, 2017.\n\nL. Zhang, S. Lu, and Z.-H. Zhou. Adaptive online learning in dynamic environments. ArXiv e-prints,\n\narXiv:1810.10815, 2018a.\n\nL. Zhang, T. Yang, R. Jin, and Z.-H. Zhou. Dynamic regret of strongly adaptive methods.\n\nProceedings of the 35th International Conference on Machine Learning, 2018b.\n\nIn\n\nM. Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent.\n\nProceedings of the 20th International Conference on Machine Learning, pages 928\u2013936, 2003.\n\nIn\n\n11\n\n\f", "award": [], "sourceid": 691, "authors": [{"given_name": "Lijun", "family_name": "Zhang", "institution": "Nanjing University (NJU)"}, {"given_name": "Shiyin", "family_name": "Lu", "institution": "Nanjing University"}, {"given_name": "Zhi-Hua", "family_name": "Zhou", "institution": "Nanjing University"}]}