{"title": "Online Learning with Adversarial Delays", "book": "Advances in Neural Information Processing Systems", "page_first": 1270, "page_last": 1278, "abstract": "We study the performance of standard online learning algorithms when the feedback is delayed by an adversary. We show that \\texttt{online-gradient-descent} and \\texttt{follow-the-perturbed-leader} achieve regret $O(\\sqrt{D})$ in the delayed setting, where $D$ is the sum of delays of each round's feedback. This bound collapses to an optimal $O(\\sqrt{T})$ bound in the usual setting of no delays (where $D = T$). Our main contribution is to show that standard algorithms for online learning already have simple regret bounds in the most general setting of delayed feedback, making adjustments to the analysis and not to the algorithms themselves. Our results help affirm and clarify the success of recent algorithms in optimization and machine learning that operate in a delayed feedback model.", "full_text": "Online Learning with Adversarial Delays\n\nKent Quanrud\u2217and Daniel Khashabi\u2020\n\nDepartment of Computer Science\n\nUniversity of Illinois at Urbana-Champaign\n\nUrbana, IL 61801\n\n{quanrud2,khashab2}@illinois.edu\n\nAbstract\n\nWe study the performance of standard online learning algorithms when the feed-\nback is delayed by an adversary. We show that online-gradient-descent\n[1] and follow-the-perturbed-leader [2] achieve regret O(\nD) in the\n\u221a\ndelayed setting, where D is the sum of delays of each round\u2019s feedback. This\nbound collapses to an optimal O(\nT ) bound in the usual setting of no delays\n(where D = T ). Our main contribution is to show that standard algorithms for\nonline learning already have simple regret bounds in the most general setting of\ndelayed feedback, making adjustments to the analysis and not to the algorithms\nthemselves. Our results help af\ufb01rm and clarify the success of recent algorithms in\noptimization and machine learning that operate in a delayed feedback model.\n\n\u221a\n\n1\n\nIntroduction\n\nand we incur the loss (cid:96)t = ft(xt). After T rounds, our total loss is the sum LT =(cid:80)T\n\nConsider the following simple game. Let K be a bounded set, such as the unit (cid:96)1 ball or a collection\nof n experts. Each round t, we pick a point xt \u2208 K. An adversary then gives us a cost function ft,\nt=1 (cid:96)t, which\n\nwe want to minimize.\nWe cannot hope to beat the adversary, so to speak, when the adversary picks the cost function after\nwe select our point. There is margin for optimism, however, if rather than evaluate our total loss in\nabsolute terms, we compare our strategy to the best \ufb01xed point in hindsight. The regret of a strategy\n\nx1, . . . , xT \u2208 K is the additive difference R(T ) =(cid:80)T\n\nt=1 ft(xt) \u2212 arg minx\u2208K\n\n(cid:80)T\n\nt=1 ft(x).\n\n\u221a\nSurprisingly, one can obtain positive results in terms of regret. Kalai and Vempala showed that a\nT ) in expectation\nsimple and randomized follow-the-leader type algorithm achieves R(T ) = O(\nfor linear cost functions [2] (here, the big-O notation assumes that the diameter of K and the ft\u2019s\nare bounded by constants). If K is convex, then even if the cost vectors are more generally convex\n\u221a\ncost functions (where we incur losses of the form (cid:96)t = ft(xt), with ft a convex function), Zinke-\nvich showed that gradient descent achieves regret R(T ) = O(\nT ) [1]. There is a large body of\ntheoretical literature about this setting, called online learning (see for example the surveys by Blum\n[3], Shalev-Shwartz [4], and Hazan [5]).\nOnline learning is general enough to be applied to a diverse family of problems. For example, Kalai\nand Vempala\u2019s algorithm can be applied to online combinatorial problems such as shortest paths\n[6], decision trees [7], and data structures [8, 2]. In addition to basic machine learning problems\nwith convex loss functions, Zinkevich considers applications to industrial optimization, where the\n\u2217http://illinois.edu/~quanrud2/. Supported in part by NSF grants CCF-1217462, CCF-\n\u2020http://illinois.edu/~khashab2/. Supported in part by a grant from Google.\n\n1319376, CCF-1421231, CCF-1526799.\n\n1\n\n\fvalue of goods is not known until after the goods are produced. Other examples of applications of\nonline learning include universal portfolios in \ufb01nance [9] and online topic-ranking for multi-labeled\ndocuments [10].\nThe standard setting assumes that the cost vector ft (or more generally, the feedback) is given to and\nprocessed by the player before making the next decision in round t + 1. Philosophically, this is not\nhow decisions are made in real life: we rush through many different things at the same time with no\npause for careful consideration, and we may not realize our mistakes for a while. Unsurprisingly, the\nassumption of immediate feedback is too restrictive for many real applications. In online advertising,\nonline learning algorithms try to predict and serve ads that optimize for clicks [11]. The algorithm\nlearns by observing whether or not an ad is clicked, but in production systems, a massive number of\nads are served between the moment an ad is displayed to a user and the moment the user has decided\nto either click or ignore that ad. In military applications, online learning algorithms are used by radio\njammers to identify ef\ufb01cient jamming strategies [12]. After a jammer attempts to disrupt a packet\nbetween a transmitter and a receiver, it does not know if the jamming attempt succeeded until an\nacknowledgement packet is sent by the receiver. In cloud computing, online learning helps devise\nef\ufb01cient resource allocation strategies, such as \ufb01nding the right mix of cheaper (and inconsistent)\nspot instances and more reliable (and expensive) on-demand instances when renting computers for\nbatch jobs [13]. The learning algorithm does not know how well an allocation strategy worked for\na batch job until the batch job has ended, by which time many more batch jobs have already been\nlaunched. In \ufb01nance, online learning algorithms managing portfolios are subject to information and\ntransaction delays from the market, and \ufb01nancial \ufb01rms invest heavily to minimize these delays.\nOne strategy to handle delayed feedback is to pool independent copies of a \ufb01xed learning algorithm,\neach of which acts as an undelayed learner over a subsequence of the rounds. Each round is dele-\ngated to a single instance from the pool of learners, and the learner is required to wait for and process\nits feedback before rejoining the pool. If there are no learners available, a new copy is instantiated\nand added to the pool. The size of the pool is proportional to the maximum number of outstanding\ndelays at any point of decision, and the overall regret is bounded by the sum of regrets of the individ-\nual learners. This approach is analyzed for constant delays by Weinberger and Ordentlich [14], and\na more sophisticated analysis is given by Joulani et al. [15]. If \u03b1 is the expected maximum number\nof outstanding feedbacks, then Joulani et al. obtain a regret bound on the order of O(\n\u03b1T ) (in ex-\npectation) for the setting considered here. The blackbox nature of this approach begets simultaneous\nbounds for other settings such as partial information and stochastic rewards. Although maintaining\ncopies of learners in proportion to the delay may be prohibitively resource intensive, Joulani et al.\nprovide a more ef\ufb01cient variant for the stochastic bandit problem, a setting not considered here.\nAnother line of research is dedicated to scaling gradient descent type algorithms to distributed set-\ntings, where asynchronous processors naturally introduce delays in the learning framework. A clas-\nsic reference in this area is the book of Bertsekas and Tsitskilis [16]. If the data is very sparse, so that\ninput instances and their gradients are somewhat orthogonal, then intuitively we can apply gradients\nout of order without signi\ufb01cant interference across rounds. This idea is explored by Recht et al. [17],\nwho analyze and test parallel algorithm on a restricted class of strongly convex loss functions, and\nby Duchi et al. [18] and McMahan and Streeter [19], who design and analyze distributed variants\nof adaptive gradient descent [20]. Perhaps the most closely related work in this area is by Langford\net al., who study the online-gradient-descent algorithm of Zinkevich when the delays are\nbounded by a constant number of rounds [21]. Research in this area has largely moved on from the\nsimplistic models considered here; see [22, 23, 24] for more recent developments.\nThe impact of delayed feedback in learning algorithms is also explored by Riabko [25] under the\nframework of \u201cweak teachers\u201d.\nFor the sake of concreteness, we establish the following notation for the delayed setting. For each\nround t, let dt \u2208 Z+ be a non-negative integer delay. The feedback from round t is delivered at the\nend of round t + dt \u2212 1, and can be used in round t + dt. In the standard setting with no delays,\ndt = 1 for all t. For each round t, let Ft = {u \u2208 [T ] : u + du \u2212 1 = t} be the set of rounds whose\nt=1 dt denote the sum of all delays; in the\n\nfeedback appears at the end of round t. We let D =(cid:80)T\n\n\u221a\n\nstandard setting with no delays, we have D = T .\nIn this paper, we investigate the implications of delayed feedback when the delays are adversarial\n(i.e., arbitrary), with no assumptions or restrictions made on the adversary. Rather than design new\n\n2\n\n\falgorithms that may generate a more involved analysis, we study the performance of the classical\nalgorithms online-gradient-descent and follow-the-perturbed-leader, essen-\n\u221a\ntially unmodi\ufb01ed, when the feedback is delayed. In the delayed setting, we prove that both algo-\n\u221a\nrithms have a simple regret bound of O(\nD). These bounds collapse to match the well-known\nO(\n\nT ) regret bounds if there are no delays (i.e., where D = T ).\n\nPaper organization In Section 2, we analyze the online-gradient-descent algorithm in\nthe delayed setting, giving upper bounds on the regret as a function of the sum of delays D. In\nSection 3, we analyze the follow-the-perturbed-leader in the delayed setting and derive\na regret bound in terms of D. Due to space constraints, extensions to online-mirror-descent\nand follow-the-lazy-leader are deferred to the appendix. We conclude and propose future\ndirections in Section 4.\n\n2 Delayed gradient descent\n\nConvex optimization In online convex optimization, the input domain K is convex, and each\ncost function ft is convex. For this setting, Zinkevich proposed a simple online algorithm, called\nonline-gradient-descent, designed as follows [1]. The \ufb01rst point, x1, is picked in K\narbitrarily. After picking the tth point xt, online-gradient-descent computes the gradient\n\u2207ft|xt of the loss function at xt, and chooses xt+1 = \u03c0K(xt \u2212 \u03b7\u2207ft|xt) in the subsequent round,\nfor some parameter \u03b7 \u2208 R>0. Here, \u03c0K is the projection that maps a point x(cid:48) to its nearest point\nin K (discussed further below). Zinkevich showed that, assuming the Euclidean diameter of K\nand the Euclidean lengths of all gradients \u2207ft|x are bounded by constants, online-gradient-\ndescent has an optimal regret bound of O(\n\nT ).\n\n\u221a\n\nDelayed gradient descent\nIn the delayed setting, the loss function ft is not necessarily given by\nthe adversary before we pick the next point xt+1 (or even at all). The natural generalization of\nonline-gradient-descent to this setting is to process the convex loss functions and apply\ntheir gradients the moment they are delivered. That is, we update\n\u2207fs|xs ,\n\nt+1 = xt \u2212 \u03b7\nx(cid:48)\n\n(cid:88)\n\ns\u2208Ft\n\nfor some \ufb01xed parameter \u03b7, and then project xt+1 = \u03c0K(x(cid:48)\nt+1) back into K to choose our (t + 1)th\nIn the setting of Zinkevich, we have Ft = {t} for each t, and this algorithm is exactly\npoint.\nonline-gradient-descent. Note that a gradient \u2207fs|xs does not need to be timestamped by\nthe round s from which it originates, which is required by the pooling strategies of Weinberger and\nOrdentlich [14] and Joulani et al. [15] in order to return the feedback to the appropriate learner.\nTheorem 2.1. Let K be a convex set with diameter 1, let f1, . . . , fT be convex functions over K\nwith (cid:107)\u2207ft|x(cid:107)2 \u2264 L for all x \u2208 K and t \u2208 [T ], and let \u03b7 \u2208 R be a \ufb01xed parameter. In the presence\nof adversarial delays, online-gradient-descent selects points x1, . . . , xT \u2208 K such that\nfor all y \u2208 K,\n\n(cid:18) 1\n\n\u03b7\n\n(cid:19)\n\nT(cid:88)\n\nt=1\n\nft(xt) \u2212 T(cid:88)\n\nt=1\n\nft(y) = O\n\n+ \u03b7L2(T + D)\n\n,\n\nwhere D denotes the sum of delays over all rounds t \u2208 [T ].\n\n\u221a\nD + T ) = O(L\n\n\u221a\nT + D, Theorem 2.1 implies a regret bound of O(L\n\n\u221a\nFor \u03b7 = 1/L\nD). This\nchoice of \u03b7 requires prior knowledge of the \ufb01nal sum D. When this sum is not known, one can\ncalculate D on the \ufb02y: if there are \u03b4 outstanding (undelivered) cost functions at a round t, then D\nincreases by exactly \u03b4. Obviously, \u03b4 \u2264 T and T \u2264 D, so D at most doubles. We can therefore\nemploy the \u201cdoubling trick\u201d of Auer et al. [26] to dynamically adjust \u03b7 as D grows.\nIn the undelayed setting analyzed by Zinkevich, we have D = T , and the regret bound of Theorem\n\u221a\n2.1 matches that obtained by Zinkevich. If each delay dt is bounded by some \ufb01xed value \u03c4, Theorem\n\u03c4 T ) that matches that of Langford et al. [21]. In both of these\n2.1 implies a regret bound of O(L\nspecial cases, the regret bound is known to be tight.\n\n3\n\n\fBefore proving Theorem 2.1, we review basic de\ufb01nitions and facts on convexity. A function f :\nK \u2192 R is convex if\n\nf ((1 \u2212 \u03b1)x + \u03b1y) \u2264 (1 \u2212 \u03b1)f (x) + \u03b1f (y)\n\n\u2200x, y \u2208 K, \u03b1 \u2208 [0, 1].\n\nIf f is differentiable, then f is convex iff\n\nf (x) + \u2207f|x \u00b7 (y \u2212 x) \u2264 f (y)\n\n(1)\nFor f convex but not necessarily differentiable, a subgradient of f at x is any vector that can replace\n\u2207f|x in equation (1). The (possible empty) set of gradients of f at x is denoted by \u2202f (x).\nThe gradient descent may occasionally update along a gradient that takes us out of the constrained\ndomain K. If K is convex, then we can simply project the point back into K.\nLemma 2.2. Let K be a closed convex set in a normed linear space X and x \u2208 X a point, and let\nx(cid:48) \u2208 K be the closest point in K to x. Then, for any point y \u2208 K,\n\n\u2200x, y \u2208 K.\n\n(cid:107)x \u2212 y(cid:107)2 \u2264 (cid:107)x(cid:48) \u2212 y(cid:107)2.\n\nWe let \u03c0K denote the map taking a point x to its closest point in the convex set K.\nProof of Theorem 2.1. Let y = arg minx\u2208K(f1(x) + \u00b7\u00b7\u00b7 + fT (x)) be the best point in hindsight at\nthe end of all T rounds. For t \u2208 [T ], by convexity of ft, we have,\n\nft(y) \u2265 ft(xt) + \u2207ft|xt \u00b7 (y \u2212 xt).\n\u2207fs|xs .\n\nFix t \u2208 [T ], and consider the distance between xt+1 and y. By Lemma 2.2, we know that\n\nWe split the sum of gradients applied in a single round and consider them one by one. For each\n\u2207fr|xr. Suppose Ft is nonempty,\nand \ufb01x s(cid:48) = maxFt to be the last index in Ft. By Lemma 2.2, we have,\n\nt+1 \u2212 y(cid:13)(cid:13)2, where x(cid:48)\nt+1 \u2212 y(cid:13)(cid:13)2\n2 \u2264(cid:13)(cid:13)x(cid:48)\n\n(cid:107)xt+1 \u2212 y(cid:107)2 \u2264(cid:13)(cid:13)x(cid:48)\nt+1 = xt \u2212 \u03b7(cid:80)\ns \u2208 Ft, let Ft,s = {r \u2208 Ft : r < s}, and let xt,s = xt\u2212\u03b7(cid:80)\n2 =(cid:13)(cid:13)xt,s(cid:48) \u2212 \u03b7\u2207fs(cid:48)|xs(cid:48) \u2212 y(cid:13)(cid:13)2\n2 \u2212 2\u03b7(cid:0)\u2207fs(cid:48)|xs(cid:48) \u00b7 (xt,s(cid:48) \u2212 y)(cid:1) + \u03b72(cid:13)(cid:13)\u2207fs(cid:48)|xs(cid:48)\n(cid:88)\n\nRepeatedly unrolling the \ufb01rst term in this fashion gives\n\n\u2207fs|xs \u00b7 (xt,s \u2212 y) + \u03b72 (cid:88)\n\n= (cid:107)xt,s(cid:48) \u2212 y(cid:107)2\n\n(cid:107)xt+1 \u2212 y(cid:107)2\n\n(cid:107)\u2207fs|xs(cid:107)2\n2.\n\n(cid:13)(cid:13)2\n\nr\u2208Ft,s\n\ns\u2208Ft\n\n2.\n\n2\n\n(cid:107)xt+1 \u2212 y(cid:107)2\n\n2 \u2212 2\u03b7\nFor each s \u2208 Ft, by convexity of f, we have,\n\n2 \u2264 (cid:107)xt \u2212 y(cid:107)2\n\ns\u2208Ft\n\ns\u2208Ft\n\n\u2212\u2207fs|xs \u00b7 (xt,s \u2212 y) = \u2207fs|xs \u00b7 (y \u2212 xt,s) = \u2207fs|xs \u00b7 (y \u2212 xs) + \u2207fs|xs \u00b7 (xs \u2212 xt,s)\n\n\u2264 fs(y) \u2212 fs(xs) + \u2207fs|xs \u00b7 (xs \u2212 xt,s).\n\nBy assumption, we also have (cid:107)\u2207fs|xs(cid:107)2 \u2264 L for each s \u2208 Ft. With respect to the distance between\nxt+1 and y, this gives,\n\n(cid:107)xt+1 \u2212 y(cid:107)2\n\n2 \u2264 (cid:107)xt \u2212 y(cid:107)2\n\nSolving this inequality for the regret terms(cid:80)\n\n2 + 2\u03b7\n\ns\u2208Ft\n\nover all rounds t \u2208 [T ], we have,\n\n(fs(y) \u2212 fs(xs) + \u2207fs|xs \u00b7 (xs \u2212 xt,s)) + \u03b72 \u00b7 |Ft| \u00b7 L2.\n\nfs(xs)\u2212 fs(y) and taking the sum of inequalities\n\ns\u2208Ft\n\n(ft(xt) \u2212 ft(y)) =\n\nfs(xs) \u2212 fs(y)\n\n(cid:88)\n\nT(cid:88)\n\n(cid:88)\n\nt=1\n\ns\u2208Ft\n\n(cid:88)\n\ns\u2208Ft\n\n4\n\nT(cid:88)\n\n(cid:32)\n\u00b7 T(cid:88)\n(cid:32) T(cid:88)\n\nt=1\n\nt=1\n\nt=1\n\n\u2264 1\n2\u03b7\n\n=\n\n1\n2\u03b7\n\n(cid:107)xt \u2212 y(cid:107)2\n\n2 \u2212 (cid:107)xt+1 \u2212 y(cid:107)2\n(cid:33)\n\n2 + 2\u03b7\n\n(cid:107)xt \u2212 y(cid:107)2\n\n2 \u2212 (cid:107)xt+1 \u2212 y(cid:107)2\n(cid:88)\nT(cid:88)\n\n2\n\n\u2207fs|xs \u00b7 (xs \u2212 xt,s) + \u03b72 \u00b7 |Ft| \u00b7 L2\nT(cid:88)\n\n(cid:88)\n\n\u2207fs|xs \u00b7 (xs \u2212 xt,s)\n\n+\n\n\u03b7\n2\n\nT L2 +\n\nt=1\n\ns\u2208Ft\n\n\u2264 1\n2\u03b7\n\n+\n\n\u03b7\n2\n\nT L2 +\n\nt=1\n\ns\u2208Ft\n\n\u2207fs|xs \u00b7 (xs \u2212 xt,s).\n\n(cid:33)\n\n(2)\n\n\fThe \ufb01rst two terms are familiar from the standard analysis of online-gradient-descent. It\nremains to analyze the last sum, which we call the delay term.\nEach summand \u2207fs|xs \u00b7 (xs \u2212 xt,s) in the delay term contributes loss proportional to the distance\nbetween the point xs when the gradient \u2207fs|xs is generated and the point xt,s when the gradient is\napplied. This distance is created by the other gradients that are applied in between, and the number\nof such in-between gradients are intimately tied to the total delay, as follows. By Cauchy-Schwartz,\nT(cid:88)\nthe delay term is bounded above by\nConsider a single term (cid:107)xs \u2212 xt,s(cid:107)2 for \ufb01xed t \u2208 [T ] and s \u2208 Ft. Intuitively, the difference xt,s\u2212xs\nis roughly the sum of gradients received between round s and when we apply the gradient from round\ns in round t. More precisely, by applying the triangle inequality and Lemma 2.2, we have,\n\n\u2207fs|xs \u00b7 (xs \u2212 xt,s) \u2264 T(cid:88)\n\n(cid:107)\u2207fs|xs(cid:107)2(cid:107)xs \u2212 xt,s(cid:107)2 \u2264 L\n\n(cid:107)xs \u2212 xt,s(cid:107)2.\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\nT(cid:88)\n\ns\u2208Ft\n\ns\u2208Ft\n\ns\u2208Ft\n\n(3)\n\nt=1\n\nt=1\n\nt=1\n\n(cid:107)xt,s \u2212 xs(cid:107)2 \u2264 (cid:107)xt,s \u2212 xt(cid:107)2 + (cid:107)xt \u2212 xs(cid:107)2 \u2264 (cid:107)xt,s \u2212 xt(cid:107)2 + (cid:107)x(cid:48)\n\nt \u2212 xt\u22121(cid:107)2 +(cid:13)(cid:13)x(cid:48)\n(cid:88)\n(cid:13)(cid:13)\u2207fp|xp\n(cid:13)(cid:13)2 \u2264 \u03b7\n\n(cid:13)(cid:13)2, and unrolling in this\nt \u2212 xs(cid:107)2.\n(cid:88)\nt\u22121(cid:88)\n(cid:13)(cid:13)2\n(cid:13)(cid:13)\u2207fq|xq\n(cid:13)(cid:13)2 + \u03b7\n\nt\u22121 \u2212 xs\n\np\u2208Ft,s\n\nr=s\n\nq\u2208Fr\n\nFor the same reason, we have (cid:107)x(cid:48)\nfashion, we have,\n(cid:107)xt,s \u2212 xs(cid:107)2 \u2264 (cid:107)xt,s \u2212 xt(cid:107)2 +\n\n(cid:32)\n\nt \u2212 xs(cid:107)2 \u2264 (cid:107)x(cid:48)\nt\u22121(cid:88)\n(cid:13)(cid:13)x(cid:48)\nt\u22121(cid:88)\n\nr+1 \u2212 xr\n(cid:33)\n\nr=s\n\n|Fr|\n\n\u2264 \u03b7 \u00b7 L \u00b7\n\n|Ft,s| +\n\n.\n\nr=s\n\nAfter substituting equation (4) into equation (3), it remains to bound the sum(cid:80)T\n(cid:80)t\u22121\nr=s|Fr|). Consider a single term |Ft,s| +(cid:80)t\u22121\n(|Ft,s| +\nr=s|Fr| in the sum. This quantity counts, for a\ngradient \u2207fs|xs from round s delivered just before round t \u2265 s, the number of other gradients that\nare applied while \u2207fs|xs is withheld. Fix two rounds s and t, and consider an intermediate round\nr \u2208 {s, . . . , t}. If r < t then \ufb01x q \u2208 Fr, and if r = t then \ufb01x q \u2208 Ft,s. The feedback from round q\nis applied in a round r between round s and round t. We divide our analysis into two scenarios. In\none case, q \u2264 s, and the gradient from round q appears only after s, as in the following diagram.\n\n(cid:80)\n\ns\u2208Ft\n\nt=1\n\n(4)\n\n\u00b7\u00b7\u00b7\nIn the other case, q > s, as in the following diagram.\n\n\u00b7\u00b7\u00b7\n\nq\n\n\u2207fq|xq\n/ s\n\n\u2207fs|xs\n/ r\n\n/ \u00b7\u00b7\u00b7\n\nt\n\ns\n\n\u00b7\u00b7\u00b7\n\n\u2207fs|xs\n/ q\n\n\u2207fq|xq\n\u00b7\u00b7\u00b7\n\n/ r\n\n/ \u00b7\u00b7\u00b7\n\nt\n\nFor each round u, let du denote the number of rounds the gradient feedback is delayed (so u \u2208\nFu+du). There are at most ds instances of the latter case, since q must lie in s+1, . . . , t. The \ufb01rst case\ncan be charged to dq. To bound the \ufb01rst case, observe that for \ufb01xed q, the number of indices s such\nthat q < s \u2264 dq + q \u2264 ds + s is at most dq. That is, all instances of the second case for a \ufb01xed q can\nt=1 dt,\n\nbe charged to dq. Between the two cases, we have(cid:80)T\n\nr=s|Fr|) \u2264 2(cid:80)T\n\n(cid:80)\n\nt=1\n\ns\u2208Ft\n\nand the delay term is bounded by\n\nT(cid:88)\n\n(cid:88)\n\nt=1\n\ns\u2208Ft\n\n\u2207fs|xs \u00b7 (xs \u2212 xt,s) \u2264 2\u03b7 \u00b7 L2\n(cid:33)\n\n(cid:32)\n\nT(cid:88)\n\nt=1\n\n(f (xt) \u2212 f (y)) \u2264 1\n2\u03b7\n\n+ \u03b7 \u00b7 L2\n\nT\n2\n\n+ 2\n\ndt\n\n= O\n\nWith respect to the overall regret, this gives,\n\nT(cid:88)\n\nt=1\n\nas desired.\n\n(cid:19)\n\n,\n\n+ \u03b7L2D\n\n(cid:18) 1\n\n\u03b7\n\n(cid:4)\n\n(|Ft,s| +(cid:80)t\u22121\nT(cid:88)\n\ndt.\n\nt=1\n\n5\n\n/\n/\n%\n%\n/\n/\n/\n$\n$\n/\n/\n/\n/\n/\n/\n)\n)\n/\n/\n/\n\"\n\"\n/\n/\n/\n/\n\fRemark 2.3. The delay term (cid:80)T\n\n(cid:80)\n\n\u2207fs|xs \u00b7 (xs \u2212 xt,s) is a natural point of entry for a\nsharper analysis based on strong sparseness assumptions. The distance xs \u2212 xt,s is measured by its\nprojection against the gradient \u2207fs|xs, and the preceding proof assumes the worst case and bounds\nthe dot product with the Cauchy-Schwartz inequality. If, for example, we assume that gradients\nare pairwise orthogonal and analyze online-gradient-descent in the unconstrained setting,\nthen the dot product \u2207fs|xs \u00b7 (xs \u2212 xt,s) is 0 and the delay term vanishes altogether.\n\ns\u2208Ft\n\nt=1\n\n3 Delaying the Perturbed Leader\nDiscrete online linear optimization In discrete online linear optimization, the input domain K \u2282\nRn is a (possibly discrete) set with bounded diameter, and each cost function ft is of the form\nft(x) = ct \u00b7 x for a bounded-length cost vector ct. The previous algorithm online-gradient-\ndescent does not apply here because K is not convex.\nA natural algorithm for this problem is follow-the-leader. Each round t,\nlet yt =\narg minx\u2208K x\u00b7(c1 +\u00b7\u00b7\u00b7+ct) be the optimum choice over the \ufb01rst t cost vectors. The algorithm pick-\ning yt in round t is called be-the-leader, and can be shown to have zero regret. Of course, be-\nthe-leader is infeasible since the cost vector ct is revealed after picking yt. follow-the-\nleader tries the next best thing, picking yt\u22121 in round t. Unfortunately, this strategy can have\nlinear regret, largely because it is a deterministic algorithm that can be manipulated by an adversary.\nKalai and Vempala [2] gave a simple and elegant correction called follow-the-perturbed-\nleader. Let \u0001 > 0 be a parameter to be \ufb01xed later, and let Q\u0001 = [0, 1/\u0001]n be the cube of length\n1/\u0001. Each round t, follow-the-perturbed-leader randomly picks a vector c0 \u2208 Q\u0001 by the\nuniform distribution, and then selects xt = arg minx\u2208K x \u00b7 (c0 + \u00b7\u00b7\u00b7 + ct\u22121) to optimize over the\nprevious costs plus the random perturbation c0. With the diameter of K and the lengths (cid:107)ct(cid:107) of each\n\u221a\ncost vector held constant, Kalai and Vempala showed that follow-the-perturbed-leader\nhas regret O(\n\nT ) in expectation.\n\nFollowing the delayed and perturbed leader More generally, follow-the-perturbed-\nleader optimizes over all information available to the algorithm, plus some additional noise to\nsmoothen the worst-case analysis. If the cost vectors are delayed, we naturally interpret follow-\nthe-perturbed-leader to optimize over all cost vectors ct delivered in time for round t when\npicking its point xt. That is, the tth leader becomes the best choice with respect to all cost vectors\ndelivered in the \ufb01rst t rounds:\n\nt(cid:88)\n\n(cid:88)\n\ns=1\n\nr\u2208Fs\n\ncr \u00b7 x\n\nyd\nt = arg min\n\nx\u2208K\n\n(we use the superscript d to emphasize the delayed setting). The tth perturbed leader optimizes over\nall cost vectors delivered through the \ufb01rst t rounds in addition to the random perturbation c0 \u2208 Q\u0001:\n\n(cid:32)\n\n\u02dcyd\nt = arg min\n\nx\u2208K\n\nc0 \u00b7 x +\n\ncr \u00b7 x\n\n.\n\n(cid:33)\n\nt(cid:88)\n\n(cid:88)\n\ns=1\n\nr\u2208Fs\n\nt\u22121 in round t. We\nIn the delayed setting, follow-the-perturbed-leader chooses xt = \u02dcyd\n\u221a\nclaim that follow-the-perturbed-leader has a direct and simple regret bound in terms of\nT ) regret bound in the undelayed\nthe sum of delays D, that collapses to Kalai and Vempala\u2019s O(\nsetting.\nTheorem 3.1. Let K \u2286 Rn be a set with L1-diameter \u2264 1, c1, . . . , cT \u2208 Rn with (cid:107)ct(cid:107)1 \u2264 1 for all\nt, and \u03b7 > 0. In the presence of adversarial delays, follow-the-perturbed-leader picks\npoints x1, . . . , xT \u2208 K such that for all y \u2208 K,\n\nE[ct \u00b7 xt] \u2264 T(cid:88)\n\nct \u00b7 y + O(cid:0)\u0001\u22121 + \u0001D(cid:1).\n\nT(cid:88)\n\nt=1\n\n\u221a\nFor \u0001 = 1/\ndoubling trick can be used to adjust \u0001 dynamically (see the discussion following Theorem 2.1).\n\n\u221a\nD, Theorem 3.1 implies a regret bound of O(\n\nD). When D is not known a priori, the\n\nt=1\n\n6\n\n\fTo analyze follow-the-perturbed-leader in the presence of delays, we introduce the no-\ntion of a prophet, who is a sort of omniscient leader who sees the feedback immediately. Formally,\nthe tth prophet is the best point with respect to all the cost vectors over the \ufb01rst t rounds:\n\nzt = arg min\n\nx\u2208K\n\n(c1 + \u00b7\u00b7\u00b7 + ct) \u00b7 x.\n\nThe tth perturbed prophet is the best point with respect to all the cost vectors over the \ufb01rst t rounds,\nin addition to a perturbation c0 \u2208 Q\u0001:\n\n\u02dczt = arg min\n\nx\u2208K\n\n(c0 + c1 + \u00b7\u00b7\u00b7 + ct) \u00b7 x.\n\n(5)\n\nThe prophets and perturbed prophets behave exactly as the leaders and perturbed leaders in the\nsetting of Kalai and Vempala with no delays. In particular, we can apply the regret bound of Kalai\nand Vempala to the (infeasible) strategy of following the perturbed prophet.\nLemma 3.2 ([2]). Let K \u2286 Rn be a set with L1-diameter \u2264 1, let c1, . . . , cT \u2208 Rn be cost vectors\nbounded by (cid:107)ct(cid:107)1 \u2264 1 for all t, and let \u0001 > 0. If \u02dcz1, . . . , \u02dczT\u22121 \u2208 K are chosen per equation (5),\n\nt=1 ct \u00b7 y + O(cid:0)\u0001\u22121 + \u0001T(cid:1). for all y \u2208 K.\n\nt=1 E[ct \u00b7 \u02dczi\u22121] \u2264(cid:80)T\n\nthen(cid:80)T\n\nThe analysis by Kalai and Vempala observes that when there are no delays, two consecutive per-\nturbed leaders \u02dcyt and \u02dcyt+1 are distributed similarly over the random noise [2, Lemma 3.2]. Instead,\nwe will show that \u02dcyd\nt and \u02dczt are distributed in proportion to delays. We \ufb01rst require a technical\nlemma that is implicit in [2].\nLemma 3.3. Let K be a set with L1-diameter \u2264 1, and let u, v \u2208 Rn be vectors. Let y, z \u2208 Rn\nbe random vectors de\ufb01ned by y = arg miny\u2208K(q + u) \u00b7 y and z = arg minz\u2208K(q + v) \u00b7 z, where\ni=1[0, r], for some \ufb01xed length r > 0. Then, for any\n\nq is chosen uniformly at random from Q =(cid:81)n\n\nvector c,\n\nE[c \u00b7 z] \u2212 E[c \u00b7 y] \u2264 (cid:107)v \u2212 u(cid:107)1(cid:107)c(cid:107)\u221e\n\nr\n\n.\n\nProof. Let Q(cid:48) = v+Q and Q(cid:48)(cid:48) = u+Q, and write y = arg miny\u2208K q(cid:48)(cid:48)\u00b7y and z = arg minz\u2208K q(cid:48)\u00b7z,\nwhere q(cid:48) \u2208 Q(cid:48) and q(cid:48)(cid:48) \u2208 Q(cid:48)(cid:48) are chosen uniformly at random. Then\n\nE[c \u00b7 z] \u2212 E[c \u00b7 y] = Eq(cid:48)(cid:48)\u2208Q(cid:48)(cid:48) [c \u00b7 z] \u2212 Eq(cid:48)\u2208Q(cid:48)[c \u00b7 y].\n\nSubtracting P[q(cid:48) \u2208 Q(cid:48) \u2229 Q(cid:48)(cid:48)]Eq(cid:48)\u2208Q(cid:48)\u2229Q(cid:48)(cid:48)[c \u00b7 z] from both terms on the right, we have\n\nEq(cid:48)(cid:48)\u2208Q(cid:48)(cid:48) [c \u00b7 z] \u2212 Eq(cid:48)\u2208Q(cid:48)[c \u00b7 y]\n\n= P[q(cid:48)(cid:48) \u2208 Q(cid:48)(cid:48) \\ Q(cid:48)] \u00b7 Eq(cid:48)(cid:48)\u2208Q(cid:48)(cid:48)\\Q(cid:48)[c \u00b7 z] \u2212 P[q(cid:48) \u2208 Q(cid:48) \\ Q(cid:48)(cid:48)] \u00b7 Eq(cid:48)\u2208Q(cid:48)\\Q(cid:48)(cid:48) [c \u00b7 y]\n\nBy symmetry, P[q(cid:48)(cid:48) \u2208 Q(cid:48)(cid:48) \\ Q(cid:48)] = P[q(cid:48) \u2208 Q(cid:48) \\ Q(cid:48)(cid:48)], and we have,\n\nE[c \u00b7 z] \u2212 E[c \u00b7 y] \u2264 (P[q(cid:48)(cid:48) \u2208 Q(cid:48)(cid:48) \\ Q(cid:48)])Eq(cid:48)(cid:48)\u2208Q(cid:48)(cid:48)\\Q(cid:48),q(cid:48)\u2208Q(cid:48)\\Q(cid:48)(cid:48) [c \u00b7 (z \u2212 y)].\n\nBy assumption, K has L1-diameter \u2264 1, so (cid:107)y \u2212 z(cid:107)1 \u2264 1, and by H\u00f6lder\u2019s inequality, we have,\n\nE[c \u00b7 z] \u2212 E[c \u00b7 y] \u2264 P[q(cid:48)(cid:48) \u2208 Q(cid:48)(cid:48) \\ Q(cid:48)](cid:107)c(cid:107)\u221e.\n(cid:19)\n\n(cid:18)\nIt remains to bound P[q(cid:48)(cid:48) \u2208 Q(cid:48)(cid:48) \\ Q(cid:48)] = P[q(cid:48) \u2208 Q(cid:48) \\ Q(cid:48)(cid:48)]. If (cid:107)v \u2212 u(cid:107)1 \u2264 r, we have,\nvol(Q(cid:48) \u2229 Q(cid:48)(cid:48)) =\n\n1 \u2212 (cid:107)v \u2212 u(cid:107)1\nOtherwise, if (cid:107)u \u2212 v(cid:107)1 > r, then vol(Q(cid:48) \u2229 Q(cid:48)(cid:48)) = 0 \u2265 vol(Q(cid:48))(1 \u2212 (cid:107)v \u2212 u(cid:107)1/r). In either case,\nwe have,\n\n(cid:18)\n1 \u2212 |(vi \u2212 ui)|\n\n(r \u2212 |vi \u2212 ui|) = vol(Q(cid:48))\n\n\u2265 vol(Q(cid:48))\n\nn(cid:89)\n\ni=1\n\nn(cid:89)\n\ni=1\n\n(cid:19)\n\n.\n\nr\n\nr\n\nP[q(cid:48) \u2208 Q(cid:48) \\ Q(cid:48)(cid:48)] =\n\nand the claim follows.\n\nvol(Q(cid:48) \u2229 Q(cid:48)(cid:48))\n\nvol(Q(cid:48))\n\n\u2264 1 \u2212 vol(Q(cid:48) \u2229 Q(cid:48)(cid:48))\n\nvol(Q(cid:48))\n\n\u2264 (cid:107)v \u2212 u(cid:107)1\n\nr\n\n,\n\n(cid:4)\n\nLemma 3.3 could also have been proven geometrically in similar fashion to Kalai and Vempala.\n\n7\n\n\ft\u22121\n\nLemma 3.4. (cid:80)T\nProof. Let ut =(cid:80)t\n\nt=1 E[ct \u00b7 \u02dczt\u22121] \u2212 E(cid:2)ct \u00b7 \u02dcyd\ns=1 ct be the sum of all costs through the \ufb01rst t rounds, and vt =(cid:80)\n\n(cid:3) \u2264 \u0001D, where D is the sum of delays of all cost\n\nvectors.\n\ns:s+ds\u2264t ct\nbe the sum of cost vectors actually delivered through the \ufb01rst t rounds. Then the perturbed prophet\n\u02dczt\u22121 optimizes over c0 + ut\u22121 and \u02dcyd\nt\u22121 optimizes over c0 + vt\u22121. By Lemma 3.3, for each t, we\nhave\nEc0\u223cQ\u0001[ct \u00b7 \u02dczt\u22121] \u2212 Ec0\u223cQ\u0001\nSummed over all T rounds, we have,\n\n(cid:3) \u2264 \u0001 \u00b7 (cid:107)ut\u22121 \u2212 vt\u22121(cid:107)1(cid:107)ct(cid:107)\u221e \u2264 \u0001 \u00b7 |{s < t : s + ds \u2265 t}|\n(cid:2)ct \u00b7 \u02dcyd\n\n|{s < t : s + ds \u2265 t}|.\n\n(cid:2)ct \u00b7 \u02dcyd\n\nEc0[ct \u00b7 \u02dczt] \u2212 Ec0\n\n(cid:3) \u2264 \u0001\n\nT(cid:88)\n\nt\u22121\n\nt\n\nThe sum(cid:80)T\nand therefore equals D. Thus,(cid:80)T\n\nt=1\n\nt=1|{s < t : s + ds \u2265 t}| charges each cost vector cs once for every round it is delayed,\n(cid:4)\n\nt=1 Ec0[ct \u00b7 \u02dczt] \u2212 Ec0\n\nt\n\n(cid:3) \u2264 \u0001D, as desired.\n\nT(cid:88)\n(cid:2)ct \u00b7 \u02dcyd\n\nt=1\n\nNow we complete the proof of Theorem 3.1.\n\nProof of Theorem 3.1. By Lemma 3.4 and Lemma 3.2, we have,\n\nT(cid:88)\n\nT(cid:88)\n\nE(cid:2)ct \u00b7 \u02dcyd\n\nt\u22121\n\n(cid:3) \u2264 T(cid:88)\n\nt=1\nas desired.\n\nt=1\n\nE[ct \u00b7 \u02dczt\u22121] + \u0001D \u2264 arg min\nx\u2208K\n\nE[ct \u00b7 x] + O(\u0001\u22121 + \u0001D),\n\nt=1\n\n(cid:4)\n\n4 Conclusion\n\u221a\n\u221a\nD) regret bounds for online-gradient-descent and follow-the-\nWe prove O(\nperturbed-leader in the delayed setting, directly extending the O(\nT ) regret bounds known\nin the undelayed setting. More importantly, by deriving a simple bound as a function of the de-\nlays, without any restriction on the delays, we establish a simple and intuitive model for measuring\ndelayed learning. This work suggests natural relationships between the regret bounds of online\nlearning algorithms and delays in the feedback.\nBeyond analyzing existing algorithms, we hope that optimizing over the regret as a function of D\nmay inspire different (and hopefully simple) algorithms that readily model real world applications\nand scale nicely to distributed environments.\n\nAcknowledgements We thank Avrim Blum for introducing us to the area of online learning and\nhelping us with several valuable discussions. We thank the reviewers for their careful and insightful\nreviews: \ufb01nding errors, referencing relevant works, and suggesting a connection to mirror descent.\n\nReferences\n[1] M. Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In Proc. 20th\n\nInt. Conf. Mach. Learning (ICML), pages 928\u2013936, 2003.\n\n[2] A. Kalai and S. Vempala. Ef\ufb01cient algorithms for online decision problems. J. Comput. Sys. Sci., 71:291\u2013\n\n307, 2005. Extended abstract in Proc. 16th Ann. Conf. Comp. Learning Theory (COLT), 2003.\n\n[3] A. Blum. On-line algorithms in machine learning. In A. Fiat and G. Woeginger, editors, Online algo-\n\nrithms, volume 1442 of LNCS, chapter 14, pages 306\u2013325. Springer Berlin Heidelberg, 1998.\n\n[4] S. Shalev-Shwartz. Online learning and online convex optimization. Found. Trends Mach. Learn.,\n\n4(2):107\u2013194, 2011.\n\n[5] E. Hazan. Introduction to online convex optimization. Internet draft available at http://ocobook.\n\ncs.princeton.edu, 2015.\n\n[6] E. Takimoto and M. Warmuth. Path kernels and multiplicative updates. J. Mach. Learn. Research, 4:773\u2013\n\n818, 2003.\n\n8\n\n\f[7] D. Helmbold and R. Schapire. Predicting nearly as well as the best pruning of a decision tree. Mach.\n\nLearn. J., 27(1):61\u201368, 1997.\n\n[8] A. Blum, S. Chawla, and A. Kalai. Static optimality and dynamic search optimality in lists and trees.\n\nAlgorithmica, 36(3):249\u2013260, 2003.\n\n[9] T. M. Cover. Universal portfolios. Math. Finance, 1(1):1\u201329, 1991.\n[10] K. Crammer and Y. Singer. A family of additive online algorithms for category ranking. J. Mach. Learn.\n\nResearch, 3:1025\u20131058, 2003.\n\n[11] X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, R. Herbrich, S. Bowers, and J. Qui\u00f1onero\nCandela. Practical lessons from predicting clicks on ads at facebook. In Proc. 20th ACM Conf. Knowl.\nDisc. and Data Mining (KDD), pages 1\u20139. ACM, 2014.\n\n[12] S. Amuru and R. M. Buehrer. Optimal jamming using delayed learning. In 2014 IEEE Military Comm.\n\nConf. (MILCOM), pages 1528\u20131533. IEEE, 2014.\n\n[13] I. Menache, O. Shamir, and N. Jain. On-demand, spot, or both: Dynamic resource allocation for executing\n\nbatch jobs in the cloud. In 11th Int. Conf. on Autonomic Comput. (ICAC), 2014.\n\n[14] M.J. Weinberger and E. Ordentlich. On delayed prediction of individual sequences. IEEE Trans. Inf.\n\nTheory, 48(7):1959\u20131976, 2002.\n\n[15] P. Joulani, A. Gy\u00f6rgy, and C. Szepesv\u00e1ri. Online learning under delayed feedback. In Proc. 30th Int.\n\nConf. Mach. Learning (ICML), volume 28, 2013.\n\n[16] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Prentice-\n\nHall, 1989.\n\n[17] B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: a lock-free approach to parallelizing stochastic gradient\n\ndescent. In Adv. Neural Info. Proc. Sys. 24 (NIPS), pages 693\u2013701, 2011.\n\n[18] J. Duchi, M.I. Jordan, and B. McMahan. Estimation, optimization, and parallelism when data is sparse.\n\nIn Adv. Neural Info. Proc. Sys. 26 (NIPS), pages 2832\u20132840, 2013.\n\n[19] H.B. McMahan and M. Streeter. Delay-tolerant algorithms for asynchronous distributed online learning.\n\nIn Adv. Neural Info. Proc. Sys. 27 (NIPS), pages 2915\u20132923, 2014.\n\n[20] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic\n\noptimization. J. Mach. Learn. Research, 12:2121\u20132159, July 2011.\n\n[21] J. Langford, A. J. Smola, and M. Zinkevich. Slow learners are fast. In Adv. Neural Info. Proc. Sys. 22\n\n(NIPS), pages 2331\u20132339, 2009.\n\n[22] J. Liu, S. J. Wright, C. R\u00e9, V. Bittorf, and S. Sridhar. An asynchronous parallel stochastic coordiante\n\ndescent algorithm. J. Mach. Learn. Research, 16:285\u2013322, 2015.\n\n[23] J. C. Duchi, T. Chaturapruek, and C. R\u00e9. Asynchronous stochastic convex optimization. CoRR,\n\nabs/1508.00882, 2015. To appear in Adv. Neural Info. Proc. Sys. 28 (NIPS), 2015.\n\n[24] S. J. Wright. Coordinate descent algorithms. Math. Prog., 151(3\u201334), 2015.\n[25] D. Riabko. On the \ufb02exibility of theoretical models for pattern recognition. PhD thesis, University of\n\nLondon, April 2005.\n\n[26] N. Cesa-Bianchi, Y. Freund, D. Haussler, D.P. Helmbold, R.E. Schapire, and M.K. Warmuth. How to use\n\nexpert advice. J. Assoc. Comput. Mach., 44(3):426\u2013485, 1997.\n\n9\n\n\f", "award": [], "sourceid": 782, "authors": [{"given_name": "Kent", "family_name": "Quanrud", "institution": "UIUC"}, {"given_name": "Daniel", "family_name": "Khashabi", "institution": "UIUC"}]}