{"title": "Horizon-Independent Minimax Linear Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 5259, "page_last": 5268, "abstract": "We consider online linear regression: at each round, an adversary reveals a covariate vector, the learner predicts a real value, the adversary reveals a label, and the learner suffers the squared prediction error. The aim is to minimize the difference between the cumulative loss and that of the linear predictor that is best in hindsight. Previous work demonstrated that the minimax optimal strategy is easy to compute recursively from the end of the game; this requires the entire sequence of covariate vectors in advance. We show that, once provided with a measure of the scale of the problem, we can invert the recursion and play the minimax strategy without knowing the future covariates. Further, we show that this forward recursion remains optimal even against adaptively chosen labels and covariates, provided that the adversary adheres to a set of constraints that prevent misrepresentation of the scale of the problem. This strategy is horizon-independent in that the regret and minimax strategies depend on the size of the constraint set and not on the time-horizon, and hence it incurs no more regret than the optimal strategy that knows in advance the number of rounds of the game. We also provide an interpretation of the minimax algorithm as a follow-the-regularized-leader strategy with a data-dependent regularizer and obtain an explicit expression for the minimax regret.", "full_text": "Horizon-Independent Minimax Linear Regression\n\nAlan Malek\n\nLaboratory for Information and Decision Systems\n\nMassachusetts Institute of Technology\n\n77 Massachusetts Avenue\n\nCambridge, MA 02139-4307, USA amalek@mit.edu\n\nPeter L. Bartlett\n\nDepartment of EECS and Statistics\n\nUniversity of California\n\nBerkeley, CA 94720-1776, USA\nbartlett@cs.berkeley.edu\n\nAbstract\n\nWe consider online linear regression: at each round, an adversary reveals a covariate\nvector, the learner predicts a real value, the adversary reveals a label, and the learner\nsuffers the squared prediction error. The aim is to minimize the difference between\nthe cumulative loss and that of the linear predictor that is best in hindsight. Previous\nwork demonstrated that the minimax optimal strategy is easy to compute recursively\nfrom the end of the game; this requires the entire sequence of covariate vectors in\nadvance. We show that, once provided with a measure of the scale of the problem,\nwe can invert the recursion and play the minimax strategy without knowing the\nfuture covariates. Further, we show that this forward recursion remains optimal even\nagainst adaptively chosen labels and covariates, provided that the adversary adheres\nto a set of constraints that prevent misrepresentation of the scale of the problem.\nThis strategy is horizon-independent in that the regret and minimax strategies\ndepend on the size of the constraint set and not on the time-horizon, and hence it\nincurs no more regret than the optimal strategy that knows in advance the number\nof rounds of the game. We also provide an interpretation of the minimax algorithm\nas a follow-the-regularized-leader strategy with a data-dependent regularizer and\nobtain an explicit expression for the minimax regret.\n\n1\n\nIntroduction\n\nLinear regression is a fundamental prediction problem in machine learning and statistics. In this\npaper, we study a sequential version: on round t, the adversary chooses and reveals a covariate vector\nxt \u2208 Rd, the learner makes a real-valued prediction \u02c6yt, the adversary chooses and reveals the true\noutcome yt \u2208 R, and \ufb01nally the learner is penalized by the square loss, (\u02c6yt \u2212 yt)2.\nSince it is hopeless to guarantee a small loss (the adversary can always cause constant loss per round),\nwe instead aim to guarantee that we are able to predict almost as well as the best \ufb01xed linear predictor\nin hindsight. Letting xt\ns denote xs, . . . , xt and ys, . . . , yt, respectively, de\ufb01ne the regret of a\nstrategy that predicts \u02c6yT\n\ns and yt\n1 as\n\n(cid:0)\u02c6yT\n\nRT\n\n1 , xT\n\n1 , yT\n1\n\n(cid:1) :=\n\nT(cid:88)\n\nt=1\n\n(\u02c6yt \u2212 yt)2 \u2212 min\n\u03b8\u2208Rd\n\n(\u03b8(cid:62)xt \u2212 yt)2.\n\nT(cid:88)\n\nt=1\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fA strategy s :(cid:83)\n(cid:0)s, xT\n\n1 , yT\n1\n\n(cid:0)\u02c6yT\n\n(cid:1) := RT\n\nt\u22651(Rd \u00d7R)t\u22121 \u00d7Rd \u2192 R, is a map from observations to predictions, and we de\ufb01ne\n\nRT\na strategy that guarantees low regret for all data sequences. In particular, this paper is concerned with\nthe minimax strategy s\u2217, which is the strategy that minimizes the worst case regret over all possible\ncovariate and outcome sequences in some constraint set, i.e. s\u2217 satis\ufb01es\n\n(cid:1) where \u02c6yt = s(x1, y1, . . . , xt\u22121, yt\u22121, xt). Our goal is to \ufb01nd\n(cid:0)s\u2217, xT\n\n(cid:1) = min\n\n(cid:0)s, xT\n\n(cid:1) .\n\n1 , xT\n\n1 , yT\n1\n\nRT\n\nRT\n\n1 , yT\n1\n\n1 , yT\n1\n\nmax\nxT\n1 ,yT\n1\n\nmax\nxT\n1 ,yT\n1\n\ns\n\nIn general, computing minimax strategies is computationally intractable because the optimal predic-\ntion \u02c6yt depends on the complete history (x1, y1, . . . , xt\u22121, yt\u22121, xt), and the dependence might be\na rather arbitrary function of this enormous space of histories. So it is surprising that, in the case\nof \ufb01xed-design linear regression (where the strategy knows the covariate sequence in advance), the\nminimax strategy can be ef\ufb01ciently computed [Bartlett et al., 2015].\nThis paper builds on results from Bartlett et al. [2015], which studied \ufb01xed-design online linear\n1 := x1, . . . , xT are known to the learner a\nregression, where the game length T and covariates xT\npriori. Under constraints on the adversarial labels yT\n1 , the value function and minimax strategy were\ncalculable in closed form using backwards induction. The resulting minimax strategy\n\n\u02c6yt+1 = x(cid:62)\n\nt+1Pt+1\n\nysxs,\n\n(MMS)\n\nt(cid:88)\n\ns=1\n\n(cid:19)\u2020\n\n(cid:18) T(cid:88)\n\nt=1\n\nis a simple, linear predictor with coef\ufb01cient matrices de\ufb01ned by\n\nPT =\n\nxtx(cid:62)\n\nt\n\nand recursion Pt = Pt+1 + Pt+1xt+1x(cid:62)\n\nt+1Pt+1.\n\n(1)\n\n1 \u2208 X\n\nThe \u02c6yt is a function of the whole sequence xT\n\n1 , and thus an extension to online-design seems dif\ufb01cult.\nGiven: covariate constraints X and label con-\nstraints Y({xt})\nFor t = 1, 2, . . . ,\n\n\u2022 Adversary chooses xt s.t. xt\n\u2022 Learner predicts \u02c6yt\n\u2022 Adversary may end the game\n\u2022 Adversary reveals yt s.t. yt\n\u2022 Learner incurs loss (\u02c6yt \u2212 yt)2\n\u2022 The game ends if no xt+1 exists such\n\nOur contributions This paper extends the \ufb01xed\ndesign setting to adversarial design where neither\nthe covariates nor the length of the game are \ufb01xed\na priori. We use {xt} and {yt} to denote arbi-\ntrary length sequences of covariates and labels,\nrespectively. We allow the adversary to play any\ncovariate sequence in some constraint set X and\nlabels in some set Y({xt}) (which may depend\non the covariates).\nIn particular, we identify a family X ,Y parameter-\nized by a positive-de\ufb01nite matrix \u03a3, representing\nthe size of future covariates, and a scalar \u03b30, rep-\nresenting the size of the future labels, and present\na strategy that is minimax optimal against all ad-\nversarial sequences in this family. The algorithm\nneeds only know \u03a3, and the guarantee is horizon-independent in the sense that the family does not\nconstrain the length of the covariate sequence and includes covariate sequences of arbitrary length for\nany \u03a3, \u03b30 pair.\nThe protocol of the general, horizon-independent setting is outlined in Figure 1. We derive the\nminimax strategy and show that it is optimal in the following way.\nDe\ufb01nition 1. A strategy s\u2217 is horizon-independent minimax optimal for some class X of covariate\nsequences and some class Y({xt}) of label sequences, possibly depending on {xt} \u2208 X , if\n\nFigure 1: Adversarial Covariates Protocol\n\n1 \u2208 Y(xT\n1 )\n\n1 \u2208 X\n\nthat xt+1\n\n(cid:32)\n\nsup\nT\n\nsup\n1 \u2208X , yT\n1 \u2208Y(xT\nxT\n1 )\n\nRT\n\nsup\n1 \u2208X , yT\n1 \u2208Y(xT\nxT\n1 )\n\nRT\n\n(cid:0)s\u2217, xT\n\n1 , yT\n1\n\n(cid:1) \u2212 min\n\ns\n\n(cid:0)s, xT\n\n(cid:1)(cid:33)\n\n1 , yT\n1\n\n= 0.\n\nWe require s\u2217 to have regret no larger than even a strategy that knows T .\nIn other words, we establish a more natural measure of game length than the number of rounds. The\ncovariate constraints on {xt} ensure that the adversary respects the scale constraint \u03a3 so that the\n\n2\n\n\flearner is not led to under-regularize or over-regularize. The minimax strategy is ef\ufb01cient and is\nsimultaneously minimax optimal against all covariate sequences corresponding to \u03a3.\nWe motivate our constraint set by showing that every condition is necessary, and we also cast the\nminimax strategy as follow the regularized leader strategy with a data-dependent regularizer. Finally,\nwe provide a general regret upper bound.\n\nOutline We begin with a review of how backwards induction is used to derive the \ufb01xed-design\nminimax algorithm (MMS) in Section 2. By inverting the recursion, we show in Section 3 how to\ncalculate (MMS) given only P0, and thus we have the minimax strategy for any covariate sequence\nthat perfectly agrees with the given P0.\nSection 4 greatly expands the scope of our algorithm by deriving weaker conditions on the adversary\nand proves that, under these conditions, the same minimax strategy is horizon-independent minimax\noptimal. We argue that these conditions are necessary. We then interpret the minimax strategy as a\nfollow the regularized leader with a speci\ufb01c, data-dependent regularizer in Section 5.\n\nRelated Work While linear regression has a long history in statistics and optimization, its online\nsibling is much more recent, starting with the work of Foster [1991], which considered binary labels\nand (cid:96)1-constrained parameters \u03b8. He proved an O(d log(dT )) regret bound for an (cid:96)2-regularized\n\u221a\nfollow-the-leader strategy. Cesa-Bianchi et al. [1996] considered (cid:96)2-constrained parameters and gave\nT ) regret bounds for a gradient descent algorithm with (cid:96)2 regularization. Kivinen and Warmuth\nO(\n[1997] showed that an Exponentiated Gradient algorithm with relative entropy gives the same regret\nwithout the need for a constraint on the parameters. Vovk [1998] applied the Aggregating Algorithm\n[Vovk, 1990] to continuously many experts and arrived at a scale free algorithm by using the inverse\nsecond moment matrix of past and current covariates. Forster [1999] and Azoury and Warmuth\n[2001] showed that this algorithm is last step minimax and achieves an O(log T ) scale-dependent\nregret bound. (See also the work of Moroshko and Crammer [2014] on last-step minimax.)\nTakimoto and Warmuth [2000] obtained the minimax strategy for prediction in Euclidean space with\nsquared loss. This was extended to more general losses in [Koolen et al., 2014] and to tracking\nproblems in [Koolen et al., 2015]. Finally, Bartlett et al. [2015] obtained the minimax strategy for\n\ufb01xed-design linear regression. We present this strategy in the next section, because we build on these\nresults. In these papers, the minimax analysis provides a natural, data-dependent regularization, in\ncontrast to the follow-the-leader methods described above. We make this comparison explicit in\nSection 5, by calculating the implied regularization.\n\n2 Fixed Design Linear Regression\n\ns=1 ysxs, \u03c32\n\nstatistics st :=(cid:80)t\nV (cid:0)st, \u03c32\n(cid:1) := min\nwith base case V (cid:0)sT , \u03c32\n\nWe begin by summarizing the main results of Bartlett et al. [2015]. Recall that in the \ufb01xed design\n1 are \ufb01xed and known to both players. De\ufb01ne the summary\nsetting, the game length T and covariates xT\ns . The minimax strategy can be\ns=1 y2\n\u2020\ncomputed by solving the of\ufb02ine problem min\u03b8\nT sT , where\nM\u2020 is the pseudo-inverse of matrix M. The optimal actions \u02c6yt and yt are computed as a function of\nthe state st\u22121 and covariates xT\n\n(cid:80)T\nt=1(x(cid:62)\n\nt , and \u03a0t =(cid:80)t\n\nt =(cid:80)t\nt \u03b8 \u2212 yt)2 =(cid:80)T\n(cid:16)\n(\u02c6yt+1 \u2212 yt+1)2 + V (cid:0)st + yt+1xt+1, \u03c32\n(cid:1) := \u2212 min\u03b8\u2208Rd(cid:80)T\n\n1 by solving the backward induction\n\ns=1 xsx(cid:62)\n\n(cid:1)(cid:17)\n(cid:1)2. The arguments of V include\n\nt+1, t + 1, xT\n1\n\nt + y2\n\nt=1 y2\n\nt \u2212 s(cid:62)\nT \u03a0\n\nt , t, xT\n1\n\nmax\nyt+1\n\n\u02c6yt+1\n\nT , T, xT\n1\n\n1 to emphasize the \ufb01xed-design setting. Performing the backwards induction generates plays \u02c6yT\nxT\n1\nand yT\n\n1 that witness the value of the game,\n\nt=1\n\n(cid:0)\u03b8(cid:62)xt \u2212 yt\nT(cid:88)\n\nt=1\n\nT(cid:88)\n\nt=1\n\nmin\n\u02c6y1\n\nmax\n\ny1\n\n\u00b7\u00b7\u00b7 min\n\n\u02c6yT\n\nmax\nyT\n\n(\u02c6yt \u2212 yt)2 \u2212 min\nw\u2208Rd\n\n(w(cid:62)xt \u2212 yt)2,\n\nwhich is the minimum guaranteeable regret against all data sequences. The resulting minimax strategy\nis precisely the linear predictor \u02c6yt+1 = x(cid:62)\nt+1Pt+1st, ((MMS)) with coef\ufb01cient matrices de\ufb01ned by\nthe recursion (1). Note that Pt is a function of every covariate xT\n1 . The minimax strategy is similar\n\n3\n\n\f\u2020\nt in place of Pt; however, Pt is a shrunken version\n\nto follow-the-leader, which would predicts with \u03a0\n\u2020\nof \u03a0\nt that takes future covariances into account.\nThe main result of Bartlett et al. [2015] is the minimax optimality of (MMS) for the following classes.\nFor some \ufb01xed sequence of positive label budgets B1, . . . , BT > 0, de\ufb01ne\n(cid:110)\n(cid:12)(cid:12) Bs for 2 \u2264 t\n1 : Bt \u2265(cid:80)t\u22121\n1 ) := {yT\n1 : |yt| \u2264 Bt\u2200t = 1, . . . , T}\n(cid:110)\n(cid:111)\n1 :(cid:80)T\nxT\n1 ) :=\n\n1. Label constraints on yt: L(BT\n2. Box constraints on xt: B(BT\n3. Ellipsoidal constraints: E(xT\n\nt Ptxs\nt Ptxt \u2264 R\n\n(cid:12)(cid:12)x(cid:62)\n\ns=1\nt=1 y2\n\n1 , R) :=\n\nt x(cid:62)\n\n(cid:111)\n\nyT\n\n.\n\n.\n\nTheorem 1. [Bartlett et al., 2015, Theorems 2 and 10] For each xT\n(MMS) is minimax optimal with respect to B(BT\nany Bt > 0 sequence and any R > 0, in the following sense:\n\n1 \u2208 L(BT\n\n1 ) if yT\n\n1 ) and with respect to E(xT\n\n1 , the corresponding strategy\n1 , R), for\n\n(1) If xT\n\n1 \u2208 B(BT\n\n1 ), then\n\nsup\n\n1 \u2208L(BT\nyT\n1 )\nsup\n1 \u2208E(xT\nyT\n\n1 ,R)\n\n(2)\n\nRT ((MMS), xT\n\n1 , yT\n\n1 ) = min\n\ns\n\nRT ((MMS), xT\n\n1 , yT\n\n1 ) = min\n\ns\n\n3 The Forward Algorithm\n\nT(cid:88)\n\nt x(cid:62)\nB2\n\nt Ptxt,\n\nsup\n\n1 \u2208L(BT\nyT\n1 )\nsup\n1 \u2208E(xT\nyT\n\n1 ,R)\n\nRT (s, xT\n\n1 , yT\n\n1 ) =\n\nRT (s, xT\n\n1 , yT\n\nt=1\n1 ) = R.\n\nThe previous section described the \ufb01xed-design minimax strategy and established suf\ufb01cient conditions\nfor its optimality. Unfortunately, Pt is recursively de\ufb01ned as a function of the entire xT\n1 sequence.\nIn this section, we show that it is possible to remove the \ufb01xed-design and known-game-length\nrequirement if we limit the adversary to play sequences that follow the Adversarial Covariate\n\n(cid:0)Rd(cid:1)T denote the set of covariate sequences of \ufb01nite length, de\ufb01ne\n\nT >0\n\n1 \u2208 X \u221e : for P0, . . . , PT de\ufb01ned by (1), P\nxT\n1 \u2208 X \u221e : for P0, . . . , PT de\ufb01ned by (1), P\nxT\n\n, and\n\n;\n\n(2)\n\nconditions. Letting X \u221e =(cid:83)\n(cid:110)\n(cid:110)\n\nA(\u03a3) :=\nA(\u03a3) :=\n\n(cid:111)\n(cid:111)\n\u2020\n0 (cid:22) \u03a3\n\u2020\n0 = \u03a3\n\n1 \u2208 A(\u03a3) if the Pt computed by applying (1) to the sequence xT\n\nthat is, xT\nThe key insight of this section is that it is possible to invert the Pt recursion: we can compute Pt\nfrom Pt\u22121 and xt. Hence, if we are given P0, then we can compute every Pt online. For some initial\ncondition \u03a3, de\ufb01ne the forward recursion with base case P0 = \u03a3\u2020 and induction step\n\n1 results in P\n\n\u2020\n0 (cid:22) \u03a3.\n\nPt := Pt\u22121 \u2212 at\nb2\nt\n\nPt\u22121xtx(cid:62)\n\nt Pt\u22121, where b2\n\nt := x(cid:62)\n\nt Pt\u22121xt, at :=\n\n1\n\n1 only. For the rest of the paper, we will de\ufb01ne\nThe prediction matrix Pt is a function of \u03a3 and xt\n(MMS) with respect to the forward recursion, i.e. \u02c6yt := x(cid:62)\nt Ptst\u22121, where Pt is de\ufb01ned by recursion\n1, and yt\u22121\n(3). The calculation of \u02c6yt only requires knowledge of \u03a3, xt\n, all of which are available to the\nlearner when choosing \u02c6yt. The algorithm needs O(d2) memory and at each round the computational\ncomplexity is O(d2). It is essential that the two recursions are equivalent, which is guaranteed by the\nfollowing lemma.\n1 \u2208 A(\u03a3),\nLemma 1. Let \u03a3 (cid:23) 0 be a positive semide\ufb01nite matrix. For any covariate sequence xT\n1 are identical to the Pt matrices\nthe Pt matrices de\ufb01ned by the backwards recursion (1) applied to xT\nde\ufb01ned by the forward recursion (3) with base case P0 = \u03a3\u2020 and updates given by xT\n1 .\nt be de\ufb01ned by the forwards recursion starting from P0 = \u03a3\u2020 and let Pt be de\ufb01ned by\nProof. Let P (cid:48)\nthe backwards recursion (1). Our goal is to show that Pt = P (cid:48)\nt for all t. The base case P0 = P (cid:48)\n0\nis a simple consequence [Bartlett et al., 2015, Lemma 11], which uses repeated applications on\nSherman-Morrison to show that\n\n(cid:112)4b2\n(cid:112)4b2\n\nt + 1 \u2212 1\nt + 1 + 1\n\n.\n\n(3)\n\n\u2020\nt = \u03a0t +\n\nP\n\nT(cid:88)\n\ns=t+1\n\nxsx(cid:62)\ns .\n\n(4)\n\nx(cid:62)\ns Psxs\n\n1 + x(cid:62)\n\ns Psxs\n\n4\n\n\fNow, assuming the induction hypothesis P (cid:48)\n\nt\u22121 = Pt\u22121, we can evaluate\n\nt Pt\u22121\n\n= Pt + Ptxt\n\nt = Pt\u22121 \u2212 at\nP (cid:48)\nb2\nt\n= Pt + Ptxtx(cid:62)\n\nPt\u22121xtx(cid:62)\n(cid:18)\nt Pt \u2212 at\n(cid:16)\nb2\nt\n1 \u2212 at\n1 + 2x(cid:62)\nb2\n(cid:16)(cid:112)4b2\n(cid:17)\nt\nt = x(cid:62)\nt + 1 \u2212 1\nt Ptxt +(cid:0)x(cid:62)\n\n(cid:1)2(cid:17)\n\nt Ptxt\n\nBy de\ufb01nition, we have b2\nx(cid:62)\nt Ptxt = 1\n2\n\n(cid:16)\n\n1 \u2212 at\nb2\nt\n\n1 + 2x(cid:62)\n\nt Pt\u22121xt = x(cid:62)\n. Plugging this is, the term in the parenthesis in (5) is\n\nt Ptxt\n\nt\n\n(5)\n\n(cid:1)\n\nt Pt\n\nt Pt\n\nt Ptxt\n\nx(cid:62)\nt Pt\n\n(cid:0)Pt + Ptxtx(cid:62)\n(cid:1) xtx(cid:62)\n(cid:0)Pt + Ptxtx(cid:62)\n(cid:1)2(cid:17)(cid:19)\nt Ptxt +(cid:0)x(cid:62)\n(cid:1)2, which we can invert to \ufb01nd that\nt Ptxt +(cid:0)x(cid:62)\n(cid:19)2(cid:33)\n(cid:32)\n(cid:18)(cid:113)\n(cid:18) 1\n(cid:113)\n2(cid:112)4b2\nt \u2212(cid:16)(cid:112)4b2\n\n(cid:19)\n(cid:19)\n(cid:18)(cid:113)\n(cid:17)(cid:16)(cid:112)4b2\nt + 1 \u2212 1\n4b2\n(cid:17)\n\n\u2212 1\n2b2\nt\nt + 1 \u2212 1\n\nt + 1 \u2212 1\n4b2\n\nt + 1 \u2212 1\n4b2\n\n(cid:19)\n(cid:17)\n\n= 1 \u2212 at\nb2\nt\n= 1 \u2212 at\nb2\nt\n\n(cid:18)(cid:113)\n\nt + 1 + 1\n\nt + 1 + 1\n\nt + 1 +\n\n+ b2\nt\n\n= 0,\n\n1 +\n\n4b2\n\n4b2\n\n1\n4\n\n1\n2\n\n+\n\n=\n\n=\n\n2\n\n(cid:16)(cid:112)4b2\n\n2b2\nt\n\nt + 1 + 1\n\nimplying that P (cid:48)\n\nt = Pt, as desired.\n\nOur \ufb01rst result is that this algorithm is actually minimax optimal if we constrain the adversary to play\nin A(\u03a3). Another interpretation is that \u03a3 encodes all the necessary scale information the learner\nneeds to respond optimally. That is, (MMS) performs as well as the best strategy that sees the covariate\nsequence in advance. In particular, knowledge of \u03a3, not T , is necessary for the learner.\nTheorem 2. For all positive semide\ufb01nite \u03a3, label bounds B1, B2, . . . > 0, and constants b > 0\nand R > 0, the minimax strategy (MMS) using the forward recursion (3) starting from P0 = \u03a3\u2020 is\nhorizon-independent minimax optimal, i.e.\nRT (s\u2217, xT\n\n1 ) \u2212 min\n\n(cid:32)\n\n(cid:33)\n\n= 0\n\nsup\nT\n\nfor(cid:0)X ,Y(xT\n\nsup\n\nsup\n1 \u2208X\nxT\n\n1 )(cid:1) equal to either (A(\u03a3),E(xT\n\n1 \u2208Y(xT\nyT\n1 )\n\n1 , yT\n1 , R)) or (B(BT\n\nRT (s, xT\n1 , yT\nsup\n1 )\n1 \u2208Y(xT\nyT\n1 )\n1 ) \u2229 A(\u03a3),L(BT\n\n1 )).\n\ns\n\n1 \u2208 A(\u03a3), Lemma 1 implies that the Pt matrices from the forwards\nProof of Theorem 2. Since xT\nand backwards recursions are equivalent, and therefore (MMS) corresponds to the minimax strategy\nfor the \ufb01xed-design game with P\n\n\u2020\n0 = \u03a3. The can then apply Theorem 1, part (1), which yields\n\nsup\n\nRT (s\u2217, xT\n\n1 ) \u2212 min\n\n1 , yT\n\nsup\n\n1 \u2208B(BT\n1 \u2208B(BT\nyT\nyT\n1 )\n1 )\n1 , we actually get the stronger result\nSince this holds for all xT\n1 ) \u2212 min\n\nRT (s\u2217, xT\n\n(cid:32)\n\n1 , yT\n\nsup\n\ns\n\nsup\nT\n\nsup\n1 \u2208A(BT\n1 )\u2229A(\u03a3)\nxT\n\n1 \u2208B(BT\nyT\n1 )\n\ns\n\nRT (s, xT\n\n1 , yT\n\n1 ) = 0.\n\n(cid:33)\n\nsup\n\n1 \u2208B(BT\nyT\n1 )\n\nRT (s, xT\n\n1 , yT\n1 )\n\n= 0.\n\nIdentical reasoning extends part (2) of Theorem 1 to the adversarial covariate context.\n\nif and only if, for any T \u2265 t + rank(cid:0)P \u2020 \u2212 \u03a0t\n\nThe adversarial covariate conditions are de\ufb01ned for entire xT\n1 sequences, but there is an online\ncharacterization, derived from the following lemma.\nLemma 2. Consider any t \u2265 0, x1, . . . , xt, and symmetric matrix P (cid:23) 0. We have that P \u2020 (cid:23) \u03a0t\nxt+1, . . . , xT , such that setting Pt = P and de\ufb01ning Pt+1, . . . , PT by the forward recursion (3)\ngives P\n\n(cid:1), there is a continuation of the covariate sequence,\n\n\u2020\nT = \u03a0T .\n\n5\n\n\fA stronger version with proof is presented in the Appendix as Theorem 6 and explicitly derives\nconditions on xt+1 that ensure P \u2020 (cid:23) \u03a0t.\ns (cid:23) \u03a0s for all s \u2264 t,\nIn words, a sequence of covariates xt\nwhere Ps corresponds to the forward recursion (3) de\ufb01ned by intuition condition P0 = \u03a3\u2020 and\n1. Hence, it is equivalent to constrain the adversary to play xt satisfying this condition\ncovariates xt\nat every round, and we do not require the adversary to \ufb01x the covariate sequence in advance; it is\nequivalent to de\ufb01ne\n\n1 is the pre\ufb01x of some xT\n\n1 \u2208 A(\u03a3) if P \u2020\n\n(cid:110)\n(cid:110)\n\nA(\u03a3) =\nA(\u03a3) =\n\n1 \u2208 X \u221e : P\nxT\n1 \u2208 X \u221e : P\nxT\n\n\u2020\n0 = \u03a3 and P\n\u2020\n0 = \u03a3, P\n\n(cid:111)\n\u2020\nt (cid:23) \u03a0t \u2200t \u2265 1, and P\n\n\u2020\nt (cid:23) \u03a0t \u2200t \u2265 1\n\n(cid:111)\n\n, and\n\n\u2020\nT = \u03a0T\n\n(6)\n\n(7)\n\n.\n\n4 Expanding the Minimax Conditions\n\ns=1\n\n(cid:111)\n\n(cid:12)(cid:12)x(cid:62)\n\n(cid:12)(cid:12)\u2200t \u2265 1\n\n(cid:110){xt} : Bt \u2265(cid:80)t\u22121\n\n1 \u2208 A(\u03a3) if the adversary\nThe strategy (MMS) is minimax optimal for any covariate sequence xT\nplays covariates that meet the \u03a3 constraint with equality, which is quite restrictive. In this section, we\nidentify a much broader set of constraints on the adversary\u2019s actions where (MMS) remains the best\nlearner response. These conditions allow for adversarial design; the data may be chosen in response\nto the learner\u2019s actions.\nA natural relaxation is to remove the equality constraints; this results in a set of constraints on the\nadversary where the labels {yt} are in L({Bt}) := {yt : |yt| \u2264 Bt\u2200t \u2265 1}, and the covariates {xt}\nare in A (\u03a3) \u2229 B (\u03a3), where B (\u03a3) =\nThe B(\u03a3) condition is necessary for an ef\ufb01cient algorithm [Bartlett et al., 2015], and without the\nA(\u03a3) condition, the adversary could choose xt to be a scaled version of st\u22121 and yt = \u03b8\u2217\nt\u22121xt,\nwhere \u03b8\u2217\n. The comparator will never suffer\nmore regret, the algorithm will suffer some regret, and we can scale xt such that the B(\u03a3) conditions\nare satis\ufb01ed. To summarize, without the A constraint, the adversary can cause arbitrary regret.\nHowever, the A and B constraints are not suf\ufb01cient to guarantee a solvable game:\nLemma 3. Fix any \u03a3 and any {Bt} with Bt \u2265 b > 0 for all t. Then, for any M > 0, there exists\n1 \u2208 A(\u03a3) \u2229 B(\u03a3) and yT\nxT\nA covariate budget is not suf\ufb01cient for a minimax algorithm; it is not even clear how to de\ufb01ne minimax\nwhen the regrets are not bounded. Hence, we will introduce continuation constraints (the name will\nbecome clear soon). Let \u03b30 > 0 be some initial label budget and de\ufb01ne \u03b3t = \u03b3t\u22121 \u2212 B2\nt Ptxt,\n1) := {\u03be \u2208 Rt : |\u03bei| \u2264 Bi, i = 1, . . . , t} be\nwith Pt de\ufb01ned by the forward recursion (3). Let B\u221e(Bt\nthe hypercube with sides of length B1, . . . , Bt and Xt be the matrix with columns x1, . . . , xt. For a\ngiven covariate budget \u03a3 and label budget \u03b30, de\ufb01ne the continuation condition\n\nt\u22121 is the best least squares predictor of xt\u22121\n\n1 ) such that the minimax regret is larger than M.\n\n1 \u2208 L(BT\n\nand yt\u22121\n\nt Ptxs\n\nt x(cid:62)\n\n.\n\n1\n\nt\n\nt\n\nC (\u03a3, \u03b30) :=\n\n\u2020\nt \u2212 Pt\n\u03a0\n\u2020\nt \u2212 Pt\n\n1 : \u03b3t \u2265 \u03be(cid:62)X(cid:62)\nxT\nwhich is equivalent to requiring that s(cid:62)\nThe rest of this section proves the main result of this paper: if the adversary plays in ABC(\u03a3, \u03b30) :=\nA(\u03a3) \u2229 B(\u03a3) \u2229 C(\u03a3, \u03b30), then (MMS) is minimax optimal.\nTheorem 3. Consider the two player game de\ufb01ned in Figure 1. For any {Bt} > 0, \u03a3 (cid:31) 0 and\n\u03b30 \u2265 0, the player strategy (MMS) has minimax regret \u03b30 and is horizon-independent minimax\noptimal for xT\n\nXt\u03be \u2200\u03be \u2208 B\u221e(Bt) and t = 1, . . . , T\nst \u2264 \u03b3t for all possible st.\n\n1 \u2208 Y = L(Bt). That is,\n\n(8)\n\n\u03a0\n\n,\n\nt\n\n(cid:32)\n1 \u2208 X = ABC(\u03a3, \u03b30) and yT\n\n(cid:33)\n\n(cid:16)\n(cid:16)\n\n(cid:17)\n(cid:17)\n\n(cid:110)\n\n(cid:111)\n\nsup\nT\n\nsup\n1 \u2208X ,yT\nxT\n\n1 \u2208Y\n\nRT ((MMS), xT\n\n1 , yT\n\n1 ) \u2212 min\n\ns\n\nsup\n1 \u2208X ,yT\nxT\n\n1 \u2208Y\n\nRT (s, xT\n\n1 , yT\n1 )\n\n= 0.\n\nWe will prove Theorem 3 by \ufb01rst considering adversarial strategies under A(\u03a3) with a \ufb01xed game\nlength. We show that, somewhat counterintuitively, the adversary may cause more regret by not\nusing the entire \u03a3 budget. Then, we show that the C condition eliminates these troublesome cases\n1 \u2208 A(\u03a3) which implies\nand the adversary exhausts the budget; therefore, the adversary plays xT\n\n6\n\n\f,\n\nxT\n\n: P\n\n(cid:110)\n\n\u2020\n0 = \u03a3 and P\n\n1 \u2208(cid:0)Rd(cid:1)T\n\nthat that (MMS) is minimax optimal by results of the previous section. Finally, we note that all\nthe previous arguments apply uniformly across T , and since (MMS) is ignorant of T , it must be\nhorizon-independent minimax optimal. The \u03a3 constraint, not the game length, seems to be the correct\nnotion of game size.\n(cid:111)\n4.1 Limiting T\n\u2020\nt (cid:23) \u03a0t \u22001 \u2264 t \u2264 T\nConsider a \ufb01xed T > 0 and de\ufb01ne AT (\u03a3) :=\nthe restriction of A(\u03a3) to sequences of length T . This goal of this section is to show i) that it is\n\u2020\nT (cid:31) \u03a0T ,\npossible for the adversary to cause more regret by not using up the covariance budget, i.e. P\nand ii) that the C conditions are suf\ufb01cient to stop this.\nWe cannot calculate the minimax solution of AT (\u03a3) directly. Section G in the appendix explicitly\nevaluates the \ufb01rst backwards induction step; it is quite complicated and has no closed form solution,\nand this suggests that ef\ufb01cient backwards induction is unlikely. Instead, we will study the related\n\ufb01xed-design early-stopping game. For some \ufb01xed xT\n1 , the game protocol is: at round t, the learner\npredicts \u02c6yt, the adversary chooses et \u2208 {0, 1} and yt \u2208 L(BT\n1 ). If et = 1, the learner incurs loss\n(\u02c6yt \u2212 yt)2 and the game continues, but if et = 0, the game ends. Intuitively, the adversary may be\nable to cause more regret because the learner is regularizing for a covariance budget corresponding to\n1 , and therefore ending the game early causes the learner to over-regularize.\nxT\nWe will derive C as a condition where the adversary always continues to T . In turn, this implies\nthat the adversary will use up the \u03a3 budget in the AT game: any xT\n1 with remaining \u03a3 budget has\n\u2208 A(\u03a3) by Lemma 2, and the C condition implies that the adversary will\na continuation xT +k\ncontinue until T + k and use up the budget. We will make this argument formal.\nWe begin by de\ufb01ning an incremental version of regret. De\ufb01ne \u2206\u2217\n\ns=1(\u03b8(cid:62)xs \u2212\ns=1(\u03b8(cid:48)(cid:62)xs \u2212 ys)2, the additional loss suffered by the comparator from playing\nt . The regret of the game\n\nys)2 \u2212 min\u03b8(cid:48)\u2208Rd(cid:80)t\u22121\nwith early stopping can be written as RT =(cid:80)T\n\nt rounds instead of t \u2212 1 rounds. We have \u2206\u2217\n\nT =(cid:80)t\n(cid:17)(cid:0)(yt \u2212 \u02c6yt)2 \u2212 \u2206\u2217\n\n(cid:1). One might notice\n\n:= min\u03b8\u2208Rd(cid:80)t\n\nt \u2265 0 and L\u2217\ns=1 es\n\n(cid:16)(cid:81)t\n\nxt, where \u03b8\u2217\n\nt = 0 for the choice yt = \u03b8\u2217\nt\u22121\n\nthat \u03b4\u2217\nt\u22121 is the ordinary least squares solution on data\nthrough time t \u2212 1, and the regret always increases. However, this choice of yt may violate the label\nconstraints, in particular, for Bt = 1 and xt \u2208 R increasing. Additionally, we want a constraint where\nthe adversary wants to play all remaining rounds, not just the next one, and hence the constraint on\nyt will depend on the future covariates.\nThe value-to-go de\ufb01nition also needs to be adapted to the incremental setting. To this end, we de\ufb01ne\nthe instantaneous value-to-go W (st, \u03c32\n\nt=1 \u2206\u2217\n\n1 ) by W (sT , \u03c32\n\n1 ) = 0 and\n\nt , t, xT\n\nT , T, xT\n\nt=1\n\n(cid:62)\n\n1\n\nt\n\nt\n\n(cid:18)\n\n(cid:19)\n\nW (st\u22121, \u03c32\n\nt\u22121, t \u2212 1, xT\n\n1 ) = max\n\net\u2208{0,1} et\n\nmin\n\u02c6yt\n\nmax\n\nyt\n\n(\u02c6yt \u2212 yt)2 \u2212 \u2206\u2217\n\nt + W (st, \u03c32\n\nt , t, xT\n1 )\n\n,\n\nt = \u03c32\n\nt\u22121 + y2\n\nwhere the statistics are updated as st = st\u22121 + ytxt and \u03c32\nt . It is easy to check that\nW0 is the minimax regret for this game and that it equals the regret of the \ufb01xed design game when\nthe adversary plays every round.\n4.2 Calculating the Instantaneous Value-to-go\nThis section derives C as the condition where et = 1 for all t and evaluates Wt. Throughout, R(M )\ndenotes the row space of matrix M. Proofs from this section are heavy on calculation and have been\ncollected in Appendix B. We begin by explicitly calculating \u2206\u2217\nt .\nLemma 4. The marginal loss for the comparator of playing another round with covariate x =\nx(cid:107) + x\u22a5, where x(cid:107) \u2208 R(\u03a0t\u22121) and x\u22a5 is its orthogonal complement, is\n\n\u2020\nt\u22121xt\n\u2020\nt xt\nTheorem 4. Consider the \ufb01xed-design game with early stopping, with covariates xT\n\n(cid:17) \u2212 2yts(cid:62)\nby the backwards recursion (1) and de\ufb01ne \u03b3t =(cid:80)T\n(cid:16)\n\n1 . De\ufb01ne the Pt\ns Psxs. Suppose that, for all t, \u03b3t \u2265\ns(cid:62)\nst + \u03b3t,\nt , t, xT\nt\nthe adversary causes more regret by continuing the game, and the optimal learner strategy is (MMS).\n\ns x(cid:62)\nst. Then the instantaneous value-to-go is W (st, \u03c32\n\n(cid:17)2 x(cid:62)\n\n1 \u2212 x(cid:62)\nt \u03a0\n\n\u2206\u2217\nt = y2\nt\n\n1 ) = s(cid:62)\n\nt \u03a0\nx(cid:62)\nt \u03a0\n\nPt \u2212 \u03a0\n\n\u2020\nt \u2212 Pt\n\ns=t+1 B2\n\ns(cid:62)\nt\u22121\u03a0\n\n\u2020\nt xt +\n\n\u2020\nt xt\n\n\u2020\nt xt\n\nt\u22121\u03a0\n\n(cid:16)\n\n(cid:16)\n\n(cid:16)\n\n(cid:17)\n\n(cid:17)\n\n\u03a0\n\n\u2020\nt\n\n.\n\nt\n\n7\n\n\f(cid:16)\n(cid:16)\n\n(cid:17)\n\n(cid:17)\n(cid:16)\nProof outline. The proof is by induction, where the base case is easily established with\n\u2020\nPt \u2212 \u03a0\n1 )(cid:1). We use our expression for \u2206\u2217\n\u03b3T = 0 and PT = \u03a0\nst +\nt\n\u03b3t, we wish to calculate the t \u2212 1 case by evaluating W (st\u22121, \u03c32\nt\u22121, t \u2212 1, xT\n1 ) =\nt ,\n(cid:41)\nmaxet\u2208{0,1} et\nperform elementary calculations to evaluate the saddle-point, and show that the above evaluates to\n\n(cid:0)min\u02c6yt maxyt(\u02c6yt \u2212 yt)2 \u2212 \u2206\u2217\n(cid:1)2\n\n\u2020\nT . Now, assuming that W (st, \u03c32\n\n(cid:17)2 x(cid:62)\n\n1 ) = s(cid:62)\n\nt + Wt(st, \u03c32\n\nt , t, xT\n\nt , xT\n\n(cid:17)\n\nt\n\nt Ptxt \u2212(cid:16)\n\n+ s(cid:62)\nt\u22121\n\nPt \u2212 \u03a0\n\n\u2020\nt\n\n(cid:40)(cid:0)s(cid:62)\n\nst\u22121 + \u03b3t, 0\n\n,\n\nmax\n\nt\u22121Ptxt\n\n+ B2\n\nt x(cid:62)\n\ns(cid:62)\nt\u22121\u03a0\n\n\u2020\nt xt\n\n\u2020\nt\u22121xt\n\u2020\nt xt\n\nt \u03a0\nx(cid:62)\nt \u03a0\n\n\u03a0\n\n1 \u2208 A(P\n\n\u2020\nt\u22121 \u2212 Pt\u22121\n\nwhich can be shown to always take the \ufb01rst value so long as \u03b3t\u22121 \u2265 s(cid:62)\nst\u22121. In\nt\u22121\nthis case, the induction hypothesis is veri\ufb01ed with the Pt update described in the theorem. This\nimplies that the instantaneous value-to-go is always positive and that an optimal adversary will always\n\u2020\n0 ), which con\ufb01rms that (MMS) using\ncontinue. As a consequence, the covariate sequence xT\nthe forward recursion is minimax optimal via Theorem 2.\nAll the ingredients are in place to prove our main result. For convenience, de\ufb01ne ABC(\u03a3, \u03b30) :=\n\u2020\n{xT\n1 \u2208 ABC(\u03a3, \u03b30) : PT = \u03a0\nT , \u03b3T = 0}, the set of sequences that deplete the \u03a3 and \u03b30 budgets.\nRoughly, we will argue that, under C(\u03a3, \u03b30), the adversary causes the most regret by playing\n1 \u2208 A(\u03a3), which implies that xT\n1 \u2208 ABC(\u03a3, \u03b30) and the regret is \u03b30. The \ufb01rst step in the analysis\nxT\nis to check that the constraint set is non-trivial.\nLemma 5. Consider the game de\ufb01ned by \u03a3 (cid:23) 0, \u03b30 \u2265 (cid:107)Bt(cid:107)\u221e and a Bt sequence. If there exists\n1 \u2208 ABC(\u03a3, \u03b30).\n\nt+log(T +1) \u2265 \u03b30, then there exists a covariate sequence xT\n\nsome T such that(cid:80)T\n\nB2\nt\n\nt=1\n\nIn particular, any Bt that are bounded below satisfy this condition.\n\nIn reasoning about optimal strategies, Theorem 4 allows us to easily establish conditions when the\nlearner is playing suboptimally and could be causing more regret. However, Theorem 4 applies to a\n\ufb01xed design game that is allowed to stop early, and we wish to reason about the adversarial covariate\ncase. The next lemma makes the crucial connection.\n1 \u2208 ABC(\u03a3, \u03b30) but \u03b3t > 0. Then there exists an extension xt+1, . . . , xT\nLemma 6. Suppose xt\n\u2020\n1 \u2208 A(\u03a3, \u03b30) and W (st, \u03c32\nin ABC(\u03a3, \u03b30) with xT\nt )st + \u03b3t equal to the\ninstantaneous value-to-go.\n\nt (Pt \u2212 \u03a0\n\n1 ) = s(cid:62)\n\nt , t, xT\n\nThe proof is a simple consequence of checking that the extension Lemma 2 is compatible with\ncondition C. We can now prove the minimax optimality of (MMS) on the ABC game.\n\n1 \u2208 ABC(\u03a3, \u03b30) and optimal holds by results from\n\n1 sequence in ABC and causes exactly \u03b30 regret against (MMS).\n\nProof of Theorem 3. We will show something stronger: the optimal adversary strategy for the game\nin Figure 1 plays an xT\nFirst, assume that the game stops before round T + 1 and x1, . . . , xT have been played. There are\nfour possible scenarios depending on whether the \u03a3 or \u03b30 budgets are exhausted.\nCase: both budgets exhausted. In this case, xT\nSection 3.\nCase: neither budget exhausted. We apply Lemma 2 to conclude that there exists a covariate\nT +1 that uses up the \u03a3 budget. The C(\u03a3, \u03b30) constraint guarantees that the adversary\nsequence xT +k\ncan cause more regret by playing these rounds. Hence, an adversary that exhausts neither budget is\nsuboptimal.\nCase: only \u03a3 budget exhausted. Since Pt \u2212 \u03a0\nand still satisfy the C constraint.\nCase: only \u03b30 budget exhausted. If the \u03a3 budget is exhausted, then xT\nt x(cid:62)\n\n\u2020\nt (cid:23) 0, we cannot exhaust the \u03b30 before the \u03a3 budget\n1 \u2208 A and hence the minimax\nt Ptxt, the adversary strategy\nis suboptimal if \u03b3T > 0 since it is possible to cause \u03b30 regret. These arguments cover all four cases,\nwe can conclude that the adversary can cause at most \u03b30 regret and that any strategy that causes \u03b30\nregret must exhaust the \u03a3 and \u03b30 budgets.\n\nt Ptxt by Theorem 2. Since \u03b3T = \u03b30\u2212(cid:80)T\n\nregret is(cid:80)T\n\nt=1 B2\n\nt=1 B2\n\nt x(cid:62)\n\n8\n\n\fIn all cases, the adversary can cause at most \u03b30 regret and it is necessary for the adversary to play\n1 \u2208 X = ABC(\u03a3, \u03b30)\n1 \u2208 ABC(\u03a3, \u03b30), which implies that (MMS) is optimal. In other words, for xT\nxT\nand yT\n\n1 \u2208 Y = L(Bt), we have\n\nsup\n1 \u2208X ,yT\nxT\n\n1 \u2208Y\n\nRT ((MMS), xT\n\n1 , yT\n\n1 ) \u2212 min\n\ns\n\nsup\n1 \u2208X ,yT\nxT\n\n1 \u2208Y\n\nRT (s, xT\n\n1 , yT\n\n1 ) = 0\n\nfor all T > 0, which implies the result.\nThe Necessity of a \u03b30 Bound Requiring a \u03b30 bound may seem arti\ufb01cial at \ufb01rst, especially since\nit translates directly into a bound on the regret. However, it is a reasonable constraint to impose,\nfor several reasons. First, recall that Lemma 3 argues that the regret of just the A(\u03a3) \u2229 B(\u03a3)\n1 \u2208 ABC(\u03a3, \u03b30), then\ngame is in\ufb01nite. Second, the restriction on the adversary is mild: if xT\n1 \u2208 ABC(\u03a3, \u03b3(cid:48)) for \u03b3(cid:48) \u2265 \u03b30, and so the budget can be adjusted online. Finally, we emphasize that\nxT\nthe learner does not need to know \u03b30 to play (MMS).\n5 Follow the Regularized Leader\nThe minimax strategy (MMS) can be interpreted as playing follow-the-regularized-leader with a\ncertain data-dependent regularizer.\nLemma 7. The minimax strategy (MMS) is exactly follow-the-regularized-leader, predicting \u02c6yt =\n\u03b8(cid:62)xt at round t, where regularization matrices Rt are\n\nR0 := P \u22121\n\n0\n\n, and Rt := Rt\u22121 +\n\n(cid:80)t\u22121\ns=1(\u03b8(cid:62)xs \u2212 ys)2 + \u03b8(cid:62)Rt\u03b8.\n\n1\n1 + x(cid:62)\nt Ptxt\n\nand \u03b8 is the solution to min\u03b8\n\nxtx(cid:62)\n\nt \u2212 xt\u22121x(cid:62)\n\nt\u22121,\n\n(9)\n\nIt is also possible to derive a Rt recursion without referring to Pt; see Lemma 11. For comparison,\nthe last step minimax algorithm [Azoury and Warmuth, 2001] plays \u02c6yt =\nst\u22121,\nso we can also view the minimax algorithm as last step minimax with a regularization of\n\ns=1 xsx(cid:62)\n\ns\n\n(cid:16)(cid:80)t\n\n(cid:17)\u22121\n\n(cid:80)T\n\nx(cid:62)\ns Psxs\n1+x(cid:62)\n\ns Psxs\n\nxsx(cid:62)\ns .\n\ns=t+1\n\nWe have shown that for the adversarial covariates protocol with X = ABC(\u03a3, \u03b30), (MMS) is the\nminimax optimal strategy and receives \u03b30 regret. Our last result helps quantify this regret by proving\na O(log(T )) regret bound for the games analyzed in Section 3.\nTheorem 5. For any \ufb01xed T and BT\nRT (s\u2217, xT\n\n1 , the minimax regret of the box-constrained game has the bound\n1 ) \u2264 d(cid:107)BT\n1 (cid:107)\u221e\n(cid:107)\u03a3(cid:107)2\n\n||\u03a3||2\n2(cid:107)BT\n1 (cid:107)2\u221e\n\n(cid:19)(cid:19)\n\n1 + 2 ln\n\n||BT\n\n1 ||2\n\n1 , yT\n\n(cid:18)\n\n(cid:18)\n\n1 +\n\n2\n\nsup\n1 \u2208A(\u03a3)\nxT\n\nsup\n\n1 \u2208L(BT\nyT\n1 )\n\n2\n\n.\n\n6 Conclusion\n\nWe have presented the minimax optimal strategy for online linear regression where the covariate\nand label sequence are chosen adversarially and the measure of game length is a covariance budget\ninstead of the number of rounds. Because the strategy has access to a more informative measure of\ngame size, \u03a3, it can compete with strategies that know the number of rounds. The minimax strategy\nis ef\ufb01cient and only needs to update Pt and st.\nOne could interpret the results of our paper as \ufb01nding a more natural way to measure the length of the\ngame that admits a tractable minimax strategy. What other game protocols can be reparameterized to\nadmit ef\ufb01cient minimax strategies? As a general method, one could start with minimax algorithms\nfor constrained cases then search for parameterizations which preserve the optimality.\nWe have also provided an intuitive view of the algorithm as follow-the-regularized-leader with a\nspeci\ufb01c data-dependent regularizer. This interpretation can be used to bound the excess regret when\nthe budget \u03a3 is misspeci\ufb01ed, perhaps allowing for adaptation to \u03a3.\n\nAcknowledgements\n\nWe gratefully acknowledge the support of the NSF through grant IIS-1619362.\n\n9\n\n\fReferences\nKaty S. Azoury and Manfred K. Warmuth. Relative loss bounds for on-line density estimation with\n\nthe exponential family of distributions. Machine Learning, 43(3):211\u2013246, 2001.\n\nPeter L. Bartlett, Wouter M. Koolen, Alan Malek, Manfred K. Warmuth, and Eiji Takimoto. Minimax\n\ufb01xed-design linear regression. In P. Gr\u00fcnwald, E. Hazan, and S. Kale, editors, Proceedings of The\n28th Annual Conference on Learning Theory (COLT), pages 226\u2013239, 2015.\n\nNicolo Cesa-Bianchi, Philip M. Long, and Manfred K. Warmuth. Worst-case quadratic loss bounds\nfor prediction using linear functions and gradient descent. Neural Networks, IEEE Transactions\non, 7(3):604\u2013619, 1996.\n\nJ\u00fcrgen Forster. On relative loss bounds in generalized linear regression.\n\nComputation Theory, pages 269\u2013280. Springer, 1999.\n\nIn Fundamentals of\n\nDean P. Foster. Prediction in the worst case. Annals of Statistics, 19(2):1084\u20131090, 1991.\n\nDavid A Harville. Matrix algebra from a statistician\u2019s perspective, volume 1. Springer, 1997.\n\nJyrki Kivinen and Manfred K. Warmuth. Exponentiated gradient versus gradient descent for linear\n\npredictors. Information and Computation, 132(1):1\u201363, 1997.\n\nWouter M. Koolen, Alan Malek, and Peter L. Bartlett. Ef\ufb01cient minimax strategies for square loss\n\ngames. In Advances in Neural Information Processing Systems, pages 3230\u20133238, 2014.\n\nWouter M. Koolen, Alan Malek, Peter L. Bartlett, and Yasin Abbasi. Minimax time series prediction.\nIn C. Cortes, N.D. Lawrence, D.D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in\nNeural Information Processing Systems 28, pages 2548\u20132556. Curran Associates, Inc., 2015. URL\nhttp://papers.nips.cc/paper/5730-minimax-time-series-prediction.pdf.\n\nEdward Moroshko and Koby Crammer. Weighted last-step min\u2013max algorithm with improved\n\nsub-logarithmic regret. Theoretical Computer Science, 558:107\u2013124, 2014.\n\nEiji Takimoto and Manfred K. Warmuth. The minimax strategy for Gaussian density estimation. In\n\n13th COLT, pages 100\u2013106, 2000.\n\nVolodimir G. Vovk. Aggregating strategies. In Proc. Third Workshop on Computational Learning\n\nTheory, pages 371\u2013383. Morgan Kaufmann, 1990.\n\nVolodya Vovk. Competitive on-line linear regression. Advances in Neural Information Processing\n\nSystems, pages 364\u2013370, 1998.\n\n10\n\n\f", "award": [], "sourceid": 2513, "authors": [{"given_name": "Alan", "family_name": "Malek", "institution": "MIT"}, {"given_name": "Peter", "family_name": "Bartlett", "institution": "UC Berkeley"}]}