{"title": "Efficient online algorithms for fast-rate regret bounds under sparsity", "book": "Advances in Neural Information Processing Systems", "page_first": 7026, "page_last": 7036, "abstract": "We consider the problem of online convex optimization in two different settings: arbitrary and i.i.d. sequence of convex loss functions. In both settings, we provide efficient algorithms whose cumulative excess risks are controlled with fast-rate sparse bounds. \nFirst, the excess risks bounds depend on the sparsity of the objective rather than on the dimension of the parameters space. Second, their rates are faster than the slow-rate $1/\\sqrt{T}$ under additional convexity assumptions on the loss functions. In the adversarial setting, we develop an algorithm BOA+ whose cumulative excess risks is controlled by several bounds with different trade-offs between sparsity and rate for strongly convex loss functions. In the i.i.d. setting under the \u0141ojasiewicz's assumption, we establish new risk bounds that are sparse with a rate adaptive to the convexity of the risk (ranging from a rate $1/\\sqrt{T}$ for general convex risk to $1/T$ for strongly convex risk). These results generalize previous works on sparse online learning under weak assumptions on the risk.", "full_text": "Ef\ufb01cient online algorithms for fast-rate\n\nregret bounds under sparsity\n\nINRIA, ENS, PSL Research University\n\nPierre Gaillard\n\nParis, France\n\npierre.gaillard@inria.fr\n\nolivier.wintenberger@upmc.fr\n\nOlivier Wintenberger\n\nSorbonne Universit\u00e9, CNRS, LPSM\n\nParis, France\n\nAbstract\n\nWe consider the problem of online convex optimization in two different settings:\narbitrary and i.i.d. sequence of convex loss functions. In both settings, we provide\nef\ufb01cient algorithms whose cumulative excess risks are controlled with fast-rate\nsparse bounds. First, the excess risks bounds depend on the sparsity of the objective\nrather than on the dimension of the parameters space. Second, their rates are\nfaster than the slow-rate 1/pT under additional convexity assumptions on the\nloss functions. In the adversarial setting, we develop an algorithm BOA+ whose\ncumulative excess risks is controlled by several bounds with different trade-offs\nbetween sparsity and rate for strongly convex loss functions. In the i.i.d. setting\nunder the \u0141ojasiewicz\u2019s assumption, we establish new risk bounds that are sparse\nwith a rate adaptive to the convexity of the risk (ranging from a rate 1/pT for general\nconvex risk to 1/T for strongly convex risk). These results generalize previous\nworks on sparse online learning under weak assumptions on the risk.\n\n1\n\nIntroduction\n\nWe consider the following setting of online convex optimization where a sequence of random convex\nloss functions (`t : Rd ! R)t>1 is sequentially observed. At each iteration t > 1, a learner\n\nEt1 = E[\u00b7|Ft1]. For any parameter \u2713 in some reference set \u21e5 \u21e2 Rd, the average excess risk can\nbe decomposed as the sum of the approximation-estimation errors:\n\nchooses a point b\u2713t1 2 Rd based on past observations Ft1 = {`1, . . . ,` t1}. The learner\nt=1 Et1\u21e5`t(b\u2713t1)\u21e4 where\naims at minimizing the average excess risk de\ufb01ned as bLT := (1/T )PT\nTXt=1\nEt1\u21e5`t(\u2713)\u21e4\n}\n\nEt1\u21e5`t(b\u2713t1)\u21e4 \n{z\n\nestimation error\n\nThough the \ufb01nal goal is to minimizebLT , a common proxy is to upper-bound the estimation term\nRT (\u2713) (also refereed to as average excess risk1) simultaneously for all \u2713 2 \u21e5. If the loss functions\nare exp-concave and \u21e5 is bounded, several sequential algorithms achieve the uniform bound2 on\nthe estimation term RT := sup\u27132\u21e5 RT (\u2713) 6 \u02dcO(d/T ); see [13]. In this paper, we are interested\nwith non-uniform bounds on RT (\u2713) increasing with the complexity of \u2713. Such non-uniform bounds\nare called oracle inequalities and state that the learner achieves the best approximation-estimation\n\nEt1\u21e5`t(\u2713)\u21e4\n}\n{z\n\nbLT =\n\n.\n\n(1)\n\n+\n\n1\nT\n\n|\n\nTXt=1\n\nTXt=1\n\n1\nT\n\n|\n\napproximation error\n\n1\nT\n\n:= RT (\u2713)\n\n1The average excess risk RT (\u2713) generalizes the average regret more commonly used in the online learning\n2Throughout the paper . denotes an approximate inequality which holds up to universal constants and \u02dcO\n\nliterature by considering the Dirac masses on {`t} as conditional distributions so that `t = Et1[`t], t 1.\ndenotes an asymptotic inequality up to logarithmic terms.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\ftrade-off of (1). Using the `0-norm to measure the complexity of \u2713, we are looking for fast-rate sparse\nbounds of the form\n\nRT (\u2713) 6 \u02dcO \u2713k\u2713k0\nT \u25c6 1\n\n2! ,\n\nfor any \u2713 2 \u21e5.\n\nThe parameter 2 [0, 1] depends on the convexity properties of the loss functions and will be speci\ufb01ed\nlater. We call fast-rate bound any bound which provides a better rate than 1/pT and sparse bounds any\nbound where some dependence on d has been replaced with k\u2713k0. Our analysis starts from a careful\nstudy of the \ufb01nite case \u21e5= {\u27131, . . . ,\u2713 K}. We consider then online averaging algorithms on adaptive\n\ufb01nite discretization grids that achieve sparse oracle bounds on \u21e5= B1 = {\u2713 2 Rd : k\u2713k1 6 1}.\nFirst contribution: fast-rate high probability quantile bound (\ufb01nite \u21e5, adversarial data) The\ncase of \ufb01nite reference set \u21e5= {\u27131, . . . ,\u2713 K} corresponds to the setting of prediction with expert ad-\nvice (see Section 2.2 or [5]) where a learner makes sequential predictions over a series of rounds with\nthe help of K experts. Hedge introduced by [19] and [26] achieves the rate RT 6 O(p(ln K)/T ).\nThe latter is optimal for general convex loss functions but better performance can be obtained in\nfavorable scenarios. The rate RT 6 O((ln K)/T ) is for instance obtained for strongly convex loss\nfunctions in [28]. Another improvement (see [16] and references therein) is devoted to quantile\n3. The latter improve\nbounds, i.e. bounds on Ek\u21e0\u21e1[RT (\u2713k)] for any probability distribution \u21e1 2 K\nthe dependence on the number of experts from ln K to the Kullback divergence K(\u21e1,b\u21e10) for any\npriorb\u21e10. They are smaller whenever many experts perform well or when a good prior knowledge\n\nis available. Squint [16] achieves a fast-rate quantile bound for adversarial data. Such a bound is\nobtained in high-probability by [20] but it suffers an additional gap term.\nIn Section 2, we extend the analysis of [16] to remove the gap term of [20]. We introduce a weak\nversion of exp-concavity; see Assumption (A2). It depends on a parameter 2 [0, 1] which goes\nfrom = 0 for general convex loss functions to = 1 for exp-concavity. We show in Theo-\nrem 2.1 that BOA [28] and Squint [16] achieve a fast rate quantile bound with high probability: i.e.\n\nE\u21e1[RT (\u2713k)] 6 \u02dcO(K(\u21e1,b\u21e10)/T )1/(2).\nSecond contribution: ef\ufb01cient sparse oracle bound (\u21e5= B1, adversarial data) The extension\nfrom \ufb01nite reference sets to convex sets is natural. The seminal paper [15] introduced the Exponen-\ntiated Gradient algorithm (EG), a version of Hedge using the sub-gradients of the loss functions.\nThe latter guarantees RT 6 O(p(ln d)/T ) for \u21e5= B1 which is optimal for convex loss functions.\nRecently, fast rate RT 6 \u02dcO(d/T )1/(2) are obtained by [17] under a slightly different assumption\nthan (A2). Here our purpose is to improve the dependence on d under the sparsity condition k\u2713k0\nsmall. The literature on learning under sparsity with i.i.d. data is vast; we refer to [12] for a review.\nYet, little work was done on sparsity bounds under adversarial data; see Table 1 for a summary.\nThe papers [7; 18; 29] focus on providing sparse estimatorsb\u2713t rather than sparse guarantees. More\nrecent works [8; 14] consider sparse approximations of the sub-gradients. Though they also compare\nthemselves with sparse parameters, they incur a bound larger than O(1/pT ) which is optimal in their\nsetting. Fast rate sparse regret bounds involving k\u2713k0 were, up to our knowledge, only obtained\nthrough non-ef\ufb01cient (exponential time) procedures (see [10]). In Section 3.3, we provide an ef\ufb01cient\nalgorithm BOA+ which satis\ufb01es the oracle inequality\n\nRT (\u2713) 6 \u02dcO(pdk\u2713k0/T ) ^ (pk\u2713k0/T 3/4) ,\n\nfor any \u2713 2B 1 ,\n\nfor strongly-convex loss functions ( = 1). The gainpk\u2713k0/d ^pk\u2713k0/T compared with the\nusual rate \u02dcO(d/T ) is signi\ufb01cant for sparse parameters \u2713.\nA crucial step of our analysis is an intermediate result which is interesting in its own. We de\ufb01ne an\nef\ufb01cient algorithm with input any \ufb01nite grid \u21e50 \u21e2B 1. We provide in Theorem 3.2 a bound of the\nform RT (\u2713) 6 \u02dcO(D(\u2713, \u21e50)/pT ) for a pseudo-metric D and any \u2713 2B 1. We say that this bound is\naccelerable as the rate may decrease if D(\u2713, \u21e50) decreases with T . In particular, it yields an oracle\nbound of the form RT (\u2713) 6 O(k\u2713k1/pT ).\n3Here and subsequently, K := {\u21e1 2 [0, 1]K; k\u21e1k1 = 1} denotes the simplex of dimension K 1.\n\n2\n\n\fProcedure\nKale et al. [8; 14]\n[7; 18; 29]\nSeqSEW [11]\nSABOA\n\nRate\n\nPoly(d)/pT\nq ln d\nT or d\nd0 ln d\nT\npd0d\n\nT\n\nPolynomial\n\nYes\nYes\nNo\nYes\n\nAssumption\nConvexity\n\n(Strong) Convexity\nStrong Convexity\nStrong Convexity\n\nSparsity setting\n\nSparse observed gradients\nProduce sparse estimators\n\nSparse bound\nSparse bound\n\nq ln d\n\nT ^\n\nln d\n\nT\n\nTable 1: Comparison of sequential optimization procedures in sparse adversarial environment.\n\nThird contribution: sparse regret bound under \u0141ojasiewicz assumption (\u21e5= B1, i.i.d. data)\nIn Section 3.4 we turn to a stochastic setting where the loss functions `1, . . . ,` T are i.i.d.. This\nsetting extends the regression one with random design to general loss functions. The classical Lasso\nprocedure satis\ufb01es, in the regression setting for the quadratic risk ( = 1), RT (\u2713) 6 \u02dcO(k\u2713k0/T )\nwhere \u2713 is a sparse approximation of \u2713\u21e4 = arg min\u27132Rd RT (\u2713), see [3]. Yet, few procedures\nsatisfying sparse bounds are sequential; we can cite [1; 8; 9; 14; 23]. We compare in Table 2 their\nresults and settings.\nThe \ufb01rst line of work [1; 9; 23] provides sparse rates of order \u02dcO(k\u2713\u21e4k0 ln d/T ). Their settings\nare close to the one of [3] but their methods differ; the one of [23] uses a `1-penalized gradient\ndescent whereas the one of [1] and [9] are based on restarting a subroutine centered around the current\nestimate on sessions of exponentially growing length. A common limitation of these works is that they\ndo not provide oracle inequality. They only compete with the global optimum over Rd only, which is\nassumed to be (approximately in [1]) sparse with a known `1-bound. In other words, they assume\nthat the global optimum also realizes the approximation-estimation errors trade-off in (1). In order to\navoid this restriction, our \ufb01rst objective is to obtain the sparse bounds RT (\u2713\u21e4(U )) 6 \u02dcO(k\u2713\u21e4(U )k0/T )\nwhere \u2713\u21e4(U ) 2 arg mink\u2713k16U RT (\u2713) for any U > 0. For U well chosen so that k\u2713\u21e4(U )k1 = U,\n\u2713\u21e4(U ) is sparse and the approximation-estimation errors trade-off in (1) is achieved. We restrict to the\ncase U = 1 suppressing the dependence on U in \u2713\u21e4 for the ease of notation. We leave the adaptation\nin U > 0 for future research.\nThe second line of works [14; 8] considers sparse approximation of sub-gradients. Yet, they provide a\n0 ln d/T ) where \u2713\u21e4 is the optimum in B1 when the loss functions\nsparse regret bound of order O(k\u2713\u21e4k2\nare strongly convex. Our second objective is to relax the strong convexity assumption which is too\nrestrictive in the sequential regression setting. Indeed, the usual restricted eigenvalues conditions\non the Gram matrix cannot hold uniformly for small t\u2019s. We work under \u0141ojasiewicz\u2019s Assumption\nintroduced by [32; 33]: There exist > 0 and \u00b5 > 0 such that for all \u2713 2B 1, there exists a minimizer\n\u2713\u21e4 of the risk over B1 satisfying\n\n\u00b5\u2713 \u2713\u21e42\n\n2 6 E[`t(\u2713) `t(\u2713\u21e4)] .\n\nThe \u0141ojasiewicz assumption depends on a parameter 2 [0, 1] that ranges from general convex risk\nfunction ( = 0) to generalized strongly convex risk function ( = 1). In Theorem 3.4 we show that\nour new ef\ufb01cient procedure SABOA achieves a fast rate upper-bound on the average excess risk of\norder \u02dcO((k\u2713\u21e4k0 ln(d)/T )1/(2)) when the optimal parameters have `1-norm bounded by 1 < 1.\nThen we recover the optimal rate of [1; 9; 23] in a similar setting, when the global optimum is\nassumed to be sparse. When k\u2713\u21e4k1 = 1, guaranteeing a good approximation-estimation trade-off\nin (1), the bound suffers an additional factor k\u2713\u21e4k0. Notice that \u0141ojasiewicz\u2019s Assumption (A3)\nallows multiple optima which is important when we are dealing with degenerated co-linear design\n(allowing zero eigenvalues in the covariance matrix). It is an open question whether the fast rate\n\u02dcO((k\u2713\u21e4k2\n0 ln(d)/T )) is optimal for ef\ufb01cient O(dT )-complex procedures such as SABOA under\n\u0141ojasiewicz\u2019s Assumption.\n\nOutline of the paper To summarize our contributions, we provide\n- the \ufb01rst high-probability quantile bound achieving a fast rate in Theorem 2.1;\n- an accelerable bound on RT (\u2713) that is small whenever \u2713 is close to a prior grid \u21e50 (Thm. 3.2);\n- two ef\ufb01cient algorithms with sparse regret bounds in the adversarial setting with strongly convex\nloss functions (BOA+, Thm. 3.3) and in the i.i.d. setting (SABOA, Thm. 3.4). In the latter setting,\nthe results are obtained under the \u0141ojasiewicz\u2019s assumption. This generalizes the usual necessary\nconditions for obtaining sparse bounds that are too restrictive in our sequential setting.\n\n3\n\n\fProcedure\n\nSetting\n\nRate\n\nAssumptions / Setting\n\nOptimum over\n\nLasso [3]\nKale et al. [8; 14]\n[1; 9; 23]+SABOA\nSABOA\n\nB d0 ln d/T\nS\nd2\n0 ln d/T\nS\nd0 ln d/T\nS\nd2\n0 ln d/T\n\nMutual Coherence\n\nStrong Convexity + Sparse Gradients\n\nStrong convexity or \u0141ojasiewicz ( = 1)\n\n\u0141ojasiewicz ( = 1)\n\nRd\nB1\nRd\nB1\n\nTable 2: Comparison of sequential (S) and batched (B) optimization procedures in i.i.d. environment.\n\n2 Finite reference set\nIn this section, we focus on \ufb01nite reference set \u21e5:= {\u27131, . . . ,\u2713 K}\u21e2B 1, including the setting of\nprediction with expert advice presented in Section 2.2. We consider the following assumptions on the\nloss functions:\n(A1) Convex Lipschitz4: the loss functions `t are convex on B1 and there exists G > 0 such that\n(A2) Weak exp-concavity: There exist \u21b5> 0 and 2 [0, 1] such that for all t > 1, \u27131,\u2713 2 2B 1,\n\nr`t(\u2713)1 6 G for all t 1, \u2713 2B 1.\nEt1\u21e5`t(\u27131)`t(\u27132)\u21e4 6 Et1\u21e5r`t(\u27131)>(\u27131\u27132)\u21e4Et1h\u21e3\u21b5r`t(\u27131)>(\u27131 \u27132)2\u23181/i.\n\nFor convex loss functions (`t), Assumption (A2) is satis\ufb01ed with = 0 and \u21b5< G 2. Fast rates are\nobtained for > 0. It is worth pointing out that Assumption (A2) is weak even in the strongest case\n = 1. It is implied by several common assumptions such as:\n\u2013 Strong convexity of the risk: under the boundedness of the gradients, assumption (A2) with\n\nalmost surely\n\n\u21b5 = \u00b5/(2G2) is implied by the \u00b5-strong convexity of the risks (Et1[`t]), t 1.\n\u2013 Exp-concavity of the loss: Lemma 4.2, Hazan [13] states that (A2) with \u21b5 6 1\n8G ,\uf8ff} is\nimplied by \uf8ff-exp-concavity of the loss functions `t, t 1. Our assumption is slightly weaker\nsince it holds in conditional expectation.\n\n4 min{ 1\n\n2.1 Fast-rate quantile bound with high probability\n\nFor prediction with K > 1 expert advice, [28] showed that a fast rate O(ln K)/T can be obtained\n\nby the BOA algorithm under the LIST condition (i.e., Lipschitz and strongly convex loss functions).\nIn this section, we show that Assumption (A2) is enough and we improve the dependence on the total\nnumber of experts with a quantile bound.\nOur algorithm is described in Algorithm 1 and corresponds to a particular case of two algorithms: the\nSquint algorithm of [16] used with a discrete prior over a \ufb01nite set of learning rates and the BOA\nalgorithm of [28] where each expert is replicated multiple times with different constant learning rates.\nThe proof (with the exact constants) is deferred to Appendix C.1.\nTheorem 2.1. Let T > 1. Assume (A1) and (A2). Apply Algorithm 1, parameter E = 4G/3 and\n\ninitial weight vectorb\u21e10 2 K. Then, for all \u21e1 2 K, with probability at least 1 2ex, x > 0,\nwhere K(\u21e1,b\u21e10) :=PK\n\n\u25c6 1\nEk\u21e0\u21e1 [RT (\u2713k)] .\u2713K(\u21e1,b\u21e10) + ln ln(GT ) + x\nk=1 \u21e1k ln(\u21e1k/b\u21e1k,0) is the Kullback-Leibler divergence.\n\nA fast rate of this type (without quantiles property) can be obtained in expectation by using Hedge\nfor exp-concave loss functions. However, Theorem 2.1 is stronger. First, Assumption (A2) is weaker\nthan the exp-concavity of the loss functions `t as it holds for absolute or quantile loss functions in a\nsuf\ufb01ciently regular regression setting. Second, the algorithm uses the so-called gradient trick; See\n[24]. Therefore, simultaneously with the fast rate O(T 1/(2)) with respect to the experts (\u2713k),\n4Throughout the paper, we assume that the Lipschitz constant G in (A1) is known. It can be calibrated online\n\n\u21b5T\n\nwith standard tricks such as the doubling trick (see [6] for instance) under sub-Gaussian conditions.\n\n2\n\n,\n\n4\n\n\fAlgorithm 1 Squint \u2013 BOA with multiple constant learning rates assigned to each parameter\n\ni=1\n\ni0=1\n\nInitialization: For 1 6 i 6 ln(ET 2), de\ufb01ne \u2318i := (eiE)1.\nFor each iteration t = 1, . . . , T do:\n\nParameters: \u21e50 = {\u27131, . . . ,\u2713 K}\u21e2B 1, E > 0 andb\u21e10 2 K.\nk=1b\u21e1k,t1\u2713k and observe r`t(b\u2713t1),\n\u2318ie\u2318iPt\nk,s)b\u21e1k,0\nj,s)\u21e4 ,\nEj\u21e0b\u21e10\u21e5\u2318i0e\u2318i0Pt\n\n\u2013 Chooseb\u2713t1 =PK\nb\u21e1k,t = Pln(ET 2)\nPln(ET 2)\n\n\u2013 Update component-wise for all 1 6 k 6 K\ns=1(rk,s\u2318ir2\n\ns=1(rj,s\u2318i0 r2\n\nrk,s = r`t(b\u2713s1)>(b\u2713s1 \u2713k) .\nthe algorithm achieves the slow rate O(1/pT ) with respect to any convex combination Ek\u21e0\u21e1[\u2713k]\nIf the algorithm is run with a uniform priorb\u21e10 = (1/K, . . . , 1/K), Theorem 2.1 implies that for any\n\n(similarly to EG). Finally, high-probability regret bounds as ours are not satis\ufb01ed by Hedge (see [2]).\n\nsubset \u21e50 \u2713 \u21e5\n\nwith high probability.\n\nmax\u27132\u21e50 RT (\u2713) .\u21e3 ln(K/ Card(\u21e50))+ln ln(GT )\n\nThanks to the quantile bounds, we pay the proportion of good experts ln(K/ Card(\u21e50)) in the regret\ninstead of the total number of experts ln(K). We refer to [16] for more interesting applications. Such\nquantile bounds on the risk were studied by Mehta [20, Section 7] in a batch i.i.d. setting (i.e., `t are\ni.i.d.). A standard online to batch conversion shows that Theorem 2.1 yields with high probability\n\n\u2318 1\n\n2\n\n\u21b5T\n\nETh`T +1(\u00af\u2713T ) Ek\u21e0\u21e1\u21e5`T +1(\u2713k)\u21e4i .\u21e3K(\u21e1,b\u21e10)+ln ln(GT )+x\n\n2 ,\nThis improves the bound obtained by [20] who suffers the additional gap\n\n\u2318 1\n\n\u21b5T\n\n\u00af\u2713T =\n\n1\n\nT PT\n\nt=1b\u2713t1 .\n\n(e 1) ET\u21e5Ek\u21e0\u21e1[`T +1(\u2713k)] min\u21e1\u21e42K `T +1(Ej\u21e0\u21e1\u21e4[\u2713j])\u21e4 .\n\n2.2 Prediction with expert advice\nThe framework of prediction with expert advice is widely considered in the literature (see [5] for\nan overview). We recall now this setting and how it can be included in our framework. At the\nbeginning of each round t, a \ufb01nite set of K > 1 experts predict f t = (f1,t, . . . , fK,t) 2 [0, 1]K\nfrom the history Ft1. The learner then chooses a weight vector b\u2713t1 in the simplex K and\nproduces a prediction bft :=b\u2713>t1f t 2 R as a convex combination of the experts.\nIts perfor-\nmance at time t is evaluated by a loss function gt\n: R ! R. The goal of the learner is to\napproach the performance of the best expert on a long run. This can be done by minimizing\nt=1 Et1[gt(bft)] Et1[gt(fk,t)] , with respect to all experts\nthe average excess risk Rk,T := 1\nk 2{ 1, . . . , K}. This setting reduces to our framework with dimension d = K.\nIndeed, it\nsuf\ufb01ces to choose the K-dimensional loss function `t : \u2713 7! gt(\u2713>f t) and the canonical basis\n+ : k\u2713k1 = 1,k\u2713k0 = 1} in RK as the reference set. Denoting by \u2713k the k-th ele-\n\u21e5:= {\u2713 2 RK\nment of the canonical basis, we see that \u2713>k f t = fk,t, so that `t(\u2713k) = gt(fk,t). Therefore, Rk,T\nmatches our de\ufb01nition of RT (\u2713k) in Equation (1) and we get under the assumptions of Theorem 2.1\na bound of order:\n\nT PT\n\nEk\u21e0\u21e1\u21e5Rk,T\u21e4 .\u21e3K(\u21e1,b\u21e10)+ln ln(GT )+x\n\n\u21b5T\n\n\u2318 1\n\n2 .\n\nAn important point to note here is that though the parameters \u2713k of the reference set are constant,\nthis method can be used to compare the player with arbitrary strategies fk,t that may evolve over\ntime and depend on recent data. We do not assume in this section that there is a single \ufb01xed expert\nk\u21e4 2{ 1, . . . , K} which is always the best, i.e., Et1[gt(fk\u21e4,t)] 6 mink Et1[gt(fk,t)]. Hence, we\ncannot replace (A2) with the closely related Bernstein assumption (see Ass. (A2\u2019) or [17, Cond. 1]).\nActually one can reformulate Assumption (A2) on the one dimensional loss functions gt as follows:\nthere exist \u21b5> 0 and 2 [0, 1] such that for all t > 1, for all 0 6 f1, f2 6 1,\n\nEt1[gt(f1) gt(f2)] 6 Et1\u21e5g0t(f1)(f1 f2)\u21e4 Et1\uf8ff\u21e3\u21b5g0t(f1)(f1 f2)2\u23181/ ,\n\nIt holds with \u21b5 = \uf8ff/(2G2) for \uf8ff-strongly convex risk Et1[gt]. For instance, the square loss\ngt = (\u00b7 yt )2 satis\ufb01es it with = 1 and \u21b5 = 1/8.\n\na.s.\n\n5\n\n\f3 Online optimization in the unit `1-ball\n\nThe aim of this section is to extend the preceding results to the reference set \u21e5= B1 instead of \ufb01nite\n\u21e5= {\u27131, . . . ,\u2713 K}. A classical reduction from the expert advice setting to the `1-ball is the so-called\n\u201cgradient-trick\u201d. A direct analysis on BOA applied to \u21e50 = {\u2713 2 Rd : k\u2713k0 = 1,k\u2713k1 = 1} the\n2d corners of the `1-ball suffers a slow rate O(1/pT ) on the average excess risk with respect to any\n\u2713 2B 1. The goal is to exhibit algorithms that go beyond O(1/pT ). In Section 3.1 we investigate\nnon-adaptive discretization grids of the space that yield optimal upper-bounds but suffer exponential\ntime complexity. In Section 3.2 we introduce a pseudo-metric in order to bound the regret of grids\nconsisting of the 2d corners and some arbitrary \ufb01xed points. From this crucial step, we derive the\nadaptive points to add to the 2d corners in the adversarial case (Section 3.3) and in the i.i.d. case\n(Section 3.4) in order to obtain two ef\ufb01cient procedures (BOA+ and SABOA respectively) with sparse\nguarantees.\n\n3.1 Warmup: fast rate by discretizing the space\nAs a warmup, we show how to use Theorem 2.1 in order to obtain fast rate on RT (\u2713) for any \u2713 2B 1.\nBasically, if the parameter \u2713 could be included into the grid \u21e50, Theorem 2.1 would turn into a bound\non the regret RT (\u2713) with respect to \u2713. However, this is not possible as we do not know \u2713 in advance.\nA solution consists in approaching B1 with B1(\"), a \ufb01xed \ufb01nite \"-covering in `1-norm of minimal\ncardinal so that Card(B1(\")) .1/\"d. We obtain a nearly optimal regret for this procedure.\nProposition 3.1. Let T > 1. Under Assumptions of Theorem 2.1, applying Algorithm 1 with grid\n\u21e50 = B1(T 2) and uniform priorb\u21e10 over Card(B1(T 2)) satis\ufb01es for all \u2713 2B 1\n\n2 +\n\n(2)\n\nRT (\u2713) .\u21e3 d ln T + ln ln(GT ) + x\n\n\u21b5T\n\n\u2318 1\n\nG\nT 2 ,\n\nwith probability at least 1 ex, x > 0.\nProof. Let \" = 1/T 2 and \u2713 2B 1 and \u02dc\u2713 be its \"-approximation in B1(\"). The proof follows from\nLipschitzness of the loss: RT (\u2713) 6 RT (\u02dc\u2713) + G\" and by applying Theorem 2.1 on RT (\u02dc\u2713).\nOne can improve d to k\u2713k0 ln d by carefully choosing the priorb\u21e10 as in [21]; see Appendix A for\n\ndetails. The obtained rate is optimal up to log-factors. However, the complexity of the discretization\nis prohibitive (of order T d) and non realistic for practical purpose.\n\n3.2 Oracle bound for arbitrary \ufb01xed discretization grid\nLet \u21e50 \u21e2B 1 be a \ufb01nite set. The aim of this Section is to study the regret of Algorithm 1 with respect\nto any \u2713 2B 1. Similarly to Proposition 3.1, the average excess risk may be bounded as\n\nRT (\u2713) .\u21e3 ln Card(\u21e50)+ln ln T +x\n\n\u21b5T\n\n\u2318 1\n2 + Gk\u27130 \u2713k1 ,\n\n(3)\n\nfor any \u27130 2 \u21e50. We say that a regret bound is accelerable if it provides a fast rate except a term\ndepending on the distance with the grid (i.e., the term in k\u27130 \u2713k1 in (3)) that decreases with T .\nThis property will be crucial in obtaining fast rates by adapting the grid \u21e50 sequentially. The regret\nbound (3) is not accelerable due to the second term that is constant. In order to \ufb01nd an accelerable\nregret bound, we introduce the notion of averaging accelerability, a pseudo-metric that replaces the\n`1-norm in (3). We give the intuition behind this notion in the sketch of the proof of Theorem 3.2.\nDe\ufb01nition 3.1 (Averaging accelerability). For any \u2713, \u27130 2B 1, we de\ufb01ne\n\nD(\u2713, \u27130) := min0 6 \u21e1 6 1 : k\u2713 (1 \u21e1)\u27130k1 6 \u21e1 .\n\nThis averaging accelerability has several nice properties. In Appendix B, we provide a few concrete\nupper-bounds in terms of classical distances. For instance, Lemma B.1 provides the upper-bound\nD(\u2713, \u27130) 6 k\u2713 \u27130k1/(1 k\u27130k1 ^ k\u2713k1). We are now ready to state our regret bound, when Algo-\nrithm 1 is applied with an arbitrary approximation grid \u21e50.\n\n6\n\n\fT\n\n,\u2713\n\n+\n\naG\nT\n\n\u21b5T\u2318 1\n\nRT (\u2713) .\u21e3 a\n\n2 + GD(\u2713, \u21e50)r a\n\nTheorem 3.2. Let \u21e50 \u21e2B 1 such that {\u2713 : k\u2713k1 = 1,k\u2713k0 = 1}\u2713 \u21e50. Let Assumption (A1) and\n(A2) be satis\ufb01ed. Then, Algorithm 1 applied with uniform priorb\u21e10 over the elements of \u21e50 and\nE = 8G/3, satis\ufb01es with probability 1 ex, x > 0,\n\n2B 1 ,\nwhere a = ln Card(\u21e50) + ln ln(GT ) + x and D(\u2713, \u21e50) := min\u271302\u21e50 D(\u2713, \u27130).\nSketch of proof. The complete proof can be found in Appendix C.2. We give here the high-level\nideas. Let \u27130 2 \u21e50 be a point in the grid \u21e50 minimizing D(\u2713, \u27130). Then one can decompose\n\u2713 = (1 \")\u27130 + \"\u271300 for a unique point k\u271300k1 = 1 and \" := D(\u2713, \u27130). See Appendix C.2 for details.\nThe regret bound can be decomposed into two terms:\n\u2013 The \ufb01rst term quanti\ufb01es the cost of picking the correct \u27130 2 \u21e50, bounded using Theorem 2.1;\n\u2013 The second one is the cost of learning \u271300 2B 1 rescaled by \". Using a classical slow-rate\nbound in B1, it is of order O(1/pT ).\n2 + \"Gr ln Card(\u21e50)\nThe average excess risk RT (\u2713) is thus of order\n\u2318 1\n(1 \") RT (\u27130)\n| {z }\n\n.\u21e3 ln Card(\u21e50) + ln ln(GT ) + x\n\n+ \"R T (\u271300)\nGpln(Card \u21e50))/T\n\n| {z }\n\nNote that the bound of Theorem 3.2 is accelerable as its second term vanishes to zero on the contrary\nto Inequality (3). Theorem 3.2 provides an upper-bound which may improve the rate O(1/pT ) if\nthe distance D(\u2713, \u21e50) is small enough. By using the properties of the averaging accelerability (see\nLemma B.1 in Appendix B), Theorem 3.2 provides some interesting properties of the rate in terms of\n`1 distance. By including 0 into the grid \u21e50, we get an oracle-bound of order O(k\u2713k1/pT ) for any\n\u2713 2B 1. Moreover a bound of order RT (\u2713) 6 Ok\u2713 \u2713kk1/(pT ) is obtained for all \u2713k 2 \u21e50\nand k\u2713k1 6 1 < 1.\nIt is worth pointing out that the bound on the gradient G can be substituted with the average gradient\nobserved by the learner. The constant G can be improved to the level of the noise in certain situations\nwith vanishing gradients (see for instance Theorem 3 of [9]).\n\nThm 2.1\n\n\u21b5T\n\nT\n\n.\n\n3.3 Fast-rate sparsity regret bound in the adversarial setting\nIn this section, we focus on the adversarial case where `t = Et1[`t] are \u00b5-strongly convex deter-\nministic functions. In this case, Assumption (A2) is satis\ufb01ed with = 1 and \u21b5 = \u00b5/(2G2). Our\n\n\u21e5(i) = {[\u2713\u21e4i ]k, k = 0, . . . , d}[{ \u2713 : k\u2713k1 = 2,k\u2713k0 = 1} ,\n\nalgorithm, called BOA+, is de\ufb01ned as follows. For each doubling session i > 0, BOA+ choosesb\u2713t\nfrom time step ti = 2i to ti+1 1 by restarting Algorithm 1 with uniform prior, parameter E = 4G/3\nand updated discretization grid \u21e50 indexed by i:\nwhere \u2713\u21e4i 2 arg min\u27132B1Pti1\nt=1 `t(\u2713) is the empirical risk minimizer (or the leader) until time\nti 1. The notation [\u00b7 ]k denotes the hard-truncation with k non-zero values. Remark that \u2713\u21e4i for\ni = 1, 2, . . . , ln2(T ) can be ef\ufb01ciently computed approximately as the solution of a strongly convex\noptimization problem.\nTheorem 3.3. Assume the loss functions are \u00b5-strongly convex on B2 := {\u2713 2 Rd : k\u2713k1 6 2} with\ngradients bounded by G in `1-norm on B2. The average regret of BOA+ satis\ufb01es the oracle bound\n\nThe proof is deferred to Appendix C.6. We emphasize that the bound can be rewritten as follows:\n\n\u00b5 Gr ln d\nT ! 3\n) min(Gr ln d\n\nRT (\u2713) 6 \u02dcO0@min8<:\nGr ln d\nRT (\u2713) 6 \u02dcO min(Gr ln d\nIt provides an intermediate rate between known optimal rates without sparsity O(pln d/T ) and\n\u02dcO(d/T ) and known optimal rates with sparsity O(pln d/T ) and (for non-ef\ufb01cient procedures only)\n\u02dcO(k\u2713k0/T ). If all \u2713\u21e4i are approximately d0-sparse it is possible to achieve the optimal rate of order\n\u02dcO(d0/T ), for any k\u2713k0 6 d0. We leave for future work whether it is possible to achieve it in general.\n\n9=;\n\u00b5T )!1/2\n\n\u00b5T 1A ,\u2713 2B 1 .\n\n,pk\u2713k0dG2 ln d\n\n,sk\u2713k0\n\n,\u2713 2B 1\\{0} .\n\n, k\u2713k0G2 ln d\n\ndG2 ln d\n\nG2 ln d\n\n\u00b5T\n\n\u00b5T\n\n+\n\nT\n\nT\n\n2\n\nT\n\n,\n\n7\n\n\f\u00b5k\u2713 \u2713\u21e4t k2\n\nRemark 3.1. The strongly convex assumption on the loss functions can be relaxed (see Inequality (33)\nin the proof of Theorem 3.3) by assuming (A2) on B2 and that there exists \u00b5 > 0 and 2 [0, 1] such\nthat for all t > 1 and \u2713 2B 1\n2 6 1\ns=1(`s(\u2713) `s(\u2713\u21e4t )), where\ntPt\n(4)\nThe rates will depend on as it is the case in Theorem 2.1. A speci\ufb01c interesting case is when\nk\u2713\u21e4t k1 = 1. Then \u2713\u21e4t is very likely to be sparse. Denote S\u21e4t its support. Assumption (4) can be\nrestricted in this case. Indeed any \u2713 2B 1 satis\ufb01es k\u2713k1 6 k\u2713\u21e4t k1, which from Lemma 6 of [1] yields\nk\u2713 \u2713\u21e4t k1 6 2k[\u2713 \u2713\u21e4t ]S\u21e4t k1 where [\u2713]S = (\u2713i11i2S)16i6d. One can restrict Assumption (4) to\nhold on S\u21e4t only. Such restricted conditions for = 1 are common in the sparse learning literature\nand essentially necessary for the existence of ef\ufb01cient and optimal sparse procedures, see [31]. For\nobtaining regret bounds on BOA+, the restricted condition (4) with = 1 should hold at any time\nt 1, which is unlikely in the regression setting.\n3.4 Fast-rate sparse excess risk bound in the i.i.d. setting\n\n\u2713\u21e4t 2 arg min\u27132B1Pt\n\ns=1 `s(\u2713) .\n\nIn this section, we assume the loss functions `t to be i.i.d. We provide an algorithm with fast-rate\nsparsity risk-bound on B1 by regularly restarting Algorithm 1 with an updated discretization grid \u21e50\napproaching the set of minimizers \u21e5\u21e4 := arg min\u27132B1 E[`t(\u2713)].\nIn the i.i.d. setting, a close inspection of the proof of Theorem 3.4 shows that we can replace\nAssumption (A2) with the Bernstein condition: there exists \u21b50 > 0 and 2 [0, 1], such that for all\n\u2713 2B 1, all \u2713\u21e4 2 \u21e5\u21e4 and all t > 1,\n\n\u21b50Ehr`t(\u2713)>(\u2713 \u2713\u21e4)2i 6 Ehr`t(\u2713)>(\u2713 \u2713\u21e4)i\n\n.\n\n(A2\u2019)\n\nThis fast-rate type stochastic condition is equivalent to the central condition (see [25, Condition 5.2])\nand was already considered to obtain faster rates of convergence for the regret (see [17, Condition 1]).\n\nThe \u0141ojasiewicz assumption In order to obtain sparse oracle inequalities we work under \u0141o-\njasiewicz\u2019s Assumption (A3) which is a relaxed version of strong convexity of the risk.\n\n(A3) \u0141ojasiewicz\u2019s inequality: (`t)t>1 is an i.i.d. sequence and there exist 2 [0, 1] and\n0 < \u00b5 6 1 such that, for all \u2713 2 Rd with k\u2713k1 6 1 , there exists \u2713\u21e4 2 \u21e5\u21e4 \u2713B 1 satisfying\n\n\u00b5\u2713 \u2713\u21e42\n\n2 6 E[`t(\u2713) `t(\u2713\u21e4)] .\n\nThis assumption is fairly mild. It is indeed satis\ufb01ed with = 0 and \u00b5 = 1 as soon as the loss function\nis convex. For = 1, this assumption is implied by the strong convexity of the risk E[`t]. Our\nframework is more general because\n\n- multiple optima are allowed, which seems to be new when combined with sparsity bounds. An\nexception is [21] that provides the optimal sparse rate under a low-rank Gram matrix setting for\nthe non-ef\ufb01cient ES algorithm;\n- on the contrary to [23] or [9], our framework does not compete with the minimizer \u2713\u21e4 over Rd\nwith a known upper-bound on the `1-norm k\u2713\u21e4k1. We consider the minimizer over the `1-ball B1\nonly. The latter is more likely to be sparse and Assumption (A3) only needs to hold over B1.\nAssumption (A2) (or (A2\u2019)) and (A3) are strongly related. Assumption (A3) is more restrictive\nbecause it is design dependent in the regression setting; The constant \u00b5 corresponds to the smallest\nnon-zero eigenvalue of the covariance matrix while \u21b5 = 1/G2 for the square loss functions. If\n\u21e5\u21e4 = {\u2713\u21e4} is a singleton than Assumption (A3) implies Assumption (A2\u2019) with \u21b50 > \u00b5/G2.\nAlgorithm and excess risk bound Our new procedure called SABOA is described in Algorithm 2.\nAgain it starts from the accelerable bound provided in Theorem 3.2 which is small if one of the points\nin \u21e50 is close to \u21e5\u21e4. As BOA+, SABOA restarts BOA by adding current estimators of \u21e5\u21e4 into an\nupdated grid \u21e50. The new points added to the grid are slightly different between the two algorithms.\n\nThey are truncated versions of the average of past iteratesb\u2713t1 for SABOA and of the leader for\n\nBOA+. Remark that restart schemes under \u0141ojasiewicz\u2019s Assumption is natural and was already used\nby [22]. We get the following upper-bound on the average excess risk. The proof that computes the\nexact constants is postponed to Appendix C.7.\n\n8\n\n\f\u2022 De\ufb01ne \u00af\u2713(i1) := 0 if i = 0 and \u00af\u2713(i1) := 2i+1Pti1\n\u2022 De\ufb01ne \u21e5(i) a set of hard-truncated and dilated soft-thresholded versions of \u00af\u2713(i1) as in (45),\n\u2022 Denote Ki := Card(\u21e5(i)) + 2d 6 (i + 1)(1 + ln d) + 3d ,\n\u2022 At time step ti, restart Algorithm 1 in Ki with parameters \u21e50 := \u21e5(i) [{ \u2713 : k\u2713k1 =\n1,k\u2713k0 = 1} (denote by \u27131, . . . ,\u2713 Ki its elements), E > 0 and uniform priorb\u21e10.\n\u2013 Chooseb\u2713t1 =PKi\n\n\u2013 De\ufb01ne component-wise for all 1 6 k 6 Ki, denoting \u2318j := (ejE)1,\n\nIn other words, for time steps t = ti, . . . , ti+1 1:\n\nt=ti1b\u2713t1 otherwise,\n\nk=1b\u21e1k,t1\u2713k and observe r`t(b\u2713t1),\nb\u21e1k,t = Pln(ET 2)\nPln(ET 2)\nwhere rk,s = r`t(b\u2713s1)>(b\u2713s1 \u2713k).\n\n\u2318je\u2318jPt\nEk0\u21e0b\u21e10\u21e5\u2318je\u2318jPt\n\nj=1\n\nj=1\n\ns=ti\n\ns=ti\n\n(rk,s\u2318j r2\n\nk,s)b\u21e1k,0\n\n(rk0,s\u2318j r2\n\nk0,s\n\n)\u21e4 ,\n\nAlgorithm 2 SABOA \u2013 Sparse Acceleration of BOA\nParameters: E > 0\nInitialization: ti = 2i for i > 0,\nFor each session i = 0, . . . do:\n\n,\n\nT\n\n\u21b5\n\n+\n\n2\n\nG2\n\n\u2713 1\n\n2\u2318\u25c6\u25c6 1\n\nRT (\u2713\u21e4) .\u2713 ln d + ln ln(GT ) + x\n\nTheorem 3.4. Under Assumptions (A1), (A2) and (A3), Algorithm 2 with E = 4/3G > 1 satis\ufb01es\nwith probability at least 1 ex, x > 0, the average excess risk bound\n\u00b5 \u21e3d2\nd0\n0 ^\nwhere d0 = max\u2713\u21e42\u21e5\u21e4 k\u2713\u21e4k0 and 0 6 6 1 satis\ufb01es \u21e5\u21e4 \u2713B 1.\nWe conclude with some important remarks about Theorem 3.4. First, we point out that SABOA\nadapts automatically to unknown parameters , , \u21b5, \u00b5 and d0 to ful\ufb01ll the rate of Theorem 3.4.\nOn the radius of L1 ball. We provide the analysis into B1, the `1-ball of radius U = 1 only.\nHowever, one might need to compare with points into B1(U ), the `1-ball of radius U > 0, in order to\nobtain a good approximation-estimation trade-off. This can be done by rescaling the loss functions\n\u2713 2B 1 7! `t(U\u2713 ) and applying our results with U G, U 2\u00b5 and \u21b5 under Assumptions (A1), (A2) and\n(A3) on B1(U ). The main rate of convergence of Theorem 3.4 is unchanged. The optimal choice of\nthe radius, if it is not imposed by the application, is left for future research.\nSupport recovery. When all \u2713\u21e4 2 \u21e5\u21e4 lie on the border of the `1-ball, they are likely to be sparse.\nOne can relax Assumption (A3) to hold in sup-norm and in a restricted version similar as done in\nthe end of Remark 3.1. In this interesting setting, we could not avoid a factor d2\n0. The reason is that\nour sequential algorithm recovers the (largest) support of \u2713\u21e4 (see Con\ufb01guration 3 of Figure 1) in a\nframework where the necessary (for the rate k\u2713\u21e4k0) Irreprensatibility Condition [27] does not hold.\nConclusion In this paper, we show that BOA is an optimal online algorithm for aggregating experts\nunder very weak conditions on the loss. Then we aggregate sparse versions of the leader (BOA+)\nor of the average of BOA\u2019s iterates (SABOA) in the adversarial or in the i.i.d. setting, respectively.\nAggregating both achieves sparse fast-rates of convergence in any case. These rates are deteriorated\n\ncompared with the ideal one \u02dcO(k\u2713k0/T )1/(2) that requires restrictive assumption for ef\ufb01cient\nalgorithm. Our main condition (A3) is weaker and more realistic than the usual ones when seeking\nfor sequential sparse rate bounds for any t 1.\nReferences\n[1] A. Agarwal, S. Negahban, and M. J. Wainwright. Stochastic optimization and sparse statis-\ntical recovery: Optimal algorithms for high dimensions. In Advances in Neural Information\nProcessing Systems 25, pages 1538\u20131546. Curran Associates, Inc., 2012.\n\n[2] J.-Y. Audibert. Progressive mixture rules are deviation suboptimal. In Advances in Neural\n\nInformation Processing Systems, pages 41\u201348, 2008.\n\n9\n\n\f[3] F. Bunea, A. Tsybakov, and M. Wegkamp. Sparsity oracle inequalities for the lasso. Electronic\n\nJournal of Statistics, 1:169\u2013194, 2007.\n\n[4] O. Catoni. Universal aggregation rules with exact bias bounds. preprint, 510, 1999.\n\n[5] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press,\n\n2006.\n\n[6] N. Cesa-Bianchi, Y. Mansour, and G. Stoltz. Improved second-order bounds for prediction with\n\nexpert advice. Machine Learning, 66(2-3):321\u2013352, 2007.\n\n[7] J. C. Duchi, S. Shalev-Shwartz, Y. Singer, and A. Tewari. Composite objective mirror descent.\n\nIn COLT, pages 14\u201326, 2010.\n\n[8] D. J. Foster, S. Kale, and H. Karloff. Online sparse linear regression. In Conference on Learning\n\nTheory, pages 960\u2013970, 2016.\n\n[9] P. Gaillard and O. Wintenberger. Sparse Accelerated Exponential Weights. In 20th International\n\nConference on Arti\ufb01cial Intelligence and Statistics (AISTATS), Apr. 2017.\n\n[10] S. Gerchinovitz. Prediction of individual sequences and prediction in the statistical framework:\nsome links around sparse regression and aggregation techniques. PhD thesis, Universit\u00e9\nParis-Sud 11, Orsay, 2011.\n\n[11] S. Gerchinovitz. Sparsity regret bounds for individual sequences in online linear regression.\n\nThe Journal of Machine Learning Research, 14(1):729\u2013769, 2013.\n\n[12] C. Giraud. Introduction to high-dimensional statistics. Chapman and Hall/CRC, 2014.\n[13] E. Hazan. Introduction to online convex optimization. Foundations and Trends R in Optimiza-\n\ntion, 2(3-4):157\u2013325, 2016.\n\n[14] S. Kale, Z. Karnin, T. Liang, and D. P\u00e1l. Adaptive feature selection: Computationally ef\ufb01cient\n\nonline sparse linear regression under rip. arXiv preprint arXiv:1706.04690, 2017.\n\n[15] J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linear\n\npredictors. Information and Computation, 132(1):1\u201363, 1997.\n\n[16] W. M. Koolen and T. Van Erven. Second-order quantile methods for experts and combinatorial\n\ngames. In COLT, volume 40, pages 1155\u20131175, 2015.\n\n[17] W. M. Koolen, P. Gr\u00fcnwald, and T. van Erven. Combining adversarial guarantees and stochastic\nfast rates in online learning. In Advances in Neural Information Processing Systems, pages\n4457\u20134465, 2016.\n\n[18] J. Langford, L. Li, and T. Zhang. Sparse online learning via truncated gradient. Journal of\n\nMachine Learning Research, 10(Mar):777\u2013801, 2009.\n\n[19] N. Littlestone and M. K. Warmuth. The weighted majority algorithm.\n\ncomputation, 108(2):212\u2013261, 1994.\n\nInformation and\n\n[20] N. A. Mehta. Fast rates with high probability in exp-concave statistical learning. In Arti\ufb01cial\n\nIntelligence and Statistics, pages 1085\u20131093, 2017.\n\n[21] P. Rigollet and A. Tsybakov. Exponential screening and optimal rates of sparse estimation. The\n\nAnnals of Statistics, pages 731\u2013771, 2011.\n\n[22] V. Roulet and A. d\u2019Aspremont. Sharpness, restart and acceleration. In Advances in Neural\n\nInformation Processing Systems, pages 1119\u20131129, 2017.\n\n[23] J. Steinhardt, S. Wager, and P. Liang. The statistics of streaming sparse regression. arXiv\n\npreprint arXiv:1412.4182, 2014.\n\n[24] I. Steinwart and A. Christmann. Estimating conditional quantiles with the help of the pinball\n\nloss. Bernoulli, 17(1):211\u2013225, 2011.\n\n10\n\n\f[25] T. Van Erven, P. D. Gr\u00fcnwald, N. A. Mehta, M. D. Reid, and R. C. Williamson. Fast rates in\nstatistical and online learning. Journal of Machine Learning Research, 16:1793\u20131861, 2015.\n\n[26] V. G. Vovk. Aggregating strategies. Proc. of Computational Learning Theory, 1990, 1990.\n[27] M. J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using\n`1-constrained quadratic programming (lasso). IEEE transactions on information theory, 55(5):\n2183\u20132202, 2009.\n\n[28] O. Wintenberger. Optimal learning with bernstein online aggregation. Machine Learning, 106\n\n(1):119\u2013141, 2017.\n\n[29] L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization.\n\nJournal of Machine Learning Research, 11(Oct):2543\u20132596, 2010.\n\n[30] Y. Yang. Combining forecasting procedures: some theoretical results. Econometric Theory, 20\n\n(01):176\u2013222, 2004.\n\n[31] Y. Zhang, M. J. Wainwright, and M. I. Jordan. Lower bounds on the performance of polynomial-\ntime algorithms for sparse linear regression. In Conference on Learning Theory, pages 921\u2013948,\n2014.\n\n[32] S. \u0141ojasiewicz. Une propri\u00e9t\u00e9 topologique des sous-ensembles analytiques r\u00e9els. Les \u00e9quations\n\naux d\u00e9riv\u00e9es partielles, pages 87\u201389, 1963.\n\n[33] S. \u0141ojasiewicz. Sur la g\u00e9om\u00e9trie semi-et sous-analytique. Annales de l\u2019institut Fourier, 43(5):\n\n1575\u20131595, 1993.\n\n11\n\n\f", "award": [], "sourceid": 3496, "authors": [{"given_name": "Pierre", "family_name": "Gaillard", "institution": "INRIA Paris, DI ENS"}, {"given_name": "Olivier", "family_name": "Wintenberger", "institution": null}]}