{"title": "Online Improper Learning with an Approximation Oracle", "book": "Advances in Neural Information Processing Systems", "page_first": 5652, "page_last": 5660, "abstract": "We study the following question: given an efficient approximation algorithm for an optimization problem, can we learn efficiently in the same setting? We give a formal affirmative answer to this question in the form of a reduction from online learning to offline approximate optimization using an efficient algorithm that guarantees near optimal regret. The algorithm is efficient in terms of the number of oracle calls to a given approximation oracle \u2013 it makes only logarithmically many such calls per iteration. This resolves an open question by Kalai and Vempala, and by Garber. Furthermore, our result applies to the more general improper learning problems.", "full_text": "Online Improper Learning with an Approximation\n\nOracle\u21e4\n\nElad Hazan\n\nPrinceton University & Google AI Princeton\n\nehazan@cs.princeton.edu\n\nWei Hu\n\nPrinceton University\n\nhuwei@cs.princeton.edu\n\nYuanzhi Li\n\nStanford University\n\nyuanzhil@stanford.edu\n\nZhiyuan Li\n\nPrinceton University\n\nzhiyuanli@cs.princeton.edu\n\nAbstract\n\nWe study the following question: given an ef\ufb01cient approximation algorithm for an\noptimization problem, can we learn ef\ufb01ciently in the same setting? We give a formal\naf\ufb01rmative answer to this question in the form of a reduction from online learning\nto of\ufb02ine approximate optimization using an ef\ufb01cient algorithm that guarantees\nnear optimal regret. The algorithm is ef\ufb01cient in terms of the number of oracle calls\nto a given approximation oracle \u2013 it makes only logarithmically many such calls\nper iteration. This resolves an open question by Kalai and Vempala, and by Garber.\nFurthermore, our result applies to the more general improper learning problems.\n\n1\n\nIntroduction\n\nA fundamental question in learning theory is whether one can ef\ufb01ciently learn a given problem using\nan optimization oracle. Namely, does ef\ufb01cient of\ufb02ine optimization for a certain problem imply\nef\ufb01cient learning algorithm for the same setting?\nFor online learning in games, it was shown by Kalai and Vempala (2005) that an optimization oracle\ngiving the best decision in hindsight is suf\ufb01cient for attaining optimal regret. However, in many\nnon-convex settings, such an optimization oracle is either unavailable or NP-hard to compute. In the\nface of NP-hardness, algorithm designers resort to approximation algorithms that are guaranteed to\nreturn a solution within a certain multiplicative factor of the optimum. We give numerous examples\nin Section 1.2.\nKakade et al. (2009) considered the question of whether such an approximation algorithm is suf\ufb01cient\nto obtain vanishing regret compared with an approximation to the best solution in hindsight. They\ngave an algorithm for this of\ufb02ine-to-online conversion. However, their reduction is inef\ufb01cient in\nthe number of per-iteration queries to the approximation oracle, which grows linearly with time.\nIdeally, an ef\ufb01cient reduction should call the oracle only a constant number of times per iteration and\nguarantee optimal regret at the same time, and this was considered an open question in the literature.\nVarious authors have improved upon this original online-to-of\ufb02ine reduction under certain cases,\nas we survey below. Recently, Garber (2017) made signi\ufb01cant progress by giving a more ef\ufb01cient\nreduction, which improves the number of oracle calls in both full information and bandit settings. He\nexplicitly asked whether a near-optimal reduction with only logarithmically many calls per iteration\nexists.\n\n\u21e4The full version of this paper can be found on https://arxiv.org/abs/1804.07837.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f1.1 Problem Setting and Our Results\nIn this paper we resolve this question on the positive side and in a more general setting, which we\nformally de\ufb01ne now.\n\nFormal description of problem setting. We consider the standard setting of online linear opti-\nmization which is known to generalize statistical learning (Hazan, 2016; Shalev-Shwartz, 2012). In a\nrepeated game, in round t a player chooses a point xt from a decision set K\u2713 Rd while an adversary\nchooses a loss vector ft 2 Rd, which determines the loss of the player f>t xt in this round. The loss\nvector ft is revealed to the player after her choice xt is made. We sometimes treat ft as a function on\nRd, i.e., ft(x) := f>t x.\nSince we consider computationally intractable problems like maximum cut or minimum-rank matrix\ncompletion, we assume that the player has access to an of\ufb02ine optimization oracle. This oracle may\nreturn a point which does not belong to the target set K\u21e4, but rather to a different set K. For example,\nin matrix completion the oracle may return a low-trace-norm matrix rather than a low-rank matrix.\nThis notion is formally captured by an optimization oracle OK,K\u21e4. Given an input v 2 Rd, this oracle\noutputs a point OK,K\u21e4(v) 2K which dominates all points in K\u21e4 in the direction v, that is\n\nv>OK,K\u21e4(v) \uf8ff min\nx\u21e42K\u21e4\nThe goal for the player is to minimize her regret, which is the difference between her cumulative loss\nand that of the best single decision (in K\u21e4) in hindsight:\n\nv>x\u21e4.\n\nRegK,K\u21e4(T ) :=\n\nTXt=1\n\nf>t xt min\nx\u21e42K\u21e4\n\nf>t x\u21e4.\n\nTXt=1\n\nWe remark that the above problem setting is similar in spirit to the notion of improper learning, where\none is allowed to output a hypothesis not from the target set. Therefore, we view the problem setting\ndescribed above as an online version of improper learning.\nIn the special case of K\u21e4 = \u21b5K (\u21b5> 1), OK,\u21b5K becomes an \u21b5-approximation oracle on K, and the\nsetting and the notation of regret are the same with those studied in (Kakade et al., 2009; Garber,\n2017), i.e., RegK,\u21b5K(T ) :=PT\n\nt=1 f>t xt \u21b5 minx2KPT\n\nOur results.\nIn this setting, we give two different algorithms, one based on the online mirror\ndescent (OMD) method and another based on the continuous multiplicative weight update (CMWU)\nalgorithm. Both of them give nearly optimal regret as well as oracle ef\ufb01ciency, while applying to\ngeneral loss vectors. Our results are summarized in Table 1 below. We present these two algorithms\nand their guarantees in Sections 3 and Appendix B.\n\nt=1 f>t x. This is called the \u21b5-regret.\n\nRegret over T rounds Oracle calls per round\n\nAlgorithm\n\nKakade et al. (2009)\n\nGarber (2017)\n\nAlg. 1 (this paper)\nAlg. 6 (this paper)\n\nO(pT )\nO(pT )\nO(pT )\n\u02dcO(pT )\n\nO(T )\n\u02dcO(pT )\nO(log T )\nO(log T )\n\nLoss vectors\n\ngeneral\n\nnon-negative\n\nPNIP property (Def. 2.4)\n\ngeneral\n\nTable 1: Summary of results in the full information setting. The \u02dcO notation hides constant and\nlogarithmic factors.\n\nAlgorithm\n\nRegret over T rounds Oracle calls in T rounds\n\nLoss vectors\n\nKakade et al. (2009)\n\nGarber (2017)\n\nAlg. 3 (this paper)\n\nO(T 2\n3 )\nO(T 2\n3 )\nO(T 2\n3 )\n\nO(T 4\n3 )\n\u02dcO(T )\n\u02dcO(T 2\n3 )\n\ngeneral\n\nnon-negative\nnon-negative\n\nTable 2: Summary of results in the bandit setting.\n\nIn addition to these two algorithms, we give an improved result in the bandit setting. In this more\ndif\ufb01cult setting, the player cannot observe ft, but rather only the loss she has suffered, namely the\n\n2\n\n\fscalar f>t xt. We show how to extend our mirror descent-based algorithm to the bandit setting and\nobtain the same O(T 2/3) regret as in (Kakade et al., 2009; Garber, 2017), but with a signi\ufb01cantly\nlower computational cost. See Table 2 for a comparison. We present our bandit result in Section 4.\n\n1.2 Applications\n\nThe setting of online learning with approximation algorithms has been well studied since (Kalai and\nVempala, 2005) with numerous applications.\nFor example, in the online max-cut problem, a learner iteratively predicts a cut over a set of vertices\nV , and afterwards the adjacency information for two vertices is revealed. The loss is zero or one,\ndepending on whether the learner correctly predicted the connectivity of the two vertices. The\nof\ufb02ine version of this problem is NP-hard, but admits SDP-based approximation algorithms such\nas the famous 0.878-approximation by Goemans and Williamson (1995). Our results imply an\nonline algorithm that can predict as accurate as the best 0.878-approximation to the maximum cut in\nhindsight, and calls the SDP relaxation only logarithmically many times per iteration.\nNumerous other examples exist for combinatorial graph optimization problems such as the traveling\nsalesman problem, sparsest graph cut, etc. Other applications include prominent machine learning\nproblems whose of\ufb02ine optimization problem is NP-hard, for example, matrix completion and\nrecommendation systems. The reader is referred to (Kakade et al., 2009; Garber, 2017) for more\ndetailed exposition of applications.\n\n1.3 Related Work\n\nThe reduction from online learning to of\ufb02ine approximation algorithms was already considered\nby Kalai and Vempala (2005). Their scheme, based on the follow-the-perturbed-leader (FTPL)\nalgorithm, requires very strong approximation guarantee from the approximation oracle, namely, a\nfully polynomial time approximation scheme (FPTAS), and requires an approximation that improves\nwith time. Balcan and Blum (2006) used the same approach in the context of mechanism design.\nKalai and Vempala (2005) also proposed a specialized reduction that works under certain conditions\non the approximation oracle, satis\ufb01ed by some known algorithms for problems such as MAX-CUT.\nFujita et al. (2013) further gave more general reductions that apply to problems whose approximation\nalgorithms are based on convex relaxations of mathematical programs. Their scheme is also based on\nthe FTPL method.\nRecent advancements on black-box online-to-of\ufb02ine reductions were made in (Kakade et al., 2009;\nDud\u00edk et al., 2016; Garber, 2017). Hazan and Koren (2016) showed that ef\ufb01cient reductions are\nin general impossible, unless special structure is present. In the settings we consider this special\nstructure is a linear cost function over the space.\nOur algorithms fall into one of two templates. The \ufb01rst is the online mirror descent method, which\nis an adaptive version of the follow-the-regularized-leader (FTRL) algorithm. The second is the\ncontinuous multiplicative weight update method, which dates back to Cover\u2019s portfolio selection\nmethod (Cover, 1991) and Vovk\u2019s aggregating algorithm (Vovk, 1990). The reader is referred to\nthe books (Cesa-Bianchi and Lugosi, 2006; Shalev-Shwartz, 2012; Hazan, 2016) for details and\nbackground on these prediction frameworks. We also make use of polynomial-time algorithms for\nsampling from log-concave distributions (Lov\u00e1sz and Vempala, 2007).\n\n2 Preliminaries\n\nFor x 2 Rd and r > 0, denote by B(x, r) the Euclidean ball in Rd of radius r centered at x.For\nS,S0 \u2713 Rd, 2 R, y 2 Rd and A 2 Rd0\u21e5d, de\ufb01ne S + S0 := {x + x0 : x 2S , x0 2S 0},\nS := {x : x 2S} , x + S := {x + y : y 2S} , and AS := {Ax : x 2S} . The convex hull of\nS\u2713 Rd is denoted by CH(S). Denote by Vol(S) the volume (Lebesgue measure) of a set S\u2713 Rd.\nDenote by k1 the probability simplex in Rk.\nA set C\u2713 Rd is called a cone if for any 0 we have C\u2713C . For any S\u2713 Rd, de\ufb01ne the dual\ncone of S as S :=y 2 Rd : x>y 0, 8x 2S . S is always a convex cone, even when S is\n\n3\n\n\fAlgorithm 1 Online Mirror Descent using a Projection-and-Separation Oracle\nInput: Learning rate \u2318> 0, tolerance \u270f> 0, regularizer ', convex cone W , time horizon T 2 N+\n1: y1 arg miny2Dom(') '(y).\n2: for t = 1 to T do\n3:\n4:\n5: r'(yt+1) r'(xt) \u2318ft\n6: end for\n\n(xt, V = (v1, . . . , vk), p) PAD (yt,\u270f, W,' )\nPlay \u02dcxt = vi with probability pi (i 2 [k]), and observe the loss vector ft.\n\nneither convex nor a cone. For any closed set S\u2713 Rd, de\ufb01ne \u21e7S : Rd !S to be the projection onto\nS, namely \u21e7S(x) := arg minx02S kx0 xk2.\nDe\ufb01nition 2.1. A strictly convex function f : A! R (A\u2713 Rd is convex) is Legendre if rf is\ncontinuous in int(A) and for any sequence x1, x2,\u00b7\u00b7\u00b7 2 A converging to a boundary point of A,\nlimn!1 krf (xn)k = 1.\nDe\ufb01nition 2.2. For a Legendre function ' : A! R, the Bregman divergence with respect to ' is\nde\ufb01ned as D'(x, y) := '(x) '(y) r'(y)>(x y) (8x, y 2A ).\nLemma 2.3 (Generalized Pythagorean theorem, see e.g. Lemma 11.3 in (Cesa-Bianchi and Lugosi,\n2006)). For any closed convex set S\u2713 Rd, x 2 Rd, y 2S , and any Legendre function ' : Rd ! R,\nletting z = arg minx02S D'(x0, x), we must have D'(y, x) D'(y, z) + D'(z, x).\nDe\ufb01nition 2.4 (Pairwise non-negative inner product). For a twice-differentiable Legendre function\n' : A! R with domain A\u2713 Rd and a convex cone W \u2713 Rd, we say (', W ) satis\ufb01es the\npairwise non-negative inner product (PNIP) property, if for all w, w0 2 W and H 2 CH(H), where\nH = {r2'(x) : x 2A} , it holds that w>H1w0 0.\nExamples. (', W ) satis\ufb01es the PNIP property if:\n\n1. '(x) = 1\n\n2kxk2, x 2 Rd and W \u2713 W , such as the non-negative orthant Rd\n\n+, the positive\n\nsemide\ufb01nite matrix cone, and the Lorentz cone Ld+1 = {(x, z) 2 Rd \u21e5 R : kxk2 \uf8ff z}.\n2. '(x) =Pd\n\ni=1 xi(log xi 1) (with domain Rd\n2 x>Q1x (with domain Rd), where Q = M M>, M 2 Rd\u21e5d is an invertible\n\n+) and W = Rd\n+;\n\n3. '(x) = 1\n\nmatrix, and W = (M>)1Rd\n+.\n\nThis is useful in our bandit algorithm in Section 4.\n\nLog-concave distributions. A distribution over Rd with a density function f is log-concave if\nlog(f ) is a concave function. For a convex set S equipped with a membership oracle, there exist\npolynomial-time algorithms for sampling from any log-concave distribution over S (Lov\u00e1sz and\nVempala, 2007). This can be used to approximately compute the mean of any log-concave distribution.\nFor ease of presentation, we will assume that we can compute the mean of bounded-supported log-\nconcave distributions exactly. Detailed explanation is provided in Appendix D.\n\n3 Mirror Descent with an Approximation Oracle\n\nIn this section, we give an ef\ufb01cient online improper linear optimization algorithm (Algorithm 1) in\nthe full information setting based on online mirror descent (OMD) equipped with a strongly convex\nregularizer ', which achieves O(pT ) regret when the regularizer ' and the domain of linear loss\nfunctions W satisfy the pairwise non-negative inner product (PNIP) property (De\ufb01nition 2.4).\nWe suppose K,K\u21e4 \u2713 B(0, R), and the loss vectors {ft} come from a convex cone W \u2713 Rd and\nkftk \uf8ff L (R, L > 0). Omitted proofs in this section are given in Appendix A.\nTheorem 3.1. Suppose (', W ) satis\ufb01es the PNIP property (De\ufb01nition 2.4). Then for any \u270f, \u2318 > 0,\nAlgorithm 1 satis\ufb01es the following regret guarantee:\n\n8x\u21e4 2K \u21e4 : E\" TXt=1\n\n(ft(\u02dcxt) ft(x\u21e4))# \uf8ff\n\n1\n\n\u2318 '(x\u21e4) '(y1) +\n\nD'(xt, yt+1)! + \u270fLT.\n\nTXt=1\n\n4\n\n\fIn particular, if ' is \u00b5-strongly convex and A maxx\u21e42K\u21e4('(x\u21e4) '(y1)), setting \u270f = R\n\u2318 = 1\n\nT and\n\nT , we have\n\nLq 2\u00b5A\n\n8x\u21e4 2K \u21e4 : E\" TXt=1\n\n(ft(\u02dcxt) ft(x\u21e4))# \uf8ff Ls 2AT\nRq A\n\n\u00b5\n\n+ LR,\n\nand in this case, Algorithm 1 makes at mostl5d log\u21e3\u21e36pT + 4\n\nround.\n\n\u00b5 + 4\u2318 T\u2318m calls of OK,K\u21e4 per\n\nFor the problem of \u21b5-regret minimization using an \u21b5-approximation oracle, we have the following\nregret guarantee, which is an immediate corollary of Theorem 3.1.\nCorollary 3.2. If W \u2713 Rd\n+, K\u2713 B(0, R), K\u21e4 = \u21b5K, '(x) = 1\n2kxk2, setting \u270f = \u21b5R\nAlgorithm 1 has the following regret guarantee:\nTXt=1\n\nft(\u21b5x\u21e4)# \uf8ff \u21b5LR(pT + 1).\n\n8x\u21e4 2K : E\" TXt=1\n\nft(x\u21e4)# = E\" TXt=1\n\nT , \u2318 = \u21b5R\nLpT\n\nft(\u02dcxt) \u21b5\n\nft(\u02dcxt) \n\nTXt=1\n\n,\n\nAlgorithm 1 is a variant of the OMD algorithm that makes use of a projection-and-decomposition\n(PAD) oracle, de\ufb01ned as follows:\nDe\ufb01nition 3.3 (Projection-and-decomposition oracle). A projection-and-decomposition (PAD) oracle\nonto K\u21e4, PAD(y, \u270f, W, ' ), is de\ufb01ned as a procedure that given y 2 Rd, \u270f> 0, a convex cone W\nand a Legendre function ' produces a tuple (y0, V, p), where y0 2 Rd, V = (v1, . . . , vk) 2 Rd\u21e5k\nand p = (p1, . . . , pk)> 2 k1, such that:\n\ni=1 pivi + c y0k \uf8ff \u270f.\n\n1. y0 is \u201ccloser\u201d to K\u21e4 than y with respect to the Bregman divergence of ' (and hence is an\n\u201cinfeasible projection\u201d): 8x\u21e4 2K \u21e4, D'(x\u21e4, y0) \uf8ff D'(x\u21e4, y);\n2. v1, . . . , vk 2K , andPk\nW . In other words, there exists c 2 W such that kPk\n\ni=1 pivi is a point that \u201calmost dominates\u201d y0 in all directions in\n\nThe purpose of the PAD oracle is the following. Suppose the OMD algorithm tells us to play a point\ny. Since y might not be in the feasible set K, we can call the PAD oracle to \ufb01nd another point y0 as\nwell as a distribution p over points v1, . . . , vk 2K . The \ufb01rst property in De\ufb01nition 3.3 is suf\ufb01cient to\nensure that playing y0 also gives low regret, and the second property further ensures that we have a\ndistribution of points in K that suffers less loss than y0 for every possible loss function so we can play\naccording to that distribution.\nAssuming the availability of a PAD oracle, one can use a standard analysis of OMD to prove a regret\nbound for Algorithm 1 as in Theorem 3.1. The proof is given in Appendix A.\nNext we show how to construct a PAD oracle using the optimization oracle OK,K\u21e4. Our construction\nis given in Algorithm 2. Theorem 3.4 gives its guarantee.\nTheorem 3.4. Suppose (', W ) satis\ufb01es PNIP condition (De\ufb01nition 2.4) and ' is \u00b5-strongly convex.\n\u21e1\nThen for any y 2 Rd,\u270f 2 (0, R], Algorithm 2 terminates in\u21e05d log 4R+2p2 minx\u21e42K\u21e4 D'(x\u21e4,y)/\u00b5\niterations, and it correctly implements the projection-and-decomposition oracle PAD(y, \u270f, W, ' ),\ni.e., its output (y0, V, p) satis\ufb01es the two properties in De\ufb01nition 3.3.\nRemark. We can use random walk methods to compute an 1\nT -approximation of the gravity center\n(line 3 in Algorithm 2) in poly(T ) time, which is enough for the purpose of bounding regret. We\ncan also replace the center of gravity method with the ellipsoid method, or any other optimization\nmethod with a similar \u201coptimization interface\u201d (i.e., any method that is based on separation queries\nand guarantees similar bounds on the number of iterations required to \ufb01nd a feasible point), as\npointed out by Garber (2017). Speci\ufb01cally, using the ellipsoid method, we can signi\ufb01cantly reduce\nthe computational complexity to depend only polynomially in log T (rather than T ), at the cost of\na slightly higher oracle complexity, namely O(d2 log T ) calls to the oracle per round. We choose\ncenter of gravity over other optimization methods only because it has the best oracle complexity,\nwhich is the main focus of this paper.\n\n\u270f\n\n5\n\n\fAlgorithm 2 Projection-and-Decomposition Oracle, PAD(y, \u270f, W, ' )\nInput: Point y 2 Rd, tolerance \u270f> 0, convex cone W , regularizer ',\nOutput: (y0, V, p), where y0 2 Rd, V = (v1, . . . , vk) 2 Rd\u21e5k for some k such that vi 2K\n(8i 2 [k]), and p = (p1, . . . , pk)> 2 k1\n1: W1 W \\ B(0, 1),\ni 0\n2: while i < 5d log 2(R+kzi+1k)\ni i + 1, wi RWi\nwdw\n3:\n4:\nzi+1 \n5: end while\n6: k i and solve\n7: return y0 = zk+1, V = (v1, . . . , vk), p\n\nD'(z, zi), Wi+1 Wi \\{ w 2 Rd : w>(vi zi+1) 0}.\n\np2k1,c2W kPk\n\ni=1 pivi + c zk+1k to get p\n\nz1 y,\ndo\nVol(Wi) ,\n\nvi O K,K\u21e4(wi).\n\nz2Rd,w>i (zvi)0\n\narg min\n\nmin\n\n\u270f\n\nWe break the proof of Theorem 3.4 into several lemmas.\nLemma 3.5. If (', W ) satis\ufb01es the PNIP condition (De\ufb01nition 2.4), then z1, . . . , zk+1 computed in\nAlgorithm 2 satisfy zi+1 zi 2 W for all i 2 [k].\nProof. Since we have zi+1 =\n\nD'(z, zi), by the KKT condition, we have\n\narg min\n\nz2Rd:w>i (zvi)0\n\n0 =\n\n@\n\n@z D'(z, zi) w>i (z vi)z=zi+1\n\n= r'(zi+1) r'(zi) wi\n\n\u270f\n\n\u21e1 iterations.\n\nfor some 0. On the other hand, note that r'(zi+1) r'(zi) =R 1\n0 r2'(zi+1 + (1 )zi) \u00b7\n(zi+1 zi)d = H(zi+1 zi), for some H 2 CH(H), where H = r2'(x) : x 2 Dom(') .\nTherefore, for all w 2 W we have w>(zi+1 zi) = w>H1H(zi+1 zi) = w>H1wi 0.\nThis means zi+1 zi 2 W .\nLemma 3.6. Under the setting of Theorem 3.4, Algorithm 2 terminates in at most\n\u21e05d log 4R+2p2 minx\u21e42K\u21e4 D'(x\u21e4,y)/\u00b5\nProof. According to the algorithm, for each i, zi+1 is the Bregman projection of zi onto a half-\nspace containing K\u21e4, since the oracle OK,K\u21e4 ensures w>i vi \uf8ff w>i x\u21e4 for all x\u21e4 2K \u21e4. Then by\nthe generalized Pythagorean theorem (Lemma 2.3) we know D'(x\u21e4, zi+1) \uf8ff D'(x\u21e4, zi) for all\nx\u21e4 2K \u21e4 and i. Therefore we have D'(x\u21e4, zi) \uf8ff D'(x\u21e4, z1) = D'(x\u21e4, y) for all x\u21e4 2K \u21e4 and i.\nLet P := minx\u21e42K\u21e4 D'(x\u21e4, y). Then there exists x\u21e4 2K \u21e4 such that P = D'(x\u21e4, y) \nD'(x\u21e4, zi) \u00b5\n2kx\u21e4 zik2 for all i, where the last inequality is due to the \u00b5-strong convexity of '.\n\u00b5 for all i. Therefore, when i 5d log 4R+2p2P/\u00b5\nThis implies kzik \uf8ff kx\u21e4k +q 2P\n,\nwe must have i 5d log 2(R+kzi+1k)\nLemma 3.7. Under Theorem 3.4\u2019s setting, 8w 2 W,kwk = 1, 9i 2 [k], such that w>(vi y0) \uf8ff \u270f.\nProof. We assume for contradiction that there exists a unit vector h 2 W such that mini2[k] h>(vi \ny0) >\u270f . Note that kvi y0k \uf8ff kvik + ky0k \uf8ff R + ky0k. Letting r :=\n2(R+ky0k), we have\n8w 2 h\n2 +(W\\B(0, 1/2)) \u2713 W\\B(0, 1) = W1. By\nSince r \uf8ff 1\nthe algorithm, we know that for all w 2 W1\\ Wk+1, there exists i 2 [k] such that w>(vi zi+1) \uf8ff 0.\nNotice that from Lemma 3.5 we know zj+1 zj 2 W for all j 2 [k]. Thus for all w 2 W1 \\ Wk+1\nthere exists i 2 [k] such that w>(vi y0) = w>(vi zk+1) \uf8ff wT (vi zi+1) \uf8ff 0. In other words,\nwe have 8w 2 W1 \\ Wk+1 : mini2[k] w>(vi y0) \uf8ff 0.\n\n\u270f\n, which means the loop must have terminated at this time.\n\n2 + (W \\ B(0, r)) : mini2[k] w>(vi y0) > 0.\n\n\u00b5 \uf8ff R +q 2P\n\n2 for \u270f \uf8ff R, we have h\n\n2 +(W\\B(0, r)) \u2713 h\n\n\u270f\n\n\u270f\n\n6\n\n\f2 + (W \\ B(0, r)) \u2713 Wk+1. We also have Vol(Wi+1) \uf8ff (1 \nTherefore, we must have h\n1/(2e))Vol(Wi) for each i 2 [k] from Lemma D.2, since Wi+1 is the intersection of Wi with\na half-space that does not contain Wi\u2019s centroid wi in the interior. Then we have\n\nVol(W1) = Vol(W \\ B(0, 1)) = rdVol(W \\ B(0, r)) \uf8ff rdVol(Wk+1)\n\n\uf8ff rd(1 1/(2e))kVol(W1) < Vol(W1),\n\n= 5d log 2(R+kzk+1k)\nwhere the last step is due to k 5d log 1\naccording to the termination condition of the loop. Therefore we have a contradiction.\n\nr = 5d log 2(R+ky0k)\n\n\u270f\n\n\u270f\n\n, which is true\n\nThe following lemma is a more general version of Lemma 6 in (Garber, 2017).\nLemma 3.8. Given v1, . . . , vk 2 Rd, \u270f 0 and a convex cone W 2 Rd, for any x 2 Rd, the\nfollowing two statements are equivalent:\n(A) There exists p = (p1, . . . , pk)> 2 k1 and c 2 W such that kPk\n(B) For all w 2 W , kwk = 1, there exists i 2 [k] such that w>(vi x) \uf8ff \u270f.\n\ni=1 pivi + c xk \uf8ff \u270f.\n\nW\n\nW \n\nCH(V ) + W \n\n\u21e7F (x)\n\nx0\n\nv4\n\nx\n\nv1\n\nCH(V )\n\nv3\n\nv2\n\n(a) Convex cone W and its dual cone W \n\n(b) An example for CH(V ) + W , where V =\n{vi}4\n\ni=1.\n\nFigure 1: Geometric Interpretation of Lemma 3.8.\n\nGeometric interpretation of Lemma 3.8. We defer the proof of Lemma 3.8 to Appendix A, and\ndiscuss its geometric intuition here. For simplicity of illustration, we only consider \u270f = 0 here\n(Figure 1). First we look at the case where W = Rd, W = {0}. In this case the lemma simply\ndegenerated to the fact\n\nx 2 CH({vi}k\n\ni=1) () There is no hyperplane that separates x and all vi\u2019s.\n\nIn the general case where W \u2713 Rd is an arbitrary convex cone, lemma 3.8 becomes\n\nx 2 CH({vi}k\n\ni=1) + W () There is no direction w 2 W such that w>x < w>vi for all i.\n\ni=1) + W . For the \u201c)\u201d side, if x 2 F , it is clear that for all w 2 W we\nDenote F := CH({vi}k\nmust have w>x w>vi for some i. For the \u201c(\u201d side, if x /2 F , then w =\u21e7 F (x) x satis\ufb01es\nw>x < w>vi for all i. Moreover it is easy to see \u21e7F (x) x 2 W , which completes the proof.\nTheorem 3.4 can be proved now using the above lemmas.\n\nProof of Theorem 3.4. The upper bound on the number of iterations is proved in Lemma 3.6. In the\nproof of Lemma 3.6, we have shown D'(x\u21e4, zi+1) \uf8ff D'(x\u21e4, zi) for all x\u21e4 2K \u21e4 and i. This implies\nD'(x\u21e4, y0) = D'(x\u21e4, zk+1) \uf8ff D'(x\u21e4, zk) \uf8ff\u00b7\u00b7\u00b7\uf8ff D'(x\u21e4, z1) = D'(x\u21e4, y) for all x\u21e4 2K \u21e4,\nwhich veri\ufb01es the \ufb01rst property in De\ufb01nition 3.3. The second property is a direct consequence of\ncombining Lemmas 3.7 and 3.8.\n\n7\n\n\fAlgorithm 3 Online Stochastic Mirror Descent with Barycentric Regularization\nInput: Learning rate \u2318> 0, tolerance \u270f> 0, {q1, . . . , qd} - a -BS(K) for some > 0, exploration\n1: Instantiate Algorithm 1 with parameters \u2318, \u270f, '(x) = 1\n2: for t = 1 to T do\n3:\n\nprobability 2 (0, 1), time horizon T 2 N+\n\nReceive \u02dcxt (the point to play in round t) from Algorithm 1\n\n2 x>Q1x, W 0 = (M>)1Rd\n\n+, and T\n\nif bt = EXPLORE then\n\nbt \u21e2EXPLORE, with probability \nEXPLOIT, with probability 1 \nSample it 2 [d] uniformly at random, and play qit\nReceive loss lt = q>it ft\n\u02dcft d\nelse\nPlay \u02dcxt and receive loss lt = \u02dcx>t ft\n\u02dcft 0\nend if\nFeed \u02dcft to Algorithm 1 as the loss vector for round t (Note that when \u02dcft = 0, in the next round\nAlgorithm 1 can simply play according to the distribution computed in this round without any\noracle calls.)\n\n ltQ1qit\n\n4:\n\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n\n14: end for\n\n4 \u21b5-Regret Minimization in the Bandit Setting\n\nIn this section we consider the \u21b5-regret minimization problem in the bandit setting, where W = Rd\n+,\n+ \\ B(0, R) and K\u21e4 = \u21b5K. Suppose the loss vectors {ft} come from Rd and kftk \uf8ff L.\nK\u2713 Rd\nSimilar to (Kakade et al., 2009), we assume we know a -barycentric spanner for K. This concept\nwas \ufb01rst introduced by Awerbuch and Kleinberg (2004).\nDe\ufb01nition 4.1 (Barycentric spanner). A set of d linearly independent vectors {q1, . . . , qd}\u21e2 Rd is\na -barycentric spanner for a set K\u21e2 Rd, denoted by -BS(K), if {q1, . . . , qd}\u2713K and for all\nx 2K , there exist 1, . . . , d 2 [, ] such that x =Pd\nGiven {q1, . . . , qd} which is a -BS(K), de\ufb01ne Q :=Pd\nq>i Q2qi \uf8ff .\n\ni=1 iqi.\ni=1 qiq>i and M := (q1, . . . , qd) 2 Rd\u21e5d.\nThe need for a new regularization. The bandit algorithm of Garber (2017) additionally requires\na certain boundedness property of barycentric spanners, namely:\n\nHowever, for certain bounded sets this quantity may be unbounded, such as the two-dimensional axis-\naligned rectangle with one axis being of size unity, and the other arbitrarily small. This unboundedness\ncreates problems with the unbiased estimator of loss vector, whose variance can be as large as certain\ngeometric properties of the decision set. To circumvent this issue, we design a new regularizer called\nbarycentric regularizer, which gives rise to an unbiased estimator coupled with an online mirror\ndescent variant that automatically ensures constant variance.\nSimilar to (Kakade et al., 2009; Garber, 2017), our bandit algorithm also simulates the full information\nalgorithm with estimated loss vectors. Namely, our algorithm implements Algorithm 1 with a speci\ufb01c\nbarycentric regularizer '(x) = 1\n2 x>Q1x. The algorithm is detailed in Algorithm 3, and its regret\nguarantee is given in Theorem 4.2. We prove Theorem 4.2 in Appendix C.\nTheorem 4.2. Denote by zt the point played by Algorithm 3 in round t.\nSuppose we set \u2318 = \u21b54/3\nThen we have\n\nT 1/3 in Algorithm 3 (assuming T > 2d3 so < 1).\n\nT and = 2/3d\n\nLRT 2/3 , \u270f = \u21b5R\n\nmax\ni2[d]\n\n8x\u21e4 2K : E\" TXt=1\n\n(ft(zt) \u21b5ft(x\u21e4))# \uf8ff \u21b5LR\u21e33d(T )2/3 + 1\u2318 ,\n\nand the expected total number of oracle calls to OK,\u21b5K in T rounds is at most Od2(T )2/3 log T.\n\n8\n\n\f5 Conclusion and Open Problems\n\nWe have described two different algorithmic approaches to reducing regret minimization to of\ufb02ine\napproximation algorithms and maintaining optimal regret and poly-logarithmic oracle complexity per\niteration, resolving previously stated open questions.\nAn intriguing open problem remaining is to \ufb01nd an ef\ufb01cient algorithm in the bandit setting that\nguarantees both \u02dcO(pT ) regret and poly(log T ) oracle complexity per iteration (at least on average).\n\nReferences\nAwerbuch, B. and Kleinberg, R. D. (2004). Adaptive routing with end-to-end feedback: Distributed\nlearning and geometric approaches. In Proceedings of the thirty-sixth annual ACM symposium on\nTheory of computing, pages 45\u201353. ACM.\n\nBalcan, M.-F. and Blum, A. (2006). Approximation algorithms and online mechanisms for item\npricing. In Proceedings of the 7th ACM Conference on Electronic Commerce, pages 29\u201335. ACM.\nCesa-Bianchi, N. and Lugosi, G. (2006). Prediction, learning, and games. Cambridge university\n\npress.\n\nCover, T. M. (1991). Universal portfolios. Mathematical Finance, 1(1):1\u201329.\nDud\u00edk, M., Haghtalab, N., Luo, H., Schapire, R. E., Syrgkanis, V., and Vaughan, J. W. (2016).\n\nOracle-ef\ufb01cient online learning and auction design. arXiv preprint arXiv:1611.01688.\n\nFujita, T., Hatano, K., and Takimoto, E. (2013). Combinatorial online prediction via metarounding.\n\nIn International Conference on Algorithmic Learning Theory, pages 68\u201382. Springer.\n\nGarber, D. (2017). Ef\ufb01cient online linear optimization with approximation algorithms. In Advances\n\nin Neural Information Processing Systems, pages 627\u2013635.\n\nGoemans, M. X. and Williamson, D. P. (1995). Improved approximation algorithms for maximum\n\ncut and satis\ufb01ability problems using semide\ufb01nite programming. J. ACM, 42(6):1115\u20131145.\n\nHazan, E. (2016).\n\nOptimization, 2(3-4):157\u2013325.\n\nIntroduction to online convex optimization. Foundations and Trends R in\n\nHazan, E. and Koren, T. (2016). The computational power of optimization in online learning. In\nProceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 128\u2013141.\nACM.\n\nKakade, S. M., Kalai, A. T., and Ligett, K. (2009). Playing games with approximation algorithms.\n\nSIAM Journal on Computing, 39(3):1088\u20131106.\n\nKalai, A. and Vempala, S. (2005). Ef\ufb01cient algorithms for online decision problems. Journal of\n\nComputer and System Sciences, 71(3):291\u2013307.\n\nLov\u00e1sz, L. and Vempala, S. (2007). The geometry of logconcave functions and sampling algorithms.\n\nRandom Structures & Algorithms, 30(3):307\u2013358.\n\nPr\u00e9kopa, A. (1973). On logarithmic concave measures and functions. Acta Scientiarum Mathemati-\n\ncarum, 34:335\u2013343.\n\nShalev-Shwartz, S. (2012). Online learning and online convex optimization. Foundations and\nTrends R in Machine Learning, 4(2):107\u2013194.\nVovk, V. G. (1990). Aggregating strategies. In Proceedings of the Third Annual Workshop on\n\nComputational Learning Theory, COLT \u201990, pages 371\u2013386.\n\n9\n\n\f", "award": [], "sourceid": 2718, "authors": [{"given_name": "Elad", "family_name": "Hazan", "institution": "Princeton University"}, {"given_name": "Wei", "family_name": "Hu", "institution": "Princeton University"}, {"given_name": "Yuanzhi", "family_name": "Li", "institution": "Princeton"}, {"given_name": "Zhiyuan", "family_name": "Li", "institution": "Princeton University"}]}