{"title": "Bandit Convex Optimization: Towards Tight Bounds", "book": "Advances in Neural Information Processing Systems", "page_first": 784, "page_last": 792, "abstract": "Bandit Convex Optimization (BCO) is a fundamental framework for decision making under uncertainty, which generalizes many problems from the realm of online and statistical learning. While the special case of linear cost functions is well understood, a gap on the attainable regret for BCO with nonlinear losses remains an important open question. In this paper we take a step towards understanding the best attainable regret bounds for BCO: we give an efficient and near-optimal regret algorithm for BCO with strongly-convex and smooth loss functions. In contrast to previous works on BCO that use time invariant exploration schemes, our method employs an exploration scheme that shrinks with time.", "full_text": "Bandit Convex Optimization: Towards Tight Bounds\n\nElad Hazan\n\nHaifa 32000, Israel\n\nK\ufb01r Y. Levy\n\nHaifa 32000, Israel\n\nTechnion\u2014Israel Institute of Technology\n\nTechnion\u2014Israel Institute of Technology\n\nehazan@ie.technion.ac.il\n\nkfiryl@tx.technion.ac.il\n\nAbstract\n\nBandit Convex Optimization (BCO) is a fundamental framework for decision\nmaking under uncertainty, which generalizes many problems from the realm of on-\nline and statistical learning. While the special case of linear cost functions is well\nunderstood, a gap on the attainable regret for BCO with nonlinear losses remains\nan important open question. In this paper we take a step towards understanding\nthe best attainable regret bounds for BCO: we give an ef\ufb01cient and near-optimal\nregret algorithm for BCO with strongly-convex and smooth loss functions. In con-\ntrast to previous works on BCO that use time invariant exploration schemes, our\nmethod employs an exploration scheme that shrinks with time.\n\n1\n\nIntroduction\n\nThe power of Online Convex Optimization (OCO) framework is in its ability to generalize many\nproblems from the realm of online and statistical learning, and supply universal tools to solving\nthem. Extensive investigation throughout the last decade has yield ef\ufb01cient algorithms with worst\ncase guarantees. This has lead many practitioners to embrace the OCO framework in modeling and\nsolving real world problems.\nOne of the greatest challenges in OCO is \ufb01nding tight bounds to the problem of Bandit Convex\nOptimization (BCO). In this \u201cbandit\u201d setting the learner observes the loss function only at the point\nthat she has chosen. Hence, the learner has to balance between exploiting the information she has\ngathered and between exploring the new data. The seminal work of [5] elegantly resolves this\n\u201cexploration-exploitation\u201d dilemma by devising a combined explore-exploit gradient descent algo-\nrithm. They obtain a bound of O(T 3/4) on the expected regret for the general case of an adversary\nplaying bounded and Lipschitz-continuous convex losses.\nIn this paper we investigate the BCO setting assuming that the adversary is limited to in\ufb02icting\nstrongly-convex and smooth losses and the player may choose points from a constrained decision\nset. In this setting we devise an ef\ufb01cient algorithm that achieves a regret of \u02dcO(\u221aT ). This rate is\nthe best possible up to logarithmic factors as implied by a recent work of [11], cleverly obtaining a\nlower bound of \u2126(\u221aT ) for the same setting.\nDuring our analysis, we develop a full-information algorithm that takes advantage of the strong-\nconvexity of loss functions and uses a self-concordant barrier as a regularization term. This algo-\nrithm enables us to perform \u201cshrinking exploration\u201d which is a key ingredient in our BCO algorithm.\nConversely, all previous works on BCO use a time invariant exploration scheme.\nThis paper is organized as follows. In Section 2 we introduce our setting and review necessary\npreliminaries regarding self-concordant barriers. In Section 3 we discuss schemes to perform single-\n\n1\n\n\fSetting\nFull-Info.\n\nBCO\n\nConvex\n\n\u02dcO(T 3/4)\n\nLinear\n\u0398(\u221aT )\n\u02dcO(\u221aT )\n\nSmooth\n\nStr.-Convex\n\n\u02dcO(T 2/3)\n\u2126(\u221aT )\n\nStr.-Convex & Smooth\n\u0398(log T )\n\u02dcO(\u221aT ) [Thm. 10]\n\nTable 1: Known regret bounds in the Full-Info./ BCO setting. Our new result is highlighted, and\nimproves upon the previous \u02dcO(T 2/3) bound.\n\npoint gradient estimations, then we de\ufb01ne \ufb01rst-order online methods and analyze the performance\nof such methods receiving noisy gradient estimates. Our main result is described and analyzed in\nSection 4; Section 5 concludes.\n\n1.1 Prior work\nFor BCO with general convex loss functions, almost simultaneously to [5], a bound of O(T 3/4)\nwas also obtained by [7] for the setting of Lipschitz-continuous convex losses. Conversely, the best\nknown lower bound for this problem is \u2126(\u221aT ) proved for the easier full-information setting.\nIn case the adversary is limited to using linear losses, it can be shown that the player does not\n\u201cpay\u201d for exploration; this property was used by [4] to devise the Geometric Hedge algorithm that\nachieves an optimal regret rate of \u02dcO(\u221aT ). Later [1], inspired by interior point methods, devised the\n\ufb01rst ef\ufb01cient algorithm that attains the same nearly-optimal regret rate for this setup of bandit linear\noptimization.\nFor some special classes of nonlinear convex losses, there are several works that lean on ideas\nfrom [5] to achieve improved upper bounds for BCO. In the case of convex and smooth losses [9]\nattained an upper bound of \u02dcO(T 2/3). The same regret rate of \u02dcO(T 2/3) was achieved by [2] in the\ncase of strongly-convex losses. For the special case of unconstrained BCO with strongly-convex\nand smooth losses, [2] obtained a regret of \u02dcO(\u221aT ). A recent paper by Shamir [11], signi\ufb01cantly\nadvanced our understanding of BCO by devising a lower bound of \u2126(\u221aT ) for the setting of strongly-\nconvex and smooth BCO. The latter implies the tightness of our bound.\nA comprehensive survey by Bubeck and Cesa-Bianchi [3], provides a review of the bandit optimiza-\ntion literature in both stochastic and online setting.\n\n2 Setting and Background\n\nNotation: During this paper we denote by || \u00b7 || the (cid:96)2 norm when referring to vectors, and use\nthe same notation for the spectral norm when referring to matrices. We denote by Bn and Sn the\nn-dimensional euclidean unit ball and unit sphere, and by v \u223c Bn and u \u223c Sn random variables\nchosen uniformly from these sets. The symbol I is used for the identity matrix (its dimension will\nbe clear from the context). For a positive de\ufb01nite matrix A (cid:31) 0 we denote by A1/2 the matrix B\nsuch that B(cid:62)B = A, and by A\u22121/2 the inverse of B. Finally, we denote [N ] := {1, . . . , N}.\n2.1 Bandit Convex Optimization\nWe consider a repeated game of T rounds between a player and an adversary, at each round t \u2208\n[T ]\n\n1. player chooses a point xt \u2208 K.\n2. adversary independently chooses a loss function ft \u2208 F.\n3. player suffers a loss ft(xt) and receives a feedback Ft.\n\n2\n\n\fIn the OCO (Online Convex Optimization) framework we assume that the decision set K is con-\nvex and that all functions in F are convex. Our paper focuses on adversaries limited to choosing\nfunctions from the set F\u03c3,\u03b2; the set off all \u03c3-strongly-convex and \u03b2-smooth functions.\nWe also limit ourselves to oblivious adversaries where the loss sequence {ft}T\nt=1 is predetermined\nand is therefore independent of the player\u2019s choices. Mind that in this case the best point in hindsight\nis also independent of the player\u2019s choices. We also assume that the loss functions are de\ufb01ned over\nthe entire space Rn and are strongly-convex and smooth there; yet the player may only choose points\nfrom a constrained set K.\nT(cid:88)\nLet us de\ufb01ne the regret of A, and its regret with respect to a comparator w \u2208 K:\n\nT(cid:88)\n\nT(cid:88)\n\nT(cid:88)\n\n\u2217),\n\nft(w\n\nRegretA\n\nT (w) =\n\nft(xt) \u2212\n\nft(w)\n\nRegretA\n\nT =\n\nft(xt) \u2212 min\nw\u2217\u2208K\n\nt=1\n\nt=1\n\nt=1\n\nt=1\n\nA player aims at minimizing his regret, and we are interested in players that ensure an o(T ) regret\nfor any loss sequence that the adversary may choose.\nThe player learns through the feedback Ft received in response to his actions. In the full informations\nsetting, he receives the loss function ft itself as a feedback, usually by means of a gradient oracle -\ni.e. the decision maker has access to the gradient of the loss function at any point in the decision set.\nConversely, in the BCO setting the given feedback is ft(xt), i.e., the loss function only at the point\n\nthat he has chosen; and the player aims at minimizing his expected regret, E(cid:2)RegretA\n\n(cid:3).\n\nT\n\n2.2 Strong Convexity and Smoothness\n\nAs mentioned in the last subsection we consider an adversary limited to choosing loss functions\nfrom the set F\u03c3,\u03b2, the set of \u03c3-strongly convex and \u03b2-smooth functions, here we de\ufb01ne these prop-\nerties.\nDe\ufb01nition 1. (Strong Convexity) We say that a function f : Rn \u2192 R is \u03c3-strongly convex over the\nset K if for all x, y \u2208 K it holds that,\n\n(1)\nDe\ufb01nition 2. (Smoothness) We say that a convex function f : Rn \u2192 R is \u03b2-smooth over the set K\nif the following holds:\n\nf (y) \u2265 f (x) + \u2207f (x)(cid:62)(y \u2212 x) +\n\n\u03c3\n2 ||x \u2212 y||2\n\nf (y) \u2264 f (x) + \u2207f (x)(cid:62)(y \u2212 x) +\n\n\u03b2\n2 ||x \u2212 y||2,\n\n\u2200x, y \u2208 K\n\n(2)\n\n2.3 Self Concordant Barriers\n\nInterior point methods are polynomial time algorithms to solving constrained convex optimization\nprograms. The main tool in these methods is a barrier function that encodes the constrained set and\nenables the use of a fast unconstrained optimization machinery. More on this subject can be found\nin [8].\nLet K \u2208 Rn be a convex set with a non empty interior int(K)\nDe\ufb01nition 3. A function R : int(K) \u2192 R is called \u03bd-self-concordant if:\n\nsequence of points approaching the boundary of K.\n\n1. R is three times continuously differentiable and convex, and approaches in\ufb01nity along any\n2. For every h \u2208 Rn and x \u2208 int(K) the following holds:\n\n|\u22073R(x)[h, h, h]| \u2264 2(\u22072R(x)[h, h])3/2\n\nand\n\n|\u2207R(x)[h]| \u2264 \u03bd1/2(\u22072R(x)[h, h])1/2\n\n3\n\n\f(cid:12)(cid:12)(cid:12)t1=t2=t3=0\n\n.\n\n\u22023\n\n\u2202t1\u2202t2\u2202t3R(x + t1h + t2h + t3h)\n\nhere, \u22073R(x)[h, h, h] :=\nOur algorithm requires a \u03bd-self-concordant barrier over K, and its regret depends on \u221a\u03bd. It is well\nknown that any convex set in Rn admits a \u03bd = O(n) such barrier (\u03bd might be much smaller), and that\nmost interesting convex sets admit a self-concordant barrier that is ef\ufb01ciently represented.\nThe Hessian of a self-concordant barrier induces a local norm at every x \u2208 int(K), we denote this\n(cid:113)\nnorm by || \u00b7 ||x and its dual by || \u00b7 ||\nh(cid:62)\u22072R(x)h,\nwe assume that \u22072R(x) always has a full rank.\nThe following fact is a key ingredient in the sampling scheme of BCO algorithms [1, 9]. Let R is\nbe self-concordant barrier and x \u2208 int(K) then the Dikin Ellipsoide,\nW1(x) := {y \u2208 Rn : ||y \u2212 x||x \u2264 1}\n\n\u2217\nx and de\ufb01ne \u2200h \u2208 Rn:\n\u2217\nx =\n||h||\n\nh(cid:62)(\u22072R(x))\u22121h\n\n||h||x =\n\ni.e. the || \u00b7 ||x-unit ball centered around x, is completely contained in K.\nOur regret analysis requires a bound on R(y) \u2212 R(x); hence, we will \ufb01nd the following lemma\nuseful:\nLemma 4. Let R be a \u03bd-self-concordant function over K, then:\n\n(cid:113)\n\n(3)\n\nR(y) \u2212 R(x) \u2264 \u03bd log\n\n1\n\n1 \u2212 \u03c0x(y)\n\n,\n\n\u2200x, y \u2208 int(K)\n\nwhere \u03c0x(y) = inf{t \u2265 0 : x + t\u22121(y \u2212 x) \u2208 K},\nNote that \u03c0x(y) is called the Minkowsky function and it is always in [0, 1]. Moreover, as y ap-\nproaches the boundary of K then \u03c0x(y) \u2192 1.\n3 Single Point Gradient Estimation and Noisy First-Order Methods\n\n\u2200x, y \u2208 int(K)\n\n3.1 Single Point Gradient Estimation\n\nA main component of BCO algorithms is a randomized sampling scheme for constructing gradi-\nent estimates. Here, we survey the previous schemes as well as the more general scheme that we\nuse.\n\nSpherical estimators: Flaxman et al. [5] introduced a method that produces single point gradient\nestimates through spherical sampling. These estimates are then inserted into a full-information pro-\ncedure that chooses the next decision point for the player. Interestingly, these gradient estimates are\nunbiased predictions for the gradients of a smoothed version function which we next de\ufb01ne.\nLet \u03b4 > 0 and v \u223c Bn, the smoothed version of a function f : Rn \u2192 R is de\ufb01ned as follows:\n\n\u02c6f (x) = E[f (x + \u03b4v)]\n\n(4)\n\nThe next lemma of [5] ties between the gradients of \u02c6f and an estimate based on samples of f:\nLemma 5. Let u \u223c Sn, and consider the smoothed version \u02c6f de\ufb01ned in Equation (4), then the\nfollowing applies:\n\n\u2207 \u02c6f (x) = E[\n\nn\n\u03b4\n\nf (x + \u03b4u)u]\n\n(5)\n\nTherefore, n\n\n\u03b4 f (x + \u03b4u)u is an unbiased estimator for the gradients of the smoothed version.\n\n4\n\n\f(a) Eigenpoles Sampling\n\n(b) Continuous Sampling\n\n(c) Shrinking Sampling\n\nFigure 1: Dikin Ellipsoide Sampling Schemes\n\nEllipsoidal estimators: Abernethy et al. [1] introduced the idea of sampling from an ellipsoid\n(speci\ufb01cally the Dikin ellipsoid) rather than a sphere in the context of BCO. They restricted the\nsampling to the eigenpoles of the ellipsoid (Fig. 1a). A more general method of sampling con-\ntinuously from an ellipsoid was introduced in [9] (Fig. 1b). We shall see later that our algorithm\nuses a \u201cshrinking-sampling\u201d scheme (Fig. 1c), which is crucial in achieving the \u02dcO(\u221aT ) regret\nbound.\nThe following lemma of [9] shows that we can sample f non uniformly over all directions and create\nan unbiased gradient estimate of a respective smoothed version:\nCorollary 6. Let f : Rn \u2192 R be a continuous function, let A \u2208 Rn\u00d7n be invertible, and v \u223c Bn,\nu \u223c Sn. De\ufb01ne the smoothed version of f with respect to A:\n\u02c6f (x) = E[f (x + Av)]\n(6)\n\nThen the following holds:\n\n\u2207 \u02c6f (x) = E[nf (x + Au)A\n\n\u22121u]\nNote that if A (cid:31) 0 then {Au : u \u2208 Sn} is an ellipsoid\u2019s boundary.\nOur next lemma shows that the smoothed version preserves the strong-convexity of f, and that we\ncan measure the distance between \u02c6f and f using the spectral norm of A2:\nLemma 7. Consider a function f : Rn \u2192 R, and a positive de\ufb01nite matrix A \u2208 Rn\u00d7n. Let \u02c6f be\nthe smoothed version of f with respect to A as de\ufb01ned in Equation (6). Then the following holds:\n\n(7)\n\n\u2022 If f is \u03c3-strongly convex then so is \u02c6f.\n\u2022 If f is convex and \u03b2-smooth, and \u03bbmax be the largest eigenvalue of A then:\n\n\u03b2\n2 ||A2||2 =\n\n\u03b2\n2\n\n0 \u2264 \u02c6f (x) \u2212 f (x) \u2264\n\n\u03bb2\nmax\n\n(8)\n\nRemark: Lemma 7 also holds if we de\ufb01ne the smoothed version of f as \u02c6f (x) = Eu\u223cSn [f (x + Au)]\ni.e. an average of the original function values over the unit sphere rather than the unit ball as de\ufb01ned\nin Equation (6). Proof is similar to the one of Lemma 7.\n\n3.2 Noisy First-Order Methods\n\nOur algorithm utilizes a full-information online algorithm, but instead of providing this method with\nexact gradient values we insert noisy estimates of the gradients. In what follows we de\ufb01ne \ufb01rst-order\nonline algorithms, and present a lemma that analyses the regret of such algorithm receiving noisy\ngradients.\n\n5\n\nKxKxKxt\fDe\ufb01nition 8. (First-Order Online Algorithm) Let A be an OCO algorithm receiving an arbitrary\nsequence of differential convex loss functions f1, . . . , fT , and providing points x1 \u2190 A and xt \u2190\nA(f1, . . . , ft\u22121). Given that A requires all loss functions to belong to some set F0. Then A is called\n\ufb01rst-order online algorithm if the following holds:\n\u2022 Adding a linear function to a member of F0 remains in F0; i.e., for every f \u2208 F0 and\na \u2208 Rn then also f + a(cid:62)x \u2208 F0\n\u2022 The algorithm\u2019s choices depend only on its gradient values taken in the past choices of A,\n\ni.e. :\n\nA(f1, . . . , ft\u22121) = A(\u2207f1(x1), . . . ,\u2207ft\u22121(xt\u22121)),\n\n\u2200t \u2208 [T ]\n\nThe following is a generalization of Lemma 3.1 from [5]:\nLemma 9. Let w be a \ufb01xed point in K. Let A be a \ufb01rst-order online algorithm receiving a sequence\nof differential convex loss functions f1, . . . , fT : K \u2192 R (ft+1 possibly depending on z1, . . . zt).\nWhere z1 . . . zT are de\ufb01ned as follows: z1 \u2190 A, zt \u2190 A(g1, . . . , gt\u22121) where gt\u2019s are vector valued\nrandom variables such that:\n\nThen if A ensures a regret bound of the form: RegretA\n\ninformation case then, in the case of noisy gradients it ensures the following bound:\n\nT \u2264 BA(\u2207f1(x1), . . . ,\u2207fT (xT )) in the full\n\nE[gt\n\n(cid:12)(cid:12)z1, f1, . . . , zt, ft] = \u2207ft(zt)\nT(cid:88)\n\nT(cid:88)\n\nt=1\n\nE[\n\nft(zt)] \u2212\n\nft(w) \u2264 E[BA(g1, . . . , gT )]\n\nt=1\n\n4 Main Result and Analysis\n\nFollowing is the main theorem of this paper:\nTheorem 10. Let K be a convex set with diameter DK and R be a \u03bd-self-concordant barrier over\n(cid:113) (\u03bd+2\u03b2/\u03c3) log T\nK. Then in the BCO setting where the adversary is limited to choosing \u03b2-smooth and \u03c3-strongly-\nconvex functions and |ft(x)| \u2264 L, \u2200x \u2208 K, then the expected regret of Algorithm 1 with \u03b7 =\n\n2n2L2T\n\nis upper bounded as\n\n(cid:32)(cid:114)\n\n(cid:33)\n\n(cid:115)(cid:18)\n\n(cid:19)\n\nE[RegretT ] \u2264 4nL\n\n\u03bd +\n\n2\u03b2\n\u03c3\n\nT log T + 2L +\n\n\u03b2D2K\n2\n\n= O\n\n\u03b2\u03bd\n\u03c3\n\nT log T\n\nwhenever T / log T \u2265 2 (\u03bd + 2\u03b2/\u03c3).\nAlgorithm 1 BCO Algorithm for Str.-convex & Smooth losses\nInput: \u03b7 > 0, \u03c3 > 0, \u03bd-self-concordant barrier R\nDe\ufb01ne Bt =(cid:0)\nChoose x1 = arg minx\u2208K R(x)\nfor t = 1, 2 . . . T do\nDraw u \u223c Sn\nUpdate xt+1 = arg minx\u2208K(cid:80)t\nPlay yt = xt + Btu\nObserve ft(xt + Btu) and de\ufb01ne gt = nft(xt + Btu)B\n\n(cid:1)\u22121/2\n(cid:8)g(cid:62)\n\n\u22072R(xt) + \u03b7\u03c3tI\n\n\u03c4 x + \u03c3\n\n\u03c4 =1\n\nend for\n\n2||x \u2212 x\u03c4||2(cid:9) + \u03b7\u22121R(x)\n\n\u22121\nt u\n\nAlgorithm 1 shrinks the exploration magnitude with time (Fig. 1c); this is enabled thanks to the\nstrong-convexity of the losses. It also updates according to a full-information \ufb01rst-order algorithm\n\n6\n\n\fdenoted FTARL-\u03c3, which is de\ufb01ned below. This algorithm is a variant of the FTRL methodology\nas de\ufb01ned in [6, 10].\n\nAlgorithm 2 FTARL-\u03c3\nInput: \u03b7 > 0, \u03bd-self concordant barrier R\nChoose x1 = arg minx\u2208K R(x)\n(cid:8)\nfor t = 1, 2 . . . T do\nReceive \u2207ht(xt)\n\u2207h\u03c4 (x\u03c4 )(cid:62)x + \u03c3\n\nOutput xt+1 = arg minx\u2208K(cid:80)t\n\n\u03c4 =1\n\nend for\n\n2||x \u2212 x\u03c4||2(cid:9) + \u03b7\u22121R(x)\n\nNext we give a proof sketch of Theorem 10\n\nProof sketch of Therorem 10. Let us decompose the expected regret of Algorithm 1 with respect to\nw \u2208 K:\n\nt=1 E [ft(yt) \u2212 ft(w)]\n\nE [RegretT (w)] :=(cid:80)T\n(cid:104)\n(cid:105)\nt=1 E [ft(yt) \u2212 ft(xt)]\n(cid:104)\n(cid:105)\nft(xt) \u2212 \u02c6ft(xt)\nt=1 E\n(cid:104) \u02c6ft(xt) \u2212 \u02c6ft(w)\n(cid:105)\nft(w) \u2212 \u02c6ft(w)\n\n=(cid:80)T\n+(cid:80)T\n(cid:80)T\n+(cid:80)T\n\nt=1 E\n\nt=1 E\n\n\u2212\n\n(9)\n\n(10)\n\n(11)\n\n(12)\n\nwhere expectation is taken with respect to the player\u2019s choices, and \u02c6ft is de\ufb01ned as\n\n\u02c6ft(x) = E[ft(x + Btv)],\n\n\u2200x \u2208 K\nhere v \u223c Bn and the smoothing matrix Bt is de\ufb01ned in Algorithm 1.\nThe sampling scheme used by Algorithm 1 yields an unbiased gradient estimate gt of the smoothed\nversion \u02c6ft, which is then inserted to FTARL-\u03c3 (Algorithm 2). We can therefore interpret Algo-\nrithm 1 as performing noisy \ufb01rst-order method (FTARL-\u03c3) over the smoothed versions. The xt\u2019s\nin Algorithm 1 are the outputs of FTARL-\u03c3, thus the term in Equation (12) is associated with \u201cex-\nploitation\u201d. The other terms in Equations (9)-(11) measure the cost of sampling away from xt, and\nthe distance between the smoothed version and the original function, hence these term are associated\nwith \u201cexploration\u201d. In what follows we analyze these terms separately and show that Algorithm 1\nachieves \u02dcO(\u221aT ) regret.\n\nThe Exploration Terms: The next hold by the remark that follows Lemma 7 and by the lemma\nitself:\n\nE[ft(yt) \u2212 ft(xt)] = E(cid:2)Eu[ft(xt + Btu)] \u2212 ft(xt)(cid:12)(cid:12)xt](cid:3)\n\u2264 0.5\u03b2E(cid:2)\n\n(cid:104)\nE[ \u02c6ft(w) \u2212 ft(w)(cid:12)(cid:12)xt]\n(cid:104)\n(cid:105)\nE[ft(xt) \u2212 \u02c6ft(xt)(cid:12)(cid:12)xt]\n\u2212 E[ft(w) \u2212 \u02c6ft(w)] = E\nE[ft(xt) \u2212 \u02c6ft(xt)] = E\n\n\u2264 0\n\n(cid:105)\n\n(cid:3)\n\n\u2264 0.5\u03b2E(cid:2)\n(cid:3)\n||B2\nt ||2\n\u2264 \u03b2/2\u03b7\u03c3t\n\n||B2\n\nt ||2\n\n\u2264 \u03b2/2\u03b7\u03c3t\n\n(13)\n\n(14)\n\n(15)\n\nwhere ||B2\nde\ufb01nite.\n\nt ||2 \u2264 1/\u03b7\u03c3t follows by the de\ufb01nition of Bt and by the fact that \u22072R(xt) is positive\n\n7\n\n\fT(cid:88)\n\nt=1\n\nt=1\n\n\u2200z \u2208 K\n\nht(xt) \u2212\n\nt=1\n\n\u22121R(w),\n\n(cid:104) \u02c6ft(xt) \u2212 \u02c6ft(w)\n(cid:105)\n\nT(cid:88)\nt )2(cid:12)(cid:12)xt] = E\n\nThe Exploitation Term: The next Lemma bounds the regret of FTARL-\u03c3 in the full-information\nsetting:\nLemma 11. Let R be a self-concordant barrier over a convex set K, and \u03b7 > 0. Consider an\nonline player receiving \u03c3-strongly-convex loss functions h1, . . . , hT and choosing points according\n\u2217\nto FTARL-\u03c3 (Algorithm 2), and \u03b7||\u2207ht(xt)||\nt \u2264 1/2, \u2200t \u2208 [T ]. Then the player\u2019s regret is upper\nT(cid:88)\nT(cid:88)\nbounded as follows:\nht(w) \u2264 2\u03b7\n\u2217\nt )2 = aT (\u22072R(xt) + \u03b7\u03c3tI)\u22121a\nhere (||a||\nNote that Algorithm 1 uses the estimates gt as inputs into FTARL-\u03c3. Using Corollary 6 we can\nshow that the gt\u2019s are unbiased estimates for the gradients of the smoothed versions \u02c6ft\u2019s. Using the\nregret bound of the above lemma, and the unbiasedness of the gt\u2019s, Lemma 9 ensures us:\n\n\u2217\nt )2 + \u03b7\n(||\u2207ht(xt)||\n\n\u2264 (nL)2\nConcluding: Plugging the latter into Equation (16) and combining Equations (9)-(16) we get:\n\nT(cid:88)\n(cid:0)\n\u22072R(xt) + \u03b7\u03c3tI\n\u22121(cid:0)\n(17)\nRecall that x1 = arg minx\u2208K R(x) and assume w.l.o.g.\nthat R(x1) = 0 (we can always add\nR a constant). Thus, for a point w \u2208 K such that \u03c0x1(w) \u2264 1 \u2212 T \u22121 Lemma 4 ensures us that\nR(w) \u2264 \u03bd log T . Combining the latter with Equation (17) and the choice of \u03b7 in Theorem 10 assures\nan expected regret bounded by 4nL\n\u2208 K such that ||w \u2212 w(cid:48)\n|| \u2264 O(T \u22121) and \u03c0x1 (w(cid:48)) \u2264 1 \u2212 T \u22121, using the\nwe can always \ufb01nd w(cid:48)\nLipschitzness of the ft\u2019s, Theorem 10 holds.\n(cid:0)\nCorrectness:\nthat Algorithm 1\n{xt +\nu, u \u2208 Sn} which is inside the Dikin ellipsoid and therefore belongs to K\n\u22072R(xt) + \u03b7\u03c3tI\n(the Dikin Eliipsoid is always in K).\n5 Summary and open questions\n\n(cid:112)(\u03bd + 2\u03b2\u03c3\u22121) T log T . For w \u2208 K such that \u03c0x1 (w) > 1\u2212T \u22121\n\nBy the de\ufb01nitions of gt and Bt, and recalling |ft(x)| \u2264 L, \u2200x \u2208 K, we can bound:\n\u22121\nt u\n\n(cid:62)\nn2 (ft(xt + Btu))2 u\n\n\u22121\nB\nt\n\n\u22121R(w)\n(cid:1)\u22121\n\u22121 log T(cid:1)\n\nB\n\nE[RegretT (w)] \u2264 2\u03b7(nL)2T + \u03b7\n\n\u2217\nt )2] + \u03b7\nE[(||gt||\n\nR(w) + 2\u03b2\u03c3\n\n\u2217\nE[(||gt||\n\nchooses\n\npoints\n\nfrom the\n\nset\n\n(cid:1)\u22121/2\n\nNote\n\nt=1\n\nE\n\n(cid:104)\n\n(16)\n\n(cid:105)\n\n(cid:12)(cid:12)xt\n\n\u2264 2\u03b7\n\nt=1\n\nWe have presented an ef\ufb01cient algorithm that attains near optimal regret for the setting of BCO with\nstrongly-convex and smooth losses, advancing our understanding of optimal regret rates for bandit\nlearning.\nPerhaps the most important question in bandit learning remains the resolution of the attainable regret\nbounds for smooth but non-strongly-convex, or vice versa, and generally convex cost functions (see\nTable 1). Ideally, this should be accompanied by an ef\ufb01cient algorithm, although understanding the\noptimal rates up to polylogarithmic factors would be a signi\ufb01cant advancement by itself.\n\nAcknowledgements\n\nThe research leading to these results has received funding from the European Union\u2019s Sev-\nenth Framework Programme (FP7/2007-2013) under grant agreement n\u25e6 336078 \u2013 ERC-\nSUBLRN.\n\n8\n\n\fReferences\n\n[1] Jacob Abernethy, Elad Hazan, and Alexander Rakhlin. Competing in the dark: An ef\ufb01cient\n\nalgorithm for bandit linear optimization. In COLT, pages 263\u2013274, 2008.\n\n[2] Alekh Agarwal, Ofer Dekel, and Lin Xiao. Optimal algorithms for online convex optimization\n\nwith multi-point bandit feedback. In COLT, pages 28\u201340, 2010.\n\n[3] S\u00b4ebastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochas-\ntic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1\u2013122,\n2012.\n\n[4] Varsha Dani, Thomas P. Hayes, and Sham Kakade. The price of bandit information for online\n\noptimization. In NIPS, 2007.\n\n[5] Abraham Flaxman, Adam Tauman Kalai, and H. Brendan McMahan. Online convex opti-\nmization in the bandit setting: gradient descent without a gradient. In SODA, pages 385\u2013394,\n2005.\n\n[6] Elad Hazan. A survey: The convex optimization approach to regret minimization. In Suvrit\nSra, Sebastian Nowozin, and Stephen J. Wright, editors, Optimization for Machine Learning,\npages 287\u2013302. MIT Press, 2011.\n\n[7] Robert D Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In NIPS,\n\nvolume 17, pages 697\u2013704, 2004.\n\n[8] Arkadii Nemirovskii. Interior point polynomial time methods in convex programming. Lecture\n\nNotes, 2004.\n\n[9] Ankan Saha and Ambuj Tewari. Improved regret guarantees for online smooth convex opti-\n\nmization with bandit feedback. In AISTATS, pages 636\u2013642, 2011.\n\n[10] Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends\n\nin Machine Learning, 4(2):107\u2013194, 2011.\n\n[11] Ohad Shamir. On the complexity of bandit and derivative-free stochastic convex optimization.\n\nIn Conference on Learning Theory, pages 3\u201324, 2013.\n\n9\n\n\f", "award": [], "sourceid": 534, "authors": [{"given_name": "Elad", "family_name": "Hazan", "institution": "Technion"}, {"given_name": "Kfir", "family_name": "Levy", "institution": "Technion"}]}