{"title": "Efficient Online Learning via Randomized Rounding", "book": "Advances in Neural Information Processing Systems", "page_first": 343, "page_last": 351, "abstract": "Most online algorithms used in machine learning today are based on variants of mirror descent or follow-the-leader. In this paper, we present an online algorithm based on a completely different approach, which combines ``random playout'' and randomized rounding of loss subgradients. As an application of our approach, we provide the first computationally efficient online algorithm for collaborative filtering with trace-norm constrained matrices. As a second application, we solve an open question linking batch learning and transductive online learning.", "full_text": "E\ufb03cient Online Learning\nvia Randomized Rounding\n\nNicol`o Cesa-Bianchi\n\nOhad Shamir\n\nDSI, Universit`a degli Studi di Milano\n\nMicrosoft Research New England\n\nItaly\n\nUSA\n\nnicolo.cesa-bianchi@unimi.it\n\nohadsh@microsoft.com\n\nAbstract\n\nMost online algorithms used in machine learning today are based on vari-\nants of mirror descent or follow-the-leader. In this paper, we present an\nonline algorithm based on a completely di\ufb00erent approach, which combines\n\u201crandom playout\u201d and randomized rounding of loss subgradients. As an\napplication of our approach, we provide the \ufb01rst computationally e\ufb03cient\nonline algorithm for collaborative \ufb01ltering with trace-norm constrained ma-\ntrices. As a second application, we solve an open question linking batch\nlearning and transductive online learning.\n\n1 Introduction\n\nOnline learning algorithms, which have received much attention in recent years, enjoy an\nattractive combination of computational e\ufb03ciency, lack of distributional assumptions, and\nstrong theoretical guarantees. However, it is probably fair to say that at their core, most of\nthese algorithms are based on the same small set of fundamental techniques, in particular\nmirror descent and regularized follow-the-leader (see for instance [14]).\n\nIn this work we revisit, and signi\ufb01cantly extend, an algorithm which uses a completely\ndi\ufb00erent approach. This algorithm, known as the Minimax Forecaster, was introduced\nin [9, 11] for the setting of prediction with static experts. It computes minimax predictions\nin the case of known horizon, binary outcomes, and absolute loss. Although the original\nversion is computationally expensive, it can easily be made e\ufb03cient through randomization.\n\nWe extend the analysis of [9] to the case of non-binary outcomes and arbitrary convex and\nLipschitz loss functions. The new algorithm is based on a combination of \u201crandom playout\u201d\nand randomized rounding, which assigns random binary labels to future unseen instances,\nin a way depending on the loss subgradients. Our resulting Randomized Rounding (R2)\nForecaster has a parameter trading o\ufb00 regret performance and computational complexity,\nand runs in polynomial time (for T predictions, it requires computing O(T 2) empirical risk\nminimizers in general, as opposed to O(T ) for generic follow-the-leader algorithms). The\nregret of the R2 Forecaster is determined by the Rademacher complexity of the comparison\nclass. The connection between online learnability and Rademacher complexity has also been\nexplored in [2, 1]. However, these works focus on the information-theoretically achievable\nregret, as opposed to computationally e\ufb03cient algorithms. The idea of \u201crandom playout\u201d,\nin the context of online learning, has also been used in [16, 3], but we apply this idea in a\ndi\ufb00erent way.\n\nWe show that the R2 Forecaster can be used to design the \ufb01rst e\ufb03cient online learning\nalgorithm for collaborative \ufb01ltering with trace-norm constrained matrices. While this is a\nwell-known setting, a straightforward application of standard online learning approaches,\nsuch as mirror descent, appear to give only trivial performance guarantees. Moreover, our\n\n1\n\n\fregret bound matches the best currently known sample complexity bound in the batch\ndistribution-free setting [21].\n\nAs a di\ufb00erent application, we consider the relationship between batch learning and trans-\nductive online learning. This relationship was analyzed in [16], in the context of binary\nprediction with respect to classes of bounded VC dimension. Their main result was that\ne\ufb03cient learning in a statistical setting implies e\ufb03cient learning in the transductive online\nsetting, but at an inferior rate of T 3/4 (where T is the number of rounds). The main open\n\u221a\nquestion posed by that paper is whether a better rate can be obtained. Using the R2 Fore-\ncaster, we improve on those results, and provide an e\ufb03cient algorithm with the optimal\nT\nrate, for a wide class of losses. This shows that e\ufb03cient batch learning not only implies\ne\ufb03cient transductive online learning (the main thesis of [16]), but also that the same rates\ncan be obtained, and for possibly non-binary prediction problems as well.\n\nWe emphasize that the R2 Forecaster requires computing many empirical risk minimizers\n(ERM\u2019s) at each round, which might be prohibitive in practice. Thus, while it does run\nin polynomial time whenever an ERM can be e\ufb03ciently computed, we make no claim that\nit is a \u201cfully practical\u201d algorithm. Nevertheless, it seems to be a useful tool in showing\nthat e\ufb03cient online learnability is possible in various settings, often working in cases where\nmore standard techniques appear to fail. Moreover, we hope the techniques we employ\nmight prove useful in deriving practical online algorithms in other contexts.\n\n2 The Minimax Forecaster\n\nWe start by introducing the sequential game of prediction with expert advice \u2014see [10].\nThe game is played between a forecaster and an adversary, and is speci\ufb01ed by an outcome\nspace Y, a prediction space P, a nonnegative loss function (cid:96) : P \u00d7 Y \u2192 R, which measures\nthe discrepancy between the forecaster\u2019s prediction and the outcome, and an expert class\nF. Here we focus on classes F of static experts, whose prediction at each round t does\nnot depend on the outcome in previous rounds. Therefore, we think of each f \u2208 F simply\nas a sequence f = (f1, f2, . . . ) where each ft \u2208 P. At each step t = 1, 2, . . . of the game,\nthe forecaster outputs a prediction pt \u2208 P and simultaneously the adversary reveals an\noutcome yt \u2208 Y. The forecaster\u2019s goal is to predict the outcome sequence almost as well as\nthe best expert in the class F, irrespective of the outcome sequence y = (y1, y2, . . . ). The\nperformance of a forecasting strategy A is measured by the worst-case regret\n\n(cid:32) T(cid:88)\n\n(cid:33)\n\nT(cid:88)\n\nt=1\n\nVT (A,F) = sup\ny\u2208Y T\n\n(cid:96)(pt, yt) \u2212 inf\nf\u2208F\n\nviewed as a function of the horizon T . To simplify notation, let L(f , y) =(cid:80)T\n\nt=1\n\n(cid:96)(ft, yt)\n\n(1)\n\nt=1 (cid:96)(ft, yt).\n\nConsider now the special case where the horizon T is \ufb01xed and known in advance, the\noutcome space is Y = {\u22121, +1}, the prediction space is P = [\u22121, +1], and the loss is the\nabsolute loss (cid:96)(p, y) = |p \u2212 y|. We will denote the regret in this special case as V abs\nT (A,F).\nThe Minimax Forecaster \u2014which is based on work presented in [9] and [11], see also [10]\nT (A,F),\nfor an exposition\u2014 is derived by an explicit analysis of the minimax regret inf A V abs\nwhere the in\ufb01mum is over all forecasters A producing at round t a prediction pt as a func-\ntion of p1, y1, . . . pt\u22121, yt\u22121. For general online learning problems, the analysis of this quan-\ntity is intractable. However, for the speci\ufb01c setting we focus on (absolute loss and binary\noutcomes), one can get both an explicit expression for the minimax regret, as well as an\nt=1 (cid:96)(ft, yt) can be e\ufb03ciently computed for any se-\nquence y1, . . . , yT . This procedure is akin to performing empirical risk minimization (ERM)\nin statistical learning. A full development of the analysis is out of scope, but is outlined in\nAppendix A of the supplementary material. In a nutshell, the idea is to begin by calculat-\ning the optimal prediction in the last round T , and then work backwards, calculating the\noptimal prediction at round T \u2212 1, T \u2212 2 etc. Remarkably, the value of inf A V abs\nT (A,F) is\nexactly the Rademacher complexity RT (F) of the class F, which is known to play a crucial\nrole in understanding the sample complexity in statistical learning [5]. In this paper, we\n\nexplicit algorithm, provided inf f\u2208F(cid:80)T\n\n2\n\n\fde\ufb01ne it as1:\n\n(cid:34)\n\nsup\nf\u2208F\n\nRT (F) = E\n\n(cid:35)\n\nT(cid:88)\n\n\u03c3tft\n\n(2)\n\nwhere \u03c31, . . . , \u03c3T are i.i.d. Rademacher random variables, taking values \u22121, +1 with equal\nprobability. When RT (F) = o(T ), we get a minimax regret inf A V abs\nT (A,F) = o(T ) which\nimplies a vanishing per-round regret.\n\nt=1\n\nIn terms of an explicit algorithm, the optimal prediction pt at round t is given by a\ncomplicated-looking recursive expression, involving exponentially many terms. Indeed, for\ngeneral online learning problems, this is the most one seems able to hope for. However, an\napparently little-known fact is that when one deals with a class F of \ufb01xed binary sequences\nas discussed above, then one can write the optimal prediction pt in a much simpler way.\nLetting Y1, . . . , YT be i.i.d. Rademacher random variables, the optimal prediction at round\nt can be written as\n\n(cid:20)\nf\u2208F L (f , y1 \u00b7\u00b7\u00b7 yt\u22121 (\u22121) Yt+1 \u00b7\u00b7\u00b7 YT ) \u2212 inf\n\ninf\n\npt = E\n\nf\u2208F L (f , y1 \u00b7\u00b7\u00b7 yt\u22121 1 Yt+1 \u00b7\u00b7\u00b7 YT )\n\n.\n\n(3)\n\n(cid:21)\n\nIn words, the prediction is simply the expected di\ufb00erence between the minimal cumulative\nloss over F, when the adversary plays \u22121 at round t and random values afterwards, and\nthe minimal cumulative loss over F, when the adversary plays +1 at round t, and the\nsame random values afterwards. We refer the reader to Appendix A of the supplementary\nmaterial for how this is derived. We denote this optimal strategy (for absolute loss and\nbinary outcomes) as the Minimax Forecaster (mf):\n\nAlgorithm 1 Minimax Forecaster (mf)\n\nfor t = 1 to T do\n\nPredict pt as de\ufb01ned in Eq. (3)\nReceive outcome yt and su\ufb00er loss |pt \u2212 yt|\n\nend for\n\nThe relevant guarantee for mf is summarized in the following theorem.\nTheorem 1. For any class F \u2286 [\u22121, +1]T of static experts, the regret of the Minimax\nForecaster (Algorithm 1) satis\ufb01es V abs\n\nT (mf,F) = RT (F).\n\n2.1 Making the Minimax Forecaster E\ufb03cient\n\nThe Minimax Forecaster described above is not computationally e\ufb03cient, as the computa-\ntion of pt requires averaging over exponentially many ERM\u2019s. However, by a martingale\nargument, it is not hard to show that it is in fact su\ufb03cient to compute only two ERM\u2019s per\nround.\n\nAlgorithm 2 Minimax Forecaster with e\ufb03cient implementation (mf*)\n\nfor t = 1 to T do\n\nFor i = t + 1, . . . , T , let Yi be a Rademacher random variable\nLet pt := inf f\u2208F L (f , y1 . . . yt\u22121 (\u22121) Yt+1 . . . YT ) \u2212 inf f\u2208F L (f , y1 . . . yt\u22121 1 Yt+1 . . . YT )\nPredict pt, receive outcome yt and su\ufb00er loss |pt \u2212 yt|\n\nend for\n\nTheorem 2. For any class F \u2286 [\u22121, +1]T of static experts, the regret of the randomized\nforecasting strategy mf* (Algorithm 2) satis\ufb01es\n\nT (mf*,F) \u2264 RT (F) +(cid:112)2T ln(1/\u03b4)\n\nV abs\n\n1In the statistical learning literature, it is more common to scale this quantity by 1/T , but the\n\nform we use here is more convenient for stating cumulative regret bounds.\n\n3\n\n\f(cid:20)\n\n(cid:21)\n\nwith probability at least 1 \u2212 \u03b4. Moreover, if the predictions p = (p1, . . . , pT ) are computed\nreusing the random values Y1, . . . , YT computed at the \ufb01rst iteration of the algorithm, rather\nthan drawing fresh values at each iteration, then it holds that\n\nE\n\nL(p, y) \u2212 inf\n\n\u2264 RT (F)\n\nfor all y \u2208 {\u22121, +1}T .\n\nf\u2208F L(f , y)\n\nProof sketch. To prove the second statement, note that(cid:12)(cid:12)E[pt]\u2212yt\nnote that |pt \u2212 yt| \u2212(cid:12)(cid:12)Ept[pt] \u2212 yt\n\n(cid:12)(cid:12) = E(cid:2)|pt\u2212yt|(cid:3) for any \ufb01xed\n(cid:12)(cid:12) for t = 1, . . . , T is a martingale di\ufb00erence sequence with\n\nyt \u2208 {\u22121, +1} and pt bounded in [\u22121, +1], and use Thm. 1. To prove the \ufb01rst statement,\n\nrespect to p1, . . . , pT , and apply Azuma\u2019s inequality.\n\nThe second statement in the theorem bounds the regret only in expectation and is thus\nweaker than the \ufb01rst one. On the other hand, it might have algorithmic bene\ufb01ts. Indeed, if\nwe reuse the same values for Y1, . . . , YT , then the computation of the in\ufb01ma over f in mf*\nare with respect to an outcome sequence which changes only at one point in each round.\nDepending on the speci\ufb01c learning problem, it might be easier to re-compute the in\ufb01mum\nafter changing a single point in the outcome sequence, as opposed to computing the in\ufb01mum\nover a di\ufb00erent outcome sequence in each round.\n\n3 The R2 Forecaster\n\nThe Minimax Forecaster presented above is very speci\ufb01c to the absolute loss (cid:96)(f, y) =\n|f \u2212 y| and for binary outcomes Y = {\u22121, +1}, which limits its applicability. We note that\nextending the forecaster to other losses or di\ufb00erent outcome spaces is not trivial:\nindeed,\nthe recursive unwinding of the minimax regret term, leading to an explicit expression and\nan explicit algorithm, does not work as-is for other cases. Nevertheless, we will now show\nhow one can deal with general (convex, Lipschitz) loss functions and outcomes belonging to\nany real interval [\u2212b, b].\nThe algorithm we propose essentially uses the Minimax Forecaster as a subroutine, by\nfeeding it with a carefully chosen sequence of binary values zt, and using predictions ft\nwhich are scaled to lie in the interval [\u22121, +1]. The values of zt are based on a randomized\nrounding of values in [\u22121, +1], which depend in turn on the loss subgradient. Thus, we\ndenote the algorithm as the Randomized Rounding (R2) Forecaster.\nTo describe the algorithm, we introduce some notation. For any scalar f \u2208 [\u2212b, b], de\ufb01ne\n\n(cid:101)f = f /b to be the scaled versions of f into the range [\u22121, +1]. For vectors f , de\ufb01ne\n(cid:101)f = (1/b)f . Also, we let \u2202pt(cid:96)(pt, yt) denote any subgradient of the loss function (cid:96) with respect\n\nto the prediction pt. The pseudocode of the R2 Forecaster is presented as Algorithm 3 below,\nand its regret guarantee is summarized in Thm. 3. The proof is presented in Appendix B\nof the supplementary material.\nTheorem 3. Suppose (cid:96) is convex and \u03c1-Lipschitz in its \ufb01rst argument. For any F \u2286 [\u2212b, b]T\nthe regret of the R2 Forecaster (Algorithm 3) satis\ufb01es\n\nVT (R2,F) \u2264 \u03c1RT (F) + \u03c1 b\n\n+ 2\n\n2T ln\n\n(cid:19)(cid:115)\n\n(cid:18)(cid:114) 1\n\n\u03b7\n\n(cid:18) 2T\n\n(cid:19)\n\n\u03b4\n\nwith probability at least 1 \u2212 \u03b4.\n\nThe prediction pt which the algorithm computes is an empirical approximation to\n\n(cid:16)(cid:101)f , z1 . . . zt\u22121 0 Yt+1 . . . YT\n\n(cid:17) \u2212 inf\n\n(cid:16)(cid:101)f , z1 \u00b7\u00b7\u00b7 zt\u22121 1 Yt+1 \u00b7\u00b7\u00b7 YT\n\nb EYt+1,...,YT\n\ninf\nf\u2208F L\n\nf\u2208F L\n\n(cid:20)\n\n(4)\n\n(cid:17)(cid:21)\n\nby repeatedly drawing independent values to Yt+1, . . . , YT and averaging. The accuracy of\nthe approximation is re\ufb02ected in the precision parameter \u03b7. A larger value of \u03b7 improves the\nregret bound, but also increases the runtime of the algorithm. Thus, \u03b7 provides a trade-o\ufb00\nbetween the computational complexity of the algorithm and its regret guarantee. We note\n\n4\n\n\fAlgorithm 3 The R2 Forecaster\n\nInput: Upper bound b on |ft|,|yt| for all t = 1, . . . , T and f \u2208 F; upper bound \u03c1 on\nsupp,y\u2208[\u2212b,b]\nfor t = 1 to T do\n\nT .\n\npt := 0\nfor j = 1 to \u03b7 T do\n\n(cid:12)(cid:12)\u2202p(cid:96)(p, y)(cid:12)(cid:12); precision parameter \u03b7 \u2265 1\n(cid:16)(cid:101)f , z1 . . . zt\u22121 (\u22121) Yt+1 . . . YT\n\u03c1 \u2202pt(cid:96)(pt, yt)(cid:1) \u2208 [0, 1]\n\n(cid:0)1 \u2212 1\n\n\u03b7 T \u2206\n\n(cid:17) \u2212 inf\n\n(cid:16)(cid:101)f , z1 . . . zt\u22121 1 Yt+1 . . . YT\n\n(cid:17)\n\nFor i = t, . . . , T , let Yi be a Rademacher random variable\nDraw \u2206 := inf\nf\u2208F L\n\nf\u2208F L\nLet pt := pt + b\n\nend for\nPredict pt\nReceive outcome yt and su\ufb00er loss (cid:96)(pt, yt)\nLet rt := 1\nLet zt := 1 with probability rt, and zt := \u22121 with probability 1 \u2212 rt\n2\n\nend for\n\nthat even when \u03b7 is taken to be a constant fraction, the resulting algorithm still runs in\npolynomial time O(T 2c), where c is the time to compute a single ERM. In subsequent results\npertaining to this Forecaster, we will assume that \u03b7 is taken to be a constant fraction.\n\nWe end this section with a remark that plays an important role in what follows.\n\nRemark 1. The predictions of our forecasting strategies do not depend on the ordering of\nthe predictions of the experts in F. In other words, all the results proven so far also hold in\na setting where the elements of F are functions f : {1, . . . , T} \u2192 P, and the adversary has\ncontrol on the permutation \u03c01, . . . , \u03c0T of {1, . . . , T} that is used to de\ufb01ne the prediction f (\u03c0t)\nT (F) remains unchanged\nof expert f at time t.2 Also, Thm. 1 implies that the value of V abs\nirrespective of the permutation chosen by the adversary.\n\n4 Application 1: Transductive Online Learning\n\ngoal is to minimize the transductive online regret(cid:80)T\n\nThe \ufb01rst application we consider is a rather straightforward one, in the context of transduc-\ntive online learning [6]. In this model, we have an arbitrary sequence of labeled examples\n(x1, y1), . . . , (xT , yT ), where only the set {x1, . . . , xT} of unlabeled instances is known to the\nlearner in advance. At each round t, the learner must provide a prediction pt for the label\nof yt. The true label yt is then revealed, and the learner incurs a loss (cid:96)(pt, yt). The learner\u2019s\nrespect to a \ufb01xed class of predictors F of the form {x (cid:55)\u2192 f (x)}.\nThe work [16] considers the binary classi\ufb01cation case with zero-one loss. Their main re-\nsult is that if a class F of binary functions has bounded VC dimension d, and there exists\nan e\ufb03cient algorithm to perform empirical risk minimization, then one can construct an\ne\ufb03cient randomized algorithm for transductive online learning, whose regret is at most\n\nO(T 3/4(cid:112)d ln(T )) in expectation. The signi\ufb01cance of this result is that e\ufb03cient batch learn-\n\n(cid:0)(cid:96)(pt, yt) \u2212 inf f\u2208F (cid:96)(f (xt), yt)(cid:1) with\n\nt=1\n\ning (via empirical risk minimization) implies e\ufb03cient learning in the transductive online\nsetting. This is an important result, as online learning can be computationally harder than\nbatch learning \u2014see, e.g., [8] for an example in the context of Boolean learning.\n\u221a\nA major open question posed by [16] was whether one can achieve the optimal rate O(\ndT ),\nmatching the rate of a batch learning algorithm in the statistical setting. Using the R2\nForecaster, we can easily achieve the above result, as well as similar results in a strictly\nmore general setting. This shows that e\ufb03cient batch learning not only implies e\ufb03cient\ntransductive online learning (the main thesis of [16]), but also that the same rates can be\nobtained, and for possibly non-binary prediction problems as well.\n\n2Formally, at each step t: (1) the adversary chooses and reveals the next element \u03c0t of the\npermutation; (2) the forecaster chooses pt \u2208 P and simultaneously the adversary chooses yt \u2208 Y.\n\n5\n\n\fdT ) with respect to the zero-one loss.\n\ncient and achieves, in the transductive online model, a regret of \u03c1RT (F)+O(\u03c1b(cid:112)T ln(T /\u03b4))\n\nTheorem 4. Suppose we have a computationally e\ufb03cient algorithm for empirical risk min-\nimization (with respect to the zero-one loss) over a class F of {0, 1}-valued functions with\n\u221a\nVC dimension d. Then, in the transductive online model, the e\ufb03cient randomized forecaster\nmf* achieves an expected regret of O(\nMoreover, for an arbitrary class F of [\u2212b, b]-valued functions with Rademacher complexity\nRT (F), and any convex \u03c1-Lipschitz loss function, if there exists a computationally e\ufb03cient\nalgorithm for empirical risk minimization, then the R2 Forecaster is computationally e\ufb03-\nwith probability at least 1 \u2212 \u03b4.\nProof. Since the set {x1, . . . , xT} of unlabeled examples is known, we reduce the online\ntransductive model to prediction with expert advice in the setting of Remark 1. This is\ndone by mapping each function f \u2208 F to a function f : {1, . . . , T} \u2192 P by t (cid:55)\u2192 f (xt), which\nis equivalent to an expert in the setting of Remarks 1. When F maps to {0, 1}, and we care\nabout the zero-one loss, we can use the forecaster mf* to compute randomized predictions\nand apply Thm. 2 to bound the expected transductive online regret with RT (F). For a class\n\u221a\nwith VC dimension d, RT (F) \u2264 O(\ndT ) for some constant c > 0, using Dudley\u2019s chaining\nmethod [12], and this concludes the proof of the \ufb01rst part of the theorem. The second part\nis an immediate corollary of Thm. 3.\n\nWe close this section by contrasting our results for online transductive learning with those\nof [7] about standard online learning. If F contains {0, 1}-valued functions, then the optimal\nd(cid:48)T , where d(cid:48) is the Littlestone dimension of\nregret bound for online learning is order of\nF. Since the Littlestone dimension of a class is never smaller than its VC dimension, we\nconclude that online learning is a harder setting than online transductive learning.\n\n\u221a\n\n5 Application 2: Online Collaborative Filtering\n\nWe now turn to discuss the application of our results in the context of collaborative \ufb01ltering\nwith trace-norm constrained matrices, presenting what is (to the best of our knowledge) the\n\ufb01rst computationally e\ufb03cient online algorithms for this problem.\nIn collaborative \ufb01ltering, the learning problem is to predict entries of an unknown m \u00d7 n\nmatrix based on a subset of its observed entries. A common approach is norm regularization,\nwhere we seek a low-norm matrix which matches the observed entries as best as possible.\nThe norm is often taken to be the trace-norm [22, 19, 4], although other norms have also\nbeen considered, such as the max-norm [18] and the weighted trace-norm [20, 13].\n\nPrevious theoretical treatments of this problem assumed a stochastic setting, where the ob-\nserved entries are picked according to some underlying distribution (e.g., [23, 21]). However,\neven when the guarantees are distribution-free, assuming a \ufb01xed distribution fails to capture\nimportant aspects of collaborative \ufb01ltering in practice, such as non-stationarity [17]. Thus,\nan online adversarial setting, where no distributional assumptions whatsoever are required,\nseems to be particularly well-suited to this problem domain.\n\nt=1 (cid:96)(pt, yt) \u2212 inf W\u2208W(cid:80)T\n\nt=1 (cid:96)(cid:0)Wit,jt, yt\n\n(cid:1). Following reality, we\n\nIn an online setting, at each round t the adversary reveals an index pair (it, jt) and secretely\nchooses a value yt for the corresponding matrix entry. After that, the learner selects a\nprediction pt for that entry. Then yt is revealed and the learner su\ufb00ers a loss (cid:96)(pt, yt).\nHence, the goal of a learner is to minimize the regret with respect to a \ufb01xed class W\n\nof prediction matrices, (cid:80)T\n\nwill assume that the adversary picks a di\ufb00erent entry in each round. When the learner\u2019s\nperformance is measured by the regret after all T = mn entries have been predicted, the\nonline collaborative \ufb01ltering setting reduces to prediction with expert advice as discussed\nin Remark 1.\nAs mentioned previously, W is often taken to be a convex class of matrices with bounded\ntrace-norm. Many convex learning problems, such as linear and kernel-based predictors,\nas well as matrix-based predictors, can be learned e\ufb03ciently both in a stochastic and an\nonline setting, using mirror descent or regularized follow-the-leader methods. However,\n\n6\n\n\ffor reasonable choices of W, a straightforward application of these techniques can lead\nIn particular, in the case of W consisting of m \u00d7 n\nto algorithms with trivial bounds.\n\nT(cid:1).\nmatrices with trace-norm at most r, standard online regret bounds would scale like O(cid:0)r\nmn(cid:1), we get a per-round regret guarantee\nSince for this norm one typically has r = O(cid:0)\u221a\nof O((cid:112)mn/T ). This is a trivial bound, since it becomes \u201cmeaningful\u201d (smaller than a\n\n\u221a\n\nconstant) only after all T = mn entries have been predicted.\n\nOn the other hand, based on general techniques developed in [15] and greatly extended in\n[1], it can be shown that online learnability is information-theoretically possible for such W.\nHowever, these techniques do not provide a computationally e\ufb03cient algorithm. Thus, to\nthe best of our knowledge, there is currently no e\ufb03cient (polynomial time) online algorithm,\nwhich attain non-trivial regret. In this section, we show how to obtain such an algorithm\nusing the R2 Forecaster.\n\nConsider \ufb01rst the transductive online setting, where the set of indices to be predicted is\nknown in advance, and the adversary may only choose the order and values of the entries.\nIt is readily seen that the R2 Forecaster can be applied in this setting, using any convex class\nW of \ufb01xed matrices with bounded entries to compete against, and any convex Lipschitz loss\nfunction. To do so, we let {ik, jk}T\nk=1 be the set of entries, and run the R2 Forecaster with\n: W \u2208 W}, which corresponds to a class of experts as discussed\nrespect to F = {t (cid:55)\u2192 Wit,jt\nin Remark 1.\n\nWhat is perhaps more surprising is that the R2 Forecaster can also be applied in a non-\ntransductive setting, where the indices to be predicted are not known in advance. Moreover,\nthe Forecaster doesn\u2019t even need to know the horizon T in advance. The key idea to achieve\nthis is to utilize the non-asymptotic nature of the learning problem \u2014namely, that the game\nis played over a \ufb01nite m \u00d7 n matrix, so the time horizon is necessarily bounded.\nThe algorithm we propose is very simple: we apply the R2 Forecaster as if we are in a\nsetting with time horizon T = mn, which is played over all entries of the m \u00d7 n matrix. By\nRemark 1, the R2 Forecaster does not need to know the order in which these m \u00d7 n entries\nare going to be revealed. Whenever W is convex and (cid:96) is a convex function, we can \ufb01nd an\nERM in polynomial time by solving a convex problem. Hence, we can implement the R2\nForecaster e\ufb03ciently.\n\nTo show that this is indeed a viable strategy, we need the following lemma, whose proof is\npresented in Appendix C of the supplementary material.\nLemma 1. Consider a (possibly randomized) forecaster A for a class F whose regret after\nT steps satis\ufb01es VT (A,F) \u2264 G with probability at least 1\u2212 \u03b4 > 1\n2 . Furthermore, suppose the\nloss function is such that\n\n(cid:0)(cid:96)(p, y) \u2212 (cid:96)(p(cid:48), y)(cid:1) \u2265 0. Then\n\ninf\np(cid:48)\u2208P sup\ninf\np\u2208P\ny\u2208Y\nVt(A,F) \u2264 G\n\nmax\n\nt=1,...,T\n\nwith probability at least 1 \u2212 \u03b4.\n\nNote that a simple su\ufb03cient condition for the assumption on the loss function to hold, is\nthat P = Y and (cid:96)(p, y) \u2265 (cid:96)(y, y) for all p, y \u2208 P.\nUsing this lemma, the following theorem exempli\ufb01es how we can obtain a regret guarantee\nfor our algorithm, in the case of W consisting of the convex set of matrices with bounded\ntrace-norm and bounded entries. For the sake of clarity, we will consider n \u00d7 n matrices.\nTheorem 5. Let (cid:96) be a loss function which satis\ufb01es the conditions of Lemma 1. Also, let W\nconsist of n \u00d7 n matrices with trace-norm at most r = O(n) and entries at most b = O(1),\nsuppose we apply the R2 Forecaster over time horizon n2 and all entries of the matrix. Then\nwith probability at least 1 \u2212 \u03b4, after T rounds, the algorithm achieves an average per-round\nregret of at most\nO\n\nn3/2 + n(cid:112)ln(n/\u03b4)\n\nuniformly over T = 1, . . . , n2.\n\n(cid:33)\n\n(cid:32)\n\nT\n\nProof. In our setting, where the adversary chooses a di\ufb00erent entry at each round, [21,\nTheorem 6] implies that for the class W(cid:48) of all matrices with trace-norm at most r = O(n),\n\n7\n\n\fit holds that RT (W(cid:48))/T \u2264 O(n3/2/T ). Therefore, Rn2(W(cid:48)) \u2264 O(n3/2). Since W \u2286 W(cid:48),\nwe get by de\ufb01nition of the Rademacher complexity that Rn2(W) = O(n3/2) as well. By\n\nThm. 3, the regret after n2 rounds is O(n3/2 + n(cid:112)ln(n/\u03b4)) with probability at least 1 \u2212 \u03b4.\nis at most O(n3/2 + n(cid:112)ln(n/\u03b4)), as required.\n\nApplying Lemma 1, we get that the cumulative regret at the end of any round T = 1, . . . , n2\n\nThis bound becomes non-trivial after n3/2 entries are revealed, which is still a vanishing\nproportion of all n2 entries. While the regret might seem unusual compared to standard\nregret bounds (which usually have rates of 1/\nT for general losses), it is a natural outcome\nof the non-asymptotic nature of our setting, where T can never be larger than n2. In fact,\nthis is the same rate one would obtain in a batch setting, where the entries are drawn from\nan arbitrary distribution. Moreover, an assumption such as boundedness of the entries is\nrequired for currently-known guarantees even in a batch setting \u2014see [21] for details.\n\n\u221a\n\nAcknowledgments\n\nThe \ufb01rst author acknowledges partial support by the PASCAL2 NoE under EC grant FP7-\n216886.\n\nReferences\n\n[1] K. Sridharan A. Rakhlin and A. Tewari. Online learning: Random averages, combina-\n\ntorial parameters, and learnability. In NIPS, 2010.\n\n[2] J. Abernethy, P. Bartlett, A. Rakhlin, and A. Tewari. Optimal strategies and minimax\n\nlower bounds for online convex games. In COLT, 2009.\n\n[3] J. Abernethy and M. Warmuth. Repeated games against budgeted adversaries.\n\nIn\n\nNIPS, 2010.\n\n[4] F. Bach. Consistency of trace-norm minimization. Journal of Machine Learning Re-\n\nsearch, 9:1019\u20131048, 2008.\n\n[5] P. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds\n\nand structural results. In COLT, 2001.\n\n[6] S. Ben-David, E. Kushilevitz, and Y. Mansour. Online learning versus o\ufb04ine learning.\n\nMachine Learning, 29(1):45\u201363, 1997.\n\n[7] S. Ben-David, D. P\u00b4al, and S. Shalev-Shwartz. Agnostic online learning. In COLT, 2009.\n\n[8] A. Blum. Separating distribution-free and mistake-bound learning models over the\n\nboolean domain. SIAM J. Comput., 23(5):990\u20131000, 1994.\n\n[9] N. Cesa-Bianchi, Y. Freund, D. Haussler, D. Helmbold, R. Schapire, and M. Warmuth.\n\nHow to use expert advice. Journal of the ACM, 44(3):427\u2013485, May 1997.\n\n[10] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University\n\nPress, 2006.\n\n[11] T. Chung. Approximate methods for sequential decision making using expert advice.\n\nIn COLT, 1994.\n\n[12] R. M. Dudley. A Course on Empirical Processes, \u00b4Ecole de Probabilit\u00b4es de St. Flour,\n\n1982, volume 1097 of Lecture Notes in Mathematics. Springer Verlag, 1984.\n\n[13] R. Foygel, R. Salakhutdinov, O. Shamir, and N. Srebro. Learning with the weighted\n\ntrace-norm under arbitrary sampling distributions. In NIPS, 2011.\n\n[14] E. Hazan. The convex optimization approach to regret minimization. In S. Nowozin\nS. Sra and S. Wright, editors, Optimization for Machine Learning. MIT Press, To\nAppear.\n\n[15] P. Bartlett J. Abernethy, A. Agarwal and A. Rakhlin. A stochastic view of optimal\n\nregret through minimax duality. In COLT, 2009.\n\n[16] S. Kakade and A. Kalai. From batch to transductive online learning. In NIPS, 2005.\n\n8\n\n\f[17] Y. Koren. Collaborative \ufb01ltering with temporal dynamics. In KDD, 2009.\n\n[18] J. Lee, B. Recht, R. Salakhutdinov, N. Srebro, and J. Tropp. Practical large-scale\n\noptimization for max-norm regularization. In NIPS, 2010.\n\n[19] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In NIPS, 2007.\n\n[20] R. Salakhutdinov and N. Srebro. Collaborative \ufb01ltering in a non-uniform world: Learn-\n\ning with the weighted trace norm. In NIPS, 2010.\n\n[21] O. Shamir and S. Shalev-Shwartz. Collaborative \ufb01ltering with the trace norm: Learning,\n\nbounding, and transducing. In COLT, 2011.\n\n[22] N. Srebro, J. Rennie, and T. Jaakkola. Maximum-margin matrix factorization.\n\nIn\n\nNIPS, 2004.\n\n[23] N. Srebro and A. Shraibman. Rank, trace-norm and max-norm. In COLT, 2005.\n\n9\n\n\f", "award": [], "sourceid": 247, "authors": [{"given_name": "Nicol\u00f2", "family_name": "Cesa-bianchi", "institution": null}, {"given_name": "Ohad", "family_name": "Shamir", "institution": null}]}