{"title": "From Batch to Transductive Online Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 611, "page_last": 618, "abstract": "", "full_text": "From Batch to Transductive Online Learning\n\nSham Kakade\n\nToyota Technological Institute\n\nChicago, IL 60637\nsham@tti-c.org\n\nAdam Tauman Kalai\n\nToyota Technological Institute\n\nChicago, IL 60637\n\nkalai@tti-c.org\n\nAbstract\n\nIt is well-known that everything that is learnable in the dif\ufb01cult online\nsetting, where an arbitrary sequences of examples must be labeled one at\na time, is also learnable in the batch setting, where examples are drawn\nindependently from a distribution. We show a result in the opposite di-\nrection. We give an ef\ufb01cient conversion algorithm from batch to online\nthat is transductive: it uses future unlabeled data. This demonstrates the\nequivalence between what is properly and ef\ufb01ciently learnable in a batch\nmodel and a transductive online model.\n\n1 Introduction\n\nThere are many striking similarities between results in the standard batch learning setting,\nwhere labeled examples are assumed to be drawn independently from some distribution,\nand the more dif\ufb01cult online setting, where labeled examples arrive in an arbitrary se-\nquence. Moreover, there are simple procedures that convert any online learning algorithm\nto an equally good batch learning algorithm [8]. This paper gives a procedure going in the\nopposite direction.\nIt is well-known that the online setting is strictly harder than the batch setting, even for\nthe simple one-dimensioanl class of threshold functions on the interval [0, 1]. Hence, we\nconsider the online transductive model of Ben-David, Kushilevitz, and Mansour [2]. In\nthis model, an arbitrary but unknown sequence of n examples (x1, y1), . . . , (xn, yn) \u2208\nX \u00d7{\u22121, 1} is \ufb01xed in advance, for some instance space X . The set of unlabeled examples\nis then presented to the learner, \u03a3 = {xi|1 \u2264 i \u2264 n}. The examples are then revealed, in an\nonline manner, to the learner, for i = 1, 2, . . . , n. The learner observes example xi (along\nwith all previous labeled examples (x1, y1), . . . , (xi\u22121, yi\u22121) and the unlabeled example\nset \u03a3) and must predict yi. The true label yi is then revealed to the learner. After this\noccurs, the learner compares its number of mistakes to the minimum number of mistakes of\nany of a target class F of functions f : X \u2192 {\u22121, 1} (such as linear threshold functions).\nNote that our results are in this type of agnostic model [7], where we allow for arbitrary\nlabels, unlike the realizable setting, i.e., noiseless or PAC models, where it is assumed that\nthe labels are consistent with some f \u2208 F.\nWith this simple transductive knowledge of what unlabeled examples are to come, one\ncan use existing expert algorithms to inef\ufb01ciently learn any class of \ufb01nite VC dimension,\nsimilar to the batch setting. How does one use unlabeled examples ef\ufb01ciently to guarantee\ngood online performance?\n\n\fOur ef\ufb01cient algorithm A2 converts a proper1 batch algorithm to a proper online algorithm\n(both in the agnostic setting). At any point in time, it has observed some labeled examples.\nIt then \u201challucinates\u201d random examples by taking some number of unlabeled examples and\nlabeling them randomly. It appends these examples to those observed so far and predicts\naccording to the batch algorithm that \ufb01nds the hypothesis of minimum empirical error on\nthe combined data.\nThe idea of \u201challucinating\u201d and optimizing has been used for designing ef\ufb01cient online\nalgorithms [6, 5, 1, 10, 4] in situations where exponential weighting schemes were inef\ufb01-\ncient. The hallucination analogy was suggested by Blum and Hartline [4]. In the context\nof transductive learning, it seems to be a natural way to try to use the unlabeled examples\nin conjunction with a batch learner. Let #mistakes(f, \u03c3n) denote the number of mistakes\nof a function f \u2208 F on a particular sequence \u03c3n \u2208 (X \u00d7{\u22121, 1})n, and #mistakes(A, \u03c3n)\ndenote the same quantity for a transductive online learning algorithm A. Our main theorem\nis the following.\nTheorem 1. Let F be a class of functions f : X \u2192 {\u22121, 1} of VC dimension d. There\nis an ef\ufb01cient randomized transductive online algorithm that, for any n > 1 and \u03c3n \u2208\n(X \u00d7 {\u22121, 1})n,\n\n(cid:112)\n\nE[#mistakes(A2, \u03c3n)] \u2264 minf\u2208F #mistakes(f, \u03c3n) + 2.5n3/4\n\nd log n.\n\n(cid:112)\n\nd log(n)), approaching 0 at a rate related to the standard VC bound.\n\nThe algorithm is computationally ef\ufb01cient in the sense that it runs in time poly(n), given\nan ef\ufb01cient proper batch learning algorithm.\nOne should note that the bound on the error rate is the same as that of the best f \u2208 F plus\nO(n\u22121/4\nIt is well-known that, without regard to computational ef\ufb01ciency, the learnable classes of\nfunctions are exactly those with \ufb01nite VC dimension. Consequently, the classes of func-\ntions learnable in the batch and transductive online settings are the same. The classes of\nfunctions properly learnable by computationally ef\ufb01cient algorithms in the proper batch\nand transductive online settings are identical, as well.\nIn addition to the new algorithm, this is interesting because it helps justify a long line of\nwork suggesting that whatever can be done in a batch setting can also be done online.\nOur result is surprising in light of earlier work by Blum showing that a slightly different\nonline model is harder than its batch analog for computational reasons and not information-\ntheoretic reasons [3].\nIn Section 2, we de\ufb01ne the transductive online model. In Section 3, we analyze the easier\ncase of data that is realizable with respect to some function class, i.e., when there is some\nfunction of zero error in the class. In Section 4, we present and analyze the hallucination\nalgorithm. In Section 5, we discuss open problems such as extending the results to improper\nlearning and the ef\ufb01cient realizable case.\n\n2 Models and de\ufb01nitions\n\nThe transductive online model considered by Ben-David, Kushlevitz, and Mansour [2],\nconsists of an instance space X and label set Y which we will always take to be bi-\nnary Y = {\u22121, 1}. An arbitrary n > 0 and arbitrary sequence of labeled examples\n(x1, y1), . . . , (xn, yn) is \ufb01xed. One can think of these as being chosen by an adversary\nwho knows the (possibly randomized) learning algorithm but not the realization of its ran-\ndom coin \ufb02ips. For notational convenience, we de\ufb01ne \u03c3i to be the subsequence of \ufb01rst i\n\n1A proper learning algorithm is one that always outputs a hypothesis h \u2208 F.\n\n\flabeled examples,\n\n\u03c3i = (x1, y1), (x2, y2), . . . , (xi, yi),\n\nand \u03a3 to be the set of all unlabeled examples in \u03c3n,\n\n\u03a3 = {xi | i \u2208 {1, 2, . . . , n}}.\n\nA transductive online learner A is a function that takes as input n (the number of examples\nto be predicted), \u03a3 \u2286 X (the set of unlabeled examples, |\u03a3| \u2264 n), xi \u2208 \u03a3 (the example\nto be tested), and \u03c3i\u22121 \u2208 (\u03a3 \u00d7 Y)i\u22121 (the previous i \u2212 1 labeled examples) and outputs a\nprediction \u2208 Y of yi, for any 1 \u2264 i \u2264 n. The number of mistakes of A on the sequence\n\u03c3n = (x1, y1), . . . , (xn, yn) is,\n\n#mistakes(A, \u03c3n) = |{i | A(n, \u03a3, xi, \u03c3i\u22121) (cid:54)= yi}|.\n\nIf A is computed by a randomized algorithm, then we similarly de\ufb01ne E[#mistakes(A, \u03c3n)]\nwhere the expectation is taken over the random coin \ufb02ips of A. In order to speak of the\nlearnability of a set F of functions f : X \u2192 Y, we de\ufb01ne\n\n#mistakes(f, \u03c3n) = |{i | f(xi) (cid:54)= yi}|.\n\nFormally, paralleling agnostic learning [7],2 we de\ufb01ne an ef\ufb01cient transductive online\nlearner A for class F to be one for which the learning algorithm runs in time poly(n)\nand achieves, for any \u0001 > 0,\n\nE[#mistakes(A, \u03c3n)] \u2264 minf\u2208F #mistakes(f, \u03c3n) + \u0001n,\n\nfor n =poly(1/\u0001).3\n\n2.1 Proper learning\nProper batch learning requires one to output a hypothesis h \u2208 F. An ef\ufb01cient proper\nbatch learning algorithm for F is a batch learning algorithm B that, given any \u0001 > 0, with\nn = poly(1/\u0001) many examples from any distribution D, outputs an h \u2208 F of expected\nerror E[PrD[h(x) (cid:54)= y]] \u2264 minf\u2208FPrD[f(x) (cid:54)= y] + \u0001 and runs in time poly(n).\nObservation 1. Any ef\ufb01cient proper batch learning algorithm B can be converted into an\nef\ufb01cient empirical error minimizer M that, for any n, given any data set \u03c3n \u2208 (X \u00d7 Y)n,\noutputs an f \u2208 F of minimal empirical error on \u03c3n.\n\nProof. Running B only on \u03c3n, B is not guaranteed to output a hypothesis of minimum\nempirical error. Instead, we set an error tolerance of B to \u0001 = 1/(4n), and give it examples\ndrawn uniformly from the distribution D which is uniform over the data \u03c3n (a type of\nbootstrap). If B indeed returns a hypothesis h of error less than 1/n more than the best\nf \u2208 F, it must be a hypothesis of minimum empirical error on \u03c3n. By Markov\u2019s inequality,\nwith probability at most 1/4, the generalization error is more than 1/n. By repeating\nseveral times and take the best hypothesis, we get a success probability exponentially close\nto 1. The runtime is polynomial in n.\n\nTo de\ufb01ne proper learning in an online setting, it is helpful to think of the following alter-\nnative de\ufb01nition of transductive online learning. In this variation, the learner must output\na sequence of hypotheses h1, h2, . . . , hn : X \u2192 {\u22121, 1}. After the ith hypothesis hi is\noutput, the example (xi, yi) is revealed, and it is clear whether the learner made an error.\nFormally, the (possibly randomized) algorithm A(cid:48) still takes as input n, \u03a3, and \u03c3i\u22121 (but\n2It is more common in online learning to bound the total number of mistakes of an online algo-\n3The results in this paper could be replaced by high-probability 1\u2212 \u03b4 bounds at a cost of log 1/\u03b4.\n\nrithm on an arbitrary sequence. We bound its error rate, as is usual for batch learning.\n\n\fno longer xi), and outputs hi : X \u2192 {\u22121, 1} and errs if hi(xi) (cid:54)= yi. To see that this\nmodel is equivalent to the previous de\ufb01nition, note that any algorithm A(cid:48) that outputs hy-\npotheses hi can be used to make predictions hi(xi) on example i (it errs if hi(xi) (cid:54)= yi).\nIt is equally true but less obvious than any algorithm A in the previous model can be con-\nverted to an algorithm A(cid:48) in this model. This is because A(cid:48) can be viewed as outputting\nhi : X \u2192 {\u22121, 1}, where the function hi is de\ufb01ned by setting hi(x) equal to be the predic-\ntion of algorithm A on the sequence \u03c3i\u22121 followed by the example x, for each x \u2208 X , i.e.,\nhi(x) = A(n, \u03a3, x, \u03c3i\u22121). (The same coins can be used if A and A(cid:48) are randomized.) A\n(possibly randomized) transductive online algorithm in this model is de\ufb01ned to be proper\nfor family of functions F if it always outputs hi \u2208 F.\n\n3 Warmup: the realizable case\nIn this section, we consider the realizable special case in which there is some f \u2208 F which\ncorrectly labels all examples. In particular, this means that we only consider sequences\n\u03c3n for which there is an f \u2208 F with #mistakes(f, \u03c3n) = 0. This case will be helpful to\nanalyze \ufb01rst as it is easier.\nFix arbitrary n > 0 and \u03a3 = {x1, x2, . . . , xn} \u2286 X , |\u03a3| \u2264 n. Say there are at most L\ndifferent ways to label the examples in \u03a3 according to functions f \u2208 F, so 1 \u2264 L \u2264 2|\u03a3|.\nIn the transductive online model, L is determined by \u03a3 and F only. Hence, as long as\nprediction occurs only on examples x \u2208 \u03a3, there are effectively only L different functions\nin F that matter, and we can thus pick L such functions that give rise to the L different\nlabelings. On the ith example, one could simply take majority vote of fj(xi) over consistent\nlabelings fj (the so-called halving algorithm), and this would easily ensure at most log2(L)\nmistakes, because each mistake eliminates at least half of the consistent labelings. One can\nalso use the following proper learning algorithm.\n\nProper transductive online learning algorithm in the realizable case:\n\u2022 Preprocessing: Given the set of unlabeled examples \u03a3, take L func-\ntions f1, f2, . . . , fL \u2208 F that give rise to the L different labelings\nof x \u2208 \u03a3.4\n\u2022 ith prediction: Output a uniformly random function f from the fj\n\nconsistent with \u03c3i\u22121.\n\nThe above algorithm, while possibly very inef\ufb01cient, is easy to analyze.\nTheorem 2. Fix a class of binary functions F of VC dimension d. The above random-\nized proper learning algorithm makes an expected d log(n) mistakes on any sequence of\nexamples of length n \u2265 2, provided that there is some mistake-free f \u2208 F.\n\nProof. Let Vi be the number of labelings fj consistent with the \ufb01rst i examples, so that\nL = V0 \u2265 V1 \u2265 \u00b7\u00b7\u00b7 \u2265 Vn \u2265 1 and L \u2264 nd, by Sauer\u2019s lemma [11] for n \u2265 2, where\nd is the VC dimension of F. Observe that the number of consistent labelings that make\na mistake on the ith example are exactly Vi\u22121 \u2212 Vi. Hence, the total expected number of\nmistakes is,\n\nn(cid:88)\n\ni=1\n\nVi\u22121 \u2212 Vi\n\nVi\u22121\n\n(cid:181)\n\n\u2264 n(cid:88)\n\ni=1\n\n1\n\nVi\u22121\n\n+\n\n1\n\nVi\u22121 \u2212 1\n\n+ . . .\n\n1\n\nVi + 1\n\n\u2264 log(L).\n\n1\ni\n\n(cid:182)\n\n\u2264 Vn(cid:88)\n\ni=2\n\n4More formally, take L functions with the following properties: for each pair 1 \u2264 j, k \u2264 L with\nj (cid:54)= k, there exists x \u2208 \u03a3 such that fj(x) (cid:54)= fk(x), and for every f \u2208 F, there exists a 1 \u2264 j \u2264 L\nwith f (x) = fj(x) for all x \u2208 \u03a3.\n\n\fHence the above algorithm achieves an error rate of O(d log(n)/n), which quickly ap-\nproaches zero for large n. Note that, this closely matches what one achieves in the batch\nsetting. Like the batch setting, no better bounds can be given up to a constant factor.\n\n4 General setting\n\nWe now consider the more dif\ufb01cult unrealizable setting where we have an unconstrained\nsequence of examples (though we still work in a transductive setting). We begin by pre-\nsenting an known (inef\ufb01cnet) extension to the halving algorithm of the previous section,\nthat works in the agnostic (unrealizable) setting that is similar to the previous algorithm.\n\nInef\ufb01cient proper transductive online learning algorithm A1:\n\n\u2022 Preprocessing: Given the set of unlabeled examples \u03a3, take L func-\ntions f1, f2, . . . , fL that give rise to the L different labelings of\nx \u2208 \u03a3. Assign an initial weight w1 = w2 = . . . = wL = 1 to\neach function.\n\n\u2022 Output fj, where 1 \u2264 j \u2264 L is chosen with probability\n\u2022 Update: for each j for which fj(xi) (cid:54)= yi, reduce wj,\n\n(cid:182)\n\n(cid:181)\n\n(cid:113)\n\nwj\n\nw1+...+wL\n\n.\n\nwj := wj\n\n1 \u2212\n\nlog L\n\nn\n\n.\n\n(cid:112)\n\nUsing an analysis very similar to that of Weighted Majority [9], one can show that, for any\nn > 1 and sequence of examples \u03c3n \u2208 (X \u00d7 {\u22121, 1})n,\n\nE[#mistakes(A1, \u03c3n)] = minf\u2208F #mistakes(f, \u03c3n) + 2\n\ndn log n,\n\nwhere d is the VC dimension of F. Note the similarity to the standard VC bound.\n\n4.1 Ef\ufb01cient algorithm\n\nWe can only hope to get an ef\ufb01cient proper online algorithm when there is an ef\ufb01cient\nproper batch algorithm. As mentioned in section 2.1, this means that there is a batch\nalgorithm M that, given any data set, ef\ufb01ciently \ufb01nds a hypothesis h \u2208 F of minimum\nempirical error. (In fact, most proper learning algorithms work this way to begin with.)\nUsing this, our ef\ufb01cient algorithm is as follows.\n\nEf\ufb01cient transductive online learning algorithm A2:\n\nn \u2264 rx \u2264 4\u221a\n\nn.\n\n\u2022 Preprocessing: Given the set of unlabeled examples \u03a3, create a hal-\nlucinated data set \u03c4 as follows.\n1. For each example x \u2208 \u03a3, choose integer rx uniformly at random\n2. Add |rx| copies of the example x labeled by the sign of rx,\n\u2022 To predict on xi: output hypothesis M(\u03c4 \u03c3i\u22121) \u2208 F, where \u03c4 \u03c3i\u22121\nis the concatenation of the hallucinated examples and the observed\nlabeled examples so far.\n\nsuch that \u2212 4\u221a\n(x, sgn(rx)), to \u03c4.\n\nThe current algorithm predicts f(xi) based on f = M(\u03c4 \u03c3i\u22121). We \ufb01rst begin by analyzing\nthe hypothetical algorithm that used the function chosen on the next iteration, i.e. predict\nf(xi) based on f = M(\u03c4 \u03c3i). (Of course, this is impossible to implement because we do\nnot know \u03c3i when predicting f(xi).)\n\n\fLemma 1. Fix any \u03c4 \u2208 (X \u00d7 Y)\u2217 and \u03c3n \u2208 (X \u00d7 Y)n. Let A(cid:48)\n2 be the algorithm that, for\neach i, predicts f(xi) based on f \u2208 F which is any empirical minimizer on the concate-\nnated data \u03c4 \u03c3i, i.e., f = M(\u03c4 \u03c3i). Then the total number of mistakes of A(cid:48)\n\n2 is,\n\n#mistakes(A(cid:48)\n\n2, \u03c3n) \u2264 minf\u2208F #mistakes(f, \u03c4 \u03c3n) \u2212 minf\u2208F #mistakes(f, \u03c4).\n\nIt is instructive to \ufb01rst consider the case where \u03c4 is empty, i.e., there are no hallucinated\nexamples. Then, our algorithm that predicts according to M(\u03c3i\u22121) could be called \u201cfollow\nthe leader,\u201d as in [6]. The above lemma means that if one could use the hypothetical \u201cbe\nthe leader\u201d algorithm then one would make no more mistakes than the best f \u2208 F. The\nproof of this case is simple. Imagine starting with the of\ufb02ine algorithm that uses M(\u03c3n)\non each example x1, . . . , xn. Now, on the \ufb01rst n \u2212 1 examples, replace the use of M(\u03c3n)\nby M(\u03c3n\u22121). Since M(\u03c3n\u22121) is an error-minimizer on \u03c3n\u22121, this can only reduce the\nnumber of mistakes. Next replace M(\u03c3n\u22121) by M(\u03c3n\u22122) on the \ufb01rst n \u2212 2 examples, and\nso on. Eventually, we reach the hypothetical algorithm above, and we have only decreased\nour number of mistakes. The proof of the above lemma follows along these lines.\n\nProof of Lemma 1. Fix empirical minimizers gi on \u03c4 \u03c3i for i = 0, 1, . . . , n, i.e., gi =\nM(\u03c4 \u03c3i). For i \u2265 1, let mi be 1 if gi(xj) (cid:54)= yj and 0 otherwise. We argue by induc-\ntion on t that,\n\n#mistakes(g0, \u03c4) +\n\nmi \u2264 #mistakes of gt on \u03c4 \u03c3t.\n\n(1)\n\nt(cid:88)\n\ni=1\n\nt+1(cid:88)\n\nFor t = 0, the two are trivially equal. Assuming it holds for t, we have,\n\n#mistakes(g0, \u03c4) +\n\nmi \u2264 #mistakes(gt, \u03c4 \u03c3t) + mt+1\n\ni=1\n\n\u2264 #mistakes(gt+1, \u03c4 \u03c3t) + mt+1\n= #mistakes(gt+1, \u03c4 \u03c3t+1).\n\n(cid:80)n\n\nThe \ufb01rst inequality above holds by induction hypothesis, and the second follows from the\nfact that gt is an empirical minimizer of \u03c4 \u03c3t. The equality establishes (1) for t + 1 and thus\ncompletes the induction. The total mistakes of the hypothetical algorithm proposed in the\nlemma is\nLemma 2. For any \u03c3n,\n\ni=1 mi, which gives the lemma by rearranging (1) for t = n.\n\nE\u03c4 [minf\u2208F #mistakes(f, \u03c4 \u03c3n)] \u2264 E\u03c4 [|\u03c4|/2] + minf\u2208F #mistakes(f, \u03c3n).\n\nFor any F of VC dimension d,\n\nE\u03c4 [minf\u2208F #mistakes(f, \u03c4)] \u2265 E\u03c4 [|\u03c4|/2] \u2212 1.5n3/4\n\nd log n.\n\n(cid:112)\n\nProof. For the \ufb01rst part of the lemma, let g = M(\u03c3n) be an empirical minimizer on \u03c3n.\nThen,\nE\u03c4 [minf\u2208F #mistakes(f, \u03c4 \u03c3n)] \u2264 E\u03c4 [#mistakes(g, \u03c4 \u03c3n)] = E\u03c4 [|\u03c4|/2]+#mistakes(g, \u03c3n).\nThe last inequality holds because, since each example in \u03c4 is equally likely to have a \u00b1\nlabel, the expected number of mistakes of any \ufb01xed g \u2208 F on \u03c4 is E[|\u03c4|/2].\nFix any f \u2208 F. For the second part of the lemma, observe that we can write the number of\nmistakes of f on \u03c4 as,\n\n|\u03c4| \u2212(cid:80)n\n\ni=1 f(xi)ri\n2\n\n.\n\n#mistakes(f, \u03c4) =\n\n\f(cid:80)n\ni=1 f(xi)ri \u2264 3n3/4\n\n(cid:112)\n\nlog(L).\n\n2 log(L)n3/4\n\n(cid:112)\n\n(cid:112)\n\n(cid:80)\n\nE[X] \u2264\n\n2 log(L)n3/4 +\n\n(cid:80)\n\nHence it suf\ufb01ces to show that, maxf\u2208F\n(cid:80)n\nNow Eri[f(xi)ri] = 0 and |f(xi)ri| \u2264 n1/4. Next, Chernoff bounds (on the scaled ran-\ndom variables f(xi)rin\u22121/4) imply that, for any \u03b1 \u2264 1, with probability at most e\u2212n\u03b12/2,\ni=1 f(xi)rin\u22121/4 \u2265 n\u03b1. Put another way, for any \u03b2 < n, with probability at most\nf(xi)rin\u22121/4 \u2265 \u03b2. As observed before, we can reduce the problem to\ne\u2212n\u22123/2\u03b22/2,\nIn other words, we can assume that there are only L differ-\nthe L different labelings.\nf(xi)ri \u2265 \u03b2 for any f \u2208 F\nent functions. By the union bound, the probability that\nis at most Le\u2212n\u22123/2\u03b22/2. Now the expectation of a non-negative random variable X is\nE[X] =\n\n(cid:82) \u221e\n(cid:80)n\n(cid:90) \u221e\n0 Pr[X \u2265 x]dx. Let X = maxf\u2208F\n(cid:112)\n\u221a\n2 log(L)n3/4 + 1.254n3/4 \u2264 3\n\ni=1 f(xi)ri. In our case,\nLe\u2212n\u22123/4x2/2dx\n\nlog(L)n3/4.\n\nBy Mathematica, the above is at most\nFinally, we use the fact that L \u2264 nd by Sauer\u2019s lemma.\nUnfortunately, we cannot use the algorithm A(cid:48)\n2. However, due to the randomness we have\nadded, we can argue that algorithm A2 is quite close:\nLemma 3. For any \u03c3n, for any i, with probability at least 1 \u2212 n\u22121/4 over \u03c4, M(\u03c4 \u03c3i\u22121) is\nan empirical minimizer of \u03c4 \u03c3i.\nProof. De\ufb01ne, F+ = {f \u2208 F | f(xi) = 1} and F\u2212 = {f \u2208 F | f(xi) = \u22121}. WLOG,\nwe may assume that F+ and F\u2212 are both nonempty. For if not, i.e., if all f \u2208 F predict\nthe same sign f(xi), then the sets of empirical minimizers of \u03c4 \u03c3i\u22121 and \u03c4 \u03c3i are equal and\nthe lemma holds trivially. For any sequence \u03c0 \u2208 (X \u00d7 Y)\u2217, de\ufb01ne,\n\ns+(\u03c0) = minf\u2208F+#mistakes(f, \u03c0) and s\u2212(\u03c0) = minf\u2208F\u2212#mistakes(f, \u03c0).\n\nNext observe that, if s+(\u03c0) < s\u2212(\u03c0) then M(\u03c0) \u2208 F+. Similarly if s\u2212(\u03c0) < s+(\u03c0) then\nM(\u03c0) \u2208 F\u2212. If they are equal then f(xi) can be an empirical minimizer in either. WLOG\nlet us say that the ith example is (xi, 1), i.e., it is labeled positively. This implies that\ns+(\u03c4 \u03c3i\u22121) = s+(\u03c4 \u03c3i) and s\u2212(\u03c4 \u03c3i\u22121) = s\u2212(\u03c4 \u03c3i) + 1. It is now clear that if M(\u03c4 \u03c3i\u22121) is\nnot also an empirical minimizer of \u03c4 \u03c3i then s+(\u03c4 \u03c3i\u22121) = s\u2212(\u03c4 \u03c3i\u22121).\nNow the quantity \u2206 = s+(\u03c4 \u03c3i\u22121)\u2212s\u2212(\u03c4 \u03c3i\u22121) is directly related to rxi, the signed random\nnumber of times that example xi is hallucinated. If we \ufb01x \u03c3n and the random choices rx\nfor each x \u2208 \u03a3 \\ {xi}, as we increase or decrease ri by 1, \u2206 correspondingly increases or\ndecreases by 1. Since ri was chosen from a range of size 2(cid:98)n1/4(cid:99) + 1 \u2265 n1/4, \u2206 = 0 with\nprobability at most n\u22121/4.\n\nWe are now ready to prove the main theorem.\n\nProof of Theorem 1. Combining Lemmas 1 and 2, if on each period i, we used any mini-\nmost minf\u2208F #mistakes(f, \u03c3n) + 1.5n3/4\u221a\nmizer of empirical error on the data \u03c4 \u03c3i, we would have a total number of mistakes of at\nd log n. Suppose A2 does end up using such\na minimizer on all but p periods. Then, its total number of mistakes can only be p larger\nthan this bound. By Lemma 3, the expected number p of periods i in which an empirical\nminimizer of \u03c4 \u03c3i is not used is \u2264 n3/4. Hence, the expected total number of mistakes of\nA2 is at most,\n\nE\u03c4 [#mistakes(A2, \u03c3n)] \u2264 minf\u2208F #mistakes(f, \u03c3n) + 1.5n3/4\n\nd log n + n3/4.\n\n(cid:112)\n\nThe above implies the theorem.\n\n\fRemark 1. The above algorithm is still costly in the sense that we must re-run the batch\nerror minimizer for each prediction we would like to make. Using an idea quite similar to\nthe \u201cfollow the lazy leader\u201d algorithm in [6], we can achieve the same expected error while\nonly needing to call M with probability n\u22121/4 on each example.\nRemark 2. The above analysis resembles previous analysis of hallucination algorithms.\nHowever, unlike previous analyses, there is no exponential distribution in the hallucination\nhere yet the bounds still depend only logarithmically on the number of labelings.\n\n5 Conclusions and open problems\n\nWe have given an algorithm for learning in the transductive online setting and established\nseveral results between ef\ufb01cient proper batch and transductive online learnability. In the\nrealizable case, however, we have not given a computationally ef\ufb01cient algorithm. Hence,\nit is an open question as to whether ef\ufb01cient learnability in the batch and transductive on-\nline settings are the same in the realizable case. In addition, our computationally ef\ufb01cient\nalgorithm requires polynomially more examples than its inef\ufb01cient counterpart. It would\n\u221a\nbe nice to have the best of both worlds, namely a computationally ef\ufb01cient algorithm that\ndn log n). Additionally, it would be nice\nachieves a number of mistakes that is at most O(\nto remove the restriction to proper algorithms.\nAcknowledgements. We would like to thank Maria-Florina Balcan, Dean Foster, John\nLangford, and David McAllester for helpful discussions.\n\nReferences\n[1] B. Awerbuch and R. Kleinberg. Adaptive routing with end-to-end feedback: Distributed learning\nand geometric approaches. In Proc. of the 36th ACM Symposium on Theory of Computing, 2004.\n[2] S. Ben-David, E. Kushilevitz, and Y. Mansour. Online learning versus of\ufb02ine learning. Machine\n\nLearning 29:45-63, 1997.\n\n[3] A. Blum. Separating Distribution-Free and Mistake-Bound Learning Models over the Boolean\n\nDomain. SIAM Journal on Computing 23(5): 990-1000, 1994.\n\n[4] A. Blum, J. Hartline. Near-Optimal Online Auctions. In Proceedings of the Proceedings of the\n\nSixteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2005.\n\n[5] J. Hannan. Approximation to Bayes Risk in Repeated Plays. In M. Dresher, A. Tucker, and\nP. Wolfe editors, Contributions to the Theory of Games, Volume 3, p. 97-139, Princeton Univer-\nsity Press, 1957.\n\n[6] A. Kalai and S. Vempala. Ef\ufb01cient algorithms for the online decision problem. In Proceedings\n\nof the 16th Conference on Computational Learning Theory, 2003.\n\n[7] M. Kearns, R. Schapire, and L. Sellie. Toward Ef\ufb01cient Agnostic Learning. Machine Learning,\n\n17(2/3):115\u2013141, 1994.\n\n[8] N. Littlestone. From On-Line to Batch Learning. In Proceedings of the 2nd Workshop on Com-\n\nputational Learning Theory, p. 269-284, 1989.\n\n[9] N. Littlestone and M. Warmuth. The Weighted Majority Algorithm. Information and Computa-\n\ntion, 108:212-261, 1994.\n\n[10] H. Brendan McMahan and Avrim Blum. Online Geometric Optimization in the Bandit Setting\nAgainst an Adaptive Adversary. In Proceedings of the 17th Annual Conference on Learning\nTheory, COLT 2004.\n\n[11] N. Sauer. On the Densities of Families of Sets. Journal of Combinatorial Theory, Series A, 13,\n\np 145-147, 1972.\n\n[12] V. N. Vapnik. Estimation of Dependencies Based on Empirical Data, New York: Springer Ver-\n\nlag, 1982.\n\n[13] V. N. Vapnik. Statistical Learning Theory, New York: Wiley Interscience, 1998.\n\n\f", "award": [], "sourceid": 2755, "authors": [{"given_name": "Sham", "family_name": "Kakade", "institution": null}, {"given_name": "Adam", "family_name": "Kalai", "institution": null}]}