{"title": "Trading off Mistakes and Don't-Know Predictions", "book": "Advances in Neural Information Processing Systems", "page_first": 2092, "page_last": 2100, "abstract": "We discuss an online learning framework in which the agent is allowed to say ``I don't know'' as well as making incorrect predictions on given examples. We analyze the trade off between saying ``I don't know'' and making mistakes. If the number of don't know predictions is forced to be zero, the model reduces to the well-known mistake-bound model introduced by Littlestone [Lit88]. On the other hand, if no mistakes are allowed, the model reduces to KWIK framework introduced by Li et. al. [LLW08]. We propose a general, though inefficient, algorithm for general finite concept classes that minimizes the number of don't-know predictions if a certain number of mistakes are allowed. We then present specific polynomial-time algorithms for the concept classes of monotone disjunctions and linear separators.", "full_text": "Trading off Mistakes and Don\u2019t-Know Predictions\n\nAmin Sayedi\u2217\n\nMorteza Zadimoghaddam\u2020\n\nAvrim Blum\u2021\n\nTepper School of Business\n\nCMU\n\nCSAIL\nMIT\n\nPittsburgh, PA 15213\nssayedir@cmu.edu\n\nCambridge, MA 02139\nmorteza@mit.edu\n\nDepartment of Computer Science\n\nCMU\n\nPittsburgh, PA 15213\navrim@cs.cmu.edu\n\nAbstract\n\nWe discuss an online learning framework in which the agent is allowed to say \u201cI\ndon\u2019t know\u201d as well as making incorrect predictions on given examples. We an-\nalyze the trade off between saying \u201cI don\u2019t know\u201d and making mistakes. If the\nnumber of don\u2019t-know predictions is required to be zero, the model reduces to\nthe well-known mistake-bound model introduced by Littlestone [Lit88]. On the\nother hand, if no mistakes are allowed, the model reduces to KWIK framework\nintroduced by Li et. al. [LLW08]. We propose a general, though inef\ufb01cient, algo-\nrithm for general \ufb01nite concept classes that minimizes the number of don\u2019t-know\npredictions subject to a given bound on the number of allowed mistakes. We then\npresent speci\ufb01c polynomial-time algorithms for the concept classes of monotone\ndisjunctions and linear separators with a margin.\n\n1 Introduction\n\nMotivated by [KS02, KK99] among others, Li, Littman and Walsh [LLW08] introduced the KWIK\nframework for online learning, standing for knows what it knows. Roughly stated, in the KWIK\nmodel, the learning algorithm is required to make only accurate predictions, although it can opt\nout of predictions by saying \u201cI don\u2019t know\u201d(\u22a5). After predicting (or answering \u22a5) it is then told\nthe correct answer. The algorithm is not allowed to make any mistakes; still, it learns from those\nexamples on which it answers \u22a5. The goal of the algorithm is to minimize the number of examples\non which it answers \u22a5. Several aspects of the model are discussed in [LLW08], and there are many\nother papers, including [WSDL, DLL09, SL08], using the framework. It is worth mentioning that\nthe idea of forcing the algorithm to say \u201cI don\u2019t know\u201d instead of making a mistake has also appeared\nin earlier work such as [RS88], and referred to as reliable learning.\n\nGenerally, it is highly desirable to have an algorithm that learns a concept in the KWIK framework\nusing a few, or even polynomial, number of \u22a5s. But unfortunately, for many concepts, no such\nalgorithm exists. In fact, it turns out that even for many basic classes which are very easy to learn in\nthe Mistake-bound model [Lit88], e.g. the class of singletons or disjunctions, the KWIK algorithm\nneeds to say \u22a5 exponentially many times. The purpose of our paper is to relax the assumption of\nnot making any mistakes, by allowing a few mistakes, to get much better bounds on the number of\n\u22a5s. Or, in the other direction, our aim is to produce algorithms that can make substantially fewer\nmistakes than in the standard Mistake-Bound model, by trading off some of those for (presumably\nless costly) don\u2019t-know predictions.\n\nIn [LLW08], the authors show, through a non-polynomial time enumeration algorithm, that a \ufb01nite\nclass H of functions can be learned in the KWIK framework with at most |H| \u2212 1 number of \u22a5s.\n\n\u2217Part of this work was done when the author was an intern in Microsoft Research New England, MA.\n\u2020Part of this work was done when the author was an intern in Microsoft Research Cambridge, UK.\n\u2021This work was supported in part by NSF grant CCF-0830540.\n\n1\n\n\fWe show that if only one mistake is allowed, that number can be reduced to p2|H|. Furthermore,\nwe show that the problem is equivalent to the famous egg-dropping puzzle, de\ufb01ned formally in\nk+1 when k mistakes are allowed. Our algorithm does not\nSection 2, hence getting bound (k + 1)H\nin general run in polynomial time in the description length of the target function since its running\ntime depends on |H|; however, we propose polynomial versions of our algorithm for two important\nclasses: monotone disjunctions and linear separators.\n\n1\n\nAllowing the algorithm to make mistakes in the KWIK model is equivalent to allowing the algorithm\nto say \u201cI don\u2019t know\u201d in the Mistake-bound model introduced in [Lit88]. In fact, one way of looking\nat the algorithms presented in section 3 is that we want to decrease the number of mistakes in\nMistake-bound model by allowing the algorithm to say \u22a5. The rest of the paper is structured as\nfollows. First we de\ufb01ne the model and describe the limits of KWIK model. Then in section 2, we\ndescribe how would the bounds on the number of \u22a5s change if we allow a few mistakes in KWIK\nmodel. Finally, we give two polynomial algorithms for important classes, Monotone Disjunctions\nand Linear Separators with a margin, in Section 3.\n\n1.1 Model\n\nWe want to learn a concept class H consisting of functions f : X \u2192 {+,\u2212}. In each stage, the\nalgorithm is given an example x \u2208 X and is asked to predict the target function h\u2217(x), where we\nassume h\u2217 \u2208 H. The algorithm might answer, +, \u2212 or \u22a5 representing \u201cI don\u2019t know\u201d. After the\nprediction, even if it is \u22a5, the value of h\u2217(x) is revealed to the algorithm. For a given integer k, we\nwant to design an algorithm such that for any sequence of examples, the number of times M that it\nmakes a mistake is not more than k, and the number of times I that it answers \u22a5 is minimized.\nFor example, the special case of k = 0 is equivalent to the KWIK framework. Also, if k \u2265 log(|H|),\nthe majority vote algorithm can learn the class with no \u22a5 responses, i.e. I = 0.\nSince we want to derive worst-case bounds, we assume that the sequence of the examples, as well as\nthe target function h\u2217 are selected by an adversary. The adversary sends the examples one by one.\nFor each example x \u2208 X, our algorithm decides what to answer; then, the adversary reveals h\u2217(x).\n1.2 The KWIK Model\n\nAlthough the idea of the KWIK framework is quite useful, there are very few problems that can be\nsolved effectively in this framework. The following example demonstrates how an easy problem in\nthe Mistake-bound model can turn into a hard problem in the KWIK model.\n\nExample 1 Suppose that H is the class of singletons. In other words, for hi \u2208 H, where hi :\n{0, 1}n \u2192 {\u2212, +}, we have hi(x) = + if x is the binary representation of i, and hi(x) = \u2212\notherwise. Class H can be learned in Mistake-bound model with mistake bound of only 1. The\nalgorithm simply predicts \u2212 on all examples until it makes a mistake. As soon as the algorithm\nmakes a mistake, it can easily \ufb01gure out what the target function is.\nHowever, class H needs exponentially many \u22a5\u2019s in the KWIK framework to be learned. Since the\nalgorithm does not know the answer until it has seen its \ufb01rst positive example, it must keep answering\n\u22a5 on all examples that it has not seen yet. Therefore, in the worst case, it answers \u22a5 and \ufb01nds out\nthat the answer is \u2212 on all the \ufb01rst 2n \u2212 1 examples that it sees.\nThe situation in Example 1 happens for many other classes of functions, e.g. conjunctions or dis-\njunctions, as well.\n\nNext, we review an algorithm (called the enumeration algorithm in [LLW08]) for solving problems\nin the KWIK framework. This algorithm is the main ingredient of most of the algorithms proposed\nin [LLW08].\n\nAlgorithm 1 Enumeration\n\nThe algorithm looks at all the functions in class H; if they all agree on the label of the current\nexample x \u2208 X, the algorithm outputs that label, otherwise the algorithm outputs\u22a5. Upon receiving\nthe true label of x, the algorithm then removes from H those functions h that answered incorrectly\n\n2\n\n\fon x and continues to the next example. Note that at least one function gets removed from H each\ntime that algorithm answers \u22a5; therefore, the algorithm \ufb01nds the target function with at most |H|\u22121\nnumber of \u22a5\u2019s.\n2 The KWIK Model with Mistakes\n\nExample 1 shows how hard it can be to learn in the KWIK model. To address this, we give the\nfollowing relaxation of the framework that allows concepts to be learned much more effectively and\nat the same time preserves the original motivation of the KWIK model\u2014it\u2019s better saying \u201cI don\u2019t\nknow\u201d rather than making a mistake.\n\nSpeci\ufb01cally, we allow the algorithm to make at most k mistakes. Even for very small values of k,\nthis can allow us to get much better bounds on the number of times that the algorithm answers \u22a5.\nFor example, by letting k = 1, i.e. allowing one mistake, the number of \u22a5\u2019s decreases from 2n \u2212 1\nto 0 for the class of singletons. Of course, this case is not so interesting since k = 1 is the mistake\nbound for the class. Our main interest is the case that k > 0 and yet is much smaller than the number\nof mistakes needed to learn in the pure Mistake Bound model.\nWe saw, in Algorithm 1, how to learn a concept class H with no mistakes and with O(|H|) number\nof \u22a5\u2019s. In many cases, O(|H|) is tight; in fact, if for every subset S \u2286 H with |S| > 1 there exists\nsome x \u2208 X for which |{h \u2208 S|h(x) = +}| \u2208 {1,|S|\u2212 1}, then the bound is tight. This condition,\nfor example, is satis\ufb01ed by the class of intervals: that is, H = {[0, a] : a \u2208 {0, 1, 2, . . . , 2n \u2212 1}}.\nHowever, if we allow the algorithm to make one mistake, we show that the number of \u22a5\u2019s can be\nreduced to O(p|H|). In general, if k mistakes are allowed, there is an algorithm that can learn any\nclass H with at most (k + 1)|H|1/k+1 number of \u22a5\u2019s. The algorithm is similar to the one for the\nclassic \u201cegg game\u201d puzzle (See [GF]). First suppose that k = 1. We make a pool of all functions\nin H, initially consisting of |H| candidates. Whenever an example arrives, we see how many of the\ncandidates label it +, and how many label it \u2212. If the population of the minority is < p|H|, we\npredict the label that the majority gives on the example; however, if the population of the minority\nis \u2265 p|H|, we say \u22a5. Those functions that predict incorrectly on an example are removed from the\npool in each step; so the pool is just the current version space. If we make a mistake in some step,\nthe size of the version space will reduce to < p|H|. Hence, using Algorithm 1, we can complete\nthe learning with at most p|H| number of additional \u22a5\u2019s after our \ufb01rst mistake. Furthermore, note\nthat before making any mistake, we remove at least p|H| of the functions from the pool each time\nwe answer \u22a5. Therefore, the total number of \u22a5\u2019s cannot exceed 2p|H|. This technique can be\ngeneralized for k mistakes, but \ufb01rst we mention a connection between this problem and the classic\n\u201cegg game\u201d puzzle.\n\nExample 2 Egg Game Puzzle\n\nYou are given 2 identical eggs, and you have access to a n-story building. The eggs can be very hard\nor very fragile or anywhere in between: they may break if dropped from the \ufb01rst \ufb02oor or may not\nbreak even if dropped from the n-th \ufb02oor. You need to \ufb01gure out the highest \ufb02oor from which an egg\ncan be dropped without breaking. The question is how many drops you need to make. Note that you\ncan break only two eggs in the process.\nThe answer to this puzzle is \u221a2n up to an additive constant. In fact, a thorough analysis of the puzzle\nwhen there are k eggs available, instead of just two eggs, is given in [GF]. The \u22a5 minimization\nproblem when k mistakes are allowed is clearly related to the egg game puzzle when the building\nhas |H| \ufb02oors and there are k + 1 eggs available. As a result, with a slightly smarter algorithm that\nadjusts the threshold p|H| recursively each time an example arrives, we can decrease the number\nof \u22a5s from 2p|H| to p2|H|.\nAlgorithm 2 Learning in the KWIK Model with at most k Mistakes\nk+1 , and let P denote the current version space: the pool of all functions that might still\nLet s = |H|\nbe the target. Initially P = H, but during the learning process, we remove functions from P . For\neach example that arrives, examine how many functions in P label it + and how many label it \u2212. If\n\nk\n\n3\n\n\fthe minority population is > s, we answer \u22a5, otherwise, we answer the majority prediction. At the\nend of each step, we remove the functions that made a mistake in the last step from P . Whenever we\nk+1\u2212i , where i is the number of mistakes we have made so far.\nmake a mistake, we update s = |P|\nProposition 1 Algorithm 2 learns a concept class H with at most k mistakes and (k + 1)|H|1/k+1\n\u201cI don\u2019t know\u201ds.\n\nk\u2212i\n\nProof: After the \ufb01rst mistake, the size of the pool reduces to < |H|\ncan argue that after the \ufb01rst mistake, the learning can be done with k\u2212 1 mistakes and k(|H|\n\u201cI don\u2019t know\u201ds. There can exist at most\nTherefore, the total number of \u22a5\u2019s will not exceed\n\nk+1 . Hence, using induction, we\nk+1 )1/k\n= |H|1/k+1 number of \u22a5\u2019s before the \ufb01rst mistake.\n\n|H|\n\nk\n\nk\n\n|H|\n\nk\n\nk+1\n\n|H|1/k+1 + k(|H|\n\nk\n\nk+1 )1/k = (k + 1)|H|1/k+1.\n\n2\n\nBefore moving to the next section, we should mention that Algorithm 2 is not computationally\nef\ufb01cient. Particularly, if H contains exponentially many functions in the natural parameters of the\nproblem, which is often the case, the running time of Algorithm 2 becomes exponential. In the next\nsection, we give polynomial-time algorithms for two important concept classes.\n\n3 The Mistake Bound Model with \u201cI don\u2019t know\u201d predictions\n\nWe can look at the problem from another perspective: instead of adding mistakes to the KWIK\nframework, we can add \u201cI don\u2019t know\u201d to the Mistake Bound model. In many cases, we prefer our\nalgorithm saying \u201cI don\u2019t know\u201d rather than making a mistake. Therefore, in this section, we try to\nimprove over optimal mistake bounds by allowing the algorithm to use a modest number of \u22a5\u2019s, and\nin general to consider the tradeoff between the number of mistakes and the number of \u22a5\u2019s. Note that\nan algorithm can always replace its \u22a5\u2019s with random +\u2019s and \u2212\u2019s, therefore, we must expect that\ndecreasing the number of mistakes by one requires increasing the number of \u22a5\u2019s by at least one.\n3.1 Monotone Disjunctions\n\nWe start with the concept class of Monotone Disjunctions. A monotone disjunction is a disjunction\nin which no literal appears negated, that is, a function of the form\n\nf (x1, . . . , xn) = xi1 \u2228 xi2 \u2228 . . . \u2228 xik .\n\nEach example is a boolean vector of length n, and an example is labeled + if and only if at least\none of the variables that belong to the target function is set to 1 in the example. We know that this\nclass can be learned with at most n mistakes in Mistake-bound Model [Lit88] where n is the total\nnumber of variables. This class is particularly interesting because results derived about monotone\ndisjunctions can be applied to other classes as well, such as general disjunctions, conjunctions, and\nk-DNF formulas. We are interested in decreasing the number of mistakes at the cost of having\n(hopefully few) \u22a5\u2019s.\nFirst, let\u2019s not worry about the running time and see how well Algorithm 2 performs here. We have\n|H| = 2n; if we let k = n/i, the bound that we get on the number of \u22a5\u2019s will be \u2243 n2i\n; this is not\nbad, especially, for the case of small i, e.g. i = 2, 3. In fact, for the case of i = 2, we are trading off\neach mistake for four \u201cI don\u2019t know\u201ds. But unfortunately, Algorithm 2 cannot do this in polynomial\ntime. Our next goal is to design an algorithm which runs in polynomial time and guarantees the\nsame good bounds on the number of \u22a5\u2019s.\nAlgorithm 3 Learning Monotone Disjunctions with at most n/2 Mistakes\nLet P , P + and P \u2212 be three pools of variables. Initially, P = {x1, . . . , xn} and P + = P \u2212 = \u03c6.\nDuring the process of learning, the variables will be moved from P to P \u2212 or P +. The pool P +\nis the set of variables that we know must exist in the target function; the pool P \u2212 is the set of the\n\ni\n\n4\n\n\fvariables that we know cannot exist in the target function. The learning process \ufb01nishes by the time\nthat P gets empty.\nIn each step, an example x arrives. Let S \u2286 {x1, . . . , xn} be the set representation of x, i.e., xi \u2208 S\nif and only if x[i] = 1. If S \u2229 P + 6= \u03c6, we can say for sure that the example is +. If S \u2286 P \u2212,\nwe can say for sure that the example is negative. Otherwise, it must be the case that S \u2229 P 6= \u03c6,\nand we cannot be sure about our prediction. Here, if |S \u2229 P| \u2265 2 we answer +, otherwise, i.e. if\n|S \u2229 P| = 1, we answer \u22a5.\nIf we make a mistake, we move S \u2229 P to P \u2212. Every time we answer \u22a5, we move S \u2229 P to P + or\nP \u2212 depending on the correct label of the example.\n\nProposition 2 Algorithm 3 learns the class of Monotone Disjunctions with at most M \u2264 n/2 mis-\ntakes and n \u2212 2M number of \u22a5s.\nProof: If we make a mistake, it must be the case that the answer had been negative while we\nanswered positive; for this to happen, we must have |S \u2229 P| \u2265 2. So, after a mistake, we can move\nS \u2229 P to P \u2212. The size of P , therefore, decreases by at least 2.\nEvery time we say \u22a5, it must be the case that |S \u2229 P| = 1. Therefore, the label of the example is\npositive iff S \u2229 P is contained in the target function, and so the algorithm correctly moves S \u2229 P to\nP + or P \u2212. Additionally, the size of P decreases by at least one on each \u22a5 prediction. 2\nAlgorithm 3, although very simple, has an interesting property. If in an online learning setting,\nsaying \u22a5 is cheaper than making a mistake, Algorithm 3 strictly dominates the best algorithm in\nMistake-bound model. Note that the sum of its \u22a5s and its mistakes is never more than n. More\nprecisely, if the cost of making a mistake is 1 and the cost of saying \u22a5 is < 1, the worst-case cost of\nthis algorithm is strictly smaller than n.\nNext we present an algorithm for decreasing the number of mistakes to n/3.\n\nAlgorithm 4 Learning Monotone Disjunctions with at most n/3 Mistakes\nLet P , P +, P \u2212 be de\ufb01ned as in Algorithm 3. We have another pool P \u2032 which consists of pairs of\nvariables such that for each pair we know at least one of the variables belongs to the target function.\nAs before, the pools form a partition over the set of all variables. In addition, a variable can belong\nto at most one pair in P \u2032. Thus, any given variable is either in a single pair of P \u2032 or else in exactly\none of the sets P , P +, or P \u2212.\nWhenever an example x arrives we do the following. Let S \u2286 {x1, . . . , xn} be the set representation\nof x, i.e. xi \u2208 S if and only if x[i] = 1. If S \u2229 P + 6= \u03c6, we answer +. If S \u2286 P \u2212, we answer \u2212.\nAlso, if S contains both members of a pair in P \u2032, we can say that the label is +.\nIf none of the above cases happen, we cannot be sure about our prediction. In this case, if |S \u2229 P| \u2265\n3, we answer +. If |S \u2229 (P \u222a P \u2032)| \u2265 2 and |S \u2229 P \u2032| \u2265 1 we again answer +. Otherwise, we answer\n\u22a5. Description of how the algorithm moves variables between sets upon receipt of the correct label\nis given in the proof below.\n\nProposition 3 Algorithm 4 learns the class of Monotone Disjunction with at most M \u2264 n/3 mistake\nand 3n/2 \u2212 3M number of \u22a5\u2019s.\nProof: If |S \u2229 P| \u2265 3 and we make a mistake on S, then the size of P will be reduced by at least 3,\nand the size of P \u2212 will increase by at least 3. If |S \u2229 (P \u222a P \u2032)| \u2265 2 and |S \u2229 P \u2032| \u2265 1 and we make\na mistake on S, then at least two variables will be moved from (P \u2032 \u222a P ) to P \u2212, and at least one\nvariable will be moved from P \u2032 to P + (since whenever a variable moves from P \u2032 to P \u2212, the other\nvariable in its pair should move to P +). Therefore, the size of P \u2212 \u222a P + will increase by at least 3.\nSince P \u2212 \u222a P + \u2264 n, we will not make more than n/3 mistakes.\nThere are three cases in which we may answer \u22a5. If |S \u2229 P| = 0 and |S \u2229 P \u2032| = 1, we answer \u22a5;\nhowever, after knowing the correct label, S\u2229 P \u2032 will be moved to P + or P \u2212. Therefore, the number\nof \u201cI don\u2019t know\u201ds of this type is bounded by n \u2212 3M . If |S \u2229 P| = 1 and |S \u2229 P \u2032| = 0, again,\nafter knowing the correct label, S \u2229 P will be moved to P + or P \u2212, so the same bound applies. If\n|S \u2229 P| = 2 and |S \u2229 P \u2032| = 0, the correct label might be + or \u2212. If it is negative, then we can move\n\n5\n\n\fS \u2229 P to P \u2212 and use the same bound as before. If it is positive, the two variables in S \u2229 P will be\nmoved to P \u2032 as a pair. Note that there can be at most n/2 of such \u22a5\u2019s; therefore, the total number\nof \u22a5\u2019s cannot exceed n/2 + n \u2212 3M . 2\n3.2 Learning Linear Separator Functions\n\nIn this section, we analyze how we can use \u22a5 predictions to decrease the number of mistakes for\nef\ufb01ciently learning linear separators with margin \u03b3. The high level idea is to use the basic approach\nof the generic algorithm in Section 2 for \ufb01nite H, but rather than explicitly enumerating over func-\ntions, to instead ef\ufb01ciently estimate the measure of the functions in the version space that predict +\nversus those that predict \u2212 and to make prediction decisions based on the result.\nSetting: We assume a sequence S of n d-dimensional examples arrive one by one, and that these\nexamples are linearly separable: there exists a unit-length separator vector w\u2217 such that w\u2217\u00b7 x > 0 if\nand only if x is a positive example. De\ufb01ne \u03b3 to be minx\u2208S |w\u2217\u00b7x|\n. For convenience, we will assume\n|x|\nthat all examples have unit length.\n\nBelow, we show how to formulate the problem with a Linear Program to bound the number of\nmistakes using some \u201cI don\u2019t know\u201d answers.\nAssume that n examples x1, x2,\u00b7\u00b7\u00b7 , xn are in the sequence S. These points arrive one at a time and\nwe have to answer when a point arrives. The objective is to make a small number of mistakes and\nsome \u201cI don\u2019t know\u201d answers to \ufb01nd a separation vector w such that w \u00b7 xi is positive if and only if\nxi is a + point. We can formulate the following linear program using this instance (this sequence of\npoints).\n\nw \u00b7 xi > 0 If xi is a + instance, and\n\nw \u00b7 xi \u2264 0 If xi is a \u2212 instance\n\nNote that there are d variables which are the coordinates of vector w, and there are n linear con-\nstraints one per input point. Clearly we do not know which points are the + points, so we can not\nwrite this linear program explicitly and solve it. But the points arrive one by one and the constraints\nof this program are revealed over the time. Note that if a vector w is a feasible solution of the above\nlinear program, any positive multiple of w is also a feasible solution. In order to make the analysis\neasier and bound the core (the set of feasible solutions of the linear program), we can assume that\n\nconstraints to make sure that the coordinates do not violate these properties. We will see later why\n\nthe coordinates of the vector w are always in range [\u22121 \u2212 \u03b3/\u221ad, 1 + \u03b3/\u221ad]. We can add 2d linear\nwe are choosing the bounds to be \u2212(1 + \u03b3/\u221ad) and 1 + \u03b3/\u221ad.\nNow assume that we are at the beginning and no point has arrived. So we do not have any of the\nn constraints related to points. The core of the linear program is the set of vectors w in [\u22121 \u2212\n\u03b3/\u221ad, 1 + \u03b3/\u221ad]d at the beginning. So we have a core (feasible set) of volume (2 + 2\u03b3/\u221ad)d at\n\ufb01rst. For now assume that we can not use the \u201cI don\u2019t know\u201d answers. We show how to use them\nlater. The \ufb01rst point arrives. There are two possibilities for this point. It is either a + point or a \u2212\npoint. If we add any of these two constraints to the linear program, we obtain a more restricted linear\nprogram with a core of lesser volume. So we obtain one LP for each of these two possibilities, and\nthe sum of the volumes of the cores of these two linear programs is equal to the volume of the core\nof our current linear program. We will show how to compute these volumes, but for now assume\nthat they are computed. If the volume of the linear program for the + case is larger than the \u2212 case,\nwe answer +. If our answer is true, we are \ufb01ne, and we have passed the query with no mistake.\nOtherwise we have made a mistake, but the volume of the core of our linear program is halved. We\ndo the same for the \u2212 case as well, i.e. we answer \u2212 when the larger volume is for \u2212 case.\nNow there are two main issues we have to deal with. First of all, we have to \ufb01nd a way to compute\nthe volume of the core of a linear program. Secondly, we have to \ufb01nd a way to bound the number of\nmistakes.\n\n6\n\n\fIn fact computing the volume of a linear program is #P -hard [DF88]. There exists a randomized\npolynomial time algorithm that approximates the volume of the core of a linear program with (1 + \u01eb)\napproximation [DFK91], i.e. the relative error is \u01eb. The running time of this algorithm is polynomial\nin n, d, and 1/\u01eb. We can use this algorithm to get estimates of the volumes of the linear programs\nwe need. But note that we really do not need to know the volumes of these linear programs. We\njust need to know whether the volume of the linear program of the + case is larger or the \u2212 case\nis larger or if they are approximately equal. Lovasz and Vempala present a faster polynomial time\nalgorithm for sampling a uniformly random point in the core of a linear program in [LV06]. One\nway to estimate the relative volumes of both sides is to sample a uniformly random point from the\ncore of our current linear program (without taking into account the new arrived point), and see if the\nsampled point is in the + side or the \u2212 side. If we sample a suf\ufb01ciently large number of points (here\n2 log(n)/\u01eb2 is large enough), and if the majority of them are in the + (\u2212) side, we can say that the\nvolume of the linear program for + (\u2212) case is at least a 1\n2 \u2212 \u01eb fraction of our current linear program\nwith high probability. So we can answer based on the majority of these sampled points, and if we\nmake a mistake, we know that the volume of the core of the linear program is multiplied by at most\n1 \u2212 ( 1\nSuppose we have already processed the \ufb01rst l examples and now the l + 1st example arrives. We\nhave the linear program with the \ufb01rst l constraints. We sample points from the core of this linear\nprogram, and based on the majority of them we answer + or \u2212 for this new example. Using the\nfollowing Theorem, we can bound the number of mistakes.\n\n2 \u2212 \u01eb) = 1\n\n2 + \u01eb.\n\nLemma 4 With high probability (1\u2212 1\nof the core of the linear program decreases by a factor of ( 1\n\nn\u2126(1) ), for every mistake we make in our algorithm, the volume\n\n2 + \u01eb).\n\nProof: Without loss of generality, assume that we answered +, but the correct answer was \u2212. So we\nsampled 2 log n/\u01eb2 functions uniformly at random from the core, and the majority of them were pre-\ndicting positive. If less than a 1\n2 \u2212 \u01eb fraction of the volume was indeed predicting positive, each sam-\npled point would be from the positive-predicting part with probability less than 1\n2\u2212\u01eb. So the expected\nnumber of positive sampled points would be less than ( 1\n2 \u2212 \u01eb)(2 log n/\u01eb2) = (log n/\u01eb2 \u2212 2 log n/\u01eb).\nTherefore, by Chernoff bounds, the chance of the sample having a majority of positive-predicting\nfunctions would be at most e\u2212(2 log n/\u01eb)2/2(log n/\u01eb2\u22122 log n/\u01eb) = e\u22122 log n/(1\u2212\u01eb) = n\u22122/(1\u2212\u01eb). Since\nthere are n examples arriving, we can use the union bound to bound the probability of failure on any\nof these rounds: the probability that the volume of the core of our linear program is not multiplied by\nat most 1\nn1/(1\u2212\u01eb) . Therefore with high probability\nn1/(1\u2212\u01eb) ), for every mistake we make, the volume of the core is multiplied by at most\n(at least 1 \u2212\n2 + \u01eb. 2\nNow we show that the core of the linear program after adding all n constraints (the constraints of\nthe variables) should have a decent volume in terms of \u03b3.\n\n2 + \u01eb on any mistakes is at most n \u00d7 n\u22122/(1\u2212\u01eb) =\n\n1\n\n1\n\n1\n\nLemma 5 If there is a unit-length separator vector w\u2217 with minx\u2208S\n= \u03b3, the core of the\ncomplete linear program after adding all n constraints of the points has volume at least (\u03b3/\u221ad)d.\n\nw\u2217\u00b7x\n|x|\n\nProof: Clearly w\u2217 is in the core of our linear program. Consider a vector w\u2032 whose all coordinates\n\nare in range (\u2212\u03b3/\u221ad, \u03b3/\u221ad). We claim that (w\u2217 + w\u2032) is a correct separator. Consider a point xi.\nWithout loss of generality assume that it is a + point. So w\u2217 \u00b7 xi is at least \u03b3\u00b7|xi|. We also know that\n|w\u2032 \u00b7 xi| is at least \u2212|w\u2032| \u00b7 |xi|. Since all its d coordinates are in range (\u2212\u03b3/\u221ad, \u03b3/\u221ad), we can say\nthat |w\u2032| is less than \u03b3. So (w\u2217 + w\u2032)\u00b7xi = w\u2217\u00b7xi + w\u2032\u00b7xi > \u03b3|xi|\u2212 \u03b3|xi| is positive. We also know\nthat the coordinates of w\u2217 + w\u2032 are in range (\u22121\u2212 \u03b3/\u221ad, 1 + \u03b3/\u221ad) because w\u2217 has unit length (so\nall its coordinates are between \u22121 and 1), and the coordinates of w\u2032 are in range (\u2212\u03b3/\u221ad, \u03b3/\u221ad).\nTherefore all vectors of form w\u2217 + w\u2032 are in the core. We conclude that the volume of the core is at\nleast (2\u03b3/\u221ad)d. 2\nLemmas 4 and 5 give us the following folklore theorem.\n\nTheorem 6 The total number of mistakes\n(1+\u03b3/\u221ad)d\n(\u03b3/\u221ad)d = O(d(log d + log 1/\u03b3)).\nlog2/(1+\u01eb)\n\n(2+2\u03b3/\u221ad)d\n(2\u03b3/\u221ad)d = log2/(1+\u01eb)\n\nin the above algorithm is not more than\n\n7\n\n\fProof: The proof easily follows from Lemmas 4 and 5. 2\n\nNow we make use of the \u201cI don\u2019t know\u201d answers to reduce the number of mistakes. Assume that we\ndo not want to make more than k mistakes. De\ufb01ne Y1 to be (2+2\u03b3/\u221ad)d which is the volume of the\ncore at the beginning before adding any of the constraints of the points. De\ufb01ne Y2 to be (2\u03b3/\u221ad)d\nwhich is a lower bound for the volume of the core after adding all the constraints of the points. Let\nR be the ratio Y2\nY1\n\n. In the above algorithm, we do not make more than log2/(1+\u01eb) R mistakes.\n\nWe want to use \u201cI don\u2019t know\u201d answers to reduce this number of mistakes. De\ufb01ne C to be R1/k.\nLet V, V1, and V2 be the volumes of the cores of the current linear program, the linear program with\nthe additional constraint that the new point is a + point, and the linear program with the additional\nconstraint that the new point is a \u2212 point respectively. If V1/V is at most 1/C, we can say that the\nnew point is a \u2212 point. In this case, even if we make a mistake the volume of the core is divided\nby at least C, and by de\ufb01nition of C, this can not happen more than logC R = k times. Similarly,\nif V2/V is at most 1/C, we can say the new point is a + point. If V1/V and V2/V are both greater\nthan 1/C, we answer \u201cI don\u2019t know\u201d, and we know that the volume of the core is multiplied by at\nmost 1 \u2212 1/C.\nSince we just need to estimate the ratios V1/V and V2/V , and in fact we want to see if any of them\nis smaller than 1/C or not, we can simply sample points from the core of our current linear program.\nBut we have to sample at least O(C log n) points to be able to have reasonable estimates with high\nprobability for these two speci\ufb01c tests (to see if V1/V or V2/V is at least 1/C). For example if we\nsample 16C log n points, and there are at most 8 log n + points among them, we can say that V1/V\nis at most 1/C with probability at least 1 \u2212 e\u221264 log2 n/32 log n = 1 \u2212 1\nn2 . But if there are at least\n8 log n + points, and 8 log n \u2212 points among the samples, we can say that both V1/V and V2/V are\nat least 1\n\n8C with high probability using Chernoff bounds.\n\nIf we make a mistake in this algorithm, the volume of the core is divided by at least C, so we do not\nmake more than k mistakes. We also know that for each \u201cI don\u2019t know\u201d answer the volume of the\ncore is multiplied by at most 1 \u2212 1\n8C , so after 8C \u201cI don\u2019t know\u201d answers the volume of the core is\nmultiplied by at most 1/e. Therefore there are at most O(C log R) \u201cI don\u2019t know\u201d answers. This\ncompletes the proof of the following theorem.\n\nTheorem 7 For any k > 0, we can learn a linear separator of margin \u03b3 in \u211cd using the above\nalgorithm with k mistakes and O(R1/k \u00d7 log R) \u201cI don\u2019t know\u201d answers, where R is equal to\n(1+\u03b3/\u221ad)d\n(\u03b3/\u221ad)d .\n\n4 Conclusion\n\nWe have discussed a learning framework that combines the elements of the KWIK and mistake-\nbound models. From one perspective, we are allowing the algorithm to make mistakes in the KWIK\nmodel. We showed, using a version-space algorithm and through a reduction to the egg-game puzzle,\nthat allowing a few mistakes in the KWIK model can signi\ufb01cantly decrease the number of don\u2019t-\nknow predictions.\n\nFrom another point of view, we are letting the algorithm say \u201cI don\u2019t know\u201d in the mistake-bound\nmodel. This can be particularly useful if don\u2019t-know predictions are cheaper than mistakes and\nwe can trade off some number of mistakes for a not-too-much-larger number of \u201cI don\u2019t know\u201ds.\nWe gave polynomial-time algorithms that effectively reduce the number of mistakes in the mistake-\nbound model using don\u2019t-know predictions for two concept classes: monotone disjunctions and\nlinear separators with a margin.\n\nAcknowledgement\n\nThe authors are very grateful to Adam Kalai, Sham Kakade and Nina Balcan as well as anonymous\nreviewers for helpful discussions and comments.\n\n8\n\n\fReferences\n\n[DF88] Martin E. Dyer and Alan M. Frieze. On the complexity of computing the volume of a\n\npolyhedron. SIAM J. Comput., 17(5):967\u2013974, 1988.\n\n[DFK91] Martin E. Dyer, Alan M. Frieze, and Ravi Kannan. A random polynomial time algorithm\n\nfor approximating the volume of convex bodies. J. ACM, 38(1):1\u201317, 1991.\n\n[DLL09] C. Diuk, L. Li, and B.R. Lef\ufb02er. The adaptive k-meteorologists problem and its appli-\ncation to structure learning and feature selection in reinforcement learning. In Proceed-\nings of the 26th Annual International Conference on Machine Learning, pages 249\u2013256.\nACM, 2009.\nGasarch and Fletcher. The Egg Game. www.cs.umd.edu/~gasarch/BLOGPAPERS/egg.pdf.\n\n[GF]\n[KK99] M. Kearns and D. Koller. Ef\ufb01cient reinforcement learning in factored MDPs. In Inter-\nnational Joint Conference on Arti\ufb01cial Intelligence, volume 16, pages 740\u2013747. Citeseer,\n1999.\n\n[KS02] M. Kearns and S. Singh. Near-optimal reinforcement learning in polynomial time. Ma-\n\nchine Learning, 49(2):209\u2013232, 2002.\n\n[Lit88] N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-\n\nthreshold algorithm. Machine learning, 2(4):285\u2013318, 1988.\n\n[LV06]\n\n[LLW08] L. Li, M.L. Littman, and T.J. Walsh. Knows what it knows: a framework for self-aware\nlearning. In Proceedings of the 25th international conference on Machine learning, pages\n568\u2013575. ACM, 2008.\nL\u00b4aszl\u00b4o Lov\u00b4asz and Santosh Vempala. Hit-and-run from a corner. SIAM J. Comput.,\n35(4):985\u20131005, 2006.\nR.L. Rivest and R. Sloan. Learning complicated concepts reliably and usefully. In Pro-\nceedings AAAI-88, pages 635\u2013639, 1988.\n\n[RS88]\n\n[SL08] A.L. Strehl and M.L. Littman. Online linear regression and its application to model-based\nreinforcement learning. Advances in Neural Information Processing Systems, 20, 2008.\n[WSDL] T.J. Walsh, I. Szita, C. Diuk, and M.L. Littman. Exploring compact reinforcement-\nlearning representations with linear regression. In Proceedings of the Twenty-Fifth Con-\nference on Uncertainty in Arti\ufb01cial Intelligence (UAI-09), 2009b.\n\n9\n\n\f", "award": [], "sourceid": 1297, "authors": [{"given_name": "Amin", "family_name": "Sayedi", "institution": null}, {"given_name": "Morteza", "family_name": "Zadimoghaddam", "institution": null}, {"given_name": "Avrim", "family_name": "Blum", "institution": null}]}