{"title": "Learning large-margin halfspaces with more malicious noise", "book": "Advances in Neural Information Processing Systems", "page_first": 91, "page_last": 99, "abstract": "We describe a simple algorithm that runs in time  poly(n,1/gamma,1/eps) and learns an unknown n-dimensional  gamma-margin halfspace to accuracy 1-eps in the presence of  malicious noise, when the noise rate is allowed to be as high as  Theta(eps gamma sqrt(log(1/gamma))). Previous efficient  algorithms could only learn to accuracy eps in the presence of  malicious noise of rate at most Theta(eps gamma).    Our algorithm does not work by optimizing a convex loss function.  We  show that no algorithm for learning gamma-margin halfspaces that  minimizes a convex proxy for misclassification error can tolerate  malicious noise at a rate greater than Theta(eps gamma); this may  partially explain why previous algorithms could not achieve the higher  noise tolerance of our new algorithm.", "full_text": "Learning large-margin halfspaces\n\nwith more malicious noise\n\nPhilip M. Long\n\nGoogle\n\nplong@google.com\n\nRocco A. Servedio\nColumbia University\n\nrocco@cs.columbia.edu\n\nAbstract\n\nWe describe a simple algorithm that runs in time poly(n, 1/\u03b3, 1/\u03b5) and learns an\nunknown n-dimensional \u03b3-margin halfspace to accuracy 1 \u2212 \u03b5 in the presence of\n\nmalicious noise, when the noise rate is allowed to be as high as \u0398(\u03b5\u03b3(cid:112)log(1/\u03b3)).\n\nPrevious ef\ufb01cient algorithms could only learn to accuracy \u03b5 in the presence of\nmalicious noise of rate at most \u0398(\u03b5\u03b3).\nOur algorithm does not work by optimizing a convex loss function. We show that\nno algorithm for learning \u03b3-margin halfspaces that minimizes a convex proxy for\nmisclassi\ufb01cation error can tolerate malicious noise at a rate greater than \u0398(\u03b5\u03b3);\nthis may partially explain why previous algorithms could not achieve the higher\nnoise tolerance of our new algorithm.\n\n1\n\nIntroduction\n\nLearning an unknown halfspace from labeled examples that satisfy a margin constraint (meaning that\nno example may lie too close to the separating hyperplane) is one of the oldest and most intensively\nstudied problems in machine learning, with research going back at least \ufb01ve decades to early seminal\nwork on the Perceptron algorithm [5, 26, 27].\nIn this paper we study the problem of learning an unknown \u03b3-margin halfspace in the model of\nProbably Approximately Correct (PAC) learning with malicious noise at rate \u03b7. More precisely, in\nthis learning scenario the target function is an unknown origin-centered halfspace f(x) = sign(w \u00b7\nx) over the domain Rn (we may assume w.l.o.g.\nthat w is a unit vector). There is an unknown\ndistribution D over the unit ball Bn = {x \u2208 Rn : (cid:107)x(cid:107)2 \u2264 1} which is guaranteed to put zero\nprobability mass on examples x that lie within Euclidean distance at most \u03b3 from the separating\nhyperplane w \u00b7 x = 0; in other words, every point x in the support of D satis\ufb01es |w \u00b7 x| \u2265 \u03b3. The\nlearner has access to a noisy example oracle EX \u03b7(f,D) which works as follows: when invoked,\nwith probability 1 \u2212 \u03b7 the oracle draws x from D and outputs the labeled example (x, f(x)) and\nwith probability \u03b7 the oracle outputs a \u201cnoisy\u201d labeled example which may be an arbitrary element\n(x(cid:48), y) of Bn\u00d7{\u22121, 1}. (It may be helpful to think of the noisy examples as being constructed by an\nomniscient and malevolent adversary who has full knowledge of the state of the learning algorithm\nand previous draws from the oracle. In particular, note that noisy examples need not satisfy the\nmargin constraint and can lie arbitrarily close to, or on, the hyperplane w \u00b7 x = 0.) The goal of\nthe learner is to output a hypothesis h : Rn \u2192 {\u22121, 1} which has high accuracy with respect to\nD: more precisely, with probability at least 1/2 (over the draws from D used to run the learner and\nany internal randomness of the learner) the hypothesis h must satisfy Prx\u223cD[h(x) (cid:54)= f(x)] \u2264 \u03b5.\n(Because a success probability can be improved ef\ufb01ciently using standard repeat-and-test techniques\n[19], we follow the common practice of excluding this success probability from our analysis.)\nIn\nparticular, we are interested in computationally ef\ufb01cient learning algorithms which have running\ntime poly(n, 1/\u03b3, 1/\u03b5).\n\n1\n\n\fIntroduced by Valiant in 1985 [30], the malicious noise model is a challenging one, as witnessed by\nthe fact that learning algorithms can typically only withstand relatively low levels of malicious noise.\nIndeed, it is well known that for essentially all PAC learning problems it is information-theoretically\npossible to learn to accuracy 1 \u2212 \u03b5 only if the malicious noise rate \u03b7 is at most \u03b5/(1 + \u03b5) [20],\nand most computationally ef\ufb01cient algorithms for learning even simple classes of functions can only\ntolerate signi\ufb01cantly lower malicious noise rates (see e.g. [1, 2, 8, 20, 24, 28]).\nInterestingly, the original Perceptron algorithm [5, 26, 27] for learning a \u03b3-margin halfspace can be\nshown to have relatively high tolerance to malicious noise. Several researchers [14, 17] have estab-\nlished upper bounds on the number of mistakes that the Perceptron algorithm will make when run on\na sequence of examples that are linearly separable with a margin except for some limited number of\n\u201cnoisy\u201d data points. Servedio [28] observed that combining these upper bounds with Theorem 6.2\nof Auer and Cesa-Bianchi [3] yields a straightforward \u201cPAC version\u201d of the online Perceptron al-\ngorithm that can learn \u03b3-margin halfspaces to accuracy 1 \u2212 \u03b5 in the presence of malicious noise\nprovided that the malicious noise rate \u03b7 is at most some value \u0398(\u03b5\u03b3). Servedio [28] also describes\na different PAC learning algorithm which uses a \u201csmooth\u201d booster together with a simple geometric\nreal-valued weak learner and achieves essentially the same result: it also learns a \u03b3-margin halfspace\nto accuracy 1 \u2212 \u03b5 in the presence of malicious noise at rate at most \u0398(\u03b5\u03b3). Both the boosting-based\nalgorithm of [28] and the Perceptron-based approach run in time poly(n, 1/\u03b3, 1/\u03b5).\nOur results. We give a simple new algorithm for learning \u03b3-margin halfspaces in the presence of\nmalicious noise. Like the earlier approaches, our algorithm runs in time poly(n, 1/\u03b3, 1/\u03b5); however,\nit goes beyond the \u0398(\u03b5\u03b3) malicious noise tolerance of previous approaches. Our \ufb01rst main result is:\n\nTheorem 1 There is a poly(n, 1/\u03b3, 1/\u03b5)-time algorithm that can learn an unknown \u03b3-margin half-\n\nspace to accuracy 1\u2212 \u03b5 in the presence of malicious noise at any rate \u03b7 \u2264 c\u03b5\u03b3(cid:112)log(1/\u03b3) whenever\nWhile our \u0398((cid:112)log(1/\u03b3)) improvement is not large, it is interesting to go beyond the \u201cnatural-\n\n\u03b3 < 1/7, where c > 0 is a universal constant.\n\nlooking\u201d \u0398(\u03b5\u03b3) bound of Perceptron and other simple approaches. The algorithm of Theorem 1 is\nnot based on convex optimization, and this is not a coincidence: our second main result is, roughly\nstated, the following.\n\nInformal paraphrase of Theorem 2 Let A be any learning algorithm that chooses a hypothesis\nvector v so as to minimize a convex proxy for the binary misclassi\ufb01cation error. Then A cannot\nlearn \u03b3-margin halfspaces to accuracy 1 \u2212 \u03b5 in the presence of malicious noise at rate \u03b7 \u2265 c\u03b5\u03b3,\nwhere c > 0 is a universal constant.\n\nOur approach. The algorithm of Theorem 1 is a modi\ufb01cation of a boosting-based approach to\nlearning halfspaces that is due to Balcan and Blum [7] (see also [6]). [7] considers a weak learner\nwhich simply generates a random origin-centered halfspace sign(v \u00b7 x) by taking v to be a uniform\nrandom unit vector. The analysis of [7], which is for a noise-free setting, shows that such a random\nhalfspace has probability \u2126(\u03b3) of having accuracy at least 1/2 + \u2126(\u03b3) with respect to D. Given\nthis, any boosting algorithm can be used to get a PAC algorithm for learning \u03b3-margin halfspaces to\naccuracy 1 \u2212 \u03b5.\nOur algorithm is based on a modi\ufb01ed weak learner which generates a collection of k = (cid:100)log(1/\u03b3)(cid:101)\nindependent random origin-centered halfspaces h1 = sign(v1 \u00b7 x), . . . , hk = sign(vk \u00b7 x) and takes\nthe majority vote H = Maj(h1, . . . , hk). The crux of our analysis is to show that if there is no noise,\nk) with\nthen with probability at least (roughly) \u03b32 the function H has accuracy at least 1/2 + \u2126(\u03b3\nrespect to D (see Section 2, in particular Lemma 1). By using this weak learner in conjunction with\na \u201csmooth\u201d boosting algorithm as in [28], we get the overall malicious-noise-tolerant PAC learning\nalgorithm of Theorem 1 (see Section 3).\nFor Theorem 2 we consider any algorithm that draws some number m of samples and minimizes\na convex proxy for misclassi\ufb01cation error. If m is too small then well-known sample complexity\nbounds imply that the algorithm cannot learn \u03b3-margin halfspaces to high accuracy, so we may\nassume that m is large; but together with the assumption that the noise rate is high, this means\nthat with overwhelmingly high probability the sample will contain many noisy examples. The heart\nof our analysis deals with this situation; we describe a simple \u03b3-margin data source and adversary\n\n\u221a\n\n2\n\n\fstrategy which ensures that the convex proxy for misclassi\ufb01cation error will achieve its minimum\non a hypothesis vector that has accuracy less than 1 \u2212 \u03b5 with respect to the underlying noiseless\ndistribution of examples. We also establish the same fact about algorithms that use a regularizer\nfrom a class that includes the most popular regularizers based on p-norms.\nRelated work. As mentioned above, Servedio [28] gave a boosting-based algorithm that learns\n\u03b3-margin halfspaces with malicious noise at rates up to \u03b7 = \u0398(\u03b5\u03b3). Khardon and Wachman [21]\nempirically studied the noise tolerance of variants of the Perceptron algorithm. Klivans et al. [22]\nshowed that an algorithm that combines PCA-like techniques with smooth boosting can tolerate rel-\natively high levels of malicious noise provided that the distribution D is suf\ufb01ciently \u201cnice\u201d (uniform\nover the unit sphere or isotropic log-concave). We note that \u03b3-margin distributions are signi\ufb01cantly\nless restrictive and can be very far from having the \u201cnice\u201d properties required by [22].\nWe previously [23] showed that any boosting algorithm that works by stagewise minimization of a\nconvex \u201cpotential function\u201d cannot tolerate random classi\ufb01cation noise \u2013 this is a type of \u201cbenign\u201d\nrather than malicious noise, which independently \ufb02ips the label of each example with probability \u03b7.\nA natural question is whether Theorem 2 follows from [23] by having the malicious noise simply\nsimulate random classi\ufb01cation noise; the answer is no, essentially because the ordering of quanti\ufb01ers\nis reversed in the two results. The construction and analysis from [23] crucially relies on the fact\nthat in the setting of that paper, \ufb01rst the random misclassi\ufb01cation noise rate \u03b7 is chosen to take some\nparticular value in (0, 1/2), and then the margin parameter \u03b3 is selected in a way that depends on\n\u03b7. In contrast, in this paper the situation is reversed: in our setting \ufb01rst the margin parameter \u03b3 is\nselected, and then given this value we study how high a malicious noise rate \u03b7 can be tolerated.\n\n2 The basic weak learner for Theorem 1\n\nLet f(x) = sign(w \u00b7 x) be an unknown halfspace and D be an unknown distribution over the n-\ndimensional unit ball that has a \u03b3 margin with respect to f as described in Section 1. For odd k \u2265 1\nwe let Ak denote the algorithm that works as follows: Ak generates k independent uniform random\nunit vectors v1, . . . , vk in Rn and outputs the hypothesis H(x) = Maj(sign(v1 \u00b7 x), . . . , sign(vk \u00b7\nx)). Note that Ak does not use any examples (and thus malicious noise does not affect its execution).\nAs the main result of Section 2 we show that if k is not too large then algorithm Ak has a non-\nnegligible chance of outputting a reasonably good weak hypothesis:\n\nLemma 1 For odd k \u2264 1\nof satisfying Prx\u223cD[H(x) (cid:54)= f(x)] \u2264 1\n\n2 \u2212 \u03b3\n\n\u221a\nk\n100\u03c0 .\n\n16\u03b32 the hypothesis H generated by Ak has probability at least \u2126(\u03b3\n\n\u221a\n\nk/2k)\n\n2.1 A useful tail bound\n\n(cid:104)(cid:80)k\n\n(cid:105)\n\ni=1 Xi < k/2\n\nin analyzing algorithm Ak: Let\n\n:=\nThe following notation will be useful\nPr\nwhere X1, . . . , Xk are i.i.d. Bernoulli (0/1) random variables with E[Xi] =\n1/2 + \u03b3 for all i. Clearly vote(\u03b3, k) is the lower tail of a Binomial distribution, but for our pur-\nposes we need an upper bound on vote(\u03b3, k) when k is very small relative to 1/\u03b32 and the value\nof vote(\u03b3, k) is close to but \u2013 crucially \u2013 less than 1/2. Standard Chernoff-type bounds [10] do not\nseem to be useful here, so we give a simple self-contained proof of the bound we need (no attempt\nhas been made to optimize constant factors below).\n\nvote(\u03b3, k)\n\nLemma 2 For 0 < \u03b3 < 1/2 and odd k \u2264 1\n\nProof: The lemma is easily veri\ufb01ed for k = 1, 3, 5, 7 so we assume k \u2265 9 below. The\n\nvalue vote(\u03b3, k) equals (cid:80)\n(cid:80)\n(cid:80)\n\n1\n2k\nmains to show that 1\n2k\n\n(cid:1)(1 \u2212 4\u03b32)i(1 \u2212 2\u03b3)k\u22122i. Since k is odd 1\n\n16\u03b32 we have vote(\u03b3, k) \u2264 1/2 \u2212 \u03b3\n(cid:0)k\n(cid:1)(1/2 \u2212 \u03b3)k\u2212i(1/2 + \u03b3)i, which is easily seen to equal\n(cid:0)k\n(cid:1) equals 1/2, so it re-\n(cid:1)(cid:2)1 \u2212 (1 \u2212 4\u03b32)i(1 \u2212 2\u03b3)k\u22122i(cid:3) \u2265 \u03b3\n\n\u221a\ni\nk\n50 . Consider any integer\n\n(cid:80)\n\n(cid:0)k\n\ni\n\n(cid:0)k\n\ni\n\ni<k/2\n\ni\n\n2k\n\ni<k/2\n\ni<k/2\n\ni<k/2\n\n\u221a\nk\n50 .\n\n3\n\n\fi \u2208 [0, k/2 \u2212 \u221a\n\nk]. For such an i we have\n\u221a\n(1 \u2212 2\u03b3)k\u22122i \u2264 (1 \u2212 2\u03b3)2\n\n(cid:18)2\n\n(cid:19)\n\n\u221a\n\n2\n\nk\n\n(1)\n\n\u221a\nk \u2264 1 \u2212 (2\u03b3)(2\n\u221a\n\u221a\n\n\u2264 1 \u2212 4\u03b3\n\u2264 1 \u2212 4\u03b3\n\nk + 8\u03b3\nk + 2\u03b3\n\n\u221a\n\u221a\n\nk) + (2\u03b3)2\n\u221a\n\nk)\n\nk(\u03b3\nk = 1 \u2212 2\u03b3\n\n\u221a\nwhere (1) is obtained by truncating the alternating binomial series expansion of (1 \u2212 2\u03b3)2\n\na positive term, (2) uses the upper bound(cid:0)(cid:96)\nk. The sum(cid:80)\n\nfrom the bound k \u2264 1\n1\u2212 (1\u2212 4\u03b32)i(1\u2212 2\u03b3)k\u22122i \u2265 2\u03b3\n[13], so we obtain the claimed bound:\n\n(cid:1) \u2264 (cid:96)2/2, and (3) uses \u03b3\n(cid:18)k\n\n16\u03b32 . So we have (1 \u2212 4\u03b32)i(1 \u2212 2\u03b3)k\u22122i \u2264 1 \u2212 2\u03b3\n(cid:19)\n\n(cid:19)(cid:2)1 \u2212 (1 \u2212 4\u03b32)i(1 \u2212 2\u03b3)k\u22122i(cid:3) \u2265 1\n\n(cid:1) is at least 0.01\u00b7 2k for all odd k \u2265 9\n(cid:88)\n\ni\u2264k/2\u2212\u221a\n\n(cid:88)\n\n(cid:18)k\n\nk \u2265 \u03b3\n\n(cid:0)k\n\n\u221a\n\n(2)\n(3)\nk after\nk \u2264 1/4 which follows\nk and thus we have\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\n2\u03b3\n\nk\n\ni\n\nk\n\n2\n\n\u221a\nk\n50 .\n\n1\n2k\n\ni\n\ni<k/2\n\n2k\n\ni<k/2\u2212\u221a\n\nk\n\ni\n\n2.2 Proof of Lemma 1\n\nThroughout the following discussion it will be convenient to view angles between vectors as lying\nthe range [\u2212\u03c0, \u03c0), so acute angles are in the range (\u2212\u03c0/2, \u03c0/2).\nRecall that sign(w\u00b7x) is the unknown target halfspace (we assume w is a unit vector) and v1, . . . , vk\nare the random unit vectors generated by algorithm Ak. For j \u2208 {1, . . . , k} let Gj denote the\n\u201cgood\u201d event that the angle between vj and w is acute, i.e. lies in the interval (\u2212\u03c0/2, \u03c0/2), and\nlet G denote the event G1 \u2227 \u00b7\u00b7\u00b7 \u2227 Gk. Since the vectors vi are selected independently we have\n\nPr[G] =(cid:81)k\n\nj=1 Pr[Gj] = 2\u2212k.\n\n\u221a\nk\n50\u03c0 .\n\nThe following claim shows that conditioned on G, any \u03b3-margin point has a noticeably-better-than-\n2 chance of being classi\ufb01ed correctly by H (note that the probability below is over the random\n1\ngeneration of H by Ak):\nClaim 3 Fix x \u2208 Bn to be any point such that |w\u00b7x| \u2265 \u03b3. Then we have PrH[H(x) (cid:54)= f(x) | G] \u2264\nvote(\u03b3/\u03c0, k) \u2264 1/2 \u2212 \u03b3\nProof: Without loss of generality we assume that x is a positive example (an entirely similar analysis\ngoes through for negative examples), so w \u00b7 x \u2265 \u03b3. Let \u03b1 denote the angle from w to x in the plane\nspanned by w and x; again without loss of generality we may assume that \u03b1 lies in [0, \u03c0/2] (the\ncase of negative angles is symmetric). In fact since x is a positive example with margin \u03b3, we have\nthat 0 \u2264 \u03b1 \u2264 \u03c0/2 \u2212 \u03b3.\nFix any j \u2208 {1, . . . , k} and let us consider the random unit vector vj. Let v(cid:48)\nj be the projection of\nj|| is uniform on the unit circle\nvj onto the plane spanned by x and w. The distribution of v(cid:48)\nin that plane. We have that sign(vj \u00b7 x) (cid:54)= f(x) if and only if the magnitude of the angle between\nv(cid:48)\nj and x is at least \u03c0/2. Conditioned on Gj, the angle from v(cid:48) to w is uniformly distributed over\nthe interval (\u2212\u03c0/2, \u03c0/2). Since the angle from w to x is \u03b1, the angle from v(cid:48) to x is the sum of\nthe angle from v(cid:48) to w and the angle from w to x, and therefore it is uniformly distributed over the\ninterval (\u2212\u03c0/2 + \u03b1, \u03c0/2 + \u03b1) . Recalling that \u03b1 \u2265 0, we have that sign(vj \u00b7 x) (cid:54)= f(x) if and only\nif angle from v(cid:48) to x lies in (\u03c0/2, \u03c0/2 + \u03b1). Since the margin condition implies \u03b1 \u2264 \u03c0/2 \u2212 \u03b3 as\nnoted above, we have Pr[sign(vj \u00b7 x) (cid:54)= f(x) | Gj] \u2264 \u03c0/2\u2212\u03b3\nNow recall that v1, ..., vk are chosen independently at random, and G = G1 \u2227 \u00b7\u00b7\u00b7 \u2227 Gk. Thus, after\nconditioning on G, we have that v1, ..., vk are still independent and the events sign(v1 \u00b7 x) (cid:54)=\nf(x), . . . , sign(vk \u00b7 x) (cid:54)= f(x) are independent.\nIt follows that PrH[H(x) (cid:54)= f(x) | G] \u2264\n\n2 \u2212 \u03b3\n\u03c0 .\n\n\u03c0 = 1\n\nj/||v(cid:48)\n\n\u221a\n50\u03c0 , where we used Lemma 2 for the \ufb01nal inequality.\n\nk\n\nvote(cid:0) \u03b3\n\n\u03c0 , k(cid:1) \u2264 1/2 \u2212 \u03b3\n\nNow all the ingredients are in place for us to prove Lemma 1. Since Claim 3 may be applied to\n\u221a\nevery x in the support of D, we have Prx\u223cD,H[H(x) (cid:54)= f(x) | G] \u2264 1/2\u2212 \u03b3\n50\u03c0 . Applying Fubini\u2019s\n\nk\n\n4\n\n\f(cid:34)\n\n\u221a\ntheorem we get that EH[Prx\u223cD[H(x) (cid:54)= f(x)] | G] \u2264 1/2 \u2212 \u03b3\n50\u03c0 . Applying Markov\u2019s inequality\nto the nonnegative random variable Prx\u223cD[H(x) (cid:54)= f(x)], we get\n\u2264 2(1/2 \u2212 \u03b3\n1 \u2212 \u03b3\n(cid:35)\n\n[H(x) (cid:54)= f(x)] >\n\n\u221a\n50\u03c0 )\n\u221a\nk\n50\u03c0\n\n1 \u2212 \u03b3\nk\n50\u03c0\n2\n\nwhich implies\n\nPr\nx\u223cD\n\n| G\n\n(cid:35)\n\nPr\nH\n\nk\n\n,\n\n\u221a\n\nk\n\nSince PrH[G] = 2\u2212k we get\n\nPr\n\n(cid:34)\nx\u223cD[H(x) (cid:54)= f(x)] \u2264 1 \u2212 \u03b3\n(cid:34)\nx\u223cD[H(x) (cid:54)= f(x)] \u2264 1 \u2212 \u03b3\n\nPr\n\n\u221a\n\nk\n50\u03c0\n2\n\n\u221a\n\nk\n50\u03c0\n2\n\nPr\nH\n\nPr\nH\n\n\u221a\n\n| G\n\n\u2265 \u2126(\u03b3\n\nk).\n\n(cid:35)\n\n\u221a\n\n\u2265 \u2126(\u03b3\n\nk/2k),\n\nand Lemma 1 is proved.\n\n3 Proof of Theorem 1: smooth boosting the weak learner to tolerate\n\nmalicious noise\n\nOur overall algorithm for learning \u03b3-margin halfspaces with malicious noise, which we call Algo-\nrithm B, combines a weak learner derived from Section 2 with a \u201csmooth\u201d boosting algorithm.\nRecall that boosting algorithms [15, 25] work by repeatedly running a weak learner on a se-\nquence of carefully crafted distributions over labeled examples. Given the initial distribution P\nover labeled examples (x, y), a distribution Pi over labeled examples is said to be \u03ba-smooth if\nPi[(x, y)] \u2264 1\n\u03ba P [(x, y)] for every (x, y) in the support of P. Several boosting algorithms are known\n[9, 16, 28] that generate only 1/\u03b5-smooth distributions when boosting to \ufb01nal accuracy 1 \u2212 \u03b5. For\nconcreteness we will use the MadaBoost algorithm of [9], which generates a (1 \u2212 \u03b5)-accurate \ufb01nal\nhypothesis after O( 1\n\n\u03b5\u03b32 ) stages of calling the weak learner and runs in time poly( 1\n\n\u03b5 , 1\n\n\u03b3 ).\n\nAt a high level our analysis here is related to previous works [28, 22] that used smooth boosting to\ntolerate malicious noise. The basic idea is that since a smooth booster does not increase the weight\nof any example by more than a 1/\u03b5 factor, it cannot \u201camplify\u201d the malicious noise rate by more\nthan this factor. In [28] the weak learner only achieved advantage O(\u03b3) so as long as the malicious\nnoise rate was initially O(\u03b5\u03b3), the \u201campli\ufb01ed\u201d malicious noise rate of O(\u03b3) could not completely\n\u201covercome\u201d the advantage and boosting could proceed successfully. Here we have a weak learner\nthat achieves a higher advantage, so boosting can proceed successfully in the presence of more\nmalicious noise. The rest of this section provides details.\nThe weak learner W that B uses is a slight extension of algorithm Ak from Section 2 with k =\n(cid:100)log(1/\u03b3)(cid:101). When invoked with distribution Pt over labeled examples, algorithm W\n\n\u2022 makes (cid:96) (speci\ufb01ed later) calls to algorithm A(cid:100)log(1/\u03b3)(cid:101), generating candidate hypotheses\n\u2022 evaluates H1, ..., H(cid:96) using M (speci\ufb01ed later) independent examples drawn from Pt and\n\nH1, ..., H(cid:96); and\n\noutputs the Hj that makes the fewest errors on these examples.\n\nThe overall algorithm B\n\nples suf\ufb01ce) from EX\u03b7(f,D);\n\n\u2022 draws a multiset S of m examples (we will argue later that poly(n, 1/\u03b3, 1/\u03b5) many exam-\n\u2022 sets the initial distribution P over labeled examples to be uniform over S; and\n\u2022 uses MadaBoost to boost to accuracy 1\u2212 \u0001/4 with respect to P , using W as a weak learner.\n\nRecall that we are assuming \u03b7 \u2264 c\u03b5\u03b3(cid:112)log(1/\u03b3); we will show that under this assumption, algorithm\n\nB outputs a \ufb01nal hypothesis h that satis\ufb01es Prx\u223cD[h(x) = f(x)] \u2265 1 \u2212 \u03b5 with probability at least\n1/2.\n\n5\n\n\f1/2 + \u0398(\u03b3(cid:112)log(1/\u03b3)). MadaBoost\u2019s boosting guarantee then implies that the \ufb01nal hypothesis (call\n\nFirst, let SN \u2286 S denote the noisy examples in S. A standard Chernoff bound [10] implies that\nwith probability at least 5/6 we have |SN|/|S| \u2264 2\u03b7; we henceforth write \u03b7(cid:48) to denote |SN|/|S|.\nWe will show below that with high probability, every time MadaBoost calls the weak learner W\nwith a distribution Pt, W generates a weak hypothesis (call it ht) that has Pr(x,y)\u223cPt[ht(x) = y] \u2265\nit h) of Algorithm B satis\ufb01es Pr(x,y)\u223cP [h(x) = y] \u2265 1\u2212 \u03b5/4. Since h is correct on (1\u2212 \u03b5/4) of the\npoints in the sample S and \u03b7(cid:48) \u2264 2\u03b7, h must be correct on at least 1 \u2212 \u03b5/4 \u2212 2\u03b7 of the points in S \\\nSN , which is a noise-free sample of poly(n, 1/\u03b3, 1/\u03b5) labeled examples generated according to D.\nSince h belongs to a class of hypotheses with VC dimension at most poly(n, 1/\u03b3, 1/\u0001) (because the\nanalysis of MadaBoost implies that h is a weighted vote over O(1/(\u03b5\u03b32)) many weak hypotheses,\nand each weak hypothesis is a vote over O(log(1/\u03b3)) n-dimensional halfspaces), by standard sample\ncomplexity bounds [4, 31, 29], with probability 5/6, the accuracy of h with respect to D is at least\n1 \u2212 \u03b5/2 \u2212 4\u03b7 > 1 \u2212 \u03b5, as desired.\nThus it remains to show that with high probability each time W is called on a distribution Pt, it\n\nindeed generates a weak hypothesis with advantage at least \u2126(\u03b3(cid:112)log(1/\u03b3)). Recall the following:\n\nDe\ufb01nition 1 The total variation distance between distributions P and Q over \ufb01nite domain X is\ndT V (P, Q) := maxE\u2286X P [E] \u2212 Q[E].\nSuppose R is the uniform distribution over the noisy points SN \u2286 S, and P (cid:48) is the uniform distri-\nbution over the remaining points S \\ SN (we may view P (cid:48) as the \u201cclean\u201d version of P ). Then\nthe distribution P may be written as P = (1 \u2212 \u03b7(cid:48))P (cid:48) + \u03b7(cid:48)R, and for any event E we have\nP [E] \u2212 P (cid:48)[E] \u2264 \u03b7(cid:48)R[E] \u2264 \u03b7(cid:48), so dT V (P, P (cid:48)) \u2264 \u03b7(cid:48).\nLet Pt denote the distribution generated by MadaBoost during boosting stage t. The smoothness of\nMadaBoost implies that Pt[SN ] \u2264 4\u03b7(cid:48)/\u0001, so the noisy examples have total probability at most 4\u03b7(cid:48)/\u03b5\nunder Pt. Arguing as for the original distribution, we have that the clean version P (cid:48)\nt of Pt satis\ufb01es\n(4)\n\nt , Pt) \u2264 4\u03b7(cid:48)/\u0001.\n\ndT V (P (cid:48)\n\nPr\ng\n\n[errorP (cid:48)\n\nt\n\nBy Lemma 1, each call to algorithm A(cid:100)log(1/\u03b3)(cid:101) yields a hypothesis (call it g) that satis\ufb01es\n\nwhere for any distribution Q we de\ufb01ne errorQ(g) def= Pr(x,y)\u223cQ[g(x) (cid:54)= y]. Recalling that \u03b7(cid:48) \u2264 2\u03b7\n\n(g) \u2264 1/2 \u2212 \u03b3(cid:112)log(1/\u03b3)/(100\u03c0)] \u2265 \u2126(\u03b32),\nand \u03b7 < c\u03b5\u03b3(cid:112)log(1/\u03b3), for a suitably small absolute constant c > 0 we have that\nThen (4) and (5) imply that Prg[errorPt(g) \u2264 1/2 \u2212 3\u03b3(cid:112)log(1/\u03b3)/(400\u03c0)] \u2265 \u2126(\u03b32). This means\nW selects from its (cid:96) calls to A in that stage will satisfy errorPt(gt) \u2264 1/2 \u2212 \u03b3(cid:112)log(1/\u03b3)/(200\u03c0).\n\nthat by taking the parameters (cid:96) and M of the weak learner W to be poly(1/\u03b3, log(1/\u03b5)), we can en-\nsure that with overall probability at least 2/3, at each stage t of boosting the weak hypothesis ht that\n\n4\u03b7(cid:48)/\u03b5 < \u03b3(cid:112)log(1/\u03b3)/(400\u03c0).\n\n(5)\n\n(6)\n\nThis concludes the proof of Theorem 1.\n\n4 Convex optimization algorithms have limited malicious noise tolerance\nGiven a sample S = {(x1, y1), . . . , (xm, ym)} of labeled examples, the number of examples mis-\nclassi\ufb01ed by the hypothesis sign(v \u00b7 x) is a nonconvex function of v, and thus it can be dif\ufb01cult to\n\ufb01nd a v that minimizes this error (see [12, 18] for theoretical results that support this intuition in\nvarious settings). In an effort to bring the powerful tools of convex optimization to bear on various\nhalfspace learning problems, a widely used approach is to instead minimize some convex proxy for\nmisclassi\ufb01cation error.\nDe\ufb01nition 2 will de\ufb01ne the class of such algorithms analyzed in this section. This de\ufb01nition allows\nalgorithms to use regularization, but by setting the regularizer \u03c8 to be the all-0 function it also covers\nalgorithms that do not.\n\n6\n\n\fregularizer if \u03c8(v) =(cid:80)n\nloss of vector v on S is L\u03c6,\u03c8,S(v) := \u03c8(v)+(cid:80)m\n\nDe\ufb01nition 2 A function \u03c6 : R \u2192 R+ is a convex misclassi\ufb01cation proxy if \u03c6 is convex, nonin-\ncreasing, differentiable, and satis\ufb01es \u03c6(cid:48)(0) < 0. A function \u03c8 : Rn \u2192 [0,\u221e) is a componentwise\ni=1 \u03c4(vi) for a convex, differentiable \u03c4 : R \u2192 [0,\u221e) for which \u03c4(0) = 0.\nGiven a sample of labeled examples S = {(x1, y1), . . . , (xm, ym)} \u2208 (Rn \u00d7{\u22121, 1})m, the (\u03c6,\u03c8)-\ni=1 \u03c6(y(v\u00b7 xi)). A (\u03c6,\u03c8)-minimizer is any learning\nalgorithm that minimizes L\u03c6,\u03c8,S(v) whenever the minimum exists.\n\nOur main negative result, shows that for any sample size, algorithms that minimize a regularized\nconvex proxy for misclassi\ufb01cation error will succeed with exponentially small probability for a\nmalicious noise rate that is \u0398(\u03b5\u03b3), and therefore for any larger malicious noise rate.\n\nTheorem 2 Fix \u03c6 to be any convex misclassi\ufb01cation proxy and \u03c8 to be any componentwise reg-\nularizer, and let algorithm A be a (\u03c6,\u03c8)-minimizer. Fix \u03b5 \u2208 (0, 1/8] to be any error parameter,\n\u03b3 \u2208 (0, 1/8] to be any margin parameter, and m \u2265 1 to be any sample size. Let the malicious noise\nrate \u03b7 be 16\u03b5\u03b3.\nThen there is an n, a target halfspace f(x) = sign(w \u00b7 x) over Rn, a \u03b3-margin distribution D for f\n(supported on points x \u2208 Bn that have | w(cid:107)w(cid:107) \u00b7x| \u2265 \u03b3), and a malicious adversary with the following\nproperty: If A\u03c6 is given m random examples drawn from EX\u03b7(f,D) and outputs a vector v, then\nthe probability (over the draws from EX\u03b7(f,D)) that v satis\ufb01es Prx\u223cD[sign(v \u00b7 x) (cid:54)= f(x)] \u2264 \u03b5 is\nat most e\u2212c/\u03b3, where c > 0 is some universal constant.\n\nProof: The analysis has two cases based on whether or not the number of examples m exceeds\nm0 := 1\n32\u0001\u03b32 . (We emphasize that Case 2, in which n is taken to be just 2, is the case that is of\nprimary interest, since in Case 1 the algorithm does not have enough examples to reliably learn a\n\u03b3-margin halfspace even in a noiseless scenario.)\nCase 1 (m \u2264 m0): Let n = (cid:98)1/\u03b32(cid:99) and let e(i) \u2208 Rn denote the unit vector with a 1 in the ith\ncomponent. Then the set of examples E := {e(1), ..., e(n)} is shattered by the family F which con-\nsists of all 2n halfspaces whose weight vectors are in {\u2212\u03b3, \u03b3}n, and any distribution whose support\nis E is a \u03b3-margin distribution for any such halfspace. The proof of the well-known information-\ntheoretic lower bound of [11]1 gives that for any learning algorithm that uses m examples (such as\nA), there is a distribution D supported on E and a halfspace f \u2208 F such that the output h of A\nsatis\ufb01es Pr[Prx\u223cD[h(x) (cid:54)= f(x)] > \u0001] \u2265 1 \u2212 exp\n, where the outer probability is over the\nrandom examples drawn by A. This proves the theorem in Case 1.\nCase 2 (m > m0): We note that it is well known (see e.g. [31]) that O( 1\n\u03b5\u03b32 ) examples suf\ufb01ce to\nlearn \u03b3-margin n-dimensional halfspaces for any n if there is no noise, so noisy examples will play\nan important role in the construction in this case.\n\nWe take n = 2. The target halfspace is f(x) = sign((cid:112)1 \u2212 \u03b32x1 + \u03b3x2). The distribution D is very\n\n(cid:16)\u2212 c\n\n(cid:17)\n\n\u03b32\n\nsimple and is supported on only two points: it puts weight 2\u0001 on the point\nwhich is a\npositive example for f, and weight 1 \u2212 2\u0001 on the point (0, 1) which is also a positive example for\nf. When the malicious adversary is allowed to corrupt an example, with probability 1/2 it provides\nthe point (1, 0) and mislabels it as negative, and with probability 1/2 it provides the point (0, 1) and\nmislabels it as negative.\nLet S = ((x1, y1), ..., (xm, ym)) be a sample of m examples drawn from EX\u03b7(f,D). We de-\n, and \u03b7S,2 :=\n\ufb01ne pS,1 :=\n\n, pS,2 := |{t:xt=(0,1),y=1}|\n\n, \u03b7S,1 := |{t:xt=(1,0)}|\n\n\u201do\u02db\u02db\u02db\n\n1\u2212\u03b32,0\n\n\u02db\u02db\u02dbn\n\nt:xt=\n\n\u03b3/\n\n\u201c\n\n\u221a\n|S|\n\n|S|\n\n|S|\n\n(cid:18)\n\n(cid:19)\n\n\u03b3\u221a\n1\u2212\u03b32 , 0\n\n1In particular, see the last displayed equation in the proof of Lemma 3 of [11].\n\n7\n\n\f|{t:xt=(0,1),y=\u22121}|\n\n. Using standard Chernoff bounds (see e.g. [10]) and a union bound we get\n\n|S|\nPr[pS,1 = 0 or pS,2 = 0 or pS,1 > 3\u0001 or \u03b7S,1 < \u03b7/4 or \u03b7S,2 < \u03b7/4]\n\n\u2264 (1 \u2212 2\u03b5(1 \u2212 \u03b7))m + (1 \u2212 (1 \u2212 2\u03b5)(1 \u2212 \u03b7))m + exp\n\u2264 2(1 \u2212 \u03b5)m + exp\n\u2264 2 exp\n\n(cid:16)\u2212 \u0001m\n(cid:17)\n(cid:18)\n+ 2 exp\n\u2212 1\n96\u03b32\n\n(cid:16)\u2212 \u03b7m\n(cid:19)\n\n(cid:18)\n\u2212 1\n48\u03b3\n\n\u2212 1\n32\u03b32\n\n+ 2 exp\n\n+ exp\n\n(cid:19)\n\n(cid:18)\n\n(cid:17)\n\n12\n\n24\n\n(cid:17)\n\n(cid:16)\u2212 \u03b7m\n\n24\n\n+ 2 exp\n\n(since \u0001 \u2264 1/4 and \u03b7 \u2264 1/2)\n\n(cid:17)\n\n(cid:16)\u2212 \u0001m\n(cid:19)\n\n12\n\n.\n\nSince the theorem allows for a e\u2212c/\u03b3 success probability for A, it suf\ufb01ces to consider the case in\nwhich pS,1 and pS,2 are both positive, pS,1 \u2264 3\u0001, and min{\u03b7S,1, \u03b7S,2} \u2265 \u03b7/4. For v = (v1, v2) \u2208\nR2 the value L\u03c6,\u03c8,S(v) is proportional to\n\nL(v1, v2) := pS,1\u03c6\n\n+ pS,2\u03c6(v2) + \u03b7S,1\u03c6(\u2212v1) + \u03b7S,2\u03c6(\u2212v2) + \u03c8(v)\n|S| .\n\n(cid:33)\n\n(cid:32)\n\n\u03b3v1(cid:112)1 \u2212 \u03b32\n\n(cid:32)\n\nFrom the bounds stated above on pS,1, pS,2, \u03b7S,1 and \u03b7S,2 we may conclude that L\u03c6,\u03c8,S(v) does\nachieve a minimum value. This is because for any z \u2208 R the set {v : L\u03c6,\u03c8,S(v) \u2264 z} is bounded,\nand therefore so is its closure. Since L\u03c6,\u03c8,S(v) is bounded below by zero and is continuous, this\nimplies that it has a minimum. To see that for any z \u2208 R the set {v : L\u03c6,\u03c8,S(v) \u2264 z} is bounded,\nobserve that if either v1 or v2 is \ufb01xed and the other one is allowed to take on arbitrarily large\nmagnitude values (either positive or negative), this causes L\u03c6,\u03c8,S(v) to take on arbitrarily large\npositive values (this is an easy consequence of the de\ufb01nition of L, the fact that \u03c6 is convex, non-\nnegative and nonincreasing, \u03c6(cid:48)(0) < 0, and the fact that pS,1, pS,2, \u03b7S,1, \u03b7S,2 are all positive).\nTaking the derivative with respect to v1 yields\n\n\u03b3v1(cid:112)1 \u2212 \u03b32\n\u03c6(cid:48)\n(7)\n\u03b3\u221a\n1\u2212\u03b32 \u03c6(cid:48)(0) \u2212 \u03b7S,1\u03c6(cid:48)(0) (recall that \u03c4 is minimized at 0 and\nWhen v1 = 0, the derivative (7) is pS,1\n\u03b3\u221a\nthus \u03c4(cid:48)(0) = 0). Recall that \u03c6(cid:48)(0) < 0 by assumption. If pS,1\n1\u2212\u03b32 < \u03b7S,1 then (7) is positive at\n0, which means that L(v1, v2) is an increasing function of v1 at v1 = 0 for all v2. Since L is convex,\nthis means that for each v2 \u2208 R we have that the value v(cid:63)\n1, v2) is a negative\n\u03b3\u221a\n1\u2212\u03b32 < \u03b7S,1, the linear classi\ufb01er v output by A\u03c6 has v1 \u2264 0; hence it\n1 < 0. So, if pS,1\nvalue v(cid:63)\n\u03b3\u221a\n1\u2212\u03b32 , 0), and thus has error rate at least 2\u0001 with respect to D.\nmisclassi\ufb01es the point (\nCombining the fact that \u03b3 \u2264 1/8 with the facts that pS,1 \u2264 3\u0001 and \u03b7S,1 > \u03b7/4, we get pS,1\n1.01 \u00d7 pS,1\u03b3 < 4\u0001\u03b3 = \u03b7/4 < \u03b7S,1 which completes the proof.\n\n\u2212 \u03b7S,1\u03c6(cid:48)(\u2212v1) \u2212 \u03c4(cid:48)(v1)\n|S|\n\n\u03b3(cid:112)1 \u2212 \u03b32\n\n1 that minimizes L(v(cid:63)\n\n\u03b3\u221a\n1\u2212\u03b32 <\n\n\u2202L\n\u2202v1\n\n= pS,1\n\n(cid:33)\n\n.\n\n5 Conclusion\n\nIt would be interesting to further improve on the malicious noise tolerance of ef\ufb01cient algorithms\nfor PAC learning \u03b3-margin halfspaces, or to establish computational hardness results for this prob-\nlem. Another goal for future work is to develop an algorithm that matches the noise tolerance of\nTheorem 1 but uses a single halfspace as its hypothesis representation.\n\nReferences\n[1] J. Aslam and S. Decatur. Speci\ufb01cation and simulation of statistical query algorithms for ef\ufb01ciency and\n\nnoise tolerance. Journal of Computer and System Sciences, 56:191\u2013208, 1998.\n\n[2] P. Auer. Learning nested differences in the presence of malicious noise. Theor. Comp. Sci., 185(1):159\u2013\n\n175, 1997.\n\n[3] P. Auer and N. Cesa-Bianchi. On-line learning with malicious noise and the closure algorithm. Annals of\n\nMathematics and Arti\ufb01cial Intelligence, 23:83\u201399, 1998.\n\n8\n\n\f[4] E. B. Baum and D. Haussler. What size net gives valid generalization? Neural Comput., 1:151\u2013160,\n\n1989.\n\n[5] H. Block. The Perceptron: a model for brain functioning. Reviews of Modern Physics, 34:123\u2013135, 1962.\n[6] A. Blum. Random Projection, Margins, Kernels, and Feature-Selection. In LNCS Volume 3940, pages\n\n52\u201368, 2006.\n\n[7] A. Blum and M.-F. Balcan. A discriminative model for semi-supervised learning. Journal of the ACM,\n\n57(3), 2010.\n\n[8] S. Decatur. Statistical queries and faulty PAC oracles. In Proc. 6th COLT, pages 262\u2013268, 1993.\n[9] C. Domingo and O. Watanabe. MadaBoost: a modi\ufb01ed version of AdaBoost. In Proc. 13th COLT, pages\n\n180\u2013189, 2000.\n\n[10] D. Dubhashi and A. Panconesi. Concentration of measure for the analysis of randomized algorithms.\n\nCambridge University Press, Cambridge, 2009.\n\n[11] A. Ehrenfeucht, D. Haussler, M. Kearns, and L. Valiant. A general lower bound on the number of exam-\n\nples needed for learning. Information and Computation, 82(3):247\u2013251, 1989.\n\n[12] V. Feldman, P. Gopalan, S. Khot, and A. Ponnuswami. On agnostic learning of parities, monomials, and\n\nhalfspaces. SIAM J. Comput., 39(2):606\u2013645, 2009.\n\n[13] W. Feller. Generalization of a probability limit theorem of Cram\u00b4er. Trans. Am. Math. Soc., 54:361\u2013372,\n\n1943.\n\n[14] Y. Freund and R. Schapire. Large margin classi\ufb01cation using the Perceptron algorithm. In Proc. 11th\n\nCOLT, pages 209\u2013217., 1998.\n\n[15] Y. Freund and R. Schapire. A short introduction to boosting. J. Japan. Soc. Artif. Intel., 14(5):771\u2013780,\n\n1999.\n\n[16] D. Gavinsky. Optimally-smooth adaptive boosting and application to agnostic learning. JMLR, 4:101\u2013\n\n117, 2003.\n\n[17] C. Gentile and N. Littlestone. The robustness of the p-norm algorithms. In Proc. 12th COLT, pages 1\u201311,\n\n1999.\n\n[18] V. Guruswami and P. Raghavendra. Hardness of learning halfspaces with noise. SIAM J. Comput.,\n\n39(2):742\u2013765, 2009.\n\n[19] D. Haussler, M. Kearns, N. Littlestone, and M. Warmuth. Equivalence of models for polynomial learn-\n\nability. Information and Computation, 95(2):129\u2013161, 1991.\n\n[20] M. Kearns and M. Li. Learning in the presence of malicious errors. SIAM Journal on Computing,\n\n22(4):807\u2013837, 1993.\n\n[21] R. Khardon and G. Wachman. Noise tolerant variants of the perceptron algorithm. JMLR, 8:227\u2013248,\n\n2007.\n\n[22] A. Klivans, P. Long, and R. Servedio. Learning Halfspaces with Malicious Noise. JMLR, 10:2715\u20132740,\n\n2009.\n\n[23] P. Long and R. Servedio. Random classi\ufb01cation noise defeats all convex potential boosters. Machine\n\nLearning, 78(3):287\u2013304, 2010.\n\n[24] Y. Mansour and M. Parnas. Learning conjunctions with noise under product distributions. Information\n\nProcessing Letters, 68(4):189\u2013196, 1998.\n\n[25] R. Meir and G. R\u00a8atsch. An introduction to boosting and leveraging.\n\nMachine Learning, pages 118\u2013183, 2003.\n\nIn LNAI Advanced Lectures on\n\n[26] A. Novikoff. On convergence proofs on perceptrons. In Proceedings of the Symposium on Mathematical\n\nTheory of Automata, volume XII, pages 615\u2013622, 1962.\n\n[27] F. Rosenblatt. The Perceptron: a probabilistic model for information storage and organization in the brain.\n\nPsychological Review, 65:386\u2013407, 1958.\n\n[28] R. Servedio. Smooth boosting and learning with malicious noise. JMLR, 4:633\u2013648, 2003.\n[29] J. Shawe-Taylor, P. Bartlett, R. Williamson, and M. Anthony. Structural risk minimization over data-\n\ndependent hierarchies. IEEE Transactions on Information Theory, 44(5):1926\u20131940, 1998.\n[30] L. Valiant. Learning disjunctions of conjunctions. In Proc. 9th IJCAI, pages 560\u2013566, 1985.\n[31] V. Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998.\n\n9\n\n\f", "award": [], "sourceid": 93, "authors": [{"given_name": "Phil", "family_name": "Long", "institution": null}, {"given_name": "Rocco", "family_name": "Servedio", "institution": null}]}