{"title": "Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs", "book": "Advances in Neural Information Processing Systems", "page_first": 926, "page_last": 934, "abstract": "Given $\\alpha,\\epsilon$, we study the time complexity required to improperly learn a halfspace with misclassification error rate of at most $(1+\\alpha)\\,L^*_\\gamma + \\epsilon$, where $L^*_\\gamma$ is the optimal $\\gamma$-margin error rate. For $\\alpha = 1/\\gamma$, polynomial time and sample complexity is achievable using the hinge-loss. For $\\alpha = 0$, \\cite{ShalevShSr11} showed that $\\poly(1/\\gamma)$ time is impossible, while learning is possible in time $\\exp(\\tilde{O}(1/\\gamma))$. An immediate question, which this paper tackles, is what is achievable if $\\alpha \\in (0,1/\\gamma)$. We derive positive results interpolating between the polynomial time for $\\alpha = 1/\\gamma$ and the exponential time for $\\alpha=0$. In particular, we show that there are cases in which $\\alpha = o(1/\\gamma)$ but the problem is still solvable in polynomial time. Our results naturally extend to the adversarial online learning model and to the PAC learning with malicious noise model.", "full_text": "Learning Halfspaces with the Zero-One Loss:\n\nTime-Accuracy Tradeoffs\n\nAharon Birnbaum and Shai Shalev-Shwartz\nSchool of Computer Science and Engineering\n\nThe Hebrew University\n\nJerusalem, Israel\n\nAbstract\n\nGiven \u03b1, \u03f5, we study the time complexity required to improperly learn a halfs-\n\u2217\n\u2217\npace with misclassi\ufb01cation error rate of at most (1 + \u03b1) L\n\u03b3 + \u03f5, where L\n\u03b3 is the\noptimal \u03b3-margin error rate. For \u03b1 = 1/\u03b3, polynomial time and sample com-\nplexity is achievable using the hinge-loss. For \u03b1 = 0, Shalev-Shwartz et al.\n[2011] showed that poly(1/\u03b3) time is impossible, while learning is possible in\ntime exp( \u02dcO(1/\u03b3)). An immediate question, which this paper tackles, is what is\nachievable if \u03b1 \u2208 (0, 1/\u03b3). We derive positive results interpolating between the\npolynomial time for \u03b1 = 1/\u03b3 and the exponential time for \u03b1 = 0. In particular,\nwe show that there are cases in which \u03b1 = o(1/\u03b3) but the problem is still solvable\nin polynomial time. Our results naturally extend to the adversarial online learning\nmodel and to the PAC learning with malicious noise model.\n\n1 Introduction\n\nSome of the most in\ufb02uential machine learning tools are based on the hypothesis class of halfspaces\nwith margin. Examples include the Perceptron [Rosenblatt, 1958], Support Vector Machines [Vap-\nnik, 1998], and AdaBoost [Freund and Schapire, 1997]. In this paper we study the computational\ncomplexity of learning halfspaces with margin.\nA halfspace is a mapping h(x) = sign(\u27e8w, x\u27e9), where w, x \u2208 X are taken from the unit ball of an\nRKHS (e.g. Rn), and \u27e8w, x\u27e9 is their inner-product. Relying on the kernel trick, our sole assumption\non X is that we are able to calculate ef\ufb01ciently the inner-product between any two instances (see\nfor example Sch\u00a8olkopf and Smola [2002], Cristianini and Shawe-Taylor [2004]). Given an example\n(x, y) \u2208 X \u00d7 {\u00b11} and a vector w, we say that w errs on (x, y) if y\u27e8w, x\u27e9 \u2264 0 and we say that w\nmakes a \u03b3-margin error on (x, y) if y\u27e8w, x\u27e9 \u2264 \u03b3.\nThe error rate of a predictor h : X \u2192 {\u00b11} is de\ufb01ned as L01(h) = P[h(x) \u0338= y], where the\nprobability is over some unknown distribution over X \u00d7{\u00b11}. The \u03b3-margin error rate of a predictor\nx 7\u2192 \u27e8w, x\u27e9 is de\ufb01ned as L\u03b3(w) = P[y\u27e8w, x\u27e9 \u2264 \u03b3]. A learning algorithm A receives an i.i.d.\ntraining set S = (x1, y1), . . . , (xm, ym) and its goal is to return a predictor, A(S), whose error rate\nis small. We study the runtime required to learn a predictor such that with high probability over the\nchoice of S, the error rate of the learnt predictor satis\ufb01es\n\nL\u03b3(w) .\n\n(1)\n\nL01(A(S)) \u2264 (1 + \u03b1) L\n\u2217\n\u2217\n\u03b3 + \u03f5 where L\n\u03b3 = min\n\nw:\u2225w\u2225=1\n\nThere are three parameters of interest: the margin parameter, \u03b3, the multiplicative approximation\nfactor parameter, \u03b1, and the additive error parameter \u03f5.\nFrom the statistical perspective (i.e., if we allow exponential runtime), Equation (1) is achievable\nwith \u03b1 = 0 by letting A be the algorithm which minimizes the number of margin errors over the\n\n1\n\n\f\u03b32\u03f52 ). See\n\ntraining set subject to a norm constraint on w. The sample complexity of A is m = \u02dc\u2126( 1\nfor example Cristianini and Shawe-Taylor [2004].\n\u2217\nIf the data is separable with margin (that is, L\n\u03b3 = 0), then the aforementioned A can be implemented\nin time poly(1/\u03b3, 1/\u03f5). However, the problem is much harder in the agnostic case, namely, when\n\u2217\n\u03b3 > 0 and the distribution over examples can be arbitrary.\nL\nBen-David and Simon [2000] showed that, no proper learning algorithm can satisfy Equation (1)\nwith \u03b1 = 0 while running in time polynomial in both 1/\u03b3 and 1/\u03f5. By \u201cproper\u201d we mean an algo-\nrithm which returns a halfspace predictor. Shalev-Shwartz et al. [2011] extended this results to im-\nproper learning\u2014that is, when A(S) should satisfy Equation (1) but is not required to be a halfspace.\nThey also derived an algorithm that satis\ufb01es Equation (1) and runs in time exp\n,\nwhere C is a constant.\nMost algorithms that are being used in practice minimize a convex surrogate loss. That is, in-\nstead of minimizing the number of mistakes on the training set, the algorithms minimize \u02c6L(w) =\ni=1 \u2113(yi\u27e8w, xi\u27e9), where \u2113 : R \u2192 R is a convex function that upper bounds the 0 \u2212 1 loss.\n1\nm\nFor example, the Support Vector Machine (SVM) algorithm relies on the hinge loss. The advantage\nof surrogate convex losses is that minimizing them can be performed in time poly(1/\u03b3, 1/\u03f5). It is\neasy to verify that minimizing \u02c6L(w) with respect to the hinge loss yields a predictor that satis\ufb01es\nEquation (1) with \u03b1 = 1\n\u03b3 . Furthermore, Long and Servedio [2011], Ben-David et al. [2012] have\nshown that any convex surrogate loss cannot guarantee Equation (1) if \u03b1 < 1\n2\n\n)\n\u2212 1\n\n\u03b3 log( 1\n\n\u2211\n\n(\n\n(\n\n1\n\u03b3\n\n.\n\nC 1\n\n\u03b3 \u03f5 )\n\n)\n\nm\n\nDespite the centrality of this problem, not much is known on the runtime required to guarantee\nEquation (1) with other values of \u03b1. In particular, a natural question is how the runtime changes\nwhen enlarging \u03b1 from 0 to 1\n\n\u03b3 . Does it change gradually or perhaps there is a phase transition?\n\n\u03b3 , let \u03c4 = 1\n\nOur main contribution is an upper bound on the required runtime as a function of \u03b1. For any \u03b1 be-\ntween1 5 and 1\n\u03b3 \u03b1. We show that the runtime required to guarantee Equation (1) is at most\nexp (C \u03c4 min{\u03c4, log(1/\u03b3)}), where C is a universal constant (we ignore additional factors which\nare polynomial in 1/\u03b3, 1/\u03f5\u2014see a precise statement with the exact constants in Theorem 1). That\nis, when we enlarge \u03b1, the runtime decreases gradually from being exponential to being polynomial.\nFurthermore, we show that the algorithm which yields the aforementioned bound is a vanilla SVM\nwith a speci\ufb01c kernel. We also show how one can design speci\ufb01c kernels that will \ufb01t well certain\nvalues of \u03b1 while minimizing our upper bound on the sample and time complexity.\nIn Section 4 we extend our results to the more challenging learning settings of adversarial online\nlearning and PAC learning with malicious noise. For both cases, we obtain similar upper bounds\non the runtime as a function of \u03b1. The technique we use in the malicious noise case may be of\nindependent interest.\n\n1\n\n\u03b3\n\nlog(1/\u03b3)\n\n. In this case, \u03c4 =\n\nAn interesting special case is when \u03b1 =\nlog(1/\u03b3) and hence the\nruntime is still polynomial in 1/\u03b3. This recovers a recent result of Long and Servedio [2011]. Their\ntechnique is based on a smooth boosting algorithm applied on top of a weak learner which constructs\nrandom halfspaces and takes their majority vote. Furthermore, Long and Servedio emphasize that\ntheir algorithm is not based on convex optimization. They show that no convex surrogate can obtain\n\u03b1 = o(1/\u03b3). As mentioned before, our technique is rather different as we do rely on the hinge\nloss as a surrogate convex loss. There is no contradiction to Long and Servedio since we apply the\nconvex loss in the feature space induced by our kernel function. The negative result of Long and\nServedio holds only if the convex surrogate is applied on the original space.\n\n\u221a\n\n\u221a\n\n1We did not analyze the case \u03b1 < 5 because the runtime is already exponential in 1/\u03b3 even when \u03b1 = 5.\nNote, however, that our bound for \u03b1 = 5 is slightly better than the bound of Shalev-Shwartz et al. [2011]\nfor \u03b1 = 0 because our bound does not involve the parameter \u03f5 in the exponent while their bound depends on\nexp(1/\u03b3 log(1/(\u03f5\u03b3))).\n\n2\n\n\f1.1 Additional related work\n\nThe problem of learning kernel-based halfspaces has been extensively studied before in the frame-\nwork of SVM [Vapnik, 1998, Cristianini and Shawe-Taylor, 2004, Sch\u00a8olkopf and Smola, 2002] and\nthe Perceptron [Freund and Schapire, 1999]. Most algorithms replace the 0-1 error function with a\nconvex surrogate. As mentioned previously, Ben-David et al. [2012] have shown that this approach\nleads to approximation factor of at least 1\n2\n\n)\n\u2212 1\n\n(\n\n1\n\u03b3\n\n.\n\nThere has been several works attempting to obtain ef\ufb01cient algorithm for the case \u03b1 = 0 under\ncertain distributional assumptions. For example, Kalai et al. [2005], Blais et al. [2008] have shown\nthat if the marginal data distribution over X is a product distribution, then it is possible to satisfy\nEquation (1) with \u03b1 = \u03b3 = 0, in time poly(n1/\u03f54\n). Klivans et al. [2009] derived similar results for\nthe case of malicious noise. Another distributional assumption is on the conditional probability of\nthe label given the instance. For example, Kalai and Sastry [2009] solves the problem in polynomial\ntime if there exists a vector w and a monotonically non-increasing function \u03d5 such that P(Y =\n1|X = x) = \u03d5(\u27e8w, x\u27e9).\nZhang [2004], Bartlett et al. [2006] also studied the relationship between surrogate convex loss\nfunctions and the 0-1 loss function. They introduce the notion of well calibrated loss functions,\nmeaning that the excess risk of a predictor h (over the Bayes optimal) with respect to the 0-1 loss\ncan be bounded using the excess risk of the predictor with respect to the surrogate loss. It follows that\nif the latter is close to zero than the former is also close to zero. However, as Ben-David et al. [2012]\nshow in detail, without making additional distributional assumptions the fact that a loss function is\nwell calibrated does not yield \ufb01nite-sample or \ufb01nite-time bounds.\nIn terms of techniques, our Theorem 1 can be seen as a generalization of the positive result given\nin Shalev-Shwartz et al. [2011]. While Shalev-Shwartz et al. only studied the case \u03b1 = 0, we are\ninterested in understanding the whole curve of runtime as a function of \u03b1. Similar to the analysis of\nShalev-Shwartz et al., we approximate the sigmoidal and erf transfer functions using polynomials.\nHowever, we need to break symmetry in the de\ufb01nition of the exact transfer function to approximate.\nThe main technical observation is that the Lipschitz constant of the transfer functions we approx-\nimate does not depend on \u03b1, and is roughly 1/\u03b3 no matter what \u03b1 is. Instead, the change of the\ntransfer function when \u03b1 is increasing is in higher order derivatives.\nTo the best of our knowledge, the only middle point on the curve that has been studied before is the\ncase \u03b1 =\n, which was analyzed in Long and Servedio [2011]. Our work shows an upper\nbound on the entire curve. Besides that, we also provide a recipe for constructing better kernels for\nspeci\ufb01c values of \u03b1.\n\nlog(1/\u03b3)\n\n\u221a\n\n\u03b3\n\n1\n\n2 Main Results\n\nOur main result is an upper bound on the time and sample complexity for all values of \u03b1 between\n5 and 1/\u03b3. The bounds we derive hold for a norm-constraint form of SVM with a speci\ufb01c kernel,\nwhich we recall now. Given a training set S = (x1, y1), . . . , (xm, ym), and a feature mapping\n\u03c8 : X \u2192 X \u2032, where X \u2032 is the unit ball of some Hilbert space, consider the following learning rule:\n\nmax{0, 1 \u2212 yi\u27e8v, \u03c8(xi)\u27e9} .\n\n(2)\n\u2211\n)\u27e9, and\nUsing the well known kernel-trick, if K(x, x\nG is an m \u00d7 m matrix with Gi,j = K(xi, xj), then we can write a solution of Equation (2) as\n\n) implements the inner product \u27e8\u03c8(x), \u03c8(x\n\nargmin\nv:\u2225v\u22252\u2264B\n\ni=1\n\n\u2032\n\n\u2032\n\nv =\n\ni ai\u03c8(xi) where the vector a \u2208 Rm is a solution of\n\nmax{0, 1 \u2212 yi (Ga)i} .\n(\n)\nm\u2211\nThe above is a convex optimization problem in m variables and can be solved in time poly(m).\nGiven a solution a \u2208 Rm, we de\ufb01ne a classi\ufb01er ha : X \u2192 {\u00b11} to be\n\nargmin\na:aT Ga\u2264B\n\n(3)\n\ni=1\n\nm\u2211\n\nm\u2211\n\nha(x) = sign\n\n(4)\n\naiK(xi, x)\n\n.\n\ni=1\n\n3\n\n\fThe upper bounds we derive hold for the above kernel-based SVM with the kernel function\n\nWe are now ready to state our main theorem.\nTheorem 1 For any \u03b3 \u2208 (0, 1/2) and \u03b1 \u2265 5, let \u03c4 = 1\n\n\u2032\n\n2\n\n) =\n\nK(x, x\n\n1 \u2212 1\n\n1\n\u27e8x, x\u2032\u27e9 .\n)\n96\u03c4 2 + e18\u03c4 log(8\u03c4 \u03b12)+5\n= poly(1/\u03b3) \u00b7 emin{18\u03c4 log(8\u03c4 \u03b12) , 4\u03c4 2}\n\n\u03b3 \u03b1 and let\n1\n\u03b32\n\n{\n\n(\n\n4\u03b12\n\n.\n\n,\n\nB = min\n\nFix \u03f5, \u03b4 \u2208 (0, 1/2) and let m be a training set size that satis\ufb01es\n\n(5)\n\n(\n\n0.06 e4\u03c4 2\n\n)}\n\n+ 3\n\nm \u2265 16\n\n\u03f52 max{2B, (1 + \u03b1)2 log(2/\u03b4)} .\n\nLet A be the algorithm which solves Equation (3) with the kernel function given in Equation (5),\nand returns the predictor de\ufb01ned in Equation (4). Then, for any distribution, with probability of at\nleast 1 \u2212 \u03b4, the algorithm A satis\ufb01es Equation (1).\nThe proof of the theorem is given in the next section. As a direct corollary we obtain that there is an\nef\ufb01cient algorithm that achieves an approximation factor of \u03b1 = o(1/\u03b3):\nCorollary 2 For any \u03f5, \u03b4, \u03b3 \u2208 (0, 1), let \u03b1 =\nbeing as de\ufb01ned in Theorem 1, the algorithm A satis\ufb01es Equation (1).\nAs another corollary of Theorem 1 we obtain that for any constant c \u2208 (0, 1), it is possible to satisfy\nEquation (1) with \u03b1 = c/\u03b3 in polynomial time. However, the dependence of the runtime on the\nconstant c is e4/c2. For example, for c = 1/2 we obtain the multiplicative factor e16 \u2248 8, 800, 000.\nOur next contribution is to show that a more careful design of the kernel function can yield better\nbounds.\n\n\u03b32 . Then, with m, A\n\nand let B = 0.06\n\n\u03b36 + 3\n\n1/\u03b3\u221a\n\nlog(1/\u03b3)\n\n\u2211\nj=1 \u03b2jz2j\u22121 (namely, p is\n|p(z)| \u2265 1 .\n\nd\n\nTheorem 3 For any \u03b3, \u03b1, let p be a polynomial of the form p(z) =\nodd) that satis\ufb01es\n\nmax\nz\u2208[\u22121,1]\n\n|p(z)| \u2264 \u03b1 and min\nz:|z|\u2265\u03b3\n\nLet m be a training set size that satis\ufb01es\n\nLet A be the algorithm which solves Equation (3) with the following kernel function\n\nm \u2265 16\n\n\u03f52 max{\u2225\u03b2\u22252\n\n1, 2 log(4/\u03b4), (1 + \u03b1)2 log(2/\u03b4)}\nd\u2211\n\n|\u03b2j|(\u27e8x, x\n\n\u2032\u27e9)2j\u22121 ,\n\n\u2032\n\nK(x, x\n\n) =\n\nj=1\n\nand returns the predictor de\ufb01ned in Equation (4). Then, for any distribution, with probability of at\nleast 1 \u2212 \u03b4, the algorithm A satis\ufb01es Equation (1).\n\n\u2211\nThe above theorem provides us with a recipe for constructing good kernel functions: Given \u03b3 and \u03b1,\nj=1 \u03b2jz2j\u22121 satis\ufb01es the\n\ufb01nd a vector \u03b2 with minimal \u21131 norm such that the polynomial p(z) =\nconditions given in Theorem 3. For a \ufb01xed degree d, this can be written as the following optimization\nproblem:\n\nd\n\n\u2225\u03b2\u22251\n\ns.t. \u2200x \u2208 [0, 1], p(z) \u2264 \u03b1 \u2227 \u2200z \u2208 [\u03b3, 1], p(z) \u2265 1 .\n\n(6)\n\nmin\n\u03b2\u2208Rd\n\nNote that for any x, the expression p(x) is a linear function of \u03b2. Therefore, the above problem is\na linear program with an in\ufb01nite number of constraints. Nevertheless, it can be solved ef\ufb01ciently\nusing the Ellipsoid algorithm. Indeed, for any \u03b2, we can \ufb01nd the extreme points of the polynomial\n\n4\n\n\fit de\ufb01nes, and then determine whether \u03b2 satis\ufb01es all the constraints or, if it doesn\u2019t, we can \ufb01nd a\nviolated constraint.\nTo demonstrate how Theorem 3 can yield a better guarantee (in terms of the constants), we solved\nEquation (6) for the simple case of d = 2. For this simple case, we can provide an analytic solution\nto Equation (6), and based on this solution we obtain the following lemma whose proof is provided\nin the appendix.\nLemma 4 Given \u03b3 < 2/3, consider the polynomial p(z) = \u03b21z + \u03b22z3, where\n\n\u03b21 = 1\n\n\u03b3 + \u03b3\n\n1+\u03b3\n\n, \u03b22 = \u2212 1\n\n\u03b3(1+\u03b3) .\n\nThen, p satis\ufb01es the conditions of Theorem 3 with\n\n\u221a\n\u03b1 = 2\n3\n\n3\u03b3\n\n+ 2\u221a\n\n3\n\n\u2264 0.385 \u00b7 1\n\n\u03b3 + 1.155 .\n\n\u03b3 + 1.\n\nFurthermore, \u2225\u03b2\u22251 \u2264 2\nIt is interesting to compare the guarantee given in the above lemma to the guarantee of using the\nvanilla hinge-loss. For both cases the sample complexity is order of\n\u03b32\u03f52 . For the vanilla hinge-\n\u03b3 while for the kernel given in Lemma 4 we obtain the\nloss we obtain the approximation factor 1\napproximation factor of \u03b1 \u2264 0.385\u00b7 1\n\u03b3 + 1.155. Recall that Ben-David et al. [2012] have shown that\nwithout utilizing kernels, no convex surrogate loss can guarantee an approximation factor smaller\n\u2212 1). The above discussion shows that applying the hinge-loss with a kernel function\nthan \u03b1 < 1\ncan break this barrier without a signi\ufb01cant increase in runtime2 or sample complexity.\n\n2 ( 1\n\n\u03b3\n\n1\n\nm\n\n\u2211\n\n3 Proofs\nGiven a scalar loss function \u2113 : R \u2192 R, and a vector w, we denote by L(w) = E(x,y)\u223cD[\u2113(y\u27e8w, x\u27e9)]\nthe expected loss value of the predictions of w with respect to a distribution D over X \u00d7 {\u00b11}.\ni=1 \u2113(yi\u27e8w, xi\u27e9)\nGiven a training set S = (x1, y1), . . . , (xm, ym), we denote by \u02c6L(w) = 1\nm\nthe empirical loss of w. We slightly overload our notation and also use L(w) to denote\nE(x,y)\u223cD[\u2113(y\u27e8w, \u03c8(x)\u27e9)], when w is an element of an RKHS corresponding to the mapping \u03c8. We\nde\ufb01ne \u02c6L(w) analogously.\nWe will make extensive use of the following loss functions: the zero-one loss, \u211301(z) = 1[z \u2264 0], the\n\u03b3-zero-one loss, \u2113\u03b3(z) = 1[z \u2264 \u03b3], the hinge-loss, \u2113h(z) = [1\u2212z]+ = max{0, 1\u2212z}, and the ramp-\nloss, \u2113ramp(z) = min{1, \u2113h(z)}. We will use L01(w), L\u03b3(w), Lh(w), and Lramp(w) to denote\nthe expectations with respect to the different loss functions. Similarly \u02c6L01(w), \u02c6L\u03b3(w), \u02c6Lh(w), and\n\u02c6Lramp(w) are the empirical losses of w with respect to the different loss functions.\nRecall that we output a vector v that solves Equation (3). This vector is in the RKHS corresponding\nto the kernel given in Equation (5). Let Bx = maxx\u2208X K(x, x) \u2264 2. Since the ramp-loss upper\nbounds the zero-one loss we have that L01(v) \u2264 Lramp(v). The advantage of using the ramp loss is\nthat it is both a Lipschitz function and it is bounded by 1. Hence, standard Rademacher generaliza-\ntion analysis (e.g. Bartlett and Mendelson [2002], Bousquet [2002]) yields that with probability of\nat least 1 \u2212 \u03b4/2 over the choice of S we have:\n\n|\nLramp(v) \u2264 \u02c6Lramp(v) + 2\n\nBxB\n\nm\n\n+\n\n2 ln(4/\u03b4)\n\nm\n\n}\n\n\u221a\n\n\u221a\n{z\n\n.\n\n(7)\n\n=\u03f51\n\nSince the ramp loss is upper bounded by the hinge-loss, we have shown the following inequalities,\n(8)\n\nL01(v) \u2264 Lramp(v) \u2264 \u02c6Lramp(v) + \u03f51 \u2264 \u02c6Lh(v) + \u03f51 .\n\nNext, we rely on the following claim adapted from [Shalev-Shwartz et al., 2011, Lemma 2.4]:\n\n2It should be noted that solving SVM with kernels takes more time than solving a linear SVM. Hence, if\nthe original instance space is a low dimensional Euclidean space we loose polynomially in the time complexity.\nHowever, when the original instance space is also an RKHS, and our kernel is composed on top of the original\nkernel, the increase in the time complexity is not signi\ufb01cant.\n\n5\n\n\f\u2211\u221e\n\n\u2211\u221e\n\nj=0 \u03b2jzj be any polynomial that satis\ufb01es\n\nj 2j \u2264 B, and let w be\nClaim 5 Let p(z) =\nany vector in X . Then, there exists vw in the RKHS de\ufb01ned by the kernel given in Equation (5), such\nthat \u2225vw\u22252 \u2264 B and for all x \u2208 X , \u27e8vw, \u03c8(x)\u27e9 = p(\u27e8w, x\u27e9).\nFor any polynomial p, let \u2113p(z) = \u2113h(p(z)), and let \u02c6Lp be de\ufb01ned analogously. If p is an odd\npolynomial, we have that \u2113p(y\u27e8w, x\u27e9) = [1 \u2212 yp(\u27e8w, x\u27e9)]+. By the de\ufb01nition of v as minimizing\n\u02c6Lh(v) over \u2225v\u22252 \u2264 B, it follows from the above claim that for any odd p that satis\ufb01es\nj 2j \u2264\nB and for any w\n\n\u2217 \u2208 X, we have that\n\n\u2211\u221e\n\nj=0 \u03b22\n\nj=0 \u03b22\n\n\u02c6Lh(v) \u2264 \u02c6Lh(vw\u2217 ) = \u02c6Lp(w\n\n\u2217\n\n) .\n\nNext, it is straightforward to verify that if p is an odd polynomial that satis\ufb01es:\n\nmax\nz\u2208[\u22121,1]\n\n|p(z)| \u2264 \u03b1 and min\nz\u2208[\u03b3,1]\n\np(z) \u2265 1\n\n\u2217\n\nthen, \u2113p(z) \u2264 (1 + \u03b1)\u2113\u03b3(z) for all z \u2208 [\u22121, 1]. For such polynomials, we have that \u02c6Lp(w\n(1 + \u03b1) \u02c6L\u03b3(w\nprobability of at least 1 \u2212 \u03b4/2 over the choice of S we have that\n) + \u03f52 .\n\n). Finally, by Hoeffding\u2019s inequality, for any \ufb01xed w\n\n\u2217, if m > log(2/\u03b4)\n\n) \u2264 L\u03b3(w\n\n\u02c6L\u03b3(w\n\n\u03f52\n2\n\n\u2217\n\n\u2217\n\nSo, overall, we have obtained that with probability of at least 1 \u2212 \u03b4,\n\n(9)\n) \u2264\n, then with\n\n\u2217\n\nL01(v) \u2264 (1 + \u03b1) L\u03b3(w\n\n\u2217\n\n) + (1 + \u03b1)\u03f52 + \u03f51 .\n\nChoosing m large enough so that (1 + \u03b1)\u03f52 + \u03f51 \u2264 \u03f5, we obtain:\nCorollary 6 Fix \u03b3, \u03f5, \u03b4 \u2208 (0, 1) and \u03b1 > 0. Let p be an odd polynomial such that\nand such that Equation (9) holds. Let m be a training set size that satis\ufb01es:\n\u00b7 max{2B, 2 log(4/\u03b4), (1 + \u03b1)2 log(2/\u03b4)} .\n\nm \u2265 16\n\u03f52\n\n\u2211\n\nj 2j \u2264 B\n\nj \u03b22\n\nThen, with probability of at least 1\u2212\u03b4, the solution of Equation (3) satis\ufb01es, L01(v) \u2264 (1+\u03b1)L\n\u2217\n\u03b3 +\u03f5.\nThe proof of Theorem 1 follows immediately from the above corollary together with the following\ntwo lemmas, whose proofs are provided in the appendix.\n\n(\n\n)\n\nLemma 7 For any \u03b3 > 0 and \u03b1 > 2, let \u03c4 = 1\n(\nexists a polynomial that satis\ufb01es the conditions of Corollary 6 with the parameters \u03b1, \u03b3, B.\nLemma 8 For any \u03b3 \u2208 (0, 1/2) and \u03b1 \u2208 [5, 1\n\u03b3 ],\n4\u03b12\nditions of Corollary 6 with the parameters \u03b1, \u03b3, B.\n\n\u03b1\u03b3 and let B =\n. Then, there exists a polynomial that satis\ufb01es the con-\n\n\u03b1\u03b3 and let B = 1\n\u03b32\n\n. Then, there\n\n96\u03c4 2 + exp\n\nlet \u03c4 =\n\n0.06 e4\u03c4 2\n\n))\n\n18\u03c4 log\n\n8\u03c4 \u03b12\n\n(\n\n)\n\n(\n\n+ 3\n\n+ 5\n\n1\n\n3.1 Proof of Theorem 3\n\nd\n\n\u2211\nThe proof is similar to the proof of Theorem 1 except that we replace Claim 5 with the following:\nj=1 \u03b2jz2j\u22121 be any polynomial, and let w be any vector in X . Then, there\nLemma 9 Let p(z) =\nexists vw in the RKHS de\ufb01ned by the kernel given in Theorem 3, such that \u2225vw\u22252 \u2264 \u2225\u03b2\u22251 and for\nall x \u2208 X , \u27e8vw, \u03c8(x)\u27e9 = p(\u27e8w, x\u27e9).\nProof We start with an explicit de\ufb01nition of the mapping \u03c8(x) corresponding to the kernel in the\ntheorem. The coordinates of \u03c8(x) are indexed by tuples (k1, . . . , kj) \u2208 [n]j for j = 1, 3, . . . , 2d\u22121.\nCoordinate (k1, . . . , kj) equals to\nplicitly the vector vw for which \u27e8vw, \u03c8(x)\u27e9 = p(\u27e8w, x\u27e9). Coordinate (k1, . . . kj) of vw equals to\nsign(\u03b2j)\n\u27e8vw, \u03c8(x)\u27e9 = p(\u27e8w, x\u27e9).\nSince for any x \u2208 X we also have that K(x, x) \u2264 \u2225\u03b2\u22251, the proof of Theorem 3 follows using the\nsame arguments as in the proof of Theorem 1.\n\n\u221a|\u03b2j|xk1xk2 . . . xkj . Next, for any w \u2208 X , we de\ufb01ne ex-\n\u221a|\u03b2j|wk1wk2 . . . wkj . It is easy to verify that indeed \u2225vw\u22252 \u2264 \u2225\u03b2\u22251 and for all x \u2208 X ,\n\n6\n\n\f4 Extension to other learning models\n\nIn this section we brie\ufb02y describe how our results can be extended to adversarial online learning and\nto PAC learning with malicious noise. We start with the online learning model.\n\n4.1 Online learning\n\nm\n\nt=1 \u2113h(yt\u27e8w\n\nGiven a sequence (x1, y1), . . . , (xm, ym), and a vector w\n\nOnline learning is performed in a sequence of consecutive rounds, where at round t the learner is\ngiven an instance, xt \u2208 X , and is required to predict its label. After predicting \u02c6yt, the target label,\nyt, is revealed. The goal of the learner is to make as few prediction mistakes as possible. See for\nexample Cesa-Bianchi and Lugosi [2006].\nA classic online classi\ufb01cation algorithm is the Perceptron [Rosenblatt, 1958]. The Perceptron main-\ntains a vector wt and predicts according to \u02c6yt = sign(\u27e8wt, xt\u27e9). Initially, w1 = 0, and at round t\nthe Perceptron updates the vector using the rule wt+1 = wt + 1[\u02c6yt \u0338= yt] ytxt. Freund and Schapire\n[1999] observed that the Perceptron can also be implemented ef\ufb01ciently in an RKHS using a kernel\nfunction.\n\u2217 such that for all t, yt\u27e8w\n, xt\u27e9 \u2265 1 and\nAgmon [1954] and others have shown that if there exists w\n\u2225xt\u22252 \u2264 Bx, then the Perceptron will make at most \u2225w\n\u2217\u22252Bx prediction mistakes. This bound\nholds without making any additional distributional assumptions on the sequence of examples.\n\u2211\nThis mistake bound has been generalized to the noisy case (see for example Gentile [2003])\nas follows.\n) =\n, xt\u27e9), where \u2113h is the hinge-loss. Then, the average number of prediction mis-\n1\nm\u2211\nm\ntakes the Perceptron will make on this sequence is at most\n1[\u02c6yt \u0338= yt] \u2264 Lh(w\n\u2211\n)\nt=1 1(yt\u27e8w\n\u2217\nL\u03b3(w\n\n, xt\u27e9 \u2264 \u03b3). Trivially, Equation (10) can yield a bound whose\nLet L\u03b3(w\n) (namely, it corresponds to \u03b1 = 1\n\u03b3 ). On the other hand, Ben-David\nleading term is\net al. [2009] have derived a mistake bound whose leading term depends on L\u03b3(w\n) (namely, it\ncorresponds to \u03b1 = 0), but the runtime of the algorithm is at least m1/\u03b32. The main result of this\nsection is to derive a mistake bound for the Perceptron based on all values of \u03b1 between 5 and 1/\u03b3.\nTheorem 10 For any \u03b3 \u2208 (0, 1/2) and \u03b1 \u2265 5, let \u03c4 = 1\n\u03b3 \u03b1 and let B\u03b1,\u03b3 be the value of B as\nde\ufb01ned in Theorem 1. Then, for any sequence (x1, y1), . . . , (xm, ym), if the Perceptron is run on\nthis sequence using the kernel function given in Equation (5), the average number of prediction\nmistakes it will make is at most:\n\nBx\u2225w\u2217\u22252 Lh(w\u2217)\n\n) = 1\nm\n1 + 1\n\u03b3\n\nBx\u2225w\n\n\u2217\u22252\n\nlet Lh(w\n\n1\nm\n\n(\n\nt=1\n\nm\n\n\u221a\n\n\u2217\n\n) +\n\nm\n\n+\n\n.\n\nm\n\n\u2217,\n\n\u2217\n\n\u2217\n\n\u2217\n\n\u2217\n\n\u2217\n\n(10)\n\n\u2217\n\n\u221a\n\n\u03b3\u2208(0,1/2),\u03b1\u22655,w\u2217\u2208X\n\nmin\n\n(1 + \u03b1)L\u03b3(w\n\n\u2217\n\n) +\n\n2B\u03b1,\u03b3 (1 + \u03b1)L\u03b3(w\u2217)\n\nm\n\n+\n\n2B\u03b1,\u03b3\n\nm\n\nProof [sketch] Equation (10) holds if we implement the Perceptron using the kernel function given\nin Equation (5), for which Bx = 2. Furthermore, similarly to the proof of Theorem 1, for any\n\u2217 in the RKHS\npolynomial p that satis\ufb01es the conditions of Corollary 6 we have that there exists v\ncorresponding to the kernel, with \u2225v\n) \u2264 (1 + \u03b1)L\u03b3(w\n\u2217\n). The theorem\nfollows.\n\n\u2217\u22252 \u2264 B and with Lh(v\n\n\u2217\n\n4.2 PAC learning with malicious noise\n\nIn this model, introduced by Valiant [1985] and speci\ufb01ed to the case of halfspaces with margin\nby Servedio [2003], Long and Servedio [2011], there is an unknown distribution over instances\nin X and there is an unknown target vector w\n, x\u27e9| \u2265 \u03b3 with probability 1.\nThe learner has an access to an example oracle. At each query to the oracle, with probability of\n1 \u2212 \u03b7 it samples a random example x \u2208 X according to the unknown distribution over X , and\n\n\u2217 \u2208 X such that |\u27e8w\n\n\u2217\n\n7\n\n\f\u2217\n\nreturns (x, sign(\u27e8w\n, x\u27e9)). However, with probability \u03b7, the oracle returns an arbitrary element of\nX \u00d7 {\u00b11}. The goal of the learner is to output a predictor that has L01(h) \u2264 \u03f5, with respect to the\n\u201cclean\u201d distribution.\nAuer and Cesa-Bianchi [1998] described a general conversion from online learning to the malicious\nnoise setting. Servedio [2003] used this conversion to derive a bound based on the Perceptron\u2019s\nmistake bound. In our case, we cannot rely on the conversion of Auer and Cesa-Bianchi [1998]\nsince it requires a proper learner, while the online learner described in the previous section is not\nproper.\nInstead, we propose the following simple algorithm. First, sample m examples. Then, solve kernel\nSVM on the resulting noisy training set.\nTheorem 11 Let \u03b3 \u2208 (0, 1/4), \u03b4 \u2208 (0, 1/2), and \u03b1 > 5. Let B be as de\ufb01ned in Theorem 1.\nLet m be a training set size that satis\ufb01es: m \u2265 64\n. Then, with\nprobability of at least 1\u2212 2\u03b4, the output of kernel-SVM on the noisy training set, denoted h, satis\ufb01es\nL01(h) \u2264 (2 + \u03b1)\u03b7 + \u03f5/2. It follows that if \u03b7 \u2264 \u03f5\n\n\u03f52 \u00b7 max\n2(2+\u03b1) then L01(h) \u2264 \u03f5.\n\n2B , (2 + \u03b1)2 log(1/\u03b4)\n\n}\n\n{\n\nProof Let \u00afS be a training set in which we replace the noisy examples with clean iid examples. Let\n\u00afL denotes the empirical loss over \u00afS and \u02c6L denotes the empirical loss over S. As in the proof of\nTheorem 1, we have that w.p. of at least 1 \u2212 \u03b4, for any v in the RKHS corresponding to the kernel\nthat satis\ufb01es \u2225v\u22252 \u2264 B we have that:\n\n(11)\nby our assumption on m. Let \u02c6\u03b7 be the fraction of noisy examples in S. Note that \u00afS and S differ in\nat most m\u02c6\u03b7 elements. Therefore, for any v,\n\nL01(v) \u2264 \u00afLramp(v) + 3\u03f5/8 ,\n\n\u00afLramp(v) \u2264 \u02c6Lramp(v) + \u02c6\u03b7 .\n\n(12)\n\u2217 be the target vector in the original space (i.e., the one which\nNow, let v be the minimizer of \u02c6Lh, let w\nachieves correct predictions with margin \u03b3 on clean examples), and let vw\u2217 be its corresponding\nelement in the RKHS (see Claim 5). We have\n\n\u02c6Lramp(v) \u2264 \u02c6Lh(v) \u2264 \u02c6Lh(vw\u2217 ) = \u02c6Lp(w\n\n(13)\nIn the above, the \ufb01rst inequality is since the ramp loss is upper bounded by the hinge loss, the second\ninequality is by the de\ufb01nition of v, the third equality is by Claim 5, the fourth inequality is by the\nproperties of p, and the last inequality follows from the de\ufb01nition of \u02c6\u03b7. Combining the above yields,\n\n) \u2264 (1 + \u03b1) \u02c6L\u03b3(w\n\n) \u2264 (1 + \u03b1)\u02c6\u03b7 .\n\n\u2217\n\n\u2217\n\nL01(v) \u2264 (2 + \u03b1)\u02c6\u03b7 + 3\u03f5/8 .\n\nFinally, using Hoefding\u2019s inequality, we know that for the de\ufb01nition of m, with probability of at\nleast 1 \u2212 \u03b4 we have that \u02c6\u03b7 \u2264 \u03b7 + \u03f5\n8(2+\u03b1). Applying the union bound and combining the above we\nconclude that with probability of at least 1 \u2212 2\u03b4, L01(v) \u2264 (2 + \u03b1)\u03b7 + \u03f5/2.\n\n5 Summary and Open Problems\n\nWe have derived upper bounds on the time and sample complexities as a function of the approxima-\ntion factor. We further provided a recipe for designing kernel functions with a small time and sample\ncomplexity for any given value of approximation factor and margin. Our results are applicable to\nagnostic PAC Learning, online learning, and PAC learning with malicious noise.\nAn immediate open question is whether our results can be improved. If not, can computationally\nhardness results be formally established. Another open question is whether the upper bounds we\nhave derived for an improper learner can be also derived for a proper learner.\n\nAcknowledgements: This work is supported by the Israeli Science Foundation grant number 598-\n10 and by the German-Israeli Foundation grant number 2254-2010. Shai Shalev-Shwartz is incum-\nbent of the John S. Cohen Senior Lectureship in Computer Science.\n\n8\n\n\fReferences\nS. Agmon. The relaxation method for linear inequalities. Canadian Journal of Mathematics, 6(3):382\u2013392,\n\n1954.\n\nP. Auer and N. Cesa-Bianchi. On-line learning with malicious noise and the closure algorithm. Annals of\n\nmathematics and arti\ufb01cial intelligence, 23(1):83\u201399, 1998.\n\nP. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results.\n\nJournal of Machine Learning Research, 3:463\u2013482, 2002.\n\nP. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classi\ufb01cation, and risk bounds. Journal of the\n\nAmerican Statistical Association, 101:138\u2013156, 2006.\n\nS. Ben-David and H. Simon. Ef\ufb01cient learning of linear perceptrons. In NIPS, 2000.\nS. Ben-David, D. Pal, , and S. Shalev-Shwartz. Agnostic online learning. In COLT, 2009.\nS. Ben-David, D. Loker, N. Srebro, and K. Sridharan. Minimizing the misclassi\ufb01cation error rate using a\n\nsurrogate convex loss. In ICML, 2012.\n\nE. Blais, R. O\u2019Donnell, and K Wimmer. Polynomial regression under arbitrary product distributions. In COLT,\n\n2008.\n\nO. Bousquet. Concentration Inequalities and Empirical Processes Theory Applied to the Analysis of Learning\n\nAlgorithms. PhD thesis, Ecole Polytechnique, 2002.\n\nN. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.\nN. Cristianini and J. Shawe-Taylor. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004.\nY. Freund and R. E. Schapire. Large margin classi\ufb01cation using the perceptron algorithm. Machine Learning,\n\n37(3):277\u2013296, 1999.\n\nY. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an application to\n\nboosting. Journal of Computer and System Sciences, 55(1):119\u2013139, August 1997.\n\nC. Gentile. The robustness of the p-norm algorithms. Machine Learning, 53(3):265\u2013299, 2003.\nA. Kalai, A.R. Klivans, Y. Mansour, and R. Servedio. Agnostically learning halfspaces. In Proceedings of the\n\n46th Foundations of Computer Science (FOCS), 2005.\n\nA.T. Kalai and R. Sastry. The isotron algorithm: High-dimensional isotonic regression. In Proceedings of the\n\n22th Annual Conference on Learning Theory, 2009.\n\nA.R. Klivans, P.M. Long, and R.A. Servedio. Learning halfspaces with malicious noise. The Journal of Machine\n\nLearning Research, 10:2715\u20132740, 2009.\n\nP.M. Long and R.A. Servedio. Learning large-margin halfspaces with more malicious noise. In NIPS, 2011.\nF. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain.\n\nPsychological Review, 65:386\u2013407, 1958. (Reprinted in Neurocomputing(MIT Press, 1988).).\n\nB. Sch\u00a8olkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization\n\nand Beyond. MIT Press, 2002.\n\nR.A. Servedio. Smooth boosting and learning with malicious noise. Journal of Machine Learning Research, 4:\n\n633\u2013648, 2003.\n\nS. Shalev-Shwartz, O. Shamir, and K. Sridharan. Learning kernel-based halfspaces with the 0-1 loss. SIAM\n\nJournal on Computing, 40:1623\u20131646, 2011.\n\nL. G. Valiant. Learning disjunctions of conjunctions. In Proceedings of the 9th International Joint Conference\n\non Arti\ufb01cial Intelligence, pages 560\u2013566, August 1985.\n\nV. N. Vapnik. Statistical Learning Theory. Wiley, 1998.\nT. Zhang. Statistical behavior and consistency of classi\ufb01cation methods based on convex risk minimization.\n\nThe Annals of Statistics, 32:56\u201385, 2004.\n\n9\n\n\f", "award": [], "sourceid": 445, "authors": [{"given_name": "Aharon", "family_name": "Birnbaum", "institution": null}, {"given_name": "Shai", "family_name": "Shwartz", "institution": null}]}