{"title": "Nearly Tight Bounds for Robust Proper Learning of Halfspaces with a Margin", "book": "Advances in Neural Information Processing Systems", "page_first": 10473, "page_last": 10484, "abstract": "We study the problem of {\\em properly} learning large margin halfspaces in the agnostic PAC model. \nIn more detail, we study the complexity of properly learning $d$-dimensional halfspaces\non the unit ball within misclassification error $\\alpha \\cdot \\opt_{\\gamma} + \\eps$, where $\\opt_{\\gamma}$\nis the optimal $\\gamma$-margin error rate and $\\alpha \\geq 1$ is the approximation ratio.\nWe give learning algorithms and computational hardness results\nfor this problem, for all values of the approximation ratio $\\alpha \\geq 1$, \nthat are nearly-matching for a range of parameters. \nSpecifically, for the natural setting that $\\alpha$ is any constant bigger \nthan one, we provide an essentially tight complexity characterization.\nOn the positive side, we give an $\\alpha = 1.01$-approximate proper learner \nthat uses $O(1/(\\eps^2\\gamma^2))$ samples (which is optimal) and runs in time\n$\\poly(d/\\eps) \\cdot 2^{\\tilde{O}(1/\\gamma^2)}$. On the negative side, \nwe show that {\\em any} constant factor approximate proper learner has runtime \n$\\poly(d/\\eps) \\cdot 2^{(1/\\gamma)^{2-o(1)}}$, \nassuming the Exponential Time Hypothesis.", "full_text": "Nearly Tight Bounds for Robust Proper Learning\n\nof Halfspaces with a Margin\u2217\n\nIlias Diakonikolas\n\nUniversity of Wisconsin-Madison\n\nilias@cs.wisc.edu\n\nDaniel M. Kane\n\nUniversity of California, San Diego\n\ndakane@cs.ucsd.edu\n\nPasin Manurangsi\u2020\n\nUniversity of California, Berkeley\n\npasin@berkeley.edu\n\nAbstract\n\nWe study the problem of properly learning large margin halfspaces in the agnostic\nPAC model.\nIn more detail, we study the complexity of properly learning d-\ndimensional halfspaces on the unit ball within misclassi\ufb01cation error \u03b1\u00b7 OPT\u03b3 + \u0001,\nwhere OPT\u03b3 is the optimal \u03b3-margin error rate and \u03b1 \u2265 1 is the approximation\nratio. We give learning algorithms and computational hardness results for this\nproblem, for all values of the approximation ratio \u03b1 \u2265 1, that are nearly-matching\nfor a range of parameters. Speci\ufb01cally, for the natural setting that \u03b1 is any constant\nbigger than one, we provide an essentially tight complexity characterization. On\nthe positive side, we give an \u03b1 = 1.01-approximate proper learner that uses\nO(1/(\u00012\u03b32)) samples (which is optimal) and runs in time poly(d/\u0001) \u00b7 2 \u02dcO(1/\u03b32).\nOn the negative side, we show that any constant factor approximate proper learner\nhas runtime poly(d/\u0001) \u00b7 2(1/\u03b3)2\u2212o(1), assuming the Exponential Time Hypothesis.\n\n1\n\nIntroduction\n\n1.1 Background and Problem De\ufb01nition\nHalfspaces are boolean functions hw : Rd \u2192 {\u00b11} of the form hw(x) = sign ((cid:104)w, x(cid:105)), where\nw \u2208 Rd is the associated weight vector. (The function sign : R \u2192 {\u00b11} is de\ufb01ned as sign(u) = 1 if\nu \u2265 0 and sign(u) = \u22121 otherwise.) The problem of learning an unknown halfspace with a margin\ncondition (in the sense that no example is allowed to lie too close to the separating hyperplane) is\nas old as the \ufb01eld of machine learning \u2014 starting with Rosenblatt\u2019s Perceptron algorithm [Ros58]\n\u2014 and has arguably been one of the most in\ufb02uential problems in the development of the \ufb01eld, with\ntechniques such as SVMs [Vap98] and AdaBoost [FS97] coming out of its study.\nWe study the problem of learning \u03b3-margin halfspaces in the agnostic PAC model [Hau92, KSS94].\nSpeci\ufb01cally, there is an unknown distribution D on Bd \u00d7 {\u00b11}, where Bd is the unit ball on Rd, and\nthe learning algorithm A is given as input a training set S = {(x(i), y(i))}m\ni=1 of i.i.d. samples drawn\nfrom D. The goal of A is to output a hypothesis whose error rate is competitive with the \u03b3-margin error\nrate of the optimal halfspace. In more detail, the error rate (misclassi\ufb01cation error) of a hypothesis\n0\u22121(h) def= Pr(x,y)\u223cD[h(x) (cid:54)= y]. For \u03b3 \u2208 (0, 1), the\nh : Rd \u2192 {\u00b11} (with respect to D) is errD\n\u03b3-margin error rate of a halfspace hw(x) with (cid:107)w(cid:107)2 \u2264 1 is errD\n\u03b3 (w) def= Pr(x,y)\u223cD [y(cid:104)w, x(cid:105) \u2264 \u03b3].\n\n\u2217The full version of this paper is available at [DKM19].\n\u2020Now at Google Research, Mountain View.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f\u03b3\n\ndef= min(cid:107)w(cid:107)2\u22641 errD\n\nWe denote by OPTD\n\u03b3 (w) the minimum \u03b3-margin error rate achievable by any\nhalfspace. We say that A is an \u03b1-agnostic learner, \u03b1 \u2265 1, if it outputs a hypothesis h that with\nprobability at least 1 \u2212 \u03c4 satis\ufb01es errD\n0\u22121(h) \u2264 \u03b1 \u00b7 OPTD\n\u03b3 + \u0001. (For \u03b1 = 1, we obtain the standard\nnotion of agnostic learning.) If the hypothesis h is itself a halfspace, we say that the learning algorithm\nis proper. This work focuses on proper learning algorithms.\n\n1.2 Related and Prior Work\n\n\u03b3 > 0) is much more challenging computationally.\n\nIn this section, we summarize the related work that is directly relavant to the results of this paper. First,\nwe note that the sample complexity of our learning problem (ignoring computational considerations)\nis well-understood. In particular, the ERM that minimizes the number of \u03b3-margin errors over the\ntraining set (subject to a norm constraint) is known to be an agnostic learner (\u03b1 = 1), assuming\nthe sample size is \u2126(log(1/\u03c4 )/(\u00012\u03b32)). Speci\ufb01cally, \u0398(log(1/\u03c4 )/(\u00012\u03b32)) samples3 are known to\nbe suf\ufb01cient and necessary for this learning problem (see, e.g., [BM02, McA03]). In the realizable\ncase (OPTD\n\u03b3 = 0), i.e., if the data is linearly separable with margin \u03b3, the ERM rule above can\nbe implemented in poly(d, 1/\u0001, 1/\u03b3) time using the Perceptron algorithm. The agnostic setting\n(OPTD\nThe agnostic version of our problem (\u03b1 = 1) was \ufb01rst considered in [BS00], who gave a proper\nlearning algorithm with runtime poly(d) \u00b7 (1/\u0001) \u02dcO(1/\u03b32). It was also shown in [BS00] that agnostic\nproper learning with runtime poly(d, 1/\u0001, 1/\u03b3) is NP-hard. A question left open by their work was\ncharacterizing the computational complexity of proper learning as a function of 1/\u03b3.\nSubsequent works focused on improper learning. The \u03b1 = 1 case was studied in [SSS09, SSS10]\nwho gave a learning algorithm with sample complexity poly(1/\u0001) \u00b7 2 \u02dcO(1/\u03b3) \u2013 i.e., exponential in 1/\u03b3\n\u2013 and computational complexity poly(d/\u0001) \u00b7 2 \u02dcO(1/\u03b3). The increased sample complexity is inherent in\ntheir approach, as their algorithm works by solving a convex program over an expanded feature space.\n[BS12] gave an \u03b1-agnostic learning algorithm for all \u03b1 \u2265 1 with sample complexity 2 \u02dcO(1/(\u03b1\u03b3)) and\ncomputational complexity poly(d/\u0001)\u00b7 2 \u02dcO(1/(\u03b1\u03b3)). (We note that the Perceptron algorithm is known to\nachieve \u03b1 = 1/\u03b3 [Ser01]. Prior to [BS12], [LS11] gave a poly(d, 1/\u0001, 1/\u03b3) time algorithm achieving\n\n\u03b1 = \u0398((1/\u03b3)/(cid:112)log(1/\u03b3)).) [BS12] posed as an open question whether their upper bounds for\n\nimproper learning can be also derived for a proper learner.\nA related line of work [KLS09, ABL17, DKK+16, LRV16, DKK+17, DKK+18, DKS18, KKM18,\nDKS19, DKK+19] has given polynomial time robust estimators for a range of learning tasks. Speci\ufb01-\ncally, [KLS09, ABL17, DKS18, DKK+19] obtained ef\ufb01cient PAC learning algorithms for halfspaces\nwith malicious noise [Val85, KL93], under the assumption that the uncorrupted data comes from\na \u201ctame\u201d distribution, e.g., Gaussian or isotropic log-concave. It should be noted that the class of\n\u03b3-margin distributions considered in this work is signi\ufb01cantly broader and can be far from satisfying\nthe structural properties required in the aforementioned works.\nA growing body of theoretical work has focused on adversarially robust learning (e.g., [BLPR19,\nMHS19, DNV19, Nak19]). In adversarially robust learning, the learner seeks to output a hypothesis\nwith small \u03b3-robust misclassi\ufb01cation error, which for a hypothesis h and a norm (cid:107) \u00b7 (cid:107) is typically\nde\ufb01ned as Pr(x,y)\u223cD[\u2203x(cid:48) with (cid:107)x(cid:48) \u2212 x(cid:107) \u2264 \u03b3 s.t. h(x(cid:48)) (cid:54)= y]. Notice that when h is a halfspace\nand (cid:107) \u00b7 (cid:107) is the Euclidean norm, the \u03b3-robust misclassi\ufb01cation error coincides with the \u03b3-margin\nerror in our context. (It should be noted that most of the literature on adversarially robust learning\nfocuses on the (cid:96)\u221e-norm.) However, the objectives of the two learning settings are slightly different:\nin adversarially robust learning, the learner would like to output a hypothesis with small \u03b3-robust\nmisclassi\ufb01cation error, whereas in our context the learner only has to output a hypothesis with small\nzero-one misclassi\ufb01cation error. Nonetheless, as we point out in Remark 1.3, our algorithms can be\nadapted to provide guarantees in line with the adversarially robust setting as well.\nFinally, in the distribution-independent agnostic setting without margin assumptions, there is com-\npelling complexity-theoretic evidence that even weak learning of halfspaces is computationally\nintractable [GR06, FGKP06, DOSW11, Dan16, BGS18].\n\n3To avoid clutter in the expressions, we will henceforth assume that the failure probability \u03c4 = 1/10. Recall\n\nthat one can always boost the con\ufb01dence probability with an O(log(1/\u03c4 )) multiplicative overhead.\n\n2\n\n\f1.3 Our Contributions\n\nWe study the complexity of proper \u03b1-agnostic learning of \u03b3-margin halfspaces on the unit ball. Our\nmain result nearly characterizes the complexity of constant factor approximation to this problem:\nTheorem 1.1. There is an algorithm that uses O(1/(\u00012\u03b32)) samples, runs in time poly(d/\u0001)\u00b72 \u02dcO(1/\u03b32)\nand is an \u03b1 = 1.01-agnostic proper learner for \u03b3-margin halfspaces with con\ufb01dence probability 9/10.\nMoreover, assuming the randomized Exponential Time Hypothesis, any proper learning algorithm\nthat achieves any constant factor approximation has runtime poly(d/\u0001) \u00b7 \u2126(21/\u03b32\u2212o(1)\n\n).\n\nThe reader is referred to Theorem 2.1 for the upper bound and Theorem 3.1 for the lower bound.\nFirst, we note that the approximation ratio of 1.01 in the theorem statement is not inherent. Our\nalgorithm achieves \u03b1 = 1 + \u03b4, for any \u03b4 > 0, with runtime poly(d/\u0001) \u00b7 2 \u02dcO(1/(\u03b4\u03b32)). The runtime of\nour algorithm signi\ufb01cantly improves the runtime of the best known agnostic proper learner [BS00],\nachieving \ufb01xed polynomial dependence on 1/\u0001, independent of \u03b3. This gain in runtime comes at the\nexpense of losing a small constant factor in the error guarantee. It is natural to ask whether there\nexists an 1-agnostic proper learner matching the runtime of our Theorem 1.1. In Theorem 3.2, we\nestablish a computational hardness result implying that such an improvement is unlikely.\nThe runtime dependence of our algorithm scales as 2 \u02dcO(1/\u03b32) (which is nearly best possible for\nproper learners), as opposed to 2 \u02dcO(1/\u03b3) for improper learning [SSS09, BS12]. In addition to the\ninterpretability of proper learning, the sample complexity of our algorithm is quadratic in 1/\u03b3 (which\nis optimal), as opposed to exponential for known improper learners. Moreover, for moderate values\nof \u03b3, our algorithm may be faster than known improper learners, as it only uses spectral methods and\nERM, as opposed to convex programming. Finally, we note that the lower bound part of Theorem 1.1\nimplies a computational separation between proper and improper learning for this problem.\nIn addition, we explore the complexity of \u03b1-agnostic learning for large \u03b1 > 1. The following theorem\nsummarizes our results in this setting:\nTheorem 1.2. There is an algorithm that uses \u02dcO(1/(\u00012\u03b32)) samples, runs in time poly(d) \u00b7\n(1/\u0001) \u02dcO(1/(\u03b1\u03b3)2) and is a \u03b1-agnostic proper learner for \u03b3-margin halfspaces with con\ufb01dence proba-\nbility 9/10. Moreover, assuming NP (cid:54)= RP and the Sliding Scale Conjecture [BGLR94], no (1/\u03b3)c-\nagnostic proper learner runs in poly(d, 1/\u03b5, 1/\u03b3) time for some (absolute) constant c > 0.\n\nThe reader is referred to Theorem 3.3 for a more precise statement of the lower bound; the upper\nbound is deferred to the full version [DKM19]. In summary, we give an \u03b1-agnostic algorithm with\nruntime exponential in 1/(\u03b1\u03b3)2, as opposed to 1/\u03b32, and we show that achieving \u03b1 = (1/\u03b3)\u2126(1) is\ncomputationally hard. (Assuming only NP (cid:54)= RP, we can rule out polynomial time \u03b1-agnostic proper\nlearning for \u03b1 = (1/\u03b3)\nRemark 1.3. While not stated explicitly in the subsequent analysis, our algorithms (with a slight\nmodi\ufb01cation to the associated constant factors) not only give a halfspace w\u2217 with zero-one loss at\nmost \u03b1 \u00b7 OPTD\n\u03b3 + \u0001, but this guarantee holds for the 0.99\u03b3-margin error4 of w\u2217 as well. Thus, our\nlearning algorithms also work in the adversarially robust setting (under the Euclidean norm) with a\nsmall loss in the \u201crobustness parameter\u201d (margin) from the one used to compute the optimum (i.e., \u03b3)\nto the one used to measure the error of the output hypothesis (i.e., 0.99\u03b3).\n\npolyloglog(1/\u03b3) .)\n\n1\n\n1.4 Our Techniques\n\nOverview of Algorithms. For the sake of this intuitive explanation, we provide an overview of our\nalgorithms when the underlying distribution D is explicitly known. The \ufb01nite sample analysis of our\nalgorithms follows from standard generalization bounds (see Section 2).\nOur constant factor approximation algorithm relies on the following observation: Let w\u2217 be the\noptimal weight vector. The assumption that |(cid:104)w\u2217, x(cid:105)| is large for almost all x (by the margin property),\nimplies a relatively strong condition on w\u2217, which will allow us to \ufb01nd a relatively small search space\nfor a near-optimal solution. A \ufb01rst idea is to consider the matrix M = E(x,y)\u223cD[xxT ], and note that\n\n4Here the constant 0.99 can be replaced by any constant less than one, with an appropriate increase to the\n\nalgorithm\u2019s running time.\n\n3\n\n\fw\u2217T Mw\u2217 = \u2126(\u03b32). This in turn implies that w\u2217 has a large component on the subspace spanned by\nthe largest O(1/(\u0001\u03b32)) eigenvalues of M. This idea suggests a basic algorithm that computes a net\nover unit-norm weight vectors on this subspace and outputs the best answer. Unfortunately, this basic\nalgorithm has runtime poly(d)2 \u02dcO(1/(\u0001\u03b32)). (Details are deferred to the full version [DKM19].)\nTo obtain our poly(d/\u0001)2 \u02dcO(1/\u03b32) time constant factor approximation algorithm (Theorem 1.1), we\nuse a re\ufb01nement of the above idea. Instead of trying to guess the projection of w\u2217 onto the space of\nlarge eigenvectors all at once, we will do so in stages. In particular, it is not hard to see that w\u2217 has a\nnon-trivial projection onto the subspace spanned by the top O(1/\u03b32) eigenvalues of M. If we guess\nthis projection, we will have some approximation to w\u2217, but unfortunately not a suf\ufb01ciently good\none. However, we note that the difference between w\u2217 and our current hypothesis w will have a large\naverage squared inner product with the misclassi\ufb01ed points. This suggests an iterative algorithm that\nin the i-th iteration considers the second moment matrix M(i) of the points not correctly classi\ufb01ed\nby the current hypothesis sign((cid:104)w(i), x(cid:105)), guesses a vector u in the space spanned by the top few\neigenvalues of M(i), and sets w(i+1) = u + w(i). This procedure produces a candidate set of weights\nwith cardinality 2 \u02dcO(1/\u03b32) one of which has the desired misclassi\ufb01cation error. This algorithm and its\nanalysis are given in Section 2.\nOur general \u03b1-factor algorithm (Theorem 1.2) relies on approximating the Chow parameters of the\ntarget halfspace f, i.e., the d numbers E[f (x)xi], i \u2208 [d]. A classical result [Cho61] shows that\nthe exact values of the Chow parameters of a halfspace (over any distribution) uniquely de\ufb01ne the\nhalfspace. Although this uniqueness is not very useful in general, the margin assumption implies a\nrelatively strong approximate identi\ufb01ability result. Combining this with an algorithm of [DDFS14],\nwe can ef\ufb01ciently compute an approximation to the halfspace f given an approximation to its Chow\nparameters. In particular, if we can approximate the Chow parameters to (cid:96)2-error \u03bd \u00b7 \u03b3, we can\napproximate fw\u2217 within error OPTD\nThe natural way to approximate the Chow parameters would be by computing the empirical Chow\nparameters, namely E(x,y)\u223cD[yx]. In the realizable case, this quantity corresponds exactly to the\nvector of Chow parameters. Unfortunately, this does not work in the agnostic case and it can introduce\nan error of \u03c9(OPTD\n\u03b3 ). To overcome this obstacle, we note that in order for a small fraction of errors\nto introduce a large error in the empirical Chow parameters, it must be the case that there is some\ndirection w in which many of these erroneous points introduce a large error. If we can guess some\nerror that correlates well with w and also guess the correct projection of our Chow parameters\nonto this vector, we can correct a decent fraction of the error between the empirical and true Chow\nparameters. We show that making the correct guesses \u02dcO(1/(\u03b3\u03b1)2) times, we can reduce the empirical\nerror suf\ufb01ciently so that it can be used to \ufb01nd an accurate hypothesis. Once again, we can compute\na hypothesis for each sequence of guesses and return the best one. The formal description of the\nalgorithm and its analysis can be found in the full version of this paper [DKM19].\n\n\u03b3 + \u03bd.\n\nOverview of Computational Lower Bounds. Our hardness results are shown via two reductions.\nThese reductions take in an instance of a \u201chard problem\u201d and produce a distribution D on Bd \u00d7 {\u00b11}.\nIf the starting instance is a YES instance of the original problem, then OPTD\n\u03b3 is small for an\nappropriate \u03b3. On the other hand, if the starting instance is a NO instance of the original problem,\nthen OPTD\n0\u22121 is large5. As a result, if there is a \u201ctoo fast\u201d (\u03b1-)agnostic learner for \u03b3-margin halfspaces,\nthen we would also get a \u201ctoo fast\u201d algorithm for the starting problem as well, which would violate\nthe corresponding complexity assumption.\nTo understand the margin parameter \u03b3 we can achieve, we need to \ufb01rst understand the problems we\nstart with. For our reductions, the starting problems can be viewed in the following form: select k\nitems from v1, . . . , vN that satisfy certain \u201clocal constraints\u201d. For instance, in our \ufb01rst reduction, the\nreduction is from the k-Clique problem where we are given a graph G and an integer k, and the goal\nis to determine whether it contains a k-clique as a subgraph. For this problem, v1, . . . , vN are the\nvertices of G and the \u201clocal\u201d constraints are that every pair of selected vertices induces an edge.\nRoughly speaking, our reduction produces D with dimension d = N, with the i-th dimension\ncorresponding to vi. The \u201cideal\u201d solution in the YES case is to set wi = 1\u221a\niff vi is selected and set\nwi = 0 otherwise. In our reductions, the local constraints are expressed using \u201csparse\u201d sample vectors\n\nk\n\n5We use OPTD\n0\u22121\n\ndef= minw\u2208Rd errD\n\n0\u22121(w) to denote the minimum error rate achievable by any halfspace.\n\n4\n\n\fk\n\nk\n\n1\n\n(cid:16) 1\u221a\n\nej(cid:17) \u00b7 w \u2264 1\u221a\n\n2k\n\nk\n\n1\u221a\n2k\n\nwith say 0.99\u221a\n\n2k\n\nei + 1\u221a\n2\n\n2\n\n\u03b3 )poly(d, 1\n\n), where the dimension d is N (and \u03b5 =\n\n(with only a constant number of non-zero coordinates all having the same magnitude). For example,\nin the case of k-Clique, the constraints can be expressed as: for every non-edge (i, j), we must have\n, where ei and ej denote the i-th and j-th vectors in the standard basis.\nA main step in both of our proofs is to show that the reduction still works even when we \u201cshift\u201d the\nright hand side by a small multiple of 1\u221a\n. For instance, in the case of k-Clique, it is possible to show\n, the correctness of the construction remains and we also\nthat, even if we replace\nget the added bene\ufb01t that now the constraints are satis\ufb01ed by a margin of \u03b3 = \u0398( 1\u221a\n) for our ideal\nsolution in the YES case.\nIn the case of k-Clique, the above idea yields a reduction to 1-agnostic learning \u03b3-margin halfspaces\nwith margin \u03b3 = \u0398( 1\u221a\npoly(N )). As a result, if there\nis an f ( 1\n\u03b5 )-time algorithm for the latter for some function f, then there also exists an\ng(k)poly(N )-time algorithm for k-Clique for some function g, which is considered unlikely as it\nwould break a widely-belived hypothesis in the area of parameterized complexity.\nRuling out \u03b1-agnostic learners is slightly more complicated, since we need to produce the \u201cgap\u201d of \u03b1\nbetween OPTD\n0\u22121 in the NO case. To create such a gap, we appeal to the\nPCP Theorem [AS98, ALM+98] which can be thought of as an NP-hardness proof of the following\n\u201cgap version\u201d of 3SAT: given a 3CNF formula as input, distinguish between the case where the\nformula is satis\ufb01able and the case where the formula is not even 0.9-satis\ufb01able6. Moreover, further\nstrengthened versions of the PCP Theorem [Din07, MR10] actually implies that this Gap-3SAT\nproblem cannot even be solved in time O(2n0.999\n), where n denotes the number of variables in the\nformula, assuming the exponential time hypothesis (ETH)7. Once again, (Gap-)3SAT can be viewed\nin the form of \u201citem selection with local constraint\u201d. However, the number of selected items k is now\nequal to n, the number of variables of the formula. With a similar line of reasoning as above, the\nmargin we get is now \u03b3 = \u0398( 1\u221a\n\u03b5 )-time\npoly(d, 1\n\u03b1-agnostic learner for \u03b3-margin halfspaces (for an appropriate \u03b1), then there is an O(2n0.995\n)-time\nalgorithm for Gap-3SAT, which would violate ETH.\nUnfortunately, the above described idea only gives the \u201cgap\u201d \u03b1 that is only slightly larger than 1,\nbecause the gap that we start with in the Gap-3SAT problem is already pretty small. To achieve larger\ngaps, our actual reduction starts from a generalization of 3SAT called constraint satisfaction problems\n(CSPs), whose gap problems are hard even for very large gap. This concludes the outline of the main\nintuitions in our reductions.\n\nn ). As a result, if there is a say 2(1/\u03b3)1.99\n\n\u03b3 in the YES case and OPTD\n\n) = \u0398( 1\u221a\n\nk\n\ndef= ((cid:80)d\n\n1.5 Preliminaries\nFor n \u2208 Z+, we denote [n] def= {1, . . . , n}. We will use small boldface characters for vectors\nand capital boldface characters for matrices. For a vector x \u2208 Rd, and i \u2208 [d], xi denotes the\ni-th coordinate of x, and (cid:107)x(cid:107)2\ni )1/2 denotes the (cid:96)2-norm of x. We will use (cid:104)x, y(cid:105)\nfor the inner product between x, y \u2208 Rd. For a matrix M \u2208 Rd\u00d7d, we will denote by (cid:107)M(cid:107)2\nits spectral norm and by tr(M) its trace. Let Bd = {x \u2208 Rd : (cid:107)x(cid:107)2 \u2264 1} be the unit ball and\nSd\u22121 = {x \u2208 Rd : (cid:107)x(cid:107)2 = 1} be the unit sphere in Rd. An origin-centered halfspace is a Boolean-\nvalued function hw : Rd \u2192 {\u00b11} of the form hw(x) = sign ((cid:104)w, x(cid:105)), where w \u2208 Rd. (Note that\nclass of all origin-centered halfspaces on Rd. Finally, we use ei to denote the i-th standard basis\nvector, i.e., the vector whose i-th coordinate is one and the remaining coordinates are zeros.\n\nwe may assume w.l.o.g. that (cid:107)w(cid:107)2 = 1.) Let Hd =(cid:8)hw(x) = sign ((cid:104)w, x(cid:105)) , w \u2208 Rd(cid:9) denote the\n\ni=1 x2\n\n2 Algorithm for Proper Agnostic Learning of Halfspaces with a Margin\n\nIn this section, we show the following theorem, which gives the upper bound part of Theorem 1.1:\n\n6In other words, for any assignment to the variables, at least 0.1 fraction of the clauses are unsatis\ufb01ed.\n7ETH states that the exact version of 3SAT cannot be solved in 2o(n) time.\n\n5\n\n\fTheorem 2.1. Fix 0 < \u03b4 \u2264 1. There is an algorithm that uses O(1/(\u00012\u03b32)) samples, runs in\ntime poly(d/\u0001) \u00b7 2 \u02dcO(1/(\u03b4\u03b32)) and is a (1 + \u03b4)-agnostic proper learner for \u03b3-margin halfspaces with\ncon\ufb01dence probability 9/10.\n\n\u03b3 (w\u2217) \u2264 OPTD\n\u03b3 .\n\nOur algorithm in this section produces a \ufb01nite set of candidate weight vectors and outputs the one with\nthe smallest empirical \u03b3/2-margin error. For the sake of this intuitive description, we will assume\nthat the algorithm knows the distribution D in question supported on Bd \u00d7 {\u00b11}. By assumption,\nthere is a unit vector w\u2217 so that errD\nWe note that if a hypothesis hw de\ufb01ned by vector w has \u03b3/2-margin error at least a (1 + \u03b4)OPTD\n\u03b3 ,\nthen there must be a large number of points correctly classi\ufb01ed with \u03b3-margin by hw\u2217, but not correctly\nclassi\ufb01ed with \u03b3/2-margin by hw. For all of these points, we must have that |(cid:104)w\u2217 \u2212 w, x(cid:105)| \u2265 \u03b3/2.\nThis implies that the \u03b3/2-margin-misclassi\ufb01ed points of hw have a large covariance in the w\u2217 \u2212 w\ndirection. In particular, we have:\nClaim 2.2. Let w \u2208 Rd be such that errD\n\u03b3 . Let D(cid:48) be D conditioned on\ny(cid:104)w, x(cid:105) \u2264 \u03b3/2. Let MD(cid:48)\nProof. We claim that with probability at least \u03b4/2 over (x, y) \u223c D(cid:48) we have that y(cid:104)w, x(cid:105) \u2264 \u03b3/2 and\ny(cid:104)w\u2217, x(cid:105) \u2265 \u03b3. To see this, we \ufb01rst note that Pr(x,y)\u223cD(cid:48)[y(cid:104)w, x(cid:105) > \u03b3/2] = 0 holds by de\ufb01nition of\nD(cid:48). Hence, we have that\n\n\u03b3/2(w) > (1 + \u03b4)OPTD\n= E(x,y)\u223cD(cid:48)[xxT ]. Then (w\u2217 \u2212 w)T MD(cid:48)\n\n(w\u2217 \u2212 w) \u2265 \u03b4\u03b32/8.\n\nPr(x,y)\u223cD(cid:48)[y(cid:104)w\u2217, x(cid:105) \u2264 \u03b3] \u2264 Pr(x,y)\u223cD[y(cid:104)w\u2217, x(cid:105) \u2264 \u03b3]\nPr(x,y)\u223cD[y(cid:104)w, x(cid:105) \u2264 \u03b3/2]\n\n<\n\nOPTD\n\n\u03b3\n\n(1 + \u03b4)OPTD\n\n\u03b3\n\n=\n\n1\n\n(1 + \u03b4)\n\n.\n\nBy a union bound, we obtain Pr(x,y)\u223cD(cid:48)[(y(cid:104)w, x(cid:105) > \u03b3/2) \u222a (y(cid:104)w\u2217, x(cid:105) \u2264 \u03b3)] \u2264 1\nTherefore, with probability at least \u03b4/(1 + \u03b4) \u2265 \u03b4/2 (since \u03b4 \u2264 1) over (x, y) \u223c D(cid:48) we have that\ny(cid:104)w\u2217\u2212 w, x(cid:105) \u2265 \u03b3/2, which implies that (cid:104)w\u2217\u2212 w, x(cid:105)2 \u2265 \u03b32/4. Thus, (w\u2217\u2212 w)T MD(cid:48)\n(w\u2217\u2212 w) =\nE(x,y)\u223cD(cid:48)[((cid:104)w\u2217 \u2212 w, x(cid:105))2] \u2265 \u03b4\u03b32/8, completing the proof.\nClaim 2.2 says that w\u2217 \u2212 w has a large component on the large eigenvalues of MD(cid:48)\nthis claim, we obtain the following result:\nLemma 2.3. Let w\u2217, w, MD(cid:48)\nthe top k eigenvectors of MD(cid:48)\n\nbe as in Claim 2.2. There exists k \u2208 Z+ so that if Vk is the span of\n, we have that (cid:107)ProjVk\n\n2 \u2265 k\u03b4\u03b32/8.\n\n(w\u2217 \u2212 w)(cid:107)2\n\n. Building on\n\n(1+\u03b4).\n\nis PSD and let 0 > \u03bbmax = \u03bb1 \u2265 \u03bb2 \u2265 . . . \u2265 \u03bbd \u2265 0 be its set of\nProof. Note that the matrix MD(cid:48)\neigenvalues. We will denote by V\u2265t the space spanned by the eigenvectors of MD(cid:48)\ncorresponding to\neigenvalues of magnitude at least t. Let dt = dim(V\u2265t) be the dimension of V\u2265t, i.e., the number\nof i \u2208 [d] with \u03bbi \u2265 t. Since x is supported on the unit ball, for (x, y) \u223c D(cid:48), we have that\ntr(MD(cid:48)\ni=1 \u03bbi and\nwe can write\n\n) = E(x,y)\u223cD(cid:48)[tr(xxT )] \u2264 1. Since MD(cid:48)\n\nis PSD, we have that tr(MD(cid:48)\n\n) =(cid:80)d\n\n1 \u2265 tr(MD(cid:48)\n\n) =\n\nd(cid:80)\n\nd(cid:80)\n\n\u03bbi(cid:82)\n\nd(cid:80)\n\n\u03bbmax(cid:82)\n\n\u03bbi =\n\n1dt =\n\ni=1\n\ni=1\n\n0\n\ni=1\n\n0\n\n\u03bbmax(cid:82)\n\n0\n\n1\u03bbi\u2265tdt =\n\ndtdt,\n\n(1)\n\nd(cid:80)\n\nwhere the last equality follows by changing the order of the summation and the integration. If the\nprojection of (w\u2217 \u2212 w) onto the i-th eigenvector of MD(cid:48)\n\ni=1\n\ni =\n\n\u03bbia2\n\n(w\u2217\u2212w) =\n\n\u03b4\u03b32/8 \u2264 (w\u2217\u2212w)T MD(cid:48)\n\n(w\u2217\u2212w)(cid:107)2\n2dt,\n(2)\nwhere the \ufb01rst inequality uses Claim 2.2, the \ufb01rst equality follows by the Pythagorean theorem, and\nthe last equality follows by changing the order of the summation and the integration. Combining (1)\n(w\u2217 \u2212 w)(cid:107)2\ndtdt. By an averaging argument,\n2 \u2265 (\u03b4\u03b32/8)dt. Letting k = dt and noting\nthere exists 0 \u2264 t \u2264 \u03bbmax such that (cid:107)ProjV\u2265t\nthat V\u2265t = Vk completes the proof.\n\nand (2), we obtain(cid:82) \u03bbmax\n\n(w\u2217 \u2212 w)(cid:107)2\n\n(cid:107)ProjV\u2265t\n\n0\n\n0\n\n\u03bbmax(cid:82)\n\nd(cid:80)\n2dt \u2265 (\u03b4\u03b32/8)(cid:82) \u03bbmax\n\ni=1\n\n0\n\n0\n\nhas (cid:96)2-norm ai, we have that\n(cid:107)ProjV\u2265t\n\na2\ni 1\u03bbi\u2265tdt =\n\n\u03bbmax(cid:82)\n\n6\n\n\fLemma 2.3 suggests a method for producing an approximation to w\u2217, or more precisely a vector\nthat produces empirical \u03b3/2-margin error at most (1 + \u03b4)OPTD\n\u03b3 . We start by describing a non-\ndeterministic procedure, which we will then turn into an actual algorithm.\nThe method proceeds in a sequence of stages. At stage i, we have a hypothesis weight vector w(i).\n(At stage i = 0, we start with w(0) = 0.) At any stage i, if errD\n\u03b3 , then w(i)\nis a suf\ufb01cient estimator. Otherwise, we consider the matrix M(i) = E(x,y)\u223cD(i) [xxT ], where D(i) is\nD conditioned on y(cid:104)w(i), x(cid:105) \u2264 \u03b3/2. By Lemma 2.3, we know that for some positive integer value\nk(i), we have that the projection of w\u2217 \u2212 w(i) onto Vk(i) has squared norm at least \u03b4k(i)\u03b32/8.\nLet p(i) be this projection. We set w(i+1) = w(i) + p(i). Since the projection of w\u2217 \u2212 w(i) and its\ncomplement are orthogonal, we have\n\n\u03b3/2(w(i)) \u2264 (1 + \u03b4)OPTD\n\n2\n\n2 =\n\ni=0\n\ns\u22121(cid:80)\n\ni=0\n\ns\u22121(cid:80)\n\n(cid:107)w\u2217 \u2212 w(i+1)(cid:107)2\n\n2 \u2212 \u03b4k(i)\u03b32/8 ,\n\n2\u2212(cid:107)w\u2217\u2212w(s)(cid:107)2\n\n2 \u2264 (cid:107)w\u2217 \u2212 w(i)(cid:107)2\n\n2 \u2212 (cid:107)w\u2217 \u2212 w(i+1)(cid:107)2\n\n(cid:17) \u2265 (\u03b4\u03b32/8)\n\n2 \u2212 (cid:107)p(i)(cid:107)2\n(3)\n2 \u2265 k(i)\u03b4\u03b32/8 (as follows from Lemma 2.3). Let s be\n\n2 = (cid:107)w\u2217 \u2212 w(i)(cid:107)2\nwhere the inequality uses the fact that (cid:107)p(i)(cid:107)2\nthe total number of stages. We can write\n1 \u2265 (cid:107)w\u2217\u2212w(0)(cid:107)2\nwhere the \ufb01rst inequality uses that (cid:107)w\u2217 \u2212 w(0)(cid:107)2\n\n(cid:16)(cid:107)w\u2217 \u2212 w(i)(cid:107)2\ntelescoping sum, and the third uses (3). Therefore, s \u2264(cid:80)s\u22121\n\nk(i) ,\n2 \u2265 0, the second notes the\n2 = 1 and (cid:107)w\u2217 \u2212 w(s)(cid:107)2\ni=0 k(i) \u2264 8/(\u03b4\u03b32). Therefore, the above\n\u03b3/2(w(s)) \u2264 (1 + \u03b4)OPTD\nprocedure terminates after at most 8/(\u03b4\u03b32) stages at some w(s) with errD\n\u03b3 .\nWe now describe how to turn the above procedure into an actual algorithm. Our algorithm tries to\nsimulate the above described procedure by making appropriate guesses. In particular, we start by\nguessing a sequence of positive integers k(i) whose sum is at most 8/(\u03b4\u03b32). This can be done in\n2O(1/(\u03b4\u03b32)) ways. Next, given this sequence, our algorithm guesses the vectors w(i) over all s stages\nin order. In particular, given w(i), the algorithm computes the matrix M(i) and the subspace Vk(i),\nand guesses the projection p(i) \u2208 Vk(i), which then gives w(i+1). Of course, we cannot expect our\nalgorithm to guess p(i) exactly (as there are in\ufb01nitely many points in Vk(i)), but we can guess it\nto within (cid:96)2-error poly(\u03b3), by taking an appropriate net. This involves an additional guess of size\n(1/\u03b3)O(k(i)) in each stage. In total, our algorithm makes 2 \u02dcO(1/(\u03b4\u03b32)) many different guesses.\nWe note that the sample version of our algorithm is essentially identical to the idealized version\ndescribed above, by replacing the distribution D by its empirical version and leveraging the following\nstatistical bound:\nm = \u2126(log(1/\u03c4 )/(\u00012\u03b32)), and (cid:98)Dm be the empirical distribution on S. Then with probability at least\nFact 2.4 ([BM02, McA03]). Let S = {(x(i), y(i))}m\ni=1 be a multiset of i.i.d. samples from D, where\n1 \u2212 \u03c4 over S, simultaneously for all unit vectors w and margins \u03b3 > 0, if hw(x) = sign((cid:104)w, x(cid:105)),\nwe have that errD\nThe pseudo-code of our algorithm is given below in Algorithm 1.\nTo show the correctness of the algorithm, we begin by noting that the set C of candidate weight\nvectors produced has size 2 \u02dcO(1/(\u03b4\u03b32)). This is because there are only 2O(1/(\u03b4\u03b32)) many possibil-\nities for the sequence of k(i), and for each such sequence the product of the sizes of the C (i) is\n\n0\u22121(hw) \u2264 err(cid:98)Dm\n\n(1/(\u03b4\u03b3))O((cid:80) k(i)) = 2 \u02dcO(1/(\u03b4\u03b32)). We note that, by the aforementioned analysis, for any choice of\nk(0), . . . , k(i\u22121) and w(i), we either have that err(cid:98)Dm\n\u03b3/2(w(i)) \u2264 (1 + \u03b4)OPT(cid:98)Dm\nof k(i) and p(i) \u2208 C (i) such that\n2 \u2264 (cid:107)w\u2217 \u2212 w(i)(cid:107)2\n2 \u2212 \u03b4k(i)\u03b32/8 + O(\u03b42\u03b36) ,\n(cid:33)\n\nalgorithm, we either \ufb01nd some w(i) with err(cid:98)Dm\n\u03b3/2(w(i)) \u2264 (1 + \u03b4)OPT(cid:98)Dm\n(cid:32)i\u22121(cid:80)\n2 \u2264 1 \u2212\n\nwhere we used (3) and the fact that C (i) is a \u03b4\u03b33-cover of Vk(i). Following the execution path of the\n\n(cid:107)w\u2217 \u2212 w(i) \u2212 p(i)(cid:107)2\n\n, or we \ufb01nd a w(i) with\n\n(cid:107)w\u2217 \u2212 w(i)(cid:107)2\n\nor there is a choice\n\n\u03b4\u03b32/8 + O(\u03b4\u03b34) ,\n\nk(j)\n\n\u03b3\n\n\u03b3\n\n(w) + \u0001.\n\n\u03b3\n\nj=0\n\n7\n\n\fAlgorithm 1 Near-Optimal (1 + \u03b4)-Agnostic Proper Learner\n1: Draw a multiset S = {(x(i), y(i))}m\ni=1 of i.i.d.\n\n2: Let (cid:98)Dm be the empirical distribution on S.\n\n\u2126(log(1/\u03c4 )/(\u00012\u03b32)).\n\nLet w(0) = 0.\nfor i = 0, 1, . . . , s \u2212 1 do\n\nLet D(i) be (cid:98)Dm conditioned on y(cid:104)w(i), x(cid:105) \u2264 \u03b3/2.\n\n3: for all sequences k(0), k(1), . . . , k(s\u22121) of positive integers with sum at most 8/(\u03b4\u03b32) + 2 do\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12: end for\n13: Let C denote the set of all w(i) generated in the above loop.\n\nLet M(i) = E(x,y)\u223cD(i) [xxT ].\nUse SVD on M(i) to \ufb01nd a basis for Vk(i), the span of the top k(i) eigenvectors.\nLet C (i) be a \u03b4\u03b33-cover, in (cid:96)2-norm, of Vk(i) \u2229 Bd of size (1/(\u03b4\u03b3))O(k(i)).\nFor each p(i) \u2208 C (i) repeat the next step of the for loop with w(i+1) = w(i) + p(i).\n\nend for\n\n14: Let v \u2208 argminw\u2208Cerr(cid:98)Dm\n15: return Output the hypothesis hv(x) = sign((cid:104)v, x(cid:105)).\n\n\u03b3/2(w).\n\nsamples from D, where m =\n\nwhere the last term is an upper bound for\n\n(cid:16)(cid:80)i\u22121\nj=0 k(j)(cid:17)\u00b7 O(\u03b42\u03b36). Note that this sequence terminates\nin at most O(1/(\u03b4\u03b32)) stages, when it becomes impossible that(cid:80) k(j) > 8/(\u03b4\u03b32) + 1. Thus, the\noutput of our algorithm must contain some weight vector v with err(cid:98)Dm\n\u03b3/2(v) \u2264 (1 + \u03b4)OPT(cid:98)Dm\n\nproof now follows by an application of Fact 2.4. This completes the proof of Theorem 2.1.\n\n. The\n\n\u03b3\n\n3 Computational Hardness Results\n\n\u03b3 , 1\n\n\u03b3 , 1\n\n\u03b5 )\u201d to mean that no T (d, 1\n\nIn this section, we provide several computational lower bounds for agnostic learning of halfspaces\nwith a margin. To clarify the statements below, we note that we say \u201cthere is no algorithm that\n\u03b5 )-time algorithm works for all combinations of\nruns in time T (d, 1\nparameters d, \u03b3 and \u03b5. (Note that we discuss the lower bounds with stronger quati\ufb01ers in the full\nversion [DKM19].) Moreover, we also ignore the dependency on \u03c4 (the probability that the learner\ncan be incorrect), since we only use a \ufb01xed \u03c4 (say 1/3) in all the bounds below.\nFirst, we show that, for any constant \u03b1 > 1, \u03b1-agnostic learning of \u03b3-margin halfspaces requires\n2(1/\u03b3)2\u2212o(1)\npoly(d, 1/\u03b5) time. Up to the lower order term \u03b3o(1) in the exponent, this matches with\nour algorithm (in Theorem 2.1). In fact, we show an even stronger result, that if the dependency of\nthe running time on the margin is say 2(1/\u03b3)1.99, then one has to pay 2d1\u2212o(1) in the running time.\nThis result holds assuming the so-called (randomized) exponential time hypothesis (ETH) [IP01,\nIPZ01], which postulates that there is no (randomized) algorithm that can solve 3SAT in time 2o(n),\nwhere n denotes the number of variables. ETH is a standard hypothesis used in proving (tight)\nrunning time lower bounds. We do not discuss ETH further here, but interested readers may refer to a\nsurvey by Lokshtanov et al. [LMS11] for an in-depth discussion and several applications of ETH.\nOur \ufb01rst lower bound can be stated more precisely as follows:\nTheorem 3.1. Assuming the (randomized) ETH, for any universal constant \u03b1 \u2265 1, there is no proper\n\u03b1-agnostic learner for \u03b3-margin halfspaces that runs in time O(2(1/\u03b3)2\u2212o(1)\n\u03b5 ) for any\nfunction f.\n\n2d1\u2212o(1)\n\n)f ( 1\n\nSecondly, we address the question of whether we can achieve \u03b1 = 1 (standard agnostic learning)\nwhile retaining running time similar to our algorithm. We answer this in the negative (assuming a\n\u03b5 )-time 1-agnostic learner\nstandard parameterized complexity assumption): there is no f ( 1\n). This demonstrates a stark contrast between what\n\n\u03b3 )poly(d, 1\n\n\u03b3 ) = 2221/\u03b3\nfor any function f (e.g., even for f ( 1\nwe can achieve with and without approximation.\n\n8\n\n\fTheorem 3.2. Assuming W[1] is not contained in randomized FPT, there is no proper 1-agnostic\nlearner for \u03b3-margin halfspaces that runs in time f ( 1\n\n\u03b5 ) for any function f.\n\n\u03b3 )poly(d, 1\n\n\u03b5 , 1\n\nFinally, we explore the other extreme of the trade-off between the running time and approximation\nratio, by asking: what is the best approximation ratio we can achieve if we only consider proper\nlearners that run in poly(d, 1\n\u03b3 )-time? On this front, it is known [Ser01] that the perceptron\nalgorithm achieves 1/\u03b3-approximation. We show that a signi\ufb01cant improvement over this is unlikely,\nby showing that (1/\u03b3)\npolyloglog(1/\u03b3) -approximation is not possible unless NP = RP. If we additionally\nassume the so-called Sliding Scale Conjecture [BGLR94], this ratio can be improved to (1/\u03b3)c for\nsome constant c > 0.\nTheorem 3.3. Assuming NP (cid:54)= RP, there is no proper (1/\u03b3)1/polyloglog(1/\u03b3)-agnostic learner for \u03b3-\n\u03b3 ). Furthermore, assuming NP (cid:54)= RP and the Sliding\nmargin halfspaces that runs in time poly(d, 1\nScale Conjecture [BGLR94], there is no proper (1/\u03b3)c-agnostic learning for \u03b3-margin halfspaces\nthat runs in time poly(d, 1\n\n\u03b3 ) for some constant c > 0.\n\n\u03b5 , 1\n\n\u03b5 , 1\n\n1\n\nDue to the technical nature of the Sliding Scale Conjecture, we do not state it in full here; please refer\nto the full version for a formal statement [DKM19].\nWe note here that the constant c in Theorem 3.3 is not explicit, i.e., it depends on the constant from\nthe Sliding Scale Conjecture (SSC). Moreover, even when assuming the most optimistic parameters\nof SSC, the constant c we can get is still very small. For instance, it is still possible that a say\n\n(cid:112)1/\u03b3-agnostic learning algorithm that runs in polynomial time exists, and this remains an interesting\n\nopen question. We remark that Daniely et al. [DLS14] have made partial progress in this direction\n\u03b3 )-time learner that belongs to a \u201cgeneralized linear family\u201d cannot\nby showing that, any poly(d, 1\nachieve approximation ratio \u03b1 better than \u2126\n. We note that the inapproximability ratio\nof [DLS14] is close to being tight for a natural, yet restricted, family of improper learners. On the\nother hand, our proper hardness result holds against all proper learners under a widely believed\nworst-case complexity assumption.\nDue to space limitations, the proofs of our hardness results are deferred to the full version of this\nwork [DKM19].\n\npolylog(1/\u03b3)\n\n(cid:17)\n\n(cid:16)\n\n\u03b5 , 1\n\n1/\u03b3\n\n4 Conclusions and Open Problems\n\nThis work gives nearly tight upper and lower bounds for the problem of \u03b1-agnostic proper learning\nof halfspaces with a margin, for \u03b1 = O(1). Our upper and lower bounds for \u03b1 = \u03c9(1) are far from\ntight. Closing this gap is an interesting open problem. Charactering the \ufb01ne-grained complexity of\nthe problem for improper learning algorithms remains a challenging open problem.\nMore broadly, an interesting direction for future work would be to generalize our agnostic learning\nresults to broader classes of geometric functions. Finally, we believe that \ufb01nding further connections\nbetween the problem of agnostic learning with a margin and adversarially robust learning is an\nintriguing direction to be explored.\n\nAcknowledgments\n\nPart of this work was performed while Ilias Diakonikolas was at the Simons Institute for the Theory\nof Computing during the program on Foundations of Data Science. Ilias Diakonikolas is supported\nby Supported by NSF Award CCF-1652862 (CAREER) and a Sloan Research Fellowship. Daniel M.\nKane is supported by NSF Award CCF-1553288 (CAREER) and a Sloan Research Fellowship.\n\n9\n\n\fReferences\n\n[ABL17] P. Awasthi, M. F. Balcan, and P. M. Long. The power of localization for ef\ufb01ciently\n\nlearning linear separators with noise. J. ACM, 63(6):50:1\u201350:27, 2017.\n\n[ALM+98] S. Arora, C. Lund, R. Motwani, M. Sudan, and M. Szegedy. Proof veri\ufb01cation and the\n\nhardness of approximation problems. J. ACM, 45(3):501\u2013555, 1998.\n\n[AS98] S. Arora and S. Safra. Probabilistic checking of proofs: A new characterization of NP.\n\nJ. ACM, 45(1):70\u2013122, 1998.\n\n[BGLR94] M. Bellare, S. Goldwasser, C. Lund, and A. Russell. Ef\ufb01cient probabilistic checkable\nproofs and applications to approximation. In Proceedings of the Twenty-Sixth Annual\nACM Symposium on Theory of Computing, page 820, 1994.\n\n[BGS18] A. Bhattacharyya, S. Ghoshal, and R. Saket. Hardness of learning noisy halfspaces\nusing polynomial thresholds. In Conference On Learning Theory, COLT 2018, pages\n876\u2013917, 2018.\n\n[BLPR19] S. Bubeck, Y. T. Lee, E. Price, and I. P. Razenshteyn. Adversarial examples from\ncomputational constraints. In Proceedings of the 36th International Conference on\nMachine Learning, ICML 2019, pages 831\u2013840, 2019.\n\n[BM02] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds\n\nand structural results. Journal of Machine Learning Research, 3:463\u2013482, 2002.\n\n[BS00] S. Ben-David and H. Ulrich Simon. Ef\ufb01cient learning of linear perceptrons. In Advances\n\nin Neural Information Processing Systems (NIPS) 2000, pages 189\u2013195, 2000.\n\n[BS12] A. Birnbaum and S. Shalev-Shwartz. Learning halfspaces with the zero-one loss: Time-\naccuracy tradeoffs. In Advances in Neural Information Processing Systems 25: NIPS\n2012, pages 935\u2013943, 2012.\n\n[Cho61] C.K. Chow. On the characterization of threshold functions. In Proceedings of the\nSymposium on Switching Circuit Theory and Logical Design (FOCS), pages 34\u201338,\n1961.\n\n[Dan16] A. Daniely. Complexity theoretic limitations on learning halfspaces. In Proceedings\nof the 48th Annual Symposium on Theory of Computing, STOC 2016, pages 105\u2013117,\n2016.\n\n[DDFS14] A. De, I. Diakonikolas, V. Feldman, and R. A. Servedio. Nearly optimal solutions for\nthe chow parameters problem and low-weight approximation of halfspaces. J. ACM,\n61(2):11:1\u201311:36, 2014.\n\n[Din07] I. Dinur. The PCP theorem by gap ampli\ufb01cation. J. ACM, 54(3):12, 2007.\n\n[DKK+16] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart. Robust\nestimators in high dimensions without the computational intractability. In Proceedings\nof FOCS\u201916, pages 655\u2013664, 2016.\n\n[DKK+17] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart. Being\nrobust (in high dimensions) can be practical. In Proceedings of the 34th International\nConference on Machine Learning, ICML 2017, pages 999\u20131008, 2017.\n\n[DKK+18] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart. Robustly\nlearning a gaussian: Getting optimal error, ef\ufb01ciently. In Proceedings of the Twenty-Ninth\nAnnual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, pages 2683\u20132702,\n2018.\n\n[DKK+19] I. Diakonikolas, G. Kamath, D. Kane, J. Li, J. Steinhardt, and A. Stewart. Sever: A robust\nmeta-algorithm for stochastic optimization. In Proceedings of the 36th International\nConference on Machine Learning, ICML 2019, pages 1596\u20131606, 2019.\n\n10\n\n\f[DKM19] I. Diakonikolas, D. M. Kane, and P. Manurangsi. Nearly tight bounds for robust proper\n\nlearning of halfspaces with a margin. CoRR, abs/1908.11335, 2019.\n\n[DKS18] I. Diakonikolas, D. M. Kane, and A. Stewart. Learning geometric concepts with nasty\nIn Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of\n\nnoise.\nComputing, STOC 2018, pages 1061\u20131073, 2018.\n\n[DKS19] I. Diakonikolas, W. Kong, and A. Stewart. Ef\ufb01cient algorithms and lower bounds for\nrobust linear regression. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium\non Discrete Algorithms, SODA 2019, pages 2745\u20132754, 2019.\n\n[DLS14] A. Daniely, N. Linial, and S. Shalev-Shwartz. The complexity of learning halfspaces\nusing generalized linear methods. In Proceedings of The 27th Conference on Learning\nTheory, COLT 2014, pages 244\u2013286, 2014.\n\n[DNV19] Akshay Degwekar, Preetum Nakkiran, and Vinod Vaikuntanathan. Computational\nlimitations in robust classi\ufb01cation and win-win results. In Conference on Learning\nTheory, COLT 2019, 25-28 June 2019, Phoenix, AZ, USA, pages 994\u20131028, 2019.\n\n[DOSW11] I. Diakonikolas, R. O\u2019Donnell, R. Servedio, and Y. Wu. Hardness results for agnostically\nlearning low-degree polynomial threshold functions. In SODA, pages 1590\u20131606, 2011.\n\n[FGKP06] V. Feldman, P. Gopalan, S. Khot, and A. Ponnuswami. New results for learning noisy\n\nparities and halfspaces. In Proc. FOCS, pages 563\u2013576, 2006.\n\n[FS97] Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and\nan application to boosting. Journal of Computer and System Sciences, 55(1):119\u2013139,\n1997.\n\n[GR06] V. Guruswami and P. Raghavendra. Hardness of learning halfspaces with noise. In Proc.\n47th IEEE Symposium on Foundations of Computer Science (FOCS), pages 543\u2013552.\nIEEE Computer Society, 2006.\n\n[Hau92] D. Haussler. Decision theoretic generalizations of the PAC model for neural net and\n\nother learning applications. Information and Computation, 100:78\u2013150, 1992.\n\n[IP01] R. Impagliazzo and R. Paturi. On the complexity of k-sat. J. Comput. Syst. Sci.,\n\n62(2):367\u2013375, 2001.\n\n[IPZ01] R. Impagliazzo, R. Paturi, and F. Zane. Which problems have strongly exponential\n\ncomplexity? J. Comput. Syst. Sci., 63(4):512\u2013530, 2001.\n\n[KKM18] A. R. Klivans, P. K. Kothari, and R. Meka. Ef\ufb01cient algorithms for outlier-robust\nregression. In Conference On Learning Theory, COLT 2018, pages 1420\u20131430, 2018.\n\n[KL93] M. J. Kearns and M. Li. Learning in the presence of malicious errors. SIAM Journal on\n\nComputing, 22(4):807\u2013837, 1993.\n\n[KLS09] A. Klivans, P. Long, and R. Servedio. Learning halfspaces with malicious noise. To\nappear in Proc. 17th Internat. Colloq. on Algorithms, Languages and Programming\n(ICALP), 2009.\n\n[KSS94] M. Kearns, R. Schapire, and L. Sellie. Toward Ef\ufb01cient Agnostic Learning. Machine\n\nLearning, 17(2/3):115\u2013141, 1994.\n\n[LMS11] D. Lokshtanov, D. Marx, and S. Saurabh. Lower bounds based on the exponential time\n\nhypothesis. Bulletin of the EATCS, 105:41\u201372, 2011.\n\n[LRV16] K. A. Lai, A. B. Rao, and S. Vempala. Agnostic estimation of mean and covariance. In\n\nProceedings of FOCS\u201916, 2016.\n\n[LS11] P. Long and R. Servedio. Learning large-margin halfspaces with more malicious noise.\n\nNIPS, 2011.\n\n11\n\n\f[McA03] D. A. McAllester. Simpli\ufb01ed PAC-bayesian margin bounds. In 16th Annual Conference\n\non Computational Learning Theory, pages 203\u2013215, 2003.\n\n[MHS19] O. Montasser, S. Hanneke, and N. Srebro. VC classes are adversarially robustly learnable,\nbut only improperly. In Conference on Learning Theory, COLT 2019, pages 2512\u20132530,\n2019.\n\n[MR10] D. Moshkovitz and R. Raz. Two-query PCP with subconstant error. J. ACM, 57(5):29:1\u2013\n\n29:29, 2010.\n\n[Nak19] P. Nakkiran. Adversarial robustness may be at odds with simplicity. CoRR,\n\nabs/1901.00532, 2019.\n\n[Ros58] F. Rosenblatt. The Perceptron: a probabilistic model for information storage and\n\norganization in the brain. Psychological Review, 65:386\u2013407, 1958.\n\n[Ser01] R. Servedio. Smooth boosting and learning with malicious noise. In Proceedings of\nthe Fourteenth Annual Conference on Computational Learning Theory, pages 473\u2013489,\n2001.\n\n[SSS09] S. Shalev-Shwartz, O. Shamir, and K. Sridharan. Agnostically learning halfspaces with\n\nmargin errors. In Technical report, Toyota Technological Institute, 2009.\n\n[SSS10] S. Shalev-Shwartz, O. Shamir, and K. Sridharan. Learning kernel-based halfspaces with\nthe zero-one loss. In The 23rd Conference on Learning Theory, COLT 2010, pages\n441\u2013450, 2010.\n\n[Val85] L. Valiant. Learning disjunctions of conjunctions. In Proc. 9th IJCAI, pages 560\u2013566,\n\n1985.\n\n[Vap98] V. Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998.\n\n12\n\n\f", "award": [], "sourceid": 5520, "authors": [{"given_name": "Ilias", "family_name": "Diakonikolas", "institution": "UW Madison"}, {"given_name": "Daniel", "family_name": "Kane", "institution": "UCSD"}, {"given_name": "Pasin", "family_name": "Manurangsi", "institution": "Google"}]}