{"title": "Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate", "book": "Advances in Neural Information Processing Systems", "page_first": 2300, "page_last": 2311, "abstract": "Many modern machine learning models are trained to achieve zero or near-zero training error in order to obtain near-optimal (but non-zero) test error. This phenomenon of strong generalization performance for ``overfitted'' / interpolated classifiers appears to be ubiquitous in high-dimensional data, having been observed in deep networks, kernel machines, boosting and random forests. Their performance is consistently robust even when the data contain large amounts of label noise. \n\nVery little theory is available to explain these observations. The vast majority of theoretical analyses of generalization allows for interpolation only when there is little or no label noise. This paper takes a step toward a theoretical foundation for interpolated classifiers by analyzing local interpolating schemes, including geometric simplicial interpolation algorithm and singularly weighted $k$-nearest neighbor schemes. Consistency or near-consistency is proved for these schemes in classification and regression problems. Moreover, the nearest neighbor schemes exhibit optimal rates under some standard statistical assumptions.\n\nFinally, this paper suggests a way to explain the phenomenon of adversarial examples, which are seemingly ubiquitous in modern machine learning, and also discusses some connections to kernel machines and random forests in the interpolated regime.", "full_text": "Over\ufb01tting or perfect \ufb01tting? Risk bounds for\n\nclassi\ufb01cation and regression rules that interpolate\n\nMikhail Belkin\n\nThe Ohio State University\n\nDaniel Hsu\n\nColumbia University\n\nPartha P. Mitra\n\nCold Spring Harbor Laboratory\n\nAbstract\n\nMany modern machine learning models are trained to achieve zero or near-zero\ntraining error in order to obtain near-optimal (but non-zero) test error. This phe-\nnomenon of strong generalization performance for \u201cover\ufb01tted\u201d / interpolated clas-\nsi\ufb01ers appears to be ubiquitous in high-dimensional data, having been observed in\ndeep networks, kernel machines, boosting and random forests. Their performance\nis consistently robust even when the data contain large amounts of label noise.\nVery little theory is available to explain these observations. The vast majority of\ntheoretical analyses of generalization allows for interpolation only when there is\nlittle or no label noise. This paper takes a step toward a theoretical foundation\nfor interpolated classi\ufb01ers by analyzing local interpolating schemes, including\ngeometric simplicial interpolation algorithm and singularly weighted k-nearest\nneighbor schemes. Consistency or near-consistency is proved for these schemes in\nclassi\ufb01cation and regression problems. Moreover, the nearest neighbor schemes\nexhibit optimal rates under some standard statistical assumptions.\nFinally, this paper suggests a way to explain the phenomenon of adversarial ex-\namples, which are seemingly ubiquitous in modern machine learning, and also\ndiscusses some connections to kernel machines and random forests in the interpo-\nlated regime.\n\n1\n\nIntroduction\n\nThe central problem of supervised inference is to predict labels of unseen data points from a set\nof labeled training data. The literature on this subject is vast, ranging from classical parametric\nand non-parametric statistics [48, 49] to more recent machine learning methods, such as kernel\nmachines [39], boosting [36], random forests [15], and deep neural networks [25]. There is a\nwealth of theoretical analyses for these methods based on a spectrum of techniques including non-\nparametric estimation [46], capacity control such as VC-dimension or Rademacher complexity [40],\nand regularization theory [42]. In nearly all of these results, theoretical analysis of generalization\nrequires \u201cwhat you see is what you get\u201d setup, where prediction performance on unseen test data\nis close to the performance on the training data, achieved by carefully managing the bias-variance\ntrade-off. Furthermore, it is widely accepted in the literature that interpolation has poor statistical\nproperties and should be dismissed out-of-hand. For example, in their book on non-parametric\nstatistics, Gy\u00f6r\ufb01 et al. [26, page 21] say that a certain procedure \u201cmay lead to a function which\ninterpolates the data and hence is not a reasonable estimate\u201d.\nYet, this is not how many modern machine learning methods are used in practice. For instance, the\nbest practice for training deep neural networks is to \ufb01rst perfectly \ufb01t the training data [35]. The\nresulting (zero training loss) neural networks after this \ufb01rst step can already have good performance\non test data [53]. Similar observations about models that perfectly \ufb01t training data have been\n\nE-mail: mbelkin@cse.ohio-state.edu, djhsu@cs.columbia.edu, mitra@cshl.edu\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fmade for other machine learning methods, including boosting [37], random forests [19], and kernel\nmachines [12]. These methods return good classi\ufb01ers even when the training data have high levels of\nlabel noise [12, 51, 53].\nAn important effort to show that \ufb01tting the training data exactly can under certain conditions be\ntheoretically justi\ufb01ed is the margins theory for boosting [37] and other margin-based methods [6,\n24, 28, 29, 34]. However, this theory lacks explanatory power for the performance of classi\ufb01ers that\nperfectly \ufb01t noisy labels, when it is known that no margin is present in the data [12, 51]. Moreover,\nmargins theory does not apply to regression and to functions (for regression or classi\ufb01cation) that\ninterpolate the data in the classical sense [12].\nIn this paper, we identify the challenge of providing a rigorous understanding of generalization in\nmachine learning models that interpolate training data. We take \ufb01rst steps towards such a theory by\nproposing and analyzing interpolating methods for classi\ufb01cation and regression with non-trivial risk\nand consistency guarantees.\n\nRelated work. Many existing forms of generalization analyses face signi\ufb01cant analytical and\nconceptual barriers to being able to explain the success of interpolating methods.\n\nCapacity control. Existing capacity-based bounds (e.g., VC dimension, fat-shattering dimension,\nRademacher complexity) for empirical risk minimization [3, 4, 7, 28, 37] do not give useful\nrisk bounds for functions with zero empirical risk whenever there is non-negligible label noise.\nThis is because function classes rich enough to perfectly \ufb01t noisy training labels generally have\ncapacity measures that grow quickly with the number of training data, at least with the existing\nnotions of capacity [12]. Note that since the training risk is zero for the functions of interest, the\ngeneralization bound must bound their true risk, as it equals the generalization gap (difference\nbetween the true and empirical risk). Whether such capacity-based generalization bounds exist is\nopen for debate.\n\nStability. Generalization analyses based on algorithmic stability [8, 14] control the difference\nbetween the true risk and the training risk, assuming bounded sensitivity of an algorithm\u2019s output\nto small changes in training data. Like standard uses of capacity-based bounds, these approaches\nare not well-suited to settings when training risk is identically zero but true risk is non-zero.\n\nRegularization. Many analyses are available for regularization approaches to statistical inverse\nproblems, ranging from Tikhonov regularization to early stopping [9, 16, 42, 52]. To obtain a\nrisk bound, these analyses require the regularization parameter (or some analogous quantity)\nto approach zero as the number of data n tends to in\ufb01nity. However, to get (the minimum norm)\ninterpolation, we need ! 0 while n is \ufb01xed, causing the bounds to diverge.\nSmoothing. There is an extensive literature on local prediction rules in non-parametric statistics [46,\n49]. Nearly all of these analyses require local smoothing (to explicitly balance bias and variance)\nand thus do not apply to interpolation. (Two exceptions are discussed below.)\n\nRecently, Wyner et al. [51] proposed a thought-provoking explanation for the performance of Ad-\naBoost and random forests in the interpolation regime, based on ideas related to \u201cself-averaging\u201d and\nlocalization. However, a theoretical basis for these ideas is not developed in their work.\nThere are two important exceptions to the aforementioned discussion of non-parametric methods.\nFirst, the nearest neighbor rule (also called 1-nearest neighbor, in the context of the general family\nof k-nearest neighbor rules) is a well-known interpolating classi\ufb01cation method, though it is not\ngenerally consistent for classi\ufb01cation (and is not useful for regression when there is signi\ufb01cant\namount of label noise). Nevertheless, its asymptotic risk can be shown to be bounded above by\ntwice the Bayes risk [18].1 A second important (though perhaps less well-known) exception is the\nnon-parametric smoothing method of Devroye et al. [21] based on a singular kernel called the Hilbert\nkernel (which is related to Shepard\u2019s method [41]). The resulting estimate of the regression function\ninterpolates the training data, yet is proved to be consistent for classi\ufb01cation and regression.\nThe analyses of the nearest neighbor rule and Hilbert kernel regression estimate are not based on\nbounding generalization gap, the difference between the true risk and the empirical risk. Rather, the\ntrue risk is analyzed directly by exploiting locality properties of the prediction rules. In particular, the\n1More precisely, the expected risk of the nearest neighbor rule converges to E[2\u2318(X)(1 \u2318(X))], where \u2318\nis the regression function; this quantity can be bounded above by 2R\u21e4(1 R\u21e4), where R\u21e4 is the Bayes risk.\n\n2\n\n\fprediction at a point depends primarily or entirely on the values of the function at nearby points. This\ninductive bias favors functions where local information in a neighborhood can be aggregated to give\nan accurate representation of the underlying regression function.\n\nWhat we do. Our approach to understanding the generalization properties of interpolation methods\nis to understand and isolate the key properties of local classi\ufb01cation, particularly the nearest neighbor\nrule. First, we construct and analyze an interpolating function based on multivariate triangulation\nand linear interpolation on each simplex (Section 3), which results in a geometrically intuitive and\ntheoretically tractable prediction rule. Like nearest neighbor, this method is not statistically consistent,\nbut, unlike nearest neighbor, its asymptotic risk approaches the Bayes risk as the dimension becomes\nlarge, even when the Bayes risk is far from zero\u2014a kind of \u201cblessing of dimensionality\u201d2. Moreover,\nunder an additional margin condition the difference between the Bayes risk and our classi\ufb01er is\nexponentially small in the dimension.\nA similar \ufb01nding holds for regression, as the method is nearly consistent when the dimension is high.\nNext, we propose a weighted & interpolated nearest neighbor (wiNN) scheme based on singular\nweight functions (Section 4). The resulting function is somewhat less natural than that obtained\nby simplicial interpolation, but like the Hilbert kernel regression estimate, the prediction rule is\nstatistically consistent in any dimension. Interestingly, conditions on the weights to ensure consistency\nbecome less restrictive in higher dimension\u2014another \u201cblessing of dimensionality\u201d. Our analysis\nprovides the \ufb01rst known non-asymptotic rates of convergence to the Bayes risk for an interpolated\npredictor, as well as tighter bounds under margin conditions for classi\ufb01cation. In fact, the rate\nachieved by wiNN regression is statistically optimal under a standard minimax setting3.\nOur results also suggest an explanation for the phenomenon of adversarial examples [44], which\nare seemingly ubiquitous in modern machine learning. In Section 5, we argue that interpolation\ninevitably results in adversarial examples in the presence of any amount of label noise. When these\nschemes are consistent or nearly consistent, the set of adversarial examples (where the interpolating\nclassi\ufb01er disagrees with the Bayes optimal) has small measure but is asymptotically dense. Our\nanalysis is consistent with the empirical observations that such examples are dif\ufb01cult to \ufb01nd by\nrandom sampling [22], but are easily discovered using targeted optimization procedures, such as\nProjected Gradient Descent [30].\nFinally, we discuss the difference between direct and inverse interpolation schemes; and make some\nconnections to kernel machines, and random forests in (Section 6).\nProofs of the main results, along with an informal discussion of some connections to graph-based\nsemi-supervised learning, are given in the full version of the paper [11] on arXiv.4\n\n2 Preliminaries\n\nThe goal of regression and classi\ufb01cation is to construct a predictor \u02c6f given labeled training data\n(x1, y1), . . . , (xn, yn) 2 Rd \u21e5 R, that performs well on unseen test data, which are typically assumed\nto be sampled from the same distribution as the training data. In this work, we focus on interpolating\nmethods that construct predictors \u02c6f satisfying \u02c6f (xi) = yi for all i = 1, . . . , n.\nAlgorithms that perfectly \ufb01t training data are not common in statistical and machine learning literature.\nThe prominent exception is the nearest neighbor rule, which is among of the oldest and best-\nunderstood classi\ufb01cation methods. Given a training set of labeled example, the nearest neighbor rule\npredicts the label of a new point x to be the same as that of the nearest point to x within the training\nset. Mathematically, the predicted label of x 2 Rd is yi, where i 2 arg mini0=1,...,n kx xi0k. (Here,\nk\u00b7k always denotes the Euclidean norm.) As discussed above, the classi\ufb01cation risk of the nearest\nneighbor rule is asymptotically bounded by twice the Bayes (optimal) risk [18]. The nearest neighbor\n2This does not remove the usual curse of dimensionality, which is similar to the standard analyses of k-NN\n\nand other non-parametric methods.\n\n3An earlier version of this article paper contained a bound with a worse rate of convergence based on a loose\nanalysis. The subsequent work [13] found that a different Nadaraya-Watson kernel regression estimate (with a\nsingular kernel) could achieve the optimal convergence rate; this inspired us to seek a tighter analysis of our\nwiNN scheme.\n\n4https://arxiv.org/abs/1806.05161\n\n3\n\n\frule provides an important intuition that such classi\ufb01ers can (and perhaps should) be constructed\nusing local information in the feature space.\nIn this paper, we analyze two interpolating schemes, one based on triangulating and constructing the\nsimplicial interpolant for the data, and another, based on weighted nearest neighbors with singular\nweight function.\n\n2.1 Statistical model and notations\nWe assume (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid labeled examples from Rd \u21e5 [0, 1]. Here,\ni=1 are the iid training data, and (X, Y ) is an independent test example from the same\n((Xi, Yi))n\ndistribution. Let \u00b5 denote the marginal distribution of X, with support denoted by supp(\u00b5);\nand let \u2318 : Rd ! R denote the conditional mean of Y given X, i.e., the function given by\n\u2318(x) := E(Y | X = x). For (binary) classi\ufb01cation, we assume the range of Y is {0, 1}\n(so \u2318(x) = P(Y = 1 | X = x)), and we let f\u21e4 : Rd !{ 0, 1} denote the Bayes opti-\nmal classi\ufb01er, which is de\ufb01ned by f\u21e4(x) := 1\n{\u2318(x)>1/2}. This classi\ufb01er minimizes the risk\n{f (X)6=Y }] = P(f (X) 6= Y ) under zero-one loss, while the conditional mean\nR0/1(f ) := E[1\nfunction \u2318 minimizes the risk Rsq(g) := E[(g(X) Y )2] under squared loss.\nThe goal of our analyses will be to establish excess risk bounds for empirical predictors ( \u02c6f and \u02c6\u2318, based\non training data) in terms of their agreement with f\u21e4 for classi\ufb01cation and with \u2318 for regression. For\nclassi\ufb01cation, the expected risk can be bounded as E[R0/1( \u02c6f )] \uf8ffR 0/1(f\u21e4) + P( \u02c6f (X) 6= f\u21e4(X)),\nwhile for regression, the expected mean squared error is precisely E[Rsq(\u02c6\u2318(X))] = Rsq(\u2318) +\nE[(\u02c6\u2318(X) \u2318(X)2]. Our analyses thus mostly focus on P( \u02c6f (X) 6= f\u21e4(X)) and E[(\u02c6\u2318(X) \u2318(X))2]\n(where the probability and expectations are with respect to both the training data and the test example).\n\n2.2 Smoothness, margin, and regularity conditions\n\nBelow we list some standard conditions needed for further development.\n(A, \u21b5)-smoothness (H\u00f6lder). For all x, x0 in the support of \u00b5, |\u2318(x) \u2318(x0)|\uf8ff A \u00b7k x x0k\u21b5.\n(B, )-margin condition [31, 45]. For all t 0, \u00b5({x 2 Rd : |\u2318(x) 1/2|\uf8ff t}) \uf8ff B \u00b7 t.\nh-hard margin condition [32]. For all x in the support of \u00b5, |\u2318(x) 1/2| h > 0.\n(c0, r0)-regularity [5]. For all 0 < r \uf8ff r0 and x 2 supp(\u00b5), (supp(\u00b5)\\B(x, r)) c0(B(x, r)),\nwhere is the Lebesgue measure on Rd, and B(c, r) := {x 2 Rd : kx ck \uf8ff r} denotes the ball\nof radius r around c.\n\nThe regularity condition from Audibert and Tsybakov [5] is not very restrictive. For example, if\nsupp(\u00b5) = B(0, 1), then c0 \u21e1 1/2 and r0 1.\nUniform distribution condition.\nIn what follows, we mostly assume uniform marginal distribution\n\u00b5 over a certain domain. This is done for the sake of simplicity and is not an essential condition.\nFor example, in every statement the uniform measure can be substituted (with a potential change of\nconstants) by an arbitrary measure with density bounded from below.\n\n3\n\nInterpolating scheme based on multivariate triangulation\n\nIn this section, we describe and analyze an interpolating scheme based on multivariate triangulation.\nOur main interest in this scheme is in its natural geometric properties and the risk bounds for\nregression and classi\ufb01cation which compare favorably to those of the original nearest neighbor rule\n(despite the fact that neither is statistically consistent in general).\n\n3.1 De\ufb01nition and basic properties\nWe de\ufb01ne an interpolating function \u02c6\u2318 : Rd ! R based on training data ((xi, yi))n\ni=1 from Rd \u21e5R and\na (multivariate) triangulation scheme T . This function is simplicial interpolation [20, 27]. We assume\nwithout loss of generality that the (unlabeled) examples x1, . . . , xn span Rd. The triangulation\n\n4\n\n\fx1\n0\n\n0\n\nx2\n\n1\n\nx3\n\n0\n\n1\n\nx1\n0\n\nthat a linear (af\ufb01ne)\n\n\u02c6 = yi for i = 1, . . . , d + 1.\n\ndegenerate simplices5 with vertices at the unlabeled examples; these simplices intersect only at\n\nthe set of unlabeled examples (x(1), . . . , x(d+1)) that are vertices for a simplex containing x. Let\nLT (x) be the corresponding set of labeled examples ((x(1), y(1)), . . . , (x(d+1), y(d+1))).6 For any\n\nscheme T partitions the convex hull bC := conv(x1, . . . , xn) of the unlabeled examples into non-\n1/2}.\nAs in Section 3.2, we assume that \u00b5 is the uniform distribution on a full-dimensional compact and\nconvex subset of Rd.\nWe \ufb01rst observe that under the same conditions as Corollary 3.3, the asymptotic excess risk of \u02c6f is\nO(1/pd). Moreover, when the conditional mean function satis\ufb01es a margin condition, this 1/pd\ncan be replaced with a quantity that is exponentially small in d, as we show next.\nTheorem 3.4. Suppose \u2318 satis\ufb01es the h-hard margin condition. As above, assume \u00b5 is the uniform\ndistribution on a simple polytope in Rd, and T is constructed using Delaunay triangulation. Further-\nmore, assume \u2318 is Lipschitz away from the class boundary (i.e., on {x 2 supp(\u00b5) : |\u2318(x)1/2| > 0})\nand that the class boundary @ has \ufb01nite d1-dimensional volume7. Then, for some absolute constants\nc1, c2 > 0 (which may depend on h), lim supn!1 E[R0/1( \u02c6f )] \uf8ffR 0/1(f\u21e4) \u00b7 (1 + c1ec2d).\nRemark 3.5. The asymptotic risk bounds show that the risk of \u02c6f can be very close to the Bayes risk in\nhigh dimensions, thus exhibiting a certain \u201cblessing of dimensionality\". This stands in contrast to the\nnearest neighbor rule, whose asymptotic risk does not diminish with the dimension and is bounded\nby twice the Bayes risk, 2R0/1(f\u21e4).\n4\n\nInterpolating nearest neighbor schemes\n\nIn this section, we describe a weighted nearest neighbor scheme that, like the 1-nearest neighbor rule,\ninterpolates the training data, but is similar to the classical (unweighted) k-nearest neighbor rule in\nterms of other properties, including convergence and consistency. (The classical k-nearest neighbor\nrule is not generally an interpolating method except when k = 1.)\n\n4.1 Weighted & interpolated nearest neighbors\nFor a given x 2 Rd, let x(i) be the i-th nearest neighbor of x among the training data ((xi, yi))n\nfrom Rd \u21e5 R, and let y(i) be the corresponding label. Let w(x, z) be a function Rd \u21e5 Rd ! R. A\nweighted nearest neighbor scheme is simply a function of the form\n\ni=1\n\n\u02c6\u2318(x) := Pk\nPk\n\ni=1 w(x, x(i))y(i)\n\n.\n\ni=1 w(x, x(i))\n\nIn what follows, we investigate the properties of interpolating schemes of this type.\nWe will need two key observations for the analyses of these algorithms.\n\nThat implies thatPk\n\nConditional independence. The \ufb01rst key observation is that, under the usual iid sampling assump-\ntions on the data, the \ufb01rst k nearest neighbors of x are conditionally independent given X(k+1).\ni=1 w(x, X(i))Y(i) is a sum of conditionally iid random variables8. Hence,\nunder a mild condition on w(x, X(i)), we expect them to concentrate around their expected value.\n7I.e., lim\u270f!0 \u00b5(@ + B(0,\u270f )) = 0, where \u201c+\u201d denotes the Minkowski sum, i.e., the \u270f-neighborhood of @.\n8Note that these variables are not independent in the ordering given by the distance to x, but a random\n\npermutation makes them independent.\n\n6\n\n\fAssuming some smoothness of \u2318, that value is closely related to \u2318(x) = E(Y | X = x), thus\nallowing us to establish bounds and rates.\nInterpolation and singular weight functions. The second key point is that \u02c6\u2318(x) is an interpolating\nscheme, provided that w(x, z) has a singularity when z = x. Indeed, it is easily seen that if\nlimz!x w(x, z) = 1, then limx!xi \u02c6\u2318(x) = yi. Extending \u02c6\u2318 continuously to the data points\nyields a weighted & interpolated nearest neighbor (wiNN) scheme.\n\nWe restrict attention to singular weight functions of the following radial type. Fix a positive integer k\nand a decreasing function : R+ ! R+ with a singularity at zero, (0) = +1. We take\n\nw(x, z) := kx zk\n\nkx x(k+1)k! .\n\nConcretely, we will consider that diverge near t = 0 as t 7! log(t) or t 7! t, > 0.\nRemark 4.1. The denominator kx x(k+1)k in the argument of is not strictly necessary, but it\nallows for convenient normalization in view of the conditional independence of k-nearest neighbors\ngiven x(k+1). Note that the weights depend on the sample and are thus data-adaptive.\nRemark 4.2. Although w(x, x(i)) are unbounded for singular weight functions, concentration only\nrequires certain bounded moments. Geometrically, the volume of the region around the singularity\nneeds to be small enough. For radial weight functions that we consider, this condition is more easily\nsatis\ufb01ed in high dimension. Indeed, the volume around the singularity becomes exponentially small\nin high dimension.\nOur wiNN schemes are related to Nadaraya-Watson kernel regression [33, 50]. The use of singular\nkernels in the context of interpolation was originally proposed by Shepard [41]; they do not appear\nto be commonly used in machine learning and statistics, perhaps due to a view that interpolating\nschemes are unlikely to generalize or even be consistent; the non-adaptive Hilbert kernel regression\nestimate [21] (essentially, k = n and = d) is the only exception we know of.\n\nstate a risk bound for wiNN schemes\n\n4.2 Mean squared error\nWe \ufb01rst\n(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid labeled examples from Rd \u21e5 R.\nTheorem 4.3. Let \u02c6\u2318 be a wiNN scheme with singular weight function . Assume the following:\n1. \u00b5 is the uniform distribution on a compact subset of Rd and satis\ufb01es the (c0, r0) regularity\n\nin a regression setting.\n\nHere,\n\ncondition for some c0 > 0 and r0 > 0.\n\n2. \u2318 satis\ufb01es the (A, \u21b5)-smoothness for some A>0 and \u21b5>0.\n3. (t) = t for some 0 << d/ 2.\n\nLet Z0 := (supp(\u00b5))/(B(0, 1)), and assume n > 2Z0k/(c0rd\nrk+1,n(x0) be the distance from x0 to its (k + 1)st nearest neighbor among X1, . . . , Xn. Then\n\n0). For any x0 2 supp(\u00b5), let\n\nE\u21e5(\u02c6\u2318(X) \u2318(X))2\u21e4 \uf8ff A2E[rk+1,n(X)2\u21b5] + \u00af2\u21e3kek/4 +\n\nd\n\nc0(d 2)k\u2318,\n\nwhere \u00af2 := supx2supp(\u00b5) E[(Y \u2318(x))2 | X = x].\nThe bound in Theorem 4.3 is stated in terms of the expected distance to the (k + 1)st nearest neighbor\nraised to the 2\u21b5 power; this is typically bounded by O((k/n)2\u21b5/d). Choosing k = n2\u21b5/(2\u21b5+d) leads\nto a convergence rate of n2\u21b5/(2\u21b5+d), which is minimax optimal.\n\n4.3 Classi\ufb01cation risk\nWe now analyze the statistical risk of the plug-in classi\ufb01er \u02c6f (x) = 1\nAs in Section 3.3, it is straigtforward obtain a risk bound for \u02c6f under the same conditions as\nTheorem 4.3. Choosing k = n2\u21b5/(2\u21b5+d) leads to a convergence rate of n\u21b5/(2\u21b5+d).\n\n{\u02c6\u2318(x)>1/2} based on \u02c6\u2318.\n\n7\n\n\fWe now give a more direct analysis, largely based on that of Chaudhuri and Dasgupta [17] for the\nstandard k-nearest neighbor rule, that leads to improved rates under favorable conditions. Our most\ngeneral theorem along these lines is a bit lengthy to state, and hence we defer it to the full version of\nthe paper. But a simple corollary is as follows.\nCorollary 4.4. Let \u02c6\u2318 be a wiNN scheme with singular weight function , and let \u02c6f be the correspond-\ning plug-in classi\ufb01er. Assume the following:\n\n1. \u00b5 is the uniform distribution on a compact subset of Rd and satis\ufb01es the (c0, r0) regularity\n\ncondition for some c0 > 0 and r0 > 0.\n\nP( \u02c6f (X) 6= f\u21e4(X)) \uf8ff B + A\u2713 Z0p\n\nc0 \u25c6\u21b5/d!\n\n0/Z0. Then for any 0 << 1/2,\n\n2. \u2318 satis\ufb01es the (A, \u21b5)-smoothness and (B, )-margin conditions for some A>0, \u21b5>0, B>0, 0.\n3. (t) = t for some 0 << d/ 2.\nLet Z0 := (supp(\u00b5))/(B(0, 1)), and assume k/n < p \uf8ff c0rd\n+ exp \nnp\n\n4k 2c0(d 2)\nRemark 4.5. For consistency, we set k := n(2+)\u21b5/((2+)\u21b5+d), and in the bound, we plug-in\np := 2k/n and := A(Z0p/c0)\u21b5/d. This leads to a convergence rate of n\u21b5/(\u21b5(2+)+d).\nRemark 4.6. The factor 1/k in the \ufb01nal term in Corollary 4.4 results from an application of Chebyshev\ninequality. Under additional moment conditions, which are satis\ufb01ed for certain functions (e.g.,\n(t) = log(t)) with better-behaved singularity at zero than t, it can be replaced by e\u2326(2k).\nAdditionally, while the condition (t) = t is convenient for analysis, it is suf\ufb01cient to assume that\n approaches in\ufb01nity no faster than t.\n\nd\n\n2 \u27131 \n\nk\n\nnp\u25c62! +\n\n.\n\n5 Ubiquity of adversarial examples in interpolated learning\n\nThe recently observed phenomenon of adversarial examples [44] in modern machine learning has\ndrawn a signi\ufb01cant degree of interest. It turns out that by introducing a small perturbation to the\nfeatures of a correctly classi\ufb01ed example (e.g., by changing an image in a visually imperceptible way\nor even by modifying a single pixel [43]) it is nearly always possible to induce neural networks to\nmis-classify a given input in a seemingly arbitrary and often bewildering way.\nWe will now discuss how our analyses, showing that Bayes optimality is compatible with interpolating\nthe data, provide a possible mechanism for these adversarial examples to arise. Indeed, such examples\nare seemingly unavoidable in interpolated learning and, thus, in much of the modern practice. As we\nshow below, any interpolating inferential procedure must have abundant adversarial examples in the\npresence of any amount of label noise. In particular, in consistent on nearly consistent schemes, like\nthose considered in this paper, while the predictor agrees with the Bayes classi\ufb01er on the bulk of the\nprobability distribution, every \u201cincorrectly labeled\u201d training example (i.e., an example whose label is\ndifferent from the output of the Bayes optimal classi\ufb01er) has a small \u201cbasin of attraction\u201d with every\npoint in the basin misclassi\ufb01ed by the predictor. The total probability mass of these \u201cadversarial\u201d\nbasins is negligible given enough training data, so that a probability of misclassifying a randomly\nchosen point is low. However, assuming non-zero label noise, the union of these adversarial basins\nasymptotically is a dense subset of the support for the underlying probability measure and hence\nthere are misclassi\ufb01ed examples in every open set. This is indeed consistent with the extensive\nempirical evidence for neural networks. While their output is observed to be robust to random feature\nnoise [22], adversarial examples turn out to be quite dif\ufb01cult to avoid and can be easily found by\ntargeted optimization methods such as PCG [30]. We conjecture that it may be a general property or\nperhaps a weakness of interpolating methods, as some non-interpolating local classi\ufb01cation rules can\nbe robust against certain forms of adversarial examples [47].\nTo substantiate this discussion, we now provide a formal mathematical statement. For simplicity, let\nus consider a binary classi\ufb01cation setting. Let \u00b5 be a probability distribution with non-zero density\nde\ufb01ned on a compact domain \u2326 \u21e2 Rd and assume non-zero label noise everywhere, i.e., for all x 2 \u2326,\n0 <\u2318 (x) < 1, or equivalently, P(f\u21e4(x) 6= Y | X = x) > 0. Let \u02c6fn be a consistent interpolating\nclassi\ufb01er constructed from n iid sampled data points (e.g., the classi\ufb01er constructed in Section 4.3).\n\n8\n\n\fLet An = {x 2 \u2326 : \u02c6fn(x) 6= f\u21e4(x)} be the set of points at which \u02c6fn disagrees with the Bayes\noptimal classi\ufb01er f\u21e4; in other words, An is the set of \u201cadversarial examples\u201d for \u02c6fn. Consistency of \u02c6f\nimplies that, with probability one, limn!1 \u00b5(An) = 0 or, equivalently, limn!1 k \u02c6fn f\u21e4kL2\n\u00b5 = 0.\nOn the other hand, the following result shows that the sets An are asymptotically dense in \u2326, so that\nthere is an adversarial example arbitrarily close to any x.\nTheorem 5.1. For any \u270f> 0 and 2 (0, 1), there exists N 2 N, such that for all n N, with\nprobability , every point in \u2326 is within distance 2\u270f of the set An.\nProof sketch. Let (X1, Y1), . . . , (Xn, Yn) be the training data used to construct \u02c6fn. Fix a \ufb01nite\n\u270f-cover of \u2326 with respect to the Euclidean distance. Since \u02c6fn is interpolating and \u2318 is never\nzero nor one, for every i, there is a non-zero probability (over the outcome of the label Yi) that\n\u02c6fn(Xi) = Yi 6= f\u21e4(Xi); in this case, the training point Xi is an adversarial example for \u02c6fn. By\nchoosing n = n(\u00b5, \u270f, ) large enough, we can ensure that with probability at least over the random\ndraw of the training data, every element of the cover is within distance \u270f of at least one adversarial\nexample, upon which every point in \u2326 is within distance 2\u270f (by triangle inequality) of the same.\n\n\u00b5, it is\nA similar argument for regression shows that while an interpolating \u02c6\u2318 may converge to \u2318 in L2\ngenerally impossible for it to converge in L1 unless there is no label noise. An even more striking\nresult is that for the Hilbert scheme of Devroye et al., the regression estimator almost surely does\nnot converge at any \ufb01xed point, even for the simple case of a constant function corrupted by label\nnoise [21]. This means that with increasing sample size n, at any given point x misclassi\ufb01cation will\noccur an in\ufb01nite number of times with probability one. We expect similar behavior to hold for the\ninterpolation schemes presented in this paper.\n\n6 Discussion and connections\n\nIn this paper, we considered two types of algorithms, one based on simplicial interpolation and\nanother based on interpolation by weighted nearest neighbor schemes. It may be useful to think of\nnearest neighbor schemes as direct methods, not requiring optimization, while our simplicial scheme\nis a simple example of an inverse method, using (local) matrix inversion to \ufb01t the data. Most popular\nmachine learning methods, such as kernel machines, neural networks, and boosting, are inverse\nschemes. While nearest neighbor and Nadaraya-Watson methods often show adequate performance,\nthey are rarely best-performing algorithms in practice. We conjecture that the simplicial interpolation\nscheme may provide insights into the properties of interpolating kernel machines and neural networks.\nTo provide some evidence for this line of thought, we show that in one dimension simplicial in-\nterpolation is indeed a special case of interpolating kernel machine. We will brie\ufb02y sketch the\nargument without going into the details. Consider the space H of real-valued functions f with the\nH =R (df / dx)2 + \uf8ff2f 2 dx. This space is a reproducing kernel Hilbert Space correspond-\nnorm kfk2\ning to the Laplace kernel e\uf8ff|xz|. It can be seen that as \uf8ff ! 0 the minimum norm interpolant\nf\u21e4 = arg minf2H,8if (xi)=yi kfkH is simply linear interpolation between adjacent points on the line.\nNote that this is the same as our simplicial interpolating method.\nInterestingly, a version of random forests similar to PERT [19] also produces linear interpolation\nin one dimension (in the limit, when in\ufb01nitely many trees are sampled). For simplicity assume\nthat we have only two data points x1 < x2 with labels 0 and 1 respectively. A tree that correctly\n{x>t}, where t 2 [x1, x2). Choosing a\nclassi\ufb01es those points is simply a function of the form 1\nrandom t uniformly from [x1, x2), we observe that Et2[x1,x2] 1\n{x>t} is simply the linear function\ninterpolating between the two data points. The extension of this argument to more than two data\npoints in dimension one is straightforward. It would be interesting to investigate the properties of\nsuch methods in higher dimension. We note that it is unclear whether a random forest method of\nthis type should be considered a direct or inverse method. While there is no explicit optimization\ninvolved, sampling is often used instead of optimization in methods like simulated annealing.\nFinally, we note that while kernel machines (which can be viewed as two-layer neural networks) are\nmuch more theoretically tractable than general neural networks, none of the current theory applies in\nthe interpolated regime in the presence of label noise [12]. We hope that simplicial interpolation can\nshed light on their properties and lead to better understanding of modern inferential methods.\n\n9\n\n\fAcknowledgements\n\nWe would like to thank Raef Bassily, Luis Rademacher, Sasha Rakhlin, and Yusu Wang for conversa-\ntions and valuable comments. We acknowledge funding from NSF. DH acknowledges support from\nNSF grants CCF-1740833 and DMR-1534910. PPM acknowledges support from the Crick-Clay\nProfessorship (CSHL) and H N Mahabala Chair (IITM). This work grew out of discussions origi-\nnating at the Simons Institute for the Theory of Computing in 2017, and we thank the Institute for\nthe hospitality. PPM and MB thank ICTS (Bangalore) for their hospitality at the 2017 workshop on\nStatistical Physics Methods in Machine Learning.\n\nReferences\n[1] Fernando Affentranger and John A Wieacker. On the convex hull of uniform random points in a\n\nsimpled-polytope. Discrete & Computational Geometry, 6(3):291\u2013305, 1991.\n\n[2] Nina Amenta, Dominique Attali, and Olivier Devillers. Complexity of delaunay triangulation for\npoints on lower-dimensional polyhedra. In Proceedings of the Eighteenth Annual ACM-SIAM\nSymposium on Discrete Algorithms, SODA \u201907, pages 1106\u20131113, 2007.\n\n[3] Martin Anthony and Peter L Bartlett. Function learning from interpolation. In Computational\nLearning Theory: Second European Conference, EUROCOLT 95, Barcelona Spain, March\n1995, Proceedings, pages 211\u2013221, 1995.\n\n[4] Martin Anthony and Peter L Bartlett. Neural Network Learning: Theoretical Foundations.\n\nCambridge University Press, 1999.\n\n[5] Jean-Yves Audibert and Alexandre B Tsybakov. Fast learning rates for plug-in classi\ufb01ers. The\n\nAnnals of statistics, 35(2):608\u2013633, 2007.\n\n[6] Peter Bartlett, Dylan J Foster, and Matus Telgarsky. Spectrally-normalized margin bounds for\n\nneural networks. In NIPS, 2017.\n\n[7] Peter L Bartlett, Philip M Long, and Robert C Williamson. Fat-shattering and the learnability\n\nof real-valued functions. Journal of Computer and System Sciences, 52(3):434\u2013452, 1996.\n\n[8] Raef Bassily, Kobbi Nissim, Adam Smith, Thomas Steinke, Uri Stemmer, and Jonathan Ullman.\nAlgorithmic stability for adaptive data analysis. In Proceedings of the forty-eighth annual ACM\nsymposium on Theory of Computing, pages 1046\u20131059. ACM, 2016.\n\n[9] Frank Bauer, Sergei Pereverzev, and Lorenzo Rosasco. On regularization algorithms in learning\n\ntheory. Journal of complexity, 23(1):52\u201372, 2007.\n\n[10] Mikhail Belkin, Irina Matveeva, and Partha Niyogi. Regularization and semi-supervised\nlearning on large graphs. In International Conference on Computational Learning Theory,\npages 624\u2013638. Springer, 2004.\n\n[11] Mikhail Belkin, Daniel Hsu, and Partha Mitra. Over\ufb01tting or perfect \ufb01tting? risk bounds for\nclassi\ufb01cation and regression rules that interpolate. arXiv preprint arXiv:1806.05161, 2018.\nURL https://arxiv.org/abs/1806.05161.\n\n[12] Mikhail Belkin, Siyuan Ma, and Soumik Mandal. To understand deep learning we need to\nunderstand kernel learning. In Proceedings of the 35th International Conference on Machine\nLearning, pages 541\u2013549, 2018.\n\n[13] Mikhail Belkin, Alexander Rakhlin, and Alexandre B. Tsybakov. Does data interpolation\n\ncontradict statistical optimality? arXiv preprint arXiv:1806.09471, 2018.\n\n[14] Olivier Bousquet and Andr\u00e9 Elisseeff. Stability and generalization. J. Mach. Learn. Res., 2:\n\n499\u2013526, March 2002. ISSN 1532-4435.\n\n[15] Leo Breiman. Random forests. Machine learning, 45(1):5\u201332, 2001.\n[16] Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares\n\nalgorithm. Foundations of Computational Mathematics, 7(3):331\u2013368, 2007.\n\n10\n\n\f[17] Kamalika Chaudhuri and Sanjoy Dasgupta. Rates of convergence for nearest neighbor classi\ufb01-\n\ncation. In Advances in Neural Information Processing Systems, pages 3437\u20133445, 2014.\n\n[18] Thomas Cover and Peter Hart. Nearest neighbor pattern classi\ufb01cation. IEEE transactions on\n\ninformation theory, 13(1):21\u201327, 1967.\n\n[19] Adele Cutler and Guohua Zhao. Pert-perfect random tree ensembles. Computing Science and\n\nStatistics, 33:490\u2013497, 2001.\n\n[20] Scott Davies. Multidimensional triangulation and interpolation for reinforcement learning. In\n\nAdvances in Neural Information Processing Systems, pages 1005\u20131011, 1997.\n\n[21] Luc Devroye, Laszlo Gy\u00f6r\ufb01, and Adam Krzy\u02d9zak. The hilbert kernel regression estimate. Journal\n\nof Multivariate Analysis, 65(2):209\u2013227, 1998.\n\n[22] Alhussein Fawzi, Seyed-Mohsen Moosavi-Dezfooli, and Pascal Frossard. Robustness of\nclassi\ufb01ers: from adversarial to random noise. In Advances in Neural Information Processing\nSystems, pages 1632\u20131640, 2016.\n\n[23] Komei Fukuda. Polyhedral computation FAQ. Technical report, Swiss Federal Institute of\nTechnology, Lausanne and Zurich, Switzerland, 2004. URL https://www.cs.mcgill.ca/\n~fukuda/download/paper/polyfaq.pdf.\n\n[24] Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of\n\nneural networks. In Thirty-First Annual Conference on Learning Theory, 2018.\n\n[25] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.\n\n[26] L\u00e1szl\u00f3 Gy\u00f6r\ufb01, Michael Kohler, Adam Krzyzak, and Harro Walk. A Distribution-Free Theory of\n\nNonparametric Regression. Springer series in statistics. Springer, 2002.\n\n[27] John H Halton. Simplicial multivariable linear interpolation. Technical Report TR91-002,\n\nUniversity of North Carolina at Chapel Hill, Department of Computer Science, 1991.\n\n[28] Vladimir Koltchinskii and Dmitry Panchenko. Empirical margin distributions and bounding the\n\ngeneralization error of combined classi\ufb01ers. Annals of Statistics, pages 1\u201350, 2002.\n\n[29] Tengyuan Liang, Tomaso Poggio, Alexander Rakhlin, and James Stokes. Fisher-rao metric,\n\ngeometry, and complexity of neural networks. arXiv preprint arXiv:1711.01530, 2017.\n\n[30] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu.\nTowards deep learning models resistant to adversarial attacks. In International Conference on\nLearning Representations, 2018.\n\n[31] Enno Mammen and Alexandre B Tsybakov. Smooth discrimination analysis. The Annals of\n\nStatistics, 27(6):1808\u20131829, 1999.\n\n[32] Pascal Massart and \u00c9lodie N\u00e9d\u00e9lec. Risk bounds for statistical learning. The Annals of Statistics,\n\n34(5):2326\u20132366, 2006.\n\n[33] Elizbar A Nadaraya. On estimating regression. Theory of Probability & Its Applications, 9(1):\n\n141\u2013142, 1964.\n\n[34] Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A PAC-bayesian approach to\nspectrally-normalized margin bounds for neural networks. In International Conference on\nLearning Representations, 2018.\n\n[35] Ruslan Salakhutdinov. Deep learning tutorial at the Simons Institute, Berkeley, 2017. URL\n\nhttps://simons.berkeley.edu/talks/ruslan-salakhutdinov-01-26-2017-1.\n\n[36] Robert E Schapire and Yoav Freund. Boosting: Foundations and algorithms. MIT press, 2012.\n\n[37] Robert E Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the margin: a new\n\nexplanation for the effectiveness of voting methods. Ann. Statist., 26(5), 1998.\n\n11\n\n\f[38] Rolf Schneider. Discrete aspects of stochastic geometry. Handbook of discrete and computa-\n\ntional geometry, page 255, 2004.\n\n[39] Bernhard Scholkopf and Alexander J Smola. Learning with kernels: support vector machines,\n\nregularization, optimization, and beyond. MIT press, 2001.\n\n[40] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to\n\nalgorithms. Cambridge university press, 2014.\n\n[41] Donald Shepard. A two-dimensional interpolation function for irregularly-spaced data. In\n\nProceedings of the 1968 23rd ACM national conference, 1968.\n\n[42] Ingo Steinwart and Andreas Christmann. Support vector machines. Springer Science & Business\n\nMedia, 2008.\n\n[43] Jiawei Su, Danilo Vasconcellos Vargas, and Sakurai Kouichi. One pixel attack for fooling deep\n\nneural networks. arXiv preprint arXiv:1710.08864, 2017.\n\n[44] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Good-\nfellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference\non Learning Representations, 2014.\n\n[45] Alexander B Tsybakov. Optimal aggregation of classi\ufb01ers in statistical learning. The Annals of\n\nStatistics, 32(1):135\u2013166, 2004.\n\n[46] Alexandre B Tsybakov. Introduction to nonparametric estimation. Springer Series in Statistics.\n\nSpringer, New York, 2009.\n\n[47] Yizhen Wang, Somesh Jha, and Kamalika Chaudhuri. Analyzing the robustness of nearest\nneighbors to adversarial examples. In Proceedings of the 35th International Conference on\nMachine Learning, pages 5133\u20135142, 2018.\n\n[48] Larry Wasserman. All of statistics. Springer, 2004.\n[49] Larry Wasserman. All of nonparametric statistics. Springer, 2006.\n[50] Geoffrey S Watson. Smooth regression analysis. Sankhy\u00afa: The Indian Journal of Statistics,\n\nSeries A, pages 359\u2013372, 1964.\n\n[51] Abraham J Wyner, Matthew Olson, Justin Bleich, and David Mease. Explaining the success of\nadaboost and random forests as interpolating classi\ufb01ers. Journal of Machine Learning Research,\n18(48):1\u201333, 2017.\n\n[52] Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent\n\nlearning. Constructive Approximation, 26(2):289\u2013315, 2007.\n\n[53] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding\ndeep learning requires rethinking generalization. In International Conference on Learning\nRepresentations, 2017.\n\n[54] Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty. Semi-supervised learning using gaussian\n\ufb01elds and harmonic functions. In Proceedings of the 20th International conference on Machine\nlearning (ICML-03), pages 912\u2013919, 2003.\n\n12\n\n\f", "award": [], "sourceid": 1156, "authors": [{"given_name": "Mikhail", "family_name": "Belkin", "institution": "Ohio State University"}, {"given_name": "Daniel", "family_name": "Hsu", "institution": "Columbia University"}, {"given_name": "Partha", "family_name": "Mitra", "institution": "Cold Spring Harbor Laboratory"}]}