{"title": "An adaptive nearest neighbor rule for classification", "book": "Advances in Neural Information Processing Systems", "page_first": 7579, "page_last": 7588, "abstract": "We introduce a variant of the $k$-nearest neighbor classifier in which $k$ is chosen adaptively for each query, rather than supplied as a parameter. The choice of $k$ depends on properties of each neighborhood, and therefore may significantly vary between different points. (For example, the algorithm will use larger $k$ for predicting the labels of points in noisy regions.) \n\nWe provide theory and experiments that demonstrate that the algorithm performs comparably to, and sometimes better than, $k$-NN with an optimal choice of $k$. In particular, we derive bounds on the convergence rates of our classifier that depend on a local quantity we call the ``advantage'' which is significantly weaker than the Lipschitz conditions used in previous convergence rate proofs. These generalization bounds hinge on a variant of the seminal Uniform Convergence Theorem due to Vapnik and Chervonenkis; this variant concerns conditional probabilities and may be of independent interest.", "full_text": "An adaptive nearest neighbor rule for classi\ufb01cation\n\nAkshay Balsubramani\nabalsubr@stanford.edu\n\nSanjoy Dasgupta\n\ndasgupta@eng.ucsd.edu\n\nYoav Freund\n\nyfreund@eng.ucsd.edu\n\nShay Moran\n\nshaym@princeton.edu\n\nAbstract\n\nWe introduce a variant of the k-nearest neighbor classi\ufb01er in which k is chosen\nadaptively for each query, rather than being supplied as a parameter. The choice\nof k depends on properties of each neighborhood, and therefore may signi\ufb01cantly\nvary between different points. For example, the algorithm will use larger k for\npredicting the labels of points in noisy regions.\nWe provide theory and experiments that demonstrate that the algorithm performs\ncomparably to, and sometimes better than, k-NN with an optimal choice of k.\nIn particular, we bound the convergence rate of our classi\ufb01er in terms of a lo-\ncal quantity we call the \u201cadvantage\u201d, giving results that are both more general\nand more accurate than the smoothness-based bounds of earlier nearest neighbor\nwork. Our analysis uses a variant of the uniform convergence theorem of Vapnik-\nChervonenkis that is for empirical estimates of conditional probabilities and may\nbe of independent interest.\n\nIntroduction\n\n1\nWe introduce an adaptive nearest neighbor classi\ufb01cation rule. Given a training set with labels {\u00b11},\nits prediction at a query point x is based on the training points closest to x, rather like the k-nearest\nneighbor rule. However, the value of k that it uses can vary from query to query. Speci\ufb01cally, if there\nare n training points, then for any query x, the smallest k is sought for which the k points closest to x\nhave labels whose average is either greater than +(n, k), in which case the prediction is +1, or less\nthan (n, k), in which case the prediction is 1; and if no such k exists, then \u201c?\u201d (\u201cdon\u2019t know\u201d)\nis returned. Here, (n, k) \u21e0p(log n)/k corresponds to a con\ufb01dence interval for the average label\nin the region around the query.\nWe study this rule in the standard statistical framework in which all data are i.i.d. draws from some\nunknown underlying distribution P on X\u21e5Y , where X is the data space and Y is the label space. We\ntake X to be a separable metric space, with distance function d : X\u21e5X ! R, and we take Y = {\u00b11}.\nWe can decompose P into the marginal distribution \u00b5 on X and the conditional expectation of the\nlabel at each point x: if (X, Y ) represents a random draw from P , de\ufb01ne \u2318(x) = E(Y |X = x). In\nthis terminology, the Bayes-optimal classi\ufb01er is the rule g\u21e4 : X ! {\u00b11} given by\n\n(1)\n\ng\u21e4(x) =\u21e2 sign(\u2318(x))\n\nif \u2318(x) 6= 0\neither 1 or +1 if \u2318(x) = 0\n\nand its error rate is the Bayes risk, R\u21e4 = 1\n2EX\u21e0\u00b5 [1 | \u2318(X)|]. A variety of nonparametric classi-\n\ufb01cation schemes are known to have error rates that converge asymptotically to R\u21e4. These include\nk-nearest neighbor (henceforth, k-NN) rules [FH51] in which k grows with the number of training\npoints n according to a suitable schedule (kn), under certain technical conditions on the metric\nmeasure space (X , d, \u00b5).\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f\u2318\n\n1\n\n0\n\n1\n\nx\n\nFigure 1: For values of x on the left half of the shown interval, the pointwise bias \u2318(x) is close to\n1 or 1, and thus a small value of k will yield an accurate prediction. Larger k will not do as well,\nbecause they may run into neighboring regions with different labels. For values of x on the right half\nof the interval, \u2318(x) is close to 0, and thus large k is essential for accurate prediction.\n\nIn this paper, we are interested in consistency as well as rates of convergence. In particular, we \ufb01nd\nthat the adaptive nearest neighbor rule is also asymptotically consistent (under the same technical\nconditions) while converging at a rate that is about as good as, and sometimes signi\ufb01cantly better\nthan, that of k-NN under any schedule (kn).\nIntuitively, one of the advantages of k-NN over nonparametric classi\ufb01ers that use a \ufb01xed bandwidth\nor radius, such as Parzen window or kernel density estimators, is that k-NN automatically adapts to\nvariation in the marginal distribution \u00b5: in regions with large \u00b5, the k nearest neighbors lie close to\nthe query point, while in regions with small \u00b5, the k nearest neighbors can be further a\ufb01eld. The\nadaptive NN rule that we propose goes further: it also adapts to variation in \u2318. In certain regions of\nthe input space, where \u2318 is close to 0, an accurate prediction would need large k. In other regions,\nwhere \u2318 is near 1 or 1, a small k would suf\ufb01ce, and in fact, a larger k might be detrimental because\nneighboring regions might be labeled differently. See Figure 1 for one such example. A k-NN\nclassi\ufb01er is forced to pick a single value of k that trades off between these two contingencies. Our\nadaptive NN rule, however, can pick the right k in each neighborhood separately.\nOur estimator allows us to give rates of convergence that are tighter and more transparent than those\ncustomarily obtained in nonparametric statistics. Speci\ufb01cally, for any point x in the instance space\nX , we de\ufb01ne a notion of the advantage at x, denoted adv(x), which is rather like a local margin.\nWe show that the prediction at x is very likely to be correct once the number of training points\nexceeds \u02dcO(1/adv(x)). Universal consistency follows by establishing that almost all points have\npositive advantage.\n\n1.1 Relation to other work in nonparametric estimation\n\nFor linear separators and many other parametric families of classi\ufb01ers, it is possible to give rates\nof convergence that hold without any assumptions on the input distribution \u00b5 or the conditional\nexpectation function \u2318. This is not true of nonparametric estimation: although any target function can\nin principle be captured, the number of samples needed to achieve a speci\ufb01c level of accuracy will\ninevitably depend upon aspects of this function such as how fast it changes [DGL96, chapter 7]. As a\nresult, nonparametric statistical theory has focused on (1) asymptotic consistency, ideally without\nassumptions, and (2) rates of convergence under a variety of smoothness assumptions.\nAsymptotic consistency has been studied in great detail for the k-NN classi\ufb01er, when k is allowed\nto grow with the number of data points n. The risk of the classi\ufb01er, denoted Rn, is its error rate\non the underlying distribution P ; this is a random variable that depends upon the set of training\npoints seen. Cover and Hart [CH67] showed that in general metric spaces, under the assumption\nthat every x in the support of \u00b5 is either a continuity point of \u2318 or has \u00b5({x}) > 0, the expected\nrisk ERn converges to the Bayes-optimal risk R\u21e4, as long as k ! 1 and k/n ! 0. For points\n\n2\n\n\fin \ufb01nite-dimensional Euclidean space, a series of results starting with Stone [Sto77] established\nconsistency without any assumptions on \u00b5 or \u2318, and showed that Rn ! R\u21e4 almost surely [DGKL94].\nMore recent work has extended these universal consistency results\u2014that is, consistency without\nassumptions on \u2318\u2014to arbitrary metric measure spaces (X , d, \u00b5) that satisfy a certain differentiation\ncondition [CG06, CD14].\nRates of convergence have been obtained for k-nearest neighbor classi\ufb01cation under various smooth-\nness conditions including Holder conditions on \u2318 [KP95, Gy\u00f681] and \u201cTsybakov margin\u201d condi-\ntions [MT99, AT07, CD14]. Such assumptions have become customary in nonparametric statistics,\nbut they leave a lot to be desired. First, they are uncheckable: it is not possible to empirically\ndetermine the smoothness given samples. Second, they view the underlying distribution P through\nthe tiny window of two or three parameters, obscuring almost all the remaining structure of the\ndistribution that also in\ufb02uences the rate of convergence. Finally, because nonparametric estimation is\noften local, there is the intriguing possibility of getting different rates of convergence in different\nregions of the input space: a possibility that is immediately defeated by reducing the entire space to\ntwo smoothness constants.\nThe \ufb01rst two of these issues are partially addressed by the work of [CD14], who analyze the \ufb01nite\nsample risk of k-NN classi\ufb01cation without any assumptions on P . Their bounds involve terms that\nmeasure the probability mass of the input space in a carefully de\ufb01ned region around the decision\nboundary: that is, bounds that are tailored to the speci\ufb01c distribution P , rather than re\ufb02ecting worst-\ncase behavior over some large class to which P belongs. However, the expressions for the risk are\nsomewhat hard to parse, in large part because of the interaction between n and k.\nIn the present paper, we obtain \ufb01nite-sample rates of convergence that are \ufb01ne-tuned not just to the\nspeci\ufb01c distribution P but also to the speci\ufb01c query point. This is achieved by de\ufb01ning a margin,\nor advantage, at every point in the input space, and giving bounds (Theorem 1) entirely in terms of\nthis quantity. For parametric classi\ufb01cation, it has become common to de\ufb01ne a notion of margin that\ncontrols generalization. In the nonparametric setting, it makes sense that the margin would in fact\nbe a function X! R, and would yield different generalization error bounds in different regions of\nspace. Our adaptive nearest neighbor classi\ufb01er allows us to realize this vision in a fairly elementary\nmanner.\nThe advantages of setting k locally have been pointed out and quanti\ufb01ed in recent work on non-\nparametric regression [DGKL94, CS18], notably that of [Kpo11]. Although it is common to reduce\nclassi\ufb01cation to regression in nonparametric analysis, the right choice of k may be fundamentally\ndifferent in the two settings. This is re\ufb02ected in the difference between our setting for k and that\nof [Kpo11]; for instance, the physical value of the radius containing k points matters in that work\nwhile playing no role in ours. Moreover, the bene\ufb01t of local adaptivity may be more pronounced for\nclassi\ufb01cation than for regression. Our analysis shows, for instance, that there is a radius rx around\neach point x such that prediction based on training points in B(x, rx) will with high probability be\nperfect, provided there are enough such points. This is not true of regression, where the target y is a\nreal value and thus the radius needs to keep shrinking.\n\nOrganization. Most proofs are relegated to the appendices.\nIn Section 2, we introduce the formal model of learning and de\ufb01ne some basic geometric notions, as a\nprelude to presenting the adaptive k-NN algorithm in Section 3. In Sections 4 and 5 and Appendix A,\nwe state and prove consistency and generalization bounds for this classi\ufb01er, and compare them with\nprior work in the k-NN literature. Our bounds exploit a general VC-based uniform convergence\nstatement which is presented in Section 6 and proved in a self-contained manner in Appendix B.\n\n2 Setup\nTake the instance space to be a separable metric space (X , d) and the label space to be Y = {\u00b11}.\nAll data are assumed to be drawn i.i.d. from a \ufb01xed unknown distribution P over X\u21e5Y .\nLet \u00b5 denote the marginal distribution on X : if (X, Y ) is a random draw from P , then\nfor any measurable set S \u2713X . For any x 2X , the conditional expectation, or bias, of Y given x, is\n\n\u00b5(S) = Pr(X 2 S)\n\n\u2318(x) = E(Y |X = x) 2 [1, 1].\n\n3\n\n\fSimilarly, for any measurable set S with \u00b5(S) > 0, the conditional expectation of Y given X 2 S is\n\n\u2318(S) = E(Y |X 2 S) =\n\n\u2318(x) d\u00b5(x).\n\n1\n\n\u00b5(S)ZS\n\nThe risk of a classi\ufb01er g : X ! {1, +1, ?} is the probability that it is incorrect on pairs (X, Y ) \u21e0 P ,\n(2)\nThe Bayes-optimal classi\ufb01er g\u21e4, as given in (1), depends only on \u2318, but its risk R\u21e4 depends on \u00b5. For\na classi\ufb01er gn based on n training points from P , we will be interested in whether R(gn) converges\nto R\u21e4, and the rate at which this convergence occurs.\nThe algorithm and analysis in this paper depend heavily on the probability masses and biases of balls\nin X . For x 2X and r 0, let B(x, r) denote the closed ball of radius r centered at x,\n\nR(g) = P ({(x, y) : g(x) 6= y}).\n\nB(x, r) = {z 2X : d(x, z) \uf8ff r}.\n\nFor 0 \uf8ff p \uf8ff 1, let rp(x) be the smallest radius r such that B(x, r) has probability mass at least p,\nthat is,\n(3)\n\nrp(x) = inf{r 0 : \u00b5(B(x, r)) p}.\n\nIt follows that \u00b5(B(x, rp(x))) p.\nThe support of the marginal distribution \u00b5 plays an important role in convergence proofs and is\nformally de\ufb01ned as\n\nsupp(\u00b5) = {x 2X : \u00b5(B(x, r)) > 0 for all r > 0}.\n\nIt is a well-known consequence of the separability of X that \u00b5(supp(\u00b5)) = 1 [CH67].\n3 The adaptive k-nearest neighbor algorithm\nThe algorithm is given a labeled training set (x1, y1), . . . , (xn, yn) 2X\u21e5Y . Based on these points,\nit is able to compute empirical estimates of the probabilities and biases of different balls.\nFor any set S \u2713X , we de\ufb01ne its empirical count and probability mass as\n\n#n(S) = |{i : xi 2 S}|\n\u00b5n(S) =\n\n#n(S)\n\n.\n\nn\n\nIf this is non-zero, we take the empirical bias to be\n\n\u2318n(S) = Pi:xi2S yi\n\n#n(S)\n\n.\n\n(4)\n\n(5)\n\nThe adaptive k-NN algorithm (AKNN) is shown in Figure 2. It makes a prediction at x by growing a\nball around x until the ball has signi\ufb01cant bias, and then choosing the corresponding label. In some\ncases, a ball of suf\ufb01cient bias may never be obtained, in which event \u201c?\u201d is returned. In what follows,\nlet gn : X ! {1, +1, ?} denote the AKNN classi\ufb01er.\nLater, we will also discuss a variant of this algorithm in which a modi\ufb01ed con\ufb01dence interval,\n\n(n, k, ) = c1r d0 log n + log(1/)\nis used, where d0 is the VC dimension of the family of balls in (X , d).\nIn comparing the algorithm of Figure 2 to standard k-nearest neighbor classi\ufb01cation, it might at \ufb01rst\nglance seem that we have merely replaced one parameter (k) with another (). This is not accurate.\nOur is the customary con\ufb01dence parameter of statistics and learning theory: it provides an upper\nbound on the failure probability of the algorithm. It can be set to 0.05, for instance. The algorithm\nmakes in\ufb01nitely many parameter choices\u2014it sets k for each query point\u2014and asks for just a single\nfailure probability that lets it know how aggressively to set its con\ufb01dence intervals.\n\n(7)\n\nk\n\n4\n\n\fGiven:\n\n\u2022 training set (x1, y1), . . . , (xn, yn) 2 X \u21e5 {\u00b11}\n\u2022 con\ufb01dence parameter 0 << 1\n\nTo predict at x 2X :\n\nexactly k training points. a\n\n\u2022 For any integer k, let Bk(x) denote the smallest ball centered at x that contains\n\u2022 Find the smallest 0 < k \uf8ff n for which the Bk(x) has a signi\ufb01cant bias: that\n\nis, |\u2318n(Bk(x))| > (n, k, ), where\n\n(n, k, ) = c1r log n + log(1/)\n\u2022 If there exists such a ball, return label sign(\u2318n(Bk(x))).\n\u2022 If no such ball exists: return \u201c?\u201d\n\nk\n\n.\n\n(6)\n\naWhen several points have the same distance to x, there might be some values of k for which\n\nBk(x) is unde\ufb01ned. Our algorithm skips such values of k.\n\nFigure 2: The adaptive k-NN (AKNN) classi\ufb01er. The absolute constant c1 is from Lemma 7.\n\n4 Pointwise advantage and rates of convergence\n\nWe now provide \ufb01nite-sample rates of convergence for the adaptive nearest neighbor rule. For\nsimplicity, we give convergence rates that are speci\ufb01c to any query point x and that depend on a\nsuitable notion of the \u201cmargin\u201d of distribution P around x.\nPick any p, > 0. Recalling de\ufb01nition (3), we say a point x 2X is (p, )-salient if the following\nholds for either s = +1 or s = 1:\n\n\u2022 s\u2318(x) > 0, and s\u2318(B(x, r)) > 0 for all r 2 [0, rp(x)), and s\u2318(B(x, rp(x))) .\n\nIn words, this means that g\u21e4(x) = s (recall that g\u21e4 is the Bayes classi\ufb01er), that the biases of all balls\nof radius \uf8ff rp(x) around x have the same sign as s, and that the bias of the ball of radius rp(x)\nhas absolute value at least . A point x can satisfy this de\ufb01nition for a variety of pairs (p, ). The\nadvantage of x is taken to be the largest value of p2 over all such pairs:\n\nadv(x) =\u21e2 sup{p2 : x is (p, )-salient}\n\n0\n\nif \u2318(x) 6= 0\nif \u2318(x) = 0\n\n(8)\n\nWe will see (Lemma 3) that under a mild condition on the underlying metric measure space, almost\nall x with \u2318(x) 6= 0 have a positive advantage.\n4.1 Advantage-based \ufb01nite-sample bounds\nWe now state two generalization bounds for the adaptive nearest neighbor classi\ufb01er. The \ufb01rst holds\npointwise\u2014it bounds the probability of error at a speci\ufb01c point x\u2014while the second is the type of\nuniform convergence bound that is more standard in learning theory.\nThe following theorem shows that for every point x, if the sample size n satis\ufb01es n ' 1/adv(x), then\nthe label of x is likely to be g\u21e4(x), where g\u21e4 is the Bayes optimal classi\ufb01er. This provides pointwise\nconvergence of g(x) to g\u21e4(x) at a rate that is sensitive to the local geometry of x.\nTheorem 1 (Pointwise convergence rate). There is an absolute constant C > 0 for which the\nfollowing holds. Let 0 << 1 denote the con\ufb01dence parameter in the AKNN algorithm (Figure 2),\nand suppose the algorithm is used to de\ufb01ne a classi\ufb01er gn based on n training points chosen i.i.d.\nfrom P . Then, for every point x 2 supp(\u00b5), if\n\nC\n\nn \n\nadv(x)\n\nmax\u2713log\n\n1\n\nadv(x)\n\n, log\n\n1\n\n\u25c6\n\n5\n\n\fthen with probability at least 1 we have that gn(x) = g\u21e4(x).\nIf we further assume that the family of all balls in the space has \ufb01nite VC dimension d0 then we can\nstrengthen the guarantee to hold with high probability simultaneously for all x 2 supp(\u00b5). This is\nachieved by a modi\ufb01ed version of the algorithm that uses con\ufb01dence interval (7) instead of (6).\nTheorem 2 (Uniform convergence rate). Suppose that the set of balls in (X , d) has \ufb01nite VC\ndimension d0, and that the algorithm of Figure 2 uses con\ufb01dence interval (7) instead of (6). Then,\nwith probability at least 1 , the resulting classi\ufb01er gn satis\ufb01es the following: for every point x 2\nsupp(\u00b5), if\n\nn \n\nC\n\nadv(x)\n\nmax\u2713log\n\n1\n\nadv(x)\n\n, log\n\n1\n\n\u25c6\n\nthen gn(x) = g\u21e4(x).\nA key step towards proving Theorems 1 and 2 is to identify the subset of X that is likely to be\ncorrectly classi\ufb01ed for a given number of training points n. This follows the rough outline of [CD14],\nwhich gave rates of convergence for k-nearest neighbor, but there are two notable differences. First,\nwe will see that the likely-correct sets obtained in that earlier work (for k-NN) are, roughly, subsets\nof those we obtain for the new adaptive nearest neighbor procedure. Second, the proof for our setting\nis considerably more streamlined; for instance, there is no need to devise tie-breaking strategies for\ndeciding the identities of the k nearest neighbors.\n\n4.2 A comparison with k-nearest neighbor\nFor a 0, let Xa denote all points with advantage greater than a:\n\nXa = {x 2 supp(\u00b5) : adv(x) > a}.\n\n(9)\n\nIn particular, X0 consists of all points with positive advantage.\nBy Theorem 1, points in Xa are likely to be correctly classi\ufb01ed when the number of training points\nis e\u2326(1/a), where the e\u2326(\u00b7) notation ignores logarithmic terms. In contrast, the work of [CD14]\nshowed that with n training points, the k-NN classi\ufb01er is likely to correctly classify the following set\nof points:\n\nX 0n,k = {x 2 supp(\u00b5) : \u2318(x) > 0,\u2318 (B(x, r)) k1/2 for all 0 \uf8ff r \uf8ff rk/n(x)}\n\n[{ x 2 supp(\u00b5) : \u2318(x) < 0,\u2318 (B(x, r)) \uf8ff k1/2 for all 0 \uf8ff r \uf8ff rk/n(x)}.\n\nSuch points are (k/n, k1/2)-salient and thus have advantage at least 1/n. In fact,\n\n[1\uf8ffk\uf8ffn\n\nX 0n,k \u2713X 1/n.\n\nIn this sense, the adaptive nearest neighbor procedure is able to perform roughly as well as all choices\nof k simultaneously. This is not a precise statement because of logarithmic factors (the sample\ncomplexity in Theorem 1 is (1/a) log(1/a) rather than 1/a), and the resulting gap can be seen in our\nexperiments.\n\n5 Universal consistency\n\nIn this section we study the convergence of R(gn) to the Bayes risk R\u21e4 as the number of points n\ngrows. An estimator is described as universally consistent in a metric measure space (X , d, \u00b5) if it\nhas this desired limiting behavior for all conditional expectation functions \u2318.\nEarlier work [CD14] established the universal consistency of k-nearest neighbor (for k/n ! 0 and\nk/(log n) ! 1) in any metric measure space that satis\ufb01es the Lebesgue differentiation condition:\nthat is, for any bounded measurable f : X! R and for almost all (\u00b5-a.e.) x 2X ,\n\n1\n\n\u00b5(B(x, r))ZB(x,r)\n\nlim\nr#0\n\n6\n\nf d\u00b5 = f (x).\n\n(10)\n\n\fThis is known to hold, for instance, in any \ufb01nite-dimensional normed space or any doubling metric\nspace [Hei01, Chapter 1].\nWe will now see that this same condition implies the universal consistency of the adaptive nearest\nneighbor rule. To begin with, it implies that almost every point has a positive advantage.\nLemma 3. Suppose metric measure space (X , d, \u00b5) satis\ufb01es condition (10). Then, for any conditional\nexpectation \u2318, the set of points\n\nhas zero \u00b5-measure.\n\n{x 2X : \u2318(x) 6= 0, adv(x) = 0}\n\nProof. Let X 0 \u2713X consist of all points x 2 supp(\u00b5) for which condition (10) holds true with f = \u2318,\nthat is, limr#0 \u2318(B(x, r)) = \u2318(x). Since \u00b5(supp(\u00b5)) = 1, it follows that \u00b5(X 0) = 1.\nPick any x 2X 0 with \u2318(x) 6= 0; without loss of generality, \u2318(x) > 0. By (10), there exists ro > 0\nsuch that\n\nThus x is (p, )-salient for p = \u00b5(B(x, ro)) > 0 and = \u2318(x)/2, and has positive advantage.\n\n\u2318(B(x, r)) \u2318(x)/2 for all 0 \uf8ff r \uf8ff ro.\n\nUniversal consistency follows as a consequence; the proof details are deferred to Appendix A.\nTheorem 4 (Universal consistency). Suppose the metric measure space (X , d, \u00b5) satis\ufb01es condi-\ntion (10). Let (n) be a sequence in [0, 1] with (1)Pn n < 1 and (2) limn!1(log(1/n))/n = 0.\nLet the classi\ufb01er gn,n : X ! {1, +1, ?} be the result of applying the AKNN procedure (Figure 2)\nwith n points chosen i.i.d. from P and with con\ufb01dence parameter n. Letting Rn = R(gn,n) denote\nthe risk of gn,n, we have Rn ! R\u21e4 almost surely.\n6 Uniform convergence of empirical conditional measures\n\nA key piece of our analysis is a uniform convergence bound for empirical estimates of conditional\nprobabilities. We now discuss this bound in an abstract setting; further details are in Appendix B.\nLet P be a distribution over some space X, and let A,B be two collections of events. Let x1, . . . , xn\nbe independent samples from P . We would like to use these to estimate P (A|B) simultaneously for\nall A 2A , B 2B . It is natural to consider the empirical estimates:\nPn(A|B) = Pi 1[xi2A\\B]\nPi 1[xi2B]\n\nWe study the approximation error of these estimates. Note that the case where B = {X} (i.e., in\nwhich one estimates P (A) using Pn(A) simultaneously for all A 2A ) is handled by the classical\nVC theory. Let us assume that both A,B have VC dimension upper-bounded by some d0.\nTo demonstrate the kinds of statements we would like, consider the case where each of A,B contains\nonly one event: A = {A}, and B = {B}, and set #n(B) =Pi 1[xi2B]. A Chernoff bound implies\nthat conditioned on the event that #n(B) > 0, the following holds with probability at least 1 :\n(11)\n\n.\n\n|P (A|B) Pn(A|B)|\uf8ff s 2 log(1/)\n\n#n(B)\n\n.\n\nThis bound depends on #n(B) and is thus data-dependent. To derive it, use that conditioned on\nxi 2 B, event xi 2 A has probability P (A|B), so random variable \u201c#n(B) \u00b7 pn(A|B)\u201d has a\nbinomial distribution with parameters #n(B) and P (A|B).\nWe would want to prove a uniform version of (11), of the form: with probability at least 1 ,\n\n(8A 2A ) (8B 2B ) : |P (A|B) Pn(A|B)|\uf8ff O s d0 log(1/)\n#n(B) ! .\n\nBut as we explain in the appendix, this is unfortunately false. Instead, we prove the following (slightly\nweaker) variant:\n\n7\n\n\fTheorem 5 (UCECM). Let P be a probability distribution over X, and let A,B be two families\nof measurable subsets of X such that VC(A), VC(B) \uf8ff d0. Let n 2 N, and let x1 . . . xn be n i.i.d\nsamples from P . Then the following event occurs with probability at least 1 :\n(8A 2A ) (8B 2B ) : |P (A|B) Pn(A|B)|\uf8ff s ko\ni=1 1[xi 2 B].\n\nwhere ko = 1000 (d0 log(8n) + log(4/)), and #n(B) =Pn\n\n#n(B)\n\n,\n\n7 Experiments\n\nFigure 3: Effect of label noise on k-NN and AKNN. Performance on MNIST for different levels of\nrandom label noise p and for different values of k. Each line in the \ufb01gure on the left (a) represents the\nperformance of k-NN as a function of k for a given level of noise. The optimal choice of k increases\nwith the noise level, and that the performance degrades severely for too-small k. The table (b) shows\nthat AKNN, with a \ufb01xed value of A, performs almost as well as k-NN with the optimal choice of k.\n\nWe performed a few experiments using real-world data sets from computer vision and genomics (see\nSection C). These were conducted with some practical alterations to the algorithm of Fig. 2.\nMulticlass extension: Suppose the set of possible labels is Y. We replace the binary rule \u201c\ufb01nd\nthe smallest k such that |\u2318n(Bk(x))| > (n, k, )\" with the rule: \u201c\ufb01nd the smallest k such that\nn(Bk(x)) 1\n\u2318y\n|Y|\n\n> (n, k, ) for some y 2Y , where \u2318y\n\n= #n{xi2S and yi=y}\n.\n\nn(S)\n\n#n(S)\n\n.\"\n\nAt left: performance of AKNN on notMNIST\nfor different settings of the con\ufb01dence param-\neter (A = 1, 3, 9), as a function of the neigh-\nborhood size. For each con\ufb01dence level we\nshow two graphs: an accuracy graph (solid\nline) and a coverage line (dashed line). For\neach value of k we plot the accuracy and the\ncoverage of AKNN which is restricted to us-\ning a neighborhood size of at most k.\nIn-\ncreasing A generally causes an increase in the\naccuracy and a decrease in coverage. Larger\nvalues of A cause AKNN to have coverage\nzero for values of k that are too small. For\ncomparison, we plot the performance of k-\nNN as a function of k. The highest accuracy\n(\u21e1 0.88) is achieved for k = 10 (dotted hori-\nzontal line), and is surpassed by AKNN with\nhigh coverage (100% for A = 1).\n\nFigure 4: Performance of AKNN on notMNIST. See also Figure 5.\n\n8\n\n\fFigure 5: A visualization of the performance of AKNN on notMNIST. (a) The correct labels, with\nprediction errors of AKNN (A = 4) highlighted. (b) The value of k chosen by the algorithm when\npredicting each datapoint.\n\nn(Bk(x))/pk is largest.\n\nParametrization: We replace Equation (6) with = Apk , where A is a con\ufb01dence parameter\ncorresponding to the theory\u2019s (given n).\nResolving multilabel predictions: Our algorithm can output answers that are not a single label. The\noutput can be \u201c?\u201d, which indicates that no label has suf\ufb01cient evidence. It can also be a subset of Y\nthat contains more than one element, indicating that more than one label has signi\ufb01cant evidence. In\nsome situations, using subsets of the labels is more informative. However, when we want to compare\nhead-to-head with k-NN, we need to output a single label. We use a heuristic to predict with a single\nlabel y 2Y on any x: the label for which maxk \u2318y\nWe brie\ufb02y discuss our main conclusions from the experiments, with more details in Appendix C.\nAKNN is comparable to the best k-NN rule. In Section 4.2 we prove that AKNN compares\nfavorably to k-NN with any \ufb01xed k. We demonstrate this in practice in different situations. With\nsimulated independent label noise on the MNIST dataset (Fig. 3), a small value of k is optimal for\nnoiseless data, but performs very poorly when the noise level is high. On the other hand, AKNN\nadapts to the local noise level automatically, as demonstrated without adding noise on the more\nchallenging notMNIST and single-cell genomics data (Fig. 4, 5, 6).\nVarying the con\ufb01dence parameter A controls abstaining. The parameter A controls how con-\nservative the algorithm is in deciding to abstain, instead of incurring error by predicting. A ! 0\nrepresents the most aggressive setting, in which the algorithm never abstains, essentially predicting\naccording to a 1-NN rule. Higher settings of A cause the algorithm to abstain on some of these\npredicted points, for which there is no suf\ufb01ciently small neighborhood with a suf\ufb01ciently signi\ufb01cant\nlabel bias (Fig. 7).\nAdaptively chosen neighborhood sizes re\ufb02ect local con\ufb01dence. The number of neighbors chosen\nby AKNN is a local quantity that gives a practical pointwise measure of the con\ufb01dence associated with\nlabel predictions. Small neighborhoods are chosen when one label is measured as signi\ufb01cant nearly\nas soon as statistically possible; by de\ufb01nition of the AKNN stopping rule, this is not true where large\nneighborhoods are necessary. In our experiments, performance on points with signi\ufb01cantly higher\nneighborhood sizes dropped monotonically, with the majority of the data set having performance\nsigni\ufb01cantly exceeding the best k-NN rule over a range of settings of A (Fig. 4, 6; Appendix C).\n\n9\n\n\fReferences\n[AT07]\n\n[BBL05]\n\n[C+18]\n\n[CD10]\n\n[CD14]\n\n[CG06]\n\n[CH67]\n\n[CS18]\n\nJ.-Y. Audibert and A.B. Tsybakov. Fast learning rates for plug-in classi\ufb01ers. Annals of\nStatistics, 35(2):608\u2013633, 2007.\nS. Boucheron, O. Bousquet, and G. Lugosi. Theory of classi\ufb01cation: A survey of some\nrecent advances. ESAIM: probability and statistics, 9:323\u2013375, 2005.\nTabula Muris Consortium et al. Single-cell transcriptomics of 20 mouse organs creates a\ntabula muris. Nature, 562(7727):367, 2018.\nK. Chaudhuri and S. Dasgupta. Rates of convergence for the cluster tree. In Advances in\nNeural Information Processing Systems, pages 343\u2013351, 2010.\nK. Chaudhuri and S. Dasgupta. Rates of convergence for nearest neighbor classi\ufb01cation.\nIn Advances in Neural Information Processing Systems, pages 3437\u20133445. 2014.\nF. Cerou and A. Guyader. Nearest neighbor classi\ufb01cation in in\ufb01nite dimension. ESAIM:\nProbability and Statistics, 10:340\u2013355, 2006.\nT. Cover and P.E. Hart. Nearest neighbor pattern classi\ufb01cation. IEEE Transactions on\nInformation Theory, 13:21\u201327, 1967.\nG.H. Chen and D. Shah. Explaining the Success of Nearest Neighbor Methods in\nPrediction. Foundations and Trends in Machine Learning. NOW Publishers, 2018.\n\n[DCL11] W. Dong, M. Charikar, and K. Li. Ef\ufb01cient k-nearest neighbor graph construction for\ngeneric similarity measures. In Proceedings of the 20th international conference on\nWorld wide web, pages 577\u2013586. ACM, 2011.\n\n[DGKL94] L. Devroye, L. Gy\u00f6r\ufb01, A. Krzyzak, and G. Lugosi. On the strong universal consistency\nof nearest neighbor regression function estimates. Annals of Statistics, 22:1371\u20131385,\n1994.\n\n[DGL96] L. Devroye, L. Gy\u00f6r\ufb01, and G. Lugosi. A Probabilistic Theory of Pattern Recognition.\n\nSpringer, 1996.\nR.M. Dudley. Balls in Rk do not cut all subsets of k +2 points. Advances in Mathematics,\n31(3):306\u2013308, 1979.\nE. Fix and J. Hodges. Discriminatory analysis, nonparametric discrimination. USAF\nSchool of Aviation Medicine, Randolph Field, Texas, Project 21-49-004, Report 4, Con-\ntract AD41(128)-31, 1951.\nL. Gy\u00f6r\ufb01. The rate of convergence of kn-nn regression estimates and classi\ufb01cation rules.\nIEEE Transactions on Information Theory, 27(3):362\u2013364, 1981.\nJ. Heinonen. Lectures on Analysis on Metric Spaces. Springer, 2001.\nS. Kulkarni and S. Posner. Rates of convergence of nearest neighbor estimation under\narbitrary sampling. IEEE Transactions on Information Theory, 41(4):1028\u20131039, 1995.\nS. Kpotufe. k-nn regression adapts to local intrinsic dimension. In Neural Information\nProcessing Systems, 2011.\n\n[MNI96] MNIST dataset. http://yann.lecun.com/exdb/mnist/, 1996.\n[Mou18] Mouse cell atlas dataset.\n\nftp://ngs.sanger.ac.uk/production/teichmann/\n\nBBKNN/MouseAtlas.zip, 2018. Accessed: 2019-05-02.\nE. Mammen and A.B. Tsybakov. Smooth discrimination analysis. The Annals of\nStatistics, 27(6):1808\u20131829, 1999.\nnotMNIST dataset. http://yaroslavb.com/upload/notMNIST/, 2011. Accessed:\n2019-05-02.\nM. Raab and A. Steger. Balls into bins - a simple and tight analysis. In Randomization\nand Approximation Techniques in Computer Science, Second International Workshop,\nRANDOM\u201998, Barcelona, Spain, October 8-10, 1998, Proceedings, pages 159\u2013170,\n1998.\nC. Stone. Consistent nonparametric regression. Annals of Statistics, 5:595\u2013645, 1977.\nV.N. Vapnik and A.Y. Chervonenkis. On the uniform convergence of relative frequencies\nof events to their probabilities. Theory of Probability & Its Applications, 16(2):264\u2013280,\n1971.\n\n[Dud79]\n\n[FH51]\n\n[Gy\u00f681]\n\n[Hei01]\n[KP95]\n\n[Kpo11]\n\n[MT99]\n\n[not11]\n\n[RS98]\n\n[Sto77]\n[VC71]\n\n10\n\n\f", "award": [], "sourceid": 4134, "authors": [{"given_name": "Akshay", "family_name": "Balsubramani", "institution": "Stanford"}, {"given_name": "Sanjoy", "family_name": "Dasgupta", "institution": "UC San Diego"}, {"given_name": "yoav", "family_name": "Freund", "institution": "UCSD"}, {"given_name": "Shay", "family_name": "Moran", "institution": "Google AI Princeton"}]}