{"title": "Rates of Convergence for Large-scale Nearest Neighbor Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 10769, "page_last": 10780, "abstract": "Nearest neighbor is a popular class of classification methods with many desirable properties. For a large data set which cannot be loaded into the memory of a single machine due to computation, communication, privacy, or ownership limitations, we consider the divide and conquer scheme: the entire data set is divided into small subsamples, on which nearest neighbor predictions are made, and then a final decision is reached by aggregating the predictions on subsamples by majority voting. We name this method the big Nearest Neighbor (bigNN) classifier, and provide its rates of convergence under minimal assumptions, in terms of both the excess risk and the classification instability, which are proven to be the same rates as the oracle nearest neighbor classifier and cannot be improved. To significantly reduce the prediction time that is required for achieving the optimal rate, we also consider the pre-training acceleration technique applied to the bigNN method, with proven convergence rate. We find that in the distributed setting, the optimal choice of the neighbor k should scale with both the total sample size and the number of partitions, and there is a theoretical upper limit for the latter. Numerical studies have verified the theoretical findings.", "full_text": "Rates of Convergence for Large-scale Nearest\n\nNeighbor Classi\ufb01cation\n\nXingye Qiao\n\nDepartment of Mathematical Sciences\n\nBinghamton University\n\nNew York, USA\n\nqiao@math.binghamton.edu\n\nJiexin Duan\n\nDepartment of Statistics\n\nPurdue University\n\nWest Lafayette, Indiana, USA\n\nduan32@purdue.edu\n\nGuang Cheng\n\nDepartment of Statistics\n\nPurdue University\n\nWest Lafayette, Indiana, USA\n\nchengg@purdue.edu\n\nAbstract\n\nNearest neighbor is a popular class of classi\ufb01cation methods with many desirable\nproperties. For a large data set which cannot be loaded into the memory of a single\nmachine due to computation, communication, privacy, or ownership limitations,\nwe consider the divide and conquer scheme: the entire data set is divided into\nsmall subsamples, on which nearest neighbor predictions are made, and then a\n\ufb01nal decision is reached by aggregating the predictions on subsamples by majority\nvoting. We name this method the big Nearest Neighbor (bigNN) classi\ufb01er, and\nprovide its rates of convergence under minimal assumptions, in terms of both the\nexcess risk and the classi\ufb01cation instability, which are proven to be the same rates\nas the oracle nearest neighbor classi\ufb01er and cannot be improved. To signi\ufb01cantly\nreduce the prediction time that is required for achieving the optimal rate, we also\nconsider the pre-training acceleration technique applied to the bigNN method, with\nproven convergence rate. We \ufb01nd that in the distributed setting, the optimal choice\nof the neighbor k should scale with both the total sample size and the number of\npartitions, and there is a theoretical upper limit for the latter. Numerical studies\nhave veri\ufb01ed the theoretical \ufb01ndings.\n\n1\n\nIntroduction\n\nIn this article, we study binary classi\ufb01cation for large-scale data sets. Nearest neighbor (NN) is a very\npopular class of classi\ufb01cation methods. The kNN method searches for the k nearest neighbors of a\nquery point x and classify it to the majority class among the k neighbors. NN methods do not require\nsophisticated training that involves optimization but are memory-based (all the data are loaded into\nthe memory when predictions are made.) In the era of big data, the volume of data is growing at an\nunprecedented rate. Yet, the computing power is limited by space and time and it may not keep pace\nwith the growth of the data volume. For many applications, one of the main challenges of NN is that\nit is impossible to load the data into the memory of a single machine [36]. In addition to the memory\nlimitation, there are other important concerns. For example, if the data are collected and stored at\nseveral distant locations, it is challenging to transmit the data to a central location. Moreover, privacy\nand ownership issues may prohibit sharing local data with other locations. Therefore, a distributed\napproach which avoids transmitting or sharing the local raw data is appealing.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThere are new algorithms which are designed to allow analyzing data in a distributed manner. For\nexample, [36] proposed distributed algorithms to approximate the nearest neighbors in a scalable\nfashion. However, their approach requires subtle and careful design of the algorithm which is not\ngenerally accessible to most users who may work on many different platforms. In light of this, a\nmore general, simple and user-friendly approach is the divide-and-conquer strategy. Assume that the\ntotal sample size in the training data is N. We divide the data set into a number of s subsets (or the\ndata are already collected from s locations and stored in s machines to begin with.) For simplicity,\nwe assume that each subset size is n so that N = sn. We may allow the number of subsets s to grow\nwith the total sample size N, in the fashion of s = N \u03b3. For each query point, kNN classi\ufb01cation\nis conducted on each subset, and these local predictions are pooled together by majority voting.\nMoreover, since the kNN predictions are made at local locations, and only the predictions (instead\nof the raw data) are transmitted to the central location, this approach is more ef\ufb01cient in terms of\ncommunication cost and it reduces (if not eliminates) much of the privacy and ownership concerns.\nThe resulting classi\ufb01er, which we coin as the big Nearest Neighbor (bigNN) classi\ufb01er, can be easily\nimplemented by a variety of users on different platforms with minimal efforts. We point out that\nthis is not a new idea, as it is essentially an ensemble classi\ufb01er using the majority vote combiner\n[10]. Moreover, a few recent work on distributed regression and principal component analysis follow\nthis direction [46, 11, 5, 32, 18, 38, 47]. However, distributed classi\ufb01cation results are much less\ndeveloped, especially in terms of the statistical performance of the ensemble learner. Our approach is\ndifferent from bagging (or bootstrap in general) [7], another type of ensemble estimator. Bagging\nwas historically proposed to enhance the prediction accuracy by reducing variance and conduct\nstatistical inference even when the sample size is not large enough. They are not our concern here.\nOur algorithm is motivated by the need to maintain data decentralisation/privacy and enhance speed\nperformance, when the sample size is too large.\nIn practice, NN methods are often implemented by algorithms capable of dealing with large-scale\ndata sets, such as the kd-tree [6] or random partition trees [14]. However, even these methods have\nlimitations. As the de\ufb01nition of \u201clarge-scale\u201d evolves, it would be good to know if the divide-and-\nconquer scheme described above may be used in conjunction to these methods. There are related\nmethods such as those that rely on ef\ufb01cient kNNs or approximate nearest neighbor (ANN) methods\n[2, 1, 26, 39]. However, little theoretical understanding in terms of the classi\ufb01cation accuracy of\nthese approximate methods has been obtained (with rare exceptions like [21].)\nSome quantization strategies [30, 28] have been proposed recently to scale up NNs to large datasets.\nThey often start with a r-net, which is a collection of data points that quantize the training data. The\naverage response value or the majority class label of those training points that fall into each cell is\nthen assigned to these cells. This is quite similar to the denoising scheme in Xue and Kpotufe [44], in\nwhich quantization is achieved by random subsamplings. However, all these quantization schemes\nhave heavy computational burden in terms of preprocessing: for a very large training data, assigning\nthe weights for each cell will be as dif\ufb01cult as predicting the class label of a query point using kNN.\nFrom this perspective, we propose a denoised bigNN algorithm to shorten the preprocessing time of\nquantization-based approaches without sacri\ufb01cing the accuracy.\nThe asymptotic consistency and convergence rates of the NN classi\ufb01cation have been studied in details.\nSee, for example, Fix and Hodges Jr [19], Cover and Hart [12], Devroye et al. [15], Chaudhuri and\nDasgupta [9]. However, there has been little theoretical understanding about the bigNN classi\ufb01cation.\nIn particular, one may ask whether the bigNN classi\ufb01cation performs as well as the oracle NN\nmethod, that is, the NN method applied to the entire data set (which is dif\ufb01cult in practice due to the\naforementioned constraints and limitations, hence the name \u201coracle\u201d.) To our knowledge, our work\nis the \ufb01rst one to address the classi\ufb01cation accuracy of the ensemble NN method, from a statistical\nlearning theory point of view, and build its relation to that of its oracle counterpart. Much progress\nhas been made for the latter. Cover [13], Wagner [43], and Fritz [20] provided distribution-free\nconvergence rates for NN classi\ufb01ers. Later works [31, 23] gave rate of convergence in terms of the\nsmoothness of class conditional probability \u03b7, such as H\u00f6lder\u2019s condition. Recently, Chaudhuri and\nDasgupta [9] studied the convergence rate under a condition more general than the H\u00f6lder\u2019s condition\nof \u03b7. In particular, a smooth measure was proposed which measures the change in \u03b7 with respect to\nprobability mass rather than distance. More recently, Kontorovich and Weiss [29] proposed a strongly\nBayes consistent margin-regularized 1-NN; Kontorovich et al. [28] proved a sample-compressed\n1-NN based multiclass learning algorithm is Bayes consistent. In addition, under the so-called margin\ncondition, and assumptions on the density function of the covariates, Audibert and Tsybakov [3]\n\n2\n\n\fshowed a faster rate of convergence. See Kohler and Krzyzak [27] for results without assuming the\ndensity exists. Some other related works about NN methods include Hall et al. [24], which gave an\nasymptotic regret formula in terms of the number of neighbors, and Samworth [37], which gave a\nsimilar formula in terms of the weights for a weighted nearest neighbor classi\ufb01er. Sun et al. [40] took\nthe stability measure into consideration and proposed a classi\ufb01cation instability (CIS) measure. They\ngave an asymptotic formula of the CIS for the weighted NN classi\ufb01er.\nIn this article, we give the convergence rate of the bigNN method under the smoothness condition\nfor \u03b7 established by Chaudhuri and Dasgupta [9], and the margin condition. It turns out that this\nrate is the same rate as the oracle NN method. That is, by divide and conquer, one does not lose\nthe classi\ufb01cation accuracy in terms of the convergence rate. We show that the rate has a minimax\nproperty. That is, with some density assumptions, this rate cannot be improved. We \ufb01nd out that the\noptimal choice of the number of neighbors k must scale with the overall sample size and number of\nsplits, and there is an upper limits on how much splits one may use. To further shorten the prediction\ntime, we study the use of the denoising technique [44], which allows signi\ufb01cant reduction of the\nprediction time at a negligible loss in the accuracy under certain conditions, which are related to the\nupper bound on the number of splits. Lastly, we verify the results using an extensive simulation study.\nAs a side product, we also show that the convergence rate of the CIS for the bigNN method is also\nthe same as the convergence for the oracle NN method, which is a sharp rate previously proven. All\nthese theoretical results hold as long as the number of divisions does not grow too fast, i.e., slower\nthan some rate determined by the smoothness of \u03b7.\n\n2 Background and key assumptions\nLet (X , \u03c1) be a separable metric space. For any x \u2208 X , let Bo(x, r) and B(x, r) be the open and\nclosed balls respectively of radius r centered at x. Let \u00b5 be a Borel regular probability measure\non (X , \u03c1) from which X are drawn. We focus on binary classi\ufb01cation in which Y \u2208 {0, 1}; given\nX = x, Y is distributed according to the class conditional probability function (also known as the\nregression function) \u03b7 : X (cid:55)\u2192 {0, 1}, de\ufb01ned as \u03b7(x) = P(Y = 1|X = x), where P is with respect\nto the joint distribution of (X, Y ).\n\nBayes classi\ufb01er, regret, and classi\ufb01cation instability\nFor any classi\ufb01er \u02dcg : X (cid:55)\u2192 {0, 1}, the risk \u02dcg is R = P(\u02dcg(X) (cid:54)= Y ). The Bayes classi\ufb01er, de\ufb01ned as\ng(x) = 1{\u03b7(x) > 1/2}, has the smallest risk among all measurable classi\ufb01er. The risk for the Bayes\nclassi\ufb01er is denoted as R\u2217 = P(g(X) (cid:54)= Y ).\nThe excess risk of classi\ufb01er \u02dcg compared to the Bayes classi\ufb01er is R \u2212 R\u2217, which is also called the\nregret of \u02dcg. Note that since the classi\ufb01er \u02dcg is often driven by a training data set that is by itself random,\nboth the regret and the risk are random quantities. Hence, sometimes we may be interested in the\nexpected value of the risk EN R, where the expectation EN is with respect to the distribution of the\ntraining data D.\nSometimes, we call the algorithm that maps a training data set D to the classi\ufb01er function \u02dcg : X (cid:55)\u2192\n{0, 1} a \u201cclassi\ufb01er\u201d. In this sense, classi\ufb01cation instability (CIS) was proposed to measure how\nsensitive a classi\ufb01er is to sampling of the data. In particular, the CIS of a classi\ufb01er is de\ufb01ned as\n\nCIS \u2261 ED1,D2 [PX (\u03c61(X) (cid:54)= \u03c62(X)|D1,D2)],\n\nwhere \u03c61 and \u03c62 are the classi\ufb01cation functions trained based on D1 and D2, which are independent\ncopies of the training data [40].\n\nBig Nearest Neighbor classi\ufb01ers\nIn practice, we have a large training data set D = {(Xi, Yi), i = 1, . . . , N}, and it may be evenly\ndivided to s = N \u03b3 subsamples with n = N (1\u2212\u03b3) observations in each. For any query point x \u2208 X ,\nits k nearest neighbors in the jth subsample are founded, and the average of their class labels Yi is\nn,k(x) = 1{ \u02c6Y (j)(x) > 1/2} as the jth binary kNN classi\ufb01er based\ndenoted as \u02c6Y (j)(x). Denote g(j)\non the jth subset. Finally, a majority voting scheme is carried out so that the \ufb01nal bigNN classi\ufb01er\nn,k(x) > 1/2}. In this article, we are interested in the risk of g\u2217\nis g\u2217\nn,k,s,\ndenoted as R\u2217\n\nn,k,s, its corresponding regret, and its CIS.\n\n(cid:80)s\n\nn,k,s(x) = 1{ 1\n\ns\n\nj=1 g(j)\n\n3\n\n\fKey assumptions\n\nMany results in this article rely on the following commonly used assumptions. The (\u03b1, L)-smoothness\nassumption ensures the smoothness of the regression function \u03b7. In particular, \u03b7 is (\u03b1, L)-smooth if\nfor all x, x(cid:48) \u2208 X , there exist \u03b1, L > 0, such that\n\n|\u03b7(x) \u2212 \u03b7(x(cid:48))| \u2264 L\u00b5(Bo(x, \u03c1(x, x(cid:48))))\u03b1.\n\nChaudhuri and Dasgupta [9] pointed out that this is more general than, and is closely related to the\nH\u00f6lder\u2019s condition when X = Rd (d is the dimension), which states that \u03b7 is \u03b1H-H\u00f6lder continuous\nif there exist \u03b1H , L > 0 such that |\u03b7(x) \u2212 \u03b7(x(cid:48))| \u2264 L(cid:107)x \u2212 x(cid:48)(cid:107)\u03b1H . Moreover, H\u00f6lder\u2019s continuity\nimplies (\u03b1, L)-smoothness, with the equality\n\n\u03b1 = \u03b1H \u00b7 d\u22121.\n\n(1)\nThis transition formula will be useful in comparing our theoretical results with the existing ones that\nare based on the H\u00f6lder\u2019s condition.\nThe second assumption is the popular margin condition [35, 41, 3]. The joint distribution of (X, Y )\nsatis\ufb01es the \u03b2-margin condition if there exists C > 0 such that\n\nP(|\u03b7(X) \u2212 1/2| \u2264 t) \u2264 Ct\u03b2,\n\n\u2200t > 0.\n\n3 Main results\n\nOur \ufb01rst main theorem concerns the regret of the bigNN classi\ufb01er.\nTheorem 1. Set k = kon2\u03b1/(2\u03b1+1)s\u22121/(2\u03b1+1) \u2192 \u221e as N \u2192 \u221e where ko is a constant. Under the\n(\u03b1, L)-smoothness assumption of \u03b7 and the \u03b2-margin condition, we have\n\nEnRn,k,s \u2212 R\u2217 \u2264 C0N\u2212\u03b1(\u03b2+1)/(2\u03b1+1).\n\nThe rate of convergence here appears to be independent of the dimension d. However, the theorem\ncan be stated in terms of the H\u00f6lder\u2019s condition instead, which leads to the rate N\u2212\u03b1H (\u03b2+1)/(2\u03b1H +d),\ndue to equality (1), which now depends on d. It would be insightful to compare the bound derived\nhere for the bigNN classi\ufb01er with the oracle NN classi\ufb01er. Theorem 7 in [9] showed that under\nalmost the same assumptions, with a scaled choice of k in the oracle kNN method, the convergence\nrate for the oracle method is also N\u2212\u03b1(1+\u03b2)/(2\u03b1+1). This means that divide and conquer does not\ncompromise the rate of convergence of the regret when it is used on the kNN classi\ufb01er.\nAs a matter of fact, the best known rate among nonparametric classi\ufb01cation methods under the margin\nassumption was N\u2212\u03b1H (1+\u03b2)/(2\u03b1H +d) (Theorems 3.3 and 3.5 in [3]), which, according to (1), was the\nsame rate for bigNN here and for oracle NN derived in [9]. In other words, the rate we have is sharp.\nIt was proved that the optimal weighted nearest neighbor classi\ufb01ers (OWNN) [37], bagged nearest\nneighbor classi\ufb01ers [25] and the stabilized nearest neighbor classi\ufb01er [40] can achieve this rate. See\nTheorem 2 of the supplementary materials of Samworth [37] and Theorem 5 of Sun et al. [40].\nOur next theorem concerns the CIS of the bigNN classi\ufb01er.\nTheorem 2. Set the same k as in Theorem 1 (k = kon2\u03b1/(2\u03b1+1)s\u22121/(2\u03b1+1)). Under the (\u03b1, L)-\nsmoothness assumption and the \u03b2-margin condition, we have\n\nCIS(bigNN) \u2264 C0N\u2212\u03b1\u03b2/(2\u03b1+1).\n\nAgain, we remark that the best known rate for CIS for a non-parameter classi\ufb01cation method (oracle\nkNN included) is N\u2212\u03b1H \u03b2/(2\u03b1H +d) (Theorem 5 in [40]), where \u03b1H is the power parameter in the\nH\u00f6lder\u2019s condition. This is exactly the rate we derived here for bigNN classi\ufb01er by noting (1).\nWe remark that the optimal number of neighbors for the oracle kNN is at the order of N 2\u03b1/(2\u03b1+1),\nwhile the optimal number of neighbors for each local classi\ufb01er in bigNN is at the order of\nn2\u03b1/(2\u03b1+1)s\u22121/(2\u03b1+1) which is not equal to n2\u03b1/(2\u03b1+1) (that is, the optimal choice of k for the\noracle above with N replaced by n.) In other words, the best choice of k in bigNN will lead to\nsuboptimal performance for each local classi\ufb01er. However, due to the aggregation via majority voting,\nthese suboptimal local classi\ufb01ers will actually ensemble an optimal bigNN classi\ufb01er.\n\n4\n\n\fMoreover, k = kon2\u03b1/(2\u03b1+1)s\u22121/(2\u03b1+1) should grow as N grows. In view of the facts that s = N \u03b3\nand n = N 1\u2212\u03b3, this implies an upper bound on s. In particular, s should be less than N 2\u03b1/(2\u03b1+1).\nConceptually, there exist notions of bias due to small sample size and bias/variance trade-off for\nensembles. If s is too large and n too small, then the \u2018bias\u2019 of the base classi\ufb01er on each subsample\ntends to increase, which can not be averaged away by the s subsamples.\nLastly, bigNN may be seen as comparable to the bagged NN method [25]. In that context, the\nsampling fraction is n/N = 1/s = N\u2212\u03b3 \u2192 0. [25] suggested that when sampling fractions converge\nto 0, but the resample sizes diverges to in\ufb01nity, the bagged NN converges to the Bayes rule. Our work\ngives a convergence rate in addition to the fact that the regret will vanish as N grows.\n\n4 Pre-training acceleration by denoising\n\nWhile the oracle kNN method or the bigNN method can achieve signi\ufb01cantly better performance\nthan 1-NN when k and s are chosen properly, in practice, many of the commercial tools for nearest\nneighbor search are optimized for 1-NN only. It is known that for statistical consistency, k should\ngrow as N to in\ufb01nity. This imposes practical challenge for the oracle kNN to search for the k nearest\nneighbors from the training data, in which k could potentially be a very large number. Even in a\nbigNN in which k at each subsample is set to be 1, to achieve statistical consistency one requires s to\ngrow with N. These naturally lead to the practical dif\ufb01culty that the prediction time is very large\nfor growing k or s in the presence of large-scale data sets. In Xue and Kpotufe [44], the authors\nproposed a denoising technique to shift the time complexity from the prediction stage to the training\nstage. In a nutshell, denoising means to pre-train the data points in the training set by re-labelling\neach data point by its global kNN prediction (for a given k). After each data point is pre-trained, at\nthe time of prediction, the nearest neighbors of the query point from among a small number (say I)\nof subsamples of the training data are identi\ufb01ed, and the majority class among these 1-NNs becomes\nthe \ufb01nal prediction for the query point. Note that under this acceleration scheme, at the prediction\nstage, one only needs to conduct 1-NN search for I times, and each time from a subsample with\nsize m (cid:28) N; hence the prediction time is signi\ufb01cantly reduced to almost the same as the 1-NN.\nXue and Kpotufe [44] further proved that at some vanishing subsampling ratio, the denoised NN can\nachieve the prediction accuracy at the same order as that of the kNN. Note that denoising does not\nwork by ignoring a lot of data all together, but by extraditing the information ahead of time (during\npreprocessing) and not bothering with the entire data later on at the time of the prediction.\nIn this section we consider using the same technique to accelerate the prediction time for large data\nsets in conjunction with the bigNN method. The pre-training step in Xue and Kpotufe [44] was based\non, by default, the oracle kNN, which is not realistic for very large data sets. We consider using\nthe bigNN to conduct the pre-training, followed by the same 1-NN searches at the prediction stage.\nOur work and the work of [44] can be viewed as supplementary to each other. As [44] shifted the\ncomputational time from the prediction stage to the training stage, we reduce the computational\nburden at the training stage by using bigNN instead of the oracle NN. A subtlety is that the denoising\nalgorithm [44] performs distributed calculation during the prediction time only, while our denoised\nbigNN method in this section distributes the calculation during both the preprocessing stage and the\nprediction stage (in two different ways).\nDe\ufb01nition 1. Denote Dsub as a subsample of the entire training data D with sample size m. Denote\nNN(x;Dsub) as the nearest neighbor of x among Dsub. The denoised BigNN classi\ufb01er is g(cid:93)(x) =\nn,k,s(NN(x;Dsub)). Note that this is the same as the 1-NN prediction for a pre-trained subsample\ng\u2217\nin which the data points are re-labeled using the bigNN classi\ufb01er g\u2217\nWe need additional assumptions to prove Theorem 3. We assume there exists some integer d(cid:48), named\nthe intrinsic dimension, and constant Cd > 0, such that for all x \u2208 X and r > 0, \u00b5(B(x, r)) \u2265 Cdrd(cid:48)\n.\nWe will also use the VC dimension technique [42]. Although the proof makes use of some results in\n[44], the generalization is not trivial due to the majority voting aggregation in the bigNN classi\ufb01er.\nTheorem 3. Let 0 < \u03b4 < 1. Assume VC dimension dvc, intrinsic dimension d(cid:48) and constant Cd, the\n\u03b1H-H\u00f6lder continuity of \u03b7 and the \u03b2-margin condition, with probability at least 1 \u2212 3\u03b4 over D,\n\nn,k,s.\n\n(cid:18) dvc log( m\n\n\u03b4 )\n\n(cid:19) \u03b1H (\u03b2+1)\n\nd(cid:48)\n\nRegret of g(cid:93) \u2264 Regret of g\u2217\n\nn,k,s + C\n\n5\n\nmCd\n\n\fUnder the H\u00f6lder\u2019s condition, the regret of the bigNN classi\ufb01er g\u2217\nn,k,s has been established to be at\nthe rate of N\u2212\u03b1H (\u03b2+1)/(2\u03b1H +d). Theorem 3 suggests that the pre-training step has introduced an\nadditional error at the order of m\u2212\u03b1H (\u03b2+1)/d(cid:48)\n, ignoring logarithmic factors. Assume for the moment\nthat the intrinsic dimension d(cid:48) equals to the ambient dimension d for simplicity. In this case, the\nadditional error is relatively small compare to the original regret of bigNN, provided that the size\nof each subsample m is at the order at least N d/(2\u03b1H +d). When the intrinsic dimension d(cid:48) is indeed\nsmaller than the ambient dimension d, then the additional error is even smaller.\nIn principle, the subsamples at the training stage (with size n) and the subsamples at the prediction\nstage (with size m; from which we search for the 1NN of x) do not have to be the same. In practice,\nat the prediction stage, we may use the subsamples that are already divided up by bigNN. In other\nwords, we do not have to conduct two different data divisions, one at the pre-training stage, the other\nat the prediction. In this case, to continue the discussion in the last paragraph, the additional error due\nto this pre-training acceleration is negligible as long as the number of total subsamples s is no larger\nthan N 2\u03b1H /(1\u03b1H +d). Incidentally, this matches the upper bound on s of N 2\u03b1/(2\u03b1+1) previously.\n[44] suggested to obtain multiple pre-trained 1-NN estimates from I subsamples repeatedly, and\nconduct a majority vote among them to improve the performance. For example, they use I = 10 in\nthe simulation study. The theoretical result does not depend on the number of subsamples I. Indeed,\nour proof would work even if I = 1. In our simulation studies, we have tried a few values of I to\ncompare the empirical performance. Our method performs fairly similarly when I is greater than 9.\n\n5 Simulations\n\nAll numerical studies are conducted on HPC clusters with two 12-core Intel Xeon Gold Skylake\nprocessors and four 10-core Xeon-E5 processors, with memory between 64 and 128 GB.\nSimulation 1: We choose the split coef\ufb01cient \u03b3 = 0.0, 0.1 . . . 0.9 and N = 1000 \u00d7\n(1, 2, 3, 4, 8, 9, 16, 27, 32). The number of neighbors k is chosen as kon2\u03b1/(2\u03b1+1)s\u22121/(2\u03b1+1) as\nstated in the theorems with ko = 1, truncated at 1. The two classes are generated as P0 \u223c N (05, I5)\nand P1 \u223c N (15, I5) with the prior class probability \u03c01 = 1/2. The \u03b1 value is chosen to be\n\u03b1H /d = 1/5 since the corresponding H\u00f6lder exponent \u03b1H = 1. In addition, the test set was\nindependently generated with 1000 observations.\nWe repeat the simulation for 1000 times for each \u03b3 and N. Here both the empirical risk (test error)\nand the empirical CIS are calculated for both the bigNN and the oracle kNN methods. The empirical\nregret is calculated as the empirical risk minus the Bayes risk, calculated using the known underlying\ndistribution. Note that due to numerical issues and more precision needed for large N and \u03b3, the\nempirical risk and CIS can present some instability. The R environment is used in this study.\n\nFigure 1: Regret and CIS for bigNN and oracle kNN (\u03b3 = 0). Different curves show different \u03b3.\n\nThe results are reported in Figures 1. The regret and CIS lines for different \u03b3 values are parallel\nto each other and are linearly decreasing as N grows at the log-log scale, which veri\ufb01es that the\nconvergence rates are power functions of N with negative exponents.\n\n6\n\nllllllllllllllllll\u22128\u22126\u2212478910log(N)log(Regret.BigNN)gll00.10.20.30.40.50.60.70.80.9llllllllllllllllll\u22125\u22124\u2212378910log(N)log(CIS.BigNN)gll00.10.20.30.40.50.60.70.80.9\fInspired by the fact that the convergence rate for regret is O(N\u2212\u03b1(1+\u03b2)/(2\u03b1+1)) and that for CIS is\nO(N\u2212\u03b1\u03b2/(2\u03b1+1)), we \ufb01t the following two linear regression models:\n\nlog(Regret) \u223c factor(\u03b3) + log(N )\n\nlog(CIS) \u223c factor(\u03b3) + log(N )\n\nusing all the dots in Figure 1. If the regression coef\ufb01cients for log(N ) are signi\ufb01cant, then the\nconvergence rates of regret and CIS are power functions of N, and the coef\ufb01cients themselves are the\nexponent terms. That is, they should be approximately \u2212\u03b1(1 + \u03b2)/(2\u03b1 + 1) and \u2212\u03b1\u03b2/(2\u03b1 + 1).\nSince the \u03b3 term is categorical, for different \u03b3 values, the regression lines share the common slopes,\nbut have different intercepts. Extremely nice prediction results from these regressions are obtained.\nIn particular, the correlation between the observed and \ufb01tted log(regret) (log(CIS), resp.) is 0.9916\n(0.9896, resp.) The scatter plots between the observed and \ufb01tted values are shown in Figure 2,\ndisplaying almost prefect \ufb01ttings. These results verify the rates obtained in our theorems.\n\nFigure 2: Scatter plots of the \ufb01tted and observed regret and CIS values.\n\nFigure 1 in the supplementary materials shows that bigNN has signi\ufb01cantly shorter computing time\nthan the oracle method.\nSimulation 2: Now suppose we intentionally \ufb01x k to be a constant (this is may not be the optimal k).\nAfter a straightforward modi\ufb01cation of the proofs, the rates of convergence for regret and for CIS\nbecome O(N\u2212\u03b3(1+\u03b2)/2) and O(N\u2212\u03b3\u03b2/2) respectively for \u03b3 < 2\u03b1/(2\u03b1 + 1), and both regret and\nCIS should decrease as \u03b3 increases. We \ufb01x number of neighbors k = 5, let \u03b3 range from 0 to 0.7, and\nlet N = 1000 \u00d7 (1, 2, 4, 8, 10, 12, 16, 20, 32). The rest of the settings is the same as in Simulation 1.\n\nFigure 3: Regret and CIS for bigNN and oracle kNN (\u03b3 = 0) for k = 5 \ufb01xed. Different curves\nrepresent different N.\nThe results are shown in Figures 3. Both lines linearly decay in \u03b3 (both plots are on the log scale for\nthe y-axis). We note that the expected slopes in these two plots should be \u2212(1 + \u03b2)/2 \u00d7 log(N ) and\n\u2212\u03b2/2 \u00d7 log(N ) respectively, which is veri\ufb01ed by the \ufb01gures, where larger N means steeper lines.\nSimulation 3: In Simulation 3, we compare denoised bigNN with the bigNN method. For denoised\nbigNN, we try to merge a few pre-training subsamples to be a prediction subsample, leading to\nthe size of each prediction subsample m = N \u03b8. We set N = 27000, d = 8, the pre-training\nsplit coef\ufb01cient \u03b3 = 0.2, 0.3, number of prediction subsampling repeats I = 5, 9, 13, 17, 21, and\nthe prediction subsample size coef\ufb01cient \u03b8 = 0.1, 0.2, . . . , 0.7. The two classes are generated as\n\n7\n\nllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll\u22129\u22128\u22127\u22126\u22125\u22124\u22129\u22128\u22127\u22126\u22125\u22124Observed log(regret)Fitted log(regret)llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll\u22125.0\u22124.5\u22124.0\u22123.5\u22123.0\u22122.5\u22125.0\u22124.5\u22124.0\u22123.5\u22123.0\u22122.5\u22122.0Observed log(CIS)Fitted log(CIS)llllllll\u221212\u221210\u22128\u22126\u221240.00.10.20.30.40.50.60.7glog(Regret.BigNN)Nl100020004000800012000160002000032000llllllll\u22125\u22124\u221230.00.10.20.30.40.50.60.7glog(CIS.BigNN)Nl100020004000800012000160002000032000\fP1 \u223c 0.5N (0d, Id) + 0.5N (3d, 2Id) and P0 \u223c 0.5N (1.5d, Id) + 0.5N (4.5d, 2Id) with the prior\nclass probability \u03c01 = 1/3. The number of neighbors K in the oracle KNN is chosen as K = N 0.7.\nThe number of local neighbors in bigNN are chosen as k = (cid:100)k\u2217\no = 1.351284, a\nsmall constant that we \ufb01nd works well in this example. In addition, the test set was independently\ngenerated with 1000 observations. We repeat the simulation for 300 times for each \u03b3, \u03b8 and I.\nThe results are reported in Figure 4. In each \ufb01gure, the black diamond shows the prediction time\nand regret for the bigNN method without acceleration. Different curves represent different number\nof subsamples queried at the prediction stage, and their performance are similar. We see that the\nperformance greatly changes due to the size of the subsamples at prediction m = N \u03b8. Small m (or\n\u03b8), corresponding to the top-left end of each curve, is fast, but introduced too much bias. Large m (or\n\u03b8) values (bottom-right) are reasonably accurate and much faster. The computing times are shown in\nseconds. For this example, it seems that \u03b8 = 0.5 or 0.6 will work well.\n\noK/s(cid:101) where k\u2217\n\nFigure 4: Regret and prediction time trade-off for denoised bigNN and bigNN (black diamonds).\n\u03b3 = 0.2, 0.3. \u03b8 = 0.1, 0.2, . . . , 0.7. Different curves show different I.\nWe tried the MATLAB/C++ based code from [30] on this simulation setting. To reach optimal\naccuracy, [30] relies on tuning two parameters knob \u03b1 and bandwidth h. We tuned h within\n(0.1, 0.2, . . . , 10) and \u03b1 within (1/6, 2/6, . . . , 1). The best [30] classi\ufb01er gave a regret 1.5 times\nof the denoised bigNN method. The preprocessing time (including tuning) is at the order of 4,000\nseconds, compared to hundreds of seconds using our method. In terms of prediction time, the best\nspeedup (x8) comes at the cost of the worst accuracy (2 times our regret) while the speedup with their\nbest accuracy is only 2 to 3 folds (compared to 10 folds in our methods).\n\n6 Real data examples\n\nThe OWNN method [37] gives the same optimal convergence rate of regret as oracle kNN and\nbigNN, and additionally enjoys an optimal constant factor asymptotically. A \u2018big\u2019OWNN (where\nthe base classi\ufb01er is OWNN instead of kNN) is technically doable but is omitted in this paper for\nmore straightforward implementations. The goal of this section is to check how much (or how little)\nstatistical accuracy bigNN will lose even if we do not use OWNN (which is optimal in risk, both\nin rate and in constant) in each subset. In particular, we compare the \ufb01nite-sample performance of\nbigNN, oracle kNN and oracle OWNN using real data. We note that bigNN has the same convergence\nrate as the other two, but with much less computing time. We deliberately choose to not include other\nstate-of-the-art algorithms (such as SVM or random forest) in the comparison. The impact of divide\nand conquer for those algorithms is an interesting future research topic.\nWe have retained benchmark data sets HTRU2 [34], Gisette [22], Musk 1 [16], Musk 2 [17],\nOccupancy [8], Credit [45], and SUSY [4], from the UCI machine learning repository [33]. The\ntest sample sizes are set as min(1000, total sample size/5). Parameters in kNN and OWNN are\ntuned using cross-validation, and the parameter k in bigNN for each subsample is the optimally\nchosen k for the oracle kNN divided by s. In Table 1, we compare the average empirical risk (test\nerror), the empirical CIS, and the speedup of bigNN relative to oracle kNN, over 500 replications\n(OWNN typically has similar computing time as kNN and hence the speed comparison with OWNN\nis omitted). From Table 1, one can see that the three methods typically yield very similar risk and CIS\n\n8\n\nlllllllq = 0.7q = 0.10.050.100.150.200.250204060time (seconds)RegretIl59131721g = 0.2lllllllq = 0.7q = 0.10.050.100.150.200.250102030time (seconds)RegretIl59131721g = 0.3\f(no single method always wins), while bigNN has a computational advantage. Moreover, it seems\nthat larger \u03b3 values tend to have slightly worse performance for bigNN.\n\nTable 1: BigNN compared to the oracle kNN and OWNN for 7 real data sets with \u03b3 = 0.1, 0.2, 0.3\nexcept for sample data sets N < 10000. Speedup is de\ufb01ned as the computing time for oracle kNN\ndivided by that for bigNN. The pre\ufb01x \u2018R.\u2019 means risk, and \u2018C.\u2019 means CIS. Both are in percentage.\n\nDATA\n\nHTRU2\nHTRU2\nHTRU2\nGISETTE\nMUSK1\nMUSK2\nOCCUP\nOCCUP\nOCCUP\nCREDIT\nCREDIT\nCREDIT\nSUSY\nSUSY\nSUSY\n\nSIZE\n\nDIM\n\n17898\n17898\n17898\n6000\n476\n6598\n20560\n20560\n20560\n30000\n30000\n30000\n5000K\n5000K\n5000K\n\n8\n8\n8\n\n5000\n166\n166\n\n6\n6\n6\n24\n24\n24\n18\n18\n18\n\n\u03b3\n\n0.1\n0.2\n0.3\n0.2\n0.1\n0.2\n0.1\n0.2\n0.3\n0.1\n0.2\n0.3\n0.1\n0.2\n0.3\n\nR.BIGNN\n\nR.KNN\n\nR.OWNN\n\nC.BIGNN\n\nC.KNN\n\nC.OWNN\n\nSPEEDUP\n\n2.0385\n2.0929\n2.1971\n3.9344\n14.7619\n3.8250\n0.6207\n0.6119\n0.6548\n18.8300\n18.8467\n18.9250\n19.3103\n21.6149\n22.3197\n\n2.1105\n2.1105\n2.1105\n3.5020\n14.9767\n3.4400\n0.6205\n0.6205\n0.6205\n18.8681\n18.8681\n18.8681\n21.0381\n21.0381\n21.0381\n\n2.1188\n2.1188\n2.1188\n3.4749\n14.9757\n3.2841\n0.6037\n0.6037\n0.6037\n18.8414\n18.8414\n18.8414\n20.7752\n20.7752\n20.7752\n\n0.3670\n0.6323\n0.5003\n4.4261\n24.2362\n4.7575\n0.3790\n0.3717\n0.3081\n2.7940\n4.3917\n4.2496\n7.7034\n7.9073\n4.6716\n\n0.6152\n0.6152\n0.6152\n4.4752\n23.0664\n5.1925\n0.4431\n0.4431\n0.4431\n3.5292\n3.5292\n3.5292\n7.4011\n7.4011\n7.4011\n\n0.5528\n0.5528\n0.5528\n4.3317\n23.2707\n4.1615\n0.5795\n0.5795\n0.5795\n3.4392\n3.4392\n3.4392\n7.5921\n7.5921\n7.5921\n\n2.72\n7.65\n21.65\n5.13\n1.79\n5.73\n2.93\n6.97\n19.19\n3.36\n7.86\n23.22\n4.59\n16.76\n88.22\n\nIn Figure 2 of the supplementary materials, we allow \u03b3 to grow to 0.9. As mentioned earlier, when s\ngrows too fast (e.g. \u03b3 \u2265 0.4 in this example), the performance of bigNN starts to deteriorate, due to\nincreased \u2018bias\u2019 of the base classi\ufb01er, despite faster computing.\n\n7 Conclusion\n\nDue to computation, communication, privacy and ownership limitations, sometimes it is impossible\nto conduct NN classi\ufb01cation at a central location. In this paper, we study the bigNN classi\ufb01er, which\ndistributes the computation to different locations. We show that the convergence rates of regret and\nCIS for bigNN are the same as the ones for the oracle NN methods, and both rates are sharp. We also\nshow that the prediction time for bigNN can be further improved, by using the denoising acceleration\ntechnique, and it is possible to do so at a negligible loss in the statistical accuracy.\nConvergence rates are only the \ufb01rst step to understand bigNN. The sharp rates give reassurance\nabout worst-case behavior; however, they do not lead naturally to optimal splitting schemes or\nquanti\ufb01cations of the relative performance of two NN classi\ufb01ers attaining the same rate (such as\nbigNN and oracle NN). Achieving these goals will be left as future works. Another future work is to\nprove the sharp upper bound on \u03b3.\n\nAcknowledgments\n\nGuang Cheng\u2019s research was partially supported by the National Science Foundation (DMS-1712907,\nDMS-1811812, DMS-1821183,) and the Of\ufb01ce of Naval Research (ONR N00014-18-2759). In\naddition, Guang Cheng is a member of the Institute for Advanced Study at Princeton University and\na visiting Fellow of the Deep Learning Program at the Statistical and Applied Mathematical Sciences\nInstitute (Fall 2019); he would like to thank both institutes for the hospitality.\n\nReferences\n[1] Alabduljalil, M. A., Tang, X., and Yang, T. (2013), \u201cOptimizing parallel algorithms for all pairs\nsimilarity search,\u201d in Proceedings of the sixth ACM international conference on Web search and\ndata mining, ACM, pp. 203\u2013212.\n\n[2] Anastasiu, D. C. and Karypis, G. (2017), \u201cParallel cosine nearest neighbor graph construction,\u201d\n\nJournal of Parallel and Distributed Computing.\n\n[3] Audibert, J.-Y. and Tsybakov, A. B. (2007), \u201cFast learning rates for plug-in classi\ufb01ers,\u201d Ann.\n\nStatist., 35, 608\u2013633.\n\n9\n\n\f[4] Baldi, P., Sadowski, P., and Whiteson, D. (2014), \u201cSearching for exotic particles in high-energy\n\nphysics with deep learning,\u201d Nature communications, 5, 4308.\n\n[5] Battey, H., Fan, J., Liu, H., Lu, J., and Zhu, Z. (2015), \u201cDistributed estimation and inference\n\nwith statistical guarantees,\u201d arXiv preprint arXiv:1509.05457.\n\n[6] Bentley, J. L. (1975), \u201cMultidimensional binary search trees used for associative searching,\u201d\n\nCommunications of the ACM, 18, 509\u2013517.\n\n[7] Breiman, L. (1996), \u201cBagging predictors,\u201d Machine learning, 24, 123\u2013140.\n\n[8] Candanedo, L. M. and Feldheim, V. (2016), \u201cAccurate occupancy detection of an of\ufb01ce room\nfrom light, temperature, humidity and CO2 measurements using statistical learning models,\u201d\nEnergy and Buildings, 112, 28\u201339.\n\n[9] Chaudhuri, K. and Dasgupta, S. (2014), \u201cRates of convergence for nearest neighbor classi\ufb01ca-\n\ntion,\u201d in Advances in Neural Information Processing Systems, pp. 3437\u20133445.\n\n[10] Chawla, N. V., Hall, L. O., Bowyer, K. W., and Kegelmeyer, W. P. (2004), \u201cLearning ensembles\nfrom bites: A scalable and accurate approach,\u201d Journal of Machine Learning Research, 5,\n421\u2013451.\n\n[11] Chen, X. and Xie, M.-g. (2014), \u201cA split-and-conquer approach for analysis of extraordinarily\n\nlarge data,\u201d Statistica Sinica, 1655\u20131684.\n\n[12] Cover, T. and Hart, P. (1967), \u201cNearest neighbor pattern classi\ufb01cation,\u201d IEEE transactions on\n\ninformation theory, 13, 21\u201327.\n\n[13] Cover, T. M. (1968), \u201cRates of convergence for nearest neighbor procedures,\u201d in Proceedings of\n\nthe Hawaii International Conference on Systems Sciences, pp. 413\u2013415.\n\n[14] Dasgupta, S. and Sinha, K. (2013), \u201cRandomized partition trees for exact nearest neighbor\n\nsearch,\u201d in Conference on Learning Theory, pp. 317\u2013337.\n\n[15] Devroye, L., Gyor\ufb01, L., Krzyzak, A., and Lugosi, G. (1994), \u201cOn the strong universal consis-\ntency of nearest neighbor regression function estimates,\u201d The Annals of Statistics, 1371\u20131385.\n\n[16] Dietterich, T. G., Jain, A. N., Lathrop, R. H., and Lozano-Perez, T. (1994), \u201cA comparison of\ndynamic reposing and tangent distance for drug activity prediction,\u201d in Advances in Neural\nInformation Processing Systems, pp. 216\u2013223.\n\n[17] Dietterich, T. G., Lathrop, R. H., and Lozano-P\u00e9rez, T. (1997), \u201cSolving the multiple instance\n\nproblem with axis-parallel rectangles,\u201d Arti\ufb01cial intelligence, 89, 31\u201371.\n\n[18] Fan, J., Wang, D., Wang, K., and Zhu, Z. (2017), \u201cDistributed Estimation of Principal\n\nEigenspaces,\u201d arXiv preprint arXiv:1702.06488.\n\n[19] Fix, E. and Hodges Jr, J. L. (1951), \u201cDiscriminatory analysis-nonparametric discrimination:\n\nconsistency properties,\u201d Tech. rep., California Univ Berkeley.\n\n[20] Fritz, J. (1975), \u201cDistribution-free exponential error bound for nearest neighbor pattern classi\ufb01-\n\ncation,\u201d IEEE Transactions on Information Theory, 21, 552\u2013557.\n\n[21] Gottlieb, L.-A., Kontorovich, A., and Krauthgamer, R. (2014), \u201cEf\ufb01cient classi\ufb01cation for\n\nmetric data,\u201d IEEE Transactions on Information Theory, 60, 5750\u20135759.\n\n[22] Guyon, I., Gunn, S., Ben-Hur, A., and Dror, G. (2005), \u201cResult analysis of the NIPS 2003 feature\n\nselection challenge,\u201d in Advances in neural information processing systems, pp. 545\u2013552.\n\n[23] Gyor\ufb01, L. (1981), \u201cThe rate of convergence of k_n-NN regression estimates and classi\ufb01cation\n\nrules (Corresp.),\u201d IEEE Transactions on Information Theory, 27, 362\u2013364.\n\n[24] Hall, P., Park, B. U., and Samworth, R. J. (2008), \u201cChoice of neighbor order in nearest-neighbor\n\nclassi\ufb01cation,\u201d The Annals of Statistics, 2135\u20132152.\n\n10\n\n\f[25] Hall, P. and Samworth, R. J. (2005), \u201cProperties of bagged nearest neighbour classi\ufb01ers,\u201d\n\nJournal of the Royal Statistical Society: Series B (Statistical Methodology), 67, 363\u2013379.\n\n[26] Indyk, P. and Motwani, R. (1998), \u201cApproximate nearest neighbors: towards removing the\ncurse of dimensionality,\u201d in Proceedings of the thirtieth annual ACM symposium on Theory of\ncomputing, ACM, pp. 604\u2013613.\n\n[27] Kohler, M. and Krzyzak, A. (2007), \u201cOn the rate of convergence of local averaging plug-in\nclassi\ufb01cation rules under a margin condition,\u201d IEEE transactions on information theory, 53,\n1735\u20131742.\n\n[28] Kontorovich, A., Sabato, S., and Weiss, R. (2017), \u201cNearest-neighbor sample compression:\nEf\ufb01ciency, consistency, in\ufb01nite dimensions,\u201d in Advances in Neural Information Processing\nSystems, pp. 1573\u20131583.\n\n[29] Kontorovich, A. and Weiss, R. (2015), \u201cA Bayes consistent 1-NN classi\ufb01er,\u201d in Arti\ufb01cial\n\nIntelligence and Statistics, pp. 480\u2013488.\n\n[30] Kpotufe, S. and Verma, N. (2017), \u201cTime-accuracy tradeoffs in kernel prediction: controlling\n\nprediction quality,\u201d The Journal of Machine Learning Research, 18, 1443\u20131471.\n\n[31] Kulkarni, S. R. and Posner, S. E. (1995), \u201cRates of convergence of nearest neighbor estimation\n\nunder arbitrary sampling,\u201d IEEE Transactions on Information Theory, 41, 1028\u20131039.\n\n[32] Lee, J. D., Liu, Q., Sun, Y., and Taylor, J. E. (2017), \u201cCommunication-ef\ufb01cient Sparse Regres-\n\nsion,\u201d Journal of Machine Learning Research, 18, 1\u201330.\n\n[33] Lichman, M. (2013), \u201cUci machine learning repository. university of california, irvine, school\n\nof information and computer sciences,\u201d .\n\n[34] Lyon, R., Stappers, B., Cooper, S., Brooke, J., and Knowles, J. (2016), \u201cFifty years of pulsar\ncandidate selection: from simple \ufb01lters to a new principled real-time classi\ufb01cation approach,\u201d\nMonthly Notices of the Royal Astronomical Society, 459, 1104\u20131123.\n\n[35] Mammen, E., Tsybakov, A. B., et al. (1999), \u201cSmooth discrimination analysis,\u201d The Annals of\n\nStatistics, 27, 1808\u20131829.\n\n[36] Muja, M. and Lowe, D. G. (2014), \u201cScalable nearest neighbor algorithms for high dimensional\n\ndata,\u201d IEEE Transactions on Pattern Analysis and Machine Intelligence, 36, 2227\u20132240.\n\n[37] Samworth, R. J. (2012), \u201cOptimal weighted nearest neighbour classi\ufb01ers,\u201d The Annals of\n\nStatistics, 40, 2733\u20132763.\n\n[38] Shang, Z. and Cheng, G. (2017), \u201cComputational limits of a distributed algorithm for smoothing\n\nspline,\u201d The Journal of Machine Learning Research, 18, 3809\u20133845.\n\n[39] Slaney, M. and Casey, M. (2008), \u201cLocality-sensitive hashing for \ufb01nding nearest neighbors\n\n[lecture notes],\u201d IEEE Signal processing magazine, 25, 128\u2013131.\n\n[40] Sun, W. W., Qiao, X., and Cheng, G. (2016), \u201cStabilized Nearest Neighbor Classi\ufb01er and its\n\nStatistical Properties,\u201d Journal of the American Statistical Association, 111, 1254\u20131265.\n\n[41] Tsybakov, A. B. (2004), \u201cOptimal aggregation of classi\ufb01ers in statistical learning,\u201d Annals of\n\nStatistics, 135\u2013166.\n\n[42] Vapnik, V. and Chervonenkis, A. Y. (1971), \u201cOn the Uniform Convergence of Relative Fre-\nquencies of Events to Their Probabilities,\u201d Theory of Probability and its Applications, 16,\n264.\n\n[43] Wagner, T. (1971), \u201cConvergence of the nearest neighbor rule,\u201d IEEE Transactions on Informa-\n\ntion Theory, 17, 566\u2013571.\n\n[44] Xue, L. and Kpotufe, S. (2018), \u201cAchieving the time of 1-NN, but the accuracy of k-NN,\u201d\nin Proceedings of the Twenty-First International Conference on Arti\ufb01cial Intelligence and\nStatistics, eds. Storkey, A. and Perez-Cruz, F., Playa Blanca, Lanzarote, Canary Islands: PMLR,\nvol. 84 of Proceedings of Machine Learning Research, pp. 1628\u20131636.\n\n11\n\n\f[45] Yeh, I.-C. and Lien, C.-h. (2009), \u201cThe comparisons of data mining techniques for the predictive\naccuracy of probability of default of credit card clients,\u201d Expert Systems with Applications, 36,\n2473\u20132480.\n\n[46] Zhang, Y., Duchi, J., and Wainwright, M. (2013), \u201cDivide and conquer kernel ridge regression,\u201d\n\nin Conference on Learning Theory, pp. 592\u2013617.\n\n[47] Zhao, T., Cheng, G., Liu, H., et al. (2016), \u201cA partially linear framework for massive heteroge-\n\nneous data,\u201d The Annals of Statistics, 44, 1400\u20131437.\n\n12\n\n\f", "award": [], "sourceid": 5750, "authors": [{"given_name": "Xingye", "family_name": "Qiao", "institution": "Binghamton University"}, {"given_name": "Jiexin", "family_name": "Duan", "institution": "Purdue University"}, {"given_name": "Guang", "family_name": "Cheng", "institution": "Purdue University"}]}