{"title": "Locally Adaptive Nearest Neighbor Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 184, "page_last": 191, "abstract": null, "full_text": "Locally Adaptive Nearest Neighbor \n\nAlgorithms \n\nDietrich Wettschereck \n\nThomas G. Dietterich \n\nDepartment of Computer Science \n\nOregon State University \nCorvallis, OR 97331-3202 \nwettscdGcs.orst.edu \n\nAbstract \n\nFour versions of a k-nearest neighbor algorithm with locally adap(cid:173)\ntive k are introduced and compared to the basic k-nearest neigh(cid:173)\nbor algorithm (kNN). Locally adaptive kNN algorithms choose the \nvalue of k that should be used to classify a query by consulting the \nresults of cross-validation computations in the local neighborhood \nof the query. Local kNN methods are shown to perform similar to \nkNN in experiments with twelve commonly used data sets. Encour(cid:173)\naging results in three constructed tasks show that local methods \ncan significantly outperform kNN in specific applications. Local \nmethods can be recommended for on-line learning and for appli(cid:173)\ncations where different regions of the input space are covered by \npatterns solving different sub-tasks. \n\n1 \n\nIntroduction \n\nThe k-nearest neighbor algorithm (kNN, Dasarathy, 1991) is one of the most ven(cid:173)\nerable algorithms in machine learning. The entire training set is stored in memory. \nA new example is classified with the class of the majority of the k nearest neighbors \namong all stored training examples. The (global) value of k is generally determined \nvia cross-validation. \n\nFor certain applications, it might be desirable to vary the value of k locally within \n\n184 \n\n\fLocally Adaptive Nearest Neighbor Algorithms \n\n185 \n\ndifferent parts of the input space to account for varying characteristics of the data \nsuch as noise or irrelevant features . However, for lack of an algorithm, researchers \nhave assumed a global value for k in all work concerning nearest neighbor algorithms \nto date (see, for example, Bottou, 1992, p. 895, last two paragraphs of Section 4.1). \nIn this paper, we propose and evaluate four new algorithms that determine different \nvalues for k in different parts of the input space and apply these varying values to \nclassify novel examples. These four algorithms use different methods to compute \nthe k-values that are used for classification. \n\nWe determined two basic approaches to compute locally varying values for k. One \ncould compute a single k or a set of k values for each training pattern, or training \npatterns could be combined into groups and k value(s) computed for these groups. A \nprocedure to determine the k to be used at classification time must be given in both \napproaches. Representatives of these two approaches are evaluated in this paper \nand compared to the global kNN algorithm. While it was possible to construct \ndata sets where local algorithms outperformed kNN, experiments with commonly \nused data sets showed, in most cases, no significant differences in performance. A \npossible explanation for this behavior is that data sets which are commonly used \nto evaluate machine learning algorithms may all be similar in that attributes such \nas distribution of noise or irrelevant features are uniformly distributed across all \npatterns. In other words, patterns from data sets describing a certain task generally \nexhibit similar properties. \n\nLocal nearest neighbor methods are comparable in computational complexity and \naccuracy to the (global) k-nearest neighbor algorithm and are easy to implement. In \nspecific applications they can significantly outperform kNN. These applications may \nbe combinations of significantly different subsets of data or may be obtained from \nphysical measurements where the accuracy of measurements depends on the value \nmeasured. Furthermore, local kNN classifiers can be constructed at classification \ntime (on-line learning) thereby eliminating the need for a global cross-validation \nrun to determine the proper value of k . \n\n1.1 Methods compared \n\nThe following nearest neighbor methods were chosen as representatives of the pos(cid:173)\nsible nearest neighbor methods discussed above and compared in the subsequent \nexperiments: \n\n\u2022 k-nearest neighbor (kNN) \n\nThis algorithm stores all of the training examples. A single value for k is \ndetermined from the training data. Queries are classified according to the \nclass of the majority of their k nearest neighbors in the training data. \n\n\u2022 localKNN 1:11 unrelltricted \n\nThis is the basic local kNN algorithm. The three subsequent algorithms \nare modifications of this method. This algorithm also stores all of the \ntraining examples. Along with each training example, it stores a list of \nthose values of k that correctly classify that example under leave-one-out \ncross-validation. To classify a query q, the M nearest neighbors of the \nquery are computed, and that k which classifies correctly most of these M \n\n\f186 \n\nWettschereck and Dietterich \n\nneighbors is determined. Call this value kM,q. The query q is then classified \nwith the class of the majority of its kM,q nearest neighbors. Note that kM,q \ncan be larger or smaller than M. The parameter M is the only parameter \nof the algorithm, and it can be determined by cross-validation. \n\n\u2022 localKNN kI pruned \n\nThe list of k values for each training example generally contains many val(cid:173)\nues. A global histogram of k values is computed, and k values that appear \nfewer than L times are pruned from all lists (at least one k value must, \nhowever, remain in each list). The parameter L can be estimated via cross(cid:173)\nvalidation. Classification of queries is identical to localKNN kI unrestricted. \n\n\u2022 localKNN one 1: per clau \n\nFor each output class, the value of k that would result in the correct (leave(cid:173)\none-out) classification of the maximum number of training patterns from \nthat class is determined. A query q is classified as follows: Assume there \nare two output classes, C1 and C2 \u2022 Let kl and k2 be the k value computed \nfor classes Cl and C2, respectively. The query is assigned to class C1 if the \npercentage of the kl nearest neighbors of q that belong to class C1 is larger \nthan the percentage of the k2 nearest neighbors of q that belong to class \nC2. Otherwise, q is assigned to class C2. Generalization of that procedure \nto any number of output classes is straightforward. \n\n\u2022 localKNN one 1: per cluster \n\nAn unsupervised cluster algorithm (RPCL, l Xu et al., 1993) is used to \ndetermine clusters of input data. A single k value is determined for each \ncluster. Each query is classified according to the k value of the cluster it is \nassigned to. \n\n2 Experimental Methods and Data sets used \n\nTo measure the performance of the different nearest neighbor algorithms, we em(cid:173)\nployed the training set/test set methodology. Each data set was randomly par(cid:173)\ntitioned into a training set containing approximately 70% of the patterns and a \ntest set containing the remaining patterns. After training on the training set, the \npercentage of correct classifications on the test set was measured. The procedure \nwas repeated a total of 25 times to reduce statistical variation. In each experi(cid:173)\nment, the algorithms being compared were trained (and tested) on identical data \nsets to ensure that differences in performance were due entirely to the algorithms. \nLeave-one-out cross-validation (Weiss & Kulikowski, 1991) was employed in all ex(cid:173)\nperiments to estimate optimal settings for free parameters such as k in kNN and \nM in localKNN. \n\n1 Rival Penalized Competitive Learning is a straightforward modification of the well \nknown k-means clustering algorithm. RPCL's main advantage over k-means clustering is \nthat one can simply initialize it with a sufficiently large number of clusters. Cluster centers \nare initialized outside of the input range covered by the training examples. The algorithm \nthen moves only those cluster centers which are needed into the range of input values and \ntherefore effectively eliminates the need for cross-validation on the number of clusters in \nk-means. This paper employed a simple version with the number of initial clusters always \nset to 25, O'c set to 0.05 and O'r to 0.002. \n\n\fLocally Adaptive Nearest Neighbor Algorithms \n\n187 \n\nWe report the average percentage of correct classifications and its standard error. \nTwo-tailed paired t-tests were conducted to determine at what level of significance \none algorithm outperforms the other. We state that one algorithm significantly \noutperforms another when the p-value is smaller than 0.05. \n\n3 Results \n\n3.1 Experiments with Constructed Data Sets \n\nThree experiments with constructed data sets were conducted to determine the \nability of local nearest neighbor methods to determine proper values of k . The data \nsets were constructed such that it was known before experimentation that varying \nk values should lead to superior performance. Two data sets which were presumed \nto require significantly different values of k were combined into a single data set \nfor each of the first two experiments. For the third experiment, a data set was \nconstructed to display some characteristics of data sets for which we assume local \nkNN methods would work best. The data set was constructed such that patterns \nfrom two classes were stretched out along two parallel lines in one part of the \ninput space. The parallel lines were spaced such that the nearest neighbor for most \npatterns belongs to the same class as the pattern itself, while two out of the three \nnearest neighbors belong to the other class. In other parts of the input space, classes \nwere well separated, but class labels were flipped such that the nearest neighbor of a \nquery may indicate the wrong pattern while the majority of the k nearest neighbors \n(k > 3) would indicate the correct class (see also Figure 4). \nFigure 1 shows that in selected applications, local nearest neighbor methods can \nlead to significant improvements over kNN in predictive accuracy. \n\nLetter \n\nExperiment 2 \n\nSine-21 Wave-21 Combined \n\nExperiment 3 \n\nConstructed \n\n70 .0\u00b1O.6 \n\n-4~~~~~~~~~~~~~~~~~ \n\nks unrestricted Q one k per class 0 one k per cluster 1 \n\nI. ks pruned \n\n\u2022 \n\nFigure 1: Percent accuracy of local kNN methods relative to kNN on separate test sets. \nThese differences (*) were statistically significant (p < 0.05). Results are based on 25 \nrepetitions. Shown at the bottom of each graph are sizes of training sets/sizes of test \nsets/number of input features. The percentage at top of each graph indicates average \naccuracy of kN N \u00b1 standard error. \n\nThe best performing lqcal methods are locaIKNNl;, pruned, localKNNl;8 unre,tricted, \n\n\f188 \n\nWettschereck and Dietterich \n\nand 10calKNNone k per cluster. These methods were outperformed by kNN in two \nof the original data sets. However, the performance of these methods was clearly \nsuperior to kNN in all domains where data were collections of significantly distinct \nsubsets. \n\n3.2 Experiments with Commonly Used Data Sets \n\nTwelve domains of varying sizes and complexities were used to compare the perfor(cid:173)\nmance of the various nearest neighbor algorithms. Data sets for these domains were \nobtained from the UC-Irvine repository of machine learning databases (Murphy & \nAha, 1991, Aha, 1990, Detrano et al., 1989). Results displayed in Figure 2 indicate \nthat in most data sets which are commonly used to evaluate machine learning algo(cid:173)\nrithms, local nearest neighbor methods have only minor impact on the performance \nof kNN. The best local methods are either indistinguishable in performance from \nkNN (localKNN one k per cluster) or inferior in only one domain (localKNN k, pruned). \n\n105150/4 \n\n150/64/9 \n\n16 \n\n~\"\"\"T'\"-f&C:NN \n\n-2 \n\n* \n\nI. ks pruned \n\n\u2022 \n\nks unrestricted \n\nIilll one k per class 0 one k per cluster 1 \n\nFigure 2: Percent accuracy of local kNN methods relative to kNN on separate test sets. \nThese differences (*) were statistically significant (p < 0.05). Results are based on 25 \nrepetitions. Shown at the bottom of each graph are sizes of training sets/sizes of test \nsets/number of input features. The percentage at top of each graph indicates average \naccuracy of kNN \u00b1 standard error. \n\nThe number of actual k values used varies significantly for the different local meth(cid:173)\nods (Table 1). Not surprisingly, 10calKNNks unrestricted uses the largest number of \ndistinct k values in all domains. Pruning of ks significantly reduced the number of \nvalues used in all domains. However, the method using the fewest distinct k values \nis 10calKNN one k per cluster, which also explains the similar performance of kNN and \n10calKNNone k per cluster in most domains. Note that several clusters computed by \n10calKNN one k per cluster may use the same k. \n\n\fLocally Adaptive Nearest Neighbor Algorithms \n\n189 \n\nTable 1: Average number of distinct values for k used by local kNN methods. \n\nTask \n\nkNN \n\nLetter recos. \nLed-16 \n\nCombinedLL \n\nSine-21 \nWaveform-21 \nCombined SW \nConstructed \nIris \nGlasd \nWine \nHunsarian \nCleveland \nVotins \nLed-7 Display \nLed-24. Display \nWaveform-2I \nWaveform-4.0 \nIaolet Letter \nLetter reco6' \n\n1 \n1 \n1 \n1 \n1 \n1 \n1 \n1 \n1 \n1 \n1 \n1 \n1 \n1 \n1 \n1 \n1 \n1 \n1 \n\nk! \n\nJ2runed \n7.6\u00b11.1 \nI6.4.\u00b12.5 \n52.0\u00b13.8 \n6.6\u00b1l.O \n9.1\u00b11.4. \n13 .5\u00b11.5 \n1l .8\u00b1O.9 \n1.6\u00b1O.2 \n7.7\u00b1O.8 \n2 .2\u00b1O.4. \n4..I\u00b1O.6 \n8.0\u00b1l.O \n4..I\u00b1O .4. \n5.6\u00b1O.4. \n16.0\u00b12.9 \n9.7\u00b11.3 \n8.4.\u00b12.0 \n1l.5\u00b12.1 \n9.4.\u00b11.9 \n\nlocal kNN methods \nk! \n\none k per \n\nunredricted \n10.8\u00b11.5 \n4.3.3\u00b1O.9 \n71.4.\u00b11.2 \n27.5\u00b11.1 \n28.0\u00b11.5 \n30.8\u00b11.6 \n15.7\u00b1O.5 \n2.0\u00b1O. 2 \n1l . 2\u00b1O.7 \n3.8\u00b1O. 4. \n12.6\u00b1O.6 \n17.2\u00b11.1 \n6.4.\u00b1O.3 \n7.6\u00b1O.4. \n37.4.\u00b11.6 \n27 .8\u00b11.2 \n29.9\u00b11.5 \n4.3.9\u00b1O.6 \nI7.0\u00b12.3 \n\ncia!! \n6 .4. \u00b1O.3 \n9.2\u00b1O.1 \nH .7\u00b1O.4. \n2.0\u00b1O.O \n2 .9\u00b1O.1 \n3.0\u00b1O.O \n2.0\u00b1O.O \n2 .4.\u00b1O.1 \n3.3\u00b1O.2 \n2.0\u00b1O.1 \n2 .0\u00b1O.O \n1.8\u00b1O. 1 \n2.0\u00b1O.O \n6.1\u00b1O.2 \n9 .0\u00b1O.2 \n3 .0\u00b1O.O \n3.0\u00b1O .O \n16.5\u00b1O .5 \n6.0\u00b1O.3 \n\none k per \nclu!ter \n\n1.8\u00b1O.2 \n9 .2\u00b1O .5 \n3.0\u00b1O.2 \nl.O\u00b1O.O \n4. .2\u00b1O.2 \n4. .8\u00b1O.2 \n5.4.\u00b1O.2 \n2 .3 \u00b1O.1 \n1.9\u00b1O.2 \n2 .6\u00b1O.1 \nl.O\u00b1O.O \n4. .6\u00b1O.2 \n1.3\u00b1O.1 \n1.0\u00b1O.O \n1 .6\u00b1O.2 \n4..3\u00b1O.1 \n4..8\u00b1O.1 \n7.1\u00b1O.3 \n2 .4.\u00b1O.2 \n\nFigure 3 shows, for one single run of Experiment 2 (data sets were combined as \ndescribed in Figure 1), which k values were actually used by the different local \nmethods. Three clusters of k values can be seen in this graph, one cluster at k = 1, \none at k = 7,9,11,12 and the third at k = 19,20,21. It is interesting to note that \nthe second and the third cluster correspond to the k values used by kNN in the \nseparate experiments. Furthermore, kNN did not use k = 1 in any of the separate \nruns. This gives insight into why kNN's performance was inferior to that of the \nlocal methods in this experiment: Patterns in the combined data set belong to \none of three categories as indicated by the k values used to classify them (k = 1, \nk ~ 10, k ~ 20). Hence, the performance difference is due to the fact that kNN \nmust estimate at training time which single category will give the best performance \nwhile the local methods make that decision at classification time for each query \ndepending on its local neighborhood. \n\n\u2022 \n\n13 kvalues (bars) \n\n\u2022 \n30 k values (bars) \nEl 3 k values (bars) \no S k values (bars) \n\nFigure 3: Bars show number of times local kNN methods used certain k values to classify \ntest examples in Experiment 2 (Figure 1 (Combined), numbers based on single run). KNN \nused k = 1 in this experiment. \n\none k per class 0 one k per cluster I \n\n\f190 \n\nWettschereck and Dietterich \n\n4 Discussion \n\nFour versions of the k-nearest neighbor algorithm which use different values of k \nfor patterns which belong to different regions of the input space were presented and \nevaluated in this paper. Experiments with constructed and commonly used data \nsets indicate that local nearest neighbor methods may have superior classification \naccuracy than kNN in specific domains. \n\nTwo methods can be recommended for domains where attributes such as noise or \nrelevance of attributes vary significantly within different parts of the input space. \nThe first method, called localKNN 1:\" pruned, computes a list of \"good\" k values for \neach training pattern, prunes less frequent values from these lists and classifies a \nquery according to the list of k values of a pre-specified number of neighbors of \nthe query. Leave-one-out cross-validation is used to estimate the proper amount of \npruning and the size of the neighborhood that should be used. \n\nThe other method, localKNNone k per du,ter, uses a cluster algorithm to determine \nclusters of input patterns. One k is then computed for each cluster and used to \nclassify queries which fall into this cluster. LocalKNNone k per du,ter performs in(cid:173)\ndistinguishable from kNN in all commonly used data sets and outperforms kNN \non the constructed data sets. This method compared with all other local methods \ndiscussed in this paper introduces a lower computational overhead at classification \ntime and is the only method which could be modified to eliminate the need for \nleave-one-ou t cross-validation. \n\nThe only purely local method, localKNN k. unre,tricted, performs well on constructed \ndata sets and is comparable to kNN on non-constructed data sets. Sensitivity stud(cid:173)\nies (results not shown) showed that a constant value of 25 for the parameter M \ngave results comparable to those where cross-validation was used to determine the \nvalue of M. The advantage of localKNNk, unrestricted over the other local meth(cid:173)\nods and kNN is that this method does not require any global information what(cid:173)\nsoever (if a constant value for M is used). It is therefore possible to construct a \nlocalKNN k6 unre,tricted classifier for each query which makes this method an attrac(cid:173)\ntive alternative for on-line learning or extremely large data sets. \nIf the researcher has reason to believe that the data set used is a collection of \nsubsets with significantly varying attributes such as noise or number of irrelevant \nfeatures, we recommend the construction of a classifier from the training data using \nlocalKNN on e k per du,ter and comparison of its performance to kNN. If the classifier \nmust be constructed on-line then localKNNk, unre,tricted should be used instead of \nkNN. \n\nWe conclude that there is considerable evidence that local nearest neighbor meth(cid:173)\nods may significantly outperform the k-nearest neighbor method on specific data \nsets. We hypothesize that local methods will become relevant in the future when \nclassifiers are constructed that simultaneously solve a variety of tasks. \n\nAcknowledgements \n\nThis research was supported in part by NSF Grant IRI-8657316, NASA Ames Grant \nNAG 2-630, and gifts from Sun Microsystems and Hewlett-Packard. Many thanks \n\n\fLocally Adaptive Nearest Neighbor Algorithms \n\n191 \n\nto Kathy Astrahantseff and Bill Langford for helpful comments during the revision \nof this manuscript. \n\nReferences \n\nAha, D.W. (1990). A Study of Instance-Based Algorithms for Supervised Learning \nTasks. Technical Report, University of California, Irvine. \nBottou, L., Vapnik, V. (1992). Local Learning Algorithms. Neural Computation, \n4(6), 888-900. \nDasarathy, B.V. (1991). Nearest Neighbor(NN) Norms: NN Pattern Classification \nTechniques. IEEE Computer Society Press. \nDetrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Schmid, K., Sandhu, S., \nGuppy, K., Lee, S. & Froelicher, V. (1989). Rapid searches for complex patterns in \nbiological molecules. American Journal of Cardiology, 64, 304-310. \nMurphy, P.M. & Aha, D.W. (1991). UCI Repository of machine learning databases \n{Machine-readable data repository}. Technical Report, University of California, \nIrvine. \nWeiss, S.M., & Kulikowski, C.A. (1991). Computer Systems that learn. San Mateo \nCalifornia: Morgan Kaufmann Publishers, INC. \nXu, L., Krzyzak, A., & Oja, E. (1993). Rival Penalized Competitive Learning for \nClustering Analysis, RBF Net, and Curve Detection IEEE Transactions on Neural \nNetworks, 4(4),636-649. \n\nI i \nI ! : \n: ! \n\n! \n\n\u2022\u2022 - - - - - - Nol..,(.-da .. - - - - -_ \n\n.50 da .. point. \n\ni \u2022\u2022 - - - - - - Nol..,rr-da .. - - - - - -\ni \ni \n\n.50d.\"polnta \n\n! : : ! \n\n~ \n! \ni I \n\nkNN correct: \n\nlocal kNN correct: \n\n69.3% \n\n66.9% \n\n.51.0'11> \n\n84.6% \n\n77 . .5% \n\n78.3% \n\nSize o( ttalnlnll ..,t: 480 \n120 \n\nleat act: \n\nTotal correct: \n\nkNN:70.0% \n\nlocal kNN: 74.8% \n\nFigure 4: Data points for the Constructed data set were drawn from either of the two \ndisplayed curves (i.e. all data points lie on either of the two curves). Class labels were \nflipped with increasing probabilities to a maximum noise level of approximately 45% at \nthe respective ends of the two lines. Listed at the bottom is performance of kNN and \n10calKNN unre.stricted within different regions of the input space and for the entire input \nspace. \n\n\f", "award": [], "sourceid": 745, "authors": [{"given_name": "Dietrich", "family_name": "Wettschereck", "institution": null}, {"given_name": "Thomas", "family_name": "Dietterich", "institution": null}]}