{"title": "An Adaptive Metric Machine for Pattern Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 458, "page_last": 464, "abstract": null, "full_text": "An Adaptive Metric Machine for Pattern \n\nClassification \n\nCarlotta Domeniconi, Jing Peng+, Dimitrios Gunopulos \n\nDept. of Computer Science, University of California, Riverside, CA 92521 \n\n+ Dept. of Computer Science, Oklahoma State University, Stillwater, OK 74078 \n\n{ carlotta, dg} @cs.ucr.edu, jpeng@cs.okstate.edu \n\nAbstract \n\nNearest neighbor classification assumes locally constant class con(cid:173)\nditional probabilities. This assumption becomes invalid in high \ndimensions with finite samples due to the curse of dimensionality. \nSevere bias can be introduced under these conditions when using \nthe nearest neighbor rule. We propose a locally adaptive nearest \nneighbor classification method to try to minimize bias. We use a \nChi-squared distance analysis to compute a flexible metric for pro(cid:173)\nducing neighborhoods that are elongated along less relevant feature \ndimensions and constricted along most influential ones. As a result, \nthe class conditional probabilities tend to be smoother in the mod(cid:173)\nified neighborhoods, whereby better classification performance can \nbe achieved. The efficacy of our method is validated and compared \nagainst other techniques using a variety of real world data. \n\nIntroduction \n\n1 \nIn classification, a feature vector x = (Xl,\u00b7\u00b7\u00b7, Xqy E lRq, representing an object, \nis assumed to be in one of J classes {i}{=l' and the objective is to build classifier \nmachines that assign x to the correct class from a given set of N training samples. \n\nThe K nearest neighbor (NN) classification method [3, 5, 7, 8, 9] is a simple and \nappealing approach to this problem. Such a method produces continuous and over(cid:173)\nlapping, rather than fixed, neighborhoods and uses a different neighborhood for \neach individual query so that all points in the neighborhood are close to the query, \nto the extent possible. In addition, it has been shown [4, 6] that the one NN rule \nhas asymptotic error rate that is at most twice the Bayes error rate, independent \nof the distance metric used. \n\nThe NN rule becomes less appealing in finite training samples, however. This is \ndue to the curse-of-dimensionality [2]. Severe bias can be introduced in the NN \nrule in a high dimensional input feature space with finite samples. As such, the \nchoice of a distance measure becomes crucial in determining the outcome of nearest \nneighbor classification. The commonly used Euclidean distance measure, while \nsimple computationally, implies that the input space is isotropic or homogeneous. \nHowever, the assumption for isotropy is often invalid and generally undesirable \nin many practical applications. In general, distance computation does not vary \n\n\fwith equal strength or in the same proportion in all directions in the feature space \nemanating from the input query. Capturing such information, therefore, is of great \nimportance to any classification procedure in high dimensional settings. \nIn this paper we propose an adaptive metric classification method to try to mini(cid:173)\nmize bias in high dimensions. We estimate a flexible metric for computing neigh(cid:173)\nborhoods based on Chi-squared distance analysis. The resulting neighborhoods are \nhighly adaptive to query locations. Moreover, the neighborhoods are elongated \nalong less relevant feature dimensions and constricted along most influential ones. \nAs a result, the class conditional probabilities tend to be constant in the modified \nneighborhoods, whereby better classification performance can be obtained. \n\n2 Local Feature Relevance Measure \n\nOur technique is motivated as follows. Let Xo be the test point whose class member(cid:173)\nship we are predicting. In the one NN classification rule, a single nearest neighbor x \nis found according to a distance metric D(x, xo). Let p(jlx) be the class conditional \nprobability at point x. Consider the weighted Chi-squared distance [8, 11] \n\nD( \n\nX,Xo \n\n) = ~ [Pr(jlx) - Pr(jlxoW \n, \n\nPr(jlxo) \n\nf=:. \n\n(1) \n\nwhich measures the distance between Xo and the point x, in terms of the difference \nbetween the class posterior probabilities at the two points. Small D(x, xo) indicates \nthat the classification error rate will be close to the asymptotic error rate for one \nnearest neighbor. In general, this can be achieved when Pr(jlx) = Pr(jlxo), which \nstates that if Pr(jlx) can be sufficiently well approximated at Xo, the asymptotic \n1-NN error rate might result in finite sample settings. \n\nEquation (1) computes the distance between the true and estimated posteriors. \nNow, imagine we replace Pr(jlxo) with a quantity that attempts to predict Pr(jlx) \nunder the constraint that the quantity is conditioned at a location along a particular \nfeature dimension. Then, the Chi-squared distance (1) tells us the extent to which \nthat dimension can be relied on to predict Pr(jlx). Thus, Equation (1) provides \nus with a foundation upon which to develop a theory of feature relevance in the \ncontext of pattern classification. \n\nBased on the above discussion, our proposal is the following. We first notice that \nPr(jlx) is a function of x. Therefore, we can compute the conditional expecta(cid:173)\ntion of p(jlx), denoted by Pr(jlxi = z), given that Xi assumes value z, where Xi \nrepresents the ith component of x. That is, Pr(jlxi = z) = E[Pr(jlx)lxi = z] = \nJ Pr(jlx)p(xlxi = z)dx. Here p(XIXi = z) is the conditional density of the other \ninput variables. Let \n\nri(x) = t [Pr(jlx) -.Pr(~Xi = Zi)]2 \n\nj=l \n\nPr(J IXi - Zi) \n\n(2) \n\nri(x) represents the ability offeature i to predict the Pr(jlx)s at Xi = Zi. The closer \nPr(jlxi = Zi) is to Pr(jlx), the more information feature i carries for predicting the \nclass posterior probabilities locally at x. \n\nWe can now define a measure of feature relevance for Xo as \n\nfi(XO) = K L ri(z), \n\n1 \n\nzEN(xo) \n\n(3) \n\n\fwhere N(xo) denotes the neighborhood of Xo containing the K nearest training \npoints, according to a given metric. ri measures how well on average the class \nposterior probabilities can be approximated along input feature i within a local \nneighborhood of Xo. Small ri implies that the class posterior probabilities will \nbe well captured along dimension i in the vicinity of Xo. Note that ri(xo) is a \nfunction of both the test point Xo and the dimension i, thereby making ri(xo) a \nlocal relevance measure. \n\nThe relative relevance, as a weighting scheme, can then be given by the following \nexponential weighting scheme \n\nWi(XO) = exp(cRi(XO))/ L exp(cRl(XO)) \n\nq \n\n1=1 \n\n(4) \n\nwhere c is a parameter that can be chosen to maximize (minimize) the influence of \nri on Wi, and Ri(X) = maxj rj(x) - ri(x). When c = 0 we have Wi = l/q, thereby \nignoring any difference between the ri's. On the other hand, when c is large a change \nin ri will be exponentially reflected in Wi. In this case, Wi is said to follow the Boltz(cid:173)\nmann distribution. The exponential weighting is more sensitive to changes in local \nfeature relevance (3) and gives rise to better performance improvement. Thus, (4) \ncan be used as weights associated with features for weighted distance computation \nD(x, y) = V'L,r=1 Wi(Xi - Yi)2. These weights enable the neighborhood to elongate \nless important feature dimensions, and, at the same time, to constrict the most \ninfluential ones. Note that the technique is query-based because weightings depend \non the query [1]. \n\n3 Estimation \nSince both PrUlx) and Pr(jlxi = Zi) in (3) are unknown, we must estimate them \nusing the training data {xn, Yn};;=1 in order for the relevance measure (3) to be \nuseful in practice. Here Yn E {I, ... , J}. The quantity Pr(jlx) is estimated by \nconsidering a neighborhood Nl (x) centered at x: \n\n(5) \n\nwhere 1(\u00b7) is an indicator function such that it returns 1 when its argument is true, \nand 0 otherwise. \nTo compute PrUlxi = z) = E[PrUlx)lxi = Z], we introduce a dummy variable gj \nsuch that if Y = j, then gj Ix = 1, otherwise gj Ix = 0, where j = 1,\u00b7\u00b7\u00b7, J. We \nthen have PrUlx) = E[gjlx], from which it is not hard to show that PrUlxi = \nz) = E[gjlxi = z]. However, since there may not be any data at Xi = z, the data \nfrom the neighborhood of x along dimension i are used to estimate E[gj IXi = z], a \nstrategy suggested in [7]. In detail, by noticing gj = l(y = j) the estimate can be \ncomputed from \n\nPA (.1 \nr J Xi = Zi = \n\n) \n\n'L,xn EN 2(X) l(lxni - xii ~ boi)l(Yn = j) \n, \n\n'L,xn EN2(X) l(l xni - xii ~ boi) \n\n(6) \n\nwhere N2 (x) is a neighborhood centered at x (larger than N 1 (x)), and the value of \nboi is chosen so that the interval contains a fixed number L of points: 'L,;;=1 1 (I Xni -\nxii ~ boi )l(xn E N 2 (x)) = L. Using the estimates in (5) and in (6), we obtain an \nempirical measure of the relevance (3) for each input variable i. \n\n\f4 Empirical Results \n\nIn the following we compare several classification methods using real data: (1) Adap(cid:173)\ntive metric nearest neighbor (ADAMENN) method (one iteration) described above, \ncoupled with the exponential weighting scheme (4); (2) i-ADAMENN - ADAMENN \nwith five iterations; (3) Simple K-NN method using the Euclidean distance measure; \n(4) C4.5 decision tree method [12]; (5) Machete [7] - an adaptive NN procedure, in \nwhich the input variable used for splitting at each step is the one that maximizes \nthe estimated local relevance (7); (6) Scythe [7] - a generalization of the Machete \nalgorithm, in which the input variables influence each split in proportion to their \nestimated local relevance, rather than the winner-take-all strategy of Machete; (7) \nDANN - discriminant adaptive nearest neighbor classification [8]; and (8) i-DANN \n- DANN with five iterations [8]. \n\nIn all the experiments, the features are first normalized over the training data to \nhave zero mean and unit variance, and the test data are normalized using the \ncorresponding training mean and variance. Procedural parameters for each method \nwere determined empirically through cross-validation. \n\nTable 1: Average classification error rates. \n\nIris Sonar Vowel Glass Image Seg Letter Liver Lung \nADAMENN \n3.0 \n40.6 \ni-ADAMENN 5.0 \n40.6 \n50.0 \n6.0 \n59.4 \n8.0 \n50.0 \n5.0 \n4.0 \n50.0 \n46.9 \n6.0 \n6.0 \n40.6 \n\n10.7 \n10.9 \n11.8 \n36.7 \n20.2 \n15.5 \n12.5 \n21.8 \n\n24.8 \n24.8 \n28.0 \n31.8 \n28.0 \n27.1 \n27.1 \n26.6 \n\nK-NN \nC4.5 \n\nMachete \nScythe \nDANN \ni-DANN \n\n30.7 \n30.4 \n32.5 \n38.3 \n27.5 \n27.5 \n30.1 \n27.8 \n\n9.1 \n9.6 \n12.5 \n23.1 \n21.2 \n16.3 \n7.7 \n9.1 \n\n5.2 \n5.2 \n6.1 \n21.6 \n12.3 \n5.0 \n12.9 \n18.1 \n\n2.4 \n2.5 \n3.6 \n3.7 \n3.2 \n3.3 \n2.5 \n3.7 \n\n5.1 \n5.3 \n6.9 \n16.4 \n9.1 \n7.2 \n3.1 \n6.1 \n\nClassification Data Sets. \nThe data sets used were taken from the VCI Machine \nLearning Database Repository [10], except for the unreleased image data set. They \nare: 1. Iris data. This data set consists of q = 4 measurements made on each of \nN = 100 iris plants of J = 2 species; 2. Sonar data. This data set consists of \nq = 60 frequency measurements made on each of N = 208 data of J = 2 classes \n(\"mines\" and \"rocks\"); 3. Vowel data. This example has q = 10 measurements \nand 11 classes. There are total of N = 528 samples in this example; 4. Glass \ndata. This data set consists of q = 9 chemical attributes measured for each of \nN = 214 data of J = 6 classes; 5. Image data. This data set consists of 40 \ntexture images that are manually classified into 15 classes. The number of images \nin each class varies from 16 to 80. The images in this database are represented by \nq = 16 dimensional feature vectors; 6. Seg data. This data set consists of images \nthat were drawn randomly from a database of 7 outdoor images. There are J = 7 \nclasses, each of which has 330 instances. Thus, there are N = 2,310 images in the \ndatabase. These images are represented by q = 19 real valued attributes; 7. Letter \ndata. This data set consists of q = 16 numerical attributes and J = 26 classes; 8. \nLiver data. This data set consists of 345 instances, represented by q = 6 numerical \nattributes, and J = 2 classes; and 9. Lung data. This example has 32 instances \nhaving q = 56 numerical features and J = 3 classes. \nResults: \nTable 1 shows the (cross-validated) error rates for the eight methods \nunder consideration on the nine real data sets. Note that the average error rates \n\n\f4 \n\nJ: ~ , I \n~ i \n\nZ \nZ \n:.d \n\nIrl \n