{"title": "Classification in Non-Metric Spaces", "book": "Advances in Neural Information Processing Systems", "page_first": 838, "page_last": 846, "abstract": null, "full_text": "Classification in Non-Metric Spaces \n\nDaphna Weinshall l ,2 David W. Jacobs l Yoram Gdalyahu2 \n\n1 NEC Research Institute, 4 Independence Way, Princeton, NJ 08540, USA \n2Inst. of Computer Science, Hebrew University of Jerusalem, Jerusalem 91904 , Israel \n\nAbstract \n\nA key question in vision is how to represent our knowledge of previously \nencountered objects to classify new ones. The answer depends on how we \ndetermine the similarity of two objects. Similarity tells us how relevant \neach previously seen object is in determining the category to which a new \nobject belongs. Here a dichotomy emerges. Complex notions of similar(cid:173)\nity appear necessary for cognitive models and applications, while simple \nnotions of similarity form a tractable basis for current computational ap(cid:173)\nproaches to classification. We explore the nature of this dichotomy and \nwhy it calls for new approaches to well-studied problems in learning. \nWe begin this process by demonstrating new computational methods \nfor supervised learning that can handle complex notions of similarity. \n(1) We discuss how to implement parametric met.hods that represent a \nclass by its mean when using non-metric similarity functions; and (2) \nWe review non-parametric methods that we have developed using near(cid:173)\nest neighbor classification in non-metric spaces. Point (2) , and some of \nthe background of our work have been described in more detail in [8]. \n\n1 Supervised Learning and Non-Metric Distances \n\nHow can one represent one 's knowledge of previously encountered objects in order \nto classify new objects? We study this question within the framework of supel vised \nlearning: it is assumed that one is given a number of training objects, each labeled as \nbelonging to a category; one wishes to use this experience to label new test instances \nof objects. This problem emerges both in the modeling of cognitive processes and \nin many practical applications. For example, one might want to identify risky \napplicants for credit based on past experience with clients who have proven to be \ngood or bad credit risks. Our work is motivated by computer vision applications. \n\nMost current computational approaches to supervised learning suppose that objects \ncan be thought of as vectors of numbers, or equivalently as points lying in an n(cid:173)\ndimensional space. They further suppose that the similarity between objects can be \ndetermined from the Euclidean distance between these vectors, or from some other \nsimple metric. This classic notion of similarity as Euclidean or metric distance leads \n\n\fClassification in Non-Metric Spaces \n\n839 \n\nto considerable mathematical and computational simplification . \n\nHowever, work in cognitive psychology has challenged such simple notions of sim(cid:173)\nilarity as models of human judgment, while applications frequently employ non(cid:173)\nEuclidean distances to measure object similarity. We consider the need for similar(cid:173)\nity measures that are not only non-Euclidean , but that are non-metric. We focus on \nproposed similarities that violate one requirement of a metric distance, the triangle \ninequality. This states that if we denote the distance between objects A and B \nby d(A , B) , then : VA , B , C : d(A, B) + d(B, C) ~ d(A , C) . Distances violating the \ntriangle inequality must also be non-Euclidean. \n\nData from cognitive psychology has demonstrated that similarity judgments may \nnot be well modeled by Euclidean distances. Tversky [12] has demonstrated in(cid:173)\nstances in which similarity judgments may violate the triangle inequality. For ex(cid:173)\nample, close similarity between Jamaica and Cuba and between Cuba and Russia \ndoes not imply close similarity between Jamaica and Russia (see also [10]) . Non(cid:173)\nmetric similarity measures are frequently employed for practical reasons, too (cf. \n[5]) . In part, work in robust statistics [7] has shown that methods that will survive \nthe presence of outliers, which are extraneous pieces of information or information \ncontaining extreme errors, must employ non-Euclidean distances that in fact violate \nthe triangle inequality ; related insights have spurred the widespread use of robust \nmethods in computer vision (reviewed in [5] and [9]). \n\nWe are interested in handling a wide range of non-metric distance functions, includ(cid:173)\ning those that are so complex that they must be treated as a black box . However, \nto be concrete, we will focus here on two simple examples of such distances: \n\nmedian distance: This distance assumes that objects are representable as a set \nof features whose individual differences can be measured, so that the difference \nbetween two objects is representable as a vector: J = (d1 , d2 , .. . dn ). The median \ndistance between the two objects is just the median value in this vector. Similarly, \none can define a k-median distance by choosing the k'th lowest element in this list. k(cid:173)\nmedian distances are often used in applications (cf. [9]) , because they are unaffected \nby the exact values of the most extreme differences between the objects . Only these \nfeatures that are most similar determine its value. The k-median distance can \nviolate the triangle inequality to an arbitrary degree (i.e. , there are no constraints \non the pairwise distances between three points) . \nrobust non-metric LP distances: Given a difference vector J, an LP distance \nhas the form: \n\n(1) \n\nand is non-metric for p < 1. \nFigure 1 illustrates why these distances present significant new challenges in su(cid:173)\npervised learning. Suppose that given some datapoints (two in Fig. 1) , we wish to \nclassify each new point as coming from the same category as its nearest neighbor. \nThen we need to determine the Voronoi diagram generated by our data: a division \nof the plane into regions in which the points all have the same nearest neighbor. \nFig. 1 shows how the Voronoi diagram changes with the function used to compute \nthe distance between datapoints; the non-metric diagrams (rightmost three pictures \nin Fig. 1) are more complex and more likely to make non-intuitive predictions. In \nfact , very little is known about the computation of non-metric Voronoi diagrams. \n\nWe now describe new parametric methods for supervised learning with non-metric \n\n\f840 \n\nD. Weins hall, D. W Jacobs and Y. Gdalyahu \n\nFigure 1: The Voronoi diagram for two points using, from left to right, p-distances with \np = 2 (Euclidean), p = 1 ( Manhattan, which is still metric), the non-metric distances \narising from p = 0.5, p = 0.2, and the min (I-median) distance. The min distance in 2-D \nillustrates the behavior of the other median distances in higher dimensions. The region of \nthe plane closer to one point is shown in black, and closer to the other in white. \n\ndistances, and review non-parametric methods that we described in [8]. \n\n2 Parametric methods: what should replace the mean \n\nParametric methods typically represent objects as vectors in a high-dimensional \nspace, and represent classes and the boundaries between them in this space us(cid:173)\ning geometric constructions or probability distributions with a limited number of \nparameters. One can attempt to extend these techniques to specific non-metric \ndistances, such as the median distance , or non-metric LP distances. We discuss \nthe example of the mean of a class below. One can also redefine geometric ob(cid:173)\njects such as linear separators, for specific non-metric distances. However, existing \nalgorithms for finding such objects in Euclidean spaces will no longer be directly \nsuitable, nor will theoretical results about such representations hold. Many prob(cid:173)\nlems are therefore open in determining how to best apply parametric supervised \nlearning techniques to specific non-metric distances. \n\n1 \n\nWe analyze k-means clustering where each class is represented by its average mem(cid:173)\nber; new elements are then classified according to which of these prototypical exam(cid:173)\nples is nearest . In Euclidean space, the mean is the point q whose sum of squared \ndistances to all the class members {qdr=l - (2:~1 d(ij, qi)2)2 - is minimized. \nSuppose now that our data come from a vector space where the correct distance \nis the LP distance from (1). Using the natural extension of the above definition, \nwe should represent each class by the point ij whose sum of distances to all the \nclass members - (2:~=1 d(ij, qi)P) p - is minimal. It is now possible to show (proof \nis omitted) that for p < 1 (the non-metric cases), the exact value of every feature \nof the representative point ij must have already appeared in at least one element in \nthe class. Moreover, the value of these features can be determined separately with \ncomplexity O(n 2 ), and total complexity of O(dn 2 ) given d features . ij is therefore \ndetermined by a mixture of up to d exemplars, where d is the dimension of the \nvector space. Thus there are efficient algorithms for finding the \"mean\" element of \na class, even using certain non-metric distances. \n\n1 \n\nWe will illustrate these results with a concrete example using the corel database, \na commercial database of images pre-labeled by categories (such as \"lions\"), where \nnon-metric distance functions have proven effective in determining the similarity of \nimages [1] . The corel database is very large, making the use of prototypes desirable. \n\nWe represent each image using a vector of 11 numbers describing general image \nproperties, such as color histograms, as described in [1] . We consider the Euclidean \n\n\fClassification in Non-Metric Spaces \n\n841 \n\nand L0 5 distances, and their corresponding prototypes: the mean and the LO.5_ \nprototype computed according to the result above. Given the first 45 classes, each \ncontaining 100 images , we found their corresponding prototypes; we then computed \nthe percentage of images in each class that are closest to their own prototype, using \neither the Euclidean or the L 0.5 distance and one of the two prototypes. The results \nare the following: \n\nmean d existing features \n\ndistance \n\nEuclidean distance \n\n25% \n20 0 \n\nIn the first column , the prototype is computed using the Euclidean mean. In the \nsecond column the prototype is computed using an LO 5 distance. In each row , a \ndifferent function is used to compute the distance from each item to the cluster \nprototype. Best results are indeed obtained with the non-metric L05 distance and \nthe correct prototype for this particular distance. While performance in absolute \nterms depends on how well this data clusters using distances derived from a simple \nfeature vector, relative performance of different methods reveals the advantage of \nusing a prototype computed with a non-metric distance. \n\nAnother important distance function is the generalized Hamming distance: given \ntwo vectors of features, their distance is the number of features which are differ(cid:173)\nent in the two vectors. This distance was assumed in psychophysical experiments \nwhich used artificial objects (Fribbles) to investigate human categorization and ob(cid:173)\nject recognition [13]. In agreement with experimental results , the prototype if for \nthis distance computed according to the definition above is the vector of \"modal\" \nfeatures - the most common feature value computed independently at each feature. \n\n3 Non-Parametric Methods: Nearest Neighbors \n\nNon-parametric classification methods typically represent a class directly by its \nexemplars. Specifically, nearest-neighbor techniques classify new objects using only \ntheir distance to labeled exemplars. Such methods can be applied using any non(cid:173)\nmetric distance function , treating the function as a black-box. However, nearest(cid:173)\nneighbor techniques must also be modified to apply well to non-metric distances. \nThe insights we gain below from doing this can form the basis of more efficient and \neffective computer algorithms, and of cognitive models for which examples of a class \nare worth remembering. This section summarizes work described in [8]. \n\nCurrent efficient algorithms for finding the nearest neighbor of a class work only \nfor metric distances [3]. The alternative of a brute-force approach, in which a new \nobject is explicitly compared to every previously seen object , is desirable neither \ncomputationally nor as a cognitive model. A natural approach to handling this \nproblem is to represent each class by a subset of its labeled examples. Such meth(cid:173)\nods are called condensing algorithms. Below we develop condensing methods for \nselecting a subset of the training set which minimizes errors in the classification of \nnew datapoints, taking into account the non-metric nature of the distance. \nIn designing a condensing method , one needs to answer the question when is one \nobject a good substitute for another? Earlier methods (e.g., [6, 2]) make use of the \nfact that the triangle inequality guarantees that when two points are similar to each \nother, their pattern of similarities to other points are not very different . Thus, in \na metric space, there is no reason to store two similar datapoints, one can easily \nsubstitute for the other. Things are different in non-metric spaces. \n\n\f842 \n\nD. Weinshall, D. W Jacobs and Y. Gdalyahu \n\na \n\nFigure 2: a) Two clusters of labeled points (left) and their Voronoi diagram (right) com(cid:173)\nputed using the I-median (min) distance. Cluster P consists of four points (black squares) \nall close together both according to the median distance and the Euclidean distance. Clus(cid:173)\nter Q consists of five points (black crosses) all having the same x coordinate, and so all \nare separated by zero distance using the median (but not Euclidean) distance. We wish to \nselect a subset of points to represent each class, while changing this Voronoi diagram as \nlittle as possible. b) All points in class Q have zero distance to each other, using the min \ndistance. So distance provides no clue as to which are interchangeable. However, the top \npoints (ql, q2) have distances to the points in class P that are highly correlated with each \nother, and poorly correlated with the bottom points (q3, q4, qs). Without using correlation \nas a clue, we might represent Q with two points from the bottom (which are nearer the \nboundary with P, a factor preferred in existing approaches). This changes the Voronoi \ndiagram drastically, as shown on the left. Using correlation as a clue, we select points from \nthe top and bottom, changing the Voronoi diagram much less, as shown on the right. \n\nSpecifically, what we really need to know is when two objects will have similar dis(cid:173)\ntances to other objects, yet unseen. We estimate this quantity using the correlation \nbetween two vectors: the vector of distances from one datapoint to all the other \ntraining data, and the vector of distances from the second datapoint to all the re(cid:173)\nmaining training datal. It can be shown (proof is omitted) that in a Euclidean space \nthe similarity between two points is the best measure of how well one can substitute \nthe other, whereas in a non-metric space the aforementioned vector correlation is a \nsubstantially better measure. Fig. 2 illustrates this result . \n\nWe now draw on these insights to produce concrete methods for representing classes \nin non-metric spaces, for nearest neighbor classification. We compare three algo(cid:173)\nrithms. The first two algorithms, random selection (cf. \n[6]) and boundary \ndetection (e.g., [11]), represent old condensing ideas: in the first we pick a random \nselection of class representatives, in the second we use points close to class bound(cid:173)\naries as representatives. The last algorithm uses new ideas: correlation selection \nincludes in the representative set points which are least correlated with the other \nclass members and representatives. To be fair in our comparison, all algorithms \nwere constrained to select the same number of representative points for each class. \n\nDuring the simulation, each of 1000 test datapoints was classified based on: (1) all \nthe data, (2) the representatives computed by each of the three algorithms. For \neach algorithm, the test is successful if the two methods (classification based on all \nthe data and based on the chosen representatives) give the same results. Fig. 3a-c \nsummarizes representative results of our simulations. See [8] for details. \n\nIGiven two datapoints X, Y and x, y ERn, where x is the vector of distances from X \nto all the other training points and y is the corresponding vector for Y, we measure the \ncorrelation between the datapoints using the statistical correlation coefficient between x, y: \ncorr(X, Y) = corr(x, y) = ~. Y-I-'y, where JJx, JJy denote the mean of x, y respectively, \nand frx , fry denote the standard deviation of x, y respectively. \n\nCTy \n\nCTx \n\n\fClassification in Non-Metric Spaces \n\n843 \n\n[p. ... \n\n.. \n\n..)~ \n, \n-- / \n\n~ \n\nf----\n\ncorrelation D\u00b7 \nboundary +-\n\nrandom ~ \n\nmedian L\u00b70.2 L\u00b70.5 Euclidean \n\nb) \n\nti \n~ \n... \nc: \nOJ \n~ \n\n0. \n\nti \n~ \n8 \nc: \n~ \nOJ \n0. \n\n100 \n\n90 \n\n80 \n\n70 \n\n60 \n\n50 \n\n40 \n\n100 \n\n90 \n\n80 \n\n70 \n\n60 \n\n50 \n\n40 \n\n~ '. \n\n.''1....-.... \n\n...-\" \n\nti \n\n\u00a7 \nc: \n~ \nOJ \n0. \n\ncorrelation D\u00b7 \nboundary +-\n\nrandom ~ \n\nmedian L\u00b70.2 L\u00b70.5 Euclidean \n\na) \n\n\u00b7\u00b7\u00b7\u00b7-m .... \u00b7-m- .. \u00b7;;i \n\n, , \n\nt t//(~ \nf/\"!' \n\ncorrelation D\u00b7 \nboundary +-\n\nrandom ~ \n\nmedian L\u00b70.2 L\u00b70.5 Euclidean \n\nC) \n\nti \n~ \n... \n0 \nc: \n~ \nOJ \n0. \n\n100 \n\n90 \n\n80 \n\n70 \n\n60 \n\n50 \n\n40 \n\n100 \n\n60 \n\n50 \n\n40 \n\ns .. .... .. \u00b7 E!j \n\n80 \n\n90 14 \n\n70 \n\ncorrelation D\u00b7 \nboundary +-\n\nrandom ~ \n\n5 reps \n\n7 reps \n\nd) \n\nFigure 3: Results: values of percent correct scores, as well as error bars giving the standard \ndeviation calculated over 20 repetitions of each test block when appropriate. Each graph \ncontains 3 plots , giving the percent correct score for each of the three algorithms described \nabove: random (selection), boundary (detection), and (selection based on) correlation. \n(a-c) Simulation results: data is chosen from R2 5 . 30 clusters were randomly chosen, each \nwith 30 datapoints. The distribution of points in each class was: (a) normal; (b) normal, \nwhere in half the datapoints one random coordinate was modified (thus the points cluster \naround a prototype, but many class members vary widely in one random dimension); (c) \nunion of 2 concentric normal distributions, one spherical and one elongated elliptical (thus \nthe points cluster around a prototype, but may vary significantly in a few non-defining \ndimensions) . Each plot gives 4 values, for each of the different distance functions used \nhere: median, Lo.2 , Lo.5 and L2 . (d) Real data: the number of representatives chosen by \nthe algorithm was limited to 5 (first column) and 7 (second column) . \n\nTo test our method with real images, we used the local curve matching algorithm \ndescribed in [4]. This non-metric curve matching algorithm was specifically designed \nto compare curves which may be quite different, and return the distance between \nthem . The training and test data are shown in Fig. 4 . Results are given in Fig. 3d. \n\nThe simulations and the real data demonstrate a significant advantage to our new \nmethod . Almost as important , in metric spaces (4th column in Fig. 3a-c) or when \nthe classes lack any \"interesting\" structure (Fig . 3a) , our method is not worse than \nexisting methods. Thus it should be used to guarantee good performance when the \nnature of the data and the distance function is not known a priori. \n\nReferences \n\n[1] Cox, I. , Miller, M., Omohundro , S. , and Yianilos, P., 1996, \"PicHunter: Bayesian \n\nRelevance Feedback for Image Retrieval,\" Proc. of I CPR, C:361- 369. \n\n\f844 \n\nD. Weinshall, D. W Jacobs and Y. Gdalyahu \n\n,............. ~~ \n~ l)1V ~ ~ o=C0 ~ \n~ \n\\-. ~~ \nW>~~ OJcCflp \n~~~ c/C;'J ~ cfY 1~) \n1yl1 P c!fl P ~);J N \n\na) \n\nb) \n\nc) \n\nd) \n\nFigure 4: Real data used to test the three algorithms, incluillng 2 classes with 30 images \neach: a) 12 examples from the first class of 30 cow contours, obtained from illfferent \nviewpoints of the same cow. b) 12 examples from the second class of 30 car contours, \nobtained from different viewpoints of 2 similar cars. c) 12 examples from the set of 30 test \ncow contours, obtained from illfferent viewpoints of the same cow with possibly adilltional \nocclusion. d) 2 examples of the real images from which the contours in a) are obtained. \n\n[2] Dasarathy, B., 1994, \"Minimal Consistent Set (MCS) Identification for Optimal Near(cid:173)\nest Neighbor Decision Systems Design,\" IEEE Trans. on Systems, Man and Cyber(cid:173)\nnetics,24(3):511-517. \n\n[3] Friedman, J., Bently, J ., Finkel, R., 1977, \"An Algorithm for Finillng Best Matches \n\nin Logarithmic Expected Time,\" ACM Trans. on Math. Software, 3:3 209-226. \n\n[4] Gdalyahu, Y. and D. Weinshall, 1997, \"Local Curve Matching for Object Recognition \n\nwithout Prior Knowledge\", Proc .: DARPA Image Understanding Workshop, 1997. \n[5] Haralick, R. and L. Shapiro, 1993, Computer and Robot Vision, Vol. 2, Addison(cid:173)\n\nWesley Publishing. \n\n[6] Hart, P., 1968, \"The Condensed Nearest Neighbor Rule,\" IEEE Trans. on Information \n\nTheory, 14(3):515- 516. \n\n[7] Hubbr, P., 1981, Robust Statistics, John Wiley and Sons. \n[8] Jacobs, D., Weinshall, D., and Gdalyahu, Y., 1998, \"Condensing Image Databases \nwhen Retrieval is based on Non-Metric Distances,\" Int. Conf. on Computer vis.:596-\n60l. \n\n[9] Meer, P., D. Mintz, D. Kim and A. Rosenfeld, 1991, \"Robust Regression Methods for \n\nComputer Vision: A Review,\" Int. J. of Compo Vis. 6(1):59-70. \n\n[10] Rosch, E ., 1975, \"Cognitive Reference Points,\" Cognitive Psychology, 7:532-547. \n[11] Tomek, 1., 1976, \"Two moillfications of CNN,\" IEEE Trans. Syst. , Man, Cyber.\" \n\nSMC-6{1l):769-772. \n\n[12] Tversky, A., 1977, \"Features of Similarity,\" Psychological Review, 84(4):327- 352. \n[13] Williams, P., \"Prototypes, Exemplars, and Object Recognition\", submitted. \n\n\fPART VIII \n\nApPLICATIONS \n\n\f\f", "award": [], "sourceid": 1581, "authors": [{"given_name": "Daphna", "family_name": "Weinshall", "institution": null}, {"given_name": "David", "family_name": "Jacobs", "institution": null}, {"given_name": "Yoram", "family_name": "Gdalyahu", "institution": null}]}