{"title": "Learning a Distance Metric from Relative Comparisons", "book": "Advances in Neural Information Processing Systems", "page_first": 41, "page_last": 48, "abstract": "", "full_text": "Learning a Distance Metric from Relative\n\nComparisons\n\nMatthew Schultz and Thorsten Joachims\n\nDepartment of Computer Science\n\nCornell University\nIthaca, NY 14853\n\nfschultz,tjg@cs.cornell.edu\n\nAbstract\n\nThis paper presents a method for learning a distance metric from rel-\native comparison such as \u201cA is closer to B than A is to C\u201d. Taking a\nSupport Vector Machine (SVM) approach, we develop an algorithm that\nprovides a \ufb02exible way of describing qualitative training data as a set of\nconstraints. We show that such constraints lead to a convex quadratic\nprogramming problem that can be solved by adapting standard meth-\nods for SVM training. We empirically evaluate the performance and the\nmodelling \ufb02exibility of the algorithm on a collection of text documents.\n\n1\n\nIntroduction\n\nDistance metrics are an essential component in many applications ranging from supervised\nlearning and clustering to product recommendations and document browsing. Since de-\nsigning such metrics by hand is dif\ufb01cult, we explore the problem of learning a metric from\nexamples. In particular, we consider relative and qualitative examples of the form \u201cA is\ncloser to B than A is to C\u201d. We believe that feedback of this type is more easily available\nin many application setting than quantitative examples (e.g. \u201cthe distance between A and\nB is 7.35\u201d) as considered in metric Multidimensional Scaling (MDS) (see [4]), or absolute\nqualitative feedback (e.g. \u201cA and B are similar\u201d, \u201cA and C are not similar\u201d) as considered\nin [11].\n\nBuilding on the study in [7], search-engine query logs are one example where feedback of\nthe form \u201cA is closer to B than A is to C\u201d is readily available for learning a (more semantic)\nsimilarity metric on documents. Given a ranked result list for a query, documents that\nare clicked on can be assumed to be semantically closer than those documents that the\nuser observed but decided to not click on (i.e. \u201cAclick is closer to Bclick than Aclick is to\nCnoclick\u201d). In contrast, drawing the conclusion that \u201cAclick and Cnoclick are not similar\u201d is\nprobably less justi\ufb01ed, since a Cnoclick high in the presented ranking is probably still closer\nto Aclick than most documents in the collection.\nIn this paper, we present an algorithm that can learn a distance metric from such relative\nand qualitative examples. Given a parametrized family of distance metrics, the algorithms\ndiscriminately searches for the parameters that best ful\ufb01ll the training examples. Taking a\nmaximum-margin approach [9], we formulate the training problem as a convex quadratic\n\n\fprogram for the case of learning a weighting of the dimensions. We evaluate the perfor-\nmance and the modelling \ufb02exibility of the algorithm on a collection of text documents.\n\nThe notation used throughout this paper is as follows. Vectors are denoted with an arrow ~x\nwhere xi is the ith entry in vector ~x. The vector ~0 is the vector composed of all zeros, and\n~1 is the vector composed of all ones. ~xT is the transpose of vector ~x and the dot product\nis denoted by ~xT ~y. We denote the element-wise product of two vectors ~x = (x1; :::; xn)T\nand ~y = (y1; :::; yn)T as ~x \u2044 ~y = (x1y1; :::; xnyn)T .\n\n2 Learning from Relative Qualitative Feedback\n\nWe consider the following learning setting. Given is a set Xtrain of objects ~xi 2 0\n\n(5)\nIf the set of constraints is feasible and a W exists that ful\ufb01lls all constraints, the solution\nis typically not unique. We aim to select a matrix AW AT such that dA;W (~x; ~y) remains\nas close to an unweighted Euclidean metric as possible. Following [8], we minimize the\nnorm of the eigenvalues jj\u2044jj2 of AW AT . Since jj\u2044jj2 = jjAW AT jj2\nF , this leads to the\nfollowing optimization problem.\n\nmin\n\n1\n2\n\njjAW AT jj2\nF\n\ns:t: 8(i;j;k) 2 Ptrain : (~xi \u00a1~xk)TAWAT(~xi \u00a1~xk) \u00a1 (~xi \u00a1~xj)TAWAT(~xi \u00a1~xj) \u201a 1\n\nWii \u201a 0\n\nUnlike in [8], this formulation ensures that dA;W (~x; ~y) is a metric, avoiding the need for\nsemi-de\ufb01nite programming like in [11].\n\nAs in classi\ufb01cation SVMs, we add slack variables [3] to account for constraints that cannot\nbe satis\ufb01ed. This leads to the following optimization problem.\n\nmin\n\n1\n2\n\njjAW AT jj2\n\nF + C X\n\ni;j;k\n\n\u00bbijk\n\ns:t: 8(i;j;k) 2 Ptrain : (~xi \u00a1~xk)TAWAT(~xi \u00a1~xk) \u00a1 (~xi \u00a1~xj)TAWAT(~xi \u00a1~xj) \u201a 1 \u00a1 \u00bbijk\n\n\u00bbijk \u201a 0\nWii \u201a 0\n\nThe sum of the slack variables \u00bbijk in the objective is an upper bound on the number of\nviolated constraints.\nAll distances dA;W (~x; ~y) can be written in the following linear form. If we let ~w be the\ndiagonal elements of W then the distance dA;W can be written as\ndA;W (~x; ~y) = q((~x \u00a1 ~y)T A)W (AT (~x \u00a1 ~y))\n\n= q ~wT (AT ~x \u00a1 AT ~y) \u2044 (AT ~x \u00a1 AT ~y)\n\n(6)\n\nwhere \u2044 denotes the element-wise product. If we let ~\u00a2xi;xj = (AT ~xi \u00a1 AT ~xk) \u2044 (AT ~xi \u00a1\nAT ~xk), then the constraints in the optimization problem can be rewritten in the following\nlinear form.\n\n8(i; j; k) 2 Ptrain : ~wT (~\u00a2xi;xk \u00a1 ~\u00a2xi;xk ) \u201a 1 \u00a1 \u00bbijk\n\n(7)\n\n\f1a)\n\n2a)\n\n1b)\n\n2b)\n\nFigure 1: Graphical example of using different A matrices. In example 1, A is the iden-\ntity matrix and in example 2 A is composed of the training examples projected into high\ndimensional space using an RBF kernel.\n\nFurthermore, the objective function is quadratic, so that the optimization problem can be\nwritten as\n\nmin\n\ns:t:\n\n1\n2\n\n~wT L ~w + C X\n\ni;j;k\n\n\u00bbijk\n\n8(i; j; k) 2 Ptrain : ~wT (~\u00a2xi;xk \u00a1 ~\u00a2xi;xj ) \u201a 1 \u00a1 \u00bbijk\n\u00bbijk \u201a 0\nWii \u201a 0\n\n(8)\n\nFor the case of A = I, jjAW AT jj2\nde\ufb01ne L = (AT A) \u2044 (AT A) so that jjAW AT jj2\nde\ufb01nite in both cases and that, therefore, the optimization problem is convex quadratic.\n\nF = wT Lw with L = I. For the case of A = ', we\nF = wT Lw. Note that L is positive semi-\n\n5 Experiments\n\nIn Figure 1, we display a graphical example of our method. Example 1 is an example of\na weighted Euclidean distance. The input data points are shown in 1a) and our training\nconstraints specify that the distance between two square points should be less than the dis-\ntance to a circle. Similarly, circles should be closer to each other than to squares. Figure 1\n(1b) shows the points after an MDS analysis with the learned distance metric as input. This\nlearned distance metric intuitively correponds to stretching the x-axis and shrinking the\ny-axis in the original input space.\n\nExample 2 in Figure 1 is an example where we have a similar goal of grouping the squares\ntogether and separating them from the circles. In this example though, there is no way to\nuse a linear weighting measure to accomplish this task. We used an RBF kernel and learned\na distance metric to separate the clusters. The result is shown in 2b.\n\nTo validate the method using a real world example, we ran several experiments on the\nWEBKB data set [5]. In order to illustrate the versatility of relative comparisons, we gen-\nerated three different distance metrics from the same data set and ran three types of tests: an\naccuracy test, a learning curve to show how the method generalizes from differing amounts\nof training data, and an MDS test to graphically illustrate the new distance measures.\n\nThe experimental setup for each of the experiments was the same. We \ufb01rst split X, the set\nof all 4,183 documents, into separate training and test sets, Xtrain and Xtest. 70% of the\n\n\fall examples X added to Xtrain and the remaining 30% are in Xtest. We used a binary\nfeature vector without stemming or stop word removal (63,949 features) to represent each\ndocument because it is the least biased distance metric to start out with. It also performed\nbest among several different variations of term weighting, stemming and stopword removal.\nThe relative comparison sets, Ptrain and Ptest, were generated as follows. We present\nresults for learning three different notions of distance.\n\n\u2020 University Distance: This distance is small when the two examples, ~x; ~y, are from\nthe same university and larger otherwise. For this data set we used webpages from\nseven universities.\n\n\u2020 Topic Distance: This distance metric is small when the two examples, ~x; ~y, are\nfrom the same topic (e.g. both are student webpages) and larger when they are\neach from a different topic. There are four topics: Student, Faculty, Course and\nProject webpages.\n\n\u2020 Topic+FacultyStudent Distance: Again when two examples, ~x; ~y, are from the\nsame topic then they have a small distance between them and a larger distance\nwhen they come from different topics. However, we add the additional constraint\nthat the distance between a faculty and a student page is smaller than the distance\nto pages from other topics.\n\nTo build the training constraints, Ptrain, we \ufb01rst randomly selected three documents,\nxi; xj; xk, from Xtrain. For the University Distance we added the triplet (i; j; k) to Ptrain\nif xi and xj were from the same university and xk was from a different university. In build-\ning Ptrain for the Topic Distance we added the (i; j; k) to Ptrain if xi and xj were from\nthe same topic (e.g. \u201cStudent Webpages\u201d) and xk was from a different topic (e.g. \u201cProject\nWebpages\u201d). For the Topic+FacultyStudent Distance, the training triple (i; j; k) was added\nto Ptrain if either the topic rule occurred, when xi and xj were from the same topic and\nxk was from a different topic, or if xi was a faculty webpage, xj was a student webpage\nand xk was either a project or course webpage. Thus the constraints would specify that\na student webpage is closer to a faculty webpage than a faculty webpage is to a course\nwebpage.\n\nUniversity Distance\nTopic Distance\nTopic+FacultyStudent Distance\n\nLearned d ~w(\u00a2; \u00a2)\n\n98.43%\n75.40%\n79.67%\n\nBinary\nTFIDF\n67.88% 80.72%\n61.82% 55.57%\n63.08% 55.06%\n\nTable 1: Accuracy of different distance metrics on an unseen test set Ptest.\n\nThe results of the learned distance measures on unseen test sets Ptest are reported in Table\n1. In each experiment the regularization parameter C was set to 1 and we used A = I.\nWe report the percentage of the relative comparisons in Ptest that were satis\ufb01ed for each of\nthe three experiments. As a baseline for comparison, we give the results for the static (not\nlearned) distance metric that performs best on the test set. The best performing metric for\nall static Euclidean distances (Binary and TFIDF) used stemming and stopword removal,\nwhich our learned distance did not use. The learned University Distance satis\ufb01ed 98.43%\nof the constraints. This veri\ufb01es that the learning method can effectively \ufb01nd the relevant\nfeatures, since pages usually mentioned which university they were from. For the other\ndistances, both the Topic Distance and Topic+FacultyStudent Distance satis\ufb01ed more than\n13% more constraints in Ptest than the best unweighted distance. Using a kernel instead of\nA = I did not yield improved results.\nFor the second test, we illustrate on the Topic+FacultyStudent data set how the prediction\naccuracy of the method scales with the number of training constraints. The learning curve\n\n\fd\ne\ni\nf\ns\ni\nt\na\nS\n \ns\nt\nn\na\nr\nt\ns\nn\no\nC\n\ni\n\n \nt\ne\nS\n\n \nt\ns\ne\nT\n\n \nf\no\n \nt\nn\ne\nc\nr\ne\nP\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\n0.6\n\n0.55\n\n0.5\n\nLearned Distance\nBinary L2\nTFIDF L2\n\n50\n\n0\nSize of Training Set in Thousands of Constraints\n\n100\n\n200\n\n150\n\n250\n\nFigure 2: Learning curves for the Topic+FacultyStudent dataset where the x axis is the size\nof the training set Ptrain plotted against the y axis which is the percent of constraints in\nPtest that were satis\ufb01ed.\n\nis shown in Figure 2 where we plot the training set size (in number of constraints) versus\nthe percentage of test constraints satis\ufb01ed. The test set Ptest was held constant and sampled\nin the same way as the training set (jPtestj = 85,907). As Figure 2 illustrates, after the data\nset contained more than 150,000 constraints, the performance of the algorithm remained\nrelatively constant.\n\nAs a \ufb01nal test of our method, we graphically display our distance metrics in Table 7. We\nplot three distance metrics: The standard binary distance (Figure a) for the Topic Dis-\ntance, the learned metric for Topic Distance (Figure b) and, and the learned metric for the\nTopic+FacultyStudent Distance (Figure c). To produce the plots in Table 7, all pairwise\ndistances between the points in Xtest were computed and then projected into 2D using a\nclassical, metric MDS algorithm [1].\n\nFigure a) in Table 7 is the result of using the pairwise distances resulting from the un-\nweighted, binary L2 norm in MDS. There is no clear distinction between any of the clusters\nin 2 dimensions. In Figure b) we see the results of the learned Topic Distance measure. The\nclasses were reasonably separated from each other. Figure c) shows the result of using the\nlearned Topic+FacultyStudent Distance metric. When compared to Figure b), the Faculty\nand Student webpages have now moved closer together as desired.\n\n6 Related Work\n\nThe most relevant related work is the work of Xing et al [11] which focused on the problem\nof learning a distance metric to increase the accuracy of nearest neighbor algorithms. Their\nwork used absolute, qualitative feedback such as \u201cA is similar to B\u201d or \u201cA is dissimilar to\nB\u201d which is different from the relative constraints considered here. Secondly, their method\ndoes not use regularization.\n\nRelated are also techniques for semi-supervised clustering, as it is also considered in [11].\nWhile [10] does not change the distance metric, [2] uses gradient descent to adapt a param-\neterized distance metric according to user feedback.\n\nOther related work are dimension reduction techniques such as Multidimensional Scaling\n(MDS) [4] and Latent Semantic Indexing [6]. Metric MDS techniques take as input a\nmatrix D of dissimilarities (or similarities) between all points in some collection and then\nseeks to arrange the points in a d-dimensional space to minimize the stress. The stress of the\n\n\farrangement is roughly the difference between the distances in the d-dimensional space and\nthe distances input in matrix D. LSI uses an eigenvalue decomposition of the original input\nspace to \ufb01nd the \ufb01rst d principal eigenvectors to describe the data in d dimensions. Our\nwork differs because the input is a set of relative comparisons, not quantitative distances\nand does not project the data into a lower dimensional space. Non-metric MDS is more\nsimilar to our technique than metric MDS. Instead of preserving the exact distances input,\nthe non-metric MDS seeks to maintain the rank order of the distances. However, the goal\nof our method is not a low dimensional projection, but a new distance metric in the original\nspace.\n\n7 Conclusion and Future Work\n\nIn this paper we presented a method for learning a weighted Euclidean distance from rela-\ntive constraints. This was accomplished by solving a convex optimization problem similar\nto SVMs to \ufb01nd the maximum margin weight vector. One of the main bene\ufb01ts of the algo-\nrithm is that the new type of the constraint enables its use in a wider range of applications\nthan conventional methods. We evaluated the method on a collection of high dimensional\ntext documents and showed that it can successfully learn different notions of distance.\n\nFuture work is needed both with respect to theory and application. In particular, we do\nnot yet know generalization error bounds for this problem. Furthermore, the power of the\nmethod would be increased, if it was possible to learn more complex metrics that go beyond\nfeature weighting, for example by incorporating kernels in a more adaptive way.\n\nReferences\n\n[1] A. Buja, D. Swayne, M. Littman, and N. Dean. Xgvis: Interactive data visualization\nwith multidimensional scaling. Journal of Computational and Graphical Statistics,\nto appear.\n\n[2] D. Cohn, R. Caruana, and A. McCallum. Semi-supervised clustering with user feed-\n\nback. Technical Report TR2003-1892, Cornell University, 2003.\n\n[3] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning,\n\n20(3):273\u2013297, 1995.\n\n[4] T. Cox and M. Cox. Multidimensional Scaling. Chapman & Hall, London, 1994.\n[5] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and\nS. Slattery. Learning to extract symbolic knowledge from the world wide web. Pro-\nceedings of the 15th National Conference on Arti\ufb01cial Intelligence (AAAI-98), 1998.\n[6] Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and\nRichard A. Harshman. Indexing by latent semantic analysis. Journal of the American\nSociety of Information Science, 41(6):391\u2013407, 1990.\n\n[7] T. Joachims. Optimizing search engines using clickthrough data. Proceedings of the\n\nACM Conference on Knowledge Discovery and Data Mining (KDD), 2002.\n\n[8] I.W. Tsang and J.T. Kwok. Distance metric learning with kernels. Proceedings of the\n\nInternational Conference on Arti\ufb01cial Neural Networks, 2003.\n\n[9] V. Vapnik. Statistical Learning Theory. Wiley, Chichester, GB, 1998.\n[10] Kiri Wagstaff, Claire Cardie, Seth Rogers, and Stefan Schroedl. Constrained K-means\nclustering with background knowledge. In Proc. 18th International Conf. on Machine\nLearning, pages 577\u2013584. Morgan Kaufmann, San Francisco, CA, 2001.\n\n[11] E.P. Xing, A.Y. Ng, M.I. Jordan, and S. Russell. Distance metric learning, with appli-\ncation to clustering with side information. Advances in Neural Information Process-\ning Systems, 2002.\n\n\fa)\n\nb)\n\nc)\n\nCourse\nProject\nStudent\nFaculty\n\n 3\n\n 2\n\n 1\n\n 0\n\n-1\n\n-2\n\n-3\n\n-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3\n\n 2\n\n 1\n\n 0\n\n-1\n\n-2\n\n-3\n\n-4\n\n 3\n\n 2\n\n 1\n\n 0\n\n-1\n\n-2\n\n-3\n\n-4\n\nCourse\nProject\nStudent\nFaculty\n\n-3\n\n-2\n\n-1\n\n 0\n\n 1\n\n 2\n\n 3\n\nCourse\nProject\nStudent\nFaculty\n\n-2\n\n-1\n\n 0\n\n 1\n\n 2\n\n 3\n\n 4\n\nTable 2: MDS plots of distance functions: a) is the unweighted L2 distance, b) is the Topic\nDistance, and c) is the Topic+FacultyStudent distance.\n\n\f", "award": [], "sourceid": 2366, "authors": [{"given_name": "Matthew", "family_name": "Schultz", "institution": null}, {"given_name": "Thorsten", "family_name": "Joachims", "institution": null}]}