{"title": "Kernel functions based on triplet comparisons", "book": "Advances in Neural Information Processing Systems", "page_first": 6807, "page_last": 6817, "abstract": "Given only information in the form of similarity triplets \"Object A is more similar to object B than to object C\" about a data set, we propose two ways of defining a kernel function on the data set. While previous approaches construct a low-dimensional Euclidean embedding of the data set that reflects the given similarity triplets, we aim at defining kernel functions that correspond to high-dimensional embeddings. These kernel functions can subsequently be used to apply any kernel method to the data set.", "full_text": "Kernel functions based on triplet comparisons\n\nMatth\u00e4us Kleindessner\u21e4\n\nDepartment of Computer Science\n\nRutgers University\nPiscataway, NJ 08854\n\nmk1572@cs.rutgers.edu\n\nUlrike von Luxburg\n\nDepartment of Computer Science\n\nUniversity of T\u00fcbingen\n\nMax Planck Institute for Intelligent Systems, T\u00fcbingen\n\nluxburg@informatik.uni-tuebingen.de\n\nAbstract\n\nGiven only information in the form of similarity triplets \u201cObject A is more similar\nto object B than to object C\u201d about a data set, we propose two ways of de\ufb01ning\na kernel function on the data set. While previous approaches construct a low-\ndimensional Euclidean embedding of the data set that re\ufb02ects the given similarity\ntriplets, we aim at de\ufb01ning kernel functions that correspond to high-dimensional\nembeddings. These kernel functions can subsequently be used to apply any kernel\nmethod to the data set.\n\n1\n\nIntroduction\n\nAssessing similarity between objects is an inherent part of many machine learning problems, be\nit in an unsupervised task like clustering, in which similar objects should be grouped together, or\nin classi\ufb01cation, where many algorithms are based on the assumption that similar inputs should\nproduce similar outputs. In a typical machine learning setting one assumes to be given a data set D of\nobjects together with a dissimilarity function d (or, equivalently, a similarity function s) quantifying\nhow \u201cclose\u201d objects are to each other. In recent years, however, a new branch of the machine\nlearning literature has emerged that relaxes this scenario (see the next paragraph and Section 3 for\nreferences). Instead of being able to evaluate d itself, we only get to see a collection of similarity\ntriplets of the form \u201cObject A is more similar to object B than to object C\u201d, which claims that\nd(A, B) < d(A, C). The main motivation for this relaxation comes from human-based computation:\nIt is widely accepted that humans are better and more reliable at providing similarity triplets, which\nmeans assessing similarity on a relative scale, than at providing similarity estimates on an absolute\nscale (\u201cThe similarity between objects A and B is 0.8\u201d). This can be seen as a special case of the\ngeneral observation that humans are better at comparing two stimuli than at identifying a single one\n(Stewart et al., 2005). For this reason, whenever one is lacking a meaningful dissimilarity function\nthat can be evaluated automatically and has to incorporate human expertise into the machine learning\nprocess, collecting similarity triplets (e.g., via crowdsourcing) may be an appropriate means.\nGiven a data set D and similarity triplets for its objects, it is not immediately clear how to solve\nmachine learning problems on D. A general approach is to construct an ordinal embedding of D, that\nis to map objects to a Euclidean space of a small dimension such that the given triplets are preserved\nas well as possible (Agarwal et al., 2007; Tamuz et al., 2011; van der Maaten and Weinberger, 2012;\nTerada and von Luxburg, 2014; Amid and Ukkonen, 2015; Heim et al., 2015; Amid et al., 2016; Jain\net al., 2016). Once such an ordinal embedding has been constructed, one can solve a problem on D by\nsolving it on the embedding. Only recently, algorithms have been proposed for solving various speci\ufb01c\nproblems directly without constructing an ordinal embedding as an intermediate step (Heikinheimo\nand Ukkonen, 2013; Kleindessner and von Luxburg, 2017). With this paper we provide another\ngeneric means for solving machine learning problems based on similarity triplets that is different from\n\n\u21e4Work done while being a PhD student at the University of T\u00fcbingen.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fthe ordinal embedding approach. We de\ufb01ne two data-dependent kernel functions on D, corresponding\nto high-dimensional embeddings of D, that can subsequently be used by any kernel method. Our\nproposed kernel functions measure similarity between two objects in D by comparing to which extent\nthe two objects give rise to resembling similarity triplets. The intuition is that this quanti\ufb01es the\nrelative difference in the locations of the two objects in D. Experiments on both arti\ufb01cial and real\ndata show that this is indeed the case and that the similarity scores de\ufb01ned by our kernel functions\nare meaningful. Our approach is appealingly simple, and other than ordinal embedding algorithms\nour kernel functions are deterministic and parameter-free. We observe them to run signi\ufb01cantly faster\nthan well-known embedding algorithms and to be ideally suited for a landmark design.\n\nSetup Let X be an arbitrary set and d : X\u21e5X! R+\n0 be a symmetric dissimilarity function on\nX : a higher value of d means that two elements of X are more dissimilar to each other. The terms\ndissimilarity and distance are used synonymously. To simplify presentation, we assume that for all\ntriples of distinct objects A, B, C 2X either d(A, B) < d(A, C) or d(A, B) > d(A, C) is true.\nNote that we do not require d to be a metric. We formally de\ufb01ne a similarity triplet as binary answer\nto a dissimilarity comparison\n\n?\n< d(A, C).\n\nd(A, B)\n\n(1)\nWe refer to A as the anchor object. A similarity triplet can be incorrect, meaning that it claims a\npositive answer to the comparison (1) although in fact the negative answer is true. In the following,\nwe deal with a \ufb01nite data set D = {x1, . . . , xn}\u2713X and collections of similarity triplets that are\nencoded as follows: an ordered triple of distinct objects (xi, xj, xk) means d(xi, xj) < d(xi, xk). A\ncollection of similarity triplets is the only information that we are given about D. Note that such a\ncollection does not necessarily provide an answer to every possible dissimilarity comparison (1).\n\n2 Our kernel functions\nAssume we are given a collection S of similarity triplets for the objects of D. Similarity triplets in S\ncan be incorrect, but for the moment assume that contradicting triples (xi, xj, xk) and (xi, xk, xj)\ncannot be present in S at the same time. We will discuss how to deal with the general case below.\nKernel function k1 Our \ufb01rst kernel function is based on the following idea: We \ufb01x two objects\nxa and xb. In order to compute a similarity score between xa and xb we would like to rank all\nobjects in D with respect to their distance from xa and also rank them with respect to their distance\nfrom xb, and take a similarity score between these two rankings as similarity score between xa\nand xb. One possibility to measure similarity between rankings is given by the famous Kendall tau\ncorrelation coef\ufb01cient (Kendall, 1938), which is also known as Kendall\u2019s \u2327 : for two rankings of n\nitems, Kendall\u2019s \u2327 between the two rankings is the fraction of concordant pairs of items minus the\nfraction of discordant pairs of items. Here, a pair of two items i1 and i2 is concordant if i1 i2\nor i1 i2 according to both rankings, and discordant if it satis\ufb01es i1 i2 according to one\nand i1 i2 according to the other ranking. Formally, a ranking is represented by a permutation\n : {1, . . . , n}!{ 1, . . . , n} such that (i) 6= (j), i 6= j, and (i) = m means that item i is\nranked at the m-th position. Given two rankings 1 and 2, the number of concordant pairs equals\nfc(1, 2) =Xi 2(j)} + {1(i) > 1(j)} {2(i) < 2(j)}],\n\n[ {1(i) < 1(j)} {2(i) < 2(j)} + {1(i) > 1(j)} {2(i) > 2(j)}],\n\nthe number of discordant pairs equals\n\n2\n\n\fx4\n\nx5\n\nx7\n\nd(x1, x4)\n\nx3\n\nx1\n\nd(x1, x3)\n\nx2\n\nx6\n\nx4\n\nx5\n\nx7\n\nd(x1, x3)\nx1\n\nx3\n\nd(x3, x5)\n\nx6\n\nd(x2, x3)\n\nx2\n\nFigure 1: Illustrations of the ideas behind k1 (left) and k2 (right). For k1: In order to compute\na similarity score between x1 (in red) and x2 (in blue) we would like to rank all objects with\nrespect to their distance from x1 and also with respect to their distance from x2 and compute\nKendall\u2019s \u2327 between the two rankings. In this example, the objects would rank as x1 x3 x2 \nx4 x5 x6 x7 and x2 x3 x6 x1 x5 x4 x7, respectively. Kendall\u2019s \u2327 between\nthese two rankings is 1/3, and this would be the similarity score between x1 and x2. For comparison,\nthe score between x1 and x7 (in green) would be 5/7, and between x2 and x7 it would be 3/7.\nFor k2: In order to compute a similarity score between x1 and x2 we would like to check for every pair\n?\nof objects (xi, xj) whether the distance comparisons d(xi, x1)\n< d(xi, xj)\nyield the same result or not. Here, we have 32 pairs for which they yield the same result and 17 pairs\nfor which they do not. We would assign 72 \u00b7 (32 17) = 15/49 as similarity score between x1 and\nx2. The score between x1 and x7 would be 3/49, and between x2 and x7 it would be 1/49.\n\n?\n< d(xi, xj) and d(xi, x2)\n\n1\n\nk\u2327 (xa) =\n\nkernel function on D since the following holds: for any mapping h : D!Z and kernel function\nk : Z\u21e5Z! R, k (h, h) : D\u21e5D! R is a kernel function.\nIn our situation, the problem is that in most cases S will contain only a small fraction of all possible\nsimilarity triplets and also that some of the triplets in S might be incorrect, so that there is no way of\nranking all objects with respect to their distance from any \ufb01xed object based on the similarity triplets\nin S. To adapt the procedure, we consider a feature map that corresponds to the kernel function\njust described. By a feature map corresponding to a kernel function k : D\u21e5D! R we mean a\nmapping : D! Rm for some m 2 N such that k(xi, xj) = h(xi), (xj)i =( xi)T \u00b7 (xj). It\nis easy to see from the above formulas (also compare with Jiao and Vert, 2015) that a feature map\ncorresponding to the described kernel function is given by k\u2327 : D! R(n\n\n2) with\n\n2 \u00b7\u2713 {d(xa, xi) < d(xa, xj)} {d(xa, xi) > d(xa, xj)}\u25c61\uf8ffi d(x2, xn) > d(x1, xn) > d(x2, x3) > d(x1, x3) > 1 be arbitrarily large. Although\nx1 and x2 are located at the maximum distance to each other, they satisfy d(x1, xi) < d(x1, xj)\nand d(x2, xi) < d(x2, xj) for all 3 \uf8ff i < j \uf8ff n, and hence both x1 and x2 are jointly located\nin all the halfspaces obtained from these distance comparisons. We end up with k1(x1, x2) ! 1,\nn ! 1, assuming k1 is computed based on all possible similarity triplets, all of which are correct.\nThe distance between x3 and xn is much smaller, but there are many points in between them and the\nhyperplanes obtained from the distance comparisons with these points separate x3 and xn. We end\nup with k1(x3, xn) ! 1, n ! 1. Depending on the task at hand, this may be desirable or not.\nLet us examine the meaningfulness of our kernel functions by calculating them on \ufb01ve visualizable\ndata sets. Each of the \ufb01rst four data sets consists of 400 points in R2 and d equals the Euclidean\nmetric. The \ufb01fth data set consists of 400 vertices of an undirected graph from a stochastic block\nmodel and d equals the shortest path distance. We computed k1 and k2 based on 10% of all possible\nsimilarity triplets (chosen uniformly at random from all triplets). The results for the \ufb01rst two data sets\nare shown in Figure 3. The results for the remaining data sets are shown in Figure 6 in Section A.1 in\nthe supplementary material. The \ufb01rst plot of a row shows the data set. The second plot shows the\ndistance matrix on the data set. Next, we can see the kernel matrices. The last plot of a row shows the\nsimilarity scores (encoded by color) based on k1 between one \ufb01xed point (shown as a black cross)\nand the other points in the data set. Clearly, the kernel matrices re\ufb02ect the block structures of the\ndistance matrices, and the similarity scores between a \ufb01xed point and the other points tend to decrease\nas the distances to the \ufb01xed point increase. A situation like in the example of Figure 2 does not occur.\n\n2.3 Landmark design\nOur kernel functions are designed as to extract information from an arbitrary collection S of similarity\ntriplets. However, by construction, a single triplet is useless, and what matters is the concurrent pres-\n?\nence of two triplets: k1(xa, xb) is only affected by pairs of triplets answering d(xa, xi)\n< d(xa, xj)\n?\nand d(xb, xi)\n< d(xb, xj), while k2(xa, xb) is only affected by pairs of triplets answering (4). Hence,\nwhen we can choose which dissimilarity comparisons of the form (1) are evaluated for creating S\n(e.g., in crowdsourcing), we should aim at maximizing the number of appropriate pairs of triplets.\nThis can easily be achieved by means of a landmark design inspired from landmark multidimensional\nscaling (de Silva and Tenenbaum, 2004): We choose a small subset of landmark objects L\u2713D . Then,\nfor k1, only comparisons of the form d(xi, xj)\n< d(xi, xk) with xi 2D and xj, xk 2L are evalu-\n\n?\n\n5\n\n\f?\n\n2)\u21e5n or k2(D) = ( k2(xi))n\n\n< d(xj, xk) with xi 2D and xj, xk 2L are\nated. For k2, only comparisons of the form d(xj, xi)\nevaluated. The landmark objects can be chosen either randomly or, if available, based on additional\nknowledge about D and the task at hand.\n2.4 Computational complexity\nGeneral S A naive implementation of our kernel functions explicitly computes the feature vectors\nk1(xi) or k2(xi), i = 1, . . . , n, and subsequently calculates the kernel matrix K by means of (3)\nor (5). In doing so, we store the feature vectors in the feature matrix k1(D) = ( k1(xi))n\ni=1 2\nR(n\ni=1 2 Rn2\u21e5n. Proceeding this way is straightforward and simple,\nrequiring to go through S only once, but comes with a computational cost of O(|S| + n4) operations.\nNote that the number of different distance comparisons of the form (1) is O(n3) and hence one might\nexpect that |S| 2 O(n3) and O(|S| + n4) = O(n4). By performing (3) or (5) in terms of matrix\nmultiplication k1(D)T \u00b7 k1(D) or k2(D)T \u00b7 k2(D) and applying Strassen\u2019s algorithm (Higham,\n1990) one can reduce the number of operations to O(|S| + n3.81), but still this is infeasible for many\ndata sets. Infeasibility for large data sets, however, is even more the case for ordinal embedding\nalgorithms, which are the current state-of-the-art method for solving machine learning problems based\non similarity triplets. All existing ordinal embedding algorithms iteratively solve an optimization\nproblem. For none of these algorithms theoretical bounds for their complexity are available in the\nliterature, but it is widely known that their running times are prohibitively high (Heim et al., 2015;\nKleindessner and von Luxburg, 2017).\n\nLandmark design If we know that S contains only dissimilarity comparisons involving landmark\nobjects, we can adapt the feature matrices such that k1(D) 2 R(|L|2 )\u21e5n or k2(D) 2 R|L|2\u21e5n\nand reduce the number of operations to O(|S| + min{|L|2, n}log2(7/8)|L|2n2), which is O(|S| +\n|L|1.62n2) if |L|2 \uf8ff n. Note that in this case we might expect that |S| 2 O(|L|2n).\nIn both cases, whenever the number of given similarity triplets |S| is small compared to the number\nof all different distance comparisons under consideration, the feature matrix k1(D) or k2(D) is\nsparse with only O(|S|) non-zero entries and methods for sparse matrix multiplication decrease\ncomputational complexity (Gustavson, 1978; Kaplan et al., 2006).\n\n3 Related work\n\n?\n<\n\nSimilarity triplets are a special case of answers to the general dissimilarity comparisons d(A, B)\nd(C, D), A, B, C, D 2X . We refer to any collection of answers to these general comparisons as\nordinal data. In recent years, ordinal data has become popular in machine learning. Among the\nwork on ordinal data in general (see Kleindessner and von Luxburg, 2014, 2017, for references),\nsimilarity triplets have been paid particular attention: Jamieson and Nowak (2011) deal with the\nquestion of how many similarity triplets are required for uniquely determining an ordinal embedding\nof Euclidean data. This work has been carried on and generalized by Jain et al. (2016). Algorithms\nfor constructing an ordinal embedding based on similarity triplets (but not on general ordinal data)\nare proposed in Tamuz et al. (2011), van der Maaten and Weinberger (2012), Amid et al. (2016), and\nJain et al. (2016). Heikinheimo and Ukkonen (2013) present a method for medoid estimation based\non statements \u201cObject A is the outlier within the triple of objects (A, B, C)\u201d, which correspond to\nthe two similarity triplets d(B, C) < d(B, A) and d(C, B) < d(C, A). Ukkonen et al. (2015) use\nthe same kind of statements for density estimation and Ukkonen (2017) uses them for clustering.\nWilber et al. (2014) examine how to minimize time and costs when collecting similarity triplets via\ncrowdsourcing. Producing a number of ordinal embeddings at the same time, each corresponding\nto a different dissimilarity function based on which a comparison (1) might have been evaluated, is\nstudied in Amid and Ukkonen (2015). In Heim et al. (2015), one of the algorithms by van der Maaten\nand Weinberger (2012) is adapted from the batch setting to an online setting, in which similarity\ntriplets are observed in a sequential way, using stochastic gradient descent. In Kleindessner and\nvon Luxburg (2017), we propose algorithms for medoid estimation, outlier detection, classi\ufb01cation,\nand clustering based on statements \u201cObject A is the most central object within (A, B, C)\u201d, which\ncomprise the two similarity triplets d(B, A) < d(B, C) and d(C, A) < d(C, B). Finally, Haghiri\net al. (2017) study the problem of ef\ufb01cient nearest neighbor search based on similarity triplets. There\n\n6\n\n\fFigure 4: Best viewed magni\ufb01ed on screen. Left: Clustering of the food data set. Part of the\ndendrogram obtained from complete-linkage clustering using k1. Right: Kernel PCA on the car data\nset based on the kernel function k2.\n\nis also a number of papers that consider similarity triplets as side information to vector data (e.g.,\nSchultz and Joachims, 2003; McFee and Lanckriet, 2011; Wilber et al., 2015).\n\n4 Experiments\n\nWe performed experiments that demonstrate the usefulness of our kernel functions. We \ufb01rst apply\nthem to three small image data sets for which similarity triplets have been gathered via crowdsourcing.\nWe then study them more systematically and compare them to an ordinal embedding approach in\nclustering tasks on subsets of USPS and MNIST digits using synthetically generated triplets.\n\n4.1 Crowdsourced similarity triplets\n\nIn this section we present experiments on real crowdsourcing data that show that our kernel functions\ncan capture the structure of a data set. Note that for the following data sets there is no ground truth\navailable and hence there is no way other than visual inspection for evaluating our results.\n\nFood data set We applied the kernelized version of complete-linkage clustering based on our kernel\nfunction k1 to the food data set introduced in Wilber et al. (2014). This data set consists of 100\nimages2 of a wide range of foods and comes with 190376 (unique) similarity triplets, which contain\n9349 pairs of contradicting triplets. Figure 4 (left) shows a part of the dendrogram that we obtained.\nEach of the ten clusters depicted there contains pretty homogeneous images. For example, the fourth\nrow only shows vegetables and salads whereas the ninth row only shows fruits and the last row only\nshows desserts. To give an impression of accelerated running time of our approach compared to\nan ordinal embedding approach: computation of k1 or k2 on this data set took about 0.1 seconds\nwhile computing an ordinal embedding using the GNMDS algorithm (Agarwal et al., 2007) took 18\nseconds (embedding dimension equaling two; all computations performed in Matlab\u2014see Section 4.2\nfor details; the embedding is shown in Figure 9 in Section A.1 in the supplementary material).\n\nCar data set We applied kernel PCA (Sch\u00f6lkopf et al., 1999) based on our kernel function k2 to\nthe car data set, which we have introduced in Kleindessner and von Luxburg (2017). It consists\nof 60 images of cars. For this data set we have collected statements of the kind \u201cObject A is the\nmost central object within (A, B, C)\u201d, meaning that d(B, A) < d(B, C) and d(C, A) < d(C, B),\nvia crowdsourcing. We ended up with 13514 similarity triplets, of which 12502 were unique. The\nprojection of the car data set onto the \ufb01rst two kernel principal components can be seen in Figure 4\n(right). The result looks reasonable, with the cars arranged in groups of sports cars (top left), ordinary\ncars (middle right) and off-road/sport utility vehicles (bottom left). Also within these groups there is\nsome reasonable structure. For example, the race-like sports cars are located near to each other and\nclose to the Formula One car, and the sport utility vehicles from German manufacturers are placed\nnext to each other.\n\nNature data set We performed similar experiments on the nature data set introduced in Heikin-\nheimo and Ukkonen (2013). The results are presented in Section A.2 in the supplementary material.\n\n2According to Wilber et al., the data set contains copyrighted material under the educational fair use\n\nexemption to the U.S. copyright law.\n\n7\n\n\fWe would like to discuss a question raised by one of the reviewers: in our setup (see Section 1),\nwe assume that similarity triplets are noisy evaluations of dissimilarity comparisons (1), where d\nis some \ufb01xed dissimilarity function. This leads to our (natural) way of dealing with contradicting\nsimilarity triplets as described in Section 2. In a different setup one could drop the dissimilarity\nfunction d and consider similarity triplets as elements of some binary relation on D\u21e5D that is not\nnecessarily transitive or antisymmetric. In the latter setup it is not clear whether our way of dealing\nwith contradicting triplets is the right thing to do. However, we believe that the experiments of this\nsection show that our setup is valid in a wide range of scenarios and our approach works in practice.\n\n4.2 Synthetically generated triplets\n\nWe studied our kernel functions with respect to the number of input similarity triplets that they\nrequire in order to produce a valuable solution in clustering tasks. We found that in the scenario of a\ngeneral collection S of triplets our approach is highly superior compared to an ordinal embedding\napproach in terms of running time, but on most data sets it is inferior regarding the required number\nof triplets. The full bene\ufb01t of our kernel functions emerges in a landmark design. There our approach\ncan compete with an embedding approach in terms of the required number of triplets and is so much\nfaster as to being easily applicable to large data sets to which ordinal embedding algorithms are not.\nIn this section we want to demonstrate this claim. We studied k1 and k2 in a landmark design by\napplying kernel k-means clustering (Dhillon et al., 2001) to subsets of USPS and MNIST digits,\nrespectively. Collections S of similarity triplets were generated as follows: We chose a certain\nnumber of landmark objects uniformly at random from all objects of the data set under consideration.\nChoosing d as the Euclidean metric, we created answers to all possible distance comparisons with\nthe landmark objects as explained in Section 2.3. Answers were incorrect with some probability\n0 \uf8ff ep \uf8ff 1 independently of each other. From the set of all answers we chose triplets in S uniformly\nat random without replacement. We compared our approach to an ordinal embedding approach with\nordinary k-means clustering. We tried the GNMDS (Agarwal et al., 2007), the CKL (Tamuz et al.,\n2011), and the t-STE (van der Maaten and Weinberger, 2012) embedding algorithms in the Matlab\nimplementation made available by van der Maaten and Weinberger (2012). In doing so, we set all\nparameters except the embedding dimension to the provided default parameters. The parameter \u00b5\nof the CKL algorithm was set to 0.1 since we observed good results with this value. Note that in\nthese unsupervised clustering tasks there is no immediate way of performing cross-validation for\nchoosing parameters. We compared to the embedding algorithms in two scenarios: in one case they\nwere provided the same triplets as input as our kernel functions, in the other case (denoted by the\nadditional \u201crand\u201d in the plots) they were provided a same number of triplets chosen uniformly at\nrandom with replacement from all possible triplets (no landmark design) and incorrect with the same\nprobability ep. For further comparison, we considered ordinary k-means applied to the original point\nset and a random clustering. We always provided the correct number of clusters as input, and set the\nnumber of replicates in k-means and kernel k-means to \ufb01ve and the maximum number of iterations\nto 100. For assessing the quality of a clustering we computed its purity (e.g., Manning et al., 2008),\nwhich measures the accordance with the known ground truth partitioning according to the digits\u2019\nvalues. A high purity value indicates a good clustering. Note that the limitation for the scale of our\nexperiments only comes from the running time of the embedding algorithms and not from our kernel\nfunctions. Still, in terms of the number of data points our experiments are comparable or actually\neven superior to all the papers on ordinal embedding cited in Section 3. In terms of the number of\nsimilarity triplets per data point, we used comparable numbers of triplets.\n\nUSPS digits We chose 1000 points uniformly at random from the subset of USPS digits 1, 2, and 3.\nUsing 15 landmark objects, we studied the performance of our approach and the ordinal embedding\napproach as a function of the number of input triplets. The \ufb01rst and the second row of Figure 5 show\nthe results (average over 10 runs of an experiment) for k1. The results for k2 are shown in Figure 7 in\nSection A.1 in the supplementary material. The \ufb01rst two plots of a row show the purity values of\nthe various clusterings for ep = 0 and ep = 0.3, respectively. The third and the fourth plot show the\ncorresponding time (in sec) that it took to compute our kernel function or an ordinal embedding. We\nset the embedding dimension to 2 (1st row) or 10 (2nd row). Based on the achieved purity values no\nmethod can be considered superior. Our kernel function k2 performs slightly worse than k1 and the\nordinal embedding algorithms. The GNMDS algorithm apparently cannot deal with the landmark\ntriplets at all and yields the same purity values as a random clustering when provided with the\nlandmark triplets. Our approach is highly superior regarding running time. The running times of the\n\n8\n\n\fy\nt\ni\nr\nu\nP\n\n0.8\n\n0.6\n\n0.4\n\n0\n\nk\n1\n\n1\n\ny\nt\ni\nr\nu\nP\n\n0.8\n\n0.6\n\n0.4\n\n0\n\nk\n\n1\n\nGNMDS\nt-STE\nCKL\nGNMDS rand\nt-STE rand\nCKL rand\nCoordinates\nRandom\n\n2\n4\n# input triplets\n\n6\n104\n\ny\nt\ni\nr\nu\nP\n\n0.8\n\n0.6\n\n0.4\n\n0\n\n on USPS (ep=0.3, embedding dim=10)\n\nk\n1\n1\n\ny\nt\ni\nr\nu\nP\n\n0.8\n\n0.6\n\n]\ns\n[\n \n\ne\nm\n\ni\nt\n \n\ni\n\ng\nn\nn\nn\nu\nR\n\n80\n\n60\n\n40\n\n20\n\n0\n\n0\n\n0\n\n0\n\n2\n4\n# input triplets\n\n6\n104\n\n on USPS (ep=0.3, embedding dim=10)\n\nk\n1\n150\n\n100\n\n50\n\n]\ns\n[\n \n\ne\nm\n\ni\nt\n \n\ni\n\ng\nn\nn\nn\nu\nR\n\n on USPS (ep=0, embedding dim=2)\n\nk\n1\n\n1\n\nk\n1\n1\n\n on USPS (ep=0.3, embedding dim=2)\n\n on USPS (ep=0.3, embedding dim=2)\n\n on USPS (ep=0, embedding dim=2)\n\nk\n40\n\n1\n\n]\ns\n[\n \n\ne\nm\n\ni\nt\n \n\ni\n\ng\nn\nn\nn\nu\nR\n\n30\n\n20\n\n10\n\nk\n40\n\n1\n\n]\ns\n[\n \n\ne\nm\n\ni\nt\n \n\ni\n\ng\nn\nn\nn\nu\nR\n\n30\n\n20\n\n10\n\n2\n\n1\n\n# input triplets\n\n4\n104\n on USPS (ep=0, embedding dim=10)\n\n3\n\n0\n\n0\n\nk\n100\n\n1\n\n1\n\n2\n\n3\n\n# input triplets\n\n4\n104\n\n on USPS (ep=0, embedding dim=10)\n\n1\n\n2\n\n3\n\n# input triplets\n\n4\n104\n\n0.4\n\n0\n\n2\n4\n# input triplets\n\n6\n104\n\n1\n\n2\n\n3\n\n# input triplets\n\n4\n104\n\n0\n\n0\n\n2\n4\n# input triplets\n\n6\n104\n\n on MNIST (ep=0.15, embedding dim=5)\n\nk\n\n1\n\n on MNIST (ep=0.15, embedding dim=5)\n\nk\n\n2\n\n0.6\n\ny\nt\ni\nr\nu\nP\n\n0.4\n\n0.2\n\nk\n\n2\n\n0.6\n\ny\nt\ni\nr\nu\nP\n\n0.4\n\n0.2\n\n0\n\n2000\n\n4000\n\n6000\n\n8000 10000\n\n0\n\n2000\n\n4000\n\n6000\n\n8000 10000\n\n# points\n\n# points\n\nk\n\n on MNIST (ep=0.15, embedding dim=5)\n\n1\n3000\n\nk\n\n on MNIST (ep=0.15, embedding dim=5)\n\n2\n4000\n\n]\ns\n[\n \n\ne\nm\n\ni\nt\n \n\ni\n\ng\nn\nn\nn\nu\nR\n\n2000\n\n1000\n\n0\n\n0\n\n]\ns\n[\n \n\ne\nm\n\ni\nt\n \n\ni\n\ng\nn\nn\nn\nu\nR\n\n3000\n\n2000\n\n1000\n\n0\n\n0\n\n2000\n\n4000\n\n6000\n\n8000 10000\n\n# points\n\n2000\n\n4000\n\n6000\n\n8000 10000\n\n# points\n\nFigure 5: 1st & 2nd row (USPS digits for k1): Clustering 1000 points from USPS digits 1, 2, and 3.\nPurity and running time as a function of the number of input triplets. 3rd row (MNIST digits):\nClustering subsets of MNIST digits. Purity and running time as a function of the number of points.\n\nordinal embedding algorithms depend on the embedding dimension and ep and in these experiments\nthe dependence is monotonic. All computations were performed in Matlab R2016a on a MacBook\nPro with 2.9 GHz Intel Core i7 and 8 GB 1600 MHz DDR3. In order to make a fair comparison we\ndid not use MEX \ufb01les or sparse matrix operations in the implementation of our kernel functions.\n\nMNIST digits We studied the performance of the various methods as a function of the size n of\nthe data set with the number of input triplets growing linearly with n. For i = 1, . . . , 10, we chose\nn = i \u00b7 103 points uniformly at random from MNIST digits. We used 30 landmark objects and\nprovided 150n input similarity triplets. The third row of Figure 5 shows the purity values of the\nvarious methods for k1 / k2 (1st / 2nd plot) and the corresponding running times (3rd / 4th plot) when\nep = 0.15. The embedding dimension was set to 5. A spot check suggested that setting it to 2 would\nhave given worse results, while setting it to 10 would have given similar results, but would have led\nto a higher running time. We computed the t-STE embedding only for n \uf8ff 6000 due to its high\nrunning time. It seems that GNMDS with random input triplets performs best, but for large values of\nn our kernel function k1 can compete with it. For 10000 points, computing k1 or k2 took 100 or 180\nseconds, while even the fastest embedding algorithm ran for 2000 seconds. For further comparison,\nFigure 8 in Section A.1 in the supplementary material shows a kernel PCA embedding based on k1\n(150n landmark triplets) and a 2-dim GNMDS embedding (150n random triplets) of n = 20000\ndigits. Here, computation of k1 took 900 seconds, while GNMDS ran for more than 6000 seconds.\n\n5 Conclusion\n\nWe proposed two data-dependent kernel functions that can be evaluated when given only an arbitrary\ncollection of similarity triplets for a data set D. Our kernel functions can be used to apply any\nkernel method to D. Hence they provide a generic alternative to the standard ordinal embedding\napproach based on numerical optimization for machine learning with similarity triplets. In a number\nof experiments we demonstrated the meaningfulness of our kernel functions. A big advantage of\nour kernel functions compared to the ordinal embedding approach is that our kernel functions run\nsigni\ufb01cantly faster. A drawback is that, in general, they seem to require a higher number of similarity\ntriplets for capturing the structure of a data set. However, in a landmark design our kernel functions\ncan compete with the ordinal embedding approach in terms of the required number of triplets.\n\n9\n\n\fAcknowledgements\nThis work has been supported by the Institutional Strategy of the University of T\u00fcbingen (DFG,\nZUK 63).\n\nReferences\nS. Agarwal, J. Wills, L. Cayton, G. Lanckriet, D. Kriegman, and S. Belongie. Generalized non-metric\nmultidimensional scaling. In International Conference on Arti\ufb01cial Intelligence and Statistics\n(AISTATS), 2007.\n\nE. Amid and A. Ukkonen. Multiview triplet embedding: Learning attributes in multiple maps. In\n\nInternational Conference on Machine Learning (ICML), 2015.\n\nE. Amid, N. Vlassis, and M. Warmuth. t-exponential triplet embedding. arXiv:1611.09957v1 [cs.AI],\n\n2016.\n\nV. de Silva and J. Tenenbaum. Sparse multidimensional scaling using landmark points. Technical\n\nreport, Stanford University, 2004.\n\nI. Dhillon, Y. Guan, and B. Kulis. Kernel k-means, spectral clustering and normalized cuts. In\n\nInternational Conference on Knowledge Discovery and Data Mining (KDD), 2001.\n\nD. Greene and P. Cunningham. Practical solutions to the problem of diagonal dominance in kernel\n\ndocument clustering. In International Conference on Machine Learning (ICML), 2006.\n\nF. G. Gustavson. Two fast algorithms for sparse matrices: Multiplication and permuted transposition.\n\nACM Transactions on Mathematical Software, 4(3):250\u2013269, 1978.\n\nS. Haghiri, U. von Luxburg, and D. Ghoshdastidar. Comparison based nearest neighbor search. In\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2017.\n\nH. Heikinheimo and A. Ukkonen. The crowd-median algorithm. In Conference on Human Computa-\n\ntion and Crowdsourcing (HCOMP), 2013. Data available on http://www.anttiukkonen.com/.\n\nE. Heim, M. Berger, L. M. Seversky, and M. Hauskrecht. Ef\ufb01cient online relative comparison kernel\n\nlearning. In SIAM International Conference on Data Mining (SDM), 2015.\n\nN. Higham. Exploiting fast matrix multiplication within the level 3 BLAS. ACM Transactions on\n\nMathematical Software, 16(4):352\u2013368, 1990.\n\nL. Jain, K. Jamieson, and R. Nowak. Finite sample prediction and recovery bounds for ordinal\n\nembedding. In Neural Information Processing Systems (NIPS), 2016.\n\nK. Jamieson and R. Nowak. Low-dimensional embedding using adaptively selected ordinal data. In\n\nAllerton Conference on Communication, Control, and Computing, 2011.\n\nY. Jiao and J.-P. Vert. The Kendall and Mallows kernels for permutations. In International Conference\n\non Machine Learning (ICML), 2015.\n\nH. Kaplan, M. Sharir, and E. Verbin. Colored intersection searching via sparse rectangular matrix\n\nmultiplication. In Symposium on Computational Geometry (SoCG), 2006.\n\nM. Kendall. A new measure of rank correlation. Biometrika, 30(1\u20132):81\u201393, 1938.\n\nM. Kleindessner and U. von Luxburg. Uniqueness of ordinal embedding. In Conference on Learning\n\nTheory (COLT), 2014.\n\nM. Kleindessner and U. von Luxburg. Lens depth function and k-relative neighborhood graph:\nJMLR, 18(58):1\u201352, 2017. Data available on\n\nVersatile tools for ordinal data analysis.\nhttp://www.tml.cs.uni-tuebingen.de/team/luxburg/code_and_data/.\n\nC. D. Manning, P. Raghavan, and H. Sch\u00fctze. Introduction to Information Retrieval. Cambridge\n\nUniversity Press, 2008.\n\n10\n\n\fB. McFee and G. Lanckriet. Learning multi-modal similarity. JMLR, 12:491\u2013523, 2011.\nB. Sch\u00f6lkopf, A. Smola, and K.-R. M\u00fcller. Kernel principal component analysis. In B. Sch\u00f6lkopf,\nC. Burges, and A. Smola, editors, Advances in Kernel Methods: Support Vector Learning, pages\n327\u2013352. MIT Press, 1999.\n\nB. Sch\u00f6lkopf, J. Weston, E. Eskin, C. Leslie, and W. Noble. A kernel approach for learning from\n\nalmost orthogonal patterns. In European Conference on Machine Learning (ECML), 2002.\n\nM. Schultz and T. Joachims. Learning a distance metric from relative comparisons. In Neural\n\nInformation Processing Systems (NIPS), 2003.\n\nN. Stewart, G. D. A. Brown, and N. Chater. Absolute identi\ufb01cation by relative judgment. Psychologi-\n\ncal Review, 112(4):881\u2013911, 2005.\n\nO. Tamuz, C. Liu, S. Belongie, O. Shamir, and A. Kalai. Adaptively learning the crowd kernel. In\n\nInternational Conference on Machine Learning (ICML), 2011.\n\nY. Terada and U. von Luxburg. Local ordinal embedding. In International Conference on Machine\n\nLearning (ICML), 2014.\n\nA. Ukkonen. Crowdsourced correlation clustering with relative distance comparisons. In International\n\nConference on Data Mining series (ICDM), 2017.\n\nA. Ukkonen, B. Derakhshan, and H. Heikinheimo. Crowdsourced nonparametric density estimation\nusing relative distances. In Conference on Human Computation and Crowdsourcing (HCOMP),\n2015.\n\nL. van der Maaten and K. Weinberger. Stochastic triplet embedding.\n\nIn IEEE International\nWorkshop on Machine Learning for Signal Processing (MLSP), 2012. Code available on\nhttp://homepage.tudelft.nl/19j49/ste/.\n\nM. Wilber, I. Kwak, and S. Belongie. Cost-effective hits for relative similarity comparisons. In\nConference on Human Computation and Crowdsourcing (HCOMP), 2014. Data available on\nhttp://vision.cornell.edu/se3/projects/cost-effective-hits/.\n\nM. Wilber, I. Kwak, D. Kriegman, and S. Belongie. Learning concept embeddings with combined\n\nhuman-machine expertise. In International Conference on Computer Vision (ICCV), 2015.\n\n11\n\n\f", "award": [], "sourceid": 3425, "authors": [{"given_name": "Matth\u00e4us", "family_name": "Kleindessner", "institution": "University of T\u00fcbingen"}, {"given_name": "Ulrike", "family_name": "von Luxburg", "institution": "University of T\u00fcbingen"}]}