{"title": "Supervised Word Mover's Distance", "book": "Advances in Neural Information Processing Systems", "page_first": 4862, "page_last": 4870, "abstract": "Accurately measuring the similarity between text documents lies at the core of many real world applications of machine learning. These include web-search ranking, document recommendation, multi-lingual document matching, and article categorization. Recently, a new document metric, the word mover's distance (WMD), has been proposed with unprecedented results on kNN-based document classification. The WMD elevates high quality word embeddings to document metrics by formulating the distance between two documents as an optimal transport problem between the embedded words. However, the document distances are entirely unsupervised and lack a mechanism to incorporate supervision when available. In this paper we propose an efficient technique to learn a supervised metric, which we call the Supervised WMD (S-WMD) metric. Our algorithm learns document distances that measure the underlying semantic differences between documents by leveraging semantic differences between individual words discovered during supervised training. This is achieved with an linear transformation of the underlying word embedding space and tailored word-specific weights, learned to minimize the stochastic leave-one-out nearest neighbor classification error on a per-document level. We evaluate our metric on eight real-world text classification tasks on which S-WMD consistently outperforms almost all of our 26 competitive baselines.", "full_text": "Supervised Word Mover\u2019s Distance\n\nGao Huang\u2217, Chuan Guo\u2217\n\nCornell University\n\n{gh349,cg563}@cornell.edu\n\nYu Sun, Kilian Q. Weinberger\n{ys646,kqw4}@cornell.edu\n\nCornell University\n\nMatt J. Kusner\u2020\n\nAlan Turing Institute, University of Warwick\n\nmkusner@turing.ac.uk\n\nFei Sha\n\nUniversity of California, Los Angeles\n\nfeisha@cs.ucla.edu\n\nAbstract\n\nRecently, a new document metric called the word mover\u2019s distance (WMD) has\nbeen proposed with unprecedented results on kNN-based document classi\ufb01cation.\nThe WMD elevates high-quality word embeddings to a document metric by for-\nmulating the distance between two documents as an optimal transport problem\nbetween the embedded words. However, the document distances are entirely un-\nsupervised and lack a mechanism to incorporate supervision when available. In\nthis paper we propose an ef\ufb01cient technique to learn a supervised metric, which\nwe call the Supervised-WMD (S-WMD) metric. The supervised training mini-\nmizes the stochastic leave-one-out nearest neighbor classi\ufb01cation error on a per-\ndocument level by updating an af\ufb01ne transformation of the underlying word em-\nbedding space and a word-imporance weight vector. As the gradient of the origi-\nnal WMD distance would result in an inef\ufb01cient nested optimization problem, we\nprovide an arbitrarily close approximation that results in a practical and ef\ufb01cient\nupdate rule. We evaluate S-WMD on eight real-world text classi\ufb01cation tasks on\nwhich it consistently outperforms almost all of our 26 competitive baselines.\n\n1\n\nIntroduction\n\nDocument distances are a key component of many text retrieval tasks such as web-search ranking\n[24], book recommendation [16], and news categorization [25]. Because of the variety of poten-\ntial applications, there has been a wealth of work towards developing accurate document distances\n[2, 4, 11, 27]. In large part, prior work focused on extracting meaningful document representations,\nstarting with the classical bag of words (BOW) and term frequency-inverse document frequency\n(TF-IDF) representations [30]. These sparse, high-dimensional representations are frequently nearly\northogonal [17] and a pair of similar documents may therefore have nearly the same distance as a\npair that are very different. It is possible to design more meaningful representations through eigen-\ndecomposing the BOW space with Latent Semantic Indexing (LSI) [11], or learning a probabilistic\nclustering of BOW vectors with Latent Dirichlet Allocation (LDA) [2]. Other work generalizes LDA\n[27] or uses denoising autoencoders [4] to learn a suitable document representation.\nRecently, Kusner et al. [19] proposed the Word Mover\u2019s Distance (WMD), a new distance for text\ndocuments that leverages word embeddings [22]. Given these high-quality embeddings, the WMD\nde\ufb01nes the distances between two documents as the optimal transport cost of moving all words from\none document to another within the word embedding space. This approach was shown to lead to\nstate-of-the-art error rates in k-nearest neighbor (kNN) document classi\ufb01cation.\n\n\u2217Authors contributing equally\n\u2020This work was done while the author was a student at Washington University in St. Louis\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fImportantly, these prior works are entirely unsupervised and not learned explicitly for any particular\ntask. For example, text documents could be classi\ufb01ed by topic or by author, which would lead to\nvery different measures of dissimilarity. Lately, there has been a vast amount of work on metric\nlearning [10, 15, 36, 37], most of which focuses on learning a generalized linear Euclidean metric.\nThese methods often scale quadratically with the input dimensionality, and can only be applied to\nhigh-dimensional text documents after dimensionality reduction techniques such as PCA [36].\nIn this paper we propose an algorithm for learning a metric to improve the Word Mover\u2019s Distance.\nWMD stands out from prior work in that it computes distances between documents without ever\nlearning a new document representation. Instead, it leverages low-dimensional word representations,\nfor example word2vec, to compute distances. This allows us to transform the word embedding\ninstead of the documents, and remain in a low-dimensional space throughout. At the same time we\npropose to learn word-speci\ufb01c \u2018importance\u2019 weights, to emphasize the usefulness of certain words\nfor distinguishing the document class.\nAt \ufb01rst glance, incorporating supervision into the WMD appears computationally prohibitive, as\neach individual WMD computation scales cubically with respect to the (sparse) dimensionality of\nthe documents. However, we devise an ef\ufb01cient technique that exploits a relaxed version of the\nunderlying optimal transport problem, called the Sinkhorn distance [6]. This, combined with a\nprobabilistic \ufb01ltering of the training set, reduces the computation time signi\ufb01cantly.\nOur metric learning algorithm, Supervised Word Mover\u2019s Distance (S-WMD), directly minimizes a\nstochastic version of the leave-one-out classi\ufb01cation error under the WMD metric. Different from\nclassic metric learning, we learn a linear transformation of the word representations while also learn-\ning re-weighted word frequencies. These transformations are learned to make the WMD distances\nmatch the semantic meaning of similarity encoded in the labels. We show across 8 datasets and 26\nbaseline methods the superiority of our method.\n2 Background\nHere we describe the word embedding technique we use (word2vec) and the recently introduced\nWord Mover\u2019s Distance. We then detail the setting of linear metric learning and the solution pro-\nposed by Neighborhood Components Analysis (NCA) [15], which inspires our method.\nword2vec may be the most popular technique for learning a word embedding over billions of words\nand was introduced by Mikolov et al. [22]. Each word in the training corpus is associated with\nan initial word vector, which is then optimized so that if two words w1 and w2 frequently occur\ntogether, they have high conditional probability p(w2|w1). This probability is the hierarchical soft-\nmax of the word vectors vw1 and vw2 [22], an easily-computed quantity which allows a simpli\ufb01ed\nneural language model (the word2vec model) to be trained ef\ufb01ciently on desktop computers. Train-\ning an embedding over billions of words allows word2vec to capture surprisingly accurate word\nrelationships [23]. Word embeddings can learn hundreds of millions of parameters and are typically\nby design unsupervised, allowing them to be trained on large unlabeled text corpora ahead of time.\nThroughout this paper we use word2vec, although many word embeddings could be used [5, 21? ].\nWord Mover\u2019s Distance. Leveraging the compelling word vector relationships of word embed-\ndings, Kusner et al. [19] introduced the Word Mover\u2019s Distance (WMD) as a distance between text\ndocuments. At a high level, the WMD is the minimum distance required to transport the words\nfrom one document to another. We assume that we are given a word embedding matrix X \u2208 Rd\u00d7n\nfor a vocabulary of n words. Let xi \u2208 Rd be the representation of the ith word, as de\ufb01ned by this\nembedding. Additionally, let da, db be the n-dimensional normalized bag-of-words (BOW) vectors\nfor two documents, where da\ni is the number of times word i occurs in da (normalized over all words\nin da). The WMD introduces an auxiliary \u2018transport\u2019 matrix T \u2208 Rn\u00d7n, such that Tij describes\nhow much of da\n\ni should be transported to db\n\nTij(cid:107)xi \u2212 xj(cid:107)p\n2,\n\nj. Formally, the WMD learns T to minimize\nsubject to,\nTij = db\n\nTij = da\ni ,\n\nn(cid:88)\n\nn(cid:88)\n\nn(cid:88)\n\nD(xi, xj) = min\nT\u22650\n\nj \u2200i, j,\n\n(1)\n\ni,j=1\n\nj=1\n\ni=1\n\nwhere p is usually set to 1 or 2. In this way, documents that share many words (or even related ones)\nshould have smaller distances than documents with very dissimilar words. It was noted in Kusner\net al. [19] that the WMD is a special case of the Earth Mover\u2019s Distance (EMD) [29], also known\nmore generally as the Wasserstein distance [20]. The authors also introduce the word centroid dis-\ntance (WCD), which uses a fast approximation \ufb01rst described by Rubner et al. [29]: (cid:107)Xd\u2212 Xd(cid:48)(cid:107)2.\n\n2\n\n\fIt can be shown that the WCD always lower bounds the WMD. Intuitively the WCD represents each\ndocument by the weighted average word vector, where the weights are the normalized BOW counts.\nThe time complexity of solving the WMD optimization problem is O(q3 log q) [26], where q is the\nmaximum number of unique words in either d or d(cid:48). The WCD scales asymptotically by O(dq).\nRegularized Transport Problem. To alleviate the cubic time complexity of the Wasserstein dis-\ntance computation, Cuturi [6] formulated a smoothed version of the underlying transport problem by\nadding an entropy regularizer to the transport objective. This makes the objective function strictly\nconvex, and ef\ufb01cient algorithms can be adopted to solve it. In particular, given a transport matrix\ni,j=1 Tij log(Tij) be the entropy of T. For any \u03bb > 0, the regularized (primal)\n\nT, let h(T) = \u2212(cid:80)n\n\ntransport problem is de\ufb01ned as\nTij(cid:107)xi \u2212 xj(cid:107)p\n\nmin\nT\u22650\n\n2 \u2212 1\n\u03bb\n\nn(cid:88)\n\ni,j=1\n\nn(cid:88)\n\nn(cid:88)\n\nh(T)\n\nsubject to,\n\nTij = da\ni ,\n\nTij = db\n\nj \u2200i, j.\n\n(2)\n\nj=1\n\ni=1\n\nThe larger \u03bb is, the closer this relaxation is to the original Wasserstein distance. Cuturi [6] propose\nan ef\ufb01cient algorithm to solve for the optimal transport T\u2217\n\u03bb using a clever matrix-scaling algorithm.\nSpeci\ufb01cally, we may de\ufb01ne the matrix Kij = exp(\u2212\u03bb(cid:107)xi \u2212 xj(cid:107)2) and solve for the scaling vectors\nu, v to a \ufb01xed-point by computing u = da./(Kv), v = db./(K(cid:62)u) in an alternating fashion.\nThese yield the relaxed transport T\u2217\n\u03bb = diag(u)K diag(v). This algorithm can be shown to have\nempirical time complexity O(q2) [6], which is signi\ufb01cantly faster than solving the WMD problem\nexactly.\nLinear Metric Learning. Assume that we have access to a training set {x1, . . . , xn} \u2282 Rd, ar-\nranged as columns in matrix X \u2208 Rd\u00d7n, and corresponding labels {y1, . . . , yn} \u2286 Y n, where Y\ncontains some \ufb01nite number of classes C = |Y|. Linear metric learning learns a matrix A \u2208 Rr\u00d7d,\nwhere r \u2264 d, and de\ufb01nes the generalized Euclidean distance between two documents xi and xj as\ndA(xi, xj) = (cid:107)A(xi\u2212xj)(cid:107)2. Popular linear metric learning algorithms are NCA [15], LMNN [36],\nand ITML [10] amongst others [37]. These methods learn a matrix A to minimize a loss function\nthat is often an approximation of the leave-one-out (LOO) classi\ufb01cation error of the kNN classi\ufb01er.\nNeighborhood Components Analysis (NCA) was introduced by Goldberger et al. [15] to learn\na generalized Euclidean metric. Here, the authors approximate the non-continuous leave-one-out\nkNN error by de\ufb01ning a stochastic neighborhood process. An input xi is assigned input xj as its\nnearest neighbor with probability\n\n(cid:80)\nexp(\u2212d2\nk(cid:54)=i exp (\u2212d2\n\nA(xi, xj))\n\nA(xi, xk))\n\npij =\n\n,\n\n(3)\n\nj:yj =yi\n\nprobability of this event can be stated as pi = (cid:80)\nexpected LOO accuracy(cid:80)\n\nwhere we de\ufb01ne pii = 0. Under this stochastic neighborhood assignment, an input xi with label\nyi is classi\ufb01ed correctly if its nearest neighbor is any xj (cid:54)= xi from the same class (yj = yi). The\npij. NCA learns A by maximizing the\ni log(pi), the KL-divergence\n\ni pi, or equivalently by minimizing \u2212(cid:80)\n\nfrom a perfect classi\ufb01cation distribution (pi = 1 for all xi).\n3 Learning a Word Embedding Metric\nIn this section we propose a method for learning a supervised document distance, by way of learn-\ning a generalized Euclidean metric within the word embedding space and a word importance vec-\ntor. We will refer to the learned document distance as the Supervised Word Mover\u2019s Distance (S-\nWMD). To learn such a metric we assume we have a training dataset consisting of m documents\n{d1, . . . , dm} \u2282 \u03a3n, where \u03a3n is the (n\u22121)-dimensional simplex (thus each document is repre-\nsented as a normalized histogram over the words in the vocabulary, of size n). For each document\nwe are given a label out of C possible classes, i.e. {y1, . . . , ym} \u2286 {1, . . . , C}m. Additionally,\nwe are given a word embedding matrix X \u2208 Rd\u00d7n (e.g., the word2vec embedding) which de\ufb01nes a\nd-dimensional word vector for each of the words in the vocabulary.\nSupervised WMD. As described in the previous section, it is possible to de\ufb01ne a distance between\nany two documents da and db as the minimum cumulative word distance of moving da to db in\nword embedding space, as is done in the WMD. Given a labeled training set we would like to\nimprove the distance so that documents that share the same labels are close, and those with different\nlabels are far apart. We capture this notion of similarity in two ways: First we transform the word\nembedding, which captures a latent representation of words. We adapt this representation with a\n\n3\n\n\flinear transformation xi \u2192 Axi, where xi represents the embedding of the ith word. Second, as\ndifferent classi\ufb01cation tasks and data sets may value words differently, we also introduce a histogram\nimportance vector w that re-weighs the word histogram values to re\ufb02ect the importance of words\nfor distinguishing the classes:\n\n(4)\nwhere \u201c\u25e6\u201d denotes the element-wise Hadamard product. After applying the vector w and the linear\nmapping A, the WMD distance between documents da and db becomes\n\n\u02dcda = (w \u25e6 da)/(w(cid:62)da),\n\nDA,w(da, db) (cid:44) min\nT\u22650\n\nTij(cid:107)A(xi \u2212 xj)(cid:107)2\n\n2 s.t.\n\nTij = \u02dcda\n\ni and\n\nTij = \u02dcdb\n\nj \u2200i, j.\n\n(5)\n\ni,j=1\n\nj=1\n\ni=1\n\nLoss Function. Our goal is to learn the matrix A and vector w to make the distance DA,w re\ufb02ect\nthe semantic de\ufb01nition of similarity encoded in the labeled data. Similar to prior work on metric\nlearning [10, 15, 36] we achieve this by minimizing the kNN-LOO error with the distance DA,w\nin the document space. As the LOO error is non-differentiable, we use the stochastic neighborhood\nrelaxation proposed by Hinton & Roweis [18], which is also used for NCA. Similar to prior work\nwe use the squared Euclidean word distance in Eq. (5). We use the KL-divergence loss proposed in\nNCA alongside the de\ufb01nition of neighborhood probability in (3) which yields,\n\nn(cid:88)\n\nn(cid:88)\n\nn(cid:88)\n\n(cid:80)\nexp(\u2212DA,w(da, db))\nc(cid:54)=a exp (\u2212DA,w(da, dc))\n\n\uf8f6\uf8f8 .\n\n(6)\n\nGradient. We can compute the gradient of the loss (cid:96)(A, w) with respect to A and w as follows,\n\n(cid:96)(A, w) = \u2212 m(cid:88)\n\na=1\n\nb:yb=ya\n\nlog\n\n\uf8eb\uf8ed m(cid:88)\n(cid:88)\nm(cid:88)\n\na=1\n\nb(cid:54)=a\n\n\u2202\n\n\u2202(A, w)\n\n(cid:96)(A, w) =\n\npab\npa\n\n(\u03b4ab \u2212 pa)\n\n\u2202\n\n\u2202(A, w)\n\nDA,w(da, db),\n\n(7)\n\nwhere \u03b4ab = 1 if and only if ya = yb, and \u03b4ab = 0 otherwise.\n\n3.1 Fast computation of \u2202DA,w(da, db)/\u2202(A, w)\n\nNotice that the remaining gradient term above \u2202DA,w(da, db)/\u2202(A, w) contains the nested linear\nprogram de\ufb01ned in (5). In fact, computing this gradient just for a single pair of documents will\nrequire time complexity O(q3 log q), where q is the largest set of unique words in either document\n[8]. This quickly becomes prohibitively slow as the document size becomes large and the number\nof documents increase. Further, the gradient is not always guaranteed to exist [1, 7] (instead we\nmust resort to subgradient descent). Motivated by the recent works on fast Wasserstein distance\ncomputation [6, 8, 12], we propose to relax the modi\ufb01ed linear program in eq. (5) using the entropy\nas in eq. (2). As described in Section 2, this allows us to approximately solve eq. (5) in O(q2) time\nvia T\u2217\nGradient w.r.t. A. It can be shown that,\n\n\u03bb = diag(u)K diag(v). We will use this approximate solution in the following gradients.\n\n\u2202\n\u2202A\n\nDA,w(da, db) = 2A\n\nij (xi \u2212 xj)(xi \u2212 xj)(cid:62),\nTab\n\n(8)\n\nn(cid:88)\n\ni,j=1\n\nwhere Tab is the optimizer of (5), so long as it is unique (otherwise it is a subgradient) [1]. We\nreplace Tab by T\u2217\nGradient w.r.t. w. To obtain the gradient with respect to w, we need the optimal solution to the\ndual transport problem:\n\n\u03bb which is always unique as the relaxed transport is strongly convex [9].\n\nD\u2217\nA,w(da, db) (cid:44) max\n\n(\u03b1,\u03b2)\n\n\u03b1(cid:62) \u02dcda + \u03b2\n\n(cid:62) \u02dcdb; s.t. \u03b1i + \u03b2j \u2264 (cid:107)A(xi \u2212 xj)(cid:107)2\n\n2 \u2200i, j.\n\n(9)\n\nGiven that both \u02dcda and \u02dcdb are functions of w, we have\n\n\u2202\n\u2202w\n\nDA,w(da, db) =\n\n\u2202D\u2217\n\u2202 \u02dcda\n\nA,w\n\n\u2202 \u02dcda\n\u2202w\n\n+\n\n\u2202D\u2217\n\u2202 \u02dcdb\n\nA,w\n\n\u2202 \u02dcdb\n\u2202w\n\n=\n\n\u03b1\u2217\u25e6da\u2212(\u03b1\u2217(cid:62) \u02dcda)da\n\nw(cid:62)da\n\n\u03b2\n\n+\n\n4\n\n\u2217(cid:62) \u02dcdb)db\n\n\u2217\u25e6db\u2212(\u03b2\nw(cid:62)db\n\n.\n(10)\n\n\f\u2217\nInstead of solving the dual directly, we obtain the relaxed optimal dual variables \u03b1\u2217\n\u03bb via the\nvectors u, v that were used to derive our relaxed transport T\u2217\n\u03bb. Speci\ufb01cally, we can solve for the\ndual variables as such: \u03b1\u2217\n1, where 1 is the\np-dimensional all ones vector. In general, we can observe from eq. (2) that the above approximation\nprocess becomes more accurate as \u03bb grows. However, setting \u03bb too large can make the algorithm\nconverges slower. In our experiments, we use \u03bb = 10, which leads to a nice trade-off between speed\nand approximation accuracy.\n\n\u03bb \u2212 log(u)(cid:62)1\n\n\u03bb \u2212 log(v)(cid:62)1\n\n\u2217\n\u03bb = log(v)\n\n\u03bb = log(u)\n\n1 and \u03b2\n\n\u03bb, \u03b2\n\np\n\np\n\nRandomly select B \u2286 {1, . . . , m}\nCompute gradients using eq. (11)\n\nAlgorithm 1 S-WMD\n1: Input: word embedding: X,\n2: dataset: {(d1, y1), . . . , (dm, ym)}\n3: ca = Xda, \u2200a\u2208{1, . . . , m}\n4: A = NCA((c1, y1), . . . , (cm, ym))\n5: w = 1\n6: while loop until convergence do\n7:\n8:\n9: A \u2190 A \u2212 \u03b7AgA\n10: w \u2190 w \u2212 \u03b7wgw\n11: end while\n\n3.2 Optimization\nAlongside the fast gradient computation process in-\ntroduced above, we can further speed up the train-\ning with a clever initialization and batch gradient de-\nscent.\nInitialization. The loss function in eq. (6) is non-\nconvex and is thus highly dependent on the initial\nsetting of A and w. A good initialization also dras-\ntically reduces the number of gradient steps required.\nFor w, we initialize all its entries to 1, i.e., all words\nare assigned with the same weights at the begin-\nning. For A, we propose to learn an initial projection\nwithin the word centroid distance (WCD), de\ufb01ned\nas D(cid:48)(da, db) = (cid:107)Xda \u2212 Xdb(cid:107)2, described in Sec-\ntion 2. The WCD should be a reasonable approximation to the WMD. Kusner et al. [19] point out\nthat the WCD is a lower bound on the WMD, which holds true after the transformation with A.\nWe obtain our initialization by applying NCA in word embedding space using the WCD distance\nbetween documents. This is to say that we can construct the WCD dataset: {c1, . . . , cm} \u2282 Rd,\nrepresenting each text document as its word centroid, and apply NCA in the usual way as described\nin Section 2. We call this learned word distance Supervised Word Centroid Distance (S-WCD).\nBatch Gradient Descent. Once the initial matrix A is obtained, we minimize the loss (cid:96)(A, w) in\n(6) with batch gradient descent. At each iteration, instead of optimizing over the full training set,\nwe randomly pick a batch of documents B from the training set, and compute the gradient for these\ndocuments. We can further speed up training by observing that the vast majority of NCA probabil-\nities pab near zero. This is because most documents are far away from any given document. Thus,\nfor a document da we can use the WCD to get a cheap neighbor ordering and only compute the\nNCA probabilities for the closest set of documents Na, based on the WCD. When we compute the\ngradient for each of the selected documents, we only use the document\u2019s M nearest neighbor doc-\numents (de\ufb01ned by WCD distance) to compute the NCA neighborhood probabilities. In particular,\nthe gradient is computed as follows,\n\n(cid:88)\n\n(cid:88)\n\na\u2208B\n\nb\u2208Na\n\ngA,w =\n\n(pab/pa)(\u03b4ab \u2212 pa)\n\n\u2202\n\n\u2202(A, w)\n\nD(A,w)(da, db),\n\n(11)\n\nwhere again Na is the set of nearest neighbors of document a. With the gradient, we update A and\nw with learning rates \u03b7A and \u03b7w, respectively. Algorithm 1 summarizes S-WMD in pseudo code.\nComplexity. The empirical time complexity of solving the dual transport problem scales quadrati-\ncally with p [26]. Therefore, the complexity of our algorithm is O(T BN [p2 + d2(p + r)]), where\nT denotes the number of batch gradient descent iterations, B = |B| the batch size, N = |Na| the\nsize of the nearest neighbor set, and p the maximum number of unique words in a document. This\n\u2217 using the alternating \ufb01xed point algorithm in Section 3.1\nis because computing T\u2217\nrequires O(p2) time, while constructing the gradients from eqs. (8) and (10) takes O(d2(p + r))\ntime. The approximated gradient eq. (11) requires this computation to be repeated BN times. In\nour experiments, we set B = 32 and N = 200, and computing the gradient at each iteration can be\ndone in seconds.\n4 Results\nWe evaluate S-WMD on 8 different document corpora and compare the kNN error with unsupervised\nWCD, WMD, and 6 document representations. In addition, all 6 document representation baselines\n\nij, \u03b1\u2217 and \u03b2\n\n5\n\n\fTable 1: The document datasets (and their descriptions) used for visualization and evaluation.\n\nname\n\ndescription\n\nBBCSPORT BBC sports articles labeled by sport\ntweets categorized by sentiment [31]\nTWITTER\nrecipe procedures labeled by origin\nRECIPE\n\nOHSUMED medical abstracts (class subsampled)\nacademic papers labeled by publisher\nnews dataset (train/test split [3])\nreviews labeled by product\ncanonical news article dataset [3]\n\nCLASSIC\nREUTERS\nAMAZON\n20NEWS\n\nC\n5\n3\n15\n10\n4\n8\n4\n20\n\nn\n517\n2176\n3059\n3999\n4965\n5485\n5600\n11293\n\nne\n220\n932\n1311\n5153\n2128\n2189\n2400\n7528\n\nBOW avg\nwords\ndim.\n117\n13243\n9.9\n6344\n48.5\n5708\n31789\n59.2\n38.6\n24277\n37.1\n22425\n45.0\n42063\n29671\n72\n\nFigure 1: t-SNE plots of WMD and S-WMD on all datasets.\n\nare used with and without 3 leading supervised metric learning algorithms\u2014resulting in an overall\ntotal of 26 competitive baselines. Our code is implemented in Matlab and is freely available at\nhttps://github.com/gaohuang/S-WMD.\nDatasets and Baselines. We evaluate all approaches on 8 document datasets in the settings of\nnews categorization, sentiment analysis, and product identi\ufb01cation, among others. Table 1 describes\nthe classi\ufb01cation tasks as well as the size and number of classes C of each of the datasets. We\nevaluate against the following document representation/distance methods: 1. bag-of-words (BOW):\na count of the number of word occurrences in a document, the length of the vector is the number\nof unique words in the corpus; 2. term frequency-inverse document frequency (TF-IDF): the BOW\nvector normalized by the document frequency of each word across the corpus; 3. Okapi BM25 [28]:\na TF-IDF-like ranking function, \ufb01rst used in search engines; 4. Latent Semantic Indexing (LSI)\n[11]: projects the BOW vectors onto an orthogonal basis via singular value decomposition; 5. La-\ntent Dirichlet Allocation (LDA) [2]: a generative probabilistic method that models documents as\nmixtures of word \u2018topics\u2019. We train LDA transductively (i.e., on the combined collection of training\n& testing words) and use the topic probabilities as the document representation ; 6. Marginalized\nStacked Denoising Autoencoders (mSDA) [4]: a fast method for training stacked denoising autoen-\ncoders, which have state-of-the-art error rates on sentiment analysis tasks [14]. For datasets larger\nthan RECIPE we use either a high-dimensional variant of mSDA or take 20% of the features that\noccur most often, whichever has better performance.; 7. Word Centroid Distance (WCD), described\nin Section 2; 8. Word Mover\u2019s Distance (WMD), described in Section 2. For completeness, we\nalso show results for the Supervised Word Centroid Distance (S-WCD) and the initialization of S-\nWMD (S-WMD init.), described in Section 3. For methods that propose a document representation\n(as opposed to a distance), we use the Euclidean distance between these vector representations for\nvisualization and kNN classi\ufb01cation. For the supervised metric learning results we \ufb01rst reduce the\ndimensionality of each representation to 200 dimensions (if necessary) with PCA and then run ei-\nther NCA, ITML, or LMNN on the projected data. We tune all free hyperparameters in all compared\nmethods with Bayesian optimization (BO), using the implementation of Gardner et al. [13]3.\nkNN classi\ufb01cation. We show the kNN test error of all document representation and distance meth-\nods in Table 2. For datasets that do not have a prede\ufb01ned train/test split: BBCSPORT, TWITTER,\nRECIPE, CLASSIC, and AMAZON we average results over \ufb01ve 70/30 train/test splits and report stan-\ndard errors. For each dataset we highlight the best results in bold (and those whose standard error\n\n3http://tinyurl.com/bayesopt\n\n6\n\ntwitterrecipeohsumedclassicamazonbbcsportreutersWMDS-WMD20news\fDATASET\n\nBOW\nTF-IDF\n\nLSI [11]\nLDA [2]\nMSDA [4]\n\nOKAPI BM25 [28]\n\nOKAPI BM25 [28]\n\nBOW\nTF-IDF\n\nLSI [11]\nLDA [2]\nMSDA [4]\n\nOKAPI BM25 [28]\n\nBOW\nTF-IDF\n\nLSI [11]\nLDA [2]\nMSDA [4]\n\nOKAPI BM25 [28]\n\nBOW\nTF-IDF\n\nLSI [11]\nLDA [2]\nMSDA [4]\n\nWCD [19]\nWMD [19]\nS-WCD\n\nS-WMD INIT.\n\nS-WMD\n\nTable 2: The kNN test error for all datasets and distances.\n\nBBCSPORT\n20.6 \u00b1 1.2\n21.5 \u00b1 2.8\n16.9 \u00b1 1.5\n4.3 \u00b1 0.6\n6.4 \u00b1 0.7\n8.4 \u00b1 0.8\n7.4 \u00b1 1.4\n1.8 \u00b1 0.2\n3.7 \u00b1 0.5\n5.0 \u00b1 0.7\n6.5 \u00b1 0.7\n25.5 \u00b1 9.4\n2.4 \u00b1 0.4\n4.0 \u00b1 0.6\n1.9 \u00b1 0.7\n2.4 \u00b1 0.5\n4.5 \u00b1 0.4\n22.7 \u00b1 10.0\n9.6 \u00b1 0.6\n0.6 \u00b1 0.3\n4.5 \u00b1 0.5\n2.4 \u00b1 0.7\n7.1 \u00b1 0.9\n21.8 \u00b1 7.4\n11.3 \u00b1 1.1\n4.6 \u00b1 0.7\n4.6 \u00b1 0.5\n2.8 \u00b1 0.3\n2.1 \u00b1 0.5\n\nTWITTER\n43.6 \u00b1 0.4\n33.2 \u00b1 0.9\n42.7 \u00b1 7.8\n31.7 \u00b1 0.7\n33.8 \u00b1 0.3\n32.3 \u00b1 0.7\n32.0 \u00b1 0.4\n31.1 \u00b1 0.3\n31.9 \u00b1 0.3\n32.3 \u00b1 0.4\n33.9 \u00b1 0.9\n43.7 \u00b1 7.4\n31.8 \u00b1 0.3\n30.8 \u00b1 0.3\n30.5 \u00b1 0.4\n31.6 \u00b1 0.2\n31.9 \u00b1 0.6\n50.3 \u00b1 8.6\n31.1 \u00b1 0.5\n30.6 \u00b1 0.5\n31.8 \u00b1 0.4\n31.1 \u00b1 0.8\n32.7 \u00b1 0.3\n37.9 \u00b1 2.8\n30.7 \u00b1 0.9\n28.7 \u00b1 0.6\n30.4 \u00b1 0.5\n28.2 \u00b1 0.4\n27.5 \u00b1 0.5\n\nREUTERS\n\nOHSUMED\nUNSUPERVISED\n\n61.1\n62.7\n66.2\n44.2\n51.0\n49.3\nITML [10]\n70.1\n55.1\n77.0\n54.7\n59.6\n61.8\nLMNN [36]\n49.1\n40.0\n59.4\n40.8\n49.9\n41.6\nNCA [15]\n57.4\n35.8\n56.6\n37.5\n50.7\n40.4\n\nRECIPE\n59.3 \u00b1 1.0\n53.4 \u00b1 1.0\n53.4 \u00b1 1.9\n45.4 \u00b1 0.5\n51.3 \u00b1 0.6\n48.0 \u00b1 1.4\n63.1 \u00b1 0.9\n51.0 \u00b1 1.4\n53.8 \u00b1 1.8\n55.7 \u00b1 0.8\n59.3 \u00b1 0.8\n54.5 \u00b1 1.3\n48.4 \u00b1 0.4\n43.7 \u00b1 0.3\n41.7 \u00b1 0.7\n44.8 \u00b1 0.4\n51.4 \u00b1 0.4\n46.3 \u00b1 1.2\n55.2 \u00b1 0.6\n41.4 \u00b1 0.4\n45.8 \u00b1 0.5\n41.6 \u00b1 0.5\n50.9 \u00b1 0.4\n48.0 \u00b1 1.6\nDISTANCES IN THE WORD MOVER\u2019S FAMILY\n49.4 \u00b1 0.3\n42.6 \u00b1 0.3\n51.3 \u00b1 0.2\n39.8 \u00b1 0.4\n39.2 \u00b1 0.3\n\nCLASSIC\n36.0 \u00b1 0.5\n35.0 \u00b1 1.8\n40.6 \u00b1 2.7\n6.7 \u00b1 0.4\n5.0 \u00b1 0.3\n6.9 \u00b1 0.4\n7.5 \u00b1 0.5\n9.9 \u00b1 1.0\n18.3 \u00b1 4.5\n5.5 \u00b1 0.7\n6.6 \u00b1 0.5\n14.9 \u00b1 2.2\n4.7 \u00b1 0.3\n4.9 \u00b1 0.3\n19.0 \u00b1 9.3\n3.0 \u00b1 0.1\n4.9 \u00b1 0.4\n11.1 \u00b1 1.9\n4.0 \u00b1 0.1\n5.5 \u00b1 0.2\n20.6 \u00b1 4.8\n3.1 \u00b1 0.2\n5.0 \u00b1 0.2\n11.2 \u00b1 1.8\n6.6 \u00b1 0.2\n2.8 \u00b1 0.1\n5.8 \u00b1 0.2\n3.3 \u00b1 0.3\n3.2 \u00b1 0.2\n\n48.9\n44.5\n43.3\n38.0\n34.3\n\n13.9\n29.1\n32.8\n6.3\n6.9\n8.1\n\n7.3\n6.6\n20.7\n6.9\n9.2\n5.9\n\n3.9\n5.8\n9.2\n3.2\n5.6\n5.3\n\n6.2\n3.8\n10.5\n3.3\n7.9\n5.2\n\n4.7\n3.5\n3.9\n3.5\n3.2\n\nAMAZON\n28.5 \u00b1 0.5\n41.5 \u00b1 1.2\n58.8 \u00b1 2.6\n9.3 \u00b1 0.4\n11.8 \u00b1 0.6\n17.1 \u00b1 0.4\n20.5 \u00b1 2.1\n11.1 \u00b1 1.9\n11.4 \u00b1 2.9\n10.6 \u00b1 2.2\n15.7 \u00b1 2.0\n37.4 \u00b1 4.0\n10.7 \u00b1 0.3\n6.8 \u00b1 0.3\n6.9 \u00b1 0.2\n6.6 \u00b1 0.2\n12.1 \u00b1 0.6\n24.0 \u00b1 3.6\n16.8 \u00b1 0.3\n6.5 \u00b1 0.2\n8.5 \u00b1 0.4\n7.7 \u00b1 0.4\n11.6 \u00b1 0.8\n23.6 \u00b1 3.1\n9.2 \u00b1 0.2\n7.4 \u00b1 0.3\n7.6 \u00b1 0.3\n5.8 \u00b1 0.2\n5.8 \u00b1 0.1\n\n20NEWS\n\nAVERAGE-RANK\n\n57.8\n54.4\n55.9\n28.9\n31.5\n39.5\n\n60.6\n45.3\n81.5\n39.6\n87.8\n47.7\n\n40.7\n28.1\n57.4\n25.1\n32.0\n27.1\n\n46.4\n29.3\n55.9\n30.7\n30.9\n26.8\n\n36.2\n26.8\n33.6\n28.4\n26.8\n\n26.1\n25.0\n26.1\n12.0\n16.6\n18.0\n\n23.0\n14.8\n21.5\n17.6\n22.5\n23.9\n\n11.5\n7.8\n14.4\n5.1\n14.6\n17.3\n\n17.5\n5.4\n17.9\n6.3\n16.5\n16.1\n\n13.5\n6.1\n11.4\n4.3\n2.4\n\noverlaps the mean of the best result). On the right we also show the average rank across datasets,\nrelative to unsupervised BOW (bold indicates the best method). We highlight the unsupervised\nWMD in blue (WMD) and our new result in red (S-WMD). Despite the very large number of com-\npetitive baselines, S-WMD achieves the lowest kNN test error on 5/8 datasets, with the exception\nof BBCSPORT, CLASSIC and AMAZON. On these datasets it achieves the 4th lowest on BBCSPORT\nand CLASSIC, and tied at 2nd on 20NEWS. On average across all datasets it outperforms all other\n26 methods. Another observation is that S-WMD right after initialization (S-WMD init.) performs\nquite well. However, as training S-WMD is ef\ufb01cient (shown in Table 3), it is often well worth the\ntraining time.\nFor unsupervised baselines, on datasets BBCSPORT\nand OHSUMED, where the previous state-of-the-art\nWMD was beaten by LSI, S-WMD reduces the er-\nror of LSI relatively by 51% and 22%, respectively.\nIn general, supervision seems to help all methods\non average. One reason why NCA with a TF-IDF\ndocument representation may be performing better\nthan S-WMD could be because of the long docu-\nment lengths in BBCSPORT and OHSUMED. Hav-\ning denser BOW vectors may improve the inverse\ndocument frequency weights, which in turn may be\na good initialization for NCA to further \ufb01ne-tune.\nOn datasets with smaller documents such as TWIT-\nTER, CLASSIC, and REUTERS, S-WMD outperforms\nNCA with TF-IDF relatively by 10%, 42%, and\n15%, respectively. On CLASSIC WMD outperforms\nS-WMD possibly because of a poor initialization\nand that S-WMD uses the squared Euclidean dis-\ntance between word vectors, which may be subop-\ntimal for this dataset. This however, does not occur\nfor any other dataset.\nVisualization. Figure 1 shows a 2D embedding of the test split of each dataset by WMD and\nS-WMD using t-Stochastic Neighbor Embedding (t-SNE) [33]. The quality of a distance can be\nvisualized by how clustered points in the same class are. Using this metric, S-WMD noticeably\nimproves upon WMD on almost all the 8 datasets. Figure 2 visualizes the top 100 words with\n\nFigure 2: The Top-100 words upweighted by\nS-WMD on 20NEWS.\n\n7\n\nwindowssalebikemacapplegunspacecarDODgraphicshockeygunsbaseballbikesdriverencryptionclipperstorycardprocomputerfirearmsatheismaltmotorcyclesmotorcycledriversIsraeliRutgersIsraelmotifsellRISCautomotiveanimationArmenianshippingcircuitIslamicmonitorriderlistsofferbikerridecopymoonautokeybusNHLelectronicshomosexualsmotherboardgovernmentcontrollercompatiblesummarizedpowerbookhappeningdolphinssecuritybiblicaldiamondTurkishpolygonplayoffwesternvirtualforsalewarningcryptotappedrocketdoctorflightridingMazdalabelorbitaskedautosimagesaintBooneKeithfiredmousechipSCSIcuteTIFFtalkhellNASAIDEsunfitDOSgay1/12016/10/27file:///C:/Users/Administrator/Desktop/wordcloud10.svg\fTable 3: Distance computation times.\n\nDATASET\n\nBBCSPORT\nTWITTER\nRECIPE\n\nOHSUMED\nCLASSIC\nREUTERS\nAMAZON\n20NEWS\n\n1M 25S\n28M 59S\n23M 21S\n46M 18S\n1H 18M\n2H 7M\n2H 15M\n14M 42S\n\nS-WMD\n4M 56S\n7M 53S\n23M 58S\n29M 12S\n36M 22S\n34M 56S\n20M 10S\n1H 55M\n\nFULL TRAINING TIMES\nMETRICS\nS-WCD/S-WMD INIT.\n\nlargest weights learned by S-WMD on the 20NEWS dataset. The size of each word is proportional\nits learned weight. We can observe that these upweighted words are indeed most representative for\nthe true classes of this dataset. More detailed results and analysis can be found in the supplementary.\nTraining time. Table 3 shows the training\ntimes for S-WMD. Note that the time to learn\nthe initial metric A is not included in time\nshown in the second column. Relative to the\ninitialization, S-WMD is surprisingly fast. This\nis due to the fast gradient approximation and\nthe batch gradient descent introduced in Sec-\ntion 3.1 and 3.2. We note that these times are\ncomparable or even faster than the time it takes\nto train a linear metric on the baseline methods\nafter PCA.\n5 Related Work\nMetric learning is a vast \ufb01eld that includes both\nsupervised and unsupervised techniques (see\nYang & Jin [37] for a large survey). Alongside NCA [15], described in Section 2, there are a num-\nber of popular methods for generalized Euclidean metric learning. Large Margin Nearest Neighbors\n(LMNN) [36] learns a metric that encourages inputs with similar labels to be close in a local region,\nwhile encouraging inputs with different labels to be farther by a large margin. Information-Theoretic\nMetric Learning (ITML) [10] learns a metric by minimizing a KL-divergence subject to generalized\nEuclidean distance constraints. Cuturi & Avis [7] was the \ufb01rst to consider learning the ground dis-\ntance in the Earth Mover\u2019s Distance (EMD). In a similar work, Wang & Guibas [34] learns a ground\ndistance that is not a metric, with good performance in certain vision tasks. Most similar to our\nwork Wang et al. [35] learn a metric within a generalized Euclidean EMD ground distance using\nthe framework of ITML for image classi\ufb01cation. They do not, however, consider re-weighting the\nhistograms, which allows our method extra \ufb02exibility. Until recently, there has been relatively little\nwork towards learning supervised word embeddings, as state-of-the-art results rely on making use\nof large unlabeled text corpora. Tang et al. [32] propose a neural language model that uses label\ninformation from emoticons to learn sentiment-speci\ufb01c word embeddings.\n6 Conclusion\nWe proposed a powerful method to learn a supervised word mover\u2019s distance, and demonstrated\nthat it may well be the best performing distance metric for documents to date. Similar to WMD,\nour S-WMD bene\ufb01ts from the large unsupervised corpus, which was used to learn the word2vec\nembedding [22, 23]. The word embedding gives rise to a very good document distance, which\nis particularly forgiving when two documents use syntactically different but conceptually similar\nwords. Two words may be similar in one sense but dissimilar in another, depending on the articles in\nwhich they are contained. It is these differences that S-WMD manages to capture through supervised\ntraining. By learning a linear metric and histogram re-weighting through the optimal transport of\nthe word mover\u2019s distance, we are able to produce state-of-the-art classi\ufb01cation results ef\ufb01ciently.\n\nAcknowledgments\n\nThe authors are supported in part by the, III-1618134, III-1526012, IIS-1149882 grants from the\nNational Science Foundation and the Bill and Melinda Gates Foundation. We also thank Dor Kedem\nfor many insightful discussions.\nReferences\n[1] Bertsimas, D. and Tsitsiklis, J. N. Introduction to linear optimization. Athena Scienti\ufb01c, 1997.\n[2] Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent dirichlet allocation. JMLR, 2003.\n[3] Cardoso-Cachopo, A. Improving Methods for Single-label Text Categorization. PdD Thesis, Instituto\n\nSuperior Tecnico, Universidade Tecnica de Lisboa, 2007.\n\n[4] Chen, M., Xu, Z., Weinberger, K. Q., and Sha, F. Marginalized denoising autoencoders for domain\n\nadaptation. In ICML, 2012.\n\n[5] Collobert, R. and Weston, J. A uni\ufb01ed architecture for natural language processing: Deep neural networks\n\nwith multitask learning. In ICML, pp. 160\u2013167. ACM, 2008.\n\n8\n\n\f[6] Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport.\n\nInformation Processing Systems, pp. 2292\u20132300, 2013.\n\nIn Advances in Neural\n\n[7] Cuturi, M. and Avis, D. Ground metric learning. JMLR, 2014.\n[8] Cuturi, M. and Doucet, A. Fast computation of wasserstein barycenters. In Jebara, Tony and Xing, Eric P.\n\n(eds.), ICML, pp. 685\u2013693. JMLR Workshop and Conference Proceedings, 2014.\n\n[9] Cuturi, M. and Peyre, G. A smoothed dual approach for variational wasserstein problems. SIAM Journal\n\non Imaging Sciences, 9(1):320\u2013343, 2016.\n\n[10] Davis, J.V., Kulis, B., Jain, P., Sra, S., and Dhillon, I.S. Information-theoretic metric learning. In ICML,\n\npp. 209\u2013216, 2007.\n\n[11] Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., and Harshman, R. A. Indexing by latent\n\nsemantic analysis. Journal of the American Society of Information Science, 41(6):391\u2013407, 1990.\n\n[12] Frogner, C., Zhang, C., Mobahi, H., Araya, M., and Poggio, T.A. Learning with a wasserstein loss. In\n\nAdvances in Neural Information Processing Systems, pp. 2044\u20132052, 2015.\n\n[13] Gardner, J., Kusner, M. J., Xu, E., Weinberger, K. Q., and Cunningham, J. Bayesian optimization with\n\ninequality constraints. In ICML, pp. 937\u2013945, 2014.\n\n[14] Glorot, X., Bordes, A., and Bengio, Y. Domain adaptation for large-scale sentiment classi\ufb01cation: A deep\n\nlearning approach. In ICML, pp. 513\u2013520, 2011.\n\n[15] Goldberger, J., Hinton, G.E., Roweis, S.T., and Salakhutdinov, R. Neighbourhood components analysis.\n\nIn NIPS, pp. 513\u2013520. 2005.\n\n[16] Gopalan, P. K., Charlin, L., and Blei, D. Content-based recommendations with poisson factorization. In\n\nNIPS, pp. 3176\u20133184, 2014.\n\n[17] Greene, D. and Cunningham, P. Practical solutions to the problem of diagonal dominance in kernel\n\ndocument clustering. In ICML, pp. 377\u2013384. ACM, 2006.\n\n[18] Hinton, G.E. and Roweis, S.T. Stochastic neighbor embedding. In NIPS, pp. 833\u2013840. MIT Press, 2002.\n[19] Kusner, M. J., Sun, Y., Kolkin, N. I., and Weinberger, K. Q. From word embeddings to document dis-\n\ntances. In ICML, 2015.\n\n[20] Levina, E. and Bickel, P. The earth mover\u2019s distance is the mallows distance: Some insights from statistics.\n\nIn ICCV, volume 2, pp. 251\u2013256. IEEE, 2001.\n\n[21] Levy, O. and Goldberg, Y. Neural word embedding as implicit matrix factorization. In NIPS, 2014.\n[22] Mikolov, T., Chen, K., Corrado, G., and Dean, J. Ef\ufb01cient estimation of word representations in vector\n\nspace. In Workshop at ICLR, 2013.\n\n[23] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. Distributed representations of words\n\nand phrases and their compositionality. In NIPS, pp. 3111\u20133119, 2013.\n\n[24] Mohan, A., Chen, Z., and Weinberger, K. Q. Web-search ranking with initialized gradient boosted regres-\n\nsion trees. JMLR, 14:77\u201389, 2011.\n\n[25] Ontrup, J. and Ritter, H. Hyperbolic self-organizing maps for semantic navigation. In NIPS, 2001.\n[26] Pele, O. and Werman, M. Fast and robust earth mover\u2019s distances. In ICCV, pp. 460\u2013467. IEEE, 2009.\n[27] Perina, A., Jojic, N., Bicego, M., and Truski, A. Documents as multiple overlapping windows into grids\n\nof counts. In NIPS, pp. 10\u201318. 2013.\n\n[28] Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M. M., Gatford, M., et al. Okapi at trec-3.\n\nNIST SPECIAL PUBLICATION SP, pp. 109\u2013109, 1995.\n\n[29] Rubner, Y., Tomasi, C., and Guibas, L. J. A metric for distributions with applications to image databases.\n\nIn ICCV, pp. 59\u201366. IEEE, 1998.\n\n[30] Salton, G. and Buckley, C. Term-weighting approaches in automatic text retrieval. Information processing\n\n& management, 24(5):513\u2013523, 1988.\n\n[31] Sanders, N. J. Sanders-twitter sentiment corpus, 2011.\n[32] Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T., and Qin, B. Learning sentiment-speci\ufb01c word embedding\n\nfor twitter sentiment classi\ufb01cation. In ACL, pp. 1555\u20131565, 2014.\n\n[33] Van der Maaten, L. and Hinton, G. Visualizing data using t-sne. JMLR, 9(2579-2605):85, 2008.\n[34] Wang, F. and Guibas, L. J. Supervised earth movers distance learning and its computer vision applications.\n\nIn ECCV. 2012.\n\n[35] Wang, X-L., Liu, Y., and Zha, H. Learning robust cross-bin similarities for the bag-of-features model.\n\nTechnical report, Peking University, China, 2009.\n\n[36] Weinberger, K.Q. and Saul, L.K. Distance metric learning for large margin nearest neighbor classi\ufb01cation.\n\nJMLR, 10:207\u2013244, 2009.\n\n[37] Yang, L. and Jin, R. Distance metric learning: A comprehensive survey. 2, 2006.\n\n9\n\n\f", "award": [], "sourceid": 2462, "authors": [{"given_name": "Gao", "family_name": "Huang", "institution": "Cornell University"}, {"given_name": "Chuan", "family_name": "Guo", "institution": "Cornell University"}, {"given_name": "Matt", "family_name": "Kusner", "institution": "Washington University in St. Louis"}, {"given_name": "Yu", "family_name": "Sun", "institution": "Cornell University"}, {"given_name": "Fei", "family_name": "Sha", "institution": "University of Southern California"}, {"given_name": "Kilian", "family_name": "Weinberger", "institution": "Cornell University"}]}