{"title": "Non-linear Metric Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2573, "page_last": 2581, "abstract": "In this paper, we introduce two novel metric learning algorithms, \u03c72-LMNN and GB-LMNN, which are explicitly designed to be non-linear and easy-to-use. The two approaches achieve this goal in fundamentally different ways: \u03c72-LMNN inherits the computational benefits of a linear mapping from linear metric learning, but uses a non-linear \u03c72-distance to explicitly capture similarities within histogram data sets; GB-LMNN applies gradient-boosting to learn non-linear mappings directly in function space and takes advantage of this approach's robustness, speed, parallelizability and insensitivity towards the single additional hyper-parameter. On various benchmark data sets, we demonstrate these methods not only match the current state-of-the-art in terms of kNN classification error, but in the case of \u03c72-LMNN, obtain best results in 19 out of 20 learning settings.", "full_text": "Non-linear Metric Learning\n\nDor Kedem, Stephen Tyree, Kilian Q. Weinberger\n\nDept. of Comp. Sci. & Engi.\n\nWashington U.\n\nSt. Louis, MO 63130\n\nkedem.dor,swtyree,kilian@wustl.edu\n\nFei Sha\n\nDept. of Comp. Sci.\n\nU. of Southern California\nLos Angeles, CA 90089\n\nfeisha@usc.edu\n\nGert Lanckriet\n\nDept. of Elec. & Comp. Engineering\n\nU. of California\n\nLa Jolla, CA 92093\n\ngert@ece.ucsd.edu\n\nAbstract\n\nIn this paper, we introduce two novel metric learning algorithms, \u03c72-LMNN and\nGB-LMNN, which are explicitly designed to be non-linear and easy-to-use. The\ntwo approaches achieve this goal in fundamentally different ways: \u03c72-LMNN\ninherits the computational bene\ufb01ts of a linear mapping from linear metric learn-\ning, but uses a non-linear \u03c72-distance to explicitly capture similarities within his-\ntogram data sets; GB-LMNN applies gradient-boosting to learn non-linear map-\npings directly in function space and takes advantage of this approach\u2019s robust-\nness, speed, parallelizability and insensitivity towards the single additional hyper-\nparameter. On various benchmark data sets, we demonstrate these methods not\nonly match the current state-of-the-art in terms of kNN classi\ufb01cation error, but in\nthe case of \u03c72-LMNN, obtain best results in 19 out of 20 learning settings.\n\nIntroduction\n\n1\nHow to compare examples is a fundamental question in machine learning. If an algorithm could\nperfectly determine whether two examples were semantically similar or dissimilar, most subsequent\nmachine learning tasks would become trivial (i.e., a nearest neighbor classi\ufb01er will achieve perfect\nresults). Guided by this motivation, a surge of recent research [10, 13, 15, 24, 31, 32] has focused on\nMahalanobis metric learning. The resulting methods greatly improve the performance of metric de-\npendent algorithms, such as k-means clustering and kNN classi\ufb01cation, and have gained popularity\nin many research areas and applications within and beyond machine learning.\nOne reason for this success is the out-of-the-box usability and robustness of several popular methods\nto learn these linear metrics. So far, non-linear approaches [6, 18, 26, 30] to metric learning have\nnot managed to replicate this success. Although more expressive, the optimization problems are\noften expensive to solve and plagued by sensitivity to many hyper-parameters. Ideally, we would\nlike to develop easy-to-use black-box algorithms that learn new data representations for the use\nof established metrics. Further, non-linear transformations should be applied depending on the\nspeci\ufb01cs of a given data set.\nIn this paper, we introduce two novel extensions to the popular Large Margin Nearest Neighbors\n(LMNN) framework [31] which provide non-linear capabilities and are applicable out-of-the-box.\nThe two algorithms follow different approaches to achieve this goal:\n(i) Our \ufb01rst algorithm, \u03c72-LMNN is specialized for histogram data. It generalizes the non-linear\n\u03c72-distance and learns a metric that strictly preserve the histogram properties of input data on a\nprobability simplex. It successfully combines the simplicity and elegance of the LMNN objective\nand the domain-speci\ufb01c expressiveness of the \u03c72-distance.\n\n1\n\n\f(ii) Our second algorithm, gradient boosted LMNN (GB-LMNN) employs a non-linear mapping\ncombined with a traditional Euclidean distance function. It is a natural extension of LMNN from\nlinear to non-linear mappings. By training the non-linear transformation directly in function space\nwith gradient-boosted regression trees (GBRT) [11] the resulting algorithm inherits the positive\naspects of GBRT\u2014its insensitivity to hyper-parameters, robustness against over\ufb01tting, speed and\nnatural parallelizability [28].\nBoth approaches scale naturally to medium-sized data sets, can be optimized using standard tech-\nniques and only introduce a single additional hyper-parameter. We demonstrate the ef\ufb01cacy of both\nalgorithms on several real-world data sets and observe two noticeable trends: i) GB-LMNN (with\ndefault settings) achieves state-of-the-art k-nearest neighbor classi\ufb01cation errors with high consis-\ntency across all our data sets. For learning tasks where non-linearity is not required, it reduces to\nLMNN as a special case. On more complex data sets it reliably improves over linear metrics and\nmatches or out-performs previous work on non-linear metric learning. ii) For data sampled from a\nsimplex, \u03c72-LMNN is strongly superior to alternative approaches that do not explicitly incorporate\nthe histogram aspect of the data\u2014in fact it obtains best results in 19/20 learning settings.\n\n2 Background and Notation\nLet {(x1, y1), . . . , (xn, yn)}\u2208Rd\u00d7C be labeled training data with discrete labels C ={1, . . . , c}.\nLarge margin nearest neighbors (LMNN) [30, 31] is an algorithm to learn a Mahalanobis metric\nspeci\ufb01cally to improve the classi\ufb01cation error of k-nearest neighbors (kNN) [7] classi\ufb01cation. As\nthe kNN rule relies heavily on the underlying metric (a test input is classi\ufb01ed by a majority vote\namongst its k nearest neighbors), it is a good indicator for the quality of the metric in use. The\nMahalanobis metric can be viewed as a straight-forward generalization of the Euclidean metric,\n\nDL(xi, xj) = (cid:107)L(xi \u2212 xj)(cid:107)2,\n\n(1)\nparameterized by a matrix L \u2208 Rd\u00d7d, which in the case of LMNN is learned such that the linear\ntransformation x \u2192 Lx better represents similarity in the target domain. In the remainder of this\nsection we brie\ufb02y review the necessary terminology and basic framework behind LMNN and refer\nthe interested reader to [31] for more details.\nLocal neighborhoods. LMNN identi\ufb01es two types of neighborhood relations between an input\nxi and other inputs in the data set: For each xi, as a \ufb01rst step, k dedicated target neighbors are\nidenti\ufb01ed prior to learning. These are the inputs that should ideally be the actual nearest neighbors\nafter applying the transformation (we use the notation j (cid:32) i to indicate that xj is a target neighbor\nof xi). A common heuristic for choosing target neighbors is picking the k closest inputs (according\nto the Euclidean distance) to a given xi within the same class. The second type of neighbors are\nimpostors. These are inputs that should not be among the k-nearest neighbors \u2014 de\ufb01ned to be all\ninputs from a different class that are within the local neighborhood of xi.\nLMNN optimization. The LMNN objective has two terms, one for each neighborhood objective:\nFirst, it reduces the distance between an instance and its target neighbors, thus pulling them closer\nand making the input\u2019s local neighborhood smaller. Second, it moves impostor neighbors (i.e.,\ndifferently labeled inputs) farther away so that the distances to impostors should exceed the distances\nto target neighbors by a large margin. Weinberger et. al [31] combine these two objectives into a\nsingle unconstrained optimization problem:\n\n(cid:88)\n\n(cid:2) 1 + DL(xi, xj)2 \u2212 DL(xi, xk)2(cid:3)\n\nDL(xi, xj)2\n\n+ \u00b5\n\n(2)\n\n(cid:88)\n\ni,j:j(cid:32)i\n\nmin\n\nL\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\npull target neighbor xj closer\n\npush impostor xk away, beyond target neighbor xj by a large margin (cid:96)\n\nThe parameter \u00b5 de\ufb01nes a trade-off between the two objectives and [x]+ is de\ufb01ned as the hinge-loss\n[x]+ = max(0, x). The optimization (2) can be transformed into a semide\ufb01nite program (SDP) [31]\nfor which a global solution can be found ef\ufb01ciently. The large margin in (2) is set to 1 as its exact\nvalue only impacts the scale of L and not the kNN classi\ufb01er.\nDimensionality reduction. As an extension to the original LMNN formulation, [26, 30] show that\nwith L \u2208 Rr\u00d7d with r < d, LMNN learns a projection into a lower-dimensional space Rr that still\nrepresents domain speci\ufb01c similarities. While this low-rank constraint breaks the convexity of the\noptimization problem, signi\ufb01cant speed-ups [30] can be obtained when the kNN classi\ufb01er is applied\nin the r-dimensional space \u2014 especially when combined with special-purpose data structures [33].\n\n2\n\n(cid:124)\n\nk : yi(cid:54)=yk\n\n(cid:123)(cid:122)\n\n+\n\n(cid:125)\n\n\f3 \u03c72-LMNN: Non-linear Distance Functions on the Probability Simplex\nThe original LMNN algorithm learns a linear transforma-\ntion L\u2208Rd\u00d7d that captures semantic similarity for kNN\nclassi\ufb01cation on data in some Euclidean vector space\nRd.\nIn this section we extend this formulation to set-\ntings in which data are sampled from a probability sim-\nplex S d ={x\u2208Rd|x\u2265 0, x(cid:62)1 = 1}, where 1 \u2208 Rd de-\nnotes the vector of all-ones. Each input xi \u2208 S d can be\ninterpreted as a histogram over d buckets. Such data are\nubiquitous in computer vision where the histograms can\nbe distributions over visual codebooks [27] or colors [25],\nin text-data as normalized bag-of-words or topic assign-\nments [3], and many other \ufb01elds [9, 17, 21].\nHistogram distances. The abundance of such data has\nsparked the development of several specialized distance\nmetrics designed to compare histograms. Examples are\nQuadratic-Form distance [16], Earth Mover\u2019s Distance\n[21], Quadratic-Chi distance family [20] and \u03c72 his-\ntogram distance [16]. We focus explicitly on the latter. Transforming the inputs with a linear\ntransformation learned with LMNN will almost certainly result in a loss of their histogram prop-\nerties \u2014 and the ability to use such distances. In this section, we introduce our \ufb01rst non-linear\nextension for LMNN, to address this issue. In particular, we propose two signi\ufb01cant changes to the\noriginal LMNN formulation: i) we learn a constrained mapping that keeps the transformed data on\nthe simplex (illustrated in Figure 1), and ii) we optimize the kNN classi\ufb01cation performance with\nrespect to the non-linear \u03c72 histogram distance directly.\n\u03c72 histogram distance. We focus on the \u03c72 histogram distance, whose origin is the \u03c72 statistical\nhypothesis test [19], and which has successfully been applied in many domains [8, 27, 29]. The \u03c72\ndistance is a bin-to-bin distance measurement, which takes into account the size of the bins and their\ndifferences. Formally, the \u03c72 distance is a well-de\ufb01ned metric \u03c72 : S d \u2192 [0, 1] de\ufb01ned as [20]\n\nFigure 1: A schematic illustration of\nthe \u03c72-LMNN optimization. The map-\nping is constrained to preserve all inputs\non the simplex S 3 (grey surface). The\narrows indicate the push (red and yel-\nlow) and pull (blue) forces from the \u03c72-\nLMNN objective.\n\n\u03c72(xi, xj) =\n\n1\n2\n\n([xi]f \u2212 [xj]f )2\n[xi]f + [xj]f\n\n,\n\n(3)\n\nd(cid:88)\n\nf =1\n\nwhere [xi]f indicates the f th feature value of the vector xi.\nGeneralized \u03c72 distance. First, analogous to the generalized Euclidean metric in (1), we generalize\nL(xi, xj), de\ufb01ned as\nthe \u03c72 distance with a linear transformation and introduce the pseudo-metric \u03c72\n(4)\nThe \u03c72 distance is only a well-de\ufb01ned metric within the simplex S d and therefore we constrain\nL to map any x onto S d. We de\ufb01ne the set of such simplex-preserving linear transformations as\nP ={L\u2208Rd\u00d7d : \u2200x \u2208 S d, Lx \u2208 S d}.\n\u03c72-LMNN Objective. To optimize the transformation L with respect to the \u03c72 histogram distance\ndirectly, we replace the Mahalanobis distance DL in (2) with \u03c72\n\n\u03c72\nL(xi, xj) = \u03c72(Lxi, Lxj).\n\nL and obtain the following:\n\n(cid:88)\n\nmin\nL\u2208P\n\ni,j: j(cid:32)i\n\n\u03c72\n\nL(xi, xj) + \u00b5\n\n(cid:88)\n\n(cid:2)(cid:96) + \u03c72\n\nk: yi(cid:54)=yk\n\nL(xi, xk)(cid:3)\n\n+.\n\nL(xi, xj) \u2212 \u03c72\n\n(5)\n\nBesides the substituted distance function, there are two important changes in the optimization prob-\nlem (5) compared to (2). First, as mentioned before, we have an additional constraint L\u2208P. Second,\nbecause (4) is not linear in L(cid:62)L, different values for the margin parameter (cid:96) lead to truly different\nsolutions (which differ not just up to a scaling factor as before). We therefore can no longer arbi-\ntrarily set (cid:96) = 1. Instead, (cid:96) becomes an additional hyper-parameter of the model. We refer to this\nalgorithm as \u03c72-LMNN.\nOptimization. To learn (5), it can be shown L\u2208P if and only if L is element-wise non-negative, i.e.,\ni Lij = 1, \u2200j. These constraints are linear with respect\n\nL\u2265 0, and each column is normalized, i.e.,(cid:80)\n\n3\n\n\fto L and we can optimize (5) ef\ufb01ciently with a projected sub-gradient method [2]. As an even faster\noptimization method, we propose a simple change of variables to generate an unconstrained version\nof (5). Let us de\ufb01ne f : Rd\u00d7d\u2192P to be the column-wise soft-max operator\n\n[f (A)]ij =\n\n.\n\n(6)\n\neAij(cid:80)\n\nk eAkj\n\nBy design, all columns of f (A) are normalized and every matrix entry is non-negative. The function\nf (\u00b7) is continuous and differentiable. By de\ufb01ning L = f (A) we obtain L\u2208P for any choice of A\u2208\nRd\u00d7d. This allows us to minimize (5) with respect to A using unconstrained sub-gradient descent1.\nWe initialize the optimization with A = 10 I + 0.01 11(cid:62) (where I denotes the identity matrix) to\napproximate the non-transformed \u03c72 histogram distance after the change of variable (f (A)\u2248 I).\nDimensionality Reduction. Analogous to the original LMNN formulation (described in Section 2),\nwe can restrict from a square matrix to L \u2208 Rr\u00d7d with r < d. In this case \u03c72-LMNN learns a\nprojection into a lower dimensional simplex L : S d \u2192 S r. All other parts of the algorithm change\nanalogously. This extension can be very valuable to enable faster nearest neighbor search [33]\nespecially for time-sensitive applications, e.g., object recognition tasks in computer vision [27]. In\nsection 6 we evaluate this version of \u03c72-LMNN under a range of settings for r.\n\n4 GB-LMNN: Non-linear Transformations with Gradient Boosting\nWhereas section 3 focuses on the learning scenario where a linear transformation is too general, in\nthis section we explore the opposite case where it is too restrictive. Af\ufb01ne transformations preserve\ncollinearity and ratios of distances along lines \u2014 i.e., inputs on a straight line remain on a straight\nline and their relative distances are preserved. This can be too restrictive for data where similarities\nchange locally (e.g., because similar data lie on non-linear sub-manifolds). Chopra et al. [6] pio-\nneered non-linear metric learning, using convolutional neural networks to learn embeddings for face-\nveri\ufb01cation tasks. Inspired by their work, we propose to optimize the LMNN objective (2) directly\nin function space with gradient boosted CART trees [11]. Combining the learned transformation\n\u03c6(x) : Rd \u2192 Rd with a Euclidean distance function has the capability to capture highly non-linear\nsimilarity relations. It can be optimized using standard techniques, naturally scales to large data sets\nwhile only introducing a single additional hyper-parameter in comparison with LMNN.\nGeneralized LMNN. To generalize the LMNN objective 2 to a non-linear transformation \u03c6(\u00b7), we\ndenote the Euclidean distance after the transformation as\n\nD\u03c6(xi, xj) = (cid:107)\u03c6(xi) \u2212 \u03c6(xj)(cid:107)2,\n\n(7)\nwhich satis\ufb01es all properties of a well-de\ufb01ned pseudo-metric in the original input space. To optimize\nthe LMNN objective directly with respect to D\u03c6, we follow the same steps as in Section 3 and\nsubstitute D\u03c6 for DL in (2). The resulting unconstrained loss function becomes\nL(\u03c6) =\n\n(cid:2) 1 + (cid:107)\u03c6(xi)\u2212\u03c6(xj)(cid:107)2\n\n(cid:107)\u03c6(xi)\u2212\u03c6(xj)(cid:107)2\n\n2 \u2212 (cid:107)\u03c6(xi)\u2212\u03c6(xk)(cid:107)2\n\n2\n\n(cid:88)\n\n(cid:88)\n\n(cid:3)\n\n+ . (8)\n\ni,j: j(cid:32)i\n\n2 + \u00b5\n\nk: yi(cid:54)=yk\n\nIn its most general form, with an unspeci\ufb01ed mapping \u03c6, (8) uni\ufb01es most of the existing variations of\nLMNN metric learning. The original linear LMNN mapping [31] is a special case where \u03c6(x) = Lx.\nKernelized versions [5, 12, 26] are captured by \u03c6(x) = L\u03c8(x), producing the kernel K(xi, xj) =\n\u03c6(xi)(cid:62)\u03c6(xj) = \u03c8(xi)(cid:62)L(cid:62)L\u03c8(xj). The embedding of Globerson and Roweis [14] corresponds to\nthe most expressive mapping function \u03c6(xi) = zi, where each input xi is transformed independently\nto a new location zi to satisfy similarity constraints \u2014 without out-of-sample extensions.\nGB-LMNN. The previous examples vary widely in expressiveness, scalability, and generalization,\nlargely as a consequence of the mapping function \u03c6. It is important to \ufb01nd the right non-linear form\nfor \u03c6, and we believe an elegant solution lies in gradient boosted regression trees.\nOur method, termed GB-LMNN, learns a global non-linear mapping. The construction of the map-\nping, an ensemble of multivariate regression trees selected by gradient boosting [11], minimizes the\ngeneral LMNN objective (8) directly in function space. Formally, the GB-LMNN transformation\n1The set of all possible matrices f (A) is slightly more restricted than P, as it reaches zero entries only in\n\nthe limit. However, given \ufb01nite computational precision, this does not seem to be a problem in practice.\n\n4\n\n\fFigure 2: GB-LMNN illustrated on a toy data set sampled from two concentric circles of different\nclasses (blue and red dots). The \ufb01gure depicts the true gradient (top row) with respect to each input\nand its least squares approximation (bottom row) with a multi-variate regression tree (depth, p = 4).\n\nis an additive function \u03c6 = \u03c60 + \u03b1(cid:80)T\n\nt=1 ht initialized by \u03c60 and constructed by iteratively adding\nregression trees ht of limited depth p [4], each weighted by a learning rate \u03b1. Individually, the trees\nare weak learners and are capable of learning only simple functions, but additively they form pow-\nerful ensembles with good generalization to out-of-sample data. In iteration t, the tree ht is selected\ngreedily to best minimize the objective upon its addition to the ensemble,\n\n\u03c6t(\u00b7) = \u03c6t\u22121(\u00b7) + \u03b1ht(\u00b7), where ht \u2248 argmin\nh\u2208T p\n\n(9)\nHere, T p denotes the set of all regression trees of depth p. The (approximately) optimal tree ht is\nfound by a \ufb01rst-order Taylor approximation of L. This makes the optimization akin to a steepest\ndescent step in function space, where ht is selected to approximate the negative gradient gt of the\nobjective L(\u03c6t\u22121) with respect to the transformation learned at the previous iteration \u03c6t\u22121. Since\nwe learn an approximation of gt as a function of the training data, sub-gradients are computed with\nrespect to each training input xi, and approximated by the tree ht(\u00b7) in the least-squared sense,\n\nL(\u03c6t\u22121 + \u03b1h).\n\nn(cid:88)\n\n\u2202L(\u03c6t\u22121)\n\u2202\u03c6t\u22121(xi)\n\n.\n\ni=1\n\nht(\u00b7) = argmin\nh\u2208T p\n\n(gt(xi) \u2212 ht(xi))2, where: gt(xi) =\n\n(10)\nIntuitively, at each iteration, the tree ht(\u00b7) of depth p splits the input space into 2p axis-aligned\nregions. All inputs that fall into one region are translated by a constant vector \u2014 consequently,\nthe inputs in different regions are shifted in different directions. We learn the trees greedily with a\nmodi\ufb01ed version of the public-domain CART implementation pGBRT [28].\nOptimization details.\nSince (8) is non-convex with respect to \u03c6, we initialize with the linear\ntransformation learned by LMNN, \u03c60 = Lx, making our method a non-linear re\ufb01nement of LMNN.\nThe only additional hyperparameter to the optimization is the maximum tree depth p to which the\nalgorithm is not particularly sensitive (we set p = 6). 2\nFigure 2 depicts a simple toy-example with concentric circles of inputs from two different classes.\nBy design, the inputs are sampled such that the nearest neighbor for any given input is from the\nother class. A linear transformation is incapable of separating the two classes. However GB-LMNN\nproduces a mapping with the desired separation. The \ufb01gure illustrates the actual gradient (top row)\nand its approximation (bottom row). The limited-depth regression trees are unable to capture the\ngradient for all inputs in a single iteration. But by greedily focusing on inputs with the largest\ngradients or groups of inputs with the most easily encoded gradients, the gradient boosting process\nadditively constructs the transformation function. At iteration 100, the gradients with respect to\nmost inputs vanish, indicating that a local minimum of L(\u03c6) is almost reached \u2014 the inputs from\nthe two classes are separated by a large margin.\n\n2Here, we set the step-size, a common hyper-parameter across all variations of LMNN, to \u03b1 = 0.01.\n\n5\n\nApproximatedGradientItera/on\t\r \u00a01Itera/on\t\r \u00a010Itera/on\t\r \u00a020Itera/on\t\r \u00a040Itera/on\t\r \u00a0100TrueGradient\fDimensionality reduction. Like linear LMNN and \u03c72-LMNN, it is possible to learn a non-linear\ntransformation to a lower dimensional space, \u03c6(x) : Rd \u2192 Rr, r \u2264 d. Initialization is made with\nthe rectangular matrix output of the dimensionality-reduced LMNN transformation, \u03c60 = Lx with\nL\u2208Rr\u00d7d. Training proceeds by learning trees with r- rather than d-dimensional outputs.\n\n5 Related Work\nThere have been previous attempts to generalize learning linear distances to nonlinear metrics. The\nnonlinear mapping \u03c6(x) of eq. (7) can be implemented with kernels [5, 12, 18, 26]. These exten-\nsions have the advantages of maintaining computational tractability as convex optimization prob-\nlems. However, their utility is limited inherently by the sizes of kernel matrices .Weinberger et. al\n[30] propose M 2-LMNN, a locally linear extension to LMNN. They partition the space into multiple\nregions, and jointly learn a separate metric for each region\u2014however, these local metrics do not give\nrise to a global metric and distances between inputs within different regions are not well-de\ufb01ned.\nNeural network-based approaches offer the \ufb02exibility of learning arbitrarily complex nonlinear map-\npings [6]. However, they often demand high computational expense, not only in parameter \ufb01tting but\nalso in model selection and hyper-parameter tuning. Of particular relevance to our GB-LMNN work\nis the use of boosting ensembles to learn distances between bit-vectors [1, 23]. Note that their goals\nare to preserve distances computed by locality sensitive hashing to enable fast search and retrieval.\nOurs are very different: we alter the distances discriminatively to minimize classi\ufb01cation error.\nOur work on \u03c72-LMNN echoes the recent interest in learning earth-mover-distance (EMD) which\nis also frequently used in measuring similarities between histogram-type data [9]. Despite its name,\nEMD is not necessarily a metric [20]. Investigating the link between our work and those new ad-\nvances is a subject for future work.\n\n6 Experimental Results\nWe evaluate our non-linear metric learning algorithms against several competitive methods. The ef-\nfectiveness of learned metrics is assessed by kNN classi\ufb01cation error. Our open-source implementa-\ntions are available for download at http://www.cse.wustl.edu/\u02dckilian/code/code.html.\nGB-LMNN We compare the non-linear global metric learned by GB-LMNN to three linear metrics:\nthe Euclidean metric and metrics learned by LMNN [31] and Information-Theoretic Metric Learning\n(ITML) [10]. Both optimize similar discriminative loss functions. We also compare to the metrics\nlearned by Multi-Metric LMNN (M 2-LMNN) [30]. M 2-LMNN learns |C| linear metrics, one for\neach the input labels.\nWe evaluate these methods and our GB-LMNN on several medium-sized data sets: ISOLET, USPS\nand Letters from the UCI repository. ISOLET and USPS have prede\ufb01ned test sets, otherwise results\nare averaged over 5 train/test splits (80%/20%). A hold-out set of 25% of the training set3 is used to\nassign hyper-parameters and to determine feature pre-processing (i.e., feature-wise normalization).\nWe set k = 3 for kNN classi\ufb01cation, following [31]. Table 1 reports the means and standard errors\nof each approach (standard error is omitted for data with pre-de\ufb01ned test sets), with numbers in bold\nfont indicating the best results up to one standard error.\nOn all three datasets, GB-LMNN outperforms methods of learning linear metrics. This shows the\nbene\ufb01t of learning nonlinear metrics. On Letters, GB-LMNN outperforms the second-best method\nM 2-LMNN by signi\ufb01cant margins. On the other two, GB-LMNN is as good as M 2-LMNN.\nWe also apply GB-LMNN to four datasets with histogram data \u2014 setting the stage for an interesting\ncomparison to \u03c72-LMNN below. The results are displayed on the right side of the table. These\ndatasets are popularly used in computer vision for object recognition [22]. Data instances are 800-\nbin histograms of visual codebook entries. There are ten common categories to the four datasets and\nwe use them for multiway classi\ufb01cation with kNN.\nNeither method evaluated so far is speci\ufb01cally adapted to histogram features. Especially linear\nmodels, such as LMNN and ITML, are expected to fumble over the intricate similarities that such\n\n3In the case of ISOLET, which consists of audio signals of spoken letters by different individuals, the hold-\n\nout set consisted of one speaker.\n\n6\n\n\fTable 1: kNN classi\ufb01cation error (in %, \u00b1 standard error where applicable), for general methods\n(top section) and histogram methods (bottom section). Best results up to one standard error in bold.\nBest results among general methods for simplex data in red italics.\n\ndata types may encode. As shown in the table, GB-LMNN consistently outperforms the linear\nmethods and M 2-LMNN.\n\u03c72-LMNN In Table 1, we compare \u03c72-LMNN to other methods for computing distances on his-\ntogram features: \u03c72-distance without transformation (equivalent to our parameterized distance \u03c72\nL\ndistance with the transformation L being the identity matrix), Quadratic-Chi-Squared (QCS) and\nQuadratic-Chi-Normalized (QCN) distances, de\ufb01ned in [20]. For QCS and QCN, we use histogram\nintersection as the ground distance. Unlike our approach, none of these is discriminatively learned\nfrom data. \u03c72-LMNN outperforms all other methods signi\ufb01cantly.\nIt is also instructive to compare the results to the performance of non-histogram speci\ufb01c methods.\nWe observe that LMNN performs better than the standard \u03c72-distance on Amazon and Caltech. This\nseems to suggest that for those two datasets, linear metrics may be adequate and GB-LMNN\u2019s non-\nlinear mapping might not be able to provide extra expressiveness and bene\ufb01ts. This is con\ufb01rmed\nin Table 1: GB-LMNN improves performance less signi\ufb01cantly for Amazon and Caltech than for\nthe other two datasets, DSLR and Webcam. For the latter two, on the contrary, LMNN performs\nworse than \u03c72-distance. In such cases, GB-LMNN\u2019s nonlinear mapping seems more bene\ufb01cial. It\nprovides a signi\ufb01cant performance boost, and matches the performance of \u03c72-distance (up to one\nstandard-error). Nonetheless, despite learning a nonlinear mapping, GB-LMNN still underperforms\n\u03c72-LMNN. In other words, it is possible that no matter how \ufb02exible a nonlinear mapping could be,\nit is still best to use metrics that respect the semantic features of the data.\nDimensionality reduction. GB-LMNN and \u03c72-LMNN are both capable of performing dimension-\nality reduction. We compare these with three dimensionality reduction methods (PCA, LMNN, and\nM 2-LMNN) on the histogram datasets and the larger UCI datasets. Each dataset is reduced to an\noutput dimensionality of r = 10, 20, 40, 80 features. As we can see from the results in Table 6, it is\nfair to say that GB-LMNN performs comparably with LMNN and M 2-LMNN, whereas \u03c72-LMNN\nobtains at times phenomenally low kNN error rates on the histograms data sets (e.g., Webcam). This\nsuggests that dimensionality reduction of histogram data can be highly effective, if the data proper-\nties are carefully incorporated in the process. We do not apply dimensionality reduction to Letters\nas it already lies in a low-dimensional space (d = 16).\nSensitivity to parameters. One of the most compelling aspects of our methods is that each intro-\nduces only a single new hyper-parameter to the LMNN framework. During our experiments, (cid:96) was\nselected by cross-validation and p was \ufb01xed to p = 6. We found very little sensitivity in GB-LMNN\nto regression tree depth, while large margin size was an important but well-behaved parameter for\n\u03c72-LMNN. Additional graphs are included in the supplementary material.\n\n7 Conclusion and Future Work\n\nIn this paper we introduced two non-linear extensions to LMNN, \u03c72-LMNN and GB-LMNN. Al-\nthough based on fundamentally different approaches, both algorithms lead to signi\ufb01cant improve-\nments over the original (linear) LMNN metrics and match or out-perform existing non-linear al-\ngorithms. The non-convexity of our proposed methods does not seem to impact their performance,\n\n7\n\n!\"#$%&'\"(\"$%&&%)\"*\"$)+%,-././.0#1-.$&%-21344546*37481358596*38:;138<<<<6*37;137:46*39<<1385:6*39<<135:96*39<<13778=6*39<<>'-$!*%.19?@;?8;?803?\fTable 2: kNN classi\ufb01cation error (in %, \u00b1 standard error where applicable) with dimensionality\nreduction to output dimensionality r. Best results up to one standard error in bold.\n\nindicating that convex algorithms (LMNN) as initialization for more expressive non-convex methods\ncan be a winning combination.\nThe strong results obtained with \u03c72-LMNN show that the incorporation of data-speci\ufb01c constraints\ncan be highly bene\ufb01cial\u2014indicating that there is great potential for future research in specialized\nmetric learning algorithms for speci\ufb01c data types. Further, the ability of \u03c72-LMNN to reduce the\ndimensionality of data sampled from probability simplexes is highly encouraging and might lead\nto interesting applications in computer vision and other \ufb01elds, where histogram data is ubiquitous.\nHere, it might be possible to reduce the running time of time critical algorithms drastically by shrink-\ning the data dimensionality, while strictly maintaining its histogram properties.\nThe high consistency with which GB-LMNN obtains state-of-the-art results across diverse data sets\nis highly encouraging. In fact, the use of ensembles of CART trees [4] not only inherits all positive\naspects of gradient boosting (robustness, speed and insensitivity to hyper-parameters) but is also a\nnatural match for metric learning. Each tree splits the space into different regions and in contrast to\nprior work [30], this splitting is fully automated, results in new (discriminatively learned) Euclidean\nrepresentations of the data and gives rise to well-de\ufb01ned pseudo-metrics.\n\n8 Acknowledgements\nKQW, DK and ST would like to thank NIH for their support through grant U01 1U01NS073457-01\nand NSF for grants 1149882 and 1137211. FS would like to thank DARPA for its support with grant\nD11AP00278 and ONR for grant N00014-12-1-0066. GL was supported in part by the NSF under\nGrants CCF-0830535 and IIS-1054960, and by the Sloan Foundation. DK would also like to thank\nthe McDonnell International Scholars Academy for their support.\n\nReferences\n\n[1] B. Babenko, S. Branson, and S. Belongie. Similarity metrics for categorization: from monolithic to\n\ncategory speci\ufb01c. In ICCV \u201909, pages 293\u2013300. IEEE, 2009.\n\n[2] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex opti-\n\nmization. Operations Research Letters, 31(3):167\u2013175, 2003.\n\n[3] D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent dirichlet allocation. The Journal of Machine Learning\n\nResearch, 3:993\u20131022, 2003.\n\n[4] L. Breiman. Classi\ufb01cation and regression trees. Chapman & Hall/CRC, 1984.\n\n8\n\n!\"#$%&'\"(\")\"$*+%,-././.0#1-.$&%-234567879:89!\"#$%&#';68<=987>?89=6867;8@=989ABCC>86:=:8978:D789=68:;@89=68D>;87=>87(!#)%\"#*B6EABCC>8;=:86(#\"D789=68:;@8>=68<>68@=98<((#+%\"#\"FGEABCC&#'%+#+D8;>78<=<8>;>87=687>987=68<((#$%\"#*H6EABCCEE!&#&%\"#)*'#$%\"#,&&#,%*#*(!#'%*#,3459D89787>789=;8<6<8;=98<>;8?=989D?8?=:8DABCC\"#*%+#*;8@D;8;=68@;>8:=68?;?8?=98D((#!%*#'B6EABCC\"#*%+#\"&#&D;8;=68@;>8;=687>:8;=98;DD8D=98DFGEABCC\"#+%+#*;8@D:8:=;8>;;8:=68@;@8<=:8@(&#'%*#&H6EABCCEE&*#*%&#\"*!#&%&#*&&#*%+#,DD87=:8?345998:78:>78<=;8:6?86=686>;89=987D<8<=:8DABCC98D=:8:;86D98<=:8<;78@=68:;?8>=98:D789=98DB6EABCC*#\"%+#*;86D98<=:8<;786=98;;?8>=98;D789=987FGEABCC98>=:89\"#,D:8:=689;98<=98;;?8;=98;(&#&%*#!H6EABCCEE&&#&%\"#+*+#\"%\"#&&&#+%*#&(!#+%*#!345?8>789;?8>=98@;?8>=98@>78:=9897?8>=;8?ABCC*#)%+#*;86D989=68>;78D=68@>;8>=:8?7:8;=:8@B6EABCC*#)%+#+*#$D989=68>;D8?=68<>;8>=98:D>8>=98DFGEABCC*#)%+#*68>D:8:=98?6<8;=;8D>989=986D>89=98;H6EABCCEE&&#&%\"#+$#&%*#)\",#(%*#\"(*#*%\"#&-./010232456728-39:*I9:*I6:*I>:*I@:;<=32>91?67-1=9\f[5] R. Chatpatanasiri, T. Korsrilabutr, P. Tangchanachaianan, and B. Kijsirikul. A new kernelization frame-\n\nwork for mahalanobis distance learning algorithms. Neurocomputing, 73(10-12):1570\u20131579, 2010.\n\n[6] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with application to\n\nface veri\ufb01cation. In CVPR \u201905, pages 539\u2013546. IEEE, 2005.\n\n[7] T. Cover and P. Hart. Nearest neighbor pattern classi\ufb01cation. IEEE Transactions on Information Theory,\n\n13(1):21\u201327, 1967.\n\n[8] O.G. Cula and K.J. Dana. 3D texture recognition using bidirectional feature histograms. International\n\nJournal of Computer Vision, 59(1):33\u201360, 2004.\n\n[9] M. Cuturi and D. Avis. Ground metric learning. arXiv preprint, arXiv:1110.2306, 2011.\n[10] J.V. Davis, B. Kulis, P. Jain, S. Sra, and I.S. Dhillon. Information-theoretic metric learning. In ICML \u201907,\n\npages 209\u2013216. ACM, 2007.\n\n[11] J.H. Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, pages\n\n1189\u20131232, 2001.\n\n[12] C. Galleguillos, B. McFee, S. Belongie, and G. Lanckriet. Multi-class object localization by combining\n\nlocal contextual interactions. CVPR \u201910, pages 113\u2013120, 2010.\n\n[13] A. Globerson and S. Roweis. Metric learning by collapsing classes. In NIPS \u201906, pages 451\u2013458. MIT\n\nPress, 2006.\n\n[14] A. Globerson and S. Roweis. Visualizing pairwise similarity via semide\ufb01nite programming. In AISTATS\n\n\u201907, pages 139\u2013146, 2007.\n\n[15] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. In\n\nNIPS \u201905, pages 513\u2013520. MIT Press, 2005.\n\n[16] J. Hafner, H.S. Sawhney, W. Equitz, M. Flickner, and W. Niblack. Ef\ufb01cient color histogram indexing\nfor quadratic form distance functions. Pattern Analysis and Machine Intelligence, IEEE Transactions on,\n17(7):729\u2013736, 1995.\n\n[17] M. Hoffman, D. Blei, and P. Cook. Easy as CBA: A simple probabilistic model for tagging music. In\n\nISMIR \u201909, pages 369\u2013374, 2009.\n\n[18] P. Jain, B. Kulis, J.V. Davis, and I.S. Dhillon. Metric and kernel learning using a linear transformation.\n\nJournal of Machine Learning Research, 13:519\u2013547, 03 2012.\n\n[19] A.M. Mood, F.A. Graybill, and D.C. Boes. Introduction in the theory of statistics. McGraw-Hill Interna-\n\ntional Book Company, 1963.\n\n[20] O. Pele and M. Werman. The quadratic-chi histogram distance family. ECCV \u201910, pages 749\u2013762, 2010.\n[21] Y. Rubner, C. Tomasi, and L.J. Guibas. The earth mover\u2019s distance as a metric for image retrieval.\n\nInternational Journal of Computer Vision, 40(2):99\u2013121, 2000.\n\n[22] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual category models to new domains. Computer\n\nVision\u2013ECCV 2010, pages 213\u2013226, 2010.\n\n[23] G. Shakhnarovich. Learning task-speci\ufb01c similarity. PhD thesis, MIT, 2005.\n[24] N. Shental, T. Hertz, D. Weinshall, and M. Pavel. Adjustment learning and relevant component analysis.\n\nIn ECCV \u201902, volume 4, pages 776\u2013792. Springer-Verlag, 2002.\n\n[25] M. Stricker and M. Orengo. Similarity of color images. In Storage and Retrieval for Image and Video\n\nDatabases, volume 2420, pages 381\u2013392, 1995.\n\n[26] L. Torresani and K. Lee. Large margin component analysis. NIPS \u201907, pages 1385\u20131392, 2007.\n[27] T. Tuytelaars and K. Mikolajczyk. Local invariant feature detectors: a survey. Foundations and Trends R(cid:13)\n\nin Computer Graphics and Vision, 3(3):177\u2013280, 2008.\n\n[28] S. Tyree, K.Q. Weinberger, K. Agrawal, and J. Paykin. Parallel boosted regression trees for web search\n\nranking. In WWW \u201911, pages 387\u2013396. ACM, 2011.\n\n[29] M. Varma and A. Zisserman. A statistical approach to material classi\ufb01cation using image patch exemplars.\n\nPattern Analysis and Machine Intelligence, IEEE Transactions on, 31(11):2032\u20132047, 2009.\n\n[30] K.Q. Weinberger and L.K. Saul. Fast solvers and ef\ufb01cient implementations for distance metric learning.\n\nIn ICML \u201908, pages 1160\u20131167. ACM, 2008.\n\n[31] K.Q. Weinberger and L.K. Saul. Distance metric learning for large margin nearest neighbor classi\ufb01cation.\n\nThe Journal of Machine Learning Research, 10:207\u2013244, 2009.\n\n[32] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning, with application to clustering\n\nwith side-information. In NIPS \u201902, pages 505\u2013512. MIT Press, 2002.\n\n[33] P.N. Yianilos. Data structures and algorithms for nearest neighbor search in general metric spaces. In\n\nACM-SIAM Symposium on Discrete Algorithms \u201993, pages 311\u2013321, 1993.\n\n9\n\n\f", "award": [], "sourceid": 1223, "authors": [{"given_name": "Dor", "family_name": "Kedem", "institution": null}, {"given_name": "Stephen", "family_name": "Tyree", "institution": null}, {"given_name": "Fei", "family_name": "Sha", "institution": null}, {"given_name": "Gert", "family_name": "Lanckriet", "institution": null}, {"given_name": "Kilian", "family_name": "Weinberger", "institution": null}]}