{"title": "Discriminative Metric Learning by Neighborhood Gerrymandering", "book": "Advances in Neural Information Processing Systems", "page_first": 3392, "page_last": 3400, "abstract": "We formulate the problem of metric learning for k nearest neighbor classification as a large margin structured prediction problem, with a latent variable representing the choice of neighbors and the task loss directly corresponding to classification error. We describe an efficient algorithm for exact loss augmented inference,and a fast gradient descent algorithm for learning in this model. The objective drives the metric to establish neighborhood boundaries that benefit the true class labels for the training points. Our approach, reminiscent of gerrymandering (redrawing of political boundaries to provide advantage to certain parties), is more direct in its handling of optimizing classification accuracy than those previously proposed. In experiments on a variety of data sets our method is shown to achieve excellent results compared to current state of the art in metric learning.", "full_text": "Discriminative Metric Learning by\nNeighborhood Gerrymandering\n\nShubhendu Trivedi, David McAllester, Gregory Shakhnarovich\n\nToyota Technological Institute\n\n{shubhendu,mcallester,greg}@ttic.edu\n\nChicago, IL - 60637\n\nAbstract\n\nWe formulate the problem of metric learning for k nearest neighbor classi\ufb01cation\nas a large margin structured prediction problem, with a latent variable representing\nthe choice of neighbors and the task loss directly corresponding to classi\ufb01cation\nerror. We describe an ef\ufb01cient algorithm for exact loss augmented inference, and\na fast gradient descent algorithm for learning in this model. The objective drives\nthe metric to establish neighborhood boundaries that bene\ufb01t the true class labels\nfor the training points. Our approach, reminiscent of gerrymandering (redrawing\nof political boundaries to provide advantage to certain parties), is more direct in\nits handling of optimizing classi\ufb01cation accuracy than those previously proposed.\nIn experiments on a variety of data sets our method is shown to achieve excellent\nresults compared to current state of the art in metric learning.\n\n1\n\nIntroduction\n\nNearest neighbor classi\ufb01ers are among the oldest and the most widely used tools in machine learn-\ning. Although nearest neighor rules are often successful, their performance tends to be limited by\ntwo factors: the computational cost of searching for nearest neighbors and the choice of the metric\n(distance measure) de\ufb01ning \u201cnearest\u201d. The cost of searching for neighbors can be reduced with ef-\n\ufb01cient indexing, e.g., [1, 4, 2] or learning compact representations, e.g., [13, 19, 16, 9]. We will not\naddress this issue here. Here we focus on the choice of the metric. The metric is often taken to be\nEuclidean, Manhattan or \u03c72 distance. However, it is well known that in many cases these choices are\nsuboptimal in that they do not exploit statistical regularities that can be leveraged from labeled data.\nThis paper focuses on supervised metric learning. In particular, we present a method of learning a\nmetric so as to optimize the accuracy of the resulting nearest neighbor classi\ufb01er.\nExisting works on metric learning formulate learning as an optimization task with various constraints\ndriven by considerations of computational feasibility and reasonable, but often vaguely justi\ufb01ed\nprinciples [23, 8, 7, 22, 21, 14, 11, 18]. A fundamental intuition is shared by most of the work\nin this area: an ideal distance for prediction is distance in the label space. Of course, that can not\nbe measured, since prediction of a test example\u2019s label is what we want to use the similarities to\nbegin with. Instead, one could learn a similarity measure with the goal for it to be a good proxy\nfor the label similarity. Since the performance of kNN prediction often is the real motivation for\nsimilarity learning, the constraints typically involve \u201cpulling\u201d good neighbors (from the correct class\nfor a given point) closer while \u201cpushing\u201d the bad neighbors farther away. The exact formulation of\n\u201cgood\u201d and \u201cbad\u201d varies but is de\ufb01ned as a combination of proximity and agreement between labels.\nWe give a formulation that facilitates a more direct attempt to optimize for the kNN accuracy as\ncompared to previous work as far as we are aware. We discuss existing methods in more detail in\nsection 2, where we also place our work in context.\n\n1\n\n\fIn the kNN prediction problem, given a point and a chosen metric, there is an implicit hidden\nvariable: the choice of k \u201cneighbors\u201d. The inference of the predicted label from these k examples is\ntrivial, by simple majority vote among the associated labels. Given a query point, there can possibly\nexist a very large number of choices of k points that might correspond to zero loss: any set of k\npoints with the majority of correct class will do. We would like a metric to \u201cprefer\u201d one of these\n\u201cgood\u201d example sets over any set of k neighbors which would vote for a wrong class. Note that to\nwin, it is not necessary for the right class to account for all the k neighbors \u2013 it just needs to get\nmore votes than any other class. As the number of classes and the value of k grow, so does the space\nof available good (and bad) example sets.\nThese considerations motivate our approach to metric learning. It is akin to the common, albeit\nnegatively viewed, practice of gerrymandering in drawing up borders of election districts so as to\nprovide advantages to desired political parties, e.g., by concentrating voters from that party or by\nspreading voters of opposing parties. In our case, the \u201cdistricts\u201d are the cells in the Voronoi diagram\nde\ufb01ned by the Mahalanobis metric, the \u201cparties\u201d are the class labels voted for by the neighbors\nfalling in each cell, and the \u201cdesired winner\u201d is the true label of the training points associated with\nthe cell. This intuition is why we refer to our method as neighborhood gerrymandering in the title.\nTechnically, we write kNN prediction as an inference problem with a structured latent variable being\nthe choice of k neighbors. Thus learning involves minimizing a sum of a structural latent hinge loss\nand a regularizer [3]. Computing structural latent hinge loss involves loss-adjusted inference \u2014 one\nmust compute loss-adjusted values of both the output value (the label) and the latent items (the set\nof nearest neighbors). The loss augmented inference corresponds to a choice of worst k neighbors\nin the sense that while having a high average similarity they also correspond to a high loss (\u201cworst\noffending set of k neighbors\u201d). Given the inherent combinatorial considerations, the key to such a\nmodel is ef\ufb01cient inference and loss augmented inference. We give an ef\ufb01cient algorithm for exact\ninference. We also design an optimization algorithm based on stochastic gradient descent on the\nsurrogate loss. Our approach achieves kNN accuracy higher than state of the art for most of the data\nsets we tested on, including some methods specialized for the relevant input domains.\nAlthough the experiments reported here are restricted to learning a Mahalanobis distance in an ex-\nplicit feature space, the formulation allows for nonlinear similarity measures, such as those de\ufb01ned\nby nonlinear kernels, provided computing the gradients of similarities with respect to metric param-\neters is feasible. Our formulation can also naturally handle a user-de\ufb01ned loss matrix on labels.\n\n2 Related Work and Discussion\n\nThere is a large body of work on similarity learning done with the stated goal of improving kNN\nperformance. In much of the recent work, the objective can be written as a combination of some\nsort of regularizer on the parameters of similarity, with loss re\ufb02ecting the desired \u201cpurity\u201d of the\nneighbors under learned similarity. Optimization then balances violation of these constraints with\nregularization. The main contrast between this body of work and our approach here is in the form\nof the loss.\nA well known family of methods of this type is based on the Large Margin Nearest Neighbor\n(LMNN) algorithm [22] . In LMNN, the constraints for each training point involve a set of pre-\nde\ufb01ned \u201ctarget neighbors\u201d from correct class, and \u201cimpostors\u201d from other classes. The set of target\nneighbors here plays a similar role to our \u201cbest correct set of k neighbors\u201d (h\u2217 in Section 4). How-\never the set of target neighbors are chosen at the onset based on the euclidean distance (in absence\nof a priori knowledge). Moreover as the metric is optimized, the set of \u201ctarget neighbors\u201d is not dy-\nnamically updated. There is no reason to believe that the original choice of neighbors based on the\neuclidean distance is optimal while the metric is updated. Also h\u2217 represents the closest neighbors\nthat have zero loss but they are not necessarily of the same class. In LMNN the target neighbors are\nforced to be of the same class. In doing so it does not fully leverage the power of the kNN objective.\nThe role of imposters is somewhat similar to the role of the \u201cworst offending set of k neighbors\u201d in\n\nour method ((cid:98)h in Section 4). See Figure 1 for an illustration. Extensions of LMNN [21, 11] allow\n\nfor non-linear metrics, but retain the same general \ufb02avor of constraints. There is another extension\nto LMNN that is more aligned to our work [20], in that they lift the constraint of having a static set\nof neighbors chosen based on the euclidean distance and instead learn the neighborhood.\n\n2\n\n\fd\n\nh\n\ng\n\ne\n\nc\n\nf\n\nx\n\nb\n\na\n\ni\n\nj\n\nb\n\na\n\nd\n\nh\n\ng\n\ne\n\nc\n\nf\n\nx\n\ni\n\nj\n\nFigure 1: Illustration of objectives of LMNN (left) and our structured approach (right) for k = 3.\nThe point x of class blue is the query point. In LMNN, the target points are the nearest neighbors of\nthe same class, which are points a, b and c (the circle centered at x has radius equal to the farthest\nof the target points i.e. point b). The LMNN objective will push all the points of the wrong class\nthat lie inside this circle out (points e, f, h, i, andj), while pulling in the target points to enforce the\nmargin. For our structured approach (right), the circle around x has radius equal to the distance of\nthe farthest of the three nearest neighbors irrespective of class. Our objective only needs to ensure\nzero loss which is achieved by pushing in point a of the correct class (blue) while pushing out the\npoint having the incorrect class (point f). Note that two points of the incorrect class lie inside the\ncircle (e, andf), both being of class red. However f is pushed out and not e since it is farther from\nx. Also see section 2.\n\nThe above family of methods may be contrasted with methods of the \ufb02avor as proposed in [23].\nHere \u201cgood\u201d neighbors are de\ufb01ned as all similarly labeled points and each class is mapped into a\nball of a \ufb01xed radius, but no separation is enforced between the classes. The kNN objective does\nnot require that similarly labeled points be clustered together and consequently such methods try to\noptimize a much harder objective for learning the metric.\nIn Neighborhood Component Analysis (NCA) [8], the piecewise-constant error of the kNN rule\nis replaced by a soft version. This leads to a non-convex objective that is optimized via gradient\ndescent. This is similar to our method in the sense that it also attempts to directly optimize for\nthe choice of the nearest neighbor at the price of losing convexity. This issue of non-convexity\nwas partly remedied in [7], by optimization of a similar stochastic rule while attempting to collapse\neach class to one point. While this makes the optimization convex, collapsing classes to distinct\npoints is unrealistic in practice. Another recent extension of NCA [18] generalizes the stochastic\nclassi\ufb01cation idea to kNN classi\ufb01cation with k > 1.\nIn Metric Learning to Rank (MLR)[14], the constraints involve all the points: the goal is to push\nall the correct matches in front of all the incorrect ones. This again is not the same as requiring\ncorrect classi\ufb01cation. In addition to global optimization constraints on the rankings (such as mean\naverage precision for target class), the authors allow localized evaluation criteria such as Precision\nat k, which can be used as a surrogate for classi\ufb01cation accuracy for binary classi\ufb01cation, but is a\npoor surrogate for multi-way classi\ufb01cation. Direct use of kNN accuracy in optimization objective is\nbrie\ufb02y mentioned in [14], but not pursued due to the dif\ufb01culty in loss-augmented inference. This is\nbecause the interleaving technique of [10] that is used to perform inference with other losses based\ninherently on contingency tables, fails for the multiclass case (since the number of data interleavings\ncould be exponential). We take a very different approach to loss augmented inference, using targeted\ninference and the classi\ufb01cation loss matrix, and can easily extend it to arbitrary number of classes.\nA similar approach is taking in [15], where the constraints are derived from triplets of points formed\nby a sample, correct and incorrect neighbors. Again, these are assumed to be set statically as an\ninput to the algorithm, and the optimization focuses on the distance ordering (ranking) rather than\naccuracy of classi\ufb01cation.\n\n3 Problem setup\nWe are given N training examples X = {x1, . . . , xN}, represented by a \u201cnative\u201d feature map,\nxi \u2208 Rd, and their class labels y = [y1, . . . , yN ]T , with yi \u2208 [R], where [R] stands for the set\n\n3\n\n\f{1, . . . , R}. We are also given the loss matrix \u039b with \u039b(r, r(cid:48)) being the loss incurred by predicting\nr(cid:48) when the correct class is r. We assume \u039b(r, r) = 0, and \u2200(r, r(cid:48)), \u039b(r, r(cid:48)) \u2265 0.\nWe are interested in Mahalanobis metrics\n\n(1)\nparameterized by positive semide\ufb01nite d \u00d7 d matrices W. Let h \u2282 X be a set of examples in X.\nFor a given W we de\ufb01ne the distance score of h w.r.t. a point x as\n\nDW (x, xi) = (x \u2212 xi)T W (x \u2212 xi) ,\n\nSW(x, h) = \u2212 (cid:88)\n\nDW (x, xj)\n\nxj\u2208h\n\nHence, the set of k nearest neighbors of x in X is\n\nhW(x) = argmax\n\n|h|=k\n\nSW(x, h).\n\n(2)\n\n(3)\n\nIn particular, the kNN classi\ufb01er\n\nwith ties resolved by a heuristic, e.g., according to 1NN vote.\n\nclassi\ufb01cation loss incured by a voting classi\ufb01er when using the set h as\n\nFor the remainder we will assume that k is known and \ufb01xed. From any set h of k examples from X,\nwe can predict the label of x by (simple) majority vote:\n\npredicts (cid:98)y(hW(x)). Due to this deterministic dependence between (cid:98)y and h, we can de\ufb01ne the\n\n(cid:98)y (h) = majority{yj : xj \u2208 h},\n\u2206(y, h) = \u039b (y,(cid:98)y(h)) .\nOne might want to learn W to minimize training loss(cid:80)\nwhich we eventually report the deterministically computed(cid:98)y. Structured prediction problems usu-\n\ni \u2206 (yi, hW(xi)). However, this fails due\nto the intractable nature of classi\ufb01cation loss \u2206. We will follow the usual remedy: de\ufb01ne a tractable\nsurrogate loss.\nHere we note that in our formulation, the output of the prediction is a structured object hW, for\n\n4 Learning and inference\n\nally involve loss which is a generalization of the hinge loss; intuitively, it penalizes the gap between\nscore of the correct structured output and the score of the \u201cworst offending\u201d incorrect output (the\none with the highest score and highest \u2206).\nHowever, in our case there is no single correct output h, since in general many choices of h would\n\nlead to correct(cid:98)y and zero classi\ufb01cation loss: any h in which the majority votes for the right class.\n\nIdeally, we want SW to prefer at least one of these correct hs over all incorrect hs. This intuition\nleads to the following surrogate loss de\ufb01nition:\n\n(4)\n\nL(x, y, W) = max\n\nh\n\n[SW(x, h) + \u2206(y, h)]\n\u2212 max\n\nSW(x, h).\n\nh:\u2206(y,h)=0\n\n(5)\n\n(6)\n\nThis is a bit different in spirit from the notion of margin sometimes encountered in ranking problems\nwhere we want all the correct answers to be placed ahead of all the wrong ones. Here, we only care\nto put one correct answer on top; it does not matter which one, hence the max in (6).\n\n5 Structured Formulation\n\nAlthough we have motivated this choice of L by intuitive arguments, it turns out that our problem is\nan instance of a familiar type of problems: latent structured prediction [24], and thus our choice of\nloss can be shown to form an upper bound on the empirical task loss \u2206.\nFirst, we note that the score SW can be written as\n\n(cid:43)\n\nSW(x, h) =\n\n(x \u2212 xj)(x \u2212 xj)T\n\n,\n\n(7)\n\n(cid:42)\n\nW,\u2212 (cid:88)\n\nxj\u2208h\n\n4\n\n\fwhere (cid:104)\u00b7,\u00b7(cid:105) stands for the Frobenius inner product. De\ufb01ning the feature map\n\n(x \u2212 xj)(x \u2212 xj)T ,\n\n(8)\n\n\u03a8(x, h) (cid:44) \u2212 (cid:88)\n\nxj\u2208h\n\nwe get a more compact expression (cid:104)W, \u03a8(x, h)(cid:105) for (7).\nFurthermore, we can encode the deterministic dependence between y and h by a \u201ccompatibility\u201d\n\nfunction A(y, h) = 0 if y = (cid:98)y(h) and A(y, h) = \u2212\u221e otherwise. This allows us to write the joint\n\ninference of y and (hidden) h performed by kNN classi\ufb01er as\n\n[A(y, h) + (cid:104)W, \u03a8(x, h)(cid:105)] .\n\n(9)\n\n(cid:98)yW(x),(cid:98)hW(x) = argmax\n\nh,y\n\nThis is the familiar form of inference in a latent structured model [24, 6] with latent variable h. So,\ndespite our model\u2019s somewhat unusual property that the latent h completely determines the inferred\ny, we can show the equivalence to the \u201cnormal\u201d latent structured prediction.\n\n5.1 Learning by gradient descent\n\nWe de\ufb01ne the objective in learning W as\n(cid:107)W(cid:107)2\n\nmin\nW\n\nF + C\n\n(cid:88)\n\ni\n\nL (xi, yi, W) ,\n\n(10)\n\nwhere (cid:107) \u00b7 (cid:107)2\nF stands for Frobenius norm of a matrix.1 The regularizer is convex, but as in other\nlatent structured models, the loss L is non-convex due to the subtraction of the max in (6). To\noptimize (10), one can use the convex-concave procedure (CCCP) [25] which has been proposed\nspeci\ufb01cally for latent SVM learning [24]. However, CCCP tends to be slow on large problems.\nFurthermore, its use is complicated here due to the requirement that W be positive semide\ufb01nite\n(PSD). This means that the inner loop of CCCP includes solving a semide\ufb01nite program, making\nthe algorithm slower still. Instead, we opt for a simpler choice, often faster in practice: stochastic\ngradient descent (SGD), described in Algorithm 1.\n\nAlgorithm 1: Stochastic gradient descent\nInput: labeled data set (X, Y ), regularization parameter C, learning rate \u03b7(\u00b7)\ninitialize W(0) = 0\nfor t = 0, . . ., while not converged do\n(cid:98)hi = argmaxh [SW(t) (xi, h) + \u2206(yi, h)]\nsample i \u223c [N ]\n(cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)W(t)\nh\u2217\ni = argmaxh:\u2206(yi,h)=0 SW(t)(xi, h)\n\u2212 \u2202SW(xi, h\u2217\ni )\n\n\u2202SW(xi,(cid:98)hi)\n\n(cid:34)\n\n\u2202W\n\n\u03b4W =\n\u2202W\nW(t+1) = (1 \u2212 \u03b7(t))W(t) \u2212 C\u03b4W\nproject W(t+1) to PSD cone\n\nThe SGD algorithm requires solving two inference problems ((cid:98)h and h\u2217), and computing the gradient\n\nof SW which we address below.2\n5.1.1 Targeted inference of h\u2217\nHere we are concerned with \ufb01nding the highest-scoring h constrained to be compatible with a given\ntarget class y. We give an O(N log N ) algorithm in Algorithm 2. Proof of its correctness and\ncomplexity analysis is in Appendix.\n\ni\n\n1We discuss other choices of regularizer in Section 7.\n2We note that both inference problems over h are done in leave one out settings, i.e., we impose an additional\n\nconstraint i /\u2208 h under the argmax, not listed in the algorithm explicitly.\n\n5\n\n\fAlgorithm 2: Targeted inference\nInput: x, W, target class y, \u03c4 (cid:44)(cid:74)ties forbidden(cid:75)\nOutput: argmaxh:(cid:98)y(h)=y SW(x)\nLet n\u2217 = (cid:100) k+\u03c4 (R\u22121)\nh := \u2205\nfor j = 1, . . . , n\u2217 do\n\n// min.\n\n(cid:101)\n\nR\n\nh := h \u222a argmin\nxi: yi=y,i /\u2208h\nfor l = n\u2217 + 1, . . . , k do\n\nDW (x, xi)\n\nde\ufb01ne #(r) (cid:44) |{i : xi \u2208 h, yi = r}|\nh := h \u222a\nxi: yi=y, or #(yi)<#(y)\u2212\u03c4, i /\u2208h\n\nargmin\n\nreturn h\n\nrequired number of neighbors from y\n\n// count selected neighbors from class r\n\nDW (x, xi)\n\nThe intuition behind Algorithm 2 is as follows. For a given combination of R (number of classes)\nand k (number of neighbors), the minimum number of neighbors from the target class y required to\nallow (although not guarantee) zero loss, is n\u2217 (see Proposition 1 in the App. The algorithm \ufb01rst\nincludes n\u2217 highest scoring neighbors from the target class. The remaining k \u2212 n\u2217 neighbors are\npicked by a greedy procedure that selects the highest scoring neighbors (which might or might not\nbe from the target class) while making sure that no non-target class ends up in a majority.\nWhen using Alg. 2 to \ufb01nd an element in H\u2217, we forbid ties, i.e. set \u03c4 = 1.\n\n5.1.2 Loss augmented inference(cid:98)hi\n\n(cid:111)\n\nCalculating the max term in (5) is known as loss augmented inference. We note that\n\n(cid:104)W, \u03a8(x, h(cid:48))(cid:105) + \u2206(y, h(cid:48)) =\n\nmax\n\nh(cid:48)\n\nmax\n\ny(cid:48)\n\nmax\n\nh(cid:48)\u2208H\u2217(y(cid:48))\n\n(cid:104)W, \u03a8(x, h(cid:48))(cid:105)\n\n+ \u039b(y, y(cid:48))\n\n(11)\n\nwhich immediately leads to Algorithm 3, relying on Algorithm 2. The intuition: perform targeted\ninference for each class (as if that were the target class), and then choose the set of neighbors for the\nclass for which the loss-augmented score is the highest. In this case, in each call to Alg. 2 we set\n\u03c4 = 0, i.e., we allow ties, to make sure the argmax is over all possible h\u2019s.\n\n= (cid:104)W,\u03a8(x,h\u2217(x,y(cid:48)))(cid:105)\n\nAlgorithm 3: Loss augmented inference\nInput: x, W,target class y\nOutput: argmaxh [SW(x, h) + \u2206(y, h)]\nfor r \u2208 {1, . . . , R} do\n\nh(r) := h\u2217(x, W, r, 1)\nLet Value (r) := SW(x, h(r)), + \u039b(y, r)\nLet r\u2217 = argmaxrValue (r)\nreturn h(r\u2217)\n\n// using Alg. 2\n\n5.1.3 Gradient update\n\nFinally, we need to compute the gradient of the distance score. From (7), we have\n\n= \u03a8(x, h) = \u2212 (cid:88)\n\nxj\u2208h\n\n\u2202SW(x, h)\n\n\u2202W\n\n(x \u2212 xj)(x \u2212 xj)T .\n\n(12)\n\nThus, the update in Alg 1 has a simple interpretation, illustrated in Fig 1 on the right. For every\n\nxi \u2208 h\u2217 \\(cid:98)h, it \u201cpulls\u201d xi closer to x. For every xi \u2208(cid:98)h\\ h\u2217, it \u201cpushes\u201d it farther from x; these push\nincluding any xi \u2208 h\u2217 \u2229(cid:98)h, has no in\ufb02uence on the update. This is a difference of our approach from\n\nand pull refer to increase/decrease of Mahalanobis distance under the updated W. Any other xi,\n\n(cid:110)\n\n6\n\n\fLMNN, MLR etc. This is illustrated in Figure 1. In particular h\u2217 corresponds to points a, c and e,\n\nwhereas(cid:98)h corresponds to points c, e and f. Thus point a is pulled while point f is pushed.\n\nSince the update does not necessarily preserve W as a PSD matrix, we enforce it by projecting W\nonto the PSD cone, by zeroing negative eigenvalues. Note that since we update (or \u201cdowndate\u201d)\nW each time by matrix of rank at most 2k, the eigendecomposition can be accomplished more\nef\ufb01ciently than the na\u00a8\u0131ve O(d3) approach, e.g., as in [17].\nUsing \ufb01rst order methods, and in particular gradient methods for optimization of non-convex func-\ntions, has been common across machine learning, for instance in training deep neural networks.\nDespite lack (to our knowledge) of satisfactory guarantees of convergence, these methods are often\nsuccessful in practice; we will show in the next section that this is true here as well.\nOne might wonder if this method is valid for our objective that is not differentiable; we discuss this\nbrie\ufb02y before describing experiments. A given x imposes a Voronoi-type partition of the space of\nh\u2217(x) under the values of W in that cell. The score SW is differentiable (actually linear) on the\ninterior of the cell, but may be non-differentiable (though continuous) on the boundaries. Since the\nboundaries between a \ufb01nite number of cells form a set of measure zero, we see that the score is\ndifferentiable almost everywhere.\n\nW into a \ufb01nite number of cells; each cell is associated with a particular combination of(cid:98)h(x) and\n\n6 Experiments\nWe compare the error of kNN classi\ufb01ers using metrics learned with our approach to that with other\nlearned metrics. For this evaluation we replicate the protocol in [11], using the seven data sets in\nTable 1. For all data sets, we report error of kNN classi\ufb01er for a range of values of k; for each\nk, we test the metric learned for that k. Competition to our method includes Euclidean Distance,\nLMNN [22], NCA, [8], ITML [5], MLR [14] and GB-LMNN [11]. The latter learns non-linear\nmetrics rather than Mahalanobis.\nFor each of the competing methods, we used the code provided by the authors. In each case we tuned\nthe parameters of each method, including ours, in the same cross-validation protocol. We omit a few\nother methods that were consistently shown in literature to be dominated by the ones we compare\nto, such as \u03c72 distance, MLCC, M-LMNN. We also could not include \u03c72-LMNN since code for it is\nnot available; however published results for k = 3 [11] indicate that our method would win against\n\u03c72-LMNN as well.\nIsolet and USPS have a standard training/test partition, for the other \ufb01ve data sets, we report the mean\nand standard errors of 5-fold cross validation (results for all methods are on the same folds). We\nexperimented with different methods for initializing our method (given the non-convex objective),\nincluding the euclidean distance, all zeros etc. and found the euclidean initialization to be always\nworse. We initialize each fold with either the diagonal matrix learned by ReliefF [12] (which gives a\nscaled euclidean distance) or all zeros depending on whether the scaled euclidean distance obtained\nusing ReliefF was better than unscaled euclidean distance. In each experiment, x are scaled by mean\nand standard deviation of the training portion.3 The value of C is tuned on on a 75%/25% split of\nthe training portion. Results using different scaling methods are attached in the appendix.\nOur SGD algorithm stops when the running average of the surrogate loss over most recent epoch no\nlonger descreases substantially, or after max. number of iterations. We use learning rate \u03b7(t) = 1/t.\nThe results show that our method dominates other competitors, including non-linear metric learning\nmethods, and in some cases achieves results signi\ufb01cantly better than those of the competition.\n\n7 Conclusion\n\nWe propose a formulation of the metric learning for kNN classi\ufb01er as a structured prediction prob-\nlem, with discrete latent variables representing the selection of k neighbors. We give ef\ufb01cient algo-\nrithms for exact inference in this model, including loss-augmented inference, and devise a stochastic\ngradient algorithm for learning. This approach allows us to learn a Mahalanobis metric with an ob-\njective which is a more direct proxy for the stated goal (improvement of classi\ufb01cation by kNN rule)\n\n3For Isolet we also reduce dimensionality to 172 by PCA computed on the training portion.\n\n7\n\n\fIsolet USPS\nDataset\n256\n170\nd\n9298\n7797\nN\n26\n10\nC\n6.18\n8.66\nEuclidean\n5.48\nLMNN\n4.43\nGB-LMNN 4.13\n5.48\nMLR\n6.61\n8.27\n5.78\n7.89\nITML\n5.23\n6.16\n1-NCA\n5.18\n4.45\nk-NCA\n5.18\nours\n4.87\n\nIsolet USPS\nDataset\n6.08\n7.44\nEuclidean\n4.9\nLMNN\n3.78\nGB-LMNN 3.54\n4.9\n8.27\n5.64\nMLR\n5.68\n7.57\nITML\n5.83\n6.09\n1-NCA\nk-NCA\n4.13\n5.1\n4.9\n4.61\nours\n\nIsolet USPS\nDataset\n8.02\n6.88\nEuclidean\n4.78\n3.72\nLMNN\n4.78\nGB-LMNN 3.98\n11.11\n5.71\nMLR\nITML\n7.77\n6.63\n5.73\n5.90\n1-NCA\n4.81\n4.17\nk-NCA\nours\n4.11\n4.98\n\nletters\n16\n20000\n26\n4.79 \u00b10.2\n3.26 \u00b10.1\n2.92 \u00b10.1\n14.25 \u00b15.8\n4.97 \u00b10.2\n4.71 \u00b12.2\n3.13 \u00b10.4\n2.32 \u00b10.1\n\nletters\n5.40 \u00b10.3\n3.58 \u00b10.2\n2.66 \u00b10.1\n19.92 \u00b16.4\n5.37 \u00b10.5\n5.28 \u00b12.5\n3.15 \u00b10.2\n2.54 \u00b10.1\n\nletters\n5.89 \u00b10.4\n4.09 \u00b10.1\n2.86 \u00b10.2\n15.54 \u00b16.8\n6.52 \u00b10.8\n6.04 \u00b12.8\n3.87 \u00b10.6\n3.05 \u00b10.1\n\nk = 3\n\nDSLR\n800\n157\n10\n75.20 \u00b13.0\n24.17 \u00b14.5\n21.65 \u00b14.8\n36.93 \u00b12.6\n19.07 \u00b14.9\n31.90 \u00b14.9\n21.13 \u00b14.3\n17.18 \u00b14.7\nk = 7\n\nDSLR\n76.45 \u00b16.2\n25.44 \u00b14.3\n25.44 \u00b14.3\n33.73 \u00b15.5\n22.32 \u00b12.5\n36.94 \u00b12.6\n22.78 \u00b13.1\n21.61 \u00b15.9\nk = 11\n\nDSLR\n73.87 \u00b12.8\n23.64 \u00b13.4\n23.64 \u00b13.4\n36.25 \u00b113.1\n22.28 \u00b13.1\n40.06 \u00b16.0\n23.65 \u00b14.1\n22.28 \u00b14.9\n\nAmazon\n800\n958\n10\n60.13 \u00b11.9\n26.72 \u00b12.1\n26.72 \u00b12.1\n24.01 \u00b11.8\n33.83 \u00b13.3\n30.27 \u00b11.3\n24.31 \u00b12.3\n21.34 \u00b12.5\n\nAmazon\n62.21 \u00b12.2\n29.23 \u00b12.0\n29.12 \u00b12.1\n23.17 \u00b12.1\n31.42 \u00b11.9\n29.22 \u00b12.7\n23.11 \u00b11.9\n22.44 \u00b11.3\n\nAmazon\n64.61 \u00b14.2\n30.12 \u00b12.9\n30.07 \u00b13.0\n24.32 \u00b13.8\n30.48 \u00b11.4\n30.69 \u00b12.9\n25.67 \u00b12.1\n24.11 \u00b13.2\n\nWebcam\n800\n295\n10\n56.27 \u00b12.5\n15.59 \u00b12.2\n13.56 \u00b11.9\n23.05 \u00b12.8\n13.22 \u00b14.6\n16.27 \u00b11.5\n13.19 \u00b11.3\n10.85 \u00b13.1\n\nWebcam\n57.29 \u00b16.3\n14.58 \u00b12.2\n12.45 \u00b14.6\n18.98 \u00b12.9\n10.85 \u00b13.1\n22.03 \u00b16.5\n13.04 \u00b12.7\n11.19 \u00b13.3\n\nWebcam\n59.66 \u00b15.5\n13.90 \u00b12.2\n13.90 \u00b11.0\n17.97 \u00b14.1\n11.86 \u00b15.6\n26.44 \u00b16.3\n11.42 \u00b14.0\n11.19 \u00b14.4\n\nCaltech\n800\n1123\n10\n80.5 \u00b14.6\n46.93 \u00b13.9\n46.11 \u00b13.9\n46.76 \u00b13.4\n48.78 \u00b14.5\n46.66 \u00b11.8\n44.56 \u00b11.7\n43.37 \u00b12.4\n\nCaltech\n80.76 \u00b13.7\n46.75 \u00b12.9\n46.17 \u00b12.8\n46.85 \u00b14.1\n51.74 \u00b12.8\n45.50 \u00b13.0\n43.92 \u00b13.1\n41.61 \u00b12.6\n\nCaltech\n81.39 \u00b14.2\n49.06 \u00b12.3\n49.15 \u00b12.8\n44.97 \u00b12.6\n50.76 \u00b11.9\n46.48 \u00b14.0\n43.8 \u00b13.1\n40.76 \u00b11.8\n\nTable 1: kNN errors for k=3, 7 and 11. Features were scaled by z-scoring.\n\nthan previously proposed similarity learning methods. Our learning algorithm is simple yet ef\ufb01cient,\nconverging on all the data sets we have experimented upon in reasonable time as compared to the\ncompeting methods.\nOur choice of Frobenius regularizer is motivated by desire to control model complexity without\nbiasing towards a particular form of the matrix. We have experimented with alternative regularizers,\nboth the trace norm of W and the shrinkage towards Euclidean distance, (cid:107)W\u2212 I(cid:107)2\nF , but found both\nto be inferior to (cid:107)W(cid:107)2\nF . We suspect that often the optimal W corresponds to a highly anisotropic\nscaling of data dimensions, and thus bias towards I may be unhealthy. The results in this paper are\nrestricted to Mahalanobis metric, which is an appealing choice for a number of reasons. In particular,\nlearning such metrics is equivalent to learning linear embedding of the data, allowing very ef\ufb01cient\nmethods for metric search. Still, one can consider non-linear embeddings x \u2192 \u03c6(x; w) and de\ufb01ne\nthe distance D in terms of the embeddings, for example, as D(x, xi) = (cid:107)\u03c6(x) \u2212 \u03c6(xi)(cid:107) or as\n\u2212\u03c6(x)T \u03c6(xi). Learning S in the latter form can be seen as learning a kernel with discriminative\nobjective of improving kNN performance. Such a model would be more expressive, but also more\nchallenging to optimize. We are investigating this direction.\n\nAcknowledgments\nThis work was partly supported by NSF award IIS-1409837.\n\n8\n\n\fReferences\n[1] S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu. An optimal algorithm\nfor approximate nearest neighbor searching \ufb01xed dimensions. J. ACM, 45(6):891\u2013923, 1998.\n[2] A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In ICML, pages\n\n97\u2013104. ACM, 2006.\n\n[3] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classi\ufb01ers.\n\nIn COLT, pages 144\u2013152. ACM Press, 1992.\n\n[4] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based\n\non p-stable distributions. In SoCG, pages 253\u2013262. ACM, 2004.\n[5] J. V. Davis, B. Kulis, J. Prateek, S. Suvrit, and D. Inderjeet.\n\nInformation-theoretic metric\n\nlearning. pages 209\u2013216, 2007.\n\n[6] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with\n\ndiscriminatively trained part-based models. IEEE T. PAMI, 32(9):1627\u20131645, 2010.\n\n[7] A. Globerson and S. Roweis. Metric learning by collapsing classes. In Y. Weiss, B. Sch\u00a8olkopf,\n\nand J. Platt, editors, NIPS, pages 451\u2013458, Cambridge, MA, 2006. MIT Press.\n\n[8] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components anal-\n\nysis. In NIPS, 2004.\n\n[9] Y. Gong and S. Lazebnik. Iterative quantization: A procrustean approach to learning binary\n\ncodes. In CVPR, pages 817\u2013824. IEEE, 2011.\n\n[10] T. Joachims. A support vector method for multivariate performance measures. In ICML, pages\n\n[11] D. Kedem, S. Tyree, K. Weinberger, F. Sha, and G. Lanckriet. Non-linear metric learning. In\n\n377\u2013384. ACM Press, 2005.\n\nNIPS, pages 2582\u20132590, 2012.\n\nrithm). AAAI, 2:129\u2013134, 1992.\n\n22:1042\u20131050, 2009.\n\n[12] K. Kira and A. Rendell. The feature selection problem: Traditional methods and a new algo-\n\n[13] B. Kulis and T. Darrell. Learning to hash with binary reconstructive embeddings. NIPS,\n\n[14] B. McFee and G. Lanckriet. Metric learning to rank. In ICML, 2010.\n[15] M. Norouzi, D. Fleet, and R. Salakhutdinov. Hamming distance metric learning. In NIPS,\n\n[16] M. Norouzi and D. J. Fleet. Minimal loss hashing for compact binary codes. ICML, 1:2, 2011.\n[17] P. Stange. On the ef\ufb01cient update of the singular value decomposition. PAMM, 8(1):10827\u2013\n\npages 1070\u20131078, 2012.\n\n10828, 2008.\n\ncodes. In ICML, 2010.\n\nPKDD, 2012.\n\n[18] D. Tarlow, K. Swersky, I. Sutskever, and R. S. Zemel. Stochastic k-neighborhood selection for\n\nsupervised and unsupervised learning. ICML, 28:199\u2013207, 2013.\n\n[19] J. Wang, S. Kumar, and S.-F. Chang. Sequential projection learning for hashing with compact\n\n[20] J. Wang, A. Woznica, and A. Kalousis. Learning neighborhoods for metric learning. In ECML-\n\n[21] K. Q. Weinberger and L. K. Saul. Fast solvers and ef\ufb01cient implementations for distance metric\n\nlearning. In ICML, pages 1160\u20131167. ACM, 2008.\n\n[22] K. Q. Weinberger and L. K. Saul. Distance metric learning for large margin nearest neighbor\n\nclassi\ufb01cation. JMLR, 10:207\u2013244, 2009.\n\n[23] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning, with application\n\nto clustering with side-information. In NIPS, pages 505\u2013512. MIT Press, 2002.\n\n[24] C.-N. J. Yu and T. Joachims. Learning structural svms with latent variables. In ICML, pages\n\n[25] A. L. Yuille, A. Rangarajan, and A. Yuille. The concave-convex procedure (cccp). NIPS,\n\n1169\u20131176. ACM, 2009.\n\n2:1033\u20131040, 2002.\n\n9\n\n\f", "award": [], "sourceid": 1741, "authors": [{"given_name": "Shubhendu", "family_name": "Trivedi", "institution": "Toyota Technological Institute at Chicago"}, {"given_name": "David", "family_name": "Mcallester", "institution": "Toyota Tech Institute Chicago"}, {"given_name": "Greg", "family_name": "Shakhnarovich", "institution": "TTI-Chicago"}]}