{"title": "Curvilinear Distance Metric Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 4223, "page_last": 4232, "abstract": "Distance Metric Learning aims to learn an appropriate metric that faithfully measures the distance between two data points. Traditional metric learning methods usually calculate the pairwise distance with fixed distance functions (\\emph{e.g.,}\\ Euclidean distance) in the projected feature spaces. However, they fail to learn the underlying geometries of the sample space, and thus cannot exactly predict the intrinsic distances between data points. To address this issue, we first reveal that the traditional linear distance metric is equivalent to the cumulative arc length between the data pair's nearest points on the learned straight measurer lines. After that, by extending such straight lines to general curved forms, we propose a Curvilinear Distance Metric Learning (CDML) method, which adaptively learns the nonlinear geometries of the training data. By virtue of Weierstrass theorem, the proposed CDML is equivalently parameterized with a 3-order tensor, and the optimization algorithm is designed to learn the tensor parameter. Theoretical analysis is derived to guarantee the effectiveness and soundness of CDML. Extensive experiments on the synthetic and real-world datasets validate the superiority of our method over the state-of-the-art metric learning models.", "full_text": "Curvilinear Distance Metric Learning\n\nShuo Chen\u2020, Lei Luo\u2021\u2217, Jian Yang\u2020\u2217, Chen Gong\u2020, Jun Li\u00a7, Heng Huang\u2021\n\nAbstract\n\nDistance Metric Learning aims to learn an appropriate metric that faithfully mea-\nsures the distance between two data points. Traditional metric learning methods\nusually calculate the pairwise distance with \ufb01xed distance functions (e.g., Eu-\nclidean distance) in the projected feature spaces. However, they fail to learn the\nunderlying geometries of the sample space, and thus cannot exactly predict the\nintrinsic distances between data points. To address this issue, we \ufb01rst reveal that the\ntraditional linear distance metric is equivalent to the cumulative arc length between\nthe data pair\u2019s nearest points on the learned straight measurer lines. After that, by\nextending such straight lines to general curved forms, we propose a Curvilinear\nDistance Metric Learning (CDML) method, which adaptively learns the nonlinear\ngeometries of the training data. By virtue of Weierstrass theorem, the proposed\nCDML is equivalently parameterized with a 3-order tensor, and the optimization\nalgorithm is designed to learn the tensor parameter. Theoretical analysis is derived\nto guarantee the effectiveness and soundness of CDML. Extensive experiments on\nthe synthetic and real-world datasets validate the superiority of our method over\nthe state-of-the-art metric learning models.\n\n1\n\nIntroduction\n\nThe goal of a Distance Metric Learning (DML) algorithm is to learn the distance function for data\npairs to measure their similarities. The learned distance metric successfully re\ufb02ects the relationships\nwithin data points and signi\ufb01cantly improves the performance of many subsequent learning tasks,\nsuch as classi\ufb01cation [22], clustering [23], retrieval [24], and veri\ufb01cation [14]. It has recently become\nan active research topic in machine learning community [30, 29].\nThe well-studied DML methods are usually linear, namely Mahalanobis distance metric based\nmodels [23]. Under the supervisions of pairwise similarities, they intend to learn a Semi-Positive-\nDe\ufb01nite (SPD) matrix M = P P (cid:62)\u2208Rd\u00d7d to decide the squared parametric distance Dist2\n\nP (x,(cid:98)x) =\n(x\u2212(cid:98)x)(cid:62)M (x\u2212(cid:98)x) between data points x and(cid:98)x in the d-dimensional space. It is notable that such\n\na linear Mahalanobis distance is equivalent to the Euclidean distance in the m-dimensional feature\nspace projected by P \u2208Rd\u00d7m. To perform the learning of the parameter M, intensive efforts have\nbeen put to design various loss functions and constraints in optimization models. The early works\nLarge Margin Nearest Neighbor (LMNN) [27] and Information-Theoretic Metric Learning (ITML)\n[7] utilized the must-link and cannot-link information to constrain the ranges of the learned distances.\nInstead of \ufb01xing the distance ranges in the objective, Geometric Mean Metric Learning (GMML) [31]\n\u2020S. Chen, J. Yang, and C. Gong are with the PCA Lab, Key Lab of Intelligent Perception and Systems for High-\nDimensional Information of Ministry of Education, and Jiangsu Key Lab of Image and Video Understanding for\nSocial Security, School of Computer Science and Engineering, Nanjing University of Science and Technology,\nNanjing 210094, China (E-mail: {shuochen, csjyang, chen.gong}@njust.edu.cn).\n\u2021L. Luo and H. Huang are with the Electrical and Computer Engineering, University of Pittsburgh, and also\n\nwith JD Finance America Corporation, USA (E-mail: lel94@pitt.edu, henghuanghh@gmail.com).\n\nCambridge, MA, USA (E-mail: junli@mit.edu).\n\n\u00a7J. Li is with the Institute for Medical Engineering & Science, Massachusetts Institute of Technology,\n\u2217L. Luo and J. Yang are corresponding authors.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Conceptual illustrations of Euclidean metric, linear (Mahalanobis) metric, and our proposed\n\ncurvilinear metric in three-dimensional space. For a pair of data points (i.e., x and (cid:98)x) from the\nspatial surface, metrics \ufb01nd out the nearest (calibration) points (i.e., T (x) and T ((cid:98)x)) on each learned\n\nmeasurer line, and then use the arc length between nearest points as measured distance results. By\nthe curved measurer lines, our method can measure the intrinsic curvilinear distance more exactly.\n\nproposed a geometric loss to jointly minimize the intra-class distances and maximize the inter-class\ndistances as much as possible. Considering that the above methods utilizing one single matrix M are\nnot \ufb02exible for complex data, the traditional Mahalanobis distance is extended to a combined form\nof multiple linear metrics [30]. Recently, the strategies of adversarial training [8] and collaborative\ntraining [20] were introduced in Adversarial Metric Learning (AML) [4] and Bilevel Distance Metric\nLearning (BDML) [29], respectively, which showed further improvements on the linear metric.\nTo enhance the \ufb02exibility of DML for \ufb01tting data pairs from nonlinear sample spaces, the early works\ntransferred the original data points to the high-dimensional kernel space by using traditional kernel\nmethods [26]. Recently, the projection matrix P of the linear DML was extended to a nonlinear\nfeature mapping form PW(\u00b7), in which the mapping W(\u00b7) is implemented by typical Deep Neural\nNetworks (DNN), such as Convolutional Neural Network (CNN) [32] and Multiple Layer Perceptron\n(MLP) [10]. To further utilize the \ufb01tting capability of DNN and characterize the relative distances, the\ntraditional pairwise loss was extended to multi-example forms, such as Npair loss [22] and Angular\nloss [25]. It is worth pointing out that the above kernelized metrics and DNN based metrics are still\ncalculated with \ufb01xed Euclidean distance in the extracted feature space, which ignores the geometric\nstructures of the sample space. To this end, some recent works proposed to learn the projection matrix\nP on differential manifolds (e.g., SPD manifold [33] and Grassmann manifold [13]) to improve the\nrepresentation capability on some speci\ufb01c nonlinear data structures. However, the geometries of their\nused manifolds are usually speci\ufb01ed and cannot be learned to adapt to various nonlinear data, and\nhence remarkably hindering the generality of the manifold based DML approaches.\nAlthough above existing DML models have achieved promising results to some extent, most of them\nfail to learn the spatial geometric structures of the sample space, and thus their obtained metrics cannot\nre\ufb02ect the intrinsic curvilinear distances between data points. To address this challenging problem, in\nthis paper, we \ufb01rstly present a new interpretation to reveal that the traditional linear distance metric is\nequivalent to the cumulative arc length between data pair\u2019s nearest points on the straight measurer\nlines (see Fig. 1(a) and (b)). Such straight measurer lines can successfully learn the directions of\nreal distances, but they are not capable of adapting to the curvilinear distance geometries on many\nnonlinear datasets. Therefore, we propose the Curvilinear Distance Metric Learning (CDML) model,\nwhich extends the straight measurer lines to the general smooth curved lines (see Fig. 1(c)). Thanks to\nthe generalized forms of such curvilinear measurers, the geometries of training data can be adaptively\nlearned, so that the nonlinear pairwise distances can be reasonably measured. We theoretically\nanalyze the effectiveness of CDML by showing its \ufb01tting capability and generalization bound.\nFurthermore, we prove that our proposed curvilinear distance satis\ufb01es the topological de\ufb01nitions of\nthe (pseudo-)metric, which demonstrates the geometric soundness of such a distance metric. The\nmain contributions of this paper are summarized as: (I). We provide a new intuitive interpretation\nfor traditional linear metric learning by explicitly formulating the measurer lines and measurement\nprocess; (II). We propose a generalized metric learning model dubbed CDML with discovering the\ncurvilinear distance hidden in the nonlinear data, and the corresponding optimization algorithm is\ndesigned to solve the proposed model which is guaranteed to converge; (III). The complete theoretical\nguarantee is established, which analyzes the \ufb01tting capability, generalization bound, and topological\nproperty of CDML, and therefore ensuring the model effectiveness and soundness; (IV). CDML\nis experimentally validated to outperform state-of-the-art metric learning models on both synthetic\ndatasets and real-world datasets.\n\n2\n\np1tp2te1te2t(a). Space of Euclidean Metric (b). Space of Linear Metric (c). Space of Curvilinear Metric (Ours) Te1(x)\u03b81(t)\u03b82(t)Real Distances:Measured Distances:Measurer Lines:\u03b83(t)p3te3tNormal Lines (Isolines):Te1(x)Tp1(x)xxTp1(x)T\u03b81(x)\u0dddT\u03b81(x)xx\u0dddx\u0dddx\u0ddd\u0ddd\u0ddd\fNotations. Throughout this paper, we write matrices, vectors, and 3-order tensors as bold uppercase\ncharacters, bold lowercase characters, and bold calligraphic uppercase characters, respectively. For a 3-\norder tensor M, the notations Mi::, M:i:, and M::i denote the horizontal, lateral, and frontal slices.\nThe tube \ufb01ber, row \ufb01ber, and column \ufb01ber as Mij:, Mi:j, and M:ij. Let y = (y1, y2,\u00b7\u00b7\u00b7 , yN )(cid:62) be\n\nthe label vector of training data pairs X ={(xj,(cid:98)xj)|j = 1, 2,\u00b7\u00b7\u00b7 , N} with xj,(cid:98)xj \u2208Rd, where yj = 1\nif xj and(cid:98)xj are similar, and yj = 0 otherwise. Here d is the data dimensionality, and N is the total\n\nnumber of data pairs. The operators (cid:107)\u00b7(cid:107)2 and (cid:107)\u00b7(cid:107)F denote the vector (cid:96)2-norm and matrix (tensor)\nFrobenius-norm, respectively. The notation Nn ={1, 2,\u00b7\u00b7\u00b7 , n} for any n\u2208N.\n\n2 Curvilinear Distance Metric Learning\n\nIn this section, we \ufb01rst present a new geometric interpretation for traditional linear metric learning\nmodels. Then the Curvilinear Distance Metric Learning (CDML) is formulated based on such an\ninterpretation. The optimization algorithm is designed to solve CDML with convergence guarantee.\n\n2.1 A Geometric Interpretation for Linear Metric\n\ntwo given data points x, (cid:98)x \u2208 Rd is de\ufb01ned as\n\nIt is well known that the linear distance metric (i.e., squared Mahalanobis distance) [30, 29] between\n\nP (x, (cid:98)x) = (x \u2212(cid:98)x)(cid:62)M (x \u2212(cid:98)x) = (x \u2212(cid:98)x)(cid:62)P P (cid:62)(x \u2212(cid:98)x),\n\nDist2\n\n(cid:107)pi(cid:107)2\n\ni=1\n\n(cid:88)m\n\n(cid:88)m\n\n2(cid:107)x \u2212(cid:98)x(cid:107)2\n\n2cos2(cid:104)pi, x \u2212(cid:98)x(cid:105) =\n\n(1)\nwhere the matrix M = P P (cid:62) is assumed to be SPD in Rd\u00d7d. In previous works, the above linear\ndistance metric is usually interpreted as the Euclidean distance in the projection space, where the\nprojection matrix P \u2208 Rd\u00d7m plays the role of feature extractions [28]. Here d and m are the\ndimensionalities of the original sample space and the extracted feature space, respectively. Although\nsuch an interpretation offers a friendly way for model extensions, it is not clear enough that why the\nlinear distance metric fails to characterize the curvilinear distances on nonlinear data.\nNow we present a new understanding for the linear distance metric from its measurement process,\nwhich offers a clear way to hand the nonlinear geometric data. We denote pi\u2208Rd as the i-th column\nof P . By using the rule of inner products, Eq. (1) equals to the following cumulative form1\nwhere Ti(x) and Ti((cid:98)x) are the projection points of x and(cid:98)x, which satisfy (piTi(x) \u2212 x)(cid:62)pi = 0\nand (piTi((cid:98)x) \u2212(cid:98)x)(cid:62)pi = 0, respectively. After that, the (cid:96)2-norm distance (cid:107)piTi(x) \u2212 piTi((cid:98)x)(cid:107)2 is\nequivalently converted to the arc length from Ti(x) to Ti((cid:98)x) on the straight line z = pit (t\u2208R), and\nwhere the integral value is the arc length of the straight measurer line z = pit from Ti(x) to Ti((cid:98)x).\nHere the weight (cid:107)pi(cid:107)2\n2 is regarded as the scale of the measurer line which equals to the squared\nunit arc length from 0 to 1. By using the convexity of g(t) = (cid:107)pit \u2212 x(cid:107)2\n2, the orthogonal condition\n(piTi(x) \u2212 x)(cid:62)pi = 0 is equivalent to \ufb01nding the nearest point Ti(x) on the measurer line, namely\n(4)\nthe data pair {x,(cid:98)x} is intrinsically computed as the cumulative arc length between {x,(cid:98)x}\u2019s nearest\n\nthus the squared Mahalanobis distance is rewritten as\n(cid:107)pi(cid:107)2\n\nBased on the results of Eq. (3) and Eq. (4), we can clearly observe that the Mahalanobis distance of\n\n2(cid:107)piTi(x) \u2212 piTi((cid:98)x)(cid:107)2\n\npoints on the learned measurer line z = pit, which is shown in Fig. 1. It reveals that linear metrics\nmerely learn the rough directions of real distances, yet cannot capture the complex data geometry.\n\nP (x, (cid:98)x) =\n\n(cid:119) Ti((cid:98)x)\n\n(cid:107)pit \u2212 x(cid:107)2\n2.\n\n(cid:88)m\n\nTi(x) = arg min\n\n(cid:107)pi(cid:107)2 dt\n\n,\n\ni=1\n\n2\n\nTi(x)\n\n(cid:107)pi(cid:107)2\n\ni=1\n\n(cid:18)\n\n(cid:19)2\n\n2,\n\n(2)\n\n(3)\n\nDist2\n\nt\u2208R\n\n2.2 Model Establishment\n\nAs we mentioned before, traditional metrics learn the direction pi of the measurer line z = pit in the\nd-dimensional sample space. However, such a straight line is far from adapting to complex nonlinear\n1 (x \u2212\n\nP (x, (cid:98)x) = (x \u2212 (cid:98)x)(cid:62)P P (cid:62)(x \u2212 (cid:98)x) = (cid:107)P (cid:62)(x \u2212 (cid:98)x)(cid:107)2\n2 =(cid:80)m\n\ni (x \u2212(cid:98)x))2 =(cid:80)m\n\n2 (x \u2212(cid:98)x),\u00b7\u00b7\u00b7 , p(cid:62)\n\n2cos2(cid:104)pi, x\u2212(cid:98)x(cid:105).\n\nm(x \u2212(cid:98)x))(cid:107)2\n\n2(cid:107)x \u2212(cid:98)x(cid:107)2\n\n1For calculation details, Dist2\n\n(cid:98)x), p(cid:62)\n\n2 = (cid:107)(p(cid:62)\n\ni=1(cid:107)pi(cid:107)2\n\ni=1(p(cid:62)\n\n3\n\n\fdata in the real world. We thus use a general vector-valued function \u03b8i : R\u2192Rd to extend the straight\n\nline z = pit (t\u2208R) to the smooth curved line(cid:101)z = \u03b8i(t) (t\u2208R). Speci\ufb01cally, it can be written as\n\n(cid:101)z = ((cid:101)z1,(cid:101)z2, \u00b7\u00b7\u00b7 ,(cid:101)zd)(cid:62) = (\u03b8i1(t), \u03b8i2(t), \u00b7\u00b7\u00b7 , \u03b8id(t))(cid:62) = \u03b8i(t),\n\n(5)\nwhere the smooth function \u03b8ik(t) is the k-th element of the vector-valued function \u03b8i(t). It should be\nnoticed that such a curved line is also assumed to be zero-crossing, i.e., \u03b8i(0) = 0 which is consistent\nwith the linear distance metric. Then the nearest point Ti(x) de\ufb01ned in Eq. (4) can be easily extended\nto the nearest point set N\u03b8i (x), and we naturally have that\n\nN\u03b8i(x) = arg min\n\n(cid:107)\u03b8i(t) \u2212 x(cid:107)2\n2.\n\nt\u2208R\n\n(6)\nNevertheless, the point set N\u03b8i(x) might contain more than one element, so we simply use the\nsmallest element of N\u03b8i(x), as described in De\ufb01nition 1.\nDe\ufb01nition 1. For a data point x\u2208Rd and a curved line \u03b8i, we de\ufb01ne the calibration point T\u03b8i(x) as\nT\u03b8i(x) = arg mint\u2208N\u03b8i(x) t, where N\u03b8i(x) includes all nearest points of x on the curved line \u03b8i(t).\nAccording to our offered interpretation in Section 2.1, the curvilinear distance is consistently regarded\nas the cumulative arc length values (see Fig. 1(c)). Here we follow the common formula of arc length\nin calculus [21], which is given in De\ufb01nition 2.\nDe\ufb01nition 2. The arc length from T1 to T2 on the curved line \u03b8i is de\ufb01ned as\n\nwhere the derivative vector \u03b8\n\nLength\u03b8i\n(cid:48)\ni(t) = (\u03b8(cid:48)\n\n(T1, T2) = (cid:119) max(T1,T2)\ni2(t), \u00b7\u00b7\u00b7 , \u03b8(cid:48)\ni1(t), \u03b8(cid:48)\n\nmin(T1,T2)\n\n(cid:107)\u03b8\nid(t))(cid:62).\n\ni(t)(cid:107)2 dt,\n(cid:48)\n\n(7)\n\nBased on the above de\ufb01nitions of the arc length and the calibration point, the traditional linear\ndistance metric (i.e., Eq. (3)) is easily extended to the general curvilinear form. Speci\ufb01cally, the\n\nsquared curvilinear distance between data points x and(cid:98)x is calculated by\n(T\u03b8i(x), T\u03b8i ((cid:98)x)),\n\n(8)\nwhere \u0398 = (\u03b81, \u03b82,\u00b7\u00b7\u00b7 , \u03b8m) is the learning parameter of the curvilinear distance metric, and m is\nthe number of measurer lines. Here the scale value s\u03b8i =Length2\n(0, 1). When we use the empirical\n\u03b8i\nrisk loss to learn \u0398, the objective of Curvilinear Distance Metric Learning (CDML) is formulated as\n\ns\u03b8i \u00b7 Length2\n\n(cid:88)m\n\nDist2\n\ni=1\n\n\u03b8i\n\n\u0398(xj, (cid:98)xj); yj) + \u03bbR(\u0398),\n\nL(Dist2\n\n\u0398(x, (cid:98)x) =\n(cid:88)N\n\n1\nN\n\nj=1\n\nmin\n\u0398\u2208Fm\n\n(9)\nwhere Fm ={(\u03b81, \u03b82,\u00b7\u00b7\u00b7 , \u03b8m)| \u03b8ik(t) = 0 and \u03b8ik(t) is smooth for i\u2208 Nm and k \u2208 Nd}, the regular-\nization parameter \u03bb > 0 is tuned by users. In the above objective, the loss function L evaluates the\ninconsistency between the predicted distances and their similarity labels, and the regularizer R is\nused to reduce the over-\ufb01tting.\nImplementation of \u0398. As the learning parameter \u0398 of CDML (i.e., Eq. (9)) is in an abstract form\nand cannot be directly solved, we have to give a concrete form for each curved line \u03b8i(t) for learning\ntasks. Here we employ the polynomial function to approximate \u03b8i(t), due to the guarantee of\nin\ufb01nite approximation which is described in Theorem 1. It is notable that \u03b8i(t) can also be in\ufb01nitely\napproximated by other ways, such as Fourier series, deep neural networks, and piecewise linear\nfunctions [16]. Without loss of generality, we employ the polynomial function in this paper.\nTheorem 1 (Weierstrass Approximation [21]). Assume that the vector-valued function \u03b8i(t) \u2208 Rd\n(i=1, 2,\u00b7\u00b7\u00b7 , m) is continuous and zero-crossing de\ufb01ned on [a, b]. Then for any \u0001 > 0 and t \u2208 [a, b],\nthere exists the c-order polynomial vector-valued function\n\nsuch that(cid:80)m\n\nMi::(t) =\ni=1 (cid:107)\u03b8i(t) \u2212 Mi::(t)(cid:107)2 < \u0001, where the 3-order tensor M = [Mijk] \u2208 Rm\u00d7d\u00d7c.\n\nMi2ktk, \u00b7\u00b7\u00b7 ,\n\nMi1ktk,\n\nk=1\n\nk=1\n\nk=1\n\n,\n\n(10)\n\n(cid:16)(cid:88)c\n\n(cid:88)c\n\n(cid:88)c\n\nMidktk(cid:17)(cid:62)\n\nWe let \u03b8i(t) := Mi::(t) for i = 1, 2,\u00b7\u00b7\u00b7 , m, and then the abstract parameter \u0398 in Eq. (9) can be\nmaterialized by the tensor M with in\ufb01nite approximation in the following optimization objective\n(11)\n\nL(Dist2M(xj, (cid:98)xj); yj) + \u03bbR(M).\n\n(cid:88)N\n\nminM\u2208Rm\u00d7d\u00d7c\n\nj=1\n\n1\nN\n\nWe thus successfully convert the abstract optimization problem Eq. (9) to the above concrete form\nEq. (11) w.r.t. the tensor M \u2208 Rm\u00d7d\u00d7c, which can be easily solved by existing algorithms [11].\n\n4\n\n\fAlgorithm 1 Solving Eq. (11) via Stochastic Gradient Descent.\n\nInput: Training data pairs X = {(xj, (cid:98)xj)|j \u2208 NN}; labels y \u2208 {0, 1}N ; batch size h; learning rate\n\n\u03c1; regularization parameter \u03bb; tensor size parameters c and m.\nInitialize: k = 1; M(1) = 0.\nRepeat:\n\n1). Uniformly randomly pick h data pairs {(xbj , (cid:98)xbj )|j \u2208 Nh} from X .\n2). Compute calibration points TMi::(xbj ) and TMi::((cid:98)xbj ) for i = 1, 2,\u00b7\u00b7\u00b7 , m by solving the\n\nreal roots of f(cid:48)\n\n3). Update the learning parameter M by\n\ni (t) = 0 in Eq. (13).\n\n(cid:18)1\n\n(cid:88)h\n\nh\n\nj=1\n\n(cid:19)\nj\u2207M(k)Dist2M(k)(xbj ,(cid:98)xbj )+\u03bb\u2207M(k)R(M(k))\n\nL(cid:48)\n\n. (12)\n\nM(k+1) :=M(k)\u2212\u03c1\n\n4). Update k := k + 1.\n\nUntil Converge.\nOutput: The converged M\u2217.\n\n2.3 Optimization Algorithm\n\nSince the pair number N is usually large in Eq. (11), we use the Stochastic Gradient Descent (SGD)\nmethod to solve it. As we know that the central operations in SGD are the gradient computation of the\n\nobjective function. Here we only need to offer the gradient of Dist2M(x,(cid:98)x) which mainly depends\non the calibration points TMi::(x), TMi::((cid:98)x), and the arc length LengthMi::\n(TMi::(x),TMi::((cid:98)x)).\nAccording to De\ufb01nition 1, the calibration point TMi::(x) can be directly obtained from the nearest\npoint set NMi::(x) = {t\u2217|fi(t\u2217) \u2264 fi(\u02c6t), and t\u2217, \u02c6t \u2208 \u0393 i}, where the polynomial function is\ni:kxtk + x(cid:62)x,\n\nfi(t) = (cid:107)Mi::(t) \u2212 x(cid:107)2\n\ni:jMi:k)tj+k \u2212 2\n\n(cid:88)c\n\n(cid:88)c\n\n(M(cid:62)\n\nM(cid:62)\n\n(13)\n\n2 =\n\nj, k=1\n\nk=1\n\nl=0\n\nmin(T1,T2)\n\n(cid:107)M(cid:48)\n\n(cid:107)M(cid:48)\n\n(cid:88)L\n\nand \u0393 i is the real root set for polynomial equation f(cid:48)\ni (t) = 0 can be\nef\ufb01ciently solved by simple numerical algorithms [17], of which the computation complexity does\nnot depend on the number of training data pair N and feature dimensionality d.\nBy using the de\ufb01nition of integral, the arc length is equivalently converted to\n\ni (t) = 0. Here the real roots of f(cid:48)\n\n(T1, T2) =(cid:119) max(T1,T2)\n\ni::(min(T1, T2)+l\u2206t)(cid:107)2\u2206t, (14)\nLengthMi::\nwhere \u2206t =|T1\u2212T2|/L. In practical uses, we \ufb01x L to a large number (e.g., L=103) and thus obtain\ni::(min(T1, T2)+l\u2206t)(cid:107)2\u2206t)2 for the squared arc\n\na well approximation GMi::(T1, T2) := ((cid:80)L\n\ni::(t)(cid:107)2dt = lim\nL\u2192\u221e\n\nl=0(cid:107)M(cid:48)\n\n(T1, T2).\n\nlength value Length2Mi::\n\u2207Mi::Dist2M(x,(cid:98)x) = \u2207Mi::GMi::(0, 1) \u00b7 GMi::(TMi::(x),TMi::((cid:98)x))\nAccording to the chain rule of derivative, the gradient of Dist2M w.r.t. M is obtained as2\n+ GMi::(0, 1) \u00b7(cid:101)\u2207Mi::GMi::(TMi::(x),TMi::((cid:98)x)).\nIt is noticed that the gradient of GMi:: (TMi::(x),TMi::((cid:98)x)) does not exist necessarily, because the\ngradient3 instead of the original gradient of GMi::(TMi::(x),TMi::((cid:98)x)). The optimization algorithm\nintermediate variable TMi::(x) is not differentiable w.r.t. Mi::. Therefore, here we use the smoothed\nfor solving Eq. (11) is summarized in Algorithm 1.\nConvergence. Notice that the main difference between Algorithm 1 and traditional SGD is that we\nutilize a smoothed gradient instead of the original gradient. Previous works have proved that the\nsmoothed gradient still ensures that the SGD algorithm converges to a stationary point [12, 2].\n\n(cid:112)GMi:: (0, 1)(cid:80)L\n3(cid:101)\u2207Mi::GMi::(TMi::(x),TMi::((cid:98)x)) = \u2206Mi::\n(GMi::+\u00b5\u2206Mi::(TMi::+\u00b5\u2206Mi::(x),TMi::+\u00b5\u2206Mi::((cid:98)x))\u2212\n2\u2207Mi::GMi::(0, 1) = 2\nG(TMi::(x),TMi::((cid:98)x))), where \u2206Mi::\u2208Rd\u00d7c is generated from the standard normal distribution, and \u00b5 > 0\n\ni::(l/L)\u00b7((l/L)1,(l/L)2,\u00b7\u00b7\u00b7 ,(l/L)c)\n\n\u00b5(cid:107)\u2206Mi::(cid:107)2\n\ni::(l/L)(cid:107)2\n\nM(cid:48)\n\nl=0\n\n(15)\n\n(cid:107)M(cid:48)\n\nL\n\n.\n\nis a given small number used to smooth the gradient.\n\n2\n\n5\n\n\f3 Theoretical Analysis\n\nIn this section, we provide theoretical results for the \ufb01tting capability, generalization bound, and\ntopological property of CDML. All proofs of theorems are given in the supplementary materials.\n\n3.1 Fitting Capability\n\nWe \ufb01rst reveal that the curvilinear distance learned by Eq. (11) is capable of distinguishing the\nsimilarities of all training data pairs. We assume that the (dis)similar pair sets XSimilar and XDissimilar\nare the partitions of the whole training data pairs set X , where the similarity label y(x,(cid:98)x) = 1 if\n(x,(cid:98)x)\u2208XSimilar, and y(x,(cid:98)x) = 0 if (x,(cid:98)x)\u2208XDissimilar. Then the conclusion is described as follows.\nTheorem 2. For given \u2206margin > 0, there exist m, c \u2208 N and (cid:102)M \u2208 Rm\u00d7d\u00d7c such that\nwhere (\u03b1, (cid:98)\u03b1) \u2208 XSimilar and (\u03b2, (cid:98)\u03b2) \u2208 XDissimilar.\n\nDist(cid:102)M(\u03b2, (cid:98)\u03b2) \u2212 Dist(cid:102)M(\u03b1, (cid:98)\u03b1) > \u2206margin,\n\n(16)\n\nFrom the above result, we know that the well learned curvilinear distance correctly predicts the\nsimilarities of data pairs in the training set X , which ensures that the inter-class distances are always\ngreater than the intra-class distances. In most metric learning models, the loss functions are designed\nto reward the larger inter-class distances and smaller intra-class distances. It means that the distance\nDist(cid:102)M(\u00b7,\u00b7) in Eq. (16) can be successfully achieved by minimizing the loss functions. Therefore, the\n\ufb01tting capability of the curvilinear distance can be reliably guaranteed by the parameter tensor M.\n\n3.2 Generalization Bound\n\nE(x,(cid:98)x)\u223cD(L(Dist2M(x,(cid:98)x); y(x,(cid:98)x))) and empirical error \u03b5X (M) := 1\n\n(cid:80)N\nNow we further analyze the effectiveness of CDML by offering the generalization bound of the\nj=1L(Dist2M(xj,(cid:98)xj); yj),\nsolution to Eq. (11). Such a bound evaluates the bias between the generalization error \u03b5(M) :=\nwhere D is the real data distribution and E(\u00b7) denotes the expectation function. We simply use the\nsquared tensor Frobenius-norm [18] for regularization and have the following conclusion.\nTheorem 3. Assume that R(M) = (cid:107)M(cid:107)2\ni,j,k(Mijk)2 and M\u2217 \u2208 Rm\u00d7d\u00d7c is the solution\nto Eq. (11). Then, we have that for any 0 < \u03b4 < 1 with probability 1 \u2212 \u03b4\n\n\u03b5(M\u2217) \u2212 \u03b5X (M\u2217) \u2264 X\u2217(cid:112)2ln(1/\u03b4)/N + B\u03bbRN (L),\n\n(17)\nwhere B\u03bb \u2192 0 as \u03bb \u2192 +\u221e4. Here RN (L) is the Rademacher complexity of the loss function L\n\nrelated to the space Rm\u00d7d\u00d7c for N training pairs, and X\u2217 = maxk\u2208NN|L(Dist2M\u2217 (xk,(cid:98)xk); yk)|.\n\nF =(cid:80)\n\nN\n\nIn Eq. (17), the \ufb01rst term of the upper bound converges with the increasing of the number of training\ndata pairs N. We can also \ufb01nd that the second term converges to 0 with the increasing of \u03bb, which\nmeans the regularizer R(M) effectively improves the generalization ability of CDML.\n\n3.3 Topological Property\n\n(cid:48)\n1(\u03c41), \u03b8\n\nIn general topology, the metric5 is de\ufb01ned as the distance function satisfying the non-negativity,\nsymmetry, triangle, and coincidence properties [23, 9]. As an extended metric, the pseudo-metric\nmerely has the \ufb01rst three properties as revealed in [19]. Here we prove that the curvilinear distance\nde\ufb01ned in Eq. (8) satis\ufb01es the topological de\ufb01nitions, and thus its geometric soundness is guaranteed.\ndenote \u0398(cid:48)(\u03c4 ) = (\u03b8\n\nTheorem 4. For the curvilinear distance Dist\u0398(x, (cid:98)x) and its corresponding parameter \u0398, we\n1). Dist\u0398(x, (cid:98)x) is a pseudo-metric for any \u0398 \u2208 Fm;\n2). Dist\u0398(x, (cid:98)x) is a metric, if \u0398(cid:48)(\u03c4 ) is full row rank for any \u03c4 = (\u03c41, \u03c42,\u00b7\u00b7\u00b7 , \u03c4m)(cid:62) \u2208 Rm.\n\nm(\u03c4m)) \u2208 Rd\u00d7m and have that\n(cid:48)\n\n2(\u03c42),\u00b7\u00b7\u00b7 , \u03b8\n(cid:48)\n\n4Here B\u03bb = 2EX ,Z [supM\u2208F (\u03bb) \u03b5Z(M)\u2212\u03b5X(M)]/EX ,Z [supM\u2208Rm\u00d7d\u00d7c \u03b5Z(M)\u2212\u03b5X(M)] and F(\u03bb)\nis a shrinking hypothesis space induced by the regularizer R(M).\n5The distance function Dist(\u00b7,\u00b7) is a metric if and only if it satis\ufb01es the four conditions for any \u03b11, \u03b12, \u03b13 \u2208\nRd: (I). Non-negativity: Dist(\u03b11, \u03b12) \u2265 0; (II). Symmetry: Dist(\u03b11, \u03b12) = Dist(\u03b12, \u03b11); (III). Triangle:\nDist(\u03b11, \u03b12) + Dist(\u03b12, \u03b13) \u2265 Dist(\u03b11, \u03b13); (IV). Coincidence: Dist(\u03b11, \u03b12) = 0 \u21d0\u21d2 \u03b11 = \u03b12.\n\n6\n\n\fFigure 2: Visualizations of the measurer lines learned by CDML in two-dimensional space. The\ngrayscale denotes the distance from the origin point to the current point of the learned measurer line.\nTable 1: Classi\ufb01cation error rates (%, mean\u00b1 std) of all methods on synthetic datasets including\nIntersected Lines, Concentric Circles, and Nested Moons.\n\nDatasets\nInstersected Lines\nConcentric Circles\nNested Moons\n\nLMNN [27]\n14.33\u00b11.21\n16.62\u00b12.14\n17.02\u00b12.23\n\nITML [7]\n17.46\u00b12.11\n15.92\u00b13.12\n12.04\u00b12.14\n\nNpair [22]\n8.52\u00b10.99\n9.13\u00b11.51\n9.22\u00b11.89\n\nAngular [25]\n8.10\u00b13.24\n8.98\u00b11.89\n10.12\u00b12.09\n\nODML [28]\n10.52\u00b12.17\n11.31\u00b12.23\n15.12\u00b11.98\n\nCERML [14]\n6.21\u00b11.92\n10.32\u00b12.16\n11.12\u00b12.41\n\nCDML\n5.12\u00b11.13\n6.95\u00b11.41\n8.64\u00b12.45\n\nNotice that the above Theorem 4 has the same conclusion with the traditional linear distance metric\nwhen \u03b8i(t) is specialized by pit, and thus such a result is a generalization of the property in the linear\nmodel [23]. In fact, most of the real-world data indeed have non-negativity, symmetry, triangle, and\ncoincidence properties. Hence this theorem clearly tells us that the basic geometric characteristics of\nreal-world data can be well persevered in the curvilinear distance metric.\n\n4 Experimental Results\n\nIn this section, we show our experimental results on both synthetic and real-world benchmark datasets\nto validate the effectiveness of CDML. We \ufb01rst visualize the learning results of CDML on synthetic\ndatasets. After that, we compare classi\ufb01cation and veri\ufb01cation accuracy of CDML with two classical\nmetric learning methods (LMNN [27] and ITML [7]) and four state-of-the-art methods (Npair\nloss [22], Angular Loss [25], ODML [28], and CERML [14]). Here LMNN, ITML, and ODML are\nlinear and the others are nonlinear. In our experiments, the parameters \u03bb and c are \ufb01xed to 1.2 and 10,\nrespectively. The SGD parameters h and \u03c1 are \ufb01xed to 103 and 10\u22123, respectively. We follow ITML\nand use the squared hinge loss and squared Frobenius-norm as L and R in Eq. (11), respectively.\n\n4.1 Visualizations on Synthetic Datasets\n\nTo demonstrate the model effectiveness on nonlinear data, we \ufb01rst visualize the learning results of\nCDML on nonlinear synthetic datasets including Intersected Lines, Concentric Circles, and Nested\nMoons [5]. Each dataset contains more than 300 data points across 2 classes in the two-dimensional\nspace. On each dataset, 60% of all data is randomly selected for training, and the rest is used for test.\nThe measurer line count m is \ufb01xed to 1 to clearly visualize the learned results.\nAs illustrated in Fig. 2, the learned measurer lines are plotted with gray lines, of which the gray-level\ndenotes the arc length distance from the origin point to the current point. According to the de\ufb01nition\nin Eq. (6), we can clearly observe that the nearest points of data points from two classes are distributed\napart on the two sides of the measurer lines (i.e., the low gray-level and high gray-level). Therefore,\nsuch measure results correctly predict large values for inter-class distances and small values for\nintra-class distances. Furthermore, the test error rates of all compared methods are shown in Table 1,\nand we \ufb01nd that DDML, PML, and CDML obtained superior results over other methods due to their\nnon-linearity. Meanwhile, CDML achieves relatively lower error rates than the deep neural network\nbased model (DDML) and manifold based model (PML) on the three datasets, which validates the\nsuperiority of our method.\n\n4.2 Comparisons on Classi\ufb01cation Datasets\n\nTo evaluate the performances of all compared methods on the classi\ufb01cation task, we adopt the k-NN\nclassi\ufb01er (k = 5) based on the learned metrics to investigate the classi\ufb01cation error rates of various\nmethods. The datasets are from the well-known UCI machine learning repository [1] including\nMNIST, Autompg, Sonar, Australia, Hayes-r, Glass, Segment, Balance, Isolet, and Letters.\n\n7\n\nClass-1MarginClass-2Class-1MarginClass-2Class-1MarginClass-2\fTable 2: Classi\ufb01cation error rates (%, mean\u00b1 std) of all methods on real-world datasets. The last row\nlists the Win/Tie/Loss counts of CDML against other methods with t-test at signi\ufb01cance level 0.05.\n\nDatasets\nMNIST\nAutompg\nSonar\nAustralia\nHayes-r\nGlass\nSegment\nBalance\nIsoLet\nLetters\nW/T/L\n\nLMNN [27]\n17.46\u00b15.32\n25.92\u00b13.32\n16.04\u00b15.31\n15.51\u00b12.53\n30.46\u00b17.32\n30.12\u00b12.32\n2.73\u00b10.82\n9.93\u00b11.62\n3.23\u00b11.32\n4.21\u00b12.05\n\n8/2/0\n\nITML [7]\n14.32\u00b17.32\n26.62\u00b13.21\n18.02\u00b13.52\n17.52\u00b12.13\n34.24\u00b16.32\n29.11\u00b13.28\n5.16\u00b12.22\n9.31\u00b12.21\n9.23\u00b12.32\n6.24\u00b10.32\n\n9/1/0\n\nNpair [22]\n11.56\u00b11.07\n21.95\u00b11.52\n15.31\u00b12.56\n15.12\u00b15.11\n24.36\u00b12.17\n22.32\u00b14.72\n8.77\u00b10.32\n8.12\u00b11.97\n5.43\u00b12.12\n5.14\u00b11.04\n\n5/5/0\n\nAngular [25]\n12.26\u00b15.12\n19.02\u00b13.01\n16.86\u00b11.21\n15.54\u00b11.23\n23.12\u00b11.37\n23.02\u00b11.22\n4.11\u00b11.22\n7.12\u00b12.22\n5.49\u00b11.12\n4.67\u00b11.82\n\n6/4/0\n\nODML [28]\n12.12\u00b15.22\n25.32\u00b15.32\n17.95\u00b16.78\n16.23\u00b14.12\n29.76\u00b11.07\n28.26\u00b11.22\n3.76\u00b11.34\n8.63\u00b12.22\n2.68\u00b10.72\n4.88\u00b10.82\n\n8/2/0\n\nCERML [14]\n13.36\u00b12.32\n26.36\u00b13.02\n19.21\u00b16.33\n18.26\u00b16.22\n30.12\u00b15.32\n29.11\u00b10.12\n5.36\u00b13.12\n9.45\u00b15.45\n7.26\u00b12.32\n5.32\u00b12.22\n\n8/2/0\n\nCDML\n8.12\u00b13.64\n15.32\u00b16.11\n15.40\u00b13.64\n12.22\u00b12.54\n25.15\u00b15.23\n22.12\u00b14.64\n1.23\u00b10.32\n5.01\u00b12.64\n3.12\u00b11.64\n2.09\u00b10.64\n\n\u2014\n\nFigure 3: ROC curves of all methods on the 3 datasets. AUC values are presented in the legends.\n\nWe compare all methods over 20 random trials. In each trial, 80% of examples are randomly selected\nas the training examples, and the rest are used for testing. The training pairs are generated by\nrandomly picking up 1000K(K\u22121) pairs among the training examples [31], where K is the number\nof classes. Here the measurer line count m is \ufb01xed to the feature dimensionality (i.e., d). The average\nclassi\ufb01cation error rates of all compared methods are shown in Table 2. We also perform the t-test\n(signi\ufb01cance level 0.05) to validate the superiority of our method over all baseline methods on each\ndataset. From the experimental results, we can observe that CDML obtains signi\ufb01cant improvements\non the linear metric learning models, which demonstrates the usefulness of our proposed curvilinear\ngeneralization. Furthermore, the statistical records of average error rates and t-test results reliably\nvalidate the superiority of our method over other baseline methods.\n\n4.3 Comparisons on Veri\ufb01cation Datasets\n\nWe use two face datasets and one image matching dataset to evaluate the capabilities of all compared\nmethods on image veri\ufb01cation. The PubFig face dataset includes 2 \u00d7 104 image pairs belonging to\n140 people [15], in which the \ufb01rst 80% data pairs are selected for training and the rest are used for test.\nSimilar experiments are performed on the LFW face dataset which includes 13233 unconstrained\nface images [15]. The MVS dataset [3] consists of over 3 \u00d7 104 image patches sampled from 3D\nobjects, in which 105 pairs are selected to form the training set, and 104 pairs are used for test.\nThe adopted features are extracted by DSIFT [6] and Siamese-CNN [32] for face datasets and image\npatch dataset, respectively. We plot the Receiver Operator Characteristic (ROC) curve by changing the\nthresholds of different distance metrics. Then the values of Area Under Curve (AUC) are calculated\nto quantitatively evaluate the performances of all comparators. From the ROC curves and AUC\nvalues in Fig. 3, it is clear to see that DDML and CDML consistently outperform other methods. In\ncomparison, CDML obtains better results than the best baseline method DDML on three datasets.\n\n5 Conclusion\n\nIn this paper, we introduced the new insight of the mechanism of metric learning models, where the\nmeasured distance is naturally interpreted as the arc length between nearest points on the learned\nstraight measurer lines. We extended such straight measurer lines to general curved lines for further\nlearning the intrinsic geometries of the training data. To the best of our knowledge, this is the \ufb01rst\nwork of metric learning with adaptively constructing the geometric relations between data points. We\nprovided theoretical analysis to show that the proposed framework can be well applied to the general\nnonlinear data. Visualizations on toy data indicate that the learned measurer lines critically capture\nthe underlying rules, and thus making the learning algorithm acquire more reliable and precise metric\nthan the state-of-the-art methods.\n\n8\n\n00.20.40.60.81False Positive Rate (FPR)00.20.40.60.81True Positive Rate (TPR)(a). PubFigLMNN0.68429ITML 0.73196Npair 0.88111Angular 0.90928ODML 0.86952CERML 0.75923CDML 0.9290800.20.40.60.81False Positive Rate (FPR)00.20.40.60.81True Positive Rate (TPR)(b). LFWLMNN0.6564ITML 0.63667Npair 0.87167Angular 0.82805ODML 0.889CERML 0.67924CDML 0.9074100.20.40.60.81False Positive Rate (FPR)00.20.40.60.81True Positive Rate (TPR)(c). MVSLMNN0.57987ITML 0.61335Npair 0.73184Angular 0.75125ODML 0.71639CERML 0.63211CDML 0.78559\fAcknowledgment\n\nS.C., J.Y., and C.G. were supported by the National Science Fund (NSF) of China under Grant (Nos:\nU1713208, 61602246, and 61973162), Program for Changjiang Scholars, \u201c111\u201d Program AH92005,\nthe Fundamental Research Funds for the Central Universities (No: 30918011319), NSF of Jiangsu\nProvince (No: BK20171430), the \u201cYoung Elite Scientists Sponsorship Program\u201d by Jiangsu Province,\nand the \u201cYoung Elite Scientists Sponsorship Program\u201d by CAST (No: 2018QNRC001).\nL.L. and H.H. were supported by U.S. NSF-IIS 1836945, NSF-IIS 1836938, NSF-DBI 1836866, NSF-\nIIS 1845666, NSF-IIS 1852606, NSF-IIS 1838627, and NSF-IIS 1837956.\n\nReferences\n[1] Arthur Asuncion and David Newman. Uci machine learning repository, 2007. 4.2\n\n[2] Krishnakumar Balasubramanian and Saeed Ghadimi. Zeroth-order (non)-convex stochastic\n\noptimization via conditional gradient and gradient updates. In NeurIPS, 2018. 2.3\n\n[3] Matthew Brown, Gang Hua, and Simon Winder. Discriminative learning of local image\ndescriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1), 2011. 4.3\n\n[4] Shuo Chen, Chen Gong, Jian Yang, Xiang Li, Yang Wei, and Jun Li. Adversarial metric learning.\n\nIn IJCAI, 2018. 1\n\n[5] Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary\n\ndifferential equations. In NeurIPS, 2018. 4.1\n\n[6] W Cheung and G Hamarneh. n-sift: n-dimensional scale invariant feature transform. IEEE\n\nTransactions on Image Processing, 18(9), 2009. 4.3\n\n[7] Jason V Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S Dhillon. Information-\n\ntheoretic metric learning. In ICML, 2007. 1, 4.1, 4, 4.2\n\n[8] Yueqi Duan, Wenzhao Zheng, Xudong Lin, Jiwen Lu, and Jie Zhou. Deep adversarial metric\n\nlearning. In CVPR, 2018. 1\n\n[9] S\u00f8ren Hauberg, Oren Freifeld, and Michael J Black. A geometric take on metric learning. In\n\nNeurIPS, 2012. 3.3\n\n[10] Junlin Hu, Jiwen Lu, and Yap-Peng Tan. Discriminative deep metric learning for face veri\ufb01cation\n\nin the wild. In CVPR, 2014. 1\n\n[11] Feihu Huang, Songcan Chen, and Heng Huang. Faster stochastic alternating direction method\n\nof multipliers for nonconvex optimization. In ICML, 2019. 2.2\n\n[12] Feihu Huang, Bin Gu, Zhouyuan Huo, Songcan Chen, and Heng Huang. Faster gradient-free\n\nproximal stochastic methods for nonconvex nonsmooth optimization. In AAAI, 2019. 2.3\n\n[13] Zhiwu Huang, Ruiping Wang, Shiguang Shan, and Xilin Chen. Projection metric learning on\n\ngrassmann manifold with application to video based face recognition. In CVPR, 2015. 1\n\n[14] Zhiwu Huang, Ruiping Wang, Shiguang Shan, Luc Van Gool, and Xilin Chen. Cross euclidean-\nto-riemannian metric learning with application to face recognition from video. IEEE transactions\non pattern analysis and machine intelligence, 2018. 1, 4.1, 4, 4.2\n\n[15] Zhouyuan Huo, Feiping Nie, and Heng Huang. Robust and effective metric learning using\n\ncapped trace norm: Metric learning via capped trace norm. In KDD. ACM, 2016. 4.3\n\n[16] Niels Landwehr, Mark Hall, and Eibe Frank. Logistic model trees. Machine learning, 2005. 2.2\n\n[17] John M McNamee and Victor Pan. Numerical methods for roots of polynomials. Newnes, 2013.\n\n2.3\n\n[18] Carl D Meyer. Matrix analysis and applied linear algebra, volume 71. Siam, 2000. 3.2\n\n9\n\n\f[19] Benjamin Paa\u00dfen, Claudio Gallicchio, Alessio Micheli, and Barbara Hammer. Tree edit distance\n\nlearning via adaptive symbol embeddings. In ICML, 2018. 3.3\n\n[20] Joel B Predd, Sanjeev R Kulkarni, and H Vincent Poor. A collaborative training algorithm for\n\ndistributed learning. IEEE Transactions on Information Theory, 2009. 1\n\n[21] Walter Rudin et al. Principles of mathematical analysis. McGraw-hill New York, 1964. 2.2, 1\n\n[22] Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. In NeurIPS,\n\n2016. 1, 4.1, 4, 4.2\n\n[23] Juan Luis Su\u00e1rez, Salvador Garc\u00eda, and Francisco Herrera. A tutorial on distance metric learning:\n\nMathematical foundations, algorithms and software. arXiv:1812.05944, 2018. 1, 3.3, 3.3\n\n[24] Fang Wang, Le Kang, and Yi Li. Sketch-based 3d shape retrieval using convolutional neural\n\nnetworks. In CVPR, 2015. 1\n\n[25] Jian Wang, Feng Zhou, Shilei Wen, Xiao Liu, and Yuanqing Lin. Deep metric learning with\n\nangular loss. In ICCV, 2017. 1, 4.1, 4, 4.2\n\n[26] Jun Wang, Huyen T Do, Adam Woznica, and Alexandros Kalousis. Metric learning with\n\nmultiple kernels. In NeurIPS, 2011. 1\n\n[27] Kilian Q Weinberger, John Blitzer, and Lawrence K Saul. Distance metric learning for large\n\nmargin nearest neighbor classi\ufb01cation. In NeurIPS, 2006. 1, 4.1, 4, 4.2\n\n[28] Pengtao Xie, Wei Wu, Yichen Zhu, and Eric P Xing. Orthogonality-promoting distance metric\n\nlearning: convex relaxation and theoretical analysise. In ICML, 2018. 2.1, 4.1, 4, 4.2\n\n[29] Jie Xu, Lei Luo, Cheng Deng, and Heng Huang. Bilevel distance metric learning for robust\n\nimage recognition. In NeurIPS, 2018. 1, 2.1\n\n[30] Han-Jia Ye, De-Chuan Zhan, Xue-Min Si, Yuan Jiang, and Zhi-Hua Zhou. What makes objects\n\nsimilar: a uni\ufb01ed multi-metric learning approach. In NeurIPS, 2016. 1, 2.1\n\n[31] Pourya Zadeh, Reshad Hosseini, and Suvrit Sra. Geometric mean metric learning. In ICML,\n\n2016. 1, 4.2\n\n[32] Sergey Zagoruyko and Nikos Komodakis. Learning to compare image patches via convolutional\n\nneural networks. In ICCV, 2015. 1, 4.3\n\n[33] Pengfei Zhu, Hao Cheng, Qinghua Hu, Qilong Wang, and Changqing Zhang. Towards general-\n\nized and ef\ufb01cient metric learning on riemannian manifold. In IJCAI, 2018. 1\n\n10\n\n\f", "award": [], "sourceid": 2374, "authors": [{"given_name": "Shuo", "family_name": "Chen", "institution": "Nanjing University of Science and Technology"}, {"given_name": "Lei", "family_name": "Luo", "institution": "Pitt"}, {"given_name": "Jian", "family_name": "Yang", "institution": "Nanjing University of Science and Technology"}, {"given_name": "Chen", "family_name": "Gong", "institution": "Nanjing University of Science and Technology"}, {"given_name": "Jun", "family_name": "Li", "institution": "MIT"}, {"given_name": "Heng", "family_name": "Huang", "institution": "University of Pittsburgh"}]}