{"title": "Fast Low-rank Metric Learning for Large-scale and High-dimensional Data", "book": "Advances in Neural Information Processing Systems", "page_first": 819, "page_last": 829, "abstract": "Low-rank metric learning aims to learn better discrimination of data subject to low-rank constraints. It keeps the intrinsic low-rank structure of datasets and reduces the time cost and memory usage in metric learning. However, it is still a challenge for current methods to handle datasets with both high dimensions and large numbers of samples. To address this issue, we present a novel fast low-rank metric learning (FLRML) method. FLRML casts the low-rank metric learning problem into an unconstrained optimization on the Stiefel manifold, which can be efficiently solved by searching along the descent curves of the manifold. FLRML significantly reduces the complexity and memory usage in optimization, which makes the method scalable to both high dimensions and large numbers of samples. Furthermore, we introduce a mini-batch version of FLRML to make the method scalable to larger datasets which are hard to be loaded and decomposed in limited memory. The outperforming experimental results show that our method is with high accuracy and much faster than the state-of-the-art methods under several benchmarks with large numbers of high-dimensional data. Code has been made available at https://github.com/highan911/FLRML.", "full_text": "Fast Low-rank Metric Learning for Large-scale and\n\nHigh-dimensional Data\n\nHan Liu \u2020, Zhizhong Han\u2021, Yu-Shen Liu\u2020 \u2217 , Ming Gu\u2020\n\u2020 School of Software, Tsinghua University, Beijing, China\n\n\u2021 Department of Computer Science, University of Maryland, College Park, USA\n\nBNRist & KLISS, Beijing, China\n\nliuhan15@mails.tsinghua.edu.cn\n\nh312h@umd.edu\n\nliuyushen@tsinghua.edu.cn\n\nguming@tsinghua.edu.cn\n\nAbstract\n\nLow-rank metric learning aims to learn better discrimination of data subject to\nlow-rank constraints. It keeps the intrinsic low-rank structure of datasets and\nreduces the time cost and memory usage in metric learning. However, it is still a\nchallenge for current methods to handle datasets with both high dimensions and\nlarge numbers of samples. To address this issue, we present a novel fast low-rank\nmetric learning (FLRML) method. FLRML casts the low-rank metric learning\nproblem into an unconstrained optimization on the Stiefel manifold, which can be\nef\ufb01ciently solved by searching along the descent curves of the manifold. FLRML\nsigni\ufb01cantly reduces the complexity and memory usage in optimization, which\nmakes the method scalable to both high dimensions and large numbers of samples.\nFurthermore, we introduce a mini-batch version of FLRML to make the method\nscalable to larger datasets which are hard to be loaded and decomposed in limited\nmemory. The outperforming experimental results show that our method is with\nhigh accuracy and much faster than the state-of-the-art methods under several\nbenchmarks with large numbers of high-dimensional data. Code has been made\navailable at https://github.com/highan911/FLRML.\n\n1\n\nIntroduction\n\nMetric learning aims to learn a distance (or similarity) metric from supervised or semi-supervised\ninformation, which provides better discrimination between samples. Metric learning has been widely\nused in various area, such as dimensionality reduction [1, 2, 3], robust feature extraction [4, 5] and\ninformation retrieval [6, 7]. For existing metric learning methods, the huge time cost and memory\nusage are major challenges when dealing with high-dimensional datasets with large numbers of\nsamples. To resolve this issue, low-rank metric learning (LRML) methods optimize a metric matrix\nsubject to low-rank constraints. These methods tend to keep the intrinsic low-rank structure of the\ndataset, and also, reduce the time cost and memory usage in the learning process. Reducing the\nmatrix size in optimization is an important idea to reduce time and memory usage. However, the\nsize of the matrix to be optimized still increases linearly or squarely with either the dimensions, the\nnumber of samples, or the number of pairwise/triplet constraints. As a result, it is still a research\nchallenge when dealing with the metric learning task on datasets with both high dimensions and large\nnumbers of samples [8, 9].\nTo address this issue, we present a Fast Low-Rank Metric Learning (FLRML). In contrast to state-of-\nthe-art methods, FLRML introduces a novel formulation to better employ the low rank constraints to\nfurther reduce the complexity, the size of involved matrices, which enables FLRML to achieve high\n\n\u2217Corresponding author: Yu-Shen Liu.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\faccuracy and faster speed on large numbers of high-dimensional data. Our main contributions are\nlisted as follows.\n- Modeling the constrained metric learning problem as an unconstrained optimization that can be\nef\ufb01ciently solved on the Stiefel manifold, which makes our method scalable to large numbers of\nsamples and constraints.\n- Reducing the matrix size and complexity in optimization as much as possible while ensuring the\naccuracy, which makes our method scalable to both large numbers of samples and high dimensions.\n- Furthermore, a mini-batch version of FLRML is proposed to make the method scalable to larger\ndatasets which are hard to be fully loaded in memory.\n\n2 Related Work\nIn metric learning tasks, the training dataset can be represented as a matrix X = [x1, ..., xn] \u2208 RD\u00d7n,\nwhere n is the number of training samples and each sample xi is with D dimensions. Metric\nlearning methods aim to learn a metric matrix M \u2208 RD\u00d7D from the training set in order to obtain\nbetter discrimination between samples. Some low-rank metric learning (LRML) methods have\nbeen proposed to obtain the robust metric of data, and to reduce the computational costs for high-\ndimensional metric learning tasks. Since the optimization with \ufb01xed low-rank constraint is nonconvex,\nthe naive gradient descent methods are easy to fall into bad local optimal solutions [2, 10]. In terms\nof different strategies to remedy this issue, the existing LRML methods can be roughly divided into\nthe following two categories.\nOne type of method [1, 11, 12, 13, 14] introduces the low-rankness encouraging norms (such as\nnuclear norm) as regularization, which relaxes the nonconvex low-rank constrained problems to\nconvex problems. The two disadvantages of such methods are: (1) the norm regularization can only\nencourage the low-rankness, but cannot limit the upper bound of rank; (2) the matrix to be optimized\nis still the size of either D2 or n2.\nAnother type of method [2, 3, 15, 16, 17, 18] considers the low-rank constrained space as Rieman-\nnian manifold. This type of method can obtain high-quality solutions of the nonconvex low-rank\nconstrained problems. However, for these methods, the matrices to be optimized are at least a\nlinear size of either D or n. The performance of these methods is still suffering on large-scale and\nhigh-dimensional datasets.\nBesides low-rank metric learning methods, there are some other types of methods for speeding up\nmetric learning on large and high-dimensional datasets. Online metric leaning [6, 7, 19, 20] randomly\ntakes one sample at each time. Sparse metric leaning [21, 22, 23, 24, 25, 26] represents the metric\nmatrix as the sparse combination of some pre-generated rank-1 bases. Non-iterative metric leaning\n[27, 28] avoids iterative calculation by providing explicit optimal solutions. In the experiments,\nsome state-of-the-art methods of these types will also be included for comparison. Compared with\nthese methods, our method also has advantage in time and memory usage on large-scale and high-\ndimensional datasets. A literature review of many available metric learning methods is beyond the\nscope of this paper. The reader may consult Refs. [8, 9, 29] for detailed expositions.\n\n3 Fast Low-Rank Metric Learning (FLRML)\n\nThe metric matrix M is usually semide\ufb01nite, which guarantees the non-negative distances and non-\nnegative self-similarities. A semide\ufb01nite M can be represented as the transpose multiplication of\ntwo identical matrices, M = L(cid:62)L, where L \u2208 Rd\u00d7D is a row-full-rank matrix, and rank(M) = d.\nUsing the matrix L as a linear transformation, the training set X can be mapped into Y \u2208 Rd\u00d7n,\nwhich is denoted by Y = LX. Each column vector yi in Y is the corresponding d-dimensional\nvector of the column vector xi in X.\nIn this paper, we present a fast low-rank metric learning (FLRML) method, which typically learns\nthe low-rank cosine similarity metric from triplet constraints T . The cosine similarity between\na pair of vectors (xi, xj) is measured by their corresponding low dimensional vector (yi, yj) as\nsim(xi, xj) = y(cid:62)\n||yi|| ||yj||. Each constraint {i, j, k} \u2208 T refers to the comparison of a pair of\nsimilarities sim(xi, xj) > sim(xi, xk).\n\ni yj\n\n2\n\n\fTo solve the problem of metric learning on large-scale and high-dimensional datasets, our motivation\nis to reduce matrix size and complexity in optimization as much as possible while ensuring the\naccuracy. To address this issue, our idea is to embed the triplet constraints into a matrix K, so that\nthe constrained metric learning problem can be casted into an unconstrained optimization in the\n\u201cform\u201d of tr(WK), where W is a low-rank semide\ufb01nite matrix to be optimized (Section 3.1). By\nreducing the sizes of W and K to Rr\u00d7r (r = rank(X)), the complexity and memory usage are\ngreatly reduced. An unconstrained optimization in this form can be ef\ufb01ciently solved on the Stiefel\nmanifold (Section 3.2). In addition, a mini-batch version of FLRML is proposed, which makes our\nmethod scalable to larger datasets that are hard to be fully loaded and decomposed in the memory\n(Section 3.3).\n\nji = 1 and c(t)\n\n1|Ti|+1 on both sides, we can get\n\ni=1(zi + m)\u03bb(zi + m), which can be further represented in matrices:\n\nL(T ) = \u2212tr(Y(cid:62)YCT\u039b) + M (\u039b),\n\nC =(cid:80)\n\ufb01rst item, and \u02dcyi be the i-th column of YC, then \u02dcyi can be written as \u02dcyi =(cid:80){i,j,k}\u2208Ti\n\n3.1 Forming the Objective Function\nUsing margin loss, each triplet t = {i, j, k} in T corresponds to a loss function l({i, j, k}) =\nmax(0, m \u2212 sim(xi, xj) + sim(xi, xk)). A naive idea is to sum the loss functions, but when n and\n|T | are very large, the evaluation of loss functions will be time consuming. Our idea is to embed the\nevaluation of loss functions into matrices to speed up their calculation.\nFor each triplet t = {i, j, k} in T , a matrix C(t) with the size n \u00d7 n is generated, which is a\nki = \u22121. The summation of all C(t) matrices is represented as\nsparse matrix with c(t)\nt\u2208T C(t). The matrix YC is with the size Rd\u00d7n. Let Ti be the subset of triplets with i as the\n(\u2212yj + yk).\nThis is the sum of negative samples minus the sum of positive samples for the set Ti. By multiplying\n1|Ti|+1 \u02dcyi as the mean of negative samples minus the mean of\non\npositive samples (in which \u201c|Ti| + 1\u201d is to avoid zero on the denominator).\nLet zi = 1|Ti|+1 y(cid:62)\ni \u02dcyi, then by minimizing zi, the vector yi tends to be closer to the positive samples\nthan the negative samples. Let T be a diagonal matrix with Tii = 1|Ti|+1, then zi is the i-th diagonal\nelement of \u2212Y(cid:62)YCT. The loss function can be constructed by putting zi into the margin loss as\ni=1 max(0, zi + m). A binary function \u03bb(x) is de\ufb01ned as: if x > 0, then \u03bb(x) = 1;\notherwise, \u03bb(x) = 0. By introducing the function \u03bb(x), the loss function L(T ) can be written as\n\nL(T ) = (cid:80)n\nL(T ) =(cid:80)n\nwhere \u039b is a diagonal matrix with \u039bii = \u03bb(zi +m), and M (\u039b) = m(cid:80)n\n\n(1)\ni=1 \u039bii is the sum of constant\nterms in the margin loss.\nIn Eq.(1), Y \u2208 Rd\u00d7n is the optimization variable. For any value of Y, the corresponding value of\nL can be obtained by solving the linear equation group Y = LX. A minimum norm least squares\nsolution of Y = LX is L = YV\u03a3\u22121U(cid:62), where [U \u2208 RD\u00d7r, \u03a3 \u2208 Rr\u00d7r, V \u2208 Rn\u00d7r] is the SVD\nof X. Based on this, the size of the optimization variable can be reduced from Rd\u00d7n to Rd\u00d7r, as\nshown in Theorem 1.\nTheorem 1. {Y : Y = BV(cid:62), B \u2208 Rd\u00d7r} is a subset of Y that covers all the possible minimum\nnorm least squares solutions of L.\nProof. By substituting Y = BV(cid:62) into L = YV\u03a3\u22121U(cid:62),\nL = B\u03a3\u22121U(cid:62).\n\n(2)\nSince U and \u03a3 are constants, then B \u2208 Rd\u00d7r covers all the possible minimum norm least squares\nsolutions of L.\nBy substituting Y = BV(cid:62) into Eq.(1), the sizes of W and K can be reduced to Rr\u00d7r, which are\nrepresented as\n\nL(T ) = \u2212tr(B(cid:62)BV(cid:62)CT\u039bV) + M (\u039b).\n\n(3)\nThis function is in the form of tr(WK), where W = B(cid:62)B and K = \u2212V(cid:62)CT\u039bV. The size of W\nand K are reduced to Rr\u00d7r, and r \u2264 min(n, D). In addition, V(cid:62)CT \u2208 Rr\u00d7n is a constant matrix\nthat will not change in the process of optimization. So this model is with low complexity.\n\n3\n\n\fIt should be noted that the purpose of SVD here is not for approximation. If all the ranks of X are\nkept, i.e. r = rank(X), the solutions are supposed to be exact. In practice, it is also reasonable to\nneglect the smallest eigenvalues of X to speed up the calculation. In the experiments, an upper bound\nis set as r = min(rank(X), 3000), since most computers can easily handle a matrix of 30002 size,\nand the information in most of the datasets can be preserved well.\n\n3.2 Optimizing on the Stiefel Manifold\nThe matrix W = B(cid:62)B is the low-rank semide\ufb01nite matrix to be optimized. Due to the non-convexity\nof low-rank semide\ufb01nite optimization, directly optimizing B in the linear space often falls into bad\nlocal optimal solutions [2, 10]. The mainstream strategy of low-rank semide\ufb01nite problem is to\nachieve the optimization on manifolds.\nThe Stiefel manifold St(d, r) is de\ufb01ned as the set of r\u00d7d column-orthogonal matrices, i.e., St(d, r) =\n{P \u2208 Rr\u00d7d : P(cid:62)P = Id}. Any semide\ufb01nite W with rank(W) = d can be represented as\nW = PSP(cid:62), where P \u2208 St(d, r) and S \u2208 {diag(s) : s \u2208 Rd\n+}. Since P is already restricted on the\nStiefel manifold, we only need to add regularization term for s. We want to guarantee the existence\nof dense \ufb01nite optimal solution of s, so the L2-norm of s is used as a regularization term. By adding\n2||s||2 into Eq.(3), we get\n1\n\nd(cid:88)\n\ni=1\n\nf0(W) = tr(PSP(cid:62)K) +\n\n||s||2 + M (\u039b) =\n\n1\n2\n\n(\n\ni + sip(cid:62)\ns2\n\ni Kpi) + M (\u039b),\n\n1\n2\n\n(4)\n\nwhere pi is the i-th column of P.\nLet ki = \u2212p(cid:62)\ncorresponding optimal solution of s \u2208 Rd\n\n+ is\n\ni Kpi. Since f0(W) is a quadratic function for each si, for any value of P, the only\n\n\u02c6s = {[\u02c6s1, ..., \u02c6sd](cid:62) : \u02c6si = max(0, ki)}.\n\n(5)\n\nBy substituting the \u02c6si values into Eq.(4), the existence of s in f0(W) can be eliminated, which\nconverts it to a new function f (P) that is only relevant with P, as shown in the following Theorem 2.\nTheorem 2.\n\n(max(0, ki)ki) + M (\u039b).\n\n(6)\n\nd(cid:88)\n\ni=1\n\nf (P) = \u2212 1\n2\n\n(cid:80)d\n\nProof. An original form of f (P) can be obtained by substituting Eq.(5) into Eq.(4):\nThe f (P) in Eq.(6) is equal to this formula in both ki \u2264 0 and ki > 0 conditions.\n\n2 max(0, ki) \u2212 ki)max(0, ki)) + M (\u039b).\n\ni=1(( 1\n\nIn order to make the gradient of f (P) continuous, and to keep s dense and positive, we adopt function\n\u00b5(x) = \u2212log(\u03c3(\u2212x)) as the smoothing of max(0, x) [2], where \u03c3(x) = 1/(1 + exp(\u2212x)) is the\nsigmoid function. Function \u00b5(x) satis\ufb01es limx\u2192+\u221e = x and limx\u2192\u2212\u221e = 0. The derivative of \u00b5(x)\nis d\u00b5(x)/dx = \u03c3(x). Figure 1 displays the sample plot of max(0, x) and \u00b5(x). Using this smoothed\n\nFigure 1: A sample plot of max(0, x) and \u00b5(x).\n\n4\n\nx-5-4-3-2-101230123max(0;x)7(x)\ftrix C \u2208 Rn\u00d7n, the low-rank constraint d\n\nAlgorithm 1 FLRML\n1: Input: the data matrix X \u2208 RD\u00d7n, the sparse supervision ma-\n2: [U \u2208 RD\u00d7r, \u03a3 \u2208 Rr\u00d7r, V \u2208 Rn\u00d7r] \u2190 SVD(X)\n3: Calculating constant matrix V(cid:62)CT\n4: Randomly initialize P \u2208 St(d, r)\n5: Initialize S satis\ufb01es Eq.(8)\n6: repeat\n7:\n8:\n9:\n10: until convergence\n11: Output: L \u2190 \u221a\n\nUpdate \u039b and K by Eq.(3)\nUpdate G by Eq.(9)\nUpdate P and S by searching on h(\u03c4 ) in Eq.(10)\n\nSP(cid:62)\u03a3\u22121U(cid:62)\n\nces CI \u2208 RnI \u00d7nI , the low-rank constraint d\n\nAlgorithm 2 M-FLRML\n1: Input: the data matrices XI \u2208 RD\u00d7nI , the supervision matri-\n2: Randomly initialize L \u2208 Rd\u00d7D\n3: for I = 1 : T do\n[UI , \u03a3I , VI ] \u2190 SVD(XI )\n4:\n5:\nGet PI and sI by Eq.(11)\n6:\nUpdate \u039b and K by Eq.(3)\n7:\nUpdate PI and sI by Eq.(12) and Eq.(5)\n8:\nGet \u2206L by Eq.(13)\n9: L \u2190 L + 1\u221a\n\u2206L\n10: end for\n11: Output: L\n\nI\n\nfunction, the loss function f (P) is rede\ufb01ned as\n\nf (P) = \u2212 1\n2\n+} needs to satisfy the condition\nThe initialization of S \u2208 {diag(s) : s \u2208 Rd\n\n(\u00b5(ki)ki) + M (\u039b).\n\ni=1\n\nd(cid:88)\n\ns = \u02c6s.\n\n(8)\nWhen P is \ufb01xed, \u02c6s is a linear function of K (see Eq.(5)), K is a linear function of \u039b (see Eq.(3)),\nand the 0-1 values in \u039b are relevant with s (see Eq.(3)). So Eq.(8) is a nonlinear equation in a form\nthat can be easily solved iteratively by updating s with \u02c6s. Since \u02c6s \u2208 O(K1), K \u2208 O(\u039b1), and\n\u039b \u2208 O(s0), this iterative process has a superlinear convergence rate.\nTo solve the model of this paper, for a matrix P \u2208 St(d, r), we need to get its gradient G = \u2202f (P)\n\u2202P .\nTheorem 3.\n\n\u2202f (P)\n\n\u2202P\n\n= \u2212(K + K(cid:62))Pdiag(q),\n\n(7)\n\n(9)\n\nG =\n2 (\u00b5(ki) + ki\u03c3(ki)).\n= \u2212 1\n\nwhere qi = \u2212 1\n\n(cid:80)d\n\nProof. Since qi = \u2202f (P)\n\u2202ki\n\n2 (\u00b5(ki) + ki\u03c3(ki)), the gradient can be derived from \u2202f (P)\n\n\u2202P =\n\n\u2202P .\n\u2202ki\n\ni=1 qi\n\u2202P can be easily obtained since ki = \u2212p(cid:62)\nThe \u2202ki\n\ni Kpi.\n\nFor solving optimizations on manifolds, the commonly used method is the \u201cprojection and retraction\u201d,\nwhich \ufb01rst projects the gradient G onto the tangent space of the manifold as \u02c6G, and then retracts\n(P \u2212 \u02c6G) back to the manifold. For Stiefel manifold, the projection of G on the tangent space\nis \u02c6G = G \u2212 PG(cid:62)P [10]. The retraction of the matrix (P \u2212 \u02c6G) to the Stiefel manifold can be\nrepresented as retract(P \u2212 \u02c6G), which is obtained by setting all the singular values of (P \u2212 \u02c6G) to 1\n[30].\nFor Stiefel manifolds, we adopt a more ef\ufb01cient algorithm [10], which performs a non-monotonic\nline search with Barzilai-Borwein step length [31, 32] on a descent curve of the Stiefel manifold. A\ndescent curve with parameter \u03c4 is de\ufb01ned as\n\n\u03c4\n2\n\nH)\u22121(I \u2212 \u03c4\n2\n\nH)P,\n\nh(\u03c4 ) = (I +\n\n(10)\nwhere H = G P(cid:62) \u2212 P G(cid:62). The optimization is performed by searching the optimal \u03c4 along the\ndescent curve. The Barzilai-Borwein method predicts a step length according to the step lengths in\nprevious iterations, which makes the method converges faster than the \u201cprojection and retraction\u201d.\nThe outline of the FLRML algorithm is shown in Algorithm 1. It can be mainly divided into four\nstages: SVD preprocessing (line 2), constant initializing (line 3), variable initializing (lines 4 and\n5), and the iterative optimization (lines 6 to 11). In one iteration, the complexity of each step is: (a)\nupdating Y and \u039b: O(nrd); (b) updating K: O(nr2); (c) updating G : O(r2d); (d) optimizing P\nand S: O(rd2).\n\n5\n\n\f3.3 Mini-batch FLRML\nIn FLRML, the maximum size of constant matrices in the iterations is only Rr\u00d7n (V and V(cid:62)CT),\nand the maximum size of variable matrices is only Rr\u00d7r. Smaller matrix size theoretically means the\nability to process larger datasets on the same size of memory. However, in practice, we \ufb01nd that the\nbottleneck is not the optimization process of FLRML. On large-scale and high-dimensional datasets,\nSVD preprocessing may take more time and memory than the FLRML optimization process. And for\nvery large datasets, it will be dif\ufb01cult to load all data into memory. In order to break the bottleneck,\nand make our method scalable to larger numbers of high-dimensional data in limited memory, we\nfurther propose Mini-batch FLRML (M-FLRML).\nInspired by the stochastic gradient descent method [18, 33], M-FLRML calculates a descent direction\nfrom each mini-batch of data, and updates L at a decreasing ratio. For the I-th mini-batch, we\nrandomly select Nt triplets from the triplet set, and use the union of the samples to form a mini-batch\nwith nI samples. Considering that the Stiefel manifold St(d, r) requires r \u2265 d, if the number of\nsamples in the union of triplets is less than d, we randomly add some other samples to make nI > d.\nThe matrix XI \u2208 RD\u00d7nI is composed of the extracted columns from X, and CI \u2208 RnI\u00d7nI is\ncomposed of the corresponding columns and rows in C.\nThe objective f0(W) in Eq.(4) consists of small matrices with size Rr\u00d7r and Rr\u00d7d. Our idea is to\n\ufb01rst \ufb01nd the descent direction for small matrices, and then maps it back to get the descent direction of\nlarge matrix L \u2208 Rd\u00d7D. Matrix XI can be decomposed as XI = UI \u03a3I V(cid:62)\nI , and the complexity of\ndecomposition is signi\ufb01cantly reduced from O(Dnr) to O(Dn2\nI ) on this mini-batch. According to\n\u221a\nEq.(2), a matrix BI can be represented as BI = LUI \u03a3I. Using SVD, matrix BI can be decomposed\nas BI = QI diag(\n\nI , and then the variable W for objective f0(W) can be represented as\n\nsI )P(cid:62)\n\nW = B(cid:62)\n\nI BI = PI diag(sI )P(cid:62)\nI .\n\n(11)\n\nIn FLRML, in order to convert f0(W) into f (P), the initial value of s satis\ufb01es the condition s = \u02c6s.\nBut in M-FLRML, sI is generated from BI, so generally this condition is not satis\ufb01ed. So instead,\nwe take PI and sI as two variables, and \ufb01nd the descent direction of them separately. In Mini-batch\nFLRML, when a different mini-batch is taken in next iteration, the predicted Barzilai-Borwein step\nlength tends to be improper, so we use \u201cprojection and retraction\u201d instead. The updated matrix \u02c6PI is\nobtained as\n\n\u02c6PI = retract(PI \u2212 GI + PI G(cid:62)\n\n(12)\n\u221a\nFor sI, we use Eq.(5) to get an updated vector \u02c6sI. Then the updated matrix for BI can be obtained as\n\u02c6sI ) \u02c6P(cid:62)\n\u02c6BI = QI diag(\nI . By mapping \u02c6BI back to the high-dimensional space, the descent direction\nof L can be obtained as\n\nI PI ).\n\n(cid:112)\n\n\u2206L = QI diag(\n\n\u02c6sI ) \u02c6P(cid:62)\n\nI \u03a3\u22121\n\nI U(cid:62)\n\nI \u2212 L.\n\nFor the I-th mini-batch, L is updated at a decreasing ratio as L \u2190 L + 1\u221a\nanalysis of the stochastic strategy which updates in step sizes by 1\u221a\nThe outline of M-FLRML is shown in Algorithm 2.\n\nI\n\n(13)\n\u2206L. The theoretical\ncan refer to the reference [18].\n\nI\n\n4 Experiments\n\n4.1 Experiment Setup\n\nIn the experiments, our FLRML and M-FLRML are compared with 5 state-of-the-art low-rank\nmetric learning methods, including LRSML [1], FRML [2], LMNN [34], SGDINC [18], and\nDRML [3]. For these methods, the complexities, maximum variable size and maximum constant\nsize in one iteration are compared in Table 2. Considering that d (cid:28) D and nI (cid:28) n, the relatively\nsmall items in the table are omitted.\nIn addition, four state-of-the-art metric learning methods of other types are also compared, including\none sparse method (SCML [23]), one online method (OASIS [6]), and two non-iterative methods\n(KISSME [27], RMML [28]).\nThe methods are evaluated on eight datasets with high dimensions or large numbers of samples: three\ndatasets NG20, RCV1-4 and TDT2-30 derived from three text collections respectively [35, 36]; one\n\n6\n\n\fn\n15,935\n4,813\n4,697\n60,000\n47,892\n118,116\n47,892\n118,116\n\ndataset\nNG20\nRCV1\nTDT2\nMNIST\nM10-16\nM40-16\nM10-100\nM40-100\n\nD\n62,061\n29,992\n36,771\n780\n4,096\n4,096\n1,000,000\n1,000,000\n\nTable 1: The datasets used in the experiments.\n\nTable 2: The complexity, variable matrix size and\nconstant matrix size of 7 low-rank metric learning\nmethods (in one iteration).\nmethods\nLMNN\nLRSML\nFRML\nDRML\nSGDINC\nFLRML\nM-FLRML Dn2\n\nsize(const)\nnr\nn2\nDn\nD|T |\nDnI\nnr\nDnI\nTable 3: The classi\ufb01cation accuracy (left) and training time (right, in seconds) of 7 metric learning\nmethods with SVD preprocessing.\nRCV1\n\ncomplexity\n|T |r2 + r3\nn2d\nDn2 + D2d\nD2|T | + D2d\nDd2 + Dn2\nI\nnrd + nr2\nI + DnI d\n\nsize(var)\n|T |r + r2\nn2\nD2\nDd\nDd\nr2 + nd\nDd\n\nntest\n3,993\n4,812\n4,697\n10,000\n10,896\n29,616\n10,896\n29,616\n\nncat\n20\n4\n30\n10\n10\n40\n10\n40\n\nM10-16\n\nM40-16\n\nMNIST\n\nNG20\n\nTDT2\n\n24.3%\n67.8%\n43.7%\n62.2%\n46.9%\n75.7%\n80.2%\n\n191\n405\n627\n342\n1680\n90\n227\n37\n\n91\n368\n84.6%\n224\n92.1%\n476\n66.5%\n261\n88.4%\n93.7%\n50\n94.2% 391\n14\n93.5%\n\n107\n306\n89.4%\n220\n95.9%\n87.5%\n509\n97.6% 1850\n1093\n85.3%\n371\n97.0%\n10\n96.3%\n\n10\n21\n97.7%\n74\n95.7%\n97.6%\n18\n97.8% 3634\n\nM\n\n90.5%\n95.9%\n\ndatasets\ntsvd\nOASIS\nKISSME\nRMML\nLMNN\nLRSML\nFRML\nFLRML\n\n355\n492\n1750\n366\n\n83.3%\n76.2%\n82.1%\n\nM\nM\n\n67.9%\n40.8%\n\n814\n380\n7900\n\nM\nM\nM\n\n48\n7\n\n76.1%\n83.4%\n\n172\n41\n\n69.1%\n75.0%\n\n931\n101\n\nTable 4: The classi\ufb01cation accuracy (left) and training time (right, in seconds) of 4 metric learning\nmethods without SVD preprocessing.\ndatasets\nSCML\nDRML\nSGDINC\nM-FLRML 54.2% 26 92.7%\n\n25.1% 2600 91.4% 216 82.2% 750 82.1% 1109 72.5% 6326\n54.0% 3399 94.2% 1367 96.9% 1121 97.6% 44 83.5% 174 74.2% 163 56.9% 4308 35.5% 5705\n2 82.7% 637 73.8% 654\n\n93.9% 8508 96.9% 211 91.3% 310\n\n14 96.1%\n\n11 95.0%\n\n2 74.0%\n\n1 83.0%\n\nM10-100\n\nM40-100\n\nM10-16\n\nM40-16\n\nMNIST\n\nRCV1\n\nNG20\n\nTDT2\n\nE\nM\n\nM\nM\n\nM\nM\n\nM\n\nE\n\nhandwritten characters dataset MNIST [37]; four voxel datasets of 3D models M10-16, M10-100,\nM40-16, and M40-100 with different resolutions in 163 and 1003 dimensions, respectively, generated\nfrom \u201cModelNet10\u201d and \u201cModelNet40\u201d [38] which are widely used in 3D shape understanding\n[39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50]. To measure the similarity, the data vectors are\nnormalized to the unit length. The dimensions D, the number of training samples n, the number of\ntest samples ntest, and the number of categories ncat of all the datasets are listed in Table 1.\nDifferent methods have different requirements for SVD preprocessing. In our experiments, a fast\nSVD algorithm [51] is adopted. The time tsvd in SVD preprocessing is listed at the top of Table 3.\nUsing the same decomposed matrices as input, seven methods are compared: three methods (LRSML,\nLMNN, and our FLRML) require SVD preprocessing; four methods (FRML, KISSME, RMML,\nOASIS) do not mention SVD preprocessing, but since they need to optimize large dense RD\u00d7D\nmatrices, SVD has to be performed to prevent them from out-of-memory error on high-dimensional\ndatasets. For all these methods, the rank for SVD is set as r = min(rank(X), 3000). The rest\nfour methods (SCML, DRML, SGDINC, and our M-FLRML) claim that there is no need for SVD\npreprocessing, which are compared using the original data matrices as input. Speci\ufb01cally, since the\nSVD calculation for datasets M10-100 and M40-100 has exceeded the memory limit of common PCs,\nonly these four methods are tested on these two datasets.\nMost tested methods use either pairwise or triplet constraints, except for LMNN and FRML that\nrequires directly inputting the labels in the implemented codes. For the other methods, 5 triplets\nare randomly generated for each sample, which is also used as 5 positive pairs and 5 negative\npairs for the methods using pairwise constraints. The accuracy is evaluated by a 5-NN classi\ufb01er\nusing the output metric of each method. For each low-rank metric learning method, the rank\nconstraint for M is set as d = 100. All the experiments are performed on the Matlab R2015a\nplatform on a PC with 3.60GHz processor and 16GB of physical memory. The code is available at\nhttps://github.com/highan911/FLRML.\n\n4.2 Experimental Results\n\nTable 3 and Table 4 list the classi\ufb01cation accuracy (left) and training time (right, in seconds) of all the\ncompared metric learning methods in all the datasets. The symbol \u201cE\u201d indicates that the objective\n\n7\n\n\fFigure 2: The convergence behavior of FLRML\nin optimization on 6 datasets.\n\nFigure 3: The change in accuracy of FLRML\nwith different m/ \u00afly values on 6 datasets.\n\nFigure 4: The change in accuracy of M-FLRML on \u201cTDT2\u201d with different Nt values and number of\nmini-batches.\n\nfails to converge to a \ufb01nite non-zero solution, and \u201cM\u201d indicates that its computation was aborted\ndue to out-of-memory error. The maximum accuracy and minimum time usage for each dataset are\nboldly emphasized.\nComparing the results with the analysis of complexity in Table 2, we \ufb01nd that for many tested\nmethods, if the complexity or matrix is a polynomial of D, n or |T |, the ef\ufb01ciency on datasets with\nlarge numbers of samples is still limited. As shown in Table 3 and Table 4, FLRML and M-FLRML\nare faster than the state-of-the-art methods on all datasets. Our methods can achieve comparable\naccuracy with the state-of-the-art methods on all datasets, and obtain the highest accuracy on several\ndatasets with both high dimensions and large numbers of samples.\nBoth our M-FLRML and SGDINC use mini-batches to improve ef\ufb01ciency. The theoretical complexity\nof these two methods is close, but in the experiment M-FLRML is faster. Generally, M-FLRML is\nless accurate than FLRML, but it signi\ufb01cantly reduces the time and memory usage on large datasets.\nIn the experiments, the largest dataset \u201cM40-100\u201d is with size 1, 000, 000 \u00d7 118, 116. If there is a\ndense matrix of such size, it will take up 880 GB of memory. When using M-FLRML to process this\ndata set, the recorded maximum memory usage of Matlab is only 6.20 GB (Matlab takes up 0.95\nGB of memory on startup). The experiment shows that M-FLRML is suitable for metric learning of\nlarge-scale high-dimensional data on devices with limited memory.\nIn the experiments, we \ufb01nd the initialization of s usually converges within 3 iterations. The optimiza-\ntion on the Stiefel manifold usually converges in less than 15 iterations. Figure 2 shows the samples\nof convergence behavior of FLRML in optimization on each dataset. The plots are drawn in relative\nvalues, in which the values of \ufb01rst iteration are scaled to 1.\nIn FLRML, one parameter m is about the margin in the margin loss. An experiment is performed\nto study the effect of the margin parameter m on accuracy. Let \u00afly be the mean of y(cid:62)\ni yi values, i.e.\ni yi). We test the change in accuracy of FLRML when the ratio m/ \u00afly varies between\n\u00afly = 1\nn\n0.1 and 2. The mean values and standard deviations of 5 repeated runs are plotted in Figure 3, which\nshows that FLRML works well on most datasets when m/ \u00afly is around 1. So we use m/ \u00afly = 1 in the\nexperiments in Table 3 and Table 4.\nIn M-FLRML, another parameter is the number of triplets Nt used to generate a mini-batch. We\ntest the effect of Nt on the accuracy of M-FLRML with the increasing number of mini-batches. The\nmean values and standard deviations of 5 repeated runs are plotted in Figure 4, which shows that\na larger Nt makes the accuracy increase faster, and usually M-FLRML is able to get good results\nwithin 20 mini-batches. So in Table 4, all the results are obtained with Nt = 80 and T = 20.\n\n(cid:80)n\ni=1(y(cid:62)\n\n8\n\niterations2468101214relative objective value00.51NG20RCV1TDT2MNISTM10-16M40-16m=7ly00.511.52accuracy0.60.70.80.91NG20RCV1TDT2MNISTM10-16M40-16mini-batches05101520accuracy0.80.850.90.951Nt=20Nt=40Nt=80\f5 Conclusion and Future Work\n\nIn this paper, FLRML and M-FLRML are proposed for ef\ufb01cient low-rank similarity metric learning\non high-dimensional datasets with large numbers of samples. With a novel formulation, FLRML and\nM-FLRML can better employ low-rank constraints to further reduce the complexity and matrix size,\nbased on which optimization is ef\ufb01ciently conducted on Stiefel manifold. This enables FLRML and\nM-FLRML to achieve good accuracy and faster speed on large numbers of high-dimensional data.\nOne limitation of our current implementation of FLRML and M-FLRML is that the algorithm still\nruns on a single processor. Recently, there is a trend about distributed metric learning for big data\n[52, 53]. It is an interest of our future research to implement M-FLRML on distributed architecture\nfor scaling to larger datasets.\n\nAcknowledgments\n\nThis research is sponsored in part by the National Key R&D Program of China (No. 2018YF-\nB0505400, 2016QY07X1402), the National Science and Technology Major Project of China (No.\n2016ZX 01038101), and the NSFC Program (No. 61527812).\n\nReferences\n[1] W. Liu, C. Mu, R. Ji, S. Ma, J. Smith, S. Chang, Low-rank similarity metric learning in high\n\ndimensions, in: AAAI, 2015, pp. 2792\u20132799.\n\n[2] Y. Mu, Fixed-rank supervised metric learning on Riemannian manifold, in: AAAI, 2016, pp.\n\n1941\u20131947.\n\n[3] M. Harandi, M. Salzmann, R. Hartley, Joint dimensionality reduction and metric learning: a\n\ngeometric take, in: ICML, 2017.\n\n[4] Z. Ding, Y. Fu, Robust transfer metric learning for image classi\ufb01cation, IEEE Transactions on\n\nImage Processing PP (99) (2017) 1\u20131.\n\n[5] L. Luo, H. Huang, Matrix variate gaussian mixture distribution steered robust metric learning,\n\nin: AAAI, 2018.\n\n[6] G. Chechik, U. Shalit, V. Sharma, S. Bengio, An online algorithm for large scale image similarity\n\nlearning, in: Advances in Neural Information Processing Systems, 2009, pp. 306\u2013314.\n\n[7] U. Shalit, D. Weinshall, G. Chechik, Online learning in the embedded manifold of low-rank\n\nmatrices, Journal of Machine Learning Research 13 (Feb) (2012) 429\u2013458.\n\n[8] A. Bellet, A. Habrard, M. Sebban, A survey on metric learning for feature vectors and structured\n\ndata, arXiv preprint arXiv:1306.6709.\n\n[9] F. Wang, J. Sun, Survey on distance metric learning and dimensionality reduction in data mining,\n\nData Mining and Knowledge Discovery 29 (2) (2015) 534\u2013564.\n\n[10] Z. Wen, W. Yin, A feasible method for optimization with orthogonality constraints, Mathemati-\n\ncal Programming 142 (1-2) (2013) 397\u2013434.\n\n[11] M. Schultz, T. Joachims, Learning a distance metric from relative comparisons, in: Advances in\n\nNeural Information Processing Systems, 2004, pp. 41\u201348.\n\n[12] J. T. Kwok, I. W. Tsang, Learning with idealized kernels, in: ICML, 2003, pp. 400\u2013407.\n[13] C. Hegde, A. C. Sankaranarayanan, W. Yin, R. G. Baraniuk, Numax: A convex approach for\nlearning near-isometric linear embeddings, IEEE Transactions on Signal Processing 63 (22)\n(2015) 6109\u20136121.\n\n[14] B. Mason, L. Jain, R. Nowak, Learning low-dimensional metrics, in: Advances in Neural\n\nInformation Processing Systems, 2017, pp. 4139\u20134147.\n\n[15] L. Cheng, Riemannian similarity learning, in: ICML, 2013, pp. 540\u2013548.\n[16] Z. Huang, R. Wang, S. Shan, X. Chen, Projection metric learning on Grassmann manifold with\n\napplication to video based face recognition, in: CVPR, 2015, pp. 140\u2013149.\n\n[17] A. Shukla, S. Anand, Distance metric learning by optimization on the Stiefel manifold, in:\nInternational Workshop on Differential Geometry in Computer Vision for Analysis of Shapes,\nImages and Trajectories, 2015.\n\n9\n\n\f[18] J. Zhang, L. Zhang, Ef\ufb01cient stochastic optimization for low-rank distance metric learning, in:\n\nAAAI, 2017.\n\n[19] S. Gillen, C. Jung, M. Kearns, A. Roth, Online learning with an unknown fairness metric, in:\n\nAdvances in Neural Information Processing Systems, 2018, pp. 2600\u20132609.\n\n[20] W. Li, Y. Gao, L. Wang, L. Zhou, J. Huo, Y. Shi, OPML: A one-pass closed-form solution for\n\nonline metric learning, Pattern Recognition 75 (2018) 302\u2013314.\n\n[21] G.-J. Qi, J. Tang, Z.-J. Zha, T.-S. Chua, H.-J. Zhang, An ef\ufb01cient sparse metric learning in\nhigh-dimensional space via l 1-penalized log-determinant regularization, in: ICML, 2009, pp.\n841\u2013848.\n\n[22] A. Bellet, A. Habrard, M. Sebban, Similarity learning for provably accurate sparse linear\n\nclassi\ufb01cation, in: ICML, 2012, pp. 1871\u20131878.\n\n[23] Y. Shi, A. Bellet, F. Sha, Sparse compositional metric learning, in: AAAI, 2014.\n[24] K. Liu, A. Bellet, F. Sha, Similarity learning for high-dimensional sparse data, in: AISTATS,\n\n2015.\n\n[25] L. Zhang, T. Yang, R. Jin, Z. hua Zhou, Sparse learning for large-scale and high-dimensional\n\ndata: a randomized convex-concave optimization approach, in: ALT, 2016, pp. 83\u201397.\n\n[26] W. Liu, J. He, S.-F. Chang, Large graph construction for scalable semi-supervised learning, in:\nProceedings of the 27th international conference on machine learning (ICML-10), 2010, pp.\n679\u2013686.\n\n[27] M. Koestinger, M. Hirzer, P. Wohlhart, P. Roth, H. Bischof, Large scale metric learning from\nequivalence constraints, in: Computer Vision and Pattern Recognition (CVPR), 2012, pp.\n2288\u20132295.\n\n[28] P. Zhu, H. Cheng, Q. Hu, Q. Wang, C. Zhang, Towards generalized and ef\ufb01cient metric learning\n\non riemannian manifold, in: IJCAI, 2018, pp. 3235\u20133241.\n\n[29] B. Kulis, Metric learning: A survey, Foundations and Trends R(cid:13) in Machine Learning 5 (4)\n\n(2013) 287\u2013364.\n\n[30] P.-A. Absil, J. Malick, Projection-like retractions on matrix manifolds, SIAM Journal on\n\nOptimization 22 (1) (2012) 135\u2013158.\n\n[31] H. Zhang, W. W. Hager, A nonmonotone line search technique and its application to uncon-\n\nstrained optimization, SIAM journal on Optimization 14 (4) (2004) 1043\u20131056.\n\n[32] J. Barzilai, J. M. Borwein, Two-point step size gradient methods, IMA Journal of Numerical\n\nAnalysis 8 (1) (1988) 141\u2013148.\n\n[33] Q. Qian, R. Jin, J. Yi, L. Zhang, S. Zhu, Ef\ufb01cient distance metric learning by adaptive sampling\nand mini-batch stochastic gradient descent (SGD), Machine Learning 99 (3) (2015) 353\u2013372.\n[34] K. Weinberger, L. Saul, Distance metric learning for large margin nearest neighbor classi\ufb01cation,\n\nJournal of Machine Learning Research 10 (Feb) (2009) 207\u2013244.\n\n[35] K. Lang, Newsweeder: Learning to \ufb01lter netnews, in: ICML, 1995, pp. 331\u2013339.\n[36] D. Cai, X. He, Manifold adaptive experimental design for text categorization, IEEE Transactions\n\non Knowledge and Data Engineering 24 (4) (2012) 707\u2013719.\n\n[37] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document\n\nrecognition, Proceedings of the IEEE 86 (11) (1998) 2278\u20132324.\n\n[38] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, J. Xiao, 3D ShapeNets: A deep\n\nrepresentation for volumetric shapes, in: CVPR, 2015, pp. 1912\u20131920.\n\n[39] Z. Han, M. Shang, Y.-S. Liu, M. Zwicker, View Inter-Prediction GAN: Unsupervised representa-\ntion learning for 3D shapes by learning global shape memories to support local view predictions,\nin: AAAI, 2019.\n\n[40] Z. Han, X. Liu, Y.-S. Liu, M. Zwicker, Parts4Feature: Learning 3D global features from\n\ngenerally semantic parts in multiple views, in: IJCAI, 2019.\n\n[41] Z. Han, X. Wang, C.-M. Vong, Y.-S. Liu, M. Zwicker, C. P. Chen, 3DViewGraph: Learning\nglobal features for 3d shapes from a graph of unordered views with attention, in: IJCAI, 2019.\n\n10\n\n\f[42] X. Liu, Z. Han, Y.-S. Liu, M. Zwicker, Point2Sequence: Learning the shape representation of\n\n3D point clouds with an attention-based sequence to sequence network, in: AAAI, 2019.\n\n[43] Z. Han, X. Wang, Y.-S. Liu, M. Zwicker, Multi-angle point cloud-vae:unsupervised feature\nlearning for 3D point clouds from multiple angles by joint self-reconstruction and half-to-half\nprediction, in: ICCV, 2019.\n\n[44] Z. Han, H. Lu, Z. Liu, C.-M. Vong, Y.-S. Liu, M. Zwicker, J. Han, C. P. Chen, 3D2SeqViews:\nAggregating sequential views for 3d global feature learning by cnn with hierarchical attention\naggregation, IEEE Transactions on Image Processing 28 (8) (2019) 3986\u20133999.\n\n[45] Z. Han, M. Shang, Z. Liu, C.-M. Vong, Y.-S. Liu, M. Zwicker, J. Han, C. P. Chen, Se-\nqViews2SeqLabels: Learning 3D global features via aggregating sequential views by rnn with\nattention, IEEE Transactions on Image Processing 28 (2) (2019) 685\u2013672.\n\n[46] X. Liu, Z. Han, W. Xin, Y.-S. Liu, M. Zwicker, L2g auto-encoder: Understanding point clouds\n\nby local-to-global reconstruction with hierarchical self-attention, in: ACMMM, 2019.\n\n[47] Z. Han, M. Shang, X. Wang, Y.-S. Liu, M. Zwicker, Y2Seq2Seq: Cross-modal representation\nlearning for 3D shape and text by joint reconstruction and prediction of view and word sequences,\nin: AAAI, 2019.\n\n[48] Z. Han, Z. Liu, C.-M. Vong, Y.-S. Liu, S. Bu, J. Han, C. Chen, BoSCC: Bag of spatial context\ncorrelations for spatially enhanced 3D shape representation, IEEE Transactions on Image\nProcessing 26 (8) (2017) 3707\u20133720.\n\n[49] Z. Han, Z. Liu, J. Han, C.-M. Vong, S. Bu, C. Chen, Unsupervised learning of 3D local features\nfrom raw voxels based on a novel permutation voxelization strategy, IEEE Transactions on\nCybernetics 49 (2) (2019) 481\u2013494.\n\n[50] Z. Han, Z. Liu, C.-M. Vong, Y.-S. Liu, S. Bu, J. Han, C. Chen, Deep Spatiality: Unsupervised\nlearning of spatially-enhanced global and local 3D features by deep neural network with coupled\nsoftmax, IEEE Transactions on Image Processing 27 (6) (2018) 3049\u20133063.\n\n[51] H. Li, G. Linderman, A. Szlam, K. Stanton, Y. Kluger, M. Tygert, Algorithm 971: An imple-\nmentation of a randomized algorithm for principal component analysis, ACM Transactions on\nMathematical Software 43 (3) (2017) 28:1\u201328:14.\n\n[52] Y. Su, H. Yang, I. King, M. Lyu, Distributed information-theoretic metric learning in apache\n\nspark, in: IJCNN, 2016, pp. 3306\u20133313.\n\n[53] E. P. Xing, Q. Ho, W. Dai, J. K. Kim, J. Wei, S. Lee, X. Zheng, P. Xie, A. Kumar, Y. Yu, Petuum:\na new platform for distributed machine learning on big data, IEEE Transactions on Big Data\n1 (2) (2015) 49\u201367.\n\n11\n\n\f", "award": [], "sourceid": 408, "authors": [{"given_name": "Han", "family_name": "Liu", "institution": "Tsinghua University"}, {"given_name": "Zhizhong", "family_name": "Han", "institution": "University of Maryland, College Park"}, {"given_name": "Yu-Shen", "family_name": "Liu", "institution": "Tsinghua University"}, {"given_name": "Ming", "family_name": "Gu", "institution": "Tsinghua University"}]}