{"title": "Regularized Distance Metric Learning:Theory and Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 862, "page_last": 870, "abstract": "In this paper, we examine the generalization error of regularized distance metric learning. We show that with appropriate constraints, the generalization error of regularized distance metric learning could be independent from the dimensionality, making it suitable for handling high dimensional data. In addition, we present an efficient online learning algorithm for regularized distance metric learning. Our empirical studies with data classification and face recognition show that the proposed algorithm is (i) effective for distance metric learning when compared to the state-of-the-art methods, and (ii) efficient and robust for high dimensional data.", "full_text": "Regularized Distance Metric Learning:\n\nTheory and Algorithm\n\nRong Jin1\n\nShijun Wang2\n\nYang Zhou1\n\n1Dept. of Computer Science & Engineering, Michigan State University, East Lansing, MI 48824\n\n2Radiology and Imaging Sciences, National Institutes of Health, Bethesda, MD 20892\nrongjin@cse.msu.edu wangshi@cc.nih.gov zhouyang@msu.edu\n\nAbstract\n\nIn this paper, we examine the generalization error of regularized distance metric\nlearning. We show that with appropriate constraints, the generalization error of\nregularized distance metric learning could be independent from the dimensional-\nity, making it suitable for handling high dimensional data. In addition, we present\nan ef\ufb01cient online learning algorithm for regularized distance metric learning. Our\nempirical studies with data classi\ufb01cation and face recognition show that the pro-\nposed algorithm is (i) effective for distance metric learning when compared to the\nstate-of-the-art methods, and (ii) ef\ufb01cient and robust for high dimensional data.\n\n1 Introduction\n\nDistance metric learning is a fundamental problem in machine learning and pattern recognition. It is\ncritical to many real-world applications, such as information retrieval, classi\ufb01cation, and clustering\n[6, 7]. Numerous algorithms have been proposed and examined for distance metric learning. They\nare usually classi\ufb01ed into two categories: unsupervised metric learning and supervised metric learn-\ning. Unsupervised distance metric learning, or sometimes referred to as manifold learning, aims to\nlearn a underlying low-dimensional manifold where the distance between most pairs of data points\nare preserved. Example algorithms in this category include ISOMAP [13] and Local Linear Embed-\nding (LLE) [8]. Supervised metric learning attempts to learn distance metrics from side information\nsuch as labeled instances and pairwise constraints. It searches for the optimal distance metric that\n(a) keeps data points of the same classes close, and (b) keeps data points from different classes far\napart. Example algorithms in this category include [17, 10, 15, 5, 14, 19, 4, 12, 16]. In this work,\nwe focus on supervised distance metric learning.\n\nAlthough a large number of studies were devoted to supervised distance metric learning (see the sur-\nvey in [18] and references therein), few studies address the generalization error of distance metric\nlearning. In this paper, we examine the generalization error for regularized distance metric learning.\nFollowing the idea of stability analysis [1], we show that with appropriate constraints, the general-\nization error of regularized distance metric learning is independent from the dimensionality of data,\nmaking it suitable for handling high dimensional data. In addition, we present an online learning\nalgorithm for regularized distance metric learning, and show its regret bound. Note that although\nonline metric learning was studied in [9], our approach is advantageous in that (a) it is computation-\nally more ef\ufb01cient in handling the constraint of SDP cone, and (b) it has a proved regret bound while\n[9] only shows a mistake bound for the datasets that can be separated by a Mahalanobis distance. To\nverify the ef\ufb01cacy and ef\ufb01ciency of the proposed algorithm for regularized distance metric learning,\nwe conduct experiments with data classi\ufb01cation and face recognition. Our empirical results show\nthat the proposed online algorithm is (1) effective for metric learning compared to the state-of-the-art\nmethods, and (2) robust and ef\ufb01cient for high dimensional data.\n\n1\n\n\f2 Regularized Distance Metric Learning\n\nLet D = {zi = (xi, yi), i = 1, . . . , n} denote the labeled examples, where xk = (x1\nk) \u2208 Rd\nk, . . . , xd\nis a vector of d dimension and yi \u2208 {1, 2, . . . , m} is class label. In our study, we assume that\nthe norm of any example is upper bounded by R, i.e., supx |x|2 \u2264 R. Let A \u2208 Sd\u00d7d\nbe the\ndistance metric to be learned, where the distance between two data points x and x\u2032 is calculated as\n|x \u2212 x\u2032|2\nFollowing the idea of maximum margin classi\ufb01ers, we have the following framework for regularized\ndistance metric learning:\n\nA = (x \u2212 x\u2032)\u22a4A(x \u2212 x\u2032).\n\n+\n\n\uf8f1\uf8f2\n\uf8f3\n\nmin\n\nA\n\nwhere\n\n1\n2|A|2\n\nF +\n\n2C\n\nn(n \u2212 1)Xi \u01eb) \u2264 2 exp \u2212\n\n\u2264 ci,\n\n2\u01eb\ni=1 c2\n\ni !\n\nPl\n\nTo use the McDiarmid inequality, we \ufb01rst compute E(DD).\nLemma 1. Given a distance metric learning algorithm A has uniform stability \u03ba/n, we have the\nfollowing inequality for E(DD)\n\nE(DD) \u2264 2\n\n\u03ba\nn\n\n(6)\n\nwhere n is the number of training examples in D.\nThe result in the following lemma shows that the condition in McDiarmid inequality holds.\nLemma 2. Let D be a collection of n randomly selected training examples, and Di,z be the collec-\ntion of examples that replaces zi in D with example z. We have |DD \u2212 DDi,z| bounded as follows\n(7)\n\n2\u03ba + 8L\u03b7(d) + 2g0\n\n|DD \u2212 DDi,z| \u2264\n\nn\n\nwhere g0 = supz,z\u2032 |V (0, z, z\u2032)| measures the largest loss when distance metric A is 0.\nCombining the results in Lemma 1 and 2, we can now derive the the bound for the generalization\nerror by using the McDiarmid inequality.\nTheorem 2. Let D denote a collection of n randomly selected training examples, and AD be the\ndistance metric learned by the algorithm in (1) whose uniform stability is \u03ba/n. With probability\n1 \u2212 \u03b4, we have the following bound for I(AD)\n\nI(AD) \u2212 ID(AD) \u2264\n\n2\u03ba\nn\n\n+ (2\u03ba + 4L\u03b7(d) + 2g0)r ln(2/\u03b4)\n\n2n\n\n3.2 Generalization Error for Regularized Distance Metric Learning\n\nFirst, we show that the superium of tr(AD) is O(d1/2), which veri\ufb01es that \u03b7(d) should behave\nsublinear in d. This is summarized by the following proposition.\nProposition 1. The trace constraint in (1) will be activated only when\n\n\u03b7(d) \u2264p2dg0C\n\nwhere g0 = supz,z\u2032 |V (0, z, z\u2032)|.\nProof. It follows directly from [tr(AD)/d]2 \u2264 |AD|2\n\nF \u2264 2C sup\n\nz,z\u2032 |V (0, z, z\u2032)| \u2264 Cg0.\n\n(8)\n\n(9)\n\nTo bound the uniform stability, we need the following proposition\nProposition 2. For any two distance metrics A and A\u2032, we have the following inequality hold for\nany examples zu and zv\n\n|V (A, zu, zv) \u2212 V (A\u2032, zu, zv)| \u2264 4LR2|A \u2212 A\u2032|F\n\n(10)\n\n3\n\n\fThe above proposition follows directly from the fact that (a) V (A, z, z\u2032) is Lipschitz continuous and\n(b) |x|2 \u2264 R for any example x. The following lemma bounds |AD \u2212 AD\u2032|F .\nLemma 3. Let D denote a collection of n randomly selected training examples, and by z = (x, y) a\nrandomly selected example. Let AD be the distance metric learned by the algorithm in (1). We have\n\n|AD \u2212 ADi,z|F \u2264\n\n8CLR2\n\nn\n\n(11)\n\nThe proof of the above lemma can be found in Appendix A.\n\nBy putting the results in Lemma 3 and Proposition 2, we have the following theorem for the stability\nof the Frobenius norm based regularizer.\nTheorem 3. The uniform stability for the algorithm in (1) using the Frobenius norm regularizer,\ndenoted by \u03b2, is bounded as follows\n\n\u03b2 =\n\n\u03ba\nn \u2264\n\n32CL2R4\n\nn\n\n(12)\n\nwhere \u03ba = 32CL2R4\n\nCombing Theorem 3 and 2, we have the following theorem for the generalization error of distance\nmetric learning algorithm in (1) using the Frobenius norm regularizer\nTheorem 4. Let D be a collection of n randomly selected examples, and AD be the distance metric\nlearned by the algorithm in (1) with h(A) = |A|2\nF . With probability 1 \u2212 \u03b4, we have the following\nbound for the true loss function I(AD) where AD is learned from (1) using the Frobenius norm\nregularizer\n\nI(AD) \u2212 ID(AD) \u2264\n\n32CL2R4\n\nn\n\n+(cid:0)32CL2R4 + 4Ls(d) + 2g0(cid:1)r ln(2/\u03b4)\n\n2n\n\nwhere s(d) = min(cid:0)\u221a2dg0C, \u03b7(d)(cid:1).\nO(s(d)/\u221an). By choosing \u03b7(d) to have a low dependence of d (i.e., \u03b7(d) \u223c dp with p \u226a 1), the\n\nRemark The most important feature in the estimation error is that it converges in the order of\n\nproposed framework for regularized distance metric learning will be robust to the high dimensional\ndata. In the extreme case, by setting \u03b7(d) to be a constant, the estimation error will be independent\nfrom the dimensionality of data.\n\n(13)\n\n4 Algorithm\n\nIn this section, we discuss an ef\ufb01cient algorithm for solving (1). We assume a hinge loss for g(z),\ni.e., g(z) = max(0, b \u2212 z), where b is the classi\ufb01cation margin. To design an online learning\nalgorithm for regularized distance metric learning, we follow the theory of gradient based online\nlearning [2] by de\ufb01ning potential function \u03a6(A) = |A|2\nF /2. Algorithm 1 shows the online learning\nalgorithm.\n\nThe theorem below shows the regret bound for the online learning algorithm in Figure 1.\nTheorem 5. Let the online learning algorithm 1 run with learning rate \u03bb > 0 on a sequence\nt), yt, t = 1, . . . , n. Assume |x|2 \u2264 R for all the training examples. Then, for all distance\n(xt, x\u2032\nmetric M \u2208 Sd\u00d7d\n\n+ , we have\n\nF(cid:19)\n1\n2\u03bb|M|2\n\nbLn \u2264\n\n1\n\n1 \u2212 8R4\u03bb/b(cid:18)Ln(M ) +\nnXt=1\n\nmax(cid:0)0, b \u2212 yt(1 \u2212 |xt \u2212 x\u2032\nM )(cid:1) ,bLn =\nt|2\n\nwhere\n\n\u0141n(M ) =\n\nnXt=1\n\nmax(cid:16)0, b \u2212 yt(1 \u2212 |xt \u2212 x\u2032\nAt\u22121 )(cid:17)\nt|2\n\n4\n\n\fAlgorithm 1 Online Learning Algorithm for Regularized Distance Metric Learning\n1: INPUT: prede\ufb01ned learning rate \u03bb\n2: Initialize A0 = 0\n3: for t = 1, . . . , T do\n4:\n5:\n6:\n7:\n8:\n9:\n\nReceive a pair of training examples {(x1\nCompute the class label yt: yt = +1 if y1\nif the training pair (x1\n\nt ), yt is classi\ufb01ed correctly, i.e., yt(cid:16)1 \u2212 |x1\n\nt , y2\nt , and yt = \u22121 otherwise.\nAt\u22121(cid:17) > 0 then\nt|2\nt \u2212 x2\n\nt)\u22a4), where \u03c0S+ (M ) projects matrix M into the\n\nt , y1\nt = y2\n\nAt = At\u22121.\n\nt ), (x2\n\nt , x2\n\nt )}\n\nelse\n\nAt = \u03c0S+ (At\u22121 \u2212 \u03bbyt(xt \u2212 x\u2032\nSDP cone.\n\nt)(xt \u2212 x\u2032\n\nend if\n10:\n11: end for\n\nt)(xt \u2212 x\u2032\n\nThe proof of this theorem can be found in Appendix B. Note that the above online learning algorithm\nrequire computing \u03c0S+ (M ), i.e., projecting matrix M onto the SDP cone, which is expensive for\nhigh dimensional data. To address this challenge, \ufb01rst notice that M \u2032 = \u03c0S+ (M ) is equivalent to the\noptimization problem M \u2032 = arg minM \u2032(cid:23)0 |M \u2032 \u2212 M|F . We thus approximate At = \u03c0S+ (At\u22121 \u2212\nt)\u22a4 where \u03bbt is computed as\n\u03bbyt(xt \u2212 x\u2032\nfollows\n\nt)\u22a4) with At = At\u22121 \u2212 \u03bbtyt(xt \u2212 x\u2032\nt)(xt \u2212 x\u2032\n(cid:8)|\u03bbt \u2212 \u03bb| : \u03bbt \u2208 [0, \u03bb], At\u22121 \u2212 \u03bbtyt(xt \u2212 x\u2032\n\nt)(xt \u2212 x\u2032\nThe following theorem shows the solution to the above optimization problem.\nTheorem 6. The optimal solution \u03bbt to the problem in (14) is expressed as\n\nt)\u22a4 (cid:23) 0(cid:9)\n\n\u03bbt = arg min\n\n(14)\n\n\u03bbt\n\n\u03bbt =(cid:26) \u03bb\n\nmin(cid:0)\u03bb, [(xt \u2212 x\u2032\n\nt)\u22a4A\u22121\n\nt\u22121(xt \u2212 x\u2032\n\nt)]\u22121(cid:1)\n\nyt = \u22121\nyt = +1\n\nProof of this theorem can be found in the supplementary materials. Finally, the quantity (xt \u2212\nt)A\u22121\nx\u2032\n\nt) can be computed by solving the following optimization problem\n\nt\u22121(xt \u2212 x\u2032\n\nmax\n\nu\n\n2u\u22a4(xt \u2212 x\u2032\n\nt) \u2212 u\u22a4Au\n\nwhose optimal value can be computed ef\ufb01ciently using the conjugate gradient method [11].\n\nNote that compared to the online metric learning algorithm in [9], the proposed online learning\nalgorithm for metric learning is advantageous in that (i) it is computationally more ef\ufb01cient by\navoiding projecting a matrix into a SDP cone, and (ii) it has a provable regret bound while [9] only\npresents the mistake bound for the separable datasets.\n\n5 Experiments\n\nWe conducted an extensive study to verify both the ef\ufb01ciency and the ef\ufb01cacy of the proposed\nalgorithms for metric learning. For the convenience of discussion, we refer to the propoesd online\ndistance metric learning algorithm as online-reg. To examine the ef\ufb01cacy of the learned distance\nmetric, we employed the k Nearest Neighbor (k-NN) classi\ufb01er. Our hypothesis is that the better the\ndistance metric is, the higher the classi\ufb01cation accuracy of k-NN will be. We set k = 3 for k-NN\nfor all the experiments according to our experience.\n\nputed as the inverse of covariance matrix of training samples, i.e., (Pn\n\nWe compare our algorithm to the following six state-of-the-art algorithms for distance metric learn-\ning as baselines: (1) Euclidean distance metric; (2) Mahalanobis distance metric, which is com-\ni=1 xixi)\u22121; (3) Xing\u2019s algo-\nrithm proposed in [17]; (4) LMNN, a distance metric learning algorithm based on the large margin\nnearest neighbor classi\ufb01er [15]; (5) ITML, an Information-theoretic metric learning based on [4];\nand (6) Relevance Component Analysis (RCA) [10]. We set the maximum number of iterations for\nXing\u2019s method to be 10, 000. The number of target neighbors in LMNN and parameter \u03b3 in ITML\n\n5\n\n\fTable 1: Classi\ufb01cation error (%) of a k-NN (k = 3) classi\ufb01er on the ten UCI data sets using seven\ndifferent metrics. Standard deviation is included.\n\nDataset\n\n1\n2\n3\n4\n5\n6\n7\n8\n9\n\nEclidean\n19.5 \u00b1 2.2\n39.9 \u00b1 2.3\n36.0 \u00b1 2.0\n4.0 \u00b1 1.7\n30.6 \u00b1 1.9\n25.4 \u00b1 4.2\n31.9 \u00b1 2.8\n18.9 \u00b1 0.5\n2.0 \u00b1 0.4\n\nMahala\n\n18.8 \u00b1 2.5\n6.7 \u00b1 0.6\n42.1 \u00b1 4.0\n10.4 \u00b1 2.7\n29.1 \u00b1 2.1\n18.4 \u00b1 3.4\n10.0 \u00b1 2.8\n37.3 \u00b1 0.5\n6.1 \u00b1 0.5\n\nXing\n\n29.3 \u00b1 17.2\n40.1 \u00b1 2.6\n43.5 \u00b1 12.5\n\n3.1 \u00b1 2.0\n30.6 \u00b1 1.9\n23.3 \u00b1 3.4\n24.6 \u00b1 7.5\n16.1 \u00b1 0.6\n12.4 \u00b1 0.8\n\nLMNN\n\n13.8 \u00b1 2.5\n3.6 \u00b1 1.1\n33.1 \u00b1 0.6\n3.9 \u00b1 1.6\n29.6 \u00b1 1.8\n15.2 \u00b1 3.1\n4.5 \u00b1 2.4\n18.4 \u00b1 0.4\n1.6 \u00b1 0.3\n\nITML\n\n8.6 \u00b1 1.7\n40.0 \u00b1 2.3\n39.8 \u00b1 3.3\n3.2 \u00b1 1.6\n28.8 \u00b1 2.1\n17.1 \u00b1 4.1\n28.7 \u00b1 3.7\n23.3 \u00b1 1.3\n2.5 \u00b1 0.4\n\nRCA\n\n17.4 \u00b1 1.5\n3.8 \u00b1 0.4\n41.6 \u00b1 0.7\n2.9 \u00b1 1.5\n28.6 \u00b1 2.3\n13.9 \u00b1 2.2\n1.8 \u00b1 1.5\n30.6 \u00b1 0.7\n2.8 \u00b1 0.4\n\nOnline-reg\n13.2 \u00b1 2.2\n3.7 \u00b1 1.2\n37.3 \u00b1 4.1\n3.2 \u00b1 1.3\n27.7 \u00b1 1.3\n12.9 \u00b1 2.2\n1.8 \u00b1 1.1\n19.8 \u00b1 0.6\n2.9 \u00b1 0.4\n\nTable 2: p-values of the Wilcoxon signed-rank test of the 7 methods on the 9 datasets.\nLMNN ITML RCA Online-reg\n0.004\n0.008\n0.027\n1.000\n0.129\n0.496\n0.734\n\nEclidean Mahala Xing\n0.641\n0.301\n1.000\n0.027\n0.359\n0.074\n0.027\n\n1.000\n0.734\n0.641\n0.004\n0.496\n0.301\n0.129\n\nMethods\nEuclidean\nMahala\nXing\nLMNN\nITML\nRCA\n\nOnline-reg\n\n0.734\n1.000\n0.301\n0.008\n0.570\n0.004\n0.004\n\n0.496\n0.570\n0.359\n0.129\n1.000\n0.820\n0.164\n\n0.301\n0.004\n0.074\n0.496\n0.820\n1.000\n0.074\n\n0.129\n0.004\n0.027\n0.734\n0.164\n0.074\n1.000\n\nwere tuned by cross validation over the range from 10\u22124 to 104. All the algorithms are implemented\nand run using Matlab. All the experiment are run on a AMD Processor 2.8G machine, with 8GMB\nRAM and Linux operation system.\n\n5.1 Experiment (I): Comparison to State-of-the-art Algorithms\n\nWe conducted experiments of data classi\ufb01cation over the following nine datasets from UCI repos-\nitory: (1) balance-scale, with 3 classes, 4 features, and 625 instances; (2) breast-cancer, with 2\nclasses, 10 features, and 683 instance; (3) glass, with 6 classes, 9 features, and 214 instances; (4)\niris, with 3 classes, 4 features, and 150 instances; (5) pima, with 2 classes, 8 features, and 768 in-\nstances; (6) segmentation, with 7 classes, 19 features, and 210 instances; (7)wine, with 3 classes,\n13 features, and 178 instances; (8) waveform, with 3 classes, 21 features, and 5000 instances; (9)\noptdigits, with 10 classes, 64 features, 3823 instances. For all the datasets, we randomly select 50%\nsamples for training, and use the remaining samples for testing. Table 1 shows the classi\ufb01cation\nerrors of all the metric learning methods over 9 datasets averaged over 10 runs, together with the\nstandard deviation. We observe that the proposed metric learning algorithm deliver performance that\ncomparable to the state-of-the-art methods. In particular, for almost all datasets, the classi\ufb01cation\naccuracy of the proposed algorithm is close to that of LMNN, which has yielded overall the best\nperformance among six baseline algorithms. This is consistent with the results of the other studies,\nwhich show LMNN is among the most effective algorithms for distance metric learning.\n\nTo further verify if the proposed method performs statistically better than the baseline methods, we\nconduct statistical test by using Wilcoxon signed-rank test [3]. The Wilcoxon signed-rank test is a\nnon-parametric statistical hypothesis test for the comparisons of two related samples. It is known to\nbe safer than the Student\u2019s t-test because it does not assume normal distributions. From table 2, we\n\ufb01nd that the regularized distance metric learning improves the classi\ufb01cation accuracy signi\ufb01cantly\ncompared to Mahalanobis distance, Xing\u2019s method and RCA at signi\ufb01cant level 0.1. It performs\nslightly better than ITML and is comparable to LMNN.\n\n6\n\n\fy\nc\na\nr\nu\nc\nc\na\nn\no\n\n \n\ni\nt\n\na\nc\ni\nf\ni\ns\ns\na\nC\n\nl\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n \n0.1\n\n0.12\n\natt\u2212face\n\n \n\nEuclidean\nMahalanobis\nLMNN\nITML\nRCA\nOnline_reg\n\n0.18\n\n0.2\n\n0.14\n\n0.16\n\nImage resize ratio\n\natt\u2212face\n\n \n\nLMNN\nITML\nRCA\nOnline_reg\n\n7000\n\n6000\n\n5000\n\n4000\n\n3000\n\n2000\n\n1000\n\n)\ns\nd\nn\no\nc\ne\ns\n(\n \n\ne\nm\n\ni\nt\n \n\ni\n\ng\nn\nn\nn\nu\nR\n\n0\n \n0.1\n\n0.12\n\n0.14\n\n0.16\n\n0.18\n\n0.2\n\nImage resize ratio\n\n(a)\n\n(b)\n\nFigure 1: (a) Face recognition accuracy of kNN and (b) running time of LMNN, ITML, RCA and\nonline reg algorithms on the \u201catt-face\u201d dataset with varying image sizes.\n\n5.2 Experiment (II): Results for High Dimensional Data\n\nTo evaluate the dependence of the regularized metric learning algorithms on data dimensions, we\ntested it by the task of face recognition. The AT&T face database 1 is used in our study. It consists\nof grey images of faces from 40 distinct subjects, with ten pictures for each subject. For every\nsubject, the images were taken at different times, with varied the lighting condition and different\nfacial expressions (open/closed-eyes, smiling/not-smiling) and facial details (glasses/no-glasses).\nThe original size of each image is 112 \u00d7 92 pixels, with 256 grey levels per pixel.\nTo examine the sensitivity to data dimensionality, we vary the data dimension (i.e., the size of\nimages) by compressing the original images into size different sizes with the image aspect ratio\npreserved. The image compression is achieved by bicubic interpolation (the output pixel value is a\nweighted average of pixels in the nearest 4-by-4 neighborhood). For each subject, we randomly spit\nits face images into training set and test set with ratio 4 : 6. A distance metric is learned from the\ncollection of training face images, and is used by the kNN classi\ufb01er (k = 3) to predict the subject ID\nof the test images. We conduct each experiment 10 times, and report the classi\ufb01cation accuracy by\naveraging over 40 subjects and 10 runs. Figure 1 (a) shows the average classi\ufb01cation accuracy of the\nkNN classi\ufb01er using different distance metric learning algorithms. The running times of different\nmetric learning algorithms for the same dataset is shown in Figure 1 (b). Note that we exclude\nXing\u2019s method in comparison because its extremely long computational time. We observed that\nwith increasing image size (dimensions), the regularized distance metric learning algorithm yields\nstable performance, indicating that the it is resilient to high dimensional data. In contrast, for almost\nall the baseline methods except ITML, their performance varied signi\ufb01cantly as the size of the input\nimage changed. Although ITML yields stable performance with respect to different size of images,\nits high computational cost (Figure 1), arising from solving a Bregman optimization problem in each\niteration, makes it unsuitable for high-dimensional data.\n\n6 Conclusion\n\nIn this paper, we analyze the generalization error of regularized distance metric learning. We show\nthat with appropriate constraint, the regularized distance metric learning could be robust to high\ndimensional data. We also present ef\ufb01cient learning algorithms for solving the related optimiza-\ntion problems. Empirical studies with face recognition and data classi\ufb01cation show the proposed\napproach is (i) robust and ef\ufb01cient for high dimensional data, and (ii) comparable to the state-of-the-\nart approaches for distance learning. In the future, we plan to investigate different regularizers and\ntheir effect for distance metric learning.\n\n1http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html\n\n7\n\n\fACKNOWLEDGEMENTS\n\nThe work was supported in part by the National Science Foundation (IIS-0643494) and the U. S.\nArmy Research Laboratory and the U. S. Army Research Of\ufb01ce (W911NF-09-1-0421). Any opin-\nions, \ufb01ndings, and conclusions or recommendations expressed in this material are those of the au-\nthors and do not necessarily re\ufb02ect the views of NSF and ARO.\n\nAppendix A: Proof of Lemma 3\n\nProof. We introduce the Bregmen divergence for the proof of this lemma. Given a convex function\nof matrix \u03d5(X), the Bregmen divergence between two matrices A and B is computed as follows:\n\nWe de\ufb01ne convex function N (X) and VD(X) as follows:\n\nd\u03d5(A, B) = \u03d5(B) \u2212 \u03d5(A) \u2212 tr(cid:0)\u2207\u03d5(A)\u22a4(B \u2212 A)(cid:1)\nN (X) = kXk2\nF ,\n\nVD(X) =\n\nn(n \u2212 1)Xi