{"title": "Learning Bregman Distance Functions and Its Application for Semi-Supervised Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 2089, "page_last": 2097, "abstract": "Learning distance functions with side information plays a key role in many machine learning and data mining applications. Conventional approaches often assume a Mahalanobis distance function. These approaches are limited in two aspects: (i) they are computationally expensive (even infeasible) for high dimensional data because the size of the metric is in the square of dimensionality; (ii) they assume a fixed metric for the entire input space and therefore are unable to handle heterogeneous data. In this paper, we propose a novel scheme that learns nonlinear Bregman distance functions from side information using a non-parametric approach that is similar to support vector machines. The proposed scheme avoids the assumption of fixed metric because its local distance metric is implicitly derived from the Hessian matrix of a convex function that is used to generate the Bregman distance function. We present an efficient learning algorithm for the proposed scheme for distance function learning. The extensive experiments with semi-supervised clustering show the proposed technique (i) outperforms the state-of-the-art approaches for distance function learning, and (ii) is computationally efficient for high dimensional data.", "full_text": "Learning Bregman Distance Functions and Its\nApplication for Semi-Supervised Clustering\n\nLei Wu\u2020(cid:93), Rong Jin\u2021, Steven C.H. Hoi\u2020, Jianke Zhu(cid:92), and Nenghai Yu(cid:93)\n\u2020School of Computer Engineering, Nanyang Technological University, Singapore\n\u2021Department of Computer Science & Engineering, Michigan State University\n\n(cid:92)Computer Vision Lab, ETH Zurich, Swiss\n\n(cid:93)Univeristy of Science and Technology of China, P.R. China\n\nAbstract\n\nLearning distance functions with side information plays a key role in many ma-\nchine learning and data mining applications. Conventional approaches often as-\nsume a Mahalanobis distance function. These approaches are limited in two as-\npects: (i) they are computationally expensive (even infeasible) for high dimen-\nsional data because the size of the metric is in the square of dimensionality; (ii)\nthey assume a \ufb01xed metric for the entire input space and therefore are unable\nto handle heterogeneous data.\nIn this paper, we propose a novel scheme that\nlearns nonlinear Bregman distance functions from side information using a non-\nparametric approach that is similar to support vector machines. The proposed\nscheme avoids the assumption of \ufb01xed metric by implicitly deriving a local dis-\ntance from the Hessian matrix of a convex function that is used to generate the\nBregman distance function. We also present an ef\ufb01cient learning algorithm for\nthe proposed scheme for distance function learning. The extensive experiments\nwith semi-supervised clustering show the proposed technique (i) outperforms the\nstate-of-the-art approaches for distance function learning, and (ii) is computation-\nally ef\ufb01cient for high dimensional data.\n\n1 Introduction\nAn effective distance function plays an important role in many machine learning and data mining\ntechniques. For instance, many clustering algorithms depend on distance functions for the pairwise\ndistance measurements; most information retrieval techniques rely on distance functions to identify\nthe data points that are most similar to a given query; k-nearest-neighbor classi\ufb01er depends on dis-\ntance functions to identify the nearest neighbors for data classi\ufb01cation. In general, learning effective\ndistance functions is a fundamental problem in both data mining and machine learning.\nRecently, learning distance functions from data has been actively studied in machine learning. In-\nstead of using a prede\ufb01ned distance function (e.g., Euclidean distance), researchers have attempted\nto learn distance functions from side information that is often provided in the form of pairwise con-\nstraints, i.e., must-link constraints for pairs of similar data points and cannot-link constraints for\npairs of dissimilar data points. Example algorithms include [16, 2, 8, 11, 7, 15].\nMost distance learning methods assume a Mahalanobis distance. Given two data points x and x(cid:48),\nthe distance between x and x(cid:48) is calculated by d(x, x(cid:48)) = (x \u2212 x(cid:48))(cid:62)A(x \u2212 x(cid:48)), where A is the dis-\ntance metric that needs to be learned from the side information. [16] learns a global distance metric\n(GDM) by minimizing the distance between similar data points while keeping dissimilar data points\nfar apart. It requires solving a Semi-De\ufb01nite Programming (SDP) problem, which is computation-\nally expensive when the dimensionality is high. BarHillel et al [2] proposed the Relevant Compo-\nnents Analysis (RCA), which is computationally ef\ufb01cient and achieves comparable results as GDM.\nThe main drawback with RCA is that it is unable to handle the cannot-link constraints. This problem\nwas addressed by Discriminative Component Analysis (DCA) in [8], which learns a distance metric\nby minimizing the distance between similar data points and in the meantime maximizing the distance\n\n1\n\n\fbetween dissimilar data points. The authors in [4] proposed an information-theoretic based metric\nlearning approach (ITML) that learns the Mahalanobis distance by minimizing the differential rel-\native entropy between two multivariate Gaussians. Neighborhood Component Analysis (NCA) [5]\nlearns a distance metric by extending the nearest neighbor classi\ufb01er. The maximum-margin nearest\nneighbor (LMNN) classi\ufb01er [14] extends NCA through a maximum margin framework. Yang et\nal. [17] propose a Local Distance Metric (LDM) that addresses multimodal data distributions. Hoi\net al. [7] propose a semi-supervised distance metric learning approach that explores the unlabeled\ndata for metric learning. In addition to learning a distance metric, several studies [12, 6] are devoted\nto learning a distance function, mostly non-metric, from the side information.\nDespite the success, the existing approaches for distance metric learning are limited in two aspects.\nFirst, most existing methods assume a \ufb01xed distance metric for the entire input space, which make\nit dif\ufb01cult for them to handle the heterogeneous data. This issue was already demonstrated in [17]\nwhen learning distance metrics from multi-modal data distributions. Second, the existing methods\naim to learn a full matrix for the target distance metric that is in the square of the dimensionality,\nmaking it computationally unattractive for high dimensional data. Although the computation can be\nreduced signi\ufb01cantly by assuming certain forms of the distance metric (e.g., diagonal matrix), these\nsimpli\ufb01cations often lead to suboptimal solutions. To address these two limitations, we propose a\nnovel scheme that learns Bregman distance functions from the given side information. Bregman\ndistance or Bregman divergence [3] has several salient properties for distance measure. Bregman\ndistance generalizes the class of Mahalanobis distance by deriving a distance function from a given\nconvex function \u03c6(x). Since the local distance metric can be derived from the local Hessian matrix of\n\u03d5(x), Bregman distance function avoids the assumption of \ufb01xed distance metric. Recent studies [1]\nalso reveal the connection between Bregman distances and exponential families of distributions. For\nexample, Kullback-Leibler divergence is a special Bregman distance when choosing the negative\nentropy function for the convex function \u03d5(x).\nThe objective of this work is to design an ef\ufb01cient and effective algorithm that learns a Bregman\ndistance function from pairwise constraints. Although Bregman distance or Bregman divergence has\nbeen explored in [1], all these studies assume a prede\ufb01ned Bregman distance function. To the best of\nour knowledge, this is the \ufb01rst work that addresses the problem of learning Bregman distances from\nthe pairwise constraints. We present a non-parametric framework for Bregman distance learning,\ntogether with an ef\ufb01cient learning algorithm. Our empirical study with semi-supervised clustering\nshow that the proposed approach (i) outperforms the state-of-the-art algorithms for distance metric\nlearning, and (ii) is computationally ef\ufb01cient for high dimensional data.\nThe rest of the paper is organized as follows. Section 2 presents the proposed framework of learning\nBregman distance functions from the pairwise constraints, together with an ef\ufb01cient learning algo-\nrithm. Section 3 presents the experimental results with semi-supervised clustering by comparing\nthe proposed algorithms with a number of state-of-the-art algorithms for distance metric learning.\nSection 5 concludes this work.\n2 Learning Bregman Distance Functions\n2.1 Bregman Distance Function\nBregman distance function is de\ufb01ned based on a given convex function. Let \u03d5(x) : Rd (cid:55)\u2192 R be a\nstrictly convex function that is twice differentiable. Given \u03d5(x), the Bregman distance function is\nde\ufb01ned as\n\nd(x1, x2) = \u03d5(x1) \u2212 \u03d5(x2) \u2212 (x1 \u2212 x2)(cid:62)\u2207\u03d5(x2)\n\nFor the convenience of discussion, we consider a symmetrized version of the Bregman distance\nfunction that is de\ufb01ned as follows\n\nd(x1, x2) = (\u2207\u03d5(x1) \u2212 \u2207\u03d5(x2))(cid:62)(x1 \u2212 x2)\n\n(1)\n\nThe following proposition shows the properties of d(x1, x2).\nProposition 1. The distance function de\ufb01ned in (1) satis\ufb01es the following properties if \u03d5(x) is a\nstrictly convex function: (a) d(x1, x2) = d(x2, x1), (b) d(x1, x2) \u2265 0, (c) d(x1, x2) = 0 \u2194 x1 = x2\n\nRemark To better understand the Bregman distance function, we can rewrite d(x1, x2) in (1) as\n\nd(x1, x2) = (x1 \u2212 x2)(cid:62)\u22072\u03d5(\u02dcx)(x1 \u2212 x2)\n\n2\n\n\fwhere \u02dcx is a point on the line segment between x1 and x2. As indicated by the above expression, the\nBregman distance function can be viewed as a general Mahalanobis distance that introduces a local\ndistance metric A = \u22072\u03d5(\u02dcx). Unlike the conventional Mahalanobis distance where metric A is a\nconstant matrix throughout the entire space, the local distance metric A = \u22072\u03d5(\u02dcx) is introduced via\nthe Hessian matrix of convex function \u03d5(x) and therefore depends on the location of x1 and x2.\nAlthough the Bregman distance function de\ufb01ned in (1) does not satisfy the triangle inequality, the\nfollowing proposition shows the degree of violation could be bounded if the Hessian matrix of \u03d5(x)\nis bounded.\nProposition 2. Let \u2126 be the closed domain for x. If \u2203m, M \u2208 R, M > m > 0 and\n\nwhere I is the identity matrix, we have the following inequality\n\nmI (cid:185) min\nx\u2208\u2126\n\n(cid:112)\n\n(cid:112)\n\n\u22072\u03d5(x) (cid:185) M I\n\u22072\u03d5(x) (cid:185) max\nx\u2208\u2126\n\u221a\nM \u2212 \u221a\n\nd(xc, xb) + (\n\nd(xa, xb) \u2264\n\nd(xa, xc) +\n\n(cid:112)\n\nm)[d(xa, xc)d(xc, xb)]1/4\n\n(2)\n\nThe proof of this proposition can be found in Appendix A. As indicated by Proposition 2, the de-\nm. Given a smooth\ngree of violation of the triangle inequality is essentially controlled by\nconvex function with almost constant Hessian matrix, we would expect that to a large degree, Breg-\nman distance will satisfy the triangle inequality. In the extreme case when \u03d5(x) = x(cid:62)Ax/2 and\n\u22072\u03d5(x) = A, we have a constant Hessian matrix, leading to a complete satisfaction of the triangle\ninequality.\n\n\u221a\nM \u2212 \u221a\n\n2.2 Problem Formulation\n\nTo a learn a Bregman distance function, the key is to \ufb01nd the appropriate convex function \u03d5(x) that\nis consistent with the given pairwise constraints. In order to learn the convex function \u03d5(x), we take\na non-parametric approach by assuming that \u03d5(\u00b7) belongs to a Reproducing Kernel Hilbert Space\nH\u03ba. Given a kernel function \u03ba(x, x(cid:48)) : Rd \u00d7 Rd (cid:55)\u2192 R, our goal is to search for a convex function\n\u03d5(x) \u2208 H\u03ba such that the induced Bregman distance function, denoted by d\u03d5(x, x(cid:48)), minimizes the\noverall training error with respect to the given pairwise constraints.\nWe denote by D = {(x1\ni , x2\nEach pairwise constraint consists of a pair of instances x1\ni are similar and \u22121 if x1\nx2\nthe input patterns of all training instances in D.\nFollowing the maximum margin framework for classi\ufb01cation, we cast the problem of learning a\nBregman distance function from pairwise constraints into the following optimization problem, i.e.,\n\ni , yi), i = 1, . . . , n} the collection of pairwise constraints for training.\ni and\ni are dissimilar. We also introduce X = (x1, . . . , xN ) to include\ni and x2\n\ni , and a label yi that is +1 if x1\n\ni and x2\n\ni ) \u2212 b])\n\n|\u03d5|2H\u03ba + C\n\n1\n2\n\nmin\n\ni , x2\n\n(cid:96)(yi[d(x1\n\n\u03d5\u2208\u2126(H\u03ba),b\u2208R+\n\n(3)\nwhere \u2126(H) = {f \u2208 H : f is convex} refers to the subspace of functional space H that only\nincludes convex functions, (cid:96)(z) = max(0, 1 \u2212 z) is a hinge loss, and C is a penalty cost parameter.\nThe main challenge with solving the variational problem in (3) is that it is dif\ufb01cult to derive a\nrepresenter theorem for \u03d5(x) because it is \u2207\u03d5(x) used in the de\ufb01nition of distance function, not\n\u03d5(x). Note that although it seems to be convenient to regularize \u2207\u03d5(x), it will be dif\ufb01cult to restrict\n\u03d5(x) to be convex. To resolve this problem, we consider a special family of kernel functions \u03ba(x, x(cid:48))\n1 x2) where h : R (cid:55)\u2192 R is a strictly convex function. Examples\nthat has the form \u03ba(x1, x2) = h(x(cid:62)\nof h(z) that guarantees \u03ba(\u00b7,\u00b7) to be positive semi-de\ufb01nite are h(z) = |z|d (d \u2265 1), h(z) = |z + 1|d\n(d \u2265 1), and h(z) = exp(z). For the convenience of discussion, we assume h(0) = 0 throughout\nthis paper.\nFirst, since \u03d5(x) \u2208 H\u03ba, we have\n\u03d5(x) =\n\ndyh(x(cid:62)y)q(y)\n\ndy\u03ba(x, y)q(y) =\n\n(cid:90)\n\n(cid:90)\n\nn(cid:88)\n\ni=1\n\nwhere q(y) is a weighting function. Given the training instances x1, . . . , xN , we divide the space\nRd into A and \u00afA that are de\ufb01ned as\n\nA = span{x1, . . . , xN}, \u00afA = Null(x1, . . . , xN )\n\n(4)\n\n(5)\n\n3\n\n\fWe de\ufb01ne H(cid:107) and H\u22a5 as follows\n\nH(cid:107) = span{\u03ba(x,\u00b7),\u2200x \u2208 A}, H\u22a5 = span{\u03ba(x,\u00b7),\u2200x \u2208 \u00afA}\n\n(6)\n\nThe following proposition summarizes an important property of reproducing kernel Hilbert space\nH\u03ba when kernel function \u03ba(\u00b7,\u00b7) is restricted to the form in Eq. (2.2).\nProposition 3. If the kernel function \u03ba(\u00b7,\u00b7) is written in the form of Equation (2.2) with h(0) = 0,\nwe have H(cid:107) and H\u22a5 form a complete partition of H\u03ba, i.e., H\u03ba = H(cid:107) \u222a H\u22a5, and H(cid:107)\u22a5H\u22a5.\nWe therefore have the following representer theorem for \u03d5(x) that minimizes (3)\nTheorem 1. The function \u03d5(x) that minimizes (3) admits the following expression\n\n(cid:90)\n\n\u03d5(x) \u2208 H(cid:107) =\n\ndyq(y)h(x(cid:62)y) =\n\nduq(u)h(x(cid:62)Xu)\n\n(7)\n\ny\u2208A\n\nwhere u \u2208 RN and X = (x1, . . . , xN ).\nThe proof of the above theorem can be found in Appendix B.\n\n(cid:90)\n\nN(cid:88)\n\n(8)\n\n(9)\n\n(10)\n\n(11)\n\n(12)\n\n(13)\n\n2.3 Algorithm\n\n(cid:80)N\nTo further derive a concrete expression for \u03d5(x), we restrict q(y) in (7) to the special form: q(y) =\ni=1 \u03b1i\u03b4(y \u2212 xi) where \u03b1i \u2265 0, i = 1, . . . , N are non-negative combination weights. This results\n\nin \u03d5(x) =\n\ni=1 \u03b1ih(x(cid:62)\n\ni x), and consequently d(xa, xb) as follows\n\n(cid:80)N\n\nd(xa, xb) =\n\n\u03b1i(h(cid:48)(x(cid:62)\n\na xi) \u2212 h(cid:48)(x(cid:62)\n\nb xi))x(cid:62)\n\ni (xa \u2212 xb)\n\ni=1\n\nBy de\ufb01ning h(xa) = (h(cid:48)(x(cid:62)\n\na x1), . . . , h(cid:48)(x(cid:62)\n\na xN ))(cid:62), we can express d(xa, xb) as follows\n\nd(xa, xb) = (xa \u2212 xb)(cid:62)X(\u03b1 \u25e6 [h(xa) \u2212 h(xb)])\n\nNotice that when h(z) = z2/2, we have d(xa, xb) expressed as\n\nd(xa, xb) = (xa \u2212 xb)(cid:62)Xdiag(\u03b1)X(cid:62)(xa \u2212 xb).\n\n(cid:80)N\n\nThis is a Mahanalobis distance with metric A = Xdiag(\u03b1)X(cid:62) =\ni . When h(z) =\nexp(z), we have h(x) = (exp(x(cid:62)x1), . . . , exp(x(cid:62)xN )), and the resulting distance function is no\nlonger stationary due to the non-linear function exp(z).\n\ni=1 \u03b1ixix(cid:62)\n\nGiven the assumption that q(y) =\n\n(cid:80)N\nn(cid:88)\ni=1 \u03b1i\u03b4(y \u2212 xi), we have (3) simpli\ufb01ed as\n(cid:161)\n1\n2 \u03b1(cid:62)K\u03b1 + C\ni )(cid:62)X(\u03b1 \u25e6 [h(x1\ni \u2212 x2\n(x1\n\ni )]) \u2212 b\nyi\n\u03b5i \u2265 0, i = 1, . . . , n, \u03b1k \u2265 0, k = 1, . . . , N\n\ni ) \u2212 h(x2\n\ni=1\n\n\u03b5i\n\n(cid:162) \u2265 1 \u2212 \u03b5i,\n(cid:80)N\n\nmin\n\u03b1\u2208RN ,b\n\ns. t.\n\nNote that the constraint \u03b1k \u2265 0 is introduced to ensure \u03d5(x) =\nfunction. By de\ufb01ning\n\nk=1 \u03b1kh(x(cid:62)xk) is a convex\n\nzi = [h(x1\n\ni ) \u2212 h(x2\n\ni )],\n\ni )] \u25e6 [X(cid:62)(x1\ni \u2212 x2\nn(cid:88)\n\n1\n2 \u03b1(cid:62)K\u03b1 + C\n\ni=1\n\n(cid:96)(yi[z(cid:62)\n\ni \u03b1 \u2212 b])\n\nwe simplify the problem in (11) as follows\n\nL =\n\nmin\n\u03b1\u2208RN\n+ ,b\n\nwhere (cid:96)(z) = max(0, 1 \u2212 z).\n\n4\n\n\fWe solve the above problem by a simple subgradient descent approach. In particular, at iteration t,\ngiven the current solution \u03b1t and bt, we compute the gradients as\ni \u03b1t \u2212 bt])yizi, \u2207bL = \u2212C\n\n\u2207\u03b1L = K\u03b1t + C\n\ni \u03b1t \u2212 bt])yi\n\n\u2202(cid:96)(yi[z(cid:62)\n\n\u2202(cid:96)(yi[z(cid:62)\n\nn(cid:88)\n\nn(cid:88)\n\ni=1\n\nwhere \u2202(cid:96)(z) stands for the subgradient of (cid:96)(z). Let S +\nwhich (\u03b1t, bt) suffers a non-zeros loss, i.e.,\nS +\nt = {(zi, yi) \u2208 D : yi(z(cid:62)\n\nWe can then express the sub-gradients of L at \u03b1t and bt as follows:\n\n(cid:88)\n\n\u2207\u03b1L = K\u03b1 \u2212 C\n\nyizi, \u2207bL = C\n\ni=1\n\n(14)\nt \u2208 D denotes the set of training instances for\n(cid:88)\ni \u03b1t \u2212 bt) < 1}\n\n(15)\n\n(16)\n\n(zi,yi)\u2208S+\n\nt\n\n(cid:161)\n\n(cid:162)\n\nyi\n\n(zi,yi)\u2208S+\n\nt\n\nThe new solution, denoted by \u03b1t+1 and bt+1, is computed as follows:\n\nk = \u03c0[0,+\u221e]\n\u03b1t+1\n\n(17)\nis the k-th element of vector \u03b1t+1, \u03c0G(x) projects x into the domain G, and \u03b3t is the\nt by following the Pegasos algorithm [10] for solving SVMs. The\n\nwhere \u03b1t+1\nstep size that is set to be \u03b3t = C\npseudo-code of the proposed algorithm is summarized in Algorithm 1.\n\nk\n\nk \u2212 \u03b3t[\u2207\u03b1L]k\n\u03b1t\n\n, bt+1 = bt \u2212 \u03b3t\u2207bL\n\nAlgorithm 1 Algorithm of Learning Bregman Distance Functions\nINPUT:\u2022 data matrix: X \u2208 RN\u00d7d\n\u2022 pair-wise constraint {(x1\n\u2022 kernel function: \u03ba(x1, x2) = h(x(cid:62)\n\u2022 penalty cost parameter C\n\ni , yi), i = 1, . . . , n}\n\ni , x2\n\n1 x2)\n\ni xj)]N\u00d7N\n\n+ , b \u2208 R\n\ni ) \u2212 h(x2\n\nOUTPUT:\u2022 Bregman coef\ufb01cients \u03b1 \u2208 RN\nPROCEDURE\n1: initialize Bregman coef\ufb01cients: \u03b1 = \u03b10, b = b0\n2: calculate kernel matrix: K = [h(x(cid:62)\n3: calculate vectors zi: zi = [h(x1\n4: set iteration step t = 1;\n5: repeat\n6:\n7:\n8:\n9:\n10:\n11:\n12: until convergence\n\n(1) update the learning rate: \u03b3 = C/t, t = t + 1\n(2) update subset of training instances: S +\n(3) compute the gradients w.r.t \u03b1 and b:\n\n(cid:80)\n\nt\n\n(cid:80)\n\n\u2207\u03b1L = K\u03b1 \u2212 C\nb \u2190 b \u2212 \u03b3\u2207bL, \u03b1k \u2190 \u03c0[0,+\u221e] (\u03b1k \u2212 \u03b3[\u2207\u03b1L]k) , k = 1, . . . , N\n\n(4) update Bregman coef\ufb01cients \u03b1 = (\u03b11, . . . , \u03b1n) and threshold b:\n\nyizi, \u2207bL = C\n\nzi\u2208S+\n\nzi\u2208S+\n\nyi\n\nt\n\ni )] \u25e6 [X(cid:62)(x1\n\ni \u2212 x2\ni )]\n\nt = {(zi, yi) \u2208 D : yi(z(cid:62)\n\ni \u03b1 \u2212 b) < 1}\n\nComputational complexity One of the major computational costs for Algorithm 1 is the prepa-\nration of kernel matrix K and vector {zi}n\ni=1, which fortunately can be pre-computed. Each step of\nthe subgradient descent algorithm has a linear complexity, i.e., O(max(N, n)), which makes it rea-\nsonable even for large data sets with high dimensionality. The number of iterations for convergence\nis O(1/\u00012) where \u0001 is the target accuracy. It thus works \ufb01ne if we are not critical about the accuracy\nof the solution.\n3 Experiments\nWe evaluate the proposed distance learning technique by semi-supervised clustering. In particular,\nwe \ufb01rst learn a distance function from the given pairwise constraints and then apply the learned\ndistance function to data clustering. We verify the ef\ufb01cacy and ef\ufb01ciency of the proposed technique\nby comparing it with a number of state-of-the-art algorithms for distance metric learning.\n\n3.1 Experimental Testbed and Settings\nWe adopt six well-known datasets from UCI machine learning repository, and six popular text bench-\nmark datasets1 in our experiments. These datasets are chosen for clustering because they vary signif-\n\n1The Reuter dataset is available at: http://renatocorrea.googlepages.com/textcategorizationdatasets\n\n5\n\n\ficantly in properties such as the number of clusters/classes, the number of features, and the number\nof instances. The diversity of datasets allows us to examine the effectiveness of the proposed learn-\ning technique more comprehensively. The details of the datasets are shown in Table 1.\n\n#samples #feature #classes\n\n#samples #feature #classes\n\ndataset\nbreast-cancer\ndiabetes\nionosphere\nliver-disorders\nsonar\na1a\n\n683\n768\n251\n345\n208\n1,605\n\n10\n8\n34\n6\n60\n123\n\n2,477\n3,470\n17,188\n4,291\n7,149\n10,789\nTable 1: The details of our experimental testbed\n\ndataset\nw1a\nw2a\nw6a\nWebKB\nnewsgroup\nReuter\n\n2\n2\n2\n2\n2\n2\n\n300\n300\n300\n\n19,687\n47,411\n5,189\n\n2\n2\n2\n6\n11\n79\n\nSimilar to previous work [16], the pairwise constraints are created by random sampling. More\nspeci\ufb01cally, we randomly sample a subset of pairs from the pool of all possible pairs (every two\ninstances forms a pair). Two instances form a must-link constraint (i.e., yi = +1) if they share the\nsame class label, and form a cannot-link constraint (i.e., yi = \u22121) if they are assigned to different\nclasses. To calculate the Bregman function, in this experiment, we adopt the non-linear function\nh(x) = (exp(x(cid:62)x1), . . . , exp(x(cid:62)xN )).\nTo perform data clustering, we run the k-means algorithm using the distance function learned from\n500 randomly sampled positive constraints 500 random negative constraints. The number of clusters\nis simply set to the number of classes in the ground truth. The initial cluster centroids are randomly\nchosen from the dataset. To enable fair comparisons, all comparing algorithms start with the same\nset of initial centroids. We repeat each clustering experiment for 20 times, and report the \ufb01nal results\nby averaging over the 20 runs.\nWe compare the proposed Bregman distance learning method using the k-means algorithm for semi-\nsupervised clustering, termed Bk-means, with the following approaches: (1) a standard k-means,\n(2) the constrained k-means [13] (Ck-means), (3) Ck-means with distance learned by RCA [2], (4)\nCk-means with distance learned by DCA [8], (5) Ck-means with distance learned by the Xing\u2019s\nalgorithm [16] (Xing), (6) Ck-means with information-theoretic metric learning (ITML) [4], and (7)\nCk-means with a distance function learned by a boosting algorithm (DistBoost) [12].\nTo evaluate the clustering performance, we use the some standard performance metrics,\nin-\ncluding pairwise Precision, pairwise Recall, and pairwise F1 measures [9], which are evalu-\nated base on the pairwise results. Speci\ufb01cally, pairwise precision is the ratio of the number\nof correct pairs placed in the same cluster over the total number of pairs placed in the same\ncluster, pairwise recall is the ratio of the number of correct pairs placed in the same cluster\nover the total number of pairs actually placed in the same cluster, and pairwise F1 equals to\n2 \u00d7 precision \u00d7 recall/(precision + recall).\n\n3.2 Performance Evaluation on Low-dimensional Datasets\nThe \ufb01rst set of experiments evaluates the clustering performance on six UCI datasets. Table 2 shows\nthe average precision, recall, and F1 measurements of all the competing algorithms given a set of\n1, 000 random constraints. The top two highest average F1 scores on each dataset were highlighted\nin bold font. From the results in Table 2, we observe that the proposed Bregman distance based\nk-means clustering approach (Bk-means) is either the best or the second best for almost all datasets,\nindicating that the proposed algorithm is in general more effective than the other algorithms for\ndistance metric learning.\n\n3.3 Performance Evaluation on High-dimensional Text Data\nWe evaluate the clustering performance on six text datasets. Since some of the methods are infeasible\nfor text clustering due to the high dimensionality, we only include the results for the methods which\nare feasible for this experiment (i.e., OOM indicates the method takes more than 10 hours, and\nOOT indicates the method needs more than 16G REM). Table 3 summarizes the F1 performance\nof all feasible methods for datasets w1a, w2a, w6a, WebKB, 20newsgroup and reuter. Since cosine\nsimilarity is commonly used in textual domain, we use k-means, Ck-means in both Euclidian space\nand Cosine similarity space as baselines. The best F1 scores are marked in bold in Table 3. The\nresults show that the learned Bregman distance function is applicable for high dimensional data, and\nit outperforms the other commonly used text clustering methods for four out of six datasets.\n\n6\n\n\fF1\n\n72.73\u00b13.42\n85.31\u00b11.48\n91.94\u00b12.15\n88.11\u00b10.22\n90.18\u00b12.94\n93.88\u00b10.22\n94.29\u00b10.29\n98.37\u00b10.19\n\nF1\n\n57.28\u00b16.20\n61.46\u00b11.36\n72.62\u00b11.24\n63.52\u00b10.39\n66.99\u00b10.45\n66.68\u00b10.00\n72.72\u00b11.03\n73.28\u00b11.93\n\nprecision\n52.47\u00b18.93\n60.06\u00b11.13\n73.93\u00b11.28\n58.11\u00b10.48\n59.86\u00b12.99\n61.23\u00b12.05\n64.45\u00b11.02\n99.42\u00b10.40\n\nprecision\n63.92\u00b18.60\n62.90\u00b18.43\n93.53\u00b13.28\n95.42\u00b12.85\n59.56\u00b118.95\n70.18\u00b14.27\n51.60\u00b11.43\n96.89\u00b14.11\n\ndiabetes\nrecall\n57.17\u00b13.68\n55.98\u00b10.64\n70.11\u00b10.41\n58.31\u00b10.16\n62.70\u00b12.18\n64.88\u00b10.56\n68.33\u00b10.98\n64.68\u00b10.63\nliver-disorders\n50.50\u00b10.40\n50.35\u00b11.68\n55.57\u00b10.10\n49.65\u00b10.08\n52.15\u00b11.68\n50.41\u00b10.07\n52.88\u00b11.31\n50.29\u00b12.09\n\nrecall\n\nF1\n\n56.41\u00b14.53\n57.57\u00b10.85\n71.55\u00b10.81\n58.21\u00b10.31\n61.22\u00b12.59\n63.00\u00b10.75\n66.33\u00b11.00\n77.43\u00b10.92\n\nF1\n\n55.67\u00b15.96\n55.13\u00b11.63\n68.73\u00b11.40\n65.31\u00b11.10\n54.92\u00b15.76\n58.67\u00b11.63\n52.23\u00b11.37\n66.86\u00b13.10\n\nF1\n\n51.87\u00b11.47\n55.32\u00b11.37\n70.46\u00b12.35\n79.83\u00b12.70\n79.83\u00b15.85\n73.11\u00b10.57\n75.54\u00b10.62\n82.52\u00b11.44\n\nprecision\n55.81\u00b11.01\n69.91\u00b10.08\n99.99\u00b10.98\n57.70\u00b11.32\n76.64\u00b10.08\n57.15\u00b11.32\n99.98\u00b10.21\n\nn/a\n\na1a\nrecall\n\n69.99\u00b10.91\n80.34\u00b10.18\n70.30\u00b10.54\n70.89\u00b11.01\n66.96\u00b10.35\n71.76\u00b11.87\n77.72\u00b10.17\n\nn/a\n\nF1\n\n62.10\u00b10.99\n77.01\u00b10.12\n81.76\u00b10.76\n63.62\u00b11.21\n69.96\u00b10.18\n63.63\u00b11.55\n86.32\u00b10.19\n\nn/a\n\nmethod\nbaseline\nCk-means\nITML\nXing\nRCA\nDCA\nDistBoost\nBk-means\n\nmethod\nbaseline\nCk-means\nITML\nXing\nRCA\nDCA\nDistBoost\nBk-means\n\nprecision\n72.85\u00b13.77\n98.10\u00b12.20\n97.05\u00b12.77\n93.61\u00b10.14\n85.40\u00b10.14\n94.53\u00b10.34\n94.76\u00b10.24\n99.04\u00b10.10\n\nprecision\n62.35\u00b16.30\n57.05\u00b11.24\n97.10\u00b12.70\n63.46\u00b10.11\n100.00\u00b16.19\n66.36\u00b13.01\n75.91\u00b11.11\n97.64\u00b11.93\n\nmethod\nbaseline\nCk-means\nITML\nXing\nRCA\nDCA\nDistBoost\nBk-means\n\nprecision\n52.98\u00b12.05\n60.44\u00b14.53\n98.68\u00b12.46\n96.99\u00b14.53\n100.00\u00b113.69\n100.00\u00b10.64\n76.64\u00b10.57\n99.20\u00b11.62\n\nbreast\nrecall\n72.52\u00b12.30\n81.01\u00b10.10\n88.96\u00b10.30\n84.19\u00b10.83\n94.16\u00b10.29\n93.23\u00b10.29\n93.83\u00b10.31\n98.33\u00b10.24\nionosphere\nrecall\n53.39\u00b12.74\n51.28\u00b11.58\n59.99\u00b10.31\n64.10\u00b10.03\n50.36\u00b11.44\n67.01\u00b12.12\n69.34\u00b10.91\n62.71\u00b11.94\nsonar\nrecall\n50.84\u00b11.69\n51.71\u00b11.17\n56.31\u00b12.28\n69.81\u00b10.05\n69.81\u00b11.33\n59.75\u00b10.30\n74.48\u00b10.69\n74.24\u00b11.23\n\nTable 2: Evaluation of clustering performance (average precision, recall, and F1) on six UCI\ndatasets. The top two F1 scores are highlighted in bold font for each dataset.\n\nw1a\n\nw2a\n\nw6a\n\nmethods\nk-means(EU)\nk-means(Cos)\nCk-means(EU)\nCk-means(Cos)\nRCA\nDCA\nITML\n64.51\u00b10.95\nBk-means\nTable 3: Evaluation of clustering F1 performance on the high dimensional text data. Only applicable\nmethods are shown. OOM indicates \u201cout of memory\u201d, and OOT indicates \u201cout of time\u201d.\n\n76.52\u00b10.97\n77.16\u00b11.27\n76.52\u00b11.01\n75.32\u00b10.91\n93.51\u00b11.13\n87.44\u00b11.99\n96.95 \u00b10.13\n98.64\u00b10.24\n\n72.59\u00b10.77\n73.47\u00b11.35\n97.23\u00b11.21\n97.14\u00b12.12\n96.45\u00b11.17\n94.30\u00b12.56\n94.12\u00b10.92\n96.92\u00b11.02\n\n76.68\u00b10.25\n76.87\u00b15.61\n87.04\u00b11.15\n87.14\u00b12.14\n91.00\u00b11.02\n92.13\u00b11.04\n92.31\u00b10.84\n93.43\u00b11.07\n\nWebKB\n35.78\u00b10.17\n35.18\u00b13.41\n70.84\u00b12.29\n75.84\u00b11.08\n\nnewsgroup\n16.54\u00b10.05\n18.87\u00b10.14\n19.12\u00b10.54\n20.08\u00b10.49\n\n43.88\u00b10.23\n45.42\u00b10.73\n56.00\u00b10.42\n58.24\u00b10.82\n\n73.94\u00b11.25\n\n25.17\u00b11.27\n\nOOM\nOOM\nOOM\n\nOOM\nOOM\nOOT\n\nOOT\nOOT\nOOT\n\nReuter\n\n3.4 Computational Complexity\nHere, we evaluate the running time of semi-supervised clustering. For a conventional clustering\nalgorithm such as k-means, its computational complexity is determined by both the calculation of\ndistance and the clustering scheme. For a semi-supervised clustering algorithm based on distance\nlearning, the overall computational time include both the time for training an appropriate distance\nfunction and the time for clustering data points. The average running times of semi-supervised\nclustering over the six UCI datasets are listed in Table 4. It is clear that the Bregman distance based\nclustering has comparable ef\ufb01ciency with simple methods like RCA and DCA on low dimensional\ndata, and runs much faster than Xing, ITML, and DistBoost. On the high dimensional text data, it is\nmuch faster than other applicable DML methods.\n\nAlgorithm\n\nUCI data(Sec.)\nText data(Min.)\n\nk-means Ck-means\n\n0.51\n0.78\n\n0.72\n4.56\n\nITML Xing\n8.56\n7.59\n71.55\nn/a\n\nRCA\n0.88\n68.90\n\nDCA DistBoost Bk-means\n0.90\n69.34\n\n13.09\nn/a\n\n1.70\n3.84\n\nTable 4: Comparison of average running time over the six UCI datasets and subsets of six text\ndatasets (10% sampling from the datasets in Table 1).\n\n7\n\n\f4 Conclusions\nIn this paper, we propose to learn a Bregman distance function for clustering algorithms using a\nnon-parametric approach. The proposed scheme explicitly address two shortcomings of the existing\napproaches for distance fuction/metric learning, i.e., assuming a \ufb01xed distance metric for the entire\ninput space and high computational cost for high dimensional data. We incorporate the Bregman\ndistance function into the k-means clustering algorithm for semi-supervised data clustering. Exper-\niments of semi-supervised clustering with six UCI datasets and six high dimensional text datasets\nhave shown that the Bregman distance function outperforms other distance metric learning algo-\nrithms in F1 measure. It also veri\ufb01es that the proposed distance learning algorithm is computation-\nally ef\ufb01cient, and is capable of handling high dimensional data.\nAcknowledgements\nThis work was done when Mr. Lei Wu was an RA at Nanyang Technological University, Sin-\ngapore. This work was supported in part by MOE tier-1 Grant (RG67/07), NRF IDM Grant\n(NRF2008IDM-IDM-004-018), National Science Foundation (IIS-0643494), and US Navy Re-\nsearch Of\ufb01ce (N00014-09-1-0663).\n\nAPPENDIX A: Proof of Proposition 2\nProof. First, let us denote by f as follows:\n\nf = (\n\nm)[d(xa, xc)d(xc, xb)]1/4\n\nThe square of the right side of Eq. (2) is\n\nd(xa, xc) +\n\nd(xc, xb) + f 1/4)2 = d(xa, xb) \u2212 \u03b7(xa, xb, xc) + \u03b4(xa, xb, xc)\n\n(cid:112)\n\n(\nwhere\n\n(cid:112)\n\n(cid:112)\n\n(cid:112)\n\n\u221a\nM \u2212 \u221a\n(cid:112)\n\n\u03b4(xa, xb, xc) = f 2 + 2f\n\u03b7(xa, xb, xc) = (\u2207\u03d5(xa) \u2212 \u2207\u03d5(xc))(xc \u2212 xb) + (\u2207\u03d5(xc) \u2212 \u2207\u03d5(xb))(xa \u2212 xc).\n\nd(xa, xc)d(xc, xb)\n\nd(xa, xc) + 2f\n\nd(xc, xb) + 2\n\n\u03b4(xa, xb, xc) \u2212 \u03b7(xa, xb, xc)\n\nFrom this above equation, the proposition holds if and only if \u03b4(xa, xb, xc) \u2212 \u03b7(xa, xb, xc) \u2265 0.\nFrom the fact that\nM \u2212 \u221a\n\u221a\nm and the distance function d(\u00b7) \u2265 0, we get \u03b4(xa, xb, xc)\u2212 \u03b7(xa, xb, xc) \u2265 0.\n\n\u221a\nM \u2212 \u221a\n\n\u221a\nM >\n\n1\n4 + d(xc, xb)\n\nd(xa, xc)d(xc, xb)\n\n3\n4 d(xa, xc)\n\n3\n4 d(xc, xb)\n\nm)2 + 2(\n\nd(xa, xc)\n\n(cid:112)\n\n\u221a\n(\n\n(cid:180)\n\n1\n4\n\n=\n\nsince\n\n(cid:179)\n\nm)\n\n+ 2d(xa, xc)d(xc, xb)\n\n(cid:90)\n\ny\u2208A\n\nAPPENDIX B: Proof of Theorem 1\nProof. We write \u03d5(x) = \u03d5(cid:107)(x) + \u03d5\u22a5(x) where\n\n\u03d5(cid:107)(x) \u2208 H(cid:107) =\n\ndyq(y)h(x(cid:62)y), \u03d5\u22a5(x) \u2208 H\u22a5 =\n\nThus, the distance function de\ufb01ned in (1) is then expressed as\n\nd(xa, xb) = (xa \u2212 xb)(cid:62)(cid:161)\u2207\u03d5(cid:107)(xa) \u2212 \u2207\u03d5(cid:107)(xb)\n\n(cid:162)\n\n(cid:90)\n\nq(y)(h(cid:48)(x(cid:62)\n\na y) \u2212 h(cid:48)(x(cid:62)\n\nb y))y(cid:62) (xa \u2212 xb) +\n\n(cid:90)\n\ny\u2208 \u00afA\n\ndyq(y)h(x(cid:62)y)\n\n+ (xa \u2212 xb)(cid:62) (\u2207\u03d5\u22a5(xa) \u2212 \u2207\u03d5\u22a5(xb))\na y) \u2212 h(cid:48)(x(cid:62)\n\nq(y)(h(cid:48)(x(cid:62)\n\nb y))y(cid:62) (xa \u2212 xb) = (xa \u2212 xb)(cid:62)(cid:161)\u2207\u03d5(cid:107)(xa) \u2212 \u2207\u03d5(cid:107)(xb)\n\ny\u2208 \u00afA\n\n(cid:162)\n\nb y))y(cid:62) (xa \u2212 xb)\n\nq(y)(h(cid:48)(x(cid:62)\n\na y) \u2212 h(cid:48)(x(cid:62)\n\n(cid:90)\n(cid:90)\n\ny\u2208A\n\n=\n\n=\n\ny\u2208A\n= |\u03d5(cid:107)(x)|2H\u03ba\n\nSince |\u03d5(x)|2H\u03ba\n= 0. Since\n|\u03d5\u22a5(x)| = (cid:104)\u03d5\u22a5(\u00b7), \u03ba(x,\u00b7)(cid:105)H\u03ba \u2264 |\u03ba(x,\u00b7)|H\u03ba|\u03d5\u22a5|H\u03ba = 0,, we have \u03d5\u22a5(x) = 0 for any x. We thus\nhave \u03d5(x) = \u03d5(cid:107)(x), which leads to the result in the theorem.\n\n, the minimizer of (1) should have |\u03d5\u22a5(x)|2H\u03ba\n\n+|\u03d5\u22a5(x)|2H\u03ba\n\n8\n\n\fReferences\n[1] A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh. Clustering with bregman divergences. In\n\nJournal of Machine Learning Research, pages 234\u2013245, 2004.\n\n[2] A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall. Learning a mahalanobis metric from\n\nequivalence constraints. JMLR, 6:937\u2013965, 2005.\n\n[3] L. Bregman. The relaxation method of \ufb01nding the common points of convex sets and its appli-\ncation to the solution of problems in convex programming. USSR Computational Mathematics\nand Mathematical Physics, 7:200\u2013217, 1967.\n\n[4] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon. Information-theoretic metric learning.\n\nIn ICML\u201907, pages 209\u2013216, Corvalis, Oregon, 2007.\n\n[5] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighborhood component analy-\n\nsis. In NIPS.\n\n[6] T. Hertz, A. B. Hillel, and D. Weinshall. Learning a kernel function for classi\ufb01cation with\nsmall training samples. In ICML \u201906: Proceedings of the 23rd international conference on\nMachine learning, pages 401\u2013408. ACM, 2006.\n\n[7] S. C. H. Hoi, W. Liu, and S.-F. Chang. Semi-supervised distance metric learning for collab-\norative image retrieval. In Proceedings of IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR2008), June 2008.\n\n[8] S. C. H. Hoi, W. Liu, M. R. Lyu, and W.-Y. Ma. Learning distance metrics with contextual\n\nconstraints for image retrieval. In Proc. CVPR2006, New York, US, June 17\u201322 2006.\n\n[9] Y. Liu, R. Jin, and A. K. Jain. Boostcluster: boosting clustering by pairwise constraints. In\n\nKDD\u201907, pages 450\u2013459, San Jose, California, USA, 2007.\n\n[10] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver\nfor svm. In ICML \u201907: Proceedings of the 24th international conference on Machine learning,\npages 807\u2013814, New York, NY, USA, 2007. ACM.\n\n[11] L. Si, R. Jin, S. C. H. Hoi, and M. R. Lyu. Collaborative image retrieval via regularized metric\n\nlearning. ACM Multimedia Systems Journal, 12(1):34\u201344, 2006.\n\n[12] T. H. Tomboy, A. Bar-hillel, and D. Weinshall. Boosting margin based distance functions\nIn In Proceedings of the Twenty-First International Conference on Machine\n\nfor clustering.\nLearning, pages 393\u2013400, 2004.\n\n[13] K. Wagstaff, C. Cardie, S. Rogers, and S. Schr\u00a8odl. Constrained k-means clustering with back-\nIn ICML\u201901, pages 577\u2013584, San Francisco, CA, USA, 2001. Morgan\n\nground knowledge.\nKaufmann Publishers Inc.\n\n[14] K. Weinberger, J. Blitzer, and L. Saul. Distance metric learning for large margin nearest neigh-\n\nbor classi\ufb01cation. In NIPS 18, pages 1473\u20131480, 2006.\n\n[15] L. Wu, S. C. H. Hoi, J. Zhu, R. Jin, and N. Yu. Distance metric learning from uncertain side\ninformation with application to automated photo tagging. In Proceedings of ACM International\nConference on Multimedia (MM2009), Beijing, China, Oct. 19\u201324 2009.\n\n[16] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning with application to\n\nclustering with side-information. In NIPS2002, 2002.\n\n[17] L. Yang, R. Jin, R. Sukthankar, and Y. Liu. An ef\ufb01cient algorithm for local distance metric\nlearning. In Proceedings of the Twenty-Second Conference on Arti\ufb01cial Intelligence (AAAI),\n2006.\n\n9\n\n\f", "award": [], "sourceid": 334, "authors": [{"given_name": "Lei", "family_name": "Wu", "institution": null}, {"given_name": "Rong", "family_name": "Jin", "institution": null}, {"given_name": "Steven", "family_name": "Hoi", "institution": null}, {"given_name": "Jianke", "family_name": "Zhu", "institution": null}, {"given_name": "Nenghai", "family_name": "Yu", "institution": null}]}