{"title": "Finding Exemplars from Pairwise Dissimilarities via Simultaneous Sparse Recovery", "book": "Advances in Neural Information Processing Systems", "page_first": 19, "page_last": 27, "abstract": "Given pairwise dissimilarities between data points, we consider the problem of finding a subset of data points called representatives or exemplars that can efficiently describe the data collection. We formulate the problem as a row-sparsity regularized trace minimization problem which can be solved efficiently using convex programming. The solution of the proposed optimization program finds the representatives and the probability that each data point is associated to each one of the representatives. We obtain the range of the regularization parameter for which the solution of the proposed optimization program changes from selecting one representative to selecting all data points as the representatives. When data points are distributed around multiple clusters according to the dissimilarities, we show that the data in each cluster select only representatives from that cluster. Unlike metric-based methods, our algorithm does not require that the pairwise dissimilarities be metrics and can be applied to dissimilarities that are asymmetric or violate the triangle inequality. We demonstrate the effectiveness of the proposed algorithm on synthetic data as well as real-world datasets of images and text.", "full_text": "Finding Exemplars from Pairwise Dissimilarities\n\nvia Simultaneous Sparse Recovery\n\nEhsan Elhamifar\nEECS Department\n\nUniversity of California, Berkeley\n\nGuillermo Sapiro\nECE, CS Department\n\nDuke University\n\nRen\u00b4e Vidal\n\nCenter for Imaging Science\nJohns Hopkins University\n\nAbstract\n\nGiven pairwise dissimilarities between data points, we consider the problem of\n\ufb01nding a subset of data points, called representatives or exemplars, that can ef\ufb01-\nciently describe the data collection. We formulate the problem as a row-sparsity\nregularized trace minimization problem that can be solved ef\ufb01ciently using con-\nvex programming. The solution of the proposed optimization program \ufb01nds the\nrepresentatives and the probability that each data point is associated with each one\nof the representatives. We obtain the range of the regularization parameter for\nwhich the solution of the proposed optimization program changes from selecting\none representative for all data points to selecting all data points as representatives.\nWhen data points are distributed around multiple clusters according to the dissim-\nilarities, we show that the data points in each cluster select representatives only\nfrom that cluster. Unlike metric-based methods, our algorithm can be applied to\ndissimilarities that are asymmetric or violate the triangle inequality, i.e., it does\nnot require that the pairwise dissimilarities come from a metric. We demonstrate\nthe effectiveness of the proposed algorithm on synthetic data as well as real-world\nimage and text data.\n\n1\n\nIntroduction\n\nFinding a subset of data points, called representatives or exemplars, which can ef\ufb01ciently describe\nthe data collection, is an important problem in scienti\ufb01c data analysis with applications in ma-\nchine learning, computer vision, information retrieval, etc. Representatives help to summarize and\nvisualize datasets of images, videos, text and web documents. Computational time and memory re-\nquirements of classi\ufb01cation algorithms improve by working on representatives, which contain much\nof the information of the original data collection. For example, the ef\ufb01ciency of the NN method\nimproves [1] by comparing tests samples to K representatives as opposed to all N training samples,\nwhere typically we have K (cid:28) N. Representatives provide clustering of data points, and, as the\nmost prototypical data points, can be used for ef\ufb01cient synthesis/generation of new data points.\nThe problem of \ufb01nding representative data has been well-studied in the literature [2, 3, 4, 5, 6, 7, 8].\nDepending on the type of the information that should be preserved by the representatives, algorithms\ncan be divided into two categories. The \ufb01rst group of algorithms \ufb01nds representatives from data that\nlie in one or multiple low-dimensional subspaces and typically operate on the measurement data\nvectors directly [5, 6, 7, 8, 9, 10, 11]. The Rank Revealing QR (RRQR) algorithm [6, 9] assumes\nthat the data come from a low-rank model and tries to \ufb01nd a subset of columns of the data matrix\nthat corresponds to the best conditioned submatrix. Randomized and greedy algorithms have also\nbeen proposed to \ufb01nd a subset of the columns of a low-rank matrix [5, 8, 10]. Assuming that the\ndata can be expressed as a linear combination of the representatives, [7, 11] formulate the problem\nof \ufb01nding representatives as a joint-sparse recovery problem, [7] showing that when the data lie in a\nunion of low-rank models, the algorithm \ufb01nds representatives from each low-rank model.\n\n1\n\n\fThe second group of algorithms \ufb01nds representatives by assuming that there is a natural grouping\nof the data collection based on an appropriate measure of similarity between pairs of data points [2,\n4, 12, 13, 14]. As a result, such algorithms typically operate on similarities/dissimilarities between\ndata points. The Kmedoids algorithm [2] tries to \ufb01nd K representatives from pairwise dissimilarities\nbetween data points. As solving the original optimization program is, in general, NP-hard [12], an\niterative approach is employed. The performance of Kmedoids, similar to Kmeans [15], depends on\ninitialization and decreases as the number of representatives, K, increases. The Af\ufb01nity Propagation\n(AP) algorithm [4, 13, 14] tries to \ufb01nd representatives from pairwise similarities between data points\nby using a message passing algorithm. While AP has suboptimal properties and \ufb01nds approximate\nsolutions, it does not require initialization and has been shown to perform well in problems such as\nunsupervised image categorization [16] and facility location problems [17].\nIn this paper, we propose an algorithm for selecting representatives of a data collection given dis-\nsimilarities between pairs of data points. We propose a row-sparsity regularized [18, 19] trace mini-\nmization program whose objective is to \ufb01nd a few representatives that encode well the collection of\ndata points according to the provided dissimilarities. The solution of the proposed optimization pro-\ngram \ufb01nds the representatives and the probability that each data point is associated with each one of\nthe representatives. Instead of choosing the number of representatives, the regularization parameter\nputs a trade-off between the number of representatives and the encoding cost of the data points via\nthe representatives based on the dissimilarities. We obtain the range of the regularization parameter\nwhere the solution of the proposed optimization program changes from selecting one representative\nfor all data points to selecting each data point as a representative. When there is a clustering of\ndata points, de\ufb01ned based on their dissimilarities, we show that, for a suitable range of the regu-\nlarization parameter, the algorithm \ufb01nds representatives from each cluster. Moreover, data points\nin each cluster select representatives only from the same cluster. Unlike metric-based methods, we\ndo not require that the dissimilarities come from a metric. Speci\ufb01cally, the dissimilarities can be\nasymmetric or can violate the triangle inequality. We demonstrate the effectiveness of the proposed\nalgorithm on synthetic data and real-world image and text data.\n\n2 Problem Statement\n\nWe consider the problem of \ufb01nding representatives from a collection of N data points. Assume\nwe are given a set of nonnegative dissimilarities {dij}i,j=1,...,N between every pair of data points\ni and j. The dissimilarity dij indicates how well the data point i is suited to be a representative of\nthe data point j. More speci\ufb01cally, the smaller the value of dij is, the better the data point i is a\nrepresentative of the data point j.1 Such dissimilarities can be built from measured data points, e.g.,\nby using the Euclidean/geodesic distances or the inner products between data points. Dissimilarities\ncan also be given directly without accessing or measuring the data points, e.g., they can be subjective\nmeasurements of the relationships between different objects. We can arrange the dissimilarities into\na matrix of the form\n\n\u00b7\u00b7\u00b7\n\nD (cid:44)\n\n...\ndN 1\nwhere di \u2208 RN denotes the i-th row of D.\nRemark 1 We do not require the dissimilarities to satisfy the triangle inequality. In addition, we\ndo not assume symmetry on the pairwise dissimilarities. D can be asymmetric, where dij (cid:54)= dji\nfor some pairs of data points. In other words, how well data point i represents data point j can be\ndifferent from how well j represents i. In the experiments, we will show an example of asymmetric\ndissimilarities for \ufb01nding representative sentences in text documents.\n\n(1)\n\nGiven D, our goal is to select a subset of data points, called representatives or exemplars, that ef\ufb01-\nciently represent the collection of data points. We consider an optimization program that promotes\nselecting a few data points that can well encode all data points via the dissimilarities. To do so, we\nconsider variables zij associated with dissimilarities dij and denote by the matrix of all variables as\n\n1dii can be set to have a nonzero value, as we will show in the experiments on the text data.\n\n2\n\n\uf8f9\uf8fa\uf8fb =\n\n\uf8ee\uf8ef\uf8f0 d11\n\n\uf8ee\uf8ef\uf8f0d\n\nd\n\n(cid:62)\n1\n...\n(cid:62)\nN\n\n\uf8f9\uf8fa\uf8fb \u2208 RN\u00d7N ,\n\n\u00b7\u00b7\u00b7\n\nd12\n...\ndN 2\n\nd1N\n...\ndN N\n\n\fFigure 1: Data points (blue dots) in two clusters and the representatives (red circles) found by the proposed\noptimization program in (4) for several values of \u03bb with \u03bbmax,q de\ufb01ned in (6). Top: q = 2, Bottom: q = \u221e.\n\n\uf8f9\uf8fa\uf8fb =\n\n\uf8ee\uf8ef\uf8f0z(cid:62)\n\n...\nz(cid:62)\n\n1\n\nN\n\n\uf8ee\uf8ef\uf8f0 z11\n\n...\nzN 1\n\nZ (cid:44)\n\n\uf8f9\uf8fa\uf8fb \u2208 RN\u00d7N ,\n\nz12\n...\nzN 2\n\n\u00b7\u00b7\u00b7\n\n\u00b7\u00b7\u00b7\n\nz1N\n...\nzN N\n\n(2)\n\nN(cid:88)\n\nN(cid:88)\n\nj=1\n\ni=1\n\nN(cid:88)\n\nN(cid:88)\n\nN(cid:88)\n\ni=1\n\nN(cid:88)\n\nwhich case zij > 0 for all the indices i of the representatives. As a result, we must have(cid:80)N\n\nwhere zi \u2208 RN denotes the i-th row of Z. We interpret zij as the probability that data point i be a\nrepresentative for data point j, hence zij \u2208 [0, 1]. A data point j can have multiple representatives in\ni=1 zij =\n1, which ensures that the total probability of data point j choosing all its representatives is equal\nto one. Our goal is to select a few representatives that well encode the data collection according to\nthe dissimilarities. To do so, we propose a row-sparsity regularized trace minimization program on\nZ that consists of two terms. First, we want the representatives to encode well all data points via\ndissimilarities. If the data point i is chosen to be a representative of a data point j with probability\nzij, the cost of encoding j with i is dijzij \u2208 [0, dij]. Hence, the total cost of encoding j using all\ni=1 dijzij. Second, we would like to have as few representatives as possible\nfor all the data points. When the data point i is a representative of some of the data points, we have\nzi (cid:54)= 0, i.e., the i-th row of Z is nonzero. Having a few representatives then corresponds to having\na few nonzero rows in the matrix Z. Putting these two goals together, we consider the following\nminimization program\n\nits representatives is(cid:80)N\n\nmin\n\ndijzij + \u03bb\n\nI((cid:107)zi(cid:107)q)\n\ns. t. zij \u2265 0, \u2200i, j;\n\nzij = 1, \u2200j,\n\n(3)\n\nwhere I(\u00b7) denotes the indicator function, which is zero when its argument is zero and is one other-\nwise. The \ufb01rst term in the objective function corresponds to the total cost of encoding all data points\nusing the representatives and the second term corresponds to the cost associated with the number\nof the representatives. The parameter \u03bb > 0 sets the trade-off between the two terms. Since the\nminimization in (3) that involves counting the number of nonzero rows of Z is, in general, NP-hard,\nwe consider the following standard convex relaxation\n\nmin\n\ndijzij + \u03bb\n\n(cid:107)zi(cid:107)q\n\ns. t. zij \u2265 0, \u2200i, j;\n\nzij = 1, \u2200j,\n\n(4)\n\nj=1\n\ni=1\n\ni=1\n\ni=1\n\nwhere, instead of counting the number of nonzero rows of Z, we use the sum of the (cid:96)q-norms of the\nrows of Z. Typically, we choose q \u2208 {2,\u221e} for which the optimization program (4) is convex.2\nNote that the optimization program (4) can be rewritten in the matrix form as\nZ = 1(cid:62),\n\nwhere tr(\u00b7) denotes the trace operator, (cid:107)Z(cid:107)1,q (cid:44) (cid:80)N\n\n(5)\ni=1 (cid:107)zi(cid:107)q, and 1 denotes an N-dimensional\n\ns. t. Z \u2265 0, 1(cid:62)\n\nZ) + \u03bb(cid:107)Z(cid:107)1,q\n\nmin tr(D\n\n(cid:62)\n\nvector whose elements are all equal to one.\n\n2It is typically the case that q = \u221e favors having 0 and 1 elements for Z, while q = 2 allows elements that\nmore often take other values in [0, 1]. Note that q = 1 also imposes sparsity in the nonzero rows of Z, which\nis not desirable since it promotes only a few data points to be associated with each representative.\n\n3\n\nN(cid:88)\n\ni=1\n\nN(cid:88)\n\n\u22121012345\u22121\u22120.500.51Representatives for \u03bb =0.002 \u03bbmax,2 data pointsrepresentatives\u22121012345\u22121\u22120.500.51Representatives for \u03bb =0.005 \u03bbmax,2 data pointsrepresentatives\u22121012345\u22121\u22120.500.51Representatives for \u03bb =0.01 \u03bbmax,2 data pointsrepresentatives\u22121012345\u22121\u22120.500.51Representatives for \u03bb =0.1 \u03bbmax,2 data pointsrepresentatives\u22121012345\u22121\u22120.500.51Representatives for \u03bb =1 \u03bbmax,2 data pointsrepresentatives\u22121012345\u22121\u22120.500.51Representatives for \u03bb =0.007 \u03bbmax,\u221e data pointsrepresentatives\u22121012345\u22121\u22120.500.51Representatives for \u03bb =0.05 \u03bbmax,\u221e data pointsrepresentatives\u22121012345\u22121\u22120.500.51Representatives for \u03bb =0.1 \u03bbmax,\u221e data pointsrepresentatives\u22121012345\u22121\u22120.500.51Representatives for \u03bb =0.9 \u03bbmax,\u221e data pointsrepresentatives\u22121012345\u22121\u22120.500.51Representatives for \u03bb =1 \u03bbmax,\u221e data pointsrepresentatives\fFigure 2: For the data points shown in Fig. 1, the matrix Z obtained by the proposed optimization program in\n(4) is shown for several values of \u03bb, where \u03bbmax,q is de\ufb01ned in (6). Top: q = 2, Bottom: q = \u221e.\n\nAs we change the regularization parameter \u03bb in (4), the number of representatives found by the\nalgorithm changes. For small values of \u03bb, where we put more emphasis on better encoding data\npoints via representatives, we obtain more representatives. In the limiting case of \u03bb \u2192 0 all points\nare selected as representatives, each point being the representative of itself, i.e., zii = 1 for all i.\nOn the other hand, for large values of \u03bb, where we put more emphasis on the row-sparsity of Z,\nwe select a small number of representatives. In the limiting case of \u03bb \u2192 \u221e, we select only one\nrepresentative for all data points. Figures 1 and 2 illustrate the representatives and the matrix Z,\nrespectively, for several values of \u03bb. In Section 3, we compute the range of \u03bb for which the solution\nof (4) changes from a single representative to all points being representatives. Note that, similar to\nthe relationship between sparse dictionary leaning [20] and Kmeans, there is a relationship between\nour method and Kmedoids. A discussion of this is part of a future publication.\nOnce we have solved the optimization program (4), we can \ufb01nd the representative indices from the\nnonzero rows of Z. We can also obtain the clustering of data points into K clusters associated with\nK representatives by assigning each data point to its closets representative. More speci\ufb01cally, if\ni1,\u00b7\u00b7\u00b7 , iK denote the indices of the representatives, data point j is assigned to the representative\nR(j) according to R(j) = argmin(cid:96)\u2208{i1,\u00b7\u00b7\u00b7 ,iK} d(cid:96)j. As mentioned before, the solution Z gives\nthe probability that each data point is associated with each one of the representatives, which also\nprovides a soft clustering of data points to the representatives. In Section 3 we show that when\nthere is a clustering of data points based on their dissimilarities (see De\ufb01nition 1), each point selects\nrepresentatives from its own cluster.\n\n3 Theoretical Analysis\n\nIn this section, we consider the optimization program (4) and study the behavior of its solution as\na function of the regularization parameter. First, we analyze the solution of (4) for a suf\ufb01ciently\nlarge value of \u03bb. We obtain a threshold value on \u03bb after which the solution of (4) remains the same,\nselecting only one representative data point. More speci\ufb01cally, we show the following result.\nTheorem 1 Consider the optimization program (4). Let (cid:96) (cid:44) argmini 1(cid:62)\n\n\u221a\n\n\u03bbmax,2 (cid:44) max\ni(cid:54)=(cid:96)\n\n\u00b7 (cid:107)di \u2212 d(cid:96)(cid:107)2\n(di \u2212 d(cid:96))\n1(cid:62)\n\n2\n\n,\n\n\u03bbmax,\u221e (cid:44) max\ni(cid:54)=(cid:96)\n\nN\n2\n\ndi and\n(cid:107)di \u2212 d(cid:96)(cid:107)1\n\n2\n\n.\n\n(6)\n\nFor q \u2208 {2,\u221e}, when \u03bb \u2265 \u03bbmax,q, the solution of the optimization program (4) is equal to Z =\ne(cid:96)1(cid:62), where e(cid:96) denotes the vector whose elements are all zero except its (cid:96)-th element, which is\nequal to 1. In other words, the solution of (4) for \u03bb \u2265 \u03bbmax,q corresponds to choosing only the (cid:96)-th\ndata point as the representative of all the data points.\n\nNote that the threshold value of the regularization parameter, for which we obtain only one repre-\nsentative, is different for q = 2 and q = \u221e. However, the two cases obtain the same representative\ngiven by the data point for which 1(cid:62)\ndi is minimum, i.e., the data point with the smallest sum of\n\n4\n\nZ matrix for \u03bb =0.002 \u03bbmax,2 10203040506010203040506000.20.40.60.81Z matrix for \u03bb =0.005 \u03bbmax,2 10203040506010203040506000.20.40.60.81Z matrix for \u03bb =0.01 \u03bbmax,2 10203040506010203040506000.20.40.60.81Z matrix for \u03bb =0.1 \u03bbmax,2 10203040506010203040506000.20.40.60.81Z matrix for \u03bb =1 \u03bbmax,2 10203040506010203040506000.20.40.60.81Z matrix for \u03bb =0.007 \u03bbmax,\u221e 10203040506010203040506000.20.40.60.81Z matrix for \u03bb =0.05 \u03bbmax,\u221e 10203040506010203040506000.20.40.60.81Z matrix for \u03bb =0.1 \u03bbmax,\u221e 10203040506010203040506000.20.40.60.81Z matrix for \u03bb =0.9 \u03bbmax,\u221e 10203040506010203040506000.20.40.60.81Z matrix for \u03bb =1 \u03bbmax,\u221e 10203040506010203040506000.20.40.60.81\fFigure 3: Data points in two clusters with dissimilarities given by pairwise Euclidean distances. For \u03bb <\n\u2206 \u2212 max{\u03b41, \u03b42}, in the solution of the optimization program (4), points in each cluster are represented by\nrepresentatives from the same cluster.\n\ndissimilarities to other data points. Notice also that when the dissimilarities are the Euclidean dis-\ntances between the data points, the single representative corresponds to the data point closest to the\ngeometric median of all data points, as shown in the right plot of Figure 1.\nWhen the regularization parameter \u03bb is smaller than the threshold in (6), the optimization program\nin (4) can \ufb01nd multiple representatives for each data point. However, when there is a clustering\nof data points based on their dissimilarities (see De\ufb01nition 1), we expect to select representatives\nfrom each cluster. In addition, we expect that the data points in each cluster be associated with the\nrepresentatives in that cluster only.\nDe\ufb01nition 1 Given dissimilarities {dij}i,j=1,...,N between N data points, we say that the data par-\ntitions into n clusters {Ci}n\ni=1 according to the dissimilarities, if for any data point j(cid:48) in any Cj, the\nlargest dissimilarity to other data points in Cj is strictly smaller than the smallest dissimilarity to\nthe data points in any Ci different from Cj, i.e.,\n\ndi(cid:48)j(cid:48),\n\n\u2200j = 1, . . . , n, \u2200j(cid:48) \u2208 Cj.\n\n(7)\n\nmax\ni(cid:48)\u2208Cj\n\ndi(cid:48)j(cid:48) < min\ni(cid:54)=j\n\nmin\ni(cid:48)\u2208Ci\n\nIn other words, the data partitions into clusters {Ci}n\nthan the intraclass dissimilarity.\n\ni=1, when the interclass dissimilarity is smaller\n\nNext, we show that for a suitable range of the regularization parameter that depends on the intraclass\nand interclass dissimilarities, the probability that a point chooses representatives from other clusters\nis zero. More precisely, we have the following result.\nTheorem 2 Given dissimilarities {dij}i,j=1,...,N between N data points, assume that the data par-\ntitions into n clusters {Ci}n\n\ni=1 according to De\ufb01nition 1. Let \u03bbc be de\ufb01ned as\n\u03bbc (cid:44) min\n\ndi(cid:48)j(cid:48)).\n\ndi(cid:48)j(cid:48) \u2212 max\ni(cid:48)\u2208Cj\n\n(8)\nThen for \u03bb \u2264 \u03bbc , the optimization program (4) \ufb01nds representatives in each cluster, where the data\nc \u2264 \u03bbc on\npoints in every Ci select representatives only from Ci. A less tight clustering threshold \u03bb(cid:48)\nthe regularization parameter is given by\n\n(min\ni(cid:54)=j\n\nmin\nj(cid:48)\u2208Cj\n\nmin\ni(cid:48)\u2208Ci\n\nj\n\n\u03bb(cid:48)\n\nc\n\n(cid:44) min\ni(cid:54)=j\n\nmin\n\ni(cid:48)\u2208Ci,j(cid:48)\u2208Cj\n\ndi(cid:48)j(cid:48) \u2212 max\n\ni\n\nmax\ni(cid:48)(cid:54)=j(cid:48)\u2208Ci\n\ndi(cid:48)j(cid:48).\n\n(9)\n\nThe \ufb01rst term in the right-hand-side of (9) shows the minimum dissimilarity between data points\nin two different clusters. The second term in the right-hand-side of (9) shows the maximum, over\nall clusters, of the dissimilarity between different data points in each cluster. When \u03bbc or \u03bb(cid:48)\nc in-\ncrease, e.g., when the intraclass dissimilarities increase or the interclass dissimilarities decrease, the\nmaximum possible \u03bb for which we obtain clustering increases. As an illustrative example, con-\nsider Figure 3, where data points are distributed in two clusters according to the dissimilarities\ngiven by the pairwise Euclidean distances of the data points. Let \u03b4i denote the diameter of clus-\nter i and \u2206 be the minimum distance among pairs of data points in different clusters. Assuming\nmax{\u03b41, \u03b42} < \u2206, for \u03bb < \u2206 \u2212 max{\u03b41, \u03b42}, the solution of the optimization program (4) is of the\n, where \u0393 \u2208 RN\u00d7N is a permutation matrix corresponding to the separation\nform Z = \u0393\nof the data into the two clusters.\n\n(cid:20)Z1\n\n0\n0 Z2\n\n(cid:21)\n\nRemark 2 The results of Theorems 1 and 2 suggest that there is a range of the regularization\nparameter for which we obtain only one representative from each cluster.\nIn other words, if\n\n5\n\n\u2206\u03b41\u03b42\fFigure 4: Number of representatives obtained by the proposed optimization program in (4) for data points in\nthe two clusters shown in Fig. 1 as a function of the regularization parameter \u03bb = \u03b1\u03bbmax,q with q \u2208 {2,\u221e}.\n\nFigure 5: Representatives and the probability matrix Z obtained by our proposed algorithm in (4) for q = \u221e.\n20 random data points are added to 120 data points generated by a mixture of 3 Gaussian distributions.\n\n\u03bbmax,q(Ci) denotes the threshold on \u03bb after which we obtain only one representative from Ci, then\nfor maxi \u03bbmax,q(Ci) \u2264 \u03bb < \u03bbc, the data points in each Ci select only one representative that is in\nCi. As we will show in the experiments, such an interval often exists and can, in fact, be large.\nFor a suf\ufb01ciently small value of \u03bb, where we put less emphasis in the row-sparsity term in the\noptimization program (4), each data point becomes a representative, i.e., zii = 1 for all i. In such\na case, each data point forms its own cluster. From the result in Theorem 2, we obtain a threshold\n\u03bbmin such that for \u03bb \u2264 \u03bbmin the solution Z is equal to the identity matrix.\nCorollary 1 Let \u03bbmin,q (cid:44) minj(mini(cid:54)=j dij \u2212 djj) for q \u2208 {2,\u221e}. For \u03bb \u2264 \u03bbmin,q, the solution\nof the optimization program (4) for q \u2208 {2,\u221e} is equal to the identity matrix. In other words, each\ndata point is the representative of itself.\n\n4 Experiments\nIn this section, we evaluate the performance of the proposed algorithm on synthetic and real datasets.\nAs scaling of D and \u03bb by the same value does not change the solution of (4), we always scale\ndissimilarities to lie in [0, 1] by dividing the elements of D by its largest element. Unless stated\notherwise, we typically set \u03bb = \u03b1\u03bbmax,q with \u03b1 \u2208 [0.01, 0.1], for which we obtain good results.\n4.1 Experiments on Synthetic Data\nWe consider the synthetic dataset shown in Figure 1 that consists of data points distributed around\ntwo clusters. We run the proposed optimization program in (4) for both q = 2 and q = \u221e for several\nvalues of \u03bb. Figures 1 and 2 show the representatives and the matrix of variables Z, respectively, for\nseveral values of the regularization parameter. Notice that, as discussed before, for small values of \u03bb,\nwe obtain more representatives and as we increase \u03bb, the number of representatives decreases. When\nthe regularization parameter reaches \u03bbmax,q, computed using our theoretical analysis, we obtain\nonly one representative for the dataset. It is important to note that, as we showed in the theoretical\nanalysis, when the regularization parameter is suf\ufb01ciently small, data points in each cluster only\nselect representatives from that cluster (see Figure 2), i.e., Z has a block-diagonal structure when\nits columns are permuted according to the clusters. Moreover, as Figure 2 shows, for a suf\ufb01ciently\nlarge range of the regularization parameter, we obtain only one representative from each cluster. To\nbetter see this, we run the optimization program with \u03bb = \u03b1\u03bbmax,q for different values of \u03b1. The\ntwo left-hand side plots in Figure 4 show the number of the representatives for q = 2 and q = \u221e,\nrespectively, from each of the two clusters.\nAs shown, when \u03bb gets larger than \u03bbmax,q, we obtain only one representative from the right cluster\nand no representative from the left cluster, i.e., as expected, we obtain one representative for all\nthe data points. Also, when \u03bb gets smaller than \u03bbmin,q, all data points become representatives, as\n\n6\n\n10\u2212410\u22122100051015202530\u03b1Number of Representativesq = 2 , \u2206 / \u03b4 =1.1 Left clusterRight cluster10\u2212410\u22122100051015202530\u03b1Number of Representativesq = \u221e , \u2206 / \u03b4 =1.1 Left clusterRight cluster10\u2212410\u22122100051015202530\u03b1Number of Representativesq = 2 , \u2206 / \u03b4 =4 Left clusterRight cluster10\u2212410\u22122100051015202530\u03b1Number of Representativesq = \u221e , \u2206 / \u03b4 =4 Left clusterRight cluster\u22121.5\u22121\u22120.500.511.5\u221210123Representatives for \u03bb =0.05 \u03bbmax,\u221e data pointsrepresentativesZ matrix for \u03bb =0.05 \u03bbmax,\u221e 204060801001201402040608010012014000.20.40.60.81\u22121.5\u22121\u22120.500.511.5\u221210123Representatives for \u03bb =0.5 \u03bbmax,\u221e data pointsrepresentativesZ matrix for \u03bb =0.5 \u03bbmax,\u221e 204060801001201402040608010012014000.20.40.60.81\fFigure 6: Classi\ufb01cation error on the USPS (left) and ISOLET (right) datasets using representatives obtained\nby different algorithms. Horizontal axis shows the percentage of the selected representatives from each class\n(averaged over all classes). Dashed line shows the classi\ufb01cation error (%) using all the training samples.\n\nexpected from our theoretical result. It is also important to note that, for a suf\ufb01ciently large range of\nthe values of \u03bb, we select only one representative from each cluster. The two right-hand side plots\nin Figure 4 show the number of the representatives when we increase the distance between the two\nclusters. Notice that we obtain similar results as before except that the range of \u03bb for which we\nselect one representative from each cluster has increased. This is also expected from our theoretical\nanalysis, since \u03bbc in (8) increases as the distance between the two clusters increases.\nNote that we also obtain similar results for larger number of clusters. For better visualization, we\nhave shown the results for only two clusters. Also, when there is not a clear partitioning of the data\npoints into clusters according to De\ufb01nition 1, e.g., when there are data points distributed between\ndifferent clusters, as shown in Figure 5, we still obtain similar results to what we have discussed\nin our theoretical analysis. This suggests the existence of stronger theoretical guarantees for our\nproposed algorithm, which is the subject of our future work.\n\n4.2 Experiments on Real Data\nIn this section, we evaluate the performance of our proposed algorithm on real image and text data.\nWe report the result for q = \u221e as it typically obtains better results than q = 2.\n\n4.2.1 NN Classi\ufb01cation using Representatives\n\nFirst, we consider the problem of \ufb01nding prototypes for classi\ufb01cation using the nearest neighbor\n(NN) algorithm [15]. Finding representatives that correspond to the modes of the data distribution\nhelps to signi\ufb01cantly reduce the computational cost and memory requirements of classi\ufb01cation al-\ngorithms, while maintaining their performance. To investigate the effectiveness of our proposed\nmethod for \ufb01nding informative prototypes for classi\ufb01cation, we consider two datasets of USPS [21]\nand ISOLET [22]. We \ufb01nd the representatives of the training data in each class of a dataset and use\nthe representatives as a reduced training set to perform NN classi\ufb01cation on the test data. We obtain\nthe representatives by taking dissimilarities to be pairwise Euclidean distances between data points.\nWe compare our proposed algorithm with AP [4], Kmedoids [2], and random selection of data points\n(Rand) as the baseline. Since Kmedoids depends on initialization, we run the algorithm 1000 times\nwith different random initializations and report the results corresponding to the best solution (lowest\nenergy) and the worst solution (highest energy) as Kmedoids-w and Kmedoids-b, respectively. To\nhave a fair comparison, we run all algorithms so that they obtain the same number of representatives.\nFigure 6 shows the average classi\ufb01cation errors using the NN method for the two datasets. The\nclassi\ufb01cation error using all training samples of each dataset is also shown with a black dashed\nline. As the results show, the classi\ufb01cation performance using the representatives found by our\nproposed algorithm is close to that of using all the training samples. Speci\ufb01cally, in the USPS\ndataset, using representatives found by our proposed method, which consist of only 16% of the\ntraining samples, we obtain 6.2% classi\ufb01cation error compared to 4.7% error obtained using all the\ntraining samples. In the ISOLET dataset, with representatives corresponding to less than half of the\ntraining samples, we obtain very close classi\ufb01cation performance to using all the training samples\n(12.4% error compared to 11.4% error). Notice that when the number of representatives decreases,\nas expected, the classi\ufb01cation performance also decreases. However, in all cases, our proposed\nalgorithm as well as AP are less affected by the decrease in the number of the representatives.\n\n7\n\n4%16%0510152025Percentage of Selected Training SamplesClassification Error (%)USPS Dataset RandKmedoids\u2212wKmedoids\u2212bAPProposed20%40%0510152025Percentage of Selected Training SamplesClassification Error (%)ISOLET Dataset RandKmedoids\u2212wKmedoids\u2212bAPProposed\fFigure 7: Some frames of a political debate video, which consists of multiple shots, and the automatically\ncomputed representatives (inside red rectangles) of the whole video sequence using our proposed algorithm.\n\n4.2.2 Video Summarization using Representatives\n\nWe now evaluate our proposed algorithm for \ufb01nding representative frames of video sequences. We\ntake a political debate video [7], downsample the frames to 80 \u00d7 100 pixels, and convert each frame\nto a grayscale image. Each data point then corresponds to an 8000-dimensional vector obtained by\nvectorizing each grayscale downsampled frame. We set the dissimilarities to be the Euclidean dis-\ntances between pairs of data points. Figure 7 shows some frames of the video and the representatives\ncomputed by our method. Notice that we obtain a representative for each shot of the video. It is\nworth mentioning that the computed representatives do not change for \u03bb \u2208 [2.68, 6.55].\n\n4.2.3 Finding Representative Sentences in Text Documents\n\nAs we discussed earlier, our proposed algorithm can deal with dissimilarities that are not necessarily\nmetric, i.e., can be asymmetric or violate the triangle inequality. We consider now an example of\nasymmetric dissimilarities where we \ufb01nd representative sentences in the text document of this pa-\nper. We compute the dissimilarities between sentences using an information theory-based criterion\nas follows [4]: we treat each sentence as a \u201cbag of words\u201d and compute dij (how well sentence i\nrepresents sentence j) based on the sum of the costs of encoding every word in sentence j using the\nwords in sentence i. More precisely, for sentences in the text of the paper, we extract the words de-\nlimited by spaces, we remove all punctuations, and eliminate words that have less than 5 characters.\nFor each word in sentence j, if the word matches3 a word in sentence i, we set the encoding cost for\nthe word to the logarithm of the number of words in sentence i, which is the cost of encoding the\nindex of the matched word. Otherwise, we set the encoding cost for the word to the logarithm of the\nnumber of the words in the text dictionary, which is the cost of encoding the index of the word in all\nthe text. We also compute dii using the same procedure, i.e., dii (cid:54)= 0, which penalizes selecting very\nlong sentences. We found that 96% of the dissimilarities are asymmetric. The four representative\nsentences obtained by our algorithm summarize the paper as follows:\n\u2013Given pairwise dissimilarities between data points, we consider the problem of \ufb01nding a subset of data points,\ncalled representatives or exemplars, that can ef\ufb01ciently describe the data collection.\n\u2013We obtain the range of the regularization parameter for which the solution of the proposed optimization pro-\ngram changes from selecting one representative for all data points to selecting all data points as representatives.\n\u2013When there is a clustering of data points, de\ufb01ned based on their dissimilarities, we show that, for a suitable\nrange of the regularization parameter, the algorithm \ufb01nds representatives from each cluster.\n\u2013As the results show, the classi\ufb01cation performance using the representatives found by our proposed algorithm\nis close to that of using all the training samples.\n\nAcknowledgment\n\nE. Elhamifar and R. Vidal are supported by grants NSF CNS-0931805, NSF ECCS-0941463, NSF\nOIA-0941362, and ONR N00014-09-10839. G. Sapiro acknowledges partial support by ONR,\nDARPA, NSF, NGA, and AFOSR grants.\n\n3We consider a word to match another word, if either word is a substring of the other.\n\n8\n\n\fReferences\n[1] S. Garcia, J. Derrac, J. R. Cano, and F. Herrera, \u201cPrototype selection for nearest neighbor classi\ufb01ca-\ntion: Taxonomy and empirical study,\u201d IEEE Transactions on Pattern Analysis and Machine Intelligence,\nvol. 34, no. 3, pp. 417\u2013435, 2012.\n\n[2] L. Kaufman and P. Rousseeuw, \u201cClustering by means of medoids,\u201d In Y. Dodge (Ed.), Statistical Data\n\nAnalysis based on the L1 Norm (North-Holland, Amsterdam), pp. 405\u2013416, 1987.\n\n[3] M. Gu and S. C. Eisenstat, \u201cEf\ufb01cient algorithms for computing a strong rank-revealing qr factorization,\u201d\n\nSIAM Journal on Scienti\ufb01c Computing, vol. 17, pp. 848\u2013869, 1996.\n\n[4] B. J. Frey and D. Dueck, \u201cClustering by passing messages between data points,\u201d Science, vol. 315, pp.\n\n972\u2013976, 2007.\n\n[5] J. A. Tropp, \u201cColumn subset selection, matrix factorization, and eigenvalue optimization,\u201d ACM-SIAM\n\nSymp. Discrete Algorithms (SODA), pp. 978\u2013986, 2009.\n\n[6] C. Boutsidis, M. W. Mahoney, and P. Drineas, \u201cAn improved approximation algorithm for the column\n\nsubset selection problem,\u201d in Proceedings of SODA, 2009, pp. 968\u2013977.\n\n[7] E. Elhamifar, G. Sapiro, and R. Vidal, \u201cSee all by looking at a few: Sparse modeling for \ufb01nding represen-\n\ntative objects,\u201d in IEEE Conference on Computer Vision and Pattern Recognition, 2012.\n\n[8] J. Bien, Y. Xu, and M. W. Mahoney, \u201cCUR from a sparse optimization viewpoint,\u201d NIPS, 2010.\n[9] T. Chan, \u201cRank revealing QR factorizations,\u201d Lin. Alg. and its Appl., vol. 88-89, pp. 67\u201382, 1987.\n[10] L. Balzano, R. Nowak, and W. Bajwa, \u201cColumn subset selection with missing data,\u201d in NIPS Workshop\n\non Low-Rank Methods for Large-Scale Machine Learning, 2010.\n\n[11] E. Esser, M. Moller, S. Osher, G. Sapiro, and J. Xin, \u201cA convex model for non-negative matrix factoriza-\ntion and dimensionality reduction on physical space,\u201d IEEE Transactions on Image Processing, vol. 21,\nno. 7, pp. 3239\u20133252, 2012.\n\n[12] M. Charikar, S. Guha, A. Tardos, and D. B. Shmoys, \u201cA constant-factor approximation algorithm for the\n\nk-median problem,\u201d Journal of Computer System Sciences, vol. 65, no. 1, pp. 129\u2013149, 2002.\n\n[13] B. J. Frey and D. Dueck, \u201cMixture modeling by af\ufb01nity propagation,\u201d Neural Information Processing\n\nSystems, 2006.\n\n[14] I. E. Givoni, C. Chung, and B. J. Frey, \u201cHierarchical af\ufb01nity propagation,\u201d Conference on Uncertainty in\n\nArti\ufb01cial Intelligence, 2011.\n\n[15] R. Duda, P. Hart, and D. Stork, Pattern Classi\ufb01cation. Wiley-Interscience, October 2004.\n[16] D. Dueck and B. J. Frey, \u201cNon-metric af\ufb01nity propagation for unsupervised image categorization,\u201d Inter-\n\nnational Conference in Computer Vision, 2007.\n\n[17] N. Lazic, B. J. Frey, and P. Aarabi, \u201cSolving the uncapacitated facility location problem using message\n\npassing algorithms,\u201d International Conference on Arti\ufb01cial Intelligence and Statistics, 2007.\n\n[18] R. Jenatton, J. Y. Audibert, and F. Bach, \u201cStructured variable selection with sparsity-inducing norms,\u201d\n\nJournal of Machine Learning Research, vol. 12, pp. 2777\u20132824, 2011.\n\n[19] J. A. Tropp., \u201cAlgorithms for simultaneous sparse approximation. part ii: Convex relaxation,\u201d Signal\nProcessing, special issue \u201cSparse approximations in signal and image processing\u201d, vol. 86, pp. 589\u2013602,\n2006.\n\n[20] M. Aharon, M. Elad, and A. M. Bruckstein, \u201cK-SVD: An algorithm for designing overcomplete dic-\ntionaries for sparse representation,\u201d IEEE Trans. on Signal Processing, vol. 54, no. 11, pp. 4311\u20134322,\n2006.\n\n[21] J. J. Hull, \u201cA database for handwritten text recognition research,\u201d IEEE Transactions on Pattern Analysis\n\nand Machine Intelligence, vol. 16, no. 5, pp. 550\u2013554, 1994.\n\n[22] M. Fanty and R. Cole, \u201cSpoken letter recognition,\u201d in Neural Information Processing Systems, 1991.\n\n9\n\n\f", "award": [], "sourceid": 25, "authors": [{"given_name": "Ehsan", "family_name": "Elhamifar", "institution": null}, {"given_name": "Guillermo", "family_name": "Sapiro", "institution": null}, {"given_name": "Ren\u00e9", "family_name": "Vidal", "institution": null}]}