{"title": "On a Theory of Nonparametric Pairwise Similarity for Clustering: Connecting Clustering to Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 145, "page_last": 153, "abstract": "Pairwise clustering methods partition the data space into clusters by the pairwise similarity between data points. The success of pairwise clustering largely depends on the pairwise similarity function defined over the data points, where kernel similarity is broadly used. In this paper, we present a novel pairwise clustering framework by bridging the gap between clustering and multi-class classification. This pairwise clustering framework learns an unsupervised nonparametric classifier from each data partition, and search for the optimal partition of the data by minimizing the generalization error of the learned classifiers associated with the data partitions. We consider two nonparametric classifiers in this framework, i.e. the nearest neighbor classifier and the plug-in classifier. Modeling the underlying data distribution by nonparametric kernel density estimation, the generalization error bounds for both unsupervised nonparametric classifiers are the sum of nonparametric pairwise similarity terms between the data points for the purpose of clustering. Under uniform distribution, the nonparametric similarity terms induced by both unsupervised classifiers exhibit a well known form of kernel similarity. We also prove that the generalization error bound for the unsupervised plug-in classifier is asymptotically equal to the weighted volume of cluster boundary for Low Density Separation, a widely used criteria for semi-supervised learning and clustering. Based on the derived nonparametric pairwise similarity using the plug-in classifier, we propose a new nonparametric exemplar-based clustering method with enhanced discriminative capability, whose superiority is evidenced by the experimental results.", "full_text": "On a Theory of Nonparametric Pairwise Similarity\n\nfor Clustering: Connecting Clustering to\n\nClassi\ufb01cation\n\nYingzhen Yang1 Feng Liang1 Shuicheng Yan2 Zhangyang Wang1 Thomas S. Huang1\n\n1 University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA\n{yyang58,liangf,zwang119,t-huang1}@illinois.edu\n\n2 National University of Singapore, Singapore, 117576\n\neleyans@nus.edu.sg\n\nAbstract\n\nPairwise clustering methods partition the data space into clusters by the pairwise\nsimilarity between data points. The success of pairwise clustering largely de-\npends on the pairwise similarity function de\ufb01ned over the data points, where ker-\nnel similarity is broadly used. In this paper, we present a novel pairwise clustering\nframework by bridging the gap between clustering and multi-class classi\ufb01cation.\nThis pairwise clustering framework learns an unsupervised nonparametric classi-\n\ufb01er from each data partition, and search for the optimal partition of the data by\nminimizing the generalization error of the learned classi\ufb01ers associated with the\ndata partitions. We consider two nonparametric classi\ufb01ers in this framework, i.e.\nthe nearest neighbor classi\ufb01er and the plug-in classi\ufb01er. Modeling the underly-\ning data distribution by nonparametric kernel density estimation, the generaliza-\ntion error bounds for both unsupervised nonparametric classi\ufb01ers are the sum of\nnonparametric pairwise similarity terms between the data points for the purpose\nof clustering. Under uniform distribution, the nonparametric similarity terms in-\nduced by both unsupervised classi\ufb01ers exhibit a well known form of kernel simi-\nlarity. We also prove that the generalization error bound for the unsupervised plug-\nin classi\ufb01er is asymptotically equal to the weighted volume of cluster boundary\n[1] for Low Density Separation, a widely used criteria for semi-supervised learn-\ning and clustering. Based on the derived nonparametric pairwise similarity using\nthe plug-in classi\ufb01er, we propose a new nonparametric exemplar-based clustering\nmethod with enhanced discriminative capability, whose superiority is evidenced\nby the experimental results.\n\n1 Introduction\n\nPairwise clustering methods partition the data into a set of self-similar clusters based on the pair-\nwise similarity between the data points. Representative clustering methods include K-means [2]\nwhich minimizes the within-cluster dissimilarities, spectral clustering [3] which identi\ufb01es clusters\nof more complex shapes lying on low dimensional manifolds, and the pairwise clustering method\n[4] using message-passing algorithm to inference the cluster labels in a pairwise undirected graph-\nical model. Utilizing pairwise similarity, these pairwise clustering methods often avoid estimating\ncomplex hidden variables or parameters, which is dif\ufb01cult for high dimensional data.\nHowever, most pairwise clustering methods assume that the pairwise similarity is given [2, 3], or\nthey learn a more complicated similarity measure based on several given base similarities [4]. In\nthis paper, we present a new framework for pairwise clustering where the pairwise similarity is\nderived as the generalization error bound for the unsupervised nonparametric classi\ufb01er. The un-\n\n1\n\n\fsupervised classi\ufb01er is learned from unlabeled data and the hypothetical labeling. The quality of\nthe hypothetical labeling is measured by the associated generalization error of the learned classi-\n\ufb01er, and the hypothetical labeling with minimum associated generalization error bound is preferred.\nWe consider two nonparametric classi\ufb01ers, i.e. the nearest neighbor classi\ufb01er (NN) and the plug-in\nclassi\ufb01er (or the kernel density classi\ufb01er). The generalization error bounds for both unsupervised\nclassi\ufb01ers are expressed as sum of pairwise terms between the data points, which can be interpreted\nas nonparametric pairwise similarity measure between the data points. Under uniform distribution,\nboth nonparametric similarity measures exhibit a well known form of kernel similarity. We also\nprove that the generalization error bound for the unsupervised plug-in classi\ufb01er is asymptotically\nequal to the weighted volume of cluster boundary [1] for Low Density Separation, a widely used\ncriteria for semi-supervised learning and clustering.\nOur work is closely related to discriminative clustering methods by unsupervised classi\ufb01cation,\nwhich search for the cluster boundaries with the help of unsupervised classi\ufb01er. For example, [5]\nlearns a max-margin two-class classi\ufb01er to group unlabeled data in an unsupervised manner, known\nas unsupervised SVM whose theoretical property is further analyzed in [6]. Also, [7] learns the\nkernel logistic regression classi\ufb01er, and uses the entropy of the posterior distribution of the class\nlabel by the classi\ufb01er to measure the quality of the learned classi\ufb01er. More recent work presented in\n[8] learns an unsupervised classi\ufb01er by maximizing the mutual information between cluster labels\nand the data, and the Squared-Loss Mutual Information is employed to produce a convex optimiza-\ntion problem. Although such discriminative methods produce satisfactory empirical results, the\noptimization of complex parameters hampers their application in high-dimensional data. Following\nthe same principle of unsupervised classi\ufb01cation using nonparametric classi\ufb01ers, we derive non-\nparametric pairwise similarity and eliminate the need of estimating complicated parameters of the\nunsupervised classifer. As an application, we develop a new nonparametric exemplar-based cluster-\ning method with the derived nonparametric pairwise similarity induced by the plug-in classi\ufb01er, and\nour new method demonstrates better empirical clustering results than the existing exemplar-based\nclustering methods.\nIt should be emphasized that our generalization bounds are essentially different from the litera-\nture. As nonparametric classi\ufb01cation methods, the generalization properties of the nearest neighbor\nclassi\ufb01er (NN) and the plug-in classi\ufb01er are extensively studied. Previous research focuses on the\naverage generalization error of the NN [9, 10], which is the average error of the NN over all the\nrandom training data sets, or the excess risk of the plug-in classi\ufb01er [11, 12]. In [9], it is shown that\nthe average generalization error of the NN is bounded by twice of the Bayes error. Assuming that\nthe class of the regression functions has a smooth parameter \u03b2, [11] proves that the excess risk of\n\u2212(cid:12)\n2(cid:12)+d where d is the dimension of the data. [12]\nthe plug-in classi\ufb01er converges to 0 of the order n\nfurther shows that the plug-in classi\ufb01er attains faster convergence rate of the excess risk, namely\n\u2212 1\n2 , under some margin assumption on the data distribution. All these generalization error bounds\nn\ndepend on the unknown Bayes error. By virtue of kernel density estimation and generalized ker-\nnel density estimation [13], our generalization bounds are represented mostly in terms of the data,\nleading to the pairwise similarities for clustering.\n\n2 Formulation of Pairwise Clustering by Unsupervised Nonparametric\n\nClassi\ufb01cation\n\nThe discriminative clustering literature [5, 7] has demonstrated the potential of multi-class clas-\nsi\ufb01cation for the clustering problem.\nInspired by the natural connection between clustering and\nclassi\ufb01cation, we model the clustering problem as a multi-class classi\ufb01cation problem: a classi\ufb01er\nis learned from the training data built by a hypothetical labeling, which is a possible cluster labeling.\nThe optimal hypothetical labeling is supposed to be the one such that its associated classi\ufb01er has the\nminimum generalization error bound. To study the generalization bound for the classi\ufb01er learned\nfrom the hypothetical labeling, we de\ufb01ne the concept of classi\ufb01cation model. Given unlabeled data\n{xl}n\nl=1 as\nbelow:\nDe\ufb01nition 1. The classi\ufb01cation model corresponding to the hypothetical labeling Y = {yl}n\nis de\ufb01ned as MY =\nl=1 are the labeled data by the\n\nl=1, a classi\ufb01cation model MY is constructed for any hypothetical labeling Y = {yl}n\n\n. S = {xl, yl}n\n\nS, PXY ,{\u03c0i, fi}Q\n\ni=1, F\n\nl=1\n\n(\n\n)\n\n2\n\n\fhypothetical labeling, and S are assumed to be i.i.d.\nsamples drawn from the joint distribu-\ntion PXY = PX|Y PY , where (X, Y ) is a random couple, X \u2208 IRd represents the data and\nY \u2208 {1, 2, ..., Q} is the class label of X, Q is the number of classes determined by the hypothetical\nlabeling. Furthermore, PXY is speci\ufb01ed by {\u03c0(i), f (i)}Q\ni=1 as follows: \u03c0(i) is the class prior for\nclass i, i.e. Pr [Y = i] = \u03c0(i); the conditional distribution PX|Y =i has probabilistic density func-\ntion f (i), i = 1, . . . , Q. F is a classi\ufb01er trained using the training data S. The generalization error\nof the classi\ufb01cation model MY is de\ufb01ned as the generalization error of the classi\ufb01er F in MY.\n\nIn this paper, we study two types of classi\ufb01cation models with the nearest neighbor classi\ufb01er and\nthe plug-in classi\ufb01er respectively, and derive their generalization error bounds as sum of pairwise\nsimilarity between the data. Given a speci\ufb01c type of classi\ufb01cation model, the optimal hypothetical\nlabeling corresponds to the classi\ufb01cation model with minimum generalization error bound. The\noptimal hypothetical labeling also generates a data partition where the sum of pairwise similarity\nbetween the data from different clusters is minimized, which is a common criteria for discriminative\nclustering.\nIn the following text, we derive the generalization error bounds for the two types of classi\ufb01cation\nmodels. Before that, we introduce more notations and assumptions for the classi\ufb01cation model.\nDenote by PX the induced marginal distribution of X, and f is the probabilistic density function of\nPX which is a mixture of Q class-conditional densities: f =\n\u03c0(i)f (i). \u03b7(i) (x) is the regression\nfunction of Y on X = x, i.e. \u03b7(i) (x) = Pr [Y = i|X = x ] = (cid:25)(i)f (i)(x)\n. For the sake of the\nconsistency of the kernel density estimators used in the sequel, there are further assumptions on\nthe marginal density and class-conditional densities in the classi\ufb01cation model for any hypothetical\nlabeling:\n(A) f is bounded from below, i.e. f \u2265 fmin > 0\n(B) {f (i)} is bounded from above, i.e. f (i) \u2264 f (i)\nwhere \u03a3(cid:13);c is the class of H\u00a8older-\u03b3 smooth functions with H\u00a8older constant c:\n\nmax, and f (i) \u2208 \u03a3(cid:13);ci, 1 \u2264 i \u2264 Q.\n\nQ\u2211\n\nf (x)\n\ni=1\n\n\u2211\n\n(cid:6)(cid:13);c , {f : IRd \u2192 IR|\u2200x; y;|f (x) \u2212 f (y)| \u2264 c\u2225x \u2212 y\u2225(cid:13)}; (cid:13) > 0\n\nIt follows from assumption (B) that f \u2208 \u03a3(cid:13);c where c =\n\u03c0(i)ci. Assumption (A) and (B) are\nmild. The upper bound for the density functions is widely required for the consistency of kernel\ndensity estimators [14, 15]; H\u00a8older-\u03b3 smoothness is required to bound the bias of such estimators,\nand it also appears in [12] for estimating the excess risk of the plug-in classi\ufb01er. The lower bound\nfor the marginal density is used to derive the consistency of the estimator of the regression function\n\u03b7(i) (Lemma 2) and the consistency of the generalized kernel density estimator (Lemma 3). We\ndenote by PX the collection of marginal distributions that satisfy assumption (A), and denote by\nPX|Y the collection of class-conditional distributions that satisfy assumption (B). We then de\ufb01ne\nthe collection of joint distributions PXY that PXY belongs to, which requires the marginal density\nand class-conditional densities satisfy assumption (A)-(B):\n\ni\n\nPXY , {PXY | PX \u2208 PX ;{PX|Y =i} \u2208 PX|Y ; min\n\ni\n\n{(cid:25)(i)} > 0}\n\n(1)\n\nGiven the joint distribution PXY , the generalization error of the classi\ufb01er F learned from the train-\ning data S is:\n\n(2)\nNonparametric kernel density estimator (KDE) serves as the primary tool of estimating the under-\nlying probabilistic density functions in our generalization analysis, and we introduce the KDE of f\nas below:\n\nR (FS ) , Pr [(X; Y ) : F (X) \u0338= Y ]\n\nn\u2211\n\nl=1\n\n3\n\n^fn;hn (x) =\n\n1\nn\n\nKhn (x \u2212 xl)\n\n(3)\n\n(\n\n)\n\nwhere Kh (x) = 1\n(2(cid:25))d/2 e\n\n\u2212 \u2225x\u22252\n\n1\n\n2\n\nis the isotropic Gaussian kernel with bandwidth h and K (x) ,\n. We have the following VC property of the Gaussian kernel K. De\ufb01ne the class\n\nhd K\n\nx\nh\n\n\fof functions\n\n)\n\n(\n\nt \u2212 \u00b7\nh\n\nF , {K\n\n; t \u2208 IRd; h \u0338= 0}\n\n\u2212 d\n\n\u222b\n\n(4)\nThe VC property appears in [14, 15, 16, 17, 18], and it is proved that F is a bounded VC class of\nmeasurable functions with respect to the envelope function F such that |u| \u2264 F for any u \u2208 F (e.g.\nF \u2261 (2\u03c0)\n2 ). It follows that there exist positive numbers A and v such that for every probability\nmeasure P on IRd for which\n(\n\n)\nwhere N\nT in the metric space\nThe VC property of K is required for the consistency of kernel density estimators shown in\nLemma 2. Also, we adopt the kernel estimator of \u03b7(i) below\n\nis de\ufb01ned as the minimal number of open \u02c6d-balls of radius \u03f5 required to cover\n\n. A and v are called the VC characteristics of F.\n\n(\n(\n)\nF 2dP < \u221e and any 0 < \u03c4 < 1,\n\u2264\nF ;\u2225\u00b7\u2225\n\nL2(P ) ; (cid:28) \u2225F\u2225\n\nL2(P )\n\nT , \u02c6d, \u03f5\n\n(\n\n)\nT , \u02c6d\n\n)\n\nv\n\nA\n(cid:28)\n\n(5)\n\nN\n\n^(cid:17)(i)\nn;hn\n\n(x) =\n\nl=1\n\nKhn (x \u2212 xl)1I{yl=i}\n\nn ^fn;hn (x)\n\n(6)\n\nBefore stating Lemma 2, we introduce several frequently used quantities throughout this paper. Let\nL, C > 0 be constants which only depend on the VC characteristics of the Gaussian kernel K. We\nde\ufb01ne\n\nn\u2211\n\nQ\u2211\n\nf0 ,\n\n(cid:25)(i)f (i)\n\nmax (cid:27)2\n\n0 , \u2225K\u22252\n2f0\n\nAlso, for all positive numbers \u03bb \u2265 C and \u03c3 > 0, we de\ufb01ne\n\ni=1\n\nE(cid:27)2 , log (1 + (cid:21)=4L)\n\n(cid:21)L(cid:27)2\n\n(7)\n\n(8)\n\nBased on Corollary 2.2 in [14], Lemma 2 and Lemma 3 in the Appendix (more complete version\nin the supplementary) show the strong consistency (almost sure uniformly convergence) of several\nkernel density estimators, i.e. \u02c6fn;hn, {\u02c6\u03b7(i)\n} and the generalized kernel density estimator, and they\nform the basis for the derivation of the generalization error bounds for the two types of classi\ufb01cation\nmodels.\n\nn;hn\n\n3 Generalization Bounds\n\nWe derive the generalization error bounds for the two types of classi\ufb01cation models with the nearest\nneighbor classi\ufb01er and the plug-in classi\ufb01er respectively. Substituting these kernel density estima-\ntors for the corresponding true density functions, Theorem 1 and Theorem 2 present the generaliza-\ntion error bounds for the classi\ufb01cation models with the plug-in classi\ufb01er and the nearest neighbor\nclassi\ufb01er. The dominant terms of both bounds are expressed as sum of pairwise similarity depend-\ning solely on the data, which facilitates the application of clustering. We also show the connection\nbetween the error bound for the plug-in classi\ufb01er and Low Density Separation in this section. The\ndetailed proofs are included in the supplementary.\n\n3.1 Generalization Bound for the Classi\ufb01cation Model with Plug-In Classi\ufb01er\n\nThe plug-in classi\ufb01er resembles the Bayes classi\ufb01er while it uses the kernel density estimator of the\nregression function \u03b7(i) instead of the true \u03b7(i). It has the form\n\nPI (X) = arg max\n1\u2264i\u2264Q\n\n^(cid:17)(i)\nn;hn\n\n(X)\n\n(9)\n\nwhere \u02c6\u03b7(i)\nis the nonparametric kernel estimator of the regression function \u03b7(i) by (6). The\ngeneralization capability of the plug-in classi\ufb01er has been studied by the literature[11, 12]. Let\n\nn;hn\n\n4\n\n\f\u2217 be the Bayes classi\ufb01er, it is proved that the excess risk of PIS, namely IESR (PIS) \u2212 R (F\n),\nF\n\u2212(cid:12)\nconverges to 0 of the order n\n2(cid:12)+d under some complexity assumption on the class of the regression\nfunctions with smooth parameter \u03b2 that {\u03b7(i)} belongs to [11, 12]. However, this result cannot be\nused to derive the generalization error bound for the plug-in classi\ufb01er comprising of nonparametric\npairwise similarities in our setting.\nWe show the upper bound for the generalization error of PIS in Lemma 1.\nLemma 1. For any PXY \u2208 PXY , there exists a n0 which depends on \u03c30 and VC characteristics\nof K, when n > n0, with probability greater than 1 \u2212 2QLh\n, the generalization error of the\nplug-in classi\ufb01er satis\ufb01es\n\nE(cid:27)2\n0\nn\n\n\u2217\n\nn + O\n\nR (PIS) \u2264 RPI\n\u2211\n\n\u22121\nlog h\nn\nnhd\nn\n\n+ h(cid:13)\nn\n\nRPI\n\nn =\n\ni;j=1;:::;Q;i\u0338=j\n\nIEX\n\n^(cid:17)(i)\nn;hn\n\n(X) ^(cid:17)(j)\n\nn;hn\n\n(X)\n\n)\n\n]\n\n(\u221a\n[\n\n(10)\n\n(11)\n\n(12)\n\n\u22121\nn\nnhd\nn\n\n\u2192 0, \u02c6\u03b7(i)\n\nis the kernel\nQ for\n\nwhere E(cid:27)2 is de\ufb01ned by (8), hn is chosen such that hn \u2192 0, log h\nn;hn\nestimator of the regression function. Moreover, the equality in (10) holds when \u02c6\u03b7(i)\n1 \u2264 i \u2264 Q.\nBased on Lemma 1, we can bound the error of the plug-in classi\ufb01er from above by RPI\nn . Theorem 1\nthen gives the bound for the error of the plug-in classi\ufb01er in the corresponding classi\ufb01cation model\nusing the generalized kernel density estimator in Lemma 3. The bound has a form of sum of pairwise\n(\nsimilarity between the data from different classes.\nTheorem 1.\nS, PXY ,{\u03c0i, fi}Q\nand the VC characteristics of K, when n > n1, with probability greater than 1 \u2212 2QLh\n\u221a\nL(\n\nthe Plug-In Classi\ufb01er) Given the classi\ufb01cation model MY =\nsuch that PXY \u2208 PXY , there exists a n1 which depends on \u03c30, \u03c31\nn \u2212\n\n, the generalization error of the plug-in classi\ufb01er satis\ufb01es\n\n(Error of\ni=1, PI\n\n0 \u2212 QLh\n\n\u2261 1\n\n)\n\nE(cid:27)2\n0\n\nn;hn\n\nE(cid:27)2\n\nE(cid:27)2\n1\nn\n\n2hn)\n\nR (PIS) \u2264 ^Rn (PIS) + O\n\n(\u221a\n\n)\n\n\u22121\nlog h\nn\nnhd\nn\n\n+ h(cid:13)\nn\n\n\u2211\n\nl;m\n\nwhere \u02c6Rn (PIS) = 1\nn2\nfunction and\n\n\u03b8lmGlm;\n\n\u221a\n\n2hn\n\n, \u03c32\n\n1 =\n\n\u2225K\u22252\nfmin\n\n2fmax\n\n, \u03b8lm = 1I{yl\u0338=ym} is a class indicator\n\n1\n2\n\nGlm;h = Gh (xl; xm) ; Gh (x; y) =\n\nKh (x \u2212 y)\n^f\nn;h (x) ^f\nn;h (y)\n\u2192 0, \u02c6fn;hn is the kernel density\nE(cid:27)2 is de\ufb01ned by (8), hn is chosen such that hn \u2192 0, log h\n)\nestimator of f de\ufb01ned by (3).\n\u02c6Rn is the dominant term determined solely by the data and the excess error O\ngoes to 0 with in\ufb01nite n. In the following subsection, we show the close connection between the\nerror bound for the plug-in classi\ufb01er and the weighted volume of cluster boundary, and the latter is\nproposed by [1] for Low Density Separation.\n\n(\u221a\n\n\u22121\nlog h\nn\nnhd\nn\n\n\u22121\nn\nnhd\nn\n\n+ h(cid:13)\nn\n\n(13)\n\n1\n2\n\n;\n\n3.1.1 Connection to Low Density Separation\n\nLow Density Separation [19], a well-known criteria for clustering, requires that the cluster boundary\nshould pass through regions of low density. It has been extensively studied in unsupervised learning\nand semi-supervised learning [20, 21, 22]. Suppose the data {xl}n\nl=1 lies on a domain \u2126 \u2286 Rd.\nLet f be the probability density function on \u2126, S be the cluster boundary which separates \u2126 into\ntwo parts S1 and S2. Following the Low Density Separation assumption, [1] suggests that the\n\n5\n\n\fcluster boundary S with low weighted volume\n\n\u222b\n\nf (s)ds should be preferable. [1] also proves that\n\nS\n\na particular type of cut function converges to the weighted volume of S. Based on their study, we\nobtain the following result relating the error of the plug-in classi\ufb01er to the weighted volume of the\ncluster boundary.\nCorollary 1. Under the assumption of Theorem 1, for any kernel bandwidth sequence {hn}\u221e\nsuch that\n\n2d+2 , with probability 1,\n\nn\u2192\u221e hn = 0 and hn > n\nlim\n\nn=1\n\n\u222b\n\u2212(cid:11) where 0 < \u03b1 < 1\n\n\u221a\n\nlim\nn\u2192\u221e\n\n(cid:25)\n2hn\n\n^Rn (PIS) =\n\nf (s)ds\n\nS\n\n(14)\n\n3.2 Generalization Bound for the Classi\ufb01cation Model with Nearest Neighbor Classi\ufb01er\n\n)\nTheorem 2 shows the generalization error bound for the classi\ufb01cation model with nearest neighbor\nclassi\ufb01er (NN), which has a similar form as (12).\nTheorem 2. (Error of the NN) Given the classi\ufb01cation model MY =\ni=1, NN\nsuch that PXY \u2208 PXY and the support of PX is bounded by [\u2212M0, M0]d, there exists a n0 which\ndepends on \u03c30 and VC characteristics of K, when n > n0, with probability greater than 1 \u2212\n\nS, PXY ,{\u03c0i, fi}Q\n\n(\n\n2QLh\n\nE(cid:27)2\n0\n\nn \u2212 (2M0)dndd0 e\n\u2211\n\n)\n\u2212n1\u2212dd0 fmin, the generalization error of the NN satis\ufb01es:\n\n)\n\n(\u221a\n\nR (NNS) \u2264 ^Rn (NNS) + c0\n\n(cid:13)\n\n\u2212d0(cid:13) + O\nn\n\nd\n\n\u22121\nlog h\nn\nnhd\nn\n\n+ h(cid:13)\nn\n\n(\u221a\n\nwhere \u02c6Rn (NN) = 1\nn\n\nHlm;hn\u03b8lm,\n\n1\u2264l<m\u2264n\n\n(\u222b\n\nHlm;hn = Khn (xl \u2212 xm)\n\nVl\n\n^fn;hn (x) dx\n^fn;hn (xl)\n\n\u222b\n\nVm\n\n^fn;hn (x) dx\n\n^fn;hn (xm)\n\n)\n\n;\n\n+\n\n(15)\n\n(16)\n\nE(cid:27)2 is de\ufb01ned by (8), d0 is a constant such that dd0 < 1, \u02c6fn;hn is the kernel density estimator of\nf de\ufb01ned by (3) with the kernel bandwidth hn satisfying hn \u2192 0, log h\n\u2192 0, Vl is the Voronoi\ncell associated with xl, c0 is a constant, \u03b8lm = 1I{yl\u0338=ym} is a class indicator function such that\n\u03b8lm = 1 if xl and xm belongs to different classes, and 0 otherwise. Moreover, the equality in (15)\nholds when \u03b7(i) \u2261 1\n\n\u22121\nn\nnhd\nn\n\nQ for 1 \u2264 i \u2264 Q.\n\n\u221a\n\n2hn\n\nin (13) and Hlm;hn in (16) are the new pairwise similarity functions induced by the plug-\nGlm;\nin classi\ufb01er and the nearest neighbor classi\ufb01er respectively. According to the proof of Theorem 1 and\nTheorem 2, the kernel density estimator \u02c6f can be replaced by the true density f in the denominators\n\u221a\nof (13) and (16), and the conclusions of Theorem 1 and 2 still hold. Therefore, both Glm;\nand\nHlm;hn are equal to ordinary Gaussian kernels (up to a scale) with different kernel bandwidth under\nuniform distribution, which explains the broadly used kernel similarity in data clustering from an\nangle of supervised learning.\n\n2hn\n\n4 Application to Exemplar-Based Clustering\n\nWe propose a nonparametric exemplar-based clustering algorithm using the derived nonparametric\npairwise similarity by the plug-in classi\ufb01er. In exemplar-based clustering, each xl is associated with\na cluster indicator el (l \u2208 {1, 2, ...n} , el \u2208 {1, 2, ...n}), indicating that xl takes xel as the cluster\nexemplar. Data from the same cluster share the same cluster exemplar. We de\ufb01ne e , {el}n\nl=1.\nMoreover, a con\ufb01guration of the cluster indicators e is consistent iff el = l when em = l for any\nl, m \u2208 1..n, meaning that xl should take itself as its exemplar if any xm take xl as its exemplar. It is\nrequired that the cluster indicators e should always be consistent. Af\ufb01nity Propagation (AP) [23], a\nrepresentative of the exemplar-based clustering methods, solves the following optimization problem\n\nn\u2211\n\nmin\n\ne\n\nl=1\n\nSl;el\n\ns:t: e is consistent\n\n(17)\n\n6\n\n\fSl;el is the dissimilarity between xl and xel, and note that Sl;l is set to be nonzero to avoid the trivial\nminimizer of (17).\nNow we aim to improve the discriminative capability of the exemplar-based clustering (17) using\nthe nonparametric pairwise similarity derived by the unsupervised plug-in classi\ufb01er. As mentioned\nbefore, the quality of the hypothetical labeling \u02c6y is evaluated by the generalization error bound for\nthe nonparametric plug-in classi\ufb01er trained by S^y, and the hypothetical labeling \u02c6y with minimum\nassociated error bound is preferred, i.e. arg min^y\nwhere\n\u221a\n\u03b8lm = 1I^yl\u0338=^ym and Glm;\nenforces minimization of the weighted volume of cluster boundary asymptotically. To avoid the\ntrivial clustering where all the data are grouped into a single cluster, we use the sum of within-\ncluster dissimilarities term\nto control the size of clusters. Therefore, the\nobjective function of our pairwise clustering method is below:\n\nis de\ufb01ned in (13). By Lemma 3, minimizing\n\n\u02c6Rn (PIS) = arg min^y\n\n\u2211\n\u2211\n\n\u03b8lmGlm;\n\n\u03b8lmGlm;\n\n2hn\n\u221a\n\n(\n\nalso\n\n2hn\n\n2hn\n\n2hn\n\nl;m\n\nl;m\n\n\u221a\n\nexp\n\nn\u2211\n(\n\u2212Glel;\n\nl=1\n\n\u2212Glel;\n\u221a\n)\n\n\u221a\n\n+ (cid:21)\n\n)\n\u2211\n\n(\n\nn\u2211\n\n(cid:9) (e) =\n\nl=1\n\nexp\n\nwhere \u03c1lm is a function to enforce the consistency of the cluster indicators:\n\n\u221a\n~(cid:18)lmGlm;\n\n+ (cid:26)lm (el; em)\n\n2hn\n\n2hn\n\nl;m\n\n{ \u221e em = l; el \u0338= l or el = m; em \u0338= m\n\n;\n\n(cid:26)lm (el; em) =\n\n0 otherwise\n\n)\n\n(18)\n\n\u221a\n\n2hn\n\n\u03bb is a balancing parameter. Due to the form of (18), we construct a pairwise Markov Random\nField (MRF) representing the unary term ul and the pairwise term \u02dc\u03b8lmGlm;\n+ \u03c1lm as the data\nlikelihood and prior respectively. The variables e are modeled as nodes and the unary term and\npairwise term in (18) are modeled as potential functions in the pairwise MRF. The minimization of\nthe objective function is then converted to a MAP (Maximum a Posterior) problem in the pairwise\nMRF. (18) is minimized by Max-Product Belief Propagation (BP).\nThe computational complexity of our clustering algorithm is O(T EN ), where E is the number of\nedges in the pairwise MRF, T is the number of iterations of message passing in the BP algorithm.\nWe call our new algorithm Plug-In Exemplar Clustering (PIEC), and compare it to representative\nexemplar-based clustering methods, i.e. AP and Convex Clustering with Exemplar-Based Model\n(CEB) [24], for clustering on three real data sets from UCI repository, i.e. Iris, Vertebral Column\n(VC) and Breast Tissue (BT). We record the average clustering accuracy (AC) and the standard\ndeviation of AC for all the exemplar-based clustering methods when they produce the correct number\nof clusters for each data set with different values of hn and \u03bb, and the results are shown in Table 1.\nAlthough AP produces better clustering accuracy on the VC data set, PIEC generates the correct\ncluster numbers for much more times. The dash in Table 1 indicates that the corresponding clustering\nmethod cannot produce the correct cluster number. The default value for the kernel bandwidth hn is\n\u2217\nn, which is set as the variance of the pairwise distance between data points\n. The\nh\n\u2217\nn, \u03bb varies between [0.2, 1] and\ndefault value for the balancing parameter \u03bb is 1. We let hn = \u03b1h\n\u03b1 varies between [0.2, 1.9] with step 0.2 and 0.05 respectively, resulting in 170 different parameter\nsettings. We also generate the same number of parameter settings for AP and CEB.\nTable 1: Comparison Between Exemplar-Based Clustering Methods. The number in the bracket is\nthe number of times when the corresponding algorithm produces correct cluster numbers.\n\n{\u2225xl \u2212 xm\u2225\n\n}\n\nl<m\n\nData sets\n\nAP\nCEB\nPIEC\n\nIris\n\n0.8933 \u00b1 0.0138 (16)\n0.6929 \u00b1 0.0168 (15)\n0.9089 \u00b1 0.0033 (15)\n\nVC\n\nBT\n\n0.6677 (14)\n\n0.4748 \u00b1 0.0014 (5)\n0.5263 \u00b1 0.0173 (35)\n\n0.4906 (1)\n\n0.3868 \u00b1 0.08 (2)\n0.6585 \u00b1 0.0103 (5)\n\n5 Conclusion\n\nWe propose a new pairwise clustering framework where nonparametric pairwise similarity is de-\nrived by minimizing the generalization error unsupervised nonparametric classi\ufb01er. Our framework\nbridges the gap between clustering and multi-class classi\ufb01cation, and explains the widely used ker-\nnel similarity for clustering. In addition, we prove that the generalization error bound for the unsu-\npervised plug-in classi\ufb01er is asymptotically equal to the weighted volume of cluster boundary for\n\n7\n\n\fLow Density Separation. Based on the derived nonparametric pairwise similarity using the plug-in\nclassi\ufb01er, we propose a new nonparametric exemplar-based clustering method with enhanced dis-\ncriminative capability compared to the exiting exemplar-based clustering methods.\n\nAppendix\n\nLemma 2. (Consistency of Kernel Density Estimator) Let the kernel bandwidth hn of the Gaussian\nkernel K be chosen such that hn \u2192 0, log h\n\u2192 0. For any PX \u2208 PX, there exists a n0 which\ndepends on \u03c30 and VC characteristics of K, when n > n0, with probability greater than 1\u2212 Lh\nover the data {xl},\n\n\u22121\nn\nnhd\nn\n\nE(cid:27)2\n0\nn\n\n(cid:13)(cid:13)(cid:13) ^fn;hn (x) \u2212 f (x)\n(cid:13)(cid:13)(cid:13)\n\n(\u221a\n\n\u221e = O\n\n\u22121\nlog h\nn\nnhd\nn\n\n(19)\nwhere \u02c6fn;hn is the kernel density estimator of f. Furthermore, for any PXY \u2208 PXY , when n > n0,\nthen with probability greater than 1 \u2212 2Lh\n\n+ h(cid:13)\nn\n\nE(cid:27)2\n0\nn\n\n(cid:13)(cid:13)(cid:13)^(cid:17)(i)\n\nn;hn\n\n(x) \u2212 (cid:17)(i) (x)\n\n(cid:13)(cid:13)(cid:13)\n\n(\u221a\n\nover the data {xl},\n\u22121\n\u221e = O\nlog h\nn\nnhd\nn\n\n+ h(cid:13)\nn\n\nfor each 1 \u2264 i \u2264 Q.\nLemma 3. (Consistency of the Generalized Kernel Density Estimator) Suppose f is the probabilistic\ndensity function of PX \u2208 PX. Let g be a bounded function de\ufb01ned on X and g \u2208 \u03a3(cid:13);g0, 0 < gmin \u2264\ng \u2264 gmax, and e = f\n\ng . De\ufb01ne the generalized kernel density estimator of e as\n\n(20)\n\n)\n\n)\n\n^en;h , 1\nn\n\nKh (x \u2212 xl)\n\ng (xl)\n\n(21)\n\n\u2225K\u22252\ng2\n\ng =\n\nLet \u03c32\nn > ng, with probability greater than 1 \u2212 Lh\n\n2fmax\nmin\n\n. There exists ng which depends on \u03c3g and the VC characteristics of K, When\n\nn\u2211\n\nl=1\n\n(\nt\u2212\u00b7\ng (\u00b7)\nh\n\nE(cid:27)2\ng\nn\n\nover the data {xl},\n\n(\u221a\n\n)\n\n\u22121\nlog h\nn\nnhd\nn\n\n+ h(cid:13)\nn\n\nwhere hn is chosen such that hn \u2192 0, log h\nSketch of proof: For \ufb01xed h \u0338= 0, we consider the class of functions\n\n\u22121\nn\nnhd\nn\n\n\u2225^en;hn (x) \u2212 e (x)\u2225\u221e = O\n\u2192 0.\n)\n\nIt can be veri\ufb01ed that Fg is also a bounded VC class with the envelope function Fg = F\n\ngmin\n\nFg , { K\n\n(\nFg;\u2225\u00b7\u2225\n\nN\n\n; t \u2208 IRd; h \u0338= 0}\n(\n\n)\n\n)\n\nv\n\nA\n(cid:28)\n\nL2(P ) ; (cid:28) \u2225Fg\u2225\n\nL2(P )\n\n\u2264\n\n(22)\n\n(23)\n\n, and\n\nThen (22) follows from similar argument in the proof of Lemma 2 and Corollary 2.2 in [14].\n\nThe generalized kernel density estimator (21) is also used in [13] to estimate the Laplacian PDF\nDistance between two probabilistic density functions, and the authors only provide the proof of\npointwise weak consistency of this estimator in [13]. Under mild conditions, our Lemma 3 and\nLemma 2 show the strong consistency of the generalized kernel density estimator and the traditional\nkernel density estimator under the same theoretical framework of the VC property of the kernel.\nAcknowledgements. This material is based upon work supported by the National Science Founda-\ntion under Grant No. 1318971.\n\n8\n\n\fReferences\n[1] Hariharan Narayanan, Mikhail Belkin, and Partha Niyogi. On the relation between low density separation,\n\nspectral clustering and graph cuts. In NIPS, pages 1025\u20131032, 2006.\n\n[2] J. A. Hartigan and M. A. Wong. A K-means clustering algorithm. Applied Statistics, 28:100\u2013108, 1979.\n[3] Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. In\n\nNIPS, pages 849\u2013856, 2001.\n\n[4] Noam Shental, Assaf Zomet, Tomer Hertz, and Yair Weiss. Pairwise clustering and graphical models. In\n\nNIPS, 2003.\n\n[5] Linli Xu, James Neufeld, Bryce Larson, and Dale Schuurmans. Maximum margin clustering. In NIPS,\n\n2004.\n\n[6] Zohar Karnin, Edo Liberty, Shachar Lovett, Roy Schwartz, and Omri Weinstein. Unsupervised svms: On\nthe complexity of the furthest hyperplane problem. Journal of Machine Learning Research - Proceedings\nTrack, 23:2.1\u20132.17, 2012.\n\n[7] Ryan Gomes, Andreas Krause, and Pietro Perona. Discriminative clustering by regularized information\n\nmaximization. In NIPS, pages 775\u2013783, 2010.\n\n[8] Masashi Sugiyama, Makoto Yamada, Manabu Kimura, and Hirotaka Hachiya. On information-\nmaximization clustering: Tuning parameter selection and analytic solution. In ICML, pages 65\u201372, 2011.\n[9] T. Cover and P. Hart. Nearest neighbor pattern classi\ufb01cation. Information Theory, IEEE Transactions on,\n\n13(1):21\u201327, January 1967.\n\n[10] Luc Devroye. A probabilistic theory of pattern recognition, volume 31. springer, 1996.\n[11] Yuhong Yang. Minimax nonparametric classi\ufb01cation - part i: Rates of convergence. IEEE Transactions\n\non Information Theory, 45(7):2271\u20132284, 1999.\n\n[12] Jean-Yves Audibert and Alexandre B. Tsybakov. Fast learning rates for plug-in classi\ufb01ers. The Annals of\n\nStatistics, 35(2):pp. 608\u2013633, 2007.\n\n[13] Robert Jenssen, Deniz Erdogmus, Jos\u00b4e Carlos Pr\u00b4\u0131ncipe, and Torbj\u00f8rn Eltoft. The laplacian pdf distance:\n\nA cost function for clustering in a kernel feature space. In NIPS, 2004.\n\n[14] Evarist Gin\u00b4e and Armelle Guillou. Rates of strong uniform consistency for multivariate kernel density\n\nestimators. Ann. Inst. H. Poincar\u00b4e Probab. Statist., 38(6):907\u2013921, November 2002.\n\n[15] Uwe Einmahl and David M. Mason. Uniform in bandwidth consistency of kernel-type function estimators.\n\nThe Annals of Statistics, 33:1380C1403, 2005.\n\n[16] R. M. Dudley. Uniform Central Limit Theorems. Cambridge University Press, 1999.\n[17] A.W. van der Vaart and J.A. Wellner. Weak Convergence and Empirical Processes. Springer series in\n\nstatistics. Springer, 1996.\n\n[18] Deborah Nolan and David Pollard. U-Processes: Rates of convergence. The Annals of Statistics, 15(2),\n\n1987.\n\n[19] Olivier Chapelle and Alexander Zien. Semi-Supervised Classi\ufb01cation by Low Density Separation. In\n\nAISTATS, 2005.\n\n[20] Markus Maier, Ulrike von Luxburg, and Matthias Hein. In\ufb02uence of graph construction on graph-based\n\nclustering measures. In NIPS, pages 1025\u20131032, 2008.\n\n[21] Zenglin Xu, Rong Jin, Jianke Zhu, Irwin King, Michael R. Lyu, and Zhirong Yang. Adaptive regulariza-\n\ntion for transductive support vector machine. In NIPS, pages 2125\u20132133, 2009.\n\n[22] Xiaojin Zhu, John Lafferty, and Ronald Rosenfeld. Semi-supervised learning with graphs. PhD thesis,\n\nCarnegie Mellon University, Language Technologies Institute, School of Computer Science, 2005.\n\n[23] Brendan J. Frey and Delbert Dueck. Clustering by passing messages between data points. Science,\n\n315:972\u2013977, 2007.\n\n[24] Danial Lashkari and Polina Golland. Convex clustering with exemplar-based models. In NIPS, 2007.\n\n9\n\n\f", "award": [], "sourceid": 126, "authors": [{"given_name": "Yingzhen", "family_name": "Yang", "institution": "UIUC"}, {"given_name": "Feng", "family_name": "Liang", "institution": "Univ. of Illinois Urbana-Champaign Statistics"}, {"given_name": "Shuicheng", "family_name": "Yan", "institution": "National University of Singapore"}, {"given_name": "Zhangyang", "family_name": "Wang", "institution": "UIUC"}, {"given_name": "Thomas", "family_name": "Huang", "institution": "UIUC"}]}