{"title": "Co-regularized Multi-view Spectral Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 1413, "page_last": 1421, "abstract": "In many clustering problems, we have access to multiple views of the data each of which could be individually used for clustering. Exploiting information from multiple views, one can hope to find a clustering that is more accurate than the ones obtained using the individual views. Since the true clustering would assign a point to the same cluster irrespective of the view, we can approach this problem by looking for clusterings that are consistent across the views, i.e., corresponding data points in each view should have same cluster membership. We propose a spectral clustering framework that achieves this goal by co-regularizing the clustering hypotheses, and propose two co-regularization schemes to accomplish this. Experimental comparisons with a number of baselines on two synthetic and three real-world datasets establish the efficacy of our proposed approaches.", "full_text": "Co-regularized Multi-view Spectral Clustering\n\nAbhishek Kumar\u2217\n\nDept. of Computer Science\n\nUniversity of Maryland,\n\nCollege Park, MD\n\nPiyush Rai\u2217\n\nDept. of Computer Science\n\nUniversity of Utah,\nSalt Lake City, UT\n\nHal Daum\u00b4e III\n\nDept. of Computer Science\n\nUniversity of Maryland,\n\nCollege Park, MD\n\nabhishek@cs.umd.edu\n\npiyush@cs.utah.edu\n\nhal@umiacs.umd.edu\n\nAbstract\n\nIn many clustering problems, we have access to multiple views of the data each\nof which could be individually used for clustering. Exploiting information from\nmultiple views, one can hope to \ufb01nd a clustering that is more accurate than the\nones obtained using the individual views. Often these different views admit same\nunderlying clustering of the data, so we can approach this problem by looking for\nclusterings that are consistent across the views, i.e., corresponding data points in\neach view should have same cluster membership. We propose a spectral cluster-\ning framework that achieves this goal by co-regularizing the clustering hypothe-\nses, and propose two co-regularization schemes to accomplish this. Experimental\ncomparisons with a number of baselines on two synthetic and three real-world\ndatasets establish the ef\ufb01cacy of our proposed approaches.\n\n1 Introduction\nMany real-world datasets have representations in the form of multiple views [1, 2]. For example,\nwebpages usually consist of both the page-text and hyperlink information; images on the web have\ncaptions associated with them; in multi-lingual information retrieval, the same document has mul-\ntiple representations in different languages, and so on. Although these individual views might be\nsuf\ufb01cient on their own for a given learning task, they can often provide complementary information\nto each other which can lead to improved performance on the learning task at hand.\n\nIn the context of data clustering, we seek a partition of the data based on some similarity measure\nbetween the examples. Our of the numerous clustering algorithms, Spectral Clustering has gained\nconsiderable attention in the recent past due to its strong performance on arbitrary shaped clusters,\nand due to its well-de\ufb01ned mathematical framework [3]. Spectral clustering is accomplished by\nconstructing a graph from the data points with edges between them representing the similarities,\nand solving a relaxation of the normalized min-cut problem on this graph [4]. For the multi-view\nclustering problem, we work with the assumption that the true underlying clustering would assign\ncorresponding points in each view to the same cluster. Given this assumption, we can approach the\nmulti-view clustering problem by limiting our search to clusterings that are compatible across the\ngraphs de\ufb01ned over each of the views: corresponding nodes in each graph should have the same\ncluster membership.\n\nIn this paper, we propose two spectral clustering algorithms that achieve this goal by co-regularizing\nthe clustering hypotheses across views. Co-regularization is a well-known technique in semi-\nsupervised literature; however, not much is known on using it for unsupervised learning problems.\nWe propose novel spectral clustering objective functions that implicitly combine graphs from multi-\nple views of the data to achieve a better clustering. Our proposed methods give us a way to combine\nmultiple kernels (or similarity matrices) for the clustering problem. Moreover, we would like to\nnote here that although multiple kernel learning has met with considerable success on supervised\nlearning problems, similar investigations for unsupervised learning have been found lacking so far,\nwhich is one of the motivations behind this work.\n\n\u2217Authors contributed equally\n\n1\n\n\f(v)\n1 , x\n\n2 Co-regularized Spectral Clustering\nWe assume that we are given data having multiple representations (i.e., views). Let X =\n(v)\nn } denote the examples in view v and K(v) denote the similarity or kernel\n{x\nmatrix of X in this view. We write the normalized graph Laplacian for this view as: L(v) =\nD(v)\u22121/2\nK(v)D(v)\u22121/2. The single view spectral clustering algorithm of [5] solves the following\noptimization problem for the normalized graph Laplacian L(v):\n\n(v)\n2 , . . . , x\n\nmax\n\nU(v)\u2208Rn\u00d7k\n\ntr(cid:16)U(v)T\n\nL(v)U(v)(cid:17) ,\n\ns.t. U(v)T\n\nU(v) = I\n\n(1)\n\nwhere tr denotes the matrix trace. The rows of matrix U(v) are the embeddings of the data points that\ncan be given to the k-means algorithm to obtain cluster memberships. For a detailed introduction\nto both theoretical and practical aspects of spectral clustering, the reader is referred to [3]. Our\nmulti-view spectral clustering framework builds on the standard spectral clustering with a single\nview, by appealing to the co-regularization framework typically used in the semi-supervised learning\nliterature [1].\n\nCo-regularization in semi-supervised learning essentially works by making the hypotheses learned\nfrom different views of the data agree with each other on unlabeled data [6]. The framework employs\ntwo main assumptions for its success: (a) the true target functions in each view should agree on the\nlabels for the unlabeled data (compatibility), and (b) the views are independent given the class label\n(conditional independence). The compatibility assumption allows us to shrink the space of possible\ntarget hypotheses by searching only over the compatible functions. Standard PAC-style analysis [1]\nshows that this also leads to reductions in the number of examples needed to learn the target function,\nsince this number depends on the size of the hypothesis class. The independence assumption makes\nit unlikely for compatible classi\ufb01ers to agree on wrong labels. In the case of clustering, this would\nmean that a data point in both views would be assigned to the correct cluster with high probability.\n\nHere, we propose two co-regularization based approaches to make the clustering hypotheses on\ndifferent graphs (i.e., views) agree with each other. The effectiveness of spectral clustering hinges\ncrucially on the construction of the graph Laplacian and the resulting eigenvectors that re\ufb02ect the\ncluster structure in the data. Therefore, we construct an objective function that consists of the graph\nLaplacians from all the views of the data and regularize on the eigenvectors of the Laplacians such\nthat the cluster structures resulting from each Laplacian look consistent across all the views.\n\nOur \ufb01rst co-regularization scheme (Section 2.1) enforces that the eigenvectors U(v) and U(w)of\na view pair (v, w) should have high pairwise similarity (using a pair-wise co-regularization crite-\nria we will de\ufb01ne in Section 2.1). Our second co-regularization scheme (Section 2.3) enforces the\nview-speci\ufb01c eigenvectors to look similar by regularizing them towards a common consensus (cen-\ntroid based co-regularization). The idea is different from previously proposed consensus clustering\napproaches [7] that commit to individual clusterings in the \ufb01rst step and then combine them to a\nconsensus in the second step. We optimize for individual clusterings as well as the consensus using\na joint cost function.\n\n2.1 Pairwise Co-regularization\n\nIn standard spectral clustering, the eigenvector matrix U(v) is the data representation for subsequent\nk-means clustering step (with i\u2019th row mapping to the original i\u2019th sample). In our proposed objec-\ntive function, we encourage the pairwise similarities of examples under the new representation (in\nterms of rows of U(\u00b7)\u2019s) to be similar across all the views. This amounts to enforcing the spectral\nclustering hypotheses (which are based on the U(\u00b7)\u2019s) to be the same across all the views.\n\nWe will work with two-view case for the ease of exposition. This will later be extended to more\nthan two views. We propose the following cost function as a measure of disagreement between\nclusterings of two views:\n\nKU(v) is the similarity matrix for U(v), and || \u00b7 ||F denotes the Frobenius norm of the matrix.\nThe similarity matrices are normalized by their Frobenius norms to make them comparable across\n\nKU(v)\n\n||KU(v)||2\nF\n\n\u2212\n\nKU(w)\n\n||KU(w) ||2\n\n.\n\n(2)\n\n2\n\nF\n\nF(cid:13)(cid:13)(cid:13)(cid:13)\n\nD(U(v), U(w)) =(cid:13)(cid:13)(cid:13)(cid:13)\n\n2\n\n\fviews. We choose linear kernel, i.e., k(xi, xj) = xT\nxj as our similarity measure in Equation 2.\ni\nThis implies that we have KU(v) = U(v)U(v)T . The reason for choosing linear kernel to measure\nsimilarity of U(\u00b7) is twofold. First, the similarity measure (or kernel) used in the Laplacian for\nspectral clustering has already taken care of the non-linearities present in the data (if any), and the\nembedding U(\u00b7) being real-valued cluster indicators, can be considered to obey linear similarities.\nSecondly, we get a nice optimization problem by using linear kernel for U(\u00b7). We also note that\nF = k, where k is the number of clusters. Substituting this in Equation 2 and ignoring the\n||KU(v) ||2\nconstant additive and scaling terms that depend on the number of clusters, we get\n\nD(U(v), U(w)) = \u2212tr(cid:16)U(v)U(v)T\n\nU(w)U(w)T(cid:17)\n\nWe want to minimize the above disagreement between the clusterings of views v and w. Com-\nbining this with the spectral clustering objectives of individual views, we get the following joint\nmaximization problem for two graphs:\n\nmax\n\nU(v) \u2208Rn\u00d7k\nU(w) \u2208Rn\u00d7k\n\ntr(cid:16)U(v)T\n\nL(v)U(v)(cid:17) + tr(cid:16)U(w)T\n\nL(w)U(w)(cid:17) + \u03bb tr(cid:16)U(v)U(v)T\n\nU(w)U(w)T(cid:17)\n\n(3)\n\ns.t. U(v)T\n\nU(v) = I, U(w)T\n\nU(w) = I\n\nThe hyperparameter \u03bb trades-off the spectral clustering objectives and the spectral embedding\n(dis)agreement term. The joint optimization problem given by Equation 3 can be solved using al-\nternating maximization w.r.t. U(v) and U(w). For a given U(w), we get the following optimization\nproblem in U(v):\n\nmax\n\nU(v)\u2208Rn\u00d7k\n\ntrnU(v)T (cid:16)L(v) + \u03bbU(w)U(w)T(cid:17) U(v)o ,\n\ns.t. U(v)T\n\nU(v) = I.\n\n(4)\n\nThis is a standard spectral clustering objective on view v with graph Laplacian L(v) + \u03bbU(w)U(w)T .\nThis can be seen as a way of combining kernels or Laplacians. The difference from standard kernel\ncombination (kernel addition, for example) is that the combination is adaptive since U(w) keeps\ngetting updated at each step, as guided by the clustering algorithm. The solution U(v) is given by\nthe top-k eigenvectors of this modi\ufb01ed Laplacian. Since the alternating maximization can make the\nalgorithm stuck in a local maximum [8], it is important to have a sensible initialization. If there is\nno prior information on which view is more informative about the clustering, we can start with any\nof the views. However, if we have some a priori knowledge on this, we can start with the graph\nLaplacian L(w) of the more informative view and initialize U(w). The alternating maximization\nis carried out after this until convergence. Note that one possibility could be to regularize directly\non the eigenvectors U(v)\u2019s and make them close to each other (e.g., in the sense of the Frobenious\nnorm of the difference between U(v) and U(w)). However, this type of regularization could be too\nrestrictive and could end up shrinking the hypothesis space of feasible clusterings too much, thus\nruling out many valid clusterings.\n\nFor \ufb01xed \u03bb and n, the joint objective of Eq. 3 can be shown to be bounded from above by a constant.\nSince the objective is non-decreasing with the iterations, the algorithm is guaranteed to converge.\nIn practice, we monitor the convergence by the difference in the value of the objective between\nconsecutive iterations, and stop when the difference falls below a minimum threshold of \u01eb = 10\u22124.\nIn all our experiments, we converge within less than 10 iterations. Note that we can use either U(v)\nor U(w) in the \ufb01nal k-means step of the spectral clustering algorithm. In our experiments, we note a\nmarginal difference in the clustering performance depending on which U(\u00b7) is used in the \ufb01nal step\nof k-means clustering.\n\n2.2 Extension to Multiple Views\nWe can extend the co-regularized spectral clustering proposed in the previous section for more than\ntwo views. This can be done by employing pair-wise co-regularizers in the objective function of\nEq. 3. For m number of views, we have\n\nmax\n\nU(1),U(2),...,U(m)\u2208Rn\u00d7k\n\nm\n\nXv=1\n\ntr(cid:16)U(v)T\n\nL(v)U(v)(cid:17) + \u03bb X1\u2264v,w\u2264m\n\nv6=w\n\ntr(cid:16)U(v)U(v)T\n\nU(w)U(w)T(cid:17) ,\n\n(5)\n\ns.t. U(v)T\n\nU(v) = I, \u2200 1 \u2264 v \u2264 V\n\n3\n\n\fWe use a common \u03bb for all pair-wise co-regularizers for simplicity of exposition, however different\n\u03bb\u2019s can be used for different pairs of views. Similar to the two-view case, we can optimize it by\nalternating maximization cycling over the views. With all but one U(v) \ufb01xed, we have the following\noptimization problem:\n\nmax\nU(v)\n\ntrnU(v)T (cid:16)L(v) + \u03bb X1\u2264w\u2264m,\n\nw6=v\n\nU(w)U(w)T(cid:17) U(v)o ,\n\ns.t. U(v)T\n\nU(v) = I\n\n(6)\n\nWe initialize all U(v), 2 \u2264 v \u2264 m by solving the spectral clustering problem for single views. We\nsolve the objective of Eq. 6 for U(1) given all other U(v), 2 \u2264 v \u2264 m. The optimization is then\ncycled over all views while keeping the previously obtained U(\u00b7)\u2019s \ufb01xed.\n\n2.3 Centroid-Based Co-regularization\nIn this section, we present an alternative regularization scheme that regularizes each view-speci\ufb01c\nset of eigenvectors U(v) towards a common centroid U\u2217 (akin to a consensus set of eigenvectors) .\n\nIn contrast with the pairwise regularization approach which has(cid:0)m\n\nwhere m is the number of views, the centroid based regularization scheme has m pairwise regular-\nization terms. The objective function can be written as:\n\n2(cid:1) pairwise regularization terms,\n\u03bbvtr(cid:16)U(v)U(v)T\n\nT(cid:17) ,\n\nU\u2217U\u2217\n\n(7)\n\nmax\n\nU(1),U(2),...,U(m),U\u2217\u2208Rn\u00d7k\n\nm\n\nXv=1\n\ntr(cid:16)U(v)T\n\nL(v)U(v)(cid:17) +Xv\n\ns.t. U(v)T\n\nU(v) = I, \u2200 1 \u2264 v \u2264 V, U\u2217\n\nU\u2217 = I\n\nT\n\nThis objective tries to balance a trade-off between the individual spectral clustering objectives and\nthe agreement of each of the view-speci\ufb01c eigenvectors U(v) with the consensus eigenvectors U\u2217.\nEach regularization term is weighted by a parameter \u03bbv speci\ufb01c to that view, where \u03bbv can be set to\nre\ufb02ect the importance of view v.\n\nJust like for Equation 6, the objective in Equation 7 can be solved in an alternating fashion optimizing\neach of the U(v)\u2019s one at a time, keeping all other variables \ufb01xed, followed by optimizing the\nconsensus U\u2217, keeping all the U(v)\u2019s \ufb01xed.\n\nIt is easy to see that with all other view-speci\ufb01c eigenvectors and the consensus U\u2217 \ufb01xed, optimizing\nU(v) for view v amounts to solving the following:\n\nmax\n\nU(v)\u2208Rn\u00d7k\n\ntr(cid:16)U(v)T\n\nL(v)U(v)(cid:17) + \u03bbvtr(cid:16)U(v)U(v)T\n\nT(cid:17) ,\n\nU\u2217U\u2217\n\ns.t. U(v)T\n\nU(v) = I\n\n(8)\n\nwhich is nothing but equivalent to solving the standard spectral clustering objective for U(v) with a\nT . Solving for the consensus U\u2217 requires solving the following\nmodi\ufb01ed Laplacian L(v) + \u03bbvU\u2217U\u2217\nobjective:\n\nUsing the circular property of matrix traces, Equation 9 can be rewritten as:\n\nmax\n\nU\u2217\u2208Rn\u00d7kXv\n\n\u03bbvtr(cid:16)U(v)U(v)T\nT Xv\n\nmax\n\nU\u2217\u2208Rn\u00d7k\n\ntr(U\u2217\n\n\u03bbv(cid:16)U(v)U(v)T(cid:17)! U\u2217) ,\n\nU\u2217U\u2217\n\nT(cid:17) ,\n\ns.t. U\u2217\n\nT\n\nU\u2217 = I\n\n(9)\n\ns.t. U\u2217\n\nT\n\nU\u2217 = I\n\n(10)\n\nwhich is equivalent to solving the standard spectral clustering objective for U\u2217 with a modi\ufb01ed\n\nLaplacian Pv \u03bbv(cid:16)U(v)U(v)T(cid:17). In contrast with the pairwise co-regularization approach of Sec-\n\ntion 2.1 which computes optimal view speci\ufb01c eigenvectors U(v)\u2019s, which \ufb01nally need to be com-\nbined (e.g., via column-wise concatenation) before running the k-means step, the centroid-based\nco-regularization approach directly \ufb01nds an optimal U\u2217 to be used in the k-means step. One possi-\nble downside of the centroid-based co-regularization approach is that noisy views could potentially\naffect the optimal U\u2217 as it depends on all the views. To deal with this, careful selection of the weigh-\ning parameter \u03bbv is required. If it is a priori known that some views are noisy, then it is advisable\nto use a small value of \u03bbv for such views, so as to prevent them from adversely affecting U\u2217.\n\n4\n\n\f3 Experiments\nWe compare both of our co-regularization based multi-view spectral clustering approaches with a\nnumber of baselines. In particular, we compare with:\n\n\u2022 Single View: Using the most informative view, i.e., one that achieves the best spectral cluster-\ning performance using a single view of the data.\n\u2022 Feature Concatenation: Concatenating the features of each view, and then running standard\nspectral clustering using the graph Laplacian derived from the joint view representation of the\ndata.\n\u2022 Kernel Addition: Combining different kernels by adding them, and then running standard\nspectral clustering on the corresponding Laplacian. As suggested in earlier \ufb01ndings [9], even this\nseemingly simple approach often leads to near optimal results as compared to more sophisticated\napproaches for classi\ufb01cation. It can be noted that kernel addition reduces to feature concatenation\nfor the special case of linear kernel.\nIn general, kernel addition is same as concatenation of\nfeatures in the Reproducing Kernel Hilbert Space.\n\u2022 Kernel Product (element-wise): Multiplying the corresponding entries of kernels and apply-\ning standard spectral clustering on the resultant Laplacian. For the special case of Gaussian kernel,\nelement-wise kernel product would be same as simple feature concatenation if both kernels use\nsame width parameter \u03c3. However, in our experiments, we use different width parameters for\ndifferent views so the performances of kernel product may not be directly comparable to feature\nconcatenation.\n\u2022 CCA based Feature Extraction: Applying CCA for feature fusion from multiple views of the\ndata [10], and then running spectral clustering using these extracted features. We apply both stan-\ndard CCA and kernel CCA for feature extraction and report the clustering results for whichever\ngives the best performance.\n\u2022 Minimizing-Disagreement Spectral Clustering: Our last baseline is the minimizing-\ndisagreement approach to spectral clustering [11], and is perhaps most closely related to our co-\nregularization based approach to spectral clustering. This algorithm is discussed more in Sec. 4.\n\nTo distinguish between the results of our two co-regularization based approaches, in the tables con-\ntaining the results, we use symbol \u201cP\u201d to denote the pairwise co-regularization method and symbol\n\u201cC\u201d to denote the centroid based co-regularization method. For datasets with more than 2 views, we\nhave also explicitly mentioned the number of views in parentheses.\n\nWe report experimental results on two synthetic and three real-world datasets. We give a brief\ndescription of each dataset here.\n\n\u2022 Synthetic data 1: Our \ufb01rst synthetic dataset consists of two views and is generated in a manner\nakin to [12] which \ufb01rst chooses the cluster ci each sample belongs to, and then generates each\n(1)\nfrom a two-component Gaussian mixture model. These views are\nof the views x\ni\n(1)\ncombined to form the sample (x\n, ci). We sample 1000 points from each view. The cluster\ni\n(2)\nmeans in view 1 are \u00b5\n2 = (1 1).\nThe covariances for the two views are given below.\n\n(2)\n, x\ni\n(1)\n(1)\n2 = (2 2), and in view 2 are \u00b5\n1 = (1 1) , \u00b5\n\n(2)\n1 = (2 2) , \u00b5\n\n(2)\nand x\ni\n\n\u03a3(1)\n\n1 =(cid:18) 1\n\n0.5\n\n0.5\n\n1.5 (cid:19) , \u03a3(2)\n\n1 =(cid:18) 0.3\n\n0\n\n0\n\n0.6 (cid:19) , \u03a3(1)\n\n2 =(cid:18) 0.3\n\n0\n\n0\n\n0.6 (cid:19) , \u03a3(2)\n\n2 =(cid:18) 1\n\n0.5\n\n0.5\n\n1.5 (cid:19)\n\n\u2022 Synthetic data 2: Our second synthetic dataset consists of three views. Moreover, the features\nare correlated. Each view still has two clusters. Each view is generated by a two component\n(1)\nGaussian mixture model. The cluster means in view 1 are \u00b5\n2 = (3 4); in view 2\n(3)\nare \u00b5\n2 = (3 3). The covariances\nfor the three views are given below. The notation \u03a3(v)\ndenotes the parameter for c\u2019th cluster in\nv\u2019th view.\n\n(2)\n2 = (2 2); and in view 3 are \u00b5\n\n(1)\n1 = (1 1) , \u00b5\n\n(2)\n1 = (1 2) , \u00b5\n\n(3)\n1 = (1 1) , \u00b5\n\nc\n\n\u03a3(1)\n\n0.5\n\n1 =(cid:18) 1\n2 =(cid:18) 0.3\n\n0.2\n\n\u03a3(1)\n\n0.5\n\n1.5 (cid:19) , \u03a3(2)\n0.6 (cid:19) , \u03a3(2)\n\n1 =(cid:18) 1\n2 =(cid:18) 0.6\n\n0.2\n\n0.1\n\n\u22120.2\n\n5\n\n\u22120.2\n\n1 (cid:19) , \u03a3(3)\n\n1 =(cid:18) 1.2\n\n0.2\n\n0.2\n\n1 (cid:19)\n\n0.1\n\n0.5 (cid:19) , \u03a3(3)\n\n2 =(cid:18) 1\n\n0.4\n\n0.4\n\n0.7 (cid:19)\n\n\f\u2022 Reuters Multilingual data: The test collection contains feature characteristics of documents\noriginally written in \ufb01ve different languages (English, French, German, Spanish and Italian), and\ntheir translations, over a common set of 6 categories [13]. This corpus is built by sampling parts\nof the Reuters RCV1 and RCV2 collections [14, 15]. We use documents originally in English\nas the \ufb01rst view and their French translations as the second view. We randomly sample 1200\ndocuments from this collection in a balanced manner, with each of the 6 clusters having 200\ndocuments. The documents are in bag-of-words representation which implies that the features are\nextremely sparse and high-dimensional. The standard similarity measures (like Gaussian kernel)\nin very high dimensions are often unreliable. Since spectral clustering essentially works with\nsimilarities of the data, we \ufb01rst project the data using Latent Semantic Analysis (LSA) [16] to a\n100-dimensional space and compute similarities in this lower dimensional space. This is akin to\na computing topic based similarity of documents [17].\n\u2022 UCI Handwritten digits data: Our second real-world dataset is taken from the handwritten\ndigits (0-9) data from the UCI repository. The dataset consists of 2000 examples, with view-1\nbeing the 76 Fourier coef\ufb01cients, and view-2 being the 216 pro\ufb01le correlations of each example\nimage.\n\u2022 Caltech-101 data: Our third real-world dataset is a subset of the Caltech-101 data from the\nMultiple Kernel Learning repository from which we chose 450 examples having 30 underlying\nclusters. We experiment with 4 kernels from this dataset.\nIn particular, we chose the \u201cpixel\nfeatures\u201d, the \u201cPyramid Histogram Of Gradients\u201d, bio-inspired \u201cSparse Localized Features\u201d, and\nSIFT descriptors as our four views. We report results on our co-regularized spectral clustering for\ntwo, three and four views cases.\n\nWe use normalized mutual information (NMI) as the clustering quality evaluation measure, which\ngives the mutual information between obtained clustering and the true clustering normalized by the\ncluster entropies. NMI ranges between 0 and 1 with higher value indicating closer match to the\ntrue clustering. We use Gaussian kernel for computing the graph similarities in all the experiments,\nunless mentioned otherwise. The standard deviation of the kernel is taken equal to the median of\nthe pair-wise Euclidean distances between the data points. In our experiments, the co-regularization\nparameter \u03bb is varied from 0.01 to 0.05 and the best result is reported (we keep \u03bb the same for all\nviews; one can however also choose different \u03bb\u2019s based on the importance of individual views). We\nexperiment with \u03bb values more exhaustively later in this Section where we show that our approach\noutperforms other baselines for a wide range of \u03bb. In the results table, the numbers in the parentheses\nare the standard deviations of the performance measures obtained with 20 different runs of k-means\nwith random initializations.\n\n3.1 Results\nThe results for all datasets are shown in Table 1. For two-view synthetic data (Synthetic Data 1),\nboth the co-regularized spectral clustering approaches outperform all the baselines by a signi\ufb01cant\nmargin, with the pairwise approach doing marginally better than the centroid-based approach. The\nclosest performing approaches are kernel addition and CCA. For synthetic data, order-2 polynomial\nkernel based kernel-CCA gives best performance among all CCA variants, while Gaussian kernel\nbased kernel-CCA performs poorly. We do not report results for Gaussian kernel CCA here. All the\nmulti-view baselines outperform the single view case for the synthetic data.\n\nFor three-view synthetic data (Synthetic Data 2), we can see that simple feature concatenation does\nnot help much. In fact, it reduces the performance when the third view is added, so we report the\nperformance with only two views for feature concatenation. Kernel addition with three views gives a\ngood improvement over single view case. As compared to other baselines (with two views), both our\nco-regularized spectral clustering approaches with two views perform better. For both approaches,\naddition of third view also results in improving the performance beyond the two view case.\n\nFor the document clustering results on Reuters multilingual data, English and French languages are\nused as the two views. On this dataset too, both our approaches outperform all the baselines by a\nsigni\ufb01cant margin. The next best performance is attained by minimum-disagreement spectral clus-\ntering [11] approach. It should be noted that CCA and element-wise kernel product performances\nare worse than that of single view.\n\nFor UCI Handwritten digits dataset, quite a few approaches including kernel addition, element-wise\nkernel multiplication, and minimum-disagreement are close to both of our co-regularized spectral\n\n6\n\n\fMethod\n\nBest Single View\nFeature Concat\nKernel Addition\nKernel Product\n\nCCA\n\nMin-Disagreement\nCo-regularized (P) (2)\nCo-regularized (P) (3)\nCo-regularized (P) (4)\nCo-regularized (C) (2)\nCo-regularized (C) (3)\nCo-regularized (C) (4)\n\nSynth data 1\n0.267 (0.0)\n0.294 (0.0)\n0.339 (0.0)\n0.277 (0.0)\n0.330 (0.0)\n0.313 (0.0)\n0.378 (0.0)\n\n\u2013\n\u2013\n\n0.367 (0.0)\n\n\u2013\n\u2013\n\nSynth data 2\n0.898 (0.0)\n0.923 (0.0)\n0.973 (0.0)\n0.959 (0.0)\n0.932 (0.0)\n0.936 (0.0)\n0.981 (0.0)\n0.989 (0.0)\n\n\u2013\n\n0.955 (0.0)\n0.989 (0.0)\n\n\u2013\n\nReuters\n\n0.287 (0.019)\n0.298 (0.020)\n0.323 (0.021)\n0.123 (0.010)\n0.147 (0.003)\n0.342 (0.024)\n0.375 (0.002)\n\n\u2013\n\u2013\n\nHandwritten\n0.641 (0.008)\n0.619 (0.015)\n0.744 (0.030)\n0.754 (0.026)\n0.682 (0.019)\n0.745 (0.024)\n0.759 (0.031)\n\n\u2013\n\u2013\n\n0.360 (0.025)\n\n0.768 (0.025)\n\n\u2013\n\u2013\n\n\u2013\n\u2013\n\nCaltech\n\n0.510 (0.008)\n\n\u2013\n\n0.383 (0.008)\n0.429 (0.007)\n0.466 (0.007)\n0.389 (0.008)\n0.527 (0.007)\n0.533 (0.008)\n0.564 (0.007)\n0.522 (0.004)\n0.512 (0.007)\n0.561 (0.005)\n\nTable 1: NMI results on various datasets for different baselines and the proposed approaches. Numbers in\nparentheses are the std. deviations. The numbers (2), (3) and (4) indicate the number of views used in our\nco-regularized spectral clustering approach. Other multi-view baselines were run with maximum number of\nviews available (or maximum number of views they can handle). Letters (P) and (C) indicate pairwise and\ncentroid based regularizations respectively.\n\nclustering approaches. It can be also be noted that feature concatenation actually performs worse\nthan single view on this dataset.\n\nFor Caltech-101 data, we cannot do feature concatenation since only kernels are available. Surpris-\ningly, on this dataset, all the baselines perform worse than the single view case. On the other hand,\nboth of our co-regularized spectral clustering approaches with two views outperform the single view\ncase. As we added more views that were available for the Caltech-101 datasets, we found that the\nperformance of the pairwise approach consistently went up as we added the third and the fourth\nview. On the other hand, the performance of the centroid-based approach slightly got worse upon\nadding the third view (possibly due to the view being noisy which affected the learned U\u2217); however\naddition of the fourth view brought the performance almost close to that of the pairwise case.\n\ne\nr\no\nc\nS\n\n \nI\n\nM\nN\n\n0.38\n\n0.36\n\n0.34\n\n0.32\n\n0.3\n\n0.28\n\n0.26\n\n0.24\n\n0\n\nCo\u2212regularization approach\nClosest performing baseline\n\n0.02\n\nCo\u2212regularization parameter \u03bb\n\n0.06\n\n0.08\n\n0.04\n\n0.58\n\n0.57\n\n0.56\n\n0.55\n\n0.54\n\n0.53\n\n0.52\n\n0.51\n\n0.5\n\ne\nr\no\nc\nS\n\n \nI\n\nM\nN\n\n0.1\n\n0.49\n\n0\n\nCo\u2212regularization approach\nClosed performing baseline\n\n0.02\n\nCo\u2212regularization Parameter \u03bb\n\n0.06\n\n0.08\n\n0.04\n\n0.1\n\n(a)\n\n(b)\n\nFigure 1: NMI scores of Co-regularized Spectral Clustering as a function of \u03bb for (a) Reuters multilingual\ndata and (b) Caltech-101 data\n\nWe also experiment with various values of co-regularization parameter \u03bb and observe its effect on\nthe clustering performance. Our reported results are for the pairwise co-regularization approach.\nSimilar trends were observed for the centroid-based co-regularization approach and therefore we do\nnot report them here. Fig. 1(a) shows the plot for Reuters multilingual data. The NMI score shoots\nup right after \u03bb starts increasing from 0 and reaches a peak at \u03bb = 0.01. After reaching a second\npeak at about 0.025, it starts decreasing and hovers around the second best baseline (Minimizing-\ndisagreement in this case) for a while. The NMI becomes worse than the second best baseline after\n\u03bb = 0.075. The plot for Caltech-101 data is shown in Fig. 1(b). The normalized mutual information\n(NMI) starts increasing as the value of lambda is increased away from 0, and reaches a peak at\n\u03bb = 0.01. It starts to decrease after that with local ups and downs. For the range of \u03bb shown in the\nplot, the NMI for co-regularized spectral clustering is greater than the closest baseline for most of\n\n7\n\n\fthe \u03bb values. These results indicate that although the performance of our algorithms depends on the\nweighing parameter \u03bb, it is reasonably stable across a wide range of \u03bb.\n\n4 Related Work\nA number of clustering algorithms have been proposed in the past to learn with multiple views\nof the data. Some of them \ufb01rst extract a set of shared features from the multiple views and then\napply any off-the-shelf clustering algorithm such as k-means on these features. The Canonical\nCorrelation Analysis (CCA) [2, 10] based approach is an example of this. Alternatively, some other\napproaches exploit the multiple views of the data as part of the clustering algorithm itself. For\nexample, [19] proposed an Co-EM based framework for multi-view clustering in mixture models.\nCo-EM approach computes expected values of hidden variables in one view and uses these in the\nM-step for other view, and vice versa. This process is repeated until a suitable stopping criteria is\nmet. The algorithm often does not converge.\n\nMulti-view clustering algorithms have also been proposed in the framework of spectral cluster-\ning [11, 20, 21]. In [20], the authors obtain a graph cut which is good on average over the multiple\ngraphs but may not be the best for a single graph. They give a random walk based formulation for the\nproblem. [11] approaches the problem of two-view clustering by constructing a bipartite graph from\nnodes of both views. Edges of the bipartite graph connect nodes from one view to those in the other\nview. Subsequently, they solve standard spectral clustering problem on this bipartite graph. In [21],\na co-training based framework is proposed where the similarity matrix of one view is constrained by\nthe eigenvectors of the Laplacian in the other view. In [22], the information from multiple graphs\nare fused using Linked Matrix Factorization. Consensus clustering approaches can also be applied\nto the problem of multi-view clustering [7]. These approaches do not generally work with original\nfeatures. Instead, they take different clusterings of a dataset coming from different sources as input\nand reconcile them to \ufb01nd a \ufb01nal clustering.\n\n5 Discussion\nWe proposed a multi-view clustering approach in the framework of spectral clustering. The approach\nuses the philosophy of co-regularization to make the clusterings in different views agree with each\nother. Co-regularization idea has been used in the past for semi-supervised learning problems. To the\nbest of our knowledge, this is the \ufb01rst work to apply the idea to the problem of unsupervised learning,\nin particular to spectral clustering. The co-regularized spectral clustering has a joint optimization\nfunction for spectral embeddings of all the views. An alternating maximization framework reduces\nthe problem to the standard spectral clustering objective which is ef\ufb01ciently solvable using state-of-\nthe-art eigensolvers.\n\nIt is possible to extend the proposed framework to the case where some of the views have missing\ndata. For missing data points, the corresponding entries in the similarity matrices would be unavail-\nable. We can estimate these missing similarities by the corresponding similarities in other views.\nOne possible approach to estimate the missing entry could be to simply average the similarities\nfrom views in which the data point is available. Proper normalization of similarities (possibly by\nFrobenius norm of the whole matrix) might be needed before averaging to make them comparable.\nOther methods for missing kernel entries estimation can also be used. It is also possible to assign\nweights to different views in the proposed objective function as done in [20], if we have some a\npriori knowledge about the informativeness of the views.\n\nOur co-regularization based framework can also be applied to other unsupervised problems such\nas spectral methods for dimensionality reduction. For example, the Kernel PCA algorithm [23]\ncan be extended to work with multiple views by de\ufb01ning each view as having its own Kernel PCA\nobjective function and having a regularizer which enforces the embeddings to look similar across\nall views (e.g., by enforcing the similarity matrices de\ufb01ned on embeddings of each view to be close\nto each other). Theoretical analysis of the proposed approach can also be pursued as a separate line\nof work. There has been very little prior work analyzing spectral clustering methods. For instance,\nthere has been some work on consistency analysis of single view spectral clustering [24], which\nprovides results about the rate of convergence as the sample size increases, using tools from theory\nof linear operators and empirical processes. Similar convergence properties could be studied for\nmulti-view spectral clustering. We can expect the convergence to be faster for multi-view case. Co-\nregularization reduces the size of hypothesis space and hence less number of examples should be\nneeded to converge to a solution.\n\n8\n\n\fReferences\n\n[1] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Confer-\n\nence on Learning Theory, 1998.\n\n[2] Kamalika Chaudhuri, Sham M. Kakade, Karen Livescu, and Karthik Sridharan. Multi-view\nClustering via Canonical Correlation Analysis. In International Conference on Machine Learn-\ning, 2009.\n\n[3] Ulrike von Luxburg. A Tutorial on Spectral Clustering. Statistics and Computing, 2007.\n[4] J. Shi and J. Malik. Normalized cuts and Image Segmentation. IEEE Transactions on Pattern\n\nAnalysis and Machine Intelligence, 22:888\u2013905, 1997.\n\n[5] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: analysis and an algorithm. In Advances\n\nin Neural Information Processing Systems, 2002.\n\n[6] Vikas Sindhwani, Partha Niyogi, and Mikhail Belkin. A Co-regularization approach to semi-\nsupervised learning with multiple views. In Proceedings of the Workshop on Learning with\nMultiple Views, International Conference on Machine Learning, 2005.\n\n[7] Alexander Strehl and Joydeep Ghosh. Cluster Ensembles - A Knowledge Reuse Framework\nfor Combining Multiple Partitions. Journal of Machine Learning Research, pages 583\u2013617,\n2002.\n\n[8] Donglin Niu, Jennifer G. Dy, and Michael I. Jordan. Multiple non-redundant spectral clustering\n\nviews. In International Conference on Machine Learning, 2010.\n\n[9] Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Learning non-linear combination\n\nof kernels. In Advances in Neural Information Processing Systems, 2009.\n\n[10] Matthew B. Blaschko and Christoph H. Lampert. Correlational Spectral Clustering. In Com-\n\nputer Vision and Pattern Recognition, 2008.\n\n[11] Virginia R. de Sa. Spectral Clustering with two views. In Proceedings of the Workshop on\n\nLearning with Multiple Views, International Conference on Machine Learning, 2005.\n\n[12] Xing Yi, Yunpeng Xu, and Changshui Zhang. Multi-view em algorithm for \ufb01nite mixture\n\nmodels. In ICAPR, Lecture Notes in Computer Science, Springer-Verlag, 2005.\n\n[13] Massih-Reza Amini, Nicolas Usunier, and Cyril Goutte. Learning from multiple partially\nIn Advances in Neural\n\nobserved views - an application to multilingual text categorization.\nInformation Processing Systems, 2009.\n\n[14] D. D. Lewis, Y. Yang, T. Rose, and F. Li. RCV1. A new benchmark collection for text catego-\n\nrization research. Journal of Machine Learning Research, 5:361\u2013397, 2004.\n\n[15] Reuters. Corpus, volume 2, multilingual corpus, 1996-08-20 to 1997-08-19, 2005.\n[16] Thomas Hofmann. Probabilistic latent semantic analysis. In Uncertainty in Arti\ufb01cial Intelli-\n\ngence, pages 289\u2013296, 1999.\n\n[17] David M. Blei, Andreq Y. Ng, and Michael I. Jordan. Latent Dirichlet Allocation. Journal of\n\nMachine Learning Research, pages 993\u20131022, 2003.\n\n[18] The UCSD Multiple Kernel Learning Repository. http://mkl.ucsd.edu.\n[19] Steffen Bickel and Tobias Scheffer. Multi-View Clustering. In IEEE International Conference\n\non Data Mining, 2004.\n\n[20] Dengyong Zhou and Christopher J. C. Burges. Spectral Clustering and Transductive Learning\n\nwith Multiple Views. In International Conference on Machine Learning, 2007.\n\n[21] Abhishek Kumar and Hal Daum\u00b4e. A Co-training Approach for Multiview Spectral Clustering.\n\nIn International Conference on Machine Learning, 2011.\n\n[22] Wei Tang, Zhengdong Lu, and Inderjit S. Dhillon. Clustering with Multiple Graphs. In IEEE\n\nInternational Conference on Data Mining, 2009.\n\n[23] Y. Bengio, P. Vincent, and J.F. Paiement. Spectral clustering and kernel PCA are learning\n\neigenfunctions. Technical Report 2003s-19, CIRANO, 2003.\n\n[24] Ulrike von Luxburg, Mikhail Belkin, and Olivier Bousquet. Consistency of Spectral Cluster-\n\ning. Annals of Statistics, 36(2):555\u2013586, 2008.\n\n9\n\n\f", "award": [], "sourceid": 817, "authors": [{"given_name": "Abhishek", "family_name": "Kumar", "institution": null}, {"given_name": "Piyush", "family_name": "Rai", "institution": null}, {"given_name": "Hal", "family_name": "Daume", "institution": null}]}