{"title": "Localized Data Fusion for Kernel k-Means Clustering with Application to Cancer Biology", "book": "Advances in Neural Information Processing Systems", "page_first": 1305, "page_last": 1313, "abstract": "In many modern applications from, for example, bioinformatics and computer vision, samples have multiple feature representations coming from different data sources. Multiview learning algorithms try to exploit all these available information to obtain a better learner in such scenarios. In this paper, we propose a novel multiple kernel learning algorithm that extends kernel k-means clustering to the multiview setting, which combines kernels calculated on the views in a localized way to better capture sample-specific characteristics of the data. We demonstrate the better performance of our localized data fusion approach on a human colon and rectal cancer data set by clustering patients. Our method finds more relevant prognostic patient groups than global data fusion methods when we evaluate the results with respect to three commonly used clinical biomarkers.", "full_text": "Localized Data Fusion for Kernel k-Means Clustering\n\nwith Application to Cancer Biology\n\nMehmet G\u00a8onen\n\ngonen@ohsu.edu\n\nDepartment of Biomedical Engineering\nOregon Health & Science University\n\nPortland, OR 97239, USA\n\nAdam A. Margolin\n\nmargolin@ohsu.edu\n\nDepartment of Biomedical Engineering\nOregon Health & Science University\n\nPortland, OR 97239, USA\n\nAbstract\n\nIn many modern applications from, for example, bioinformatics and computer vi-\nsion, samples have multiple feature representations coming from different data\nsources. Multiview learning algorithms try to exploit all these available informa-\ntion to obtain a better learner in such scenarios. In this paper, we propose a novel\nmultiple kernel learning algorithm that extends kernel k-means clustering to the\nmultiview setting, which combines kernels calculated on the views in a localized\nway to better capture sample-speci\ufb01c characteristics of the data. We demonstrate\nthe better performance of our localized data fusion approach on a human colon\nand rectal cancer data set by clustering patients. Our method \ufb01nds more relevant\nprognostic patient groups than global data fusion methods when we evaluate the\nresults with respect to three commonly used clinical biomarkers.\n\n1\n\nIntroduction\n\nClustering algorithms aim to \ufb01nd a meaningful grouping of the samples at hand in an unsupervised\nmanner for exploratory data analysis. k-means clustering is one of the classical algorithms (Harti-\ngan, 1975), which uses k prototype vectors (i.e., centers or centroids of k clusters) to characterize\nthe data and minimizes a sum-of-squares cost function to \ufb01nd these prototypes with a coordinate\ndescent optimization method. However, the \ufb01nal cluster structure heavily depends on the initializa-\ntion because the optimization scheme of k-means clustering is prone to local minima. Fortunately,\nthe sum-of-squares minimization can be formulated as a trace maximization problem, which can\nnot be solved easily due to binary decision variables used to denote cluster memberships, but this\nhard optimization problem can be reduced to an eigenvalue decomposition problem by relaxing the\nconstraints (Zha et al., 2001; Ding and He, 2004). In such a case, overall clustering algorithm can be\nformulated in two steps: (i) performing principal component analysis (PCA) (Pearson, 1901) on the\ncovariance matrix and (ii) recovering cluster membership matrix using the k eigenvectors that cor-\nrespond to the k largest eigenvalues. Similar to many other learning algorithms, k-means clustering\nis also extended towards a nonlinear version with the help of kernel functions, which is called kernel\nk-means clustering (Girolami, 2002). The kernelized variant can also be optimized with a spectral\nrelaxation approach using kernel PCA (KPCA) (Sch\u00a8olkopf et al., 1998) instead of canonical PCA.\nIn many modern applications, samples have multiple feature representations (i.e., views) coming\nfrom different data sources. Instead of using only one of the views, it is better to use all available in-\nformation and let the learning algorithm decide how to combine these data sources, which is known\nas multiview learning. There are three main categories for the combination strategy (Noble, 2004):\n(i) combination at the feature level by concatenating the views (i.e., early integration), (ii) combi-\nnation at the decision level by concatenating the outputs of learners trained on each view separately\n(i.e., late integration), and (iii) combination at the learning level by trying to \ufb01nd a uni\ufb01ed distance,\nkernel, or latent matrix using all views simultaneously (i.e., intermediate integration).\n\n1\n\n\f1.1 Related work\nWhen we have multiple views for clustering, we can simply concatenate the views and train a stan-\ndard clustering algorithm on the concatenated view, which is known as early integration. However,\nthis approach does not assign weights to the views, and the view with the highest number of features\nmight dominate the clustering step due to the unsupervised nature of the problem.\nLate integration algorithms obtain a clustering on each view separately and combine these clustering\nresults using an ensemble learning scheme. Such clustering algorithms are also known as cluster\nensembles (Strehl and Ghosh, 2002). However, they do not exploit the dependencies between the\nviews during clustering, and these dependencies might already be lost if we combine only clustering\nresults in the second step.\nIntermediate integration algorithms combine the views in a single learning scheme to collectively\n\ufb01nd a uni\ufb01ed clustering. Chaudhuri et al. (2009) propose to extract a unifying feature representation\nfrom the views by performing canonical correlation analysis (CCA) (Hotelling, 1936) and to train\na clustering algorithm on this common representation. Similarly, Blaschko and Lampert (2008) ex-\ntract a common feature representation but with a nonlinear projection step using kernel CCA (Lai\nand Fyfe, 2000) and then perform clustering. Such CCA-based algorithms assume that all views are\ninformative, and if there are some noisy views, this can degrade the clustering performance dras-\ntically. Lange and Buhmann (2006) propose to optimize the weights of a convex combination of\nview-speci\ufb01c similarity measures within a nonnegative matrix factorization framework and to as-\nsign samples to clusters using the latent matrices obtained in the factorization step. Valizadegan and\nJin (2007) extend the maximum margin clustering formulation of Xu et al. (2004) to perform ker-\nnel combination and clustering jointly by formulating a semide\ufb01nite programming (SDP) problem.\nChen et al. (2007) further improve this idea by formulating a quadratically constrained quadratic\nprogramming problem instead of an SDP problem. Tang et al. (2009) convert the views into graphs\nby placing samples into vertices and creating edges using the similarity values between samples\nin each view, and then factorize these graphs jointly with a shared factor common to all graphs,\nwhich is used for clustering at the end. Kumar et al. (2011) propose a co-regularization strategy\nfor multiview spectral clustering by enforcing agreement between the similarity matrices calculated\non the latent representations obtained from the spectral decomposition of each view. Huang et al.\n(2012) formulate another multiview spectral clustering method that \ufb01nds a weighted combination\nof the af\ufb01nity matrices calculated on the views. Yu et al. (2012) develop a multiple kernel k-means\nclustering algorithm that optimizes the weights in a conic sum of kernels calculated on the views.\nHowever, their formulation uses the same kernel weights for all of the samples.\nMultiview clustering algorithms have attracted great interest in cancer biology due to the availability\nof multiple genomic characterizations of cancer patients. Yuan et al. (2011) formulate a patient-\nspeci\ufb01c data fusion algorithm that uses a nonparametric Bayesian model coupled with a Markov\nchain Monte Carlo inference scheme, which can combine only two views and is computationally\nvery demanding due to the high dimensionality of genomic data. Shen et al. (2012) and Mo et al.\n(2013) \ufb01nd a shared latent subspace across genomic views and cluster cancer patients using their\nrepresentations in this subspace. Wang et al. (2014) construct patient networks from patient\u2013patient\nsimilarity matrices calculated on the views, combine these into a single uni\ufb01ed network using a\nnetwork fusion approach, and then perform clustering on the \ufb01nal patient network.\n\n1.2 Our contributions\nIntermediate integration using kernel matrices is also known as multiple kernel learning (MKL)\n(G\u00a8onen and Alpayd\u0131n, 2011). Most of the existing MKL algorithms use the same kernel weights\nfor all samples, which may not be a good idea due to sample-speci\ufb01c characteristics of the data or\nmeasurement noise present in some of the views. In this work, we study kernel k-means cluster-\ning under the multiview setting and propose a novel MKL algorithm that combines kernels with\nsample-speci\ufb01c weights to obtain a better clustering. We demonstrate the better performance of our\nalgorithm on the human colon and rectal cancer data set provided by TCGA consortium (The Cancer\nGenome Atlas Network, 2012), where we use three genomic characterizations of the patients (i.e.,\nDNA copy number, mRNA gene expression, and DNA methylation) for clustering. Our localized\ndata fusion approach obtains more relevant prognostic patient groups than global fusion approaches\nwhen we evaluate the results with respect to three commonly used clinical biomarkers (i.e., micro-\nsatellite instability, hypermutation, and mutation in BRAF gene) of colon and rectal cancer.\n\n2\n\n\f2 Kernel k-means clustering\n\nWe \ufb01rst review kernel k-means clustering (Girolami, 2002) before extending it to the multiview\nsetting. Given N independent and identically distributed samples {xi \u2208 X}n\ni=1, we assume that\nthere is a function \u03a6(\u00b7) that maps the samples into a feature space, in which we try to minimize a\nsum-of-squares cost function over the cluster assignment variables {zic}n,k\ni=1,c=1. The optimization\nproblem (OPT1) de\ufb01nes kernel k-means clustering as a binary integer programming problem, where\nnc is the number of samples assigned to cluster c, and \u00b5c is the centroid of cluster c.\n\nminimize\n\nzic(cid:107)\u03a6(xi) \u2212 \u00b5c(cid:107)2\n\n2\n\nwith respect to zic \u2208 {0, 1} \u2200(i, c)\n\nk(cid:88)\n\nc=1\n\nn(cid:88)\nk(cid:88)\n\ni=1\n\nsubject to\n\nzic = 1 \u2200i\n\nc=1\n\nwhere nc =\n\nn(cid:88)\n\ni=1\n\nzic\n\n\u2200c, \u00b5c =\n\n(OPT1)\n\nn(cid:88)\n\ni=1\n\n1\nnc\n\nzic\u03a6(xi) \u2200c\n\nWe can convert this optimization problem into an equivalent matrix-vector form problem as follows:\n\nminimize tr ((\u03a6 \u2212 M)(cid:62)(\u03a6 \u2212 M))\n\nwith respect to Z \u2208 {0, 1}n\u00d7k\n\nsubject to Z1k = 1n\n\nwhere \u03a6 = [\u03a6(x1) \u03a6(x2)\n1 , n\u22121\n\nL = diag (n\u22121\n\n2 , . . . , n\u22121\nk ).\n\n. . . \u03a6(xn)], M = \u03a6ZLZ(cid:62),\n\n(OPT2)\n\nUsing that \u03a6(cid:62)\u03a6 = K, tr (AB) = tr (BA), and Z(cid:62)Z = L\u22121, the objective function of the\noptimization problem (OPT2) can be rewritten as\ntr ((\u03a6 \u2212 M)(cid:62)(\u03a6 \u2212 M)) = tr ((\u03a6 \u2212 \u03a6ZLZ(cid:62))(cid:62)(\u03a6 \u2212 \u03a6ZLZ(cid:62)))\n\n= tr (\u03a6(cid:62)\u03a6 \u2212 2\u03a6(cid:62)\u03a6ZLZ(cid:62) + ZLZ(cid:62)\u03a6(cid:62)\u03a6ZLZ(cid:62))\n= tr (K \u2212 2KZLZ(cid:62) + KZLZ(cid:62)ZLZ(cid:62)) = tr (K \u2212 L\n\n1\n\n2 Z(cid:62)KZL\n\n1\n2 ),\n\nwhere K is the kernel matrix that holds the similarity values between the samples, and L 1\n2 is de\ufb01ned\nas taking the square root of the diagonal elements. The resulting optimization problem (OPT3) is a\ntrace maximization problem, but it is still very dif\ufb01cult to solve due to the binary decision variables.\n\nmaximize tr (L\n\n2 Z(cid:62)KZL\nwith respect to Z \u2208 {0, 1}n\u00d7k\n\n1\n\n1\n\n2 \u2212 K)\n\n(OPT3)\n\nsubject to Z1k = 1n\n\nHowever, we can formulate a relaxed version of this optimization problem by renaming ZL 1\nand letting H take arbitrary real values subject to orthogonality constraints.\n\n2 as H\n\nmaximize tr (H(cid:62)KH \u2212 K)\n\nwith respect to H \u2208 Rn\u00d7k\nsubject to H(cid:62)H = Ik\n\n(OPT4)\n\nThe \ufb01nal optimization problem (OPT4) can be solved by performing KPCA on the kernel matrix\nK and setting H to the k eigenvectors that correspond to k largest eigenvalues (Sch\u00a8olkopf et al.,\n1998). We can \ufb01nally extract a clustering solution by \ufb01rst normalizing all rows of H to be on the\nunit sphere and then performing k-means clustering on this normalized matrix. Note that, after the\nnormalization step, H contains k-dimensional representations of the samples on the unit sphere, and\nk-means is not very sensitive to initialization in such a case.\n\n3\n\n\f3 Multiple kernel k-means clustering\n\nIn a multiview learning scenario, we have multiple feature representations, where we assume that\neach representation has its own mapping function, i.e., {\u03a6m(\u00b7)}p\nInstead of an unweighted\ncombination of these views (i.e., simple concatenation), we can obtain a weighted mapping function\nby concatenating views using a convex sum (i.e., nonnegative weights that sum up to 1). This\n,\n+ is the vector of kernel weights that we need to optimize during training. The kernel\n\ncorresponds to replacing \u03a6(xi) with \u03a6\u03b8(xi) = (cid:2)\u03b81\u03a61(xi)(cid:62) \u03b82\u03a62(xi)(cid:62) . . .\np(cid:88)\n\nwhere \u03b8 \u2208 Rp\nfunction de\ufb01ned over the weighted mapping function becomes\n\n\u03b8p\u03a6p(xi)(cid:62)(cid:3)(cid:62)\n\np(cid:88)\n\nm=1.\n\nk\u03b8(xi, xj) = (cid:104)\u03a6\u03b8(xi), \u03a6\u03b8(xj)(cid:105) =\n\n(cid:104)\u03b8m\u03a6m(xi), \u03b8m\u03a6m(xj)(cid:105) =\n\n\u03b82\nmkm(xi, xj),\n\nwhere we combine kernel functions using a conic sum (i.e., nonnegative weights), which guarantees\nto have a positive semi-de\ufb01nite kernel function at the end. The optimization problem (OPT5) gives\nthe trace maximization problem we need to solve.\n\nm=1\n\nm=1\n\nmaximize tr (H(cid:62)K\u03b8H \u2212 K\u03b8)\nwith respect to H \u2208 Rn\u00d7k, \u03b8 \u2208 Rp\n\n+\n\nsubject to H(cid:62)H = Ik, \u03b8\n\n(cid:62)\n\nwhere K\u03b8 =\n\n\u03b82\nmKm\n\np(cid:88)\n\nm=1\n\n1p = 1\n\n(OPT5)\n\nWe solve this problem using a two-step alternating optimization strategy: (i) Optimize H given \u03b8.\nIf we know the kernel weights (or initialize randomly in the \ufb01rst iteration), solving (OPT5) reduces\nto solving (OPT4) with the combined kernel matrix K\u03b8, which requires performing KPCA on K\u03b8.\n(ii) Optimize \u03b8 given H. If we know the eigenvectors from the \ufb01rst step, solving (OPT5) reduces to\nsolving (OPT6), which is a convex quadratic programming (QP) problem with p decision variables\nand one equality constraint, and is solvable with any standard QP solver up to a moderate number\nof kernels.\n\np(cid:88)\n\nminimize\n\nm tr (Km \u2212 H(cid:62)KmH)\n\u03b82\nwith respect to \u03b8 \u2208 Rp\n\nm=1\n\n+\n\nsubject to \u03b8\n\n1p = 1\n\n(cid:62)\n\n(OPT6)\n\nK\u03b8 to(cid:80)p\n\nNote that using a convex combination of kernels in (OPT5) is not a viable option because if we set\nm=1 \u03b8mKm, there would be a trivial solution to the trace maximization problem with a\n\nsingle active kernel and others with zero weights, which is also observed by Yu et al. (2012).\n\n4 Localized multiple kernel k-means clustering\n\nInstead of using the same kernel weights for all samples, we propose to use a localized data fu-\nsion approach by assigning sample-speci\ufb01c weights to kernels, which enables us to capture sample-\nspeci\ufb01c characteristics of the data and to get rid of sample-speci\ufb01c noise that may be present in\nsome of the views. In our localized combination approach, the mapping function is represented as\nis the matrix of\nsample-speci\ufb01c kernel weights, which are nonnegative and sum up to 1 for each sample (G\u00a8onen and\nAlpayd\u0131n, 2013). The locally combined kernel function can be written as\nk\u0398(xi, xj) = (cid:104)\u03a6\u0398(xi), \u03a6\u0398(xj)(cid:105) =\n(cid:104)\u03b8im\u03a6m(xi), \u03b8jm\u03a6m(xj)(cid:105) =\n\n\u03a6\u0398(xi) = (cid:2)\u03b8i1\u03a61(xi)(cid:62) \u03b8i2\u03a62(xi)(cid:62) . . .\np(cid:88)\n\n\u03b8ip\u03a6p(xi)(cid:62)(cid:3)(cid:62)\n\n, where \u0398 \u2208 Rn\u00d7p\n\n\u03b8im\u03b8jmkm(xi, xj),\n\np(cid:88)\n\n+\n\nwhere we are guaranteed to have a positive semi-de\ufb01nite kernel function. The optimization problem\n(OPT7) gives the trace maximization problem with the locally combined kernel matrix, where \u03b8m \u2208\n+ is the vector of kernel weights assigned to kernel m, and \u25e6 denotes the Hadamard product.\nRn\n\nm=1\n\nm=1\n\n4\n\n\fmaximize tr (H(cid:62)K\u0398H \u2212 K\u0398)\n\nwith respect to H \u2208 Rn\u00d7k, \u0398 \u2208 Rn\u00d7p\nsubject to H(cid:62)H = Ik, \u03981p = 1n\nm) \u25e6 Km\n(cid:62)\n\nwhere K\u0398 =\n\np(cid:88)\n\n(\u03b8m\u03b8\n\n+\n\nm=1\n\n(OPT7)\n\nWe solve this problem using a two-step alternating optimization strategy: (i) Optimize H given \u0398.\nIf we know the sample-speci\ufb01c kernel weights (or initialize randomly in the \ufb01rst iteration), solving\n(OPT7) reduces to solving (OPT4) with the combined kernel matrix K\u0398, which requires performing\nKPCA on K\u0398. (ii) Optimize \u0398 given H. If we know the eigenvectors from the \ufb01rst step, using that\ntr (A(cid:62)((cc(cid:62)) \u25e6 B)A) = c(cid:62)((AA(cid:62)) \u25e6 B)c, solving (OPT7) reduces to solving (OPT8), which is\na convex QP problem with n \u00d7 p decision variables and n equality constraints.\n\np(cid:88)\n\nminimize\n\nm((In \u2212 HH(cid:62)) \u25e6 Km)\u03b8m\n(cid:62)\n\n\u03b8\n\nm=1\n\nwith respect to \u0398 \u2208 Rn\u00d7p\nsubject to \u03981p = 1n\n\n+\n\n(OPT8)\n\nTraining the localized combination approach requires more computational effort than training the\nglobal approach due to the increased size of QP problem in the second step. However, the block-\ndiagonal structure of the Hessian matrix in (OPT8) can be exploited to solve this problem much\nmore ef\ufb01ciently. Note that the objective function of (OPT8) can be written as\n\n\uf8f9\uf8fa\uf8fa\uf8fb,\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fb\n\uf8ee\uf8ef\uf8ef\uf8f0\u03b81\n\n\u03b82\n...\n\u03b8p\n\n\uf8ee\uf8ef\uf8ef\uf8f0\u03b81\n\n\u03b82\n...\n\u03b8p\n\n(cid:62)\uf8ee\uf8ef\uf8ef\uf8ef\uf8f0(In \u2212 HH(cid:62)) \u25e6 K1\n\uf8f9\uf8fa\uf8fa\uf8fb\n\n0n\u00d7n\n...\n0n\u00d7n\n\n(In \u2212 HH(cid:62)) \u25e6 K2\n\n0n\u00d7n\n...\n0n\u00d7n\n\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n...\n\u00b7\u00b7\u00b7\n\n0n\u00d7n\n0n\u00d7n\n...\n\n(In \u2212 HH(cid:62)) \u25e6 Kp\nwhere we have an n \u00d7 n matrix for each kernel on the diagonal of the Hessian matrix.\n\n5 Experiments\n\nClustering patients is one of the clinically important applications in cancer biology because it helps\nto identify prognostic cancer subtypes and to develop personalized strategies to guide therapy. Mak-\ning use of multiple genomic characterizations in clustering is critical because different patients may\nmanifest their disease in different genomic platforms due to cancer heterogeneity and measurement\nnoise. We use the human colon and rectal cancer data set provided by TCGA consortium (The Can-\ncer Genome Atlas Network, 2012), which contains several genomic characterizations of the patients,\nto test our new clustering algorithm in a challenging real-world application.\nWe use DNA copy number, mRNA gene expression, and DNA methylation data of the patients\nfor clustering.\nIn order to evaluate the clustering results, we use three commonly used clinical\nbiomarkers of colon and rectal cancer (The Cancer Genome Atlas Network, 2012): (i) micro-satellite\ninstability (i.e., a hypermutable phenotype caused by the loss of DNA mismatch repair activity)\n(ii) hypermutation (de\ufb01ned as having mutations in more than or equal to 300 genes), and (iii) mu-\ntation in BRAF gene. Note that these three biomarkers are not directly identi\ufb01able from the input\ndata sources used. The preprocessed genomic characterizations of the patients can be downloaded\nfrom a public repository at https://www.synapse.org/#!Synapse:syn300013, where\nDNA copy number, mRNA gene expression, DNA methylation, and mutation data consist of 20313,\n20530, 24980, and 14581 features, respectively. The micro-satellite instability data can be down-\nloaded from https://tcga-data.nci.nih.gov/tcga/dataAccessMatrix.htm. In\nthe resulting data set, there are 204 patients with available genomic and clinical biomarker data.\nWe implement kernel k-means clustering and its multiview variants in Matlab. Our implementations\nare publicly available at https://github.com/mehmetgonen/lmkkmeans. We solve the\nQP problems of the multiview variants using the Mosek optimization software (Mosek, 2014). For\nall methods, we perform 10 replications of k-means with different initializations as the last step and\nuse the solution with the lowest sum-of-squares cost to decide cluster memberships.\n\n5\n\n\fthe Gaussian kernel on mRNA gene expression data, (iii) KM:\n\nWe calculate four different kernels to use in our experiments: (i) KC: the Gaussian kernel on DNA\ncopy number data, (ii) KG:\nthe\nGaussian kernel on DNA methylation data, and (vi) KCGM: the Gaussian kernel on concatenated\ndata (i.e., early combination). Before calculating each kernel, the input data is normalized to have\nzero mean and unit standard deviation (i.e., z-normalization for each feature). For each kernel, we\nset the kernel width parameter to the square root of the number of features in its corresponding view.\nWe compare seven clustering algorithms on this colon and rectal cancer data set: (i) kernel k-means\nclustering with KC, (ii) kernel k-means clustering with KG, (iii) kernel k-means clustering with KM,\n(iv) kernel k-means clustering with KCGM, (v) kernel k-means clustering with (KC + KG + KM) / 3,\n(vi) multiple kernel k-means clustering with (KC, KG, KM), and (vii) localized multiple kernel k-\nmeans clustering with (KC, KG, KM). The \ufb01rst three algorithms are single-view clustering methods\nthat work on a single genomic characterization. The fourth algorithm is the early integration ap-\nproach that combines the views at the feature level. The \ufb01fth and sixth algorithms are intermediate\nintegration approaches that combine the kernels using unweighted and weighted sums, respectively,\nwhere the latter is very similar to the formulations of Huang et al. (2012) and Yu et al. (2012). The\nlast algorithm is our localized MKL approach that combines the kernels in a sample-speci\ufb01c way.\nWe assign three different binary labels to each sample as the ground truth using the three clinical\nbiomarkers mentioned and evaluate the clustering results using three different performance metrics:\n(i) normalized mutual information (NMI), (ii) purity, and (iii) the Rand index (RI). We set the number\nof clusters to 2 for all of the algorithms because each ground truth label has only two categories.\nWe \ufb01rst show the kernel weights assigned to 204\ncolon and rectal cancer patients by our localized\ndata fusion approach. As we can see from Fig-\nure 1, some of the patients are very well charac-\nterized by their DNA copy number data. Our lo-\ncalized algorithm assigns weights larger than 0.5\nto DNA copy number data for most of the patients\nin the second cluster, whereas all three views are\nused with comparable weights for the remaining\npatients. Note that the kernel weights of each pa-\ntient are strictly nonnegative and sum up to 1 (i.e.,\nde\ufb01ned on the unit simplex). Our proposed clus-\ntering algorithm can identify the most informa-\ntive genomic platforms in an unsupervised and\npatient-speci\ufb01c manner. Together with the bet-\nter clustering performance and biological inter-\npretation presented next, this particular applica-\ntion from cancer biology shows the potential for\nlocalized combination strategy.\nFigure 2 summarizes the results obtained by seven clustering algorithms on the colon and rectal can-\ncer data set. For each algorithm, the cluster assignment and the values of three clinical biomarkers\nare aligned to each other, and we report the performance values of nine biomarker\u2013metric pairs. We\nsee that DNA copy number (i.e., KC) is the most informative genomic characterization when we\ncompare the performance of single-view clustering algorithms, where it obtains better results than\nmRNA gene expression (i.e., KG) and DNA methylation (i.e., KM) in terms of NMI and RI on all\nbiomarkers. We also see that the early integration strategy (i.e., KCGM) does not improve the re-\nsults because mRNA gene expression and DNA methylation dominate the clustering step due to the\nunsupervised nature of the problem. However, when we combine the kernels using an unweighted\ncombination strategy, i.e., (KC + KG + KM) / 3, the performance values are signi\ufb01cantly improved\ncompared to single-view clustering methods and early integration in terms of NMI and RI on all\nbiomarkers. Instead of using an unweighted sum, we can optimize the combination weights using\nthe multiple kernel k-means clustering of Section 3. In this case, the performance values are slightly\nimproved compared to the unweighted sum in terms of NMI and RI on all biomarkers. Our local-\nized data fusion approach signi\ufb01cantly outperforms the other algorithms in terms of NMI and RI on\n\u201cmicro-satellite instability\u201d and \u201chypermutation\u201d biomarkers, and it is the only algorithm that can\nobtain purity values higher than the ratio of the majority class samples on \u201cmutation in BRAF gene\u201d\nbiomarker. These results validate the bene\ufb01t of our localized approach for the multiview setting.\n\nFigure 1: Kernel weights assigned to patients\nby our localized data fusion approach. Each dot\ndenotes a single cancer patient, and patients in\nthe same cluster are drawn with the same color.\n\n6\n\n1.00.80.60.40.21.00.80.60.40.21.00.80.60.40.2llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllGene expressionCopy numberMethylationClusterll12\fFigure 2: Results obtained by seven clustering algorithms on the colon and rectal cancer data set\nprovided by TCGA consortium (The Cancer Genome Atlas Network, 2012). For each algorithm, we\n\ufb01rst display the cluster assignment and report the number of patients in each cluster. We then display\nthe values of three clinical biomarkers aligned with the cluster assignment, where \u201cMSI high\u201d shows\nthe patients with high micro-satellite instability status in darker color, \u201cHypermutation\u201d shows the\npatients with mutations in more than or equal to 300 genes in darker color, and \u201cBRAF mutation\u201d\nshows the patients with a mutation in their BRAF gene in darker color. We compare the algorithms\nin terms of their clustering performance on three clinical biomarkers under three metrics: normalized\nmutual information (NMI), purity, and the Rand index (RI). For all performance metrics, a higher\nvalue means better performance, and for each biomarker\u2013metric pair, the best result is reported in\nbold face. We see that our localized clustering algorithm obtains the best result for eight out of nine\nbiomarker\u2013metric pairs, whereas all algorithms have the same purity value for BRAF mutation.\n\n7\n\n102 patients102 patientsAlgorithm:Clusters:MSI high:Hypermutation:BRAF mutation:Kernel k\u2212means clustering with KCNMI0.14660.14180.0459Purity0.86760.84800.8971RI0.53760.54260.5156117 patients87 patientsAlgorithm:Clusters:MSI high:Hypermutation:BRAF mutation:Kernel k\u2212means clustering with KGNMI0.05040.05140.0174Purity0.86760.84800.8971RI0.50820.50910.508283 patients121 patientsAlgorithm:Clusters:MSI high:Hypermutation:BRAF mutation:Kernel k\u2212means clustering with KMNMI0.00080.00490.0026Purity0.86760.84800.8971RI0.51430.51050.514387 patients117 patientsAlgorithm:Clusters:MSI high:Hypermutation:BRAF mutation:Kernel k\u2212means clustering with KCGMNMI0.00190.01270.0041Purity0.86760.84800.8971RI0.51050.50760.5105119 patients85 patientsAlgorithm:Clusters:MSI high:Hypermutation:BRAF mutation:Kernel k\u2212means clustering with (KC + KG + KM) / 3NMI0.24370.23030.0945Purity0.86760.84800.8971RI0.60090.60960.5568122 patients82 patientsAlgorithm:Clusters:MSI high:Hypermutation:BRAF mutation:Multiple kernel k\u2212means clustering with (KC, KG, KM)NMI0.25570.24310.1013Purity0.86760.84800.8971RI0.61410.62330.5666158 patients46 patientsAlgorithm:Clusters:MSI high:Hypermutation:BRAF mutation:Localized multiple kernel k\u2212means clustering with (KC, KG, KM)NMI0.39540.37880.1481Purity0.88730.88730.8971RI0.80880.80880.7114\fFigure 3: Important features in genomic views determined using the solution of multiple kernel\nk-means clustering together with cluster assignment and mutations in frequently mutated genes.\nFor each genomic view, we calculate the Pearson correlation values between features and clustering\nassignment, and display topmost 100 positively correlated and bottommost 100 negatively correlated\nfeatures (red: high, blue: low). We also display the mutation status (black: mutated, white: wild-\ntype) of patients for 102 most frequently mutated genes, which are mutated in at least 16 patients.\n\nFigure 4: Important features in genomic views determined using the solution of localized multiple\nkernel k-means clustering together with cluster assignment and mutations in frequently mutated\ngenes. See Figure 3 for details.\n\nWe perform an additional biological interpretation step by looking at the features that can be used\nto differentiate the clusters found. Figures 3 and 4 show features in genomic views that are highly\n(positively or negatively) correlated with the cluster assignments of the two best performing algo-\nrithms in terms of clustering performance, namely, multiple kernel k-means clustering and localized\nmultiple kernel k-means clustering. We clearly see that the genomic signatures of the hyper-mutated\ncluster (especially the one for DNA copy number) obtained using our localized data fusion approach\nare much less noisy than those of global data fusion. Identifying clear genomic signatures are clini-\ncally important because they can be used for diagnostic and prognostic purposes on new patients.\n\n6 Discussion\n\nWe introduce a localized data fusion approach for kernel k-means clustering to better capture\nsample-speci\ufb01c characteristics of the data in the multiview setting, which can not be captured using\nglobal data fusion strategies such as Huang et al. (2012) and Yu et al. (2012). The proposed method\nis from the family of MKL algorithms and combines the kernels de\ufb01ned on the views with sample-\nspeci\ufb01c weights to determine the relative importance of the views for each sample. We illustrate the\npractical importance of the method on a human colon and rectal cancer data set by clustering patients\nusing their three different genomic characterizations. The results show that our localized data fusion\nstrategy can identify more relevant prognostic patient groups than global data fusion strategies.\nThe interesting topics for future research are: (i) exploiting the special structure of the Hessian\nmatrix in our formulation by developing a customized solver instead of using an off-the-shelf op-\ntimization software to improve the time complexity, and (ii) integrating prior knowledge about the\nsamples that we may have into our formulation to be able to \ufb01nd more relevant clusters.\nAcknowledgments. This study was \ufb01nancially supported by the Integrative Cancer Biology Pro-\ngram (grant no 1U54CA149237) and the Cancer Target Discovery and Development (CTDD) Net-\nwork (grant no 1U01CA176303) of the National Cancer Institute.\n\n8\n\nCopy numberGene expressionMethylationClustersMutationCopy numberGene expressionMethylationClustersMutation\fReferences\nM. B. Blaschko and C. H. Lampert. Correlational spectral clustering. In Proceedings of the IEEE Conference\n\non Computer Vision and Pattern Recognition, 2008.\n\nK. Chaudhuri, S. M. Kakada, K. Livescu, and K. Sridharan. Multi-view clustering via canonical correlation\n\nanalysis. In Proceedings of the 26st International Conference on Machine Learning, 2009.\n\nJ. Chen, Z. Zhao, J. Ye, and H. Liu. Nonlinear adaptive distance metric learning for clustering. In Proceedings\n\nof the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2007.\n\nC. Ding and X. He. K-means clustering via principal component analysis. In Proceedings of the 21st Interna-\n\ntional Conference on Machine Learning, 2004.\n\nM. Girolami. Mercer kernel-based clustering in feature space. IEEE Transactions on Neural Networks, 13(3):\n\n780\u2013784, 2002.\n\nM. G\u00a8onen and E. Alpayd\u0131n. Multiple kernel learning algorithms. Journal of Machine Learning Research, 12\n\n(Jul):2211\u20132268, 2011.\n\nM. G\u00a8onen and E. Alpayd\u0131n. Localized algorithms for multiple kernel learning. Pattern Recognition, 46(3):\n\n795\u2013807, 2013.\n\nJ. A. Hartigan. Clustering Algorithms. John Wiley & Sons, Inc., New York, NY, USA, 1975.\nH. Hotelling. Relations between two sets of variates. Biometrika, 28(3/4):321\u2013327, 1936.\nH.-C. Huang, Y.-Y. Chuang, and C.-S. Chen. Af\ufb01nity aggregation for spectral clustering. In Proceedings of the\n\nIEEE Conference on Computer Vision and Pattern Recognition, 2012.\n\nA. Kumar, P. Rai, and H. Daum\u00b4e III. Co-regularized multi-view spectral clustering. In Advances in Neural\n\nInformation Processing Systems 24, 2011.\n\nP. L. Lai and C. Fyfe. Kernel and nonlinear canonical correlation analysis. International Journal of Neural\n\nSystems, 10(5):365\u2013377, 2000.\n\nT. Lange and J. M. Buhmann. Fusion of similarity data in clustering.\n\nProcessing Systems 18, 2006.\n\nIn Advances in Neural Information\n\nQ. Mo, S. Wang, V. E. Seshan, A. B. Olshen, N. Schultz, C. Sander, R. S. Powers, M. Ladanyi, and R. Shen.\nPattern discovery and cancer gene identi\ufb01cation in integrated cancer genomic data. Proceedings of the\nNational Academy of Sciences of the United States of America, 110(11):4245\u20134250, 2013.\n\nMosek. The MOSEK Optimization Tools Manual Version 7.0 (Revision 134). MOSEK ApS, Denmark, 2014.\nW. S. Noble. Support vector machine applications in computational biology. In B. Sch\u00a8olkopf, K. Tsuda, and\n\nJ.-P. Vert, editors, Kernel Methods in Computational Biology, chapter 3. The MIT Press, 2004.\n\nK. Pearson. On lines and planes of closest \ufb01t to systems of points in space. Philosophical Magazine, 2(11):\n\n559\u2013572, 1901.\n\nB. Sch\u00a8olkopf, A. Smola, and K.-R. M\u00a8uller. Nonlinear component analysis as a kernel eigenvalue problem.\n\nNeural Computation, 10(5):1299\u20131319, 1998.\n\nR. Shen, Q. Mo, N. Schultz, V. E. Seshan, A. B. Olshen, J. Huse, M. Ladanyi, and C. Sander.\n\nsubtype discovery in glioblastoma using iCluster. PLoS ONE, 7(4):e35236, 2012.\n\nIntegrative\n\nA. Strehl and J. Ghosh. Cluster ensembles \u2013 A knowledge reuse framework for combining multiple partitions.\n\nJournal of Machine Learning Research, 3(Dec):583\u2013617, 2002.\n\nW. Tang, Z. Lu, and I. S. Dhillon. Clustering with multiple graphs. In Proceedings of the 9th IEEE International\n\nConference on Data Mining, 2009.\n\nThe Cancer Genome Atlas Network. Comprehensive molecular characterization of human colon and rectal\n\ncancer. Nature, 487(7407):330\u2013337, 2012.\n\nH. Valizadegan and R. Jin. Generalized maximum margin clustering and unsupervised kernel learning.\n\nAdvances in Neural Information Processing Systems 19, 2007.\n\nIn\n\nB. Wang, A. M. Mezlini, F. Demir, M. Flume, Z. Tu, M. Brudno, B. Haibe-Kains, and A. Goldenberg. Similarity\n\nnetwork fusion for aggregating data types on a genomic scale. Nature Methods, 11(3):333\u2013337, 2014.\n\nL. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum margin clustering.\n\nInformation Processing Systems 17, 2004.\n\nIn Advances in Neural\n\nS. Yu, L.-C. Tranchevent, X. Liu, W. Gl\u00a8anzel, J. A. K. Suykens, B. De Moor, and Y. Moreau. Optimized data\nfusion for kernel k-means clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34\n(5):1031\u20131039, 2012.\n\nY. Yuan, R. S. Savage, and F. Markowetz. Patient-speci\ufb01c data fusion de\ufb01nes prognostic cancer subtypes. PLoS\n\nComputational Biology, 7(10):e1002227, 2011.\n\nH. Zha, X. He, C. Ding, H. Simon, and M. Gu. Spectral relaxation for K-means clustering. In Advances in\n\nNeural Information Processing Systems 14, 2001.\n\n9\n\n\f", "award": [], "sourceid": 733, "authors": [{"given_name": "Mehmet", "family_name": "G\u00f6nen", "institution": "Oregon Health & Science University"}, {"given_name": "Adam", "family_name": "Margolin", "institution": "Oregon Health & Science University"}]}