{"title": "ICA-based Clustering of Genes from Microarray Expression Data", "book": "Advances in Neural Information Processing Systems", "page_first": 675, "page_last": 682, "abstract": "", "full_text": "ICA-Based Clustering of Genes from \n\nMicroarray Expression Data \n\n \n\nSu-In Lee* and Serafim Batzoglou\u00a7 \n*Department of Electrical Engineering \n\n\u00a7Department of Computer Science \n\nStanford University, Stanford, CA 94305 \n\nsilee@stanford.edu, serafim@cs.stanford.edu \n\nAbstract \n\nWe propose an unsupervised methodology using independent \ncomponent analysis (ICA) to cluster genes from DNA microarray \ndata. Based on an ICA mixture model of genomic expression \npatterns, linear and nonlinear ICA finds components that are specific \nto certain biological processes. Genes that exhibit significant \nup-regulation or down-regulation within each component are \ngrouped into clusters. We test the statistical significance of \nenrichment of gene annotations within each cluster. ICA-based \nclustering outperformed other leading methods in constructing \nfunctionally coherent clusters on various datasets. This result \nsupports our model of genomic expression data as composite effect \nof independent biological processes. Comparison of clustering \nperformance among various \nincluding a \nkernel-based nonlinear ICA algorithm shows that nonlinear ICA \nperformed \nthe best for small datasets and natural-gradient \nmaximization-likelihood worked well for all the datasets. \n\nICA algorithms \n\n1 Introduction \n\nMicroarray technology has enabled genome-wide expression profiling, promising to \nprovide insight into underlying biological mechanism involved in gene regulation. To \naid such discoveries, mathematical tools that are versatile enough to capture the \nunderlying biology and simple enough to be applied efficiently on large datasets are \nneeded. Analysis tools based on novel data mining techniques have been proposed \n[1]-[6]. When applying mathematical models and tools to microarray analysis, \nclustering genes that have the similar biological properties is an important step for \nthree reasons: reduction of data complexity, prediction of gene function, and \nevaluation of the analysis approach by measuring the statistical significance of \nbiological coherence of gene clusters. \nIndependent component analysis (ICA) linearly decomposes each of N vectors into M \ncommon component vectors (N\u2265M) so that each component is statistically as \nindependent from the others as possible. One of the main applications of ICA is blind \n\n\f \n\nsource separation (BSS) that aims to separate source signals from their mixtures. \nThere have been a few attempts to apply ICA to the microarray expression data to \nextract meaningful signals each corresponding to independent biological process \n[5]-[6]. In this paper, we provide the first evidence that ICA is a superior \nmathematical model and clustering tool for microarray analysis, compared to the most \nwidely used methods namely PCA and k-means clustering. We also introduce the \napplication of nonlinear ICA to microarray analysis, and show that it outperforms \nlinear ICA on some datasets. \nWe apply ICA to microarray data to decompose the input data into statistically \nindependent components. Then, genes are clustered in an unsupervised fashion into \nnon-mutually exclusive clusters. Each independent component is assigned a putative \nbiological meaning based on functional annotations of genes that are predominant \nwithin the component. We systematically evaluate the clustering performance of \nseveral ICA algorithms on four expression datasets and show that ICA-based \nclustering is superior to other leading methods that have been applied to analyze the \nsame datasets. We also proposed a kernel based nonlinear ICA algorithm for dealing \nwith more realistic mixture model. Among the different linear ICA algorithms \nincluding six linear and one nonlinear ICA algorithm, the natural-gradient \nmaximum-likelihood estimation method (NMLE) [7]-[8] performs well in all the \ndatasets. Kernel-based nonlinear ICA method worked better for three small datasets. \n\n2 Mathematical model of genome-wide expression \n\nSeveral distinct biological processes take place simultaneously inside a cell; each \nbiological process has its own expression program to up-regulate or down-regulate the \nlevel of expression of specific sets of genes. We model a genome-wide expression \npattern in a given condition (measured by a microarray assay) as a mixture of signals \ngenerated by statistically independent biological processes with different activation \nlevels. We design two kinds of models for genomic expression pattern: a linear and \nnonlinear mixture model. \nSuppose that a cell is governed by M independent biological processes S = (s1, \u2026, \nsM)T, each of which is a vector of K gene expression levels, and that we measure the \nlevels of expression of all genes in N conditions, resulting in a microarray expression \nmatrix X = (x1,\u2026,xN)T. The expression level at each different condition j can be \nexpressed as linear combinations of the M biological processes: xj=aj1s1+\u2026+ajMsM. \nWe can express this idea concisely in matrix notation as follows. \n\nX\n\n=\n\nAS\n\n,\n\nx\n1\n\nM\nx\n\nN\n\n\uf8ee\n\uf8ef\n\uf8ef\n\uf8ef\n\uf8f0\n\n\uf8f9\n\uf8fa\n\uf8fa\n\uf8fa\n\uf8fb\n\n=\n\na\n11\n\nM\n\nN\n\n1\n\na\n\n\uf8ee\n\uf8ef\n\uf8ef\n\uf8ef\n\uf8f0\n\na\nM\n1\n\nM\n\nNM\n\na\n\ns\n1\n\nM\ns\n\nM\n\n\uf8f9\n\uf8fa\n\uf8fa\n\uf8fa\n\uf8fb\n\n\uf8ee\n\uf8ef\n\uf8ef\n\uf8ef\n\uf8f0\n\n\uf8f9\n\uf8fa\n\uf8fa\n\uf8fa\n\uf8fb\n\nL\n\nL\n\n (1) \n\nMore generally, we can express X = (x1,\u2026,xN)T as a post-nonlinear mixture of the \nunderlying independent processes as follows, where f(.) is a nonlinear mapping from \nN to N dimensional space. \n\n \n\nX\n\n=\n\nf\n\n(\n\nAS\n\n),\n\nx\n1\n\nM\nx\n\nN\n\n\uf8ee\n\uf8ef\n\uf8ef\n\uf8ef\n\uf8f0\n\n\uf8f9\n\uf8fa\n\uf8fa\n\uf8fa\n\uf8fb\n\n=\n\nf\n\n\uf8eb\n\uf8ec\n\uf8ec\n\uf8ec\n\uf8ed\n\n\uf8ee\n\uf8ef\n\uf8ef\n\uf8ef\n\uf8f0\n\na\n11\n\nM\n\nN\n\n1\n\na\n\nL\n\nL\n\na\n1\n\nM\n\nM\n\nNM\n\na\n\n\uf8f9\n\uf8fa\n\uf8fa\n\uf8fa\n\uf8fb\n\n\uf8ee\n\uf8ef\n\uf8ef\n\uf8ef\n\uf8f0\n\ns\n1\n\nM\n\nM\n\ns\n\n\uf8f9\n\uf8fa\n\uf8fa\n\uf8fa\n\uf8fb\n\n\uf8f6\n\uf8f7\n\uf8f7\n\uf8f7\n\uf8f8\n\n (2) \n\n\f3 Independent component analysis \n\n \n\nIn the models described above, since we assume that the underlying biological \nprocesses are independent, we suggest that vectors S=(s1,\u2026,sM) are statistically \nindependent and so ICA can recover S from the observed microarray data X. For linear \nICA, we apply natural-gradient maximum estimation (NMLE) method which was \nproposed in [7] and was made more efficient by using natural gradient method in [8]. \nWe also apply nonlinear ICA using reproducible kernel Hilbert spaces (RKHS) based \non [9], as follows: \n1. We map the N dimensional input data xi to \u0424(xi) in the feature space by using the \nkernel trick. The feature space is defined by the relationship \u0424(xi)T\u0424(xj)=k(xi,, xj). \nThat is, inner product of mapped data is determined to by a kernel function k(.,.) in \nthe input space; we used a Gaussian radial basis function (RBF) kernel \n(k(x,y)=exp(-|x-y|2)) and a polynomial kernel of degree 2 (k(x,y)=(xTy+1)2). To \nperform mapping, we found orthonormal bases of the feature space by randomly \nsampling L input data v={v1,\u2026,vL} 1000 times and choosing one set minimizing the \ncondition number of \u03a6v=(\u03a6(v1),\u2026,\u03a6(vL)). Then, a set of orthonormal bases of the \nfeature space is determined by the selected L images of input data in v as \u039e = \nT\u03a6v)-1/2. We map all input data x1,\u2026,xK, each corresponding to a gene, to \n\u03a6v(\u03a6v\n\u03a8(x1),\u2026,\u03a8(xK) in the feature space with basis \u039e, as follows: \n\n \n\n \n\n\u03a8(xi)=(\u03a6v\n\nT\u03a6v)-1/2\u03a6v\n\nT\u03a6v(xi)\n\n=\n\n\uf8ee\n\uf8ef\n\uf8ef\n\uf8ef\n\uf8f0\n\nvvk\n,\n(\n1\n\n1\n\n)\n\nM\n\nvvk\n,\n(\n1\n\nL\n\n)\n\n\u2212\n\n2/1\n\nK\n\nL\n\nvvk\n,\n(\n\n1\n\nL\n\n)\n\nM\nvvk\n,\n(\n\nL\n\n)\n\nL\n\n\uf8f9\n\uf8fa\n\uf8fa\n\uf8fa\n\uf8fb\n\nxvk\n,\n(\n\n1\n\ni\n\n)\n\nM\n,\n\nL\n\nx\n\ni\n\n)\n\nvk\n(\n\n\uf8ee\n\uf8ef\n\uf8ef\n\uf8ef\n\uf8f0\n\n\uf8f9\n\uf8fa\n\uf8fa\n\uf8fa\n\uf8fb\n\nL\n\n\u211c\u2208\n\n(1\u2264 i\u2264K) (3) \n\n2. We linearly decompose the mapped data \u03a8=[\u03a8(x1),.,\u03a8(xK)]\u2208RL\u00d7K into statistically \n\nindependent components using NMLE. \n\n4 Proposed approach \n\nThe microarray dataset we are given is in matrix form where each element xij \ncorresponds to the level of expression of the jth gene in the ith experimental condition. \nMissing values are imputed by KNNImpute [10], an algorithm based on k nearest \nneighbors that is widely used in microarray analysis. Given the expression matrix X of \nN experiment by K genes, we perform the following steps. \n1. Apply ICA to decompose X into independent components y1, \u2026,yM as in Equations \n(1) and (2). Prior to applying ICA, remove any rows that make the expression \nmatrix X singular. After ICA, each component denoted by yi is a vector \ncomprising K loads gene expression levels, i.e., yi = (yi1, ...,yiK). We chose to let \nthe number of components M to be maximized, which is equal the number of \nmicroarray experiments N because the maximum for N in our datasets was 250, \nwhich is smaller than the number of biological processes we hypothesize to act \nwithin a cell. \n\n2. For each component, cluster genes according to their relative loads yij/mean(yi). \nBased on our ICA model, each component is a putative genomic expression \nprogram of an independent biological process. Thus, our hypothesis is that genes \nshowing relatively high or low expression level within the component are the most \nimportant for the process. We create two clusters for each component: one cluster \ncontaining genes with expression level higher than a threshold, and one cluster \ncontaining genes with expression level lower than a threshold. \n\n\f \n\nCluster i,1 = {gene j | \nCluster i,2 = {gene j | \n\nijy\nijy\n\n> mean(\n< mean(\n\niy\niy\n\n) + c\n) \u2013 c\n\n\u00d7std(\n\u00d7std(\n\niy\niy\n\n \n\n)} \n)} (4) \n\n \n\nHere, mean(yi) is the average, std(yi) is the standard deviation of yi; and c is an \nadjustable coefficient. The value of the coefficient c was varied from 1.0 to 2.0 and \nthe result for c=1.25 was presented in this paper. The results for other values of c \nare similar, and are presented on the website www.stanford.edu/~silee/ICA/. \n\n3. For each cluster, measure the enrichment of each cluster with genes of known \nfunctional annotations. Using the Gene Ontology (GO) [11] and KEGG [12] gene \nannotation databases, we calculate the p-value for each cluster with every gene \nannotation, which is the probability that the cluster contains the observed number \nof genes with the annotation by chance assuming the hypergeometric distribution \n(details in [4]). For each gene annotation, the minimum p-value that is smaller than \n10-7 obtained from any cluster was collected. If no p-value smaller than 10-7 is \nfound, we consider the gene annotation not to be detected by the approach. As a \nresult, we can assign biological meaning to each cluster and the corresponding \nindependent component and we can evaluate the clustering performance by \ncomparing the collected minimum p-value for each gene annotation with that from \nother clustering approach. \n\n5 Performance evaluation \n\nWe tested the ICA-based clustering to four expression datasets (D1\u2014D4) described in \nTable 1. \n\n \n\nD1 \n\nARRAY \nTYPE \nSpotted \n\nD2 Oligonucl\neotide \nSpotted \nD3 \nD4 Oligonucl\neotide \n\nTable 1: The four datasets used in our analysis \n\nDESCRIPTION \n\n # OF \nGENES (K) \n\n # OF \nEXPS (N) \n\nBudding yeast during cell cycle and \nCLB2/CLN3 overactive strain [13] \nBudding yeast during cell cycle [14] \n\nC. elegans in various conditions [3] \nNormal human tissue including 19 \n\nkinds of tissues [15] \n\n4579 \n\n6616 \n\n17817 \n7070 \n\n22 \n\n17 \n\n553 \n59 \n\nFor D1 and D4, we compared the biological coherence of ICA components with that \nof PCA applied in the same datasets in [1] and [2], respectively. For D2 and D3, we \ncompared with k-means clustering and the topomap method, applied in the same \ndatasets in [4] and [3], respectively. We applied nonlinear ICA to D1, D2 and D4. \nDataset D3 is very large and makes the nonlinear algorithm unstable. \nD1 was preprocessed to contain log-ratios xij=log2(Rij/Gij) between red and green \nintensities. In [1], principal components, referred \nto as eigenarrays, were \nhypothesized to be genomic expression programs of distinct biological processes. We \ncompared the biological coherence of independent components with that of principal \ncomponents found by [1]. Comparison was done in two ways: (1) For each \ncomponent, we grouped genes within top x% of significant up-regulation and \ndown-regulation (as measured by the load of the gene in the component) into two \nclusters with x adjusted from 5% to 45%. For each value of x, statistical significance \nwas measured for clusters from independent components and compared with that from \n\n\f \n\nprincipal components based on the minimum p-value for each gene annotation, as \ndescribed in Section 4. We made a scatter plot to compare the negative log of the \ncollected best p-values for each gene annotation when x is fixed to be 15%, shown in \nFigure 1 (a) (2) Same as before, except we did not fix the value of x; instead, we \ncollected the minimum p-value from each method for each GO and KEGG gene \nannotation category and compared the collected p-values (Figure 1 (b)). For both \ncases, in the majority of the gene annotation categories ICA produced significantly \nlower p-values than PCA did, especially for gene annotation for which both ICA and \nPCA showed high significance. \n \n\n \nFigure 1. Comparison of linear ICA (NMLE) to PCA on dataset D1 (a) when x is fixed \nto be 15%; (b) when x is not fixed. (c) Three independent components of dataset D4. \n\nEach gene is mapped to a point based on the value assigned to the gene in three \n\nindependent components, which are enriched with liver- (red), Muscle- (orange) and \n\nvulva-specific (green) genes, respectively. \n\nindependent components were enriched for \n\n \nThe expression levels of genes in D4 were normalized across the 59 experiments, and \nthe logarithms of the resulting values were taken. Experiments 57, 58, and 59 were \nremoved because they made the expression matrix nearly singular. In [2], a clustering \napproach based on PCA and subsequent visual inspection was applied to an earlier \nversion of this dataset, containing 50 of the 59 samples. After we performed ICA, the \nmost significant \nliver-specific, \nmuscle-specific and vulva-specific genes with p-value of 10-133, 10-124 and 100-117, \nrespectively. In the ICA liver cluster, 198 genes were liver specific (out of a total of \n244), as compared with the 23 liver-specific genes identified in [2] using PCA. The \nICA muscle cluster of 235 genes contains 199 muscle specific genes compared to 19 \nmuscle-specific genes identified in [2]. We generated a 3-dimensional scatter plot of \nthe load expression levels of all genes annotated in [15] on these significant ICA \ncomponents in Figure 1 (c). We can see that the liver-specific, muscle-specific and \nvulva-specific genes are strongly biased to lie on the x-, y-, and z- axis, respectively. \nWe applied nonlinear ICA on this dataset and the first four most significant clusters \nfrom nonlinear ICA with Gaussian RBF kernel were muscle-specific, liver-specific, \nvulva-specific and brain-specific with p-value of 10-158, 10-127, 10-112 and 10-70, \nrespectively, showing considerable improvement over the linear ICA clusters. \nFor D2, variance-normalization was applied to the 3000 most variant genes as in [4]. \nThe 17th experiment, which made the expression matrix close to singular, was \nremoved. We measured the statistical significance of clusters as described in Section \n4 and compared the smallest p-value of each gene annotation from our approach to \nthat from k-means clustering applied to the same dataset [4]. We made a scatter plot \n\n\f \n\nfor comparing the negative log of the smallest p-value (y-axis) from ICA clusters with \nthat from k-means clustering (x-axis). The coefficient c is varied from 1.0 to 2.0 and \nthe superiority of ICA-based clustering to k-means clustering does not change. In \nmany practical settings, estimation of the best c is not needed; we can adjust c to get a \ndesired size of the cluster unless our focus is to blindly find the size of clusters. Figure \n2 (a) (b) (c) shows for c=1.25 a comparison of the performance of linear ICA \n(NMLE), nonlinear ICA with Gaussian RBF kernel (NICA gauss), and k-means \nclustering (k-means). \nFor D3, first we removed experiments that contained more than 7000 missing values, \nbecause ICA does not perform properly when the dataset contains many missing \nvalues. The 250 remaining experiments were used, containing expression levels for \n17817 genes preprocessed to be log-ratios xij=log2(Rij/Gij) between red and green \nintensities. We compared the biological coherence of clusters by our approach with \nthat of topomap-based approach applied to the same dataset in [3]. The result when \nc=1.25 is plotted in the Figure 2 (d). We observe that the two methods perform very \nsimilarly, with most categories having roughly the same p-value in ICA and in the \ntopomap clusters. The topomap clustering approach performs slightly better in a \nlarger fraction of the categories. Still, we consider this performance a confirmation \nthat ICA is a widely applicable method that requires minimal training: in this case the \nmissing values and high diversity of the data make clustering especially challenging, \nwhile the topomap approach was specifically designed and manually trained for this \ndataset as described in [3]. \nFinally, we compared different ICA algorithms in terms of clustering performance. \nWe tested six linear ICA methods: Natural Gradient Maximum Likelihood Estimation \n(NMLE) [7][8], Joint Approximate Diagonalization of Eigenmatrices [16], Fast Fixed \nPoint ICA with three different measures of non-Gaussianity [17], and Extended \nInformation Maximization (Infomax) [18]. We also tested two kernels for nonlinear \nICA: Gaussian RBF kernel, and polynomial kernel (NICA ploy). For each dataset, we \ncompared the biological coherence of clusters generated by each method. Among the \nsix linear ICA algorithms, NMLE was the best in all datasets. Among both linear and \nnonlinear methods, the Gaussian kernel nonlinear ICA method was the best in \nDatasets D1, D2 and D4, the polynomial kernel nonlinear ICA method was best in \nDataset D4, and NMLE was best in the large datasets (D3 and D4). In Figure 3, we \ncompare the NMLE method with three other ICA methods for the dataset D2. Overall, \nthe NMLE algorithm consistently performed well in all datasets. The nonlinear ICA \nalgorithms performed best in the small datasets, but were unstable in the two largest \ndatasets. More \ncomparison \nthe website \nwww.stanford.edu/~silee/ICA/. \n\ndemonstrated \n\nin \n\nresults \n\nare \n\n \n\n \n\n\f \n\n \n\nFigure 2: Comparison of (a) linear ICA (NMLE) with k-means clustering, (b) \n\nnonlinear ICA with Gaussian RBF kernel to linear ICA (NMLE), and (c) nonlinear \n\nICA with Gaussian RBF kernel to k-means clustering on the dataset D2. (d) \n\nComparison of linear ICA (NMLE) to topomap-based approach on the dataset D3. \n\n \n\n \nFigure 3: Comparison of linear ICA (NMLE) to (a) Extended Infomax ICA algorithm, \n\n(b) Fast ICA with symmetric orthogonalization and tanh nonlinearity and (c) \n\nNonlinear ICA with polynomial kernel of degree 2 on the Dataset (B). \n\n6 Discussion \n\nICA is a powerful statistical method for separating mixed independent signals. We \nproposed applying ICA to decompose microarray data into independent gene \nexpression patterns of underlying biological processes, and to group genes into \nclusters that are mutually non-exclusive with statistically significant functional \ncoherence. Our clustering method outperformed several leading methods on a variety \nof datasets, with the added advantage that it requires setting only one parameter, \nnamely the fraction c of standard deviations beyond which a gene is considered to be \nassociated with a component\u2019s cluster. We observed that performance was not very \nsensitive to that parameter, suggesting that ICA is robust enough to be used for \nclustering with little human intervention. \nThe empirical performance of ICA in our tests supports the hypothesis that statistical \nindependence is a good criterion for separating mixed biological signals in microarray \ndata. The Extended Infomax ICA algorithm proposed in [18] can automatically \ndetermine whether the distribution of each source signal is super-Gaussian or \nsub-Gaussian. Interestingly, the application of Extended Infomax ICA to all the \n\n\f \n\nexpression datasets uncovered no source signal with sub-Gaussian distribution. A \nlikely explanation is that global gene expression profiles are mixtures of \nsuper-Gaussian sources rather than of sub-Gaussian sources. This finding is consistent \nwith the following intuition: underlying biological processes are super-Gaussian, \nbecause they affect sharply the relevant genes, typically a small fraction of all genes, \nand leave the majority of genes relatively unaffected. \n\nAcknowledgments \nWe thank Te-Won Lee for helpful feedback. We thank Relly Brandman, Chuong Do, \nand Yueyi Liu for edits to the manuscript. \n\nReferences \n[1] Alter O, Brown PO, Botstein D. Proc. Natl. Acad. Sci. USA 97(18):10101-10106, 2000. \n[2] Misra J, Schmitt W, et al. Genome Research 12:1112-1120, 2002. \n[3] Kim SK, Lund J, et al. Science 293:2087-2092, 2001. \n[4] Tavazoie S, Hughes JD, et al. Nature Genetics 22(3):281-285, 1999. \n[5] Hori G, Inoue M, et al. Proc. 3rd Int. Workshop on Independent Component Analysis and \n\nBlind Signal Separation, Helsinki, Finland, pp. 151-155, 2000. \n\n[6] Liebermeister W. Bioinformatics 18(1):51-60, 2002. \n[7] Bell AJ. and Sejnowski TJ. Neural Computation, 7:1129-1159, 1995. \n[8] Amari S, Cichocki A, et al. In Advances in Neural Information Processing Systems 8, pp. \n\n757-763. Cambridge, MA: MIT Press, 1996. \n\n[9] Harmeling S, Ziehe A, et al. In Advances in Neural Information Processing Systems 8, pp. \n\n757-763. Cambridge, MA: MIT Press, . \n\n[10] Troyanskaya O., Cantor M, et al. Bioinformatics 17:520-525, 2001. \n[11] The Gene Ontology Consortium. Genome Research 11:1425-1433, 2001. \n[12] Kanehisa M., Goto S. In Current Topics in Computational Molecular Biology, pp. \n\n301\u2013315. MIT-Press, Cambridge, MA, 2002. \n\n[13] Spellman PT, Sherlock G, et al. Mol. Biol. Cell 9:3273-3297, 1998. \n[14] Cho RJ, Campell MJ, et al. Molecular Cell 2:65-73, 1998. \n[15] Hsiao L, Dangond F, et al. Physiol. Genomics 7:97-104, 2001. \n[16] Cardoso JF, Neural Computation 11(1):157-192, 1999. \n[17] Hyvarinen A. IEEE Transactions on Neural Network 10(3):626\u2013634, 1999. \n[18] Lee TW, Girolami M, et al. Neural Computation 11:417\u2013441, 1999. \n\n\f", "award": [], "sourceid": 2396, "authors": [{"given_name": "Su-in", "family_name": "Lee", "institution": null}, {"given_name": "Serafim", "family_name": "Batzoglou", "institution": null}]}