{"title": "From Coexpression to Coregulation: An Approach to Inferring Transcriptional Regulation among Gene Classes from Large-Scale Expression Data", "book": "Advances in Neural Information Processing Systems", "page_first": 928, "page_last": 934, "abstract": null, "full_text": "From Coexpression to Coregulation: An \nApproach to Inferring Transcriptional \nRegulation among Gene Classes from \n\nLarge-Scale Expression Data \n\nEric Mjolsness \n\nJet Propulsion Laboratory \n\nCalifornia Institute of Technology \n\nPasadena CA 91109-8099 \nmjolsness@jpl.nasa.gov \n\nTobias Mann \n\nJet Propulsion Laboratory \n\nCalifornia Institute of Technology \n\nPasadena CA 91109-8099 \nmann@aigjpl.nasa.gov \n\nRebecca Castaiio \n\nJet Propulsion Laboratory \n\nBarbara Wold \n\nDivision of Biology \n\nCalifornia Institute of Technology \n\nCalifornia Institute of Technology \n\nPasadena CA 91109-8099 \nbecky@aigjpl.nasa.gov \n\nPasadena CA 91125 \nwoldb@its.caltech.edu \n\nAbstract \n\nsmall-scale gene \n\nregulation networks \n\nWe provide preliminary evidence that eXlstmg algorithms for \ninferring \nfrom gene \nexpression data can be adapted to large-scale gene expression data \ncoming from hybridization microarrays. The essential steps are (1) \nclustering many genes by their expression time-course data into a \nminimal set of clusters of co-expressed genes, (2) theoretically \nmodeling the various conditions under which the time-courses are \nmeasured using a continious-time analog recurrent neural network \nfor the cluster mean time-courses, (3) fitting such a regulatory \nmodel to the cluster mean time courses by simulated annealing \nwith weight decay, and (4) analysing several such fits for \ncommonalities \nthe \nconnection matrices. This procedure can be used to assess the \nadequacy of existing and future gene expression time-course data \nsets for determ ining transcriptional regulatory relationships such as \ncoregulation . \n\nthe circuit parameter sets \n\nincluding \n\nin \n\n1 Introduction \n\nIn a cell, genes can be turned \"on\" or \"off' to varying degrees by the protein \nproducts of other genes. When a gene is \"on\" it is transcribed to produce messenger \nRNA (mRNA) which can subsequently be translated into protein molecules. Some \nof these proteins are transcription factors which bind to DNA at specific sites and \nthereby affect which genes are transcribed and how often. This trancriptional \n\n\fInferring Transcriptional Regulation among Gene Classes \n\n929 \n\nregulation feedback circuitry provides a fundamental mechanism for information \nprocessing in the cell. \nIt governs differentiation into diverse cell types and many \nother basic biological processes. \n\nRecently, several new \ntechnologies have been developed for measuring the \n\"expression\" of genes as mRNA or protein product. Improvements in conventional \nf1uorescently labeled antibodies against proteins have been coupled with confocal \nmicroscopy and \nsimultaneous \nmeasurement of small numbers of proteins in large numbers of individual nuclei in \nthe fruit fly Drosophila melanogaster \nIn a complementary way, the mRNA \nlevels of thousands of genes, each averaged over many cells, have been measured by \nhybridization arrays for various species including the budding yeast Saccharomyces \ncerevisiae [2] . \n\nto partially automate \n\nimage processing \n\n[1]. \n\nthe \n\nThe high-spatial-resolution protein antibody data has been quantitatively modeled \nby \"gene regulation network\" circuit models [3] which use continuous-time, analog, \nrecurrent neural networks (Hopfield networks without an objective function) to \nmodel transcriptional regulation [4][5] . This approach requires some machine \nlearning technique to infer the circuit parameters from the data, and a particular \nvariant of simulated annealing has proven effective [6][7]. Methods in current \nbiological use for analysing mRNA hybridization data do not infer regulatory \nrelationships, but rather simply cluster together genes with similar patterns of \nexpression across time and experimental conditions [8][9] . In this paper, we explore \nthe extension of the gene circuit method to the mRNA hybridization data which has \nmuch lower spatial resolution but can currently assay a thousand times more genes \nthan immunofluorescent image analysis. \n\nThe essential problem with using the gene circuit method, as employed for \nimmunoflourescence data, on hybridization data is that the number of connection \nstrength parameters grows between linearly and quadratically in the number of \ngenes (depending on sparsity assumptions) . This requires more data on each gene, \nand even if that much data is available, simulated annealing for circuit inference \ndoes not seem to scale well with the number of unknown parameters. Some form of \ndimensionality reduction is called for. Fortunately dimensionality reduction is \navailable in the present practice of clustering the large-scale time course expression \ndata by genes, into gene clusters. \nIn this way one can derive a small number of \ncluster-mean time courses for \"aggregated genes\", and then fit a gene regulation \ncircuit to these cluster mean time courses. We will discuss details of how this \nanalysis can be performed and then interpreted. A similar approach using somewhat \ndifferent algorithms for clustering and circuit inference has been taken by Hertz \n[10]. \n\nIn the following, we will first summarize the data models and algorithms used, and \nthen report on preliminary experiments in applying those algorithms to gene \nexpression data for 2467 yeast genes [9][11]. Finally we will discuss prospects for \nand limitations of the approach. \n\n2 Data Models and Algorithms \n\nThe data model is as follows . We imagine that there is a small, hidden regulatory \nnetwork of \" aggregate genes\" which regulate one another by the analog neural \nnetwork dynamics [3] \n\nT . dv; = g(~ T.v + h) - Xv . \nI dt \n\nI \n\nI \n\nI \n\n~ 1/ \nJ \n\n/ \n\n\f930 \n\nE. Mjolsness, T Mann, R. Castano and B. Wold \n\nis the continuous-valued state variable for gene product i, ~j is the \nIn which Vi \nmatrix of positive, zero, or negative connections by which one transcription factor \ncan enhance or repress another, and gO \nis a nonlinear monotonic sigmoidal \nactivation function. When a particular matrix entry ~j \nis nonzero, there is a \nregulatory \"connection\" from gene product} to gene i. The regulation is enhancing \nif T is positive and repressing if it is negative. If ~j is zero there is no connection . \n\nThis network is run forwards from some initial condition and time-sampled to \ngenerate a wild-type time course for the aggregate genes. In addition, various other \ntime courses can be generated under alternative experimental conditions by \nmanipulating the parameters. For example an entire aggregate gene (corresponding \nto a cluster of real genes) could be removed from the circuit or otherwise modified \nto represent mutants. External input conditions could be modeled as modifications \nto h. Thus we get one or several time courses (trajectories) for the aggregate genes. \n\nFrom such aggregate time courses, actual gene data is generated by addition of \nGaussian-distributed noise to the logarithms of the concentration variables. Each \ntime point in each cluster has its own scalar standard deviation parameter (and a \nmean arising from the circuit dynamics). Optionally, each gene's expression data \nmay also be multiplied by a time-independent proportionality constant. \n\nT \n\nRegulatory aggregate genes \ncluster \n(large circles) and \nmember \ngenes \n(small \ncircles). \n\nGiven this data generation model and suitable gene expression data, the problem is \nto infer gene cluster memberships and the circuit parameters for the aggregate \ngenes' regulatory relationships. Then, we would like to use the inferred cluster \nmemberships and regulatory circuitry to make testable biological predictions. \n\nThis data model departs from biological reality in many ways that could prove to be \nimportant, both for inference and for prediction. Except for the Gaussian noise \nmodel, each gene in a cluster is models as fully coregulated with every other one -\nthey are influenced in the same ways by the same regulatory connection strengths. \nAlso, the nonlinear circuit model must not only reflect transcriptional regulation, \nbut all other regulatory circuitry affecting measured gene expression such as kinase(cid:173)\nphosphatase networks. \n\nUnder this data model, one could formulate a joint Bayesian inference problem for \nthe clustering and circuit inference aspects of fitting the data. But given the highly \nprovisional nature of the model, we simply apply in sequence an existing mixture(cid:173)\nof-Gaussians clustering algorithm \nits \ndimensionality , and then an existing gene circuit inference algorithm . Presumably a \njoint optimization algorithm could be obtained by iterating these steps. \n\nthe data and \n\nto preprocess \n\nreduce \n\n2.1 Clustering \n\nA widely used clustering algorithm for mixure model estimation is Expectation(cid:173)\nMaximization (EM)[12]. We use EM with a diagonal covariance in the Gaussian, so \nthat for each feature vector component a (a combination of experimental condition \n\n\fInferring Transcriptional Regulation among Gene Classes \n\n931 \n\nand time point in a time course) and cluster a there is a standard deviation \nIn preprocessing, each concentration data point is divided by its \nparameter G aa . \nvalue at time zero and then a logarithm taken. The log ratios are clustered using \nEM. Optionally, each gene's entire feature vector may be normalized to unit length \nand the cluster centers likewise normalized during the iterative EM algorithm. \n\nIn order to choose the number of clusters, k, we use the cross-validation algorithm \ndescribed by Smyth [13]. This involves computing the likelihood of each optimized \nfit on a test set and averaging over runs and over divisions of the data into training \nand test sets. Then, we can examine the likelihood as a function of k in order to \nchoose k. Normally one would pick k so as to maximize cross-validated likelihood. \nHowever, in the present application we also want to reward small values of k which \nlead to smalIer circuits for the circuit inference phase of the algorithm. The choice \nof k will be discussed in the next section. \n\n2.2 Circuit Inference \n\nWe use the Lam-Delosme variant of simulated annealing (SA) to derive connection \nstrengths T, time constants t, and decay rates f..., as in previous work using this gene \ncircuit method [4][5]. We set h to zero. The score function which SA optimizes is \n\nS(T,r,A) = AI(v;(t;T,r,A)-vi(t\u00bb)2 + WI7;j2 \n\nh \n\nij \n+exp[B(I7;i2 + I A7 + Ir;)]-l \n\nij \n\nThe first term represents the fit to data Vi. The second term is a standard weight \ndecay term. The third term forces solutions to stay within a bounded region in \nweight space. We vary the weight decay coefficient W in order to encourage \nrelatively sparse connection matrix solutions. \n\n3 Results \n\n3.1 Data \n\nWe used the Saccharomyces cerevisiae data set of [9]. It includes three longer time \ncourses representing different ways to synchronize the normal cell cycle [II], and \nfive shorter time courses representing altered conditions. We used all eight time \ncourses for clustering, but just 8 time points of one of the longer time courses (alpha \nIt is likely that multiple \nfactor synchronized cell cycle) for the circuit inference. \nlong time courses under altered conditions will be required before strong biological \npredictions can be made from inferred regulatory circuit models. \n\n3.2 Clustering \n\nWe found that the most likely number of classes as determined by cross validation \nwas about 27, but that there is a broad plateau of high-likelihood cluster numbers \nfrom 15 to 35 (Figure I). This is similar to our results with another gene expression \ndata set for the nematode worm Caenorhabditis e/egans supplied by Stuart Kim; \nthese more extensive clustering experiments are summarized in Figure 2. Clustering \nexperiments with synthetic data is used to understand these results. \nThese \nexperiments show that the cross-validated log likelihood curve can indicate the \nnumber of clusters present in the data, justifying the use of the curve for that \n\n\f932 \n\nE. Mjolsness, T. Mann, R. Castano and B. Wold \n\npurpose. In more detail, synthetic data generated from 14 20-dimensional spherical \nGaussian clusters were clustered using the EMlCV algorithm. The likelihoods \nshowed a sharp peak at k=14 unlike Figures 1 or 2. In another experiment, 14 20-\ndimensional spherical Gaussian superclusters were used to generate second-level \nclusters (3 subclusters per supercluster), which in turn generated synthetic data \npoints. This two-level hierarchical model was then clustered with the EMlCV \nmethod. The likelihood curves (not shown) were quite similar to Figures 1 and 2, \nwith a higher-likelihood plateau from roughly 14 to 40. \n\nx 10\" \n\n~, :-\n\n~ :1; \ni , \n./ \nI \nl \n\n~'.~; --~--~,~~ ---7.1~--~~~' ---=a--~~7---~~--~~ \n\nero.V ..... d~ \n\nFigure 1. Cross-validated log-likelihood scores, displayed and averaged over 5 runs, for EM \nclustering of S. cerevisiae gene expression data [9]. Horizontal axis: k, the \"requested\" or \nmaximal number of cluster centers in the fit. Some cluster centers go unmatched to data. \nVertical axis: log likelihood score for the fit, scatterplotted and' averaged. Likelihoods have \nnot been integrated over any range of parameters for hypothesis testing. k ranges from 2 to \n40 in increments of 1. Solid line shows average likelihood value for each k. \n\n+ \n\n~10o-----:':-'.-----::20o----,.c!=----~\",-----; .. 7---!\" \n\nNumber of Clusters \n\n~, \n\n_u \n\n+ \n\n++ \n\n+ \n\n~~---~~--~,~~--~,~~--~~~--~* \n\nNumber of Clusters \n\n+++ \n\nFigure 2. Cross-validated log-likelihood scores, averaged over 13 runs, for EM clustering of \nC. elegans gene expression data from S. Kim's lab. Horizontal axis: k, the \"requested\" or \nmaximal number of cluster centers in the fit. Some cluster centers go unmatched to data. \nVertical axis: log likelihood score for the fit, as an average over 13 runs plus or minus one \nstandard deviation. (Left) Fine-scale plot, k =2 to 60 in increments of 2. \n(Right) Coarse(cid:173)\nscale plot, k=2 to 202 in increments of 10. Both plots show an extended plateau of relatively \nlikely fits between roughly k =14 and k =40. \n\nFrom Figures 1 and 2 and the synthetic data experiments mentioned above, we can \nguess at appropriate values for k which take into account both the measured \nlikelihood of clustering and the requirements for few parameters in circuit-fitting. \nFor example choosing k=15 clusters would put us at the beginning of the plateau, \nlosing very little cluster likelihood in return for reducing the aggregate genes circuit \nsize from 27 to 15 players. The interpretation would be that there are about 15 \nsuperclusters in hierarchically clustered data, to which we will fit a 15-player \n\n\fInferring Transcriptional Regulation among Gene Classes \n\n933 \n\nregulatory circuit. Much more aggressive would be to pick k=7 or 8 clusters, for a \nrelatively significant drop in \nlog-likelihood in return for a further substantial \ndecrease in circuit size. An acceptable range of cluster numbers (and circuit sizes) \nwould seem to be k=8 to 15. \n\n3.3 Gene Circuit Inference \n\nIt proved possible to fit the k= 15 time course using weight decay W=1 but without \nusing hidden units. W=O and W=3 gave less satisfactory results. Four of the 15 \nclusters are shown in Figure 3 for one good run (W= 1). Scores for our first few \n(unselected) runs at the current parameter settings are shown in Table 1. Each run \ntook between 24 and 48 hours on one processor of an Sun UItrasparc 60 computer. \nEven with weight decay, it is possible that successful fits are really overfits with \nthis particular data since there are about twice as many parameters as data points. \n\nWeight \nDecay W \n\n \n\n/1 0/\\6 \n\n0 \n\n3 \n\n1.536 +/- 0.134 \n\n2.803 +/- 0.437 \n\n0.787 +/- 0.394 \n\n2.782 +/- 0.200 \n\n1.438 +/- 0.037 \n\n2.880 +/- 0.090 \n\n3 \n\n10 \n\n4 \n\nTable 1. Score function parameters were A= 1.0. B=O.O 1. Annealing runs statistics are \nreported when the temperature dropped below 0.0001. All the best scores and visually \nacceptable fits occurred in W=I runs. \n\nThe average values of the data fit, weight decay, and penalty terms in the score \nfunction for W=1 were {0.378, 0.332, 0.0667} after slightly more annealing. \n\nThere were a few significant similarities between the connection matrices computed \nin the two lowest-scoring runs. The most salient feature in the lowest-scoring \nnetwork was a set of direct feedback loops among its strongest connections: cluster \n8 both excited and was inhibited by cluster 10, and cluster 10 excited and was \ninhibited by cluster 15. This feature was preserved in the second-best run. A \nsystematic search for \"concensus circuitry\" shows convergence towards a unique \nconnection matrix for the 8-point time series data used here, but more complete 16-\ntime-point data gives mUltiple \"clusters\" of connection matrices. From parameter(cid:173)\ncounting one might expect that making robust and unique regulatory predictions will \ntrajectory data taken under substantially different \nrequire \nconditions. Such data is expected to be forthcoming. \n\nthe use of more \n\n4 Discussion \n\nWe have illustrated a procedure for deriving regulatory models from large-scale \ngene expression data. As the data becomes more comprehensive in the number and \nnature of conditions under which comparable time courses are measured, this \nprocedure can be used to determine when biological hypotheses about gene \nregulation can be robustly derived from the data. \n\nAcknowledgments \n\nThis work was supported in part by the Whittier Foundation, the Office of Naval \nResearch under contract NOOO 14-97-1-0422, and the NASA Advanced Concepts \nProgram. Stuart Kim (Stanford University) provided the C. elegans gene expression \narray data. The GRN simulation and inference code is due in part to Charles Garrett \nand George Marnellos. The EM clustering code is due in part to Roberto Manduchi. \n\n\f934 \n\nE. Mjolsness, T. Mann, R. Castano and B. Wold \n\n0.5,'-- ---'2--\n\n-\n\n-6'-------'-7---'8-\n\n~ \n\n--'-3 - --.' - ----'-5-\n\n: : : : ~ i \n't \n::v:: ===:=;; J \nJ ; : ; ; ; :~ \n:~ , 234 5678 \n\n, 23 456 78 \n\nongIIlIII doll JIIII,1