{"title": "Cross Species Expression Analysis using a Dirichlet Process Mixture Model with Latent Matchings", "book": "Advances in Neural Information Processing Systems", "page_first": 1270, "page_last": 1278, "abstract": "Recent studies compare gene expression data across species to identify core and species specific genes in biological systems. To perform such comparisons researchers need to match genes across species. This is a challenging task since the correct matches (orthologs) are not known for most genes. Previous work in this area used deterministic matchings or reduced multidimensional expression data to binary representation. Here we develop a new method that can utilize soft matches (given as priors) to infer both, unique and similar expression patterns across species and a matching for the genes in both species. Our method uses a Dirichlet process mixture model which includes a latent data matching variable. We present learning and inference algorithms based on variational methods for this model. Applying our method to immune response data we show that it can accurately identify common and unique response patterns by improving the matchings between human and mouse genes.", "full_text": "Cross Species Expression Analysis using a Dirichlet\n\nProcess Mixture Model with Latent Matchings\n\nHai-Son Le\n\nMachine Learning Department\n\nCarnegie Mellon University\n\nPittsburgh, PA, USA\nhple@cs.cmu.edu\n\nZiv Bar-Joseph\n\nMachine Learning Department\n\nCarnegie Mellon University\n\nPittsburgh, PA, USA\n\nzivbj@cs.cmu.edu\n\nAbstract\n\nRecent studies compare gene expression data across species to identify core and\nspecies speci\ufb01c genes in biological systems. To perform such comparisons re-\nsearchers need to match genes across species. This is a challenging task since\nthe correct matches (orthologs) are not known for most genes. Previous work in\nthis area used deterministic matchings or reduced multidimensional expression\ndata to binary representation. Here we develop a new method that can utilize soft\nmatches (given as priors) to infer both, unique and similar expression patterns\nacross species and a matching for the genes in both species. Our method uses\na Dirichlet process mixture model which includes a latent data matching vari-\nable. We present learning and inference algorithms based on variational methods\nfor this model. Applying our method to immune response data we show that it\ncan accurately identify common and unique response patterns by improving the\nmatchings between human and mouse genes.\n\n1\n\nIntroduction\n\nResearchers have been increasingly relying on cross species analysis to understand how biological\nsystems operate. Sequence based methods have been successfully applied to identify and charac-\nterize coding and functional non coding regions in multiple species [1]. However, sequence infor-\nmation is static and thus provides only partial view of cellular activity. More recent studies attempt\nto integrate sequence and gene expression data from multiple species [2, 3, 4]. Unlike sequence,\nexpression levels are dynamic and differ across time and conditions. By combining expression and\nsequence data researchers were able to identify both \u201dcore\u201d and \u201ddivergent\u201d genes. \u201dCore\u201d genes\nare similarly expressed across species and are useful for constructing models of conserved systems,\nfor example the cell cycle [2]. \u201dDivergent\u201d genes are similar in sequence but differ in expression\nacross species. These are useful for identifying species speci\ufb01c responses, for example why some\npathogens are resistant to drugs while others are not [3].\n\nWhile useful, cross species analysis of expression data is challenging. In addition to the regular\nissues with expression data (noise, missing values, etc.) when comparing expression levels across\nspecies researchers need to match genes across species. For most genes the correct match in another\nspecies (known as ortholog) is not known. A number of methods have been suggested to solve the\nmatching problem. The \ufb01rst set of methods is based on a one to one deterministic assignment by\nrelying on top sequence matches. Such an assignment can be used to concatenate the expression\nvectors for matched genes across species and then cluster the resulting vectors. For example, Stuart\net al. [5] constructed \u201dmetagenes\u201d consisting of top sequence matches from four species. These\nwere used to cluster the data from multiple species to identify conserved and divergent patterns.\nBergmann et al. [6] de\ufb01ned one of the species (species A) as a reference and \ufb01rst clustered genes\nin A. They then used matched genes in the second species (B) as starting points for clustering\n\n1\n\n\fgenes in B. When the clustering algorithm converges in B, genes that remain in the cluster are\nconsidered \u201dcore\u201d whereas genes that are removed are \u201ddivergent\u201d. Quon et al. [4] used a mixture of\nGaussians model, which takes as input the expression data of orthologous genes and a phylogenetic\ntree connecting the species, to reconstruct the expression pro\ufb01les as well as detecting divergent\nlinks in the phylogeny. The second set of methods allowed for soft matches but was either limited to\nanalyzing binary or discrete data with very few labels. For example, Lu et al. combined experiments\nfrom multiple species by using Markov Random Fields [7] and Gaussian Random Fields [8] in which\nedges represent sequence similarity and potential functions constrain similar genes across species to\nhave a similar expression pattern.\n\nWhile both approaches led to successful applications, they suffer from drawbacks that limit their\nuse in practice. In many cases the top sequence match is not the correct ortholog and a deterministic\nassignment may lead to wrong conclusions about the conservation of genes. Methods that have\nused soft assignments were limited to summarization of the data (up or down regulated) and could\nnot utilize more complex pro\ufb01les. Here we present a new method that uses soft assignments to\nallow comparison and clustering across species of arbitrary expression data without requiring prior\nknowledge on the phylogeny. Our method takes as input expression datasets in two species and a\nprior on matches between homologous genes in these species (derived from sequence data). The\nmethod simultaneously clusters the expression values for both species while computing a posterior\nfor the assignment of orthologs for genes. We use Dirichlet Process model to automatically detect\nthe number of clusters.\n\nWe have tested our method on simulated and immune response data. In both cases the algorithm\nwas able to \ufb01nd correct matches and to improve upon methods that used a deterministic assignment.\nWhile the method was developed for, and applied to, biological data, it is general and can be used to\naddress other problems including matchings of captions to images (see Section 5).\n\n2 Problem de\ufb01nition\n\nIn this section, we \ufb01rst describe in details the cross species analysis problem for gene expression\ndata. Next, we formalize this as a general clustering and matching problem for cases in which the\nmatches are not known in advance.\n\nUsing microarrays or new sequencing techniques researchers can monitor the expression levels of\ngenes under certain conditions or at speci\ufb01c time points. For each such measurement we obtain a\nvector whose elements are the expression values for all genes (there are usually thousands of entries\nin each vector). We assume that the input consists of microarray experiments from two species and\neach species has a different set of genes. While the exact matches between genes in both species\nare not known for most genes, we have a prior for gene pairs (one from each species) which is\nderived from sequence data [9]. Our goal is to simultaneously cluster the genes in both species.\nSuch clustering can identify coherent and divergent responses between the species. In addition, we\nwould like to infer for each gene in one species whether there exists a homolog that is similarly\nexpressed in the other species and if so, who.\n\nThe problem can also be formalized more generally in the following way. Denote by x =\n[x1, x2, . . . , xnx ] and y = [y1, y2, . . . , yny ] the datasets of samples from two different experiment\nsettings, where xi \u2208 \u211cpx and yj \u2208 \u211cpy . In addition, let M be a sparse non-negative nx \u00d7 ny matrix\nthat encodes prior information regarding the matching of samples in x and y. We de\ufb01ne the match\nprobability between xi and yj as follows:\n\np(xi and yj are matched) =\n\nM(i, j)\n\nNi\n\n= \u03c0i,j\n\np(xi is not matched) =\n\n1\nNi\n\n= \u03c0i,0\n\n(1)\n\nwhere Ni = 1 + Pny\n\nj=1 M(i, j). \u03c0i,0 is the prior probability that xi is not matched to any element\nin Y . We use \u03c0i to denote the vector (\u03c0i,0, . . . , \u03c0i,ny ). Finally, let mi \u2208 {0, 1, . . . . , ny} be the\nlatent matching variable. If mi = 1 we say that xi is matched to ymi. If mi = 0 for we say that xi\nhas no match in y. Our goal is to infer both, the latent variables mj\u2019s and cluster membership for\npairs of samples (xi, ymi )\u2019s. The following notations are used in the rest of the paper. Lowercase\nnormal font, e.g x, is used for a single variable and lowercase bold font, e.g x, is used for vectors.\nUppercase bold roman letters, such as M, denote matrices. Uppercase letters, e.g X, are used to\nrepresent random variables and E[X] represents the expectation of a random variable X.\n\n2\n\n\f3 Model\n\nModel selection is an important problem when analyzing real world data. Many clustering algo-\nrithms, including Gaussian mixture models, require as an input the number of clusters. In addition to\ndomain knowledge, this model selection question can be addressed using cross validation. Bayesian\nnonparametric methods provide an alternative solution allowing the complexity of the model to grow\nbased on the amount of available data. Under-\ufb01tting is addressed by the fact that the model allows\nfor unbounded complexity while over-\ufb01tting is mitigated by the Bayesian assumption. We use this\napproach to develop a nonparametric model for clustering and matching cross species expression\ndata. Our model, termed Dirichlet Process Mixture Model with Latent Matchings (DPMMLM) ex-\ntends the popular Dirichlet Process Mixture Model to cases where priors are provided to matchings\nbetween vectors to be clustered.\n\n3.1 Dirichlet Process\n\nLet G0 a probability measure on a measurable space. We write G \u223c DP (\u03b1, G0) if G is a random\nprobability measure drawn from a Dirichlet process (DP). The existence of the Dirichlet process was\n\ufb01rst proven by [10]. Furthermore, measures of G are discrete with probability one. This property\ncan be seen from the explicit stick-breaking construction due to Sethuraman [11] as follows.\nLet (Vi)\u221e\n\u03b7i \u223c G0. Then a random measure G de\ufb01ned as\n\ni=1 be independent sequences of i.i.d random variables: Vi \u223c Beta(1, \u03b1) and\n\ni=1 and (\u03b7i)\u221e\n\n\u03b8i = Vi\n\ni\u22121\n\nYj=1\n\n(1 \u2212 Vj)\n\nG =\n\n\u221e\n\nXi=1\n\n\u03b8i\u03b4\u03b7i\n\n(2)\n\nwhere \u03b4\u03b7 is a probability measure concentrated at \u03b7, is a random probability measure distributed\naccording to DP(\u03b1, G0) as shown in [11] .\n\n3.2 Dirichlet Process Mixture Model (DPMM)\n\nDirichlet process has been used as a nonparametric prior on the parameters of a mixture model. This\nmodel is referred to as Dirichlet Process Mixture Model. Let z be the mixture membership indicator\nvariables for data variables x. Using the stick-breaking construction in (2), the Dirichlet process\nmixture model is given by\n\nG \u223c DP(\u03b1, G0)\n\nzi, \u03b7i | G \u223c G\n\nxi | zi, \u03b7i \u223c F (\u03b7i)\n\n(3)\n\nwhere F (\u03b7i) denotes the distribution of the observation xi given parameter \u03b7i.\n\n3.3 Dirichlet Process Mixture Model with Latent Matchings (DPMMLM)\n\nIn this section, we describe the new mixture model based on DP with latent variables for data\nmatching between x and y. We use FX (\u03b7), FY (\u03b7) to denote the marginal distribution of X and\nY respectively; and FX|Y (y, \u03b7) to denote the conditional distribution of X given Y . The parameter\n\u03b7 is a random variable of the prior distribution G0(\u03b7 | \u03bb0) with hyperparameter \u03bb0. Also, let zi be\nthe mixture membership of the sample pair (xi, ymi ). Our model is given by:\n\nG \u223c DP(\u03b1, G0)\n\nzi, \u03b7i | G \u223c G\n\nmi | \u03c0i \u223c Discrete(\u03c0i)\n\nymi | mi, zi, \u03b7i \u223c FY (\u03b7i), if mi > 0\nxi | mi, zi, \u03b7i, y \u223c (cid:26)FX|Y (ymi , \u03b7i)\n\nFX (\u03b7i)\n\nif mi > 0\notherwise\n\n(4)\n\nThe major difference between our model and a regular DPMM is the dependence of xi on y if\n\n3\n\n\fmi > 0. In other words the assignment of x to a cluster depends on both, its own expression levels\nand the levels of the y component to which it is matched. If x is not matched to any y component\nthen we resort to the marginal distribution FX of the mixture.\n\n3.4 Mean-\ufb01eld variational methods\n\nFor probabilistic models, mean-\ufb01eld variational methods [12, 13] provide a deterministic and\nbounded approximation to the intractable joint probability of observed and hidden variables. Brie\ufb02y,\ngiven a model with observed variables x and hidden variables h, we would like to compute log p(x),\nwhich requires us to marginalize over all hidden variables h. Since p(x, h) is often intractable, we\ncan \ufb01nd a tractable probability q(h) that gives the best lower bound of log p(x) using Jensen \u2019s\ninequality:\n\nlog p(x) \u2265 Zh\n\nq(h) log p(x, h) \u2212 q(h) log q(h) dh = Eq[log p(x, h)] \u2212 Eq[log q(h)]\n\n(5)\n\nMaximizing this lower bound is equivalent to \ufb01nding the distribution q(h) that minimizes the KL\ndivergence between q(h) and p(h | x). Hence, q(h) is the best approximation model within the\nchosen parametric family.\n\n3.5 Variational Inference for DPMMLM\n\nAlthough the DP mixture model is an \u201din\ufb01nite\u201d mixture model, it is intractable to solve the optimiza-\ntion problem when allowing for in\ufb01nitely many variables. We thus follow the truncation approach\nused in [14], and limit the number of cluster to K. When K is chosen to be large enough, the dis-\ntribution is a drawn from the Dirichlet process [14]. To restrict the number of clusters to K, we set\nVK = 1 and thus obtain \u03b8i>K = 0 in (2). The likelihood of the observed data is\n\np(x, y | \u03b1, \u03bb0) = Z\n\nm,z,v,\u03b7\n\np(\u03b7 | \u03bb0) p(v | \u03b1)\n\nK\n\nYk=1n(cid:0)\u03c0i,0fX (xi | \u03b7k)(cid:1)m0\n\ni\n\nnx\n\np(zi | v)\n\nYi=1\nYj=1(cid:0)\u03c0i,jfX|Y (xi | yj, \u03b7k)fY (yj | \u03b7k)(cid:1)mj\n\nny\n\ni\n\niozk\n\n(6)\n\nwhere p(zi | v) = vzi Qzi\u22121\n\nk=1 (1 \u2212 vk) and v is the stick breaking variables given in Section 3.1. The\n\ufb01rst part of (6) p(\u03b7 | \u03bb0) p(v | \u03b1) is the likelihood of the model parameters and the second part is\nthe likelihood of the assignments to clusters and matchings.\n\nFollowing the variational inference framework for conjugate-exponential graphical models [15] we\nchoose the distribution that factorizes over {mi, zi}i=1,...,nx, {vk}k=1,...,K and {\u03b7k}k=1,...,K\u22121 as\nfollows:\n\nq(m, z, v, \u03b7) =\n\nnx\n\nYi=1(cid:8)q\u03c6i (mi)\n\nny\n\nYj=0\n\nq\u03b8i,j (zi)mj\ni(cid:9)\n\nK\u22121\n\nYk=1\n\nq\u03b3k (vk)\n\nK\n\nYk=1\n\nq\u03bbk (\u03b7k)\n\n(7)\n\nwhere q\u03c6i (mi) and q\u03b8i,j (zi) are multinomial distributions and q\u03b3k (vk) are beta distributions. These\ndistributions are conjugate distributions for the likelihood of the parameters in (6). q\u03bbk (\u03b7k) requires\nspecial treatment due to the coupling of the marginal and conditional distributions in the likelihood.\nThese issues are discussed in details in section 3.5.2.\n\nUsing this variational distribution we obtain a lower bound for the log likelihood:\n\nlog p(x, y | \u03b1, \u03bb0) \u2265 E[log p(\u03b7 | \u03bb0)] + E[log p(V | \u03b1)]\n\n+\n\nnx\n\nXi=1nE[log p(Zi | V)] +\n\nny\n\nK\n\nXj=0\n\nXk=1\n\nE[M j\n\ni Z k\n\ni ](log \u03c0i,j + \u03c1i,j,k)o \u2212 E[log q(M, Z, V, \u03b7)]\n\n(8)\n\nwhere all expectations are with respect to the distribution q(m, z, v, \u03b7) and\n\n\u03c1i,j,k = (cid:26)E[log fX|Y (Xi | Yj, \u03b7k)] + E[log fY (Yj | \u03b7k)]\n\nE[log fX (Xi | \u03b7k)]\n\nif j > 0\nif j = 0\n\n4\n\n\fTo compute the terms in (8), we note that\n\nE[M j\n\ni Z k\n\ni ] = \u03c6i,j\u03b8i,j,k = \u03c8i,j,k\n\nE[log p(Zi | V)] =\n\nK\n\nXk=1\n\nq(zi > k)E[log(1 \u2212 Vk)] + q(zi = k)E[log Vk]\n\nwhere q(zi > k) = Pny\n\nj=0PK\n\nt=k+1 \u03c8i,j,t and q(zi = k) = Pny\n\nj=0 \u03c8i,j,k.\n\n3.5.1 Coordinate ascent inference algorithm\n\nThe lower bound above can be optimized by a coordinate ascent algorithm. The update rules for\nall terms except for the q\u03bb(\u03b7), are presented below. These are direct applications of the variational\ninference for conjugate-exponential graphical models [15]. We discuss the update rule for q\u03bb(\u03b7) in\nsection 3.5.2.\n\n\u2022 Update for q\u03b3k (vk):\n\nnx\n\nny\n\n\u03b3k,1 = 1 +\n\nXj=0\n\u2022 Update for q\u03b8i,j (zi) and q\u03c6i (mi):\n\nXi=1\n\nk\u22121\n\n\u03c8i,j,k\n\n\u03b3k,2 = \u03b1 +\n\nnx\n\nny\n\nK\n\nXi=1\n\nXj=0\n\nXt=k+1\n\n\u03c8i,j,t\n\nXk=1\nE[log(1 \u2212 Vk)] + E[log Vk](cid:1)\n\u03b8i,j,k \u221d exp(cid:0)\u03c1i,j,k +\n\u03c6i,j \u221d exp(cid:16) log \u03c0i,j +\nXk=1\n\u03b8i,j,k(cid:0)\u03c1i,j,k +\n\nXk=1\n\nk\u22121\n\nK\n\nE[log(1 \u2212 Vk)] + E[log Vk](cid:1)(cid:17)\n\n3.5.2 Application of the model to multivariate Gaussians\n\nThe previous sections described the model in a general terms. In the rest of this section, and in\nour experiments, we focus on data that is assumed to be distributed as a multivariate Gaussian with\nunknown mean and covariance matrix. The prior distribution G0 is then given by the conjugate prior\nGaussian-Wishart distribution. In a classical DP Gaussian Mixture Model with Gaussian-Wishart\nprior, the posterior distribution of the parameters could be computed analytically. Unfortunately,\nin our model, the coupling of the conditional and marginal distribution in the likelihood makes it\ndif\ufb01cult to derive analytical formulas for the posterior distribution. Note that if (X, Y ) \u223c N (\u00b5, \u03a3)\n\nwith \u00b5 = (\u00b5X , \u00b5Y ) and \u03a3 = (cid:18) \u03a3X \u03a3XY\n\n\u03a3Y (cid:19) then X \u223c N (\u00b5X , \u03a3X ), Y \u223c N (\u00b5Y , \u03a3Y ) and\n\n\u03a3Y X\n\nX|Y = y \u223c N (\u00b5X + \u03a3XY \u03a3\u22121\n\n(9)\nTherefore, we introduce an approximation distribution for the datasets which decouples the marginal\nand conditional distributions as follows:\n\nY (y \u2212 \u00b5Y ), \u03a3X \u2212 \u03a3XY \u03a3\u22121\n\nY \u03a3Y X ).\n\nfX (x | \u00b5X , \u039bX ) = N (\u00b5X , \u03a3 = \u039b\u22121\nX )\n\nfY (y | \u00b5Y , \u039bY ) = N (\u00b5Y , \u03a3 = \u039b\u22121\nY )\n\nfX|Y (x | y, W, b, \u00b5X , \u039bX ) = N (\u00b5X + b \u2212 Wy, \u03a3 = \u039b\u22121\nX )\n\nY in (9).\n\nwhere W is a px \u00d7 py projection matrix and \u039b is the precision matrix. In this approximation, we\nassume that the covariance matrices of X and X|Y are the same. In other words, the covariance\nof X is independent of Y . The matrix W models the linear correlation of X on Y , similar to\n\u2212\u03a3XY \u03a3\u22121\nThe priors for \u00b5X , \u039bX and \u00b5Y , \u039bY are given by Gaussian-Wishart(GW) distributions. A \ufb02at im-\nproper prior is given to W and b, p0(W) = 1, p0(b) = 1 for all W, b. These assumptions lead\nto decoupling of the marginal and conditional distributions. Therefore, the distribution q\u03bbk (\u03b7k) can\nnow be factorized into two GW distributions and a distribution of W. To avoid over-cluttering\nsymbols, we omit the subscript k of the speci\ufb01c cluster k.\n\nq\u2217\n\u03bbk (\u03b7k) = GW (\u00b5X , \u039bX ) GW (\u00b5Y , \u039bY ) g(W) g(b)\n\n5\n\n\fPosterior distribution of \u00b5Y , \u039bY : The update rules follow the standard posterior distribution of\nGaussian-Wishart conjugate priors.\nPosterior distribution of \u00b5X , \u039bX and W, b: Due to the coupling of \u00b5X , \u039bX with W, we do a\ncoordinate ascent procedure to \ufb01nd the optimal posterior distribution. The posterior distribution of\nW, b is a singleton discrete distribution g such that g(W\u2217) = 1, g(b\u2217) = 1.\n\n\u2022 Update for posterior distribution of \u00b5X , \u039bX:\n\n\u03baX = \u03baX 0 + nX\n\nS\u22121\nX = S\u22121\n\nX 0 + VX +\n\n\u03baX 0nX\n\n\u03baX 0 + nX\n\nmX =\n\n1\n\u03baX\n\n(\u03baX 0mX 0 + nX x)\n\n(x \u2212 mX 0)(x \u2212 mX 0)T\n\n\u03bdX = \u03bdX 0 + nX\n\nwhere nX =\n\nnx\n\nnx\n\nny\n\nXn=1\n\nXj=0\n\n\u03c8i,j,k, x =\n\n1\nnX\nny\n\nVX =\n\nXi=1(cid:8)\u03c8i,0,k(xi \u2212x)(xi \u2212x)T +\n\nXj=1\n\nnx\n\nny\n\nXj=1\n\nXi=1 (cid:0)\u03c8i,0,kxi +\n\u03c8i,j,k(xi \u2212 b + W\u2217yj)(cid:1) and\n\u03c8i,j,k(xi \u2212b+W\u2217yj \u2212x)(xi \u2212b+W\u2217yj \u2212x)T(cid:9).\n\n\u2022 Update for W\u2217, b\u2217: We \ufb01nd W\u2217, b\u2217 that maximizes the log likelihood. Taking the deriva-\n\ntive with respect to W\u2217 and solving for W\u2217, we get\n\nnx\n\nny\n\nW\u2217 = (cid:16)\nXi=1\nXj=1\nb\u2217 = \u2212(cid:16)\nXi=1\nXj=1\n\nnx\n\nj (cid:17)(cid:16)\n\n\u03c8i,j,k(xi \u2212 mX \u2212 b)yT\n\nXi=1\n\u03c8i,j,k(xi \u2212 mX + W\u2217yj)(cid:17)/\n\nny\n\nnx\n\nny\n\nXj=1\nXi=1\n\nnx\n\n\u03c8i,j,kyjyT\n\nj (cid:17)\u22121\n\nny\n\nXj=1\n\n\u03c8i,j,k\n\n4 Experiments and Results\n\n4.1 Simulated data\n\n2 r, where r \u223c \u03c72\n\nWe demonstrate the performance of the model in identifying data matchings as well as cluster mem-\nbership of datapoints using simulated data. To generate a simulated dataset, we sample 120 data-\npoints from a mixture of three 5-dimensional Gaussians with separation coef\ufb01cient = 2 leading to\nwell separated mixtures1. The covariance matrix was derived from the autocorrelation matrix for\na \ufb01rst-order autoregressive process leading to highly dependent components (\u03c1 = 0.9). From these\nsamples, we use the \ufb01rst 3 dimensions to create 120 datapoints x = [x1, . . . , x120]. The last two\ndimensions of the \ufb01rst 100 datapoints are used to create y = [y1, . . . , y100] (note that there are no\nmatches for 20 points in x). Hence, the ground truth M matrix is a diagonal 120 \u00d7 100 matrix.\nWe selected a large value for the diagonal entries (\u03c4 = 1000) in order to place a strong prior for\nthe correct matchings. Next, for t = 0, . . . , 20, we randomly select t entries on each row of M\n1. We repeat the process 20 times for each t to compute the\nand set them to \u03c4\nmean and standard deviation shown in Figure 1(a) and Figure 1(b). We compare the performance\nof our model(DPMMLM) with a standard Dirichlet Process Mixture Model where each component\nin x is matched based on the highest prior: {(xi, yj \u2217 ) | i = 1, . . . , 100 and j\u2217 = argmaxjM(i, j)}\n(DPMM). For all models, the truncation level (K) is set to 20 and \u03b1 is 1. Figure 1(a) presents the\npercentage of correct matchings inferred by DPMMLM and the highest prior matching. For DP-\nMMLM, a datapoint xi is matched to the datapoint yj with the largest posterior probability \u03c6i,j.\nWith the added noise, DPMMLM can still achieve an accuracy of 50% when the highest prior\nmatching leads to only 25% accuracy. Figure 1(b) and 1(c) show the Normalized Mutual Informa-\ntion (NMI) and Adjusted Rand index [17] for the clusters inferred by the two models compared to\nthe true clusters. As can be seen, while the percentage of correct matchings decreased with the added\nnoise, DPMMLM still achieves high NMI of 0.8 and Adjusted Rand index of 0.92. In conclusion,\nby relying on matchings of points DPMMLM can still performs very well in terms of its ability to\nidentify correct clusters even with the high noise levels.\n\n1Following [16], a Gaussian mixture is c-separated if for each pair (i, j) of components, kmi \u2212 mjk2 \u2265\n\nc2D max(\u03bbmax\n\ni\n\n, \u03bbmax\n\nj\n\n) , where \u03bbmax denotes the maximum eigenvalue of their covariance.\n\n6\n\n\fi\n\ns\ng\nn\nh\nc\nt\na\nm\n\nt\nc\ne\nr\nr\no\nc\nf\no\n%\n\n100\n\n90\n\n80\n\n70\n\n60\n\n50\n\n40\n\n30\n\n20\n0\n\nDPMMLM\nTop matches\n\nn\no\ni\nt\na\nm\nr\no\nf\nn\n\nI\n\nl\n\na\nu\nt\nu\nM\nd\ne\nz\ni\nl\n\na\nm\nr\no\nN\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\nDPMMLM\nDPMM\n\n5\n\n10\n\n15\n\nNumber of random entries per row (t)\n\n20\n\n0.1\n0\n\n5\n\n10\n\n15\n\nNumber of random entries per row (t)\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\nx\ne\nd\nn\n\ni\n\nd\nn\na\nR\nd\ne\nt\ns\nu\nd\nA\n\nj\n\n20\n\n0.1\n0\n\nDPMMLM\nDPMM\n\n5\n\n10\n\n15\n\n20\n\nNumber of random entries per row (t)\n\n(a) The % of correct matchings.\n\n(b) Normalized MI.\n\n(c) Adjusted Rand index.\n\nFigure 1: Evaluation of the result on simulated data.\n\n4.2\n\nImmune response dataset\n\nCluster 1\n\nCluster 2\n\nCluster 3\n\nCluster 1\n\nCluster 2\n\n2 4 6 8\nCluster 4\n\n2 4 6 8\nCluster 5\n\n2 4 6 8\n\n2 4 6 8\nCluster 3\n\n2 4 6 8\n\n4\n\n2\n\n0\n\n\u22122\n\n\u22124\n\n4\n\n2\n\n0\n\n\u22122\n\n\u22124\n\n2 4 6 8\n\n(b) DPMM\n\n2 4 6 8\n\n2 4 6 8\n\n(a) DPMMLM\n\nFigure 2: The heatmap for clusters inferred for the immune response dataset.\n\nWe compared human and mouse immune response datasets to identify similar and divergent genes.\nWe selected two experiments that studied immune response to gram negative bacteria. The \ufb01rst was\na time series of human response to Salmonella [18]. Cells were infected with Salmonella and were\npro\ufb01led at: 0.5h, 1h, 2h, 3h and 4h. The second looked at mouse response to Yersinia enterocolitica\nwith and without treatment by IFN-\u03b3 [19]. We used BLASTN to compute the sequence similarity\n(bit-score) between all human and mouse genes. For each species we selected the most varying 500\ngenes and expanded the gene list to include all matched genes in the other species with a bit score\ngreater than 75. This led to a set of 1476 human and 1967 mouse genes which we compared using\nour model. The M matrix is the bit scores between human and mouse genes thresholded at 75.\nThe resulting clusters are presented in Figure 2(a). In that \ufb01gure, the \ufb01rst \ufb01ve dimensions are human\nexpression values and each gene in human is matched to the mouse gene with the highest posterior.\nHuman genes which are not matched to any mouse gene in the cluster have a blank line on the\nmouse side of the \ufb01gure. The algorithm identi\ufb01ed \ufb01ve different clusters. Clusters 1, 4 and 5 display\na similar expression pattern in human and mouse with genes either up or down regulated in response\nto the infection. Genes in cluster 2 differ between the two species being mostly down regulated in\nhumans while slightly upregulated in mouse. Human genes in cluster 3 also differ from their mouse\northologs. While they are strongly upregulated in humans, the corresponding mouse genes do not\nchange much.\n\n7\n\n\fP value\n\nCorrected P GO term description\n\n2.86216e-10 <0.001\n4.97408e-10 <0.001\n7.82427e-10 <0.001\n4.14320e-10 <0.001\n4.49332e-09 <0.001\n4.77653e-09 <0.001\n8.27313e-09 <0.001\n1.17013e-07\n0.001\n\nregulation of apoptosis\nregulation of cell death\nprotein binding\nregulation of programmed cell death\npositive regulation of cellular process\npositive regulation of biological process\nresponse to chemical stimulus\ncytoplasm\n\n1.28299e-07\n2.20104e-07\n\n0.001\n0.001\n\nresponse to stress\ncell proliferation\n\nP value\n\n5.06685e-07\n6.15795e-07\n7.70651e-07\n7.78266e-07\n1.09778e-06\n1.42704e-06\n1.91735e-06\n3.23244e-06\n\n3.39901e-06\n3.66178e-06\n\n0.001\n0.001\n0.001\n0.002\n0.002\n0.002\n0.003\n0.005\n\nCorrected P GO term description\nresponse to stimulus\nnegative regulation of biological process\ncellular process\nregulation of localization\nresponse to organic substance\ncollagen metabolic process\nnegative regulation of cellular process\nmulticellular organismal macromolecule\nmetabolic process\ninterspecies interaction\nnegative regulation of apoptosis\n\n0.005\n0.005\n\nTable 1: The GO enrichment result for cluster 1 identi\ufb01ed by DPMMLM.\n\nWe used the Gene Ontology (GO, www.geneontology.org) to calculate the enrichment of functional\ncategories in each cluster based on the hypergeometric distribution. Genes in cluster 1 (Table 1)\nare associated with immune and stress responses. Interestingly the most signi\ufb01cant category for\nthis cluster is \u201dregulation of apoptosis\u201d (corrected p-value <0.001). Indeed, both Salmonella and\nYersinia are known to induce apoptosis in host cells [20]. When clustering the two datasets indepen-\ndently the p-value for this category is greatly reduced indicating that accurate matchings can lead to\nbetter identi\ufb01cation of core pathways (see Appendix). Cluster 4 contains the most coherent set of\nupregulated genes across the two species. One of top GO categories for this cluster is \u2019response to\nmolecule of bacterial origin\u2019 (corrected p-value < 0.001) which is the most accurate description of\nthe condition tested. See Appendix for complete GO tables of all clusters. In contrast to clusters in\nwhich mouse and human genes are similarly expressed, cluster 3 genes are strongly upregulated in\nhuman cells while not changing in mouse. This cluster is enriched for ribosomal proteins (corrected\np-value <0.001). This may indicate different strategies utilized by the bacteria in the two experi-\nments. There are studies that show that pathogens can upregulate the synthesis of ribosomal genes\n(which are required for translation) [21] whereas other studies indicate that ribosomal genes may not\nchange much, or may even be reduced, following infection [22]. The results of our analysis indicate\nthat while following Salmonella infection in human cells ribosomal genes are upregulated, they are\nnot activated following Yarsinia infection in mouse.\n\nWe have also analyzed the matchings obtained using sequence data alone (prior) and by combining\nsequence and expression data (posterior) using our method. The top posterior gene is the same\nas the top prior gene in most cases (905 of the 1476 human genes). However, there are several\ncases in which the prior and posterior differ. 293 human genes are not matched to any mouse\ngene in the cluster they are assigned to indicating that they are expressed in a species dependent\nmanner. Additionally, for 278 human genes the top posterior and prior mouse gene differ. To test\nwhether these differences inferred by the algorithm are biologically meaningful we compared our\nDirichlet method to a method that uses deterministic assignments, as was done in the past. Using\nsuch assignments the algorithm identi\ufb01ed only three clusters as shown in Figure 2(b). Neither of\nthese clusters looked homogenous across species.\n\n5 Conclusions\n\nWe have developed a new model for simultaneously clustering and matching genes across species.\nThe model uses a Dirichlet Process to infer the number of clusters. We developed an ef\ufb01cient\nvariational inference method that scales to large datasets with almost 2000 datapoints. We have\nalso demonstrated the power of our method on simulated data and immune response dataset. While\nthe method was presented in the context of expression data it is general and can be used for other\nmatching tasks in which a prior can be obtained. For example, when trying to determine a caption\nfor images extracted from webpages a prior can be obtained by relying on the distance between the\nimage and the text on the page. Next, clustering can be employed to utilize the abundance of images\nthat are extracted and improve the matching outcome.\n\nAcknowledgments\n\nWe thank the anonymous reviewers for constructive and insightful comments. This work is sup-\nported in part by NIH grant 1RO1 GM085022 and NSF grants DBI-0965316 and CAREER-0448453\nto Z.B.J.\n\n8\n\n\fReferences\n\n[1] M. Kellis, N. Patterson, M. Endrizzi, B. Birren, and E. S. Lander. Sequencing and comparison\n\nof yeast species to identify genes and regulatory elements. Nature, 423:241\u2013254, May 2003.\n\n[2] L. J. Jensen, T. S. Jensen, U. de Lichtenberg, S. Brunak, and P. Bork. Co-evolution of tran-\n\nscriptional and post-translational cell-cycle regulation. Nature, 443:594\u2013597, Oct 2006.\n\n[3] G. Lelandais et al. Genome adaptation to chemical stress: clues from comparative transcrip-\n\ntomics in Saccharomyces cerevisiae and Candida glabrata. Genome Biol., 9:R164, 2008.\n\n[4] G. Quon, Y. W. Teh, E. Chan, M. Brudno, T. Hughes, and Q. D. Morris. A mixture model\nIn Advances in Neural\n\nfor the evolution of gene expression in non-homogeneous datasets.\nInformation Processing Systems, volume 21, 2009.\n\n[5] J. M. Stuart, E. Segal, D. Koller, and S. K. Kim. A gene-coexpression network for global\n\ndiscovery of conserved genetic modules. Science, 302:249\u2013255, Oct 2003.\n\n[6] Sven Bergmann, Jan Ihmels, and Naama Barkai. Similarities and differences in genome-wide\n\nexpression data of six organisms. PLoS Biol, 2(1):e9, 12 2003.\n\n[7] Y. Lu, R. Rosenfeld, and Z. Bar-Joseph.\n\nIdentifying cycling genes by combining sequence\n\nhomology and expression data. Bioinformatics, 22:e314\u2013322, Jul 2006.\n\n[8] Y. Lu, R. Rosenfeld, G. J. Nau, and Z. Bar-Joseph. Cross species expression analysis of innate\n\nimmune response. J. Comput. Biol., 17:253\u2013268, Mar 2010.\n\n[9] R. Sharan et al. Conserved patterns of protein interaction in multiple species. Proc. Natl. Acad.\n\nSci. U.S.A., 102:1974\u20131979, Feb 2005.\n\n[10] Thomas S. Ferguson. A bayesian analysis of some nonparametric problems. The Annals of\n\nStatistics, 1(2):209\u2013230, 1973.\n\n[11] J. Sethuraman. A constructive de\ufb01nition of dirichlet priors. Statistica Sinica, 4:639\u2013650, 1994.\n[12] Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. An in-\ntroduction to variational methods for graphical models. Machine Learning, 37(2):183\u2013233,\nNovember 1999.\n\n[13] Martin J. Wainwright and Michael I. Jordan. Graphical models, exponential families, and\n\nvariational inference. Found. Trends Mach. Learn., 1(1-2):1\u2013305, 2008.\n\n[14] H. Ishwaran and James. Gibbs sampling methods for stick breaking priors. Journal of the\n\nAmerican Statistical Association, pages 161\u2013173, March 2001.\n\n[15] Zoubin Ghahramani and Matthew J. Beal. Propagation algorithms for variational bayesian\nlearning. In In Advances in Neural Information Processing Systems 13, pages 507\u2013513. MIT\nPress, 2001.\n\n[16] Sanjoy Dasgupta. Learning mixtures of gaussians.\n\nIn FOCS \u201999: Proceedings of the 40th\n\nAnnual Symposium on Foundations of Computer Science, Washington, DC, USA, 1999.\n\n[17] M. Meila. Comparing clusterings by the variation of information.\n\nIn Learning theory and\nKernel machines: 16th Annual Conference on Learning Theory and 7th Kernel Workshop,\nCOLT/Kernel 2003, Washington, DC, USA, August 24-27, 2003: proceedings, page 173.\nSpringer Verlag, 2003.\n\n[18] C. S. Detweiler et al. Host microarray analysis reveals a role for the Salmonella response\nregulator phoP in human macrophage cell death. Proc. Natl. Acad. Sci. U.S.A., 98:5850\u20135855,\nMay 2001.\n\n[19] K. van Erp et al. Role of strain differences on host resistance and the transcriptional response\nof macrophages to infection with Yersinia enterocolitica. Physiol. Genomics, 25:75\u201384, 2006.\n[20] D. M. Monack, B. Raupach, et al. Salmonella typhimurium invasion induces apoptosis in\n\ninfected macrophages. Proc. Natl. Acad. Sci. U.S.A., 93:9833\u20139838, Sep 1996.\n\n[21] O. O. Zharskaia et al. [Activation of transcription of ribosome genes following human embryo\n\n\ufb01broblast infection with cytomegalovirus in vitro]. Tsitologiia, 45:690\u2013701, 2003.\n\n[22] J. W. Gow, S. Hagan, P. Herzyk, C. Cannon, P. O. Behan, and A. Chaudhuri. A gene signature\n\nfor post-infectious chronic fatigue syndrome. BMC Med Genomics, 2:38, 2009.\n\n9\n\n\f", "award": [], "sourceid": 300, "authors": [{"given_name": "Ziv", "family_name": "Bar-joseph", "institution": null}, {"given_name": "Hai-son", "family_name": "Le", "institution": null}]}