{"title": "Hierarchical Dirichlet Processes with Random Effects", "book": "Advances in Neural Information Processing Systems", "page_first": 697, "page_last": 704, "abstract": null, "full_text": "Hierarchical Dirichlet Processes with Random Effects\n\nSeyoung Kim\n\nDepartment of Computer Science\nUniversity of California, Irvine\n\nIrvine, CA 92697-3435\nsykim@ics.uci.edu\n\nPadhraic Smyth\n\nDepartment of Computer Science\nUniversity of California, Irvine\n\nIrvine, CA 92697-3435\nsmyth@ics.uci.edu\n\nAbstract\n\nData sets involving multiple groups with shared characteristics frequently arise\nin practice.\nIn this paper we extend hierarchical Dirichlet processes to model\nsuch data. Each group is assumed to be generated from a template mixture model\nwith group level variability in both the mixing proportions and the component\nparameters. Variabilities in mixing proportions across groups are handled using\nhierarchical Dirichlet processes, also allowing for automatic determination of the\nnumber of components. In addition, each group is allowed to have its own compo-\nnent parameters coming from a prior described by a template mixture model. This\ngroup-level variability in the component parameters is handled using a random\neffects model. We present a Markov Chain Monte Carlo (MCMC) sampling algo-\nrithm to estimate model parameters and demonstrate the method by applying it to\nthe problem of modeling spatial brain activation patterns across multiple images\ncollected via functional magnetic resonance imaging (fMRI).\n\n1 Introduction\n\nHierarchical Dirichlet processes (DPs) (Teh et al., 2006) provide a \ufb02exible framework for proba-\nbilistic modeling when data are observed in a grouped fashion and each group can be thought of as\nbeing generated from a mixture model. In the hierarchical DPs all of, or a subset of, the mixture\ncomponents are shared by different groups and the number of such components are inferred from the\ndata using a DP prior. Variability across groups is modeled by allowing different mixing proportions\nfor different groups.\n\nIn this paper we focus on the problem of modeling systematic variation in the shared mixture com-\nponent parameters and not just in the mixing proportions. We will use the problem of modeling\nspatial fMRI activation across multiple brain images as a motivating application, where the images\nare obtained from one or more subjects performing the same cognitive tasks. Figure 1 illustrates the\nbasic idea of our proposed model. We assume that there is an unknown true template for mixture\ncomponent parameters, and that the mixture components for each group are noisy realizations of the\ntemplate components. For our application, groups and data points correspond to images and pixels.\nGiven grouped data (e.g., a set of images) we are interested in learning both the overall template\nmodel and the random variation relative to the template for each group. For the fMRI application,\nwe model the images as mixtures of activation patterns, assigning a mixture component to each spa-\ntial activation cluster in an image. As shown in Figure 1 our goal is to extract activation patterns\nthat are common across multiple images, while allowing for variation in fMRI signal intensity and\nactivation location in individual images. In our proposed approach, the amount of variation (called\nrandom effects) from the overall true component parameters is modeled as coming from a prior dis-\ntribution on group-level component parameters (Gelman et al. 2004). By combining hierarchical\nDPs with a random effects model we let both mixing proportions and mixture component parame-\nters adapt to the data in each group. Although we focus on image data in this paper, the proposed\n\n\f\u0011+\n\n?\n\nQs\n\nTemplate\nmixture model\n\nGroup-level\nmixture model\n\nfMRI brain\nactivation\n\nFigure 1: Illustration of group level variations from the template model.\n\nModel\nHierarchical DPs\nTransformed DPs\nHierarchical DPs with random effects\n\nGroup-level mixture components\n\n\u03b8a \u00d7 ma, \u03b8b \u00d7 mb\n\n\u03b8a + \u2206a1, . . . , \u03b8a + \u2206ama, \u03b8b + \u2206b1, . . . , \u03b8b + \u2206bmb\n\n(\u03b8a + \u2206a) \u00d7 ma, (\u03b8b + \u2206b) \u00d7 mb\n\nTable 1: Group-level mixture component parameters for hierarchical DPs, transformed DPs, and\nhierarchical DPs with random effects as proposed in this paper.\n\napproach is applicable to more general problems of modeling group-level random variation with\nmixture models.\n\nHierarchical DPs and transformed DPs (Sudderth et al., 2005) both address a similar problem of\nmodeling groups of data using mixture models with mixture components shared across groups. Ta-\nble 1 compares the basic ideas underlying these two models with the model we propose in this paper.\nGiven a template mixture of two components with parameters \u03b8a and \u03b8b, in hierarchical DPs a mix-\nture model for each group can have ma and mb exact copies (commonly known as tables in the Chi-\nnese restaurant process representation) of each of the two components in the template\u2014thus, there\nis no notion of random variation in component parameters across groups. In transformed DPs, each\nof the copies of \u03b8a and \u03b8b receives a transformation parameter \u2206a1, . . . , \u2206ama and \u2206b1, . . . , \u2206bmb.\nThis is not suitable for modeling the type of group variation illustrated in Figure 1 because there is\nno direct way to enforce \u2206a1 = . . . = \u2206ama and \u2206b1 = . . . = \u2206bmb to obtain \u2206a and \u2206b as used\nin our proposed model.\n\nIn this general context the model we propose here can be viewed as being closely related to both\nhierarchical DPs and transformed DPs, but having application to quite different types of problems in\npractice, e.g., as an intermediate between the highly constrained variation allowed by the hierarchical\nDP and the relatively unconstrained variation present in the computer vision scenes to which the\ntransformed DP has been applied (Sudderth et al, 2005).\n\nFrom an applications viewpoint the use of DPs for modeling multiple fMRI brain images is novel\nand shows considerable promise as a new tool for analyzing such data. The majority of existing\nstatistical work on fMRI analysis is based on voxel-by-voxel hypothesis testing, with relatively little\nwork on modeling of the spatial aspect of the problem. One exception is the approach of Penny\nand Friston (2003) who proposed a probabilistic mixture model for spatial activation modeling and\ndemonstrated its advantages over voxel-wise analysis. The application of our proposed model to\nfMRI data can be viewed as a generalization of Penny and Friston\u2019s work in three different aspects\nby (a) allowing for analysis of multiple images rather than a single image (b) learning common\nactivation clusters and systematic variation in activation across these images, and (c) automatically\nlearning the number of components in the model in a data-driven fashion.\n\n2 Models\n\n2.1 Dirichlet process mixture models\n\nA Dirichlet process DP(\u03b10, G) with a concentration parameter \u03b10 > 0 and a base measure G can\nbe used as a nonparametric prior distribution on mixing proportion parameters in a mixture model\nwhen the number of components is unknown a priori (Rasmussen, 2000). The generative process\nfor a mixture of Gaussian distributions with component mean \u00b5k and DP prior DP(\u03b10, G) can be\n\n\f\u000f\f\u03b10\n\u000e\r\n?\n\u000f\f\u03c0\n\u000f\f\nH\n\u000e\r\n\u000e\r\n?\n?\n\u000f\f\u00b5k\n\u000f\fzji\n\u000e\r\n\u000e\r\n\n?\nk = 1 : \u221e\n\u000f\fyji\n\n\t\n\u000e\n\ni = 1 : N\nj = 1 : J\n\n\u000f\f\u03b3\n\u000e\r\n?\n\u000f\f\u03b10\n\u000f\f\u03b2\n\u000e\r\n\u000e\r\n\n?\n\u000f\f\u03c0j\n\u000f\f\n\n\t\nH\n\u000e\r\n\u000e\r\n?\n?\n\u000f\f\u00b5k\n\u000f\fzji\n\u000e\r\n\u000e\r\n\n?\nk = 1 : \u221e\n\u000f\fyji\n\n\t\n\u000e\n\ni = 1 : N\nj = 1 : J\n\n\u000f\f\u03b3\n\u000e\r\n?\n\u000f\f\u03b10\n\u000f\f\u03b2\n\u000e\r\n\u000e\r\n\n?\n\u000f\f\u03c0j\n\u000f\f\n\u000f\f\n\n\t\nR\nH\n\u000e\r\n\u000e\r\n\u000e\r\n? ?\n?\n\u000f\f\u00b5k\n\u000f\fzji\n\u000f\f\u03c4 2\n\u000e\r\n\u000e\r\n\u000e\r\n\n\t?\n?\n\u000f\fujk\n\u000f\fyji \u000e\r\n\u001b\n\u000e\n\nj = 1 : J\n\nk\n\nk = 1 : \u221e\n\ni = 1 : N\nj = 1 : J\n\n(a)\n\n(b)\n\n(c)\n\nFigure 2: Plate diagrams for (a) DP mixtures, (b) hierarchical DPs and (c) hierarchical DPs with\nrandom effects.\n\nwritten, using a stick breaking construction (Sethuraman, 2004), as:\n\u03c0|\u03b10 \u223c Stick(\u03b10), \u00b5k|G \u223c NG(\u00b50, \u03c82\nwhere yi, i = 1, . . . , N are observed data and zi is a component label for yi. It can be shown that\nthe labels zi\u2019s have the following clustering property:\n\nk=1, \u03c32 \u223c N (\u00b5zi , \u03c32),\n\nyi|zi, (\u00b5k)\u221e\n\nzi|\u03c0 \u223c \u03c0,\n\n0),\n\nzi|z1, . . . , z(i\u22121), \u03b10 \u223c\n\nK\n\nXk=1\n\nn\u2212i\nk\n\ni \u2212 1 + \u03b10\n\n\u03b4k +\n\n\u03b10\n\ni \u2212 1 + \u03b10\n\n\u03b4knew ,\n\nwhere n\u2212i\nk represents the number of zi\u2032 , i\u2032 6= i, assigned to component k. The probability that zi is\nassigned to a new component is proportional to \u03b10. Note that the component with more observations\nalready assigned to it has a higher probability to attract the next observation.\n\n2.2 Hierarchical Dirichlet processes\n\nWhen multiple groups of data are present and each group can be modeled as a mixture it is often\nuseful to let different groups share mixture components.\nIn hierarchical DPs (Teh et al., 2006)\ncomponents are shared by different groups with varying mixing proportions for each group, and the\nnumber of components in the model can be inferred from data.\nLet yji be the ith data point (i = 1, . . . , N) in group j (j = 1, . . . , J), \u03b2 the global mixing pro-\nportions, \u03c0j the mixing proportions for group j, and \u03b10, \u03b3, H are the hyperparameters for the DP.\nThen, the hierarchical DP can be written as follows, using a stick breaking construction:\n\n\u03b2|\u03b3 \u223c Stick(\u03b3),\n\u00b5k|H \u223c NH (\u00b50, \u03c82\n\n\u03c0j|\u03b10, \u03b2 \u223c DP(\u03b10, \u03b2),\n0),\n\nyji|zji, (\u00b5k)\u221e\n\nzji|\u03c0j \u223c \u03c0j,\nk=1, \u03c32 \u223c N (\u00b5zji , \u03c32),\n\n(1)\n\nThe plate diagram in Figure 2(b) illustrates the generative process of this model. Mixture compo-\nnents described by the \u00b5k\u2019s can be shared across the J groups.\nThe hierarchical DP has clustering properties similar to that for DP mixtures, i.e.,\n\np(hji|h\u2212ji, \u03b10) \u223c\n\nTj\n\nXt=1\n\nn\u2212i\njt\n\nnj \u2212 1 + \u03b10\n\n\u03b4t +\n\n\u03b10\n\nnj \u2212 1 + \u03b10\n\n\u03b4tnew\n\np(ljt|l\u2212jt, \u03b3) \u223c\n\nK\n\nXk=1\n\nm\u2212t\nk\n\n\u03b4k +\n\n\u03b3\n\nwhere hji represents the mapping of each data item yji to one of Tj clusters within group j and\nljt maps the tth local cluster in group j to one of K global clusters shared by all of the J groups.\n\nP mu \u2212 1 + \u03b3\n\nP mu \u2212 1 + \u03b3\n\n\u03b4knew,\n\n(2)\n\n(3)\n\n\fThe probability that a new local cluster is generated within group j is proportional to \u03b10. This new\ncluster is generated according to Equation (3). Notice that more than one local cluster in group j\ncan be linked to the same global cluster. It is the assignment of data items to K global clusters via\nlocal cluster labels that is typically of interest.\n\n3 Hierarchical Dirichlet processes with random effects\n\nWe now propose an extension of the standard hierarchical DP to a version that includes random\neffects. We \ufb01rst develop our model for the case of Gaussian density components, and later in the\npaper apply this model to the speci\ufb01c problem of modeling activation patterns in fMRI brain images.\nk=1, \u03c32 \u223c N (\u00b5zji , \u03c32) in Equation (1) and add\nWe take \u00b5k|H \u223c NH (\u00b50, \u03c82\nrandom effects as follows:\n\n0) and yji|zji, (\u00b5k)\u221e\n\n\u00b5k|H \u223c NH (\u00b50, \u03c82\nk \u223c N (\u00b5k, \u03c4 2\n\nk ),\n\n0),\nyji|zji, (ujk)\u221e\n\nk |R \u223c Inv-\u03c72\n\u03c4 2\n\nR(v0, s2\n\n0),\n\nujk|\u00b5k, \u03c4 2\n\nk=1 \u223c N (ujzji , \u03c32).\n\n(4)\nEach group j has its own component mean ujk for the kth component and these group-level param-\nk ). Thus, \u00b5k can be viewed as a template, and\neters come from a common prior distribution N (\u00b5k, \u03c4 2\nk . The random effects param-\nujk as a noisy observation of the template for group j with variance \u03c4 2\neters ujk are generated once per group and shared by local clusters in group j that are assigned to\nthe same global cluster k.\nFor inference we use an MCMC sampling scheme that is based on the clustering property given in\nEquations (2) and (3). In each iteration we sample labels h = {hji for all j, i}, l = {ljt for all j, t}\nand component parameters \u00b5 = {\u00b5k for all k}, \u03c4 2 = {\u03c4 2\nk for all k}, u = {ujk for all k, j} alter-\nnately.\nWe sample tji\u2019s using the following conditional distribution:\n\np(hji = t|h\u2212ji, u, \u00b5, \u03c4 2, y) \u221d (cid:26) n\u2212jtp(yji|ujk, \u03c32)\n\n\u03b10p(yji|h\u2212jiu, \u00b5, \u03c4 2, \u03b3)\n\nif t was used\nif t = tnew,\n\nwhere\n\nk )NH (\u00b50, \u03c82\n\n0)Inv-\u03c72\n\nR(v0, s2\n\n0)dujkd\u00b5kd\u03c4 2\nk .\n\nIn Equation (5a) the summation is over components in A = {k| some hji\u2032 for i\u2032 6= i is assigned to\nk}, representing global clusters that already have some local clusters in group j assigned to them.\nIn this case, since ujk is already known, we can simply compute the likelihood p(yji|ujk).\nIn\nEquation (5b) the summation is over B = {k| no hji\u2032 for i\u2032 6= i is assigned to k} representing global\nclusters that have not yet been assigned in group j. For conjugate priors we can integrate over\nthe unknown random effects parameter ujk to compute the likelihood using N (yji|\u00b5k, \u03c4 2\nk + \u03c32)\nk , yji). Equation (5c) models the case\nand sample ujk from the posterior distribution p(ujk|\u00b5k, \u03c4 2\nwhere a new global component gets generated. The integral cannot be evaluated analytically, so we\napproximate the integral by sampling new values for \u00b5k, \u03c4 2\nk , and ujk from prior distributions and\nevaluating p(yji|ujk) given these new values for the parameters (Neal, 1998).\nSamples for ljt\u2019s can be obtained from the conditional distribution given as\n\np(ljt = k|l\u2212jt, u, \u00b5, \u03c4 2, y) \u221d\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\n\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\nif k was used in group j\n\nm\u2212jtQi:hji=t p(yji|ujk, \u03c32)\nm\u2212jtR Qi:hji=t p(yji|ujk, \u03c32)p(ujk|\u00b5k, \u03c4 2\n\u03b3R R R Qi:hji=t p(yji|ujk)p(ujk|\u00b5k, \u03c4 2\n0) dujkd\u00b5kd\u03c4k\n\nif k is new in group j\n\nNH (\u00b50, \u03c82\n\n0)Inv-\u03c72\n\nR(v0, s2\n\nk )\n\nif k is a new component.\n\nk )dujk\n\n(6)\n\nmk\n\np(yji|ujk)\n\nmk\n\nPk mk + \u03b3\n\np(yji|h\u2212jiu, \u00b5, \u03c4 , \u03b3) = Xk\u2208A\nPk mk + \u03b3 Z p(yji|ujk)p(ujk|\u00b5k, \u03c4 2\n+ Xk\u2208B\nPk mk + \u03b3 Z Z Z p(yji|ujk)p(ujk|\u00b5k, \u03c4 2\n\n+\n\n\u03b3\n\nk )dujk\n\n(5a)\n\n(5b)\n\n(5c)\n\n\f45\n\n40\n\n35\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n\n0\n\n45\n\n40\n\n35\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n\n0\n\n2\n\n4\n\n6\n\n8\n\n10\n\n12\n\n14\n\n16\n\n18\n\n20\n\n2\n\n4\n\n6\n\n8\n\n10\n\n12\n\n14\n\n16\n\n18\n\n20\n\n45\n\n40\n\n35\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n\n0\n\n45\n\n40\n\n35\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n\n0\n\n2\n\n4\n\n6\n\n8\n\n10\n\n12\n\n14\n\n16\n\n18\n\n20\n\n2\n\n4\n\n6\n\n8\n\n10\n\n12\n\n14\n\n16\n\n18\n\n20\n\n45\n\n40\n\n35\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n\n0\n\n45\n\n40\n\n35\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n\n0\n\n2\n\n4\n\n6\n\n8\n\n10\n\n12\n\n14\n\n16\n\n18\n\n20\n\n2\n\n4\n\n6\n\n8\n\n10\n\n12\n\n14\n\n16\n\n18\n\n20\n\n45\n\n40\n\n35\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n\n0\n\n45\n\n40\n\n35\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n\n0\n\n2\n\n4\n\n6\n\n8\n\n10\n\n12\n\n14\n\n16\n\n18\n\n20\n\n2\n\n4\n\n6\n\n8\n\n10\n\n12\n\n14\n\n16\n\n18\n\n20\n\nFigure 3: Histogram for simulated data with mixture density estimates overlaid.\n\nk , and ujk from the prior and evaluating the likelihood.\n\nAs in the sampling of hji, if k is new in group j we can evaluate the integral analytically and sample\nujk from the posterior distribution. If k is a new component we approximate the integral by sampling\nnew values for \u00b5k, \u03c4 2\nGiven h and l we can update the component parameters \u00b5, \u03c4 and u using standard Gibbs sampling\nfor a normal hierarchical model (Gelman et al., 2006).\nIn practice, this Markov chain can mix\npoorly and get stuck in local maxima where the labels for two group-level components are swapped\nrelative to the same two components in the template. To address this problem and restore the correct\ncorrespondence between template components and group-level components we propose a move that\nswaps the labels for two group-level components at the end of each sampling iteration and accepts\nthe move based on a Metropolis-Hastings acceptance rule.\n\nTo illustrate the proposed model we simulated data from a mixture of one-dimensional Gaussian\ndensities with known parameters and tested if the sampling algorithm can recover the parameters\nfrom the data. From a template mixture model with three mixture components we generated 10\ngroup-level mixture models by adding random effects in the form of mean-shifts to the template\nmeans, sampled from N (0, 1). Using varying mixing proportions for each group we generated 200\nsamples from each of the 10 mixture models. Histograms for the samples in eight groups are shown\nin Figure 3(a). The estimated models after 1000 iterations of the MCMC algorithm are overlaid.\nWe can see that the sampling algorithm was able to learn the original model successfully despite the\nvariability in both component means and mixing proportions of the mixture model.\n\n4 A model for fMRI activation surfaces\n\nWe now apply the general framework of the hierarchical DP with random effects to the problem\nof detecting and characterizing spatial activation patterns in fMRI brain images. Underlying our\napproach is an assumption that there is an unobserved true spatial activation pattern in a subject\u2019s\nbrain given a particular stimulus and that multiple activation images for this individual collected over\ndifferent fMRI sessions are realizations of the true activation image, with variability in the activation\npattern due to various sources. Our goal is to infer the unknown true activation from multiple such\nactivation images.\n\nWe model each activation image using a mixture of experts model, with a component expert assigned\nto each local activation cluster (Rasmussen and Ghahramani, 2002). By introducing a hierarchical\nDP into this model we allow activation clusters to be shared across images, inferring the number of\nsuch clusters from the data. In addition, the random effects component can be incorporated to allow\nactivation centers to be slightly shifted in terms of pixel locations or in terms of peak intensity. These\ntypes of variation are common in multi-image fMRI experiments, due to a variety of factors such as\nhead motion, variation in the physiological and cognitive states of the subject. In what follows below\nwe will focus on 2-dimensional \u201cslices\u201d rather than 3-dimensional voxel images\u2014in principle the\nsame type of model could be developed for the 3-dimensional case.\n\nWe brie\ufb02y discuss the mixture of experts model below (Kim et al., 2006). Assuming the \u03b2 values\nyi, i = 1, . . . , N are conditionally independent of each other given the voxel position xi = (xi1, xi2)\nand the model parameters, we model the activation yi at voxel xi as a mixture of experts:\n\np(yi|xi, \u03b8) = Xc\u2208C\n\np(yi|c, xi)P (c|xi),\n\n(7)\n\nwhere C = {cbg, cm, m = 1, . . . , M \u2212 1} is a set of M expert component labels for background\ncbg and M \u2212 1 activation components cm\u2019s. The \ufb01rst term on the right hand side of Equation (7)\nde\ufb01nes the expert for a given component. We model the expert for an activation component as a\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 4: Results from eight runs for subject 2 at Stanford. (a) Raw images for a cross section\nof right precentral gyrus and surrounding area. Activation components estimated from the images\nusing (b) DP mixtures, (c) hierarchical DPs, and (d) hierarchical DP with random effects.\n\nGaussian-shaped surface centered at bm with width \u03a3m and height hm as follows.\n\nyi = hmexp(cid:16)\u2212(xi \u2212 bm)\u2032(\u03a3m)\u22121(xi \u2212 bm)(cid:17) + \u03b5,\n\n(8)\n\nwhere \u03b5 is an additive noise term distributed as N (0, \u03c32\nas yi = \u00b5 + \u03b5, having a constant activation level \u00b5 with additive noise distributed as N (0, \u03c32\n\nact). The background component is modeled\n\nbg).\n\nThe second term in Equation (7) is known as a gate function in the mixture of experts framework\u2014\nit decides which expert should be used to make a prediction for the activation level at position xi.\n\nUsing Bayes\u2019 rule we write this term as P (c|xi) = p(xi|c)\u03c0c/(Pc\u2208C p(xi|c)\u03c0c), where \u03c0c is a\n\nclass prior probability P (c). p(xi|c) is de\ufb01ned as follows. For activation components, p(xi|cm)\nis a normal density with mean bm and covariance \u03a3m. bm and \u03a3m are shared with the Gaussian\nsurface model for experts in Equation (8). This implies that the probability of activating the mth\nexpert is highest at the center of the activation and gradually decays as xi moves away from the\ncenter. p(xi|cbg) for the background component is modeled as having a uniform distribution of 1/N\nfor all positions in the brain. If xi is not close to the center of any activations, the gate function\nselects the background expert for the voxel.\nWe place a hierarchical DP prior on \u03c0c, and let the location parameters bm and the height parameters\nhm vary in individual images according to a Normal prior distribution with a variance \u03a8bm and\nas a half normal\n\u03c82\ndistribution with a 0 mean and a variance as suggested by Gelman (2006). Since the surface model\nfor the activation component is a highly non-linear model, without conjugate prior distributions it\nis not possible to evaluate the integrals in Equations (5b)-(5c) and (6) analytically in the sampling\nalgorithm. We rely on an approximation of the integrals by sampling new values for bm and hm\nfrom their priors and new values for image-speci\ufb01c random effects parameters from N (bm, \u03a8bm)\nand N (hm, \u03c82\n) and evaluating the likelihood of the data given these new values for the unknown\nparameters.\n\nusing a random effects model. We de\ufb01ne prior distributions for \u03a8bm and \u03c82\n\nhm\n\nhm\n\nhm\n\n5 Experimental results on fMRI data\n\nWe demonstrate the performance of the model and inference algorithm described above by using\nfMRI data collected from three subjects (referred to as Subjects 1, 2 and 3) performing the same\nsensorimotor task at two different fMRI scanners (Stanford and Duke). Each subject was scanned\nduring eight separate fMRI experiments (\u201cruns\u201d) and for each run a \u03b2-map (a voxel image that\nsummarizes the brain activation) was produced using standard fMRI preprocessing.\n\nIn this experiment we analyze a 2D cross-section of the right precentral gyrus brain region, a region\nthat is known to be activated by this sensorimotor task. We \ufb01t our model to each set of eight \u03b2-maps\nfor each of the subjects at each scanner, and compare the results from the models obtained from\nthe hierarchical DP without random effects. We also \ufb01t standard DP mixtures to individual images\nas a baseline, using Algorithm 7 from Neal (1998) to sample from the model. The concentration\nparameters for DP priors in all of the three models were given a prior distribution gamma(1.5, 1)\nand sampled from the posterior as described in Teh et al.(2006). For all of the models the MCMC\nsampling algorithm was run for 3000 iterations.\n\n\f60\n\n40\n\n20\n\n0\n\n1 2 3 4 5 6 7 8 9 10\n\n(a)\n\n60\n\n40\n\n20\n\n0\n\n1 2 3 4 5 6 7 8 9 10\n\n(b)\n\n100\n\n50\n\n0\n\n1 2 3 4 5 6 7 8 9 10\n\n(c)\n\nFigure 5: Histogram of the number of components over the last 1000 iterations (Subject 2 at Stan-\nford). (a) DP mixture, (b) hierarchical DP, and (c) hierarchical DP with random effects.\n\nScanner\n\nSubject\n\nStanford\n\nDuke\n\nSubject 1\nSubject 2\nSubject 3\nSubject 1\nSubject 2\nSubject 3\n\nHierarchical DP\n\nAvg.\nlogP\n\n-1142.6\n-1260.9\n-1084.1\n-1154.9\n-677.9\n-1175.6\n\nStandard\ndeviation\n\n21.8\n32.1\n11.3\n12.5\n12.2\n13.6\n\nHierarchical DP\n\nwith random effects\nStandard\ndeviation\n\nAvg.\nlogP\n\n-1085.3\n-1082.8\n-1040.9\n-1166.9\n-559.9\n-1086.8\n\n12.6\n28.7\n13.5\n13.1\n15.8\n13.2\n\nTable 2: Predictive logP scores of test images averaged over eight cross-validation runs. The simu-\nlation errors are shown as standard deviations.\n\nFigure 4(a) shows \u03b2-maps from eight fMRI runs of Subject 2 at Stanford. From the eight images one\ncan see three primary activation bumps, subsets of which appear in different images with variability\nin location and intensity. Figures 4 (b)-(d) each show a sample from the model learned on the data\nin Figure 4(a), where Figure 4(b) is for DP mixtures, Figure 4(c) for hierarchical DPs, and Figure\n4(d) for hierarchical DPs with random effects. The sampled activation components are overlaid as\nellipses using one standard deviation of the width parameters \u03a3m. The thickness of ellipses indicates\nthe estimated height hm of the bump. In Figures 4(b) and (c) ellipses for activation components\nshared across images are drawn with the same color.\n\nThe DPs shown in Figure 4(b) seem to over\ufb01t with many bumps and show a relatively poor gen-\neralization capability because the model cannot borrow strength from other similar images. The\nhierarchical DP in Figure 4(c) is not \ufb02exible enough to account for bumps that are shared across\nimages but that have variability in their parameters. By using one \ufb01xed set of component param-\neters shared across images, the hierarchical DPs are too constrained and are unable to detect the\nmore subtle features of individual images. The random effects model \ufb01nds the three main bumps\nand a few more bumps with lower intensity for the background. Thus, in terms of generalization,\nthe model with random effects provides a good trade-off between the relatively unconstrained DP\nmixtures and overly-constrained hierarchical DPs. Histograms of the number of components (every\n10 samples over the last 1000 iterations) for the three different models are shown in Figure 5.\n\nWe also perform a leave-one-image-out cross-validation to compare the predictive performance of\nhierarchical DPs and our proposed model. For each subject at each scanner we \ufb01t a model from\nseven images and compute the predictive likelihood of the remaining one image. The predictive\nscores and simulation errors (standard deviations) averaged over eight cross-validation runs for both\nmodels are shown in Table 2. In all of the subjects except for Subject 1 at Duke, the proposed model\nshows a signi\ufb01cant improvement over hierarchical DPs. For Subject 1 at Duke, the hierarchical DP\ngives a slightly better result but the difference in scores is not signi\ufb01cant relative to the simulation\nerror.\n\nFigure 6 shows the difference in the way the hierarchical DP and our proposed model \ufb01t the data\nin one cross-validation run for Subject 1 at Duke as shown in Figure 6(a). The hierarchical DP\nin Figure 6(b) models the common bump with varying intensity in the middle of each image as a\nmixture of two components\u2014one for the bump in the \ufb01rst two images with relatively high intensity\nand another for the same bump in the rest of the images with lower intensity. Our proposed model\nrecovers the correspondence in the bumps with different intensity across images as shown in Figure\n6(c).\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 6: Results from one cross-validation run for subject 1 at Duke. (a) Raw images for a cross\nsection of right precentral gyrus and surrounding area. Activation components estimated from the\nimages are shown in (b) for hierarchical DPs, and in (c) for hierarchical DP with random effects.\n\n6 Conclusions\n\nIn this paper we proposed a hierarchical DP model with random effects that allows each group (or\nimage) to have group-level mixture component parameters as well as group-level mixing propor-\ntions. Using fMRI brain activation images we demonstrated that our model can capture components\nshared across multiple groups with individual-level variation. In addition, we showed that our model\nis able to estimate the number of components more reliably due to the additional \ufb02exibility in the\nmodel compared to DP mixtures and hierarchical DPs. Possible future directions for this work in-\nclude extensions to modeling differences between labeled groups of individuals, e.g., in studies of\ncontrols and patients for a particular disorder.\n\nAcknowledgments\n\nWe would like to thank Hal Stern for useful discussions. We acknowledge the support of the fol-\nlowing grants: the Functional Imaging Research in Schizophrenia Testbed, Biomedical Informatics\nResearch Network (FIRST BIRN; 1 U24 RR021992, www.nbirn.net); the Transdisciplinary Imag-\ning Genetics Center (P20RR020837-01); and the National Alliance for Medical Image Comput-\ning (NAMIC; Grant U54 EB005149), funded by the National Institutes of Health through the NIH\nRoadmap for Medical Research. Author PS was also supported in part by the National Science\nFoundation under awards number IIS-0431085 and number SCI-0225642.\n\nReferences\n\nGelman, A., Carlin, J., Stern, H. & Rubin, D. (2004) Bayesian Data Analysis, New York: Chapman &\nHall/CRC.\n\nGelman, A. (2006). Prior distribution for variance parameters in hierarchical models. Bayesian Analysis,\n1(3):515\u2013533.\n\nKim, S., Smyth, P., & Stern, H. (2006). A nonparametric Bayesian approach to detecting spatial activation\npatterns in fMRI data. Proceedings of the 9th International Conference on Medical Image Computing and\nComputer Assisted Intervention, vol. 2, pp.217\u2013224.\n\nNeal, R.M. (1998) Markov chain sampling methods for Dirichlet process mixture models. Technical Report\n4915, Department of Statistics, University of Toronto.\n\nPenny, W. & Friston, K. (2003) Mixtures of general linear models for functional neuroimaging. IEEE Transac-\ntions on Medical Imaging, 22(4):504\u2013514.\n\nRasmussen, C.E. (2000) The in\ufb01nite Gaussian mixture model. Advances in Neural Information Processing\nSystems 12, pp. 554\u2013560. MIT Press.\n\nRasmussen, C.E. & Ghahramani, Z. (2002) In\ufb01nite mixtures of Gaussian process experts. Advances in Neural\nInformation Processing Systems 14, pp. 881\u2013888. MIT Press.\n\nSethuraman, J. (1994) A constructive de\ufb01nition of Dirichlet priors. Statistica Sinica, 4:639\u2013650.\n\nSudderth, E., Torralba, A., Freeman, W. & Willsky, A. (2005). Describing visual scenes using transformed\nDirichlet Processes. Advances in Neural Information Processing Systems 18, pp. 1297\u20131304. MIT Press.\n\nTeh, Y.W., Jordan, M.I., Beal, M.J. & Blei, D.M. (2006). Hierarchical Dirichlet processes. Journal of American\nStatistical Association, To appear.\n\n\f", "award": [], "sourceid": 2975, "authors": [{"given_name": "Seyoung", "family_name": "Kim", "institution": null}, {"given_name": "Padhraic", "family_name": "Smyth", "institution": null}]}