{"title": "Modelling Genetic Variations using Fragmentation-Coagulation Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 819, "page_last": 827, "abstract": "We propose a novel class of Bayesian nonparametric models for sequential data called fragmentation-coagulation processes (FCPs). FCPs model a set of sequences using a partition-valued Markov process which evolves by splitting and merging clusters. An FCP is exchangeable, projective, stationary and reversible, and its equilibrium distributions are given by the Chinese restaurant process. As opposed to hidden Markov models, FCPs allow for flexible modelling of the number of clusters, and they avoid label switching non-identifiability problems. We develop an efficient Gibbs sampler for FCPs which uses uniformization and the forward-backward algorithm. Our development of FCPs is motivated by applications in population genetics, and we demonstrate the utility of FCPs on problems of genotype imputation with phased and unphased SNP data.", "full_text": "Modelling Genetic Variations with\n\nFragmentation-Coagulation Processes\n\nYee Whye Teh, Charles Blundell and Lloyd T. Elliott\n\nGatsby Computational Neuroscience Unit, UCL\n\n17 Queen Square, London WC1N 3AR, United Kingdom\n\n{ywteh,c.blundell,elliott}@gatsby.ucl.ac.uk\n\nAbstract\n\nWe propose a novel class of Bayesian nonparametric models for sequential data\ncalled fragmentation-coagulation processes (FCPs). FCPs model a set of se-\nquences using a partition-valued Markov process which evolves by splitting and\nmerging clusters. An FCP is exchangeable, projective, stationary and reversible,\nand its equilibrium distributions are given by the Chinese restaurant process. As\nopposed to hidden Markov models, FCPs allow for \ufb02exible modelling of the num-\nber of clusters, and they avoid label switching non-identi\ufb01ability problems. We\ndevelop an ef\ufb01cient Gibbs sampler for FCPs which uses uniformization and the\nforward-backward algorithm. Our development of FCPs is motivated by applica-\ntions in population genetics, and we demonstrate the utility of FCPs on problems\nof genotype imputation with phased and unphased SNP data.\n\n1\n\nIntroduction\n\nWe are interested in probablistic models for sequences arising from the study of genetic variations\nin a population of organisms (particularly humans). The most commonly studied class of genetic\nvariations in humans are single nucleotide polymorphisms (SNPs), with large quantities of data now\navailable (e.g. from the HapMap [1] and 1000 Genomes projects [2]). SNPs play an important role\nin our understanding of genetic processes, human historical migratory patterns, and in genome-wide\nassociation studies for discovering the genetic basis of diseases, which in turn are useful in clinical\nsettings for diagnoses and treatment recommendations.\nA SNP is a speci\ufb01c location in the genome where a mutation has occurred to a single nucleotide at\nsome time during the evolutionary history of a species. Because the rate of such mutations is low\nin human populations the chances of two mutations occurring in the same location is small and so\nmost SNPs have only two variants (wild type and mutant) in the population. The SNP variants on\na chromosome of an individual form a sequence, called a haplotype, with each entry being binary\nvalued coding for the two possible variants at that SNP. Due to the effects of gene conversion and\nrecombination, the haplotypes of a set of individuals often has a \u201cmosaic\u201d structure where contigu-\nous subsequences recur across multiple individuals [3]. Hidden Markov Models (HMMs) [4] are\noften used as the basis of existing models of genetic variations that exploit this mosaic structure\n(e.g. [3, 5]). However, HMMs, as dynamic generalisations of \ufb01nite mixture models, cannot \ufb02exibly\nmodel the number of states needed for a particular dataset, and suffer from the same label switching\nnon-identi\ufb01ability problems of \ufb01nite mixture models [6] (see Section 3.2). While nonparametric\ngeneralisations of HMMs [7, 8, 9] allow for \ufb02exible modelling of the number of states, they still\nsuffer from label switching problems.\nIn this paper we propose alternative Bayesian nonparametric models for genetic variations called\nfragmentation-coagulation processes (FCPs). An FCP de\ufb01nes a Markov process on the space of par-\ntitions of haplotypes, such that the random partition at each time is marginally a Chinese restaurant\n\n1\n\n\fprocess (CRP). The clusters of the FCP are used in the place of HMM states. FCPs do not require\nthe number of clusters in each partition to be speci\ufb01ed, and do not have explicit labels for clusters\nthus avoid label switching problems. The partitions of FCPs evolve via a series of events, each of\nwhich involves either two clusters merging into one, or one cluster splitting into two. We will see\nthat FCPs are natural models for the mosaic structure of SNP data since they can \ufb02exibly accommo-\ndate varying numbers of subsequences and they do not have the label switching problems inherent\nin HMMs. Further, computations in FCPs scale well.\nThere is a rich literature on modelling genetic variations. The standard coalescent with recombina-\ntion (also known as the ancestral recombination graph) model describes the genealogical history of\na set of haplotypes using coalescent, recombination and mutation events [10]. Though an accurate\nmodel of the genetic process, inference is unfortunately highly intractable. PHASE [11, 12] and IM-\nPUTE [13] are a class of HMM based models, where each HMM state corresponds to a haplotype\nin a reference panel (training set). This alleviates the label switching problem, but incurs higher\ncomputational costs than the normal HMMs or our FCP since there are now as many HMM states\nas reference haplotypes. BEAGLE [14] introduces computational improvements by collapsing the\nmultiple occurrences of the same mosaic subsequence across the reference haplotypes into a single\nnode of a graph, with the graph constructed in a very ef\ufb01cient but somewhat ad hoc manner.\nSection 2 introduces preliminary notation and describes random partitions and the CRP. In Section 3\nwe introduce FCPs, discuss their more salient properties, and describe how they are used to model\nSNP data. Section 4 describes an auxiliary variables Gibbs sampler for our model. Section 5 presents\nresults on simulated and real data, and Section 6 concludes.\n\n2 Random Partitions\n\nLet S denote a set of n SNP sequences. Label the sequences by the integers 1, . . . , n so that S can\nbe taken to be [n] = {1, . . . , n}. A partition \u03b3 of S is a set of disjoint non-empty subsets of S\n(called clusters) whose union is S. Denote the set of partitions of S by \u03a0S. If a \u2282 S, de\ufb01ne the\nprojection \u03b3|a of \u03b3 onto a to be the partition of a obtained by removing the elements of S\\a as well\nas any resulting empty subsets from \u03b3. The canonical distribution over \u03a0S is the Chinese restaurant\nprocess (CRP) [15, 16].\nIt can be described using an iterative generative process: n customers\nenter a Chinese restaurant one at a time. The \ufb01rst customer sits at some table and each subsequent\ncustomer sit at a table with m current customers with probability proportional to m, or at a new table\nwith probability proportional to \u03b1, where \u03b1 is a parameter of the CRP. The seating arrangement of\ncustomers around tables forms a partition \u03b3 of S, with occupied tables corresponding to the clusters\nin \u03b3. We write \u03b3 \u223c CRP(\u03b1, S) if \u03b3 \u2208 \u03a0S is a CRP distributed random partition over S. Multiplying\nthe conditional probabilities together gives the probability mass function of the CRP:\n\nf\u03b1,S(\u03b3) =\n\n\u03b1|\u03b3|\u0393(\u03b1)\n\u0393(n + \u03b1)\n\n\u0393(|a|)\n\n(1)\n\n(cid:89)\n\na\u2208\u03b3\n\nwhere \u0393 is the gamma function. The CRP is exchangeable (invariant to permutations of S), and\nprojective (the probability of the projection \u03b3|a is simply f\u03b1,a(\u03b3|a)), so can be extended in a natural\nmanner to partitions of N and is related via de Finetti\u2019s theorem to the Dirichlet process [17].\n\n3 Fragmentation-Coagulation Processes\nA fragmentation-coagulation process (FCP) is a continuous-time Markov process \u03c0 \u2261 (\u03c0(t), t \u2208\n[0, T ]) over a time interval [0, T ] where each \u03c0(t) is a random partition in \u03a0S. Since the space of\npartitions for a \ufb01nite S is \ufb01nite, the FCP is a Markov jump process (MJP) [18] : it evolves according\nto a discrete series of random events (or jumps) at which it changes state and at all other times\nthe state remains unchanged. In particular, the jump events in an FCP are either fragmentations or\ncoagulations. A fragmentation at time t involves a cluster c \u2208 \u03c0(t\u2212) splitting into exactly two non-\nempty clusters a, b \u2208 \u03c0(t) (all other clusters stay unchanged; the t\u2212 notation means an in\ufb01nitesimal\ntime before t), and a coagulation at t involves two clusters a, b \u2208 \u03c0(t\u2212) merging to form a single\ncluster c = a \u222a b \u2208 \u03c0(t) (see Figure 1). Note that fragmentations and coagulations are converses of\neach other; as we will see later, this will lead to some important properties of the FCP.\n\n2\n\n\fFigure 1: FCP cartoon. Each line is a sequence\nand bundled lines form clusters. C: coagula-\ntion event. F: fragmentation event. Fractions\nare, for the orange sequence, from left to right:\nprobability of joining cluster c at time 0, prob-\nability of following cluster a at a fragmentation\nevent, rate of starting a new table (creating a\nfragmentation), and rate of joining with an ex-\nisting table (creating a coagulation).\n\nFollowing the various popular culinary processes in Bayesian nonparametrics, we will start by de-\nscribing the law of \u03c0 in terms of the conditional distribution of the cluster membership of each\nsequence i given those of 1, . . . , i \u2212 1. Since we have a Markov process with a time index, the\nmetaphor is of a Chinese restaurant operating from time 0 to time T , where customers (sequences)\nmay move from one table (cluster) to another and tables may split and merge at different points\nin time, so that the seating arrangements (partition structures) at different times might not be the\nsame. To be more precise, de\ufb01ne \u03c0|[i\u22121] = (\u03c0|[i\u22121](t), t \u2208 [0, T ]) to be the projection of \u03c0 onto\nthe \ufb01rst i \u2212 1 sequences. \u03c0|[i\u22121] is piecewise constant, with \u03c0|[i\u22121](t) \u2208 \u03a0[i\u22121] describing the\npartitioning of the sequences 1, . . . , i \u2212 1 (the seating arrangement of customers 1, . . . , i \u2212 1) at\ntime t. Let ai(t) = c\\{i}, where c is the unique cluster in \u03c0|[i](t) containing i. Note that either\nai(t) \u2208 \u03c0|[i\u22121](t), meaning customer i sits at an existing table in \u03c0|[i\u22121](t), or ai(t) = \u2205, which will\nmean that customer i sits at a new table. Thus the function ai describes customer i\u2019s choice of table\nto sit at through times [0, T ]. We de\ufb01ne the conditional distribution of ai given \u03c0|[i\u22121] as a Markov\njump process evolving from time 0 to T with two parameters \u00b5 > 0 and R > 0 (see Figure 1):\ni = 1: The \ufb01rst customer sits at a table for the duration of the process, i.e. a1(t) = \u2205 \u2200t \u2208 [0, T ].\nt = 0: Each subsequent customer i starts at time t = 0 by sitting at a table according to CRP\nprobabilities with parameter \u00b5. So, ai(0) = c \u2208 \u03c0|[i\u22121](0) with probability proportional to\n|c|, and ai(0) = \u2205 with probability proportional to \u00b5.\nF1: At time t > 0, if customer i is sitting at table ai(t\u2212) = c \u2208 \u03c0|[i\u22121](t\u2212), and the table c\nfragments into two tables a, b \u2208 \u03c0|[i\u22121](t), customer i will move to table a with probability\n|a|/|c|, and to table b with probability |b|/|c|.\n\nC1: If the table c merges with another table at time t, the customer simply follows the other\ncustomers to the resulting merged table.\nF2: At all other times t, if customer i is sitting at some existing table ai(t\u2212) = c \u2208 \u03c0|[i\u22121](t),\nthen the customer will move to a new empty table (ai(t) = \u2205) with rate R/|c|.\nC2: Finally, if i is sitting by himself (ai(t\u2212) = \u2205), then he will join an existing table ai(t) =\nc \u2208 \u03c0|[i\u22121](t) with rate R/\u00b5. The total rate of joining any existing table is |\u03c0|[i\u22121](t)|R/\u00b5.\n\nNote that when customer i moves to a new table in step F2, a fragmentation event is created, and\nall subsequent customers who end up in the same table will have to decide at step F1 whether to\nmove to the original table or to the table newly created by i. The probabilities in steps F1 and F2\nare exactly the same as those for a Dirichlet diffusion tree [19] with constant divergence function\nR. Similarly step C2 creates a coagulation event in which subsequent customers seated at the two\nmerging tables will move to the merged table in step C1, and the probabilities are exactly the same\nas those for Kingman\u2019s coalescent [20, 21]. Thus our FCP is a combination of the Dirichlet diffusion\ntree and Kingman\u2019s coalescent. Theorem 3 below shows that this combination results in FCPs being\nstationary Markov processes with CRP equilibrium distributions. Further, FCPs are reversible, so in\na sense the Dirichlet diffusion tree and Kingman\u2019s coalescent are duals of each other.\nGiven \u03c0|[i\u22121], \u03c0|[i] is uniquely determined by ai and vice versa, so that the seating of all n customers\nthrough times [0, T ], a1, . . . , an, uniquely determines the sequential partition structure \u03c0. We now\ninvestigate various properties of \u03c0 that follows from the iterative construction above. The \ufb01rst is\nan alternative characterisation of \u03c0 as an MJP whose transitions are fragmentations or coagulations,\nan unsurprising observation since both the Dirichlet diffusion tree and Kingman\u2019s coalescent, as\npartition-valued processes, are Markov.\n\n3\n\nCFFCC0T|c|\u00b5+i\u22121R\u00b5R|a||a||c|\fTheorem 1. \u03c0 is an MJP with initial distribution \u03c0(0)\u223c CRP(\u00b5, S) and stationary transit rates,\n\n\u0393(|a|)\u0393(|b|)\n\nq(\u03b3, \u03c1) = R\n\n(2)\nwhere \u03b3, \u03c1 \u2208 \u03a0S are such that \u03c1 is obtained from \u03b3 by fragmenting a cluster c \u2208 \u03b3 into two clusters\na, b \u2208 \u03c1 (at rate q(\u03b3, \u03c1)), and conversely \u03b3 is obtained from \u03c1 by coagulating a, b into c (at rate\nq(\u03c1, \u03b3)). The total rate of transition out of \u03b3 is:\n\n\u0393(|c|)\n\nq(\u03c1, \u03b3) =\n\nR\n\u00b5\n\nq(\u03b3,\u00b7) = R\n\nH|c|\u22121 +\n\nR\n\u00b5\n\n|\u03b3|(|\u03b3| \u2212 1)\n\n2\n\n(3)\n\n(cid:88)\n\nc\u2208\u03b3\n\nwhere H|c|\u22121 is the |c| \u2212 1st harmonic number.\nProof. The initial distribution follows from the CRP probabilities of step t = 0. For every i, ai is\nMarkov and ai(t) depends only on ai(t\u2212) and \u03c0|[i\u22121](t), thus (ai(s), s \u2208 [0, t]) depends only on\n(aj(s), s \u2208 [0, t], j < i) and the Markovian structure of \u03c0 follows by induction. Since \u03a0S is \ufb01nite,\n\u03c0 is an MJP. Further, the probabilities and rates in steps F1, F2, C1 and C2 do not depend explicitly\non t so \u03c0 has stationary transit rates. By construction, q(\u03b3, \u03c1) is only non-zero if \u03b3 and \u03c1 are related\nby a complimentary pair of fragmentation and coagulation events, as in the theorem.\nTo derive the transition rates (2), recall that a transition rate r from state s to state s(cid:48) means that if\nthe MJP is in state s at time t then it will transit to state s(cid:48) by an in\ufb01nitesimal time later t + \u03b4 with\nprobability \u03b4r. For the fragmentation rate q(\u03b3, \u03c1), the probability of transiting from \u03b3 to \u03c1 in an\nin\ufb01nitesimal time \u03b4 is \u03b4 times the rate at which a customer starts his own table in step F2, times the\nprobabilities of subsequent customers choosing either table in step F1 to form the two tables a and\nb. Dividing this product by \u03b4 forms the rate q(\u03b3, \u03c1). Without loss of generality suppose that the table\nstarted by the customer eventually becomes a and that there were j other customers at the existing\ntable which eventually becomes b. Thus, the rate of the customer starting his own table is R/j and\nthe product of probabilities of subsequent customer choices in step F1 is then 1\u00b72\u00b7\u00b7\u00b7(|a|\u22121)\u00d7j\u00b7\u00b7\u00b7(|b|\u22121)\n.\nMultiplying these together gives q(\u03b3, \u03c1) in (2). Similarly, the coagulation rate q(\u03c1, \u03b3) is a product\n\u00b5 at which a customer moves from his own table to an existing table in step C2 and the\nof the rate R\nprobability of all subsequent customers in either table moving to the merged table (which is just 1).\nFinally, the total transition rate q(\u03b3,\u00b7) is a sum over all possible fragmentations and coagulations of\n\u03b3. There are |\u03b3|(|\u03b3|\u22121)\npossible pairs of clusters to coagulate, giving the second term. The \ufb01rst term\nis obtained by summing over all c \u2208 \u03b3, and over all unordered pairs a, b resulting from fragmenting\n\nc, and using the identity(cid:80){a,b} \u0393(|a|)\u0393(|b|)\n\n\u0393(|c|) = H|c|\u22121.\n\n(j+1)\u00b7\u00b7\u00b7(|c|\u22121)\n\n2\n\nTheorem 2. \u03c0 is projective and exchangeable. Thus it can be extended naturally to a Markov\nprocess over partitions of N.\nProof. Both properties follow from the fact that both the initial distribution CRP(\u00b5, S) and the\ntransition rates (2) are projective and exchangeable. Here we will give more direct arguments for the\ntheorem. Projectivity is a direct consequence of the iterative construction, showing that the law of\n\u03c0|[i] does not depend on the clustering trajectories aj of subsequent customers j > i. We can show\nexchangeability of \u03c0 by deriving the joint probability density of a sample path of \u03c0 (the density\nexists since both \u03a0S and T are \ufb01nite so \u03c0 has a \ufb01nite number of events on [0, T ]), and seeing that\nit is invariant to permutations of S. For an MJP the probability of a sample path is the probability\nof the initial state (f\u00b5,S(\u03c0(0))) times, for each subsequent jump, the probability of staying in the\ncurrent state \u03b3 until the jump (the holding time is exponential distributed with rate q(\u03b3,\u00b7)) and the\ntransition from \u03b3 to the next state \u03c1 (this is the ratio q(\u03b3, \u03c1)/q(\u03b3,\u00b7)), and \ufb01nally the probability of\nnot transiting from the last jump time to T . Multiplying these probabilities together gives, after\nsimpli\ufb01cation:\n\n(cid:32)\n\n(cid:90) T\n\n0\n\n(cid:33)(cid:81)\n(cid:81)\n\n\u0393(|a|)\n\u0393(|a|)\n\n(4)\n\na\u2208A<>\na\u2208A><\n\np(\u03c0) = R|C|+|F|\u00b5|A|\u22122|C|\u22122|F| \u0393(\u00b5)\n\n\u0393(\u00b5 + n)\n\nexp\n\n\u2212\n\nq(\u03c0(t),\u00b7)dt\n\nwith |C| the number of coagulations, |F| number of fragmentations, and A, A<>, A>< are sets of\npaths in \u03c0. A path is a cluster created either at time 0 or a coagulation or fragmentation, and exists\nfor a de\ufb01nite amount of time until it is terminated at time T or another event (these are the horizontal\n\n4\n\n\fbundles of lines in Figure 1). A is the set of all paths in \u03c0, A<> the set of paths created either at\ntime 0 or by a fragmentation and terminated either at time T or by a coagulation, and A>< the set\nof paths created by a coagulation and terminated by a fragmentation or at time T .\nTheorem 3. \u03c0 is ergodic and has equilibrium distribution CRP(\u00b5, S). Further, it is reversible with\n(\u03c0(T \u2212 t), t \u2208 [0, T ]) having the same law as \u03c0.\nProof. Ergodicity follows from the fact that for any T > 0 and any two partitions \u03b3, \u03c1 \u2208 \u03a0S,\nthere is positive probability that if it starts at \u03c0(0) = \u03b3, it will end with \u03c0(T ) = \u03c1. For example,\nit may undergo a sequence of fragmentations until each sequence belong to its own cluster, then a\nsequence of coagulations forming the clusters in \u03c1. Reversibility and the equilibrium distribution\ncan be demonstrated by detailed balance. Suppose \u03b3, \u03c1 \u2208 \u03a0S and a, b, c are related as in Theorem 1,\n\n(cid:81)\n\nf\u00b5,S(\u03b3)q(\u03b3, \u03c1) = \u00b5|\u03b3|\u0393(\u00b5)\n= \u00b5|\u03b3|+1\u0393(\u00b5)\n\n\u0393(n+\u00b5) \u0393(|a|)\u0393(|b|)(cid:81)\n\n\u0393(n+\u00b5)\n\nk\u2208\u03b3 \u0393(|k|) \u00d7 R \u0393(|a|)\u0393(|b|)\n\n\u0393(|c|)\n\nk\u2208\u03b3,k(cid:54)=c \u0393(|k|) \u00d7 R\n\n\u00b5 = f\u00b5,S(\u03c1)q(\u03c1, \u03b3)\n\n(5)\n\nFinally, the terms in (4) are invariant to time reversals, i.e. p((\u03c0(T \u2212 t), t \u2208 [0, T ])) = p(\u03c0).\n\nTheorem 3 shows that the \u00b5 parameter controls the marginal distributions of \u03c0(t), while (2) indicates\nthat the R parameter controls the rate at which \u03c0 evolves.\n\n3.1 A Model of SNP Sequences\n\nWe model the n SNP sequences (haplotypes) with an FCP \u03c0 over partitions of S = [n]. Let the\nm assayed SNP locations on a chunk of the chromosome be at positions t1 < t2 \u00b7\u00b7\u00b7 < tm. The\nith haplotype consists of observations xi1, . . . , xim \u2208 {0, 1} each corresponding to a binary SNP\nvariant. For j = 1, . . . , m, and for each cluster c \u2208 \u03c0(tj) at position tj, we have a parameter\n\u03b8cj \u223c Bernoulli(\u03b2j) which denotes the variant at location tj of the corresponding subsequence. For\neach i \u2208 c we model xij as equal to \u03b8cj with probability 1 \u2212 \u0001, where \u0001 is a noise probability. We\nplace a prior \u03b2j \u223c Beta(\u03b1 \u02dc\u03b2j, \u03b1(1 \u2212 \u02dc\u03b2j)) with mean \u02dc\u03b2j given by the empirical mean of variant 1 at\nSNP j among the observed haplotypes. We place uninformative uniform priors on log R, log \u00b5 and\nlog \u03b1 over a bounded but large range such that the boundaries were never encountered.\nThe properties of FCPs in Theorems 1-3 are natural in the modelling setting here. Projectivity and\nexchangeability relate to the assumption that sequence labels should not have an effect on the model,\nwhile stationarity and reversibility arise from the simplifying assumption that we do not expect the\ngenetic processes operating in different parts of the genome to be different. These are also properties\nof the standard coalescent with recombination model of genetic variations [10]. Incidentally the\ncoalescent with recombination model is not Markov, though there have been Markov approximations\n[22, 23], and all practical HMM based methods are Markov.\n\n3.2 HMMs and the Label Switching Problem\n\nHMMs can also be interpreted as sequential partitioning processes in which each state at time step t\ncorresponds to a cluster in the partition at t. Since each sequence can be in different states at different\ntimes this automatically induces a partition-structured Markov process, where each partition consists\nof at most K clusters (K being the number of states in the HMM), and where each cluster is labelled\nwith an HMM state. This labelling of the clusters in HMMs is a signi\ufb01cant, but subtle, difference\nbetween HMMs and FCPs. Note that the clusters in FCPs are unlabelled, and de\ufb01ned purely in\nterms of the sequences they contain. This labelling of the clusters in HMMs are a signi\ufb01cant source\nof non-identi\ufb01ability in HMMs, since the likelihoods of data items (and often even the priors over\ntransition probabilities) are invariant to the labels themselves so that each permutation over labels\ncreates a mode in the posterior. This is the so called \u201clabel switching problem\u201d for \ufb01nite mixture\nmodels [6]. Since the FCP clusters are unlabelled they do not suffer from label switching problems.\nOn the other hand, by having labelled clusters HMMs can share statistical strength among clusters\nacross time steps (e.g. by enforcing the same emission probabilities from each cluster across time),\nwhile FCPs do not have a natural way of sharing statistical strength across time. This means that\nFCPs are not suitable for sequential data where there is no natural correspondence between times\nacross different sequences, e.g. time series data like speech and video.\n\n5\n\n\f3.3 Discrete Time Markov Chain Construction\n\nFCPs can be derived as continuous time limits of discrete time Markov chains constructed from\nfragmentation and coagulation operators [24]. This construction is more intuitive but lacks the\nrigour of the development described here. Let CRP(\u03b1, d, S) be a generalisation of the CRP on S\nwith an additional discount parameter d (see [25] for details). For any \u03b4 > 0, construct a Markov\nchain over \u03c0(0), \u03c0(\u03b4), \u03c0(2\u03b4), . . . as follows: \u03c0(0) \u223c CRP(\u00b5, 0, S); then for every m \u2265 1, de\ufb01ne\n\u03c1(m\u03b4) to be the partition obtained by fragmenting each cluster c \u2208 \u03c0((m\u22121)\u03b4) by a partition drawn\nindependently from CRP(0, R\u03b4, c), and \u03c0(m\u03b4) is constructed by coagulating into one the clusters of\n\u03c1(m\u03b4) belonging to the same cluster in a draw from CRP(\u00b5/R\u03b4, 0, \u03c1(m\u03b4)). Results from [26] (see\nalso [27]) show that marginally each \u03c1(m\u03b4) \u223c CRP(\u00b5, R\u03b4, S) and \u03c0(m\u03b4) \u223c CRP(\u00b5, 0, S). The\nvarious properties of FCPs, i.e. Markov, projectivity, exchangeability, stationarity, and reversibility,\nhold for this discrete time Markov chain, and the continuous time \u03c0 can be derived by taking \u03b4 \u2192 0.\n\n4 Gibbs Sampling using Uniformization\n\nWe use a Gibbs sampler for inference in the FCP given SNP haplotype data. Each iteration of\nthe sampler involves treating the ith haplotype sequence as the last sequence to be added into the\nFCP partition structure (making use of exchangeability), so that the iterative procedure described\nin Section 3 gives the conditional prior of ai given \u03c0|S\\{i}. Coupling with the likelihood terms of\nxi1, . . . , xim gives us the desired conditional distribution of ai. Since this conditional distribution\nof ai is Markov, we can make use of the forward \ufb01ltering-backward sampling procedure to sample\nit. However, ai is a continuous-time MJP so a direct application of the typical forward-backward\nalgorithm is not possible. One possibility is to marginalise out the sample path of ai except at a \ufb01nite\nnumber of locations (corresponding to the jumps in \u03c0|S\\{i} and the SNP locations). This approach\nis computationally expensive as it requires many matrix exponentiations, and does not resolve the\nissue of obtaining a full sample path of ai, which may involve jumps at random locations we have\nmarginalised out.\nInstead, we make use of a recently developed MCMC inference method for MJPs [28]. This sampler\nintroduces as auxiliary variables a set of \u201cpotential jump points\u201d distributed according to a Poisson\nprocess with piecewise constant rates, such that conditioned on them the posterior of ai becomes\na Markov chain that can only transition at either its previous jump locations or the potential jump\npoints, and we can then apply standard forward-backward to sample ai. For each t the state space\nof ai(t) is Cit \u2261 \u03c0|S\\{i} \u222a {\u2205}. For s, s(cid:48) \u2208 Cit let Qt(s, s(cid:48)) be the transition rate from state s to s(cid:48)\ns(cid:48)(cid:54)=s Qt(s, s(cid:48)). Let \u2126t > maxs\u2208Cit \u2212Qt(s, s) be an upper\nbound on the transition rates of ai at time t, a(cid:48)\ni be the previous sample path of ai, J(cid:48) be the jumps in\na(cid:48)\ni, and E consists of the m SNP locations and the event times in \u03c0|S\\{i}. Let Mt(s) be the forward\nmessage at time t and state s \u2208 Cit. The resulting forward-backward sampling algorithm is given\nbelow. In addition we update the logarithms of R, \u00b5 and \u03b1 by slice sampling.\n1. Sample potential jumps J aux \u223c Poisson(\u039b) with rate \u039b(t) = \u2126t + Qt(a(cid:48)\n2. Compute forward messages by iterating over t \u2208 {0} \u222a J aux \u222a J(cid:48) \u222a E from left to right:\n2a. At t = 0, set Mt(s) \u221d |s| for s \u2208 \u03c0|S\\{i} and Mt(\u2205) \u221d \u00b5.\n2b. At a fragmentation in \u03c0|S\\{i}, say of c into a, b, set Mt(a) =\n\ngiven in Section 3, with Qt(s, s) = \u2212(cid:80)\n\nand Mt(k) = Mt\u2212(k) for k (cid:54)= a, b, c. Here t\u2212 denotes the time of the previous iteration.\n\ni(t), a(cid:48)\n\n2e. At a potential jump in J aux \u222a J(cid:48), set Mt(s) =(cid:80)\n\n2c. At a coagulation in \u03c0|S\\{i}, say of a, b into c, set Mt(c) = Mt\u2212(a) + Mt\u2212(b).\n2d. At an observation, say t = tj, set Mt(s) = p(xij|\u03b8sj)Mt\u2212(s). We integrate out \u03b8\u2205j and \u03b2j.\nMt\u2212(s(cid:48))(1(s(cid:48) = s) + Qt(s(cid:48), s)/\u2126).\n3. Get new sample path ai by backward sampling. This is straightforward and involves reversing\nthe message computations above. Note that ai can only jump at the times in J aux \u222a J(cid:48), and\nchange state at times in E if it was involved in the fragmentation or coagulation event.\n\ns(cid:48)\u2208Cit\n\ni(t))).\n\n|a|\n|c| Mt\u2212(c), Mt(b) =\n\n|b|\n|c| Mt\u2212(c),\n\n5 Experiments\n\nLabel switching problem Figure 2 demonstrates the label switching problem (Section 3.2) during\nblock Gibbs sampling of a 2-state Bayesian HMM (BHMM) compared to inference in an FCP. The\n\n6\n\n\fFigure 2: Label switching problem.\nLeft: Each line is median, over 10 runs, of the normalized log-likelihoods of a Bayesian HMM (blue)\nand an FCP (red) at each iteration of MCMC. Lighter polygons are the 25% and 75% percentiles.\nRight: Number of MCMC iterations before each model \ufb01rst encounters the optimum states.\n\nobserved data comprises 16 sequences of length 16. Eight of the sequences consist of just zeros\nand the others consist of just ones. Each of the binary BHMM states, zij \u2208 {0, 1}, i indexing\nsequence and j indexing position within sequence i, transits to the same state with probability \u03c4,\nwith a prior \u03c4 \u223c Beta(10.0, 0.1) encouraging self transitions. The observations of the BHMM have\ndistribution xij \u223c Bernoulli(\u03c1zij ) where \u03c11 = 1 \u2212 \u03c10 and \u03c10 \u223c Beta(1.0, 1.0). The optimal\nclustering under both models assigns all zero observations to one state and all ones to another state.\nAs shown in Figure 2, due to the lack of identi\ufb01ability of its states, the BHMM requires more MCMC\niterations through the data before inference converges upon an optimal state, whilst an FCP is able\nto \ufb01nd the correct state much more quickly. This is re\ufb02ected in both the normalized log-likelihood\nof the models in Figure 2(left) and in the number of iterations before reaching the optimal state,\nFigure 2(right).\n\nImputation from phased data To reduce costs,\ntypically not all known SNPs are assayed for each\nparticipant in a large association study. The prob-\nlem of inferring the variants of unassayed SNPs in\na study using a larger dataset (e.g. HapMap or 1000\nGenomes) is called genotype imputation [13].\nFigure 3 compares the genotype imputation accu-\nracy of FCP with that of fastPHASE [5] and BEA-\nGLE [14], two state-of-the-art methods. We used\n3000 MCMC iterations for inference with the FCP,\nwith the \ufb01rst 1000 iterations discarded as burn-in.\nWe used 320 genes from 47 individuals in the Seat-\ntle SNPs dataset [29]. Each gene consists of 94 se-\nquences, of length between 13 and 416 SNPs. We\nheld out 10%\u201350% of the SNPs uniformly among\nall haplotypes for testing. Our model had higher ac-\ncuracy than both fastPHASE and BEAGLE.\n\nFigure 3: Accuracy vs proportion of missing\ndata for imputation from phased data. Lines\nare drawn at the means and error bars at the\nstandard error of the means.\n\nImputation from unphased data In humans,\nmost chromosomes come in pairs. Current assaying methods are unable to determine from which\nof these two chromosomes each variant originates without employing expensive protocols, thus the\ndata for each individual in large datasets actually consist of sequences of unordered pairs of variants\n(called genotypes). This includes the Seattle SNPs dataset (the haplotypes provided by [29] in the\nprevious experiment were phased using PHASE [11, 12]).\nIn this experiment, we performed imputation using the original unphased genotypes, using an exten-\nsion of the FCP able to handle this sort of data. Figure 4 shows the genotype imputation accuracies\nand run-times of the FCP model (with 60, 600 or 3000 MCMC iterations of which 30, 200 or 600\nwere discarded for burn-in) and state-of-the-art software (fastPHASE [5], IMPUTE2 [30], BEAGLE\n\n7\n\n020406080100MCMCiteration0.30.40.50.60.70.80.91.0normalizedloglikelihoodFCPBHMM020406080MCMCiterationsuntiloptimumBHMMFCP0.10.20.30.40.5proportionheldoutSNPs86889092949698accuracy(%)BEAGLEFCPfastPHASE\fFigure 4: Time and accuracy performance of genotype imputation on 231 Seattle SNPs genes.\nLeft: Accuracies evaluated by removing 10%\u201350% of SNPs from 10%\u201350% of individuals, repeated\n\ufb01ve times on each gene with the same hold out proportions. Centers of crosses correspond to median\naccuracy and times whilst whiskers correspond to the extent of the inter-quartile range.\nMiddle: Lines are accuracy averaged over \ufb01ve repetitions of each gene with 30% of shared SNPs\nremoved from 10%\u201350% of individuals. Each repetition uses a different subset of SNPs and indi-\nviduals. Lighter polygons are standard errors.\nRight: As Middle, except with 10%\u201350% of shared SNPs removed from 30% of individuals.\n\n[14]). We held out 10%\u201350% of the shared SNPs in 10%\u201350% of the 47 individuals of the Seat-\ntle SNPs dataset. This paradigm mimics a popular experimental setting in which the genotypes of\nsparsely assayed individuals are imputed using a densely assayed reference panel [30]. We discarded\n89 of the genes as they were unable to be properly pre-processed for use with IMPUTE2.\nAs can be seen in Figure 4, FCP achieves similar state-of-the-art accuracy to IMPUTE2 and fast-\nPHASE. Given enough iterations, the FCP outperforms all other methods in terms of accuracy.\nWith 600 iterations, FCP has almost the same accuracy and run-time as fastPHASE. With just 60\niterations, FCP performs comparably to IMPUTE2 but is an order of magnitude faster. Note that\nIMPUTE2 scales quadratically in the number of genotypes, so we expect FCPs to be more scalable.\nFinally, BEAGLE is the fastest algorithm but has worst accuracies.\n\n6 Discussion\n\nWe have proposed a novel class of Bayesian nonparametric models called fragmentation-coagulation\nprocesses (FCPs), and applied them to modelling population genetic variations, showing encourag-\ning empirical results on genotype imputation. FCPs are the simplest non-trivial examples of ex-\nchangeable fragmentation-coalescence processes (EFCP) [31]. In general EFCPs the fragmentation\nand coagulation events may involve more than two clusters. They also have an erosion operation,\nwhere a single element of S forms a single element cluster. EFCPs were studied by probabilists for\ntheir theoretical properties, and our work represents the \ufb01rst application of EFCPs as probabilistic\nmodels of real data, and the \ufb01rst inference algorithm derived for EFCPs.\nThere are many interesting avenues for future research. Firstly, we are currently exploring a number\nof other applications in population genetics, including phasing and genome-wide association studies.\nSecondly, it would be interesting to explore the discrete time Markov chain version of FCPs, which\nalthough not as elegant will have simpler and more scalable inference. Thirdly, the haplotype graph\nin BEAGLE is constructed via a series of cluster splits and merges, and bears striking resemblance\nto the partition structures inferred by FCPs. It would be interesting to explore the use of BEAGLE\nas a fast initialisation of FCPs, and to use FCPs as a Bayesian interpretation of BEAGLE. Finally,\nbeyond population genetics, FCPs can also be applied to other time series and sequential data, e.g.\nthe time evolution of community structure in network data, or topical change in document corpora.\n\nAcknowledgements\n\nWe thank the Gatsby Charitable Foundation for generous funding, and Vinayak Rao, Andriy Mnih,\nChris Holmes and Gil McVean for fruitful discussions.\n\n8\n\n101102103computationtime(s)8890929496accuracy(%)BEAGLEIMPUTE2fastPHASEFCP60FCP600FCP30000.10.20.30.40.5proportionheldoutgenotypes9091929394accuracy(%)0.10.20.30.40.5proportionheldoutSNPs9091929394accuracy(%)\fReferences\n[1] The International HapMap Consortium. The international HapMap project. Nature, 426:789\u2013796, 2003.\n[2] The 1000 Genomes Project Consortium. A map of human genome variation from population-scale se-\n\nquencing. Nature, 467:1061\u20131073, 2010.\n\n[3] M. J. Daly, J. D. Rioux, S. F. Schaffner, T. J. Hudson, and R. S. Lander. High-resolution haplotype\n\nstructure in the human genome. Nature Genetics, 29:229\u2013232, 2001.\n\n[4] L. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Pro-\n\nceedings of the IEEE, 77:257\u2013285, 1989.\n\n[5] P. Scheet and M. Stephens. A fast and \ufb02exible statistical model for large-scale population genotype data:\nApplications to inferring missing genotypes and haplotypic phase. The American Journal of Human\nGenetics, 78(4):629 \u2013 644, 2006.\n\n[6] A. Jasra, C. C. Holmes, and D. A. Stephens. Markov chain Monte Carlo methods and the label switching\n\nproblem in Bayesian mixture modeling. Statistical Science, 20(1):50\u201367, 2005.\n\n[7] M. J. Beal, Z. Ghahramani, and C. E. Rasmussen. The in\ufb01nite hidden Markov model. In Advances in\n\nNeural Information Processing Systems, volume 14, 2002.\n\n[8] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal of the\n\nAmerican Statistical Association, 101(476):1566\u20131581, 2006.\n\n[9] E. P. Xing and K. Sohn. Hidden Markov Dirichlet process: Modeling genetic recombination in open\n\nancestral space. Bayesian Analysis, 2(2), 2007.\n\n[10] R. R. Hudson. Properties of a neutral allele model with intragenic recombination. Theoretical Population\n\nBiology, 23(2):183 \u2013 201, 1983.\n\n[11] M. Stephens and P. Donnelly. A comparison of Bayesian methods for haplotype reconstruction from\n\npopulation genotype data. American Journal of Human Genetics, 73:1162\u20131169.\n\n[12] N. Li and M. Stephens. Modeling Linkage Disequilibrium and Identifying Recombination Hotspots Using\n\nSingle-Nucleotide Polymorphism Data. Genetics, 165(4):2213\u20132233, 2003.\n\n[13] J. Marchini, B. Howie, S. Myers, G. McVean, and P. Donnelly. A new multipoint method for genome-wide\n\nassociation studies by imputation of genotypes. Nature Genetics, 39(7):906\u2013913, 2007.\n\n[14] B. L. Browning and S. R. Browning. A uni\ufb01ed approach to genotype imputation and haplotype-phase\ninference for large data sets of trios and unrelated individuals. American Journal of Human Genetics,\n84:210\u2013223, 2009.\n\n[15] D. Aldous. Exchangeability and related topics. In \u00b4Ecole d\u2019 \u00b4Et\u00b4e de Probabilit\u00b4es de Saint-Flour XIII\u20131983,\n\npages 1\u2013198. Springer, Berlin, 1985.\n\n[16] J. Pitman. Combinatorial Stochastic Processes. Lecture Notes in Mathematics. Springer-Verlag, 2006.\n[17] D. Blackwell and J. B. MacQueen. Ferguson distributions via P\u00b4olya urn schemes. Annals of Statistics,\n\n1:353\u2013355, 1973.\n\n[18] E. C\u00b8 inlar. Introduction to Stochastic Processes. Prentice Hall, 1975.\n[19] R. M. Neal. Slice sampling. Annals of Statistics, 31:705\u2013767, 2003.\n[20] J. F. C. Kingman. On the genealogy of large populations. Journal of Applied Probability, 19:27\u201343, 1982.\n\nEssays in Statistical Science.\n\n[21] J. F. C. Kingman. The coalescent. Stochastic Processes and their Applications, 13:235\u2013248, 1982.\n[22] G. A. T. McVean and N. J. Cardin. Approximating the coalescent with recombination. Philosophical\n\nTransactions of the Royal Society of London B: Biological Sciences, 360(1459):1387\u20131393, 2005.\n\n[23] P. Marjoram and J. Wall. Fast \u201ccoalescent\u201d simulation. BMC Genetics, 7(1):16, 2006.\n[24] J. Bertoin. Random Fragmentation and Coagulation Processes. Cambridge University Press, 2006.\n[25] J. Pitman and M. Yor. The two-parameter Poisson-Dirichlet distribution derived from a stable subordina-\n\ntor. Annals of Probability, 25:855\u2013900, 1997.\n\n[26] J. Pitman. Coalescents with multiple collisions. Annals of Probability, 27:1870\u20131902, 1999.\n[27] J. Gasthaus and Y. W. Teh. Improvements to the sequence memoizer. In Advances in Neural Information\n\nProcessing Systems, 2010.\n\n[28] V. Rao and Y. W. Teh. Fast MCMC sampling for Markov jump processes and continuous time Bayesian\nnetworks. In Proceedings of the International Conference on Uncertainty in Arti\ufb01cial Intelligence, 2011.\n\n[29] NHLBI Program for Genomic Applications. SeattleSNPs. June 2011. http://pga.gs.washington.edu.\n[30] B. N. Howie, P. Donnelly, and J. Marchini. A \ufb02exible and accurate genotype imputation method for the\n\nnext generation of genome-wide association studies. PLoS Genetics, (6), 2009.\n\n[31] J. Berestycki. Exchangeable fragmentation-coalescence processes and their equilibrium measures.\n\nhttp://arxiv.org/abs/math/0403154, 2004.\n\n9\n\n\f", "award": [], "sourceid": 554, "authors": [{"given_name": "Yee", "family_name": "Teh", "institution": null}, {"given_name": "Charles", "family_name": "Blundell", "institution": null}, {"given_name": "Lloyd", "family_name": "Elliott", "institution": null}]}