{"title": "Scalable imputation of genetic data with a discrete fragmentation-coagulation process", "book": "Advances in Neural Information Processing Systems", "page_first": 2852, "page_last": 2860, "abstract": "We present a Bayesian nonparametric model for genetic sequence data in which a set of genetic sequences is modelled using a Markov model of partitions. The partitions at consecutive locations in the genome are related by their clusters first splitting and then merging. Our model can be thought of as a discrete time analogue of continuous time fragmentation-coagulation processes [Teh et al 2011], preserving the important properties of projectivity, exchangeability and reversibility, while being more scalable. We apply this model to the problem of genotype imputation, showing improved computational efficiency while maintaining the same accuracies as in [Teh et al 2011].", "full_text": "Scalable imputation of genetic data with a discrete\n\nfragmentation-coagulation process\n\nLloyd T. Elliott\n\nGatsby Computational Neuroscience Unit\n\nUniversity College London\n\n17 Queen Square\n\nLondon WC1N 3AR, U.K.\n\nelliott@gatsby.ucl.ac.uk\n\nYee Whye Teh\n\nDepartment of Statistics\n\nUniversity of Oxford\n1 South Parks Road\n\nOxford OX1 3TG, U.K.\n\ny.w.teh@stats.ox.ac.uk\n\nAbstract\n\nWe present a Bayesian nonparametric model for genetic sequence data in which\na set of genetic sequences is modelled using a Markov model of partitions. The\npartitions at consecutive locations in the genome are related by the splitting and\nmerging of their clusters. Our model can be thought of as a discrete analogue of\nthe continuous fragmentation-coagulation process [Teh et al 2011], preserving the\nimportant properties of projectivity, exchangeability and reversibility, while being\nmore scalable. We apply this model to the problem of genotype imputation, show-\ning improved computational ef\ufb01ciency while maintaining accuracies comparable\nto other state-of-the-art genotype imputation methods.\n\n1\n\nIntroduction\n\nThe increasing availability of genetic data (for example, from the Thousand Genomes project [1])\nand the importance of genetics in scienti\ufb01c and medical applications requires the development of\nscalable and accurate models for genetic sequences which are informed by genetic processes. Al-\nthough standard models such as the coalescent with recombination [2] are accurate, they suffer from\nintractable posterior computations. To address this, various hidden Markov model (HMM) based\napproaches have been proposed in the literature as more scalable alternatives (e.g. [3, 4]).\nDue to gene conversion and chromosomal crossover, genetic sequences exhibit a local \u2018mosaic\u2019-like\nstructure wherein sequences are composed of prototypical segments called haplotypes [5]. Locally,\nthese prototypical segments are shared by a cluster of sequences: each sequence in the cluster is\ndescribed well by a haplotype that is speci\ufb01c to the location on the chromosome of the cluster. An\nexample of such a structure is shown in Figure 1. HMMs can capture this structure by having each\nlatent state correspond to one of the haplotypes [3, 6]. Unfortunately, this leads to symmetries in\nthe posterior distribution arising from the nonidenti\ufb01ability of the state labels [7, 8]. Furthermore,\ncurrent state-of-the-art HMM methods often involve costly model selection procedures in order to\nchoose the number of latent states.\nA continuous fragmentation-coagulation process (CFCP) has recently been proposed for modelling\nlocal mosaic structure in genetic sequences [9]. The CFCP is a nonparametric models de\ufb01ned di-\nrectly on unlabelled partitions thereby avoiding both costly model selection and the label switching\nproblem [8]. Although inference algorithms derived for the CFCP scale linearly in the number and\nlength of the sequences [9], since the CFCP is a Markov jump process the computational overhead\nneeded to model the arbitrary number of latent events located between two consecutive observations\nmight preclude scalability to large datasets.\nIn this work, we present a novel fragmentation-coagulation process de\ufb01ned on a discrete grid (called\nthe DFCP) which provides the advantages of the CFCP while being more scalable. The DFCP\n\n1\n\n\fFigure 1: Haplotype structure of the CEU and YRI populations from HapMap [10] found by DFCP.\nData consists of single nucleotide polymorphisms (SNPs) from TAP2 gene. Horizontal axis indi-\ncates SNP location and label. Vertical axis represents clusters from last sample of an MCMC chain\nconverging to DFCP posterior. Letters inside clusters indicate base identity.\n\ndescribes location-dependent unlabelled partitions such that at each location on the chromosome the\nclusters will split into multiple clusters which then merge to form the clusters at the next location. As\nwith the CFCP, the DFCP avoids the label switching problem by de\ufb01ning a probability distribution\ndirectly on the space of unlabelled partitions.\nThe splitting and merging of clusters across the chromosome forms a mosaic structure of haplotypes.\nFigure 1 gives an example of the structure discovered by the DFCP. We describe the DFCP in\nsection 2, and a forward-backward inference algorithm in section 3. Sections 4 and 5 report some\nexperimental results showing good performance on an imputation problem, and in section 6 we\nconclude.\n\n2 The discrete fragmentation-coagulation process\n\nIn humans, most of the bases on a chromosome are the same for all individuals in a population.\nGenetic variations arise through mutations such as single nucleotide polymorphisms (SNPs), which\nare locations in the genome where a single base was altered by a mutation at some time in the\nancestry of the chromosome. At each SNP location, a particular chromosome has one of usually two\npossible bases (referred to as the major and minor allele). Consequently, SNP data for a chromosome\ncan be modelled as a binary sequence, with each entry indicating which of the two bases is present\nat that location. In this paper we consider SNP data consisting of n binary sequences x = (xi)n\ni=1,\nt=1 is of length T and corresponds to the T SNPs on a segment of\nwhere each sequence xi = (xit)T\na chromosome in an individual. The t-th entry xit of sequence i is equal to zero if individual i has\nthe major allele at location t and equal to one otherwise.\nWe will model these sequences using a discrete fragmentation-coagulation process (DFCP) so that\nthe sequence values at the SNP at location t are described by the latent partition \u03c0t of the sequences.\nEach cluster in the partition corresponds to a haplotype. The DFCP models the sequence of partitions\nusing a discrete Markov chain as follows: starting with \u03c0t, we \ufb01rst fragment each cluster in \u03c0t into\nsmaller clusters, forming a \ufb01ner partition \u03c1t. Then we coagulate the clusters in \u03c1t to form the clusters\nof \u03c0t+1. In the remainder of this section, we will \ufb01rst give some background theory on partitions, and\nrandom fragmentation and coagulation operations and then we will describe the DFCP as a Markov\nchain over partitions. Finally, we will describe the likelihood model used to relate the sequence of\npartitions to the observed sequences.\n\n2.1 Random partitions, fragmentations and coagulations\n\nA partition of a set S is a clustering of S into non-overlapping non-empty subsets of S whose union\nis all of S. The Chinese restaurant process (CRP) forms a canonical family of distributions on\npartitions. A random partition \u03c0 of a set S is said to follow the law CRP(S, \u03b1, \u03c3) if:\n\n(1)\nd = (x)(x + d) . . . (x + (n\u2212 1)d) is Kramp\u2019s symbol and \u03b1 > \u2212\u03c3, \u03c3 \u2208 [0, 1) are the con-\nwhere [x]n\ncentration and discount parameters respectively [11]. A CRP can also be described by the following\n\nPr(\u03c0) =\n\na\u2208\u03c0\n\n1\n\n[\u03b1 + \u03c3]#\u03c0\u22121\n[\u03b1 + 1]#S\u22121\n\n\u03c3\n\n[1 \u2212 \u03c3]#a\u22121\n\n1\n\n(cid:89)\n\n2\n\nCT C C C T C T A T A C A T G C TCT T G T T A C T A A T G A T A G T G G C G A C G G C G A T A C G T A T A T C T A TC AG G T A C C T G T GAGCCACAT T A T T C G C T T A C C T A T G T C C C T A A G A A A G G T G C A A C G A C C A A G AC C C G C A G T C C C C A G T T C C G rs16870907rs2857105rs2857103rs13209654rs16870923rs1894411rs1044043rs2857101rs10484565rs241456rs241455rs241454rs241453rs241452rs241451rs17034rs241449rs241448rs241447rs4148876rs241446rs241445rs241440rs241439rs241438rs241437rs2228396rs241436rs9380326rs4576294rs1015166rs241433rs2228397\fanalogy: customers (elements of S) enter a Chinese restaurant and choose to sit at tables (clusters in\n\u03c0). The \ufb01rst customer chooses any table. Subsequently, the i-th customer sits at a previously chosen\ntable a with probability proportional to #a\u2212 \u03c3 where #a is the number of customers already sitting\nthere and at some unoccupied table with probability proportional to \u03b1 + \u03c3#\u03c0 where #\u03c0 is the total\nnumber of tables already sat at by previous customers.\nThe fragmentation and coagulation operators are random operations on partitions. The fragmenta-\ntion FRAG(\u03c0, \u03b1, \u03c3) of a partition \u03c0 is formed by partitioning further each cluster a of \u03c0 according\nto CRP(a, \u03b1, \u03c3) and then taking the union of the resulting partitions, yielding a partition of S that\nis \ufb01ner than \u03c0. Conversely, the coagulation COAG(\u03c0, \u03b1, \u03c3) of \u03c0 is formed by partitioning the set of\nclusters of \u03c0 (i.e., the set \u03c0 itself) according to CRP(\u03c0, \u03b1, \u03c3) and then replacing each cluster with the\nunion of its elements, yielding a partition that is coarser than \u03c0. The fragmentation and coagulation\noperators are linked through the following theorem by Pitman [12].\n\nTheorem 1. Let S be a set and let A1, B1, A2, B2 be random partitions of S such that:\nB1|A1 \u223c FRAG(A1,\u2212\u03c31\u03c32, \u03c32),\nA2|B2 \u223c COAG(B2, \u03b1, \u03c31).\n\nA1 \u223c CRP(S, \u03b1\u03c32, \u03c31\u03c32),\nB2 \u223c CRP(S, \u03b1\u03c32, \u03c32),\n\nThen, for all partitions A and B of the set S such that B is a re\ufb01nement of A:\n\nPr(A1 = A, B1 = B) = Pr(A2 = A, B2 = B).\n\n(2)\n\n2.2 The discrete fragmentation-coagulation process\nt=1 with Rt \u2208 [0, 1). Under\nThe DFCP is parameterized by a concentration \u00b5 > 0 and rates (Rt)T\u22121\nthe DFCP, the marginal distribution of the partition \u03c0t is CRP(S, \u00b5, 0) and so \u00b5 controls the number\nof clusters that are found at each location. The rate parameter Rt controls the strength of dependence\nbetween \u03c0t and \u03c0t+1, with Rt = 0 implying that \u03c0t = \u03c0t+1 and Rt \u2192 1 implying independence.\nt=1 , the DFCP on a set of sequences indexed by the set S = {1, . . . , n} is de-\nGiven \u00b5 and (Rt)T\u22121\nscribed by the following Markov chain. First we draw a partition \u03c01 \u223c CRP(S, \u00b5, 0). This CRP\ndescribes the clustering of S at location t = 1. Subsequently, we draw \u03c1t|\u03c0t from FRAG(\u03c0t, 0, Rt),\nwhich fragments each of the clusters in \u03c0t into smaller clusters in \u03c1t, and then \u03c0t+1|\u03c1t from\nCOAG(\u03c1t, \u00b5/Rt, 0), which coagulates clusters in \u03c1t into larger clusters in \u03c0t+1.\nEach \u03c0t has CRP(S, \u00b5, 0) as its invariant marginal distribution and each \u03c1t is marginally distributed\nas CRP(S, \u00b5, Rt). This can be seen by applying Theorem 1 with the substitution \u03c31 = 0, \u03c32 = Rt,\n\u03b1 = \u00b5/Rt. In population genetics the CRP appears as (and was predated by) Ewen\u2019s sampling\nformula [13], a counting formula for the number of alleles appearing in a population, observed at\na given location. Over a short segment of the chromosome where recombination rates are low,\nhaplotypes behave like alleles and so a CRP prior on the number of haplotypes at a location is\nreasonable.\nFurther, since fragmentation and coagulation operators are de\ufb01ned in terms of CRPs which are pro-\njective and exchangeable, the Markov chain is projective and exchangeable in S as well. Projectivity\nand exchangeability are desirable properties for Bayesian nonparametric models because they imply\nthat the marginal distribution of a given data item does not depend on the total number of other data\nitems or on the order in which the other data items are indexed. In genetics, this captures the fact\nthat usually only a small subset of a population is observed.\nFinally, the theorem also shows that conditioned on \u03c0t+1, \u03c1t has distribution FRAG(\u03c0t+1, 0, Rt)\nwhile \u03c0t|\u03c1t has distribution COAG(\u03c1t, \u00b5/Rt, 0) meaning that the Markov chain de\ufb01ning the DFCP\nis reversible. Chromosome replication is directional and so statistics for genetic processes along the\nchromosome are not reversible. But the strength of this effect on SNP data is not currently known\nand many genetic models such as the coalescent with recombination [14] assume reversibility for\nsimplicity. The non-reversibility displayed by models such as fastPHASE is an artifact of their\nconstruction rather than an attempt to capture non-reversible aspects of genetic sequences.\n\n2.3 Likelihood model for sequence observations\n\nGiven the sequence of partitions (\u03c0t)T\nt=1, we model the observations in each cluster at each location\nt independently. For each cluster a \u2208 \u03c0t at location t, we adopt a discrete likelihood model in which\n\n3\n\n\f\u03c01 \u223c CRP(S, \u00b5, 0),\n\n\u03c1t|\u03c0t \u223c FRAG(\u03c0t, 0, Rt),\n\n\u03c0t+1|\u03c1t \u223c COAG(\u03c1t, \u00b5/Rt, 0),\n\nlog \u00b5 \u223c N (m, v),\nlog Rt \u223c Uniform(log Rmin, 0),\nxit|ait = \u03b8tait, \u03b8ta|\u03b2t \u223c Bernoulli(\u03b2t),\n\u03b2t|\u03b3t \u223c Beta(\nlog \u03b3t \u223c Uniform(log \u03b3min, 0).\n\n\u03b3t\n2\n\n\u03b3t\n2\n\n),\n\n,\n\n(3)\n\nFigure 2: Left: Graphical model for the dis-\ncrete fragmentation coagulation process. Hy-\nperparameters are not shown. Right: Gener-\native process for genetic sequences xit.\n\nthe same observation is emitted for each sequence in the cluster. For each sequence i, let ait \u2208 \u03c0t\nbe the cluster in \u03c0t containing i. Let \u03b8ta be the emission of cluster a at location t. Since SNP data\nhas binary labels, \u03b8ta \u2208 {0, 1} is a Bernoulli random variable. Let the mean of \u03b8ta be \u03b2t (this\nis the latent allele frequency at location t). We assume that conditioned on the partitions and the\nparameters, the observations xit are independent, and determined by the cluster parameter \u03b8ta. Thus\nthe probability Pr(\u03b8ta = 1|\u03b2t) = \u03b2t and the probability Pr(xit|ait = a, \u03b8ta) = \u03b4(xit = \u03b8ta) where\n\u03b4 is an indicator function (i.e., it is one if xit = \u03b8ta and zero otherwise).\nWe place a beta prior on \u03b2t with mean parameter 1/2 and mass parameter \u03b3t. The mass parameters\nare themselves marginally independent and we place on them an uninformative log-uniform prior\nover a range: p(\u03b3t) \u221d \u03b3\u22121\n, \u03b3t \u2265 \u03b3min. Since this distribution is heavy tailed, the \u03b2t variables\nwill have more mass near 0 and 1 than they would have if \u03b3t were \ufb01xed, adding sparsity to the\nlatent allele frequencies. This phenomenon is empirically observed in SNP data. We also place an\nuninformative log-uniform prior on Rt over a range: p(Rt) \u221d R\u22121\n, Rt \u2265 Rmin. Note that the\nprior gives more mass to values of Rt close to Rmin which we set close to zero, since we expect\nthe partitions of consecutive locations to be relatively similar so that the mosaic haplotype structure\ncan be formed. Finally, we place a truncated log-normal prior on \u00b5 with mean m and variance v:\nlog \u00b5 \u223c N (m, v), \u00b5 > 0. The graphical model for this generative process is shown in Figure 2.\n\nt\n\nt\n\n2.4 Relationship with the continuous fragmentation-coagulation process\n\nThe continuous version of the fragmentation-coagulation process [9], which we refer to as the CFCP,\nis a partition valued Markov jump process (MJP). (The \u2018time\u2019 variable for this MJP is the chromo-\nsome location, viewed as a continuous variable.) The CFCP is a pure jump process and can be\nde\ufb01ned in terms of its rates for various jump events. There are two types of events in the CFCP:\nbinary fragmentation events, in which a single cluster a is split into two clusters b and c at a rate of\nR\u0393(#b)\u0393(#c)/\u0393(#a), and binary coagulation events in which two clusters b and c merge to form\none cluster a at a rate of R/\u00b5.\nAs was shown in [9] the CFCP can be realised as a continuous limit of the DFCP. Consider a DFCP\nwith concentration \u00b5 and constant rate parameter R\u03b5. Then as \u03b5 \u2192 0 the probability that the\ncoagulation and fragmentation operations at a speci\ufb01c time step t induce no change in the partition\nstructure \u03c0t approaches 1. Conversely, the probability that these operations are the binary events\ngiven above scales as O(\u03b5), while all other events scale as larger powers of \u03b5. If we rescale the time\nsteps by t (cid:55)\u2192 \u03b5t, then the expected number of binary events over a \ufb01nite interval approaches \u03b5 times\nthe rates given above and the expected number of all other events goes to zero, yielding the CFCP.\nIn the CFCP fragmentation and coagulation events are binary: they involve either one cluster frag-\nmenting into two new clusters, or two clusters coagulating into one new cluster. However, for the\nDFCP the fragmentation and coagulation operators can describe more complicated haplotype struc-\ntures without introducing more latent events. For example one cluster splitting into three clusters\n(as happens to the second haplotype from the top of Figure 1 after the 18th SNP) can be described\n\n4\n\n\u03c11\u03c12\u03c1T\u22121\u03c01\u03c02\u00b7\u00b7\u00b7\u03c0Txi1xi2\u00b7\u00b7\u00b7xiT\u03b81a\u03b82a\u00b7\u00b7\u00b7\u03b8Ta\u03b21\u03b22\u00b7\u00b7\u00b7\u03b2T\u22001\u2264i\u2264n\u2200a\u2208\u03c01\u2200a\u2208\u03c02\u2200a\u2208\u03c0T\fby the DFCP using just one fragmentation operator. The order of the latent events introduced by the\nCFCP required does not matter, adding unnecessary symmetry to its posterior.\n\n3\n\nInference with the discrete fragmentation coagulation process\n\nWe derive a Gibbs sampler for posterior simulation in the DFCP by making use of the exchangeabil-\nity of the process. Each iteration of the sampler updates the trajectory of cluster assignments of one\nsequence i through the partition structure. To arrive at the updates, we \ufb01rst derive the conditional\ndistribution of the i-th trajectory given the others, which can be shown to be a Markov chain. Cou-\npled with the deterministic likelihood terms, we then use a backwards-\ufb01ltering/forwards-sampling\nalgorithm to obtain a new trajectory for sequence i. In this section, we derive the conditional dis-\ntribution of trajectory i using the de\ufb01nition of fragmentation and coagulation and also the posterior\ndistributions of the parameters Rt, \u00b5 which we will update using slice sampling [15].\n\n3.1 Conditional probabilities for the trajectory of sequence i\nWe will refer to the projection of the partitions \u03c0t and \u03c1t onto S \u2212 {i} by \u03c0\u2212i\nrespectively.\nLet at (respectively bt) be the cluster assignment of sequence i at location t in \u03c0t (respectively \u03c1t).\nIf the sequence i is placed in a new cluster by itself in \u03c0t (i.e., it forms a singleton cluster) we will\ndenote this by at = \u2205 and for \u03c1\u2212i\nt we will denote the respective event by bt = \u2205. Otherwise, if\nthe the sequence i is placed in an existing cluster in \u03c0\u2212i\nt ) we will denote this by\nat \u2208 \u03c0\u2212i\nt \u222a {\u2205} and\nt \u222a {\u2205}.\n\u03c1\u2212i\nStarting at t = 1, since the initial distribution is \u03c01 \u2208 CRP(S, \u00b5, 0), the conditional cluster assign-\nment of the sequence i in \u03c01 is given by the CRP probabilities from (1):\n\nt ). Thus the state spaces of at and bt are respectively \u03c0\u2212i\n\n(respectively bt \u2208 \u03c1\u2212i\n\n(respectively \u03c1\u2212i\n\nand \u03c1\u2212i\n\nt\n\nt\n\nt\n\nt\n\nPr(at = a|\u03c0\u2212i\n\n1 ) =\n\n(cid:26)#a/(n \u2212 1 + \u00b5)\n\n\u00b5/(n \u2212 1 + \u00b5)\n\nif a \u2208 \u03c0\u2212i\nif a = \u2205.\n\nt\n\n,\n\n(4)\n\nTo \ufb01nd the conditional distribution of bt given at, we use the de\ufb01nition of the fragmentation oper-\nation as independent CRP partitions of each cluster in \u03c0t. If at = \u2205, then the sequence i is in a\ncluster by itself in \u03c0t and so it will remain in a cluster by itself after fragmenting. Thus, bt = \u2205\nwith probability 1. If at = a \u2208 \u03c0\u2212i\nthen bt must be one of the clusters in \u03c1t into which a fragments.\nThis can be a singleton cluster, in which case bt = \u2205, or it can be one of the clusters in \u03c1\u2212i\n. We will\nt by Ft(a). Since a is fragmented according to CRP(a, 0, R), when\nrefer to this set of clusters in \u03c1\u2212i\nthe i-th sequence is added to this CRP it is placed in a cluster b \u2208 Ft(a) with probability propor-\ntional to (#b \u2212 R) and it is placed in a singleton cluster with probability proportional to R#Ft(a).\nNormalizing these probabilities yields the following joint distribution:\n\nt\n\nt\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3 (#b \u2212 Rt)/#a\n\nRt#Ft(a)/#a\n\n1\n0\n\n, b \u2208 Ft(a),\n, b = \u2205,\n\nt\n\nif a \u2208 \u03c0\u2212i\nif a \u2208 \u03c0\u2212i\nif a = b = \u2205,\notherwise.\n\nt\n\n(5)\n\nPr(bt = b|at = a, \u03c0\u2212i\n\nt\n\n, \u03c1\u2212i\n\nt ) =\n\nSimilarly, to \ufb01nd the conditional distribution of at+1 given bt = b we use the de\ufb01nition of the\ncoagulation operation. If b (cid:54)= \u2205, then the sequence i was not in a singleton cluster in \u03c1\u2212i\nand so it\nmust follow the rest of the sequences in b to the unique a \u2208 \u03c0\u2212i\nt+1 such that b \u2286 a (i.e., b coagulates\nwith other clusters to form a). We will refer to the set of clusters in \u03c1\u2212i\nthat coagulate to form a by\nCt(a). If b = \u2205 then the sequence i is in a singleton cluster in \u03c1\u2212i\nt and so we can imagine it being the\nlast customer added to the coagulating CRP(\u03c1t, \u00b5/Rt, 0) of the clusters of \u03c1t. Hence the probability\nthat sequence i is placed in a cluster a \u2208 \u03c0\u2212i\nt+1 is proportional to #Ct(a) while the probability that it\nforms a cluster by itself in \u03c0\u2212i\nt+1 is proportional to \u00b5/Rt. This yields the following joint probability:\n\nt\n\nt\n\nPr(at+1 = a|bt = b, \u03c0\u2212i\n\nt+1, \u03c1\u2212i\n\nt ) =\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3\n\nRt#Ct(a)/(\u00b5 + Rt#\u03c1\u2212i\nt )\n\n\u00b5/(\u00b5 + Rt#\u03c1\u2212i\nt )\n\n1\n\n0\n\n5\n\nt+1, b \u2208 Ct(a),\nt+1, b = \u2205,\n\nif a \u2208 \u03c0\u2212i\nif a \u2208 \u03c0\u2212i\nif a = b = \u2205,\notherwise.\n\n(6)\n\n\f3.2 Message passing and sampling for the sequences of the DFCP\n\nOnce the conditional probabilities are de\ufb01ned, it is straightforward to derive messages that allow\nus to conduct backwards-\ufb01ltering/forwards-sampling to resample the trajectory of sequence i in the\nDFCP. This provides an exact Gibbs update for the trajectory of that sequence conditioned on the\ntrajectories of all the other sequences and the data. The messages we will de\ufb01ne are the conditional\ndistribution of all the data seen after a given location in the sequence conditioned on the cluster\nassignment of sequence i at that location. The messages are de\ufb01ned as follows:\n\nmtC(a) = Pr(xi,(t+1):T|at = a, \u03c0\u2212i\nt:T , \u03c1\u2212i\nmtF (b) = Pr(xi,(t+1):T|bt = b, \u03c0\u2212i\nt:T , \u03c1\u2212i\n(cid:124)\n(cid:125)\n\n(a) \u03b4(xi,(t+1) = \u03b8(t+1),a)\n\nmt+1C\n\n(cid:124)\n\nt:(T\u22121)).\nt:(T\u22121)).\n\nWe de\ufb01ne the last messages to be mTC (a) = 1. These messages are computed as follows:\n(cid:125)\nt+1, \u03c1\u2212i\nt )\n\nPr(at+1 = a|bt = b, \u03c0\u2212i\n\nmtF (b) =\n\n(cid:123)(cid:122)\n\nLikelihood.\n\nCoagulation probabilities from (6).\n\na\u2208\u03c0\n\n\u2212i\nt+1\u222a{\u2205}\n\nmtC(a) =\n\nmtF (b) Pr(bt = b|at = a, \u03c0\u2212i\n\n, \u03c1\u2212i\nt )\nFragmentation probabilities from (5).\n\nt\n\n.\n\nb\u2208\u03c1\n\n\u2212i\nt \u222a{\u2205}\n\n(cid:124)\n\n(cid:125)\n\n(7)\n(8)\n\n.\n\n(9)\n\n(10)\n\n(cid:88)\n(cid:88)\n\n(cid:123)(cid:122)\n(cid:123)(cid:122)\n\nAs the fragmentation and coagulation conditional probabilities are only supported for clusters a, b\nsuch that b \u2286 a, these sums can be expanded so that only non-zero terms are summed over. For\nsimplicity we do not provide these expanded forms here. Given these computations it is easy to\nde\ufb01ne backwards messages using the reversibility of the process. The backwards messages can be\nused to compute marginal probabilities of the observation as in the forward-backward algorithm.\nTo sample from the posterior distribution of the trajectory for sequence i conditioned on the other\ntrajectories and the data, we use the Markov property for the chain a1, b1, . . . , bT\u22121, aT and the\nde\ufb01nition of the messages. Starting at location 1, we have:\n\n(11)\n\n(12)\n\n(13)\n\n1:(T\u22121))\n\nPr(a1 = a|xi, \u03c0\u2212i\n\n1:T , \u03c1\u2212i\n1 ) Pr(xi1|a1 = a) Pr(xi,2:T|a1 = a, \u03c0\u2212i\n\u221d Pr(a1 = a|\u03c0\u2212i\n(cid:123)(cid:122)\n(cid:124)\n(cid:125)\n= Pr(a1 = a|\u03c0\u2212i\n1 )\nCRP probabilities (1).\n\n\u03b4(x1 = \u03b81a)\n\nm1C(a).\n\n(cid:123)(cid:122)\n\nLikelihood.\n\n(cid:124)\n\n(cid:125)\n\nFor subsequent bt and at+1 for locations t = 1, . . . , T \u2212 1,\n\n1:T , \u03c1\u2212i\n\n1:(T\u22121)),\n\nt\n\nt\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n1:(T\u22121))\n\nt:T , \u03c1\u2212i\n\n\u221d Pr(bt = b|at = a, \u03c0\u2212i\n= Pr(bt = b|at = a, \u03c0\u2212i\n\nPr(bt = b|at = a, xi, \u03c0\u2212i\n1:T , \u03c1\u2212i\nt ) Pr(xi,(t+1):T|bt = b, \u03c0\u2212i\n, \u03c1\u2212i\n(cid:125)\n, \u03c1\u2212i\nt )\nFragmentation probabilities from (5).\nPr(at = a|bt\u22121 = b, xi, \u03c0\u2212i\n1:T , \u03c1\u2212i\nt\u22121) Pr(xit|at = a) Pr(xi,(t+1):T|at = a, \u03c0\u2212i\n, \u03c1\u2212i\n(cid:124)\n(cid:125)\n, \u03c1\u2212i\nt\u22121)\n\n\u221d Pr(at = a|bt\u22121 = b, \u03c0\u2212i\n= Pr(at = a|bt\u22121 = b, \u03c0\u2212i\n\n\u03b4(xit = \u03b8ta)\n\nt:(T\u22121)),\n\n1:(T\u22121))\n\nmtF (b).\n\nmtC(a).\n\n(cid:123)(cid:122)\n\n(cid:123)(cid:122)\n\n(cid:124)\n\n(cid:125)\n\nt\n\nt\n\nCoagulation probability from (6).\n\nLikelihood.\n\nt:T , \u03c1\u2212i\n\nt:(T\u22121)),\n\nThe complexity of this update is O(KT ) where K is the expected number of clusters in the posterior.\nThis complexity class is the same as for the continuous fragmentation-coagulation process and other\nrelated HMM methods such as fastPHASE. But there is no exact Gibbs update for the trajectories in\nthe CFCP. Instead the CFCP sampler relies on uniformization [16] which has slower mixing times\nthan exact Gibbs and so the update for the DFCP is, theoretically, more ef\ufb01cient.\n\n3.3 Parameter updates\n\nWe use slice sampling [15] to update the \u00b5 and Rt parameters conditioned on the partition structure.\nUsing Bayes\u2019 rule, the de\ufb01nition (3) of the DFCP, and the identity [a]n\nb = bn\u0393(a/b + n)/\u0393(a/b),\n\n6\n\n\fFigure 3: Allele imputation for X chromosomes from the Thousand Genomes project. Left:\nAccuracy for prediction of held out alleles for continuous (CFCP) and discrete (DFCP) versions of\nfragmentation-coagulation process and for popular methods BEAGLE and fastPHASE. 90% miss-\ning data condition truncates BEAGLE accuracies to emphasize other conditions. Right: Runtime\nversus accuracy for 500 MCMC iterations for DFCP and CFCP in 50% missing data condition.\nPoints are averaged over 20 datasets and 25 consecutive samples.\n\nthe posterior probabilities of \u00b5 and Rt given the partitions \u03c01:T and \u03c11:(T\u22121) are as follows:\n\nPr(\u00b5|\u03c0, \u03c1) \u221d Pr(\u00b5) Pr(\u03c01|\u00b5, R1) Pr(\u03c11|\u03c01, \u00b5, R1)\u00b7\u00b7\u00b7 Pr(\u03c0T|\u03c1T\u22121, \u00b5, RT\u22121),\n\n\u221d Pr(\u00b5)\n\n\u0393(\u00b5)\n\n\u0393(\u00b5 + n)\n\n\u0393(\u00b5/Rt)\n\n\u0393(\u00b5/Rt + #\u03c1t)\n\nPr(Rt|\u03c0, \u03c1, \u00b5) \u221d Pr(Rt) Pr(\u03c1t|\u03c0t, \u00b5, Rt) Pr(\u03c0t+1|\u03c1t, \u00b5, Rt),\n\n\u00b5\u2212T +(cid:80)T\n\nt=1 #\u03c0t\n\nT\u22121(cid:89)\n\nt=1\n\n\u0393(\u00b5/Rt)\u0393(1 \u2212 Rt)\u2212#\u03c1t\n\n\u0393(#\u03c1t + \u00b5/Rt)\n\n(14)\n\n\u0393(#b \u2212 Rt).\n\n(15)\n\n.\n\n(cid:89)\n\nb\u2208\u03c1t\n\n\u221d Pr(Rt)R#\u03c1t\u2212#\u03c0t\u2212#\u03c0t+1+1\n\nt\n\n4 Experiments\n\nTo examine the accuracy and scalability of the DFCP we conducted an allele imputation experiment\non SNP data from the Thousand Genomes project1. We also compared the runtime of the samplers\nfor the DFCP and CFCP on data simulated from the coalescent with recombination model [14]. In\nthis section, we describe the setup of these experiments and in section 5 we present the results.\nFor the allele imputation experiment, we considered SNPs from 524 male X chromosomes. We\nchose 20 intervals randomly, each containing 500 consecutive SNPs. In \ufb01ve conditions we held out\nnested sets of between 10% and 90% of the alleles uniformly over all pairs of sites and individuals,\nand used fastPHASE [3], BEAGLE [17], CFCP [9] and the DFCP to predict the held out alleles.\nWe used the most recent versions of BEAGLE and fastPHASE software available to us. We imple-\nmented the DFCP with many of the same libraries and programming techniques as the CFCP and\nboth versions were optimized. In each missing data condition, the CFCP and DFCP were run with\n\ufb01ve random restarts and 46 MCMC iterations per restart (26 of which were discarded for burnin and\nthinning). The accuracies for the DFCP and CFCP were computed by thresholding the empirical\nmarginal probabilities of the held out alleles at 0.5. The priors on the hyper parameters and the\nlikelihood speci\ufb01cation of the two models were matched and the samplers were initialized using a\nsequential Monte Carlo method based on the trajectory updates.\nThe posterior distributions of the concentration parameter \u00b5 for the two methods are different. In\norder to match the expected number of clusters in the posterior, we also conducted allele imputation\nin the 50% missing data condition with \u00b5 \ufb01xed at 10.0 for both models. We simulated 500 MCMC\niterations with no random restarts. We then computed the accuracy of the samples by predicting\nheld out alleles based on the cluster assignments of the sample.\n\n1March 2012 v3 release of the Thousand Genomes Project.\n\n7\n\nproportion missing data0.940.950.960.970.980.991.00accuracy (proportion correct)DFCPCFCPfastPHASEBEAGLE0.30.50.10.70.90500100015002000250030003500runtime (seconds)0.9880.9890.990accuracy (proportion correct)DFCPCFCP\fIn a second experiment we simulated datasets from the coalescent with recombination model con-\nsisting of between 10,000 and 50,000 sequences using the software ms [14]. We conducted posterior\nMCMC simulation in both models and compared the computation time required per iteration.\n\n5 Results\n\nThe accuracy of the DFCP in the allele imputation experiment was comparable to that of the CFCP\nand fastPHASE in all missing data conditions (Figure 3, left). For the 70% and 90% missing data\nconditions, BEAGLE performed poorly (its median accuracy for this condition was 93.90% and\nmean at chance accuracy for all conditions was 93.44%). In Figure 3(right) we compare the accuracy\nand runtime for the 50% missing data condition. This \ufb01gure shows that the runtime required for each\niteration is lower for the DFCP and the sequential Monte Carlo initialization is better (i.e., closer\nto a posterior mode) for the DFCP. No difference in mixing time is suggested by the \ufb01gure. As an\naside, we estimated the Shannon entropy in these samples and found that the DFCP had slightly\nmore entropy per sample than the CFCP (the difference was small but statistically signi\ufb01cant). This\ncould indicate that the DFCP has better mixing.\nFor the second experiment, we plot the runtime per iteration of both models against the number\nof sequences in the simulated dataset (Figure 4). The DFCP was around 2.5 times faster than the\nCFCP for the condition with 50,000 sequences. In both models, most of the computation time was\nspent calculating the messages in the backwards-\ufb01ltering step. The CFCP has an arbitrary number of\nlatent events between consecutive observations and it is likely that the runtime improvement shown\nby the DFCP is due to its reduced number of required message calculations.\n\n6 Discussion\n\nIn this paper we have presented a discrete\nfragmentation-coagulation process. The DFCP\nis a partition-valued Markov chain, where par-\ntitions change along the chromosome by a frag-\nmentation operation followed by a coagulation\noperation. The DFCP is designed to model\nthe mosaic haplotype structure observed in ge-\nnetic sequences. We applied the DFCP to an al-\nlele prediction task on data from the Thousand\nGenomes Project yielding accuracies compa-\nrable to state-of-the-art methods and runtimes\nthat were lower than the runtimes of the contin-\nuous fragmentation-coagulation process [9].\nThe DFCP and CFCP induce different joint dis-\ntributions on the partitions at adjacent locations.\nThe CFCP is a Markov jump process with an ar-\nbitrary number of latent binary events wherein\na single cluster is split into two clusters, or two clusters are merged into one. The DFCP however\ncan model any partition structure with one pair of fragmentation and coagulation operations. Ex-\nact Gibbs updates for the partitions are possible in the DFCP whereas sampling in the CFCP uses\nuniformization [16] which, although fast in practice, has in theory slower mixing than exact Gibbs.\nIn future work we will explore better calling and calibration methods to improve imputation ac-\ncuracies. Another avenue of future research is to understand how other genetic processes can be\nincorporated into the fragmentation-coagulation framework, including population admixture and\ngene conversion. Although haplotype structure is a local property, the Markov assumption does not\nhold in real genetic data. This could be re\ufb02ected through hierarchical FCP models or adaptation of\nother dependent nonparametric models such as the spatially normalized Gamma process [18].\n\nFigure 4: Runtimes per iteration per sequence of\nDFCP and CFCP on simulated datasets consist-\ning of large numbers of sequences. Lines indicate\nmean. Shaded region indicates standard deviation.\n\nAcknowledgements\n\nWe thank the Gatsby Charitable Foundation for funding. We also thank Andriy Mnih, Vinayak Rao\nand Anna Goldenberg for helpful discussion and the anonymous reviewers for their suggestions.\n\n8\n\n1.01.52.02.53.03.54.04.55.0#individuals01020304050runtime(seconds/iteration)DFCPCFCP\u00d7104\fReferences\n[1] The 1000 Genomes Project Consortium. A map of human genome variation from population-scale se-\n\nquencing. Nature, 467:1061\u20131073, 2010.\n\n[2] R. R. Hudson. Properties of a neutral allele model with intragenic recombination. Theoretical Population\n\nBiology, 23(2):183 \u2013 201, 1983.\n\n[3] P. Scheet and M. Stephens. A fast and \ufb02exible statistical model for large-scale population genotype data:\nApplications to inferring missing genotypes and haplotypic phase. The American Journal of Human\nGenetics, 78(4):629 \u2013 644, 2006.\n\n[4] J. Marchini, B. Howie, S. Myers, G. McVean, and P. Donnelly. A new multipoint method for genome-wide\n\nassociation studies by imputation of genotypes. Nature Genetics, 39(7):906\u2013913, 2007.\n\n[5] M. J. Daly, J. D. Rioux, S. F. Schaffner, T. J. Hudson, and R. S. Lander. High-resolution haplotype\n\nstructure in the human genome. Nature Genetics, 29:229\u2013232, 2001.\n\n[6] J. Marchini, D. Cutler, N. Patterson, M. Stephens, E. Eskin, E. Halperin, S. Lin, Z.S. Qin, H.M. Munro,\nG.R. Abecasis, P. Donnelly, and the International HapMap Consortium. A comparison of phasing algo-\nrithms for trios and unrelated individuals. The American Journal of Human Genetics, 78(3):437 \u2013 450,\n2006.\n\n[7] M. Stephens. Dealing with label switching in mixture models. Journal of the Royal Statistical Society:\n\nSeries B (Statistical Methodology), 62(4):795\u2013809, 2000.\n\n[8] A. Jasra, C. C. Holmes, and D. A. Stephens. Markov chain Monte Carlo methods and the label switching\n\nproblem in Bayesian mixture modeling. Statistical Science, 20(1):50\u201367, 2005.\n\n[9] Y. W. Teh, C. Blundell, and L. T. Elliott. Modelling genetic variations using fragmentation-coagulation\n\nprocesses. In Advances in neural information processing systems, 2011.\n\n[10] The International HapMap Consortium. The international HapMap project. Nature, 426:789\u2013796, 2003.\n[11] J. Pitman. Combinatorial stochastic processes. Springer-Verlag, 2006.\n[12] J. Pitman. Coalescents with multiple collisions. Annals of Probability, 27:1870\u20131902, 1999.\n[13] W. J. Ewens. The sampling theory of selectively neutral alleles. Theoretical Population Biology, 3:87\u2013\n\n112, 1972.\n\n[14] R. R. Hudson. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinfomat-\n\nics, 18:337\u2013338, 2002.\n\n[15] R. M. Neal. Slice sampling. Annals of Statistics, 31:705\u2013767, 2003.\n[16] V. Rao and Y. W. Teh. Fast MCMC sampling for Markov jump processes and continuous time Bayesian\nnetworks. In Proceedings of the International Conference on Uncertainty in Arti\ufb01cial Intelligence, 2011.\n[17] B. L. Browning and S. R. Browning. A uni\ufb01ed approach to genotype imputation and haplotype-phase\ninference for large data sets of trios and unrelated individuals. American Journal of Human Genetics,\n84:210\u2013223, 2009.\n\n[18] V. Rao and Y. W. Teh. Spatial normalized gamma processes. In Advances in Neural Information Process-\n\ning Systems, volume 22, pages 1554\u20131562, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1294, "authors": [{"given_name": "Lloyd", "family_name": "Elliott", "institution": null}, {"given_name": "Yee", "family_name": "Teh", "institution": null}]}