{"title": "A Global Structural EM Algorithm for a Model of Cancer Progression", "book": "Advances in Neural Information Processing Systems", "page_first": 163, "page_last": 171, "abstract": "Cancer has complex patterns of progression that include converging as well as diverging progressional pathways. Vogelstein's path model of colon cancer was a pioneering contribution to cancer research. Since then, several attempts have been made at obtaining mathematical models of cancer progression, devising learning algorithms, and applying these to cross-sectional data. Beerenwinkel {\\em et al.} provided, what they coined, EM-like algorithms for Oncogenetic Trees (OTs) and mixtures of such. Given the small size of current and future data sets, it is important to minimize the number of parameters of a model. For this reason, we too focus on tree-based models and introduce Hidden-variable Oncogenetic Trees (HOTs). In contrast to OTs, HOTs allow for errors in the data and thereby provide more realistic modeling. We also design global structural EM algorithms for learning HOTs and mixtures of HOTs (HOT-mixtures). The algorithms are global in the sense that, during the M-step, they find a structure that yields a global maximum of the expected complete log-likelihood rather than merely one that improves it. The algorithm for single HOTs performs very well on reasonable-sized data sets, while that for HOT-mixtures requires data sets of sizes obtainable only with tomorrow's more cost-efficient technologies.", "full_text": "A Global Structural EM Algorithm\nfor a Model of Cancer Progression\n\nAli To\ufb01gh\n\nSchool of Computer Science\n\nMcGill Centre for Bioinformatics\n\nMcGill University, Canada\n\nali.tofigh@mcgill.ca\n\nErik Sj\u00a8olund\n\nStockholm Bioinformatics Center\nStockholm University, Sweden\nerik.sj\u00a8olund@sbc.su.se\n\nMattias H\u00a8oglund\n\nDepartment of Oncology\nLund University, Sweden\n\nmattias.hoglund@med.lu.se\n\nJens Lagergren\n\nScience for Life Lab\n\nSwedish e-Science Research Center\nStockholm Bioinformatics Center\n\nSchool of Computer Science and Communication\n\nKTH Royal Institute of Technology, Sweden\n\njensl@csc.kth.se\n\nAbstract\n\nCancer has complex patterns of progression that include converging as well as di-\nverging progressional pathways. Vogelstein\u2019s path model of colon cancer was a\npioneering contribution to cancer research. Since then, several attempts have been\nmade at obtaining mathematical models of cancer progression, devising learning\nalgorithms, and applying these to cross-sectional data. Beerenwinkel et al. pro-\nvided, what they coined, EM-like algorithms for Oncogenetic Trees (OTs) and\nmixtures of such. Given the small size of current and future data sets, it is im-\nportant to minimize the number of parameters of a model. For this reason, we\ntoo focus on tree-based models and introduce Hidden-variable Oncogenetic Trees\n(HOTs). In contrast to OTs, HOTs allow for errors in the data and thereby pro-\nvide more realistic modeling. We also design global structural EM algorithms\nfor learning HOTs and mixtures of HOTs (HOT-mixtures). The algorithms are\nglobal in the sense that, during the M-step, they \ufb01nd a structure that yields a global\nmaximum of the expected complete log-likelihood rather than merely one that im-\nproves it. The algorithm for single HOTs performs very well on reasonable-sized\ndata sets, while that for HOT-mixtures requires data sets of sizes obtainable only\nwith tomorrow\u2019s more cost-ef\ufb01cient technologies.\n\n1\n\nIntroduction\n\nIn the learning literature, there are several previous results on learning probabilistic tree models, in-\ncluding various Expectation Maximization-based inference algorithms. In [1], trees were considered\nwhere the vertices were associated with observable variables and an ef\ufb01cient algorithm for \ufb01nding a\nglobally optimal Maximum Likelihood (ML) solution was described. Subsequently, [2] presented a\nstructural Expectation Maximization (EM) algorithm for \ufb01nding the ML mixture of trees as well as\nMAP solutions with respect to several priors.\nThere are three axes along which it is natural to compare these as well as other results. The \ufb01rst\naxis is the type of dependency structure allowed. The second axis is the type of variables used\u2014\n\n1\n\n\fobservable only or hidden and observable\u2014and the type of relations they can have. The third axis\nis the type of inference algorithms that are known for the model.\nIt is interesting in relation to the present result to ask in what respect the structural EM algorithm\nof\n[3] constitutes an improvement when compared with Friedman\u2019s earlier structural EM algo-\nrithm [4]. In fact, it may seem like the former constitutes no improvement at all, since the latter\nis concerned with more general dependency structures. Notice, however, that it is customary to\ndistinguish between EM algorithms and generalized EM algorithms for inferring numerical param-\neters, the difference being that in the M-step of the former, parameters are found that maximize the\nexpected complete log-likelihood, whereas in the latter, parameters are found that merely improve\nit. As Friedman points out in his article on the Bayesian Structural EM algorithm [4], the same\ndistinction can be made regarding the maximization over structures. Clearly, it would be convenient\nto use the same terminology for structural EM algorithms as for ordinary EM algorithms. However,\nthe distinction is often not made for structural EM algorithms and even researchers that consider\nthemselves experts in the \ufb01eld seem to be unaware of it. For this reason, we de\ufb01ne global structural\nEM algorithms to be EM algorithms that in the M-step \ufb01nd a structure yielding a global maximum of\nthe expected complete log-likelihood (as opposed to a structure that merely improves it). Equipped\nwith this de\ufb01nition, we note that the phylogeny algorithm of [3] is a global structural EM algorithm\nin contrast to the earlier algorithm [4]. Another example of a global structural EM algorithm is the\nlearning algorithm for trees with hidden variables presented in [5].\nIn an effort to provide mathematical models of cancer progression, Desper et al.\nintroduced the\nOncogenetic Tree model where observable variables corresponding to aberrations are associated\nwith vertices of a tree [6]. They then proceeded to show that an algorithm based on Edmonds\u2019s\noptimum branching algorithm will, with high probability, correctly reconstruct an Oncogenetic Tree\nT from suf\ufb01ciently long series of data generated from T .\nThe Oncogenetic Tree model suffers from two problems; monotonicity\u2014an aberration associated\nwith a child cannot occur unless the aberration associated with its parent has occurred\u2014and limited-\nstructure\u2014compared to a network, the tree structure severely limits the sets of progressional paths\nthat can be modeled. In an attempt to remedy these problems, the Network Aberration Model was\nproposed [7, 8]. However, the computational problems associated with these network models are\nhard; for instance, no ef\ufb01cient EM algorithm for training is yet known. In another attempt, Beeren-\nwinkel et al. used mixtures of Oncogenetic Trees to overcome the problem of limited-structure, but\nwithout removing the monotonicity and only obtaining an algorithm with an EM-like structure that\nhas not been proved to deliver a locally optimal maximum-likelihood solution [9, 10, 11].\nBeerenwinkel and coworkers used Conjunctive Bayesian Networks (CBNs) to model cancer pro-\ngression [12, 13]. In order to overcome the limited ability of CBNs to model noisy biological data,\n[14] introduced the hidden CBN model. A hidden CBN can be obtained from a CBN by consider-\ning each variable in the CBN to be hidden and associating an observable variable with each hidden\nvariable. The hidden CBN also has a common error parameter specifying the probability that any\nindividual observable variable differs from its associated hidden variable. In a hidden CBN, values\nare \ufb01rst generated for the hidden variables, and then, the observable variables obtain values based\nboth on the hidden variables and the error parameter.\nWe present the Hidden-variable Oncogenetic Tree (HOT) model where a hidden and an observable\nvariable are associated with each vertex of a rooted directed tree. The value of the hidden variable\nindicates whether or not the tumor progression has reached the vertex (a value of one means that\ncancer progression has reached the vertex and zero that it has not), while the value of the observable\nvariable indicates whether a speci\ufb01c aberration has been detected (a value of one represents detection\nand zero the opposite). This interpretation provides several relations between the variables in a HOT.\nAn asymmetric relation is required between the hidden variables associated with the two endpoints\nof an arc of the directed tree. Because of this asymmetry, the global structural EM algorithm that\nwe derive for the HOT ML problem cannot, in contrast to many of the above mentioned algorithms,\nbe based on a maximum spanning tree algorithm and is instead based on the optimal branching\nalgorithm [15, 16, 17]. Having so recti\ufb01ed the monotonicity problem, we proceed to obtain a model\nallowing for a higher degree of structural variation by introducing mixtures of HOTs (HOT-mixtures)\nand, in contrast to Beerenwinkel et al., we derive a proper structural EM algorithm for training these.\n\n2\n\n\fIn the near future, multiple types of high throughput (HTP) data will be available for large collections\nof tumors, providing great opportunities as well as computational challenges for progression model\ninference. One of the main motivations for our models and inference methods is that they enable\nanalysis of future HTP-data, which most likely will require the ability to handle large numbers\nof mutational events. In this paper, however, we apply our methods to cytogenetic data for colon\nand kidney cancer, mostly due to the availability of cytogenetic data for large numbers of tumors\nprovided by the Mitelman database [18].\n\n2 HOTs and the novel global structural EM algorithm\n\n2.1 Hidden-variable Oncogenetic Trees\n\nWe will denote the set of observed data points D and an individual data point X. In Section 3, we\nwill apply our methods to CNA, i.e., a data point will be a set of observed copy number abberations,\nbut in general, more complex events can be used.\nA rooted directed tree T consists of a set of vertices, denoted V (T ) and a set of arcs denoted A(T ).\nAn arc (cid:104)u, v(cid:105) is directed from the vertex u called its tail towards the vertex v called its head. If there\nis an arc with tail p and head u in a directed tree T , then p is called the parent of u in T and denoted\np(u) (the tree T will be clear from context).\nAn OT is a rooted directed tree where an aberration associated with each vertex and a probability\nassociated with each arc. One can view an OT as generating a set of aberrations by \ufb01rst visiting the\nroot and then continuing towards the leaves (preorder) visiting each vertex with the probability of\nits incoming arc if the parent has been visited, and with probability zero if the parent has not been\nvisited. The result of the progression is the set of aberrations associated with the visited vertices.\n\n\u2022\n\n\u2022\n\n0.25\n\n\u2022\n\n0.5\n\n-3p\n\n\u2022\n\nT1,\u03bb1=0.7\n\n\u2022\n\nT2,\u03bb2=0.3\n\n\u2022\n\n0.5\n\n0.5\n\n0.5\n\n0.25\n\n-3p 0.9\n\n0.5\n\n+Xp 0.9\n\n0.25\n\n-3p 0.9\n\n0.25\n\n0.25\n\n0.25\n\n0.25\n\n0.5\n\n0.25\n\n0.25\n\n0.25\n\n\u2022\n\n\u2022\n(a)\n\n\u2022\n\n+17q\n\n-4p\n(b)\n\n+Xp\n\n+17q\n0.8\n\n-4p\n0.9\n(c)\n\n+Xp\n0.8\n\n-4p\n0.6\n\n+17q\n0.8\n\n-3p\n0.9\n\n+17q\n0.8\n\n-4p\n0.9\n\n+Xp\n0.8\n\n(d)\n\nFigure 1: (a) A rooted directed tree with the root at the top. All arcs are directed downwards, i.e., away from\nthe root. (b) An OT with probabilities associated with arcs and CNAs associated with vertices. (c) A HOT with\nprobabilities associated with arcs (indicating the probability that the hidden variable associated with the head\nof the arc receives the value 1 conditioned that the hidden variable associated with the tail has this value), and\nCNAs as well as probabilities associated with vertices (indicating the probability that the observable variable\nassociated with the vertex receives the value 1 conditioned that the hidden variable associated with the vertex\nhas received this value). (d) A HOT-mixture consisting of two HOTs. The mixing probability for T1 is 0.7 and\nthat for T2 is 0.3. So with probability 0.7 a synthetic tumor is generated from T1 and otherwise one is generated\nfrom T2.\n\nIn Figure 1(b), an OT for CNA is depicted (aberrations are written in the standard notation for CNAs\nin cytogenetic data, i.e., each represents a duplication (+) or deletion (-) of a speci\ufb01c chromosomal\nregion). Notice that an aberration associated with a vertex cannot occur unless the aberration asso-\nciated with its parent has occurred. For instance, the set {+Xp, +17q} cannot be generated by the\nOT in Figure 1(b). In a data-modeling context, this is highly undesirable as data is typically noisy\nand is bound to contain both false positives and negatives. Our HOT model does not suffer from this\nproblem.\nA Hidden-variable Oncogenetic Tree (HOT) is a directed tree where, just like OTs, each vertex\nrepresents a speci\ufb01c aberration. Unlike OTs however, the progression of cancer is modeled with\nhidden variables associated with vertices and conditional probabilities associated with the arcs. The\nobservation of the aberrations (the data) are modeled with a different set of random variables whose\nvalues are conditioned on the hidden variables.\n\n3\n\n\fFormally, a Hidden-variable Oncogenetic Tree (HOT) is a pair T = (T, \u0398) where:\n\n1. T is a rooted directed tree and \u0398 consists of two conditional probability distributions,\n\n\u03b8X (u) and \u03b8Z(u), for each vertex u;\n\n2. two random variables are associated with each vertex: an observable variable X(u) and a\n\nhidden variable Z(u), each assuming the values 0 or 1;\n\n3. the hidden variable associated with the root, Z(r), is de\ufb01ned to have a value of one;\n4. for each non-root vertex u, \u03b8Z(u) is a conditional probability distribution on Z(u) condi-\n\ntioned by Z(p(u)) satisfying Pr[Z(u) = 1|Z(p(u)) = 0] = \u0001Z(u); and\ntioned by Z(u) satisfying Pr[X(u) = 1|Z(u) = 0] = \u0001X (u).\n\n5. for each non-root vertex u, \u03b8X (u) is a conditional probability distribution on X(u) condi-\n\nWith respect to (4), one might argue that Pr[Z(u) = 1|Z(p(u)) = 0] should be zero, since if the\nprogression has not reached p(u) it should not be able to proceed to u. However, the derivation\nand implementation of the EM algorithm depends on the non-zero value of this probability for\nmuch the same reasons that people use pseudo-counts [19], namely, once a parameter receives the\nvalue 0 in an EM algorithm for training, it will subsequently not be changed. Moreover, \u0001Z has\na natural interpretation: it corresponds to a small probability of spontaneous mutations occurring\nindependently from the overall progressional path that the disease is following. Similar arguments\napply to (5) where we interpret \u0001X as the small probability of falsely detecting an aberration that is\nnot actually present (corresponding to a false positive test).\nWe note here that it is possible to have CPDs where X(u) and Z(u) depend on both X(p(u)) and\nZ(p(u)), and even to let X(u) depend on all three of Z(u), X(p(u)), and Z(p(u)). We note here\nthat our arguments can easily be extended to cover these cases, although we will not consider them\nfurther in the following text. Figure 1(c) shows an example of a HOT where \u0001Z and \u0001X have been\nomitted for clarity.\n\n2.2 The novel global structural EM algorithm for HOTs\n\nWe have derived a global structural Expectation Maximization (EM) algorithm for inferring HOTs\nfrom data. According to standard EM theory [20], such an algorithm is obtained if there is a proce-\ndure that given a HOT T \ufb01nds a HOT T (cid:48) that maximizes the so-called complete log-likelihood (also\nknown as the Q-term):\n\nQ(T (cid:48);T ) =\n\nPr[Z|X,T ] log Pr[Z, X|T (cid:48)].\n\n(cid:88)\n\n(cid:88)\n\nX\u2208D\n\nZ\n\nThe likelihood of T (cid:48) is guaranteed to be at least as high as T , which immediately leads to an\niterative procedure. In standard EM, the Q-term is maximized only over the parameters of a model,\nin our case the conditional probabilities, leaving the structure, i.e., the directed tree, unchanged.\nFriedman et al. [3] extended the use of EM algorithms from the standard parameter estimation to\nalso \ufb01nding an optimal structure. In their case, the probabilistic model was reversible and the tree\nthat maximized the expected complete log-likelihood could be obtained using a maximum spanning\ntree algorithm. In our case, the pair-wise relations between hidden variables are asymmetric and a\nmaximum spanning tree algorithm cannot be used. However, as we show below, the Q-term can be\nmaximized by instead using Edmonds\u2019s optimal branching algorithm.\nWhen dealing with mixtures of HOTs in later sections, we will need to maximize the weighted\nversion of the Q-term, which we introduce already here:\n\nQf (T (cid:48);T ) =\n\nf (X)Pr[Z|X,T ] log Pr[Z, X|T (cid:48)],\n\n(1)\n\n(cid:88)\n\n(cid:88)\n\nX\u2208D\n\nZ\n\nwhere f is a weight function on the data points in D and can be computed in constant time.\nBy expanding and rearranging the terms in (1) (see the appendix), it can be shown that Qf (T (cid:48);T )\n\nequals(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n(cid:104)u,v(cid:105)\u2208A(T (cid:48))\n\na,b\u2208{0,1}\n\nX\u2208D\n\nf (X)Pr[Z(v) = a, Z(u) = b|X,T ] log Pr[Z(v) = a|Z(u) = b, \u03b8(cid:48)\n\nZ(u)]\n\n4\n\n\f(cid:88)\n\nX\u2208D\n\n(cid:88)\n\nX\u2208D:X(u)=\u03c3\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n+\n\n(cid:104)u,v(cid:105)\u2208A(T )\n\n\u03c3,a\u2208{0,1}\n\nX\u2208D:X(u)=\u03c3\n\nf (X)Pr[Z(v) = a|X,T ] log Pr[X(v) = \u03c3|Z(v) = a, \u03b8(cid:48)\n\nX (u)].\n\nAs long as the directed tree T (cid:48) is \ufb01xed, the standard EM methodology (see for instance [19]) can be\nused to \ufb01nd the \u0398(cid:48) that maximizes Qf (T (cid:48), \u0398(cid:48);T ) as follows. First, let\n\nAu(a, b) =\n\nf (X)Pr[Z(u) = a, Z(p(cid:48)(u)) = b|X,T ]\n\nand\n\nBu(\u03c3, a) =\n\nf (X)Pr[Z(u) = a|X,T ].\n\nThen the \u0398(cid:48) that, for a \ufb01xed T (cid:48), maximizes Qf (T (cid:48);T ) (i.e. Qf (T (cid:48), \u0398(cid:48);T )) is given by\n\nPr[Z(u) = a|Z(p(cid:48)(u)) = b, \u03b8(cid:48)\n\nZ(u)] = Au(a, b)/(\n\nAu(a, b))\n\nand\n\nPr[X(u) = \u03c3|Z(u) = a, \u03b8(cid:48)\n\nZ(u)] = Bu(\u03c3, a)/(\n\nBu(\u03c3, a)).\n\n(cid:88)\n(cid:88)\n\na\u2208{0,1}\n\n\u03c3\u2208{0,1}\n\nThe time required for computing the right hand sides of (2) and (3) is O(n2), where n is the number\nof aberrations (The probabilities Pr[Z(u) = a, Z(v) = b|X,T ] can be computed using techniques\nanalogous to those appearing in [3]).\nFor each arc (cid:104)p, u(cid:105) of T (cid:48), using the CPDs de\ufb01ned above, we de\ufb01ne the weight of the arc, speci\ufb01c to\n\n(2)\n\n(3)\n\nthis tree to be(cid:88)\n(cid:88)\n\na,b\u2208{0,1}\n\n(cid:88)\n(cid:88)\n\nX\u2208D\n\n+\n\nb\u2208{0,1}\n\nX\u2208D\n\nf (X)Pr[Z(u) = a, Z(p(cid:48)(u)) = b|X,T ] log Pr[Z(u) = a|Z(p(cid:48)(u)) = b, \u03b8(cid:48)\n\nZ(u)]\n\nf (X)Pr[Z(u) = a|X,T ] log Pr[X(u)|Z(u) = a, \u03b8(cid:48)\n\nX (u)].\n\nWe now make two important observations from which it follows how to maximize the weighted\nexpected complete log-likelihood over all directed trees. First, notice that if two directed trees T (cid:48)\nand T (cid:48)(cid:48) have a common arc (cid:104)p, u(cid:105), then this arc has the same weight in these two trees (since the\nweights on the arc does not depend on any other arc in the tree). Let G be the directed, complete,\nand arc-weighted graph with the same vertex set as the tree T , and with arc weights given by the\nabove expression.\nAn optimal arborescence of a directed graph is a rooted directed tree on the same set of vertices\nas the directed graph, i.e., a subgraph that has exactly one directed path from one speci\ufb01ed vertex\ncalled the root to any other vertex, and has maximum arc weight sum among all such rooted directed\ntrees. For any arborescence T (cid:48) of G, the sum of the arc weights equals, by the construction of G,\nthe maximum value of Qf (T (cid:48), \u0398(cid:48);T ) over all \u0398(cid:48). From this follows that, a (spanning) directed tree\nT (cid:48) is an optimal arborescence of G if and only if T (cid:48) maximizes the Qf term. And so, applying\nEdmonds\u2019s algorithm to G gives the desired directed tree. Tarjan\u2019s implementation of Edmonds\u2019s\nalgorithm runs in quadratic time [15, 16, 17]. Hence, the total running time for the algorithm is\nO(|D| \u00b7 n2).\n\n2.3 HOT-mixtures\n\nIn this section we extend our model to HOT-mixtures by including an initial random choice of\none of several HOTs and letting the \ufb01nal outcome be generated by the chosen HOT. We will also\nobtain an EM-based model-training algorithm for HOT-mixtures by showing how to optimize the\nexpected complete log-likelihoods for HOT-mixtures. Formally, we will use k HOTs T1, . . . ,Tk and\na random mixing variable I that takes on values in 1, . . . , k. The probability that I = i is denoted\n\u03bbi and \u03bb = (\u03bb1, . . . , \u03bbk) is a vector of parameters of the model in addition to those of the HOTs\n(\u03bb1, . . . , \u03bbk are constrained to sum to 1). The following notation is convenient\n\n(cid:80)\n\u03bbiPr[X|Ti]\nj\u2208[k] \u03bbjPr[X|Tj]\n\n.\n\n\u03b3i(X) = Pr[I = i|X, M ] =\n\n5\n\n\fFor a HOT-mixture, the expected complete log-likelihood can be expressed as follows\n\nPr[Z, I|X, M ] log Pr[Z, I, X|M(cid:48)].\n\n(4)\n\n(cid:88)\n\n(cid:88)\n\nX\u2208D\n\nZ,I\n\nUsing standard EM methodology, it is possible to show that (4) can be maximized by independently\nmaximizing\n\n(cid:88)\nand, for each i = 1, . . . , k, maximizing(cid:88)\n(cid:88)\n\ni\u2208[k]\n\n(cid:88)\n\nX\u2208D\n\n\u03b3i(X) log(\u03bb(cid:48)\ni)\n\n(5)\n\n(6)\n\nPr[Z|X,Ti]\u03b3i(X) log(Pr[Z, X|T (cid:48)\ni ])\n\n1, . . . , \u03bb(cid:48)\nFinding a \u03bb(cid:48) = \u03bb(cid:48)\ni = 1, . . . , k, \ufb01nding a T (cid:48)\nprevious subsections (with \u03b3i(X) weighting the data points).\n\nX\u2208D\nk maximizing (5) is straightforward (see for instance [19]) and, for each\ni that maxmizes the weighted Q-term in (6) can be done as described in the\n\nZ\n\n3 Results\n\nIn this section, we report results obtained by applying our algorithms to synthetic and cytogenetic\ncancer data.\nIn the standard version of the EM algorithm, there are four parameters per edge of a HOT.\nThe number of parameters can be reduced by letting some parameters be global, e.g., by letting\n\u0001x(u) = \u0001x(u(cid:48)) for all vertices u and u(cid:48). There are three parameters whose global estimation is\ndesirable: \u0001x, \u0001Z, and Pr[X(u) = 0|Z(u) = 1]. However, for technical reasons, requiring that \u0001z be\nglobal makes it impossible to derive an EM algorithm. Therefore, we will distinguish between two\ndifferent versions of the algorithm: one with free parameters and one with global parameters. The\nfree parameter version then corresponds to the standard EM algorithm, while the global parameter\nversion corresponds to letting \u0001x and Pr[X(u) = 0|Z(u) = 1] be global. When evaluating the global\nparameter version of the algorithm using synthetic data, we will follow the convention of letting all\nthree error parameters be global when generating data.\nOther conventions used for all the tests described here include the following. We enforce an upper\nlimit of 0.5 on \u0001z and \u0001x. Also, for each data set, we \ufb01rst run the algorithm on a set of randomly\ngenerated start HOTs or start HOT-mixtures for 10 iterations. The HOT or HOT-mixture that results\nin the best likelihood is then run until convergence. Unless stated otherwise, the number of start\ntrees and mixtures is 100.\n\n3.1 Tests on Synthetic Data sets\n\n3.1.1 Single HOTs\n\nWe generated random HOTs with 10, 25, and 40 vertices with parameters on the edges chosen\nuniformly in the intervals\n\nPr[Z(u) = 1|Z(p(u)) = 1] \u2208 [0.1, 1.0],\nPr[X(u) = 0|Z(u) = 1], \u0001x, \u0001z \u2208 [0.01, q],\n\n(7)\n(8)\nwhere q \u2208 {0.05, 0.10, 0.25, 0.50}. For each combination, we generated 100 HOTs for a total of\n3 \u00d7 4 \u00d7 100 = 1200 HOTs. Figure 2 shows the result of our experiments on synthetic data. An\nedge of the generated HOT connecting one speci\ufb01c aberration to another is considered to have been\ncorrectly recovered if the HOT obtained from the algorithm connects the same two aberrations in\nthe same direction. We also compared the performance of our algorithms with that of Mtreemix by\nBeerenwinkel et al [11]. The generated data from our single HOTs were passed to Mtreemix and\nthe same criteria as above were used to detect correctly recovered edges (no special options were\nset when running Mtreemix on data generated with global parameters since no distinction between\nglobal and free parameters can be made on oncogenetic trees). Mtreemix outperforms our methods\nwhen the HOTs and the error parameters are small, while our algorithms outperform Mtreemix\nsigni\ufb01cantly as the size of the HOTs or error parameters become larger.\n\n6\n\n\f(a)\n\n(b)\n\nFigure 2: Histograms showing the mean percentage of edges that were correctly recovered by the\nalgorithm for the free parameter case together with error bars showing one standard deviation.\n\nFigure 3: Histograms showing proportion of edges correctly recovered by the EM algorithm for\nHOT-mixtures with global parameters on two HOTs with 25 vertices each. Each bar represents 100\nmixtures. Error bars show one standard deviation.\n\n3.1.2 HOT Mixtures\n\nWe also tested the ability of the algoriithm to recover a mixture of two HOTs. The results are\nshown in 3. When measuring the number of correctly recovered edges, the following procedure was\nused. Each HOT produced from the algorithm was compared to each HOT from which the data\nwas generated, and the number of correctly recovered edges was noted. The best way of matching\nthe two HOTs produced from the algorithm with the two original HOTs was then determined. Two\n\n7\n\n0200040000.00.20.40.60.81.0EM algorithmfree parameters# data points% correctly recovered edgesq = 0.05q = 0.1q = 0.25q = 0.5020004000Mtreemix# data points% correctly recovered edges10 vertices25 vertices40 vertices0200040000.00.20.40.60.81.0EM algorithmglobal parameters# data points% correctly recovered edgesq = 0.05q = 0.1q = 0.25q = 0.5020004000Mtreemix# data points% correctly recovered edges10 vertices25 vertices40 vertices020004000600080000.20.30.40.50.6q = 0.05# data points% correctly recovered edges020004000600080000.20.30.40.50.6q = 0.1# data points% correctly recovered edges020004000600080000.20.30.40.50.6q = 0.25# data points% correctly recovered edgesmixture probabilities0.5 vs 0.50.3 vs 0.70.1 vs 0.9\f17p-\n\n+7\n\n3p-\n\n5q-\n\n5q-\n\n+17\n\n+5q,-14\n\n18q-,8p-\n\n8p-\n\n+16\n\n+1q,8p-,-17\n\n\u2022\n\n+7\n\n+17\n\n-4\n\n+16\n\n-10\n\n14p-\n-15,-21\n\n9p-,-9q-\n-10,-22\n(a)\n\n+20,+21\n\n+2\n\n6q-,-9,-22,-X\n\n+12\n\n+20\n\n1p-,-13,-15\n\n-18,-21\n\n+19\n\n+2p,+8q,+21\n(c)\n\n+2,+12,+19\n-4,-10,-13,-15\n\n+8q,1p-,-18,-21\n\n(b)\n\n\u2022\n\n3p-\n\n-14\n\n-9\n\n-10\n\n+5q\n\n8p-\n\n6q-,-21\n\n-13\n\n-X\n\n+19\n\n+2\n\n+12\n\n+8q\n\n-18\n\n-22\n\n-15\n\n+1q\n\n-17\n\n1p-\n\n-4\n\n(d)\n\nFigure 4: HOTs obtained from RCC data.\n(a) shows an adapted version of the pathways for CC data\npublished in [21]. (b) is a \ufb01gure adapted from [22] showing the pathways obtained from statistical analysis of\nRCC data. (c) and (d) are the HOTs we obtained from the RCC data using only aberrations on the left and right\npathways in (b), respectively. Notice the high level of agreement between the root-to-leaf paths in the recovered\nHOTs with those in (b).\n\nfeatures can clearly be distinguished: the results improve as the size of the data increases, and the\nalgorithm performs better when the HOTs have equal probability in the mixture.\n\n3.2 Tests on Cancer Data\n\nOur cytogenetic data for colon (CC) and kidney (RCC) cancer consist of 512 and 998 tumors, re-\nspectively. The data consist of measurements on 41 common aberrations (18 gains, 23 losses) for\nCC and 28 (13 gains, 15 losses) for RCC. The data have previously been analyzed in [21] and [22]\nresulting in suggested pathways of progression. These analyses were based on Principal Compo-\nnent Analysis (PCA) performed on correlations between aberrations and a statistical measure called\ntime of occurrence (TO) which measures how early or late an aberration occurs during progression.\nThe aberrations were then clustered based on the PCA and each cluster was manually formed into\na pathway (based on PCA and TO). One advantage of our approach is that we are able to replace\nthe manual curation by automated computational steps. Another advantage is that our models assign\nprobabilities to data and the different models can therefore be compared objectively.\nWe expect \u0001z and \u0001x to be small in the type of data that we are using. We obtained the n most\ncorrelated aberrations in our CC data, for n \u2208 {4, . . . , 11}, and tested different upper limits on\n\u0001z and \u0001x. The best correspondence to previously published analyses of the data was found when\n\u0001z \u2264 0.25 and \u0001x \u2264 0.01 by counting the number of bad edges. A bad edge is one that contradicts\nthe partial ordering given by the pathways described in [21], of which the relevant part is shown in\nthe Figure 4(a).\nHaving found upper limits that work well on the CC data, we applied the algorithm with these upper\nbounds to the RCC data. The earlier analyses in [22] strongly suggests that two HOTs are required\nto model the RCC data. Given that our mixture model appears, from our tests on synthetic data, to\nrequire substantially more data points to recover the underlying HOTs in a satisfactory manner, we\nused the results of the analysis in [22] to divide the aberrations into two (overlapping) clusters for\nwhich we created HOTs separately. These HOTs can be seen in Figure 4(c) and 4(d) and they show\nvery good agreement to the pathways from [22] shown in Figure 4(b). For instance, each root-to-leaf\npath in the HOT of Figure 4(c) agrees perfectly with the pathway shown in Figure 4(b).\n\n8\n\n\fReferences\n\n[1] C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees. IEEE Trans\n\nInform Theor, 14(3):462\u2013467, 1968.\n\n[2] M. Meila and M.I. Jordan. Learning with mixtures of trees. J Mach Learn Res, 1(1):1\u201348, 2000.\n[3] N. Friedman, M. Ninio, I. Pe\u2019er, and T. Pupko. A structural em algorithm for phylogenetic inference. J\n\nComput Biol, 9(2):331\u2013353, 2002.\n\n[4] N. Friedman. The bayesian structural em algorithm. In Proceedings of the Conference on Uncertainty in\n\nArti\ufb01cial Intelligence, pages 129\u2013138. Morgan Kaufmann, 1998.\n\n[5] P Leray and O Franc\u00b8ois. Bayesian network structural learning and incomplete data. Proceedings of the\nInternational and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning\n(AKRR 2005), pages 33\u201340, 2005.\n\n[6] R. Desper, F. Jiang, O.P. Kallioniemi, H. Moch, C.H. Papadimitriou, and A.A. Schaffer. Inferring tree\nmodels for oncogenesis from comparative genome hybridization data. J Comput Biol, 6(1):37\u201351, 1999.\n[7] M. Hjelm, M. H\u00a8oglund, and J. Lagergren. New probabilistic network models and algorithms for oncoge-\n\nnesis. J Comput Biol, 13(4):853\u2013865, May 2006.\n\n[8] M.D. Radmacher, R. Simon, R. Desper, R. Taetle, A.A. Schaffer, and M.A. Nelson. Graph models of\n\noncogenesis with an application to melanoma. J Theor Biol, 212(4):535\u201348, Oct 2001.\n\n[9] N. Beerenwinkel, J. Rahnenfuhrer, M. Daumer, D. Hoffmann, R. Kaiser, J. Selbig, and T. Lengauer.\nLearning multiple evolutionary pathways from cross-sectional data. J Comput Biol, 12(6):584\u2013598, Jul\n2005.\n\n[10] J. Rahnenfuhrer, N. Beerenwinkel, W.A. Schulz, C. Hartmann, A. von Deimling, B. Wullich, and\nT. Lengauer. Estimating cancer survival and clinical outcome based on genetic tumor progression scores.\nBioinformatics, 21(10):2438\u20132446, May 2005.\n\n[11] N. Beerenwinkel, J. Rahnenfuhrer, R. Kaiser, D. Hoffmann, J. Selbig, and T. Lengauer. Mtreemix: a soft-\nware package for learning and using mixture models of mutagenetic trees. Bioinformatics, 21(9):2106\u2013\n2107, 2005.\n\n[12] N. Beerenwinkel, N. Eriksson, and B. Sturmfels. Conjunctive bayesian networks. Bernoulli, 13(4):893\u2013\n\n909, Jan 2007.\n\n[13] N. Beerenwinkel, N. Eriksson, and B. Sturmfels. Evolution on distributive lattices.\n\n242(2):409\u201320, Sep 2006.\n\nJ Theor Biol,\n\n[14] M. Gerstung, M. Baudis, H. Moch, and N. Beerenwinkel. Quantifying cancer progression with conjunc-\n\ntive bayesian networks. Bioinformatics, 25(21):2809\u201315, Nov 2009.\n\n[15] R.E. Tarjan. Finding optimum branchings. Networks, 7(1):25\u201336, 1977.\n[16] R.M. Karp. A simple derivation of edmond\u2019s algorithm for optimum branching. Networks, 1(265-272):5,\n\n1971.\n\n[17] P. Camerini, L. Fratta, and F. Maf\ufb01oli. The k best spanning arborescences of a network. Networks,\n\n10(2):91\u2013110, 1980.\n\n[18] F. Mitelman, B. Johansson, and F. Mertens (Eds.). Mitelman database of chromosome aberrations and\n\ngene fusions in cancer, 2010. http://cgap.nci.nih.gov/Chromosomes/Mitelman.\n\n[19] R. Durbin, S.R. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis. Cambridge University\n\nPress, Cambridge, 1998.\n\n[20] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the em\n\nalgorithm. J Roy Stat Soc B, 39(1):1\u201338, 1977.\n\n[21] M. H\u00a8oglund, D. Gisselsson, G.B. Hansen, T. S\u00a8all, F. Mitelman, and M. Nilbert. Dissecting karyotypic\npatterns in colorectal tumors: Two distinct but overlapping pathways in the adenoma-carcinoma transition.\nCanc Res, 62:5939\u20135946, 2002.\n\n[22] M. H\u00a8oglund, D. Gisselsson, M. Soller, G.B. Hansen, P. Elfving, and F. Mitelman. Dissecting karyotypic\npatterns in renal cell carcinoma: an analysis of the accumulated cytogenetic data. Canc Genet Cytogenet,\n153(1):1\u20139, 2004.\n\n9\n\n\f", "award": [], "sourceid": 130, "authors": [{"given_name": "Ali", "family_name": "Tofigh", "institution": null}, {"given_name": "Erik", "family_name": "Sj\u0326lund", "institution": null}, {"given_name": "Mattias", "family_name": "H\u0326glund", "institution": null}, {"given_name": "Jens", "family_name": "Lagergren", "institution": null}]}