{"title": "Multiplicative Forests for Continuous-Time Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 458, "page_last": 466, "abstract": "Learning temporal dependencies between variables over continuous time is an important and challenging task. Continuous-time Bayesian networks effectively model such processes but are limited by the number of conditional intensity matrices, which grows exponentially in the number of parents per variable. We develop a partition-based representation using regression trees and forests whose parameter spaces grow linearly in the number of node splits. Using a multiplicative assumption we show how to update the forest likelihood in closed form, producing efficient model updates. Our results show multiplicative forests can be learned from few temporal trajectories with large gains in performance and scalability.", "full_text": "Multiplicative Forests for Continuous-Time Processes\n\nJeremy C. Weiss\n\nUniversity of Wisconsin\nMadison, WI 53706, USA\njcweiss@cs.wisc.edu\n\nSriraam Natarajan\n\nWake Forest University\n\nWinston Salem, NC 27157, USA\nsnataraj@wakehealth.edu\n\nDavid Page\n\nUniversity of Wisconsin\nMadison, WI 53706, USA\npage@biostat.wisc.edu\n\nAbstract\n\nLearning temporal dependencies between variables over continuous time is an\nimportant and challenging task. Continuous-time Bayesian networks effectively\nmodel such processes but are limited by the number of conditional intensity matri-\nces, which grows exponentially in the number of parents per variable. We develop\na partition-based representation using regression trees and forests whose param-\neter spaces grow linearly in the number of node splits. Using a multiplicative\nassumption we show how to update the forest likelihood in closed form, produc-\ning ef\ufb01cient model updates. Our results show multiplicative forests can be learned\nfrom few temporal trajectories with large gains in performance and scalability.\n\n1 Introduction\n\nThe modeling of temporal dependencies is an important and challenging task with applications in\n\ufb01elds that use forecasting or retrospective analysis, such as \ufb01nance, biomedicine, and anomaly de-\ntection. While analyses over time series data with \ufb01xed, discrete time intervals are well studied, as\nfor example in [1], there are domains in which discretizing the time leads to intervals where no ob-\nservations are made, producing \u201cmissing data\u201d in those periods, or there is no natural discretization\navailable and so the time series assumptions are restrictive. Of note, experiments in previous work\nprovide evidence that coercing continuous-time data into time series and conducting time series\nanalysis is less effective than learning models built with continuous-time data in mind [2].\n\nWe investigate a subset of continuous-time models: probabilistic models over \ufb01nite event spaces\nacross continuous time. The prevailing model in this \ufb01eld is the continuous-time Markov process\n(CTMP), a model that provides an initial distribution over states and a rate matrix parameterizing the\nrate of transitioning between states. However, this model does not scale for the case where a CTMP\nstate is a joint state over many variable states. Because the number of joint states is exponential\nin the number of variables, the size of the CTMP rate matrix grows exponentially in the number\nof variables. Continuous-time Bayesian networks (CTBNs), a family of CTMPs with a factored\nrepresentation, encode rate matrices for each variable and the dependencies among variables [3].\nFigure 1 shows a complete trajectory, i.e., a timeline where the state of each variable is known for\nall times t, for a CTMP with four joint states (a, b), (a, B), (A, b), and (A, B) factorized into two\nbinary CTBN variables \u03b1 and \u03b2 (with states a and A, and b and B, respectively).\n\nPrevious work on CTBNs includes several approaches to performing CTBN inference [4, 5, 6, 7, 8]\nand learning [2, 3]. Brie\ufb02y, CTBNs do not admit exact inference without transformation to the\nexponential-size CTMP. Approximate inference methods including expectation propagation [4],\nmean \ufb01eld [6], importance sampling-based methods [7], and MCMC [8] have been applied, and\nwhile these methods have helped mitigate the inference problem, inference in large networks re-\nmains a challenge. CTBN learning involves parameter learning using suf\ufb01cient statistics (e.g. num-\nbers of transitions M and durations T in Figure 1) and structure learning over a directed (possibly\ncyclic) graph over the variables to maximize a penalized likelihood score. Our work addresses learn-\ning in a generalized framework to which the inference methods mentioned above can be extended.\n\n1\n\n\fIn this work we introduce a generalization of CTBNs: partition-based CTBNs. Partition-based\nCTBNs remove the restriction used in CTBNs of storing one rate matrix per parents setting for\nevery variable. Instead partition-based CTBNs de\ufb01ne partitions over the joint state space and de\ufb01ne\nthe transition rate of each variable to be dependent on the membership of the current joint state\nto an element (part) of a partition. As an example, suppose we have partition P composed of\nparts p1 = {(a, b), (A, b)} and p2 = {(a, B), (A, B)}. Then the transition into si from joint state\n(A, B) in Figure 1 would be parameterized by transition rate qa|p2. Partition-based CTBNs store\none transition rate per part, as opposed to one transition rate matrix per parents setting. Later we\nwill show that, for a particular choice of partitions, a partition-based CTBN is equivalent to a CTBN.\nHowever, the more general framework offers other choices of partitions which may be more suitable\nfor learning from data.\n\nPartition-based CTBNs avoid one limitation of\nCTBNs:\nthat the model size is necessarily ex-\nponential in the maximum number of parents per\nvariable. For networks with sparse incoming con-\nnections, this issue is not apparent. However,\nin many real domains, a variable\u2019s transition rate\nmay be a function of many variables.\n\nGiven the framework of partition-based CTBNs,\nwe need to provide a way to determine useful\npartitions. Thus, we introduce partition-based\nCTBN learning using regression tree modi\ufb01ca-\ntions in place of CTBN learning using graph op-\nerators of adding, reversing, and deleting edges.\nIn the spirit of context-speci\ufb01c independence [9],\nwe can view tree learning as a method for learn-\ning compact partition-based dependencies. How-\never,\ntree learning induces recursive subparti-\ntions, which limits their ability to partition the\njoint state space. We therefore introduce mul-\ntiplicative forests for CTBNs, which allow the\nmodel to represent up to an exponential number\nof transition rates with parameters still linear in\nthe number of splits.\n\nA\n\na\n\nB\n\nb\n\nX\n\nsi\n\nti\n\nMa|B\nTa|B\n\nMb\nTb\n\nTime\n\nFigure 1: Example of a complete trajectory in\na two-node CTBN. The arrows show the tran-\nsitions and time intervals that are aggregated to\ncompute selected suf\ufb01cient statistics (M\u2019s and\nT\u2019s). A and a denote two states for one variable,\nand B and b two states for a second variable.\n\nFollowing canonical tree learning methods, we perform greedy tree and forest learning using it-\nerative structure modi\ufb01cations. We show that the partition-based change in log likelihood can be\ncalculated ef\ufb01ciently in closed form using a multiplicative assumption. We also show that using\nmultiplicative forests, we can ef\ufb01ciently calculate the ML parameters. Thus, we can calculate the\nmaximum change in log likelihood for a forest modi\ufb01cation proposal, which gives us the best itera-\ntive update to the forest model.\n\nFinally, we conduct experiments to compare CTBNs, regression tree CTBNs (treeCTBNs) and mul-\ntiplicative forest CTBNs (mfCTBNs) on three data sets. Our hypothesis is twofold: \ufb01rst, that learn-\ning treeCTBNs and mfCTBNs will scale better towards large domains because of their compact\nmodel structures, and second, that mfCTBNs will outperform both CTBNs and treeCTBNs with\nfewer data points because of their ability to capture multiplicative dependencies.\n\nThe rest of the paper is organized as follows: in Section 2 we provide background on CTBNs. In\nSection 3 we present partition-based CTBNs, show that they subsume CTBNs and de\ufb01ne the parti-\ntions that tree and forest structures induce. We also describe theoretical advantages of using forests\nfor learning and how to learn these models ef\ufb01ciently. We present results in Section 4 showing that\nforest CTBNs are scalable to large state spaces and learn better than CTBNs, from fewer examples\nand in less time. Finally, in Sections 5 and 6 we identify connections to functional gradient boost-\ning and related continuous-time processes and discuss how our work addresses one limitation that\nprevents CTBNs from \ufb01nding widespread use.\n\n2\n\n\f2 Background\n\nCTBNs are probabilistic graphical models that capture dependencies between variables over con-\ntinuous time. A CTBN is de\ufb01ned by 1) a distribution for the initial state over variables X given\nby a Bayesian Network B, and 2) a directed (possibly cyclic) graph over variables X with a set of\nConditional Intensity Matrices (CIMs) for each variable X \u2208 X that hold the rates (intensities) qx|u\nof variable transitions given their parents UX in the directed graph. Here a CTBN variable X \u2208 X\nhas states x1, . . . , xk, and there is an intensity qx|u for every state x \u2208 X given an instantiation\nover its parents u \u2208 UX. The intensity corresponds to the rate of transitioning out of state x; the\nprobability density function for staying in state x given an instantiation of parents u is qx|ue\u2212qx|ut.\nGiven a transition, X moves to some other state x0 with probability \u0398xx0|u. Taking the product over\nintervals bounded by single transitions, we obtain the CTBN trajectory likelihood:\n\nY\nX\u2208X\n\nY\nx\u2208X\n\nY\nu\u2208UX\n\nMx|u\nx|u e\u2212qx|uTx|u Y\nq\nx06=x\n\n\u0398\n\nMxx0|u\nxx0|u\n\nwhere the Mx|u and Mxx0|u are the suf\ufb01cient statistics indicating the number of transitions out of\nstate x (total, and to x0, respectively), and the Tx|u are the suf\ufb01cient statistics for the amount of time\nspent in x given the parents are in state u.\n\n3 Partition-based CTBNs\n\nHere we de\ufb01ne partition-based CTBNs, an alternative framework for determining variable transition\nrates. We give the syntax and semantics of our model, providing the generative model and likelihood\nformulation. We then show that CTBNs are one instance in our framework. Next, we introduce\nregression trees and multiplicative forests and describe the partitions they induce, which are then\nused in the partition-based CTBN framework. Finally, we discuss the advantages of using trees and\nforests in terms of learning compact models ef\ufb01ciently.\n\nLet X be a \ufb01nite set of discrete variables X of size n, with each variable X having a discrete\nset of states {x1, x2, . . . , xk}, where k may differ for each variable. We de\ufb01ne a joint state\ns = {x1, x2, . . . , xn} over X where the subscript indicates the variable index. We also de\ufb01ne\nthe partition space P = X 1. We will shortly de\ufb01ne set partitions P over P, composed of disjoint\nparts p, each of which holds a set of elements s.\n\nNext we de\ufb01ne the dynamics of the model, which form a continuous-time process over X . Each\nvariable X transitions among its states with rate parameter qx0|s for entering state x0 given the joint\nstate s2. This rate parameter (called an intensity) parameterizes the exponential distribution for\ntransitioning into x0, given by the pdf: p(x0, s, t) = qx0|se\u2212qx0|st for time t \u2208 [0, \u221e).\nA partition-based CTBN has a collection of set partitions P over P, one Px0 for every variable state\nx0. For shorthand, we will often denote p = Px0 (s) to indicate the part p of partition Px0 to which\nstate s belongs. We de\ufb01ne the intensity parameter as qx0|s = qx0|p for all s \u2208 p. Note that this \ufb01xes\nthis intensity to be the same for every s \u2208 p, and also note that the set of parts p covers P. The pdf\nfor transitioning is given by p(x0, s, t) = p(x0, Px0 (s), t) = qx0|pe\u2212qx0 |pt for all s in p.\nNow we are ready to de\ufb01ne the partition-based CTBN model. A partition-based CTBN model M is\ncomposed of a distribution over the initial state of our variables, de\ufb01ned by a Bayesian network B,\nand a set of partitions Px0 for every variable state x0 with corresponding sets of intensities qx0|p.\nThe partition-based CTBN provides a generative framework for producing a trajectory z de\ufb01ned by a\nsequence of (state, time) pairs (si, ti). Given an initial state s0, transition times are sampled for each\nvariable state x0 according to p(x0, Px0 (s0), t). The next state is selected based on the transition to\nthe x0 with the shortest time, after which the transition times are resampled according to p(x0, si, t).\nDue to the memoryless property of exponential distributions, no resampling of the transition time\nfor x0 is needed if p(x0, si, t) = p(x0, si\u22121, t). The trajectory terminates when all sampled transition\ntimes exceed a speci\ufb01ed ending time.\n\n1Note we can generalize this to larger spaces P = R \u00d7 X , where R is an external state space as in [10].\n\nbut for our analysis we restrict R to be a single element r, i.e. P \u223c\n\n= X .\n\n2Of note, partition-based CTBNs are modeling the intensity of transitioning to the recipient state x\n\n0, rather\n\nthan from the donor state x because we are more often interested in the causes of entering a state.\n\n3\n\n\fGiven a trajectory z, we can also de\ufb01ne the model likelihood. For each interval ti, the joint state\nremains unchanged, and then one variable transitions into x0. The likelihood given the interval is:\nqx0|si\u22121 QX Qx\u2208X e\u2212qx|si\u22121 ti, i.e., the product of the probability density for x0 and the probability\nthat no other variable transitions before ti. Taking the product over all intervals in z, we get the\nmodel likelihood:\n\nY\nX\u2208X\n\nY\nx0\u2208X\n\nY\n\nMx0 |s\nx0|s e\u2212qx0|sTs\nq\n\ns\n\n(1)\n\nwhere Mx0|s is the number of transitions into x0 from state s, and Ts is the total duration spent in\ns. Combining terms based on the membership of s to p and de\ufb01ning Mx0|p = Ps\u2208p Mx0|s and\nTp = Ps\u2208p Ts, we get:\n\nEq.(1) = Y\nX\u2208X\n\nY\nx0\u2208X\n\nY\np\u2208Px0\n\nMx0 |p\nx0|p e\u2212qx0|pTp\nq\n\n3.1 CTBN as a partition-based CTBN\n\nHere we show that CTBNs can be viewed as an instance of partition-based CTBNs. Each variable\nX is given a parent set UX, and the transition intensities qx|u are recorded for leaving donor states\nx given the current setting of the parents u \u2208 UX. The CTBN likelihood can be shown to be:\n\nY\nX\u2208X\n\nY\nx\u2208X\n\nY\nu\u2208UX\n\ne\u2212qx|uTx|u Y\nx06=x\n\nMxx0 |u\nq\nxx0|u\n\n(2)\n\nas in [5], where qxx0|u and Mxx0|u denote the intensity and number of transitions from state x to\nstate x0 given parents setting u, and Px06=x qxx0|u = qx|u. Rearranging the product from equation\n2, we achieve a likelihood in terms of recipient states x0:\n\nEq. (2) = Y\nX\u2208X\n= Y\nX\u2208X\n\nY\nx\u2208X\nY\nx0\u2208X\n\nY\nu\u2208UX\nY\np\u2208Px0\n\nMxx0 |u\nxx0|u e\u2212qxx0|uTx|u\nq\n\nY\nx06=x\nMx0 |p\nx0|p e\u2212qx0|pTp\nq\n\n(3)\n\nwhere we de\ufb01ne p as {x}\u00d7{u}\u00d7(X \\(X \u00d7UX )) in each partition Px0, and likewise: qx0|p = qxx0|u,\nMx0|p = Mxx0|u, and Tp = Tx|u. Thus, CTBNs are one instance of partition-based CTBNs, with\npartitions corresponding to a speci\ufb01ed donor state x and parents setting u.\n\n3.2 Tree and forest partitions\n\nTrees and forests induce partitions over a space de\ufb01ned by the set of possible split criteria [11]. Here\nwe will de\ufb01ne the Conditional Intensity Trees (CITs): regression trees that determine the intensities\nqx0|p by inducing a partition over P. Similarly, we will de\ufb01ne Conditional Intensity Forests (CIFs),\nwhere tree intensities are named intensity factors whose product determines qx0|p. An example of a\nCIF, composed of a collection of CITs, is shown later in the experiment results in Figure 4.\nFormally, a Conditional Intensity Tree (CIT) fx0 is a directed tree structure on a graph G(V, E) with\nnodes V and edges E(Vi, Vj ). Internal nodes Vi of the tree hold splits \u03c3Vi = (\u03c0Vi , {E(Vi, \u00b7)})\ncomposed of surjective maps \u03c0Vi : s 7\u2192 E(Vi, Vj ) and lists of the outgoing edges. The maps \u03c0\ninduce partitions over P and endow each outgoing edge E(Vi, Vj) with part pVj . External nodes\nl, or leaves, hold non-negative real values qCIT\nx0|p called intensities. A path \u03c1 from the root to a leaf\ninduces a part p, which is the intersection of the parts on the edges of the path: p = TE(Vi,Vj )\u2208\u03c1 pVj .\nThe parts corresponding to paths of a CIT form a partition over P, which can be shown easily using\ninduction and the fact that the maps \u03c0Vi induce disjoint parts pVj that cover P.\nA Conditional Intensity Forest (CIF) Fx0 is a set of CITs {fx0}. Because the parts of each CIT\nform a partition, a CIF induces a joint partition over P where a part p is the set of states s that have\nthe same paths through all CITs. Finally, a CIF produces intensities from joint states by taking the\nproduct over the intensity factors from each CIT: qCIF\n\nx0|pCIF = Qfx0 qCIT\n\nx0|pCIT.\n\n4\n\n\fUsing regression trees and forests can greatly reduce the number of model parameters. In CTBNs,\nthe number of parameters grows exponentially in the number of parents per node. In tree and forest\nCTBNs, the number of parameters may be linear in the number of parents per node, exploiting the\nef\ufb01ciency of using partitions. Notably, however, tree CTBNs are limited to having one intensity\nper parameter. In forest CTBNs, the number of intensities can be exponential in the number of\nparameters. Thus, the forest model has much greater potential expressivity per parameter than the\nother models. We quantify these differences in the Supplementary Materials at our website.\n\n3.3 Forest CTBN learning\n\nHere we discuss the reasoning for using the multiplicative assumption and derive the changes in like-\nlihood given modi\ufb01cations to the forest structure. Previous forests learners have used an additive\nassumption, e.g. averaging and aggregating, thereby taking advantage of properties of ensembles\n[12, 13]. However, if we take the sum over the intensity factors from each tree, there are no direct\nmethods for calculating the change in likelihood aside from calculating the likelihood before and af-\nter a forest modi\ufb01cation, which would require scanning the full data once per modi\ufb01cation proposal.\nFurthermore, summing intensity factors could lead to intensities outside the valid domain [0, \u221e).\nInstead we use a multiplicative assumption since it gives us the correct range over intensities. As we\nshow below, using the multiplicative assumption also has the advantage that it is easy to compute\nthe change in log likelihood with changes in forest structure. Consider a partition-based CTBN\nM = (B, {Fx0}) where the partitions Px0 and intensities qx0|p are given by the CIFs {Fx0}. We\nfocus on change in forest structure for one state x0 \u2208 X and remove x0 from the subscript notation\nfor simplicity. Given a current forest structure F and its partition P , we formulate the change in\nlikelihood by adding a new CIT f 0 and its partition P 0. One example of f 0 is a new a one-split stub.\nAnother example of f 0 is a tree copied to have the same structure as a CIT f in F with all intensity\nfactors set to one, except at one leaf node where a split is added. This is equivalent to adding a split\nto f . We denote \u02c6P as the joint partition of P and P 0 and parts \u02c6p \u2208 \u02c6P , p \u2208 P , and p0 \u2208 P 0. We\nconsider the change in log likelihood \u2206LL given the new and old models:\n\n\u2206LL = (X\n\nM \u02c6p log q \u02c6p \u2212 q \u02c6pT \u02c6p) \u2212 (X\n\nMp log qp \u2212 qpTp)\n\n\u02c6p\n\np\n\n= (X\n\nM \u02c6p(log qp0 + log qp) \u2212 q \u02c6pT \u02c6p) \u2212 (X\n\nMp log qp \u2212 qpTp)\n\n\u02c6p\n\np\n\n= (X\n\nM \u02c6p log qp0 \u2212 q \u02c6pT \u02c6p) + X\n\nqpTp\n\n\u02c6p\n\n= X\np0\n\np\n\nMp0 log qp0 \u2212 X\n\nq \u02c6pT \u02c6p + X\n\nqpTp\n\n\u02c6p\n\np\n\n(4)\n\nWe make use of the multiplicative assumption that q \u02c6p = qp0 qp and Pp Mp = Pp0 Mp0 = P \u02c6p M \u02c6p\nto arrive at equation 4. The \ufb01rst and third terms are easy to compute given the old intensities and\nnew intensity factors. The second term is slightly more complicated:\n\nX\n\nq \u02c6pT \u02c6p = X\n\n\u02c6p\n\n\u02c6p\n\nqp0qpT \u02c6p = X\np0\n\nqp0 X\n\u02c6p\u223cp0\n\nqpT \u02c6p\n\nWe introduce the notation \u02c6p \u223c p0 to denote the parts \u02c6p that correspond to the part p0. The second\nterm is a summation over parts \u02c6p; we have simply grouped together terms by membership in p0.\n\nThe number of parts in the joint partition set \u02c6P can be exponentially large, but the only remaining\ndependency on the joint partition space in the change in log likelihood is the term P \u02c6p\u223cp0 qpT \u02c6p.\nWe can keep track of this value as we progress through the trajectories, so the actual time cost is\nlinear in the number of trajectory intervals. Thinking of intensities q as rates, and given durations\nT , we observe that the second and third terms in equation 4 are expected numbers of transitions:\nE\u02c6p = P\u02c6p q \u02c6pT \u02c6p and Ep = Pp qpTp. We additionally de\ufb01ne Ep0 = P \u02c6p\u223cp0 qpT \u02c6p. Speci\ufb01cally, the\nexpectations Ep0 and Ep are the expected number of transitions in part p0 and p using the old model\nintensities, respectively, whereas E\u02c6p is the expected number of transitions using the new intensities.\n\n5\n\n\f3.4 Maximum-likelihood parameters\n\nThe change in log likelihood is dependent on the intensity factor values {qp0 } we choose for the new\npartition. We calculate the maximum likelihood parameters by setting the derivative with respect to\nthese factors to zero to get qp0 =\n. Following the derivation in [2], we assign\npriors to the suf\ufb01cient statistics calculations. Note, however, that the priors affect the multiplicative\nintensity factors, so a tree may split on the same partition set twice to get a stronger effect on the\nintensity, with the possible risk of undesirable over\ufb01tting.\n\nP \u02c6p\u223cp0 qpT \u02c6p\n\nMp0\nEp0\n\nMp0\n\n=\n\n3.5 Forest implementation\n\nWe use greedy likelihood maximization steps to learn multiplicative forests (mfCTBNs). Each itera-\ntion requires repeating three steps: (re)initialization, suf\ufb01cient statistics updates, and model updates.\nInitially we are given a blank forest Fx0 per state x0 containing a blank tree fx0, that is, a single root\nnode acting as a leaf with an intensity factor of one. We also are given sets of possible splits {\u03c3} and\na penalty function \u03ba(|Z|, |M|) to penalize increased model complexity. First, for every leaf l in M,\nwe (re)initialize the suf\ufb01cient statistics Ml and El in M, as well as suf\ufb01cient statistics for potential\nforest modi\ufb01cations: Ml,\u03c3, El,\u03c3, \u2200l, \u03c3. Then, we traverse each of our trajectories z \u2208 Z to update\neach leaf. For every (state, duration) pair (si, ti), where ti is the time spent in state si\u22121 before the\ntransition to si, we update the suf\ufb01cient statistics that compose equation 4. Finally, we compute the\nchange in likelihood for possible forest modi\ufb01cations, and choose the modi\ufb01cation with the greatest\nscore. If this score is greater than the cost of the additional model complexity, \u03ba, we accept the\nmodi\ufb01cation. We replace the selected leaf with a branch node split upon the selected \u03c3. The new\nleaf intensity factors are the product of the old intensity (factor) ql and the intensity factor qp0.\nUnlike most forest learning algorithms, mfCTBNs learn trees neither in series nor in parallel. No-\ntably, the best split is determined solely by the change in log likelihood, regardless of the tree to\nwhich it belongs. If it belongs to the blank tree at the end of the forest, that tree produces non-trivial\nfactors and a new blank tree is appended to the forest. In this way, as mfCTBN learns, it automat-\nically determines the forest size and tree depth according to the evidence in the data. We provide\ncode and Supplementary Materials at our website.\n\n4 Experiments\n\nsmoking\n\nglucose level\n\nH D L\n\nblood pressure\n\ngender\n\nbmi\n\nWe evaluate our tree learning and forest learn-\ning algorithms on samples from three models.\nThe \ufb01rst model, which we call \u201cNodelman\u201d,\nis the benchmark model developed in [3, 2].\nThe second is a simpli\ufb01ed cardiovascular health\nmodel we call \u201cCV health\u201d shown in Figure\n2. The cause of pathologies in this \ufb01eld are\nknown to be multifactorial [14]. For example, it\nhas been well-established that independent pos-\nitive risk factors for atherosclerosis include be-\ning male, a smoker, in old age, having high glu-\ncose, high BMI, and high blood pressure. The\nprimary tool for prediction in this \ufb01eld is risk\nfactor analysis, where transformations over the\nproduct of risk factor values determines overall\nrisk. The third model we call \u201cS100\u201d is a large-\nscale model with one hundred binary variables.\nParents are determined by the binomial distri-\nbution B(0.05, 200) over variable states, with\nintensity factor ratios of 1 : 0.5. Our goal is to show that treeCTBNs and mfCTBNs can scale to\nmuch larger model types and still learn effectively. In our experiments we set the potential splits\n{\u03c3} to be the set of binary splits determined by indicators for each variable state x0. We set \u03ba to be\nzero and terminate model learning when the tune set likelihood begins to decrease.\n\nFigure 2: The cardiovascular health (CV health)\nstructure used in experiments.\n\nabnormal heart electrophysiology\n\nthrombolytic therapy\n\ntroponin levels\n\natherosclerosis\n\nstroke\n\nM I\n\nage\n\narrhythmia\n\nchest pain\n\n6\n\n\fd\no\no\nh\n\ni\nl\n\ni\n\ne\nk\nL\ng\no\n\n \n\nl\n\n0\n2\n\u2212\n\n0\n3\n\u2212\n\n0\n4\n\u2212\n\n0\n5\n\u2212\n\n0\n6\n\u2212\n\n0\n7\n\u2212\n\nTruth\nTreeCTBN\nmfCTBN\nN\u2212CTBN\n\nd\no\no\nh\n\ni\nl\n\ni\n\ne\nk\nL\ng\no\n\n \n\nl\n\n0\n2\n\u2212\n\n0\n4\n\u2212\n\n0\n6\n\u2212\n\n0\n8\n\u2212\n\n0\n0\n1\n\u2212\n\nTruth\nTreeCTBN\nmfCTBN\nN\u2212CTBN\n\nd\no\no\nh\n\ni\nl\n\ni\n\ne\nk\nL\ng\no\n\n \n\nl\n\n0\n\n0\n5\n\u2212\n\n0\n0\n1\n\u2212\n\n0\n5\n1\n\u2212\n\n0\n0\n2\n\u2212\n\nTruth\nTreeCTBN\nmfCTBN\n\n10\n\n1000\n\n100\nTrajectories\n\n10000\n\n10\n\n1000\n\n100\nTrajectories\n\n10000\n\n10\n\n1000\n\n100\nTrajectories\n\n10000\n\nFigure 3: Average testing set log likelihood varying the training set size for each model: Nodelman\n(left), CV health (center), and S100 (right). N-CTBN averages are omitted on the S100 model as\none third of the runs did not terminate.\n\nWe compare our algorithms against the learning algorithm presented in [2] using code from [15],\nwhich we will call N-CTBN. N-CTBNs perform a greedy Bayesian structure search, adding, remov-\ning, or reversing arcs to maximize the Bayesian information criterion score, a tradeoff between the\nlikelihood and a combination of parameter and data size. Our algorithms use a tune set by sieving\noff one quarter of the original training set trajectories. We use the same Laplace prior as used in\n[15]. We use the same training and testing set for each algorithm. The trajectories are sampled\nfrom the ground truth models for durations 10, 10 and 2 units of time, respectively. We evaluate the\nthree models using the testing set average log likelihood. To provide an experimental comparison\nof model performance, we choose to analyze the p-values for a two-sided paired t-test for the aver-\nage log likelihoods between mfCTBNs and N-CTBNs for each training set size. The results come\nfrom testing sets with one thousand sampled trajectories. Additional evaluation criteria assessing\nstructural recovery were also analyzed and are provided in the Supplementary Materials.\n\n4.1 Results\n\nFigure 3 (left) shows that the mfCTBN substantially outperforms both the treeCTBN and the N-\nCTBN on the Nodelman model in terms of average log likelihood. This effect is most pronounced\nwith relatively few trajectories, suggesting that mfCTBNs are able to learn more quickly than either\nof the other models.\n\nWe observe an even larger difference between the mfCTBN and the other models in the CV health\nmodel in Figure 3 (center). With relatively few trajectories, the mfCTBN is able to identify the\nmultifactorial causes as observed in the high log likelihood and structural recall. For runs with\nfewer than 500 training set trajectories, many N-CTBN models have nodes including every other\nnode as a parent, requiring the estimation of about 300,000 parameters on average, shown in the\nSupplementary Materials. Figure 3 (right) shows that mfCTBNs can effectively learn dense models\nan order of magnitude larger than those previously studied. The expected number of parents per node\nin the S100 model is approximately 20. In order to exactly reconstruct the S100 model, a traditional\nCTBN would then need to estimate 221 intensity values. For many applications, variables need\nmore parents than this. We observe that N-CTBNs have dif\ufb01culty scaling to models of this size.\nThe N-CTBN learning time on this data set ranges from 4 hours to more than 3 days; runs were\nstopped if they had not terminated in that time. About one third of the runs failed to complete,\nand the runs that did complete suggested that N-CTBN performed poorly, similar to the differences\nobserved in the CV health experiment. We suspect the algorithm may be similarly building nodes\nwith many parents; the model might need to estimate 2100 parameters, a bottleneck at minimum. By\ncomparison, all runs using treeCTBNs and mfCTBNs completed in less than 1 hour. The averaged\nresults of N-CTBNs on the S100 model are omitted accordingly.\n\nWe tested for signi\ufb01cant differences in the average log likelihoods between the N-CTBN and\nmfCTBN learning algorithms. In the Nodelman model, the differences were signi\ufb01cant at level\nof p =1e-10 for sizes 10 through 500, p = 0.05 for sizes 1000 and 5000, and not signi\ufb01cant for size\n10000. In the CV health model, the differences were signi\ufb01cant at p =1e-9 for all training set sizes.\nWe were unable to generate a t-test comparison of the S100 model.\n\n7\n\n\fNormal BP\n\nYouth\n\nNormal weight\n\n<50% atherosclerotic\n\nNormal weight\n\nHypertensive\n\nNormal glucose\n\n<50% atherosclerotic\n\nTrue\n\nFalse\n\nFalse\n\nTrue\n\nTrue\n\nFalse\n\nTrue\n\nFalse\n\nFalse\n\nTrue\n\nFalse\n\nTrue\n\nTrue\n\nFalse\n\nTrue\n\nFalse\n\n  1.0\n\n  2.0\n\n  1.0\n\n 0.10\n\n  1.0\n\n  2.0\n\n0.010\n\n  1.0\n\n  1.3\n\n 0.68\n\n0.020\n\nFemale\n\n 0.36\n\nHypertensive\n\n0.0080\n\n  1.4\n\nFalse\n\nTrue\n\nTrue\n\nFalse\n\nFrequent smoker\n\nNormal glucose\n\nMale\n\nFrequent smoker\n\n 0.13\n\n  1.1\n\n  3.5\n\nYouth\n\nTrue\n\nFalse\n\nFalse\n\nTrue\n\nFalse\n\nTrue\n\n  2.0\n\n  1.0\n\n  2.0\n\n  1.0\n\n0.020\n\n0.050\n\nFalse\n\nTrue\n\n 0.18\n\n 0.38\n\nTrue\n\nFalse\n\n 0.12\n\n  1.1\n\nFigure 4: Ground truth (left) and mfCTBN forest learnt from 1000 trajectories (right) for inten-\nsity/rate of developing severe atherosclerosis.\n\nFigure 4 shows the ground truth forest and the mfCTBN forest learned for the \u201csevere atheroscle-\nrosis\u201d state in the CV health model. To calculate the intensity of transitioning into this state, we\nidentify the leaf in each forest that matches the current state and take the product of their intensity\nfactors. Figure 4 (right) shows the recovery of the correct dependencies in approximately the right\nratios. Full forest models can be found in the Supplementary Materials.\n\n5 Related Work\n\nWe discuss the relationships between mfCTBNs and related work in two areas: forest learning and\ncontinuous-time processes. Forest learning with a multiplicative assumption is equivalent to forest\nlearning in the log space with an additive assumption and exponentiating the result. This suggests\nthat our method shares similarities with functional gradient boosting (FGB), a leading method for\nconstructing regression forests, run in the log space [16]. However, our method is different in its\ndirect use of a likelihood-based objective function and in its ability to modify any tree in the forest at\nany iteration. Further discussion comparing the methods is provided in the Supplementary Materials.\n\nSeveral other works that model variable dependencies over continuous time also exist. Poisson pro-\ncess networks and cascades model variable dependencies and event rates [17, 18]. Perhaps the most\nclosely related work, piecewise-constant conditional intensity models (PCIMs), reframes the con-\ncept of a factored CTMP to allow learning over arbitrary basis state functions with trees, possibly\npiecewise over time [10]. These models focus on the \u201cpositive class\u201d, i.e. the observation or count\nof observations of an event. The trouble with this is that the data used to learn the model may be in-\ncomplete. Given a timeline, we receive all observations of events but not necessarily all occurrences\nof the events, and we would like to include this uncertainty in our model. For Poisson processes in\nparticular, the representation of the \u201cnegative\u201d class is missing, when in some cases it is the absent\nstate of a variable that triggers a process, as for example in the case of gene expression networks and\nnegative regulation. Finally other related work includes non-parametric continuous-time processes,\nwhich produce exchangeable distributions over transition rate sets in unfactored CTMPs [19].\n\n6 Conclusion\n\nWe presented an alternative representation of the dynamics of CTBNs using partition-based CTBNs\ninstantiated by trees and forests. Our models grow linearly in the number of forest node splits, while\nCTBNs grow exponentially in the number of parent nodes per variable. Motivated by the domain\nover intensities, we introduced multiplicative forests and showed that CTBN likelihood updates\ncan be ef\ufb01ciently computed using changes in log likelihood. Finally, we showed that mfCTBNs\noutperform both treeCTBNs and N-CTBNs in three experiments and that mfCTBNs are scalable to\nproblems with many variables. With our contributions in developing scalable CTBNs and ef\ufb01cient\nlearning, along with continued improvements in inference, CTBNs can be a powerful statistical tool\nto model complex processes over continuous time.\n\n7 Acknowledgments\n\nWe gratefully acknowledge CIBM Training Program grant 5T15LM007359, NIGMS grant\nR01GM097618-01, NLM grant R01LM011028-01, and ICTR NIH NCATS grant UL1TR000427.\n\n8\n\n\fReferences\n\n[1] T. Dean and K. Kanazawa, \u201cA model for reasoning about persistence and causation,\u201d Compu-\n\ntational Intelligence, vol. 5, no. 2, pp. 142\u2013150, 1989.\n\n[2] U. Nodelman, C. R. Shelton, and D. Koller, \u201cLearning continuous time Bayesian networks,\u201d in\n\nUAI, 2003.\n\n[3] U. Nodelman, Continuous time Bayesian networks. PhD thesis, Stanford University, 2007.\n[4] U. Nodelman, D. Koller, and C. R. Shelton, \u201cExpectation propagation for continuous time\n\nBayesian networks,\u201d in UAI, 2005.\n\n[5] S. Saria, U. Nodelman, and D. Koller, \u201cReasoning at the right time granularity,\u201d in UAI, 2007.\n[6] I. Cohn, T. El-Hay, N. Friedman, and R. Kupferman, \u201cMean \ufb01eld variational approximation\n\nfor continuous-time Bayesian networks,\u201d in UAI, 2009.\n\n[7] Y. Fan and C. R. Shelton, \u201cSampling for approximate inference in continuous time Bayesian\n\nnetworks,\u201d in AI and Mathematics, 2008.\n\n[8] V. Rao and Y. Teh, \u201cFast MCMC sampling for Markov jump processes and continuous time\n\nBayesian networks,\u201d in UAI, 2011.\n\n[9] D. Heckerman, \u201cCausal independence for knowledge acquisition and inference,\u201d in UAI,\n\npp. 122\u2013127, 1993.\n\n[10] A. Gunawardana, C. Meek, and P. Xu, \u201cA model for temporal dependencies in event streams,\u201d\n\nin NIPS, 2011.\n\n[11] C. Strobl, J. Malley, and G. Tutz, \u201cAn introduction to recursive partitioning: rationale, applica-\ntion, and characteristics of classi\ufb01cation and regression trees, bagging, and random forests.,\u201d\nPsychological methods, vol. 14, no. 4, p. 323, 2009.\n\n[12] Y. Freund and R. Schapire, \u201cA desicion-theoretic generalization of on-line learning and an\n\napplication to boosting,\u201d in Computational learning theory, 1995.\n\n[13] L. Breiman, \u201cRandom forests,\u201d Machine learning, vol. 45, no. 1, pp. 5\u201332, 2001.\n[14] W. Kannel, \u201cBlood pressure as a cardiovascular risk factor,\u201d JAMA, vol. 275, no. 20, p. 1571,\n\n1996.\n\n[15] C. Shelton, Y. Fan, W. Lam, J. Lee, and J. Xu, \u201cContinuous time Bayesian network reasoning\n\nand learning engine,\u201d JMLR, vol. 11, pp. 1137\u20131140, 2010.\n\n[16] J. Friedman, \u201cGreedy function approximation: a gradient boosting machine,\u201d Annals of Statis-\n\ntics, 2001.\n\n[17] S. Rajaram, T. Graepel, and R. Herbrich, \u201cPoisson-networks: A model for structured point\n\nprocesses,\u201d in AI and Statistics, 2005.\n\n[18] A. Simma, Modeling Events in Time Using Cascades Of Poisson Processes. PhD thesis, EECS\n\nDepartment, University of California, Berkeley, Jul 2010.\n\n[19] A. Saeedi and A. Bouchard-Ct, \u201cPriors over recurrent continuous time processes,\u201d in NIPS,\n\n2011.\n\n9\n\n\f", "award": [], "sourceid": 243, "authors": [{"given_name": "Jeremy", "family_name": "Weiss", "institution": null}, {"given_name": "Sriraam", "family_name": "Natarajan", "institution": null}, {"given_name": "David", "family_name": "Page", "institution": null}]}