{"title": "Probabilistic Event Cascades for Alzheimer's disease", "book": "Advances in Neural Information Processing Systems", "page_first": 3095, "page_last": 3103, "abstract": "Accurate and detailed models of the progression of neurodegenerative diseases such as  Alzheimer's (AD) are crucially important for reliable early diagnosis and the determination and deployment of effective treatments. In this paper, we introduce the ALPACA (Alzheimer's disease Probabilistic Cascades) model, a generative model linking latent Alzheimer's progression dynamics to observable biomarker data. In contrast with previous works which model disease progression as a fixed ordering of events, we explicitly model the variability over such orderings among patients which is more realistic, particularly for highly detailed disease progression models. We describe efficient learning algorithms for ALPACA and discuss promising experimental results on a real cohort of Alzheimer's patients from the  Alzheimer's Disease Neuroimaging Initiative.", "full_text": "Probabilistic Event Cascades for Alzheimer\u2019s disease\n\nJonathan Huang\nStanford University\n\njhuang11@stanford.edu\n\nDaniel Alexander\n\nUniversity College London\n\nd.alexander@cs.ucl.ac.uk\n\nAbstract\n\nAccurate and detailed models of neurodegenerative disease progression are\ncrucially important for reliable early diagnosis and the determination of effective\ntreatments. We introduce the ALPACA (Alzheimer\u2019s disease Probabilistic\nCascades) model, a generative model linking latent Alzheimer\u2019s progression\ndynamics to observable biomarker data. In contrast with previous works which\nmodel disease progression as a \ufb01xed event ordering, we explicitly model the\nvariability over such orderings among patients which is more realistic, particularly\nfor highly detailed progression models. We describe ef\ufb01cient learning algorithms\nfor ALPACA and discuss promising experimental results on a real cohort of\nAlzheimer\u2019s patients from the Alzheimer\u2019s Disease Neuroimaging Initiative.\n\nIntroduction\n\n1\nModels of disease progression are among the core tools of modern medicine for early disease\ndiagnosis, treatment determination and for explaining symptoms to patients.\nIn neurological\ndiseases, for example, symptoms and pathologies tend to be similar in different diseases. The\nordering and severity of those changes, however, provide discrimination amongst different diseases.\nThus progression models are key to early differential diagnosis and thus to drug development\n(for \ufb01nding the right participants in trials) and for eventual deployment of effective treatments.\nDespite their utility, traditional models of disease progression [3, 17] have largely been limited to\ncoarse symptomatic staging which divides patients into a small number of groups by thresholding\na crude clinical score of how far the disease has progressed. The models are thus only as precise\nas these crude clinical scores \u2014 although providing insight into disease mechanisms, they provide\nlittle bene\ufb01t for early diagnosis or accurate patient staging. With the growing availability of larger\ndatasets consisting of measurements from clinical, imaging and pathological sources, however,\nmore detailed characterizations of disease progression are now becoming feasible and a key hope\nin medical science is that such models will provide earlier, more accurate diagnosis, leading to\nmore effective development and deployment of emerging treatments. The recent availability of\ncross sectional datasets such as the Alzheimer\u2019s Disease Neuroimaging Initiative data has generated\nintense speculation in the neurology community about the nature of the cascade of events in AD\nand the ordering in which biomarkers show abnormality. Several hypothetical models [12, 5, 1]\nbroadly agree, but differ in some ways. Despite early attempts on limited data sets [13], a data\ndriven con\ufb01rmation of those models remains a pressing need.\nBeckett [2] was the \ufb01rst, nearly two decades ago, to propose a data driven model of disease progres-\nsion using a distribution over orderings of clinical events. This earlier work of [2] considered the\nprogressive loss of physical abilities in ageing persons such as the ability to do heavy work around\nthe house, or to climb up stairs. More recently, Fonteijn et al. [8] developed event-based models of\ndisease progression by analyzing ordered series of much \ufb01ner grained clinical and atrophy events\nwith applications to the study of familial Alzheimer\u2019s disease and Huntington\u2019s disease, both of\nwhich are well-studied autosomal-dominantly inherited neurodegenerative diseases. Examples of\nevents in the model of [8] include (but are not limited to) clinical events (such as a transition from\nPresymptomatic Alzheimer\u2019s to Mild Cognitive Impairment) and the onset of atrophy (reduction of\ntissue volume). By assuming a single universal ordering of events within the disease progression,\nthe method of [8] is able to scale to much larger collections of events, thus achieving much more\ndetailed characterizations of disease progression compared to that of [2].\n\n1\n\n\fThe assumption made in [8] of a universal ordering common to all patients within a disease cohort,\nis a major oversimpli\ufb01cation of reality, however, where the event ordering can vary considerably\namong patients even if it is consistent enough to distinguish different diseases.\nIn practice, the\nassumption of a universal ordering within the model means we cannot recover the diversity of\norderings over population groups and can make \ufb01tting the model to patient data unstable. To\naddress the universal ordering problem, our work revisits the original philosophy of [2] by explicitly\nmodeling a distribution over permutations. By carefully considering computational complexity and\nexploiting modern machine learning techniques, however, we are able to overcome many of its\noriginal limitations. For example, where [2] did not model measurement noise, our method can\nhandle a wide range of measurement models. Additionally, like [8], our method can achieve the\nscalability that is required to produced \ufb01ne-grained disease progression models. The following is a\nsummary of our main contributions.\n\u2022 We propose the Alzheimer\u2019s disease Probabilistic Cascades (ALPACA) model, a proba-\n\u2022 We develop ef\ufb01cient probabilistic inference and learning algorithms for ALPACA, includ-\ning a novel patient \u201cstaging\u201d method, which predicts a patient\u2019s full trajectory through\nclinical and atrophy events from sparse and noisy measurement data.\n\u2022 We provide empirical validation of our algorithms on synthetic data in a variety of settings\n\nbilistic model of disease cascades, allowing for patients to have distinct event orderings.\n\nas well as promising preliminary results for a real cohort of Alzheimer\u2019s patients.\n\n2 Preliminaries: Snapshots of neurodegenerative disease cascades\nWe model a neurodegenerative disease cascade as an ordering of a discrete set of N events,\n{e1, . . . , eN}. These events represent changes in patient state, such as a suf\ufb01ciently low score on\na memory test for a clinical diagnosis of AD, or the \ufb01rst measurement of tissue pathology, such as\nsigni\ufb01cant atrophy in the hippocampus (memory related brain area). An ordering over events is rep-\nresented as a permutation \u03c3 which corresponds events to the positions within the ordering at which\nthey occur. We write \u03c3 as \u03c3(1)|\u03c3(2)| . . .|\u03c3(N ), where \u03c3(j) = ei means that \u201cEvent i occurs in\nposition j with respect to \u03c3\u201d. In practice, the ordering \u03c3 for a particular patient can only be observed\nindirectly via snapshots which probe at a particular point in time whether each event has occurred or\nnot. We denote a snapshot by a vector of N measurements z = (ze1, . . . , zen ), where each zei is a\nreal valued measurement re\ufb02ecting a noisy diagnosis as to whether event i of the disease progression\nhas occurred prior to measuring z.1 Were it not for noise within the measurement process, a single\nsnapshot z would partition the event set into two disjoint subsets: events that have occurred already\n(e.g., {e\u03c3(1), . . . , e\u03c3(r)}), and events which have yet to occur (e.g., {e\u03c3(r+1), . . . , e\u03c3(N )}).\nWhere prior models [8] considered data in which a patient is only associated with a single\nsnapshot (taken at a single time point), we allow for multiple snapshots of a patient to be taken\nspaced throughout that patient\u2019s disease cascade.\nIn this more general case of k snapshots,\nthe event set is partitioned into k + 1 disjoint subsets (in the absence of noise). For example,\nif \u03c3 = e3|e1|e4|e5|e6|e2, then k = 2 snapshots might partition the event ordering into sets\nX1 = {e1, e3}, X2 = {e4, e5}, X3 = {e2, e6}, re\ufb02ecting that events in X1 occur before events in\nX2, which occur before events in X3. Such partitions can also be thought of as partial rankings over\nthe events (and indeed, we will exploit recent methods for learning with partial rankings in our own\napproach, [11]). To denote partial rankings, we again use vertical bars, separating the events that oc-\ncur between snapshots. In the above example, we would write e1, e3|e4, e5|e2, e6. This connection\nbetween snapshots and partial rankings plays a key role in our inference algorithms in Section 4.1.\nInstead of reasoning with continuous snapshot times, we use the fact that many distinct snapshot\ntimes can result in the same partial ranking, to reason instead with discrete snapshot sets. By snap-\nshot set, we refer to the collection of positions in the full event ordering just before each snapshot\nis taken. In our running example, the snapshot set is \u03c4 = {2, 4}. Given a full ordering \u03c3, the partial\nranking which arises from snapshot data (assuming no noise) is fully determined by \u03c4. We denote\nthis resulting partial ranking by \u03c3|\u03c4 . Thus in our running example, \u03c3|\u03c4 ={2,4} = e1, e3|e4, e5|e2, e6.\n3 ALPACA: the Alzheimer\u2019s disease Probabilistic Cascades model\nWe now present ALPACA, a generative model of noisy snapshots in which the event ordering for each\npatient is a latent variable. ALPACA makes two main assumptions: (1), that the measured outcomes\nfor each patient are independent of each other and (2), conditioned on the event ordering of each\n\n1For notational simplicity, we assume that measurements corresponding to each event are scalar valued.\n\nHowever, our model extends trivially to more complicated measurements.\n\n2\n\n\fpatient and the time at which a snapshot is taken, the measurements for each event are independent.\nIn contrast with [8], we do not assume that multiple snapshot vectors for the same patient are inde-\npendent of each other. The simplest form of ALPACA is as follows. For each patient j = 1, . . . , M:\n\n1. Draw an ordering of the events \u03c3(j) from a Mallows distribution P (\u03c3; \u03c30, \u03bb) over orderings.\n2. Draw a snapshot set \u03c4 (j) from a uniform distribution P (\u03c4 ) over subsets of the event set.\n3. For each element of the snapshot set, \u03c4 (j)\n\ni = \u03c4 (j)\n\n1 , . . . , \u03c4 (j)\n\nif event e has occurred prior to time \u03c4 (j)\n\nK(j) and for each event e = e1, . . . , eN :\n), draw z(j)\n\n(a) If \u03c3\u22121(e) \u2264 \u03c4 (j)\n\ni\n\ni\n\ni,e \u223c\n\nN (\u00b5occurred\n\ne\n\n, coccurred\n\ne\n\n(i.e.,\n). Otherwise draw z(j)\n\ni,e \u223c N (\u00b5healthy\n\ne\n\n, chealthy\n\ne\n\n).\n\ne\n\ne\n\ne\n\ne\n\n, chealthy\n\n, coccurred\n\n) \u2014 otherwise z(j)\n\ni,e is sampled from the distribution N (\u00b5occurred\n\nIn the above basic model, each entry of a snapshot vector, z(j)\ni,e , is generated by sampling from a uni-\nvariate measurement model (assumed in this case to be Gaussian). If event e has already occurred,\nthe observation z(j)\ni,e is\nsampled from a measurement distribution estimated from a control population of healthy individuals,\nN (\u00b5healthy\n). For notational simplicity, we denote the collection of snapshots for patient\nj by z(j)\u00b7,\u00b7 = {z(j)\ni,e }i=1,...,K(j),e=1,...,N . We remark that the success of our approach does not hinge\non the assumption of normality and our algorithms can deal with a variety of measurement models.\nFor example, certain clinical events (such as the loss of the ability to pass a memory test) are more\nnaturally modeled as discrete observations and can trivially be incorporated into the current model.\nThe prior distribution over possible event orderings is assumed to take the form of the well known\nMallows distribution, which has been used in a number of other application areas such as NLP,\nsocial choice, and psychometrics ([6, 15, 18]), and has the following probability mass function over\norderings: P (\u03c3 = \u03c3; \u03c30, \u03bb) \u221d exp (\u2212\u03bbdK(\u03c3, \u03c30)), where dK(\u00b7,\u00b7) is the Kendall\u2019s tau distance\nmetric on orderings. The Kendall\u2019s tau distance penalizes the number of inversions, or pairs of\nevents for which \u03c3 and \u03c30 disagree over relative ordering. Mallows models are analogous to normal\ndistributions in that \u03c30 can be interpreted as the mean or central ordering and \u03bb as a measure of the\n\u201cspread\u201d of the distribution. Both parameters are viewed as \ufb01xed quantities to be estimated via the\nempirical Bayesian approach outlined in Section 4.\nThe choices of the Mallows model for orderings and the uniform distribution for snapshot sets are\nparticularly convenient for clinical settings in which the number of subjects may be limited, since\nthe small number of parameters of the model (which scales linearly in N) suf\ufb01ciently constrains\nlearning, and eases our discussion of inference and learning in Section 4. However, as we discuss in\nSection 5, the parametric assumptions made in the most basic form of ALPACA can be considerably\nrelaxed without impacting the computational complexity of learning. Our algorithms are thus\napplicable for more general classes of distributions over orderings as well as snapshot sets.\nApplication to patient staging. With respect to the event-based characterization of disease\nprogression, a critical problem is that of patient staging, the problem of determining the extent\nto which a disease has progressed for a particular patient given corresponding measurement data.\nALPACA offers a simple and natural formulation of the patient staging problem as a probabilistic\ninference query. In particular, given the measurements corresponding to a particular patient, we\nperform patient staging by: (1) computing a posterior distribution over the event ordering \u03c3(j), then\n(2) computing a posterior distribution over the most recent element of the snapshot set \u03c4 (j).\nTo visualize the posterior distribution over the event ordering \u03c3(j), we plot a simple \u201c\ufb01rst-order\nstaging diagram\u201d, displaying the probability that event e has occurred (or will occur) in position\nq according to the posterior. Two major features differentiate ALPACA from traditional patient\nstaging approaches, in which patients are binned into a small number of imprecisely de\ufb01ned stages.\nIn particular, our method is more \ufb01ne-grained, allowing for a detailed picture of what the patient has\nundergone as well as a prediction of what is to come next. Moreover, ALPACA has well-de\ufb01ned\nprobabilistic semantics, allowing for a rigorous probabilistic characterization of uncertainty.\n4\nIn this section we describe tractable inference and parameter estimation procedures for ALPACA.\n4.1\nGiven a collection of K (j) snapshots for a patient j, the critical inference problem that we must\nsolve is that of computing a posterior distribution over the latent event order and snapshot set\nfor that patient. Despite the fact that all latent variables are discrete, however, computing this\n\nInference algorithms and parameter estimation\n\nInference.\n\n3\n\n\fO(N ! \u00d7(cid:0) N\n\n(cid:1))), for which there exist no tractable exact inference algorithms.\n\nposterior distribution can be nontrivial due to the super-exponential size of the state space (which is\n\nK(j)\n\nWe thus turn to a Gibbs sampling approximation. Directly applying the Gibbs sampler to the model\nis dif\ufb01cult however. One reason is that it is not obvious how to tractably sample the event ordering \u03c3\nconditioned on its Markov blanket, given that the corresponding likelihood function is not conjugate\nprior to the Mallows model. Instead, noting that the snapshots depend on (\u03c3, \u03c4 ) only through the\npartial ranking \u03b3 \u2261 \u03c3|\u03c4 , our Gibbs sampler operates on an augmented model in which the partial\nranking \u03b3 is \ufb01rst generated (deterministically) from \u03c3 and \u03c4 , and the snapshots are then generated\nconditioned on the partial ranking \u03b3. See Fig. 1(a) for a Bayes net representation. This augmented\nmodel is equivalent to the original model, but has the advantage that it reduces the sampling step for\nthe event ordering \u03c3 to a well understood problem (described below). Our sampler thus alternates\nbetween sampling \u03c3 and jointly sampling (\u03b3, \u03c4 ) from the following conditional distributions:\n\n\u03c3(j) \u223c P (\u03c3 | \u03b3 = \u03b3(j), \u03c4 = \u03c4 (j) ; \u03c30, \u03c6),\n\n(4.1)\nObserve that since the snapshot set \u03c4 is fully determined by the partial ranking \u03b3, it is not necessary\nto condition on \u03c4 in Equation 4.1 (left). Similarly in Equation 4.1 (right), since \u03b3 is fully determined\ngiven both the event ordering \u03c3 and the snapshot set \u03c4 , one can sample \u03c4 \ufb01rst, and deterministically\nreconstruct \u03b3. Therefore the Gibbs sampling updates are:\n\n(\u03b3(j), \u03c4 (j)) \u223c P (\u03b3, \u03c4 | \u03c3 = \u03c3(j) , z(j)\u00b7,\u00b7 ).\n\n\u03c4 (j) \u223c P (\u03c4 | \u03c3 = \u03c3(j) , z(j)\u00b7,\u00b7 ).\n\n\u03c3(j) \u223c P (\u03c3 | \u03b3 = \u03b3(j) ; \u03c30, \u03c6),\n\nK(j)\n\n(4.2)\nWhile the Gibbs sampling updates here effectively reduce the inference problem to smaller\ninference problems, the state spaces over \u03c3 and \u03c4 still remain intractably large (with cardinalities\n\n(cid:1)), respectively). In the remainder of this section, we show how to exploit even\n\nO(N !) and O((cid:0) N\n\nfurther structure within each of the conditional distributions over \u03c3 and \u03c4 for ef\ufb01cient inference.\nAs a result, we are able to carry out Gibbs sampling operations ef\ufb01ciently and exactly.\nSampling event orderings. To sample \u03c3(j) from the conditional distribution in Equation 4.2,\nwe must condition a Mallows prior on the partial ranking \u03b3 = \u03b3(j). This precise problem has in\nfact been discussed in a number of works [4, 14, 15, 9]. In our experiments, we use the method\nof Huang [9] which explicitly computes a representation of the posterior, from which one can\nef\ufb01ciently (and exactly) draw independent samples.\nSampling snapshot sets. We now turn to the problem of sampling a snapshot set \u03c4 (j) of size K (j)\nfrom Equation 4.2 (right). Note \ufb01rst that if K (j) is small (say, less than 3), then one can exhaustively\n\ncompute the posterior probability of each of the (cid:0) N\n\n(cid:1) K (j)-subsets and draw a sample from\n\nK(j)\n\na tabular representation of the posterior. For larger K (j), however, the exhaustive approach is\nintractable. In the following, we present a dynamic programming algorithm for sampling snapshot\nsets with running time much lower than the exhaustive setting (even for small K (j)). Our core\ninsight is to exploit conditional independence relations within the posterior distribution over\nsnapshot sets. That such independence relations exist may not seem surprising due to the simplicity\nof the uniform prior over snapshot sets \u2014 but on the other hand, note that the individual times of\na snapshot set drawn from the uniform distribution over K (j)-subsets are not a priori independent\nof each other (they could not be, as the total number of times is observed and \ufb01xed to be K (j)). As\nwe show in the following, however, we can bijectively associate each snapshot set with a trajectory\nthrough a certain grid. With respect to this grid-based representation of snapshot sets, we then show\nthat the posterior distribution can be viewed as that of a particular hidden Markov model (HMM).\nWe will consider the set G = {(x, y) : 0 \u2264 x \u2264 K (j) and 0 \u2264 y \u2264 N \u2212 K (j)}. G is a grid (depicted\nin Fig. 1(b)) which we will visualize with (K (j), N \u2212 K (j)) in the upper left corner and (0, 0) in the\nlower right corner. Let PG denote the collection of staircase walks (paths which never go up or to the\nleft) through the grid G starting and ending at the corners (K (j), N \u2212 K (j)) and (0, 0), respectively.\nAn example staircase walk is outlined in blue in Figure 1(b). It is not dif\ufb01cult to verify that every\nelement in PG has length N (i.e., every staircase walk traverses exactly N edges in the grid).\nGiven a grid G, we can now state a one-to-one correspondence between the staircase walks in PG\nwith the K (j)-subsets of N. To establish the correspondence, we \ufb01rst associate each edge of the grid\nto the sum of the indices of the starting node of that edge. Hence the edge from (x1, y1) to (x2, y2) is\nassociated with the number x1 + y1. Given any staircase walk p = ((x0, y0), (x1, y1), ..., (xN , yN ))\nin PG, we associate p to the subset of events in {1, . . . , N} corresponding to the subset of edges\nof p which point downwards. It is not dif\ufb01cult to show that this association is in fact, bijective (i.e.,\ngiven a snapshot set \u03c4, there is a unique staircase walk p\u03c4 mapping to \u03c4).\n\n4\n\n\f(a)\n\n(b)\n\nFigure 1: (a): Bayesian network representation of our model (augmented by adding the partial ranking \u03b3).\n(b): Grid structured state space G for sampling snapshot sets with edges labeled with transition probabilities\naccording to Equation 4.3.\nIn this example, N = 5 and K (j) = 2. The example path (highlighted) is\np = ((2, 3), (2, 2), (1, 2), (1, 1), (0, 1), (0, 0)), corresponding to the snapshot set \u03c4 = {4, 2}.\n\nWe now show that our encoding of K (j)-subsets as staircase walks allows for the posterior over\n\u03c4 in Equation 4.2 to factor with respect to a hidden Markov model. Conditioned on \u03c3 = \u03c3(j), we\nde\ufb01ne an HMM over G with the following transition and observation probabilities, respectively:\n\n\uf8f1\uf8f2\uf8f3 x\n\ny\n\nx+y\n\nx+y\n0\n\nif (xt, yt) = (x \u2212 1, y)\nif (xt, yt) = (x, y \u2212 1)\n\notherwise\n1,\u03c3(N\u2212t), . . . , z(j)\n\n,\n\n(4.3)\n\nK(j),\u03c3(N\u2212t)\n\n),\n\n(4.4)\n\nP ((xt, yt)| (xt\u22121, yt\u22121) = (x, y)) \u2261\n\nL(z(j)\n\n\u00b7,\u03c3(N\u2212t)|(xt, yt) = (xt, yt)) \u2261 \u03c6(xt, xt + yt; z(j)\nK(cid:89)\n\nwhere \u03c6(v, e; z1, . . . , zK ) \u2261 v\u22121(cid:89)\n\nP (zi; \u00b5healthy\n\n, chealthy\n\ne\n\n)\n\ne\n\nP (zi; \u00b5occurred\n\ne\n\n, coccurred\n\ne\n\n).\n\ni=1\n\ni=v\n\nThe initial state is set to (x0, y0) = (K (j), N\u2212K (j)) and the chain terminates when (x, y) = (0, 0).\nNote that sample trajectories from the above HMM are staircase walks with probability one.\nProposition 1. Conditioned on \u03c3 = \u03c3(j), the posterior probability P (\u03c4 = \u03c4 (j) | \u03c3 = \u03c3(j), z(j)\u00b7,\u00b7 ) is\nequal to the posterior probability of the staircase walk p\u03c4 (j) under the hidden Markov model de\ufb01ned\nby Equations 4.3 and 4.4.\nTo sample a snapshot set from the conditional distribution in Equation 4.2, we therefore sample\nstaircase walks from the above HMM and convert the resulting samples to snapshot sets.\nTime Complexity of a single Gibbs iteration. We now consider the computational complexity\nof our inference procedures. First observe that the complexity of sampling from the posterior\ndistribution of a Mallows model conditioned on a partial ranking is O(N 2) [9]. We claim that the\ncomplexity of sampling a snapshot set is also O(N 2). To see why, note that the complexity of the\nBackwards algorithm for HMMs is squared in the number of states and linear in the number of\ntime steps. In our case, the number of states is K (j)(N \u2212 K (j)) and the number of time steps is\nN. Thus in the worst case (where K (j) \u2248 N/2), the complexity of naively sampling a staircase\nwalk is O(N 5). However we can exploit additional problem structure. First, since the HMM\ntransition matrix is sparse (each state transitions to at most two states), the Backwards algorithm\ncan be performed in O(N \u00b7 #(states)) time. Second, since the grid coordinates corresponding to\nthe current state at time T are constrained to sum to N \u2212 T , the size of the effective state space\nis reduced to O(N ) rather than O(K (j)(N \u2212 K (j))). Thus in the worst case, the running time\ncomplexity can in turn be reduced to O(N 2) and even linear time O(N ) when K (j) \u223c O(1). In\nconclusion, the total complexity of a single Gibbs iteration requires at most O(N 2) operations.\nMixing considerations. Under mild assumptions, it is not dif\ufb01cult to establish ergodicity of our\nGibbs sampler, showing that the sampling distribution must eventually converge to the desired\nposterior. The one exception is when the size of the snapshot set is one less than the number of\nevents (K (j) = N \u2212 1). In this exceptional case,2 the grid G has size N \u00d7 1, forcing the Gibbs\nsampler to be deterministic. As a result, the Markov chain de\ufb01ned by the Gibbs sampler is not\nirreducible and hence not ergodic. We have:\nProposition 2. The Gibbs sampler is ergodic on its state space if and only if K (j) < N \u2212 1.\n\n2Note that to have so many snapshots for a single patient would be rare indeed.\n\n5\n\n\f0 , \u03bb(0)).\n\nparameters (\u03c30, \u03bb) by maximizing the marginal log likelihood: (cid:96)(\u03c30, \u03bb) =(cid:80)M\n\nEven when K (j) < N \u2212 1, mixing times for the chain can be longer for larger snapshot sets (where\nK (j) is close to N \u2212 1). For example, when K (j) = N \u2212 2, it is possible to show that the T th\nordering in the Gibbs chain can differ from the (T + 1)th ordering by at most an adjacent swap.\nConsequently, since it requires O(N 2) adjacent swaps (in the worst case) to reach the mode of the\nposterior distribution with nonzero probability, we can lower bound the mixing time in this case by\nO(N 2) steps. For smaller K (j), the Gibbs sampler is able to make larger jumps in state space and\nindeed, for these chains, we observe faster mixing times in practice.\n4.2 Parameter estimation.\nGiven a snapshot dataset {z(j)}j=1,...,M , we now discuss how to estimate the ALPACA model\nj=1 log P (z(j)|\u03c30, \u03bb).\nCurrently we obtain point estimates of model parameters, but fuller Bayesian approaches are also\npossible. Our approach uses Monte Carlo expectation maximization (EM) to alternate between the\nfollowing two steps given an initial setting of model parameters (\u03c3(0)\nE-step. For each patient in the cohort, use the inference algorithm described in Section 4.1 to\nobtain a draw from the posterior distribution P (\u03c3(j), \u03c4 (j)|z(j), \u03c30, \u03bb). Note that multiple draws can\nalso be taken to reduce the variance of the E-step.\nM-step. Given the draws obtained via the E-step, we can now apply standard Mallows model esti-\nmation algorithms (see [7, 16, 15]) to optimize for the parameters \u03c30 and \u03bb given the sampled order-\ning for each patient. Optimizing for \u03bb, for example, is a one-dimensional convex optimization [16].\nOptimizing for \u03c30 (sometimes called the consensus ranking problem) is known to be NP-hard. Our\nimplementation uses the Fligner and Verducci heuristic [7] (which is known to be an unbiased es-\ntimator of \u03c30) followed by local search, but more sophisticated estimators exist [16]. Note that the\nsampled snapshot sets ({\u03c4 (j)}) do not play a role in the M-step described here, but can be used to\nestimate parameters for the more complex snapshot set distributions described in Section 5.\nComplexity of EM. The running time of a single iteration of our E-step requires O(N 2TGibbsM )\ntime, where TGibbs is the number of Gibbs iterations. The running time of the M-step is O(N 2M )\n(assuming a single sample per patient), and is therefore dominated by the E-step complexity.\n5 Extensions of the basic model\nGeneralized ordering models. The classical Mallows model for orderings is often too limited\nfor real datasets in its lack of \ufb02exibility. One limitation is that the positional variances of all of\nthe events are governed by just a single parameter, \u03bb. In clinical datasets, it is more conceivable\nthat different biomarkers within a disease cascade change over different timescales, thus leading to\nhigher positional variance for certain events and lower positional variance for others.\nFortunately our approach applies to any class of distributions for which one can ef\ufb01ciently condition\non partial ranking observations. In our experiments (Section 6), we achieve more \ufb02exibility using\nthe generalized Mallows model [7, 16], which includes the classical Mallows model as a special\ncase and allows for the positional variance of each event e to be governed by its own corresponding\nparameter \u03bbe. Generalized Mallows models are in turn a special case of the recently introduced\nhierarchical rif\ufb02e independent models [10] which allow one to capture dependencies among small\nsubsets of events. Huang et al. ([11]), in particular, proved that these hierarchical rif\ufb02e independent\nmodels form a natural conjugate prior family for partial ranking likelihood functions and introduced\nef\ufb01cient algorithms for conditioning on partial ranking observations.\nIt is \ufb01nally interesting to note that it would not be trivial to use traditional Markov chains to capture\nthe dependencies in the event sequence due to the fact that observations come in snapshot form\ninstead of being indexed by time as they would be in an ordinary hidden Markov model. Thus in\norder to properly perform inference, one would have to infer an HMM posterior with respect to\neach of the permutations of the event set, which is computationally harder.\nGeneralized snapshot set models. Going beyond the uniform distribution, ALPACA can also\nef\ufb01ciently handle a more general class of snapshot set distribution by observing that any distribution\nparametrizable as a Markov chain over the grid G that generates staircase walks can be substituted\nfor the uniform distribution with exactly the same time complexity of Gibbs sampling. As a cau-\ntionary remark, we note that allowing for these more general models without additional constraints\ncan sometimes lead to instabilities in parameter estimation. A simple constrained Markov chain that\nwe have successfully used in experiments parameterizes transition probabilities such that a staircase\n\n6\n\n\f(a) Central ranking recovery vs. measure-\nment noise. Synthetic data, N = 10 events,\nM = 250 patients, K(j) \u2208 {1, 2, 3}.\nWorst case Kendall\u2019s tau score is 45.0.\n\n(b) Central ranking recovery vs.\nsize of\npatient cohort. Synthetic data, N = 20\nevents, K(j) \u2208 {1, . . . , 10}. Worst case\nKendall\u2019s tau score is 190.0.\n\n(c) Illustration of mixing times using a Gibbs\ntrace plot on a synthetic dataset with N =\n20 and K(j)=4,8,12,16. Larger snapshots\n(larger K(j)) lead to longer mixing times.\n\n(d) BIC scores on the ADNI data (lower is\nbetter) comparing the ALPACA model (with\nvarying settings of \u03b1) against the single order-\ning model of [8] (shown in the \u03c3\u2217 column).\n\n(e) ADNI Patient staging.\n(Left) \ufb01rst order staging diagram, the\n(e, q)th entry is the probability that event e has/will occur in posi-\ntion q. (Right) posterior probability distribution over the position in\nthe event ordering at which the patient snapshot was taken.\n\nFigure 2: Experimental results\n\nwalk moves down at node (x, y) in the grid G with probability proportional to \u03b1x and to the left with\nprobability proportional to (1 \u2212 \u03b1)y. Setting \u03b1 = 1/2 recovers the uniform distribution. Setting\n0 \u2264 \u03b1 < 1/2, however, re\ufb02ects a prior bias for snapshots to have been taken earlier in the disease\ncascade, while setting 1/2 < \u03b1 \u2264 1 re\ufb02ects a prior bias for snapshots to have been taken later in\nthe disease cascade. Thus \u03b1 intuitively allows us to interpolate between early and late detection.\n6 Experiments\nSynthetic data experiments We \ufb01rst validate ALPACA on synthetic data. Since we are interested\nin the ability of the model to recover the true central ranking, we evaluate based on the Kendall\u2019s tau\ndistance between the ground truth central ranking and the central rankings learned by our algorithms.\nTo understand how learning is impacted by measurement noise, we simulate data from models in\nwhich the means \u00b5healthy and \u00b5occurred are \ufb01xed to be 0 and 1, respectively and variances are\nselected uniformly at random from the interval (0, cM AX\n), then learn model parameters from the\nsimulated data. Fig. 2(a) illustrates the results on a problem with N = 10 events and 250 patients\nvaries between [0.2, 1.2]. As\n(with K (j) set to be 1, 2, or 3 randomly for each patient) as cM AX\nshown in the \ufb01gure, we obtain nearly perfect performance for low measurement noise with recovery\nrates degrading gracefully with higher measurement noise levels.\nWe also show results on a larger problem with N = 20 events, ce = 0.1, and K (j) drawn uniformly\nat random from {1, . . . , 10}. Varying the cohort size, this time, Fig. 2(b) shows, as expected, that\nrecovery rates for the central ordering improve as the number of patients increases. Note that with\n20 events, it would be utterly intractable to use brute force inference algorithms, but our algorithms\ncan process a patient\u2019s measurements in roughly 3 seconds on a laptop.\nIn both experiments for Figs. 2(a) and 2(b), we discard the \ufb01rst 200 burn-in iterations of Gibbs, but\nit is often suf\ufb01cient to discard much fewer iterations. To illustrate mixing behavior, Fig. 2(c) shows\nexample Gibbs trace plots with N = 20 events and varying sizes of the snapshot set, K (j). We\nobserve that mixing time increases as K (j) increases, con\ufb01rming the discussion of mixing (Sec. 4.1).\n\ne\n\ne\n\nThe ADNI dataset. We also present a preliminary analysis of a cohort with a total number of\n347 subjects (including 83 control subjects) from the Alzheimer\u2019s Disease Neuroimaging Institute\n(ADNI). We derive seven typical biomarkers associated with the onset of Alzheimers: (1) the\ntotal tau level in cerebral spinal \ufb02uid (CSF) [tau], (2) the total A\u03b242 level in CSF [abeta], (3)\nthe total ADAS cognitive assessment score [adas], (4) brain volume [brainvol], (5) hippocampal\nvolume [hippovol], (6) brain atrophy rate [brainatrophy], and (7) hippocampal atrophy rate\n\n7\n\n\f[hippoatrophy]. Due to the small number of measured events in the ADNI data, it is possible\nto apply the model of Fonteijn et al. [8] (which assumes that all patients follow a single ordering\n\u03c3\u2217) by searching exhaustively over the collection of all 7! = 5040 orderings. We compare\nthe ALPACA model against the single ordering model via BIC scores (shown in Fig. 2(d)). We \ufb01t\nour model \ufb01ve times, with the bias parameter \u03b1 (described in Section 5) set to .1, .3, .5, .7, .9. We\nuse a single Gaussian for each of the healthy and occurred measurement distributions (as described\nin [8]), assuming that all patients in the control group are healthy.3\nThe results show that by allowing for the event ordering \u03c3 to vary across patients, the AL-\nPACA model signi\ufb01cantly outperforms the single ordering model (shown in the \u03c3\u2217 column) in\nBIC score with respect to all of the tried settings of \u03b1. Further, we observe that setting \u03b1 = 0.1\nminimizes the BIC, re\ufb02ecting the fact, we conjecture, that many of the patients in the ADNI cohort\nare in the earlier stages of Alzheimers. The optimal central ordering inferred by the Fonteijn model\nis: \u03c3\u2217 = adas|hippovol|hippoatrophy|brainatrophy|abeta|tau|brainvol, while ALPACA infers\nthe central ordering: \u03c30 = adas|hippovol|abeta|hippoatrophy|tau|brainatrophy|brainvol.\nObserve that the two event orderings are largely in agreement with each other with CSF A\u03b242 and\nCSF tau events shifted to being earlier in the event ordering, which is more consistent with current\nthinking in neurology [12, 5, 1], which places the two CSF events \ufb01rst. Note that adas is \ufb01rst in\nboth orderings as it was used to classify the patients \u2014 thus its position is somewhat arti\ufb01cial. It\nis surprising that the hippocampal volume and atrophy events are inferred in both models to occur\nbefore the CSF events [13], but we believe that this may be due to the signi\ufb01cant proportion of\nmisdiagnosed patients in the data. These misdiagnosed patients still have heavy atrophy in the\nhippocampus, which is a common pathology among many neurological conditions (other dementias\nand psychiatric disorders), but a change in CSF A\u03b2 is much more speci\ufb01c to AD. Future work will\nadapt the model for robustness to these misdiagnoses and other outliers.\nFinally, Fig. 2(e) shows the patient staging result for an example patient from the ADNI data.\nThe left matrix visualizes the probability that each event will occur in each position of the event\nordering given snapshot data from this patient, while the right histogram visualizes where in the\nevent ordering the patient was situated when the snapshot was taken.\n7 Conclusions\nWe have developed the Alzheimer\u2019s disease Probabilistic Cascades model for event ordering\nwithin the Alzheimer\u2019s disease cascade. In its most basic form, ALPACA is a simple model with\ngenerative semantics, allowing one to learn the central ordering of events that occur within a disease\nprogression as well as to quantify the variance of this ordering across patients. Our preliminary\nresults show that relaxing the notion that a single ordering over events exists for all patients allows\nALPACA to achieve a much better \ufb01t to snapshot data from a cohort of Alzheimer\u2019s patients.\nOne of our main contributions is to show how the combinatorial structure of event ordering models\ncan be exploited for algorithmic ef\ufb01ciency. While exact inference remains intractable for ALPACA,\nwe have presented a simple MCMC based procedure which uses dynamic programming as a\nsubroutine for highly ef\ufb01cient inference.\nThere may exist biomarkers for Alzheimer\u2019s which are more effective than those considered in our\ncurrent work for the purposes of patient staging.\nIdentifying such biomarker events remains an\nopen question crucial to the success of data-driven models of disease cascades. Fortunately, one\nof the main advantages of ALPACA lies in its extensibility and modularity. We have discussed\nseveral such possible extensions, from more general measurement models to more general rif\ufb02e\nindependent ordering models. Additionally, with the ability to scale gracefully with problem size\nas well as to handle noise, we believe that the ALPACA model will be applicable to many other\nAlzheimer\u2019s datasets as well as datasets for other neurodegenerative diseases.\nAcknowledgements\nJ. Huang is supported by a NSF Computing Innovation Fellowship. The EPSRC support D.\nAlexander\u2019s work on this topic with grant EP/J020990/01. The authors also thank Dr. Jonathan\nSchott, UCL Dementia Centre, and Dr. Jonathan Bartlett, London School of Hygiene and Tropical\nMedicine, for preparation of the data and help with interpretation of the results.\n\n3We note that this assumption is a major oversimpli\ufb01cation as some of the control subjects are likely affected\nby some non-AD neurodegenerative disease. Due to these dif\ufb01culties in obtaining ground truth data, however,\nestimating accurate measurement models can sometimes be a limitation.\n\n8\n\n\fReferences\n[1] Paul S. Aisen, Ronald C. Petersen, Michael C. Donohue, Anthony Gamst, Rema Raman, Ronald G.\nThomas, Sarah Walter, John Q. Trojanowski, Leslie M. Shaw, Laurel A. Beckett, Clifford R. Jack, William\nJagust, Arthur W. Toga, Andrew J. Saykin, John C. Morris, Robert C. Green, and Michael W. Weiner. The\nalzheimer\u2019s disease neuroimaging initiative: progress report and future plans. Alzheimers dementia the\njournal of the Alzheimers Association, 6(3):239\u2013246, 2010.\n\n[2] Laurel Beckett. Maximum likelihood estimation in Mallows\u2019s model using partially ranked data., pages\n\n92\u2013107. New York: Springer-Verlag, 1993.\n\n[3] H. Braak and E. Braak. Neuropathological staging of alzheimer-related changes. Acta Neuropathol.,\n\n82:239\u2013259, 1991.\n\n[4] Ludwig M. Busse, Peter Orbanz, and Joachim Buhmann. Cluster analysis of heterogeneous rank data.\nIn The 24th Annual International Conference on Machine Learning, ICML \u201907, Corvallis, Oregon, June\n2007.\n\n[5] A Caroli and G B Frisoni. The dynamics of alzheimer?s disease biomarkers in the alzheimer\u2019s disease\n\nneuroimaging initiative cohort. Neurobiology of Aging, 31(8):1263\u20131274, 2010.\n\n[6] Harr Chen, S. R. K. Branavan, Regina Barzilay, and David R. Karger. Global models of document\nstructure using latent permutations. In Proceedings of Human Language Technologies: The 2009 Annual\nConference of the North American Chapter of the Association for Computational Linguistics, NAACL\n\u201909, pages 371\u2013379, Stroudsburg, PA, USA, 2009. Association for Computational Linguistics.\n\n[7] Michael Fligner and Joseph Verducci. Mulistage ranking models. Journal of the American Statistical\n\nAssociation, 83(403):892\u2013901, 1988.\n\n[8] Hubert M. Fonteijn, Marc Modat, Matthew J. Clarkson, Josephine Barnes, Manja Lehmann, Nicola Z.\nHobbs, Rachael I. Scahill, Sarah J. Tabrizi, Sebastien Ourselin, Nick C. Fox, and Daniel C. Alexander.\nAn event-based model for disease progression and its application in familial alzheimer\u2019s disease and\nhuntington\u2019s disease. NeuroImage, 60(3):1880 \u2013 1889, 2012.\n\n[9] Jonathan Huang. Probabilistic Reasoning and Learning on Permutations: Exploiting Structural Decom-\n\npositions of the Symmetric Group. PhD thesis, Carnegie Mellon University, 2011.\n\n[10] Jonathan Huang and Carlos Guestrin. Learning hierarchical rif\ufb02e independent groupings from rankings.\n\nIn International Conference on Machine Learning (ICML 2010), Haifa, Israel, June 2010.\n\n[11] Jonathan Huang, Ashish Kapoor, and Carlos Guestrin. Ef\ufb01cient probabilistic inference with partial rank-\n\ning queries. In Conference on Uncertainty in Arti\ufb01cial Intelligence, Barcelona, Spain, July 2011.\n\n[12] Clifford R Jack, David S Knopman, William J Jagust, Leslie M Shaw, Paul S Aisen, Michael W\nWeiner, Ronald C Petersen, and John Q Trojanowski. Hypothetical model of dynamic biomarkers of\nthe alzheimer\u2019s pathological cascade. The Lancet Neurology 1, 9:119\u2013128, January 2010.\n\n[13] Clifford R. Jack, Prashanthi Vemuri, Heather J. Wiste, Stephen D. Weigand, Paul S. Aisen, John Q. Tro-\njanowski, Leslie M. Shaw, Matthew A. Bernstein, Ronald C. Petersen, Michael W. Weiner, and David S.\nKnopman. Evidence for ordering of alzheimer disease biomarkers. Archives of Neurology, 2011.\n\n[14] Guy Lebanon and Yi Mao. Non-parametric modeling of partially ranked data. In John C. Platt, Daphne\nKoller, Yoram Singer, and Sam Roweis, editors, Advances in Neural Information Processing Systems 20,\nNIPS \u201907, pages 857\u2013864, Cambridge, MA, 2008. MIT Press.\n\n[15] Tyler Lu and Craig Boutilier. Learning mallows models with pairwise preferences. In The 28th Annual\n\nInternational Conference on Machine Learning, ICML \u201911, Bellevue, Washington, June 2011.\n\n[16] Marina Meila, Kapil Phadnis, Arthur Patterson, and Jeff Bilmes. Consensus ranking under the exponential\n\nmodel. Technical Report 515, University of Washington, Statistics Department, April 2007.\n\n[17] Rachael I. Scahill, Jonathan M. Schott, John M. Stevens, Martin N. Rossor, and Nick C. Fox. Mapping\nthe evolution of regional atrophy in alzheimer\u2019s disease: Unbiased analysis of \ufb02uid-registered serial mri.\nProceedings of the National Academy of Sciences, 99(7):4703\u20134707, 2002.\n\n[18] Mark Steyvers, Michael Lee, Brent Miller, and Pernille Hemmer. The wisdom of crowds in the recollec-\ntion of order information. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta,\neditors, Advances in Neural Information Processing Systems 22, pages 1785\u20131793. 2009.\n\n9\n\n\f", "award": [], "sourceid": 1428, "authors": [{"given_name": "Jonathan", "family_name": "Huang", "institution": null}, {"given_name": "Daniel", "family_name": "Alexander", "institution": null}]}