{"title": "Disease Trajectory Maps", "book": "Advances in Neural Information Processing Systems", "page_first": 4709, "page_last": 4717, "abstract": "Medical researchers are coming to appreciate that many diseases are in fact complex, heterogeneous syndromes composed of subpopulations that express different variants of a related complication. Longitudinal data extracted from individual electronic health records (EHR) offer an exciting new way to study subtle differences in the way these diseases progress over time. In this paper, we focus on answering two questions that can be asked using these databases of longitudinal EHR data. First, we want to understand whether there are individuals with similar disease trajectories and whether there are a small number of degrees of freedom that account for differences in trajectories across the population. Second, we want to understand how important clinical outcomes are associated with disease trajectories. To answer these questions, we propose the Disease Trajectory Map (DTM), a novel probabilistic model that learns low-dimensional representations of sparse and irregularly sampled longitudinal data. We propose a stochastic variational inference algorithm for learning the DTM that allows the model to scale to large modern medical datasets. To demonstrate the DTM, we analyze data collected on patients with the complex autoimmune disease, scleroderma. We find that DTM learns meaningful representations of disease trajectories and that the representations are significantly associated with important clinical outcomes.", "full_text": "Disease Trajectory Maps\n\nPeter Schulam\n\nDept. of Computer Science\nJohns Hopkins University\n\nBaltimore, MD 21218\npschulam@cs.jhu.edu\n\nRaman Arora\n\nDept. of Computer Science\nJohns Hopkins University\n\nBaltimore, MD 21218\narora@cs.jhu.edu\n\nAbstract\n\nMedical researchers are coming to appreciate that many diseases are in fact com-\nplex, heterogeneous syndromes composed of subpopulations that express different\nvariants of a related complication. Longitudinal data extracted from individual\nelectronic health records (EHR) offer an exciting new way to study subtle differ-\nences in the way these diseases progress over time. In this paper, we focus on\nanswering two questions that can be asked using these databases of longitudinal\nEHR data. First, we want to understand whether there are individuals with similar\ndisease trajectories and whether there are a small number of degrees of freedom\nthat account for differences in trajectories across the population. Second, we want\nto understand how important clinical outcomes are associated with disease trajecto-\nries. To answer these questions, we propose the Disease Trajectory Map (DTM), a\nnovel probabilistic model that learns low-dimensional representations of sparse and\nirregularly sampled longitudinal data. We propose a stochastic variational inference\nalgorithm for learning the DTM that allows the model to scale to large modern\nmedical datasets. To demonstrate the DTM, we analyze data collected on patients\nwith the complex autoimmune disease, scleroderma. We \ufb01nd that DTM learns\nmeaningful representations of disease trajectories and that the representations are\nsigni\ufb01cantly associated with important clinical outcomes.\n\nIntroduction\n\n1\nLongitudinal data is becoming increasingly important in medical research and practice. This is due,\nin part, to the growing adoption of electronic health records (EHRs), which capture snapshots of\nan individual\u2019s state over time. These snapshots include clinical observations (apparent symptoms\nand vital sign measurements), laboratory test results, and treatment information. In parallel, medical\nresearchers are beginning to recognize and appreciate that many diseases are in fact complex, highly\nheterogeneous syndromes [Craig, 2008] and that individuals may belong to disease subpopulations\nor subtypes that express similar sets of symptoms over time (see e.g. Saria and Goldenberg [2015]).\nExamples of such diseases include asthma [L\u00f6tvall et al., 2011], autism [Wiggins et al., 2012], and\nCOPD [Castaldi et al., 2014]. The data captured in EHRs can help better understand these complex\ndiseases. EHRs contain many types of observations and the ability to track their progression can help\nbring in to focus the subtle differences across individual disease expression.\nIn this paper, we focus on two exploratory questions that we can begin to answer using longitudinal\nEHR data. First, we want to discover whether there are individuals with similar disease trajectories\nand whether there are a small number of degrees of freedom that account for differences across a\nheterogeneous population. A low-dimensional characterization of trajectories and how they differ\ncan yield insights into the biological underpinnings of the disease. In turn, this may motivate new\ntargeted therapies. In the clinic, physicians can analyze an individual\u2019s clinical history to estimate\nthe low-dimensional representation of the trajectory and can use this knowledge to make more\naccurate prognoses and guide treatment decisions by comparing against representations of past\ntrajectories. Second, we would like to know whether individuals with similar clinical outcomes (e.g.\ndeath, severe organ damage, or development of comorbidities) have similar disease trajectories. In\ncomplex diseases, individuals are often at risk of developing a number of severe complications and\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fclinicians rarely have access to accurate prognostic biomarkers. Discovering associations between\ntarget outcomes and trajectory patterns may both generate new hypotheses regarding the causes of\nthese outcomes and help clinicians to better anticipate the event using an individual\u2019s clinical history.\nContributions. Our approach to simultaneously answering these questions is to embed individual\ndisease trajectories into a low-dimensional vector space wherein similarity in the embedded space\nimplies that two individuals have similar trajectories. Such a representation would naturally answer\nour \ufb01rst question, and could also be used to answer the second by comparing distributions over\nrepresentations across groups de\ufb01ned by different outcomes. To learn these representations, we\nintroduce a novel probabilistic model of longitudinal data, which we term the Disease Trajectory Map\n(DTM). In particular, the DTM models the trajectory over time of a single clinical marker, which is\nan observation or measurement recorded over time by clinicians that is used to track the progression\nof a disease (see e.g. Schulam et al. [2015]). Examples of clinical markers are pulmonary function\ntests or creatinine laboratory test results, which track lung and kidney function respectively. The\nDTM discovers low-dimensional (e.g. 2D or 3D) latent representations of clinical marker trajectories\nthat are easy to visualize. We describe a stochastic variational inference algorithm for estimating the\nposterior distribution over the parameters and individual-speci\ufb01c representations, which allows our\nmodel to be easily applied to large datasets. To demonstrate the DTM, we analyze clinical marker data\ncollected on individuals with the complex autoimmune disease scleroderma (see e.g. Allanore et al.\n[2015]). We \ufb01nd that the learned representations capture interesting subpopulations consistent with\nprevious \ufb01ndings, and that the representations suggest associations with important clinical outcomes\nnot captured by alternative representations.\n1.1 Background and Related Work\nClinical marker data extracted from EHRs is a by-product of an individual\u2019s interactions with the\nhealthcare system. As a result, the time series are often irregularly sampled (the time between samples\nvaries within and across individuals), and may be extremely sparse (it is not unusual to have a single\nobservation for an individual). To aid the following discussion, we brie\ufb02y introduce notation for\nthis type of data. We use m to denote the number of individual disease trajectories recorded in a\ngiven dataset. For each individual, we use ni to denote the number of observations. We collect the\nobservation times for subject i into a column vector ti (cid:44) [ti1, . . . , tini](cid:62) (sorted in non-decreasing\n(cid:44) [yi1, . . . , yini](cid:62). Our goal\norder) and the corresponding measurements into a column vector yi\nis to embed the pair (ti, yi) into a low-dimensional vector space wherein similarity between two\nembeddings xi and xj) implies that the trajectories have similar shapes. This is commonly done\nusing basis representations of the trajectories.\nFixed basis representations.\nIn the statistics literature, (ti, yi) is often referred to as unbalanced\nlongitudinal data, and is commonly analyzed using linear mixed models (LMMs) [Verbeke and\nMolenberghs, 2009]. In their simplest form, LMMs assume the following probabilistic model:\n\nwi | \u03a3 \u223c N (0, \u03a3) , yi | Bi, wi, \u00b5, \u03c32 \u223c N (\u00b5 + Biwi, \u03c32Ini).\n\n(1)\nThe matrix Bi \u2208 Rni\u00d7d is known as the design matrix, and can be used to capture non-linear\nrelationships between the observation times ti and measurements yi. Its rows are comprised of\nd-dimensional basis expansions of each observation time Bi = [b(ti1),\u00b7\u00b7\u00b7 , b(tini)](cid:62). Common\nchoices of b(\u00b7) include polynomials, splines, wavelets, and Fourier series. The particular basis\nused is often carefully crafted by the analyst depending on the nature of the trajectories and on\nthe desired structure (e.g. invariance to translations and scaling) in the representation [Brillinger,\n2001]. The design matrix can therefore make the LMM remarkably \ufb02exible despite its simple\nparametric probabilistic assumptions. Moreover, the prior over wi and the conjugate likelihood make\nit straightforward to \ufb01t \u00b5, \u03a3, and \u03c32 using EM or Bayesian posterior inference.\nAfter estimating the model parameters, we can estimate the coef\ufb01cients wi of a given clinical marker\ntrajectory using the posterior distribution, which embeds the trajectory in a Euclidean space. To\n\ufb02exibly capture complex trajectory shapes, however, the basis must be high-dimensional, which makes\ninterpretability of the representations challenging. We can use low-dimensional summaries such\nas the projection on to a principal subspace, but these are not necessarily substantively meaningful.\nIndeed, much research has gone into developing principal direction post-processing techniques (e.g.\nKaiser [1958]) or alternative estimators that enhance interpretability (e.g. Carvalho et al. [2012]).\nData-adaptive basis representations. A set of related, but more \ufb02exible, techniques comes from\nfunctional data analysis where observations are functions (i.e. trajectories) assumed to be sampled\n\n2\n\n\ffrom a stochastic process and the goal is to \ufb01nd a parsimonious representation for the data [Ramsay\net al., 2002]. Functional principal component analysis (FPCA), one of the most popular techniques in\nfunctional data analysis, expresses functional data in the orthonormal basis given by the eigenfunctions\nof the auto-covariance operator. This representation is optimal in the sense that no other representation\ncaptures more variation [Ramsay, 2006]. The idea itself can be traced back to early independent work\nby Karhunen and Loeve and is also referred to as the Karhunen-Loeve expansion [Watanabe, 1965].\nWhile numerous variants of FPCA have been proposed, the one that is most relevant to the problem\nat hand is that of sparse FPCA [Castro et al., 1986, Rice and Wu, 2001] where we allow sparse,\nirregularly sampled data as is common in longitudinal data analysis. To deal with the sparsity, Rice\nand Wu [2001] used LMMs to model the auto-covariance operator. In very sparse settings, however,\nLMMs can suffer from numerical instability of covariance matrices in high dimensions. James et al.\n[2000] addressed this by constraining the rank of the covariance matrices\u2014we will refer to this\nmodel as the reduced-rank LMM, but note that it is a variant of sparse FPCA. Although sparse FPCA\nrepresents trajectories using a data-driven basis, the basis is restricted to lie in a linear subspace of a\n\ufb01xed basis, which may be overly restrictive. Other approaches to learning a functional basis include\nBayesian estimation of B-spline parameters (e.g. [Bigelow and Dunson, 2012]) and placing priors\nover reproducing kernel Hilbert spaces (e.g. [MacLehose and Dunson, 2009]). Although \ufb02exible,\nthese two approaches do not learn a low-dimensional representation.\nCluster-based representations. Mixture models and clustering approaches are also commonly\nused to represent and discover structure in time series data. Marlin et al. [2012] cluster time series\ndata from the intensive care unit (ICU) using a mixture model and use cluster membership to predict\noutcomes. Schulam and Saria [2015] describe a probabilistic model that represents trajectories using\na hierarchy of features, which includes \u201csubtype\u201d or cluster membership. LMMs have also been\nextended to have nonparametric Dirichlet process priors over the coef\ufb01cients (e.g. Kleinman and\nIbrahim [1998]), which implicitly induce clusters in the data. Although these approaches \ufb02exibly\nmodel trajectory data, the structure they recover is a partition, which does not allow us to compare all\ntrajectories in a coherent way as we can in a vector space.\nLexicon-based representations. Another line of research has investigated the discovery of motifs\nor repeated patterns in continuous time-series data for the purposes of succinctly representing the\ndata as a string of elements of the discovered lexicon. These include efforts in the speech processing\ncommunity to identify sub-word units (parts of words comparable to phonemes) in a data-driven\nmanner [Varadarajan et al., 2008, Levin et al., 2013]. In computational healthcare, Saria et al. [2011]\npropose a method for discovering deformable motifs that are repeated in continuous time series\ndata. These methods are, in spirit, similar to discretization approaches such as symbolic aggregate\napproximation (SAX) [Lin et al., 2007] and piecewise aggregate approximation (PAA) [Keogh et al.,\n2001] that are popular in data mining, and aim to \ufb01nd compact descriptions of sequential data,\nprimarily for the purposes of indexing, search, anomaly detection, and information retrieval. The\nfocus in this paper is to learn representations for entire trajectories rather than discover a lexicon.\nFurthermore, we focus on learning a representation in a vector space where similarities among\ntrajectories are captured through the standard inner product on Rd.\n2 Disease Trajectory Maps\nTo motivate Disease Trajectory Maps (DTM), we begin with the reduced-rank LMM proposed\nby James et al. [2000]. We show that the reduced-rank LMM de\ufb01nes a Gaussian process with a\ncovariance function that linearly depends on trajectory-speci\ufb01c representations. To de\ufb01ne DTMs,\nwe then use the kernel trick to make the dependence non-linear. Let \u00b5 \u2208 R be the marginal mean of\nthe observations, F \u2208 Rd\u00d7q be a rank-q matrix, and \u03c32 be the variance of measurement errors. As a\nreminder, yi \u2208 Rni denotes the vector of observed trajectory measurements, Bi \u2208 Rni\u00d7d denotes the\nindividual\u2019s design matrix, and xi \u2208 Rq denotes the individual\u2019s representation. James et al. [2000]\nde\ufb01ne the reduced-rank LMM using the following conditional distribution:\nyi | Bi, xi, \u00b5, F, \u03c32 \u223c N (\u00b5 + BiFxi, \u03c32Ini).\n\n(2)\n\nThey assume an isotropic normal prior over xi and marginalize to obtain the observed-data log-\nlikelihood, which is then optimized with respect to {\u00b5, F, \u03c32}. As in Lawrence [2004], we instead\noptimize xi and marginalize F. By assuming a normal prior N (0, \u03b1Iq) over the rows of F and\nmarginalizing we obtain:\n\nyi | Bi, xi, \u00b5, \u03c32, \u03b1 \u223c N (\u00b5, \u03b1(cid:104)xi, xi(cid:105)BiB(cid:62)\n\ni + \u03c32Ini).\n\n(3)\n\n3\n\n\fNote that by marginalizing over F, we induce a joint distribution over all trajectories in the dataset.\nMoreover, this joint distribution is a Gaussian process with mean \u00b5 and the following covariance\nfunction de\ufb01ned across trajectories that depends on times {ti, tj} and representations {xi, xj}:\n\nCov(yi, yj | Bi, Bj, xi, xj, \u00b5, \u03c32, \u03b1) = \u03b1(cid:104)xi, xj(cid:105)BiB(cid:62)\n\n(4)\nThis reformulation of the reduced-rank LMM highlights that the covariance across trajectories i and\nj depends on the inner product between the two representations xi and xj, and suggests that we can\nnon-linearize the dependency with an inner product in an expanded feature space using the \u201ckernel\ntrick\u201d. Let k(\u00b7,\u00b7) denote a non-linear kernel de\ufb01ned over the representations with parameters \u03b8, then\nwe have:\n\nj + I[i = j] (\u03c32Ini).\n\nCov(yi, yj | Bi, Bj, xi, xj, \u00b5, \u03c32, \u03b8) = k(xi, xj)BiB(cid:62)\n1 , . . . , y(cid:62)\n\n(5)\nm](cid:62) denote the column vector obtained by concatenating the measurement\n\nLet y (cid:44) [y(cid:62)\nvectors from each trajectory. The joint distribution over y is a multivariate normal:\n\nj + I[i = j] (\u03c32Ini).\n\ny | B1:m, x1:m, \u00b5, \u03c32, \u03b8 \u223c N (\u00b5, \u03a3DTM + \u03c32In),\n\n(6)\nwhere \u03a3DTM is a covariance matrix that depends on the times t1:m (through design matrices B1:m)\nand representations x1:m. In particular, \u03a3DTM is a block-structured matrix with m row blocks and\nm column blocks. The block at the ith row and jth column is the covariance between yi and yj\nde\ufb01ned in (5). Finally, we place isotropic Gaussian priors over xi. We use Bayesian inference to\nobtain a posterior Gaussian process and to estimate the representations. We tune hyperparameters by\nmaximizing the observed-data log likelihood. Note that our model is similar to the Bayesian GPLVM\n[Titsias and Lawrence, 2010], but models functional data instead of \ufb01nite-dimensional vectors.\n2.1 Learning and Inference in the DTM\nAs formulated, the model scales poorly to large datasets. Inference within each iteration of an\noptimization algorithm, for example, requires storing and inverting \u03a3DTM, which requires O(n2)\ni=1 ni is the number of clinical marker observations.\nFor modern datasets, where n can be in the hundreds of thousands or millions, this is unacceptable.\nIn this section, we approximate the log-likelihood using techniques from Hensman et al. [2013] that\nallow us to apply stochastic variational inference (SVI) [Hoffman et al., 2013].\n\nspace and O(n3) time respectively, where n (cid:44)(cid:80)m\n\nInducing points. Recent work in scaling Gaussian processes to large datasets has focused on the\nidea of inducing points [Snelson and Ghahramani, 2005, Titsias, 2009], which are a relatively small\nnumber of arti\ufb01cial observations of a Gaussian process that approximately capture the information\ncontained in the training data. In general, let f \u2208 Rm denote observations of a GP at inputs {xi}m\nand u \u2208 Rp denote inducing points at inputs {zi}p\ni=1\ni=1. Titsias [2009] constructs the inducing points\nas variational parameters by introducing an augmented probability model:\npp u, \u02dcKmm),\n\n(7)\nwhere Kpp is the Gram matrix between inducing points, Kmm is the Gram matrix between\nobservations, Kmp is the cross Gram matrix between observations and inducing points, and\n\u02dcKmm (cid:44) Kmm \u2212 KmpK\u22121\npp Kpm. We can marginalize over u to construct a low-rank approxi-\nmate covariance matrix, which is computationally cheaper to invert using the Woodbury identity.\nAlternatively, Hensman et al. [2013] extends these ideas by explicitly maintaining a variational\ndistribution over u that d-separates the observations and satis\ufb01es the conditions required to apply\nSVI [Hoffman et al., 2013]. Let yf = f + \u0001 where \u0001 is iid Gaussian noise with variance \u03c32, then we\nuse the following inequality to lower bound our data log-likelihood:\n\nu \u223c N (0, Kpp) , f | u \u223c N (KmpK\u22121\n\nlog p(yf | u) \u2265(cid:80)m\n\ni=1\n\nEp(fi|u)[log p(yf i | fi)].\n\n(8)\n\nIn the interest of space, we refer the interested reader to Hensman et al. [2013] for details.\nDTM evidence lower bound. When marginalizing over the rows of F, we induced a Gaussian\nprocess over the trajectories, but by doing so we also implicitly induced a Gaussian process over\nthe individual-speci\ufb01c basis coef\ufb01cients. Let wi (cid:44) Fxi \u2208 Rd denote the basis weights implied by\nthe mapping F and representation xi in the reduced-rank LMM, and let w:,k for k \u2208 [d] denote the\nkth coef\ufb01cient of all individuals in the dataset. After marginalizing the kth row of F and applying\nthe kernel trick, we see that the vector of coef\ufb01cients w:,k has a Gaussian process distribution\n\n4\n\n\fwith mean zero and covariance function: Cov(wik, wjk) = \u03b1k(xi, xj). Moreover, the Gaussian\nprocesses across coef\ufb01cients are statistically independent of one another. To lower bound the DTM log-\nlikelihood, we introduce p inducing points uk for each vector of coef\ufb01cients w:,k with shared inducing\npoint inputs {zi}p\ni=1. To refer to all inducing points simultaneously, we use U (cid:44) [u1, . . . , ud] and u\nto denote the \u201cvectorized\u201d form of U obtained by stacking its columns. Applying (8) we have:\n\nlogp(y | u, x1:m) \u2265 m(cid:88)\n\ni=1\n\nm(cid:88)\n\ni=1\n\nEp(wi|u,xi)[log p(yi | wi)]\n\ni Bi] (cid:44) m(cid:88)\n\ni=1\n\n\u2265 m(cid:88)\nq(u, x1:m) (cid:44) N (u | m, S)(cid:81)m\n\ni=1\n\n=\n\nlog N (yi | \u00b5 + BiU(cid:62)K\u22121\n\npp ki, \u03c32Ini) \u2212 \u02dckii\n\n2\u03c32 Tr[B(cid:62)\n\nlog \u02dcp(yi | u, xi),\n\n(9)\n\nwhere ki (cid:44) [k(xi, z1), . . . , k(xi, zp)](cid:62) and \u02dckii is the ith diagonal element of \u02dcKmm. We can then\nconstruct the variational lower bound on log p(y):\n\nlog p(y) \u2265 Eq(u,x1:m)[log p(y | u, x1:m)] \u2212 KL(q(u, x1:m)(cid:107) p(u, x1:m))\nEq(u,xi)[log \u02dcp(yi | u, xi)] \u2212 KL(q(u, x1:m)(cid:107) p(u, x1:m)),\n\n(10)\n\n(11)\n\nwhere we use the lower bound in (9). Finally, to make the lower bound concrete we specify the\nvariational distribution q(u, x1:m) to be a product of independent multivariate normal distributions:\n(12)\n\ni=1 N (xi | mi, Si),\n\nwhere the variational parameters to be \ufb01t are m, S, and {mi, Si}m\ni=1.\nStochastic optimization of the lower bound. To apply SVI, we must be able to compute the\ngradient of the expected value of log \u02dcp(yi | u, xi) under the variational distributions. Because u\nand xi are assumed to be independent in the variational posteriors, we can analyze the expectation\nin either order. Fix xi, then we see that log \u02dcp(yi | u, xi) depends on u only through the mean of\nthe Gaussian density, which is a quadratic term in the log likelihood. Because q(u) is multivariate\nnormal, we can compute the expectation in closed form.\n\nEq(u)[log \u02dcp(yi | u, xi)] = Eq(U)[log N (yi | \u00b5 + (Bi \u2297 k\n\npp )u, \u03c32Ini)] \u2212 \u02dckii\n(cid:62)\ni K\u22121\ni Ci] \u2212 \u02dckii\n= log N (yi | \u00b5 + Cim, \u03c32Ini)] \u2212 1\n2\u03c32 Tr[SC(cid:62)\n\n2\u03c32 Tr[B(cid:62)\ni Bi]\n2\u03c32 Tr[B(cid:62)\n\ni Bi],\n\n(cid:62)\ni K\u22121\n\nwhere we have de\ufb01ned Ci (cid:44) (Bi \u2297 k\npp ) to be the extended design matrix and \u2297 is the Kronecker\nproduct. We now need to compute the expectation of this expression with respect to q(xi), which\n(cid:62)\nentails computing the expectations of ki (a vector) and kik\ni (a matrix). In this paper, we assume an\nRBF kernel, and so the elements of the vector and matrix are all exponentiated quadratic functions of\nxi. This makes the expectations straightforward to compute given that q(xi) is multivariate normal.1\nWe therefore see that the expected value of log \u02dcp(yi) can be computed in closed form under the\nassumed variational distribution.\nWe use the standard SVI algorithm to optimize the lower bound. We subsample the data, optimize\nthe likelihood of each example in the batch with respect to the variational parameters over the\nrepresentation (mi, Si), and compute approximate gradients of the global variational parameters (m,\nS) and the hyperparameters. The likelihood term is conjugate to the prior over u, and so we can\ncompute the natural gradients with respect to the global variational parameters m and S [Hoffman\net al., 2013, Hensman et al., 2013]. Additional details on the approximate objective and the gradients\nrequired for SVI are given in the supplement. We provide details on initialization, minibatch selection,\nand learning rates for our experiments in Section 3.\nInference on new trajectories. The variational distribution over the inducing point values u can\nbe used to approximate a posterior process over the basis coef\ufb01cients wi [Hensman et al., 2013].\nTherefore, given a representation xi, we have that\n(cid:62)\ni K\u22121\n\nwik | xi, m, S \u223c N (k\n\npp mk, \u02dckii + k\n\npp SkkK\u22121\n\n(cid:62)\ni K\u22121\n\npp ki),\n\n(13)\n\n1Other kernels can be used instead, but the expectations may not have closed form expressions.\n\n5\n\n\fwhere mk is the approximate posterior mean of the kth column of U and Skk is its covariance. The\napproximate joint posterior distribution over all coef\ufb01cients can be shown to be multivariate normal.\nLet \u00b5(xi) be the mean of this distribution given representation xi and \u03a3(xi) be the covariance, then\nthe posterior predictive distribution over a new trajectory y\u2217 given the representation x\u2217 is\n\n\u2217 + \u03c32In\u2217 ).\n\ny\u2217 | x\u2217 \u223c N (\u00b5 + B\u2217\u00b5(x\u2217), B\u2217\u03a3(x\u2217)B(cid:62)\n\n(14)\nWe can then approximately marginalize with respect to the prior over x\u2217 or a variational approximation\nof the posterior given a partial trajectory using a Monte Carlo estimate.\n3 Experiments\nWe now use DTM to analyze clinical marker trajectories of individuals with the autoimmune disease,\nscleroderma [Allanore et al., 2015]. Scleroderma is a heterogeneous and complex chronic autoimmune\ndisease. It can potentially affect many of the visceral organs, such as the heart, lungs, kidneys, and\nvasculature. Any given individual may experience only a subset of complications, and the timing of\nthe symptoms relative to disease onset can vary considerably across individuals. Moreover, there are\nno known biomarkers that accurately predict an individual\u2019s disease course. Clinicians and medical\nresearchers are therefore interested in characterizing and understanding disease progression patterns.\nMoreover, there are a number of clinical outcomes responsible for the majority of morbidity among\npatients with scleroderma. These include congestive heart failure, pulmonary hypertension and\npulmonary arterial hypertension, gastrointestinal complications, and myositis [Varga et al., 2012].\nWe use the DTM to study associations between these outcomes and disease trajectories.\nWe study two scleroderma clinical markers. The \ufb01rst is the percent of predicted forced vital capacity\n(PFVC); a pulmonary function test result measuring lung function. PFVC is recorded in percentage\npoints, and a higher value (near 100) indicates that the individual\u2019s lungs are functioning as expected.\nThe second clinical marker that we study is the total modi\ufb01ed Rodnan skin score (TSS). Scleroderma\nis named after its effect on the skin, which becomes hard and \ufb01brous during periods of high disease\nactivity. Because it is the most clinically apparent symptom, many of the current sub-categorizations\nof scleroderma depend on an individual\u2019s pattern of skin disease activity over time [Varga et al.,\n2012]. To systematically monitor skin disease activity, clinicians use the TSS which is a quantitative\nscore between 0 and 55 computed by evaluating skin thickness at 17 sites across the body (higher\nscores indicate more active skin disease).\n3.1 Experimental Setup\nFor our experiments, we extract trajectories from the Johns Hopkins Hospital Scleroderma Center\u2019s\npatient registry; one of the largest in the world. For both PFVC and TSS, we study the trajectory from\nthe time of \ufb01rst symptom until ten years of follow-up. The PFVC dataset contains trajectories for\n2,323 individuals and the TSS dataset contains 2,239 individuals. The median number of observations\nper individuals is 3 for the PFVC data and 2 for the TSS data. The maximum number of observations\nis 55 and 22 for PFVC and TSS respectively.\nWe present two sets of results. First, we visualize groups of similar trajectories obtained by clustering\nthe representations learned by DTM. Although not quantitative, we use these visualizations as a way\nto check that the DTM uncovers subpopulations that are consistent with what is currently known\nabout scleroderma. Second, we use the learned representations of trajectories obtained using the\nLMM, the reduced-rank LMM (which we refer to as FPCA), and the DTM to statistically test for\nrelationships between important clinical outcomes and learned disease trajectory representations.\nFor all experiments and all models, we use a common 5-dimensional B-spline basis composed of\ndegree-2 polynomials (see e.g. Chapter 20 in Gelman et al. [2014]). We choose knots using the\npercentiles of observation times across the entire training set [Ramsay et al., 2002]. For LMM and\nFPCA, we use EM to \ufb01t model parameters. To \ufb01t the DTM, we use the LMM estimate to set the\nmean \u00b5 , noise \u03c32, and average the diagonal elements of \u03a3 to set the kernel scale \u03b1. Length-scales (cid:96)\nare set to 1. For these experiments, we do not learn the kernel hyperparameters during optimization.\nWe initialize the variational means over xi using the \ufb01rst two unit-scaled principal components of wi\nestimated by LMM and set the variational covariances to be diagonal with standard deviation 0.1. For\nboth PFVC and TSS, we use minibatches of size 25 and learn for a total of \ufb01ve epochs (passes over\nthe training data). The initial learning rate for m and S is 0.1 and decays as t\u22121 for each epoch t.\n3.2 Qualitative Analysis of Representations\nThe DTM returns approximate posteriors over the representations xi for all individuals in the training\nset. We examine these posteriors for both the PFVC and TSS datasets to check for consistency with\n\n6\n\n\fFigure 1: (A) Groups of PFVC trajectories obtained by hierarchical clustering of DTM representations. (B)\nTrajectory representations are color-coded and labeled according to groups shown in (A). Contours re\ufb02ect\nposterior GP over the second B-spline coef\ufb01cient (blue contours denote smaller values, red denote larger values).\n\nFigure 2: Same presentation as in Figure 1 but for TSS trajectories.\n\nwhat is currently known about scleroderma disease trajectories. In Figure 1 (A) we show groups of\ntrajectories uncovered by clustering the posterior means over the representations, which are plotted\nin Figure 1 (B). Many of the groups shown here align with other work on scleroderma lung disease\nsubtypes (e.g. Schulam et al. [2015]). In particular, we see rapidly declining trajectories (group [5]),\nslowly declining trajectories (group [22]), recovering trajectories (group [23]), and stable trajectories\n(group [34]). Surprisingly, we also see a group of individuals who we describe as \u201clate decliners\u201d\n(group [28]). These individuals are stable for the \ufb01rst 5-6 years, but begin to decline thereafter. This\nis surprising because the onset of scleroderma-related lung disease is currently thought to occur early\nin the disease course [Varga et al., 2012]. In Figure 2 (A) we show clusters of TSS trajectories and the\ncorresponding mean representations in Figure 2 (B). These trajectories corroborate what is currently\nknown about skin disease in scleroderma. In particular, we see individuals who have minimal activity\n(e.g. group [1]) and individuals with early activity that later stabilizes (e.g. group [11]), which\ncorrespond to what are known as the limited and diffuse variants of scleroderma [Varga et al., 2012].\nWe also \ufb01nd that there are a number of individuals with increasing activity over time (group [6])\nand some whose activity remains high over the ten year period (group [19]). These patterns are not\ncurrently considered to be canonical trajectories and warrant further investigation.\n3.3 Associations between Representations and Clinical Outcomes\nTo quantitatively evaluate the low-dimensional representations learned by the DTM, we statistically\ntest for relationships between the representations of clinical marker trajectories and important clinical\noutcomes. We compare the inferences of the hypothesis test with those made using representations\nderived from the LMM and FPCA baselines. For the LMM, we project wi into its 2-dimensional prin-\ncipal subspace. For FPCA, we learn a rank-2 covariance, which learns 2-dimensional representations.\nTo establish that the models are all equally expressive and achieve comparable generalization error, we\npresent held-out data log-likelihoods in Table 1, which are estimated using 10-fold cross-validation.\nWe see that the models are roughly equivalent with respect to generalization error.\nTo test associations between clinical outcomes and learned representations, we use a kernel density\nestimator test [Duong et al., 2012] to test the null hypothesis that the distributions across subgroups\nwith and without the outcome are equivalent. The p-values obtained are listed in Table 2. As a point of\n\n7\n\nPercent of Predicted FVC (PFVC)(B)Years Since First Symptom(A)[5][23][34][22][28][7][14][21][35][1][8][15][29]Total Skin Score (TSS)Years Since First Symptom(A)(B)[1][11][6][19][5][10][15][20][16]\fTable 1: Disease Trajectory Held-out Log-Likelihoods\n\nPFVC\n\nTSS\n\nModel\n\nLMM\nFPCA\nDTM\n\nSubj. LL\n-17.59 (\u00b1 1.18)\n-17.89 (\u00b1 1.19)\n-17.74 (\u00b1 1.23)\n\nObs. LL\n-3.95 (\u00b1 0.04)\n-4.03 (\u00b1 0.02)\n-3.98 (\u00b1 0.03)\n\nSubj. LL\n-13.63 (\u00b1 1.41)\n-13.76 (\u00b1 1.42)\n-13.25 (\u00b1 1.38)\n\nObs. LL\n-3.47 (\u00b1 0.05)\n-3.47 (\u00b1 0.05)\n-3.32 (\u00b1 0.06)\n\nTable 2: P-values under the null hypothesis that the distributions of trajectory representations are the same\nacross individuals with and without clinical outcomes. Lower values indicate stronger support for rejection.\n\nOutcome\n\nCongestive Heart Failure\nPulmonary Hypertension\nPulmonary Arterial Hypertension\nGastrointestinal Complications\nMyositis\nInterstitial Lung Disease\nUlcers and Gangrene\n\nLMM\n\n0.170\n0.270\n0.013\n0.328\n0.337\n\u22170.000\n0.410\n\nPFVC\n\nFPCA\n\n0.081\n\u22170.000\n0.020\n0.073\n\u22170.002\n\u22170.000\n0.714\n\nDTM\n\n0.013\n\u22170.000\n\u22170.002\n0.347\n\u22170.004\n\u22170.000\n0.514\n\nLMM\n\n0.107\n0.485\n0.712\n0.026\n\u22170.000\n0.553\n0.573\n\nTSS\n\nFPCA\n\n0.383\n0.606\n0.808\n0.035\n\u22170.002\n0.515\n0.316\n\nDTM\n\n0.189\n0.564\n0.778\n0.011\n\u22170.000\n0.495\n\u22170.009\n\nreference, we include two clinical outcomes that should be clearly related to the two clinical markers.\nInterstitial lung disease is the most common cause of lung damage in scleroderma [Varga et al., 2012],\nand so we con\ufb01rm that the null hypothesis is rejected for all three PFVC representations. Similarly,\nfor TSS we expect ulcers and gangrene to be associated with severe skin disease. In this case, only\nthe representations learned by DTM reveal this relationship. For the remaining outcomes, we see\nthat FPCA and DTM reveal similar associations, but that only DTM suggests a relationship with\npulmonary arterial hypertension (PAH). Presence of \ufb01brosis (which drives lung disease progression)\nhas been shown to be a risk factor in the development of PAH (see Chapter 36 of Varga et al. [2012]),\nbut only the representations learned by DTM corroborate this association (see Figure 3).\n4 Conclusion\nWe presented the Disease Trajectory Map (DTM), a novel probabilistic model that learns low-\ndimensional embeddings of sparse and irregularly sampled clinical time series data. The DTM is a\nreformulation of the LMM. We derived it using an approach comparable to that of Lawrence [2004] in\nderiving the Gaussian process latent variable model (GPLVM) from probabilistic principal component\nanalysis (PPCA) [Tipping and Bishop, 1999], and indeed the DTM can be interpreted as a \u201ctwin\nkernel\u201d GPLVM (brie\ufb02y discussed in the concluding paragraphs) over functional observations. The\nDTM can also be viewed as an LMM with a \u201cwarped\u201d Gaussian prior over the random effects (see\ne.g. Damianou et al. [2015] for a discussion of distributions induced by mapping Gaussian random\nvariables through non-linear maps). We demonstrated the model by analyzing data extracted from\none of the nation\u2019s largest scleroderma patient registries, and found that the DTM discovers structure\namong trajectories that is consistent with previous \ufb01ndings and also uncovers several surprising\ndisease trajectory shapes. We also explore associations between important clinical outcomes and\nthe DTM\u2019s representations and found statistically signi\ufb01cant differences in representations between\noutcome-de\ufb01ned groups that were not uncovered by two sets of baseline representations.\nAcknowledgments. PS is supported by an NSF Graduate Research Fellowship. RA is supported in part by\nNSF BIGDATA grant IIS-1546482.\n\nFigure 3: Scatter plots of PFVC representations for the three models color-coded by presence or absence of\npulmonary arterial hypertension (PAH). Groups of trajectories with very few cases of PAH are circled in green.\n\n8\n\n(B) FPCA Representations(C) DTM Representations(A) LMM Representations\fReferences\nAllanore et al. Systemic sclerosis. Nature Reviews Disease Primers, page 15002, 2015.\nJamie L Bigelow and David B Dunson. Bayesian semiparametric joint models for functional predictors. Journal\n\nof the American Statistical Association, 2012.\n\nDavid R Brillinger. Time series: data analysis and theory, volume 36. SIAM, 2001.\nCarvalho et al. High-dimensional sparse factor modeling: applications in gene expression genomics. Journal of\n\nP.J. Castaldi et al. Cluster analysis in the copdgene study identi\ufb01es subtypes of smokers with distinct patterns of\n\nthe American Statistical Association, 2012.\n\nairway disease and emphysema. Thorax, 2014.\n\ncurves. Technometrics, 28(4):329\u2013337, 1986.\n\nPE Castro, WH Lawton, and EA Sylvestre. Principal modes of variation for processes with continuous sample\n\nJ. Craig. Complex diseases: Research and applications. Nature Education, 1(1):184, 2008.\nA. C. Damianou, M. K Titsias, and N. D. Lawrence. Variational inference for latent variables and uncertain\n\ninputs in gaussian processes. JMLR, 2, 2015.\n\nT. Duong, B. Goud, and K. Schauer. Closed-form density-based framework for automatic detection of cellular\n\nmorphology changes. Proceedings of the National Academy of Sciences, 109(22):8382\u20138387, 2012.\n\nAndrew Gelman et al. Bayesian data analysis, volume 2. Taylor & Francis, 2014.\nJ. Hensman, N. Fusi, and N.D. Lawrence. Gaussian processes for big data. arXiv:1309.6835, 2013.\nM.D. Hoffman, D.M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. JMLR, 14(1):1303\u20131347,\n\nG.M. James, T.J. Hastie, and C.A. Sugar. Principal component models for sparse functional data. Biometrika, 87\n\n2013.\n\n(3):587\u2013602, 2000.\n\nH.F. Kaiser. The varimax criterion for analytic rotation in factor analysis. Psychometrika, 23(3):187\u2013200, 1958.\nE. Keogh et al. Locally adaptive dimensionality reduction for indexing large time series databases. ACM\n\nK.P Kleinman and J.G. Ibrahim. A semiparametric bayesian approach to the random effects model. Biometrics,\n\nSIGMOD Record, 30(2):151\u2013162, 2001.\n\npages 921\u2013938, 1998.\n\nN.D. Lawrence. Gaussian process latent variable models for visualisation of high dimensional data. Advances in\n\nneural information processing systems, 16(3):329\u2013336, 2004.\n\nK. Levin, K. Henry, A. Jansen, and K. Livescu. Fixed-dimensional acoustic embeddings of variable-length\n\nsegments in low-resource settings. In ASRU, pages 410\u2013415. IEEE, 2013.\n\nJessica Lin, Eamonn Keogh, Li Wei, and Stefano Lonardi. Experiencing SAX: a novel symbolic representation\n\nof time series. Data Mining and knowledge discovery, 15(2):107\u2013144, 2007.\n\nJ. L\u00f6tvall et al. Asthma endotypes: a new approach to classi\ufb01cation of disease entities within the asthma\n\nsyndrome. Journal of Allergy and Clinical Immunology, 127(2):355\u2013360, 2011.\n\nRichard F MacLehose and David B Dunson. Nonparametric bayes kernel-based priors for functional data\n\nanalysis. Statistica Sinica, pages 611\u2013629, 2009.\n\nB.M. Marlin et al. Unsupervised pattern discovery in electronic health care data using probabilistic clustering\nmodels. In Proc. ACM SIGHIT International Health Informatics Symposium, pages 389\u2013398. ACM, 2012.\n\nJames Ramsay et al. Applied functional data analysis: methods and case studies. Springer, 2002.\nJames O Ramsay. Functional data analysis. Wiley Online Library, 2006.\nJ.A. Rice and C.O. Wu. Nonparametric mixed effects models for unequally sampled noisy curves. Biometrics,\n\n57(1):253\u2013259, 2001.\n\nS. Saria and A. Goldenberg. Subtyping: What it is and its role in precision medicine. Int. Sys., IEEE, 2015.\nS. Saria et al. Discovering deformable motifs in continuous time series data. In IJCAI, volume 22, 2011.\nP. Schulam and S. Saria. A framework for individualizing predictions of disease trajectories by exploiting\n\nmulti-resolution structure. In Advances in Neural Information Processing Systems, pages 748\u2013756, 2015.\n\nP. Schulam, F. Wigley, and S. Saria. Clustering longitudinal clinical marker trajectories from electronic health\n\ndata: Applications to phenotyping and endotype discovery. In AAAI, pages 2956\u20132964, 2015.\nE. Snelson and Z. Ghahramani. Sparse gaussian processes using pseudo-inputs. In NIPS, 2005.\nM. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical\n\nSociety: Series B (Statistical Methodology), 61(3):611\u2013622, 1999.\n\nM.K. Titsias. Variational learning of inducing variables in sparse gaussian processes. In AISTATS, 2009.\nM.K. Titsias and N.D. Lawrence. Bayesian gaussian process latent variable model. In AISTATS, 2010.\nB. Varadarajan et al. Unsupervised learning of acoustic sub-word units. In Proc. ACL, pages 165\u2013168, 2008.\nJ. Varga et al. Scleroderma: From pathogenesis to comprehensive management. Springer, 2012.\nG. Verbeke and G. Molenberghs. Linear mixed models for longitudinal data. Springer, 2009.\nS. Watanabe. Karhunen-loeve expansion and factor analysis, theoretical remarks and applications. In Proc. 4th\n\nPrague Conf. Inform. Theory, 1965.\n\nL.D. Wiggins et al. Support for a dimensional view of autism spectrum disorders in toddlers. Journal of autism\n\nand developmental disorders, 42(2):191\u2013200, 2012.\n\n9\n\n\f", "award": [], "sourceid": 2383, "authors": [{"given_name": "Peter", "family_name": "Schulam", "institution": "Johns Hopkins University"}, {"given_name": "Raman", "family_name": "Arora", "institution": "Johns Hopkins University"}]}