{"title": "Patient Risk Stratification for Hospital-Associated C. diff as a Time-Series Classification Task", "book": "Advances in Neural Information Processing Systems", "page_first": 467, "page_last": 475, "abstract": "A patient's risk for adverse events is affected by temporal processes including the nature and timing of diagnostic and therapeutic activities, and the overall evolution of the patient's pathophysiology over time. Yet many investigators ignore this temporal aspect when modeling patient risk, considering only the patient's current or aggregate state. We explore representing patient risk as a time series. In doing so, patient risk stratification becomes a time-series classification task. The task differs from most applications of time-series analysis, like speech processing, since the time series itself must first be extracted. Thus, we begin by defining and extracting approximate \\textit{risk processes}, the evolving approximate daily risk of a patient. Once obtained, we use these signals to explore different approaches to time-series classification with the goal of identifying high-risk patterns. We apply the classification to the specific task of identifying patients at risk of testing positive for hospital acquired colonization with \\textit{Clostridium Difficile}. We achieve an area under the receiver operating characteristic curve of 0.79 on a held-out set of several hundred patients. Our two-stage approach to risk stratification outperforms classifiers that consider only a patient's current state (p$<$0.05).", "full_text": "Patient Risk Strati\ufb01cation for Hospital-Associated\n\nC. diff as a Time-Series Classi\ufb01cation Task\n\nJenna Wiens\n\njwiens@mit.edu\n\nJohn V. Guttag\n\nguttag@mit.edu\n\nEric Horvitz\n\nhorvitz@microsoft.com\n\nAbstract\n\nA patient\u2019s risk for adverse events is affected by temporal processes including the\nnature and timing of diagnostic and therapeutic activities, and the overall evolu-\ntion of the patient\u2019s pathophysiology over time. Yet many investigators ignore this\ntemporal aspect when modeling patient outcomes, considering only the patient\u2019s\ncurrent or aggregate state. In this paper, we represent patient risk as a time se-\nries. In doing so, patient risk strati\ufb01cation becomes a time-series classi\ufb01cation\ntask. The task differs from most applications of time-series analysis, like speech\nprocessing, since the time series itself must \ufb01rst be extracted. Thus, we begin\nby de\ufb01ning and extracting approximate risk processes, the evolving approximate\ndaily risk of a patient. Once obtained, we use these signals to explore different\napproaches to time-series classi\ufb01cation with the goal of identifying high-risk pat-\nterns. We apply the classi\ufb01cation to the speci\ufb01c task of identifying patients at risk\nof testing positive for hospital acquired Clostridium dif\ufb01cile. We achieve an area\nunder the receiver operating characteristic curve of 0.79 on a held-out set of sev-\neral hundred patients. Our two-stage approach to risk strati\ufb01cation outperforms\nclassi\ufb01ers that consider only a patient\u2019s current state (p<0.05).\n\n1\n\nIntroduction\n\nTime-series data are available in many different \ufb01elds, including medicine, \ufb01nance, information re-\ntrieval and weather prediction. Much research has been devoted to the analysis and classi\ufb01cation of\nsuch signals [1] [2]. In recent years, researchers have had great success with identifying temporal\npatterns in such time series and with methods that forecast the value of variables. In most appli-\ncations there is an explicit time series, e.g., ECG signals, stock prices, audio recordings, or daily\naverage temperatures.\nWe consider a novel application of time-series analysis, patient risk. Patient risk has an inherent\ntemporal aspect; it evolves over time as it is in\ufb02uenced by intrinsic and extrinsic factors. However, it\nhas no easily measurable time series. We hypothesize that, if one could measure risk over time, one\ncould learn patterns of risk that are more likely to lead to adverse outcomes. In this work, we frame\nthe problem of identifying hospitalized patients for high-risk outcomes as a time-series classi\ufb01cation\ntask. We propose and motivate the study of patient risk processes to model the evolution of risk over\nthe course of a hospital admission.\nSpeci\ufb01cally, we consider the problem of using time-series data to estimate the risk of an inpatient\nbecoming colonized with Clostridium dif\ufb01cile (C. diff ) during a hospital stay. (C. diff is a bacterial\ninfection most often acquired in hospitals or nursing homes. It causes severe diarrhea and can lead to\ncolitis and other serious complications.) Despite the fact that many of the risk factors are well known\n(e.g., exposure, age, underlying disease, use of antimicrobial agents, etc.) [3], C. diff continues to\nbe a signi\ufb01cant problem in many US hospitals. From 1996 to 2009, C. diff rates for hospitalized\npatients aged \u2265 65 years increased by 200% [4].\n\n1\n\n\fThere are well-established clinical guidelines for predicting whether a test for C. diff is likely to\nbe positive [5]. Such guidelines are based largely on the presence of symptoms associated with\nan existing C. diff infection, and thus are not useful for predicting whether a patient will become\ninfected. In contrast, risk strati\ufb01cation models aim to identify patients at high risk of becoming\ninfected. The use of these models could lead to a better understanding of the risk factors involved\nand ultimately provide information about how to reduce the incidence of C. diff in hospitals.\nThere are many different ways to de\ufb01ne the problem of estimating risk. The precise de\ufb01nition has\nimportant rami\ufb01cations for both the potential utility of the estimate and the dif\ufb01culty of the problem.\nReported results in the medical literature for the problem of risk strati\ufb01cation for C. diff vary greatly,\nwith areas under the receiver operating characteristic curve (AUC) of 0.628-0.896 [6] [7][8][9][10].\nThe variation in classi\ufb01cation performance is based in part on differences in the task de\ufb01nition, in\npart on differences in the study populations, and in part on the evaluation method. The highest\nreported AUCs were from studies of small (e.g., 50 patients) populations, relatively easy tasks (e.g.,\ninclusion of large number of patients with predictably short stays, e.g., patients in labor), or both.\nAdditionally, some of the reported results were not obtained from testing on held-out sets.\nWe consider patients with at least a 7-day hospital admission who do not test positive for C. diff\nuntil day 7 or later. This group of patients is already at an elevated risk for acquiring C. diff because\nof the duration of the hospital stay. Focusing on this group makes the problem more relevant (and\nmore dif\ufb01cult) than other related tasks.\nTo the best of our knowledge, representing and studying the risk of acquiring C. diff (or any other\ninfection) as a time series has not previously been explored. We propose a risk strati\ufb01cation method\nthat aims to identify patterns of risk that are more likely to lead to adverse outcomes. In [11] we\nproposed a method for extracting patient risk processes. Once patient risk processes are extracted,\nthe problem of risk strati\ufb01cation becomes that of time-series classi\ufb01cation. We explore a variety\nof different methods including classi\ufb01cation using similarity metrics, feature extraction, and hidden\nMarkov models. A direct comparison with the reported results in the literature for C. diff risk\nprediction is dif\ufb01cult because of the differences in the studies mentioned above. Thus, to measure\nthe added value of considering the temporal dimension, we implemented the standard approach as\nrepresented in the related literature of classifying patients based on their current or average state and\napplied it to our data set. Our method leads to a signi\ufb01cant improvement over this more traditional\napproach.\n\n2 The Data\n\nOur dataset comes from a large US hospital database. We extracted all stays >= 7days, from all\ninpatient admissions that occurred over the course of a year.\nTo ensure that we are in fact predicting the acquisition of C. diff during the current admission, we\nremove patients who tested positive for C. diff in the 60 days preceding or, if negative, following the\ncurrent admission [3]. In addition, we remove patients who tested positive for C. diff before day 7\nof the admission. Positive cases are those patients who test positive on or after 7 days in the hospital.\nNegative patients are all remaining patients.\nWe de\ufb01ne the start of the risk period of a patient as the time of admission and de\ufb01ne the end of\nthe risk period, according to the following rule: if the patient tests positive, the \ufb01rst positive test\nmarks the end of the risk period, otherwise the patient is considered at risk until discharge. The \ufb01nal\npopulation consisted of 9,751 hospital admissions and 8,166 unique patients. Within this population,\n177 admissions had a positive test result for C. diff.\n\n3 Methods\n\nPatient risk is not a directly measurable time series. Thus, we propose a two-stage approach to risk\nstrati\ufb01cation. We \ufb01rst extract approximate risk processes and then apply time-series classi\ufb01cation\ntechniques to those processes. Both stages are described here; for more detail regarding the \ufb01rst\nstage we direct the reader to [11].\n\n2\n\n\f3.1 Extracting Patient Risk Processes\n\nWe extract approximate patient risk processes, i.e., a risk time series for each admission, by inde-\npendently calculating the daily risk of a patient and then concatenating these predictions. We begin\nby extracting more than 10,000 variables for each day of each hospital admission. Almost all of\nthe features pertain to categorical features that have been exploded into binary features; hence the\nhigh dimensionality. Approximately half of the features are based on data collected at the time of\nadmission e.g., patient history, admission reason, and patient demographics. These features remain\nconstant throughout the stay. The remaining features are collected over the course of the admis-\nsion and may change on a daily basis e.g., lab results, room location, medications, and vital sign\nmeasurements.\nWe employ a support vector machine (SVM) to produce daily risk scores. Each day of an admission\nis associated with its own feature vector. We refer to this feature vector of observations as the\npatient\u2019s current state. However, we do not have ground-truth labels for each day of a patient\u2019s\nadmission. We only know whether or not a patient eventually tests positive for C. diff. Thus we\nassign each day of an admission in which the patient eventually tests positive as positive, even though\nthe patient may not have actually been at high risk on each of those days. In doing so, we hope to\nidentify high-risk patients as early as possible. Since we do not expect a patient\u2019s risk to remain\nconstant during an entire admission, there is noise in the training labels. For example, there may be\nsome days that look almost identical in the feature space but have different labels. To handle this\nnoise we use a soft-margin SVM, that allows for misclassi\ufb01cations. As long as our assumption does\nnot lead to more incorrect labels than correct labels, it is possible to learn a meaningful classi\ufb01er,\ndespite the approximate labels. We do not use the SVM as a classi\ufb01er but instead consider the\ncontinuous prediction made by the SVM, i.e., the distance to the decision boundary. We take the\nconcatenated continuous outputs of the SVM for a hospital admission as a representation of the\napproximate risk process. We give some examples of these approximate risk processes for both case\nand non-case patients in Figure 1.\n\nFigure 1: Approximate daily risk represented as a time series results in a risk process for each\npatient.\n\nOne could risk stratify patients based solely on their current state, i.e., use the daily risk value from\nthe risk process to classify patients as either high risk or low risk on that day. This method, which\nignores the temporal evolution of risk, achieves an AUC of 0.69 (95% CI 0.61-0.77). Intuitively,\ncurrent risk should depend on previous risk. We tested this intuition by classifying patients based\non the average of their risk process. This performed signi\ufb01cantly better achieving an AUC of 0.75\n(95% CI 0.69-0.81). Still, averaging in this way ignores the possibility of leveraging richer temporal\npatterns, as discussed in the next section.\n\n3.2 Classifying Patient Risk Processes\n\nGiven the risk processes of each patient, the risk strati\ufb01cation task becomes a time-series classi\ufb01-\ncation task. Time-series classi\ufb01cation is a well-investigated area of research, with many proposed\nmethods. For an in-depth review of sequence classi\ufb01cation we refer the reader to [2]. Here, we\nexplore three different approaches to the problem: classi\ufb01cation based on feature vectors, similarity\nmeasures, and \ufb01nally HMMs. We \ufb01rst describe each method, and then present results about their\nperformance in Section 4.\n\n3\n\n510152025\u22121.5\u22121\u22120.500.511.5Approximate RiskTime (days)510152025303540\u22121\u22120.500.511.5Time (days)Patient is dischargedPatient tests positive\f3.2.1 Classi\ufb01cation using Feature Extraction\n\nThere are many different ways to extract features from time series.\nIn the literature many have\nproposed time-frequency representations extracted using various Fourier or wavelet transforms [12].\nGiven the small number of samples composing our time-series data, we were wary of applying such\ntechniques.\nInstead we chose an approach inspired by the combination of classi\ufb01ers in the text\ndomain using reliability indicators [13]. We de\ufb01ne a feature vector based on different combinations\nof the predictions made in the \ufb01rst stage. We list the features in Table 1.\n\nTable 1: Univariate summary statistics for observation vector x = [x1, x2, ..., xn]\n\n(cid:80)n\n\n2\n\nn,\n1\nn\n\n(cid:80)n\n\n1 ixi,\n\n1 i2xi,\n\nn(n+1)(2n+1)\nxn,\n\u03c3,\n1\nn\n1\nn\n1\nn\n1\nn\n\n|xi \u2212 xi+1|,\ni+1|,\ni \u2212 x(cid:48)\n|x(cid:48)\n1 1xi>0,\n1 1xi<0,\n\n1\n\n1\n\nFeature Description\n\n1 xi,\n\nn(n+1)\n6\n\nlength of time series\naverage daily risk\nlinear weighted average daily risk\nquadratic weighted average daily risk\nrisk on most recently observed day\nstandard deviation of daily risk\naverage absolute change in daily risk\naverage absolute change in 1st difference\nfraction of the visit with positive risk score\nfraction of the visit with negative risk score\n\n(cid:80)n\n(cid:80)n\u22121\n(cid:80)n\u22122\n(cid:80)n\n(cid:80)n\nsum of the risk over the most recent 3 days (cid:80)n\n\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12\n13\n14 maximum observation\n15\n16 minimum observation\n17\n\nlongest positive run (normalized)\nlongest negative run (normalized)\n\nlocation of maximum (normalized)\n\nlocation of minimum (normalized)\n\nxi,\n\nmax\n1\nn argmax\nmin\ni\n1\nn argmin\n\ni\nxi,\n\nn\u22122 xi,\n\ni\n\nxi,\n\nxi,\n\ni\n\nFeatures 2-4 are averages; Features 3 and 4 weight days closer to the time of classi\ufb01cation more\nheavily. Features 6-10 are different measures for the amount of \ufb02uctuation in the time series. Fea-\ntures 5 and 11 capture information about the most recent states of the patient. Features 12 and\n13 identify runs in the data, i.e., periods of time where the patient is consistently at high or low\nrisk. Finally, Features 14-17 summarize information regarding global maxima and minima in the\napproximate risk process.\nGiven these feature de\ufb01nitions, we map each patient admission risk process to a \ufb01xed-length feature\nvector. These summarization variables allow one to compare time series of different lengths, while\nstill capturing temporal information, e.g., when the maximum risk occurs relative to the time of\nprediction. Given this feature space, one can learn a classi\ufb01er to identify high-risk patients. This\napproach is described in Figure 2.\n\nFigure 2: A two-step approach to risk strati\ufb01cation where prede\ufb01ned features are extracted from the\ntime-series data.\n\n4\n\n SVM1 Feature Extraction p1 x x\u2019 SVM2 y 3 Classify each admission based on the x\u2019; predict whether or not P\t\r \u00a0will test positive for C. diff. 2 Concatenate predictions and extract feature vector x\u2019 based on time series x. 1 Given m\u00d7n admission P, where m is the number of observ. for each day and n is the number of days, predict daily risk xi based on the observ. pi, for i=1\u2026n. pn p2 x1 xn x2 ... ... \f3.2.2 Classi\ufb01cation using Similarity Metrics\n\nIn the previous section, we learned a second classi\ufb01er based on extracted features. In this section,\nwe consider classi\ufb01ers based on the raw data, i.e., the concatenated time series from Step 2 in Fig-\nure 2. SVMs classify examples based on a kernel or similarity measure. One of the most common\nnon-linear kernels is the Gaussian radial basis function kernel: k(xi, xj) = exp(\u2212\u03b3(cid:107)xi \u2212 xj(cid:107)2).\nIts output is dependent on the Euclidean distance between examples xi and xj. This distance mea-\nsure requires vectors of the same length. We consider two approaches to generating vectors of the\nsame length: (1) linear interpolation and (2) truncation.\nIn the \ufb01rst approach we linearly inter-\npolate between points. In the second approach we consider only the most recent 5 days of data,\nxn\u22124, xn\u22123, ..., xn.\nEuclidean distance is a one-to-one comparison. In contrast, the dynamic time warping (DTW) dis-\ntance is a one-to-many comparison [14]. DTW computes the distance between two time series by\n\ufb01nding the minimal cost alignment. Here, the cost is the absolute distance between aligned points.\nWe linearly interpolate all time series to have the same length, the length of the longest admission\nwithin the dataset (54). To ensure that the warping path does not contain lengthy vertical and hor-\nizontal segments, we constrain the warping window (how far the warping path can stray from the\ndiagonal) using the Sakoe-Chiba band with a width of 10% of the length of the time series [15]. We\nlearn an SVM classi\ufb01er based on this distance metric, by replacing the Euclidean distance in the\nRBF kernel with the DTW distance, k(xi, xj) = exp(\u2212\u03b3DT W (xi, xj)) as in [16].\n\n3.2.3 Classi\ufb01cation using Hidden Markov Models\n\nWe can make observations about a patient on a daily basis, but we cannot directly measure whether\nor not a patient is at high risk. Hence, we used the phrase approximate risk process. By applying\nHMMs we assume there is a sequence of hidden states, x1, x2, ..., xn that govern the observations\ny1, y2, ..., yn. Here, the observations are the predictions made by the SVM. We consider a two-\nstate HMM where each state, s1 and s2, is associated with a mixture of Gaussian distributions over\npossible observations. At an intuitive level, one can think of these states as representing low and\nhigh risk. Using the data, we learn and apply HMMs in two different ways.\nClassi\ufb01cation via Likelihood\nWe hypothesize that there may exist patterns of risk over time that are more likely to lead to a posi-\ntive test result. To test this hypothesis, we \ufb01rst consider the classic approach to classi\ufb01cation using\nHMMs described in Section VI-B [17]. We learn two separate HMMs: one using only observa-\ntion sequences from positive patients and another using only observation sequences from negative\npatients. We initialize the emission probabilities differently for each model based on the data, but\ninitialize the transition probabilities as uniform probabilities. Given a test observation sequence, we\napply both models and calculate the log-likelihood of the data given each model using the forward-\nbackward algorithm. We classify patients continuously, based on the ratio of the log-likelihoods.\nClassi\ufb01cation via Posterior State Probabilities\nAs we saw in Figure 1, the SVM output for a patient may \ufb02uctuate greatly from day to day. While\nlarge \ufb02uctuations in risk are not impossible, they are not common. Recall that in our initial calcu-\nlation while the variables from time of admission are included in the prediction, the previous day\u2019s\nrisk is not. The predictions produced by the SVM are independent. HMMs allow us to model the\nobservations as a sequence and induce a temporal dependence in the model: the current state, xt,\ndepends on the previous state, xt\u22121.\nWe learn an HMM on a training set. We consider a two state model in which we initialize the\nemission probabilities as p(yt|xt = s1) = N (\u00b5s1, 1), p(yt|xt = s2) = N (\u00b5s2, 1) \u2200 t where \u00b5s1 =\n\u22121 and \u00b5s2 = 1. Based on this initialization s1 and s2 correspond to \u201clow-risk\u201d and \u201chigh-risk\u201d\nstates, as mentioned above. A key decision was to use a left-to-right model where, once a patient\nreaches a \u201chigh-risk\u201d state they remain there. All remaining transition probabilities were initialized\nuniformly. Applied to a test example we compute the posterior probabilities p(xt|y1, ..., yn) for\nt = 1...n using the forward-backward algorithm. Because of the left-to-right assumption, if enough\nhigh-risk observations are made it will trigger a transition to the high-risk state. Figure 3 shows two\nexamples of risk processes and their associated posterior state probabilities p(xt = s2|y1, ..., yn) for\n\n5\n\n\f(a) Patient is discharged on day 40\n\n(b) Patient tests positive for C. diff on day 24\n\nFigure 3: Given all of the observations from y1, ..., yn (in blue) we compute the posterior probability\nof being in a high-risk state for each day (in red).\n\nt = 1...n. We classify each patient according to the probability of being in a high-risk state on the\nmost recent day i.e., p(xn = s2|y1, ...yn).\n\n4 Experiments & Results\n\nThis section describes a set of experiments used to compare several methods for predicting a pa-\ntient\u2019s risk of acquiring C. diff during the current hospital admission. We start by describing the\nexperimental setup, which is maintained across all experiments, and later present the results.\n\n4.1 Experimental Setup\n\nIn order to reduce the possibility of confusing the risk of becoming colonized with C. diff with the\nexistence of a current infection, for patients from the positive class we consider only data collected\nup to two days before a positive test result. This reduces the possibility of learning a classi\ufb01er based\non symptoms or treatment (a problem with some earlier studies).\nFor patients who never test positive, researchers typically use the discharge day as the index event\n[3]. However, this can lead to deceptively good results because patients nearing discharge are typi-\ncally healthier than patients not nearing discharge. To avoid this problem, we de\ufb01ne the index event\nfor negative examples as either the halfway point of their admission, or 5 days into the admission,\nwhichever is greater. We consider a minimum of 5 days for a negative patient since 5 days is the\nminimum amount of data we have for any positive patient (e.g., a patient who tests positive on day\n7).\nTo handle class imbalance, we randomly subsample the negative class, selecting 10 negative exam-\nples for each positive example. When training the SVM we employ asymmetric cost parameters as\nin [18]. Additionally, we remove outliers, those patients with admissions longer than 60 days. Next,\nwe randomly split the data into strati\ufb01ed training and test sets with a 70/30 split. The training set\nconsisted of 1,251 admissions (127 positive), while the test set was composed of 532 admissions (50\npositive). This split was maintained across all experiments. In all of the experiments, the training\ndata was used for training purposes and validation of parameter selection, and the test set was used\nfor evaluation purposes. For training and classi\ufb01cation, we employed SVMlight [19] and Kevin\nMurphy\u2019s HMM Toolbox [20].\n\n4.2 Results\n\nTable 2 compares the performance of eight different classi\ufb01ers applied to the held-out test data.\nThe \ufb01rst classi\ufb01er is our baseline approach, described in Section 3.1, it classi\ufb01es patients based\nsolely on their current state. The second classi\ufb01er RP+Average is an initial improvement on this\napproach, and classi\ufb01es patients based on the average value of their risk process. The remain-\ning classi\ufb01ers are all based on time-series classi\ufb01cation methods. RP+SimilarityEuc.5days clas-\nsi\ufb01es patients using a non-linear SVM based on the Euclidean distance between the most recent\n\n6\n\n02468101214161820\u22121\u22120.500.51y0510152000.51p(x=s2|y1,...,yn)Time in DaysTime (days)!0510152025\u2212101y051015202500.51p(x=s2|y1,...,yn)Time in DaysTime (days)!\fTable 2: Predicting a positive test result two days in advance using different classi\ufb01ers. Current\nState represents the traditional approach to risk strati\ufb01cation, and is the only classi\ufb01er that is not\nbased on patient Risk Processes (RP).\n\nApproach\n\nCurrent State\nRP+Average\nRP+SimilarityEuc.5days\nRP+HMMlikelihood\nRP+SimilarityEuc.interp.\nRP+SimilarityDT W\nRP+HMMposterior\nRP+Features\n\nAUC\n0.69\n0.75\n0.73\n0.74\n0.75\n0.76\n0.76\n0.79\n\n95% CI\n0.61-0.77\n0.69-0.81\n0.67-0.80\n0.68-0.81\n0.69-0.82\n0.69-0.82\n0.70-0.82\n0.73-0.85\n\nF-Score\n\n0.28\n0.32\n0.27\n0.30\n0.31\n0.31\n0.30\n0.37\n\n95% CI\n0.19-0.38\n0.21-0.41\n0.18-0.37\n0.20-0.38\n0.22-0.41\n0.22-0.41\n0.21-0.41\n0.24-0.49\n\nFigure 4: Results of predicting a patient\u2019s risk of\ntesting positive for C. diff in the held-out test set\nusing RP+Features.\n\nFigure 5: Feature weights from SVMs learned us-\ning different folds of the training set. The de\ufb01ni-\ntion of features is given in Table 1\n\n5 days. RP+SimilarityEuc.interp. uses the entire risk process by interpolating between points.\nThese two methods in addition to DTW are described in Section 3.2.2. The difference between\nRP+HMMlikelihood and RP+HMMposterior is described in Section 3.2.3. RP+Features classi\ufb01es\npatients based on a linear combination of the average and other summary statistics (described in\nSection 3.2.1) of the risk process. For all of the performance measures we compute 95% point wise\ncon\ufb01dence intervals by bootstrapping (sampling with replacement) the held-out test set.\nFigure 4 gives the ROC curve for the best method, the RP+Features. The AUC is calculated by\nsweeping the decision threshold. The RP+Features performed as well or better than the Current\nState and RP+Average approach at every point along the curve, thereby dominating both traditional\napproaches.\nCompared to the other classi\ufb01ers the classi\ufb01er based on the RP+Features dominates on both AUC\nand F-Score. This classi\ufb01er is based on a linear combination of statistics (listed in Table 1) computed\nfrom the patient risk processes. We learned the feature weights using the training data. To get a sense\nof the importance of each feature we used repeated sub-sampling validation on the training set. We\nrandomly subsampled 70% of the training data 100 times and learned 100 different SVMs; this\nresulted in 100 different sets of feature weights. The results of this experiment are shown in Figure\n5. The most important features are the length of the time series (Feature 1), the fraction of the time\nfor which the patient is at positive risk (Feature 9), and the maximum risk attained (Feature 14).\nThe only two features with signi\ufb01cantly negative weights are Feature 10 and Feature 13, the overall\nfraction of time a patient has a negative risk, and the longest consecutive period of time that a patient\nhas negative risk.\nIt is dif\ufb01cult to interpret the performance of a classi\ufb01er based on these results alone, especially since\nthe classes are imbalanced. Figure 6 gives the confusion matrix for mean performance of the best\n\n7\n\n00.20.40.60.8100.20.40.60.81FPR (1\u2212Specificity)TPR (Sensivity) Risk Process Features (cid:0) AUC: 0.7906RP + Features!\u22120.2\u22120.15\u22120.1\u22120.0500.050.10.150.21234567891011121314151617FeatureWeight\fclassi\ufb01er, RP+Features. To further convey the ability of the classi\ufb01er to risk stratify patients, we split\nthe test patients into quintiles (as is often done in clinical studies) based on the continuous output\nof the classi\ufb01er. Each quintile contains approximately 106 patients. For each quintile we calculated\nthe probability of a positive test result, based on those patients who eventually test positive for C.\ndiff. Figure 7 shows that the probability increases with each quintile. The difference between the 1st\nand 5th quintiles is striking; relative to the 1st quintile, patients in the 5th quintile are at more than a\n25-fold greater risk.\n\nPredicted Outcome\n\np\n\nn\n\nTP:26\n\nFN:24\n\nFP:72\n\nTN:410\n\nActual\nOutcome\n\np(cid:48)\n\nn(cid:48)\n\nFigure 6: Confusion Matrix Using the best ap-\nproach, the RP+Features, we achieve a Sensitivity\nof 50% and a Speci\ufb01city of 85% on the held-out\ndata.\n\nFigure 7: Test patients with RP+Features predic-\ntions in the 5th quintile are more than 25 times\nmore likely to test positive for C. diff than those\nin the 1st quintile.\n\n5 Discussion & Conclusion\n\nTo the best of our knowledge, we are the \ufb01rst to consider risk of acquiring an infection as a time\nseries. We use a two-stage process, \ufb01rst extracting approximate risk processes and then using the\nrisk process as an input to a classi\ufb01er. We explore three different approaches to classi\ufb01cation:\nsimilarity metrics, feature vectors, and hidden Markov models. The majority of the methods based\non time-series classi\ufb01cation performed as well if not better than the previous approach of classifying\npatients simply based on the average of their risk process. The differences were not statistically\nsigni\ufb01cant, perhaps because of the small number of positive examples in the held-out set. Still,\nwe are encouraged by these results, which suggest that posing the risk strati\ufb01cation problem as a\ntime-series classi\ufb01cation task can provide more accurate models.\nThere is large overlap in the con\ufb01dence intervals for many of the results reported in Table 2, in part\nbecause of the paucity of positive examples. Still, based on the mean performance, all classi\ufb01ers that\nincorporate patient risk processes outperform the Current State classi\ufb01er, and the majority of those\nclassi\ufb01ers perform as well or better than the RP+Average. Only two classi\ufb01ers did not perform better\nthan the latter classi\ufb01er: RP+SimilarityEuc.5days and RP+HMMlikelihood. RP+SimilarityEuc.5days\nclassi\ufb01es patients based on a similarity metric using only the most recent 5 days of the patient risk\nprocesses. Its relatively poor performance suggests that a patient\u2019s risk may depend on the entire risk\nprocess. The reasons for the relatively poor performance of the RP+HMMlikelihood approach are\nless clear. Initially, we thought that perhaps two states was insuf\ufb01cient, but experiments with larger\nnumbers of states led to over\ufb01tting on the training data. It may well be that the Markovian assump-\ntion is problematic in this context. We plan to investigate other graphical models, e.g., conditional\nrandom \ufb01elds, going forward.\nThe F-Scores reported in Table 2 are lower than often seen in the machine-learning literature. How-\never, when predicting outcomes in medicine, the problems are often so hard, the data so noisy, and\nthe class imbalance so great that one cannot expect to achieve the kind of classi\ufb01cation performance\ntypically reported in the machine-learning literature. For this reason, the medical literature on risk\nstrati\ufb01cation typically focuses on a combination of the AUC and the kind of odds ratios derivable\nfrom the data in Figure 7. As observed in the introduction, a direct comparison with the AUC\nachieved by others is not possible because of differences in the datasets, the inclusion criteria, and\nthe details of the task. We have yet to thoroughly investigate the clinical rami\ufb01cations of this work.\nHowever, for the daunting task of risk stratifying patients already at an elevated risk for C. diff, an\nAUC of 0.79 and an odds ratio of >25 are quite good.\n\n8\n\n1st2nd3rd4th5th00.050.10.150.20.250.3QuintileFraction of Patients who test Positive\fReferences\n[1] M. M. Gaber, A. Zaslavsky, and S. Krishnaswamy. Mining data streams: A review. SIGMOD,\n\n34(2), June 2005.\n\n[2] Z. Xing, J. Pei, and E. Keogh. A brief survey on sequence classi\ufb01cation. ACM SIGKDD\n\nExplorations, 12(1):40\u201348, June 2010.\n\n[3] E. R. Dubberke, K. A. Reske, Y. Yan, M. A. Olsen, L. C. McDonald, and V. J. Fraser. Clostrid-\nium dif\ufb01cile - associated disease in a setting of endemicity: Identi\ufb01cation of novel risk factors.\nClinical Infectious Diseases, 45:1543\u20139, December 2007.\n\n[4] CDC. Rates for clostridium dif\ufb01cile infection among hospitalized patients. Centers for Disease\n\nControl and Prevention Morbidity and Mortality Weekly Report, 60(34):1171, 2011.\n\n[5] D. A. Katz, M.E. Lynch, and B. Littenber. Clinical prediction rules to optimize cytotoxin\ntesting for clostridium dif\ufb01cile in hospitalized patients with diarrhea. American Journal of\nMedicine, 100(5):487\u201395, 1996.\n\n[6] J. Tanner, D. Khan, D. Anthony, and J. Paton. Waterlow score to predict patietns at risk of\ndeveloping clostridium dif\ufb01cile-associated disease. Journal of Hospital Infection, 71(3):239\u2013\n244, 2009.\n\n[7] E. R. Dubberke, Y. Yan, K. A. Reske, A.M. Butler, J. Doherty, V. Pham, and V.J. Fraser.\nDevelopment and validation of a clostridium dif\ufb01cile infection risk prediction model. Infect\nControl Hosp Epidemiol, 32(4):360\u2013366, 2011.\n\n[8] K. W. Garey, T. K. Dao-Tran, Z. D. Jiang, M. P. Price, L. O. Gentry, and DuPont H. L. A\nclinical risk index for clostridium dif\ufb01cile infection in hospitalized patients receiving broad-\nspectrum antibiotics. Journal of Hospital Infections, 70(2):142\u2013147, 2008.\n\n[9] G. Krapohl. Preventing health care-associated infection: Development of a clinical prediction\n\nrule for clostridium dif\ufb01cile infection. PhD Thesis, 2011.\n\n[10] N. Peled, S. Pitlik, Z. Samra, A. Kazakov, Y. Bloch, and J. Bishara. Predicting clostridium\ndif\ufb01cile toxin in hospitalized patients with antibiotic-associated diarrhea. Infect Control Hosp\nEpidemiol, 28(4):377\u201381, 2007.\n\n[11] J. Wiens, E. Horvitz, and J. Guttag. Learning evolving patient risk processes for c. diff colo-\n\nnization. In ICML Workshop on Machine Learning from Clinical Data, 2012.\n\n[12] T. W. Liao. Clustering of time series data - a survey. The Journal of the Pattern Recognition\n\nSociety, January 2005.\n\n[13] P. Bennett, S. Dumais, and E. Horvitz. The combination of test classi\ufb01ers using reliability\n\nindicators. Information Retrieval, 8(1):67\u2013100, 2005.\n\n[14] H. Sakoe and S. Chiba. Dynamic programming algorithm optimization for spoken word recog-\n\nnition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 26(1):43\u201349, 1978.\n\n[15] C. Ratanamahatana and E. Keogh. Three myths about dynamic time warping data mining. In\n\nProceedings of the Fifth SIAM International Conference on Data Mining, 2005.\n\n[16] C. Bahlmann, B. Haasdonk, and Burkhardt H. On-line handwriting recognition with support\nvector machines - a kernel approach. Proceedings of the 8th International Workshop on Fron-\ntiers in Handwriting Recognition, 2002.\n\n[17] L.R. Rabiner. A tutorial on hidden markov models and selected applications in speech recog-\n\nnition. Proceedings of the IEEE, 77(2), February 1989.\n\n[18] K. Morik, P. Brockhausen, and T. Joachims. Combining statistical learning with a knowledge-\nbased approach - a case study in intensive care monitoring. Proc. 16th International Confer-\nence on Machine Learning, 1999.\n\n[19] T. Joachims. Making large-scale svm learning practical. advances in kernel methods - support\n\nvector learning, 1999.\n\n[20] K. Murphy. Hidden Markov Model (HMM) Toolbox for Matlab.\n\nwww.cs.ubc.ca/\u02dcmurphyk/Software/HMM/hmm.html.\n\n9\n\n\f", "award": [], "sourceid": 247, "authors": [{"given_name": "Jenna", "family_name": "Wiens", "institution": null}, {"given_name": "Eric", "family_name": "Horvitz", "institution": null}, {"given_name": "John", "family_name": "Guttag", "institution": null}]}