{"title": "Multitask Boosting for Survival Analysis with Competing Risks", "book": "Advances in Neural Information Processing Systems", "page_first": 1390, "page_last": 1399, "abstract": "The co-occurrence of multiple diseases among the general population is an important problem as those patients have more risk of complications and represent a large share of health care expenditure. Learning to predict time-to-event probabilities for these patients is a challenging problem because the risks of events are correlated (there are competing risks) with often only few patients experiencing individual events of interest, and of those only a fraction are actually observed in the data. We introduce in this paper a survival model with the flexibility to leverage a common representation of related events that is designed to correct for the strong imbalance in observed outcomes. The procedure is sequential: outcome-specific survival distributions form the components of nonparametric multivariate estimators which we combine into an ensemble in such a way as to ensure accurate predictions on all outcome types simultaneously. Our algorithm is general and represents the first boosting-like method for time-to-event data with multiple outcomes. We demonstrate the performance of our algorithm on synthetic and real data.", "full_text": "Multitask Boosting for Survival Analysis with\n\nCompeting Risks\n\nAlexis Bellot\n\nUniversity of Oxford\n\nOxford, United Kingdom\n\nalexis.bellot@eng.ox.ac.uk\n\nMihaela van der Schaar\n\nUniversity of Oxford and The Alan Turing Institute\n\nLondon, United Kingdom\nmschaar@turing.ac.uk\n\nAbstract\n\nThe co-occurrence of multiple diseases among the general population is an impor-\ntant problem as those patients have more risk of complications and represent a large\nshare of health care expenditure. Learning to predict time-to-event probabilities for\nthese patients is a challenging problem because the risks of events are correlated\n(there are competing risks) with often only few patients experiencing individual\nevents of interest, and of those only a fraction are actually observed in the data. We\nintroduce in this paper a survival model with the \ufb02exibility to leverage a common\nrepresentation of related events that is designed to correct for the strong imbalance\nin observed outcomes. The procedure is sequential: outcome-speci\ufb01c survival\ndistributions form the components of nonparametric multivariate estimators which\nwe combine into an ensemble in such a way as to ensure accurate predictions\non all outcome types simultaneously. Our algorithm is general and represents\nthe \ufb01rst boosting-like method for time-to-event data with multiple outcomes. We\ndemonstrate the performance of our algorithm on synthetic and real data.\n\n1\n\nIntroduction\n\nThere is now signi\ufb01cant evidence that the progressions of many diseases interact with one another\nsuch that the prediction of events of interest, for example death due to breast cancer in a population\nof women, will be in\ufb02uenced by their simultaneous risks of developing related diseases, such as\ncardiovascular or pulmonary diseases [19, 20]. A central problem in survival analysis is to predict\nthe relationship between variables and survival, which is especially challenging when a number of\ndifferent correlated events might occur - i.e., there are competing risks. Current approaches jointly\nmodel competing risks in an attempt to capture shared latent biological traits or common risk factors.\nIn the presence of multiple events however, jointly modelling these conditions leads to predictive\nmodels that neglect individual diseases with lower incidence. Clinical prognosis tools may result in\nhigh overall accurate predictions rates while also having unacceptably low performance with respect\nto an underrepresented disease outcome, which strongly reduces their explanatory power for practical\npurposes. The design of therapies and medical planning relies on survival estimates of predictive\nmodels. Examples of similar settings can be found in many \ufb01elds beyond medicine including failure\nanalysis in engineering and prediction of multiple economic events in economics.\nThe focus of this work is to provide a new interpretation of boosting algorithms [11] in a multitask\nlearning framework [8] that extends this family to time-to-event data with multiple competing\noutcomes. Motivated by the ideas discussed above, we speci\ufb01cally intend to leverage the heterogeneity\npresent in large modern data sets, the complexity in underlying relationships between events/tasks\nand the strong imbalance often observed between events/tasks. The aim is a \ufb02exible simultaneous\ndescription of the likelihood of different events over time that we achieve by estimating full probability\ndistributions, in contrast to single time prediction problems such as regression or classi\ufb01cation.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Boosting in the Toy Example.\n\nWe develop a boosting algorithm in which each task-speci\ufb01c time-to-event distribution is a component\nof a multi-output function. A distinctive feature is that each weak estimator (whose performance is\nsub-optimal) learns a shared representation between tasks by recursively partitioning the observed\ndata (analogous to the construction of trees) from all related tasks using a measure of similarity\nbetween instances that involves all related tasks. This means that we learn a shared representation\ndirectly from selecting appropriate subsets of patients, which may experience different events, but\nwhich have a common time-to-event trajectory. Since this partitioning scheme is applied recursively,\nlearned relationships (and predictions) are adaptive to the complexity of the problem and, in addition,\nno assumptions on the data generating process, such as accelerated failure times or proportional\nhazards (common in most survival models [22, 17]) have to be posed. We construct an ensemble by\nweighting the sample data such as to bias the next weak multivariate estimator towards mis-predicted\ninstances. What distinguishes our weighting scheme from existing boosting methods is that while\nthe output of each weak estimator is a multivariate probability distribution, the data only provides\nthe speci\ufb01c event that occurred and the time of occurrence and thus we introduce new notions of\n\"prediction correctness\" that apply in our setting.\nWhy is boosting useful for competing risks? A toy example we show in Figure 1 may help to\nillustrate our method. We consider a population that experiences one of two events, death due to\ncardiovascular diseases (CVD) or breast cancer. (For simplicity we ignore censoring). Each patient\nis characterized by its body mass index (BMI), cholesterol level and age at menarche. The medical\nfact is that increased BMI increases the risk of both breast cancer and CVD; increased cholesterol\nincreases the risk of CVD but is irrelevant for breast cancer; increased age at menarche decreases\nthe risk of breast cancer but is irrelevant for CVD. (Note that the same patients are represented in all\npanels of Figure 1 - the vertical position remains the same while their horizontal position changes due\nto different features being considered). The panels show three iterations of boosting using a stump as\na weak predictor; the best partition of the data in each case is shown with the yellow threshold. The\n\ufb01rst stump recognizes BMI as best separating event times (on average), but mispredicts survival of\npatient (a) (that has high survival despite having high BMI) and (b) for which the contrary is true.\nIteration 2, encouraged by the higher weight of (a), considers a split along the cholesterol level and is\nable to better describe (a)\u2019s survival (its high survival is due to a low cholesterol level). Iteration 3,\nafter repeatedly mispredicting (b) in iteration 1 and 2, splits based on age at menarche which explains\n(b)\u2019s low survival.\nSurvival data in the presence of competing risks is often highly heterogeneous, the process of boosting\nis effective for identifying patients that do not conform to a general pattern; a fact further exacerbated\nwhen only few examples are available from each type or if imbalance is large.\n\n2 Problem Formulation\n\nThe set up we consider is best de\ufb01ned within the context of medical patients at risk of mutually\nexclusive outcomes, such as causes of death, referred more generally as tasks in other domains. In\nthis context the goal of multitask learning is to estimate cumulative incidence functions (CIFs):\n\nF1, ..., FK : X \u00d7 T \u2192 [0, 1]\n\n(1)\n\n2\n\n\fFk represents the probability of a speci\ufb01c event of type k happening before time t, Fk(t|X) = p(T \u2264\nt, Z = k|X)1. This relationship is estimated from an observational sample of the random tuple\n{X, T, Z} where the input space X describes patient characteristics - typically Rd -, T \u2208 R+ de\ufb01nes\nthe time to event and Z is the type of event observed Z \u2208 {\u2205, 1, ..., K}. A particularity of time-to-\nevent data is that often the outcome will not be observed for every patient (e.g. a patient follow-up\nmight be interrupted) however event-free survival is known up to a censoring time independent of\n(X, T ) (a common assumption in the survival literature that ensures consistency of our estimates).\nThis is the de\ufb01ning property of survival data and makes our setting distinct from classical supervised\nproblems. We write zi = \u2205 for a right censored observation and zi = 1, ..., K to denote the occurrence\nof one of K competing events.\nThe key idea is to exploit the shared structure of F1, ..., FK by estimating them jointly, rather than\nindependently, in the hope of improving prediction performance for all tasks. We aim to learn\nprognostic models \u02c6Fk such as to minimize the discrepancy between predicted and actual survival\nstatus,\n\nLk( \u02c6F ) := E 1\n\u03c4\n\nI(T \u2264 t, Z = k) \u2212 \u02c6Fk(t|X)\n\ndt,\n\nL( \u02c6F ) =\n\n1\nK\n\nLk( \u02c6F )\n\n(2)\n\n(cid:90) \u03c4\n\n(cid:16)\n\n0\n\n(cid:17)2\n\n(cid:88)\n\nk\n\nwhich is extended on the right hand side to consider multiple tasks simultaneously by averaging over\nk. In addition, we de\ufb01ne the cause-speci\ufb01c hazard function, or subdistribution hazard,\n\n\u03bbk(t|X) = lim\ndt\u21920\n\np(t \u2264 T \u2264 t + dt, Z = k|T \u2265 t, X)/dt\n\n(3)\n\nIt represents the instantaneous risk of experiencing an end-point related to cause k and indicates the\nrate at which mortality with respect to that cause progresses with time. The cumulative cause speci\ufb01c\n\nhazard is \u039bk(t) =(cid:82) t\n\n0 \u03bbk(s)ds.\n\n3 Model Description\n\nIn this section we present our main contribution: a nonparametric boosting algorithm for jointly\nestimating survival distributions for multiple tasks we call Survival Multitask Boosting (SMTBoost).\nBoosting algorithms iteratively train simple predictive models on weighted samples of the data such\nas to encourage improvement on those data points that are mis-predicted in previous iterations. The\nfollowing subsections will detail \ufb01rst the training procedure of weak predictors and then will provide\nthe ensemble approach that results in \ufb02exible task-speci\ufb01c time-to-event distributions.\n\n3.1 Weak predictors\n\nWeak predictors are trees composed of leaves and nodes. Leafs de\ufb01ne a partition for the data and\nare responsible for making predictions and nodes guide examples towards appropriate leaves using\nbinary splits based on boolean-valued rules. We seek a binary recursive partitioning scheme -rules\nthat partition the data at each node- resulting in the greatest difference of task-speci\ufb01c and overall\ntime-to-event distributions.\n\n3.1.1 Splitting rule\n\nThe key in growing trees lies in the split rule used to recursively separate the population in homoge-\nneous nodes. In the context of competing risks homogeneity of time-to-event outcomes otherwise\nmeasured with the log-rank test statistic and model deviance in single event survival settings are\nnot applicable. We opt instead for a modi\ufb01ed version of Gray\u2019s test statistic [14, 16] that explicitly\ncompares CIFs Fk between two populations. Gray proposed a non-parametric log-rank test statistic\nde\ufb01ned as a weighted sum of differences in estimates of the sub-distribution hazards \u03bbk, that effec-\ntively generalizes the log-rank test to competing risks. In order to measure similarity with respect to\nall causes simultaneously, we combine the task-speci\ufb01c splitting rules across the event types k and\noptimize for a weighted sum of Gray\u2019s statistic over all tasks (weighted by an asymptotic estimate of\n1Also called subdistribution function because it does not converge to 1 as t \u2192 \u221e, but to p(Z = k), the\nexpected proportion of task k events. However, the CIFs for all possible event types will always add up to the\ndistribution function of T .\n\n3\n\n\fFigure 2: An example of partition of node C0 into C1 and C2 based on Gray\u2019s test statistic for\ntwo tasks. The estimated CIFs get updated based on the resulting partition (de\ufb01ned by x = \u03b4) that\nmaximizes Gray\u2019s test statistic, a composite measure of the difference between the CIFs in each\nsubset.\n\nthe variance of each statistic, see [14] for details). Figure 2 illustrates this procedure for a single split.\nFor conventional survival analysis, when a single event is analyzed in isolation, Gray\u2019s test statistic\nreduces to a log-rank test statistic commonly used in survival trees.\n\n3.1.2 Leaf node predictions\n\nLeaf nodes are responsible for predictions. Based on the \ufb01nal partition we compute task speci\ufb01c\ndistributions with nonparametric estimators from the theory of counting processes. Let Cj denote\nthe index set of training examples with leaf node membership j. Let Nk(t) and N (t) denote the\nnumber of events of type k and of any type recorded before time t respectively, Y (t) the number\nof individuals at risk at time t and t(0) < t(1) < ... < t(m) the ordered distinct event times (these\nquantities refer to leaf node j only, we omit the subscript j for readability). We compute task-speci\ufb01c\nsurvival at leaf node j with the nonparametric Aalen-Johansen estimator [1],\n\n\u02c6Fk(\u03c4 ) =\n\n\u02c6S(t)d\u02c6\u039bk(t)\n\n(4)\n\n(cid:90) \u03c4\n\n0\n\nwhere probability of event-free survival \u02c6S is estimated with the Kaplan Meier estimator and the\ncumulative hazard function \u02c6\u039bk is estimated with the Nelson-Aalen estimator,\n\n(cid:89)\n\n(cid:18)\n1 \u2212 N (t(i)) \u2212 N (t(i\u22121))\n\n(cid:19)\n\nY (t(i))\n\n\u02c6S(t) =\n\ni:t(i)\u2264t\n\n(cid:88)\n\n,\n\n\u02c6\u039bk(t) =\n\nNk(t(i)) \u2212 Nk(t(i\u22121))\n\n(5)\n\ni:t(i)<t\n\nY (t(i))\n\nThe leaf nodes partition the sample space so the above construction de\ufb01nes the task-speci\ufb01c and\noverall cumulative incidence functions for the tree,\nI(i \u2208 Cj) \u02c6Fk,j(t),\n\n\u02c6Fk(t; xi) =\n\n\u02c6F (t; xi) =\n\n(cid:88)\n\n(cid:88)\n\n\u02c6Fk(t; xi)\n\n(6)\n\nj\n\nk\n\nwhere the subscript j refers to CIFs estimated based on each leaf of the resulting tree. This process al-\nlows to obtain completely nonparametric estimates of survival. In complex problems, and particularly\nfrom a medical perspective, this is important because subtle signals in heterogeneous populations are\noften unknown a priori and need to be discovered from relationships in the data.\n\n3.2 Ensemble approach by boosting\n\nIn the traditional boosting framework, misclassi\ufb01ed examples are up-weighted to bias the next weak\npredictor to improve previous predictions. The contrast with time-to-event settings is that model\noutputs are probability distributions over time and hence notions of correctness of model predictions\nneed to be accommodated. We use the extended loss function introduced in (2), that captures the joint\nperformance over all tasks and gives a measure of the individual empirical error:\n\n\u02c6Wi(t)\n\nI(Ti \u2264 t, Zi = k) \u2212 \u02c6F (t; xi)\n\ndt\n\n(7)\n\n(cid:90) \u03c4\n\n(cid:88)\n\n0\n\nk\n\nei =\n\n1\nK\u03c4\n\n(cid:16)\n\n(cid:17)2\n\nwhere \u02c6Wi(t) are estimated inverse probability of censoring weights.\n\n4\n\n\fEach iteration m of the boosting procedure grows a tree F (m) on a weighted random fraction s of\nthe training data. A decomposition of the error suggests improved performance with uncorrelated\nweak predictors (see the supplementary material), which we encourage by randomly sub-sampling\nthe data; empirically we have found this to impact performance favourably. The algorithm proceeds\nby re-weighting all samples in our training data as a function of the individual prediction error e(m)\nwith (7) and a measure of overall con\ufb01dence in predictions of the mth tree, \u03b2(m). Those examples\nwith poor event predictions get increased updated weights where \u03b2(m) is set to control the magnitude\nof this update: more con\ufb01dent trees leading to larger updates. \u03b2(m) is adjusted to lie in the interval\n(0, 1); note that random guessing of survival probabilities results in an average error \u0001(m) of 1/3\nwhich any weak predictor F (m) is assumed to improve upon. Final time-to-event distributions for\neach task are computed from a weighted average of all weak predictors, weighted proportional to\ntheir con\ufb01dence \u03b2(m). In contrast to existing boosting methods [9, 25, 11, 13], the output is not in\nthe form of a discrete class label or a real valued number, but a set of distribution functions. One of\nthe main contributions of this paper is to explicitly extend discrete label boosting to nonparametric\nmultivariate density estimation. We present all algorithmic details in Algorithm 1.\n\ni\n\nAlgorithm 1 Multitask Boosting\n\nInput: time-to-event data with multiple tasks D = {(Xi, Ti, Zi)}i of size n, number of iterations M,\ninitial weights w(1)\nfor m = 1 to M do\n\ni \u221d 1, sampling fraction s.\n\n1. Let D\u2217 be a randomly sampled fraction s of training data D with distribution w(m).\n2. Learn weak model F (m) : X \u00d7 T \u2192 [0, 1]K on D\u2217.\n3. Calculate prediction error e(m)\n\n4. Calculate adjusted error of F (m), \u0001(m) =(cid:80)\n\nfor each instance i with equation (7).\n\n.\n\ni\n\ni e(m)\n\ni w(m)\n\ni\n\n5. Calculate con\ufb01dence in individual weak models \u03b2(m) = \u0001(m)\n6. Update data distribution w(m+1)\n\n(\u03b2(m))1\u2212e\n\n\u221d w(m)\n\n2/3\u2212\u0001(m) :\n\n(m)\ni\n\n.\n\ni\n\ni\n\nend for\nOutput: Final predictions Ff , the weighted average of F (m) for 1 \u2264 m \u2264 M using log(1/\u03b2(m)) as\nthe weight of model F (m).\n\n3.3 Variable importance\n\nUnderstanding the in\ufb02uence of variables on each speci\ufb01c task is of crucial importance in medicine\nand other domains. The approach we use is based on a comparison of the prediction error (2) when\na variable is randomly shuf\ufb02ed (such that the dependence between the response and the variable in\nbroken) in comparison to the original best \ufb01t, similarly to [26] who have shown similar procedures to\nbe effective in many practical settings. The randomization effectively voids the effect of a variable.\nThe intuition is that variables used as splitting rules in many tree con\ufb01gurations will signi\ufb01cantly alter\nindividual predictions (when the variable value is shuf\ufb02ed in each patient) suggesting high predictive\npower relative to other variables. Let e\u2217\nm,j denote the error of tree m over the training data with\nvariable j randomly shuf\ufb02ed and em the error without shuf\ufb02ing. We de\ufb01ne the importance of variable\nj, as the weighted average of prediction error differences,\n\n(cid:80)\n(cid:80)\nm log(1/\u03b2(m))|em \u2212 e\u2217\n\nm log(1/\u03b2(m))\n\nj,m|\n\n(8)\n\nTask-speci\ufb01c variable importance measures can be computed by considering the error only over the\ntask speci\ufb01c component (i.e. using Lk instead of L in equation 2).\n\n4 Related Work\n\nSurvival analysis under competing risks departs from more common supervised learning problems\nby asking both what event will occur and when that event will occur. A number of recent papers\n[22, 17, 5, 3, 4] only consider a single event of interest and are thus not directly applicable to\n\n5\n\n\four context. We focus instead on contrasting with approaches which, like the present paper, treat\ncompeting risks.\nParametric models The most common techniques for the analysis of such data model explicitly\nsome form of a cause-speci\ufb01c hazard (presented in (3)) as a parametric function of descriptive\nvariables. [21] and later [10] are familiar examples of this approach. Applications of boosting\n[23, 15, 6], albeit in a gradient boosting framework which is very different to ours, have been\nproposed to improve parameter optimization by pursuing parameter updates that result in the steepest\nascent of Cox\u2019s partial likelihood. With the exception of [6] all other works only consider one event\ntypes. A major downside of all the above is their dependence on proportional hazards - hazard rates\nbetween any two patients need to be in constant proportion over time - and their need to specify\ncovariate interactions beforehand. By contrast, our work makes no such assumptions.\nTree-based models Closer to our work are tree-based approaches to competing risks. [16] extended\nRandom Forests [7] to time to event estimation under competing risks. They propose a parallel\nensemble in which fully grown trees are built independently on a bootstrapped sample of the data. As\nwas empirically observed in classi\ufb01cation problems in [24], the performance on important subsets of\nthe population is undermined by the small contribution of underrepresented tasks to the construction\nof each tree. For this reason several authors [12] have suggested modi\ufb01cations that re-balance the\ndata by over/under sampling subsets of the data. However our experiments using this approach,\nreported in section 5 produced only mixed results. By boosting, our model implicitly corrects for this\nimbalance by encouraging successive trees to improve performance on underrepresented tasks when\nthey are mis-predicted. [5] use multivariate random forests within a parametric Bayesian mixture\nmodel in which the components of the mixture describe each task individually.\nOther Machine Learning models The approach to competing risks in terms of a multi-tasking\nlearning problem is not new to the present work. For example, [2] builds a model that couples this\npoint of view with a representation in terms of deep multi-task Gaussian Processes with vector-valued\nkernels. However, the objective in [2] is to predict \ufb01xed time risk (e.g. 1 year mortality) rather than\nto predict full survival curves, which is our objective here. [18] is closer to the present work in that it\nshares the objective of predicting cause-speci\ufb01c survival probabilities, but the methodology is quite\ndifferent, exploiting a deep learning architecture with shared and task speci\ufb01c layers.\n\n5 Experiments\n\n5.1 Evaluation Protocol\n\nWe measure performance with a common metric used in the literature: the cause-speci\ufb01c concordance\nindex (C-index). Formally, we de\ufb01ne the (time-dependent) concordance index (C-index) for a cause\nk as follows [27]:\n\nCk(t) := P( \u02c6Fk(t; Xi) > \u02c6Fk(t; Xj)|{zi = k} \u2227 {Ti \u2264 t} \u2227 {Ti < Tj \u2228 \u03b4j (cid:54)= k})\n\n(9)\n\nwhere \u02c6Fk(t; Xi) is the predicted CIF for a test patient i. The time-dependent C-index as de\ufb01ned\nabove corresponds to the probability that predicted cause-speci\ufb01c survival probabilities are ranked in\naccordance to the actual observed survival times given the occurrence of an event and corresponding\ncause. The C-index thus serves as a measure of the discriminative power for a cause of interest of a\nmodel and can be interpreted as an extension of the AUROC for censored data. Random guessing\ncorresponds to a C-index of 0.5 and perfect prediction to a C-index of 1.\nBaseline Algorithms We compare our model with 9 baseline algorithms described in section 4. We\nconsider the proportional hazards model on the cause speci\ufb01c hazard (Cox) [21], the proportional\nhazards model on the subdistribution hazard (Fine-Gray) by [10] and the boosting approach to\nparameter optimization from [6]. These three baselines encode a linear effect of variables on survival\nand do not require hyper-parameter tuning except [6] for which the number of boosting iterations\nis optimized by cross-validation. As nonparametric alternatives we consider Random Forests for\nsurvival data under competing risks (RSF) [16] and also a weighted version (Weighted RSF) that\nattempts to mitigate task imbalance by sampling low incidence instances with higher probability such\nas to achieve balanced tasks in each bootstrapped sample. The size of the ensembles was optimized by\ncross-validation while the remaining hyper-parameters were set to default values. We compare with\nthe Gaussian process model (DMGP) [2] with the suggested hyperparameters con\ufb01gurations and the\n\n6\n\n\fDeep Learning architecture (DeepHit) [18] with hyperparameters optimized with a validation set. We\nhave in addition evaluated SMTBoost on each cause separately, denoted SMTBoost (sep.), by using\nthe logrank test statistic instead of Gray\u2019s test (see section 3.1.1) and the deep neural network for\nsurvival prediction (DeepSurv) [17] - also evaluated on each cause separately as it does not consider\ncompeting risks - to understand the bene\ufb01t of considering all causes jointly. On all experiments we\ntrain SMTBoost with a tree-depth of 3 and 250 boosting iterations, our default parameter settings.\n\n5.2 Synthetic Studies\n\nThis section explores the ability of SMTBoost to recover complex survival patterns.\n\n5.2.1 High Dimensional and Heterogeneous data\nX \u223c U(0, 1), T 1 \u223c exp(X 2\n\n1 +sin(X2+X3)+2X4+2X5), T 2 \u223c exp(X1+X2+X3+2X6+2X7)\n\nThis challenging setting mimics data that might be expected in genetic studies or medical data from\nelectronic health records in which the two tasks re\ufb02ect heterogeneous interactions between patient\nvariables. We generate 1000 event times T , each with probability 0.5 from tasks 1 or 2, based on\n100 variables drawn from a uniform distribution. A random subset of 20% of generated times are\ncensored by transforming their event time: C \u2190 U(0, T ). Only a very small number of variables,\n7 out of 100, are set to in\ufb02uence time to event. The \ufb01rst 3 generated variables are shared between\ntasks, while variables 4 and 5 in\ufb02uence task 1 only, and variables 6 and 7 in\ufb02uence task 2 only. All\nremaining variables are introduced as noise.\nAs a \ufb01rst experiment we aim to evaluate and\ndemonstrate the validity of our task speci\ufb01c\nvariable importance procedure introduced in\nsection 3.3. Results (normalized) are shown\nin Figure 3. For each variable two estimates\nare presented: one deriving from the error on\ntask 1 predictions only, and one considering\nthe error on task 2 only. We note \ufb01rst that\neven in high dimensional settings with a lot\nof noise, SMTBoost is able to distinguish be-\ntween in\ufb02uential and noise variables. In ad-\ndition SMTBoost captures the larger effect\nof task-speci\ufb01c variables but, due to the pres-\nence of censoring, also overestimates the im-\nportance of variables that are present in only one of the two tasks.\nIn a second experiment we evaluate model task-speci\ufb01c predictions in comparison to baseline\nalgorithms introduced in section 5.1 with the C-index averaged over all time horizons t. We present\nthese results on the two last columns of Table 1. Performance on task 1 demonstrates the representation\nability of the more \ufb02exible approaches (tree based, deep learning and gaussian process) but outperform\nonly marginally for task 2 which has linear covariate in\ufb02uence. The performance of the tree-based\napproaches on both tasks suggests that these are more ef\ufb01cient in high-dimensional settings. In\ncomparison to SMTBoost we believe that it is the stronger focus (by boosting) on divergent instances\nthat leads to the gain in performance with respect to RSF and weighted RSF because the exponential\ntransformation of covariate interactions for both tasks leads to highly divergent event times even\nbetween observations that have similar covariate values.\n\nFigure 3: Variable importance in high dimensional\nsetting.\n\n5.3 Real data studies: SEER\n\nWe investigate a patient population extracted from the Surveillance, Epidemiology, and End Results\n(SEER) repository similarly to [2]. The data contains 72, 809 patients of which 14.4% died due\nto Breast cancer, 1.7% due to cardiovascular diseases (CVD) and 6.1% due to other causes. The\nremaining patients were censored. We give a more detailed description in the Supplementary material.\nPerformance is evaluated with the cause-speci\ufb01c C-index (section 5.1), averaged over equally spaced\ntimes of 10 months from registration to the last observed event. Table 1 gives all performance results;\nthese are averages over 4 fold cross-validation estimates and con\ufb01dence bands are standard deviations.\n\n7\n\n\fModels\nCox\nCoxBoost\nFine-Gray\nRSF\nWeighted RSF\nDeepHit [18]\nDMGP [2]\nDeepSurv [17]\n\nSMTBoost (sep.)\nSMTBoost\n\nBreast Cancer\n0.773 \u00b1 0.02\n0.774 \u00b1 0.02\n0.777 \u00b1 0.02\n0.789 \u00b1 0.03\n0.778 \u00b1 0.03\n0.800 \u00b1 0.01\n0.801 \u00b1 0.02\n0.781 \u00b1 0.02\n0.795 \u00b1 0.02\n0.819 \u00b1 0.02\n\nCVD\n\nOther\n\n0.688 \u00b1 0.02\n0.639 \u00b1 0.03\n0.678 \u00b1 0.02\n0.642 \u00b1 0.03\n0.682 \u00b1 0.02\n0.636 \u00b1 0.03\n0.643 \u00b1 0.02\n0.722 \u00b1 0.03\n0.645 \u00b1 0.02\n0.730 \u00b1 0.03\n0.684 \u00b1 0.01\n0.662 \u00b1 0.01\n0.646 \u00b1 0.02\n0.732 \u00b1 0.03\n0.685 \u00b1 0.03\n0.659 \u00b1 0.03\n0.721 \u00b1 0.04\n0.660 \u00b1 0.03\n0.766 \u00b1 0.03 0.688 \u00b1 0.02\n\nSynthetic T 1\n0.612 \u00b1 0.01\n0.613 \u00b1 0.01\n0.613 \u00b1 0.01\n0.654 \u00b1 0.01\n0.645 \u00b1 0.01\n0.652 \u00b1 0.01\n0.651 \u00b1 0.01\n0.629 \u00b1 0.01\n0.631 \u00b1 0.01\n0.664 \u00b1 0.01\n\nSynthetic T 2\n0.705 \u00b1 0.01\n0.705 \u00b1 0.01\n0.706 \u00b1 0.01\n0.720 \u00b1 0.01\n0.717 \u00b1 0.01\n0.720 \u00b1 0.01\n0.718 \u00b1 0.01\n0.710 \u00b1 0.01\n0.710 \u00b1 0.01\n0.721 \u00b1 0.01\n\nTable 1: C-index \ufb01gures (higher better) and standard deviations on the SEER and synthetic dataset.\n\nSource of gain Patients suffering from chronic diseases tend to be very heterogeneous, mortality\nrates can be highly divergent even within narrow phenotypes. The limitations imposed by proportional\nhazard models to model this kind of data are evident from the performance results on both Breast\nCancer and CVD outcomes. Predictions of other causes tend to bene\ufb01t from simpler modelling\napproaches as SEER predominantly records patient information related to Cancer (see Supplement)\nwhich suggests that few predictive variables are available for other causes. Performance gains\nof SMTBoost are largest with respect CVD outcomes which illustrates its ability to handle low\nincidence tasks (only 1.7% of events relate to CVD). Both DeepHit and DMGP are competitive as\nthey leverage the in\ufb02uence of shared risk factors but underperform SMTBoost. The results suggest\nthat boosting to handle imbalance is crucial to improve predictions.\n\nFigure 4: Mean C-index results (higher better).\n\n5.3.1 Further exploring imbalanced heterogeneous data\n\nWe constructed an additional more general synthetic experiment designed to express complex and\nheterogeneous survival patterns between 2 tasks to further understand performance in imbalanced\ndata sets. Consider the following data generation process,\n\nX \u223c U(0, 1), T 1 \u223c exp(log(\u03b1T\n\n1 X) + \u03b1T\n\n2 X 2), T 2 \u223c exp(\u03b1T\n\n3 X)\n\nVariables X and parameters \u03b11, \u03b12, \u03b13 are each of dimension 5 whose components are drawn at\nrandom from a uniform distribution. For each task we investigate predictive performance as a function\nof task prevalence by analyzing 4 scenarios with different task proportions in the resulting data. For\ninstance a \ufb01rst balanced scenario for task 1 would involve a split: 50% censored, 25% task 1, 25%\ntask 2. We generate 5 data sets (by sampling variables and parameters randomly) of 1000 instances for\neach individual scenario and set a random 50% of the population to be uniformly censored. We show\nperformance results in Figure 4, as a function of task 1 and task 2 occurrence in the data. As expected,\nall models have their performance deteriorate the fewer samples available but we observe increasing\nrelative performance gains for both SMTBoost and weighted RSF, the only two approaches that\nattempt to correct for the imbalance.\n\n8\n\n\f6 Conclusion\n\nWe have introduced a boosting-based algorithm for survival analysis with multiple outcomes, designed\nto handle the heterogeneity present in modern medical data sets, including highly imbalanced data\nand high dimensional feature spaces. Our experiments on synthetic and real medical data have\ndemonstrated large performance improvements over current techniques and show the advantage\nof a boosting framework, already observed in classi\ufb01cation and regression problems, in the \ufb01eld\nof time-to-event analysis. From a medical perspective our model contributes towards the \ufb01eld of\n\u201cindividualized medicine\u201d. Our hope is that based on our model clinicians can improve long term\nprognosis and more accurately weight the bene\ufb01ts of a treatment for each individual patient whose\ncharacteristics may lead her to behave differently than the average.\n\nReferences\n[1] Odd Aalen. A model for nonparametric regression analysis of counting processes. In Mathe-\n\nmatical statistics and probability theory, pages 1\u201325. Springer, 1980.\n\n[2] Ahmed Alaa and Mihaela van der Schaar. Deep multi-task gaussian processes for survival\nanalysis with competing risks. In Advances in Neural Information Processing Systems, 2017.\n[3] Ahmed M Alaa and Mihaela van der Schaar. Autoprognosis: Automated clinical prognostic\nmodeling via bayesian optimization with structured kernel learning. In International Conference\non Machine Learning, 2018.\n\n[4] Alexis Bellot and Mihaela Van der Schaar. A hierarchical bayesian model for personalized\n\nsurvival predictions. IEEE Journal of Biomedical and Health Informatics, 2018.\n\n[5] Alexis Bellot and Mihaela van der Schaar. Tree-based bayesian mixture model for competing\nrisks. In International Conference on Arti\ufb01cial Intelligence and Statistics, pages 910\u2013918, 2018.\n[6] Harald Binder, Arthur Allignol, Martin Schumacher, and Jan Beyersmann. Boosting for\nhigh-dimensional time-to-event data with competing risks. Bioinformatics, 25(7):890\u2013896,\n2009.\n\n[7] Leo Breiman. Random forests. Machine learning, 45(1):5\u201332, 2001.\n[8] R Caruana. Multitask learning: A knowledge-based source of inductive bias. In International\n\nConference on Machine Learning, pages 41\u201348, 1993.\n\n[9] Harris Drucker. Improving regressors using boosting techniques. In International Conference\n\non Machine Learning, pages 107\u2013115, 1997.\n\n[10] Jason P Fine and Robert J Gray. A proportional hazards model for the subdistribution of a\n\ncompeting risk. Journal of the American statistical association, 94(446):496\u2013509, 1999.\n\n[11] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning\nand an application to boosting. Journal of computer and system sciences, 55(1):119\u2013139, 1997.\n[12] Mikel Galar, Alberto Fernandez, Edurne Barrenechea, Humberto Bustince, and Francisco\nHerrera. A review on ensembles for the class imbalance problem: bagging-, boosting-, and\nIEEE Transactions on Systems, Man, and Cybernetics, Part C\nhybrid-based approaches.\n(Applications and Reviews), 42(4):463\u2013484, 2012.\n\n[13] Nico G\u00f6rnitz, Christian Widmer, Georg Zeller, Andr\u00e9 Kahles, Gunnar R\u00e4tsch, and S\u00f6ren Sonnen-\nburg. Hierarchical multitask structured output learning for large-scale sequence segmentation.\nIn Advances in Neural Information Processing Systems, pages 2690\u20132698, 2011.\n\n[14] Robert J Gray. A class of k-sample tests for comparing the cumulative incidence of a competing\n\nrisk. The Annals of statistics, pages 1141\u20131154, 1988.\n\n[15] Kevin He, Yanming Li, Ji Zhu, Hongliang Liu, Jeffrey E Lee, Christopher I Amos, Terry\nHyslop, Jiashun Jin, Huazhen Lin, Qinyi Wei, et al. Component-wise gradient boosting and\nfalse discovery control in survival analysis with high-dimensional covariates. Bioinformatics,\n32(1):50\u201357, 2015.\n\n[16] Hemant Ishwaran, Thomas A Gerds, Udaya B Kogalur, Richard D Moore, Stephen J Gange,\nand Bryan M Lau. Random survival forests for competing risks. Biostatistics, 15(4):757\u2013773,\n2014.\n\n9\n\n\f[17] Jared Katzman, Uri Shaham, Jonathan Bates, Alexander Cloninger, Tingting Jiang, and\nYuval Kluger. Deep survival: A deep cox proportional hazards network. arXiv preprint\narXiv:1606.00931, 2016.\n\n[18] Changhee Lee, William R Zame, Jinsung Yoon, and Mihaela van der Schaar. Deephit: A deep\n\nlearning approach to survival analysis with competing risks. AAAI, 2018.\n\n[19] Stewart Mercer, Chris Salisbury, and Martin Fortin. ABC of Multimorbidity. John Wiley &\n\nSons, 2014.\n\n[20] Deborah Morrison, Karolina Agur, Stewart Mercer, Andreia Eiras, Juan I Gonz\u00e1lez-Montalvo,\nand Kevin Gruffydd-Jones. Managing multimorbidity in primary care in patients with chronic\nrespiratory conditions. NPJ primary care respiratory medicine, 26:16043, 2016.\n\n[21] Ross L Prentice, John D Kalb\ufb02eisch, Arthur V Peterson Jr, Nancy Flournoy, Vern T Farewell,\nand Norman E Breslow. The analysis of failure times in the presence of competing risks.\nBiometrics, pages 541\u2013554, 1978.\n\n[22] Rajesh Ranganath, Adler Perotte, No\u00e9mie Elhadad, and David Blei. Deep survival analysis. In\n\nMachine Learning for Healthcare Conference, pages 101\u2013114, 2016.\n\n[23] Greg Ridgeway. The state of boosting. Computing Science and Statistics, pages 172\u2013181, 1999.\n[24] Chris Seiffert, Taghi M Khoshgoftaar, Jason Van Hulse, and Amri Napolitano. Building useful\nIn FLAIRS conference, pages\n\nmodels from imbalanced data with sampling and boosting.\n306\u2013311, 2008.\n\n[25] Dimitri P Solomatine and Durga L Shrestha. Adaboost.rt: a boosting algorithm for regression\nproblems. In Neural Networks, 2004. Proceedings. 2004 IEEE International Joint Conference\non, volume 2, pages 1163\u20131168. IEEE, 2004.\n\n[26] Carolin Strobl, Anne-Laure Boulesteix, Thomas Kneib, Thomas Augustin, and Achim Zeileis.\n\nConditional variable importance for random forests. BMC bioinformatics, 9(1):307, 2008.\n\n[27] Marcel Wolbers, Paul Blanche, Michael T Koller, Jacqueline CM Witteman, and Thomas A\nGerds. Concordance for prognostic models with competing risks. Biostatistics, 15(3):526\u2013539,\n2014.\n\n10\n\n\f", "award": [], "sourceid": 719, "authors": [{"given_name": "Alexis", "family_name": "Bellot", "institution": "University of Oxford"}, {"given_name": "Mihaela", "family_name": "van der Schaar", "institution": "University of Oxford"}]}