{"title": "Forecasting Treatment Responses Over Time Using Recurrent Marginal Structural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 7483, "page_last": 7493, "abstract": "Electronic health records provide a rich source of data for machine learning methods to learn dynamic treatment responses over time. However, any direct estimation is hampered by the presence of time-dependent confounding, where actions taken are dependent on time-varying variables related to the outcome of interest. Drawing inspiration from marginal structural models, a class of methods in epidemiology which use propensity weighting to adjust for time-dependent confounders, we introduce the Recurrent Marginal Structural Network - a sequence-to-sequence architecture for forecasting a patient's expected response to a series of planned treatments. Using simulations of a state-of-the-art pharmacokinetic-pharmacodynamic (PK-PD) model of tumor growth, we demonstrate the ability of our network to accurately learn unbiased treatment responses from observational data \u2013 even under changes in the policy of treatment assignments \u2013 and performance gains over benchmarks.", "full_text": "Forecasting Treatment Responses Over Time Using\n\nRecurrent Marginal Structural Networks\n\nBryan Lim\n\nDepartment of Engineering Science\n\nUniversity of Oxford\n\nAhmed Alaa\n\nElectrical Engineering Department\nUniversity of California, Los Angeles\n\nbryan.lim@eng.ox.ac.uk\n\nahmedmalaa@ucla.edu\n\nMihaela van der Schaar\n\nUniversity of Oxford\n\nand The Alan Turing Institute\n\nmschaar@turing.ac.uk\n\nAbstract\n\nElectronic health records provide a rich source of data for machine learning meth-\nods to learn dynamic treatment responses over time. However, any direct estimation\nis hampered by the presence of time-dependent confounding, where actions taken\nare dependent on time-varying variables related to the outcome of interest. Drawing\ninspiration from marginal structural models, a class of methods in epidemiology\nwhich use propensity weighting to adjust for time-dependent confounders, we\nintroduce the Recurrent Marginal Structural Network - a sequence-to-sequence\narchitecture for forecasting a patient\u2019s expected response to a series of planned treat-\nments. Using simulations of a state-of-the-art pharmacokinetic-pharmacodynamic\n(PK-PD) model of tumor growth [12], we demonstrate the ability of our network\nto accurately learn unbiased treatment responses from observational data \u2013 even\nunder changes in the policy of treatment assignments \u2013 and performance gains over\nbenchmarks.\n\n1\n\nIntroduction\n\nWith the increasing prevalence of electronic health records, there has been much interest in the use\nof machine learning to estimate treatment effects directly from observational data [13, 41, 44, 2].\nThese records, collected over time as part of regular follow-ups, provide a more cost-effective method\nto gather insights on the effectiveness of past treatment regimens. While the majority of previous\nwork focuses on the effects of interventions at a single point in time, observational data also captures\ninformation on complex time-dependent treatment scenarios, such as where the ef\ufb01cacy of treatments\nchanges over time (e.g. drug resistance in cancer patients [40]), or where patients receive multiple\ninterventions administered at different points in time (e.g. joint prescriptions of chemotherapy and\nradiotherapy [12]). As such, the ability to accurately estimate treatment effects over time would allow\ndoctors to determine both the treatments to prescribe and the optimal time at which to administer\nthem.\nHowever, straightforward estimation in observational studies is hampered by the presence of time-\ndependent confounders, arising in cases where interventions are contingent on biomarkers whose\nvalue are affected by past treatments. For examples, asthma rescue drugs provide short-term rapid\nimprovements to lung function measures, but are usually prescribed to patients with reduced lung\nfunction scores. As such, na\u00efve methods can lead to the incorrect conclusion that the medication\nreduces lung function scores, contrary to the actual treatment effect [26]. Furthermore, [23] show\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fthat the standard adjustments for causal inference, e.g. strati\ufb01cation, matching and propensity scoring\n[16], can introduce bias into the estimation in the presence of time-dependent confounding.\nMarginal structural models (MSMs) are a class of methods commonly used in epidemiology to\nestimate time-dependent effects of exposure while adjusting for time-dependent confounders [15, 24,\n19, 14]. Using the probability of a treatment assignment, conditioned on past exposures and covariate\nhistory, MSMs typically adopt inverse probability of treatment weighting (IPTW) to correct for bias\nin standard regression methods [22], re-constructing a \u2018pseudo-population\u2019 from the observational\ndataset to similar to that of a randomized clinical trial. However, the effectiveness of bias correction\nis dependent on a correct speci\ufb01cation of the conditional probability of treatment assignment, which\nis dif\ufb01cult to do in practice given the complexity of treatment planning. In standard MSMs, IPTWs\nare produced using pooled logistic regression, which makes strong assumptions on the form of the\nconditional probability distribution. This also requires one separate set of coef\ufb01cients to be estimated\nper time-step and many models to be estimated for long trajectories.\nIn this paper, we propose a new deep learning model - which we refer to as Recurrent Marginal\nStructural Networks - to directly learn time-dependent treatment responses from observational data,\nbased on in the marginal structural modeling framework. Our key contributions are as follows:\nMulti-step Prediction Using Sequence-to-sequence Architecture To forecast treatment re-\nsponses at multiple time horizons in the future, we propose a new RNN architecture for multi-step\nprediction based on sequence-to-sequence architectures in natural language processing [36]. This\ncomprises two halves, 1) an encoder RNN which learns representations for the patient\u2019s current clini-\ncal state, and 2) a decoder which is initialized using the encoder\u2019s \ufb01nal memory state and computes\nforward predictions given the intended treatment assignments. At run time, the R-MSN also allows\nfor prediction horizons to be \ufb02exibly adjusted to match the intended treatment duration, by expanding\nor contracting the number of decoder units in the sequence-to-sequence model.\nScenario Analysis for Complex Treatment Regimens Treatment planning in clinical settings is\noften based on the interaction of numerous variables - including 1) the desired outcomes for a patient\n(e.g. survival improvement or comorbidity risk reduction), 2) the treatments to assign (e.g. binary\ninterventions or continuous dosages), and 3) the length of treatment affected by both number and\nduration of interventions. The R-MSN naturally encapsulates this by using multi-input/output RNNs,\nwhich can be con\ufb01gured to have multiples treatments and targets of different forms (e.g. continuous\nor discrete). Different sequences of treatments can also be evaluated using the sequence-to-sequence\narchitecture of the network. Moreover, given the susceptibility of IPTWs to model misspeci\ufb01cation,\nthe R-MSN uses Long-short Term Memory units (LSTMs) to compute the probabilities required\nfor propensity weighting. Combining these aspects together, the R-MSN is able to help clinicians\nevaluate the projected outcome of a complex treatment scenario \u2013 providing timely clinical decision\nsupport and helping them customize a treatment regimen to the patient. A example of scenario\nanalysis for different cancer treatment regimens is shown in Figure 1, with the expected response of\ntumor growth to no treatment, chemotherapy and radiotherapy shown.\n\nFigure 1: Forecasting Tumor Growth Under Multiple Treatment Scenarios\n\n2 Related Works\n\nGiven the diversity of literature on causal inference, we focus on works associated with time-\ndependent treatment responses and deep learning here, with a wider survey in Appendix A.\n\n2\n\n\fG-computation and Structural Models. Counterfactual inference under time-dependent con-\nfounding has been extensively studied in the epidemiology literature, particularly in the seminal\nworks of Robins [30, 31, 16]. Methods in this area can be categorized into 3 groups: models based\non the G-computation formula, structural nested mean models, and marginal structural models [8].\nWhile all these models provide strong theoretical foundations on the adjustments for time-dependent\nconfounding, their prediction models are typically based on linear or logistic regression. These\nmodels would be misspeci\ufb01ed when either the outcomes or the treatment policy exhibit complex\ndependencies on the covariate history.\nPotential Outcomes with Longitudinal Data. Bayesian nonparametric models have been pro-\nposed to estimate the effects of both single [32, 33, 43, 34] and joint treatment assignments [35]\nover time. These methods use Gaussian processes (GPs) to model the baseline progression, which\ncan estimate the treatment effects at multiple points in the future. However, some limitations do\nexist. Firstly, to aid in calibration, most Bayesian methods make strong assumptions on model\nstructure - such as 1) independent baseline progression and treatment response components [43, 35],\nand 2) the lack of heterogeneous effects, by either omitting baseline covariates (e.g. genetic or\ndemographic information) [34, 33] or incorporating them as linear components [43, 35]. Recurrent\nneural networks (RNNs) avoid the need for any explicit model speci\ufb01cations, with the networks\nlearning these relationships directly from the data. Secondly, inference with Bayesian models can\nbe computationally complex, making them dif\ufb01cult to scale. This arises from the use of Markov\nChain-Monte Carlo sampling for g-computation, and the use of sparse GPs that have at least O(N M 2)\ncomplexity, where N and M are the number of observations and inducing points respectively [39].\nFrom this perspective, RNNs have the bene\ufb01t of scalability and update their internal states with new\nobservations as they arrive. Lastly, apart from [35] which we evaluate in Section 5, existing models\ndo not consider treatment responses for combined interventions and multiple targets. This is handled\nnaturally in our network by using multi-input/multi-output RNN architectures.\nDeep Learning for Causal Inference. Deep learning has also been used to estimate individualized\ntreatment effects for a single intervention at a \ufb01xed time, using instrumental variable approaches [13],\ngenerative adversarial networks [44] and multi-task architectures [3]. To the best of our knowledge,\nours is the \ufb01rst deep learning method for time-dependent effects and establishes a framework to use\nexisting RNN architectures for treatment response estimation.\n\n3 Problem De\ufb01nition\n\nLet Yt,i = [Yt,i(1), . . . , Yt,i(\u2126y)] be a vector of \u2126y observed outcomes for patient i at time t, At,i =\n[At,i(1), . . . , At,i(\u2126a)] a vector of actual treatment administered, Lt,i = [Lt,i(1), . . . , Lt,i(\u2126l)] time-\ndependent covariates and Xi = [Xi(1), . . . , Xi(\u2126v)] patient-speci\ufb01c static features. For notational\nsimplicity, we will omit the subscript i going forward unless explicitly required.\nTreatment Responses Over Time Determining an individual\u2019s response to a prescribed treatment\ncan be characterized as learning a function g(.) for the expected outcomes over a prediction horizon\n\u03c4, given an intended course of treatment and past observations, i.e.:\n\nE(cid:2)Yt+\u03c4|a(t, \u03c4 \u2212 1), \u00afHt\n\n(cid:3) = g(\u03c4, a(t, \u03c4 \u2212 1), \u00afHt)\n\n(1)\nwhere g(.) represents a generic, possibly non-linear, function, a(t, \u03c4 \u2212 1) = (at, . . . at+\u03c4\u22121) is an\nintended sequence of treatments ak from the current time until just before the outcome is observed,\nand \u00afHt = (\u00afLt, \u00afAt\u22121, X) is the patient\u2019s history with covariates \u00afLt = (L1, . . . , Lt) and actions\n\u00afAt\u22121 = (A1, . . . At\u22121).\nInverse Probability of Treatment Weighting\nInverse probability of treatment weighting, ex-\ntensively studied in marginal structural modeling to adjust for time-dependent confounding\n[22, 16, 15, 24, 26], with extensions to joint treatment assignments [19], censored observations\n[14] and continuous dosages [10]. We list the key results for our problem below, with a more\nthorough discussion in Appendix B.\n(cid:81)\u2126a\nThe stabilized weights for joint treatment assignments [21] can be expressed as:\n(cid:81)\u2126a\nk=1 f (An(k)| \u00afAn\u22121)\nk=1 f (An(k)| \u00afHn)\n\nf (An| \u00afAn\u22121)\nf (An| \u00afHn)\n\nSW(t, \u03c4 ) =\n\nt+\u03c4(cid:89)\n\nt+\u03c4(cid:89)\n\nn=t\n\n(2)\n\nn=t\n\n=\n\n3\n\n\fwhere f (.) is the probability mass function for discrete treatment applications, or the probability\ndensity function when continuous dosages are used [10]. We also note that \u00afHn contains both past\ntreatments \u00afAn\u22121 and potential confounders \u00afLn. To account for censoring, we used the additional\nstabilized weights below:\n\nSW \u2217(t, \u03c4 ) =\n\nf (Cn = 0|T > n, \u00afAn\u22121)\n\nf (Cn = 0|T > n, \u00afLn\u22121, \u00afAn\u22121, X)\n\n(3)\n\nt+\u03c4(cid:89)\n\nn=t\n\ni=1\n\nwhere Cn = 1 denotes right censoring of the trajectory, and T is the time at which censoring occurs.\nWe also adopt the additional steps for stabilization proposed in [42], truncating stabilized weights at\ntheir 1st and 99th percentile values, and normalizing weights by their mean for a \ufb01xed prediction\nhorizon, i.e.\nwhere I is the total number of\npatients, Ti is the length of the patient\u2019s trajectory and N the total number of observations. Stabilized\nweights are then used to weight the loss contributions of each training observation, expressed in\nsquared-errors terms below for continuous predictions:\n\n\u02dcSW = SWi(t, \u03c4 )/\n\nt=1 SWi(t, \u03c4 )/N\n\n(cid:16)(cid:80)I\n\n(cid:80)Ti\n\n(cid:17)\n\ne(i, t, \u03c4 ) = \u02dcSWi(t, \u03c4 \u2212 1) \u00d7 \u02dcSW\n\n\u2217\ni (t, \u03c4 \u2212 1) \u00d7 (cid:107)Yt+\u03c4,i \u2212 g(\u03c4, a(t, \u03c4 \u2212 1), \u00afHt)(cid:107)2\n\n(4)\n\n4 Recurrent Marginal Structural Networks\n\nAn MSM can be subdivided into two submodels, one modeling the IPTWs and the other estimating\nthe treatment response itself. Adopting this framework, we use two sets of deep neural networks to\nbuild a Recurrent Marginal Structural Network (R-MSN) - 1) a set propensity networks to compute\ntreatment probabilities used for IPTW, and 2) a prediction network used to determine the treatment\nresponse for a given set of planned interventions. Additional details on the algorithm can be found in\nAppendix E, with the source code uploaded onto GitHub1.\n\n4.1 Propensity Networks\n\npropensity networks \u2013 with action probabilities f(cid:0) \u00afAn| .(cid:1) generated jointly by a set of multi-target\n\nFrom Equations 2 and 3, we can see that 4 key probability functions are required to calculate the\nstabilized weights. In all instances, probabilities are conditioned on the history of past observations\n( \u00afAn\u22121 and \u00afHn), making RNNs natural candidates to learn these functions.\nEach probability function is parameterized with a different LSTM \u2013 collectively referred to as\nLSTMs and censoring probabilities f (Cn = 0| . ) by single output LSTMs. This also accounts\nfor possible correlations between treatment assignments, for instance in treatment regimens where\ncomplementary drugs are prescribed together to combat different aspects of the same disease.\nThe \ufb02exibility of RNN architectures also allows for the modeling of treatment assignments with\ndifferent forms. In simple cases with discrete treatment assignments, a standard LSTM with a sigmoid\noutput layer can be used for binary treatment probabilities or a softmax layer for categorical ones.\nMore complex architectures, such as variational RNNs [6], can be used to compute probabilities\nwhen treatments map to continuous dosages. To calculate the binary probabilities in the experiments\nin Section 5, LSTMs were \ufb01tted with tanh state activations and sigmoid outputs.\n\n4.2 Prediction Network\n\nThe prediction network focuses on forecasting the treatment response of a patient, with time-\ndependent confounding accounted for using IPTWs from the propensity networks. Although standard\nRNNs can be used for one-step-ahead forecasts, actual treatments plans can be considerably more\ncomplex, with varying durations and number of interventions depending on the condition of the\npatient. To remove any restrictions on the prediction horizon or number of planned interventions, we\npropose the sequence-to-sequence architecture depicted in Figure 4.2. One key difference between\nour model and standard sequence-to-sequence (e.g.[36]) is that the last unit of the encoder is also used\nin making predictions for the \ufb01rst time step, in addition to the decoder units at further horizons. This\nallows the R-MSN to use all available information in making predictions, including the covariates\n\n1https://github.com/sjblim/rmsn_nips_2018\n\n4\n\n\fFigure 2: R-MSN Architecture for Multi-step Treatment Response Prediction\n\navailable at the current time step t. For the continuous predictions in Section 5, we used Exponential\nLinear Unit (ELU [7]) state activations and a linear output layer.\nEncoder The goal of the encoder is to learn good representations for the patient\u2019s current clinical\nstate, and we do so with a standard LSTM that makes one-step-ahead predictions of the outcome\n( \u02c6Yt+1) given observations of covariates and actual treatments. At the current follow-up time t, the\nencoder is also used in forecasting the expected response at t + 1, as the latest covariate measurements\nLt are available to be fed into the LSTM along with the \ufb01rst planned treatment assignment.\nDecoder While multi-step prediction can be performed by recursively feeding outputs into the\ninputs at the next time step, this would require output predictions for all covariates, with a high\ndegree of accuracy to reduce error propagation through the network. Given that often only a small\nsubset treatment outcomes are of interest, it would be desirable to forecast treatment responses on the\nbasis of planned future actions alone. As such, the purpose of the decoder is to propagate the encoder\nrepresentation forwards in time - using only the proposed treatment assignments and avoiding the\nneed to forecast input covariates. This is achieved by training another LSTM that accepts only actions\nas inputs, but initializing the internal memory state of the \ufb01rst LSTM in the decoder sequence (zt)\nusing encoder representations. To allow for different state sizes in the encoder and decoder, encoder\ninternal states (ht) are passed through a single network layer with ELU activations, i.e. the memory\nadapter, before being initializing the decoder. As the network is made up of LSTM units, the internal\nstates here refer to the concatenation of the cell and hidden states [17] of the LSTM.\n\n4.3 Training Procedure\n\nThe training procedure for R-MSNs can be subdivided into the 3 training steps shown in Figure 3 -\nstarting with the propensity networks, followed by the encoder, and ending with the decoder.\n\n(a) Step 1: Propensity Network Training\n\n(b) Step 2: Encoder Training\n\n(c) Step 3: Decoder Training\n\nFigure 3: Training Procedure for R-MSNs\n\n5\n\n\fobtained from their cumulative product, i.e. SW(t, \u03c4 ) =(cid:81)\u03c4\n\nStep 1: Propensity Network Training From Figure 3(a), each propensity network is \ufb01rst trained\nto estimate the probability of the treatment assigned at each time step, which is combined to compute\nSW(t, 0) and SW \u2217(t, 0) at each time step. Stabilized weights for longer horizons can then be\nj=0 SW(t + j, 0). For tests in Section 5,\npropensity networks were trained using standard binary cross entropy loss, with treatment assignments\nand censoring treated as binary observations.\nStep 2: Encoder Training Next, decoder and encoder training was divided into separate steps -\naccelerating learning by \ufb01rst training the encoder to learn representations of the patient\u2019s clinical\nstate and then using the decoder to extrapolate them according to the intended treatment plan. As\nsuch, the encoder was trained to forecast standard one-step-ahead treatment response according to the\nstructure in Figure 3(b), using all available information on treatments and covariates until the current\ntime step. Upon completion, the encoder was used to perform a feed-forward pass over the training\nand validation data, extracting the internal states ht for the \ufb01nal training step. As tests in Section 5\nwere performed for continuous outcomes, we express the loss function for the encoder as a weighted\nmean-squared error loss (Lencoder in Equation 5), although we note that this approach is compatible\nwith other loss functions, e.g. cross entropy for discrete outcomes.\nStep 3: Decoder Training Finally, the decoder and memory adapter were trained together\nbased on the format in Figure 3(c). For a given patient, observations were batched into shorter\nsequences of up to \u03c4max steps, such that each sequence commencing at time t is made up of\n[ht,{At+1, . . . , At+\u03c4max\u22121},{Yt+2, . . . , Yt+\u03c4max}]. These were compiled for all patient-times\nand randomly grouped into minibatches to be used for backpropagation through time. For continuous\npredictions, the loss function for the decoder is (Ldecoder) can also be found in Equation 5.\n\nI(cid:88)\n\nTi(cid:88)\n\nI(cid:88)\n\nTi(cid:88)\n\nmin(Ti\u2212t,\u03c4max)(cid:88)\n\nLencoder =\n\ne(i, t, 1)\n\nLdecoder =\n\ne(i, t, \u03c4 )\n\n(5)\n\ni=1\n\nt=1\n\ni=1\n\nt=1\n\n\u03c4 =2\n\n5 Experiments With Cancer Growth Simulation Model\n\n5.1 Simulation Details\n\nAs confounding effects in real-world datasets are unknown a priori, methods for treatment response\nestimation are often evaluated using data simulations, where treatment application policies are\nexplicitly modeled [34, 33, 35]. To ensure that our tests are fully reproducible and realistic from\na medical perspective, we adopt the pharmacokinetic-pharmacodynamic (PK-PD) model of [12]\n- the state-of-the-art in treatment response modeling for non-small cell lung patients. The model\nfeatures key characteristics present in actual lung cancer treatments, such as combined effects of\nchemo- and radiotherapy, cell repopulation after treatment, death/recovery of patients, and different\nstaring distributions of tumor sizes based on the stage of cancer at diagnosis. On the whole, PK-\nPD models allow clinicians to explore hypotheses around dose-response relationships and propose\noptimal treatment schedules [5, 29, 11, 9, 1]. While we refer readers to [12] for the \ufb01ner details of the\nmodel, such as speci\ufb01c priors used, we examine the overall structure of the model below to illustrate\ntreatment-response relationships and how time-dependent confounding is introduced.\nPK-PD Model for Tumor Dynamics We use a discrete-time model for tumor volume V (t), where\nt is the number of days since diagnosis:\n\nV (t) =\n\n1 + \u03c1 log(\n\n(cid:124)\n\nK\n\nV (t \u2212 1)\n\n(cid:123)(cid:122)\n\nTumor Growth\n\n\u2212 \u03b2cC(t)\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\nChemotherapy\n\n)\n\n(cid:125)\n\n(cid:124)\n\n\u2212 (\u03b1d(t) + \u03b2d(t)2)\n\nV (t \u2212 1)\n\n(cid:123)(cid:122)\n\nRadiation\n\n(cid:125)\n\n+ et(cid:124)(cid:123)(cid:122)(cid:125)\n\nNoise\n\n(6)\n\n(cid:18)\n\n(cid:19)\n\nwhere \u03c1, K, \u03b2c, \u03b1, \u03b2 are model parameters sampled for each patient according to prior distributions\nin [12]. A Gaussian noise term et \u223c N (0, 0.012) was added to account for randomness in the growth\nof the tumor. d(t) is the dose of radiation applied at t, while drug concentration C(t) is modeled\naccording to an exponential decay with a half life of 1 day, i.e.:\nC(t) = \u02dcC(t) + C(t \u2212 1)/2\n\n(7)\nwhere \u02dcC(t) is an new continuous dose of chemotherapy drugs applied at time t. To account for\nheterogeneous effects, we added static features to the simulation model by randomly subclassing\n\n6\n\n\fpatients into 3 different groups, with each patient having a group label Si \u2208 {1, 2, 3}. This represents\nspeci\ufb01c characteristics which affect with patient\u2019s response to chemotherapy and radiotherapy (e.g.\nby genetic factors [4]), which augment the prior means of \u03b2c and \u03b1 according to:\n\n(cid:26)1.1\u00b5\u03b2c , if Si = 3\n\n\u00b5\u03b2c , otherwise\n\n\u00b5(cid:48)\n\n\u03b2c\n\n(i) =\n\n(cid:26)1.1\u00b5\u03b1 , if Si = 1\n\n\u00b5\u03b1 , otherwise\n\n\u00b5(cid:48)\n\u03b1(i) =\n\n(8)\n\nwhere \u00b5\u2217 are the mean parameters of [12], and \u00b5(cid:48)\n\u2217(i) those used to simulate patient i. We note that\nthe value of \u03b2 is set in relation to \u03b1, i.e. \u03b1/\u03b2 = 10, and would also be adjusted accordingly by Si.\nCensoring Mechanisms Patient censoring is incorporated by modeling 1) death when tumor\ndiameters reach Dmax = 13 cm (or a volume of Vmax = 1150 cm3 assuming perfectly spherical\ntumors), 2) recovery determined by a Bernoulli process with recovery probability pt = exp(\u2212Vt),\nand 3) termination of observations after 60 days (administrative censoring).\nTreatment Assignment Policy To introduce time-dependent confounders, we assume that\nchemotherapy prescriptions Ac(t) \u2208 {0, 1} and radiotherapy prescriptions Ad(t) \u2208 {0, 1} are\nBernoulli random variables, with probabilities pc(t) and pd(t) respectively that are a functions of the\ntumor diameter:\n\npc(t) = \u03c3\n\n( \u00afD(t) \u2212 \u03b8c)\n\npd(t) = \u03c3\n\n( \u00afD(t) \u2212 \u03b8d)\n\n(9)\n\n(cid:18) \u03b3d\n\nDmax\n\n(cid:19)\n\n(cid:18) \u03b3c\n\nDmax\n\n(cid:19)\n\nwhere \u00afD(t) is the average tumor diameter over the last 15 days, \u03c3(.) is the sigmoid activation function,\nand \u03b8\u2217 and \u03b3\u2217 are constant parameters. \u03b8\u2217 is \ufb01xed such that \u03b8c = \u03b8d = Dmax/2, giving the model\na 0.5 probability of treatment application exists when the tumor is half its maximum size. When\ntreatments are applied, i.e. Ac(t) or Ad(t) is 1, chemotherapy is assumed to be administered in\n5.0 mg/m3 doses of Vinblastine, and radiotherapy in 2.0 Gy fractions. \u03b3 also controls the degree of\ntime-dependent confounding - starting with no confounding at \u03b3 = 0, as treatment assignments are\nindependent of the response variable, and an increase as \u03b3 becomes larger.\n\n5.2 Benchmarks\n\ntreatments, e.g. cum(\u00afac(t \u2212 1)) =(cid:80)t\u22121\n\nWe evaluate the performance of R-MSNs against MSMs and Bayesian nonparametric models,\nfocusing on its effectiveness in estimating unbiased treatment responses and its multi-step prediction\nperformance. An overview of the models tested is summarized below:\nStandard Marginal Structural Models (MSM) For the MSMs used in our investigations, we\nadopt similar approximations to [19, 14], encoding historical actions via cumulative sum of applied\nk=1 ac(k), and covariate history using the previous observed\nvalue V (t \u2212 1). The exact forms of the propensity and prediction models are in Appendix D.\nBayesian Treatment Response Curves (BTRC) We also benchmark our performance against the\nmodel of [35] - the state-of-the-art in forecasting multistep treatment responses for joint therapies\nwith multiple outcomes. Given that the simulation model only has one target outcome, we also\nconsider a simpler variant of the model without \u201cshared\" components, denoting this as the reduced\nBTRC (R-BTRC) model. This reduced parametrization was found to improve convergence during\ntraining, and additional details on calibration can be found in Appendix G.\nRecurrent Marginal Structural Networks (R-MSN) R-MSNs were designed according to the\ndescription in Section 4, with full details on training and hyperparameter in Appendix F. To evaluate\nthe effectiveness of the propensity networks, we also trained predictions networks using the IPTWs\nfrom the MSM, including this as an additional benchmark in Section 5.3 (Seq2Seq + Logistic).\n\n5.3 Performance Evaluations\n\nTime-Dependent Confounding Adjustments To investigate how well models learn unbiased\ntreatment responses from observational data, we trained all models on simulations with \u03b3c = \u03b3d = 10\n(biased policy) and examine the root-mean-squared errors (RMSEs) of one-step-ahead predictions as\n\u03b3\u2217 is reduced. Both \u03b3\u2217 parameters were set to be equal in this section for simplicity, i.e. \u03b3c = \u03b3d = \u03b3.\nUsing the simulation model in Section 5.1, we simulated 10,000 paths to be used for model training,\n1,000 for validation data used in hyperparameter optimization, and another 1,000 for out-of-sample\n\n7\n\n\fFigure 4: Normalized RMSEs for One-Step-Ahead Predictions\n\ntesting. For linear and MSM models, which do not have hyperparameters to optimized, we combined\nboth training and validation datasets for model calibration.\nFigure 4 shows the RMSE values of various models at different values of \u03b3, with RMSEs normalized\nwith Vmax and reported in percentage terms. Here, we focus on the main comparisons of interest\n\u2013 1) linear models to provide a baseline on performance, 2) linear vs MSMs to evaluate traditional\nmethods for IPTWs, 3) Seq2Seq + logistic IPTWs vs MSMs for the bene\ufb01ts of the Seq2Seq model,\n4) R-MSN vs Seq2Seq + logistic to determine the improvements of our model and RNN-estimated\nIPTWs, and 5) BTRC/R-BTRC to benchmark against state-of-the-art methods. Additional results are\nalso documented in Appendix C for reference.\nFrom the graph, R-MSNs displayed the lowest RMSEs across all values of \u03b3, decreasing slightly\nfrom a normalized RMSE of 1.02% at \u03b3 = 10 to 0.92% at \u03b3 = 0. Focusing on RMSEs at \u03b3 = 0,\nR-MSNs improve MSMs by 80.9% and R-BTCs by 66.1%, demonstrating its effectiveness in learning\nunbiased treatment responses from confounded data. The propensity networks also improve unbiased\ntreatment estimates by 78.7% (R-MSN vs. Seq2Seq + Logistic), indicating the bene\ufb01ts of more\n\ufb02exible models for IPTW estimation. While the IPTWs of MSMs do provide small gains for linear\nmodels, linear models still exhibit the largest unbiased RMSE across all benchmarks - highlighting\nthe limitations of linear models in estimating complex treatment responses. Bayesian models also\nperform consistently across \u03b3, with normalized RMSEs for R-BTRC decreasing from 2.09% to 1.91%\nacross \u03b3 = 0 to 10, but were also observed to slightly underperform linear models on the training\ndata itself. Part of this can potentially be attributed to model misspeci\ufb01cation in the BTRC, which\nassumes that treatment responses are linear time-invariant and independent of the baseline progression.\nThe differences in modeling assumptions can be seen from Equation 6, where chemotherapy and\nradiotherapy contributions are modeled as multiplicative with V (t). This highlights the bene\ufb01ts of\nthe data-driven nature of the R-MSN, which can \ufb02exibly learn treatment response models of different\ntypes.\nMulti-step Prediction Performance To evaluate the bene\ufb01ts of the sequence-to-sequence archi-\ntecture, we report the normalized RMSEs for multi-step prediction in Table 1, using the best model of\neach category (R-MSN, MSM and R-BTRC). Once again, the R-MSN outperforms benchmarks for\nall timesteps, beating MSMs by 61% on the training policy and 95% for the unbiased one. While the\nR-BTRC does show improvements over MSMs for the unbiased treatment response, we also observe\na slight underperformance versus MSMs on the training policy itself, highlighting the advantages of\nR-MSNs.\n\n6 Conclusions\n\nThis paper introduces Recurrent Marginal Structural Networks - a novel learning approach for\npredicting unbiased treatment responses over time, grounded in the framework of marginal structural\nmodels. Networks are subdivided into two parts, a set of propensity networks to accurately compute\nthe IPTWs, and a sequence-to-sequence architecture to predict responses using only a planned\nsequence of future actions. Using tests on a medically realistic simulation model, the R-MSN\ndemonstrated performance improvements over traditional methods in epidemiology and the state-of-\nthe-art models for joint treatment response prediction over multiple timesteps.\n\n8\n\n\fTable 1: Normalized RMSE for Various Prediction Horizons \u03c4\n\nAve. % Decrease\nin RMSE vs MSMs\n\n\u03c4\nMSM\nR-BTRC 2.09% 2.85% 3.50% 4.07% 4.58% -32% (\u2191 RMSE)\n\n1\n1.67% 2.51% 3.12% 3.64% 4.09% -\n\n4\n\n5\n\n2\n\n3\n\nTraining\nPolicy\n(\u03b3c = 10, \u03b3d = 10) R-MSN\nUnbiased\nAssignment\n(\u03b3c = 0, \u03b3d = 0)\nUnbiased\nRadiotherapy\n(\u03b3c = 10, \u03b3d = 0)\nUnbiased\nChemotherapy\n(\u03b3c = 0, \u03b3d = 10)\n\n1.02% 1.80% 1.90% 2.11% 2.46% +61%\n4.84% 5.29% 5.51% 5.65% 5.84% -\n\nMSM\nR-BTRC 1.91% 2.74% 3.34% 3.75% 4.08% +66%\nR-MSN\n0.92% 1.38% 1.30% 1.22% 1.14% +95%\nMSM\n3.85% 4.03% 4.32% 4.60% 4.91% -\nR-BTRC 1.74% 1.68% 2.14% 2.54% 2.91% +74%\n1.08% 1.66% 1.83% 1.98% 2.14% +84%\nR-MSN\nMSM\n1.84% 2.65% 3.09% 3.44% 3.83% -\nR-BTRC 1.16% 2.45% 2.97% 3.34% 3.64% +20%\nR-MSN\n0.65% 1.13% 1.05% 1.17% 1.31% +87%\n\nAcknowledgments\n\nThis research was supported by the Oxford-Man Institute of Quantitative Finance, the US Of\ufb01ce of\nNaval Research (ONR), and the Alan Turing Institute.\n\n9\n\n\fReferences\n\n[1] Optimizing drug regimens in cancer chemotherapy: a simulation study using a pk\u2013pd model. Computers in\n\nBiology and Medicine, 31(3):157 \u2013 172, 2001. Goal-Oriented Model-Based drug Regimens.\n\n[2] Ahmed M. Alaa and Mihaela van der Schaar. Bayesian inference of individualized treatment effects\nusing multi-task gaussian processes. In Proceedings of the thirty-\ufb01rst Conference on Neural Information\nProcessing Systems, (NIPS), 2017.\n\n[3] Ahmed M. Alaa, Michael Weisz, and Mihaela van der Schaar. Deep counterfactual networks with propensity\n\ndropout. In Proceedings of the 34th International Conference on Machine Learning (ICML), 2017.\n\n[4] H. Bartsch, H. Dally, O. Popanda, A. Risch, and P. Schmezer. Genetic risk pro\ufb01les for cancer susceptibility\n\nand therapy response. Recent Results Cancer Res., 174:19\u201336, 2007.\n\n[5] Letizia Carrara, Silvia Maria Lavezzi, Elisa Borella, Giuseppe De Nicolao, Paolo Magni, and Italo Poggesi.\nCurrent mathematical models for cancer drug discovery. Expert Opinion on Drug Discovery, 12(8):785\u2013799,\n2017.\n\n[6] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron Courville, and Yoshua Bengio. A\nrecurrent latent variable model for sequential data. In Proceedings of the 28th International Conference on\nNeural Information Processing Systems - Volume 2, NIPS\u201915, pages 2980\u20132988, Cambridge, MA, USA,\n2015.\n\n[7] Djork-Arn\u00e9 Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by\n\nexponential linear units (ELUs). CoRR, abs/1511.07289, 2015.\n\n[8] R.M. Daniel, S.N. Cousens, B.L. De Stavola, M. G. Kenward, and J. A. C. Sterne. Methods for dealing with\n\ntime-dependent confounding. Statistics in Medicine, 32(9):1584\u20131618, 2012.\n\n[9] Mould DR, Walz A-C, Lave T, Gibbs JP, and Frame B. Developing exposure/response models for anticancer\n\ndrug treatment: Special considerations. CPT: Pharmacometrics & Systems Pharmacology, 4(1):12\u201327.\n\n[10] Peter H. Egger and Maximilian von Ehrlich. Generalized propensity scores for multiple continuous\n\ntreatment variables. Economics Letters, 119(1):32 \u2013 34, 2013.\n\n[11] M. J. Eigenmann, N. Frances, T. Lav\u00e9, and A.-C. Walz. Pkpd modeling of acquired resistance to anti-cancer\n\ndrug treatment. Journal of Pharmacokinetics and Pharmacodynamics, 44(6):617\u2013630, 2017.\n\n[12] Changran Geng, Harald Paganetti, and Clemens Grassberger. Prediction of treatment response for combined\nchemo- and radiation therapy for non-small cell lung cancer patients using a bio-mathematical model.\nScienti\ufb01c Reports, 7, 2017.\n\n[13] Jason Hartford, Greg Lewis, Kevin Leyton-Brown, and Matt Taddy. Deep IV: A \ufb02exible approach for\ncounterfactual prediction. In Proceedings of the 34th International Conference on Machine Learning\n(ICML), 2017.\n\n[14] Miguel A. Hernan, Babette Brumback, and James M. Robins. Marginal structural models to estimate\nthe joint causal effect of nonrandomized treatments. Journal of the American Statistical Association,\n96(454):440\u2013448, 2001.\n\n[15] Miguel A. Hernan and James M. Robins. Marginal structural models to estimate the causal effect of\n\nzidovudine on the survival of hiv-positive men. Epidemiology, 359:561\u2013570, 2000.\n\n[16] MA Hern\u00e1n and JM Robins. Causal Inference. Chapman & Hall/CRC, 2018.\n[17] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735\u20131780,\n\nNovember 1997.\n\n[18] William Hoiles and Mihaela van der Schaar. A non-parametric learning method for con\ufb01dently estimating\npatient\u2019s clinical state and dynamics. In Proceedings of the twenty-ninth Conference on Neural Information\nProcessing Systems, (NIPS), 2016.\n\n[19] Chanelle J. Howe, Stephen R. Cole, Shruti H. Mehta, and Gregory D. Kirk. Estimating the effects of\nmultiple time-varying exposures using joint marginal structural models: alcohol consumption, injection\ndrug use, and hiv acquisition. Epidemiology, 23(4):574\u2013582, 2012.\n\n[20] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International\n\nConference on Learning Representations (ICLR), 2015.\n\n[21] Clovis Lusivika-Nzinga, Hana Selinger-Leneman, Sophie Grabar, Dominique Costagliola, and Fabrice\nCarrat. Performance of the marginal structural cox model for estimating individual and joined effects of\ntreatments given in combination. BMC Medical Research Methodology, 17(1):160, Dec 2017.\n\n[22] Mohammad Ali Mansournia, Mahyar Etminan, Goodarz Danaei, Jay S Kaufman, and Gary Collins. A\nprimer on inverse probability of treatment weighting and marginal structural models. Emerging Adulthood,\n4(1):40\u201359, 2016.\n\n10\n\n\f[23] Mohammad Ali Mansournia, Mahyar Etminan, Goodarz Danaei, Jay S Kaufman, and Gary Collins.\n\nHandling time varying confounding in observational research. BMJ, 359, 2017.\n\n[24] Mohammad Alia Mansournia, Goodarzc Danaei, Mohammad Hosseind Forouzanfar, Mahmoodb Mah-\nmoodi, Mohsene Jamali, Nasrinf Mansournia, and Kazemb Mohammad. Effect of physical activity on\nfunctional performance and knee pain in patients with osteoarthritis : Analysis with marginal structural\nmodels. Epidemiology, 23(4):631\u2013640, 2012.\n\n[25] Charles E McCulloch and Shayle R. Searle. Generalized, Linear and Mixed Models. Wiley, New York,\n\n2001.\n\n[26] Kathleen M. Mortimer, Romain Neugebauer, Mark van der Laan, and Ira B. Tager. An application of model-\n\ufb01tting procedures for marginal structural models. American Journal of Epidemiology, 162(4):382\u2013388,\n2005.\n\n[27] Mihaela van der Schaar Onur Atan, James Jordon. Deep-treat: Learning optimal personalized treatments\n\nfrom observational data using neural networks. In AAAI, 2018.\n\n[28] Mihaela van der Schaar Onur Atan, William Zame. Constructing effective personalized policies using\n\ncounterfactual inference from biased data sets with many features. In Machine Learning, 2018.\n\n[29] Kyungsoo Park. A review of modeling approaches to predict drug response in clinical oncology. Yonsei\n\nMedical Journal, 58(1):1\u20138, 2017.\n\n[30] Thomas S. Richardson and Andrea Rotnitzky. Causal etiology of the research of james m. robins. Statistical\n\nScience, 29(4):459\u2013484, 2014.\n\n[31] James M. Robins, Miguel \u00c1ngel Hern\u00e1n, and Babette Brumback. Marginal structural models and causal\n\ninference in epidemiology. Epidemiology, 11(5):550\u2013560, 2000.\n\n[32] Jason Roy, Kirsten J. Lum, and Michael J. Daniels. A bayesian nonparametric approach to marginal\nstructural models for point treatments and a continuous or survival outcome. Biostatistics, 18(1):32\u201347,\n2017.\n\n[33] Peter Schulam and Suchi Saria. Reliable decision support using counterfactual models. In Proceedings of\n\nthe thirty-\ufb01rst Conference on Neural Information Processing Systems, (NIPS), 2017.\n\n[34] Ricardo Silva. Observational-interventional priors for dose-response learning. In Proceedings of the\n\nThirtieth Conference on Neural Information Processing Systems, (NIPS), 2016.\n\n[35] Hossein Soleimani, Adarsh Subbaswamy, and Suchi Saria. Treatment-response models for counterfactual\nIn Proceedings of the Thirty-Third\n\nreasoning with continuous-time, continuous-valued interventions.\nConference on Uncertainty in Arti\ufb01cial Intelligence (UAI), 2017.\n\n[36] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. In\n\nProceedings of the twenty-seventh Conference on Neural Information Processing Systems, (NIPS), 2014.\n\n[37] Adith Swaminathan and Thorsten Joachims. Batch learning from logged bandit feedback through counter-\n\nfactual risk minimization. Journal of Machine Learning Research, 16:1731\u20131755, 2015.\n\n[38] Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from logged\n\nbandit feedback. CoRR, abs/1502.02362, 2015.\n\n[39] Michalis K. Titsias. Variational learning of inducing variables in sparse gaussian processes. In Proceedings\n\nof the twelth International Conference on Arti\ufb01cial Intelligence and Statistic, (AISTATS), 2009.\n\n[40] Panagiotis J. Vlachostergios and Bishoy M. Faltas. Treatment resistance in urothelial carcinoma: an\n\nevolutionary perspective. Nature Review Clincal Oncology, 2018.\n\n[41] Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treatment effects using random\n\nforests. Journal of the American Statistical Association, 2017.\n\n[42] Yongling Xiao, Michal Abrahamowicz, and Erica Moodie. Accuracy of conventional and marginal\nstructural cox model estimators: A simulation study. The International Journal of Biostatistics, 6(2), 2010.\n[43] Yanbo Xu, Yanxun Xu, and Suchi Saria. A non-parametric bayesian approach for estimating treatment-\nresponse curves from sparse time series. In Proceedings of the 1st Machine Learning for Healthcare\nConference (MLHC), 2016.\n\n[44] Jinsung Yoon, James Jordon, and Mihaela van der Schaar. GANITE: Estimation of individualized treatment\neffects using generative adversarial nets. In International Conference on Learning Representations (ICLR),\n2018.\n\n11\n\n\f", "award": [], "sourceid": 3722, "authors": [{"given_name": "Bryan", "family_name": "Lim", "institution": "University of Oxford"}]}