{"title": "Robust and Efficient Transfer Learning with Hidden Parameter Markov Decision Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 6250, "page_last": 6261, "abstract": "We introduce a new formulation of the Hidden Parameter Markov Decision Process (HiP-MDP), a framework for modeling families of related tasks using low-dimensional latent embeddings. Our new framework correctly models the joint uncertainty in the latent parameters and the state space. We also replace the original Gaussian Process-based model with a Bayesian Neural Network, enabling more scalable inference. Thus, we expand the scope of the HiP-MDP to applications with higher dimensions and more complex dynamics.", "full_text": "Robust and Ef\ufb01cient Transfer Learning with Hidden\n\nParameter Markov Decision Processes\n\nTaylor Killian\u2217\n\ntaylorkillian@g.harvard.edu\n\nHarvard University\n\nGeorge Konidaris\ngdk@cs.brown.edu\n\nBrown University\n\nAbstract\n\nSamuel Daulton\u2217\n\nsdaulton@g.harvard.edu\nHarvard University, Facebook\u2020\n\nFinale Doshi-Velez\n\nfinale@seas.harvard.edu\n\nHarvard University\n\nWe introduce a new formulation of the Hidden Parameter Markov Decision Pro-\ncess (HiP-MDP), a framework for modeling families of related tasks using low-\ndimensional latent embeddings. Our new framework correctly models the joint\nuncertainty in the latent parameters and the state space. We also replace the original\nGaussian Process-based model with a Bayesian Neural Network, enabling more\nscalable inference. Thus, we expand the scope of the HiP-MDP to applications\nwith higher dimensions and more complex dynamics.\n\n1\n\nIntroduction\n\nThe world is \ufb01lled with families of tasks with similar, but not identical, dynamics. For example,\nconsider the task of training a robot to swing a bat with unknown length l and mass m. The task is a\nmember of a family of bat-swinging tasks. If a robot has already learned to swing several bats with\nvarious lengths and masses {(li, mi)}N\ni=1, then the robot should learn to swing a new bat with length\nl(cid:48) and mass m(cid:48) more ef\ufb01ciently than learning from scratch. That is, it is grossly inef\ufb01cient to develop\na control policy from scratch each time a unique task is encountered.\nThe Hidden Parameter Markov Decision Process (HiP-MDP) [14] was developed to address this\ntype of transfer learning, where optimal policies are adapted to subtle variations within tasks in an\nef\ufb01cient and robust manner. Speci\ufb01cally, the HiP-MDP paradigm introduced a low-dimensional latent\ntask parameterization wb that, combined with a state and action, completely describes the system\u2019s\ndynamics T (s(cid:48)|s, a, wb). However, the original formulation did not account for nonlinear interactions\nbetween the latent parameterization and the state space when approximating these dynamics, which\nrequired all states to be visited during training. In addition, the original framework scaled poorly\nbecause it used Gaussian Processes (GPs) as basis functions for approximating the task\u2019s dynamics.\nWe present a new HiP-MDP formulation that models interactions between the latent parameters\nwb and the state s when transitioning to state s(cid:48) after taking action a. We do so by including the\nlatent parameters wb, the state s, and the action a as input to a Bayesian Neural Network (BNN).\nThe BNN both learns the common transition dynamics for a family of tasks and models how the\nunique variations of a particular instance impact the instance\u2019s overall dynamics. Embedding the\nlatent parameters in this way allows for more accurate uncertainty estimation and more robust transfer\nwhen learning a control policy for a new and possibly unique task instance. Our formulation also\ninherits several desirable properties of BNNs: it can model multimodal and heteroskedastic transition\n\n\u2217Both contributed equally as primary authors\n\u2020Current af\ufb01liation, joined afterward\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\ffunctions, inference scales to data large in both dimension and number of samples, and all output\ndimensions are jointly modeled, which reduces computation and increases predictive accuracy [11].\nHerein, a BNN can capture complex dynamical systems with highly non-linear interactions between\nstate dimensions. Furthermore, model uncertainty is easily quanti\ufb01ed through the BNN\u2019s output\nvariance. Thus, we can scale to larger domains than previously possible.\nWe use the improved HiP-MDP formulation to develop control policies for acting in a simple two-\ndimensional navigation domain, playing acrobot [42], and designing treatment plans for simulated\npatients with HIV [15]. The HiP-MDP rapidly determines the dynamics of new instances, enabling\nus to quickly \ufb01nd near-optimal instance-speci\ufb01c control policies.\n\n2 Background\n\nModel-based reinforcement learning We consider reinforcement learning (RL) problems in\nwhich an agent acts in a continuous state space S \u2286 RD and a discrete action space A. We\nassume that the environment has some true transition dynamics T (s(cid:48)|s, a), unknown to the agent,\nand we are given a reward function R(s, a) : S \u00d7 A \u2192 R that provides the utility of taking action a\nfrom state s. In the model-based reinforcement learning setting, our goal is to learn an approximate\ntransition function \u02c6T (s(cid:48)|s, a) based on observed transitions (s, a, s(cid:48)) and then use \u02c6T (s(cid:48)|s, a) to learn a\nt \u03b3trt], where \u03b3 \u2208 (0, 1] governs\n\npolicy a = \u03c0(s) that maximizes long-term expected rewards E[(cid:80)\n\nthe relative importance of immediate and future rewards.\n\nHiP-MDPs A HiP-MDP [14] describes a family of Markov Decision Processes (MDPs) and is\nde\ufb01ned by the tuple {S, A, W, T, R, \u03b3, PW}, where S is the set of states s, A is the set of actions a,\nand R is the reward function. The transition dynamics T (s(cid:48)|s, a, wb) for each task instance b depend\non the value of the hidden parameters wb \u2208 W ; for each instance, the parameters wb are drawn from\nprior PW . The HiP-MDP framework assumes that a \ufb01nite-dimensional array of hidden parameters wb\ncan fully specify variations among the true task dynamics. It also assumes the system dynamics are\ninvariant during a task and the agent is signaled when one task ends and another begins.\nBayesian Neural Networks A Bayesian Neural Network (BNN) is a neural network, f (\u00b7,\u00b7;W), in\nwhich the parameters W are random variables with some prior P (W) [27]. We place independent\nw\u2208W N (w; \u00b5, \u03c32). Exact Bayesian inference for the\nposterior over parameters P (W|{(s(cid:48), s, a)}) is intractable, but several recent techniques have been\ndeveloped to scale inference in BNNs [4, 17, 22, 33]. As probabilistic models, BNNs reduce the\ntendency of neural networks to over\ufb01t in the presence of low amounts of data\u2014just as GPs do.\nIn general, training a BNN is more computationally ef\ufb01cient than a GP [22], while still providing\ncoherent uncertainty measurements. Speci\ufb01cally, predictive distributions can be calculated by taking\naverages over samples of W from an approximated posterior distribution over the parameters. As\nsuch, BNNs are being adopted in the estimation of stochastic dynamical systems [11, 18].\n\nGaussian priors on each parameter P (W) =(cid:81)\n\n3 A HiP-MDP with Joint-Uncertainty\n\nThe original HiP-MDP transition function models variation across task instances as:3\n\nd \u2248 K(cid:88)\n\nwbk \u02c6T (GP)\n\nkad (s) + \u0001\n\nk=1\n\ns(cid:48)\nwbk \u223c N (\u00b5wk , \u03c32\nw)\nad),\n\n\u0001 \u223c N (0, \u03c32\n\n(1)\n\nwhere sd is the dth dimension of s. Each basis transition function \u02c6Tkad (indexed by the kth latent\nparameter, the action a, and the dimension d) is a GP using only s as input, linearly combined with\ninstance-speci\ufb01c weights wbk. Inference involves learning the parameters for the GP basis functions\nand the weights for each instance. GPs can robustly approximate stochastic state transitions in\n3We present a simpli\ufb01ed version that omits their \ufb01ltering variables zkad \u2208 {0, 1} to make the parallels\nbetween our formulation and the original more explicit; our simpli\ufb01cation does not change any key properties.\n\n2\n\n\fcontinuous dynamical systems in model-based reinforcement learning [9, 35, 36]. GPs have also\nbeen widely used in transfer learning outside of RL (e.g. [5]).\nWhile this formulation is expressive, it has limitations. The primary limitation is that the uncertainty\nin the latent parameters wkb is modeled independently of the agent\u2019s state uncertainty. Hence, the\nmodel does not account for interactions between the latent parameterization wb and the state s. As a\nresult, Doshi-Velez and Konidaris [14] required that each task instance b performed the same set of\nstate-action combinations (s, a) during training. While such training may sometimes be possible\u2014e.g.\nrobots that can be driven to identical positions\u2014it is onerous at best and impossible for other systems\nsuch as human patients. The secondary limitation is that each output dimension sd is modeled\nseparately as a collection of GP basis functions { \u02c6Tkad}K\nk=1. The basis functions for output dimension\nsd are independent of the basis functions for output dimension sd(cid:48), for d (cid:54)= d(cid:48). Hence, the model\ndoes not account for correlation between output dimensions. Modeling such correlations typically\nrequires knowledge of how dimensions interact in the approximated dynamical system [2, 19]. We\nchoose not to constrain the HiP-MDP with such a priori knowledge since the aim is to provide basis\nfunctions that can ascertain these relationships through observed transitions.\nTo overcome these limitations, we include the instance-speci\ufb01c weights wb as input to the transition\nfunction and model all dimensions of the output jointly:\n\ns(cid:48) \u2248 \u02c6T (BNN)(s, a, wb) + \u0001\nwb \u223c N (\u00b5w, \u03a3b)\n\n\u0001 \u223c N(cid:0)0, \u03c32\n\n(cid:1) .\n\nn\n\n(2)\nThis critical modeling change eliminates all of the above limitations: we can learn directly from\ndata as observed\u2014which is abundant in many industrial and health domains\u2014and no longer require\nhighly constrained training procedure. We can also capture the correlations in the outputs of these\ndomains, which occur in many natural processes.\nFinally, the computational demands of using GPs as the transition function limited the application\nof the original HiP-MDP formulation to relatively small domains. In the following, we use a BNN\nrather than a GP to model this transition function. The computational requirements needed to learn a\nGP-based transition function makes a direct comparison to our new BNN-based formulation infeasible\nwithin our experiments (Section 5). We demonstrate, in Appendix A, that the BNN-based transition\nmodel far exceeds the GP-based transition model in both computational and predictive performance.\nIn addition, BNNs naturally produce multi-dimensional outputs s(cid:48) without requiring prior knowledge\nof the relationships between dimensions. This allows us to directly model output correlations between\nthe D state dimensions, leading to a more uni\ufb01ed and coherent transition model. Inference in a larger\ninput space s, a, wb with a large number of samples is tractable using ef\ufb01cient approaches that let\nus\u2014given a distribution P (W) and input-output tuples (s, a, s(cid:48))\u2014estimate a distribution over the\nlatent embedding P (wb). This enables more robust, scalable transfer.\n\nDemonstration We present a toy domain (Figure 1) where an agent is tasked with navigating\nto a goal region. The state space is continuous (s \u2208 (\u22122, 2)2), and action space is discrete (a \u2208\n{N, E, S, W}). Task instances vary the following the domain aspects: the location of a wall that\nblocks access to the goal region (either to the left of or below the goal region), the orientation of the\ncardinal directions (i.e. whether taking action North moves the agent up or down), and the direction\nof a nonlinear wind effect that increases as the agent moves away from the start region. Ignoring the\nwall and grid boundaries, the transition dynamics are:\n\n\u2206x = (\u22121)\u03b8b c(cid:0)ax \u2212 (1 \u2212 \u03b8b)\u03b2(cid:112)(x + 1.5)2 + (y + 1.5)2(cid:1)\n\u2206y = (\u22121)\u03b8b c(cid:0)ay \u2212 \u03b8b\u03b2(cid:112)(x + 1.5)2 + (y + 1.5)2(cid:1)\n\n(cid:26)1 a \u2208 {E, W}\n(cid:26)1 a \u2208 {N, S}\n\notherwise\n\n0\n\n0\n\notherwise,\n\nax =\n\nay =\n\nwhere c is the step-size (without wind), \u03b8b \u2208 {0, 1} indicates which of the two classes the instance\nbelongs to and \u03b2 \u2208 (0, 1) controls the in\ufb02uence of the wind and is \ufb01xed for all instances. The agent\n\n3\n\n\fFigure 1: A demonstration of the HiPMDP modeling the joint uncertainty between the latent\nparameters wb and the state space. On the left, blue and red dots show the exploration during the red\n(\u03b8b = 0) and blue (\u03b8b = 1) instances. The latent parameters learned from the red instance are used\npredict transitions for taking action E from an area of the state space either unexplored (top right) or\nexplored (bottom right) during the red instance. The prediction variance provides an estimate of the\njoint uncertainty between the latent parameters wb and the state.\n\nis penalized for trying to cross a wall, and each step incurs a small cost until the agent reaches the\ngoal region, encouraging the agent to discover the goal region with the shortest route possible. An\nepisode terminates once the agent enters the goal region or after 100 time steps.\nA linear function of the state s and latent parameters wb would struggle to model both classes of\ninstances (\u03b8b = 0 and \u03b8b = 1) in this domain because the state transition resulting from taking an\naction a is a nonlinear function with interactions between the state and hidden parameter \u03b8b.\nBy contrast, our new HiP-MDP model allows nonlinear interactions between state and the latent\nparameters wb, as well as jointly models their uncertainty. In Figure 1, this produces measurable\ndifferences in transition uncertainty in regions where there are few related observed transitions, even\nif there are many observations from unrelated instances. Here, the HiP-MDP is trained on two\ninstances from distinct classes (shown in blue (\u03b8b = 1) and red (\u03b8b = 0) on the left). We display the\nuncertainty of the transition function, \u02c6T , using the latent parameters wred inferred for a red instance in\ntwo regions of the domain: 1) an area explored during red instances and 2) an area not explored under\nred instances, but explored with blue instances. The transition uncertainty \u02c6T is three times larger in\nthe region where red instances have not been\u2014even if many blue instances have been there\u2014than in\nregions where red instances have commonly explored, demonstrating that the latent parameters can\nhave different effects on the transition uncertainty in different states.\n\n4\n\nInference\n\nAlgorithm 1 summarizes the inference procedure for learning a policy for a new task instance b,\nfacilitated by a pre-trained BNN for that task, and is similar in structure to prior work [9, 18]. The\nprocedure involves several parts. Speci\ufb01cally, at the start of a new instance b, we have a global replay\nbuffer D of all observed transitions (s, a, r, s(cid:48)) and a posterior over the weights W for our BNN\ntransition function \u02c6T learned with data from D. The \ufb01rst objective is to quickly determine the latent\nembedding, wb, of the current instance\u2019s speci\ufb01c dynamical variation as transitions (s, a, s(cid:48)) are\nobserved from the current instance. Transitions from instance b are stored in both the global replay\nbuffer D and an instance-speci\ufb01c replay buffer Db. The second objective is to develop an optimal\ncontrol policy using the transition model \u02c6T and learned latent parameters wb. The transition model \u02c6T\nand latent embedding wb are separately updated via mini-batch stochastic gradient descent (SGD)\nusing Adam [26]. Using \u02c6T for planning increases our sample ef\ufb01ciency as we reduce interactions\nwith the environment. We describe each of these parts in more detail below.\n4.1 Updating embedding wb and BNN parameters W\nFor each new instance, a new latent weighting wb is sampled from the prior PW (Alg. 1, step 2),\nin preparation of estimating unobserved dynamics introduced by \u03b8b. Next, we observe transitions\n(s, a, r, s(cid:48)) from the task instance for an initial exploratory episode (Alg. 1, steps 7-10). Given that\n\n4\n\n\fAlgorithm 1 Learning a control policy w/ the HiP-MDP\n\nInput: Global replay buffer D, BNN transition\nfunction \u02c6T , initial state s0\nb\nDraw new wb \u223c PW\nRandomly init. policy \u02c6\u03c0b \u03b8, \u03b8\u2212\nInit. instance replay buffer Db\nInit. \ufb01ctional replay buffer Df\nfor i = 0 to Ne episodes do\n\n1: procedure LEARNPOLICY( D, \u02c6T , s0\nb)\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n\nTake action a \u2190 \u02c6\u03c0b(s)\nStore D,Db \u2190 (s, a, r, s(cid:48), wb)\nuntil episode is complete\nif i = 0 OR \u02c6T is innaccurate then\n\nDb,W, wb \u2190 TUNEMODEL(Db,W, wb)\nfor j = 0 to Nf \u2212 1 episodes do\nDf\nb , \u02c6\u03c0b \u2190 SIMEP(Df\n\nDf\nb , \u02c6\u03c0b \u2190 SIMEP(Df\n\nb , \u02c6T , wb, \u02c6\u03c0b, s0\nb)\n\nb , \u02c6T , wb, \u02c6\u03c0b, s0\nb)\n\nrepeat\n\nb\n\nb , \u02c6T , wb, \u02c6\u03c0b, s0\nb)\nfor t = 0 to Nt time steps do\nTake action a \u2190 \u02c6\u03c0b(s)\nApprox. \u02c6s(cid:48) \u2190 \u02c6T (s, a, wb)\nCalc. reward \u02c6r \u2190 R(s, a, \u02c6s(cid:48))\nStore Df\nb \u2190 (s, a, \u02c6r, \u02c6s(cid:48))\nif mod (t, N\u03c0) = 0 then\nUpdate \u02c6\u03c0b via \u03b8 from Df\n\u03b8\u2212 \u2190 \u03c4 \u03b8 + (1 \u2212 \u03c4 )\u03b8\u2212\nb , \u02c6\u03c0b\n\n1: function SIMEP(Df\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n1: function TUNEMODEL(Db,W, wb)\n2:\n3:\n4:\n5:\n\nUpdate wb from Db\nUpdate W from Db\n\nfor k = 0 to Nu updates do\n\nreturn Db,W, wb\n\nreturn Df\n\nb\n\ndata, we optimize the latent parameters wb to minimize the \u03b1-divergence of the posterior predictions\nof \u02c6T (s, a, wb|W) and the true state transitions s(cid:48) (step 3 in TuneModel) [22]. Here, the minimization\noccurs by adjusting the latent embedding wb while holding the BNN parameters W \ufb01xed. After an\ninitial update of the wb for a newly encountered instance, the parameters W of the BNN transition\nfunction \u02c6T are optimized (step 4 in TuneModel). As the BNN is trained on multiple instances of\na task, we found that the only additional data needed to re\ufb01ne the BNN and latent wb for some\nnew instance can be provided by an initial exploratory episode. Otherwise, additional data from\nsubsequent episodes can be used to further improve the BNN and latent estimates (Alg. 1, steps 11-14).\nThe mini-batches used for optimizing the latent wb and BNN network parameters W are sampled\nfrom Db with squared error prioritization [31]. We found that switching between small updates to the\nlatent parameters and small updates to the BNN parameters led to the best transfer performance. If\neither the BNN network or latent parameters are updated too aggressively (having a large learning\nrate or excessive number of training epochs), the BNN disregards the latent parameters or state inputs\nrespectively. After completing an instance, the BNN parameters and the latent parameters are updated\nusing samples from global replay buffer D. Speci\ufb01c modeling details such as number of epochs,\nlearning rates, etc. are described in Appendix C.\n\n4.2 Updating policy \u02c6\u03c0b\n\nWe construct an \u03b5-greedy policy to select actions based on an approximate action-value function\n\u02c6Q(s, a). We model the action value function \u02c6Q(s, a) with a Double Deep Q Network (DDQN) [21, 29].\nThe DDQN involves training two networks (parametrized by \u03b8 and \u03b8\u2212 respectively), a primary Q-\nnetwork, which informs the policy, and a target Q-network, which is a slowly annealed copy of the\nprimary network (step 9 of SimEp) providing greater stability when updating the policy \u02c6\u03c0b .\nWith the updated transition function, \u02c6T , we approximate the environment when developing a control\npolicy (SimEp). We simulate batches of entire episodes of length Nt using the approximate dynamical\nmodel \u02c6T , storing each transition in a \ufb01ctional experience replay buffer Df\nb (steps 2-6 in SimEp). The\nprimary network parameters \u03b8 are updated via SGD every N\u03c0 time steps (step 8 in SimEp) to minimize\nthe temporal-difference error between the primary network\u2019s and the target network\u2019s Q-values. The\nmini-batches used in the update are sampled from the \ufb01ctional experience replay buffer Df\nb , using\nTD-error-based prioritization [38].\n\n5\n\n\f5 Experiments and Results\n\nNow, we demonstrate the performance of the HiP-MDP with embedded latent parameters in transfer-\nring learning across various instances of the same task. We revisit the 2D demonstration problem\nfrom Section 3, as well as describe results on both the acrobot [42] and a more complex healthcare\ndomain: prescribing effective HIV treatments [15] to patients with varying physiologies.4\nFor each of these domains, we compare our formulation of the HiP-MDP with embedded latent\nparameters (equation 2) with four baselines (one model-free and three model-based) to demonstrate\nthe ef\ufb01ciency of learning a policy for a new instance b using the HiP-MDP. These comparisons\nare made across the \ufb01rst handful of episodes encountered in a new task instance to highlight the\nadvantage provided by transferring information through the HiP-MDP. The \u2018linear\u2019 baseline uses\na BNN to learn a set of basis functions that are linearly combined with the parameters wb (used to\napproximate the approach of Doshi-Velez and Konidaris [14], equation 1), which does not allow\ninteractions between states and weights. The \u2018model-based from scratch\u2019 baseline considers each task\ninstance b as unique; requiring the BNN transition function to be trained only on observations made\nfrom the current task instance. The \u2018average\u2019 model baseline is constructed under the assumption that\na single transition function can be used for every instance of the task; \u02c6T is trained from observations\nof all task instances together. For all model-based approaches, we replicated the HiP-MDP procedure\nas closely as possible. The BNN was trained on observations from a single episode before being used\nto generate a large batch of approximate transition data, from which a policy is learned. Finally, the\nmodel-free baseline learns a DDQN-policy directly from observations of the current instance.\nFor more information on the experimental speci\ufb01cations and long-run policy learning see Appendix C\nand D, respectively.\n\n5.1 Revisiting the 2D demonstration\n\n(a)\n\n(b)\n\nFigure 2: (a) a demonstration of a model-free control policy, (b) a comparison of learning a policy at\nthe outset of a new task instance b using the HiP-MDP versus four benchmarks. The HiP-MDP with\nembedded wb outperforms all four benchmarks.\n\nThe HiP-MDP and the average model were supplied a transition model \u02c6T trained on two previous\ninstances, one from each class, before being updated according to the procedure outlined in Sec. 4\nfor a newly encountered instance. After the \ufb01rst exploratory episode, the HiP-MDP has suf\ufb01ciently\ndetermined the latent embedding, evidenced in Figure 2b where the developed policy clearly outper-\nforms all four benchmarks. This implies that the transition model \u02c6T adequately provides the accuracy\nneeded to develop an optimal policy, aided by the learned latent parametrization.\nThe HiP-MDP with linear wb also quickly adapts to the new instance and learns a good policy.\nHowever, the HiP-MDP with linear wb is unable to model the nonlinear interaction between the latent\nparameters and the state. Therefore the model is less accurate and learns a less consistent policy than\nthe HiP-MDP with embedded wb. (See Figure 2a in Appendix A.2)\n\n4Example code for training and evaluating a HiP-MDP, including the simulators used in this section, can be\n\nfound at http://github.com/dtak/hip-mdp-public.\n\n6\n\n\f(a)\n\n(b)\n\nFigure 3: (a) the acrobot domain, (b) a comparison of learning a policy for a new task instance b\nusing the HiP-MDP versus four benchmarks.\n\nWith single episode of data, the model trained from scratch on the current instance is not accurate\nenough to learn a good policy. Training a BNN from scratch requires more observations of the true\ndynamics than are necessary for the HiP-MDP to learn the latent parameterization and achieve a\nhigh level of accuracy. The model-free approach eventually learns an optimal policy, but requires\nsigni\ufb01cantly more observations to do so, as represented in Figure 2a. The model-free approach has no\nimprovement in the \ufb01rst 10 episodes. The poor performance of the average model approach indicates\nthat a single model cannot adequately represent the dynamics of the different task instances. Hence,\nlearning a latent representation of the dynamics speci\ufb01c to each instance is crucial.\n\n5.2 Acrobot\n\nFirst introduced by Sutton and Barto [42], acrobot is a canonical RL and control problem. The most\ncommon objective of this domain is for the agent to swing up a two-link pendulum by applying a\npositive, neutral, or negative torque on the joint between the two links (see Figure 3a). These actions\nmust be performed in sequence such that the tip of the bottom link reaches a predetermined height\nabove the top of the pendulum. The state space consists of the angles \u03b81, \u03b82 and angular velocities\n\u02d9\u03b81, \u02d9\u03b82, with hidden parameters corresponding to the masses (m1, m2) and lengths (l1, l2), of the two\nlinks.5 See Appendix B.2 for details on how these hidden parameters were varied to create different\ntask instances. A policy learned on one setting of the acrobot will generally perform poorly on\nother settings of the system, as noted in [3]. Thus, subtle changes in the physical parameters require\nseparate policies to adequately control the varied dynamical behavior introduced. This provides\na perfect opportunity to apply the HiP-MDP to transfer between separate acrobot instances when\nlearning a control policy \u02c6\u03c0b for the current instance.\nFigure 3b shows that the HiP-MDP learns an optimal policy after a single episode, whereas all other\nmodel-based benchmarks required an additional episode of training. As in the toy example, the\nmodel-free approach eventually learns an optimal policy, but requires more time.\n\n5.3 HIV treatment\n\nDetermining effective treatment protocols for patients with HIV was introduced as an RL problem by\nmathematically representing a patient\u2019s physiological response to separate classes of treatments [1, 15].\nIn this model, the state of a patient\u2019s health is recorded via 6 separate markers measured with a\nblood test.6 Patients are given one of four treatments on a regular schedule. Either they are given\ntreatment from one of two classes of drugs, a mixture of the two treatments, or provided no treatment\n(effectively a rest period). There are 22 hidden parameters in this system that control a patient\u2019s\nspeci\ufb01c physiology and dictate rates of virulence, cell birth, infection, and death. (See Appendix B.3\n\n5The centers of mass and moments of inertia can also be varied. For our purposes we left them unperturbed.\n6These markers are: the viral load (V ), the number of healthy and infected CD4+ T-lymphocytes (T1, T \u2217\n1 ,\n2 , respectively), and the number of\n\nrespectively), the number of healthy and infected macrophages (T2, T \u2217\nHIV-speci\ufb01c cytotoxic T-cells (E).\n\n7\n\n\f(a)\n\n(b)\n\nFigure 4: (a) a visual representation of a patient with HIV transitioning from an unhealthy steady\nstate to a healthy steady state using a proper treatment schedule, (b) a comparison of learning a policy\nfor a new task instance b using the HiP-MDP versus four benchmarks.\n\nfor more details.) The objective is to develop a treatment sequence that transitions the patient from an\nunhealthy steady state to a healthy steady state (Figure 4a, see Adams et al. [1] for a more thorough\nexplanation). Small changes made to these parameters can greatly effect the behavior of the system\nand therefore introduce separate steady state regions that require unique policies to transition between\nthem.\nFigure 4b shows that the HiP-MDP develops an optimal control policy after a single episode, learning\nan unmatched optimal policy in the shortest time. The HIV simulator is the most complex of our\nthree domains, and the separation between each benchmark is more pronounced. Modeling a HIV\ndynamical system from scratch from a single episode of observations proved to be infeasible. The\naverage model, which has been trained off a large batch of observations from related dynamical\nsystems, learns a better policy. The HiP-MDP with linear wb is able to transfer knowledge from\nprevious task instances and quickly learn the latent parameterization for this new instance, leading\nto an even better policy. However, the dynamical system contains nonlinear interactions between\nthe latent parameters and the state space. Unlike the HiP-MDP with embedded wb, the HiP-MDP\nwith linear wb is unable to model those interactions. This demonstrates the superiority of the HiP-\nMDP with embedded wb for ef\ufb01ciently transferring knowledge between instances in highly complex\ndomains.\n\n6 Related Work\n\nThere has been a large body of work on solving single POMDP models ef\ufb01ciently [6, 16, 24, 37, 45].\nIn contrast, transfer learning approaches leverage training done on one task to perform related tasks.\nStrategies for transfer learning include: latent variable models, reusing pre-trained model parameters,\nand learning a mapping between separate tasks (see review in [43]).\nOur work falls into the latent variable model category. Using latent representation to relate tasks has\nbeen particularly popular in robotics where similar physical movements can be exploited across a\nvariety of tasks and platforms [10, 20]. In Chen et al. [8], these latent representations are encoded\nas separate MDPs with an accompanying index that an agent learns while adapting to observed\nvariations in the environment. Bai et al. [3] take a closely related approach to our updated formulation\nof the HiP-MDP by incorporating estimates of unknown or partially observed parameters of a\nknown environmental model and re\ufb01ning those estimates using model-based Bayesian RL. The\ncore difference between this and our work is that we learn the transition model and the observed\nvariations directly from the data while Bai et al. [3] assume it is given and the speci\ufb01c variations\nof the parameters are learned. Also related are multi-task approaches that train a single model for\nmultiple tasks simultaneously [5, 7]. Finally, there have been many applications of reinforcement\nlearning (e.g. [32, 40, 44]) and transfer learning in the healthcare domain by identifying subgroups\nwith similar response (e.g. [23, 28, 39]).\n\n8\n\n\fMore broadly, BNNs are powerful probabilistic inference models that allow for the estimation of\nstochastic dynamical systems [11, 18]. Core to this functionality is their ability to represent both\nmodel uncertainty and transition stochasticity [25]. Recent work decomposes these two forms of\nuncertainty to isolate the separate streams of information to improve learning. Our use of \ufb01xed latent\nvariables as input to a BNN helps account for model uncertainty when transferring the pretrained\nBNN to a new instance of a task. Other approaches use stochastic latent variable inputs to introduce\ntransition stochasticity [12, 30].\nWe view the HiP-MDP with latent embedding as a methodology that can facilitate personalization and\ndo so robustly as it transfers knowledge of prior observations to the current instance. This approach\ncan be especially useful in extending personalized care to groups of patients with similar diagnoses,\nbut can also be extended to any control system where variations may be present.\n\n7 Discussion and Conclusion\n\nWe present a new formulation for transfer learning among related tasks with similar, but not identical\ndynamics, within the HiP-MDP framework. Our approach leverages a latent embedding\u2014learned\nand optimized in an online fashion\u2014to approximate the true dynamics of a task. Our adjustment\nto the HiP-MDP provides robust and ef\ufb01cient learning when faced with varied dynamical systems,\nunique from those previously learned. It is able, by virtue of transfer learning, to rapidly determine\noptimal control policies when faced with a unique instance.\nThe results in this work assume the presence of a large batch of already-collected data. This setting is\ncommon in many industrial and health domains, where there may be months, sometimes years, worth\nof operations data on plant function, product performance, or patient health. Even with large batches,\neach new instance still requires collapsing the uncertainty around the instance-speci\ufb01c parameters\nin order to quickly perform well on the task. In Section 5, we used a batch of transition data from\nmultiple instances of a task\u2014without any arti\ufb01cial exploration procedure\u2014to train the BNN and\nlearn the latent parameterizations. Seeded with data from diverse task instances, the BNN and latent\nparameters accounted for the variation between instances.\nWhile we were primarily interested in settings where batches of observational data exist, one might\nalso be interested in more traditional settings in which the \ufb01rst instance is completely new, the second\ninstance only has information from the \ufb01rst, etc. In our initial explorations, we found that one can\nindeed learn the BNN in an online manner for simpler domains. However, even with simple domains,\nthe model-selection problem becomes more challenging: an overly expressive BNN can over\ufb01t to\nthe \ufb01rst few instances, and have a hard time adapting when it sees data from an instance with very\ndifferent dynamics. Model-selection approaches to allow the BNN to learn online, starting from\nscratch, is an interesting future research direction.\nAnother interesting extension is rapidly identifying the latent wb. Exploration to identify wb would\nsupply the dynamical model with the data from the regions of domain with the largest uncertainty. This\ncould lead to a more accurate latent representation of the observed dynamics while also improving the\noverall accuracy of the transition model. Also, we found training a DQN requires careful exploration\nstrategies. When exploration is constrained too early, the DQN quickly converges to a suboptimal,\ndeterministic policy\u2013\u2013often choosing the same action at each step. Training a DQN along the BNN\u2019s\ntrajectories of least certainty could lead to improved coverage of the domain and result in more robust\npolicies. The development of effective policies would be greatly accelerated if exploration were more\nrobust and stable. One could also use the hidden parameters wb to learn a policy directly.\nRecognizing structure, through latent embeddings, between task variations enables a form of transfer\nlearning that is both robust and ef\ufb01cient. Our extension of the HiP-MDP demonstrates how embedding\na low-dimensional latent representation with the input of an approximate dynamical model facilitates\ntransfer and results in a more accurate model of a complex dynamical system, as interactions between\nthe input state and the latent representation are modeled naturally. We also model correlations in the\noutput dimensions by replacing the GP basis functions of the original HiP-MDP formulation with a\nBNN. The BNN transition function scales signi\ufb01cantly better to larger and more complex problems.\nOur improvements to the HiP-MDP provide a foundation for robust and ef\ufb01cient transfer learning.\nFuture improvements to this work will contribute to a general transfer learning framework capable of\naddressing the most nuanced and complex control problems.\n\n9\n\n\fAcknowledgements We thank Mike Hughes, Andrew Miller, Jessica Forde, and Andrew Ross for\ntheir helpful conversations. TWK was supported by the MIT Lincoln Laboratory Lincoln Scholars\nProgram. GDK is supported in part by the NIH R01MH109177. The content of this work is solely\nthe responsibility of the authors and does not necessarily represent the of\ufb01cial views of the NIH.\n\nReferences\n[1] BM Adams, HT Banks, H Kwon, and HT Tran. Dynamic multidrug therapies for HIV: optimal and STI\n\ncontrol approaches. Mathematical Biosciences and Engineering, pages 223\u2013241, 2004.\n\n[2] MA Alvarez, L Rosasco, ND Lawrence, et al. Kernels for vector-valued functions: A review. Foundations\n\nand Trends R(cid:13) in Machine Learning, 4(3):195\u2013266, 2012.\n\n[3] H Bai, D Hsu, and W S Lee. Planning how to learn. In International Conference on Robotics and\n\nAutomation, pages 2853\u20132859. IEEE, 2013.\n\n[4] C Blundell, J Cornebise, K Kavukcuoglu, and D Wierstra. Weight uncertainty in neural networks. In\n\nProceedings of The 32nd International Conference on Machine Learning, pages 1613\u20131622, 2015.\n\n[5] EV Bonilla, KM Chai, and CK Williams. Multi-task Gaussian process prediction. In Advances in Neural\n\nInformation Processing Systems, volume 20, pages 153\u2013160, 2008.\n\n[6] E Brunskill and L Li. Sample complexity of multi-task reinforcement learning.\n\nUncertainty in Arti\ufb01cial Intelligence, 2013.\n\nIn Conference on\n\n[7] R Caruana. Multitask learning. In Learning to learn, pages 95\u2013133. Springer, 1998.\n\n[8] M Chen, E Frazzoli, D Hsu, and WS Lee. POMDP-lite for robust robot planning under uncertainty. In\n\nInternational Conference on Robotics and Automation, pages 5427\u20135433. IEEE, 2016.\n\n[9] MP Deisenroth and CE Rasmussen. PILCO: a model-based and data-ef\ufb01cient approach to policy search.\n\nIn Proceedings of the International Conference on Machine Learning, 2011.\n\n[10] B Delhaisse, D Esteban, L Rozo, and D Caldwell. Transfer learning of shared latent spaces between robots\n\nwith similar kinematic structure. In International Joint Conference on Neural Networks. IEEE, 2017.\n\n[11] S Depeweg, JM Hern\u00e1ndez-Lobato, F Doshi-Velez, and S Udluft. Learning and policy search in stochastic\ndynamical systems with Bayesian neural networks. In International Conference on Learning Representa-\ntions, 2017.\n\n[12] S Depeweg, JM Hern\u00e1ndez-Lobato, F Doshi-Velez, and S Udluft. Uncertainty decomposition in bayesian\n\nneural networks with latent variables. arXiv preprint arXiv:1706.08495, 2017.\n\n[13] CR Dietrich and GN Newsam. Fast and exact simulation of stationary gaussian processes through circulant\n\nembedding of the covariance matrix. SIAM Journal on Scienti\ufb01c Computing, 18(4):1088\u20131107, 1997.\n\n[14] F Doshi-Velez and G Konidaris. Hidden parameter Markov Decision Processes: a semiparametric\nregression approach for discovering latent task parametrizations. In Proceedings of the Twenty-Fifth\nInternational Joint Conference on Arti\ufb01cial Intelligence, volume 25, pages 1432\u20131440, 2016.\n\n[15] D Ernst, G Stan, J Goncalves, and L Wehenkel. Clinical data based optimal STI strategies for HIV: a\nreinforcement learning approach. In Proceedings of the 45th IEEE Conference on Decision and Control,\n2006.\n\n[16] A Fern and P Tadepalli. A computational decision theory for interactive assistants. In Advances in Neural\n\nInformation Processing Systems, pages 577\u2013585, 2010.\n\n[17] Y Gal and Z Ghahramani. Dropout as a Bayesian approximation: representing model uncertainty in deep\n\nlearning. In Proceedings of the 33rd International Conference on Machine Learning, 2016.\n\n[18] Y Gal, R McAllister, and CE Rasmussen. Improving PILCO with Bayesian neural network dynamics\n\nmodels. In Data-Ef\ufb01cient Machine Learning workshop, ICML, 2016.\n\n[19] MG Genton, W Kleiber, et al. Cross-covariance functions for multivariate geostatistics. Statistical Science,\n\n30(2):147\u2013163, 2015.\n\n[20] A Gupta, C Devin, Y Liu, P Abbeel, and S Levine. Learning invariant feature spaces to transfer skills with\n\nreinforcement learning. In International Conference on Learning Representations, 2017.\n\n10\n\n\f[21] H van Hasselt, A Guez, and D Silver. Deep reinforcement learning with double Q-learning. In Proceedings\n\nof the Thirtieth AAAI Conference on Arti\ufb01cial Intelligence, pages 2094\u20132100. AAAI Press, 2016.\n\n[22] JM Hern\u00e1ndez-Lobato, Y Li, M Rowland, D Hern\u00e1ndez-Lobato, T Bui, and RE Turner. Black-box\n\u03b1-divergence minimization. In Proceedings of the 33rd International Conference on Machine Learning,\n2016.\n\n[23] N Jaques, S Taylor, A Sano, and R Picard. Multi-task, multi-kernel learning for estimating individual\n\nwellbeing. In Proceedings of NIPS Workshop on Multimodal Machine Learning, 2015.\n\n[24] LP Kaelbling, ML Littman, and AR Cassandra. Planning and acting in partially observable stochastic\n\ndomains. Arti\ufb01cial intelligence, 101(1):99\u2013134, 1998.\n\n[25] A Kendall and Y Gal. What uncertainties do we need in bayesian deep learning for computer vision? arXiv\n\npreprint arXiv:1703.04977, 2017.\n\n[26] D Kingma and J Ba. Adam: A method for stochastic optimization. In International Conference on\n\nLearning Representations, 2015.\n\n[27] D JC MacKay. A practical Bayesian framework for backpropagation networks. Neural computation, 4(3):\n\n448\u2013472, 1992.\n\n[28] VN Marivate, J Chemali, E Brunskill, and M Littman. Quantifying uncertainty in batch personalized\nsequential decision making. In Workshops at the Twenty-Eighth AAAI Conference on Arti\ufb01cial Intelligence,\n2014.\n\n[29] V Mnih, K Kavukcuoglu, D Silver, A A Rusu, J Veness, M G Bellemare, A Graves, M Riedmiller, A K\nFidjeland, G Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518\n(7540):529\u2013533, 2015.\n\n[30] TM Moerland, J Broekens, and CM Jonker. Learning multimodal transition dynamics for model-based\n\nreinforcement learning. arXiv preprint arXiv:1705.00470, 2017.\n\n[31] AW Moore and CG Atkeson. Prioritized sweeping: reinforcement learning with less data and less time.\n\nMachine learning, 13(1):103\u2013130, 1993.\n\n[32] BL Moore, LD Pyeatt, V Kulkarni, P Panousis, K Padrez, and AG Doufas. Reinforcement learning for\nclosed-loop propofol anesthesia: a study in human volunteers. Journal of Machine Learning Research, 15\n(1):655\u2013696, 2014.\n\n[33] RM Neal. Bayesian training of backpropagation networks by the hybrid Monte carlo method. Technical\n\nreport, Citeseer, 1992.\n\n[34] J Qui\u00f1onero-Candela and CE Rasmussen. A unifying view of sparse approximate gaussian process\n\nregression. Journal of Machine Learning Research, 6(Dec):1939\u20131959, 2005.\n\n[35] CE Rasmussen and M Kuss. Gaussian processes in reinforcement learning.\n\nInformation Processing Systems, volume 15, 2003.\n\nIn Advances in Neural\n\n[36] CE Rasmussen and CKI Williams. Gaussian processes for machine learning. MIT Press, Cambridge,\n\n2006.\n\n[37] B Rosman, M Hawasly, and S Ramamoorthy. Bayesian policy reuse. Machine Learning, 104(1):99\u2013127,\n\n2016.\n\n[38] T Schaul, J Quan, I Antonoglou, and D Silver. Prioritized experience replay. In International Conference\n\non Learning Representations, 2016.\n\n[39] P Schulam and S Saria. Integrative analysis using coupled latent variable models for individualizing\n\nprognoses. Journal of Machine Learning Research, 17:1\u201335, 2016.\n\n[40] SM Shortreed, E Laber, DJ Lizotte, TS Stroup, J Pineau, and SA Murphy. Informing sequential clinical\ndecision-making through reinforcement learning: an empirical study. Machine learning, 84(1-2):109\u2013136,\n2011.\n\n[41] E Snelson and Z Ghahramani. Sparse gaussian processes using pseudo-inputs. In Advances in Neural\n\nInformation Processing Systems, pages 1257\u20131264, 2006.\n\n[42] R Sutton and A Barto. Reinforcement learning: an introduction, volume 1. MIT Press, Cambridge, 1998.\n\n11\n\n\f[43] ME Taylor and P Stone. Transfer learning for reinforcement learning domains: a survey. Journal of\n\nMachine Learning Research, 10(Jul):1633\u20131685, 2009.\n\n[44] M Tenenbaum, A Fern, L Getoor, M Littman, V Manasinghka, S Natarajan, D Page, J Shrager, Y Singer,\n\nand P Tadepalli. Personalizing cancer therapy via machine learning. Workshops of NIPS, 2010.\n\n[45] JD Williams and S Young. Scaling POMDPs for dialog management with composite summary point-based\nvalue iteration (CSPBVI). In AAAI Workshop on Statistical and Empirical Approaches for Spoken Dialogue\nSystems, pages 37\u201342, 2006.\n\n12\n\n\f", "award": [], "sourceid": 3156, "authors": [{"given_name": "Taylor", "family_name": "Killian", "institution": "Harvard University"}, {"given_name": "Samuel", "family_name": "Daulton", "institution": "Harvard University"}, {"given_name": "George", "family_name": "Konidaris", "institution": "Brown University"}, {"given_name": "Finale", "family_name": "Doshi-Velez", "institution": "Harvard"}]}