{"title": "Correlation Priors for Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 14155, "page_last": 14165, "abstract": "Many decision-making problems naturally exhibit pronounced structures inherited\nfrom the characteristics of the underlying environment. In a Markov decision process\nmodel, for example, two distinct states can have inherently related semantics\nor encode resembling physical state configurations. This often implies locally correlated transition dynamics among the states. In order to complete a certain task in such environments, the operating agent usually needs to execute a series of temporally and spatially correlated actions. Though there exists a variety of approaches to capture these correlations in continuous state-action domains, a principled solution for discrete environments is missing. In this work, we present a Bayesian learning framework based on P\u00f3lya-Gamma augmentation that enables an analogous reasoning in such cases. We demonstrate the framework on a number of common decision-making related problems, such as imitation learning, subgoal extraction, system identification and Bayesian reinforcement learning. By explicitly modeling the underlying correlation structures of these problems, the proposed approach yields superior predictive performance compared to correlation-agnostic models, even when trained on data sets that are an order of magnitude smaller in size.", "full_text": "Correlation Priors for Reinforcement Learning\n\nBastian Alt\u21e4\n\nAdrian \u0160o\u0161i\u00b4c\u21e4\n\nHeinz Koeppl\n\nDepartment of Electrical Engineering and Information Technology\n\nTechnische Universit\u00e4t Darmstadt\n\n{bastian.alt, adrian.sosic, heinz.koeppl}@bcs.tu-darmstadt.de\n\nAbstract\n\nMany decision-making problems naturally exhibit pronounced structures inherited\nfrom the characteristics of the underlying environment. In a Markov decision pro-\ncess model, for example, two distinct states can have inherently related semantics\nor encode resembling physical state con\ufb01gurations. This often implies locally cor-\nrelated transition dynamics among the states. In order to complete a certain task in\nsuch environments, the operating agent usually needs to execute a series of tempo-\nrally and spatially correlated actions. Though there exists a variety of approaches to\ncapture these correlations in continuous state-action domains, a principled solution\nfor discrete environments is missing. In this work, we present a Bayesian learn-\ning framework based on P\u00f3lya-Gamma augmentation that enables an analogous\nreasoning in such cases. We demonstrate the framework on a number of common\ndecision-making related problems, such as imitation learning, subgoal extraction,\nsystem identi\ufb01cation and Bayesian reinforcement learning. By explicitly modeling\nthe underlying correlation structures of these problems, the proposed approach\nyields superior predictive performance compared to correlation-agnostic models,\neven when trained on data sets that are an order of magnitude smaller in size.\n\n1\n\nIntroduction\n\nCorrelations arise naturally in many aspects of decision-making. The reason for this phenomenon is\nthat decision-making problems often exhibit pronounced structures, which substantially in\ufb02uence\nthe strategies of an agent. Examples of correlations are even found in stateless decision-making\nproblems, such as multi-armed bandits, where prominent patterns in the reward mechanisms of\ndifferent arms can translate into correlated action choices of the operating agent [7, 9]. However,\nthese statistical relationships become more pronounced in the case of contextual bandits, where\neffective decision-making strategies not only exhibit temporal correlation but also take into account\nthe state context at each time point, introducing a second source of correlation [12].\nIn more general decision-making models, such as Markov decision processes (MDPs), the agent can\ndirectly affect the state of the environment through its action choices. The effects caused by these\nactions often share common patterns between different states of the process, e.g., because the states\nhave inherently related semantics or encode similar physical state con\ufb01gurations of the underlying\nsystem. Examples of this general principle are omnipresent in all disciplines and range from robotics,\nwhere similar actuator outputs result in similar kinematic responses for similar states of the robot\u2019s\njoints, to networking applications, where the servicing of a particular queue affects the surrounding\nnetwork state (Section 4.3.3). The common consequence is that the structures of the environment are\nusually re\ufb02ected in the decisions of the operating agent, who needs to execute a series of temporally\nand spatially correlated actions in order to complete a certain task. This is particularly true when two or\nmore agents interact with each other in the same environment and need coordinate their behavior [2].\n\n\u21e4The \ufb01rst two authors contributed equally to this work.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFocusing on rational behavior, correlations can manifest themselves even in unstructured domains,\nthough at a higher level of abstraction of the decision-making process. This is because rationality itself\nimplies the existence of an underlying objective optimized by the agent that represents the agent\u2019s\nintentions and incentivizes to choose one action over another. Typically, these goals persist at least\nfor a short period of time, causing dependencies between consecutive action choices (Section 4.2).\nIn this paper, we propose a learning framework that offers a direct way to model such correlations in\n\ufb01nite decision-making problems, i.e., involving systems with discrete state and action spaces. A key\nfeature of our framework is that it allows to capture correlations at any level of the process, i.e., in the\nsystem environment, at the intentional level, or directly at the level of the executed actions. We encode\nthe underlying structure in a hierarchical Bayesian model, for which we derive a tractable variational\ninference method based on P\u00f3lya-Gamma augmentation that allows a fully probabilistic treatment of\nthe learning problem. Results on common benchmark problems and a queueing network simulation\ndemonstrate the advantages of the framework. The accompanying code is publicly available via Git.1\n\nRelated Work\n\nModeling correlations in decision-making is a common theme in reinforcement learning and related\n\ufb01elds. Gaussian processes (GPs) offer a \ufb02exible tool for this purpose and are widely used in a\nbroad variety of contexts. Moreover, movement primitives [18] provide an effective way to describe\ntemporal relationships in control problems. However, the natural problem domain of both are\ncontinuous state-action environments, which lie outside the scope of this work.\nInferring correlation structure from count data has been discussed extensively in the context of topic\nmodeling [13, 14] and factor analysis [29]. Recently, a GP classi\ufb01cation algorithm with a scalable vari-\national approach based on P\u00f3lya-Gamma augmentation was proposed [30]. Though these approaches\nare promising, they do not address the problem-speci\ufb01c modeling aspects of decision-making.\nFor agents acting in discrete environments, a number of customized solutions exist that allow to model\nspeci\ufb01c characteristics of a decision-making problem. A broad class of methods that speci\ufb01cally\ntarget temporal correlations rely on hidden Markov models. Many of these approaches operate on the\nintentional level, modeling the temporal relationships of the different goals followed by the agent [22].\nHowever, there also exist several approaches to capture spatial dependencies between these goals.\nFor a recent overview, see [27] and the references therein. Dependencies on the action level have\nalso been considered in the past but, like most intentional models, existing approaches largely focus\non the temporal correlations in action sequences (such as probabilistic movement primitives [18])\nor they are restricted to the special case of deterministic policies [26]. A probabilistic framework to\ncapture correlations between discrete action distributions is described in [25].\nWhen it comes to modeling transition dynamics, most existing approaches rely on GP models [4, 3].\nIn the Texplore method of [8], correlations within the transition dynamics are modeled with the help of\na random forest, creating a mixture of decision tree outcomes. Yet, a full Bayesian description in form\nof an explicit prior distribution is missing in this approach. For behavior acquisition, prior distributions\nover transition dynamics are advantageous since they can easily be used in Bayesian reinforcement\nlearning algorithms such as BEETLE [21] or BAMCP [6]. A particular example of a prior distribution\nover transition probabilities is given in [19] in the form of a Dirichlet mixture. However, the\nincorporation of prior knowledge expressing a particular correlation structure is dif\ufb01cult in this model.\nTo the best of our knowledge, there exists no principled method to explicitly model correlations in\nthe transition dynamics of discrete environments. Also, a universally applicable inference tool for\ndiscrete environments, comparable to Gaussian processes, has not yet emerged. The goal of our work\nis to \ufb01ll this gap by providing a \ufb02exible inference framework for such cases.\n\n2 Background\n\n2.1 Markov Decision Processes\nIn this paper, we consider \ufb01nite Markov decision processes (MDPs) of the form (S,A,T , R), where\nS = {1, . . . , S} is a \ufb01nite state space containing S distinct states, A = {1, . . . , A} is an action space\n\n1https://git.rwth-aachen.de/bcs/correlation_priors_for_rl\n\n2\n\n\fcomprising A actions for each state, T : S\u21e5S\u21e5A! [0, 1] is the state transition model specifying\nthe probability distribution over next states for each current state and action, and R : S\u21e5A! R is a\nreward function. For further details, see [28].\n\n2.2\n\nInference in MDPs\n\nIn decision-making with discrete state and action spaces, we are often faced with integer-valued\nquantities modeled as draws from multinomial distributions, xc \u21e0 Mult(xc | Nc, pc), Nc 2 N,\npc 2 K, where K denotes the number of categories, c 2C indexes some \ufb01nite covariate space with\ncardinality C, and pc parametrizes the distribution at a given covariate value c. Herein, xc can either\nrepresent actual count data observed during an experiment or describe some latent variable of our\nmodel. For example, when modeling the policy of an agent in an MDP, xc may represent the vector of\naction counts observed at a particular state, in which case C = S, K = A, and Nc is the total number\nof times we observe the agent choosing an action at state c. Similarly, when modeling state transition\nprobabilities, xc could be the vector counting outgoing transitions from some state s for a given\naction a (in which case C = S\u21e5 A) or, when modeling the agent\u2019s intentions, xc could describe the\nnumber of times the agent follows a particular goal, which itself might be unobservable (Section 4.2).\nWhen facing the inverse problem of inferring the probability vectors {pc} from the count data {xc},\na computationally straightforward approach is to model the probability vectors using independent\n>0 is\nDirichlet distributions for all covariate values, i.e., pc \u21e0 Dir(pc | \u21b5c) 8c 2C , where \u21b5c 2 RK\na local concentration parameter for covariate value c. However, the resulting model is agnostic to\nthe rich correlation structure present in most MDPs (Section 1) and thus ignores much of the prior\ninformation we have about the underlying decision-making problem. A more natural approach would\nbe to model the probability vectors {pc} jointly using common prior model, in order to capture their\ndependency structure. Unfortunately, this renders exact posterior inference intractable, since the\nresulting prior distributions are no longer conjugate to the multinomial likelihood.\nRecently, a method for approximate inference in dependent multinomial models has been developed\nto account for the inherent correlation of the probability vectors [14]. To this end, the following prior\nmodel was introduced,\n\npc = \u21e7SB( c\u00b7),\n\n \u00b7k \u21e0N ( \u00b7k | \u00b5k, \u2303), k = 1, . . . , K 1.\n\n(1)\n\nHerein, \u21e7SB(\u21e3) = [\u21e7 (1)\n\nSB(\u21e3), . . . , \u21e7(K)\n\nSB (\u21e3)]> is the logistic stick-breaking transformation, where\n\n(1 (\u21e3j)), k = 1, . . . , K 1,\n\n\u21e7(K)\nSB (\u21e3) = 1 \n\n\u21e7(k)\n\nSB (\u21e3),\n\nK1Xk=1\n\n\u21e7(k)\n\nSB (\u21e3) = (\u21e3k)Yj and \uf8ffk = [\uf8ff1k, . . . ,\uf8ff Ck]>. A detailed derivation of the these results and\nthe resulting ELBO is provided in Section A . The variational approximation can be optimized through\ncoordinate-wise ascent by cycling through the parameters and their moments. The corresponding\ndistribution over probability vectors {pc} is de\ufb01ned implicitly through the deterministic relationship\nin Eq. (1).\n\n4\n\n\fHyper-Parameter Optimization\n\nFor hyper-parameter learning, we employ a variational expectation-maximization approach [15] to\noptimize the ELBO after each update of the variational parameters. Assuming a covariance matrix \u2303\u2713\nparametrized by a vector \u2713 = [\u27131, . . . ,\u2713 J ]>, the ELBO can be written as\n\nlog |Vk|\n\ntr (\u23031\n\n\u2713 Vk)\n\n1\n2\n\nK1Xk=1\n\n\u2713 (\u00b5k k) + C(K 1) +\n\nL(q) = \n\n\n\n\n\n2\n\n1\n2\n\n1\n2\n\nK 1\n\n|\u2303\u2713| +\n\nK1Xk=1\nK1Xk=1\n(\u00b5k k)>\u23031\nK1Xk=1\nK1Xk=1\nCXc=1\n\nbck log 2 +\n\nlog\u2713bck\nxck\u25c6\nK1Xk=1\nCXc=1\n2 \u2318 .\nbck log\u21e3cosh\n\nwck\n\n>k \uf8ffk \n\nK1Xk=1\n\nCXc=1\n\nA detailed derivation of this expression can be found in Section A.2. The corresponding gradients\nw.r.t. the hyper-parameters calculate to\n\n@L\n@\u00b5k\n\n= \u2303\u27131(\u00b5k k),\n\n@L\n@\u2713j\n\n1\n2\n\n= \n\nK1Xk=1\u2713tr (\u23031\n\n\u2713\n\n@\u2303\u2713\n@\u2713j\n\n\u2713\n\n) tr (\u23031\n@\u2303\u2713\n@\u2713j\n\n@\u2303\u2713\n@\u2713j\n\n\u23031\n\n\u2713 V k)\n\n\u23031\n\n\u2713 (\u00b5k k)\u25c6 ,\n\n(\u00b5k k)>\u23031\n\n\u2713\n\nwhich admits a closed-form solution for the optimal mean parameters, given by \u00b5k = k.\nFor the optimization of the covariance parameters \u2713, we can resort to a numerical scheme using the\nabove gradient expression; however, this requires a full inversion of the covariance matrix in each\nupdate step. As it turns out, a closed-form expression can be found for the special case where \u2713\nis a scale parameter, i.e., \u2303\u2713 = \u2713 \u02dc\u2303, for some \ufb01xed \u02dc\u2303. The optimal parameter value can then be\ndetermined as\n\n\u2713 =\n\n1\n\nKC\n\nK1Xk=1\n\ntr\u21e3 \u02dc\u23031V k + (\u00b5k k)(\u00b5k k)>\u2318 .\n\nThe closed-form solution avoids repeated matrix inversions since \u02dc\u23031, being independent of all hyper-\nparameters and variational parameters, can be evaluated at the start of the optimization procedure.\nThe full derivation of the gradients and the closed-form expression is provided in Section B.\nFor the experiments in the following section, we consider a squared exponential covariance function\n\nl2 \u2318 , with a covariate distance measure d : C\u21e5C! R0\nof the form (\u2303\u2713)cc0 = \u2713 exp\u21e3 d(c,c0)2\nand a length scale l 2 R0 adapted to the speci\ufb01c modeling scenario. Yet, we note that for model\nselection purposes multiple covariance functions can be easily compared against each other based on\nthe resulting values of the ELBO [15]. Also, a combination of functions can be employed, provided\nthat the resulting covariance matrix is positive semi-de\ufb01nite (see covariance kernels of GPs [23]).\n\n4 Experiments\n\nTo demonstrate the versatility of our inference framework, we test it on a number of modeling scenar-\nios that commonly occur in decision-making contexts. Due to space limitations, we restrict ourselves\nto imitation learning, subgoal modeling, system identi\ufb01cation, and Bayesian reinforcement learning.\nHowever, we would like to point out that the same modeling principles can be applied in many other\nsituations, e.g., for behavior coordination among agents [2] or knowledge transfer between related\ntasks [11], to name just two examples. A more comprehensive evaluation study is left for future work.\n\n4.1\n\nImitation Learning\n\nFirst, we illustrate our framework on an imitation learning example, where we aspire to reconstruct the\npolicy of an agent (in this context called the expert) from observed behavior. For the reconstruction,\n\n5\n\n\f(a) expert policy\n\n(b) Dirichlet estimate\n\n(c) PG estimate\n\n(d) normalized evaluation metrics\n\nFigure 1: Imitation learning example. The expert policy in (a) is reconstructed using the posterior\nmean estimates of (b) an independent Dirichlet policy model and (c) a correlated PG model, based on\naction data observed at the states marked in gray. The PG joint estimate of the local policies yields\na signi\ufb01cantly improved reconstruction, as shown by the resulting Hellinger distance to the expert\npolicy and the corresponding value loss [27] in (d).\n\nd=1\n\nwe suppose to have access to a demonstration data set D = {(sd, ad) 2S\u21e5A} D\nd=1 containing D\nstate-action pairs, where each action has been generated through the expert policy, i.e., ad \u21e0 \u21e1(a | sd).\nAssuming a discrete state and action space, the policy can be represented as a stochastic matrix\n\u21e7 = [\u21e11, . . . , \u21e1S], whose ith column \u21e1i 2 A represents the local action distribution of the expert\nat state i in form of a vector. Our goal is to estimate this matrix from the demonstrations D. By\nconstructing the count matrix (X)ij =PD\n(sd = i ^ ad = j), the inference problem can be\ndirectly mapped to our PG model, which allows to jointly estimate the coupled quantities {\u21e1i}\nthrough their latent representation by approximating the posterior distribution p( | X) in Eq. (2).\nIn this case, the covariate set C is described by the state space S.\nTo demonstrate the advantages of this joint inference approach over a correlation-agnostic estimation\nmethod, we compare our framework to the independent Dirichlet model described in Section 2.2. Both\nreconstruction methods are evaluated on a classical grid world scenario comprising S = 100 states\nand A = 4 actions. Each action triggers a noisy transition in one of the four cardinal directions such\nthat the pattern of the resulting next-state distribution resembles a discretized Gaussian distribution\ncentered around the targeted adjacent state. Rewards are distributed randomly in the environment.\nThe expert follows a near-optimal stochastic policy, choosing actions from a softmax distribution\nobtained from the Q-values of the current state. An example scenario is shown in Fig. 1a, where the\nthe displayed arrows are obtained by weighting the four unit-length vectors associated with the action\nset A according to their local action probabilities. The reward locations are highlighted in green.\nFig. 1b shows the reconstruction of the policy obtained through the independent Dirichlet model. Since\nno dependencies between the local action distributions are considered in this approach, a posterior esti-\nmate can only be obtained for states where demonstration data is available, highlighted by the gray col-\noring of the background. For all remaining states, the mean estimate predicts a uniform action choice\nfor the expert behavior since no action is preferred by the symmetry of the Dirichlet prior, resulting in\nan effective arrow length of zero. By contrast, the PG model (Fig. 1c) is able to generalize the expert\nbehavior to unobserved regions of the state space, resulting in signi\ufb01cantly improved reconstruction\nof the policy (Fig. 1d). To capture the underling correlations, we used the Euclidean distance between\nthe grid positions as covariate distance measure d and set l to the maximum occurring distance value.\n\n4.2 Subgoal Modeling\n\nIn many situations, modeling the actions of an agent is not of primary interest or proves to be\ndif\ufb01cult, e.g., because a more comprehensive understanding of the agent\u2019s behavior is desired (see\ninverse reinforcement learning [16] and preference elicitation [24]) or because the policy is of\ncomplex form due to intricate system dynamics. A typical example is robot object manipulation,\nwhere contact-rich dynamics can make it dif\ufb01cult for a controller trained from a small number of\ndemonstrations to appropriately generalize the expert behavior [31]. A simplistic example illustrating\nthis problem is depicted in Fig. 2a, where the agent behavior is heavily affected by the geometry\nof the environment and the action pro\ufb01les at two wall-separated states differ drastically. Similarly\nto Section 4.1, we aspire to reconstruct the shown behavior from a demonstration data set of the form\nd=1, depicted in Fig. 2b. This time, however, we follow a conceptually\nD = {(sd, ad) 2S\u21e5A} D\n\n6\n\n\f(b) data set\n\n(c) PG subgoal\n\n(a) expert policy\n\n(e) Dirichlet subgoal\nFigure 2: Subgoal modeling example. The expert policy in (a) targeting the green reward states is\nreconstructed from the demonstration data set in (b). By generalizing the demonstrations on the inten-\ntional level while taking into account the geometry of the problem, the PG subgoal model in (c) yields\na signi\ufb01cantly improved reconstruction compared to the corresponding action-based model in (d) and\nthe uncorrelated subgoal model in (e). Red color encodes the Hellinger distance to the expert policy.\n\n(d) PG imitation\n\ndifferent line of reasoning and assume that each state s 2S has an associated subgoal gs that the agent\nis targeting at that state. Thus, action ad is considered as being drawn from some goal-dependent\naction distribution p(ad | sd, gsd). For our example, we adopt the normalized softmax action model\ndescribed in [27]. Spatial relationships between the agent\u2019s decisions are taken into account with\nthe help of our PG framework, by coupling the probability vectors that govern the underlying subgoal\nselection process, i.e., gs \u21e0 Cat(ps), where ps is described through the stick-breaking construction\nin Eq. (1). Accordingly, the underlying covariate space of the PG model is C = S.\nWith the additional level of hierarchy introduced, the count data X to train our model is not directly\navailable since the subgoals {gs}S\ns=1 are not observable. For demonstration purposes, instead of\nderiving the full variational update for the extended model, we follow a simpler strategy that leverages\nthe existing inference framework within a Gibbs sampling procedure, switching between variational\nupdates and drawing posterior samples of the latent subgoal variables. More precisely, we iterate be-\ntween 1) computing the variational approximation in Eq. (3) for a given set of subgoals {gs}S\ns=1, treat-\ning each subgoal as single observation count, i.e., xs = OneHot(gs) \u21e0 Mult(xs | Ns = 1, ps) and\n2) updating the latent assignments based on the induced goal distributions, i.e., gs \u21e0 Cat(\u21e7SB( s\u00b7)).\nFig. 2c shows the policy model obtained by averaging the predictive action distributions of M = 100\ndenotes the mth\ndrawn subgoal con\ufb01gurations, i.e., \u02c6\u21e1(a | s) = 1\nGibbs sample of the subgoal assignment at state s. The obtained reconstruction is visibly better than\nthe one produced by the corresponding imitation learning model in Fig. 2d, which interpolates the\nbehavior on the action level and thus fails to navigate the agent around the walls. While the Dirichlet-\nbased subgoal model (Fig. 2e) can generally account for the walls through the use of the underlying\nsoftmax action model, it cannot propagate the goal information to unvisited states. For the considered\nuninformative prior distribution over subgoal locations, this has the consequence that actions assigned\nto such states have the tendency to transport the agent to the center of the environment, as this is the\ndominating move obtained when blindly averaging over all possible goal locations.\n\nm=1 p(a | s, ghmis\n\nMPM\n\n), where ghmis\n\n4.3 System Identi\ufb01cation & Bayesian Reinforcement Learning\nHaving focused our attention on learning a model of an observed policy, we now enter the realm of\nBayesian reinforcement learning (BRL) and optimize a behavioral model to the particular dynamics of\na given environment. For this purpose, we slightly modify our grid world from Section 4.1 by placing\na target reward of +1 in one corner and repositioning the agent to the opposite corner whenever the\ntarget state is reached (compare \u201cGrid10\u201d domain in [6]). For the experiment, we assume that the\nagent is aware of the target reward but does not know the transition dynamics of the environment.\n\n4.3.1 System Identi\ufb01cation\nFor the beginning, we ignore the reward mechanism altogether and focus on learning the transition\ndynamics of the environment. To this end, we let the agent perform a random walk on the grid,\nchoosing actions uniformly at random and observing the resulting state transitions. The recorded\nstate-action sequence (s1, a1, s2, a2, . . . , aT1, sT ) is summarized in the form of count matrices\n(at = a ^ st = i ^ st+1 = j) represents\n\n[X(1), . . . , X(A)], where the element (X(a))ij =PT\n\nt=1\n\n7\n\n\f(a) transition model error\n\n(b) model-based BRL\n\nFigure 3: Bayesian reinforcement learning results. (a) Estimation error of the transition dynamics\nover the number of observed transitions. Shown are the Hellinger distances to the true next-state\ndistribution and the standard deviation of the estimation error, both averaged over all states and\nactions of the MDP. (b) Expected returns of the learned policies (normalized by the optimal return)\nwhen replanning with the estimated transition dynamics after every \ufb01ftieth state transition.\n\nthe number of observed transitions from state i to j for action a. Analogously to the previous two\nexperiments, we estimate the transition dynamics of the environment from these matrices using\nan independent Dirichlet prior model and our PG framework, where we employ a separate model\nfor each transition count matrix. The resulting estimation accuracy is described by the graphs in\nFig. 3a, which show the distance between the ground truth dynamics of the environment and those\npredicted by the models, averaged over all states and actions. As expected, our PG model signi\ufb01cantly\noutperforms the naive Dirichlet approach.\n\n4.3.2 Bayesian Reinforcement Learning\nNext, we consider the problem of combined model-learning and decision-making by exploiting\nthe experience gathered from previous system interactions to optimize future behavior. Bayesian\nreinforcement learning offers a natural playground for this task as it intrinsically balances the\nimportance of information gathering and instantaneous reward maximization, avoiding the exploration-\nexploitation dilemma encountered in classical reinforcement learning schemes [5].\nTo determine the optimal trade-off between these two competing objectives computationally, we\nfollow the principle of posterior sampling for reinforcement learning (PSRL) [17], where future\nactions are planned using a probabilistic model of the environment\u2019s transition dynamics. Herein, we\nconsider two variants: (1) In the \ufb01rst variant, we compute the optimal Q-values for a \ufb01xed number of\nposterior samples representing instantiations of the transition model and choose the policy that yields\nthe highest expected return on average. (2) In the second variant, we select the greedy policy dictated\nby the posterior mean of the transition dynamics. In both cases, the obtained policy is followed for a\n\ufb01xed number of transitions before new observations are taken into account for updating the posterior\ndistribution. Fig. 3b shows the expected returns of the so-obtained policies over the entire execution\nperiod for the three prior models evaluated in Fig. 3a and both PSRL variants. The graphs reveal that\nthe PG approach requires signi\ufb01cantly fewer transitions to learn an effective decision-making strategy.\n\n4.3.3 Queueing Network Modeling\nAs a \ufb01nal experiment, we evaluate our model on a network scheduling problem, depicted in Fig. 4a.\nThe considered two-server network consists of two queues with buffer lengths B1 = B2 = 10.\nThe state of the system is determined by the number of packets in each queue, summarized by the\nqueueing vector b = [b1, b2], where bi denotes the number of packets in queue i. The underlying\nsystem state space is S = {0, . . . , B1}\u21e5{ 0, . . . , B2} with size S = (B1 + 1)(B2 + 1).\nFor our experiment, we consider a system with batch arrivals and batch servicing. The task for\nthe agent is to schedule the traf\ufb01c \ufb02ow of the network under the condition that only one of the\nqueues can be processed at a time. Accordingly, the actions are encoded as a = 1 for serving\nqueue 1 and a = 2 for serving queue 2. The number of packets arriving at queue 1 is modeled as\n\n8\n\nDir(=103)samplingDir(=103)meanDir(=1)samplingDir(=1)meanPGsamplingPGmean0200040006000800010000numberoftransitions0.20.40.60.8Hellingerdistance0500100015002000numberoftransitions0.000.250.500.751.00normalizedvalue\fq1\n\n|\n\nq3\n\nb1\n\nq2\n\nB1\n\nz }| {\n}\n{z\nb2z}|{\n|\n\n{z\n\nB2\n\n(a) queueing network\n\n}\n\nFigure 4: BRL for batch queueing. (a) Considered two-server queueing network. (b) Expected\nreturns over the number of learning episodes, each consisting of twenty state transitions.\n\n(b) comparison of learned policies\n\n0 , (b2 + q2 q3)B2\n\n0 ], where the truncation operation (\u00b7)B\n\nq1 \u21e0 Pois(q1 | #1) with mean rate #1 = 1. The packets are transferred to buffer 1 and subsequently\nprocessed in batches of random size q2 \u21e0 Pois(q2 | #2), provided that the agent selects queue 1.\nTherefore, #2 = 1\n(a = 1), where we consider an average batch size of 1 = 3. Processed\npackets are transferred to the second queue, where they wait to be processed further in batches\n(a = 2) and an average batch size of 2 = 2. The\nof size q3 \u21e0 Pois(q3 | #3), with #3 = 2\nresulting transition to the new queueing state b0 after one processing step can be compactly written as\nb0 = [(b1 + q1 q2)B1\n0 = max(0, min(B,\u00b7))\naccounts for the nonnegativity and \ufb01niteness of the buffers. The reward function, which is known to\nthe agent, computes the negative sum of the queue lengths R(b) = (b1 + b2). Despite the simplistic\narchitecture of the network, \ufb01nding an optimal policy for this problem is challenging since determining\nthe state transition matrices requires nontrivial calculations involving concatenations of Poisson\ndistributions. More importantly, when applied in a real-world context, the arrival and processing rates\nof the network are typically unknown so that planning-based methods cannot be applied.\nFig. 4b shows the evaluation of PSRL on the network. As in the previous experiment, we use a separate\nPG model for each action and compute the covariance matrix \u2303\u2713 based on the normalized Euclidean\ndistances between the queueing states of the system. This encodes our prior knowledge that the queue\nlengths obtained after servicing two independent copies of the network tend to be similar if their\nprevious buffer states were similar. Our agent follows a greedy strategy w.r.t. the posterior mean of the\nestimated model. The policy is evaluated after each policy update by performing one thousand steps\nfrom all possible queueing states of the system. As the graphs reveal, the PG approach signi\ufb01cantly\noutperforms its correlation agnostic counterpart, requiring fewer interactions with the system while\nyielding better scheduling strategies by generalizing the networks dynamics over queueing states.\n\n5 Conclusion\n\nWith the proposed variational PG model, we have presented a self-contained learning framework for\n\ufb02exible use in many common decision-making contexts. The framework allows an intuitive consider-\nation of prior knowledge about the behavior of an agent and the structures of its environment, which\ncan signi\ufb01cantly boost the predictive performance of the resulting models by leveraging correlations\nand reoccurring patterns in the decision-making process. A key feature is the adjustment of the model\nregularization through automatic calibration of its hyper-parameters to the speci\ufb01c decision-making\nscenario at hand, which provides a built-in solution to infer the effective range of correlations from\nthe data. We have evaluated the framework on various benchmark tasks including a realistic queueing\nproblem, which in a real-world situation admits no planning-based solution due to unknown system\nparameters. In all presented scenarios, our framework consistently outperformed the naive baseline\nmethods, which neglect the rich statistical relationships to be unraveled in the estimation problems.\n\nAcknowledgments\nThis work has been funded by the German Research Foundation (DFG) as part of the projects B4 and\nC3 within the Collaborative Research Center (CRC) 1053 \u2013 MAKI.\n\n9\n\n020406080episode109876valuePGDir=1Dir=103\fReferences\n[1] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: A review for statisticians. Journal\n\nof the American Statistical Association, 112(518):859\u2013877, 2017.\n\n[2] G. Chalkiadakis and C. Boutilier. Coordination in multiagent reinforcement learning: A Bayesian approach.\nIn International Joint Conference on Autonomous Agents and Multiagent Systems, pages 709\u2013716. ACM,\n2003.\n\n[3] M. Deisenroth and C. E. Rasmussen. Pilco: A model-based and data-ef\ufb01cient approach to policy search.\n\nIn International Conference on Machine Learning, pages 465\u2013472, 2011.\n\n[4] Y. Engel, S. Mannor, and R. Meir. Bayes meets Bellman: The Gaussian process approach to temporal\n\ndifference learning. In International Conference on Machine Learning, pages 154\u2013161, 2003.\n\n[5] M. Ghavamzadeh, S. Mannor, J. Pineau, A. Tamar, et al. Bayesian reinforcement learning: A survey.\n\nFoundations and Trends in Machine Learning, 8(5-6):359\u2013483, 2015.\n\n[6] A. Guez, D. Silver, and P. Dayan. Ef\ufb01cient Bayes-adaptive reinforcement learning using sample-based\n\nsearch. In Advances in Neural Information Processing Systems, pages 1025\u20131033, 2012.\n\n[7] S. Gupta, G. Joshi, and O. Ya\u02d8gan. Correlated multi-armed bandits with a latent random source.\n\narXiv:1808.05904, 2018.\n\n[8] T. Hester and P. Stone. Texplore: real-time sample-ef\ufb01cient reinforcement learning for robots. Machine\n\nlearning, 90(3):385\u2013429, 2013.\n\n[9] M. Hoffman, B. Shahriari, and N. Freitas. On correlation and budget constraints in model-based bandit\noptimization with application to automatic machine learning. In Arti\ufb01cial Intelligence and Statistics, pages\n365\u2013374, 2014.\n\n[10] H. Ishwaran and L. F. James. Gibbs sampling methods for stick-breaking priors. Journal of the American\n\nStatistical Association, 96(453):161\u2013173, 2001.\n\n[11] T. W. Killian, S. Daulton, G. Konidaris, and F. Doshi-Velez. Robust and ef\ufb01cient transfer learning with\nhidden parameter Markov decision processes. In Advances in Neural Information Processing Systems,\npages 6250\u20136261, 2017.\n\n[12] A. Krause and C. S. Ong. Contextual Gaussian process bandit optimization. In Advances in Neural\n\nInformation Processing Systems, pages 2447\u20132455, 2011.\n\n[13] J. D. Lafferty and D. M. Blei. Correlated topic models. In Advances in Neural Information Processing\n\nSystems, pages 147\u2013154, 2006.\n\n[14] S. Linderman, M. Johnson, and R. P. Adams. Dependent multinomial models made easy: Stick-breaking\nwith the P\u00f3lya-Gamma augmentation. In Advances in Neural Information Processing Systems, pages\n3456\u20133464, 2015.\n\n[15] K. P. Murphy. Machine learning: a probabilistic perspective. MIT Press, 2012.\n\n[16] A. Y. Ng and S. J. Russell. Algorithms for inverse reinforcement learning. In International Conference on\n\nMachine Learning, pages 663\u2013670, 2000.\n\n[17] I. Osband, D. Russo, and B. Van Roy. (More) ef\ufb01cient reinforcement learning via posterior sampling. In\n\nAdvances in Neural Information Processing Systems, pages 3003\u20133011, 2013.\n\n[18] A. Paraschos, C. Daniel, J. R. Peters, and G. Neumann. Probabilistic movement primitives. In Advances in\n\nNeural Information Processing Systems, pages 2616\u20132624, 2013.\n\n[19] M. Pavlov and P. Poupart. Towards global reinforcement learning. In NIPS Workshop on Model Uncertainty\n\nand Risk in Reinforcement Learning, 2008.\n\n[20] N. G. Polson, J. G. Scott, and J. Windle. Bayesian inference for logistic models using P\u00f3lya-Gamma latent\n\nvariables. Journal of the American Statistical Association, 108(504):1339\u20131349, 2013.\n\n[21] P. Poupart, N. Vlassis, J. Hoey, and K. Regan. An analytic solution to discrete Bayesian reinforcement\n\nlearning. In International Conference on Machine Learning, pages 697\u2013704. ACM, 2006.\n\n[22] P. Ranchod, B. Rosman, and G. Konidaris. Nonparametric Bayesian reward segmentation for skill discovery\nusing inverse reinforcement learning. In IEEE/RSJ International Conference on Intelligent Robots and\nSystems, pages 471\u2013477, 2015.\n\n10\n\n\f[23] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for machine learning. MIT Press, 2005.\n\n[24] C. A. Rothkopf and C. Dimitrakakis. Preference elicitation and inverse reinforcement learning. In Joint\nEuropean Conference on Machine Learning and Knowledge Discovery in Databases, pages 34\u201348, 2011.\n\n[25] A. \u0160o\u0161i\u00b4c, A. M. Zoubir, and H. Koeppl. A Bayesian approach to policy recognition and state representation\n\nlearning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6):1295\u20131308, 2017.\n\n[26] A. \u0160o\u0161i\u00b4c, A. M. Zoubir, and H. Koeppl. Policy recognition via expectation maximization. In IEEE\n\nInternational Conference on Acoustics, Speech and Signal Processing, pages 4801\u20134805, 2016.\n\n[27] A. \u0160o\u0161i\u00b4c, E. Rueckert, J. Peters, A. M. Zoubir, and H. Koeppl.\n\nInverse reinforcement learning via\nnonparametric spatio-temporal subgoal modeling. Journal of Machine Learning Research, 19(69):1\u201345,\n2018.\n\n[28] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT Press, 2018.\n\n[29] M. K. Titsias. The in\ufb01nite Gamma-Poisson feature model. In Advances in Neural Information Processing\n\nSystems, pages 1513\u20131520, 2008.\n\n[30] F. Wenzel, T. Galy-Fajou, C. Donner, M. Kloft, and M. Opper. Ef\ufb01cient Gaussian process classi\ufb01cation\nusing P\u00f2lya-Gamma data augmentation. In AAAI Conference on Arti\ufb01cial Intelligence, volume 33, pages\n5417\u20135424, 2019.\n\n[31] Y. Zhu, Z. Wang, J. Merel, A. Rusu, T. Erez, S. Cabi, S. Tunyasuvunakool, J. Kram\u00e1r, R. Hadsell, N. de Fre-\nitas, et al. Reinforcement and imitation learning for diverse visuomotor skills. arXiv:1802.09564,\n2018.\n\n11\n\n\f", "award": [], "sourceid": 7917, "authors": [{"given_name": "Bastian", "family_name": "Alt", "institution": "Technische Universit\u00e4t Darmstadt"}, {"given_name": "Adrian", "family_name": "\u0160o\u0161i\u0107", "institution": "Merck KGaA"}, {"given_name": "Heinz", "family_name": "Koeppl", "institution": "Technische Universit\u00e4t Darmstadt"}]}