{"title": "Scalable Coordinated Exploration in Concurrent Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 4219, "page_last": 4227, "abstract": "We consider a team of reinforcement learning agents that concurrently operate in a common environment, and we develop an approach to efficient coordinated exploration that is suitable for problems of practical scale. Our approach builds on the seed sampling concept introduced in Dimakopoulou and Van Roy (2018) and on a randomized value function learning algorithm from Osband et al. (2016). We demonstrate that, for simple tabular contexts, the approach is competitive with those previously proposed in Dimakopoulou and Van Roy (2018) and with a higher-dimensional problem and a neural network value function representation, the approach learns quickly with far fewer agents than alternative exploration schemes.", "full_text": "Scalable Coordinated Exploration in\nConcurrent Reinforcement Learning\n\nMaria Dimakopoulou\nStanford University\n\nmadima@stanford.edu\n\nIan Osband\n\nGoogle DeepMind\n\niosband@google.com\n\nBenjamin Van Roy\nStanford University\nbvr@stanford.edu\n\nAbstract\n\nWe consider a team of reinforcement learning agents that concurrently operate\nin a common environment, and we develop an approach to ef\ufb01cient coordinated\nexploration that is suitable for problems of practical scale. Our approach builds on\nseed sampling[1] and randomized value function learning [11]. We demonstrate\nthat, for simple tabular contexts, the approach is competitive with previously\nproposed tabular model learning methods [1]. With a higher-dimensional problem\nand a neural network value function representation, the approach learns quickly\nwith far fewer agents than alternative exploration schemes.\n\n1\n\nIntroduction\n\nConsider a farm of robots operating concurrently, learning how to carry out a task, as studied in\n[3]. There are bene\ufb01ts to scale, since a larger number of robots can gather and share larger volumes\nof data that enable each to learn faster. These bene\ufb01ts are most dramatic if the robots explore\nin a coordinated fashion, diversifying their learning goals and adapting appropriately as data is\ngathered. Web services present a similar situation, as considered in [18]. Each user is served by an\nagent, and the collective of agents can accelerate learning by intelligently coordinating how they\nexperiment. Considering its importance, the problem of coordinated exploration in reinforcement\nlearning has received surprisingly little attention; while [3] and [18] consider teams of agents that\ngather data in parallel, they do not address coordination of data gathering, though this can be key to\nteam performance. Dimakopolou and Van Roy [1] recently identi\ufb01ed properties that are essential\nto ef\ufb01cient coordinated exploration and proposed suitable tabular model learning methods based on\nseed sampling. Though this represents a conceptual advance, the methods do not scale to meet the\nneeds of practical applications, which require generalization to address intractable state spaces. In\nthis paper, we develop scalable reinforcement learning algorithms that aim to ef\ufb01ciently coordinate\nexploration and we present computational results that establish their substantial bene\ufb01t.\nWork on coordinated exploration builds on a large literature that addresses ef\ufb01cient exploration in\nsingle-agent reinforcement learning (see, e.g., [6, 5, 21]). A growing segment of this literature studies\nand extends posterior sampling for reinforcement learning (PSRL) [19], which has led to statistically\nef\ufb01cient and computationally tractable approaches to exploration [10, 12, 13]. The methods we will\npropose leverage this line of work, particularly the use of randomized value function learning [14].\nThe problem we address is known as concurrent reinforcement learning [18, 15, 4, 16, 1]. A\nteam of reinforcement learning agents interact with the same unknown environment, share data\nwith one another, and learn in parallel how to operate effectively. To learn ef\ufb01ciently in such\nsettings, the agents should coordinate their exploratory effort. Three properties essential to ef\ufb01cient\ncoordinated exploration, identi\ufb01ed in [1], are real-time adaptivity to shared observations, commitment\nto carry through with action sequences that reveal new information, and diversity across learning\nopportunities pursued by different agents. That paper demonstrated that upper-con\ufb01dence-bound\n(UCB) exploration schemes for concurrent reinforcement learning (concurrent UCRL), such as those\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fdiscussed in [15, 4, 16], fail to satisfy the diversity property due to their deterministic nature. Further,\na straightforward extension of PSRL to the concurrent multi-agent setting, in which each agent\nindependently samples a new MDP at the start of each time period, as done in [7], fails to satisfy the\ncommitment property because the agents are unable to explore the environment thoroughly [17]. As\nan alternative, [1] proposed seed sampling, which extends PSRL in a manner that simultaneously\nsatis\ufb01es the three properties. The idea is that each concurrent agent independently samples a\nrandom seed, a mapping from seed to the MDP is determined by the prevailing posterior distribution.\nIndependence among seeds diversi\ufb01es exploratory effort among agents. If the mapping is de\ufb01ned\nin an appropriate manner, the fact that each agent maintains a consistent seed ensures a suf\ufb01cient\ndegree of commitment, while the fact that the posterior adapts to new data allows each agent to react\nintelligently to new information.\nAlgorithms presented in [1] are tabular and hence do not scale to address intractable state spaces.\nFurther, computational studies carried out in [1] focus on simple stylized problems designed to\nillustrate the bene\ufb01ts of seed sampling. In the next section, we demonstrate that observations made\nin these stylized contexts extend to a more realistic problem involving swinging up and balancing a\npole. Subsequent sections extend the seed sampling concept to operate with generalizing randomized\nvalue functions [14], leading to new algorithms such as seed temporal-difference learning (seed TD)\nand seed least-squares value iteration (seed LSVI). We show that on tabular problems, these scalable\nseed sampling algorithms perform as well as the tabular seed sampling algorithms of [1]. Finally, we\npresent computational results demonstrating effectiveness of one of our new algorithms applied in\nconjunction with a neural network representation of the value function on another pole balancing\nproblem with a state space too large to be addressed by tabular methods.\n\n2 Seeding with Tabular Representations\n\nThis section shows that the advantages of seed sampling over alternative exploration schemes\nextend beyond the toy problems with known transition dynamics and a handful of unknown rewards\nconsidered in [1]. We consider a problem that is more realistic and complex, but of suf\ufb01ciently small\nscale to be addressed by tabular methods, in which a group of agents learn to swing-up and balance a\npole. We demonstrate that seed sampling learns to achieve the goal quickly and with far fewer agents\nthan other exploration strategies.\nIn the classic problem [20], a pole is attached to a cart that moves on a frictionless rail. We modify\nthe problem so that deep exploration is crucial to identifying rewarding states and thus learning\nthe optimal policy. Unlike the traditional cartpole problem, where the interaction begins with the\npole stood upright and the agent must learn to balance it, in our problem the interaction begins with\nthe pole hanging down and the agent must learn to swing it up. The cart moves on an in\ufb01nite rail.\nConcretely the agent interacts with the environment through the state st = (\u03c6t, \u02d9\u03c6t) \u2208 (cid:60)2, where \u03c6t is\nthe angle of the pole from the vertical, upright position \u03c6 = 0 and \u02d9\u03c6t is the respective angular velocity.\nThe cart is of mass M = 1 and the pole has mass m = 0.1 and length l = 1, with acceleration due to\ngravity g = 9.8. At each timestep the agent can apply a horizontal force Ft to the cart. The second\norder differential equation governing the system is \u00a8\u03c6t = g sin(\u03c6t)\u2212cos(\u03c6t)\u03c4t\nt sin(\u03c6t)\nm+M cos(\u03c6t)2)\n[11]. We discretize the evolution of this second order differential equation with timescale \u2206t = 0.02\nand present a choice of actions Ft = {\u221210, 0, 10} for all t. At each timestep the agent pays a cost\n|Ft|\n1000 for its action but can receive a reward of 1 if the pole is balanced upright (cos(\u03c6t) > 0.75) and\nsteady (angular velocity less than 1). The interaction ends after 1000 actions, i.e. at t = 20. The\nenvironment is modeled as a time-homogeneous MDP, which is identi\ufb01ed by M = (S,A,R,P, \u03c1),\nwhere S is the discretized state space [0, 2\u03c0] \u00d7 [\u22122\u03c0, 2\u03c0], A = {\u221210, 0, 10} is the action space, R\nis the reward model, P is the transition model and \u03c1 is the initial state distribution.\nConsider a group of K agents, who explore and learn to operate in parallel in this common environ-\nment. Each kth agent begins at state sk,0 = (\u03c0, 0) + wk, where each component of wk is uniformly\ndistributed in [\u22120.05, 0.05]. Each agent k takes an action at arrival times tk,1, tk,2, . . . , tk,H of an\nindependent Poisson process with rate \u03ba = 1. At time tk,m, the agent takes action ak,m, transitions\nfrom state sk,m\u22121 to state sk,m and observes reward rk,m. The agents are uncertain about the transi-\ntion structure P and share a common Dirichlet prior over the transition probabilities associated with\neach state-action pair (s, a) \u2208 S \u00d7 A with parameters \u03b10(s, a, s(cid:48)) = 1, for all s(cid:48) \u2208 S. The agents\nare also uncertain about the reward structure R and share a common Gaussian prior over the reward\n\n, \u03c4t = Ft+ l\n\n\u02d9\u03c62\n2\nm+M\n\n3\u2212 m\n\nl\n\n2 ( 4\n\n2\n\n\fFigure 1: Performance of PSRL (no adaptivity), concurrent UCRL (no diversity), Thompson resam-\npling (no commitment) and seed sampling in the tabular problem of learning how to swing and keep\nupright a pole attached to a cart that moves left and right on an in\ufb01nite rail.\n\n0(s, a) = 1.\n\nassociated with each state-action pair (s, a) \u2208 S \u00d7 A with parameters \u00b50(s, a) = 0, \u03c32\nAgents share information in real time and update their posterior beliefs.\nWe compare seed sampling with three baselines, PSRL, concurrent UCRL and Thompson resampling.\nIn PSRL, each agent k samples an MDP Mk,0 from the common prior at time tk,0 and computes the\noptimal policy \u03c0k,0(\u00b7) with respect to Mk,0, which does not change throughout the agent\u2019s interaction\nwith the environment. Therefore, the PSRL agents do not adapt to the new information in real-time.\nOn the other hand, in concurrent UCRL, Thompson resampling and seed sampling, at each time tk,m,\nthe agent k generates a new MDP Mk,m based on the data gathered by all agents up to that time,\ncomputes the optimal policy \u03c0k,m for Mk,m and takes an action ak,m = \u03c0k,m(sk,m\u22121) according to\nthe new policy. Concurrent UCRL is a deterministic approach according to which all the parallel\nagents construct the same optimistic MDP conditioned on the common shared information up to that\ntime. Therefore, the concurrent UCRL agents do not diversify their exploratory effort. Thompson\nresampling has each agent independently sample a new MDP at each time period from the common\nposterior distribution conditioned on the shared information up to that time. Resampling an MDP\nindependently at each time period breaks the agent\u2019s intent to pursue a sequence of actions revealing\nthe rare reward states. Therefore, the Thompson resampling agents do not commit. Finally, in seed\nsampling, at the beginning of the experiment, each agent k samples a random seed \u03c9k with two\ncomponents that remain \ufb01xed throughout the experiment. The \ufb01rst component is |S|2|A| sequences\nof independent and identically distributed Exp(1) random variables; the second component is |S||A|\nindependent and identically distributed N (0, 1) random variables. At each time tk,m, agent k maps\nthe data gathered by all agents up to that time and its seed \u03c9k to an MDP Mk,m by combining the\nExponential-Dirichlet seed sampling and the standard-Gaussian seed sampling methods described\nin [1]. Independence among seeds diversi\ufb01es exploratory effort among agents. The fact that the\nagent maintains a consistent seed leads to a suf\ufb01cient degree of commitment, while the fact that the\nposterior adapts to new data allows the agent to react intelligently to new information.\nAfter the end of the learning interaction, there is an evaluation of what the group of K agents learned.\nThe performance of each algorithm is measured with respect to the reward achieved during this\nevaluation, where a greedy agent starts at s0 = (\u03c0, 0), generates the expected MDP of the cartpole\nenvironment based on the posterior beliefs formed by the K parallel agents at the end of their learning,\nand interacts with the cartpole as dictated by the optimal policy with respect to this MDP. Figure 1\nplots the reward achieved by the evaluation agent for increasing number of PSRL, seed sampling,\nconcurrent UCRL and Thompson resampling agents operating in parallel in the cartpole environment.\nAs the number of parallel learning agents grows, seed sampling quickly increases its evaluation\nreward and soon attains a high reward only within 20 seconds of learning. On the other hand, the\nevaluation reward achieved by episodic PSRL (no adaptivity), concurrent UCRL (no diversity), and\nThompson resampling (no commitment) does not improve at all or improves in a much slower rate as\nthe number of parallel agents increases.\n\n3 Seeding with Generalizing Representations\n\nAs we demonstrated in Section 2, seed sampling can offer great advantage over other exploration\nschemes. However, our examples involved tabular learning and the algorithms we considered do\n\n3\n\n\fnot scale gracefully to address practical problems that typically pose enormous state spaces. In\nthis section, we propose an algorithmic framework that extends the seeding concept from tabular to\ngeneralizing representations. This framework supports scalable reinforcement learning algorithms\nwith the degrees of adaptivity, commitment, and intent required for ef\ufb01cient coordinated exploration.\nWe consider algorithms with which each agent is instantiated with a seed and then learns a parameter-\nized value function over the course of operation. When data is insuf\ufb01cient, the seeds govern behavior.\nAs data accumulates and is shared across agents, each agent perturbs each observation in a manner\ndistinguished by its seed before training its value function on the data. The varied perturbations of\nshared observations result in diverse value function estimates and, consequently, diverse behavior. By\nmaintaining a constant seed throughout learning, an agent does not change his interpretation of the\nsame observation from one time period to the next, and this achieves the desired level of commitment,\nwhich can be essential in the presence of delayed consequences. Finally, by using parameterized\nvalue functions, agents can cope with intractably large state spaces. Section 3.1 offers a more detailed\ndescription of our proposed algorithmic framework, and Section 3.2 provides examples of algorithms\nthat \ufb01t this framework.\n\n3.1 Algorithmic Framework\n\nThere are K agents, indexed 1, . . . , K. The agents operate over H time periods in identical envi-\nronments, each with state space S and action space A. Denote by tk,m the time at which agent k\napplies its mth action. The agents may progress synchronously (tk,m = tk(cid:48),m) or asynchronously\n(tk,m (cid:54)= tk(cid:48),m). Each agent k begins at state sk,0. At time tk,m, agent k is at state sk,m, takes action\nak,m, observes reward rk,m and transitions to state sk,m+1. In order for the agents to adapt their\npolicies in real-time, each agent has access to a buffer B with observations of the form (s, a, r, s(cid:48)).\nThis buffer stores past observations of all K agents. Denote by Bt the content of this buffer at time t.\nWith value function learning, agent k uses a family \u02dcQk of state action value functions indexed by a set\nof parameters \u0398k. Each \u03b8 \u2208 \u0398k de\ufb01nes a state-action value function \u02dcQk,\u03b8 : S \u00d7 A \u2192 (cid:60). The value\n\u02dcQk,\u03b8(s, a) could be, for example, the output of a neural network with weights \u03b8 in response to an\ninput (s, a). Initially, the agents may have prior beliefs over the parameter \u03b8, such as the expectation,\n\u00af\u03b8, or the level of uncertainty, \u03bb, on \u03b8.\nAgents diversify their behavior through a seeding mechanism. Under this mechanism, each agent k is\ninstantiated with a seed \u03c9k. Seed \u03c9k is intrinsic to agent k and differentiates how agent k interprets\nthe common history of observations in the buffer B. A form of seeding is that each agent k can\nindependently and randomly perturb observations in the buffer. For example, different agents k, k(cid:48)\ncan add different noise terms zk,j and zk(cid:48),j of variance v, which are determined by seeds \u03c9k and \u03c9k(cid:48),\nj) in the buffer B, as discussed\nrespectively, to rewards from the same jth observation (sj, aj, rj, s(cid:48)\nin [14] for the single-agent setting. This induces diversity by creating modi\ufb01ed training sets from\nthe same history among the agents. Based on the prior distribution for the parameter \u03b8, agent k can\ninitialize the value function with a sample \u02c6\u03b8k from this distribution, with the seed \u03c9k providing the\nsource of randomness. These independent value function parameter samples diversify the exploration\nin initial stages of operation. The seed \u03c9k remains \ufb01xed throughout the course of learning. This\ninduces a level of commitment in agent k, which can be important in reinforcement learning settings\nwhere delayed consequences are present.\nAt time tk,m, before taking the mth action, agent k \ufb01ts its generalized representation model on\nthe history (or a subset thereof) of observations (sj, aj, rj, s(cid:48)\nj) perturbed by the noise seeds zk,j,\nj = 1, . . . ,|Btk,m|. The initial parameter seed \u02c6\u03b8k can also play a role in subsequent stages of learning,\nother than the \ufb01rst time period, by in\ufb02uencing the model \ufb01tting. An example of employing the initial\nparameter seed \u02c6\u03b8k in the model \ufb01tting of subsequent time periods is by having a function \u03c8(\u00b7) as a\nregularization term in which \u02c6\u03b8k appears. By this model \ufb01tting, agent k obtains parameters \u03b8k,m at\ntime period tk,m. These parameters de\ufb01ne a state-action value function \u02dcQk,\u03b8k,m (\u00b7,\u00b7) based on which\na policy is computed. Based on the obtained policy and its current state sk,m, the agent takes a greedy\naction ak,m, observes reward rk,m and transitions to state sk,m+1. The agent k stores this observation\n(sk,m, ak,m, rk,m, sk,m+1) in the buffer B so that all agents can access it next time they \ufb01t their\nmodels. For learning problems with large learning periods, it may be practical to cap the common\nbuffer to a certain capacity C and once this capacity is exceeded to start overwriting observations at\n\n4\n\n\frandom. In this case, the way observations are overwritten can also be different for each agent and\ndetermined by seed \u03c9k (e.g. by \u03c9k also de\ufb01ning random permutation of indices 1, . . . , C).\nThe ability of the agents to make decisions in the high-dimensional environments of real systems,\nwhere the number of states is enormous or even in\ufb01nite, is achieved through the value function\nrepresentations, while coordinating the exploratory effort of the group of agents is achieved through\nthe way that the seeding mechanism controls the \ufb01tting of these generalized representations. As the\nnumber of parallel agents increases, this framework enables the agents to learn to operate and achieve\nhigh rewards in complex environments very quickly.\n\n3.2 Examples of Algorithms\n\nWe now present examples of algorithms that \ufb01t the framework of Section 3.1.\nIn our proposed algorithms, agents share a Gaussian prior over unknown parameters \u03b8\u2217 \u223c N (\u00af\u03b8, \u03bbI)\nand a Gaussian likelihood, N (0, v). Each agent k samples independently noise seeds zk,j \u223c N (0, v)\nfor each observation j in the buffer and initial parameter seeds \u02c6\u03b8k \u223c N (\u00af\u03b8, \u03bbI). These seeds remain\n\ufb01xed throughout learning. We now explain how the algorithms we propose satisfy the three properties\nof ef\ufb01cient coordinated exploration.\n1. Adaptivity: The key idea behind randomized value functions is that \ufb01tting a model to a randomly\nperturbed prior and randomly perturbed observations can be used to generate posterior samples\nyj = \u03b8\u2217T xj + \u0001j, with IID \u0001j \u223c N (0, v). Let f\u03b8 = \u03b8T x, \u02c6\u03b8 \u223c N (\u00af\u03b8, \u03bbI) and zj \u223c N (0, v).\nThen, the solution to argmin\u03b8\nis a sample from the\nposterior of \u03b8\u2217 given (X, y) [14]. This sample can be computed for non-linear f\u03b8 as well, although\nit will not be from the exact posterior. In the concurrent setting, when each agent k draws initial\nparameter seed \u02c6\u03b8k \u223c N (\u00af\u03b8, \u03bbI) and noise seeds zk,1, zk,2,\u00b7\u00b7\u00b7 \u223c N (0, v) at each time period it\ncan solve this value-function optimization problem to obtain a posterior parameter sample based\non the high-dimensional observations gathered by all agents so far.\n\nor approximate posterior samples. Consider the data (X, y) = (cid:0){xj}N\nj=1,{yj}N\n(cid:17)\n\n(cid:80)\nj (yj + zj \u2212 f\u03b8(xi))2 + 1\n\n(cid:1), where\n\n\u03bb(cid:107)\u03b8 \u2212 \u02c6\u03b8(cid:107)2\n\n2\n\n(cid:16) 1\n\nv\n\nj=1\n\n2. Diversity: The independence of the initial parameter seeds \u02c6\u03b8k and noise seeds zk,j among agents\ndiversi\ufb01es exploration both when there are no available observations and when the agents have\naccess to the same shared observations.\n\n3. Commitment: Each agent k applies the same perturbation zk,j to each jth observation and uses\nthe same regularization \u02c6\u03b8k throughout learning; this provides the requisite level of commitment.\n\n3.2.1 Seed Least Squares Value Iteration (Seed LSVI)\n\nLSVI computes a sequence of value functions parameters re\ufb02ecting optimal expected rewards over\nan expanding horizon based on observed data. In seed LSVI, each kth agent\u2019s initial parameter \u03b8k,0\nis set to \u02c6\u03b8k. Before its mth action, agent k uses the buffer of observations gathered by all K agents\nup to that time, or a subset thereof, and the random noise terms zk to carry out LSVI, initialized with\n\u02dc\u03b8H = 0, where H is the LSVI planning horizon:\n\n(cid:19)2\nj, a) + zk,j \u2212 \u02dcQk,\u03b8(sj, aj)\n(s(cid:48)\n\n+ \u03c8(\u03b8, \u02c6\u03b8k)\n\n\uf8f6\uf8f8\n\n\uf8eb\uf8ed 1\n\nv\n\n(cid:88)\n\n(cid:18)\n\n\u02dc\u03b8h = argmin\n\n\u03b8\n\nrj + max\na\u2208A\n\n\u02dcQk,\u02dc\u03b8h+1\n\n(sj ,aj ,rj ,s(cid:48)\nj )\n\n\u03bb(cid:107)\u03b8 \u2212 \u02c6\u03b8k(cid:107)2\nfor h = H \u2212 1, . . . , 0, where \u03c8(\u03b8, \u02c6\u03b8k) is a regularization penalty (e.g. \u03c8(\u03b8, \u02c6\u03b8k) = 1\n2).\nAfter setting \u03b8k,m = \u02dc\u03b80, agent k applies action ak,m = argmaxa\u2208A \u02dcQk,\u03b8k,m (sk,m, a). Note that the\nagent\u2019s random seed can be viewed as \u03c9k = (\u02c6\u03b8k, zk,1, zk,2, . . .).\n\n3.2.2 Seed Temporal-Difference Learning (Seed TD)\n\nWhen the dimension of \u03b8 is very large, signi\ufb01cant computational time may be required to produce an\nestimate with LSVI, and using \ufb01rst-order algorithms in the vein of stochastic gradient descent, such\nas TD, can be bene\ufb01cial. In seed TD, each kth agent\u2019s initial parameter \u03b8k,0 is set to \u02c6\u03b8k. Before its\n\n5\n\n\f(cid:88)\n\n(cid:18)\n\nL(\u03b8) =\n\n1\nv\n\n(sj ,aj ,rj ,s(cid:48)\nj )\n\n\u02dc\u03b8n = \u02dc\u03b8n\u22121 \u2212 \u03b1\u2207\u03b8L(\u02dc\u03b8n\u22121)\n\nrj + \u03b3 max\na\u2208A\n\n\u02dcQk,\u02dc\u03b8n\u22121\n\nj, a) + zk,j \u2212 \u02dcQk,\u03b8(sj, aj)\n(s(cid:48)\n\n(cid:19)2\n\n+ \u03c8(\u03b8, \u02c6\u03b8k)\n\nmth action, agent k uses the buffer of observations gathered by all K agents up to that time to carry\nout N iterations of stochastic gradient descent, initialized with \u02dc\u03b80 = \u03b8k,m\u22121:\n\nfor n = 1, . . . , N, where \u03b1 is the TD learning rate, L(\u03b8) is the loss function, \u03b3 is the discount rate\nand \u03c8(\u03b8, \u02c6\u03b8k) is a regularization penalty (e.g. \u03c8(\u03b8, \u02c6\u03b8k) = 1\n2). After setting \u03b8k,m = \u02dc\u03b8N ,\nagent k applies action ak,m = argmaxa\u2208A \u02dcQk,\u03b8k,m (sk,m, a). Note that the agent\u2019s random seed can\nbe viewed as \u03c9k = (\u02c6\u03b8k, zk,1, zk,2, . . .).\n\n\u03bb(cid:107)\u03b8 \u2212 \u02c6\u03b8k(cid:107)2\n\n3.2.3 Seed Ensemble\n\nWhen the number of parallel agents is large, instead of having each one of the K agents \ufb01t a separate\nvalue function model (e.g. K separate neural networks), we can have an ensemble of E models,\nE < K, to decrease computational requirements. Each model e = 1, . . . , E is initialized with\n\u02c6\u03b8e \u223c N (\u00af\u03b8, \u03bb) from the common prior belief on parameters \u03b8, which is \ufb01xed and speci\ufb01c to model e\nof the ensemble. Moreover model e is trained on the buffer of observations B according to one of the\nj) \u2208 B is perturbed with noise ze,j,\nmethods of Section 3.2.1 or 3.2.2. Each observation (sj, aj, rj, s(cid:48)\nwhich is also \ufb01xed and speci\ufb01c to model e of the ensemble. Note that the agent\u2019s k random seed, \u03c9k,\nis a randomly drawn index e = 1, . . . , E associated with a model from the ensemble.\n\n3.2.4 Extensions\n\nThe framework we propose is not necessarily constrained to value function approximation methods.\nFor instance, one could use the same principles for policy function approximation, where each agent k\nde\ufb01nes a policy function \u02dc\u03c0k(s, a, \u03b8) and before its mth action uses the buffer of observations gathered\nby all K agents up to that time and its seeds zk to perform policy gradient.\n\n4 Computational Results\n\nIn this section, we present computational results that demonstrate the robustness and effectiveness\nof the approach we suggest in Section 3. In Section 4.1, we present results that serve as a sanity\ncheck for our approach. We show that in the tabular toy problems considered in [1], seeding with\ngeneralized representations performs equivalently with the seed sampling algorithm proposed in [1],\nwhich is particularly designed for tabular settings and can bene\ufb01t from very informative priors. In\nSection 4.2, we scale-up to a high-dimensional problem, which would be too dif\ufb01cult to address by\nany tabular approach. We use the concurrent reinforcement learning algorithm of Sections 3.2.2 and\n3.2.3 with a neural network value function approximation and we see that our approach explores\nquickly and achieves high rewards.\n\n4.1 Sanity Checks\n\nThe authors of [1] considered two toy problems that demonstrate the advantage of seed sampling\nover Thompson resampling or concurrent UCRL. We compare the performance of seed LSVI\n(Section 3.2.1) and seed TD (Section 3.2.2), which are designed for generalized representations,\nwith seed sampling, Thompson resampling and concurrent UCRL which are designed for tabular\nrepresentations.\nThe \ufb01rst toy problem is the \u201cbipolar chain\u201d of \ufb01gure 2a. The chain has an even number of vertices, N,\nV = {0, 1, . . . , N \u2212 1} and the endpoints are absorbing. From any inner vertex of the chain, there\nare two edges that lead deterministically to the left or to the right. The leftmost edge eL = (1, 0) has\nweight \u03b8L and the rightmost edge eR = (N \u2212 2, N \u2212 1) has weight \u03b8R, such that |\u03b8L| = |\u03b8R| = N\nand \u03b8R = \u2212\u03b8L. All other edges have weight \u03b8e = \u22120.1. Each one of the K agents starts from vertex\nN/2, and its goal is to maximize the accrued reward. We let the agents interact with the environment\nfor 2N time periods. As in [1], seed sampling, Thompson resampling and concurrent UCRL, know\neverything about the environment except from whether \u03b8L = N, \u03b8R = \u2212N or \u03b8L = \u2212N, \u03b8R = N\n\n6\n\n\f(a) Bipolar chain environment\n\n(b) Parallel chains environment\n\n(c) Bipolar chain mean regret per agent\n\n(d) Parallel chains mean regret per agent\n\nFigure 2: Comparison of the scalable seed algorithms, seed LSVI and seed TD, with their tabular\ncounterpart seed sampling and the tabular alternatives concurrent UCRL and Thompson resampling\nin the toy settings considered in [1]. This comparison serves as a sanity check.\n\nand they share a common prior that assigns probability p = 0.5 to either scenario. Once an agent\nreaches either of the endpoints, all K agents learn the true value of \u03b8L and \u03b8R. Seed LSVI and\nseed TD use N-dimensional one-hot encoding to represent any of the chain\u2019s states and a linear\nvalue function representation. Unlike, the tabular algorithms, seed LSVI and seed TD start with a\ncompletely uninformative prior. We run the algorithms with different number of parallel agents K\noperating on a chain with N = 50 vertices. Figure 2c shows the mean reward per agent achieved\nas K increases. The \u201cbipolar chain\u201d example aims to highlight the importance of the commitment\nproperty. As explained in [1], concurrent UCRL and seed sampling are expected to perform in par\nbecause they exhibit commitment, but Thompson resampling is detrimental to exploration because\nresampling a MDP in every time period leads the agents to oscillation around the start vertex. Seed\nLSVI and seed TD exhibit commitment and perform almost as well as seed sampling, which not only\nis designed for tabular problems but also starts with a signi\ufb01cantly more informed prior.\nThe second toy problem is the \u201cparallel chains\u201d of \ufb01gure 2b. Starting from vertex 0, each of the\nK agents chooses one of the c = 1, . . . , C chains, of length L. Once a chain is chosen, the agent\ncannot switch to another chain. All the edges of each chain c have zero weights, apart from the\nedge incoming to the last vertex of the chain, which has weight \u03b8c \u223c N (0, \u03c32\n0 + c). The objective\nis to choose the chain with the maximum reward. As in [1], seed sampling, Thompson resampling\nand concurrent UCRL, know everything about the environment except from \u03b8c,\u2200c = 1, . . . , C, on\nwhich they share a common, well-speci\ufb01ed prior. Once an agent traverses the last edge of chain c,\nall agents learn \u03b8c. Seed LSVI and seed TD use N-dimensional one-hot encoding to represent any\nof the chain\u2019s states and a linear value function representation. As before, seed LSVI and seed TD\nstart with a completely uninformative prior. We run the algorithms with different number of parallel\n0 = 100. Figure 2d\nagents K operating on a parallel chain environment with C = 4, L = 4 and \u03c32\nshows the mean reward per agent achieved as K increases. The \u201cparallel chains\u201d example aims to\nhighlight the importance of the diversity property. As explained in [1], Thompson resampling and\nseed sampling are expected to perform in par because they diversify, but concurrent UCRL is wasteful\nof the exploratory effort of the agents, because it sends all the agents who have not left the source to\nthe same chain with the most optimistic last edge reward. Seed LSVI and seed TD exhibit diversity\nand perform identically with seed sampling, which again starts with a very informed prior.\n\n4.2 Scaling Up: Cartpole Swing-Up\n\nIn this section we extend the algorithms and insights we have developed in the rest of the paper to\na complex non-linear control problem. We revisit a variant of the \u201ccartpole\u201d problem of Section 2,\nbut we introduce two additional state variables, the horizontal distance of the cart xt from the center\n\n7\n\n10N-1N/2...-0.1-0.1-0.1-0.1-0.1-0.1start...\u03b8(1,0)\u03b8(\u039d-2,\u039d-1)C,21,2...C,11,1C,L1,L...............00000\u03b81\u03b8C00\f3\u2212 m\n\nl\n\n2 ( 4\n\n\u00a8\u03c6t cos(\u03c6t)\nm+M , \u03c4t = Ft+ l\n\n\u02d9\u03c62\n2\nm+M\n\n, \u00a8xt = \u03c4t \u2212 m l\n\nx = 0 and its velocity, \u02d9xt. The second order differential equation governing the system becomes\n\u00a8\u03c6t = g sin(\u03c6t)\u2212cos(\u03c6t)\u03c4t\n[11]. We discretize the\nm+M cos(\u03c6t)2)\nevolution of this second order differential equation with timescale \u2206t = 0.01. The agent receives a\nreward of 1 if the pole is balanced upright, steady in the middle and the cartpole is centered (precisely\nwhen cos(\u03c6t) > 0.95, |xt| < 0.1, | \u02d9xt| < 1 and | \u02d9\u03c6t| < 1), otherwise the reward is 0. We evaluate\nperformance for 30 seconds of interaction, equivalent to 3000 actions. For implementation, we use\nthe DeepMind control suite that imposes a rigid edge at |x| = 2 [22].\nDue to the curse of dimensionality, tabular approaches to seed sampling quickly become in-\ntractable as we introduce more state variables. For a practical approach to seed sampling in this\ndomain we apply the seed TD-ensemble algorithm of Sections 3.2.2 and 3.2.3, together with a\nneural network representation of the value function. We pass the neural network six features:\n10 , 1{|xt| < 0.1}. Let f\u03b8 : S \u2192 RA be a (50, 50)-MLP with recti\ufb01ed\ncos(\u03c6t), sin(\u03c6t),\n0 sampled from Glorot initialization [2]. After each action, for each agent we sample a minibatch\n\u03b8e, \u03b8e\nof 16 transitions uniformly from the shared replay buffer and take gradient steps with respect to\n\u03b8e using the ADAM optimizer with learning rate 10\u22123 [8]. The parameter \u03b8e\n0 plays a role similar\nto the prior regularization \u03c8 when used in conjunction with SGD training [9]. We sample noise\nze,j \u223c N (0, 0.01) to be used in the shared replay buffer.\n\nlinear units and linear skip connection. We initialize each Qe(s, a | \u03b8e) =(cid:0)f\u03b8e + 3f\u03b8e\n\n(cid:1) (s)[a] for\n\n\u02d9\u03c6t\n10 , x\n\n10 , \u02d9xt\n\n2\n\nt sin(\u03c6t)\n\n0\n\nFigure 3: Comparison of seed sampling varying the number K of agents, with a model ensemble size\nmin(K, 30). As a baseline we use DQN with 100 agents applying \u0001-greedy exploration.\n\nFigure 3 presents the results of our seed sampling experiments on this cartpole problem. Each curve\nis averaged over 10 random instances. As a baseline, we consider DQN with 100 parallel agents\neach with 0.1-greedy action selection. With this approach, the agents fail to see any reward over the\nduration of their experience. By contrast, a seed sampling approach is able to explore ef\ufb01ciently, with\nagents learning to balance the pole remarkably quickly 1. The average reward per agent increases as\nwe increase the number K of parallel agents. To reduce compute time, we use seed ensemble with\nmin(K, 30) models; this seems to not signi\ufb01cantly degrade performance.\n\n5 Closing Remarks\n\nWe have extended the concept of seeding from the non-practical tabular representations to generalized\nrepresentations and we have proposed an approach for designing scalable concurrent reinforcement\nlearning algorithms that can intelligently coordinate the exploratory effort of agents learning in\nparallel in potentially enormous state spaces. This approach allows the concurrent agents (1) to\nadapt to each other\u2019s high-dimensional observations via value function learning, (2) to diversify their\nexperience collection via an intrinsic random seed that uniquely initializes each agent\u2019s generalized\nrepresentation and uniquely interprets the common history of observations, (3) to commit to sequences\nof actions revealing useful information by maintaining each agent\u2019s seed constant throughout learning.\nWe envision multiple applications of practical interest, where a number of parallel agents who\nconform to the proposed framework, can learn and achieve high rewards in short learning periods.\nSuch application areas include web services, the management of a \ufb02eet of autonomous vehicles or\nthe management of a farm of networked robots, where each online user, vehicle or robot respectively\nis controlled by an agent.\n\n1For a demo, see https://youtu.be/kwvhfzbzb0o\n\n8\n\n\fReferences\n[1] Maria Dimakopoulou and Benjamin Van Roy. Coordinated exploration in concurrent reinforce-\n\nment learning. In ICML, 2018.\n\n[2] Xavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedfor-\nward neural networks. In Proceedings of the thirteenth international conference on arti\ufb01cial\nintelligence and statistics, pages 249\u2013256, 2010.\n\n[3] Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. Deep reinforcement learning\n\nfor robotic manipulation with asynchronous off-policy updates. In arXiv, 2016.\n\n[4] Z. Guo and E. Brunskill. Concurrent PAC RL. In AAAI Conference on Arti\ufb01cial Intelligence,\n\npages 2624\u20132630, 2015.\n\n[5] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement\n\nlearning. Journal of Machine Learning Research, 11:1563\u20131600, 2010.\n\n[6] Michael J. Kearns and Satinder P. Singh. Near-optimal reinforcement learning in polynomial\n\ntime. Machine Learning, 49(2-3):209\u2013232, 2002.\n\n[7] Michael Jong Kim. Thompson sampling for stochastic control: The \ufb01nite parameter case. IEEE\n\nTransactions on Automatic Control, 2017.\n\n[8] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[9] Ian Osband, John Aslanides, and Albin Cassirer. Randomized prior functions for deep rein-\n\nforcement learning. arXiv preprint arXiv:1806.03335, 2018.\n\n[10] Ian Osband, Daniel Russo, and Benjamin Van Roy. (More) ef\ufb01cient reinforcement learning via\n\nposterior sampling. In NIPS, pages 3003\u20133011. Curran Associates, Inc., 2013.\n\n[11] Ian Osband, Daniel Russo, Benjamin Van Roy, and Zheng Wen. Deep exploration via random-\n\nized value functions. arXiv preprint arXiv:1608.02731, 2016.\n\n[12] Ian Osband and Benjamin Van Roy. On optimistic versus randomized exploration in reinforce-\nment learning. In The Multi-disciplinary Conference on Reinforcement Learning and Decision\nMaking, 2017.\n\n[13] Ian Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for\n\nreinforcement learning. In ICML, 2017.\n\n[14] Ian Osband, Benjamin Van Roy, and Zheng Wen. Generalization and exploration via randomized\nvalue functions. In Proceedings of The 33rd International Conference on Machine Learning,\npages 2377\u20132386, 2016.\n\n[15] Jason Pazis and Ronald Parr. PAC optimal exploration in continuous space Markov decision\n\nprocesses. In AAAI. Citeseer, 2013.\n\n[16] Jason Pazis and Ronald Parr. Ef\ufb01cient pac-optimal exploration in concurrent, continuous state\n\nmdps with delayed updates. In AAAI. Citeseer, 2016.\n\n[17] Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. A tutorial\n\non Thompson sampling. arXiv preprint arXiv:1707.02038, 2017.\n\n[18] D. Silver, Barker Newnham, L, S. Weller, and J. McFall. Concurrent reinforcement learning\nfrom customer interactions. In Proceedings of The 30th International Conference on Machine\nLearning, pages 924\u2013932, 2013.\n\n[19] Malcolm J. A. Strens. A Bayesian framework for reinforcement learning. In ICML, pages\n\n943\u2013950, 2000.\n\n[20] Richard Sutton and Andrew Barto. Reinforcement Learning: An Introduction. MIT Press, 2017.\n[21] Csaba Szepesv\u00e1ri. Algorithms for Reinforcement Learning. Synthesis Lectures on Arti\ufb01cial\n\nIntelligence and Machine Learning. Morgan & Claypool Publishers, 2010.\n\n[22] Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David\nBudden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite.\narXiv preprint arXiv:1801.00690, 2018.\n\n9\n\n\f", "award": [], "sourceid": 2074, "authors": [{"given_name": "Maria", "family_name": "Dimakopoulou", "institution": "Stanford University"}, {"given_name": "Ian", "family_name": "Osband", "institution": "Google Deepmind"}, {"given_name": "Benjamin", "family_name": "Van Roy", "institution": "Stanford University"}]}