{"title": "A Deep Bayesian Policy Reuse Approach Against Non-Stationary Agents", "book": "Advances in Neural Information Processing Systems", "page_first": 954, "page_last": 964, "abstract": "In multiagent domains, coping with non-stationary agents that change behaviors from time to time is a challenging problem, where an agent is usually required to be able to quickly detect the other agent's policy during online interaction, and then adapt its own policy accordingly. This paper studies efficient policy detecting and reusing techniques when playing against non-stationary agents in Markov games. We propose a new deep BPR+ algorithm by extending the recent BPR+ algorithm with a neural network as the value-function approximator. To detect policy accurately, we propose the \\textit{rectified belief model} taking advantage of the \\textit{opponent model} to infer the other agent's policy from reward signals and its behaviors. Instead of directly storing individual policies as BPR+, we introduce \\textit{distilled policy network} that serves as the policy library in BPR+, using policy distillation to achieve efficient online policy learning and reuse. Deep BPR+ inherits all the advantages of BPR+ and empirically shows better performance in terms of detection accuracy, cumulative rewards and speed of convergence compared to existing algorithms in complex Markov games with raw visual inputs.", "full_text": "A Deep Bayesian Policy Reuse Approach Against\n\nNon-Stationary Agents\n\nYan Zheng1, Zhaopeng Meng1,\n\nJianye Hao1\u2217, Zongzhang Zhang2,\n\nTianpei Yang1, Changjie Fan3\n\n1College of Intelligence and Computing, Tianjin University, Tianjin, China\n\n2School of Computer Science and Technology, Soochow University, Suzhou, China\n\n3NetEase Fuxi Lab, NetEase, Inc., Hangzhou, China\n\n{yanzheng, mengzp, jianye.hao}@tju.edu.cn, zzzhang@suda.edu.cn\n\ntpyang@tju.edu.cn, fanchangjie@netease.com\n\nAbstract\n\nIn multiagent domains, coping with non-stationary agents that change behaviors\nfrom time to time is a challenging problem, where an agent is usually required\nto be able to quickly detect the other agent\u2019s policy during online interaction,\nand then adapt its own policy accordingly. This paper studies ef\ufb01cient policy\ndetecting and reusing techniques when playing against non-stationary agents in\nMarkov games. We propose a new deep BPR+ algorithm by extending the recent\nBPR+ algorithm with a neural network as the value-function approximator. To\ndetect policy accurately, we propose the recti\ufb01ed belief model taking advantage of\nthe opponent model to infer the other agent\u2019s policy from reward signals and its\nbehaviors. Instead of directly storing individual policies as BPR+, we introduce\ndistilled policy network that serves as the policy library in BPR+, using policy\ndistillation to achieve ef\ufb01cient online policy learning and reuse. Deep BPR+ inherits\nall the advantages of BPR+ and empirically shows better performance in terms\nof detection accuracy, cumulative rewards and speed of convergence compared to\nexisting algorithms in complex Markov games with raw visual inputs.\n\n1\n\nIntroduction\n\nWith recent advance of deep reinforcement learning (DRL) techniques [17, 18, 22, 24, 25], a large\nnumber of DRL algorithms have been successfully applied to solve challenging problems such as\ngame playing [18], robotics [14] and recommendation [26]. Yet, most of these algorithms focus on\nsingle-agent domains, without explicitly considering the coexisting agents in the environments.\nThere also exist many application scenarios involving multiagent interactions, commonly known\nas multiagent systems (MAS). A few multiagent DRL algorithms [5, 19, 23] have been proposed\nthat focus on directly searching an optimal policy, without explicitly considering coexisting agent\u2019s\nbehaviors. However, in MAS, it is critically essential for agents to learn to cope with each other\nby taking the other agent\u2019s behaviors into account [2, 6, 13, 15, 16]. But none of the these works\nexplicitly categorizes the other agent\u2019s policy. Without such an explication, a learned policy cannot\nbe directly exploited to achieve higher cumulative reward, a.k.a., return.\nIn this work, we study the problem of playing against non-stationary agents, the second most\nsophisticated problem categorized in [9], by explicitly identifying their categorized policies, and then\nreusing learned strategies against them. When an agent uses an unknown policy, the optimal policy\nagainst it can be ef\ufb01ciently learned using a starting policy built with a distilled policy network. The\n\n\u2217Corresponding author: Jianye Hao.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fBPR+ algorithm [7, 10] has similar ideas but with the following limitations: 1) BPR+\u2019s internal belief\nmodel is updated only using the reward signal, which may be insuf\ufb01cient to detect the other agent\u2019s\npolicy accurately; 2) BPR+ learns an optimal policy from scratch whenever the other agent is detected\nusing an unknown policy, which can be inef\ufb01cient especially in large domains. Additionally, it uses a\nmodel-based tabular algorithm, R-max, restricting itself to small domains; and 3) BPR+ is a tabular-\nbased algorithm that directly stores learned policies as Q-tables, which might be space-inef\ufb01cient and\neven infeasible when handling larger problems or storing numerous policies.\nTo address above limitations in BPR+, we propose a new algorithm, named deep BPR+, combining\nBPR+ with recent DRL techniques. To predict the other agent\u2019s policy accurately, we rectify the\nbelief model in BPR+ using an opponent model, which encodes an agent\u2019s policy, analyzing from\nboth aspects of the reward signal and coexisting agent\u2019s behavior to improve its detection accuracy.\nAs for storing and reusing policies, we introduce the distilled policy network using policy distillation\n[21], which plays a role of the policy library in BPR+, providing convenient policy switching. Most\nimportantly, compared with learning from scratch, it allows us to more ef\ufb01ciently learn new policies\nthat can compete well against previously unseen policies. Another side bene\ufb01t it brings is high spatial\nutilization, which makes it suitable for applications with limited storage spaces. Empirical results\nshow that, compared to existing algorithms, deep BPR+ can achieve more accurate policy detection,\nmore ef\ufb01cient policy reuse and higher speed of convergence towards optimal policies in face of a\nnon-stationary agent using a new unseen policy in complex Markov games with raw visual inputs.\n\n2 Preliminaries\n\nsteps, is de\ufb01ned as the cumulative reward u =(cid:80)k\n\nBayesian policy reuse (BPR) [20] provides an ef\ufb01cient framework for an agent to act by selecting\nthe best response strategy when facing with an unknown task. Formally, a task \u03c4 \u2208 T is de\ufb01ned as an\nMDP, and a policy \u03c0(s) is a function that outputs an appropriate action a given state s. The return,\nor utility, generated from interacting with the MDP environment by following \u03c0 over an episode of k\ni=1 ri, where ri is the immediate reward received\nby acting at step i \u2212 1. BPR uses a performance model P (U|\u03c4, \u03c0), which is a probability distribution\nover the utility U, to describe how policy \u03c0 behaves on task \u03c4. For a set of previously-solved tasks T ,\nthe belief \u03b2(\u03c4 ) is a probability distribution over T that measures to what extent that the currently\nfaced task \u03c4\u2217 matches the known task \u03c4 based on reward signal (i.e., the cumulative reward u). The\nbelief model \u03b20(\u03c4 ) is initialized with a prior probability and updated at time t using Bayes\u2019 rule:\n\n1\n\u03b7\n\n\u03b2t(\u03c4 ) =\n\nP (ut|\u03c4, \u03c0t)\u03b2t\u22121(\u03c4 ),\n\nwhere \u03b7 is the normalization factor(cid:80)\n\n(1)\n\u03c4(cid:48)\u2208T P (ut|\u03c4(cid:48), \u03c0t)\u03b2t\u22121(\u03c4(cid:48)). Based on the belief model, to\nmaximize the utility, BPR uses a policy-selection indicator, named probability of expected im-\nprovement, to reuse the most appropriate policy in a policy library \u03a0 [20]. The indicator considers\n(cid:80)\nthe expected hypothesized increment over the utility that the current best policy can achieve. As-\nsume that \u00afU is the expected utility of the best current policy under the current belief measured by\n\u03c4\u2208T \u03b2(\u03c4 )E[U|\u03c4, \u03c0]. Thus BPR chooses the best potential policy \u03c0\u2217, i.e., the policy most\nmax\u03c0\u2208\u03a0\nlikely to result in any improvement to the expected utility:\n\n(cid:90) +\u221e\n\n(cid:88)\n\n\u00afU\n\n\u03c4\u2208T\n\n\u03c0\u2217 = arg max\n\u03c0\u2208\u03a0\n\n\u03b2(\u03c4 )P (U|\u03c4, \u03c0)dU.\n\n(2)\n\nBPR+ extends BPR to handle non-stationary opponents in multiagent settings and learns new\nperformance models in an online manner. Note that the tasks and policies that BPR+ faces correspond\nto opponent strategies and optimal policies against those strategies, respectively. Despite BPR+ has\nthe ability to detect the strategy switch to adjust current strategy online, its effectiveness is only tested\nwith a tabular representation in a single-state iterated matrix game.\nPolicy distillation [21] can be used to successfully consolidate multiple task-speci\ufb01c policies into\na single policy network by transferring knowledge, i.e., Q-function, from a teacher model \u03c8 to a\nstudent model \u03c6. The teacher \u03c8 generates a dataset D\u03c8 = {(si, qi)}N\ni=0, where each sample consists\nof a state si and a vector qi of unnormalized Q-values with one value per action. The regression is\nused to train a student \u03c6 with samples (s, a, r, s(cid:48)) drawn from D\u03c8 to produce the output similar to\nthe teacher \u03c8. Similar to [11], the Kullback-Leibler (KL) divergence with temperature t is adopted to\n\n2\n\n\fif execute a reuse stage then\n\nAlgorithm 1: Deep BPR+\nInput: Episodes K, policy library \u03a0, known opponent policy set T , performance model P (U|T , \u03a0)\n1 Initialize beliefs \u00af\u03b20 with uniform distribution\n2 for episode t = 1 ... K do\n3\n4\n5\n6\n7\n8\n\nChoose a policy \u03c0t based on \u00af\u03b2t\u22121 to execute, and receive utility ut (see Equation 11)\nEstimate opponent\u2019s online policy \u02c6\u03c4 t\nUpdate recti\ufb01ed belief model \u00af\u03b2t using ut and \u02c6\u03c4 t\nif a new policy is detected by moving averaged reward then\n\nInitialize policy \u03c0t by distilled policy network, and then switch to learning stage\n\no based on its observed behaviors\n\no (see Equation 10)\n\n9\n10\n11\n12\n\nelse if execute a learning stage then\n\nOptimize \u03c0t by DQN, and estimate opponent\u2019s online policy \u02c6\u03c4 t\no\nif an optimal policy is obtained then\n\nUpdate T , \u03a0 and P (U|T , \u03a0), and then switch to the reuse stage\n\nmeasure the training loss between the models \u03c8 and \u03c6 using the following form:\n\n|D\u03c8|(cid:88)\n\nq\u03c8\ni\nt\n\n) ln\n\ni\n\nsoftmax( q\u03c8\nt )\nsoftmax(q\u03c6\ni )\n\n.\n\n(3)\n\nLossKL(D\u03c8, \u03b8\u03c6) =\n\nThe ith element of softmax(z) is de\ufb01ned as exp z(i)/(cid:80)\n\nsoftmax(\n\ni=1\n\nj exp z(j), where z represents a vector.\n\n3 Deep BPR+\n\nDeep BPR+ extends BPR+ from tabular representation to deep neural network approximation, and\naddresses the following two major drawbacks in such an extension. To be consistent in the below, we\nuse \u201copponent\u201d to indicate the coexisting agent regardless of cooperative or competitive environments.\nFirst, the accuracy of the belief model \u03b2(\u03c4 ) in Equation 1 is highly dependent on the performance\nmodel P (U|\u03c4, \u03c0), which evaluates the policy \u03c0 behaving against the opponent using policy \u03c4, named\nresponse policy. However, the performance model of a response policy against different strategies\nmight be the same in multiagent domains, resulting in indistinguishability of the belief model and\nthus inaccurate detection. To address this, we detect the opponent\u2019s policy simultaneously using the\nbelief model based on reward signals and opponent models based on observations.\nSecond, when learning a response policy (e.g. DQN) against a new opponent, BPR+ learns from\nscratch, resulting in inef\ufb01ciency due to expensive training time. Here, we propose the distilled policy\nnetwork (DPN), using policy distillation to combine multiple response policies into a single one.\nWhen learning a new response policy, the distilled policy network can be used to initialize a starting\npolicy to speed up the learning process, thus signi\ufb01cantly improve the quick response ability.\nThe overall \ufb02ow of deep BPR+ is outlined in Algorithm 1. It consists of two stages: 1) a reuse\nstage (lines 3-8) selecting the most appropriate policy to execute; and 2) a learning stage (lines 9-12)\nobtaining an optimal response policy against a new opponent. In each episode, only one stage will\nbe executed. During the reuse stage, deep BPR+ selects a response policy \u03c0t using the new belief\nmodel \u00af\u03b2t recti\ufb01ed by the opponent model, receives cumulative reward ut, estimates the opponent\u2019s\nonline policy \u02c6\u03c4 t\no. At the end of the stage, deep BPR+ checks whether\nthe opponent is using a previously unseen policy. If yes, it switches to the learning stage in the\nnext episode and starts the policy optimization using a starting policy \u03c0t initialized by DPN. In the\nlearning stage, any DRL algorithm can be used to learn a response policy online. Once it is \ufb01nished,\nthe estimated opponent\u2019s policy \u02c6\u03c4 t\no, that depicts the opponent\u2019s new policy, will be added into the\nknown opponent policy set T , the corresponding learned response policy will be added into the policy\nlibrary \u03a0, and the performance model P (U|T , \u03a0) will be updated using the corresponding received\ncumulative reward. At last, deep BPR+ will switch back to the reuse stage in the next episode.\n\no, and updates \u00af\u03b2t using ut and \u02c6\u03c4 t\n\n3\n\n\fNote that, deep BPR+ currently changes its policy only at the beginning of each episode since the\nreward signal for detection is on the episode basis. However, the policy update can also be performed\nat each step if the instantaneous reward is used as the signal in the belief model (line 4). The choice\nof the signal is problem-speci\ufb01c, and it determines the granularity of policy selection frequency.\n\n3.1 Recti\ufb01ed Belief Model (RBM)\n\nDetecting the opponent\u2019s policy is a critical part in deep BPR+, since higher detection accuracy enables\nmore ef\ufb01cient policy reuse, resulting in better performance. However, in vanilla BPR+, the belief\nmodel, originally designed for measuring the similarity between different tasks in transfer learning,\nsuffers from inaccurate detection in multiagent domains. More speci\ufb01cally, \u03b2k(\u03c4 ) \u2261 \u03b2k(\u03c4|uk, \u03c0k) in\nEquation 1 describes the probability of the opponent using policy \u03c4 at episode k given that the agent\nuses policy \u03c0 against \u03c4 and receives reward uk. At the beginning of episode k + 1, the agent chooses\nthe most appropriate response policy to execute by reasoning about the opponent\u2019s policy \u03c4\u2217:\n\n(4)\nwhere \u03c4\u2217 is a unique solution only when \u03b2(\u03c4\u2217) > \u03b2(\u03c4i) for every \u03c4i (cid:54)= \u03c4\u2217. However, this condition\ndoes not always hold because the belief model is updated only using the performance model in\nEquation 1, making it being proportional to and highly dependent on the performance model:\n\n\u03b2k(\u03c4 ),\n\n\u03c4\u2217 = arg max\n\n\u03c4\n\n\u03b2k(\u03c4 ) \u2261 \u03b2k(\u03c4|uk, \u03c0k) \u221d P (uk|\u03c4, \u03c0k).\n\n(5)\nAssuming in a fully cooperative environment, an agent uses \u03c01, ..., \u03c0n policies to cooperate with\nopponent using \u03c41, ..., \u03c4n policies respectively. In this case, any miscoordination results in zero\nreward meaning P (0|\u03c4i, \u03c0j) (cid:39) 1 where i (cid:54)= j and i, j \u2208 [1, n]. Suppose, at episode k, an agent uses\npolicy \u03c0j against its opponent using \u03c4i and thus results in a miscoordination. Since the performance\nmodel for each miscoordinated policy pair (\u03c4i, \u03c0j) is indistinguishable as follows:\nP (u = 0|\u03c41, \u03c0k\nThis leads to the belief models over different \u03c4j(j (cid:54)= i) are indistinguishable as well:\n\ni ) = \u00b7\u00b7\u00b7 = P (u = 0|\u03c4i\u22121, \u03c0k\n\ni ) = \u00b7\u00b7\u00b7 = P (u = 0|\u03c4n, \u03c0k\n\ni ) = P (u = 0|\u03c4i+1, \u03c0k\n\ni ) (cid:39) 1.\n(6)\n\n\u03b2k(\u03c41) = \u00b7\u00b7\u00b7 = \u03b2k(\u03c4i\u22121) = \u03b2k(\u03c4i+1) = \u00b7\u00b7\u00b7 = \u03b2k(\u03c4n).\n\n(7)\nIn Equation 4, there exist multiple \u03c4\u2217 solutions. One of them will be randomly selected and thus\nleads to continuous miscoordination in the following episodes. So, detecting from only one single\nperspective of policy\u2019s performance is not enough to exactly reason about the opponent\u2019s policy.\nTo overcome this, we propose the opponent model/policy \u02c6\u03c4 parameterized by \u03b8, a neural network\napproximator to an opponent\u2019s true policy \u03c4, depicting its behaviors from its past sequence of moves.\nOpponent model is critical in identifying the other agent\u2019s type by estimating its policy [1]. Similar\nideas can be found in recent multiagent DRL algorithms (e.g., DRON [6] and MADDPG [16]).\nHowever, they either use handcrafted behavioral features or observed agents\u2019 behaviors to learn a\ngeneralized policy without explicitly opponent classi\ufb01cation. DPIQN [12] learns features for different\nopponent\u2019s policies, but uses them to train a generalized Q-network to execute, rather than directly\nreusing a more advantageous response policy. LOLA [4] optimizes the policy by considering that the\ncoexisting agent learns at the same time. It belongs to the category of \u201ctheory of mind\u201d de\ufb01ned by [9]\nwhich is computationally expensive. In deep BPR+, we explicitly categorize opponent\u2019s policy and try\nto achieve better performance by reusing the best response policy. Assuming that (s0, a0, ..., st, at...)\nis the observation during interactions with an opponent using policy \u03c4, the opponent model \u02c6\u03c4 can be\nobtained by maximizing the log probability of policy \u02c6\u03c4. However, only using sampled observation\nmay easily incur the over-\ufb01tting issue. Besides, observations may vary greatly among different\nepisodes, resulting in high variance. To alleviate this, an entropy regularizer is introduced into the\nloss function:\n\n(8)\nwhere H is the entropy of policy \u02c6\u03c4 and (si, ai) are training samples from the observations. With the\nopponent model \u02c6\u03c4, the similarity between opponent\u2019s different policies \u03c41 and \u03c42 can be measured us-\ning a KL divergence, denoted by DKL( \u02c6\u03c41, \u02c6\u03c42) \u2248 E(s,a) log\n, whereby opponent\u2019s different\npolicies with same belief can be further distinguished by utilizing the corresponding opponent model.\nSpeci\ufb01cally, during online interaction, line 5 in Algorithm 1 estimates the opponent\u2019s online policy\n\nL(\u03b8) = \u2212Esi,ai [log \u02c6\u03c4 (ai|si) + H(\u02c6\u03c4 )] ,\n(cid:111)\n\n(cid:110) \u02c6\u03c41(s,a)\n\n\u02c6\u03c42(s,a)\n\n4\n\n\f(cid:88)\n\np(\u03c4 ) \u2261 p(\u02c6\u03c4| \u02c6\u03c4o) =\n\n\u02c6\u03c4o, thus the posterior probability (unnormalized) of the opponent using policy \u03c4 can be de\ufb01ned as\nfollows:\n\n\u02c6\u03c4i\u2208T DKL(\u02c6\u03c4o, \u02c6\u03c4i)/DKL(\u02c6\u03c4o, \u02c6\u03c4 ).\n\n(9)\nHere, \u02c6\u03c4i is the previously learned approximate policy of the known opponent\u2019s policy \u03c4i stored in T\nand the KL divergence is measured using observed state-action pairs. Note that, due to the similarity\nbetween opponent\u2019s different policies being inversely proportional to the value of the KL divergence,\nhere we thus use the reciprocal of the relative proportion of the KL divergence. Besides, the posterior\nprobability is normalized using \u03b7 in Equation 10.\nIntuitively, both belief and opponent models can be understood as the posterior probabilities measuring\nthe opponent\u2019s policy, based on received reward signal u and observed online behaviors \u02c6\u03c4 respectively.\n\u03b2(\u03c4 ) and p(\u03c4 ) are independent of each other due to their dependence on u and \u02c6\u03c4 separately. Thus,\nwe propose a recti\ufb01ed belief model that measures the probability of the opponent using policy \u03c4 from\nboth models by directly multiplying them together to obtain a more accurate prediction model:\n\nwhere \u03b7 = (cid:80)\n\n(10)\n\u03c4(cid:48)\u2208T p(\u03c4(cid:48))P (uk|\u03c4(cid:48), \u03c0k)\u03b2k\u22121(\u03c4(cid:48)). RBM recti\ufb01es the performance model using the\n\nopponent model to select a more appropriate policy \u03c0\u2217 in terms of cumulative reward to execute:\n\np(\u03c4 )P (uk|\u03c4, \u03c0k)\u03b2k\u22121(\u03c4 ),\n\n\u00af\u03b2(\u03c4 ) =\n\n1\n\u03b7\n\n(cid:90) +\u221e\n\n(cid:88)\n\n\u03c0\u2217 = arg max\n\u03c0\u2208\u03a0\n\n\u00afU\n\n\u03c4\u2208T\n\n\u00af\u03b2(\u03c4 )P (U|\u03c4, \u03c0)dU.\n\n(11)\n\n\u03c0|\u03c4, \u03c0). If the value of(cid:80)\n\nFinally, deep BPR+ uses the same moving average reward as the BPR+ to detect whether encountering\n\u03c0 be the accumulated rewards in episode\nan unknown policy (line 8 in Algorithm 1). Speci\ufb01cally, let rt\nt using policy \u03c0. Then, the possibility of obtaining rt\n\u03c0 given the non-stationary agent\u2019s policy with\nt /n (for all \u03c4 \u2208 T ) over the last n\nlabel \u03c4 is p\u03c4\nrounds is lower than a threshold Pthr, then the deep BPR+ agent infers that the other agent is using a\npreviously unseen strategy, and switches to the learning stage to learn an optimal policy against it\nquickly. Intuitively, Pthr is the lower bound of the possibility of the opponent using a known policy\nin T , and re\ufb02ects the sensitivity of a deep BPR+ agent responding to a potentially unknown opponent.\n\ni=t\u2212n+1,...,t p\u03c4\n\nt = P (rt\n\n3.2 Distilled Policy Network (DPN)\n\nDeep BPR+ inherits the ability of online learning against a new opponent. However vanilla BPR+\nlearns from scratch each time when encountering a new opponent, which can be extremely inef\ufb01cient.\nThere may exist certain similarities among opponent\u2019s different policies, as well as their associated\nresponse policies. Thus it is reasonable to reuse the learned response policy as a starting policy for\nthe subsequent policy optimization when its associate opponent model shares high similarity to the\nnew opponent model. A straightforward way to achieve this, given the opponent model \u02c6\u03c4o, is to\ndirectly reuse the policy \u03c0i, whose associated opponent model \u02c6\u03c4i has the highest DKL(\u02c6\u03c4o, \u02c6\u03c4i), as the\nstarting policy. Optimizing from a learned policy improves the learning ef\ufb01ciency, however, may\nincur insuf\ufb01cient explorations as well, thus resulting in suboptimal solutions. To address this issue,\nwe introduce DPN to achieve ef\ufb01cient learning while avoiding being stuck in suboptimal solutions.\nDPN, as depicted in Figure 1(a), is comprised of a shared convolution layer and multiple separate\ncontroller layers, each of which is trained for a distinct response policy. A different label is fed\ninto DPN to switch among response policies by linking the shared convolution layer with the\ncorresponding control layer, achieving convenient fast policy switch. Speci\ufb01cally, the response\npolicies against opponent\u2019s n different policies are trained separately and distinguished by different\nlabels. The training samples (i.e., Q-values) generated by different response policies are assigned\nwith corresponding labels. To consolidate multiple policies into a single DPN, supervised learning is\nperformed to minimize the distillation loss using training data with n kinds of labels simultaneously.\nIntuitively, compared to training an independent convolution layer for each response policy, this\narchitecture forces the shared convolution layer to learn more generalized features against different\nopponents, better describing the environment. As depicted in Figure 1(b), when encountering a new\nopponent, the starting policy is initialized by connecting the learned shared convolution layer in DPN\nwith a randomly initialized control layer. By optimizing from the starting policy, deep BPR+ learns\nmore ef\ufb01ciently and robustly against different opponents than directly reusing the response policy.\n\n5\n\n\fFigure 1: (a) the policy distillation in deep BPR+ and (b) the starting policy initialized by DPN.\n\nOnce the response policy is obtained, the policy distillation will be performed again to update DPN.\nNote that deep BPR+ does not require storing training data during online learning, because the\ntraining data can be regenerated using DPN whenever a new policy is required to be distilled. Another\nside bene\ufb01t that DPN brings is the high spatial utilization due to its shared convolution layers.\nIt is worth noting that deep BPR+ has some differences from DRON [6], including: 1) deep BPR+\nuses DPN to maintain accurate one-to-one response policies, while DRON uses an end-to-end trained\nresponse subnetwork, which cannot guarantee that each response policy is good enough against\na particular type of opponent. 2) In DRON, parameter K in the mixture-of-expert is \ufb01xed and\nthus cannot handle the case when the number of opponents is dynamically changing. In contrast,\ndeep BPR+ is \ufb02exible to add any new \u201cexpert policy\u201d into our policy library, leading to continuous\nperformance improvement in the online fashion. 3) For policy switching, Deep BPR+ is more general\nas it requires no additional information except the opponent\u2019s past actions. However, DRON usually\nrequires additional hand-crafted features, which is dif\ufb01cult to generalize across different domains.\n\n4 Experiments\n\nThis section presents empirical results on a gridworld game adapted from [8], a navigation game\nadapted from [3], and a soccer game adapted from [6, 15]. Comparisons among BPR [20], BPR+\n[10] and deep BPR+ are performed to verify their performance. For comparisons in multiagent DRL\nwith raw image inputs, BPR and BPR+ here use the neural network as the function approximator.\nAn omniscient agent, equipped with pre-trained optimally policies against the coexisting agent, is\nadopted as the baseline. In all games involving a non-stationary agent, deep BPR+ is empirically\nveri\ufb01ed in terms of detection accuracy, cumulative reward and learning speed of a new response policy.\nDetailed network architecture of deep BPR+ and corresponding hyperparameters are described in\nAppendix.\n\n4.1 Game Description\n\nFigure 2(a) is a cooperative gridworld game, where two agents (A and O) are required to enter into\ntheir respective goals (G(A) and G(O)) while avoiding a collision. Each agent has \ufb01ve actions to\nchoose: {north, south, east, west, stay}. Every movement leads the agent to move into a neighbor\ngrid in the corresponding direction, except that a collision on the edge of the grid or thick wall results\nin no movement. Once entering a goal cell, the agent receives a positive reward of +5 and will stay\nthere until the episode \ufb01nishes. A reward of 0 is received whenever entering at a non-goal state.\nBesides, if two agents try to enter the same cell, both of them stay and receive a negative reward of\n\u22121 as punishment. Only when both of the agents reach their goals, the episode ends. Figure 2(b) is a\ncooperative navigation game sharing similar settings except for some minor differences: 1) there is\na thick wall (in gray) separating the area into two zones, and two agents are required to enter into\nthe same goal (green cells) marked by \u201cG\u201d. 2) Both agents receive a positive reward if they enter\nthe same goal states together and then an episode ends. Otherwise, a small positive reward being\nproportional to their Euclidean distance is received due to miscoordination. Figure 2(c) depicts a\n\n6\n\n\f(a) cooperative gridworld\n\n(b) cooperative navigation game\n\n(c) competitive soccer\n\nFigure 2: The con\ufb01guration of tested games.\n\ncompetitive 5 \u00d7 5 soccer game, where two agents try to steal the ball (gray circle) from each other\nand carry it to their respective goal areas. Different from previous settings, if both agents move to the\nsame grid, the possession of the ball exchanges, and the move does not take place. Once an agent\ntakes the ball to its goal, it scores a reward of +10 while the other agent receives \u221210 as punishment,\nand the game ends and resets to the con\ufb01guration shown in the \ufb01gure with agent O holding the ball.\nIntuitively, in all tested games, agent A could be able to achieve higher performance by detecting the\nother agent\u2019s intention more accurately.\n\n4.2 Non-stationary Policy Detection\n\nIn the games shown in Figure 2, agent O has 6, 5, 5 random initialized policies (denoted in Figure 2)\nrespectively in hand and switches its policy every several episodes. Agent A is equipped with the\ncorresponding pre-trained response policies and aims at selecting the most appropriate policy in hand\nto reuse against agent O by detecting its behavior. To improve detection accuracy, deep BPR+(D)\nuses RBM in non-stationary policy detection and policy selection against a non-stationary agent.\n\nFigure 3: Comparisons of average detection accuracy (10 random seeds) against non-stationary agent.\n\nFigure 3 demonstrates the average detection accuracy of opponent\u2019s policy, where deep BPR+(D),\ncompared to BPR+, performs closer to the omniscient agent and achieves higher accuracy. It is worth\nnoting that when a non-stationary agent switches its policy more frequently, the detection accuracy\nof BPR+ degrades dramatically while deep BPR+(D) still maintains a relatively high accuracy rate.\nThe major reason for accuracy improvement comes from the RBM that overcomes drawbacks of\nvanilla BPR+, including: 1) vanilla BPR+, once failed in distinguishing opponent policy, has to\ntry every known opponent policy, which is quite time-consuming. 2) The opponent may switch\npolicy again before vanilla BPR+ \ufb01nds the correct policy. This will perturb the update of internal\nbelief models, resulting in further degrading of detection accuracy. Empirically, due to the RMB,\ndeep BPR+(D) performs better by choosing the best response policy to execute, especially when the\nopponent switches at a fast frequency.\n\n4.3 Ef\ufb01cient Learning against Unknown Policies\n\nIn this experiment, we evaluate the learning ef\ufb01ciency of deep BPR+ against agent O adopting new\nunknown policies during interaction. Different from the previous settings, agent A only knows, in the\n\n7\n\nG(O)G(A)O\fFigure 4: Comparisons of a family of BPR algorithms including BPR, BPR+, deep BPR+(D) using\nonly the RBM in detection, deep BPR+(L) using the DPN in learning, and deep BPR+(D+L) using\nboth. The cumulative reward is normalized by omniscient reward.\n\nbeginning, how to respond to agent O\u2019s any policy excluding policies A and E in Figure 2, and is\nrequired to learn new response policies against them in the online fashion. Following previous work\n[10], we assume that agent O will not switch its policy before agent A learns the new response policy.\nIn other cases, policy switch happens after agent O executes a policy 5 to 20 times at random. The\ncomparisons in terms of online learning speed and cumulative reward are given in Figure 4.\nBPR performs poorly in the three games due to its inability to detect and respond to the agent O\nusing unknown policies, while all the others can detect and learn the corresponding response policies\nagainst it. Note that, due to the high learning ef\ufb01ciency that DPN brings, deep BPR+(D+L) and deep\nBPR+(L) achieve higher rewards than deep BPR+(D) and BPR+. As an example, in Figure 4(b),\nagent O \ufb01rstly uses a new policy E unknown to agent A around the 20th episode. Deep BPR+(D+L)\nand BPR+(L) can detect this and ef\ufb01ciently learn the response policies by optimizing from the starting\npolicy initialized by DPN, while deep BPR+(D) and BPR+ take relatively longer. The same situation\nhappens around the 175th episode when another new unknown policy A is adopted. Also, in the\nsubsequent interaction, DPN can ef\ufb01ciently reuse the learned response policy whenever policy A or\nE is re-encountered. Another observation is that the RBM can achieve higher performance than the\nvanilla belief model regardless of whether using DPN or not.\n\nFigure 5: Comparison of learning performance when optimizing from different policies.\n\nTo further verify the bene\ufb01t of DPN in learning against an unknown policy, comparisons of optimizing\ndirectly from a starting policy initialized by DPN or learned response policies are conducted and\ndemonstrated in Figure 5. To make the comparison fair, rather than reusing the entire learned\nresponse policies, we reuse only the convolution layers in them. One observation is that, compared to\noptimizing directly from response policies, from a starting policy initialized by DPN achieves higher\nperformance and stability (low variance) no matter facing unknown policy A or E.\nAnother observation is that, even occasionally learning from individual response policy with DQN\nmay achieve similar performance (e.g., response policy \u03c0A in Figure 5(b)), any poorly selected\nresponse policy would signi\ufb01cantly degrade the online learning performance (e.g., \u03c0B,\u03c0C or \u03c0D in\nFigure 5). Besides, it would still be dif\ufb01cult to determine which response policy should be chosen as\nthe starting policy. In contrast, DPN offers us a generalized and elegant way of initializing the starting\npolicy, achieving promising performance without concerning about which response policy to choose.\n\n8\n\n\fFigure 6: Visualization of trajectories when learning against unknown policies in navigation game.\n\nIn previous results, DPN shows competitive performance and robustness against different unknown\npolicies. To investigate the intuition behind this, we visualize the trajectories of last 200 episodes\nwhen learning from different response policies \u03c0i as well as DPN against unknown policy A and\nE as shown in Figure 6. To be fair, the exploration rate during learning is the same for all settings.\nIntuitively, bootstrapping from learned policies trends to be stuck in local optima while DPN can\navoid this and thus achieve better performance. For example, when learning against unknown policy\nA, response policies \u03c0B, \u03c0C, \u03c0D and \u03c0E seem to explore in a wrong direction, resulting in inef\ufb01cient\nexploration (Figure 6 (a-d)). A similar phenomenon can be found in learning against policy E\nwhen optimizing from response policies \u03c0A ,\u03c0C and \u03c0D (Figure 6(f, h, i)). In contrast, DPN can\nachieve more ef\ufb01cient explorations using the same exploration rate and thus obtain better results. We\nhypothesize that this is because the network architecture allows the shared convolution layers to learn\nmore generalized features against different opponents, better describing the environment and guiding\nthe agent to explore in the right direction.\n\n5 Conclusion and Future Work\n\nThis paper proposes a deep BPR+ algorithm to play against non-stationary agents in multiagent\nscenarios. The recti\ufb01ed belief model is introduced by utilizing the opponent model, achieving\naccurate policy detection from the perspectives of received signal and opponent\u2019s behaviors. And the\ndistilled policy network is proposed as the policy library, achieving ef\ufb01cient learning against unknown\npolicies, convenient policy reuse and ef\ufb01cient policy storing. Empirical evaluations demonstrate that\nthe deep BPR+ algorithm indeed achieves promising performance than other existing algorithms on\nthree complex Markov games. As a future work, we would like to investigate how to act optimally in\nface of the adaptive agents whose behaviors are continuously changing over time.\n\nAcknowledgments\n\nThe work is supported by the National Natural Science Foundation of China (Grant Nos.: 61702362,\n61876119, 61502323), Special Program of Arti\ufb01cial Intelligence, Tianjin Research Program of\nApplication Foundation and Advanced Technology (No.: 16JCQNJC00100), Special Program\nof Arti\ufb01cial Intelligence of Tianjin Municipal Science and Technology Commission (No.: 569\n17ZXRGGX00150), Science and Technology Program of Tianjin, China (Grant Nos. 15PT-\nCYSY00030, 16ZXHLGX00170), Natural Science Foundation of Jiangsu (No.: BK20181432),\nand High School Natural Foundation of Jiangsu (No.: 16KJB520041)\n\nReferences\n[1] Stefano V. Albrecht and Peter Stone. Autonomous agents modelling other agents: A compre-\n\nhensive survey and open problems. Arti\ufb01cial Intelligence, 258:66\u201395, 2018.\n\n[2] Georgios Chalkiadakis and Craig Boutilier. Coordination in multiagent reinforcement learning:\na baayesian approach. In Proceedings of the 2nd International Conference on Autonomous\nAgents and Multiagent Systems (AAMAS), pages 709\u2013716, 2003.\n\n[3] Jacob W. Crandall. Just add pepper: extending learning algorithms for repeated matrix games to\nrepeated markov games. In Proceedings of the 11th International Conference on Autonomous\nAgents and Multiagent Systems (AAMAS), pages 399\u2013406, 2012.\n\n9\n\n\f[4] Jakob N. Foerster, Richard Y. Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel,\nand Igor Mordatch. Learning with opponent-learning awareness. In Proceedings of the 17th\nInternational Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages\n122\u2013130, 2018.\n\n[5] Jayesh K. Gupta, Maxim Egorov, and Mykel J. Kochenderfer. Cooperative multi-agent control\n\nusing deep reinforcement learning. In Adaptive Learning Agents Workshop, 2017.\n\n[6] He He and Jordan L. Boyd-Graber. Opponent modeling in deep reinforcement learning. In\nProceedings of the 33rd International Conference on Machine Learning (ICML), pages 1804\u2013\n1813, 2016.\n\n[7] Pablo Hernandez-Leal and Michael Kaisers. Learning against sequential opponents in repeated\nstochastic games. In The 3rd Multi-disciplinary Conference on Reinforcement Learning and\nDecision Making, 2017.\n\n[8] Pablo Hernandez-Leal and Michael Kaisers. Towards a fast detection of opponents in repeated\nstochastic games. In Proceedings of the 16th International Conference on Autonomous Agents\nand Multiagent Systems (AAMAS) 2017 Workshops, pages 239\u2013257, 2017.\n\n[9] Pablo Hernandez-Leal, Michael Kaisers, Tim Baarslag, and Enrique Munoz de Cote. A survey\nof learning in multiagent environments: Dealing with non-stationarity. CoRR, abs/1707.09183,\n2017.\n\n[10] Pablo Hernandez-Leal, Benjamin Rosman, Matthew E. Taylor, Luis Enrique Sucar, and En-\nrique Munoz de Cote. A Bayesian approach for learning and tracking switching, non-stationary\nopponents (extended abstract). In Proceedings of the 15th International Conference on Au-\ntonomous Agents and Multiagent Systems (AAMAS), pages 1315\u20131316, 2016.\n\n[11] Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural\n\nnetwork. CoRR, abs/1503.02531, 2015.\n\n[12] Zhang-Wei Hong, Shih-Yang Su, Tzu-Yun Shann, Yi-Hsiang Chang, and Chun-Yi Lee. A deep\npolicy inference q-network for multi-agent systems. In Proceedings of the 17th International\nConference on Autonomous Agents and Multiagent Systems (AAMAS), pages 1388\u20131396, 2018.\n\n[13] Junling Hu and Michael P. Wellman. Multiagent reinforcement learning: Theoretical framework\nand an algorithm. In Proceedings of the 15th International Conference on Machine Learning\n(ICML), pages 242\u2013250, 1998.\n\n[14] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval\nTassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.\nIn International Conference on Learning Representations (ICLR), 2016.\n\n[15] Michael L. Littman. Markov games as a framework for multi-agent reinforcement learning. In\nProceedings of the 11th International Conference on Machine Learning (ICML), pages 157\u2013163,\n1994.\n\n[16] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent\nactor-critic for mixed cooperative-competitive environments. In Advances in Neural Information\nProcessing Systems 30, (NIPS), pages 6382\u20136393, 2017.\n\n[17] Volodymyr Mnih, Adri\u00e0 Puigdom\u00e8nech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap,\nTim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforce-\nment learning. In Proceedings of the 33rd International Conference on Machine Learning\n(ICML), pages 1928\u20131937, 2016.\n\n[18] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness anda Marc\nG. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Pe-\ntersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan\nWierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement\nlearning. Nature, 518(7540):529\u2013533, 2015.\n\n10\n\n\f[19] Gregory Palmer, Karl Tuyls, Daan Bloembergen, and Rahul Savani. Lenient multi-agent deep\nreinforcement learning. In Proceedings of the 17th International Conference on Autonomous\nAgents and Multiagent Systems (AAMAS), pages 443\u2013451, 2018.\n\n[20] Benjamin Rosman, Majd Hawasly, and Subramanian Ramamoorthy. Bayesian policy reuse.\n\nMachine Learning, 104(1):99\u2013127, 2016.\n\n[21] Andrei A. Rusu, Sergio Gomez Colmenarejo, \u00c7aglar G\u00fcl\u00e7ehre, Guillaume Desjardins, James\nKirkpatrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy\ndistillation. CoRR, abs/1511.06295, 2015.\n\n[22] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay.\n\nIn International Conference on Learning Representations (ICLR), 2016.\n\n[23] Ardi Tampuu, Tambet Matiisen, Dorian Kodelja, Ilya Kuzovkin, Kristjan Korjus, Juhan Aru,\nJaan Aru, and Raul Vicente. Multiagent cooperation and competition with deep reinforcement\nlearning. PLOS ONE, 12(4):1\u201315, 04 2017.\n\n[24] Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double\nQ-learning. In Proceedings of the 30th AAAI Conference on Arti\ufb01cial Intelligence (AAAI), pages\n2094\u20132100, 2016.\n\n[25] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, and Nando de Freitas.\nDueling network architectures for deep reinforcement learning. In Proceedings of the 33rd\nInternational Conference on Machine Learning (ICML), pages 1995\u20132003, 2016.\n\n[26] Xiangyu Zhao, Liang Zhang, Zhuoye Ding, Dawei Yin, Yihong Zhao, and Jiliang Tang. Deep\n\nreinforcement learning for list-wise recommendations. CoRR, abs/1801.00209, 2018.\n\n11\n\n\f", "award": [], "sourceid": 526, "authors": [{"given_name": "YAN", "family_name": "ZHENG", "institution": "Tianjin University"}, {"given_name": "Zhaopeng", "family_name": "Meng", "institution": "School of Computer Software, Tianjin University"}, {"given_name": "Jianye", "family_name": "Hao", "institution": "Tianjin University"}, {"given_name": "Zongzhang", "family_name": "Zhang", "institution": "Soochow University"}, {"given_name": "Tianpei", "family_name": "Yang", "institution": "Tianjin University"}, {"given_name": "Changjie", "family_name": "Fan", "institution": "Netease"}]}