{"title": "Distributional Reward Decomposition for Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 6215, "page_last": 6224, "abstract": "Many reinforcement learning (RL) tasks have specific properties that can be leveraged to modify existing RL algorithms to adapt to those tasks and further improve performance, and a general class of such properties is the multiple reward channel. In those environments the full reward can be decomposed into sub-rewards obtained from different channels. Existing work on reward decomposition either requires prior knowledge of the environment to decompose the full reward, or decomposes reward without prior knowledge but with degraded performance. In this paper, we propose Distributional Reward Decomposition for Reinforcement Learning (DRDRL), a novel reward decomposition algorithm which captures the multiple reward channel structure under distributional setting. Empirically, our method captures the multi-channel structure and discovers meaningful reward decomposition, without any requirements on prior knowledge. Consequently, our agent achieves better performance than existing methods on environments with multiple reward channels.", "full_text": "Distributional Reward Decomposition for\n\nReinforcement Learning\n\nZichuan Lin\u2217\n\nTsinghua University\n\nlinzc16@mails.tsinghua.edu.cn\n\nLi Zhao\n\nMicrosoft Research\n\nlizo@microsoft.com\n\nDerek Yang\nUC San Diego\n\ndyang1206@gmail.com\n\nTao Qin\n\nMicrosoft Research\n\ntaoqin@microsoft.com\n\nGuangwen Yang\nTsinghua University\n\nygw@tsinghua.edu.cn\n\nAbstract\n\nTie-Yan Liu\n\nMicrosoft Research\n\ntyliu@microsoft.com\n\nMany reinforcement learning (RL) tasks have speci\ufb01c properties that can be lever-\naged to modify existing RL algorithms to adapt to those tasks and further improve\nperformance, and a general class of such properties is the multiple reward channel.\nIn those environments the full reward can be decomposed into sub-rewards obtained\nfrom different channels. Existing work on reward decomposition either requires\nprior knowledge of the environment to decompose the full reward, or decomposes\nreward without prior knowledge but with degraded performance. In this paper,\nwe propose Distributional Reward Decomposition for Reinforcement Learning\n(DRDRL), a novel reward decomposition algorithm which captures the multiple\nreward channel structure under distributional setting. Empirically, our method cap-\ntures the multi-channel structure and discovers meaningful reward decomposition,\nwithout any requirements on prior knowledge. Consequently, our agent achieves\nbetter performance than existing methods on environments with multiple reward\nchannels.\n\n1\n\nIntroduction\n\nReinforcement learning has achieved great success in decision making problems since Deep Q-\nlearning was proposed by Mnih et al. [2015]. While general RL algorithms have been deeply studied,\nhere we focus on those RL tasks with speci\ufb01c properties that can be utilized to modify general RL\nalgorithms to achieve better performance. Speci\ufb01cally, we focus on RL environments with multiple\nreward channels, but only the full reward is available.\nReward decomposition has been proposed to investigate such properties. For example, in Atari game\nSeaquest, rewards of environment can be decomposed into sub-rewards of shooting sharks and those\nof rescuing divers. Reward decomposition views the total reward as the sum of sub-rewards that are\nusually disentangled and can be obtained independently (Sprague and Ballard [2003], Russell and\nZimdars [2003], Van Seijen et al. [2017], Grimm and Singh [2019]), and aims at decomposing the\ntotal reward into sub-rewards. The sub-rewards may further be leveraged to learn better policies.\nVan Seijen et al. [2017] propose to split a state into different sub-states, each with a sub-reward\nobtained by training a general value function, and learn multiple value functions with sub-rewards.\nThe architecture is rather limited due to requiring prior knowledge of how to split into sub-states.\nGrimm and Singh [2019] propose a more general method for reward decomposition via maximizing\n\n\u2217Contributed during internship at Microsoft Research.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fdisentanglement between sub-rewards. In their work, an explicit reward decomposition is learned\nvia maximizing the disentanglement of two sub-rewards estimated with action-value functions.\nHowever, their work requires that the environment can be reset to arbitrary state and can not apply\nto general RL settings where states can hardly be revisited. Furthermore, despite the meaningful\nreward decomposition they achieved, they fail to utilize the reward decomposition into learning better\npolicies.\nIn this paper, we propose Distributional Reward Decomposition for Reinforcement Learning (DR-\nDRL), an RL algorithm that captures the latent multiple-channel structure for reward, under the\nsetting of distributional RL. Distributional RL differs from value-based RL in that it estimates the\ndistribution rather than the expectation of returns, and therefore captures richer information than\nvalue-based RL. We propose an RL algorithm that estimates distributions of the sub-returns, and com-\nbine the sub-returns to get the distribution of the total returns. In order to avoid naive decomposition\nsuch as 0-1 or half-half, we further propose a disentanglement regularization term to encourage the\nsub-returns to be diverged. To better separate reward channels, we also design our network to learn\ndifferent state representations for different channels.\nWe test our algorithm on chosen Atari Games with multiple reward channels. Empirically, our method\nhas following achievements:\n\n\u2022 Discovers meaningful reward decomposition.\n\u2022 Requires no external information.\n\u2022 Achieves better performance than existing RL methods.\n\n2 Background\n\nWe consider a general reinforcement learning setting, in which the interaction of the agent and the\nenvironment can be viewed as a Markov Decision Process (MDP). Denote the state space by X ,\naction space by A, the state transition function by P , the action-state dependent reward function by\nR and \u03b3 the discount factor, we write this MDP as (X , A, R, P, \u03b3).\nGiven a \ufb01xed policy \u03c0, reinforcement learning estimates the action-value function of \u03c0, de\ufb01ned by\nt=0 \u03b3trt(xt, at) where (xt, at) is the state-action pair at time t, x0 = x, a0 = a and\nrt is the corresponding reward. The Bellman equation characterizes the action-value function by\ntemporal equivalence, given by\n\nQ\u03c0(x, a) =(cid:80)\u221e\n\nQ\u03c0(x, a) = R(x, a) + \u03b3 E\n\nx(cid:48),a(cid:48) [Q\u03c0(x(cid:48), a(cid:48))]\n\nwhere x(cid:48) \u223c P (\u00b7|x, a), a(cid:48) \u223c \u03c0(\u00b7|x(cid:48)). To maximize the total return given by E\n(cid:105)\ncommon approach is to \ufb01nd the \ufb01xed point for the Bellman optimality operator\na(cid:48) Q\u2217(x(cid:48), a(cid:48))\n\nQ\u2217(x, a) = T Q\u2217(x, a) = R(x, a) + \u03b3E\nx(cid:48)\n\nmax\n\n(cid:104)\n\nx0,a0\n\nwith the temporal difference (TD) error, given by\n\n(cid:21)2\n\n[Q\u03c0(x0, a0)], one\n\n(cid:20)\n\n\u03b42\nt =\n\nrt + \u03b3 max\n\na(cid:48)\u2208A Q (xt+1, a(cid:48)) \u2212 Q (xt, at)\n\nover the samples (xt, at, st, xt+1) along the trajectory. Mnih et al. [2015] propose Deep Q-Networks\n(DQN) that learns the action-value function with a neural network and achieves human-level perfor-\nmance on the Atari-57 benchmark.\n\n2.1 Reward Decomposition\n\nStudies for reward decomposition also leads to state decomposition (Laversanne-Finot et al. [2018],\nThomas et al. [2017]), where state decomposition is leveraged to learn different policies. Extending\ntheir work, Grimm and Singh [2019] explore the decomposition of the reward function directly, which\nis considered to be most related to our work. Denote the i-th (i=1,2,...,N) sub-reward function at\nstate-action pair (x, a) as ri(x, a), the complete reward function is given by\n\nr =\n\nri\n\nN(cid:88)\n\ni=1\n\n2\n\n\fFor each sub-reward function, consider the sub-value function U \u03c0\n\ni and corresponding policy \u03c0i:\n\n\u221e(cid:88)\n\ni (x0, a0) = Ext,at [\nU \u03c0\n\n\u03b3tri(xt, at)]\n\n\u03c0i = arg max\n\n\u03c0\n\nt=0\n\nU \u03c0\ni\n\n\uf8f9\uf8fb ,\n(cid:35)\n\n(cid:19)\n\nwhere xt \u223c P (\u00b7|\u03c0, x0, a0), at \u223c \u03c0(\u00b7|xt).\nIn their work, reward decomposition is considered meaningful if each reward is obtained indepen-\ndently (i.e. \u03c0i should not obtain rj) and each reward is obtainable.\nTwo evaluate the two desiderata, the work proposes the following values:\n\nJindependent (r1, . . . , rn) = Es\u223c\u00b5\n\n\u03c0\u2217\n\u03b1i,j(s)U\ni\n\nj\n\n(s)\n\n\uf8ee\uf8f0(cid:88)\n(cid:34) n(cid:88)\n\ni(cid:54)=j\n\n(1)\n\n(2)\n\nJnontrivial (r1, . . . , rn) = Es\u223c\u00b5\n\n\u03b1i,i(s)U \u03c0\u2217\n\ni\n\n(s)\n\n,\n\ni\n\nwhere \u03b1i,j is for weight control and for simplicity set to 1 in their work. During training, the network\nwould maximize Jnontrivial \u2212 Jindependent to achieve the desired reward decomposition.\n\ni=1\n\n2.2 Distributional Reinforcement Learning\n\nIn most reinforcement learning settings, the environment is not deterministic. Moreover, in general\npeople train RL models with an \u0001-greedy policy to allow exploration, making the agent also stochastic.\nTo better analyze the randomness under this setting, Bellemare et al. [2017] propose C51 algorithm\nand conduct theoretical analysis of distributional RL.\nIn distributional RL, reward Rt is viewed as a random variable, and the total return is de\ufb01ned by\nt=0 \u03b3tRt. The expectation of Z is the traditional action-value Q and the distributional\n\nZ = (cid:80)\u221e\n\nBellman optimality operator is given by\n\n(cid:18)\n\nT Z(x, a) :D= R(x, a) + \u03b3Z\n\nx(cid:48), arg max\na(cid:48)\u2208A\n\nEZ (x(cid:48), a(cid:48))\n\nwhere if random variable A and B satis\ufb01es A D= B then A and B follow the same distribution.\nRandom variable is characterized by a categorical distribution over a \ufb01xed set of values in C51, and it\noutperforms all previous variants of DQN on Atari domain.\n\n3 Distributional Reward Decomposition for Reinforcement Learning\n\n3.1 Distributional Reward Decomposition\n\nsub-return Zi =(cid:80)\u221e\n\nIn many reinforcement learning environments, there are multiple sources for an agent to receive\nreward as shown in Figure 1(b). Our method is mainly designed for environments with such property.\nUnder distributional setting, we will assume reward and sub-rewards are random variables and\ndenote them by R and Ri respectively. In our architecture, the categorical distribution of each\nt=0 \u03b3tRi,t is the output of a network, denoted by Fi(x, a). Note that in most\ncases, sub-returns are not independent, i.e. P (Zi = v)! = P (Zi = v|Zj). So theoretically we need\nFij(x, a) for each i and j to obtain the distribution of the full return. We call this architecture as non-\nfactorial model or full-distribution model. The non-factorial model architecture is shown in appendix.\nHowever, experiment shows that using an approximation form of P (Zi = v) \u2248 P (Zi = v|Zj) so\nthat only Fi(x, a) is required performs much better than brutally computing Fij(x, a) for all i, j, we\nbelieve that is due to the increased sample number. In this paper, we approximate the conditional\nprobability P (Zi = v|Zj) with P (Zi = v).\nConsider categorical distribution function Fi and Fj with same number of atoms K, the k-th atom\nis denoted by ak with value ak = a0 + kl, 1 \u2264 k \u2264 K where l is a constant. Let random variable\n\n3\n\n\f(a)\n\n(b)\n\nFigure 1: (a) Distributional reward decomposition network architecture. (b) Examples of multiple\nreward channels in Atari games: the top row shows examples of Seaquest in which the submarine\nreceives rewards from both shooting sharks and rescuing divers; the bottom row shows examples of\nHero where the hero receives rewards from both shooting bats and rescuing people.\n\nZi \u223c Fi and Zj \u223c Fj, from basic probability theory we know that the distribution function of\nZ = Zi + Zj is the convolution of Fi and Fj\n\nF(v) = P (Zi + Zj = v) =\n\nP (Zi = ak)P (Zj = v \u2212 ak|Zi = ak)\n\nP (Zi = ak)P (Zj = v \u2212 ak) = Fi(v) \u2217 Fj(v).\n\n(3)\n\nK(cid:88)\n\u2248 K(cid:88)\n\nk=1\n\nk=1\n\nWhen we use N sub-returns, the distribution function of the total return is then given by F =\nF1 \u2217 F2 \u2217 \u00b7\u00b7\u00b7 \u2217 FN where \u2217 denotes linear 1D-convolution.\nWhile reward decomposition is not explicitly done in our algorithm, we can derive the decomposed\ni=1 Zi follows bellman equation, so\n\nreward with using trained agents. Recall that total return Z =(cid:80)N\n\nnaturally we have\n\nN(cid:88)\n\ni=1\n\nN(cid:88)\n\ni=1\n\nN(cid:88)\n\ni=1\n\nT Z D= T (\n\nZi) D= R + \u03b3Z(cid:48) = (\n\nRi) + \u03b3(\n\nZ(cid:48)\ni)\n\n(4)\n\nwhere Z(cid:48)\ni represents sub-return on the next state-action pair. Note that we only have access to a\nsample of the full reward R, the sub-rewards Ri are arbitrary and for better visualization a direct way\nof deriving them is given by\n\n(5)\nIn the next section we will present an example of those sub-rewards by taking their expectation E(Ri).\nNote that our reward decomposition is latent and we do not need Ri for our algorithm, Eq. 5 only\nprovides an approach to visualize our reward decomposition.\n\nRi = Zi \u2212 \u03b3Z(cid:48)\n\ni\n\n3.2 Disentangled Sub-returns\n\nTo obtain meaningful reward decomposition, we want the sub-rewards to be disentangled. Inspired\nby Grimm and Singh [2019], we compute the disentanglement of distributions of two sub-returns F i\nand F j on state x with the following value:\n\ndisentang = DKL(Fx,arg maxa\nJ ij\n\nE(Zi)||Fx,arg maxa\n\nE(Zj )),\n\n(6)\n\nwhere DKL denotes the cross-entropy term of KL divergence.\n\n4\n\n\ud835\udc65ConvolutionSoftmaxSoftmax*Softmax\u00b7\u00b7\u00b7\ud835\udc391\ud835\udc392\ud835\udc39N\ud835\udc39\u00b7\u00b7\u00b7\ud835\udf13(\ud835\udc65)\ud835\udf19\ud835\udc521\ud835\udf19\ud835\udc522\u00b7\u00b7\u00b7\ud835\udf19(\ud835\udc52\ud835\udc41)\u00b7\fIntuitively, J ij\ndisentang estimates the disentanglement of sub-returns Zi and Zj by \ufb01rst obtaining\nactions that maximize E(Zi) and E(Zj) respectively, and then computes the KL divergence between\nthe two estimated total returns of the actions. If Zi and Zj are independent, the action maximizing\ntwo sub-returns would be different and such difference would re\ufb02ect in the estimation for total\nreturn. Through maximizing this value, we can expect a meaningful reward decomposition that learns\nindependent rewards.\n\n3.3 Projected Bellman Update with Regularization\n\nFollowing the work of C51 (Bellemare et al. [2017]), we use projected Bellman update for our\nalgorithm. When applied with the Bellman optimality operator, the atoms of T Z is shifted by rt and\nshrank by \u03b3. However to compute the loss, usually KL divergence between Z and T Z, it is required\nthat the two categorical distributions are de\ufb01ned on the same set of atoms, so the target distribution\nT Z would need to be projected to the original set of atoms before Bellman update. Consider a sample\ntransition (x, a, r, x(cid:48)), the projection operator \u03a6 proposed in C51 is given by\n\n\uf8ee\uf8f01 \u2212\n\nM\u22121(cid:88)\n\nj=0\n\n(cid:12)(cid:12)(cid:12)[r + \u03b3aj]Vmax\n\nVmin\nl\n\n\u2212 ai\n\n(cid:12)(cid:12)(cid:12)\n\n\uf8f9\uf8fb1\n\n0\n\n(\u03a6T Z(x, a))i =\n\nFx(cid:48),a(cid:48) (aj) ,\n\n(7)\n\nwhere M is the number of atoms in C51 and [\u00b7]b\n(x, a, r, x(cid:48)) would be given by the cross-entropy term of KL divergence of Z and \u03a6T Z\n\na bounds its argument in [a, b]. The sample loss for\n\nLx,a,r,x(cid:48) = DKL(\u03a6T Z(x, a)||Z(x, a)).\n\n(8)\nLet F \u03b8 be a neural network parameterized by \u03b8, we combine distributional TD error and disentan-\nglement to jointly update \u03b8. For each sample transition (x, a, r, x(cid:48)), \u03b8 is updated by minimizing the\nfollowing objective function:\n\n(cid:88)\n\n(cid:88)\n\nLx,a,r,x(cid:48) \u2212 \u03bb\n\nJ ij\ndisentang,\n\n(9)\n\nwhere \u03bb denotes learning rate.\n\ni\n\nj!=i\n\n3.4 Multi-channel State Representation\nOne complication of our approach outlined above is that very often the distribution Fi cannot\ndistinguish itself from other distributions (e.g., Fj, j (cid:54)= i) during learning since they all depend\non the same state feature input. This brings dif\ufb01culties in maximizing disentanglement by jointly\ntraining as different distribution functions are exchangeable. A naive idea is to split the state feature\n\u03c8(x) into N pieces (e.g., \u03c8(x)1, \u03c8(x)2, ..., \u03c8(x)N ) so that each distribution depends on different\nsub-state-features. However, we empirically found that this method is not enough to help learn good\ndisentangled sub-returns.\nTo address this problem, we utilize an idea similar to universal value function approximation\n(UVFA) (Schaul et al. [2015]). The key idea is to take one-hot embedding as an additional in-\nput to condition the categorical distribution function, and apply the element-wise multiplication\n\u03c8 (cid:12) \u03c6, to force interaction between state features and the one-hot embedding feature:\n\nFi(x, a) = F\u03b8i(\u03c8(x) (cid:12) \u03c6(ei))a,\n\n(10)\nwhere ei denotes the one-hot embedding where the i-th element is one and \u03c6 denotes one-layer\nnon-linear neural network that is updated by backpropagation during training.\nIn this way, the agent explicitly learns different distribution functions for different channels. The\ncomplete network architecture is shown in Figure 1(a).\n\n4 Experiment Results\n\nWe tested our algorithm on the games from Arcade Learning Environment (ALE; Bellemare et al.\n[2013]). We conduct experiments on six Atari games, some with complicated rules and some with\n\n5\n\n\fFigure 2: Performance comparison with Rainbow. RD(N) represents using N-channel reward\ndecomposition. Each training curve is averaged by three random seeds.\n\nsimple rules. For our study, we implemented our algorithm based on Rainbow (Hessel et al. [2018])\nwhich is an advanced variant of C51 (Bellemare et al. [2017]) and achieved state-of-the-art results\nin Atari games domain. We replace the update rule of Rainbow by Eq. 9 and network architecture\nof Rainbow by our convoluted architecture as shown in Figure 1(a). In Rainbow, the Q-value is\nbounded by [Vmin, Vmax] where Vmax = \u2212Vmin = 10. In our method, we bound the categorical\ndistribution of each sub-return Zi(i = 1, 2, ..., N ) by a range of [ Vmin\nN ]. Rainbow uses a\ncategorical distribution with M = 51 atoms. For fair comparison, we assign K = (cid:98) M\nN (cid:99) atoms for the\ndistribution of each sub-return, which results in the same network capacity as the original network\narchitecture.\nOur code is built upon dopamine framework (Castro et al. [2018]). We use the default well-tuned\nhyper-parameter setting in dopamine. For our updating rule in Eq. 9, we set \u03bb = 0.0001. We run our\nagents for 100 epochs, each with 0.25 million of training steps and 0.125 million of evaluation steps.\nFor evaluation, we follow common practice in Van Hasselt et al. [2016], starting the game with up to\n30 no-op actions to provide random starting positions for the agent. All experiments are performed\non NVIDIA Tesla V100 16GB graphics cards.\n\nN , Vmax\n\n4.1 Comparison with Rainbow\n\nTo verify that our architecture achieves reward decomposition without degraded performance, we\ncompare our methods with Rainbow. However we are not able to compare our method with Van Seijen\net al. [2017] and Grimm and Singh [2019] since they require either pre-de\ufb01ned state pre-processing\nor speci\ufb01c-state resettable environments. We test our reward decomposition (RD) with 2 and 3\nchannels (e.g., RD(2), RD(3)). The results are shown in Figure 2. We found that our methods\nperform signi\ufb01cantly better than Rainbow on the environments that we tested. This implies that\nour distributional reward decomposition method can help accelerate the learning process. We also\ndiscover that on some environments, RD(3) performs better than RD(2) while in the rest the two have\nsimilar performance. We conjecture that this is due to the intrinsic settings of the environments. For\nexample, in Seaquest and UpNDown, the rules are relatively complicated, so RD(3) characterizes\nsuch complex reward better. However in simple environments like Gopher and Asterix, RD(2) and\nRD(3) obtain similar performance, and sometimes RD(2) even outperforms RD(3).\n\n4.2 Reward Decomposition Analysis\n\nHere we use Seaquest to illustrate our reward decomposition. Figure 3 shows the sub-rewards\nobtained by taking the expectation of the LHS of Eq.5 and the original reward along an actual\ntrajectory. We observe that while r1 = E(R1) and r2 = E(R2) basically add up to the original\nreward r, r1 dominates as the submarine is close to the surface, i.e. when it rescues the divers and\n\n6\n\n020406080100Epoch050001000015000200002500030000ReturnSeaquestRAINBOW + RD(3)RAINBOW + RD(2)RAINBOW020406080100Epoch02500500075001000012500150001750020000ReturnAsterixRAINBOW + RD(3)RAINBOW + RD(2)RAINBOW020406080100Epoch0200040006000800010000120001400016000ReturnGopherRAINBOW + RD(3)RAINBOW + RD(2)RAINBOW020406080100Epoch010000200003000040000500006000070000ReturnStarGunnerRAINBOW + RD(3)RAINBOW + RD(2)RAINBOW020406080100Epoch010000200003000040000ReturnHeroRAINBOW + RD(3)RAINBOW + RD(2)RAINBOW020406080100Epoch0100002000030000400005000060000ReturnUpNDownRAINBOW + RD(3)RAINBOW + RD(2)RAINBOW\fFigure 3: Reward decomposition along the trajectory. While sub-rewards r1 and r2 usually adds\nup to the original reward r, we see that the proportion of sub-rewards greatly depends on how the\noriginal reward is obtained.\n\nre\ufb01lls oxygen. When the submarine scores by shooting sharks, r2 becomes the main source of reward.\nWe also monitor the distribution of different sub-returns when the agent is playing game. In Figure 4\n(a), the submarine \ufb02oats to the surface to rescue the divers and re\ufb01ll oxygen and Z1 has higher values.\nWhile in Figure 4 (b), as the submarine dives into the sea and shoots sharks, expected values of Z2\n(orange) are higher than Z1 (blue). This result implies that the reward decomposition indeed captures\ndifferent sources of returns, in this case shooting sharks and rescuing divers/re\ufb01lling oxygen. We\nalso provide statistics on actions for quantitative analysis to support the argument. In Figure 6(a), we\nE(Z2) in a single\ncount the occurrence of each action obtained with arg maxa\ntrajectory, using the same policy as in Figure 4. We see that while Z1 prefers going up, Z2 prefers\ngoing down with \ufb01re.\n\nE(Z1) and arg maxa\n\n4.3 Visualization by saliency maps\n\nTo better understand the roles of different sub-rewards, we train a DRDRL agent with two channels\n(N=2) and compute saliency maps (Simonyan et al. [2013]). Speci\ufb01cally, to visualize the salient\npart of the images as seen by different sub-policies, we compute the absolute value of the Jacobian\n|\u2207xQi(x, arg maxa(cid:48) Q(x, a(cid:48)))| for each channel. Figure 5 shows that visualization results. We \ufb01nd\nthat channel 1 (red region) focuses on re\ufb01lling oxygen while channel 2 (green region) pays more\nattention to shooting sharks as well as the positions where sharks are more likely to appear.\n\n4.4 Direct Control using Induced Sub-policies\n\nE((cid:80)M\n\nE(Zi). To clarify,\nWe also provide videos2 of running sub-policies de\ufb01ned by \u03c0i = arg maxa\nthe sub-policies are never rolled out during training or evaluation and are only used to compute\nJ ij\ndisentang in Eq. 6. We execute these sub-policies and observe its difference with the main policy\ni=1 Zi) to get a better visual effect of the reward decomposition. Take Seaquest\n\u03c0 = arg maxa\nin Figure 6(b) as an example, two sub-policies show distinctive preference. As Z1 mainly captures the\nreward for surviving and rescuing divers, \u03c01 tends to stay close to the surface. However Z2 represents\nthe return gained from shooting sharks, so \u03c02 appears much more aggressive than \u03c01. Also, without\n\u03c01 we see that \u03c02 dies quickly due to out of oxygen.\n\n2https://sites.google.com/view/drdpaper\n\n7\n\n123465134625r1:0.71 r2:0.10r1:0.58 r2:0.40r1:0.60 r2:0.39r1:0.07 r2:0.97r1:0.06 r2:0.98r1:0.26 r2:0.72\fFigure 4: An illustration of how the sub-returns discriminates at different stage of the game. In \ufb01gure\n(a), the submarine is re\ufb01lling oxygen while in \ufb01gure (b) the submarine is shooting sharks.\n\nFigure 5: Sub-distribution saliency maps on the Atari game Seaquest, for a trained DRDRL of two\nchannels (N=2). One channel learns to pay attention to the oxygen, while another channel learns to\npay attention to the sharks.\n\n5 Related Work\n\nOur method is closely related to previous work of reward decomposition. Reward function decompo-\nsition has been studied among others by Russell and Zimdars [2003] and Sprague and Ballard [2003].\nWhile these earlier works mainly focus on how to achieve optimal policy given the decomposed\nreward functions, there have been several recent works attempting to learn latent decomposed re-\nwards. Van Seijen et al. [2017] construct an easy-to-learn value function by decomposing the reward\nfunction of the environment into n different reward functions. To ensure the learned decomposition is\nnon-trivial, Van Seijen et al. [2017] proposed to split a state into different pieces following domain\nknowledge and then feed different state pieces into each reward function branch. While such method\ncan accelerate learning process, it always requires many pre-de\ufb01ned preprocessing techniques. There\n\n8\n\n\uff08a\uff09\uff08b\uff09\ud835\udc4d2\ud835\udc4d1\ud835\udc4d1\ud835\udc4d2\fFigure 6: (a) Action statistics in an example trajectory of Seaquest. (b) Direct controlling using two\ninduced sub-policies \u03c01 = arg maxa E(Z1), \u03c02 = arg maxa E(Z2): the top picture shows that \u03c01\nprefers to stay at the top to keep agent alive; the bottom picture shows that \u03c02 prefers aggressive\naction of shooting sharks.\n\nhas been other work that explores learn reward decomposition network end-to-end. Grimm and\nSingh [2019] investigates how to learn independently-obtainable reward functions. While it learns\ninteresting reward decomposition, their method requires that the environments be resettable to speci\ufb01c\nstates since it needs multiple trajectories from the same starting state to compute their objective\nfunction. Besides, their method aims at learning different optimal policies for each decomposed\nreward function. Different from the works above, our method can learn meaningful implicit reward\ndecomposition without any requirements on prior knowledge. Also, our method can leverage the\ndecomposed sub-rewards to \ufb01nd better behaviour for a single agent.\nOur work also relates to Horde (Sutton et al. [2011]). The Horde architecture consists of a large\nnumber of \u2018sub-agents\u2019 that learn in parallel via off-policy learning. Each demon trains a separate\ngeneral value function (GVF) based on its own policy and pseudo-reward function. A pseudo-reward\ncan be any feature-based signal that encodes useful information. The Horde architecture is focused on\nbuilding up general knowledge about the world, encoded via a large number of GVFs. UVFA (Schaul\net al. [2015]) extends Horde along a different direction that enables value function generalizing across\ndifferent goals. Our method focuses on learning implicit reward decomposition in order to more\nef\ufb01ciently learn a control policy.\n\n6 Conclusion\n\nIn this paper, we propose Distributional Reward Decomposition for Reinforcement Learning (DR-\nDRL), a novel reward decomposition algorithm which captures the multiple reward channel structure\nunder distributional setting. Our algorithm signi\ufb01cantly outperforms state-of-the-art RL methods\nRAINBOW on Atari games with multiple reward channels. We also provide interesting experimental\nanalysis to get insight for our algorithm. In the future, we might try to develop reward decomposition\nmethod based on quantile networks (Dabney et al. [2018a,b]).\n\nAcknowledgments\n\nThis work was supported in part by the National Key Research & Development Plan of China (grant\nNo. 2016YFA0602200 and 2017YFA0604500), and by Center for High Performance Computing and\nSystem Simulation, Pilot National Laboratory for Marine Science and Technology (Qingdao).\n\n9\n\n(a)(b)\fReferences\nMarc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environ-\nment: An evaluation platform for general agents. Journal of Arti\ufb01cial Intelligence Research, 47:\n253\u2013279, 2013.\n\nMarc G Bellemare, Will Dabney, and R\u00e9mi Munos. A distributional perspective on reinforcement\nlearning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70,\npages 449\u2013458. JMLR. org, 2017.\n\nPablo Samuel Castro, Subhodeep Moitra, Carles Gelada, Saurabh Kumar, and Marc G. Bellemare.\nDopamine: A Research Framework for Deep Reinforcement Learning. 2018. URL http:\n//arxiv.org/abs/1812.06110.\n\nWill Dabney, Georg Ostrovski, David Silver, and Remi Munos. Implicit quantile networks for\ndistributional reinforcement learning. In International Conference on Machine Learning, pages\n1104\u20131113, 2018a.\n\nWill Dabney, Mark Rowland, Marc G Bellemare, and R\u00e9mi Munos. Distributional reinforcement\nlearning with quantile regression. In Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence,\n2018b.\n\nChristopher Grimm and Satinder Singh. Learning independently-obtainable reward functions. arXiv\n\npreprint arXiv:1901.08649, 2019.\n\nMatteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan\nHorgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in\ndeep reinforcement learning. In Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\nAdrien Laversanne-Finot, Alexandre P\u00e9r\u00e9, and Pierre-Yves Oudeyer. Curiosity driven exploration of\n\nlearned disentangled goal spaces. arXiv preprint arXiv:1807.01521, 2018.\n\nVolodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare,\nAlex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control\nthrough deep reinforcement learning. Nature, 518(7540):529, 2015.\n\nStuart J Russell and Andrew Zimdars. Q-decomposition for reinforcement learning agents. In\nProceedings of the 20th International Conference on Machine Learning (ICML-03), pages 656\u2013\n663, 2003.\n\nTom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approximators.\n\nIn International Conference on Machine Learning, pages 1312\u20131320, 2015.\n\nKaren Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks:\nVisualising image classi\ufb01cation models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.\n\nNathan Sprague and Dana Ballard. Multiple-goal reinforcement learning with modular sarsa (0).\n\n2003.\n\nRichard S Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M Pilarski, Adam White,\nand Doina Precup. Horde: A scalable real-time architecture for learning knowledge from unsuper-\nvised sensorimotor interaction. In The 10th International Conference on Autonomous Agents and\nMultiagent Systems-Volume 2, pages 761\u2013768. International Foundation for Autonomous Agents\nand Multiagent Systems, 2011.\n\nValentin Thomas, Jules Pondard, Emmanuel Bengio, Marc Sarfati, Philippe Beaudoin, Marie-Jean\nMeurs, Joelle Pineau, Doina Precup, and Yoshua Bengio. Independently controllable features.\narXiv preprint arXiv:1708.01289, 2017.\n\nHado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-\n\nlearning. In Thirtieth AAAI Conference on Arti\ufb01cial Intelligence, 2016.\n\nHarm Van Seijen, Mehdi Fatemi, Joshua Romoff, Romain Laroche, Tavian Barnes, and Jeffrey\nTsang. Hybrid reward architecture for reinforcement learning. In Advances in Neural Information\nProcessing Systems, pages 5392\u20135402, 2017.\n\n10\n\n\f", "award": [], "sourceid": 3358, "authors": [{"given_name": "Zichuan", "family_name": "Lin", "institution": "Tsinghua University"}, {"given_name": "Li", "family_name": "Zhao", "institution": "Microsoft Research"}, {"given_name": "Derek", "family_name": "Yang", "institution": "UC San Diego"}, {"given_name": "Tao", "family_name": "Qin", "institution": "Microsoft Research"}, {"given_name": "Tie-Yan", "family_name": "Liu", "institution": "Microsoft Research Asia"}, {"given_name": "Guangwen", "family_name": "Yang", "institution": "Tsinghua University"}]}