{"title": "Fully Parameterized Quantile Function for Distributional Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 6193, "page_last": 6202, "abstract": "Distributional Reinforcement Learning (RL) differs from traditional RL in that, rather than the expectation of total returns, it estimates distributions and has achieved state-of-the-art performance on Atari Games. The key challenge in practical distributional RL algorithms lies in how to parameterize estimated distributions so as to better approximate the true continuous distribution. Existing distributional RL algorithms parameterize either the probability side or the return value side of the distribution function, leaving the other side uniformly fixed as in C51, QR-DQN or randomly sampled as in IQN. In this paper, we propose fully parameterized quantile function that parameterizes both the quantile fraction axis (i.e., the x-axis) and the value axis (i.e., y-axis) for distributional RL. Our algorithm contains a fraction proposal network that generates a discrete set of quantile fractions and a quantile value network that gives corresponding quantile values. The two networks are jointly trained to find the best approximation of the true distribution. Experiments on 55 Atari Games show that our algorithm significantly outperforms existing distributional RL algorithms and creates a new record for the Atari Learning Environment for non-distributed agents.", "full_text": "Fully Parameterized Quantile Function for\n\nDistributional Reinforcement Learning\n\nDerek Yang\u2217\nUC San Diego\n\ndyang1206@gmail.com\n\nLi Zhao\n\nMicrosoft Research\n\nlizo@microsoft.com\n\nZichuan Lin\n\nTsinghua University\n\nlinzc16@mails.tsinghua.edu.cn\n\nTao Qin\n\nMicrosoft Research\n\ntaoqin@microsoft.com\n\nJiang Bian\n\nMicrosoft Research\n\njiang.bian@microsoft.com\n\nTieyan Liu\n\nMicrosoft Research\n\ntyliu@microsoft.com\n\nAbstract\n\nDistributional Reinforcement Learning (RL) differs from traditional RL in that,\nrather than the expectation of total returns, it estimates distributions and has\nachieved state-of-the-art performance on Atari Games. The key challenge in\npractical distributional RL algorithms lies in how to parameterize estimated dis-\ntributions so as to better approximate the true continuous distribution. Existing\ndistributional RL algorithms parameterize either the probability side or the return\nvalue side of the distribution function, leaving the other side uniformly \ufb01xed as in\nC51, QR-DQN or randomly sampled as in IQN. In this paper, we propose fully\nparameterized quantile function that parameterizes both the quantile fraction axis\n(i.e., the x-axis) and the value axis (i.e., y-axis) for distributional RL. Our algo-\nrithm contains a fraction proposal network that generates a discrete set of quantile\nfractions and a quantile value network that gives corresponding quantile values.\nThe two networks are jointly trained to \ufb01nd the best approximation of the true\ndistribution. Experiments on 55 Atari Games show that our algorithm signi\ufb01cantly\noutperforms existing distributional RL algorithms and creates a new record for the\nAtari Learning Environment for non-distributed agents.\n\n1\n\nIntroduction\n\nDistributional reinforcement learning [Jaquette et al., 1973, Sobel, 1982, White, 1988, Morimura\net al., 2010, Bellemare et al., 2017] differs from value-based reinforcement learning in that, instead\nof focusing only on the expectation of the return, distributional reinforcement learning also takes\nthe intrinsic randomness of returns within the framework into consideration [Bellemare et al., 2017,\nDabney et al., 2018b,a, Rowland et al., 2018]. The randomness comes from both the environment\nitself and agent\u2019s policy. Distributional RL algorithms characterize the total return as random variable\nand estimate the distribution of such random variable, while traditional Q-learning algorithms estimate\nonly the mean (i.e., traditional value function) of such random variable.\nThe main challenge of distributional RL algorithm is how to parameterize and approximate the\ndistribution. In Categorical DQN [Bellemare et al., 2017](C51), the possible returns are limited to\na discrete set of \ufb01xed values, and the probability of each value is learned through interacting with\nenvironments. C51 out-performs all previous variants of DQN on a set of 57 Atari 2600 games in the\nArcade Learning Environment (ALE) [Bellemare et al., 2013]. Another approach for distributional\nreinforcement learning is to estimate the quantile values instead. Dabney et al. [2018b] proposed QR-\n\n\u2217Contributed during internship at Microsoft Research.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fDQN to compute the return quantiles on \ufb01xed, uniform quantile fractions using quantile regression\nand minimize the quantile Huber loss [Huber, 1964] between the Bellman updated distribution and\ncurrent return distribution. Unlike C51, QR-DQN has no restrictions or bound for value and achieves\nsigni\ufb01cant improvements over C51. However, both C51 and QR-DQN approximate the distribution\nfunction or quantile function on \ufb01xed locations, either value or probability. Dabney et al. [2018a]\npropose learning the quantile values for sampled quantile fractions rather than \ufb01xed ones with an\nimplicit quantile value network (IQN) that maps from quantile fractions to quantile values. With\nsuf\ufb01cient network capacity and in\ufb01nite number of quantiles, IQN is able to approximate the full\nquantile function.\nHowever, it is impossible to have in\ufb01nite quantiles in practice. With limited number of quantile\nfractions, ef\ufb01ciency and effectiveness of the samples must be reconsidered. The sampling method\nin IQN mainly helps training the implicit quantile value network rather than approximating the full\nquantile function, and thus there is no guarantee in that sampled probabilities would provide better\nquantile function approximation than \ufb01xed probabilities.\nIn this work, we extend the method in Dabney et al. [2018b] and Dabney et al. [2018a] and propose\nto fully parameterize the quantile function. By fully parameterization, we mean that unlike QR-DQN\nand IQN where quantile fractions are \ufb01xed or sampled and only the corresponding quantile values\nare parameterized, both quantile fractions and corresponding quantile values in our algorithm are\nparameterized. In addition to a quantile value network similar to IQN that maps quantile fractions\nto corresponding quantile values, we propose a fraction proposal network that generates quantile\nfractions for each state-action pair. The fraction proposal network is trained so that as the true\ndistribution is approximated, the 1-Wasserstein distance between the approximated distribution and\nthe true distribution is minimized. Given the proposed fractions generated by the fraction proposal\nnetwork, we can learn the quantile value network by quantile regression. With self-adjusting fractions,\nwe can approximate the true distribution better than with \ufb01xed or sampled fractions.\nWe begin with related works and backgrounds of distributional RL in Section 2. We describe\nour algorithm in Section 3 and provide experiment results of our algorithm on the ALE environ-\nment [Bellemare et al., 2013] in Section 4. At last, we discuss the future extension of our work, and\nconclude our work in Section 5.\n\n2 Background and Related Work\n\nWe consider the standard reinforcement learning setting where agent-environment interactions are\nmodeled as a Markov Decision Process (X ,A, R, P, \u03b3) [Puterman, 1994], where X and A denote\nstate space and action space, P denotes the transition probability given state and action, R denotes\nstate and action dependent reward function and \u03b3 \u2208 (0, 1) denotes the reward discount factor.\n\nFor a policy \u03c0, de\ufb01ne the discounted return sum a random variable by Z \u03c0(x, a) =(cid:80)\u221e\n\nt=0 \u03b3tR(xt, at),\nwhere x0 = x, a0 = a, xt \u223c P (\u00b7|xt\u22121, at\u22121) and at \u223c \u03c0(\u00b7|xt). The objective in reinforcement\nlearning can be summarized as \ufb01nding the optimal \u03c0\u2217 that maximizes the expectation of Z \u03c0, the\naction-value function Q\u03c0(x, a) = E[Z \u03c0(x, a)]. The most common approach is to \ufb01nd the unique\n\ufb01xed point of the Bellman optimality operator T [Bellman, 1957]:\n\nQ\u2217(x, a) = T Q\u2217(x, a) := E[R(x, a)] + \u03b3EP max\n\na(cid:48) Q\u2217 (x(cid:48), a(cid:48)) .\n\nTo update Q, which is approximated by a neural network in most deep reinforcement learning\nstudies, Q-learning [Watkins, 1989] iteratively trains the network by minimizing the squared temporal\ndifference (TD) error de\ufb01ned by\n\n(cid:20)\n\n(cid:21)2\n\n\u03b42\nt =\n\nrt + \u03b3 max\n\na(cid:48)\u2208A Q (xt+1, a(cid:48)) \u2212 Q (xt, at)\n\nalong the trajectory observed while the agent interacts with the environment following \u0001-greedy\npolicy. DQN [Mnih et al., 2015] uses a convolutional neural network to represent Q and achieves\nhuman-level play on the Atari-57 benchmark.\n\n2\n\n\f2.1 Distributional RL\n\nInstead of a scalar Q\u03c0(x, a), distributional RL looks into the intrinsic randomness of Z \u03c0 by studying\nits distribution. The distributional Bellman operator for policy evaluation is\n\nZ \u03c0(x, a) D= R(x, a) + \u03b3Z \u03c0 (X(cid:48), A(cid:48)) ,\n\nwhere X(cid:48) \u223c P (\u00b7|x, a) and A(cid:48) \u223c \u03c0(\u00b7|X(cid:48)), A D= B denotes that random variable A and B follow the\nsame distribution.\nBoth theory and algorithms have been established for distributional RL. In theory, the distribu-\ntional Bellman operator for policy evaluation is proved to be a contraction in the p-Wasserstein\ndistance [Bellemare et al., 2017]. Bellemare et al. [2017] shows that C51 outperforms value-based\nRL, in addition Hessel et al. [2018] combined C51 with enhancements such as prioritized experience\nreplay [Schaul et al., 2016], n-step updates [Sutton, 1988], and the dueling architecture [Wang et al.,\n2016], leading to the Rainbow agent, current state-of-the-art in Atari-57 for non-distributed agents,\nwhile the distributed algorithm proposed by Kapturowski et al. [2018] achieves state-of-the-art per-\nformance for all agents. From an algorithmic perspective, it is impossible to represent the full space\nof probability distributions with a \ufb01nite collection of parameters. Therefore the parameterization of\nquantile functions is usually the most crucial part in a general distributional RL algorithm. In C51,\nthe true distribution is projected to a categorical distribution [Bellemare et al., 2017] with \ufb01xed values\nfor parameterization. QR-DQN \ufb01xes probabilities instead of values, and parameterizes the quantile\nvalues [Dabney et al., 2018a] while IQN randomly samples the probabilities [Dabney et al., 2018a].\nWe will introduce QR-DQN and IQN in Section 2.2, and extend from their work to ours.\n\n2.2 Quantile Regression for Distributional RL\n\nIn contrast to C51 which estimates probabilities for N \ufb01xed locations in return, QR-DQN [Dabney\net al., 2018b] estimates the respected quantile values for N \ufb01xed, uniform probabilities. In QR-DQN,\nthe distribution of the random return is approximated by a uniform mixture of N Diracs,\n\nN(cid:88)\n\ni=1\n\nZ\u03b8(x, a) :=\n\n1\nN\n\n\u03b4\u03b8i(x,a),\n\nwith each \u03b8i assigned a quantile value trained with quantile regression.\nBased on QR-DQN, Dabney et al. [2018a] propose using probabilities sampled from a base distribu-\ntion, e.g. \u03c4 \u2208 U ([0, 1]), rather than \ufb01xed probabilities. They further learn the quantile function that\nmaps from embeddings of sampled probabilities to the corresponding quantiles, called implicit quan-\ntile value network (IQN). At the time of this writing, IQN achieves the state-or-the-art performance\non Atari-57 benchmark, human-normalized mean and median of all agents that does not combine\ndistributed RL, prioritized replay [Schaul et al., 2016] and n-step update.\nDabney et al. [2018a] claimed that with enough network capacity, IQN is able to approximate to\nthe full quantile function with in\ufb01nite number of quantile fractions. However, in practice one needs\nto use a \ufb01nite number of quantile fractions to estimate action values for decision making, e.g. 32\nrandomly sampled quantile fractions as in Dabney et al. [2018a]. With limited fractions, a natural\nquestion arises that, how to best utilize those fractions to \ufb01nd the closest approximation of the true\ndistribution?\n\n3 Our Algorithm\n\nWe propose Fully parameterized Quantile Function (FQF) for Distributional RL. Our algorithm\nconsists of two networks, the fraction proposal network that generates a set of quantile fractions\nfor each state-action pair, and the quantile value network that maps probabilities to quantile values.\nWe \ufb01rst describe the fully parameterized quantile function in Section 3.1, with variables on both\nprobability axis and value axis. Then, we show how to train the fraction proposal network in Section\n3.2, and how to train the quantile value network with quantile regression in Section 3.3. Finally, we\npresent our algorithm and describe the implementation details in Section 3.4.\n\n3\n\n\f3.1 Fully Parameterized Quantile Function\n\nIn FQF, we estimate N adjustable quantile values for N adjustable quantile fractions to approximate\nthe quantile function. The distribution of the return is approximated by a weighted mixture of N\nDiracs given by\n\nZ\u03b8,\u03c4 (x, a) :=\n\n(1)\nwhere \u03b4z denotes a Dirac at z \u2208 R, \u03c41, ...\u03c4N\u22121 represent the N-1 adjustable fractions satisfying\n\u03c4i\u22121 < \u03c4i, with \u03c40 = 0 and \u03c4N = 1 to simplify notation. Denote quantile function [M\u00fcller, 1997]\nF \u22121\nZ the inverse function of cumulative distribution function FZ(z) = P r(Z < z). By de\ufb01nition we\nhave\n\n(\u03c4i+1 \u2212 \u03c4i)\u03b4\u03b8i(x,a),\n\nN\u22121(cid:88)\n\ni=0\n\nZ (p) := inf {z \u2208 R : p \u2264 FZ(z)}\nF \u22121\n\nwhere p is what we refer to as quantile fraction.\nBased on the distribution in Eq.(1), denote \u03a0\u03b8,\u03c4 the projection operator that projects quantile function\nonto a staircase function supported by \u03b8 and \u03c4, the projected quantile function is given by\n\n\u22121,\u03b8,\u03c4\nZ\n\nF\n\n(\u03c9) = \u03a0\u03b8,\u03c4 F \u22121\n\nZ (\u03c9) = \u03b80 +\n\nN\u22121(cid:88)\n(\u03b8i+1 \u2212 \u03b8i)H\u03c4i+1 (\u03c9),\n\ni=0\n\nwhere H is the Heaviside step function and H\u03c4 (\u03c9) is the short for H(\u03c9 \u2212 \u03c4 ). Figure 1 gives an\nexample of such projection. For each state-action pair (x, a), we \ufb01rst generate the set of fractions \u03c4\nusing the fraction proposal network, and then obtain the quantiles values \u03b8 corresponding to \u03c4 using\nthe quantile value network.\nTo measure the distortion between approximated quantile function and the true quantile function, we\nuse the 1-Wasserstein metric given by\n\n(cid:90) \u03c4i+1\n\nN\u22121(cid:88)\n\ni=0\n\n\u03c4i\n\n(cid:12)(cid:12)F \u22121\n\nZ (\u03c9) \u2212 \u03b8i\n\n(cid:12)(cid:12) d\u03c9.\n\nW1(Z, \u03b8, \u03c4 ) =\n\n(2)\n\nUnlike KL divergence used in C51 which considers only the probabilities of the outcomes, the\np-Wasseretein metric takes both the probability and the distance between outcomes into consideration.\nFigure 1 illustrates the concept of how different approximations could affect W1 error, and shows an\nexample of \u03a0W1. However, note that in practice Eq.(2) can not be obtained without bias.\n\n(a)\n\n(b)\n\nFigure 1: Two approximations of the same quantile function using different set of \u03c4 with N = 6, the\narea of the shaded region is equal to the 1-Wasserstein error. (a) Finely-adjusted \u03c4 with minimized\nW1 error. (b) Randomly chosen \u03c4 with larger W1 error.\n\n4\n\n\f3.2 Training fraction proposal Network\n\nTo achieve minimal 1-Wasserstein error, we start from \ufb01xing \u03c4 and \ufb01nding the optimal corresponding\nquantile values \u03b8. In QR-DQN, Dabney et al. [2018a] gives an explicit form of \u03b8 to achieve the goal.\nWe extend it to our setting:\nLemma 1. [Dabney et al., 2018a] For any \u03c41, ...\u03c4N\u22121 \u2208 [0, 1] satisfying \u03c4i\u22121 < \u03c4i for i, with \u03c41 = 0\nand \u03c4N = 1, and cumulative distribution function F with inverse F \u22121, the set of \u03b8 minimizing Eq.(2)\nis given by\n\n\u03b8i = F \u22121\nZ (\n\n\u03c4i + \u03c4i+1\n\n2\n\n)\n\n(3)\n\nWe can now substitute \u03b8i in Eq.(2) with equation Eq.(3) and \ufb01nd the optimal condition for \u03c4 to\nminimize W1(Z, \u03c4 ). For simplicity, we denote \u02c6\u03c4i = \u03c4i+\u03c4i+1\n.\nProposition 1. For any continuous quantile function F \u22121\nWasserstein loss of F \u22121\n\nZ that is non-decreasing, de\ufb01ne the 1-\n\nby\n\n2\n\nZ and F\n\n\u22121,\u03c4\nZ\n\n(cid:90) \u03c4i+1\n\nN\u22121(cid:88)\n\ni=0\n\n\u03c4i\n\n(cid:12)(cid:12)F \u22121\n\nZ (\u02c6\u03c4i)(cid:12)(cid:12) d\u03c9.\n\nZ (\u03c9) \u2212 F \u22121\n\nW1(Z, \u03c4 ) =\n\n(4)\n\n(5)\n\n= 0.\n\n\u2202W1\n\u2202\u03c4i\n\nis given by\n\n\u2202W1\n\u2202\u03c4i\n\n= 2F \u22121\n\nZ (\u03c4i) \u2212 F \u22121\n\nZ (\u02c6\u03c4i) \u2212 F \u22121\n\nZ (\u02c6\u03c4i\u22121),\n\n\u2200i \u2208 (0, N ).\nFurther more, \u2200\u03c4i\u22121, \u03c4i+1 \u2208 [0, 1], \u03c4i\u22121 < \u03c4i+1, \u2203\u03c4i \u2208 (\u03c4i\u22121, \u03c4i+1) s.t. \u2202W1\n\n\u2202\u03c4i\n\nProof of proposition 1 is given in the appendix. While computing W1 without bias is usually\nimpractical, equation 5 provides us with a way to minimize W1 without computing it. Let w1 be\nthe parameters of the fraction proposal network P , for an arbitrary quantile function F \u22121\nZ , we can\nminimize W1 by iteratively applying gradients descent to w1 according to Eq.(5) and convergence is\nguaranteed. As the true quantile function F \u22121\nZ is unknown to us in practice, we use the quantile value\nnetwork F \u22121\nThe expected return, also known as action-value based on FQF is then given by\n\nwith parameters w2 for current state and action as true quantile function.\n\nZ,w2\n\nN\u22121(cid:88)\n\nQ(x, a) =\n\n(\u03c4i+1 \u2212 \u03c4i)F \u22121\n\nZ,w2\n\n(\u02c6\u03c4i),\n\nwhere \u03c40 = 0 and \u03c4N = 1.\n\n3.3 Training quantile value network\n\ni=0\n\nWith the properly chosen probabilities, we combine quantile regression and distributional Bellman\nupdate on the optimized probabilities to train the quantile function. Consider Z a random variable\ndenoting the action-value at (xt, at) and Z(cid:48) the action-value random variable at (xt+1, at+1), the\nweighted temporal difference (TD) error for two probabilities \u02c6\u03c4i and \u02c6\u03c4j is de\ufb01ned by\n\nij = rt + \u03b3F \u22121\n\u03b4t\nZ(cid:48),w1\n\n(\u02c6\u03c4i) \u2212 F \u22121\n\nZ,w1\n\n(\u02c6\u03c4j)\n\n(6)\n\nQuantile regression is used in QR-DQN and IQN to stochastically adjust the quantile estimates so as\nto minimize the Wasserstein distance to a target distribution. We follow QR-DQN and IQN where\nquantile value networks are trained by minimizing the Huber quantile regression loss [Huber, 1964],\nwith threshold \u03ba,\n\n\u03c1\u03ba\n\n\u03c4 (\u03b4ij) = |\u03c4 \u2212 I{\u03b4ij < 0}| L\u03ba (\u03b4ij)\nL\u03ba (\u03b4ij) =\n\n(cid:26) 1\n\u03ba(cid:0)|\u03b4ij| \u2212 1\n2 \u03ba(cid:1) ,\n\n, with\n\u03ba\nif |\u03b4ij| \u2264 \u03ba\notherwise\n\n2 \u03b42\nij,\n\n5\n\n\fThe loss of the quantile value network is then given by\n\nN\u22121(cid:88)\n\nN\u22121(cid:88)\n\n\u03c1\u03ba\n\u02c6\u03c4j\n\n(\u03b4t\n\nij)\n\n(7)\n\nL(xt, at, rt, xt+1) =\n\n1\nN\n\ni=0\n\nj=0\n\nZ and its Bellman target share the same proposed quantile fractions \u02c6\u03c4 to reduce compu-\n\nNote that F \u22121\ntation.\nWe perform joint gradient update for w1 and w2, as illustrated in Algorithm 1.\n\nAlgorithm 1: FQF update\nParameter :N, \u03ba\nInput: x, a, r, x(cid:48), \u03b3 \u2208 [0, 1)\n// Compute proposed fractions for x, a\n\u03c4 \u2190 Pw1(x);\n// Compute proposed fractions for x(cid:48), a(cid:48)\nfor a(cid:48) \u2208 A do\n\ni )F \u22121\nZ(cid:48),w2\n\n(\u02c6\u03c4i)a;\n\n(\u02c6\u03c4i) \u2212 F \u22121\n\nZ,w2\n\n(\u02c6\u03c4j)\n\n\u03c4(cid:48) \u2190 Pw1(x(cid:48));\n\nQ(s(cid:48), a(cid:48)) \u2190(cid:80)N\u22121\n\nend\n// Compute greedy action\ni+1 \u2212 \u03c4(cid:48)\n\ni=0 (\u03c4(cid:48)\nQ(s(cid:48), a(cid:48));\n\na(cid:48)\n\na\u2217 \u2190 argmax\n// Compute L\nfor 0 \u2264 i \u2264 N \u2212 1 do\nfor 0 \u2264 j \u2264 N \u2212 1 do\n\u03b4ij \u2190 r + \u03b3F \u22121\nZ(cid:48),w2\n\nN\n\ni=0\n\n(cid:80)N\u22121\n\n(cid:80)N\u22121\n\nend\nend\nL = 1\n// Compute \u2202W1\n\u2202\u03c4i\n\u2202W1\n\u2202\u03c4i\nUpdate w1 with \u2202W1\n\u2202\u03c4i\nOutput: Q\n\n= 2F \u22121\n\nZ,w2\n\n3.4\n\nImplementation Details\n\n(\u03b4ij);\n\nj=0 \u03c1\u03ba\n\u02c6\u03c4j\nfor i \u2208 [1, N \u2212 1]\n(\u03c4i) \u2212 F \u22121\n\n(\u02c6\u03c4i) \u2212 F \u22121\n\nZ,w2\n\n; Update w2 with \u2207L;\n\nZ,w2\n\n(\u02c6\u03c4i\u22121);\n\nOur fraction proposal network is represented by one fully-connected MLP layer. It takes the state\nembedding of original IQN as input and generates fraction proposal. Recall that in Proposition 1,\nwe require \u03c4i\u22121 < \u03c4i and \u03c40 = 0, \u03c4N = 1. While it is feasible to have \u03c40 = 0, \u03c4N = 1 \ufb01xed and sort\nthe output of \u03c4w1, the sort operation would make the network hard to train. A more reasonable and\npractical way would be to let the neural network automatically have the output sorted using cumulated\nsoftmax. Let q \u2208 RN denote the output of a softmax layer, we have qi \u2208 (0, 1), i \u2208 [0, N \u2212 1] and\nj=0 qj, i \u2208 [0, N ], then straightforwardly we have \u03c4i < \u03c4j for \u2200i < j\nand \u03c40 = 0, \u03c4N = 1 in our fraction proposal network. Note that as W1 is not computed, we can\u2019t\ndirectly perform gradient descent for the fraction proposal network. Instead, we use the grad_ys\nargument in the tensor\ufb02ow operator tf.gradients to assign \u2202W1\nto the optimizer. In addition, one\n\u2202\u03c4i\ni=0 qi log qi to prevent the distribution\n\ni=0 qi = 1. Let \u03c4i =(cid:80)i\u22121\n(cid:80)N\u22121\ncan use entropy of q as a regularization term H(q) = \u2212(cid:80)N\u22121\n\nfrom degenerating into a deterministic one.\nWe borrow the idea of implicit representations from IQN to our quantile value network. To be speci\ufb01c,\nwe compute the embedding of \u03c4, denoted by \u03c6(\u03c4 ), with\n\n(cid:33)\n\n\u03c6j(\u03c4 ) := ReLU\n\ncos(i\u03c0\u03c4 )wij + bj\n\n,\n\n(cid:32)n\u22121(cid:88)\n\ni=0\n\n6\n\n\fFigure 2: Performance comparison with IQN. Each training curve is averaged by 3 seeds. The\ntraining curves are smoothed with a moving average of 10 to improve readability.\n\n(\u03c8(x) (cid:12) \u03c6(\u03c4 )).\n\nwhere wij and bj are network parameters. We then compute the element-wise (Hadamard) product of\nstate feature \u03c8(x) and embedding \u03c6(\u03c4 ). Let (cid:12) denote element-wise product, the quantile values are\ngiven by F \u22121\nIn IQN, after the set of \u03c4 is sampled from a uniform distribution, instead of using differences\nbetween \u03c4 as probabilities of the quantiles, the mean of the quantile values is used to compute\n) with \u03c40 = 0, \u03c4N = 1\nZ (\u03c4i) are equal, we use the former one to consist with our projection operation.\n\naction-value Q. While in expectation, Q =(cid:80)N\u22121\n\nZ (\u03c4 ) \u2248 F \u22121\n(cid:80)N\ni=1 F \u22121\n\ni=0 (\u03c4i+1 \u2212 \u03c4i)F \u22121\n\nZ ( \u03c4i+\u03c4i+1\n\n2\n\nZ,w2\n\nand Q = 1\nN\n\n4 Experiments\n\nWe test our algorithm on the Atari games from Arcade Learning Environment (ALE) Bellemare\net al. [2013]. We select the most relative algorithm to ours, IQN [Dabney et al., 2018a], as baseline,\nand compare FQF with QR-DQN [Dabney et al., 2018b], C51 [Bellemare et al., 2017], prioritized\nexperience replay [Schaul et al., 2016] and Rainbow [Hessel et al., 2018], the current state-of-art\nthat combines the advantages of several RL algorithms including distributional RL. The baseline\nalgorithm is implemented by Castro et al. [2018] in the Dopamine framework, with slightly lower\nperformance than reported in IQN. We implement FQF based on the Dopamine framework. Unfortu-\nnately, we fail to test our algorithm on Surround and Defender as Surround is not supported by the\nDopamine framework and scores of Defender is unreliable in Dopamine. Following the common\npractice [Van Hasselt et al., 2016], we use the 30-noop evaluation settings to align with previous\nworks. Results of FQF and IQN using sticky action for evaluation proposed by Machado et al. [2018]\nare also provided in the appendix. In all, the algorithms are tested on 55 Atari games.\nOur hyper-parameter setting is aligned with IQN for fair comparison. The number of \u03c4 for FQF is 32.\nThe weights of the fraction proposal network are initialized so that initial probabilities are uniform as\nin QR-DQN, also the learning rates are relatively small compared with the quantile value network to\nkeep the probabilities relatively stable while training. We run all agents with 200 million frames. At\nthe training stage, we use \u0001-greedy with \u0001 = 0.01. For each evaluation stage, we test the agent for\n\n7\n\n0255075100125150175200Epoch01000200030004000500060007000ReturnBerzerkIQNFQF0255075100125150175200Epoch010000200003000040000500006000070000ReturnGopherIQNFQF0255075100125150175200Epoch02000400060008000100001200014000ReturnKangarooIQNFQF0255075100125150175200Epoch050000100000150000200000250000ReturnChopperCommandIQNFQF0255075100125150175200Epoch20004000600080001000012000ReturnCentipedeIQNFQF0255075100125150175200Epoch0100200300400500600ReturnBreakoutIQNFQF0255075100125150175200Epoch0500100015002000ReturnAmidarIQNFQF0255075100125150175200Epoch010000200003000040000500006000070000ReturnKungFuMasterIQNFQF0255075100125150175200Epoch201001020ReturnDoubleDunkIQNFQF\f0.125 million frames with \u0001 = 0.001. For each algorithm we run 3 random seeds. All experiments\nare performed on NVIDIA Tesla V100 16GB graphics cards.\n\nMean Median\n221%\n580%\n701%\n\n79%\nDQN\n124%\nPRIOR.\nC51\n178%\nRAINBOW 1213% 227%\n902%\nQR-DQN\n193%\nIQN\n1112% 218%\n1426% 272%\nFQF\n\n>Human\n24\n39\n40\n42\n41\n39\n44\n\n>DQN\n0\n48\n50\n52\n54\n54\n54\n\nTable 1: Mean and median scores across 55 Atari 2600 games, measured as percentages of human\nbaseline. Scores are averages over 3 seeds.\n\nTable 1 compares the mean and median human normalized scores across 55 Atari games with up\nto 30 random no-op starts, and the full score table is provided in the Appendix. It shows that FQF\noutperforms all existing distributional RL algorithms, including Rainbow [Hessel et al., 2018] that\ncombines C51 with prioritized replay, and n-step updates. We also set a new record on the number of\ngames where non-distributed RL agent performs better than human.\nFigure 2 shows the training curves of several Atari games. Even on games where FQF and IQN\nhave similar performance such as Centipede , FQF is generally much faster thanks to self-adjusting\nfractions.\nHowever, one side effect of the full parameterization in FQF is that the training speed is decreased.\nWith same settings, FQF is roughly 20% slower than IQN due to the additional fraction proposal\nnetwork. As the number of \u03c4 increases, FQF slows down signi\ufb01cantly while IQN\u2019s training speed is\nnot sensitive to the number of \u03c4 samples.\n\n5 Discussion and Conclusions\n\nBased on previous works of distributional RL, we propose a more general complete approximation\nof the return distribution. Compared with previous distributional RL algorithms, FQF focuses not\nonly on learning the target, e.g. probabilities for C51, quantile values for QR-DQN and IQN, but\nalso which target to learn, i.e quantile fraction. This allows FQF to learn a better approximation of\nthe true distribution under restrictions of network capacity. Experiment result shows that FQF does\nachieve signi\ufb01cant improvement.\nThere are some open questions we are yet unable to address in this paper. We will have some\ndiscussions here. First, does the 1-Wasserstein error converge to its minimal value when the quantile\nfunction is not \ufb01xed? We cannot guarantee convergence of the fraction proposal network in deep\nneural networks where we involve quantile regression and Bellman update. Second, though we\nempirically believe so, does the contraction mapping result for \ufb01xed probabilities given by Dabney\net al. [2018b] also apply on self-adjusting probabilities? Third, while FQF does provide potentially\nbetter distribution approximation with same amount of fractions, how will a better approximated\ndistribution affect agent\u2019s policy and how will it affect the training process? More generally, how\nimportant is quantile fraction selection during training?\nAs for future work, we believe that studying the trained quantile fractions will provide intriguing\nresults. Such as how sensitive are the quantile fractions to state and action, and that how the\nquantile fractions will evolve in a single run. Also, the combination of distributional RL and\nDDPG in D4PG [Barth-Maron et al., 2018] showed that distributional RL can also be extended to\ncontinuous control settings. Extending our algorithm to continuous settings is another interesting\ntopic. Furthermore, in our algorithm we adopted the concept of selecting the best target to learn. Can\nthis intuition be applied to areas other than RL?\nFinally, we also noticed that most of the games we fail to reach human-level performance involves\ncomplex rules that requires exploration based policies, such as Montezuma Revenge and Venture.\nIntegrating distributional RL will be another potential direction as in [Tang and Agrawal, 2018]. In\n\n8\n\n\fgeneral, we believe that our algorithm can be viewed as a natural extension of existing distributional\nRL algorithms, and that distributional RL may integrate greatly with other algorithms to reach higher\nperformance.\n\nReferences\nGabriel Barth-Maron, Matthew W Hoffman, David Budden, Will Dabney, Dan Horgan, Alistair\nMuldal, Nicolas Heess, and Timothy Lillicrap. Distributed distributional deterministic policy\ngradients. International Conference on Learning Representations, 2018.\n\nMarc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environ-\nment: An evaluation platform for general agents. Journal of Arti\ufb01cial Intelligence Research, 47:\n253\u2013279, 2013.\n\nMarc G Bellemare, Will Dabney, and R\u00e9mi Munos. A distributional perspective on reinforcement\nlearning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70,\npages 449\u2013458. JMLR. org, 2017.\n\nRichard Bellman. Dynamic Programming. Princeton University Press, Princeton, NJ, USA, 1 edition,\n\n1957.\n\nPablo Samuel Castro, Subhodeep Moitra, Carles Gelada, Saurabh Kumar, and Marc G. Bellemare.\nDopamine: A Research Framework for Deep Reinforcement Learning. 2018. URL http:\n//arxiv.org/abs/1812.06110.\n\nWill Dabney, Georg Ostrovski, David Silver, and Remi Munos. Implicit quantile networks for\ndistributional reinforcement learning. In International Conference on Machine Learning, pages\n1104\u20131113, 2018a.\n\nWill Dabney, Mark Rowland, Marc G Bellemare, and R\u00e9mi Munos. Distributional reinforcement\nlearning with quantile regression. In Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence,\n2018b.\n\nMatteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan\nHorgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in\ndeep reinforcement learning. In Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\nPeter J. Huber. Robust estimation of a location parameter. Annals of Mathematical Statistics, 35(1):\n\n73\u2013101, March 1964. ISSN 0003-4851. doi: 10.1214/aoms/1177703732.\n\nStratton C Jaquette et al. Markov decision processes with a new optimality criterion: Discrete time.\n\nThe Annals of Statistics, 1(3):496\u2013505, 1973.\n\nSteven Kapturowski, Georg Ostrovski, John Quan, Remi Munos, and Will Dabney. Recurrent\n\nexperience replay in distributed reinforcement learning. 2018.\n\nMarlos C Machado, Marc G Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael\nBowling. Revisiting the arcade learning environment: Evaluation protocols and open problems for\ngeneral agents. Journal of Arti\ufb01cial Intelligence Research, 61:523\u2013562, 2018.\n\nVolodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare,\nAlex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control\nthrough deep reinforcement learning. Nature, 518(7540):529, 2015.\n\nTetsuro Morimura, Masashi Sugiyama, Hisashi Kashima, Hirotaka Hachiya, and Toshiyuki Tanaka.\nNonparametric return distribution approximation for reinforcement learning. In Proceedings of the\n27th International Conference on Machine Learning (ICML-10), pages 799\u2013806, 2010.\n\nAlfred M\u00fcller. Integral probability metrics and their generating classes of functions. Advances in\n\nApplied Probability, 29(2):429\u2013443, 1997.\n\nMartin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John\n\nWiley & Sons, Inc., New York, NY, USA, 1st edition, 1994. ISBN 0471619779.\n\n9\n\n\fMark Rowland, Marc Bellemare, Will Dabney, Remi Munos, and Yee Whye Teh. An analysis\nof categorical distributional reinforcement learning. In International Conference on Arti\ufb01cial\nIntelligence and Statistics, pages 29\u201337, 2018.\n\nTom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay.\n\nInternational Conference on Learning Representations, abs/1511.05952, 2016.\n\nMatthew J Sobel. The variance of discounted markov decision processes. Journal of Applied\n\nProbability, 19(4):794\u2013802, 1982.\n\nRichard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3\n\n(1):9\u201344, 1988.\n\nYunhao Tang and Shipra Agrawal. Exploration by distributional reinforcement learning. In Proceed-\nings of the 27th International Joint Conference on Arti\ufb01cial Intelligence, pages 2710\u20132716. AAAI\nPress, 2018.\n\nHado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-\n\nlearning. In Thirtieth AAAI Conference on Arti\ufb01cial Intelligence, 2016.\n\nZiyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. Dueling\nnetwork architectures for deep reinforcement learning. In International Conference on Machine\nLearning, pages 1995\u20132003, 2016.\n\nChristopher John Cornish Hellaby Watkins. Learning from delayed rewards. 1989.\n\nDJ White. Mean, variance, and probabilistic criteria in \ufb01nite markov decision processes: a review.\n\nJournal of Optimization Theory and Applications, 56(1):1\u201329, 1988.\n\n10\n\n\f", "award": [], "sourceid": 3340, "authors": [{"given_name": "Derek", "family_name": "Yang", "institution": "UC San Diego"}, {"given_name": "Li", "family_name": "Zhao", "institution": "Microsoft Research"}, {"given_name": "Zichuan", "family_name": "Lin", "institution": "Tsinghua University"}, {"given_name": "Tao", "family_name": "Qin", "institution": "Microsoft Research"}, {"given_name": "Jiang", "family_name": "Bian", "institution": "Microsoft"}, {"given_name": "Tie-Yan", "family_name": "Liu", "institution": "Microsoft Research Asia"}]}