{"title": "Learning values across many orders of magnitude", "book": "Advances in Neural Information Processing Systems", "page_first": 4287, "page_last": 4295, "abstract": "Most learning algorithms are not invariant to the scale of the signal that is being approximated. We propose to adaptively normalize the targets used in the learning updates.  This is important in value-based reinforcement learning, where the magnitude of appropriate value approximations can change over time when we update the policy of behavior. Our main motivation is prior work on learning to play Atari games, where the rewards were clipped to a predetermined range. This clipping facilitates learning across many different games with a single learning algorithm, but a clipped reward function can result in qualitatively different behavior. Using adaptive normalization we can remove this domain-specific heuristic without diminishing overall performance.", "full_text": "Learning values across many orders of magnitude\n\nHado van Hasselt Arthur Guez\n\nMatteo Hessel\n\nVolodymyr Mnih David Silver\n\nGoogle DeepMind\n\nAbstract\n\nMost learning algorithms are not invariant to the scale of the signal that is being\napproximated. We propose to adaptively normalize the targets used in the learn-\ning updates. This is important in value-based reinforcement learning, where the\nmagnitude of appropriate value approximations can change over time when we\nupdate the policy of behavior. Our main motivation is prior work on learning to\nplay Atari games, where the rewards were clipped to a predetermined range. This\nclipping facilitates learning across many different games with a single learning\nalgorithm, but a clipped reward function can result in qualitatively different behav-\nior. Using adaptive normalization we can remove this domain-speci\ufb01c heuristic\nwithout diminishing overall performance.\n\n1\n\nIntroduction\n\nMany machine-learning algorithms rely on a-priori access to data to properly tune relevant hyper-\nparameters [Bergstra et al., 2011, Bergstra and Bengio, 2012, Snoek et al., 2012]. It is much harder\nto learn ef\ufb01ciently from a stream of data when we do not know the magnitude of the function we\nseek to approximate beforehand, or if these magnitudes can change over time, as is typically the case\nin reinforcement learning when the policy of behavior improves over time.\nOur main motivation is the work by Mnih et al. [2015], in which Q-learning [Watkins, 1989] is\ncombined with a deep convolutional neural network [cf. LeCun et al., 2015]. The resulting deep Q\nnetwork (DQN) algorithm learned to play a varied set of Atari 2600 games from the Arcade Learning\nEnvironment (ALE) [Bellemare et al., 2013], which was proposed as an evaluation framework to test\ngeneral learning algorithms on solving many different interesting tasks. DQN was proposed as a\nsingular solution, using a single set of hyperparameters.\nThe magnitudes and frequencies of rewards vary wildly between different games. For instance, in\nPong the rewards are bounded by 1 and +1 while in Ms. Pac-Man eating a single ghost can yield\na reward of up to +1600. To overcome this hurdle, rewards and temporal-difference errors were\nclipped to [1, 1], so that DQN would perceive any positive reward as +1, and any negative reward as\n1. This is not a satisfying solution for two reasons. First, the clipping introduces domain knowledge.\nMost games have sparse non-zero rewards. Clipping results in optimizing the frequency of rewards,\nrather than their sum. This is a fairly reasonable heuristic in Atari, but it does not generalize to\nmany other domains. Second, and more importantly, the clipping changes the objective, sometimes\nresulting in qualitatively different policies of behavior.\nWe propose a method to adaptively normalize the targets used in the learning updates. If these targets\nare guaranteed to be normalized it is much easier to \ufb01nd suitable hyperparameters. The proposed\ntechnique is not speci\ufb01c to DQN or to reinforcement learning and is more generally applicable in\nsupervised learning and reinforcement learning. There are several reasons such normalization can be\ndesirable. First, sometimes we desire a single system that is able to solve multiple different problems\nwith varying natural magnitudes, as in the Atari domain. Second, for multi-variate functions the\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fnormalization can be used to disentangle the natural magnitude of each component from its relative\nimportance in the loss function. This is particularly useful when the components have different\nunits, such as when we predict signals from sensors with different modalities. Finally, adaptive\nnormalization can help deal with non-stationary. For instance, in reinforcement learning the policy of\nbehavior can change repeatedly during learning, thereby changing the distribution and magnitude of\nthe values.\n\n1.1 Related work\nInput normalization has long been recognized as important to ef\ufb01ciently learn non-linear approx-\nimations such as neural networks [LeCun et al., 1998], leading to research on how to achieve\nscale-invariance on the inputs [e.g., Ross et al., 2013, Ioffe and Szegedy, 2015, Desjardins et al.,\n2015]. Output or target normalization has not received as much attention, probably because in\nsupervised learning data sets are commonly available before learning commences, making it straight-\nforward to determine appropriate normalizations or to tune hyper-parameters. However, this assumes\nthe data is available a priori, which is not true in online (potentially non-stationary) settings.\nNatural gradients [Amari, 1998] are invariant to reparameterizations of the function approximation,\nthereby avoiding many scaling issues, but these are computationally expensive for functions with\nmany parameters such as deep neural networks. This is why approximations are regularly proposed,\ntypically trading off accuracy to computation [Martens and Grosse, 2015], and sometimes focusing\non a certain aspect such as input normalization [Desjardins et al., 2015, Ioffe and Szegedy, 2015].\nMost such algorithms are not fully invariant to the scale of the target function.\nIn the Atari domain several algorithmic variants and improvements for DQN have been proposed\n[van Hasselt et al., 2016, Bellemare et al., 2016, Schaul et al., 2016, Wang et al., 2016], as well as\nalternative solutions [Liang et al., 2016, Mnih et al., 2016]. However, none of these address the\nclipping of the rewards or explicitly discuss the impacts of clipping on performance or behavior.\n\n1.2 Preliminaries\nConcretely, we consider learning from a stream of data {(Xt, Yt)}1t=1 where the inputs Xt 2 Rn\nand targets Yt 2 Rk are real-valued tensors. The aim is to update parameters \u2713 of a function\nf\u2713 : Rn ! Rk such that the output f\u2713(Xt) is (in expectation) close to the target Yt according to some\nloss lt(f\u2713), for instance de\ufb01ned as a squared difference: lt(f\u2713) = 1\n2 (f\u2713(Xt)  Yt)>(f\u2713(Xt)  Yt).\nA canonical update is stochastic gradient descent (SGD). For a sample (Xt, Yt), the update is then\n\u2713t+1 = \u2713t  \u21b5r\u2713lt(f\u2713), where \u21b5 2 [0, 1] is a step size. The magnitude of this update depends on\nboth the step size and the loss, and it is hard to pick suitable step sizes when nothing is known about\nthe magnitude of the loss.\nAn important special case is when f\u2713 is a neural network [McCulloch and Pitts, 1943, Rosenblatt,\n1962], which are often trained with a form of SGD [Rumelhart et al., 1986], with hyperparameters\nthat interact with the scale of the loss. Especially for deep neural networks [LeCun et al., 2015,\nSchmidhuber, 2015] large updates may harm learning, because these networks are highly non-linear\nand such updates may \u2018bump\u2019 the parameters to regions with high error.\n\n2 Adaptive normalization with Pop-Art\n\nWe propose to normalize the targets Yt, where the normalization is learned separately from the\napproximating function. We consider an af\ufb01ne transformation of the targets\n\n\u02dcYt = \u23031\n\nt (Yt  \u00b5t) ,\n\n(1)\nwhere \u2303t and \u00b5t are scale and shift parameters that are learned from data. The scale matrix \u2303t\ncan be dense, diagonal, or de\ufb01ned by a scalar t as \u2303t = tI. Similarly, the shift vector \u00b5t can\ncontain separate components, or be de\ufb01ned by a scalar \u00b5t as \u00b5t = \u00b5t1. We can then de\ufb01ne a loss\non a normalized function g(Xt) and the normalized target \u02dcYt. The unnormalized approximation for\nany input x is then given by f (x) = \u2303g(x) + \u00b5, where g is the normalized function and f is the\nunnormalized function.\nAt \ufb01rst glance it may seem we have made little progress. If we learn \u2303 and \u00b5 using the same algorithm\nas used for the parameters of the function g, then the problem has not become fundamentally different\n\n2\n\n\for easier; we would have merely changed the structure of the parameterized function slightly.\nConversely, if we consider tuning the scale and shift as hyperparameters then tuning them is not\nfundamentally easier than tuning other hyperparameters, such as the step size, directly.\nFortunately, there is an alternative. We propose to update \u2303 and \u00b5 according to a separate objective\nwith the aim of normalizing the updates for g. Thereby, we decompose the problem of learning an\nappropriate normalization from learning the speci\ufb01c shape of the function. The two properties that\nwe want to simultaneously achieve are\n(ART) to update scale \u2303 and shift \u00b5 such that \u23031(Y  \u00b5) is appropriately normalized, and\n(POP) to preserve the outputs of the unnormalized function when we change the scale and shift.\nWe discuss these properties separately below. We refer to algorithms that combine output-preserving\nupdates and adaptive rescaling, as Pop-Art algorithms, an acronym for \u201cPreserving Outputs Precisely,\nwhile Adaptively Rescaling Targets\u201d.\n\n2.1 Preserving outputs precisely\nUnless care is taken, repeated updates to the normalization might make learning harder rather than\neasier because the normalized targets become non-stationary. More importantly, whenever we adapt\nthe normalization based on a certain target, this would simultaneously change the output of the\nunnormalized function of all inputs. If there is little reason to believe that other unnormalized outputs\nwere incorrect, this is undesirable and may hurt performance in practice, as illustrated in Section 3.\nWe now \ufb01rst discuss how to prevent these issues, before we discuss how to update the scale and shift.\nThe only way to avoid changing all outputs of the unnormalized function whenever we update the\nscale and shift is by changing the normalized function g itself simultaneously. The goal is to preserve\nthe outputs from before the change of normalization, for all inputs. This prevents the normalization\nfrom affecting the approximation, which is appropriate because its objective is solely to make learning\neasier, and to leave solving the approximation itself to the optimization algorithm.\nWithout loss of generality the unnormalized function can be written as\n\nf\u2713,\u2303,\u00b5,W,b(x) \u2318 \u2303g\u2713,W,b(x) + \u00b5 \u2318 \u2303(Wh\u2713(x) + b) + \u00b5 ,\n\n(2)\nwhere h\u2713 is a parametrized (non-linear) function, and g\u2713,W,b = Wh\u2713(x) + b is the normalized\nfunction. It is not uncommon for deep neural networks to end in a linear layer, and then h\u2713 can be the\noutput of the last (hidden) layer of non-linearities. Alternatively, we can always add a square linear\nlayer to any non-linear function h\u2713 to ensure this constraint, for instance initialized as W0 = I and\nb0 = 0.\nThe following proposition shows that we can update the parameters W and b to ful\ufb01ll the second\ndesideratum of preserving outputs precisely for any change in normalization.\nProposition 1. Consider a function f : Rn ! Rk de\ufb01ned as in (2) as\n\nf\u2713,\u2303,\u00b5,W,b(x) \u2318 \u2303 (Wh\u2713(x) + b) + \u00b5 ,\n\nwhere h\u2713 : Rn ! Rm is any non-linear function of x 2 Rn, \u2303 is a k \u21e5 k matrix, \u00b5 and b are\nk-element vectors, and W is a k \u21e5 m matrix. Consider any change of the scale and shift parameters\nfrom \u2303 to \u2303new and from \u00b5 to \u00b5new, where \u2303new is non-singular. If we then additionally change\nthe parameters W and b to Wnew and bnew, de\ufb01ned by\n\nWnew = \u23031\n\nnew (\u2303b + \u00b5  \u00b5new) ,\nthen the outputs of the unnormalized function f are preserved precisely in the sense that\n\nnew\u2303W\n\nand\n\nbnew = \u23031\n\nf\u2713,\u2303,\u00b5,W,b(x) = f\u2713,\u2303new,\u00b5new,Wnew,bnew (x) ,\n\n8x .\n\nThis and later propositions are proven in the appendix. For the special case of scalar scale and\nshift, with \u2303 \u2318 I and \u00b5 \u2318 \u00b51, the updates to W and b become Wnew = (/new)W and\nbnew = (b + \u00b5  \u00b5new)/new. After updating the scale and shift we can update the output of\nthe normalized function g\u2713,W,b(Xt) toward the normalized output \u02dcYt, using any learning algorithm.\nImportantly, the normalization can be updated \ufb01rst, thereby avoiding harmful large updates just before\nthey would otherwise occur. This observation is made more precise in Proposition 2 in Section 2.2.\n\n3\n\n\fAlgorithm 1 SGD on squared loss with Pop-Art\n\nFor a given differentiable function h\u2713, initialize \u2713.\nInitialize W = I, b = 0, \u2303 = I, and \u00b5 = 0.\nwhile learning do\n\nnew\u2303W ,\n\nnew(\u2303b + \u00b5  \u00b5new)\n\nObserve input X and target Y\nUse Y to compute new scale \u2303new and new shift \u00b5new\nW \u23031\nb \u23031\n\u00b5 \u00b5new\n\u2303 \u2303new ,\nh h\u2713(X)\nJ (r\u2713h\u2713,1(X), . . . ,r\u2713h\u2713,m(X))\n Wh + b  \u23031(Y  \u00b5)\n\u2713 \u2713  \u21b5JW>\nW W  \u21b5h>\nb b  \u21b5\n\nend while\n\n(rescale W and b)\n(update scale and shift)\n(store output of h\u2713)\n(compute Jacobian of h\u2713)\n(compute normalized error)\n(compute SGD update for \u2713)\n(compute SGD update for W)\n(compute SGD update for b)\n\nAlgorithm 1 is an example implementation of SGD with Pop-Art for a squared loss. It can be\ngeneralized easily to any other loss by changing the de\ufb01nition of . Notice that W and b are updated\ntwice: \ufb01rst to adapt to the new scale and shift to preserve the outputs of the function, and then by\nSGD. The order of these updates is important because it allows us to use the new normalization\nimmediately in the subsequent SGD update.\n\n2.2 Adaptively rescaling targets\n\nA natural choice is to normalize the targets to approximately have zero mean and unit variance. For\nclarity and conciseness, we consider scalar normalizations. It is straightforward to extend to diagonal\nor dense matrices. If we have data {(Xi, Yi)}t\n\ni=1 up to some time t, we then may desire\n\n(Yi  \u00b5t)/t = 0\n\ntXi=1\n\n1\nt\n\ntXi=1\n\nsuch that\n\n\u00b5t =\n\nYi\n\nThis can be generalized to incremental updates\n\nand\n\nand\n\nt = 1 ,\n\n1\nt\n\ntXi=1\n(Yi  \u00b5t)2/2\ntXi=1\n\nY 2\ni  \u00b52\nt .\n\n1\nt\n\nt =\n\n(3)\n\nand 2\n\nt , where\n\nt = \u232bt  \u00b52\n\n\u232bt = (1  t)\u232bt1 + tY 2\nt .\n\n(4)\n\u00b5t = (1  t)\u00b5t1 + tYt\nt is positive\nHere \u232bt estimates the second moment of the targets and t 2 [0, 1] is a step size. If \u232bt  \u00b52\ninitially then it will always remain so, although to avoid issues with numerical precision it can be\nuseful to enforce a lower bound explicitly by requiring \u232bt  \u00b52\nt  \u270f with \u270f> 0. For full equivalence\nto (3) we can use t = 1/t. If t =  is constant we get exponential moving averages, placing more\nweight on recent data points which is appropriate in non-stationary settings.\nA constant  has the additional bene\ufb01t of never becoming negligibly small. Consider the \ufb01rst time a\ntarget is observed that is much larger than all previously observed targets. If t is small, our statistics\nwould adapt only slightly, and the resulting update may be large enough to harm the learning. If t\nis not too small, the normalization can adapt to the large target before updating, potentially making\nlearning more robust. In particular, the following proposition holds.\nProposition 2. When using updates (4) to adapt the normalization parameters  and \u00b5, the normal-\nized targets are bounded for all t by\n\np(1  t)/t \uf8ff (Yt  \u00b5t)/t \uf8ffp(1  t)/t .\n\nFor instance, if t =  = 104 for all t, then the normalized target is guaranteed to be in (100, 100).\nNote that Proposition 2 does not rely on any assumptions about the distribution of the targets. This is\nan important result, because it implies we can bound the potential normalized errors before learning,\nwithout any prior knowledge about the actual targets we may observe.\n\n4\n\n\fAlgorithm 2 Normalized SGD\n\nFor a given differentiable function h\u2713, initialize \u2713.\nwhile learning do\n\nObserve input X and target Y\nUse Y to compute new scale \u2303\nh h\u2713(X)\nJ (rh\u2713,1(X), . . . ,rh\u2713,m(X))>\n Wh + b  Y\n\u2713 \u2713  \u21b5J(\u23031W)>\u23031\nW W  \u21b5g>\nb b  \u21b5\n\nend while\n\n(store output of h\u2713)\n(compute Jacobian of h\u2713)\n(compute unnormalized error)\n(update \u2713 with scaled SGD)\n(update W with SGD)\n(update b with SGD)\n\nIt is an open question whether it is uniformly best to normalize by mean and variance. In the\nappendix we discuss other normalization updates, based on percentiles and mini-batches, and derive\ncorrespondences between all of these.\n\n2.3 An equivalence for stochastic gradient descent\nWe now step back and analyze the effect of the magnitude of the errors on the gradients when using\nregular SGD. This analysis suggests a different normalization algorithm, which has an interesting\ncorrespondence to Pop-Art SGD.\nWe consider SGD updates for an unnormalized multi-layer function of form f\u2713,W,b(X) =\nWh\u2713(X) + b. The update for the weight matrix W is\n\nWt = Wt1 + \u21b5tth\u2713t(Xt)> ,\n\nwhere t = f\u2713,W,b(X)Yt is gradient of the squared loss, which we here call the unnormalized error.\nThe magnitude of this update depends linearly on the magnitude of the error, which is appropriate\nwhen the inputs are normalized, because then the ideal scale of the weights depends linearly on the\nmagnitude of the targets.1\nNow consider the SGD update to the parameters of h\u2713, \u2713t = \u2713t1  \u21b5JtW>t1t where Jt =\n(rg\u2713,1(X), . . . ,rg\u2713,m(X))> is the Jacobian for h\u2713. The magnitudes of both the weights W and\nthe errors  depend linearly on the magnitude of the targets. This means that the magnitude of the\nupdate for \u2713 depends quadratically on the magnitude of the targets. There is no compelling reason\nfor these updates to depend at all on these magnitudes because the weights in the top layer already\nensure appropriate scaling. In other words, for each doubling of the magnitudes of the targets, the\nupdates to the lower layers quadruple for no clear reason.\nThis analysis suggests an algorithmic solution, which seems to be novel in and of itself, in which\nwe track the magnitudes of the targets in a separate parameter t, and then multiply the updates for\nall lower layers with a factor 2\n. A more general version of this for matrix scalings is given in\nAlgorithm 2. We prove an interesting, and perhaps surprising, connection to the Pop-Art algorithm.\nProposition 3. Consider two functions de\ufb01ned by\n\nt\n\nf\u2713,\u2303,\u00b5,W,b(x) = \u2303(Wh\u2713(x) + b) + \u00b5\n\nand\n\nf\u2713,W,b(x) = Wh\u2713(x) + b ,\n\nwhere h\u2713 is the same differentiable function in both cases, and the functions are initialized identically,\nusing \u23030 = I and \u00b5 = 0, and the same initial \u27130, W0 and b0. Consider updating the \ufb01rst function\nusing Algorithm 1 (Pop-Art SGD) and the second using Algorithm 2 (Normalized SGD). Then, for\nany sequence of non-singular scales {\u2303t}1t=1 and shifts {\u00b5t}1t=1, the algorithms are equivalent in\nthe sense that 1) the sequences {\u2713t}1t=0 are identical, 2) the outputs of the functions are identical, for\nany input.\n\nThe proposition shows a duality between normalizing the targets, as in Algorithm 1, and changing the\nupdates, as in Algorithm 2. This allows us to gain more intuition about the algorithm. In particular,\n1In general care should be taken that the inputs are well-behaved; this is exactly the point of recent work on\n\ninput normalization [Ioffe and Szegedy, 2015, Desjardins et al., 2015].\n\n5\n\n\fFig. 1a. Median RMSE on binary regression for SGD\nwithout normalization (red), with normalization but\nwithout preserving outputs (blue, labeled \u2018Art\u2019), and\nwith Pop-Art (green). Shaded 10\u201390 percentiles.\n\nFig. 1b. `2 gradient norms for DQN during learning\non 57 Atari games with actual unclipped rewards (left,\nred), clipped rewards (middle, blue), and using Pop-\nArt (right, green) instead of clipping. Shaded areas\ncorrespond to 95%, 90% and 50% of games.\n\nin Algorithm 2 the updates in top layer are not normalized, thereby allowing the last linear layer\nto adapt to the scale of the targets. This is in contrast to other algorithms that have some \ufb02avor of\nadaptive normalization, such as RMSprop [Tieleman and Hinton, 2012], AdaGrad [Duchi et al.,\n2011], and Adam [Kingma and Adam, 2015] that each component in the gradient by a square root of\nan empirical second moment of that component. That said, these methods are complementary, and it\nis straightforward to combine Pop-Art with other optimization algorithms than SGD.\n\n3 Binary regression experiments\n\nWe \ufb01rst analyze the effect of rare events in online learning, when infrequently a much larger target\nis observed. Such events can for instance occur when learning from noisy sensors that sometimes\ncaptures an actual signal, or when learning from sparse non-zero reinforcements. We empirically\ncompare three variants of SGD: without normalization, with normalization but without preserving\noutputs precisely (i.e., with \u2018Art\u2019, but without \u2018Pop\u2019), and with Pop-Art.\nThe inputs are binary representations of integers drawn uniformly randomly between 0 and n =\n2101. The desired outputs are the corresponding integer values. Every 1000 samples, we present the\nbinary representation of 216 1 as input (i.e., all 16 inputs are 1) and as target 216 1 = 65, 535. The\napproximating function is a fully connected neural network with 16 inputs, 3 hidden layers with 10\nnodes per layer, and tanh internal activation functions. This simple setup allows extensive sweeps over\nhyper-parameters, to avoid bias towards any algorithm by the way we tune these. The step sizes \u21b5 for\nSGD and  for the normalization are tuned by a grid search over {105, 104.5, . . . , 101, 100.5, 1}.\nFigure 1a shows the root mean squared error (RMSE, log scale) for each of 5000 samples, before\nupdating the function (so this is a test error, not a train error). The solid line is the median of 50\nrepetitions, and shaded region covers the 10th to 90th percentiles. The plotted results correspond to\nthe best hyper-parameters according to the overall RMSE (i.e., area under the curve). The lines are\nslightly smoothed by averaging over each 10 consecutive samples.\nSGD favors a relatively small step size (\u21b5 = 103.5) to avoid harmful large updates, but this slows\nlearning on the smaller updates; the error curve is almost \ufb02at in between spikes. SGD with adaptive\nnormalization (labeled \u2018Art\u2019) can use a larger step size (\u21b5 = 102.5) and therefore learns faster, but\nhas high error after the spikes because the changing normalization also changes the outputs of the\nsmaller inputs, increasing the errors on these. In comparison, Pop-Art performs much better. It\nprefers the same step size as Art (\u21b5 = 102.5), but Pop-Art can exploit a much faster rate for the\nstatistics (best performance with  = 100.5 for Pop-Art and  = 104 for Art). The faster tracking\nof statistics protects Pop-Art from the large spikes, while the output preservation avoids invalidating\n\n6\n\n025005000numberofsamples10100100010000RMSE(logscale)Pop-ArtArtSGD\fthe outputs for smaller targets. We ran experiments with RMSprop but left these out of the \ufb01gure as\nthe results were very similar to SGD.\n\n4 Atari 2600 experiments\n\nAn important motivation for this work is reinforcement learning with non-linear function approxima-\ntors such as neural networks (sometimes called deep reinforcement learning). The goal is to predict\nand optimize action values de\ufb01ned as the expected sum of future rewards. These rewards can differ\narbitrarily from one domain to the next, and non-zero rewards can be sparse. As a result, the action\nvalues can span a varied and wide range which is often unknown before learning commences.\nMnih et al. [2015] combined Q-learning with a deep neural network in an algorithm called DQN,\nwhich impressively learned to play many games using a single set of hyper-parameters. However, as\ndiscussed above, to handle the different reward magnitudes with a single system all rewards were\nclipped to the interval [1, 1]. This is harmless in some games, such as Pong where no reward is ever\nhigher than 1 or lower than 1, but it is not satisfactory as this heuristic introduces speci\ufb01c domain\nknowledge that optimizing reward frequencies is approximately is useful as optimizing the total score.\nHowever, the clipping makes the DQN algorithm blind to differences between certain actions, such as\nthe difference in reward between eating a ghost (reward >= 100) and eating a pellet (reward = 25)\nin Ms. Pac-Man. We hypothesize that 1) overall performance decreases when we turn off clipping,\nbecause it is not possible to tune a step size that works on many games, 2) that we can regain much\nof the lost performance by with Pop-Art. The goal is not to improve state-of-the-art performance,\nbut to remove the domain-dependent heuristic that is induced by the clipping of the rewards, thereby\nuncovering the true rewards.\nWe ran the Double DQN algorithm [van Hasselt et al., 2016] in three versions: without changes,\nwithout clipping both rewards and temporal difference errors, and without clipping but additionally\nusing Pop-Art. The targets are the cumulation of a reward and the discounted value at the next state:\n(5)\n\nYt = Rt+1 + Q(St, argmax\n\nQ(St, a; \u2713); \u2713) ,\n\na\n\nwhere Q(s, a; \u2713) is the estimated action value of action a in state s according to current parameters \u2713,\nand where \u2713 is a more stable periodic copy of these parameters [cf. Mnih et al., 2015, van Hasselt\net al., 2016, for more details]. This is a form of Double Q-learning [van Hasselt, 2010]. We roughly\ntuned the main step size and the step size for the normalization to 104. It is not straightforward to\ntune the unclipped version, for reasons that will become clear soon.\nFigure 1b shows `2 norm of the gradient of Double DQN during learning as a function of number of\ntraining steps. The left plot corresponds to no reward clipping, middle to clipping (as per original\nDQN and Double DQN), and right to using Pop-Art instead of clipping. Each faint dashed lines\ncorresponds to the median norms (where the median is taken over time) on one game. The shaded\nareas correspond to 50%, 90%, and 95% of games.\nWithout clipping the rewards, Pop-Art produces a much narrower band within which the gradients\nfall. Across games, 95% of median norms range over less than two orders of magnitude (roughly\nbetween 1 and 20), compared to almost four orders of magnitude for clipped Double DQN, and more\nthan six orders of magnitude for unclipped Double DQN without Pop-Art. The wide range for the\nlatter shows why it is impossible to \ufb01nd a suitable step size with neither clipping nor Pop-Art: the\nupdates are either far too small on some games or far too large on others.\nAfter 200M frames, we evaluated the actual scores of the best performing agent in each game on 100\nepisodes of up to 30 minutes of play, and then normalized by human and random scores as described\nby Mnih et al. [2015]. Figure 2 shows the differences in normalized scores between (clipped) Double\nDQN and Double DQN with Pop-Art.\nThe main eye-catching result is that the distribution in performance drastically changed. On some\ngames (e.g., Gopher, Centipede) we observe dramatic improvements, while on other games (e.g.,\nVideo Pinball, Star Gunner) we see a substantial decrease. For instance, in Ms. Pac-Man the clipped\nDouble DQN agent does not care more about ghosts than pellets, but Double DQN with Pop-Art\nlearns to actively hunt ghosts, resulting in higher scores. Especially remarkable is the improved\nperformance on games like Centipede and Gopher, but also notable is a game like Frostbite which\nwent from below 50% to a near-human performance level. Raw scores can be found in the appendix.\n\n7\n\n\fFigure 2: Differences between normalized scores for Double DQN with and without Pop-Art on 57\nAtari games.\n\nSome games fare worse with unclipped rewards because it changes the nature of the problem. For\ninstance, in Time Pilot the Pop-Art agent learns to quickly shoot a mothership to advance to a next\nlevel of the game, obtaining many points in the process. The clipped agent instead shoots at anything\nthat moves, ignoring the mothership. However, in the long run in this game more points are scored\nwith the safer and more homogeneous strategy of the clipped agent. One reason for the disconnect\nbetween the seemingly qualitatively good behavior combined with lower scores is that the agents\nare fairly myopic: both use a discount factor of  = 0.99, and therefore only optimize rewards that\nhappen within a dozen or so seconds into the future.\nOn the whole, the results show that with Pop-Art we can successfully remove the clipping heuristic\nthat has been present in all prior DQN variants, while retaining overall performance levels. Double\nDQN with Pop-Art performs slightly better than Double DQN with clipped rewards: on 32 out of 57\ngames performance is at least as good as clipped Double DQN and the median (+0.4%) and mean\n(+34%) differences are positive.\n\n5 Discussion\n\nWe have demonstrated that Pop-Art can be used to adapt to different and non-stationary target\nmagnitudes. This problem was perhaps not previously commonly appreciated, potentially because\nin deep learning it is common to tune or normalize a priori, using an existing data set. This is not\nas straightforward in reinforcement learning when the policy and the corresponding values may\nrepeatedly change over time. This makes Pop-Art a promising tool for deep reinforcement learning,\nalthough it is not speci\ufb01c to this setting.\nWe saw that Pop-Art can successfully replace the clipping of rewards as done in DQN to handle\nthe various magnitudes of the targets used in the Q-learning update. Now that the true problem is\nexposed to the learning algorithm we can hope to make further progress, for instance by improving\nthe exploration [Osband et al., 2016], which can now be informed about the true unclipped rewards.\n\nReferences\nS. I. Amari. Natural gradient works ef\ufb01ciently in learning. Neural computation, 10(2):251\u2013276, 1998. ISSN\n\n0899-7667.\n\nM. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation\n\nplatform for general agents. J. Artif. Intell. Res. (JAIR), 47:253\u2013279, 2013.\n\n8\n\nVideoPinballStarGunnerJamesBondDoubleDunkBreakoutTimePilotWizardofWorDefenderPhoenixChopperCommandQ*BertBattleZoneAmidarSkiingBeamRiderTutankhamH.E.R.O.RiverRaidSeaquestIceHockeyRobotankAlienUpandDownBerzerkPongMontezuma\u2019sRevengePrivateEyeFreewayPitfallGravitarSurroundSpaceInvadersAsteroidsKangarooCrazyClimberBankHeistSolarisYarsRevengeAsterixKung-FuMasterBowlingMs.PacmanFrostbiteZaxxonRoadRunnerFishingDerbyBoxingVentureNameThisGameEnduroKrullTennisDemonAttackCentipedeAssaultAtlantisGopher-1600%-800%-400%-200%-100%0%100%200%400%800%1600%Normalizeddierences\fM. G. Bellemare, G. Ostrovski, A. Guez, P. S. Thomas, and R. Munos. Increasing the action gap: New operators\n\nfor reinforcement learning. In AAAI, 2016.\n\nJ. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. The Journal of Machine Learning\n\nResearch, 13(1):281\u2013305, 2012.\n\nJ. S. Bergstra, R. Bardenet, Y. Bengio, and B. K\u00e9gl. Algorithms for hyper-parameter optimization. In Advances\n\nin Neural Information Processing Systems, pages 2546\u20132554, 2011.\n\nG. Desjardins, K. Simonyan, R. Pascanu, and K. Kavukcuoglu. Natural neural networks. In Advances in Neural\n\nInformation Processing Systems, pages 2062\u20132070, 2015.\n\nJ. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization.\n\nThe Journal of Machine Learning Research, 12:2121\u20132159, 2011.\n\nS. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate\n\nshift. arXiv preprint arXiv:1502.03167, 2015.\n\nD. P. Kingma and J. B. Adam. A method for stochastic optimization. In International Conference on Learning\n\nRepresentation, 2015.\n\nY. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\nY. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436\u2013444, 05 2015.\nY. Liang, M. C. Machado, E. Talvitie, and M. H. Bowling. State of the art control of atari games using shallow\nreinforcement learning. In International Conference on Autonomous Agents and Multiagent Systems, 2016.\nJ. Martens and R. B. Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In\nProceedings of the 32nd International Conference on Machine Learning, volume 37, pages 2408\u20132417, 2015.\nW. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of\n\nmathematical biophysics, 5(4):115\u2013133, 1943.\n\nV. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K.\nFidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra,\nS. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):\n529\u2013533, 2015.\n\nV. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous\n\nmethods for deep reinforcement learning. In International Conference on Machine Learning, 2016.\n\nI. Osband, C. Blundell, A. Pritzel, and B. Van Roy. Deep exploration via bootstrapped DQN. CoRR,\n\nabs/1602.04621, 2016.\n\nF. Rosenblatt. Principles of Neurodynamics. Spartan, New York, 1962.\nS. Ross, P. Mineiro, and J. Langford. Normalized online learning. In Proceedings of the 29th Conference on\n\nUncertainty in Arti\ufb01cial Intelligence, 2013.\n\nD. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. In\n\nParallel Distributed Processing, volume 1, pages 318\u2013362. MIT Press, 1986.\n\nT. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized experience replay. In International Conference on\n\nLearning Representations, Puerto Rico, 2016.\n\nJ. Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85\u2013117, 2015.\nJ. Snoek, H. Larochelle, and R. P. Adams. Practical bayesian optimization of machine learning algorithms. In\n\nAdvances in neural information processing systems, pages 2951\u20132959, 2012.\n\nT. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent\n\nmagnitude. COURSERA: Neural Networks for Machine Learning, 2012.\n\nH. van Hasselt. Double Q-learning. Advances in Neural Information Processing Systems, 23:2613\u20132621, 2010.\nH. van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with Double Q-learning. AAAI, 2016.\nZ. Wang, N. de Freitas, T. Schaul, M. Hessel, H. van Hasselt, and M. Lanctot. Dueling network architectures for\ndeep reinforcement learning. In International Conference on Machine Learning, New York, NY, USA, 2016.\n\nC. J. C. H. Watkins. Learning from delayed rewards. PhD thesis, University of Cambridge England, 1989.\n\n9\n\n\f", "award": [], "sourceid": 2127, "authors": [{"given_name": "Hado", "family_name": "van Hasselt", "institution": "DeepMind"}, {"given_name": "Arthur", "family_name": "Guez", "institution": "Google"}, {"given_name": "Arthur", "family_name": "Guez", "institution": "Google DeepMind"}, {"given_name": "Matteo", "family_name": "Hessel", "institution": "Google DeepMind"}, {"given_name": "Volodymyr", "family_name": "Mnih", "institution": "Google DeepMind"}, {"given_name": "David", "family_name": "Silver", "institution": "DeepMind"}]}