{"title": "Regularized Anderson Acceleration for Off-Policy Deep Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 10231, "page_last": 10241, "abstract": "Model-free deep reinforcement learning (RL) algorithms have been widely used for a range of complex control tasks. However, slow convergence and sample inefficiency remain challenging problems in RL, especially when handling continuous and high-dimensional state spaces. To tackle this problem, we propose a general acceleration method for model-free, off-policy deep RL algorithms by drawing the idea underlying regularized Anderson acceleration (RAA), which is an effective approach to accelerating the solving of fixed point problems with perturbations. Specifically, we first explain how policy iteration can be applied directly with Anderson acceleration. Then we extend RAA to the case of deep RL by introducing a regularization term to control the impact of perturbation induced by function approximation errors. We further propose two strategies, i.e., progressive update and adaptive restart, to enhance the performance. The effectiveness of our method is evaluated on a variety of benchmark tasks, including Atari 2600 and MuJoCo. Experimental results show that our approach substantially improves both the learning speed and final performance of state-of-the-art deep RL algorithms.", "full_text": "Regularized Anderson Acceleration for Off-Policy\n\nDeep Reinforcement Learning\n\nWenjie Shi, Shiji Song, Hui Wu, Ya-Chu Hsu, Cheng Wu, Gao Huang\u2217\n\nDepartment of Automation, Tsinghua University, Beijing, China\n\nBeijing National Research Center for Information Science and Technology (BNRist)\n\n{shiwj16, wuhui14, xuyz17}@mails.tsinghua.edu.cn\n\n{shijis, wuc, gaohuang}@tsinghua.edu.cn\n\nAbstract\n\nModel-free deep reinforcement learning (RL) algorithms have been widely used\nfor a range of complex control tasks. However, slow convergence and sample\ninef\ufb01ciency remain challenging problems in RL, especially when handling con-\ntinuous and high-dimensional state spaces. To tackle this problem, we propose\na general acceleration method for model-free, off-policy deep RL algorithms by\ndrawing the idea underlying regularized Anderson acceleration (RAA), which is an\neffective approach to accelerating the solving of \ufb01xed point problems with pertur-\nbations. Speci\ufb01cally, we \ufb01rst explain how policy iteration can be applied directly\nwith Anderson acceleration. Then we extend RAA to the case of deep RL by\nintroducing a regularization term to control the impact of perturbation induced by\nfunction approximation errors. We further propose two strategies, i.e., progressive\nupdate and adaptive restart, to enhance the performance. The effectiveness of our\nmethod is evaluated on a variety of benchmark tasks, including Atari 2600 and\nMuJoCo. Experimental results show that our approach substantially improves both\nthe learning speed and \ufb01nal performance of state-of-the-art deep RL algorithms.\nThe code and models are available at: https://github.com/shiwj16/raa-drl.\n\nIntroduction\n\n1\nReinforcement learning (RL) is a principled mathematical framework for experience-based au-\ntonomous learning of policies. In recent years, model-free deep RL algorithms have been applied\nin a variety of challenging domains, from game playing [1, 2] to robot navigation [3, 4]. However,\nsample inef\ufb01ciency, i.e., the required number of interactions with the environment is impractically\nhigh, remains a major limitation of current RL algorithms for problems with continuous and high-\ndimensional state spaces. For example, many RL approaches on tasks with low-dimensional state\nspaces and fairly benign dynamics may even require thousands of trials to learn. Sample inef\ufb01ciency\nmakes learning in real physical systems impractical and severely prohibits the applicability of RL\napproaches in more challenging scenarios.\nA promising way to improve the sample ef\ufb01ciency of RL is to learn models of the underlying system\ndynamics. However, learning models of the underlying transition dynamics is dif\ufb01cult and inevitably\nleads to modelling errors. Alternatively, off-policy algorithms such as deep Q-learning (DQN) [1]\nand its variants [5, 6], deep deterministic policy gradient (DDPG) [7], soft actor-critic (SAC) [8, 9]\nand off-policy hierarchical RL [10], which instead aim to reuse past experience, are commonly used\nto alleviate the sample inef\ufb01ciency problem. Unfortunately, off-policy algorithms are typically based\non policy iteration or value iteration, which repeatedly apply the Bellman operator of interest and\ngenerally require an in\ufb01nite number of iterations to converge exactly to the optima. Moreover, the\nBellman iteration constructs a contraction mapping which converges asymptotically to the optimal\n\n\u2217Corresponding auther.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fvalue function [11]. Iterating this mapping essentially results in a \ufb01xed-point problem [12] and thus\nmay be unacceptably slow to converge. These issues are further exacerbated when nonlinear function\napproximator such as neural network is utilized or the tasks have continuous state and action spaces.\nThis paper explores how to accelerate the convergence or improve the sample ef\ufb01ciency for model-\nfree, off-policy deep RL. We make the observation that RL is closely linked to \ufb01xed-point iteration:\nthe optimal policy can be found by solving a \ufb01xed-point problem of associated Bellman operator.\nTherefore, we attempt to embrace the idea underlying Anderson acceleration (also known as Anderson\nmixing, Pulay mixing) [13, 14], which is a method capable of speeding up the computation of \ufb01xed-\npoint iterations. While the classic \ufb01xed-point iteration repeatedly applies the operator to the last\nestimate, Anderson acceleration searches for the optimal point that has minimal residual within the\nsubspace spanned by several previous estimates, and then applies the operator to this optimal estimate.\nPrior work [15] has successfully applied Anderson acceleration to value iteration and preliminary\nexperiments show a signi\ufb01cant speed up of convergence. However, existing application is only\nfeasible on simple tasks with low-dimensional, discrete state and action spaces. Besides, as far as we\nknow, Anderson acceleration has never been applied to deep RL due to some long-standing issues\nincluding biases induced by sampling a minibatch and function approximation errors.\nIn this paper, Anderson acceleration is \ufb01rst applied to policy iteration under a tabular setting. Then,\nwe propose a practical acceleration method for model-free, off-policy deep RL algorithms based\non regularized Anderson acceleration (RAA) [16], which is a general paradigm with a Tikhonov\nregularization term to control the impact of perturbations. The structure of perturbations could be the\nnoise injected from the outside and high-order error terms induced by a nonlinear \ufb01xed-point iteration\nfunction. In the context of deep RL, function approximation errors are major perturbation source\nfor RAA. We present two bounds to characterize how the regularization term controls the impact of\nfunction approximation errors. Two strategies, i.e., progressive update and adaptive restart, are further\nproposed to enhance the performance. Moreover, our acceleration method can be implemented readily\nto deep RL algorithms including Dueling-DQN [5] and twin delayed DDPG (TD3) [17] to solve very\ncomplex, high-dimensional tasks, such as Atari 2600 and MuJoCo [18] benchmarks. Finally, the\nempirical results show that our approach exhibits a substantial improvement in both learning speed\nand \ufb01nal performance over vanilla deep RL algorithms.\n\n2 Related Work\nPrior works have made a number of efforts to improve the sample ef\ufb01ciency and speed up the\nconvergence of deep RL from different respects, such as variance reduction [19, 20], model-based RL\n[21, 22, 23], guided exploration [24, 25], etc. One of the most widely used techniques is off-policy\nRL, which combines temporal difference [26] and experience replay [27, 28] so as to make use of\nall the previous samples before each update to the policy parameters. Though introducing biases\nby using previous samples, off-policy RL alleviates the high variance in estimation of Q-value and\npolicy gradient [29]. Consequently, fast convergence is rendered when under \ufb01ne parameter-tuning.\nAs one kernel technique of off-policy RL, temporal difference is derived from the Bellman itera-\ntion which can be regarded as a \ufb01xed-point problem [12]. Our work focuses on speeding up the\nconvergence of off-policy RL via speeding up the convergence of the eseential \ufb01xed-point problem,\nand replying on a technique namely Anderson acceleration. This method is exploited by prior work\n[13, 30] to accelerate the \ufb01xed-point iteration by computing the new iteration as linear combination\nof previous evaluations. In the linear case, the convergence rate of Anderson acceleration has been\nelaborately analyzed and proved to be equal to or better than \ufb01xed-point iteration in [14]. For nonlin-\near \ufb01xed-point iteration, regularized Anderson acceleration is proposed by [16] to constrain the norm\nof coef\ufb01cient vector and reduce the impact of perturbations. Recent works [15, 31] have applied the\nAnderson acceleration to value iteration and deep neural network, and preliminary experiments show\nthat a signi\ufb01cant speedup of convergence is achieved. However, there is still no research showing its\nacceleration effect on deep RL for complex high-dimensional problems, as far as we know.\n\n3 Preliminaries\nUnder RL paradigm, the interaction between an agent and the environment is described as a Markov\nDecision Process (MDP). Speci\ufb01cally, at a discrete timestamp t, the agent takes an action at in a state\nst and transits to a subsequent state st+1 while obtaining a reward rt = r(st, at) from the environ-\nment. The transition between states satis\ufb01es the Markov property, i.e., P (st+1|st, at, . . . , s0, a0) =\nP (ss+t|st, at). Usually, the RL algorithm aims to search a policy \u03c0(a|s) that maximizes the expected\n\n2\n\n\fstate-action pair (s, a): Q\u03c0(s, a) = E [(cid:80)\u221e\n\nsum of discounted future rewards. Q-value function describes the expected return starting from a\nt=0 \u03b3trt+1|s0 = s, a0 = a], where the policy \u03c0(a|s) is a\nfunction or conditional distribution mapping the state space S to the action space A.\n\n3.1 Off-policy reinforcement learning\nMost off-policy RL algorithms are derived from policy iteration, which alternates between policy\nevaluation and policy improvement to monotonically improve the policy and the value function until\nconvergence. For complex environments with unknown dynamics and continuous spaces, policy\niteration is generally combined with function approximation, and parameterized Q-value function (or\ncritic) and policy function are learned from sampled interactions with environment. Since critic is\nrepresented as parameterized function instead of look-up table, the policy evaluation is replaced with\nan optimization problem which minimizes the squared temporal difference error, the discrepancy\nbetween the outputs of critics after and before applying the Bellman operator\n\nL(\u03b8) = E(cid:2)((T Q\u03b8(cid:48))(s, a) \u2212 Q\u03b8(s, a))2(cid:3) ,\n\n(1)\nwhere typically the Bellman operator is applied to a separate target value network Q\u03b8(cid:48) whose\nparameter is periodically replaced or softly updated with copy of current Q-network weight.\nIn off-policy RL \ufb01eld, prior works have proposed a number of modi\ufb01cations on the Bellman operator\nto alleviate the overestimation or function approximation error problems and thus achieved signi\ufb01cant\nimprovement. Similar to policy improvement, DQN replaces the current policy with a greedy policy\nfor the next state in the Bellman operator\n\n(2)\nAs the state-of-the-art actor-critic algorithm for continuous control, TD3 [17] proposes a clipped\ndouble Q-learning variant and a target policy smoothing regularization to modify the Bellman operator,\nwhich alleviates overestimation and over\ufb01tting problems,\n\n(3)\nwhere Q\u03b8j (s, a)(j = 1, 2) denote two critics with decoupled parameters \u03b8j. The added noise\n\u0001 \u223c clip(N (0, \u03c3),\u2212c, c) is clipped by the positive constant c.\n\nQ\u03b8(cid:48)\n\nj=1,2\n\n(cid:2)r(st, at) + \u03b3 min\n\n(T Q\u03b8(cid:48))(st, at) = Est+1,rt\n\na\n\nj\n\nQ\u03b8(cid:48)(st+1, a)(cid:3).\n(st+1, \u03c0\u03c6(cid:48)(st+1) + \u0001)(cid:3),\n\n(cid:2)r(st, at) + \u03b3 max\n\n(T Q\u03b8(cid:48))(st, at) = Est+1,rt\n\n3.2 Anderson acceleration for value iteration\nMost RL algorithms are derived from a fundamental framework named policy iteration which consists\nof two phases, i.e. policy evaluation and policy improvement. The policy evaluation estimates the\nQ-value function induced by current policy by iterating a Bellman operator from an initial estimate.\nFollowing the policy evaluation, the policy improvement acquires a better policy from a greedy\nstrategy, The policy iteration alternates two phases to update the Q-value and the policy respectively\nuntil convergence. As a special variant of policy iteration, value iteration merges policy evaluation\nand policy improvement into one iteration\n\nEs(cid:48),r [r + \u03b3Vk(s(cid:48))] ,\u2200s \u2208 S,\n\nVk+1(s) \u2190 (T Vk)(s) = max\n\na\n\n(4)\nand iterates it until convergence from a initial V0, where the Bellman operation is only repeatedly\napplied to the last estimate. Anderson acceleration is a widely used technique to speed up the\nconvergence of \ufb01xed-point iterations and has been successfully applied to speed up value iteration\n[15] by linearly combining previous m (m > 1) value estimates,\ni T Vk\u2212m+i,\n\u03b1k\n\n(5)\nwhere the coef\ufb01cient vector \u03b1k \u2208 Rm is determined by minimizing the norm of total Bellman\nresiduals of these estimates,\n\nVk+1 \u2190 m(cid:88)\n\ni=1\n\n\u03b1i(T Vk\u2212m+i \u2212 Vk\u2212m+i)\n\n\u03b1i = 1.\n\n(6)\n\nFor the (cid:96)2-norm, the minimum can be analytically solved by using the Karush-Kuhn-Tucker condi-\ntions. Corresponding coef\ufb01cient vector is given by\n(\u2206T\n\n(7)\nwhere \u2206k = [\u03b4k\u2212m+1, . . . \u03b4k] \u2208 R|S|\u00d7m is a Bellman residuals matrix with \u03b4i = T Vi \u2212 Vi \u2208 R|S|,\nand 1 \u2208 Rm denotes the vector with all components equal to one [15].\n\nk \u2206k)\u221211\nk \u2206k)\u221211\n\n1T (\u2206T\n\n\u03b1k =\n\n,\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m(cid:88)\n\ni=1\n\n\u03b1k = argmin\n\u03b1\u2208Rm\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) , s.t.\n\nm(cid:88)\n\ni=1\n\n3\n\n\f4 Regularized Anderson Acceleration for Deep Reinforcement Learning\nOur regularized Anderson acceleration (RAA) method for deep RL can be derived starting from a\ndirect implementation of Anderson acceleration to the classic policy iteration algorithm. We will\n\ufb01rst present this derivation to show that the resulting algorithm converges faster to the optimal policy\nthan the vanilla form. Then, a regularized variant is proposed for a more general case with function\napproximation. Based on this theory, a progressive and practical acceleration method with adaptive\nrestart is presented for off-policy deep RL algorithms.\n\n4.1 Anderson acceleration for policy iteration\nAs described above, Anderson acceleration can be directly applied to value iteration. However,\npolicy iteration is more fundamental and suitable to scale to deep RL, compared to value iteration.\nUnfortunately, the implementation of Anderson acceleration is complicated when considering policy\niteration, because there is no explicit \ufb01xed-point mapping between the policies in any two consecutive\nsteps, which make it impossible to straightforwardly apply Anderson acceleration to the policy \u03c0.\nDue to the one-to-one mapping between policies and Q-value functions, policy iteration can be\naccelerated by applying Anderson acceleration to the policy improvement, which establishes a\nmapping from the current Q-value estimate to the next policy. In this section, our derivation is based\non a tabular setting, to enable theoretical analysis. Speci\ufb01cally, for the prototype policy iteration,\nsuppose that estimates have been computed up to iteration k, and that in addition to the current\nestimate Q\u03c0k, the m \u2212 1 previous estimates Q\u03c0k\u22121, ..., Q\u03c0k\u2212m+1 are also known. Then, a linear\ncombination of estimates Q\u03c0i with coef\ufb01cients \u03b1i\n\n2 reads\n\nDue to this equality constraint, we de\ufb01ne combined Bellman operator Tc as follows\n\ni=1\n\ni=1\n\n(8)\n\n(9)\n\nm(cid:88)\n\nQk\n\n\u03b1 =\n\n\u03b1iQ\u03c0k\u2212m+i with\n\n\u03b1i = 1.\n\nm(cid:88)\n\ni=1\n\nTcQk\n\n\u03b1 =\n\n\u03b1iT Q\u03c0k\u2212m+i.\n\nm(cid:88)\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m(cid:88)\n\ni=1\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) , s.t.\n\nm(cid:88)\n\ni=1\n\nThen, one searches a coef\ufb01cient vector \u03b1k that minimizes the following objective function J de\ufb01ned\nas the combined Bellman residuals among the entire state-action space S \u00d7 A,\n\n\u03b1k = argmin\n\u03b1\u2208Rm\n\nJ(\u03b1) = argmin\n\u03b1\u2208Rm\n\n\u03b1i(T Q\u03c0k\u2212m+i \u2212 Q\u03c0k\u2212m+i)\n\n\u03b1i = 1.\n\n(10)\n\nIn this paper, we will consider the (cid:96)2-norm, although a different norm may also be feasible (for\nexample (cid:96)1 and (cid:96)\u221e, in which case the optimization problem becomes a linear program). The solution\nto this optimization problem is identical to (7) except that \u2206k = [\u03b4k\u2212m+1, ..., \u03b4k] \u2208 R|S\u00d7A|\u00d7m with\n\u03b4i = T Q\u03c0i \u2212Q\u03c0i \u2208 R|S\u00d7A|. Detailed derivation can be found in Appendix A.1 of the supplementary\nmaterial. Then, the new policy improvement steps are given by\n\n\u03c0k+1(s) = argmax\n\na\n\nQk\n\n\u03b1(s, a) = argmax\n\na\n\ni Q\u03c0k\u2212m+1(s, a),\u2200s \u2208 S.\n\u03b1k\n\n(11)\n\nMeanwhile, Q-value estimate Q\u03c0k+1 can be obtained by iteratively applying the following policy\nevaluation operator by starting from some initial function Q0,\n\n(cid:2)r + \u03b3Ea(cid:48)\u223c\u03c0k+1[Qi\u22121(s(cid:48), a(cid:48))](cid:3) ,\u2200(s, a) \u2208 (S,A).\n\n(12)\nIn fact, the effect of acceleration can be explained intuitively. The linear combination Qk\n\u03b1 is a better\nestimate of Q-value than the last one Q\u03c0k in terms of combined Bellman residuals. Accordingly, the\npolicy is improved from a better policy baseline corresponding to the better estimate of Q-value.\n\nQi(s, a) \u2190 Es(cid:48),r\n\nm(cid:88)\n\ni=1\n\n4.2 Regularized variant with function approximation\n\nFor RL control tasks with continuous state and action spaces, or high-dimensional state space, we\ngenerally consider the case in which Q-value function is approximated by a parameterized function\napproximator. If the approximation is suf\ufb01ciently good, it might be appropriate to use it in place of\nQ\u03c0 in (8)-(12). However, there are several key challenges when implementing Anderson acceleration\nwith function approximation.\n\n2Notice that we don\u2019t impose a positivity condition on the coef\ufb01cients.\n\n4\n\n\fFirst, notice that the Bellman residuals in (10) are calculated among the entire state-action space.\nUnfortunately, sweeping entire state-action space is intractable for continuous RL, and a \ufb01ne grained\ndiscretization will lead to the curse of dimensionality. A feasible alternative to avoid this issue is\n\nto use a sampled Bellman residuals matrix (cid:101)\u2206k instead. To alleviate the bias induced by sampling a\n\nminibatch, we adopt a large sample size NA speci\ufb01cally for Anderson acceleration.\nSecond, function approximation errors are unavoidable and lead to biased solution of Anderson\nacceleration. The intricacies of this issue will be exacerbated by deep models. Therefore, function\napproximation errors will induce severe perturbation when implementing Anderson acceleration to\npolicy iteration with function approximation. In addition to the perturbation, the solution (7) contains\nthe inverse of a squared Bellman residuals matrix, which may suffer from ill-conditioning when the\nsquared Bellman residuals matrix is rank-de\ufb01cient, and this is a major source of numerical instability\nin vanilla Anderson acceleration. In other words, even if the perturbation is small, its impact on the\nsolution can be arbitrarily large.\nUnder the above observations, we scale the idea underlying RAA to the policy iteration with function\n\napproximation in this section. Then, the coef\ufb01cient vector (10) is now adjusted to(cid:101)\u03b1k that minimizes\n(cid:101)\u03b1k = argmin\n\nthe perturbed objective function added with a Tikhonov regularization term,\n\n\u03b1i(T Q\u03c0k\u2212m+i \u2212 Q\u03c0k\u2212m+i + ek\u2212m+i)\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + \u03bb(cid:107)\u03b1(cid:107)2 , s.t.\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m(cid:88)\n\ni=1\n\nm(cid:88)\n\ni=1\n\n\u03b1\u2208Rm\n\n\u03b1i = 1,\n\n(13)\n\nwhere ek\u2212m+i represents the perturbation induced by function approximation errors. The solution to\nthis regularized optimization problem can be obtained analytically similar to (10),\n\nwhere \u03bb is a positive scalar representing the scale of regularization. (cid:101)\u2206k = [(cid:101)\u03b4k\u2212m+1, ...,(cid:101)\u03b4k] \u2208 RNA\u00d7m\nis the sampled Bellman residuals matrix with(cid:101)\u03b4i = T Q\u03c0i \u2212 Q\u03c0i + ei \u2208 RNA.\n\n,\n\n(14)\n\nIn fact, the regularization term controls the norm of coef\ufb01cient vector produced by RAA and reduces\nthe impact of perturbation induced by function approximation errors, as shown analytically by the\nfollowing proposition.\nProposition 1. Consider two identical policy iterations I1 and I2 with function approximation. I2\nis implemented with regularized Anderson acceleration and takes into account approximation errors,\nvectors of I1 and I2 respectively. Then, we have the following bounds\n\nwhereas I1 is only implemented with vanilla Anderson acceleration. Let \u03b1k and(cid:101)\u03b1k be the coef\ufb01cient\n\n(cid:101)\u03b1k =\n\n((cid:101)\u2206T\nk(cid:101)\u2206k + \u03bbI)\u221211\nk(cid:101)\u2206k + \u03bbI)\u221211\n1T ((cid:101)\u2206T\n\n(cid:115)\n\n(cid:107)(cid:101)\u03b1k(cid:107) \u2264\n\n\u03bb + (cid:107)(cid:101)\u2206k(cid:107)2\n\n,\n\nm\u03bb\n\n(cid:107)(cid:101)\u03b1k \u2212 \u03b1k(cid:107) \u2264 (cid:107)(cid:101)\u2206T\n\nk(cid:101)\u2206k \u2212 \u2206T\n\n\u03bb\n\nk \u2206k(cid:107) + \u03bb\n\n(cid:107)\u03b1k(cid:107).\n\n(15)\n\nProof. See Appendix A.2 of the supplementary material.\n\nFrom the above bounds, we can observe that regularization allows a better control of the impact of\n\nfunction approximation errors, but also causes an inevitable gap between(cid:101)\u03b1k and \u03b1k. Qualitatively,\noverlarge \u03bb leads to very small norm of coef\ufb01cient vector (cid:101)\u03b1k, which means the coef\ufb01cients for\n\nlarge regularization scale \u03bb means less impact of function approximation errors. On the other hand,\n\nprevious estimates is nearly identical. However, according to (10), equal coef\ufb01cients are probably far\naway from the optima \u03b1k and thus result in great performance loss of Anderson acceleration.\n\n4.3 Implementation on off-policy deep reinforcement learning\nAs discussed in last section, it is impossible to directly use policy iteration in very large continuous\ndomains. To that end, most off-policy deep RL algorithms apply the mechanism underlying policy\niteration to learn approximations to both the Q-value function and the policy. Instead of iterating\npolicy evaluation and policy improvement to convergence, these off-policy algorithms alternate\nbetween optimizing two networks with stochastic gradient descent. For example, actor-critic method\nis a well-known implementation of this mechanism. In this section, we show that RAA for policy\niteration can be readily extended to existing off-policy deep RL algorithms for both discrete and\ncontinuous control tasks, with only a few modi\ufb01cations to the update of critic.\n\n5\n\n\fAlgorithm 1: RAA-Dueling-DQN Algorithm\n\nInitialize a critic network Q\u03b8 with random parameters \u03b8;\nInitialize m target networks \u03b8i \u2190 \u03b8 (i = 1, ..., m) and replay buffer D;\nInitialize restart checking period Tr and maximum training steps K;\nSet k = 0, c1 = 1, \u2206min = inf, \u2206Tr = 0;\nwhile k < K do\n\nReceive initial observation state s0;\nfor t = 1 to T do\n\nSet k = k + 1, and mk = min(ck, m);\nWith probability \u03b5 select a random action at, otherwise select at = argmaxa Q\u03b8(st, a);\nExecute at, receive rt and st+1, store transition (st, at, rt, st+1) into D;\nSample minibatch of transitions (s, a, r, s(cid:48)) from D;\n\nPerform Anderson acceleration steps (13)-(14) and obtain(cid:101)\u03b1k, \u2206Tr = \u2206Tr + (cid:107)(cid:101)\u03b4k(cid:107)2\n\n2;\nUpdate the critic by minimizing the loss function (18) with yt equal to the RHS of (17);\nUpdate target networks every M steps: \u03b8i \u2190 \u03b8i+1 (i = 1, ..., m \u2212 1) and \u03b8m \u2190 \u03b8;\nck+1 = ck + 1;\nif k mod Tr = 0 then\n\n\u2206min = min(\u2206min, \u2206Tr);\nif \u2206Tr > \u2206min then\n\n\u2206min = inf, and ck+1 = 1;\n\n4.3.1 Regularized Anderson acceleration for actor-critic\nConsider a parameterized Q-value function Q\u03b8(st, at) and a tractable policy \u03c0\u03c6(at|st), the parameters\nof these networks are \u03b8 and \u03c6. In the following, we \ufb01rst give the main results of RAA for actor-critic.\nThen, RAA is combined with Dueling-DQN and TD3 respectively.\nUnder the paradigm of off-policy deep RL (actor-critic), RAA variant of policy iteration (11)-(12)\ndegrades into the following Bellman equation\n\n(cid:35)\n\nQ\u03b8(st, at) = Est+1,rt\n\nrt + \u03b3\n\nQ\u03b8i(st+1, at+1)\n\n,\n\n(16)\n\ni=1\n\ni=1\n\nat+1\n\n(cid:35)\n\nrt + \u03b3\n\nm(cid:88)\n\nm(cid:88)\n\nQ\u03b8(st, at) = \u03b2\n\nQ\u03b8i(st+1, at+1)\n\n,\n\n(17)\n\nwhere \u03b8i is the parameters of target network before i update steps. Furthermore, to mitigate the\ninstability resulting from drastic update step of Anderson acceleration, the following progressive\nBellman equation (or progressive update) with RAA is used practically,\n\n(cid:101)\u03b1iQ\u03b8i(st, at) + (1 \u2212 \u03b2)Est+1,rt\n\n(cid:101)\u03b1i max\nLQ(\u03b8) = E(st,at)\u2208D(cid:2)(Q\u03b8(st, at) \u2212 yt)2(cid:3) ,\n\nwhere \u03b2 is a small positive coef\ufb01cient.\nGenerally, the loss function of critic is then formulated as the following squared consistency error of\nBellman equation,\n(18)\nwhere D is the distribution of previously sampled transitions, or a replay buffer. The target value of\nQ-value function or critic is represented by yt.\nRAA-Dueling-DQN. Different from vanilla Dueling-DQN algorithm using general Bellman equa-\ntion, we instead use progressive Bellman equation with RAA (17) to update the critic. That is, yt is\nthe RHS of (17) for RAA-Dueling-DQN.\nRAA-TD3. For the case of TD3 where an actor and two critics are learned for deterministic policy\nand Q-value function respectively, the implementation of RAA is more complicated. Speci\ufb01cally,\ntwo critics Q\u03b8j (j = 1, 2) are simultaneously trained with clipped double Q-learning. Then, the target\nvalues yj,t(j = 1, 2) for RAA-TD3 are given by\n\n(cid:35)\n(cid:101)\u03b1i(cid:98)Q\u03b8i (st+1, \u03c0\u03c6(cid:48)(st+1) + \u0001)\n\nm(cid:88)\n\ni=1\n\n,\n\n(19)\n\nm(cid:88)\n\n(cid:101)\u03b1i(cid:98)Q\u03b8i (st, at) + (1 \u2212 \u03b2)Est+1,rt\n\nyj,t = \u03b2\n\nwhere (cid:98)Q\u03b8i(st, at) = minj=1,2 Q\u03b8i\n\ni=1\n\n(st, at).\n\nj\n\n(cid:34)\n\nm(cid:88)\n\ni=1\n\nat+1\n\n(cid:101)\u03b1i max\n(cid:34)\n\n(cid:34)\n\nrt + \u03b3\n\n6\n\n\f(a) Breakout\n\n(b) Enduro\n\n(c) Qbert\n\n(d) SpaceInvaders\n\n(e) Ant-v2\n\n(f) Hopper-v2\n\n(g) Walker2d-v2\n\n(h) HalfCheetah-v2\n\nFigure 1: Learning Curves of Dueling-DQN, TD3 and their RAA variants on discrete and continuous control\ntasks. The solid curves correspond to the mean and the shaded region to the standard deviation over several trials.\nCurves are smoothed uniformly for visual clarity.\n\n4.3.2 Adaptive restart\nThe idea of restarting an algorithm is well known in the numerical analysis literature. Vanilla\nAnderson acceleration has shown substantial improvements by incorporating with periodic restarts\n[30], where one periodically starts the acceleration scheme anew by only using information from the\nmost recent iteration. In this section, to alleviate the problem that deep RL is notoriously prone to be\ntrapped in local optimum, we propose an adaptive restart strategy for our RAA method.\nAmong the training steps of actor-critic with RAA, periodic restart checking steps are enforced to\nclear the memory immediately before the iteration completely crashes. More explicitly, the iteration\nis restarted whenever the average squared residual of current period exceeds the average squared\nresidual of last period. Complete description of RAA-Dueling-DQN is summarized in Algorithm 1.\nAnd RAA-TD3 is given in Appendix B of the supplementary material.\n\n5 Experiments\nIn this section, we present our experimental results and discuss their implications. We \ufb01rst give a\ndetailed description of the environments (Atari 2600 and MuJoCo) used to evaluate our methods.\nThen, we report results on both discrete and continuous control tasks. Finally, we provide an ablative\nanalysis for the proposed methodology. All default hyperparameters used in these experiments are\nlisted in Appendix C of the supplementary material.\n\n5.1 Experimental setup\nAtari 2600. For discrete control tasks, we perform experiments in the Arcade Learning Environment.\nWe select four games (Breakout, Enduro, Qbert and SpaceInvaders) varying in their dif\ufb01culty of\nconvergence. The agent receives 84 \u00d7 84 \u00d7 4 stacked grayscale images as inputs, as described in [1].\nMuJoCo. For continuous control tasks, we conduct experiments in environments built on the\nMuJoCo physics engine. We select a number of control tasks to evaluate the performance of the\nproposed methodology and the baseline methods. In each task, the agent takes a vector of physical\nstates as input, and generates an action to manipulate the robots in the environment.\n\n5.2 Comparative evaluation\nTo evaluate our RAA variant method, we select Dueling-DQN and TD3 as the baselines for discrete\nand continuous control tasks, respectively. Please note that we do not select DDPG as the baseline\nfor continuous control tasks, as DDPG shows bad performance in dif\ufb01cult control tasks such as\nrobotic manipulation. Figure 1 shows the total average return of evaluation rollouts during training\nfor Dueling-DQN, TD3 and their RAA variants. We train \ufb01ve and seven different instances of each\nalgorithm for Atari 2600 and MuJoCo, respectively. Besides, each baseline and corresponding RAA\n\n7\n\n0.00.51.01.52.02.53.0%\u000420\u000389058\u0003\u000e0\u0001050100150200250\u0018;07,\u00040\u0003709:73#\u0018\u0018\n\u001b:0\u0004\u00043\u0004\n\u001b\"\u001f\u001b:0\u0004\u00043\u0004\n\u001b\"\u001f0.00.20.40.60.81.01.21.4%\u000420\u000389058\u0003\u000e0\u00010100200300400500600700#\u0018\u0018\n\u001b:0\u0004\u00043\u0004\n\u001b\"\u001f\u001b:0\u0004\u00043\u0004\n\u001b\"\u001f0.00.20.40.60.81.01.21.4%\u000420\u000389058\u0003\u000e0\u000105001000150020002500#\u0018\u0018\n\u001b:0\u0004\u00043\u0004\n\u001b\"\u001f\u001b:0\u0004\u00043\u0004\n\u001b\"\u001f0.00.20.40.60.81.01.21.4%\u000420\u000389058\u0003\u000e0\u0001100200300400500600700#\u0018\u0018\n\u001b:0\u0004\u00043\u0004\n\u001b\"\u001f\u001b:0\u0004\u00043\u0004\n\u001b\"\u001f0.00.20.40.60.81.0%\u000420\u000389058\u0003\u000e0\u0013010002000300040005000\u0018;07,\u00040\u0003709:73#\u0018\u0018\n%\u001b\u0010%\u001b\u00100.00.20.40.60.81.0%\u000420\u000389058\u0003\u000e0\u00130500100015002000250030003500#\u0018\u0018\n%\u001b\u0010%\u001b\u00100.00.20.40.60.81.0%\u000420\u000389058\u0003\u000e0\u0013010002000300040005000#\u0018\u0018\n%\u001b\u0010%\u001b\u00100.00.20.40.60.81.0%\u000420\u000389058\u0003\u000e0\u00130200040006000800010000#\u0018\u0018\n%\u001b\u0010%\u001b\u0010\fvariant are trained with same random seeds set and evaluated every 10000 environment steps, where\neach evaluation reports the average return over ten different rollouts.\nThe results in Figure 1 show that, overall, RAA variants outperform to corresponding baseline on\nmost tasks with a large margin such as HalfCheetah-v2 and perform comparably to them on the\neasier tasks such as Enduro in terms of learning speed, which indicate that RAA is a feasible method\nto make existing off-policy RL algorithms more sample ef\ufb01cient. In addition to the direct bene\ufb01t\nof acceleration mentioned above, we also observe that our RAA variants demonstrate superior or\ncomparable \ufb01nal performance to the baseline methods in all tasks. In fact, RAA-Dueling-DQN can\nbe seen as a weighted variant of Average-DQN [32], which can effectively reduce the variance of\napproximation error in the target values and thus shows improved performance. In summary, our\napproach brings an improvement in both the learning speed and \ufb01nal performance.\n\n5.3 Ablation studies\nThe results in the previous section suggest that our RAA method can improve the sample ef\ufb01ciency\nof existing off-policy RL algorithms. In this section, we further examine how sensitive our approach\nis to the scaling of regularization. We also perform ablation studies to understand the contribution of\neach individual component: progressive update and adaptive restart. Additionally, we analyze the\nimpact of different number of previous estimates m and compare the behavior of our proposed RAA\nmethod over different learning rates.\nRegularization scale. Our approach is sensitive to the scaling of regularization \u03bb, because it control\nthe norm of the coef\ufb01cient vector and reduces the impact of approximation error. According to the\nconclusions of Proposition 1, larger regularization magnitude implies less impact of approximation\nerror, but overlarge regularization will make the coef\ufb01cients nearly identical and thus result in\nsubstantial degradation of acceleration performance. Figure 2 shows how learning performance\nchanges on discrete control tasks when the regularization scale is varied, and consistent conclusion\nas above can be drawn from Figure 2. For continuous control tasks, it is dif\ufb01cult to obtain same\nconclusion due to the dominant effect of bias induced by sampling a minibatch relative to function\napproximation errors. Additional learning curves on continuous control tasks can be found in\nAppendix D of the supplementary material.\n\n(a) Breakout\n\n(b) Enduro\n\n(c) Qbert\n\n(d) SpaceInvaders\n\nFigure 2: Sensitivity of RAA-Dueling-DQN to the scaling of regularization on discrete control tasks.\n\nProgressive update and adaptive restart. This experiment compares our proposed approach with:\n(i) RAA without using progressive update (no progressive); (ii) RAA without adding adaptive restart\n(no restart); (iii) RAA without using progressive update and adding adaptive restart (no progressive\nand no restart). Figure 3 shows comparative learning curves on continuous control tasks. Although the\nsigni\ufb01cance of each component varies task to task, we see that using progressive update is essential\nfor reducing the variance on all four tasks, consistent conclusion can also be drawn from Figure\n1. Moreover, adding adaptive restart marginally improves the performance. Additional results on\ndiscrete control tasks can be found in Appendix D of the supplementary material.\n\n(a) Ant-v2\nFigure 3: Ablation analysis of RAA-TD3 (blue) over progressive update and adaptive restart.\n\n(c) Walker2d-v2\n\n(b) Hopper-v2\n\n(d) HalfCheetah-v2\n\n8\n\n0.00.20.40.60.81.01.21.4%\u000420\u000389058\u0003\u000e0\u00010255075100125150175200\u0018;07,\u00040\u0003709:73\u0014\r\u000b\u000e\u0014\r\u000b\r\u000e\u0014\r\u000b\r\r\u000e0.00.20.40.60.81.01.21.4%\u000420\u000389058\u0003\u000e0\u00010100200300400500600700\u0014\r\u000b\u000e\u0014\r\u000b\r\u000e\u0014\r\u000b\r\r\u000e0.00.20.40.60.81.01.21.4%\u000420\u000389058\u0003\u000e0\u0001050010001500200025003000\u0014\r\u000b\u000e\u0014\r\u000b\r\u000e\u0014\r\u000b\r\r\u000e0.00.20.40.60.81.01.21.4%\u000420\u000389058\u0003\u000e0\u0001200300400500600700\u0014\r\u000b\u000e\u0014\r\u000b\r\u000e\u0014\r\u000b\r\r\u000e0.00.20.40.60.81.0%\u000420\u000389058\u0003\u000e0\u001310002000300040005000\u0018;07,\u00040\u0003709:730.00.20.40.60.81.0%\u000420\u000389058\u0003\u000e0\u001305001000150020002500300035000.00.20.40.60.81.0%\u000420\u000389058\u0003\u000e0\u00130100020003000400050000.00.20.40.60.81.0%\u000420\u000389058\u0003\u000e0\u00130200040006000800010000120005745480/\u0003;,7\u0004,3934\u00037089,7934\u0003574\u00047088\u0004;034\u0003574\u00047088\u0004;0\u0003,3/\u000334\u00037089,79\fThe number of previous estimates m.\nIn our experi-\nments, the number of previous estimates m is set to 5. In\nfact, there is a tradeoff between performance and compu-\ntational cost. Fig.4 shows the results of RAA-TD3 using\ndifferent m(m = 1, 3, 5, 7, 9) on Walker2d task. Overall,\nwe can conclude that larger m leads to faster convergence\nand better \ufb01nal performance, but the improvement be-\ncomes small when m exceeds a threshold. In practice, we\nsuggest to take into account available computing resource\nand sample ef\ufb01ciency when applying our proposed RAA\nmethod to other works.\n\nLearning rate. To compare the behavior of our pro-\nposed RAA method over different learning rates (lr), we\nperform additional experiments on Walker2d task, and the\nresults of TD3 and our RAA-TD3 are shown in Fig.5.\nOverall, the improvement of our method is consistent\nacross all learning rates, though the performance of both\nTD3 and our RAA-TD3 is bad under the setting with\nnon-optimal learning rates, and the improvement is more\nsigni\ufb01cant when the learning rate is smaller. Moreover,\nconsistent improvement of performance means that our\nproposed RAA method is effective and robust.\n\n6 Conclusion\n\nFigure 4: Learning Curves of RAA-TD3 on\nWalker2d-v2 with different m.\n\nFigure 5:\nWalker2d-v2 with different learning rates.\n\nPerformance comparison on\n\nIn this paper, we presented a general acceleration method for existing deep reinforcement learning\n(RL) algorithms. The main idea is drawn from regularized Anderson acceleration (RAA), which is\nan effective approach to speeding up the solving of \ufb01xed point problems with perturbations. Our\ntheoretical results explain that vanilla Anderson acceleration can be directly applied to policy iteration\nunder a tabular setting. Furthermore, RAA is extended to model-free deep RL by introducing an\nadditional regularization term. Two rigorous bounds about coef\ufb01cient vector demonstrate that the\nregularization term controls the norm of the coef\ufb01cient vector produced by RAA and reduces the\nimpact of perturbation induced by function approximation errors. Moreover, we veri\ufb01ed that the\nproposed method can signi\ufb01cantly accelerate off-policy deep RL algorithms such as Dueling-DQN\nand TD3. The ablation studies show that progressive update and adaptive restart strategies can\nenhance the performance. For future work, how to combine Anderson acceleration or its variants\nwith on-policy deep RL is an exciting avenue.\n\nAcknowledgments\n\nGao Huang is supported in part by Beijing Academy of Arti\ufb01cial Intelligence (BAAI) under\ngrant BAAI2019QN0106 and Tencent AI Lab Rhino-Bird Focused Research Program under grant\nJR201914. This research is supported by the National Science Foundation of China (NSFC) under\ngrant 41427806.\n\nReferences\n[1] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller,\nA. K. Fidjeland, G. Ostrovski et al., \u201cHuman-level control through deep reinforcement learning,\u201d Nature,\nvol. 518, no. 7540, p. 529, 2015.\n\n[2] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser,\nI. Antonoglou, V. Panneershelvam, M. Lanctot et al., \u201cMastering the game of go with deep neural\nnetworks and tree search,\u201d Nature, vol. 529, no. 7587, p. 484, 2016.\n\n[3] W. Shi, S. Song, C. Wu, and C. P. Chen, \u201cMulti pseudo q-learning-based deterministic policy gradient\nfor tracking control of autonomous underwater vehicles,\u201d IEEE Transactions on Neural Networks and\nLearning Systems, pp. 3534\u20133546, 2018.\n\n9\n\n0.00.20.40.60.81.0%\u000420\u000389058\u0003\u000e0\u0013010002000300040005000\u0018;07,\u00040\u0003709:73#\u0018\u0018\n%\u001b\u0010\u0003m\u0014\u000e#\u0018\u0018\n%\u001b\u0010\u0003m\u0014\u0010#\u0018\u0018\n%\u001b\u0010\u0003m\u0014\u0012#\u0018\u0018\n%\u001b\u0010\u0003m\u0014\u0001#\u0018\u0018\n%\u001b\u0010\u0003m\u0014\u00010.00.20.40.60.81.0%\u000420\u000389058\u0003\u000e0\u0013010002000300040005000\u0018;07,\u00040\u0003709:73%\u001b\u0010\u0003lr\u0014\r\u000b\r\u000e#\u0018\u0018\n%\u001b\u0010\u0003lr\u0014\r\u000b\r\u000e%\u001b\u0010\u0003lr\u0014\r\u000b\r\r\u000e#\u0018\u0018\n%\u001b\u0010\u0003lr\u0014\r\u000b\r\r\u000e%\u001b\u0010\u0003lr\u0014\r\u000b\n\n\u000e#\u0018\u0018\n%\u001b\u0010\u0003lr\u0014\r\u000b\n\n\u000e\f[4] P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. J. Ballard, A. Banino, M. Denil, R. Goroshin, L. Sifre,\nK. Kavukcuoglu et al., \u201cLearning to navigate in complex environments,\u201d in Proceedings of the International\nConference on Learning Representations, 2017.\n\n[5] Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, and N. De Freitas, \u201cDueling network\narchitectures for deep reinforcement learning,\u201d in Proceedings of the 33rd International Conference on\nMachine Learning, vol. 48. PMLR, 2016, pp. 1995\u20132003.\n\n[6] H. Van Hasselt, A. Guez, and D. Silver, \u201cDeep reinforcement learning with double q-learning,\u201d in Thirtieth\n\nAAAI Conference on Arti\ufb01cial Intelligence, 2016.\n\n[7] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, \u201cContinuous\ncontrol with deep reinforcement learning,\u201d in Proceedings of the International Conference on Learning\nRepresentations, 2016.\n\n[8] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, \u201cSoft actor-critic: Off-policy maximum entropy deep\nreinforcement learning with a stochastic actor,\u201d in Proceedings of the 35th International Conference on\nMachine Learning, vol. 80. PMLR, 2018, pp. 1861\u20131870.\n\n[9] W. Shi, S. Song, and C. Wu, \u201cSoft policy gradient method for maximum entropy deep reinforcement\nlearning,\u201d in Proceedings of the International Joint Conference on Arti\ufb01cial Intelligence, 2019, pp. 3425\u2013\n3431.\n\n[10] O. Nachum, S. S. Gu, H. Lee, and S. Levine, \u201cData-ef\ufb01cient hierarchical reinforcement learning,\u201d in\n\nAdvances in Neural Information Processing Systems, 2018, pp. 3303\u20133313.\n\n[11] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-dynamic programming. Athena Scienti\ufb01c Belmont, MA, 1996,\n\nvol. 5.\n\n[12] A. Granas and J. Dugundji, Fixed point theory. Springer Science & Business Media, 2013.\n[13] H. F. Walker and P. Ni, \u201cAnderson acceleration for \ufb01xed-point iterations,\u201d SIAM Journal on Numerical\n\nAnalysis, vol. 49, no. 4, pp. 1715\u20131735, 2011.\n\n[14] A. Toth and C. Kelley, \u201cConvergence analysis for anderson acceleration,\u201d SIAM Journal on Numerical\n\nAnalysis, vol. 53, no. 2, pp. 805\u2013819, 2015.\n\n[15] M. Geist and B. Scherrer, \u201cAnderson acceleration for reinforcement learning,\u201d arXiv preprint arX-\n\niv:1809.09501, 2018.\n\n[16] D. Scieur, A. d\u2019Aspremont, and F. Bach, \u201cRegularized nonlinear acceleration,\u201d in Advances In Neural\n\nInformation Processing Systems, 2016, pp. 712\u2013720.\n\n[17] S. Fujimoto, H. van Hoof, and D. Meger, \u201cAddressing function approximation error in actor-critic methods,\u201d\nin Proceedings of the 35th International Conference on Machine Learning, vol. 80. PMLR, 2018, pp.\n1587\u20131596.\n\n[18] E. Todorov, T. Erez, and Y. Tassa, \u201cMujoco: A physics engine for model-based control,\u201d in 2012 IEEE/RSJ\n\nInternational Conference on Intelligent Robots and Systems, 2012, pp. 5026\u20135033.\n\n[19] E. Greensmith, P. L. Bartlett, and J. Baxter, \u201cVariance reduction techniques for gradient estimates in\nreinforcement learning,\u201d Journal of Machine Learning Research, vol. 5, no. Nov, pp. 1471\u20131530, 2004.\n[20] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, \u201cHigh-dimensional continuous control\nusing generalized advantage estimation,\u201d in Proceedings of the International Conference on Learning\nRepresentations, 2016.\n\n[21] M. Deisenroth and C. E. Rasmussen, \u201cPilco: A model-based and data-ef\ufb01cient approach to policy search,\u201d\n\nin Proceedings of the 28th International Conference on Machine Learning, 2011, pp. 465\u2013472.\n\n[22] G. Williams, N. Wagener, B. Goldfain, P. Drews, J. M. Rehg, B. Boots, and E. A. Theodorou, \u201cInformation\ntheoretic mpc for model-based reinforcement learning,\u201d in 2017 IEEE International Conference on Robotics\nand Automation, 2017, pp. 1714\u20131721.\n\n[23] J. Buckman, D. Hafner, G. Tucker, E. Brevdo, and H. Lee, \u201cSample-ef\ufb01cient reinforcement learning with\nstochastic ensemble value expansion,\u201d in Advances in Neural Information Processing Systems, 2018, pp.\n8224\u20138234.\n\n[24] S. Levine and P. Abbeel, \u201cLearning neural network policies with guided policy search under unknown\n\ndynamics,\u201d in Advances in Neural Information Processing Systems, 2014, pp. 1071\u20131079.\n\n[25] Y. Chebotar, M. Kalakrishnan, A. Yahya, A. Li, S. Schaal, and S. Levine, \u201cPath integral guided policy\n\nsearch,\u201d in 2017 IEEE International Conference on Robotics and Automation, 2017, pp. 3381\u20133388.\n\n[26] R. S. Sutton, \u201cLearning to predict by the methods of temporal differences,\u201d Machine learning, vol. 3, no. 1,\n\npp. 9\u201344, 1988.\n\n[27] L.-J. Lin, \u201cReinforcement learning for robots using neural networks,\u201d Carnegie-Mellon Univ Pittsburgh PA\n\nSchool of Computer Science, Tech. Rep., 1993.\n\n10\n\n\f[28] Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas, \u201cSample ef\ufb01-\ncient actor-critic with experience replay,\u201d in Proceedings of the International Conference on Learning\nRepresentations, 2017.\n\n[29] S. Gu, T. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine, \u201cQ-prop: Sample-ef\ufb01cient policy gradient\nwith an off-policy critic,\u201d in Proceedings of the International Conference on Learning Representations,\n2017.\n\n[30] N. C. Henderson and R. Varadhan, \u201cDamped anderson acceleration with restarts and monotonicity control\nfor accelerating em and em-like algorithms,\u201d Journal of Computational and Graphical Statistics, pp. 1\u201342,\n2019.\n\n[31] G. Xie, Y. Wang, S. Zhou, and Z. Zhang, \u201cInterpolatron: Interpolation or extrapolation schemes to\n\naccelerate optimization for deep neural networks,\u201d arXiv preprint arXiv:1805.06753, 2018.\n\n[32] O. Anschel, N. Baram, and N. Shimkin, \u201cAveraged-dqn: Variance reduction and stabilization for deep\nreinforcement learning,\u201d in Proceedings of the 34th International Conference on Machine Learning, vol. 70,\n2017, pp. 176\u2013185.\n\n11\n\n\f", "award": [], "sourceid": 5403, "authors": [{"given_name": "Wenjie", "family_name": "Shi", "institution": "Tsinghua University"}, {"given_name": "Shiji", "family_name": "Song", "institution": "Department of Automation, Tsinghua University"}, {"given_name": "Hui", "family_name": "Wu", "institution": "Tsinghua University"}, {"given_name": "Ya-Chu", "family_name": "Hsu", "institution": "Tsinghua University"}, {"given_name": "Cheng", "family_name": "Wu", "institution": "Tsinghua"}, {"given_name": "Gao", "family_name": "Huang", "institution": "Tsinghua"}]}