{"title": "Importance Resampling for Off-policy Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 1799, "page_last": 1809, "abstract": "Importance sampling (IS) is a common reweighting strategy for off-policy prediction in reinforcement learning. While it is consistent and unbiased, it can result in high variance updates to the weights for the value function. In this work, we explore a resampling strategy as an alternative to reweighting. We propose Importance Resampling (IR) for off-policy prediction, which resamples experience from a replay buffer and applies standard on-policy updates. The approach avoids using importance sampling ratios in the update, instead correcting the distribution before the update. We characterize the bias and consistency of IR, particularly compared to Weighted IS (WIS). We demonstrate in several microworlds that IR has improved sample efficiency and lower variance updates, as compared to IS and several variance-reduced IS strategies, including variants of WIS and V-trace which clips IS ratios. We also provide a demonstration showing IR improves over IS for learning a value function from images in a racing car simulator.", "full_text": "Importance Resampling for Off-policy Prediction\n\nMatthew Schlegel\nUniversity of Alberta\n\nmkschleg@ualberta.ca\n\nWesley Chung\n\nUniversity of Alberta\nwchung@ualberta.ca\n\nDaniel Graves\n\nHuawei\n\ndaniel.graves@huawei.com\n\nJian Qian\n\nUniversity of Alberta\njq1@ulberta.ca\n\nMartha White\n\nUniversity of Alberta\nwhitem@ulberta.ca\n\nAbstract\n\nImportance sampling (IS) is a common reweighting strategy for off-policy predic-\ntion in reinforcement learning. While it is consistent and unbiased, it can result\nin high variance updates to the weights for the value function. In this work, we\nexplore a resampling strategy as an alternative to reweighting. We propose Im-\nportance Resampling (IR) for off-policy prediction, which resamples experience\nfrom a replay buffer and applies standard on-policy updates. The approach avoids\nusing importance sampling ratios in the update, instead correcting the distribution\nbefore the update. We characterize the bias and consistency of IR, particularly\ncompared to Weighted IS (WIS). We demonstrate in several microworlds that IR\nhas improved sample ef\ufb01ciency and lower variance updates, as compared to IS\nand several variance-reduced IS strategies, including variants of WIS and V-trace\nwhich clips IS ratios. We also provide a demonstration showing IR improves over\nIS for learning a value function from images in a racing car simulator.\n\n1\n\nIntroduction\n\nAn emerging direction for reinforcement learning systems is to learn many predictions, formalized\nas value function predictions contingent on many different policies. The idea is that such predictions\ncan provide a powerful abstract model of the world. Some examples of systems that learn many value\nfunctions are the Horde architecture composed of General Value Functions (GVFs) [Sutton et al.,\n2011, Modayil et al., 2014], systems that use options [Sutton et al., 1999, Schaul et al., 2015a],\npredictive representation approaches [Sutton et al., 2005, Schaul and Ring, 2013, Silver et al., 2017]\nand systems with auxiliary tasks [Jaderberg et al., 2017]. Off-policy learning is critical for learn-\ning many value functions with different policies, because it enables data to be generated from one\nbehavior policy to update the values for each target policy in parallel.\nThe typical strategy for off-policy learning is to reweight updates using importance sampling (IS).\nFor a given state s, with action a selected according to behavior \u00b5, the IS ratio is the ratio between the\nprobability of the action under the target policy \u21e1 and the behavior: \u21e1(a|s)\n\u00b5(a|s). The update is multiplied\nby this ratio, adjusting the action probabilities so that the expectation of the update is as if the actions\nwere sampled according to the target policy \u21e1. Though the IS estimator is unbiased and consistent\n[Kahn and Marshall, 1953, Rubinstein and Kroese, 2016], it can suffer from high or even in\ufb01nite\nvariance due to large magnitude IS ratios, in theory [Andradottir et al., 1995] and in practice [Precup\net al., 2001, Mahmood et al., 2014, 2017].\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThere have been some attempts to modify off-policy prediction algorithms to mitigate this variance.1\nWeighted IS (WIS) algorithms have been introduced [Precup et al., 2001, Mahmood et al., 2014,\nMahmood and Sutton, 2015], which normalize each update by the sample average of the ratios.\nThese algorithms improve learning over standard IS strategies, but are not straightforward to extend\nto nonlinear function approximation. In the of\ufb02ine setting, a reweighting scheme, called importance\nsampling with unequal support [Thomas and Brunskill, 2017], was introduced to account for samples\nwhere the ratio is zero, in some cases signi\ufb01cantly reducing variance. Another strategy is to rescale\nor truncate the IS ratios, as used by V-trace [Espeholt et al., 2018] for learning value functions and\nTree-Backup [Precup et al., 2000], Retrace [Munos et al., 2016] and ABQ [Mahmood et al., 2017]\nfor learning action-values. Truncation of IS-ratios in V-trace can incur signi\ufb01cant bias, and this\nadditional truncation parameter needs to be tuned.\nAn alternative to reweighting updates is to instead correct the distribution before updating the estima-\ntor using weighted bootstrap sampling: resampling a new set of data from the previously generated\nsamples [Smith et al., 1992, Arulampalam et al., 2002]. Consider a setting where a buffer of data is\nstored, generated by a behavior policy. Samples for policy \u21e1 can be obtained by resampling from this\nbuffer, proportionally to \u21e1(a|s)\n\u00b5(a|s) for state-action pairs (s, a) in the buffer. In the sampling literature,\nthis strategy has been proposed under the name Sampling Importance Resampling (SIR) [Rubin,\n1988, Smith et al., 1992, Gordon et al., 1993], and has been particularly successful for Sequential\nMonte Carlo sampling [Gordon et al., 1993, Skare et al., 2003]. Such resampling strategies have\nalso been popular in classi\ufb01cation, with over-sampling or under-sampling typically being preferred\nto weighted (cost-sensitive) updates [Lopez et al., 2013].\nA resampling strategy has several potential bene\ufb01ts for off-policy prediction.2 Resampling could\neven have larger bene\ufb01ts for learning approaches, as compared to averaging or numerical integration\nproblems, because updates accumulate in the weight vector and change the optimization trajectory\nof the weights. For example, very large importance sampling ratios could destabilize the weights.\nThis problem does not occur for resampling, as instead the same transition will be resampled mul-\ntiple times, spreading out a large magnitude update across multiple updates. On the other extreme,\nwith small ratios, IS will waste updates on transitions with very small IS ratios. By correcting the\ndistribution before updating, standard on-policy updates can be applied. The magnitude of the up-\ndates vary less\u2014because updates are not multiplied by very small or very large importance sampling\nratios\u2014potentially reducing variance of stochastic updates and simplifying learning rate selection.\nWe hypothesize that resampling (a) learns in a fewer number of updates to the weights, because it\nfocuses computation on samples that are likely under the target policy and (b) is less sensitive to\nlearning parameters and target and behavior policy speci\ufb01cation.\nIn this work, we investigate the use of resampling for online off-policy prediction for known, un-\nchanging target and behavior policies. We \ufb01rst introduce Importance Resampling (IR), which sam-\nples transitions from a buffer of (recent) transitions according to IS ratios. These sampled transitions\nare then used for on-policy updates. We show that IR has the same bias as WIS, and that it can be\nmade unbiased and consistent with the inclusion of a batch correction term\u2014even under a slid-\ning window buffer of experience. We provide additional theoretical results characterizing when we\nmight expect the variance to be lower for IR than IS. We then empirically investigate IR on three mi-\ncroworlds and a racing car simulator, learning from images, highlighting that (a) IR is less sensitive\nto learning rate than IS and V-trace (IS with clipping) and (b) IR converges more quickly in terms\nof the number of updates.\n\n2 Background\n\nWe consider the problem of learning General Value Functions (GVFs) [Sutton et al., 2011]. The\nagent interacts in an environment de\ufb01ned by a set of states S, a set of actions A and Markov transi-\ntion dynamics, with probability P(s0|s, a) of transitions to state s0 when taking action a in state s. A\nGVF is de\ufb01ned for policy \u21e1 : S\u21e5A! [0, 1], cumulant c : S\u21e5A\u21e5S! R and continuation function\n1There is substantial literature on variance reduction for another area called off-policy policy evaluation,\nbut which estimates only a single number or value for a policy (e.g., see [Thomas and Brunskill, 2016]). The\nresulting algorithms differ substantially, and are not appropriate for learning the value function.\n\n2We explicitly use the term prediction rather than policy evaluation to make it clear that we are not learning\n\nvalue functions for control. Rather, our goal is to learn value functions solely for the sake of prediction.\n\n2\n\n\f : S\u21e5A\u21e5S ! [0, 1], with Ct+1\ntransition (St, At, St+1). The value for a state s 2S is\n\ndef= c(St, At, St+1) and t+1\n\ndef= (St, At, St+1) for a (random)\n\nV (s) def= E\u21e1 [Gt|St = s]\n\nwhere Gt\n\ndef= Ct+1 + t+1Ct+2 + t+1t+2Ct+3 + . . . .\n\nThe operator E\u21e1 indicates an expectation with actions selected according to policy \u21e1. GVFs encom-\npass standard value functions, where the cumulant is a reward. Otherwise, GVFs enable predictions\nabout discounted sums of others signals into the future, when following a target policy \u21e1. These val-\nues are typically estimated using parametric function approximation, with weights \u2713 2 Rd de\ufb01ning\napproximate values V\u2713(s).\nIn off-policy learning, transitions are sampled according to behavior policy, rather than the target\npolicy. To get an unbiased sample of an update to the weights, the action probabilities need to be\nadjusted. Consider on-policy temporal difference (TD) learning, with update \u21b5ttr\u2713V\u2713(s) for a\ndef= Ct+1 + t+1V\u2713(St+1) V\u2713(s). If\ngiven St = s, for learning rate \u21b5t 2 R+ and TD-error t\nactions are instead sampled according to a behavior policy \u00b5 : S\u21e5A! [0, 1], then we can use\nimportance sampling (IS) to modify the update, giving the off-policy TD update \u21b5t\u21e2ttr\u2713V\u2713(s) for\nIS ratio \u21e2t\n\u00b5(At|St). Given state St = s, if \u00b5(a|s) > 0 when \u21e1(a|s) > 0, then the expected value\nof these two updates are equal. To see why, notice that\n\ndef= \u21e1(At|St)\n\nE\u00b5 [\u21b5t\u21e2ttr\u2713V\u2713(s)|St = s] = \u21b5tr\u2713V\u2713(s)E\u00b5 [\u21e2tt|St = s]\n\nwhich equals E\u21e1 [\u21b5t\u21e2ttr\u2713V\u2713(s)|St = s] because\n\u21e1(a|s)\n\u00b5(a|s)E [t|St = s, At = a] = E\u21e1 [t|St = s] .\n\nE\u00b5 [\u21e2tt|St = s] =Xa2A\n\n\u00b5(a|s)\n\nThough unbiased, IS can be high-variance. A lower variance alternative is Weighted IS (WIS). For\ni=1, batch WIS uses a normalized estimate\na batch consisting of transitions {(si, ai, si+1, ci+1,\u21e2 i)}n\nfor the update. For example, an of\ufb02ine batch WIS TD algorithm, denoted WIS-Optimal below,\nwould use update \u21b5t\ntr\u2713V\u2713(s). Obtaining an ef\ufb01cient WIS update is not straightforward,\nhowever, when learning online and has resulted in algorithms in the SGD setting (i.e. n = 1)\nspecialized to tabular [Precup et al., 2001] and linear functions [Mahmood et al., 2014, Mahmood\nand Sutton, 2015]. We nonetheless use WIS as a baseline in the experiments and theory.\n\n\u21e2tPn\n\ni=1 \u21e2i\n\n3\n\nImportance Resampling\n\nIn this section, we introduce Importance Resampling (IR) for off-policy prediction and characterize\nits bias and variance. A resampling strategy requires a buffer of samples, from which we can re-\nsample. Replaying experience from a buffer was introduced as a biologically plausible way to reuse\nold experience [Lin, 1992, 1993], and has become common for improving sample ef\ufb01ciency, partic-\nularly for control [Mnih et al., 2015, Schaul et al., 2015b]. In the simplest case\u2014which we assume\nhere\u2014the buffer is a sliding window of the most recent n samples, {(si, ai, si+1, ci+1,\u21e2 i)}t\ni=tn, at\ntime step t > n. We assume samples are generated by taking actions according to behavior \u00b5. The\ntransitions are generated with probability d\u00b5(s)\u00b5(a|s)P(s0|s, a), where d\u00b5 : S! [0, 1] is the sta-\ntionary distribution for policy \u00b5. The goal is to obtain samples according to d\u00b5(s)\u21e1(a|s)P(s0|s, a),\nas if we had taken actions according to policy \u21e1 from states3 s \u21e0 d\u00b5.\nThe IR algorithm is simple: resample a mini-batch of size k on each step t from the buffer of\nsize n, proportionally to \u21e2i in the buffer. Using the resampled mini-batch we can update our value\nfunction using standard on-policy approaches, such as on-policy TD or on-policy gradient TD. The\nkey difference to IS and WIS is that the distribution itself is corrected, before the update, whereas\nIS and WIS correct the update itself. This small difference, however, can have larger rami\ufb01cations\npractically, as we show in this paper.\n\n3The assumption that states are sampled from d\u00b5 underlies most off-policy learning algorithms. Only a few\nattempt to adjust probabilities d\u00b5 to d\u21e1, either by multiplying IS ratios before a transition [Precup et al., 2001]\nor by directly estimating state distributions [Hallak and Mannor, 2017, Liu et al., 2018]. In this work, we focus\non using resampling to correct the action distribution\u2014the standard setting. We expect, however, that some\ninsights will extend to how to use resampling to correct the state distribution, particularly because wherever IS\nratios are used it should be straightforward to use our resampling approach.\n\n3\n\n\fWe consider two variants of IR: with and without bias correction. For point ij sampled from the\nbuffer, let ij be the on-policy update for that transition. For example, for TD, ij = ijr\u2713V\u2713(sij ).\nThe \ufb01rst step for either variant is to sample a mini-batch of size k from the buffer, proportionally\nto \u21e2i. Bias-Corrected IR (BC-IR) additionally pre-multiplies with the average ratio in the buffer\n\u00af\u21e2 def= 1\n\ni=1 \u21e2i, giving the following estimators for the update direction\n\nnPn\n\nXIR\n\ndef= 1\nk\n\nij\n\nXBC\n\ndef= \u00af\u21e2\nk\n\nij\n\nkXj=1\n\nkXj=1\n\nBC-IR negates bias introduced by the average ratio in the buffer deviating signi\ufb01cantly from the true\nmean. For reasonably large buffers, \u00af\u21e2 will be close to 1 making IR and BC-IR have near-identical\nupdates4. Nonetheless, they do have different theoretical properties, particularly for small buffer\nsizes n, so we characterize both.\nAcross most results, we make the following assumption.\nAssumption 1. A buffer Bt = {Xt+1, ..., Xt+n} is constructed from the most recent n transitions\nsampled by time t + n, which are generated sequentially from an irreducible, \ufb01nite MDP with a \ufb01xed\npolicy \u00b5.\nTo denote expectations under p(x) = d\u00b5(s)\u00b5(a|s)P(s0|s, a) and q(x) = d\u00b5(s)\u21e1(a|s)P(s0|s, a), we\noverload the notation from above, using operators E\u00b5 and E\u21e1 respectively. To reduce clutter, we\nwrite E to mean E\u00b5, because most expectations are under the sampling distribution. All proofs can\nbe found in Appendix B.\n\n3.1 Bias of IR\nWe \ufb01rst show that IR is biased, and that its bias is actually equal to WIS-Optimal, in Theorem 3.1.\nTheorem 3.1. [Bias for a \ufb01xed buffer of size n] Assume a buffer B of n transitions sampled i.i.d\naccording to p(x = (s, a, s0)) = d\u00b5(s)\u00b5(a|s)P(s0|s, a). Let XWIS\u21e4\ni be the\nWIS-Optimal estimator of the update. Then,\n\nj=1 \u21e2j\n\ni=1\n\n\u21e2iPn\n\ndef= Pn\n\nand so the bias of XIR is proportional to\n\nE[XIR] = E[XWIS\u21e4]\n\n1\nn\n\n(E\u21e1[]2\n\ni=1 \u21e2ii).\n\n\u21e2 = Var( 1\n\nnPn\n\nj=1 \u21e2j); 2\n\n = Var( 1\n\nnPn\n\n\u21e2 \u21e2,\u21e2)\n\n(1)\nwhere E\u21e1[] is the expected update across all transitions, with actions from S taken by the tar-\nget policy \u21e1; 2\ni=1 \u21e2ii); and covariance (\u21e2,) =\nj=1 \u21e2j, 1\nCov( 1\n\nBias(XIR) = E[XIR] E\u21e1[] /\nnPn\n\nTheorem 3.1 is the only result which follows a different set of assumptions, primarily due to using the\nbias characterization of XWIS\u21e4 found in Owen [2013]. The bias of IR will be small for reasonably\nlarge n, both because it is proportional to 1/n and because larger n will result in lower variance\nof the average ratios and average update for the buffer in Equation (1). In particular, as n grows,\nthese variances decay proportionally to n. Nonetheless, for smaller buffers, such bias could have an\nimpact. We can, however, easily mitigate this bias with a bias-correction term, as shown in the next\ncorollary and proven in Appendix B.2.\nCorollary 3.1.1. BC-IR is unbiased: E[XBC] = E\u21e1[].\n\nnPn\n\n3.2 Consistency of IR\nConsistency of IR in terms of an increasing buffer, with n ! 1, is a relatively straightforward\nextension of prior results for SIR, with or without the bias correction, and from the derived bias of\nboth estimators (see Theorem B.1 in Appendix B.3). More interesting, and re\ufb02ective of practice, is\nconsistency with a \ufb01xed length buffer and increasing interactions with the environment, t ! 1. IR,\nwithout bias correction, is asymptotically biased in this case; in fact, its asymptotic bias is the one\ncharacterized above for a \ufb01xed length buffer in Theorem 3.1. BC-IR, on the other hand, is consistent,\neven with a sliding window, as we show in the following theorem.\n\n4 \u00af\u21e2 \u21e1 E[\u21e2(a|s)] = E[ \u21e1(a|s)\n\n\u00b5(a|s) ] =Ps,a\n\n\u21e1(a|s)\n\u00b5(a|s) \u00b5(a|s)d\u00b5(s) = 1.\n\n4\n\n\fdef= 1\n\ndef= 1\n\nt=1 X (t)\n\nT PT\n\ntive sample size is between 1 and n, with one estimator (Pn\n\nTheorem 3.2. Let Bt = {Xt+1, ..., Xt+n} be the buffer of the most recent n transitions sampled\naccording to Assumption 1. De\ufb01ne the sliding-window estimator Xt\nBC. Then, if\nE\u21e1[||] < 1, then XT converges to E\u21e1[] almost surely as T ! 1.\n3.3 Variance of Updates\nIt might seem that resampling avoids high-variance in updates, because it does not reweight with\nlarge magnitude IS ratios. The notion of effective sample size from statistics, however, provides\nsome intuition about why large magnitude IS ratios can also negatively affect IR, not just IS. Effec-\ni [Kong et al., 1994,\nMartino et al., 2017]. When the effective sample size is low, this indicates that most of the probabil-\nity is concentrated on a few samples. For high magnitude ratios, IR will repeatedly sample the same\ntransitions, and potentially never sample some of the transitions with small IS ratios.\nFortunately, we \ufb01nd that, despite this dependence on effective sample size, IR can signi\ufb01cantly\nreduce variance over IS. In this section, we characterize the variance of the BC-IR estimator. We\nchoose this variant of IR, because it is unbiased and so characterizing its variance is a more fair\ncomparison to IS. We de\ufb01ne the mini-batch IS estimator XIS\nj=1 \u21e2zj zj , where indices\nzj are sampled uniformly from {1, . . . , n}. This contrasts the indices i1, . . . , ik for XBC that are\nsampled proportionally to \u21e2i.\nWe begin by characterizing the variance, under a \ufb01xed dataset B. For convenience, let \u00b5B =\nE\u21e1[|B]. We characterize the sum of the variances of each component in the update estimator,\nwhich equivalently corresponds to normed deviation of the update from its mean,\nm=1 Var(m | B) = E[k \u00b5Bk2\n\ni=1 \u21e2i)2 /Pn\n\nkPk\n\n2 | B]\n\ni=1 \u21e22\n\nV( | B) def= tr Cov( | B) =Pd\n\n2 > c\n\u21e2j\n\n2 < c\n\u21e2j\n\nfor samples where \u21e2j \u00af\u21e2, and that\nfor samples where \u21e2j < \u00af\u21e2, for some c > 0. Then the BC-IR estimator has lower\n\nfor an unbiased stochastic update 2 Rd. We show two theorems that BC-IR has lower variance\nthan IS, with two different conditions on the norm of the update. We \ufb01rst start with more general\nconditions, and then provide a theorem for conditions that are likely only true in early learning.\nTheorem 3.3. Assume that, for a given buffer B, kjk2\nkjk2\nvariance than the IS estimator: V(XBC | B) < V(XIS | B).\nThe conditions in Theorem 3.3 preclude having update norms for samples with small \u21e2 be quite\n\u21e2\u2014and a small norm for samples with large \u21e2. These conditions can\nlarge\u2014larger than a number / 1\nbe relaxed to a statement on average, where the cumulative weighted magnitude of the update norm\nfor samples with \u21e2 below the median needs to be smaller than for samples with \u21e2 above the mean\n(see the proof in Appendix B.5).\nWe next consider a setting where the magnitude of the update is independent of the given state\nand action. We expect this condition to hold in early learning, where the weights are randomly\ninitialized, and thus randomly incorrect across the state-action space. As learning progresses, and\nvalue estimates become more accurate in some states, it is unlikely for this condition to hold.\nTheorem 3.4. Assume \u21e2 and the magnitude of the update kk2\nThen the BC-IR estimator will have equal or lower variance than the IS estimator: V(XBC | B) \uf8ff\nV(XIS | B).\nThese results have focused on variance of each estimator, for a \ufb01xed buffer, which provided insight\ninto variance of updates when executing the algorithms. We would, however, also like to characterize\nvariability across buffers, especially for smaller buffers. Fortunately, such a characterization is a\nsimple extension on the above results, because variability for a given buffer already demonstrates\nvariability due to different samples. It is easy to check that E[E[\u00b5IR | B]] = E[\u00b5IS | B] = E\u21e1[].\nThe variances can be written using the law of total variance\n\n2 | B] = E[\u21e2j | B] E[kjk2\n\n2 are independent\n\nE[\u21e2jkjk2\n\n2 | B]\n\nV(XBC) = E[V(XBC | B)] + V(E[XBC | B]) = E[V(XBC | B)] + V(\u00b5B)\nV(XIS) = E[V(XIS | B)] + V(\u00b5B)\n\n=) V(XBC) V(XIS) = E[V(XBC | B) V(XIS | B)]\n\nwith expectation across buffers. Therefore, the analysis of V(XBC | B) directly applies.\n\n5\n\n\f4 Empirical Results\n\nnPn\n\nWe investigate the two hypothesized bene\ufb01ts of resampling as compared to reweighting: improved\nsample ef\ufb01ciency and reduced variance. These bene\ufb01ts are tested in two microworld domains\u2014\na Markov chain and the Four Rooms domain\u2014where exhaustive experiments can be conducted.\nWe also provide a demonstration that IR reduces sensitivity over IS and VTrace in a car simulator,\nTORCs, when learning from images 5.\nWe compare IR and BC-IR against several reweighting strategies, including importance sam-\npling (IS); two online approaches to weighted important sampling, WIS-Minibatch with weighting\nj=1 \u21e2j; and V-trace6, which corresponds to\nclipping importance weights [Espeholt et al., 2018]. We also compare to WIS-TD(0) [Mahmood\nand Sutton, 2015], when applicable, which uses an online approximation to WIS, with a stepsize\nselection strategy (as described in Appendix A.2). This algorithm uses only one sample at a time,\nrather than a mini-batch, and so is only included in Figure 2. Where appropriate, we also include\nbaselines using On-policy sampling; WIS-Optimal which uses the whole buffer to get an update;\nand Sarsa(0) which learns action-values\u2014which does not require IS ratios\u2014and then produces esti-\n\nj=1 \u21e2j and WIS-Buffer with weighting \u21e2i/ k\n\n\u21e2i/Pk\n\nmate V (s) =Pa \u21e1(s, a)Q(s, a). WIS-Optimal is included as an optimal baseline, rather than as a\n\ncompetitor, as it estimates the update using the whole buffer on every step.\nIn all the experiments, the data is generated off-policy. We compute the absolute value error (AVE)\nor the absolute return error (ARE) on every step. For the sensitivity plots we take the average over\nall the interactions as speci\ufb01ed for the environment \u2014 resulting in MAVE and MARE respectively.\nThe error bars represent the standard error over runs, which are featured on every plot \u2014 although\nnot visible in some instances. For the microworlds, the true value function is found using dynamic\nprogramming with threshold 1015, and we compute AVE over all the states. For TORCs and\ncontinuous Four Rooms, the true value function is approximated using rollouts from a random subset\nof states generated when running the behavior policy \u00b5, and the ARE is computed over this subset.\nFor the Torcs domain, the same subset of states is used for each run due to computational constraints\nand report the mean squared return error (MSRE). Plots showing sensitivity over number of updates\nshow results for complete experiments with updates evenly spread over all the interactions. A tabular\nrepresentation is used in the microworld experiments, tilecoded features with 64 tilings and 8 tiles\nis used in continuous Four Rooms, and a convolutional neural network is used for TORCs, with an\narchitecture previously de\ufb01ned for self-driving cars [Bojarski et al., 2016].\n\nInvestigating Convergence Rate\n\n4.1\nWe \ufb01rst investigate the convergence rate of IR. We report learning curves in Four Rooms, as well as\nsensitivity to the learning rate. The Four Rooms domain [Stolle and Precup, 2002] has four rooms\nin an 11x11 grid world. The four rooms are positioned in a grid pattern with each room having\ntwo adjacent rooms. Each adjacent room is separated by a wall with a single connecting hallway.\nThe target policy takes the down action deterministically. The cumulant for the value function is 1\nwhen the agent hits a wall and 0 otherwise. The continuation function is = 0.9, with termination\nwhen the agent hits a wall. The resulting value function can be thought of as distance to the bottom\nwall. The behavior policy is uniform random everywhere except for 25 randomly selected states\nwhich take the action down with probability 0.05 with remaining probability split equally amongst\nthe other actions. The choice of behavior and target policy induce high magnitude IS ratios.\nAs shown in Figure 1, IR has noticeable improvements over the reweighting strategies tested. The\nfact that IR resamples more important transitions from the replay buffer seems to signi\ufb01cantly in-\ncrease the learning speed. Further, IR has a wider range of usable learning rates. The same effect is\nseen even as we reduce the total number of updates, where the uniform sampling methods perform\nsigni\ufb01cantly worse as the interactions between updates increases\u2014suggesting improved sample ef-\n\ufb01ciency. WIS-Buffer performs almost equivalently to IS, because for reasonably size buffers, its\nnPn\nnormalization factor 1\nj=1 \u21e2j \u21e1 1 because E[\u21e2] = 1. WIS-Minibatch and V-trace both reduce\n5Experimental code for every domain except Torcs can be found at https://mkschleg.github.io/\n\nResampling.jl\n\n6Retrace, ABQ and TreeBackup also use clipping to reduce variance. But, they are designed for learning\naction-values and for mitigating variance in eligibility traces. When trace parameter = 0\u2014as we assume\nhere\u2014there are no IS ratios and these methods become equivalent to using Sarsa(0) for learning action-values.\n\n6\n\n\f(BC)IR\n\nClip = 1.0\n\nIS\n\nWIS-Minibatch\n\nClip = 0.5*max\n\nWIS-Bu\ufb00er\n\nClip = 0.9*max\n\n0.6\nMAVE\n0.5\n0.4\n0.3\n0.2\n0.1\n0.0\n\n0.5\n0.4\n0.3\n0.2\n0.1\n0.0\n\n0\n\n1000\n1000\n\n2000\n3000\n2000\n3000\nStep (102)\n\n4000\n4000\n\n5000\n5000\n\n104.0 \n\n104.5 \n\n105.5 \nNumber of Updates (log)\n\n105.0 \n\nWIS-Optimal\n\nSarsa\n1.0\n0.8\n0.6\n0.4\n0.2\n0.0\n\n0\n\n10\n\n20\nLearning Rate\n\n30\n\nFigure 1: Four Rooms experiments (n = 2500, k = 16, 25 runs): left Learning curves for each\nmethod, with updates every 16 steps. IR and WIS-Optimal are overlapping. center Sensitivity over\nthe number of interactions between updates. right Learning rate sensitivity plot.\n\nthe variance signi\ufb01cantly, with their bias having only a limited impact on the \ufb01nal performance\ncompared to IS. Even the most aggressive clipping parameter for V-trace\u2014a clipping of 1.0\u2014 out-\nperforms IS. The bias may have limited impact because the target policy is deterministic, and so\nonly updates for exactly one action in a state. Sarsa\u2014which is the same as Retrace(0)\u2014performs\nsimilarly to the reweighting strategies.\nThe above results highlight the convergence rate improvements from IR, in terms of number of\nupdates, without generalization across values. Conclusions might actually be different with function\napproximation, when updates for one state can be informative for others. For example, even if in one\nstate the target policy differs signi\ufb01cantly from the behavior policy, if they are similar in a related\nstate, generalization could overcome effective sample size issues. We therefore further investigate if\nthe above phenomena arise under function approximation with RMSProp learning rate selection.\n\n0.40\n\n0.35\n\n0.30\n\n0.25\n\nIR\n\nWIS-TD(0)\nWIS-Minibatch\nIS\n\n0.20\n\nIR + WIS-TD(0)\n\nIR- WIS- TD(0)\nNormIS\nIR\nWISBatch\nVTrace_1.0\nWIS- TD(0)\n\nV-trace\n\n0.40\n\n0.35\n\n0.30\n\n0.25\n\n0.20\n\nIR\nIS\nWIS-Minibatch\nWISBatch\nV-trace\nVTrace_1.0\nIS\nIR\n\n101.0 \n3.0\nNumber of Updates (10^n)\n\n102.0 \n4.0\n\n101.5 \n3.5\n\n102.5 \n4.5\n\n103.0 \n5.0\n\n103.0 \n104.5 \n4.5\n3.0\nNumber of Updates (10^n)\nNumber of Updates (log)\n\n104.0 \n4.0\n\n103.5 \n3.5\n\n105.0 \n5.0\n\nFigure 2: Convergence rates in Continuous Four Rooms averaged over 25 runs with 100000 inter-\nactions with the environment. left uniform random behavior policy and target policy which takes\nthe down action with probability 0.9 and probability 0.1/3 for all other actions. Learning used in-\ncremental updates (as speci\ufb01ed in appendix A.2). right uniform random behavior and target policy\nwith persistent down action selection learned with mini-batch updates with RMSProp.\n\nWe conduct two experiments similar to above, in a continuous state Four Rooms variant. The agent\nis a circle with radius 0.1, and the state consists of a continuous tuple containing the x and y co-\nordinates of the agent\u2019s center point. The agent takes an action in one of the 4 cardinal directions\nmoving 0.5 \u00b1 U(0.0, 0.1) in that directions with random drift in the orthogonal direction sampled\nfrom N (0.0, 0.01). The representation is a tile coded feature vector with 64 tilings and 8 tiles. We\nprovide results for both mini-batch updating (as above) and incremental updating (i.e. updating\non each transition of a mini-batch incrementally, see appendix A.2 for details). For the mini-batch\nexperiment, the target policy deterministically takes the down action. For the incremental experi-\nment, the target policy takes the down action with probability 0.9 and selects all other action with\nprobability 0.1/3.\nWe \ufb01nd that generalization can mitigate some of the differences between IR and IS above in some\nsettings, but in others the difference remains just as stark (see Figure 2 and Appendix C.2). If we\nuse the behavior policy from the tabular domain, which skews the behavior in a sparse set of states,\nthe nearby states mitigate this skew. However, if we use a behavior policy that selects all actions\n\n7\n\n\f(BC)IR\n\nIS\n\nMAVE\n\n1.0\n0.8\n0.6\n0.4\n0.2\n0.0\n\n0\n\n2\n\n4\n\n6\n\nLearning Rate\n\n1.0\n0.8\n0.6\n0.4\n0.2\n0.0\n\n1.25\nVar \n0.0125\n(10-2)\n0.0100\n0.75\n0.0075\n0.50\n0.0050\n0.25\n0.0025\n0.00\n0.0000\n10\n\nWIS-Minibatch\n\nClip = 1.0\n\nWIS-Bu\ufb00er\nClip = 0.5*max\n\nWIS-Optimal\nClip = 0.9*max\n\nSarsa\n\nOnPolicy\n\n8\n\n10\n\n0\n\n2\n\n4\n\n6\n\nLearning Rate\n\n8\n\nLess\n0\n\nLearning\n\n50\n\n100\n\n150\n\n200\n\nLearning Progress\n\nMore\n250\n\nLearning\n\nFigure 4: Learning Rate sensitivity plots in the Random Walk Markov Chain, with buffer size n =\n15000 and mini-batch size k = 16. Averaged over 100 runs. The policies, written as [probability\nleft, probability right] are \u00b5 = [0.9, 0.1],\u21e1 = [0.1, 0.9] left learning rate sensitivity plot for all\nmethods but V-trace. center learning rate sensitivity for V-trace with various clipping parameters\nright Variance study for IS, IR, and WISBatch. The x-axis corresponds to the training iteration,\nwith variance reported for the weights at that iteration generated by WIS-Optimal. These plots show\na correlation between the sensitivity to learning rate and magnitude of variance.\n\nuniformly, then again IR obtains noticeable gains over IS and V-trace, for reducing the required\nnumber of updates, as shown in Figure 2.\nWe \ufb01nd similar results for the incremental setting Figure 2 (left), where resampling still outperforms\nall other methods in terms of convergence rates. Given WIS-TD(0)\u2019s signi\ufb01cant degrade in perfor-\nmance as the number of updates decreases, we also compare with using WIS-TD(0) when sampling\naccording to resampling IR+WIS-TD(0). Interestingly, this method outperforms all others \u2014 albeit\nonly slightly against IR with constant learning rate. This result leads us to believe RMSProp may be\na optimizer poor choice for this setting. Expanded results can be found in Appendix C.2.\n\n4.2\n\nInvestigating Variance\n\nTo better investigate the update variance we use a Markov chain, where we can more easily control\ndissimilarity between \u00b5 and \u21e1, and so control the magnitude of the IS ratios. The Markov chain\nis composed of 8 non-terminating states and 2 terminating states on the ends of the chain, with a\ncumulant of 1 on the transition to the right-most terminal state and 0 everywhere else. We consider\npolicies with probabilities [left, right] equal in all states: \u00b5 = [0.9, 0.1],\u21e1 = [0.1, 0.9]; further policy\nsettings can be found in Appendix C.1.\nWe \ufb01rst measure the variance of the updates for \ufb01xed buffers. We compute the variance of the\nupdate\u2014from a given weight vector\u2014by simulating the many possible updates that could have\noccurred. We are interested in the variance of updates both for early learning\u2014when the weight\nvector is quite incorrect and updates are larger\u2014and later learning. To obtain a sequence of such\nweight vectors, we use the sequence of weights generated by WIS-Optimal. As shown in Figure 4,\nthe variance of IR is lower than IS, particularly in early learning, where the difference is stark. Once\nthe weight vector has largely converged, the variance of IR and IS is comparable and near zero.\n\nWe can also evaluate the update variance by proxy\nusing learning rate sensitivity curves. As seen in\nFigure 4 (left) and (center), IR has the lowest sensi-\ntivity to learning rates, on-par with On-Policy sam-\npling. IS has the highest sensitivity, along with WIS-\nBuffer and WIS-Minibatch. Various clipping pa-\nrameters with V-trace are also tested. V-trace does\nprovide some level of variance reduction but incurs\nmore bias as the clipping becomes more aggressive.\n\nAverage\nMSRE\n0.20\n\n0.16\n\n0.12\n\n(BC)IR\n\nIS\n\nV-trace\n\n10-4\n\n10-6\n\n10-5\n\nLearning Rate\n\n4.3 Demonstration on a Car Simulator\n\nWe use the TORCs racing car simulator to perform\nscaling experiments with neural networks to com-\npare IR, IS, and V-trace. The simulator produces\n\nFigure 3: Learning rate sensitivity in TORCs,\naveraged over 10 runs. V-trace has clipping\nparameter 1.0. All the methods performed\nworse with a higher learning rate than shown\nhere, so we restrict to this range.\n\n8\n\n\f64x128 cropped grayscale images. We use an underlying deterministic steering controller that pro-\nduces steering actions adet 2 [1, +1] and take an action with probability de\ufb01ned by a Gaussian\na \u21e0N (adet, 0.1). The target policy is a Gaussian N (0.15, 0.0075), which corresponds to steering\nleft. Pseudo-termination (i.e., = 0) occurs when the car nears the center of the road, and the\ncumulant becomes 1. Otherwise, the cumulant is zero and = 0.9. The policy is speci\ufb01ed using\ncontinuous action distributions and results in IS ratios as high as \u21e0 1000 and highly variant updates\nfor IS.\nAgain, we can see that IR provides bene\ufb01ts over IS and V-trace, in Figure 3. There is even more\ngeneralization from the neural network in this domain, than in Four Rooms where we found gen-\neralization did reduce some of the differences between IR and IS. Yet, IR still obtains the best\nperformance, and avoids some of the variance seen in IS for two of the learning rates. Additionally,\nBC-IR actually performs differently here, having worse performance for the largest learning rate.\nThis suggest IR has an effect in reducing variance.\n\n5 Conclusion\nIn this paper we introduced a new approach to off-policy learning: Importance Resampling. We\nshowed that IR is consistent, and that the bias is the same as for Weighted Importance Sampling.\nWe also provided an unbiased variant of IR, called Bias-Corrected IR. We empirically showed that\nIR (a) has lower learning rate sensitivity than IS and V-trace, which is IS with varying clipping\nthresholds; (b) the variance of updates for IR are much lower in early learning than IS and (c) IR\nconverges faster than IS and other competitors, in terms of the number of updates. These results\ncon\ufb01rm the theory presented for IR, which states that variance of updates for IR are lower than IS\nin two settings, one being an early learning setting. Such lower variance also explains why IR can\nconverge faster in terms of number of updates, for a given buffer of data.\nThe algorithm and results in this paper suggest new directions for off-policy prediction, particularly\nfor faster convergence. Resampling is promising for scaling to learning many value functions in\nparallel, because many fewer updates can be made for each value function. A natural next step is a\ndemonstration of IR, in such a parallel prediction system. Resampling from a buffer also opens up\nquestions about how to further focus updates. One such option is using an intermediate sampling\npolicy. Another option is including prioritization based on error, such as was done for control with\nprioritized sweeping [Peng and Williams, 1993] and prioritized replay [Schaul et al., 2015b].\n\nAcknowledgments\nWe would like to thank Huawei for their support, and especially for allowing a portion of this\nwork to be completed during Matthews internship in the summer of 2018. We also would like to\nacknowledge University of Alberta, Alberta Machine Intelligence Institute, IVADO, and NSERC\nfor their continued funding and support, as well as Compute Canada (www.computecanada.ca)\nfor the computing resources used for this work.\n\nReferences\nSigrun Andradottir, Daniel P Heyman, and Teunis J Ott. On the Choice of Alternative Measures in\n\nImportance Sampling with Markov Chains. Operations Research, 1995.\n\nM S Arulampalam, S Maskell, N Gordon, and T Clapp. A Tutorial on Particle Filters for Online\n\nNonlinear/Non-Gaussian Bayesian Tracking. IEEE Transactions on Signal Processing, 2002.\n\nMariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon\nGoyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning\nfor self-driving cars. arXiv preprint arXiv:1604.07316, 2016.\n\nLasse Espeholt, Hubert Soyer, R\u00b4emi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward, Yotam\nDoron, Vlad Firoiu, Tim Harley, Iain Dunning, and others. IMPALA: Scalable distributed Deep-\nRL with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561, 2018.\n\nN J Gordon, D J Salmond, Radar, AFM Smith IEE Proceedings F Signal, and 1993. Novel approach\n\nto nonlinear/non-Gaussian Bayesian state estimation. IET, 1993.\n\n9\n\n\fAssaf Hallak and Shie Mannor.\n\narXiv:1702.07121, 2017.\n\nConsistent on-line off-policy evaluation.\n\narXiv preprint\n\nMax Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David\nSilver, and Koray Kavukcuoglu. Reinforcement Learning with Unsupervised Auxiliary Tasks. In\nInternational Conference on Representation Learning, 2017.\n\nH Kahn and A W Marshall. Methods of Reducing Sample Size in Monte Carlo Computations.\n\nJournal of the Operations Research Society of America, 1953.\n\nAugustine Kong, Jun S Liu, and Wing Hung Wong. Sequential imputations and Bayesian missing\n\ndata problems. Journal of the American Statistical Association, 1994.\n\nDavid A Levin and Yuval Peres. Markov chains and mixing times, volume 107. American Mathe-\n\nmatical Soc., 2017.\n\nLong-Ji Lin. Self-Improving Reactive Agents Based On Reinforcement Learning, Planning and\n\nTeaching. Machine Learning, 1992.\n\nLong-Ji Lin. Reinforcement Learning for Robots Using Neural Networks. PhD thesis, Carnegie\n\nMellon University, 1993.\n\nQiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the Curse of Horizon: In\ufb01nite-\nHorizon Off-Policy Estimation. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-\nBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31. Curran\nAssociates, Inc., 2018.\n\nVictoria Lopez, Alberto Fernandez, Salvador Garcia, Vasile Palade, and Francisco Herrera. An\ninsight into classi\ufb01cation with imbalanced data: Empirical results and current trends on using\ndata intrinsic characteristics. Information Sciences, 2013.\n\nA R Mahmood and R.S. Sutton. Off-policy learning based on weighted importance sampling with\n\nlinear computational complexity. In Conference on Uncertainty in Arti\ufb01cial Intelligence, 2015.\n\nA Rupam Mahmood, Hado P van Hasselt, and Richard S Sutton. Weighted importance sampling\nfor off-policy learning with linear function approximation. In Advances in Neural Information\nProcessing Systems, 2014.\n\nAshique Rupam Mahmood, Huizhen Yu, and Richard S Sutton. Multi-step Off-policy Learning\n\nWithout Importance Sampling Ratios. arXiv:1509.01240v2, 2017.\n\nLuca Martino, V\u00b4\u0131ctor Elvira, and Francisco Louzada. Effective sample size for importance sampling\n\nbased on discrepancy measures. Signal Processing, 2017.\n\nVolodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Belle-\nmare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen,\nCharles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wier-\nstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning.\nNature, 2015.\n\nJoseph Modayil, Adam White, and Richard S Sutton. Multi-timescale nexting in a reinforcement\nlearning robot. Adaptive Behavior - Animals, Animats, Software Agents, Robots, Adaptive Sys-\ntems, 2014.\n\nR\u00b4emi Munos, Tom Stepleton, Anna Harutyunyan, and Marc G Bellemare. Safe and Ef\ufb01cient Off-\n\nPolicy Reinforcement Learning. Advances in Neural Information Processing Systems, 2016.\n\nArt B. Owen. Monte Carlo theory, methods and examples. 2013.\n\nJing Peng and Ronald J Williams. Ef\ufb01cient Learning and Planning Within the Dyna Framework.\n\nAdaptive Behavior, 1993.\n\nDoina Precup, Richard S Sutton, and Satinder P Singh. Eligibility Traces for Off-Policy Policy\n\nEvaluation. ICML, 2000.\n\n10\n\n\fDoina Precup, Richard S Sutton, and Sanjoy Dasgupta. Off-Policy Temporal-Difference Learning\n\nwith Function Approximation. ICML, 2001.\n\nDonald B Rubin. Using the SIR algorithm to simulate posterior distributions. Bayesian statistics,\n\n1988.\n\nReuven Y Rubinstein and Dirk P Kroese. Simulation and the Monte Carlo Method. John Wiley &\n\nSons, 2016.\n\nTom Schaul and Mark Ring. Better generalization with forecasts. In International Joint Conference\n\non Arti\ufb01cial Intelligence, 2013.\n\nTom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal Value Function Approxi-\n\nmators. In International Conference on Machine Learning, 2015a.\n\nTom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized Experience Replay.\n\narXiv:1511.05952 [cs], 2015b.\n\nDavid Silver, Hado van Hasselt, Matteo Hessel, Tom Schaul, Arthur Guez, Tim Harley, Gabriel\nDulac-Arnold, David P Reichert, Neil C Rabinowitz, Andr\u00b4e Barreto, and Thomas Degris. The\nPredictron - End-To-End Learning and Planning. In AAAI Conference on Arti\ufb01cial Intelligence,\n2017.\n\n\u00d8ivind Skare, Erik B\u00f8lviken, and Lars Holden. Improved Sampling-Importance Resampling and\n\nReduced Bias Importance Sampling. Scandinavian Journal of Statistics, 2003.\n\nAFM Smith, AE Gelfand The American Statistician, and 1992. Bayesian statistics without tears: a\n\nsampling\u2013resampling perspective. Taylor & Francis, 1992.\n\nMartin Stolle and Doina Precup. Learning Options in Reinforcement Learning. In International\n\nSymposium on Abstraction, Reformulation, and Approximation, 2002.\n\nRichard S Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework\n\nfor temporal abstraction in reinforcement learning. Arti\ufb01cial intelligence, 1999.\n\nRichard S Sutton, Eddie J Rafols, and Anna Koop. Temporal Abstraction in Temporal-difference\n\nNetworks. In Advances in Neural Information Processing Systems, 2005.\n\nRichard S Sutton, J Modayil, M Delp, T Degris, P.M. Pilarski, A White, and D Precup. Horde: A\nscalable real-time architecture for learning knowledge from unsupervised sensorimotor interac-\ntion. In International Conference on Autonomous Agents and Multiagent Systems, 2011.\n\nPhilip Thomas and Emma Brunskill. Data-Ef\ufb01cient Off-Policy Policy Evaluation for Reinforcement\n\nLearning. In AAAI Conference on Arti\ufb01cial Intelligence, 2016.\n\nPhilip S Thomas and Emma Brunskill.\n\nConference on Arti\ufb01cial Intelligence, 2017.\n\nImportance Sampling with Unequal Support.\n\nIn AAAI\n\n11\n\n\f", "award": [], "sourceid": 1042, "authors": [{"given_name": "Matthew", "family_name": "Schlegel", "institution": "University of Alberta"}, {"given_name": "Wesley", "family_name": "Chung", "institution": "McGill University"}, {"given_name": "Daniel", "family_name": "Graves", "institution": "Huawei Technologies Canada"}, {"given_name": "Jian", "family_name": "Qian", "institution": "University of Alberta"}, {"given_name": "Martha", "family_name": "White", "institution": "University of Alberta"}]}