{"title": "Weighted importance sampling for off-policy learning with linear function approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 3014, "page_last": 3022, "abstract": "Importance sampling is an essential component of off-policy model-free reinforcement learning algorithms. However, its most effective variant, \\emph{weighted} importance sampling, does not carry over easily to function approximation and, because of this, it is not utilized in existing off-policy learning algorithms. In this paper, we take two steps toward bridging this gap. First, we show that weighted importance sampling can be viewed as a special case of weighting the error of individual training samples, and that this weighting has theoretical and empirical benefits similar to those of weighted importance sampling. Second, we show that these benefits extend to a new weighted-importance-sampling version of off-policy LSTD(lambda). We show empirically that our new WIS-LSTD(lambda) algorithm can result in much more rapid and reliable convergence than conventional off-policy LSTD(lambda) (Yu 2010, Bertsekas & Yu 2009).", "full_text": "Weighted importance sampling for off-policy learning\n\nwith linear function approximation\n\nA. Rupam Mahmood, Hado van Hasselt, Richard S. Sutton\nReinforcement Learning and Arti\ufb01cial Intelligence Laboratory\n\nUniversity of Alberta\n\nEdmonton, Alberta, Canada T6G 1S2\n\n{ashique,vanhasse,sutton}@cs.ualberta.ca\n\nAbstract\n\nImportance sampling is an essential component of off-policy model-free rein-\nforcement learning algorithms. However, its most effective variant, weighted im-\nportance sampling, does not carry over easily to function approximation and, be-\ncause of this, it is not utilized in existing off-policy learning algorithms. In this\npaper, we take two steps toward bridging this gap. First, we show that weighted\nimportance sampling can be viewed as a special case of weighting the error of\nindividual training samples, and that this weighting has theoretical and empiri-\ncal bene\ufb01ts similar to those of weighted importance sampling. Second, we show\nthat these bene\ufb01ts extend to a new weighted-importance-sampling version of off-\npolicy LSTD(). We show empirically that our new WIS-LSTD() algorithm can\nresult in much more rapid and reliable convergence than conventional off-policy\nLSTD() (Yu 2010, Bertsekas & Yu 2009).\n\n1\n\nImportance sampling and weighted importance sampling\n\nImportance sampling (Kahn & Marshall 1953, Rubinstein 1981, Koller & Friedman 2009) is a well-\nknown Monte Carlo technique for estimating an expectation under one distribution given samples\nfrom a different distribution. Consider that data samples Yk 2 R are generated i.i.d. from a sample\n.\ndistribution l, but we are interested in estimating the expected value of these samples, vg\n= EEEg [Yk],\nunder a different distribution g. In importance sampling this is achieved simply by averaging the\nsamples weighted by the ratio of their likelihoods \u21e2k\nl(Yk) , called the importance-sampling\nratio. That is, vg is estimated as:\n\n.\n= g(Yk)\n\n.\n\n\u02dcvg\n\nk=1 \u21e2kYk\n\n.\n\n(1)\n\nThis is an unbiased estimate because each of the samples it averages is unbiased:\n\nn\n\n= Pn\ny dy =Zy\n\nEEEl [\u21e2kYk] =Zy\n\nl(y)\n\ng(y)\nl(y)\n\ng(y)y dy = EEEg [Yk] = vg .\n\nUnfortunately, this importance sampling estimate is often of unnecessarily high variance. To see\nhow this can happen, consider a case in which the samples Yk are all nearly the same (under both\ndistributions) but the importance-sampling ratios \u21e2k vary greatly from sample to sample. This should\nbe an easy case because the samples are so similar for the two distributions, but importance sampling\nwill average the \u21e2kYk, which will be of high variance, and thus its estimates will also be of high\nvariance. In fact, without further bounds on the importance-sampling ratios, \u02dcvg may have in\ufb01nite\nvariance (Andrad\u00b4ottir et al. 1995, Robert & Casella 2004).\nAn important variation on importance sampling that often has much lower variance is weighted im-\nportance sampling (Rubinstein 1981, Koller & Friedman 2009). The weighted importance sampling\n\n1\n\n\f(WIS) estimate vg as a weighted average of the samples with importance-sampling ratios as weights:\n\n\u02c6vg\n\n.\n\n= Pn\nPn\n\nk=1 \u21e2kYk\nk=1 \u21e2k\n\n.\n\nThis estimate is biased, but consistent (asymptotically correct) and typically of much lower variance\nthan the ordinary importance-sampling (OIS) estimate, as acknowledged by many authors (Hester-\nberg 1988, Casella & Robert 1998, Precup, Sutton & Singh 2000, Shelton 2001, Liu 2001, Koller\n& Friedman 2009). For example, in the problematic case sketched above (near constant Yk, widely\nvarying \u21e2k) the variance of the WIS estimate will be related to the variance of Yk. Note also that\nwhen the samples are bounded, the WIS estimate has bounded variance, because the estimate itself\nis bounded by the highest absolute value of Yk, no matter how large the ratios \u21e2k are (Precup, Sutton\n& Dasgupta 2001).\nAlthough WIS is the more successful importance sampling technique, it has not yet been extended\nto parametric function approximation. This is problematic for applications to off-policy reinforce-\nment learning, in which function approximation is viewed as essential for large-scale applications\nto sequential decision problems with large state and action spaces. Here an important subproblem is\nthe approximation of the value function\u2014the expected sum of future discounted rewards as a func-\ntion of state\u2014for a designated target policy that may differ from that used to select actions. The\nexisting methods for off-policy value-function approximation either use OIS (Maei & Sutton 2010,\nYu 2010, Sutton et al. 2014, Geist & Scherrer 2014, Dann et al. 2014) or use WIS but are limited\nto the tabular or non-parametric case (Precup et al. 2000, Shelton 2001). How to extend WIS to\nparametric function approximation is important, but far from clear (as noted by Precup et al. 2001).\n\n2\n\nImportance sampling for linear function approximation\n\nIn this section, we take the \ufb01rst step toward bridging the gap between WIS and off-policy learning\nwith function approximation. In a general supervised learning setting with linear function approxi-\nmation, we develop and analyze two importance-sampling methods. Then we show that these two\nmethods have theoretical properties similar to those of OIS and WIS. In the fully-representable case,\none of the methods becomes equivalent to the OIS estimate and the other to the WIS estimate.\nThe key idea is that OIS and WIS can be seen as least-squares solutions to two different empirical\nobjectives. The OIS estimate is the least-squares solution to an empirical mean-squared objective\nwhere the samples are importance weighted:\n\n\u02dcvg = arg min\n\nv\n\n1\nn\n\nnXk=1\n\n(\u21e2kYk v)2 =)\n\nnXk=1\n\n(\u21e2kYk \u02dcvg) = 0 =) \u02dcvg = Pn\n\nk=1 \u21e2kYk\n\nn\n\n.\n\n(2)\n\nSimilarly, the WIS estimate is the least-squares solution to an empirical mean-squared objective\nwhere the individual errors are importance weighted:\n\n\u02c6vg = arg min\n\nv\n\n1\nn\n\n\u21e2k (Yk v)2 =)\n\nnXk=1\n\nnXk=1\n\n\u21e2k (Yk \u02c6vg) = 0 =) \u02c6vg = Pn\nPn\n\nk=1 \u21e2kYk\nk=1 \u21e2k\n\n.\n\n(3)\n\nWe solve similar empirical objectives in a general supervised learning setting with linear function\napproximation to derive the two new methods.\nConsider two correlated random variables Xk and Yk, where Xk takes values from a \ufb01nite set X ,\nand where Yk 2 R. We want to estimate the conditional expectation of Yk for each x 2X under\na target distribution gY |X. However, the samples (Xk, Yk) are generated i.i.d. according to a joint\nsample distribution lXY (\u00b7) with conditional probabilities lY |X that may differ from the conditional\n.\ntarget distribution. Each input is mapped to a feature vector k\n= (Xk) 2 Rm, and the goal is to\nestimate the expectation EEEgY |X[Yk | Xk = x] as a linear function of the features\n\n\u2713>(x) \u21e1 vg(x)\n\n.\n= EEEgY |X [Yk|Xk = x] .\n\nEstimating this expectation is again dif\ufb01cult because the target joint distribution of the input-output\npairs gXY can be different than the sample joint distribution lXY . Generally, the discrepancy in\n\n2\n\n\fthe joint distribution may arise from two sources: difference in marginal distribution of inputs,\ngX 6= lX, and difference in the conditional distribution of outputs, gY |X 6= lY |X. Problems where\nonly the former discrepancy arise are known as covariate shift problems (Shimodaira 2000).\nIn\nthese problems the conditional expectation of the outputs is assumed unchanged between the target\nand the sample distributions. In off-policy learning problems, the discrepancy between conditional\nprobabilities is more important. Most off-policy learning methods correct only the discrepancy\nbetween the target and the sample conditional distributions of outputs (Hachiya et al. 2009, Maei &\nSutton 2010, Yu 2010, Maei 2011, Geist & Scherrer 2014, Dann et al. 2014). In this paper, we also\nfocus only on correcting the discrepancy between the conditional distributions.\nThe problem of estimating vg(x) as a linear function of features using samples generated from l can\nbe formulated as the minimization of the mean squared error (MSE) where the solution is as follows:\n\nEEElX\uf8ff\u21e3EEEgY |X [Yk|Xk] \u2713>k\u23182 = EEElX\u21e5k>k\u21e41 EEElX\u21e5EEEgY |X [Yk|Xk] k\u21e4 .\n\n\u2713\u21e4 \u02d9= arg min\nSimilar to the empirical mean-squared objectives de\ufb01ned in (2) and (3), two different empirical\nobjectives can be de\ufb01ned to approximate this solution. In one case the importance weighting is\napplied to the output samples, Yk, and in the other case the importance weighting is applied to the\nerror, Yk \u2713>k,\n.\n\u02dcJn(\u2713)\n=\n\n(4)\n\n.\n=\n\n\u2713\n\n;\n\n,\n\n1\nn\n\nnXk=1\u21e3\u21e2kYk \u2713>k\u23182\n\n1\nn\n\n\u02c6Jn(\u2713)\n\nnXk=1\n\n\u21e2k\u21e3Yk \u2713>k\u23182\n.\n= gY |X(Yk|Xk)/lY |X(Yk|Xk).\n\nwhere importance-sampling ratios are de\ufb01ned by \u21e2k\nWe can minimize these objectives by equating the derivatives of the above empirical objectives to\nzero. Provided the relevant matrix inverses exist, the resulting solutions are, respectively\n\n\u02dc\u2713n\n\n\u02c6\u2713n\n\n.\n\n= nXk=1\n= nXk=1\n\n.\n\nk>k!1 nXk=1\n\u21e2kk>k!1 nXk=1\n\n\u21e2kYkk , and\n\n\u21e2kYkk .\n\n(5)\n\n(6)\n\nWe call \u02dc\u2713 the OIS-LS estimator and \u02c6\u2713 the WIS-LS estimator.\nA least-squares method similar to WIS-LS above was introduced for covariate shift problems by\nHachiya, Sugiyama and Ueda (2012). Although super\ufb01cially similar, that method uses importance-\nsampling ratios to correct for the discrepancy in the marginal distributions of inputs, whereas\nWIS-LS corrects for the discrepancy in the conditional expectations of the outputs. For the fully-\nrepresentable case, unlike WIS-LS, the method of Hachiya et al. becomes an ordinary Monte Carlo\nestimator with no importance sampling.\n\n3 Analysis of the least-squares importance-sampling methods\n\nThe two least-squares importance-sampling methods have properties similar to those of the OIS and\nthe WIS methods.\nIn Theorems 1 and 2, we prove that when vg can be represented as a linear\nfunction of the features, then OIS-LS is an unbiased estimator of \u2713\u21e4, whereas WIS-LS is a biased\nestimator, similar to the WIS estimator. If the solution is not linearly representable, least-squares\nmethods are not generally unbiased. In Theorem 3 and 4, we show that both least-squares estimators\nare consistent for \u2713\u21e4. Finally, we demonstrate that the least-squares methods are generalizations of\nOIS and WIS by showing, in Theorem 5 and 6, that in the fully representable case (when the features\nform an orthonormal basis) OIS-LS is equivalent to OIS and WIS-LS is equivalent to WIS.\nTheorem 1. If vg is a linear function of the features, that is, vg(x) = \u2713>\nunbiased estimator, that is, EEElXY [\u02dc\u2713n] = \u2713\u21e4.\nTheorem 2. Even if vg is a linear function of the features, that is, vg(x) = \u2713>\ngeneral a biased estimator, that is, EEElXY [\u02c6\u2713n] 6= \u2713\u21e4.\n\n\u21e4 (x), then OIS-LS is an\n\n\u21e4 (x), WIS-LS is in\n\n3\n\n\fTheorem 3. The OIS-LS estimator \u02dc\u2713n is a consistent estimator of the MSE solution \u2713\u21e4 given in (4).\nTheorem 4. The WIS-LS estimator \u02c6\u2713n is a consistent estimator of the MSE solution \u2713\u21e4 given in (4).\nTheorem 5. If the features form an orthonormal basis, then the OIS-LS estimate \u02dc\u2713>\nn (x) of input\nx is equivalent to the OIS estimate of the outputs corresponding to x.\nTheorem 6. If the features form an orthonormal basis, then the WIS-LS estimate \u02c6\u2713>\nx is equivalent to the WIS estimate of the outputs corresponding to x.\n\nn (x) of input\n\nProofs of Theorem 1-6 are given in the Appendix.\nThe WIS-LS estimate is perhaps the most interesting of the two least-squares estimates, because it\ngeneralizes WIS to parametric function approximation for the \ufb01rst time and extends its advantages.\n\n4 A new off-policy LSTD() with WIS\n\nIn sequential decision problems, off-policy learning methods based on important sampling can suffer\nfrom the same high-variance issues as discussed above for the supervised case. To address this, we\nextend the idea of WIS-LS to off-policy reinforcement learning and construct a new off-policy WIS-\nLSTD() algorithm.\nWe \ufb01rst explain the problem setting. Consider a learning agent that interacts with an environment\nwhere at each step t the state of the environment is St and the agent observes a feature vector\n.\n= (St) 2 Rm. The agent takes an action At based on a behavior policy b(\u00b7|St), that is typically\nt\na function of the state features. The environment provides the agent a scalar (reward) signal Rt+1\nand transitions to state St+1. This process continues, generating a trajectory of states, actions and\nrewards. The goal is to estimate the values of the states under the target policy \u21e1, de\ufb01ned as the\nexpected returns given by the sum of future discounted rewards:\n\nv\u21e1(s)\n\nRt+1\n\n.\n\n= EEE\" 1Xt=0\n\ntYk=1\n\n(Sk) | S0 = s, At \u21e0 \u21e1(\u00b7|St),8t# ,\n\nwhere (Sk) 2 [0, 1] is a state-dependent degree of discounting on arrival in Sk (as in Sutton et al.\n2014). We assume the rewards and discounting are chosen such that v\u21e1(s) is well-de\ufb01ned and \ufb01nite.\nOur primary objective is to estimate v\u21e1 as a linear function of the features: v\u21e1(s) \u21e1 \u2713>(s), where\n\u2713 2 Rm is a parameter vector to be estimated. As before, we need to correct for the difference\nin sample distribution resulting from the behavior policy and the target distribution as induced by\nthe target policy. Consider a partial trajectory from time step k to time t, consisting of a sequence\nSk, Ak, Rk, Sk+1, . . . , St. The probability of this trajectory occurring given it starts at Sk under the\ntarget policy will generally differ from its probability under the behavior policy. The importance-\nsampling ratio \u21e2t\nk is de\ufb01ned to be the ratio of these probabilities. This importance-sampling ratio\ncan be written in terms of the product of action-selection probabilities without needing a model of\nthe environment (Sutton & Barto 1998):\n\n.\n\n\u21e2t\nk\n\n= Qt1\ni=k \u21e1(Ai|Si)\n\u21e1(Ai|Si)\nQt1\nb(Ai|Si)\ni=k b(Ai|Si)\ni = \u21e1(Ai|Si)/b(Ai|Si).\n\nt1Yi=k\n\n.\n= \u21e2i+1\n\n=\n\n=\n\n\u21e2i ,\n\nt1Yi=k\n\nwhere we use the shorthand \u21e2i\nWe incorporate a common technique to reinforcement learning (RL) where updates are estimated\nby bootstrapping, fully or partially, on previously constructed state-value estimates. Bootstrapping\npotentially reduces the variance of the updates compared to using full returns and makes RL algo-\nrithms applicable to non-episodic tasks. In this paper we assume that the bootstrapping parameter\n(s) 2 [0, 1] may depend on the state s (as in Sutton & Barto 1998, Maei & Sutton 2010). In the\nfollowing, we use the notational shorthands k\nFollowing Sutton et al. (2014), we construct an empirical loss as a sum of pairs of squared corrected\nand uncorrected errors, each corresponding to a different number of steps of lookahead, and each\n.\nweighted as a function of the intervening discounting and bootstrapping. Let Gt\n= Rk+1 + . . . + Rt\nk\nImagine constructing the\nbe the undiscounted return truncated after looking ahead t k steps.\n\n.\n= (Sk) and k\n\n.\n= (Sk).\n\n4\n\n\fempirical loss for time 0. After leaving S0 and observing R1 and S1, the \ufb01rst uncorrected error is\n0 \u2713>0, with weight equal to the probability of terminating 1 1. If we do not terminate, then\nG1\n0 + v>1 \u2713>0 using the bootstrapping\nwe continue to S1 and form the \ufb01rst corrected error G1\nestimate v>1. The weight on this error is 1(11), and the parameter vector v may differ from \u2713.\n0 \u2713>0 and the second\nContinuing to the next time step, we obtain the second uncorrected error G2\n0 +v>2\u2713>0, with respective weights 11(12) and 112(12). This\ncorrected error G2\ngoes on until we reach the horizon of our data, say at time t, when we bootstrap fully with v>t,\n0 + v>t \u2713>0 with weight 11 \u00b7\u00b7\u00b7 t1t1t.\ngenerating a \ufb01nal corrected return error of Gt\nk \u2713>k, and the general form for the\nThe general form for the uncorrected error is \u270ft\nk(\u2713)\ncorrected error is \u00aft\nk + v>t \u2713>k. All these errors could be squared, weighted by\ntheir weights, and summed to form the overall empirical loss. In the off-policy case, we need to also\nk and \u00aft\nweight the squares of the errors \u270ft\nk. Hence, the overall\nempirical loss at time k for data up to time t can be written as\n\nk by the importance-sampling ratio \u21e2t\n\nk(\u2713, v)\n\n.\n= Gt\n\n.\n= Gt\n\n`t\nk(\u2713, v)\n\n.\n= \u21e2k\n\nt1Xi=k+1\n\nC i1\n\nk\n\n+ \u21e2kC t1\n\nk\n\n\uf8ff(1 i)\u21e3\u270fi\nh(1 t)\u270ft\n\nk(\u2713)\u23182\nk(\u2713)2 + t\u00aft\n\n+ i(1 i)\u21e3\u00afi\nk(\u2713, v)2i\n\nk(\u2713, v)\u23182\n\nk(\u2713, v).\n\nThis loss differs from that used by other LSTD() methods in that importance weighting is applied\nto the individual errors within `t\nNow, we are ready to state the least-squares problem. As noted by Geist & Scherrer (2014), LSTD()\nmethods can be derived by solving least-squares problems where estimates at each step are matched\nwith multi-step returns starting from those steps given that bootstrapping is done using the solution\nitself. Our proposed new method, called WIS-LSTD(), computes at each time t the solution to the\nleast-squares problem:\n\n, where C t\nk\n\n.\n=\n\ntYj=k+1\n\njj\u21e2j.\n\n\u2713t\n\n.\n= arg min\n\n\u2713\n\n`t\nk(\u2713, \u2713t).\n\nt1Xk=0\n\nthe solution,\nk=0 2\u21e2\n\nAt\n\nPt1\n\n\u21e2\nk,t(\u2713, v)\n\nthe derivative of\n\nk,t(\u2713t, \u2713t)k = 0, where the errors \u21e2\n\nthe objective is zero:\nk,t are de\ufb01ned by\n\n.\n= \u21e2k\n\nt1Xi=k+1\n\nCi1\n\nk\n\n\u21e5(1 i)\u270fi\n\nk(\u2713) + i(1 i)\u00afi\n\nk(\u2713, v)\u21e4\n\n+ \u21e2kCt1\n\nk\n\nk=0 `t\n\n@\n\n@\u2713Pt1\n\nk(\u2713, \u2713t)\u2713=\u2713t\n\n=\n\n\u21e5(1 t)\u270ft\n\nk(\u2713) + t\u00aft\n\nk(\u2713, v)\u21e4 .\n\nt1Xi=k+1\nt1Xi=k+1\n\nk,t(\u2713t, \u2713t)k that involve \u2713t from those that do not:\n\nNext, we separate the terms of \u21e2\n\u21e2\nk,t(\u2713t, \u2713t)k = bk,t Ak,t\u2713t, where bk,t 2 Rm, Ak,t 2 Rm\u21e5m and they are de\ufb01ned as\nbk,t\n\nkk + \u21e2kCt1\n\n.\n= \u21e2k\n\nCi1\n\nk Gt\n\nkk,\n\nk\n\n(1 ii)Gi\n\nAk,t\n\n.\n= \u21e2k\n\nCi1\nk k((1 ii)k i(1 i)i)> + \u21e2kCt1\n\nk k(k tt)>.\n\nTherefore, the solution can be found as follows:\n\nt1Xk=0\n\n(bk,t Ak,t\u2713t) = 0 =) \u2713t = A1\n\nt bt, where At\n\n.\n=\n\nAk,t,\n\n.\n=\n\nbt\n\nbk,t.\n\n(7)\n\nt1Xk=0\n\nt1Xk=0\n\nIn the following we show that WIS-LS is a special case of the above algorithm de\ufb01ned by (7). As\nTheorem 6 shows that WIS-LS generalizes WIS, it follows that the above algorithm generalizes WIS\nas well.\nTheorem 7. At termination, the algorithm de\ufb01ned by (7) is equivalent to the WIS-LS method in the\nsense that if 0 = \u00b7\u00b7\u00b7 = t = 0 = \u00b7\u00b7\u00b7 = t1 = 1 and t = 0, then \u2713t de\ufb01ned in (7) equals \u02c6\u2713t as\nde\ufb01ned in (6), with Yk\n\nk. (Proved in the Appendix).\n\n.\n= Gt\n\n5\n\n\fOur last challenge is to \ufb01nd an equivalent ef\ufb01cient online algorithm for this method. The solution in\n(7) cannot be computed incrementally in this form. When a new sample arrives at time t + 1, Ak,t+1\nand bk,t+1 have to be computed for each k = 0, . . . , t, and hence the computational complexity of\nthis solution grows with time. It would be preferable if the solution at time t + 1 could be computed\nincrementally based on the estimates from time t, requiring only constant computational complexity\nper time step. It is not immediately obvious such an ef\ufb01cient update exists. For instance, for = 1\nthis method achieves full Monte Carlo (weighted) importance-sampling estimation, which means\nwhenever the target policy deviates from the behavior policy all previously made updates have to\nbe unmade so that no updates are made towards a trajectory which is impossible under the target\npolicy. Sutton et al. (2014) show it is possible to derive ef\ufb01cient updates in some cases with the use\nof provisional parameters which keep track of the provisional updates that might need to be unmade\nwhen a deviation occurs. In the following, we show that using such provisional parameters it is also\npossible to achieve an equivalent ef\ufb01cient update for (7).\nWe \ufb01rst write both bk,t and Ak,t recursively in t (derivations in Appendix A.8):\nbk,t+1 = bk,t + \u21e2kCt\nAk,t+1 = Ak,t + \u21e2kCt\nUsing the above recursions, we can write the updates of both bt and At incrementally. The vector\nbt can be updated incrementally as\n\nkRt+1k + (\u21e2t 1)tt\u21e2kCt1\nkk(t t+1t+1)> + (\u21e2t 1)tt\u21e2kCt1\n\nk k(k t)>.\n\nk Gt\n\nkk,\n\nbt+1 =\n\ntXk=0\n\nbk,t+1 =\n\n+ (\u21e2t 1)tt\n\nbk,t + \u21e2tRt+1t + Rt+1\n\n\u21e2kCt\n\nkk\n\nt1Xk=1\n\n= bt + Rt+1et + (\u21e2t 1)ut,\n\nwhere the eligibility trace et 2 Rm and the provisional vector ut 2 Rm are de\ufb01ned as follows:\nk k! = \u21e2t(t + ttet1),\net = \u21e2tt +\n\n\u21e2kC t1\n\n\u21e2kC t\n\nut = tt\n\n\u21e2kC t1\n\nk Gt\n\n\u21e2kC t2\n\nk Gt1\n\nk k\n\nkk\n\nk Gt\n\n\u21e2kCt1\n\nt2Xk=1\n\nt1Xk=1\n\nbk,t+1 + bt,t+1 =\n\nt1Xk=0\nt1Xk=1\nkk = \u21e2tt + \u21e2ttt \u21e2t1t1 +\nkk = tt \u21e2t1t1t1\nt2Xk=1\nk k + \u21e2t1Rtt1!\nt1Xk=0\nkk(t t+1t+1)> + (\u21e2t 1)tt\n\nAk,t+1 + At,t+1 =\n\nt1Xk=0\n\nt1Xk=1\nt1Xk=1\n\nt2Xk=1\n\nAt+1 =\n\ntXk=0\n\n+\n\nAk,t+1 =\n\n\u21e2kCt\n\nt1Xk=1\n\n= At + et(t t+1t+1)> + (\u21e2t 1)Vt,\n\nwhere the provisional matrix Vt 2 Rm\u21e5m is de\ufb01ned as\nVt = tt\n\nk k(k t)> = tt \u21e2t1t1t1\nt2Xk=1\nk k(t1 t)> + \u21e2t1t1(t1 t)>!\n\n\u21e2kCt1\n\n\u21e2kCt1\n\n+\n\nt1Xk=1\nt2Xk=1\n\n= tt\u21e2t1Vt1 + et1(t1 t)> .\n\nAk,t + \u21e2tt(t t+1t+1)>\n\nt1Xk=1\n\n\u21e2kCt1\n\nk k(k t)>\n\n\u21e2kCt2\n\nk k(k t1)>\n\n(8)\n\n(9)\n\n(10)\n\n(11)\n\n(12)\n\n+ Rt\n\n\u21e2kC t1\n\n= tt (\u21e2t1ut1 + Rtet1) .\n\nThe matrix At can be updated incrementally as\n\nThen the parameter vector can be updated as: \u2713t+1 = (At+1)1 bt+1.\n(13)\nEquations (8\u201313) comprise our WIS-LSTD(). Its per-step computational complexity is O(m3),\nwhere m is the number of features. The computational cost of this method does not increase with\ntime. At present we are unsure whether or not there is an O(m2) implementation.\n\n6\n\n\fTheorem 8. The off-policy LSTD() method de\ufb01ned in (8\u201313) is equivalent to the off-policy\nLSTD() method de\ufb01ned in (7) in the sense that they compute the same \u2713t at each time t.\n\nProof. The result follows immediately from the above derivation.\n\nIt is easy to see that in the on-policy case this method becomes equivalent to on-policy LSTD()\n(Boyan 1999) by noting that the third term of both bt and At updates in (8) and (11) becomes zero,\nbecause in the on-policy case all the importance-sampling ratios are 1.\nRecently Dann et al. (2014) proposed another least-squares based off-policy method called recur-\nsive LSTD-TO(). Unlike our algorithm, that algorithm does not specialize to WIS in the fully repre-\nsentable case, and it does not seem as closely related to WIS. The Adaptive Per-Decision Importance\nWeighting (APDIW) method by Hachiya et al. (2009) is super\ufb01cially similar to WIS-LSTD(), there\nare several important differences. APDIW is a one-step method that always fully bootstraps whereas\nWIS-LSTD() covers the full spectrum of multi-step backups including both one-step backup and\nMonte Carlo update. In the fully representable case, APDIW does not become equivalent to the WIS\nestimate, whereas WIS-LSTD(1) does. Moreover, APDIW does not \ufb01nd a consistent estimation of\nthe off-policy target whereas WIS algorithms do.\n\n5 Experimental results\n\nWe compared the performance of the proposed WIS-LSTD() method with the conventional off-\npolicy LSTD() by Yu (2010) on two random-walk tasks for off-policy policy evaluation. These\nrandom-walk tasks consist of a Markov chain with 11 non-terminal and two terminal states. They\ncan be imagined to be laid out horizontally, where the two terminal states are at the left and the right\nends of the chain. From each non-terminal state, there are two actions available: left, which leads to\nthe state to the left and right, which leads to the state to the right. The reward is 0 for all transitions\nexcept for the rightmost transition to the terminal state, where it is +1. The initial state was set to\nthe state in the middle of the chain. The behavior policy chooses an action uniformly randomly,\nwhereas the target policy chooses the right action with probability 0.99. The termination function \nwas set to 1 for the non-terminal states and 0 for the terminal states.\nWe used two tasks based on this Markov chain in our experiments. These tasks differ by how the\nnon-terminal states were mapped to features. The terminal states were always mapped to a vector\nwith all zero elements. For each non-terminal state, the features were normalized so that the L2 norm\nof each feature vector was one. For the \ufb01rst task, the feature representation was tabular, that is, the\nfeature vectors were standard basis vectors. In this representation, each feature corresponded to only\none state. For the second task, the feature vectors were binary representations of state indices. There\nwere 11 non-terminal states, hence each feature vector had blog2(11)c + 1 = 4 components. These\nvectors for the states from left to right were (0, 0, 0, 1)>, (0, 0, 1, 0)>, (0, 0, 1, 1)>, . . . , (1, 0, 1, 1)>,\nwhich were then normalized to get unit vectors. These features heavily underrepresented the states,\ndue to the fact that 11 states were represented by only 4 features.\nWe tested both algorithms for different values of constant , from 0 to 0.9 in steps of 0.1 and from\n0.9 to 1.0 in steps of 0.025. The matrix to be inverted in both methods was initialized to \u270fI, where the\nregularization parameter \u270f was varied by powers of 10 with powers chosen from -3 to +3 in steps of\n0.2. Performance was measured as the empirical mean squared error (MSE) between the estimated\nvalue of the initial state and its true value under the target policy projected to the space spanned by\nthe given features. This error was measured at the end of each of 200 episodes for 100 independent\nruns.\nFigure 1 shows the results for the two tasks in terms of empirical convergence rate, optimum perfor-\nmance and parameter sensitivity. Each curve shows MSE together with standard errors. The \ufb01rst row\nshows results for the tabular task and the second row shows results for the function approximation\ntask. The \ufb01rst column shows learning curves using (, \u270f) = (0, 1) for the \ufb01rst task and (0.95, 10) for\nthe second. It shows that in both cases WIS-LSTD() learned faster and gave lower error throughout\nthe period of learning. The second column shows performance with respect to different optimized\nover \u270f. The x-axis is plotted in a reverse log scale, where higher values are more spread out than the\nlower values. In both tasks, WIS-LSTD() outperformed the conventional LSTD() for all values of\n. For the best parameter setting (best and \u270f), WIS-LSTD() outperformed LSTD() by an order\n\n7\n\n\fWIS-LSTD( )\n\n\nepisodes\n\no\ufb00-policy LSTD( )\n\n\nWIS-LSTD( )\n\n\nepisodes\n\nMSE\n\n\n\nFunc. approx. task\n\nMSE\n\nMSE\n\n\n0.0\n0.5\n0.9\n\n\u2014\n... \n\u2012\u2012 \n\nregularization parameter \u270f\n\n\n0.5\n0.9\n1.0\n\n\u2014\n... \n\u2012\u2012 \n\no\ufb00-policy LSTD( )\n\n\nMSE\n\nMSE\n\nMSE\n\nTabular task\n\n\n\nregularization parameter \u270f\n\nFigure 1: Empirical comparison of our WIS-LSTD() with conventional off-policy LSTD() on two\nrandom-walk tasks. The empirical Mean Squared Error shown is for the initial state at the end of\neach episode, averaged over 100 independent runs (and also over 200 episodes in column 2 and 3).\n\nof magnitude. The third column shows performance with respect to the regularization parameter \u270f\nfor three representative values of . For a wide range of \u270f, WIS-LSTD() outperformed conven-\ntional LSTD() by an order of magnitude. Both methods performed similarly for large \u270f, as such\nlarge values essentially prevent learning for a long period of time. In the function approximation\ntask when smaller values of \u270f were chosen, close to 1 led to more stable estimates, whereas smaller\n introduced high variance for both methods. In both tasks, the better-performing regions of \u270f (the\nU-shaped depressions) were wider for WIS-LSTD().\n\n6 Conclusion\n\nAlthough importance sampling is essential to off-policy learning and has become a key part of mod-\nern reinforcement learning algorithms, its most effective form\u2014WIS\u2014has been neglected because\nof the dif\ufb01culty of combining it with parametric function approximation. In this paper, we have\nbegun to overcome these dif\ufb01culties. First, we have shown that the WIS estimate can be viewed as\nthe solution to an empirical objective where the squared errors of individual samples are weighted\nby the importance-sampling ratios. Second, we have introduced a new method for general super-\nvised learning called WIS-LS by extending the error-weighted empirical objective to linear function\napproximation and shown that the new method has similar properties as those of the WIS estimate.\nFinally, we have introduced a new off-policy LSTD algorithm WIS-LSTD() that extends the ben-\ne\ufb01ts of WIS to reinforcement learning. Our empirical results show that the new WIS-LSTD() can\noutperform Yu\u2019s off-policy LSTD() in both tabular and function approximation tasks and shows\nrobustness in terms of its parameters. An interesting direction for future work is to extend these\nideas to off-policy linear-complexity methods.\n\nAcknowledgement\n\nThis work was supported by grants from Alberta Innovates Technology Futures, National Science\nand Engineering Research Council, and Alberta Innovates Centre for Machine Learning.\n\n8\n\n\fReferences\n\nAndrad\u00b4ottir, S., Heyman, D. P., Ott, T. J. (1995). On the choice of alternative measures in importance sampling\n\nwith markov chains. Operations Research, 43(3):509\u2013519.\n\nBertsekas, D. P., Yu, H. (2009). Projected equation methods for approximate solution of large linear systems.\n\nJournal of Computational and Applied Mathematics, 227(1):27\u201350.\n\nBoyan, J. A. (1999). Least-squares temporal difference learning. In Proceedings of the 17th International\n\nConference, pp. 49\u201356.\n\nCasella, G., Robert, C. P. (1998). Post-processing accept-reject samples: recycling and rescaling. Journal of\n\nComputational and Graphical Statistics, 7(2):139\u2013157.\n\nDann, C., Neumann, G., Peters, J. (2014). Policy evaluation with temporal differences: a survey and compari-\n\nson. Journal of Machine Learning Research, 15:809\u2013883.\n\nGeist, M., Scherrer, B. (2014). Off-policy learning with eligibility traces: A survey. Journal of Machine\n\nLearning Research, 15:289\u2013333.\n\nHachiya, H., Akiyama, T., Sugiayma, M., Peters, J. (2009). Adaptive importance sampling for value function\n\napproximation in off-policy reinforcement learning. Neural Networks, 22(10):1399\u20131410.\n\nHachiya, H., Sugiyama, M., Ueda, N. (2012). Importance-weighted least-squares probabilistic classi\ufb01er for\n\ncovariate shift adaptation with application to human activity recognition. Neurocomputing, 80:93\u2013101.\n\nHesterberg, T. C. (1988), Advances in importance sampling, Ph.D. Dissertation, Statistics Department, Stanford\n\nUniversity.\n\nKahn, H., Marshall, A. W. (1953). Methods of reducing sample size in Monte Carlo computations. In Journal\n\nof the Operations Research Society of America, 1(5):263\u2013278.\n\nKoller, D., Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. MIT Press,\n\n2009.\n\nLiu, J. S. (2001). Monte Carlo strategies in scienti\ufb01c computing. Berlin, Springer-Verlag.\nMaei, H. R., Sutton, R. S. (2010). GQ(): A general gradient algorithm for temporal-difference prediction\nlearning with eligibility traces. In Proceedings of the Third Conference on Arti\ufb01cial General Intelligence,\npp. 91\u201396. Atlantis Press.\n\nMaei, H. R. (2011). Gradient temporal-difference learning algorithms. PhD thesis, University of Alberta.\nPrecup, D., Sutton, R. S., Singh, S. (2000). Eligibility traces for off-policy policy evaluation. In Proceedings\n\nof the 17th International Conference on Machine Learning, pp. 759\u2013766. Morgan Kaufmann.\n\nPrecup, D., Sutton, R. S., Dasgupta, S. (2001). Off-policy temporal-difference learning with function approxi-\n\nmation. In Proceedings of the 18th International Conference on Machine Learning.\n\nRobert, C. P., and Casella, G., (2004). Monte Carlo Statistical Methods, New York, Springer-Verlag.\nRubinstein, R. Y. (1981). Simulation and the Monte Carlo Method, New York, Wiley.\nShelton, C. R. (2001). Importance Sampling for Reinforcement Learning with Multiple Objectives. PhD thesis,\n\nMassachusetts Institute of Technology.\n\nShimodaira, H. (2000). Improving predictive inference under covariate shift by weighting the log-likelihood\n\nfunction. Journal of Statistical Planning and Inference, 90(2):227\u2013244.\n\nSutton, R. S., Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press.\nSutton, R. S., Mahmood, A. R., Precup, D., van Hasselt, H. (2014). A new Q() with interim forward view\nand Monte Carlo equivalence. In Proceedings of the 31st International Conference on Machine Learning,\nBeijing, China.\n\nYu, H. (2010). Convergence of least squares temporal difference methods under general conditions. In Pro-\nceedings of the 27th International Conference on Machine Learning, pp. 1207\u20131214.\n\n9\n\n\f", "award": [], "sourceid": 1566, "authors": [{"given_name": "A. Rupam", "family_name": "Mahmood", "institution": "University of Alberta"}, {"given_name": "Hado", "family_name": "van Hasselt", "institution": "Centrum Wiskunde & Informatica (CWI)"}, {"given_name": "Richard", "family_name": "Sutton", "institution": "University of Alberta"}]}