{"title": "Policy Evaluation Using the \u03a9-Return", "book": "Advances in Neural Information Processing Systems", "page_first": 334, "page_last": 342, "abstract": "We propose the \u03a9-return as an alternative to the \u03bb-return currently used by the TD(\u03bb) family of algorithms. The benefit of the \u03a9-return is that it accounts for the correlation of different length returns. Because it is difficult to compute exactly, we suggest one way of approximating the \u03a9-return. We provide empirical studies that suggest that it is superior to the \u03bb-return and \u03b3-return for a variety of problems.", "full_text": "Policy Evaluation Using the \u2126-Return\n\nPhilip S. Thomas\n\nUniversity of Massachusetts Amherst\n\nCarnegie Mellon University\n\nScott Niekum\n\nUniversity of Texas at Austin\n\nGeorgios Theocharous\n\nAdobe Research\n\nGeorge Konidaris\nDuke University\n\nAbstract\n\nWe propose the \u2126-return as an alternative to the \u03bb-return currently used by the\nTD(\u03bb) family of algorithms. The bene\ufb01t of the \u2126-return is that it accounts for\nthe correlation of different length returns. Because it is dif\ufb01cult to compute ex-\nactly, we suggest one way of approximating the \u2126-return. We provide empirical\nstudies that suggest that it is superior to the \u03bb-return and \u03b3-return for a variety of\nproblems.\n\n1\n\nIntroduction\n\nMost reinforcement learning (RL) algorithms learn a value function\u2014a function that estimates the\nexpected return obtained by following a given policy from a given state. Ef\ufb01cient algorithms for esti-\nmating the value function have therefore been a primary focus of RL research. The most widely used\nfamily of RL algorithms, the TD(\u03bb) family [1], forms an estimate of return (called the \u03bb-return) that\nblends low-variance but biased temporal difference return estimates with high-variance but unbiased\nMonte Carlo return estimates, using a parameter \u03bb \u2208 [0, 1]. While several different algorithms exist\nwithin the TD(\u03bb) family\u2014the original linear-time algorithm [1], least-squares formulations [2], and\nmethods for adapting \u03bb [3], among others\u2014the \u03bb-return formulation has remained unchanged since\nits introduction in 1988 [1].\nRecently Konidaris et al. [4] proposed the \u03b3-return as an alternative to the \u03bb-return, which uses a\nmore accurate model of how the variance of a return increases with its length. However, both the \u03b3\nand \u03bb-returns fail to account for the correlation of returns of different lengths, instead treating them\nas statistically independent. We propose the \u2126-return, which uses well-studied statistical techniques\nto directly account for the correlation of returns of different lengths. However, unlike the \u03bb and\n\u03b3-returns, the \u2126-return is not simple to compute, and often can only be approximated. We propose a\nmethod for approximating the \u2126-return, and show that it outperforms the \u03bb and \u03b3-returns on a range\nof off-policy evaluation problems.\n\n2 Complex Backups\n\nEstimates of return lie at the heart of value-function based RL algorithms: an estimate, \u02c6V \u03c0, of the\nvalue function, V \u03c0, estimates return from each state, and the learning process aims to reduce the\nerror between estimated and observed returns. For brevity we suppress the dependencies of V \u03c0 and\n\u02c6V \u03c0 on \u03c0 and write V and \u02c6V . Temporal difference (TD) algorithms use an estimate of the return\nobtained by taking a single transition in the Markov decision process (MDP) [5] and then estimating\nthe remaining return using the estimate of the value function:\n\nRTD\nst\n\n= rt + \u03b3 \u02c6V (st+1),\n\n1\n\n\fwhere RTD\nis the return estimate from state st, rt is the reward for going from st to st+1 via action\nat, and \u03b3 \u2208 [0, 1] is a discount parameter. Monte Carlo algorithms (for episodic tasks) do not use\nst\nintermediate estimates but instead use the full return,\n\nfor an episode L transitions in length after time t (we assume that L is \ufb01nite). These two types of\nreturn estimates can be considered instances of the more general notion of an n-step return,\n\nR(n)\nst\n\n=\n\n\u03b3irt+i\n\n+ \u03b3n \u02c6V (st+n),\n\nfor n \u2265 1. Here, n transitions are observed from the MDP and the remaining portion of return is\nestimated using the estimate of the value function. Since st+L is a state that occurs after the end of\nan episode, we assume that \u02c6V (st+L) = 0, always.\nA complex return is a weighted average of the 1, . . . , L step returns:\n\nRMC\nst\n\n=\n\n\u03b3irt+i,\n\nL\u22121(cid:88)\n\ni=0\n\n(cid:33)\n\n(cid:32)n\u22121(cid:88)\n\ni=0\n\nL(cid:88)\n\nn=1\n\nR\u2020\n\nst\n\n=\n\nw\u2020(n, L)R(n)\nst\n\n,\n\n(1)\n\nwhere w\u2020(n, L) are weights and \u2020 \u2208 {\u03bb, \u03b3, \u2126} will be used to specify the weighting schemes of\ndifferent approaches. The question that this paper proposes an answer to is: what weighting scheme\nwill produce the best estimates of the true expected return?\nst, is the weighting scheme that is used by the entire family of TD(\u03bb) algorithms\nThe \u03bb-return, R\u03bb\n[5]. It uses a parameter \u03bb \u2208 [0, 1] that determines how the weight given to a return decreases as the\nlength of the return increases:\n\n(cid:40)\n1 \u2212(cid:80)n\u22121\n\n(1 \u2212 \u03bb)\u03bbn\u22121\n\ni=1 w\u03bb(i)\n\nif n < L\nif n = L.\n\nw\u03bb(n, L) =\n\n= RTD\n\n= RMC\n\nst , which has low variance but high bias. When \u03bb = 1, R\u03bb\nst\n\nst , which\nWhen \u03bb = 0, R\u03bb\nst\nhas high variance but is unbiased. Intermediate values of \u03bb blend the high-bias but low-variance\nestimates from short returns with the low-bias but high-variance estimates from the longer returns.\nThe success of the \u03bb-return is largely due to its simplicity\u2014TD(\u03bb) using linear function approxi-\nmation has per-time-step time complexity linear in the number of features. However, this ef\ufb01ciency\ncomes at a cost: the \u03bb-return is not founded on a principled statistical derivation.1 Konidaris et al. [4]\nremedied this recently by showing that the \u03bb-return is the maximum likelihood estimator of V (st)\n|V (st) = x) if\ngiven three assumptions. Speci\ufb01cally, R\u03bb\nst\nAssumption 1 (Independence). R(1)\nAssumption 2 (Unbiased Normal Estimators). R(n)\nst\nV (st) for all n.\nAssumption 3 (Geometric Variance). Var(R(n)\n\n\u2208 arg maxx\u2208R Pr(R(1)\nst , . . . , R(L)\nst\nst are independent random variables,\n\nst , . . . , R(L)\n\nis normally distributed with mean E[R(n)\n\nst ] =\n\nst ) \u221d 1/\u03bbn.\n\nst , R(2)\n\nAlthough this result provides a theoretical foundation for the \u03bb-return, it is based on three typically\nfalse assumptions: the returns are highly correlated, only the Monte Carlo return is unbiased, and the\nvariance of the n-step returns from each state do not usually increase geometrically. This suggests\nthree areas where the \u03bb-return might be improved\u2014it could be modi\ufb01ed to better account for the\ncorrelation of returns, the bias of the different returns, and the true form of Var(R(n)\nThe \u03b3-return uses an approximate formula for the variance of an n-step return in place of Assumption\n3. This allows the \u03b3-return to better account for how the variance of returns increases with their\n\nst ).\n\n1To be clear: there is a wealth of theoretical and empirical analyses of algorithms that use the \u03bb-return.\nUntil recently there was not a derivation of the \u03bb-return as the estimator of V (st) that optimizes some objective\n(e.g., maximizes log likelihood or minimizes expected squared error).\n\n2\n\n\flength, while simultaneously removing the need for the \u03bb parameter. The \u03b3-return is given by the\nweighting scheme:\n\n((cid:80)n\n(cid:80)L\n\u02c6n=1((cid:80)\u02c6n\n\ni=1 \u03b32(i\u22121))\u22121\n\ni=1 \u03b32(i\u22121))\u22121\n\n.\n\nw\u03b3(n, L) =\n\n3 The \u2126-Return\n\nst\n\nand R(21)\n\nWe propose a new complex return, the \u2126-return, that improves upon the \u03bb and \u03b3 returns by account-\ning for the correlations of the returns. To emphasize this problem, notice that R(20)\nst will\nbe almost identical (perfectly correlated) for many MDPs (particularly when \u03b3 is small). This means\nthat Assumption 1 is particularly egregious, and suggests that a new complex return might improve\nupon the \u03bb and \u03b3-returns by properly accounting for the correlation of returns.\nWe formulate the problem of how best to combine different length returns to estimate the true ex-\npected return as a linear regression problem. This reformulation allows us to leverage the well-\nunderstood properties of linear regression algorithms. Consider a regression problem with L points,\n{(xi, yi)}L\ni=1, where the value of yi depends on the value of xi. The goal is to predict yi given\nxi. We set xi = 1 and yi = R(i)\nst . We can then construct the design matrix (a vector in this case),\n(cid:124). We seek a re-\nx = 1 = [1, . . . , 1]\ngression coef\ufb01cient, \u02c6\u03b2 \u2208 R, such that y \u2248 x \u02c6\u03b2. This \u02c6\u03b2 will be our estimate of the true expected\nreturn.\nGeneralized least squares (GLS) is a method for selecting \u02c6\u03b2 when the yi are not necessarily in-\ndependent and may have different variances. Speci\ufb01cally, if we use a linear model with (possibly\ncorrelated) mean-zero noise to model the data, i.e., y = x\u03b2 + \u0001, where \u03b2 \u2208 R is unknown, \u0001 is a\nrandom vector, E[\u0001] = 0, and Var(\u0001|x) = \u2126, then the GLS estimator\n\n(cid:124) \u2208 RL and the response vector, y = [R(1)\n\nst , . . . , R(L)\nst\n\nst , R(2)\n\n]\n\n(cid:124)\n\n\u2126\u22121x)\u22121x\n\n(cid:124)\n\n\u2126\u22121y,\n\n\u02c6\u03b2 = (x\n\n(2)\nis the best linear unbiased estimator (BLUE) for \u03b2 [6]\u2014the linear unbiased estimator with the\nlowest possible variance.\nIn our setting the assumptions about\nthat produced the data become that\n[R(1)\n+ \u0001, where E[\u0001] = 0 (i.e., the returns are\nall unbiased estimates of the true expected return) and Var(\u0001|x) = \u2126. Since x = 1 in our case,\nVar(\u0001|x)(i, j) = Cov(R(i)\nst ), where Var(\u0001|x)(i, j) de-\nnotes the element of Var(\u0001|x) in the ith row and jth column.\nSo, using only Assumption 2, GLS ((2), solved for \u02c6\u03b2) gives us the complex return:\n\n(cid:124)\n= [V (st), V (st), . . . , V (st)]\nst \u2212 V (st), R(j)\n\nst \u2212 V (st)) = Cov(R(i)\n\nthe true model\n\nst , . . . , R(L)\nst\n\nst , R(2)\n\nst , R(j)\n\n(cid:124)\n\n]\n\n\uf8eb\uf8ec\uf8ec\uf8ed[ 1\n(cid:124)\n\n\u02c6\u03b2 =\n\n1\n\n. . .\n\n1 ] \u2126\u22121\n\n(cid:123)(cid:122)\n\n(cid:80)L\nn,m=1 \u2126\u22121(n,m)\n\n1\n\n=\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\nR(1)\nst\nR(2)\nst...\nR(L)\nst\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb\n\n(cid:125)\n\n,\n\n1 ] \u2126\u22121\n\n(cid:123)(cid:122)\n\nn,m=1 \u2126\u22121(n,m)R(n)\n\nst\n\n\uf8ee\uf8ef\uf8ef\uf8f0 1\n\n1\n\n...\n\n1\n\n. . .\n\n1\n\n\u22121\n\n[ 1\n\n(cid:125)\n\n\uf8f9\uf8fa\uf8fa\uf8fb\n\uf8f6\uf8f7\uf8f7\uf8f8\n(cid:80)L\n(cid:80)L\nm=1 \u2126\u22121(n, m)\n\u02c6n,m=1 \u2126\u22121(\u02c6n, m)\n\n=(cid:80)L\n\n(cid:124)\n\nwhich can be written in the form of (1) with weights:\n\nw\u2126(n, L) =\n\n,\n\n(3)\n\nwhere \u2126 is an L \u00d7 L matrix with \u2126(i, j) = Cov(R(i)\nNotice that the \u2126-return is a generalization of the \u03bb and \u03b3 returns. The \u03bb-return can be obtained\nby reintroducing the false assumption that the returns are independent and that their variance grows\ngeometrically, i.e., by making \u2126 a diagonal matrix with \u2126n,n = \u03bb\u2212n. Similarly, the \u03b3-return can be\n\nobtained by making \u2126 a diagonal matrix with \u2126n,n =(cid:80)n\n\nst , R(j)\nst ).\n\ni=1 \u03b32(i\u22121).\n\n3\n\n\fst is a BLUE of V (st) if Assumption 2 holds. Since Assumption 2 does not hold, the\nNotice that R\u2126\n\u2126-return is not an unbiased estimator of V (s). Still, we expect it to outperform the \u03bb and \u03b3-returns\nbecause it accounts for the correlation of n-step returns and they do not. However, in some cases\nit may perform worse because it is still based on the false assumption that all of the returns are\nunbiased estimators of V (st). Furthermore, given Assumption 2, there may be biased estimators of\nV (st) that have lower expected mean squared error than a BLUE (which must be unbiased).\n\n4 Approximating the \u2126-Return\n\nIn practice the covariance matrix, \u2126, is unknown and must be approximated from data. This ap-\nproach, known as feasible generalized least squares (FGLS), can perform worse than ordinary least\nsquares given insuf\ufb01cient data to accurately estimate \u2126. We must therefore accurately approximate\n\u2126 from small amounts of data.\nTo study the accuracy of covariance matrix estimates, we estimated \u2126 using a large number of\ntrajectories for four different domains: a 5 \u00d7 5 gridworld, a variant of the canonical mountain car\ndomain, a real-world digital marketing problem, and a continuous control problem (DAS1), all of\nwhich are described in more detail in subsequent experiments. The covariance matrix estimates are\ndepicted in Figures 1(a), 2(a), 3(a), and 4(a). We do not specify rows and columns in the \ufb01gures\nbecause all covariance matrices and estimates thereof are symmetric. Because they were computed\nfrom a very large number of trajectories, we will treat them as ground truth.\nWe must estimate the \u2126-return when only a few trajectories are available. Figures 1(b), 2(b), 3(b),\nand 4(b) show direct empirical estimates of the covariance matrices using only a few trajectories.\nThese empirical approximations are poor due to the very limited amount of data, except for the\ndigital marketing domain, where a \u201cfew\u201d trajectories means 10,000. The solid black entries in\nFigures 1(f), 2(f), 3(f), and 4(f) show the weights, w\u2126(n, L), on different length returns when using\ndifferent estimates of \u2126. The noise in the direct empirical estimate of the covariance matrix using\nonly a few trajectories leads to poor estimates of the return weights.\nWhen approximating \u2126 from a small number of trajectories, we must be careful to avoid this over-\n\ufb01tting of the available data. One way to do this is to assume a compact parametric model for \u2126.\nBelow we describe a parametric model of \u2126 that has only four parameters, regardless of L (which\ndetermines the size of \u2126). We use this parametric model in our experiments as a proof of concept\u2014\nwe show that the \u2126-return using even this simple estimate of \u2126 can produce improved results over\nthe other existing complex returns. We do not claim that this scheme for estimating \u2126 is particularly\nprincipled or noteworthy.\n\n4.1 Estimating Off-Diagonal Entries of \u2126\n\nNotice in Figures 1(a), 2(a), 3(a), and 4(a) that for j > i, Cov(Ri\n) =\nst\nVar(Ri\n). This structure would mean that we can \ufb01ll in \u2126 given its diagonal values, leaving only L\nst\nparameters. We now explain why this relationship is reasonable in general, and not just an artifact\nof our domains. We can write each entry in \u2126 as a recurrence relation:\n\n, Ri\nst\n\n, Rj\nst\n\nst\n\n) \u2248 Cov(Ri\n\nCov[R(i)\n\nst , R(j)\n\nst ] =Cov[R(i)\n=Cov[R(i)\n\nst\n\nst , R(j\u22121)\nst , R(j\u22121)\n\nst\n\n+ \u03b3j\u22121(rt+j + \u03b3 \u02c6V (st+j) \u2212 \u02c6V (st+j\u22121)]\n] + \u03b3j\u22121Cov[R(i)\n\nst , rt+j + \u03b3 \u02c6V (st+j) \u2212 \u02c6V (st+j\u22121)],\n\n, Rj\nst\n\nwhen i < j. The term rt+j + \u03b3 \u02c6V (st+j) \u2212 \u02c6V (st+j\u22121) is the temporal difference error j steps\nin the future. The proposed assumption that Cov(Ri\n) is equivalent to as-\nst\nsuming that the covariance of this temporal difference error and the i-step return is negligible:\nst , rt+j + \u03b3 \u02c6V (st+j) \u2212 \u02c6V (st+j\u22121)] \u2248 0. The approximate independence of these two\n\u03b3j\u22121Cov[R(i)\nterms is reasonable in general due to the Markov property, which ensures that at least the conditional\ncovariance, Cov[R(i)\nBecause this relationship is not exact, the off-diagonal entries tend to grow as they get farther from\nthe diagonal. However, especially when some trajectories are padded with absorbing states, this\nrelationship is quite accurate when j = L, since the temporal difference errors at the absorbing state\nst , R(L\u22121)\nare all zero, and Cov[R(i)\n]\n\nst , 0] = 0. This results in a signi\ufb01cant difference between Cov[R(i)\n\nst , rt+j + \u03b3 \u02c6V (st+j) \u2212 \u02c6V (st+j\u22121)|st], is zero.\n\n) = Var(Ri\nst\n\nst\n\n4\n\n\f(a) Empirical \u2126 from 1\nmillion trajectories.\n\n(b) Empirical \u2126 from 5\ntrajectories.\n\n(c) Approximate \u2126 from\n1 million trajectories.\n\n(d) Approximate \u2126 from\n5 trajectories.\n\n(e) Approximate and empiri-\ncal diagonals of \u2126.\n\n(f) Approximate and empirical weights\nfor each return.\n\n(g) Mean squared error from \ufb01ve trajec-\ntories.\n\nFigure 1: Gridworld Results.\n\n(a) Empirical \u2126 from 1\nmillion trajectories.\n\n(b) Empirical \u2126 from 2\ntrajectories.\n\n(c) Approximate \u2126 from\n1 million trajectories.\n\n(d) Approximate \u2126 from\n2 trajectories.\n\n(e) Approximate and empiri-\ncal diagonals of \u2126.\n\n(f) Approximate and empirical weights\nfor each return.\n\n(g) Mean squared error from two trajec-\ntories.\n\nFigure 2: Mountain Car Results.\n\n5\n\n51015200510152025051015202551015200510152025\u221210010203051015200510152025051015202551015200510152025010203001530020VarianceReturn LengthEmpirical 1MApprox 1MEmpirical 5Approx 5-0.21120Weight, w(n, 20)Return Length, nEmpirical 1MApprox 1MEmpirical 5Approx 5\u03bb=0.8, 3.19055App \u03a9, 1.953941101001,000Mean Squared ErrorReturn Type51015202530010203040\u22122002040608051015202530010203040\u221210001002005101520253001020304002040608051015202530010203040050100150080160030VarianceReturn LengthEmpirical 1MApprox 1MEmpirical 2Approx 2-0.21130Weight, w(n, 11)Return Length, nEmpirical 1MApprox 1MEmpirical 2Approx 2WIS, 144.48App \u03a9, 76.39101001,00010,000ISWIS\u03bb=0\u03bb=0.1\u03bb=0.2\u03bb=0.3\u03bb=0.4\u03bb=0.5\u03bb=0.6\u03bb=0.7\u03bb=0.8\u03bb=0.9\u03bb=1\u03b3Emp \u03a9App \u03a9Mean Squared ErrorReturn Type\f(a) Empirical \u2126 from 1\nmillion trajectories.\n\n(b) Empirical \u2126 from\n10000 trajectories.\n\n(c) Approximate \u2126 from\n1 million trajectories.\n\n(d) Approximate \u2126 from\n10000 trajectories.\n\n(e) Approximate and empiri-\ncal diagonals of \u2126.\n\n(f) Approximate and empirical weights\nfor each return.\n\n(g) Mean squared error from 10000 tra-\njectories.\n\nFigure 3: Digital Marketing Results.\n\n(a) Empirical \u2126 from\n10000 trajectories.\n\n(b) Empirical \u2126 from 10\ntrajectories.\n\n(c) Approximate \u2126 from\n10000 trajectories.\n\n(d) Approximate \u2126 from\n10 trajectories.\n\n(e) Approximate and empiri-\ncal diagonals of \u2126.\n\n(f) Approximate and empirical weights\nfor each return.\n\n(g) Mean squared error from 10 trajec-\ntories.\n\nFigure 4: Functional Electrical Stimulation Results.\n\n6\n\n123456789101234567891000.050.10.150.2123456789101234567891000.050.10.150.2123456789101234567891000.050.10.150.2123456789101234567891000.050.10.150.200.080.16010VarianceReturn LengthEmpirical 1MApprox 1MEmpirical 10kApprox 10k-0.51110Weight, w(n, 10)Return Length, nEmpirical 1MApprox 1MEmpirical 10kApprox 10k\u03bb=0, 0.0011Emp \u03a9, 0.0007App \u03a9, 0.001100.0020.004ISWIS\u03bb=0\u03bb=0.1\u03bb=0.2\u03bb=0.3\u03bb=0.4\u03bb=0.5\u03bb=0.6\u03bb=0.7\u03bb=0.8\u03bb=0.9\u03bb=1\u03b3Emp \u03a9App \u03a9Mean Squared ErrorReturn Type5101520051015202501020304051015200510152025\u2212100102030510152005101520250102030405101520051015202505101520250204001020VarianceReturn LengthEmpirical 10KApprox 10KEmpirical 10Approx 10-0.51120Weight, w(n, 10)Return Length, nEmpirical 10KApprox 10KEmpirical 10Approx 10\u03bb=0, 3.2102\u03bb=1, 3.47436App \u03a9, 3.1070110100ISWIS\u03bb=0\u03bb=0.1\u03bb=0.2\u03bb=0.3\u03bb=0.4\u03bb=0.5\u03bb=0.6\u03bb=0.7\u03bb=0.8\u03bb=0.9\u03bb=1\u03b3Emp \u03a9App \u03a9Mean Squared ErrorReturn Type\fst , R(L)\nst\n\nand Cov[R(i)\n]. Rather than try to model this drop, which can in\ufb02uence the weights signi\ufb01-\ncantly, we reintroduce the assumption that the Monte Carlo return is independent of the other returns,\nmaking the off-diagonal elements of the last row and column zero.\n\n4.2 Estimating Diagonal Entries of \u2126\n\nThe remaining question is how best to approximate the diagonal of \u2126 from a very small number\nof trajectories. Consider the solid and dotted black curves in Figures 1(e), 2(e), 3(e), and 4(e),\nwhich depict the diagonals of \u2126 when estimated from either a large number or small number of\ntrajectories. When using only a few trajectories, the diagonal includes \ufb02uctuations that can have\nsigni\ufb01cant impacts on the resulting weights. However, when using many trajectories (which we\ntreat as giving ground truth), the diagonal tends to be relatively smooth and monotonically increasing\nuntil it plateaus (ignoring the \ufb01nal entry).\nThis suggests using a smooth parametric form to approximate the diagonal, which we do as follows.\nLet vi denote the sample variance of R(i)\nst for i = 1 . . . L. Let v+ be the largest sample variance:\nv+ = maxi\u2208{1,...,L} vi. We parameterize the diagonal using four parameters, k1, k2, v+, and vL:\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3k1\n\n\u02c6\u2126k1,k2,v+,vL(i, i) =\n\nvL\nmin{v+, k1k(1\u2212t)\n\n2\n\nif i = 1\nif i = L\notherwise.\n\n}\n\n\u2126(1, 1) = k1 sets the initial variance, and vL is the variance of the Monte Carlo return. The pa-\nrameter v+ enforces a ceiling on the variance of the i-step return, and k2 captures the growth rate of\nthe variance, much like \u03bb. We select the k1 and k2 that minimize the mean squared error between\n\u02c6\u2126(i, i) and vi, and set v+ and vL directly from the data.2\nThis reduces the problem of estimating \u2126, an L \u00d7 L matrix, to estimating four numbers from return\ndata. Consider Figures 1(c), 2(c), 3(c), and 4(c), which depict \u02c6\u2126 as computed from many trajectories.\nThe differences between these estimates and the ground truth show that this parameterization is not\nperfect, as we cannot represent the true \u2126 exactly. However, the estimate is reasonable and the\nresulting weights (solid red) are visually similar to the ground truth weights (solid black) in Figures\n1(f), 2(f), 3(f), and 4(f). We can now get accurate estimates of \u2126 from very few trajectories. Figures\n1(d), 2(d), 3(d), and 4(d) show \u02c6\u2126 when computed from only a few trajectories. Note their similarity\nto \u02c6\u2126 when using a large number of trajectories, and that the resulting weights (un\ufb01lled red in Figures\n1(f), 2(f), 3(f), and 4(f)) are similar to the those obtained using many more trajectories (the \ufb01lled red\nbars).\nPseudocode for approximating the \u2126-return is provided in Algorithm 1. Unlike the \u03bb-return, which\ncan be computed from a single trajectory, the \u2126-return requires a set of trajectories in order to\nestimate \u2126. The pseudocode assumes that every trajectory is of length L, which can be achieved by\npadding shorter trajectories with absorbing states.\n\n2We include the constraints that k2 \u2208 [0, 1] and 0 \u2264 k1 \u2264 v+.\n\n7\n\n\fAlgorithm 1: Computing the \u2126-return.\nRequire: n trajectories beginning at s and of length L.\n\nfor i = 1, . . . , L and for each trajectory.\n\n1. Compute R(i)\ns\n2. Compute the sample variances, vi = Var(R(i)\n3. Set v+ = maxi\u2208{1,...,L} vi.\n4. Search for the k1 and k2 that minimize the mean squared error between vi and\n\ns ), for i = 1, . . . , L.\n\n5. Fill the diagonal of the L \u00d7 L matrix, \u2126, with \u2126(i, i) = \u02c6\u2126k1,k2,v+,vL(i, i), using the\n\n6. Fill all of the other entries with \u2126(i, j) = \u2126(i, i) where j > i. If (i = L or j = L) and\n\n\u02c6\u2126k1,k2,v+,vL (i, i) for i = 1, . . . , L.\n\noptimized k1 and k2.\ni (cid:54)= j then set \u2126(i, j) = 0 instead.\n\n7. Compute the weights for the returns according to (3).\n8. Compute the \u2126-return for each trajectory according to (1).\n\n5 Experiments\n\nApproximations of the \u2126-return could, in principle, replace the \u03bb-return in the whole family of\nTD(\u03bb) algorithms. However, using the \u2126-return for TD(\u03bb) raises several interesting questions that\nare beyond the scope of this initial work (e.g., is there a linear-time way to estimate the \u2126-return?\nSince a different \u2126 is needed for every state, how can the \u2126-return be used with function approxi-\nmation where most states will never be revisited?). We therefore focus on the speci\ufb01c problem of\noff-policy policy evaluation\u2014estimating the performance of a policy using trajectories generated by\na possibly different policy. This problem is of interest for applications that require the evaluation of\na proposed policy using historical data.\nDue to space constraints, we relegate the details of our experiments to the appendix in the supple-\nmental documents. However, the results of the experiments are clear\u2014Figures 1(g), 2(g), 3(g), and\n4(g) show the mean squared error (MSE) of value estimates when using various methods.3 Notice\nthat, for all domains, using the \u2126-return (the EMP \u2126 and APP \u2126 labels) results in lower MSE than\nthe \u03b3-return and the \u03bb-return with any setting of \u03bb.\n\n6 Conclusions\n\nRecent work has begun to explore the statistical basis of complex estimates of return, and how we\nmight reformulate them to be more statistically ef\ufb01cient [4]. We have proposed a return estimator\nthat improves upon the \u03bb and \u03b3-returns by accounting for the covariance of return estimates. Our\nresults show that understanding and exploiting the fact that in control settings\u2014unlike in standard\nsupervised learning\u2014observed samples are typically neither independent nor identically distributed,\ncan substantially improve data ef\ufb01ciency in an algorithm of signi\ufb01cant practical importance.\nMany (largely positive) theoretical properties of the \u03bb-return and TD(\u03bb) have been discovered over\nthe past few decades. This line of research into other complex returns is still in its infancy, and\nso there are many open questions. For example, can the \u2126-return be improved upon by removing\nAssumption 2 or by keeping Assumption 2 but using a biased estimator (not a BLUE)? Is there\na method for approximating the \u2126-return that allows for value function approximation with the\nsame time complexity as TD(\u03bb), or which better leverages our knowledge that the environment is\nMarkovian? Would TD(\u03bb) using the \u2126-return be convergent in the same settings as TD(\u03bb)? While\nwe hope to answer these questions in future work, it is also our hope that this work will inspire other\nresearchers to revisit the problem of constructing a statistically principled complex return.\n\n3To compute the MSE we used a large number of Monte Carlo rollouts to estimate the true value of each\n\npolicy.\n\n8\n\n\fReferences\n[1] R.S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3(1):9\u201344,\n\n1988.\n\n[2] S.J. Bradtke and A.G. Barto. Linear least-squares algorithms for temporal difference learning. Machine\n\nLearning, 22(1-3):33\u201357, March 1996.\n\n[3] C. Downey and S. Sanner. Temporal difference Bayesian model averaging: A Bayesian perspective on\nadapting lambda. In Proceedings of the 27th International Conference on Machine Learning, pages 311\u2013\n318, 2010.\n\n[4] G.D. Konidaris, S. Niekum, and P.S. Thomas. TD\u03b3: Re-evaluating complex backups in temporal differ-\n\nence learning. In Advances in Neural Information Processing Systems 24, pages 2402\u20132410, 2011.\n\n[5] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA,\n\n1998.\n\n[6] T. Kariya and H. Kurata. Generalized Least Squares. Wiley, 2004.\n[7] D. Precup, R. S. Sutton, and S. Singh. Eligibility traces for off-policy policy evaluation. In Proceedings\n\nof the 17th International Conference on Machine Learning, pages 759\u2013766, 2000.\n\n[8] A. R. Mahmood, H. Hasselt, and R. S. Sutton. Weighted importance sampling for off-policy learning with\n\nlinear function approximation. In Advances in Neural Information Processing Systems 27, 2014.\n\n[9] J. R. Tetreault and D. J. Litman. Comparing the utility of state features in spoken dialogue using rein-\nforcement learning. In Proceedings of the Human Language Technology/North American Association for\nComputational Linguistics, 2006.\n\n[10] G. D. Konidaris, S. Osentoski, and P. S. Thomas. Value function approximation in reinforcement learning\nusing the Fourier basis. In Proceedings of the Twenty-Fifth Conference on Arti\ufb01cial Intelligence, pages\n380\u2013395, 2011.\n\n[11] G. Theocharous and A. Hallak. Lifetime value marketing using reinforcement learning.\n\nMultidisciplinary Conference on Reinforcement Learning and Decision Making, 2013.\n\nIn The 1st\n\n[12] P. S. Thomas, G. Theocharous, and M. Ghavamzadeh. High con\ufb01dence off-policy evaluation. In Pro-\n\nceedings of the Twenty-Ninth Conference on Arti\ufb01cial Intelligence, 2015.\n\n[13] D. Blana, R. F. Kirsch, and E. K. Chadwick. Combined feedforward and feedback control of a redundant,\nnonlinear, dynamic musculoskeletal system. Medical and Biological Engineering and Computing, 47:\n533\u2013542, 2009.\n\n[14] P. S. Thomas, M. S. Branicky, A. J. van den Bogert, and K. M. Jagodnik. Application of the actor-critic\narchitecture to functional electrical stimulation control of a human arm. In Proceedings of the Twenty-\nFirst Innovative Applications of Arti\ufb01cial Intelligence, pages 165\u2013172, 2009.\n\n[15] P. M. Pilarski, M. R. Dawson, T. Degris, F. Fahimi, J. P. Carey, and R. S. Sutton. Online human training\nof a myoelectric prosthesis controller via actor-critic reinforcement learning. In Proceedings of the 2011\nIEEE International Conference on Rehabilitation Robotics, pages 134\u2013140, 2011.\n\n[16] K. Jagodnik and A. van den Bogert. A proportional derivative FES controller for planar arm movement.\n\nIn 12th Annual Conference International FES Society, Philadelphia, PA, 2007.\n\n[17] N. Hansen and A. Ostermeier. Completely derandomized self-adaptation in evolution strategies. Evolu-\n\ntionary Computation, 9(2):159\u2013195, 2001.\n\n9\n\n\f", "award": [], "sourceid": 198, "authors": [{"given_name": "Philip", "family_name": "Thomas", "institution": "University of Massachusetts Amherst, Carnegie Mellon University"}, {"given_name": "Scott", "family_name": "Niekum", "institution": "UT Austin"}, {"given_name": "Georgios", "family_name": "Theocharous", "institution": "Adobe"}, {"given_name": "George", "family_name": "Konidaris", "institution": "Duke"}]}