{"title": "Sketch-Based Linear Value Function Approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 2213, "page_last": 2221, "abstract": "Hashing is a common method to reduce large, potentially infinite feature vectors to a fixed-size table. In reinforcement learning, hashing is often used in conjunction with tile coding to represent states in continuous spaces. Hashing is also a promising approach to value function approximation in large discrete domains such as Go and Hearts, where feature vectors can be constructed by exhaustively combining a set of atomic features. Unfortunately, the typical use of hashing in value function approximation results in biased value estimates due to the possibility of collisions. Recent work in data stream summaries has led to the development of the tug-of-war sketch, an unbiased estimator for approximating inner products. Our work investigates the application of this new data structure to linear value function approximation. Although in the reinforcement learning setting the use of the tug-of-war sketch leads to biased value estimates, we show that this bias can be orders of magnitude less than that of standard hashing. We provide empirical results on two RL benchmark domains and fifty-five Atari 2600 games to highlight the superior learning performance of tug-of-war hashing.", "full_text": "Sketch-Based Linear Value Function Approximation\n\nMarc G. Bellemare\nUniversity of Alberta\n\nJoel Veness\n\nUniversity of Alberta\n\nMichael Bowling\n\nUniversity of Alberta\n\nmg17@cs.ualberta.ca\n\nveness@cs.ualberta.ca\n\nbowling@cs.ualberta.ca\n\nAbstract\n\nHashing is a common method to reduce large, potentially in\ufb01nite feature vectors\nto a \ufb01xed-size table. In reinforcement learning, hashing is often used in conjunc-\ntion with tile coding to represent states in continuous spaces. Hashing is also\na promising approach to value function approximation in large discrete domains\nsuch as Go and Hearts, where feature vectors can be constructed by exhaustively\ncombining a set of atomic features. Unfortunately, the typical use of hashing in\nvalue function approximation results in biased value estimates due to the possibil-\nity of collisions. Recent work in data stream summaries has led to the development\nof the tug-of-war sketch, an unbiased estimator for approximating inner products.\nOur work investigates the application of this new data structure to linear value\nfunction approximation. Although in the reinforcement learning setting the use of\nthe tug-of-war sketch leads to biased value estimates, we show that this bias can\nbe orders of magnitude less than that of standard hashing. We provide empirical\nresults on two RL benchmark domains and \ufb01fty-\ufb01ve Atari 2600 games to highlight\nthe superior learning performance obtained when using tug-of-war hashing.\n\n1\n\nIntroduction\n\nRecent value-based reinforcement learning applications have shown the bene\ufb01t of exhaustively gen-\nerating features, both in discrete and continuous state domains. In discrete domains, exhaustive\nfeature generation combines atomic features into logical predicates. In the game of Go, Silver et al.\n[19] showed that good features could be generated by enumerating all stone patterns up to a certain\nsize. Sturtevant and White [21] similarly obtained promising reinforcement learning results using a\nfeature generation method that enumerated all 2, 3 and 4-wise combinations of a set of 60 atomic\nfeatures. In continuous-state RL domains, tile coding [23] is a canonical example of exhaustive\nfeature generation; tile coding has been successfully applied to benchmark domains [22], to learn to\nplay keepaway soccer [20], in multiagent robot learning [4], to train bipedal robots to walk [18, 24]\nand to learn mixed strategies in the game of Goofspiel [3].\nExhaustive feature generation, however, can result in feature vectors that are too large to be repre-\nsented in memory, especially when applied to continuous spaces. Although such feature vectors are\ntoo large to be represented explicitly, in many domains of interest they are also sparse. For example,\nmost stone patterns are absent from any particular Go position. Given a \ufb01xed memory budget, the\nstandard approach is to hash features into a \ufb01xed-size table, with collisions implicitly handled by\nthe learning algorithm; all but one of the applications discussed above use some form of hashing.\nWith respect to its typical use for linear value function approximation, hashing lacks theoretical\nguarantees.\nIn order to improve on the basic hashing idea, we turn to sketches: state-of-the-art\nmethods for approximately storing large vectors [6]. Our goal is to show that one such sketch,\nthe tug-of-war sketch [7], is particularly well-suited for linear value function approximation. Our\nwork is related to recent developments on the use of random projections in reinforcement learning\n[11] and least-squares regression [16, 10]. Hashing, however, possesses a computational advantage\nover traditional random projections: each feature is hashed exactly once. In comparison, even sparse\n\n1\n\n\frandom projection methods [1, 14] carry a per-feature cost that increases with the size of the reduced\nspace. Tug-of-war hashing seeks to reconcile the computational ef\ufb01ciency that makes hashing a\npractical method for linear value function approximation on large feature spaces, while preserving\nthe theoretical appeal of random projection methods.\nA natural concern when using hashing in RL is that hash collisions irremediably degrade learning. In\nthis paper we argue that tug-of-war hashing addresses this concern by providing us with a low-error\napproximation of large feature vectors at a fraction of the memory cost. To quote Sutton and Barto\n[23], \u201cHashing frees us from the curse of dimensionality in the sense that memory requirements\nneed not be exponential in the number of dimensions, but need merely match the real demands of\nthe task.\u201d\n\n2 Background\nWe consider the reinforcement learning framework of Sutton and Barto [23]. An MDP M is a\ntuple (cid:104)S,A, P, R, \u03b3(cid:105), where S is the set of states, A is the set of actions, P : S \u00d7 A \u00d7 S \u2192 [0, 1]\nis the transition probability function, R : S \u00d7 A \u2192 R is the reward function and \u03b3 \u2208 [0, 1] is\nthe discount factor. At time step t the agent observes state st \u2208 S, selects an action at \u2208 A\nand receives a reward rt := R(st, at). The agent then observes the new state st+1 distributed\naccording to P (\u00b7|st, at). From state st, the agent\u2019s goal is to maximize the expected discounted\nQ\u03c0(s, a), where the stationary policy \u03c0 : S \u00d7A \u2192 [0, 1] represents the agent\u2019s behaviour. Q\u03c0(s, a)\nis recursively de\ufb01ned as:\n\ni=0 \u03b3iR(st+i, at+i)(cid:3). A typical approach is to learn state-action values\n\nsum of future rewards E(cid:2)(cid:80)\u221e\n\n(cid:34)(cid:88)\n\na(cid:48)\u2208A\n\n(cid:35)\n\n(1)\n\nQ\u03c0(s, a) := R(s, a) + \u03b3Es(cid:48)\u223cP (\u00b7|s,a)\n\n\u03c0(a(cid:48)|s(cid:48))Q\u03c0(s(cid:48), a(cid:48))\n\nthis equation is the optimal value function Q\u2217(s, a)\n\nA special case of\n:= R(s, a) +\n\u03b3Es(cid:48) [maxa(cid:48) Q\u2217(s(cid:48), a(cid:48))]. The optimal value function corresponds to the value under an optimal\npolicy \u03c0\u2217. For a \ufb01xed \u03c0, The SARSA(\u03bb) algorithm [23] learns Q\u03c0 from sample transitions\n(st, at, rt, st+1, at+1). In domains where S is large (or in\ufb01nite), learning Q\u03c0 exactly is imprac-\ntical and one must rely on value function approximation. A common value function approxi-\nmation scheme in reinforcement learning is linear approximation. Given \u03c6 : S \u00d7 A \u2192 Rn\nmapping state-action pairs to feature vectors, we represent Q\u03c0 with the linear approximation\nQt(s, a) := \u03b8t \u00b7 \u03c6(s, a), where \u03b8t \u2208 Rn is a weight vector. The gradient descent SARSA(\u03bb)\nupdate is de\ufb01ned as:\n\n\u03b4t \u2190 rt + \u03b3\u03b8t \u00b7 \u03c6(st+1, at+1) \u2212 \u03b8t \u00b7 \u03c6(st, at)\net \u2190 \u03b3\u03bbet\u22121 + \u03c6(st, at)\n\n\u03b8t+1 \u2190 \u03b8t + \u03b1\u03b4tet ,\n\n(2)\nwhere \u03b1 \u2208 [0, 1] is a step-size parameter and \u03bb \u2208 [0, 1] controls the degree to which changes in\nthe value function are propagated back in time. Throughout the rest of this paper Q\u03c0(s, a) refers to\nthe exact value function computed from Equation 1 and we use Qt(s, a) to refer to the linear ap-\nproximation \u03b8t \u00b7 \u03c6(s, a); \u201cgradient descent SARSA(\u03bb) with linear approximation\u201d is always implied\nwhen referring to SARSA(\u03bb). We call \u03c6(s, a) the full feature vector and Qt(s, a) the full-vector\nvalue function.\nAsymptotically, SARSA(\u03bb) is guaranteed to \ufb01nd the best solution within the span of \u03c6(s, a), up to\na multiplicative constant that depends on \u03bb [25]. If we let \u03a6 \u2208 R|S||A|\u00d7n denote the matrix of full\nfeature vectors \u03c6(s, a), and let \u00b5 : S \u00d7 A \u2192 [0, 1] denote the steady state distribution over state-\naction pairs induced by \u03c0 and P then, under mild assumptions, we can guarantee the existence and\nuniqueness of \u00b5. We denote by (cid:104)\u00b7,\u00b7(cid:105)\u00b5 the inner product induced by \u00b5, i.e. (cid:104)x, y(cid:105)\u00b5 := xT Dy, where\nx, y \u2208 R|S||A| and D \u2208 R|S||A|\u00d7|S||A| is a diagonal matrix with entries \u00b5(s, a). The norm (cid:107)\u00b7(cid:107)\u00b5 is\n\nde\ufb01ned as(cid:112)(cid:104)\u00b7,\u00b7(cid:105)\u00b5. We assume the following: 1) S and A are \ufb01nite, 2) the Markov chain induced\n\nby \u03c0 and P is irreducible and aperiodic, and 3) \u03a6 has full rank. The following theorem bounds the\nerror of SARSA(\u03bb):\nTheorem 1 (Restated from Tsitsiklis and Van Roy [25]). Let M = (cid:104)S,A, P, R, \u03b3(cid:105) be an MDP and\n\u03c0 : S \u00d7 A \u2192 [0, 1] be a policy. Denote by \u03a6 \u2208 R|S||A|\u00d7n the matrix of full feature vectors and\n\n2\n\n\fby \u00b5 the stationary distribution on (S,A) induced by \u03c0 and P . Under assumptions 1-3), SARSA(\u03bb)\nconverges to a unique \u03b8\u03c0 \u2208 Rn with probability one and\n(cid:107)\u03a6\u03b8\u03c0 \u2212 Q\u03c0(cid:107)\u00b5 \u2264 1 \u2212 \u03bb\u03b3\n1 \u2212 \u03b3\n\n(cid:107)\u03a0Q\u03c0 \u2212 Q\u03c0(cid:107)\u00b5 ,\n\nis a vector representing the exact solution to Equation 1 and\n\nwhere Q\u03c0 \u2208 R|S||A|\n\u03a0 := \u03a6(\u03a6T D\u03a6)\u22121\u03a6T D is the projection operator.\nBecause \u03a0 is the projector operator for \u03a6, for any \u03b8 we have (cid:107)\u03a6\u03b8 \u2212 Q\u03c0(cid:107)\u00b5 \u2265 (cid:107)\u03a0Q\u03c0 \u2212 Q\u03c0(cid:107)\u00b5;\nTheorem 1 thus implies that SARSA(1) converges to \u03b8\u03c0 = arg min\u03b8 (cid:107)\u03a6\u03b8 \u2212 Q\u03c0(cid:107)\u00b5.\n\n2.1 Hashing in Reinforcement Learning\n\nAs discussed previously, it is often impractical to store the full weight vector \u03b8t in memory. A\ntypical example of this is tile coding on continuous-state domains [22], which generates a number\nof features exponential in the dimensionality of the state space. In such cases, hashing can effec-\ntively be used to approximate Q\u03c0(s, a) using a \ufb01xed memory budget. Let h be a hash function\nh : {1, . . . , n} \u2192 {1, . . . , m}, mapping full feature vector indices into hash table indices, where\nm (cid:28) n is the hash table size. We de\ufb01ne standard hashing features as the feature map \u02c6\u03c6(s, a) whose\nith component is de\ufb01ned as:\n\n\u02c6\u03c6i(s, a) :=\n\nI[h(j)=i]\u03c6j(s, a) ,\n\n(3)\n\nj=1\n\nwhere \u03c6j(s, a) denotes the jth component of \u03c6(s, a) and I[x] denotes the indicator function. We\nassume that our hash function h is drawn from a universal family: for any i, j \u2208 {1, . . . , n}, i (cid:54)= j,\nPr(h(i) = h(j)) \u2264 1\nm.1 We de\ufb01ne the standard hashing value function \u02c6Qt(s, a) := \u02c6\u03b8t \u00b7 \u02c6\u03c6(s, a),\nwhere \u02c6\u03b8t \u2208 Rm is a weight vector, and \u02c6\u03c6(s, a) is the hashed vector. Because of hashing collisions,\nthe standard hashing value function is a biased estimator of Qt(s, a), i.e., in general Eh[ \u02c6Qt(s, a)] (cid:54)=\nQt(s, a). For example, consider the extreme case where m = 1: all features share the same weight.\nWe return to the issue of the bias introduced by standard hashing in Section 4.1.\n\n2.2 Tug-of-War Hashing\n\nThe tug-of-war sketch, also known as the Fast-AGMS, was recently introduced as a powerful method\nfor approximating inner products of large vectors [7]. The name \u201csketch\u201d refers to the data struc-\nture\u2019s function as a summary of a stream of data. In the canonical sketch setting, we summarize a\ncount vector \u03b8 \u2208 Rn using a sketch vector \u02dc\u03b8 \u2208 Rm. At each time step a vector \u03c6t \u2208 Rn is received.\ni=0 \u03c6i. Given two hash\nfunctions, h and \u03be : {1, . . . , n} \u2192 {\u22121, 1}, \u03c6t is mapped to a vector \u02dc\u03c6t whose ith component is\n\nThe purpose of the sketch vector is to approximate the count vector \u03b8t :=(cid:80)t\u22121\n\nn(cid:88)\n\nn(cid:88)\n\n\u02dc\u03c6t,i :=\n\nI[h(j)=i]\u03c6t,j\u03be(j)\n\n(4)\n\nj=1\n\nThe tug-of-war sketch vector is then updated as \u02dc\u03b8t+1 \u2190 \u02dc\u03b8t + \u02dc\u03c6t. In addition to h being drawn\nfrom a universal family of hash functions, \u03be is drawn from a four-wise independent family of hash\nfunctions: for all sets of four unique indices {i1, i2, i3, i4}, Pr\u03be(\u03be(i1) = k1, \u03be(i2) = k2, \u03be(i3) =\n16 with k1 . . . k4 \u2208 {\u22121, 1}. For an arbitrary \u03c6 \u2208 Rn and its corresponding\nk3, \u03be(i4) = k4) = 1\ntug-of-war vector \u02dc\u03c6 \u2208 Rm, Eh,\u03be[\u02dc\u03b8t \u00b7 \u02dc\u03c6] = \u03b8t \u00b7 \u03c6: the tug-of-war sketch produces unbiased estimates\n\u02dc\u03c6i.\n\nof inner products [7]. This unbiasedness property can be derived as follows. First let \u02dc\u03b8t =(cid:80)t\u22121\n\ni=0\n\n1While it may seem odd to randomly select your hash function, this can equivalently be thought as sampling\nan indexing assignment for the MDP\u2019s features. While a particular hash function may be well- (or poorly-)\nsuited for a particular MDP, it is hard to imagine how this could be known a priori. By considering a randomly\nselected hash function (or random permutation of the features), we are simulating the uncertainty of using a\nparticular hash function on a never before encountered MDP.\n\n3\n\n\fThen \u02dc\u03b8t \u00b7 \u02dc\u03c6t(cid:48) =(cid:80)t\u22121\n\ni=0\n\n\u02dc\u03c6i \u00b7 \u02dc\u03c6t(cid:48) and\n\n\uf8ee\uf8f0 n(cid:88)\n\nn(cid:88)\n\nI[h(j1)=h(j2)]\u03c6i,j1\u03c6t(cid:48),j2\u03be(j1)\u03be(j2)\n\n\uf8f9\uf8fb\n\nj1=1\nj2=1\nif j1 = j2\notherwise\n\n(by four-wise independence)\n\n(cid:26) 1\n\n0\n\nEh,\u03be[ \u02dc\u03c6i \u00b7 \u02dc\u03c6t(cid:48)] = Eh,\u03be\n\nE\u03be[\u03be(j1)\u03be(j2)] =\n\nThe result follows by noting that I[h(j1)=h(j2)] is independent from \u03be(j1)\u03be(j2) given j1, j2.\n\n3 Tug-of-War with Linear Value Function Approximation\n\nhashing features as \u02dc\u03c6 : S \u00d7 A \u2192 Rm with \u02dc\u03c6i(s, a) :=(cid:80)n\n\nWe now extend the tug-of-war sketch to the reinforcement learning setting by de\ufb01ning the tug-of-war\nI[h(j)=i]\u03c6j(s, a)\u03be(j). The SARSA(\u03bb)\n\nj=1\n\nupdate becomes:\n\n\u02dc\u03b4t \u2190 rt + \u03b3 \u02dc\u03b8t \u00b7 \u02dc\u03c6(st+1, at+1) \u2212 \u02dc\u03b8t \u00b7 \u02dc\u03c6(st, at)\n\u02dcet \u2190 \u03b3\u03bb\u02dcet\u22121 + \u02dc\u03c6(st, at)\n\n\u02dc\u03b8t+1 \u2190 \u02dc\u03b8t + \u03b1\u02dc\u03b4t\u02dcet.\n\n(5)\nWe also de\ufb01ne the tug-of-war value function \u02dcQt(s, a) := \u02dc\u03b8t \u00b7 \u02dc\u03c6(s, a) with \u02dc\u03b8t \u2208 Rm and refer to\n\u02dc\u03c6(s, a) as the tug-of-war vector.\n\n3.1 Value Function Approximation with Tug-of-War Hashing\n\nIntuitively, one might hope that the unbiasedness of the tug-of-war sketch for approximating inner\nproducts carries over to the case of linear value function approximation. Unfortunately, this is not\nthe case. However, it is still possible to bound the error of the tug-of-war value function learned with\nSARSA(1) in terms of the full-vector value function. Our bound relies on interpreting tug-of-war\nhashing as a special kind of Johnson-Lindenstrauss transform [8].\nWe de\ufb01ne a \u221e-universal family of hash functions H such that for any set of indices i1, i2, . . . , il\nPr(h(i1) = k1, . . . , h(il) = kl) \u2264 1|C|l , where C \u2282 N and h \u2208 H : {1, . . . , n} \u2192 C .\nLemma 1 (Dasgupta et al. [8], Theorem 2). Let h : {1, . . . , n} \u2192 {1, . . . , m} and \u03be : {1, . . . , n} \u2192\n{\u22121, 1} be two independent hash functions chosen uniformly at random from \u221e-universal families\nand let H \u2208 {0,\u00b11}m\u00d7n be a matrix with entries Hij = I[h(j)=i]\u03be(j). Let \u0001 < 1, \u03b4 < 1\n10 ,\nc ,\nm = 12\nwith probability 1 \u2212 3\u03b4, H satis\ufb01es the following property:\n\n(cid:1). For any given vector x \u2208 Rn such that (cid:107)x(cid:107)\u221e \u2264 1\u221a\n\n(cid:1) and c = 16\n\n(cid:1) log2(cid:0) m\n\n\u00012 log(cid:0) 1\n\n\u0001 log(cid:0) 1\n\n\u03b4\n\n\u03b4\n\n\u03b4\n\n(1 \u2212 \u0001)(cid:107)x(cid:107)2\n\n2 \u2264 (cid:107)Hx(cid:107)2\n\n2 \u2264 (1 + \u0001)(cid:107)x(cid:107)2\n2 .\n\nLemma 1 states that, under certain conditions on the input vector x, tug-of-war hashing approxi-\nmately preserves the norm of x. When \u03b4 and \u0001 are constant, the requirement on (cid:107)x(cid:107)\u221e can be waived\nby applying Theorem 1 to the normalized vector u =\nc. A clear discussion on hashing as\na Johnson-Lindenstrauss transform can be found in the work of Kane and Nelson [13], who also\nimprove Lemma 1 and extend it to the case where the family of hash functions is k-universal rather\nthan \u221e-universal.\nLemma 2 (Based on Maillard and Munos [16], Proposition 1). Let x1 . . . xK and y be vectors in\nRn. Let H \u2208 {0,\u00b11}m\u00d7n, \u0001, \u03b4 and m be de\ufb01ned as in Lemma 1. With probability at least 1\u2212 6K\u03b4,\nfor all k \u2208 {1, . . . , K},\n\nx(cid:107)x(cid:107)2\n\n\u221a\n\nxk \u00b7 y \u2212 \u0001(cid:107)xk(cid:107)2 (cid:107)y(cid:107)2 \u2264 Hxk \u00b7 Hy \u2264 xk \u00b7 y + \u0001(cid:107)xk(cid:107)2 (cid:107)y(cid:107)2 .\n\nProof (Sketch). The proof follows the steps of Maillard and Munos [16]. Given two unit vectors\nu, v \u2208 Rn, we can relate (Hu) \u00b7 (Hv) to (cid:107)Hu + Hv(cid:107)2\n2 using the parallelogram\nlaw. We then apply Lemma 1 to bound both sides of each squared norm and substitute xk for u and\ny for w to bound Hxk \u00b7 Hy. Applying the union bound yields the desired statement.\n\n2 and (cid:107)Hu \u2212 Hv(cid:107)2\n\n4\n\n\fWe are now in a position to bound the asymptotic error of SARSA(1) with tug-of-war hashing.\nGiven hash functions h and \u03be de\ufb01ned as per Lemma 1, we denote by H \u2208 Rm\u00d7n the matrix whose\nentries are Hij := I[h(j)=i]\u03be(j), such that \u02dc\u03c6(s, a) = H\u03c6(s, a). We also denote by \u02dc\u03a6 := \u03a6H T the\nmatrix of tug-of-war vectors. We again assume that 1) S and A are \ufb01nite, that 2) \u03c0 and P induce an\nirreducible, aperiodic Markov chain and that 3) \u03a6 has full rank. For simplicity of argument, we also\nassume that 4) \u02dc\u03a6 := \u03a6H T has full rank; when \u02dc\u03a6 is rank-de\ufb01cient, SARSA(1) converges to a set of\nsolutions \u02dc\u0398\u03c0 satisfying the bound of Theorem 2, rather than to a unique \u02dc\u03b8\u03c0.\nTheorem 2. Let M = (cid:104)S,A, P, R, \u03b3(cid:105) be an MDP and \u03c0 : S \u00d7 A \u2192 [0, 1] be a policy. Let\n\u03a6 \u2208 R|S||A|\u00d7n be the matrix of full feature vectors and \u02dc\u03a6 \u2208 R|S||A|\u00d7m be the matrix of tug-of-\nwar vectors. Denote by \u00b5 the stationary distribution on (S,A) induced by \u03c0 and P . Let \u0001 < 1,\n\u03b4 < 1, \u03b4(cid:48) = \u03b4\n\u03b4(cid:48) . Under assumptions 1-4), gradient-descent SARSA(1) with\ntug-of-war hashing converges to a unique \u02dc\u03b8\u03c0 \u2208 Rm and with probability at least 1 \u2212 \u03b4\n\n6|S||A| and m \u2265 12\n\n\u00012 log 1\n\n(cid:13)(cid:13) \u02dc\u03a6\u02dc\u03b8\u03c0 \u2212 Q\u03c0(cid:13)(cid:13)\u00b5 \u2264 (cid:107)\u03a6\u03b8\u03c0 \u2212 Q\u03c0(cid:107)\u00b5 + \u0001(cid:107)\u03b8\u03c0(cid:107)2\n\n(cid:107)\u03c6(s, a)(cid:107)2 ,\n\nsup\n\ns\u2208S,a\u2208A\n\nwhere Q\u03c0 is the exact solution to Equation 1 and \u03b8\u03c0 = arg min\u03b8 (cid:107)\u03a6\u03b8 \u2212 Q\u03c0(cid:107)\u00b5.\n\nProof. First note that Theorem 1 implies the convergence of SARSA(1) with tug-of-war hashing to\na unique solution, which we denote \u02dc\u03b8\u03c0. We \ufb01rst apply Lemma 2 to the set {\u03c6(s, a) : (s, a) \u2208 S\u00d7A}\nand \u03b8\u03c0; note that we can safely assume |S||A| > 1, and therefore \u03b4(cid:48) < 1/10. By our choice of m,\nfor all (s, a) \u2208 S \u00d7A, |H\u03c6(s, a)\u00b7 H\u03b8\u03c0 \u2212 \u03c6(s, a)\u00b7 \u03b8\u03c0| \u2264 \u0001(cid:107)\u03c6(s, a)(cid:107)2 (cid:107)\u03b8\u03c0(cid:107)2 with probability at least\n1 \u2212 6|S||A|\u03b4(cid:48) = 1 \u2212 \u03b4. As previously noted, SARSA(1) converges to \u02dc\u03b8\u03c0 = arg min\u03b8 (cid:107) \u02dc\u03a6\u03b8 \u2212 Q\u03c0(cid:107)\u00b5;\ncompared to \u02dc\u03a6\u02dc\u03b8\u03c0, the solution \u03b8\u03c0\nH := \u02dc\u03a6H\u03b8\u03c0 is thus an equal or worse approximation to Q\u03c0. It\nfollows that\n\nH \u2212 Q\u03c0(cid:13)(cid:13)\u00b5 \u2264 (cid:13)(cid:13) \u02dc\u03a6\u03b8\u03c0\n(cid:13)(cid:13) \u02dc\u03a6\u02dc\u03b8\u03c0 \u2212 Q\u03c0(cid:13)(cid:13)\u00b5 \u2264 (cid:13)(cid:13) \u02dc\u03a6\u03b8\u03c0\n(cid:115) (cid:88)\n(cid:115) (cid:88)\n\nH \u2212 \u03a6\u03b8\u03c0(cid:13)(cid:13)\u00b5 +(cid:13)(cid:13)\u03a6\u03b8\u03c0 \u2212 Q\u03c0(cid:13)(cid:13)\u00b5\n\u00b5(s, a)(cid:2)H\u03c6(s, a) \u00b7 H\u03b8\u03c0 \u2212 \u03c6(s, a) \u00b7 \u03b8\u03c0(cid:3)2\n\u00b5(s, a)(cid:2)\u0001(cid:107)\u03c6(s, a)(cid:107)2 (cid:107)\u03b8\u03c0(cid:107)2\n\n+ (cid:107)\u03a6\u03b8\u03c0 \u2212 Q\u03c0(cid:107)\u00b5\n\n+ (cid:107)\u03a6\u03b8\u03c0 \u2212 Q\u03c0(cid:107)\u00b5\n\n(Lemma 2)\n\ns\u2208S,a\u2208A\n\n(cid:3)2\n\n=\n\n\u2264\n\ns\u2208S,a\u2208A\n\n\u2264 \u0001(cid:107)\u03b8\u03c0(cid:107)2\n\nsup\n\ns\u2208S,a\u2208A\n\n(cid:107)\u03c6(s, a)(cid:107)2 + (cid:107)\u03a6\u03b8\u03c0 \u2212 Q\u03c0(cid:107)\u00b5 ,\n\nas desired.\n\nOur proof of Theorem 2 critically requires the use of \u03bb = 1. A natural next step would be to attempt\nto drop this restriction on \u03bb. It also seems likely that the \ufb01nite-sample analysis of LSTD with random\nprojections [11] can be extended to cover the case of tug-of-war hashing. Theorem 2 suggests that,\nunder the right conditions, the tug-of-war value function is a good approximation to the full-vector\nvalue function. A natural question now arises: does tug-of-war hashing lead to improved linear\nvalue function approximation compared with standard hashing? More importantly, does tug-of-war\nhashing result in better learned policies? These are the questions we investigate empirically in the\nnext section.\n\n4 Experimental Study\n\nIn the sketch setting, the appeal of tug-of-war hashing over standard hashing lies in its unbiasedness.\nWe therefore begin with an empirical study of the magnitude of the bias when applying different\nhashing methods in a value function approximation setting.\n\n4.1 Value Function Bias\n\nWe used standard hashing, tug-of-war hashing, and no hashing to learn a value function over a\nshort trajectory in the Mountain Car domain [22]. Our evaluation uses a standard implementation\navailable online [15].\n\n5\n\n\fFigure 1: Bias and Mean Squared Error of value estimates using standard and tug-of-war hashing in\n1,000 learning steps of Mountain Car. Note the log scale of the y axis.\n\nWe generated a 1,000-step trajectory using an \u0001-greedy policy [23]. For this \ufb01xed trajectory we\nupdated a full feature weight vector \u03b8t using SARSA(0) with \u03b3 = 1.0 and \u03b1 = 0.01. We focus on\nSARSA(0) as it is commonly used in practice for its ease of implementation and its faster update\nspeed in sparse settings. Parallel to the full-vector update we also updated both a tug-of-war weight\nvector \u02dc\u03b8t and a standard hashing weight vector \u02c6\u03b8t, with the same values of \u03b3 and \u03b1. Both methods\nuse a hash table size of m = 100 and the same randomly selected hash function. This hash function\nis de\ufb01ned as (ax + b) mod p mod m, where p is a large prime and a, b < p are random integers\n[5]. At every step we compute the difference in value between the hashed value functions \u02dcQt(st, at)\nand \u02c6Qt(st, at), and the full-vector value function Qt(st, at). We repeated this experiment using 1\nmillion hash functions selected uniformly at random. Figure 1 shows for each time step, estimates\nof the magnitude of the biases E[ \u02dcQt(st, at)] \u2212 Qt(st, at) and E[ \u02c6Qt(st, at)] \u2212 Qt(st, at) as well as\nestimates of the mean squared errors E[( \u02dcQt(st, at)\u2212 Qt(st, at))2] and E[( \u02c6Qt(st, at)\u2212 Qt(st, at))2]\nusing the different hash functions. To provide a sense of scale, the estimate of the value of the \ufb01nal\nstate when using no hashing is approximately \u22124; note that the y-axis uses a logarithmic scale.\nThe tug-of-war value function has a small, almost negligible bias. In comparison, the bias of stan-\ndard hashing is orders of magnitude larger \u2013 almost as large as the value it is trying to estimate.\nThe mean square error estimates show a similar trend. Furthermore, the same experiment on the\nAcrobot domain [22] yielded qualitatively similar results. Our results con\ufb01rm the insights provided\nin Section 2: the tug-of-war value function can be signi\ufb01cantly less biased than the standard hashing\nvalue function.\n\n4.2 Reinforcement Learning Performance\n\nHaving smaller bias and mean square error in the Q-value estimates does not necessarily imply\nimproved agent performance. In reinforcement learning, actions are selected based on relative Q-\nvalues, so a consistent bias may be harmless. In this section we evaluate the performance (cumulative\nreward per episode) of learning agents using both tug-of-war and standard hashing.\n\n4.2.1 Tile Coding\n\nWe \ufb01rst studied the performance of agents using each of the two hashing methods in conjunction\nwith tile coding. Our study is based on Mountain Car and Acrobot, two standard RL benchmark\ndomains. For both domains we used the standard environment dynamics [22]; we used the \ufb01xed\nstarting-state version of Mountain Car to reduce the variance in our results. We compared the two\nhashing methods using \u0001-greedy policies and the SARSA(\u03bb) algorithm.\nFor each domain and each hashing method we performed a parameter sweep over the learning rate\n\u03b1 and selected the best value which did not cause the value estimates to divergence. The Acrobot\nstate was represented using 48 6 \u00d7 6 \u00d7 6 \u00d7 6 tilings and the Mountain Car state, 10 9 \u00d7 9 tilings.\nOther parameters were set to \u03b3 = 1.0, \u03bb = 0.9, \u0001 = 0.0; the learning rate was further divided by the\nnumber of tilings.\n\n6\n\n 1e-06 1e-05 0.0001 0.001 0.01 0.1 1 10 0 100 200 300 400 500 600 700 800 900 1000BiasTime StepsTug-of-WarStandard 0.0001 0.001 0.01 0.1 1 10 100 0 100 200 300 400 500 600 700 800 900 1000Mean Squared ErrorTime StepsTug-of-WarStandard\fFigure 2: Performance of standard hashing and tug-of-war hashing in two benchmark domains. The\nperformance of the random agent is provided as reference.\n\nWe experimented with hash table sizes m \u2208 [20, 1000] for Mountain Car and m \u2208 [100, 2000] for\nAcrobot. Each experiment consisted of 100 trials, sampling a new hash function for each trial. Each\ntrial consisted of 10,000 episodes, and episodes were restricted to 5,000 steps. At the end of each\ntrial, we disabled learning by setting \u03b1 = 0 and evaluated the agent on an additional 500 episodes.\nFigure 2 shows the performance of standard hashing and tug-of-war hashing as a function of the\nhash table size. The conclusion is clear: when the hashed vector is small relative to the full vector,\ntug-of-war hashing performs better than standard hashing. This is especially true in Acrobot, where\nthe number of features (over 62,000) necessarily results in harmful collisions.\n\n4.2.2 Atari\n\nWe next evaluated tug-of-war hashing and standard hashing on a suite of Atari 2600 games. The\nAtari domain was proposed as a game-independent platform for AI research by Naddaf [17]. Atari\ngames pose a variety of challenges for learning agents. The learning agent\u2019s observation space is the\ngame screen: 160x210 pixels, each taking on one of 128 colors. In the game-independent setting,\nagents are tuned using a small number of training games and subsequently evaluated over a large\nnumber of games for which no game-speci\ufb01c tuning takes place. The game-independent setting\nforces us to use features that are common to all games, for example, by encoding the presence\nof color patterns in game screens; such an encoding is a form of exhaustive feature generation.\nDifferent learning methods have been evaluated on the Atari 2600 platform [9, 26, 12]. We based\nour evaluation on prior work on a suite of Atari 2600 games [2], to which we refer the reader for full\ndetails on handling Atari 2600 games as RL domains. We performed parameter sweeps over \ufb01ve\ntraining games, and tested our algorithms on \ufb01fty testing games.\nWe used models of contingency awareness to locate the player avatar [2]. From a given game,\nwe generate feature sets by exhaustively enumerating all single-color patterns of size 1x1 (single\npixels), 2x2, and 3x3. The presence of each different pattern within a 4x5 tile is encoded as a binary\nfeature. We also encode the relative presence of patterns with respect to the player avatar location.\nThis procedures gives rise to 569,856,000 different features, of which 5,000 to 15,000 are active at\na given time step.\nWe trained \u0001-greedy SARSA(0) agents using both standard hashing and tug-of-war hashing with\nhash tables of size m=1,000, 5,000 and 20,000. We chose the step-size \u03b1 using a parameter sweep\nover the training games: we selected the best-performing \u03b1 which never resulted in divergence in\nthe value function. For standard hashing, \u03b1 = 0.01, 0.05, 0.2 for m = 1,000, 5,000 and 20,000,\nrespectively. For tug-of-war hashing, \u03b1 = 0.5 across table sizes. We set \u03b3 = 0.999 and \u0001 = 0.05.\nEach experiment was repeated over ten trials lasting 10,000 episodes each; we limited episodes to\n18,000 frames to avoid issues with non-terminating policies.\n\n7\n\n 0 500 1000 1500 2000 2500 100 500 1000 1500 2000Random AgentStandard HashingTug-of-War HashingAcrobotSteps to GoalHash Table Size 0 1000 2000 3000 4000 5000 200 400 600 800 1000Mountain CarRandom AgentStandard HashingTug-of-War HashingSteps to GoalHash Table Size\fFigure 3: Inter-algorithm score distributions over \ufb01fty-\ufb01ve Atari games. Higher curves re\ufb02ect higher\nnormalized scores.\n\nAccurately comparing methods across \ufb01fty-\ufb01ve games poses a challenge, as each game exhibits a\ndifferent reward function and game dynamics. We compared methods using inter-algorithm score\ndistributions [2]. For each game, we extracted the average score achieved by our agents over the\nlast 500 episodes of training, yielding six different scores (three per hashing method) per game.\nDenoting these scores by sg,i, i = 1 . . . 6, we de\ufb01ned the inter-algorithm normalized score zg,i :=\n(sg,i \u2212 rg,min)/(rg,max \u2212 rg,min) with rg,min := mini {sg,i} and rg,max := maxi {sg,i}. Thus\nzg,i = 1.0 indicates that the ith score was the highest for game g, and zg,i = 0.0 similarly indicates\nthe lowest score. For each combination of hashing method and memory size, its inter-algorithm\nscore distribution shows the fraction of games for which the corresponding agent achieves a certain\nnormalized score or better.\nFigure 3 compares the score distributions of agents using either standard hashing or tug-of-war\nhashing for m = 1,000, 5,000 and 20,000. Tug-of-war hashing consistently outperforms standard\nhashing across hash table sizes. For each m and each game, we also performed a two-tailed Welch\u2019s\nt-test with 99% con\ufb01dence intervals to determine the statistical signi\ufb01cance of the average score\ndifference between the two methods. For m = 1,000, tug-of-war hashing performed statistically\nbetter in 38 games and worse in 5; for m = 5,000, it performed better in 41 games and worse in 7;\nand for m = 20,000 it performed better in 35 games and worse in 5. Our results on Atari games\ncon\ufb01rm what we observed on Mountain Car and Acrobot: in practice, tug-of-war hashing performs\nmuch better than standard hashing. Furthermore, computing the \u03be function took less than 0.3% of\nthe total experiment time, a negligible cost in comparison to the bene\ufb01ts of using tug-of-war hashing.\n\n5 Conclusion\n\nIn this paper, we cast the tug-of-war sketch into the reinforcement learning framework. We showed\nthat, although the tug-of-war sketch is unbiased in the setting for which it was developed [7], the\nself-referential component of reinforcement learning induces a small bias. We showed that this bias\ncan be much smaller than the bias that results from standard hashing and provided empirical results\ncon\ufb01rming the superiority of tug-of-war hashing for value function approximation.\nAs increasingly more complex reinforcement learning problems arise and strain against the bound-\naries of practicality, so the need for fast and reliable approximation methods grows. If standard\nhashing frees us from the curse of dimensionality, then tug-of-war hashing goes a step further by\nensuring, when the demands of the task exceed available resources, a robust and principled shift\nfrom the exact solution to its approximation.\n\nAcknowledgements\n\nWe would like to thank Bernardo \u00b4Avila Pires, Martha White, Yasin Abbasi-Yadkori and Csaba\nSzepesv\u00b4ari for the help they provided with the theoretical aspects of this paper, as well as Adam\nWhite and Rich Sutton for insightful discussions on hashing and tile coding. This research was\nsupported by the Alberta Innovates Technology Futures and the Alberta Innovates Centre for Ma-\nchine Learning at the University of Alberta. Invaluable computational resources were provided by\nCompute/Calcul Canada.\n\n8\n\n1.00.50.0 00.20.40.60.81.0Fraction of gamesInter-algorithm scoreTug-of-WarStandardHash Table Size: 10001.00.50.0 00.20.40.60.81.0Fraction of gamesInter-algorithm scoreTug-of-WarStandardHash Table Size: 50001.00.50.0 00.20.40.60.81.0Fraction of gamesInter-algorithm scoreTug-of-WarStandardHash Table Size: 20,000\fReferences\n\n[1] Dimitris Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with binary coins.\n\nJournal of Computer and System Sciences, 66(4):671\u2013687, 2003.\n\n[2] Marc G. Bellemare, Joel Veness, and Michael Bowling. Investigating contingency awareness using Atari\n\n2600 games. In Proceedings of the Twenty-Sixth AAAI Conference on Arti\ufb01cial Intelligence, 2012.\n\n[3] Michael Bowling and Manuela Veloso. Scalable learning in stochastic games. In AAAI Workshop on\n\nGame Theoretic and Decision Theoretic Agents, 2002.\n\n[4] Michael Bowling and Manuela Veloso. Simultaneous adversarial multi-robot learning. In Proceedings of\n\nthe Eighteenth International Joint Conference on Arti\ufb01cial Intelligence, pages 699\u2013704, 2003.\n\n[5] J. Lawrence Carter and Mark N. Wegman. Universal classes of hash functions. Journal of Computer and\n\nSystem Sciences, 18(2):143\u2013154, 1979.\n\n[6] Graham Cormode. Sketch techniques for massive data. In Graham Cormode, Minos Garofalakis, Pe-\nter Haas, and Chris Jermaine, editors, Synopses for Massive Data: Samples, Histograms, Wavelets and\nSketches, Foundations and Trends in Databases. NOW publishers, 2011.\n\n[7] Graham Cormode and Minos Garofalakis. Sketching streams through the net: Distributed approximate\nquery tracking. In Proceedings of the 31st International Conference on Very Large Data Bases, pages\n13\u201324, 2005.\n\n[8] Anirban Dasgupta, Ravi Kumar, and Tam\u00b4as Sarl\u00b4os. A sparse Johnson-Lindenstrauss transform. In Pro-\n\nceedings of the 42nd ACM Symposium on Theory of Computing, pages 341\u2013350, 2010.\n\n[9] Carlos Diuk, A. Andre Cohen, and Michael L. Littman. An object-oriented representation for ef\ufb01cient re-\ninforcement learning. In Proceedings of the Twenty-Fifth International Conference on Machine Learning,\npages 240\u2013247, 2008.\n\n[10] Mahdi Milani Fard, Yuri Grinberg, Joelle Pineau, and Doina Precup. Compressed least-squares regression\non sparse spaces. In Proceedings of the Twenty-Sixth AAAI Conference on Arti\ufb01cial Intelligence. AAAI,\n2012.\n\n[11] Mohammad Ghavamzadeh, Alessandro Lazaric, Oldaric-Ambrym Maillard, and R\u00b4emi Munos. LSTD\nIn Advances in Neural Information Processing Systems 23, pages 721\u2013729,\n\nwith random projections.\n2010.\n\n[12] Matthew Hausknecht, Piyush Khandelwal, Risto Miikkulainen, and Peter Stone. HyperNEAT-GGP: A\nIn Genetic and Evolutionary Computation Conference\n\nHyperNEAT-based Atari general game player.\n(GECCO), 2012.\n\n[13] Daniel M. Kane and Jelani Nelson. A derandomized sparse Johnson-Lindenstrauss transform. arXiv\n\npreprint arXiv:1006.3585, 2010.\n\n[14] Ping Li, Trevor J. Hastie, and Kenneth W. Church. Very sparse random projections.\n\nIn Proceedings\nof the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages\n287\u2013296, 2006.\n\n[15] The Reinforcement Learning Library, 2010. http://library.rl-community.org.\n[16] Oldaric-Ambrym Maillard and R\u00b4emi Munos. Compressed least squares regression. In Advances in Neural\n\nInformation Processing Systems 22, pages 1213\u20131221, 2009.\n\n[17] Yavar Naddaf. Game-independent AI agents for playing Atari 2600 console games. PhD thesis, University\n\nof Alberta, 2010.\n\n[18] E. Schuitema, D.G.E. Hobbelen, P.P. Jonker, M. Wisse, and J.G.D. Karssen. Using a controller based\non reinforcement learning for a passive dynamic walking robot. In Proceedings of the Fifth IEEE-RAS\nInternational Conference on Humanoid Robots, pages 232\u2013237, 2005.\n\n[19] David Silver, Richard S. Sutton, and Martin M\u00a8uller. Reinforcement learning of local shape in the game\n\nof Go. In 20th International Joint Conference on Arti\ufb01cial Intelligence, pages 1053\u20131058, 2007.\n\n[20] Peter Stone, Richard S. Sutton, and Gregory Kuhlmann. Reinforcement learning for RoboCup soccer\n\n[21] Nathan Sturtevant and Adam White. Feature construction for reinforcement learning in Hearts. Comput-\n\nkeepaway. Adaptive Behavior, 13(3):165, 2005.\n\ners and Games, pages 122\u2013134, 2006.\n\n[22] Richard S. Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse\ncoding. In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, Advances in Neural\nInformation Processing Systems, volume 8, pages 1038\u20131044, 1996.\n\n[23] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.\n[24] Russ Tedrake, Teresa Weirui Zhang, and H. Sebastian Seung. Stochastic policy gradient reinforcement\nlearning on a simple 3D biped. In Proceedings of Intelligent Robots and Systems 2004, volume 3, pages\n2849\u20132854, 2004.\n\n[25] John N. Tsitsiklis and Benjamin Van Roy. An analysis of temporal-difference learning with function\n\napproximation. IEEE Transactions on Automatic Control, 42(5):674\u2013690, 1997.\n\n[26] Samuel Wintermute. Using imagery to simplify perceptual abstraction in reinforcement learning agents.\n\nIn Proceedings of the Twenty-Fourth AAAI Conference on Arti\ufb01cial Intelligence, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1102, "authors": [{"given_name": "Marc", "family_name": "Bellemare", "institution": null}, {"given_name": "Joel", "family_name": "Veness", "institution": null}, {"given_name": "Michael", "family_name": "Bowling", "institution": null}]}