{"title": "Bellman Error Based Feature Generation using Random Projections on Sparse Spaces", "book": "Advances in Neural Information Processing Systems", "page_first": 3030, "page_last": 3038, "abstract": "This paper addresses the problem of automatic generation of features for value function approximation in reinforcement learning.  Bellman Error Basis Functions (BEBFs) have been shown to improve the error of policy evaluation with function approximation, with a convergence rate similar to that of value iteration. We propose a simple, fast and robust algorithm based on random projections, which generates BEBFs for sparse feature spaces. We provide a finite sample analysis of the proposed method, and prove that projections logarithmic in the dimension of the original space guarantee a contraction in the error.  Empirical results demonstrate the strength of this method in domains in which choosing a good state representation is challenging.", "full_text": "Bellman Error Based Feature Generation using\n\nRandom Projections on Sparse Spaces\n\nMahdi Milani Fard, Yuri Grinberg, Amir massoud Farahmand, Joelle Pineau, Doina Precup\n\nSchool of Computer Science\n\nMcGill University\nMontreal, Canada\n\n{mmilan1,ygrinb,amirf,jpineau,dprecup}@cs.mcgill.ca\n\nAbstract\n\nThis paper addresses the problem of automatic generation of features for value\nfunction approximation in reinforcement learning. Bellman Error Basis Functions\n(BEBFs) have been shown to improve policy evaluation, with a convergence rate\nsimilar to that of value iteration. We propose a simple, fast and robust algorithm\nbased on random projections, which generates BEBFs for sparse feature spaces.\nWe provide a \ufb01nite sample analysis of the proposed method, and prove that pro-\njections logarithmic in the dimension of the original space guarantee a contraction\nin the error. Empirical results demonstrate the strength of this method in domains\nin which choosing a good state representation is challenging.\n\n1\n\nIntroduction\n\nPolicy evaluation, i.e. computing the expected return of a given policy, is at the core of many rein-\nforcement learning (RL) algorithms. In large problems, it is necessary to use function approximation\nin order to perform this task; a standard choice is to hand-craft parametric function approximators,\nsuch as a tile coding, radial basis functions or neural networks. The accuracy of parametrized pol-\nicy evaluation depends crucially on the quality of the features used in the function approximator,\nand thus often a lot of time and effort is spent on this step. The desire to make this process more\nautomatic has led to a lot of recent work on feature generation and feature selection in RL (e.g.\n[1, 2, 3, 4, 5]).\nAn approach that offers good theoretical guarantees is to generate features in the direction of the\nBellman error of the current value estimates (Bellman Error Based features, or BEBF). Successively\nadding exact BEBFs has been shown to reduce the error of a linear value function estimator at a\nrate similar to value iteration, which is the best one could hope to achieve [6]. Unlike \ufb01tted value\niteration [7], which works with a \ufb01xed feature set, iterative BEBF generation gradually increases the\ncomplexity of the hypothesis space by adding new features and thus does not diverge, as long as the\nerror in the generation does not cancel out the contraction effect of the Bellman operator [6]. Several\nsuccessful methods have been proposed for generating features related to the Bellman error [5, 1, 4,\n6, 3]. In practice however, these methods can be computationally expensive when applied in high\ndimensional input spaces.\nWith the emergence of more high-dimensional RL problems, it has become necessary to design and\nadapt BEBF-based methods to be more scalable and computationally ef\ufb01cient. In this paper, we\npresent an algorithm that uses the idea of applying random projections speci\ufb01cally in very large and\nsparse feature spaces (e.g. 105 \u2212 106 dimensions). The idea is to iteratively project the original fea-\ntures into exponentially lower-dimensional spaces. Then, we apply linear regression in the smaller\nspaces, using temporal difference errors as targets, in order to approximate BEBFs.\nRandom projections have been studied extensively in signal processing [8, 9] as well as machine\nlearning [10, 11, 12, 13]. In reinforcement learning, Ghavamzadeh et al. [14] have used random\nprojections in conjunction with LSTD and have shown that this can reduce the estimation error,\n\n1\n\n\fat the cost of a controlled bias. Instead of compressing the feature space for LSTD, we focus on\nthe BEBF generation setting, which offers better scalability and more \ufb02exibility in practice. Our\nalgorithm is well suited for sparse feature spaces, naturally occurring in domains with audio and\nvideo inputs [15], and also in tile-coded and discretized spaces.\nWe carry out a \ufb01nite sample analysis, which helps determine the sizes that should be used for the\nprojections. Our analysis holds for both \ufb01nite and continuous state spaces and is easy to apply\nwith discretized or tile-coded features, which are popular in many RL applications. The proposed\nmethod compares favourably, from a computational point of view, to many other feature extraction\nmethods in high dimensional spaces, as each iteration takes only poly-logarithmic time in the number\nof dimensions. The method provides guarantees on the reduction of the error, yet needs minimal\ndomain knowledge, as we use agnostic random projections.\nOur empirical analysis indicates that the proposed method provides similar results to L2-regularized\nLSTD, but scales much better in time complexity as the observed sparsity decreases. It signi\ufb01cantly\noutperforms L1-regularized methods both in performance and computation time. The algorithm\nseems robust to the choice of parameters and has small computational and memory complexity.\n\n2 Notation and Background\n\nThroughout this paper, column vectors are represented by lower case bold letters, and matrices are\nrepresented by bold capital letters. |.| denotes the size of a set, and M(X ) is the set of measures\non X . (cid:107).(cid:107)0 is Donoho\u2019s zero \u201cnorm\u201d indicating the number of non-zero elements in a vector. (cid:107).(cid:107)\ndenotes the L2 norm for vectors and the operator norm for matrices: (cid:107)M(cid:107) = supv (cid:107)Mv(cid:107)/(cid:107)v(cid:107). The\nFrobenius norm of a matrix is then de\ufb01ned as: (cid:107)M(cid:107)F =\ni,j. Also, we denote the Moore-\nPenrose pseudo-inverse of a matrix M with M\u2020. The weighted L2 norm of a function is de\ufb01ned as\n(cid:107)f (x)(cid:107)\u03c1(x) =\nstate is represented by a vector x \u2208 X of D features, having (cid:107)x(cid:107) \u2264 1. We assume that x is k-sparse\nin some known or unknown basis \u03a8: X (cid:44) {\u03a8z, s.t. (cid:107)z(cid:107)0 \u2264 k and (cid:107)z(cid:107) \u2264 1}. Such spaces occur\nboth naturally (e.g. image, audio and video signals [15]) as well as from most discretization-based\nmethods (e.g., tile-coding).\n\n(cid:113)(cid:82) |f (x)|2 d\u03c1(x). We focus on spaces that are large, bounded and k-sparse. Our\n\n(cid:113)(cid:80)\n\ni,j M2\n\n2.1 Markov Decision Process\nA Markov Decision Process (MDP) M = (S,A, T, R) is de\ufb01ned by a (possibly in\ufb01nite) set of\nstates S, a set of actions A, a transition kernel T : S \u00d7 A \u2192 M(S), where T (.|s, a) de\ufb01nes the\ndistribution of next state given that action a is taken in state s, and a (possibly stochastic) bounded\nreward function R : S \u00d7 A \u2192 M([0, Rmax]). We assume discounted-reward MDPs, with the\ndiscount factor denoted by \u03b3 \u2208 [0, 1). At each discrete time step, the RL agent chooses an action\nand receives a reward. The environment then changes to a new state, according to the transition\nkernel.\nA policy is a (possibly stochastic) function from states to actions. The value of a state s for policy\nt \u03b3trt) if the agent\nstarts in state s and acts according to policy \u03c0. Let R(s, \u03c0(s)) be the expected reward at state s\nunder policy \u03c0. The value function satis\ufb01es:\n\n\u03c0, denoted by V \u03c0(s), is the expected value of the discounted sum of rewards ((cid:80)\n\nV \u03c0(s) = R(s, \u03c0(s)) + \u03b3\n\n(1)\nMany methods have been developed for \ufb01nding the value of a policy (policy evaluation) when the\ntransition and reward functions are known. Dynamic programming methods apply iteratively the\nBellman operator T to an initial guess of the value function [16]:\n\nV \u03c0(s(cid:48))T (ds(cid:48)|s, \u03c0(s)).\n\nT V (s) = R(s, \u03c0(s)) + \u03b3\n\nV (s(cid:48))T (ds(cid:48)|s, \u03c0(s)),\n\n(2)\nWhen the transition and reward models are not known, one can use a \ufb01nite sample set of transitions\nto learn an approximate value function. When the state space is very large or continuous, the value\nfunction is also approximated using a feature vector xs, which is a function of the state s. Often,\nthis approximation is linear: V (s) \u2248 wT xs. To simplify the derivations, we use V (x) to directly\nrefer to the value estimate of a state with feature vector x.\n\n(cid:90)\n(cid:90)\n\n2\n\n\fLeast-squares temporal difference learning (LSTD) and its variations [17, 18, 19] are among meth-\nods that learn a value function based on a \ufb01nite sample, especially when function approximation is\nneeded. LSTD-type methods are ef\ufb01cient in their use of data, but can be computationally expensive,\nas they rely on inverting a large matrix. Using LSTD in spaces induced by random projections is a\nway of dealing with this problem [14]. As we show in our experiments, if the observation space is\nsparse, we can also use conjugate gradient descent methods to solve the regularized LSTD problem.\nStochastic gradient descent methods are alternatives to LSTD in high-dimensional state spaces, as\ntheir memory and computational complexity per time step are linear in the number of state features,\nwhile providing convergence guarantees [20]. However, online gradient-type methods typically have\nslow convergence rates and do not make ef\ufb01cient use of the data.\n\n2.2 Bellman Error Based Feature Generation\n\nIn high-dimensional state spaces, direct estimation of the value function fails to provide good results\nwhen using a small number of sampled transitions. Feature selection/extraction methods have thus\nbeen used to build better approximation spaces for the value functions [1, 2, 3, 4, 5]. Among these,\nwe focus on methods that aim to generate features in the direction of the Bellman error de\ufb01ned as:\n(3)\nt=1) be a random sample of size n, collected on an MDP with a \ufb01xed policy.\n\nLet Sn = ((xt, rt)n\nGiven an estimate V of the value function, temporal difference (TD) errors are de\ufb01ned to be:\n\neV (.) = T V (.) \u2212 V (.).\n\n\u03b4t = rt + \u03b3V (xt+1) \u2212 V (xt).\n\n(4)\nIt is easy to show that the expectation of the temporal difference at xt equals the Bellman error at\nthat point [16]. TD-errors are thus proxies to estimating the Bellman error.\nUsing temporal differences, Menache et al. [21] introduced two algorithms to construct basis func-\ntions for linear function approximation. Keller et al. [3] applied neighbourhood component analysis\nas a dimensionality reduction technique to construct a low dimensional state space based on the TD-\nerror. In their work, they iteratively add features that would help predict the Bellman error. Parr et al.\n[6] later showed that any BEBF extraction method with small angular error will provably tighten the\napproximation error of the value function estimate. Online BEBF extraction methods have also been\nstudied in the RL literature. The incremental Feature Dependency Discovery (iFDD) is a fast online\nalgorithm to extract non-linear binary features for linear function approximation [5].\nWe note that these algorithms, although theoretically interesting, are dif\ufb01cult to apply to very large\nstate spaces or need speci\ufb01c domain knowledge to generate good features. The problem lies in\nthe large estimation error when predicting BEBFs in high-dimensional state spaces. Our proposed\nsolution leverages the use of simple random projections to alleviate this problem.\n\n2.3 Random Projections and Inner Product\n\nRandom projections have been introduced in signal processing, as an ef\ufb01cient method for compress-\ning very high-dimensional signals (such as images or video). It is well known that random projec-\ntions of appropriate sizes preserve enough information to exactly reconstruct the original signal with\nhigh probability [22, 9]. This is because random projections are norm and distance-preserving in\nmany classes of feature spaces.\nThere are several types of random projection matrices that can be used. In this work, we assume that\neach entry in the projection matrix \u03a6D\u00d7d is an i.i.d. sample from a Gaussian distribution:\n\n\u03c6i,j \u223c N (0, 1/d).\n\n(5)\n\nRecently, it has been shown that random projections of appropriate sizes preserve linearity of a target\nfunction on sparse feature spaces. A bound introduced in [11] and later tightened by [23] shows that\nif a function is linear in a sparse space, it is almost linear in an exponentially smaller projected space.\nAn immediate lemma based on Theorem 2 of [23] bounds the bias induced by random projections:\nLemma 1. Let X be a D-dimensional k-sparse space and \u03a6D\u00d7d be a random projection according\nto Eqn 5. Fix w \u2208 RD and 1 > \u03be0 > 0. Then, for \u0001(\u03be0)\n, with probability > 1\u2212 \u03be0 :\n(6)\n\n(cid:113) 48k\n\u2200x \u2208 X :(cid:12)(cid:12)(\u03a6T w)T (\u03a6T x) \u2212 wT x(cid:12)(cid:12) \u2264 \u0001(\u03be0)\n\nd log 4D\n\u03be0\nprj (cid:107)w(cid:107)(cid:107)x(cid:107),\n\nprj =\n\n3\n\n\fHence, projections of size \u02dcO(k log D) preserve the linearity up to an arbitrary constant. Along with\nthe analysis of the variance of the estimators, this helps bound the prediction error of the linear \ufb01t in\nthe compressed space.\n\n3 Compressed Linear BEBFs\n\nIn this work, we propose a new method to generate BEBFs using linear regression in a small space\ninduced by random projections. We \ufb01rst project the state features into a much smaller space and\nthen regress a hyperplane to the TD-errors. For simplicity, we assume that regardless of the current\nestimate of the value function, the Bellman error is always linearly representable in the original fea-\nture space. This seems like a strong assumption, but is true, for example, in virtually any discretized\nspace, and is also likely to hold in very high dimensional feature spaces1.\nLinear function approximators can be used to estimate the value of a given state. Let Vm be an\nestimated value function described in a linear space de\ufb01ned by a feature set \u03a8 = {\u03c81, . . . \u03c8m}. Parr\net al. [6] show that if we add a new BEBF \u03c8m+1 = eVm to the feature set, (with mild assumptions)\nthe approximation error on the new linear space shrinks by a factor of \u03b3. They also show that if we\ncan estimate the Bellman error within a constant angular error, cos\u22121(\u03b3), the error will still shrink.\nEstimating the Bellman error by regressing to temporal differences in high-dimensional sparse\nspaces can result in large prediction error. This is due to the large estimation error of regression\nin high dimensional spaces (over-\ufb01tting). However, as discussed in Lemma 1, random projections\nwere shown to exponentially reduce the dimension of a sparse feature space, only at the cost of a\ncontrolled constant bias. A variance analysis along with proper mixing conditions can also bound\nthe estimation error due to the variance in MDP returns. The computational cost of the estimation is\nalso much smaller when the regression is applied in the compressed space.\n\n3.1 General CBEBF Algorithm\n\nIn light of these results, we propose the Compressed Bellman Error Based Feature Generation al-\ngorithm (CBEBF). The algorithm iteratively constructs new features using compressed linear re-\ngression to the TD-errors, and uses these features with a policy evaluation algorithm to update the\nestimate of the value function.\n\nAlgorithm 1 Compressed Bellman Error Based Feature Generation (CBEBF)\n\nt=1), where xt is the observation received at time t, and\n\nInput: Sample trajectory Sn = ((xt, rt)n\nrt is the observed reward; Number of BEBFs: m; Projection size schedule: d1, d2, . . . , dm\nOutput: V (.): estimate of the value function\nInitialize V (.) to be 0 for all x.\nInitialize the set of BEBFs linear weights \u03a8 \u2190 \u2205.\nfor i \u2190 1 to m do\nGenerate projection \u03a6D\u00d7di according to Eqn 5.\nCalculate TD-errors: \u03b4t = rt + \u03b3V (xt+1) \u2212 V (xt).\nApply compressed regression:\nLet udi\u00d71 be the result of OLS regression in the compressed space,\nusing \u03a6T xt as inputs and \u03b4t as outputs.\nAdd \u03a6u to \u03a8.\nApply policy evaluation with features {\u02c6ev(x) = xT v | v \u2208 \u03a8} to update V (.).\n\nend for\n\nThe optimal number of BEBFs and the schedule of projection sizes need to be determined and are\nsubjects of future work. But we show in the next section that logarithmic size projections should be\nenough to guarantee the reduction of error in value function prediction at each step. This makes the\nalgorithm very attractive when it comes to computational and memory complexity, as the regression\nat each step is only on a small projected feature space. As we discuss in our empirical analysis, the\nalgorithm is fast and robust with respect to the selection of parameters.\n\n1For the more general case, the analysis can be done with respect to the projected Bellman error [6]. We\n\nassume linearity of the Bellman error to simplify the derivations.\n\n4\n\n\f3.2 Simpli\ufb01ed CBEBF as Regularized Value Iteration\n\nNote that in CBEBF, we can use any type of value function approximation to estimate the value func-\ntion in each iteration. To simplify the bias\u2013variance analysis and avoid multiple levels of regression,\nwe present here a simpli\ufb01ed version of the CBEBF algorithm (SCBEBF). In the simpli\ufb01ed version,\ninstead of storing the features in each iteration, new features are added to the value function ap-\nproximator with constant weight 1. Therefore, the value estimate is simply the sum of all generated\nBEBFs. As compared to the general CBEBF, the simpli\ufb01ed version trivially has lower computa-\ntional complexity per iteration, as it avoids an extra level of regression based on the features. It also\navoids storing the features by simply keeping the sum of all previously generated coef\ufb01cients.\nIt is important to note that once we use linear value function approximation, the entire BEBF genera-\ntion process can be viewed as a regularized value iteration algorithm. Each iteration of the algorithm\nis a regularized Bellman backup which is linear in the features. The coef\ufb01cients of this linear backup\nare con\ufb01ned to a lower-dimensional random subspace implicitly induced by the random projection\nused in each iteration.\n\n3.3 Finite Sample Analysis of Simpli\ufb01ed CBEBF\n\nThis section provides a \ufb01nite sample analysis of the simpli\ufb01ed CBEBF algorithm. In order to provide\nsuch analysis, we need to have an assumption on the range of observed TD-errors. This is usually\npossible by assuming that the current estimate of the value function is bounded, which is easy to\nenforce by truncating any estimate of the value function between 0 and Vmax = Rmax/(1 \u2212 \u03b3).\nThe following theorem shows how well we can estimate the Bellman error by regression to the TD-\nerrors in a compressed space. It highlights the bias\u2013variance trade-off with respect to the choice of\nthe projection size.\nTheorem 2. Let \u03a6D\u00d7d be a random projection according to Eqn 5. Let Sn = ((xt, rt)n\nt=1) be\na sample trajectory collected on an MDP with a \ufb01xed policy with stationary distribution \u03c1, in a\nD-dimensional k-sparse feature space, with D > d \u2265 10. Let \u03c4 be the forgetting time of the chain\n(de\ufb01ned in the appendix). Fix any estimate V of the value function, and the corresponding TD-errors\n\u03b4t\u2019s bounded by \u00b1\u03b4max. Assume that the Bellman error is linear in the features with parameter w.\nols = (X\u03a6)\u2020\u03b4, where X is the matrix containing xt\u2019s\nWith compressed OLS regression we have w(\u03a6)\nand \u03b4 is the vector of TD-errors. Assume that X is of rank larger than d. For any \ufb01xed 0 < \u03be < 1/4,\nwith probability no less than 1 \u2212 \u03be, the prediction error\n(cid:115)\nis bounded by:\n\n(cid:13)(cid:13)(cid:13)\u03c1(x)\nols \u2212 eV (x)\n(cid:115)\n\n(cid:13)(cid:13)(cid:13)xT \u03a6w(\u03a6)\n\n12 \u03b1\u0001(\u03be/4)\n\nprj\n\n(cid:107)w(cid:107)(cid:107)x(cid:107)\u03c1\n\n+ 4\u03b1\u0001(\u03be/4)\n\nprj\n\n(cid:107)w(cid:107)\n\n(cid:114) 1\n\nd\u03bd\n\nd\u03c4\nn\u03bd\n\nlog\n\nd\n\u03be\n\n+ 2\u03b1\u03b4max (cid:107)x(cid:107)\u03c1\n\n\u03bad\nn\u03bd\n\nlog\n\nd\n\u03be\n\n(7)\n\nprj\n\nwhere \u0001(\u03be/4)\neigenvalue of the empirical gram matrix 1\n\n\u03b1 = max(1, maxz\u2208X(cid:13)(cid:13)zT \u03a6(cid:13)(cid:13) /(cid:107)z(cid:107)).\n\nis according to Lemma 1, \u03ba and \u03bd are the condition number and the smallest positive\nn \u03a6T XT X\u03a6, and we de\ufb01ne maximum norm scaling factor\n\nA detailed proof is included in the appendix. The sketch of the proof is as follows: Lemma 1\nsuggests that if the Bellman error is linear in the original features, the bias due to the projection can\nbe bounded within a controlled constant error with logarithmic size projections. If the Markov chain\nuniformly quickly forgets its past, one can also bound the on-measure variance part of the error. The\nvariance terms, of course, go to 0 as the number of sampled transitions n goes to in\ufb01nity.\nTheorem 2 can be further simpli\ufb01ed by using concentration bounds on random projections as de\ufb01ned\nin Eqn 5. The norm of \u03a6 can be bounded using the bounds discussed in Cand`es and Tao [8]; we\nhave with probability 1 \u2212 \u03b4\u03a6:\n\n(cid:107)\u03a6\u2020(cid:107) \u2264(cid:104)(cid:112)D/d \u2212(cid:112)(2 log(2/\u03b4\u03a6))/d \u2212 1\n(cid:105)\u22121\n(cid:107)\u03a6(cid:107) \u2264(cid:112)D/d +(cid:112)(2 log(2/\u03b4\u03a6))/d + 1\n\u02dcO((cid:112)n/d). Thus we have \u03ba = O(1) and \u03bd = O(1/d). Projections are norm-preserving and thus\n\nSimilarly, when n > d, we expect the smallest and biggest singular values of X\u03a6 to be of order of\n\nand\n\n.\n\n5\n\n\f\u03b1 (cid:39) 1. Assuming that n = \u02dcO(d2), we can rewrite the bound on the error up to logarithmic terms as:\n(8)\n\n\u221a\n\n+ \u02dcO(cid:0)d/\n\n(cid:16)(cid:107)w(cid:107)(cid:107)x(cid:107)\u03c1(x)\n\n(cid:17)\n(cid:112)k log D/d\n\nn(cid:1) .\n\n\u02dcO\n\n\u221a\n\nThe \ufb01rst term is a part of the bias due to the projection (excess approximation error). The rest is the\ncombined variance terms that shrink with larger training sets (estimation error). We clearly observe\nthe trade-off with respect to the compressed dimension d. With the assumptions discussed above,\nwe can see that projection of size d = \u02dcO(k log D) should be enough to guarantee arbitrarily small\nbias, as long as (cid:107)w(cid:107)(cid:107)x(cid:107)\u03c1(x) is small. Thus, the bound is tight enough to prove reduction in the error\nas new BEBFs are added to the feature set.\n\nNote that this bound matches that of Ghavamzadeh et al. [14]. The variance term is of order(cid:112)d/n\u03bd.\nn rather than the expected(cid:112)d/n. We expect the use of ridge regression instead of OLS in\n\nThus, the dependence on the smallest eigenvalue of the gram matrix makes the variance term order\nd/\nthe inner loop of the algorithm to remove this dependence and help with the convergence rate (see\nappendix).\nAs mentioned before, our simpli\ufb01ed version of the algorithm does not store the generated BEBFs\n(such that it could later apply value function approximation over them). It adds up all the features\nwith weight 1 to approximate the value function. Therefore our analysis is different from that of\nParr et al. [6]. The following lemma (simpli\ufb01cation of results in Parr et al. [6]) provides a suf\ufb01cient\ncondition for the shrinkage of the error in the value function prediction:\nLemma 3. Let V \u03c0 be the value function of a policy \u03c0 imposing stationary measure \u03c1, and let eV be\nthe Bellman error under policy \u03c0 for an estimate V . Given a BEBF \u03c8 satisfying:\n\n(cid:107)\u03c8(x) \u2212 eV (x)(cid:107)\u03c1(x) \u2264 \u0001(cid:107)eV (x)(cid:107)\u03c1(x) ,\n\n(9)\n\n(10)\n\nwe have that:\n\n(cid:107)V \u03c0(x) \u2212 (V (x) + \u03c8(x))(cid:107)\u03c1(x) \u2264 (\u03b3 + \u0001 + \u0001\u03b3)(cid:107)V \u03c0(x) \u2212 V (x)(cid:107)\u03c1(x) .\n\nTheorem 2 (simpli\ufb01ed in Equation (8)) does not state the error in terms of (cid:107)eV (x)(cid:107)\u03c1 =(cid:13)(cid:13)wT x(cid:13)(cid:13)\u03c1,\nMunos [10, 12] provide some discussion on the cases were(cid:13)(cid:13)wT x(cid:13)(cid:13)\u03c1 and (cid:107)w(cid:107)(cid:107)x(cid:107)\u03c1 are expected to\n\nas needed by this lemma, but rather does it in terms of (cid:107)w(cid:107)(cid:107)x(cid:107)\u03c1. Therefore, if there is a large gap\nbetween these terms, we cannot expect to see shrinkage in the error (we can only show that the\nerror can be shrunk to a bounded uncontrolled constant). Ghavamzadeh et al. [14] and Maillard and\n\nbe close. These cases include when the features are rescaled orthonormal basis functions and also\nwith speci\ufb01c classes of wavelet functions.\nThe dependence on the norm of w is conjectured to be tight by the compressed sensing litera-\nture [24], making this bound asymptotically the best one can hope for. This dependence also points\nout an interesting link between our method and L2-regularized LSTD. We expect ridge regression to\nbe favourable in cases where the norm of the weight vector is small. The upper bound on the error\nof compressed regression is also smaller when the norm of w is small.\nLemma 4. Assume the conditions of Theorem 2. Further assume for some constants c1, c2, c3 \u2265 1:\n(11)\n\n1/\u03bd \u2264 c3d,\n\n(cid:107)x(cid:107)\u03c1 \u2264 c2\n\n(cid:107)w(cid:107) \u2264 c1\n\n(cid:13)(cid:13)wT x(cid:13)(cid:13)\u03c1\n\nand\n\nand\n\n1c2\n\nThere exist universal constants c4 and c5, such that for any \u03b3 < \u03b30 < 1 and 0 < \u03be < 1/4, if:\nd \u2265 \u03b12c2\nthen with the addition of the estimated BEBF, we have that with probability 1 \u2212 \u03be:\n(cid:107)V \u03c0(x) \u2212 (V (x) + \u03c8(x))(cid:107)\u03c1(x) \u2264 \u03b30 (cid:107)V \u03c0(x) \u2212 V (x)(cid:107)\u03c1(x) .\n\nand n \u2265 (\u03c4 + \u03b12c2\n\n\u03b30 \u2212 \u03b3\n\n\u03b30 \u2212 \u03b3\n\nmax\u03ba)c5\n\n2c3\u03b42\n\n2c3c4\n\nk log\n\nD\n\u03be\n\nd2 log\n\nd\n\u03be\n\n,\n\n(12)\n\n(cid:13)(cid:13)wT x(cid:13)(cid:13)\u03c1\n(cid:18) 1 + \u03b3\n(cid:19)2\n\n(cid:18) 1 + \u03b3\n\n(cid:19)2\n\n(cid:16)\n\nThe proof is included in the appendix. Lemma 4 shows that with enough sampled transitions, using\nrandom projections of size d = \u02dcO\nguarantees contraction in the error by a factor\nof \u03b30. Using union bound over m iterations of the algorithm, we prove that projections of size\nd = \u02dcO\n( 1+\u03b3\n\u03b30\u2212\u03b3 )2d2 log(md)\nsuf\ufb01ces to shrink the error by a factor of \u03b3m\n\nand a sample of transitions of size n = \u02dcO\n\n( 1+\u03b3\n\u03b30\u2212\u03b3 )2k log(mD)\n\n( 1+\u03b3\n\u03b30\u2212\u03b3 )2k log D\n\n0 after m iterations.\n\n(cid:17)\n\n(cid:16)\n\n(cid:17)\n\n(cid:16)\n\n(cid:17)\n\n6\n\n\f4 Empirical Analysis\n\nWe conduct a series of experiments to evaluate the performance of our algorithm and compare it\nagainst viable alternatives. Experiments are performed using a simulator that models an autonomous\nhelicopter in the \ufb02ight regime close to hover [25]. Our goal is to evaluate the value function asso-\nciated with the manually tuned policy provided with the simulator. We let the helicopter free fall\nfor 5 time-steps before the policy takes control. We then collect 100 transitions while the helicopter\nhovers. We run this process multiple times to collect more trajectories on the policy.\nThe original state space of the helicopter domain consists of 12 continuous features. 6 of these\nfeatures corresponding to the velocities and position, capture most of the data needed for policy\nevaluation. We use tile-coding on these 6 features as follows: 8 randomly positioned grids of size\n16 \u00d7 16 \u00d7 16 are placed over forward, sideways and downward velocity. 8 grids of similar structure\nare placed on features corresponding to the hovering coordinates. The constructed feature space is\nthus of size 65536. Note that our choice of tile-coding for this domain is for demonstration purposes.\nSince the true value function is not known in our case, we evaluate the performance of the algorithm\nby measuring the normalized return prediction error (NRPE) on a large test set. Let U (xi) be the\nempirical return observed for xi in a testing trajectory, and \u00afU be its average over the testing measure\n\u00b5(x). We de\ufb01ne NRPE(V ) = (cid:107)U (x) \u2212 V (x)(cid:107)\u00b5(x)/(cid:107)U (x) \u2212 \u00afU(cid:107)\u00b5(x). Note that the best constant\npredictor has NRPE = 1.\nWe start by an experiment to observe the behaviour of the prediction error in SCBEBF as we run\nmore iterations of the algorithm. We collect 3000 sample transitions for training. We experiment\nwith 3 schedules for the projection size: (1) Fix d = 300 for 300 steps. (2) Fix d = 30 for 300 steps.\n(3) Let d decrease with each iteration i: d = (cid:98)300e\u2212i/30(cid:99). Figure 1 (left) shows the error averaged\nover 5 runs. When d is \ufb01xed to a large number, the prediction error drops rapidly, but then rises due\nto over-\ufb01tting. This problem can be mitigated by using a smaller \ufb01xed projection size at the cost of\nslower convergence. In our experiments, we \ufb01nd a gradual decreasing schedule to provide fast and\nrobust convergence with minimal over-\ufb01tting effects.\n\nFigure 1: Left: NRPE of SCBEBF for different number of projections, under different choices of\nd, averaged over 5 runs. Right: Comparison of the prediction error of different methods for varying\nsample sizes. 95% con\ufb01dence intervals are tight (less than 0.005 in width) and are not shown.\n\nWe next compare SCBEBF against other alternatives. There are only a few methods that can be\ncompared against our algorithm due to the high dimensional feature space. We compare against\nCompressed LSTD (CLSTD) [14], L2-Regularized LSTD using a Biconjugate gradient solver (L2-\nLSTD), and L1-Regularized LSTD using LARS-TD [2] with a Biconjugate gradient solver in the\ninner loop (L1-LSTD). These conjugate gradient solvers exploit the sparsity of the feature space to\nconverge faster to the solution of linear equations [26]. We avoided online and stochastic gradient\ntype methods as they are not very ef\ufb01cient in sample complexity.\nWe compare the described methods while increasing the size of the training set. The projection\nschedule for SCBEBF is set to d = (cid:98)500e\u2212i/300(cid:99) for all sample sizes. The regularization parameter\nof L2-LSTD was chosen among a small set of values using 1/5 of the training data as validation set.\nDue to memory and time constraints, the optimal choice of parameters could not be set for CLSTD\nand L1-LSTD. The maximum size of projection for CLSTD and the maximum number of non-zero\ncoef\ufb01cients for L1-LSTD was set to 3000. CLSTD would run out of memory and L1-LSTD would\ntake multiple hours if we increase these limits.\n\n7\n\n0501001502002503000.650.70.750.80.850.90.951IterationNRPE  d=30d=300d=300 exp(\u2212i/30)1000200030004000500060000.440.460.480.50.520.540.560.580.6Sample SizeNRPE  L2\u2212LSTDL1\u2212LSTDCLSTDSCBEBF\fThe results, averaged over 5 runs, are shown in Figure 1 (right). We see that L2-LSTD outperforms\nother methods, closely followed by SCBEBF. Not surprisingly, L1-LSTD and CLSTD are not com-\npetitive here as they are suboptimal with the mentioned constraints. This is a consequence of the\nfact that these algorithms scale worse with respect to memory and time complexity.\nWe conjecture that L2-LSTD is bene\ufb01ting from the sparsity of the features space, not only in running\ntime (due to the use of conjugate gradient solvers), but also in sample complexity. This makes L2-\nLSTD an attractive choice when the features are observed in the sparse basis. However, if the\nfeatures are sparse in some unknown basis (observation is not sparse), then the time complexity of\nany linear solver in the observation basis can be prohibitive. SCBEBF, however, scales much better\nin such cases as the main computation is done in the compressed space.\n\nFigure 2: Runtime of L2-LSTD and SCBEBF with varying observation sparsity.\n\nTo highlight this effect, we construct an experiment in which we gradually increase the number of\nnon-zero features using a change of basis. The error of both L2-LSTD and SCBEBF remain mostly\nunchanged as predicted by the theory. We thus only compare the running times as we change the\nobservation sparsity. Figure 2 shows the CPU time used by each methods with sample size of 3000,\naveraged over 5 runs (using Matlab on a 3.2GHz Quad-Core Intel Xeon processor). We run 100\niterations of SCBEBF with d = (cid:98)300e\u2212i/30(cid:99) (as in the \ufb01rst experiment), and set the regularization\nparameter of L2-LSTD to the optimal value. We can see that the running time L2-LSTD quickly\nbecomes prohibitive with the decreased observation sparsity, whereas the running time of SCBEBF\ngrows very slowly (and linearly).\n\n5 Discussion\n\nWe provided a simple, fast and robust feature extraction algorithm for policy evaluation in sparse\nand high dimensional state spaces. Using recent results on the properties of random projections, we\nproved that in sparse spaces, random projections of sizes logarithmic in the original dimension are\nsuf\ufb01cient to preserve linearity. Therefore, BEBFs can be generated on compressed spaces induced\nby small random projections. Our \ufb01nite sample analysis provides guarantees on the reduction in\nprediction error after the addition of such BEBFs.\nOur assumption of the linearity of the Bellman error in the original feature space might be too strong\nfor some problems. We introduced this assumption to simplify the analysis. However, most of the\ndiscussion can be rephrased in terms of the projected Bellman error, and we expect this approach to\ncarry through and provide more general results (e.g. see Parr et al. [6]).\nCompared to other regularization approaches to RL [2, 27, 28], our random projection method does\nnot require complex optimization, and thus is faster and more scalable. If features are observed in\nthe sparse basis, then conjugate gradient solvers can be used for regularized value function approxi-\nmation. However, CBEBF seems to have better performance with smaller sample sizes and provably\nworks under any observation basis.\nFinding the optimal choice of the projection size schedule and the number of iterations is an inter-\nesting subject of future research. We expect the use of cross-validation to suf\ufb01ce for the selection\nof the optimal parameters, due to the robustness that we observed in the results of the algorithm. A\ntighter theoretical bound might also help provide an analytical, closed-form answer to how parame-\nters should be selected. One would expect a slow reduction in the projection size to be favourable.\nAcknowledgements: Financial support for this work was provided by Natural Sciences and Engi-\nneering Research Council Canada, through their Discovery Grants Program.\n\n8\n\n0100200300400050100150200250Number of non\u2212zero featuresCPU Time (Seconds)  L2\u2212LSTDSCBEBF\fReferences\n[1] D. Di Castro and S. Mannor. Adaptive bases for reinforcement learning. Machine Learning and Knowl-\n\nedge Discovery in Databases, pages 312\u2013327, 2010.\n\n[2] J.Z. Kolter and A.Y. Ng. Regularization and feature selection in least-squares temporal difference learn-\n\ning. In International Conference on Machine Learning, 2009.\n\n[3] P.W. Keller, S. Mannor, and D. Precup. Automatic basis function construction for approximate dynamic\n\nprogramming and reinforcement learning. In International Conference on Machine Learning, 2006.\n\n[4] P. Manoonpong, F. W\u00a8org\u00a8otter, and J. Morimoto. Extraction of reward-related feature space using\ncorrelation-based and reward-based learning methods. Neural Information Processing. Theory and Al-\ngorithms, pages 414\u2013421, 2010.\n\n[5] A. Geramifard, F. Doshi, J. Redding, N. Roy, and J.P. How. Online discovery of feature dependencies. In\n\nInternational Conference on Machine Learning, 2011.\n\n[6] R. Parr, C. Painter-Wake\ufb01eld, L. Li, and M. Littman. Analyzing feature generation for value-function\n\napproximation. In International Conference on Machine Learning, 2007.\n\n[7] J. Boyan and A.W. Moore. Generalization in reinforcement learning: Safely approximating the value\n\nfunction. In Advances in Neural Information Processing Systems, 1995.\n\n[8] E.J. Cand`es and T. Tao. Near-optimal signal recovery from random projections: Universal encoding\n\nstrategies. Information Theory, IEEE Transactions on, 52(12):5406\u20135425, 2006.\n\n[9] E.J. Cand`es and M.B. Wakin. An introduction to compressive sampling. Signal Processing Magazine,\n\n[10] O.A. Maillard and R. Munos. Linear regression with random projections. Journal of Machine Learning\n\n[11] M.M. Fard, Y. Grinberg, J. Pineau, and D. Precup. Compressed least-squares regression on sparse spaces.\n\n[12] O.A. Maillard and R. Munos. Compressed least-squares regression. In Advances in Neural Information\n\n[13] S. Zhou, J. Lafferty, and L. Wasserman. Compressed regression. In Proceedings of Advances in neural\n\n[14] M. Ghavamzadeh, A. Lazaric, O.A. Maillard, and R. Munos. LSTD with random projections. In Advances\n\ninformation processing systems, 2007.\n\nin Neural Information Processing Systems, 2010.\n\n[15] B.A. Olshausen, P. Sallee, and M.S. Lewicki. Learning sparse image codes using a wavelet pyramid\n\narchitecture. In Advances in neural information processing systems, 2001.\n\n[16] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA,\n\nIEEE, 25(2):21\u201330, 2008.\n\nResearch, 13:2735\u20132772, 2012.\n\nIn AAAI, 2012.\n\nProcessing Systems, 2009.\n\n[17] S.J. Bradtke and A.G. Barto. Linear least-squares algorithms for temporal difference learning. Machine\n\n[18] J.A. Boyan. Technical update: Least-squares temporal difference learning. Machine Learning, 49(2):\n\n[19] M.G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4:\n\n1998.\n\nLearning, 22(1):33\u201357, 1996.\n\n233\u2013246, 2002.\n\n1107\u20131149, 2003. ISSN 1532-4435.\n\n[20] H.R. Maei and R.S. Sutton. GQ (\u03bb): A general gradient algorithm for temporal-difference prediction\n\nlearning with eligibility traces. In Third Conference on Arti\ufb01cial General Intelligence, 2010.\n\n[21] I. Menache, S. Mannor, and N. Shimkin. Basis function adaptation in temporal difference reinforcement\n\nlearning. Annals of Operations Research, 134(1):215\u2013238, 2005.\n\n[22] M.A. Davenport, M.B. Wakin, and R.G. Baraniuk. Detection and estimation with compressive measure-\n\nments. Dept. of ECE, Rice University, Tech. Rep, 2006.\n\n[23] M.M. Fard, Y. Grinberg, J. Pineau, and D. Precup. Random projections preserve linearity in sparse spaces.\n\nSchool of Computer Science, Mcgill University, Tech. Rep, 2012.\n\n[24] M.A. Davenport, P.T. Boufounos, M.B. Wakin, and R.G. Baraniuk. Signal processing with compressive\n\nmeasurements. Selected Topics in Signal Processing, IEEE Journal of, 4(2):445\u2013460, 2010.\n\n[25] Andrew Y Ng, Adam Coates, Mark Diel, Varun Ganapathi, Jamie Schulte, Ben Tse, Eric Berger, and Eric\nLiang. Autonomous inverted helicopter \ufb02ight via reinforcement learning. In Experimental Robotics IX,\npages 363\u2013372. Springer, 2006.\n\n[26] Richard Barrett, Michael Berry, Tony F Chan, James Demmel, June Donato, Jack Dongarra, Victor Ei-\njkhout, Roldan Pozo, Charles Romine, and Henk Van der Vorst. Templates for the solution of linear\nsystems: building blocks for iterative methods. Number 43. Society for Industrial and Applied Mathemat-\nics, 1987.\n\n[27] A.M. Farahmand, M. Ghavamzadeh, and C. Szepesv\u00b4ari. Regularized policy iteration. In Advances in\n\nNeural Information Processing Systems, 2010.\n\n[28] J. Johns, C. Painter-Wake\ufb01eld, and R. Parr. Linear complementarity for regularized policy evaluation and\n\nimprovement. In Advances in Neural Information Processing Systems, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1385, "authors": [{"given_name": "Mahdi", "family_name": "Milani Fard", "institution": "McGill University"}, {"given_name": "Yuri", "family_name": "Grinberg", "institution": "McGill University"}, {"given_name": "Amir-massoud", "family_name": "Farahmand", "institution": "McGill University"}, {"given_name": "Joelle", "family_name": "Pineau", "institution": "McGill University"}, {"given_name": "Doina", "family_name": "Precup", "institution": "McGill University"}]}