{"title": "Value Pursuit Iteration", "book": "Advances in Neural Information Processing Systems", "page_first": 1340, "page_last": 1348, "abstract": "Value Pursuit Iteration (VPI) is an approximate value iteration algorithm that finds a close to optimal policy for reinforcement learning and planning problems with large state spaces. VPI has two main features: First, it is a nonparametric algorithm that finds a good sparse approximation of the optimal value function given a dictionary of features. The algorithm is almost insensitive to the number of irrelevant features. Second, after each iteration of VPI, the algorithm adds a set of functions based on the currently learned value function to the dictionary. This increases the representation power of the dictionary in a way that is directly relevant to the goal of having a good approximation of the optimal value function. We theoretically study VPI and provide a finite-sample error upper bound for it.", "full_text": "Value Pursuit Iteration\n\nAmir-massoud Farahmand\u2217\n\nDoina Precup \u2020\nSchool of Computer Science, McGill University, Montreal, Canada\n\nAbstract\n\nValue Pursuit Iteration (VPI) is an approximate value iteration algorithm that \ufb01nds\na close to optimal policy for reinforcement learning problems with large state\nspaces. VPI has two main features: First, it is a nonparametric algorithm that\n\ufb01nds a good sparse approximation of the optimal value function given a dictio-\nnary of features. The algorithm is almost insensitive to the number of irrelevant\nfeatures. Second, after each iteration of VPI, the algorithm adds a set of functions\nbased on the currently learned value function to the dictionary. This increases the\nrepresentation power of the dictionary in a way that is directly relevant to the goal\nof having a good approximation of the optimal value function. We theoretically\nstudy VPI and provide a \ufb01nite-sample error upper bound for it.\n\n1\n\nIntroduction\n\nOne often has to use function approximation to represent the near optimal value function of the\nreinforcement learning (RL) and planning problems with large state spaces. Even though the con-\nventional approach of using a parametric model for the value function has had successes in many\napplications, it has one main weakness: Its success critically depends on whether the chosen func-\ntion approximation method is suitable for the particular task in hand. Manually designing a suitable\nfunction approximator, however, is dif\ufb01cult unless one has considerable domain knowledge about\nthe problem. To address this issue, the problem-dependent choice of function approximator and\nnonparametric approaches to RL/Planning problems have gained considerable attention in the RL\ncommunity, e.g., feature generation methods of Mahadevan and Maggioni [1], Parr et al. [2], and\nnonparametric regularization-based approaches of Farahmand et al. [3, 4], Taylor and Parr [5] .\nOne class of approaches that addresses the aforementioned problem is based on the idea of \ufb01nding\na sparse representation of the value function in a large dictionary of features (or atoms). In this\napproach, the designer does not necessarily know a priori whether or not a feature is relevant to the\nrepresentation of the value function. The feature, therefore, is simply added to the dictionary with\nthe hope that the algorithm itself \ufb01gures out the necessary subset of features. The usual approach to\ntackle irrelevant features is to use sparsity-inducing regularizers such as the l1-norm of the weights\nin the case of linear function approximators, e.g., Kolter and Ng [6], Johns et al. [7], Ghavamzadeh\net al. [8]. Another approach is based on greedily adding atoms to the representation of the target\nfunction. Examples of these approaches in the supervised learning setting are Matching Pursuit and\nOrthogonal Matching Pursuit (OMP) [9, 10]. These greedy algorithms have successfully been used\nin the signal processing and statistics/supervised machine learning communities for years, but their\napplication in the RL/Planning problems has just recently attracted some attention. Johns [11] em-\npirically investigated some greedy algorithms, including OMP, for the task of feature selection using\ndictionary of proto-value functions [1]. A recent paper by Painter-Wake\ufb01eld and Parr [12] considers\ntwo algorithms (OMP-TD and OMP-BRM; OMP-TD is the same as one of the algorithms by [11])\nin the context of policy evaluation and provides some conditions under which OMP-BRM can \ufb01nd\nthe minimally sparse solution.\n\u2217Academic.SoloGen.net.\n\u2020This research was funded in part by NSERC and ONR.\n\n1\n\n\fTo address the problem of value function representation in RL when not much a priori knowledge\nis available, we introduce the Value Pursuit Iteration (VPI) algorithm. VPI, which is an Approx-\nimate Value Iteration (AVI) algorithm (e.g., [13]), has two main features. The \ufb01rst is that it is a\nnonparametric algorithm that \ufb01nds a good sparse approximation of the optimal value function given\na set of features (dictionary), by using a modi\ufb01ed version of OMP. The second is that after each\niteration, the VPI algorithm adds a set of functions based on the currently learned value function to\nthe dictionary. This potentially increases the representation power of the dictionary in a way that is\ndirectly relevant to the goal of approximating the optimal value function.\nAt the core of VPI is the OMP algorithm equipped with a model selection procedure. Using OMP\nallows VPI to \ufb01nd a sparse representation of the value function in large dictionaries, even countably\nin\ufb01nite ones1. This property is very desirable for RL/Planning problems for which one usually does\nnot know the right representation of the value function, and so one wishes to add as many features\nas possible and to let the algorithm automatically detect the best representation. A model selection\nprocedure ensures that OMP is adaptive to the actual dif\ufb01culty of the problem.\nThe second main feature of VPI is that it increases the size of the dictionary by adding some basis\nfunctions computed from previously learned value functions. To give an intuitive understanding of\nhow this might help, consider the dictionary B = {g1, g2, . . .}, in which each atom gi is a real-\nvalued function de\ufb01ned on the state-action space. The goal is to learn the optimal value function by\ni\u22651 wigi.2 Suppose that we are lucky and the optimal value\nfunction Q\u2217 belongs to the dictionary B, e.g., g1 = Q\u2217. This is indeed an ideal atom to have in the\ndictionary since one may have a sparse representation of the optimal value function in the form of\ni\u22651 wigi with w1 = 1 and wi = 0 for i \u2265 2. Algorithms such as OMP can \ufb01nd this sparse\nrepresentation quite effectively (details will be speci\ufb01ed later). Of course, we are not usually lucky\nenough to have the optimal value function in our dictionary, but we may still use approximation of\nthe optimal value function. In the exact Value Iteration, Qk \u2192 Q\u2217 exponentially fast. This ensures\nthat Qk and Qk+1 = T \u2217Qk are close enough, so one may use Qk to explain a large part of Qk+1 and\nuse the other atoms of the dictionary to \u201cexplain\u201d the residual. In an AVI procedure, however, the\nestimated value function sequence (Qk)k\u22651 does not necessarily converge to Q\u2217, but one may hope\nthat it gets close to a region around the optimum. In that case, we may very well use the dictionary\nof {Q1, . . . , Qk} as the set of candidate atoms to be used in the representation of Qk+1. We show\nthat adding these learned atoms does not hurt and may actually help.\nTo summarize, the algorithmic contribution of this paper is to introduce the VPI algorithm that \ufb01nds\na sparse representation of the optimal value function in a huge function space and increases the\nrepresentation capacity of the dictionary problem-dependently. The theoretical contribution of this\nwork is to provide a \ufb01nite-sample analysis of VPI and to show that the method is sound.\n\na representation in the form of Q =(cid:80)\nQ\u2217 =(cid:80)\n\n2 De\ufb01nitions\n\nWe follow the standard notation and de\ufb01nitions of Markov Decision Processes (MDP) and Rein-\nforcement Learning (RL) (cf. [14]). We also need some de\ufb01nitions regarding the function spaces\nand norms, which are de\ufb01ned later in this section.\nFor a space \u2126 with \u03c3-algebra \u03c3\u2126, M(\u2126) denotes the set of all probability measures over \u03c3\u2126. B(\u2126)\ndenotes the space of bounded measurable functions w.r.t. (with respect to) \u03c3\u2126 and B(\u2126, L) denotes\nthe subset of B(\u2126) with bound 0 < L < \u221e.\nA \ufb01nite-action discounted MDP is a 5-tuple (X ,A, P,R, \u03b3), where X is a measurable state space,\nA is a \ufb01nite set of actions, P : X \u00d7 A \u2192 M(X ) is the transition probability kernel, R : X \u00d7 A \u2192\nM(R) is the reward kernel, and \u03b3 \u2208 [0, 1) is a discount factor. Let r(x, a) = E [R(\u00b7|x, a)], and\nassume that r is uniformly bounded by Rmax. A measurable mapping \u03c0 : X \u2192 A is called a\ndeterministic Markov stationary policy, or just a policy for short. A policy \u03c0 induces the m-step\ntransition probability kernels (P \u03c0)m : X \u2192 M(X ) and (P \u03c0)m : X \u00d7A \u2192 M(X \u00d7A) for m \u2265 1.\nWe use V \u03c0 and Q\u03c0 to denote the value and action-value function of a policy \u03c0. We also use V \u2217\nand Q\u2217 for the optimal value and optimal action-value functions, with the corresponding optimal\n\n1From the statistical viewpoint and ignoring the computational dif\ufb01culty of working with large dictionaries.\n2The notation will be de\ufb01ned precisely in Section 2.\n\n2\n\n\fpolicy \u03c0\u2217. A policy \u03c0 is greedy w.r.t. an action-value function Q, denoted \u03c0 = \u02c6\u03c0(\u00b7; Q), if \u03c0(x) =\nargmaxa\u2208A Q(x, a) holds for all x \u2208 X (if there exist multiple maximizers, one of them is chosen in\nan arbitrary deterministic manner). De\ufb01ne Qmax = Rmax/(1 \u2212 \u03b3). The Bellman optimality operator\nis denoted by T \u2217. We use (P V )(x) to denote the expected value of V after the transition according\nto a probability transition kernel P . Also for a probability measure \u03c1 \u2208 M(X ), the symbol (\u03c1P )\nrepresents the distribution over states when the initial state distribution is \u03c1 and we follow P for\na single step. A typical choice of P is (P \u03c0)m for m \u2265 1 (similarly for \u03c1 \u2208 M(X \u00d7 A) and\naction-value functions).\n\n(cid:44)(cid:82)\n\nX |V (x)|p d\u03c1(x)(cid:3)1/p. The L\u221e(X )-norm is de\ufb01ned as\n(cid:44) (cid:2)(cid:82)\n\n2.1 Norms and Dictionaries\nFor a probability measure \u03c1 \u2208 M(X ), and a measurable function V \u2208 B(X ), we de\ufb01ne the Lp(\u03c1)-\nnorm (1 \u2264 p < \u221e) of V as (cid:107)V (cid:107)p,\u03c1\n(cid:107)V (cid:107)\u221e (cid:44) supx\u2208X |V (x)|. Similarly for \u03bd \u2208 M(X \u00d7 A) and Q \u2208 B(X \u00d7 A), we de\ufb01ne (cid:107)\u00b7(cid:107)p,\u03bd as\n(cid:107)Q(cid:107)p\n(cid:80)n\nLet z1:n denote the Z-valued sequence (z1, . . . , zn). For Dn = z1:n, de\ufb01ne the empirical norm of\nfunction f : Z \u2192 R as (cid:107)f(cid:107)p\ni=1 |f (zi)|p. Based on this de\ufb01nition, one\n(cid:44) 1\n(with Z = X \u00d7 A). Note that if Dn = Z1:n is\nmay de\ufb01ne (cid:107)V (cid:107)Dn\nrandom with Zi \u223c \u03bd, the empirical norm is random as well. For any \ufb01xed function f, we have\nrefer to an L2-norm. When we do not want to\n\nX\u00d7A |Q(x, a)|pd\u03bd(x, a) and (cid:107)Q(cid:107)\u221e (cid:44) sup(x,a)\u2208X\u00d7A |Q(x, a)|.\n\np,Dn\n(with Z = X ) and (cid:107)Q(cid:107)Dn\n\n= (cid:107)f(cid:107)p\n\np,\u03bd\n\nn\n\np,z1:n\n\n(cid:105)\n\n= (cid:107)f(cid:107)p,\u03bd. The symbols (cid:107)\u00b7(cid:107)\u03bd and (cid:107)\u00b7(cid:107)Dn\n\ng\u2208B |cg| : f = (cid:80)\n\nE(cid:104)(cid:107)f(cid:107)p,Dn\nfunctions f \u2208 H that admits an expansion f =(cid:80)\n\n(cid:44) inf{(cid:80)\n, we may use L1(B;Dn) instead of L1(B;(cid:107)\u00b7(cid:107)Dn\n\nemphasize the underlying measure, we use (cid:107)\u00b7(cid:107) to denote an L2-norm.\nConsider a Hilbert space H endowed with an inner product norm (cid:107)\u00b7(cid:107). We call a family of functions\nB = {g1, g2, . . . ,} with atoms gi \u2208 H a dictionary. The class L1(B) = L1(B;(cid:107)\u00b7(cid:107)) consists of those\ng\u2208B cgg with (cg) being an absolutely summable\nsequence (these de\ufb01nitions are quoted from Barron et al. [15]). The norm of a function f in this\nspace is de\ufb01ned as (cid:107)f(cid:107)L1(B;(cid:107)\u00b7(cid:107))\ng\u2208B cgg}. To avoid clutter, when the\nnorm is the empirical norm (cid:107)\u00b7(cid:107)Dn\n), and when the\nnorm is (cid:107)\u00b7(cid:107)\u03bd, we may use L1(B; \u03bd). We denote a ball with radius r > 0 w.r.t. the norm of L1(B; \u03bd)\nby Br(L1(B; \u03bd)).\nFor a dictionary B, we introduce a \ufb01xed exhaustion B1 \u2282 B2 \u2282 . . . \u2282 B, with the number of atoms\n|Bm| being m. If we index our dictionaries as Bk, the symbol Bk,m refers to the m-th element of\nthe exhaustion of Bk. For a real number \u03b1 > 0, the space L1,\u03b1(B;(cid:107)\u00b7(cid:107)) is de\ufb01ned as the set of\nall functions f such that for all m = 1, 2, . . . , there exists a function h depending on m such that\n(cid:107)h(cid:107)L1(Bm;(cid:107)\u00b7(cid:107)) \u2264 C and (cid:107)f \u2212 h(cid:107) \u2264 Cm\u2212\u03b1. The smallest constant C such that these inequalities\nhold de\ufb01nes a norm for L1,\u03b1(B;(cid:107)\u00b7(cid:107)). Finally, we de\ufb01ne the truncation operator \u03b2L : B(X ) \u2192 B(X )\nfor some real number L > 0 as follows. For any function f \u2208 B(X ), the truncated function of f at\nthe threshold level L is the function \u03b2Lf : B(X ) \u2192 R such that for any x \u2208 X , \u03b2Lf (x) is equal\nto f (x) if \u2212L \u2264 f (x) \u2264 L, is equal to L if f (x) > L, and is equal to \u2212L if f (x) < \u2212L. We\noverload \u03b2L to be an operator from B(X \u00d7 A) to B(X \u00d7 A) by applying it component-wise, i.e.,\n\u03b2LQ(x,\u00b7) (cid:44) [\u03b2LQ(x, a1), . . . , \u03b2LQ(x, aA)](cid:62).\n\n3 VPI Algorithm\n\nIn this section, we \ufb01rst describe the behaviour of VPI in the ideal situation when the Bellman op-\ntimality operator T \u2217 can be applied exactly in order to provide the intuitive understanding of why\nVPI might work. Afterwards, we describe the algorithm that does not have access to the Bellman\noptimality operator and only uses a \ufb01nite sample of transitions.\nVPI belongs to the family of AVI algorithms, which start with an initial action-value function Q0\nand at each iteration k = 0, 1, . . . , approximately apply the Bellman optimality operator T \u2217 to the\nmost recent estimate Qk to get a new estimate Qk+1 \u2248 T \u2217Qk. The size of the error between Qk+1\nand T \u2217Qk is a key factor in determining the performance of an AVI procedure.\n\n3\n\n\ffor all g \u2208 B), that is Qk+1 =(cid:80)\n\n(cid:12)(cid:12)(cid:10) r(i\u22121) , g(cid:11)(cid:12)(cid:12), i.e., choose an element of the dictionary that has the maxi-\n\nSuppose that T \u2217Qk can be calculated, but it is not possible to represent it exactly. In this case, one\nmay use an approximant Qk+1 to represent T \u2217Qk. In this paper we would like to represent Qk+1\nas a linear function of some atoms in a dictionary B = {g1, g2, . . .} (g \u2208 H(X \u00d7 A) and (cid:107)g(cid:107) = 1\ng\u2208B cgg. Our goal is to \ufb01nd a representation that is as sparse as\npossible, i.e., uses only a few atoms in B. From statistical viewpoint, the smallest representation\namong all those that have the same function approximation error is desirable as it leads to smaller\nestimation error. The goal of \ufb01nding the sparsest representation, however, is computationally in-\ntractable. Nevertheless, it is possible to \ufb01nd a \u201creasonable\u201d suboptimal sparse approximation using\nalgorithms such as OMP, which is the focus of this paper.\nThe OMP algorithm works as follows. Let \u02dcQ(0) = 0. For each i = 1, 2, . . . , de\ufb01ne the\nresidual r(i\u22121) = T \u2217Qk \u2212 \u02dcQ(i\u22121). De\ufb01ne the new atom to be added to the representation as\ng(i) \u2208 Argmaxg\u2208B\nmum correlation with the residual. Here (cid:104)\u00b7 , \u00b7(cid:105) is the inner product for a Hilbert space H(X \u00d7 A)\nto which T \u2217Qk and atoms of the dictionary belong. Let \u03a0(i) be the orthogonal projection onto\nspan(g(1), . . . , g(i)), i.e., \u03a0(i)T \u2217Qk (cid:44) argminQ\u2208span(g(1),...,g(i)) (cid:107)Q \u2212 T \u2217Qk(cid:107). We then have\n\u02dcQ(i) = \u03a0(i)T \u2217Qk. OMP continues iteratively.\nTo quantify the approximation error at the i-th iteration, we use the L1(B;(cid:107)\u00b7(cid:107))-norm of the target\nfunction of the OMP algorithm, which is T \u2217Qk in our case (with the norm being the one induced by\nthe inner product used in the OMP procedure). Recall that this class consists of functions that admit\ng\u2208B cgg and (cg) being an absolutely summable sequence. If T \u2217Qk\nbelongs to the class of L1(B;(cid:107)\u00b7(cid:107)), it can be shown (e.g., Theorem 2.1 of Barron et al. [15]) that\nafter i iterations of OMP, the returned function \u02dcQ(i) is such that (cid:107) \u02dcQ(i) \u2212 T \u2217Qk(cid:107) \u2264 (cid:107)T \u2217Qk(cid:107)L1(B;(cid:107)\u00b7(cid:107))\n.\nThe problem with this result is that it requires T \u2217Qk to belong to L1(B;(cid:107)\u00b7(cid:107)). This depends on how\nexpressive the dictionary B is. If it is not expressive enough, we still would like OMP to quickly\nconverge to the best approximation of T \u2217Qk /\u2208 L1(B;(cid:107)\u00b7(cid:107)) in L1(B;(cid:107)\u00b7(cid:107)). Fortunately, such a result\nexists (Theorem 2.3 by Barron et al. [15], quoted in the supplementary material) and we use it in the\nproof of our main result.\nWe now turn to the more interesting case when we do not have access to T \u2217Qk.\nInstead we\nare only given a set of transitions in the form of D(k)\n, X(cid:48)\ni=1, where\ni \u223c P (\u00b7|Xi, Ai), and\n(X (k)\nRi \u223c R(\u00b7|Xi, Ai). Instead of using T \u2217Qk, we use the empirical Bellman operator for the dataset\nD(k)\nn . The operator is de\ufb01ned as follows.\n\n, R(k)\n) are drawn from the sampling distribution \u03bd \u2208 M(X \u00d7 A), X(cid:48)\n\nan expansion in the form(cid:80)\n\nn = {(X (k)\n\n(k))}n\n\n, A(k)\n\n, A(k)\n\ni+1\n\n\u221a\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\n(Empirical\n\nDe\ufb01nition\n1\n=\nn)}, de\ufb01ned similarly as above. De\ufb01ne the ordered multi-\n{(X1, A1, R1, X(cid:48)\nset Sn = {(X1, A1), . . . , (Xn, An)}. The empirical Bellman optimality operator \u02c6T \u2217 : Sn \u2192 Rn is\nde\ufb01ned as ( \u02c6T \u2217Q)(Xi, Ai) (cid:44) Ri + \u03b3 maxa(cid:48) Q(X(cid:48)\n\n1), . . . , (Xn, An, Rn, X(cid:48)\n\ni, a(cid:48)) for 1 \u2264 i \u2264 n.\n\nOperator).\n\nOptimality\n\nBellman\n\nLet\n\nDn\n\nSince E(cid:104) \u02c6T \u2217Qk(X (k)\n\n)(cid:12)(cid:12) Qk, X (k)\n\n(cid:105)\n\n= T \u2217Qk(X (k)\n\n, A(k)\n\n, A(k)\n\n, A(k)\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\n), we can solve a regression\nproblem and \ufb01nd an estimate for Qk+1, which is close T \u2217Qk. This regression problem is the core of\nthe family of Fitted Q-Iteration (FQI) algorithms, e.g., [13, 4]. In this paper, the regression function\nat each iteration is estimated using a modi\ufb01ed OMP procedure introduced by Barron et al. [15].\nWe are now ready to describe the VPI algorithm (Algorithm 1). It gets as input a prede\ufb01ned dictio-\nnary B0. This can be a dictionary of wavelets, proto-value functions, etc. The size of this dictionary\ncan be countably in\ufb01nite. It also receives an integer m, which speci\ufb01es how many atoms of B0\nshould be used by the algorithm. This de\ufb01nes the effective dictionary B0,m. This value can be set to\nm = (cid:100)na(cid:101) for some \ufb01nite a > 0, so it can actually be quite large. VPI also receives K, the number\nof iterations, and \u03bd, the sampling distribution. For the simplicity of analysis, we assume that the\nsampling distribution is \ufb01xed, but in practice one may change this sampling distribution after each\niteration (e.g., sample new data according to the latest policy). Finally, VPI gets a set of m(cid:48) link\nfunctions \u03c3i : B(X \u00d7 A, Qmax) \u2192 B(X \u00d7 A, Qmax) for some m(cid:48) that is smaller than m/K. We\ndescribe the role of link functions shortly.\n\n4\n\n\fAlgorithm 1 Value Pursuit Iteration(B0, m,{\u03c3i}m(cid:48)\n\ni=1, \u03bd, K)\n\nInput: Initial dictionary B0, Number of dictionary atoms used m, Link functions {\u03c3i}m(cid:48)\naction distribution \u03bd, Number of iterations K.\nReturn: QK\nQ0 \u2190 0.\n0 \u2190 \u2205.\nB(cid:48)\nfor k = 0, 1, . . . , K \u2212 1 do\nConstruct a dataset D(k)\nn =\nk+1 \u2190 0\n\u02c6Q(0)\n// Orthogonal Matching Pursuit loop\nNormalize elements of B0,m and B(cid:48)\nfor i = 1, 2, . . . , c1n do\n\nand call them \u02c6Bk and \u02c6B(cid:48)\nk.\n\nk according to (cid:107)\u00b7(cid:107)D(k)\n\n(cid:111)n\n\n, (X (k)\n\ni.i.d.\u223c \u03bd\n\n(X (k)\n\n, R(k)\n\n, A(k)\n\n, A(k)\n\n, X(cid:48)\n\n(cid:110)\n\ni=1\n\n(k))\n\ni\n\n)\n\nn\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni=1, State-\n\n(cid:12)(cid:12)(cid:12)\n\nD(k)\n\nn\n\nk\n\nk+1\n\nr(i\u22121) \u2190 \u02c6T \u2217Qk \u2212 \u02c6Q(i\u22121)\n(cid:83) \u02c6B(cid:48)\ng(i) \u2190 Argmaxg\u2208 \u02c6Bk\n(cid:26)(cid:13)(cid:13)(cid:13)\u03b2Qmax\nk+1 \u2190 \u03a0(i) \u02c6T \u2217Qk\n\u02c6Q(i)\nend for\ni\u2217 \u2190 argmini\u22651\n(cid:83){\u03c3i(\u03b2Qmax Qk+1;Bk\nQk+1 \u2190 \u02c6Q(i\u2217)\nk+1 \u2190 B(cid:48)\nB(cid:48)\n\nk+1\n\nk\n\n(cid:12)(cid:12)(cid:12)(cid:10) r(i\u22121) , g(cid:11)\n(cid:13)(cid:13)(cid:13)2\n\nk+1 \u2212 \u02c6T \u2217Qk\n\u02c6Q(i)\n(cid:83)B(cid:48)\n\nend for\n\n{\u03a0(i): Projection onto span(g(1), . . . , g(i))}\n\n+ c2(Qmax) i ln(n)\n\nn\n\nD(k)\n\nn\n\n{Complexity Regularization}\n\n(cid:27)\n\nk)}m(cid:48)\n\ni=1 {Extending the dictionary}\n\nn\n\nAt the k-th iteration of the algorithm, we perform OMP for c1n iterations (c1 > 0), similar to\nwhat is described above with the difference that instead of using T \u2217Qk as the target, we use \u02c6T \u2217Qk\n(cid:80)n\nover empirical samples.3 This means that we use the empirical inner product (cid:104) Q1 , Q2 (cid:105)D(k)\n(cid:44)\ni=1 |Q1(Xi, Ai) \u00b7 Q2(Xi, Ai)| for (Xi, Ai) \u2208 D(k)\nn and the empirical orthogonal projection.4\n1\nn\nThe result would be a sequence ( \u02c6Q(i)\nk+1)i\u22650. Next, we perform a model selection procedure to choose\nthe best candidate. This can be done in different ways such as using a separate dataset as a validation\nset. Here we use a complexity regularization technique that penalizes more complex estimates (those\nthat have more atoms in their representation). Note that we use the truncated estimate \u03b2Qmax\nk+1 in\nthe model selection procedure. This is required for the theoretical guarantees. The outcome of this\nmodel selection procedure will determine Qk+1.\nFinally we use link functions {\u03c3i}m(cid:48)\ni=1 to generate m(cid:48) new atoms, which are vector-valued Qmax-\nbounded measurable functions from X \u00d7 A to R|A|, to be added to the learned dictionary B(cid:48)\nk.\nThe link functions extract \u201cinteresting\u201d aspects of Qk+1, potentially by considering the current\ndictionary Bk\nk. VPI is quite \ufb02exible in how the new atoms are generated and how large m(cid:48) can\nbe. The theory allows m(cid:48) to be in the order of na (a > 0), so one may add many potentially useful\natoms without much deterioration in the performance. Regarding the choice of the link functions,\nthe theory requires that at least Qk+1 itself is being added to the dictionary, but it leaves other\npossibilities open. For example, one might apply nonlinearities (e.g., sigmoid functions) to Qk+1.\nOr one might add atoms localized in parts of the state-action space with high residual errors \u2013 a\nheuristic which has been used previously in basis function construction. This procedure continues\nfor K iterations and the outcome will be QK. In the next section, we study the theoretical properties\nof the greedy policy w.r.t. QK, i.e., \u03c0K = \u02c6\u03c0(\u00b7; QK).\nRemark 1 (Comparison of VPI with FQI). Both VPI and FQI are indeed instances of AVI. If we\ncompare VPI with the conventional implementation of FQI that uses a \ufb01xed set of linear basis\n\n(cid:83)B(cid:48)\n\n\u02c6Q(i)\n\n3The value of c1 depends only on Qmax and a. We do not explicitly specify it since the value that is\ndetermined by the theory shall be quite conservative. One may instead \ufb01nd it by the trial and error. Moreover,\nin practice we may stop much earlier than n iterations.\n\n4When the number of atoms is larger than the number of samples (i > n), one may use the Moore\u2013Penrose\n\npseudoinverse to perform the orthogonal projection.\n\n5\n\n\ffunctions, we observe that FQI is the special case of VPI in which all atoms in the dictionary are\nused in the estimation. As VPI has a model selection step, its chosen estimator is not worse than\nFQI\u2019s (up to a small extra risk) and is possibly much better if the target is sparse in the dictionary.\nMoreover, extending the dictionary decreases the function approximation error with negligible effect\non the model selection error. The same arguments apply to many other FQI versions that use a \ufb01xed\ndata-independent set of basis functions and do not perform model selection.\n\n4 Theoretical Analysis\n\nIn this section, we \ufb01rst study how the function approximation error propagates in VPI (Section 4.1)\nand then provide a \ufb01nite-sample error upper bound as Theorem 3 in Section 4.2. All the proofs are\nin the supplementary material.\n\n4.1 Propagation of Function Approximation Error\n\n(cid:12)(cid:12)(cid:12)\n\n1\n\n\u03c0b(a(cid:48)|y)\n\ndPX,A\nd\u03bdX (y)\n\nsup(y,a(cid:48))\u2208X\u00d7A\n\n(cid:16)E(cid:104)\n\n(cid:12)(cid:12)(cid:12)(cid:105)(cid:17) 1\n\nIn this section, we present tools to upper bound the function approximation error at each iteration.\nDe\ufb01nition 2 (Concentrability Coef\ufb01cient of Function Approximation Error Propagation). (I) Let \u03bd\nbe a distribution over the state-action pairs, (X, A) \u223c \u03bd, \u03bdX be the marginal distribution of X,\nand \u03c0b(\u00b7|\u00b7) be the conditional probability of A given X. Further, let P be a transition probability\nkernel P : X \u00d7 A \u2192 M(X) and Px,a = P (\u00b7|x, a). De\ufb01ne the concentrability coef\ufb01cient of one-\n2 , where C\u03bd\u2192\u221e =\nstep transitions w.r.t. \u03bd by C\u03bd\u2192\u221e =\n\u221e if Px,a is not absolutely continuous w.r.t. \u03bdX for some (x, a) \u2208 X \u00d7 A, or if \u03c0b(a(cid:48)|y) = 0\nfor some (y, a(cid:48)) \u2208 X \u00d7 A. (II) Furthermore, for an optimal policy \u03c0\u2217 and an integer m \u2265 0,\n)m \u2208 M(X \u00d7 A) denote the future state-action distribution obtained after m-steps of\nlet \u03bd(P \u03c0\u2217\nfollowing \u03c0\u2217. De\ufb01ne c\u03bd(m) (cid:44) (cid:107) d(\u03bd(P \u03c0\u2217\n)m is\nnot absolutely continuous w.r.t. \u03bd, we let c\u03bd(m) = \u221e.\nThe constant C\u03bd\u2192\u221e is large if after transition step, the future states can be highly concentrated at\nsome states where the probability of taking some action a(cid:48) is small or d\u03bdX is small. Hence, the name\n\u201cconcentrability of one-step transitions\u201d. The de\ufb01nition of C\u03bd\u2192\u221e is from Chapter 5 of Farahmand\n[16]. The constant c\u03bd(m) shows how much we deviate from \u03bd whenever we follow an optimal policy\n\u03c0\u2217. It is notable that if \u03bd happens to be the stationary distribution of the optimal policy \u03c0\u2217 (e.g., the\nsamples are generated by an optimal expert), c\u03bd(m) = 1 for all m \u2265 0.\nWe now provide the following result that upper bounds the error caused by using Qk (which is the\nnewly added atom to the dictionary) to approximate T \u2217Qk. The proof is provided in the supplemen-\ntary material.\ni=0 \u2282 B(X \u00d7 A, Qmax) be a Qmax-bounded sequence of measurable action-\nLemma 1. Let (Qi)k\n\u03bd \u2264\nvalue functions. De\ufb01ne \u03b5i (cid:44) T \u2217Qi \u2212 Qi+1 (0 \u2264 i \u2264 k \u2212 1). Then, (cid:107)Qk \u2212 T \u2217Qk(cid:107)2\n\n(cid:107)\u221e. If the future state-action distribution \u03bd(P \u03c0\u2217\n\n)m)\n\nd\u03bd\n\n\u03bd + \u03b3k(2Qmax)2(cid:105)\n\n.\n\n(cid:104)(cid:80)k\u22121\n\n(1+\u03b3C\u03bd\u2192\u221e)2\n\n1\u2212\u03b3\n\ni=0 \u03b3k\u22121\u2212ic\u03bd(k \u2212 1 \u2212 i)(cid:107)\u03b5i(cid:107)2\n\nIf there was no error at earlier iterations (i.e., (cid:107)\u03b5i(cid:107)\u03bd = 0 for 0 \u2264 i \u2264 k\u22121), the error (cid:107)Qk \u2212 T \u2217Qk(cid:107)2\n\u03bd\nwould be O(\u03b3kQ2\nmax), which is decaying toward zero with a geometrical rate. This is similar to the\nbehaviour of the exact VI, i.e., (cid:107)T \u2217Qk \u2212 Qk(cid:107)\u221e \u2264 (1 + \u03b3)\u03b3k (cid:107)Q\u2217 \u2212 Q0(cid:107)\u221e.\nThe following result is Theorem 5.3 of Farahmand [16]. For the sake of completeness, we provide\nthe proof in the supplementary material.\nTheorem 2. Let (Qk)k\u22121\nT \u2217Qi \u2212 Qi+1 (0 \u2264 i \u2264 k). Let F|A|\n\nmeasurable functions. Then, inf Q(cid:48)\u2208F|A| (cid:107)Q(cid:48) \u2212 T \u2217Qk(cid:107)\u03bd \u2264 inf Q(cid:48)\u2208F|A|(cid:13)(cid:13)Q(cid:48) \u2212 (T \u2217)(k+1)Q0\n(cid:80)k\u22121\ni=0 (\u03b3 C\u03bd\u2192\u221e)k\u2212i (cid:107)\u03b5i(cid:107)\u03bd.\n\nk=0 be a sequence of state-action value functions and de\ufb01ne \u03b5i (cid:44)\n: X \u00d7 A \u2192 R|A| be a subset of vector-valued\n\n(cid:13)(cid:13)\u03bd +\n\nThis result quanti\ufb01es the behaviour of the function approximation error and relates it to the function\napproximation error of approximating (T \u2217)k+1Q0 (which is a deterministic quantity depending only\non the MDP itself, the function space F|A|, and Q0) and the errors of earlier iterations. This allows\n\n6\n\n\fus to provide a tighter upper bound for the function approximation error compared to the so-called\ninherent Bellman error supQ\u2208F|A| inf Q(cid:48)\u2208F|A| (cid:107)Q(cid:48) \u2212 T \u2217Q(cid:107)\u03bd introduced by Munos and Szepesv\u00b4ari\n[17], whenever the errors at previous iterations are small.\n\n4.2 Finite Sample Error Bound for VPI\nIn this section, we provide an upper bound on the performance loss (cid:107)Q\u2217 \u2212 Q\u03c0K(cid:107)1,\u03c1. This perfor-\nmance loss indicates the regret of following the policy \u03c0K instead of an optimal policy when the\ninitial state-action is distributed according to \u03c1. We de\ufb01ne the following concentrability coef\ufb01cients\nsimilar to Farahmand et al. [18].\nthe Future State-Action Distribution). Given\nDe\ufb01nition 3 (Expected Concentrability of\n\u03c1, \u03bd \u2208 M(X \u00d7 A), m \u2265 0, and an arbitrary sequence of\nstationary policies\nlet \u03c1P \u03c01P \u03c02 . . . P \u03c0m \u2208 M(X \u00d7 A) denote the future state-action distribu-\n(\u03c0m)m\u22651,\ntion obtained after m transitions, when the \ufb01rst state-action pair is distributed accord-\nFor integers m1, m2 \u2265\ning to \u03c1 and then we follow the sequence of policies (\u03c0k)m\n1, policy \u03c0 and the sequence of policies \u03c01, . . . , \u03c0k de\ufb01ne the concentrability coef\ufb01cients\ncVI1,\u03c1,\u03bd(m1, m2; \u03c0) (cid:44)\nand cVI2,\u03c1,\u03bd(m1; \u03c01, . . . , \u03c0k) (cid:44)\n\n(cid:34)(cid:12)(cid:12)(cid:12)(cid:12) d\n(cid:16)\n\u03c1(P \u03c0)m1 (P \u03c0\u2217\n(cid:20)(cid:12)(cid:12)(cid:12) d(\u03c1(P \u03c0k )m1 P \u03c0k\u22121 P \u03c0k\u22122\u00b7\u00b7\u00b7P \u03c01 )\n\n, where (X, A) \u223c \u03bd. If the future state-action dis-\n)m2 (similarly, if \u03c1(P \u03c0k )m1 P \u03c0k\u22121P \u03c0k\u22122 \u00b7\u00b7\u00b7 P \u03c01) is not absolutely continu-\n\nE\ntribution \u03c1(P \u03c0)m1(P \u03c0\u2217\nous w.r.t. \u03bd, we let cVI1,\u03c1,\u03bd(m1, m2; \u03c0) = \u221e (similarly, cVI2,\u03c1,\u03bd(m1; \u03c01, . . . , \u03c0k) = \u221e).\nAssumption A1 We make the following assumptions:\n\n(cid:12)(cid:12)(cid:12)(cid:12)2(cid:35)(cid:33) 1\n\n(cid:12)(cid:12)(cid:12)2(cid:21)(cid:19) 1\n\n(cid:32)\n\n(X, A)\n\n(X, A)\n\n(cid:18)\n\nk=1.\n\n(cid:17)\n\n)m2\n\nE\n\nd\u03bd\n\nd\u03bd\n\n2\n\n2\n\n\u2022 For all values of 0 \u2264 k \u2264 K \u2212 1,\n\n, R(k)\n) \u223c \u03bd \u2208 M(X \u00d7 A) and X(cid:48)\n\nthe dataset used by VPI at each iteration is\ni=1 with independent and identically distributed (i.i.d.)\ni \u223c\n\n(k) \u223c P (\u00b7|X (k)\n\n) and R(k)\n\n(k))}n\n\n, A(k)\n\n, X(cid:48)\n\ni\n\ni\n\ni\n\ni\n\ni\n\nn = {(X (k)\nD(k)\nsamples (X (k)\nR(\u00b7,\u00b7|X (k)\n\ni\n, A(k)\n\ni\n\n, A(k)\n, A(k)\n\ni\n\ni\n\ni\n\ni\n) for i = 1, 2, . . . , n.\n\nn\n\nn and D(k(cid:48))\n\nare independent.\n\nQmax almost surely (a.s).\n\n\u2022 For 1 \u2264 k, k(cid:48) \u2264 K \u2212 1 and k (cid:54)= k(cid:48), the datasets D(k)\n\u2022 There exists a constant Qmax \u2265 1 such that for any Q \u2208 B(X \u00d7 A; Qmax), | \u02c6T \u2217Q(X, A)| \u2264\n\u2022 For all g \u2208 B0, (cid:107)g(cid:107)\u221e \u2264 L < \u221e.\n\u2022 The number of atoms m used from the dictionary B0 is m = (cid:100)na(cid:101) for some \ufb01nite a > 0.\n\u2022 At iteration k, each of the link functions {\u03c3i}m(cid:48)\n\nThe number of link functions m(cid:48) used at each iteration is at most m/K.\nBk\nX \u00d7 A \u2192 R|A|. At least one of the mappings returns \u03b2QmaxQk+1.\n\ni=1 maps \u03b2QmaxQk+1 and the dictionary\nk to an element of the space of vector-valued Qmax-bounded measurable functions\n\n(cid:83)B(cid:48)\n\nMost of these assumptions are mild and some of them can be relaxed. The i.i.d. assumption can be\nrelaxed using the so called independent block technique [19], but it results in much more complicated\nproofs. We conjecture that the independence of datasets at different iterations might be relaxed as\nwell under certain condition on the Bellman operator (cf. Section 4.2 of [17]). The condition on the\nnumber of atoms m and the number of link functions being polynomial in n are indeed very mild.\nIn order to compactly present our result, we de\ufb01ne ak = (1\u2212\u03b3) \u03b3K\u2212k\u22121\nfor 0 \u2264 k < K. Note that\nthe behaviour of ak \u221d \u03b3K\u2212k\u22121, so it gives more weight to later iterations. Also de\ufb01ne C1(k) (cid:44)\n\n1\u2212\u03b3K+1\n\n\u03bd\u2192\u221e (k = 1, 2, . . . ) and C2 (cid:44) (1+\u03b3C\u03bd\u2192\u221e)2\n\n1\u2212\u03b3\n\n. For 0 \u2264 s \u2264 1, de\ufb01ne\n\n(cid:80)k\u22121\ni=0 \u03b3k\u2212iC 2(k\u2212i)\nK\u22121(cid:88)\n\nCVI,\u03c1,\u03bd(K; s) =\n1 \u2212 \u03b3\n2\n\n)2 sup\n\u03c0(cid:48)\n1,...,\u03c0(cid:48)\n\n(\n\nK\n\nk=0\n\n(cid:34)(cid:88)\n\nm\u22650\n\na2(1\u2212s)\n\nk\n\n\u03b3m(cid:0)cVI1,\u03c1,\u03bd(m, K \u2212 k; \u03c0(cid:48)\n\nK) + cVI2,\u03c1,\u03bd(m + 1; \u03c0(cid:48)\n\nk+1, . . . , \u03c0(cid:48)\n\nK)(cid:1)(cid:35)2\n\n,\n\nwhere in the last de\ufb01nition the supremum is taken over all policies. The following theorem is the\nmain theoretical result of this paper. Its proof is provided in the supplementary material.\n\n7\n\n\fTheorem 3. Consider the sequence (Qk)K\nhold. For any \ufb01xed 0 < \u03b4 < 1, recursively de\ufb01ne the sequence (bi)K\n\nk=0 generated by VPI (Algorithm 1). Let Assumptions A1\n\ni=0 as follows:\n\n(cid:44) c1Q3\n\nmax\n\nb2\n0\n\n+ 3\n\ninf\n\nQ(cid:48)\u2208BQmax (L1(B0,m;\u03bd))\n\n(cid:107)Q(cid:48) \u2212 T \u2217Q0(cid:107)2\n\u03bd ,\n\nlog(cid:0) nK\nlog(cid:0) nK\n\nn\n\n\u03b4\n\n\u03b4\n\n(cid:1)\n(cid:1)\n\n(cid:115)\n(cid:115)\n(cid:40)\n\n(cid:13)(cid:13)2\n\n\u03bd + C1(k)\n\nk\u22121(cid:88)\n\ni=0\n\n\u03b3k\u2212ib2\ni ,\n\n(cid:33)(cid:41)\n\n\u03b3k\u22121\u2212ic\u03bd(k \u2212 1 \u2212 i) b2\n\ni + \u03b3k(2Qmax)2\n\n(k \u2265 1)\n\n,\n\n(cid:44) c2Q3\n\nb2\nk\n\nn\n\n+\n\nmax\n\ninf\n\nc3 min\n\nQ(cid:48)\u2208BQmax (L1(B0,m;\u03bd))\n\n(cid:13)(cid:13)Q(cid:48) \u2212 (T \u2217)k+1Q0\n(cid:32)k\u22121(cid:88)\ndiscounted sum of errors as E(s) (cid:44) (cid:80)K\u22121\n(cid:20)\n\n\u03c1-weighted performance loss of \u03c0K is upper bounded as\n\n\u03bd \u2264 b2\n\nk=0 a2s\n\nC2\n\ni=0\n\n(cid:21)\n\nfor some c1, c2, c3 > 0 that are only functions of Qmax and L. Then, for any k = 0, 1, . . . , K \u2212 1,\nit holds that (cid:107)Qk+1 \u2212 T \u2217Qk(cid:107)2\nK . Furthermore, de\ufb01ne the\nk bk (for s \u2208 [0, 1]). Choose \u03c1 \u2208 M(X \u00d7 A). The\n\nk, with probability at least 1 \u2212 k\u03b4\n\n,\n\n2\u03b3\n\ninf\ns\u2208[0,1]\n\n(1 \u2212 \u03b3)2\n\n(cid:113) log(nK/\u03b4)\n\nVI,\u03c1,\u03bd(K; s)E 1/2(s) + 2\u03b3KQmax\nC 1/2\n\n(cid:107)Q\u2217 \u2212 Q\u03c0K(cid:107)1,\u03c1 \u2264\nwith probability at least 1 \u2212 \u03b4.\nThe value of bk is a deterministic upper bound on the error (cid:107)Qk+1 \u2212 T \u2217Qk(cid:107)\u03bd of each iteration of\nVPI. We would like bk to be close to zero, because the second part of the theorem implies that\n(cid:107)Q\u2217 \u2212 Q\u03c0K(cid:107)1,\u03c1 would be small too. If we study b2\nk, we observe two main terms: The \ufb01rst term,\nwhich behaves as\n, is the estimation error. The second term describes the function\napproximation error. For k \u2265 1, it consists of two terms from which the minimum is selected.\nThe \ufb01rst term inside min{\u00b7,\u00b7} describes the behaviour of the function approximation error when\nwe only use the prede\ufb01ned dictionary B0,m to approximate T \u2217Qk (see Theorem 2). The second\nterm describes the behaviour of the function approximation error when we only consider Qk as\nthe approximant of T \u2217Qk (see Lemma 1). The error caused by this approximation depends on\nthe error made in earlier iterations. The current analysis only considers the atom Qk from the\nlearned dictionary, but VPI may actually use other atoms to represent T \u2217Qk. This might lead to\nmuch smaller function approximation errors. Hence, our analysis shows that in terms of function\napproximation error, our method is sound and superior to not increasing the size of the dictionary.\nHowever, revealing the full power of VPI remains as future work. Just as an example, if B0 is\ncomplete in L2(\u03bd), by letting n, m \u2192 \u221e both the estimation error and function approximation error\ngoes to zero and the method is consistent and converges to the optimal value function.\n\nn\n\n5 Conclusion\n\nThis work introduced VPI, an approximate value iteration algorithm that aims to \ufb01nd a close to\noptimal policy using a dictionary of atoms (or features). The VPI algorithm uses a modi\ufb01ed Orthog-\nonal Matching Pursuit that is equipped with a model selection procedure. This allows VPI to \ufb01nd a\nsparse representation of the value function in large, and potentially overcomplete, dictionaries. We\ntheoretically analyzed VPI and provided a \ufb01nite-sample error upper bound for it. The error bound\nshows the effect of the number of samples as well as the function approximation properties of the\nprede\ufb01ned dictionary, and the effect of learned atoms.\nThis paper is a step forward to better understanding how overcomplete dictionaries and sparsity can\neffectively be used in the RL/Planning context. A more complete theory describing the effect of\nadding atoms to the dictionary remains to be established. We are also planning to study VPI\u2019s em-\npirical performance, and comparing with other feature construction methods. We note that our main\nfocus was on the statistical properties of the algorithm, not on computational ef\ufb01ciency; optimizing\ncomputation speed will be an interesting topic for future investigation.\n\n8\n\n\fReferences\n[1] Sridhar Mahadevan and Mauro Maggioni. Proto-value functions: A Laplacian framework for learning\nrepresentation and control in markov decision processes. Journal of Machine Learning Research, 8:\n2169\u20132231, 2007. 1\n\n[2] Ronald Parr, Christopher Painter-Wake\ufb01eld, Lihong Li, and Michael Littman. Analyzing feature gener-\nation for value-function approximation. In ICML \u201907: Proceedings of the 24th international conference\non Machine learning, pages 737 \u2013 744, New York, NY, USA, 2007. ACM. 1\n\n[3] Amir-massoud Farahmand, Mohammad Ghavamzadeh, Csaba Szepesv\u00b4ari, and Shie Mannor. Regularized\npolicy iteration. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural\nInformation Processing Systems (NIPS - 21), pages 441\u2013448. MIT Press, 2009. 1\n\n[4] Amir-massoud Farahmand, Mohammad Ghavamzadeh, Csaba Szepesv\u00b4ari, and Shie Mannor. Regularized\nIn Proceedings of\n\n\ufb01tted Q-iteration for planning in continuous-space Markovian Decision Problems.\nAmerican Control Conference (ACC), pages 725\u2013730, June 2009. 1, 4\n\n[5] Gavin Taylor and Ronald Parr. Kernelized value function approximation for reinforcement learning. In\nICML \u201909: Proceedings of the 26th Annual International Conference on Machine Learning, pages 1017\u2013\n1024, New York, NY, USA, 2009. ACM. 1\n\n[6] J. Zico Kolter and Andrew Y. Ng. Regularization and feature selection in least-squares temporal difference\nlearning. In ICML \u201909: Proceedings of the 26th Annual International Conference on Machine Learning,\npages 521\u2013528. ACM, 2009. 1\n\n[7] Jeff Johns, Christopher Painter-Wake\ufb01eld, and Ronald Parr. Linear complementarity for regularized pol-\nicy evaluation and improvement.\nIn J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, and\nA. Culotta, editors, Advances in Neural Information Processing Systems (NIPS - 23), pages 1009\u20131017.\n2010. 1\n\n[8] Mohammad Ghavamzadeh, Alessandro Lazaric, R\u00b4emi Munos, and Matthew Hoffman. Finite-sample\nanalysis of lasso-TD. In Lise Getoor and Tobias Scheffer, editors, Proceedings of the 28th International\nConference on Machine Learning (ICML-11), ICML \u201911, pages 1177\u20131184, New York, NY, USA, June\n2011. ACM. ISBN 978-1-4503-0619-5. 1\n\n[9] Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad. Orthogonal matching pursuit: Recursive function ap-\nproximation with applications to wavelet decomposition. In Proceedings of the 27th Annual Asilomar\nConference on Signals, Systems, and Computers, pages 40\u201344, 1993. 1\n\n[10] Geoffrey M. Davis, St\u00b4ephane Mallat, and Marco Avellaneda. Adaptive greedy approximations. Journal\n\nof Constructive Approximation, 13:57\u201398, 1997. 1\n\n[11] Jeff Johns. Basis Construction and Utilization for Markov Decision Processes using Graphs. PhD thesis,\n\nUniversity of Massachusetts Amherst, 2010. 1\n\n[12] Christopher Painter-Wake\ufb01eld and Ronald Parr. Greedy algorithms for sparse reinforcement learning. In\n\nProceedings of the 29th International Conference on Machine Learning (ICML) (Accepted), 2012. 1\n\n[13] Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. Jour-\n\nnal of Machine Learning Research, 6:503\u2013556, 2005. 2, 4\n\n[14] Csaba Szepesv\u00b4ari. Algorithms for Reinforcement Learning. Morgan Claypool Publishers, 2010. 2\n[15] Andrew R. Barron, Albert Cohen, Wolfgang Dahmen, and Ronald A. Devore. Approximation and learning\n\nby greedy algorithms. The Annals of Statistics, 36(1):64\u201394, 2008. 3, 4\n\n[16] Amir-massoud Farahmand. Regularization in Reinforcement Learning. PhD thesis, University of Alberta,\n\n2011. 6\n\n[17] R\u00b4emi Munos and Csaba Szepesv\u00b4ari. Finite-time bounds for \ufb01tted value iteration. Journal of Machine\n\nLearning Research, 9:815\u2013857, 2008. 7\n\n[18] Amir-massoud Farahmand, R\u00b4emi Munos, and Csaba Szepesv\u00b4ari. Error propagation for approximate pol-\nicy and value iteration. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta,\neditors, Advances in Neural Information Processing Systems (NIPS - 23), pages 568\u2013576. 2010. 7\n\n[19] Bin Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of\n\nProbability, 22(1):94\u2013116, January 1994. 7\n\n9\n\n\f", "award": [], "sourceid": 654, "authors": [{"given_name": "Amir", "family_name": "Farahmand", "institution": null}, {"given_name": "Doina", "family_name": "Precup", "institution": null}]}