{"title": "Basis refinement strategies for linear value function approximation in MDPs", "book": "Advances in Neural Information Processing Systems", "page_first": 2899, "page_last": 2907, "abstract": "We provide a theoretical framework for analyzing basis function construction for linear value function approximation in Markov Decision Processes (MDPs). We show that important existing methods, such as Krylov bases and Bellman-error-based methods are a special case of the general framework we develop. We provide a general algorithmic framework for computing basis function refinements which \u201crespect\u201d the dynamics of the environment, and we derive approximation error bounds that apply for any algorithm respecting this general framework. We also show how, using ideas related to bisimulation metrics, one can translate basis refinement into a process of finding \u201cprototypes\u201d that are diverse enough to represent the given MDP.", "full_text": "Basis Re\ufb01nement Strategies for Linear Value\n\nFunction Approximation in MDPs\n\nGheorghe Comanici\n\nDoina Precup\n\nSchool of Computer Science\n\nSchool of Computer Science\n\nPrakash Panangaden\n\nSchool of Computer Science\n\nMcGill University\nMontreal, Canada\n\nMcGill University\nMontreal, Canada\n\nMcGill University\nMontreal, Canada\n\ngcoman@cs.mcgill.ca\n\ndprecup@cs.mcgill.ca\n\nprakash@cs.mcgill.ca\n\nAbstract\n\nWe provide a theoretical framework for analyzing basis function construction for\nlinear value function approximation in Markov Decision Processes (MDPs). We\nshow that important existing methods, such as Krylov bases and Bellman-error-\nbased methods are a special case of the general framework we develop. We pro-\nvide a general algorithmic framework for computing basis function re\ufb01nements\nwhich \u201crespect\u201d the dynamics of the environment, and we derive approximation\nerror bounds that apply for any algorithm respecting this general framework. We\nalso show how, using ideas related to bisimulation metrics, one can translate ba-\nsis re\ufb01nement into a process of \ufb01nding \u201cprototypes\u201d that are diverse enough to\nrepresent the given MDP.\n\n1\n\nIntroduction\n\nFinding optimal or close-to-optimal policies in large Markov Decision Processes (MDPs) requires\nthe use of approximation. A very popular approach is to use linear function approximation over a set\nof features [Sutton and Barto, 1998, Szepesvari, 2010]. An important problem is that of determining\nautomatically this set of features in such a way as to obtain a good approximation of the problem\nat hand. Many approaches have been explored, including adaptive discretizations [Bertsekas and\nCastanon, 1989, Munos and Moore, 2002], proto-value functions [Mahadevan, 2005], Bellman error\nbasis functions (BEBFs) [Keller et al., 2006, Parr et al., 2008a], Fourier basis [Konidaris et al.,\n2011], feature dependency discovery [Geramifard et al., 2011] etc. While many of these approaches\nhave nice theoretical guarantees when constructing features for \ufb01xed policy evaluation, this problem\nis signi\ufb01cantly more dif\ufb01cult in the case of optimal control, where multiple policies have to be\nevaluated using the same representation.\nWe analyze this problem by introducing the concept of basis re\ufb01nement, which can be used as a gen-\neral framework that encompasses a large class of iterative algorithms for automatic feature extrac-\ntion. The main idea is to start with a set of basis which are consistent with the reward function, i.e.\nwhich allow only states with similar immediate reward to be grouped together. One-step look-ahead\nis then used to \ufb01nd parts of the state space in which the current basis representation is inconsistent\nwith the environment dynamics, and the basis functions are adjusted to \ufb01x this problem. The process\ncontinues iteratively. We show that BEBFs [Keller et al., 2006, Parr et al., 2008a] can be viewed as\na special case of this iterative framework. These methods iteratively expand an existing set of basis\nfunctions in order to capture the residual Bellman error. The relationship between such features and\naugmented Krylov bases allows us to show that every additional feature in these sets is consistently\nre\ufb01ning intermediate bases. Based on similar arguments, it can be shown that other methods, such\nas those based on the concept of MDP homomorphisms [Ravindran and Barto, 2002], bisimulation\nmetrics [Ferns et al., 2004], and partition re\ufb01nement algorithms [Ruan et al., 2015], are also spe-\ncial cases of the framework. We provide approximation bounds for sequences of re\ufb01nements, as\n\n1\n\n\fwell as a basis convergence criterion, using mathematical tools rooted in bisimulation relations and\nmetrics [Givan et al., 2003, Ferns et al., 2004].\nA \ufb01nal contribution of this paper is a new approach for computing alternative representations based\non a selection of prototypes that incorporate all the necessary information to approximate values\nover the entire state space. This is closely related to kernel-based approaches [Ormoneit and Sen,\n2002, Jong and Stone, 2006, Barreto et al., 2011], but we do not assume that a metric over the state\nspace is provided (which allows one to determine similarity between states). Instead, we use an\niterative approach, in which prototypes are selected to properly distinguish dynamics according to\nthe current basis functions, then a new metric is estimated, and the set of prototypes is re\ufb01ned again.\nThis process relies on using pseudometrics which in the limit converge to bisimulation metrics.\n\n2 Background and notation\n\nWe will use the framework of Markov Decision Processes, consisting of a \ufb01nite state space S, a\n\ufb01nite action space A, a transition function P : (S \u00d7 A) \u2192 P(S)1, where P (s, a) is a probability\ndistribution over the state space S, a reward function2 R : (S\u00d7A) \u2192 R. For notational convenience,\nP a(s), Ra(s) will be used to denote P (s, a) and R(s, a), respectively. One of the main objectives\nof MDP solvers is to determine a good action choice, also known as a policy, from every state that\nthe system would visit. A policy \u03c0 : S \u2192 P(A) determines the probability of choosing each action\na\u2208A \u03c0(s)(a) = 1). The value of a policy \u03c0 given a state s0 is de\ufb01ned as\n\na given the state s (with(cid:80)\n\ni=0 \u03b3iRai(si)(cid:12)(cid:12) si+1 \u223c P ai(si), ai \u223c \u03c0(si)(cid:3) .\n\nV \u03c0(s0) = E(cid:2)(cid:80)\u221e\n\n(cid:2)EP a(s)[f ](cid:3). Let\n\nNote that V \u03c0 is a real valued function [[S \u2192 R]]; the space of all such functions will be denoted\nby FS. We will also call such functions features. Let R\u03c0 and P \u03c0 denote the reward and tran-\nsition probabilities corresponding to choosing actions according to \u03c0. Note that R\u03c0 \u2208 FS and\nP \u03c0 \u2208 [[FS \u2192 FS]], where3 R\u03c0(s) = Ea\u223c\u03c0(s)[Ra(s)] and P \u03c0(f )(s) = Ea\u223c\u03c0(s)\nT \u03c0 \u2208 [[FS \u2192 FS]] denote the Bellman operator: T \u03c0(f ) = R\u03c0 + \u03b3P \u03c0(f ). This operator is linear\nand V \u03c0 is its \ufb01xed point, i.e. T \u03c0(V \u03c0) = V \u03c0. Most algorithms for solving MDPs will either use\nthe model (R\u03c0, P \u03c0) to \ufb01nd V \u03c0 (if this model is available and/or can be estimated ef\ufb01ciently), or\nthey will estimate V \u03c0 directly using samples of the model, {(si, ai, ri, si+1)}\u221e\ni=0. The value V \u2217\nassociated with the best policy \u03c0\u2217 is the \ufb01xed point of the Bellman optimality operator T \u2217 (not a\nlinear operator), de\ufb01ned as: T \u2217(f ) = maxa\u2208A (Ra + \u03b3P a(f )).\nThe main problem we address in this paper is that of \ufb01nding alternative representations for a given\nMDP. In particular, we look for \ufb01nite, linearly independent subsets \u03a6 of FS. These are bases for sub-\nspaces that will be used to speed up the search for V \u03c0, by limiting it to span(\u03a6). We say that a basis\nB is a partition if there exists an equivalence relation \u223c on S such that B = {\u03c7(C) | C \u2208 S/\u223c},\nwhere \u03c7 is the characteristic function (i.e. \u03c7(X)(x) = 1 if x \u2208 X and 0 otherwise). Given any\nequivalence relation \u223c, we will use the notation \u2206(\u223c) for the set of characteristic functions on the\nequivalence classes of \u223c, i.e. \u2206(\u223c) = {\u03c7(C) | C \u2208 S/\u223c}.4.\nOur goal will be to \ufb01nd subsets \u03a6 \u2282 FS which allow a value function approximation with strong\nquality guarantees. More precisely, for any policy \u03c0 we would like to approximate V \u03c0 with\ni=1 wi\u03c6i for some choice of wi\u2019s, which amounts to \ufb01nding the best candidate inside\nV \u03c0\nthe space spanned by \u03a6 = {\u03c61, \u03c62, ..., \u03c6k}. A suf\ufb01cient condition for V \u03c0 to be an element of\nspan(\u03a6) (and therefore representable exactly using the chosen set of bases), is for \u03a6 to span\nthe reward function and be an invariant subspace of the transition function: R\u03c0 \u2208 span(\u03a6) and\n\u2200f \u2208 \u03a6, P \u03c0(f ) \u2208 span(\u03a6). Linear \ufb01xed point methods like TD, LSTD, LSPE [Sutton, 1988,\nBradtke and Barto, 1996, Yu and Bertsekas, 2006] can be used to \ufb01nd the least squares \ufb01xed point\napproximation V \u03c0\n\u03a6 of V \u03c0 for a representation \u03a6; these constitute proper approximation schemes, as\n\n\u03a6 = (cid:80)k\n\n1We will use P(X) to denote the set of probability distributions on a given set X.\n2For simplicity, we assume WLOG that the reward is deterministic and independent of the state into which\n\nthe system arrives.\n\n3We will use E\u00b5[f ] = (cid:80)\nfunction f is multivariate, we will use Ex\u223c\u00b5[f (x, y)] =(cid:80)\n\nx f (x)\u00b5(x) to mean the expectation of a function f wrt distribution \u00b5. If the\nx f (x, y)\u00b5(x) to denote expectation of f when y is\n4The equivalence class of an element s \u2208 S is {s(cid:48) \u2208 S | s \u223c s(cid:48)}. S/\u223c is used for the quotient set of all\n\n\ufb01xed.\nequivalence classes of \u223c.\n\n2\n\n\f\u03a6, P \u03c0\n\n\u03a6 is the \ufb01xed point of the operator T \u03c0\n\n\u03a6 (f ) = \u03a0\u03a6R\u03c0 + \u03b3\u03a0\u03a6P \u03c0(f ) and V \u03c0\n\none can determine the number of iterations required to achieve a desired approximation error. Given\na representation \u03a6, the approximate value function V \u03c0\n\u03a6, de\ufb01ned\n\u03a6 f := \u03a0\u03a6(R\u03c0 + \u03b3P \u03c0(f )), where \u03a0\u03a6 is the orthogonal projection operator on \u03a6. Using the\nas: T \u03c0\nlinearity of \u03a0\u03a6, it directly follows that T \u03c0\n\u03a6 is the \ufb01xed point of\n\u03a6 ) := (\u03a0\u03a6R\u03c0, \u03a0\u03a6P \u03c0). For more\nthe Bellman operator over the transformed linear model (R\u03c0\ndetails, see [Parr et al., 2008a,b].\nThe analysis tools that we will use to establish our results are based on probabilistic bisimulation and\nits quantitative analogues. Strong probabilistic bisimulation is a notion of behavioral equivalence\nbetween the states of a probabilistic system, due to [Larsen and Skou, 1991] and applied to MDPs\nwith rewards by [Givan et al., 2003]. The metric analog is due to [Desharnais et al., 1999, 2004] and\nthe extension of the metric to include rewards is due to [Ferns et al., 2004]. An equivalence relation\n\u223c is a a bisimulation relation on the state space S if for every pair (s, s(cid:48)) \u2208 S\u00d7S, s \u223c s(cid:48) if and only\nif \u2200a \u2208 A,\u2200C \u2208 S/\u223c, Ra(s) = Ra(s(cid:48)) and P a(s)(C) = P a(s(cid:48))(C) (we use here P a(s)(C) to de-\nnote the probability of transitioning into C, under transition s, a). A pseudo-metric is a bisimulation\nmetric if there exists some bisimulation relation \u223c such that \u2200s, s(cid:48), d(s, s(cid:48)) = 0 \u21d0\u21d2 s \u223c s(cid:48).\nThe bisimulation metrics described by [Ferns et al., 2004] are constructed using the Kantorovich\nmetric for comparing two probability distributions. Given a ground metric d over S, the Kantorovich\nmetric over P(S) takes the largest difference in the expected value of Lipschitz-1 functions with\nrespect to d: \u2126(d) := {f \u2208 FS | \u2200s, s(cid:48), f (s) \u2212 f (s(cid:48)) \u2264 d(s, s(cid:48))}. The distance between two\nprobabilities \u00b5 and \u03bd is computed as: K(d) : (\u00b5, \u03bd) (cid:55)\u2192 sup\u03d5\u2208\u2126(d)\nE\u00b5[\u03d5] \u2212 E\u03bd[\u03d5]. For more details\non the Kantorovich metric, see [Villani, 2003]. The following approximation scheme converges to\na bisimulation metric (starting with d0 = 0, the metric that associates 0 to all pairs):\ndk+1(s, s(cid:48)) = T (dk)(s, s(cid:48)) := max\n(1)\nThe operator T has a \ufb01xed point d\u2217, which is a bisimulation metric, and dk \u2192 d\u2217 as k \u2192 \u221e. [Ferns\net al., 2004] provide bounds which allow one to assess the quality of general state aggregations using\nthis metric. Given a relation \u223c and its corresponding partition \u2206(\u223c), one can de\ufb01ne an MDP model\nover \u2206(\u223c) as: \u02c6Ra = \u03a0\u2206(\u223c)Ra and \u02c6P a = \u03a0\u2206(\u223c)P a, \u2200a \u2208 A. The approximation error between the\ntrue MDP optimal value function V \u2217 and its approximation using this reduced MDP model, denoted\nby V \u2217\n\n(cid:0)(1 \u2212 \u03b3)(cid:12)(cid:12)Ra(s) \u2212 Ra(s(cid:48))(cid:12)(cid:12) + \u03b3K(dk)(cid:0)P a(s), P a(s(cid:48))(cid:1)(cid:1).\n\n\u2206(\u223c), is bounded above by:\n\na\n\n(2)\n\u223c(s) is average distance from a state s to its \u223c-equivalence class, de\ufb01ned as an expectation\nwhere d\u2217\nover the uniform distribution U: d\u2217\n\u223c(s) = E\u02c6s\u223cU [d\u2217(s, \u02c6s) | s \u223c \u02c6s]. Similar bounds for representa-\ntions that are not partitions can be found in [Comanici and Precup, 2011]. Note that these bounds\nare minimized by aggregating states which are \u201cclose\u201d in terms of the bisimulation distance d\u2217.\n\n1 \u2212 \u03b3\n\nd\u2217\n\u223c(s) + max\ns(cid:48)\u2208S\n\n\u03b3\n\n(1 \u2212 \u03b3)2 d\u2217\n\n\u223c(s(cid:48)).\n\n(cid:12)(cid:12)(cid:12)V \u2217\n\n\u2206(\u223c)(s) \u2212 V \u2217(s)\n\n(cid:12)(cid:12)(cid:12) \u2264 1\n\n3 Basis re\ufb01nement\n\nIn this section we describe the proposed basis re\ufb01nement framework, which relies on \u201cdetecting\u201d\nand \u201c\ufb01xing\u201d inconsistencies in the dynamics induced by a given set of features. Intuitively, states\nare dynamically consistent with respect to a set of basis functions if transitions out of these states\nare evaluated the same way by the model {P a | a \u2208 A}. Inconsistencies are \u201c\ufb01xed\u201d by augmenting\na basis with features that are able to distinguish inconsistent states, relative to the initial basis. We\nare now ready to formalize these ideas.\nDe\ufb01nition 3.1. Given a subset F \u2282 FS, two states s, s(cid:48) \u2208 S are consistent with respect to F ,\ndenoted s \u223cF s(cid:48), if \u2200f \u2208 F,\u2200a \u2208 A, f (s) = f (s(cid:48)) and EP a(s)[f ] = EP a(s(cid:48))[f ].\nDe\ufb01nition 3.2. Given two subspaces F, G \u2282 FS, G re\ufb01nes F in an MDP M, and write F (cid:110) G, if\nF \u2286 G and\n\n\u2200s, s(cid:48) \u2208 S, s \u223cF s(cid:48) \u21d0\u21d2 [\u2200g \u2208 G, g(s) = g(s(cid:48))].\n\nE\u00b5[f ] = E\u03bd[f ](cid:3) \u21d0\u21d2\n(cid:2)\u2200b \u2208 \u0393, E\u00b5[b] = E\u03bd[b](cid:3). For the special case of Dirac distributions \u03b4s and \u03b4s(cid:48), for which\n\nUsing the linearity of expectation, one can prove that, given two probability distributions \u00b5, \u03bd,\nand a \ufb01nite subset \u0393 \u2282 F ,\n\nthen (cid:2)\u2200f \u2208 F,\n\nif span(\u0393) = F ,\n\n3\n\n\fE\u03b4s[f ] = f (s), it also holds that(cid:2)\u2200f \u2208 F,\n\nf (s) = f (s(cid:48))(cid:3) \u21d0\u21d2 (cid:2)\u2200b \u2208 \u0393,\n\nb(s) = b(s(cid:48))(cid:3).\n\nTherefore, Def. 3.2 gives a relation between two subspaces, but the re\ufb01nement conditions could be\nchecked on any basis choice. It is the subspace itself rather than a particular basis that matters, i.e.\n\u0393 (cid:110) \u0393(cid:48) if span(\u0393) (cid:110) span(\u0393(cid:48)). To \ufb01x inconsistencies on a pair (s, s(cid:48)), for which we can \ufb01nd f \u2208 \u0393\nand a \u2208 A such that either f (s) (cid:54)= f (s(cid:48)) or EP a(s)[f ] (cid:54)= EP a(s(cid:48))[f ], one should construct a new\nfunction \u03d5 with \u03d5(s) (cid:54)= \u03d5(s(cid:48)) and add it to \u0393(cid:48). To guarantee that all inconsistencies have been\naddressed, if \u03d5(s) (cid:54)= \u03d5(s(cid:48)) for some \u03d5 \u2208 \u0393(cid:48), \u0393 must contain a feature f such that, for some a \u2208 A,\neither f (s) (cid:54)= f (s(cid:48)) or EP a(s)[f ] (cid:54)= EP a(s(cid:48))[f ].\nIn Sec. 5 we present an algorithmic framework consisting of sequential improvement steps, in which\na current basis \u0393 is re\ufb01ned into a new one, \u0393(cid:48), with span(\u0393) (cid:110) span(\u0393(cid:48)). Def 3.2 guarantees that\nfollowing such strategies expands span(\u0393) and that the approximation error for any policy will be\ndecreased as a result. We now discuss bounds that can be obtained based on these de\ufb01nitions.\n\n3.1 Value function approximation results\n\n(cid:111)\n\nb\u2208\u2206(\u223c\u0393) \u03c9(b)b\n\n(cid:2)\u2200b, b(cid:48) \u2208 \u2206(\u223c\u0393), b (cid:54)= b(cid:48) \u21d2 \u03c9(b) (cid:54)= \u03c9(b(cid:48))(cid:3) \u21d2 \u0393 (cid:110) \u0393 \u222a(cid:110)(cid:80)\n\nOne simple way to create a re\ufb01nement is to add to \u0393 a single element that would address all incon-\nsistencies: a feature that is valued differently for every element of \u2206(\u223c\u0393). Given \u03c9 : \u2206(\u223c\u0393) \u2192 R,\n. On the other hand,\nsuch a construction provides no approximation guarantee for the optimal value function (unless we\nmake additional assumptions on the problem - we will discuss this further in Section 3.2). Although\nit addresses inconsistencies in the dynamics over the set of features spanned by \u0393, it does not nec-\nessarily provide the representation power required to properly approximate the value of the optimal\npolicy. The main theoretical result in this section provides conditions for describing re\ufb01ning se-\nquences of bases, which are not necessarily accurate, but have approximation errors bounded by\nan exponentially decreasing function. These results are based on \u2206(\u223c\u0393), the largest basis re\ufb01ning\nsubspace: any feature that is constant over equivalence classes of \u223c\u0393 will be spanned by \u2206(\u223c), i.e.\nfor any re\ufb01nement V (cid:110) W , V \u2286 W \u2286 span(\u2206(\u223cV )). These subsets are convenient as they can be\nanalyzed using the bisimulation metric introduced in [Ferns et al., 2004].\nLemma 3.1. The bisimulation operator in Eq. 1) is a contraction with constant \u03b3. That is, for any\nmetric d over S, sups,s(cid:48)\u2208S |T (d)(s, s(cid:48))| \u2264 \u03b3 sups,s(cid:48)\u2208S |d(s, s(cid:48))|.\nThe proof relies on the Monge-Kantorovich duality (see [Villani, 2003]) to check that T satis\ufb01es\nsuf\ufb01cient conditions to be a contraction operator. An operator Z is a contraction (with constant\n\u03b3 < 1) if Z(x) \u2264 Z(x(cid:48)) whenever x \u2264 x(cid:48), and if Z(x + c) = Z(x) + \u03b3c for any constant\nc \u2208 R [Blackwell, 1965]. One could easily check these conditions on the operator in Equation 1.\nTheorem 3.1. Let \u223c0 represent reward consistency, i.e. s \u223c0 s(cid:48) \u21d0\u21d2 \u2200a \u2208 A, Ra(s) = Ra(s(cid:48)),\nand \u03931 = \u2206(\u223c0). Additionally, assume {\u0393n}\u221e\nn=1 is a sequence of bases such that for all n \u2265 1,\n\u0393n (cid:110) \u0393n+1 and \u0393n+1 is as large as the partition corresponding to consistency over \u0393n, i.e.\n|\u0393n+1| = |S/\u223c\u0393n |. If V \u2217\nis the optimal value function computed with respect to representa-\n\n\u2212 V \u2217(cid:12)(cid:12)(cid:12)(cid:12)\u221e \u2264 \u03b3n+1 sups,s(cid:48),a |Ra(s) \u2212 Ra(s(cid:48))|/(1 \u2212 \u03b3)2.\n\ntion \u0393n, then(cid:12)(cid:12)(cid:12)(cid:12)V \u2217\n\n\u0393n\n\n\u0393n\n\nn=1.\n\nProof. We will use the bisimulation metric de\ufb01ned in Eq. 1 and Eq. 2 applied to the special case of\nreduced models over bases {\u0393n}\u221e\nFirst, note that Monge-Kantorovich duality is crucial in this proof. It basically states that the Kan-\ntorovich metric is a solution to the Monge-Kantorovich problem, when its cost function is equal\nto the base metric for the Kantorovich metric. Speci\ufb01cally, for two measures \u00b5 and \u03bd, and a cost\nfunction f \u2208 [S \u00d7 S \u2192 R], the Monge-Kantorovich problem computes:\nJ (f )(\u00b5, \u03bd) = inf{E\u03be[f (x, y)] | \u03be \u2208 P(S\u00d7 S) s.t. \u00b5, \u03bd are the marginals corresponding to x and y}\nThe set of measures \u03be with marginals \u00b5 and \u03bd is also known as the set of couplings of \u00b5 and \u03bd. For\nany metric d over S, J (d)(\u00b5, \u03bd) = K(d)(\u00b5, \u03bd) (for proof, see [Villani, 2003]).\nNext, we\nSince\nrelation\n|\u0393n+1| = |S/\u223c\u0393n | = |\u2206(\u223c\u0393n )| and \u0393n+1 \u2286 span(\u2206(\u223c\u0393n )), it must be the case that\nspan(\u0393n+1) = span(\u2206(\u223c\u0393n )).\nis not hard to see that for the special case of parti-\ntions, a re\ufb01nement can be determined based on transitions into equivalence classes. Given\n\nthe metric\n\nand \u0393n.\n\nT n(0)\n\ndescribe\n\nbetween\n\nIt\n\na\n\n4\n\n\ftwo equivalence relations \u223c1 and \u223c2,\nthe re\ufb01nement \u2206(\u223c1) (cid:110) \u2206(\u223c2) holds if and only if\n\u2200s, s(cid:48) with s \u223c\u0393n+1 s(cid:48), and \u2200C \u2208 S/\u223c\u0393n , P a(s)(C) = P a(s(cid:48))(C). This equality is crucial in\nde\ufb01ning the following coupling for J (f )(P a(s), P a(s(cid:48))): let \u03beC \u2208 P(S \u00d7 S) be any coupling of\nP a(s)|C and P a(s(cid:48))|C, the restrictions of P a(s) and P a(s(cid:48)) to C; the latter is possible as the two\n\u03beC. For any cost\n\ns \u223c2 s(cid:48) \u21d2 s \u223c1 s(cid:48) and s \u223c2 s(cid:48) \u21d2(cid:2)\u2200a \u2208 A,\u2200C \u2208 S/\u223c1 P a(s)(C) = P a(s(cid:48))(C)(cid:3). In particular,\ndistributions are equal. Next, de\ufb01ne the coupling \u03be of \u00b5 and \u03bd as \u03be =(cid:80)\nfunction f, if s \u223c\u0393n+1 s(cid:48), then J (f )(P a(s), P a(s(cid:48))) \u2264(cid:80)\n\nC\u2208S/\u223c\u0393n\n\nE\u03beC [f ].\n\nUsing an inductive argument, we will now show that \u2200n, s \u223c\u0393n s(cid:48) \u21d2 T n(0)(s, s(cid:48)) = 0. The base\ncase is clear from the de\ufb01nition: s \u223c0 s(cid:48) \u21d2 T (0)(s, s(cid:48)) = 0. Now, assume the former holds for n;\nthat is, \u2200C \u2208 S/\u223c\u0393n , \u2200s, s(cid:48) \u2208 C,T n(0)(s, s(cid:48)) = 0. But \u03beC is zero everywhere except on the set\nC \u00d7 C, so E\u03beC [T n(0)] = 0. Combining the last two results, we get the following upper bound:\n\ns \u223c\u0393n+1 s(cid:48) \u21d2 J (T n(0))(P a(s), P a(s(cid:48))) \u2264(cid:80)\n\nE\u03beC [T n(0)] = 0.\n\nC\u2208S/\u223c\u0393n\n\nSince T n(0) is a metric, it also holds that J (T n(0))(P a(s), P a(s(cid:48))) \u2265 0. Moreover, as s and\ns(cid:48) are consistent over \u0393n \u2287 \u2206(\u223c0), this pair of states agree on the reward function. Therefore,\nT n+1(0)(s, s(cid:48)) = maxa((1 \u2212 \u03b3)|Ra(s) \u2212 Ra(s(cid:48))| + \u03b3J (T n(0))(P a(s), P a(s(cid:48)))) = 0.\nFinally, for any b \u2208 \u2206(\u223c\u0393n ) and s \u2208 S with b(s) = 1, and any other state \u02c6s with b(\u02c6s) = 1, it must\nbe the case that s \u223c\u0393n \u02c6s and T n(0)(s, \u02c6s) = 0. Therefore,\n\nC\u2208S/\u223c\u0393n\n\n(cid:12)(cid:12)(cid:12)(cid:12)V \u2217\n\n\u02c6s\u223cU [d\u2217(s, \u02c6s) | s \u223c\u0393n \u02c6s] = E\nE\nAs span(\u0393n) = span(\u2206(\u223cn)), V \u2217\nBased on (2) and (3), we can conclude that\n\n\u0393n\n\n(3)\nis the optimal value function for the MDP model over \u2206(\u223cn).\n\n\u02c6s\u223cU [d\u2217(s, \u02c6s) \u2212 T n(0)(s, \u02c6s) | s \u223c\u0393n \u02c6s] \u2264 ||d\u2217 \u2212 T n(0)||\u221e.\n\u2212 V \u2217(cid:12)(cid:12)(cid:12)(cid:12)\u221e \u2264 \u03b3||d\u2217 \u2212 T n(0)||\u221e/(1 \u2212 \u03b3)2.\n\n(4)\nBut we already know from Lemma 3.1 that d\u2217 (de\ufb01ned in Eq. 1) is the \ufb01xed point of a contraction\noperator with constant \u03b3. As J (0)(\u00b5, \u03bd) = 0, the following holds for all n \u2265 1\n\n\u0393n\n\n||d\u2217 \u2212 T n(0)||\u221e \u2264 \u03b3n||T (0) \u2212 0||\u221e/(1 \u2212 \u03b3) \u2264 \u03b3n sup\ns,s(cid:48),a\n\n|Ra(s) \u2212 Ra(s(cid:48))|.\n\n(5)\n\nThe \ufb01nal result is easily obtained by putting together Equations 4 and 5.\n\nThe result of the theorem provides a strategy for constructing re\ufb01ning sequences with strong approx-\nimation guarantees. Still, it might be inconvenient to generate re\ufb01nements as large as S/\u223c\u0393n, as\nthis might be over-complete; although faithful to the assumptions of the theorem, it might generate\nfeatures that distinguish states that are not often visited, or pairs of states which are only slightly\ndifferent. To address this issue, we provide a variation on the concept of re\ufb01nement that can be used\nto derive more \ufb02exible re\ufb01ning algorithms: re\ufb01nements that concentrate on local properties.\nDe\ufb01nition 3.3. Given a subset F \u2282 FS, and a subset \u03b6 \u2282 S, two states s, s(cid:48) \u2208 S are con-\nsistent on \u03b6 with respect to F , denoted s \u223cF,\u03b6 s(cid:48), if \u2200f \u2208 F,\u2200a \u2208 A, f (s) = f (s(cid:48)) and\n\u2200\u02c6s \u2208 \u03b6, EP a(\u02c6s)[f ] = EP a(s)[f ] \u21d0\u21d2 EP a(\u02c6s)[f ] = EP a(s(cid:48))[f ].\nDe\ufb01nition 3.4. Given two subspaces F, G \u2282 FS, G re\ufb01nes F locally with respect to \u03b6, denoted\nF (cid:110)\u03b6 G, if F \u2286 G and \u2200s, s(cid:48) \u2208 S, s \u223cF,\u03b6 s(cid:48) \u21d0\u21d2 [\u2200g \u2208 G, g(s) = g(s(cid:48))].\nDe\ufb01nition 3.2 is the special case of De\ufb01nition 3.4 corresponding to a re\ufb01nement with respect to the\nwhole state space S, i.e. F (cid:110) G \u2261 F (cid:110)S G. When the subset \u03b6 is not important, we will use\nthe notation V (cid:110)\u25e6 W to say that W re\ufb01nes V locally with respect to some subset of S. The result\nbelow states that even if one provides local re\ufb01nements (cid:110)\u25e6, one will eventually generate a pair of\nsubspaces which are related through a global re\ufb01nement property (cid:110).\nProposition 3.1. Let {\u0393i}n\n{\u03b6i}n\nThen \u2206(\u223c\u03930,\u03b7) \u2286 span(\u0393n).\nProof. Assume s \u223c\u0393n\u22121,\u03b6n s(cid:48). We will check below all conditions necessary to conclude that\ns \u223c\u03930,\u03b7 s(cid:48). First, let f \u2208 \u03930.\nIt is immediate from the de\ufb01nition of local re\ufb01nements that\n\u2200j \u2264 n \u2212 1, \u0393j \u2286 \u0393n\u22121, so that s \u223c\u03930,\u03b6n s(cid:48). It follows that \u2200f \u2208 \u03930, f (s) = f (s(cid:48)).\n\ni=0 be a set of bases over S with \u0393i\u22121 (cid:110)\u03b6i \u0393i, i = 1, ..., n, for some\ni=1 . Assume that \u0393n is the maximal re\ufb01nement (i.e. |\u0393n| = |S/\u223c\u0393n\u22121,\u03b6n |). Let \u03b7 = \u222ai\u03b6i.\n\n5\n\n\fNext, \ufb01x f \u2208 \u03930, a \u2208 A and \u02c6s \u2208 \u03b7.\nIf \u02c6s \u2208 \u03b6n, then EP a(\u02c6s)[f ] = EP a(s)[f ] \u21d0\u21d2\nEP a(\u02c6s)[f ] = EP a(s(cid:48))[f ], by the assumption above on the pair s, s(cid:48). Otherwise, \u2203j < n such that\n\u02c6s \u2208 \u03b6j and \u0393j\u22121 (cid:110)\u03b6j \u0393j. But we already know that \u2200f \u2208 \u0393j, f (s) = f (s(cid:48)), as \u0393j \u2286 \u0393n\u22121. We\ncan use this result in the de\ufb01nition of local re\ufb01nement \u0393j\u22121 (cid:110)\u03b6j \u0393j to conclude that s \u223c\u0393j\u22121,\u03b6j s(cid:48).\nMoreover, as \u02c6s \u2208 \u03b6j, f \u2208 \u03930 \u2286 \u0393j\u22121, EP a(\u02c6s)[f ] = EP a(s)[f ] \u21d0\u21d2 EP a(\u02c6s)[f ] = EP a(s(cid:48))[f ]. This\ncompletes the de\ufb01nition of consistency on \u03b7, and it becomes clear that s \u223c\u0393n\u22121,\u03b6n s(cid:48) \u21d2 s \u223c\u03930,\u03b7 s(cid:48),\nor \u2206(\u223c\u03930,\u03b7) \u2286 span(\u2206(\u223c\u0393n\u22121,\u03b6n )).\nFinally, both \u0393n and \u2206(\u223c\u0393n\u22121,\u03b7) are bases of the same size, and both re\ufb01ne \u0393n\u22121. It must be that\nspan(\u0393n) = span(\u2206(\u223c\u0393n\u22121,\u03b6n )) \u2287 \u2206(\u223c\u03930,\u03b7).\n\n3.2 Examples of basis re\ufb01nement for feature extraction\n\nThe concept of basis re\ufb01nement is not only applicable to the feature extraction methods we will\npresent later, but to methods that have been studied in the past. In particular, methods based on\nBellman error basis functions, state aggregation strategies, and spectral analysis using bisimulation\nmetrics are all special cases of basis re\ufb01nement. We brie\ufb02y describe the re\ufb01nement property for\nthe \ufb01rst two cases, and, in the next section, we elaborate on the connection between re\ufb01nement and\nbisimulation metrics to provide a new condition for convergence to self-re\ufb01ning bases.\nKrylov bases: Consider the uncontrolled (policy evaluation) case, in which one would like to\n\ufb01nd a set of features that is suited to evaluating a single policy of interest. A common approach to\nautomatic feature generation in this context computes Bellman error basis functions (BEBFs), which\nhave been shown to generate a sequence of representations known as Krylov bases. Given a policy\n\u03c0, a Krylov basis \u03a6n of size n is built using the model (R\u03c0, P \u03c0) (de\ufb01ned in Section 2 as elements\nof FS and [[FS \u2192 FS]], respectively): \u03a6n = span{R\u03c0, P \u03c0R\u03c0, (P \u03c0)2R\u03c0, ..., (P \u03c0)nR\u03c0}. It is not\nhard to check that \u03a6n (cid:110) \u03a6n+1, where (cid:110) is the re\ufb01nement relational property in Def 3.2. Since the\ninitial feature R\u03c0 \u2208 \u2206(\u223c0), the result in Theorem 3.1 holds for the Krylov bases.\nUnder the assumption of a \ufb01nite-state MDP (i.e. |S| < \u221e), \u0393\u03c7 := {\u03c7({s}) | s \u2208 S} is a basis for\nFS, therefore this set of features is \ufb01nite dimensional. It follows that one can \ufb01nd N \u2264 |S| such\nthat one of the Krylov bases is a self-re\ufb01nement, i.e. \u03a6N (cid:110) \u03a6N . This would by no means be the\nonly self-re\ufb01ning basis. In fact this property holds for the basis of characteristic functions, \u0393\u03c7 (cid:110) \u0393\u03c7.\nThe purpose our framework is to determine other self-re\ufb01ning bases which are suited for function\napproximation methods in the context of controlled systems.\nState aggregation: One popular strategy used for solving MDPs is that of computing state aggre-\ngation maps. Instead of working with alternative subspaces, these methods \ufb01rst compute equiv-\nalence relations on the state space. An aggregate/collapsed model is then derived, and the so-\nlution to this model is translated to one for the original problem:\nthe resulting policy provides\nthe same action choice for states that have originally been related. Given any equivalence rela-\ntion \u223c on S, a state aggregation map is a function from S to any set X, \u03c1 : S \u2192 X, such that\n\u2200s, s(cid:48), \u03c1(s(cid:48)) = \u03c1(s) \u21d0\u21d2 s \u223c s(cid:48). In order to obtain a signi\ufb01cant computational gain, one would\nlike to work with aggregation maps \u03c1 that reduce the size of the space for which one looks to provide\naction choices, i.e. |X| (cid:28) |S|. As discussed in Section 3.1, one could work with features that are\nde\ufb01ned on an aggregate state space instead of the original state space. That is, instead of computing\na set of state features \u0393 \u2282 FS, we could work instead with an aggregation map \u03c1 : S \u2192 X and a\nset of features over X, \u02c6\u0393 \u2282 FX. If \u223c is the relation such that s \u223c s(cid:48) \u21d0\u21d2 \u03c1(s) = \u03c1(s(cid:48)), then\n\u2200\u03d5 \u2208 \u02c6\u0393, \u03d5 \u25e6 \u03c1 \u2208 span(\u2206(\u223c)).\n\n4 Using bisimulation metrics for convergence of bases\n\nIn Section 3.2 we provide two examples of self-re\ufb01ning subspaces: the Krylov bases and the charac-\nteristic functions on single states. The latter is the largest and sparsest basis; it spans the entire state\nspace and the features share no information. The former is potentially smaller and it spans the value\nof the \ufb01xed policy for which it was designed. In this section we will present a third self-re\ufb01ning\nconstruction, which is designed to capture bisimulation properties. Based on the results presented\nin Section 3.1, it can be shown that given a bisimulation relation \u223c, the partition it generates is\nself-re\ufb01ning, i.e. \u2206(\u223c) (cid:110) \u2206(\u223c).\n\n6\n\n\fD(\u0393) : (s, s(cid:48)) (cid:55)\u2192 maxa\n\n(cid:12)(cid:12)EP a(s)[\u03d5] \u2212 EP a(s(cid:48))[\u03d5](cid:12)(cid:12)(cid:3)\n\nDesirable self-re\ufb01ning bases might be be computationally demanding and/or too complex to use or\nrepresent. We propose iterative schemes which ultimately provide a self-re\ufb01ning result - albeit we\nwould have the \ufb02exibility of stopping the iterative process before reaching the \ufb01nal result. At the\nsame time, we need a criterion to describe convergence of sequences of bases. That is, we would\nwant to know how close an iterative process is to obtaining a self-re\ufb01ning basis. Inspired by the \ufb01xed\npoint theory used to study bisimulation metrics [Desharnais et al., 1999], instead of using a metric\nover the set of all bases to characterize convergence of such sequences, we will use corresponding\nmetrics over the original state space. This choice is better suited for generalizing previously existing\nmethods that compare pairs of states for bisimilarity through their associated reward models and\nexpected realizations of features over the next state distribution model associated with these states.\nWe will study metric construction strategies based on a map D, de\ufb01ned below, which takes an\nelement of the powerset P(FS) of FS and returns an element of all pseudo-metrics M (S) over S.\n(6)\n\u0393 is a set of features whose expectation over next-state distributions should be matched. It is not hard\nto see that bases \u0393 for which D(\u0393) is a bisimulation metric are by de\ufb01nition self-re\ufb01ning. For exam-\nple, consider the largest bisimulation relation \u223c on a given MDP. It is not hard to see that D(\u2206(\u223c))\nis a bisimulation. A more elaborate example involves the set \u2126(d) of Lipschitz-1 continuous func-\ntions on [[(S, d) \u2192 (R, L1)]] (recall de\ufb01nition and computation details from Section 2). De\ufb01ne d\u2217\nto be the \ufb01xed point of the operator T : d (cid:55)\u2192 D(\u2126(d)), i.e. d\u2217 = supn\u2208N T n(0). d\u2217 has the same\nproperty as the bisimulation metric de\ufb01ned in Equation 1. Moreover, given any bisimulation metric\nd, D(\u2126(d)) is a bisimulation metric.\nDe\ufb01nition 4.1. We say a sequence {\u0393n}\u221e\nn=1 is a a bisimulation sequence of bases if D(\u0393n) con-\nverges uniformly from below to a bisimulation metric. If one has the a sequence of re\ufb01ning bases\nwith \u0393n (cid:110) \u0393n+1,\u2200n, then {D(\u0393n)}\u221e\nn=1 is an increasing sequence, but not necessarily a bisimulation\nsequence.\n\n(cid:2)(1 \u2212 \u03b3)|Ra(s) \u2212 Ra(s(cid:48))| + \u03b3 sup\u03d5\u2208\u0393\n\nA bisimulation sequence of bases provide an approximation scheme for bases that satisfy two im-\nportant properties studied in the past: self-re\ufb01nement and bisimilarity. One could show that the\napproximation schemes presented in [Ferns et al., 2004], [Comanici and Precup, 2011], and [Ruan\net al., 2015] are all examples of bisimulation sequences. We will present in the next section a\nframework that generalizes all these examples, but which can be easily extended to a broader set of\napproximation schemes that incorporate both re\ufb01ning and bisimulation principles.\n\n5 Prototype based re\ufb01nements\n\nIn this section we propose a strategy that iteratively builds sequences of re\ufb01neing sets of fea-\ntures, based on the concepts described in the previous sections. This generates layered sets of\nfeatures, where the nth layer in the construction will be dependent only on the (n \u2212 1)th layer.\nAdditionally, each feature will be associated with a reward-transition prototype: elements of\nQ := [[A \u2192 (R \u00d7 P(S))]], associating to each action a reward and a next-state probability dis-\ntribution. Prototypes can be viewed as \u201cabstract\u201d or representative states, such as used in KBRL\nmethods [Ormoneit and Sen, 2002]. In the layered structure, the similarity between prototypes at\nthe nth layer is based on a measure of consistency with respect to features at the (n\u2212 1)th layer. The\nsame measure of similarity is used to determine whether the entire state space is \u201ccovered\u201d by the\nset of prototypes/features chosen for the nth layer. We say that a space is covered if every state of\nthe space is close to at least one prototype generated by the construction, with respect to a prede-\n\ufb01ned measure of similarity. This measure is designed to make sure that consecutive layers represent\nre\ufb01ning sets of features. Note that for any given MDP, the state space S is embedded into Q (i.e.\nS \u2282 Q), as (Ra(s), P a(s)) \u2208 Q for every state s \u2208 S. Additionally, The metric generator D, as\nde\ufb01ned in Equation 6, can be generalized to a map from P(FS) to M (Q).\nn=1, where Jn \u2282 Q is a set of covering\nThe algorithmic strategy will look for a sequence {Jn, \u03b9n}\u221e\nprototypes, and \u03b9n : Jn \u2192 FS is a function that associates a feature to every prototype in Jn.\nStarting with J0 = \u2205 and \u03930 = \u2205, the strategy needs to \ufb01nd, at step n > 0, a cover \u02c6Jn for S,\nbased on the distance metric D(\u0393n\u22121). That is, it has to guarantee that \u2200s \u2208 S,\u2203\u03ba \u2208 \u02c6Jn with\nD(\u0393n\u22121)(s, \u03ba) = 0. With Jn = \u02c6Jn \u222a Jn\u22121 and using a strictly decreasing function \u03c4 : R\u22650 \u2192 R\n(e.g. the energy-based Gibbs measure \u03c4 (x) = exp(\u2212\u03b2x) for some \u03b2 > 0), the framework constructs\n\u03b9n : Jn \u2192 FS, a map that associates prototypes to features as \u03b9n(\u03ba)(s) = \u03c4 (D(\u0393n\u22121)(\u03ba, s)).\n\n7\n\n\fAlgorithm 1 Prototype re\ufb01nement\n1: J0 = \u2205 and \u03930 = \u2205\n2: for n = 1 to \u221e do\n3:\n4:\n5:\n6:\n\nchoose a representative subset \u03b6n \u2282 S and a cover approximation error \u0001n \u2265 0\n\ufb01nd an \u0001n-cover \u02c6Jn for \u03b6n\nde\ufb01ne Jn = \u02c6Jn \u222a Jn\u22121\nchoose a strictly decreasing function \u03c4 : R\u22650 \u2192 R\nde\ufb01ne \u03b9n(\u03ba) =\nde\ufb01ne \u0393n = {\u03b9n(\u03ba) | \u03ba \u2208 Jn} (note that \u0393n is a local re\ufb01nement, \u0393n\u22121 (cid:110)\u03b6n \u0393n)\n\n(cid:26)s (cid:55)\u2192 \u03c4 (D(\u0393n\u22121)(\u03ba, s))\n\n7:\n\n8:\n\n\u03b9n\u22121(\u03ba)\n\nif \u2203\u02c6s \u2208 \u03b6n, such that D(\u0393n\u22121)(\u03ba, \u02c6s) \u2264 \u0001n\notherwise\n\nIt is not hard to see that the re\ufb01nement property holds at every step, i.e. \u0393n (cid:110) \u0393n+1. First, every\nequivalence class of \u223c\u0393n is represented by some prototype in Jn. Second, \u03b9n is purposely de\ufb01ned\nto make sure that a distinction is made between each prototype in Jn+1. Moreover, {\u0393n}\u221e\nn=1 is a\nbisimulation sequence of bases, as the metric generator D is the main tool used in \u201ccovering\u201d the\nstate space with the set of prototypes Jn. Two states will be represented by the same prototype (i.e.\nthey will be equivalent with respect to \u223c\u0393n) if and only if the distance between their corresponding\nreward-transition models is 0.\nAlgorithm 1 provides pseudo-code for the framework described in this section. Note that it also con-\ntains two additional modi\ufb01cations, used to illustrate the \ufb02exibility of this feature extraction process.\nThrough the \ufb01rst modi\ufb01cation, one could use the intermediate results at time step n to determine\na subset \u03b6n \u2282 S of states which are likely to have a model with signi\ufb01cantly distinct dynamics\nover \u0393n\u22121. As such, the prototypes \u02c6Jn\u22121 can be specialized to cover only the signi\ufb01cant subset \u03b6n.\nMoreover Theorem 3.1 guarantees that if every state in S is picked in \u03b6n in\ufb01nitely often, as n \u2192 \u221e,\nthen the approximation power of the \ufb01nal result is not be compromised. The second modi\ufb01cation\nis based on using the values in the metric D(\u0393n\u22121) for more than just choosing feature activations:\none could set at every step constants \u0001n \u2265 0 and then \ufb01nd Jn such that \u03b6n is covered using \u0001n-balls,\ni.e. for every state in \u03b6n, there exists a prototype \u03ba \u2208 Jn with D(\u0393n\u22121)(\u03ba, s) \u2264 \u0001n. One can easily\nshow that the re\ufb01nement property can be maintained using the modi\ufb01ed de\ufb01tion of \u03b9n described in\nAlgorithm 1.\n\n6 Discussion\n\nWe proposed a general framework for basis re\ufb01nement for linear function approximation. The the-\noretical results show that any algorithmic scheme of this type satis\ufb01es strong bounds on the quality\nof the value function that can be obtained. In other words, this approach provides a \u201cblueprint\u201d for\ndesigning algorithms with good approximation guarantees. As discussed, some existing value func-\ntion construction schemes fall into this category (such as state aggregation re\ufb01nement, for example).\nOther methods, like BEBFs, can be interpreted in this way in the case of policy evaluation; however,\nthe \u201ctraditional\u201d BEBF approach in the case of control does not exactly \ufb01t this framework. However,\nwe suspect that it could be adapted to exactly follow this blueprint (something we leave for future\nwork).\nWe provided ideas for a new algorithmic approach to this problem, which would provide strong\nguarantees while being signi\ufb01cantly cheaper than other existing methods with similar bounds (which\nrely on bisimulation metrics). We plan to experiment with this approach in the future. The focus\nof this paper was to establish the theoretical underpinnings of the algorithm. The algorithm struc-\nture we propose is close in spirit to [Barreto et al., 2011], which selects prototype states in order\nto represent well the dynamics of the system by means of stochastic factorization. However, their\napproach assumes a given metric which measures state similarity, and selects representative states\nusing k-means clustering based on this metric. Instead, we iterate between computing the metric\nand choosing prototypes. We believe that the theory presented in this paper opens up the possibil-\nity of further development of algorithms for constructive function approximation that have quality\nguarantees in the control case, and which can be effective also in practice.\n\n8\n\n\fReferences\nR. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.\nCs. Szepesvari. Algorithms for Reinforcement Learning. Morgan & Claypool, 2010.\nD. P. Bertsekas and D. A. Castanon. Adaptive Aggregation Methods for In\ufb01nite Horizon Dynamic\n\nProgramming. IEEE Transactions on Automatic Control, 34, 1989.\n\nR. Munos and A. Moore. Variable Resolution Discretization in Optimal Control. Machine Learning,\n\n49(2-3):291\u2013323, 2002.\n\nS. Mahadevan. Proto-Value Functions: Developmental Reinforcement Learning. In ICML, pages\n\n553\u2013560, 2005.\n\nP. W. Keller, S. Mannor, and D. Precup. Automatic Basis Function Construction for Approximate\n\nDynamic Programming and Reinforcement Learning. In ICML, pages 449\u2013456, 2006.\n\nR. Parr, C. Painter-Wake\ufb01led, L. Li, and M. L. Littman. Analyzing Feature Generation for Value\n\nFunction Approximation. In ICML, pages 737\u2013744, 2008a.\n\nG. D. Konidaris, S. Osentoski, and P. S. Thomas. Value Function Approximation in Reinforcement\n\nLearning using the Fourier Basis. In AAAI, pages 380\u2013385, 2011.\n\nA. Geramifard, F. Doshi, J. Redding, N. Roy, and J. How. Online Discovery of Feature Dependen-\n\ncies. In ICML, pages 881\u2013888, 2011.\n\nB. Ravindran and A. G. Barto. Model Minimization in Hierarchical Reinforcement Learning. In\n\nSymposium on Abstraction, Reformulation and Approximation (SARA), pages 196\u2013211, 2002.\n\nN. Ferns, P. Panangaden, and D. Precup. Metrics for \ufb01nite Markov Decision Processes. In UAI,\n\npages 162\u2013169, 2004.\n\nS. Ruan, G. Comanici, P. Panangaden, and D. Precup. Representation Discovery for MDPs using\n\nBisimulation Metrics. In AAAI, pages 3578\u20133584, 2015.\n\nR. Givan, T. Dean, and M. Greig. Equivalence Notions and Model Minimization in Markov Decision\n\nProcesses. Arti\ufb01cial Intelligence, 147(1-2):163\u2013223, 2003.\n\nD. Ormoneit and S. Sen. Kernel-Based Reinforcement Learning. Machine Learning, 49(2-3):161\u2013\n\n178, 2002.\n\nN. Jong and P. Stone. Kernel-Based Models for Reinforcement Learning. In ICML Workshop on\n\nKernel Machines and Reinforcement Learning, 2006.\n\nA. S. Barreto, D. Precup, and J. Pineau. Reinforcement Learning using Kernel-Based Stochastic\n\nFactorization. In NIPS, pages 720\u2013728, 2011.\n\nR. S. Sutton. Learning to Predict by the Methods of Temporal Differences. Machine Learning, 3\n\n(1):9\u201344, 1988.\n\nS. J. Bradtke and A. G. Barto. Linear Least-Squares Algorithms for Temporal Difference Learning.\n\nMachine Learning, 22(1-3):33\u201357, 1996.\n\nH. Yu and D. Bertsekas. Convergence Results for Some Temporal Difference Methods Based on\n\nLeast Squares. Technical report, LIDS MIT, 2006.\n\nR. Parr, L. Li, G. Taylor, C. Painter-Wake\ufb01eld, and M. L. Littman. An Analysis of Linear Models,\nIn\n\nLinear Value-Function Approximation, and Feature Selection for Reinforcement Learning.\nICML, pages 752\u2013759, 2008b.\n\nK. G. Larsen and A. Skou. Bisimulation through Probabilistic Testing. Information and Computa-\n\ntion, 94:1\u201328, 1991.\n\nJ. Desharnais, V. Gupta, R. Jagadeesan, and P. Panangaden. Metrics for Labeled Markov Systems.\n\nIn CONCUR, 1999.\n\nJ. Desharnais, V. Gupta, R. Jagadeesan, and P. Panangaden. A metric for labelled Markov processes.\n\nTheoretical Computer Science, 318(3):323\u2013354, 2004.\n\nC. Villani. Topics in optimal transportation. American Mathematical Society, 2003.\nG. Comanici and D. Precup. Basis Function Discovery Using Spectral Clustering and Bisimulation\n\nMetrics. In AAAI, 2011.\n\nD. Blackwell. Discounted Dynamic Programming. Annals of Mathematical Statistics, 36:226\u2013235,\n\n1965.\n\n9\n\n\f", "award": [], "sourceid": 1657, "authors": [{"given_name": "Gheorghe", "family_name": "Comanici", "institution": "McGill University, Montreal"}, {"given_name": "Doina", "family_name": "Precup", "institution": "University of McGill"}, {"given_name": "Prakash", "family_name": "Panangaden", "institution": "McGill University, Montreal"}]}