{"title": "Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions", "book": "Advances in Neural Information Processing Systems", "page_first": 843, "page_last": 850, "abstract": null, "full_text": "Value Function Approximation with Diffusion\n\nWavelets and Laplacian Eigenfunctions\n\nSridhar Mahadevan\n\nDepartment of Computer Science\n\nUniversity of Massachusetts\n\nAmherst, MA 01003\n\nmahadeva@cs.umass.edu\n\nMauro Maggioni\n\nProgram in Applied Mathematics\n\nDepartment of Mathematics\n\nYale University\n\nNew Haven, CT 06511\n\nmauro.maggioni@yale.edu\n\nAbstract\n\nWe investigate the problem of automatically constructing ef\ufb01cient rep-\nresentations or basis functions for approximating value functions based\non analyzing the structure and topology of the state space. In particu-\nlar, two novel approaches to value function approximation are explored\nbased on automatically constructing basis functions on state spaces that\ncan be represented as graphs or manifolds: one approach uses the eigen-\nfunctions of the Laplacian, in effect performing a global Fourier analysis\non the graph; the second approach is based on diffusion wavelets, which\ngeneralize classical wavelets to graphs using multiscale dilations induced\nby powers of a diffusion operator or random walk on the graph. Together,\nthese approaches form the foundation of a new generation of methods for\nsolving large Markov decision processes, in which the underlying repre-\nsentation and policies are simultaneously learned.\n\n1 Introduction\n\nValue function approximation (VFA) is a well-studied problem: a variety of linear and\nnonlinear architectures have been studied, which are not automatically derived from the\ngeometry of the underlying state space, but rather handcoded in an ad hoc trial-and-error\nprocess by a human designer [1]. A new framework for VFA called proto-reinforcement\nlearning (PRL) was recently proposed in [7, 8, 9]. Instead of learning task-speci\ufb01c value\nfunctions using a handcoded parametric architecture, agents learn proto-value functions, or\nglobal basis functions that re\ufb02ect intrinsic large-scale geometric constraints that all value\nfunctions on a manifold [11] or graph [3] adhere to, using spectral analysis of the self-\nadjoint Laplace operator. This approach also yields new control learning algorithms called\nrepresentation policy iteration (RPI) where both the underlying representations (basis func-\ntions) and policies are simultaneously learned. Laplacian eigenfunctions also provide ways\nof automatically decomposing state spaces since they re\ufb02ect bottlenecks and other global\ngeometric invariants.\n\nIn this paper, we extend the earlier Laplacian approach in a new direction using the recently\nproposed diffusion wavelet transform (DWT), which is a compact multi-level representa-\ntion of Markov diffusion processes on manifolds and graphs [4, 2]. Diffusion wavelets\n\n\fprovide an interesting alternative to global Fourier eigenfunctions for value function ap-\nproximation, since they encapsulate all the traditional advantages of wavelets: basis func-\ntions have compact support, and the representation is inherently hierarchical since it is\nbased on multi-resolution modeling of processes at different spatial and temporal scales.\n\n2 Technical Background\n\nThis paper uses the framework of spectral graph theory [3] to build basis representations\nfor smooth (value) functions on graphs induced by Markov decision processes. Given\nany graph G, an obvious but poor choice of representation is the \u201ctable-lookup\u201d orthonor-\nmal encoding, where \u03c6(i) = [0 . . . i . . . 0] is the encoding of the ith node in the graph.\nThis representation does not re\ufb02ect the topology of the speci\ufb01c graph under considera-\ntion. Polynomials are another popular choice of orthonormal basis functions [5], where\n\u03c6(s) = [1 s . . . sk] for some \ufb01xed k. This encoding has two disadvantages: it is numeri-\ncally unstable for large graphs, and is dependent on the ordering of vertices. In this paper,\nwe outline a new approach to the problem of building basis functions on graphs using\nLaplacian eigenfunctions and diffusion wavelets.\n\nss0 ) is de\ufb01ned as a \ufb01nite set\nA \ufb01nite Markov decision process (MDP) M = (S, A, P a\nss0 , Ra\nof states S, a \ufb01nite set of actions A, a transition model P a\nss0 specifying the distribution over\nfuture states s0 when an action a is performed in state s, and a corresponding reward model\nss0 specifying a scalar cost or reward [10]. A state value function is a mapping S \u2192 R\nRa\nor equivalently a vector in R|S|. Given a policy \u03c0 : S \u2192 A mapping states to actions,\nits corresponding value function V \u03c0 speci\ufb01es the expected long-term discounted sum of\nrewards received by the agent in any given state s when actions are chosen using the policy.\nAny optimal policy \u03c0\u2217 de\ufb01nes the same unique optimal value function V \u2217 which satis\ufb01es\nthe nonlinear constraints\nV\n\n(s) = max\n\nss0 + \u03b3V \u2217(s0))\n\nP a\nss0 (Ra\n\n\u2217\n\na X\n\ns0\n\nFor any MDP, any policy induces a Markov chain that partitions the states into classes:\ntransient states are visited initially but not after a \ufb01nite time, and recurrent states are visited\nin\ufb01nitely often. In ergodic MDPs, the set of transient states is empty. The construction of\nbasis functions below assumes that the Markov chain induced by a policy is a reversible\nrandom walk on the state space. While some policies may not induce such Markov chains,\nthe set of basis functions learned from a reversible random walk can still be useful in\napproximating value functions for (reversible or non-reversible) policies. In other words,\nthe construction of the basis functions can be considered an off-policy method:\njust as\nin Q-learning where the exploration policy differs from the optimal learned policy, in the\nproposed approach the actual MDP dynamics may induce a different Markov chain than the\none analyzed to build representations. Reversible random walks greatly simplify spectral\nanalysis since such random walks are similar to a symmetric operator on the state space.\n\n2.1 Smooth Functions on Graphs and Value Function Representation\n\nWe assume the state space can be modeled as a \ufb01nite undirected weighted graph (G, E, W ),\nbut the approach generalizes to Riemannian manifolds. We de\ufb01ne x \u223c y to mean an edge\nbetween x and y, and the degree of x to be d(x) = Px\u223cy w(x, y). D will denote the\ndiagonal matrix de\ufb01ned by Dxx = d(x), and W the matrix de\ufb01ned by Wxy = w(x, y) =\nw(y, x). The L2 norm of a function on G is ||f ||2\n2 = Px\u2208G |f (x)|2d(x). The gradient\nof a function is \u2207f (i, j) = w(i, j)(f (i) \u2212 f (j)) if there is an edge e connecting i to j, 0\notherwise. The smoothness of a function on a graph, can be measured by the Sobolev norm\n(1)\n\n|f (x) \u2212 f (y)|2w(x, y) .\n\n|f (x)|2d(x) + X\n\nH2 = ||f ||2\n\n2 + ||\u2207f ||2\n\n2 = X\n\n||f ||2\n\nx\n\nx\u223cy\n\n\fThe \ufb01rst term in this norm controls the size (in terms of L2-norm) for the function f , and\nthe second term controls the size of the gradient. The smaller ||f ||H2, the smoother is f .\nWe will assume that the value functions we consider have small H2 norms, except at a\nfew points, where the gradient may be large. Important variations exist, corresponding to\ndifferent measures on the vertices and edges of G.\n\nClassical techniques, such as value iteration and policy iteration [10], represent value func-\ntions using an orthonormal basis (e1, . . . , e|S|) for the space R|S| [1]. For a \ufb01xed precision\n\u0001, a value function V \u03c0 can be approximated as\n\n||V \u03c0 \u2212 X\n\n\u03b1\u03c0\n\ni ei|| \u2264 \u0001\n\ni\u2208S(\u0001)\n\nwith \u03b1i =< V \u03c0, ei > since the ei\u2019s are orthonormal, and the approximation is measured\nin some norm, such as L2 or H2. The goal is to obtain representations in which the index\nset S(\u0001) in the summation is as small as possible, for a given approximation error \u0001. This\nhope is well founded at least when V \u03c0 is smooth or piecewise smooth, since in this case it\nshould be compressible in some well chosen basis {ei}.\n\n3 Function Approximation using Laplacian Eigenfunctions\n\nThe combinatorial Laplacian L [3] is de\ufb01ned as\n\nLf (x) = X\n\ny\u223cx\n\nw(x, y)(f (x) \u2212 f (y)) = (D \u2212 W )f .\n\n2 (D\u2212W )D\u2212 1\n\nOften one considers the normalized Laplacian L = D\u2212 1\n2 which has spectrum\nin [0, 2]. This Laplacian is related to the notion of smoothness as above, since hf, Lf i =\n2, which should be compared\nPx f (x) Lf (x) = Px,y w(x, y)(f (x) \u2212 f (y))2 = ||\u2207f ||2\nwith (1). Functions that satisfy the equation Lf = 0 are called harmonic. The Spectral\nTheorem can be applied to L (or L), yielding a discrete set of eigenvalues 0 \u2264 \u03bb0 \u2264 \u03bb1 \u2264\n. . . \u03bbi \u2264 . . . and a corresponding orthonormal basis of eigenfunctions {\u03bei}i\u22650, solutions to\nthe eigenvalue problem L\u03bei = \u03bbi\u03bei.\nThe eigenfunctions of the Laplacian can be viewed as an orthonormal basis of global\nFourier smooth functions that can be used for approximating any value function on a\ngraph. These basis functions capture large-scale features of the state space, and are par-\nticularly sensitive to \u201cbottlenecks\u201d, a phenomenon widely studied in Riemannian geometry\nand spectral graph theory [3]. Observe that \u03bei satis\ufb01es ||\u2207\u03bei||2\n2 = \u03bbi. In fact, the varia-\ntional characterization of eigenvectors shows that \u03bei is the normalized function orthogonal\nto \u03be0, . . . , \u03bei\u22121 with minimal ||\u2207\u03bei||2. Hence the projection of a function f on S onto the\ntop k eigenvectors of the Laplacian is the smoothest approximation to f , in the sense of the\nnorm in H2. A potential drawback of Laplacian approximation is that it detects only global\nsmoothness, and may poorly approximate a function which is not globally smooth but only\npiecewise smooth, or with different smoothness in different regions. These drawbacks are\naddressed in the context of analysis with diffusion wavelets, and in fact partly motivated\ntheir construction.\n\n4 Function Approximation using Diffusion Wavelets\n\nDiffusion wavelets were introduced in [4, 2], in order to perform a fast multiscale analysis\nof functions on a manifold or graph, generalizing wavelet analysis and associated signal\nprocessing techniques (such as compression or denoising) to functions on manifolds and\ngraphs. They allow the fast and accurate computation of high powers of a Markov chain\n\n\fDiffusionWaveletTree (H0, \u03a60, J, \u0001):\n\n// H0: symmetric conjugate to random walk matrix, represented on the basis \u03a60\n// \u03a60 : initial basis (usually Dirac\u2019s \u03b4-function basis), one function per column\n// J : number of levels to compute\n// \u0001: precision\n\nfor j from 0 to J do,\n\n1. Compute sparse factorization Hj \u223c\u0001 QjRj, with Qj orthogonal.\n\n2. \u03a6j+1 \u2190 Qj = HjR\u22121\n\nj\n\nand [H 2j\n0 ]\n\n\u03a6j+1\n\u03a6j+1\n\nj .\n\u223cj\u0001 Hj+1 \u2190 Rj R\u2217\n\n3. Compute sparse factorization I \u2212 \u03a6j+1\u03a6\u2217\n4. \u03a8j+1 \u2190 Q0\nj.\n\nj+1 = Q0\n\njR0\n\nj, with Q0\n\nj orthogonal.\n\nend\n\nFigure 1: Pseudo-code for constructing a Diffusion Wavelet Tree\n\nP on the manifold or graph, including direct computation of the Green\u2019s function (or fun-\ndamental matrix) of the Markov chain, (I \u2212 P )\u22121, which can be used to solve Bellman\u2019s\nequation. Here, \u201cfast\u201d means that the number of operations required is O(|S|), up to loga-\nrithmic factors.\n\nSpace constraints permit only a brief description of the construction of diffusion wavelet\ntrees. More details are provided in [4, 2]. The input to the algorithm is a \u201cprecision\u201d\nparameter \u0001 > 0, and a weighted graph (G, E, W ). We can assume that G is connected,\notherwise we can consider each connected component separately. The construction is based\non using the natural random walk P = D\u22121W on a graph and its powers to \u201cdilate\u201d, or\n\u201cdiffuse\u201d functions on the graph, and then de\ufb01ning an associated coarse-graining of the\ngraph. We symmetrize P by conjugation and take powers to obtain\n\nH t = D\n\n1\n\n2 P tD\u2212 1\n\n2 = (D\u2212 1\n\n2 W D\u2212 1\n\n2 )t = (I \u2212 L)t = X\n\n(1 \u2212 \u03bbi)t\u03bei(\u00b7)\u03bei(\u00b7)\n\n(2)\n\ni\u22650\n\nwhere {\u03bbi} and {\u03bei} are the eigenvalues and eigenfunctions of the Laplacian as above.\nHence the eigenfunctions of H t are again \u03bei and the ith eigenvalue is (1 \u2212 \u03bbi)t. We assume\nthat H 1 is a sparse matrix, and that the spectrum of H 1 has rapid decay.\n\nA diffusion wavelet tree consist of orthogonal diffusion scaling functions \u03a6j that are\nsmooth bump functions, with some oscillations, at scale roughly 2j (measured with respect\nto geodesic distance, for small j), and orthogonal wavelets \u03a8j that are smooth localized os-\ncillatory functions at the same scale. The scaling functions \u03a6j span a subspace Vj, with the\nproperty that Vj+1 \u2286 Vj, and the span of \u03a8j, Wj, is the orthogonal complement of Vj into\nVj+1. This is achieved by using the dyadic powers H 2j as \u201cdilations\u201d, to create smoother\nand wider (always in a geodesic sense) \u201cbump\u201d functions (which represent densities for the\nsymmetrized random walk after 2j steps), and orthogonalizing and downsampling appro-\npriately to transform sets of \u201cbumps\u201d into orthonormal scaling functions.\nComputationally (Figure 1), we start with the basis \u03a60 = I and the matrix H0 := H 1,\nsparse by assumption, and construct an orthonormal basis of well-localized functions for\nits range (the space spanned by the columns), up to precision \u0001, through a variation of\nthe Gram-Schmidt orthonormalization scheme, described in [4]. In matrix form, this is a\nsparse factorization H0 \u223c\u0001 Q0R0, with Q0 orthonormal. Notice that H0 is |G| \u00d7 |G|,\nbut in general Q0 is |G| \u00d7 |G(1)| and R0 is |G(1)| \u00d7 |G|, with |G(1)| \u2264 |G|.\nIn fact\n|G(1)| is approximately equal to the number of singular values of H0 larger than \u0001. The\n\n\fj\n\n0, where we used H0 = H \u2217\n\n\u2265 \u0001. We have the operator H 2j\n\ncolumns of Q0 are an orthonormal basis of scaling functions \u03a61 for the range of H0, written\nas a linear combination of the initial basis \u03a60. We can now write H 2\n0 on the basis \u03a61:\nH1 := [H 2]\u03a61\n0 . This is a compressed\n0H0H0Q0 = R0R\u2217\n= Q\u2217\n\u03a61\nrepresentation of H 2\n0 acting on the range of H0, and it is a |G(1)| \u00d7 |G(1)| matrix. We\nproceed by induction: at scale j we have an orthonormal basis \u03a6j for the rank of H 2j \u22121\nup to precision j\u0001, represented as a linear combination of elements in \u03a6j\u22121. This basis\ncontains |G(j)| functions, where |G(j)| is comparable with the number of eigenvalues \u03bbj of\nH0 such that \u03bb2j \u22121\n0 represented on \u03a6j by a |G(j)| \u00d7 |G(j)|\nmatrix Hj, up to precision j\u0001. We compute a sparse decomposition of Hj \u223c\u0001 QjRj, and\nand represent H 2j+1 on this basis by the\nobtain the next basis \u03a6j+1 = Qj = HjR\u22121\nmatrix Hj+1 := [H 2j\nWavelet bases for the spaces Wj can be built analogously by factorizing IVj \u2212 Qj+1Q\u2217\nj+1,\nwhich is the orthogonal projection on the complement of Vj+1 into Vj. The spaces can\nbe further split to obtain wavelet packets [2]. A Fast Diffusion Wavelet Transform al-\nlows expanding in O(n) (where n is the number of vertices) computations any function\nin the wavelet, or wavelet packet, basis, and ef\ufb01ciently search for the most suitable basis\nset. Diffusion wavelets and wavelet packets are a very ef\ufb01cient tool for representation and\napproximation of functions on manifolds and graphs [4, 2], generalizing to these general\nspaces the nice properties of wavelets that have been so successfully applied to similar tasks\nin Euclidean spaces.\n\n]\u03a6j+1\n\u03a6j+1\n\n= Q\u2217\n\nj HjHjQj = RjR\u2217\nj .\n\nj\n\nDiffusion wavelets allow computing H 2k\nf for any \ufb01xed f , in order O(kn). This is non-\ntrivial because while the matrix H is sparse, large powers of it are not, and the computation\nH \u00b7 H . . . \u00b7 (H(Hf )) . . .) involves 2k matrix-vector products. As a notable consequence,\nthis yields a fast algorithm for computing the Green\u2019s function, or fundamental matrix,\nassociated with the Markov process H, via (I\u2212H 1)\u22121f = Pk\u22650 H k = Qk\u22650(I+H 2k\n)f .\nIn a similar way one can compute (I \u2212 P )\u22121. For large classes of Markov chains we can\nperform this computation in time O(n), in a direct (as opposed to iterative) fashion. This is\nremarkable since in general the matrix (I \u2212 H 1)\u22121 is full and only writing down the entries\nwould take time O(n2). It is the multiscale compression scheme that allows to ef\ufb01ciently\nrepresent (I \u2212 H 1)\u22121 in compress form, taking advantage of the smoothness of the entries\nof the matrix. This is discussed in general in [4]. We use this approach to develop a faster\npolicy evaluation step for solving MDPs described in [6]\n\n5 Experiments\n\nFigure 2 contrasts Laplacian eigenfunctions and diffusion wavelet basis functions in a three\nroom grid world environment. Laplacian eigenfunctions were produced by solving Lf =\n\u03bbf , where L is the combinatorial Laplacian, whereas diffusion wavelet basis functions\nwere produced using the algorithm described in Figure 1. The input to both methods is an\nundirected graph, where edges connect states reachable through a single (reversible) action.\nSuch graphs can be easily learned from a sample of transitions, such as that generated by\nRL agents while exploring the environment in early phases of policy learning. Note how\nthe intrinsic multi-room environment is re\ufb02ected in the Laplacian eigenfunctions. The\nLaplacian eigenfunctions are globally de\ufb01ned over the entire state space, whereas diffusion\nwavelet basis functions are progressively more compact at higher levels, beginning at the\nlowest level with the table-lookup representation, and converging at the highest level to\nbasis functions similar to Laplacian eigenfunctions. Figure 3 compares the approximations\nproduced in a two-room grid world MDP with 630 states. These experiments illustrate\nthe superiority of diffusion wavelets: in the \ufb01rst experiment (top row), diffusion wavelets\nhandily outperform Laplacian eigenfunctions because the function is highly nonlinear near\n\n\fFigure 2: Examples of Laplacian eigenfunctions (left) and diffusion wavelet basis functions\n(right) computed using the graph Laplacian on a complete undirected graph of a determin-\nistic grid world environment with reversible actions.\n\nthe goal, but mostly linear elsewhere. The eigenfunctions contain a lot of ripples in the \ufb02at\nregion causing a large residual error. In the second experiment (bottom row), Laplacian\neigenfunctions work signi\ufb01cantly better because the value function is globally smooth.\nEven here, the superiority of diffusion wavelets is clear.\n\n20\n\n10\n\n10\n\n0\n\n0\n\n30\n\n20\n\n50\n\n40\n\n30\n\n20\n\n10\n\n0\n30\n\n300\n\n200\n\n100\n\n0\n\n\u2212100\n30\n\n20\n\n10\n\n20\n\n10\n\n0\n\n0\n\n50\n\n40\n\n30\n\n20\n\n10\n\n0\n30\n\n300\n\n200\n\n100\n\n0\n\n\u2212100\n\n\u2212200\n30\n\n20\n\n10\n\n10\n\n0\n\n0\n\n30\n\n20\n\n20\n\n10\n\n20\n\n10\n\n0\n\n0\n\n10\n\n5\n\n0\n\n\u22125\n30\n\n60\n\n40\n\n20\n\n0\n\n\u221220\n\n\u221240\n\n\u221260\n30\n\n30\n\n30\n\n20\n\n10\n\n20\n\n10\n\n0\n\n0\n\n20\n\n10\n\n20\n\n10\n\n0\n\n0\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22124\n\n\u22125\n\n\u22126\n\n\u22127\n\n0\n\n3\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n30\n\n30\n\nWP\nEig\n\n50\n\n100\n\n150\n\n200\n\nWP\nEig\n\n50\n\n100\n\n150\n\n200\n\nFigure 3: Left column: value functions in a two room grid world MDP, where each room\nhas 21 \u00d7 15 states connected by a door in the middle of the common wall. Middle two\ncolumns: approximations produced by 5 diffusion wavelet bases and Laplacian eigenfunc-\ntions. Right column: least-squares approximation error (log scale) using up to 200 basis\nfunctions (bottom curve: diffusion wavelets; top curve: Laplacian eigenfunctions). In the\ntop row, the value function corresponds to a random walk. In the bottom row, the value\nfunction corresponds to the optimal policy.\n\n5.1 Control Learning using Representation Policy Iteration\n\nThis section describes results of using the automatically generated basis functions inside\na control learning algorithm, in particular the Representation Policy Iteration (RPI) al-\ngorithm [8]. RPI is an approximate policy iteration algorithm where the basis functions\n\n\f\u03c6(s, a) handcoded in other methods, such as LSPI [5] are learned from a random walk of\ntransitions by computing the graph Laplacian and then computing the eigenfunctions or the\ndiffusion wavelet bases as described above. One striking property of the eigenfunction and\ndiffusion wavelet basis functions is their ability to re\ufb02ect nonlinearities arising from \u201cbot-\ntlenecks\u201d in the state space. Figure 4 contrasts the value function approximation produced\nby RPI using Laplacian eigenfunctions with that produced by a polynomial approximator.\nThe polynomial approximator yields a value function that is \u201cblind\u201d to the nonlinearities\nproduced by the walls in the two room grid world MDP.\n\nFigure 4: This \ufb01gures compares the value functions produced by RPI using Laplacian\neigenfunctions with that produced by LSPI using a polynomial approximator in a two\nroom grid world MDP with a \u201cbottleneck\u201d region representing the door connecting the two\nrooms. The Laplacian basis functions on the left clearly capture the nonlinearity arising\nfrom the bottleneck, whereas the polynomial approximator on the right smooths the value\nfunction across the walls as it is \u201cblind\u201d to the large-scale geometry of the environment.\n\nTable 1 compares the performance of diffusion wavelets and Laplacian eigenfunctions us-\ning RPI on the classic chain MDP from [5]. Here, an initial random walk of 5000 steps\nwas carried out to generate the basis functions in a 50 state chain. The chain MDP is a\nsequential open (or closed) chain of varying number of states, where there are two actions\nfor moving left or right along the chain. In the experiments shown, a reward of 1 was pro-\nvided in states 10 and 41. Given a \ufb01xed k, the encoding \u03c6(s) of a state s for Laplacian\neigenfunctions is the vector comprised of the values of the kth lowest-order eigenfunctions\non state k. For diffusion wavelets, all the basis functions at level k were evaluated at state\ns to produce the encoding.\n\nMethod\n\nRPI DF (5)\nRPI DF (14)\nRPI DF (19)\nRPI Lap (5)\nRPI Lap (15)\nRPI Lap (25)\n\n#Trials Error\n\nMethod\n\n4.4\n6.8\n8.2\n4.2\n7.2\n9.4\n\n2.4\n4.8\n0.6\n3.8\n3\n2\n\nLSPI RBF (6)\nLSPI RBF (14)\nLSPI RBF (26)\nLSPI Poly (5)\nLSPI Poly (15)\nLSPI Poly (25)\n\n#Trials Error\n20.8\n2.8\n2.8\n4\n\n3.8\n4.4\n6.4\n4.2\n1\n1\n\n34.4\n36\n\nTable 1: This table compares the performance of RPI using diffusion wavelets and Lapla-\ncian eigenfunctions with LSPI using handcoded polynomial and radial basis functions on a\n50 state chain graph MDP.\n\nEach row re\ufb02ects the performance of either RPI using learned basis functions or LSPI with\na handcoded basis function (values in parentheses indicate the number of basis functions\nused for each architecture). The two numbers reported are steps to convergence and the\nerror in the learned policy (number of incorrect actions), averaged over 5 runs. Laplacian\nand diffusion wavelet basis functions provide a more stable performance at both the low\nend and at the higher end, as compared to the handcoded basis functions. As the number of\n\n\fbasis functions are increased, RPI with Laplacian basis functions takes longer to converge,\nbut learns a more accurate policy. Diffusion wavelets converge slower as the number of\nbasis functions is increased, giving the best results overall with 19 basis functions. Unlike\nLaplacian eigenfunctions, the policy error is not monotonically decreasing as the number\nof bases functions is increased. This result is being investigated. LSPI with RBF is unstable\nat the low end, converging to a very poor policy for 6 basis functions. LSPI with a 5 degree\npolynomial approximator works reasonably well, but its performance noticeably degrades\nat higher degrees, converging to a very poor policy in one step for k = 15 and k = 25.\n\n6 Future Work\n\nWe are exploring many extensions of this framework, including extensions to factored\nMDPs, approximating action value functions as well as large state spaces by exploiting\nsymmetries de\ufb01ned by a group of automorphisms of the graph. These enhancements will\nfacilitate ef\ufb01cient construction of eigenfunctions and diffusion wavelets. For large state\nspaces, one can randomly subsample the graph, construct the eigenfunctions of the Lapla-\ncian or the diffusion wavelets on the subgraph, and then interpolate these functions using\nthe Nystr\u00a8om approximation and related low-rank linear algebraic methods. In experiments\non the classic inverted pendulum control task, the Nystr\u00a8om approximation yielded excel-\nlent results compared to radial basis functions, learning a more stable policy with a smaller\nnumber of samples.\n\nAcknowledgements\n\nThis research was supported in part by a grant from the National Science Foundation IIS-\n0534999.\n\nReferences\n\n[1] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scienti\ufb01c, Belmont,\n\nMassachusetts, 1996.\n\n[2] J. Bremer, R. Coifman, M. Maggioni, and A. Szlam. Diffusion wavelet packets. Technical\nto appear in Appl. Comp.\n\nReport Tech. Rep. YALE/DCS/TR-1304, Yale University, 2004.\nHarm. Anal.\n\n[3] F. Chung. Spectral Graph Theory. American Mathematical Society, 1997.\n[4] R. Coifman and M Maggioni. Diffusion wavelets. Technical Report Tech. Rep. YALE/DCS/TR-\n\n1303, Yale University, 2004. to appear in Appl. Comp. Harm. Anal.\n\n[5] M. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Re-\n\nsearch, 4:1107\u20131149, 2003.\n\n[6] M. Maggioni and S. Mahadevan. Fast direct policy evaluation using multiscale Markov Diffu-\n\nsion Processes. Technical Report Tech. Rep.TR-2005-39, University of Massachusetts, 2005.\n\n[7] S. Mahadevan. Proto-value functions: Developmental reinforcement learning. In Proceedings\n\nof the 22nd International Conference on Machine Learning, 2005.\n\n[8] S. Mahadevan. Representation policy iteration. In Proceedings of the 21st International Con-\n\nference on Uncertainty in Arti\ufb01cial Intelligence, 2005.\n\n[9] S. Mahadevan. Samuel meets Amarel: Automating value function approximation using global\n\nstate space analysis. In National Conference on Arti\ufb01cial Intelligence (AAAI), 2005.\n\n[10] M. L. Puterman. Markov decision processes. Wiley Interscience, New York, USA, 1994.\n[11] S Rosenberg. The Laplacian on a Riemannian Manifold. Cambridge University Press, 1997.\n\n\f", "award": [], "sourceid": 2871, "authors": [{"given_name": "Sridhar", "family_name": "Mahadevan", "institution": null}, {"given_name": "Mauro", "family_name": "Maggioni", "institution": null}]}