{"title": "Optimality of Reinforcement Learning Algorithms with Linear Function Approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 1587, "page_last": 1594, "abstract": null, "full_text": "Optimality of Reinforcement Learning \n\nAlgorithms with Linear Function \n\nApproximation \n\nRalf Schoknecht \n\nILKD \n\nUniversity of Karlsruhe, Germany \nralf.schoknecht@ilkd.uni-karlsruhe.de \n\nAbstract \n\nThere are several reinforcement learning algorithms that yield ap(cid:173)\nproximate solutions for the problem of policy evaluation when the \nvalue function is represented with a linear function approximator. \nIn this paper we show that each of the solutions is optimal with \nrespect to a specific objective function. Moreover, we characterise \nthe different solutions as images of the optimal exact value func(cid:173)\ntion under different projection operations. The results presented \nhere will be useful for comparing the algorithms in terms of the \nerror they achieve relative to the error of the optimal approximate \nsolution. \n\n1 \n\nIntroduction \n\nIn large domains the determination of an optimal value function via a tabular rep(cid:173)\nresentation is no longer feasible with respect to time and memory considerations. \nTherefore, reinforcement learning (RL) algorithms are combined with linear func(cid:173)\ntion approximation schemes. However, the different RL algorithms, that all achieve \nthe same optimal solution in the tabular case, converge to different solutions when \ncombined with function approximation. Up to now it is not clear which of the \nsolutions, i.e. which of the algorithms, should be preferred. One reason is that a \ncharacterisation of the different solutions in terms of the objective functions they \noptimise is partly missing. In this paper we state objective functions for the TD(O) \nalgorithm [9], the LSTD algorithm [4, 3] and the residual gradient algorithm [1] ap(cid:173)\nplied to the problem of policy evaluation, i.e. the determination of the value function \nfor a fixed policy. Moreover, we characterise the different solutions as images of the \noptimal exact value function under different projection operations. We think that \nan analysis of the different optimisation criteria and the projection operations will \nbe useful for determining the errors that the different algorithms achieve relative to \nthe error of the theoretically optimal approximate solution. This will yield a cri(cid:173)\nterion for selecting an optimal RL algorithm. For the TD(O) algorithm such error \nbounds with respect to a specific norm are already known [2, 10] but for the other \nalgorithms there are no comparable results. \n\n\f2 Exact Policy Evaluation \nFor a Markov decision process (MDP) with finite state space S (lSI = N), action \nspace A, state transition probabilities p : (S, S, A) -+ [0,1] and stochastic reward \nfunction r : (S, A) -+ R policy evaluation is concerned with solving the Bellman \nequation \n(1) \nfor a fixed policy /-t : S -+ A. vt denotes the value of state Si, pt,j = p(Si' Sj, /-t(Si)), \nRf = E{r(si,/-t(Si))} and \"( is the discount factor. As the policy /-t is fixed we will \nomit it in the following to make notation easier. \nThe fixed point V* of equation (1) can be determined iteratively with an operator \nT: RN -+ RN by \n\nVit = \"(PltVIt + Rit \n\n(2) \n\n(3) \n\nTV n = V n+ 1 = \"(PVn + R. \n\nThis iteration converges to a unique fixed point [2], that is given by \n\nwhere (J - \"(P) is invertible for every stochastic matrix P. \n\nV* = (I - ,,(p)-l R, \n\n3 Approximate Policy Evaluation \n\nIf the state space S gets too large the exact solution of equation (1) becomes very \ncostly with respect to both memory and computation time. Therefore, often linear \nfeature-based function approximation is applied. The value function V is repre(cid:173)\nsented as a linear combination of basis functions H := {1J>1' ... , IJ> F} which can be \nwritten as V = IJ>w, where w E RF is the parameter vector describing the linear \ncombination and IJ> = (1J>11 ... IIJ> F) E RNxF is the matrix with the basis functions \nas columns. The rows of IJ> are the feature vectors CP(Si) E RF for the states Si. \n\n3.1 The Optimal Approximate Solution \n\nIf the transition probability matrix P were known, then the optimal exact solution \nV* = (J - ,,(P)-l R could be computed directly. The optimal approximation to \nthis solution is obtained by minimising IllJ>w - V* II with respect to w. Therefore, \na notion of norm must exist. Generally a symmetric positive definite matrix D can \nbe used to define a norm according to II . liD = ~ with the scalar product \n(x, y) D = xT Dy. The optimal solution that can be achieved with the linear function \napproximator IJ>w then is the orthogonal projection of V* onto [IJ>], i.e. the span of \nthe columns of IJ>. Let IJ> have full column rank. Then the orthogonal projection on \n[IJ>] according to the norm II\u00b7 liD is defined as IID = 1J>(IJ>TDIJ\u00bb-lIJ>TD. We denote \nthe optimal approximate solution by vf/ = IID V*. The corresponding parameter \nvector wfJ/ with vgL = IJ>wfJ/ is then given by \n\nwfJ/ = (IJ>TDIJ\u00bb-lIJ>TDV* = (IJ>TDIJ\u00bb-lIJ>TD(J _ ,,(P)-lR. \n\n(4) \nHere, 8L stands for supervised learning because wl} minimises the weighted \nquadratic error \n\nw~knF ~lllJ>w - V*111 = ~(lJ>w\u00a3L - v*f D(lJ>w\u00a3L - V*) = ~llVgL - V*111 \n\n(5) \n\nfor a given D and V*, which is the objective of a supervised learning method. \nNote, that V* equals the expected discounted accumulated reward along a sampled \ntrajectory under the fixed policy /-t, i.e. V*(so) = E[2:::o r(st, /-t(St))] for every \nSo E S. These are exactly the samples obtained by the TD(l) algorithm [9]. Thus, \nthe TD(l) solution is equivalent to the optimal approximate solution. \n\n\f3.2 The Iterative TD Algorithm \n\nIn the approximate case the Bellman equation (1) becomes \n\n*w = ,,(P**w + R \n\n(6) \n\nA popular algorithm for updating the parameter vector w after a single transition \nXi -+ Zi with reward ri is the stochastic sampling-based TD(O)-algorithm [9] \nwn+l = wn + acp(xi)[ri + ,,(CP(Zi )T wn - cp(Xi)T wn] = (IF + aAi)wn + abi , \n\n(7) \n\nwhere a is the learning rate, Ai = cp(Xi)[\"(cp(Zi) - cp(xi)f, bi = cp(xi)ri and IF is \nthe identity matrix in RF. Let p be a probability distribution on the state space S. \nFurthermore, let Xi be sampled according to p, Zi be sampled according to P(Xi , \u00b7) \nand ri be sampled according to r(x;). We will use E p[.] to denote the expectation \nwith respect to the distribution p. Let AiP = Ep[A;] and bIt = Ep[bi]. If the \nlearning rate decays according to \n\np \n\np \n\nLat = 00 \n\nLa; < 00, \n\nt \n\nt \n\n(8) \n\nthen, in the average sense, the stochastic TD(O) algorithm (7) behaves like the \ndeterministic iteration \n\nwith \n\nATD = _**T D (I - rvP)** \n\nDp P I ' Dp \n\nbTD = **T D R \n, \n\nP \n\n(9) \n\n(10) \n\nwhere D p = diag(p) is the diagonal matrix with the elements of p and R is the \nvector of expected rewards [2] (Lemma 6.5, Lemma 6.7). In particular the stochastic \nTD(O) algorithm converges if and only if the deterministic algorithm (9) converges. \nFurthermore, if both algorithms converge they converge to the same fixed point. \n\nAn iteration of the form (9) converges if all eigenvalues of the matrix 1+ aAip \nlie within the unit circle [5]. For a matrix Alt that has only eigenvalues with \nnegative real part and a learning rate at that decays according to (8) there is a \nt* such that the eigenvalues of I + atA IF lie inside the unit circle for all t > t* . \nHence, for a decaying learning rate the deterministic TD(O) algorithm converges \nif all eigenvalues of Aft have a negative real part. Since this requirement is not \nalways fulfilled the TD algorithm possibly diverges as shown in [1] . This divergence \nis due to the positive eigenvalues of AI;D [8]. \n\np \n\np \n\np \n\np \n\np \n\nHowever, under special assumptions convergence of the TD(O) algorithm can be \nshown [2]. Let the feature matrix ** E RNxF have full rank, where F :::; N, i.e. \nthere are not more parameters than states). This results in no loss of generality \nbecause the linearly dependent columns of ** can be eliminated without changing the \npower of the approximation architecture. The most important assumption concerns \nthe sampling of the states that is reflected in the matrix D. Let the Markov chain \nbe aperiodic and recurrent. Besides the aperiodicity requirement, this assumption \nresults in no loss of generality because transient states can be eliminated. Then a \nsteady-state distribution 7r of the Markov chain exists. When sampling the states \naccordinj3 to this steady-state distribution, i.e. D = D'/r = diag(7r), it can be shown \nthat AI;\" is negative definite [2] (Lemma 6.6). This immediately yields that all \neigenvalues are negative which in turn yields convergence of the TD(O) algorithm \nwith decaying learning rate. \n\n\fIn the next section we will characterise the limit value vZ;: as the projection of \nV* in a more general setting. However, for the sampling distribution 7r there is \nanother interesting interpretation of VZ;: as the fixed point of IID~ T , where IID~ \nis the orthogonal projection with respect to DJr onto [], as defined in section 3.1 , \nand T is the update operator defined in (2) [2, 10] . In the following we use this fact \nto deduce a new formula for VZ;: that has a form similar to V* in (3). Before we \nproceed, we need the following lemma \n\nLemma 1 Th e matrix 1 - ')'IID~P is regular. \n\n')'IID~P is regular if and only if it does not have eigen(cid:173)\n\nProof: The matrix 1 -\nvalue zero. An equivalent condition is that one is not an eigenvalue of ')'IID~ P. \nTherefore, it is sufficient to show that the spectral radius satisfies ehIID~P) < 1. \nFor any matrix norm II\u00b7 II it holds that e(A) :S IIAII [5]. Therefore, we know \nthat ehIID~P) :S IbIID~PIID~ ' where the vector norm II\u00b7IID~ induces the matrix \nnorm II . IID~ by the standard definition IIAIID~ = sUP ll x II D~= dIIAx IID~} . With \nthis definition and with the fact that IlPx lID~ :S Il x lID~ for all x [2] (Lemma 6.4) \nwe obtain IIPIID~ = sUP ll x II D~=dIIPxIID~} :S sUP llxII D~=dll x IID~} = 1. More(cid:173)\nover, we have IIIIDJID~ = sUP ll x II D~=d IIIID~ xI ID~} :S sUP ll x II D~=d llxIID~} = 1, \nwhere we used the well known fact that an orthogonal projection IID~ is a non(cid:173)\nexpansion with respect to the vector norm II . IID~. Putting all together we obtain \nehIID~P) :S II ')'IID~PIID~ :S ')' IIIIDJID~ \u00b7 IIPIID~ :S ')' < 1. \nD \nWe can now solve the fixed point equation vZ;: = IID~ TVZ;: and obtain \n\n(11) \nwith j5 = IID~ P and R = IID~ R. This resembles equation (3) for the exact solution \nof the policy evaluation problem. The TD(O) solution with sampling distribution \n7r can thus be interpreted as exact solution of the \"projected\" policy evaluation \nproblem with j5 and R. Note, that compared to the TD(l) solution of the approx(cid:173)\nimate policy evaluation problem VJ!: = IID~ (1 - ,),P) - l R with weighting matrix \nDJr equation (11) only differs in the position of the projection operator. This leads \nto an interesting comparison of TD(O) and TD(l) . While TD(O) yields the exact \nsolution of the projected problem, TD(l) yields the projected solution of the exact \nproblem. \n\n3.3 The Least-Squares TD Algorithm \n\nBesides the iterative solution of (6) often a direct solution by matrix inversion is \ncomputed using equation (9) in the fixed point form AiFwiF + bIt = O. This \napproach is known as least-squares TD (LSTD) [4, 3]. It is only required that AIt p \nbe invertible, i.e. that the eigenvalues be unequal zero. In contrast to the iterative \nTD algorithm the eigenvalues need not have negative real parts. Therefore, LSTD \noffers the possibility of using sampling distributions p other than the steady-state \ndistribution 7r [6, 7] Thus, parts of the state space that would be rarely visited under \nthe steady-state distribution can now be visited more frequently which makes the \napproximation of the value function more reliable. This is necessary if the result of \npolicy evaluation should be used in a policy improvement step because otherwise \nthe action choice in rarely visited states may be bad [6]. \n\np \n\np \n\np \n\nFor the following let the feature matrix have full column rank. As described above \nthis results in no loss of generality. LSTD allows to sample the states with an \narbitrary sampling distribution p. If there are states s that are not visited under p, \n\n\fp \n\ni.e. p(s) = 0, then these states can be eliminated from the Markov chain. Hence, \nwithout loss of generality we assume that the matrix D p = diag(p) is invertible. \nThese conditions ensure the invertibility of A'};D and according to [4, 3] the LSTD \nsolution is given by \n\n(12) \nNote, that the matrix A'iF and the vector bI;D can be computed from samples \nsuch that the model P does not need to be known. Note also that in general \nwI;D \u00a5- wy} as discussed in [3]. This means, that the TD(O) solution wI;D and the \nTD(I) solution wfJ/ may differ when function approximation is used. \nDepending on the sampling distribution p the LSTD approach may be the only \nway of computing the fixed point of (9) because the corresponding iterative TD(O) \nalgorithm may diverge due to positive eigenvalues. However, if the TD(O) algorithm \nconverges the limit coincides with the LSTD solution wI; D. \n\np \n\np \n\np \n\np \n\np \n\np \n\np \n\nFor the value function V.JD achieved by the LSTD algorithm the following holds \n\np \n\nVTD \n\nDp \n\nq,WTD (~) q,(_ATD) -l bTD = q, [(_ATD)T(_ATD)] -1 (_ATD)TbTD \nDp \n\nDp \n\nDp \n\nDp \n\nDp \n\nDp \n\nDp \n\n(3),(10) II \n\n= \n\n(I-,PjTDJq,q,TDp(I-,P) \n\nV* \n\nII \n\n= DJD \n\nV* \n. \n\n( \n) \n13 \n\np \n\nWe define D JD = (J - , P)TDJq,q,TDp(J - , P). As q,q,T is singular in general, the \nmatrix DJD is symmetric and positive semi-definite. Hence, it defines a semi-norm \nII\u00b7IIDTD . Thus, the LSTD solution is obtained by projecting V* onto [q,] with \nrespect to II . II DT D. After having deduced this new relation between the optimal \nsolution V* and V.JD we can characterise WI;D as minimising the corresponding \nquadratic objective function. \nmin~llq,w-V* 112 TD =~(q,WTD_V*fDTD(q,wTD_V*) = ~IIVTD-V*W TD . (14) \ncER F 2 \n\n2 Dp \n\nDp \n\nDp \n\nDp \n\nDp \n\np \n\n2 \n\np \n\np \n\np \n\nIt can be shown that the value of the objective function for the LSTD solution \nis zero, i.e. IIV.JD - V*111TD = O. With equation (14) we have shown that the \nLSTD solution minimises a certain error metric. The form of this error metric is \nsimilar to (5). The only difference lies in the norm that is used. This unifies the \ncharacterisation of the solutions that are achieved by different algorithms. \n\np \n\np \n\n3.4 The Residual Gradient Algorithm \n\nThere is a third approach to solving equation (6). The residual gradient algorithm \n[1] directly minimises the weighted Bellman error \n\n1 \n-II(I - , P)q,w - RIID \n2 \n\n2 \n\np \n\n(15) \n\nby gradient descent. The resulting update rule of the deterministic algorithm has \na form similar to (9) \n\nwith \n\nbR G = q,T(J - \"VPT)D R \nDp \n\nP' \n\n, \n\n(16) \n\n(17) \n\nwhere D p is again the diagonal matrix with the visitation probabilities Pi on its \ndiagonal. As all entries on the diagonal are nonnegative, D p can be decomposed \n\n\fp \n\np \n\ninto yfi5\"\";T yfi5\"\";. Hence, we can write Ai5; = -(yfi5\"\";(I _ ,p)q,)T yfi5\"\";(J - ,P)q,. \nTherefore, Ai5G is negative semidefinite. \nIf q, has full column rank and Dp is \nregular, i.e. the visitation probability for every state is positive, then Ai5G is negative \ndefinite. Therefore, all eigenvalues of Ai5G are negative, which yields convergence of \nthe residual gradient algorithm (16) for a decaying learning rate independently of \nthe weighting D p , t he function approximator q, and the transition probabilities P. \nThe equivalence of the limit value of the deterministic and the stochastic version of \nthe residual gradient algorithm can be proven with an argument similar to that in \n[2] for the equivalence of the deterministic and the stochastic version of the TD(O) \nalgorithm in equations (7) and (9) respectively. Note also that the matrix Ai5G and \nthe vector bi5G can be computed from samples so that the model P does not need \nto be known for the deterministic residual gradient algorithm. \nIf Ai5G is invertible a unique limit of the iteration (16) exists. It can be directly \ncomputed via the fixed point form, which yields the new identity \n\np \n\np \n\np \n\np \n\nwi5; = (-Ai5;)-lbi5; = (q,T(I - , pf Dp(I _ , p)q,) -l q,T (J _ , p)T DpR. (18) \n\nThis solution of the residual gradient algorithm is related to the optimal solution \n(4) of the approximate Bellman equation (6) as described in the following lemma. \n\nLemma 2 The solution wi5G of the residual gradient algorithm with weighting ma-\ntrix D p is equivalent to the optimal supervised learning solution Wf/RG of the approx-\nimate B ellman equation (6) with weighting matrix D:G = (J _ , p)T Dp(I - , P). \n\np \n\np \n\nProof: \n\nwi5; = (q,T (I _ , p)T Dp(I _ , p)q,) -l q,T (I - , pf DpR \n\n= (q,T D:Gq,) -l q,T (J - , pf Dp(I - , P)(I _ , p) -l R \n= (q,T DRGq,) -l q,T DRGV* = wSL \n\np \n\np \n\nDJ;G, \n\nwhere we used the fact that V* = (J _ , P) -l R. \nD \nTherefore, wi5G can be interpreted as the orthogonal projection of the optimal \nsolution V* onto [q,] with respect to the scalar product defined by D:G. This \nyields a new equivalent formula for the Bellman error (15) \n\np \n\n~II(I - , P)q,w - RII~ = ~((J - , P)q,w - RfDp((I - , P)q,w - R) \n2 \n= ~(q,w - v*f(I - , pfDp(J - , P)(q,w - V*) = ~11q,w - V*II~RG' \n\n2 \n\np \n\n2 \n\n2 \n\n(19) \n\np \n\nThe Bellman error is the objective function that is minimised by the residual gra(cid:173)\ndient algorithm. As we have just shown, this objective function can be expressed \nin a form similar to (5), where the only difference lies in the norm that is used. \nThus, we have shown that the solution of the residual gradient algorithm can also \nbe characterised in the general framework of quadratic error metrics IIq,w - V* liD. \nAs a direct consequence we can represent the solution as an orthogonal projection \nV RG = q,wRG = II RG V*. \nDp \n\nDp \n\nDp \n\nAccording to section 3.2 an iteration of the form (16) generally converges for matri(cid:173)\nces A with eigenvalues that have negative real parts. However, the fact that Ai5G \nis symmetric assures convergence even for singular Ai5G [8] (Proposition 1). Thus, \n\np \n\np \n\n\fTable 1: Overview over the solutions of different RL algorithms. The supervised \nlearning (SL) approach, the TD(O) algorithm, the LSTD algorithm and the residual \ngradient (RG) algorithm are analysed in terms of the conditions of solvability. More(cid:173)\nover, we summarise the optimisation criteria that the different algorithms minimise \nand characterise the different solutions in terms of the projection of the optimal \nsolution V* onto [<1>]. If the visitation distribution is arbitrary, we write 'r:/p. \n\nsolvability: \n\ncondition for Ai \ncondition for p \n\noptimisation criterion \n\ncharacterisation as projection \n\nSL \n\n-\n\n'r:/p \n\neq. (5) \nIIDp V* \n\nTD \n\nLSTD \n\nRG \n\nRe(Ai) < 0 \n\np=7f \neq. (14) \n\nIID;D V* \n\nAi :;i 0 \np(s) :;i 0 \neq. (14) \nIIDTD V* \n\np \n\nRe(Ai) ::::: 0 \n\n'r:/p \n\neq. (19) \nIIDRG V* \n\np \n\nthe residual gradient algorithm (16) converges for any matrix A15G that is of the \nform (17) and in case A15G is regular the limit is given by (18). Note that a matrix \n** which does not have full column rank leads to ambiguous solutions w15G that \ndepend on the initial value wo. However, the corresponding V j}G = w15G are the \nsame. For singular Dp the matrix D:G = (I - ,P)T Dp(J - IP) is also singular. \nThus, the limit Vj}G may not be unique but may depend itself on the initial value \nwo. The reason is that there may be a whole subspace of [**] with dimension larger \nthan zero that minimises IIVj}G - V*IIDRG because II\u00b7IIDRG is now only a semi-norm. \nBut for all minimising Vj}G the Bellman error is the same, i.e. with respect to the \nBellman error all the solutions Vj}G are equivalent [8] (Proposition 1). \n\np \n\np \n\np \n\np \n\np \n\np \n\np \n\np \n\np \n\np \n\np \n\n3.5 Synopsis of the Different Solutions \nIn Table 1 we give a brief overview of the solutions that the different RL algo(cid:173)\nrithms yield. An SL solution can be computed for arbitrary weighting matrices D p \ninduced by a sampling distribution p. For the three RL algorithms (TD, LSTD, \nRG) solvability conditions can be either formulated in terms of the eigenvalues of \nthe iteration matrix A or in terms of the sampling distribution p. The iterative \nTD(O) algorithm has the most restrictive conditions for solvability both for the \neigenvalues of the iteration matrix A, whose real parts must be smaller than zero, \nand for the sampling distribution p, which must equal the steady-state distribution \n7f. The LSTD method only requires invertibility of Arp. This is satisfied if ** has \nfull column rank and if the visitation distribution p samples every state s infinitely \noften, i.e. p( s) :;i 0 for all s E S. In contrast to that the residual gradient algorithm \nconverges independently of p and the concrete A15G because all these matrices have \neigenvalues with nonpositive real parts. \n\np \n\np \n\nAll solutions can be characterised as minimising a quadratic optimisation criterion \nIl**w - V* liD with corresponding matrix D. The SL solution optimises the weighted \nquadratic error (5), RG optimises the weighted Bellman error (19) and both TD and \nLSTD optimise the quadratic function (14) with weighting matrices D;;D and DJD \nrespectively. With the assumption of regular D p , i.e. p(s) :;i 0 for all s E S, the \nsolutions V can be characterised as images of the optimal solution V* under different \northogonal projections (optimal, RG) and projections that minimise a semi-norm \n(TD, LSTD). For singular Dp see the remarks on ambiguous solutions in section 3.4. \n\n\fLet us finally discuss the case of a quasi-tabular representation of the value function \nthat is obtained for regular ** and let all states be visited infinitely often, i.e. D p is \nregular. Due to the invertibility of ** we have [**] = ~N. Thus, the optimal solution \nV* is exactly representable because V* E [**]. Moreover, every projection operator \nII : ~N -+ [**] reduces to the identity. Therefore, all the projection operators for \nthe different algorithms are equivalent to the identity. Hence, with a quasi-tabular \nrepresentation all the algorithms converge to the optimal solution V*. \n\n4 Conclusions \n\nWe have presented an analysis of the solutions that are achieved by different rein(cid:173)\nforcement learning algorithms combined with linear function approximation. The \nsolutions of all the examined algorithms, TD(O), LSTD and the residual gradient \nalgorithm, can be characterised as minimising different corresponding quadratic \nobjective function. As a consequence, each of the value functions, that one of the \nabove algorithms converges to, can be interpreted as image of the optimal exact \nvalue function under a corresponding orthogonal projection. In this general frame(cid:173)\nwork we have given the first characterisation of the approximate TD(O) solution \nin terms of the minimisation of a quadratic objective function. This approach al(cid:173)\nlows to view the TD(O) solution as exact solution of a projected learning problem. \nMoreover, we have shown that the residual gradient solution and the optimal ap(cid:173)\nproximate solution only differ in the weighting of the error between the exact and \nthe approximate solution. In future research we intend to use the results presented \nhere for determining the errors of the different solutions relative to the optimal \napproximate solution with respect to a given norm. This will yield a criterion for \nselecting reinforcement learning algorithms that achieve optimal solution quality. \n\nReferences \n\n[1] L. C. Baird. Residual algorithms: Reinforcement learning with function approxima(cid:173)\n\ntion. Proc. of the Twelfth International Conference on Machine Learning, 1995. \n\n[2] D. P. Bertsekas and J. N. Tsitsiklis. Neuro Dynamic Programming. Athena Scientific, \n\nBelmont, Massachusetts, 1996. \n\n[3] J .A. Boyan. Least-squares temporal difference learning. In Proceeding of the Sixteenth \n\nInternational Conference on Machine Learning, pages 49- 56, 1999. \n\n[4] S.J Bradtke and A.G. Barto. Linear least-squares algorithms for temporal difference \n\nlearning. Machine Learning, 22:33- 57, 1996. \n\n[5] A. Greenbaum. Iterative Methods for Solving Linear Systems. SIAM, 1997. \n[6] D. Koller and R. Parr. Policy iteration for factored mdps. In Proc. of the Sixteenth \n\nConference on Uncertainty in Artificial Intelligence (UAI) , pages 326- 334, 2000. \n\n[7] M. G. Lagoudakis and R . Parr. Model-free least-squares policy iteration. In Advances \n\nin Neural Information Processing Systems, volume 14, 2002. \n\n[8] R. Schoknecht and A. Merke. Convergent combinations of reinforcement learning \nwith function approximation. In Advances in Neural Information Processing Systems, \nvolume 15, 2003. \n\n[9] R. S. Sutton. Learning to predict by the methods of temporal differences. Machine \n\nLearning, 3:9- 44, 1988. \n\n[10] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with \n\nfunction approximation. IEEE Transactions on Automatic Control, 1997. \n\n\f", "award": [], "sourceid": 2322, "authors": [{"given_name": "Ralf", "family_name": "Schoknecht", "institution": null}]}*