{"title": "Convergent Combinations of Reinforcement Learning with Linear Function Approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 1611, "page_last": 1618, "abstract": null, "full_text": "Convergent Combinations of \n\nReinforcement Learning with Linear \n\nFunction Approximation \n\nRalf Schoknecht \n\nILKD \n\nUniversity of Karlsruhe, Germany \nralf. schoknecht@ilkd. uni-karlsruhe. de \n\nArtur Merke \n\nLehrstuhl Informatik 1 \n\nUniversity of Dortmund, Germany \n\narturo merke@udo.edu \n\nAbstract \n\nConvergence for iterative reinforcement learning algorithms like \nTD(O) depends on the sampling strategy for the transitions. How(cid:173)\never, in practical applications it is convenient to take transition \ndata from arbitrary sources without losing convergence. In this \npaper we investigate the problem of repeated synchronous updates \nbased on a fixed set of transitions. Our main theorem yields suffi(cid:173)\ncient conditions of convergence for combinations of reinforcement \nlearning algorithms and linear function approximation. This allows \nto analyse if a certain reinforcement learning algorithm and a cer(cid:173)\ntain function approximator are compatible. For the combination of \nthe residual gradient algorithm with grid-based linear interpolation \nwe show that there exists a universal constant learning rate such \nthat the iteration converges independently of the concrete transi(cid:173)\ntion data. \n\nIntroduction \n\n1 \nThe strongest convergence guarantees for reinforcement learning (RL) algorithms \nare available for the tabular case, where temporal difference algorithms for both \npolicy evaluation and the general control problem converge with probability one \nindependently of the concrete sampling strategy as long as all states are sampled \ninfinitely often and the learning rate is decreased appropriately [2]. In large, pos(cid:173)\nsibly continuous, state spaces a tabular representation and adaptation of the value \nfunction is not feasible with respect to time and memory considerations. Therefore, \nlinear feature-based function approximation is often used. However, it has been \nshown that synchronous TD(O), i.e. dynamic programming, diverges for general lin(cid:173)\near function approximation [1]. Convergence with probability one for TD('\\) with \ngeneral linear function approximation has been proved in [12]. They establish the \ncrucial condition of sampling states according to the steady-state distribution of \nthe Markov chain in order to ensure convergence. This requirement is reasonable \nfor the pure prediction task but may be disadvantageous for policy improvement \nas shown in [6] because it may lead to bad action choices in rarely visited parts \nof the state space. When transition data is taken from arbitrary sources a certain \nsampling distribution cannot be assured which may prevent convergence. \n\n\fAn alternative to such iterative TD approaches are least-squares TD (LSTD) meth(cid:173)\nods [4, 3, 6, 8]. They eliminate the learning rate parameter and carry out a matrix \ninversion in order to compute the fixed point of the iteration directly. In [4] a least(cid:173)\nsquares approach for TD(O) is presented which is generalised to TD(A) in [3]. Both \napproaches still sample the states according to the steady-state distribution. In \n[6, 8] arbitrary sampling distributions are used such that the transition data could \nbe taken from any source. This may yield solutions that are not achievable by \nthe corresponding iterative approach because this iteration diverges. All the LSTD \napproaches have the problem that the matrix to be inverted may be singular. This \ncase can occur if the basis functions are not linearly independent or if the Markov \nchain is not recurrent. In order to apply the LSTD approach the problem would \nhave to be preprocessed by sorting out the linear dependent basis functions and \nthe transient states of the Markov chain. In practice one would like to save this \nadditional work. \n\nThus, the least-squares TD algorithm can fail due to matrix singularity and the \niterative TD(O) algorithm can fail if the sampling distribution is different from the \nsteady-state distribution. Hence, there are problems for which neither an iterative \nnor a least-squares TD solution exist. The actual reason for the failure of the \niterative TD(O) approach lies in an incompatible combination of the RL algorithm \nand the function approximator. Thus, the idea is that either a change in the RL \nalgorithm or a change in the approximator may yield a convergent iteration. Here, \na change in the TD(O) algorithm is not meant to completely alter the character \nof the algorithm. We require that only modifications of the TD(O) algorithm be \nconsidered that are consistent according to the definition in the next section. \n\nIn this paper we propose a unified framework for the analysis of a whole class of \nsynchronous iterative RL algorithms combined with arbitrary linear function ap(cid:173)\nproximation. For the sparse iteration matrices that occur in RL such an iterative \napproach is superior to a method that uses matrix inversion as the LSTD approach \ndoes [5]. Our main theorem states sufficient conditions under which combinations \nof RL algorithms and linear function approximation converge. We hope that these \nconditions and the convergence analysis, that is based on the eigenvalues of the iter(cid:173)\nation matrix, bring new insight in the interplay of RL and function approximation. \nFor an arbitrary linear function approximator and for arbitrary fixed transition data \nthe theorem allows to predict the existence of a constant learning rate such that \nthe synchronous residual gradient algorithm [1] converges. Moreover, in combina(cid:173)\ntion with interpolating grid-based function approximators we are able to specify \na formula for a constant learning rate such that the synchronous residual gradi(cid:173)\nent algorithm converges independently of the transition data. This is very useful \nbecause otherwise the learning rate would have to be decreased which slows down \nconvergence. \n\n2 A Framework for Synchronous Iterative RL Algorithms \nFor a Markov decision process (MDP) with N states S = {S1' .. . ,SN}, action space \nA, state transition probabilities p : (S, S, A) -+ [0,1] and stochastic reward function \nr : (S, A) -+ R policy evaluation is concerned with solving the Bellman equation \n\nV 7r = 'YP7rV7r + R7r \n\n(1) \nfor a fixed policy 7r : S -+ A. Vt denotes the value of state Si, Pi7j = P(Si ' Sj, 7r(Si)) , \nRi = E{r(si,7r(Si))} and 'Y is the discount factor. As the policy 7r is fixed we will \nomit it in the following to make notation easier. \nIf the state space S gets too large the exact solution of equation (1) becomes very \ncostly with respect to both memory and computation time. Therefore, often linear \n\n\ffeature-based function approximation is applied. The value function V is repre(cid:173)\nsented as a linear combination of basis functions { ... > IAll. Also, let Ef, be the eigenspace corresponding to eigenvalue \nAi and H = maxd ,J;(l:)I }. If the following assumptions hold \n\n(a) Vi: (Re(Ai) < 0) v Ai = 0 \n(b) dim(Ef,) = (3i for Ai = 0 \n\n(c) [CT] 11 [DT]l. = {O} \n\nthen the limit w* = limn -> (1) w n exists for all learning rates 0 < a < aL, where the \nlimit learning rate aL satisfies aL = if. The limit w* may depend on the initial \nvalue wO . Note, if the Ai leading to the maximum of H is real then H = I Ai I. \nA proof of this theorem can be found in the appendix. General convergence con(cid:173)\nditions of iterations have been examined in numerical mathematics. A standard \nresult states that if the absolute value of the largest eigenvalue of the iteration \nmatrix IF + aA, i.e. the spectral radius, is smaller than one, then the iteration \nconverges to the unique fixed point w* = -A-I b [5] (Theorem 2.1.1). In our case, \nhowever, the matrix A may not be invertible. This happens, for example, if the \nfeatures '=in l ~ I >'=ax l < k with \nH = IAmaxl. Thus, 0:* leads to convergence according to Theorem 1. Note also that \na larger learning rate does not necessarily lead to a faster asymptotic convergence \nof the iteration. \n\n4 Counterexample of Baird - Revisited \nIn this section we analyse the counterexample given by Baird in [1], and show how \nTheorem 1 and Proposition 1 can be applied to obtain explicit bounds for the \nlearning rate 0: and the discount factor \"( for which the residual gradient and TD(O) \nalgorithms converge. The matrices *, X and Z are given by \n\n**= \n\n12000000 \n10200000 \n10020000 \n10002000 \n10000200 \n10000020 \n20000001 \n\n1000000 \n0100000 \n0010000 \nX= 0001000 \n0000100 \n0000010 \n0000001 \n\nZ= \n\n0000001 \n0000001 \n0000001 \n0000001 \n0000001 \n0000001 \n0000001 \n\nIn \nwhich corresponds to the synchronous update of every state transition. \nthe residual gradient case we have K RG \n-(\"(Z - X)**((\"(Z - X)**(X** 0 as described in [1] for,,( = 0.9. However, contradicting \nthe argument in [1] the TD(O) algorithm converges for all \"( :::; 0.88 if the learning \nrate is chosen appropriately. For example, for \"( = 0.4 all eigenvalues are negative \n(UTD = {-3.0,-4,-5.2}), so condition (a) and (b) of Theorem 1 are trivially \nfulfilled. Condition (c) can also be shown by simple computation, and therefore \nusing Theorem 1 we obtain convergence for 0: < 0.384 and optimal asymptotic \nconvergence for 0:* ~ 0.244, which is much smaller. \n\n\f5 Conclusions \nFor the problem of repeated synchronous updates based on a fixed set of transitions \nwe have proved sufficient conditions of convergence for arbitrary combinations of \nreinforcement learning algorithms and linear function approximation. Our main \ntheorem yields a rule for determining a problem dependent learning rate such that \nthe algorithm converges. For a combination of the residual gradient algorithm with \ngrid-based linear interpolation we have deduced a constant learning rate such that \nthe algorithm converges independently of the concrete transition data. Moreover, \nwe have derived a general formula for an optimal learning rate with respect to \nasymptotic convergence. Finally we have applied our main theorem to fully analyse \nthe example Baird gives for the divergence of TD(O) [1]. \n\nAppendix \nLemma 1 Let D be a real m x F matrix and CT a real F x m matrix, where \nm > F. Then K = DCT has the same eigenvalues as A = CT D and additionally \nthe eigenvalue zero with multiplicity (F-m). Let HI{ be the generalised eigenspace \nof K corresponding to the eigenvalue A and H1 the generalised eigenspace of A \ncorresponding to the eigenvalue A. Then, CTHI{ ~ H1 and DH1 ~ HI{. For \nA oF 0 it even holds that CTHI{ = H1 and DH1 = HI{. \n\nProof: The generalised eigenspace HI{ has index sI{ if sI{ is the smallest number \nfor which ker(K - AIm)sf = ker(K - AIm)sf +1 holds, where h denotes the identity \nin IRkxk. Let x E HI{, i.e. (K - AIm)sf x = O. With CT Ki = AiCT we have \nCT(K - AImyf x = CT(i~ St KiASf - i)x = (A - AIF)sf CT x . \n\nsf ( K) \n\n(4) \n\nThus, CT x E H1. And with the same argument we obtain Dx E HI{ from x E \nH1\u00b7 Therefore, CTHI{ ~ H1 and DH1 ~ HI{ Let A oF 0 and BI{ a basis in \nHI{. As the Jordan block of K corresponding to HI{ is invertible the vectors \nCT Bf are linearly independent and therefore form a basis of the span [CT BI{]. \nWith the above consideration we have [CT BI{J ~ H 1. If this is a real subset \nCTBI{ can be completed to form a basis B1 of H1 with IBI{I < IB11. Then we \nhave that DB1 is linearly independent and [DB1 J ~ HI{. Moreover, we have \ndim(HI{) = IBI{ I < IB11 = dim([DB1]) ~ dim(HI{), which is a contradiction. \nTherefore, CTHI{ = [CT BfJ = H1. Similarly, we obtain DH1 = HI{. Thus, the \nmultiplicities of the eigenvalues A oF 0 of A and K are the same. The multiplicity \nof the eigenvalue zero of matrix K is by (F - m) larger than that of matrix A. D \n\nProof of Theorem 1: Due to assumption (a) and Lemma 1 every eigenvalue of \nA is either zero or has a real part less than zero. If the real part of every eigenvalue \nof A is less than zero, A is invertible. For invertible matrices Theorem 2.1.1 from \n[5] states that the iteration converges if and only if the spectral radius e(IF + aA), \ni.e. the largest eigenvalue, is less than 1. For every eigenvalue Ai of A obviously \n1 + aAi is an eigenvalue of IF + aA. With H = maxi { ,~;(l:) , } we obtain for a> 0 \n\n2 \ne(IF + aA) < 1 ~ 'it: 11 + aAi l < 1 ~ a < H' \n\n. \n\n(5) \n\nThis completes the proof if all eigenvalues of A have a negative real part. \nIn the following let A have the eigenvalue Al = O. The vector space IRF can be \nrepresented as the direct sum of the generalised eigenspaces IRF = H~ EB H12 EB \n\n\fIn the following we write ilt = Ht2 EB ... EB Htl because this is a \n\u00b7 .. EB Htl \u2022 \ncomplementary space of Ht. As the generalised eigenspaces of A are invariant \nagainst A, i.e. \\::Ix E Ht. : Ax E Ht., the iteration wn+1 = (IF + aA)wn + ab can \nbe decomposed in two parts, one in the generalised eigenspace Ht and the other in \nthe com.Qlem~ntary space ilt. Let wn = wn + wn and b = b + b, where wn, b E Ht \nand wn , b E Ht. Then we have \n\nwn+1 = wn + a(Awn + b) = ~n + a(Awn + b~ +~n + a(Awn + b~ \n\n(6) \n\nThus, the convergence analysis can be carried out separately for the two iterations. \nThe matrix A in iteration wn+1 = wn + a(Awn + b) is not invertible. However, the \niteration takes place in the subspace ilt. In this subspace the mapping associated \nwith A is invertible. Therefore, A can be replaced by an invertible matrix A that \ndoes not ~lter the iteration in ilt. The matrix A can be constructed such that \ne(IF + aA) = e(IF + aA). Therefore, according to the considerations above the \niteration converges for 0 < a < it. \nIn the following we show that the iteration in Ht is the identity and therefore \ntrivially converges. According to assumption J~ Hff = E{f. All v E IRm can be \nrepresented as v = ii + v with ii E E{f and v E Ho = H~ EB \u00b7 .. EB Ht. According to \nLemma 1 CTilff = ilt and CTHff ~ Ht hold. Therefore, for b + b = b = CT v we \nhave b = CT ii and b = CT v. Let E{f =1= {o}. Then, for all ii E E{f \n\n0= Kii = DCTii ===* CTii E [CT] n [DT].L 1% cTii = O. \n\nFor E{f = {O} we also obtain CTii = 0 because ii = o. Therefore, we have \nCTE{f = {O} and, as a consequence, b = CTii = o. The last that remains to show \nis that Aw = 0 for all w E HA. According to Lemma 1 we know that Dw E Hff. \nAssumption (b) says that H~ = E{f and from the above considerations we know \nthat CTE{f = {O}. Therefore, Aw = CT(Dw) = o. Thus, the iteration in Ht is the \nidentity. As both parts of the iteration converge the overall iteration also converges \nwhich completes that part of the proof. \nThe limit w* of wn+1 = wn + a(Awn + b) is unique and we have w* = A-lb. The \nlimit of wn+l = wn + a(Awn + b) is not unique, but depends on the initial value \nwo. It holds that w* = wo. Therefore, the limit w* = w* + w* depends on the \ninitial value wo. \n\nProof of Proposition 1: For the residual gradient algorithm we have ARG = \n_8T X DXT8 and bRG = _8T X DXT r. In order to apply Theorem 1 this is \ndecomposed in ARG = CTD and bRG = CTv with C = -D = v75XT8 and \nv = -v75XT r. As the diagonal entries of D are positive we can write v75 for the \ndiagonal matrix whose entries are the square roots of D. Thus [CT] = )DT] which \nyields condition (c) of Theorem 1. Moreover, the matrix K = DC = -CCT \nis symmetric and therefore diagonalisable. Hence, condition (b) is fulfilled and \nall eigenvalues are real. Let now A =1= 0 be an eigenvalue of K and let x be a \ncorresponding eigenvector. Then 0 > - (CT x) T (CT x) = xT K x = AXT x which \nyields A < o. Thus, all requirements are fulfilled and for an appropriate choice of \na the residual gradient algorithm converges independently of the concrete form of \nthe function approximation scheme. \n\nThe consistency of the residual gradient algorithm can be shown formally but due to \nspace limitations we only give the following informal proof. The algorithm minimises \n\n\fthe Bellman error, which is a quadratic objective function. Hence, there are no local \noptima and if the global optimum is not unique, the values of all global optima are \nidentical. Due to its gradient descent property the residual gradient algorithm \nconverges to such a global optimum independently of the initial value. In case of a \ntabular representation a global minimum has Bellman error zero and corresponds \nto an optimal solution. Thus, the residual gradient algorithm is consistent. \n\nA detailed description of how grid-based linear interpolation works in combination \nwith RL can be found in [7]. Important for us is that in a d-dimensional grid each \nfeature vector ip(x) satisfies 0 ~ ipi(X) ~ land 2:::1 ipi(X) = 1. With (, -> denoting \nthe standard scalar product and II . 112 denoting the corresponding euclidean norm, \nwe have !Ki,jl = 1\u00abCT)i, (CT)j )1 ~ maxdll(CT)IIID = 2::=1 Cl~j\" According to \nthe definition Cl,j = (-JD)I,1 2:~1 Xk,ICripj(Zk) - ipj(Xk)) holds. Moreover, from \nD = X T X it follows that Dl ,l = 2:;;'=1 X~,l = 2:;;'=1 Xk ,l because Xk ,l is either zero \nor one. And besides that we have nl,IDI ,1 = 1. Altogether we obtain \n\nIK',il ,,;~' (15\", ,~, X\", it, *