{"title": "Error Propagation for Approximate Policy and Value Iteration", "book": "Advances in Neural Information Processing Systems", "page_first": 568, "page_last": 576, "abstract": "We address the question of how the approximation error/Bellman residual at each iteration of the Approximate Policy/Value Iteration algorithms influences the quality of the resulted policy. We quantify the performance loss as the Lp norm of the approximation error/Bellman residual at each iteration. Moreover, we show that the performance loss depends on the expectation of the squared Radon-Nikodym derivative of a certain distribution rather than its supremum -- as opposed to what has been suggested by the previous results. Also our results indicate that the contribution of the approximation/Bellman error to the performance loss is more prominent in the later iterations of API/AVI, and the effect of an error term in the earlier iterations decays exponentially fast.", "full_text": "Error Propagation for Approximate Policy and\n\nValue Iteration\n\nAmir massoud Farahmand\n\nDepartment of Computing Science\n\nUniversity of Alberta\n\nEdmonton, Canada, T6G 2E8\n\namirf@ualberta.ca\n\nSequel Project, INRIA Lille\n\nR\u00b4emi Munos\n\nLille, France\n\nremi.munos@inria.fr\n\nCsaba Szepesv\u00b4ari \u2217\n\nDepartment of Computing Science\n\nUniversity of Alberta\n\nEdmonton, Canada, T6G 2E8\nszepesva@ualberta.ca\n\nAbstract\n\nWe address the question of how the approximation error/Bellman residual at each\niteration of the Approximate Policy/Value Iteration algorithms in\ufb02uences the qual-\nity of the resulted policy. We quantify the performance loss as the Lp norm of the\napproximation error/Bellman residual at each iteration. Moreover, we show that\nthe performance loss depends on the expectation of the squared Radon-Nikodym\nderivative of a certain distribution rather than its supremum \u2013 as opposed to what\nhas been suggested by the previous results. Also our results indicate that the\ncontribution of the approximation/Bellman error to the performance loss is more\nprominent in the later iterations of API/AVI, and the effect of an error term in the\nearlier iterations decays exponentially fast.\n\n1\n\nIntroduction\n\nThe exact solution for the reinforcement learning (RL) and planning problems with large state space\nis dif\ufb01cult or impossible to obtain, so one usually has to aim for approximate solutions. Approximate\nPolicy Iteration (API) and Approximate Value Iteration (AVI) are two classes of iterative algorithms\nto solve RL/Planning problems with large state spaces. They try to approximately \ufb01nd the \ufb01xed-\npoint solution of the Bellman optimality operator.\nAVI starts from an initial value function V0 (or Q0), and iteratively applies an approximation of\nT \u2217, the Bellman optimality operator, (or T \u03c0 for the policy evaluation problem) to the previous\nestimate, i.e., Vk+1 \u2248 T \u2217Vk. In general, Vk+1 is not equal to T \u2217Vk because (1) we do not have\ndirect access to the Bellman operator but only some samples from it, and (2) the function space\nin which V belongs is not representative enough. Thus there would be an approximation error\n\u03b5k = T \u2217Vk \u2212 Vk+1 between the result of the exact VI and AVI.\nSome examples of AVI-based approaches are tree-based Fitted Q-Iteration of Ernst et al. [1], multi-\nlayer perceptron-based Fitted Q-Iteration of Riedmiller [2], and regularized Fitted Q-Iteration of\nFarahmand et al. [3]. See the work of Munos and Szepesv\u00b4ari [4] for more information about AVI.\n\n\u2217Csaba Szepesv\u00b4ari is on leave from MTA SZTAKI. We would like to acknowledge the insightful comments\nby the reviewers. This work was partly supported by AICML, AITF, NSERC, and PASCAL2 under no216886.\n\n1\n\n\fAPI is another iterative algorithm to \ufb01nd an approximate solution to the \ufb01xed point of the Bellman\noptimality operator. It starts from a policy \u03c00, and then approximately evaluates that policy \u03c00, i.e.,\nit \ufb01nds a Q0 that satis\ufb01es T \u03c00 Q0 \u2248 Q0. Afterwards, it performs a policy improvement step, which\nis to calculate the greedy policy with respect to (w.r.t.) the most recent action-value function, to get\na new policy \u03c01, i.e., \u03c01(\u00b7) = arg maxa\u2208A Q0(\u00b7, a). The policy iteration algorithm continues by\napproximately evaluating the newly obtained policy \u03c01 to get Q1 and repeating the whole process\nagain, generating a sequence of policies and their corresponding approximate action-value functions\nQ0 \u2192 \u03c01 \u2192 Q1 \u2192 \u03c02 \u2192 \u00b7\u00b7\u00b7 . Same as AVI, we may encounter a difference between the ap-\nproximate solution Qk (T \u03c0k Qk \u2248 Qk) and the true value of the policy Q\u03c0k, which is the solution\nof the \ufb01xed-point equation T \u03c0k Q\u03c0k = Q\u03c0k. Two convenient ways to describe this error is either\nby the Bellman residual of Qk (\u03b5k = Qk \u2212 T \u03c0k Qk) or the policy evaluation approximation error\n(\u03b5k = Qk \u2212 Q\u03c0k).\nAPI is a popular approach in RL literature. One well-known algorithm is LSPI of Lagoudakis and\nParr [5] that combines Least-Squares Temporal Difference (LSTD) algorithm (Bradtke and Barto\n[6]) with a policy improvement step. Another API method is to use the Bellman Residual Mini-\nmization (BRM) and its variants for policy evaluation and iteratively apply the policy improvement\nstep (Antos et al. [7], Maillard et al. [8]). Both LSPI and BRM have many extensions: Farah-\nmand et al. [9] introduced a nonparametric extension of LSPI and BRM and formulated them as\nan optimization problem in a reproducing kernel Hilbert space and analyzed its statistical behavior.\nKolter and Ng [10] formulated an l1 regularization extension of LSTD. See Xu et al. [11] and Jung\nand Polani [12] for other examples of kernel-based extension of LSTD/LSPI, and Taylor and Parr\n[13] for a uni\ufb01ed framework. Also see the proto-value function-based approach of Mahadevan and\nMaggioni [14] and iLSTD of Geramifard et al. [15].\nA crucial question in the applicability of API/AVI, which is the main topic of this work, is to un-\nderstand how either the approximation error or the Bellman residual at each iteration of API or AVI\naffects the quality of the resulted policy. Suppose we run API/AVI for K iterations to obtain a policy\n\u03c0K. Does the knowledge that all \u03b5ks are small (maybe because we have had a lot of samples and\nused powerful function approximators) imply that V \u03c0K is close to the optimal value function V \u2217\ntoo? If so, how does the errors occurred at a certain iteration k propagate through iterations of\nAPI/AVI and affect the \ufb01nal performance loss?\nThere have already been some results that partially address this question. As an example, Propo-\nsition 6.2 of Bertsekas and Tsitsiklis [16] shows that for API applied to a \ufb01nite MDP, we have\nlim supk\u2192\u221e (cid:107)V \u2217 \u2212 V \u03c0k(cid:107)\u221e \u2264 2\u03b3\n(1\u2212\u03b3)2 lim supk\u2192\u221e (cid:107)V \u03c0k \u2212 Vk(cid:107)\u221e where \u03b3 is the discount facto.\nSimilarly for AVI, if the approximation errors are uniformly bounded ((cid:107)T \u2217Vk \u2212 Vk+1(cid:107)\u221e \u2264 \u03b5), we\nhave lim supk\u2192\u221e (cid:107)V \u2217 \u2212 V \u03c0k(cid:107)\u221e \u2264 2\u03b3\nNevertheless, most of these results are pessimistic in several ways. One reason is that they are\nexpressed as the supremum norm of the approximation errors (cid:107)V \u03c0k \u2212 Vk(cid:107)\u221e or the Bellman error\n(cid:107)Qk \u2212 T \u03c0k Qk(cid:107)\u221e. Compared to Lp norms, the supremum norm is conservative. It is quite possible\nthat the result of a learning algorithm has a small Lp norm but a very large L\u221e norm. Therefore, it\nis desirable to have a result expressed in Lp norm of the approximation/Bellman residual \u03b5k.\nIn the past couple of years, there have been attempts to extend L\u221e norm results to Lp ones [18, 17,\n7]. As a typical example, we quote the following from Antos et al. [7]:\nProposition 1 (Error Propagation for API \u2013 [7]). Let p \u2265 1 be a real and K be a positive integer.\nThen, for any sequence of functions {Q(k)} \u2282 B(X \u00d7 A; Qmax)(0 \u2264 k < K), the space of Qmax-\nbounded measurable functions, and their corresponding Bellman residuals \u03b5k = Qk \u2212 T \u03c0Qk, the\nfollowing inequalities hold:\n\n(1\u2212\u03b3)2 \u03b5 (Munos [17]).\n\nwhere Rmax is an upper bound on the magnitude of the expected reward function and\n\n(cid:16)\n\nC 1/p\n\n\u03c1,\u03bd max\n0\u2264k