Part of Advances in Neural Information Processing Systems 24 (NIPS 2011)

*Amir-massoud Farahmand*

<p>Many practitioners of reinforcement learning problems have observed that oftentimes the performance of the agent reaches very close to the optimal performance even though the estimated (action-)value function is still far from the optimal one. The goal of this paper is to explain and formalize this phenomenon by introducing the concept of the action-gap regularity. As a typical result, we prove that for an agent following the greedy policy (\hat{\pi}) with respect to an action-value function (\hat{Q}), the performance loss (E[V^<em>(X) - V^{\hat{X}} (X)]) is upper bounded by (O(|| \hat{Q} - Q^</em>||_\infty^{1+\zeta})), in which (\zeta >= 0) is the parameter quantifying the action-gap regularity. For (\zeta > 0), our results indicate smaller performance loss compared to what previous analyses had suggested. Finally, we show how this regularity affects the performance of the family of approximate value iteration algorithms.</p>

Do not remove: This comment is monitored to verify that the site is working properly