Michael Kearns, Yishay Mansour, Andrew Ng
We consider the problem of reliably choosing a near-best strategy from a restricted class of strategies TI in a partially observable Markov deci(cid:173) sion process (POMDP). We assume we are given the ability to simulate the POMDP, and study what might be called the sample complexity - that is, the amount of data one must generate in the POMDP in order to choose a good strategy. We prove upper bounds on the sample com(cid:173) plexity showing that, even for infinitely large and arbitrarily complex POMDPs, the amount of data needed can be finite, and depends only linearly on the complexity of the restricted strategy class TI, and expo(cid:173) nentially on the horizon time. This latter dependence can be eased in a variety of ways, including the application of gradient and local search algorithms. Our measure of complexity generalizes the classical super(cid:173) vised learning notion of VC dimension to the settings of reinforcement learning and planning.