{"title": "PAC-Bayesian Model Selection for Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1624, "page_last": 1632, "abstract": "This paper introduces the first set of PAC-Bayesian bounds for the batch reinforcement learning problem in finite state spaces. These bounds hold regardless of the correctness of the prior distribution. We demonstrate how such bounds can be used for model-selection in control problems where prior information is available either on the dynamics of the environment, or on the value of actions. Our empirical results confirm that PAC-Bayesian model-selection is able to leverage prior distributions when they are informative and, unlike standard Bayesian RL approaches, ignores them when they are misleading.", "full_text": "PAC-Bayesian Model Selection\nfor Reinforcement Learning\n\nMahdi Milani Fard\n\nSchool of Computer Science\n\nMcGill University\nMontreal, Canada\n\nJoelle Pineau\n\nSchool of Computer Science\n\nMcGill University\nMontreal, Canada\n\nmmilan1@cs.mcgill.ca\n\njpineau@cs.mcgill.ca\n\nAbstract\n\nThis paper introduces the \ufb01rst set of PAC-Bayesian bounds for the batch rein-\nforcement learning problem in \ufb01nite state spaces. These bounds hold regardless\nof the correctness of the prior distribution. We demonstrate how such bounds can\nbe used for model-selection in control problems where prior information is avail-\nable either on the dynamics of the environment, or on the value of actions. Our\nempirical results con\ufb01rm that PAC-Bayesian model-selection is able to leverage\nprior distributions when they are informative and, unlike standard Bayesian RL\napproaches, ignores them when they are misleading.\n\n1\n\nIntroduction\n\nBayesian methods in machine learning, although elegant and concrete, have often been criticized\nnot only for their computational cost, but also for their strong assumptions on the correctness of the\nprior distribution. There are usually no theoretical guarantees when performing Bayesian inference\nwith priors that do not admit the correct posterior. Probably Approximately Correct (PAC) learning\ntechniques, on the other hand, provide distribution-free convergence guarantees with polynomially-\nbounded sample sizes [1]. These bounds, however, are notoriously loose and impractical. One can\nargue that such loose bounds are to be expected, as they re\ufb02ect the inherent dif\ufb01culty of the problem\nwhen no assumptions are made on the distribution of the data.\nBoth PAC and Bayesian methods have been proposed for reinforcement learning (RL) [2, 3, 4, 5,\n6, 7, 8], where an agent is learning to interact with an environment to maximize some objective\nfunction. Many of these methods aim to solve the so-called exploration\u2013exploitation problem by\nbalancing the amount of time spent on gathering information about the dynamics of the environment\nand the time spent acting optimally according to the currently built model. PAC methods are much\nmore conservative than Bayesian methods as they tend to spend more time exploring the system\nand collecting information [9]. Bayesian methods, on the other hand, are greedier and only solve\nthe problem over a limited planning horizon. As a result of this greediness, Bayesian methods can\nconverge to suboptimal solutions. It has been shown that Bayesian RL is not PAC [9]. We argue here\nthat a more adaptive method can be PAC and at the same time more data ef\ufb01cient if an informative\nprior is taken into account. Such adaptive techniques have been studied within the PAC-Bayesian\nliterature for supervised learning.\nThe PAC-Bayesian approach, \ufb01rst introduced by McAllester [10] (extending the work of Shawe-\nTaylor et al. [11]), combines the distribution-free correctness of PAC theorems with the data-\nef\ufb01ciency of Bayesian inference. This is achieved by removing the assumption of the correctness of\nthe prior and, instead, measuring the consistency of the prior over the training data. The empirical\nresults of model selection algorithms for classi\ufb01cation tasks using these bounds are comparable to\nsome of the most popular learning algorithms such as AdaBoost and Support Vector Machines [12].\nPAC-Bayesian bounds have also been linked to margins in classi\ufb01cation tasks [13].\n\n1\n\n\fThis paper introduces the \ufb01rst results of the application of PAC-Bayesian techniques to the batch\nRL problem. We derive two PAC-Bayesian bounds on the approximation error in the value function\nof stochastic policies for reinforcement learning on observable and discrete state spaces. One is a\nbound on model-based RL where a prior distribution is given on the space of possible models. The\nsecond one is for the case of model-free RL, where a prior is given on the space of value functions.\nIn both cases, the bound depends both on an empirical estimate and a measure of distance between\nthe stochastic policy and the one imposed by the prior distribution. We present empirical results\nwhere model-selection is performed based on these bounds, and show that PAC-Bayesian bounds\nfollow Bayesian policies when the prior is informative and mimic the PAC policies when the prior\nis not consistent with the data. This allows us to adaptively balance between the distribution-free\ncorrectness of PAC and the data-ef\ufb01ciency of Bayesian inference.\n\n2 Background and Notation\n\nIn this section, we introduce the notations and de\ufb01nitions used in the paper.\nA Markov Decision Process (MDP) M = (S, A, T, R) is de\ufb01ned by a set of states S, a set of\nactions A, a transition function T (s, a, s(cid:48)) de\ufb01ned as:\n\nT (s, a, s(cid:48)) = p(st+1 = s(cid:48)|st = s, at = a),\u2200s, s(cid:48) \u2208 S, a \u2208 A,\n\n(1)\nand a (possibly stochastic) reward function R(s, a) : S \u00d7 A \u2192 [Rmin, Rmax]. Throughout the paper\nwe assume \ufb01nite-state, \ufb01nite-action, discounted-reward MDPs, with the discount factor denoted by\n\u03b3. A reinforcement learning agent chooses an action and receives a reward. The environment will\nthen change to a new state according to the transition probabilities.\nA policy is a (possibly stochastic) function from states to actions. The value of a state\u2013action pair\nt \u03b3trt) if the\nagent acts according to that policy after taking action a in the \ufb01rst step. The value function satis\ufb01es\nthe Bellman equation [14]:\n\n(s, a) for policy \u03c0, denoted by Q\u03c0(s, a), is the expected discounted sum of rewards ((cid:80)\n\nQ\u03c0(s, a) = R(s, a) + \u03b3\n\n(T (s, a, s(cid:48))Q\u03c0(s(cid:48), \u03c0(s(cid:48)))) .\n\n(2)\n\nThe optimal policy is the policy that maximizes the value function. The optimal value of a state\u2013\naction pair, denoted by Q\u2217(s, a), satis\ufb01es the Bellman optimality equation [14]:\n\nQ\u2217(s, a) = R(s, a) + \u03b3\n\nT (s, a, s(cid:48)) max\na(cid:48)\u2208A\n\nQ\u2217(s(cid:48), a(cid:48))\n\n.\n\n(3)\n\n(cid:88)\n\ns(cid:48)\u2208S\n\n(cid:18)\n\n(cid:88)\n\ns(cid:48)\u2208S\n\n(cid:18)\n\n(cid:88)\n\ns(cid:48)\u2208S\n\n(cid:19)\n\n(cid:19)\n\nThere are many methods developed to \ufb01nd the optimal policy for a given MDP when transition and\nreward functions are known. Value iteration [14] is a simple dynamic programming method in which\none iteratively applies the Bellman optimality operator, denoted by B, to an initial guess of the\noptimal value function:\n\nBQ(s, a) = R(s, a) + \u03b3\n\nT (s, a, s(cid:48)) max\na(cid:48)\u2208A\n\nQ(s(cid:48), a(cid:48))\n\n.\n\n(4)\n\nFor simplicity we write BQ when B is applied to the value of all state\u2013action pairs. Since B is a\ncontraction with respect to the in\ufb01nity norm [15] (i.e. (cid:107)BQ \u2212 BQ(cid:48)(cid:107)\u221e \u2264 \u03b3(cid:107)Q \u2212 Q(cid:48)(cid:107)\u221e), the value\niteration algorithm will converge to the \ufb01xed point of the Bellman optimality operator, which is the\noptimal value function (BQ\u2217 = Q\u2217).\n\n3 Model-Based PAC-Bayesian Bound\n\nIn model-based RL, one aims to estimate the transition and reward functions and then act optimally\naccording to the estimated models. PAC methods use the empirical average for their estimated model\nalong with frequentist bounds. Bayesian methods use the Bayesian posterior to estimate the model.\nThis section provides a bound that suggests an adaptive method to choose a stochastic estimate\nbetween these two extremes, which is both data-ef\ufb01cient and has guaranteed performance.\n\n2\n\n\fAssuming that the reward model is known (we make this assumption throughout this section), one\ncan build empirical models of the transition dynamics by gathering sample transitions, denoted by\nU, and taking the empirical average. Let this empirical average model be \u02c6T (s, a, s(cid:48)) = ns,a,s(cid:48)/ns,a,\nwhere ns,a,s(cid:48) and ns,a are the number of corresponding transitions and samples. Trivially, E \u02c6T = T .\nThe empirical value function, denoted by \u02c6Q, is de\ufb01ned to be the value function on an MDP with\nthe empirical transition model. As one observes more and more sample trajectories on the MDP, the\nempirical model gets increasingly more accurate, and so will the empirical value function. Different\nforms of the following lemma, connecting the error rates on \u02c6T and \u02c6Q, are used in many of the\nPAC-MDP results [4]:\nLemma 1. There is a constant k \u2265 (1 \u2212 \u03b3)2/\u03b3 such that:\n\u21d2\n\n\u2200s, a : (cid:107) \u02c6T (s, a, .) \u2212 T (s, a, .)(cid:107)1 \u2264 k\u0001\n\n\u2200\u03c0 : (cid:107) \u02c6Q\u03c0 \u2212 Q\u03c0(cid:107)\u221e \u2264 \u0001.\n\n(5)\n\nAs a consequence of the above lemma, one can act near-optimally in the part of the MDP for which\nwe have gathered enough samples to have a good empirical estimate of the transition model. PAC-\nMDP methods explicitly [2] or implicitly [3] use that fact to exploit the knowledge on the model\nas long as they are in the \u201cknown\u201d part of the state space. The downside of these methods is that\nwithout further assumptions on the model, it will take a large number of sample transitions to get a\ngood empirical estimate of the transition model.\nThe Bayesian approach to modeling the transition dynamics, on the other hand, starts with a prior\ndistribution over the transition probability and then marginalizes this prior over the data to get a\nposterior distribution. This is usually done by assuming independent Dirichlet distributions over the\ntransition probabilities, with some initial count vector \u03b1, and then adding up the observed counts to\nthis initial vector to get the conjugate posterior [6]. The initial \u03b1-vector encodes the prior knowledge\non the transition probabilities, and larger initial values further bias the empirical observation towards\nthe initial belief.\nIf a strong prior is close to the true values, the Bayesian posterior will be more accurate than the\nempirical point estimate. However, a strong prior peaked on the wrong values will bias the Bayesian\nmodel away from the correct probabilities. Therefore, the Bayesian posterior might not provide\nthe optimal estimate of the model parameters. A good posterior distribution might be somewhere\nbetween the empirical point estimate and the Bayesian posterior.\nThe following theorem is the \ufb01rst PAC-Bayesian bound on the estimation error in the value function\nwhen we build a stochastic policy1 based on some arbitrary posterior distribution Mq.\nTheorem 2. Let \u03c0\u2217\nT (cid:48) be the optimal policy with respect to the MDP with transition model T (cid:48) and\n\u2206T (cid:48) = (cid:107) \u02c6Q\u03c0\u2217\nT (cid:48) \u2212 Q\u03c0\u2217\nT (cid:48)(cid:107)\u221e. For any prior distribution Mp on the transition model, any posterior Mq,\nany i.i.d. sampling distribution U, with probability no less than 1 \u2212 \u03b4 over the sampling of U \u223c U:\n\n\u2200Mq : ET (cid:48)\u223cMq \u2206T (cid:48) \u2264\n\nD(Mq(cid:107)Mp) \u2212 ln \u03b4 + |S| ln 2 + ln|S| + ln nmin\n\n(nmin \u2212 1)k2/2\n\n,\n\n(6)\n\nwhere nmin = mins,a ns,a and D(.(cid:107).) is the Kullback\u2013Leibler (KL) divergence.\nThe above theorem (proved in the Appendix) provides a lower bound on the expectation of the true\nvalue function when the policy is taken to be optimal according to the sampled model from the\nposterior:\n\n(cid:115)\n\n(cid:18)(cid:113)\n\n(cid:19)\n\nEQ\u03c0\u2217\n\nT (cid:48) \u2265 E \u02c6Q\u03c0\u2217\n\nT(cid:48) \u2212 \u02dcO\n\nD(Mq(cid:107)Mp)/nmin\n\n.\n\n(7)\n\nThis lower bound suggests a stochastic model-selection method in which one searches in the space\nof posteriors to maximize the bound. Notice that there are two elements to the above bound. One\nis the PAC part of the bound that suggests the selection of models with high empirical value func-\ntions for their optimal policy. There is also a penalty term (or a regularization term) that penalizes\ndistributions that are far from the prior (the Bayesian side of the bound).\n\n1This is a more general form of stochastic policy than is usually seen in the RL literature. A complete policy\n\nis sampled from an imposed distribution, correlating the selection of actions on different states.\n\n3\n\n\fMargin for Deterministic Policies\n\nOne could apply Theorem 2 with any choice Mq. Generally, this will result in a bound on the value\nof a stochastic policy. However, if the optimal policy is the same for all of the possible samples from\nthe posterior, then we will get a bound for that particular deterministic policy.\nWe de\ufb01ne the support of policy \u03c0, denoted by T\u03c0, to be the set of transition models for which the\noptimal policy is \u03c0. Putting all the posterior probability on T\u03c0 will result in a tighter bound for the\nvalue of the policy \u03c0. The tightest bound occurs when Mq is a scaled version of Mp summing to 1\nover T\u03c0, that is when we have:\n\n(cid:40) Mp(T (cid:48))\nMp(T\u03c0) T (cid:48) \u2208 T\u03c0\nT (cid:48) /\u2208 T\u03c0\n(cid:18)(cid:113)\u2212 ln Mp(T\u03c0)/nmin\n\n0\n\n(cid:19)\n\n.\n\nMq(T (cid:48)) =\n\nEQ\u03c0\u2217\n\nT(cid:48) \u2265 E \u02c6Q\u03c0\u2217\n\nT (cid:48) \u2212 \u02dcO\n\nIn that case, the KL divergence is D(Mq(cid:107)Mp) = \u2212 ln Mp(T\u03c0), and the bound will be:\n\n(8)\n\n(9)\n\nIntuitively, we will get tighter bounds for policies that have larger empirical values and higher prior\nprobabilities supporting them.\nFinding Mp(T\u03c0) might not be computationally tractable. Therefore, we de\ufb01ne a notion of margin\nfor transition functions and policies and use it to get tractable bounds. The margin of a transition\nfunction T (cid:48), denoted by \u03b8T (cid:48), is the maximum distance we can move away from T (cid:48) such that the\noptimal policy does not change:\n\n(cid:107)T (cid:48)(cid:48) \u2212 T (cid:48)(cid:107)1 \u2264 \u03b8T (cid:48) \u21d2 \u03c0\u2217\n\nT (cid:48)(cid:48) = \u03c0\u2217\nT (cid:48).\n\n(10)\nThe margin de\ufb01nes a hypercube around T (cid:48) for which the optimal policy does not change. In cases\nwhere the support set of a policy is dif\ufb01cult to \ufb01nd, one can use this hypercube to get a reasonable\nbound for the true value function of the corresponding policy. In that case, we would de\ufb01ne the\nposterior to be the scaled prior de\ufb01ned only on the margin hypercube. The idea behind this method\nis similar to that of the Luckiness framework [11] and large-margin classi\ufb01ers [16, 13]. This shows\nthat the idea of maximizing margins can be applied to control problems as well as classi\ufb01cation and\nregression tasks.\nTo \ufb01nd the margin of any given T (cid:48), if we know the value of the second best policy, we can calculate\nits regret according to T (cid:48) (it will be the smallest regret \u03b7min). Using Lemma 1, we can conclude\nthat if (cid:107)T (cid:48)(cid:48) \u2212 T (cid:48)(cid:107)1 \u2264 k\u03b7min/2, then the value of the best and second best policies can change by at\nmost \u03b7min/2, and thus the optimal policy will not change. Therefore, \u03b8T (cid:48) \u2265 k\u03b7min/2. One can then\nde\ufb01ne the posterior on the transitions inside the margin to get a bound for the value function.\n\n4 Model-Free PAC-Bayes Bound\n\nIn this section we introduce a PAC-Bayesian bound for model-free reinforcement learning on dis-\ncrete state spaces. This time we assume that we are given a prior distribution on the space of value\nfunctions, rather than on transition models. This prior encodes an initial belief about the optimal\nvalue function for a given RL domain. This could be useful, for example, in the context of transfer\nlearning, where one has learned a value function in one environment and then uses that as the prior\nbelief on a similar domain.\nWe start by de\ufb01ning the TD error of a given value function Q to be (cid:107)Q \u2212 BQ(cid:107)\u221e. In most cases, we\ndo not have access to the Bellman optimality operator. When we only have access to a sample set U\ncollected on the RL domain, we can de\ufb01ne the empirical Bellman optimality operator \u02c6B to be:\n\nr + \u03b3 max\n\na(cid:48) Q(s(cid:48), a(cid:48))\n\n,\n\n(11)\n\n(cid:88)\n\n(cid:16)\n\n\u02c6BQ(s, a) =\n\n1\n\nns,a\n\n(s,a,s(cid:48),r)\u2208U\n\n(cid:17)\n\nNote that E[ \u02c6BQ] = BQ. We further make an assumption that all the BQ values we could observe\nare bounded in the range [cmin, cmax], with c = cmax \u2212 cmin. Using this assumption, one can use\nHoeffding\u2019s inequality to bound the difference between the empirical and true Bellman operators:\n\nPr{| \u02c6BQ(s, a) \u2212 BQ(s, a)| > \u0001} \u2264 e\u22122ns,a\u00012/c2\n\n.\n\n(12)\n\n4\n\n\fWhen the true Bellman operator is not known, one makes use of the empirical TD error, similarly\nde\ufb01ned to be (cid:107)Q \u2212 \u02c6BQ(cid:107)\u221e. Q-learning [14] and its derivations with function approximation [17],\nand also batch methods such as LSTD [18], often aim to minimize the empirical (projected) TD\nerror. We argue that it might be better to choose a function that is not a \ufb01xed point of the empirical\nBellman operator. Instead, we aim to minimize the upper bound on the approximation error (which\nmight be referred to as loss) of the Q function, as compared to the true optimal value.\nThe following theorem (proved in the Appendix) is the \ufb01rst PAC-Bayesian bound for model-free\nbatch RL on discrete state spaces:\nTheorem 3. Let \u2206Q = (cid:107)Q \u2212 Q\u2217(cid:107)\u221e \u2212 (cid:107)Q\u2212 \u02c6BQ(cid:107)\u221e\n. For all prior distributions Jp and posteriors Jq\nover the space of value functions, with probability no less than 1 \u2212 \u03b4 over the sampling of U \u223c U:\n\n1\u2212\u03b3\n\n\u2200Jq : EQ\u223cJq \u2206Q \u2264\n\nD(Jq(cid:107)Jp) \u2212 ln \u03b4 + ln|S| + ln|A| + ln nmin\n\n2(nmin \u2212 1)(1 \u2212 \u03b3)2/c2\n\n.\n\n(13)\n\n(cid:115)\n\nThis time we have an upper bound on the expected approximation error:\n\nE(cid:107)Q \u2212 Q\u2217(cid:107)\u221e \u2264 E(cid:107)Q \u2212 \u02c6BQ(cid:107)\u221e\n\n1 \u2212 \u03b3\n\n+ \u02dcO\n\nD(Jq(cid:107)Jp)/nmin\n\n(cid:18)(cid:113)\n\n(cid:19)\n\n.\n\n(14)\n\nThis suggests a model-selection method in which one would search for a posterior Jq to minimize the\nabove bound. The PAC side of the bound guides this model-selection method to look for posteriors\nwith smaller empirical TD error. The Bayesian part, on the other hand, penalizes the selection of\nposteriors that are far from the prior distribution.\nOne can use general forms of priors that would impose smoothness or sparsity for this model-\nselection technique. In that sense, this method would act as a regularization technique that penalizes\ncomplex and irregular functions. The idea of regularization in RL with function approximation is\nnot new to this work [19]. This bound, however, is more general, as it could incorporate not only\nsmoothness constraints, but also other forms of prior knowledge into the learning process.\n\n5 Empirical Results\n\nTo illustrate the model-selection techniques based on the bounds in the paper, we consider one\nmodel-based RL domain and one model-free problem. The model-based domain is a chain model\nin which states are ordered by their index. The last state has a reward of 1 and all other states have\nreward 0. There are two types of actions. One is a stochastic \u201cforward\u201d operation which moves us to\nthe next state in the chain with probability 0.5 and otherwise makes a random transition. The second\ntype is a stochastic \u201creset\u201d which moves the system to the \ufb01rst state in the chain with probability 0.5\nand makes a random transition otherwise. In this domain, we have at each state two actions that do\nstochastic reset and one action that is a stochastic forward. There are 10 states and \u03b3 = 0.9.\nWhen there are only a few number of sample transitions for each state\u2013action pair, there is a high\nchance that the frequentist estimate confuses a reset action with a forward. Therefore, we expect a\ngood model-based prior to be useful in this case. We use independent Dirichlets as our prior. We\nexperiment with priors for which the Dirichlet \u03b1-vector sums up to 10. We de\ufb01ne our good prior to\nhave \u03b1-vectors proportional to the true transition probabilities. A misleading prior is one for which\nthe vector is proportional to a transition model when the actions are switched between forward and\nreset. A weighted sum between the good and bad priors creates a range of priors that gradually\nchange from being informative to misleading.\nWe compare the expected regret of three different methods. The empirical method uses the optimal\npolicy with respect to the empirical models. The Bayesian method samples a transition model from\nthe Bayesian Dirichlet posteriors (when the observed counts are added to the prior \u03b1-vectors) and\nthen uses the optimal policy with respect to the sampled model. The PAC-Bayesian method uses\ncounts + \u03bb\u03b1prior as the \u03b1-vector of the posterior and \ufb01nds the value of \u03bb \u2208 [0, 1], using linear\nsearch within values with distance 0.1, that maximizes the lower bound of Theorem 2 (with a more\noptimistic value for k and \u03b4 = 0.05). It then samples from that distribution and uses the optimal\npolicy with respect to the sampled model. The running time for a single run is a few seconds.\n\n5\n\n\fFigure 1 (left) shows the comparison between the maximum regret in these methods for different\nsample sizes when the prior is informative. This is averaged over 50 runs for the Bayesian and PAC-\nBayesian methods and 10000 runs for the empirical method. The number of sampled transitions is\nthe same for all state\u2013action pairs. As expected, the Bayesian method outperforms the empirical\none for small sample sizes. We can see that the PAC-Bayesian method is closely following the\nBayesian one in this case. With a misleading prior, however, as we can see in Figure 1 (center),\nthe empirical method outperforms the Bayesian one. This time, the regret rate of the PAC-Bayesian\nmethod follows that of the empirical method. Figure 1 (right) shows how the PAC-Bayesian method\nswitches between following the empirical estimate and the Bayesian posterior as the prior gradually\nchanges from being misleading to informative (four sample transitions per state action pair). This\nshows that the bound of Theorem 2 is helpful as a model selection technique.\n\nFigure 1: Average regrets of different methods. Error bars are 1 standard deviation of the mean.\n\nThe next experiment is to test the model-free bound of Theorem 3. The domain is a \u201cpuddle world\u201d.\nAn agent moves around in a grid world of size 5\u00d79 containing puddles with reward \u22121, an absorbing\ngoal state with reward +1, and reward 0 for the remaining states. There are stochastic actions along\neach of the four cardinal directions that move in the correct direction with probability 0.7 and move\nin a random direction otherwise. If the agent moves towards the boundary then it stays in its current\nposition.\n\nFigure 2: Maps of puddle world RL domain. Shaded boxes are puddles.\n\nWe \ufb01rst learn the true value function of a known prior map of the world (Figure 2, left). We then use\nthat value function as the prior for our model-selection technique on two other environments. One of\nthem is a similar environment where the shape of the puddle is slightly changed (Figure 2, center).\nWe expect the prior to be informative and useful in this case. The other environment is, however,\nlargely different from the \ufb01rst map (Figure 2, right). We thus expect the prior to be misleading.\n\nTable 1: Performance of different model-selection methods.\nPAC-Bayesian Regret\n\nEmpirical Regret Bayesian Regret\n0.10 \u00b1 0.01\n1.16 \u00b1 0.09\n\n0.21 \u00b1 0.03\n0.19 \u00b1 0.03\n\n0.12 \u00b1 0.01\n0.22 \u00b1 0.04\n\nAverage \u03bb\n0.58 \u00b1 0.01\n0.03 \u00b1 0.03\n\n(cid:17)\u22121\n\n(cid:16) \u03bb\n\n\u03c32\n0\n\nSimilar Map\nDifferent Map\n\n(cid:17)\n\n(cid:16) \u03bb\u00b50\n\n\u03c32\n0\n\n(cid:17)(cid:46)(cid:16) \u03bb\n\n\u03c32\n0\n\n\u02c6\u03c32\n\n+ n\n\u02c6\u03c32\n\nand variance\n\nWe start with independent Gaussians (one for each state\u2013action pair) as the prior, with the initial\n0 = 0.01 for the variance. The posterior is chosen to be the\nmap\u2019s Q-values for the mean \u00b50, and \u03c32\n+ n \u02c6Q(.,.)\nproduct of Gaussians with mean\n, where\n\u02c6\u03c32 is the empirical variance. We sample from this posterior and act according to its greedy policy.\nFor \u03bb = 1, this is the Bayesian posterior for the mean of a Gaussian with known variance. For\n\u03bb = 0, the prior is completely ignored. We will, however, \ufb01nd the \u03bb \u2208 [0, 1] that minimizes the\nPAC-Bayesian bound of Theorem 3 (with an optimistic choice of c and \u03b4 = 0.05) and compare it\nwith the performance of the empirical policy and a semi-Bayesian policy that acts according to a\nsampled value from the Bayesian posterior.\nTable 1 shows the average over 100 runs of the maximum regret for these methods and the average\nof the selected \u03bb, with equal sample size of 20 per state\u2013action pair. Again, it can be seen that the\nPAC-Bayesian method makes use of the prior (with higher values of \u03bb) when the prior is informative,\nand otherwise follows the empirical estimate (smaller values of \u03bb). It adaptively balances the usage\nof the prior based on its consistency over the observed data.\n\n+ n\n\u02c6\u03c32\n\n6\n\n51015200.040.060.080.10.120.140.16sample size for each state\u2212action pairregret EmpiricalBayesianPAC\u2212Bayesian 51015200.050.10.150.20.250.30.350.4sample size for each state\u2212action pairregret EmpiricalBayesianPAC\u2212Bayesian 00.20.40.60.810.050.10.150.20.250.30.350.4weight on the good priorregret EmpiricalBayesianPAC\u2212Bayesian GGG\f6 Discussion\n\nThis paper introduces the \ufb01rst set of PAC-Bayesian bounds for the batch RL problem in \ufb01nite state\nspaces. We demonstrate how such bounds can be used for both model-based and model-free RL\nmethods. Our empirical results show that PAC-Bayesian model-selection uses prior distributions\nwhen they are informative and useful, and ignores them when they are misleading.\nFor the model-based bound, we expect the running time of searching in the space of parametrized\nposteriors to increase rapidly with the size of the state space. A more scalable version would sample\nmodels around the posteriors, solve each model, and then use importance sampling to estimate the\nvalue of the bound for each possible posterior. This problem does not exist with the model-free\napproach, as we do not need to solve the MDP for each sampled model.\nA natural extension to this work would be on domains with continuous state spaces, where one would\nuse different forms of function approximation for the value function. There is also the possibility\nof future work in applications of PAC-Bayesian theorems in online reinforcement learning, where\none targets the exploration\u2013exploitation problem. Online PAC RL with Bayesian priors has recently\nbeen addressed with the BOSS algorithm [20]. PAC-Bayesian bounds could help derive similar\nmodel-free algorithms with theoretical guarantees.\nAcknowledgements: Funding for this work was provided by the National Institutes of Health (grant\nR21 DA019800) and the NSERC Discovery Grant program.\n\nAppendix\n\nThe following lemma, due to McAllester [21], forms the basis of the proofs for both bounds:\n\nLemma 4. For \u03b2 > 0, K > 0, and Q,P, \u2206 \u2208 Rn satisfying Pi,Qi, \u2206i \u2265 0 and(cid:80)n\nQi\u2206i \u2264(cid:112)(D(Q(cid:107)P) + ln K)/\u03b2.\n\nPie\u03b2\u22062\n\nn(cid:88)\n\nn(cid:88)\n\ni \u2264 K\n\n\u21d2\n\ni=1 Qi = 1:\n\n(15)\n\ni=1\n\ni=1\n\nNote that even when we have arbitrary probability measures Q and P on a continuous space of\n\u2206\u2019s, it might still be possible to de\ufb01ne a sequence of vectors Q(1),Q(2), . . . , P (1),P (2), . . . and\n\u2206(1), \u2206(2), . . . such that Q(n),P (n) and \u2206(n) satisfy the condition of the lemma and\n\nEQ\u2206 = lim\nn\u2192\u221e\n\nQ(n)\ni \u2206(n)\n\ni\n\n,\n\nD(Q(cid:107)P) = lim\nn\u2192\u221e\n\nn(cid:88)\n\ni=1\n\nn(cid:88)\n\ni=1\n\nQ(n)\n\ni\n\nln\n\nQ(n)\nP (n)\n\ni\n\ni\n\n.\n\n(16)\n\nWe will then take the limit of the conclusion of the lemma to get a bound for the continuous case [21].\n\nProof of Theorem 2 (Model-Based Bound)\nLemma 5. Let \u2206T (cid:48) = (cid:107) \u02c6Q\u03c0\u2217\n\nT (cid:48) \u2212 Q\u03c0\u2217\n\nT (cid:48)(cid:107)\u221e. With probability no less than 1 \u2212 \u03b4 over the sampling:\n2 (nmin\u22121)k2\u22062\n\nT (cid:48) ] \u2264 |S|2|S|nmin\n\n.\n\n1\n\n(17)\n\n\u03b4\n\nET (cid:48)\u223cMp [e\n\nBefore proving Lemma 5, note that Lemma 5 and Lemma 4 together imply Therorem 2. We only\nneed to apply the method described for arbitrary probability measures. To prove Lemma 5, it suf\ufb01ces\nto prove the following, swap the expectations and apply Markov\u2019s inequality:\n\nT (cid:48) ] \u2264 |S|2|S|nmin.\nTherefore, we only need to show that for any choice of T (cid:48), EU\u223cU [e 1\nbound. Let as = \u03c0\u2217\n\n2 (nmin\u22121)k2\u22062\n\nET (cid:48)\u223cMP\n\nEU\u223cU [e\n\n1\n\n(18)\n\n2 (nmin\u22121)k2\u22062\n\nT (cid:48) ] follows the\n\nT (cid:48)(s). We have:\n\nPr{\u2206T (cid:48) \u2265 \u0001} \u2264 (cid:88)\n\u2264 (cid:88)\n\ns\n\n(cid:16)\n\nPr{(cid:107) \u02c6T (s, as, .) \u2212 T (s, as, .)(cid:107)1 > k\u0001}\n\n2 ns,as (k\u0001)2(cid:17) \u2264 |S|2|S|e\u2212 1\n\n2|S|e\u2212 1\n\n2 nmin(k\u0001)2\n\n(19)\n\n(20)\n\n.\n\ns\n\n7\n\n\fThe \ufb01rst line is by Lemma 1. The second line is a concentration inequality for multinomials [22].\n2 nmin(k\u0001)2.\nWe choose to maximize EU\u223cU [e 1\nThe maximum occurs when the inequality is tight and the p.d.f. for \u2206T (cid:48) is:\n\nT (cid:48) ], subject to Pr{\u2206T (cid:48) \u2265 \u0001} \u2264 |S|2|S|e\u2212 1\n\n2 (nmin\u22121)k2\u22062\n\nf (\u2206) = |S|2|S|k2nmin\u2206e\u2212 1\n\n2 nmink2\u22062\n\n.\n\nWe thus get:\n\nEU\u223cU [e\n\n1\n\n2 (nmin\u22121)k2\u22062\n\nT (cid:48) ] \u2264\n\n1\n\n2 (nmin\u22121)k2\u22062\n\ne\n\nf (\u2206)d\u2206\n\n|S|2|S|k2nmin\u2206e\u2212 1\nThis concludes the proof of Lemma 5 and consequently Theorem 2.\n\n=\n\n0\n\n2 k2\u22062\n\n(cid:90) \u221e\n(cid:90) \u221e\n\n0\n\n(21)\n\n(22)\n\n(23)\n\nd\u2206 \u2264 |S|2|S|nmin.\n\nProof of Theorem 3 (Model-Free Bound)\nSince B is a contraction with respect to the in\ufb01nity norm and Q\u2217 is its \ufb01xed point, we have:\n\n(cid:107)Q \u2212 Q\u2217(cid:107)\u221e = (cid:107)Q \u2212 BQ + BQ \u2212 BQ\u2217(cid:107)\u221e \u2264 (cid:107)Q \u2212 BQ(cid:107)\u221e + (cid:107)BQ \u2212 BQ\u2217(cid:107)\u221e (24)\n(25)\n\n\u2264 (cid:107)Q \u2212 BQ(cid:107)\u221e + \u03b3(cid:107)Q \u2212 Q\u2217(cid:107)\u221e\n\nAnd thus (cid:107)Q \u2212 Q\u2217(cid:107)\u221e \u2264 1\nLemma 6. Let \u2206Q = max(0,(cid:107)Q \u2212 Q\u2217(cid:107)\u221e \u2212 (cid:107)Q\u2212 \u02c6BQ(cid:107)\u221e\n\n1\u2212\u03b3(cid:107)Q \u2212 BQ(cid:107)\u221e.\n\n1\u2212\u03b3\n\n). With probability no less than 1 \u2212 \u03b4:\n\nEQ\u223cJp [e2(nmin\u22121)(1\u2212\u03b3)2\u22062\n\nQ/c2\n\n] \u2264 |S||A|nmin\n\n\u03b4\n\n.\n\n(26)\n\nQ/c2\n\n\u2264 Pr\n\n] follows the bound. We have that:\n\nSimilar to the previous section, Lemma 6 and Lemma 4 together imply Theorem 3.\nTo prove Lemma 6, similar to the previous proof, we only need to show that for any choice of Q,\nEU\u223cU [e2(nmin\u22121)(1\u2212\u03b3)2\u22062\nPr{\u2206Q\u2265 \u0001} = Pr\n\n(cid:110)(cid:107)Q \u2212 Q\u2217(cid:107)\u221e \u2265 \u0001 + (cid:107)Q \u2212 \u02c6BQ(cid:107)\u221e/(1 \u2212 \u03b3)\n(cid:111)\n(cid:17)(cid:111)\n(cid:110)(cid:107)Q \u2212 BQ(cid:107)\u221e \u2265 (1 \u2212 \u03b3)\n(cid:110)|Q(s, a) \u2212 BQ(s, a)| \u2265 (1 \u2212 \u03b3)\u0001 + (cid:107)Q \u2212 \u02c6BQ(cid:107)\u221e\n(cid:111)\n(cid:110)|Q(s, a) \u2212 \u02c6BQ(s, a)| + | \u02c6BQ(s, a) \u2212 BQ(s, a)| \u2265 (1 \u2212 \u03b3)\u0001 + (cid:107)Q \u2212 \u02c6BQ(cid:107)\u221e\n(cid:110)| \u02c6BQ(s, a) \u2212 BQ(s, a)| \u2265 (1 \u2212 \u03b3)\u0001\n\n\u0001 + (cid:107)Q \u2212 \u02c6BQ(cid:107)\u221e/(1 \u2212 \u03b3)\n\n(cid:111)\n\n(cid:111)\n\n(cid:16)\n\n(29)\n\n(28)\n\n(30)\n\n(27)\n\nPr\n\nPr\n\nPr\n\ne\u22122ns,a(1\u2212\u03b3)2\u00012/c2 \u2264 |S||A|e\u22122nmin(1\u2212\u03b3)2\u00012/c2\n\ns,a\n\nEqn (28) follows from the derivations at the beginning of this section. Eqn (29) is by the union\nbound. Eqn (31) is by the de\ufb01nition of in\ufb01nity norm. Last derivation is by Hoeffding inequality of\nEquation (12). Now again, similar to the model-based case, when the inequality is tight the p.d.f. is:\n\nf (\u2206) = 4|S||A|nmin(1 \u2212 \u03b3)2c\u22122\u2206e\u22122nmin(1\u2212\u03b3)2\u22062/c2\n\n.\n\ns,a\n\n\u2264 (cid:88)\n\u2264 (cid:88)\n\u2264 (cid:88)\n\u2264 (cid:88)\n\ns,a\n\ns,a\n\n(31)\n\n(32)\n\nWe thus get:\n\nEU\u223cU [e2(nmin\u22121)(1\u2212\u03b3)2\u22062\n\nQ/c2\n\n] \u2264\n\n(cid:90) \u221e\n(cid:90) \u221e\n\n0\n\n=\n\u2264 |S||A|nmin.\n\n0\n\ne2(nmin\u22121)(1\u2212\u03b3)2\u22062/c2\n\nf (\u2206)d\u2206\n\n4|S||A|nmin(1 \u2212 \u03b3)2c\u22122\u2206e\u22122(1\u2212\u03b3)2\u22062/c2\n\nd\u2206\n\nThis concludes the proof of Lemma 6 and consequently Theorem 3.\n\n8\n\n\fReferences\n[1] L. G. Valiant. A theory of the learnable. Commun. ACM, 27(11):1134\u20131142, 1984.\n[2] M. Kearns and S. Singh. Near-optimal reinforcement learning in polynomial time. Machine Learning,\n\n49(2-3):209\u2013232, 2002.\n\n[3] R. I. Brafman and M. Tennenholtz. R-max \u2013 A general polynomial time algorithm for near-optimal\n\nreinforcement learning. The Journal of Machine Learning Research, 3:213\u2013231, 2003.\n\n[4] A. L. Strehl and M. L. Littman. A theoretical analysis of model-based interval estimation. In Proceedings\n\nof the 22nd International Conference on Machine Learning, pages 856\u2013863, 2005.\n\n[5] S. M. Kakade. On the sample complexity of reinforcement learning. PhD thesis, University College\n\nLondon, 2003.\n\n[6] M. O. G. Duff. Optimal learning: Computational procedures for Bayes-adaptive Markov decision pro-\n\ncesses. PhD thesis, University of Massachusetts Amherst, 2002.\n\n[7] M. J. A. Strens. A Bayesian Framework for Reinforcement Learning. In Proceedings of the 17th Inter-\n\nnational Conference on Machine Learning, pages 943\u2013950, 2000.\n\n[8] T. Wang, D. Lizotte, M. Bowling, and D. Schuurmans. Bayesian sparse sampling for on-line reward\nIn Proceedings of the 22nd International Conference on Machine Learning, page 963,\n\noptimization.\n2005.\n\n[9] J. Z. Kolter and A. Y. Ng. Near-Bayesian exploration in polynomial time. In Proceedings of the 26th\n\nInternational Conference on Machine Learning, pages 513\u2013520, 2009.\n\n[10] D. A. McAllester. Some PAC-Bayesian theorems. Machine Learning, 37(3):355\u2013363, 1999.\n[11] J. Shawe-Taylor and R. C. Williamson. A PAC analysis of a Bayesian estimator. In Proceedings of the\n\n10th Annual Conference on Computational Learning Theory, pages 2\u20139, 1997.\n\n[12] P. Germain, A. Lacasse, F. Laviolette, and M. Marchand. PAC-Bayesian learning of linear classi\ufb01ers. In\n\nProceedings of the 26th International Conference on Machine Learning, pages 353\u2013360, 2009.\n\n[13] J. Langford and J. Shawe-Taylor. PAC-Bayes and margins. In Proceedings of Advances in Neural Infor-\n\nmation Processing Systems, pages 439\u2013446, 2002.\n\n[14] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA,\n\n1998.\n\n[15] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming (Optimization and Neural Computation\n\nSeries, 3). Athena Scienti\ufb01c, 1996.\n\n[16] R. Herbrich and T. Graepel. A PAC-Bayesian margin bound for linear classi\ufb01ers. IEEE Transactions on\n\nInformation Theory, 48(12):3140\u20133150, 2002.\n\n[17] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learn-\ning with function approximation. Proceedings of Advances in Neural Information Processing Systems,\n12:1057\u20131063, 2000.\n\n[18] J. A. Boyan. Technical update: Least-squares temporal difference learning. Machine Learning,\n\n49(2):233\u2013246, 2002.\n\n[19] A. Farahmand, M. Ghavamzadeh, C. Szepesv\u00b4ari, and S. Mannor. Regularized \ufb01tted Q-iteration: Applica-\n\ntion to planning. Recent Advances in Reinforcement Learning, pages 55\u201368, 2008.\n\n[20] J. Asmuth, L. Li, M. L. Littman, A. Nouri, and D. Wingate. A Bayesian sampling approach to exploration\n\nin reinforcement learning. The 25th Conference on Uncertainty in Arti\ufb01cial Intelligence, 2009.\n\n[21] D. A. McAllester. PAC-Bayesian model averaging. In Proceedings of the 12th Annual Conference on\n\nComputational Learning Theory, pages 164\u2013170, 1999.\n\n[22] T. Weissman, E. Ordentlich, G. Seroussi, S. Verdu, and M. J. Weinberger. Inequalities for the L1 deviation\nof the empirical distribution. Technical report, Information Theory Research Group, HP Laboratories,\n2003.\n\n9\n\n\f", "award": [], "sourceid": 431, "authors": [{"given_name": "M.", "family_name": "Fard", "institution": null}, {"given_name": "Joelle", "family_name": "Pineau", "institution": null}]}