{"title": "From Weighted Classification to Policy Search", "book": "Advances in Neural Information Processing Systems", "page_first": 139, "page_last": 146, "abstract": null, "full_text": "From Weighted Classification to Policy Search\n\nD. Blatt Department of Electrical Engineering and Computer Science University of Michigan Ann Arbor, MI 48109-2122 dblatt@eecs.umich.edu\n\nA. O. Hero Department of Electrical Engineering and Computer Science University of Michigan Ann Arbor, MI 48109-2122 hero@eecs.umich.edu\n\nAbstract\nThis paper proposes an algorithm to convert a T -stage stochastic decision problem with a continuous state space to a sequence of supervised learning problems. The optimization problem associated with the trajectory tree and random trajectory methods of Kearns, Mansour, and Ng, 2000, is solved using the Gauss-Seidel method. The algorithm breaks a multistage reinforcement learning problem into a sequence of single-stage reinforcement learning subproblems, each of which is solved via an exact reduction to a weighted-classification problem that can be solved using off-the-self methods. Thus the algorithm converts a reinforcement learning problem into simpler supervised learning subproblems. It is shown that the method converges in a finite number of steps to a solution that cannot be further improved by componentwise optimization. The implication of the proposed algorithm is that a plethora of classification methods can be applied to find policies in the reinforcement learning problem.\n\n1\n\nIntroduction\n\nThere has been increased interest in applying tools from supervised learning to problems in reinforcement learning. The goal is to leverage techniques and theoretical results from supervised learning for solving the more complex problem of reinforcement learning [3]. In [6] and [4], classification was incorporated into approximate policy iterations. In [2], regression and classification are used to perform dynamic programming. Bounds on the performance of a policy which is built from a sequence of classifiers were derived in [8] and [9]. Similar to [8], we adopt the generative model assumption of [5] and tackle the problem of finding good policies within an infinite class of policies, where performance is evaluated in terms of empirical averages over a set of trajectory trees. In [8] the T-step reinforcement learning problem was converted to a set of weighted classification problems by trying to fit the classifiers to the maximal path on the trajectory tree of the decision process. In this paper we take a different approach. We show that while the task of finding the global optimum within a class of non-stationary policies may be overwhelming, the componentwise search leads to single step reinforcement learning problems which can be reduced to a sequence of weighted classification problems. Our reduction is exact and is differ-\n\n\f\nent from the one proposed in [8]; it gives more weight to regions of the state space in which the difference between the possible actions in terms of future reward is large, rather than giving more weight to regions in which the maximal future reward is large. The weighted classification problems can be solved by applying weights-sensitive classifiers or by further reducing the weighted classification problem to a standard classification problem using re-sampling methods (see [7], [1], and references therein for a description of both approaches). Based on this observation, an algorithm that converts the policy search problem into a sequence of weighted classification problems is given. It is shown that the algorithm converges in a finite number of steps to a solution, which cannot be further improved by changing the control of a single stage while holding the rest of the policy fixed.\n\n2\n\nProblem Formulation\n\nThe results are presented in the context of MDPs but can be applied to POMDPs and nonMarkovian decision processes as well. Consider a T-step MDP M = {S , A, D, Ps,a }, where S is a (possibly continuous) state space, A = {0, . . . , L - 1} is a finite set of possible actions, D is the distribution of the initial state, and Ps,a is the distribution of the next state given that the current state is s and the action taken is a. The reward granted when taking action a at state s and making a transition to state s is assumed to be a known deterministic and bounded function of s denoted by r : S [-M , M ]. No generality is lost in specifying a known deterministic reward since it is possible to augment the state variable by an additional random component whose distribution depends on the previous state and action, and specify the function r to extract this random component. Denote by S0 , S1 , . . . , ST the random state variables. A non-stationary deterministic policy = (0 , 1 , . . . , T -1 ) is a sequence of mappings t : S A, which are called controls. The control t specifies the action taken at time t as a function of the state at time t. The expected sum of rewards of a non-stationary deterministic policy is given by T , t r (St ) (1) V ( ) = E\n=1\n\nwhere the expectation is taken with respect to the distribution over the random state variables induced by the policy . We call V ( ) the value of policy . Non-stationary deterministic policies are considered since the optimal policy for a finite horizon MDP is non-stationary and deterministic [10]. Usually the optimal policy is defined as the policy that maximizes the value conditioned on the initial state, i.e., T , t R (St ) |S0 = s (2) V (s) = E\n=1\n\nfor any realization s of S0 [10]. The policy that maximizes the conditional value given each realization of the initial state also maximizes the value averaged over the initial state, and it is the unique maximizer if the distribution of the initial state D is positive over S . Therefore, when optimizing over all possible policies, the maximization of (1) and (2) are equivalent. When optimizing (1) over a restricted class of policies, which does not contain the optimal policy, the distribution over the initial state specifies the importance of different regions of the state space in terms of the approximation error. For example, assigning high probability to a certain region of S will favor policies that well approximate the optimal policy over that region. Alternatively, maximizing (1) when D is a point mass at state s is equivalent to maximizing (2). Following the generative model assumption of [5], the initial distribution D and the conditional distribution Ps,a are unknown but it is possible to generate realization of the initial\n\n\f\nstate according to D and the next state according to Ps,a for arbitrary state-action pairs (s, a). Given the generative model, n trajectory trees are constructed in the following manner. The root of each tree is a realization of S0 generated according to the distribution D. Given the realization of the initial state, realizations of the next state S1 given the L possible actions, denoted by S1 |a, a A, are generated. Note that this notation omits the dependence on the value of the initial state. Each of the L realizations of S1 is now the root of the subtree. These iterations continue to generate a depth T tree. Denote by St |i0 , i1 , . . . , it-1 the random variable generated at the node that follows the sequence of actions i0 , i1 , . . . , it-1 . Hence, each tree is constructed using a single call to the initial state generator and LT - 2 calls to the next state generator.\n\nFigure 1: A binary trajectory tree. Consider a class of policies , i.e., each element of is a sequence of T mappings from S to A. It is possible to estimate the value of any policy in the class from the set of trajectory trees by simply averaging the sum of rewards on each tree along the path that agrees with the policy [5]. Denote by V i ( ) the observed value on the i'th tree along the path that corresponds to the policy . Then the value of the policy is estimated by Vn ( ) = n-1 in\n=1\n\nIn [5], the authors show that with high probability (over the data set) Vn ( ) converges uniformly to V ( ) (1) with rates that depend on the VC-dimension of the policy class. This result motivates the use of policies with high Vn ( ), since with high probability these policies have high values of V ( ). In this paper, we consider the problem of finding policies that obtain high values of Vn ( ).\n\nV i ( ).\n\n(3)\n\n3\n\nA Reduction From a Single Step Reinforcement Learning Problem to Weighted Classification\n\nThe building block of the proposed algorithm is an exact reduction from a single step reinforcement learning to a weighted classification problem. Consider the single step decision process. An initial state S0 generated according to the distribution D is followed by one of L possible actions A {0, 1, . . . , L - 1}, which leads to a transition to state S1 whose\n\n\f\nconditional distribution given the initial state is s and the action is a is given by Ps,a . Given a class of policies , where policy in is a map from S to A, the goal is to find arg max Vn ( ).\n \n\n(4)\n\nwhere fn r a function f , En {f (S0 , S1 |0, S1 |1, . . . , S1 |L - 1)} is its empirical expectation o n-1 i=1 f (si , si |0, si |1, . . . , si |L - 1), and I () is the indicator function taking a value 01 1 1 of one when its argument is true and zero otherwise.\n\nIn this single step problem the data are n realization of the random element {S0 , S1 |0, S1 |1, . . . , S1 |L - 1}. Denote the i'th realization by {si , si |0, si |1, . . . , si |L - 01 1 1 1}. In this case, Vn ( ) can be written explicitly by L-1 , l Vn ( ) = En r(S1 |l)I ( (S0 ) = l) (5)\n=0\n\nThe following proposition shows that the problem of maximizing the empirical reward (5) is equivalent to a weighted classification problem. Proposition 1 Given a class of policies and a set of n trajectory trees, L-1 = l r(S1 |l)I ( (S0 ) = l) arg max En\n =0\n\narg min En\n \n\nL-1 m . I l ax r(S1 |k ) - r(S1 |l) ( (S0 ) = l)\n=0 k\n\n(6)\n\nThe proposition implies that the maximizer of the empirical reward over a class of policies is the output of an optimal weights dependent classifier for the data set: n s i i i , 0 , arg max r (s1 |k ), w\nk i=1\n\nwhere for each sample, the first argument is the example, the second is the label, and i m i w= ax r(si |k ) - r(si |0), max r(si |k ) - r(si |1), . . . , max r(si |k ) - r(si |L - 1) 1 1 1 1 1 1\nk k k\n\ns the realization of the L costs of classifying example i to each of the possible labels. Note that the realizations of the costs are always non-negative and the cost of the correct classification (arg maxk r(si |k )) is always zero. The solution to the weighted classification 1 problem is a map from S to A which minimizes the empirical weighted misclassification error (6). The proposition asserts that this mapping is also the control which maximizes the empirical reward (5). Proof 1 For all j {0, 1, . . . , L - 1},\nL-1 l =0\n\nr(S1 |l)I ( (S0 ) = l) = r(S1 |j ) + (r(S1 |0) - r(S1 |j ))I ( (s) = 0) +\n\n(7)\n\n(r(S1 |1) - r(S1 |j ))I ( (s) = 1) + . . . + (r(S1 |L - 1) - r(S1 |j ))I ( (s) = L - 1). L-1 l\n=0\n\nIn addition, En r(S1 |l)I ( (S0 ) = l) =\n\n\f\nEn En En\n\nI\n\nI\n\nI\n\n(arg max r(S1 |k ) = 0)\nk\n\n(arg max r(S1 |k ) = 1)\nk\n\nL-1 l =0\n\nL-1 l =0\n\nr(S1 |l)I ( (S0 ) = l)\n\nr(S1 |l)I ( (S0 ) = l)\nL-1 l =0\n\n+\n\n+ ... + .\n\n(arg max r(S1 |k ) = L - 1)\nk\n\nr(S1 |l)I ( (S0 ) = l)\n\nSubstituting (7) we obtain En\nL-1 j =0 k\n\nL-1 l\n=0\n\nr(S1 |l)I ( (S0 ) = l)\n\n=\n\nEn {I (arg max r(S1 |k ) = j )[r(S1 |j ) -\nk\n\n(max r(S1 |k ) - r(S1 |0))I ( (S0 ) = 0) - (max r(S1 |k ) - r(S1 |1))I ( (S0 ) = 1) - . . . -\nk\n\n(max r(S1 |k ) - r(S1 |L - 1))I ( (S0 ) = L - 1)]} =\nk L-1 j =0\n\nEn\n\nT\n\nEn\n\nhe term in the second to last line is independent of (s) and the result follows. In the binary case, the optimization problem is | , arg min En r(S1 |0) - r(S1 |1)|I ( (S0 ) = arg max r(S1 |k ))\n k\n\nL-1 m I l ax R(S1 |k ) - R(S1 |l) ( (S0 ) = l)\n=0 k\n\nI\n\n(arg max r(S1 |k ) = j )r(S1 |j )\nk\n\n-\n\ni.e., the single step reinforcement learning problem reduces to the weighted classification problem with samples s n i i i i , 0 , arg max r (s1 |k ), |r (s1 |0) - r (s1 |1)|\nk { 0, 1} i=1\n\nwhere for each sample, the first argument is the example, the second is the label, and the third is a realization of the cost incurred when misclassifying the example. Note that this is different from the reduction in [8]. When applying the reduction in [8] to our single step problem the costs are taken to be maxk{0,1} r(si |k ) rather than |r(si |0) - r(si |1)|. Set1 1 1 ting the costs to maxk{0,1} r(si |k ) instead of |r(si |0) - r(si |1)| favors classifiers which 1 1 1 perform well in regions where the maximal reward is large (regardless of the difference between the two actions) instead of regions where the difference between the rewards that result from the two actions is large. It is easy to set an example of a simple MDP and a restricted class of policies, which do not include the optimal policy, in which the classifier that minimizes the weighted misclassification problem with costs maxk{0,1} r(si |k ) is 1 not equivalent to the optimal policy. When using our reduction, they are always equivalent. On the other hand, in [8] the choice maxk{0,1} r(si |k ) led to a bound on the perfor1 mance of the policy in terms of the performance of the classifier. We do not pursue this\n\n\f\ntype of bounds here since given the classifier, the performance of the resulting policy can be directly estimated from (5). Given a sequence of classifiers, the value of the induced sequence of controls (or policy) can be estimated directly by (3) with generalization guarantees provided by the bounds in [5]. In [2], a certain single step binary reinforcement learning problem is converted to weighted classification by averaging multiple realizations of the rewards under the two possible actions for each state. As seen here, this Monte Carlo approach is not necessary; it is sufficient to sample the rewards once for each state.\n\n4\n\nFinding Good Policies for a T -Step Markov Decision Processes By Solving a Sequence of Weighted Classification Problems\n\nGiven the class of policies , the algorithm updates the controls 0 , . . . , T -1 one at a time in a cyclic manner while holding the rest constant. Each update is formulated as a single step reinforcement learning problem which is then converted to a weighted classification problem. In practice, if the weighted classification problem is only approximately solved, then the new control is accepted only if it leads to higher value of V . When updating t , the trees are pruned from the root to stage t by keeping only the branch which agrees with the controls 0 , 1 , . . . , t-1 . Then a single step reinforcement learning is formulated at time step t, where the realization of the reward which follows action a A at stage t is the immediate reward obtained at the state which follows action a plus the sum of rewards which are accumulated along the branch which agrees with the controls t+1 , t+2 , . . . , T -1 . The iterations end after the first complete cycle with no parameter modifications. Note that when updating t , each tree contributes one realization of the state at time t. A result of the pruning process is that the ensemble of state realization are drawn from the distribution induced by the policy up to time t - 1. In other words, the algorithm relaxes the requirement in [2] to have access to a baseline distribution - a distribution over the states that is induced by a good policy. Our algorithm automatically generates samples from distributions that are induced by a sequence of monotonically improving policies.\n\nFigure 2: Updating 1 . In the example: pruning down according to 0 (S0 ) = 0, propagating rewards up according to 2 (S2 |00) = 1, and 2 (S2 |01) = 0. Proposition 2 The algorithm converges after a finite number of iterations to a policy that cannot be further improved by changing one of the controls and holding the rest fixed. Proof 2 Writing the empirical average sum of rewards Vn ( ) explicitly as i Vn ( ) = En I (0 (S 0 ) = i0 )I (1 (S 1 |i0 ) = i1 ) . . . T\n0 ,...,i T - 1 A\n\n\f\nI(T -1 (S\n\nT -1 0\n\n|i , i , . . . , i\n\n1\n\nT -2\n\n)=i\n\nT -1\n\n)\n\ntT\n\nr(S |i , i , . . . , i\n\nt0\n\n1\n\nt-1\n\n)\n\n=1\n\n,\n\nit can be seen that the algorithm is a Gauss-Seidel algorithm for maximizing Vn ( ), where, at each iteration, optimization of t is carried out at one of the stages t while keeping t , t = t fixed. At each iteration the previous control is a valid solution and hence the objective function is non decreasing. Since Vn ( ) is evaluated using a finite number of trees, it can take only a finite set of values. Therefore, we must reach a cycle with no updates after a finite number of iterations. A cycle with no improvements implies that we cannot increase the empirical average sum of rewards by updating one of the t 's.\n\n5\n\nInitialization\n\nThere are two possible initial policies that can be extracted from the set of trajectory trees. One possible initial policy is the myopic policy which is computed from the root of the tree downwards. Staring from the root, 0 is found by solving the single stage reinforcement learning resulting from taking into account only the immediate reward at the next state. Once the weighted classification problem is solved the trees are pruned by following the action which agrees with 0 . The remaining realizations of state S1 follow the distribution induced by the myopic control of the first stage. The process is continued to stage T - 1. The second possible initial policy is computed from the leaves backward to the root. Note that the distribution of the state at a leaf that is chosen at random is the distribution of the state when a randomized policy is used. Therefore, to find the best control at stage T - 1, given that the previous T - 2 controls choose random actions, we solve the weighted classification problem induced by considering all the realization of the state ST -1 from all the trees (these are not independent observations) or choose randomly one realization from each tree (these are independent realizations). Given the classifier, we use the equivalent control T -1 to propagated the rewards up to the previous stage and solve the resulting weighted classification problem. This is carried out recursively up to the root of the tree.\n\n6\n\nExtensions\n\nThe results presented in this paper generalize to the non-Markovian setting as well. In particular, when the state space, action space, and the reward function depend on time, and the distribution over the next state depends on all past states and actions, we will be dealing with non-stationary deterministic policies = (0 , 1 , . . . , T -1 ); t : S0 A0 . . . St-1 At-1 St At , t = 0, 1, . . . , T - 1. POMDPs can be dealt with in terms of the belief states as a continuous state space MDP or as a non-Markovian process in which policies depend directly on all past observations. While we focused on the trajectory tree method, the algorithm can be easily modified to solve the optimization problem associated with the random trajectory method [5] by adjusting the single step reinforcement learning reduction and the pruning method presented here.\n\n7\n\nIllustrative Example\n\nThe following example illustrates the aspects of the problem and the components of our solution. The simulated system is a two-step MDP, with continuous state space S = [0, 1] and a binary action space A = {0, 1}. The distribution over the initial state is uniform. Given state s and action a the next state s is generated by s = mod(s + 0.33a + 0.1randn, 1),\n\n\f\nwhere mod(x, 1) is the fraction part of x, and randn is a Gaussian random variable independent of the other variables in the problem. The reward function is r(s) = s sin( s). We consider a class of policies parameterized by a continuous parameter: = { (; )| = (0 , 1 ) [0, 2]2 }, where i (s; i ) = 1 when i 1 and s > i or when i > 1 and s < i - 1 and zero otherwise, i = 0, 1. In Figure 3 the objective function Vn ( ()), estimated from n = 20 trees, is presented as a function of 0 and 1 . The path taken by the algorithm supperimposed on the contour plot of Vn ( ()) is also presented. Starting from the arbitrary point 0, the algorithm performs optimization with respect to one of the coordinates at a time and converges after 3 iterations.\n2 1.8\n\n1\n1.6\n\n0.9\n1.4\n\n0.8\n1 2\n\n0.7 0.6 0.5\n1\n\n1.2\n\n1 3\n\n\n\n0.8\n\n0.4\n0.6\n\n0.3 2\n0.4 0\n\n1.5 1 1 0.5 1 0 0 0.5 \n0\n\n2 1.5\n0.2\n\n0\n\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1 0\n\n1.2\n\n1.4\n\n1.6\n\n1.8\n\n2\n\nReferences\n\nFigure 3: The objective function Vn ( ()) and the path taken by the algorithm.\n\n[1] N. Abe, B. Zadrozny, and J. Langford. An iterative method for multi-class cost-sensitive learning. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 311, 2004. [2] J. Bagnell, S. Kakade, A. Ng, and J. Schneider. Policy search by dynamic programming. In Advances in Neural Information Processing Systems, volume 16. MIT Press, 2003. [3] A. G. Barto and T. G. Dietterich. Reinforcement learning and its relationship to supervised learning. In J. Si, A. Barto, W. Powell, and D. Wunsch, editors, Handbook of learning and approximate dynamic programming. John Wiley and Sons, Inc, 2004. [4] A. Fern, S. Yoon, and R. Givan. Approximate policy iteration with a policy language bias. In Advances in Neural Information Processing Systems, volume 16, 2003. [5] M. Kearns, Y. Mansour, and A. Ng. Approximate planning in large POMDPs via reusable trajectories. In Advances in Neural Information Processing Systems, volume 12. MIT Press, 2000. [6] M. Lagoudakis and R. Parr. Reinforcement learning as classification: Leveraging modern classifiers. In Proceedings of the Twentieth International Conference on Machine Learning, 2003. [7] J. Langford and A. Beygelzimer. Sensitive error correcting output codes. In Proceedings of the 18th Annual Conference on Learning Theory, pages 158172, 2005. [8] J. Langford and B. Zadrozny. Reducing T-step reinforcement learning to classification. http://hunch.net/jl/projects/reductions/reductions.html, 2003. [9] J. Langford and B. Zadrozny. Relating reinforcement learning performance to classification performance. In Proceedings of the Twenty Second International Conference on Machine Learning, pages 473480, 2005. [10] M. L. Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, Inc, 1994.\n\n\f\n", "award": [], "sourceid": 2778, "authors": [{"given_name": "Doron", "family_name": "Blatt", "institution": null}, {"given_name": "Alfred", "family_name": "Hero", "institution": null}]}