{"title": "Linear Feature Encoding for Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 4224, "page_last": 4232, "abstract": "Feature construction is of vital importance in reinforcement learning, as the quality of a value function or policy is largely determined by the corresponding features. The recent successes of deep reinforcement learning (RL) only increase the importance of understanding feature construction. Typical deep RL approaches use a linear output layer, which means that deep RL can be interpreted as a feature construction/encoding network followed by linear value function approximation. This paper develops and evaluates a theory of linear feature encoding. We extend theoretical results on feature quality for linear value function approximation from the uncontrolled case to the controlled case. We then develop a supervised linear feature encoding method that is motivated by insights from linear value function approximation theory, as well as empirical successes from deep RL. The resulting encoder is a surprisingly effective method for linear value function approximation using raw images as inputs.", "full_text": "Linear Feature Encoding for Reinforcement Learning\n\nZhao Song, Ronald Parr\u2020, Xuejun Liao, Lawrence Carin\n\nDepartment of Electrical and Computer Engineering\n\n\u2020 Department of Computer Science\n\nDuke University, Durham, NC 27708, USA\n\nAbstract\n\nFeature construction is of vital importance in reinforcement learning, as the quality\nof a value function or policy is largely determined by the corresponding features.\nThe recent successes of deep reinforcement learning (RL) only increase the im-\nportance of understanding feature construction. Typical deep RL approaches use\na linear output layer, which means that deep RL can be interpreted as a feature\nconstruction/encoding network followed by linear value function approximation.\nThis paper develops and evaluates a theory of linear feature encoding. We extend\ntheoretical results on feature quality for linear value function approximation from\nthe uncontrolled case to the controlled case. We then develop a supervised linear\nfeature encoding method that is motivated by insights from linear value function\napproximation theory, as well as empirical successes from deep RL. The resulting\nencoder is a surprisingly effective method for linear value function approximation\nusing raw images as inputs.\n\n1\n\nIntroduction\n\nFeature construction has been and remains an important topic for reinforcement learning. One of\nthe earliest, high pro\ufb01le successes of reinforcement learning, TD-gammon [1], demonstrated a huge\nperformance improvement when expert features were used instead of the raw state, and recent years\nhave seen a great deal of practical and theoretical work on understanding feature selection and\ngeneration for linear value function approximation [2\u20135].\nMore recent practical advances in deep reinforcement learning have initiated a new wave of interest in\nthe combination of neural networks and reinforcement learning. For example, Mnih et al. [6] described\na reinforcement learning (RL) system, referred to as Deep Q-Networks (DQN), which learned to\nplay a large number of Atari video games as well as a good human player. Despite these successes\nand, arguably because of them, a great deal of work remains to be done in understanding the role of\nfeatures in RL. It is common in deep RL methods to have a linear output layer. This means that there\nis potential to apply the insights gained from years of work in linear value function approximation to\nthese networks, potentially giving insight to practitioners and improving the interpretability of the\nresults. For example, the layers preceding the output layer could be interpreted as feature extractors\nor encoders for linear value function approximation.\nAs an example of the connection between practical neural network techniques and linear value\nfunction approximation theory, we note that Oh et al. [7] introduced spatio-temporal prediction\narchitectures that trained an action-conditional encoder to predict next states, leading to improved\nperformance on Atari games. Oh et al. cited examples of next state prediction as a technique used in\nneural networks in prior work dating back several decades, though this approach is also suggested by\nmore recent linear value function approximation theory [4].\nIn an effort to extend previous theory in a direction that would be more useful for linear value function\napproximation and, hopefully, lead to greater insights into deep RL, we generalize previous work\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fon analyzing features for uncontrolled linear value function approximation [4] to the controlled\ncase. We then build on this result to provide a set of suf\ufb01cient conditions which guarantee that\nencoded features will result in good value function approximation. Although inspired by deep RL,\nour results (aside from one negative one in Section 3.2 ) apply most directly to the linear case, which\nhas been empirically explored in Liang et al. [8]. This implies the use of a rich, original (raw) feature\nspace, such as sequences of images from a video game without persistent, hidden state. The role of\nfeature encoding in such cases is to \ufb01nd a lower dimensional representation that is suitable for linear\nvalue function approximation. Feature encoding is still needed in such cases because the raw state\nrepresentation is so large that it is impractical to use directly.\nOur approach works by de\ufb01ning an encoder and a decoder that use a lower dimensional representation\nto encode and predict both reward and next state. Our results differ from previous results [4] in linear\nvalue function approximation theory that provided suf\ufb01cient conditions for good approximation.\nSpeci\ufb01cally, our results span two different representations, a large, raw state representation and\na reduced one. We propose an ef\ufb01cient coordinate descent algorithm to learn parameters for the\nencoder and decoder. To demonstrate the effectiveness of this approach, we consider the challenging\n(for linear techniques) problem of learning features from raw images in pendulum balancing and\nblackjack. Surprisingly, we are able to discover good features and learn value functions in these\ndomains using just linear encoding and linear value function approximation.\n\n2 Framework and Notation\nMarkov Decision Processes (MDPs) can be represented as a tuple (cid:104)S,A, R, P, \u03b3(cid:105), where S =\n{s1, s2, . . . , sn} is the state set, A = {a1, a2, . . . , am} is the action set, R \u2208 Rnm\u00d71 represents the\nreward function whose element R(si, aj) denotes the expected immediate reward when taking action\naj in state si, P \u2208 Rnm\u00d7n denotes the transition probabilities of underlying states whose element\n\u03b3 \u2208 [0, 1) is the discount factor for the future reward. The policy \u03c0 in an MDP can be represented in\na \u03c0(a|s) = 1.\nGiven a policy \u03c0, we de\ufb01ne P \u03c0 \u2208 Rnm\u00d7nm as the transition probability for the state-action pairs,\nstate-action pairs, where Q\u03c0(s, a) represents the expected total \u03b3\u2212discounted rewards when taking\naction a in state s and following \u03c0 afterwards. For the state-action pair (s, a), the Q\u2212function\nsatis\ufb01es the following Bellman equation:\n\nP(cid:2)(si, a), sj\n(cid:3) is the probability of transiting from state si to state sj when taking an action a, and\nterms of the probability of taking action a when in state s, i.e., \u03c0(a|s) \u2208 [0, 1] and(cid:80)\nwhere P \u03c0(s(cid:48), a(cid:48)|s, a) = P(cid:2)(s, a), s(cid:48)(cid:3) \u03c0(a(cid:48)|s(cid:48)). For any policy \u03c0, its Q-function is de\ufb01ned over the\n\nQ\u03c0(cid:0)s, a(cid:1) = R(s, a) +\n\nP \u03c0(s(cid:48), a(cid:48)|s, a) Q\u03c0(cid:0)s(cid:48), a(cid:48)(cid:1)(cid:105)\n\n(cid:104)\n\n\u03b3\n\n(cid:88)\n\ns(cid:48),a(cid:48)\n\n(1)\n\n.\n\n2.1 The Bellman operator\nWe de\ufb01ne the Bellman operator T \u03c0 on the Q\u2212functions as\n\n(T \u03c0Q)(s, a) = R(s, a) +\n\n\u03b3\n\nP \u03c0(s(cid:48), a(cid:48)|s, a) Q(cid:0)s(cid:48), a(cid:48)(cid:1)(cid:105)\n\n(cid:104)\n\n(cid:88)\n\ns(cid:48),a(cid:48)\n\nQ\u03c0 is known to be a \ufb01xed point of T \u03c0, i.e., T \u03c0Q\u03c0 = Q\u03c0. Of particular interest in this paper is the\n\nBellman error for an approximated Q-function to Q\u03c0, speci\ufb01cally BE((cid:98)Q\u03c0) = T \u03c0(cid:98)Q\u03c0 \u2212 (cid:98)Q\u03c0. When\n\nthe Bellman error is 0, the Q-function is at the \ufb01xed point. Otherwise, we have [9]:\n\n(cid:107)(cid:98)Q\u03c0 \u2212 Q\u03c0(cid:107)\u221e \u2264 (cid:107)(cid:98)Q\u03c0 \u2212 T \u03c0(cid:98)Q\u03c0(cid:107)\u221e / (1 \u2212 \u03b3),\n\nwhere (cid:107)x(cid:107)\u221e refers to the (cid:96)\u221e norm of a vector x.\n\n2.2 Linear Approximation\n\nas (cid:98)Q\u03c0(s, a) = \u03a6w\u03c0\n\nWhen the Q-function cannot be represented exactly, we can approximate it with a linear function\n\u03a6, with \u03a6 = [\u03a6(s1, a1) . . . \u03a6(sn, am)]T \u2208 Rnm\u00d7km, \u03a6(si, aj) \u2208 Rkm\u00d71 is a\n\u03a6 \u2208 Rkm\u00d71\n\nfeature vector for state si and action aj, superscript T represents matrix transpose, and w\u03c0\nis the weight vector.\n\n2\n\n\fGiven the features \u03a6, the linear \ufb01xed point methods [10\u201312] aim to estimate w\u03c0\nfollowing \ufb01xed-point equation:\n\n\u03a6, by solving the\n\n(2)\nwhere \u03a0 = \u03a6(\u03a6T \u03a6)\u22121\u03a6T is the orthogonal (cid:96)2 projector on span(\u03a6). Solving (2) leads to the\nfollowing linear \ufb01xed-point solution:\n\n\u03a6 = \u03a0(R + \u03b3P \u03c0\u03a6w\u03c0\n\u03a6)\n\n\u03a6w\u03c0\n\n\u03a6 = (\u03a6T \u03a6 \u2212 \u03b3\u03a6T P \u03c0\u03a6)\u22121 \u03a6T R.\nw\u03c0\n\n2.3 Feature Selection/Construction\n\nThere has been great interest in recent years in automating feature selection or construction for\nreinforcement learning. Research in this area has typically focused on using a linear value function\napproximation method with a feature selection wrapper.\nParr et al. [2] proposed using the Bellman error to generate new features, but this approach did not\nscale well in practice. Mahadevan and Maggioni [3] explored a feature generation approach based\nupon the Laplacian of a connectivity graph of the MDP. This approach has many desirable features,\nthough it did not connect directly to the optimization problem implied by the MDP and could produce\nworthless features in pathological cases [4].\nGeramifard et al. [13] and Farahmand and Precup [14] consider feature construction where features\nare built up through composition of base or atomic features. Such approaches are reminiscent of\nclassical approaches to features construction. They can be useful, but they can also be myopic if the\nneeded features are not reachable through chains of simpler features where each step along the chain\nis a demonstrable improvement.\nFeature selection solves a somewhat different problem from feature construction. Feature selection\nassumes that a reasonable set of candidate features are presented to the learner, and the learner\u2019s task is\nto \ufb01nd the good ones from a potentially large set of mostly worthless or redundant ones. LASSO [15]\nand Orthogonal Matching Pursuit (OMP) [16] are methods of feature selection for regression that\nhave been applied to reinforcement learning [17, 5, 18, 19]. In practice, these approaches do require\nthat good features are present within the larger set, so they do not address the question of how to\ngenerate good features in the \ufb01rst place.\n\n3 Theory for Feature Encoding\nPrevious work demonstrated an equivalence between linear value function approximation and linear\nmodel approximation [20, 21, 4], as well as the relationship between errors in the linear model and\nthe Bellman error for the linear \ufb01xed point [4]. Speci\ufb01cally, low error in the linear model could imply\nlow Bellman error in the linear \ufb01xed point approximation. These results were for the uncontrolled\ncase. A natural extension of these results would be to construct features for action-conditional linear\nmodels, one for each action, and use those features across multiple policies, i.e., through several\niterations of policy iteration. Anecdotally, this approach seemed to work well in some cases, but there\nwere no theoretical results to justify it. The following example demonstrates that features which are\nsuf\ufb01cient for perfect linear action models and reward models, may not suf\ufb01ce for perfect linear value\nfunction approximation.\nExample 1. Consider an MDP with a single feature \u03c6(x) = x, two actions that have no effect,\np(x|x, a1) = 1.0 and p(x|x, a2) = 1.0, and with R(x, a1) = x and R(x, a2) = \u2212x. The single\nfeature \u03c6 is suf\ufb01cient to construct a linear predictor of the expected next state and reward. However,\nthe value function is not linear in \u03c6 since V \u2217(x) = |x| / (1 \u2212 \u03b3).\n\nThe signi\ufb01cance of this example is that existing theory on the connection between linear models\nand linear features does not provide suf\ufb01cient conditions on the quality of the features for model\napproximation that would ensure good value function approximation for all policies. Existing theory\nalso does not extend to provide a connection between the model error for a set of features and\nthe Bellman error of a Q-function based on these features. To make this connection, the features\nmust be thought of as predicting not only expected next features, but expected next feature-action\ncombinations. Below, we extend the results of Parr et al. [4] to Q-functions and linear state-action\nmodels.\n\n3\n\n\fThe linear model Similar to Parr et al. [4], we approximate the reward R and the expected policy-\nconditional next feature P \u03c0\u03a6 in the controlled case, using the following linear model:\n\n\u02c6R = \u03a6r\u03a6 = \u03a6(\u03a6T \u03a6)\u22121\u03a6T R\n\n\u03a6 = \u03a6(\u03a6T \u03a6)\u22121\u03a6T P \u03c0\u03a6.\nSince (cid:98)Q\u03c0 = \u03a6w for some w, the \ufb01xed-point equation in (1) becomes\n\n(cid:91)P \u03c0\u03a6 = \u03a6P \u03c0\n\n(3a)\n(3b)\n\n\u03a6w = \u03a6r\u03a6 + \u03b3\u03a6P \u03c0\nw = (I \u2212 \u03b3P \u03c0\n\n(4a)\n(4b)\nLemma 2. For any MDP M with features \u03a6 and policy \u03c0 represented as the \ufb01xed point of the\napproximate Q\u2212function, the linear-model solution and the linear \ufb01xed-point solution are the same.\nProof: See Supplemental Materials.\nTo analyze the error in the controlled case, we de\ufb01ne the Bellman error for the state-value function,\ngiven a policy \u03c0 as\n\n\u03a6 w\n\u03a6 )\u22121 r\u03a6\n\nP \u03c0(s(cid:48), a(cid:48)|s, a)(cid:98)Q\u03c0(cid:0)s(cid:48), a(cid:48)(cid:1)(cid:105) \u2212 (cid:98)Q\u03c0(s, a).\n\nBE(cid:0)(cid:98)Q\u03c0(s, a)(cid:1) = R(s, a) +\n\n(cid:104)\n\n\u03b3\n\n(cid:88)\n\ns(cid:48),a(cid:48)\n\nAs a counterpart to Parr et al. [4], we introduce the following reward error and policy-conditional\nper-feature error, in the controlled case as\n\n\u2206R = R \u2212 \u02c6R = R \u2212 \u03a6r\u03a6\n\u03a6 = P \u03c0\u03a6 \u2212 (cid:91)P \u03a6\u03c0 = P \u03c0\u03a6 \u2212 \u03a6P \u03c0\n\u2206\u03c0\n\u03a6 .\n\n(5a)\n(5b)\n\nTheorem 3. For any MDP M with feature \u03a6, and policy \u03c0 represented as the \ufb01xed point of the\napproximate Q\u2212function, the Bellman error can be represented as\n\nBE(cid:0)(cid:98)Q\u03c0(cid:1) = \u2206R + \u03b3\u2206\u03c0\n\n\u03a6w\u03c0\n\u03a6.\n\nProof: See Supplemental Materials.\nTheorem 3 suggests a suf\ufb01cient condition for a good set of features: If the model prediction error \u2206\u03c0\n\u03a6,\nand reward prediction error \u2206R are low, then the Bellman error must also be low. Previous work did\nnot give an in-depth understanding of how to construct such features. In Parr et al. [2], the Bellman\nerror is de\ufb01ned only on the training data. Since it is orthogonal to the span of the existing features,\nthere is no convenient way to approximate it, and the extension to off-sample states is not obvious.\nThey used locally weighted regression with limited success, but the process was slow and prone to the\nusual perils of non-parametric approximators, such as high sensitivity to the distance function used.\nOne might hope to minimize (5a) and (5b) directly, perhaps using sampled states and next states, but\nthis is not a straightforward optimization problem to solve in general, because the search space for\n\u03a6 is the space of functions and because \u03a6 appears inconveniently on both sides of 5(b) making it\ndif\ufb01cult rearrange terms to solve for \u03a6 as an optimization problem with a \ufb01xed target. Thus, without\nadditional assumptions about how the states are initially encoded and what space of features will be\nsearched, it is challenging to apply Theorem 3 directly. Our solution to this dif\ufb01culty is to apply the\ntheorem in a somewhat indirect manner: First we assume that the input is a rich, raw feature set (e.g.,\nimages) and that the primary challenge is reducing the size of the feature set rather than constructing\nmore elaborate features. Next, we restrict our search space for \u03a6 to the space of linear encodings of\nthese raw features. Finally, we require that these encoded features are predictive of next raw features\nrather than next encoded features. This approach differs from what Theorem 3 requires but it results\nin an easier optimization problem and, as shown below, we are able to use Theorem 3 to show that\nthis alternative condition is suf\ufb01cient to represent the true value function.\nWe now present a theory of predictively optimal feature encoding. We refer to the features that\nultimately are used by a linear value function approximation step using the familiar \u03a6 notation, and\nwe refer to the inputs before feature encoding as the raw features, A. For n samples and l raw features,\nwe can think of A as an nm \u00d7 lm matrix. For every row in A, only the block corresponding to the\naction taken is non-zero. The raw features are operated on by an encoder:\n\n4\n\n\fDe\ufb01nition 4. The encoder, E\u03c0 (or E\u03c0 in the linear case) is a transformation E\u03c0(A) = \u03a6. We use the\nnotation E\u03c0 because we think of it as encoding the raw state. When the encoder is linear, E\u03c0 = E\u03c0,\nwhere E\u03c0 is an lm \u00d7 km matrix that right multiplies A, AE\u03c0 = \u03a6.\nWe want to encode a reduced size representation of the raw features suf\ufb01cient to predict the next\nexpected reward and raw features because, as proven below, doing so is a suf\ufb01cient (though not\nnecessary) condition for good linear value function approximation. Prediction of next raw feature\nand rewards is done via a decoder, which is a matrix in this paper, but could be non-linear in general:\nDe\ufb01nition 5. The decoder, D, is a km \u00d7 (lm + 1) matrix predicting [P \u03c0A, R] from E\u03c0(A).\nThis approach is distinct from the work of Parr et al. [4] for several reasons. We study a set of\nconditions on a reduced size feature set and study the relationship between the reduced feature set and\nthe original features, and we provide an algorithm in the next section for constructing these features.\nDe\ufb01nition 6. \u03a6 = E\u03c0(A) is predictively optimal with respect to A and \u03c0 if there exists a D\u03c0 such\nthat E\u03c0(A)D\u03c0 = [P \u03c0A, R].\n\n3.1 Linear Encoder and Linear Decoder\n\nIn the linear case, a predictively optimal set of features satis\ufb01es:\n\nAE\u03c0D\u03c0 = AE\u03c0[Ds\n\n\u03c0, Dr\n\n\u03c0] = [P \u03c0A, R]\n\n(6)\n\nwhere Ds\nTheorem 7. For any MDP M with predictively optimal \u03a6 = AE\u03c0 for policy \u03c0, if the linear \ufb01xed\n\n\u03c0 represent the \ufb01rst lm columns and the last column of D\u03c0, respectively.\n\n\u03c0 and Dr\n\npoint for \u03a6 is (cid:98)Q\u03c0, BE((cid:98)Q\u03c0) = 0.\n\nProof: See Supplemental Materials.\n\n3.2 Non-linear Encoder and Linear Decoder\n\nOne might expect that the results above generalize easily to the case where a more powerful encoder\nis used. This could correspond, for example, to a deep network with a linear output layer used for\nvalue function approximation. Surprisingly, the generalization is not straightforward:\nTheorem 8. The existence of a non-linear encoder E and linear decoder D such that E(A)D =\n[P \u03c0A, R] is not suf\ufb01cient to ensure predictive optimality of \u03a6 = E(A).\nProof: See Supplemental Materials.\nThis negative result doesn\u2019t shut the door on combining non-linear encoders with linear decoders.\nRather, it indicates that additional conditions beyond those needed in the linear case are required\nto ensure optimal encoding. For example, requiring that the encoded features lie in an invariant\nsubspace of P \u03c0 [4] would be a suf\ufb01cient condition (though of questionable practicality).\n\n4\n\nIterative Learning of Policy and Encoder\n\nIn practice we do not have access to P \u03c0, but do have access to the raw feature representation of\nsampled states and sampled next states. To train the encoder E\u03c0 and decoder D\u03c0, we sample states\nand next states from a data collection policy. When exploration is not the key challenge, this can be\ndone with a single data collection run using a policy that randomly selects actions (as is often done\nwith LSPI [22]). For larger problems, it may be desirable to collect additional samples as the policy\nchanges. These sampled states and next states are represented by matrices \u02dcA and A(cid:48), respectively.\nTheorem 7 suggests that given a policy \u03c0, zero Bellman error can be achieved if features are encoded\nappropriately. Subsequently, the obtained features and resulting Q-functions can be used to update\nthe policy, with an algorithm such as LSPI. In a manner similar to the policy update in LSPI, the\nnon-zero blocks in A(cid:48) are changed accordingly after a new policy is learned. With the updated A(cid:48), we\nre-learn the encoder and then repeat the process, as summarized in Algorithm 1. It may be desirable\n\nto update the policy several times while estimating (cid:98)Q\u03c0 since the encoded features may still be useful\n\nif the policy has not changed much. Termination conditions for this algorithm are typical approximate\npolicy iteration termination conditions.\n4.1 Learning Algorithm for Encoder\n\n5\n\n\fwhile Termination Conditions Not Satis\ufb01ed do\n\nAlgorithm 1 Iterative Learning of Encoder and Policy\n\nEstimate (cid:98)Q\u03c0\npolicy for (cid:98)Q\u03c0.\n\nLearn the encoder E\u03c0 and decoder D\u03c0\nUpdate the next raw state A(cid:48), by changing the po-\nsition of non-zero blocks according to the greedy\n\nIn our implementation, the encoder E\u03c0\nand decoder D\u03c0 are jointly learned us-\ning Algorithm 2, which seeks to minimize\n(cid:107) \u02dcAE\u03c0D\u03c0 \u2212 [A(cid:48), R](cid:107)F by coordinate de-\nscent [23], where (cid:107)X(cid:107)F represents the\nFrobenius norm of a matrix X. Note that \u02dcA\ncan be constructed as a block diagonal ma-\ntrix, where each block corresponds to the\nsamples from each action. Subsequently,\nthe pseudoinverse of \u02dcA in Algorithm 2 can\nbe ef\ufb01ciently computed, by operating on the pseudoinverse of each block in \u02dcA.\nAlgorithm 2 alternatively updates E\u03c0 and D\u03c0 until one of the following conditions is met: (1) the\nnumber of iterations reaches the maximally allowed one; (2) the residual (cid:107) \u02dcAE\u03c0D\u03c0\u2212[A(cid:48),R](cid:107)F\nis below\na threshold; (3) the current residual is greater than the previous residual. For regularization, we use\nthe truncated singular value decomposition (SVD) [24] when taking the pseudo-inverses of \u02dcA to\ndiscard all but the top k singular vectors in each block of \u02dcA.\n\nend while\n\n(cid:107)[A(cid:48),R](cid:107)F\n\nAlgorithm 2 Linear Feature Discovery\n\nLINEARENCODERFEATURES ( \u02dcA, A(cid:48), R, k )\nD\u03c0 \u2190 rand(km, lm + 1)\nwhile Convergence Conditions Not Satis\ufb01ed do\n\nAlgorithm 2 is based on a linear encoder\nand a linear decoder. Consequently, one\nmay notice that the value function is also\nlinear in the domain of the raw features, i.e.,\nthe value function can be represented as\n\n(cid:98)Q\u03c0 = \u02dcA E\u03c0 w = \u02dcAw(cid:48) with w(cid:48) = E\u03c0w.\n\nE\u03c0 \u2190 \u02dcA\u2020[A(cid:48), R]D\u2020\nD\u03c0 \u2190 ( \u02dcAE\u03c0)\u2020[A(cid:48), R]\n\n\u03c0\n\nend while\nreturn E\u03c0\n\nSee text for termination conditions.\nrand represents samples from uniform [0, 1].\n\u2020 is the (truncated) Moore-Penrose pseudoinverse.\n\nOne may wonder, why it is not better to\nsolve for w(cid:48) directly with regularization\non w(cid:48)? Although it is impractical to do\nthis using batch linear value function ap-\nproximation methods, due to the size of the\nfeature matrix, one might argue that on-line\napproaches such as deep RL techniques ap-\nproximate this approach by stochastic gra-\ndient descent. To the extent this characterization is accurate, it only increases the importance of\nhaving a clear understanding of feature encoding as an important sub-problem, since this is the natural\ninterpretation of everything up to the \ufb01nal layer in such networks and is even an explicit objective in\nsome cases [7].\n5 Experiments\nThe goal of our experiments is to show that the model of and algorithms for feature encoding presented\nabove are practical and effective. The use of our encoder allows us to learn good policies using linear\nvalue function approximation on raw images, something that is not generally perceived to be easy\nto do. These experiments should be viewed as validating this approach to feature encoding, but not\ncompeting with deep RL methods, which are non-linear and use far greater computational resources.\nWe implement our proposed linear encoder-decoder model and, for comparison, the random projection\nmodel in Ghavamzadeh et al. [25]. We tested them on the Inverted Pendulum and Blackjack [26],\ntwo popular benchmark domains in RL. Our test framework creates raw features using images, where\nthe elements in the non-zero block of \u02dcA correspond to an image that has been converted to vector by\nconcatenating the rows of the image. For each problem, we run Algorithm 1 50 times independently to\naccount for the randomness in the training data. Our training data are formed by running a simulation\nfor the desired number of steps and choosing actions at random. For the encoder, the number of\nfeatures k is selected over the validation set to achieve the best performance. All code is written in\nMATLAB and tested on a machine with 3.1GHz CPU and 8GB RAM. Our test results show that\nAlgorithm 1 cost at most half an hour to run, for the inverted pendulum and blackjack problems.\nTo verify that the encoder is doing something interesting, rather than simply picking features from \u02dcA,\nwe also tried a greedy, sparse reinforcement learning algorithm, OMP-TD [5] using \u02dcA as the candidate\nfeature set. Our results, however, showed that OMP-TD\u2019s performance was much worse than the\napproach using linear encoder. We skip further details on OMP-TD\u2019s performance for conciseness.\n\n6\n\n\f5.1\n\nInverted Pendulum\n\nWe used a version of the inverted pendulum adapted from Lagoudakis and Parr [22], a continuous\ncontrol problem with 3 discrete actions, left, right, or nothing, corresponding to the force applied to\na cart on an in\ufb01nite rail upon which an inverted pendulum is mounted. The true state is described\nby two continuous variables, the angle and angular velocity of the pendulum. For the version of the\nproblem used here, there is a reward of 0 for each time step the pendulum is balanced, and a penalty\nof \u22121 for allowing the pendulum to fall, after which the system enters an absorbing state with value\n0. The discount factor is set to be 0.95.\nFor the training data, we collected a desired number of trajectories with starting angle and angular\nvelocity sampled uniformly on [\u22120.2, 0.2]. These trajectories were truncated after 100 steps if the\npendulum had not already fallen. Algorithm 2 did not see the angle or angular velocity. Instead,\nthe algorithm was given two successive, rendered, grayscale images of the pendulum. Each image\nhas 35 \u00d7 52 pixels and hence the raw state is a 35 \u00d7 52 \u00d7 2 = 3640 dimensional vector. To ensure\nthat these two images are a Markovian representation of the state, it was necessary to modify the\nsimulator. The original simulator integrated the effects of gravity and the force applied over the time\nstep of the simulator. This made the simulation more accurate, but has the consequence that the\nchange in angle between two successive time steps could differ from the angular velocity. We forced\nthe angular velocity to match the change in angle per time step, thereby making the two successive\nimages a Markovian state.\nWe compare the linear encoder with the features using radial basis functions (RBFs) in Lagoudakis\nand Parr [22], and the random projection in Ghavamzadeh et al. [25]. The learned policy was then\nevaluated 100 times to obtain the average number of balancing steps. For each episode, a maximum\nof 3000 steps is allowed to run. If a run achieves this maximum number, we claim it as a success and\ncount it when computing the probability of success. We used k = 50 features for both linear encoder\nand random projection.\nFigure 1 shows the\nresults with means\nand 95% con\ufb01dence\nintervals, given dif-\nferent\nof\nnumbers\ntraining\nepisodes,\nEncoder\u2212\u03c4\nwhere\ncorresponds\nto the\nversion of Algorithm 1\nwith \u03c4 changes in the\nencoder. We observe\nthat for most of the\npoints, our proposed\nencoder\nachieves\nperformance\nbetter\nthan RBFs and random projections, in terms of both balancing steps and the probability of success.\nThis is a remarkable result because the RBFs had access to the underlying state, while the encoder\nwas forced to discover an underlying state representation based upon the images. Moreover,\nEncoder\u22122 achieves slightly better performance than Encoder\u22121 in most of the testing points. We\nalso notice that further increasing \u03c4 did not bring any obvious improvement, based on our test.\n\nFigure 1: (a) Number of balancing steps and (b) prob. of success, vs. number\nof training episodes.\n\n(a)\n\n(b)\n\n5.2 Blackjack\nThere are 203 states in this problem, so we can solve directly for the optimal value V \u2217 and the\noptimal policy \u03c0\u2217 explicitly. The states from 1-200 in this problem can be completely described by\nthe information from the ace status (usable or not), player\u2019s current sum (12-21), and dealer\u2019s one\nshowing card (A-10). The terminal states 201- 203 correspond to win, lose, and draw, respectively.\nWe set k = 203 features for the linear encoder.\nTo represent raw states for the encoder, we use three concatenated sampled MNIST digitsand hence a\nraw state is a 28 \u00d7 28 \u00d7 3 = 2352 dimensional vector. Two examples of such raw states are shown in\nFigure 2. Note that three terminal states are represented by \u201c300\u201d, \u201c400\u201d, and \u201c500\u201d, respectively.\n\n7\n\n2004006008001000Number of training episodes050010001500200025003000StepsEncoder-2Encoder-1RBFRandom Projection2004006008001000Number of training episodes00.10.20.30.40.50.60.70.80.91Probability of successEncoder-2Encoder-1RBFRandom Projection\fThe training data are formed by executing the random policy with the desired number of episodes.\nOur evaluation metrics for a policy represented by the value V and the corresponding action a are\n\nRelative value error = (cid:107)V \u2212 V \u2217(cid:107)2 /(cid:107)V \u2217(cid:107)2, Action error = (cid:107)a \u2212 a\u2217(cid:107)0.\n\nWe compare the features discovered by the linear encoder and random projection against in-\ndicator functions on the true state, since such indicator features should be the gold standard.\nWe can make the encoder and random projection\u2019s tasks more chal-\nlenging by adding noise to the raw state. Although it is not guaran-\nteed in general (Example 1), it suf\ufb01ces to learn a single encoder that\npersisted across policies for this problem, so we report results for a\nsingle set of encoded features. We denote the algorithms using linear\nencoder as Encoder-Image-\u03ba and the algorithms using random\nprojection as Random-Image-\u03ba, where \u03ba is the number of possible\nimages used for each digit. For example \u03ba = 10 means that the\nimage for each digit is randomly selected from the \ufb01rst 10 images in the MNIST training dataset.\n\nFigure 2: Two examples of\nthe blackjack state rendered as\nthree MNIST digits.\n\n(b)\n\n(a)\n\n(a)\n\nFigure 3 shows the\nsurprising result that\nEncoder-Image-1\nand Random-Image-1\nachieve superior per-\nformance to indicator\nfunctions on the true\nstate when the number\nof\ntraining episodes\nis less than or equal\nto 6000. In this case,\nthe encoded state rep-\nresentation wound up\nFigure 3: (a) Relative value error and (b) action error, as functions of the\nhaving less than 203\nnumber of training episodes. An additional plot for the actual return is\neffective parameters\nprovided in Supplemental Materials.\nbecause the SVD in\nthe pseudoinverse found lower dimensional structure that explained most of the variation and\ndiscarded the rest as noise because the singular values were below threshold. This put the encoder in\nthe favorable side of the bias-variance trade off when training data were scarce. When the number of\ntraining episodes becomes larger, the indicator function outperforms the linear encoder, which is\nconsistent with its asymptotically optimal property. Furthermore, the performance of the encoder\nbecomes worse as \u03ba is larger. This matches our expectation that a larger \u03ba means that a state would\nbe mapped to more possible digits and thus extracting features for the same state becomes more\ndif\ufb01cult. Finally, we notice that our proposed encoder is more robust to noise, when compared with\nrandom projection: Encoder-Image-10 outperforms Random-Image-10 with remarkable margins,\nmeasured in both relative value error and action error.\n\n(b)\n\n6 Conclusions and Future Work\n\nWe provide a theory of feature encoding for reinforcement learning that provides guidance on how\nto reduce a rich, raw state to a lower-dimensional representation suitable for linear value function\napproximation. Our results are most compelling in the linear case, where we provide a framework\nand algorithm that enables linear value function approximation using a linear encoding of raw images.\nAlthough our framework aligns with practice for deep learning [7], our results indicate that future\nwork is needed to elucidate the additional conditions that are needed to extend theory to guarantee\ngood performance in the non-linear case.\n\nAcknowledgements\n\nWe thank the anonymous reviewers for their helpful comments and suggestions. This research was\nsupported in part by ARO, DARPA, DOE, NGA, ONR and NSF.\n\n8\n\nNumber of training episodes200040006000800010000Relative value error0.20.40.60.811.2Random-Image-10Encoder-Image-10Random-Image-1Encoder-Image-1IndicatorNumber of training episodes200040006000800010000Action error202530354045505560657075Random-Image-10Encoder-Image-10Random-Image-1Encoder-Image-1Indicator\fReferences\n[1] G. Tesauro, \u201cTD-Gammon, a self-teaching backgammon program, achieves master-level play,\u201d Neural\n\nComputation, 1994.\n\n[2] R. Parr, C. Painter-Wake\ufb01eld, L. Li, and M. Littman, \u201cAnalyzing feature generation for value-function\n\napproximation,\u201d in ICML, 2007.\n\n[3] S. Mahadevan and M. Maggioni, \u201cProto-value functions: A Laplacian framework for learning representation\n\nand control in Markov decision processes,\u201d JMLR, 2007.\n\n[4] R. Parr, L. Li, G. Taylor, C. Painter-Wake\ufb01eld, and M. L. Littman, \u201cAn analysis of linear models, linear\n\nvalue-function approximation, and feature selection for reinforcement learning,\u201d in ICML, 2008.\n\n[5] C. Painter-wake\ufb01eld and R. Parr, \u201cGreedy algorithms for sparse reinforcement learning,\u201d in ICML, 2012.\n\n[6] V. Mnih et al., \u201cHuman-level control through deep reinforcement learning,\u201d Nature, 2015.\n\n[7] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh, \u201cAction-conditional video prediction using deep networks\n\nin Atari games,\u201d in NIPS, 2015.\n\n[8] Y. Liang, M. C. Machado, E. Talvitie, and M. Bowling, \u201cState of the art control of Atari games using\n\nshallow reinforcement learning,\u201d in AAMAS, 2016.\n\n[9] R. J. Williams and L. C. Baird III, \u201cTight performance bounds on greedy policies based on imperfect value\n\nfunctions,\u201d Northeastern University, Tech. Rep., 1993.\n\n[10] R. S. Sutton, \u201cLearning to predict by the method of temporal differences,\u201d Machine Learning, 1988.\n\n[11] S. Bradtke and A. Barto, \u201cLinear least-squares algorithms for temporal difference learning,\u201d Machine\n\nlearning, 1996.\n\n[12] H. Yu and D. P. Bertsekas, \u201cConvergence results for some temporal difference methods based on least\n\nsquares,\u201d IEEE TAC, 2009.\n\n[13] A. Geramifard, T. J. Walsh, N. Roy, and J. How, \u201cBatch iFDD: A scalable matching pursuit algorithm for\n\nsolving MDPs,\u201d in UAI, 2013.\n\n[14] A. M. Farahmand and D. Precup, \u201cValue pursuit iteration,\u201d in NIPS, 2012.\n\n[15] R. Tibshirani, \u201cRegression shrinkage and selection via the Lasso,\u201d JRSSB, 1996.\n\n[16] S. G. Mallat and Z. Zhang, \u201cMatching pursuits with time-frequency dictionaries,\u201d IEEE TSP, 1993.\n\n[17] J. Z. Kolter and A. Y. Ng, \u201cRegularization and feature selection in least-squares temporal difference\n\nlearning,\u201d in ICML, 2009.\n\n[18] M. Petrik, G. Taylor, R. Parr, and S. Zilberstein, \u201cFeature selection using regularization in approximate\n\nlinear programs for Markov decision processes,\u201d in ICML, 2010.\n\n[19] J. Johns, C. Painter-Wake\ufb01eld, and R. Parr, \u201cLinear complementarity for regularized policy evaluation and\n\nimprovement,\u201d in NIPS, 2010.\n\n[20] R. Schoknecht, \u201cOptimality of reinforcement learning algorithms with linear function approximation,\u201d in\n\nNIPS, 2002.\n\n[21] R. S. Sutton, C. Szepesv\u00e1ri, A. Geramifard, and M. H. Bowling, \u201cDyna-style planning with linear function\n\napproximation and prioritized sweeping,\u201d in UAI, 2008.\n\n[22] M. Lagoudakis and R. Parr, \u201cLeast-squares policy iteration,\u201d JMLR, 2003.\n\n[23] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University Press, 2004.\n\n[24] P. C. Hansen, \u201cThe truncated SVD as a method for regularization,\u201d BIT Numerical Mathematics, 1987.\n\n[25] M. Ghavamzadeh, A. Lazaric, O. Maillard, and R. Munos, \u201cLSTD with random projections,\u201d in NIPS,\n\n2010.\n\n[26] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. The MIT Press, 1998.\n\n9\n\n\f", "award": [], "sourceid": 2102, "authors": [{"given_name": "Zhao", "family_name": "Song", "institution": "Duke University"}, {"given_name": "Ronald", "family_name": "Parr", "institution": "Duke University"}, {"given_name": "Xuejun", "family_name": "Liao", "institution": "Duke University"}, {"given_name": "Lawrence", "family_name": "Carin", "institution": "Duke University"}]}