{"title": "Learning Reward Machines for Partially Observable Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 15523, "page_last": 15534, "abstract": "Reward Machines (RMs), originally proposed for specifying problems in Reinforcement Learning (RL), provide a structured, automata-based representation of a reward function that allows an agent to decompose problems into subproblems that can be efficiently learned using off-policy learning. Here we show that RMs can be learned from experience, instead of being specified by the user, and that the resulting problem decomposition can be used to effectively solve partially observable RL problems. We pose the task of learning RMs as a discrete optimization problem where the objective is to find an RM that decomposes the problem into a set of subproblems such that the combination of their optimal memoryless policies is an optimal policy for the original problem. We show the effectiveness of this approach on three partially observable domains, where it significantly outperforms A3C, PPO, and ACER, and discuss its advantages, limitations, and broader potential.", "full_text": "Learning Reward Machines for Partially\n\nObservable Reinforcement Learning\n\nRodrigo Toro Icarte\u2217\nUniversity of Toronto\n\nVector Institute\n\nEthan Waldie\n\nUniversity of Toronto\n\nRichard Valenzano\n\nElement AI\n\nMargarita P. Castro\nUniversity of Toronto\n\nToryn Q. Klassen\nUniversity of Toronto\n\nVector Institute\n\nSheila A. McIlraith\nUniversity of Toronto\n\nVector Institute\n\nAbstract\n\nReward Machines (RMs) provide a structured, automata-based representation of a\nreward function that enables a Reinforcement Learning (RL) agent to decompose\nan RL problem into structured subproblems that can be ef\ufb01ciently learned via\noff-policy learning. Here we show that RMs can be learned from experience,\ninstead of being speci\ufb01ed by the user, and that the resulting problem decomposition\ncan be used to effectively solve partially observable RL problems. We pose the\ntask of learning RMs as a discrete optimization problem where the objective is\nto \ufb01nd an RM that decomposes the problem into a set of subproblems such that\nthe combination of their optimal memoryless policies is an optimal policy for the\noriginal problem. We show the effectiveness of this approach on three partially\nobservable domains, where it signi\ufb01cantly outperforms A3C, PPO, and ACER, and\ndiscuss its advantages, limitations, and broader potential.1\n\n1\n\nIntroduction\n\nThe use of neural networks for function approximation has led to many recent advances in Rein-\nforcement Learning (RL). Such deep RL methods have allowed agents to learn effective policies in\nmany complex environment including board games [30], video games [23], and robotic systems [2].\nHowever, RL methods (including deep RL methods) often struggle when the environment is partially\nobservable. This is because agents in such environments usually require some form of memory to\nlearn optimal behaviour [31]. Recent approaches for giving memory to an RL agent either rely on\nrecurrent neural networks [24, 15, 37, 29] or memory-augmented neural networks [25, 18].\nIn this work, we show that Reward Machines (RMs) are another useful tool for providing memory in\na partially observable environment. RMs were originally conceived to provide a structured, automata-\nbased representation of a reward function [33, 4, 14, 39]. Exposed structure can be exploited by the\nQ-Learning for Reward Machines (QRM) algorithm [33], which simultaneously learns a separate\npolicy for each state in the RM. QRM has been shown to outperform standard and hierarchical deep\nRL over a variety of discrete and continuous domains. However, QRM was only de\ufb01ned for fully\nobservable environments. Furthermore, the RMs were handcrafted.\nIn this paper, we propose a method for learning an RM directly from experience in a partially\nobservable environment, in a manner that allows the RM to serve as memory for an RL algorithm.\n\n\u2217Correspondence to: Rodrigo Toro Icarte <rntoro@cs.toronto.edu>.\n1Our code is available at https://bitbucket.org/RToroIcarte/lrm.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fA requirement is that the RM learning method be given a \ufb01nite set of detectors for properties that\nserve as the vocabulary for the RM. We characterize an objective for RM learning that allows us to\nformulate the task as a discrete optimization problem and propose an ef\ufb01cient local search approach\nto solve it. By simultaneously learning an RM and a policy for the environment, we are able to\nsigni\ufb01cantly outperform several deep RL baselines that use recurrent neural networks as memory in\nthree partially observable domains. We also extend QRM to the case of partial observability where\nwe see further gains when combined with our RM learning method.\n\n2 Preliminaries\n\nRL agents learn policies from experience. When the problem is fully-observable, the underlying\nenvironment model is typically assumed to be a Markov Decision Process (MDP). An MDP is a tuple\nM = (cid:104)S, A, r, p, \u03b3(cid:105), where S is a \ufb01nite set of states, A is a \ufb01nite set of actions, r : S \u00d7 A \u2192 R is the\nreward function, p(s, a, s(cid:48)) is the transition probability distribution, and \u03b3 is the discount factor. The\nagent starts not knowing what r or p are. At every time step t, the agent observes the current state\nst \u2208 S and executes an action at \u2208 A following a policy \u03c0(at|st). As a result, the state st changes\nto st+1 \u223c p(st+1|st, at) and the agent receives a reward signal r(st, at). The goal is to learn the\noptimal policy \u03c0\u2217, which maximizes the future expected discounted reward for every state in S [32].\nQ-learning [38] is a well-known RL algorithm that uses samples of experience of the form\n(st, at, rt, st+1) to estimate the optimal q-function q\u2217(s, a). Here, q\u2217(s, a) is the expected return of\nselecting action a in state s and following an optimal policy \u03c0\u2217. Deep RL methods like DQN [23]\nand DDQN [35] represent the q-function as \u02dcq\u03b8(s, a), where \u02dcq\u03b8 is a neural network whose inputs are\nfeatures of the state and action, and whose weights \u03b8 are updated using stochastic gradient descent.\nIn partially observable problems, the underlying environment model is typically assumed to\nbe a Partially Observable Markov Decision Process (POMDP). A POMDP is a tuple PO =\n(cid:104)S, O, A, r, p, \u03c9, \u03b3(cid:105), where S, A, r, p, and \u03b3 are de\ufb01ned as in an MDP, O is a \ufb01nite set of ob-\nservations, and \u03c9(s, o) is the observation probability distribution. At every time step t, the agent is\nin exactly one state st \u2208 S, executes an action at \u2208 A, receives reward rt = r(st, at), and moves to\nstate st+1 according to p(st, at, st+1). However, the agent does not observe st+1, but only receives\nan observation ot+1 \u2208 O. This observation provides the agent a clue about what the state st+1 \u2208 S is\nvia \u03c9. In particular, \u03c9(st+1, ot+1) is the probability of observing ot+1 from state st+1 [5].\nRL methods cannot be immediately applied to POMDPs because the transition probabilities and\nreward function are not necessarily Markovian w.r.t. O (though by de\ufb01nition they are w.r.t. S). As\nsuch, optimal policies may need to consider the complete history o0, a0, . . . , at\u22121, ot of observations\nand actions when selecting the next action. Several partially observable RL methods use a recurrent\nneural network to compactly represent the history, and then use a policy gradient method to train it.\nHowever, when we do have access to a full POMDP model PO, then the history can be summarized\ninto a belief state. A belief state is a probability distribution bt : S \u2192 [0, 1] over S, such that bt(s) is\nthe probability that the agent is in state s \u2208 S given the history up to time t. The initial belief state is\ncomputed using the initial observation o0: b0(s) \u221d \u03c9(s, o0) for all s \u2208 S. The belief state bt+1 is\nthen determined from the previous belief state bt, the executed action at, and the resulting observation\n\u2208 S. Since the state transitions and\nreward function are Markovian w.r.t. bt, the set of all belief states B can be used to construct the\nbelief MDP MB. Optimal policies for MB are also optimal for the POMDP [5].\n3 Reward Machines for Partially Observable Environments\n\not+1 as bt+1(s(cid:48)) \u221d \u03c9(s(cid:48), ot+1)(cid:80)\n\ns\u2208S p(s, at, s(cid:48))bt(s) for all s(cid:48)\n\nIn this section, we de\ufb01ne RMs for the case of partial observability. We use the following problem as\na running example to help explain various concepts.\nExample 3.1 (The cookie domain). The cookie domain, shown in Figure 1a, has three rooms\nconnected by a hallway. The agent (purple triangle) can move in the four cardinal directions. There\nis a button in the yellow room that, when pressed, causes a cookie to randomly appear in the red or\nblue room. The agent receives a reward of +1 for reaching (and thus eating) the cookie and may\nthen go and press the button again. Pressing the button before reaching a cookie will move it to a\nrandom location. There is no cookie at the beginning of the episode. This is a partially observable\nenvironment since the agent can only see what it is in the room that it currently occupies.\n\n2\n\n\f\u2663\n\n(cid:75)\n\n(cid:7)\n\u2660\n\u2663\n\n(cid:7)\n\u2660\n\u2663\n\n\u2044\n\n\u2044\n\n(a) Cookie domain.\n\n(b) Symbol domain.\n\n(c) 2-keys domain.\n\nFigure 1: Partially observable environments where the agent can only see what is in the current room.\n\n,\n\n,\n\n,\n\n,\n\n,\n\n,\n\n, or\n\nis true if the agent pushed the button with its last action; and\n\n\u2192 U, and \u03b4r is the reward-transition function, \u03b4r : U\u00d72P\n\nRMs are \ufb01nite state machines that are used to encode a reward function [33]. In the case of partial\nobservability, they are de\ufb01ned over a set of propositional symbols P that correspond to a set of\nhigh-level features that the agent can detect using a labelling function L : O\u2205 \u00d7 A\u2205 \u00d7 O \u2192 2P\nwhere (for any set X) X\u2205 (cid:44) X \u222a {\u2205}. L assigns truth values to symbols in P given an environment\nexperience e = (o, a, o(cid:48)) where o(cid:48) is the observation seen after executing action a when observing o.\nWe use L(\u2205,\u2205, o) to assign truth values to the initial observation. We call a truth value assignment\nof P an abstract observation because it provides a high-level view of the low-level environment\nobservations via the labelling function L. A formal de\ufb01nition of an RM follows:\nDe\ufb01nition 3.1 (reward machine). Given a set of propositional symbols P, a Reward Machine is\na tuple RP = (cid:104)U, u0, \u03b4u, \u03b4r(cid:105) where U is a \ufb01nite set of states, u0 \u2208 U is an initial state, \u03b4u is the\nstate-transition function, \u03b4u : U\u00d72P\n\u2192 R.\nRMs decompose problems into a set of high-level states U and de\ufb01ne transitions using if-like\nconditions de\ufb01ned by \u03b4u. These conditions are over a set of binary properties P that the agent can\ndetect using L. For example, in the cookie domain, P = { ,\n, }. These properties\nare true (i.e., part of an experience label according to L) in the following situations:\nis\nis true if the agent ends the experience\ntrue if the agent ends the experience in a room of that color;\nin the same room as a cookie;\nis\ntrue if the agent ate a cookie with its last action (by moving onto the space where the cookie was).\nFigure 2 shows three possible RMs for the cookie domain. They all de\ufb01ne the same reward signal (1\nfor eating a cookie and 0 otherwise) but differ in their states and transitions. As a result, they differ\nwith respect to the amount of information about the current domain state that can be inferred from the\ncurrent RM state, as we will see below.\nEach RM starts in the initial state u0. Edge labels in the \ufb01gures provide a visual representation of the\nfunctions \u03b4u and \u03b4r. For example, label (cid:104)\n, 1(cid:105) between state u2 and u0 in Figure 2b represents\n\u03b4u(u2,{ , }) = u0 and \u03b4r(u2,{ , }) = 1. Intuitively, this means that if the RM is in state u2\nand the agent\u2019s experience ended in room immediately after eating the cookie\n, then the agent\nwill receive a reward of 1 and the RM will transition to u0. Notice that any properties not listed\nin the label are false (e.g. must be false to take the transition labelled (cid:104)\n, 1(cid:105)). We also use\nmultiple labels separated by a semicolon (e.g., \u201c(cid:104)\n, 0(cid:105)\u201d) to describe different conditions\nfor transitioning between the RM states, each with their own associated reward. The label (cid:104)o/w, r(cid:105)\n(\u201co/w\u201d for \u201cotherwise\u201d) on an edge from ui to uj means that that transition will be made (and reward\nr received) if none of the other transitions from ui can be taken.\nLet us illustrate the behaviour of an RM using the one shown in Figure 2c. The RM will stay in\nu0 until the agent presses the button (causing a cookie to appear), whereupon the RM moves to u1.\nFrom u1 the RM may move to u2 or u3 depending on whether the agent \ufb01nds a cookie when it enters\nanother room. It is also possible to associate meaning with being in RM states: u0 means that there is\nno cookie available, u1 means that there is a cookie in some room (either blue or red), etc.\nWhen learning a policy for a given RM, one simple technique is to learn a policy \u03c0(o, u) that considers\nthe current observation o \u2208 O and the current RM state u \u2208 U. Interestingly, a partially observable\nproblem might be non-Markovian over O, but Markovian over O \u00d7 U for some RM RP. This is the\ncase for the cookie domain with the RM from Figure 2c, for example.\nQ-Learning for RMs (QRM) is another way to learn a policy by exploiting a given RM [33]. QRM\nlearns one q-function \u02dcqu (i.e., policy) per RM state u \u2208 U. Then, given any sample experience,\nthe RM can be used to emulate how much reward would have been received had the RM been in\n\n, 0(cid:105);(cid:104)\n\n3\n\n\f(cid:104)o/w, 0(cid:105)\n\n(cid:104)\n(cid:104)\n\n, 0(cid:105);\n, 0(cid:105)\n\n(cid:104)o/w, 0(cid:105)\n\n(cid:104)o/w, 0(cid:105)\n\nu2\n\nu3\n\n, 1(cid:105);\n(cid:104)\n, 1(cid:105);\n(cid:104)\n(cid:104)o/w, 0(cid:105)\n\nu0\n\n(cid:104)\n\nu1\n\n, 1(cid:105)\n\n(cid:104)\n(cid:104)\n\n(cid:104)\n\n, 0(cid:105);\n, 0(cid:105)\n\n, 0(cid:105)\n\n(cid:104)o/w, 0(cid:105)\n\n(cid:104)\n(cid:104)\n\n, 0(cid:105);\n, 0(cid:105)\n\n(cid:104)o/w, 0(cid:105)\n\nu2\n\n, 0(cid:105)\n\n(cid:104)\n\n, 1(cid:105)\n\nu1\n\n(cid:104)\nu0\n\n(cid:104)\n\n, 0(cid:105)\n\n(cid:104)o/w, 0(cid:105)\n\n(cid:104)\n\nu0\n\n, 0(cid:105)\n\n(cid:104)\n\n(cid:104)o/w, 0(cid:105)\n\n, 1(cid:105)\n\n(cid:104)\n\n, 1(cid:105)\n\n(a) Naive RM.\n\n(b) \u201cOptimal\u201d RM.\n\n(c) Perfect RM.\n\nFigure 2: Three possible Reward Machines for the Cookie domain.\n\nany one of its states. Formally, experience e = (o, a, o(cid:48)) can be transformed into a valid experience\n((cid:104)o, u(cid:105), a,(cid:104)o(cid:48), u(cid:48)\n(cid:105), r) used for updating \u02dcqu for each u \u2208 U, where u(cid:48) = \u03b4u(u, L(e)) and r =\n\u03b4r(u, L(e)). Hence, any off-policy learning method can take advantage of these \u201csynthetically\"\ngenerated experiences to update all subpolicies simultaneously.\nWhen tabular q-learning is used, QRM is guaranteed to converge to an optimal policy on fully-\nobservable problems [33]. However, in a partially observable environment, an experience e might\nbe more or less likely depending on the RM state that the agent was in when the experience was\ncollected. For example, experience e might be possible in one RM state ui but not in RM state uj.\nThus, updating the policy for uj using e as QRM does, would introduce an unwanted bias to \u02dcquj . We\nwill discuss how to (partially) address this problem in \u00a75.\n\n4 Learning Reward Machines from Traces\n\n, zero otherwise) but provides no memory in support of solving the task.\n\nOur overall idea is to search for an RM that can be used as external memory by an agent for a given\ntask. As input, our method will only take a set of high-level propositional symbols P, and a labelling\nfunction L that can detect them. Then, the key question is what properties should such an RM have.\nThree proposals naturally emerge from the literature. The \ufb01rst comes from the work on learning\nFinite State Machines (FSMs) [3, 40, 10], which suggests learning the smallest RM that correctly\nmimics the external reward signal given by the environment, as in Giantamidis and Tripakis\u2019 method\nfor learning Moore Machines [10]. Unfortunately, such approaches would learn RMs of limited\nutility, like the one in Figure 2a. This naive RM correctly predicts reward in the cookie domain (i.e.,\n+1 for eating a cookie\nThe second proposal comes from the literature on learning Finite State Controllers (FSC) [22] and on\nmodel-free RL methods [32]. This work suggests looking for the RM whose optimal policy receives\nthe most reward. For instance, the RM from Figure 2b is \u201coptimal\u201d in this sense. It decomposes the\nproblem into three states. The optimal policy for u0 goes directly to press the button, the optimal\npolicy for u1 goes to the blue room and eats the cookie if present, and the optimal policy for u2 goes\nto the red room and eats the cookie. Together, these three policies give rise to an optimal policy for\nthe complete problem. This is a desirable property for RMs, but requires computing optimal policies\nin order to compare the relative quality of RMs, which seems prohibitively expensive. However, we\nbelieve that \ufb01nding ways to ef\ufb01ciently learn \u201coptimal\u201d RMs is a promising future work direction.\nFinally, the third proposal comes from the literature on Predictive State Representations (PSR)\n[20], Deterministic Markov Models (DMMs) [21], and model-based RL [16]. These works suggest\nlearning the RM that remembers suf\ufb01cient information about the history to make accurate Markovian\npredictions about the next observation. For instance, the cookie domain RM shown in Figure 2c\nis perfect w.r.t.\nthis criterion. Intuitively, every transition in the cookie environment is already\nMarkovian except for transitioning from one room to another. Depending on different factors, when\nentering to the red room there could be a cookie there (or not). The perfect RM is able to encode\nsuch information using 4 states: when at u0 the agent knows that there is no cookie, at u1 the agent\nknows that there is a cookie in the blue or the red room, at u2 the agent knows that there is a cookie\n\n4\n\n\fin the red room, and at u3 the agent knows that there is a cookie in the blue room. Since keeping\ntrack of more information will not result in better predictions, this RM is perfect. Below, we develop\na theory about perfect RMs and describe an approach to learn them.\n\n4.1 Perfect Reward Machines: Formal De\ufb01nition and Properties\n\nThe key insight behind perfect RMs is to use their states U and transitions \u03b4u to keep track of relevant\npast information such that the partially observable environment PO becomes Markovian w.r.t. O \u00d7 U.\nDe\ufb01nition 4.1 (perfect reward machine). An RM RP = (cid:104)U, u0, \u03b4u, \u03b4r(cid:105) is considered perfect for a\nPOMDP PO = (cid:104)S, O, A, r, p, \u03c9, \u03b3(cid:105) with respect to a labelling function L if and only if for every\ntrace o0, a0, . . . , ot, at generated by any policy over PO, the following holds:\n(1)\nPr(ot+1, rt|o0, a0, . . . , ot, at) = Pr(ot+1, rt|ot, xt, at)\n\nwhere x0 = u0 and xt = \u03b4u(xt\u22121, L(ot\u22121, at\u22121, ot)) .\n\nTwo interesting properties follow from De\ufb01nition 4.1. First, if the set of belief states B for the\nPOMDP PO is \ufb01nite, then there exists a perfect RM for PO with respect to some L. Second, the\noptimal policies for perfect RMs are also optimal for the POMDP (see supplementary material \u00a73).\nTheorem 4.1. Given any POMDP PO with a \ufb01nite reachable belief space, there will always exists at\nleast one perfect RM for PO with respect to some labelling function L.\nTheorem 4.2. Let RP be a perfect RM for a POMDP PO w.r.t. a labelling function L, then any\noptimal policy for RP w.r.t. the environmental reward is also optimal for PO.\n4.2 Perfect Reward Machines: How to Learn Them\n\nWe now consider the problem of learning a perfect RM from traces, assuming one exists w.r.t. the\ngiven labelling function L. Recall that a perfect RM transforms the original problem into a Markovian\nproblem over O \u00d7 U. Hence, we should prefer RMs that accurately predict the next observation\no(cid:48) and immediate reward r from the current observation o, RM state u, and action a. This might\nbe achieved by collecting a training set of traces from the environment, \ufb01tting a predictive model\nfor Pr(o(cid:48), r|o, u, a), and picking the RM that makes better predictions. However, this can be very\nexpensive, especially considering that the observations might be images.\nInstead, we propose an alternative that focuses on a necessary condition for a perfect RM: the RM\nmust predict what is possible and impossible in the environment at the abstract level. For example,\nit is impossible to be at u3 in the RM from Figure 2c and make the abstract observation { , },\nbecause the RM reaches u3 only if the cookie was seen in the blue room or not to be in the red room.\nThis idea is formalized in the optimization model LRM. Let T = {T0, . . . ,Tn} be a set of traces,\nwhere each trace Ti is a sequence of observations, actions, and rewards:\nTi = (oi,0, ai,0, ri,0, . . . , ai,ti\u22121, ri,ti\u22121, oi,ti ).\n(2)\nWe now look for an RM (cid:104)U, u0, \u03b4u, \u03b4r(cid:105) that can be used to predict L(ei,t+1) from L(ei,t) and the\ncurrent RM state xi,t, where ei,t+1 is the experience (oi,t, ai,t, oi,t+1) and ei,0 is (\u2205,\u2205, oi,0) by\nde\ufb01nition. The model parameters are the set of traces T , the set of propositional symbols P, the\nlabelling function L, and a maximum number of states in the RM umax. The model also uses the sets\nI = {0 . . . n} and Ti = {0 . . . ti \u2212 1}, where I contains the index of the traces and Ti their time\nsteps. The model has two auxiliary variables xi,t and Nu,l. Variable xi,t \u2208 U represents the state of\nthe RM after observing trace Ti up to time t. Variable Nu,l \u2286 22P\nis the set of all the next abstract\nobservations seen from the RM state u and the abstract observations l at some point in T . In other\nwords, l(cid:48)\n\n\u2208 Nu,l iff u = xi,t, l = L(ei,t), and l(cid:48) = L(ei,t+1) for some trace Ti and time t.\n\n(cid:88)\n\n(cid:88)\n\n(LRM)\n\n(3)\n(4)\n(5)\n(6)\n(7)\n\n\u2200i \u2208 I, t \u2208 Ti \u222a {ti}\n\u2200i \u2208 I\n\u2200i \u2208 I, t \u2208 Ti\n\nminimize\n(cid:104)U,u0,\u03b4u,\u03b4r(cid:105)\n\nlog(|Nxi,t,L(ei,t)|)\n\ni\u2208I\n\nt\u2208Ti\n\ns.t. (cid:104)U, u0, \u03b4u, \u03b4r(cid:105) \u2208 RP\n\n|U| \u2264 umax\nxi,t \u2208 U\nxi,0 = u0\nxi,t+1 = \u03b4u(xi,t, L(ei,t+1))\n\n5\n\n\f\u2200u \u2208 U, l \u2208 2\nP\n\u2200i \u2208 I, t \u2208 Ti\n\nNu,l \u2286 22P\nL(ei,t+1) \u2208 Nxi,t,L(ei,t)\n\n(8)\n(9)\nConstraints (3) and (4) ensure that we \ufb01nd a well-formed RM over P with at most umax states.\nConstraint (5), (6), and (7) ensure that xi,t is equal to the current state of the RM, starting from u0\nand following \u03b4u. Constraint (8) and (9) ensure that the sets Nu,l contain every L(ei,t+1) that has\nbeen seen right after l and u in T . The objective function comes from maximizing the log-likelihood\nfor predicting L(ei,t+1) using a uniform distribution over all the possible options given by Nu,l.\nA key property of this formulation is that any perfect RM is optimal w.r.t. the objective function in\nLRM when the number of traces tends to in\ufb01nity (see supplementary material \u00a73):\nTheorem 4.3. When the set of training traces (and their lengths) tends to in\ufb01nity and is collected by\na policy such that \u03c0(a|o) > \u0001 for all o \u2208 O and a \u2208 A, any perfect RM with respect to L and at most\numax states will be an optimal solution to the formulation LRM.\n\nFinally, note that the de\ufb01nition of a perfect RM does not impose conditions over the rewards associated\nwith the RM (i.e., \u03b4r). This is why \u03b4r is a free variable in the model LRM. However, we still expect \u03b4r\nto model the external reward signals given by the environment. To do so, we estimate \u03b4r(u, l) using\nits empirical expectation over T (as commonly done when constructing belief MDPs [5]).\n4.3 Searching for a Perfect Reward Machine Using Tabu Search\n\nWe now describe the speci\ufb01c optimization technique used to solve LRM. We experimented with many\ndiscrete optimization approaches\u2014including mixed integer programming [6], Benders decomposi-\ntion [8], evolutionary algorithms [17], among others\u2014and found local search algorithms [1] to be\nthe most effective at \ufb01nding high quality RMs given short time limits. In particular, we use Tabu\nsearch [11], a simple and versatile local search procedure with convergence guarantees and many\nsuccessful applications in the literature [36]. We also include our unsuccessful mixed integer linear\nprogramming model for LRM in the supplementary material \u00a71.\nIn the context of our work, Tabu search starts from a random RM and, on each iteration it evaluates all\n\u201cneighbouring\u201d RMs. We de\ufb01ne the neighbourhood of an RM as the set of RMs that differ by exactly\none transition (i.e., removing/adding a transition, or changing its value) and evaluate RMs using the\nobjective function of LRM. When all neighbouring RMs are evaluated, the algorithm chooses the one\nwith lowest values and sets it as the current RM. To avoid local minima, Tabu search maintains a\nTabu list of all the RMs that were previously used as the current RM. Then, RMs in the Tabu list are\npruned when examining the neighbourhood of the current RM.\n\n5 Simultaneously Learning a Reward Machine and a Policy\n\nWe now describe our overall approach to simultaneously \ufb01nding an RM and exploiting that RM to\nlearn a policy. The complete pseudo-code can be found in the supplementary material (Algorithm 1).\nOur approach starts by collecting a training set of traces T generated by a random policy during tw\n\u201cwarmup\u201d steps. This set of traces is used to \ufb01nd an initial RM R using Tabu search. The algorithm\nthen initializes policy \u03c0, sets the RM state to the initial state u0, and sets the current label l to the\ninitial abstract observation L(\u2205,\u2205, o). The standard RL learning loop is then followed: an action\na is selected following \u03c0(o, u) where u is the current RM state, and the agent receives the next\nobservation o(cid:48) and the immediate reward r. The RM state is then updated to u(cid:48) = \u03b4u(u, L(o, a, o(cid:48)))\nand the last experience ((cid:104)o, u(cid:105), a, r,(cid:104)o(cid:48), u(cid:48)\n(cid:105)) is used by any RL method of choice to update \u03c0. Note\nthat in an episodic task, the environment and RM are reset whenever a terminal state is reached.\nIf on any step, there is evidence that the current RM might not be the best one, our approach will\nattempt to \ufb01nd a new one. Recall that the RM R was selected using the cardinality of its prediction\nsets N (LRM). Hence, if the current abstract observation l(cid:48) is not in Nu,l, adding the current trace to T\nwill increase the size of Nu,l for R. As such, the cost of R will increase and it may no longer be the\nbest RM. Thus, if l(cid:48)\n(cid:54)\u2208 Nu,l, we add the current trace to T and search for a new RM. Recall that we\nuse Tabu search, though any discrete optimization method could be applied. Our method only uses\nthe new RM if its cost is lower than R\u2019s. If the RM is updated, a new policy is learned from scratch.\nGiven the current RM, we can use any RL algorithm to learn a policy \u03c0(o, u), by treating the\ncombination of o and u as the current state. If the RM is perfect, then the optimal policy \u03c0\u2217(o, u) will\n\n6\n\n\fCookie domain\n\nSymbol domain\n\n2-keys domains\n\nLegend:\n\nDDQN\n\nA3C\n\nPPO\n\nACER\n\nLRM + DDQN\n\nLRM + DQRM\n\nOptimal\n\nFigure 3: Total reward collected every 10, 000 training steps.\n\nalso be optimal for the original POMDP (as stated in Theorem 4.2). However, to exploit the problem\nstructure exposed by the RM, we can use the QRM algorithm.\nAs explained in \u00a73, standard QRM under partial observability can introduce a bias because an\nexperience e = (o, a, o(cid:48)) might be more or less likely depending on the RM state that the agent was\nin when the experience was collected. We partially address this issue by updating \u02dcqu using (o, a, o(cid:48))\nif and only if L(o, a, o(cid:48)) \u2208 Nu,l, where l was the current abstract observation that generated the\nexperience (o, a, o(cid:48)). Hence, we do not transfer experiences from ui to uj if the current RM does not\nbelieve that (o, a, o(cid:48)) is possible in uj. For example, consider the cookie domain and the perfect RM\nfrom Figure 2c. If some experience consists of entering to the red room and seeing a cookie, then\nthis experience will not be used by states u0 and u3 as it is impossible to observe a cookie at the red\nroom from those states. Note that adding this rule may work in many cases, but it will not address\nthe problem in all environments (more discussion in \u00a77). We consider addressing this problem as an\ninteresting area for future work.\n\n6 Experimental Evaluation\n\nWe tested our approach on three partially observable grid domains (Figure 1). The agent can move\nin the four cardinal directions and can only see what is in the current room. These are stochastic\ndomains where the outcome of an action randomly changes with a 5% probability.\nThe \ufb01rst environment is the cookie domain (Figure 1a) described in \u00a73. Each episode is 5, 000 steps\nlong, during which the agent should attempt to get as many cookies as possible.\nThe second environment is the symbol domain (Figure 1b). It has three symbols \u2663, \u2660, and (cid:7) in the\nred and blue rooms. One symbol from {\u2663,\u2660, (cid:7)} and possibly a right or left arrow are randomly\nplaced at the yellow room. Intuitively, that symbol and arrow tell the agent where to go, e.g., \u2663 and\n\u2192 tell the agent to go to \u2663 in the east room. If there is no arrow, the agent can go to the target symbol\nin either room. An episode ends when the agent reaches any symbol in the red or blue room, at which\npoint it receives a reward of +1 if it reached the correct symbol and \u22121 otherwise.\nThe third environment is the 2-keys domain (Figure 1c). The agent receives a reward of +1 when\nit reaches the coffee (in the yellow room). To do so, it must open the two doors (shown in brown).\nEach door requires a different key to open it, and the agent can only carry one key at a time. Initially,\nthe two keys are randomly located in either the blue room, the red room, or split between them.\nWe tested two versions of our Learned Reward Machine (LRM) approach: LRM+DDQN and\nLRM+DQRM. Both learn an RM from experience as described in \u00a74.2, but LRM+DDQN learns\na policy using DDQN [35] while LRM+DQRM uses the modi\ufb01ed version of QRM described in\n\u00a75. In all domains, we used umax = 10, tw = 200, 000, an epsilon greedy policy with \u0001 = 0.1, and\na discount factor \u03b3 = 0.9. The size of the Tabu list and the number of steps that the Tabu search\nperforms before returning the best RM found is 100. We compared against 4 baselines: DDQN [35],\nA3C [24], ACER [37], and PPO [29] using the OpenAI baseline implementations [12]. DDQN uses\nthe concatenation of the last 10 observations as input which gives DDQN a limited memory to better\nhandle the domains. A3C, ACER, and PPO use an LSTM to summarize the history. Note that the\n\n7\n\n01\u00b71062\u00b71063\u00b7106050100150200TrainingstepsReward01\u00b71062\u00b71060200400TrainingstepsReward02\u00b71064\u00b7106050100150TrainingstepsReward\foutput of the labelling function was also given to the baselines. Details on the hyperparameters and\nnetworks can be found in the supplementary material \u00a74.\nFigure 3 shows the total cumulative rewards that each approach gets every 10, 000 training steps and\ncompares it to the optimal policy. For the LRM algorithms, the \ufb01gure shows the median performance\nover 30 runs per domain, and percentile 25 to 75 in the shadowed area. For the DDQN baseline, we\nshow the maximum performance seen for each time period over 5 runs per problem. Similarly, we\nalso show the maximum performance over the 30 runs of A3C, ACER, and PPO per period. All the\nbaselines outperformed a random policy, but none make much progress on any of the domains.\nFurthermore, LRM approaches largely outperform all the baselines, reaching close-to-optimal policies\nin the cookie and symbol domain. We also note that LRM+DQRM learns faster than LRM+DDQN,\nbut is more unstable. In particular, LRM+DQRM converged to a considerably better policy than\nLRM+DDQN in the 2-keys domain. We believe this is due to QRM\u2019s experience sharing mechanism\nthat allows for propagating sparse reward backwards faster (see supplementary material \u00a74.3).\nA key factor in the strong performance of the LRM approaches is that Tabu search \ufb01nds high-quality\nRMs in less than 100 local search steps (Figure 5, supplementary material). In fact, our results show\nthat Tabu search \ufb01nds perfect RMs in most runs, in particular when tested over the symbol domain.\n\n7 Discussion, Limitations, and Broader Potential\n\nu1\n\nu0\n\n(cid:104)\n\n, 0(cid:105)\n\n(cid:104)\n\n, 0(cid:105)\n\n, 1(cid:105);\n(cid:104)\n(cid:104)o/w, 0(cid:105)\n\n, 1(cid:105);\n(cid:104)\n(cid:104)o/w, 0(cid:105)\nFigure 4: The gravity domain\n\nSolving partially observable RL problems is challenging\nand LRM was able to solve three problems that were con-\nceptually simple but presented a major challenge to A3C,\nACER, and PPO with LSTM-based memories. A key idea\nbehind these results was to optimize over a necessary con-\ndition for perfect RMs. This objective favors RMs that\nare able to predict possible and impossible future observa-\ntions at the abstract level given by the labelling function\nL. In this section, we discuss the advantages and current\nlimitations of such an approach.\nWe begin by considering the performance of Tabu search\nin our domains. Given a training set composed of one\nmillion transitions, a simple Python implementation of\nTabu search takes less than 2.5 minutes to learn an RM\nacross all our environments, when using 62 workers on a Threadripper 2990WX processor. Note that\nTabu search\u2019s main bottleneck is evaluating the neighbourhood around the current RM solution. As\nthe size of the neighbourhood depends on the size of the set of propositional symbols P, exhaustively\nevaluating the neighbourhood may sometimes become impractical. To handle such problem, it will\nbe necessary to import ideas from the Large Neighborhood Search literature [27].\nRegarding limitations, learning the RM at the abstract level is ef\ufb01cient but requires ignoring (possibly\nrelevant) low-level information. For instance, Figure 4 shows an adversarial example for LRM.\nThe agent receives reward for eating the cookie (\n). There is an external force pulling the agent\ndown\u2014i.e., the outcome of the \u201cmove-up\u201d action is actually a downward movement with high\nprobability. There is a button (\n) that the agent can press to turn off (or back on) the external force.\nHence, the optimal policy is to press the button and then eat the cookie. Given P = { , }, a perfect\nRM for this environment is fairly simple (see Figure 4) but LRM might not \ufb01nd it. The reason is that\npressing the button changes the low-level probabilities in the environment but does not change what\nis possible or impossible at the abstract level. In other words, while the LRM objective optimizes\nover necessary conditions for \ufb01nding a perfect RM, those conditions are not suf\ufb01cient to ensure that\nan optimal solution will be a perfect RM. In addition, if a perfect RM is found, our heuristic approach\nto share experiences in QRM would not work as intended because the experiences collected when the\nforce is on (at u0) would be used to update the policy for the case where the force is off (at u1).\nOther current limitations include that it is unclear how to handle noise over the high-level detectors L\nand how to transfer learning from previously learned policies when a new RM is learned. Finally,\nde\ufb01ning a set of proper high-level detectors for a given environment might be a challenge to deploying\nLRM. Hence, looking for ways to automate that step is an important direction for future work.\n\n8\n\n\f8 Related Work\n\nState-of-the-art approaches to partially observable RL use Recurrent Neural Networks (RNNs) as\nmemory in combination with policy gradient [24, 37, 29, 15], or use external neural-based memories\n[25, 18, 13]. Other approaches include extensions to Model-Based Bayesian RL to work under partial\nobservability [28, 7, 9] and to provide a small binary memory to the agent and a special set of actions\nto modify it [26]. While our experiments highlight the merits of our approach w.r.t. RNN-based\napproaches, we rely on ideas that are largely orthogonal. As such, we believe there is signi\ufb01cant\npotential in mixing these approaches to get the bene\ufb01t of memory at both the high- and the low-level.\nThe effectiveness of automata-based memory has long been recognized in the POMDP literature [5],\nwhere the objective is to \ufb01nd policies given a complete speci\ufb01cation of the environment. The idea\nis to encode policies using Finite State Controllers (FSCs) which are FSMs where the transitions\nare de\ufb01ned in terms of low-level observations from the environment and each state in the FSM is\nassociated with one primitive action. When interacting with the environment, the agent always selects\nthe action associated with the current state in the controller. Meuleau et al. [22] adapted this idea to\nwork in the RL setting by exploiting policy gradient to learn policies encoded as FSCs. RMs can be\nconsidered as a generalization of FSC as they allow for transitions using conditions over high-level\nevents and associate complete policies (instead of just one primitive action) to each state. This allows\nour approach to easily leverage existing deep RL methods to learn policies from low-level inputs,\nsuch as images\u2014which is not achievable by Meuleau et al. [22]. That said, further investigating using\nideas for learning FSMs [3, 40, 10] in learning RMs is a promising direction for future work.\nOur approach to learn RMs is greatly in\ufb02uenced by Predictive State Representations (PSRs) [20].\nThe idea behind PSRs is to \ufb01nd a set of core tests (i.e., sequences of actions and observations) such\nthat if the agent can predict the probabilities of these occurring, given any history H, then those\nprobabilities can be used to compute the probability of any other test given H. The insight is that\nstate representations that are good for predicting the next observation are good for solving partially\nobservable environments. We adapted this idea to the context of RM learning as discussed in \u00a74.\nWhile our work was under review, two interesting papers were submitted to arXiv. The \ufb01rst paper, by\nXu et al. [39], proposes a polynomial time algorithm to learn reward machines in fully observable\ndomains. Their goal is to learn the smallest reward machine that is consistent with the reward\nfunction\u2014which makes sense for fully observable domains, but would have limited utility under\npartial observability (as discussed in \u00a74). The second paper, by Zhang et al. [41], proposes to learn a\ndiscrete PSR representation of the environment directly from low-level observations and then plan\nover such representation using tabular Q-learning. This is a promising research direction, with some\nclear synergies with LRM.\n\n9 Concluding Remarks\n\nWe have presented a method for learning (perfect) Reward Machines in partially observable envi-\nronments and demonstrated the effectiveness of these learned RMs in tackling partially observable\nRL problems that are unsolvable by A3C, ACER and PPO. Informed by criteria from the POMDP,\nFSC, and PSR literature, we proposed a set of RM properties that support tackling RL in partially\nobservable environments. We used these properties to formulate RM learning as a discrete optimiza-\ntion problem. We experimented with several optimization methods, \ufb01nding Tabu search to be the\nmost effective. We then combined this RM learning with policy learning for partially observable\nRL problems. Our combined approach outperformed a set of strong LSTM-based approaches on\ndifferent domains.\nWe believe this work represents an important building block for creating RL agents that can solve\ncognitively challenging partially observable tasks. Not only did our approach solve problems that\nwere unsolvable by A3C, ACER and PPO, but it did so in a relatively small number of training steps.\nRM learning provided the agent with memory, but more importantly the combination of RM learning\nand policy learning provided it with discrete reasoning capabilities that operated at a higher level of\nabstraction, while leveraging deep RL\u2019s ability to learn policies from low-level inputs. This work\nleaves open many interesting questions relating to abstraction, observability, and properties of the\nlanguage over which RMs are constructed. We believe that addressing these questions, among many\nothers, will push the boundary of partially observable RL problems that can be solved.\n\n9\n\n\fAcknowledgments\n\nWe gratefully acknowledge funding from the Natural Sciences and Engineering Research Council of\nCanada (NSERC) and Microsoft Research. The \ufb01rst author also gratefully acknowledges funding\nfrom CONICYT (Becas Chile). A preliminary version of this work was presented at RLDM [34].\n\nReferences\n[1] E. Aarts, E. H. Aarts, and J. K. Lenstra. Local search in combinatorial optimization. Princeton\n\nUniversity Press, 2003.\n\n[2] M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron,\nM. Plappert, G. Powell, A. Ray, et al. Learning dexterous in-hand manipulation. arXiv preprint\narXiv:1808.00177, 2018.\n\n[3] D. Angluin and C. H. Smith. Inductive inference: Theory and methods. ACM Computing\n\nSurveys (CSUR), 15(3):237\u2013269, 1983.\n\n[4] A. Camacho, R. Toro Icarte, T. Q. Klassen, R. Valenzano, and S. A. McIlraith. LTL and beyond:\nFormal languages for reward function speci\ufb01cation in reinforcement learning. In Proceedings\nof the 28th International Joint Conference on Arti\ufb01cial Intelligence (IJCAI), pages 6065\u20136073,\n2019.\n\n[5] A. R. Cassandra, L. P. Kaelbling, and M. L. Littman. Acting optimally in partially observable\nstochastic domains. In Proceedings of the 12th National Conference on Arti\ufb01cial Intelligence\n(AAAI), pages 1023\u20131028, 1994.\n\n[6] M. Conforti, G. Cornu\u00e9jols, and G. Zambelli. Integer programming, volume 271. Springer,\n\n2014.\n\n[7] F. Doshi-Velez, D. Pfau, F. Wood, and N. Roy. Bayesian nonparametric methods for partially-\nobservable reinforcement learning. IEEE transactions on pattern analysis and machine intelli-\ngence, 37(2):394\u2013407, 2013.\n\n[8] A. M. Geoffrion. Generalized Benders decomposition. Journal of optimization theory and\n\napplications, 10(4):237\u2013260, 1972.\n\n[9] M. Ghavamzadeh, S. Mannor, J. Pineau, A. Tamar, et al. Bayesian reinforcement learning: A\n\nsurvey. Foundations and Trends in Machine Learning, 8(5-6):359\u2013483, 2015.\n\n[10] G. Giantamidis and S. Tripakis. Learning Moore machines from input-output traces.\n\nIn\nProceedings of the 21st International Symposium on Formal Methods (FM), pages 291\u2013309,\n2016.\n\n[11] F. Glover and M. Laguna. Tabu search. In Handbook of combinatorial optimization, pages\n\n2093\u20132229. Springer, 1998.\n\n[12] C. Hesse, M. Plappert, A. Radford, J. Schulman, S. Sidor, and Y. Wu. OpenAI baselines.\n\nhttps://github.com/openai/baselines, 2017.\n\n[13] C.-C. Hung, T. Lillicrap, J. Abramson, Y. Wu, M. Mirza, F. Carnevale, A. Ahuja, and\nG. Wayne. Optimizing agent behavior over long time scales by transporting value. arXiv\npreprint arXiv:1810.06721, 2018.\n\n[14] L. Illanes, X. Yan, R. Toro Icarte, and S. A. McIlraith. Symbolic planning and model-free\nreinforcement learning: Training taskable agents. In Proceedings of the 4th Multi-disciplinary\nConference on Reinforcement Learning and Decision (RLDM), pages 191\u2013195, 2019.\n\n[15] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu.\nReinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397,\n2016.\n\n[16] L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: A survey. Journal\n\nof arti\ufb01cial intelligence research, 4:237\u2013285, 1996.\n\n[17] D. Kasenberg and M. Scheutz.\n\nInterpretable apprenticeship learning with temporal logic\nspeci\ufb01cations. In Proceedings of the 56th IEEE Annual Conference on Decision and Control\n(CDC), pages 4914\u20134921, 2017.\n\n[18] A. Khan, C. Zhang, N. Atanasov, K. Karydis, V. Kumar, and D. D. Lee. Memory augmented\n\ncontrol networks. arXiv preprint arXiv:1709.05706, 2017.\n\n10\n\n\f[19] D. Kingma and J. Ba. Adam: A method for stochastic optimization.\n\narXiv:1412.6980, 2014.\n\narXiv preprint\n\n[20] M. L. Littman, R. S. Sutton, and S. Singh. Predictive representations of state. In Proceedings\nof the 15th Conference on Advances in Neural Information Processing Systems (NIPS), pages\n1555\u20131561, 2002.\n\n[21] M. Mahmud. Constructing states for reinforcement learning.\n\nIn Proceedings of the 27th\n\nInternational Conference on Machine Learning (ICML), pages 727\u2013734, 2010.\n\n[22] N. Meuleau, L. Peshkin, K.-E. Kim, and L. P. Kaelbling. Learning \ufb01nite-state controllers for\npartially observable environments. In Proceedings of the 15th Conference on Uncertainty in\nArti\ufb01cial Intelligence (UAI), pages 427\u2013436, 1999.\n\n[23] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,\nM. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep rein-\nforcement learning. Nature, 518(7540):529\u2013533, 2015.\n\n[24] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu.\nAsynchronous methods for deep reinforcement learning. In Proceedings of the 33rd Interna-\ntional Conference on Machine Learning (ICML), pages 1928\u20131937, 2016.\n\n[25] J. Oh, V. Chockalingam, S. Singh, and H. Lee. Control of memory, active perception, and\naction in minecraft. In Proceedings of the 33rd International Conference on Machine Learning\n(ICML), pages 2790\u20132799, 2016.\n\n[26] L. Peshkin, N. Meuleau, and L. P. Kaelbling. Learning policies with external memory. In\nProceedings of the 16th International Conference on Machine Learning (ICML), pages 307\u2013314,\n1999.\n\n[27] D. Pisinger and S. Ropke. Large neighborhood search. In Handbook of metaheuristics, pages\n\n399\u2013419. Springer, 2010.\n\n[28] P. Poupart and N. Vlassis. Model-based bayesian reinforcement learning in partially observable\ndomains. In Proceedings of the 10th International Symposium on Arti\ufb01cial Intelligence and\nMathematics (ISAIM), pages 1\u20132, 2008.\n\n[29] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization\n\nalgorithms. arXiv preprint arXiv:1707.06347, 2017.\n\n[30] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker,\nM. Lai, A. Bolton, et al. Mastering the game of Go without human knowledge. Nature, 550\n(7676):354, 2017.\n\n[31] S. P. Singh, T. Jaakkola, and M. I. Jordan. Learning without state-estimation in partially\nobservable Markovian decision processes. In Machine Learning Proceedings 1994, pages\n284\u2013292. Elsevier, 1994.\n\n[32] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.\n[33] R. Toro Icarte, T. Q. Klassen, R. Valenzano, and S. A. McIlraith. Using reward machines for\nhigh-level task speci\ufb01cation and decomposition in reinforcement learning. In Proceedings of\nthe 35th International Conference on Machine Learning (ICML), pages 2112\u20132121, 2018.\n\n[34] R. Toro Icarte, E. Waldie, T. Q. Klassen, R. Valenzano, M. P. Castro, and S. A. McIlraith.\nSearching for Markovian subproblems to address partially observable reinforcement learning. In\nProceedings of the 4th Multi-disciplinary Conference on Reinforcement Learning and Decision\n(RLDM), pages 22\u201326, 2019.\n\n[35] H. Van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double q-learning. In\nProceedings of the 30th AAAI Conference on Arti\ufb01cial Intelligence (AAAI), pages 2094\u20132100,\n2016.\n\n[36] S. Vo\u00df, S. Martello, I. H. Osman, and C. Roucairol. Meta-heuristics: Advances and trends in\n\nlocal search paradigms for optimization. Springer Science & Business Media, 2012.\n\n[37] Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas. Sample\n\nef\ufb01cient actor-critic with experience replay. arXiv preprint arXiv:1611.01224, 2016.\n\n[38] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine learning, 8(3-4):279\u2013292, 1992.\n\n11\n\n\f[39] Z. Xu, I. Gavran, Y. Ahmad, R. Majumdar, D. Neider, U. Topcu, and B. Wu. Joint inference of\nreward machines and policies for reinforcement learning. arXiv preprint arXiv:1909.05912,\n2019.\n\n[40] Z. Zeng, R. M. Goodman, and P. Smyth. Learning \ufb01nite state machines with self-clustering\n\nrecurrent networks. Neural Computation, 5(6):976\u2013990, 1993.\n\n[41] A. Zhang, Z. C. Lipton, L. Pineda, K. Azizzadenesheli, A. Anandkumar, L. Itti, J. Pineau, and\nT. Furlanello. Learning causal state representations of partially observable environments. arXiv\npreprint arXiv:1906.10437, 2019.\n\n12\n\n\f", "award": [], "sourceid": 9003, "authors": [{"given_name": "Rodrigo", "family_name": "Toro Icarte", "institution": "University of Toronto and Vector Institute"}, {"given_name": "Ethan", "family_name": "Waldie", "institution": "University of Toronto & Palantir Technologies"}, {"given_name": "Toryn", "family_name": "Klassen", "institution": "University of Toronto"}, {"given_name": "Rick", "family_name": "Valenzano", "institution": "Element AI"}, {"given_name": "Margarita", "family_name": "Castro", "institution": "University of Toronto"}, {"given_name": "Sheila", "family_name": "McIlraith", "institution": "University of Toronto"}]}