{"title": "Learning Task Specifications from Demonstrations", "book": "Advances in Neural Information Processing Systems", "page_first": 5367, "page_last": 5377, "abstract": "Real-world applications often naturally decompose into several\n sub-tasks. In many settings (e.g., robotics) demonstrations provide\n a natural way to specify the sub-tasks. However, most methods for\n learning from demonstrations either do not provide guarantees that\n the artifacts learned for the sub-tasks can be safely recombined or\n limit the types of composition available. Motivated by this\n deficit, we consider the problem of inferring Boolean non-Markovian\n rewards (also known as logical trace properties or\n specifications) from demonstrations provided by an agent\n operating in an uncertain, stochastic environment. Crucially,\n specifications admit well-defined composition rules that are\n typically easy to interpret. In this paper, we formulate the\n specification inference task as a maximum a posteriori (MAP)\n probability inference problem, apply the principle of maximum\n entropy to derive an analytic demonstration likelihood model and\n give an efficient approach to search for the most likely\n specification in a large candidate pool of specifications. In our\n experiments, we demonstrate how learning specifications can help\n avoid common problems that often arise due to ad-hoc reward composition.", "full_text": "Learning Task Speci\ufb01cations from Demonstrations\n\nMarcell Vazquez-Chanlatte1, Susmit Jha2, Ashish Tiwari2, Mark K. Ho1, Sanjit A. Seshia1\n\n1 University of California, Berkeley 2 SRI International, Menlo Park\n\n{marcell.vc, sseshia, mark_ho}@eecs.berkeley.edu\n\n{susmit.jha, tiwari}@sri.com\n\nAbstract\n\nReal-world applications often naturally decompose into several sub-tasks.\nIn\nmany settings (e.g., robotics) demonstrations provide a natural way to specify the\nsub-tasks. However, most methods for learning from demonstrations either do\nnot provide guarantees that the artifacts learned for the sub-tasks can be safely\nrecombined or limit the types of composition available. Motivated by this de\ufb01cit,\nwe consider the problem of inferring Boolean non-Markovian rewards (also known\nas logical trace properties or speci\ufb01cations) from demonstrations provided by an\nagent operating in an uncertain, stochastic environment. Crucially, speci\ufb01cations\nadmit well-de\ufb01ned composition rules that are typically easy to interpret. In this\npaper, we formulate the speci\ufb01cation inference task as a maximum a posteriori\n(MAP) probability inference problem, apply the principle of maximum entropy to\nderive an analytic demonstration likelihood model and give an ef\ufb01cient approach to\nsearch for the most likely speci\ufb01cation in a large candidate pool of speci\ufb01cations.\nIn our experiments, we demonstrate how learning speci\ufb01cations can help avoid\ncommon problems that often arise due to ad-hoc reward composition.\n\n1\n\nIntroduction\n\nIn many settings (e.g., robotics) demonstrations provide a natural way to specify a task. For ex-\nample, an agent (e.g., human expert) gives one or more demonstrations of the task from which\nwe seek to automatically synthesize a policy for the robot to execute. Typically, one models the\ndemonstrator as episodically operating within a dynamical system whose transition relation only\ndepends on the current state and action (called the Markov condition). However, even if the dy-\nnamics are Markovian, many problems are naturally modeled in non-Markovian terms (see Ex 1).\n\nExample 1. Consider the task illustrated in Figure 1. In this task,\nthe agent moves in a discrete gridworld and can take actions to move\nin the cardinal directions (north, south, east, west). Further, the agent\ncan sense abstract features of the domain represented as colors. The\ntask is to reach any of the yellow (recharge) tiles without touching\na red tile (lava) \u2013 we refer to this sub-task as YR. Additionally, if a\nblue tile (water) is stepped on, the agent must step on a brown tile\n(drying tile) before going to a yellow tile \u2013 we refer to this sub-task\nas BBY. The last constraint requires recall of two state bits of history\n(and is thus not Markovian): one bit for whether the robot is wet and\nanother bit encoding if the robot recharged while wet.\n\nFigure 1\n\nFurther, like Ex 1, many tasks are naturally decomposed into several sub-tasks. This work aims to\naddress the question of how to systematically and separately learn non-Markovian sub-tasks such\nthat they can be readily and safely recomposed into the larger meta-task.\nHere, we argue that non-Markovian Boolean speci\ufb01cations provide a powerful, \ufb02exible, and easily\ntransferable formalism for task representations when learning from demonstrations. This stands\nin contrast to the quantitative scalar reward functions commonly associated with Markov Decision\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fProcesses. Focusing on Boolean speci\ufb01cations has certain bene\ufb01ts: (1) The ability to naturally\nexpress tasks with temporal dependencies; (2) the ability to take advantage of the compositionality\npresent in many problems, and (3) use of formal methods for planning and veri\ufb01cation [29].\nAlthough standard quantitative scalar reward functions could be used to learn this task from demon-\nstrations, three issues arise. First, consider the problem of temporal speci\ufb01cations: reward functions\nare typically Markovian, so requirements like those in Ex 1 cannot be directly expressed in the task\nrepresentation. One could explicitly encode time into a state and reduce the problem to learning\na Markovian reward on new time-dependent dynamics; however, in general, such a reduction suf-\nfers from an exponential blow up in the state size (commonly known as the curse of history [24]).\nWhen inferring tasks from demonstrations, where different hypotheses may have different historical\ndependencies, na\u00efvely encoding the entire history quickly becomes intractable.\nA second limitation relates to the compositionality of task representations. As suggested, Ex 1\nnaturally decomposes into two sub-tasks. Ideally, we would want an algorithm that could learn each\nsub-task and combine them into the complete task, rather than only be able to learn single monolithic\ntasks. However, for many classes of quantitative rewards, \"combining\" rewards remains an ad-hoc\nprocedure. The situation is further exacerbated by humans being notoriously bad at anticipating or\nidentifying when quantitative rewards will lead to unintended consequences [11], which poses a\nserious problem for AI safety [1] and has led to investigations into reward repair [9]. For instance,\nwe could take a linear combination of rewards for each of the subtasks in Ex 1, but depending on the\nrelative scales of the rewards, and temporal discount rate, wildly different behaviors would result.\nIn fact, the third limitation - brittleness due to simple changes in the\nenvironment - illustrates that often, the correctness of the agent can\nchange due to a simple change in the environment. Namely, imagine\nfor a moment we remove the water and drying tiles from Fig 1 and\nattempt to learn a reward that encodes the \u201crecharge while avoid lava\u201d\ntask in Ex 1. Fig 2a illustrates the reward resulting from performing\nMaximum Entropy Inverse Reinforcement Learning [35] with the\ndemonstrations shown in Fig 1 and the binary features: red (lava tile),\nyellow (recharge tile), and \u201cis wet\u201d. As is easy to verify, a reward\ni=0 \u03b3iri(s), with a discount factor of \u03b3 = 0.69\nwould generate the trajectory shown in Fig 2a which avoids lava and\neventually recharges.\nUnfortunately, using the same reward and discount factor on a nearly\nidentical world can result in the agent entering the lava. For example,\nFig 2b illustrates the learned reward being applied to a change in\nthe gridworld where the top left charging tile has been removed. An\nacceptable trajectory is indicated via a dashed arrow. Observe that\nnow the discounted sum of rewards is maximized on the solid arrow\u2019s\npath, resulting in the agent entering the lava! While it is possible to\n\ufb01nd new discount factors to avoid this behavior, such a supervised\nprocess would go against the spirit of automatically learning the task.\nFinally, we brie\ufb02y remark that while non-Markovian Boolean rewards\ncannot encode all possible rewards, e.g., \u201crun as fast as possible\u201d,\noften times such objectives can be reframed as policies for a Boolean\ntask. For example, consider modeling a race. If at each time step there\nis a non-zero probability of entering a losing state, the agent will run\nforward as fast as possible even for the Boolean task \u201cwin the race\u201d.\nThus, quantitative Markovian rewards are limited as a task representation when learning tasks\ncontaining temporal speci\ufb01cations or compositionality from demonstrations. Moreover, the need to\n\ufb01ne tune learned tasks with such properties seemingly undercuts the original purpose of learning task\nrepresentations that are generalizable and invariant to irrelevant aspects of a task [21].\nRelated Work: Our work is intimately related to Maximum Entropy Inverse Reinforcement Learning.\nIn Inverse Reinforcement Learning (IRL) [23] the demonstrator, operating in a stochastic environment,\nis assumed to attempt to (approximately) optimize some unknown reward function over the trajectories.\nIn particular, one traditionally assumes a trajectory\u2019s reward is the sum of state rewards of the\n\n(b)\nFigure 2:\nIllustration of a\nbug in the learnt quantitative\nMarkovian reward resulting\nfrom slight changes in the en-\nvironment.\n\noptimizing agent,(cid:80)\u221e\n\n(a)\n\n2\n\n\ftrajectory. This formalism offers a succinct mechanism to encode and generalize the goals of the\ndemonstrator to new and unseen environments.\nIn the IRL framework, the problem of learning from demonstrations can then be cast as a Bayesian\ninference problem [26] to predict the most probable reward function. To make this inference procedure\nwell-de\ufb01ned and robust to demonstration/modeling noise, Maximum Entropy IRL [35] appeals to the\nprinciple of maximum entropy [12]. This results in a likelihood over the demonstrations which is no\nmore committed to any particular behavior than what is required for matching the empirically observed\nreward expectation. While this approach was initially limited to learning a linear combination of\nfeature vectors, IRL has been successfully adapted to arbitrary function approximators such as\nGaussian processes [19] and neural networks [8]. As stated in the introduction, while powerful,\ntraditional IRL provides no principled mechanism for composing the resulting rewards.\nTo address this de\ufb01cit, composition using soft optimality has recently received a fair amount of\nattention; however, the compositions are limited to either strict disjunction (do X or Y) [30] [31]\nor conjunction (do X and Y) [10]. Further, because soft optimality only bounds the deviation from\nsimultaneously optimizing both rewards, optimizing the composition does not preclude violating\nsafety constraints embedded in the rewards (e.g., do not enter the lava).\nThe closest work to ours is recent work on inferring Linear Temporal Logic (LTL) by \ufb01nding the\nspeci\ufb01cation that minimizes the expected number of violations by an optimal agent compared to\nthe expected number of violations by an agent applying actions uniformly at random [16]. The\ncomputation of the optimal agent\u2019s expected violations is done via dynamic programming on the\nexplicit product of the deterministic Rabin automaton [7] of the speci\ufb01cation and the state dynamics.\nA fundamental drawback to this procedure is that due to the curse of history, it incurs a heavy run-time\ncost, even on simple two state and two action Markov Decision Processes. We also note that the\nliterature on learning logical speci\ufb01cations from examples (e.g., [15, 33, 20]), does not handle noise in\nexamples while our approach does. Finally, once a speci\ufb01cation has been identi\ufb01ed, one can leverage\nthe rich literature on planning using temporal logic to synthesize a policy [17, 28, 27, 13, 14].\nContributions: (i) We formulate the problem of learning speci\ufb01cations from demonstrations in\nterms of Maximum a Posteriori inference. (ii) To make this inference well de\ufb01ned, we appeal to the\nprinciple of maximum entropy culminating in the distribution given (9). The main contribution of this\nmodel is that it only depends on the probability the demonstrator will successfully perform task and\nthe probability that the task is satis\ufb01ed by performing actions uniformly at random. Because these\nproperties can be estimated without explicitly unrolling the dynamics in time, this model avoids many\nof the pitfalls characteristic to the curse of history. (iii) We provide an algorithm that exploits the\npiece-wise convex structure in our posterior model (9) to ef\ufb01ciently perform Maximum a Posteriori\ninference for the most probable speci\ufb01cation.\nOutline: In Sec 2, we de\ufb01ne speci\ufb01cations and probabilistic automata (Markov Decision Processes\nwithout rewards). In Sec 3, we introduce the problem of speci\ufb01cation inference from demonstrations,\nand inspired by Maximum Entropy IRL [35], develop a model of the posterior probability of a\nspeci\ufb01cation given a sequence of demonstrations. In Sec 4, we develop an algorithm to perform\ninference under (9). Finally, in Sec 5, we demonstrate how due to their inherent composability,\nlearning speci\ufb01cations can avoid common bugs that often occur due to ad-hoc reward composition.\n2 Background\n\nWe seek to learn speci\ufb01cations from demonstrations provided by a teacher who executes a sequence\nof actions that probabilistically changes the system state. For simplicity, we assume that the set of\nactions and states are \ufb01nite and fully observed. The system is naturally modeled as a probabilistic\nautomaton1 formally de\ufb01ned below:\n\nDe\ufb01nition 1 (Probabilistic Automaton). A probabilistic automaton is a tuple M =\n(S, s0, A, \u03b4), where S is the \ufb01nite set of states, s0 \u2208 S is the initial state, A is the \ufb01nite\nset of actions, and \u03b4 : S \u00d7 A \u00d7 S \u2192 [0, 1] speci\ufb01es the transition probability of going from s to\n\ns(cid:48) given action a, i.e. \u03b4(s, a, s(cid:48)) = Pr(s(cid:48) | s, a) and (cid:88)\n\nPr(s(cid:48) | s, a) = 1 for all states s.\n\ns(cid:48)\u2208S\n\n1Probabilistic Automata are often constructed as a Markov Decision Process, M, without its Markovian\n\nreward map R, denoted M \\ R.\n\n3\n\n\fDe\ufb01nition 2 (Trace). A sequence of state/action pairs is called a trace (trajectory, demonstra-\ntion). A trace of length \u03c4 \u2208 N is an element of (S \u00d7 A)\u03c4 .\n\nNext, we develop machinery to distinguish between desirable and undesirable traces. For simplicity,\nwe focus on \ufb01nite trace properties, referred to as speci\ufb01cations, that are decidable within some \ufb01xed\n\u03c4 \u2208 N time steps, e.g., \u201cevent A occurred in the last 20 steps\u201d.\n\nDe\ufb01nition 3 (Speci\ufb01cation). Given a set of states S, a set of actions A, and a \ufb01xed trace length\n\u03c4 \u2208 N, a speci\ufb01cation is a subset of traces \u03d5 \u2286 (S \u00d7 A)\u03c4 . We de\ufb01ne true def= (S \u00d7 A)\u03c4 ,\n\u00ac\u03d5 def= true \\ \u03d5, and false def= \u00actrue. A collection of speci\ufb01cations, \u03a6, is called a concept class.\nFinally, we abuse notation and use \u03d5 to also denote its indicator function (interpreted as a\nnon-Markovian Boolean reward),\n\n(cid:26)1 if \u03be \u2208 \u03d5\n\n\u03d5(\u03be) def=\n\notherwise .\n\n0\n\n(1)\n\nSpeci\ufb01cations may be given in formal notation, as sets or automata. Further, each representation\nfacilitates de\ufb01ning a plethora of composition rules. For example, consider two speci\ufb01cations, \u03d5A, \u03d5B\n\nthat encode tasks A and B respectively and the composition rule \u03d5A\u2229\u03d5B : \u03be (cid:55)\u2192 min(cid:0)\u03d5A(\u03be), \u03d5B(\u03be)(cid:1).\nconjunction (logical and). Similarly, maximizing \u03d5A \u222a \u03d5B : \u03be (cid:55)\u2192 max(cid:0)\u03d5A(\u03be), \u03d5B(\u03be)(cid:1) corresponds\nmaximizing \u03d5A \u2286 \u03d5B : \u03be (cid:55)\u2192 max(cid:0)1 \u2212 \u03d5A(\u03be), \u03d5B(\u03be)(cid:1) corresponds to task A triggering task B.\n\nBecause the agent only receives a non-zero reward if \u03d5A(\u03be) = \u03d5B(\u03be) = 1, a reward maximizing\nagent must necessarily perform tasks A and B simultaneously. Thus, \u03d5A \u2229 \u03d5B corresponds to\nto disjunction (logical or). One can also encode conditional requirements using subset inclusion, e.g.,\n\nComplicated temporal connectives can also be de\ufb01ned using temporal logics [25] or automata [32].\nFor our purposes, it suf\ufb01ces to informally extend propositional logic with three temporal operators:\n(1) Let Ha, read \u201chistorically a\u201d, denote that property a held at all previous time steps. (2) Let\nP a def= \u00ac(H\u00aca), read \u201conce a\u201d, denote that the property a at least once held in the past. (3) Let a S b,\nread \u201ca since b\u201d, denote that the property a that has held every time step after b last held. The true\npower of temporal operators is realized when they are composed to make more complicated sentences.\nFor example, H(a =\u21d2 (b S c)) translates to \u201cit was always the case that if a was true, then b has\nheld since the last time c held.\u201d. Observe that the property BBY from the introductory example takes\nthis form, H((yellow \u2227 P blue) =\u21d2 (\u00acblue S brown)), i.e., \u201cHistorically, if the agent had once\nvisited blue and is currently visiting yellow, then the agent has not visited blue since it last visited\nbrown\u201d.\n\n3 Speci\ufb01cation Inference from Demonstrations\nIn the spirit of Inverse Reinforcement Learning, we now seek to \ufb01nd the speci\ufb01cation that best\nexplains the behavior of the agent. We refer to this as Speci\ufb01cation Inference from Demonstrations.\nDe\ufb01nition 4 (Speci\ufb01cation Inference from Demonstrations). The speci\ufb01cation inference from\ndemonstrations problem is a tuple (M, X, \u03a6) where M = (S, s0, A, \u03b4) is a probabilistic\nautomaton, X is a (multi-)set of \u03c4-length traces drawn from an unknown distribution induced\nby a teacher attempting to demonstrate some unknown speci\ufb01cation within M, and \u03a6 a concept\nclass of speci\ufb01cations.\nA solution to (M, X, \u03a6) is:\n\n(2)\nwhere Pr(\u03d5 | M, X) denotes the probability that the teacher demonstrated \u03d5 given the observed\ntraces, X, and the dynamics, M.\n\n\u03d5\u2208\u03a6\n\n\u03d5\u2217 \u2208 arg max\n\nPr(\u03d5 | M, X)\n\nTo make this inference well-de\ufb01ned, we make a series of assumptions culminating in (9).\nLikelihood of a demonstration: We begin by leveraging the principle of maximum entropy to\ndisambiguate the likelihood distributions. Concretely, de\ufb01ne:\n\nPr(si+1|si, ai, M )\n\n(3)\n\nw(cid:0)\u03be = (s, a), M(cid:1) =\n\n\u03c4\u22121(cid:89)\n\ni=0\n\n4\n\n\fwhere s and a are the projected sequences of states and actions of \u03be respectively, to be the weight\nof each possible demonstration \u03be induced by dynamics M. Given a demonstrator who on average\nsatis\ufb01es the speci\ufb01cation \u03d5 with probability \u03d5, we approximate the likelihood function by:\n\nPr(cid:0)\u03be | M, \u03d5, \u03d5(cid:1) = w(\u03be, M ) \u00b7 exp(\u03bb\u03d5\u03d5(\u03be))\nwhere \u03bb\u03d5, Z\u03d5 are normalization factors such that E\u03be[\u03d5] = \u03d5 and(cid:80)\n\n(4)\n\u03be Pr(\u03be | M, \u03d5) = 1. For a\ndetailed derivation that (4) is the maximal entropy distribution, we point the reader to [18]. Next\nobserve that due to the Boolean nature of \u03d5, (4) admits a simple closed form:\n\nZ\u03d5\n\nPr(\u03be | M, \u03d5, \u03d5) =(cid:103){\u03be} \u00b7\n\n(cid:26)\u03d5/(cid:101)\u03d5\n(1 \u2212 \u03d5)/(cid:102)\u00ac\u03d5 \u03be /\u2208 \u03d5\n\n\u03be \u2208 \u03d5\n\nwhere in general we use(cid:102)(\u00b7) to denote the probability of satisfying a speci\ufb01cation using uniformly\nrandom actions. Thus, we denote by(cid:103){\u03be} the probability of randomly generating demonstration \u03be\nwithin M. Further, note that by the law of the excluded middle, for any speci\ufb01cation: (cid:102)\u00ac\u03d5 = 1 \u2212(cid:101)\u03d5.\n\nProof Sketch. For brevity, let W\u03d5\n\n\u03be\u2208\u03d5 w(\u03be, M ) and c def= e\u03bb\u03d5. Via the constraints on (4),\n\ndef=(cid:80)\nZ\u03d5 \u00b7 \u03d5 = 1 \u00b7(cid:88)\nc1 \u00b7 w(\u03be, M ) + 0 \u00b7(cid:88)\nw(\u03be, M ) + c0(cid:88)\nZ\u03d5 = c1(cid:88)\n\n\u03be\u2208\u03d5\n\n\u03be /\u2208\u03d5\n\n\u03be\u2208\u03d5\n\n\u03be /\u2208\u03d5\n\nc0 \u00b7 w(\u03be, M ) = cW\u03d5\n\nw(\u03be, M ) = cW\u03d5 + W\u00ac\u03d5\n\n(5)\n\n(6)\n\n(cid:18)(cid:89)\n\nL(X | M, \u03d5, \u03d5) = log\n\nCombining gives Z\u03d5 = W\u00ac\u03d5/(1\u2212 \u03d5). Next, observe that if \u03be (cid:54)\u2208 \u03d5, then e\u03bb\u03d5\u03d5(\u03be) = 1 and substituting\nin (4) yields, Pr(\u03be | \u03d5, M, \u03be /\u2208 \u03d5) = w\u03be(1 \u2212 \u03d5)/W\u00ac\u03d5.\nIf \u03be \u2208 \u03d5 (implying W\u03d5 (cid:54)= 0) then\n\nLikelihood of a set of demonstrations: If the teacher gives a \ufb01nite sequence of \u03c4 length demonstra-\ntions, X, drawn i.i.d. from (5), then the log likelihood, L, of X under (5) is:2\n\ne\u03bb\u03d5 = Z\u03d5\u03d5/W\u03d5 and Pr(\u03be | \u03d5, M, \u03be \u2208 \u03d5) = w\u03be\u03d5/W\u03d5. Finally, observe that (cid:101)\u03d5 = W\u03d5/Wtrue and\n(cid:103){\u03be} = w\u03be/Wtrue. Substituting and factoring yields (5).\n(cid:19)\n(cid:103){\u03be}\n(cid:18) 1 \u2212 \u03d5\n1 \u2212(cid:101)\u03d5\n\nbetween two Bernoulli distributions with means \u03d5 and (cid:101)\u03d5 resp. Syntactically, let B(\u00b5) denote\n\n(cid:18) \u03d5(cid:101)\u03d5\n(cid:88)\n(cid:19)(cid:21)\n\nwhere by de\ufb01nition we take (0 \u00b7 ln(. . .) = 0) and N\u03d5\n\na Bernoulli distribution with mean \u00b5 and DKL(P (cid:107) Q) def=\ninformation gain when using distribution P compared to Q. If X is \u201crepresentative\u201d such that\nN\u03d5 \u2248 \u03d5 \u00b7 |X|, we can (up to a \u03d5 independent normalization) approximate (7):\n\nP (i) ln(P (i)/Q(i)) denote the\n\nis the information gain (KL divergence)\n\n(cid:18)\u00ac\u03d5(cid:102)\u00ac\u03d5\n\nNext, observe that\n\n(cid:18) \u03d5(cid:101)\u03d5\n\n+ (1 \u2212 \u03d5) ln\n\n+ N\u00ac\u03d5 ln\n\n+ N\u03d5 ln\n\ndef=\n\n\u03d5(\u03be).\n\n\u03be\u2208X\n\n(cid:19)\n\n(cid:19)\n\n(cid:19)\n\n(cid:20)\n\n\u03be\u2208X\n\n\u03d5 ln\n\n(7)\n\nPr(X | M, \u03d5, \u03d5) \u221d\u223c exp\n\n(8)\nWhere \u221d\u223c denotes approximately proportional to. Unfortunately, the approximation |X| \u00b7 \u03d5 \u2248\nN\u03c6 implies that, \u00ac\u03d5 = 1 \u2212 \u03d5 which introduces the undesirable symmetry, Pr(X | M, \u03d5, \u03d5) =\nleast as good as random. Operationally, we assert that Pr(\u03d5 | \u03d5 < (cid:101)\u03d5) = 0 and is otherwise uniform.\nPr(X | M,\u00ac\u03d5,\u00ac\u03d5), into (8). To break this symmetry, we assert that the demonstrator must be at\nFinally, we arrive at the posterior distribution given in (9), where 1[\u00b7] denotes an indicator function.\n\nDemonstrator is better than random.\n\n(cid:122)\n(cid:125)(cid:124)\n1[\u03d5 \u2265 (cid:101)\u03d5]\u00b7 exp(cid:0)|X| \u00b7\n\n(cid:123)\n\nInformation gain over random actions.\n\n(cid:122)\n(cid:123)\nDKL (B(\u03d5) (cid:107) B((cid:101)\u03d5))(cid:1)\n\n(cid:125)(cid:124)\n\nPr(\u03d5 | M, X, \u03d5) \u221d\u223c\n\n(9)\n2We have suppressed a multinomial coef\ufb01cient required if any two demonstrations are the same. However,\n\nthis term will not change as \u03d5 varies, and thus cancels when comparing across speci\ufb01cations.\n\n(cid:88)\n(cid:16)B(\u03d5) (cid:107) B((cid:101)\u03d5)\n(cid:17)(cid:17)\n\ni\n\n(cid:16)|X| \u00b7 DKL\n\n5\n\n\f4 Algorithm\n\nIn this section, we exploit the structure imposed by (9) to ef\ufb01ciently search for the most probable\nspeci\ufb01cation (2) within a (potentially large) concept class, \u03a6. Namely, observe that under (9), the\nspeci\ufb01cation inference problem (2) reduces to maximizing the information gain over random actions.\n\n(cid:110)\n1[\u03d5 \u2265 (cid:101)\u03d5] \u00b7 DKL\n\n(cid:16)B(\u03d5) (cid:107) B((cid:101)\u03d5)\n(cid:17)(cid:111)\n\n\u03d5\u2217 \u2208 arg max\n\n\u03d5\u2208\u03a6\n\n(10)\n\nto be #P -complete [2]. Nevertheless, in practice, moderately ef\ufb01cient methods for computing or\n\napplicable. Further, while evaluating if a trace satis\ufb01es a speci\ufb01cation is fairly ef\ufb01cient (and thus\n\nBecause gradients on (cid:101)\u03d5 and \u03d5 are not well-de\ufb01ned, gradient descent based algorithms are not\nour N\u03d5/|X| approximation to \u03d5 is assumed easy to compute), computing (cid:101)\u03d5 is in general known\napproximating (cid:101)\u03d5 exist including Monte Carlo simulation [22] and weighted model counting [5] via\nthat poses few (cid:101)\u03d5 queries. We begin with the observation that adding a trace to a speci\ufb01cation cannot\nLemma 1. \u2200\u03d5(cid:48), \u03d5 \u2208 \u03a6 . \u03d5(cid:48) \u2286 \u03d5 implies (cid:101)\u03d5(cid:48) \u2264 (cid:101)\u03d5 and \u03d5(cid:48) \u2264 \u03d5.\n\nBinary Decision Diagrams (BDDs) [3] or repeated SAT queries [4]. As such, we seek an algorithm\n\nlower its probability of satisfaction under random actions.\n\nProof. The probability of sampling an element of a set monotonically increases as elements are\nadded to the set independent of the \ufb01xed underlying distribution over elements.\n\nFurther, note that N\u03d5 (and thus, our approximation to \u03d5) can only take on |X| + 1 possible values.\nThis suggests a piece-wise analysis of (10) by conditioning on the value of \u03d5.\n\nDe\ufb01nition 5. Given candidate speci\ufb01cations \u03a6 and a subset of demonstrations S \u2286 X de\ufb01ne,\n\n(cid:21)\n\n(cid:20) |S|\n|X| \u2265 x\n\n(cid:18)\n\n(cid:19)\n|S|\n|X| ) (cid:107) B(x)\n\n\u00b7 DKL\n\nB(\n\n(11)\n\ndef= {\u03d5 \u2208 \u03a6 : \u03d5 \u2229 X = S}\n\n\u03a6S\n\nJ|S|(x) def= 1\n\nThe next key observation is that J|S| : [0, 1] \u2192 R\u22650 monotonically decreases in x.\n\nLemma 2. \u2200S \u2286 X, x < x(cid:48) =\u21d2 J|S|(x) \u2264 J|S|(x(cid:48))\n|S|\n|X| \u2265 x] indicator,\nProof. To begin, observe that DKL is always non-negative. Due to the 1[\nJ|S|(x) = 0 for all x > |S|/|X|. Next, observe that J|S| is convex due to the convexity of the\nDKL on Bernoulli distributions and is minimized at x = |S|/|X| (KL Divergence of identical\ndistributions is 0). Thus, J|S|(x) monotonically decreases as x increases.\n\nIllustration of Thm 1\n\n\u03d51\n\n)\nx\n(\n6\nJ\n\n4\n\n2\n\n0\n\n\u03d52 \u03d53\n\n\u03d54\n\n0.0\n\n0.2\n\n0.4\n\nx\n\n0.6\n\n0.8\n\n1.0\n\nFigure 3: Left: An example of a series of speci\ufb01cations \u03d51, . . . , \u03d54 ordered by subset inclusion. The\ndots represent demonstrations, and so each speci\ufb01cation has \u03d5i = 6/9. Right: Plot of J|S|(x) for\n\nhypothetical values of (cid:101)\u03d5i annotated as points. Notice that the sequence of speci\ufb01cations is ordered on\n\nthe x-axis, and thus the maximum must occur at the start of the sequence.\nThese insights are then combined in Theorem 1 and illustrated in Fig 3.\n\nTheorem 1. If A denotes a sequence of speci\ufb01cations, \u03d51, . . . , \u03d5n, ordered by subset inclusion\nj \u2264 k =\u21d2 \u03d5j \u2286 \u03d5k and S \u2286 X is an arbitrary subset of demonstrations, then:\n\nJ|S|((cid:101)\u03d5) = J|S|((cid:101)\u03d51)\n\nmax\n\u03d5\u2208A\n\n6\n\n(12)\n\nSpace of Trajectories\u03c6\u2081\u03c6\u2082\u03c6\u2083\u03c6\u2084\fProof. (cid:101)\u03d5 is monotonically increasing on A (Lemma 1). Via Lemma 2 J|S|(x) is monotonically\ndecreasing and thus the maximum of J|S|((cid:101)\u03d5) must occur at the beginning of A.\n\n\u03d57\n\n\u03d56\n\ntrue\n\nLattice Concept Classes.\nTheorem 1 suggests specializing to concept classes where\ndetermining subset relations is easy. We propose studying con-\ncept classes organized into a \ufb01nite (bounded) lattice, (\u03a6, (cid:69)),\nthat respects subset inclusion: (\u03d5 (cid:69) \u03d5(cid:48) =\u21d2 \u03d5 \u2286 \u03d5(cid:48)). To\nenforce the bounded constraint, we assert that true and false\nare always assumed to be in \u03a6 and act as the bottom and top of\nthe partial order respectively. Intuitively, this lattice structure\nencodes our partial knowledge of which speci\ufb01cations imply\nother speci\ufb01cations. These implication relations can be rep-\nresented as a directed graph where the nodes correspond to\nelements of \u03a6 and an edge is present if the source is known\nto imply the target. Because implication is transitive, many of\nthe edges can be omitted without losing any information. The\ngraph resulting from this transitive reduction is called a Hasse\ndiagram [6] (See Fig 4). In terms of the graphical model, the\nHasse diagram encodes that for certain pairs of speci\ufb01cations,\n\u03d5, \u03d5(cid:48), we know that Pr(\u03d5(\u03be) = 1 | \u03d5(cid:48)(\u03be) = 1, M ) = 1 or\nPr(\u03d5(\u03be) = 0 | \u03d5(cid:48)(\u03be) = 0, M ) = 1.\nInference on chain concept classes. Sequences of speci\ufb01cations ordered by subset inclusion gener-\nalize naturally to ascending chains.\n\nFigure 4: Hasse Diagram of an ex-\nample lattice \u03a6 with an anti-chain\nannotated. Directed edges represent\nknown subset relations and paths\nrepresent chains.\n\nanti-chain\n\nfalse\n\n\u03d53\n\n\u03d52\n\n\u03d54\n\n\u03d50\n\nDe\ufb01nition 1 (Chains and Anti-Chains). Given a partial order (\u03a6, (cid:69)), an ascending chain (or\njust chain) is a sequence of elements of A ordered by (cid:69). The smallest element of the chain is\ndenoted, \u2193 (A). Finally, an anti-chain is a set of incomparable elements. An anti-chain is called\nmaximal if no element can be added to it without making two of its elements comparable.\n\nRecasting Theorem 1 in the parlance of chains yields:\n\nCorollary 1. If S \u2286 X is a subset of demonstrations and A is a chain in (\u03a6S, (cid:69)) then:\n\nJ|S|((cid:101)\u03d5) = J|S|( (cid:93)\u2193 (A))\n\nmax\n\u03d5\u2208A\n\n(13)\n\nObserve that if the lattice, (\u03a6, (cid:69)) is itself a chain, then there are at most |X| + 1 non-empty\ndemonstration partitions, \u03a6S. In fact, the non-empty partitions can be re-indexed by the cardinality of\nS, e.g., \u03a6S (cid:55)\u2192 \u03a6|S|. Further, note that since chains are totally ordered, the smallest element of each\nnon-empty partition can be found by performing a binary search (indicated by \ufb01nd_smallest below).\nThese insights are combined into Algorithm 1 with a relativized run-time analysis given in Thm 2.\nAlgorithm 1 Inference on chains\n1: procedure CHAIN_INFERENCE(X, (A, (cid:69)))\n2:\n\n(i, \ufb01nd_smallest(A, i))(cid:12)(cid:12) i \u2208 {0, 1, . . . ,|X|}\n\n(cid:26)\n\n(cid:27)\n\n\u03a8 \u2190\nreturn i, \u03d5\u2217 \u2190 arg max\ni,\u03d5\u2208\u03a8\n\nJi((cid:101)\u03d5)\n\n3:\n\n(cid:46) O(|Tdata|X| ln(|A|)).\n(cid:46) O(Trand|X|)\n\nTheorem 2. Let Tdata and Trand respectively represent the worst case execution time of comput-\n\ning \u03d5 and (cid:101)\u03d5 for \u03d5 in chain A. Given demonstrations X, Alg 1 runs in time:\n\n(14)\n\n(cid:18)\n\n|X|(cid:16)\n\nO\n\n(cid:17)(cid:19)\n\nTdata ln(|A|) + Trand\n\n7\n\n\fProof Sketch. A binary search over |A| elements takes ln(|A|) time. There are |X| binary\nsearches required to \ufb01nd the smallest element of each partition. Finally, for each smallest\nelement, a single random satisfaction query is made.\n\nLattice inference. Of course, in general, (\u03a6, (cid:69)) is not a chain, but a complicated lattice. Nevertheless,\nobserve that any path from false to true is a chain. Further, the smallest element of each partition\nmust either be the same speci\ufb01cation or incomparable in (\u03a6, (cid:69)). That is, for each k \u2208 {0, 1, . . .|X|},\nthe set:\n\n(cid:26)\n\n(cid:18)X\n\n(cid:19)(cid:27)\n\nBk\n\ndef=\n\n\u2193 (\u03a6S) : S \u2208\n\nk\nis a maximal anti-chain. Thus, Corollary 1 can be extended to:\nCorollary 2. Given a lattice (\u03a6, (cid:69)) and demonstrations X:\nk\u22080,1,...,|X| max\n\u03d5\u2208Bk\n\nJN\u03d5 ((cid:101)\u03d5) = max\n\nmax\n\u03d5\u2208\u03a6\n\nJk((cid:101)\u03d5)\n\n(15)\n\n(16)\n\nRecalling that N\u03d5 increases on paths from false to true, we arrive at the following simple algorithm\nwhich takes as input the demonstrations and the lattice \u03d5 encoded as a directed acyclic graph rooted\nat false. (i) Perform a breadth \ufb01rst traversal (BFT) of the lattice (\u03a6, (cid:69)) starting at false (ii) During\nthe traversal, if speci\ufb01cation \u03d5 has a larger N\u03d5 than all of its direct predecessors, then check if it is\nmore probable than the best speci\ufb01cation seen so far (if so, make it the most probable speci\ufb01cation\nseen so far). (iii) At the end of the traversal, return the most probable speci\ufb01cation. Pseudo code is\nprovided in Algorithm 2 with a run-time analysis given in Theorem 3.\nAlgorithm 2 Inference on Partial Orders\n1: procedure PARTIALORDER_INFERENCE(X, (\u03a6, (cid:69)))\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n\n(\u03d5\u2217, best_info_gain) \u2190 (false, 0)\nfor \u03d5 in breadth_\ufb01rst_traversal((\u03a6, (cid:69))) do\n\nparents \u2190 direct_predecessors(\u03d5)\nif \u2203\u03d5(cid:48) \u2208 parents . N\u03d5(cid:48) = N\u03d5 then\n\ninfo_gain \u2190 JN\u03d5 ((cid:101)\u03d5)\n\nif info_gain > best_info_gain then\n\n(\u03d5\u2217, best_info_gain) \u2190 (\u03d5, info_gain)\n\nreturn \u03d5\u2217\n\ncontinue\n\nTheorem 3. Let (\u03a6, (cid:69)) be a bounded partial order encoded as a Directed Acyclic Graph (DAG),\nG = (V, E), with vertices V and edges E. Further, let B denote the largest anti-chain in \u03a6. If\n\nTdata and Trand respectively represent the worst case execution time of computing \u03d5 and (cid:101)\u03d5, then\n\nfor demonstrations X, Alg 2 runs in time:\n\nO(cid:0)E + Tdata \u00b7 V + Trand \u00b7 |B||X|(cid:1)\n\n(17)\nProof sketch. BFT takes O(V + E). Further, for each node, \u03d5 is computed (O(Tdata \u00b7 V )).\nsize of the largest anti-chain, this query happens no more than |B||X| times.\n\nFinally, for each node in each of the candidate anti-chains Bk, (cid:101)\u03d5 is computed. Since |B| is the\n\n5 Experiments and Discussion\n\nScenario. Recall our introductory gridworld example Ex 1. Now imagine that the robot is pre-\nprogrammed to perform task the \u201crecharge and avoid lava\u201d task, but is unaware of the second\nrequirement, \u201cdo not recharge when wet\u201d. To signal this additional constraint to the robot, the\nhuman operator provides the \ufb01ve demonstrations shown in Fig 1. We now illustrate how learning\nspeci\ufb01cations rather than Markovian rewards enables the robot to safely compose the new constraint\nwith its existing knowledge to perform the joint task in a manner that is robust to changes in the task.\nTo begin, we assume the robot has access to the Boolean features: red (lava tile), blue (water tile),\nbrown (drying tile), and yellow (recharge tile). Using these features, the robot has encoded the\n\u201crecharge and avoid lava\u201d task as: H(\u00acred) \u2227 P (yellow).\n\n8\n\n\fConcept Class. We designed the robot\u2019s concept\nclass to be the conjunction of the known require-\nments and a speci\ufb01cation generated by the gram-\nmar on the right. The motivation in choosing this\ngrammar was that (i) it generates a moderately\nlarge concept class (930 possible speci\ufb01cations\nafter pruning trivially false speci\ufb01cations), and\n(ii) it contains several interesting alternative tasks\nsuch as H(red =\u21d2 (\u00acbrown S blue)), which\nsemantically translates to: \u201cthe robot should be\nwet before entering lava\u201d. To generate\nthe edges in Hasse diagram, we unrolled the formula into their corresponding Boolean formula and\nused a SAT solver to determine subset relations. While potentially slow, we make three observations\nregarding this process: (i) the process was trivially parallelizable (ii) so long as the atomic predicates\nremain the same, this Hasse diagram need not be recomputed since it is otherwise invariant to the\ndynamics (iii) most of the edges in the resulting diagram could have been syntactically identi\ufb01ed\nusing well known identities on temporal logic formula.\n\nConcept Class Grammar:\n|= (cid:104)H \u03c8(cid:105) | (cid:104)P \u03c8(cid:105)\n(cid:104)\u03c6(cid:105)\n(cid:104)\u03c8(cid:105)\n|= (cid:104)\u03b2(cid:105) | (cid:104)\u03b2(cid:105) =\u21d2 (cid:104)\u03b2(cid:105)\n|= (cid:104)\u03b1(cid:105) | (cid:104)\u03b1(cid:105) \u2227 (cid:104)\u03b1(cid:105) | (cid:104)\u03b1(cid:105) S (cid:104)\u03b1(cid:105)\n(cid:104)\u03b2(cid:105)\n|= AP | \u00acAP\n(cid:104)\u03b1(cid:105)\n(cid:104)AP(cid:105)\n|= yellow | red | brown | blue\n\nComputing (cid:101)\u03d5. To perform random satisfaction rate queries, (cid:101)\u03d5, we \ufb01rst ran Monte Carlo to get a\n\ncoarse estimate and we symbolically encoded the dynamics, color sensor, and speci\ufb01cation into a\nBinary Decision Diagram to get exact values. This data structure serves as an incredibly succinct\nencoding of the speci\ufb01cation aware unrolling of the dynamics, which in practice avoids the exponential\nblow up suggested by the curse of history. We then counted the number of satisfying assignments and\ndivided by the total possible number of satisfying assignments.3 On average in these candidate pools,\neach query took 0.4 seconds with a standard deviation of 0.32 seconds.\nResults. Running a fairly unoptimized implementation of Algorithm 2 on the concept class and\nclass). The inferred additional requirement was H((yellow\u2227 P blue) =\u21d2 (\u00acblue S brown)) which\nexactly captures the do not recharge while wet constraint. Compared to a brute force search over the\nconcept class, our algorithm offered an approximately 5.5 fold improvement. Crucially, there exists\ncontrollable trajectories satisfying the joint speci\ufb01cation:\n\ndemonstrations took approximately 95 seconds and resulted in 172 (cid:101)\u03d5 queries (\u2248 18% of the concept\n\nH\u00acred \u2227 P yellow\n\n\u2227 H\n\n(yellow \u2227 P blue) =\u21d2 (\u00acblue S brown)\n\n.\n\n(cid:18)\n\n(cid:19)\n\n(cid:18)\n\n(cid:19)\n\n(18)\n\nThus, a speci\ufb01cation optimizing agent must jointly perform both tasks. This holds true even under task\nchanges such as that in Fig 2. Further, observe that it was fairly painless to incorporate the previously\nknown recharge while avoiding lava constraints. Thus, in contrast to quantitative Markovian rewards,\nlearning Boolean speci\ufb01cations enabled encoding compositional temporal speci\ufb01cations that are\nrobust to changes in the environment.\n\n6 Conclusion and Future work\nMotivated by the problem of compositionally learning from demonstrations, we developed a technique\nfor learning binary non-Markovian rewards, which we referred to as speci\ufb01cations. Because of their\nlimited structure, speci\ufb01cations enabled \ufb01rst learning sub-speci\ufb01cations for subtasks and then later\ncreating a composite speci\ufb01cations that encodes the larger task. To learn these speci\ufb01cations from\ndemonstrations, we applied the principle of maximum entropy to derive a novel model for the\nlikelihood of a speci\ufb01cation given the demonstrations. We then developed an algorithm to ef\ufb01ciently\nsearch for the most probable speci\ufb01cation in a candidate pool of speci\ufb01cations in which some subset\nrelations between speci\ufb01cations are known. Finally, in our experiment, we gave a concrete instance\nwhere using traditional learning composite reward functions is non-obvious and error-prone, but\ninferring speci\ufb01cations enables trivial composition. Future work includes extending the formalism\nto in\ufb01nite horizon speci\ufb01cations, continuous dynamics, characterizing the optimal set of teacher\ndemonstrations under our posterior model [34], ef\ufb01ciently marginalizing over the whole concept\nclass and exploring alternative data driven methods for generating concept classes.\nAcknowledgments. We would like to thank the anonymous referees as well as Daniel Fremont, Markus Rabe,\nBen Caul\ufb01eld, Marissa Ramirez Zweiger, Shromona Ghosh, Gil Lederman, Tommaso Dreossi, Anca Dragan, and\n\n3One can add probabilities to transitions by adding to transition constraints additional fresh variables such\n\nthat the number of satisfying assignments is proportional to the probability.\n\n9\n\n\fNatarajan Shankar for their useful suggestions and feedback. The work of the authors on this paper was funded\nin part by the US National Science Foundation (NSF) under award numbers CNS-1750009, CNS-1740079, CNS-\n1545126 (VeHICaL), the DARPA BRASS program under agreement number FA8750\u201316\u2013C0043, the DARPA\nAssured Autonomy program, by Toyota under the iCyPhy center and the US ARL Cooperative Agreement\nW911NF-17-2-0196.\n\nReferences\n[1] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Man\u00e9. Concrete problems\n\nin AI safety. arXiv preprint arXiv:1606.06565, 2016.\n\n[2] F. Bacchus, S. Dalmao, and T. Pitassi. Algorithms and complexity results for #SAT and Bayesian\ninference. In Foundations of computer science, 2003. proceedings. 44th annual ieee symposium\non, pages 340\u2013351. IEEE, 2003.\n\n[3] R. E. Bryant. Symbolic boolean manipulation with ordered binary-decision diagrams. ACM\n\nComputing Surveys (CSUR), 24(3):293\u2013318, 1992.\n\n[4] S. Chakraborty, K. S. Meel, and M. Y. Vardi. Algorithmic improvements in approximate\ncounting for probabilistic inference: From linear to logarithmic sat calls. Proceedings of the\n25th International Joint Conference on Arti\ufb01cial Intelligence (IJCAI-16), 2016.\n\n[5] M. Chavira and A. Darwiche. On probabilistic inference by weighted model counting. Arti\ufb01cial\n\nIntelligence, 172(6-7):772\u2013799, 2008.\n\n[6] N. Christo\ufb01des. Graph theory: An algorithmic approach (Computer science and applied\n\nmathematics). Academic Press, Inc., 1975.\n\n[7] B. Farwer. \u03c9-automata. In Automata logics, and in\ufb01nite games, pages 3\u201321. Springer, 2002.\n[8] C. Finn, S. Levine, and P. Abbeel. Guided cost learning: Deep inverse optimal control via policy\n\noptimization. In International Conference on Machine Learning, pages 49\u201358, 2016.\n\n[9] S. Ghosh, S. Jha, A. Tiwari, P. Lincoln, and X. Zhu. Model, data and reward repair: Trusted\nmachine learning for Markov Decision Processes. In 2018 48th Annual IEEE/IFIP International\nConference on Dependable Systems and Networks Workshops (DSN-W), pages 194\u2013199. IEEE,\n2018.\n\n[10] T. Haarnoja, V. Pong, A. Zhou, M. Dalal, P. Abbeel, and S. Levine. Composable deep reinforce-\n\nment learning for robotic manipulation. arXiv preprint arXiv:1803.06773, 2018.\n\n[11] M. K. Ho, M. L. Littman, F. Cushman, and J. L. Austerweil. Teaching with Rewards and\nPunishments: Reinforcement or Communication? In D. Noelle, R. Dale, A. S. Warlaumont,\nJ. Yoshimi, T. Matlock, C. D. Jennings, and P. P. Maglio, editors, Proceedings of the 37th Annual\nConference of the Cognitive Science Society, pages 920\u2013925, Austin, TX, 2015. Cognitive\nScience Society.\n\n[12] E. T. Jaynes. Information theory and statistical mechanics. Physical review, 106(4):620, 1957.\n[13] S. Jha and V. Raman. Automated synthesis of safe autonomous vehicle control under perception\n\nuncertainty. In NASA Formal Methods Symposium, pages 117\u2013132. Springer, 2016.\n\n[14] S. Jha, V. Raman, D. Sadigh, and S. A. Seshia. Safe autonomy under perception uncertainty\nusing chance-constrained temporal logic. Journal of Automated Reasoning, 60(1):43\u201362, 2018.\n[15] S. Jha, A. Tiwari, S. A. Seshia, T. Sahai, and N. Shankar. TeLEx: Passive STL learning using\nonly positive examples. In International Conference on Runtime Veri\ufb01cation, pages 208\u2013224.\nSpringer, 2017.\n\n[16] D. Kasenberg and M. Scheutz.\n\nInterpretable apprenticeship learning with temporal logic\n\nspeci\ufb01cations. arXiv preprint arXiv:1710.10532, 2017.\n\n[17] H. Kress-Gazit, G. E. Fainekos, and G. J. Pappas. Temporal-logic-based reactive mission and\n\nmotion planning. IEEE Transactions on Robotics, 25(6):1370\u20131381, 2009.\n\n[18] S. Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review.\n\nCoRR, abs/1805.00909, 2018.\n\n[19] S. Levine, Z. Popovic, and V. Koltun. Nonlinear inverse reinforcement learning with gaussian\nprocesses. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger,\neditors, Advances in Neural Information Processing Systems 24, pages 19\u201327. Curran Associates,\nInc., 2011.\n\n10\n\n\f[20] W. Li. Speci\ufb01cation Mining: New Formalisms, Algorithms and Applications. PhD thesis, EECS\n\nDepartment, University of California, Berkeley, Mar 2014.\n\n[21] M. L. Littman, U. Topcu, J. Fu, C. Isbell, M. Wen, and J. MacGlashan. Environment-Independent\n\nTask Speci\ufb01cations via GLTL. arXiv:1704.04341 [cs], Apr. 2017. arXiv: 1704.04341.\n\n[22] N. Metropolis and S. Ulam. The Monte Carlo method. Journal of the American statistical\n\nassociation, 44(247):335\u2013341, 1949.\n\n[23] A. Y. Ng, S. J. Russell, et al. Algorithms for inverse reinforcement learning. In ICML, pages\n\n663\u2013670, 2000.\n\n[24] J. Pineau, G. Gordon, S. Thrun, et al. Point-based value iteration: An anytime algorithm for\n\nPOMDPs. In IJCAI, volume 3, pages 1025\u20131032, 2003.\n\n[25] A. Pnueli. The temporal logic of programs. In Foundations of Computer Science, 1977., 18th\n\nAnnual Symposium on, pages 46\u201357. IEEE, 1977.\n\n[26] D. Ramachandran and E. Amir. Bayesian inverse reinforcement learning. IJCAI, 2007.\n[27] V. Raman, A. Donz\u00e9, D. Sadigh, R. M. Murray, and S. A. Seshia. Reactive synthesis from\nsignal temporal logic speci\ufb01cations. In Proceedings of the 18th International Conference on\nHybrid Systems: Computation and Control, pages 239\u2013248. ACM, 2015.\n\n[28] I. Saha, R. Ramaithitima, V. Kumar, G. J. Pappas, and S. A. Seshia. Automated composition of\nmotion primitives for multi-robot systems from safe LTL speci\ufb01cations. In Intelligent Robots\nand Systems (IROS 2014), 2014 IEEE/RSJ International Conference on, pages 1525\u20131532.\nIEEE, 2014.\n\n[29] S. A. Seshia, D. Sadigh, and S. S. Sastry. Towards Veri\ufb01ed Arti\ufb01cial Intelligence. ArXiv e-prints,\n\nJuly 2016.\n\n[30] E. Todorov. Linearly-solvable Markov decision problems. In Advances in neural information\n\nprocessing systems, pages 1369\u20131376, 2007.\n\n[31] E. Todorov. General duality between optimal control and estimation. In Decision and Control,\n\n2008. CDC 2008. 47th IEEE Conference on, pages 4286\u20134292. IEEE, 2008.\n\n[32] M. Y. Vardi. An automata-theoretic approach to linear temporal logic. In Logics for concurrency,\n\npages 238\u2013266. Springer, 1996.\n\n[33] M. Vazquez-Chanlatte, J. V. Deshmukh, X. Jin, and S. A. Seshia. Logical clustering and\nlearning for time-series data. In International Conference on Computer Aided Veri\ufb01cation,\npages 305\u2013325. Springer, 2017.\n\n[34] M. Vazquez-Chanlatte, M. K. Ho, T. L. Grif\ufb01ths, and S. A. Seshia. Communicating Composi-\ntional and Temporal Speci\ufb01cations by Demonstrations, extended abstract. In Symposium on\nCyber-Physical Human Systems (CPHS), 2018.\n\n[35] B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement\n\nlearning. In AAAI, volume 8, pages 1433\u20131438. Chicago, IL, USA, 2008.\n\n11\n\n\f", "award": [], "sourceid": 2577, "authors": [{"given_name": "Marcell", "family_name": "Vazquez-Chanlatte", "institution": "University of California, Berkeley"}, {"given_name": "Susmit", "family_name": "Jha", "institution": "SRI International"}, {"given_name": "Ashish", "family_name": "Tiwari", "institution": "SRI International"}, {"given_name": "Mark", "family_name": "Ho", "institution": "UC Berkeley"}, {"given_name": "Sanjit", "family_name": "Seshia", "institution": "UC Berkeley"}]}