{"title": "Bayesian Inference of Temporal Task Specifications from Demonstrations", "book": "Advances in Neural Information Processing Systems", "page_first": 3804, "page_last": 3813, "abstract": "When observing task demonstrations, human apprentices are able to identify whether a given task is executed correctly long before they gain expertise in actually performing that task. Prior research into learning from demonstrations (LfD) has failed to capture this notion of the acceptability of an execution; meanwhile, temporal logics provide a flexible language for expressing task specifications. Inspired by this, we present Bayesian specification inference, a probabilistic model for inferring task specification as a temporal logic formula. We incorporate methods from probabilistic programming to define our priors, along with a domain-independent likelihood function to enable sampling-based inference. We demonstrate the efficacy of our model for inferring true specifications with over 90% similarity between the inferred specification and the ground truth, both within a synthetic domain and a real-world table setting task.", "full_text": "Bayesian Inference of Temporal Task Speci\ufb01cations\n\nfrom Demonstrations\n\nAnkit Shah\nCSAIL, MIT\n\najshah@mit.edu\n\nPritish Kamath\n\nCSAIL, MIT\n\npritish@mit.edu\n\nShen Li\n\nCSAIL, MIT\n\nshenli@mit.edu\n\nJulie Shah\nCSAIL, MIT\n\njulie_a_shah@mit.edu\n\nAbstract\n\nWhen observing task demonstrations, human apprentices are able to identify\nwhether a given task is executed correctly long before they gain expertise in actually\nperforming that task. Prior research into learning from demonstrations (LfD) has\nfailed to capture this notion of the acceptability of an execution; meanwhile, tem-\nporal logics provide a \ufb02exible language for expressing task speci\ufb01cations. Inspired\nby this, we present Bayesian speci\ufb01cation inference, a probabilistic model for infer-\nring task speci\ufb01cation as a temporal logic formula. We incorporate methods from\nprobabilistic programming to de\ufb01ne our priors, along with a domain-independent\nlikelihood function to enable sampling-based inference. We demonstrate the ef\ufb01-\ncacy of our model for inferring speci\ufb01cations with over 90% similarity between\nthe inferred speci\ufb01cation and the ground truth, both within a synthetic domain and\na real-world table setting task.\n\n1\n\nIntroduction\n\nImagine showing a friend how to play your favorite quest-based video game. A mission within such\na game might be composed of multiple sub-quests that must be completed in order to complete that\nlevel. In this scenario, it is likely that your friend would comprehend what needs to be done in order\nto complete the mission well before he or she was actually able to play the game effectively. While\nlearning from demonstrations, human apprentices can identify whether a task is executed correctly\nwell before gaining expertise in that task. Most current approaches to learning from demonstration\nframe this problem as one of learning a reward function or policy within a Markov decision process\nsetting; however, user speci\ufb01cation of acceptable behaviors through reward functions and policies\nremains an open problem [1]. Temporal logics have been used in prior research as a language\nfor expressing desirable system behaviors, and can improve the interpretability of speci\ufb01cations if\nexpressed as compositions of simpler templates (akin to those described by Dwyer et al. [2]). In\nthis work, we propose a probabilistic model for inferring the temporal structure of a task as a linear\ntemporal logic (LTL) speci\ufb01cation.\nA speci\ufb01cation inferred from demonstrations is valuable in conjunction with synthesis algorithms for\nveri\ufb01able controllers ([3] and [4]), as a reward signal during reinforcement learning ([5], [6]), and as\na system model for execution monitoring. In our work, we frame speci\ufb01cation learning as a Bayesian\ninference problem.\nThe \ufb02exibility of LTL for specifying behaviors also represents a key challenge with regard to inference\ndue to a large hypothesis space. We de\ufb01ne prior and likelihood distributions over a smaller but relevant\npart of the LTL formulas, using templates based on work by Dwyer et al [2]. Ideas from universal\nprobabilistic programming languages formalized by Freer et al [7] and Goodman et al [8], [9] are\nkey to our modeling approach. Indeed, probabilistic programming languages enabled Ellis et al\n[10], [11] to perform inference over complex, recursively de\ufb01ned hypothesis spaces of graphics\nprograms and pronunciation rules. We demonstrate the capability of our model to achieve greater\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fthan 90% similarity between the ground truth speci\ufb01cation and the inferred speci\ufb01cation, both within\na synthetic domain and a real-world task of setting a dinner table.\n\n2 Linear Temporal Logic\n\nLinear temporal logic (LTL), introduced by Pnueli [12], provides an expressive grammar for describ-\ning temporal behaviors. A LTL formula is composed of atomic propositions (discrete time sequences\nof Boolean literals) and both logical and temporal operators, and is interpreted over traces [\u03b1] of the\nset of propositions \u03b1. The notation [\u03b1], t |= \u03d5 indicates that \u03d5 holds at time t. The trace [\u03b1] satis\ufb01es\n\u03d5 (denoted as [\u03b1] |= \u03d5) iff [\u03b1], 0 |= \u03d5. The minimal syntax of LTL can be described as follows:\n\n\u03d5 ::= p | \u00ac\u03d51 | \u03d51 \u2228 \u03d52 | X\u03d51 | \u03d51U\u03d52\n\n(1)\n\np is an atomic proposition; \u03d51 and \u03d52 are valid LTL formulas. The operator X is read as \u2018next\u2019 and\nX\u03d51 evaluates as true at time t if \u03d51 evaluates to true at t + 1. The operator U is read as \u2018until\u2019 and\nthe formula \u03d51U\u03d52 evaluates as true at a time t1 if \u03d52 evaluates as true at some time t2 > t1 and \u03d51\nevaluates as true for all time steps t such that t1 \u2264 t \u2264 t2. In addition to the minimal syntax, we also\nuse the additional \ufb01rst order logic operators \u2227 (and) and \u2192 (implies), as well as other higher-order\ntemporal operators, F (eventually) and G (globally). F\u03d51 evaluates to true at t1 if \u03d51 evaluates as\ntrue for some t \u2265 t1. G\u03d51 evaluates to true at t1 if \u03d51 evaluates as true for all t \u2265 t1.\n\n3 Bayesian Speci\ufb01cation Inference\n\nA large number of tasks comprised of multiple subtasks can be represented by a combination of\nthree temporal behaviors among those de\ufb01ned by Dwyer et al [2] \u2014 namely, global satisfaction of a\nproposition, eventual completion of a subtask, and temporal ordering between subtasks. With \u03d5global,\n\u03d5eventual, and \u03d5order representing LTL formulas for these behaviors, the task speci\ufb01cation is written\nas follows:\n(2)\nWe represent the task demonstrations as an observed sequence of state variables, x. Let \u03b1 \u2208 {0, 1}n\nrepresent a vector of a \ufb01nite dimension formed by Boolean propositions. \u03b1 = f (x) (i.e., the\npropositions) are a function of the state variables of the system at a given time instant. The output of\nspeci\ufb01cation learning is a formula, \u03d5 \u2208 \u03d5, that best explains the demonstrations, where \u03d5 is the set\nof all formulas satisfying the template described in Equation 2.\n\n\u03d5 = \u03d5global \u2227 \u03d5eventual \u2227 \u03d5order\n\n3.1 Formula Template\nGlobal satisfaction: Let \u03c4 be the set of candidate propositions to be globally satis\ufb01ed, and let T \u2286 \u03c4\nbe the actual subset of propositions globally satis\ufb01ed. The LTL formula that speci\ufb01es this behavior is\nwritten as follows:\n\n(cid:33)\n\n\u03d5global =\n\n(G(\u03c4 ))\n\n(3)\n\n(cid:32)(cid:94)\n\n\u03c4\u2208T\n\nSuch formulas are useful for specifying that some constraints must always be met \u2014 for example, a\nrobot avoiding collisions during motion, or an aircraft avoiding no-\ufb02y zones.\nEventual completion: Let \u2126 be the set of all candidate subtasks, and let W1 \u2286 \u2126 be the set of\nsubtasks that must be completed if the conditions represented by \u03c0w; w \u2208 W1 are met. \u03c9w are\npropositions representing the completion of a subtask. The LTL formula that speci\ufb01es this behavior\nis written as follows:\n\n(cid:32) (cid:94)\n\nw\u2208W1\n\n(cid:33)\n\n\u03d5eventual =\n\n(\u03c0w \u2192 F\u03c9w)\n\n(4)\n\nTemporal ordering: Every set of feasible ordering constraints over a set of subtasks is mapped\nto a directed acyclic graph (DAG) over nodes representing these subtasks. Each edge in the DAG\n\n2\n\n\fcorresponds to a binary precedence constraint. Let W2 be the set of binary temporal orders de\ufb01ned\nby W2 = {(w1, w2) : w1 \u2208 V , w2 \u2208 Descendants(w1)}, where V is the set of all nodes in the task\ngraph. Thus, the ordering constraints include an enumeration of not just the edges in the task-graph,\nbut all descendants of a given node. For subtasks w1 and w2, the ordering constraint is written as\nfollows:\n\n\uf8eb\uf8ed (cid:94)\n\n(w1,w2)\u2208W2\n\n\uf8f6\uf8f8\n\n\u03d5order =\n\n(\u03c0w1 \u2192 (\u00ac\u03c9w2U\u03c9w1 ))\n\n(5)\n\nThis formula states that if conditions for the execution of w1 i.e. \u03c0w1 are satis\ufb01ed, w2 must not be\ncompleted until w1 has been completed.\nFor the purposes of this paper, we assume that all required propositions \u03b1 = [T , \u03c0, \u03c9]T and labeling\nfunctions f (x) are known, along with the sets \u03c4 and \u2126, and the mapping of the condition propositions\n\u03c0w to their subtasks. Under these assumptions, the problem of inferring the correct formula for a\ntask is equivalent to identifying the correct subsets T , W1, and W2, which explain the observed\ndemonstrations well.\n\n3.2 Speci\ufb01cation Learning as Bayesian Inference\n\n(cid:80)\nP (h)P (D | h)\nh\u2208H P (h)P (D | h)\n\nThe Bayes theorem is fundamental to the problem of inference, and is stated as follows:\n\nP (h | D) =\n\n(6)\nP (h) is the prior distribution over the hypothesis space, and P (D | h) is the likelihood of observing\nthe data given a hypothesis. Our hypothesis space is de\ufb01ned by H = \u03d5, where \u03d5 is the set of all\nformulas that can be generated by the production rule de\ufb01ned by the template in Equation 2. The\nobserved data comprises the set of demonstrations provided to the system by expert demonstrators\n(note that we assume all these demonstrations are acceptable). D represents a set of sequences of the\npropositions, de\ufb01ned by D = {[\u03b1]}.\n\n3.2.1 Prior speci\ufb01cation\n\nWhile sampling candidate formulas as per the template depicted in Equation 2, we treat the sub-\nformulas in Equations 3, 4, and 5 as independent to each other. As generating the actual formula,\ngiven the selected subsets, is deterministic, sampling \u03d5global and \u03d5eventual is equivalent to selecting\na subset of a given \ufb01nite universal set. Given a set A, we de\ufb01ne SampleSubset(A,p) as the process\nof applying a Bernoulli trial with success probability p to each element of A and returning the\nsubset of elements for which the trial was successful. Thus, sampling \u03d5global and \u03d5eventual is\naccomplished by performing SampleSubset(\u03c4 , pG) and SampleSubset(\u2126, pE). Sampling \u03d5order\nis equivalent to sampling a DAG, with the nodes of the graph representing subtasks. Based on domain\nknowledge, appropriately constraining the DAG topologies would result in better inference with\nfewer demonstrations. Here, we present three possible methods of sampling a DAG, with different\nrestrictions on the graph topology.\n\ni \u2190 1; Ci \u2190 []\nP \u2190 random permutation(\u2126)\nfor a \u2208 P do\n\nAlgorithm 1 SampleSetsOfLinearChains\n1: function SAMPLESETSOFLINEARCHAIN(\u2126,ppart)\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n\nCi.append(a)\nk \u2190 Bernoulli(ppart)\nif k = 1 then\ni = i + 1; Ci \u2190 []\n\nreturn Cj \u2200 j\n\nLinear chains: A linear chain is a DAG such that all subtasks must occur in a single, unique sequence\nout of all permutations. Sampling a linear chain is equivalent to selecting a permutation from a\nuniform distribution and is achieved via the following probabilistic program: for a set of size n,\nsample n \u2212 1 elements from that set without replacement, with uniform probability.\n\n3\n\n\fTable 1: Prior de\ufb01nitions and hyperparameters.\n\u03d5order\nRandomPermutation(\u2126)\nSampleSetsOfLinearChains(\u2126, ppart)\nSampleForestofSubTasks(\u2126, Nnew)\n\npG, pE\npG, pE, ppart\npG, pE, Nnew\n\nHyperparameters\n\nPrior\nPrior 1\nPrior 2\nPrior 3\n\nSets of linear chains: This graph topology includes graphs formed by a set of disjoint sub-graphs,\neach of which is either a linear chain or a solitary node. The execution of subtasks within a particular\nlinear chain must be completed in the speci\ufb01ed order; however, no temporal constraints exist between\nthe chains. Algorithm 1 depicts a probabilistic program for constructing these sets of chains. In line\n2, the \ufb01rst active linear chain is initialized as an empty sequence. In line 3, a random permutation\nof the nodes is produced. For each element a \u2208 P , line 5 adds the element to the last active chain.\nLines 6 and 8 ensure that after each element, either a new active chain is initiated (with probability\nppart) or the old active chain continues (with probability 1 \u2212 ppart).\nForest of sub-tasks: This graph topology includes forests (i.e., sets of disjoint trees). A given node\nhas no temporal constraints with respect to its siblings, but must precede all its descendants. Algorithm\n2 depicts a probabilistic program for sampling a forest. Line 2 creates a random permutation of the\nsubtasks, P . Line 3 initializes an empty forest. In order to support a recursive sampling algorithm, the\ndata structure representing forests is de\ufb01ned as an array of trees, F. The ith tree has two attributes:\na root node, F [i].root, and a \u2018descendant Forest\u2019, F [i].descendant, in which the root node of each\ntree is a child of the root node de\ufb01ned as the \ufb01rst attribute. The length of the forest, F .length, is the\nnumber of trees included in that forest. The size of a tree, F [i].size, is the number of nodes within\nthe tree (i.e., the root node and all of its descendants). For each subtask in the random permutation\nP , line 5 inserts the given subtask into the forest as per the recursive function InsertIntoForest\nde\ufb01ned in lines 7 through 13. In line 8, an integer i is sampled from a categorical distribution, with\n{1, 2, . . . ,F .length + 1} as the possible outcomes. The probability of each outcome is proportional\nto the size of the trees in the forest, while the probability of F .length + 1 being the outcome is\nproportional to Nnew, a user-de\ufb01ned parameter. This sampling process is similar in spirit to the\nChinese restaurant process [13]. If the outcome of the draw is F .length + 1, then a new tree with\nroot node a is created in line 10; otherwise, InsertIntoForest is called recursively to add a to the\nforest F [i].descendants, as per line 12.\n\nP \u2190 random permutation(\u2126)\nF \u2190 []\nfor a \u2208 P do\nreturn F\ni \u2190 Categorical([F [1].size, F [2].size, . . . , F [F .length].size, Nnew])\nif i = F .length + 1 then\n\nAlgorithm 2 SampleForestofSubtasks\n1: function SAMPLEFORESTOFSUBTASKS(\u2126,Nnew)\n2:\n3:\n4:\n5:\n6:\n7: function INSERTINTOFOREST(F, a)\n8:\n9:\n10:\n11:\n12:\n13:\n\nCreate new tree F [F .length + 1].root = a\nF [i].descendants = InsertIntoForest(F [i].descendants, a)\n\nF =InsertIntoForest(F,a)\n\nelse\nreturn F\n\nThree prior distributions based on the four probabilistic programs are described in Table 1. In all the\npriors, \u03d5global and \u03d5eventual are sampled using SampleSubset(\u03c4 , pG) and SampleSubset(\u2126, pE),\nrespectively.\n3.2.2 Likelihood function\nThe likelihood distribution, P ({[\u03b1]} | \u03d5), is the probability of observing the trajectories within the\ndata set given the candidate speci\ufb01cation. It is reasonable to assume that the demonstrations are inde-\n\npendent of each other; thus, the total likelihood can be factored as P ({[\u03b1]} | \u03d5) =(cid:81){[\u03b1]} P ([\u03b1] | \u03d5).\n\nThe probability of observing a given trajectory demonstration is dependent upon the underlying\ndynamics of the domain and the characteristics of the agents producing the demonstrations. In\nthe absence of this knowledge, our aim is to develop an informative, domain-independent proxy\n\n4\n\n\ffor the true likelihood function based only on the properties of the candidate formula; we call\nthis the \u2018Complexity-based\u2019 (CB) likelihood function. Our approach is founded upon the classical\ninterpretation of probability championed by Laplace [14], which involves computing probabilities in\nterms of a set of equally likely outcomes. Let there be Nconj conjunctive clauses in \u03d5; there are then\n2Nconj possible outcomes in terms of the truth values of the conjunctive clauses. In the absence of\nany additional information, we assign equal probabilities to each of the potential outcomes. Then,\naccording to the classical interpretation of probability, for candidate formula \u03d51, de\ufb01ned by subsets\n2 , the likelihood odds ratio is de\ufb01ned\nT1, W 1\nas follows:\n\n2 ; and \u03d52, de\ufb01ned by subsets T2, W 2\n\n1 , and W 2\n\n1 and W 1\n\nP ([\u03b1] | \u03d51)\nP ([\u03b1] | \u03d52)\n\n=\n\n= 2|T1|+|W 1\n|T2|+|W 2\n= 2|T1|+|W 1\n\n1 |+|W 1\n2 |\n2 | ,\n1 |+|W 2\n1 |+|W 1\n2 |\n,\n\u0001\n\n2\n\n[\u03b1] |= \u03d52\n[\u03b1] (cid:50) \u03d52\n\n(7)\n\nHere, a \ufb01nite probability proportional to \u0001 is assigned to a demonstration that does not satisfy the\ngiven candidate formula. With this likelihood distribution, a more restrictive formula with a low prior\nprobability can gain favor over a simpler formula with higher prior probability given a large number\nof observations that would satisfy it. However, if the candidate formula is not the true speci\ufb01cation, a\nlarger set of demonstrations is more likely to include non-satisfying examples, thereby substantially\ndecreasing the posterior probability of the candidate formula. The design of this likelihood function\nis inspired by the size principle described by Tenenbaum [15].\nA second choice for a likelihood function, inspired by Shepard [16], is de\ufb01ned as the SIM model\nby Tenenbaum [15]. We call this the \u2018Complexity-independent\u2019 (CI) likelihood function, and it is\nde\ufb01ned as follows:\n\n\uf8f1\uf8f2\uf8f3 2Nconj1\n\n2Nconj2\n2Nconj1\n\n\u0001\n\n(cid:26)1 \u2212 \u0001,\n\n\u0001,\n\nP ([\u03b1] | \u03d5) =\n\nif [\u03b1] |= \u03d5\nOtherwise\n\n(8)\n\n3.2.3 Inference\n\nWe implemented our probabilistic model in webppl [9], a Turing-complete probabilistic programming\nlanguage. The hyperparameters, including those de\ufb01ned in Table 1 and \u0001, were set as follows:\npE, pG = 0.8; ppart = 0.3; Nnew = 5; \u0001 = 4\u00d7 log(2)\u00d7 (|\u03c4 +|\u2126| + 0.5|\u2126|(|\u2126|\u2212 1)). These values\nwere held constant for all evaluation scenarios. The equation for \u0001 was de\ufb01ned such that evidence of a\nsingle non-satisfying demonstration would negate the contribution of four satisfying demonstrations\nto the posterior probability. The posterior distribution of candidate formulas is constructed using\nwebppl\u2019s Markov chain Monte Carlo (MCMC) sampling algorithm from 10,000 samples, with 100\nsamples used as burn-in. The posterior distribution is stored as a categorical distribution, with each\npossibility representing a unique formula. The maximum a posteriori (MAP) candidate represents the\nbest estimate for the speci\ufb01cation as per the model. The inference was run on a desktop with an Intel\ni7-7700 processor.\n\n4 Evaluations\n\nWe evaluated the performance of our model within two different domains: a synthetic domain in which\nwe could easily vary the complexity of the ground truth speci\ufb01cations, and a domain representing the\nreal-world task of setting a dinner table \u2014 a task often incorporated into studies of learning from\ndemonstration ([17]).\nIf the ground truth formula is de\ufb01ned using subsets T \u2217, W \u2217\n2 (as per Equations 3, 4, and 5),\nand a candidate formula \u03d5 is de\ufb01ned by subsets T , W1, and W2, we de\ufb01ne the degree of similarity\nusing the Jaccard index [18] as follows:\n\n1 , and W \u2217\n\nL(\u03d5) =\n\n| {T \u2217 \u222a W \u2217\n| {T \u2217 \u222a W \u2217\n\n1 \u222a W \u2217\n1 \u222a W \u2217\n\n2 } \u2229 {T \u222a W1 \u222a W2} |\n2 } \u222a {T \u222a W1 \u222a W2} |\n\n(9)\n\nThe maximum possible value of L(\u03d5) is one wherein both formulas are equivalent. One key bene\ufb01t\nof our approach is that we compute a posterior distribution over candidate formulas; thus, we report\nthe expected value of E(L(\u03d5)) as a measure of the deviation of the inferred distribution from the\n\n5\n\n\fFigure 1: Example trajectories from Scenario 1. Green circles denote the POIs and the red circles\ndenote the avoidance zones of threats.\n\n(a) Scenario 1\n\n(b) Scenario 1\n\n(c) Scenario 2 L(\u03d5)\n\n(d) Scenario 2 orders\n\nFigure 2: Figure 2a depicts the results from Scenario 1, with the dotted line representing the maximum\npossible value of L(\u03d5). Figure 2b shows the number of unique formulas in the posterior distribution,\nFigure 2c indicates the L(\u03d5) values for Scenario 2, and Figure 2d depicts the correct and extra\norderings inferred in Scenario 2. The dotted lines represent the number of orderings in the true\nspeci\ufb01cation.\n\nground truth. We also report the maximum value of L(\u03d5) among the top 5 candidates in the posterior\ndistribution. We also classify the inferred orders in W2 as correct if they are included in the ground\ntruth, incorrect if they reverse any constraint within the ground truth, and \u201cextra\u201d otherwise. (Extra\norders over-constrain the problem, but do not induce incorrect behaviors.)\nWe evaluated our approach against the temporal logic inference (TempLogIn) algorithm proposed by\nKong et al [19]. TempLogIn mines parametric STL (PSTL) speci\ufb01cations, by conducting a breadth\n\ufb01rst search through a DAG induced by a partial ordering relation between PSTL formulas. Note that\nwhile our approach requires only positive examples, temporal logic inference must be trained on both\npositive and negative examples.\n\n4.1 Synthetic Domain\n\nIn our synthetic domain, an agent navigates within a two-dimensional space that includes points\nof interest (POIs) to visit and threats to avoid. A predicate, \u03c9i, is associated with each POI and\nevaluates as true if the agent is within a tolerance region of the given POI. Each threat has a predicate,\n\u03c4i, associated with it, which evaluates as true if the agent enters an avoidance region for that threat.\n\n6\n\n05101520253035Number of Training Demonstrations00.20.40.60.811.20102030Number of Demonstrations0100200300400Number of FormulasCBCI01020304050Number of Training Demonstrations00.20.40.60.811.201020304050Number of Demonstrations0123456NumberE[#Correct Orders]: Prior 2E[#Extra Orders]: Prior 2E[#Correct Orders]: Prior 3E[#Extra Orders]: Prior 3\f(a)\n\n(b)\n\nFigure 3: Figure 3a depicts all the \ufb01nal con\ufb01gurations. Figure 3b depicts the demonstration setup.\nFinally, propositions \u03c0i are associated with the accessibility of ith POI, and evaluate as true if the\ngiven POI is not within the avoidance region of any threat. The agent is programmed to visit the\naccessible POIs and avoid threats, as per the ground truth speci\ufb01cation. In Scenario 1, we generated\nexample trajectories in which the agent visits four POIs in a speci\ufb01c order [1, 2, 3, 4]. During each\ndemonstration, \ufb01ve threat locations were sampled from a uniform distribution in the task space.\nFigure 1 depicts some of the demonstrated trajectories. In Scenario 2, we incorporated \ufb01ve POIs:\n[1, 3, 5] must be visited in that speci\ufb01c order, while {2, 4} can be visited in any order if accessible.\nFor Scenario 1, the posterior distribution was computed by using prior 1 (de\ufb01ned in Table 1) and\nboth CB (Equation 7) and CI (Equation 8) likelihood functions for different training set sizes. The\nexpected value and maximum value among the top 5 formula candidates of L(\u03d5) is depicted in\nFigure 2a.\nWe observed that the CB likelihood function performed better than the CI likelihood function at\ninferring the complete speci\ufb01cation. Using the CI likelihood resulted in a higher posterior probability\nassigned to formulas with high prior probability that were satis\ufb01ed by all demonstrations. (These\ntended to be simple, non-informative formulas; the CB likelihood function assigned higher probability\nmass to more complex formulas that explained the demonstrations correctly.) Figure 2b depicts the\nnumber of unique formulas in the posterior distributions. The CB likelihood function resulted in\nposteriors being more peaky, with fewer unique formulas as the training set size increased; this effect\nwas not observed with the CI likelihood function.\nFor Scenario 2, the posterior distribution was computed using priors 2 and 3, as the ground truth\nspeci\ufb01cation did not lie in the support of prior 1. The expected value and maximum value among the\ntop 5 formula candidates of L(\u03d5) are depicted in Figure 2c. Given a suf\ufb01cient number of training\nexamples, both priors were able to infer the complete formula; with 10 or more training examples,\nboth priors returned the ground truth formula among the top 5 candidates with regard to posterior\nprobabilities. Figure 2d depicts the correct and extra orders inferred in Scenario 2. Prior 3 assigns a\nlarger prior probability to longer task chains compared with prior 2, but the two priors converge to\nthe correct speci\ufb01cation given enough training examples.\nThe runtime for MCMC inference is a function of the number of samples generated, the number of\ndemonstrations in the training set, and the length of demonstrations. Scenarios 1 and 2 required an\naverage runtime of 10 and 90 minutes for training set sizes of 5 and 50, respectively.\nTempLogIn [19] required 33 minutes to terminate with three PSTL clauses. For both Scenarios 1 and\n2, the mined formulas did not capture any of the temporal behaviors in Section 3.1, indicating that\nmore PSTL clauses were required. With \ufb01ve and 10 PSTL clauses, the algorithm did not terminate\nwithin the 24 hours runtime cutoff. Scaling TempLogIn to larger formula lengths is dif\ufb01cult as the\nsize of the search graph increases exponentially with number of PSTL clauses, and the algorithm\nmust evaluate all formula candidates of length n before candidates of length n + 1.\n\n4.2 Dinner Table Domain\n\nWe also tested our model on a real-world task: setting a dinner table. The task featured eight dining\nset pieces that had to be organized on a table while avoiding contact with a centerpiece. Figure 3a\ndepicts each of the \ufb01nal con\ufb01gurations of the dining set pieces. The pieces were varied in each\ncon\ufb01guration, but the position of a given piece on the table was constant across con\ufb01gurations, with\npositions marked on the table in order to guide the demonstrator. A predicate \u03c4 was associated with\nthe centerpiece, encoding whether the demonstrators\u2019 wrists got too close to it. \u03c0i was associated\n\n7\n\n\f(a) Dinner table L(\u03d5)\n\n(b) Dinner Table orders\n\nFigure 4: Figure 4a depicts the L(\u03d5) values for the dinner table domain, with the dotted line\nrepresenting the maximum possible value. Figure 4b depicts the correct and extra orderings inferred\nwithin this domain. The dotted lines represent the number of orderings in the true speci\ufb01cation.\n\nwith the ith dinner piece, and encoded whether that piece needed to be placed on the table. \u03c9i was\nassociated with the ith dinner piece, and encoded whether it was at its correct and \ufb01nal position.\nIn some of the con\ufb01gurations, the dinner plate, small plate and bowl were stacked on top of each\nother; in this case, the true speci\ufb01cation would be to eventually position all the required dinner pieces\nby placing the dinner plate, small plate, and bowl, in that order. The state space x consisted of the\npositions of each of the dinner pieces and the wrists of the demonstrator, all of which were tracked\nvia a motion capture system. The truth values of \u03c9i and \u03c4 were evaluated using task-space region\nconstraints de\ufb01ned by Berenson et al [20]. A total of 71 demonstrations were collected, and randomly\nsampled subsets of different sizes were used to learn the speci\ufb01cations. The expected value of L(\u03d5)\nand its maximum value among the top 5 candidates are depicted in Figure 4a; the number of correct\nand extra orders are depicted in Figure 1. With prior 2, our model correctly identi\ufb01ed the ground truth\nin all cases. With prior 3, the inferred formula contained additional ordering constraints compared\nwith the ground truth. Using all 71 demonstrations, the MAP candidate had one additional ordering\nconstraint: that the fork be placed before the spoon. Upon review, this constraint was satis\ufb01ed in all\nbut four of the demonstrations.\n\n5 Related Work\n\nOne common approach in prior research frames learning from demonstration as an inverse reinforce-\nment learning (IRL) problem. Ng et al [21] and Abbeel et al [22] \ufb01rst formalized the problem of\ninverse reinforcement learning as one of optimization in order to identify the reward function that best\nexplains observed demonstrations. Ziebart et al [23] introduced algorithms to compute optimal policy\nfor imitation using the maximum entropy criterion. Konidaris et al [24] and Niekum et al [25] framed\nIRL in a semi-Markov setting, allowing for an implicit representation of the temporal structure of\nthe task. Surveys by Argall et al [26] and Chernova et al [27] provided a comprehensive review of\ntechniques built upon these works as applied to robotics. However, according to Arnold et al [1], one\ndrawback of inverse reinforcement learning is that it is non-trivial to extract task speci\ufb01cations from a\nlearned reward function or policy. Our method bridges this gap by directly learning the speci\ufb01cations\nfor acceptable execution of the given task.\nTemporal logics, introduced by Pnueli [12], are an expressive grammar used to describe the desirable\ntemporal properties of task execution. Temporal logics have previously been used as a language for\ngoal de\ufb01nitions in reinforcement learning algorithms ([5], [6]), reactive controller synthesis ([3], [4]),\nand domain independent planning [28].\nKasenberg and Scheutz [29] explored mining globally persistent speci\ufb01cations from optimal traces of\na \ufb01nite state Markov decision process (MDP). Jin et al [30] proposed algorithms for mining temporal\nspeci\ufb01cations similar to rise time and setting time for closed-loop control systems. Works by Kong et\nal [31], [19], Yoo and Belta [32], and Lemieux et al [33] are most closely related to our own, as our\nwork incorporates only the observed state variable (and not the actions of the demonstrators) as input\nto the model. Lemieux et al [33] introduced Texada, a general speci\ufb01cation mining tool for software\n\n8\n\n3040506070Number of Training Demonstrations00.20.40.60.811.23040506070Number of Demonstrations0123456NumberE[#Correct Orders]: Prior 2E[#Extra Orders]: Prior 2E[#Correct Orders]: Prior 3E[#Extra Orders]: Prior 3\flogs. Texada outputs all possible instances of a particular formula template that are satis\ufb01ed; however,\nit treats each time step as a string, with all unique strings within the log treated as unique propositions.\nTexada would treat a system with n propositions as a system with 2n distinct propositions \u2014 thus,\ninterpreting a mined formula is non-trivial. Kong et al [31], [19] and Yoo and Belta [32] mined PSTL\nspeci\ufb01cations for given demonstrations while simultaneously inferring signal propositions akin to\nour own user-de\ufb01ned atomic propositions by conducting breadth \ufb01rst search over a DAG formed by\ncandidate formulas. Our prior speci\ufb01cations allow for better connectivity between different formulas,\nwhile using MCMC-based approximate inference allows for \ufb01xed runtimes.\nWe adopt a fully Bayesian approach to model the inference problem, enabling our model to maintain\na posterior distribution over candidate formulas. This distribution provides a measure of con\ufb01dence\nwhen predicting the acceptability of a new demonstration that the aforementioned approaches do not.\n\n6 Conclusion\n\nIn conclusion we presented a probabilistic model to infer task speci\ufb01cations in terms of three behaviors\nencoded as LTL templates. We presented three prior distributions that allow for ef\ufb01cient sampling of\ncandidate formulas as per the templates. We also presented a likelihood function that depends only\non the number of conjunctive clauses in the candidate formula and is transferable across domains\nas it requires no information about the domain itself. Finally, we demonstrated that with our model\ninferred speci\ufb01cations with over 90% similarity to the ground truth, both within a synthetic domain\nand a real-world task of setting a dinner table.\n\nAcknowledgements\n\nThis research was funded in part by Lockheed Martin Corporation and the Air Force Research\nLaboratory. Approved for Public Release: distribution unlimited, 88ABW-2018-2502, 16 May 2018\n\nReferences\n[1] T. Arnold, D. Kasenberg, and M. Scheutz, \u201cValue alignment or misalignment\u2013what will keep systems\n\naccountable,\u201d in 3rd International Workshop on AI, Ethics, and Society, 2017.\n\n[2] M. B. Dwyer, G. S. Avrunin, and J. C. Corbett, \u201cPatterns in property speci\ufb01cations for \ufb01nite-state\nveri\ufb01cation,\u201d in Proceedings of the 21st international conference on Software engineering, pp. 411\u2013420,\nACM, 1999.\n\n[3] H. Kress-Gazit, G. E. Fainekos, and G. J. Pappas, \u201cTemporal-logic-based reactive mission and motion\n\nplanning,\u201d IEEE transactions on robotics, vol. 25, no. 6, pp. 1370\u20131381, 2009.\n\n[4] V. Raman, A. Donz\u00e9, D. Sadigh, R. M. Murray, and S. A. Seshia, \u201cReactive synthesis from signal temporal\nlogic speci\ufb01cations,\u201d in Proceedings of the 18th International Conference on Hybrid Systems: Computation\nand Control, pp. 239\u2013248, ACM, 2015.\n\n[5] X. Li, C.-I. Vasile, and C. Belta, \u201cReinforcement learning with temporal logic rewards,\u201d in Intelligent\n\nRobots and Systems (IROS), 2017 IEEE/RSJ International Conference on, pp. 3834\u20133839, IEEE, 2017.\n\n[6] M. L. Littman, U. Topcu, J. Fu, C. Isbell, M. Wen, and J. MacGlashan, \u201cEnvironment-independent task\n\nspeci\ufb01cations via gltl,\u201d arXiv preprint arXiv:1704.04341, 2017.\n\n[7] C. E. Freer, D. M. Roy, and J. B. Tenenbaum, \u201cTowards common-sense reasoning via conditional simulation:\nlegacies of turing in arti\ufb01cial intelligence.,\u201d in Turing\u2019s Legacy (R. Downey, ed.), vol. 42 of Lecture Notes\nin Logic, pp. 195\u2013252, Cambridge University Press, 2014.\n\n[8] N. Goodman, V. Mansinghka, D. M. Roy, K. Bonawitz, and J. B. Tenenbaum, \u201cChurch: a language for\n\ngenerative models,\u201d arXiv preprint arXiv:1206.3255, 2012.\n\n[9] N. D. Goodman and A. Stuhlm\u00fcller, \u201cThe Design and Implementation of Probabilistic Programming\n\nLanguages.\u201d http://dippl.org, 2014. Accessed: 2018-4-9.\n\n[10] K. Ellis, D. Ritchie, A. Solar-Lezama, and J. B. Tenenbaum, \u201cLearning to infer graphics programs from\n\nhand-drawn images,\u201d arXiv preprint arXiv:1707.09627, 2017.\n\n9\n\n\f[11] K. Ellis, A. Solar-Lezama, and J. Tenenbaum, \u201cUnsupervised learning by program synthesis,\u201d in Advances\n\nin neural information processing systems, pp. 973\u2013981, 2015.\n\n[12] A. Pnueli, \u201cThe temporal logic of programs,\u201d in Foundations of Computer Science, 1977., 18th Annual\n\nSymposium on, pp. 46\u201357, IEEE, 1977.\n\n[13] D. J. Aldous, \u201cExchangeability and related topics,\u201d in \u00c9cole d\u2019\u00c9t\u00e9 de Probabilit\u00e9s de Saint-Flour XIII \u2014\n\n1983 (P. L. Hennequin, ed.), (Berlin, Heidelberg), p. 92, Springer Berlin Heidelberg, 1985.\n\n[14] P. S. Laplace and P. Simon, \u201cA philosophical essay on probabilities, translated from the 6th french edition\n\nby frederick wilson truscott and frederick lincoln emory,\u201d 1951.\n\n[15] J. B. Tenenbaum, \u201cRules and similarity in concept learning,\u201d in Advances in neural information processing\n\nsystems, pp. 59\u201365, 2000.\n\n[16] R. Shepard, \u201cToward a universal law of generalization for psychological science,\u201d Science, vol. 237,\n\nno. 4820, pp. 1317\u20131323, 1987.\n\n[17] R. Toris, D. Kent, and S. Chernova, \u201cUnsupervised learning of multi-hypothesized pick-and-place task\ntemplates via crowdsourcing,\u201d in Robotics and Automation (ICRA), 2015 IEEE International Conference\non, pp. 4504\u20134510, IEEE, 2015.\n\n[18] J. Paul, \u201cThe distribution of the \ufb02ora in the alpine zone.1,\u201d New Phytologist, vol. 11, no. 2, pp. 37\u201350.\n[19] Z. Kong, A. Jones, and C. Belta, \u201cTemporal logics for learning and detection of anomalous behavior,\u201d IEEE\n\nTransactions on Automatic Control, vol. 62, no. 3, pp. 1210\u20131222, 2017.\n\n[20] D. Berenson, S. Srinivasa, and J. Kuffner, \u201cTask space regions: A framework for pose-constrained\nmanipulation planning,\u201d The International Journal of Robotics Research, vol. 30, no. 12, pp. 1435\u20131460,\n2011.\n\n[21] A. Y. Ng and S. J. Russell, \u201cAlgorithms for inverse reinforcement learning,\u201d in Proceedings of the\nSeventeenth International Conference on Machine Learning, ICML \u201900, (San Francisco, CA, USA),\npp. 663\u2013670, Morgan Kaufmann Publishers Inc., 2000.\n\n[22] P. Abbeel and A. Y. Ng, \u201cApprenticeship learning via inverse reinforcement learning,\u201d in Proceedings of\n\nthe twenty-\ufb01rst international conference on Machine learning, p. 1, ACM, 2004.\n\n[23] B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey, \u201cMaximum entropy inverse reinforcement\n\nlearning.,\u201d in AAAI, vol. 8, pp. 1433\u20131438, Chicago, IL, USA, 2008.\n\n[24] G. Konidaris, S. Kuindersma, R. Grupen, and A. Barto, \u201cRobot learning from demonstration by constructing\n\nskill trees,\u201d The International Journal of Robotics Research, vol. 31, no. 3, pp. 360\u2013375, 2012.\n\n[25] S. Niekum, S. Osentoski, G. Konidaris, S. Chitta, B. Marthi, and A. G. Barto, \u201cLearning grounded \ufb01nite-\nstate representations from unstructured demonstrations,\u201d The International Journal of Robotics Research,\nvol. 34, no. 2, pp. 131\u2013157, 2015.\n\n[26] B. D. Argall, S. Chernova, M. Veloso, and B. Browning, \u201cA survey of robot learning from demonstration,\u201d\n\nRobotics and autonomous systems, vol. 57, no. 5, pp. 469\u2013483, 2009.\n\n[27] S. Chernova and A. L. Thomaz, \u201cRobot learning from human teachers,\u201d Synthesis Lectures on Arti\ufb01cial\n\nIntelligence and Machine Learning, vol. 8, no. 3, pp. 1\u2013121, 2014.\n\n[28] J. Kim, C. J. Banks, and J. A. Shah, \u201cCollaborative planning with encoding of users\u2019 high-level strategies.,\u201d\n\nin AAAI, pp. 955\u2013962, 2017.\n\n[29] D. Kasenberg and M. Scheutz, \u201cInterpretable apprenticship learning with temporal logic speci\ufb01cations,\u201d\n\narXiv preprint arXiv:1710.10532, 2017.\n\n[30] X. Jin, A. Donz\u00e9, J. V. Deshmukh, and S. A. Seshia, \u201cMining requirements from closed-loop control\nmodels,\u201d IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 34, no. 11,\npp. 1704\u20131717, 2015.\n\n[31] Z. Kong, A. Jones, A. Medina Ayala, E. Aydin Gol, and C. Belta, \u201cTemporal logic inference for classi\ufb01ca-\ntion and prediction from data,\u201d in Proceedings of the 17th international conference on Hybrid systems:\ncomputation and control, pp. 273\u2013282, ACM, 2014.\n\n[32] C. Yoo and C. Belta, \u201cRich time series classi\ufb01cation using temporal logic,\u201d in Robotics: Science and\n\nSystems, 2017.\n\n[33] C. Lemieux, D. Park, and I. Beschastnikh, \u201cGeneral ltl speci\ufb01cation mining (t),\u201d in Automated Software\n\nEngineering (ASE), 2015 30th IEEE/ACM International Conference on, pp. 81\u201392, IEEE, 2015.\n\n10\n\n\f", "award": [], "sourceid": 1891, "authors": [{"given_name": "Ankit", "family_name": "Shah", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Pritish", "family_name": "Kamath", "institution": "MIT"}, {"given_name": "Julie", "family_name": "Shah", "institution": "MIT"}, {"given_name": "Shen", "family_name": "Li", "institution": "MIT"}]}