{"title": "Unsupervised Structure Learning of Stochastic And-Or Grammars", "book": "Advances in Neural Information Processing Systems", "page_first": 1322, "page_last": 1330, "abstract": "Stochastic And-Or grammars compactly represent both compositionality and reconfigurability and have been used to model different types of data such as images and events. We present a unified formalization of stochastic And-Or grammars that is agnostic to the type of the data being modeled, and propose an unsupervised approach to learning the structures as well as the parameters of such grammars. Starting from a trivial initial grammar, our approach iteratively induces compositions and reconfigurations in a unified manner and optimizes the posterior probability of the grammar. In our empirical evaluation, we applied our approach to learning event grammars and image grammars and achieved comparable or better performance than previous approaches.", "full_text": "Unsupervised Structure Learning of Stochastic\n\nAnd-Or Grammars\n\nMaria Pavlovskaia\n\nSong-Chun Zhu\n\nKewei Tu\n\nCenter for Vision, Cognition, Learning and Art\nDepartments of Statistics and Computer Science\n{tukw,mariapavl,sczhu}@ucla.edu\n\nUniversity of California, Los Angeles\n\nAbstract\n\nStochastic And-Or grammars compactly represent both compositionality and re-\ncon\ufb01gurability and have been used to model different types of data such as images\nand events. We present a uni\ufb01ed formalization of stochastic And-Or grammars\nthat is agnostic to the type of the data being modeled, and propose an unsupervised\napproach to learning the structures as well as the parameters of such grammars.\nStarting from a trivial initial grammar, our approach iteratively induces composi-\ntions and recon\ufb01gurations in a uni\ufb01ed manner and optimizes the posterior prob-\nability of the grammar. In our empirical evaluation, we applied our approach to\nlearning event grammars and image grammars and achieved comparable or better\nperformance than previous approaches.\n\n1\n\nIntroduction\n\nStochastic grammars are traditionally used to represent natural language syntax and semantics, but\nthey have also been extended to model other types of data like images [1, 2, 3] and events [4, 5,\n6, 7]. It has been shown that stochastic grammars are powerful models of patterns that combine\ncompositionality (i.e., a pattern can be decomposed into a certain con\ufb01guration of sub-patterns) and\nrecon\ufb01gurability (i.e., a pattern may have multiple alternative con\ufb01gurations). Stochastic grammars\ncan be used to parse data samples into their compositional structures, which help solve tasks like\nclassi\ufb01cation, annotation and segmentation in a uni\ufb01ed way. We study stochastic grammars in the\nform of stochastic And-Or grammars [1], which are an extension of stochastic grammars in natural\nlanguage processing [8, 9] and are closely related to sum-product networks [10]. Stochastic And-Or\ngrammars have been used to model spatial structures of objects and scenes [1, 3] as well as temporal\nstructures of actions and events [7].\nManual speci\ufb01cation of a stochastic grammar is typically very dif\ufb01cult and therefore machine learn-\ning approaches are often employed to automatically induce unknown stochastic grammars from data.\nIn this paper we study unsupervised learning of stochastic And-Or grammars in which the training\ndata are unannotated (e.g., images or action sequences).\nThe learning of a stochastic grammar involves two parts: learning the grammar rules (i.e., the struc-\nture of the grammar) and learning the rule probabilities or energy terms (i.e., the parameters of the\ngrammar). One strategy in unsupervised learning of stochastic grammars is to manually specify\na \ufb01xed grammar structure (in most cases, the full set of valid grammar rules) and try to optimize\nthe parameters of the grammar. Many approaches of learning natural language grammars (e.g.,\n[11, 12]) as well as some approaches of learning image grammars [10, 13] adopt this strategy. The\nmain problem of this strategy is that in some scenarios the full set of valid grammar rules is too large\nfor practical learning and inference, while manual speci\ufb01cation of a compact grammar structure is\nchallenging. For example, in an image grammar the number of possible grammar rules to decom-\npose an image patch is exponential in the size of the patch; previous approaches restrict the valid\n\n1\n\n\fways of decomposing an image patch (e.g., allowing only horizontal and vertical segmentations),\nwhich however reduces the expressive power of the image grammar.\nIn this paper, we propose an approach to learning both the structure and the parameters of a stochas-\ntic And-Or grammar. Our approach extends the previous work on structure learning of natural\nlanguage grammars [14, 15, 16], while improves upon the recent work on structure learning of And-\nOr grammars of images [17] and events [18]. Starting from a trivial initial grammar, our approach\niteratively inserts new fragments into the grammar to optimize its posterior probability. Most of\nthe previous structure learning approaches learn new compositions and recon\ufb01gurations modeled\nin the grammar in a separate manner, which can be error-prone when the training data is scarce or\nambiguous; in contrast, we induce And-Or fragments of the grammar, which uni\ufb01es the search for\nnew compositions and recon\ufb01gurations, making our approach more ef\ufb01cient and robust.\nOur main contributions are as follows.\n\n\u2022 We present a formalization of stochastic And-Or grammars that is agnostic to the types of\natomic patterns and their compositions. Consequently, our learning approach is capable of\nlearning from different types of data, e.g., text, images, events.\n\u2022 Unlike some previous approaches that rely on heuristics for structure learning, we explicitly\noptimize the posterior probability of both the structure and the parameters of the grammar.\nThe optimization procedure is made ef\ufb01cient by deriving and utilizing a set of suf\ufb01cient\nstatistics from the training data.\n\u2022 We learn compositions and recon\ufb01gurations modeled in the grammar in a uni\ufb01ed manner\nthat is more ef\ufb01cient and robust to data scarcity and ambiguity than previous approaches.\n\u2022 We empirically evaluated our approach in learning event grammars and image grammars\n\nand it achieved comparable or better performance than previous approaches.\n\n2 Stochastic And-Or Grammars\n\nStochastic And-Or grammars are \ufb01rst proposed to model images [1] and later adapted to model\nevents [7]. Here we provide a uni\ufb01ed de\ufb01nition of stochastic And-Or grammars that is agnostic to\nthe type of the data being modeled. We restrict ourselves to the context-free subclass of stochastic\nAnd-Or grammars, which can be seen as an extension of stochastic context-free grammars in for-\nmal language theory [8] as well as an extension of decomposable sum-product networks [10]. A\nstochastic context-free And-Or grammar is de\ufb01ned as a 5-tuple (cid:104)\u03a3, N, S, R, P(cid:105). \u03a3 is a set of termi-\nnal nodes representing atomic patterns that are not decomposable; N is a set of nonterminal nodes\nrepresenting decomposable patterns, which is divided into two disjoint sets: And-nodes N AND and\nOr-nodes N OR; S \u2208 N is a start symbol that represents a complete entity; R is a set of grammar\nrules, each of which represents the generation from a nonterminal node to a set of nonterminal or\nterminal nodes; P is the set of probabilities assigned to the grammar rules. The set of grammar rules\nR is divided into two disjoint sets: And-rules and Or-rules.\n\n\u2022 An And-rule represents the decomposition of a pattern into a con\ufb01guration of non-\noverlapping sub-patterns. It takes the form of A \u2192 a1a2 . . . an, where A \u2208 N AND is a\nnonterminal And-node and a1a2 . . . an is a set of terminal or nonterminal nodes represent-\ning the sub-patterns. A set of relations are speci\ufb01ed between the sub-patterns and between\nthe nonterminal node A and the sub-patterns, which con\ufb01gure how these sub-patterns form\nthe composite pattern represented by A. The probability of an And-rule is speci\ufb01ed by the\nenergy terms de\ufb01ned on the relations. Note that one can specify different types of relations\nin different And-rules, which allows multiple types of compositions to be modeled in the\nsame grammar.\n\u2022 An Or-rule represents an alternative con\ufb01guration of a composite pattern. It takes the form\nof O \u2192 a, where O \u2208 N OR is a nonterminal Or-node, and a is either a terminal or a\nnonterminal node representing a possible con\ufb01guration. The set of Or-rules with the same\nleft-hand side can be written as O \u2192 a1|a2| . . .|an. The probability of an Or-rule speci\ufb01es\nhow likely the alternative con\ufb01guration represented by the Or-rule is selected.\n\nA stochastic And-Or grammar de\ufb01nes generative processes of valid entities, i.e., starting from an\nentity containing only the start symbol S and recursively applying the grammar rules in R to convert\n\n2\n\n\fTable 1: Examples of stochastic And-Or grammars\n\nNatural language\ngrammar\nEvent And-Or\ngrammar [7]\nImage And-Or\ngrammar [1]\n\nTerminal node\nWord\n\nAtomic action (e.g.,\nstanding, drinking)\nVisual word (e.g.,\nGabor bases)\n\nNonterminal node\nPhrase\n\nEvent or sub-event\n\nImage patch\n\nRelations in And-rules\nDeterministic \u201cconcatenating\u201d\nrelations\nTemporal relations (e.g., those\nproposed in [19])\nSpatial relations (e.g., those\nspecifying relative positions,\nrotations and scales)\n\n(a)\n\n(b)\n\n(c)\n\nFigure 1: An illustration of the learning process. (a) The initial grammar. (b) Iteration 1: learning a\ngrammar fragment rooted at N1. (c) Iteration 2: learning a grammar fragment rooted at N2.\n\nnonterminal nodes until the entity contains only terminal nodes (atomic patterns). Table 1 gives a\nfew examples of stochastic context-free And-Or grammars that model different types of data.\n\n3 Unsupervised Structure Learning\n\n3.1 Problem De\ufb01nition\n\nIn unsupervised learning of stochastic And-Or grammars, we aim to learn a grammar from a set\nof unannotated i.i.d. data samples (e.g., natural language sentences, quantized images, action se-\nquences). The objective function is the posterior probability of the grammar given the training data:\n\nP (G|X) \u221d P (G)P (X|G) =\n\n1\nZ\n\nP (xi|G)\n\ne\u2212\u03b1(cid:107)G(cid:107) (cid:89)\n\nxi\u2208X\n\nwhere G is the grammar, X = {xi} is the set of training samples, Z is the normalization factor\nof the prior, \u03b1 is a constant, and (cid:107)G(cid:107) is the size of the grammar. By adopting a sparsity prior that\npenalizes the size of the grammar, we hope to learn a compact grammar with good generalizability.\nIn order to ease the learning process, during learning we approximate the likelihood P (xi|G) with\nthe Viterbi likelihood (the probability of the best parse of the data sample xi). Viterbi likelihood has\nbeen empirically shown to lead to better grammar learning results [20, 10] and can be interpreted as\ncombining the standard likelihood with an unambiguity bias [21].\n\n3.2 Algorithm Framework\n\nWe \ufb01rst de\ufb01ne an initial grammar that generates the exact set of training samples. Speci\ufb01cally, for\neach training sample xi \u2208 X, there is an Or-rule S \u2192 Ai in the initial grammar where S is the start\n1(cid:107)X(cid:107) where (cid:107)X(cid:107) is the number of\nsymbol and Ai is an And-node, and the probability of the rule is\ntraining samples; for each xi there is also an And-rule Ai \u2192 ai1ai2 . . . ain where aij (j = 1 . . . n)\nare the terminal nodes representing the set of atomic patterns contained in sample xi, and a set of\nrelations are speci\ufb01ed between these terminal nodes such that they compose sample xi. Figure 1(a)\nshows an example initial grammar. This initial grammar leads to the maximal likelihood on the\ntraining data but has a very small prior probability because of its large size.\n\n3\n\nAnd\u2010nodeOr\u2010nodeS\u2026\u2026A1A2a1a2a3a4a5a3a4a6xxSSx1x2S\u2026\u2026A1A2\u2026\u2026A1A2a1a2Xa5a6XXYa1Ya6YX\u2026\u2026Xa3a4Xa3a4aaa2a5SAnd\u2010nodeOr\u2010nodeS\u2026\u2026A1A2a1a2a3a4a5a6a7a8SSa1a2Xa5a6\u2026\u2026A1A2Xa1Ya6\u2026\u2026A1A2Ya2Xa5a6XXYX\u2026\u2026a3a4Xa3a4a2a5And\u2010nodeOr\u2010nodeS\u2026\u2026A1A2a1a2a3a4a5a3a4a6xxSSx1x2S\u2026\u2026A1A2\u2026\u2026A1A2a1a2N1a5a6N1NN2a1N2a6N2N1\u2026\u2026N1a3a4N1a3a4aaa2a5SAnd\u2010nodeOr\u2010nodeS\u2026\u2026A1A2a1a2a3a4a5a6a7a8SSa1a2Xa5a6\u2026\u2026A1A2Xa1Ya6\u2026\u2026A1A2Ya2Xa5a6XXYX\u2026\u2026a3a4Xa3a4a2a5And\u2010nodeOr\u2010nodeS\u2026\u2026A1A2a1a2a3a4a5a3a4a6xxSSx1x2S\u2026\u2026A1A2\u2026\u2026A1A2a1a2N1a5a6N1NN2a1N2a6N2N1\u2026\u2026N1a3a4N1a3a4aaa2a5SAnd\u2010nodeOr\u2010nodeS\u2026\u2026A1A2a1a2a3a4a5a6a7a8SSa1a2Xa5a6\u2026\u2026A1A2Xa1Ya6\u2026\u2026A1A2Ya2Xa5a6XXYX\u2026\u2026a3a4Xa3a4a2a5SAnd\u2010nodeOr\u2010nodeS\u2026\u2026A1A2a1a2a3a4a5a6a7a8SSa1a2Xa5a6\u2026\u2026A1A2Xa1Ya6\u2026\u2026A1A2Ya2Xa5a6XXYX\u2026\u2026a3a4Xa3a4a2a5\fStarting from the initial grammar, we introduce new intermediate nonterminal nodes between the\nterminal nodes and the top-level nonterminal nodes in an iterative bottom-up fashion to generalize\nthe grammar and increase its posterior probability. At each iteration, we add a grammar fragment\ninto the grammar that is rooted at a new nonterminal node and contains a set of grammar rules that\nspecify how the new nonterminal node generates one or more con\ufb01gurations of existing terminal\nor nonterminal nodes; we also try to reduce each training sample using the new grammar rules and\nupdate the top-level And-rules accordingly. Figure 1 illustrates this learning process. There are\ntypically multiple candidate grammar fragments that can be added at each iteration, and we employ\ngreedy search or beam search to explore the search space and maximize the posterior probability of\nthe grammar. We also restrict the types of grammar fragments that can be added in order to reduce\nthe number of candidate grammar fragments, which will be discussed in the next subsection. The\nalgorithm terminates when no more grammar fragment can be found that increases the posterior\nprobability of the grammar.\n\n3.3 And-Or Fragments\n\nIn each iteration of our learning algorithm framework, we search for a new grammar fragment and\nadd it into the grammar. There are many different types of grammar fragments, the choice of which\ngreatly in\ufb02uences the ef\ufb01ciency and accuracy of the learning algorithm. Two simplest types of\ngrammar fragments are And-fragments and Or-fragments. An And-fragment contains a new And-\nnode A and an And-rule A \u2192 a1a2 . . . an specifying the generation from the And-node A to a\ncon\ufb01guration of existing nodes a1a2 . . . an. An Or-fragment contains a new Or-node O and a set\nof Or-rules O \u2192 a1|a2| . . .|an each specifying the generation from the Or-node O to an existing\nnode ai. While these two types of fragments are simple and intuitive, they both have important\ndisadvantages if they are searched for separately in the learning algorithm. For And-fragments, when\nthe training data is scarce, many compositions modeled by the target grammar would be missing\nfrom the training data and hence cannot be learned by searching for And-fragments alone; besides,\nif the search for And-fragments is not properly coupled with the search for Or-fragments, the learned\ngrammar would become large and redundant. For Or-fragments, it can be shown that in most cases\nadding an Or-fragment into the grammar decreases the posterior probability of the grammar even\nif the target grammar does contain the Or-fragment, so in order to learn Or-rules we need more\nexpensive search techniques than greedy or beam search employed in our algorithm; in addition, the\nsearch for Or-fragments can be error-prone if different Or-rules can generate the same node in the\ntarget grammar.\nInstead of And-fragments and Or-fragments, we propose to search for And-Or fragments in the\nlearning algorithm. An And-Or fragment contains a new And-node A, a set of new Or-nodes\nO1, O2, . . . , On, an And-rule A \u2192 O1O2 . . . On, and a set of Or-rules Oi \u2192 ai1|ai2| . . .|aimi\nfor each Or-node Oi (where ai1, ai2, . . . , aimi are existing nodes of the grammar). Such an And-Or\ni=1 mi number of con\ufb01gurations of existing nodes. Figure 2(a) shows an\nexample And-Or fragment. It can be shown that by adding only And-Or fragments, our algorithm is\nstill capable of constructing any context-free And-Or grammar. Using And-Or fragments can avoid\nor alleviate the problems associated with And-fragments and Or-fragments: since an And-Or frag-\nment systematically covers multiple compositions, the data scarcity problem of And-fragments is\nalleviated; since And-rules and Or-rules are learned in a more uni\ufb01ed manner, the resulting gram-\nmar is often more compact; reasonable And-Or fragments usually increase the posterior probability\nof the grammar, therefore easing the search procedure; \ufb01nally, ambiguous Or-rules can be better\ndistinguished since they are learned jointly with their sibling Or-nodes in the And-Or fragments.\nTo perform greedy search or beam search, in each iteration of our learning algorithm we need to\n\ufb01nd the And-Or fragments that lead to the highest gain in the posterior probability of the grammar.\nComputing the posterior gain by re-parsing the training samples can be very time-consuming if the\ntraining set or the grammar is large. Fortunately, we show that by assuming grammar unambiguity\nthe posterior gain of adding an And-Or fragment can be formulated based on a set of suf\ufb01cient statis-\ntics of the training data and is ef\ufb01cient to compute. Since the posterior probability is proportional to\nthe product of the likelihood and the prior probability, the posterior gain is equal to the product of\nthe likelihood gain and the prior gain, which we formulate separately below.\nLikelihood Gain. Remember that in our learning algorithm when an And-Or fragment is added\ninto the grammar, we try to reduce the training samples using the new grammar rules and update the\n\nfragment can generate(cid:81)n\n\n4\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: (a) An example And-Or fragment. (b) The n-gram tensor of the And-Or fragment based\non the training data (here n = 3). (c) The context matrix of the And-Or fragment based on the\ntraining data.\n\ntop-level And-rules accordingly. Denote the set of reductions being made on the training samples\nby RD. Suppose in reduction rd \u2208 RD, we replace a con\ufb01guration e of nodes a1j1a2j2 . . . anjn\nwith the new And-node A, where aiji (i = 1 . . . n) is an existing terminal or nonterminal node that\ncan be generated by the new Or-node Oi in the And-Or fragment. With reduction rd, the Viterbi\nlikelihood of the training sample x where rd occurs is changed by two factors. First, since the\ngrammar now generates the And-node A \ufb01rst, which then generates a1j1 a2j2 . . . anjn, the Viterbi\nlikelihood of sample x is reduced by a factor of P (A \u2192 a1j1 a2j2 . . . anjn ). Second, the reduction\nmay make sample x identical to some other training samples, which increases the Viterbi likelihood\nof sample x by a factor equal to the ratio of the numbers of such identical samples after and before\nthe reduction. To facilitate the computation of this factor, we can construct a context matrix CM\nwhere each row is a con\ufb01guration of existing nodes covered by the And-Or fragment, each column\nis a context which is the surrounding patterns of a con\ufb01guration, and each element is the number of\ntimes that the corresponding con\ufb01guration and context co-occur in the training set. See Figure 2(c)\nfor the context matrix of the example And-Or fragment. Putting these two types of changes to the\nlikelihood together, we can formulate the likelihood gain of adding the And-Or fragment as follows\n(cid:81)mi\n(see the supplementary material for the full derivation).\nj=1 (cid:107)RDi(aij)(cid:107)(cid:107)RDi(aij )(cid:107)\n\n(cid:81)n\n\n(cid:80)\n\nCM [e,c]\n\n(cid:81)\nc((cid:80)\n(cid:81)\n\n\u00d7\n\ne CM [e, c])\ne,c CM [e, c]CM [e,c]\n\ne\n\nP (X|Gt+1)\nP (X|Gt)\n\ni=1\n\n=\n\n(cid:107)RD(cid:107)n(cid:107)RD(cid:107)\n\nwhere Gt and Gt+1 are the grammars before and after learning from the And-Or fragment, RDi(aij)\ndenotes the subset of reductions in RD in which the i-th node of the con\ufb01guration being reduced\nis aij, e in the summation or product ranges over all the con\ufb01gurations covered by the And-Or\nfragment, and c in the product ranges over all the contexts that appear in CM.\nIt can be shown that the likelihood gain can be factorized as the product of two tensor/matrix co-\nherence measures as de\ufb01ned in [22]. The \ufb01rst is the coherence of the n-gram tensor of the And-Or\nfragment (which tabulates the number of times each con\ufb01guration covered by the And-Or fragment\nappears in the training samples, as illustrated in Figure 2(b)). The second is the coherence of the\ncontext matrix. These two factors provide a surrogate measure of how much the training data support\nthe context-freeness within the And-Or fragment and the context-freeness of the And-Or fragment\nagainst its context respectively. See the supplementary material for the derivation and discussion.\nThe formulation of likelihood gain also entails the optimal probabilities of the Or-rules in the And-\nOr fragment.\n\n\u2200i, j P (Oi \u2192 aij) =\n\n(cid:80)mi\n(cid:107)RDi(aij)(cid:107)\nj(cid:48)=1 (cid:107)RDi(aij(cid:48))(cid:107) =\n\n(cid:107)RDi(aij)(cid:107)\n\n(cid:107)RD(cid:107)\n\nPrior Gain. The prior probability of the grammar is determined by the grammar size. When the\nAnd-Or fragment is added into the grammar, the size of the grammar is changed in two aspects:\n\ufb01rst, the size of the grammar is increased by the size of the And-Or fragment; second, the size of the\ngrammar is decreased because of the reductions from con\ufb01gurations of multiple nodes to the new\nAnd-node. Therefore, the prior gain of learning from the And-Or fragment is:\n\nP (Gt+1)\nP (Gt)\n\n= e\u2212\u03b1((cid:107)Gt+1(cid:107)\u2212(cid:107)Gt(cid:107)) = e\n\nmiso)\u2212(cid:107)RD(cid:107)(n\u22121)sa)\n\ni=1\n\n\u2212\u03b1((nsa+(cid:80)n\n\n5\n\nAO1O2O3912329123103410a11a11a12a13a21a22a31a32a33a3468212152053172363a12a13a31a32a33a34context1context2context3\u2026\u00a0\u2026a11a21a31100\u2026a12a21a31512\u2026\u2026\u00a0\u2026\u2026\u2026\u2026\u2026a13a22a34411\u2026A912329123103410a11O1O2O368212152053172363a12a13a11a12a13a21a22a31a32a33a34a31a32a33a34context1context2context3\u2026\u00a0\u2026a11a21a31100\u2026a12a21a31512\u2026\u2026\u00a0\u2026\u2026\u2026\u2026\u2026a13a22a34411\u2026A912329123103410a11O1O2O368212152053172363a12a13a11a12a13a21a22a31a32a33a34a31a32a33a34context1context2context3\u2026\u00a0\u2026a11a21a31100\u2026a12a21a31512\u2026\u2026\u00a0\u2026\u2026\u2026\u2026\u2026a13a22a34411\u2026\fFigure 3: An illustration of the procedure of \ufb01nding the best And-Or fragment. r1, r2, r3 denote\ndifferent relations between patterns. (a) Collecting statistics from the training samples to construct\nor update the n-gram tensors. (b) Finding one or more sub-tensors that lead to the highest posterior\ngain and constructing the corresponding And-Or fragments.\n\nFigure 4: An example video and the action annotations from the human activity dataset [23]. Each\ncolored bar denotes the start/end time of an occurrence of an action.\n\nwhere sa and so are the number of bits needed to encode each node on the right-hand side of an\nAnd-rule and Or-rule respectively. It can be seen that the prior gain penalizes And-Or fragments\nthat have a large size but only cover a small number of con\ufb01gurations in the training data.\nIn order to \ufb01nd the And-Or fragments with the highest posterior gain, we could construct n-gram\ntensors from all the training samples for different values of n and different And-rule relations, and\nwithin these n-gram tensors we search for sub-tensors that correspond to And-Or fragments with\nthe highest posterior gain. Figure 3 illustrates this procedure. In practice, we \ufb01nd it suf\ufb01cient to\nuse greedy search or beam search with random restarts in identifying good And-Or fragments. See\nthe supplementary material for the pseudocode of the complete algorithm of grammar learning.\nThe algorithm runs reasonably fast: our prototype implementation can \ufb01nish running within a few\nminutes on a desktop with 5000 training samples each containing more than 10 atomic patterns.\n\n4 Experiments\n\n4.1 Learning Event Grammars\n\nWe applied our approach to learn event grammars from human activity data. The \ufb01rst dataset con-\ntains 61 videos of indoor activities, e.g., using a computer and making a phone call [23]. The atomic\nactions and their start/end time are annotated in each video, as shown in Figure 4. Based on this\ndataset, we also synthesized a more complicated second dataset by dividing each of the two most\nfrequent actions, sitting and standing, into three subtypes and assigning each occurrence of the two\nactions randomly to one of the subtypes. This simulates the scenarios in which the actions are de-\ntected in an unsupervised way and therefore actions of the same type may be regarded as different\nbecause of the difference in the posture or viewpoint.\nWe employed three different methods to apply our grammar learning approach on these two datasets.\nThe \ufb01rst method is similar to that proposed in [18]. For each frame of a video in the dataset, we\nconstruct a binary vector that indicates which of the atomic actions are observed in this frame. In this\nway, each video is represented by a sequence of vectors. Consecutive vectors that are identical are\n\n6\n\nRelation2r1r2r3Relation2313r1(a)+11253+1263Training\u00a0samplesn\u2010gram\u00a0tensors\u00a0relations\u00a0(here\u00a0n1A2AO1O2(b)r1215353of\u00a0different\u00a0n=2)And\u2010Or\u00a0fragment\fTable 2: The experimental results (F-\nmeasure) on the event datasets. For\nour approach, f, c+f and cf denote\nthe \ufb01rst, second and third methods\nrespectively.\n\nADIOS [15]\nSPYZ [18]\nOurs (f)\nOurs (c+f)\nOurs (cf)\n\nData 1 Data 2\n0.204\n0.810\n0.582\n0.756\n0.831\n0.702\n0.624\n0.768\n0.813\n0.767\n\nFigure 5: An example event And-Or grammar with two\ntypes of relations that grounds to atomic actions\n\nmerged. We then map each distinct vector to a unique ID and thus convert each video into a sequence\nof IDs. Our learning approach is applied on the ID sequences, where each terminal node represents\nan ID and each And-node speci\ufb01es the temporal \u201cfollowed-by\u201d relation between its child nodes. In\nthe second and third methods, instead of the ID sequences, our learning approach is directly applied\nto the vector sequences. Each terminal node now represents an occurrence of an atomic action. In\naddition to the \u201cfollowed-by\u201d relation, an And-node may also specify the \u201cco-occurring\u201d relation\nbetween its child nodes. In this way, the resulting And-Or grammar is directly grounded to the\nobserved atomic actions and is therefore more \ufb02exible and expressive than the grammar learned\nfrom IDs as in the \ufb01rst method. Figure 5 shows such a grammar. The difference between the second\nand the third method is: in the second method we require the And-nodes with the \u201cco-occurring\u201d\nrelation to be learned before any And-node with the \u201cfollowed-by\u201d relation is learned, which is\nequivalent to applying the \ufb01rst method based on a set of IDs that are also learned; on the other hand,\nthe third method does not restrict the order of learning of the two types of And-nodes.\nNote that in our learning algorithm we assume that each training sample consists of a single pattern\ngenerated from the target grammar, but here each video may contain multiple unrelated events. We\nslightly modi\ufb01ed our algorithm to accommodate this issue: right before the algorithm terminates, we\nchange the top-level And-nodes in the grammar to Or-nodes, which removes any temporal relation\nbetween the learned events in each training sample and renders them independent of each other.\nWhen parsing a new sample using the learned grammar, we employ the CYK algorithm to ef\ufb01ciently\nidentify all the subsequences that can be parsed as an event by the grammar.\nWe used 55 samples of each dataset as the training set and evaluated the learned grammars on the\nremaining 6 samples. On each testing sample, the events identi\ufb01ed by the learned grammars were\ncompared against manual annotations. We measured the purity (the percentage of the identi\ufb01ed\nevent durations overlapping with the annotated event durations) and inverse purity (the percentage\nof the annotated event durations overlapping with the identi\ufb01ed event durations), and report the F-\nmeasure (the harmonic mean of purity and inverse purity). We compared our approach with two\nprevious approaches [15, 18], both of which can only learn from ID sequences.\nTable 2 shows the experimental results. It can be seen that our approach is competitive with the\nprevious approaches on the \ufb01rst dataset and outperforms the previous approaches on the more com-\nplicated second dataset. Among the three methods of applying our approach, the second method has\nthe worst performance, mostly because the restriction of learning the \u201cco-occurring\u201d relation \ufb01rst\noften leads to premature equating of different vectors. The third method leads to the best overall\nperformance, which implies the advantage of grounding the grammar to atomic actions and simulta-\nneously learning different relations. Note that the third method has better performance on the more\ncomplicated second dataset, and our analysis suggests that the division of sitting/standing into sub-\ntypes in the second dataset actually helps the third method to avoid learning erroneous compositions\nof continuous siting or standing.\n\n4.2 Learning Image Grammars\n\nWe \ufb01rst tested our approach in learning image grammars from a synthetic dataset of animal face\nsketches [24]. Figure 6 shows some example images from the dataset. We constructed 15 training\nsets of 5 different sizes and ran our approach for three times on each training set. We set the terminal\n\n7\n\nPick\u00a0&\u00a0throw\u00a0trashStandStandStandfffffThe\u201cfollowed\u2010Pick\u00a0up\u00a0trashThrow\u00a0trashcccThe\u00a0followedby\u201d\u00a0relationThe\u201ccooccurring\u201dBend\u00a0downSquatStandBend\u00a0downThe\u00a0co\u2010occurring\u00a0relation\fFigure 6: Example\nimages from the syn-\nthetic dataset\n\nFigure 7: The experimental results on the synthetic image dataset\n\nFigure 8: Example images and atomic patterns of the real dataset [17]\n\nTable 3: The average\nperplexity on the testing\nsets from the real\nim-\nage experiments (lower\nis better)\n\nOurs\nSZ [17]\n\nPerplexity\n\n67.5\n129.4\n\nnodes to represent the atomic sketches in the images and set the relations in And-rules to represent\nrelative positions between image patches. The hyperparameter \u03b1 of our approach is \ufb01xed to 0.5.\nWe evaluated the learned grammars against the true grammar. We estimated the precision and recall\nof the sets of images generated from the learned grammars versus the true grammar, from which\nwe computed the F-measure. We also estimated the KL-divergence of the probability distributions\nde\ufb01ned by the learned grammars from that of the true grammar. We compared our approach with\nthe image grammar learning approach proposed in [17]. Figure 7 shows the experimental results. It\ncan be seen that our approach signi\ufb01cantly outperforms the competing approach.\nWe then ran our approach on a real dataset of animal faces that was used in [17]. The dataset contains\n320 images of four categories of animals: bear, cat, cow and wolf. We followed the method described\nin [17] to quantize the images and learn the atomic patterns, which become the terminal nodes of the\ngrammar. Figure 8 shows some images from the dataset, the quantization examples and the atomic\npatterns learned. We again used the relative positions between image patches as the type of relations\nin And-rules. Since the true grammar is unknown, we evaluated the learned grammars by measuring\ntheir perplexity (the reciprocal of the geometric mean probability of a sample from a testing set).\nWe ran 10-fold cross-validation on the dataset: learning an image grammar from each training set\nand then evaluating its perplexity on the testing set. Before estimating the perplexity, the probability\ndistribution represented by each learned grammar was smoothed to avoid zero probability on the\ntesting images. Table 3 shows the results of our approach and the approach from [17]. Once again\nour approach signi\ufb01cantly outperforms the competing approach.\n\n5 Conclusion\n\nWe have presented a uni\ufb01ed formalization of stochastic And-Or grammars that is agnostic to the type\nof the data being modeled, and have proposed an unsupervised approach to learning the structures\nas well as the parameters of such grammars. Our approach optimizes the posterior probability of the\ngrammar and induces compositions and recon\ufb01gurations in a uni\ufb01ed manner. Our experiments in\nlearning event grammars and image grammars show satisfactory performance of our approach.\n\nAcknowledgments\n\nThe work is supported by grants from DARPA MSEE project FA 8650-11-1-7149, ONR MURI\nN00014-10-1-0933, NSF CNS 1028381, and NSF IIS 1018751.\n\n8\n\n010020030040000.20.40.60.81Number of Training SamplesF\u2212measure  OursSZ [17]0100200300400051015Number of Training SamplesKL\u2212Divergence  OursSZ [17]Example\u00a0imagesExample\u00a0quantized\u00a0imagesAtomicpatternsimagesAtomic\u00a0patterns(terminal\u00a0nodes)\fReferences\n[1] S.-C. Zhu and D. Mumford, \u201cA stochastic grammar of images,\u201d Found. Trends. Comput. Graph. Vis.,\n\nvol. 2, no. 4, pp. 259\u2013362, 2006.\n\n[2] Y. Jin and S. Geman, \u201cContext and hierarchy in a probabilistic image model,\u201d in CVPR, 2006.\n[3] Y. Zhao and S. C. Zhu, \u201cImage parsing with stochastic scene grammar,\u201d in NIPS, 2011.\n[4] Y. A. Ivanov and A. F. Bobick, \u201cRecognition of visual activities and interactions by stochastic parsing,\u201d\n\nPattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 22, no. 8, pp. 852\u2013872, 2000.\n\n[5] M. S. Ryoo and J. K. Aggarwal, \u201cRecognition of composite human activities through context-free gram-\n\nmar based representation,\u201d in CVPR, 2006.\n\n[6] Z. Zhang, T. Tan, and K. Huang, \u201cAn extended grammar system for learning and recognizing complex\n\nvisual events,\u201d IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 2, pp. 240\u2013255, Feb. 2011.\n\n[7] M. Pei, Y. Jia, and S.-C. Zhu, \u201cParsing video events with goal inference and intent prediction,\u201d in ICCV,\n\n2011.\n\n[8] C. D. Manning and H. Sch\u00a8utze, Foundations of statistical natural language processing. Cambridge,\n\nMA, USA: MIT Press, 1999.\n\n[9] P. Liang, M. I. Jordan, and D. Klein, \u201cProbabilistic grammars and hierarchical dirichlet processes,\u201d The\n\nhandbook of applied Bayesian analysis, 2009.\n\n[10] H. Poon and P. Domingos, \u201cSum-product networks : A new deep architecture,\u201d in Proceedings of the\n\nTwenty-Seventh Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), 2011.\n\n[11] J. K. Baker, \u201cTrainable grammars for speech recognition,\u201d in Speech Communication Papers for the 97th\n\nMeeting of the Acoustical Society of America, 1979.\n\n[12] D. Klein and C. D. Manning, \u201cCorpus-based induction of syntactic structure: Models of dependency and\n\nconstituency,\u201d in Proceedings of ACL, 2004.\n\n[13] S. Wang, Y. Wang, and S.-C. Zhu, \u201cHierarchical space tiling for scene modeling,\u201d in Computer Vision\u2013\n\nACCV 2012. Springer, 2013, pp. 796\u2013810.\n\n[14] A. Stolcke and S. M. Omohundro, \u201cInducing probabilistic grammars by Bayesian model merging,\u201d in\n\nICGI, 1994, pp. 106\u2013118.\n\n[15] Z. Solan, D. Horn, E. Ruppin, and S. Edelman, \u201cUnsupervised learning of natural languages,\u201d Proc. Natl.\n\nAcad. Sci., vol. 102, no. 33, pp. 11 629\u201311 634, August 2005.\n\n[16] K. Tu and V. Honavar, \u201cUnsupervised learning of probabilistic context-free grammar using iterative bi-\nclustering,\u201d in Proceedings of 9th International Colloquium on Grammatical Inference (ICGI 2008), ser.\nLNCS 5278, 2008.\n\n[17] Z. Si and S. Zhu, \u201cLearning and-or templates for object modeling and recognition,\u201d IEEE Trans on Pattern\n\nAnalysis and Machine Intelligence, 2013.\n\n[18] Z. Si, M. Pei, B. Yao, and S.-C. Zhu, \u201cUnsupervised learning of event and-or grammar and semantics\n\nfrom video,\u201d in ICCV, 2011.\n\n[19] J. F. Allen, \u201cTowards a general theory of action and time,\u201d Arti\ufb01cial intelligence, vol. 23, no. 2, pp.\n\n123\u2013154, 1984.\n\n[20] V. I. Spitkovsky, H. Alshawi, D. Jurafsky, and C. D. Manning, \u201cViterbi training improves unsupervised\ndependency parsing,\u201d in Proceedings of the Fourteenth Conference on Computational Natural Language\nLearning, ser. CoNLL \u201910, 2010.\n\n[21] K. Tu and V. Honavar, \u201cUnambiguity regularization for unsupervised learning of probabilistic grammars,\u201d\nin Proceedings of the 2012 Conference on Empirical Methods in Natural Language Processing and Nat-\nural Language Learning (EMNLP-CoNLL 2012), 2012.\n\n[22] S. C. Madeira and A. L. Oliveira, \u201cBiclustering algorithms for biological data analysis: A survey.\u201d\n\nIEEE/ACM Trans. on Comp. Biol. and Bioinformatics, vol. 1, no. 1, pp. 24\u201345, 2004.\n\n[23] P. Wei, N. Zheng, Y. Zhao, and S.-C. Zhu, \u201cConcurrent action detection with structural prediction,\u201d in\n\nProc. Intl Conference on Computer Vision (ICCV), 2013.\n\n[24] A. Barbu, M. Pavlovskaia, and S. Zhu, \u201cRates for inductive learning of compositional models,\u201d in AAAI\n\nWorkshop on Learning Rich Representations from Low-Level Sensors (RepLearning), 2013.\n\n9\n\n\f", "award": [], "sourceid": 683, "authors": [{"given_name": "Kewei", "family_name": "Tu", "institution": "UCLA"}, {"given_name": "Maria", "family_name": "Pavlovskaia", "institution": "UCLA"}, {"given_name": "Song-Chun", "family_name": "Zhu", "institution": "UCLA"}]}