{"title": "Learning annotated hierarchies from relational data", "book": "Advances in Neural Information Processing Systems", "page_first": 1185, "page_last": 1192, "abstract": null, "full_text": "Learning annotated hierarchies from relational data\n\nDaniel M. Roy, Charles Kemp, Vikash K. Mansinghka, and Joshua B. Tenenbaum\n\nCSAIL, Dept. of Brain & Cognitive Sciences, MIT, Cambridge, MA 02139\n\n{droy, ckemp, vkm, jbt}@mit.edu\n\nAbstract\n\nThe objects in many real-world domains can be organized into hierarchies, where\neach internal node picks out a category of objects. Given a collection of fea-\ntures and relations de\ufb01ned over a set of objects, an annotated hierarchy includes a\nspeci\ufb01cation of the categories that are most useful for describing each individual\nfeature and relation. We de\ufb01ne a generative model for annotated hierarchies and\nthe features and relations that they describe, and develop a Markov chain Monte\nCarlo scheme for learning annotated hierarchies. We show that our model discov-\ners interpretable structure in several real-world data sets.\n\n1 Introduction\n\nResearchers in AI and cognitive science [1, 7] have proposed that hierarchies are useful for rep-\nresenting and reasoning about the objects in many real-world domains. One of the reasons that\nhierarchies are valuable is that they compactly specify categories at many levels of resolution, each\nnode representing the category of objects at the leaves below the node. Consider, for example, the\nsimple hierarchy shown in Figure 1a, which picks out \ufb01ve categories relevant to a typical university\ndepartment: employees, staff, faculty, professors, and assistant professors.\n\nSuppose that we are given a large data set describing the features of these employees and the in-\nteractions among these employees. Each of the \ufb01ve categories will account for some aspects of\nthe data, but different categories will be needed for understanding different features and relations.\n\u201cFaculty,\u201d for example, is the single most useful category for describing the employees that publish\npapers (Figure 1b), but three categories may be needed to describe the social interactions among the\nemployees (Figure 1c). In order to understand the structure of the department, it is important not\nonly to understand the hierarchical organization of the employees, but to understand which levels in\nthe hierarchy are appropriate for describing each feature and each relation. Suppose, then, that an\nannotated hierarchy is a hierarchy along with a speci\ufb01cation of the categories in the hierarchy that\nare relevant to each feature and relation.\n\nThe idea of an annotated hierarchy is one of the oldest proposals in cognitive science, and researchers\nincluding Collins and Quillian [1] and Keil [7] have argued that semantic knowledge is organized\ninto representations with this form. Previous treatments of annotated hierarchies, however, usually\nsuffer from two limitations. First, annotated hierarchies are usually hand-engineered, and there\nare few proposals describing how they might be learned from data. Second, annotated hierarchies\ntypically capture knowledge only about the features of objects: relations between objects are rarely\nconsidered. We address both problems by de\ufb01ning a generative model for objects, features, relations,\nand hierarchies, and showing how it can be used to recover an annotated hierarchy from raw data.\n\nOur generative model for feature data assumes that the objects are located at the leaves of a rooted\ntree, and that each feature is generated from a partition of the objects \u201cconsistent\u201d with the hierarchy.\nA tree-consistent partition (henceforth, t-c partition) of the objects is a partition of the objects into\ndisjoint categories, i.e. each class in the partition is exactly the set of leaves descending from some\nnode in the tree. Therefore, a t-c partition can be uniquely encoded as the set of these nodes whose\nleaf descendants comprise the classes (Figure 1a,b). The simplest t-c partition is the singleton set\n\n\fTrue\n\nFalse\n\n(a)\n\nEmployees (E)\n\nFaculty (F)\n\n(b)\n\nDirect\u2212Deposit\nHas Tenure\nPublishes\n\nStaff (S)\n\nSS\n\nProfessors (P)\n\nF\n\nAssistant Profs (A)\n\nP E\n\nA\n\nE\n\nF\n\n(c)\n\nS\n\nP\n\nA\n\nE\n\nF\n\nE\n\nF\n\nE\n\nF\n\nt\nA\n\nS\n\nP\n\nA\n\nS\n\nP\n\nA\n\nS\n\nP\n\nA\n\nw\n\nA\n\nS,S\n\nS,F\n\nS,S\n\nS,F\n\nS,E\n\nF,S\n\nP,P P,A\n\nA,P\n\nA,A\n\nF,S\n\nF,F\n\nP,S\n\nP,P P,A\n\nA,E\n\nfriends with\n\nworks with\n\norders around\n\nFigure 1: (a) A hierarchy over 15 members of a university department: 5 staff members, 5 professors and\n5 assistant professors. (b) Three binary features, each of which is associated with a different t-c partition of\nthe objects. Each class in each partition is labeled with the corresponding node in the tree. (c) Three binary\nrelations, each of which is associated with a different t-c partition of the set of object pairs. Each class in each\npartition is labeled with the corresponding pair of nodes.\n\ncontaining the root node, which places all objects into a single class. The most complex t-c partition\nis the set of all leaves, which assigns each object to its own class. We assume that the features of\nobjects in different classes are independent, but that objects in the same class tend to have similar\nfeatures. Therefore, \ufb01nding the categories in the tree most relevant to a feature can be formalized\nas \ufb01nding the simplest t-c partition that best accounts for the distribution of the feature (Figure 1b).\nWe de\ufb01ne an annotated hierarchy as a hierarchy together with a t-c partition for each feature.\n\nAlthough most discussions of annotated hierarchies focus on features, much of the data available\nto human learners comes in the form of relations. Understanding the structure of social groups, for\ninstance, involves inferences about relations like admires(\u00b7, \u00b7), friend-of (\u00b7, \u00b7) and brother-of (\u00b7, \u00b7).\nLike the feature case, our generative model for relational data assumes that each (binary) relation\nis generated from a t-c partition of the set of all pairs of objects. Each class in a t-c partition now\ncorresponds to a pair of categories (i.e. pair of nodes) (Figure 1c), and we assume that all pairs in a\ngiven class tend to take similar values. As in the feature case, \ufb01nding the categories in the tree most\nrelevant to a relation can be formalized as \ufb01nding the t-c partition that best accounts for the distribu-\ntion of the relation. The t-c partition for each relation can be viewed as an additional annotation of\nthe tree. The \ufb01nal piece of our generative model is a prior over rooted trees representing hierarchies.\nRoughly speaking, the best hierarchy will then be the one that provides the best categories with\nwhich to summarize all the features and relations.\n\nLike other methods for discovering structure in data, our approach may be useful both as a tool\nfor data analysis and as a model of human learning. After describing our approach, we apply it to\nseveral data sets inspired by problems faced by human learners. Our \ufb01rst analysis suggests that the\nmodel recovers coherent domains given objects and features from several domains (animals, foods,\ntools and vehicles). Next we show that the model discovers interpretable structure in kinship data,\nand in data representing relationships between ontological kinds.\n\n2 A generative model for features and relations\n\nOur approach is organized around a generative model for feature data and relational data. For sim-\nplicity, we present our model for feature and relational data separately, focusing on the case where\nwe have a single binary feature or a single binary relation. After presenting our generative model,\nwe describe how it can be used to recover annotated hierarchies from data.\n\nWe begin with the case of a single binary feature and de\ufb01ne a joint distribution over three entities:\na rooted, weighted, binary tree T with O objects at the leaves; a t-c partition of the objects; and\nfeature observations, d. For a feature, a t-c partition \u03c0 is a set of nodes {n1, n2, . . . , nk}, such that\neach object is a descendant of exactly one node in \u03c0. We will identify each node with the category\nof objects descending from it. We denote the data for all objects in the category n as dn. If o is a\nleaf (single object category), then do is the value of the feature for object o. In Figure 1b, three t-c\npartitions associated with the hierarchy are represented and each class in each partition is labeled\nwith the corresponding category.\nThe joint distribution P (T, w, \u03c0, d|\u03bb, \u03b3f ) is induced by the following generative process:\n\n\fi. Sample a tree T from a uniform distribution over rooted binary trees with O leaves (each\n\nleaf will represent an object and there are O objects). Each node n represents a category.\n\nii. For each category n, sample its weight, wn, according to an exponential distribution with\n\nparameter \u03bb, i.e. p(wn|\u03bb) = \u03bbe\u2212\u03bbwn.\n\niii. Sample a t-c partition \u03c0f = {n1, n2, . . . , nk} \u223c \u03a0(root-of(T )), where \u03a0(n) is a\n\nstochastic, set-valued function:\n\n\u03a0(n) = (cid:26){n}\n\n\u222ai\u03a0(ni) otherwise\n\nn is a leaf, or w.p. \u03c6(wn)\n\n(1)\n\nwhere \u03c6(x) = 1\u2212e\u2212x and ni are the children of n. Intuitively, categories with large weight\nare more likely to be classes in the partition. For the publishes feature in Figure 1b, the t-c\npartition is {F, S}.\n\niv. For each category n \u2208 \u03c0f , sample \u03b8n \u223c Beta(\u03b3f , \u03b3f ), where \u03b8n is the probability that\nobjects in category n exhibit the feature f . Returning to the publishes example in Figure 1b,\ntwo parameters, \u03b8F and \u03b8S, would be drawn for this feature.\n\nv. For each object o, sample its feature value do \u223c Bernoulli(\u03b8n), where n \u2208 \u03c0f is the\n\ncategory containing o.\n\nConsider now the case where we have a single binary relation de\ufb01ned over all ordered pairs of\nobjects {(oi, oj)}. In the relational case, our joint distribution is de\ufb01ned over a rooted, weighted,\nbinary tree; a t-c partition of ordered pairs of objects; and observed, relational data represented as a\nmatrix D where Di,j = 1 if the relation holds between oi and oj.\nGiven a pair of categories (ni, mj), let ni \u00d7 mj be the set of all pairs of objects (oi, oj) such that oi\nis an object in the category ni and oj is an object in the category mj. With respect to pairs of trees, a\nt-c partition, \u03c0, is a set of pairs of categories {(n1, m1), (n2, m2), . . . , (nk, mk)} such that, for every\npair of objects (oi, oj), there exists exactly one pair (nk, mk) \u2208 \u03c0 such that (oi, oj) \u2208 nk \u00d7 mk.\nTo help visualize these 2D t-c partitions, we can reorder the columns and rows of the matrix D\naccording to an in-order traversal of the binary tree T . Each t-c partition now splits the matrix\ninto contiguous, rectangular blocks (see Figure 1c, where each rectangular block is labeled with its\ncategory pair). Assuming we have already generated a rooted, weighted binary tree, we now specify\nthe generative process for a single binary relation (c.f. steps iii through v in the feature case):\n\niii. Sample a t-c partition \u03c0r = {(n1, m1), . . . , (nk, mk)} \u223c \u03a0(root-of(T ), root-of(T )),\n\nwhere \u03a0(n, m) is a stochastic, set-valued function:\n\n\u03a0(n, m) = \uf8f1\uf8f2\n\uf8f3\n\n{(n, m)}\n\u222ai\u03a0(ni, m)\n\u222aj \u03a0(n, mj) otherwise\n\nw.p. \u03c6(wn) \u00b7 \u03c6(wm)\notherwise, w.p. 1\n2\n\n(2)\n\nwhere ni/mj are the children of n/m. To handle special cases, if both n, m are leaves,\n\u03a0(n, m) = {n, m}; if only one of the nodes is a leaf, we default to the feature case on the\nremaining tree, halting with probability \u03c6(wn) \u00b7 \u03c6(wm). Intuitively, if a pair of categories\n(n, m) both have large weight, the process is more likely to group all pairs of objects in\nn \u00d7 m into a single class. In Figure 1c, the t-c partition for the works with relation is\n{(S, S), (S, F ), (F, S), (F, F )}.\n\niv. For each pair of categories (n, m) \u2208 \u03c0r, sample \u03b8n,m \u223c Beta(\u03b3r, \u03b3r), where \u03b8n,m is the\nprobability that the relation holds between any pair of objects in n \u00d7 m. For the works\nwith relation in Figure 1c, parameters would be drawn for each of the four classes in the t-c\npartition.\n\nv. For each pair of objects (oi, oj), sample the relation Di,j \u223c Bernoulli(\u03b8n,m), where\n(n, m) \u2208 \u03c0r and (oi, oj) \u2208 (n, m). That is, the probability that the relation holds is\nthe same for all pairs in a given class.\n\nFor data sets with multiple relations and features, we assume that all relations and features are\nconditionally independent given the weighted tree T .\n\n\f2.1 Inference\n\nGiven observations of features and relations, we can use the generative model to ask various ques-\ntions about the latent hierarchy and its annotations. We start by determining the posterior distribution\non the weighted tree topologies, (T, w), given data D = ({d(f )}F\nr=1) over O objects,\nF features and R relations and hyperparameters \u03bb and \u03b3 = ({\u03b3f }F\n\nf =1, {D(r)}R\nf =1, {\u03b3r}R\n\nr=1). By Bayes\u2019 rule,\n\nP (T, w|D, \u03bb, \u03b3) \u221d P (T ) P (w|T, \u03bb)\n\nP (D|T, w, \u03b3)\n\nF\n\nf =1\n\nR\n\nr=1\n\n\u221d (cid:16)1(cid:17)\n\nP (d(f )|T, w, \u03b3f )Y\n\n(cid:16)Yn\n\n\u03bbe\u2212\u03bbwn(cid:17) (cid:16)Y\n\nP (D(r)|T, w, \u03b3r)(cid:17).\nBut P (d(f )|T, w, \u03b3f ) = P\u03c0 P (\u03c0|T, w) P (d(f )|\u03c0, \u03b3f ), where P (\u03c0|T, w) is the distribution over\ndent, P (d(f )|\u03c0, \u03b3f ) = Qn\u2208\u03c0 P (d(f )\n\nt-c partitions induced by the stochastic function \u03a0 and P (d(f )|\u03c0, \u03b3f ) is the likelihood given\nthe partition, marginalizing over the feature probabilities, \u03b8n. Because the classes are indepen-\nn |n \u2208 \u03c0, \u03b3f ) is the\nmarginal likelihood for d(f )\nn , the features for objects in category n. For our binary-valued data\nsets, Mf (n) is the standard marginal likelihood for the beta-binomial model. Because there are\nan exponential number of t-c partitions, we present an ef\ufb01cient dynamic program for calculating\nTf (n) = P (d(f )\nFirst observe that, for all objects (i.e. leaf nodes) o, Tf (o) = Mf (o). Let n be a node and assume no\nancestor of n is in \u03c0. With probability \u03c6(wn) = 1 \u2212 e\u2212wn, category n will be a single class and the\ncontribution to Tf will be Mf (n). Otherwise, \u03a0(n) splits category n into its children, n1 and n2.\nNow the possible partitions of the objects in category n are every t-c partition of the objects below n1\npaired with every t-c partition below n2. By independence, this contributes Tf (n1)Tf (n2). Hence,\n\nn |T, w, \u03b3f ). Then, Tf (root-of(T )) = P (d(f )|T, w, \u03b3f ) is the desired quantity.\n\nn |n \u2208 \u03c0, \u03b3f ), where Mf (n) = P (d(f )\n\nTf (n) = (cid:26)\u03c6(wn)Mf (n) + (1 \u2212 \u03c6(wn)) Tf (n1)Tf (n2)\n\nMf (n)\n\nif n is an internal node\notherwise.\n\nFor the relational case, we describe a dynamic program Tr(n, m) that calculates P (D(r)\nn,m|T, w, \u03b3r),\nthe probability of all relations between objects in n\u00d7m, conditioned on the tree, having marginalized\nout the t-c partitions and relation probabilities. Let Mr(n, m) = P (D(r)\nn,m|(n, m) \u2208 \u03c0, \u03b3r) be the\nmarginal likelihood of the relations in n \u00d7 m. For relations, Mf (n, m) is also the beta-binomial. If\nn and m are both leaves, then Tr(n, m) = Mr(n, m). Otherwise,\n\nTr(n, m) = \u03c6(wn) \u03c6(wm)Mr(n, m)\n\n+ (1 \u2212 \u03c6(wn) \u03c6(wm))\uf8f1\uf8f2\n\uf8f3\n\nn is a leaf\nTr(n, m1)Tr(n, m2)\nm is a leaf\n(Tr(n1, m)Tr(n2, m)\n1\n2 \u00b7 (Tr(n, m1)Tr(n, m2) + Tr(n1, m)Tr(n2, m)) otherwise\n\nThe above dynamic programs have linear and quadratic complexity in the number of objects, re-\nspectively. Because we can ef\ufb01ciently compute the posterior density of a weighted tree, we can\nsearch for the maximum a posteriori (MAP) weighted tree. Conditioned on the MAP tree, we can\nef\ufb01ciently compute the MAP t-c partition for each feature and relation. We \ufb01nd the MAP tree \ufb01rst,\nrather than jointly optimizing for both the topology and partitions, because marginalizing over the\nt-c partitions produces more robust trees; marginalization has a (Bayesian) \u201dOccam\u2019s razor\u201d effect\nand helps avoid over\ufb01tting. MAP t-c partitions can be computed by a straightforward modi\ufb01cation\nof the above dynamic programs, replacing sums with max operations and maintaining a list of nodes\nrepresenting the MAP t-c partition at each node in the tree.\n\nWe chose to implement global search by building a Markov chain Monte Carlo (MCMC) algorithm\nwith the posterior as the stationary distribution and keeping track of the best tree as the chain mixes.\nFor all the results in this paper, we \ufb01xed the hyperparameters of all beta distributions to \u03b3 = 0.5\n(i.e. the asymptotically least informative prior) and report the (empirical) MAP tree and MAP t-c\npartitions conditioned on the tree. The MCMC algorithm searches for the MAP tree by cycling\nthrough three Metropolis-Hastings (MH) moves adapted from [14]:\n\ni. Subtree Pruning and Regrafting: Choose a node n uniformly at random (except the root).\nChoose a non-descendant node m. Detach n from its parent and collapse the parent (remove\n\n\fnode, attaching the remaining child to the parent\u2019s parent and adding the parent\u2019s weight to\nthe child\u2019s). Sample u \u223c Uniform(0, 1) and then insert a new node m\u2032 between m and its\nparent. Attach n to m\u2032, set wm\u2032 := (1 \u2212 u)wm and set wm := uwm.\n\nii. Edge Weight Adjustment: Choose a node n uniformly at random (including the root) and\n\npropose a new weight wn (e.g. let x be Normal(log(wt), 1) and let new weight be ex).\n\niii. Subtree Swapping: Choose a node n uniformly at random (except the root). Choose another\n\nnode n\u2032 such that neither n nor n\u2032 is a descendant of the other, and swap n and n\u2032.\n\nThe \ufb01rst two moves suf\ufb01ce to make the chain ergodic; subtree swapping is included to improve\nmixing. The \ufb01rst and last moves are symmetric. We initialized the chain on a random tree with\nweights set to one, ran the chain for approximately one million iterations and assessed convergence\nby comparing separate chains started from multiple random initial states.\n\n2.2 Related Work\n\nThere are several methods that discover hierarchical structure in feature data. Hierarchical clustering\n[4] has been successfully used for analyzing both biological data [18] and psychological data, but\ncannot learn the annotated hierarchies that we consider. Bayesian hierarchical clustering (BHC) [6]\nis a recent alternative which constructs a tree as a byproduct of approximate inference in a \ufb02at\nclustering model, but lacks any notion of annotations. It is possible that a BHC-inspired algorithm\ncould be derived to \ufb01nd approximate MAP annotated hierarchies. Our model for feature data is most\nclosely related to methods for Bayesian phylogenetics [14]. These methods typically assume that\nfeatures are generated directly by a stochastic process over a tree. Our model adds an intervening\nlayer of abstraction by assuming that partitions are generated by a stochastic process over a tree, and\nthat features are generated from these partitions. By introducing a partition for each feature, we gain\nthe ability to annotate a hierarchy with the levels most relevant to each feature.\n\nThere are several methods for discovering hierarchical structure in relational data [5, 13], but none\nof these methods provides a general purpose solution to the problem we consider. Most of these\nmethods take a single relation as input, and assume that the hierarchy captures an underlying com-\nmunity structure: in other words, objects that are often paired in the input are assumed to lie nearby\nin the tree. Our approach handles multiple relations simultaneously, and allows a more \ufb02exible map-\nping between each relation and the underlying hierarchy. Different relations may depend on very\ndifferent regions of the hierarchy, and some relations may establish connections between categories\nthat are quite distant in the tree (see Figure 4).\n\nMany non-hierarchical methods for relational clustering have also been developed [10, 16, 17]. One\nfamily of approaches is based on the stochastic blockmodel [15], of which the In\ufb01nite Relational\nModel (IRM) [9] is perhaps the most \ufb02exible. The IRM handles multiple relations simultaneously,\nand does not assume that each relation has underlying community structure. The IRM, however,\ndoes not discover hierarchical structure; instead it partitions the objects into a set of non-overlapping\ncategories. Our relational model is an extension of the blockmodel that discovers a nested set of\ncategories as well as which categories are useful for understanding each relation in the data set.\n\n3 Results\nWe applied our model to three problems inspired by tasks that human learners are required to solve.\nOur \ufb01rst application used data collected in a feature-listing task by Cree and McRae [2]. Participants\nin this task listed the features that came to mind when they thought about a given object: when asked\nto think about a lemon, for example, subjects listed features like \u201cyellow,\u201d \u201csour,\u201d and \u201cgrows on\ntrees.\u201d1 We analyzed a subset of the full data set including 60 common objects and the 100 features\nmost commonly listed for these objects. The 60 objects are shown in Figure 2, and were chosen to\nrepresent four domains: animals, food, vehicles and tools.\n\nFigure 2 shows the MAP tree identi\ufb01ed by our algorithm. The model discovers the four domains\nas well as superordinate categories (e.g. \u201cliving things\u201d, including fruits, vegetables, and animals)\nand subordinate categories (e.g. \u201cwheeled vehicles\u201d). Figure 2 also shows MAP partitions for 10\n\n1Note that some of the features are noisy \u2014 according to these data, onions are not edible, since none of the\n\nparticipants chose to list this feature for onion.\n\n\ft\ni\n\nl\n\ne\np\np\na\ne\nn\np\n\ni\n\nu\nr\nf\ne\np\na\nr\ng\n\nl\n\ne\np\np\na\n\ne\nn\ni\nr\ne\ng\nn\na\n\nt\n\ne\nn\ni\nr\na\n\nn\no\nm\ne\n\ne\np\na\nr\ng\n\nl\n\ne\ng\nn\na\nr\no\n\nt\nc\ne\nn\n\nr\ne\nb\nm\nu\nc\nu\nc\n\ny\nr\nr\ne\nb\nw\na\nr\nt\ns\n\nt\n\no\nr\nr\na\nc\n\ni\n\nh\ns\nd\na\nr\n\ns\nn\no\nn\no\n\ni\n\ne\nc\nu\n\no\n\nt\n\na\n\nt\n\nt\nt\n\ne\n\no\np\n\nl\n\np\nm\na\nc\n\nl\n\nl\nl\ni\nr\nd\n\ns\nr\ne\n\ni\nl\n\np\n\nl\n\nl\n\ne\nv\no\nh\ns\n\ne\ns\nh\nc\n\ni\n\nh\nc\nn\ne\nr\nw\n\ne\nx\na\n\ne\no\nh\n\nr\ne\nm\nm\na\nh\n\ns\nr\no\ns\ns\nc\ns\n\ni\n\nk\nw\na\nh\na\nm\no\n\nt\n\nr\ne\nm\nm\na\nh\ne\ng\nd\ne\ns\n\nl\n\ne\nk\na\nr\n\nr\na\nb\nw\no\nr\nc\n\nr\ne\nv\ni\nr\nd\nw\ne\nr\nc\ns\n\nn\na\nv\n\nr\na\nc\n\nk\nc\nu\nr\nt\n\ns\nu\nb\n\np\ne\ne\n\nj\n\ni\n\np\nh\ns\n\ne\nk\nb\n\ni\n\nt\n\nt\n\ne\n\nj\n\nh\nc\na\ny\n\nl\n\na\ne\ns\n\nn\no\n\ni\nl\n\ni\n\nn\nh\np\no\nd\n\ne\ns\nu\no\nm\n\nl\n\nt\n\na\nr\n\nr\ne\ng\n\ni\nt\n\nk\nc\nu\nd\n\nt\n\na\nc\n\nr\ne\ne\nd\n\np\ne\ne\nh\ns\n\nl\n\ne\nr\nr\ni\nu\nq\ns\n\nn\ne\nk\nc\nh\nc\n\ni\n\ng\np\n\ni\n\nw\no\nc\n\ne\ns\nr\no\nh\n\ni\n\nn\na\nr\nt\n\nl\n\ne\nc\ny\nc\nr\no\n\nl\n\ne\nc\ny\nc\ni\nr\nt\n\nt\n\nr\ne\n\nt\n\np\no\nc\n\ni\nl\n\ne\nh\n\no\nm\n\ne\nn\ni\nr\na\nm\nb\nu\ns\n\nw\no\nr\nr\na\nb\ne\ne\nh\nw\n\nl\n\nFood\n\nTools\n\nVehicles\n\nAnimals\n\na tool\nan animal\nmade of metal\nis edible\nhas a handle\nis juicy\nis fast\neaten in salads\nis white\nhas 4 wheels\nhas 4 legs\n\nFigure 2: MAP tree recovered from a data set including 60 objects from four domains. MAP partitions for\nseveral features are shown: the model discovers, for example, that \u201cis juicy\u201d is associated with only one part of\nthe tree. The weight of each edge in the tree is proportional to its vertical extent.\n\nn\no\n\ni\nt\nc\nn\nu\nF\n\n \nl\nl\n\ne\nC\n\nn\no\n\n \n\ni\nt\nc\nn\nu\nF\nn\na\ng\nr\nO\n\ny\nt\ni\nl\n\na\nm\nr\no\nn\nb\nA\nd\ne\nr\ni\nu\nq\nc\nA\n\n \n\ny\nt\ni\nl\n\na\nm\nr\no\nn\nb\nA\n\n \nl\n\na\n\nt\ni\n\nn\ne\ng\nn\no\nC\n\nn\no\n\ni\n\ni\nt\nc\nn\nu\nF\n \nc\ng\no\no\ns\ny\nh\nP\n\nl\n\ni\n\nn\no\n\ni\nt\nc\nn\nu\nF\n \nc\ng\no\no\nh\n\ni\n\nl\n\nt\n\na\nP\n\ne\nm\no\nr\nd\nn\ny\nS\n\n \nr\no\n\n \n\ne\ns\na\ne\ns\nD\n\ni\n\nl\nl\n\ne\nC\n\ne\nu\ns\ns\nT\n\ni\n\nn\no\n\ni\nt\nc\nn\nu\n\nf\ns\ny\nD\n\n \nl\nl\n\ne\nC\n\ne\nr\nu\n\nt\nc\nu\nr\nt\n\nS\n\n \nl\n\ni\n\na\nc\nm\no\n\nt\n\na\nn\nA\n\nd\nr\ni\n\nB\n\nt\n\nn\na\nP\n\nl\n\nl\n\na\nm\nn\nA\n\ni\n\nl\n\na\nm\nm\na\nM\n\ns\ns\ne\nc\no\nr\nP\n\n \nl\n\na\nr\nu\n\nt\n\na\nN\n\n \n\ns\ns\ne\nc\no\nr\nP\nd\ne\ns\nu\na\nc\n\u2212\nn\na\nm\nu\nH\n\ne\nr\nu\nd\ne\nc\no\nr\nP\n \nc\ni\nt\n\nu\ne\np\na\nr\ne\nh\nT\n\ne\nr\nu\nd\ne\nc\no\nr\nP\n \ny\nr\no\n\nt\n\na\nr\no\nb\na\nL\n\nt\nl\n\nu\ns\ne\nR\n \ny\nr\no\n\nt\n\na\nr\no\nb\na\nL\n\ne\nr\nu\nd\ne\nc\no\nr\nP\n \nc\ni\nt\ns\no\nn\ng\na\nD\n\ni\n\ni\n\ng\nn\nd\nn\nF\n\ni\n\ni\n\nd\no\nr\ne\nS\n\nt\n\ne\n\nt\n\na\nr\nd\ny\nh\no\nb\nr\na\nC\n\nm\no\n\nt\n\np\nm\ny\nS\n\n \nr\no\n\n \n\nn\ng\nS\n\ni\n\ni\n\nd\np\nL\n\ni\n\nc\ni\nt\n\ni\n\no\nb\n\ni\nt\n\nn\nA\n\ne\nn\no\nm\nr\no\nH\n\ne\nm\ny\nz\nn\nE\n\ni\n\n \n\nd\nc\nA\no\nn\nm\nA\n\ni\n\nDiseases\n\nChemicals\n\nanalyzes\n\naffects\n\nprocess of\n\ncauses\n\ncauses (IRM)\n\nFigure 3: MAP tree recovered from 49 relations between entities in a biomedical data set. Four relations are\nshown (rows and columns permuted to match in-order traversal of the MAP tree). Consider the circled subset of\nthe t-c partition for causes. This block captures the knowledge that \u201cchemicals\u201d cause \u201cdiseases.\u201d The In\ufb01nite\nRelational Model (IRM) does not capture the appropriate structure in the relation cause because it does not\nmodel the latent hierarchy, instead choosing a single partition to describe the structure across all relations.\n\nrepresentative features. The model discovers that some features are associated only with certain\nparts of the tree: \u201cis juicy\u201d is associated with the fruits, and \u201cis metal\u201d is associated with the man-\nmade items. Discovering domains is a fundamental cognitive problem that may be solved early\nin development [11], but that is ignored by many cognitive models, which consider only carefully\nchosen data from a single domain (e.g. data including only animals and only biological features). By\norganizing the 60 objects into domains and identifying a subset of features that are associated with\neach domain, our model begins to suggest how infants may parse their environment into coherent\ndomains of objects and features.\n\nOur second application explores the acquisition of ontological knowledge, a problem that has been\npreviously discussed by Keil [7]. We demonstrate that our model discovers a simple biomedical\nontology given data from the Uni\ufb01ed Medical Language System (UMLS) [12]. The full data set in-\ncludes 135 entities and 49 binary relations, where the entities are ontological categories like \u2018Sign or\nSymptom\u2019, \u2018Cell\u2019, and \u2018Disease or Syndrome,\u2019 and the relations include verbs like causes, analyzes\nand affects. We applied our model to a subset of the data including the 30 entities shown in Figure 3.\n\n\f1\nM\nO\n\n1\nM\nO\n\n1\nM\nO\n\n1\nF\nO\n\n1\nF\nO\n\n1\nF\nO\n\n1\nF\nO\n\n1\nF\nY\n\n1\nF\nY\n\n1\nF\n1Y\nF\nY\n\n1\nM\nO\n\n1\nM\nY\n\n1\nM\nY\n\n1\nM\nY\n\n3\nF\nO\n\n3\nF\nY\n\n3\nF\nO\n\n3\nF\nY\n\n3\nF\nO\n\n3\nF\nO\n\n3\nM\nO\n\n3\nM\nO\n\n3\nM\nY\n\n3\nM\nY\n\n3\nF\nY\n\n1\nM\nY\n\n3\nF\nY\n\n3\nM\nO\n\n3\nM\nO\n\n3\nM\nY\n\n3\nM\nY\n\n4\nF\nO\n\n4\nF\nO\n\n4\nF\nO\n\n4\nF\nY\n\n4\nF\nY\n\n4\nF\nO\n\n4\nF\nY\n\n4\nF\nY\n\n4\nM\nO\n\n4\nM\nO\n\n2\nM\nO\n\n2\nM\nO\n\n2\nM\nY\n\n2\nM\nY\n\n2\nF\n2O\nF\nO\n\n2\nF\nO\n\n2\nF\nY\n\n2\nF\nO\n\n2\nF\nY\n\n2\nF\nY\n\n2\nF\nY\n\n4\nM\nY\n\n4\nM\nO\n\n4\nM\nO\n\n2\nM\nY\n\n2\nM\nY\n\n2\nM\nO\n\n2\nM\nO\n\n4\nM\nY\n\n4\nM\nY\n\n4\nM\nY\n\nSection 1\n\nSection 3\n\nSection 4\n\nSection 2\n\nAgngiya\n\nAdiaya\n\nUmbaidya\n\nAnowadya\n\nAnowadya (IRM)\n\nFigure 4: MAP tree recovered from kinship relations between 64 members of the Alyawarra tribe. Individuals\nhave been labelled with their age, gender and kinship section (e.g. \u201cYF1\u201d is a young female from section 1).\nMAP partitions are shown for four representative relations: the model discovers that different relations depend\non the tree in very different ways; hierarchical structure allows for a compact representation (c.f. IRM).\n\nThe MAP tree is an ontology that captures several natural groupings, including a category for \u201cliving\nthings\u201d (plant, bird, animal and mammal), a category for \u201cchemical substances\u201d (amino acid, lipid,\nantibiotic, enzyme etc.) and a category for abnormalities. The MAP partitions for each relation\nidentify the relevant categories in the tree relatively cleanly: the model discovers, for example, that\nthe distinction between \u201cliving things\u201d and \u201cabnormalities\u201d is irrelevant to the \ufb01rst place of the\nrelation causes, since neither of these categories can cause anything (according to the data set). This\ndistinction, however, is relevant to the second place of causes: substances can cause abnormalities\nand dysfunctions, but cannot cause \u201cliving things\u201d. Note that the MAP partitions for causes and\nanalyzes are rather different: one of the reasons why discovering separate t-c partitions for each\nrelation is important is that different relations can depend on very different parts of an ontology.\n\nOur third application is inspired by the problem children face when learning the kinship structure\nof their social group. This problem is especially acute for children growing up in Australian tribes,\nwhich have kinship systems that are more complicated in many ways than Western kinship systems,\nbut which nevertheless display some striking regularities. We focus here on data from the Alyawarra\ntribe [3]. Denham [3] collected a large data set by asking 104 tribe members to provide kinship terms\nfor each other. Twenty-six different terms were mentioned in total, and four of them are represented\nin Figure 4. More than one kinship term may describe the relationship between a pair of individuals\n\u2014 since the data set includes only one term per pair, some of the zeros in each matrix represent\nmissing data rather than relationships that do not hold. For simplicity, however, we assume that\nrelationships that were never mentioned do not exist.\n\nThe Alyawarra tribe is divided into four kinship sections, and these sections are fundamental to\nthe social structure of the tribe. Each individual, for instance, is permitted only to marry individuals\nfrom one of the other sections. Whether a kinship term applies between a pair of individuals depends\non their sections, ages and genders [3, 8]. We analyzed a subset of the full data set including 64\nindividuals chosen to equally represent all four sections, both genders, and people young and old.\nThe MAP tree divides the individuals perfectly according to kinship section, and discovers additional\nstructure within each section. Group three, for example, is split by age and then by gender. The MAP\npartitions for each relation indicate that different relations depend very differently on the structure\nof the tree. Adiadya refers to a younger member of one\u2019s own kinship section. The MAP partition\nfor this relation contains \ufb01ne-level structure only along the diagonal, indicating that the model has\ndiscovered that the term only applies between individuals from the same kinship section. Umbaidya\ncan be used only between members of sections 1 and 3, and members of sections 2 and 4. Again\nthe MAP partition indicates that the model has discovered this structure. In some places the MAP\npartitions appears to over\ufb01t the data: the partition for Umbaidya, for example, appears to capture\nsome of the noise in this relation. This result may re\ufb02ect the fact that our generative process is not\nquite right for these data: in particular, it does not capture the idea that some of the zeroes in each\nrelation represent missing data.\n\n\f4 Conclusions\nWe developed a probabilistic model that assumes that features and relations are generated over an\nannotated hierarchy, and showed how this model can be used to recover annotated hierarchies from\nraw data. Three applications of the model suggested that it is able to recover interpretable structure\nin real-world data, and may help to explain the computational principles which allow human learners\nto acquire hierarchical representations of real-world domains.\n\nOur approach opens up several avenues for future work. A hierarchy speci\ufb01es a set of categories,\nand annotations indicate which of these categories are important for understanding speci\ufb01c features\nand relations. A natural extension is to learn sets of categories that possess other kinds of structure,\nsuch as factorial structure [17]. For example, the kinship data we analyzed may be well described\nby three sets of overlapping categories where each individual belongs to a kinship section, a gender,\nand an age group. We have already extended the model to handle continuous data and can imag-\nine other extensions, including higher-order relations, multiple trees, and relations between distinct\nsets of objects (e.g. given information, say, about the book-buying habits of a set of customers, this\nextension of our model could discover a hierarchical representation of the customers and a hierar-\nchical representation of the books, and discover the categories of books that tend to be preferred by\ndifferent kinds of customers). We are also actively exploring variants of our model that permit accu-\nrate online approximations for inference; e.g., by placing an exchangeable prior over tree structures\nbased on a Polya-urn scheme, we can derive an ef\ufb01cient particle \ufb01lter.\n\nWe have shown that formalizing the intuition behind annotated hierarchies in terms of a prior on\ntrees and partitions and a noise-robust likelihood enabled us to discover interesting structure in real-\nworld data. We expect a fruitful area of research going forward will involve similar marriages be-\ntween intuitions about structured representation from classical AI and cognitive science and modern\ninferential machinery from Bayesian statistics and machine learning.\n\nReferences\n[1] A. M. Collins and M. R. Quillian. Retrieval Time from Semantic Memory. JVLVB, 8:240\u2013248, 1969.\n[2] G. Cree and K. McRae. Analyzing the factors underlying the structure and computation of the meaning\nof chipmunk, chisel, cheese, and cello (and many other concrete nouns). JEP Gen., 132:163\u2013201, 2003.\n[3] W. Denham. The detection of patterns in Alyawarra nonverbal behaviour. PhD thesis, U. of Wash., 1973.\n[4] R. O. Duda and P. E. Hart. Pattern Classi\ufb01cation and Scene Analaysis. Wiley, 2001.\n[5] M. Girvan and M. E. J. Newman. Community structure in social and biological networks. Proceedings\n\nof the National Academy of Sciences, 99(12):7821\u20137826, 2002.\n\n[6] K. Heller and Z. Ghahramani. Bayesian Hierarchical Clustering. In ICML, 2005.\n[7] F. C. Keil. Semantic and Conceptual Development. Harvard University Press, Cambridge, MA, 1979.\n[8] C. Kemp, T. L. Grif\ufb01ths, and J. B. Tenenbaum. Discovering Latent Classes in Relational Data. Technical\n\nReport AI Memo 2004-019, MIT, 2004.\n\n[9] C. Kemp, J. B. Tenenbaum, T. L. Grif\ufb01ths, T. Yamada, and N. Ueda. Learning systems of concepts with\n\nan in\ufb01nite relational model. In AAAI, 2006.\n\n[10] J. Kubica, A. Moore, J. Schneider, and Y. Yang. Stochastic link and group detection. In NCAI, 2002.\n[11] J. M. Mandler and L. McDonough. Concept formation in infancy. Cog. Devel., 8:291\u2013318, 1993.\n[12] A. T. McCray. An upper level ontology for the biomedical domain. Comp. Func. Genom., 4:80\u201384, 2001.\n[13] J. Neville, M. Adler, and D. Jensen. Clustering relational data using attribute and link information. In\n\nProc. of the Text Mining and Link Analysis Workshop, IJCAI, 2003.\n\n[14] D. L. Swofford, G. J. Olsen, P. J. Waddell, and D. M. Hillis. Phylogenetic inference. Molecular System-\n\natics, 2nd. edition, 1996.\n\n[15] Y. J. Wang and G. Y. Wong. Stochastic blockmodels for directed graphs. JASA, 82:8\u201319, 1987.\n[16] S. Wasserman and K. Faust. Social network analysis: Methods and applications. Cambridge Press, 1994.\n[17] A. P. Wolfe and D. Jensen. Playing multiple roles: discovering overlapping roles in social networks. In\nProc. of the Workshop on statistical relational learning and its connections to other \ufb01elds, ICML, 2004.\n[18] K. Y. Yeung, M. Medvedovic, and R. E. Bumgarner. Clustering gene-expression data with repeated\n\nmeasurements. Genome Biology, 2003.\n\n\f", "award": [], "sourceid": 3077, "authors": [{"given_name": "Daniel", "family_name": "Roy", "institution": null}, {"given_name": "Charles", "family_name": "Kemp", "institution": null}, {"given_name": "Vikash", "family_name": "Mansinghka", "institution": null}, {"given_name": "Joshua", "family_name": "Tenenbaum", "institution": null}]}