{"title": "Learning concept graphs from text with stick-breaking priors", "book": "Advances in Neural Information Processing Systems", "page_first": 334, "page_last": 342, "abstract": "We present a generative probabilistic model for learning general graph structures, which we term concept graphs, from text. Concept graphs provide a visual summary of the thematic content of a collection of documents-a task that is difficult to accomplish using only keyword search. The proposed model can learn different types of concept graph structures and is capable of utilizing partial prior knowledge about graph structure as well as labeled documents. We describe a generative model that is based on a stick-breaking process for graphs, and a Markov Chain Monte Carlo inference procedure. Experiments on simulated data show that the model can recover known graph structure when learning in both unsupervised and semi-supervised modes. We also show that the proposed model is competitive in terms of empirical log likelihood with existing structure-based topic models (such as hPAM and hLDA) on real-world text data sets. Finally, we illustrate the application of the model to the problem of updating Wikipedia category graphs.", "full_text": "Learning Concept Graphs from Text with\n\nStick-Breaking Priors\n\nAmerica L. Chambers\n\nDepartment of Computer Science\nUniversity of California, Irvine\n\nIrvine, CA 92697\n\nahollowa@ics.uci.edu\n\nPadhraic Smyth\n\nDepartment of Computer Science\nUniversity of California, Irvine\n\nIrvine, CA 92607\n\nsmyth@ics.uci.edu\n\nMark Steyvers\n\nmark.steyvers@uci.edu\n\nDepartment of Cognitive Science\nUniversity of California, Irvine\n\nIrvine, CA 92697\n\nAbstract\n\nWe present a generative probabilistic model for learning general graph structures,\nwhich we term concept graphs, from text. Concept graphs provide a visual sum-\nmary of the thematic content of a collection of documents\u2014a task that is dif\ufb01cult\nto accomplish using only keyword search. The proposed model can learn different\ntypes of concept graph structures and is capable of utilizing partial prior knowl-\nedge about graph structure as well as labeled documents. We describe a generative\nmodel that is based on a stick-breaking process for graphs, and a Markov Chain\nMonte Carlo inference procedure. Experiments on simulated data show that the\nmodel can recover known graph structure when learning in both unsupervised and\nsemi-supervised modes. We also show that the proposed model is competitive\nin terms of empirical log likelihood with existing structure-based topic models\n(hPAM and hLDA) on real-world text data sets. Finally, we illustrate the applica-\ntion of the model to the problem of updating Wikipedia category graphs.\n\n1\n\nIntroduction\n\nWe present a generative probabilistic model for learning concept graphs from text. We de\ufb01ne a\nconcept graph as a rooted, directed graph where the nodes represent thematic units (called concepts)\nand the edges represent relationships between concepts. Concept graphs are useful for summarizing\ndocument collections and providing a visualization of the thematic content and structure of large\ndocument sets - a task that is dif\ufb01cult to accomplish using only keyword search. An example of\na concept graph is Wikipedia\u2019s category graph1. Figure 1 shows a small portion of the Wikipedia\ncategory graph rooted at the category MACHINE LEARNING2. From the graph we can quickly in-\nfer that the collection of machine learning articles in Wikipedia focuses primarily on evolutionary\nalgorithms and Markov models with less emphasis on other aspects of machine learning such as\nBayesian networks and kernel methods.\nThe problem we address in this paper is that of learning a concept graph given a collection of\ndocuments where (optionally) we may have concept labels for the documents and an initial graph\nstructure. In the latter scenario, the task is to identify additional concepts in the corpus that are\n\n1http://en.wikipedia.org/wiki/Category:Main topic classi\ufb01cations\n2As of May 5, 2009\n\n1\n\n\fFigure 1: A portion of the Wikipedia category supergraph for the node MACHINE LEARNING\n\nFigure 2: A portion of the Wikipedia category subgraph rooted at the node MACHINE LEARNING\n\nnot re\ufb02ected in the graph or additional relationships between concepts in the corpus (via the co-\noccurrence of concepts in documents) that are not re\ufb02ected in the graph. This is particularly suited\nfor document collections like Wikipedia where the set of articles is changing at such a fast rate\nthat an automatic method for updating the concept graph may be preferable to manual editing or\nre-learning the hierarchy from scratch. The foundation of our approach is latent Dirichlet allocation\n(LDA) [1]. LDA is a probabilistic model for automatically identifying topics within a document\ncollection where a topic is a probability distribution over words. The standard LDA model does\nnot include any notion of relationships, or dependence, between topics. In contrast, methods such\nas the hierarchical topic model (hLDA) [2] learn a set of topics in the form of a tree structure. The\nrestriction to tree structures however is not well suited for large document collections like Wikipedia.\nFigure 1 gives an example of the highly non-tree like nature of the Wikipedia category graph. The\nhierarchical Pachinko allocation model (hPAM) [3] is able to learn a set of topics arranged in a \ufb01xed-\nsized graph with a nonparametric version introduced in [4]. The model we propose in this paper is\na simpler alternative to hPAM and nonparametric hPAM that can achieve the same \ufb02exibility (i.e.\nlearning arbitrary directed acyclic graphs over a possibly in\ufb01nite number of nodes) within a simpler\nprobabilistic framework. In addition, our model provides a formal mechanism for utilizing labeled\ndata and existing concept graph structures. Other methods for creating concept graphs include the\nuse of techniques such as hierarchical clustering, pattern mining and formal concept analysis to\nconstruct ontologies from document collections [5, 6, 7]. Our approach differs in that we utilize\na probabilistic framework which enables us (for example) to make inferences about concepts and\ndocuments. Our primary novel contribution is the introduction of a \ufb02exible probabilistic framework\nfor learning general graph structures from text that is capable of utilizing both unlabeled documents\nas well as labeled documents and prior knowledge in the form of existing graph structures.\nIn the next section we introduce the stick-breaking distribution and show how it can be used as\na prior for graph structures. We then introduce our generative model and explain how it can be\nadapted for the case where we have an initial graph structure. We derive collapsed Gibbs\u2019 sampling\nequations for our model and present a series of experiments on simulated and real text data. We\ncompare our performance against hLDA and hPAM as baselines. We conclude with a discussion of\nthe merits and limitations of our approach.\n\n2\n\nMachine learning LearningEducationComputational StatisticsStatisticsAlgorithmsSocietyKnowledgeKnowledge SharingMathematicalSciencesSoftwareEngineeringComputingComputerProgrammingAppliedMathematicsComputerScienceFormalSciencesAppliedSciencesCognitionPhilosophyOf mindCognitiveSciencePhilosophyBy fieldMetaphysicsArtificialIntelligenceProbability andStatisticsThoughtBayesian NetworksClassification AlgorithmsEnsemble LearningGenetic AlgorithmsKernel MethodsGenetic ProgrammingInteractive Evolutionary ComputationLearning in Computer VisionMarkov NetworksStatistical Natural Language ProcessingEvolutionary AlgorithmsMarkov ModelsMachine Learning\f2 Stick-breaking Distributions\nStick-breaking distributions P(\u00b7) are discrete probability distributions of the form:\n\n\u221e(cid:88)\n\n\u221e(cid:88)\n\nP(\u00b7) =\n\n\u03c0j\u03b4xj (\u00b7) where\n\n\u03c0j = 1, 0 \u2264 \u03c0j \u2264 1\n\nj=1\n\nj=1\n\nand \u03b4xj (\u00b7) is the delta function centered at the atom xj. The xj variables are sampled independently\nfrom a base distribution H (where H is assumed to be continuous). The stick-breaking weights \u03c0j\nhave the form\n\n\u03c01 = v1,\n\n\u03c0j = vj\n\n(1 \u2212 vk)\n\nfor j = 2, 3, . . . ,\u221e\n\nj\u22121(cid:89)\n\nk=1\n\nwhere the vj are independent Beta(\u03b1j, \u03b2j) random variables. Stick-breaking distributions derive\ntheir name from the analogy of repeatedly breaking the remainder of a unit-length stick at a randomly\nchosen breakpoint. See [8] for more details.\nUnlike the Chinese restaurant process, the stick-breaking process lacks exchangeability. The prob-\nability of sampling a particular cluster from P(\u00b7) given the sequences {xj} and {vj} is not equal\nto the probability of sampling the same cluster given a permutation of the sequences {x\u03c3(j)} and\n{v\u03c3(j)}. This can be seen in Equation 2 where the probability of sampling xj depends upon the\nvalue of the j \u2212 1 proceeding Beta random variables {v1, v2, . . . , vj\u22121}. If we \ufb01x xj and permute\nevery other atom, then the probability of sampling xj changes: it is now determined by the Beta\nrandom variables {v\u03c3(1), v\u03c3(2), . . . , v\u03c3(j\u22121)}.\nThe stick-breaking distribution can be utilized as a prior distribution on graph structures. We con-\nstruct a prior on graph structures by specifying a distribution at each node (denoted as Pt) that\ngoverns the probability of transitioning from node t to another node in the graph. There is some\nfreedom in choosing Pt; however we have two constraints. First, making a new transition must have\nnon-zero probability. In Figure 1 it is clear that from MACHINE LEARNING we should be able to\ntransition to any of its children. However we may discover evidence for passing directly to a leaf\nnode such as STATISTICAL NATURAL LANGUAGE PROCESSING (e.g. if we observe new articles\nrelated to statistical natural language processing that do not use Markov models). Second, making\na transition to a new node must have non-zero probability. For example, we may observe new ar-\nticles related to the topic of Bioinformatics. In this case, we want to add a new node to the graph\n(BIOINFORMATICS) and assign some probability of transitioning to it from other nodes.\nWith these two requirements we can now provide a formal de\ufb01nition for Pt. We begin with an\ninitial graph structure G0 with t = 1 . . . T nodes. For each node t we de\ufb01ne a feasible set Ft as the\ncollection of nodes to which t can transition. The feasible set may contain the children of node t or\npossible child nodes of node t (as discussed above). In general, Ft is some subset of the nodes in\nG0. We add a special node called the \u201dexit node\u201d to Ft. If we sample the exit node then we exit\nfrom the graph instead of transitioning forward. We de\ufb01ne Pt as a stick-breaking distribution over\nthe \ufb01nite set of nodes Ft where the remaining probability mass is assigned to an in\ufb01nite set of new\nnodes (nodes that exist but have not yet been observed). The exact form of Pt is shown below.\n\n|Ft|(cid:88)\n\nj=1\n\n\u221e(cid:88)\n\nj=|Ft|+1\n\nPt(\u00b7) =\n\n\u03c0tj\u03b4ftj (\u00b7) +\n\n\u03c0tj\u03b4xtj (\u00b7)\n\nThe \ufb01rst |Ft| atoms of the stick-breaking distribution are the feasible nodes ftj \u2208 Ft. The remaining\natoms are unidenti\ufb01able nodes that have yet to be observed (denoted as xtj for simplicity).\nThis is not yet a working de\ufb01nition unless we explicitly state which nodes are in the set Ft. Our\nmodel does not in general assume any speci\ufb01c form for Ft. Instead, the user is free to de\ufb01ne it as\nthey like. In our experiments, we \ufb01rst assign each node to a unique depth and then de\ufb01ne Ft as any\nnode at the next lower depth. The choice of Ft determines the type of graph structures that can be\nlearned. For the choice of Ft used in this paper, edges that traverse multiple depths are not allowed\nand edges between nodes at the same depth are not allowed. This prevents cycles from forming\nand allows inference to be performed in a timely manner. More generally, one could extend the\nde\ufb01nition of Ft to include any node at a lower depth.\n\n3\n\n\f1. For node t \u2208 {1, . . . ,\u221e}\n\ni. Sample stick-break weights {vtj}|\u03b1, \u03b2 \u223c Beta(\u03b1, \u03b2)\nii. Sample word distribution \u03c6t|\u03b7 \u223c Dirichlet(\u03b7)\n2. For document d \u2208 {1, 2, . . . D}\ni. Sample a distribution over levels \u03c4d|a, b \u223c Beta(a,b)\nii. Sample path pd \u223c {Pt}\u221e\niii. For word i \u2208 {1, 2, . . . , Nd}\n\nt=1\n\nSample level ld,i \u223c TruncatedDiscrete(\u03c4d)\nGenerate word xd,i|{pd, ld,i, \u03a6} \u223c Multinomial(\u03c6pd[ldi])\n\nFigure 3: Generative process for GraphLDA\n\nDue to a lack of exchangeability, we must specify the stick-breaking order of the elements in Ft.\nNote that despite the order, the elements of Ft always occur before the in\ufb01nite set of new nodes in\nthe stick-breaking permutation. We use a Metropolis-Hastings sampler proposed by [10] to learn\nthe permutation of feasible nodes with the highest likelihood given the data.\n\n3 Generative Process\n\nFigure 3 shows the generative process for our proposed model, which we refer to as GraphLDA.\nWe observe a collection of documents d = 1 . . . D where document d has Nd words. As discussed\nearlier, each node t is associated with a stick-breaking prior Pt. In addition, we associate with each\nnode a multinomial distribution \u03c6t over words in the fashion of topic models.\nA two-stage process is used to generate document d. First, a path through the graph is sampled\nfrom the stick-breaking distributions. We denote this path as pd. The i + 1st node in the path\nis sampled from Ppdi(\u00b7) which is the stick-breaking distribution at the ith node in the path. This\nprocess continues until an exit node is sampled. Then for each word xi a level in the path, ldi, is\nsampled from a truncated discrete distribution. The word xi is generated by the topic at level ldi\nof the path pd which we denote as pd[ldi]. In the case where we observe labeled documents and an\ninitial graph structure the paths for document d is restricted to end at the concept label of document\nd.\nOne possible option for the length distribution is a multinomial distribution over levels. We take\na different approach and instead use a parametric smooth form. The motivation is to constrain the\nlength distribution to have the same general functional form across documents (in contrast to the rel-\natively unconstrained multinomial), but to allow the parameters of the distribution to be document-\nspeci\ufb01c. We considered two simple options: Geometric and Poisson (both truncated to the number\nof possible levels). In initial experiments the Geometric performed better than the Poisson, so the\nGeometric was used in all experiments reported in this paper. If word xdi has level ldi = 0 then the\nword is generated by the topic at the last node on the path and successive levels correspond to earlier\nnodes in the path. In the case of labeled documents, this matches our belief that a majority of words\nin the document should be assigned to the concept label itself.\n\nInference\n\n4\nWe marginalize over the topic distributions \u03c6t and the stick-breaking weights {vtj}. We use a\ncollapsed Gibbs sampler [9] to infer the path assignment pd for each document, the level distribution\nparameter \u03c4d for each document, and the level assignment ldi for each word. Of the \ufb01ve hyper-\nparameters in the model, inference is sensitive to the value of \u03b2 and \u03b7 so we place an Exponential\nprior on both and use a Metropolis-Hastings sampler to learn the best setting.\n\n4.1 Sampling Paths\n\nFor each document, we must sample a path pd conditioned on all other paths p\u2212d, the level variables,\nand the word tokens. We only consider paths whose length is greater than or equal to the maximum\n\n4\n\n\flevel of the words in the document.\n\np(pd|x, l, p\u2212d, \u03c4 ) \u221d p(xd|x\u2212d, l, p) \u00b7 p(pd|p\u2212d)\n\n(1)\nThe \ufb01rst term in Equation 1 is the probability of all words in the document given the path pd. We\ncompute this probability by marginalizing over the topic distributions \u03c6t:\n\n(cid:32) V(cid:89)\n\u03bbd(cid:89)\n\nl=1\n\nv=1\n\n\u0393(\u03b7 + Npd[l],v)\n\u0393(\u03b7 + N\u2212d\npd[l],v)\n\n(cid:33)\n\n\u2217 \u0393(V \u03b7 +(cid:80)\n\u0393(V \u03b7 +(cid:80)\n\nv N\u2212d\npd[l],v)\nv Npd[l],v)\n\np(xd|x\u2212d, l, p) =\n\nWe use \u03bbd to denote the length of path pd. The notation Npd[l],v stands for the number of times\nword type v has been assigned to node pd[l]. The superscript \u2212d means we \ufb01rst decrement the count\nNpd[l],v for every word in document d.\nThe second term is the conditional probability of the path pd given all other paths p\u2212d. We present\nthe sampling equation under the assumption that there is a maximum number of nodes M allowed\nat each level. We \ufb01rst consider the probability of sampling a single edge in the path from a node x\nto one of its feasible nodes {y1, y2, . . . , yM} where the node y1 has the \ufb01rst position in the stick-\nbreaking permutation, y2 has the second position, y3 the third and so on.\nWe denote the number of paths that have gone from x to yi as N(x,yi). We denote the number of\npaths that have gone from x to a node with a strictly higher position in the stick-breaking distribution\nk=i+1 N(x,yk). Extending this notation we denote the\n\nthan yi as N(x,>yi). That is, N(x,>yi) = (cid:80)M\ni\u22121(cid:89)\n\nsum N(x,yi) + N(x,>yi) as N(x,\u2265yi). The probability of selecting node yi is given by:\n\n\u03b1 + N(x,yi)\n\n\u03b2 + N(x,>yr)\n\n\u03b1 + \u03b2 + N(x,\u2265yr)\n\nfor i = 1 . . . M\n\np(x \u2192 yi | p\u2212d) =\n\n\u03b1 + \u03b2 + N(x,\u2265yi)\n\nr=1\n\nIf ym is the last node with a nonzero count N(x,ym) and m << M it is convenient to compute the\nprobability of transitioning to yi, for i \u2264 m, and the probability of transitioning to any node higher\nthan ym. The probability of transitioning to a node higher than ym is given by\n\n(cid:34)\n\nM\u2212m(cid:35)\n\nM(cid:88)\n\nk=m+1\n\np(x \u2192 yk|p\u2212d) = \u2206\n\n1 \u2212 \u03b2\n\n\u03b1 + \u03b2\n\nwhere \u2206 = (cid:81)m\n\n\u03b2+N(x,>yr )\n\n. A similar derivation can be used to compute the probability of\nsampling a node higher than ym when M is equal to in\ufb01nity. Now that we have computed the\nprobability of a single edge, we can compute the probability of an entire path pd:\n\n\u03b1+\u03b2+N(x,\u2265yr )\n\nr=1\n\n\u03bbd(cid:89)\n\np(pd|p\u2212d) =\n\np(pdj \u2192 pd,j+1|p\u2212d)\n\n4.2 Sampling Levels\n\nj=1\n\nFor the ith word in the dth document we must sample a level ldi conditioned on all other levels l\u2212di,\nthe document paths, the level parameters \u03c4 , and the word tokens.\n\np(ldi|x, l\u2212di, p, \u03c4 ) =\n\n(cid:32) \u03b7 + N\u2212di\n\nW \u03b7 + N\u2212di\n\npd[ldi],xdi\npd[ldi],\u00b7\n\n(cid:33)\n\n\u00b7\n\n(1 \u2212 \u03c4d)ldi \u03c4d\n\n(1 \u2212 (1 \u2212 \u03c4d)\u03bbd+1)\n\nThe \ufb01rst term is the probability of word type xdi given the topic at node pd[ldi]. The second term is\nthe probability of the level ldi given the level parameter \u03c4d.\n\n4.3 Sampling \u03c4 Variables\n\nFinally, we must sample the level distribution \u03c4d conditioned on the rest of the level parameters \u03c4 \u2212d,\nthe level variables, and the word tokens.\n\n(cid:33)\n\n(cid:32)\n\n\u2217\n\n(cid:33)\n\n\u03c4 a\u22121\n\nd\n\n(1 \u2212 \u03c4d)b\u22121\n\nB(cid:0)a, b(cid:1)\n\n(2)\n\np(\u03c4d|x, l, p, \u03c4 \u2212d) =\n\n(cid:32) Nd(cid:89)\n\ni=1\n\n(1 \u2212 \u03c4d)ldi \u03c4d\n\n(1 \u2212 (1 \u2212 \u03c4d)\u03bbd+1)\n\n5\n\n\f(a) Simulated Graph\n\n(b) Learned Graph (0 labeled documents)\n\n(c) Learned Graph (250 labeled documents)\n\n(d) Learned Graph (4000 labeled documents)\n\nFigure 4: Learning results with simulated data\n\nDue to the normalization constant (1 \u2212 (1 \u2212 \u03c4d)\u03bbd+1), Equation 2 is not a recognizable probability\ndistribution and we must use rejection sampling. Since the \ufb01rst term in Equation 2 is always less\nthan or equal to 1, the sampling distribution is dominated by a Beta(a, b) distribution. According\nto the rejection sampling algorithm, we sample a candidate value for \u03c4d from Beta(a, b) and either\n\naccept with probability(cid:81)Nd\n\n(1\u2212\u03c4d)ldi \u03c4d\n(1\u2212(1\u2212\u03c4d)\u03bbd+1) or reject and sample again.\n\ni=1\n\n4.4 Metropolis Hastings for Stick-Breaking Permutations\n\nIn addition to the Gibbs sampling, we employ a Metropolis Hastings sampler presented in [10] to\nmix over stick-breaking permutations. Consider a node x with feasible nodes {y1, y2, . . . , yM}. We\nsample two feasible nodes yi and yj from a uniform distribution3. Assume yi comes before yj in\nthe stick-breaking distribution. Then the probability of swapping the position of nodes yi and yj is\ngiven by\n\nmin\n(x,>yi) = N(x,>yi) \u2212 N(x,yj ). See [10] for a full derivation. After every new path assign-\n\nwhere N\u2217\nment, we propose one swap for each node in the graph.\n\nk=0\n\nk=0\n\n\u03b1 + \u03b2 + N(x,>yj ) + k\n\u03b1 + \u03b2 + N\u2217\n(x,>yi) + k\n\n\u03b1 + \u03b2 + N\u2217\n(x,>yi) + k\n\u03b1 + \u03b2 + N(x,>yj ) + k\n\n\u00b7\n\n(cid:40)\n\n1,\n\nN(x,yi)\u22121(cid:89)\n\nN(x,yj )\u22121(cid:89)\n\n(cid:41)\n\n5 Experiments and Results\n\nIn this section, we present experiments performed on both simulated and real text data. We compare\nthe performance of GraphLDA against hPAM and hLDA.\n\n5.1 Simulated Text Data\n\nIn this section, we illustrate how the performance of GraphLDA improves as the fraction of labeled\ndata increases. Figure 4(a) shows a simulated concept graph with 10 nodes drawn according to the\n\n3In [10] feasible nodes are sampled from the prior probability distribution. However for small values of \u03b1\n\nand \u03b2 this results in extremely slow mixing.\n\n6\n\n1234567891097310699574863313855245242783064535131541234567/10498/4973106095749619454551568227531642342753/109202091234/75/16/97/108/4/19/210972105796848423538451227428326245245/2221268123456789109731069957486331385524524278306453513154\fstick-breaking generative process with parameter values \u03b7 = .025, \u03b1 = 10, \u03b2 = 10, a = 2 and\nb = 5. The vocabulary size is 1, 000 words and we generate 4, 000 documents with 250 words each.\nEach edge in the graph is labeled with the number of paths that traverse it.\nFigures 4(b)-(d) show the learned graph structures as the fraction of labeled data increases from\n0 labeled and 4, 000 unlabeled documents to all 4, 000 documents being labeled. In addition to\nlabeling the edges, we label each node based upon the similarity of the learned topic at the node to\nthe topics of the original graph structure. The Gibbs sampler is initialized to a root node when there\nis no labeled data. With labeled data, the Gibbs sampler is initialized with the correct placement of\nnodes to levels. The sampler does not observe the edge structure of the graph nor the correct number\nof nodes at each level (i.e. the sampler may add additional nodes). With no labeled data, the sampler\nis unable to recover the relationship between concepts 8 and 10 (due to the relatively small number\nof documents that contain words from both concepts). With 250 labeled documents, the sampler is\nable to learn the correct placement of both nodes 8 and 10 (although the topics contain some noise).\n\n5.2 Wikipedia Articles\n\nIn this section, we compare the performance of GraphLDA to hPAM and hLDA on a set of 518\nmachine-learning articles taken from Wikipedia. The input to each model is only the article text. All\nmodels are restricted to learning a three-level hierarchical structure. For both GraphLDA and hPAM,\nthe number of nodes at each level was set to 25. For GraphLDA, the parameters were \ufb01xed at \u03b1 = 1,\na = 1 and b = 1. The parameters \u03b2 and \u03b7 were initialized to 1 and .001 respectively and optimized\nusing a Metropolis Hastings sampler. We used the MALLET toolkit implementation of hPAM4 and\nhLDA [11]. For hPAM, we used different settings for the topic hyperparameter \u03b7 = (.001, .01, .1).\nFor hLDA we set \u03b7 = .1 and considered \u03b3 = (.1, 1, 10) where \u03b3 is the smoothing parameter for the\nChinese restaurant process and \u03b1 = (.1, 1, 10) where \u03b1 is the smoothing over levels in the graph.\nAll models were run for 9, 000 iterations to ensure burn-in and samples were taken every 100 it-\nerations thereafter, for a total of 10, 000 iterations. The performance of each model was evaluated\non a hold-out set consisting of 20% of the articles using both empirical likelihood and the left-to-\nright evaluation algorithm (see Sections 4.1 and 4.5 of [12]) which are measures of generalization\nto unseen data. For both GraphLDA and hLDA we use the distribution over paths that was learned\nduring training to compute the per-word log likelihood. For hPAM we compute the MLE estimate of\nthe Dirichlet hyperparameters for both the distribution over super-topics and the distributions over\nsub-topics from the training documents. Table 5.2 shows the per-word log-likelihood for each model\naveraged over the ten samples. GraphLDA is competitive when computing the empirical log likeli-\nhood. We speculate that GraphLDA\u2019s lower performance in terms of left-to-right log-likelihood is\ndue to our choice of the geometric distribution over levels (and our choice to position the geomet-\nric distribution at the last node of the path) and that a more \ufb02exible approach could result in better\nperformance.\n\nTable 1: Per-word log likelihood of test documents\n\nhLDA\n\nParameters\nModel\nGraphLDA MH opt.\n\u03b7 = .1\nhPAM\n\u03b7 = .01\n\u03b7 = .001\n\u03b3 = .1, \u03b1 = .1\n\u03b3 = .1, \u03b1 = 1\n\u03b3 = .1, \u03b1 = 10\n\u03b3 = 1, \u03b1 = .1\n\u03b3 = 1, \u03b1 = 1\n\u03b3 = 1, \u03b1 = 10\n\u03b3 = 10, \u03b1 = .1\n\u03b3 = 10, \u03b1 = 1\n\u03b3 = 10, \u03b1 = 10\n\nEmpirical LL Left-to-Right LL\n-7.10 \u00b1 .003\n-7.36 \u00b1 .013\n-7.33 \u00b1 .012\n-7.38 \u00b1 .006\n-7.10 \u00b1 .004\n-7.09 \u00b1 .003\n-7.08 \u00b1 .003\n-7.08 \u00b1 .003\n-7.08 \u00b1 .002\n-7.06 \u00b1 .003\n-7.07 \u00b1 .004\n-7.07 \u00b1 .003\n-7.06 \u00b1 .003\n\n-7.13 \u00b1 .009\n-6.11 \u00b1 .007\n-6.47 \u00b1 .012\n-6.71 \u00b1 .013\n-6.82 \u00b1 .007\n-6.86 \u00b1 .006\n-6.90 \u00b1 .008\n-6.83 \u00b1 .007\n-6.86 \u00b1 .006\n-6.88 \u00b1 .008\n-6.81 \u00b1 .006\n-6.83\u00b1 .005\n-6.88 \u00b1 .010\n\n7\n\n\fFigure 5: Wikipedia graph structure with additional machine learning abstracts. The edge widths\ncorrespond to the probability of the edge in the graph\n\n5.3 Wikipedia Articles with a Graph Structure\n\nIn our \ufb01nal experiment we illustrate how GraphLDA can be used to update an existing category\ngraph. We use the aforementioned 518 machine-learning Wikipedia articles, along with their cat-\negory labels, to learn topic distributions for each node in Figure 1. The sampler is initialized with\nthe correct placement of nodes and each document is initialized to a random path from the root to\nits category label. After 2, 000 iterations, we \ufb01x the path assignments for the Wikipedia articles\nand introduce a new set of documents. We use a collection of 400 machine learning abstracts from\nthe International Conference on Machine Learning (ICML). We sample paths for the new collec-\ntion of documents keeping the paths from the Wikipedia articles \ufb01xed. The sampler was allowed\nto add new nodes to each level to explain any new concepts that occurred in the ICML text set.\nFigure 5 illustrates a portion of the \ufb01nal graph structure. The nodes in bold are the original nodes\nfrom the Wikipedia category graph. The results show that the model is capable of augmenting an\nexisting concept graph with new concepts (e.g. clustering, support vector machines (SVMs), etc.)\nand learning meaningful relationships (e.g. boosting/ensembles are on the same path as the concepts\nfor SVMs and neural networks).\n\n6 Discussion and Conclusion\n\nMotivated by the increasing availability of large-scale structured collections of documents such as\nWikipedia, we have presented a \ufb02exible non-parametric Bayesian framework for learning concept\ngraphs from text. The proposed approach can combine unlabeled data with prior knowledge in the\nform of labeled documents and existing graph structures. Extensions such as allowing the model\nto handle multiple paths per document are likely to be worth pursuing. In this paper we did not\ndiscuss scalability to large graphs which is likely to be an important issue in practice. Computing\nthe probability of every path during sampling, where the number of graphs is a product over the\nnumber of nodes at each level, is a computational bottleneck in the current inference algorithm and\nwill not scale. Approximate inference methods that can address this issue should be quite useful in\nthis context.\n\n7 Acknowledgements\n\nThis material is based upon work supported in part by the National Science Foundation under Award\nNumber IIS-0083489, by a Microsoft Scholarship (AC), and by a Google Faculty Research award\n(PS). The authors would also like to thank Ian Porteous and Alex Ihler for useful discussions.\n\n4MALLET implements the \u201cexit node\u201d version of hPAM\n\n8\n\nsetdatalearningconceptmodellearningdatamodelmethodkernellearningdimensionalityclassificationreductionmethodalgorithmsvmvectorproblemmulticlassclusteringdataprincipalcomponentkmeansmodelnoisealgorithmhiddentrainingmodelselectionrbmalgorithmfeaturelearningpolicydecisiongraphfunctionmodelmultitaskinferenceBayesianDirichletvariablesnodenetworkparentBayesian decisionclassificationclassclassifierdataclassifierboostingensemblehypothesismarginevolutionevolutionaryalgorithmindividualsearchkernellinearspacevectorpointMarkovtimeprobabilitychaindistributionnetworkneuralneuroncnnfunctiongeneticfitnessmutationselectionsolutiongraphMarkovnetworkrandomfieldwordtopiclanguagemodeldocumentlearningalgorithmkernelconvexconstraint\fReferences\n\n[1] David Blei, Andrew Ng, and Michael Jordan. Latent dirichlet allocation. Journal of Machine\n\nLearning Research, 3:993\u20131022, 2003.\n\n[2] David M. Blei, Thomas L. Grif\ufb01ths, and Michael I. Jordan. The nested chinese restaurant\nprocess and bayesian nonparametric inference of topic hierarchies. Journal of the Acm, 57,\n2010.\n\n[3] David Mimno, Wei Li, and Andrew McCallum. mixtures of hierarchical topics with pachinko\n\nallocation. In Proceedings of the 21st Intl. Conf. on Machine Learning, 2007.\n\n[4] Wei Li, David Blei, and Andrew McCallum. Nonparametric bayes pachinko allocation. In\nProceedings of the Twenty-Third Annual Conference on Uncertainty in Arti\ufb01cial Intelligence\n(UAI-07), pages 243\u2013250, 2007.\n\n[5] Blaz Fortuna, Marko Grobelnki, and Dunja Mladenic. Ontogen: Semi-automatic ontology\neditor. In Proceedings of theHuman Computer Interaction International Conference, volume\n4558, pages 309\u2013318, 2007.\n\n[6] S. Bloehdorn, P. Cimiano, and A. Hotho. Learning ontologies to improve text clustering and\nclassi\ufb01cation. In From Data and Inf. Analysis to Know. Eng.: Proc. of the 29th Annual Conf.\nthe German Classi\ufb01cation Society (GfKl \u201905), volume 30 of Studies in Classi\ufb01cation, Data\nAnalysis and Know. Org., pages 334\u2013341. Springer, Feb. 2005.\n\n[7] P. Cimiano, A. Hotho, and S. Staab. Learning concept hierarchies from text using formal\n\nconcept analysis. J. Arti\ufb01cial Intelligence Research (JAIR), 24:305\u2013339, 2005.\n\n[8] Hemant Ishwaran and Lancelot F. James. Gibbs sampling methods for stick-breaking priors.\n\nJournal of the American Statistical Association, 96(453):161\u2013173, March 2001.\n\n[9] Tom Grif\ufb01ths and Mark Steyvers. Finding scienti\ufb01c topics. Proceedings of the Natl. Academy\n\nof the Sciences of the U.S.A., 101 Suppl 1:5228\u20135235, 2004.\n\n[10] Ian Porteous, Alex Ihler, Padhraic Smyth, and Max Welling. Gibbs sampling for coupled\nIn Proceedings of UAI 2006,\n\nin\ufb01nite mixture models in the stick-breaking representation.\npages 385\u2013392, July 2006.\n\n[11] Andrew Kachites McCallum.\nhttp://mallet.cs.umass.edu, 2002.\n\nMallet: A machine learning for\n\nlanguage toolkit.\n\n[12] Hanna M. Wallach, Iain Murray, Ruslan Salakhutdinov, and David Mimno. Evaluation meth-\nIn Proceedings of the 26th Intl. Conf. on Machine Learning (ICML\n\nods for topic models.\n2009), 2009.\n\n9\n\n\f", "award": [], "sourceid": 1264, "authors": [{"given_name": "America", "family_name": "Chambers", "institution": null}, {"given_name": "Padhraic", "family_name": "Smyth", "institution": null}, {"given_name": "Mark", "family_name": "Steyvers", "institution": null}]}