{"title": "On the Dirichlet Prior and Bayesian Regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 713, "page_last": 720, "abstract": null, "full_text": "On the Dirichlet Prior and Bayesian \n\nRegularization \n\nHarald Steck \n\nTommi S. Jaakkola \n\nArtificial Intelligence Laboratory \n\nArtificial Intelligence Laboratory \n\nMassachusetts Institute of Technology \n\nMassachusetts Institute of Technology \n\nCambridge, MA 02139 \n\nCambridge, MA 02139 \n\nharald@ai.mit.edu \n\ntommi@ai.mit.edu \n\nAbstract \n\nA common objective in learning a model from data is to recover \nits network structure, while the model parameters are of minor in(cid:173)\nterest. For example, we may wish to recover regulatory networks \nfrom high-throughput data sources. In this paper we examine how \nBayesian regularization using a product of independent Dirichlet \npriors over the model parameters affects the learned model struc(cid:173)\nture in a domain with discrete variables. We show that a small \nscale parameter - often interpreted as \"equivalent sample size\" or \n\"prior strength\" - leads to a strong regularization of the model \nstructure (sparse graph) given a sufficiently large data set. In par(cid:173)\nticular, the empty graph is obtained in the limit of a vanishing scale \nparameter. This is diametrically opposite to what one may expect \nin this limit, namely the complete graph from an (unregularized) \nmaximum likelihood estimate. Since the prior affects the parame(cid:173)\nters as expected, the scale parameter balances a trade-off between \nregularizing the parameters vs. \nthe structure of the model. We \ndemonstrate the benefits of optimizing this trade-off in the sense \nof predictive accuracy. \n\nIntroduction \n\n1 \nRegularization is essential when learning from finite data sets. In the Bayesian ap(cid:173)\nproach, regularization is achieved by specifying a prior distribution over the parame(cid:173)\nters and subsequently averaging over the posterior distribution. This regularization \nprovides not only smoother estimates of the parameters compared to maximum \nlikelihood but also guides the selection of model structures. \nIt was pointed out in [6] that a very large scale parameter of the Dirichlet prior can \ndegrade predictive accuracy due to severe regularization of the parameter estimates. \nWe complement this discussion here and show that a very small scale parameter \ncan lead to poor over-regularized structures when a product of (conjugate) Dirichlet \npriors is used over multinomial conditional distributions (Section 3). Section 4 \ndemonstrates the effect of the scale parameter and how it can be calibrated. We \nfocus on the class of Bayesian network models throughout this paper. \n\n\f2 Regularization of Parameters \nWe briefly review Bayesian regularization of parameters. We follow the assump(cid:173)\ntions outlined in [6] : multinomial sample, complete data, parameter modularity, \nparameter independence, and Dirichlet prior. Note that the Dirichlet prior over \nthe parameters is often used for two reasons: (1) the conjugate prior permits an(cid:173)\nalytical calculations, and (2) the Dirichlet prior is intimately tied to the desirable \nlikelihood-equivalence property of network structures [6]. The Dirichlet prior over \nthe parameters 8' I11\"i is given by \n\n(1) \n\nwhere 8Xi l11\"i pertains to variable X i in state Xi given that its parents IIi are in joint \nstate 'Tri . The number of variables in the domain is denoted by n, and i = 1, ... , n. \nThe normalization terms in Eq. 1 involve the Gamma function r(\u00b7). There are \na number of approaches to specifying the positive hyper-parameters O:Xi ,11\"i of the \nDirichlet prior [2, 1, 6] ; we adopt the common choice, \n\n(2) \n\nwhere p is a (marginal) prior distribution over the (joint) states, as this assignment \nensures likelihood equivalence of the network structures [6]. Due to lack of prior \nknowledge, p is often chosen to be uniform, p(Xi,'Tri ) = 1/ (IXil\u00b7IIIil), where lXii , \nIIIi l denote the number of (joint) states [1]. The scale parameter 0: of the Dirichlet \nprior is positive and independent of i, i.e. , 0: = L Xi ,11\"i O:Xi ,11\"i ' \n\nThe average parameter value OXi l11\"i ' which typically serves as the regularized pa(cid:173)\nrameter estimate given a network structure m , is given by \n\no = E \n\nXi l11\"i -\n\np(Oxi l ~i I D,m) Xi l11\"i \n\n[8] = N Xi ,11\"i + O:X i, 11\"i \n\nN + \n\n7ri \n\nQ 7ri \n\n' \n\n(3) \n\nwhere N Xi ,11\"i are the cell-counts from data D; E[\u00b7] is the expectation. Positive \nhyper-parameters O:X i, 11\"i lead to regularized parameter estimates, i.e., the estimated \nparameters become \"smoother\" or \" less extreme\" when the prior distribution p is \nclose to uniform. An increasing scale parameter 0: leads to a stronger regulariza(cid:173)\ntion, while in the limit 0: -+ 0, the (unregularized) maximum likelihood estimate is \nobtained, as expected. \n\n3 Regularization of Structure \nIn the remainder of this paper, we outline effects due to Bayesian regularization of \nthe Bayesian network structure when using a product of Dirichlet priors. Let us \nbriefly introduce relevant notation. \nIn the Bayesian approach to structure learning, the posterior probability of the \nnetwork structure m is given by p(mID) = p(Dlm)p(m)/p(D), where p(D) is the \n(unknown) probability of given data D , and p(m) denotes the prior distribution over \nthe network structures; we assume p(m) > 0 for all m. Following the assumptions \noutlined in [6], including the Dirichlet prior over the parameters 8, the marginal \nlikelihood p(Dlm) = Ep(O lm) [p(Dlm , 8)] can be calculated analytically. Pretending \nthat the (i.i.d.) data arrived in a sequential manner, it can be written as \n\np(Dlm) = II II X:(':~l) \n\nN \n\nn N(k-l) + 0: k \n\nk \n\nk = l i=l N k + O:11\"k \n\n7r i \n\ni \n\nXi ,11\"i , \n\n(4) \n\n\fwhere N(k-l) denotes the counts implied by data D(k-l) seen before step k along \nthe sequence (k = 1, ... , N). The (joint) state of variable Xi and its parents IIi \noccurring in the kth data point is denoted by xf, 7rf. In Eq. 4, we also decomposed \nthe joint probability into a product of conditional probabilities according to the \nBayesian network structure m. Eq. 4 is independent of the sequential ordering of \nthe data points, and the ratio in Eq. 3 corresponds to the one in Eq. 4 when based \non data D(k - l) at each step k along the sequence. \n\n3.1 Limit of Vanishing Scale-Parameter \nThis section is concerned with the limit of a vanishing scale parameter of the Dirich(cid:173)\nlet prior, a -+ O. In this limit Bayesian regularization depends crucially on the \nnumber of zero-cell-counts in the contingency table implied by the data, or in other \nwords, on the number of different configurations (data points) contained in the \ndata. Let the Effective Number of Parameters (EP) be defined as \n\nn \n\ndk';) = l: [ l: I(Nxi,1rJ - l: I(N1rJ ], \n\n(5) \n\nwhere N Xi ,1ri' N1ri are the (marginal) cell counts in the contingency table implied \nby data D; m refers to the Bayesian network structure, and 1(\u00b7) is an indicator \nfunction such that I(z) = 0 if z = 0 and I(z) = 1 otherwise. When all cell \ncounts are positive, EP is identical to the well-known number of parameters (P), \ndk';) = 4m ) = L:i(IXil - l)IIIil, which play an important role in regularizing the \nlearned network structure. The key difference is that EP accounts for zero-cell(cid:173)\ncounts implied by the data. \nLet us now consider the behavior of the marginal likelihood (cf. Eq. 4) in the limit \nof a small scale parameter a. We find \nProposition 1: Under the assumptions concerning the prior distribution outlined \nin Section 2, the marginal likelihood of a Bayesian network structure m vanishes in \nthe limit a -+ 0 if the data D contain two or more different configurations. This \nproperty is independent of the network structure. The leading polynomial order is \ngiven by \n\nd(=l \np(Dlm) \"-' a EP \n\nas a -+ 0, \n\n(6) \nwhich depends both on the network structure and the data. However, the dependence \non the data is through the number of different data points only. This holds inde(cid:173)\npendently of a particular choice of strictly positive prior distributions P(Xi' IIi). If \nthe prior over the network structures is strictly positive, this limiting behavior also \nholds for the posterior probability p( miD) . \nIn the following we give a derivation of Proposition 1 that also facilitates the intuitive \nunderstanding of the result. First, let us consider the behavior of the Dirichlet \ndistribution in the limit a -+ O. The hyper-parameters a X i ,1ri vanish when a -+ 0, \nand thus the Dirichlet prior converges to a discrete distribution over the parameter \nsimplex in the sense that the probability mass concentrates at a particular, randomly \nchosen corner of the simplex containing B. I1ri (cf. [9]). Since the randomly chosen \npoints (for different 7ri, i) do not change when sampling (several) data points from \nthe distribution implied by the model, it follows immediately that the marginal \nlikelihood of any network structure vanishes whenever there are two or more different \nconfigurations contained in the data. \nThis well-known fact also shows that the limit a -+ 0 actually corresponds to a very \nstrong prior belief [9, 12]. This is in contrast to many traditional interpretations \nwhere the limit a -+ 0 is considered as \"no prior information\", often motivated \nby Eq. 3. As pointed out in [9, 12], the interpretation of the scale parameter a \n\n\f7ri \n\nXi , 7ri \n\nas \"equivalent sample size\" or as the\" strength\" of prior belief may be misleading, \nparticularly in the case where O:X i, 1ri < 1 for some configurations Xi, 7ri. A review \nof different notions of \"noninformative\" priors (including their limitations) can be \nfound in [7]. Note that the noninformative prior in the sense of entropy is achieved \nby setting O:Xi ,1ri = 1 for each Xi, 7ri and for all i = 1, ... , n. This is the assignment \noriginally proposed in [2]; however, this assignment generally is inconsistent with \nEq. 2, and hence with likelihood equivalence [6]. \nIn order to explain the behavior of the marginal likelihood in leading order of the \nscale parameter 0:, the properties of the Dirichlet distribution are not sufficient by \nthemselves. Additionally, it is essential that the probability distribution described \nby a Bayesian network decomposes into a product of conditional probabilities, and \nthat there is a Dirichlet prior pertaining to each variable for each parent configu(cid:173)\nration. All these Dirichlet priors are independent of each other under the standard \nassumption of parameter independence. Obviously, the ratio (for given k and i) in \nEq. 4 can only vanish in the limit 0: --+ 0 if N(~- ~ = 0 while N(~- l) > 0; in other \nwords, the parent-configuration 7rf must already have occurred previously along \nthe sequence (7rf is \"old\"), while the child-state xf occurs simultaneously with this \nparent-state for the first time (xf, 7rf is \"new\"). In this case, the leading polyno(cid:173)\nmial order of the ratio (for given k and i) is linear in 0:, assuming P(Xi' IIi) > 0; \notherwise the ratio (for given k and i) converges to a finite positive value in the \nlimit 0: --+ O. Consequently, the dependence of the marginal likelihood in leading \npolynomial order on 0: is completely determined by the number of different config(cid:173)\nurations in the data. It follows immediately that the leading polynomial order in \n0: is given by EP (d. Eq. 5). This is because the first term counts the number of \nall the different joint configurations of Xi , IIi in the data, while the second term \nensures that EP counts only those configurations where (xf, 7rf) is \"new\" while 7rf \nis \"old\". \nNote that the behavior of the marginal likelihood in Proposition 1 is not entirely \ndetermined by the network structure in the limit 0: --+ 0, as it still depends on the \ndata. This is illustrated in the following example. First, let us consider two binary \nvariables, Xo and Xl, and the data D containing only two data points, say (0,0) \nand (1,1). Given data D, three Dirichlet priors are relevant regarding graph ml, \nXo --+ Xl, but only two Dirichlet priors pertain to the empty graph, mo. The \nresulting additional \"flexibility\" due to an increased number of priors favours more \ncomplex models: p(Dlmd ~ 0:, while p(Dlmo) ~ 0:2 . Second, let us now assume \nthat all possible configurations occur in data D. Then we still have p(Dlmo) ~ 0:2 \nfor the empty graph. Concerning graph ml, however, the marginal likelihood now \nalso involves the vanishing terms due to the two priors pertaining to BX1 lxo =o and \nBXl lxo=l, and it hence becomes p(Dlmd ~ 0:3 . \nThis dependence on the data can be formalized as follows. Let us compare the \nmarginal likelihoods of two graphs, say m+ and m - . In particular, let us consider \ntwo graphs that are identical except for a single edge, say A +- B between the \nvariables A and B. Let the edge be present in graph m+ and absent in m-. The \nfact that the marginal likelihood decomposes into terms pertaining to each of the \nvariables (d. Eq. 4) entails that all the terms regarding the remaining variables \ncancel out in the Bayes factor p(Dlm+)/p(Dlm-), which is the standard relative \nBayesian score. With the definition of the Effective Degrees of Freedom (EDF)l \n\nwe immediately obtain from Proposition 1 that p(Dlm+)/p(Dlm- ) ~ o:dEDF in the \n\nINote that EDF are not necessarily non-negative. \n\n(7) \n\n\fI p(Dlm+) {-oo if d EDF > 0, \n\nog p(Dlm- ) -+ \n\nlimit a -+ 0, and hence \nProposition 2: Let m+ and m- be the two network structures as defined above. \nLet the prior belief be given according to Eq. 2. Then in the limit a -+ 0: \n\n+00 if dEDF < O. \n\n( ) \n8 \nThe result holds independently of a particular choice of strictly positive prior distri(cid:173)\nbutions P(Xi' IIi). If the prior over the network structures is strictly positive, this \nlimiting behavior also holds for the posterior ratio. \nA positive value of the log Bayes factor indicates that the presence of the edge \nA f- B is favored, given the parents IIA ; conversely, a negative relative score \nsuggests that the absence of this edge is preferred. The divergence of this relative \nBayesian score implies that there exists a (small) positive threshold value ao > 0 \nsuch that, for any a < ao, the same graph(s) are favored as in the limit. \nSince Proposition 2 applies to every edge in the network, it follows immediately \nthat the empty (complete) graph is assigned the highest relative Bayesian score \nwhen EDF are positive (negative). Regularization of network structure in the case \nof positive EDF is therefore extreme, permitting only the empty graph. This is \nprecisely the opposite of what one may have expected in this limit, namely the \ncomplete graph corresponding to the unregularized maximum likelihood estimate \n(MLE). In contrast, when EDF are negative, the complete graph is favored. This \nagrees with MLE. \nRoughly speaking, positive (negative) EDF correspond to large (small) data sets. \nIt is thus surprising that a small data set, where one might expect an increased \nrestriction on model complexity, actually gives rise to the complete graph, while \na large data set yields the - most regularized - empty graph in the limit a -+ O. \nMoreover, it is conceivable that a \"medium\" sized data set may give rise to both \npositive and negative EDF. This is because the marginal contingency tables implied \nby the data with respect to a sparse (dense) graph may contain a small (large) \nnumber of zero-cell-counts. The relative Bayesian score can hence become rather \nunstable in this case, as completely different graph structures are optimal in the \nlimit a -+ 0, namely graphs where each variable has either the maximal number of \nparents or none. \nNote that there are two reasons for the hyper-parameters a Xi , 1fi to take on small \nvalues (cf. Eq. 2): (1) a small equivalent sample size a, or (2) a large number of \nIXi l\u00b7 IIIil \u00bb a , due to a large number of parents (with a large \njoint states, i.e. \nnumber of states). Thus, these hyper-parameters can also vanish in the limit of a \nlarge number of configurations (x , 1f) even though the scale parameter a remains \nfixed. This is precisely the limit defining Dirichlet processes [4], which, analogously, \nproduce discrete samples. With a finite data set and a large number of joint config(cid:173)\nurations, only the typical limit in Proposition 2 is possible. This follows from the \nfact that a large number of zero-cell-counts forces EDF to be negative. The sur(cid:173)\nprising behavior implied by Proposition 2 therefore does not carryover to Dirichlet \nprocesses. As found in [8], however, the use of a product of Dirichlet process priors \nin non parametric inference can also lead to surprising effects. \nWhen dEDF = 0, it is indeed true that the value of the log Bayes factor can converge \nto any (possibly finite) value as a -+ O. \nIts value is determined by the priors \nP(Xi' IIi), as well as by the counts implied by the data. The value of the Bayes \nfactor can be therefore easily set by adjusting the prior weights p(Xi' 1fi). \n\n3.2 Large Scale-Parameter \nIn the other limiting case, where a -+ 00, the Bayes factor approaches a finite value, \nwhich in general depends on the given data and on the prior distributions p(Xi' IIi). \n\n\flBF \n\n2 \n1.5 \n\n1 \n\n0.5 \n\nz=3 \n\n-0.5 \n-1 \n\n~ 100 \n\nz=o \n\n150 \n\n200 \n\n250 \n\na \n\n300 \n\nFigure 1: The log Bayes factor (lBF) is depicted as a function of the scale parameter \n0:. It is assumed that the two variables A and B are binary and have no parents; \nand that the \"data\" imply the contingency table: NA= O,B= O = NA= l,B= l = 10 + z \nand NA=l,B=O = NA=O,B=l = 10 - z, where z is a free parameter determining the \nstatistical dependence between A and B. The prior p(Xi,IIi ) was chosen to be \nuniform. \n\nThis can be seen easily by applying the Stirling approximation in the limit 0: -+ 00 \nafter rewriting Eq. 4 in terms of Gamma functions (cf. also [2, 6]). When the \npopular choice of a uniform prior p(Xi,IIi ) is used [1], then \n\np(Dlm+) \n\nlog p(Dlm-) -+ 0 as 0:-+00, \n\n(9) \n\nwhich is independent of the data. Hence, neither the presence nor the absence of \nthe edge between A and B is favored in this limit. Given a uniform prior over the \nnetwork structures, p(m) =const, the posterior distribution p(mID) over the graphs \nthus becomes increasingly spread out as 0: grows, permitting more variable network \nstructures. \n\nThe behavior of the Bayes factor between the two limits 0: -+ 0 and 0: -+ 00 is exem(cid:173)\nplified for positive EDF in Figure 1: there are two qualitatively different behaviors, \ndepending on the degree of statistical dependence between A and B. A sufficiently \nweak dependence results in a monotonically increasing Bayes factor which favors the \nabsence of the edge A +- B at any finite value of 0:. In contrast, given a sufficiently \nstrong dependence between A and B, the log Bayes factor takes on positive values \nfor all (finite) 0: exceeding a certain value 0:+ of the scale parameter. Moreover, 0:+ \ngrows as the statistical dependence between A and B diminishes. Consequently, \ngiven a domain with a range of degrees of statistical dependences, the number of \nedges in the learned graph increases monotonically with growing scale parameter 0: \nwhen each variable has at most one parent (i. e., in the class of trees or forests). This \nis because increasingly weaker statistical dependencies between variables are recov(cid:173)\nered as 0: grows; the restriction to forests excludes possible \"interactions\" among \n(several) parents of a variable. As suggested by our experiments, this increase in \nthe number of edges can also be expected to hold for general Bayesian network \nstructures (although not necessarily in a monotonic way). \n\nThis reveals that regularization of network structure tends to diminish with a grow(cid:173)\ning scale parameter. Note that this is in the opposite direction to the regularization \nof parameters (cf. Section 2). Hence, the scale parameter 0: of the Dirichlet prior \ndetermines the trade-off between regularizing the parameters vs. the structure of the \nBayesian network model. \nIf a uniform prior over the network structures is chosen, p(m) = const, the above \ndiscussion also holds for the posterior ratio (instead of the Bayes factor). The \nbehavior is more complicated, however, when a non-uniform prior is assumed. For \ninstance, when a prior is chosen that penalizes the presence of edges, the posterior \n\n\ffavours the absence of an edge not only when the scale parameter is sufficiently \nsmall, but also when it is sufficiently large. This is apparent from Fig. 1, when the \nlog Bayes factor is compared to a positive threshold value (instead of zero). \n\n4 Example \n\nThis section exemplifies that the entire model (parameters and structure) has to \nbe considered when learning from data. This is because regularization of model \nstructure diminishes, while regularization of parameters increases with a growing \nscale parameter a of the Dirichlet prior, as discussed in the previous sections. \n\nWhen the entire model is taken into account, one can use a sensitivity analysis in \norder to determine the dependence of the learned model on the scale parameter \na, given the prior p(Xi' IIi) (cf. Eq. 2). The influence of the scale parameter a \non predictive accuracy of the model can be assessed by cross-validation or, in a \nBayesian approach, prequential validation [11, 3]. Another possibility is to treat \nthe scale parameter a as an additional parameter of the model to be learned from \ndata. Hence, prior belief regarding the parameters e can then enter only through \nthe (normalized) distributions P(Xi' IIi). Howeverl. note that this is sufficient to \ndetermine the (average) prior parameter estimate e (cf. Eq. 3) , i.e., when N = O. \nAssuming an (improper) uniform prior distribution over a, its posterior distribution \nis p(aID) ex: p(Dla), given data D. Then aD = argmaxaP(Dla), where p(Dla) = \nI: m P(Dla,m)p(m)2 can be calculated exactly if the summation is feasible (like in \nthe example below). Alternatively, assuming that the posterior over a is strongly \npeaked, the likelihood may also be approximated by summing over the k most likely \ngraphs m only (k = 1 in the most extreme case; empirical Bayes). Subsequently, \nmodel structure m and parameters B can be learned with respect to the Bayesian \nscore employing aD. \nIn the following, the effect of various values assigned to the scale parameter a is \nexemplified concerning the data set gathered from Wisconsin high-school students \nby Sewell and Shah [10]. This domain comprises 5 discrete variables, each with \n2 or 4 states; the sample size is 10,318. In this small domain, exhaustive search \nin the space of Bayesian network structures is feasible (29,281 graphs). Both the \nprior distributions p(m) for all m and P(Xi' IIi) are chosen to be uniform. Figure 2 \nshows that the number of edges in the graph with the highest posterior probability \ngrows with an increasing value of the scale parameter, as expected (cf. Section 3). \nIn addition, cross-validation indicates best predictive accuracy of the learned model \nat a ~ 100, ... ,300, while the likelihood p(Dla) takes on its maximum at aD ~ 69. \nBoth approaches agree on the same network structure, which is depicted in Fig. 3. \nThis graph can easily be interpreted in a causal manner, as outlined in [5].3 We \nnote that this graph was also obtained in [5] due, however, to additional constraints \nconcerning network structure, as a rather small prior strength of a = 5 was used. For \ncomparison, Fig. 3 also shows the highest-scoring unconstraint graph due to a = 5, \nwhich does not permit a causal interpretation, cf. also [5]. This illustrates that \nthe \"right\" choice of the scale parameter a of the Dirichlet prior, when accounting \nfor both model structure and parameters, can have a crucial impact on the learned \nnetwork structure and the resulting insight in the (\"true\") dependencies among the \nvariables in the domain. \n\n2We assume that m and a are independent a priori, p(mla) = p(m). \n3Since we did not impose any constraints on the network structure, unlike to [5] , \nMarkov-equivalence leaves the orientation of the edge between the variables IQ and CP \nunspecified. \n\n\fa \n5 \n50 \n100 \n200 \n300 \n500 \n1, 000 \n\np(D la) \na. XV 5 \np(D laD) \n10 \u00b7w \n0.045 \n6 \n0.044 0.13 \n7 \n0.040 0.05 \n7 \n10- 14 \n0.040 \n7 \n0.040 10-30 \n7 \n10- 65 \n0.042 \n7 \n0.047 10- 151 \n8 \n\nSES: socioeconomic status SEX: gender of student \nPE: parental encouragement CP: college plans \nIQ: intelligence quotient \n\nFigure 2: As a function of a: number of \narcs (a.) \nin the highest-scoring graph; \naverage KL divergence in 5-fold cross(cid:173)\nvalidation (XV 5), std= 0.006; likelihood \nof a when treated as an additional model \nparameter (aD = 69). \n\nFigure 3: Highest-scoring (unconstraint) \ngraphs when a = 5 (left) , and when a = \n46, ... ,522 (right). Note that the latter \ngraph can also be obtained at a = 5 when \nadditional constraints are imposed on the \nstructure, cf. [5]. \n\nAcknowledgments \nWe would like to thank Chen-Hsiang Yeang and the anonymous referees for valuable \ncomments. Harald Steck acknowledges support from the German Research Foun(cid:173)\ndation (DFG) under grant STE 1045/1-1. Tommi Jaakkola acknowledges support \nfrom Nippon Telegraph and Telephone Corporation, NSF ITR grant IIS-0085836, \nand from the Sloan Foundation in the form of the Sloan Research Fellowship. \nReferences \n[1] W. Buntine. Theory refinement on Bayesian networks. Conference on Uncer(cid:173)\n\ntainty in Artificial Intelligence, pages 52- 60. Morgan Kaufmann, 1991. \n\n[2] G. Cooper and E. Herskovits. A Bayesian method for the induction of proba(cid:173)\n\nbilistic networks from data. Machine Learning, 9:309- 47, 1992. \n\n[3] A. P. Dawid. Statistical theory. The prequential approach. Journal of the Royal \n\nStatistical Society, Series A , 147:277- 305, 1984. \n\n[4] T. S. Ferguson. A Bayesian analysis of some nonparametric problems. Annals \n\nof Statistics, 1:209- 30, 1973. \n\n[5] D. Heckerman. A tutorial on learning with Bayesian networks. In M. I. Jordan \n\n(Ed.), Learning in Graphical Models, pages 301- 54. Kluwer, 1996. \n\n[6] D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: \nthe combination of knowledge and statistical data. Machine Learning, 20:197-\n243,1995. \n\n[7] R. E . Kass and L. Wasserman. Formal rules for selecting prior distributions: \n\na review and annotated bibliography. Technical Report 583, CMU, 1993. \n\n[8] S. Petrone and A. E. Raftery. A note on the Dirichlet process prior in Bayesian \nnonparametric inference with partial exchangeability. Technical Report 297, \nUniversity of Washington, Seattle, 1995. \n\n[9] J. Sethuraman and R. C. Tiwari. Convergence of Dirichlet measures and the \ninterpretation of their parameter. In S. S. Gupta and J. O. Berger (Eds.), \nStatistical Decision Theory and Related Topics III, pages 305- 15, 1982. \n\n[10] W. Sewell and V. Shah. Social class, parental encouragement, and educational \n\naspirations. American Journal of Sociology, 73:559- 72, 1968. \n\n[11] M. Stone. Cross-validatory choice and assessment of statistical predictions. \n\nJournal of the Royal Statistical Society, Series B, 36:111- 47, 1974. \n\n[12] S. G. Walker and B. K. Mallick. A note on the scale parameter of the Dirichlet \n\nprocess. The Canadian Journal of Statistics, 25:473- 9, 1997. \n\n\f", "award": [], "sourceid": 2160, "authors": [{"given_name": "Harald", "family_name": "Steck", "institution": null}, {"given_name": "Tommi", "family_name": "Jaakkola", "institution": null}]}