{"title": "Automated Refinement of Bayes Networks' Parameters based on Test Ordering Constraints", "book": "Advances in Neural Information Processing Systems", "page_first": 2591, "page_last": 2599, "abstract": "In this paper, we derive a method to refine a Bayes network diagnostic model by exploiting constraints implied by expert decisions on test ordering. At each step, the expert executes an evidence gathering test, which suggests the test's relative diagnostic value. We demonstrate that consistency with an expert's test selection leads to non-convex constraints on the model parameters.  We incorporate these constraints by augmenting the network with nodes that represent the constraint likelihoods. Gibbs sampling, stochastic hill climbing and greedy search algorithms are proposed to find a MAP estimate that takes into account test ordering constraints and any data available. We demonstrate our approach on diagnostic sessions from a manufacturing scenario.", "full_text": "Automated Re\ufb01nement of Bayes Networks\u2019\n\nParameters based on Test Ordering Constraints\n\nOmar Zia Khan & Pascal Poupart\n\nDavid R. Cheriton School of Computer Science\n\nUniversity of Waterloo\nWaterloo, ON Canada\n\n{ozkhan,ppoupart}@cs.uwaterloo.ca\n\nJohn Mark Agosta\u2217\n\nIntel Labs\n\nSanta Clara, CA, USA\n\njohnmark.agosta@gmail.com\n\nAbstract\n\nIn this paper, we derive a method to re\ufb01ne a Bayes network diagnostic model by\nexploiting constraints implied by expert decisions on test ordering. At each step,\nthe expert executes an evidence gathering test, which suggests the test\u2019s relative\ndiagnostic value. We demonstrate that consistency with an expert\u2019s test selection\nleads to non-convex constraints on the model parameters. We incorporate these\nconstraints by augmenting the network with nodes that represent the constraint\nlikelihoods. Gibbs sampling, stochastic hill climbing and greedy search algo-\nrithms are proposed to \ufb01nd a MAP estimate that takes into account test ordering\nconstraints and any data available. We demonstrate our approach on diagnostic\nsessions from a manufacturing scenario.\n\n1 INTRODUCTION\n\nThe problem of learning-by-example has the promise to create strong models from a restricted num-\nber of cases; certainly humans show the ability to generalize from limited experience. Machine\nLearning has seen numerous approaches to learning task performance by imitation, going back to\nsome of the approaches to inductive learning from examples [14]. Of particular interest are problem-\nsolving tasks that use a model to infer the source, or cause of a problem from a sequence of investi-\ngatory steps or tests. The speci\ufb01c example we adopt is a diagnostic task such as appears in medicine,\nelectro-mechanical fault isolation, customer support and network diagnostics, among others.\nWe de\ufb01ne a diagnostic sequence as consisting of the assignment of values to a subset of tests. The\ndiagnostic process embodies the choice of the best next test to execute at each step in the sequence,\nby measuring the diagnostic value among the set of available tests at each step, that is, the ability of\na test to distinguish among the possible causes. One possible implementation with which to carry\nout this process, the one we apply, is a Bayes network [9]. As with all model-based approaches,\nprovisioning an adequate model can be daunting, resulting in a \u201cknowledge elicitation bottleneck.\u201d\nA recent approach for easing the bottleneck grew out of the realization that the best time to gain an\nexpert\u2019s insight into the model structure is during the diagnostic process. Recent work in \u201cQuery-\nBased Diagnostics\u201d [1] demonstrated a way to improve model quality by merging model use and\nmodel building into a single process. More precisely the expert can take steps to modify the network\nstructure to add or remove nodes or links, interspersed within the diagnostic sequence. In this paper\nwe show how to extend this variety of learning-by-example to include also re\ufb01nement of model\nparameters based on the expert\u2019s choice of test, from which we determine constraints. The nature\nof these constraints, as shown herein, is derived from the value of the tests to distinguish causes, a\nvalue referred to informally as value of information [10]. It is the effect of these novel constraints\non network parameter learning that is elucidated in this paper.\n\n(cid:3)\n\nJ. M. Agosta is no longer af\ufb01liated with Intel Corporation\n\n1\n\n\fConventional statistical learning approaches are not suited to this problem, since the number of cases\navailable from diagnostic sessions is small, and the data from any case is sparse. (Only a fraction of\nthe tests are taken.) But more relevant is that one diagnostic sequence from an expert user represents\nthe true behavior expected of the model, rather than a noisy realization of a case generated by the\ntrue model. We adopt a Bayesian approach, which offers a principled way to incorporate knowledge\n(constraints and data, when available) and also consider weakening the constraints, by applying a\nlikelihood to them, so that possibly con\ufb02icting constraints can be incorporated consistently.\nSec. 2 reviews related work and Sec. 3 provides some background on diagnostic networks and model\nconsistency. Then, Sec. 4 describes an augmented Bayesian network that incorporates constraints\nimplied by an expert\u2019s choice of tests. Some sampling techniques are proposed to \ufb01nd the Maximum\na posterior setting of the parameters given the constraints (and any data available). The approach is\nevaluated in Sec. 5 on synthetic data and a real world manufacturing diagnostic scenario. Finally,\nSec. 6 discusses some future work.\n\n2 RELATED WORK\n\nParameter learning for Bayesian networks can be viewed as searching in a high-dimensional space.\nAdopting constraints on the parameters based on some domain knowledge is a way of pruning this\nsearch space and learning the parameters more ef\ufb01ciently, both in terms of data needed and time\nrequired. Qualitative probabilistic networks [17] allow qualitative constraints on the parameter space\nto be speci\ufb01ed by experts. For instance, the in\ufb02uence of one variable on another, or the combined\nin\ufb02uence of multiple variables on another variable [5] leads to linear inequalities on the parameters.\nWittig and Jameson [18] explain how to transform the likelihood of violating qualitative constraints\ninto a penalty term to adjust maximum likelihood, which allows gradient ascent and Expectation\nMaximization (EM) to take into account linear qualitative constraints.\nOther examples of qualitative constraints include some parameters being larger than others, bounded\nin a range, within \u03f5 of each other, etc. Various proposals have been made that exploit such con-\nstraints. Altendorf et al. [2] provide an approximate technique based on constrained convex opti-\nmization for parameter learning. Niculescu et al. [15] also provide a technique based on constrained\noptimization with closed form solutions for different classes of constraints. Feelders [6] provides an\nalternate method based on isotonic regression while Liao and Ji [12] combine gradient descent with\nEM. de Campos and Ji [4] also use constrained convex optimization, however, they use Dirichlet\npriors on the parameters to incorporate any additional knowledge. Mao and Lebanon [13] also use\nDirichlet priors, but they use probabilistic constraints to allow inaccuracies in the speci\ufb01cation of\nthe constraints.\nA major difference between our technique and previous work is on the type of constraints. Our\nconstraints do not need to be explicitly speci\ufb01ed by an expert. Instead, we passively observe the\nexpert and learn from what choices are made and not made [16]. Furthermore, as we shall show\nlater, our constraints are non-convex, preventing the direct application of existing techniques that\nassume linear or convex functions. We use Beta priors on the parameters, which can easily be ex-\ntended to Dirichlet priors like previous work. We incorporate constraints in an augmented Bayesian\nnetwork, similar to Liang et al. [11], though their constraints are on model predictions as opposed\nto ours which are on the parameters of the network. Finally, we also use the notion of probabilistic\nconstraints to handle potential mistakes made by experts.\n\n3 BACKGROUND\n\n3.1 DIAGNOSTIC BAYES NETWORKS\n\nWe consider the class of bipartite Bayes networks that are widely used as diagnostic models, though\nour approach can be used for networks with any structure. The network forms a sparse, directed,\ncausal graph, where arcs go from causes to observable node variables. We use upper case to denote\nrandom variables; C for causes, and T for observables (tests). Lower case letters denote values in\nthe domain of a variable, e.g. c \u2208 dom(C) = {c, (cid:22)c}, and bold letters denote sets of variables. A\nset of marginally independent binary-valued node variables C with distributions Pr(C) represent\nunobserved causes, and condition the remaining conditionally independent binary-valued test vari-\n\n2\n\n\fable nodes T. Each cause conditions one or more tests; likewise each test is conditioned by one or\nmore causes, resulting in a graph with one or more possibly multiply-connected components. The\ntest variable distributions Pr(T|C) incorporate the further modeling assumption of Independence of\nCausal In\ufb02uence, the most familiar example being the Noisy-Or model [8]. To keep the exposition\nsimple, we assume that all variables are binary and that conditional distributions are parametrized by\nthe Noisy-Or; however, the algorithms described in the rest of the paper generalize to any discrete\nnon-binary variable models.\nConventionally, unobserved tests are ranked in a diagnostic Bayes network by their Value Of In-\nformation (VOI) conditioned on tests already observed. To be precise, VOI is the expected gain in\nutility if the test were to be observed. The complete computation requires a model equivalent to a\npartially observable Markov decision process. Instead, VOI is commonly approximated by a greedy\ncomputation of the Mutual Information between a test and the set of causes [3]. In this case, it\nis easy to show that Mutual Information is in turn well approximated to second order by the Gini\nimpurity [7] as shown in Equation 1.\n\n]\n\n\u2211\n\n[\u2211\n\nGI(C|T ) =\n\nPr(T = t)\n\nt\n\nc\n\nPr(C = c|T = t)(1 \u2212 Pr(C = c|T = t))\n\n(1)\n\nWe will use the Gini measure as a surrogate for VOI, as a way to rank the best next test in the\ndiagnostic sequence.\n\n3.2 MODEL CONSISTENCY\n\nA model that is consistent with an expert would generate Gini impurity rankings consistent with\nthe expert\u2019s diagnostic sequence. We interpret the expert\u2019s test choices as implying constraints on\nGini impurity rankings between tests. To that effect, [1] de\ufb01nes the notion of Cause Consistency\nand Test Consistency, which indicate whether the cause and test orderings induced by the posterior\ndistribution over causes and the VOI of each test agree with an expert\u2019s observed choice. Assuming\n\u2217 (i.e., test that yields the lowest Gini\nthat the expert greedily chooses the most informative test T\nimpurity) at each step, then the model is consistent with the expert\u2019s choices when the following\nconstraints are satis\ufb01ed:\n\nGI(C|T\n\n\u2217\n\n) \u2264 GI(C|Ti)\n\n\u2200i\n\n(2)\n\nWe demonstrate next how to exploit these constraints to re\ufb01ne the Bayes network.\n\n4 MODEL REFINEMENT\n\nConsider a simple diagnosis example with two possible causes C1 and C2 and two tests T1 and T2 as\nshown in Figure 1. To keep the exposition simple, suppose that the priors for each cause are known\n(generally separate data is available to estimate these), but the conditional distribution of each test\nis unknown. Using the Noisy-OR parameterizations for the conditional distributions, the number of\nparameters are linear in the number of parents instead of exponential.\n\nPr(Ti = true|C) = 1 \u2212 (1 \u2212 \u03b8i\n0)\n\n(1 \u2212 \u03b8i\nj)\n\n(3)\n\n\u220f\n\nj|Cj =true\n\n0 = Pr(Ti = true|Cj = f alse \u2200j) is the leak probability that Ti will be true when none of\nHere, \u03b8i\nj = Pr(Ti = true|Cj = true, Ck = f alse \u2200k \u0338= j) is the link reliability,\nthe causes are true and \u03b8i\nwhich indicates the independent contribution of cause Cj to the probability that test Ti will be true.\nIn the rest of this section, we describe how to learn the \u03b8 parameters while respecting the constraints\nimplied by test consistency.\n\n4.1 TEST CONSISTENCY CONSTRAINTS\n\nSuppose that an expert chooses test T1 instead of test T2 during the diagnostic process. This ordering\nby the expert implies that the current model (parametrized by the \u03b8\u2019s) must be consistent with the\nconstraint GI(C|T2)\u2212 GI(C|T1) \u2265 0. Using the de\ufb01nition of Gini impurity in Eq. 1, we can rewrite\n\n3\n\n\fFigure 1: Network with 2\ncauses and 2 tests\n\nFigure 2: Augmented net-\nwork with parameters and\nconstraints\n\nFigure 3: Augmented net-\nwork extended to handle inac-\ncurate feedback\n\nthe constraint for the network shown in Fig. 1 as follows:\n\n\u2211\n\n(\nPr(t1) (cid:0)\n\n\u2211\n\nt1\n\nc1;c2\n\n(Pr(t1jc1; c2) Pr(c1) Pr(c2))2\n\nPr(t1)\n\n)\n\n(\n\n\u2211\n\n(cid:0)\n\nt2\n\nPr(t2) (cid:0)\n\n\u2211\n\nc1;c2\n\n(Pr(t2jc1; c2) Pr(c1) Pr(c2))2\n\nPr(t2)\n\n(4)\n\n)\n\n(cid:21) 0\n\nFurthermore, using the Noisy-Or encoding from Eq. 3, we can rewrite the constraint as a polynomial\nin the \u03b8\u2019s. This polynomial is non-linear, and in general, not concave. The feasible space may\nconsist of disconnected regions. Fig. 4 shows the surface corresponding to the polynomial for the\n2 as the only free variables.\ncase where \u03b8i\nThe parameters\u2019 feasible space, satisfying the constraint consists of the two disconnected regions\nwhere the surface is positive.\n\n1 = 0.5 for each test i, which leaves \u03b81\n\n0 = 0 and \u03b8i\n\n2 and \u03b82\n\n4.2 AUGMENTED BAYES NETWORK\n\nOur objective is to learn the \u03b8 parameters of diagnostic Bayes networks given test constraints of the\nform described in Eq. 4. To deal with non-convex constraints and disconnected feasible regions, we\npursue a Bayesian approach whereby we explicitly model the parameters and constraints as random\nvariables in an augmented Bayes network (see Fig. 2). This allows us to frame the problem of\nlearning the parameters as an inference problem in a hybrid Bayes network of discrete (T, C, V ) and\ncontinuous ((cid:2)) variables. As we will see shortly, this augmented Bayes network provides a unifying\nframework to simultaneously learn from constraints and data, to deal with possibly inconsistent\nconstraints, and to express preferences over the degree of satisfaction of the constraints.\nWe encode the constraint derived from the expert feedback as a binary random variable V in the\nBayes network. If V is true the constraint is satis\ufb01ed, otherwise it is violated. Thus, if V is true\nthen (cid:2) lies in the positive region of Fig. 4, and if V is f alse then (cid:2) lies in the negative region.\nWe model the CPT for V as Pr(V |(cid:2)) = max(0, \u03c0), where \u03c0 = GI(C|T1) \u2212 GI(C|T2). Note that\nthe value of GI(C|T ) lies in the interval [0,1], so the probability \u03c0 will always be normalized. The\nintuition behind this de\ufb01nition of the CPT for V is that a constraint is more likely to be satis\ufb01ed if\nthe parameters lie in the interior of the constraint region.\nWe place a Beta prior over each (cid:2) parameter. Since the test variables are conditioned on the (cid:2)\nparameters that are now part of the network, their conditional distributions become known. For in-\nstance, the conditional distribution for Ti (given in Eq. 3) is fully de\ufb01ned given the noisy-or param-\neters \u03b8i\nj. Hence the problem of learning the parameters becomes an inference problem to compute\nposteriors over the parameters given that the constraint is satis\ufb01ed (and any data). In practice, it is\nmore convenient to obtain a single value for the parameters instead of a posterior distribution since\nit is easier to make diagnostic predictions based on one Bayes network. We estimate the parameters\nby computing a maximum a posteriori (MAP) hypothesis given that the constraint is satis\ufb01ed (and\n\u2217\nany data): (cid:2)\n\n= arg max(cid:2) Pr((cid:2)|V = true).\n\n4\n\n\fAlgorithm 1 Pseudo Code for Gibbs Sampling, Stochastic Hill Climbing and Greedy Search\n\n\u2032\n= S; Sj = s\n\nfor i = 1 to #samples\n\n\u2032 obeys constraints V\n\n\u2217\n\nfor j = 1 to #hiddenV ariables\n\n\u2032 from conditional of jth hidden variable Sj\n\nacceptSample = f alse; k = 0\nrepeat\nSample s\n\u2032\nS\nif Sj is cause or test, then acceptSample = true\nelseif S\n\n1 Fix observed variables, let V = true and randomly sample feasible starting state S\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n\nSample u from uniform distribution, U(0,1)\nif u < p(S\n\n)\n\nk = k + 1\n\u2032\nif k == maxIterations, then s\n\nelseif algo == Greedy, then acceptSample = true\n\nuntil acceptSample == true\n\u2032\nSj = s\n\nelseif algo == StochasticHillClimbing\n\nif algo == Gibbs\n\u2032\n\n= Sj; acceptSample = true\n\nelseif algo == Greedy\n\nM q(S\u2032) where p and q are the true and proposal distributions and M > 1\n\nacceptSample = true\n\nif likelihood(S\n\n) > likelihood(S), then acceptSample = true\n\n\u2032\n\n4.3 MAP ESTIMATION\n\nPrevious approaches for parameter learning with domain knowledge include modi\ufb01ed versions of\nEM or some other optimization techniques that account for linear/convex constraints on the param-\neters. Since our constraints are non-convex, we propose a new approach based on Gibbs sampling\nto approximate the posterior distribution, from which we compute the MAP estimate. Although\nthe technique converges to the MAP in the limit, it may require excessive time. Hence, we modify\nGibbs sampling to obtain more ef\ufb01cient stochastic hill climbing and greedy search algorithms with\nanytime properties.\nThe pseudo code for our Gibbs sampler is provided in Algorithm 1. The two key steps are sam-\npling the conditional distributions of each variable (line 6) and rejection sampling to ensure that the\nconstraints are satis\ufb01ed (lines 9 and 12). We sample each variable given the rest according to the\nfollowing distributions:\n\nti \u223c Pr(Ti|c, \u03b8i) \u2200i\ncj \u223c Pr(Cj|c \u2212 cj, t, \u03b8) \u221d\n\n\u220f\n\nj\n\n\u220f\n\ni\n\nPr(Cj)\n\n\u220f\nPr(ti|c, \u03b8i) \u2200j\n\n(5)\n(6)\n\n(7)\n\n\u223c Pr((cid:2)i\n\nj\n\n\u03b8i\nj\n\n|(cid:2) \u2212 (cid:2)i\n\nj, t, c, v) \u221d Pr(v|t, (cid:2))\n\nPr(ti|cj, \u03b8i) \u2200i, j\n\ni\n\nThe tests and causes are easily sampled from the multinomials as described in the equations above.\nHowever, sampling the \u03b8\u2019s is more dif\ufb01cult due to the factor Pr(v|(cid:2), t) = max(0, \u03c0), which is a\ntruncated mixture of Betas. So, instead of sampling \u03b8 from its true conditional, we sample it from\na proposal distribution that replaces max(0, \u03c0) by an un-truncated mixture of Betas equal to \u03c0 + a\nwhere a is a constant that ensures that \u03c0 + a is always positive. This is equivalent to ignoring the\nconstraints. Then we ensure that the constraints are satis\ufb01ed by rejecting the samples that violate the\nconstraints. Once Gibbs sampling has been performed, we obtain a sample that approximates the\nposterior distribution over the parameters given the constraints (and any data). We return a single\nsetting of the parameters by selecting the sampled instance with the highest posterior probability\n(i.e., MAP estimate). Since we will only return the MAP estimate, it is possible to speed up the\nsearch by modifying Gibbs sampling. In particular, we obtain a stochastic hill climbing algorithm\nby accepting a new sample only if its posterior probability improves upon that of the previous sample\n\n5\n\n\fFigure 4: Difference in Gini\nimpurity for the network in\nFig. 1 when \u03b81\n2 are\nthe only parameters allowed\nto vary.\n\n2 and \u03b82\n\nFigure 5: Posterior over pa-\nrameters computed through\ncalculation after discretiza-\ntion.\n\nFigure 6: Posterior over pa-\nrameters calculated through\nSampling.\n\n(line 15). Thus, each iteration of the stochastic hill climber requires more time, but always improves\nthe solution.\nAs the number of constraints grows and the feasibility region shrinks, the Gibbs sampler and stochas-\ntic hill climber will reject most samples. We can mitigate this by using a Greedy sampler that caps\nthe number of rejected samples, after which it abandons the sampling for the current variable to\nmove on to the next variable (line 19). Even though the feasibility region is small overall, it may still\nbe large in some dimensions, so it makes sense to try sampling another variable (that may have a\nlarger range of feasible values) when it is taking too long to \ufb01nd a new feasible value for the current\nvariable.\n\n4.4 MODEL REFINEMENT WITH INCONSISTENT CONSTRAINTS\n\nSo far, we have assumed that the expert\u2019s actions generate a feasible region as a consequence of\nconsistent constraints. We handle inconsistencies by further extending our augmented diagnostic\nBayes network. We treat the observed constraint variable, V , as a probabilistic indicator of the true\n\u2217 as shown in Figure 3. We can easily extend our techniques for computing the MAP to\nconstraint V\ncater for this new constraint node by sampling an extra variable.\n\n5 EVALUATION AND EXPERIMENTS\n\n5.1 EVALUATION CRITERIA\n\n\u2217, the true model that we aim to learn, the diagnostic process determines the choice\nFormally, for M\nof best next test as the one with the smallest Gini impurity. If the correct choice for the next test is\nknown (such as demonstrated by an expert), we can use this information to include a constraint on the\n\u2217 the set of all possible constraints\nmodel. We denote by V+ the set of observed constraints and by V\n\u2217. Having only observed V+, our technique will consider any M + \u2208 M+ as a\nthat hold for M\n\u2217 the set\npossible true model, where M+ is the set of all models that obey V +. We denote by M\n\u2217 and would recommend the\nof all models that are diagnostically equivalent to M\nV+ the particular model obtained by MAP estimation based on the\nsame steps as M\nconstraints V+. Similarly, when a dataset D is available, we denote by M MAP\nD the model obtained\nby MAP estimation based on D and by M MAP\nDV+, the model based on D and V+.\nIdeally we would like to \ufb01nd the true underlying model M\n\u2217. However, other diagnostically equivalent M\nbetween the models found and M\nthe same tests as M\n(i.e., # of recommended tests that are the same).\n\n\u2217, hence we will report the KL divergence\n\u2217 may recommend\n\u2217 and thus have similar constraints, so we also report test consistency with M\n\u2217\n\n\u2217) and by M MAP\n\n\u2217 (i.e., obey V\n\n5.2 CORRECTNESS OF MODEL REFINEMENT\n\nGiven V\nconstruction. If any constraint V\n\n\u2217, our technique for model adjustment is guaranteed to choose a model M MAP \u2208 M\n\n\u2217 by\n\u2217 is violated, the rejection sampling step of our technique\n\n\u2217 \u2208 V\n\n6\n\n00.20.40.60.8100.51\u22120.1\u22120.0500.050.1Link Reliability ofTest 2 and Cause 2Link Reliability of Test 2 and Cause 1Difference inGini Impurity00.20.40.60.8100.20.40.60.8100.020.040.060.080.1Link Reliability ofTest 2 and Cause 1Link Reliability ofTest 2 and Cause 1Posterior Probability\f7:\n\nFigure\nMean KL-\ndivergence and one standard\ndeviation for a 3 cause 3\ntest network on learning\nwith data,\nconstraints and\ndata+constraints.\n\nFigure 8: Test Consistency\nfor a 3 cause 3 test network on\nlearning with data, constraints\nand data+constraints.\n\nFigure 9: Convergence rate\ncomparison.\n\nwould reject that set of parameters. To illustrate this, consider the network in Fig. 2. There are six\nparameters (four link reliabilities and two leak parameters). Let us \ufb01x the leak parameters and the\nlink reliability from the \ufb01rst cause to each test. Now we can compute the posterior surface over\nthe two variable parameters after discretizing each parameter in small steps and then calculating the\nposterior probability at each step as shown in Fig. 5. We can compare this surface with that obtained\nafter Gibbs sampling using our technique as shown in Fig. 6. We can see that our technique recovers\nthe posterior surface from which we can compute the MAP. We obtain the same MAP estimate with\nthe stochastic hill climbing and greedy search algorithms.\n\n5.3 EXPERIMENTAL RESULTS ON SYNTHETIC PROBLEMS\n\n\u2217 \u2208 M\n\n\u2217 for M\n\nV+ , which has lower mean KL-divergence than M MAP\n\nWe start by presenting our results on a 3-cause by 3-test fully-connected bipartite Bayes network.\n\u2217 that we want to learn given V+. We use our technique\nWe assume that there exists some M\n\u2217 to get the feasible\nto \ufb01nd M MAP. To evaluate M MAP, we \ufb01rst compute the constraints, V\nregion associated with the true model. Next, we sample 100 other models from this feasible re-\ngion that are diagnostically equivalent. We compare these models with M MAP (after collecting 200\nsamples with non-informative priors for the parameters).\nWe compute the KL-divergence of M MAP with respect to each sampled model. We expect KL-\ndivergence to decrease as the number of constraints in V+ increases since the feasible region be-\ncomes smaller. Figure 7 con\ufb01rms this trend and shows that M MAP\nDV+ has lower mean KL-divergence\nthan M MAP\nD . The data points in D are limited\nto the results of the diagnostic sessions needed to obtain V+. As constraints increase, more data is\navailable and so the results for the data-only approach also improve with increasing constraints.\nWe also compare the test consistency when learning from data only, constraints only or both. Given\na \ufb01xed number of constraints, we enumerate the unobserved trajectories, and then compute the\nhighest ranked test using the learnt model and the sampled true models, for each trajectory. The test\nconsistency is reported as a percentage, with 100% consistency indicating that the learned and true\nmodels had the same highest ranked tests on every trajectory. Figure 8 presents these percentatges\nfor the greedy sampling technique (the results are similar for the other techniques). It again appears\nthat learning parameters with both constraints and data is better than learning with only constraints,\nwhich is most of the times better than learning with only data.\nFigure 9 compares the convergence rate of each technique to \ufb01nd the MAP estimate. As expected,\nStochastic Hill Climbing and Greedy Sampling take less time than Gibbs sampling to \ufb01nd parameter\nsettings with high posterior probability.\n\n5.4 EXPERIMENTAL RESULTS ON REAL-WORLD PROBLEMS\n\nWe evaluate our technique on a real-world diagnostic network collected and reported by Agosta et\nal. [1], where the authors collected detailed session logs over a period of seven weeks in which the\n\n7\n\n12345670102030405060708090100Number of constraints usedPercentage of tests correctly predicted  Data OnlyConstraints OnlyData+Constraints100101102103\u221220\u221218\u221216\u221214\u221212\u221210\u22128Elapsed Time (plotted on log scale from 0 to 1500 seconds)Negative Log Likelihood of MAP EstimateComparing convergence of Different Techniques  Gibbs SamplingStochastic Hill ClimbingGreedy Sampling\fFigure 10: Diagnostic Bayesian network collected\nfrom user trials and pruned to retain sub-networks\nwith at least one constraint\n\nFigure 11: KL divergence comparison as the\nnumber of constraints increases for the real\nworld problem.\n\nentire diagnostic sequence was recorded. The sequences intermingle model building and querying\nphases. The model network structure was inferred from an expert\u2019s sequence of positing causes\nand tests. Test-ranking constraints were deduced from the expert\u2019s test query sequences once the\nnetwork structure is established.\nThe 157 sessions captured over the seven weeks resulted in a Bayes network with 115 tests, 82 root\ncauses and 188 arcs. The network consists of several disconnected sub-networks, each identi\ufb01ed\nwith a symptom represented by the \ufb01rst test in the sequence, and all subsequent tests applied within\nthe same subnet. There were 20 sessions from which we were able to observe trajectories with\nat least two tests, resulting in a total of 32 test constraints. We pruned our diagnostic network to\nremove the sub-networks with no constraints to get a Bayes network with 54 tests, 30 root causes,\nand 67 parameters divided in 7 sub-networks, as shown in Figure 10, on which we apply our model\nre\ufb01nement technique to learn the parameters for each sub-network separately.\nSince we don\u2019t have the true underlying network and the full set of constraints (more constraints\n\u2217\ncould be observed in future diagnostic sessions), we treated the 32 constraints as if they were V\n\u2217 as if it contained models diagnostically equivalent to\nand the corresponding feasible region M\nthe unknown true model. Figure 11 reports the KL divergence between the models found by our\n\u2217 as we increase the number of constraints. With such\nalgorithms and sampled models from M\nlimited constraints and consequently large feasible regions, it is not surprising that the variation in\nKL divergence is large. Again, the MAP estimate based on both the constraints and the data has\nlower KL divergence than constraints only and data only.\n\n6 CONCLUSION AND FUTURE WORK\n\nIn summary, we presented an approach that can learn the parameters of a Bayes network based on\nconstraints implied by test consistency and any data available. While several approaches exist to\nincorporate qualitative constraints in learning procedures, our work makes two important contribu-\ntions: First, this is the \ufb01rst approach that exploits implicit constraints based on value of information\nassessments. Secondly it is the \ufb01rst approach that can handle non-convex constraints. We demon-\nstrated the approach on synthetic data and on a real-world manufacturing diagnostic problem. Since\ndata is generally sparse in diagnostics, this work makes an important advance to mitigate the model\nacquisition bottleneck, which has prevented the widespread application of diagnostic networks so\nfar. In the future, it would be interesting to generalize this work to reinforcement learning in appli-\ncations where data is sparse, but constraints may be inferred from expert interactions.\n\nAcknowledgments\n\nThis work was supported by a grant from Intel Corporation.\n\n8\n\n681012141618202212345678Number of constraints usedKL\u2212divergence of when computing joint over all tests  Data OnlyConstraints OnlyData+Constraints\fReferences\n[1] John Mark Agosta, Omar Zia Khan, and Pascal Poupart. Evaluation results for a query-based\ndiagnostics application. In The Fifth European Workshop on Probabilistic Graphical Models\n(PGM 10), Helsinki, Finland, September 13\u201315 2010.\n\n[2] Eric E. Altendorf, Angelo C. Resti\ufb01car, and Thomas G. Dietterich. Learning from sparse\ndata by exploiting monotonicity constraints. In Proceedings of Twenty First Conference on\nUncertainty in Arti\ufb01cial Intelligence (UAI), Edinburgh, Scotland, July 2005.\n\n[3] Brigham S. Anderson and Andrew W. Moore. Fast information value for graphical models.\nIn Proceedings of Nineteenth Annual Conference on Neural Information Processing Systems\n(NIPS), pages 51\u201358, Vancouver, BC, Canada, December 2005.\n\n[4] Cassio P. de Campos and Qiang Ji.\n\nImproving Bayesian network parameter learning using\nIn International Conference in Pattern Recognition (ICPR), Tampa, FL, USA,\n\nconstraints.\n2008.\n\n[5] Marek J. Druzdzel and Linda C. van der Gaag. Elicitation of probabilities for belief networks:\nIn Proceedings of the Eleventh Annual\ncombining qualitative and quantitative information.\nConference on Uncertainty in Arti\ufb01cial Intelligence (UAI), pages 141\u2013148, Montreal, QC,\nCanada, 1995.\n\n[6] Ad J. Feelders. A new parameter learning method for Bayesian networks with qualitative in-\n\ufb02uences. In Proceedings of Twenty Third International Conference on Uncertainty in Arti\ufb01cial\nIntelligence (UAI), Vancouver, BC, July 2007.\n\n[7] Mara Angeles Gil and Pedro Gil. A procedure to test the suitability of a factor for strati\ufb01cation\n\nin estimating diversity. Applied Mathematics and Computation, 43(3):221 \u2013 229, 1991.\n\n[8] David Heckerman and John S. Breese. Causal independence for probability assessment and\nIEEE Systems, Man, and Cybernetics, 26(6):826\u2013831,\n\ninference using bayesian networks.\nNovember 1996.\n\n[9] David Heckerman, John S. Breese, and Koos Rommelse. Decision-theoretic troubleshooting.\n\nCommunications of the ACM, 38(3):49\u201356, 1995.\n\n[10] Ronald A. Howard.\n\nInformation value theory.\n\nCybernetics, 2(1):22\u201326, August 1966.\n\nIEEE Transactions on Systems Science and\n\n[11] Percy Liang, Michael I. Jordan, and Dan Klein. Learning from measurements in exponen-\nIn Proceedings of Twenty Sixth Annual International Conference on Machine\n\ntial families.\nLearning (ICML), Montreal, QC, Canada, June 2009.\n\n[12] Wenhui Liao and Qiang Ji. Learning Bayesian network parameters under incomplete data with\n\ndomain knowledge. Pattern Recognition, 42:3046\u20133056, 2009.\n\n[13] Yi Mao and Guy Lebanon. Domain knowledge uncertainty and probabilistic parameter con-\nstraints. In Proceedings of Twenty Fifth Conference on Uncertainty in Arti\ufb01cial Intelligence\n(UAI), Montreal, QC, Canada, 2009.\n\n[14] Ryszard S. Michalski. A theory and methodology of inductive learning. Arti\ufb01cial Intelligence,\n\n20:111\u2013116, 1984.\n\n[15] Radu Stefan Niculescu, Tom M. Mitchell, and R. Bharat Rao. Bayesian network learning with\n\nparameter constraints. Journal of Machine Learning Research, 7:1357\u20131383, 2006.\n\n[16] Mark A. Peot and Ross D. Shachter. Learning from what you dont observe. In Proceedings\nof the Fourteenth Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), pages 439\u2013446,\nMadison, WI, July 1998.\n\n[17] Michael P. Wellman. Fundamental concepts of qualitative probabilistic networks. Arti\ufb01cial\n\nIntelligence, 44(3):257\u2013303, August 1990.\n\n[18] Frank Wittig and Anthony Jameson. Exploiting qualitative knowledge in the learning of con-\nditional probabilities of Bayesian networks. In Proceedings of the Sixteenth Conference on\nUncertainty in Arti\ufb01cial Intelligence (UAI), San Francisco, CA, July 2000.\n\n9\n\n\f", "award": [], "sourceid": 1402, "authors": [{"given_name": "Omar", "family_name": "Khan", "institution": null}, {"given_name": "Pascal", "family_name": "Poupart", "institution": null}, {"given_name": "John-mark", "family_name": "Agosta", "institution": null}]}