{"title": "Active Learning for Parameter Estimation in Bayesian Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 647, "page_last": 653, "abstract": null, "full_text": "Active Learning for Parameter Estimation \n\nin Bayesian Networks \n\nSimon Tong \n\nComputer Science Department \n\nStanford University \n\nsimon. tong@cs.stanford.edu \n\nDaphne Koller \n\nComputer Science Department \n\nStanford University \n\nkoller@cs.stanford.edu \n\nAbstract \n\nBayesian networks are graphical representations of probability distributions. In virtually \nall of the work on learning these networks, the assumption is that we are presented with \na data set consisting of randomly generated instances from the underlying distribution. In \nmany situations, however, we also have the option of active learning, where we have the \npossibility of guiding the sampling process by querying for certain types of samples. This \npaper addresses the problem of estimating the parameters of Bayesian networks in an active \nlearning setting. We provide a theoretical framework for this problem, and an algorithm \nthat chooses which active learning queries to generate based on the model learned so far. \nWe present experimental results showing that our active learning algorithm can significantly \nreduce the need for training data in many situations. \n\n1 Introduction \nIn many machine learning applications, the most time-consuming and costly task is the \ncollection of a sufficiently large data set. Thus, it is important to find ways to minimize the \nnumber of instances required. One possible method for reducing the number of instances \nis to choose better instances from which to learn. Almost universally, the machine learning \nliterature assumes that we are given a set of instances chosen randomly from the underlying \ndistribution. In this paper, we assume that the learner has the ability to guide the instances \nit gets, selecting instances that are more likely to lead to more accurate models. This \napproach is called active learning. \n\nThe possibility of active learning can arise naturally in a variety of domains, in several \nvariants. In selective active learning, we have the ability of explicitly asking for an example \nof a certain \"type\"; i.e., we can ask for an full instance where some of the attributes take on \nrequested values. For example, if our domain involves webpages, the learner might be able \nto ask a human teacher for examples of homepages of graduate students in a Computer Sci(cid:173)\nence department. A variant of selective active learning is pool-based active learning, where \nthe learner has access to a large pool of instances, about which it knows only the value of \ncertain attributes. It can then ask for instances in this pool for which these known attributes \ntake on certain values. For example, one could redesign the U.S. census to have everyone \nfill out only the short form; the active learner could then select among the respondents for \nthose that should fill out the more detailed long form. Another example is a cancer study \nin which we have a list of people's ages and whether they smoke, and we can ask a subset \nof these people to undergo a thorough examination. \n\nIn such active learning settings, we need a mechanism that tells us which instances \nto select. This problem has been explored in the context of supervised learning [1, 2, 7, \n9]. In this paper, we consider its application in the unsupervised learning task of density \nestimation. We present a formal framework for active learning in Bayesian networks (BNs). \n\n\fWe assume that the graphical structure of the BN is fixed, and focus on the task of parameter \nestimation. We define a notion of model accuracy, and provide an algorithm that selects \nqueries in a greedy way, designed to improve model accuracy as much as possible. At first \nsight, the applicability of active learning to density estimation is unclear. Given that we are \nnot simply sampling, it is initially not clear that an active learning algorithm even learns \nthe correct density. In fact we can actually show that our algorithm is consistent, i.e., it \nconverges to the right density at the limit. Furthermore, it is not clear that active learning \nis necessarily beneficial in this setting. After all, if we are trying to estimate a distribution, \nthen random samples from that distribution would seem the best source. Surprisingly, \nwe provide empirical evidence showing that, in a range of interesting circumstances, our \napproach learns from significantly fewer instances than random sampling. \n\n2 Learning Bayesian Networks \nLet X = {Xl, ... ,Xn } be a set of random variables, with each variable Xi taking values \nin some finite domain Dom[ XiJ. A Bayesian network over X is a pair (9, 0) that represents \na distribution over the joint space of X. Q is a directed acyclic graph, whose nodes corre(cid:173)\nspond to the random variables in X and whose structure encodes conditional independence \nproperties about the joint distribution. We use U i to denote the set of parents of Xi. 0 is \na set of parameters which quantify the network by specifying the conditional probability \ndistributions (CPDs) P(Xi I U i ). We assume that the CPD of each node consists of a \nseparate multinomial distribution over Dom[XiJ for each instantiation u of the parents U i . \nHence, we have a parameter OXi;lu for each Xij E Dom[XiJ; we use OXilu to represent the \nvector of parameters associated with the multinomial P(Xi I u). \n\nOur focus is on the parameter estimation task: we are given the network structure Q, \nand our goal is to use data to estimate the network parameters O. We will use Bayesian \nparameter estimation, keeping a density over possible parameter values. As usual [5], we \nmake the assumption of parameter independence, which allows us to represent the joint \ndistribution p( 0) as a set of independent distributions, one for each multinomial 0 Xi I u. \n\nFor multinomials, the conjugate prior is a Dirichlet distribution [4], which is param(cid:173)\neterized by hyperparameters aj E IR+, with a* = l:j aj. Intuitively, aj represents \nthe number of \"imaginary samples\" observed prior to observing any data. In particular, \nif X is distributed multinomial with parameters 0 = (01, ... , Or), and p(O) is Dirichlet, \nthen the probability that our next observation is Xj is aj/a*. If we obtain a new instance \nX = Xj sampled from this distribution, then our posterior distribution p(O) is also dis(cid:173)\ntributed Dirichlet with hyperparameters (al, ... ,aj + 1, ... ,ar ). In a BN with the pa(cid:173)\nrameter independence assumption, we have a Dirichlet distribution for every multinomial \ndistribution OXi lu. Given a distribution p(O), we use a Xi; lu to denote the hyperparameter \ncorresponding to the parameter OXi; lu. \n\n3 Active Learning \nAssume we start out with a network structure Q and a prior distribution p( 0) over the \nparameters of Q. In a standard machine learning framework, data instances are indepen(cid:173)\ndently, randomly sampled from some underlying distribution. In an active learning setting, \nwe have the ability to request certain types of instances. We formalize this idea by assum(cid:173)\ning that some subset C of the variables are controllable. The learner can select a subset of \nvariables Q C C and a particular instantiation q to Q. The request Q = q is called a query. \nThe result of such a query is a randomly sampled instance x conditioned on Q = q. \nA (myopic) active learner e is a querying function that takes Q and p(O), and selects \na query Q = q. It takes the resulting instance x, and uses it to update its distribution \np(O) to obtain a posterior p'(O). It then repeats the process, using p' for p. We note that \np( 0) summarizes all the relevant aspects of the data seen so far, so that we do not need \nto maintain the history of previous instances. To fully specify the algorithm, we need to \naddress two issues: we need to describe how our parameter distribution is updated given \n\n\fthat x is not a random sample, and we need to construct a mechanism for selecting the next \nquery based on p. \n\nTo answer the first issue assume for simplicity that our query is Q = q for a single node \nQ. First, it is clear that we cannot use the resulting instance x to update the parameters of \nthe node Q itself. However, we also have a more subtle problem. Consider a parent U of \nQ. Although x does give us information about the distribution of U, it is not information \nthat we can conveniently use. Intuitively, P(U I Q = q) is sampled from a distribution \nspecified by a complex formula involving multiple parameters. We avoid this problem \nsimply by ignoring the information provided by x on nodes that are \"upstream\" of Q. \nMore generally, we define a variable Y to be updateable in the context of a selective query \nQ if it is not in Q or an ancestor of a node in Q. \nOur update rule is now very simple. Given a prior distribution p( 9) and an instance \nx from a query Q = q, we do standard Bayesian updating, as in the case of randomly \nsampled instances, but we update only the Dirichlet distributions of update able nodes. We \nuse p( 9 t Q = q, x) to denote the distribution pi (9) obtained from this algorithm; this can \nbe read as \"the density of 9 after asking query q and obtaining the response x\". \n\nOur second task is to construct an algorithm for deciding on our next query given our \ncurrent distribution p. The key step in our approach is the definition of a measure for the \nquality of our learned model. This allows us to evaluate the extent to which various in(cid:173)\nstances would improve the quality of our model, thereby providing us with an approach \nfor selecting the next query to perform. Our formulation is based on the framework of \nBayesian point estimation. In the Bayesian learning framework, we maintain a distribution \np( 9) over all of the model parameters. However, when we are asked to reason using the \nmodel, we typically \"collapse\" this distribution over parameters, generate a single repre(cid:173)\nsentative model iJ, and answer questions relative to that. If we choose to use iJ, whereas the \n\"true\" model is 9*, we incur some loss Loss(iJ II 9*). Our goal is to minimize this loss. Of \ncourse, we do not have access to 9*. However, our posterior distribution p( 9) represents \nour \"optimal\" beliefs about the different possible values of 9*, given our prior knowledge \nand the evidence. Therefore, we can define the risk of a particular iJ with respect to pas: \n\nEe~p(e) [Loss (6 II iJ)] = 10 Loss (9 II iJ)p(9) d9. \n\n(1) \n\nWe then define the Bayesian point estimate to be the value of iJ that minimizes the risk. \nWe shall only be considering using the Bayesian point estimate, thus we define the risk of \na density p, Risk(p( 9)), to be the risk of the optimal iJ with respect to p. \n\nThe risk of our density p(9) is our measure for the quality of our current state of knowl(cid:173)\nedge, as represented by p(9) . In a greedy scheme, our goal is to obtain an instance x \nsuch that the risk of the pi obtained by updating p with x is lowest. Of course, we do not \nknow exactly which x we are going to get. We know only that it will be sampled from a \ndistribution induced by our query. Our expected posterior risk is therefore: \n\nExPRisk(p(9) I Q = q) = Ee~p(e)Ex~Pe(XIQ=q)Risk(p(9 t Q = q, x)). \n\n(2) \nThis definition leads immediately to the following simple algorithm: For each candidate \nquery Q = q, we evaluate the expected posterior risk, and then select the query for which \nit is lowest. \n\n4 Active Learning Algorithm \nTo obtain a concrete algorithm from the active learning framework shown in the previous \nsection, we must pick a loss function. There are many possible choices, but perhaps the \nbest justified is the relative entropy or Kullback-Leibler divergence (KL-divergence) [3] : \nKL(9 II iJ) = ~x Pe(x) In ;:~:~. The KL-divergence has several independentjustifica(cid:173)\ntions, and a variety of properties that make it particularly suitable as a measure of distance \nbetween distributions. We therefore proceed in this paper using KL-divergence as our \n\n\floss function. (An analogous analysis can be carried through for another very natural loss \nfunction: negative loglikelihood of future data -\nin the case of multinomial CPDs with \nDirichlet densities over the parameters this results in an identical final algorithm.) \n\nWe now want to find an efficient approach to computing the risk. Two properties of \nKL-divergence tum out to be crucial. The first is that the value 0 that minimizes the risk \nrelative to p is the mean value of the parameters, Ee~p(9) [0]. For a Bayesian network \nwith independent Dirichlet distributions over the parameters, this expression reduces to \nOx _ -Iu = O!Zij Iu , the standard (Bayesian) approach used for collapsing a distribution over \nBN models into a single model. The second observation is that, for BNs, KL-divergence \ndecomposes with the graphical structure of the network: \n\nO'Zli.lu \n\n\"3 \n\nKL(OIIO') = L KL(P9 (Xi I Ui) II P9 ,(Xi lUi)), \n\n(3) \nwhere KL(P(Xi I U i ) II P'(Xi I U i )) is the conditional KL-divergence and is given by \n:Eu P(u)KL(P(Xi I u) II P'(Xi I u)) . With these two facts, we can prove the following: \nTheorem 4.1 Let f(a) be \nthe digamma func(cid:173)\nDefine 8(al,oo.,ar ) = \ntion f'(a)/f(a), and H be the entropy junction. \n:E;=l [~ (\\[I\"(aj + 1) -\n\\[I\"(a* + 1)) + H (~, ... , ~)]. Then the risk decomposes as: \n(4) \n\nRisk(p(O)) = L L \n\nPij(u)8(aXillu,oo.,aXirilu)' \n\nthe Gamma junction, \n\n\\[I\" (a) be \n\ni \n\ni uEDom[Ui] \n\nEq. (4) gives us a concrete expression for evaluating the risk of p(O). However, to evaluate a \npotential query, we also need its expected posterior risk. Recall that this is the expectation, \nover all possible answers to the query, of the risk of the posterior distribution p'. In other \nwords, it is an average over an exponentially large set of possibilities. \n\nTo understand how we can evaluate this expression efficiently, we first consider a much \nsimpler case. Consider a BN where we have only one child node X and its parents U, i.e., \nthe only edges are from the nodes U to X . We also restrict attention to queries where we \ncontrol all and only the parents U. In this case, a query q is an instantiation to U, and the \npossible outcomes to the query are the possible values of the variable X. \n\nThe expected posterior risk contains a term for each variable Xi and each instantiation \nto its parents. In particular, it contains a term for each of the parent variables U. However, \nas these variables are not updateable, their hyperparameters remain the same following any \nquery q. Hence, their contribution to the risk is the same in every p( 0 t U = q, x) , and in \nour prior p( 0). Thus, we can ignore the terms corresponding to the parents, and focus on \nthe terms associated with the conditional distribution P(X I U). Hence, we have: \n\nRiskx(p(O)) = LPij(u)8(aXllu,oo.,axrlu) \n\n(5) \n\nExPRiskx(p(O) I U = q) \n\nu \n\nu \nwhere a~jlu is the hyperparameter in p(O t Q = q, Xj) . \ntion in risk obtained by asking a query U = q: \n\nj \n\nRather than evaluating the expected posterior risk directly, we will evaluate the reduc(cid:173)\n\n~(X I q) = Risk(P(O)) - ExPRisk(P(O) I q) = Riskx(P(O)) - ExPRiskx(P(O) I q) \nOur first key observatIOn relies on the fact that the variables tl are not updateable for this \nquery, so that their hyperparameters do not change. Hence, Pij (u) and Pij' (u) are the same. \nThe second observation is that the hyperparameters corresponding to an instantiation u are \nthe same in p and p' except for u = q. Hence, terms cancel and the expression simplifies to: \n, a~r lq)) . By taking advantage \nPij (q) (8(aX1Iq, 00 \u2022 \nof certain functional properties of \\[1\", we finally obtain: \n~(X I q) = Pij(q) (H (:Zllq, ... , :zrlq) -\" Pij(Xj I q)H (:~llq, 00 . , :~rlq)) (7) \n\n, aXrlq) - :Ej Pij(Xj I q)8(a~1Iq' 00 \u2022 \n\nZI.lq \n\nZI . lq \n\nz.lq \n\nZI.lq \n\nL..J \nj \n\n\fIf we now select our query q so as to maximize the difference between our current risk \nand the expected posterior risk, we get a very natural behavior: We will select the query \nq that leads to the greatest reduction in the entropy of X given its parents. It is also here \nthat we can gain an insight as to where active learning has an edge over random sampling. \nConsider one situation in which ql which is 100 times less likely than ~; ql will lead us \nto update a parameter whose current density is Dirichlet(l, 1), whereas q2 will lead us to \nupdate a parameter whose current density is Dirichlet(100, 100). However, according to \n~, updating the former is worth more than the latter. In other words, if we are confident \nabout commonly occurring situations, it is worth more to ask about the rare cases. \n\nWe now generalize this derivation to the case of an arbitrary BN and an arbitrary query. \nHere, our average over possible query answers encompasses exponentially many terms. \nFortunately, we can utilize the structure of the BN to avoid an exhaustive enumeration. \nTheorem 4.2 For an arbitrary BN and an arbitrary query Q = q, the expected KL poste(cid:173)\nrior risk decomposes as: \nExPRisk(P(O) 1 Q = q) = L: L: \nPij(u 1 Q = q)ExPRiskxi (P(O) 1 Vi = u). \n\ni uEDom[u;] \n\nIn other words, the expected posterior risk is a weighted sum of expected posterior risks for \nconditional distributions of individual nodes Xi, where for each node we consider \"queries\" \nthat are complete instantiations to the parents Vi of Xi . \n\nWe now have similar decompositions for the risk and the expected posterior risk. The \nobvious next step is to consider the difference between them, and then simplify it as we \ndid for the case of a single variable. Unfortunately, in the case of general BNs, we can \nno longer exploit one of our main simplifying assumptions. Recall that, in the expression \nfor the risk (Eq. (5\u00bb, the term involving Xi and u is weighted by Pij (u). In the expected \nposterior risk, the weight is Pij' (u). In the case of a single node and a full parent query, the \nhyperparameters of the parents could not change, so these two weights were necessarily \nthe same. In the more general setting, an instantiation x can change hyperparameters all \nthrough the network, leading to different weights. \n\nHowever, we believe that a single data instance will not usually lead to a dramatic \nchange in the distributions. Hence, these weights are likely to be quite close. To simplify \nthe formula (and the associated computation), we therefore choose to approximate the pos(cid:173)\nterior probability Pij' (u) using the prior probability Pij(u). Under this assumption, we can \nuse the same simplification as we did in the single node case. \n\nAssuming that this approximation is a good one, we have that: \n\n~(X 1 q) = Risk(p(O)) - ExPRisk(p(O) 1 q) ~ L: L: \n\nPij(u 1 q)~(Xi 1 u), \n\ni uEDom[Ui] \n\n(8) \nwhere ~(Xi 1 u) is as defined in Eq. (7). Notice that we actually only need to sum over \nthe update able XiS since ~(Xi 1 u) will be zero for all non-updateable XiS. \n\nThe above analysis provides us with an efficient implementation of our general active \nlearning scheme. We simply choose a set of variables in the Bayesian network that we wish \nto control, and for each instantiation of the controllable variables we compute the expected \nchange in risk given by Eq. (8). We then ask the query with the greatest expected change \nand update the parameters of the updateable nodes. \n\nWe now consider the computational complexity of the algorithm. It turns out that, for \neach potential query, all of the desired quantities can be obtained via two inference passes \nusing a standard join tree algorithm [6] . Thus, the run time complexity of the algorithm is: \n0(1 QI . cost of BN join tree inference), where Q is the set of candidate queries. \n\nOur algorithm (approximately) finds the query that reduces the expected risk the most. \nWe can show that our specific querying scheme (including the approximation) is consistent. \nAs we mentioned before, this statement is non-trivial and depends heavily on the specific \nquerying algorithm. \n\n\f, \n\u2022 I\" \n\no \n;; \n\n, , \nK \n;; \n\n,,~\" ~---;;;-~--c,,:c-~,C,;-\" ~---;;;-~---!. \"~~\"-----c~--~,~\"\"-----c~-----! \n\nNOO1 WoIQu ....... \n\n(a) \n\nNconberofQ...ene& \n\n(b) \n\n(c) \n\nFigure 1: (a) Alann network with three controllable nodes. (b) Asia network with two controllable \nnodes. (c) Cancer network with one controllable node. The axes are zoomed for resolution. \n\nTheorem 4.3 Let U be the set of nodes which are updateable for at least one candidate \nquery at each querying step. Assuming that the underlying true distribution is not determin(cid:173)\nistic, then our querying algorithm produces consistent estimates for the CPD parameters \nof every member ofU. \n\n5 Experimental Results \nWe performed experiments on three commonly used networks: Alarm, Asia and Can(cid:173)\ncer. Alarm has 37 nodes and 518 independent parameters, Asia has eight nodes and 18 \nindependent parameters, and Cancer has five nodes and 11 independent parameters. \n\nWe first needed to set the priors for each network. We use the standard approach [5] \nof eliciting a network and an equivalent sample size. In our experiments, we assumed that \nwe had fairly good background knowledge of the domain. To simulate this, we obtained \nour prior by sampling a few hundred instances from the true network and used the counts \n(together with smoothing from a uniform prior) as our prior. This is akin to asking for a \nprior network from a domain expert, or using an existing set of complete data to find initial \nsettings of the parameters. We then compared refining the parameters either by using active \nlearning or by random sampling. We permitted the active learner to abstain from choosing \na value for a controlled node if it did not wish to -- that node is then sampled as usual. \n\nFigure 1 presents the results for the three networks. The graphs compare the KL(cid:173)\ndivergence between the learned networks and the true network that is generating the data. \nWe see that active learning provides a substantial improvement in all three networks. The \nimprovement in the Alarm network is particularly striking given that we had control of \njust three of the 36 nodes. The extent of the improvement depends on the extent to which \nqueries allow us to reach rare events. For example, Smoking is one of the controllable vari(cid:173)\nables in the Asia network. In the original network, P(Smoking) = 0.5. Although there was \na significant gain by using active learning in this network, we found that there was a greater \nincrease in performance if we altered the generating network to have P(Smoking) = 0.9; \nthis is the graph that is shown. \n\nWe also experimented with specifying uniform priors with a small equivalent sample \n\nsize. Here, we obtained significant benefit in the Asia network, and some marginal im(cid:173)\nprovement in the other two. One possible reason is that the improvement is \"washed out\" \nby randomness, as the active learner and standard learner are learning from different in(cid:173)\nstances. Another explanation is that the approximation in Eq. (8) may not hold as well when \nthe prior p( 0) is uninformed and thereby easily perturbed even by a single instance. This \nindicates that our algorithm may perform best when refining an existing domain model. \n\nOverall, we found that in almost all situations active learning performed as well as or \nbetter than random sampling. The situations where active learning produced most benefit \nwere, unsurprisingly, those in which the prior was confident and correct about the com(cid:173)\nmonly occurring cases and uncertain and incorrect about the rare ones. Clearly, this is the \nprecisely the scenario we are most likely to encounter in practice when the prior is elicited \n\n\ffrom an expert. By experimenting with forcing different priors we found that active learn(cid:173)\ning was worse in one type of situation: where the prior was confident yet incorrect about \nthe commonly occurring cases and uncertain but actually correct about the rare ones. This \ntype of scenario is unlikely to occur in practice. Another factor affecting the performance \nwas the degree to which the controllable nodes could influence the updateable nodes. \n\n6 Discussion and Conclusions \nWe have presented a formal framework and resulting querying algorithm for parameter \nestimation in Bayesian networks. To our knowledge, this is one of the first applications of \nactive learning in an unsupervised context. Our algorithm uses parameter distributions to \nguide the learner to ask queries that will improve the quality of its estimate the most. \n\nBN active learning can also be performed in a causal setting. A query now acts as \nexperiment - it intervenes in a model and forces variables to take particular values. Using \nPearl's intervention theory [8], we can easily extend our analysis to deal with this case. The \nonly difference is that the notion of an updateable node is even simpler -\nany node that is \nnot part of a query is updateable. Regrettably, space prohibits a more complete exposition. \nWe have demonstrated that active learning can have significant advantages for the task \nof parameter estimation in BNs, particularly in the case where our parameter prior is of the \ntype that a human expert is likely to provide. Intuitively, the benefit comes from estimating \nthe parameters associated with rare events. Although it is less important to estimate the \nprobabilities of rare events accurately, the number of instances obtained if we randomly \nsample from the distribution is still not enough. We note that this advantage arises even \nwhen we have used a loss function that considers only the accuracy of the distribution. In \nmany practical settings such as medical or fault diagnosis, the rare cases are even more \nimportant, as they are often the ones that it is critical for the system to deal with correctly. \nA further direction that we are pursuing is active learning for the causal structure of a \ndomain. In other words, we are presented with a domain whose causal structure we wish \nto understand and we want to know the best sequence of experiments to perform. \n\nAcknowledgements The experiments were performed using the PHROG system, devel(cid:173)\noped primarily by Lise Getoor, Uri Lerner, and Ben Taskar. Thanks to Carlos Guestrin \nand Andrew Ng for helpful discussions. The work was supported by DARPA's information \nAssurance program under subcontract to SRI International, and by ARO grant DAAH04-\n96-1-0341 under the MURI program \"Integrated Approach to Intelligent Systems\". \n\nReferences \n[1] A.c. Atkinson and A.N. Donev. Optimal Experimental Designs. Oxford University Press, 1992. \n[2] D. Cohn, Z. Ghahramani, and M. Jordan. Active learning with statistical models. Journal of \n\nArtificial intelligence Research, 4, 1996. \n\n[3] T.M Cover and J.A. Thomas. information Theory. Wiley, 1991. \n[4] M. H. DeGroot. Optimal Statistical Decisions. McGraw-Hill, New York, 1970. \n[5] D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The combination \n\nof knowledge and statistical data. Machine Learning, 20: 197-243, 1995. \n\n[6] S. L. Lauritzen and D. J. Spiegelhalter. Local computations with probabilities on graphical struc(cid:173)\n\ntures and their application to expert systems. J. Royal Statistical Society, B 50(2), 1988. \n\n[7] D. MacKay. Information-based objective functions for active data selection. Neural Computa(cid:173)\n\ntion,4:590-604, 1992. \n\n[8] J. Pearl. Causality: Models, Reasoning, and inference. Cambridge University Press, 2000. \n[9] H.S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Froc. COLT, pages 287-\n\n294,1992. \n\n\f", "award": [], "sourceid": 1795, "authors": [{"given_name": "Simon", "family_name": "Tong", "institution": null}, {"given_name": "Daphne", "family_name": "Koller", "institution": null}]}