{"title": "Instance-Specific Bayesian Model Averaging for Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 1449, "page_last": 1456, "abstract": null, "full_text": "Instance-Specific Bayesian Model \n\nAveraging for Classification \n\n \n \n \n \n \n \n\nShyam Visweswaran \n\nCenter for Biomedical Informatics \n\nIntelligent Systems Program \n\nPittsburgh, PA 15213 \nshyam@cbmi.pitt.edu \n\nGregory F. Cooper \n\nCenter for Biomedical Informatics \n\nIntelligent Systems Program \n\nPittsburgh, PA 15213 \n\ngfc@cbmi.pitt.edu \n\n \n\nAbstract \n\nClassification algorithms typically induce population-wide models \nthat are trained to perform well on average on expected future \ninstances. We introduce a Bayesian framework for learning \ninstance-specific models from data that are optimized to predict \nwell for a particular instance. Based on this framework, we present \na \nthat performs \nselective model averaging over a restricted class of Bayesian \nnetworks. On experimental evaluation, this algorithm shows \nsuperior performance over model selection. We intend to apply \nsuch instance-specific algorithms to improve the performance of \npatient-specific predictive models induced from medical data. \n\ninstance-specific algorithm called ISA \n\nlazy \n\n1 Introduction \n\nCommonly used classification algorithms, such as neural networks, decision trees, \nBayesian networks and support vector machines, typically induce a single model \nfrom a training set of instances, with the intent of applying it to all future instances. \nWe call such a model a population-wide model because it is intended to be applied \nto an entire population of future instances. A population-wide model is optimized to \npredict well on average when applied to expected future instances. In contrast, an \ninstance-specific model is one that is constructed specifically for a particular \ninstance. The structure and parameters of an instance-specific model are specialized \nto the particular features of an instance, so that it is optimized to predict especially \nwell for that instance. \nUsually, methods that induce population-wide models employ eager learning in \nwhich the model is induced from the training data before the test instance is \nencountered. In contrast, lazy learning defers most or all processing until a response \nto a test instance is required. Learners that induce instance-specific models are \nnecessarily lazy in nature since they take advantage of the information in the test \ninstance. An example of a lazy instance-specific method is the lazy Bayesian rule \n(LBR) learner, implemented by Zheng and Webb [1], which induces rules in a lazy \nfashion from examples in the neighborhood of the test instance. A rule generated by \nLBR consists of a conjunction of the attribute-value pairs present in the test instance \n\n\f \n\nas the antecedent and a local simple (na\u00efve) Bayes classifier as the consequent. The \nstructure of the local simple Bayes classifier consists of the attribute of interest as \nthe parent of all other attributes that do not appear in the antecedent, and the \nparameters of the classifier are estimated from the subset of training instances that \nsatisfy the antecedent. A greedy step-forward search selects the optimal LBR rule \nfor a test instance to be classified. When evaluated on 29 UCI datasets, LBR had the \nlowest average error rate when compared to several eager learning methods [1]. \nTypically, both eager and lazy algorithms select a single model from some model \nspace, ignoring the uncertainty in model selection. Bayesian model averaging is a \ncoherent approach to dealing with the uncertainty in model selection, and it has \nbeen shown to improve the predictive performance of classifiers [2]. However, since \nthe number of models in practically useful model spaces is enormous, exact model \naveraging over the entire model space is usually not feasible. In this paper, we \ndescribe a lazy instance-specific averaging (ISA) algorithm for classification that \napproximates Bayesian model averaging in an instance-sensitive manner. ISA \nextends LBR by adding Bayesian model averaging to an instance-specific model \nselection algorithm. \nWhile the ISA algorithm is currently able to directly handle only discrete variables \nand is computationally more intensive than comparable eager algorithms, the results \nin this paper show that it performs well. In medicine, such lazy instance-specific \nalgorithms can be applied to patient-specific modeling for improving the accuracy \nof diagnosis, prognosis and risk assessment. \nThe rest of this paper is structured as follows. Section 2 introduces a Bayesian \nframework for instance-specific learning. Section 3 describes the implementation of \nISA. In Section 4, we evaluate ISA and compare its performance to that of LBR. \nFinally, in Section 5 we discuss the results of the comparison. \n\n2 Decision Theoretic Framework \n\nWe use the following notation. Capital letters like X, Z, denote random variables \nand corresponding lower case letters, x, z, denote specific values assigned to them. \nThus, X = x denotes that variable X is assigned the value x. Bold upper case letters, \nsuch as X, Z, represent sets of variables or random vectors and their realization is \ndenoted by the corresponding bold lower case letters, x, z. Hence, X = x denotes that \nthe variables in X have the states given by x. In addition, Z denotes the target \nvariable being predicted, X denotes the set of attribute variables, M denotes a model, \nD denotes the training dataset, and denotes a generic test instance that is \nnot in D. \nWe now characterize population-wide and instance-specific model selection in \ndecision theoretic terms. Given training data D and a separate generic test instance \n, the Bayes optimal prediction for Zt is obtained by combining the \npredictions of all models weighted by their posterior probabilities, as follows: \n\nZP\n(\n\nt\n\n|\n\nX\n\nt\n\n,\n\nD\n\n)\n\nZP\n(\n\nt\n\n|\n\nX\n\nt\n\n,\n\n\u222b=\n\nM\n\ndMDMPM\n\n)\n\n(\n\n|\n\n)\n\n. \n\n \n\n \n\n(1) \n\nThe optimal population-wide model for predicting Zt is as follows: \n\n[\nZPU\n(\n\nmax\nM\n\n\uf8f1\u2211\n\uf8f2\n\uf8f3\n\ntX\n\nt\n\n|\n\nX\n\nt\n\n,\n\nZPD\n(\n\n),\n\nt\n\n|\n\nX\n\nt\n\n,\n\n]\nPM\n\n)\n\n(\n\nX\n\n|\n\nD\n\n)\n\n, \n\n\uf8fc\n\uf8fd\n\uf8fe\n\n \n\n(2) \n\n \n\n\fwhere the function U gives the utility of approximating the Bayes optimal estimate \nP(Zt | Xt, D), with the estimate P(Zt | Xt, M) obtained from model M. The term \nP(X | D) is given by: \n\n \n\nP\n\n(\n\nX\n\n|\n\nD\n\n)\n\nP\n\n(\n\nX\n\n|\n\n\u222b=\n\nM\n\ndMDMPM\n\n(\n\n)\n\n)\n\n|\n\n. \n\n \n\n \n\n \n\n(3) \n\nThe optimal instance-specific model for predicting Zt is as follows: \n[\n{\nZPU\n(\nmax\nM\n\n]\n})\n, \n\nM\n\nZPD\n(\n\n),\n\nX\n\nX\n\n=\n\nx\n\nt\n\n=\n\nx\n\nt\n\n,\n\nt\n\n|\n\n \n\nt\n\n|\n\nt\n\nt\n\n,\n\n \n\n(4) \n\nwhere xt are the values of the attributes of the test instance Xt for which we want to \npredict Zt. The Bayes optimal estimate P(Zt | Xt = xt, D), in Equation 4 is derived \nusing Equation 1, for the special case in which Xt = xt. \nThe difference between the population-wide and the instance-specific models can be \nnoted by comparing Equations 2 and 4. Equation 2 for the population-wide model \nselects the model that on average will have the greatest utility. Equation 4 for the \ninstance-specific model, however, selects the model that will have the greatest \nexpected utility for the specific instance Xt = xt. For predicting Zt in a given instance \nXt = xt, the model selected using Equation 2 can never have an expected utility \ngreater than the model selected using Equation 4. This observation provides support \nfor developing instance-specific models. \nEquations 2 and 4 represent theoretical ideals for population-wide and instance-\nspecific model selection, respectively; we are not suggesting they are practical to \ncompute. The current paper focuses on model averaging, rather than model \nselection. Ideal Bayesian model averaging is given by Equation 1. Model averaging \nhas previously been applied using population-wide models. Studies have shown that \napproximate Bayesian model averaging using population-wide models can improve \npredictive performance over population-wide model selection [2]. The current paper \nconcentrates on investigating the predictive performance of approximate Bayesian \nmodel averaging using instance-specific models. \n\n3 Instance-Specific Algorithm \n\nWe present the implementation of the lazy instance-specific algorithm based on the \nabove framework. ISA searches the space of a restricted class of Bayesian networks \nto select a subset of the models over which to derive a weighted (averaged) \nposterior of the target variable Zt. A key characteristic of the search is the use of a \nheuristic to select models that will have a significant influence on the weighted \nposterior. We introduce Bayesian networks briefly and then describe ISA in detail. \n\n3.1 B ayesian Network s \n\nis a probabilistic model \n\nA Bayesian network \nthat combines a graphical \nrepresentation (the Bayesian network structure) with quantitative information (the \nparameters of the Bayesian network) to represent the joint probability distribution \nover a set of random variables [3]. Specifically, a Bayesian network M representing \nthe set of variables X consists of a pair (G, \u0398G). G is a directed acyclic graph that \ncontains a node for every variable in X and an arc between every pair of nodes if the \ncorresponding variables are directly probabilistically dependent. Conversely, the \nabsence of an arc between a pair of nodes denotes probabilistic independence \nbetween the corresponding variables. \u0398G represents the parameterization of the \nmodel. \n\n \n\n\f \n\nIn a Bayesian network M, the immediate predecessors of a node Xi in X are called \nthe parents of Xi and the successors, both immediate and remote, of Xi in X are \ncalled the descendants of Xi. The immediate successors of Xi are called the children \nof Xi. For each node Xi there is a local probability distribution (that may be discrete \nor continuous) on that node given the state of its parents. The complete joint \nprobability distribution over X, represented by the parameterization \u0398G, can be \nfactored into a product of local probability distributions defined on each node in the \nnetwork. This factorization is determined by the independences captured by the \nstructure of the Bayesian network and is formalized in the Bayesian network \nMarkov condition: A node (representing a variable) is independent of its non-\ndescendants given just its parents. According to this Markov condition, the joint \nprobability distribution on model variables X = (X1, X2, \u2026, Xn) can be factored as \nfollows: \n\nXXP\n(\n\n,\n\n1\n\n,\n\n...,\n\nX\n\n)\n\nn\n\n2\n\n\u220f\n= n\n\n=\n1\n\ni\n\nXP\n(\n\ni\n\n|\n\nparents\n\n(\n\nX\n\n))\n\n, \n\ni\n\n \n\n \n\n(5) \n\nwhere parents(Xi) denotes the set of nodes that are the parents of Xi. If Xi has no \nparents, then the set parents(Xi) is empty and P(Xi | parents(Xi)) is just P(Xi). \n\n3.2 \n\nISA Models \n\nThe LBR models of Zheng and Webb [1] can be represented as members of a \nrestricted class of Bayesian networks (see Figure 1). We use the same class of \nBayesian networks for the ISA models, to facilitate comparison between the two \nalgorithms. In Figure 1, all nodes represent attributes that are discrete. Each node in \nX has either an outgoing arc into target node, Z, or receives an arc from Z. That is, \neach node is either a parent or a child of Z. Thus, X is partitioned into two sets: the \nfirst containing nodes (X1, \u2026, Xj in Figure 1) each of which is a parent of Z and \nevery node in the second set, and the second containing nodes (Xj+1, \u2026, Xk in Figure \n1) that have as parents the node Z and every node in the first set. The nodes in the \nfirst set are instantiated to the corresponding values in the test instance for which Zt \nis to be predicted. Thus, the first set of nodes represents the antecedent of the LBR \nrule and the second set of nodes represents the consequent. \n\nX1= x1\n\nXi= xi \n\n...\n\nZ\n\n...\n\nXk\n\nXi+1\n\nFigure 1: An example of a Bayesian network LBR model with target\nnode Z and k attribute nodes of which X1, \u2026, Xj are instantiated to \nvalues x1, \u2026, xj in xt. X1, \u2026, Xj are present in the antecedent of the LBR \nrule and Z, Xj+1, \u2026, Xk (that form the local simple Bayes classifier) are \npresent in the consequent. The indices need not be ordered as shown,\nbut are presented in this example for convenience of exposition. \n\n \n\n\f3.3 Model Averaging \n\nFor Bayesian networks, Equation 1 can be evaluated as follows: \nZP\n(\n\nDMPM\n\nZP\n(\n\nD\n\nx\n\nx\n\n, \n\n)\n\n)\n\n)\n\n(\n\n\u2211=\n\n \n\n \n\n \n\n,\n\n,\n\n|\n\n|\n\nt\n\nt\n\n|\n\nt\n\nt\n\nM\n\n \n\n(6) \n\nwith M being a Bayesian network comprised of structure G and parameters \u0398G. The \nprobability distribution of interest is a weighted average of the posterior distribution \nover all possible Bayesian networks where the weight is the probability of the \nBayesian network given the data. Since exhaustive enumeration of all possible \nmodels is not feasible, even for this class of simple Bayesian networks, we \napproximate exact model averaging with selective model averaging. Let R be the set \nof models selected by the search procedure from all possible models in the model \nspace, as described in the next section. Then, with selective model averaging, \nP(Zt | xt, D) is estimated as: \n,\n\nDMPM\n\nZP\n(\n\nx\n\n)\n\n(\n\n)\n\n|\n\nt\n\nt\n\nZP\n(\n\nt\n\n|\n\nt\n\nx\n\n,\n\nD\n\n)\n\n\u2211\n\u2208\u2245\nRM\n\nDMP\n(\n\n|\n\n)\n\n. \n\n \n\n \n\n(7) \n\n|\n\u2211\n\n\u2208\nRM\n\n|\n\u2211\n\n\u2208\nRM\n\nAssuming uniform prior belief over all possible models, the model posterior \nP(M | D) in Equation 7 can be replaced by the marginal likelihood P(D | M), to \nobtain the following equation: \n)\n\nMDPM\n\nZP\n(\n\nx\n\n)\n\n(\n\n,\n\n|\n\nt\n\nt\n\nZP\n(\n\nt\n\n|\n\nt\n\nx\n\n,\n\nD\n\n)\n\n\u2211\n\u2208\u2245\nRM\n\nMDP\n(\n\n|\n\n)\n\n. \n\n \n\n \n\n(8) \n\nThe (unconditional) marginal likelihood P(D | M) in Equation 8, is a measure of the \ngoodness of fit of the model to the data and is also known as the model score. While \nthis score is suitable for assessing the model\u2019s fit to the joint probability \ndistribution, it is not necessarily appropriate for assessing the goodness of fit to a \nconditional probability distribution which \nin prediction and \nclassification tasks, as is the case here. A more suitable score in this situation is a \nconditional model score that is computed from training data D of d instances as: \n\nthe focus \n\nis \n\nscore\n\n(\n\nMD\n\n,\n\n)\n\n\u220f\n= d\n\np\n\n=\n1\n\np\n\nzP\n(\n\n|\n\n1\n\nx\n\n,...,\n\nx\n\np\n\n,z\n\n1\n\n,...,z\n\n\u2212\n1\n\np\n\n,M\n\n)\n\n. \n\n \n\n \n\n(9) \n\nThis score is computed in a predictive and sequential fashion: for the pth training \ninstance the probability of predicting the observed value zp for the target variable is \ncomputed based on the values of all the variables in the preceding p-1 training \ninstances and the values xp of the attributes in the pth instance. One limitation of this \nscore is that its value depends on the ordering of the data. Despite this limitation, it \nhas been shown to be an effective scoring criterion for classification models [4]. \nThe parameters of the Bayesian network M, used in the above computations, are \ndefined as follows: \n\nXP\n(\n\ni\n\n=\n\nk\n\n|\n\nparents\n\n(\n\nX\n\n=\n\n)\n\ni\n\nj\n\n)\n\n\u2261\n\n\u03b8\nijk\n\n=\n\nN\nijk\nN\n\nij\n\n+\n+\n\n\u03b1\nijk\n\u03b1\nij\n\n, \n\n \n\n \n\n(10) \n\nwhere (i) Nijk is the number of instances in the training dataset D where variable Xi \nhas value k and the parents of Xi are in state j, (ii)\n, (iii) \u03b1ijk is a \n\nN\n\nN\n\n\u2211=\n\nk\n\nij\n\nijk\n\n \n\n\fparameter prior that can be interpreted as the belief equivalent of having previously \nobserved \u03b1ijk instances in which variable Xi has value k and the parents of Xi are in \nstate j, and (iv)\n\n. \n\n\u03b1\nij\n\n\u2211=\n\nk\n\n\u03b1\nijk\n\n3.4 Model Search \n\n \n\nWe use a two-phase best-first heuristic search to sample the model space. The first \nphase ignores the evidence xt in the test instance while searching for models that \nhave high scores as given by Equation 9. This is followed by the second phase that \nsearches for models having the greatest impact on the prediction of Zt for the test \ninstance, which we formalize below. \nThe first phase searches for models that predict Z in the training data very well; \nthese are the models that have high conditional model scores. The initial model is \nthe simple Bayes network that includes all the attributes in X as children of Z. A \nsucceeding model is derived from a current model by reversing the arc of a child \nnode in the current model, adding new outgoing arcs from it to Z and the remaining \nchildren, and instantiating this node to the value in the test instance. This process is \nperformed for each child in the current model. An incoming arc of a child node is \nconsidered for reversal only if the node\u2019s value is not missing in the test instance. \nThe newly derived models are added to a priority queue, Q. During each iteration of \nthe search, the model with the highest score (given by Equation 9) is removed from \nQ and placed in a set R, following which new models are generated as described just \nabove, scored and added to Q. The first phase terminates after a user-specified \nnumber of models have accumulated in R. \nThe second phase searches for models that change the current model-averaged \nestimate of P(Zt | xt, D) the most. The idea here is to find viable competing models \nfor making this posterior probability prediction. When no competitive models can \nbe found, the prediction becomes stable. During each iteration of the search, the \nhighest ranked model M* is removed from Q and added to R. The ranking is based \non how much the model changes the current estimate of P(Zt | xt, D). More change is \nbetter. In particular, M* is the model in Q that maximizes the following function: \nMRf\n(\n\nMRg\n(\n\n(11) \n\nRg\n(\n\n*})\n\n*)\n\n=\n\n, \n\n\u2212\n\n)\n\n \n\n,\n\n \n\n \n\n{\n\nU\n\nwhere for a set of models S, the function g(S) computes the approximate model \naveraged prediction for Zt, as follows: \n\nZP\n(\n\n\u2211\n\u2208=\nSM\n\nSg\n)(\n\nt\n\n|\n\u2211\n\n\u2208\nSM\n\ntx\n\n,\n\nM\n\n)\n\nscore\n\n(\n\nMD\n\n,\n\n)\n\nscore\n\n(\n\nMD\n\n,\n\n)\n\n. \n\n \n\n \n\n \n\n(12) \n\nThe second phase terminates when no new model can be found that has a value (as \ngiven by Equation 11) that is greater than a user-specified minimum threshold T. \nThe final distribution of Zt is then computed from the models in R using Equation 8. \n\n4 Evaluation \n\nWe evaluated ISA on the 29 UCI datasets that Zheng and Webb used for the \nevaluation of LBR. On the same datasets, we also evaluated a simple Bayes \nclassifier (SB) and LBR. For SB and LBR, we used the Weka implementations \n(Weka v3.3.6, http://www.cs.waikato.ac.nz/ml/weka/) with default settings [5]. We \nimplemented the ISA algorithm as a standalone application in Java. The following \n\n \n\n\f \n\nsettings were used for ISA: a maximum of 100 phase-1 models, a threshold T of \n0.001 in phase-2, and an upper limit of 500 models in R. For the parameter priors in \nEquation 10, all \u03b1ijk were set to 1. \nAll error rates were obtained by averaging the results from two stratified 10-fold \ncross-validation (20 trials total) similar to that used by Zheng and Webb. Since, \nboth LBR and ISA can handle only discrete attributes, all numeric attributes were \ndiscretized in a pre-processing step using the entropy based discretization method \ndescribed in [6]. For each pair of training and test folds, the discretization intervals \nwere first estimated from the training fold and then applied to both folds. The error \nrates of two algorithms on a dataset were compared with a paired t-test carried out at \nthe 5% significance level on the error rate statistics obtained from the 20 trials. \nThe results are shown in Table 1. Compared to SB, ISA has significantly fewer \nerrors on 9 datasets and significantly more errors on one dataset. Compared to LBR, \nISA has significantly fewer errors on 7 datasets and significantly more errors on two \ndatasets. On \ntic-tac-toe, ISA shows considerable \nimprovement in performance over both SB and LBR. With respect to computation \n\ntwo datasets, chess and \n\nTable 1: Percent error rates of simple Bayes (SB), Lazy Bayesian Rule (LBR) \nand Instance-Specific Averaging (ISA). A - indicates that the ISA error rate is \nstatistically significantly lower than the marked SB or LBR error rate. A +\nindicates that the ISA error rate is statistically significantly higher. \n\nDataset \n\nAnnealing \nAudiology \nBreast (W) \nChess (KR-KP) \nCredit (A) \nEchocardiogram \nGlass \nHeart (C) \nHepatitis \nHorse colic \nHouse votes 84 \nHypothyroid \nIris \nLabor \nLED 24 \nLiver disorders \nLung cancer \nLymphography \nPima \nPostoperative \nPrimary tumor \nPromoters \nSolar flare \nSonar \nSoybean \nSplice junction \nTic-Tac-Toe \nWine \nZoo \n\nSize \n\n898 \n226 \n699 \n3169 \n690 \n131 \n214 \n303 \n155 \n368 \n435 \n3163 \n150 \n57 \n200 \n345 \n32 \n148 \n768 \n90 \n339 \n106 \n1389 \n208 \n683 \n3177 \n958 \n178 \n101 \n\nNo. of \nclasses \n\nNum. \nAttrib. \n\nNom. \nAttrib. \n\n6 \n24 \n2 \n2 \n2 \n2 \n6 \n2 \n2 \n2 \n2 \n2 \n3 \n2 \n10 \n2 \n3 \n4 \n2 \n3 \n22 \n2 \n2 \n2 \n19 \n3 \n2 \n3 \n7 \n\n6 \n0 \n9 \n0 \n6 \n6 \n9 \n13 \n6 \n7 \n0 \n7 \n4 \n8 \n0 \n6 \n0 \n0 \n8 \n1 \n0 \n0 \n0 \n60 \n0 \n0 \n0 \n13 \n0 \n\n 32 \n 69 \n 0 \n 36 \n 9 \n 1 \n 0 \n 0 \n 13 \n 15 \n 16 \n 18 \n 0 \n 8 \n 24 \n0 \n56 \n18 \n0 \n7 \n17 \n57 \n10 \n0 \n35 \n60 \n9 \n0 \n16 \n\nPercent error rate \nISA \nSB \nLBR \n 1.9 \n 3.5 - \n 2.7 - \n30.9 \n29.6 \n29.4 \n 3.7 \n 2.9 + \n 2.8 + \n 1.1 \n12.1 - \n 3.0 - \n13.9 \n13.8 \n14.0 \n35.9 \n33.2 \n34.0 \n29.0 \n26.9 \n27.8 \n16.2 \n17.5 \n16.2 \n11.3 \n14.2 - \n14.2 - \n17.8 \n20.2 \n16.0 \n 5.1 \n10.1 - \n 7.0 - \n 0.9 \n 0.9 \n 1.4 - \n 6.0 \n 5.3 \n 6.0 \n 7.0 \n 8.8 \n 6.1 \n40.3 \n40.5 \n40.5 \n36.8 \n36.8 \n36.8 \n56.3 \n56.3 \n56.3 \n13.2 \n15.5 - \n15.5 - \n22.3 \n21.8 \n22.0 \n33.3 \n33.3 \n33.3 \n54.2 \n54.4 \n53.5 \n 7.5 \n 7.5 \n 7.5 \n20.2 \n18.3 + 19.4 \n15.9 \n15.6 \n15.4 \n 7.2 \n 7.1 \n 7.9 - \n 4.4 \n 4.3 \n 4.7 \n10.3 \n13.7 - \n30.3 - \n 1.1 \n 1.1 \n 1.1 \n 6.4 \n 8.4 - \n 8.4 - \n\n \n\n\f \n\ntimes, ISA took 6 times longer to run than LBR on average for a single test instance \non a desktop computer with a 2 GHz Pentium 4 processor and 3 GB of RAM. \n\n5 Conclusions and Future Research \n\nWe have introduced a Bayesian framework for instance-specific model averaging \nand presented ISA as one example of a classification algorithm based on this \nframework. An instance-specific algorithm like LBR that does model selection has \nbeen shown by Zheng and Webb to perform classification better than several eager \nalgorithms [1]. Our results show that ISA, which extends LBR by adding Bayesian \nmodel averaging, improves overall on LBR, which provides support that we can \nobtain additional prediction improvement by performing instance-specific model \naveraging rather than just instance-specific model selection. \nIn future work, we plan to explore further the behavior of ISA with respect to the \nnumber of models being averaged and the effect of the number of models selected in \neach of the two phases of the search. We will also investigate methods to improve \nthe computational efficiency of ISA. In addition, we plan to examine other \nheuristics for model search as well as more general model spaces such as \nunrestricted Bayesian networks. \nThe instance-specific framework is not restricted to the Bayesian network models \nthat we have used in this investigation. In the future, we plan to explore other \nmodels using this framework. Our ultimate interest is to apply these instance-\nspecific algorithms to improve patient-specific predictions (for diagnosis, therapy \nselection, and prognosis) and thereby to improve patient care. \n\nAck nowledgments \nThis work was supported by the grant T15-LM/DE07059 from the National Library \nof Medicine (NLM) to the University of Pittsburgh\u2019s Biomedical Informatics \nTraining Program. We would like to thank the three anonymous reviewers for their \nhelpful comments. \n\nReferences \n[1] Zheng, Z. and Webb, G.I. (2000). Lazy Learning of Bayesian Rules. Machine Learning, \n41(1):53-84. \n[2] Hoeting, J.A., Madigan, D., Raftery, A.E. and Volinsky, C.T. (1999). Bayesian Model \nAveraging: A Tutorial. Statistical Science, 14:382-417. \n[3] Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, San \nMateo, CA. \n[4] Kontkanen, P., Myllymaki, P., Silander, T., and Tirri, H. (1999). On Supervised Selection \nof Bayesian Networks. In Proceedings of the 15th International Conference on Uncertainty \nin Artificial Intelligence, pages 334-342, Stockholm, Sweden. Morgan Kaufmann. \n[5] Witten, I.H. and Frank, E. (2000). Data Mining: Practical Machine Learning Tools with \nJava Implementations. Morgan Kaufmann, San Francisco, CA. \n[6] Fayyad, U.M., and Irani, K.B. (1993). Multi-Interval Discretization of Continuous-\nValued Attributes for Classification Learning. In Proceedings of the Thirteenth International \nJoint Conference on Artificial Intelligence, pages 1022-1027, San Mateo, CA. Morgan \nKaufmann. \n\n \n\n\f", "award": [], "sourceid": 2565, "authors": [{"given_name": "Shyam", "family_name": "Visweswaran", "institution": null}, {"given_name": "Gregory", "family_name": "Cooper", "institution": null}]}