{"title": "Breaking Boundaries Between Induction Time and Diagnosis Time Active Information Acquisition", "book": "Advances in Neural Information Processing Systems", "page_first": 898, "page_last": 906, "abstract": "There has been a clear distinction between induction or training time and diagnosis time active information acquisition. While active learning during induction focuses on acquiring data that promises to provide the best classification model, the goal at diagnosis time focuses completely on next features to observe about the test case at hand in order to make better predictions about the case. We introduce a model and inferential methods that breaks this distinction. The methods can be used to extend case libraries under a budget but, more fundamentally, provide a framework for guiding agents to collect data under scarce resources, focused by diagnostic challenges. This extension to active learning leads to a new class of policies for real-time diagnosis, where recommended information-gathering sequences include actions that simultaneously seek new data for the case at hand and for cases in the training set.", "full_text": "Breaking Boundaries: Active Information Acquisition\n\nAcross Learning and Diagnosis\n\nAshish Kapoor and Eric Horvitz\n\nMicrosoft Research\n1 Microsoft Way\n\nRedmond, WA 98052\n\nAbstract\n\nTo date, the processes employed for active information acquisition during periods\nof learning and diagnosis have been considered as separate and have been applied\nin distinct phases of analysis. While active learning centers on the collection of\ninformation about training cases in order to build better predictive models, diag-\nnosis uses \ufb01xed predictive models for guiding the collection of observations about\na speci\ufb01c test case at hand. We introduce a model and inferential methods that\nbridge these phases of analysis into a holistic approach to information acquisition\nthat considers simultaneously the extension of the predictive model and the prob-\ning of a case at hand. The bridging of active learning and real-time diagnostic\nfeature acquisition leads to a new class of policies for learning and diagnosis.\n\n1 Introduction\n\nConsider a real-world problem scenario where the challenge is to diagnose a patient who presents\nwith several salient symptoms by performing inference with a probabilistic diagnostic model. The\ndiagnostic model is trained from a database of patients, where training cases may have missing fea-\ntures. Assume we have at our discretion an evidential budget that enables us to acquire additional\ninformation so as to make a good diagnosis. Traditionally, such a budget has been spent solely on\nperforming real-time observations about the case at hand, for example, by carrying out additional\ntests on a patient presenting to a physician with some previously identi\ufb01ed complaints, signs, and\nsymptoms. However, there lies another opportunity to improving diagnostic models\u2014that of allo-\ncating some or all of the evidential budget to extending some portion of the training database, and\nthen learning an updated diagnostic model for use in inference about the case at hand. This broader\nperspective on diagnostic reasoning has real-world implications. For instance, investing efforts to\nobserve features that are currently missing in training cases, such as missing details on presenting\nsymptoms or on outcomes of prior patient cases, might preempt the need for carrying out a painful\nor risky medical test on the patient at hand. We focus on the promise of developing methods that\njointly consider informational value and costs of acquiring information about both the case at hand\nand about cases in the training library, and weighing the potential contributions of each of these\npotential sources of information during diagnosis.\nTo date, the process of diagnosis has focused on the use of a \ufb01xed predictive model, which in turn is\nused to generate recommendations for the observations to gather. Similarly, efforts in active learning\nhave focused on gathering information about the training cases in order to build better predictive\nmodels. The active collection of the different types of missing information under a budget, spanning\nmethods that have been referred to separately as learning and diagnosis, is graphically depicted in\nFigure 1. While diagnosis-time information acquisition methods focus on acquiring information\nabout the test case at hand, induction-time methods focus on collecting information about training\ncases for learning a good predictive model. We shall describe methods that weave together these two\nperspectives on information acquisition that have been handled separately to date, yielding a holistic\napproach to evidence collection in the context of the larger learning and prediction system. The\n\n1\n\n\fFigure 1:\nIllustration of induction-time and diagnosis-time active information acquisition.\nInduction-time active learning focuses on acquiring information for the pool of data used to train\na diagnostic model; diagnosis-time information acquisition focuses on the next best observations to\nacquire from the test case at hand.\n\nmethodology applies to situations where there is a single diagnostic challenge, as well as broader\nconceptions of diagnosis over streams of cases over time.\nWe take a decision-theoretic perspective on the joint consideration of observations about the case at\nhand and about options for extending the training set. We start by directly modeling how the training\ndata might affect the outcome of the predictions about test cases at hand, thus, relaxing the common\nassumption that a predictive model is \ufb01xed during diagnosis. Real-world diagnostic applications\nhave made this assumption to date, often employing an information-theoretic or decision theoretic-\ncriterion, such as value of information (VOI), during diagnosis to collect data about the case at\nhand. The holistic method can guide the acquisition of data for training cases that are missing\narbitrary combinations of features and labels. The methodology extends active learning beyond the\nsituation where training is done from a case library of completely speci\ufb01ed instances, where each\ncase contains a complete set of observations. We shall show how the more holistic active-learning\napproach allows for a \ufb01ne-grained triaging of information to acquire by deliberating in parallel about\nthe value of acquiring missing information from cases either in the training or the test set.\n\n2 Related Research\n\nAs we mentioned, efforts to date on the use of active learning for training classi\ufb01cation models have\nlargely focused on the task of acquiring labels, and assume that all of the features are observed in\nadvance. Popular heuristics for selecting unlabeled data points include uncertainty in classi\ufb01cation\n[1, 2], reduction in version space for SVMs [13], expected informativeness [9], disagreement among\na committee of classi\ufb01ers [3], and expected reduction in classi\ufb01cation [10]. There has been limited\nwork on methods for actively selecting missing features for instantiation. Lizotte et al. [8] tackle the\nproblem of selecting features in a budgeted learning scenario. Speci\ufb01cally, they solve a problem that\ncan be viewed as the inverse of traditional active learning; given class labels, they seek to determine\nthe best features to compute for each instance such that a good predictive model can be trained\nunder a budget. Even rarer are attempts to unify active acquisition of features with the acquisition\nof missing class labels. Research on this more general active learning includes work with graphical\nprobabilistic models by Tong and Koller [14] and by Saar-Tsechansky et al. [11].\nSeveral methods have been used for guiding data acquisition at diagnosis time. The goal is to identify\nthe best additional observations to acquire for making inferences and for ultimately taking actions\ngiven inferences about the class of a test case at hand [4, 5, 6, 7, 12]. The best tests and observations\nto make are computed with methods that compute or approximate the VOI. VOI for each potential\nnew observation is computed by considering the probability distribution over the class of the case at\nfocus of attention of based on observations made so far, and the uncertainties expected after making\neach proposed observation. New evidence to collect is triaged by considering the expected utility\nof the best immediate actions versus the actions taken after the new observations, considering the\n\n2\n\nDiagnostic Challenge(Possibly Incomplete)Training Cases(Possibly Incomplete)Predictive ModelInduction TimeInformation AcquisitionDiagnosis TimeInformation Acquisition\fcosts of making each proposed observation. Thus, VOI balances the informational bene\ufb01ts and the\nobservational costs of the new observations under uncertainty.\n\n3 Approach\n\ni\n\ni and Dh\n\n(cid:83)Dh\n\ni , where Do\n\nWe shall now describe a Bayesian model that smoothly combines induction-time and diagnosis-time\ninformation acquisition. The methods move beyond the task of parameter and structure estimation\nexplored in the prior studies of active learning and directly model statistical relationships amongst\nthe data points.\nAssume that we are given a training corpus with n independent training instances Di = {(xi, ti)}.\nHere, xi are the d dimensional features and their labels are denoted as ti. The training cases can\nbe incomplete; not all of the labels and features in the training set D are observed. Hence, we\nrepresent Di = Do\ni represent the mutually exclusive subsets of observed\nand unobserved components respectively in the ith data instance.\nLet us consider a test data point as x\u2217 where our task is to recover the label t\u2217 for the test case1.\nSimilar to the training cases, we again assume that x\u2217 is not fully observed and that there are un-\nobserved features. Given a budget for acquiring information, our goal is to determine the missing\ncomponents either from the training set or among the missing features in the test case so that we\nmake the best prediction on t\u2217.\nApproaches to active learning leverage the statistical relationships among sets of observations within\ncases with their class labels. The computation of expected value of information has been carried\nout with an information-theoretic method such as with procedures that seek to minimize entropy\nor maximize information gain. We compute such measures by directly modeling the conditional\ndensity of the test label t\u2217, given all that has been observed:\n\np(t\u2217|xo\u2217,Do) = p(t\u2217|xo\u2217,Do\n\n1, ..,Do\nn)\n\n(1)\n\n1, ..,Do\n\nn} (similarly we\u2019ll use Dh = {Dh\n\nHere, xo\u2217 represents the observed components of the test case and we de\ufb01ne the set of all observed\nn}). We\nvariables in the training corpus as Do = {Do\nnote that the strategy of directly modeling the statistical dependencies among all of the training data\nand the test case is a departure from most existing classi\ufb01cation methods. Given a training corpus,\nmost methods try to \ufb01t a model or learn a classi\ufb01er that best explains the training data and use this\nlearned model to classify test cases. This two-phase approach introduces a separation in information\nacquisition for training and testing; consequently, active information acquisition is limited either\nto real-time diagnosis or to training-time active learning and does not fully allow modeling of the\njoint statistics for the training and the test data. Directly modeling the dependency of the test label\nt\u2217 on the training and the test data as described in Equation 1 allows us to reason about next best\ninformation to observe by considering how posterior distributions changes with the acquisition of\nmissing information. Assuming that we can compute predictive distributions as given in Equation\n1, the next section describes how we can utilize such models to actively seek information.\n\n1 , ..,Dh\n\n3.1 Decision-Theoretic Selective Sampling\n\nWe are interested in selectively sampling unobserved information, either about the training set or the\ntest case, in order to make a better prediction. If available budget allows for multiple observations,\nour the goal is to determine an optimal set of variables to observe. However, performing such non-\nmyopic analyses is prohibitively expensive for many active learning heuristics [7]. In practice, the\nselective sampling task is performed in a greedy manner. That is starting from an empty set, the\nalgorithm selects one element at a time according to the active learning criterion. We note that\nKrause et al. [6] provides a detailed analysis of myopic and non-myopic strategies, and describes\nsituations where losses in a greedy approach can be bounded.\nIn this work, we adopt a greedy\nstrategy.\nThe decision-theoretic selective sampling criterion we use estimates the values of acquiring infor-\nmation, which in turn can be used as a guiding principle in active learning. We can quantify such\n1For simplicity, we limit our discussion to a single test point; the analysis described generalizes directly to\n\nconsidering larger set of test points\n\n3\n\n\fvalue in terms of information gain. Intuitively, knowing one more bit of information may tighten\na probability distribution over the class of the test case. On the other hand, observations are ac-\nquired at a price. By considering this reduction in uncertainty along with the cost of obtaining such\ninformation, we can formulate a selective sampling criterion.\nLet us assume that we have a probabilistic model and appropriate inference procedures that would\nallow us to compute the conditional distribution of the test label t\u2217 given all the observed entities\nDo (Equation 1). Then, such computations can be used determining the expected information gain.\nExpected information gain is formally de\ufb01ned as expected reduction in uncertainty over the t\u2217 as\nwe observe more evidence. In order to balance the bene\ufb01t of observing a feature/label with the cost\nof its observation, we use expected return on information (ROI) as a selection criteria that aims to\nmaximize information gain per unit cost:\n\nH(t\u2217|Do) \u2212 Ed[H(t\u2217|d \u222a Do)]\n\nROI : \u02c6d = arg max\nd\u2208Dh\n\nC(d)\n\n(2)\nHere, H(\u00b7) denotes the entropy and Ed[\u00b7] is the expectation with respect to the current model. Note,\nhere d can either be a feature value or a label and C(\u00b7) denotes the cost associated with observing\ninformation d. This strategy differs from the VOI criteria that aims to minimize total operational\ncost of the system. Unlike VOI, the proposed criteria does not require that the gain from selective\nsampling and the cost of observing observation have the same currency; consequently, ROI can be\nused more generally. Note, the proposed framework for active information acquisition easily extends\nto scenarios where the cost and the bene\ufb01ts of the system can be measure in a single currency and\nVOI can be applied. Also note that while the ROI formulation we introduces considers a single test,\nsimilar computations can be done for a larger set of test points by considering the joint entropy over\nthe test labels. Without the introduction of assumptions of conditional independence that are not\noverly restrictive (described below) the joint formulation can be computed as the sum of the ROI\nevaluated for each of the test cases. We now describe how we can model the joint statistics among\nthe training and the test cases simultaneously.\n\n3.2 Modeling Joint Dependencies\n\nLet us consider a probabilistic model to describe the joint dependencies among features and the\nlabel of an instance.\nIf we denote the parameters of the model with \u03bb, then, given the training\ndata, the classical approach in learning the model would attempt to \ufb01nd a best value \u02c6\u03bb according to\nsome optimization criterion. However, in our case we are interested in modeling joint dependencies\namong all of the data (both training or testing). Consequently, in our analysis, we consider the\nmodel parameters \u03bb as a random variable over which we marginalize in order to generate a posterior\npredictive distribution. Formally, we rewrite Equation 1 as:\n\np(t\u2217|xo\u2217,Do) =\n\np(t\u2217, \u03bb|xo\u2217,Do),\n\n(3)\n\n(cid:90)\n\n\u03bb\n\nwhere the Bayesian treatment of \u03bb allows us to marginalize over \u03bb and model direct statistical de-\npendencies between the different data points; consequently, we can determine how different features\nand labels directly affect the test prediction. Note that considering model parameters \u03bb as random\nvariables is consistent with principles of Bayesian modeling and is similar in spirit to prior research,\nsuch as [9] and [15].\nIn order to compute the integral in Equation (3), we need to characterize p(t\u2217, \u03bb|xo\u2217,Do), which in\nturn de\ufb01nes a joint distribution over all of the data instances and the parameters \u03bb of the model.\nFirst, we consider individual data instances and model the joint distribution of features and labels of\nthe instance as a Markov Random Field (MRF)2. Then, assuming conditional independence between\ndata points3 given the model parameters, the joint distribution that includes all the instances and the\nparameters \u03bb can be written as:\n\np(D, \u03bb) \u221d p(\u03bb)\n\n1\n\nZ(\u03bb)\n\nexp[\u03bbT \u03c6(xi, ti)]\n\nn(cid:89)\n\ni=1\n\n2We limit ourselves to the case where both the labels and the features are binary (0 or 1).\n3The conditional independence assumption also allows us to compute ROI for a set of test cases by summing\n\nindividual ROI values.\n\n4\n\n\fHere, Z(\u03bb) is the partition function that normalizes the distribution, \u03bb are parameters of the model\nwith a Gaussian prior p(\u03bb) \u223c N (0, \u039b). Also, \u03c6(x, t) = [t, tx1, .., tx2, \u03c6(x)] is the appended\nfeature set and is in correspondence with the underlying undirected graphical model. In theory, the\nfeatures can be functions of all the individual features of x. However, we restrict ourselves to a\nBoltzmann machine that has individual and pairwise features only and corresponds to an undirected\ngraphical model GF = {VF ,EF} where each node VF corresponds to an individual feature and the\nedges in EF between the nodes correspond to the pairwise features. A fully connected GF graph can\nrepresent an arbitrary distribution. However, the computational complexity of situations involving\nlarge numbers of features may require pruning of the graph to achieve tractability.\nUsing Bayes rule and the conditional independence assumption, Equation 3 reduces to:\n\np(t\u2217|xo\u2217,Do) =\n\np(t\u2217|xo\u2217, \u03bb) \u00b7 p(\u03bb|Do)\n\n(4)\nThe \ufb01rst term p(t\u2217|xo\u2217, \u03bb) inside the integral can be interpreted as likelihood of t\u2217 given the ob-\nserved components x\u2217 of the test case and the parameter \u03bb. Similarly p(\u03bb|Do) is the posterior\ndistribution over parameter \u03bb given all the observations in the training corpus. We review details of\nthese computations below.\n\n\u03bb\n\n(cid:90)\n\n3.3 Computational Challenges\nGiven the set of all observations Do, we \ufb01rst seek to infer the posterior distribution p(\u03bb|Do) which\ncan be written as:\n\np(\u03bb|Do) \u221d p(\u03bb)\n\np(Do\n\ni ,Dh\n\ni |\u03bb)\n\n(cid:90)\n\nn(cid:89)\n\nDh\n\ni\n\ni=1\n\n\u2207L = \u039b\u22121\u03bb \u2212 n(cid:88)\n\nComputing the posterior is intractable as it is a product of the Gaussian prior with non-Gaussian data\nlikelihood terms. In general, the problem of inferring model parameters in an undirected graphical\nmodel is a hard one. Welling and Parise [15] propose Bethe-Laplace approximation to infer model\nparameters for a Markov Random Field. In a similar spirit, we employ Laplace approximation that\nuses Bethe or a tree-structured approximation albeit with data that is partially observed. The idea\nbehind Laplace approximation is to \ufb01t a Gaussian at the mode \u02c6\u03bb of the exact posterior distribution\np(\u03bb|Do) \u2248 N (\u02c6\u03bb, \u03a3), where:\n\n\u03a3 = E\u02c6\u03bb[\u03c6(x, t)\u03c6(x, t)T ] \u2212 E\u02c6\u03bb[\u03c6(x, t)]E\u02c6\u03bb[\u03c6(x, t)]T\n\nHere, E\u02c6\u03bb[\u00b7] denote expectation with respect to p(x, t|\u02c6\u03bb). Note, that it is non-trivial to \ufb01nd the mode\n\u02c6\u03bb as well as the covariance matrix \u03a3, as the underlying graphical structure is complex. While\nthe covariance \u03a3 is approximated using the linear response algorithm [15], the mode \u02c6\u03bb is usually\nfound by running a gradient descent procedure that minimizes the negative log of the posterior\n(L = \u2212 log(p(\u03bb|D)). The gradients of this objective can be succinctly written as:\n\n[E\u03bb,Do\n\ni\n\n[\u03c6(x, t)] \u2212 E\u03bb[\u03c6(x, t)]]\n\n(5)\n\ni=1\n\ni\n\n[\u00b7] is the expectation with respect to the distribution conditioned on the observed vari-\nHere, E\u03bb,Do\nables: p(x|\u03bb,Do\ni ). Note, that computing the \ufb01rst expectation term is trivial for the fully observed\ncase. However, partially observed cases requires exact inference. Similarly, the computation of the\nsecond expectation term in the gradient requires exact inference. For the fully connected graphs,\nexact inference is hard and we must rely on approximations.\nOne approach is to approximate GF by a tree, which we denote as GM I that preserves an estimation\nof mutual information among variables. Speci\ufb01cally GM I is the maximal spanning tree of an undi-\nrected graphical model, which has the same structure as the original graph and with edges weighted\nby empirical mutual information.\nWe have the choice of either running loopy belief propagation (BP) for approximate inference on the\nfull graph GF or doing an exact inference on the tree approximation GM I. As the features \u03c6(x, t)\nonly consist of single and pairwise variables, belief propagation directly provides the required ex-\npectations over the features of MRF. In our work, we observed better results when using loopy BP;\n\n5\n\n\fhowever, it was much faster to run inference on the tree structured graphs. Consequently, we used\nloopy BP to compute the posterior p(\u03bb|Do) given the training data. Also note that given the Gaus-\nsian approximation to p(\u03bb|Do), the required predictive distribution p(t\u2217|x\u2217,Do) can be computed\nusing sampling [15]. Finally, ROI computations require that for each d \u2208 Dh, we infer p(t\u2217|d\u222aDo)\nfor d = 0 and d = 1 and compute the expected conditional entropy. This repeated inference for all\nthe missing bits in the data can be time consuming; thus, the tree-structured approximation was used\nto do all ROI computations and to determine the next bit of information to seek.\n\n4 Experiments and Results\n\nWe shall compare proposed active information acquisition, which does not distinguish between\ninduction-time and diagnosis-time analyses, against other alternatives on a synthetic dataset and\ntwo real-world applications. Previewing our results, we \ufb01nd that the proposed scheme outperforms\nits competitors in terms of accuracy over the test points and provides a signi\ufb01cant boost for consid-\nerably less incurred cost. The signi\ufb01cant gains we obtained over approaches that limit themselves\nto separately consider induction-time or diagnosis-time information acquisition suggests that the\nholistic perspective can provide broader and more ef\ufb01cient options to acquire information.\n\n4.1 Experiments with Synthetic Data\n\nWe \ufb01rst sought to evaluate the basic operation of the proposed framework with a synthetic training\nset of Boolean data generated by randomly sampling labels with a fair coin toss. The features of the\ndata are 14 dimensional and consist of partially informative and partially random features. Out of\nthe 14 features, seven are randomly generated using a fair coin toss, while the rest of the features are\ngenerated by multiplying the label with all of the seven randomly generated features individually.\nWe note that, even with full observations and a perfect data model for 0.78% of the cases, the\nprediction cannot be better than random. This arises whenever all of the randomly generated bits\nare 0 which in turn blocks any information about the label being observed. For the rest of the cases,\nperfect prediction is feasible with only seven features. We considered a dataset with 100 examples\nfor experiments on this synthetic data. Further, we consider a 50-50 train and test split and assume\nthat 25% of the total bits are unobserved and that the target of the selective sampling procedure is to\ndetermine the best next observations to make so as to best predict the labels for the test cases.\nWe assume that the cost of observing a label in the training data is directly proportional to the\npossible number of features that can be computed for every data point (that is, c(d) = Dim). The\nfeatures, drawn from either the training or testing set, are much cheaper and have a unit cost of\nobservation. We set the costs of observing labels of test cases to in\ufb01nity; consequently the active\nlearning methods never observe them.\nWe compared the joint selection (Diagnosis+Induction) advocated in this work with 1) diagnosis-\ntime active information acquisition (Diagnosis), where information bits are sampled only from the\ntest case at hand and 2) induction-time active acquisition (Induction). In addition, we considered\ntwo different \ufb02avors of induction-time active inquisition where either only features or only labels\nwere allowed to be sampled. We refer to these two \ufb02avors as Induction (features only) and Induction\n(labels only) respectively. In all of the cases, we used ROI for active learning as described in section\n3.1. Finally, we compare these methods with the baseline of random sampling strategy.\nFigure 2 (left) shows the recognition results with increasing costs during active acquisition of in-\nformation. We plot the overall classi\ufb01cation accuracy over the test set on the y-axis and the cost\nincurred on the x-axis. Each point on the graph signi\ufb01es an average recognition on the test set over\n10 random training and test splits. From the \ufb01gure, we see that all sampling strategies show increases\nin accuracy as the cost increases, but Diagnosis+Induction has advantages over other methods. First,\nDiagnosis+Induction obtains better recognition results for a \ufb01xed incurred cost, outperforming the\ndiagnosis-time sampling strategy as well as all the \ufb02avors of induction-time information acquisition.\nSecond, the Diagnosis+Induction sampling strategy levels off to the maximum performance fairly\nquickly when compared to other methods. Performance of Diagnosis only and Random sampling are\nnoticeably worse than the other alternatives. Also, we note that the induction (features-only) stops\nabruptly for the synthetic case as most of the features in the learning problem are uninformative;\nafter initial rounds the algorithm stops sampling. In summary, all of the active methods for active\n\n6\n\n\fFigure 2: Comparison of various selective selection schemes (best viewed in color).\n\ninformation acquisition do better than random; however, Induction+Diagnosis strategy achieves best\ncombination of recognition performance and cost ef\ufb01ciency.\nIn order to analyze different sampling methods, we look at the sampling behavior of different active\nlearning mechanisms. Figure 3 (left) illustrates the statistics of sampled information at the termina-\ntion of the active learning procedure. The bars with different shades denote the sampling distribution\namongst training labels, training features and the test features, which are generated by averaging\nover the 10 runs. While the Induction (features only), Induction (labels only) and diagnosis strategy\njust acquire labels, features for training data and features for the test cases respectively, the Diagno-\nsis+Induction approaches show acquisition of information from different kinds of sources. We note\nthat the random sampling strategy also samples from both labels and features; however, as indicated\nby Figure 2 (left) this strategy is not optimal as it does not take the cost structure into account. Di-\nagnosis+Induction is the most \ufb02exible scheme and it aims to acquire information from all facets of\nthe classi\ufb01cation problem by properly considering gains in predictive power and balancing it with\nthe cost of information acquisition.\n\n4.2 Experiment on Path\ufb01nder Data\n\nAvailability and access of large medical databases enables us to build better predictive models for\nvarious diagnostic purposes. While most efforts have focused on active data acquisition for diagnosis\nonly [5], our framework promises a broader set of options to a diagnostician, where he can reason\nwhether to perform additional tests on a patient or seek more information about the training set.\nWe analyze one such scenario where the goal is to build a predictive model that would guide surgi-\ncal pathologists who study the lymphatic system with the diagnosis of lymph-node diseases. This\ndataset consists of labels of \u201cbenign\u201d or \u201cmalignant\u201d to lymph-node follicles from 48 subjects. The\nfeatures signify sets of histological features viewed at low and high power under the microscope\nthat an expert surgical pathologist believed could be informative to that label. The proposed holis-\ntic perspective on active learning supports the scenario where pathologists in pursuit of a diagnosis\nneed to determine the next observations either from the test case at hand or consider querying for\nhistorical records in order to successfully label the lymph node (or, more generally, diagnose the\ndisease). For this experiment, we consider random splits 30 training examples and 18 test cases and\nagain assume that 25% of the total bits are unobserved. The experiment protocol is same as the one\nfor synthetic data where we report results averaged over 10 runs and the test set is used to compare\nthe recognition performance.\nThe results on the Path\ufb01nder data are shown in Figure 2 (Middle). As before, x-axis and y-axis\ndenote costs incurred and overall classi\ufb01cation accuracy on the test data over 10 random training\nand test splits. Again we see that the Diagnosis+Induction performs better than the other methods\nand attains high accuracy at a fairly low cost. However, one difference in this experiment is the fact\nthat Random sampling strategy outperforms active Diagnosis and active Induction (features only).\nThis suggests that the labels in the training cases are highly informative when compared to the fea-\ntures. This in turn is re\ufb02ected by the similar performance of Diagnosis+Induction with Induction\nand Induction (only) towards the end of active learning run. Upon further analysis, we found that\nDiagnosis+Induction, Induction and Induction (labels only) end up selecting similar training labels,\nconsequently reaching similar performance towards the end. This further reinforces the validity of\nthe hypothesis that the training labels are very informative. On analyzing the sampling behavior of\ndifferent methods (Figure 3 (middle)) we again \ufb01nd that the Diagnosis+Induction approaches show\nacquisition of information from different kinds of sources. However, we also note that the propor-\ntion of sampled training labels is remarkably few and very similar for both Diagnosis+Induction\n\n7\n\n0501001502002500.780.80.820.840.860.880.90.920.940.96CostAccuracy on Test SetBoolean Diagnosis+InductionDiagnosisInductionInduction (features only)Induction (labels only)Random0501001502002503003504004505000.650.70.750.80.850.9CostAccuracy on Test SetPathfinder Diagnosis+InductionDiagnosisInductionInduction (features only)Induction (labels only)Random0501001502002503003500.740.760.780.80.820.840.860.880.90.92CostAccuracy on Test SetVoting Diagnosis+InductionDiagnosisInductionInduction (features only)Induction (labels only)Random\fFigure 3: Statistics of different information selected in active learning.\n\nand Induction, hinting that there might be particular cases that are highly informative about the pre-\ndiction task. In summary, Diagnosis+Induction again provides best recognition rates at low costs,\ndemonstrating the effectiveness of the uni\ufb01ed perspective on active learning.\n\n4.3 Experiments on Congressional Voting Records\n\nSurveys have been popular information gathering tools, however, the cost of acquiring information\nby surveying can be costly and is often fraught with missing information. Intelligent information\nacquisition with active learning promises ef\ufb01cient use of limited resources. The holistic perspective\non data acquisition can help avoid probing subjects for potentially risky or expensive questions by\nconsidering accessible information (for example, information such as demographics, age, etc.) or\ninitially unavailable labels about the past survey takers.\nWe analyze a similar survey task of determining af\ufb01liation of subjects based on incomplete historical\ndata. This data set includes votes for each of the U.S. House of Representatives Congressmen on the\n16 key votes on United States policies. There are 435 data instances classi\ufb01ed as Democrats versus\nRepublican where 16 attributes for each of data represents a Yes or No on a vote. Further, out of\n435 \u00d7 16 features, there are 392 instances missing. The presence of missing features makes this a\nchallenging active-learning problem. We consider 10 random splits with 100 training instances and\n335 test cases and report results averaged over these splits.\nExperimental results on the voting data are shown in Figure 2 (right). Each point on the graph sig-\nni\ufb01es an average recognition on the test set over 10 random training and test splits. Similar to the\nearlier experiments, we see improvement in recognition accuracy on the test set for different sam-\npling schemes. Performance of Diagnosis only, Induction (features only), and Random sampling\nare noticeably worse than the other alternatives. Diagnosis+Induction again shows superior perfor-\nmance attaining a high accuracy at a relatively low cost. Upon analyzing the statistics of sampled\ninformation (Figure 3 (right)) at the termination of the active learning procedure, we see that while\nDiagnosis+Induction approaches show acquisition of information from different kinds of sources,\nit is signi\ufb01cantly different from the Random strategy whose sampling distribution is close to the\ntrue distribution of the available information bits. By considering information gain and the cost\nstructure through ROI, Diagnosis+Induction is able to achieve the best combination of recognition\nperformance and cost ef\ufb01ciency.\n\n5 Conclusion\n\nWe introduced a scheme for active data acquisition that removes the separation between diagnosis-\ntime and induction-time active information acquisition. The task of diagnosis changes qualitatively\nwith the use of methods that take a more holistic perspective on active learning, simultaneously\nconsidering information acquisition for extending a case library as well as for identifying the next\nbest features to observe about the diagnostic challenge at hand. We ran several experiments that\nshowed the effectiveness of combining diagnosis-time and induction-time active learning. We are\npursuing several related challenges and opportunities, including analysis of approximate inference\ntechniques and non-myopic extensions.\n\n8\n\n00.10.20.30.40.50.60.70.80.91Distribution of Sampled InformationBooleanTraining LabelsTraining FeatureTest Feature00.10.20.30.40.50.60.70.80.91Distribution of Sampled InformationPathfinderTraining LabelsTraining FeatureTest Feature00.10.20.30.40.50.60.70.80.91Distribution of Sampled InformationVotingTraining LabelsTraining FeatureTest Feature\fReferences\n\n[1] N. Cesa-Bianchi, A. Conconi and C. Gentile (2003). Learning probabilistic linear-threshold\nclassi\ufb01ers via selective sampling. COLT.\n[2] S. Dasgupta, A. T. Kalai and C. Monteleoni (2005). Analysis of perceptron-based active learning.\nCOLT.\n[3] Y. Freund, H. S. Seung, E. Shamir and N. Tishby (1997). Selective Sampling Using the Query\nby Committee Algorithm. Machine Learning Volume 28.\n[4] R. Greiner, A. Grove and D. Roth (2002). Learning Cost-Sensitive Active Classi\ufb01ers. Arti\ufb01cial\nIntelligence Volume 139(2).\n[5] D. Heckerman, E. Horvitz, and B. N. Nathwani (1992). Toward Normative Expert Systems: Part\nI The Path\ufb01nder Project. Methods of Information in Medicine 31:90-105.\n[6] P. Kanani and P. Melville (2008). Prediction-time Active Feature-value Acquisition for Customer\nTargeting. NIPS Workshop on Cost Sensitive Learning.\n[7] A. Krause, A. Singh and C. Guestrin (2008). Near-optimal Sensor Placements in Gaussian\nProcesses: Theory, Ef\ufb01cient Algorithms and Empirical Studies. JMLR Volume 9(2).\n[8] D. Lizotte, O. Madani and R. Greiner (2003). Budgeted Learning of Naive-Bayes Classi\ufb01ers.\nUAI.\n[9] D. MacKay (1992a). Information-Based Objective Functions for Active Data Selection. Neural\nComputation Volume 4(4).\n[10] N. Roy and A. McCallum (2001). Toward Optimal Active Learning through Sampling Estima-\ntion of Error Reduction. ICML.\n[11] M. Saar-Tsechansky, P. Melville and F. Provost (2008). Active Feature-value Acquisition.\nManagement Science.\n[12] V. S. Sheng and C. X. Ling (2006). Feature value acquisition in testing: a sequential batch test\nalgorithm. ICML.\n[13] S. Tong and D. Koller (2001). Support Vector Machine Active Learning with Applications to\nText Classi\ufb01cation. JMLR Volume 2.\n[14] S. Tong and D. Koller (2001). Active learning for parameter estimation in Bayesian networks.\nNIPS.\n[15] M. Welling and S. Parise (2006) Bayesian Random Fields: The Bethe-Laplace Approximation.\nUAI.\n\n9\n\n\f", "award": [], "sourceid": 181, "authors": [{"given_name": "Ashish", "family_name": "Kapoor", "institution": null}, {"given_name": "Eric", "family_name": "Horvitz", "institution": null}]}