{"title": "Optimal Decision Tree with Noisy Outcomes", "book": "Advances in Neural Information Processing Systems", "page_first": 3303, "page_last": 3313, "abstract": "A fundamental task in active learning involves performing a sequence of tests to identify an unknown hypothesis that is drawn from a known distribution. This problem, known as optimal decision tree induction, has been widely studied for decades and the asymptotically best-possible approximation algorithm has been devised for it. We study a generalization where certain test outcomes are noisy, even in the more general case when the noise is persistent, i.e., repeating the test on the scenario gives the same noisy output, disallowing simple repetition as a way to gain confidence. \nWe design new approximation algorithms for both the non-adaptive setting, where the test sequence must be fixed a-priori, and the adaptive setting where the test sequence depends on the outcomes of prior tests. \nPrevious work in the area assumed at most a constant number of noisy outcomes per test and per scenario and provided approximation ratios that were problem dependent (such as the minimum probability of a hypothesis). Our new approximation algorithms provide guarantees that are nearly best-possible and work for the general case of a large number of noisy outcomes per test or per hypothesis where the performance degrades smoothly with this number. \nOur results adapt and generalize methods used for submodular ranking and stochastic set cover. \nWe evaluate the performance of our algorithms on two natural applications with noise: toxic chemical identification and active learning of linear classifiers. Despite our logarithmic theoretical approximation guarantees, our methods give solutions with cost very close to the information theoretic minimum, demonstrating the effectiveness of our methods.", "full_text": "Optimal Decision Tree with Noisy Outcomes\n\nSu Jia\u2217\n\nCarnegie Mellon University\nsjia1@andrew.cmu.edu\n\nViswanath Nagarajan\nUniversity of Michigan\n\nviswa@umich.edu\n\nFatemeh Navidi\u2020\n\nUniversity of Michigan\nnavidi@umich.edu\n\nR. Ravi\n\nCarnegie Mellon University\nravi@andrew.cmu.edu\n\nAbstract\n\nA fundamental task in active learning involves performing a sequence of tests to\nidentify an unknown hypothesis that is drawn from a known distribution. This\nproblem, known as optimal decision tree induction, has been widely studied for\ndecades and the asymptotically best-possible approximation algorithm has been\ndevised for it. We study a generalization where certain test outcomes are noisy,\neven in the more general case when the noise is persistent, i.e., repeating a test gives\nthe same noisy output, disallowing simple repetition as a way to gain con\ufb01dence.\nWe design new approximation algorithms for both the non-adaptive setting, where\nthe test sequence must be \ufb01xed a-priori, and the adaptive setting where the test\nsequence depends on the outcomes of prior tests. Previous work in the area\nassumed at most a logarithmic number of noisy outcomes per hypothesis and\nprovided approximation ratios that depended on parameters such as the minimum\nprobability of a hypothesis. Our new approximation algorithms provide guarantees\nthat are nearly best-possible and work for the general case of a large number of\nnoisy outcomes per test or per hypothesis where the performance degrades smoothly\nwith this number. Our results adapt and generalize methods used for submodular\nranking and stochastic set cover. We evaluate the performance of our algorithms\non two natural applications with noise: toxic chemical identi\ufb01cation and active\nlearning of linear classi\ufb01ers. Despite our theoretical logarithmic approximation\nguarantees, our methods give solutions with cost very close to the information\ntheoretic minimum, demonstrating the effectiveness of our methods.\n\n1\n\nIntroduction\n\nThe classic optimal decision tree (ODT) problem involves identifying an initially unknown hypothesis\n\u00afx that is drawn from a known probability distribution over a set of m possible hypotheses. We can\nperform tests in order to distinguish between these hypotheses. Each test produces a binary outcome\n(positive or negative) and the precise outcome of each hypothesis-test pair is known beforehand. 3 So\nan instance of ODT can be viewed as a \u00b11 matrix with the hypotheses as rows and tests as columns.\nThe goal is to identify hypothesis \u00afx using the minimum number of tests in expectation.\nAs a motivating application, consider the following task in medical diagnosis [25]. A doctor needs\nto diagnose a patient\u2019s disease by performing tests. Given an a priori probability distribution over\npossible diseases, what sequence of tests should the doctor perform? Another application is in active\n\n\u2217Su Jia and Fatemeh Navidi contributed equally to this work.\n\u2020Research of F. Navidi and V. Nagarajan partly supported by NSF grant CCF-1750127.\n3We consider binary test outcomes only for simplicity: our results also hold for \ufb01nitely many outcomes.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\flearning [10]. Given a set of data points, one wants to learn a classi\ufb01er that labels the points correctly\nas positive and negative. There is a set of m possible classi\ufb01ers which is assumed to contain the true\nclassi\ufb01er. In the Bayesian setting, which we consider, the true classi\ufb01er is drawn from some known\nprobability distribution. The goal is to identify the true classi\ufb01er by querying labels at the minimum\nnumber of points (in expectation). Other applications include entity identi\ufb01cation in databases [6]\nand experimental design to choose the most accurate theory among competing candidates [14].\nAn important issue that is not considered in the classic ODT model is that of unknown or noisy\noutcomes. In fact, our research was motivated by a dataset involving toxic chemical identi\ufb01cation\nwhere the outcomes of many hypothesis-test pairs are stated as unknown (one of our experimental\nresults is also on this dataset). Prior work incorporating noise in ODT [14] is restricted to settings\nwith very sparse noise. In this paper, we design approximation algorithms for the noisy optimal\ndecision tree problem in full generality.\nWe consider a standard model for persistent noise. Certain outcomes (i.e., entries in the hypothesis-\ntest matrix) are random with a known distribution: for simplicity we treat each noisy outcome as\nan unbiased \u00b11 random variable. Our results extend directly to the case when each noisy outcome\nhas a different probability of being \u00b11. Persistent noise means that repeating the same test always\nproduces the same \u00b11 outcome. We assume that the instance is identi\ufb01able, i.e., a unique hypothesis\ncan always be identi\ufb01ed irrespective of the noisy outcomes. (This assumption can be relaxed: see \u00a76.)\nWe consider both non-adaptive policies (where the test sequence is \ufb01xed upfront) and adaptive\npolicies (where the test sequence is built incrementally and depends on observed test outcomes).\nClearly, adaptive policies perform at least as well as non-adaptive ones.4 However, non-adaptive\npolicies are very simple to implement (requiring minimal incremental computation) and may be\npreferred in time-sensitive applications. Our main contributions are:\n\n\u2022 an O(log m)-approximation algorithm for non-adaptive ODT with noise.\n\u2022 an O(min(h, r) + log m)-approximation algorithm for adaptive ODT with noise, where h\n\u2022 an O(log m)-approximation algorithm for adaptive ODT with noise when every test has at\n\u2022 experimental results on applications to toxic chemical identi\ufb01cation and active learning.\n\n(resp. r) is the maximum number of noisy outcomes for any hypothesis (resp. test).\n\u221a\nleast m \u2212 O(\n\nm) noisy outcomes.\n\nWe note that both non-adaptive and adaptive versions (even for usual ODT) generalize the set cover\nproblem: so an \u2126(log m) approximation ratio is the best possible (unless P=NP).\n\nRelated Work The optimal decision tree problem (without noise) has been extensively studied\nfor several decades [11, 20, 25, 24, 1, 2, 7, 18]. The state-of-the-art result [18] is an O(log m)-\napproximation, for instances with arbitrary probability distribution and costs. It is also known that\nODT cannot be approximated to a factor better than O(log m), unless P=NP [6].\nThe application of ODT to Bayesian active learning was formalized in [10]. There are also several\nresults on the statistical complexity of active learning. e.g. [4, 19, 29], where the focus is on proving\nbounds for structured hypothesis classes. In contrast, we consider arbitrary hypothesis classes and\nobtain computationally ef\ufb01cient policies with provable approximation bounds relative to the optimal\n(instance speci\ufb01c) policy. This approach is similar to that in [10, 15, 12, 14, 9, 22].\nThe noisy ODT problem was studied previously in [14]. Using a connection to adaptive-\nsubmodularity [12], they obtained an O(log2\n)-approximation algorithm for noisy ODT in\nthe presence of very few noisy outcomes; here pmin \u2264 1\nm is the minimum probability of any hy-\npothesis.5 In particular, the running time of the algorithm in [14] is exponential in the number of\nnoisy outcomes per hypothesis, which is polynomial only if this number is at most logarithmic in the\nnumber of hypotheses/tests. Our result provides the following improvements (i) the running time is\npolynomial irrespective of the number of noisy outcomes and (ii) the approximation ratio is better by\nat least one logarithmic factor. We note that a better O(log m) approximation ratio (still only for very\nsparse noise) follows from subsequent work on the \u201cequivalence class determination\u201d problem by [9].\n\npmin\n\n1\n\n4There are also instances where the relative gap between the best adaptive and non-adaptive policies is \u02dc\u2126(m).\n5The paper [14] states the approximation ratio as O(log 1/pmin) because it relied on an erroneous claim in\n\n[12]. The correct approximation ratio, based on [28, 13], is O(log2 1/pmin).\n\n2\n\n\fFor this setting, our result is also an O(log m) approximation, but the algorithm is simpler. More\nimportantly, ours is the \ufb01rst result that can handle any number of noisy outcomes.\nOther variants of noisy ODT have also been considered, e.g. [27, 5, 8], where the goal is to identify\nthe correct hypothesis with at least some target probability. The theoretical results in [8] provide\n\u201cbicriteria\u201d approximation bounds where the algorithm has a larger error probability than the optimal\npolicy. Our setting is different because we require zero probability of error.\nMany algorithms for ODT (including ours) rely on some underlying submodularity properties. We\nbrie\ufb02y survey some background results. The basic submodular cover problem was \ufb01rst considered by\n[31], who proved that the natural greedy algorithm is a (1 + ln 1\n\u0001 )-approximation algorithm, where \u0001\nis the minimal positive marginal increment of the function. [3] obtained an O(log 1\n\u0001 )-approximation\nalgorithm for the submodular ranking problem, that involves simultaneously covering multiple\nsubmodular functions; [21] extended this result to also handle costs. [23] studied an adaptive version\nof the submodular ranking problem. We utilize results/techniques from these papers.\nFinally, we note that there is also work on minimizing the worst-case (instead of average case) cost\nin ODT and active learning [26, 30, 16, 17]. These results are incomparable to ours because we are\ninterested in the average case, i.e. minimizing expected cost.\n\n2 Problem De\ufb01nition\n\nWe start with de\ufb01ning the optimal decision tree with noise (ODTN) formally. There is a set of m\npossible hypotheses with a probability distribution {\u03c0x}m\nx=1, from which an unknown hypothesis\n\u00afx is drawn. There is also a set U = [n] of binary tests. Each test e \u2208 U is associated with a 3-way\npartition T +(e), T \u2212(e), T \u2217(e) of the hypotheses, where the test outcome is (a) positive if \u00afx lies in\nT +(e), (b) negative if \u00afx \u2208 T \u2212(e), and (c) positive or negative with probability 1\n2 each if \u00afx \u2208 T \u2217(e)\n(these are noisy outcomes). We assume that conditioned on \u00afx, each noisy outcome is independent.\nWe also use rx(e) to denote the part of test e that hypothesis x lies in, i.e.\n\n\uf8f1\uf8f2\uf8f3 \u22121\n\n+1\n\u2217\n\nrx(e) =\n\nif x \u2208 T \u2212(e)\nif x \u2208 T +(e)\nif x \u2208 T \u2217(e)\n\nWhile we know the 3-way partition T +(e), T \u2212(e), T \u2217(e) for each test e \u2208 U upfront, we are not\naware of the actual outcomes for the noisy hypothesis-test pairs. It is assumed that the realized\nhypothesis \u00afx can be uniquely identi\ufb01ed by performing all tests, regardless of the outcomes of \u2217-tests.\nThis means that for every pair x, y \u2208 [m] of hypotheses, there is some test e \u2208 U with x \u2208 T +(e)\nand y \u2208 T \u2212(e) or vice-versa. The goal is to perform an adaptive (or non-adaptive) sequence of tests\nto identify hypothesis \u00afx using the minimum expected number of tests. Note that expectation is taken\nover both the prior distribution of \u00afx and the random outcomes of noisy tests for \u00afx.\nIn our algorithms and analysis, it will be convenient to work with an expanded set of hypotheses M.\nFor a binary vector b \u2208 {\u00b11}U and hypothesis x \u2208 [m], we say b is consistent with x and denote\nb \u223c x, if be = rx(e) for each e \u2208 U with rx(e) (cid:54)= \u2217. Let M = {(b, x) \u2208 {\u00b11}U \u00d7 [m] : b \u223c x},\nand Mx \u2286 M be all copies associated with a particular hypothesis x \u2208 [m]; note that {Mx}m\nx=1 is\na partition of M. Each \u201cexpanded\u201d hypothesis (b, x) \u2208 M corresponds to the case where the true\nhypothesis \u00afx = x and the test-outcomes are given by b. We assign the probability qb,x = \u03c0x/2hx\nto each (b, x) \u2208 M, where hx is the number of *-tests for x. Note that conditioned on \u00afx = x, the\nprobability of observing outcomes b is exactly 2\u2212hx; so Pr[\u00afx = x and test outcomes are b] = qb,x.\nFor any (b, x) \u2208 M and e \u2208 U, de\ufb01ne rb,x(e) = b(e) to be the observed outcome of test e if \u00afx = x\nand test-outcomes are b. For every expanded hypothesis (b, x) \u2208 M and test e \u2208 U, de\ufb01ne\n\nTb,x(e) =\n\n,\n\n(1)\n\nwhich is the subset of (original) hypotheses that can de\ufb01nitely be ruled-out based on test e if \u00afx = x\nand the test-outcomes are given by b. Note that hypotheses in T \u2217(e) are never part of Tb,x(e) as\ntheir outcome on test e can be positive/negative (so they cannot be ruled-out). For every hypothesis\n(b, x) \u2208 M, de\ufb01ne a monotone submodular function fb,x : 2U \u2192 [0, 1]:\n\n(cid:26) T +(e)\n\nT \u2212(e)\n\nif rb,x(e) = \u22121\nif rb,x(e) = +1\n\nfb,x(S) = |(cid:91)\n\ne\u2208S\n\nTb,x(e)| \u00b7\n\n1\n\nm \u2212 1\n\n,\n\n\u2200S \u2286 U,\n\n(2)\n\n3\n\n\fwhich equals the fraction of the m \u2212 1 hypotheses (excluding x) that have been ruled-out based on\nthe tests in S if \u00afx = x and test-outcomes are given by b. Assuming \u00afx = x and test-outcomes are\ngiven by b, hypothesis x is uniquely identi\ufb01ed after tests S if and only if fb,x(S) = 1.\nA non-adaptive policy is speci\ufb01ed by just a permutation of tests. The policy performs tests in this\nsequence and eliminates incompatible hypotheses until there is a unique compatible hypothesis\n(which is \u00afx). Note that the number of tests performed under such a policy is still random (depends\non \u00afx and outcomes of noisy tests). An adaptive policy chooses tests incrementally, depending on\nprior test outcomes. The state of a policy is a tuple (E, d) where E \u2286 U is a subset of tests and\nd \u2208 {\u00b11}E denotes the observed outcomes on tests in E. An adaptive policy is speci\ufb01ed by a\nmapping \u03a6 : 2U \u00d7 {\u00b11}U \u2192 U from states to tests, where \u03a6(E, d) is the next test to perform at\nstate (E, d). Equivalently, we can view a policy as decision tree with nodes corresponding to states,\nlabels at nodes representing the test performed at that state and branches corresponding to the \u00b11\noutcome at the current test. As the number of states can be exponential, we cannot hope to specify\narbitrary adaptive policies. Instead, we want implicit policies \u03a6, where given any state (E, d), the test\n\u03a6(E, d) can be computed ef\ufb01ciently. This would imply that the total time taken under any outcome is\npolynomial. We note that an optimal policy \u03a6\u2217 can be very complex and the map \u03a6\u2217(E, d) may not\nbe ef\ufb01ciently computable. We will still compare the performance of our (ef\ufb01cient) policy to \u03a6\u2217.\nIn this paper, we consider the persistent noise model. That is, repeating a test e with \u00afx \u2208 T \u2217(e)\nalways produces the same outcome. An alternative model is non-persistent noise, where each run of\ntest e with \u00afx \u2208 T \u2217(e) produces an independent random outcome. The persistent noise model is more\nappropriate to handle missing data. It also contains the non-persistent noise model as a special case\n(by introducing multiple tests with identical partitions). One can easily obtain an adaptive O(log2 m)-\napproximation for the non-persistent model using existing algorithms for noiseless ODT [7] and\nrepeating each test O(log m) times. The persistent-noise model that we consider is much harder.\n\n3 Non-Adaptive Algorithm\n\nOur algorithm is based on a reduction to the submodular ranking problem [3], de\ufb01ned below.\n\nSubmodular Function Ranking (SFR) An instance of SFR consists of a ground set U of elements\nand a collection of monotone submodular functions {f1, ..., fm}, fx : 2U \u2192 [0, 1], with fx(\u2205) = 0\nand fx(U ) = 1 for all x \u2208 [m]. Additionally, there is a weight wx \u2265 0 for each x \u2208 [m]. A solution\nis a permutation of the elements U. Given any permutation \u03c3 of U, the cover time of function f is\nC(f, \u03c3) := min{t |f (\u222ai\u2208[t]\u03c3(i)) = 1} where \u03c3(i) is the ith element in \u03c3. In words, it is the earliest\ntime when the value of f reaches the unit threshold. The goal is to \ufb01nd a permutation \u03c3 of [n] with\n\nminimal total cover time(cid:80)\n\nx\u2208[m] w(x) \u00b7 C(fx, \u03c3). We will use the following result:\n\nTheorem 3.1 ([3]). There is an O(log 1\nincrement of any function.\n\n\u0001 )-approximation for SFR where \u0001 is minimum marginal\n\nThe non-adaptive ODTN problem can be expressed as an instance of SFR as follows. The elements\nare the tests U. For each hypothesis-copy (b, x) \u2208 M there is a function fb,x (see (2)) with weight\nqb,x. Based on the de\ufb01nition of these functions, the parameter \u0001 = 1\nm\u22121. To see the equivalence, note\nthat a solution to non-adaptive ODTN is also a permutation \u03c3 of U and hypothesis x is uniquely\nidenti\ufb01ed under outcome (b, x) exactly when function fb,x has value one. Moreover, the objective of\nthe ODTN problem is the expected number of tests in \u03c3 to identify the realized hypothesis \u00afx, which\nequals\n\n\u03c0x\n\n2\u2212hx \u00b7 Cb,x(\u03c3) =\n\nqb,x \u00b7 Cb,x(\u03c3),\n\nm(cid:88)\n\nx=1\n\n(cid:88)\n\nb\u223cx\n\n(cid:88)\n\n(b,x)\u2208M\n\nwhere Cb,x(\u03c3) is the cover-time of function fb,x. It now follows that this SFR instance is equivalent\nto the non-adaptive ODTN instance. However, we cannot apply Theorem 3.1 directly to obtain an\nO(log m) approximation. This is because we have an exponential number of functions (note |M| can\nbe exponential in m), which means that a direct implementation of the algorithm from [3] requires\nexponential time. Nevertheless, we show that a variant of the SFR algorithm can be used to obtain:\nTheorem 3.2. There is an O(log m)-approximation for non-adaptive ODTN.\n\n4\n\n\fThe SFR algorithm [3] is a greedy-style algorithm that at any point, having already chosen tests E,\nassigns a score to each test e \u2208 U \\ E of\n\n(cid:88)\n\nGE(e) :=\n\nqb,x\n\n(b,x)\u2208M :fb,x(E)<1\n\n(cid:88)\n\n=\n\n(b,x)\u2208M\n\nqb,x \u00b7 \u2206E(b, x, e),\n\nfb,x({e} \u222a E) \u2212 fb,x(E)\n\n1 \u2212 fb,x(E)\n\n(cid:40) fb,x({e}\u222aE)\u2212fb,x(E)\n\n1\u2212fb,x(E)\n\n\u2206E(b, x, e) =\n\n0,\n\n,\n\nif fb,x(E) < 1;\notherwise.\n\n(3)\n\n(4)\n\nwhere \u2206E(b, x, e) is the \u201cgain\u201d of test e for the hypothesis-copy b, x. At each step, the algorithm\nchooses the test of maximum score. However, we do not know how to compute the score (3) in\npolynomial time. Instead, using the fact that GE(e) is the expectation of \u2206E(b, x, e) over the\nhypothesis-copies (b, x) \u2208 M, we will show that we can obtain an approximate maximizer by\nsampling. Moreover, Theorem 3.1 also holds when we choose a test with approximately maximum\nscore: this follows directly from the analysis in [21]. This sampling approach is still not suf\ufb01cient\nbecause it can fail when the value GE(e) is very small. A key observation is, when the score GE(e) is\nsmall for all tests e then it must be that, with high probability the already-performed tests E uniquely\nidentify hypothesis \u00afx. Hence the future tests won\u2019t affect the expected cover time by much.\nAs the realized hypothesis \u00afx can always be identi\ufb01ed uniquely, for any pair x, y \u2208 [m] of hypotheses,\nthere is a test where x and y have opposite outcomes (i.e. one is + and the other \u2212). So there is a set\n\n(cid:1) tests where hypothesis \u00afx will be uniquely identi\ufb01ed by performing all the tests in L.\n\nL of at most(cid:0)m\n\n2\n\nThe non-adaptive ODTN algorithm (Non-Adap) involves two phases. In the \ufb01rst phase, we run\nthe SFR algorithm using sampling to get estimates \u00afGE(e) of the scores GE(e) at each step; let\ne\u2217 = arg maxe\u2208U \u00afGE(e) denote the chosen test. If at some step, the maximum sampled score is less\nthan m\u22125 then we go to the second phase where we perform all the tests in L and stop. The number\nof samples used to obtain each estimate is polynomial in m; so the overall runtime is polynomial.\nThe complete proof can be found in the full version of this paper.\n\n4 Adaptive Algorithms\n\nOur adaptive algorithm chooses between two algorithms (ODT Nr and ODT Nh) based on the\nnoise sparsity parameters h (maximum number of noisy outcome per hypothesis), and r (maximum\nnumber of noisy outcome per test). These two algorithms maintain the posterior probability of each\nhypothesis based on the previous test outcomes, and use these probabilities to calculate a \u201cscore\u201d for\neach test. The score of a test has two components (i) a term that prioritizes splitting the candidate\nhypotheses in a balanced way and (ii) terms that correspond to the expected number of hypotheses\neliminated. We maintain the following information at each point in the algorithm: already performed\ntests E \u2286 U, compatible hypotheses H \u2286 [m] and (posterior) probability px for each x \u2208 H. Given\nx\u2208S px for any subset\nS \u2286 [m]. The main difference between the two algorithms we have, is in the de\ufb01ning the metric for\ncomponent (i). First we discuss ODT Nr:\n\nvalues {px : x \u2208 [m]}, to reduce notation we use the shorthand p(S) =(cid:80)\n\nAlgorithm 1 ODTNr\ninitially E \u2190 \u2205, H \u2190 [m] and px \u2190 \u03c0x for all x \u2208 [m].\nwhile |H| > 1 do\n\nfor any test e \u2208 U, let Le(H) be the smaller cardinality set among T +(e) \u2229 H and T \u2212(e) \u2229 H\nselect test e \u2208 U \\ E that maximizes:\n\np (Le(H)) +\n\n|T \u2212(e) \u2229 H|\n\n|H| \u2212 1\n\n\u00b7p(cid:0)T +(e) \u2229 H(cid:1) +\n\n\u00b7p(cid:0)T\n\n(e) \u2229 H(cid:1)+\n\n\u2212\n\n|T +(e) \u2229 H|\n\n|H| \u2212 1\n\n|H \\ T \u2217(e)|\n2(|H| \u2212 1)\n\n\u00b7p (T\n\n\u2217\n\n(e) \u2229 H) .\n\n(5)\n\nif outcome of test e is + then H \u2190 H \\ T \u2212(e); else H \u2190 H \\ T +(e).\nE \u2190 E \u222a {e} and update px \u2190 px/2 for all x \u2208 T \u2217(e).\n\nend while\n\nTheorem 4.1. Algorithm 1 is an O(r + log m)-approximation algorithm for adaptive ODTN, where\nr is the maximum number of noisy outcomes per test.\n\n5\n\n\fThe high-level idea is to view any ODT instance I as a suitable instance J of adaptive submodular\nranking (ASR). Then we will use and modify an existing framework of analysis of ASR from [23].\nAn equivalent ASR instance J . This involves the expanded hypothesis set M where each hypoth-\nesis (b, x) \u2208 M occurs with probability qb,x = \u03c0x/2hx. Each hypothesis (b, x) is also associated\nwith: (i) submodular function fb,x : 2U \u2192 [0, 1] and (ii) feedback function rb,x : U \u2192 {+,\u2212}\nwhere rb,x(e) is the outcome of test e under hypothesis (b, x). The goal in the ASR instance is to\nadaptively select a subset S \u2286 U such that the value fb,x(S) = 1 for the realized hypothesis (b, x).\nThe objective is to minimize the expected cost E[|S|].\nLemma 4.2. The ASR instance J is equivalent to ODT instance I.\nNow, we present an algorithm for the ASR instance J that we will show is equivalent to running\nAlgorithm 1 on instance I. Crucially, the ASR algorithm is almost identical to that studied in prior\nwork [23] and therefore we can essentially re-use the analysis from that paper to prove our bound.\nRecall that the expanded hypotheses M = \u222am\nx=1Mx where Mx are all copies of hypothesis x \u2208 [m].\n(b,x)\u2208S qb,x for any subset S \u2286 M. Also note that hypothesis\n(b, x) is covered when fb,x(E) = 1 which implies identifying hypothesis x \u2208 [m].\nAlgorithm 2 Algorithm for ASR instance J .\n\nTo reduce notation, we use q(S) =(cid:80)\n\ninitially E \u2190 \u2205, H(cid:48) \u2190 M.\nwhile H(cid:48) (cid:54)= \u2205 do\n\nH \u2190 {x \u2208 [m] : Mx \u2229 H(cid:48) (cid:54)= \u2205}.\nfor any test e \u2208 U, let L(cid:48)\nselect test e \u2208 U \\ E that maximizes:\n\ne(H(cid:48)) = {(b, x) \u2208 H(cid:48) : x \u2208 Le(H)} = H(cid:48) \u2229(cid:0)\u222ax\u2208Le(H)Mx\n\nq (L(cid:48)\n\ne(H(cid:48))) +\n\n(cid:88)\n\n(b,x)\u2208Mx\u2229H(cid:48)\n\nqb,x \u00b7 fb,x(e \u222a E) \u2212 fb,x(E)\n\n1 \u2212 fb,x(E)\n\n.\n\n(cid:1)\n\n(6)\n\nremove incompatible and covered hypotheses from H(cid:48) based on the feedback from e.\nE \u2190 E \u222a {e}\n\nend while\n\nWe now prove the equivalence between Algorithms 1 and 2. The state of either algorithm is\nrepresented by the set E of tests performed along with their outcomes.\nLemma 4.3. The decision tree produced by Algorithm 1 on I is the same as that produced by\nAlgorithm 2 on J .\n\nBased on Lemmas 4.2 and 4.3, in order to prove Theorem 4.1 it suf\ufb01ces to show that Algorithm 2\nis an O(r + log m)-approximation algorithm for ASR instance J . The proof is very similar to the\nanalysis in [23]. So we only provide an outline of the overall proof, while emphasizing the differences.\nFor k = 0, 1,\u00b7\u00b7\u00b7 , de\ufb01ne the following quantities:\n\n\u2022 Ak \u2286 M is the set of uncovered hypotheses in ALG at time L \u00b7 2k, and ak = q(Ak).\n\u2022 Yk is the set of uncovered hypotheses in OPT at time 2k\u22121, and yk = q(Yk).\n\nHere L = O(r + log m). The key step is to show:\nak \u2264 0.2ak\u22121 + 3yk,\n\nL2k(cid:88)\n\n(cid:88)\n\n\uf8eb\uf8ed (cid:88)\n\nZ :=\n\nt>L2k\u22121\n\n(E,H(cid:48))\u2208R(t)\n\nmax\ne\u2208U\\E\n\nqb,x +\n\n(b,x)\u2208L(cid:48)\n\ne(H(cid:48))\n\n(b,x)\u2208H(cid:48)\n\nAs shown in [23], this implies an O(L) approximation ratio. In order to prove (7) we use the quantity:\n\nfor all k \u2265 1.\n(cid:88)\n\nqb,x \u00b7 fb,x(e \u222a E) \u2212 fb,x(E)\n\n1 \u2212 fb,x(E)\n\n(7)\n\n\uf8f6\uf8f8 (8)\n\nAbove, R(t) denotes the set of states (E, H(cid:48)) that occur at time t in ALG. (7) will be proved by\nseparately lower and upper bounding Z.\nLemma 4.4 ([23]). We have Z \u2265 L \u00b7 (ak \u2212 3yk)/3.\n\n6\n\n\fLemma 4.5. We have Z \u2264 ak\u22121 \u00b7 (1 + ln m + r + log m).\nProof. For any hypothesis (b, x) \u2208 Ak\u22121 (i.e. uncovered in ALG by time L2k\u22121) let \u03c3b,x be the\npath traced by (b, x) in ALG\u2019s decision tree, starting from time 2k\u22121L and ending at 2kL or when\n(b, x) gets covered. Recall that for any L2k\u22121 < t \u2264 L2k, any hypothesis in H(cid:48) for any state in R(t)\nappears in Ak\u22121. So only hypotheses in Ak\u22121 can contribute to Z and we rewrite (8) as:\n\nZ\n\n=\n\n(cid:88)\n\u2264 (cid:88)\n\n(b,x)\u2208Ak\u22121\n\n(b,x)\u2208Ak\u22121\n\n(cid:18) fb,x(e \u222a E) \u2212 fb,x(E)\nqb,x \u00b7 (cid:88)\n\uf8eb\uf8ed (cid:88)\n\nfb,x(e \u222a E) \u2212 fb,x(E)\n\n1 \u2212 fb,x(E)\n\nqb,x \u00b7\n\ne\u2208\u03c3b,x\n\n1 \u2212 fb,x(E)\n\ne\u2208\u03c3b,x\n\n(cid:88)\n\n+\n\ne\u2208\u03c3b,x\n\n(cid:19)\n\n+ 1[(b, x) \u2208 L\n(cid:48)\ne(H\n\n(cid:48)\n\n)]\n\n\uf8f6\uf8f8 (9)\n\n1[(b, x) \u2208 L\n(cid:48)\n\n(cid:48)\n\n)]\n\ne(H\n\nAbove, for any e \u2208 \u03c3b,x we use (E, H(cid:48)) to denote the state at which e is selected.\nFix any hypothesis (b, x) \u2208 Ak\u22121. For the \ufb01rst term, we use Lemma 4.6 below and the de\ufb01nition of\n\u0001 \u2264 1 + ln m as parameter \u0001 \u2265 1/m for fb,x.\n\n\u0001. This implies(cid:80)\n\nfb,x(e\u222aE)\u2212fb,x(E)\n\n\u2264 1 + ln 1\n\ne\u2208\u03c3b,x\n\n1\u2212fb,x(E)\n\nNow, we bound the second term by proving the inequality below:\n\n1[(b, x) \u2208 L(cid:48)\n\ne(H(cid:48))] \u2264 r + log m\n\n(10)\n\n(cid:88)\n\ne\u2208\u03c3b,x\n\nTo prove this inequality, consider hypotheses in H(cid:48). Now, if hypothesis (b, x) \u2208 L(cid:48)\ne(H(cid:48)) when ALG\nselects test e, then x would be in Le(H). Suppose Le(H) = T +(e) \u2229 H; the other case is identical.\nLet De(H) = T \u2212(e) \u2229 H and Se(H) = T \u2217(e) \u2229 H. As x \u2208 Le(H), it must be that path \u03c3b,x\nfollows the + branch out of e. Also, the number of candidate hypotheses on this path after test e is\n\n|Le(H)| + |Se(H)| \u2264 |Le(H)|\n\n+\n\n|De(H)|\n\n+ |Se(H)| =\n\n2\n\n2\n\n|H|\n2\n\n+\n\n|Se(H)|\n\n2\n\n\u2264 |H|\n\n2\n\n+\n\nr\n2\n\n.\n\nThe \ufb01rst inequality uses the de\ufb01nition of Le(H) and the last inequality uses the bound of r on the\nnumber of hypotheses with \u2217 outcomes. Hence, each time that (b, x) \u2208 L(cid:48)\ne(H(cid:48)) along path \u03c3b,x, the\nnumber of candidate hypotheses changes as |Hnew| \u2264 1\n2. This implies that after log2 m\nb,x denote the portion of path \u03c3b,x after |H| drops below r. Note that\nsuch events, |H| \u2264 r. Let \u03c3(cid:48)\ne(H(cid:48)) we have Le(H) (cid:54)= \u2205: so |H| reduces by at least one after each such test\neach time (b, x) \u2208 L(cid:48)\ne(H(cid:48))] \u2264 r. As the portion of path \u03c3b,x until |H| \u2264 r contributes at\n\n2|Hold| + r\n\nmost log2 m to the left-hand-side in (10), the total is at most r + log2 m as needed.\nLemma 4.6 ([3]). Let f : 2U \u2192 [0, 1] be any monotone function with f (\u2205) = 0 and \u0001 = min{f (S \u222a\n{e}) \u2212 f (S) : e \u2208 U, S \u2286 U, f (S \u222a {e}) \u2212 f (S) > 0}. Then, for any sequence \u2205 = S0 \u2286 S1 \u2286\n\ne\u2208\u03c3(cid:48)\n\n1[(b, x) \u2208 L(cid:48)\n\ne. Hence(cid:80)\n\u00b7\u00b7\u00b7 Sk \u2286 U of subsets, we have(cid:80)k\n\nb,x\n\nf (St)\u2212f (St\u22121)\n1\u2212f (St\u22121) \u2264 1 + ln 1\n\u0001 .\n\nt=1\n\nSetting L = 15(1 + ln m + r + log2 m) and applying Lemmas 4.4 and 4.5 completes the proof of\n(7) and hence Theorem 4.1.\nWe now discuss algorithm ODT Nh. This is based on directly applying the ASR algorithm from [23]\nto J . The resulting algorithm is very similar to Algorithm 2 and involves a change in the de\ufb01nition\nof L(cid:48)\n\ne(H(cid:48)) in score (6) to be the smaller of the following sets:\n\n{(b, x) \u2208 H(cid:48) : rb,x(e) = +} and {(b, x) \u2208 H(cid:48) : rb,x(e) = \u2212}.\n\nThe main difference in the analysis is in proving the following inequality instead of inequality (10) in\n\nthe proof of Lemma 4.5: (cid:88)\n\ne\u2208\u03c3b,x\n\n1[(b, x) \u2208 L(cid:48)\n\ne(H(cid:48))] \u2264 h + log m\n\n(11)\n\nThis follows from the observation that each time that (b, x) \u2208 L(cid:48)\nreduces by at least a factor 1\nbetween ODT Nr and ODT Nh, we obtain:\n\ne(H(cid:48)) along path \u03c3b,x, the size |H(cid:48)|\n2 (note that initially |H(cid:48)| = |M| \u2264 2h \u00b7 m). Finally by choosing the best\n\n7\n\n\fTheorem 4.7. There is an O(min(h, r) + log m)-approximation algorithm for adaptive ODTN,\nwhere h (resp. r) is the maximum number of noisy outcomes per hypothesis (resp. test).\n\nWe note that our analysis is tight (up to constant factors).\nIn order to understand the dependence of the approximation ratio on the noise sparsity min(h, r), we\nstudy instances that have a very large number of noisy outcomes. Formally, we de\ufb01ne an \u03b1-sparse\n(\u03b1 \u2264 1/2) instance as follows. There is a constant C such that max{|T +(e)|,|T \u2212(e)|} \u2264 C \u00b7 m\u03b1\nfor all tests e \u2208 U. Somewhat surprisingly, there is a very different algorithm (see the full version)\nthat can actually take advantage of the large noise:\nTheorem 4.8. There is an adaptive O(log m)-approximation for ODTN on \u03b1 \u2264 1\n\n2 sparse instances.\n\n5 Non-identi\ufb01able Instances\n\nWe have assumed that for every pair x, y of hypotheses, there is some test that distinguishes them\ndeterministically. Without this assumption, we can still obtain similar results by slightly changing the\nstopping criterion. De\ufb01ne a similarity graph G on m nodes (corresponding to hypotheses) with an\nedge (x, y) if there is no test separating x and y deterministically. Let Dx denote the set containing x\nand all its neighbors in G, for each x \u2208 [m]. The neighborhood stopping criterion involves stopping\nwhen the set H of compatible hypotheses is contained in some Dx, where x might or might not\nbe \u00afx. The clique stopping criterion involves stopping when H is contained in some clique of G.\nNote that clique stopping is a stronger notion of identi\ufb01cation than neighborhood stopping. Our\nalgorithms\u2019 performance guarantees will now also depend on the maximum degree d of G; note that\nd = 0 in the perfectly identi\ufb01able case. We obtain a non-adaptive algorithm with approximation ratio\nO((d + 1) log m) and an adaptive algorithm with approximation ratio O(d + min(h, r) + log m).\nBelow we outline our approach for the adaptive algorithm; the details can be found in the full version.\nFor the adaptive version, we run a two-phase algorithm. In the \ufb01rst phase, we identify some subset\nN \u2286 [m] containing \u00afx with |N| \u2264 d. This can be done using the algorithm in \u00a74 with the following\nsubmodular function for each (b, x) \u2208 M.\n\nfb,x(S) = |(cid:91)\n\ne\u2208S\n\nTb,x(e)| \u00b7\n\n1\n\nm \u2212 d\n\n,\n\n\u2200S \u2286 U.\n\nThe expected cost of this phase is O(min(r, h) + log m)\u00b7 OP T using an analysis identical to \u00a74; here\nOP T denotes the optimal value. Then, in the second phase, we run a simple splitting algorithm that\niteratively selects any test that splits the current set H of candidate scenarios, until the neighborhood\nor clique stopping criterion is satis\ufb01ed. The expected cost of this phase is at most d\u00b7OP T . Combining\nboth phases, we obtain an O(d + min(h, r) + log m)-approximation algorithm.\n\n6 Extensions\n\nNon-binary outcomes. We can also handle tests with an arbitrary set \u03a3 of outcomes (instead of\n\u00b11). This requires extending the outcomes b to be in \u03a3U and applying this change to the de\ufb01nitions\nof sets Tb,x (1) and submodular function fb,x (2).\n\nNon-uniform noise distribution. Our results extend directly to the case where each noisy outcome\nhas a different probability of being \u00b11. Suppose that the probability of every noisy outcome is between\n\u03b4 and 1 \u2212 \u03b4. Then Theorems 3.2 and 4.7 continue to hold (irrespective of \u03b4), and Theorem 4.8 holds\nwith a slightly worse O( 1\n\n\u03b4 log m) approximation ratio.\n\n7 Experiments\n\nWe implemented our algorithms, and performed experiments on real-world and synthetic data sets.\nWe compared our algorithms\u2019 cost (expected number of tests) with an information theoretic lower\nbound on the optimal cost and show that the difference is negligible. Thus, despite our logarithmic\napproximation ratios, the practical performance can be much better.\n\n8\n\n\fChemicals with Unknown Test Outcomes One natural application of ODT is identifying chemical\nor biological materials. We considered a data set called WISER6, which includes 400+ chemicals\n(hypothesis) and 78 binary tests. Every chemical has either positive, negative or unknown result on\neach test. We have performed our algorithms on both the original instance, in which some chemicals\nare not perfectly identi\ufb01able, and a modi\ufb01ed version. In the modi\ufb01ed version, to ensure every pair of\nchemicals can be distinguished, we removed the chemicals that are not identi\ufb01able from each other to\nbe left with 255 chemicals (to do this, we used a greedy rule that iteratively drops the highest-degree\nhypothesis in the similarity graph).\n\nRandom Binary Classi\ufb01ers with Margin Error We construct a dataset containing 100 two-\ndimensional points, by picking each of their attributes uniformly in [\u22121000, 1000]. We also choose\na2+b2 + c \u2264 0, where a, b \u2190 N (0, 1) and\n2000 random triples (a, b, c) to form linear classi\ufb01ers ax+by\nc \u2190 U (\u22121000, 1000). The point labels are binary and we introduce noisy outcomes based on the\ndistance of each point to a classi\ufb01er. Speci\ufb01cally, for each threshold d \u2208 {0, 5, 10, 20, 30} we de\ufb01ne\ndataset CL-d that has a noisy outcome for any classi\ufb01er-point pair where the distance of the point\nto the boundary of the classi\ufb01er is smaller than d. In order to ensure that the instances are perfectly\nidenti\ufb01able, we remove \u201cequivalent\u201d classi\ufb01ers and we are left with 234 classi\ufb01ers.\n\n\u221a\n\nthe adaptive O(h + log m)-approximation (ODTNh),\n\nAlgorithms We implement the following algorithms: the adaptive O(r + log m)-approximation\n(ODTNr),\nthe non-adaptive O(log m)-\napproximation (Non-Adap) and a slightly adaptive version of Non-Adap (Low-Adap). Algorithm\nLow-Adap considers the same sequence of tests as Non-Adap while (adaptively) skipping non-\ninformative tests based on observed outcomes. The implementations of the adaptive and non-adaptive\nalgorithms are available online.7 We also consider three different stopping criteria: unique stopping\nfor perfectly identi\ufb01able instances, neighborhood and clique stopping (de\ufb01ned in Section 5) for\noriginal WISER dataset.\n\nResults Table 1 shows the results of different algorithms with unique stopping on all identi\ufb01able\ndatasets when the distribution over hypothesis is uniform, and Table 2 summarizes the results on\noriginal WISER with other stopping criteria. Experiments with some non-uniform distributions are\npresented in the supplementary material. Table 1 also reports values of an information theoretic lower\nbound (the entropy log2 m) on the optimal cost (Low-BND). We can see that ODTNr consistently\noutperforms the other algorithms and is very close to the lower bound. Note that original WISER\ndataset that is used to produce results in Table 2 has m = 414 hypotheses and d = 54 in its similarity\ngraph, while the processed WISER in Table 1 is perfectly identi\ufb01able with m = 255 hypotheses.\n\nAlgorithm\n\nData(r,h) WISER\n(245,45)\n7.994\n8.357\n9.707\n11.568\n9.152\n\nLow-BND\nODTNr\nODTNh\nNon-Adap\nLow-Adap\n\nCL-0\n(0,0)\n7.870\n7.910\n7.910\n9.731\n8.619\n\nCL-5\n(5,3)\n7.870\n7.927\n7.979\n9.831\n8.517\n\nCL-10\n(7,6)\n7.870\n7.915\n8.211\n9.941\n8.777\n\nCL-20\n(12,8)\n7.870\n7.962\n8.671\n9.996\n8.692\n\nCL-30\n(13,8)\n7.870\n8.000\n8.729\n10.204\n8.803\n\nTable 1: Cost of Algorithms with Unique Stopping for Uniform Distribution. For each dataset we\nalso indicate the noise parameters h and r (max number of noisy outcomes per hypothesis/test).\n\nAlgorithm\nODTNr\nODTNh\nNon-Adap\nLow-Adap\n\nNeighborhood Stopping\n\nClique Stopping\n\n11.163\n11.908\n16.995\n16.983\n\n11.817\n12.506\n21.281\n20.559\n\nTable 2: Cost of Algorithms on original WISER dataset with Neighborhood and Clique Stopping for\nUniform Distribution.\n\n6https://wiser.nlm.nih.gov\n7https://github.com/FatemehNavidi/ODTN ; https://github.com/sjia1/ODT-with-noisy-outcomes\n\n9\n\n\fReferences\n[1] M. Adler and B. Heeringa. Approximating optimal binary decision trees. In Approximation, Randomization\n\nand Combinatorial Optimization. Algorithms and Techniques, pages 1\u20139. Springer, 2008.\n\n[2] E. M. Arkin, H. Meijer, J. S. Mitchell, D. Rappaport, and S. S. Skiena. Decision trees for geometric models.\n\nInternational Journal of Computational Geometry & Applications, 8(03):343\u2013363, 1998.\n\n[3] Y. Azar and I. Gamzu. Ranking with submodular valuations. In Proceedings of the twenty-second annual\n\nACM-SIAM symposium on Discrete Algorithms, pages 1070\u20131079. SIAM, 2011.\n\n[4] M. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In Machine Learning, Proceedings\nof the Twenty-Third International Conference (ICML 2006), Pittsburgh, Pennsylvania, USA, June 25-29,\n2006, pages 65\u201372, 2006.\n\n[5] G. Bellala, S. K. Bhavnani, and C. Scott. Active diagnosis under persistent noise with unknown noise\ndistribution: A rank-based approach. In Proceedings of the Fourteenth International Conference on\nArti\ufb01cial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011, pages\n155\u2013163, 2011.\n\n[6] V. T. Chakaravarthy, V. Pandit, S. Roy, P. Awasthi, and M. K. Mohania. Decision trees for entity\nidenti\ufb01cation: Approximation algorithms and hardness results. ACM Trans. Algorithms, 7(2):15:1\u201315:22,\n2011.\n\n[7] V. T. Chakaravarthy, V. Pandit, S. Roy, and Y. Sabharwal. Approximating decision trees with multiway\nbranches. In International Colloquium on Automata, Languages, and Programming, pages 210\u2013221.\nSpringer, 2009.\n\n[8] Y. Chen, S. H. Hassani, and A. Krause. Near-optimal bayesian active learning with correlated and noisy\ntests. In Proceedings of the 20th International Conference on Arti\ufb01cial Intelligence and Statistics, AISTATS\n2017, 20-22 April 2017, Fort Lauderdale, FL, USA, pages 223\u2013231, 2017.\n\n[9] F. Cicalese, E. S. Laber, and A. M. Saettler. Diagnosis determination: decision trees optimizing simultane-\nously worst and expected testing cost. In Proceedings of the 31th International Conference on Machine\nLearning, ICML 2014, Beijing, China, 21-26 June 2014, pages 414\u2013422, 2014.\n\n[10] S. Dasgupta. Analysis of a greedy active learning strategy. In Advances in neural information processing\n\nsystems, pages 337\u2013344, 2005.\n\n[11] M. Garey and R. Graham. Performance bounds on the splitting algorithm for binary testing. Acta\n\nInformatica, 3:347\u2013355, 1974.\n\n[12] D. Golovin and A. Krause. Adaptive submodularity: Theory and applications in active learning and\n\nstochastic optimization. J. Artif. Intell. Res., 42:427\u2013486, 2011.\n\n[13] D. Golovin and A. Krause. Adaptive submodularity: A new approach to active learning and stochastic\n\noptimization. CoRR, abs/1003.3967, 2017.\n\n[14] D. Golovin, A. Krause, and D. Ray. Near-optimal bayesian active learning with noisy observations. In\nAdvances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information\nProcessing Systems 2010. Proceedings of a meeting held 6-9 December 2010, Vancouver, British Columbia,\nCanada., pages 766\u2013774, 2010.\n\n[15] A. Guillory and J. A. Bilmes. Average-case active learning with costs. In Algorithmic Learning Theory,\n20th International Conference, ALT 2009, Porto, Portugal, October 3-5, 2009. Proceedings, pages 141\u2013155,\n2009.\n\n[16] A. Guillory and J. A. Bilmes. Interactive submodular set cover. In Proceedings of the 27th International\n\nConference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, pages 415\u2013422, 2010.\n\n[17] A. Guillory and J. A. Bilmes. Simultaneous learning and covering with adversarial noise. In Proceedings\nof the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June\n28 - July 2, 2011, pages 369\u2013376, 2011.\n\n[18] A. Gupta, V. Nagarajan, and R. Ravi. Approximation algorithms for optimal decision trees and adaptive\n\ntsp problems. Mathematics of Operations Research, 42(3):876\u2013896, 2017.\n\n[19] S. Hanneke. A bound on the label complexity of agnostic active learning. In Machine Learning, Proceedings\nof the Twenty-Fourth International Conference (ICML 2007), Corvallis, Oregon, USA, June 20-24, 2007,\npages 353\u2013360, 2007.\n\n10\n\n\f[20] L. Hya\ufb01l and R. L. Rivest. Constructing optimal binary decision trees is N P -complete. Information\n\nProcessing Lett., 5(1):15\u201317, 1976/77.\n\n[21] S. Im, V. Nagarajan, and R. V. D. Zwaan. Minimum latency submodular cover. ACM Transactions on\n\nAlgorithms (TALG), 13(1):13, 2016.\n\n[22] S. Javdani, Y. Chen, A. Karbasi, A. Krause, D. Bagnell, and S. S. Srinivasa. Near optimal bayesian active\nlearning for decision making. In Proceedings of the Seventeenth International Conference on Arti\ufb01cial\nIntelligence and Statistics, AISTATS 2014, Reykjavik, Iceland, April 22-25, 2014, pages 430\u2013438, 2014.\n\n[23] P. Kambadur, V. Nagarajan, and F. Navidi. Adaptive submodular ranking. In Integer Programming and\nCombinatorial Optimization - 19th International Conference, IPCO 2017, Waterloo, ON, Canada, June\n26-28, 2017, Proceedings, pages 317\u2013329, 2017 (full version: https://arxiv.org/abs/1606.01530).\n\n[24] S. R. Kosaraju, T. M. Przytycka, and R. Borgstrom. On an optimal split tree problem. In Workshop on\n\nAlgorithms and Data Structures, pages 157\u2013168. Springer, 1999.\n\n[25] D. W. Loveland. Performance bounds for binary testing with arbitrary weights. Acta Inform., 22(1):101\u2013\n\n114, 1985.\n\n[26] M. J. Moshkov. Greedy algorithm with weights for decision tree construction. Fundam. Inform., 104(3):285\u2013\n\n292, 2010.\n\n[27] M. Naghshvar, T. Javidi, and K. Chaudhuri. Noisy bayesian active learning. In 50th Annual Allerton\nConference on Communication, Control, and Computing, Allerton 2012, Allerton Park & Retreat Center,\nMonticello, IL, USA, October 1-5, 2012, pages 1626\u20131633, 2012.\n\n[28] F. Nan and V. Saligrama. Comments on the proof of adaptive stochastic set cover based on adaptive\nsubmodularity and its implications for the group identi\ufb01cation problem in \"group-based active query\nselection for rapid diagnosis in time-critical situations\". IEEE Trans. Information Theory, 63(11):7612\u2013\n7614, 2017.\n\n[29] R. D. Nowak. Noisy generalized binary search. In Advances in Neural Information Processing Systems 22:\n23rd Annual Conference on Neural Information Processing Systems 2009. Proceedings of a meeting held\n7-10 December 2009, Vancouver, British Columbia, Canada., pages 1366\u20131374, 2009.\n\n[30] A. M. Saettler, E. S. Laber, and F. Cicalese. Trading off worst and expected cost in decision tree problems.\n\nAlgorithmica, 79(3):886\u2013908, 2017.\n\n[31] L. A. Wolsey. An analysis of the greedy algorithm for the submodular set covering problem. Combinatorica,\n\n2(4):385\u2013393, 1982.\n\n11\n\n\f", "award": [], "sourceid": 1837, "authors": [{"given_name": "Su", "family_name": "Jia", "institution": "CMU"}, {"given_name": "viswanath", "family_name": "nagarajan", "institution": "Univ Michigan, Ann Arbor"}, {"given_name": "Fatemeh", "family_name": "Navidi", "institution": "University of Michigan"}, {"given_name": "R", "family_name": "Ravi", "institution": "CMU"}]}