{"title": "Near-Optimal Bayesian Active Learning with Noisy Observations", "book": "Advances in Neural Information Processing Systems", "page_first": 766, "page_last": 774, "abstract": "We tackle the fundamental problem of Bayesian active learning with noise, where we need to adaptively select from a number of expensive tests in order to identify an unknown hypothesis sampled from a known prior distribution. In the case of noise-free observations, a greedy algorithm called generalized binary search (GBS) is known to perform near-optimally. We show that if the observations are noisy, perhaps surprisingly, GBS can perform very poorly. We develop EC2, a novel, greedy active learning algorithm and prove that it is competitive with the optimal policy, thus obtaining the first competitiveness guarantees for Bayesian active learning with noisy observations. Our bounds rely on a recently discovered diminishing returns property called adaptive submodularity, generalizing the classical notion of submodular set functions to adaptive policies. Our results hold even if the tests have non\u2013uniform cost and their noise is correlated. We also propose EffECXtive, a particularly fast approximation of EC2, and evaluate it on a Bayesian experimental design problem involving human subjects, intended to tease apart competing economic theories of how people make decisions under uncertainty.", "full_text": "Near\u2013Optimal Bayesian Active Learning\n\nwith Noisy Observations\n\nDaniel Golovin\n\nCaltech\n\nAndreas Krause\n\nCaltech\n\nDebajyoti Ray\n\nCaltech\n\nAbstract\n\nWe tackle the fundamental problem of Bayesian active learning with noise, where\nwe need to adaptively select from a number of expensive tests in order to identify\nan unknown hypothesis sampled from a known prior distribution. In the case of\nnoise\u2013free observations, a greedy algorithm called generalized binary search (GBS)\nis known to perform near\u2013optimally. We show that if the observations are noisy,\nperhaps surprisingly, GBS can perform very poorly. We develop EC2, a novel,\ngreedy active learning algorithm and prove that it is competitive with the optimal\npolicy, thus obtaining the \ufb01rst competitiveness guarantees for Bayesian active learn-\ning with noisy observations. Our bounds rely on a recently discovered diminishing\nreturns property called adaptive submodularity, generalizing the classical notion\nof submodular set functions to adaptive policies. Our results hold even if the tests\nhave non\u2013uniform cost and their noise is correlated. We also propose EFFECX-\nTIVE, a particularly fast approximation of EC2, and evaluate it on a Bayesian\nexperimental design problem involving human subjects, intended to tease apart\ncompeting economic theories of how people make decisions under uncertainty.\n\nIntroduction\n\n1\nHow should we perform experiments to determine the most accurate scienti\ufb01c theory among com-\npeting candidates, or choose among expensive medical procedures to accurately determine a patient\u2019s\ncondition, or select which labels to obtain in order to determine the hypothesis that minimizes general-\nization error? In all these applications, we have to sequentially select among a set of noisy, expensive\nobservations (outcomes of experiments, medical tests, expert labels) in order to determine which hy-\npothesis (theory, diagnosis, classi\ufb01er) is most accurate. This fundamental problem has been studied in\na number of areas, including statistics [17], decision theory [13], machine learning [19, 7] and others.\nOne way to formalize such active learning problems is Bayesian experimental design [6], where one\nassumes a prior on the hypotheses, as well as probabilistic assumptions on the outcomes of tests. The\ngoal then is to determine the correct hypothesis while minimizing the cost of the experimentation. Un-\nfortunately, \ufb01nding this optimal policy is not just NP-hard, but also hard to approximate [5]. Several\nheuristic approaches have been proposed that perform well in some applications, but do not carry theo-\nretical guarantees (e.g., [18]). In the case where observations are noise-free1, a simple algorithm, gen-\neralized binary search2(GBS) run on a modi\ufb01ed prior, is guaranteed to be competitive with the optimal\npolicy; the expected number of queries is a factor of O(log n) (where n is the number of hypotheses)\nmore than that of the optimal policy [15], which matches lower bounds up to constant factors [5].\nThe important case of noisy observations, however, as present in most applications, is much less\nwell understood. While there are some recent positive results in understanding the label complexity\nof noisy active learning [19, 1], to our knowledge, so far there are no algorithms that are provably\ncompetitive with the optimal sequential policy, except in very restricted settings [16]. In this paper, we\n\n1This case is known as the Optimal Decision Tree (ODT) problem.\n2GBS greedily selects tests to maximize, in expectation over the test outcomes, the prior probability mass of\neliminated hypotheses (i.e., those with zero posterior probability, computed w.r.t. the observed test outcomes).\n\n1\n\n\fintroduce a general formulation of Bayesian active learning with noisy observations that we call the\nEquivalence Class Determination problem. We show that, perhaps surprisingly, generalized binary\nsearch performs poorly in this setting, as do greedily (myopically) maximizing the information gain\n(measured w.r.t. the distribution on equivalence classes) or the decision-theoretic value of information.\nThis motivates us to introduce a novel active learning criterion, and use it to develop a greedy active\nlearning algorithm called the Equivalence Class Edge Cutting algorithm (EC2), whose expected cost\nis competitive to that of the optimal policy. Our key insight is that our new objective function satis\ufb01es\nadaptive submodularity [9], a natural diminishing returns property that generalizes the classical notion\nof submodularity to adaptive policies. Our results also allow us to relax the common assumption\nthat the outcomes of the tests are conditionally independent given the true hypothesis. We also\ndevelop the Efficient Edge Cutting approXimate objective algorithm (EFFECXTIVE), an ef\ufb01cient\napproximation to EC2, and evaluate it on a Bayesian experimental design problem intended to tease\napart competing theories on how people make decisions under uncertainty, including Expected Value\n[22], Prospect Theory [14], Mean-Variance-Skewness [12] and Constant Relative Risk Aversion [20].\nIn our experiments, EFFECXTIVE typically outperforms existing experimental design criteria such as\ninformation gain, uncertainty sampling, GBS, and decision-theoretic value of information. Our results\nfrom human subject experiments further reveal that EFFECXTIVE can be used as a real-time tool\nto classify people according to the economic theory that best describes their behaviour in \ufb01nancial\ndecision-making, and reveal some interesting heterogeneity in the population.\n\n2 Bayesian Active Learning in the Noiseless Case\nIn the Bayesian active learning problem, we would like to distinguish among a given set of hypotheses\nH = {h1, . . . , hn} by performing tests from a set T = {1, . . . , N} of possible tests. Running test\nt incurs a cost of c(t) and produces an outcome from a \ufb01nite set of outcomes X = {1, 2, . . . , (cid:96)}.\nWe let H denote the random variable which equals the true hypothesis, and model the outcome of\neach test t by a random variable Xt taking values in X . We denote the observed outcome of test\nt by xt. We further suppose we have a prior distribution P modeling our assumptions on the joint\nprobability P (H, X1, . . . , XN ) over the hypotheses and test outcomes. In the noiseless case, we\nassume that the outcome of each test is deterministic given the true hypothesis, i.e., for each h \u2208 H,\nP (X1, . . . , XN | H = h) is a deterministic distribution. Thus, each hypothesis h is associated\nwith a particular vector of test outcomes. We assume, w.l.o.g., that no two hypotheses lead to the\nsame outcomes for all tests. Thus, if we perform all tests, we can uniquely determine the true\nhypothesis. However in most applications we will wish to avoid performing every possible test, as\nthis is prohibitively expensive. Our goal is to \ufb01nd an adaptive policy for running tests that allows us\nto determine the value of H while minimizing the cost of the tests performed. Formally, a policy \u03c0\n(also called a conditional plan) is a partial mapping \u03c0 from partial observation vectors xA to tests,\nspecifying which test to run next (or whether we should stop testing) for any observation vector xA.\nHereby, xA \u2208 X A is a vector of outcomes indexed by a set of tests A \u2286 T that we have performed\nso far 3 (e.g., the set of labeled examples in active learning, or outcomes of a set of medical tests that\nwe ran). After having made observations xA, we can rule out inconsistent hypotheses. We denote\nthe set of hypotheses consistent with event \u039b (often called the version space associated with \u039b) by\nV(\u039b) := {h \u2208 H : P (h | \u039b) > 0}. We call a policy feasible if it is guaranteed to uniquely determine\nthe correct hypothesis. That is, upon termination with observation xA, it must hold that |V(xA)| = 1.\nWe can de\ufb01ne the expected cost of a policy \u03c0 by\n\n(cid:88)\n\nh\n\nc(\u03c0) :=\n\nP (h)c(T (\u03c0, h))\n\nwhere T (\u03c0, h) \u2286 T is the set of tests run by policy \u03c0 in case H = h. Our goal is to \ufb01nd a feasible\npolicy \u03c0\u2217 of minimum expected cost, i.e.,\n\n\u03c0\u2217 = arg min{c(\u03c0) : \u03c0 is feasible}\n\n(2.1)\nA policy \u03c0 can be naturally represented as a decision tree T \u03c0, and thus problem (2.1) is often called\nthe Optimal Decision Tree (ODT) problem.\nUnfortunately, obtaining an approximate policy \u03c0 for which c(\u03c0) \u2264 c(\u03c0\u2217) \u00b7 o(log(n)) is NP-hard [5].\nHence, various heuristics are employed to solve the Optimal Decision Tree problem and its variants.\nTwo of the most popular heuristics are to select tests greedily to maximize the information gain (IG)\n\n3Formally we also require that (xt)t\u2208B \u2208 dom(\u03c0) and A \u2286 B, implies (xt)t\u2208A \u2208 dom(\u03c0) (c.f., [9]).\n\n2\n\n\fconditioned on previous test outcomes, and generalized binary search (GBS). Both heuristics are\ngreedy, and after having made observations xA will select\n\nt\u2217 = arg max\n\nt\u2208T\n\n\u2206Alg (t| xA) /c(t),\n\nand \u2206GBS (t| xA) := P (V(xA)) \u2212(cid:80)\n\nwhere Alg \u2208 {IG, GBS}. Here, \u2206IG (t| xA) := H (XT | xA) \u2212 Ext\u223cXt|xA [H (XT |xA, xt)] is the\nmarginal information gain measured with respect to the Shannon entropy H (X) := Ex[\u2212 log2 P (x)],\nx\u2208X P (Xt = x | xA)P (V(xA, Xt = x)) is the expected\nreduction in version space probability mass. Thus, both heuristics greedily chooses the test that\nmaximizes the bene\ufb01t-cost ratio, measured with respect to their particular bene\ufb01t functions. They\nstop after running a set of tests A such that |V(xA)| = 1, i.e., once the true hypothesis has been\nuniquely determined.\nIt turns out that for the (noiseless) Optimal Decision Tree problem, these two heuristics are equivalent\n[23], as can be proved using the chain rule of entropy.\nInterestingly, despite its myopic nature\nGBS has been shown [15, 7, 11, 9] to obtain near-optimal expected cost: the strongest known bound\nis c(\u03c0GBS) \u2264 c(\u03c0\u2217) (ln(1/pmin) + 1) where pmin := minh\u2208H P (h). Let xS(h) be the unique\nvector xS \u2208 X S such that P (xS | h) = 1. The result above is proved by exploiting the fact\nthat fGBS(S, h) := 1 \u2212 P (V(xS(h))) + P (h) is adaptive submodular and strongly adaptively\nmonotone [9]. Call xA a subvector of xB if A \u2286 B and P (xB | xA) > 0. In this case we write\nxA \u227a xB. A function f : 2T \u00d7 H is called adaptive submodular w.r.t. a distribution P , if for any\nxA \u227a xB and any test t it holds that \u2206 (t| xA) \u2265 \u2206 (t| xB), where\n\n\u2206 (t| xA) := EH [f (A \u222a {t} , H) \u2212 f (A, H) | xA] .\n\nThus, f is adaptive submodular if the expected marginal bene\ufb01ts \u2206 (t| xA) of adding a new test t can\nonly decrease as we gather more observations. f is called strongly adaptively monotone w.r.t. P if,\ninformally, \u201cobservations never hurt\u201d with respect to the expected reward. Formally, for all A, all\nt /\u2208 A, and all x \u2208 X we require EH [f (A, H) | xA] \u2264 EH [f (A \u222a {t} , H) | xA, Xt = x] .\nThe performance guarantee for GBS follows from the following general result about the greedy\nalgorithm for adaptive submodular functions (applied with Q = 1 and \u03b7 = pmin):\nTheorem 1 (Theorem 10 of [9] with \u03b1 = 1). Suppose f : 2T \u00d7 H \u2192 R\u22650 is adaptive submodular\nand strongly adaptively monotone with respect to P and there exists Q such that f (T , h) = Q for all\nh. Let \u03b7 be any value such that f (S, h) > Q\u2212 \u03b7 implies f (S, h) = Q for all sets S and hypotheses h.\nThen for self\u2013certifying instances the adaptive greedy policy \u03c0 satis\ufb01es c(\u03c0) \u2264 c(\u03c0\u2217)\n.\n\n(cid:16) Q\n\n(cid:16)\n\n(cid:17)\n\n(cid:17)\n\nln\n\n\u03b7\n\n+ 1\n\nThe technical requirement that instances be self\u2013certifying means that the policy will have proof\nthat it has obtained the maximum possible objective value, Q, immediately upon doing so. It is\nnot dif\ufb01cult to show that this is the case with the instances we consider in this paper. We refer the\ninterested reader to [9] for more detail.\nIn the following sections, we will use the concept of adaptive submodularity to provide the \ufb01rst\napproximation guarantees for Bayesian active learning with noisy observations.\n\n3 The Equivalence Class Determination Problem and the EC2 Algorithm\nWe now wish to consider the Bayesian active learning problem where tests can have noisy outcomes.\nOur general strategy is to reduce the problem of noisy observations to the noiseless setting. To gain\nintuition, consider a simple model where tests have binary outcomes, and we know that the outcome\nof exactly one test, chosen uniformly at random unbeknown to us, is \ufb02ipped. If any pair of hypotheses\nh (cid:54)= h(cid:48) differs by the outcome of at least three tests, we can still uniquely determine the correct\nhypothesis after running all tests. In this case we can reduce the noisy active learning problem to\nthe noiseless setting by, for each hypothesis, creating N \u201cnoisy\u201d copies, each obtained by \ufb02ipping\nthe outcome of one of the N tests. The modi\ufb01ed prior P (cid:48) would then assign mass P (cid:48)(h(cid:48)) = P (h)/N\nto each noisy copy h(cid:48) of h. The conditional distribution P (cid:48)(XT | h(cid:48)) is still deterministic (obtained\nby \ufb02ipping the outcome of one of the tests). Thus, each hypothesis hi in the original problem is\nnow associated with a set Hi of hypotheses in the modi\ufb01ed problem instance. However, instead of\nselecting tests to determine which noisy copy has been realized, we only care which set Hi is realized.\n\n3\n\n\fof m equivalence classes {H1, . . . ,Hm} so that H =(cid:85)m\n\nThe Equivalence Class Determination problem (ECD). More generally, we introduce the\nEquivalence Class Determination problem4, where our set of hypotheses H is partitioned into a set\ni=1 Hi, and the goal is to determine which\nclass Hi the true hypothesis lies in. Formally, upon termination with observations xA we require\nthat V(xA) \u2286 Hi for some i. As with the ODT problem, the goal is to minimize the expected cost\nof the tests, where the expectation is taken over the true hypothesis sampled from P . In \u00a74, we will\nshow how the Equivalence Class Determination problem arises naturally from Bayesian experimental\ndesign problems in probabilistic models.\nGiven the fact that GBS performs near-optimally on the Optimal Decision Tree problem, a natural ap-\nproach to solving ECD would be to run GBS until the termination condition is met. Unfortunately, and\nperhaps surprisingly, GBS can perform very poorly on the ECD problem. Consider an instance with\na uniform prior over n hypotheses, h1, . . . , hn, and two equivalence classes H1 := {hi : 1 \u2264 i < n}\nand H2 := {hn}. There are tests T = {1, . . . , n} such that hi(t) = 1[i = t], all of unit cost. Hereby,\n1[\u039b] is the indicator variable for event \u039b. In this case, the optimal policy only needs to select test\nn, however GBS may select tests 1, 2, . . . , n in order until running test t, where H = ht is the true\nhypothesis. Given our uniform prior, it takes n/2 tests in expectation until this happens, so that GBS\npays, in expectation, n/2 times the optimal expected cost in this instance.\nThe poor performance of GBS in this instance may be attributed to its lack of consideration for the\nequivalence classes. Another natural heuristic would be to run the greedy information gain policy,\nonly with the entropy measured with respect to the probability distribution on equivalence classes\nrather than hypotheses. Call this policy \u03c0IG. It is clearly aware of the equivalence classes, as it\nadaptively and myopically selects tests to reduce the uncertainty of the realized class, measured w.r.t.\nthe Shannon entropy. However, we can show there are instances in which it pays \u2126(n/ log(n)) times\nthe optimal cost, even under a uniform prior. See the long version of this paper [10] for details.\n\nThe EC2 algorithm. The reason why GBS fails is because reducing the version space mass does\nnot necessarily facilitate differentiation among the classes Hi. The reason why \u03c0IG fails is that there\nare complementarities among tests; a set of tests can be far better than the sum of its parts. Thus, we\nwould like to optimize an objective function that encourages differentiation among classes, but lacks\ncomplementarities. We adopt a very elegant idea from Dasgupta [8], and de\ufb01ne weighted edges be-\ntween hypotheses that we aim to distinguish between. However, instead of introducing edges between\narbitrary pairs of hypotheses (as done in [8]), we only introduce edges between hypotheses in different\nclasses. Tests will allow us to cut edges inconsistent with their outcomes, and we aim to eliminate\nall inconsistent edges while minimizing the expected cost incurred. We now formalize this intuition.\nSpeci\ufb01cally, we de\ufb01ne a set of edges E = \u222a1\u2264i 0}.\n\n2 ln\n\nmin\n\nwhere p(cid:48)\n\nIf all tests have unit cost, by using a modi\ufb01ed prior [15] the approximation factor can be improved to\nO (log |H| + log | supp(\u0398)|) as in the case of Theorem 3.\n\nThe EFFECXTIVE algorithm. For some noise models, \u0398 may have exponentially\u2013large support. In\nthis case reducing Bayesian active learning with noise to Equivalence Class Determination results in\ninstances with exponentially-large equivalence classes. This makes running EC2 on them challenging,\nsince explicitly keeping track of the equivalence classes is impractical. To overcome this challenge,\nwe develop EFFECXTIVE, a particularly ef\ufb01cient algorithm which approximates EC2.\nFor clarity, we only consider the 0\u22121 loss, i.e., our goal is to \ufb01nd the most likely hypothesis (MAP\nestimate) given all the data xT , namely h\u2217(xT ) := arg maxh P (h | xT ). Recall de\ufb01nition (4.1), and\nconsider the weight of edges between distinct equivalence classes Hi and Hj:\nw(Hi\u00d7Hj) =\n= P (XT \u2208 Hi)P (XT \u2208 Hj).\nIn general, P (XT \u2208 Hi) can be estimated to arbitrary accuracy using a rejection sampling approach\nwith bounded sample complexity. We defer details to the full version of the paper. Here, we focus on\nthe case where, upon observing all tests, the hypothesis is uniquely determined, i.e., P (H | xT ) is\ndeterministic for all xT in the support of P . In this case, it holds that P (XT \u2208 Hi) = P (H = hi).\n\nP (xT )P (x(cid:48)\nT \u2208Hj\n\n(cid:17)(cid:16) (cid:88)\n\n(cid:16) (cid:88)\n\n(cid:88)\n\nxT \u2208Hi,x(cid:48)\n\nT \u2208Hj\nx(cid:48)\n\nP (xT )\n\nP (x(cid:48)\n\nxT \u2208Hi\n\nT ) =\n\n(cid:17)\n\nT )\n\nThus, the total weight is(cid:88)\n(cid:104)(cid:88)\n\n\u2206Eff (t| xA) :=\n\ni(cid:54)=j\n\nw(Hi \u00d7 Hj) =\n\nP (Xt = x | xA)\n\nThis insight motivates us to use the objective function\n\n(cid:16)(cid:88)\n\ni\n\nP (hi)\n\n(cid:17)2 \u2212(cid:88)\nP (hi)2 = 1 \u2212(cid:88)\nP (hi | xA, Xt = x)2(cid:17)(cid:105) \u2212(cid:88)\n(cid:16)(cid:88)\n\ni\n\ni\n\ni\n\ni\n\nP (hi)2.\n\nP (hi | xA)2,\n\nthe weight of a distribution 1 \u2212(cid:80)\n\nx\n\nwhich is the expected reduction in weight from the prior to the posterior distribution. Note that\ni P (hi)2 is a monotonically increasing function of the R\u00b4enyi\n\n6\n\n\f2 log(cid:80)\n\nentropy (of order 2), which is \u2212 1\ni P (hi)2. Thus the objective \u2206Eff can be interpreted as a\n(non-standard) information gain in terms of the (exponentiated) R\u00b4enyi entropy. In our experiments,\nwe show that this criterion performs well in comparison to existing experimental design criteria,\nincluding the classical Shannon information gain. Computing \u2206Eff (t| xA) requires us to perform one\ninference task for each outcome x of Xt, and O(n) computations to calculate the weight for each\noutcome. We call the algorithm that greedily optimizes \u2206Eff the EFFECXTIVE algorithm (since it\nuses an Ef\ufb01cient Edge Cutting approXimate objective), and present pseudocode in Algorithm 1.\n\nInput: Set of hypotheses H; Set of tests T ; prior distribution P ; function f.\nbeginA \u2190 \u2205;\n\nwhile \u2203h (cid:54)= h(cid:48) : P (h | xA) > 0 and P (h(cid:48) | xA) > 0 do\n\n(cid:104)(cid:80)\nxP (Xt = x | xA)(cid:0)(cid:80)\n\niP (hi | xA, Xt = x)2(cid:1)(cid:105) \u2212(cid:80)\n\nforeach t \u2208 T \\ A do\n\nSelect t\u2217 \u2208 arg maxt \u2206Eff (t| xA) /c(t); Set A \u2190 A \u222a {t\u2217} and observe outcome xt\u2217;\n\nend\nAlgorithm 1: The EFFECXTIVE algorithm using the Ef\ufb01cient Edge Cutting approXimate objective.\n\n\u2206Eff (t| xA) :=\n\ni P (hi | xA)2;\n\nUP T (L) =(cid:80)\n\ni f ((cid:96)i)w(pi) for nonlinear functions f ((cid:96)i) = (cid:96)\u03c1\n\nity posited is UCRRA(L) =(cid:80)\n\n/(1 \u2212 a) if a (cid:54)= 1, and UCRRA(L) =(cid:80)\n\n5 Experiments\nSeveral economic theories make claims to explain how people make decisions when the payoffs\nare uncertain. Here we use human subject experiments to compare four key theories proposed in\nliterature. The uncertainty of the payoff in a given situation is represented by a lottery L, which is\nsimply a random variable with a range of payoffs L := {(cid:96)1, . . . , (cid:96)k}. For our purposes, a payoff is\nan integer denoting how many dollars you receive (or lose, if the payoff is negative). Fix lottery\nL, and let pi := P [L = (cid:96)i]. The four theories posit distinct utility functions, with agents prefer-\nring larger utility lotteries. Three of the theories have associated parameters. The Expected Value\ntheory [22] posits simply UEV (L) = E [L], and has no parameters. Prospect theory [14] posits\ni , if (cid:96)i \u2265 0 and f ((cid:96)i) = \u2212\u03bb(\u2212(cid:96)i)\u03c1, if\n(cid:96)i < 0, and w(pi) = e\u2212(log(1/pi))\u03b1 [21]. The parameters \u0398P T = {\u03c1, \u03bb, \u03b1} represent risk aversion,\nloss aversion and probability weighing factor respectively. For portfolio optimization problems, \ufb01nan-\ncial economists have used value functions that give weights to different moments of the lottery [12]:\nUM V S(L) = w\u00b5\u00b5 \u2212 w\u03c3\u03c3 + w\u03bd\u03bd, where \u0398M V S = {w\u00b5, w\u03c3, w\u03bd} are the weights for the mean, stan-\ndard deviation and standardized skewness of the lottery respectively. In Constant Relative Risk Aver-\nsion theory [20], there is a parameter \u0398CRRA = a representing the level of risk aversion, and the util-\ni pi log((cid:96)i), if a = 1.\nThe goal is to adaptively select a sequence of tests to present to a human subject in order to\ndistinguish which of the four theories best explains the subject\u2019s responses. Here a test t is a pair\nof lotteries, (Lt\n2). Based on the theory that represents behaviour, one of the lotteries would be\npreferred to the other, denoted by a binary response xt \u2208 {1, 2}. The possible payoffs were \ufb01xed to\nL = {\u221210, 0, 10} (in dollars), and the distribution (p1, p2, p3) over the payoffs was varied, where\npi \u2208 {0.01, 0.99} \u222a {0.1, 0.2, . . . , 0.9}. By considering all non-identical pairs of such lotteries, we\nobtained the set of possible tests.\nWe compare six algorithms: EFFECXTIVE, greedily maximizing Information Gain (IG), Value of\nInformation (VOI), Uncertainty Sampling5 (US), Generalized Binary Search (GBS), and tests selected\nat Random. We evaluated the ability of the algorithms to recover the true model based on simulated\nresponses. We chose parameter values for the theories such that they made distinct predictions and\nwere consistent with the values proposed in literature [14]. We drew 1000 samples of the true model\nand \ufb01xed the parameters of the model to some canonical values, \u0398P T = {0.9, 2.2, 0.9}, \u0398M V S =\n{0.8, 0.25, 0.25}, \u0398CRRA = 1. Responses were generated using a softmax function, with the\nprobability of response xt = 1 given by P (xt = 1) = 1/(1 + eU (Lt\n1)). Fig. 2(a) shows the\nperformance of the 6 methods, in terms of the accuracy of recovering the true model with the number\nof tests. We \ufb01nd that US, GBS and VOI perform signi\ufb01cantly worse than Random in the presence\nof noise. EFFECXTIVE outperforms InfoGain signi\ufb01cantly, which outperforms Random.\n\ni pi(cid:96)1\u2212a\n\ni\n\n2)\u2212U (Lt\n\n1, Lt\n\n5Uncertainty sampling greedily selects the test whose outcome distribution has maximum Shannon entropy.\n\n7\n\n\f(a) Fixed parameters\n\n(b) With parameter uncertainty\n\n(c) Human subject data\n\nFigure 2: (a) Accuracy of identifying the true model with \ufb01xed parameters, (b) Accuracy using a grid of\nparameters, incorporating uncertainty in their values, (c) Experimental results: 11 subjects were classi\ufb01ed into\nthe theories that described their behavior best. We plot probability of classi\ufb01ed type.\n\nWe also considered uncertainty in the values of the parameters, by setting \u03c1 from 0.85-0.95, \u03bb from\n2.1-2.3, \u03b1 from 0.9-1; w\u00b5 from 0.8-1.0, w\u03c3 from 0.2-0.3, w\u03bd from 0.2-0.3; and a from 0.9-1.0, all\nwith 3 values per parameter. We generated 500 random samples by \ufb01rst randomly sampling a model\nand then randomly sampling parameter values. EFFECXTIVE and InfoGain outperformed Random\nsigni\ufb01cantly, Fig. 2(b), although InfoGain did marginally better among the two. The increased\nparameter range potentially poses model identi\ufb01ability issues, and violates some of the assumptions\nbehind EFFECXTIVE, decreasing its performance to the level of InfoGain.\nAfter obtaining informed consent according to a protocol approved by the Institutional Review Board\nof Caltech, we tested 11 human subjects to determine which model \ufb01t their behaviour best. Laboratory\nexperiments have been used previously to distinguish economic theories, [4], and here we used a\nreal-time, dynamically optimized experiment that required fewer tests. Subjects were presented 30\ntests using EFFECXTIVE. To incentivise the subjects, one of these tests was picked at random, and\nsubjects received payment based the outcome of their chosen lottery. The behavior of most subjects\n(7 out of 10) was best described by EV. This is not unexpected given the high quantitative abilities of\nthe subjects. We also found heterogeneity in classi\ufb01cation: One subject got classi\ufb01ed as MVS, as\nidenti\ufb01ed by violations of stochastic dominance in the last few choices. 2 subjects were best described\nby prospect theory since they exhibited a high degree of loss aversion and risk aversion. One subject\nwas also classi\ufb01ed as a CRRA-type (log-utility maximizer). Figure 2(c) shows the probability of\nthe classi\ufb01ed model with number of tests. Although we need a larger sample to make signi\ufb01cant\nclaims of the validity of different economic theories, our preliminary results indicate that subject\ntypes can be identi\ufb01ed and there is heterogeneity in the population. They also serve as an example of\nthe bene\ufb01ts of using real-time dynamic experimental design to collect data on human behavior.\n\n6 Conclusions\nIn this paper, we considered the problem of adaptively selecting which noisy tests to perform in order\nto identify an unknown hypothesis sampled from a known prior distribution. We studied the Equiva-\nlence Class Determination problem as a means to reduce the case of noisy observations to the classic,\nnoiseless case. We introduced EC2, an adaptive greedy algorithm that is guaranteed to choose the\nsame hypothesis as if it had observed the outcome of all tests, and incurs near-minimal expected cost\namong all policies with this guarantee. This is in contrast to popular heuristics that are greedy w.r.t.\nversion space mass reduction, information gain or value of information, all of which we show can be\nvery far from optimal. EC2 works by greedily optimizing an objective tailored to differentiate between\nsets of observations that lead to different decisions. Our bounds rely on the fact that this objective func-\ntion is adaptive submodular. We also develop EFFECXTIVE, a practical algorithm based on EC2, that\ncan be applied to arbitrary probabilistic models in which ef\ufb01cient exact inference is possible. We apply\nEFFECXTIVE to a Bayesian experimental design problem, and our results indicate its effectiveness in\ncomparison to existing algorithms. We believe that our results provide an interesting direction towards\nproviding a theoretical foundation for practical active learning and experimental design problems.\n\nAcknowledgments. This research was partially supported by ONR grant N00014-09-1-1044, NSF grant\nCNS-0932392, NSF grant IIS-0953413, a gift by Microsoft Corporation, an Okawa Foundation Research Grant,\nand by the Caltech Center for the Mathematics of Information.\n\n8\n\n0510152025300.20.30.40.50.60.70.80.91Number of testsVOIEffECXtiveGBSRandomUncertaintySamplingInfoGain0510152025300.20.30.40.50.60.70.80.91Number of testsVOIEffECXtiveInfoGainRandomGBSUncertaintySampling05101520253000.10.20.30.40.50.60.70.80.91Prob. of Classified TypeNumber of testsMVS, n=1CRRA, n=1EV, n=7PT, n=2\fReferences\n[1] N. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In ICML, 2006.\n[2] Gowtham Bellala, Suresh Bhavnani, and Clayton Scott. Extensions of generalized binary search to group\nidenti\ufb01cation and exponential costs. In Advances in Neural Information Processing Systems (NIPS), 2010.\n[3] Gowtham Bellala, Suresh K. Bhavnani, and Clayton D. Scott. Group-based query learning for rapid\n\ndiagnosis in time-critical situations. CoRR, abs/0911.4511, 2009.\n\n[4] Colin F. Camerer. An experimental test of several generalized utility theories. The Journal of Risk and\n\nUncertainty, 2(1):61\u2013104, 1989.\n\n[5] V. T. Chakaravarthy, V. Pandit, S. Roy, P. Awasthi, and M. Mohania. Decision trees for entity identi\ufb01cation:\nApproximation algorithms and hardness results. In In Proceedings of the ACM- SIGMOD Symposium on\nPrinciples of Database Systems, 2007.\n\n[6] K. Chaloner and I. Verdinelli. Bayesian experimental design: A review. Statistical Science, 10(3):273\u2013304,\n\nAug. 1995.\n\n[7] Sanjoy Dasgupta. Analysis of a greedy active learning strategy. In NIPS, 2004.\n[8] Sanjoy Dasgupta. Coarse sample complexity bounds for active learning. In Y. Weiss, B. Sch\u00a8olkopf,\nand J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 235\u2013242. MIT Press,\nCambridge, MA, 2006.\n\n[9] Daniel Golovin and Andreas Krause. Adaptive submodularity: Theory and applications in active learning\n\nand stochastic optimization. CoRR, abs/1003.3967v3, 2010.\n\n[10] Daniel Golovin, Andreas Krause, and Debajyoti Ray. Near-optimal Bayesian active learning with noisy\n\nobservations. CoRR, abs/1010.3091, 2010.\n\n[11] Andrew Guillory and Jeff Bilmes. Average-case active learning with costs. In The 20th International\n\nConference on Algorithmic Learning Theory, University of Porto, Portugal, October 2009.\n\n[12] Giora Hanoch and Haim Levy. Ef\ufb01cient portfolio selection with quadratic and cubic utility. The Journal of\n\nBusiness, 43(2):181\u2013189, 1970.\n\n[13] R. A. Howard. Information value theory. In IEEE Transactions on Systems Science and Cybernetics\n\n(SSC-2), 1966.\n\n[14] D. Kahneman and A. Tversky. Prospect theory: An analysis of decision under risk. Econometrica,\n\n47(2):263\u2013292, 1979.\n\n[15] S. Rao Kosaraju, Teresa M. Przytycka, and Ryan S. Borgstrom. On an optimal split tree problem. In WADS\n\u201999: Proceedings of the 6th International Workshop on Algorithms and Data Structures, pages 157\u2013168,\nLondon, UK, 1999. Springer-Verlag.\n\n[16] Andreas Krause and Carlos Guestrin. Optimal value of information in graphical models. Journal of\n\nArti\ufb01cial Intelligence Research (JAIR), 35:557\u2013591, 2009.\n\n[17] D. V. Lindley. On a measure of the information provided by an experiment. Annals of Mathematical\n\nStatistics, 27:986\u20131005, 1956.\n\n[18] D. MacKay. Information-based objective functions for active data selection. Neural Computation, 4(4):590\u2013\n\n604, 1992.\n\n[19] Rob Nowak. Noisy generalized binary search. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams,\nand A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 1366\u20131374. 2009.\n\n[20] John W. Pratt. Risk aversion in the small and in the large. Econometrica, 32(1):122\u2013136, 1964.\n[21] D. Prelec. The probablity weighting function. Econometrica, 66(3):497\u2013527, 1998.\n[22] John von Neumann and Oskar Morgenstern. Theory of Games and Economic Behaviour. Princeton\n\nUniversity Press, 1947.\n\n[23] Alice X. Zheng, Irina Rish, and Alina Beygelzimer. Ef\ufb01cient test selection in active diagnosis via entropy\napproximation. In UAI \u201905, Proceedings of the 21st Conference in Uncertainty in Arti\ufb01cial Intelligence,\n2005.\n\n9\n\n\f", "award": [], "sourceid": 1100, "authors": [{"given_name": "Daniel", "family_name": "Golovin", "institution": null}, {"given_name": "Andreas", "family_name": "Krause", "institution": null}, {"given_name": "Debajyoti", "family_name": "Ray", "institution": null}]}