{"title": "Adaptive Active Hypothesis Testing under Limited Information", "book": "Advances in Neural Information Processing Systems", "page_first": 4035, "page_last": 4043, "abstract": "We consider the problem of active sequential hypothesis testing where a Bayesian decision maker must infer the true hypothesis from a set of hypotheses.  The decision maker may choose for a set of actions, where the outcome of an action is corrupted by independent noise.    In this paper we consider a special case where the decision maker has limited knowledge about the distribution of observations for each action, in that only a binary value is observed.  Our objective is to infer the true hypothesis with low error, while minimizing the number of action sampled.  Our main results include the derivation of a lower bound on sample size for our system under limited knowledge and the design of an active learning policy that matches this lower bound and outperforms similar known algorithms.", "full_text": "Adaptive Active Hypothesis Testing under Limited\n\nInformation\n\nEindhoven University of Technology, Eindhoven, The Netherlands\n\nFabio Cecchi\n\nf.cecchi@tue.nl\n\nNidhi Hegde\n\nNokia Bell Labs, Paris-Saclay, France\n\nnidhi.hegde@nokia-bell-labs.com\n\nAbstract\n\nWe consider the problem of active sequential hypothesis testing where a Bayesian\ndecision maker must infer the true hypothesis from a set of hypotheses. The\ndecision maker may choose for a set of actions, where the outcome of an action is\ncorrupted by independent noise. In this paper we consider a special case where the\ndecision maker has limited knowledge about the distribution of observations for\neach action, in that only a binary value is observed. Our objective is to infer the\ntrue hypothesis with low error, while minimizing the number of action sampled.\nOur main results include the derivation of a lower bound on sample size for our\nsystem under limited knowledge and the design of an active learning policy that\nmatches this lower bound and outperforms similar known algorithms.\n\n1\n\nIntroduction\n\nWe consider the problem of active sequential hypothesis testing with incomplete information. The\noriginal problem, \ufb01rst studied by Chernoff [1], is one where a Bayesian decision maker must infer\nthe correct hypothesis from a set of J hypotheses. At each step the decision maker may choose from\nW actions where the outcome of an action is a random variable that depends on the action and the\ntrue (hidden) hypothesis. In prior work, the probability distribution functions on the outcomes are\nassumed to be known. In the present work we assume that these distributions are not known, and\nonly some rough information about the outcomes of the actions is known, to be made more precise\nfurther on.\nActive hypothesis testing is an increasingly important problem these days, with applications that\ninclude the following. (a) Medical diagnostics ([2]) systems that include clinical trials for testing a\nnew treatment, or diagnostics of a new disease. (b) Crowdsourcing: online platforms for task-worker\nmatching such as Amazon\u2019s Mechanical Turk or TaskRabbit, where, as new tasks arrive, they must\nbe matched to workers capable of working on them. (c) Customer hotline centres or Q&A forums:\nonline platforms such as StackExchange where questions are submitted, and users with varying\ncapabilities are available for providing an answer. This includes customer service centres where\ncustomer tickets are submitted and the nature of the problem must be learned before its treatment (an\nexample where supervised learning techniques are used is [3]). (d) Content search problems where\nan incoming image must be matched to known contents, as studied in [4].\nWe now informally describe our model. In the general instance of our problem, the true hypothesis,\n\u03b8\u2217 is one in a set of J hypotheses, J = {\u03b81, . . . , \u03b8J}, and a set of W actions is available, where\nthe outcomes of the actions depend on the true hypothesis. When the true hypothesis is \u03b8j and\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\faction w is chosen, a noisy outcome Xw,j \u2208 J is observed, whose distribution, pw,j(\u00b7) \u2208 P(J ), is\ngiven. The objective then is to select an action at each step so as to infer the true hypothesis in a\nminimum number of steps, with a given accuracy. In our model, we assume that the decision maker\nhas limited information about the outcome distributions. We de\ufb01ne the principal set of an action w as\nJw \u2286 J . When action w is sampled, a noisy binary outcome y \u2208 {\u22121, 1} is observed, which gives\nan indication on whether the action classi\ufb01es the hypothesis in the set Jw. The quality of action w,\n\u03b1w is related to the noise in the outcome. Rather than the distributions pw,j(\u00b7), we assume that the\ndecision maker only has knowledge of the principal set Jw and quality \u03b1w of each action.\n\n1.1 Related work\n\nSince the seminal work by Chernoff [1], active hypothesis testing and variants of the problemhave\nbeen studied through various perspectives (see [5] for a brief survey). Chernoff derived a simple\nheuristic algorithm whose performance is shown to achieve asymptotic optimality in the regime\nwhere the probability of error vanishes. Speci\ufb01cally, it is shown that as the probability of error \u03b4\ndecreases the expected number of samples needed by Chernoff\u2019s algorithm grows as \u2212 log(\u03b4). Most\nof the past literature in active sequential hypothesis testing has dealt with extensions of Chernoff\u2019s\nmodel, and has shown that Chernoff\u2019s algorithm performs well in more general settings [6, 7]. A\nnotable exception is [8], where the impact of the number of hypotheses is analyzed and an algorithm\nthat performs better than Chernoff\u2019s benchmark is provided for the case of large values of J.\nOur work differs from prior work in a few ways. First, the hypothesis need not be locally identi\ufb01able.\nWhile in [1] each action is able to distinguish each pair of hypotheses, we assume that each hypothesis\nis globally identi\ufb01able, i.e., each pair of hypotheses can be discerned by at least one action. This is a\ncommon assumption in the area of distributed hypothesis testing ([9, 10]) and a weaker assumption\nthan that of Chernoff. Note that dropping this assumption is not novel in itself, and has been done\nin other work such as [8]. Second, a novel extension in our work, differing from [8] is that we do\nnot assume full knowledge on the actions\u2019 statistical parameters. The responses of actions are noisy,\nand in past literature the probability distributions governing them was assumed to be known. In\nour model, we drop this assumption, and we only require to know a lower bound \u03b1w > 1/2 on the\nprobability that action w will provide a correct response, no matter the hypothesis we want to test. As\nfar as we know, no previous work in active sequential learning has tackled the problem of incomplete\nstatistical information and we believe that such an extension may provide a non-negligible impact in\nreal-life applications.\nActive hypothesis testing is similar to the problem of Bayesian active learning. This latter perspec-\ntive in considered in [11] where noisy Bayesian active learning setting is used on the hypothesis\ntesting problem with asymmetric noise and a heuristic based on the extrinsic Jensen-Shannon (EJS)\ndivergence [12] is proposed. As in [8], full knowledge of the probability distributions governing\nthe noise is available. In contrast, in our work we consider a more restricted model where, only a\nbinary outcome with noise is given by the actions on the large hypothesis space. Inference with\nbinary responses is considered in work on generalized binary search (GBS) [13], which is special\ncase where the label set (outcome of actions) is binary with the case of symmetric, non-peristent\nnoise. Our work differs from this type of work in that we consider asymmetric label-dependent noise,\nthat is, \u03b1w varies with action w.\nWe thus position our work between [11, 8] and [13]. While the former assumes full knowledge on\nthe noise distributions, we assume that only a binary response is provided and only a lower bound\non the value that governs the outcome is known, and while the latter considers symmetric noise, we\nextend to asymmetric label-dependent noise.\n\nOur contribution. Our main objective is to investigate the minimum sample query size of this\nsystem for a certain level of accuracy in the inference of the true hypothesis, and to design ef\ufb01cient\npolicies for this inference. Our contributions in the present paper are as follows. First, we consider the\nsystem under limited knowledge of outcome distribution. This restricted scenario adds a signi\ufb01cant\nconstraint for the action selection policy, and the belief vector update policy. To the best of our\nknowledge, this restricted scenario has not been considered in past literature. Second, under the\nlimited knowledge constraint, we propose the Incomplete-Bayesian Adaptive Gradient (IBAG) policy\nwhich includes a belief vector update rule that we call Incomplete-Bayesian, and an action selection\nrule, named Adaptive Gradient, that follows the drift of the (unknown) coordinate of interest in the\n\n2\n\n\fbelief vector. Third, we derive a lower bound on the sample size for the system under incomplete\ninformation, and show that the performance of IBAG matches this bound. We also carry out numerical\nexperiments to compare IBAG to prior work.\n\n2 Model\n\nThe classic model of the active sequential learning problem consists in sequentially selecting one\nof several available sensing actions, in order to collect enough information to identify the true\nhypothesis, as considered in [1]. We thus consider a system where a decision maker has at his\ndisposal a \ufb01nite set of actions W = {1, . . . , W}, and there are a set of J = |J | < \u221e possible\nhypothesis, J = {\u03b81, . . . , \u03b8J}. (For the rest of the paper, we refer to a hypothesis only by its index,\ni.e., j for hypothesis \u03b8j, for ease of notation.) When the true hypothesis is j and action w is sensed, the\noutcome Xw,j \u2208 J is sampled from the distribution pw,j(\u00b7) \u2208 P(J ), i.e., P{Xw,j = j(cid:48)} = pw,j(j(cid:48)).\nIn our model, we assume to have limited information about the actions and this affects the classic\nmodel in two ways. First, for every sampled action w, a binary outcome y \u2208 {\u22121, 1} is observed,\nindicating whether the inference of hypothesis by this action is in Jw or not, i.e., the response\nobserved is Yw,j \u2208 {\u22121, 1} where\n\nYw,j =\n\nif Xw,j \u2208 Jw,\nif Xw,j /\u2208 Jw.\n\n\u22121,\n\n(cid:26)1,\n(cid:26)1,\n\nqw,j \u2265 \u03b1w,\n\n\u2200 j \u2208 J , w \u2208 W.\n\n(2)\n\nThe subset Jw \u2286 J is assumed to be known, and it is described by the matrix g \u2208 {\u22121, 1}W\u00d7J\nwhere\n\nj belongs is given by qw,j := P{Yw,j = gw,j} =(cid:80)\n\nif j \u2208 Jw,\nif j /\u2208 Jw.\nObserve that the probability an action w correctly identi\ufb01es the subset to which the true hypothesis\nj(cid:48):gw,j =gw,j(cid:48) pw,j(j(cid:48)). However, as a second\nrestriction, instead of knowing qw,j, the capacity, or quality, of an action w is captured by \u03b1w where\nwe assume that\n\ngw,j =\n\n\u22121,\n\n(1)\n\nWe thus characterize each action by its principal set, Jw, and its quality, \u03b1w.\nAssumption 1. For every action w \u2208 W, the principal sets Jw \u2286 J and the quality \u03b1w \u2208 (1/2, 1)\nare known. Denote by \u2206w = 2\u03b1w \u2212 1 where \u2206w \u2208 [\u2206m, \u2206M ] and \u2206m, \u2206M \u2208 (0, 1).\nSince each action can only indicate whether the hypothesis belongs to a subset or not, there must exist\nan action w \u2208 W for which j1 and j2 belong to different subsets, for all pairs j1, j2 \u2208 J . De\ufb01ne the\nsubset Wj1,j2 \u2286 W as Wj1,j2 = {w \u2208 W : gw,j1 gw,j2 = \u22121}.\nAssumption 2. For every j1, j2 \u2208 J , the subset Wj1,j2 is nonempty, i.e., each hypothesis is globally\nidenti\ufb01able.\nFor every action w \u2208 W and hypothesis j \u2208 J we de\ufb01ne the subsets Jw,+j and Jw,\u2212j which are,\nrespectively, given by the hypotheses that action w cannot and can distinguish from j, i.e.,\nJw,\u2212j = {j(cid:48) \u2208 J : gw,j(cid:48)gw,j = \u22121}.\n\nJw,+j = {j(cid:48) \u2208 J : gw,j(cid:48)gw,j = 1},\n\nNote that w \u2208 Wj1,j2 if and only if j2 \u2208 Jw,\u2212j1 (or equivalently j1 \u2208 Jw,\u2212j2).\nWe aim to design a simple algorithm to infer the correct hypothesis using as few actions as possible.\nThe true hypothesis will be denoted by j\u2217 \u2208 J . The learning process is captured by the evolution of\nthe belief vector \u03bd(t) \u2208 P(J ), where \u03bdj(t) denotes the decision maker\u2019s con\ufb01dence at time t that\nthe true hypothesis is j. At the initial step t = 1, the belief vector \u03bd(1) \u2208 P(J ) is initialized so that\n\u03bdj(1) > 0, j \u2208 J . Since we assume to initially lack any information on the true hypothesis, without\nloss of generality, we set \u03bdj(1) = 1/J for every j \u2208 J .\nAt every step t \u2265 1, according to the belief vector \u03bd(t), the decision maker determines the next\naction to sense FW (\u03bd(t)) = w(t) \u2208 W according to some selection rule FW (\u00b7). The outcome\ny(t) \u2208 {\u22121, 1} from the chosen action w(t) is used to update the belief vector according to an\nupdate rule F U\n\n(cid:1) = \u03bd(t + 1) \u2208 P(J ). The algorithm ends at time T \u2217, and the\n\n(cid:0)\u03bd(t), w(t), y(t)\n\n3\n\n\finferred hypothesis is given by \u02c6j = arg maxj\u2208J \u03bdj(T \u2217) . Sensing actions is stopped when one of\nthe posteriors is larger than 1 \u2212 \u03b4, for some \u03b4 > 0:\n{max\nj\u2208J \u03bdj(t) > 1 \u2212 \u03b4}.\n\nT \u2217 = inf\nt\u22650\n\n(3)\n\n3 The Incomplete-Bayesian update rule\n\nWe now describe how the decision maker updates the belief vector after he observes the outcome of\nan action. Given a belief vector \u03bd \u2208 P(J ) and the observation y \u2208 {\u22121, 1} obtained from action\nw \u2208 W, de\ufb01ne\n\n\u02dcf (y, j, w) =\n\n1 \u2212 qw,j,\n\ny = gw,j,\ny = \u2212gw,j,\n\nf (y, j, w) =\n\n1 \u2212 \u03b1w,\n\ny = gw,j,\ny = \u2212gw,j.\n\nU,j(\u03bd, w, y) =\n\nNote that \u02dcf (y, j, w) denotes the probability of having outcome y given that the action w is chosen and\nthe true hypothesis is j. The standard Bayesian update rule is given by the map F B\nU (\u03bd, w, y), where\n. In our model, however, the values qw,j for w \u2208 W are unknown to\nF B\nthe decision maker. Hence, we introduce the Incomplete Bayesian (IB) update rule, which mimics the\nBayesian rule, but with limited knowledge on outcome probailities. The IB update rule is given by\nthe map F U (\u03bd, w, y), where\n\n\u02dcf (y,j,w)\u03bdj\ni\u2208J \u02dcf (y,i,w)\u03bdi\n\n(cid:80)\n\n(cid:26)qw,j,\n\n(cid:26)\u03b1w,\n\nFU,j(\u03bd, w, y) =\n\n.\n\n(4)\n\n(cid:80)\n\nf (y, j, w)\u03bdj\ni\u2208J f (y, i, w)\u03bdi\n\nObserve that Bayesian and IB update rules are identical when qw,j = \u03b1w.\nIn practice, the \u03bdj(t) evolves according to both the quality of the chosen action, \u03b1w, and the\nrelation between this action\u2019s principal set Jw and the current state of the belief vector \u03bd(t). This\ndependence is formalized in the following lemma whose proof is included in the supplementary\nmaterial, Section B.\nLemma 1. Given \u03bd(t) \u2208 P(J ) and w(t) \u2208 W, then it holds that\n\n\u03bdj\u2217 (t + 1)\n\u03bdj(t + 1)\n\n=\n\n\u03bdj\u2217 (t)\n\u03bdj(t)\n\n\u00d7\n\n1,\nindic1{w(t) /\u2208 Wj\u2217,j},\n1+\u2206w(t)\n1\u2212\u2206w(t)\n1\u2212\u2206w(t)\n1+\u2206w(t)\n\n,\n\n,\n\nw.p.\n\nw.p. 1{w(t) \u2208 Wj\u2217,j}qw(t),j\u2217 ,\nw.p. 1{w(t) \u2208 Wj\u2217,j}(1 \u2212 qw(t),j\u2217 ).\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n3.1 A lower bound on the sample size\n\nNote that the IB update rule alone sets some constraints on the performance. In particular, if we\nrequire the error probability to be low, then the expected number of samples is necessarily larger than\na certain quantity depending on the model parameters. We show that this quantity asymptotically\ngrows as \u2212 log \u03b4 in the asymptotic regime where \u03b4 \u2192 0.\nTheorem 1. Assume the IB update rule is applied to the belief vector and that\n\nThen, there exist functions K l\n\n1(\u03b4) such that\n\nP{\u03bdj\u2217 (T \u2217) \u2264 \u03b4} \u2264 \u02dc\u03b3 < 1.\n\nlim\n\u03b4\u21920\n0(\u03b4), K l\n\nE[T \u2217] \u2265 K l\n\n1(\u03b4) log\n\n1\n\u03b4\n\n+ K l\n\n0(\u03b4),\n\ni(\u03b4) \u2265 K l\nK l\n\ni > 0,\n\nlim\n\u03b4\u21920\n\nfor i = 0, 1.\n\nThe proof of this result is presented in the supplement, Section A.2. We sketch the proof here. We\n\ufb01rst de\ufb01ne\n\nand show that, on the one hand, if P{\u02c6j (cid:54)= j\u2217} is small, then(cid:80)\nprobability, and on the other hand, if t is small, then(cid:80)\n\nSt(j1, j2) = log\n\n\u03bdj1 (t)\n\u03bdj2 (t)\n\n,\n\nS(j1, j2) = ST \u2217 (j1, j2),\n\nj(cid:54)=j\u2217 S(j\u2217, j) is large with high\nj(cid:54)=j\u2217 St(j\u2217, j) is small with high probability.\n\n4\n\n\fWe use these properties to derive a lower bound on the tail probability of T \u2217, and thus on its expected\nvalue.\nFurther, we can control the belief vector evolution by deriving bounds on the ratio between coordinates\nof the belief vector under the IB policy. Speci\ufb01cally, in the supplementary material Section A.3,\nwe bound the probability that \u03bdj(t) > \u03bdj\u2217 (t) at a certain time, and investigate how this probability\nevolves with t.\n\n4 Adaptive Gradient: the action selection policy\n\n4.1 A gradient-based selection policy\n\nWe now present an action selection policy that, together with the IB update rule, de\ufb01nes our active\nlearning algorithm, which we call the Incomplete-Bayesian Adaptive Gradient (IBAG) policy. We\nwill then analyze the complete algorithm showing that its performance asymptotically matches the\nlower bound provided in Theorem 1 as \u03b4 \u2192 0.\nWe focus on the j\u2217-th coordinate of the belief vector, and de\ufb01ne the drift at time t as\n\nDw(\u03bd(t)) = E[\u03bdj\u2217 (t + 1)|\u03bd(t), w(t) = w] \u2212 \u03bdj\u2217 (t).\n\nSimple algebra and (4) yield the following Lemma.\nLemma 2. It holds that\n\nDw(\u03bd(t)) = 4\u2206w\u03bdj\u2217 (t)\u03bdw,\u2212j\u2217 (t)\n\nwhere\n\n\u03bdw,+j =\n\n(cid:16) qw,j\u2217 \u2212 \u03b1w + \u2206w\u03bdw,\u2212j\u2217 (t)\n(cid:0)1 \u2212 2\u03bdw,\u2212j\u2217 (t)(cid:1)2\n(cid:88)\n\n1 \u2212 \u22062\n\nw\n\n\u03bdj,\n\n\u03bdw,\u2212j =\n\n\u03bdj.\n\nj\u2208Jw,\u2212j\n\n(cid:88)\n\nj\u2208Jw,+j\n\n(cid:17)\n\n,\n\n(5)\n\nAssume for a moment that we know the true hypothesis j\u2217 and qw,j\u2217 for every w \u2208 W. Then, in\norder to let \u03bdj\u2217 (t) grow as much as possible, we would greedily select the action w which maximizes\nDw(\u03bd(t)). Our worker selection policy will attempt to mimic as closely as possible this greedy\npolicy, while operating without complete information.\nLemma 3. It holds that Dw(\u03bd(t)) \u2265 DL\nw(\u03bd(t)), where\n\nDL\n\nw(\u03bd(t)) = 4\u03bdj\u2217 (t)\n\n(6)\n\nand\n\n\u03bd\u2212w(t) = min\n\n\u22062\n\nw\u03bd2\u2212w(t)\n\n(cid:0)1 \u2212 2\u03bd\u2212w(t)(cid:1)2 ,\n(cid:111)\n(cid:88)\n\n\u03bdj(t)\n\n1 \u2212 \u22062\n\nw\n\n(cid:110) (cid:88)\n\n\u03bdj(t),\n\nj\u2208Jw\n\nj /\u2208Jw\n\nThe proof follows from the fact that Dw(\u03bd(t)) is increasing both in qw,j\u2217 and \u03bdw,\u2212j\u2217 (t) for every\nw \u2208 W, and the observation that that qw,j\u2217 \u2265 \u03b1w and \u03bdw,\u2212j\u2217 (t) \u2265 \u03bd\u2212w(t).\nNote that DL\nw(\u03bd(t)) provides us a tight lower bound on the expected growth of the coordinate of the\ntrue hypothesis if action w is chosen at step t. Indeed, DL\nw(\u03bd(t)) can be decomposed to a part that\nuses the j\u2217-th coordinate of the belief vector and a part than can be computed without knowing j\u2217.\nThe Adaptive Gradient (AG) selection rule, then chooses at step t, the action wD(t) \u2208 W such that\n\n1 \u2212 d2(cid:0)1 \u2212 2v(cid:1)2 ,\n\nd2v2\n\n(7)\n\nwD(t) = FW (\u03bd(t)) = arg max\n\nw\u2208W G(\u03bd\u2212w, \u2206w),\n\nG(v, d) =\n\ni.e., we select the action maximizing the current lower bound on the expected growth of the j\u2217-\ncoordinate of the belief vector. Ties are broken uniformly.\nRemark: Assume the actions have different costs of sensing. The AG selection rule can then be\ngeneralized as follows:\n\nwD(t) = F c\n\nW (\u03bd(t)) = arg max\nw\u2208W\n\nG(\u03bd\u2212w, \u2206w)\n\ncw\n\n.\n\n(8)\n\n5\n\n\f4.2 An upper bound\n\nWe now present our main result. We show that the expected number of samples required by our\nalgorithm IBAG asymptotically matches the lower bound obtained in Theorem 1.\nTheorem 2. Under the IBAG algorithm, there exist constants K u\nthat\n\n1 > 0 independent of \u03b4 such\n\n0 , K u\n\nE[T \u2217] \u2264 K u\n\n1 log\n\n1\n\u03b4\n\n+ K u\n0 .\n\nThe proof is provided in supplementary material, Section A.5. This result is based on the intuition\nthat IBAG never selects an action that is too uninformative relative to the other actions. Speci\ufb01cally,\nthe information provided by an action w at time t depends on its quality \u03b1w and outcome over the\nsubset Jw,\u2212j\u2217. In other words, the value \u03bdw,\u2212j\u2217 must decrease to 0, hence the higher this value is\nfor a given action w, the more we can still learn from sensing this action. As a proxy for \u03bdw,\u2212j\u2217 we\nuse \u03bd\u2212w which also must be as large as possible. The following lemma, whose proof is given in\nsupplementary material, Section B, provides bounds on the relative quality of \u03bd\u2212wD(t) compared to\n\u03bd\u2212w.\nLemma 4. For every w \u2208 W, it holds that \u03bd\u2212wD(t) \u2265 \u2206m\n\n\u2206M \u03bd\u2212w.\n\n5 Numerical results\n\nWe now present numerical results based on simulations. In order to gain practical insight, we will\nfocus on a task labelling application. A task labelling problem might arise in a crowdsourcing scenrio\nsuch as Amazon\u2019s Mechanical Turk or Content search problems where an incoming image must be\nmatched to known contents. The mapping to the hypothesis testing problem is as follows. The set\nof hypotheses J corresponds to the set of task labels, with j\u2217 the true hypothesis being the latent\ntask label that must be inferred. The set of W actions corresponds to W workers who perform the\nlabelling when sampled, where pw,j(j(cid:48)) is the probability that worker w assigns the task the label j(cid:48)\nwhen the true label is j. For each worker w, we will call Jw the expertise of the worker (principal\nset of the actions), and \u03b1w will be the quality of the worker. We will \ufb01rst investigate the impact of\nthe lack of exact knowledge, i.e., the difference between \u03b1w and qw,j, that we call slack. We then\ncompare our algorithm to that in [1] and that of [13] for a few scenarios of interest.\n\n5.1 The effect of the slack\nHere we present a simulated scenario with J = 100, W = 15, and \ufb01xed subsets {Jw}w\u2208W satisfying\nAssumption 2. We set \u03b4 \u2248 0.001, and assume the incoming job-type to be j\u2217 = 1. In Figure 1\nwe present the results of 1000 runs of the simulation for every instance of respectively the \ufb01rst and\nsecond scenario described below. Recalling that the simulation stops as soon as maxj \u03bdj(t) > 1 \u2212 \u03b4,\nwe specify that out of the entire set of simulations of these scenarios the algorithm never failed to infer\nthe correct incoming job type j\u2217 = 1. For both scenarios, in Figure 1(left) we display the averaged\nsample paths of the coordinate \u03bdj\u2217 (t) and in Figure 1(right) the average sample size required for the\ndecision maker to make an inference.\n\nThe performance upper bound is pessimistic.\nIn the \ufb01rst set of simulations, scenario A, we \ufb01x\nthe quality vector \u03b1 with \u03b1w \u2208 (0.55, 0.6) for every worker w \u2208 W. We then let the parameter s\nvary in {0, .05, .1, .15, .2, .25, .3} and assume qw,j\u2217 = \u03b1w + s for every w \u2208 W. In Theorem 2 we\nproved an upper bound for E[T \u2217] when the IBAG algorithm is employed. It can be observed that\nthe upper bound does not depend on qw,j\u2217, but only on \u03b1w. In fact, the upper bound is obtained by\nlooking at the worst case scenario, where qw,j\u2217 = \u03b1w for every w \u2208 W and j \u2208 J . As the slack s\ngrows, the performance of the algorithm drastically improves even if it is not re\ufb02ected in the upper\nbound term.\n\nRobustness to perturbations in estimate of worker skills.\nIn the second set of simulations,\nscenario B we \ufb01x the quality vector qw,j\u2217 \u2208 (0.85, 0.9) for every worker w \u2208 W. We then let\nthe parameter s vary in {0, .05, .1, .15, .2, .25, .3} and set \u03b1w = qw,j\u2217 \u2212 s for every w \u2208 W. It is\nobserved that the IBAG algorithm performs well even when the decision maker\u2019s knowledge of the\nskills is not precise, and he decides to play safe by reducing the lower bound \u03b1(w).\n\n6\n\n\f(a) Scenario A\n\n(b) Scenario B\n\nFigure 1: ((a), (b) left) Empirical average of the sample paths of the process \u03bdj\u2217 (t), ((a), (b) right)\nEmpirical average of the sample size T \u2217.\n\nWe therefore deduce that the learning process strongly depends on the true skills of the worker qw,j\n(Figure 1(a)), however their exact knowledge is not fundamental for IBAG to behave well (Figure\n1(b)) - it is robust to small perturbations.\n\n5.2 Comparison to existing algorithms\n\nChernoff algorithm. As we mentioned, most of the existing sequential hypothesis testing algorithms\nare based on Chernoff\u2019s algorithm presented in [1]. Such an algorithm, at step t identi\ufb01es the\njob-types j1, j2 \u2208 J associated with the two highest values of \u03bd(t) and selects the class of workers\nwC that best distinguishes j1 and j2, i.e., wC = arg maxw\u2208Wj1,j2\n\u03b1w. In the asymptotic regime with\n\u03b4 \u2192 0, the expected sample size required by the Chernoff\u2019s algorithm is of order \u2212 log \u03b4, exactly\nas with IBAG. This has been proven ([1, 8]) in the case with full knowledge of the matrix pw,j(\u00b7).\nWhat we emphasize here is that by focusing only on the two highest components of \u03bd(t), the decision\nmaker loses information that might help him make a better selection of worker w(t). In particular,\nChernoff\u2019s algorithm bases its decision largely on the workers\u2019 skills and thus does not behave as\nwell as it should when these are not informative enough.\nSoft-Decision GBS algorithm. The algorithm proposed in [13] generalizes the intuition behind\noptimal GBS algorithms in noiseless environments. This algorithm, given a belief vector \u03bd(t) at\n\u03bdj \u2212\nstep t picks the worker \u00afw such that \u00afw = arg minw\n\n(cid:12)(cid:12) = arg maxw{\u03bd\u2212w}. Intuitively, the Soft-Decision GBS algorithm selects the worker that\n\n(cid:12)(cid:12) = arg minw\n\n(cid:12)(cid:12)(cid:80)\n\n(cid:12)(cid:12)(cid:80)\n\nj\u2208Jw\n\n(cid:80)\n\nj /\u2208Jw\n\n\u03bdj\n\nj\u2208J \u03bdjgw,j\n\nis the most \"unsure\", in the sense that the worker splits the belief vector as evenly as possible. Since\nthe model in [13] does not allow for different qualities of the workers (noise is symmetric there), this\nfeature does not play a role on the worker selection policy. Note that when the quality of all workers\nare identical, the Soft-Decision GBS and the IBAG algorithms are identical. In [13], an asymptotic\nperformance analysis is presented, and under certain constraints on the problem geometry, it is shown\nthat the sample size required is of order \u2212 log \u03b4 + log J, and once again the performance in terms of\nthe error probability matches with IBAG.\nWe now compare our algorithm IBAG with the Chernoff algorithm under three scenarios and with\nSoft-Decision GBS only for the third scenario where the quality \u03b1w or workers (noise in GBS) differ\namong the workers.\nIn the \ufb01rst scenario, we set J = 32, j\u2217 = 1, and \u03b4 = 0.003. We assume two kinds of worker classes.\nWe have 5 \u2018generalist\u2019 workers, each of whom has |Jw| = J/2 = 16 and moreover for every pair of\njob types (j1, j2) there exists a generalist belonging to Wj1,j2. In addition, we have 32 \u2018specialist\u2019\nworkers who can distinguish exactly 1 job-type, i.e., |Jw| = 1. We assume that there is one specialist\nper job-type, and note that among them there is also w\u2217 such that Jw\u2217 = {j\u2217}. We consider two\ncases: in case A, the skills of the workers are identical, \u03b1w = 0.8 for every w \u2208 W, and in case B we\ndrop the generalists\u2019 skill level to \u03b1w = 0.75. We assume qw,j = \u03b1w for every w \u2208 W and j \u2208 J .\nIn the second scenario, we set J = 30 with only specialists present. We set \u03b4 = 0.003 and j\u2217 = 1. In\nthis scenario we consider two cases as well, in case A \u03b1w = 0.7 for every worker, while in case B we\ndrop the skill level of the specialist on job-type j\u2217 to 0.65, representing a situation where the system\nis ill-prepared for an incoming job. We assume qw,j = \u03b1w for every w \u2208 W and j \u2208 J .\nWe display the results for both scenarios in Figure 2. In Figure 2(top) we display boxplots of the\nnumber of queries required and in Figure 2(bottom) we show the expectation of the number of\nqueries per kind of worker. In both scenarios, the performance of Chernoff\u2019s algorithm is drastically\n\n7\n\n\f(a) Scenario 1\n\n(b) Scenario 2\n\n(c) Scenario 3\n\nFigure 2: (top) Boxplot of the sample size T \u2217. (bottom) Empirical expected number of times the\ndifferent groups of workers are queried.\n\nweakened by only a tiny variation in \u03b1w, yielding a very different behavior. In the \ufb01rst scenario,\nalthough it is very informative to query the generalists in an early explorative stage, under Chernoff\u2019s\nalgorithm the selection of the workers relies too much on the skill levels and therefore always queries\nthe specialists. The IBAG algorithm, on the other hand, sensibly decides at each step on the trade-off\nbetween getting rough information on a larger set of job pairs, or getting more precise information on\na smaller set, and seems to better grasp this quality vs quantity dilemma.\nSimilarly, in case B of the second scenario, the low-quality workers (the specialist in j\u2217) are never\nselected by Chernoff\u2019s algorithm, even if their responses have a large impact on the growth of \u03bdj\u2217 (t).\nFor both cases A and B we see that IBAG outperforms Chernoff.\nIn the third scenario we set J = 32, W = 42, and \u03b4 = 0.03. We have \ufb01ve low-quality generalist\nworkers with \u03b1w = 0.55, \ufb01ve high-quality generalist workers with \u03b1w = 0.75. The remaining\n32 workers are specialists with \u03b1w = 0.8. The plots comparing all three algorithms is shown in\nFigure 2(iii). We observe again that the Chernoff algorithm never queries generalists and performs\nthe worst. IBAG outperforms Soft-GBS because it queries high-quality workers preferentially while\nSoft-GBS doesn\u2019t consider quality.\n\n6 Discussion and conclusion\n\nWe have presented and analyzed the IBAG algorithm, an intuitive active sequential learning algorithm\nwhich requires only a rough knowledge of the quality and principal set of each available action.\nThe algorithm is shown to be competitive and in many cases outperforms Chernoff\u2019s algorithm, the\nbenchmark in the area.\nAs far as we know, this is the \ufb01rst attempt to analyze a scenario where the decision maker has limited\nknowledge of the system parameters. In Section 5 we studied through simulations, the effect of\nthis lack of exact knowledge on the performances of the system, in order to quantify the tradeoff\nbetween caution, i.e., how close \u03b1w is to qw,j, and the cost. The numerical analysis suggests that\na moderate caution does not worsen drastically the performance. In the supplement Section C we\nanalyze formally this tradeoff and show results on how cautious the decision maker can be while still\nensuring good performance.\nA further element of incomplete knowledge would be to allow slight perturbations on the principal\nsets of the actions. In the present paper we have assumed to know with certainty, for every w \u2208 W\nand j \u2208 J , whether w has j in its principal set (j \u2208 Jw), or not. In future work we will investigate\nthe impact of uncertainty in the expertise, for instance having j \u2208 Jw with some probability pj,w.\n\n8\n\n\fAs a last remark, it would be interesting to analyze the model when the different actions have\nheterogeneous costs. Note that the IBAG algorithm naturally extends to such case, as mentioned in\nequation (8). The IBAG algorithm in the framework of the task-worker system could give de\ufb01nitive\nanswers on whether it is better to sample a response from a cheap worker with a general expertise\nand low skill or from more expensive workers with narrow expertise and higher skill.\n\nReferences\n[1] H. Chernoff, \u201cSequential design of experiments,\u201d The Annals of Mathematical Statistics, vol. 30, no. 3,\n\npp. 755\u2013770, 1959.\n\n[2] S. Berry, B. Carlin, J. Lee, and P. Muller, Bayesian Adaptive Methods for Clinical Trials. CRC press, 2010.\n[3] S. C. Hui and G. Jha, \u201cData mining for customer service support,\u201d Information & Management, vol. 38,\n\nno. 1, pp. 1\u201313, 2000.\n\n[4] N. Vaidhiyan, S. P. Arun, and R. Sundaresan, \u201cActive sequential hypothesis testing with application to a\nvisual search problem,\u201d in 2012 IEEE International Symposium on Information Theory Proceedings (ISIT),\npp. 2201\u20132205, IEEE, 2012.\n\n[5] B. Ghosh, \u201cA brief history of sequential analysis,\u201d Handbook of Sequential Analysis, vol. 1, 1991.\n[6] A. Albert, \u201cThe sequential design of experiments for in\ufb01nitely many states of nature,\u201d The Annals of\n\nMathematical Statistics, vol. 32, pp. 774\u2013799, 1961.\n\n[7] J. Kiefer and J. Sacks, \u201cAsymptotically optimum sequential inference and design,\u201d The Annals of Mathe-\n\nmatical Statistics, vol. 34, pp. 705\u2013750, 1963.\n\n[8] M. Naghshvar and T. Javidi, \u201cActive sequential hypothesis testing,\u201d The Annals of Statistics, vol. 41, no. 6,\n\npp. 2703\u20132738, 2013.\n\n[9] A. Lalitha, A. Sarwate, and T. Javidi, \u201cSocial learning and distributed hypothesis testing,\u201d in Information\n\nTheory (ISIT), 2014 IEEE International Symposium on, pp. 551\u2013555, IEEE, 2014.\n\n[10] R. Olfati-Saber, J. Fax, and R. Murray, \u201cConsensus and cooperation in networked multi-agent systems,\u201d\n\nProceedings of the IEEE, vol. 95, no. 1, pp. 215\u2013233, 2007.\n\n[11] M. Naghshvar, T. Javidi, and K. Chaudhuri, \u201cNoisy bayesian active learning,\u201d in Communication, Control,\n\nand Computing (Allerton), 2012 50th Annual Allerton Conference on, pp. 1626\u20131633, IEEE, 2012.\n\n[12] M. Naghshvar and T. Javidi, \u201cExtrinsic jensen-shannon divergence with application in active hypothesis\n\ntesting,\u201d in IEEE International Symposium on Information Theory (ISIT), 2012.\n\n[13] R. Nowak, \u201cNoisy generalized binary search,\u201d in Advances in neural information processing systems,\n\npp. 1366\u20131374, 2009.\n\n9\n\n\f", "award": [], "sourceid": 2143, "authors": [{"given_name": "Fabio", "family_name": "Cecchi", "institution": "Eindhoven University of Technology"}, {"given_name": "Nidhi", "family_name": "Hegde", "institution": "Nokia Bell Labs"}]}