Paper ID: | 460 |
---|---|

Title: | The Power of Adaptivity in Identifying Statistical Alternatives |

The paper studies the following problem. There is a bag of infinite coins where if you draw a coin it either has bias \theta_0 with probability \alpha or \theta_1<\theta_0 with probability 1-\alpha. In each step you either retoss the previous coin or draw a fresh coin. The goal is to find a coin with bias \theta_0 with probability 1-\delta and minimize the number of coin tosses in the process. Previous literature - When the exact values of \alpha, \theta_0, \theta_1 are known tight bounds on number of coin tosses was previously known. Additionally this algorithm can have more than 1 coin out of the bag. Contributions of the paper - The paper gives the following two results. a) It gives an algorithm for the setting when (\alpha, \theta_0, \theta_1) are known. It gives an adaptive strategy algorithm (where the number of times the current coin is tosses is adaptively chosen) which achieves the same bounds as the algorithm which knows (\alpha_0, \theta_0, \theta_1) (upto log factor). The algorithm is non-trivial. b) The paper also shows a lower bound for class of non-adaptive algorithms which first toss N coins m times, estimate (\alpha, \theta_0, \theta_1) from these and then run the previous algorithm which assumes knowledge of (\alpha_0, \theta_0, \theta_1). They show that such algorithms require quadratically more number of tosses than adaptive algorithms.

Nice theoretical work. A bit more details on applications will strengthen the paper further.

2-Confident (read it all; understood it all reasonably well)

The paper considers the advantage of applying adaptive strategies in the problem of finding a coin with mean \theta_1 in an (infinitely) big bag containing both heavy coins (=having mean \theta_1) and light ones (=having \mean \theta_0<\theta_1). The paper proposes solutions that weaken the requirements of previous works. In particular, it does not require prior knowledge about the parameters. In addition to that, the results also generalize over the original setting, which could lead to further interesting developments.

The presentation should be improved somewhat. For example, the lower bound in line 148 seems much larger then the upper bound in line 158. The resolution of this seems to be that the former is about problem P1 and the latter is not, but this separation should be presented much more clearly. Similar confusion between the various models occur in other places too. This does not make the paper unreadable, but it would be much easier to follow if these small issues were taken care of. Apart of this, the authors did a good job in motivating the problem, emphasizing the most important details (including Table 1 which presents the main results in an easy to digest form), and all this makes the paper really enjoyable. The proposed solutions offer some nice novel insights to the problem, and the proposed methods improve seriously on existing methods, most notably in applicability. The most valuable contribution might be, however, is the generalization of the problem itself, which potentially opens up some novel topic for the future research - both for theory and application. Additional questions and remarks - Extending the math formula in lines 169-170 with '=\int(dP/dQ)dP-1' would be helpful. - Please explain the inequality in line 173. The lower bound in line 192 should also be explained. - Please move the sentence in lines 173-174 outside the example. - Please rephrase the sentence in lines 256-257. - Please add a short note to the paper explaining why always halving \epsilon_0 and \alpha_0 in parallel could not be applied in Algorithm 3 without losing the sample complexity guarantees. (This is obvious from the analysis, but deserves to be discussed briefly in the main part of the paper too. The reader might be left with questions without it.) - Line 534 in the supplementary material should refer to Lemma 2 instead of Lemma 1.

2-Confident (read it all; understood it all reasonably well)

This paper provided mathematical analyses of what the authors call the "most biased coin" problem. This is to identify the maximally biased coin out of a set of "coins" from a bag (Bernoulli random variables with different parameters) where the generating process depends on alpha (probability of drawing a "heavy" coin vs a "light" coin), theta_1 (probability of success for a heavy coin), theta_0 (probability of success for a "light" coin, which is always less than the theta_1 for a heavy coin). The authors considered the novel case where alpha, theta_1, theta_0 are unknown, and analyzed adaptive vs non-adaptive strategies for solving the "most biased coin" problem. For the case where alpha,theta_1, theta_0 were known and unknown, they proved lower and upper bounds for algorithms that were non-adaptive (the fixed, estimate then explore) and adaptive, placing their novel bounds in the context of what was already known. They considered how their analysis and proofs of bounds were connected to multi-armed bandit problems, and could be extended from "coins" (Bernouilli random variables) to detecting the presence of mixtures for exponential family distributions.

> The connections to anomaly detection could be better spelled out. > It wasn't completely clear to what extent the authors see the connections they draw between most biased coins and other kinds of detection of mixture distributions, and subsequent implications of their results, as being important novel contributions of the paper. E.g. see "2 From Identifying Coins to Detecting Mixture Distributions" > Authors could better motivate the problem in the intro and other parts – diving into formal details before grounded examples made it hard to know what they were getting at, and why this problem is important, how widespread it is. E.g. it wasn't immediately clear how the most biased coin problem was relevant to "automated hiring of crowd workers for data processing tasks, anomaly 50 and intrusion detection, and discovery of vacant frequencies in the radio spectrum" *Writing* >Writing should be clearer and less convoluted, too many confusing clauses to parse. E.g. "We can model each worker’s performance (e.g. accuracy or speed) as a random variable with some 56 expected performance so that selecting a good worker is equivalent to identifying a worker with a 57 high mean." E.g. 2. What is the (1) referring to? "comes within log factors of the known information- 105 theoretic lower bound and (1) which is achieved by an algorithm that knows the parameters" >Up until this point I wasn't clear it was a 'standard' problem, or just how you were describing it: "The most biased coin problem was first proposed by Chandrasekaran and Karp [7]." > The link to MAB was initially confusing until it was clear you were talking about a fixed confidence vs fixed budget formulation – which is then no longer true when you go to the infinite armed problem. People shouldn't have to read your reference to understand. "In the best-arm identification problem"

1-Less confident (might not have understood significant parts)

This paper studies a heavy-coin detection problem where there are (potentially infinite) many coins, of which $1-alpha$ percent are light coins with parameter $theta_0$ and $alpha$ percent are heavy coins with parameter $theta_1(>theta_0)$. This paper proposes an adaptive strategy that is based on estimation (of theta's)-and-exploration (to find heavy coins), but also improves upon it using a doubling trick. Hardness results for other alternatives (particularly the baseline estimation-and-exploration without the doubling trick) are used as evidence towards the novelty of this contribution.

This problem is interesting and there is a history of discussion on similar topics in EE/ML/TCS proceedings particularly in the multi-armed bandits community. There are certain novelty in this paper that can be a nice contribution. However, my biggest problems with the paper are in the organization of the materials and a lack of discussions with some key papers that are very relevant. The authors discussion on the background is not entirely comprehensive. For example, eq (1) is fairly common knowledge in the multi-armed bandits dating back to Lai and Robbins 1985. A very relevant literature for discovery of all heavy coins in a finite-arm case is studied in Bubeck et al., ICML 2013. Additionally, these papers do not require knowledge of $theta_0, theta_1$ either and further allow each coin to have its own parameter while the result is adaptive with the actual realization (e.g., if the parameters fit the model in the paper under review, then an equivalent of (1) can be obtained). Nonetheless, the paper under review bears some novelty because the other two papers focused on discovery of all "heavier coins" whereas this paper is able to find one heavy coin using only amortized complexity. The connection to hypothesis test (P1) is also unclear -- unless the authors suggest that there are many distributions that are either H_0 or H_1 and one wants to find one distribution that belongs to H_1. I am not sure if this is the idea of a hypothesis test. Also, why would I only want just one example distribution from H_1 instead of finding all distributions belonging to H_1? The second equality in the line above (2) contains a typo. It makes sense to write dP/dQ as Radon-Nikodym derivative when P is absolutely continuous on Q; however, it makes no sense to write dP in isolation in the integrand. I would instead write it in terms of pdf or pmf. Section 3 is tedious and confusing. I would put them in the appendix, but only state why each term should appear in the result (instead of explaining how one should go about proving the theorems). The paragraph below 3.2 is particularly confusing. I directly skipped to Table 1. In Section 4, what is SPRT? Algorithm 3 may need better introduction as my understanding of it is an informed guess based on the term doubling trick. I did not check for the proofs of Theorem 5; it seems to be a core part of the paper but it is also at the very end of the paper. I can read it further upon request. As for novelty/impact if I only perceive this paper as an algorithm that finds one heavy coin with sample complexity that equals the amortized cost to find all heavy coins (and assuming the proofs are correct): it has some novelty. However, the setting is rather restrictive (there are only two types of arms). As the authors suggest, neither this paper nor the Carpentier and Valko paper (assuming arm parameter having exponential distribution) covers the full flexibility of the original best-arm identification problem. These assumptions on the arm distribution may hinder real-world applications. The author should also revise the paper for clarify and emphasize on the algorithms and some real-world motivations.

1-Less confident (might not have understood significant parts)

The authors propose an algorithm to solve the so-called 'most biased coin problem' when the parameters in the problem are unknown. The claim is that their approach leads to usage of fewer number of samples as compared to 'estimate-then-explore' approach. The overall idea is that there are infinite number of coins of two types, with different probability of 'Head'. The objective is to pick coins one by one and raise an alarm the first time a coin of high probability of head is detected. A picked coin from the infinite collection can be tossed many times to estimate its bias. This results in two types of exploration, exploring new coins or exploring (tossing) a given coin again and again. The authors claim that this problem has many applications: crowdsourcing, spectrum sensing in cognitive radios, and anomaly and intrusion detection.

I think the paper addressed an interesting problem can should be accepted into the conference. I have some comments that the authors can use to improve the paper. 1. I think the problem can be motivated in a better way. It is not obvious how the problem is applicable in crowdsourcing, spectrum sensing in cognitive radios, and anomaly and intrusion detection. For example, in crowdsourcing application, are you trying to find the first good candidate? In spectrum sending are you going to use the same spectrum again and again to find out if there is a collision? What I am trying to say is that you should map it to a concrete problem and not just mention an area. So, the problem sounds mathematically interesting, but not sold well for applications. I can see an explanation in lines 48 - 58, but not very convincing to me. 2. Line 74: omits should be emits? 3. Line 80: \mu is earlier defined as mean. So, the probability statement in this line is not very concrete. Something is missing here. 4. Algorithm 1: this description is unclear. Variable N is used both for the stopping time, and also, for index of the iteration? 5. What does 'heavy' means in the abstract description between lines 113 - 119, \theta_1 > \theta_0? 6. Theorem 2: You may want to specify where Problem P2 is stated. The paper is dense, it is hard to search back. 7. Why choose 'the most biased coin problem' as the name? It does not make sense to me. There are many coins with high bias. So how do you define what is most biased? Even if you phrase it as finding the first coin with the most bias among those observed, it still is not a convincing name. Perhaps a different name? 8. The paper title sounds too generic (and grand?) for what is being done in the paper. 9. What is the technical difference between 'adaptive' and 'fully adaptive'? To me 'fully' is redundant. The choice should be between 'adaptive' and 'semi-adaptive'.

1-Less confident (might not have understood significant parts)