Review for NeurIPS paper: Confidence sequences for sampling without replacement

NeurIPS 2020

Confidence sequences for sampling without replacement

Review 1

Summary and Contributions: The paper extends results for sequential hypothesis testing and confidence sequences to the case where data is sampled without replacement. ===== Update: I have read the rebuttal and look forward to seeing the final version of the paper.

Strengths: The paper is presents interesting and novel techniques. It is very well written with a good discussion of issues and choices one faces when applying the methods. While sequential inference problems are relevant to NeurIPS, the problem seems somewhat niche. However, it presents an interesting and potentially high impact possible use case in election auditing.

Weaknesses: The primary weakness is the limited scope of problems which it can offer significant advantages over with replacement sampling. In some cases the work seems to have limited benefit over with replacement sampling. In figure 1D show, the benefits of without replacement sampling for uniform distributions appear to be so slight as to be nearly unreadable from the plot, and one would expect it only appears when the sampled fraction is a large of the population size.

Correctness: Yes, I believe they are. The high level explanations are consistent with the given results though I did not verify the proofs in detail.

Clarity: The paper is very well written. The main suggestion is to reduce the liberal use of quotes and underlining. e.g. "peeking", "p-hacking". I'm not sure why 'filtration' is in quotes in line 80. Is it not a filtration? Underlining the part on 'working priors/posteriors' seems excessive since it's not important to the contributions of the paper.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: Overall, the good "science" of the paper makes what will possibly be lasting contributions to the field, not just short term incremental improvements. It would be nice to discuss briefly the advantages/disadvantages of the novel mean estimator in eq 3.1. In particular, the estimator appears to be inconsistent even though it is unbiased since it seems to weight items that occur early more heavily. What are the implications of that? Is there then a point at which without replacement confidence interval is worse than a with replacement confidence interval? details: In Figure 4 (left), it's not clear that the value at t_1=100 is visible on the plot or what the difference in the CIs at that time.

Review 2

Summary and Contributions: The paper provides a novel approach to constructing confidence sequences, i.e., confidence sets (intervals) for parameter estimates that are true simultaneously in all steps of a sequentially refined estimation process. The key idea/result is that if one defines a ‘working prior’ over the parameter space then, at all times, the set of parameters that have a prior/posterior probability ratio of less than 1/alpha form a confidence sequence. Importantly, the choice of the prior does not influence the frequentist validity of the confidence sequence but merely influences its shape. The authors apply this idea to a range of discrete and continuous parameter estimation problems by constructing suitable priors with tractable prior/posterior ratios.

Strengths: The authors attack an extremely important problem that we all face in empirical science where experiments are most naturally performed sequentially and one needs practical but probabilistically sound stopping criteria. Furthermore, the paper and the appendix are full of impressive probabilistic results that are required to apply the core idea to the various settings. Finally, despite the inherent complexity of the topic, the authors have created an accessible exposition and provide well designed didactic application examples.

Weaknesses: I don’t see major weaknesses. See below for a few suggestions for improvement.

Correctness: I did not check the proofs of the main results, which are extensive and only provided in the appendix. However, all the statements are plausible, and the content of the main paper is technically rigorous. Also the authors provide a range of examples which empirically support all key claims, most importantly that the true parameter is contained in the complete confidence sequence, and also that the confidence sequences follow expected shapes (e.g., when comparing the Bernstein versus the Hoeffding style approaches for estimating real-valued parameters).

Clarity: As mentioned above, the paper is excellently written with an accessible introduction to the general problem setting including canonical examples for the different specific settings. At two points I felt some additional details could be provided to further increase the accessibility to non-expert readers: a) The definitions of filtrations and martingale are not recapped. While this is understandable given the space limitations, it constitutes quite a large gap from the intuitive examples. b) In contrast to the other applications, the introduction to Shapley values does not feel self-contained.

Relation to Prior Work: The paper generally does an excellent job in discussing relations to traditional approaches to the construction of confidence sets. A notable potential improvement could be to also include experiments that contrast the confidence sequence approach with traditional confidence bounds that only hold at a single moment in time (i.e., predefined stopping time).

Reproducibility: Yes

Additional Feedback: T_n in page 2, l47 should be T_m UPDATE: I thank the authors for their response to my comments and their intention to use them to further improve the (already very good) presentation of their work.

Review 3

Summary and Contributions: Paper provided CS under sampling without replacement, by a method of PPR martingale. Cases with binary values and real values are discussed. --- I'd like to thank the authors for the discussion and addressing my comments. My score stays as top 50% acceptance.

Strengths: The work addresses important problem of constructing CS under the setup of sampling without replacement. Several imporatant cases are carefully analyzed and corresponding methods, along with associated theory are developed. Paper is very clear and easy to read and the problem is of great interest for researchers in other fields like politics and game theory.

Weaknesses: My only concern is around N - sometime we are sampling without replacement, and without knowing the population size N (e.g. if doing political polls, the analyst usually have a target sample size in mind, say I am gonna sample 1k this time, but the true population under there is not known) Is this method still viable in this case with another layer of randomness? Also some examples are useful in illustrating the type of cases considered, but not quite useful for adapting the method. e.g. permutation test - one can totally do a sampling without replacement (actually that would be easier, just do sampling without replacement on [n] independently), and as n! is usually very large, the difference between CS on sampling with/without replacement would be very small (I am making this claim w/o deriving, but purely on intuition). So another example may be better for selling the paper.

Correctness: I checked the prop 2.1 but not the rest. Prop2.1 is solid.

Clarity: Yes. Clear and easy to read.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: This paper develops a variety of confidence sequences for sampling without replacement. In particular, confidence sequences are constructed for categorical observations and the mean of a finite set of bounded real numbers.

Strengths: The main strength is its theoretical contribution, not only the proposed four confidence sequences, but also new technical tools such as prior-posterior-ratio martingale and two additional exponential supermartingales in eq (3.3) and (3.7). These make this paper not a naïve application/extension of existing confidence sequence results, but of its own interest.

Weaknesses: The PPR framework in Section 2 is restricted to the case where the posterior distribution is computable, and the supermartingales setting in Section 3 are designed specifically for the mean of a finite set of bounded real; these restrictions prevent the proposed CS to be used in more general settings.

Correctness: Correct.

Clarity: This paper is well written and well organized.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: The four motivating examples are of limited interest within the machine learning community, and the author might add one or more examples that could have a stronger connection with the community, Moreover, this paper focuses on the discrete categorial setting and a finite set of bounded real, so it might be interesting to add some discussion for more general model and enlarge the applicability of the proposed CS. For the PPR method, the example in sec 2.2 utilize the advantage of a conjugate prior, but that is not always possible; so I am wondering when the posterior distribution is not available in close-form, can we use some sampling method (like MCMC) such that the proposed CS can still be applied and what about the computation complexity? Minor comments: There lacks several right bracket in the equation in step 1 of Appendix A.1. The notation J in Thm 3.1 appears abruptly without any further explanation. -------------------------------------------------------- I have read the author response and I stand by my original score (vote for acceptance).