Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
This is an interesting paper, and well written. Overall I like the contributions. I have the following comments to consider. I am not sure "feedforward" is an appropriate prefix for the technique, as it seems to suggest that the approach is feedforward neural networks based. Though, it is completely upto the authors. If I understand correctly, the proposed approach differs from the prior well known technique, one-coin Dawid-Skene model, in the sense that there is a prior distribution on worker accuracies. I suppose, similar prior model has been considered in other works? In section 3, although references are provided for known concepts used in the proposed technique, it doesn't give a clear picture on the novelty of the technique. May be, there should be a related work section. It is claimed that the proposed technique differs from the previous works, in terms of employing variational mean field approach, in the sense that KL divergence is minimized between the true and approximate posterior. It would be good to provide expressions for KL divergence, and to show how it becomes tractable to minimize it for the assumed prior distributions (so the posterior) and the mean field approximation of the posterior. In the previous works, coordinate decent is used for optimization, whereas it is proposed to perform (stochastic) updates, using a simple observation in each step, leading to local analytical optimizations. I suppose, the updates become analytical because of the specified beta distribution for the prior? Is it necessary to perform updates using a single observation? That seems like an extreme scenario even for online settings. How about updating with small sized batches of observations? I would like to mention it here that, in the recent literature of machine learning, information theoretic objectives like KL-divergence, mutual information have been used for optimization even if not computable analytically. And, it is a standard practice to use stochastic gradient rather than gradient. While it is interesting to see similar concepts being used for Bayesian inference, it should not come as a big surprise in terms of novelty. A small suggestions: section 3.2 could benefit from a sketch so as to explain the reasoning behind reordering the labels. In Theorem 1 and 2, the expressions for F(.) are too complicated to have clear interpretations, though useful to compute the upper bounds numerically. The authors do not provide any statements or remarks regarding the interpretability of the theorems. May be, the expressions should be simplified in the original theorems or in the corollaries, just a thought. In section 5, for synthetic data, Monte Carlo sampling seem to be outperforming the others. Though, its performance is not as good for the real datasets. Any intuitions on why that is the case? Also, I didn't see a reference for MC method. Is it true that the value of parameters, alpha, beta, are assumed to be known, rather than optimized or tuned? There are no errors bars, or statistical significance numbers for the experimental results presented in Section 5. I don't know how impactful the empirical evaluation is. I don't see significant improvement in accuracy numbers. What does this statement mean: "average accuracy just above 0.5"? ----------------- The rebuttal clarifies on the questions asked in the review well. I like this paper, and vote for its acceptance. Also, increasing the score from 7 to 8.
Overview: This paper proposes a Bayesian algorithm to aggregate the labels in a crowdsourced application. The authors consider the standard Dawid-Skene one coin model, in which each worker j is associated with parameter p_j, which represents the probability of the worker correctly assigning a label. In the Bayesian setting, the authors assume that the parameters p_j’s follow a Beta distribution. Computing the exact posterior distribution is computationally difficult. Therefore, the authors use variational approximation in which the posterior distribution is assumed to have a product form. Even with the variational approximation, parameter inference is computationally difficult. Therefore, the authors propose to perform a single coordinate descent step as opposed to an expensive coordinate descent step. The authors derive bounds for error rates. They also demonstrate the efficacy of their methods on both synthetic and real-world datasets. Strengths of the paper: The paper is tackling the classic problem of aggregating labels in a crowdsourced application. The paper is focusing on speed. The algorithms proposed are fast and simple to implement. Plus, they come with theoretical guarantees on the bounds for error rates. The fact that the theoretical bounds are close to empirical error rates is impressive. Weaknesses: The paper has the following main weaknesses: 1. The paper starts with the objective of designing fast label aggregation algorithms for a streaming setting. But it doesn’t spend any time motivating the applications in which such algorithms are needed. All the datasets used in the empirical analysis are static datasets. For the paper to be useful, the problem considered should be well motivated. 2. It appears that the output from the algorithm depends on the order in which the data are processed. This should be clarified. 3. The theoretical results are presented under the assumption that the predictions of FBI converge to the ground truth. Why should this assumption be true? It is not clear to me how this assumption is valid for finite R. This needs to clarified/justified. 3. The takeaways from the empirical analysis are not fully clear. It appears that the big advantage of the proposed methods is their speed. However, the experiments don’t seem to be explicitly making this point (the running times are reported in the appendix; perhaps they should be moved to the main body). Plus, the paper is lacking the key EM benchmark. Also, perhaps the authors should use a different dataset in which speed is most important to showcase the benefits of this approach. Update after the author response: I read the author rebuttal. I suggest the authors to add the clarifications they detailed in the rebuttal to the final paper. Update after the author response: I read the author rebuttal. I suggest the authors to add the clarifications they detailed in the rebuttal to the final paper. Also, the motivating crowdsourcing application where speed is really important is not completely clear to me from the rebuttal. I suggest the authors clarify this properly in the final paper.
The authors proposed Feedforward Bayesian Inference (FBI) for aggregating crowdsourced annotations in binary tasks. FBI follows the one-coin DS model, but adopts a streaming variational Bayes approach. It has two variants, namely Fast FBI and Sorted FBI. Fast FBI could run as fast as majority voting, while Sorted FBI archives better performance but runs slower. Bounds for the asymptotic accuracy of the two variants in both non-adaptive and adaptive settings are provided. Empirical results validates the bounds and show the efficiency and effectiveness of the two variants. The paper is well-organized and easy to follow. Efficient/real-time truth inference for crowdsourcing is an important and relevant problem. The proposed FBI algorithms are novel and are suitable for streaming settings but don’t achieve the best performance on average when all labels have been collected. Some questions: In practice, the prior distribution q may not be known in most cases. How will it affect the FBI algorithms if q is not known? How does Sorted FBI work in the streaming setting? From Algorithm 2, it seems that the Fast-FBI algorithm is only called when all labels have been collected (because the call is outside the t loop). What would Algorithm 2 look like if the t loop is the outer loop?