Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The paper presents the outcome of reproduction efforts of 255 prior studies, analyzing the relation of success of reproduction and approximately 25 features (some quantitative, some more subjective/qualitative) extracted from these papers. The authors present and discuss the features that predict reproducibility significantly, as well as those that are not found significantly effective despite (authors') expectations (e.g., conceptualization figures, or step-by-step examples presented in the paper). The paper is well written, and I found the discussion insightful. The analyses of the data is based on traditional statistical testing. Despite some shortcomings (e.g., biased sample - also acknowledged by the authors), this is a rather large-scale study, and it contributes to the ongoing discussion on replicability/reproducibility in the field. The main weakness of the study is the biased sample. ------- After author response: Thank you for your response. Additional discussion/information suggested will improve the paper, and my "automatic text classification" suggestion was not realistic to incorporate in this paper anyway. After reading other reviews and author response, I (still) think that this paper provides valuable data and may stimulate important/useful discussion in the field. I believe it should be accepted.
The working topic on evaluating paper reproducibility is important for an increasing research community. This paper conducts an empirical study on what impacts the reproducibility of research papers. However: 1) The definition to reproducibility is vague. They say that a paper is reproducible if they managed to reproduce its main claims. What does it mean exactly? Come close to reported performance scores? Beat competitors? 2) How is it possible that you reimplement the codes of 255 papers independently without looking authors’ code? This appears as an enormous labor effort. The paper is mostly clear and has good quality. I am not entirely sure about its significance for NIPS as the topic is not directly concerned with machine learning, but rather with a meta-introspection. One might think that a workshop like https://sites.google.com/view/icml-reproducibility-workshop/home could be a suitable venue as well.
This is a well-written and carefully researched paper on what factors in a paper are correlated with it being more or less reproducible. Reproducible here means the authors were able to independently author the same algorithm and reproduce the majority of the claims as the original paper. The statistical analysis is careful and complete, and the discussion of the results is nice. It would be interesting to note not only whether the paper was reproducible, but how long it too to reproduce (scaled by "size" of the thing being reproduced. Perhaps something like days to implement divided by lines of code in the implementation. The authors mention that extracting the features per paper took only about 30 minutes, but didn't say how long implementation took, other than mentioning one case that took over 4 years. It would be interesting to see the distribution of how long it took to reimplement. It would be nice if the paper had a bit more discussion about how it was decided whether an implementation successfully reproduced the claims of the paper. I'm assuming you did not require getting exactly the same results down to the hundredth of a percent, but presumably required getting "similar" results? Some more clarity here would be helpful. There's really nothing wrong with this paper. I think it's a good, solid contribution and valuable to the field. The only potential issue is whether it may fit better at a different conference focused more on the science of science, but really I think it fits fine at NeurIPS too. I read the author response and agree with their assesment that it is reasonable to publish this paper at a conference like NeurIPS due to the ML focus of the papers being reproduced.