Paper ID: | 805 |
---|---|

Title: | Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift |

Originality: the paper focuses on rather classical question of distribution shift, with as few example as possible. It also proposes preliminary idea to identify examples that are representative of the shift,which seems to me more novel and beyond that, proposes to distinguish between benign and malign shifts. However this aspect is not very developped. The studt is restricted to "natural" shifts, ie. no attacks. Quality: the submission is similar to a review paper. Many technics are compared and very well presented, the supplementary material contains a lot of results that could be exploited. There is no theoretical part, which is not really a problem considering this paper. Section 2 is ok for the related work part, as far as I know. Section 3 is a (quite long) description of studied algorithms, followed by a (also quite long) description of used tests. Then comes 2 short parts on "most anomalous samples" and "malignency of the shift" : the main contribution of this work should be there, but the description is rather vague. I'm confident that the author could sumerize the way they select the most anomalous samples with more formalism (for instance : l 178 : "we can identifiy..." : there has to be an equation to make this think clearer?) Section 4 seems sufficient to reproduce results Section 5 : this part is more the experimental results summary than a discussion. I would expect in the discussion may be more intuitions or interpretations of the results. Clarity: There is a well-though labelling of different methods, such that one can read results more easily. The whole paper is very organized. Significance: the subject is interesting and there are good ideas in the paper. However the most significant part (characterization and malignency detection) are not supported enough to say this work can be used in its present form. Details, in the order of appearance in the paper: Section 3.1 : can you make clear, among used algorithm, which one are original (I suspect "label classifier" and "domain classifier" not to be standard tools?) or give a reference if they are not? l139 : you precise you used a squared exponential kernel : is it a usual setting, can it have an impact on results ? (the "we use" somehow indicates it's an arbitrary choice) l 165 166 : notation : c and r are not defined ll 196 : "the difference classifier" : which one is it? l 197 : I think a word is missing l219 : not sure of what "label proportion" refers to table 1 : how the "detection accuracy" is obtained? figure 2 /3 : I have a hard time trying to interpret all your figures (linked with previous comment on discussion part) : can you help your reader in this task? fig 3.c : I don't see why those examples are different? biblio : ref[3] was published in UAI 2018 ref [24] : typo "reviewpart" ref [38] : ICLR 2014?

UPDATE: Thanks for your reply and the clarifications about your work. In an updated version of the paper, I would recommend to compare against related work that does not follow the two-sample testing paradigm. --- Summary: The general framework for detecting dataset shift that is considered in this work consists of two main building blocks. The first one is a dimensionality reduction (i) followed by a two-sample test (ii) determining whether source and target data come from the same distribution. The authors consider a number of possibilities for both (i) and (ii). Here, the problem setting is that labeled data from the source and unlabeled data from the target domain is available. Moreover, the authors describe how to use a domain classifier to obtain the “most anomalous samples” to characterize the shift. Lastly, the paper proposes heuristics to determine how severe a shift is. Originality: The paper builds on a number of well-known techniques and combines them for detecting dataset shift. In that sense, the contribution is not very “original” but still an interesting analysis. I am not aware of previous work that has performed the same analysis. Quality: The authors themselves state that their ideas on characterizing the shift and determining its harmfulness are preliminary and heuristic. I agree with this judgement. In particular, I would like to see a discussion of the underlying implicit assumptions behind these ideas and the failure cases. For instance, in 3.4 1) the authors propose to look at the labeled top anomalous samples (i.e. those labeled examples the domain classifier predicts to be from the target domain with certainty) and use the black-box model’s accuracy on those as a characterization of the shift. Assuming such labeled anomalous samples exist is of course a pretty strong assumption. Related to that point, the authors state in line 196: “Since the difference classifier picks up on the most noticeable deviations, the domain classifier’s accuracy functions as a worst-case bound on the model’s accuracy under the identified shift type.” I don’t think this statement is true: if the domain classifier’s accuracy is large would mean that the shift is large and hence the (source domain) model’s accuracy is probably rather low. Could you please clarify the statement? Regarding the empirical study, the comparison between the different shift detection techniques seems thorough but I wonder why there is no related work or other baselines in the experimental results section. I would also recommend to comment on how the distribution of the p-values changes as the number of samples from the test set increases. Can you characterize in which cases they decrease monotonically with the number of samples and when they do not? Clarity: The paper is clearly written for most parts but I found some of the experimental details hard to understand. In particular, the results in Table 1 seems to be averaged over MNIST and CIFAR-10 which I find a bit odd. Also, it is not clear to me what sort of intervals are shown in the plots -- is that information given somewhere? Significance: I do find the considered problem setting important and more work is needed in this area. However, the main doubt I have about this paper is that related work is not considered in the experiments, the weaknesses of the approach are discussed too little and the ideas on characterizing the shift and determining the harmfulness are too preliminary.

The authors conduct quite a fairly detailed study of different ways of thinking about statistical tests for detecting distribution shift. The paper and contribution is well-contextualized and the manuscript is clear. There is a clear takeaway message for engineers designing tools for monitoring deployed machine learning models to use pre-trained classifiers with single-dimensional statistical testing. The shortcoming, which the authors themselves bring up is that there is no temporal dependence considered (i.e. streaming data) and that only images are considered. To be really comprehensive, data of these other forms could have been considered.