Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
When comparing the difference in the distributions of the log likelihoods of in-distribution vs out of distribution (Fig 1a + line 94), the authors state that they largely overlap - can they be more specific, ie what is the difference in means, can you compute a p-value using e.g. Wilcoxon rank sum test, etc, rather than simply say they “largely overlap” ? The authors assume that (i) each dimension of the input space can be described as either background or semantic, and (ii) these are independent (eq 1). How true is this in practice? You could imagine e.g. a background effect of darkening an image, in which case the probability of observing a sequence depends on an interaction between the semantic and background components - similarly, the GC content of a sequence is similarly a function of the semantic component when classifying bacterial sequences. Can the authors demonstrate this assumption holds? The LLR as defined in equation (5) depends only on the semantic features -- how are these identified in practice on the test set since as the authors note z is unknown a priori? This assumption of independence between semantic and background features seems a key component of the model, and the likelihood ratio should be computed on the semantic elements alone, but it is not obvious how these are identified in practice? Or is the full feature set used and this approximate equality assumed to be true? Do the authors have an explanation for the AUROC being significantly worse than random on the Fashion MNIST dataset?
The authors were motivated to solve the problem of bacterial identification in the presence of out-of-distribution (OOD) examples: when a classifer is trained on known bacterial classes and deployed in the real world, it may erroneously classify yet unknown bacterial strains by assigning them to one of the exisiting classes with high confidence. Methods for OOD detection try to address this problem. The authors propose a novel statistic to identify OOD examples: Their method is based on taking the log-likelihood ratio (LLR) between a model trained on in-distribution data and a background model. For both models, autoregressive models are used — the background model is trained on perturbed in-distribution data (where the amount of perturbation is a hyper-parameter that needs to be tuned). Combined with the assumption that the likelihood factorises into semantic and background contributions, the statistic can be approximated as the difference in log-likelihoods under both models, effectively focusing on the semantic components only. The paper introduces a bacterial identification task, which is suggested as a benchmark for future evaluations of methods. The authors compare the performance of their method in terms of three metrics (area under ROC curve, area under precision-recall curve, and false positive rate at 80% true positive rate) on this task. The proposed LLR-based method outperforms other methods. Apart from applying their method to genomic sequences, the LLR approach is applied to image datasets. Quality This is a technically sound submission. The motivation for using the LLR statistic for OOD is clearly explained. The LLR method is compared against a representative set of baseline methods. A couple of points: - On CIFAR-10 versus SVHN (table 4) only likelihood and likelihood-ratio methods are reported, baselines are not included for comparison. This renders the comparison of methdos incomplete and the authors should include results in the revision. - I appreciated the extensive discussion of finding hyper-parameters in the supplement. However, I do not fully agree with the conclusion the authors draw: When no OOD data is available for validation and OOD data is simulated instead (table S1b), AUROC scores are high regardless of the strength of L2-regularization. The impression from table S1a is that high lambda parameters are detrimental to performance. Thus, the conclusion about plausible ranges of hyper-parameters is not the same in both settings. Would it be better to abstain from using L2 regularisation altogether in this situation, since it only seems to bring marginal gains in the best case scenario? - There is no report of variance in the metrics. To see whether the results are sensitive to e.g. the initialisation of neural networks, the authors should report errors on metrics/run baselines repeatedly. - It would be informative if runtimes of different methods (given hardware) were reported. Clarity The paper is easy to follow, and well organised. Results are cleanly presented in figures and tables. Information for reproducing results is reported. Additional points: - The authors state they "do not compare .. with other methods that do utilize OOD inputs in training.". It would be good to provide references to methods that are excluded for this reason. It might be fair to include them -- but trained on synthetic OOD data instead. Originality The authors propose a novel method for OOD that shows promising results and introduce a novel benchmark problem to the literature. Significance The idea to build test statistics on likelihood-ratios for OOD detection is interesting and opens room for developing novel techniques in future papers. Introducing new benchmarks is important, however, there is no mention of releasing code for baselines and the benchmark. The impact/adoption of the benchmark in the community will depend on the availability and ease of use of code. Without a code release, the significance of the paper for the community is reduced. The LLR statistic reaches state-of-the-art performance on the bacterial identification benchmark. It is difficult to judge, however, how this will generalize to other problem settings. Results on the CIFAR-10 versus SVHN problem are not reported with respect to the other baselines (only a comparison to likelihood is included). Typos - L279: additonal - L506 (supplement): indepedent
Originality: The idea of using likelihood ratio (LLR) with respect to a background distribution to detect OOD is novel (It has similarities to the learning of a statistical potential in the context of protein folding - https://en.wikipedia.org/wiki/Statistical_potential). Quality: The work is technically sound. Clarity: The paper is very well written and easy to understand. Significance: The experimental results are strong and give almost perfect results on Fashion-MNIST vs MNIST task. Furthermore they collect a bacterial genomic dataset (from publicly available genomes) to further evaluate their method and get significantly better results for that task as well. They also find that their OOD detection is correlated with distance to the in-distribution set giving more evidence for LLR. Given the experimental evidence and the novelty of the method, I think this is an important contribution to the stable of OOD detectors.