{"title": "Likelihood Ratios for Out-of-Distribution Detection", "book": "Advances in Neural Information Processing Systems", "page_first": 14707, "page_last": 14718, "abstract": "Discriminative neural networks offer little or no performance guarantees when deployed on data not generated by the same process as the training distribution. On such out-of-distribution (OOD) inputs, the prediction may not only be erroneous, but confidently so, limiting the safe deployment of classifiers in real-world applications. One such challenging application is bacteria identification based on genomic sequences, which holds the promise of early detection of diseases, but requires a model that can output low confidence predictions on OOD genomic sequences from new bacteria that were not present in the training data. We introduce a genomics dataset for OOD detection that allows other researchers to benchmark progress on this important problem. We investigate deep generative model based approaches for OOD detection and observe that the likelihood score is heavily affected by population level background statistics. We propose a likelihood ratio method for deep generative models which effectively corrects for these confounding background statistics. We benchmark the OOD detection performance of the proposed method against existing approaches on the genomics dataset and show that our method achieves state-of-the-art performance. Finally, we demonstrate the generality of the proposed method by showing that it significantly improves OOD detection when applied to deep generative models of images.", "full_text": "Likelihood Ratios for Out-of-Distribution Detection\n\nJie Ren\u21e4\u2020\n\nGoogle Research\n\njjren@google.com\n\nPeter J. Liu \u2021\nGoogle Research\n\npeterjliu@google.com\n\nEmily Fertig\u2020\nGoogle Research\n\nemilyaf@google.com\n\nJasper Snoek\nGoogle Research\n\njsnoek@google.com\n\nRyan Poplin\n\nGoogle Research\n\nrpoplin@google.com\n\nMark A. DePristo\nGoogle Research\n\nmdepristo@google.com\n\nJoshua V. Dillon \u2021\nGoogle Research\n\njvdillon@google.com\n\nBalaji Lakshminarayanan\u21e4\u2021\n\nDeepMind\n\nbalajiln@google.com\n\nAbstract\n\nDiscriminative neural networks offer little or no performance guarantees when\ndeployed on data not generated by the same process as the training distribution. On\nsuch out-of-distribution (OOD) inputs, the prediction may not only be erroneous,\nbut con\ufb01dently so, limiting the safe deployment of classi\ufb01ers in real-world applica-\ntions. One such challenging application is bacteria identi\ufb01cation based on genomic\nsequences, which holds the promise of early detection of diseases, but requires a\nmodel that can output low con\ufb01dence predictions on OOD genomic sequences from\nnew bacteria that were not present in the training data. We introduce a genomics\ndataset for OOD detection that allows other researchers to benchmark progress on\nthis important problem. We investigate deep generative model based approaches\nfor OOD detection and observe that the likelihood score is heavily affected by\npopulation level background statistics. We propose a likelihood ratio method for\ndeep generative models which effectively corrects for these confounding back-\nground statistics. We benchmark the OOD detection performance of the proposed\nmethod against existing approaches on the genomics dataset and show that our\nmethod achieves state-of-the-art performance. We demonstrate the generality of\nthe proposed method by showing that it signi\ufb01cantly improves OOD detection\nwhen applied to deep generative models of images.\n\n1\n\nIntroduction\n\nFor many machine learning systems, being able to detect data that is anomalous or signi\ufb01cantly\ndifferent from that used in training can be critical to maintaining safe and reliable predictions. This\nis particularly important for deep neural network classi\ufb01ers which have been shown to incorrectly\nclassify such out-of-distribution (OOD) inputs into in-distribution classes with high con\ufb01dence (Good-\nfellow et al., 2014; Nguyen et al., 2015). This behaviour can have serious consequences when the\npredictions inform real-world decisions such as medical diagnosis, e.g. falsely classifying a healthy\nsample as pathogenic or vice versa can have extremely high cost. The importance of dealing with\nOOD inputs, also referred to as distributional shift, has been recognized as an important problem for\n\n\u21e4Corresponding authors\n\u2020Google AI Resident\n\u2021Mentors\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fAI safety (Amodei et al., 2016). The majority of recent work on OOD detection for neural networks\nis evaluated on image datasets where the neural network is trained on one benchmark dataset (e.g.\nCIFAR-10) and tested on another (e.g. SVHN). While these benchmarks are important, there is a\nneed for more realistic datasets which re\ufb02ect the challenges of dealing with OOD inputs in practical\napplications.\nBacterial identi\ufb01cation is one of the most important sub-problems of many types of medical diagnosis.\nFor example, diagnosis and treatment of infectious diseases, such as sepsis, relies on the accurate\ndetection of bacterial infections in blood (Blauwkamp et al., 2019). Several machine learning\nmethods have been developed to perform bacteria identi\ufb01cation by classifying existing known\ngenomic sequences (Patil et al., 2011; Rosen et al., 2010), including deep learning methods (Busia\net al., 2018) which are state-of-the-art. Even if neural network classi\ufb01ers achieve high accuracy as\nmeasured through cross-validation, deploying them is challenging as real data is highly likely to\ncontain genomes from unseen classes not present in the training data. Different bacterial classes\ncontinue to be discovered gradually over the years (see Figure S4 in Appendix C.1) and it is estimated\nthat 60%-80% of genomic sequences belong to as yet unknown bacteria (Zhu et al., 2018; Eckburg\net al., 2005; Nayfach et al., 2019). Training a classi\ufb01er on existing bacterial classes and deploying\nit may result in OOD inputs being wrongly classi\ufb01ed as one of the classes from the training data\nwith high con\ufb01dence. In addition, OOD inputs can also be the contaminations from the bacteria\u2019s\nhost genomes such as human, plant, fungi, etc., which also need to be detected and excluded from\npredictions (Ponsero & Hurwitz, 2019). Thus having a method for accurately detecting OOD inputs\nis critical to enable the practical application of machine learning methods to this important problem.\nA popular and intuitive strategy for detecting OOD inputs is to train a generative model (or a hybrid\nmodel cf. Nalisnick et al. (2019)) on training data and use that to detect OOD inputs at test time\n(Bishop, 1994). However, Nalisnick et al. (2018) and Choi et al. (2018) recently showed that deep\ngenerative models trained on image datasets can assign higher likelihood to OOD inputs. We report a\nsimilar failure mode for likelihood based OOD detection using deep generative models of genomic\nsequences. We investigate this phenomenon and \ufb01nd that the likelihood can be confounded by\ngeneral population level background statistics. We propose a likelihood ratio method which uses a\nbackground model to correct for the background statistics and enhances the in-distribution speci\ufb01c\nfeatures for OOD detection. While our investigation was motivated by the genomics problem, we\nfound our methodology to be more general and it shows positive results on image datasets as well. In\nsummary, our contributions are:\n\u2022 We create a realistic benchmark for OOD detection, that is motivated by challenges faced in\napplying deep learning models on genomics data. The sequential nature of genetic sequences\nprovides a new modality and hopefully encourages the OOD research community to contribute to\n\u201cmachine learning that matters\u201d (Wagstaff, 2012).\n\u2022 We show that likelihood from deep generative models can be confounded by background statistics.\n\u2022 We propose a likelihood ratio method for OOD detection, which signi\ufb01cantly outperforms the\n\u2022 We evaluate existing OOD methods on the proposed genomics benchmark and demonstrate that\n\nraw likelihood on OOD detection for deep generative models on image datasets.\n\nour method achieves state-of-the-art (SOTA) performance on this challenging problem.\n\n2 Background\nSuppose we have an in-distribution dataset D of (x, y) pairs sampled from the distribution p\u21e4(x, y),\nwhere x is the extracted feature vector or raw input and y 2Y := {1, . . . , k, . . . , K} is the label\nassigning membership to one of K in-distribution classes. For simplicity, we assume inputs to\nbe discrete, i.e. xd 2{ A, C, G, T} for genomic sequences and xd 2{ 0, . . . , 255} for images. In\ngeneral, OOD inputs are samples (x, y) generated from an underlying distribution other than p\u21e4(x, y).\nIn this paper, we consider an input (x, y) to be OOD if y 62 Y: that is, the class y does not belong to\none of the K in-distribution classes. Our goal is to accurately detect if an input x is OOD or not.\nMany existing methods involve computing statistics using the predictions of (ensembles of) dis-\ncriminative classi\ufb01ers trained on in-distribution data, e.g. taking the con\ufb01dence or entropy of the\npredictive distribution p(y|x) (Hendrycks & Gimpel, 2016; Lakshminarayanan et al., 2017). An\nalternative is to use generative model-based methods, which are appealing as they do not require\n\n2\n\n\flabeled data and directly model the input distribution. These methods \ufb01t a generative model p(x)\nto the input data, and then evaluate the likelihood of new inputs under that model. However, recent\nwork has highlighted signi\ufb01cant issues with this approach for OOD detection on images, showing\nthat deep generative models such as Glow (Kingma & Dhariwal, 2018) and PixelCNN (Oord et al.,\n2016; Salimans et al., 2017) sometimes assign higher likelihoods to OOD than in-distribution inputs.\nFor example, Nalisnick et al. (2018) and Choi et al. (2018) show that Glow models trained on the\nCIFAR-10 image dataset assign higher likelihood to OOD inputs from the SVHN dataset than they\ndo to in-distribution CIFAR-10 inputs; Nalisnick et al. (2018), Shafaei et al. (2018) and Hendrycks\net al. (2018) show failure modes of PixelCNN and PixelCNN++ for OOD detection.\nFailure of density estimation for OOD detection We investigate whether density estimation-based\nmethods work well for OOD detection in genomics. As a motivating observation, we train a deep\ngenerative model, more precisely LSTM (Hochreiter & Schmidhuber, 1997), on in-distribution\ngenomic sequences (composed by {A, C, G, T}), and plot the log-likelihoods of both in-distribution\nand OOD inputs (See Section 5.2 for the dataset and the full experimental details). Figure 1a\nshows that the histogram of the log-likelihood for OOD sequences largely overlaps with that of\nin-distribution sequences with AUROC of 0.626, making it unsuitable for OOD detection. Our\nobservations show a failure mode of deep generative models for OOD detection on genomic sequences\nand are complementary to earlier work which showed similar results for deep generative models on\nimages (Nalisnick et al., 2018; Choi et al., 2018).\n\nFigure 1: (a) Log-likelihood hardly separates in-distribution and OOD inputs with AUROC of 0.626.\n(b) The log-likelihood is heavily affected by the GC-content of a sequence.\n\nWhen investigating this failure mode, we discovered that the log-likelihood under the model is heavily\naffected by a sequence\u2019s GC-content, see Figure 1b. GC-content is de\ufb01ned as the percentage of bases\nthat are either G or C, and is used widely in genomic studies as a basic statistic for describing overall\ngenomic composition (Sueoka, 1962), and studies have shown that bacteria have an astonishing\ndiversity of genomic GC-content, from 16.5% to 75% (Hildebrand et al., 2010). Bacteria from similar\ngroups tend to have similar GC-content at the population level, but they also have characteristic\nbiological patterns that can distinguish them well from each other. The confounding effect of GC-\ncontent in Figure 1b makes the likelihood less reliable as a score for OOD detection, because an OOD\ninput may result in a higher likelihood than an in-distribution input, because it has high GC-content\n(cf. the bottom right part of Figure 1b) and not necessarily because it contains characteristic patterns\nspeci\ufb01c to the in-distribution bacterial classes.\n\n3 Likelihood Ratio for OOD detection\n\nWe \ufb01rst describe the high level idea and then describe how to adapt it to deep generative models.\nHigh level idea Assume that an input x is composed of two components, (1) a background component\ncharacterized by population level background statistics, and (2) a semantic component characterized\nby patterns speci\ufb01c to the in-distribution data. For example, images can be modeled as backgrounds\nplus objects; text can be considered as a combination of high frequency stop words plus semantic\nwords (Luhn, 1960); genomes can be modeled as background sequences plus motifs (Bailey &\nElkan, 1995; Reinert et al., 2009). More formally, for a D-dimensional input x = x1, . . . , xD, we\nassume that there exists an unobserved variable z = z1, . . . , zD, where zd 2{ B, S} indicates if\nthe dth dimension of the input xd is generated from the Background model or the Semantic model.\n\n3\n\n\fp(x) = p(xB)p(xS).\n\nGrouping the semantic and background parts, the input can be factored as x = {xB, xS} where\nxB = {xd | zd = B, d = 1, . . . , D}. For simplicity, assume that the background and semantic\ncomponents are generated independently. The likelihood can be then decomposed as follows,\n(1)\nWhen training and evaluating deep generative models, we typically do not distinguish between these\ntwo terms in the likelihood. However, we may want to use just the semantic likelihood p(xS) to\navoid the likelihood term being dominated by the background term (e.g. OOD input with the same\nbackground but different semantic component). In practice, we only observe x, and it is not always\neasy to split an input into background and semantic parts {xB, xS}. As a practical alternative, we\npropose training a background model by perturbing inputs. Adding the right amount of perturbations\nto inputs can corrupt the semantic structure in the data, and hence the model trained on perturbed\ninputs captures only the population level background statistics.\nAssume that p\u2713(\u00b7) is a model trained using in-distribution data, and p\u27130(\u00b7) is a background model\nthat captures general background statistics. We propose a likelihood ratio statistic that is de\ufb01ned as\n\nLLR(x) = log\n\np\u2713(x)\np\u27130(x)\n\n= log\n\np\u2713(xB) p\u2713(xS)\np\u27130(xB) p\u27130(xS)\n\n,\n\n(2)\n\nwhere we use the factorization from Equation 1. Assume that (i) both models capture the background\ninformation equally well, that is p\u2713(xB) \u21e1 p\u27130(xB) and (ii) p\u2713(xS) is more peaky than p\u27130(xS) as\nthe former is trained on data containing semantic information, while the latter model \u27130 is trained\nusing data with noise perturbations. Then, the likelihood ratio can be approximated as\n\nLLR(x) \u21e1 log p\u2713(xS)  log p\u27130(xS).\n\n(3)\n\nAfter taking the ratio, the likelihood for the background component xB is cancelled out, and only the\nlikelihood for the semantic component xS remains. Our method produces a background contrastive\nscore that captures the signi\ufb01cance of the semantics compared with the background model.\nLikelihood ratio for auto-regressive models Auto-regressive models are one of the popular choices\nfor generating images (Oord et al., 2016; Van den Oord et al., 2016; Salimans et al., 2017) and\nsequence data such as genomics (Zou et al., 2018; Killoran et al., 2017) and drug molecules (Olive-\ncrona et al., 2017; Gupta et al., 2018), and text (Jozefowicz et al., 2016). In auto-regressive models,\nd=1 log p\u2713(xd|x<d), where\nx<d = x1 . . . xd1. Decomposing the log-likelihood into background and semantic parts, we have\n(4)\n\nthe log-likelihood of an input can be expressed as log p\u2713(x) = PD\n\nlog p\u2713(xd|x<d) + Xd:xd2xS\nWe can use a similar auto-regressive decomposition for the background model p\u27130(x) as\nwell. Assuming that both the models capture the background information equally well,\nlog p\u27130(xd|x<d), the likelihood ratio is approximated as\n(5)\n\nlog p\u2713(x) = Xd:xd2xB\nlog p\u2713(xd|x<d) \u21e1Pd:xd2xB\n\nlog p\u2713(xd|x<d)  Xd:xd2xS\n\nlog p\u27130(xd|x<d) = Xd:xd2xS\n\nPd:xd2xB\nLLR(x) \u21e1 Xd:xd2xS\n\nlog p\u2713(xd|x<d).\n\nlog\n\np\u2713(xd|x<d)\np\u27130(xd|x<d)\n\n.\n\nTraining the Background Model In practice, we add perturbations to the input data by randomly\nselecting positions in x1 . . . xD following an independent and identical Bernoulli distribution with\nrate \u00b5 and substituting the original character with one of the other characters with equal probability.\nThe procedure is inspired by genetic mutations. See Algorithm 1 in Appendix A for the pseudocode\nfor generating input perturbations. The rate \u00b5 is a hyperparameter and can be easily tuned using a\nsmall amount of validation OOD dataset (different from the actual OOD dataset of interest). In the\ncase where validation OOD dataset is not available, we show that \u00b5 can also be tuned using simulated\nOOD data. In practice, we observe that \u00b5 2 [0.1, 0.2] achieves good performance empirically for\nmost of the experiments in our paper. Besides adding perturbations to the input data, we found other\ntechniques that can improve model generalization and prevent model memorization, such as adding\nL2 regularization with coef\ufb01cient  to model weights, can help to train a good background model.\nIn fact, it has been shown that adding noise to the input is equivalent to adding L2 regularization to\nthe model weights under some conditions (Bishop, 1995a,b). Besides the methods above, we expect\nadding other types of noise or regularization methods would show a similar effect. The pseudocode\nfor our proposed OOD detection algorithm can be found in Algorithm 2 in Appendix A.\n\n4\n\n\f4 Experimental setup\n\nregressive model for computing the log-likelihood log p\u2713(x) =PD\n\nWe design experiments on multiple data modalities (images, genomic sequences) to evaluate our\nmethod and compare with other baseline methods. For each of the datasets, we build an auto-\nd=1 log p\u2713(xd|x<d). For training\nthe background model p\u27130(x), we use the exact same architecture as p\u2713(x), and the only differences\nare that it is trained on perturbed inputs and (optionally) we apply L2 regularization to model weights.\nBaseline methods for comparison We compare our approach to several existing methods.\n1. The maximum class probability, p(\u02c6y|x) = maxk p(y = k|x). OOD inputs tend to have lower\n2. The entropy of the predicted class distribution, Pk p(y = k|x) log p(y = k|x). High entropy\n\nscores than in-distribution data (Hendrycks & Gimpel, 2016).\n\nof the predicted class distribution, and therefore a high predictive uncertainty, which suggests that\nthe input may be OOD.\n\n3. The ODIN method proposed by Liang et al. (2017). ODIN uses temperature scaling (Guo et al.,\n2017), adds small perturbations to the input, and applies a threshold to the resulting predicted\nclass to distinguish in- and out-of- distribution inputs. This method was designed for continuous\ninputs and cannot be directly applied to discrete genomic sequences. We propose instead to add\nperturbations to the input of the last layer that is closest to the output of the neural network.\n\n4. The Mahalanobis distance of the input to the nearest class-conditional Gaussian distribution esti-\nmated from the in-distribution data. Lee et al. (2018) \ufb01t class-conditional Gaussian distributions\nto the activations from the last layer of the neural network.\n\n5. The classi\ufb01er-based ensemble method that uses the average of the predictions from multiple\nindependently trained models with random initialization of network parameters and random\nshuf\ufb02ing of training inputs (Lakshminarayanan et al., 2017).\n\n6. The log-odds of a binary classi\ufb01er trained to distinguish between in-distribution inputs from all\n\nclasses as one class and randomly perturbed in-distribution inputs as the other.\n\n7. The maximum class probability over K in-distribution classes of a (K + 1)-class classi\ufb01er where\n\nthe additional class is perturbed in-distribution.\n\n8. The maximum class probability of a K-class classi\ufb01er for in-distribution classes but the predicted\nclass distribution is explicitly trained to output uniform distribution on perturbed in-distribution\ninputs. This is similar to using simulated OOD inputs from GAN (Lee et al., 2017) or using\nauxiliary datasets of outliers (Hendrycks et al., 2018) for calibration purpose.\n\n9. The generative model-based ensemble method that measures E[log p\u2713(x)]  Var[log p\u2713(x)] of\nmultiple likelihood estimations from independently trained model with random initialization and\nrandom shuf\ufb02ing of the inputs. (Choi et al., 2018).\n\nBaseline methods 1-8 are classi\ufb01er-based and method 9 is generative model-based. For classi\ufb01er-\nbased methods, we choose the commonly used model architecture, convolutional neural networks\n(CNNs). Methods 6-8 are based on perturbed inputs which aims to mimic OOD inputs. Perturbations\nare added to the input in the same way as that we use for training background models. Our method\nand methods 3, 6, 7, and 8 involve hyperparameter tuning; we follow the protocol of Hendrycks et al.\n(2018) where optimal hyperparameters are picked on a different OOD validation set than the \ufb01nal\nOOD dataset it is tested on. For Fashion-MNIST vs. MNIST experiment, we use the NotMNIST\nBulatov (2011) dataset for hyperparameter tuning. For CIFAR-10 vs SVHN, we used gray-scaled\nCIFAR-10 for hyperparameter tuning. For genomics, we use the OOD bacteria classes discovered\nbetween 2011-2016, which are disjoint from the \ufb01nal OOD classes discovered after 2016. While this\nset of baselines is not exhaustive, it is broadly representative of the range of existing methods. Note\nthat since our method does not rely on OOD inputs for training, we do not compare it with other\nmethods that do utilize OOD inputs in training.\nEvaluation metrics for OOD detection We trained the model using only in-distribution inputs, and\nwe tuned the hyperparameters using validation datasets that include both in-distribution and OOD\ninputs. The test dataset is used for the \ufb01nal evaluation of the method. For the \ufb01nal evaluation, we\nrandomly selected the same number of in-distribution and OOD inputs from the test dataset, and for\neach example x we computed the log likelihood-ratio statistic LLR(x) as the score. A small value of\n\n5\n\n\fthe score suggests a high likelihood of being OOD. We use the area under the ROC curve (AUROC\"),\nthe area under the precision-recall curve (AUPRC\"), and the false positive rate at 80% true positive\nrate (FPR80#), as the metrics for evaluation. These three metrics are commonly used for evaluating\nOOD detection methods (Hendrycks & Gimpel, 2016; Hendrycks et al., 2018; Alemi et al., 2018).\n\n5 Results\n\nWe \ufb01rst present results on image datasets as they are easier to visualize, and then present results on\nour proposed genomic dataset. For image experiments, our goal is not to achieve state-of-the-art\nperformance but to show that our likelihood ratio effectively corrects for background statistics and\nsigni\ufb01cantly outperforms the likelihood. While previous work has shown the failure of PixelCNN\nfor OOD detection, we believe ours is the \ufb01rst to provide an explanation for why this phenomenon\nhappens for PixelCNN, through the lens of background statistics.\n\n5.1 Likelihood ratio for detecting OOD images\n\nFollowing existing literature (Nalisnick et al., 2018; Hendrycks et al., 2018), we evaluate our method\nusing two experiments for detecting OOD images: (a) Fashion-MNIST as in-distribution and MNIST\nas OOD, (b) CIFAR-10 as in-distribution and SVHN as OOD. For each experiment, we train a\nPixelCNN++ (Salimans et al., 2017; Van den Oord et al., 2016) model using in-distribution data. We\ntrain a background model by adding perturbations to the training data. To compare with classi\ufb01er-\nbased baseline methods, we use CNN-based classi\ufb01ers. See Appendix B.1 for model details. Based\non the likelihood from the PixelCNN++ model, we con\ufb01rm that the model assigns a higher likelihood\nto MNIST than Fashion-MNIST, as previously reported by Nalisnick et al. (2018), and the AUROC\nfor OOD detection is only 0.091, even worse than random (Figure 2a). We discover that the proportion\nof zeros i.e. number of pixels belonging to the background in an image is a confounding factor to\nthe likelihood score (Pearson Correlation Coef\ufb01cient 0.85, see Figure 2b, Figure S1). Taking the\nlikelihood ratio between the original and the background models, we see that the AUROC improves\nsigni\ufb01cantly from 0.091 to 0.996 (Figure 2d). The log likelihood-ratio for OOD images are highly\nconcentrated around value 0, while that for in-distribution images are mostly positive (Figure 2c).\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 2: (a) Log-likelihood of MNIST images (OOD) is higher than that of Fashion-MNIST images\n(in-distribution). (b) Log-likelihood is highly correlated with the background (proportion of zeros in\nan image). (c) Log-likelihood ratio is higher for Fashion-MNIST (in-dist) than MNIST (OOD). (d)\nLikelihood ratio signi\ufb01cantly improves the AUROC of OOD detection from 0.091 to 0.996.\n\n6\n\n\fWhich pixels contribute the most to the likelihood (ratio)? To qualitatively evaluate the difference\nbetween the likelihood and the likelihood ratio, we plot their values for each pixel for Fashion-MNIST\nand MNIST images. This allows us to visualize which pixels contribute the most to the two terms\nrespectively. Figure 3 shows a heatmap, with lighter (darker) gray colors indicating higher (lower)\nvalues. Figures 3(a,b) show that the likelihood value is dominated by the \u201cbackground\u201d pixels,\nwhereas likelihood ratio focuses on the \u201csemantic\u201d pixels. Figures 3(c,d) con\ufb01rm that the background\npixels cause MNIST images to be assigned high likelihood, whereas likelihood ratio focuses on the\nsemantic pixels. We present additional qualitative results in Appendix B. For instance, Figure S2\nshows that images with the highest likelihood-ratios are those with prototypical Fashion-MNIST\nicons, e.g. \u201cshirts\u201d and \u201cbags\u201d, highly contrastive with the background, while images with the lowest\nlikelihood-ratios are those with rare patterns, e.g. dress with stripes or sandals with high ropes.\n\n(a) Likelihood\n\n(b) Likelihood-Ratio\n\n(c) Likelihood\n\n(d) Likelihood-Ratio\n\nFigure 3: The log-likelihood of each pixel in an image log p\u2713(xd|x<d), and the log likelihood-ratio\nof each pixel log p\u2713(xd|x<d)  log p\u27130(xd|x<d), d = 1 . . . , 784., for 16 Fashion-MNIST images (a,\nb) and MNIST images (c, d). Lighter gray color indicates larger value (see colorbar). Note that the\nrange of log-likelihood (negative value) is different from that of log likelihood-ratio (mostly positive\nvalue). For the ease of visualization, we unify the colorbar by adding a constant to the log-likelihood\nscore. The images are randomly sampled from the test dataset and sorted by their likelihood p\u2713(x).\nLooking at which pixels contribute the most to each quantity, we observe that the likelihood value\nis dominated by the \u201cbackground\u201d pixels on both Fashion-MNIST and MNIST, whereas likelihood\nratio focuses on the \u201csemantic\u201d pixels.\n\nWe compare our method with other baselines. The classi\ufb01er-based baseline methods are built using\nLeNet architecture. Table 1a shows that our method achieves the highest AUROC\", AUPRC\", and\nthe lowest FPR80#. The method using Mahalanobis distance performs better than other baselines.\nNote that the binary classi\ufb01er between in-distribution and perturbed in-distribution does not perform\nas well as our method, possibly due to the fact that while the features learned by the discriminator\ncan be good for detecting perturbed inputs, they may not generalize well for OOD detection. The\ngenerative model approach based on p(x) captures more fundamental features of the data generation\nprocess than the discriminative approach.\nFor the experiment using CIFAR-10 as in-distribution and SVHN as OOD, we apply the same training\nprocedure using the PixelCNN++ model and choose hyperparameters using grayscaled CIFAR-10\nwhich was shown to be OOD by Nalisnick et al. (2018). See Appendix B.1 for model details. Looking\nat the results in Table 2, we observe that the OOD images from SVHN have higher likelihood than\nthe in-distribution images from CIFAR-10, con\ufb01rming the observations of Nalisnick et al. (2018),\nwith AUROC of 0.095. Our likelihood-ratio method signi\ufb01cantly improves the AUROC to 0.931.\nFigure S3 in Appendix B shows additonal qualitative results. For detailed results including other\nbaseline methods, see Table S2 in Appendix B.3.\n\n5.2 OOD detection for genomic sequences\nDataset for detecting OOD genomic sequences We design a new dataset for evaluating OOD\nmethods. As bacterial classes are discovered gradually over time, in- and out-of-distribution data\ncan be naturally separated by the time of discovery. Classes discovered before a cutoff time can be\nregarded as in-distribution classes, and those discovered afterward, which were unidenti\ufb01ed at the\ncutoff time, can be regarded as OOD. We choose two cutoff years, 2011 and 2016, to de\ufb01ne the\ntraining, validation, and test splits (Figure 4). Our dataset contains of 10 in-distribution classes, 60\nOOD classes for validation, and 60 OOD classes for testing. Note that the validation OOD dataset is\nonly used for hyperparameter tuning, and the validation OOD classes are disjoint from the test OOD\n\n7\n\n\fTable 1: AUROC\", AUPRC\", and FPR80# for detecting OOD inputs using likelihood and likelihood-\nratio method and other baselines on (a) Fashion-MNIST vs. MNIST datasets and (b) genomic dataset.\nThe up and down arrows on the metric names indicate whether greater or smaller is better. \u00b5 in the\nparentheses indicates the background model is tuned only using noise perturbed input, and (\u00b5 and\n) indicates the background model is tuned by both perturbation and L2 regularization. Numbers in\nfront and inside of the brackets are mean and standard error respectively based on 10 independent\nruns with random initialization of network parameters and random shuf\ufb02ing of training inputs. For\nensemble models, the mean and standard error are estimated based on 10 bootstrap samples from 30\nindependent runs, which can be underestimations of the true standard errors.\nAUROC\"\n0.626 (0.001)\n0.732 (0.015)\n0.755 (0.005)\n0.634 (0.003)\n0.634 (0.003)\n0.697 (0.010)\n0.525 (0.010)\n0.682 (0.002)\n0.690 (0.001)\n0.695 (0.001)\n0.635 (0.016)\n0.652 (0.004)\n0.669 (0.005)\n0.628 (0.001)\n\nLikelihood\nLikelihood Ratio (ours, \u00b5)\nLikelihood Ratio (ours, \u00b5, )\np(\u02c6y|x)\nEntropy of p(y|x)\nODIN\nMahalanobis distance\nEnsemble, 5 classi\ufb01ers\nEnsemble, 10 classi\ufb01ers\nEnsemble, 20 classi\ufb01ers\nBinary classi\ufb01er\np(\u02c6y|x) with noise class\np(\u02c6y|x) with calibrations\nWAIC, 5 models\n\nLikelihood\nLikelihood Ratio (ours, \u00b5)\nLikelihood Ratio (ours, \u00b5, )\np(\u02c6y|x)\nEntropy of p(y|x)\nAdjusted ODIN\nMahalanobis distance\nEnsemble, 5 classi\ufb01ers\nEnsemble, 10 classi\ufb01ers\nEnsemble, 20 classi\ufb01ers\nBinary classi\ufb01er\np(\u02c6y|x) with noise class\np(\u02c6y|x) with calibration\nWAIC, 5 models\n\nAUROC\"\n0.089 (0.002)\n0.973 (0.031)\n0.994 (0.001)\n0.734 (0.028)\n0.746 (0.027)\n0.752 (0.069)\n0.942 (0.017)\n0.839 (0.010)\n0.851 (0.007)\n0.857 (0.005)\n0.455 (0.105)\n0.877 (0.050)\n0.904 (0.023)\n0.221 (0.013)\n\nFPR80#\n\n1.000 (0.001)\n0.005 (0.008)\n0.001 (0.000)\n0.506 (0.046)\n0.448 (0.049)\n0.432 (0.116)\n0.088 (0.028)\n0.275 (0.019)\n0.241 (0.014)\n0.240 (0.011)\n0.886 (0.126)\n0.195 (0.101)\n0.139 (0.044)\n0.911 (0.008)\n\nAUPRC\"\n0.320 (0.000)\n0.951 (0.063)\n0.993 (0.002)\n0.702 (0.026)\n0.726 (0.026)\n0.763 (0.062)\n0.928 (0.021)\n0.833 (0.009)\n0.844 (0.006)\n0.849 (0.004)\n0.505 (0.064)\n0.871 (0.054)\n0.895 (0.023)\n0.401 (0.008)\n\nAUPRC\"\n0.613 (0.001)\n0.685 (0.017)\n0.719 (0.006)\n0.599 (0.003)\n0.599 (0.003)\n0.671 (0.012)\n0.503 (0.007)\n0.647 (0.002)\n0.655 (0.002)\n0.659 (0.001)\n0.634 (0.015)\n0.627 (0.005)\n0.635 (0.004)\n0.616 (0.001)\n\nFPR80#\n\n0.661 (0.002)\n0.534 (0.031)\n0.474 (0.011)\n0.669 (0.007)\n0.617 (0.007)\n0.550 (0.021)\n0.747 (0.014)\n0.589 (0.004)\n0.574 (0.004)\n0.570 (0.004)\n0.619 (0.025)\n0.643 (0.008)\n0.627 (0.006)\n0.657 (0.002)\n\n(a)\n\n(b)\n\nTable 2: CIFAR-10 vs SVHN results: AUROC\", AUPRC\", FPR80# for detecting OOD inputs using\nlikelihood and our likelihood-ratio method.\n\nLikelihood\nLikelihood Ratio (ours, \u00b5)\nLikelihood Ratio (ours, \u00b5, )\n\nAUROC\"\n0.095 (0.003)\n0.931 (0.032)\n0.930 (0.042)\n\nAUPRC\"\n0.320 (0.001)\n0.888 (0.049)\n0.881 (0.064)\n\nFPR80#\n1.000 (0.000)\n0.062 (0.073)\n0.066 (0.123)\n\nclasses. To mimic sequencing data, we fragmented genomes in each class into short sequences of 250\nbase pairs, which is a common length that current sequencing technology generates. Among all the\nshort sequences, we randomly choose 100,000 sequences for each class for training, validation, and\ntest. Additional details about the dataset, including pre-processing and the information for the in- and\nout-of-distribution classes, can be found in Appendix C.1.\nLikelihood ratio method for detecting OOD sequences We build an LSTM model for estimating\nthe likelihood p(x) based on the transition probabilities p(xd|x<d), d = 1, . . . , D. In particular,\nwe feed the one-hot encoded DNA sequences into an LSTM layer, followed by a dense layer and a\nsoftmax function to predict the probability distribution over the 4 letters of {A, C, G, T}, and train\nthe model using only the in-distribution training data. We evaluate the likelihood for sequences in the\nOOD test dataset under the trained model, and compare those with the likelihood for sequences in the\nin-distribution test dataset. The AUROC\", AUPRC\", and FPR80# scores are 0.626, 0.613, and 0.661\nrespectively (Table 1b).\nWe train a background model by using the perturbed in-distribution data and optionally adding L2\nregularization to the model weights. Hyperparameters are tuned using validation dataset which\ncontains in-distribution and validation OOD classes, and the validation OOD classes are disjoint\nfrom test OOD classes. Contrasting against the background model, the AUROC\", AUPRC\", and\nFPR80# for the likelihood-ratio signi\ufb01cantly improve to 0.755, 0.719, and 0.474, respectively (Table\n1b, Figure 5b). Compared with the likelihood, the AUROC and AUPRC for likelihood-ratio increased\n20% and 17% respectively, and the FPR80 decreased 28%. Furthermore, Figure 5a shows that the\nlikelihood ratio is less sensitive to GC-content, and the separation between in-distribution and OOD\ndistribution becomes clearer. We evaluate other baseline methods on the test dataset as well. For\nclassi\ufb01er-based baselines, we construct CNNs with one convolutional layer, one max-pooling layer,\none dense layer, and a \ufb01nal dense layer with softmax activation for predicting class probabilities, as in\nAlipanahi et al. (2015); Busia et al. (2018); Ren et al. (2018b). Comparing our method to the baselines\nin Table 1b, our method achieves the highest AUROC, AUPRC, and the lowest FPR80 scores on the\ntest dataset. Ensemble method and ODIN perform better than other baseline methods. Comparing\nwith the Fashion-MNIST and MNIST experiment, the Mahalanobis distance performs worse for\ndetecting genomic OOD possibly due to the fact that Fashion-MNIST and MNIST images are quite\ndistinct while in-distribution and OOD bacteria classes are interlaced under the same taxonomy (See\nFigure S5 for the phylogenetic tree of the in-distribution and OOD classes).\n\n8\n\n\fFigure 4: The design of the training, validation, and test datasets for genomic sequence classi\ufb01cation\nincluding in and OOD data.\n\n(a)\n\nFigure 5: (a) The likelihood-ratio score is roughly independent of the GC-content which makes it less\nsusceptible to background statistics and better suited for OOD detection. (b) ROCs and AUROCs for\nOOD detection using likelihood and likelihood-ratio. (c) Correlation between the AUROC of OOD\ndetection and distance to in-distribution classes using Likelihood Ratio and the Ensemble method.\n\n(b)\n\n(c)\n\nOOD detection correlates with its distance to in-distribution We investigate the effect of the\ndistance between the OOD class to the in-distribution classes, on the performance of OOD detection.\nTo measure the distance between the OOD class to the in-distribution, we randomly select representa-\ntive genome from each of the in-distribution classes and OOD classes. We use the state-of-the-art\nalignment-free method for genome comparison, dS\n2 (Ren et al., 2018a; Reinert et al., 2009), to compute\nthe genetic distance between each pair of the genomes in the set. This genetic distance is calculated\nbased on the similarity between the normalized nucleotide word frequencies (k-tuples) of the two\ngenomes, and studies have shown that this genetic distance re\ufb02ects true evolutionary distances between\ngenomes (Chan et al., 2014; Bernard et al., 2016; Lu et al., 2017). For each of the OOD classes, we use\nthe minimum distance between the genome in that class to all genomes in the in-distribution classes as\nthe measure of the genetic distance between this OOD class and the in-distribution. Not surprisingly,\nthe the AUROC for OOD detection is positively correlated with the genetic distance (Figure 5c), and\nan OOD class far away from in-distribution is easier to be detected. Comparing our likelihood ratio\nmethod and one of the best classi\ufb01er-based methods, ensemble method, we observe that our likelihood\nratio method has higher AUROC for different OOD classes than ensemble method in general. Further-\nmore, our method has a higher Pearson correlation coef\ufb01cient (PCC) of 0.570 between the minimum\ndistance and AUROC for Likelihood Ratio method, than the classi\ufb01er-based ensemble method with\n20 models which has PCC of 0.277. The dataset and code for the genomics study is available at\nhttps://github.com/google-research/google-research/tree/master/genomics_ood.\n\n6 Discussion and Conclusion\nWe investigate deep generative model-based methods for OOD detection and show that the likelihood\nof auto-regressive models can be confounded by background statistics, providing an explanation to the\nfailure of PixelCNN for OOD detection observed by recent work (Nalisnick et al., 2018; Hendrycks\net al., 2018; Shafaei et al., 2018). We propose a likelihood ratio method that alleviates this issue by\ncontrasting the likelihood against a background model. We show that our method effectively corrects\nfor the background components, and signi\ufb01cantly improves the accuracy of OOD detection on both\nimage datasets and genomic datasets. Finally, we create and release a realistic genomic sequence\ndataset for OOD detection which highlights an important real-world problem, and hope that this\nserves as a valuable OOD detection benchmark for the research community.\n\n9\n\nIn-distribution trainingIn-distribution validationIn-distribution testOOD validationOOD test< 01/01/201101/01/2011 ~ 01/01/2016> 01/01/2016\fAcknowledgments\nWe thank Alexander A. Alemi, Andreea Gane, Brian Lee, D. Sculley, Eric Jang, Jacob Burnim,\nKatherine Lee, Matthew D. Hoffman, Noah Fiedel, Rif A. Saurous, Suman Ravuri, Thomas Colthurst,\nYaniv Ovadia, the Google Brain Genomics team, and Google TensorFlow Probability team for helpful\nfeedback and discussions.\n\nReferences\nAhlgren, N. A., Ren, J., Lu, Y. Y., Fuhrman, J. A., and Sun, F. Alignment-free oligonucleotide\nfrequency dissimilarity measure improves prediction of hosts from metagenomically-derived viral\nsequences. Nucleic acids research, 45(1):39\u201353, 2016.\n\nAlemi, A. A., Fischer, I., and Dillon, J. V. Uncertainty in the variational information bottleneck.\n\narXiv preprint arXiv:1807.00906, 2018.\n\nAlipanahi, B., Delong, A., Weirauch, M. T., and Frey, B. J. Predicting the sequence speci\ufb01cities of\n\nDNA-and RNA-binding proteins by deep learning. Nature Biotechnology, 33(8):831, 2015.\n\nAmodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., and Man\u00e9, D. Concrete problems\n\nin AI safety. arXiv preprint arXiv:1606.06565, 2016.\n\nBailey, T. L. and Elkan, C. The value of prior knowledge in discovering motifs with MEME. In\n\nISMB, volume 3, pp. 21\u201329, 1995.\n\nBernard, G., Chan, C. X., and Ragan, M. A. Alignment-free microbial phylogenomics under scenarios\nof sequence divergence, genome rearrangement and lateral genetic transfer. Scienti\ufb01c Reports, 6:\n28970, 2016.\n\nBishop, C. M. Novelty Detection and Neural Network Validation. IEEE Proceedings-Vision, Image\n\nand Signal processing, 141(4):217\u2013222, 1994.\n\nBishop, C. M. Regularization and complexity control in feed-forward networks. In Proceedings\nInternational Conference on Arti\ufb01cial Neural Networks ICANN, volume 95, pp. 141\u2013148, 1995a.\nBishop, C. M. Training with noise is equivalent to Tikhonov regularization. Neural computation, 7\n\n(1):108\u2013116, 1995b.\n\nBlauwkamp, T. A., Thair, S., Rosen, M. J., Blair, L., Lindner, M. S., Vilfan, I. D., Kawli, T., Christians,\nF. C., Venkatasubrahmanyam, S., Wall, G. D., et al. Analytical and clinical validation of a microbial\ncell-free dna sequencing test for infectious disease. Nature Microbiology, 4(4):663, 2019.\n\nBrady, A. and Salzberg, S. L. Phymm and PhymmBL: metagenomic phylogenetic classi\ufb01cation with\n\ninterpolated Markov models. Nature Methods, 6(9):673, 2009.\n\nBulatov, Y. NotMNIST dataset, 2011. URL http://yaroslavvb.blogspot.com/2011/09/\n\nnotmnist-dataset.html.\n\nBusia, A., Dahl, G. E., Fannjiang, C., Alexander, D. H., Dorfman, E., Poplin, R., McLean, C. Y.,\nChang, P.-C., and DePristo, M. A deep learning approach to pattern recognition for short DNA\nsequences. bioRxiv, pp. 353474, 2018.\n\nChan, C. X., Bernard, G., Poirion, O., Hogan, J. M., and Ragan, M. A. Inferring phylogenies of\n\nevolving sequences without multiple sequence alignment. Scienti\ufb01c reports, 4:6504, 2014.\n\nChoi, H., Jang, E., and Alemi, A. A. WAIC, but why? Generative ensembles for robust anomaly\n\ndetection. arXiv preprint arXiv:1810.01392, 2018.\n\nEckburg, P. B., Bik, E. M., Bernstein, C. N., Purdom, E., Dethlefsen, L., Sargent, M., Gill, S. R.,\nNelson, K. E., and Relman, D. A. Diversity of the human intestinal microbial \ufb02ora. Science, 308\n(5728):1635\u20131638, 2005.\n\nGoodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. arXiv\n\npreprint arXiv:1412.6572, 2014.\n\n10\n\n\fGuo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On calibration of modern neural networks. arXiv\n\npreprint arXiv:1706.04599, 2017.\n\nGupta, A., M\u00fcller, A. T., Huisman, B. J., Fuchs, J. A., Schneider, P., and Schneider, G. Generative\n\nrecurrent networks for de novo drug design. Molecular informatics, 37(1-2):1700111, 2018.\n\nHe, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings\n\nof the IEEE conference on computer vision and pattern recognition, pp. 770\u2013778, 2016.\n\nHendrycks, D. and Gimpel, K. A baseline for detecting misclassi\ufb01ed and out-of-distribution examples\n\nin neural networks. arXiv preprint arXiv:1610.02136, 2016.\n\nHendrycks, D., Mazeika, M., and Dietterich, T. G. Deep anomaly detection with outlier exposure.\n\narXiv preprint arXiv:1812.04606, 2018.\n\nHildebrand, F., Meyer, A., and Eyre-Walker, A. Evidence of selection upon genomic GC-content in\n\nbacteria. PLoS genetics, 6(9):e1001107, 2010.\n\nHochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Computation, 9(8):1735\u20131780,\n\nNovember 1997.\n\nJozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., and Wu, Y. Exploring the limits of language\n\nmodeling. arXiv preprint arXiv:1602.02410, 2016.\n\nKilloran, N., Lee, L. J., Delong, A., Duvenaud, D., and Frey, B. J. Generating and designing DNA\n\nwith deep generative models. arXiv preprint arXiv:1712.06148, 2017.\n\nKingma, D. P. and Dhariwal, P. Glow: Generative \ufb02ow with invertible 1x1 convolutions. In NeurIPS,\n\n2018.\n\nLakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and scalable predictive uncertainty\n\nestimation using deep ensembles. In NeurIPS, 2017.\n\nLeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al. Gradient-based learning applied to document\n\nrecognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\nLee, K., Lee, H., Lee, K., and Shin, J. Training con\ufb01dence-calibrated classi\ufb01ers for detecting\n\nout-of-distribution samples. arXiv preprint arXiv:1711.09325, 2017.\n\nLee, K., Lee, K., Lee, H., and Shin, J. A simple uni\ufb01ed framework for detecting out-of-distribution\n\nsamples and adversarial attacks. In NeurIPS, 2018.\n\nLiang, S., Li, Y., and Srikant, R. Enhancing the reliability of out-of-distribution image detection in\n\nneural networks. arXiv preprint arXiv:1706.02690, 2017.\n\nLu, Y. Y., Tang, K., Ren, J., Fuhrman, J. A., Waterman, M. S., and Sun, F. CAFE: a C celerated A\n\nlignment-F r E e sequence analysis. Nucleic Acids Research, 45(W1):W554\u2013W559, 2017.\n\nLuhn, H. P. Key word-in-context index for technical literature (kwic index). American Documentation,\n\n11(4):288\u2013295, 1960.\n\nNalisnick, E., Matsukawa, A., Teh, Y. W., Gorur, D., and Lakshminarayanan, B. Do deep generative\n\nmodels know what they don\u2019t know? arXiv preprint arXiv:1810.09136, 2018.\n\nNalisnick, E., Matsukawa, A., Teh, Y. W., Gorur, D., and Lakshminarayanan, B. Hybrid models with\n\ndeep and invertible features. In ICML, 2019.\n\nNayfach, S., Shi, Z. J., Seshadri, R., Pollard, K. S., and Kyrpides, N. C. New insights from\n\nuncultivated genomes of the global human gut microbiome. Nature, pp. 1, 2019.\n\nNguyen, A., Yosinski, J., and Clune, J. Deep neural networks are easily fooled: High con\ufb01dence\npredictions for unrecognizable images. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, pp. 427\u2013436, 2015.\n\nOlivecrona, M., Blaschke, T., Engkvist, O., and Chen, H. Molecular de-novo design through deep\n\nreinforcement learning. Journal of cheminformatics, 9(1):48, 2017.\n\n11\n\n\fOord, A. v. d., Kalchbrenner, N., and Kavukcuoglu, K. Pixel recurrent neural networks. arXiv\n\npreprint arXiv:1601.06759, 2016.\n\nOvadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S., Dillon, J. V., Lakshminarayanan,\nB., and Snoek, J. Can you trust your model\u2019s uncertainty? evaluating predictive uncertainty under\ndataset shift. arXiv preprint arXiv:1906.02530, 2019.\n\nPatil, K. R., Haider, P., Pope, P. B., Turnbaugh, P. J., Morrison, M., Scheffer, T., and McHardy, A. C.\nTaxonomic metagenome sequence assignment with structured output models. Nature Methods, 8\n(3):191, 2011.\n\nPonsero, A. J. and Hurwitz, B. L. The promises and pitfalls of machine learning for detecting viruses\n\nin aquatic metagenomes. Frontiers in Microbiology, 10:806, 2019.\n\nReinert, G., Chew, D., Sun, F., and Waterman, M. S. Alignment-free sequence comparison (I):\n\nstatistics and power. Journal of Computational Biology, 16(12):1615\u20131634, 2009.\n\nRen, J., Bai, X., Lu, Y. Y., Tang, K., Wang, Y., Reinert, G., and Sun, F. Alignment-free sequence\n\nanalysis and applications. Annual Review of Biomedical Data Science, 1:93\u2013114, 2018a.\n\nRen, J., Song, K., Deng, C., Ahlgren, N. A., Fuhrman, J. A., Li, Y., Xie, X., and Sun, F. Identifying\n\nviruses from metagenomic data by deep learning. arXiv preprint arXiv:1806.07810, 2018b.\n\nRosen, G. L., Reichenberger, E. R., and Rosenfeld, A. M. NBC: the naive Bayes classi\ufb01cation tool\nwebserver for taxonomic classi\ufb01cation of metagenomic reads. Bioinformatics, 27(1):127\u2013129,\n2010.\n\nSalimans, T., Karpathy, A., Chen, X., Kingma, D. P., and Bulatov, Y. PixelCNN++: A PixelCNN\nimplementation with discretized logistic mixture likelihood and other modi\ufb01cations. In ICLR,\n2017.\n\nShafaei, A., Schmidt, M., and Little, J. J. Does your model know the digit 6 is not a cat? a less biased\n\nevaluation of\" outlier\" detectors. arXiv preprint arXiv:1809.04729, 2018.\n\nSueoka, N. On the genetic basis of variation and heterogeneity of DNA base composition. Proceedings\n\nof the National Academy of Sciences, 48(4):582\u2013592, 1962.\n\nVan den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A., et al. Conditional image\n\ngeneration with PixelCNN decoders. In NeurIPS, 2016.\n\nWagstaff, K. L. Machine learning that matters. In ICML, 2012.\nYarza, P., Richter, M., Peplies, J., Euzeby, J., Amann, R., Schleifer, K.-H., Ludwig, W., Gl\u00f6ckner,\nF. O., and Rossell\u00f3-M\u00f3ra, R. The All-Species Living Tree project: a 16S rRNA-based phylogenetic\ntree of all sequenced type strains. Systematic and Applied Microbiology, 31(4):241\u2013250, 2008.\n\nZhou, J. and Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning\u2013based\n\nsequence model. Nature Methods, 12(10):931, 2015.\n\nZhu, Z., Ren, J., Michail, S., and Sun, F. Metagenomic unmapped reads provide important insights\n\ninto human microbiota and disease associations. bioRxiv, pp. 504829, 2018.\n\nZou, J., Huss, M., Abid, A., Mohammadi, P., Torkamani, A., and Telenti, A. A primer on deep\n\nlearning in genomics. Nature Genetics, pp. 1, 2018.\n\n12\n\n\f", "award": [], "sourceid": 8301, "authors": [{"given_name": "Jie", "family_name": "Ren", "institution": "Google Brain"}, {"given_name": "Peter", "family_name": "Liu", "institution": "Google Brain"}, {"given_name": "Emily", "family_name": "Fertig", "institution": "Google Research"}, {"given_name": "Jasper", "family_name": "Snoek", "institution": "Google Brain"}, {"given_name": "Ryan", "family_name": "Poplin", "institution": "Google"}, {"given_name": "Mark", "family_name": "Depristo", "institution": "Google"}, {"given_name": "Joshua", "family_name": "Dillon", "institution": "Google"}, {"given_name": "Balaji", "family_name": "Lakshminarayanan", "institution": "Google DeepMind"}]}