{"title": "Reverse KL-Divergence Training of Prior Networks: Improved Uncertainty and Adversarial Robustness", "book": "Advances in Neural Information Processing Systems", "page_first": 14547, "page_last": 14558, "abstract": "Ensemble approaches for uncertainty estimation have recently been applied to the tasks of misclassification detection, out-of-distribution input detection and adversarial attack detection. Prior Networks have been proposed as an approach to efficiently emulate an ensemble of models for classification by parameterising a Dirichlet prior distribution over output distributions. These models have been shown to outperform alternative ensemble approaches, such as Monte-Carlo Dropout, on the task of out-of-distribution input detection. However, scaling Prior Networks to complex datasets with many classes is difficult using the training criteria originally proposed. This paper makes two contributions. First, we show that the appropriate training criterion for Prior Networks is the reverse KL-divergence between Dirichlet distributions. This addresses issues in the nature of the training data target distributions, enabling prior networks to be successfully trained on classification tasks with arbitrarily many classes, as well as improving out-of-distribution detection performance. Second, taking advantage of this new training criterion, this paper investigates using Prior Networks to detect adversarial attacks and proposes a generalized form of adversarial training. It is shown that the construction of successful adaptive whitebox attacks, which affect the prediction and evade detection, against Prior Networks trained on CIFAR-10 and CIFAR-100 using the proposed approach requires a greater amount of computational effort than against networks defended using standard adversarial training or MC-dropout.", "full_text": "Reverse KL-Divergence Training of Prior Networks:\nImproved Uncertainty and Adversarial Robustness\n\nAndrey Malinin \u21e4\nYandex Research\n\nam969@yandex-team.ru\n\nMark Gales\n\nDepartment of Engineering\nUniversity of Cambridge\nmjfg@eng.cam.ac.uk\n\nAbstract\n\nEnsemble approaches for uncertainty estimation have recently been applied to\nthe tasks of misclassi\ufb01cation detection, out-of-distribution input detection and\nadversarial attack detection. Prior Networks have been proposed as an approach\nto ef\ufb01ciently emulate an ensemble of models for classi\ufb01cation by parameteris-\ning a Dirichlet prior distribution over output distributions. These models have\nbeen shown to outperform alternative ensemble approaches, such as Monte-Carlo\nDropout, on the task of out-of-distribution input detection. However, scaling\nPrior Networks to complex datasets with many classes is dif\ufb01cult using the train-\ning criteria originally proposed. This paper makes two contributions. First, we\nshow that the appropriate training criterion for Prior Networks is the reverse KL-\ndivergence between Dirichlet distributions. This addresses issues in the nature of\nthe training data target distributions, enabling prior networks to be successfully\ntrained on classi\ufb01cation tasks with arbitrarily many classes, as well as improving\nout-of-distribution detection performance. Second, taking advantage of this new\ntraining criterion, this paper investigates using Prior Networks to detect adversarial\nattacks and proposes a generalized form of adversarial training. It is shown that the\nconstruction of successful adaptive whitebox attacks, which affect the prediction\nand evade detection, against Prior Networks trained on CIFAR-10 and CIFAR-100\nusing the proposed approach requires a greater amount of computational effort than\nagainst networks defended using standard adversarial training or MC-dropout.\n\n1\n\nIntroduction\n\nNeural Networks (NNs) have become the dominant approach to addressing computer vision (CV)\n[1, 2, 3], natural language processing (NLP) [4, 5, 6], speech recognition (ASR) [7, 8] and bio-\ninformatics [9, 10] tasks. One important challenge is for NNs to make reliable estimates of con\ufb01dence\nin their predictions. Notable progress has recently been made on predictive uncertainty estimation for\nDeep Learning through the de\ufb01nition of baselines, tasks and metrics [11], and the development of\npractical methods for estimating uncertainty using ensemble methods, such as Monte-Carlo Dropout\n[12] and Deep Ensembles [13]. Uncertainty estimates derived from ensemble approaches have\nbeen successfully applied to the tasks of detecting misclassi\ufb01cations and out-of-distribution inputs,\nand have also been investigated for adversarial attack detection [14, 15]. However, ensembles can\nbe computationally expensive and it is hard to control their behaviour. Recently, [16] proposed\nPrior Networks - a new approach to modelling uncertainty which has been shown to outperform\nMonte-Carlo dropout on a range of tasks. Prior Networks parameterize a Dirichlet prior over output\ndistributions, which allows them to emulate an ensemble of models using a single network, whose\nbehaviour can be explicitly controlled via choice of training data.\n\n\u21e4Work done while at Cambridge University Department of Engineering\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fIn [16], Prior Networks are trained using the forward KL-divergence between the model and a target\nDirichlet distribution. It is, however, necessary to use auxiliary losses, such as cross-entropy, to\nyield competitive classi\ufb01cation performance. Furthermore, it is also dif\ufb01cult to train Prior Networks\nusing this criterion on complex datasets with many classes. In this work we show that the forward\nKL-divergence (KL) is an inappropriate optimization criterion and instead propose to train Prior\nNetworks with the reverse KL-divergence (RKL) between the model and a target Dirichlet distribution.\nIn sections 3 and 4 of this paper it is shown, both theoretically and empirically on synthetic data, that\nthis loss yields the desired behaviours of a Prior Network and does not require auxiliary losses. In\nsection 5 Prior Networks are successfully trained on a range of image classi\ufb01cation tasks using the\nproposed criterion without loss of classi\ufb01cation performance. It is also shown that these models yield\nbetter out-of-distribution detection performance on the CIFAR-10 and CIFAR-100 datasets than Prior\nNetworks trained using forward KL-divergence.\nAn interesting application of uncertainty estimation is the detection of adversarial attacks, which are\nsmall perturbations to the input that are almost imperceptible to humans, yet which drastically affect\nthe predictions of the neural network [17]. Adversarial attacks are a serious security concern, as there\nexists a plethora of adversarial attacks which are quite easy to construct [18, 19, 20, 21, 22, 23, 24, 25].\nAt the same time, while it is possible to improve the robustness of a network to adversarial attacks\nusing adversarial training [17] and adversarial distillation [26], it is still possible to craft successful\nadversarial attacks against these networks [21]. Instead of considering robustness to adversarial\nattacks, [14] investigates detection of adversarial attacks and shows that adversarial attacks can be\ndetectable using a range of approaches. While, adaptive attacks can be crafted to successfully attack\nthe proposed detection schemes, [14] singles out detection of adversarial attacks using uncertainty\nmeasures derived from Monte-Carlo dropout as being more challenging to successfully overcome\nusing adaptive attacks. Thus, in this work we investigate the detection of adversarial attacks using\nPrior Networks, which have previously outperformed Monte-Carlo dropout on other tasks.\nUsing the greater degree of control over the behaviour of Prior Networks which the reverse KL-\ndivergence loss affords, Prior Networks are trained to predict the correct class on adversarial inputs,\nbut yield a higher measure of uncertainty than on natural inputs. Effectively, this is a direct general-\nization of adversarial training [17] which improves both the robustness of the model to adversarial\nattacks and also allows them to be detected. As Prior Networks yield measures of uncertainty derived\nfrom distributions over output distributions, rather than simple con\ufb01dences, adversarial attacks need to\nsatisfy far more constraints in order to both successfully attack the Prior Network and evade detection.\nResults in section 6 show that on the CIFAR-10 and CIFAR-100 datasets it is more computationally\nchallenging to construct adaptive adversarial attacks against Prior Networks than against standard\nDNNs, adversarially trained DNNs and Monte-Carlo dropout defended networks.\nThus, the two main contributions of this paper are the following. Firstly, a new reverse KL-divergence\ntraining criterion which yields the desired behaviour of Prior Networks and allows them to be trained\non more complex datasets. Secondly, a generalized form of adversarial training, enabled using the\nproposed training criterion, which makes successful adaptive whitebox attacks, which aim to both\nattack the network and evade detection, far more computationally expensive to construct for Prior\nNetworks than for models defended using standard adversarial training or Monte-Carlo dropout.\n\n2 Prior Networks\n\nAn ensemble of models can be interpreted as a set of output distributions drawn from an implicit\nconditional distribution over output distributions. A Prior Network p(\u21e1|x\u21e4; \u02c6\u2713) 2, is a neural network\nwhich explicitly parametrizes a prior distribution over output distributions. This effectively allows a\nPrior Network to emulate an ensemble and yield the same measures of uncertainty [27, 28], but in\nclosed form and without sampling.\n\np(\u21e1|x\u21e4; \u02c6\u2713) = p(\u21e1| \u02c6\u21b5),\n\n\u02c6\u21b5 = f (x\u21e4; \u02c6\u2713)\n\n(1)\n\nA Prior Network typically parameterizes the Dirichlet distribution3 (eq. 2), which is the conjugate\nprior to the categorical, due to its tractable analytic properties. The Dirichlet distribution is de\ufb01ned\n\n2Here \u21e1 = [P(y = !1),\u00b7\u00b7\u00b7 , P(y = !K)]T - the parameters of a categorical distribution.\n3Alternate choices of distribution, such as a mixture of Dirichlets or the Logistic-Normal, are possible.\n\n2\n\n\fas:\n\np(\u21e1; \u21b5) =\n\n(\u21b50)\nc=1 (\u21b5c)\n\n\u21e1\u21b5c1\nc\n\n,\u21b5 c > 0,\u21b5 0 =\n\nKYc=1\n\n\u21b5c\n\nKXc=1\n\n(2)\n\nQK\n\nwhere (\u00b7) is the gamma function. The Dirichlet distribution is parameterized by its concentration\nparameters \u21b5, where \u21b50, the sum of all \u21b5c, is called the precision of the Dirichlet distribution. Higher\nvalues of \u21b50 lead to sharper, more con\ufb01dent distributions. The predictive distribution of a Prior\nNetwork is given by the expected categorical distribution under the conditional Dirichlet prior:\n\nP(y = !c|x\u21e4; \u02c6\u2713) =Ep(\u21e1|x\u21e4; \u02c6\u2713)[P(y = !c|\u21e1)] = \u02c6\u21e1c =\n\n\u02c6\u21b5c\nk=1 \u21b5k\n\n=\n\nPK\n\ne\u02c6zc\nk=1 \u02c6e\u02c6zk\n\nPK\n\n(3)\n\nwhere \u02c6zc are the logits predicted by the model. The desired behaviors of a Prior Network, as described\nin [16], can be visualized on a simplex in \ufb01gure 1. Here, \ufb01gure 1:a describes con\ufb01dent behavior (low-\nentropy prior focused on low-entropy output distributions), \ufb01gure 1:b describes uncertainty due to\nsevere class overlap (data uncertainty) and \ufb01gure 1:c describes the behaviour for an out-of-distribution\ninput (knowledge uncertainty).\n\n(a) Low uncertainty\n\n(b) High data uncertainty\n\n(c) Out-of-distribution\n\nFigure 1: Desired Behaviors of a Dirichlet distribution over categorical distributions.\n\nGiven a Prior Network which yields the desired behaviours, it is possible to derive measures of\nuncertainty in the prediction by considering the mutual information between y and \u21e1:\n\nKnowledge U ncertainty\n\nMI[y, \u21e1|x\u21e4; \u02c6\u2713]\n}\n|\n\n{z\n\n= H\u21e5Ep(\u21e1|x\u21e4; \u02c6\u2713)[P(y|\u21e1)]\u21e4\n|\n}\n\nT otal U ncertainty\n\n{z\n\n Ep(\u21e1|x\u21e4; \u02c6\u2713)\u21e5H[P(y|\u21e1)]\u21e4\n|\n}\n\nExpected Data U ncertainty\n\n{z\n\nThe given expression allows total uncertainty, given by the entropy of the predictive distribution,\nto be decomposed into data uncertainty and knowledge uncertainty. Data uncertainty arises due to\nclass-overlap in the data, which is the equivalent of noise for classi\ufb01cation problems. Knowledge\nUncertainty, also know as epistemic uncertainty [12] or distributional uncertainty [16], arises due to\nthe model\u2019s lack of understanding or knowledge about the input. In other word, knowledge uncertainty\narises due to a mismatch between the training and test data.\n\n(4)\n\n3 Forward and Reverse KL-Divergence Losses\n\nAs Prior Networks parameterize the Dirichlet distribution, ideally we would like to have a dataset\ni=1, where (i) are the parameters of a target Dirichlet distribution p(\u21e1|).\nDtrn = {x(i), (i)}N\nIn this scenario, we could simply minimize the (forward) KL-divergence between the model and\nthe target for every training sample x(i). Alternatively, if we had a set of samples of categorical\ndistributions from the target Dirichlet distribution for every input, then we could maximize the\nlikelihood their under the predicted Dirichlet [29], which, in expectation, is equivalent to minimizing\nthe KL-divergence.\nIn practice, however, we only have access to the target class label y(i) 2\n{!1,\u00b7\u00b7\u00b7 ,! K} for every input x(i). When training standard DNNs with cross-entropy loss this isn\u2019t\n\n3\n\n\fa problem, as the correct target distribution ^Ptr(y|x) is induced in expectation, as shown below:\n\nKXc=1\nKXc=1\n\nE^Ptr(y|x)[I(y = !c)] ln P(\u02c6y = !c|x; \u2713)\u21e4\n^Ptr(y = !c|x) ln P(\u02c6y = !c|x; \u2713)i\n\nL(\u2713,Dtrn) = E^ptr(x)\u21e5 \n= E^ptr(x)h \n= E^ptr(x)hKL\u21e5^Ptr(y|x)||P(y|x; \u2713)\u21e4i + const\n= E^ptr(x)hKL\u21e5\u21e1tr||\u02c6\u21e1\u21e4i + const\n\nUnfortunately, training models which are a (higher-order) distribution over predictive distributions\nbased on samples from the (lower-order) predictive distribution is more challenging. The solution\nto this problem proposed in the original work on Prior Networks [16] was to minimize the forward\nKL-divergence between the model and a target Dirichlet distribution p(\u21e1|(c)):\nI(y = !c) \u00b7 KL[p(\u21e1|(c))||p(\u21e1|x; \u2713)]\n\nLKL(y, x, \u2713; ) =\n\n(6)\n\nThe target concentration parameters (c) depend on the class c and are set manually as follows:\n\nKXc=1\nk =n + 1\n\n1\n\n(c)\n\nif c = k\nif c 6= k\n\n(5)\n\n(7)\n\n(9)\n\nL(\u2713,D; in, out, ) =LKL\n\nin (\u2713,Dtrn; in) + \u00b7 LKL\n\nwhere is a hyper-parameter which is set by hand, rather than learned from the data. This criterion\nis jointly optimized on in-domain and out-of-domain data Dtrn and Dout as follows:\nout (\u2713,Dout; out)\n\n(8)\nwhere is the out-of-distribution loss weight. In-domain in should take on a large value, for\nexample 1e2, so that the concentration is high only in the corner corresponding to the target class, and\nlow elsewhere. Note, the concentration parameters have to be strictly positive, so it is not possible to\nset the rest of the concentration parameters to 0. Instead, they are set to one, which also provides a\nsmall degree smoothing. Out-of-domain out = 0, which results in a \ufb02at Dirichlet distribution.\nHowever, there is a signi\ufb01cant issue with this criterion. Consider taking the expectation of equation 6\nwith respect to the empirical distribution ^ptr(x, y) = {x(i), y(i)}N\n\nLKL(\u2713,Dtrn; ) = E^ptr(x,y)h KXc=1\n= E^ptr(x)hKL\u21e5 KXc=1\n\ni=1 = Dtrn:\nI(y = !c) \u00b7 KL[p(\u21e1|(c))||p(\u21e1|x; \u2713)]i\n^Ptr(y = !c|x)p(\u21e1|(c))||p(\u21e1|x; \u2713)\u21e4i + const\n\nIn expectation this loss induces a target distribution which is a mixture of Dirichlet distributions\nthat has a mode in each corner of the simplex, as shown in \ufb01gure 2:a. When the level of data\nuncertainty is low, this is not a problem, as there will only be a single signi\ufb01cant mode. However,\nthe target distribution will be multi-modal when there is a signi\ufb01cant amount of data uncertainty.\nAs the forward KL-divergence is zero-avoiding, it will drive the model to spread itself over each\nmode, effectively \u2019inverting\u2019 the Dirichlet distribution and forcing the precision \u02c6\u21b50 to a low value, as\ndepicted in \ufb01gure 2:b. This is an undesirable behaviour and can compromise predictive performance.\nRather, as previously stated, in regions of signi\ufb01cant data uncertainty the model should yield a\ndistribution with a single high-precision mode at the center of the simplex, as shown in \ufb01gure 1:b.\nThe main issue with the forward KL-divergence loss is that it induces an arithmetic mixture of\ntarget distributions p(\u21e1|(c)) in expectation. This can be avoided by instead considering the reverse\nKL-divergence between the target distribution p(\u21e1|(c)) and the model:\n\nLRKL(y, x, \u2713; ) =\n\nKXc=1\n\nI(y = !c) \u00b7 KL[p(\u21e1|x; \u2713)||p(\u21e1|(c))]\n\n(10)\n\n4\n\n\f= E^ptr(x)hEp(\u21e1|x;\u2713)\u21e5 ln p(\u21e1|x; \u2713) ln\nKYc=1\n= E^ptr(x)hKL\u21e5p(\u21e1|x; \u2713)||p(\u21e1| \u00af)\u21e4i + const\nKXc=1\n\n^Ptr(y = !c|x) \u00b7 (c)\n\n\u00af =\n\np(\u21e1|(c))^Ptr(y=!c|x)\u21e4i\n\n(11)\n\n(a) Induced Target\n\n(b) Actual behaviour\n\nFigure 2: Induced target and predicted Dirichlet distribution when trained with equation 6\n\nThe expectation of this criterion with respect to the empirical distribution induces a geometric mixture\nof target Dirichlet distributions:\n\nLRKL(\u2713,Dtrn; ) = E^ptr(x)h KXc=1\n\n^Ptr(y = !c|x)KL\u21e5p(\u21e1|x; \u2713)||p(\u21e1|(c))\u21e4i\n\nA geometric mixture of Dirichlet distributions results in a standard Dirichlet distribution whose\nconcentration parameters \u00af are an arithmetic mixture of the target concentration parameters for each\nclass. Thus, this loss induces a target distribution which is always a standard uni-modal Dirichlet with\na mode at the point on the simplex which re\ufb02ects the correct level of data uncertainty (\ufb01gure 1a-b).\nFurthermore, as a consequence using of this loss in equation 8 instead of the forward KL-divergence,\nthe concentration parameters are appropriately interpolated on the boundary of the in-domain and\nout-of-distribution regions, where the degree of interpolation depends on the OOD loss weight .\nFurther analysis of the properties of the reverse KL-divergence loss is provided in appendix A.\nFinally, it is important to emphasize that this discussion is about what target distribution is induced\nin expectation when training models which parameterize a distribution over output distributions\nusing samples from the output distribution. It is necessary to stress that if either the parameters of, or\nsamples from, the correct target distribution over output distributions are available, for every input,\nthen forward KL-divergence is a sensible training criterion.\n\n4 Experiments on Synthetic Data\n\nThe previous section investigated the theoretical properties of forward and reverse KL-divergence\ntraining criteria for Prior Networks. In this section these criteria are assessed empirically by using\nthem to train Prior Networks on the arti\ufb01cial high-uncertainty 3-class dataset4 introduced in [16].\nIn these experiments, the out-of-distribution training data Dout was sampled such that it forms a\nthin shell around the training data. The target concentration parameters (c) were constructed as\ndescribed in equation 7, with in = 1e2 and out = 0. The in-domain loss and out-of-distribution\nlosses were equally weighted ( = 1).\nFigure 3 depicts the total uncertainty, expected data uncertainty and mutual information, which is\na measure of knowledge uncertainty, derived using equation 4 from Prior Networks trained using\nboth criteria. By comparing \ufb01gures 3a and 3d it is clear that a Prior Network trained using forward\nKL-divergence over-estimates total uncertainty in domain, as the total uncertainty is equally high\nalong the decision boundaries, in the region of class overlap and out-of-domain. The Prior Network\ntrained using the reverse KL-divergence, on the other hand, yields an estimate of total uncertainty\nwhich better re\ufb02ects the structure of the dataset. Figure 3b shows that the expected data uncertainty\nis altogether incorrectly estimated by the Prior Network trained via forward KL-divergence, as\n\n4Described in appendix B.\n\n5\n\n\f(a) Total Uncertainty - KL\n\n(b) Data Uncertainty - KL\n\n(c) Mutual Information - KL\n\n(d) Total Uncertainty - RKL\n\n(e) Data Uncertainty - RKL\n\n(f) Mutual Information - RKL\n\nFigure 3: Comparison of measures of uncertainty derived from Prior Networks trained with forward\nand reverse KL-divergence. Measures of uncertainty are derived via equation 4.\n\nit is uniform over the entire in-domain region. As a result, the mutual information is higher in-\ndomain along the decision boundaries than out-of-domain. In contrast, \ufb01gures 3c and 3f show that\nthe measures of uncertainty provided by a Prior Network trained using the reverse KL-divergence\ndecompose correctly - data uncertainty is highest in regions of class overlap while mutual information\nis low in-domain and high out-of-domain. Thus, these experiments support the analysis in the previous\nsection, and illustrate how the reverse KL-divergence is the more suitable optimization criterion.\n\n5\n\nImage Classi\ufb01cation Experiments\n\nHaving evaluated the forward and reverse KL-divergence losses on a synthetic dataset in the previous\nsection, we now evaluate these losses on a range of image classi\ufb01cation datasets. The training\ncon\ufb01gurations are described in appendix C. Table 1 presents the classi\ufb01cation error rates of standard\nDNNs, an ensemble of 5 DNNs [13], and Prior Networks trained using both the forward and reverse\nKL-divergence losses. From table 1 it is clear that Prior Networks trained using forward KL-\ndivergence (PN-KL) achieve increasingly worse classi\ufb01cation performance as the datasets become\nmore complex and have a larger number of classes. At the same time, Prior Networks trained using\nthe reverse KL-divergence loss (PN-RKL) have similar error rates as ensembles and standard DNNs.\nNote that in these experiments no auxiliary losses were used.5\n\nTable 1: Mean classi\ufb01cation performance (% Error) \u00b12 across 5 random initializations.\n\nDataset\nMNIST\nSVHN\nCIFAR-10\nCIFAR-100\nTinyImageNet\n\nDNN\n0.5 \u00b10.1\n4.3 \u00b10.3\n8.0 \u00b10.4\n30.4 \u00b10.6\n41.7 \u00b10.4\n\nPN-KL\n0.6 \u00b10.1\n5.7 \u00b10.2\n14.7 \u00b10.4\n\n-\n-\n\nPN-RKL\n0.5 \u00b10.1\n4.2 \u00b10.2\n7.5 \u00b10.3\n28.1 \u00b10.2\n40.3 \u00b10.4\n\nENSM\n0.5 \u00b1 NA\n3.3 \u00b1 NA\n6.6 \u00b1 NA\n26.9 \u00b1 NA\n36.9 \u00b1 NA\n\n5An on-going PyTorch re-implementation of this paper, along updated results, is available at\n\nhttps://github.com/KaosEngineer/PriorNetworks\n\n6\n\n\fTable 2 presents the out-of-distribution detection performance of Prior Networks trained on CIFAR-10\nand CIFAR-100 [30] using the forward and reverse KL-divergences. Prior Networks trained on\nCIFAR-10 use CIFAR-100 are OOD training data, while Prior Networks trained on CIFAR-100\nuse TinyImageNet [31] as OOD training data. Performance is assessed using area under an ROC\ncurve (AUROC) in the same fashion as in [16, 11]. The results on CIFAR-10 show that PN-RKL\nconsistently yields better performance than PN-KL and the ensemble on all OOD test datasets (SVHN,\nLSUN and TinyImagenet). The results using model trained on CIFAR-100 show that Prior Networks\nare capable of out-performing the ensembles when evaluated against the LSUN and SVHN datasets.\nHowever, Prior Networks have dif\ufb01culty distinguishing between the CIFAR-10 and CIFAR-100 test\nsets. However, this represents a limitation of the both the classi\ufb01cation model and the OOD training\ndata, rather than the training criterion. Improving classi\ufb01cation performance of Prior Networks on\nCIFAR-100, which improves understanding of what is \u2019in-domain\u2019, and using a more appropriate\nOOD training dataset, which provides a better contrast, is likely improve OOD detection performance.\nTable 2: Out-of-domain detection results (mean % AUROC \u00b12 across 5 rand. inits) using mutual\ninformation (eqn. 4) derived from models trained on CIFAR-10 and CIFAR-100.\n\nModel\n\nENSM\nPN-KL\nPN-RKL\n\nSVHN\n89.5 \u00b1 NA\n97.8 \u00b11.1.\n98.2 \u00b11.1\n\nCIFAR-10\nLSUN\n93.2 \u00b1 NA\n91.6 \u00b11.7\n95.7 \u00b10.9\n\nTinyImageNet\n\n90.3 \u00b1 NA\n92.4 \u00b10.9\n95.7 \u00b10.7\n\nSVHN\n78.9 \u00b1 NA\n84.8 \u00b10.8\n\n-\n\nCIFAR-100\n\nLSUN\n85.6 \u00b1 NA\n100.0 \u00b10.0\n\n-\n\nCIFAR-10\n76.5 \u00b1 NA\n57.8 \u00b10.4\n\n-\n\n6 Adversarial Attack Detection\n\nThe previous section has discussed the use of the reverse KL-divergence training criterion for training\nPrior Networks. Here, we show that the proposed loss also offers a generalization of adversarial\ntraining [17, 25] which allows Prior Networks to be both more robust to adversarial attacks and\ndetect them as OOD samples. The use of measures of uncertainty for adversarial attack detection was\npreviously studied in [14], where it was shown that Monte-Carlo dropout ensembles yield measures of\nuncertainty which are more challenging to attack than other considered methods. In a similar fashion\nto Monte-Carlo dropout, Prior Networks yield rich measures of uncertainty derived from distributions\nover distributions. For Prior Networks this means that for an adversarial attack to both affect the\nprediction and evade detection, it must satisfy several criteria. Firstly, the adversarial input must\nbe located in a region of input space classi\ufb01ed as the desired class. Secondly, the adversarial input\nmust be in a region of input space where both the relative and absolute magnitudes of the model\u2019s\nlogits \u02c6z, and therefore all the measures of uncertainty derivable from the predicted distribution over\ndistribution, are the same as for the natural input, making it challenging to distinguish between the\nnatural and adversarial input. Clearly, this places more constraints on the space of solutions for\nsuccessful adversarial attacks than detection based on the con\ufb01dence of the prediction, which places\na constraint only on the relative value of just a single logit.\n\n(a) Natural Target\n\n(b) Adversarial Target\n\nFigure 4: Target Dirichlet distributions for natural and adversarial inputs.\n\nUsing the greater degree of control over the behaviour of Prior Networks which the reverse KL-\ndivergence loss affords, Prior Networks can be explicitly trained to yield high uncertainty for example\nadversarial attacks, further constraining the space of successful solutions. Here, adversarially\nperturbed inputs are used as the out-of-distribution training data for which the Prior Network is\n\n7\n\n\ftrained to both yield the correct prediction and high measures of uncertainty. Thus, the Prior Network\nis jointly trained to yield either a sharp or wide Dirichlet distribution at the appropriate corner of the\nsimplex for natural or adversarial data, respectively, as described in \ufb01gure 4.\n\n(\u2713,Dtrn; in) + \u00b7 LRKL\n\nadv (\u2713,Dadv; adv)\n\nin\n\nL(\u2713,D; in, adv, ) =LRKL\n\n(12)\nThe target concentration parameters are set using equation 7, where in = 1e2 for natural and\nadv = 1 for adversarial data, for example. This approach can be seen as a generalization of\nadversarial training [17, 25]. The difference is that here we are training the model to yield a\nparticular behaviour of an entire distribution over output distributions, rather than simply making\nsure that the decision boundaries are correct in regions of input space which correspond to adversarial\nattacks. Furthermore, it is important to highlight that this generalized form of adversarial training is a\ndrop-in replacement for standard adversarial training which only requires changing the loss function.\n\n(a) C10 Whitebox Success Rate\n\n(b) C10 Whitebox AUROC\n\n(c) C10 Whitebox JSR\n\n(d) C100 Whitebox Success Rate\n\n(e) C100 Whitebox ROC AUC\n\n(f) C100 Whitebox JSR\n\n(g) PN Blackbox Success Rate\n\n(h) PN Blackbox AUROC\n\n(i) PN Blacbox JSR\n\nFigure 5: Adaptive Attack detection performance in terms of mean Success Rate, % AUROC and\nJoint Success Rate (JSR) across 5 random inits. L1 bound on adversarial perturbation is 30 pixels.\nAs discussed in [14, 32], approaches to detecting adversarial attacks need to be evaluated against the\nstrongest possible attacks - adaptive whitebox attacks which have full knowledge of the detection\nscheme and actively seek to bypass it. Here, targeted iterative PGD-MIM [20, 25] attacks are used\nfor evaluation and simple targeted FGSM [17] are used during training. The goal is to switch the\nprediction to a target class but leave measures of uncertainty derived from the model unchanged.\nTwo forms of criteria, expressed in equation 13, are used to generate the adversarial sample, \u02dcx. For\nboth criteria the target for the attacks is set to the second most likely class, as this should yield the least\n\u2019unnatural\u2019 perturbation of the outputs. The \ufb01rst approach involves permuting the model\u2019s predictive\ndistribution over class labels \u02c6\u21e1 and minimizing the forward KL-divergence between \u02c6\u21e1 and the target\npermuted distribution \u21e1adv. This ensures that the target class is predicted, but places constraints\nonly the relative values of the logits, and therefore only on measures of uncertainty derived from the\npredictive posterior. The second approach involves permuting the concentration parameters \u02c6\u21b5 and\n\n8\n\n\fP M FP(y|\u02dcx; \u02c6\u2713), t =KL[\u21e1adv||\u02c6\u21e1], LKL\nLKL\n\nP M F . 6 Thus, only attacks generated via LKL\n\nminimizing the forward KL divergence to the permuted target Dirichlet distribution padv(\u21e1). This\nplaces constraints on both the relative and absolute values of the logits, and therefore on measures of\nuncertainty derived from the entire distribution over distributions.\n\nDIRp(\u21e1|\u02dcx; \u02c6\u2713), t = KL[padv(\u21e1)||p(\u21e1|\u02dcx; \u02c6\u2713)]\n\n(13)\nDIR has more explicit constraints, it was found to be more challenging to optimize and\nThough LKL\nP M F are considered.\nyield less aggressive attacks than LKL\nIn the following set of experiments Prior Networks are trained on either the CIFAR-10 or CIFAR-100\ndatasets [30] using the procedure discussed above and detailed in appendixC. The baseline models are\nan undefended DNN and a DNN trained using standard adversarial training (DNN-ADV). For these\nmodels uncertainty is estimated via the entropy of the predictive posterior. Additionally, estimates\nof mutual information (knowledge uncertainty) are derived via a Monte-Carlo dropout ensemble\ngenerated from each of these models. Similarly, Prior Networks also use the mutual information\n(eqn. 4) for adversarial attack detection. Performance is assessed via the Success Rate, AUROC and\nJoint Success Rate (JSR). For the ROC curves considered here the true positive rate is computed using\nnatural examples, while the false-positive rate is computed using only successful adversarial attacks7.\nThe JSR, described in greater detail in appendix D, is the equal error rate where false positive rate\nequals false negative rate, and allows joint assessment of adversarial robustness and detection.\nThe results presented in \ufb01gure 5 show that on both the CIFAR-10 and CIFAR-100 datasets whitebox\nattacks successfully change the prediction of DNN and DNN-ADV models to the second most likely\nclass and evade detection (AUROC goes to 50). Monte-Carlo dropout ensembles are marginally\nharder to adversarially overcome, due to the random noise. At the same time, it takes far more\niterations of gradient descent to successfully attack Prior Networks such that they fail to detect the\nattack. On CIFAR-10 the Joint Success Rate is only 0.25 at 1000 iterations, while the JSR for the\nother models is 0.5 (the maximum). Results on the more challenging CIFAR-100 dataset show\nthat adversarially trained Prior Networks yield a more modest increase in robustness over baseline\napproaches, but it still takes signi\ufb01cantly more computational effort to attack the model. Thus,\nthese results support the assertion that adversarially trained Prior Networks constrain the solution\nspace for adaptive adversarial attack, making them computationally more dif\ufb01cult to successfully\nconstruct. At the same time, blackbox attacks, computed on identical networks trained on the same\ndata from a different random initialization, fail entirely against Prior Networks trained on CIFAR-10\nand CIFAR-100. This shows that the adaptive attacks considered here are non-transferable.\n\n7 Conclusion\n\nPrior Networks have been shown to be an interesting approach to deriving rich and interpretable\nmeasures of uncertainty from neural networks. This work consists of two contributions which aim to\nimprove these models. Firstly, a new training criterion for Prior Networks, the reverse KL-divergence\nbetween Dirichlet distributions, is proposed. It is shown, both theoretically and empirically, that\nthis criterion yields the desired set of behaviours of a Prior Network and allows these models to be\ntrained on more complex datasets with arbitrary numbers of classes. Furthermore, it is shown that this\nloss improves out-of-distribution detection performance on the CIFAR-10 and CIFAR-100 datasets\nrelative to the forward KL-divergence loss used in [16]. However, it is necessary to investigate\nproper choice of out-of-distribution training data, as an inappropriate choice can limit OOD detection\nperformance on complex datasets. Secondly, this improved training criterion enables Prior Networks\nto be applied to the task of detecting whitebox adaptive adversarial attacks. Speci\ufb01cally, adversarial\ntraining of Prior Networks can be seen as both a generalization of, and a drop in replacement for,\nstandard adversarial training which improves robustness to adversarial attacks and the ability to detect\nthem by placing more constraints on the space of solutions to the optimization problem which yields\nadversarial attacks. It is shown that it is signi\ufb01cantly more computationally challenging to construct\nsuccessfully adaptive whitebox PGD attacks against Prior Network than against baseline models. It\nis necessary to point out that the evaluation of adversarial attack detection using Prior Networks is\nlimited to only strong L1 attacks. It is of interest to assess how well Prior Networks are able to\ndetect adaptive C&W L2 attacks [21] and EAD L1 attacks [33].\n\n6Results are described in appendix E.\n7The may result in minimum AUROC performance being a little greater than 50 is the success rate is not 100\n\n%, as is the case with MCDP AUROC in \ufb01gure 5.\n\n9\n\n\fAcknowledgments\nThis paper reports on research partly supported by Cambridge Assessment, University of Cambridge.\nThis work is also partly funded by a DTA EPSRC award.\n\nReferences\n[1] Ross Girshick, \u201cFast R-CNN,\u201d in Proc. 2015 IEEE International Conference on Computer\n\nVision (ICCV), 2015, pp. 1440\u20131448.\n\n[2] Karen Simonyan and Andrew Zisserman, \u201cVery Deep Convolutional Networks for Large-Scale\nImage Recognition,\u201d in Proc. International Conference on Learning Representations (ICLR),\n2015.\n\n[3] Ruben Villegas, Jimei Yang, Yuliang Zou, Sungryull Sohn, Xunyu Lin, and Honglak Lee,\n\u201cLearning to Generate Long-term Future via Hierarchical Prediction,\u201d in Proc. International\nConference on Machine Learning (ICML), 2017.\n\n[4] Tomas Mikolov et al., \u201cLinguistic Regularities in Continuous Space Word Representations,\u201d in\n\nProc. NAACL-HLT, 2013.\n\n[5] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean, \u201cEf\ufb01cient Estimation of Word\n\nRepresentations in Vector Space,\u201d 2013, arXiv:1301.3781.\n\n[6] Tomas Mikolov, Martin Kara\ufb01\u00e1t, Luk\u00e1s Burget, Jan Cernock\u00fd, and Sanjeev Khudanpur, \u201cRe-\n\ncurrent Neural Network Based Language Model,\u201d in Proc. INTERSPEECH, 2010.\n\n[7] Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel rahman Mohamed, Navdeep Jaitly,\nAndrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara Sainath, and Brian Kingsbury, \u201cDeep\nneural networks for acoustic modeling in speech recognition,\u201d Signal Processing Magazine,\n2012.\n\n[8] Awni Y. Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan\nPrenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Y. Ng, \u201cDeep speech:\nScaling up end-to-end speech recognition,\u201d 2014, arXiv:1412.5567.\n\n[9] Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad,\n\u201cIntelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission,\u201d\nin Proc. 21th ACM SIGKDD International Conference on Knowledge Discovery and Data\nMining, New York, NY, USA, 2015, KDD \u201915, pp. 1721\u20131730, ACM.\n\n[10] Babak Alipanahi, Andrew Delong, Matthew T. Weirauch, and Brendan J. Frey, \u201cPredicting\nthe sequence speci\ufb01cities of DNA- and RNA-binding proteins by deep learning,\u201d Nature\nBiotechnology, vol. 33, no. 8, pp. 831\u2013838, July 2015.\n\n[11] Dan Hendrycks and Kevin Gimpel, \u201cA Baseline for Detecting Misclassi\ufb01ed and Out-of-\nDistribution Examples in Neural Networks,\u201d http://arxiv.org/abs/1610.02136,\n2016, arXiv:1610.02136.\n\n[12] Yarin Gal and Zoubin Ghahramani, \u201cDropout as a Bayesian Approximation: Representing\nModel Uncertainty in Deep Learning,\u201d in Proc. 33rd International Conference on Machine\nLearning (ICML-16), 2016.\n\n[13] B. Lakshminarayanan, A. Pritzel, and C. Blundell, \u201cSimple and Scalable Predictive Uncertainty\nEstimation using Deep Ensembles,\u201d in Proc. Conference on Neural Information Processing\nSystems (NIPS), 2017.\n\n[14] Nicholas Carlini and David A. Wagner, \u201cAdversarial examples are not easily detected: Bypassing\n\nten detection methods,\u201d CoRR, 2017.\n\n[15] L. Smith and Y. Gal, \u201cUnderstanding Measures of Uncertainty for Adversarial Example\n\nDetection,\u201d in UAI, 2018.\n\n[16] Andrey Malinin and Mark Gales, \u201cPredictive uncertainty estimation via prior networks,\u201d in\n\nAdvances in Neural Information Processing Systems, 2018, pp. 7047\u20137058.\n\n[17] Christian Szegedy, Alexander Toshev, and Dumitru Erhan, \u201cDeep neural networks for object\n\ndetection,\u201d in Advances in Neural Information Processing Systems, 2013.\n\n10\n\n\f[18] Ian Goodfellow, Jonathon Shlens, and Christian Szegedy, \u201cExplaining and harnessing adversar-\n\nial examples,\u201d in International Conference on Learning Representations, 2015.\n\n[19] Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio, \u201cAdversarial examples in the physical\n\nworld,\u201d 2016, vol. abs/1607.02533.\n\n[20] Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo Li,\n\u201cBoosting adversarial attacks with momentum,\u201d in The IEEE Conference on Computer Vision\nand Pattern Recognition (CVPR), 2018.\n\n[21] Nicholas Carlini and David A. Wagner, \u201cTowards evaluating the robustness of neural networks,\u201d\n\nCoRR, 2016.\n\n[22] Nicolas Papernot, Patrick D. McDaniel, Ian J. Goodfellow, Somesh Jha, Z. Berkay Celik, and\nAnanthram Swami, \u201cPractical black-box attacks against deep learning systems using adversarial\nexamples,\u201d CoRR, vol. abs/1602.02697, 2016.\n\n[23] Nicolas Papernot, Patrick D. McDaniel, Somesh Jha, Matt Fredrikson, Z. Berkay Celik, and\nAnanthram Swami, \u201cThe limitations of deep learning in adversarial settings,\u201d in IEEE European\nSymposium on Security and Privacy, EuroS&P 2016, Saarbr\u00fccken, Germany, March 21-24,\n2016, 2016, pp. 372\u2013387.\n\n[24] Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song, \u201cDelving into transferable adversarial\n\nexamples and black-box attacks,\u201d CoRR, vol. abs/1611.02770, 2016.\n\n[25] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian\nVladu, \u201cTowards deep learning models resistant to adversarial attacks,\u201d arXiv preprint\narXiv:1706.06083, 2017.\n\n[26] Nicolas Papernot, Patrick D. McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami, \u201cDistilla-\ntion as a defense to adversarial perturbations against deep neural networks,\u201d in IEEE Symposium\non Security and Privacy, SP 2016, San Jose, CA, USA, May 22-26, 2016, 2016, pp. 582\u2013597.\n\n[27] Yarin Gal, Uncertainty in Deep Learning, Ph.D. thesis, University of Cambridge, 2016.\n[28] Stefan Depeweg, Jos\u00e9 Miguel Hern\u00e1ndez-Lobato, Finale Doshi-Velez, and Steffen Udluft, \u201cDe-\ncomposition of uncertainty for active learning and reliable reinforcement learning in stochastic\nsystems,\u201d stat, vol. 1050, pp. 11, 2017.\n\n[29] Anonymous, \u201cEnsemble distribution distillation,\u201d in Submitted to International Conference on\n\nLearning Representations, 2020, under review.\n\n[30] Alex Krizhevsky, \u201cLearning multiple layers of features from tiny images,\u201d 2009.\n[31] Stanford CS231N, \u201cTiny ImageNet,\u201d https://tiny-imagenet.herokuapp.com/,\n\n2017.\n\n[32] Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dimitris\nTsipras, Ian Goodfellow, and Aleksander Madry, \u201cOn evaluating adversarial robustness,\u201d arXiv\npreprint arXiv:1902.06705, 2019.\n\n[33] Pin-Yu Chen, Yash Sharma, Huan Zhang, Jinfeng Yi, and Cho-Jui Hsieh, \u201cEad: elastic-net\nattacks to deep neural networks via adversarial examples,\u201d in Thirty-second AAAI conference\non arti\ufb01cial intelligence, 2018.\n\n[34] Murat Sensoy, Lance Kaplan, and Melih Kandemir, \u201cEvidential deep learning to quantify\nclassi\ufb01cation uncertainty,\u201d in Advances in Neural Information Processing Systems 31, S. Bengio,\nH. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., pp. 3179\u20133189.\nCurran Associates, Inc., 2018.\n\n[35] Mart\u00edn Abadi et al., \u201cTensorFlow: Large-Scale Machine Learning on Heterogeneous Systems,\u201d\n\n2015, Software available from tensor\ufb02ow.org.\n\n[36] Diederik P. Kingma and Jimmy Ba, \u201cAdam: A Method for Stochastic Optimization,\u201d in Proc.\n\n3rd International Conference on Learning Representations (ICLR), 2015.\n\n[37] Zhitao Gong, Wenlu Wang, and Wei-Shinn Ku, \u201cAdversarial and clean data are not twins,\u201d\n\nCoRR, vol. abs/1704.04960, 2017.\n\n[38] Kathrin Grosse, Praveen Manoharan, Nicolas Papernot, Michael Backes, and Patrick D. Mc-\nDaniel, \u201cOn the (statistical) detection of adversarial examples,\u201d CoRR, vol. abs/1702.06280,\n2017.\n\n11\n\n\f[39] Jan Hendrik Metzen, Tim Genewein, Volker Fischer, and Bastian Bischoff, \u201cOn detecting\nadversarial perturbations,\u201d in Proceedings of 5th International Conference on Learning Repre-\nsentations (ICLR), 2017.\n\n12\n\n\f", "award": [], "sourceid": 8237, "authors": [{"given_name": "Andrey", "family_name": "Malinin", "institution": "Yandex Research"}, {"given_name": "Mark", "family_name": "Gales", "institution": "University of Cambridge"}]}