{"title": "Predictive Uncertainty Estimation via Prior Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 7047, "page_last": 7058, "abstract": "Estimating how uncertain an AI system is in its predictions is important to improve the safety of such systems. Uncertainty in predictive can result from uncertainty in model parameters, irreducible \\emph{data uncertainty} and uncertainty due to distributional mismatch between the test and training data distributions. Different actions might be taken depending on the source of the uncertainty so it is important to be able to distinguish between them. Recently, baseline tasks and metrics have been defined and several practical methods to estimate uncertainty developed. These methods, however, attempt to model uncertainty due to distributional mismatch either implicitly through \\emph{model uncertainty} or as \\emph{data uncertainty}. This work proposes a new framework for modeling predictive uncertainty called Prior Networks (PNs) which explicitly models \\emph{distributional uncertainty}. PNs do this by parameterizing a prior distribution over predictive distributions. This work focuses on uncertainty for classification and evaluates PNs on the tasks of identifying out-of-distribution (OOD) samples and detecting misclassification on the MNIST and CIFAR-10 datasets, where they are found to outperform previous methods. Experiments on synthetic and MNIST and CIFAR-10 data show that unlike previous non-Bayesian methods PNs are able to distinguish between data and distributional uncertainty.", "full_text": "Predictive Uncertainty Estimation via Prior Networks\n\nAndrey Malinin\n\nDepartment of Engineering\nUniversity of Cambridge\nam969@cam.ac.uk\n\nMark Gales\n\nDepartment of Engineering\nUniversity of Cambridge\nmjfg@eng.cam.ac.uk\n\nAbstract\n\nEstimating how uncertain an AI system is in its predictions is important to improve\nthe safety of such systems. Uncertainty in predictive can result from uncertainty in\nmodel parameters, irreducible data uncertainty and uncertainty due to distributional\nmismatch between the test and training data distributions. Different actions might\nbe taken depending on the source of the uncertainty so it is important to be able to\ndistinguish between them. Recently, baseline tasks and metrics have been de\ufb01ned\nand several practical methods to estimate uncertainty developed. These methods,\nhowever, attempt to model uncertainty due to distributional mismatch either im-\nplicitly through model uncertainty or as data uncertainty. This work proposes a\nnew framework for modeling predictive uncertainty called Prior Networks (PNs)\nwhich explicitly models distributional uncertainty. PNs do this by parameterizing\na prior distribution over predictive distributions. This work focuses on uncertainty\nfor classi\ufb01cation and evaluates PNs on the tasks of identifying out-of-distribution\n(OOD) samples and detecting misclassi\ufb01cation on the MNIST and CIFAR-10\ndatasets, where they are found to outperform previous methods. Experiments on\nsynthetic and MNIST and CIFAR-10 data show that unlike previous non-Bayesian\nmethods PNs are able to distinguish between data and distributional uncertainty.\n\n1\n\nIntroduction\n\nNeural Networks (NNs) have become the dominant approach to addressing computer vision (CV) [1,\n2, 3], natural language processing (NLP) [4, 5, 6], speech recognition (ASR) [7, 8] and bio-informatics\n(BI) [9, 10] tasks. Despite impressive, and ever improving, supervised learning performance, NNs\ntend to make over-con\ufb01dent predictions [11] and until recently have been unable to provide measures\nof uncertainty in their predictions. Estimating uncertainty in a model\u2019s predictions is important, as\nit enables, for example, the safety of an AI system [12] to be increased by acting on the model\u2019s\nprediction in an informed manner. This is crucial to applications where the cost of an error is high,\nsuch as in autonomous vehicle control and medical, \ufb01nancial and legal \ufb01elds.\nRecently notable progress has been made on predictive uncertainty for Deep Learning through\nthe de\ufb01nition of baselines, tasks and metrics [13] and the development of practical methods for\nestimating uncertainty. One class of approaches stems from Bayesian Neural Networks [14, 15, 16,\n17]. Traditionally, these approaches have been computationally more demanding and conceptually\nmore complicated than non-Bayesian NNs. Crucially, their performance depends on the form\nof approximation made due to computational constraints and the nature of the prior distribution\nover parameters. A recent development has been the technique of Monte-Carlo Dropout [18],\nwhich estimates predictive uncertainty using an ensemble of multiple stochastic forward passes and\ncomputing the mean and spread of the ensemble. This technique has been successfully applied to\ntasks in computer vision [19, 20]. A number of non-Bayesian ensemble approaches have also been\nproposed. One approach based on explicitly training an ensemble of DNNs, called Deep Ensembles\n[11], yields competitive uncertainty estimates to MC dropout. Another class of approaches, developed\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\ffor both regression [21] and classi\ufb01cation [22], involves explicitly training a model in a multi-task\nfashion to minimize its Kullback-Leibler (KL) divergence to both a sharp in-domain predictive\nposterior and a \ufb02at out-of-domain predictive posterior, where the out-of-domain inputs are sampled\neither from a synthetic noise distribution or a different dataset during training. These methods\nare explicitly trained to detect out-of-distribution inputs and have the advantage of being more\ncomputationally ef\ufb01cient at test time.\nThe primary issue with these approaches is that they con\ufb02ate different aspects of predictive uncertainty,\nwhich results from three separate factors - model uncertainty, data uncertainty and distributional\nuncertainty. Model uncertainty, or epistemic uncertainty [23], measures the uncertainty in estimating\nthe model parameters given the training data - this measures how well the model is matched to the\ndata. Model uncertainty is reducible1 as the size of training data increases. Data uncertainty, or\naleatoric uncertainty [23], is irreducible uncertainty which arises from the natural complexity of the\ndata, such as class overlap, label noise, homoscedastic and heteroscedastic noise. Data uncertainty\ncan be considered a \u2019known-unknown\u2019 - the model understands (knows) the data and can con\ufb01dently\nstate whether a given input is dif\ufb01cult to classify (an unknown). Distributional uncertainty arises due\nto mismatch between the training and test distributions (also called dataset shift [24]) - a situation\nwhich often arises for real world problems. Distributional uncertainty is an \u2019unknown-unknown\u2019 - the\nmodel is unfamiliar with the test data and thus cannot con\ufb01dently make predictions. The approaches\ndiscussed above either con\ufb02ate distributional uncertainty with data uncertainty or implicitly model\ndistributional uncertainty through model uncertainty, as in Bayesian approaches. The ability to\nseparately model the 3 types of predictive uncertainty is important, as different actions can be taken\nby the model depending on the source of uncertainty. For example, in active learning tasks detection\nof distributional uncertainty would indicate the need to collect training data from this distribution.\nThis work addresses the explicit prediction of each of the three types of predictive uncertainty by\nextending the work done in [21, 22] while taking inspiration from Bayesian approaches.\nSummary of Contributions. This work describes the limitations of previous methods of obtaining\nuncertainty estimates and proposes a new framework for modeling predictive uncertainty, called Prior\nNetworks (PNs), which allows distributional uncertainty to be treated as distinct from both data\nuncertainty and model uncertainty. This work focuses on the application of PNs to classi\ufb01cation tasks.\nAdditionally, this work presents a discussion of a range of uncertainty metrics in the context of each\nsource of uncertainty. Experiments on synthetic and real data show that unlike previous non-Bayesian\nmethods PNs are able to distinguish between data uncertainty and distributional uncertainty. Finally,\nPNs are evaluated2 on the tasks of identifying out-of-distribution (OOD) samples and detecting\nmisclassi\ufb01cation outlined in [13], where they outperform previous methods on the MNIST and\nCIFAR-10 datasets.\n\n2 Current Approaches to Uncertainty Estimation\n\nThis section describes current approaches to predictive uncertainty estimation. Consider a distribution\np(x, y) over input features x and labels y. For image classi\ufb01cation x corresponds to images\nand y object labels. In a Bayesian framework the predictive uncertainty of a classi\ufb01cation model\nP(!c|x\u21e4,D) 3 trained on a \ufb01nite dataset D = {xj, yj}N\nj=1 \u21e0 p(x, y) will result from data (aleatoric)\nuncertainty and model (epistemic) uncertainty. A model\u2019s estimates of data uncertainty are described\nby the posterior distribution over class labels given a set of model parameters \u2713 and model uncertainty\nis described by the posterior distribution over the parameters given the data (eq. 1).\n\nP(!c|x\u21e4,D) = Z P(!c|x\u21e4, \u2713)\n}\n\n{z\n\n|\n\nData\n\np(\u2713|D)\n\n| {z }M odel\n\nd\u2713\n\n(1)\n\nHere, uncertainty in the model parameters induces a distribution over distributions P(!c|x\u21e4, \u2713). The\nexpected distribution P(!c|x\u21e4,D) is obtained by marginalizing out the parameters \u2713. Unfortunately,\nobtaining the true posterior p(\u2713|D) using Bayes\u2019 rule is intractable, and it is necessary to use either\nan explicit or implicit variational approximation q(\u2713) [25, 26, 27, 28]:\n(2)\n\np(\u2713|D) \u21e1 q(\u2713)\n\n1Up to identi\ufb01ability limits. In the limit of in\ufb01nite data p(\u2713|D) yields equivalent parameterizations.\n2Code available at https://github.com/KaosEngineer/DirichletPriorNetworks\n3Using the standard shorthand for P(y = !c|x\u21e4,D).\n\n2\n\n\fFurthermore, the integral in eq. 1 is also intractable for neural networks and is typically approximated\nvia sampling (eq. 3), using approaches like Monte-Carlo dropout [18], Langevin Dynamics [29] or\nexplicit ensembling [11]. Thus,\n\nP(!c|x\u21e4,D) \u21e1\n\n1\nM\n\nMXi=1\n\nP(!c|x\u21e4, \u2713(i)), \u2713(i) \u21e0 q(\u2713)\n\n(3)\n\nEach P(!c|x\u21e4, \u2713(i)) in an ensemble {P(!c|x\u21e4, \u2713(i))}M\ni=1 obtained sampled from q(\u2713) is a categorical\ndistribution \u00b5 4 over class labels y conditioned on the input x\u21e4, and can be visualized as a point on a\nsimplex. For the same x\u21e4 this ensemble is a collection of points on a simplex (\ufb01g. 1a), which can be\nseen as samples of categorical distributions from an implicit conditional distribution over a simplex\n(\ufb01g. 1b) induced via the posterior over model parameters.\n\n(a) Ensemble\n\n(b) Distribution\n\nFigure 1: Distributions on a Simplex\n\nBy selecting an appropriate approximate inference scheme and model prior p(\u2713) Bayesian approaches\ni=1 is\naim to craft an approximate model posterior q(\u2713) such that the ensemble {P(!c|x\u21e4, \u2713(i))}M\nconsistent in the region of training data, and becomes increasingly diverse when the input x\u21e4 is far\nfrom the training data. Thus, these approaches aim to craft an implicit conditional distribution over a\nsimplex (\ufb01g. 1b) with the attributes that it is sharp at the corners of a simplex for inputs similar to\nthe training data and \ufb02at over the simplex for out-of-distribution inputs. Given an ensemble from\nsuch a distribution, the entropy of the expected distribution P(!c|x\u21e4,D) will indicate uncertainty\nin predictions. It is not possible, however, to determine from the entropy whether this uncertainty\nis due to a high degree of data uncertainty, or whether the input is far from the region of training\ndata. It is necessary to use measures of spread of the ensemble, such as Mutual Information, to\nassess uncertainty in predictions due to model uncertainty. This allows sources of uncertainty to be\ndetermined.\nIn practice, however, for deep, distributed black-box models with tens of millions of parameters, such\nas DNNs, it is dif\ufb01cult to select an appropriate model prior and approximate inference scheme to\ncraft a model posterior which induces an implicit distribution with the desired properties. This makes\nit hard to guarantee the desired properties of the induced distribution for current state-of-the-art Deep\nLearning approaches. Furthermore, creating an ensemble can be computationally expensive.\nAn alternative, non-Bayesian class of approaches derives measures of uncertainty via the predictive\nposteriors of regression [21] and classi\ufb01cation [13, 22, 30] DNNs. Here, DNNs are explicitly trained\n[22, 21] to yield high entropy posterior distributions for out-of-distribution inputs. These approaches\nare easy to train and inference is computationally cheap. However, a high entropy posterior over\nclasses could indicate uncertainty in the prediction due to either an in-distribution input in a region\nof class overlap or an out-of-distribution input far from the training data. Thus, it is not possible to\nrobustly determine the source of uncertainty using these approaches. Further discussion of uncertainty\nmeasures can be found in section 4.\n\n3 Prior Networks\n\nHaving described existing approaches, an alternative approach to modeling predictive uncertainty,\ncalled Prior Networks, is proposed in this section. As previously described, Bayesian approaches aim\nto construct an implicit conditional distribution over distributions on a simplex (\ufb01g 1b) with certain\ndesirable attributes by appropriate selection of model prior and approximate inference method. In\npractice this is a dif\ufb01cult task and an open research problem.\n\n4Where \u00b5 is a vector of probabilities:\u21e5\u00b51, \u00b7\u00b7\u00b7 , \u00b5K\u21e4T = \u21e5P(y = !1), \u00b7\u00b7\u00b7 , P(y = !K)\u21e4T\n\n3\n\n\fThis work proposes to instead explicitly parameterize a distribution over distributions on a simplex,\np(\u00b5|x\u21e4, \u2713), using a DNN referred to as a Prior Network and train it to behave like the implicit\ndistribution in the Bayesian approach. Speci\ufb01cally, when it is con\ufb01dent in its prediction a Prior\nNetwork should yield a sharp distribution centered on one of the corners of the simplex (\ufb01g. 2a). For\nan input in a region with high degrees of noise or class overlap (data uncertainty) a Prior Network\nshould yield a sharp distribution focused on the center of the simplex, which corresponds to being\ncon\ufb01dent in predicting a \ufb02at categorical distribution over class labels (known-unknown) (\ufb01g. 2b).\nFinally, for \u2019out-of-distribution\u2019 inputs the Prior Network should yield a \ufb02at distribution over the\nsimplex, indicating large uncertainty in the mapping x 7! y (unknown-unknown) (\ufb01g. 2c).\n\n(a) Con\ufb01dent Prediction\n\n(b) High data uncertainty\n\n(c) Out-of-distribution\n\nFigure 2: Desired behaviors of a distribution over distributions\n\nIn the Bayesian framework distributional uncertainty, or uncertainty due to mismatch between the\ndistributions of test and training data, is considered a part of model uncertainty. In this work it will be\nconsidered to be a source of uncertainty separate from data uncertainty or model uncertainty. Prior\nNetworks will be explicitly constructed to capture data uncertainty and distributional uncertainty. In\nPrior Networks data uncertainty is described by the point-estimate categorical distribution \u00b5 and\ndistributional uncertainty is described by the distribution over predictive categoricals p(\u00b5|x\u21e4, \u2713). The\nparameters \u2713 of the Prior Network must encapsulate knowledge both about the in-domain distribution\nand the decision boundary which separates the in-domain region from everything else. Construction\nof a Prior Network is discussed in sections 3.1 and 3.2. Before this it is necessary to discuss its\ntheoretical properties.\nConsider modifying eq. 1 by introducing the term p(\u00b5|x\u21e4, \u2713) as follows:\np(\u2713|D)\n| {z }M odel\n\nP(!c|x\u21e4,D) = Z Z p(!c|\u00b5)\n| {z }\n\nIn this expression data, distribution and model uncertainty are now each modeled by a separate term\nwithin an interpretable probabilistic framework. The relationship between uncertainties is made\nexplicit - model uncertainty affects estimates of distributional uncertainty, which in turn affects\nthe estimates of data uncertainty. This is expected, as a large degree of model uncertainty will\nyield a large variation in p(\u00b5|x\u21e4, \u2713), and large uncertainty in \u00b5 will lead to a large uncertainty in\nestimates of data uncertainty. Thus, model uncertainty affects estimates of data and distributional\nuncertainties, and distributional uncertainty affects estimates of data uncertainty. This forms a\nhierarchical model - there are now three layers of uncertainty: the posterior over classes, the per-data\nprior distribution and the global posterior distribution over model parameters. Similar constructions\nhave been previously explored for non-neural Bayesian models, such as Latent Dirichlet Allocation\n[31]. However, typically additional levels of uncertainty are added in order to increase the \ufb02exibility\nof models, and predictions are obtained by marginalizing or sampling. In this work, however, the\nadditional level of uncertainty is added in order to be able to extract additional measures of uncertainty,\ndepending on how the model is marginalized. For example, consider marginalizing out \u00b5 in eq. 4,\nthus re-obtaining eq. 1:\n\nDistributional\n\np(\u00b5|x\u21e4, \u2713)\n|\n{z\n}\n\nd\u00b5d\u2713\n\n(4)\n\nData\n\nZ hZ p(!c|\u00b5)p(\u00b5|x\u21e4, \u2713)d\u00b5ip(\u2713|D)d\u2713 = Z P(!c|x\u21e4, \u2713)p(\u2713|D)d\u2713\n\nSince the distribution over \u00b5 is lost in the marginalization it is unknown how sharp or \ufb02at it was\naround the point estimate. If the expected categorical P(!c|x\u21e4, \u2713) is \"\ufb02at\" it is now unknown whether\nthis is due to high data or distributional uncertainty. In this situation, it will be necessary to again\nrely on measures which assess the spread of an MC ensemble, like mutual information (section 4), to\nestablish the source of uncertainty. Thus, Prior Networks are consistent with previous approaches to\n\n(5)\n\n4\n\n\fmodeling uncertainty, both Bayesian and non-Bayesian - they can be viewed as an \u2019extra tool in the\nuncertainty toolbox\u2019 which is explicitly crafted to capture the effects of distributional mismatch in a\nprobabilistically interpretable way. Alternatively, consider marginalizing out \u2713 in eq. 4 as follows:\n\nZ p(!c|\u00b5)hZ p(\u00b5|x\u21e4, \u2713)p(\u2713|D)d\u2713id\u00b5 = Z p(!c|\u00b5)p(\u00b5|x\u21e4,D)d\u00b5\n\n(6)\n\nThis yields expected estimates of data and distributional uncertainty given model uncertainty. Eq. 6\ncan be seen as a modi\ufb01cation of eq. 1 where the model is rede\ufb01ned as p(!c|\u00b5) and the distribution\nover model parameters p(\u00b5|x\u21e4,D) is now conditional on both the training data D and the test input\nx\u21e4. This explicitly yields the distribution over the simplex which the Bayesian approach implicitly\ninduces. Further discussion of how measures of uncertainty are derived from the marginalizations of\nequation 4 is presented in section 4.\nUnfortunately, like eq. 1, the marginalization in eq. 6 is generally intractable, though it can be\napproximated via Bayesian MC methods. For simplicity, this work will assume that a point-estimate\n(eq. 7) of the parameters will be suf\ufb01cient given appropriate regularization and training data size.\n\np(\u2713|D) = (\u2713 \u02c6\u2713) =) p(\u00b5|x\u21e4;D) \u21e1 p(\u00b5|x\u21e4; \u02c6\u2713)\n\n(7)\n\n3.1 Dirichlet Prior Networks\nA Prior Network for classi\ufb01cation parametrizes a distribution over a simplex, such as a Dirichlet\n(eq. 8), Mixture of Dirichlet distributions or the Logistic-Normal distribution. In this work the\nDirichlet distribution is chosen due to its tractable analytic properties. A Dirichlet distribution is a\nprior distribution over categorical distribution, which is parameterized by its concentration parameters\n\u21b5, where \u21b50, the sum of all \u21b5c, is called the precision of the Dirichlet distribution. Higher values of\n\u21b50 lead to sharper distributions.\n\n\u00b5\u21b5c1\nc\n\n,\u21b5 c > 0,\u21b5 0 =\n\n\u21b5c\n\n(8)\n\nA Prior Network which parametrizes a Dirichlet will be referred to as a Dirichlet Prior Network\n(DPN). A DPN will generate the concentration parameters \u21b5 of the Dirichlet distribution.\n\n(\u21b50)\nc=1 (\u21b5c)\n\nKYc=1\n\nDir(\u00b5|\u21b5) =\n\nQK\np(\u00b5|x\u21e4; \u02c6\u2713) = Dir(\u00b5|\u21b5), \u21b5 = f (x\u21e4; \u02c6\u2713)\nThe posterior over class labels will be given by the mean of the Dirichlet:\nP(!c|x\u21e4; \u02c6\u2713) =Z p(!c|\u00b5)p(\u00b5|x\u21e4; \u02c6\u2713)d\u00b5 =\n\n\u21b5c\n\u21b50\n\nKXc=1\n\nIf an exponential output function is used for the DPN, where \u21b5c = ezc, then the expected posterior\nprobability of a label !c is given by the output of the softmax (eq. 11).\n\n(9)\n\n(10)\n\n(11)\n\nP(!c|x\u21e4; \u02c6\u2713) =\n\nezc(x\u21e4)\nk=1 ezk(x\u21e4)\n\nPK\n\nThus, standard DNNs for classi\ufb01cation with a softmax output function can be viewed as predicting the\nexpected categorical distribution under a Dirichlet prior. The mean, however, is insensitive to arbitrary\nscaling of \u21b5c. Thus the precision \u21b50, which controls the sharpness of the Dirichlet, is degenerate\nunder standard cross-entropy training. It is necessary to change the cost function to explicitly train a\nDPN to yield a sharp or \ufb02at prior distribution around the expected categorical depending on the input\ndata.\n\n3.2 Dirichlet Prior Network Training\nThere are potentially many ways in which a Prior Network can be trained and it is not the focus of\nthis work to investigate them all. This work considers one approach to training a DPN based on the\nwork done in [21, 22] and here. The DPN is explicitly trained in a multi-task fashion to minimize\nthe KL divergence (eq. 12) between the model and a sharp Dirichlet distribution focused on the\nappropriate class for in-distribution data, and between the model and a \ufb02at Dirichlet distribution for\n\n5\n\n\fout-of-distribution data. A \ufb02at Dirichlet is chosen as the uncertain distribution in accordance with the\nprinciple of insuf\ufb01cient reason [32], as all possible categorical distributions are equiprobable.\n\nL(\u2713) = Epin(x)[KL[Dir(\u00b5| \u02c6\u21b5)||p(\u00b5|x; \u2713)]] + Epout(x)[KL[Dir(\u00b5| \u02dc\u21b5)||p(\u00b5|x; \u2713)]]\n\n(12)\n\nIn order to train using this loss function the in-distribution targets \u02c6\u21b5 and out-of-distribution targets \u02dc\u21b5\nmust be de\ufb01ned. It is simple to specify a \ufb02at Dirichlet distribution by setting all \u02dc\u21b5c = 1. However,\ndirectly setting the in-distribution target \u02c6\u21b5c is not convenient. Instead the concentration parameters\n\u02c6\u21b5c are re-parametrized into \u02c6\u21b50, the target precision, and the means \u02c6\u00b5c = \u02c6\u21b5c\n. \u02c6\u21b50 is a hyper-parameter\n\u02c6\u21b50\nset during training and the means are simply the 1-hot targets used for classi\ufb01cation. A further\ncomplication is that learning sparse \u20191-hot\u2019 continuous distributions, which are effectively delta\nfunctions, is challenging under the de\ufb01ned KL loss, as the error surface becomes poorly suited for\noptimization. There are two solutions - \ufb01rst, it is possible to smooth the target means (eq. 13), which\nredistributes a small amount of probability density to the other corners of the Dirichlet. Alternatively,\nteacher-student training [33] can be used to specify non-sparse target means \u02c6\u00b5. The smoothing\napproach is used in this work. Additionally, cross-entropy can be used as an auxiliary loss for\nin-distribution data.\n\n\u02c6\u00b5c =n 1 (K 1)\u270fif\n\n\u270fif\n\n\n\n(y = !c) = 1\n(y = !c) = 0\n\n(13)\n\nThe multi-task training objective (eq. 12) requires samples of \u02dcx from the out-of-domain distribution\npout(x). However, the true out-of-domain distribution is unknown and samples are unavailable.\nOne solution is to synthetically generate points on the boundary of the in-domain region using a\ngenerative model [21, 22]. An alternative is to use a different, real dataset as a set of samples from\nthe out-of-domain distribution [22].\n\n4 Uncertainty Measures\n\nThe previous section introduced a new framework for modeling uncertainty. This section explores a\nrange of measures for quantifying uncertainty given a trained DNN, DPN or Bayesian MC ensemble.\nThe discussion is broken down into 4 classes of measure, depending on how eq. 4 is marginalized.\nDetails of derivation can be found in Appendix C.\nThe \ufb01rst class derives measures of uncertainty from the expected predictive categorical P(!c|x\u21e4;D),\ngiven a full marginalization of eq. 4 which can be approximated either with a point estimate of the\nparameters \u02c6\u2713 or a Bayesian MC ensemble. The \ufb01rst measure is the probability of the predicted class\n(mode), or max probability (eq. 14), which is a measure of con\ufb01dence in the prediction used in\n[13, 22, 30, 23, 11].\n\n(14)\nThe second measure is the entropy (eq. 15) of the predictive distribution [23, 18, 11]. It behaves\nsimilar to max probability, but represents the uncertainty encapsulated in the entire distribution.\n\nP(!c|x\u21e4;D)\n\nP = max\n\nc\n\nKXc=1\n\nH[P(y|x\u21e4;D)] = \n\nP(!c|x\u21e4;D) ln(P(!c|x\u21e4;D))\n\n(15)\n\nMax probability and entropy of the expected distribution can be seen as measures of the total\nuncertainty in predictions.\nThe second class of measures considers marginalizing out \u00b5 in eq. 4, yielding eq. 1. Mutual\nInformation (MI) [23] between the categorical label y and the parameters of the model \u2713 is a measure\ni=1 [18] which assess uncertainty in predictions due to\nof the spread of an ensemble {P(!c|x\u21e4, \u2713(i))}M\nmodel uncertainty. Thus, MI implicitly captures elements of distributional uncertainty. MI can be\nexpressed as the difference of the total uncertainty, captured by the entropy of expected distribution,\nand the expected data uncertainty, captured by expected entropy of each member of the ensemble\n(eq. 16). This interpretation was given in [34].\n\nM odel U ncertainty\n\nI[y, \u2713|x\u21e4,D]\n}\n|\n\n{z\n\n= H[Ep(\u2713|D)[P(y|x\u21e4, \u2713)]]\n}\n|\n\nT otal U ncertainty\n\n{z\n\n6\n\n Ep(\u2713|D)[H[P(y|x\u21e4, \u2713)]]\n}\n|\n\nExpected Data U ncertainty\n\n{z\n\n(16)\n\n\fThe third class of measures considers marginalizing out \u2713 in eq. 4, yielding eq. 6. The \ufb01rst measure\nin this class is the mutual information between y and \u00b5 (eq. 17), which behaves in exactly the same\nway as MI between y and \u2713, but the spread is now explicitly due to distributional uncertainty, rather\nthan model uncertainty.\n\n= H[Ep(\u00b5|x\u21e4;D)[P(y|\u00b5)]]\n|\n}\n\nT otal U ncertainty\n\n{z\n\n Ep(\u00b5|x\u21e4;D)[H[P(y|\u00b5)]]\n|\n}\n\nExpected Data U ncertainty\n\n{z\n\n(17)\n\nDistributional U ncertainty\n\nI[y, \u00b5|x\u21e4;D]\n|\n}\n\n{z\n\nAnother measure of uncertainty is the differential entropy (eq. 18) of the DPN. This measure is\nmaximized when all categorical distributions are equiprobable, which occurs when the Dirichlet\nDistribution is \ufb02at - in other words when there is the greatest variety of samples from the Dirichlet\nprior. Differential entropy is well suited to measuring distributional uncertainty, as it can be low even\nif the expected categorical under the Dirichlet prior has high entropy, and also captures elements of\ndata uncertainty.\n\nH[p(\u00b5|x\u21e4;D)] = ZSK1\n\np(\u00b5|x\u21e4;D) ln(p(\u00b5|x\u21e4;D))d\u00b5\n\n(18)\n\nThe \ufb01nal class of measures uses the full eq. 4 and assesses the spread of p(\u00b5|x\u21e4; \u2713) due to model\nuncertainty via the MI between \u00b5 and \u2713, which can be computed via Bayesian ensemble approaches.\n\n5 Experiments\n\nThe previous sections discussed modeling different aspects of predictive uncertainty and presented\nseveral measures of quantifying it. This section compares the proposed and previous methods in two\nsets of experiments. The \ufb01rst experiment illustrates the advantages of a DPN over other non-Bayesian\nmethods [22, 30] on synthetic data and the second set of experiments evaluate DPNs on MNIST and\nCIFAR-10 and compares them to DNNs and ensembles generated via Monte-Carlo Dropout (MCDP)\non the tasks of misclassi\ufb01cation detection and out-of-distribution data detection. The experimental\nsetup is described in Appendix A and additional experiments are described in Appendix B.\n\n5.1 Synthetic Experiments\n\nA synthetic experiment was designed to illustrate the limitation of using uncertainty measures derived\nfrom P(!c|x\u21e4;D) [22, 30] to detect out-of-distribution samples. A simple dataset with 3 Gaussian\ndistributed classes with equidistant means and tied isotropic variance is created. The classes are\nnon-overlapping when = 1 (\ufb01g. 3a) and overlap when = 4 (\ufb01g. 3d). The entropy of the true\nposterior over class labels is plotted in blue in \ufb01gures 3a and 3d, which show that when the classes are\ndistinct the entropy is high only on the decision boundaries, but when the classes overlap the entropy\nis high also within the data region. A small DPN with 1 hidden layer of 50 neurons is trained on this\ndata. Figures 3b and 3c show that when classes are distinct both the entropy of the DPN\u2019s predictive\nposterior and the differential entropy of the DPN have identical behaviour - low in the region of\ndata and high elsewhere, allowing in-distribution and out-of-distribution regions to be distinguished.\nFigures 3e and 3f, however, show that when there is a large degree of class overlap the entropy and\ndifferential entropy have different behavior - entropy is high both in region of class overlap and far\nfrom training data, making dif\ufb01cult to distinguish out-of-distribution samples and in-distribution\nsamples at a decision boundary. In contrast, the differential entropy is low over the whole region of\ntraining data and high outside, allowing the in-distribution region to be clearly distinguished from the\nout-of-distribution region.\n\n5.2 MNIST and CIFAR-10 Experiments\n\nAn in-domain misclassi\ufb01cation detection experiment and an out-of-distribution (OOD) input detection\nexperiment were run on the MNIST and CIFAR-10 datasets [35, 36] to assess the DPN\u2019s ability to\nestimate uncertainty. The misclassi\ufb01cation detection experiment involves detecting whether a given\nprediction is incorrect given an uncertainty measure. Misclassi\ufb01cations are chosen as the positive\nclass. The misclassi\ufb01cation detection experiment was run on the MNIST valid+test set and the\nCIFAR-10 test set. The out-of-distribution detection experiment involves detecting whether an input\n\n7\n\n\f(a) = 1\n\n(b) Entropy = 1\n\n(c) Diff. Entropy = 1\n\n(d) = 4\n\n(e) Entropy = 4\n\n(f) Diff. Entropy = 4\n\nFigure 3: Synthetic Experiment\n\nis out-of-distribution given a measure of uncertainty. Out-of-distribution samples are chosen as the\npositive class. The OMNIGLOT dataset [37], scaled down to 28x28 pixels, was used as real \u2019OOD\u2019\ndata for MNIST. 15000 samples of OMNIGLOT data were randomly selected to form a balanced\nset of positive (OMNIGLOT) and negative (MNIST valid+test) samples. For CIFAR-10 three OOD\ndatasets were considered - SVHN, LSUN and TinyImagetNet (TIM) [38, 39, 40]. The two considered\nbaseline approaches derive uncertainty measures from either the class posterior of a DNN [13] or an\nensemble generated via MC dropout applied to the same DNN [23, 18]. All uncertainty measures\ndescribed in section 4 are explored for both tasks in order to see which yield best performance. The\nperformance is assessed by area under the ROC (AUROC) and Precision-Recall (AUPR) curves in\nboth experiments as in [13].\n\nTable 1: MNIST and CIFAR-10 misclassi\ufb01cation detection\n\nData\n\nMNIST\n\nCIFAR10\n\nModel\nDNN\nMCDP\nDPN\nDNN\nMCDP\nDPN\n\nMax.P\n98.0\n97.2\n99.0\n92.4\n92.5\n92.2\n\n-\n\nAUROC\nEnt. M.I. D.Ent. Max.P\n26.6\n98.6\n33.0\n97.2\n43.6\n98.9\n92.3\n48.7\n48.4\n92.0\n52.7\n92.1\n\n90.4\n92.1\n\n96.9\n98.6\n\n90.9\n\n92.9\n\n-\n-\n\n-\n-\n\n-\n\nAUPR\n\n-\n\n-\n-\n\n27.8\n30.7\n\nEnt. M.I. D.Ent.\n25.0\n29.0\n39.7\n47.1\n45.5\n51.0\n\n37.6\n51.0\n\n45.5\n\n25.5\n\n-\n-\n\n-\n\n% Err.\n0.4\n0.4\n0.6\n8.0\n8.0\n8.5\n\nTable 1 shows that the DPN consistently outperforms both a DNN, and a MC dropout ensemble\n(MCDP) in misclassi\ufb01cation detection performance, although there is a negligible drop in accuracy of\nthe DPN as compared to a DNN or MCDP. Max probability yields the best results, closely followed\nby the entropy of the predictive distribution. This is expected, as they are measures of total uncertainty\nin predictions, while the other measures capture the either model or distributional uncertainty. The\nperformance difference is more pronounced on AUPR, which is sensitive to misbalanced classes.\nTable 2 shows that a DPN consistently outperforms the baselines in OOD sample detection for both\nMNIST and CIFAR-10 datasets. On MNIST, the DPN is able to perfectly classify all samples using\nmax probability, entropy and differential entropy. On the CIFAR-10 dataset the DPN consistently\noutperforms the baselines by a large margin. While high performance against SVHN and LSUN is\nexpected, as LSUN, SVHN and CIFAR-10 are quite different, high performance against TinyIma-\ngeNet, which is also a dataset of real objects and therefore closer to CIFAR-10, is more impressive.\nCuriously, MC dropout does not always yield better results than a standard DNN, which supports\n\n8\n\n\fthe assertion that it is dif\ufb01cult to achieve the desired behaviour for a Bayesian distribution over\ndistributions.\n\nTable 2: MNIST and CIFAR-10 out-of-domain detection\n\nData\n\nID\n\nOOD\n\nMNIST\n\nOMNI\n\nCIFAR10\n\nSVHN\n\nCIFAR10\n\nLSUN\n\nCIFAR10\n\nTIM\n\nModel\nDNN\nMCDP\nDPN\nDNN\nMCDP\nPN\nDNN\nMCDP\nDPN\nDNN\nMCDP\nDPN\n\nMax.P\n98.7\n99.2\n100.0\n90.1\n89.6\n98.1\n89.8\n89.1\n94.4\n87.5\n87.6\n94.3\n\n-\n\n-\n\n-\n-\n\n100.0\n\n99.3\n99.5\n\nAUROC\nEnt. M.I. D.Ent. Max.P\n98.3\n98.8\n99.0\n99.2\n100.0\n100.0\n90.8\n84.6\n84.1\n90.6\n97.7\n98.2\n87.0\n91.4\n86.5\n90.9\n94.4\n93.3\n84.7\n88.7\n85.1\n89.2\n94.3\n94.0\n\n86.9\n94.3\n\n83.7\n98.2\n\n89.3\n94.4\n\n98.5\n\n94.6\n\n94.6\n\n-\n-\n\n-\n-\n\n-\n-\n\n-\n\n-\n\nAUPR\n\n-\n\n-\n\n-\n-\n\n-\n-\n\n100.0\n\n99.3\n97.5\n\nEnt. M.I. D.Ent.\n98.5\n99.1\n100.0\n85.1\n84.8\n97.8\n90.0\n89.6\n93.4\n87.2\n87.9\n94.0\n\n86.4\n93.4\n\n83.2\n94.0\n\n73.1\n97.8\n\n98.2\n\n93.3\n\n94.2\n\n-\n-\n\n-\n-\n\n-\n\n-\n\nThe experiments above suggest that there is little bene\ufb01t of using measures such as differential entropy\nand mutual information over standard entropy. However, this is because MNIST and CIFAR-10 are\nlow data uncertainty datasets - all classes are distinct. It is interesting to see whether differential\nentropy of the Dirichlet prior will be able to distinguish in-domain and out-of-distribution data better\nthan entropy when the classes are less distinct. To this end zero mean isotropic Gaussian noise with\na standard deviation = 3 noise is added to the inputs of the DNN and DPN during both training\nand evaluation on the MNIST dataset. Table 3 shows that in the presence of strong noise entropy\nand MI fail to successfully discriminate between in-domain and out-of-distribution samples, while\nperformance using differential entropy barely falls.\n\nTable 3: MNIST vs OMNIGLOT. Out-of-distribution detection AUROC on noisy data.\n\nM.I.\n\nD.Ent.\n\n\nDNN\nMCDP\nDPN\n\nEnt.\n\n0.0\n98.8\n98.8\n100.0\n\n3.0\n58.4\n58.4\n51.8\n\n0.0\n-\n\n99.3\n99.5\n\n3.0\n-\n\n79.1\n22.3\n\n0.0\n-\n-\n\n100.0\n\n3.0\n-\n-\n\n99.8\n\n6 Conclusion\n\nThis work describes the limitations of previous work on predictive uncertainty estimations within the\ncontext of sources of uncertainty and proposes to treat out-of-distribution (OOD) inputs as a separate\nsource of uncertainty, called Distributional Uncertainty. To this end, this work presents a novel\nframework, called Prior Networks (PN), which allows data, distributional and model uncertainty to\nbe treated separately within a consistent probabilistically interpretable framework. A particular form\nof these PNs are applied to classi\ufb01cation, Dirichlet Prior Networks (DPNs). DPNs are shown to yield\nmore accurate estimates of distributional uncertainty than MC Dropout and standard DNNs on the task\nof OOD detection on the MNIST and CIFAR-10 datasets. The DPNs also outperform other methods\non the task of misclassi\ufb01cation detection. A range of uncertainty measures is presented and analyzed\nin the context of the types of uncertainty which they assess. It was noted that measures of total\nuncertainty, such as max probability or entropy of the predictive distribution, yield the best results on\nmisclassi\ufb01cation detection. Differential entropy of DPN was best for measure of uncertainty for OOD\ndetection, especially when classes are less distinct. This was illustrated on both a synthetic experiment\nand on a noise-corrupted MNIST task. Uncertainty measures can be analytically calculated at test\ntime for DPNs, reducing computational cost relative to ensemble approaches. Having investigated\nPNs for image classi\ufb01cation, it is interesting to apply them to other tasks computer vision, NLP,\nmachine translation, speech recognition and reinforcement learning. Finally, it is necessary to explore\nPrior Networks for regression tasks.\n\n9\n\n\fAcknowledgments\nThis paper reports on research partly supported by Cambridge Assessment, University of Cambridge.\nThis work also partly funded by a DTA EPSRC away and a Google Research award. We would also\nlike to thank members of the CUED Machine Learning group, especially Dr. Richard Turner, for\nfruitful discussions.\n\nReferences\n[1] Ross Girshick, \u201cFast R-CNN,\u201d in Proc. 2015 IEEE International Conference on Computer\n\nVision (ICCV), 2015, pp. 1440\u20131448.\n\n[2] Karen Simonyan and Andrew Zisserman, \u201cVery Deep Convolutional Networks for Large-Scale\nImage Recognition,\u201d in Proc. International Conference on Learning Representations (ICLR),\n2015.\n\n[3] Ruben Villegas, Jimei Yang, Yuliang Zou, Sungryull Sohn, Xunyu Lin, and Honglak Lee,\n\u201cLearning to Generate Long-term Future via Hierarchical Prediction,\u201d in Proc. International\nConference on Machine Learning (ICML), 2017.\n\n[4] Tomas Mikolov et al., \u201cLinguistic Regularities in Continuous Space Word Representations,\u201d in\n\nProc. NAACL-HLT, 2013.\n\n[5] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean, \u201cEf\ufb01cient Estimation of Word\n\nRepresentations in Vector Space,\u201d 2013, arXiv:1301.3781.\n\n[6] Tomas Mikolov, Martin Kara\ufb01\u00e1t, Luk\u00e1s Burget, Jan Cernock\u00fd, and Sanjeev Khudanpur, \u201cRe-\n\ncurrent Neural Network Based Language Model,\u201d in Proc. INTERSPEECH, 2010.\n\n[7] Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel rahman Mohamed, Navdeep Jaitly,\nAndrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara Sainath, and Brian Kingsbury, \u201cDeep\nneural networks for acoustic modeling in speech recognition,\u201d Signal Processing Magazine,\n2012.\n\n[8] Awni Y. Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan\nPrenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Y. Ng, \u201cDeep speech:\nScaling up end-to-end speech recognition,\u201d 2014, arXiv:1412.5567.\n\n[9] Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad,\n\u201cIntelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission,\u201d\nin Proc. 21th ACM SIGKDD International Conference on Knowledge Discovery and Data\nMining, New York, NY, USA, 2015, KDD \u201915, pp. 1721\u20131730, ACM.\n\n[10] Babak Alipanahi, Andrew Delong, Matthew T. Weirauch, and Brendan J. Frey, \u201cPredicting\nthe sequence speci\ufb01cities of DNA- and RNA-binding proteins by deep learning,\u201d Nature\nBiotechnology, vol. 33, no. 8, pp. 831\u2013838, July 2015.\n\n[11] B. Lakshminarayanan, A. Pritzel, and C. Blundell, \u201cSimple and Scalable Predictive Uncertainty\nEstimation using Deep Ensembles,\u201d in Proc. Conference on Neural Information Processing\nSystems (NIPS), 2017.\n\n[12] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul F. Christiano, John Schulman, and Dan Man\u00e9,\n\u201cConcrete problems in AI safety,\u201d http://arxiv.org/abs/1606.06565, 2016, arXiv:\n1606.06565.\n\n[13] Dan Hendrycks and Kevin Gimpel, \u201cA Baseline for Detecting Misclassi\ufb01ed and Out-of-\nDistribution Examples in Neural Networks,\u201d http://arxiv.org/abs/1610.02136,\n2016, arXiv:1610.02136.\n\n[14] David JC MacKay, \u201cA practical bayesian framework for backpropagation networks,\u201d Neural\n\ncomputation, vol. 4, no. 3, pp. 448\u2013472, 1992.\n\n[15] David JC MacKay, Bayesian methods for adaptive models, Ph.D. thesis, California Institute of\n\nTechnology, 1992.\n\n[16] Geoffrey E. Hinton and Drew van Camp, \u201cKeeping the neural networks simple by minimizing\nthe description length of the weights,\u201d in Proc. Sixth Annual Conference on Computational\nLearning Theory, New York, NY, USA, 1993, COLT \u201993, pp. 5\u201313, ACM.\n\n10\n\n\f[17] Radford M. Neal, Bayesian learning for neural networks, Springer Science & Business Media,\n\n1996.\n\n[18] Yarin Gal and Zoubin Ghahramani, \u201cDropout as a Bayesian Approximation: Representing\nModel Uncertainty in Deep Learning,\u201d in Proc. 33rd International Conference on Machine\nLearning (ICML-16), 2016.\n\n[19] A. Kendall, Y. Gal, and R. Cipolla, \u201cMulti-Task Learning Using Uncertainty to Weight Losses\nfor Scene Geometry and Semantics,\u201d in Proc. Conference on Neural Information Processing\nSystems (NIPS), 2017.\n\n[20] A. Kendall and Y. Gal, \u201cWhat Uncertainties Do We Need in Bayesian Deep Learning for\nComputer Vision,\u201d in Proc. Conference on Neural Information Processing Systems (NIPS),\n2017.\n\n[21] A. Malinin, A. Ragni, M.J.F. Gales, and K.M. Knill, \u201cIncorporating Uncertainty into Deep\nLearning for Spoken Language Assessment,\u201d in Proc. 55th Annual Meeting of the Association\nfor Computational Linguistics (ACL), 2017.\n\n[22] Kimin Lee, Honglak Lee, Kibok Lee, and Jinwoo Shin, \u201cTraining con\ufb01dence-calibrated\nclassi\ufb01ers for detecting out-of-distribution samples,\u201d International Conference on Learning\nRepresentations, 2018.\n\n[23] Yarin Gal, Uncertainty in Deep Learning, Ph.D. thesis, University of Cambridge, 2016.\n[24] Joaquin Qui\u00f1onero-Candela, Dataset Shift in Machine Learning, The MIT Press, 2009.\n[25] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra, \u201cWeight Uncer-\ntainty in Neural Networks,\u201d in Proc. International Conference on Machine Learning (ICML),\n2015.\n\n[26] Alex Graves, \u201cPractical variational inference for neural networks,\u201d in Advances in neural\n\ninformation processing systems, 2011, pp. 2348\u20132356.\n\n[27] Christos Louizos and Max Welling, \u201cStructured and ef\ufb01cient variational deep learning with\nmatrix gaussian posteriors,\u201d in International Conference on Machine Learning, 2016, pp.\n1708\u20131716.\n\n[28] Diederik P Kingma, Tim Salimans, and Max Welling, \u201cVariational dropout and the local\nreparameterization trick,\u201d in Advances in Neural Information Processing Systems, 2015, pp.\n2575\u20132583.\n\n[29] Max Welling and Yee Whye Teh, \u201cBayesian Learning via Stochastic Gradient Langevin\n\nDynamics,\u201d in Proc. International Conference on Machine Learning (ICML), 2011.\n\n[30] Shiyu Liang, Yixuan Li, and R. Srikant, \u201cEnhancing the reliability of out-of-distribution image\ndetection in neural networks,\u201d in Proc. International Conference on Learning Representations,\n2018.\n\n[31] David M. Blei, Andrew Y. Ng, and Michael I. Jordan, \u201cLatent Dirichlet Allocation,\u201d Journal of\n\nMachine Learning Research, vol. 3, pp. 993\u20131022, Mar. 2003.\n\n[32] Kevin P. Murphy, Machine Learning, The MIT Press, 2012.\n[33] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean, \u201cDistilling the knowledge in a neural network,\u201d\n\n2015, arXiv:1503.02531.\n\n[34] Stefan Depeweg, Jos\u00e9 Miguel Hern\u00e1ndez-Lobato, Finale Doshi-Velez, and Steffen Udluft, \u201cDe-\ncomposition of uncertainty for active learning and reliable reinforcement learning in stochastic\nsystems,\u201d arXiv preprint arXiv:1710.07283, 2017.\n\n[35] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, \u201cGradient-based learning applied to document\n\nrecognition,\u201d Proceedings of the IEEE, vol. 86, pp. 2278\u20132324, 1998.\n\n[36] Alex Krizhevsky, \u201cLearning multiple layers of features from tiny images,\u201d 2009.\n[37] Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum, \u201cHuman-level concept\nlearning through probabilistic program induction,\u201d Science, vol. 350, no. 6266, pp. 1332\u20131338,\n2015.\n\n[38] Ian J. Goodfellow, Yaroslav Bulatov, Julian Ibarz, Sacha Arnoud, and Vinay D. Shet, \u201cMulti-\ndigit number recognition from street view imagery using deep convolutional neural networks,\u201d\n2013, arXiv:1312.6082.\n\n11\n\n\f[39] Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao, \u201cLSUN: construction of a\nlarge-scale image dataset using deep learning with humans in the loop,\u201d 2015, arXiv:1506.03365.\n[40] Stanford CS231N, \u201cTiny ImageNet,\u201d https://tiny-imagenet.herokuapp.com/,\n\n2017.\n\n[41] M Buscema, \u201cMetanet: The theory of independent judges,\u201d Substance Use & Misuse, vol. 33,\n\nno. 2, pp. 439\u2013461, 1998.\n\n[42] Mart\u00edn Abadi et al., \u201cTensorFlow: Large-Scale Machine Learning on Heterogeneous Systems,\u201d\n\n2015, Software available from tensor\ufb02ow.org.\n\n[43] Timothy Dozat, \u201cIncorporating Nesterov Momentum into Adam,\u201d in Proc. International\n\nConference on Learning Representations (ICLR), 2016.\n\n12\n\n\f", "award": [], "sourceid": 3499, "authors": [{"given_name": "Andrey", "family_name": "Malinin", "institution": "University of Cambridge"}, {"given_name": "Mark", "family_name": "Gales", "institution": "University of Cambridge"}]}