{"title": "Modeling Uncertainty by Learning a Hierarchy of Deep Neural Connections", "book": "Advances in Neural Information Processing Systems", "page_first": 4244, "page_last": 4254, "abstract": "Modeling uncertainty in deep neural networks, despite recent important advances, is still an open problem. Bayesian neural networks are a powerful solution, where the prior over network weights is a design choice, often a normal distribution or other distribution encouraging sparsity. However, this prior is agnostic to the generative process of the input data, which might lead to unwarranted generalization for out-of-distribution tested data. We suggest the presence of a confounder for the relation between the input data and the discriminative function given the target label.\nWe propose an approach for modeling this confounder by sharing neural connectivity patterns between the generative and discriminative networks. This approach leads to a new deep architecture, where networks are sampled from the posterior of local causal structures, and coupled into a compact hierarchy. We demonstrate that sampling networks from this hierarchy, proportionally to their posterior, is efficient and enables estimating various types of uncertainties. Empirical evaluations of our method demonstrate significant improvement compared to state-of-the-art calibration and out-of-distribution detection methods.", "full_text": "Modeling Uncertainty by Learning a Hierarchy of\n\nDeep Neural Connections\n\nRaanan Y. Rohekar\n\nIntel AI Lab\n\nraanan.yehezkel@intel.com\n\nShami Nisimov\n\nIntel AI Lab\n\nshami.nisimov@intel.com\n\nYaniv Gurwicz\n\nIntel AI Lab\n\nyaniv.gurwicz@intel.com\n\nGal Novik\nIntel AI Lab\n\ngal.novik@intel.com\n\nAbstract\n\nModeling uncertainty in deep neural networks, despite recent important advances, is\nstill an open problem. Bayesian neural networks are a powerful solution, where the\nprior over network weights is a design choice, often a normal distribution or other\ndistribution encouraging sparsity. However, this prior is agnostic to the generative\nprocess of the input data, which might lead to unwarranted generalization for out-\nof-distribution tested data. We suggest the presence of a confounder for the relation\nbetween the input data and the discriminative function given the target label. We\npropose an approach for modeling this confounder by sharing neural connectivity\npatterns between the generative and discriminative networks. This approach leads\nto a new deep architecture, where networks are sampled from the posterior of\nlocal causal structures, and coupled into a compact hierarchy. We demonstrate that\nsampling networks from this hierarchy, proportionally to their posterior, is ef\ufb01cient\nand enables estimating various types of uncertainties. Empirical evaluations of\nour method demonstrate signi\ufb01cant improvement compared to state-of-the-art\ncalibration and out-of-distribution detection methods.\n\n1\n\nIntroduction\n\nDeep neural networks have become an important tool in applied machine learning, achieving state-of-\nthe-art regression and classi\ufb01cation accuracy in many domains. However, quantifying and measuring\nuncertainty in these discriminative models, despite recent important advances, is still an open problem.\nRepresentation of uncertainty is crucial for many domains, such as safety-critical applications,\npersonalized medicine, and recommendation systems [6]. Common deep neural networks are not\ndesigned to capture model uncertainty, hence estimating it implicitly from the prediction is often\ninaccurate. Several types of uncertainties are commonly discussed [5, 19], where the two main types\nare: 1) epistemic uncertainty and 2) aleatoric uncertainty. Epistemic uncertainty is caused by the lack\nof knowledge, typically in cases where only a small training data exists, or for out-of-distribution\ninputs. Aleatoric uncertainty is caused by noisy data, and contrary to epistemic uncertainty, does\nnot vanish in the large sample limit. In this work, we focus on two aspects of uncertainty, often\nrequired addressing in practical uses of neural networks: calibration [2] and out-of-distribution\n(OOD) detection.\nCalibration is a notion that describes the relation between a predicted probability of an event and\nthe actual proportion of occurrences of that event. It is generally measured using (strictly) proper\n\n\u2217All authors contributed equally.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fscoring rules [7], such as negative log-likelihood (NLL) and the Brier score, both are minimized for\ncalibrated models. Guo et al. [8] examined the calibration of recent deep architectures by computing\nthe expected calibration error (ECE)\u2014the difference between an approximation of the empirical\nreliability curve and the optimal reliability curve [21]. Miscalibration is often addressed by post\nprocessing the outputs [24, 8], approximating the posterior over the weights of a pre-trained network\n[25], or by learning an ensemble of networks [13, 18, 5, 16].\nOOD detection is often addressed by designing a loss function to optimize during parameter learning\nof a given network architecture [19, 3]. Many of these methods are speci\ufb01cally tailored for detecting\nOOD, requiring some information about OOD samples, which is often impractical to obtain in\nreal-world cases. In addition, these methods often are not capable of modeling different types of\nuncertainty. Ensemble methods, on the other hand, learn multiple sets of network parameters for\na given structure [16], or approximate the posterior distribution over the weights from which they\nsample at test time [1, 18, 5].\nIn this paper, we make the distinction between structure-based methods, which include ensemble\nmethods that replicate the structure or sample subsets from it, and parameter-based methods, which\nspecify a loss function to be used for a given structure. Ensemble methods, in general, do not specify\nthe loss function to be used for parameter learning, nor do they restrict post-processing of their output.\nIt is interesting to note that while the majority of ensemble methods use distinct sets of parameters\nfor each network [16], in the MC-dropout method [5] the parameters are shared across multiple\nnetworks. Common to all these methods is that they use a single network architecture (structure), as\nit is generally unclear how to fuse outputs from different structures.\nWe propose a method that samples network structures, where parts of one sampled structure may be\nsimilar to parts of another sampled structure but having different weights values. In addition, these\nstructures may share some parts, along with their weights, with other structures (weight sharing),\nspeci\ufb01cally in the deeper layers. All these properties are learned from the input data.\n\n2 Background\n\nWe focus on two approaches that are commonly used for modeling uncertainty: 1) Bayesian neural\nnetworks [5, 1], and 2) ensembles [16, 18]. Both approaches employ multiple networks to model\nuncertainty, where the main difference is the use/lack-of-use of shared parameters across networks\nduring training and inference.\nIn Bayesian neural networks the weights, \u03c6, are treated as random variables, and the posterior\ndistribution is learned from the training data p(\u03c6|x, y). Then, the probability of label y\u2217 for a test\nsample x\u2217 is evaluated by\n\n(cid:90)\n\np(y\u2217|x\u2217, x, y) =\n\np(y\u2217|x\u2217, \u03c6) p(\u03c6|x, y) d\u03c6.\n\n(1)\n\nHowever, since learning the posterior over \u03c6 and estimating Equation 1 are usually intractable, varia-\ntional methods are often used, where an approximating variational distribution is de\ufb01ned q(\u03c6), and\nthe KL-divergence between the true posterior and the variational distribution, KL(q(\u03c6)|| p(\u03c6|X, Y )),\nis minimized.\nA common practice is to set a prior p(\u03c6) and use the Bayes rule,\np(Y |X, \u03c6) p(\u03c6)\n\n(2)\n\n.\n\np(\u03c6|X, Y ) =\n\nThus, minimizing the KL-divergence is equivalent to maximizing a variational lower bound,\n\np(Y |X)\n\n(cid:16)\n\n(cid:17)\n\nLVI =\n\nq(\u03c6) log [p(Y |X, \u03c6)] d\u03c6 \u2212 KL\n\nq(\u03c6)|| p(\u03c6)\n\nd\u03c6.\n\n(3)\n\nGal & Ghahramani [5] showed that the dropout objective, when applied before every layer, maximizes\nEquation 3. However, the prior p(\u03c6) is agnostic to the unlabeled data distribution p(x), which may\nlead to unwarranted generalization for out-of-distribution test samples. As a remedy, in this work we\npropose to condition the parameters of the discriminative model, \u03c6, on the unlabeled training data,\nX. That is, to replace the prior p(\u03c6) in Equation 2 with p(\u03c6|x), thereby letting unlabeled training\ndata to guide the posterior distribution rather than relying on some prior assumption over the prior.\n\n2\n\n(cid:90)\n\n\f3 A Hierarchy of Deep Neural Networks\n\nWe \ufb01rst describe the key idea, then introduce a new neural architecture, and \ufb01nally, describe a\nstochastic inference algorithm for estimating different types of uncertainties.\n\n3.1 Key Idea\nWe approximate the prediction in Equation 1 by sampling from the posterior, \u03c6i \u223c p(\u03c6|x, y),\n\nm(cid:88)\n\ni=1\n\n1\nm\n\nP (y\u2217|x\u2217, x, y) \u2248\n\nP (y\u2217|x\u2217, \u03c6i).\n\n(4)\n\nSince sampling from the posterior is intractable for deep neural networks, we follow a Bayesian\napproach and propose a prior distribution for the parameters. However, in contrast to the common\npractice of assuming a Gaussian distribution (or some other prior independent of the data), our prior\ndepends on the unlabeled training data, p(\u03c6|x). We de\ufb01ne p(\u03c6|x) by \ufb01rst, considering a generative\nmodel for x, with parameters \u03b8. Next, we assume a dependency relation between \u03b8 and \u03c6, such that\nthe joint distribution factorizes as\n\np(X, Y, \u03c6, \u03b8) = p(Y |X, \u03c6) p(X|\u03b8) p(\u03c6|\u03b8) p(\u03b8).\n\n(5)\nIn essence, we assume a generative process of X, to confound the relation X \u2192 \u03c6 conditioned on\nY . That is, in contrast to the common practice where a \u201cv-structure\u201d is assumed, X \u2192 Y \u2190 \u03c6, we\n(cid:90)\nassume a generative function, parametrized by \u03b8, to be a parent of X and \u03c6, as illustrated in Figure 1.\nGiven a training set {x, y} the posterior is\np(\u03c6|x, y) \u221d\n\np(y|x, \u03c6) p(\u03c6|\u03b8) p(\u03b8|x) d\u03b8.\n\n(6)\n\nFigure 1: A causal diagram describing our assumptions.\n\nThe prior p(\u03c6|\u03b8) is expected to diminish unwarranted generalization for out-of-distribution samples\u2014\nsamples with low p(X|\u03b8). As these samples are non-present or scarce in the training set, they have\nnon or negligible in\ufb02uence on \u03b8. Hence, for in-distribution, Xin \u223c p(x), p(\u03c6|X in, \u03b8) = p(\u03c6|\u03b8),\nwhereas for out-of-distribution this relation does not hold. During inference we \ufb01rst sample \u03b8 from\np(\u03b8|x), which for an arbitrary out-of-distribution sample is a uniform distribution. Thus, p(\u03c6|\u03b8) is\nexpected to spread probability mass across \u03c6 for out-of-distribution data.\nTwo main questions arise: 1) what generative model should be used, and 2) how can we de\ufb01ne\nthe conditional relation p(\u03c6|\u03b8). It is desired to de\ufb01ne a generative model such that the conditional\ndistribution p(\u03c6|\u03b8) can be computed ef\ufb01ciently.\nOur solution includes a new deep model, called BRAINet, which consists of multiple networks\ncoupled into a single hierarchical structure. This structure (inter-layer neural connectivity pattern)\nis learned and scored from unlabeled training data, x, such that multiple generative structures,\n{ \u02dcG1, \u02dcG2, \u02dcG3, . . .}, can be sampled from their posterior p( \u02dcG|x) during training and inference. We\nde\ufb01ne p(\u03b8i|x) \u2261 p( \u02dcGi|x), where for each \u03b8i, the corresponding discriminative network parameters\nare estimated \u03c6i = arg max p(\u03c6|\u03b8i). This can be described as using multiple networks during\ninference and training, similarly to MC-dropout [5] and Deep Ensembles [16]. However, as these\nstructures are sampled from a single network, they share some of their parameters, speci\ufb01cally in\ndeeper layers. This is different from MC-dropout where all the parameters are shared across networks,\nand Deep Ensembles where none of the parameters are shared.\n\n3\n\nY\u03c6X\u03b8inputfeaturesdiscriminativefunctionparameterslabelgenerativefunctionparameters\f3.2 BRAINet: A Hierarchy of Deep Networks\nRecently, Rohekar et al. [27] introduced an algorithm, called B2N, for learning the structure, G,\nof discriminative deep neural networks in an unsupervised manner. The B2N algorithm, learns an\ninter-layer connectivity pattern, where neurons in a layer may connect to other neurons in any deeper\nlayer, not just to the ones in the next layer. Initially, B2N learns a deep generative graph \u02dcG with latent\nnodes H. This graph is constructed by unfolding the recursive calls in the RAI structure learning\nalgorithm [30]. RAI learns a causal structure1 B [23], and the B2N learns a deep generative graph\nsuch that,\n(7)\n\n(X, H)dH.\n\n(cid:90)\n\npB(X) =\n\np \u02dcG\n\nInterestingly, a discriminative network structure G is proved to mimic [23] a generative structure \u02dcG,\nhaving the exact same structure for (X, H). That is, for any \u03b8(cid:48), there exists \u03c6(cid:48), which can produce the\nposterior distribution over the latent variables in the generative model, p\u03b8(cid:48)(H|X) = p\u03c6(cid:48)(H|X, Y )2.\nRecently, an extension for the RAI algorithm [30] was proposed, called B-RAI [26]. B-RAI is\na Bayesian approach that learns multiple causal structures, scores them and couples them into a\nhierarchy. This hierarchy is represented by a tree, which they call GGT. The Bayesian scores of\neach structure is encoded ef\ufb01ciently in the GGT, which allows structures {B1,B2, . . .} to be sampled\nfrom this GGT proportionally to their posterior distribution, P (B|X). Based on the principles of the\nB2N algorithm, we propose converting this GGT (generated by B-RAI) into a deep neural network\nhierarchy. We call this hierarchy B-RAI neural network, abbreviated to BRAINet. Then, a neural\nnetwork structure, G, can be sampled from the BRAINet model proportionally to P (B|X), where\nG has the same connectivity as a generative structure \u02dcG, and where the relation in Equation 7 holds.\nThis yields a dependency between the generative P (X) and discriminative P (Y |X) models.\n3.2.1 BRAINet Structure Learning\n\nBefore describing the BRAINet structure learning algorithm, we provide de\ufb01nitions of relevant\nconcepts introduced by Yehezkel & Lerner [30].\nDe\ufb01nition 1 (Autonomous set of nodes). In a graph de\ufb01ned over X, a set of nodes X(cid:48) \u2286 X is\ncalled autonomous given X ex \u2282 X if the parents\u2019 set, P a(X), \u2200X \u2208 X(cid:48) is P a(X) \u2282 X(cid:48) \u222a X ex.\nDe\ufb01nition 2 (d-separation resolution). The resolution of a d-separation relation between a pair of\nnon-adjacent nodes in a graph is the size of the smallest condition set that d-separates the two nodes.\nDe\ufb01nition 3 (d-separation resolution of a graph). The d-separation resolution of a graph is the\nhighest d-separation resolution in the graph.\n\nWe present a recursive algorithm, Algorithm 1, for learning the structure of a BRAINet model. Each\nrecursive call receives a causal structure B (a CPDAG), a set of endogenous X and exogenous X ex\nnodes, and a target conditional independence order n. The CPDAG encodes P (X|X ex), providing\nan ef\ufb01cient factorization of this distribution. The d-separation resolution of B is assumed n \u2212 1.\nAt the beginning of each recursive call, an exit condition is tested (line 2). This condition is satis\ufb01ed\nif conditional independence of order n cannot be tested (a conditional independence order is de\ufb01ned\nto be the size of the condition set). In this case, the maximal depth is reached and an empty graph is\nreturned (a gather layer composed of observed nodes). From this point, the recursive procedure will\ntrace back, adding latent parent layers.\nEach recursive call consists of three stages:\n\ntonomous sets of nodes (lines 7\u20139): one descendant set, X D, and k ancestor sets, {X Ai}k\n\na) Increase the d-separation resolution of B to n and decompose the input features, X, into au-\ni=1.\nb) Call recursively to learn BRAINet structures for each autonomous set (lines 10\u201312).\nc) Merge the returned BRAINet structures into a single structure (lines 13\u201315).\n\n1A CPDAG (complete partially directed acyclic graph), a Markov equivalence class encoding causal relations\n2As the structures of \u02dcG and G are identical, differing only in edge direction, we assume a one-to-one mapping\n\namong X, is learned from nonexperimental observed data.\nfrom each latent node in \u02dcG to its corresponding latent node in G.\n\n4\n\n\fFigure 2: An example of BRAINet structure learning with s = 2. Network inputs, X = {A, . . . , E},\nare depicted on a gray plain. Red arrows indicate recursive calls for learning local causal structures.\nInitially, no prior information exists and a fully-connected graph over X is assumed (a). Then,\n0-order (condition set size of 0) independence is tested between pairs of connected nodes, using\ntwo (s) different bootstrap samples of the training data. This results in two CPDAGs (b, c). These\nCPDAGs are further re\ufb01ned by recursive calls with higher order independence tests, until no more\nedges can be removed (d-g). Tracing back, a deep NN is constructed by adding NN layers, where\neach CPDAG leads to its corresponding neural connectivity pattern (h-n). Each rectangle represents a\nlayer of neurons (denoted Lt\ni in Algorithm 1-line 14). A Venn diagram, above a set of rectangles,\nindicates the \ufb01eld-of-view each layer has over X. Finally, a discriminative structure is created (o, p).\n\n5\n\nIdentify a Hierarchy ofCausal Relationsno prior information0 order2ndorderrecursive callreturn from callsparallel recursive calls, differing by bootstrap samples.BCDEAFinal resultinggenerative structure.BCDEAaBCDECDEBAABCDEABCEDAbcDEABCABCEDBCADECBAABCEDBCAECDBCDCDEBAEABCDDCECDEBAEAdefgDECBAABCEDBCADEABCABCEDBCAECDBCDCDEBAEABCDDCECDEBAEAhijkTrace backand Build a Deep NNEach rectangle is a NN layer. A Venn diagram describes the field-of-view over the input.CDEBABCDEAlmnBCDEABCDEAAdiscriminativestructureiscreatedby:1)reversingtheedgesand2)addingalabelnodeYasachildofthedeepestlayers.op\fAlgorithm 1: BRAINet structure learning\n1 BRAINet_SL (B, X, X ex, n)\n\nresolution n.\n\nInput: an initial CPDAG B over endogenous X & exogenous X ex observed variables, and a desired\nOutput: L, the deepest layer in a learned structure\nif the maximal indegree of X in B is lower than n + 1 then\n\n(cid:46) exit condition\n(cid:46) a Bayesian score (e.g., BDeu)\n\n2\n3\n4\n5\n\n6\n7\n8\n9\n10\n11\n\n12\n13\n14\n15\n\n16\n\nr := Score(X|B)\nL :=a gather layer for X with score r\nreturn L\n\nfor t = 1 : s do\n\nx\u2217 := sample with replacement from training data x\nB\u2217 :=IncSeparation(B, n, x\u2217)\n{X D, X A1, . . . , X Ak} :=FindAutonomous(X|B\u2217)\nfor i = 1 : k do\n\nLAi :=BRAI_NN_SL(B\u2217, X Ai, X ex, n + 1)\n\nLD :=BRAI_NN_SL(B\u2217, X D, X ex \u222a {X Ai}k\nCreate an empty layer container Lt (tagged with index t)\nIn Lt create k independent layers: Lt\n\u2200i \u2208 {1, . . . , k}, connect: LAi \u2192 Lt\n\n1, . . . , Lt\nk\ni \u2190 LD\n\nreturn L = {Lt}s\n\nt=1\n\n(cid:46) bootstrap sample\n(cid:46) increase d-separation resolution to n\n(cid:46) decompose\n\n(cid:46) recursively call for ancestors\ni=1, n + 1) (cid:46) recursively call for descendant\n\n(cid:46) connect\n\nFigure 3: Stochastic training/inference in a BRAINet model. In a single stochastic/training step, a\nsubset of the network is selected. In this example there are four possible sub-network selections: a,b,c,\nand d, encoded in a GGT (e). In the GGT, one of s (here s = 2) branches, having scores r1, . . . , rs,\nis sampled according to Equation 8. As an example, the highlighted branches were sampled resulting\nin sub-network d. Note that, in every possible stochastic step (a,b,c,d), all the inputs, X = A, . . . , E,\nare selected, and each input is selected only once.\n\nThese stages are executed s times (line 6), resulting in an ensemble of s BRAINet structures. The\ndeepest layers, {L1, . . . , Ls}, of these structures are grouped together (line 16), while maintaining\ntheir index 1, . . . , s in L, and returned. Note that the caller function treats this group as a single\nlayer. A detailed description of Algorithm 1 can be found in Appendix A, and complexity analysis in\nAppendix B. An example for learning a BRAINet structure with s = 2 (each recursive call returns an\nensemble of two BRAINet models) is given in Figure 2.\n\n3.2.2 BRAINet Training and Inference\n\nThe BRAINet model allows us to sample sub-networks with respect to their relative posterior\nprobability. The scores calculation and sub-network selection are performed recursively, from the\nleaves to the root. For each autonomous set, given s sampled sub-networks and their scores, r1, . . . , rs,\nreturned from s recursive calls, one of the s results is sampled. We use the Boltzmann distribution,\n\nP (t;{rt(cid:48)}s\n\nt(cid:48)=1) =\n\nexp[rt/\u03b3]\nt(cid:48)=1 exp[rt(cid:48)/\u03b3]\n\n,\n\n(8)\n\n(cid:80)s\n\n6\n\nABCDEABCDEABCDEABCDEabcdGraph Generating Tree (GGT)\u02dcr2AAAB9HicbVDLSgNBEOz1GeMr6tHLYBA8hd0o6DHoxWME84BkCbOzvcmQ2Yczs4Gw7Hd48aCIVz/Gm3/jJNmDJhY0FFXddHd5ieBK2/a3tba+sbm1Xdop7+7tHxxWjo7bKk4lwxaLRSy7HlUoeIQtzbXAbiKRhp7Ajje+m/mdCUrF4+hRTxN0QzqMeMAZ1UZy+5oLHzOZD7J6PqhU7Zo9B1klTkGqUKA5qHz1/ZilIUaaCapUz7ET7WZUas4E5uV+qjChbEyH2DM0oiEqN5sfnZNzo/gkiKWpSJO5+nsio6FS09AznSHVI7XszcT/vF6qgxs341GSaozYYlGQCqJjMkuA+Fwi02JqCGWSm1sJG1FJmTY5lU0IzvLLq6RdrzmXtfrDVbVxW8RRglM4gwtw4BoacA9NaAGDJ3iGV3izJtaL9W59LFrXrGLmBP7A+vwBNN+SYQ==\u02dcr1AAAB9HicbVDLSgNBEOz1GeMr6tHLYBA8hd0o6DHoxWME84BkCbOzs8mQ2YczvYGw7Hd48aCIVz/Gm3/jJNmDJhY0FFXddHd5iRQabfvbWlvf2NzaLu2Ud/f2Dw4rR8dtHaeK8RaLZay6HtVcioi3UKDk3URxGnqSd7zx3czvTLjSIo4ecZpwN6TDSASCUTSS20chfZ6pfJA5+aBStWv2HGSVOAWpQoHmoPLV92OWhjxCJqnWPcdO0M2oQsEkz8v9VPOEsjEd8p6hEQ25drP50Tk5N4pPgliZipDM1d8TGQ21noae6QwpjvSyNxP/83opBjduJqIkRR6xxaIglQRjMkuA+EJxhnJqCGVKmFsJG1FFGZqcyiYEZ/nlVdKu15zLWv3hqtq4LeIowSmcwQU4cA0NuIcmtIDBEzzDK7xZE+vFerc+Fq1rVjFzAn9gff4AM1qSYA==BCEADrAAAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeqF48VTFtoQ9lsp+3SzSbsboQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoqeNUMfRZLGLVDqlGwSX6hhuB7UQhjUKBrXB8N/NbT6g0j+WjmSQYRHQo+YAzaqzkq152M+2VK27VnYOsEi8nFcjR6JW/uv2YpRFKwwTVuuO5iQkyqgxnAqelbqoxoWxMh9ixVNIIdZDNj52SM6v0ySBWtqQhc/X3REYjrSdRaDsjakZ62ZuJ/3md1Ayug4zLJDUo2WLRIBXExGT2OelzhcyIiSWUKW5vJWxEFWXG5lOyIXjLL6+SZq3qXVRrD5eV+m0eRxFO4BTOwYMrqMM9NMAHBhye4RXeHOm8OO/Ox6K14OQzx/AHzucP4LSOug==rBAAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMdSLx4rmLbQhrLZbtulm03YnQgl9Dd48aCIV3+QN/+N2zYHbX0w8Hhvhpl5YSKFQdf9dgobm1vbO8Xd0t7+weFR+fikZeJUM+6zWMa6E1LDpVDcR4GSdxLNaRRK3g4nd3O//cS1EbF6xGnCg4iOlBgKRtFKvu5njVm/XHGr7gJknXg5qUCOZr/81RvELI24QiapMV3PTTDIqEbBJJ+VeqnhCWUTOuJdSxWNuAmyxbEzcmGVARnG2pZCslB/T2Q0MmYahbYzojg2q95c/M/rpji8DTKhkhS5YstFw1QSjMn8czIQmjOUU0so08LeStiYasrQ5lOyIXirL6+TVq3qXVVrD9eVeiOPowhncA6X4MEN1OEemuADAwHP8ApvjnJenHfnY9lacPKZU/gD5/MH4jmOuw==rC!E DAAACCHicbVDLSgMxFM3UV62vUZcuDBbBVZmpgi6LVXBZwT6gMwyZNNOGZpIhyShl6NKNv+LGhSJu/QR3/o1pOwttPXDh5Jx7yb0nTBhV2nG+rcLS8srqWnG9tLG5tb1j7+61lEglJk0smJCdECnCKCdNTTUjnUQSFIeMtMNhfeK374lUVPA7PUqIH6M+pxHFSBspsA9lkNU9SfsDjaQUD/AaeoxE+eNqHNhlp+JMAReJm5MyyNEI7C+vJ3AaE64xQ0p1XSfRfoakppiRcclLFUkQHqI+6RrKUUyUn00PGcNjo/RgJKQpruFU/T2RoVipURyazhjpgZr3JuJ/XjfV0YWfUZ6kmnA8+yhKGdQCTlKBPSoJ1mxkCMKSml0hHiCJsDbZlUwI7vzJi6RVrbinlertWbl2mcdRBAfgCJwAF5yDGrgBDdAEGDyCZ/AK3qwn68V6tz5mrQUrn9kHf2B9/gCYD5mxrE!D CAAACCHicbVDLSgMxFM3UV62vUZcuDBbBVZmpgi6LVXBZwT6gMwyZNNOGZpIhyShl6NKNv+LGhSJu/QR3/o1pOwttPXDh5Jx7yb0nTBhV2nG+rcLS8srqWnG9tLG5tb1j7+61lEglJk0smJCdECnCKCdNTTUjnUQSFIeMtMNhfeK374lUVPA7PUqIH6M+pxHFSBspsA9lkF17kvYHGkkpHuAV9BiJ8kd9HNhlp+JMAReJm5MyyNEI7C+vJ3AaE64xQ0p1XSfRfoakppiRcclLFUkQHqI+6RrKUUyUn00PGcNjo/RgJKQpruFU/T2RoVipURyazhjpgZr3JuJ/XjfV0YWfUZ6kmnA8+yhKGdQCTlKBPSoJ1mxkCMKSml0hHiCJsDbZlUwI7vzJi6RVrbinlertWbl2mcdRBAfgCJwAF5yDGrgBDdAEGDyCZ/AK3qwn68V6tz5mrQUrn9kHf2B9/gCYNpmxEDBCACEDDCErEAAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiCB4rmLbQhrLZTtulm03Y3Qgl9Dd48aCIV3+QN/+N2zYHbX0w8Hhvhpl5YSK4Nq777RTW1jc2t4rbpZ3dvf2D8uFRU8epYuizWMSqHVKNgkv0DTcC24lCGoUCW+H4dua3nlBpHstHM0kwiOhQ8gFn1FjJV73sbtorV9yqOwdZJV5OKpCj0St/dfsxSyOUhgmqdcdzExNkVBnOBE5L3VRjQtmYDrFjqaQR6iCbHzslZ1bpk0GsbElD5urviYxGWk+i0HZG1Iz0sjcT//M6qRlcBxmXSWpQssWiQSqIicnsc9LnCpkRE0soU9zeStiIKsqMzadkQ/CWX14lzVrVu6jWHi4r9Zs8jiKcwCmcgwdXUId7aIAPDDg8wyu8OdJ5cd6dj0VrwclnjuEPnM8f5siOvg==rDAAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiHjxWMG2hDWWznbZLN5uwuxFK6G/w4kERr/4gb/4bt20O2vpg4PHeDDPzwkRwbVz32ymsrW9sbhW3Szu7e/sH5cOjpo5TxdBnsYhVO6QaBZfoG24EthOFNAoFtsLx7cxvPaHSPJaPZpJgENGh5APOqLGSr3rZ3bRXrrhVdw6ySrycVCBHo1f+6vZjlkYoDRNU647nJibIqDKcCZyWuqnGhLIxHWLHUkkj1EE2P3ZKzqzSJ4NY2ZKGzNXfExmNtJ5Eoe2MqBnpZW8m/ud1UjO4DjIuk9SgZItFg1QQE5PZ56TPFTIjJpZQpri9lbARVZQZm0/JhuAtv7xKmrWqd1GtPVxW6jd5HEU4gVM4Bw+uoA730AAfGHB4hld4c6Tz4rw7H4vWgpPPHMMfOJ8/5UOOvQ==rA!C BAAACCHicbVC7TsMwFHXKq5RXgJEBiwqJqUoKEoylXRiLRB9SE0WO67RWHTuyHVAVdWThV1gYQIiVT2Djb3DbDNBypCsdn3OvfO8JE0aVdpxvq7Cyura+UdwsbW3v7O7Z+wdtJVKJSQsLJmQ3RIowyklLU81IN5EExSEjnXDUmPqdeyIVFfxOjxPix2jAaUQx0kYK7GMZZNeepIOhRlKKB9iAHiNR/qhPArvsVJwZ4DJxc1IGOZqB/eX1BU5jwjVmSKme6yTaz5DUFDMyKXmpIgnCIzQgPUM5ionys9khE3hqlD6MhDTFNZypvycyFCs1jkPTGSM9VIveVPzP66U6uvIzypNUE47nH0Upg1rAaSqwTyXBmo0NQVhSsyvEQyQR1ia7kgnBXTx5mbSrFfe8Ur29KNfqeRxFcAROwBlwwSWogRvQBC2AwSN4Bq/gzXqyXqx362PeWrDymUPwB9bnD46jmas=EDBCABACBCArC!A BAAACCHicbVC7TsMwFHXKq5RXgJEBiwqJqUoKEoylXRiLRB9SE0WO67RWHTuyHVAVdWThV1gYQIiVT2Djb3DbDNBypCsdn3OvfO8JE0aVdpxvq7Cyura+UdwsbW3v7O7Z+wdtJVKJSQsLJmQ3RIowyklLU81IN5EExSEjnXDUmPqdeyIVFfxOjxPix2jAaUQx0kYK7GMZZA1P0sFQIynFA7yGHiNR/qhPArvsVJwZ4DJxc1IGOZqB/eX1BU5jwjVmSKme6yTaz5DUFDMyKXmpIgnCIzQgPUM5ionys9khE3hqlD6MhDTFNZypvycyFCs1jkPTGSM9VIveVPzP66U6uvIzypNUE47nH0Upg1rAaSqwTyXBmo0NQVhSsyvEQyQR1ia7kgnBXTx5mbSrFfe8Ur29KNfqeRxFcAROwBlwwSWogRvQBC2AwSN4Bq/gzXqyXqx362PeWrDymUPwB9bnD469mas=\u02dcr2=rD+rE++rC!A BAAACLXicbVDJSgNBEO1xN25Rj14agyIIYaLichBiVPCoYFTIhKGnU5M09ix01yhhmB/y4q+I4EERr/6GPXEQtwddvHpVRXU9L5ZCo20/W0PDI6Nj4xOTpanpmdm58vzChY4SxaHJIxmpK49pkCKEJgqUcBUrYIEn4dK7PszrlzegtIjCc+zH0A5YNxS+4AyN5JaPHBSyA6nK3HQjo6t0nyr3iK6beGyi4xgpT9JDR4luD5lS0S09oI4Ev0gamVuu2FV7APqX1ApSIQVO3fKj04l4EkCIXDKtWzU7xnbKFAouISs5iYaY8WvWhZahIQtAt9PBtRldMUqH+pEyL0Q6UL9PpCzQuh94pjNg2NO/a7n4X62VoL/bTkUYJwgh/1zkJ5JiRHPraEco4Cj7hjCuhPkr5T2mGEdjcGlgwl6O7a+T/5KLjWpts7p5tlWpNwo7JsgSWSZrpEZ2SJ2ckFPSJJzckQfyTF6se+vJerXePluHrGJmkfyA9f4BM/el3g==e\fwhere \u03b3 is a \u201ctemperature\u201d term. When \u03b3 \u2192 \u221e, results are sampled from a uniform distribution, and\nwhen \u03b3 \u2192 0 the index of the maximal value is selected (arg max). We use \u03b3 = 1 and the Bayesian\nscore, BDeu [10]. Finally, the sampled sub-networks, each corresponding to an autonomous set, are\nmerged. The score of the resulting network is the sum of scores3 of all autonomous sets merged\ninto the network. When training the parameters, at each step, a single sub-network is sampled using\na uniform distribution and its weights are updated. At inference, however, there are two options.\nIn the the \ufb01rst, which we call \u201cstochastic\u201d, we run T forward passes, each time sampling a single\nnetwork with respect to Equation 8. Then, the outputs of the sampled networks are averaged. Figure 3\nillustrates several forward passes sampled from the BRAINet model. This is similar to dropout at\ninference, except that a sub-network is sampled with respect to its posterior. Note that, in BRAINet, in\ncontrast to MC-dropout, weights are not sampled independently, and there is an implicit dependency\nbetween sampling variables. In the second inference option, which we call \u201csimultaneous\u201d, we run\na single forward pass through the BRAINet and recursively perform a weighted average of the s\nactivations for each autonomous set.\n\n3.2.3 BRAINet Uncertainty Estimation\n\nSeveral measures of uncertainty can be computed using the BRAINet model. Firstly, the max-softmax\n[28] and entropy [5] can be computed on the outputs of a single \u201csimultaneous\u201d forward pass or\non the average of outputs from multiple stochastic forward passes. Secondly, using the distinct\noutputs of multiple forward passes, we can compute the expected entropy, E\np(\u03c6|x,y)H[p(y\u2217|x\u2217, \u03c6)],\nIn Appendix C-Figure 2, we qualitatively\nor the mutual information, MI(y\u2217, \u03c6|x\u2217, x, y), [29].\nshow epistemic uncertainty estimation using MI, calculated from the BRAINet model, for images\ngenerated by VAE [15] trained on MNIST. In addition, using the outputs of multiple stochastic passes,\nwe can estimate the distribution over the network output. Finally, we demonstrate an interesting\nproperty of our method that learns a broader prior over \u03c6 as the training set size decreases. That\nis, a relation between the predictive uncertainty and the number of unique structures (connectivity\npatterns) encoded in a BRAINet model (exempli\ufb01ed in Appendix C-Figure 1).\n\n4 Empirical Evaluation\n\nBRAINet structure learning algorithm is implemented using BNT [20] and runs ef\ufb01ciently on a\nstandard desktop CPU. In all experiments, we used MLP-layers (dense), ReLU activations, ADAM\noptimization [14], a \ufb01xed learning rate, and batch normalization [12]. Unless otherwise stated, each\nexperiment was repeated 10 times. Mean and STDev on test-set are reported. BRAINet structure was\nlearned directly for pixels in the ablation study (Section 4.1) and calibration experiments (Section 4.2).\nIn out-of-distribbution detection experiments, higher level features were extracted using the \ufb01rst\nconvolutional layers of common pre-trained NN, and BRAINet structure was learned for these\nfeatures.\nBRAINet parameters were learned by sampling sub-networks uniformally, and updating the parame-\nters by SGD. At inference, BRAINet performs Bayesian model averaging (each sampled sub-network\nis weighted) using one of the two strategies described in Section 3.2.2.\n\n4.1 An Ablation Study for Evaluating the Effect of Confounding with a Generative Process\n\nFirst, we conduct an ablation study by gradually reducing the dependence of the discriminative\nparameters \u03c6 on the generative parameters \u03b8, i.e., the strength of the link \u03b8 \u2192 \u03c6 in Figure 1, and\nmeasuring the performance of the resulting model for MNIST dataset [17]. In the extreme case of\ndisconnecting this link, the BRAINet structure will simply become a Deep Ensembles model [16]\nwith s independent networks. Figure 4 demonstrates that even for a small dependence between \u03b8 and\n\u03c6, as restricted by BRAINet, a signi\ufb01cant improvement is achieved in performance (high calibration\nand classi\ufb01cation accuracy). X-axis represents the strength of the link \u03b8 \u2192 \u03c6 (see Figure 1), where\nthe values represent the amount of mutual information that is required for a pair of nodes in X to be\nconsidered dependent (line 8, Algorithm 1). When this value is low, all the nodes in X are considered\ndependent and no structure is learned. For a mutual information threshold of 0, a simple ensemble of\ns networks is obtained where each network is composed of stacked fully connected layer.\ni=1 r(Xi|P ai).\n\n3Using a decomposable score, such as BDeu, the score of a CPDAG is r =(cid:80)n\n\n7\n\n\fFigure 4: Results of an ablation study. The effect of conditioning the discriminative function on the\ngenerative process, p(\u03c6|\u03b8) as measured by the test NLL, classi\ufb01cation error, and Brier score. It is\nevident that performance worsens after weakening the dependence of \u03c6 on \u03b8.\n\n4.2 Calibration\n\nBRAINet can be perceived as a compact representation of an ensemble. We demonstrate on MNIST\nthat it achieves higher classi\ufb01cation accuracy and is better calibrated than Deep Ensembles for the\nsame model size (Figure 5). Here, we used the simultaneous inference mode of BRAINet. Next,\nwe evaluate the accuracy and calibration of BRAINet as a function of stochastic forward passes,\nand \ufb01nd it to signi\ufb01cantly outperform Bayes-by-Backprop [1] and MC-dropout (Figure 6). We also\n\ufb01nd that using the BRAINet structure within either Deep Ensembles or with MC-dropout methods,\nfurther improves these later approaches. For that we use common UCI-repository [4] regression\nbenchmarks. Results are reported: Appendix D-Table 1 for Deep Ensembles, and Appendix D-Table 2\nfor MC-dropout. Finally, we compare BRAINet to various state-of-the-art methods on large networks.\nIn all benchmarks, BRAINet achieves the lowest expected calibration error [8] (Appendix D-Table 3).\n\nFigure 5: Performance as a function of normalized model size. Test NLL, classi\ufb01cation error, and\nBrier score of BRAINet, compared to Deep Ensembles [16]. X-axis is the model size divided by\nthe size of a single network (240K parameters) in the Deep Ensembles model. For BRAINet, up to\nmodel size 5, s = 2, and from model size 7 and above, s = 3. Different BRAINet sizes, for a given s,\nare obtained by varying the number of neurons in the dense-layers.\n\nFigure 6: Performance as a function of the number of forward passes (number of sampled networks).\nTest NLL, classi\ufb01cation error, and Brier score of BRAINet, compared to MC-dropout [5] and Bayes-\nby-Backprop [1]. We used a BRAINet with s = 2. Model size is 240K parameters for MC-dropout\nand BRAINet, and double for Bayes-by-Backprop.\n\n4.3 Out-of-Distribution Detection\n\nNext, we evaluate the the performance of detecting OOD samples by applying the BRAINet structure\nafter feature-extracting layers of common NN typologies. First, we compare it to a baseline network\nand MC-dropout (We found MC-dropout to signi\ufb01cantly outperform Bayes-by-Backprop and Deep\n\n8\n\n00.010.020.030.0444.555.5\u00b710\u22122mutualinformationthresholdtestNLL00.010.020.030.041.21.31.4mutualinformationthresholdclassi\ufb01cationerror(%)00.010.020.030.041.822.22.4mutualinformationthresholdBrierscore(\u00d710\u22123)1001010.040.060.080.1modelsize(normalized)testNLLDeepEnsemblesBRAINet10010111.21.4modelsize(normalized)classi\ufb01cationerror(%)DeepEnsemblesBRAINet1001011.61.822.22.4modelsize(normalized)Brierscore(\u00d710\u22123)DeepEnsemblesBRAINet135791113150.050.10.15numberofstochasticforwardpassestestNLLBayes-by-BackpropMC-dropoutBRAINet135791113151.522.53numberofstochasticforwardpassesclassi\ufb01cationerror(%)Bayes-by-BackpropMC-dropoutBRAINet135791113152345numberofstochasticforwardpassesBrierscore(\u00d710\u22123)Bayes-by-BackpropMC-dropoutBRAINet\fEnsembles on this task). In order to evaluate the gain in performance resulting only from the structure,\nwe use the cross-entropy loss for parameter learning. Using other loss functions that are suited for\nimproving OOD detection [3, 19], may further improve the results (see the next experiment). In this\nexperiment we used a ResNet-20 network [9], pre-trained on CIFAR-10 data. For MC-dropout, we\nused a structure having 2 fully-connected layers, and for BRAINet, we learned a 2-layer structure\nwith s = 2. Both structures have 16K parameters and they replace the last layer of ResNet-20. SVHN\ndataset [22] is used as the OOD samples. We calculated the area under the ROC and precision-recall\ncurves, treating OOD samples as positive classes (Table 1).\n\nTable 1: OOD detection (SVHN dataset) by replacing the last layer of ResNet20, pre-trained on\nCIFAR-10, with a learned structure. Parameters are trained using cross-entropy loss. Two inference\nmode of BRAINet: a single simultaneous forward pass (BRAINet sm.), and multiple stochastic\nforward passes. MC-dropout and BRAINet use 15 forward passes (see also Appendix D-Figure 3).\n\nMETHOD\n\nERR MAX. P\n\nBASELINE\nBRAINET SM.\nMC-DROPOUT\n\nBRAINET\n\n7.7\n8.0\n8.2\n\u00b10.1\n7.5\n\u00b10.1\n\n90.38\n91.62\n90.73\n\u00b10.78\n92.13\n\u00b10.15\n\nAUC-ROC\nENT.\nMI\n\n90.74\n92.18\n90.26\n\u00b10.81\n92.61\n\u00b10.03\n\n\u2014\n\u2014\n\n84.74\n\u00b11.07\n91.87\n\u00b10.13\n\nAUC-PR\n\nE.ENT. MAX. P\n\nENT.\n\n\u2014\n\u2014\n\n90.61\n\u00b10.93\n92.98\n\u00b10.10\n\n93.94\n94.91\n94.31\n\u00b10.33\n95.37\n\u00b10.13\n\n94.09\n94.80\n93.71\n\u00b10.34\n95.36\n\u00b10.07\n\nMI\n\n\u2014\n\u2014\n\n88.89\n\u00b10.45\n94.6\n\u00b10.25\n\nE.ENT.\n\n\u2014\n\u2014\n\n94.46\n\u00b10.46\n95.64\n\u00b10.05\n\nLastly, we demonstrate that training the parameters of BRAINet using a loss function, speci\ufb01cally\ndesigned for OOD detection [3], achieves a signi\ufb01cant improvement over a common baseline [11]\nand improves state-of-the art results (Table 2).\n\nTable 2: OOD detection by training BRAINet parameters using a loss function designed for OOD\ndetection. A comparison between a baseline [11], con\ufb01dence-based thresholding [3], and BRAINet\nwith the same con\ufb01dence measure. Architecture: VGG-13, in-distribution: CIFAR-10, OOD:\nTinyImageNet. BRAINet replaces the last layer. FPR @TPR=95%: false positive rate at true\npositive rate of 95%. Detection error: minimum classi\ufb01cation error over all possible thresholds.\nAUC-ROC and AU-PR: area under ROC and precision-recall curves. \u201cin\u201d/\u201cout\u201d indicate that in/out-\nof-distribution data is the positive class. An arrow indicates if lower (\u2193) or higher (\u2191) is better.\n\nMEASURE\nCLASSIFICATION ERROR\nFPR @TPR=95%\nDETECTION ERROR\nAUC-ROC\nAUC-PR (IN)\nAUC-PR (OUT)\n\n\u2193\n\u2193\n\u2193\n\u2191\n\u2191\n\u2191\n\nBASELINE [11]\n\nCONFIDENCE [3]\n\nBRAINET\n\n5.28\n0.438\n0.120\n0.935\n0.946\n0.917\n\n5.63\n0.195\n0.092\n0.970\n0.974\n0.965\n\n5.65\n0.124\n0.076\n0.980\n0.982\n0.979\n\n5 Conclusions\n\nWe proposed a method for confounding the training process in deep neural networks, where the\ndiscriminative network is conditioned on the generative process of the input. This led to a new\narchitecture\u2014BRAINet: a hierarchy of deep neural connections. From this hierarchy, local sub-\nnetworks can be sampled proportionally to the posterior of local causal structures of the input. Using\nan ablation study, we found that even a weak relation between the generative and discriminative\nfunctions results in a signi\ufb01cant gain in calibration and accuracy. In addition, We found that the\nnumber of neural connectivity patterns in BRAINet is adjusted automatically according to the\nuncertainty in the input training data. We demonstrated that this enables estimating different types of\nuncertainties, better than common and state-of-the-art methods, as well as higher accuracy on both\nsmall and large datasets. We conjecture that the resulting model can also be effective at detecting\nadversarial attacks, an plan to explore this in our future work.\n\n9\n\n\fReferences\n[1] Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. Weight uncertainty in neural network. In\n\nInternational Conference on Machine Learning (ICML), pp. 1613\u20131622, 2015.\n\n[2] Dawid, A. P. The well-calibrated Bayesian. Journal of the American Statistical Association, 77(379):\n\n605\u2013610, 1982.\n\n[3] DeVries, T. and Taylor, G. W. Learning con\ufb01dence for out-of-distribution detection in neural networks.\n\narXiv preprint arXiv:1802.04865, 2018.\n\n[4] Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.\n\n[5] Gal, Y. and Ghahramani, Z. Dropout as a Bayesian approximation: Representing model uncertainty in\n\ndeep learning. In international conference on machine learning (ICML), pp. 1050\u20131059, 2016.\n\n[6] Ghahramani, Z. Probabilistic machine learning and arti\ufb01cial intelligence. Nature, 521(7553):452, 2015.\n\n[7] Gneiting, T. and Raftery, A. E. Strictly proper scoring rules, prediction, and estimation. Journal of the\n\nAmerican Statistical Association, 102(477):359\u2013378, 2007.\n\n[8] Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On calibration of modern neural networks.\n\nInternational Conference on Machine Learning (ICML), pp. 1321\u20131330, 2017.\n\nIn\n\n[9] He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of\n\nthe IEEE conference on computer vision and pattern recognition, pp. 770\u2013778, 2016.\n\n[10] Heckerman, D., Geiger, D., and Chickering, D. M. Learning Bayesian networks: The combination of\n\nknowledge and statistical data. Machine learning, 20(3):197\u2013243, 1995.\n\n[11] Hendrycks, D. and Gimpel, K. A baseline for detecting misclassi\ufb01ed and out-of-distribution examples in\n\nneural networks. International Conference on Learning Representations (ICLR), 2017.\n\n[12] Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal\n\ncovariate shift. In International Conference on Machine Learning, pp. 448\u2013456, 2015.\n\n[13] Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., and Wilson, A. G. Averaging weights leads to wider\n\noptima and better generalization. arXiv preprint arXiv:1803.05407, 2018.\n\n[14] Kingma, D. and Ba, J. Adam: A method for stochastic optimization. In Proceedings of the International\n\nConference on Learning Representations (ICLR), 2015.\n\n[15] Kingma, D. P. and Welling, M. Auto-encoding variational Bayes. In Proceedings of the International\n\nConference on Learning Representations (ICLR), 2014.\n\n[16] Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and scalable predictive uncertainty estimation\nusing deep ensembles. In Advances in Neural Information Processing Systems (NIPS), pp. 6402\u20136413,\n2017.\n\n[17] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[18] Maddox, W., Garipov, T., Izmailov, P., Vetrov, D., and Wilson, A. G. A simple baseline for Bayesian\n\nuncertainty in deep learning. arXiv preprint arXiv:1902.02476, 2019.\n\n[19] Malinin, A. and Gales, M. Predictive uncertainty estimation via prior networks. In Advances in Neural\n\nInformation Processing Systems (NeurIPS), 2018.\n\n[20] Murphy, K. The Bayes net toolbox for Matlab. Computing Science and Statistics, 33:331\u2013350, 2001.\n\n[21] Naeini, M. P., Cooper, G., and Hauskrecht, M. Obtaining well calibrated probabilities using Bayesian\n\nbinning. In Twenty-Ninth AAAI Conference on Arti\ufb01cial Intelligence, 2015.\n\n[22] Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. Reading digits in natural images\nwith unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning,\n2011.\n\n[23] Pearl, J. Causality: Models, Reasoning, and Inference. Cambridge university press, second edition, 2009.\n\n[24] Platt, J. et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood\n\nmethods. Advances in large margin classi\ufb01ers, 10(3):61\u201374, 1999.\n\n10\n\n\f[25] Ritter, H., Botev, A., and Barber, D. A scalable laplace approximation for neural networks. In International\n\nConference on Learning Representations (ICLR), 2018.\n\n[26] Rohekar, R. Y., Gurwicz, Y., Nisimov, S., Koren, G., and Novik, G. Bayesian structure learning by\n\nrecursive bootstrap. In Advances in Neural Information Processing Systems (NeurIPS), 2018.\n\n[27] Rohekar, R. Y. Y., Nisimov, S., Gurwicz, Y., Koren, G., and Novik, G. Constructing deep neural networks\nby Bayesian network structure learning. In Advances in Neural Information Processing Systems (NeurIPS),\n2018.\n\n[28] Shiyu, L., Yixuan, L., and R., S. Enhancing the reliability of out-of-distribution image detection in neural\n\nnetworks. In Proceedings of the International Conference on Learning Representations (ICLR), 2018.\n\n[29] Smith, L. and Gal, Y. Understanding Measures of Uncertainty for Adversarial Example Detection. In\n\nUncertainty in Arti\ufb01cial Intelligence (UAI), 2018.\n\n[30] Yehezkel, R. and Lerner, B. Bayesian network structure learning by recursive autonomy identi\ufb01cation.\n\nJournal of Machine Learning Research (JMLR), 10(Jul):1527\u20131570, 2009.\n\n11\n\n\f", "award": [], "sourceid": 2389, "authors": [{"given_name": "Raanan", "family_name": "Yehezkel Rohekar", "institution": "Intel AI Lab"}, {"given_name": "Yaniv", "family_name": "Gurwicz", "institution": "Intel AI Lab"}, {"given_name": "Shami", "family_name": "Nisimov", "institution": "Intel AI Lab"}, {"given_name": "Gal", "family_name": "Novik", "institution": "Intel AI Lab"}]}