{"title": "The Functional Neural Process", "book": "Advances in Neural Information Processing Systems", "page_first": 8746, "page_last": 8757, "abstract": "We present a new family of exchangeable stochastic processes, the Functional Neural Processes (FNPs). FNPs model distributions over functions by learning a graph of dependencies on top of latent representations of the points in the given dataset. In doing so, they define a Bayesian model without explicitly positing a prior distribution over latent global parameters; they instead adopt priors over the relational structure of the given dataset, a task that is much simpler. We show how we can learn such models from data, demonstrate that they are scalable to large datasets through mini-batch optimization and describe how we can make predictions for new points via their posterior predictive distribution. We experimentally evaluate FNPs on the tasks of toy regression and image classification and show that, when compared to baselines that employ global latent parameters, they offer both competitive predictions as well as more robust uncertainty estimates.", "full_text": "The Functional Neural Process\n\nChristos Louizos\n\nUniversity of Amsterdam\nTNO Intelligent Imaging\nc.louizos@uva.nl\n\nKlamer Schutte\n\nTNO Intelligent Imaging\n\nklamer.schutter@tno.nl\n\nXiahan Shi\n\nBosch Center for Arti\ufb01cial Intelligence\n\nUvA-Bosch Delta Lab\n\nxiahan.shi@de.bosch.com\n\nMax Welling\n\nUniversity of Amsterdam\n\nQualcomm\n\nm.welling@uva.nl\n\nAbstract\n\nWe present a new family of exchangeable stochastic processes, the Functional\nNeural Processes (FNPs). FNPs model distributions over functions by learning a\ngraph of dependencies on top of latent representations of the points in the given\ndataset. In doing so, they de\ufb01ne a Bayesian model without explicitly positing a\nprior distribution over latent global parameters; they instead adopt priors over the\nrelational structure of the given dataset, a task that is much simpler. We show how\nwe can learn such models from data, demonstrate that they are scalable to large\ndatasets through mini-batch optimization and describe how we can make predic-\ntions for new points via their posterior predictive distribution. We experimentally\nevaluate FNPs on the tasks of toy regression and image classi\ufb01cation and show\nthat, when compared to baselines that employ global latent parameters, they offer\nboth competitive predictions as well as more robust uncertainty estimates.\n\n1\n\nIntroduction\n\nNeural networks are a prevalent paradigm for approximating functions of almost any kind. Their\nhighly \ufb02exible parametric form coupled with large amounts of data allows for accurate modelling\nof the underlying task, a fact that usually leads to state of the art prediction performance. While\npredictive performance is de\ufb01nitely an important aspect, in a lot of safety critical applications, such\nas self-driving cars, we also require accurate uncertainty estimates about the predictions.\nBayesian neural networks [33, 37, 15, 5] have been an attempt at imbuing neural networks with\nthe ability to model uncertainty; they posit a prior distribution over the weights of the network and\nthrough inference they can represent their uncertainty in the posterior distribution. Nevertheless, for\nsuch complex models, the choice of the prior is quite dif\ufb01cult since understanding the interactions\nof the parameters with the data is a non-trivial task. As a result, priors are usually employed for\ncomputational convenience and tractability. Furthermore, inference over the weights of a neural\nnetwork can be a daunting task due to the high dimensionality and posterior complexity [31, 44].\nAn alternative way that can \u201cbypass\u201d the aforementioned issues is that of adopting a stochastic\nprocess [25]. They posit distributions over functions, e.g. neural networks, directly, without the\nnecessity of adopting prior distributions over global parameters, such as the neural network weights.\nGaussian processes [41] (GPs) is a prime example of a stochastic process; they can encode any\ninductive bias in the form of a covariance structure among the datapoints in the given dataset, a more\nintuitive modelling task than positing priors over weights. Furthermore, for vanilla GPs, posterior\ninference is much simpler. Despite these advantages, they also have two main limitations: 1) the\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Venn diagram of\nthe sets used in this work.\nThe blue is the training in-\nputs Dx, the red is the ref-\nerence set R and the parts\nenclosed in the dashed and\nsolid lines are M, the train-\ning points not in R, and\nB, the union of the train-\ning points and R. The white\nbackground corresponds to\nO, the complement of R.\n\nFigure 2: The Functional Neural Process (FNP) model. We embed\nthe inputs (dots) from a complicated domain X to a simpler domain\nU where we then sample directed graphs of dependencies among\nthem, G, A. Conditioned on those graphs, we use the parents from\nthe reference set R as well as their labels yR to parameterize a latent\nvariable zi that is used to predict the target yi. Each of the points has\na speci\ufb01c number id for clarity.\n\nunderlying model is not very \ufb02exible for high dimensional problems and 2) training and inference is\nquite costly since it generally scales cubically with the size of the dataset.\nGiven the aforementioned limitations of GPs, one might seek a more general way to parametrize\nstochastic processes that can bypass these issues. To this end, we present our main contribution,\nFunctional Neural Processes (FNPs), a family of exchangeable stochastic processes that posit\ndistributions over functions in a way that combines the properties of neural networks and stochastic\nprocesses. We show that, in contrast to prior literature such as Neural Processes (NPs) [14], FNPs do\nnot require explicit global latent variables in their construction, but they rather operate by building a\ngraph of dependencies among local latent variables, reminiscing more of autoencoder type of latent\nvariable models [24, 42]. We further show that we can exploit the local latent variable structure in\na way that allows us to easily encode inductive biases and illustrate one particular instance of this\nability by designing an FNP model that behaves similarly to a GP with an RBF kernel. Furthermore,\nwe demonstrate that FNPs are scalable to large datasets, as they can facilitate for minibatch gradient\noptimization of their parameters, and have a simple to evaluate and sample posterior predictive\ndistribution. Finally, we evaluate FNPs on toy regression and image classi\ufb01cation tasks and show\nthat they can obtain competitive performance and more robust uncertainty estimates. We have open\nsourced an implementation of FNPs for both classi\ufb01cation and regression along with example usages\nat https://github.com/AMLab-Amsterdam/FNP.\n\n2 The Functional Neural Process\n\nFor the following we assume that we are operating in the supervised learning setup, where we are\ngiven tuples of points (x, y), with x \u2208 X being the input covariates and y \u2208 Y being the given label.\nLet D = {(x1, y1) . . . , (xN , yN )} be a sequence of N observed datapoints. We are interested in\nconstructing a stochastic process that can bypass the limitations of GPs and can offer the predictive\ncapabilities of neural networks. There are two necessary conditions that have to be satis\ufb01ed during\nthe construction of such a model: exchangeability and consistency [25]. An exchangeable distribution\nover D is a joint probability over these elements that is invariant to permutations of these points, i.e.\n\np(y1:N|x1:N ) = p(y\u03c3(1:N )|x\u03c3(1:N )),\n\n(1)\nwhere \u03c3(\u00b7) corresponds to the permutation function. Consistency refers to the phenomenon that the\nprobability de\ufb01ned on an observed sequence of points {(x1, y1), . . . , (xn, yn)}, pn(\u00b7), is the same\nas the probability de\ufb01ned on an extended sequence {(x1, y1), . . . , (xn, yn), . . . , (xn+m, yn+m)},\n\n2\n\n\fpn+m(\u00b7), when we marginalize over the new points:\n\n(cid:90)\n\npn(y1:n|x1:n) =\n\npn+m(y1:n+m|x1:n+m)dyn+1:n+m.\n\n(2)\n\nEnsuring that both of these conditions hold, allows us to invoke the Kolmogorov Extension and de-\nFinneti\u2019s theorems [25], hence prove that the model we de\ufb01ned is an exchangeable stochastic process.\nIn this way we can guarantee that there is an underlying Bayesian model with an implied prior over\nglobal latent parameters p\u03b8(w) such that we can express the joint distribution in a conditional i.i.d.\n\nfashion, i.e. p\u03b8(y1, . . . , yN|x1, . . . , xN ) =(cid:82) p\u03b8(w)(cid:81)N\n\ni=1 p(yi|xi, w)dw.\n\nThis constitutes the main objective of this work; how can we parametrize and optimize such distri-\nbutions? Essentially, our target is to introduce dependence among the points of D in a manner that\nrespects the two aforementioned conditions. We can then encode prior assumptions and inductive\nbiases to the model by considering the relations among said points, a task much simpler than specify-\ning a prior over latent global parameters p\u03b8(w). To this end, we introduce in the following our main\ncontribution, the Functional Neural Process (FNP).\n\n2.1 Designing the Functional Neural Process\n\nOn a high level the FNP follows the construction of a stochastic process as described at [11]; it posits\na distribution over functions h \u2208 H from x to y by \ufb01rst selecting a \u201creference\u201d set of points from X ,\nand then basing the probability distribution over h around those points. This concept is similar to\nK}\nthe \u201cinducing inputs\u201d that are used in sparse GPs [46, 51]. More speci\ufb01cally, let R = {xr\n1, . . . , xr\nbe such a reference set and let O = X \\ R be the \u201cother\u201d set, i.e. the set of all possible points that\nare not in R. Now let Dx = {x1, . . . , xN} be any \ufb01nite random set from X , that constitutes our\nobserved inputs. To facilitate the exposition we also introduce two more sets; M = Dx \\ R that\ncontains the points of Dx that are from O and B = R \u222a M that contains all of the points in Dx and\nR. We provide a Venn diagram in Fig. 1. In the following we describe the construction of the model,\nshown in Fig. 2, and then prove that it corresponds to an in\ufb01nitely exchangeable stochastic process.\n\nEmbedding the inputs to a latent space The \ufb01rst step of the FNP is to embed each of the xi of B\nindependently to a latent representation ui\n\np\u03b8(UB|XB) =\n\np\u03b8(ui|xi),\n\n(3)\n\n(cid:89)\n\ni\u2208B\n\nwhere p\u03b8(ui|xi) can be any distribution, e.g. a Gaussian or a delta peak, where its parameters, e.g.\nthe mean and variance, are given by a function of xi. This function can be any function, provided\nthat it is \ufb02exible enough to provide a meaningful representation for xi. For this reason, we employ\nneural networks, as their representational capacity has been demonstrated on a variety of complex\nhigh dimensional tasks, such as natural image generation and classi\ufb01cation.\n\nConstructing a graph of dependencies in the embedding space The next step is to construct a\ndependency graph among the points in B; it encodes the correlations among the points in D that\narise in the stochastic process. For example, in GPs such a correlation structure is encoded in the\ncovariance matrix according to a kernel function g(\u00b7,\u00b7) that measures the similarity between two\ninputs. In the FNP we adopt a different approach. Given the latent embeddings UB that we obtained\nin the previous step we construct two directed graphs of dependencies among the points in B; a\ndirected acyclic graph (DAG) G among the points in R and a bipartite graph A from R to M. These\ngraphs are represented as random binary adjacency matrices, where e.g. Aij = 1 corresponds to the\nvertex j being a parent for the vertex i. The distribution of the bipartite graph can be de\ufb01ned as\n\np(A|UR, UM ) =\n\nBern (Aij|g(ui, uj)) .\n\n(4)\n\nwhere g(ui, uj) provides the probability that a point i \u2208 M depends on a point j in the reference set\nR. This graph construction reminisces graphon [39] models, with however two important distinctions.\nFirstly, the embedding of each node is a vector rather than a scalar and secondly, the prior distribution\nover u is conditioned on an initial vertex representation x rather than being the same for all vertices.\nWe believe that the latter is an important aspect, as it is what allows us to maintain enough information\nabout the vertices and construct more informative graphs.\n\n(cid:89)\n\n(cid:89)\n\ni\u2208M\n\nj\u2208R\n\n3\n\n\fFigure 3: An example of the bipartite graph A\nthat the FNP learns. The \ufb01rst column of each\nimage is a query point and the rest are the \ufb01ve\nmost probable parents from the R. We can see\nthat the FNP associates same class inputs.\n\nFigure 4: A DAG over R on MNIST, obtained\nafter propagating the means of U and threshold-\ning edges that have less than 0.5 probability in G.\nWe can see that FNP learns a meaningful G by\nconnecting points that have the same class.\n\nt(ui) > t(uj). The function t(\u00b7) is de\ufb01ned as t(ui) = (cid:80)\n\nThe DAG among the points in R is a bit trickier, as we have to adopt a topological ordering of the\nvectors in UR in order to avoid cycles. Inspired by the concept of stochastic orderings [43], we\nde\ufb01ne an ordering according to a parameter free scalar projection t(\u00b7) of u, i.e. ui > uj when\nk tk(uik) where each individual tk(\u00b7)\nis a monotonic function (e.g. the log CDF of a standard normal distribution); in this case we can\nguarantee that ui > uj when individually for all of the dimensions k we have that uik > ujk under\ntk(\u00b7). This ordering can then be used in\n\np(G|UR) =\n\nBern (Gij|I[t(ui) > t(uj)]g(ui, uj))\n\n(5)\n\n(cid:89)\n\n(cid:89)\n\ni\u2208R\n\nj\u2208R,j(cid:54)=i\n\nwhich leads into random adjacency matrices G that can be re-arranged into a triangular structure\nwith zeros in the diagonal (i.e. DAGs). In a similar manner, such a DAG construction reminisces of\ndigraphon models [6], a generalization of graphons to the directed case. The same two important\ndistinctions still apply; we are using vector instead of scalar representations and the prior over the\nrepresentation of each vertex i depends on xi. It is now straightforward to bake in any relational\ninductive biases that we want our function to have by appropriately de\ufb01ning the g(\u00b7,\u00b7) that is used for\nthe construction of G and A. For example, we can encode an inductive bias that neighboring points\n\nshould be dependent by choosing g(ui, uj) = exp(cid:0)\u2212 \u03c4\n\n2(cid:107)ui \u2212 uj(cid:107)2(cid:1). This what we used in practice.\n\nWe provide examples of the A, G that FNPs learn in Figures 3, 4 respectively.\n\nParametrizing the predictive distribution Having obtained the dependency graphs A, G, we are\nnow interested in how to construct a predictive model that induces them. To this end, we parametrize\npredictive distributions for each target variable yi that explicitly depend on the reference set R\naccording to the structure of G and A. This is realized via a local latent variable zi that summarizes\nthe context from the selected parent points in R and their targets yR\n\n(cid:90)\n\n(cid:90)\nzj|parAj\n\n(cid:16)\n\np\u03b8\n\np\u03b8(yM , ZM|R, yR, A)dZM\np\u03b8(yj|zj)dzj\n\n(R, yR)\n\n(cid:17)\n\n(6)\n\n(cid:90)\n\n=\n\np\u03b8(yB, ZB|R, G, A)dZB =\n\n(cid:0)zi|parGi\n\n(R, yR)(cid:1) p\u03b8(yi|zi)dzi\n\n(cid:90)\n\n(cid:89)\n\ni\u2208R\n\n(cid:90)\n\n(cid:89)\n\nj\u2208M\n\np\u03b8(yR, ZR|R, G)dZR\n\np\u03b8\n(\u00b7), parAj\n\n(\u00b7) are functions that return the parents of the point i, j according to G, A\nwhere parGi\nrespectively. Notice that we are guaranteed that the decomposition to the conditionals at Eq. 6 is\nvalid, since the DAG G coupled with A correspond to another DAG. Since permutation invariance in\nthe parents is necessary for an overall exchangeable model, we de\ufb01ne each distribution over z, e.g.\n\n(R, yR)(cid:1), as an independent Gaussian distribution per dimension k of z1\n(R, yR)(cid:1) = N\n\np(cid:0)zi|parAi\n(cid:0)zik|parAi\n\n\uf8f6\uf8f8\uf8f6\uf8f8 (7)\n\n\uf8eb\uf8edzik\n\n\uf8eb\uf8edCi\n\n(cid:88)\n\nAij\u03bd\u03b8(xr\n\nj , yr\n\nj )k\n\n(cid:88)\n\nAij\u00b5\u03b8(xr\n\nj , yr\n\nj )k, exp\n\np\u03b8\n\n(cid:12)(cid:12)(cid:12)(cid:12)Ci\n\nj\u2208R\n\nj\u2208R\n\n1The factorized Gaussian distribution was chosen for simplicity, and it is not a limitation. Any distribution is\n\nvalid for z provided that it de\ufb01nes a permutation invariant probability density w.r.t. the parents.\n\n4\n\n\fthe data tuples of R, yR. The Ci is a normalization constant with Ci = ((cid:80)\n\nwhere the \u00b5\u03b8(\u00b7,\u00b7) and \u03bd\u03b8(\u00b7,\u00b7) are vector valued functions with a codomain in R|z| that transform\nj Aij + \u0001)\u22121, i.e. it\ncorresponds to the reciprocal of the number of parents of point i, with an extra small \u0001 to avoid\ndivision by zero when a point has no parents. By observing Eq. 6 we can see that the prediction for a\ngiven yi depends on the input covariates xi only indirectly via the graphs G, A which are a function\nof ui. Intuitively, it encodes the inductive bias that predictions on points that are \u201cfar away\u201d, i.e. have\nvery small probability of being connected to the reference set via A, will default to an uninformative\nstandard normal prior over zi hence a constant prediction for yi. This is similar to the behaviour that\nGPs with RBF kernels exhibit.\nNevertheless, Eq. 6 can also hinder extrapolation, something that neural networks can do well. In\ncase extrapolation is important, we can always add a direct path by conditioning the prediction on ui,\nthe latent embedding of xi, i.e. p(yi|zi, ui). This can serve as a middle ground where we can allow\nsome extrapolation via u. In general, it provides a knob, as we can now interpolate between GP and\nneural network behaviours by e.g. changing the dimensionalities of z and u.\n\nPutting everything together: the FNP and FNP+ models Now by putting everything together\nwe arrive at the overall de\ufb01nitions of the two FNP models that we propose\n\n(cid:90)\n(cid:90)\n\n(cid:88)\n(cid:88)\n\nG,A\n\nFNP \u03b8(D) :=\n\nFNP +\n\n\u03b8 (D) :=\n\np\u03b8(UB|XB)p(G, A|UB)p\u03b8(yB, ZB|R, G, A)dUBdZBdyi\u2208R\\Dx ,\n\np\u03b8(UB, G, A|XB)p\u03b8(yB, ZB|R, UB, G, A)dUBdZBdyi\u2208R\\Dx ,\n\n(8)\n\n(9)\n\nG,A\n\nwhere the \ufb01rst makes predictions according to Eq. 6 and the second further conditions on u. Notice\nthat besides the marginalizations over the latent variables and graphs, we also marginalize over any\nof the points in the reference set that are not part of the observed dataset D. This is necessary for\nthe proof of consistency that we provide later. For this work, we always chose the reference set to\nbe a part of the dataset D so the extra integration is omitted. In general, the marginalization can\nprovide a mechanism to include unlabelled data to the model which could be used to e.g. learn a\nbetter embedding u or \u201cimpute\u201d the missing labels. We leave the exploration of such an avenue\nfor future work. Having de\ufb01ned the models at Eq. 8, 9 we now prove that they both de\ufb01ne valid\npermutation invariant stochastic processes by borrowing the methodology described at [11].\nProposition 1. The distributions de\ufb01ned at Eq. 8, 9 are valid permutation invariant stochastic\nprocesses, hence they correspond to Bayesian models.\n\nProof sketch. The full proof can be found in the Appendix. Permutation invariance can be proved\nby noting that each of the terms in the products are permutation equivariant w.r.t. permutations of\nD hence each of the individual distributions de\ufb01ned at Eq. 8, 9 are permutation invariant due to the\nproducts. To prove consistency we have to consider two cases [11], the case where we add a point\nthat is part of R and the case where we add one that is not part of R. In the \ufb01rst case, marginalizing\nout that point will lead to the same distribution (as we were marginalizing over that point already),\nwhereas in the second case the point that we are adding is a leaf in the dependency graph, hence\nmarginalizing it doesn\u2019t affect the other points.\n\n2.2 The FNPs in practice: \ufb01tting and predictions\n\nHaving de\ufb01ned the two models, we are now interested in how we can \ufb01t their parameters \u03b8 when we\nare presented with a dataset D, as well as how to make predictions for novel inputs x\u2217. For simplicity,\nwe assume that R \u2286 Dx and focus on the FNP as the derivations for the FNP+ are analogous. Notice\nthat in this case we have that B = Dx = XD.\n\nFitting the model to data Fitting the model parameters with maximum marginal likelihood is\ndif\ufb01cult, as the necessary integrals / sums of Eq.8 are intractable. For this reason, we employ\nvariational inference and maximize the following lower bound to the marginal likelihood of D\nL = Eq\u03c6(UD,G,A,ZD|XD)[log p\u03b8(UD, G, A, ZD, yD|XD) \u2212 log q\u03c6(UD, G, A, ZD|XD)],\n\n(10)\n\n5\n\n\fwith respect to the model parameters \u03b8 and variational parameters \u03c6. For a tractable lower\nbound, we assume that the variational posterior distribution q\u03c6(UD, G, A, ZD|XD) factorizes as\n\np\u03b8(UD|XD)p(G|UR)p(A|UD)q\u03c6(ZD|XD) with q\u03c6(ZD|XD) =(cid:81)|D|\n\ni=1 q\u03c6(zi|xi). This leads to\n\nLR + LM|R = Ep\u03b8(UR,G|XR)q\u03c6(ZR|XR)[log p\u03b8(yR, ZR|R, G) \u2212 log q\u03c6(ZR|XR)]+\n(11)\n+ Ep\u03b8(UD,A|XD)q\u03c6(ZM|XM )[log p\u03b8(yM|ZM ) + log p\u03b8 (ZM|parA(R, yR)) \u2212 log q\u03c6(ZM|XM )]\nwhere we decomposed the lower bound into the terms for the reference set R, LR, and the terms that\ncorrespond to M, LM|R. For large datasets D we are interested in doing ef\ufb01cient optimization of this\nbound. While the \ufb01rst term is not, in general, amenable to minibatching, the second term is. As a\nresult, we can use minibatches that scale according to the size of the reference set R. We provide\nmore details in the Appendix.\nIn practice, for all of the distributions over u and z, we use diagonal Gaussians, whereas for G, A\nwe use the concrete / Gumbel-softmax relaxations [34, 21] during training. In this way we can jointly\noptimize \u03b8, \u03c6 with gradient based optimization by employing the pathwise derivatives obtained with\nthe reparametrization trick [24, 42]. Furthermore, we tie most of the parameters \u03b8 of the model and\n\u03c6 of the inference network, as the regularizing nature of the lower bound can alleviate potential\nover\ufb01tting of the model parameters \u03b8. More speci\ufb01cally, for p\u03b8(ui|xi), q\u03c6(zi|xi) we share a neural\nnetwork torso and have two output heads, one for each distribution. We also parametrize the priors\nover the latent z in terms of the q\u03c6(zi|xi) for the points in R; the \u00b5\u03b8(xr\ni ) are both\ny, where \u00b5q(\u00b7), \u03bdq(\u00b7) are the functions that provide the mean and\ny, \u03bdq(xr\nde\ufb01ned as \u00b5q(xr\nvariance for q\u03c6(zi|xi) and \u00b5r\nIt is interesting to see that the overall bound at Eq. 11 reminisces the bound of a latent variable\nmodel such as a variational autoencoder (VAE) [24, 42] or a deep variational information bottleneck\nmodel (VIB) [1]. We aim to predict the label yi of a given point xi from its latent code zi where\nthe prior, instead of being globally the same as in [24, 42, 1], it is conditioned on the parents of\nthat particular point. The conditioning is also intuitive, as it is what converts the i.i.d. to the more\ngeneral exchangeable model. This is also similar to the VAE for unsupervised learning described at\nassociative compression networks (ACN) [16] and reminisces works on few-shot learning [4].\nThe posterior predictive distribution In order to perform predictions for unseen points x\u2217, we\nemploy the posterior predictive distribution of FNPs. More speci\ufb01cally, we can show that by using\nBayes rule, the predictive distribution of the FNPs has the following simple form\n\ny are linear embeddings of the labels.\n\ni ) + \u03bdr\ny, \u03bdr\n\ni ) + \u00b5r\n\ni , yr\n\ni ), \u03bd\u03b8(xr\n\ni , yr\n\np\u03b8(UR, u\u2217|XR, x\u2217)p(a\u2217|UR, u\u2217)p\u03b8(z\u2217|para\u2217 (R, yR))p\u03b8(y\u2217|z\u2217)dURdu\u2217dz\u2217\n\n(12)\n\n(cid:90)\n\n(cid:88)\n\na\u2217\n\nwhere u are the representations given by the neural network and a\u2217 is the binary vector that denotes\nwhich points from R are the parents of the new point. We provide more details in the Appendix.\nIntuitively, we \ufb01rst project the reference set and the new point on the latent space u with a neural\nnetwork and then make a prediction y\u2217 by basing it on the parents from R according to a\u2217. This\npredictive distribution reminisces the models employed in few-shot learning [53].\n\n3 Related work\n\nThere has been a long line of research in Bayesian Neural Networks (BNNs) [15, 5, 23, 19, 31, 44]. A\nlot of works have focused on the hard task of posterior inference for BNNs, by positing more \ufb02exible\nposteriors [31, 44, 30, 56, 3]. The exploration of more involved priors has so far not gain much\ntraction, with the exception of a handful of works [23, 29, 2, 17]. For \ufb02exible stochastic processes,\nwe have a line of works that focus on (scalable) Gaussian Processes (GPs); these revolve around\nsparse GPs [46, 51], using neural networks to parametrize the kernel of a GP [55, 54], employing\n\ufb01nite rank approximations to the kernel [9, 18] or parametrizing kernels over structured data [35, 52].\nCompared to such approaches, FNPs can in general be more scalable due to not having to invert a\nmatrix for prediction and, furthermore, they can easily support arbitrary likelihood models (e.g. for\ndiscrete data) without having to consider appropriate transformations of a base Gaussian distribution\n(which usually requires further approximations).\nThere have been interesting recent works that attempt to merge stochastic processes and neural\nnetworks. Neural Processes (NPs) [14] de\ufb01ne distributions over global latent variables in terms\n\n6\n\n\fof subsets of the data, while Attentive NPs [22] extend NPs with a deterministic path that has a\ncross-attention mechanism among the datapoints. In a sense, FNPs can be seen as a variant where we\ndiscard the global latent variables and instead incorporate cross-attention in the form of a dependency\ngraph among local latent variables. Another line of works is the Variational Implicit Processes\n(VIPs) [32], which consider BNN priors and then use GPs for inference, and functional variational\nBNNs (fBNNs) [47], which employ GP priors and use BNNs for inference. Both methods have their\ndrawbacks, as with VIPs we have to posit a meaningful prior over global parameters and the objective\nof fBNNs does not always correspond to a bound of the marginal likelihood. Finally, there is also\nan interesting line of works that study wide neural networks with random Gaussian parameters and\ndiscuss their equivalences with Gaussian Processes [38, 27], as well as the resulting kernel [20].\nSimilarities can be also seen at other works; Associative Compression Networks (ACNs) [16] employ\nsimilar ideas for generative modelling with VAEs and conditions the prior over the latent variable of a\npoint to its nearest neighbors. Correlated VAEs [50] similarly employ a (a-priori known) dependency\nstructure across the latent variables of the points in the dataset. In few-shot learning, metric-based\napproaches [53, 4, 48, 45, 26] similarly rely on similarities w.r.t. a reference set for predictions.\n\n4 Experiments\n\nWe performed two main experiments in order to verify the effectiveness of FNPs. We implemented and\ncompared against 4 baselines: a standard neural network (denoted as NN), a neural network trained\nand evaluated with Monte Carlo (MC) dropout [13] and a Neural Process (NP) [14] architecture. The\narchitecture of the NP was designed in a way that is similar to the FNP. For the \ufb01rst experiment we\nexplored the inductive biases we can encode in FNPs by visualizing the predictive distributions in\ntoy 1d regression tasks. For the second, we measured the prediction performance and uncertainty\nquality that FNPs can offer on the benchmark image classi\ufb01cation tasks of MNIST and CIFAR 10.\nFor this experiment, we also implemented and compared against a Bayesian neural network trained\nwith variational inference [5]. We provide the experimental details in the Appendix.\nFor all of the experiments in the paper, the NP was trained in a way that mimics the FNP, albeit\nwe used a different set R at every training iteration in order to conform to the standard NP training\nregime. More speci\ufb01cally, a random amount from 3 to num(R) points were selected as a context\nfrom each batch, with num(R) being the maximum amount of points allocated for R. For the toy\nregression task we set num(R) = N \u2212 1.\n\nExploring the inductive biases in toy regression To visually access the inductive biases we\nencode in the FNP we experiment with two toy 1-d regression tasks described at [40] and [19]\nrespectively. The generative process of the \ufb01rst corresponds to drawing 12 points from U [0, 0.6], 8\npoints from U [0.8, 1] and then parametrizing the target as yi = xi+\u0001+sin(4(xi+\u0001))+sin(13(xi+\u0001))\nwith \u0001 \u223c N (0, 0.032). This generates a nonlinear function with \u201cgaps\u201d in between the data where\nwe, ideally, want the uncertainty to be high. For the second we sampled 20 points from U [\u22124, 4] and\ni + \u0001, where \u0001 \u223c N (0, 9). For all of the models we used a\nthen parametrized the target as yi = x3\nheteroscedastic noise model. Furthermore, due to the toy nature of this experiment, we also included\na Gaussian Process (GP) with an RBF kernel. We used 50 dimensions for the global latent of NP for\nthe \ufb01rst task and 10 dimensions for the second. For the FNP models we used 3, 50 dimensions for the\nu, z for the \ufb01rst task and 3, 10 for the second. For the reference set R we used 10 random points for\nthe FNPs and the full dataset for the NP.\nThe results we obtain are presented in Figure 5. We can see that for the \ufb01rst task the FNP with the\nRBF function for g(\u00b7,\u00b7) has a behaviour that is very similar to the GP. We can also see that in the\nsecond task it has the tendency to quickly move towards a \ufb02at prediction outside the areas where we\nobserve points, something which we argued about at Section 2.1. This is not the case for MC-dropout\nor NP where we see a more linear behaviour on the uncertainty and erroneous overcon\ufb01dence, in\nthe case of the \ufb01rst task, in the areas in-between the data. Nevertheless, they do seem to extrapolate\nbetter compared to the FNP and GP. The FNP+ seems to combine the best of both worlds as it allows\nfor extrapolation and GP like uncertainty, although a free bits [7] modi\ufb01cation of the bound for z was\nhelpful in encouraging the model to rely more on these particular latent variables. Empirically, we\nobserved that adding more capacity on u can move the FNP+ closer to the behaviour we observe\nfor MC-dropout and NPs. In addition, increasing the amount of model parameters \u03b8 can make FNPs\nover\ufb01t, a fact that can result into a reduction of predictive uncertainty.\n\n7\n\n\f(a) MC-dropout\n\n(b) Neural Process\n\n(c) Gaussian Process\n\n(d) FNP\n\n(e) FNP+\n\nFigure 5: Predictive distributions for the two toy regression tasks according to the different models\nwe considered. Shaded areas correspond to \u00b1 3 standard deviations.\n\nPrediction performance and uncertainty quality For the second task we considered the image\nclassi\ufb01cation of MNIST and CIFAR 10. For MNIST we used a LeNet-5 architecture that had two\nconvolutional and two fully connected layers, whereas for CIFAR we used a VGG-like architecture\nthat had 6 convolutional and two fully connected. In both experiments we used 300 random points\nfrom D as R for the FNPs and for NPs, in order to be comparable, we randomly selected up to 300\npoints from the current batch for the context points during training and used the same 300 points\nas FNPs for evaluation. The dimensionality of u, z was 32, 64 for the FNP models in both datasets,\nwhereas for the NP the dimensionality of the global variable was 32 for MNIST and 64 for CIFAR.\nAs a proxy for the uncertainty quality we used the task of out of distribution (o.o.d.) detection; given\nthe fact that FNPs are Bayesian models we would expect that their epistemic uncertainty will increase\nin areas where we have no data (i.e. o.o.d. datasets). The metric that we report is the average entropy\non those datasets as well as the area under an ROC curve (AUCR) that determines whether a point\nis in or out of distribution according to the predictive entropy. Notice that it is simple to increase\nthe \ufb01rst metric by just learning a trivial model but that would be detrimental for AUCR; in order to\nhave good AUCR the model must have low entropy on the in-distribution test set but high entropy on\nthe o.o.d. datasets. For the MNIST model we considered notMNIST, Fashion MNIST, Omniglot,\nGaussian N (0, 1) and uniform U [0, 1] noise as o.o.d. datasets whereas for CIFAR 10 we considered\nSVHN, a tinyImagenet resized to 32 pixels, iSUN and similarly Gaussian and uniform noise. The\nsummary of the results can be seen at Table 1.\n\nTable 1: Accuracy and uncertainty on MNIST and CIFAR 10 from 100 posterior predictive samples.\nFor the all of the datasets the \ufb01rst column is the average predictive entropy whereas for the o.o.d.\ndatasets the second is the AUCR and for the in-distribution it is the test error in %.\n\nNN\n\nMC-Dropout\n\nNP\n\nFNP\n\nFNP+\n\n0.9\u00b10.1 / 99.5\u00b10.1 1.3\u00b10.2 / 99.1\u00b10.4 1.4\u00b10.1 / 99.6\u00b10.3 1.2\u00b10.2 / 99.7\u00b10.2 1.9\u00b10.1 / 99.8\u00b10.1 1.8\u00b10.1 / 99.9\u00b10.1\n\n0.05 / 0.5\n1.30 / 99.48\n1.23 / 99.07\n1.18 / 99.29\n2.03 / 100.0\n0.65 / 97.58\n\n0.06 / 7.0\n0.42 / 91.3\n0.59 / 93.1\n0.59 / 93.1\n0.05 / 72.1\n0.08 / 77.3\n\nVI BNN\n0.02 / 0.6\n1.33 / 99.80\n0.92 / 98.61\n1.61 / 99.91\n1.77 / 100.0\n1.41 / 99.87\n\n0.06 / 6.4\n0.45 / 91.8\n0.52 / 91.9\n0.57 / 93.2\n0.76 / 96.9\n0.65 / 96.1\n\n0.01 / 0.6\n1.31 / 99.90\n0.71 / 98.98\n0.86 / 99.69\n1.58 / 99.94\n1.46 / 99.96\n\n0.06 / 7.5\n0.38 / 90.2\n0.45 / 89.8\n0.47 / 90.8\n0.37 / 91.9\n0.17 / 87.8\n\n0.04 / 0.7\n1.94 / 99.90\n1.85 / 99.66\n1.87 / 99.79\n1.94 / 99.86\n2.11 / 99.98\n\n0.18 / 7.2\n1.09 / 94.3\n1.20 / 94.0\n1.30 / 95.1\n1.13 / 95.4\n0.71 / 89.7\n\n0.02 / 0.7\n1.77 / 99.96\n1.55 / 99.58\n1.71 / 99.92\n2.03 / 100.0\n1.88 / 99.99\n\n0.08 / 7.2\n0.42 / 89.8\n0.74 / 93.8\n0.81 / 94.8\n0.96 / 97.9\n0.99 / 98.4\n\nMNIST\nnMNIST\nfMNIST\nOmniglot\nGaussian\nUniform\nAverage\nCIFAR10\nSVHN\ntImag32\niSUN\nGaussian\nUniform\nAverage\n\n0.01 / 0.6\n1.03 / 99.73\n0.81 / 99.16\n0.71 / 99.44\n0.99 / 99.63\n0.85 / 99.65\n\n0.05 / 6.9\n0.44 / 93.1\n0.51 / 92.7\n0.52 / 93.2\n0.01 / 72.3\n0.93 / 98.4\n\n0.5\u00b10.2 / 89.9\u00b14.5 0.4\u00b10.1 / 85.4\u00b14.5 0.6\u00b10.1 / 94\u00b11.1 0.4\u00b10.1 / 90.1\u00b10.7 1.1\u00b10.1 / 93.7\u00b11.0 0.8\u00b10.1 / 94.9\u00b11.6\n\nWe observe that both FNPs have comparable accuracy to the baseline models while having higher\naverage entropies and AUCR on the o.o.d. datasets. FNP+ in general seems to perform better than\nFNP. The FNP did have a relatively high in-distribution entropy for CIFAR 10, perhaps denoting\nthat a larger R might be more appropriate. We further see that the FNPs have almost always better\nAUCR than all of the baselines we considered. Interestingly, out of all the non-noise o.o.d. datasets\n\n8\n\n\fwe did observe that Fashion MNIST and SVHN, were the hardest to distinguish on average across all\nthe models. This effect seems to agree with the observations from [36], although more investigation\nis required. We also observed that, sometimes, the noise datasets on all of the baselines can act\nas \u201cadversarial examples\u201d [49] thus leading to lower entropy than the in-distribution test set (e.g.\nGaussian noise for the NN on CIFAR 10). FNPs did have a similar effect on CIFAR 10, e.g. the FNP\non uniform noise, although to a much lesser extent. We leave the exploration of this phenomenon\nfor future work. It should be mentioned that other advances in o.o.d. detection, e.g. [28, 8], are\northogonal to FNPs and could further improve performance.\nWe further performed additional experiments in order to\nbetter disentangle the performance differences between\nNPs and FNPs: we trained an NP with the same \ufb01xed\nreference set R as the FNPs throughout training, as well\nas an FNP+ where we randomly sample a new R for every\nbatch (akin to the NP) and use the same R as the NP\nfor evaluation. While we argued in the construction of the\nFNPs that with a \ufb01xed R we can obtain a stochastic process,\nwe could view the case with random R as an ensemble of\nstochastic processes, one for each realization of R. The\nresults from these models can be seen at Table 2. On\nthe one hand, the FNP+ still provides robust uncertainty\nwhile the randomness in R seems to improve the o.o.d.\ndetection, possibly due to the implicit regularization. On\nthe other hand the \ufb01xed R seems to hurt the NP, as the o.o.d.\ndetection decreased, similarly hinting that the random R\nhas bene\ufb01cial regularizing effects.\nFinally, we provide some additional insights after doing\nablation studies on MNIST w.r.t.\nthe sensitivity to the\nnumber of points in R for NP, FNP and FNP+, as well as\nvarying the amount of dimensions for u, z in the FNP+. The results can be found in the Appendix. We\ngenerally observed that NP models have lower average entropy at the o.o.d. datasets than both FNP\nand FNP+ irrespective of the size of R. The choice of R seems to be more important for the FNPs\nrather than NPs, with FNP needing a larger R, compared to FNP+, to \ufb01t the data well. In general,\nit seems that it is not the quantity of points that matters but rather the quality; the performance did\nnot always increase with more points. This supports the idea of a \u201ccoreset\u201d of points, thus exploring\nideas to infer it is a promising research direction that could improve scalability and alleviate the\ndependence of FNPs on a reasonable R. As for the trade-off between z, u in FNP+; a larger capacity\nfor z, compared to u, leads to better uncertainty whereas the other way around seems to improve\naccuracy. These observations are conditioned on having a reasonably large u, which facilitates for\nmeaningful G, A.\n\nNP \ufb01xed R FNP+ random R\n0.01 / 0.6\nMNIST\n1.09 / 99.78\nnMNIST\n0.64 / 98.34\nfMNIST\n0.79 / 99.53\nOmniglot\nGaussian\n1.79 / 99.96\nUniform 1.42 / 99.93\n0.07 / 7.5\nCIFAR10\n0.46 / 91.5\nSVHN\ntImag32\n0.55 / 91.5\n0.60 / 92.6\niSUN\n0.20 / 87.2\nGaussian\nUniform\n0.53 / 94.3\n\nTable 2: Results obtained by training\na NP model with a \ufb01xed reference set\n(akin to FNP) and a FNP+ model with\na random reference set (akin to NP).\n\n0.02 / 0.8\n2.20 / 100.0\n1.58 / 99.78\n2.06 / 99.99\n2.28 / 100.0\n2.23 / 100.0\n0.09 / 6.9\n0.56 / 91.4\n0.77 / 93.4\n0.83 / 94.0\n1.23 / 99.1\n0.90 / 97.2\n\n5 Discussion\n\nWe presented a novel family of exchangeable stochastic processes, the Functional Neural Processes\n(FNPs). In contrast to NPs [14] that employ global latent variables, FNPs operate by employing local\nlatent variables along with a dependency structure among them, a fact that allows for easier encoding\nof inductive biases. We veri\ufb01ed the potential of FNPs experimentally, and showed that they can serve\nas competitive alternatives. We believe that FNPs open the door to plenty of exciting avenues for\nfuture research; designing better function priors by e.g. imposing a manifold structure on the FNP\nlatents [12], extending FNPs to unsupervised learning by e.g. adapting ACNs [16] or considering\nhierarchical models similar to deep GPs [10].\n\nAcknowledgments\n\nWe would like to thank Patrick Forr\u00e9 for helpful discussions over the course of this project and Peter\nOrbanz, Benjamin Bloem-Reddy for helpful discussions during a preliminary version of this work.\nWe would also like to thank Daniel Worrall, Tim Bakker and Stephan Alaniz for helpful feedback on\nan initial draft.\n\n9\n\n\fReferences\n[1] Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational\n\ninformation bottleneck. arXiv preprint arXiv:1612.00410, 2016.\n\n[2] Andrei Atanov, Arsenii Ashukha, Kirill Struminsky, Dmitry Vetrov, and Max Welling. The deep\nweight prior. modeling a prior distribution for cnns using generative models. arXiv preprint\narXiv:1810.06943, 2018.\n\n[3] Juhan Bae, Guodong Zhang, and Roger Grosse. Eigenvalue corrected noisy natural gradient.\n\narXiv preprint arXiv:1811.12565, 2018.\n\n[4] Sergey Bartunov and Dmitry Vetrov. Few-shot generative modelling with generative matching\nnetworks. In International Conference on Arti\ufb01cial Intelligence and Statistics, pages 670\u2013678,\n2018.\n\n[5] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty\nin neural networks. Proceedings of the 32nd International Conference on Machine Learning,\nICML 2015, Lille, France, 6-11 July 2015, 2015.\n\n[6] Diana Cai, Nathanael Ackerman, Cameron Freer, et al. Priors on exchangeable directed graphs.\n\nElectronic Journal of Statistics, 10(2):3490\u20133515, 2016.\n\n[7] Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya\nSutskever, and Pieter Abbeel. Variational lossy autoencoder. arXiv preprint arXiv:1611.02731,\n2016.\n\n[8] Hyunsun Choi and Eric Jang. Generative ensembles for robust anomaly detection. arXiv\n\npreprint arXiv:1810.01392, 2018.\n\n[9] Kurt Cutajar, Edwin V Bonilla, Pietro Michiardi, and Maurizio Filippone. Random feature\nexpansions for deep gaussian processes. In Proceedings of the 34th International Conference\non Machine Learning-Volume 70, pages 884\u2013893. JMLR. org, 2017.\n\n[10] Andreas C. Damianou and Neil D. Lawrence. Deep gaussian processes. In Proceedings of\nthe Sixteenth International Conference on Arti\ufb01cial Intelligence and Statistics, AISTATS 2013,\nScottsdale, AZ, USA, April 29 - May 1, 2013, pages 207\u2013215, 2013.\n\n[11] Abhirup Datta, Sudipto Banerjee, Andrew O Finley, and Alan E Gelfand. Hierarchical nearest-\nneighbor gaussian process models for large geostatistical datasets. Journal of the American\nStatistical Association, 111(514):800\u2013812, 2016.\n\n[12] Luca Falorsi, Pim de Haan, Tim R Davidson, and Patrick Forr\u00e9. Reparameterizing distributions\n\non lie groups. arXiv preprint arXiv:1903.02958, 2019.\n\n[13] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model\nuncertainty in deep learning. In international conference on machine learning, pages 1050\u20131059,\n2016.\n\n[14] Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J Rezende, SM Eslami,\n\nand Yee Whye Teh. Neural processes. arXiv preprint arXiv:1807.01622, 2018.\n\n[15] Alex Graves. Practical variational inference for neural networks.\n\nInformation Processing Systems, pages 2348\u20132356, 2011.\n\nIn Advances in Neural\n\n[16] Alex Graves, Jacob Menick, and Aaron van den Oord. Associative compression networks. arXiv\n\npreprint arXiv:1804.02476, 2018.\n\n[17] Danijar Hafner, Dustin Tran, Alex Irpan, Timothy Lillicrap, and James Davidson. Reliable\nuncertainty estimates in deep neural networks using noise contrastive priors. arXiv preprint\narXiv:1807.09289, 2018.\n\n[18] James Hensman, Nicolas Durrande, Arno Solin, et al. Variational fourier features for gaussian\n\nprocesses. Journal of Machine Learning Research, 18:151\u20131, 2017.\n\n10\n\n\f[19] Jos\u00e9 Miguel Hern\u00e1ndez-Lobato and Ryan Adams. Probabilistic backpropagation for scalable\nlearning of bayesian neural networks. In International Conference on Machine Learning, pages\n1861\u20131869, 2015.\n\n[20] Arthur Jacot, Franck Gabriel, and Cl\u00e9ment Hongler. Neural tangent kernel: Convergence and\ngeneralization in neural networks. In Advances in neural information processing systems, pages\n8571\u20138580, 2018.\n\n[21] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax.\n\narXiv preprint arXiv:1611.01144, 2016.\n\n[22] Hyunjik Kim, Andriy Mnih, Jonathan Schwarz, Marta Garnelo, Ali Eslami, Dan Rosenbaum,\nOriol Vinyals, and Yee Whye Teh. Attentive neural processes. arXiv preprint arXiv:1901.05761,\n2019.\n\n[23] Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local\n\nreparametrization trick. Advances in Neural Information Processing Systems, 2015.\n\n[24] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint\n\narXiv:1312.6114, 2013.\n\n[25] Achim Klenke. Probability theory: a comprehensive course. Springer Science & Business\n\nMedia, 2013.\n\n[26] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for one-shot\n\nimage recognition. In ICML deep learning workshop, 2015.\n\n[27] Jaehoon Lee, Lechao Xiao, Samuel S Schoenholz, Yasaman Bahri, Jascha Sohl-Dickstein, and\nJeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient\ndescent. arXiv preprint arXiv:1902.06720, 2019.\n\n[28] Shiyu Liang, Yixuan Li, and R Srikant. Enhancing the reliability of out-of-distribution image\n\ndetection in neural networks. arXiv preprint arXiv:1706.02690, 2017.\n\n[29] Christos Louizos, Karen Ullrich, and Max Welling. Bayesian compression for deep learning. In\n\nAdvances in Neural Information Processing Systems, pages 3288\u20133298, 2017.\n\n[30] Christos Louizos and Max Welling. Structured and ef\ufb01cient variational deep learning with\nmatrix gaussian posteriors. In International Conference on Machine Learning, pages 1708\u20131716,\n2016.\n\n[31] Christos Louizos and Max Welling. Multiplicative normalizing \ufb02ows for variational bayesian\nneural networks. In Proceedings of the 34th International Conference on Machine Learning-\nVolume 70, pages 2218\u20132227. JMLR. org, 2017.\n\n[32] Chao Ma, Yingzhen Li, and Jos\u00e9 Miguel Hern\u00e1ndez-Lobato. Variational implicit processes.\n\narXiv preprint arXiv:1806.02390, 2018.\n\n[33] David JC MacKay. Probable networks and plausible predictions\u2014a review of practical bayesian\nmethods for supervised neural networks. Network: Computation in Neural Systems, 6(3):469\u2013\n505, 1995.\n\n[34] Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous\n\nrelaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.\n\n[35] C\u00e9sar Lincoln C Mattos, Zhenwen Dai, Andreas Damianou, Jeremy Forth, Guilherme A Barreto,\nand Neil D Lawrence. Recurrent gaussian processes. arXiv preprint arXiv:1511.06644, 2015.\n\n[36] Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan.\nDo deep generative models know what they don\u2019t know? arXiv preprint arXiv:1810.09136,\n2018.\n\n[37] Radford M Neal. Bayesian learning for neural networks. PhD thesis, Citeseer, 1995.\n\n11\n\n\f[38] Roman Novak, Lechao Xiao, Yasaman Bahri, Jaehoon Lee, Greg Yang, Jiri Hron, Daniel A\nAbola\ufb01a, Jeffrey Pennington, and Jascha Sohl-Dickstein. Bayesian deep convolutional networks\nwith many channels are gaussian processes. 2018.\n\n[39] Peter Orbanz and Daniel M Roy. Bayesian models of graphs, arrays and other exchangeable\nrandom structures. IEEE transactions on pattern analysis and machine intelligence, 37(2):437\u2013\n461, 2015.\n\n[40] Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via\nbootstrapped dqn. In Advances in neural information processing systems, pages 4026\u20134034,\n2016.\n\n[41] Carl Edward Rasmussen. Gaussian processes in machine learning. In Summer School on\n\nMachine Learning, pages 63\u201371. Springer, 2003.\n\n[42] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation\nand approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.\n\n[43] Moshe Shaked and J George Shanthikumar. Stochastic orders. Springer Science & Business\n\nMedia, 2007.\n\n[44] Jiaxin Shi, Shengyang Sun, and Jun Zhu. Kernel implicit variational inference. arXiv preprint\n\narXiv:1705.10119, 2017.\n\n[45] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In\n\nNIPS, 2017.\n\n[46] Edward Snelson and Zoubin Ghahramani. Sparse gaussian processes using pseudo-inputs. In\n\nAdvances in neural information processing systems, pages 1257\u20131264, 2006.\n\n[47] Shengyang Sun, Guodong Zhang, Jiaxin Shi, and Roger Grosse. Functional variational bayesian\n\nneural networks. arXiv preprint arXiv:1903.05779, 2019.\n\n[48] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales.\n\nLearning to compare: Relation network for few-shot learning. 2018.\n\n[49] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfel-\nlow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199,\n2013.\n\n[50] Da Tang, Dawen Liang, Tony Jebara, and Nicholas Ruozzi. Correlated variational auto-encoders.\n\narXiv preprint arXiv:1905.05335, 2019.\n\n[51] Michalis Titsias. Variational learning of inducing variables in sparse gaussian processes. In\n\nArti\ufb01cial Intelligence and Statistics, pages 567\u2013574, 2009.\n\n[52] Mark Van der Wilk, Carl Edward Rasmussen, and James Hensman. Convolutional gaussian\n\nprocesses. In Advances in Neural Information Processing Systems, pages 2849\u20132858, 2017.\n\n[53] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks\nfor one shot learning. In Advances in neural information processing systems, pages 3630\u20133638,\n2016.\n\n[54] Andrew G Wilson, Zhiting Hu, Ruslan R Salakhutdinov, and Eric P Xing. Stochastic variational\ndeep kernel learning. In Advances in Neural Information Processing Systems, pages 2586\u20132594,\n2016.\n\n[55] Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernel\n\nlearning. In Arti\ufb01cial Intelligence and Statistics, pages 370\u2013378, 2016.\n\n[56] Guodong Zhang, Shengyang Sun, David Duvenaud, and Roger Grosse. Noisy natural gradient\n\nas variational inference. arXiv preprint arXiv:1712.02390, 2017.\n\n12\n\n\f", "award": [], "sourceid": 4708, "authors": [{"given_name": "Christos", "family_name": "Louizos", "institution": "University of Amsterdam"}, {"given_name": "Xiahan", "family_name": "Shi", "institution": "Bosch Center for Artificial Intelligence"}, {"given_name": "Klamer", "family_name": "Schutte", "institution": "TNO"}, {"given_name": "Max", "family_name": "Welling", "institution": "University of Amsterdam / Qualcomm AI Research"}]}