{"title": "BRUNO: A Deep Recurrent Model for Exchangeable Data", "book": "Advances in Neural Information Processing Systems", "page_first": 7190, "page_last": 7198, "abstract": "We present a novel model architecture which leverages deep learning tools to perform exact Bayesian inference on sets of high dimensional, complex observations. Our model is provably exchangeable, meaning that the joint distribution over observations is invariant under permutation: this property lies at the heart of Bayesian inference. The model does not require variational approximations to train, and new samples can be generated conditional on previous samples, with cost linear in the size of the conditioning set. The advantages of our architecture are demonstrated on learning tasks that require generalisation from short observed sequences while modelling sequence variability, such as conditional image generation, few-shot learning, and anomaly detection.", "full_text": "BRUNO: A Deep Recurrent Model for Exchangeable\n\nData\n\nIryna Korshunova \u2665\n\nGhent University\n\niryna.korshunova@ugent.be\n\nJonas Degrave \u2665 \u2020\nGhent University\n\njonas.degrave@ugent.be\n\nFerenc Husz\u00e1r\n\nTwitter\n\nfhuszar@twitter.com\n\nYarin Gal\n\nUniversity of Oxford\nyarin@cs.ox.ac.uk\n\nArthur Gretton \u2660\nGatsby Unit, UCL\n\narthur.gretton@gmail.com\n\nJoni Dambre \u2660\nGhent University\n\njoni.dambre@ugent.be\n\nAbstract\n\nWe present a novel model architecture which leverages deep learning tools to per-\nform exact Bayesian inference on sets of high dimensional, complex observations.\nOur model is provably exchangeable, meaning that the joint distribution over obser-\nvations is invariant under permutation: this property lies at the heart of Bayesian\ninference. The model does not require variational approximations to train, and new\nsamples can be generated conditional on previous samples, with cost linear in the\nsize of the conditioning set. The advantages of our architecture are demonstrated\non learning tasks that require generalisation from short observed sequences while\nmodelling sequence variability, such as conditional image generation, few-shot\nlearning, and anomaly detection.\n\n1\n\nIntroduction\n\nWe address the problem of modelling unordered sets of objects that have some characteristic in\ncommon. Set modelling has been a recent focus in machine learning, both due to relevant application\ndomains and to ef\ufb01ciency gains when dealing with groups of objects [5, 18, 20, 23]. The relevant\nconcept in statistics is the notion of an exchangeable sequence of random variables \u2013 a sequence where\nany re-ordering of the elements is equally likely. To ful\ufb01l this de\ufb01nition, subsequent observations must\nbehave like previous ones, which implies that we can make predictions about the future. This property\nallows the formulation of some machine learning problems in terms of modelling exchangeable data.\nFor instance, one can think of few-shot concept learning as learning to complete short exchangeable\nsequences [10]. A related example comes from the generative image modelling \ufb01eld, where we\nmight want to generate images that are in some ways similar to the ones from a given set. At present,\nhowever, there are few \ufb02exible and provably exchangeable deep generative models to solve this\nproblem.\nFormally, a \ufb01nite or in\ufb01nite sequence of random variables x1, x2, x3, . . . is said to be exchangeable\nif for all n and all permutations \u03c0\n\np(x1, . . . , xn) = p(cid:0)x\u03c0(1), . . . , x\u03c0(n)(cid:1) ,\n\ni. e. the joint probability remains the same under any permutation of the sequence. If random variables\nin the sequence are independent and identically distributed (i. i. d.), then it is easy to see that the\nsequence is exchangeable. The converse is false: exchangeable random variables can be correlated.\nOne example of an exchangeable but non-i. i. d. sequence is a sequence of variables x1, . . . , xn, which\n\n(1)\n\n\u2665\u2660Equal contribution \u2020Now at DeepMind.\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fjointly have a multivariate normal distribution Nn(0, \u03a3) with the same variance and covariance for\nall the dimensions [1]: \u03a3ii = 1 and \u03a3ij,i(cid:54)=j = \u03c1, with 0 \u2264 \u03c1 < 1.\nThe concept of exchangeability is intimately related to Bayesian statistics. De Finetti\u2019s theorem\nstates that every exchangeable process (in\ufb01nite sequence of random variables) is a mixture of i. i. d.\nprocesses:\n\np(x1, . . . , xn) =(cid:90) p(\u03b8)\n\nn(cid:89)i=1\n\np(xi|\u03b8)d\u03b8,\n\n(2)\n\nwhere \u03b8 is some parameter (\ufb01nite or in\ufb01nite dimensional) conditioned on which, the random variables\nare i. i. d. [1]. In our previous Gaussian example, one can prove that x1, . . . , xn are i. i. d. with\nxi \u223c N (\u03b8, 1 \u2212 \u03c1) conditioned on \u03b8 \u223c N (0, \u03c1).\nIn terms of predictive distributions p(xn|x1:n\u22121), the stochastic process in Eq. 2 can be written as\n(3)\n\np(xn|x1:n\u22121) =(cid:90) p(xn|\u03b8)p(\u03b8|x1:n\u22121)d\u03b8,\n\nby conditioning both sides on x1:n\u22121. Eq. 3 is exactly the posterior predictive distribution, where\nwe marginalise the likelihood of xn given \u03b8 with respect to the posterior distribution of \u03b8. From this\nfollows one possible interpretation of the de Finetti\u2019s theorem: learning to \ufb01t an exchangeable model\nto sequences of data is implicitly the same as learning to reason about the hidden variables behind the\ndata.\nOne strategy for de\ufb01ning models of exchangeable sequences is through explicit Bayesian modelling:\none de\ufb01nes a prior p(\u03b8), a likelihood p(xi|\u03b8) and calculates the posterior in Eq. 2 directly. Here,\nthe key dif\ufb01culty is the intractability of the posterior and the predictive distribution p(xn|x1:n\u22121).\nBoth of these expressions require integrating over the parameter \u03b8, so we might end up having to\nuse approximations. This could violate the exchangeability property and make explicit Bayesian\nmodelling dif\ufb01cult.\nOn the other hand, we do not have to explicitly represent the posterior to ensure exchangeability.\nOne could de\ufb01ne a predictive distribution p(xn|x1:n\u22121) directly, and as long as the process is\nexchangeable, it is consistent with Bayesian reasoning. The key dif\ufb01culty here is de\ufb01ning an easy-to-\ncalculate p(xn|x1:n\u22121) which satis\ufb01es exchangeability. For example, it is not clear how to train or\nmodify an ordinary recurrent neural network (RNN) to model exchangeable data. In our opinion, the\nmain challenge is to ensure that a hidden state contains information about all previous inputs x1:n\nregardless of sequence length.\nIn this paper, we propose a novel architecture which combines features of the approaches above,\nwhich we will refer to as BRUNO: Bayesian RecUrrent Neural mOdel. Our model is provably\nexchangeable, and makes use of deep features learned from observations so as to model complex data\ntypes such as images. To achieve this, we construct a bijective mapping between random variables\nxi \u2208 X in the observation space and features zi \u2208 Z, and explicitly de\ufb01ne an exchangeable model\nfor the sequences z1, z2, z3, . . . , where we know an analytic form of p(zn|z1:n\u22121) without explicitly\ncomputing the integral in Eq. 3.\nUsing BRUNO, we are able to generate samples conditioned on the input sequence by sampling\ndirectly from p(xn|x1:n\u22121). The latter is also tractable to evaluate, i. e. has linear complexity in the\nnumber of data points. In respect of model training, evaluating the predictive distribution requires a\nsingle pass through the neural network that implements X (cid:55)\u2192 Z mapping. The model can be learned\nstraightforwardly, since p(xn|x1:n\u22121) is differentiable with respect to the model parameters.\nThe paper is structured as follows. In Section 2 we will look at two methods selected to highlight\nthe relation of our work with previous approaches to modelling exchangeable data. Section 3 will\ndescribe BRUNO, along with necessary background information. In Section 4, we will use our model\nfor conditional image generation, few-shot learning, set expansion and set anomaly detection. Our\ncode is available at github.com/IraKorshunova/bruno.\n\n2 Related work\n\nBayesian sets [6] aim to model exchangeable sequences of binary random variables by analytically\ncomputing the integrals in Eq. 2, 3. This is made possible by using a Bernoulli distribution for the\n\n2\n\n\flikelihood and a beta distribution for the prior. To apply this method to other types of data, e.g. images,\none needs to engineer a set of binary features [7]. In that case, there is usually no one-to-one mapping\nbetween the input space X and the features space Z: in consequence, it is not possible to draw\nsamples from p(xn|x1:n\u22121). Unlike Bayesian sets, our approach does have a bijective transformation,\nwhich guarantees that inference in Z is equivalent to inference in space X .\nThe neural statistician [5] is an extension of a variational autoencoder model [8, 15] applied to\ndatasets. In addition to learning an approximate inference network over the latent variable zi for\nevery xi in the set, approximate inference is also implemented over a latent variable c \u2013 a context\nthat is global to the dataset. The architecture for the inference network q(c|x1, . . . , xn) maps every\nxi into a feature vector and applies a mean pooling operation across these representations. The\nresulting vector is then used to produce parameters of a Gaussian distribution over c. Mean pooling\nmakes q(c|x1, . . . , xn) invariant under permutations of the inputs. In addition to the inference\nnetworks, the neural statistician also has a generative component p(x1, . . . , xn|c) which assumes\nthat xi\u2019s are independent given c. Here, it is easy to see that c plays the role of \u03b8 from Eq. 2. In\nthe neural statistician, it is intractable to compute p(x1, . . . , xn), so its variational lower bound\nis used instead. In our model, we perform an implicit inference over \u03b8 and can exactly compute\npredictive distributions and the marginal likelihood. Despite these differences, both neural statistician\nand BRUNO can be applied in similar settings, namely few-shot learning and conditional image\ngeneration, albeit with some restrictions, as we will see in Section 4.\n\n3 Method\n\nWe begin this section with an overview of the mathematical tools needed to construct our model: \ufb01rst\nthe Student-t process [17]; and then the Real NVP \u2013 a deep, stably invertible and learnable neural\nnetwork architecture for density estimation [4]. We next propose BRUNO, wherein we combine an\nexchangeable Student-t process with the Real NVP, and derive recurrent equations for the predictive\ndistribution such that our model can be trained as an RNN. Our model is illustrated in Figure 1.\n\nFigure 1: A schematic of the BRUNO model. It depicts how Bayesian thinking can lead to an\nRNN-like computational graph in which Real NVP is a bijective feature extractor and the recurrence\nis represented by Bayesian updates of an exchangeable Student-t process.\n\n3.1 Student-t processes\n\nThe Student-t process (T P) is the most general elliptically symmetric process with an analytically\nrepresentable density [17]. The more commonly used Gaussian processes (GP s) can be seen as\nlimiting case of T Ps. In what follows, we provide the background and de\ufb01nition of T Ps.\nLet us assume that z = (z1, . . . zn) \u2208 Rn follows a multivariate Student-t distribution\nM V Tn(\u03bd, \u00b5, K) with degrees of freedom \u03bd \u2208 R+ \\ [0, 2], mean \u00b5 \u2208 Rn and a positive de\ufb01nite\nn \u00d7 n covariance matrix K. Its density is given by\n\np(z) =\n\n\u0393( \u03bd+n\n2 )\n\n((\u03bd \u2212 2)\u03c0)n/2\u0393(\u03bd/2)|K|\n\n\u22121/2(cid:18)1 +\n\n(z \u2212 \u00b5)T K\u22121(z \u2212 \u00b5)\n\n\u03bd \u2212 2\n\n2\n\n(cid:19)\u2212 \u03bd+n\n\n.\n\n(4)\n\n3\n\nz11z21x11x21z12z22x12x22z1z2x1x2TPTPp(x2|x1)=p(z12|z11)p(z22|z12)(cid:12)(cid:12)det\u2202z2\u2202x2(cid:12)(cid:12)p(x1)=p(z11)p(z21)(cid:12)(cid:12)det\u2202z1\u2202x1(cid:12)(cid:12)sampleRealNVPRealNVPRealNVP-1\f(5)\n\n(6)\n\nFor our problem, we are interested in computing a conditional distribution. Suppose we can partition\nz into two consecutive parts za \u2208 Rna and zb \u2208 Rnb, such that\nKba Kbb(cid:21)(cid:33).\n\u00b5b(cid:21) ,(cid:20)Kaa Kab\n\u02dcKbb(cid:17),\n\nThen conditional distribution p(zb|za) is given by\n\n\u03bd + \u03b2a \u2212 2\n\u03bd + na \u2212 2\n\nzb(cid:21) \u223c M V Tn(cid:32)\u03bd,(cid:20)\u00b5a\n(cid:20)za\np(zb|za) = M V Tnb(cid:16)\u03bd + na, \u02dc\u00b5b,\n\u22121\n\u02dc\u00b5b = KbaK\naa (za \u2212 \u00b5a) + \u00b5b\n\u22121\n\u03b2a = (za \u2212 \u00b5a)T K\naa (za \u2212 \u00b5a)\n\u22121\n\u02dcKbb = Kbb \u2212 KbaK\naa Kab.\n\nIn the general case, when one needs to invert the covariance matrix, the complexity of computing\np(zb|za) is O(n3\na). These computations become infeasible for large datasets, which is a known\nbottleneck for GP s and T P s [13]. In Section 3.3, we will show that exchangeable processes do not\nhave this issue.\nThe parameter \u03bd, representing the degrees of freedom, has a large impact on the behaviour of T P s. It\ncontrols how heavy-tailed the t-distribution is: as \u03bd increases, the tails get lighter and the t-distribution\ngets closer to the Gaussian. From Eq. 6, we can see that as \u03bd or na tends to in\ufb01nity, the predictive\ndistribution tends to the one from a GP . Thus, for small \u03bd and na, a T P would give less certain\npredictions than its corresponding GP .\nA second feature of the T P is the scaling of the predictive variance with a \u03b2a coef\ufb01cient, which\nexplicitly depends on the values of the conditioning observations. From Eq. 6, the value of \u03b2a is\nprecisely the Hotelling statistic for the vector za, and has a \u03c72\nna distribution with mean na in the\nevent that za \u223c Nna (\u00b5a, Kaa). Looking at the weight (\u03bd+\u03b2a\u22122)/(\u03bd+na\u22122), we see that the variance\nof p(zb|za) is increased over the Gaussian default when \u03b2a > na, and is reduced otherwise. In other\nwords, when the samples are dispersed more than they would be under the Gaussian distribution, the\npredictive uncertainty is increased compared with the Gaussian case. It is helpful in understanding\nthese two properties to recall that the multivariate Student-t distribution can be thought of as a\nGaussian distribution with an inverse Wishart prior on the covariance [17].\n\n3.2 Real NVP\n\nReal NVP [4] is a member of the normalising \ufb02ows family of models, where some density in the\ninput space X is transformed into a desired probability distribution in space Z through a sequence\nof invertible mappings [14]. Speci\ufb01cally, Real NVP proposes a design for a bijective function\nf : X (cid:55)\u2192 Z with X = RD and Z = RD such that (a) the inverse is easy to evaluate, i.e. the cost\nof computing x = f\u22121(z) is the same as for the forward mapping, and (b) computing the Jacobian\ndeterminant takes linear time in the number of dimensions D. Additionally, Real NVP assumes a\nsimple distribution for z, e.g. an isotropic Gaussian, so one can use a change of variables formula to\nevaluate p(x):\n\nThe main building block of Real NVP is a coupling layer. It implements a mapping X (cid:55)\u2192 Y that\ntransforms half of its inputs while copying the other half directly to the output:\n\n(cid:26)y1:d = x1:d\nyd+1:D = xd+1:D (cid:12) exp(s(x1:d)) + t(x1:d),\n\n(8)\n\nwhere (cid:12) is an elementwise product, s (scale) and t (translation) are arbitrarily complex functions, e.g.\nconvolutional neural networks.\nOne can show that the coupling layer is a bijective, easily invertible mapping with a triangular\nJacobian and composition of such layers preserves these properties. To obtain a highly nonlinear\nmapping f (x), one needs to stack coupling layers X (cid:55)\u2192 Y1 (cid:55)\u2192 Y2 \u00b7\u00b7\u00b7 (cid:55)\u2192 Z while alternating the\ndimensions that are being copied to the output.\n\n4\n\np(x) = p(z)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n\u2202x (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\ndet(cid:32) \u2202f (x)\n\n.\n\n(7)\n\n\fTo make good use of modelling densities, the Real NVP has to treat its inputs as instances of a\ncontinuous random variable [19]. To do so, integer pixel values in x are dequantised by adding\nuniform noise u \u2208 [0, 1)D. The values x + u \u2208 [0, 256)D are then rescaled to a [0, 1) interval and\ntransformed with an elementwise function: f (x) = logit(\u03b1 + (1 \u2212 2\u03b1)x) with some small \u03b1.\n3.3 BRUNO: the exchangeable sequence model\n\nWe now combine Bayesian and deep learning tools from the previous sections and present our model\nfor exchangeable sequences whose schematic is given in Figure 1.\nAssume we are given an exchangeable sequence x1, . . . , xn, where every element is a D-dimensional\nvector: xi = (x1\ni ). We apply a Real NVP transformation to every xi, which results in an\nexchangeable sequence in the latent space: z1, . . . , zn, where zi \u2208 RD. The proof that the latter\nsequence is exchangeable is given in Appendix A.\nWe make the following assumptions about the latents:\n\ni , . . . xD\n\nA1: dimensions {zd}d=1,...,D are independent, so p(z) =(cid:81)D\n\nA2: for every dimension d, we assume the following: (zd\nparameters:\n\nd=1 p(zd)\n\n1 , . . . zd\n\nn) \u223c M V Tn(\u03bdd, \u00b5d1, Kd), with\n\n\u2022 degrees of freedom \u03bdd \u2208 R+ \\ [0, 2]\n\u2022 mean \u00b5d1 is a 1 \u00d7 n dimensional vector of ones multiplied by the scalar \u00b5d \u2208 R\n\u2022 n \u00d7 n covariance matrix Kd with Kd\nij,i(cid:54)=j = \u03c1d where 0 \u2264 \u03c1d < vd to\nmake sure that Kd is a positive-de\ufb01nite matrix that complies with covariance properties of\nexchangeable sequences [1].\n\nii = vd and Kd\n\n1 , zd\n\n2 . . . zd\n\nThe exchangeable structure of the covariance matrix and having the same mean for every n, guarantees\nn is exchangeable. Because the covariance matrix is simple, we can\nthat the sequence zd\nderive recurrent updates for the parameters of p(zd\n1:n). Using the recurrence is a lot more\nef\ufb01cient compared to the closed-form expressions in Eq. 6 since we want to compute the predictive\ndistribution for every step n.\nWe start from a prior Student-t distribution for p(z1) with parameters \u00b51 = \u00b5 , v1 = v, \u03bd1 = \u03bd,\n\u03b21 = 0. Here, we will drop the dimension index d to simplify the notation. A detailed derivation of\nthe following results is given in Appendix B. To compute the degrees of freedom, mean and variance\nof p(zn+1|z1:n) for every n, we begin with the recurrent relations\n\nn+1|zd\n\n\u03bdn+1 = \u03bdn + 1, \u00b5n+1 = (1 \u2212 dn)\u00b5n + dnzn,\n\n(9)\nv+\u03c1(n\u22121). Note that the GP recursions simply use the latter two equations, i.e. if we\nwhere dn =\nwere to assume that (zd\nn) \u223c Nn(\u00b5d1, Kd). For T P s, however, we also need to compute \u03b2\n\u2013 a data-dependent term that scales the covariance matrix as in Eq. 6. To update \u03b2, we introduce\nrecurrent expressions for the auxiliary variables:\n\nvn+1 = (1 \u2212 dn)vn + dn(v \u2212 \u03c1),\n\n1 , . . . zd\n\n\u03c1\n\n\u02dczi = zi \u2212 \u00b5\nan =\n\nv + \u03c1(n \u2212 2)\n\n(v \u2212 \u03c1)(v + \u03c1(n \u2212 1))\n\n\u03b2n+1 = \u03b2n + (an \u2212 bn)\u02dcz2\n\nn + bn(\n\n,\n\nbn =\n\n\u2212\u03c1\n\n(v \u2212 \u03c1)(v + \u03c1(n \u2212 1))\n\u02dczi)2 \u2212 bn\u22121(\n\n\u02dczi)2.\n\nn\u22121(cid:88)i=1\n\nn(cid:88)i=1\n\nFrom these equations, we see that computational complexity of making predictions in exchangeable\nGP s or T P s scales linearly with the number of observations, i.e. O(n) instead of a general O(n3)\ncase where one needs to compute an inverse covariance matrix.\nSo far, we have constructed an exchangeable Student-t process in the latent space Z. By coupling it\nwith a bijective Real NVP mapping, we get an exchangeable process in space X . Although we do\nnot have an explicit analytic form of the transitions in X , we still can sample from this process and\nevaluate the predictive distribution via the change of variables formula in Eq. 7.\n\n5\n\n\f3.4 Training\nHaving an easy-to-evaluate autoregressive distribution p(xn+1|x1:n) allows us to use a training\nscheme that is common for RNNs, i.e. maximise the likelihood of the next element in the sequence\nat every step. Thus, our objective function for a single sequence of \ufb01xed length N can be writ-\nten as L = (cid:80)N\u22121\nn=0 log p(xn+1|x1:n), which is equivalent to maximising the joint log-likelihood\nlog p(x1, . . . , xN ). While we do have a closed-form expression for the latter, we chose not to use\nit during training in order to minimize the difference between the implementation of training and\ntesting phases. Note that at test time, dealing with the joint log-likelihood would be inconvenient or\neven impossible due to high memory costs when N gets large, which again motivates the use of a\nrecurrent formulation.\nDuring training, we update the weights of the Real NVP model and also learn the parameters of\nthe prior Student-t distribution. For the latter, we have three trainable parameters per dimension:\ndegrees of freedom \u03bdd, variance vd and covariance \u03c1d. The mean \u00b5d is \ufb01xed to 0 for every d and is\nnot updated during training.\n\n4 Experiments\n\nIn this section, we will consider a few problems that \ufb01t naturally into the framework of modeling\nexchangeable data. We chose to work with sequences of images, so the results are easy to analyse; yet\nBRUNO does not make any image-speci\ufb01c assumptions, and our conclusions can generalise to other\ntypes of data. Speci\ufb01cally, for non-image data, one can use a general-purpose Real NVP coupling\nlayer as proposed by Papamakarios et al. [12]. In contrast to the original Real NVP model, which uses\nconvolutional architecture for scaling and translation functions in Eq. 8, a general implementation\nhas s and t composed from fully connected layers. We experimented with both convolutional and\nnon-convolutional architectures, the details of which are given in Appendix C.\nIn our experiments, the models are trained on image sequences of length 20. We form each sequence\nby uniformly sampling a class and then selecting 20 random images from that class. This scheme\nimplies that a model is trained to implicitly infer a class label that is global to a sequence. In what\nfollows, we will see how this property can be used in a few tasks.\n\n4.1 Conditional image generation\n\nWe \ufb01rst consider a problem of generating samples conditionally on a set of images, which reduces to\nsampling from a predictive distribution. This is different from a general Bayesian approach, where\none needs to infer the posterior over some meaningful latent variable and then \u2018decode\u2019 it.\nTo draw samples from p(xn+1|x1:n), we \ufb01rst sample z \u223c p(zn+1|z1:n) and then compute the\ninverse Real NVP mapping: x = f\u22121(z). Since we assumed that dimensions of z are independent,\nwe can sample each zd from a univariate Student-t distribution. To do so, we modi\ufb01ed Bailey\u2019s polar\nt-distribution generation method [2] to be computationally ef\ufb01cient for GPU. Its algorithm is given in\nAppendix D.\nIn Figure 2, we show samples from the prior distribution p(x1) and conditional samples from a\npredictive distribution p(xn+1|x1:n) at steps n = 1, . . . , 20. Here, we used a convolutional Real NVP\nmodel as a part of BRUNO. The model was trained on Omniglot [10] same-class image sequences of\nlength 20 and we used the train-test split and preprocessing as de\ufb01ned by Vinyals et al. [21]. Namely,\nwe resized the images to 28 \u00d7 28 pixels and augmented the dataset with rotations by multiples of 90\ndegrees yielding 4,800 and 1,692 classes for training and testing respectively.\nTo better understand how BRUNO behaves, we test it on special types of input sequences that were\nnot seen during training. In Appendix E, we give an example where the same image is used throughout\nthe sequence. In that case, the variability of the samples reduces as the models gets more of the\nsame input. This property does not hold for the neural statistician model [5], discussed in Section 2.\nAs mentioned earlier, the neural statistician computes the approximate posterior q(c|x1, . . . , xn)\nand then uses its mean to sample x from a conditional model p(x|cmean). This scheme does not\naccount for the variability in the inputs as a consequence of applying mean pooling over the features\nof x1, . . . , xn when computing q(c|x1, . . . , xn). Thus, when all xi\u2019s are the same, it would still\nsample different instances from the class speci\ufb01ed by xi. Given the code provided by the authors of\n\n6\n\n\fFigure 2: Samples generated conditionally on the sequence of the unseen Omniglot character class.\nAn input sequence is shown in the top row and samples in the bottom 4 rows. Every column of the\nbottom subplot contains 4 samples from the predictive distribution conditioned on the input images\nup to and including that column. That is, the 1st column shows samples from the prior p(x) when no\ninput image is given; the 2nd column shows samples from p(x|x1) where x1 is the 1st input image\nin the top row and so on.\n\nthe neural statistician and following an email exchange, we could not reproduce the results from their\npaper, so we refrained from making any direct comparisons.\nMore generated samples from convolutional and non-convolutional architectures trained on\nMNIST [11], Fashion-MNIST [22] and CIFAR-10 [9] are given in the appendix. For a couple\nof these models, we analyse the parameters of the learnt latent distributions (see Appendix F).\n\n4.2 Few-shot learning\n\nPreviously, we saw that BRUNO can generate images of the unseen classes even after being\nconditioned on a couple of examples. In this section, we will see how one can use its conditional\nprobabilities not only for generation, but also for a few-shot classi\ufb01cation.\nWe evaluate the few-shot learning accuracy of the model from Section 4.1 on the unseen Omniglot\ncharacters from the 1,692 testing classes following the n-shot and k-way classi\ufb01cation setup proposed\nby Vinyals et al. [21]. For every test case, we randomly draw a test image xn+1 and a sequence of n\nimages from the target class. At the same time, we draw n images for every of the k \u2212 1 random\ndecoy classes. To classify an image xn+1, we compute p(xn+1|xC=i\n1:n ) for each class i = 1 . . . k in\nthe batch. An image is classi\ufb01ed correctly when the conditional probability is highest for the target\nclass compared to the decoy classes. This evaluation is performed 20 times for each of the test classes\nand the average classi\ufb01cation accuracy is reported in Table 1.\nFor comparison, we considered three models from Vinyals et al. [21]: (a) k-nearest neighbours\n(k-NN), where matching is done on raw pixels (Pixels), (b) k-NN with matching on discriminative\nfeatures from a state-of-the-art classi\ufb01er (Baseline Classi\ufb01er), and (c) Matching networks.\nWe observe that BRUNO model from Section 4.1 outperforms the baseline classi\ufb01er, despite having\nbeen trained on relatively long sequences with a generative objective, i.e. maximising the likelihood\nof the input images. Yet, it cannot compete with matching networks \u2013 a model tailored for a few-shot\nlearning and trained in a discriminative way on short sequences such that its test-time protocol exactly\nmatches the training time protocol. One can argue, however, that a comparison between models\ntrained generatively and discriminatively is not fair. Generative modelling is a more general, harder\nproblem to solve than discrimination, so a generatively trained model may waste a lot of statistical\npower on modelling aspects of the data which are irrelevant for the classi\ufb01cation task. To verify our\nintuition, we \ufb01ne-tuned BRUNO with a discriminative objective, i.e. maximising the likelihood of\ncorrect labels in n-shot, k-way classi\ufb01cation episodes formed from the training examples of Omniglot.\nWhile we could sample a different n and k for every training episode like in matching networks,\nwe found it suf\ufb01cient to \ufb01x n and k during training. Namely, we chose the setting with n = 1 and\nk = 20. From Table 1, we see that this additional discriminative training makes BRUNO competitive\nwith state-of-the-art models across all n-shot and k-way tasks.\nAs an extension to the few-shot learning task, we showed that BRUNO could also be used for online\nset anomaly detection. These experiments can be found in Appendix H.\n\n7\n\n\fTable 1: Classi\ufb01cation accuracy for a few-shot learning task on the Omniglot dataset.\n\nModel\n\nPIXELS [21]\nBASELINE CLASSIFIER [21]\nMATCHING NETS [21]\nBRUNO\nBRUNO (discriminative \ufb01ne-tuning)\n\n5-way\n\n20-way\n\n1-shot 5-shot\n1-shot 5-shot\n41.7% 63.2% 26.7% 42.6%\n80.0% 95.0% 69.5% 89.1%\n98.1% 98.9% 93.8% 98.5%\n86.3% 95.6% 69.2% 87.7%\n97.1% 99.4% 91.3% 97.8%\n\n4.3 GP-based models\nIn practice, we noticed that training T P-based models can be easier compared to GP-based models as\nthey are more robust to anomalous training inputs and are less sensitive to the choise of hyperparame-\nters. Under certain conditions, we were not able to obtain convergent training with GP-based models\nwhich was not the case when using T Ps; an example is given in Appendix G. However, we found a\nfew heuristics that make for a successful training such that T P and GP-based models perform equally\nwell in terms of test likelihoods, sample quality and few-shot classi\ufb01cation results. For instance, it\nwas crucial to use weight normalisation with a data-dependent initialisation of parameters of the Real\nNVP [16]. As a result, one can opt for using GPs due to their simpler implementation. Nevertheless,\na Student-t process remains a strictly richer model class for the latent space with negligible additional\ncomputational costs.\n\n5 Discussion and conclusion\n\nIn this paper, we introduced BRUNO, a new technique combining deep learning and Student-t or\nGaussian processes for modelling exchangeable data. With this architecture, we may carry out implicit\nBayesian inference, avoiding the need to compute posteriors and eliminating the high computational\ncost or approximation errors often associated with explicit Bayesian inference.\nBased on our experiments, BRUNO shows promise for applications such as conditional image\ngeneration, few-shot concept learning, few-shot classi\ufb01cation and online anomaly detection. The\nprobabilistic construction makes the BRUNO approach particularly useful and versatile in transfer\nlearning and multi-task situations. To demonstrate this, we showed that BRUNO trained in a\ngenerative way achieves good performance in a downstream few-shot classi\ufb01cation task without any\ntask-speci\ufb01c retraining. Though, the performance can be signi\ufb01cantly improved with discriminative\n\ufb01ne-tuning.\nTraining BRUNO is a form of meta-learning or learning-to-learn: it learns to perform Bayesian\ninference on various sets of data. Just as encoding translational invariance in convolutional neural\nnetworks seems to be the key to success in vision applications, we believe that the notion of\nexchangeability is equally central to data-ef\ufb01cient meta-learning. In this sense, architectures like\nBRUNO and Deep Sets [23] can be seen as the most natural starting point for these applications.\nAs a consequence of exchangeability-by-design, BRUNO is endowed with a hidden state which\nintegrates information about all inputs regardless of sequence length. This desired property for\nmeta-learning is usually dif\ufb01cult to ensure in general RNNs as they do not automatically generalise to\nlonger sequences than they were trained on and are sensitive to the ordering of inputs. Based on this\nobservation, the most promising applications for BRUNO may fall in the many-shot meta-learning\nregime, where larger sets of data are available in each episode. Such problems naturally arise in\nprivacy-preserving on-device machine learning, or federated meta-learning [3], which is a potential\nfuture application area for BRUNO.\n\nAcknowledgements\n\nWe would like to thank Lucas Theis for his conceptual contributions to BRUNO, Conrado Miranda\nand Frederic Godin for their helpful comments on the paper, Wittawat Jitkrittum for useful discussions,\nand Lionel Pigou for setting up the hardware.\n\n8\n\n\fReferences\n[1] Aldous, D., Hennequin, P., Ibragimov, I., and Jacod, J. (1985). Ecole d\u2019Ete de Probabilites de Saint-Flour\n\nXIII, 1983. Lecture Notes in Mathematics. Springer Berlin Heidelberg.\n\n[2] Bailey, R. W. (1994). Polar generation of random variates with the t-distribution. Math. Comp., 62(206):779\u2013\n\n781.\n\n[3] Chen, F., Dong, Z., Li, Z., and He, X. (2018). Federated meta-learning for recommendation. arXiv preprint\n\narXiv:1802.07876.\n\n[4] Dinh, L., Sohl-Dickstein, J., and Bengio, S. (2017). Density estimation using Real NVP. In Proceedings of\n\nthe 5th International Conference on Learning Representations.\n\n[5] Edwards, H. and Storkey, A. (2017). Towards a neural statistician. In Proceedings of the 5th International\n\nConference on Learning Representations.\n\n[6] Ghahramani, Z. and Heller, K. A. (2006). Bayesian sets. In Weiss, Y., Sch\u00f6lkopf, B., and Platt, J. C., editors,\n\nAdvances in Neural Information Processing Systems 18, pages 435\u2013442. MIT Press.\n\n[7] Heller, K. A. and Ghahramani, Z. (2006). A simple bayesian framework for content-based image retrieval.\n\nIn IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 2110\u20132117.\n\n[8] Kingma, D. P. and Welling, M. (2014). Auto-encoding variational bayes.\n\nInternational Conference on Learning Representations.\n\nIn Proceedings of the 2nd\n\n[9] Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Technical report.\n\n[10] Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. (2015). Human-level concept learning through\n\nprobabilistic program induction. Science.\n\n[11] LeCun, Y., Cortes, C., and Burges, C. J. (1998). The MNIST database of handwritten digits.\n\n[12] Papamakarios, G., Murray, I., and Pavlakou, T. (2017). Masked autoregressive \ufb02ow for density estimation.\n\nIn Advances in Neural Information Processing Systems 30, pages 2335\u20132344.\n\n[13] Rasmussen, C. E. and Williams, C. K. I. (2005). Gaussian Processes for Machine Learning (Adaptive\n\nComputation and Machine Learning). The MIT Press.\n\n[14] Rezende, D. and Mohamed, S. (2015). Variational inference with normalizing \ufb02ows. In Proceedings of\nthe 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning\nResearch, pages 1530\u20131538.\n\n[15] Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate\ninference in deep generative models. In Proceedings of the 31st International Conference on Machine\nLearning, pages 1278\u20131286.\n\n[16] Salimans, T. and Kingma, D. P. (2016). Weight normalization: A simple reparameterization to accelerate\ntraining of deep neural networks. In Proceedings of the 30th International Conference on Neural Information\nProcessing Systems.\n\n[17] Shah, A., Wilson, A. G., and Ghahramani, Z. (2014). Student-t processes as alternatives to gaussian\nprocesses. In Proceedings of the 17th International Conference on Arti\ufb01cial Intelligence and Statistics, pages\n877\u2013885.\n\n[18] Szabo, Z., Sriperumbudur, B., Poczos, B., and Gretton, A. (2016). Learning theory for distribution\n\nregression. Journal of Machine Learning Research, 17(152).\n\n[19] Theis, L., van den Oord, A., and Bethge, M. (2016). A note on the evaluation of generative models. In\n\nProceedings of the 4th International Conference on Learning Representations.\n\n[20] Vinyals, O., Bengio, S., and Kudlur, M. (2016a). Order matters: Sequence to sequence for sets. In\n\nProceedings of the 4th International Conference on Learning Representations.\n\n[21] Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., and Wierstra, D. (2016b). Matching networks for\n\none shot learning. In Advances in Neural Information Processing Systems 29, pages 3630\u20133638.\n\n[22] Xiao, H., Rasul, K., and Vollgraf, R. (2017). Fashion-mnist: a novel image dataset for benchmarking\n\nmachine learning algorithms. arXiv preprint, abs/1708.07747.\n\n[23] Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., and Smola, A. J. (2017). Deep\n\nsets. In Advances in Neural Information Processing Systems 30, pages 3394\u20133404.\n\n9\n\n\f", "award": [], "sourceid": 3574, "authors": [{"given_name": "Iryna", "family_name": "Korshunova", "institution": "Ghent University"}, {"given_name": "Jonas", "family_name": "Degrave", "institution": "Deepmind"}, {"given_name": "Ferenc", "family_name": "Huszar", "institution": "Twitter"}, {"given_name": "Yarin", "family_name": "Gal", "institution": "University of OXford"}, {"given_name": "Arthur", "family_name": "Gretton", "institution": "Gatsby Unit, UCL"}, {"given_name": "Joni", "family_name": "Dambre", "institution": "Ghent University"}]}