{"title": "Learning Invariances using the Marginal Likelihood", "book": "Advances in Neural Information Processing Systems", "page_first": 9938, "page_last": 9948, "abstract": "In many supervised learning tasks, learning what changes do not affect the predic-tion target is as crucial to generalisation as learning what does. Data augmentationis a common way to enforce a model to exhibit an invariance: training data is modi-fied according to an invariance designed by a human and added to the training data.We argue that invariances should be incorporated the model structure, and learnedusing themarginal likelihood, which can correctly reward the reduced complexityof invariant models. We incorporate invariances in a Gaussian process, due to goodmarginal likelihood approximations being available for these models. Our maincontribution is a derivation for a variational inference scheme for invariant Gaussianprocesses where the invariance is described by a probability distribution that canbe sampled from, much like how data augmentation is implemented in practice", "full_text": "Learning Invariances using the Marginal Likelihood\n\nMark van der Wilk\n\nPROWLER.io\nCambridge, UK\n\nmark@prowler.io\n\nMatthias Bauer\n\nMPI for Intelligent Systems\nUniversity of Cambridge\n\nmsb55@cam.ac.uk\n\nST John\n\nPROWLER.io\nCambridge, UK\nst@prowler.io\n\nJames Hensman\nPROWLER.io\nCambridge, UK\n\njames@prowler.io\n\nAbstract\n\nGeneralising well in supervised learning tasks relies on correctly extrapolating the\ntraining data to a large region of the input space. One way to achieve this is to con-\nstrain the predictions to be invariant to transformations of the input that are known\nto be irrelevant (e.g. translation). Commonly, this is done through data augmenta-\ntion, where the training set is enlarged by applying hand-crafted transformations to\nthe inputs. We argue that invariances should instead be incorporated in the model\nstructure, and learned using the marginal likelihood, which correctly rewards the\nreduced complexity of invariant models. We demonstrate this for Gaussian process\nmodels, due to the ease with which their marginal likelihood can be estimated.\nOur main contribution is a variational inference scheme for Gaussian processes\ncontaining invariances described by a sampling procedure. We learn the sampling\nprocedure by backpropagating through it to maximise the marginal likelihood.\n\nIntroduction\n\n1\nIn supervised learning, we want to predict some quantity y \u2208 Y given an input x \u2208 X from a limited\nnumber of N training examples {xn, yn}N\nn=1. We want our model to make correct predictions in as\nmuch of the input space X as possible. By constraining our predictor to make similar predictions\nfor inputs which are modi\ufb01ed in ways that are irrelevant to the prediction (e.g. small translations,\nrotations, or deformations for handwritten digits), we can generalise what we learn from a single\ntraining example to a wide range of new inputs. It is common to encourage these invariances by\ntraining on a dataset that is enlarged by training examples that have undergone modi\ufb01cations that\nare known to not in\ufb02uence the output \u2013 a technique known as data augmentation. Developing an\naugmentation for a particular dataset relies on expert knowledge, trial and error, and cross-validation.\nThis human input makes data augmentation undesirable from a machine learning perspective, akin\nto hand-crafting features. It is also unsatisfactory from a Bayesian perspective, according to which\nassumptions and expert knowledge should be explicitly encoded in the prior distribution only. By\nadding data that are not true observations, the posterior may become overcon\ufb01dent, and the marginal\nlikelihood can no longer be used to compare to other models.\nIn this work, we argue that data augmentation should be formulated as an invariance in the functions\nthat are learnable by the model. To do so, we investigate prior distributions which incorporate invari-\nances. The main bene\ufb01t of treating invariances in this way is that models with different invariances\ncan be compared using the marginal likelihood. As a consequence, parameterised invariances can\neven be learned by backpropagating through the marginal likelihood.\n\nFigure 1: Samples describing the learned invariance for some example MNIST digits (squares). The\nmethod becomes insensitive to the rotations, shears, and rotations that are present in the samples.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fWe start from Gaussian process (GP) approximations, as they provide high-quality marginal likelihood\napproximations. We build on earlier work by developing a practical variational inference scheme for\nGaussian process models that have invariances built into the kernel. Our approach overcomes the\nmajor technical obstacle that our invariant kernels cannot be computed in closed form, which has\nbeen a requirement for kernel methods until now. Instead, we only require unbiased estimates of the\nkernel for learning the GP and its hyperparameters. The estimate is constructed by sampling from a\ndistribution that characterises the invariance (\ufb01g. 1).\nWe believe this work to be exciting, as it simultaneously provides a Bayesian alternative to data\naugmentation, and a natural method for learning invariances in a supervised manner. Additionally, the\nability to use kernels that do not admit closed-form evaluation may be of use for the kernel community\nin general, as it may open the door to new kernels with interesting properties beyond invariance.\n\n2 Related work\nIncorporating invariances into machine learning models is commonplace, and has been addressed in\nmany ways over the years. Despite the wide variety of methods for incorporating given invariances,\nthere are few solutions for learning which invariances to use. Our approach is unique in that it performs\ndirect end-to-end training using a supervised objective function. Here we present a brief review of\nexisting methods, grouped into three rough categories.\nData augmentation. As discussed, data augmentation refers to creating additional training exam-\nples by transforming training inputs in ways that do not change the prediction [Beymer and Poggio,\n1995; Niyogi et al., 1998]. The larger dataset constrains the model\u2019s predictions to be correct for a\nlarger region of the input space. For example, Loosli et al. [2007] propose augmenting with small ro-\ntations, scaling, thickening/thinning and deformations. On modern deep learning tasks like ImageNet\n[Deng et al., 2009], it is standard to apply \ufb02ips, crops, and colour alterations [Krizhevsky et al., 2012;\nHe et al., 2016]. Most attempts at learning the data augmentation focus on generating more data from\nunsupervised models trained on the inputs. Hauberg et al. [2016] learn a distribution of mappings\nthat transform between pairs of input images, which are then sampled and applied to random training\nimages, while Antoniou et al. [2017] use a GAN to capture the input density.\nModel constraints. An alternative to letting a \ufb02exible model learn an invariance described by a\ndata augmentation is to constrain the model to exhibit invariances through clever parameterisation.\nConvolutional Neural Networks (CNNs) [LeCun et al., 1989, 1998] are a ubiquitous example of this,\nand work by applying the same \ufb01lters across different image locations, giving a form of translation\ninvariance. Cohen and Welling [2016] extend this with \ufb01lters that are shared across other transfor-\nmations like rotations. Invariances have also been incorporated into kernel methods. MacKay [1998]\nintroduced the periodic kernel for functions invariant to shifts by the period. More sophisticated invari-\nances suitable for images, like translation invariance, were discussed by Kondor [2008]. Ginsbourger\net al. [2012, 2013, 2016] investigated similar kernels in the context of Gaussian processes. More\nrecently, van der Wilk et al. [2017] introduced a Gaussian process with generalisation properties\nsimilar to CNNs, together with a tractable approximation. The same method can also be used to\nimprove the scaling of the invariant kernels introduced by the earlier papers, and our method is based\non it. For similar kernels, Raj et al. [2017] present a random feature based model approximation for\ninvariances that are not learned. A \ufb01nal approach is to map the inputs to some fundamental space\nwhich is constant for all inputs that we want to be invariant to [Kondor, 2008; Ginsbourger et al., 2012].\nFor example, we can achieve rotational invariance by mapping the input image to some canonical\nrotation, on which classi\ufb01cation is then performed. Jaderberg et al. [2015] do this by learning to\n\u201cuntransform\u201d digits to a canonical orientation before performing classi\ufb01cation.\nRegularisation.\nInstead of placing hard constraints on the functions that can be represented, reg-\nularisation encourages desired solutions by adding extra penalties to the objective function. Simard\net al. [1992] encourage locally invariant solutions by penalising the derivative of the classi\ufb01er in\nthe directions that the model should be invariant to. SVMs have also been regularised to encourage\ninvariance to local perturbations, notably in Sch\u00f6lkopf et al. [1998], Chapelle and Sch\u00f6lkopf [2002],\nand Graepel and Herbrich [2004].\n\n3 The in\ufb02uence of invariance on the marginal likelihood\nIn this work, we aim to improve the generalisation ability of a function f : X \u2192 Y by constraining it\nto be invariant. By following the Bayesian approach and making the invariance part of the prior on\nf (\u00b7), we can use the marginal likelihood to learn the correct invariances in a supervised manner. In\n\n2\n\n\fInvariance\n\nf (x) = f (t(x))\n\nthis section we will justify our approach, \ufb01rst by de\ufb01ning invariance, and then by showing why the\nmarginal likelihood, rather than the \u201cregular\u201d likelihood, is a natural objective for learning.\n3.1\nIn this work we will distinguish between what we will refer to as \u201cstrict invariance\u201d and \u201cinsensitivity\u201d.\nFor the de\ufb01nition of strict invariance we follow the standard de\ufb01nition that is also used by Kondor\n[2008, \u00a74.4] and Ginsbourger et al. [2012], where we require the value of our function f (\u00b7) to remain\nunchanged if any transformation t : X \u2192 X from a set T is applied to the input:\n(1)\nThe set of all transformations T determines the invariance. For example, T would be the set of all\nrotations if we want f (\u00b7) to be rotationally invariant.\nFor many tasks, imposing strict invariance is too restrictive. For example, imposing rotational invari-\nance will likely help the recognition of handwritten 2s, especially if they are presented in a haphazardly\nrotated way, while this same invariance may be detrimental for telling apart 6s and 9s in their natural\norientation. For this reason, our main focus in this paper is on approximate invariances, where we\nwant our function to not change \u201ctoo much\u201d after transformations on the input. We call this notion\nof invariance insensitivity. This notion is actually the most common in the related work, with data\naugmentation and regularisation approaches only enforcing f (\u00b7) to take similar values for transformed\ninputs, rather than exactly the same value. In this work we formalise insensitivity as controlling the\nprobability of a large deviation in f (\u00b7) after applying a random transformation t \u2208 T to the input:\n(2)\n\n(cid:17)\n[f (x) \u2212 f (t(x))]2 > L\n\nt \u223c p(t) .\n\n\u2200t \u2208 T .\n\n\u2200x \u2208 X\n\n\u2200x \u2208 X\n\n(cid:16)\n\nP\n\n< \u0001\n\nWhen working with insensitivity in practice, we conceptually think about the distribution of points\nthat are generated by the transformations, rather than the transformations themselves. This gives\na formulation that is much closer to the aim of data augmentation: a limit on the variation of the\nfunction for augmented points. Writing the distribution of points xa obtained from applying the\ntransformations as p(xa|x), we can instead write:\n< \u0001\n\n(cid:16)\n(cid:17)\n[f (x) \u2212 f (xa)]2 > L\n\nxa \u223c p(xa|x) .\n\n\u2200x \u2208 X\n\n(3)\n\nP\n\nFor the remainder of this paper we will refer to both strict invariance and insensitivity as simply \u201cin-\nvariance\u201d. Our method treats both similarly, with strict invariance being a special case of insensitivity.\nFrom these de\ufb01nitions, we can see how these constraints can improve generalisation. While the\nprediction of a non-invariant learning method is only in\ufb02uenced in a small region around a training\npoint, invariant models are constrained to make similar predictions in a much larger area, with the\narea being determined by the set or distribution of transformations. Insensitivity is particularly useful,\nas it allows local invariances. Making f (\u00b7) insensitive to rotation can help the classi\ufb01cation of 6s that\nhave been rotated by small angles, while also allowing f (\u00b7) to give a different prediction for 9s.\n3.2 Marginal likelihood\nMost current machine learning models are trained by maximising the regularised likelihood p(y|f (\u00b7))\nwith respect to parameters of the regression function f (\u00b7). One issue with this objective function is\nthat it does not distinguish between models which \ufb01t the training data equally well, but will have\ndifferent generalisation characteristics. We see an example of this in \ufb01g. 2. The \ufb01gure shows data\nfrom a function that is invariant to switching the two input coordinates. We can train a model with\nthe invariance embedded into the prior, and a non-invariant model. In terms of RMSE on the training\nset (which is proportional to the log likelihood), both models \ufb01t the training data very well, with the\nnon-invariant model even \ufb01tting marginally better. However, the invariant model generalises much\nbetter, as points on one half of the input inform the function on the other half.\nThe marginal likelihood is found by integrating the likelihood p(y|f ) over the prior on f (\u00b7),\n\n(cid:90)\n\np(y|\u03b8) =\n\np(y|f )p(f|\u03b8)df ,\n\n(4)\n\nand effectively captures the model complexity as well as the data \ufb01t [Rasmussen and Ghahramani,\n2001; MacKay, 2002; Rasmussen and Williams, 2005]. It is also closely related to bounds on the\ngeneralisation error [Seeger, 2003; Germain et al., 2016]. Our example in \ufb01g. 2 also shows that the\nmarginal likelihood does correctly identify the invariant model as the one that generalises best.\n\n3\n\n\factual function\n\ninvariant GP model\n\nnon-invariant GP model\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\n0\n\n2\n\n4\n\n6\n\n8\n\n10\n\ntraining data\n\n0\n\n2\n\n4\n\n6\n\n8\n\n10\n\nlog marginal lik. =\u221234.906\ntraining set RMSE = 0.007\ntest set RMSE = 0.43\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\n0\n\n2\n\n4\n\n6\n\n8\n\n10\n\nlog marginal lik. =\u221243.768\n\ntraining set RMSE = 0.001\n\ntest set RMSE = 0.87\n\nFigure 2: Data from a symmetric function (left) with the solutions learned by invariant (middle) and\nnon-invariant (right) Gaussian processes. While the non-invariant model \ufb01ts better to the training\ndata, the invariant model generalises better. The marginal likelihood identi\ufb01es the best model.\n\nTo understand how the invariance affects the marginal likelihood, and why a high marginal likelihood\ncan indicate good generalisation performance, we decompose it using the product rule of probability\nC(cid:89)\nand by splitting up our dataset y into chunks {y1, y2, . . . , yC}:\n\n(5)\n\np(y|\u03b8) = p(y1|\u03b8)p(y2|y1, \u03b8)p(y3|y1:2, \u03b8)\n\np(yc|y1:c\u22121, \u03b8) .\n\nc=4\n\nFrom this we see that the marginal likelihood measures how well previous chunks of data predict\nfuture ones. It seems reasonable that if previous chunks of the training set accurately predict later\nones, that our entire training set will predict well on a test set as well. We can apply this insight to the\nexample in \ufb01g. 2 by dividing the training set into chunks consisting of the points in the top left, and\nthe bottom right halves. The non-invariant model only generalises locally, and will therefore make\npredictions close to the prior for the opposing half. The invariant model is constrained to predict\nexactly the same for the opposing half as it has learned for the observed half, and will therefore be\ncon\ufb01dent and correct, giving a much higher marginal likelihood. Note that if the invariance had been\ndetrimental to predictive performance (e.g. if f (\u00b7) was constrained to be anti-symmetric) the marginal\nlikelihood would have been poor, as the model would have made incorrect predictions for y2.\nGiven that the marginal likelihood correctly identi\ufb01es which invariances in the prior bene\ufb01t gener-\nalisation, we focus our efforts in the rest of this paper on \ufb01nding a good approximation that can be\npractically used to learn invariances.\n\nInference for Gaussian processes with invariances\n\n4\nIn the previous section we argued that the marginal likelihood was a more appropriate objective\nfunction for learning invariances than the regular likelihood. Marginal likelihoods are commonly\nhard to compute, but good approximations exist for Gaussian processes1 with simple kernels. In this\nsection, we focus our efforts on Gaussian processes based on kernels with complex, parameterised\ninvariances built in, for which we will derive a practical marginal likelihood approximation.\nOur approximation is based on earlier variational lower bounds for Gaussian processes. While Turner\nand Sahani [2011] point out that variational bounds do introduce bias to hyperparameter estimates, the\nbias is well-understood in our case, and is reduced by using suf\ufb01ciently many inducing points [Bauer\net al., 2016]. In the literature, this method is commonplace for both regression and classi\ufb01cation tasks\n[Titsias, 2009; Hensman et al., 2013, 2015a; van der Wilk et al., 2017; Kim and Teh, 2018].\n4.1\nOur starting point is the double-sum construction for priors over strictly invariant functions [Kondor,\n2008; Ginsbourger et al., 2012], which we brie\ufb02y review. If f (\u00b7) is strictly invariant to a set of\n1Conceptually, a neural network could be used if an accurate estimate of its marginal likelihood were available.\n\nInvariant Gaussian processes\n\n4\n\n\ftransformations, f (\u00b7) must also be invariant to compositions of transformations, as the same invariance\nholds at the transformed point t(x). The set of all compositions of transformations forms the group\nG. The set of all points that can be obtained by applying transformations in G to a point x forms the\norbit of x: A(x) = {t(x) | t \u2208 G}. We use P to denote the number of elements in A(x). All input\npoints which can be transformed into one another by an element of G share an orbit, and must also\nhave the same function value. This implies a simple construction of a strictly invariant function f (\u00b7)\nfrom a non-invariant function g(\u00b7). We can simply sum g(\u00b7) over the orbit of a point:\n\nf (x) =\n\ng(xa) .\n\n(6)\n\n(cid:88)\n\nxa\u2208A(x)\n\nBy placing a GP prior on g(\u00b7) \u223c GP(0, kg(\u00b7,\u00b7)), we imply a GP on invariant functions f (\u00b7), since\nGaussians are closed under summation. We \ufb01nd that the prior on f (\u00b7) has a double-sum kernel:\n\n\uf8ee\uf8f0 (cid:88)\n\n(cid:88)\n\n\uf8f9\uf8fb =\n\ng(x(cid:48)\na)\n\n(cid:88)\n\n(cid:88)\n\nxa\u2208A(x)\n\na\u2208A(x(cid:48))\nx(cid:48)\n\nxa\u2208A(x)\n\na\u2208A(x(cid:48))\nx(cid:48)\n\nkf (x, x(cid:48)\n\n) = Eg\n\ng(xa)\n\nkg(xa, x(cid:48)\n\na) .\n\n(7)\n\nEarlier we argued that insensitivity is often more desirable. In order to relax the constraint of strict\ninvariance, we relax the constraint that we sum over an orbit. Instead, we consider A(x) to be an\narbitrary set of points, which we will refer to as an augmentation set, describing what input changes\nf (\u00b7) should be insensitive to. If two inputs have signi\ufb01cantly overlapping augmentation sets, then\ntheir corresponding function values are constrained to be similar, as many terms in the sum of eq. (6)\nare shared (see appendix A for how this achieves insensitivity in the sense of eq. (2)). This kernel was\nalso studied by Dao et al. [2018] as a \ufb01rst-order approximation of data augmentation, and Raj et al.\n[2017] as a local invariance.\nWe can also consider in\ufb01nite augmentation sets, where we describe the relative density of elements\nusing a probability density, which we refer to as the augmentation density p(xa|x). We will take\np(xa|x) to be a process which perturbs the training data, much like how data augmentation is per-\nformed. Following a similar argument to the above, this results in a kernel that is doubly integrated\nover the augmentation distribution p(xa|x):\n\nkf (x, x(cid:48)\n\n) =\n\np(xa|x)p(x(cid:48)\n\na|x(cid:48)\n\n)kg(xa, x(cid:48)\n\na)dxadx(cid:48)\na .\n\n(8)\n\n(cid:90)(cid:90)\n\nWe collect all the parameters of the kernel, consisting of the parameters of the augmentation distribu-\ntion and the base kernel, in \u03b8 (dropped from notation for brevity), and treat them as hyperparameters\nof the model, which we will tune using an approximation to the marginal likelihood.\nWhen using these kernels, we face an additional obstacle on top of the usual problems with scalability\nand non-conjugate inference. The sums over large orbits prohibit the evaluation of eq. (7), while the\nintegrals in eq. (8) are analytically intractable for interesting invariances. Over the next few sections,\nwe develop an approximation that will allow us to evaluate a lower bound to the marginal likelihood\nwhich only requires samples of the orbit A(x) or augmentation distribution p(xa|x). This allows us\nto minibatch not only over examples in the training set, but also over samples that describe the desired\ninvariances.\n4.2 Variational inference using inducing points\nWe want to use the invariant GP de\ufb01ned in the previous section as a prior over functions for regression\nand classi\ufb01cation models. With slight abuse of notation we denote our prior p(f ), which will be\nGaussian for the marginal of any \ufb01nite set of input points. We denote sets of inputs as matrices\nX \u2208 RN\u00d7D, and observations as vectors f = {f (xn)}N\n\nf | \u03b8 \u223c GP(0, kf (\u00b7,\u00b7)) ,\n\n(9)\nwhere p(yi|f (xi)) is a Gaussian likelihood for regression, Bernoulli for binary classi\ufb01cation, etc. The\nmarginal of p(f ) for a \ufb01nite number of observations is a Gaussian distribution with covariance K\ufb00 :\n(10)\nInference in GPs suffers from two main dif\ufb01culties. First, inference is only analytically tractable\nfor Gaussian likelihoods. Second, computation in GP models is well-known to scale badly with the\n\np(f (X)) = p(f ) = N (0, K\ufb00 ) ,\n\n[K\ufb00 ]nn(cid:48) = kf (xn, xn(cid:48)) .\n\nn=1 = f (X). We denote our model:\nyn | f, xn\n\n\u223c p(yn | f (xn)) ,\n\niid\n\n5\n\n\fdataset size, requiring O(N 3) computations on K\ufb00 . Approximate inference using inducing variables\n[Qui\u00f1onero-Candela and Rasmussen, 2005] can be used to address both of these problems. We follow\nthe variational approach (referring to Titsias [2009]; Hensman et al. [2013, 2015b] for the full details)\nby introducing an approximate Gaussian process posterior denoted q(f ), which is constructed by\nconditioning on M < N \u201cinducing observations\u201d. The shape of q(f ) can be adjusted by changing the\ninput locations {zm}M\nm=1 = Z and output mean m and covariance S of these inducing observations.\nThis results in an approximate posterior of the form q(f ) = GP(\u00b5(\u00b7), \u03bd(\u00b7,\u00b7)) with\nu(\u00b7)K\u22121\nuu[Kuu \u2212 S]K\u22121\n\u00b5(\u00b7) = kT\n(11)\nwhere [Kuu]mm(cid:48) = k(zm, zm(cid:48)) (analogous to K\ufb00 only using the inducing input locations Z), and\nku(\u00b7) = [k(zm,\u00b7)]M\nm=1 is the covariance between the inducing outputs and the rest of the process.\nWe select the variational parameters by numerical maximisation of the marginal likelihood lower\nbound (or, evidence lower bound: ELBO) using stochastic optimisation [Hensman et al., 2013].\nThis correctly minimises the KL divergence between the approximate and exact posteriors\nKL[q(f )||p(f|y)] [Matthews et al., 2016]. The ELBO is given by\n\n\u03bd(\u00b7,\u00b7) = k(\u00b7,\u00b7) \u2212 kT\n\nu(\u00b7)K\u22121\n\nuuku(\u00b7) ,\n\nuum,\n\nlog p(y) \u2265 L =\n\nEq(f (xn))[log p(y|f (xn)] \u2212 KL[q(u)||p(u)] .\n\n(12)\n\nN(cid:88)\n\nn=1\n\nInter-domain inducing variables\n\nWe \ufb01nd the expectation under q(f (xn)) either analytically or by Monte Carlo. In order to reduce\nthe cost of evaluating the whole sum, we evaluate the bound stochastically by sub-sampling. This\ntechnique allows Gaussian processes to be applied to large datasets with general likelihoods. However,\nit does not address the issue of needing to evaluate our intractable kernel kf (eqs. (7) and (8)).\n4.3\nThe problem of evaluating large double sums was encountered by van der Wilk et al. [2017] for\nconvolutional and strictly invariant kernels. Their solution was based on the observation that problems\nwith evaluating the bound (eq. (12)) stemmed from intractabilities in the approximate posterior q(f ),\nsince this is where the kernel evaluations are needed. By choosing a different parameterisation of\nq(f ), the cost of evaluating the approximate posterior for a minibatch of \u02dcN points can be reduced\nfrom O( \u02dcN 2 + ( \u02dcN M + M 2)P 2 + M 3) to O( \u02dcN P 2 + \u02dcN M P + M 2 + M 3) \u2013 a signi\ufb01cant saving,\nparticularly when \u02dcN is small, and M and P are large.\nThis can be achieved simply by constructing the posterior from inducing variables in g(\u00b7) instead of\nf (\u00b7). Approximations constructed using observations in different spaces are said to use inter-domain\ninducing variables [Figueiras-Vidal and L\u00e1zaro-Gredilla, 2009], and can use the same variational\nbound as in eq. (12) [Matthews et al., 2016], with only Kuu and ku(\u00b7) being modi\ufb01ed in q(f (xn)):\n(13)\n\nkfu(x, z) = Ep(g)[f (x)g(z)] =\n\n) = kg(z, z(cid:48)\n\nkuu(z, z(cid:48)\n\n(cid:88)\n\nk(xa, z) ,\n\n) .\n\nxa\u2208A(x)\n\nThe new covariances require only a single sum, or no sum at all. Only kf (xn, xn) still requires a\ndouble sum, although this can be mitigated by keeping the minibatch size \u02dcN small.\nWhile this technique allows variational inference to be applied to invariant kernels with moderately\nsized orbits, similar to the convolutional kernels in van der Wilk et al. [2017], it does not help when\neven a single sum is too large. This technique is not applicable when intractable integrals appear\n(e.g. eq. (8)), since evaluations of the intractable kf are still needed, and the covariance function kfu\nalso requires an intractable integral:\n\n(cid:20)(cid:90)\n\n(cid:21)\n\n(cid:90)\n\nkfu(x, z) = Ep(g)\n\np(xa|x)g(xa)g(z)\n\ndxa =\n\np(xa|x)k(xa, x)dxa .\n\n(14)\n\n4.4 An estimator using samples describing invariances\nWe now show that the inter-domain parameterisation above allows us to, for certain likelihoods,\ncreate an unbiased estimator of the lower bound in eq. (12) using unbiased estimates of kf and kfu.\nWe start with discussing the estimator for the Gaussian likelihood here, leaving some non-Gaussian\nlikelihoods for the next section. We only consider the integral formulation of the kernel, as in eq. (8),\nas sub-sampling of augmentation sets requires only a minor tweak2.\n\n2Sub-sampling without replacement requires a different re-weighting of diagonal elements (appendix B).\n\n6\n\n\f(cid:0)y2\n\n1\n2\u03c32\n\nS(cid:88)\n\ns=1\n\n(cid:1)(cid:21)\n\n(cid:20)\n\nN(cid:88)\n\nFor Gaussian likelihoods, the expectation in eq. (12) can be evaluated analytically, giving the bound\n\nn=1\n\nL =\n\n\u2212 log 2\u03c0\u03c32 \u2212\nwith \u00b5n = \u00b5(xn) and \u03c32\nn = \u03bd(xn, xn) (eq. (11)) being the only terms which depend on the intractable\nkf and kfu covariances. The KL term is tractable, as it only depends on Kuu, which can be evaluated\nfrom kg directly (eq. (13)).\n\n\u2212 KL[q(u)||p(u)] ,\n\nn \u2212 2yn\u00b5n + \u00b52\n\nn + \u03c32\nn\n\n(15)\n\nWe aim to construct unbiased estimators(cid:99)\u00b5n,(cid:99)\u00b52\n\nn for the intractable terms, which only rely on\nsamples of p(xa|x). The posterior mean can be estimated easily using a Monte Carlo estimate of kfu:\n(16)\n\n=\u21d2 (cid:99)\u00b5n =(cid:98)kfnuK\u22121\n\np(xa|xn)kg(xa, Z)dxa\n\n\u00b5n = kfnuK\u22121\n\nuum =\n\nK\u22121\n\nuum ,\n\nuum\n\n(cid:20)(cid:90)\n\nn and(cid:99)\u03c32\n(cid:21)\n\nFor \u00b52\n\nn and \u03c32\n\nn, we start by rewriting them so we can focus on estimators of kf (xn, xn) and kT\n\n\u02c6kfu(x, z) =\n\nkg(x(s), z),\n\nx(s) \u223c p(xa|x).\n\n(17)\n\nfnu = Tr(cid:2)K\u22121\nfnukfnu:\nuukT\n(18)\nuu(Kuu \u2212 S)K\u22121\n(19)\nfnukfnu identically, as they can both be written as the integral\n\nuummTK\u22121\nfnukfnu\n\n(cid:0)kT\n(cid:1)(cid:3) .\n\n(cid:0)kT\n\n(cid:1)(cid:3) ,\n\nfnukfnu\n\nuu\n\nuu\n\nn = kf (xn, xn) \u2212 Tr(cid:2)K\u22121\n\nn = kfnuK\u22121\nuummTK\u22121\n\u00b52\n\u03c32\n(cid:90)(cid:90)\n\nWe treat kf (xn, xn) and an element of kT\n\nI =\n\np(xa|xn)p(x(cid:48)\na) and r = kfu(xa, zm)kfu(x(cid:48)\n\na|xn)r(xa, x(cid:48)\nwith r = kg(xa, x(cid:48)\na, zm(cid:48)), respectively. A simple Monte Carlo estimate\nwould require sampling two independent sets of points for xa and x(cid:48)\na. We would like to sample only\na single one, so all the necessary quantities can be estimated with the same \u201cminibatch\u201d of augmented\npoints. We propose to use the following estimator, which we show to be unbiased in appendix B.\n\na)dxadx(cid:48)\na ,\n\n(20)\n\nS(cid:88)\n\nS(cid:88)\n\n(cid:16)\n\nx(s), x(s(cid:48))(cid:17)\n\nr\n\n\u02c6I =\n\n1\n\nS(S \u2212 1)\n\ns=1\n\ns(cid:48)=1\n\n(1 \u2212 \u03b4ss(cid:48)) ,\n\nx(s) \u223c p(xa|x) .\n\n(21)\n\n(cid:19)\n\n(cid:90)\n\n(cid:18)\n\nWe now have unbiased estimates for all quantities needed to optimise the variational lower bound.\n4.5 Logistic classi\ufb01cation with P\u00f2lya-Gamma latent variables\nThe estimators we developed in the previous section allowed us to estimate the bound in an unbiased\nway, as long as the variational expectation over the likelihood only depended on \u00b5n, \u00b52\nn. This\nlimits the applicability of our method to likelihoods of a Gaussian form. Luckily, some likelihoods\ncan be written as expectations over unnormalised Gaussians. For example, the logistic likelihood can\nbe written as an expectation over a P\u00f2lya-Gamma variable \u03c9 [Polson et al., 2013]:\n\nn, and \u03c32\n\n\u03c3(ynf (xn)) = (1 + exp(\u2212ynf (xn)))\n\n\u22121 =\n\ncN\n\nf (xn);\n\n\u22121\nyn, \u03c9\nn\n\n1\n2\n\np(\u03c9n)d\u03c9n .\n\n(22)\n\nThis trick was investigated by Gibbs and Mackay [2000] and recently revisited by Wenzel et al. [2018]\nto construct a variational lower bound to the logistic likelihood with a Gaussian form:\n\u22121\nyn, \u03c9\nn\n\nlog p(yn|f (xn)) = log \u03c3(ynf (xn)) \u2265 Eq(f (xn))q(\u03c9n)\n\nc log N\n\nf (xn)|\n\n(23)\n\n1\n2\n\n.\n\n(cid:19)(cid:21)\n\n(cid:18)\n\n(cid:20)\n\nThis bound for the likelihood can be used as a drop-in approximation for the exact likelihood, at\nthe cost of adding an additional KL gap to the true marginal likelihood. For our purpose, the crucial\nbene\ufb01t is that we again obtain a Gaussian form in the expectation over q(f (xn)), for which we\ncan use the unbiased estimators we developed above. Gibbs and Mackay [2000] and Wenzel et al.\n[2018] go on to \ufb01nd the optimal parameters for q(\u03c9n) in closed form. We cannot use this, as the\noptimal parameters depend non-linearly on \u00b5n and \u03c3n. Instead, we choose to parameterise q(\u03c9n) as\na P\u00f2lya-Gamma distribution with its parameters given by a recognition network mapping from the\ncorresponding input and label, following Kingma and Welling [2014]. This method can likely be\nextended to the multi-class setting using the stick-breaking construction by Linderman et al. [2015].\n\n7\n\n\f5 Experiments\nWe demonstrate our approach on a series of experiments on variants of the MNIST datasets. While\nMNIST has been accurately solved by other methods, we intend to show that a model like an RBF\nGP (Radial Basis Function or squared exponential kernel), for which MNIST is challenging, can be\nsigni\ufb01cantly improved by learning the correct invariances. For binary classi\ufb01cation tasks, we will use\nthe P\u00f2lya-Gamma approximation for the logistic likelihood, while for multi-class classi\ufb01cation, we\nare currently forced to use the Gaussian likelihood. We consider two classes of transformations for\nwhich we automatically learn parameters: (i) global af\ufb01ne transformations, and (ii) local deformations.\nNote that we must be able to backpropagate through these transformations in order to learn their\nparameters.\n\nAf\ufb01ne transformations.\n2D af\ufb01ne transformations are determined by 6 parameters \u03c6 and allow for\nscaling, rotation, shear, and translation. To sample from p(xa|x), we \ufb01rst draw \u03c6 \u223c Unif(\u03c6min, \u03c6max)\nand then apply3 the transformation to obtain xa = A\ufb00 \u03c6(x). Since the transformation A\ufb00 \u03c6(\u00b7) is\ndifferentiable w.r.t. \u03c6, we can backpropagate to {\u03c6min, \u03c6max} using the reparameterisation trick.\nLocal deformations. As a second class of transformations we consider the local deformations as\nintroduced with the in\ufb01niteMNIST dataset [Loosli et al., 2007; Simard et al., 2000]. Samples are\ncreated by \ufb01rst creating a smooth vector \ufb01eld to describe the local deformations, followed by local\nrotations, scalings, and various other transformations. The parameters determining the size of the\ntransformations can be backpropagated through in the same way as for the af\ufb01ne transformations.\n5.1 Recovering invariances in binary MNIST classi\ufb01cation\nAs a \ufb01rst test, we demonstrate that our approach can recover the parameter of a known transformation\nin an odds-vs-even MNIST binary classi\ufb01cation problem. We consider the regular MNIST dataset\nand rotate each example by a randomly chosen angle \u03c6 \u2208 [\u2212\u03b1true, \u03b1true] for \u03b1true \u2208 {90\u25e6, 180\u25e6\nWe choose p(xa|x) to be a uniform distribution over rotated images, leading to a rotational invariance,\nand use the variational lower bound to train the amount of rotation \u03b1. To perform well on this task,\nwe expect the recovered \u03b1 to be at least as large as the true value \u03b1true to account for the rotational\ninvariance. Too large values, i.e. \u03b1 \u2248 180\u25e6, should be avoided due to ambiguities between, for\nexample, 6s and 9s.\n\n}.\n\nFigure 3: Binary classi\ufb01cation on the partially rotated (by \u00b190\u25e6 or \u00b1180\u25e6) MNIST dataset. Left:\nAmount of rotation invariance over time. Middle: Test error. Right: Log marginal likelihood bound.\nWe \ufb01nd that the trained GP models with invariances are able to approximately recover the true angles\n(\ufb01g. 3, left). When \u03b1true = 180\u25e6, the angle is under-estimated, whereas \u03b1true = 90\u25e6 is recovered well.\nRegardless, all models outperform the simple RBF GP by a large margin, both in terms of error, and\nin terms of the marginal likelihood bound (\ufb01g. 3, right). These results show that our approach can be\ncombined effectively with certain non-Gaussian likelihood models using the P\u00f2lya-Gamma trick.\n5.2 Classi\ufb01cation of MNIST digits\nNext, we consider full MNIST classi\ufb01cation, using a Gaussian likelihood, and compare the non-\ninvariant RBF kernel to various invariant kernels. Figure 4 shows that the GPs with invariant kernels\nclearly outperform the baseline RBF kernel. Both types of learned invariances, af\ufb01ne transformations\nand local deformations, lead to similar performance for a wide range of initial conditions. When\nconstrained to rotational invariances only, the results are only slightly better than the baseline. This\n\n3Af\ufb01ne transform implementation from github.com/kevinzakka/spatial-transformer-network.\n\n8\n\n024Walltime(s)\u00b7104045\u25e690\u25e6135\u25e6180\u25e6Recoveredangle\u03b1true=90\u25e6024Walltime(s)\u00b7104\u03b1true=180\u25e600.20.40.60.81Walltime(s)\u00b71050.040.060.080.10.12Testerror\u03b1true=90\u25e6\u03b1init=45\u25e6\u03b1init=175\u25e6\ufb01xed\u03b1true=90\u25e6RBF00.20.40.60.81Walltime(s)\u00b7105\u22122\u22121.5\u22121\u00b7104Logmarginallikelihood\u03b1true=90\u25e6\u03b1init=45\u25e6\u03b1init=175\u25e6\ufb01xed\u03b1true=90\u25e6RBF\u03b1init=5\u25e6\u03b1init=45\u25e6\u03b1init=90\u25e6\u03b1init=135\u25e6\u03b1init=175\u25e6\findicates that deformations (stretching, shear) are more important than rotations for MNIST. Crucially,\nwe do not require a validation set, but can use the log marginal likelihood of the training data to\nmonitor performance. In \ufb01g. 1 we show samples from p(xa|x) for the model that uses all af\ufb01ne\ntransformations.\n\nMethod\n\nRBF\n\nrot. only\nall af\ufb01ne\nlocal def.\n\nError in %\n2.15 \u00b1 0.03\n2.08 \u00b1 0.06\n1.35 \u00b1 0.07\n1.47 \u00b1 0.05\n\nFigure 4: MNIST classi\ufb01cation results. Left: Test error. Middle: Log marginal likelihood bound. Right:\nFinal test error. All invariant models outperform the RBF baseline.\n\n5.3 Classi\ufb01cation of rotated MNIST digits\nWe also consider the fully rotated MNIST dataset4. In this case, we only run GP models that are\ninvariant to af\ufb01ne transformations. We compare general af\ufb01ne transformations (learning all parame-\nters), rotations with learned angle bounds, and \ufb01xed rotational invariance (\ufb01g. 5). We found that all\ninvariant models outperform the baseline (RBF) by a large margin. However, the models with \ufb01xed\nangles (no free parameters of the transformation) outperform their learned counterparts. We attribute\nthis to the optimisation dynamics, as the problem of optimising the variational, kernel, and transfor-\nmation parameters jointly is more dif\ufb01cult than optimising only variational and kernel parameters for\n\ufb01xed transformations. We emphasise that the marginal likelihood bound does correctly identify the\nbest-performing invariance in this case as well.\n\nFigure 5: Rotated MNIST classi\ufb01cation results. Left: Test error. Right: Log marginal likelihood\nbound. The optimiser has dif\ufb01culty \ufb01nding a good solution with the learned invariances, although the\nmarginal likelihood bound correctly identi\ufb01es the best model.\n\n6 Conclusion\nIn this work, we show how invariances described by general data transformations can be incorporated\ninto Gaussian process models. We use \u201cdouble-sum\u201d kernels, which sum a base kernel over all\npoints that are similar under the invariance. These kernels cannot be evaluated in closed form, due to\nintegrals or a prohibitively large number of terms in the sums. Our method solves this technical issue by\nconstructing a variational lower bound which only requires unbiased estimates of the kernel. Crucially,\nthis variational lower bound also allows us to learn a good invariance by maximising the marginal\nlikelihood bound through backpropagation of the sampling procedure. We show experimentally that\nour method can learn useful invariances for variants of MNIST. In some experiments, the joint\noptimisation problem does not achieve the performance of when the method is initialised with the\ncorrect invariance. Despite this drawback, the objective function correctly identi\ufb01es the best solution.\nWhile in this work we focus on kernels with invariances, we hope that our demonstration of learning\nwith kernels that do not admit a closed-form evaluation will prove to be more generally useful by\nincreasing the set of kernels with interesting generalisation properties that can be used.\n\n4http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/MnistVariations\n\n9\n\n00.20.40.60.81Walltime(s)\u00b71050.010.0150.020.0250.03TesterrorrotationonlyRBFallaf\ufb01nelocaldeform00.20.40.60.81Walltime(s)\u00b710544.555.5\u00b7105LogmarginallikelihoodRBFrotationonlyallaf\ufb01nelocaldeform00.20.40.60.81Walltime(s)\u00b71050.050.10.150.2TesterrorRBFrotation(\ufb01xed\u00b145\u25e6)rotation(\ufb01xed\u00b1179\u25e6)rotation(learned)af\ufb01ne(learned)00.20.40.60.81Walltime(s)\u00b7105024\u00b7104LogmarginallikelihoodRBFrotation(\ufb01xed\u00b145\u25e6)rotation(\ufb01xed\u00b1179\u25e6)rotation(learned)af\ufb01ne(learned)\fAcknowledgements\nM.B. gratefully acknowledges partial funding through a Qualcomm studentship in technology.\n\nReferences\nAntoniou, A., Storkey, A., and Edwards, H. (2017). Data augmentation generative adversarial networks. arXiv\n\npreprint arXiv:1711.04340.\n\nBauer, M., van der Wilk, M., and Rasmussen, C. E. (2016). Understanding probabilistic sparse Gaussian process\n\napproximations. In Advances in Neural Information Processing Systems 29.\n\nBeymer, D. and Poggio, T. (1995). Face recognition from one example view. In Proceedings of IEEE International\n\nConference on Computer Vision.\n\nChapelle, O. and Sch\u00f6lkopf, B. (2002). Incorporating invariances in non-linear support vector machines. In\n\nAdvances in Neural Information Processing Systems 14.\n\nCohen, T. and Welling, M. (2016). Group equivariant convolutional networks. In Proceedings of The 33rd\n\nInternational Conference on Machine Learning.\n\nDao, T., Gu, A., Ratner, A. J., Smith, V., Sa, C. D., and R\u00e9, C. (2018). A kernel theory of modern data augmentation.\n\narXiv preprint arXiv:1803.06084.\n\nDeng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical\n\nimage database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition.\n\nFigueiras-Vidal, A. and L\u00e1zaro-Gredilla, M. (2009). Inter-domain Gaussian processes for sparse inference using\n\ninducing features. In Advances in Neural Information Processing Systems 22.\n\nGermain, P., Bach, F., Lacoste, A., and Lacoste-Julien, S. (2016). PAC-Bayesian theory meets Bayesian inference.\n\nIn Advances in Neural Information Processing Systems 29.\n\nGibbs, M. N. and Mackay, D. J. C. (2000). Variational Gaussian process classi\ufb01ers. IEEE Transactions on\n\nNeural Networks, 11(6):1458\u20131464.\n\nGinsbourger, D., Bay, X., Roustant, O., and Carraro, L. (2012). Argumentwise invariant kernels for the approxi-\n\nmation of invariant functions. Annales de la Facult\u00e9 de Sciences de Toulouse.\n\nGinsbourger, D., Roustant, O., and Durrande, N. (2013). Invariances of random \ufb01elds paths, with applications in\n\nGaussian process regression. arXiv preprint arXiv:1308.1359.\n\nGinsbourger, D., Roustant, O., and Durrande, N. (2016). On degeneracy and invariances of random \ufb01elds paths\nwith applications in Gaussian process modelling. Journal of Statistical Planning and Inference, 170:117\u2013128.\n\nGraepel, T. and Herbrich, R. (2004). Invariant pattern recognition by semi-de\ufb01nite programming machines. In\n\nAdvances in Neural Information Processing Systems 16.\n\nHauberg, S., Freifeld, O., Larsen, A. B. L., Fisher, J., and Hansen, L. (2016). Dreaming more data: Class-\ndependent distributions over diffeomorphisms for learned data augmentation. In Proceedings of the 19th\nInternational Conference on Arti\ufb01cial Intelligence and Statistics.\n\nHe, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In The IEEE\n\nConference on Computer Vision and Pattern Recognition (CVPR).\n\nHensman, J., Fusi, N., and Lawrence, N. D. (2013). Gaussian processes for big data. In Proceedings of the 29th\n\nConference on Uncertainty in Arti\ufb01cial Intelligence (UAI).\n\nHensman, J., Matthews, A. G., Filippone, M., and Ghahramani, Z. (2015a). MCMC for variationally sparse\n\nGaussian processes. In Advances in Neural Information Processing Systems 28.\n\nHensman, J., Matthews, A. G., and Ghahramani, Z. (2015b). Scalable variational Gaussian process classi\ufb01cation.\n\nIn Proceedings of the 18th International Conference on Arti\ufb01cial Intelligence and Statistics.\n\nJaderberg, M., Simonyan, K., Zisserman, A., and Kavukcuoglu, K. (2015). Spatial transformer networks. In\n\nAdvances in Neural Information Processing Systems 28.\n\nKim, H. and Teh, Y. W. (2018). Scaling up the Automatic Statistician: Scalable structure discovery using Gaussian\n\nprocesses. In Proceedings of the 21st International Conference on Arti\ufb01cial Intelligence and Statistics.\n\n10\n\n\fKingma, D. P. and Welling, M. (2014). Auto-encoding variational Bayes. In Proceedings of the 2nd International\n\nConference on Learning Representations (ICLR).\n\nKondor, R. (2008). Group theoretical methods in machine learning. PhD thesis, Columbia University.\nKrizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). ImageNet classi\ufb01cation with deep convolutional neural\n\nnetworks. Advances in Neural Information Processing Systems 25.\n\nLeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. (1989).\n\nBackpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541\u2013551.\n\nLeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recogni-\n\ntion. Proceedings of the IEEE, 86(11):2278\u20132324.\n\nLinderman, S., Johnson, M., and Adams, R. P. (2015). Dependent multinomial models made easy: Stick-breaking\n\nwith the P\u00f2lya-Gamma augmentation. In Advances in Neural Information Processing Systems 28.\n\nLoosli, G., Canu, S., and Bottou, L. (2007). Training invariant support vector machines using selective sampling.\n\nLarge scale kernel machines, pages 301\u2013320.\n\nMacKay, D. J. C. (1998). Introduction to Gaussian processes. In Bishop, C. M., editor, Neural Networks and\n\nMachine Learning, NATO ASI Series, pages 133\u2013166. Kluwer Academic Press.\n\nMacKay, D. J. C. (2002). Model Comparison and Occam\u2019s Razor. In Information Theory, Inference & Learning\n\nAlgorithms, chapter 28, pages 343\u2013355. Cambridge University Press, Cambridge.\n\nMatthews, A. G., Hensman, J., Turner, R. E., and Ghahramani, Z. (2016). On sparse variational methods and\nIn Proceedings of the 19th International\n\nthe Kullback-Leibler divergence between stochastic processes.\nConference on Arti\ufb01cial Intelligence and Statistics.\n\nNiyogi, P., Girosi, F., and Poggio, T. (1998). Incorporating prior information in machine learning by creating\n\nvirtual examples. Proceedings of the IEEE, 86(11):2196\u20132209.\n\nPolson, N. G., Scott, J. G., and Windle, J. (2013). Bayesian inference for logistic models using P\u00f2lya\u2013Gamma\n\nlatent variables. Journal of the American Statistical Association, 108(504):1339\u20131349.\n\nQui\u00f1onero-Candela, J. and Rasmussen, C. E. (2005). A unifying view of sparse approximate Gaussian process\n\nregression. Journal of Machine Learning Research, 6:1939\u20131959.\n\nRaj, A., Kumar, A., Mroueh, Y., Fletcher, T., and Schoelkopf, B. (2017). Local group invariant representations\nvia orbit embeddings. In Proceedings of the 20th International Conference on Arti\ufb01cial Intelligence and\nStatistics.\n\nRasmussen, C. E. and Ghahramani, Z. (2001). Occam\u2019s razor. In Advances in Neural Information Processing\n\nSystems 13.\n\nRasmussen, C. E. and Williams, C. K. I. (2005). Model selection and adaptation of hyperparameters. In Gaussian\n\nProcesses for Machine Learning, chapter 5. The MIT Press.\n\nSch\u00f6lkopf, B., Simard, P., Smola, A., and Vapnik, V. (1998). Prior knowledge in support vector kernels. In\n\nAdvances in Neural Information Processing Systems 10.\n\nSeeger, M. (2003). Bayesian Gaussian Process Models: PAC-Bayesian Generalisation Error Bounds and Sparse\n\nApproximations. PhD thesis, University of Edinburgh.\n\nSimard, P., Victorri, B., LeCun, Y., and Denker, J. (1992). Tangent Prop \u2013 A formalism for specifying selected\n\ninvariances in an adaptive network. In Advances in Neural Information Processing Systems 4.\n\nSimard, P. Y., Le Cun, Y. A., Denker, J. S., and Victorri, B. (2000). Transformation invariance in pattern\nrecognition: Tangent distance and propagation. International Journal of Imaging Systems and Technology,\n11(3):181\u2013197.\n\nTitsias, M. K. (2009). Variational learning of inducing variables in sparse Gaussian processes. In Proceedings of\n\nthe 12th International Conference on Arti\ufb01cial Intelligence and Statistics.\n\nTurner, R. E. and Sahani, M. (2011). Two problems with variational expectation maximisation for time-series\nmodels. In Barber, D., Cemgil, T., and Chiappa, S., editors, Bayesian Time Series Models, chapter 5, pages\n109\u2013130. Cambridge University Press.\n\nvan der Wilk, M., Rasmussen, C. E., and Hensman, J. (2017). Convolutional Gaussian processes. Advances in\n\nNeural Information Processing Systems 30.\n\nWenzel, F., Galy-Fajou, T., Donner, C., Kloft, M., and Opper, M. (2018). Ef\ufb01cient Gaussian process classi\ufb01cation\n\nusing P\u00f2lya-Gamma data augmentation. arXiv preprint arXiv:1802.06383.\n\n11\n\n\f", "award": [], "sourceid": 6457, "authors": [{"given_name": "Mark", "family_name": "van der Wilk", "institution": "PROWLER.io"}, {"given_name": "Matthias", "family_name": "Bauer", "institution": "Max Planck Institute for Intelligent Systems"}, {"given_name": "ST", "family_name": "John", "institution": "PROWLER.io"}, {"given_name": "James", "family_name": "Hensman", "institution": "PROWLER.io"}]}