{"title": "Fast Kernel Learning for Multidimensional Pattern Extrapolation", "book": "Advances in Neural Information Processing Systems", "page_first": 3626, "page_last": 3634, "abstract": "The ability to automatically discover patterns and perform extrapolation is an essential quality of intelligent systems. Kernel methods, such as Gaussian processes, have great potential for pattern extrapolation, since the kernel flexibly and interpretably controls the generalisation properties of these methods. However, automatically extrapolating large scale multidimensional patterns is in general difficult, and developing Gaussian process models for this purpose involves several challenges. A vast majority of kernels, and kernel learning methods, currently only succeed in smoothing and interpolation. This difficulty is compounded by the fact that Gaussian processes are typically only tractable for small datasets, and scaling an expressive kernel learning approach poses different challenges than scaling a standard Gaussian process model. One faces additional computational constraints, and the need to retain significant model structure for expressing the rich information available in a large dataset. In this paper, we propose a Gaussian process approach for large scale multidimensional pattern extrapolation. We recover sophisticated out of class kernels, perform texture extrapolation, inpainting, and video extrapolation, and long range forecasting of land surface temperatures, all on large multidimensional datasets, including a problem with 383,400 training points. The proposed method significantly outperforms alternative scalable and flexible Gaussian process methods, in speed and accuracy. Moreover, we show that a distinct combination of expressive kernels, a fully non-parametric representation, and scalable inference which exploits existing model structure, are critical for large scale multidimensional pattern extrapolation.", "full_text": "Fast Kernel Learning for Multidimensional Pattern\n\nExtrapolation\n\nAndrew Gordon Wilson\u2217\n\nCMU\n\nElad Gilboa\u2217\n\nWUSTL\n\nArye Nehorai\n\nWUSTL\n\nJohn P. Cunningham\n\nColumbia\n\nAbstract\n\nThe ability to automatically discover patterns and perform extrapolation is an es-\nsential quality of intelligent systems. Kernel methods, such as Gaussian processes,\nhave great potential for pattern extrapolation, since the kernel \ufb02exibly and inter-\npretably controls the generalisation properties of these methods. However, auto-\nmatically extrapolating large scale multidimensional patterns is in general dif\ufb01-\ncult, and developing Gaussian process models for this purpose involves several\nchallenges. A vast majority of kernels, and kernel learning methods, currently\nonly succeed in smoothing and interpolation. This dif\ufb01culty is compounded by\nthe fact that Gaussian processes are typically only tractable for small datasets, and\nscaling an expressive kernel learning approach poses different challenges than\nscaling a standard Gaussian process model. One faces additional computational\nconstraints, and the need to retain signi\ufb01cant model structure for expressing the\nrich information available in a large dataset. In this paper, we propose a Gaussian\nprocess approach for large scale multidimensional pattern extrapolation. We re-\ncover sophisticated out of class kernels, perform texture extrapolation, inpainting,\nand video extrapolation, and long range forecasting of land surface temperatures,\nall on large multidimensional datasets, including a problem with 383,400 training\npoints. The proposed method signi\ufb01cantly outperforms alternative scalable and\n\ufb02exible Gaussian process methods, in speed and accuracy. Moreover, we show\nthat a distinct combination of expressive kernels, a fully non-parametric represen-\ntation, and scalable inference which exploits existing model structure, are critical\nfor large scale multidimensional pattern extrapolation.\n\n1\n\nIntroduction\n\nOur ability to effortlessly extrapolate patterns is a hallmark of intelligent systems: even with large\nmissing regions in our \ufb01eld of view, we can see patterns and textures, and we can visualise in our\nmind how they generalise across space.\nIndeed machine learning methods aim to automatically\nlearn and generalise representations to new situations. Kernel methods, such as Gaussian processes\n(GPs), are popular machine learning approaches for non-linear regression and classi\ufb01cation [1, 2, 3].\nFlexibility is achieved through a kernel function, which implicitly represents an inner product of\narbitrarily many basis functions. The kernel interpretably controls the smoothness and generalisation\nproperties of a GP. A well chosen kernel leads to impressive empirical performances [2].\nHowever, it is extremely dif\ufb01cult to perform large scale multidimensional pattern extrapolation with\nkernel methods. In this context, the ability to learn a representation of the data entirely depends\non learning a kernel, which is a priori unknown. Moreover, kernel learning methods [4] are not\ntypically intended for automatic pattern extrapolation; these methods often involve hand crafting\ncombinations of Gaussian kernels (for smoothing and interpolation), for speci\ufb01c applications such\nas modelling low dimensional structure in high dimensional data. Without human intervention,\nthe vast majority of existing GP models are unable to perform pattern discovery and extrapolation.\n\n\u2217Authors contributed equally.\n\n1\n\n\fP +1\n\nWhile recent approaches such as [5] enable extrapolation on small one dimensional datasets, it is\ndif\ufb01cult to generalise these approaches for larger multidimensional situations. These dif\ufb01culties\narise because Gaussian processes are computationally intractable on large scale data, and while\nscalable approximate GP methods have been developed [6, 7, 8, 9, 10, 11, 12, 13], it is uncertain\nhow to best scale expressive kernel learning approaches. Furthermore, the need for \ufb02exible kernel\nlearning on large datasets is especially great, since such datasets often provide more information to\nautomatically learn an appropriate statistical representation.\nIn this paper, we introduce GPatt, a \ufb02exible, non-parametric, and computationally tractable approach\nto kernel learning for multidimensional pattern extrapolation, with particular applicability to data\nwith grid structure, such as images, video, and spatial-temporal statistics. Speci\ufb01cally:\n\u2022 We extend fast Kronecker-based GP inference (e.g., [14, 15]) to account for non-grid data. Our\nexperiments include data where more than 70% of the training data are not on a grid. Indeed most\napplications where one would want to exploit Kronecker structure involve missing and non-grid\ndata \u2013 caused by, e.g., water, government boundaries, missing pixels and image artifacts. By\nadapting expressive spectral mixture kernels to the setting of multidimensional inputs and Kro-\nnecker structure, we achieve exact inference and learning costs of O(P N\nP ) computations and\nP ) storage, for N datapoints and P input dimensions, compared to the standard O(N 3)\nO(P N 2\ncomputations and O(N 2) storage associated with GPs.\n\u2022 We show that i) spectral mixture kernels (adapted for Kronecker structure); ii) scalable infer-\nence based on Kronecker methods (adapted for incomplete grids); and, iii) truly non-parametric\nrepresentations, when used in combination (to form GPatt) distinctly enable large-scale multi-\ndimensional pattern extrapolation with GPs. We demonstrate this through a comparison with\nvarious expressive models and inference techniques: i) spectral mixture kernels with arguably the\nmost popular scalable GP inference method (FITC) [10]; ii) a \ufb02exible and ef\ufb01cient recent spectral\nbased kernel learning method (SSGP) [6]; and, iii) the most popular GP kernels with Kronecker\nbased inference.\n\u2022 The information capacity of non-parametric methods grows with the size of the data. A truly\nnon-parametric GP must have a kernel that is derived from an in\ufb01nite basis function expansion.\nWe \ufb01nd that a truly non-parametric representation is necessary for pattern extrapolation on large\ndatasets, and provide insights into this surprising result.\n\u2022 GPatt is highly scalable and accurate. This is the \ufb01rst time, as far as we are aware, that highly\nexpressive non-parametric kernels with in some cases hundreds of hyperparameters, on datasets\nexceeding N = 105 training instances, can be learned from the marginal likelihood of a GP, in\nonly minutes. Such experiments show that one can, to some extent, solve kernel selection, and\nautomatically extract useful features from the data, on large datasets, using a special combination\nof expressive kernels and scalable inference.\n\u2022 We show the proposed methodology provides a distinct approach to texture extrapolation and in-\npainting; it was not previously known how to make GPs work for these fundamental applications.\n\u2022 Moreover, unlike typical inpainting approaches, such as patch-based methods (which work by\nrecursively copying pixels or patches into a gap in an image, preserving neighbourhood similar-\nities), GPatt is not restricted to spatial inpainting. This is demonstrated on a video extrapolation\nexample, for which standard inpainting methods would be inapplicable [16]. Similarly, we apply\nGPatt to perform large-scale long range forecasting of land surface temperatures, through learning\na sophisticated correlation structure across space and time. This learned correlation structure also\nprovides insights into the underlying statistical properties of these data.\n\u2022 We demonstrate that GPatt can precisely recover sophisticated out-of-class kernels automatically.\n\n2 Spectral Mixture Product Kernels for Pattern Discovery\n\nThe spectral mixture kernel has recently been introduced [5] to offer a \ufb02exible kernel that can learn\nany stationary kernel. By appealing to Bochner\u2019s theorem [17] and building a scale mixture of A\nGaussian pairs in the spectral domain, [5] produced the spectral mixture kernel\n\nkSM(\u03c4 ) =\n\naexp{\u22122\u03c02\u03c4 2\u03c32\nw2\n\na} cos(2\u03c0\u03c4 \u00b5a) ,\n\n(1)\n\nA(cid:88)\n\na=1\n\n2\n\n\fP(cid:89)\n\nwhich they applied to one-dimensional input data with a small number of points. For tractability\nwith multidimensional inputs and large data, we propose a spectral mixture product (SMP) kernel:\n\nkSMP(\u03c4|\u03b8) =\n\nkSM(\u03c4p|\u03b8p) ,\n\n(2)\n\np=1\n\na, w2\n\na}A\n\nwhere \u03c4p is the pth component of \u03c4 = x \u2212 x(cid:48) \u2208 RP , \u03b8p are the hyperparameters {\u00b5a, \u03c32\na=1 of\nthe pth spectral mixture kernel in the product of Eq. (2), and \u03b8 = {\u03b8p}P\np=1 are the hyperparameters\nof the SMP kernel. The SMP kernel of Eq. (2) has Kronecker structure which we exploit for scalable\nand exact inference in section 2.1. With enough components A, the SMP kernel of Eq. (2) can model\nany stationary product kernel to arbitrary precision, and is \ufb02exible even with a small number of\ncomponents, since scale-location Gaussian mixture models can approximate many spectral densities.\nWe use SMP-A as shorthand for an SMP kernel with A components in each dimension (for a total\nof 3P A kernel hyperparameters and 1 noise hyperparameter). Wilson [18, 19] contains detailed\ndiscussions of spectral mixture kernels.\nCritically, a GP with an SMP kernel is not a \ufb01nite basis function method, but instead corresponds to\na \ufb01nite (A component) mixture of in\ufb01nite basis function expansions. Therefore such a GP is a truly\nnonparametric method. This difference between a truly nonparametric representation \u2013 namely a\nmixture of in\ufb01nite bases \u2013 and a parametric kernel method, a \ufb01nite basis expansion corresponding\nto a degenerate GP, is critical both conceptually and practically, as our results will show.\n\nP +1\n\n2.1 Fast Exact Inference with Spectral Mixture Product Kernels\nGaussian process inference and learning requires evaluating (K +\u03c32I)\u22121y and log |K +\u03c32I|, for an\nN \u00d7 N covariance matrix K, a vector of N datapoints y, and noise variance \u03c32, as described in the\nsupplementary material. For this purpose, it is standard practice to take the Cholesky decomposition\nof (K + \u03c32I) which requires O(N 3) computations and O(N 2) storage, for a dataset of size N.\nHowever, many real world applications are engineered for grid structure, including spatial statistics,\nsensor arrays, image analysis, and time sampling.\n[14] has shown that the Kronecker structure\nin product kernels can be exploited for exact inference and hyperparameter learning in O(P N 2\nP )\nstorage and O(P N\nP ) operations, so long as the inputs x \u2208 X are on a multidimensional grid,\nmeaning X = X1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 XP \u2282 RP . Details are in the supplement.\nHere we relax this grid assumption. Assuming we have a dataset of M observations which are\nnot necessarily on a grid, we propose to form a complete grid using W imaginary observations,\nyW \u223c N (f W , \u0001\u22121IW ), \u0001 \u2192 0. The total observation vector y = [yM , yW ](cid:62) has N = M + W\nentries: y = N (f , DN ), where the noise covariance matrix DN = diag(DM , \u0001\u22121IW ), DM =\n\u03c32IM . The imaginary observations yW have no corrupting effect on inference: the moments of\nthe resulting predictive distribution are exactly the same as for the standard predictive distribution,\nnamely lim\u0001\u21920(KN + DN )\u22121y = (KM + DM )\u22121yM (proof in the supplement).\n\u22121 y. Since DN is not a scaled identity (as is the usual\nFor inference, we must evaluate (KN + DN )\ncase in Kronecker methods), we cannot ef\ufb01ciently decompose KN + DN , but we can ef\ufb01ciently\ntake matrix vector products involving KN and DN . We therefore use preconditioned conjugate gra-\n\u22121 y, an iterative method involving only matrix vector\ndients (PCG) [20] to compute (KN + DN )\nto solve C(cid:62) (KN + DN ) Cz = C(cid:62)y.\nproducts. We use the preconditioning matrix C = D\nThe preconditioning matrix C speeds up convergence by ignoring the imaginary observations yW .\nExploiting the fast multiplication of Kronecker matrices, PCG takes O(JP N\nP ) total operations\n\u22121 y to convergence within ma-\n(where the number of iterations J (cid:28) N) to compute (KN + DN )\nchine precision (supplement). This procedure can also be used to handle heteroscedastic noise.\nFor learning (hyperparameter training) we must evaluate the marginal likelihood (supplement). We\ncannot ef\ufb01ciently compute the log |KM + DM| complexity penalty in the marginal likelihood, be-\ncause KM is not a Kronecker matrix. We approximate the complexity penalty as\ni + \u03c32) ,\n\ni + \u03c32) \u2248 M(cid:88)\n\nlog |KM + DM| =\n\nM(cid:88)\n\nlog(\u02dc\u03bbM\n\n\u22121/2\nN\n\nlog(\u03bbM\n\n(3)\n\nP +1\n\nfor noise variance \u03c32. We approximate the eigenvalues \u03bbM\nthat \u02dc\u03bbM\n\ni of KM using the eigenvalues of KN such\nfor i = 1, . . . , M, which is particularly effective for large M (e.g. M > 1000)\n\ni = M\n\nN \u03bbN\n\ni\n\ni=1\n\ni=1\n\n3\n\n\fP +1\n\n[7]. [21] proves this eigenvalue approximation is asymptotically consistent (e.g., converges in the\nlimit of large M), and [22] shows how one can bound the true eigenvalues by their approximation\nusing PCA. Notably, only the log determinant (complexity penalty) term in the marginal likelihood\nundergoes a small approximation, and inference remains exact.\nAll remaining terms in the marginal likelihood can be computed exactly and ef\ufb01ciently using PCG.\nThe total runtime cost of hyperparameter learning and exact inference with an incomplete grid is thus\nO(P N\nP ). In image problems, for example, P = 2, and so the runtime complexity reduces to\nO(N 1.5). Although the proposed inference can handle non-grid data, this inference is most suited\nto inputs where there is some grid structure \u2013 images, video, spatial statistics, etc. If there is no\nsuch grid structure (e.g., none of the training data fall onto a grid), then the computational expense\nnecessary to augment the data with imaginary grid observations can be prohibitive. Although in-\ncomplete grids have been brie\ufb02y considered in, e.g. [23], such approaches generally involve costly\nand numerically unstable rank 1 updates, inducing inputs, and separate (and restricted) treatments\nof \u2018missing\u2019 and \u2018extra\u2019 data. Moreover, the marginal likelihood, critical for kernel learning, is not\ntypically considered in alternate approaches to incomplete grids.\n\n3 Experiments\n\nIn our experiments we combine the SMP kernel of Eq. (2) with the fast exact inference and learning\nprocedures of section 2.1, in a GP method we henceforth call GPatt1,2.\nWe contrast GPatt with many alternative Gaussian process kernel methods. We are particularly\ninterested in kernel methods, since they are considered to be general purpose regression methods, but\nconventionally have dif\ufb01culty with large scale multidimensional pattern extrapolation. Speci\ufb01cally,\nwe compare to the recent sparse spectrum Gaussian process regression (SSGP) [6] method, which\nprovides fast and \ufb02exible kernel learning. SSGP models the kernel spectrum (spectral density)\nas a sum of point masses, such that SSGP is a \ufb01nite basis function (parametric) model, with as\nmany basis functions as there are spectral point masses. SSGP is similar to the recent models of\nLe et al. [8] and Rahimi and Recht [9], except it learns the locations of the point masses through\nmarginal likelihood optimization. We use the SSGP implementation provided by the authors at\nhttp://www.tsc.uc3m.es/\u02dcmiguel/downloads.php.\nTo further test the importance of the fast inference (section 2.1) used in GPatt, we compare to a GP\nwhich uses the SMP kernel of section 2 but with the popular fast FITC [10, 24] inference, which\nuses inducing inputs, and is implemented in GPML (http://www.gaussianprocess.org/\ngpml). We also compare to GPs with the popular squared exponential (SE), rational quadratic (RQ)\nand Mat\u00b4ern (MA) (with 3 degrees of freedom) kernels, catalogued in Rasmussen and Williams [1],\nrespectively for smooth, multi-scale, and \ufb01nitely differentiable functions. Since GPs with these\nkernels cannot scale to the large datasets we consider, we combine these kernels with the same fast\ninference techniques that we use with GPatt, to enable a comparison.3 Moreover, we stress test each\nof these methods in terms of speed and accuracy, as a function of available data and extrapolation\nrange, and number of components. All of our experiments contain a large percentage of non-grid\ndata, and we test accuracy and ef\ufb01ciency as a function of the percentage of missing data.\nIn all experiments we assume Gaussian noise, to express the marginal likelihood of the data p(y|\u03b8)\nsolely as a function of kernel hyperparameters \u03b8. To learn \u03b8 we optimize the marginal likelihood\nusing BFGS. We use a simple initialisation scheme: any frequencies {\u00b5a} are drawn from a uniform\ndistribution from 0 to the Nyquist frequency (1/2 the sampling rate), length-scales {1/\u03c3a} from a\ntruncated Gaussian distribution, with mean proportional to the range of the data, and weights {wa}\nare initialised as the empirical standard deviation of the data divided by the number of components\nused in the model. In general, we \ufb01nd GPatt is robust to initialisation, particularly for N > 104\ndatapoints. We show a representative initialisation in the experiments.\nThis range of tests allows us to separately understand the effects of the SMP kernel, a non-parametric\nrepresentation, and the proposed inference methods of section 2.1; we will show that all are required\nfor good extrapolation performance.\n\n1We write GPatt-A when GPatt uses an SMP-A kernel.\n2Experiments were run on a 64bit PC, with 8GB RAM and a 2.8 GHz Intel i7 processor.\n3We also considered the model of [25], but this model is intractable for the datasets we considered and is\n\nnot structured for the fast inference of section 2.1.\n\n4\n\n\f3.1 Extrapolating Metal Tread Plate and Pores Patterns\nWe extrapolate the missing region, shown in Figure 1a, on a real metal tread plate texture. There\nare 12675 training instances (Figure 1a), and 4225 test instances (Figure 1b). The inputs are pixel\nlocations x \u2208 R2 (P = 2), and the outputs are pixel intensities. The full pattern is shown in Figure\n1c. This texture contains shadows and subtle irregularities, no two identical diagonal markings, and\npatterns that have correlations across both input dimensions.\n\n(a) Train\n\n(b) Test\n\n(c) Full\n\n(d) GPatt\n\n(e) SSGP\n\n(f) FITC\n\n(g) GP-SE\n\n(h) GP-MA\n\n(i) GP-RQ\n\n(j) GPatt Initialisation\n\n(k) Train\n\n(l) GPatt\n\n(m) GP-MA\n\n(n) Train\n\n(o) GPatt\n\n(p) GP-MA\n\nFigure 1: (a)-(j): Extrapolation on a Metal Tread Plate Pattern. Missing data are shown in black. a)\nTraining region (12675 points), b) Testing region (4225 points), c) Full tread plate pattern, d) GPatt-\n30, e) SSGP with 500 basis functions, f) FITC with 500 inducing (pseudo) inputs, and the SMP-30\nkernel, and GPs with the fast exact inference in section 2.1, and g) squared exponential (SE), h)\nMat\u00b4ern (MA), and i) rational quadratic (RQ) kernels. j) Initial and learned hyperparameters using\nGPatt using simple initialisation. During training, weights of extraneous components automatically\nshrink to zero. (k)-(h) and (n)-(p): Extrapolation on tread plate and pore patterns, respectively, with\nadded artifacts and non-stationary lighting changes.\nTo reconstruct the missing and training regions, we use GPatt-30. The GPatt reconstruction shown\nin Fig 1d is as plausible as the true full pattern shown in Fig 1c, and largely automatic. Without hand\ncrafting of kernel features to suit this image, exposure to similar images, or a sophisticated initiali-\nsation, GPatt has automatically discovered the underlying structure of this image, and extrapolated\nthat structure across a large missing region, even though the structure of this pattern is not indepen-\ndent across the two spatial input dimensions. Indeed the separability of the SMP kernel represents\nonly a soft prior assumption, and does not rule out posterior correlations between input dimensions.\nThe reconstruction in Figure 1e was produced with SSGP, using 500 basis functions. In principle\nSSGP can model any spectral density (and thus any stationary kernel) with in\ufb01nitely many compo-\nnents (basis functions). However, since these components are point masses (in frequency space),\neach component has highly limited expressive power. Moreover, with many components SSGP ex-\nperiences practical dif\ufb01culties regarding initialisation, over-\ufb01tting, and computation time (scaling\nquadratically with the number of basis functions). Although SSGP does discover some interesting\nstructure (a diagonal pattern), and has equal training and test performance, it is unable to capture\nenough information for a convincing reconstruction, and we did not \ufb01nd that more basis functions\nimproved performance. Likewise, FITC with an SMP-30 kernel and 500 inducing (pseudo) inputs\ncannot capture the necessary information to interpolate or extrapolate. On this example, FITC ran\nfor 2 days, and SSGP-500 for 1 hour, compared to GPatt which took under 5 minutes.\nGPs with SE, MA, and RQ kernels are all truly Bayesian nonparametric models \u2013 these kernels\nare derived from in\ufb01nite basis function expansions. Therefore, as seen in Figure 1 g), h), i), these\nmethods are completely able to capture the information in the training region; however, these kernels\ndo not have the proper structure to reasonably extrapolate across the missing region \u2013 they simply\nact as smoothing \ufb01lters. Moreover, this comparison is only possible because we have implemented\nthese GPs using the fast exact inference techniques introduced in section 2.1.\n\n5\n\n \f(a) Runtime Stress Test\n\n(b) Accuracy Stress Test\n\n(c) Recovering Sophisticated Kernels\n\nFigure 2: Stress Tests. a) Runtime Stress Test. We show the runtimes in seconds, as a function\nof training instances, for evaluating the log marginal likelihood, and any relevant derivatives, for a\nstandard GP with SE kernel (as implemented in GPML), FITC with 500 inducing (pseudo) inputs\nand SMP-25 and SMP-5 kernels, SSGP with 90 and 500 basis functions, and GPatt-100, GPatt-25,\nand GPatt-5. Runtimes are for a 64bit PC, with 8GB RAM and a 2.8 GHz Intel i7 processor, on the\ncone pattern (P = 2), shown in the supplement. The ratio of training inputs to the sum of imaginary\nand training inputs for GPatt is 0.4 and 0.6 for the smallest two training sizes, and 0.7 for all other\ntraining sets. b) Accuracy Stress Test. MSLL as a function of holesize on the metal pattern of\nFigure 1. The values on the horizontal axis represent the fraction of missing (testing) data from\nthe full pattern (for comparison Fig 1a has 25% missing data). We compare GPatt-30 and GPatt-15\nwith GPs with SE, MA, and RQ kernels (and the inference of section 2.1), and SSGP with 100 basis\nfunctions. The MSLL for GPatt-15 at a holesize of 0.01 is \u22121.5886. c) Recovering Sophisticated\nKernels. A product of three kernels (shown in green) was used to generate a movie of 112,500\ntraining points. From this data, GPatt-20 reconstructs these component kernels (the learned SMP-20\nkernel is shown in blue). All kernels are a function of \u03c4 = x \u2212 x(cid:48) and have been scaled by k(0).\n\nOverall, these results indicate that both expressive nonparametric kernels, such as the SMP kernel,\nand the speci\ufb01c fast inference in section 2.1, are needed to extrapolate patterns in these images.\nWe note that the SMP-30 kernel used with GPatt has more components than needed for this problem.\nHowever, as shown in Fig. 1j, if the model is overspeci\ufb01ed, the complexity penalty in the marginal\nlikelihood shrinks the weights ({wa} in Eq. (1)) of extraneous components, as a proxy for model\nselection \u2013 an effect similar to automatic relevance determination [26]. Components which do not\nsigni\ufb01cantly contribute to model \ufb01t are automatically pruned, as shrinking the weights decreases the\neigenvalues of K and thus minimizes the complexity penalty (a sum of log eigenvalues). The simple\nGPatt initialisation in Fig 1j is used in all experiments and is especially effective for N > 104.\nIn Figure 1 (k)-(h) and (n)-(p) we use GPatt to extrapolate on treadplate and pore patterns with added\nartifacts and lighting changes. GPatt still provides a convincing extrapolation \u2013 able to uncover both\nlocal and global structure. Alternative GPs with the inference of section 2.1 can interpolate small\nartifacts quite accurately, but have trouble with larger missing regions.\n\n3.2 Stress Tests and Recovering Complex 3D Kernels from Video\nWe stress test GPatt and alternative methods in terms of speed and accuracy, with varying data-\nsizes, extrapolation ranges, basis functions, inducing (pseudo) inputs, and components. We assess\naccuracy using standardised mean square error (SMSE) and mean standardized log loss (MSLL) (a\nscaled negative log likelihood), as de\ufb01ned in Rasmussen and Williams [1] on page 23. Using the\nempirical mean and variance to \ufb01t the data would give an SMSE and MSLL of 1 and 0 respectively.\nSmaller SMSE and more negative MSLL values correspond to better \ufb01ts of the data.\nThe runtime stress test in Figure 2a shows that the number of components used in GPatt does not\nsigni\ufb01cantly affect runtime, and that GPatt is much faster than FITC (using 500 inducing inputs) and\nSSGP (using 90 or 500 basis functions), even with 100 components (601 kernel hyperparameters).\nThe slope of each curve roughly indicates the asymptotic scaling of each method. In this experiment,\nthe standard GP (with SE kernel) has a slope of 2.9, which is close to the cubic scaling we expect. All\nother curves have a slope of 1 \u00b1 0.1, indicating linear scaling with the number of training instances.\nHowever, FITC and SSGP are used here with a \ufb01xed number of inducing inputs and basis functions.\nMore inducing inputs and basis functions should be used when there are more training instances \u2013\nand these methods scale quadratically with inducing inputs and basis functions for a \ufb01xed number\nof training instances. GPatt, on the other hand, can scale linearly in runtime as a function of training\n\n6\n\n05000.51\u03c4k105000.51\u03c4k205000.51\u03c4k3 TrueRecovered\fTable 1: We compare the test performance of GPatt-30 with SSGP (using 100 basis functions), and\nGPs using SE, MA, and RQ kernels, combined with the inference of section 3.2, on patterns with a\ntrain test split as in the metal treadplate pattern of Figure 1. We show the results as SMSE (MSLL).\n\ntrain, test\nGPatt\nSSGP\nSE\nMA\nRQ\n\nRubber mat\n12675, 4225\n0.31 (\u22120.57)\n0.65 (\u22120.21)\n0.97 (0.14)\n0.86 (\u22120.069)\n0.89 (0.039)\n\nTread plate\n12675, 4225\n0.45 (\u22120.38)\n1.06 (0.018)\n0.90 (\u22120.10)\n0.88 (\u22120.10)\n0.90 (\u22120.10)\n\nPores\n12675, 4225\n0.0038 (\u22122.8)\n1.04 (\u22120.024)\n0.89 (\u22120.21)\n0.88 (\u22120.24)\n0.88 (\u22120.048)\n\nWood\n14259, 4941\n0.015 (\u22121.4)\n0.19 (\u22120.80)\n0.64 (1.6)\n0.43 (1.6)\n0.077 (0.77)\n\nChain mail\n14101, 4779\n0.79 (\u22120.052)\n1.1 (0.036)\n1.1 (1.6)\n0.99 (0.26)\n0.97 (\u22120.0025)\n\nsize, without any deterioration in performance. Furthermore, the \ufb01xed 2-3 orders of magnitude\nGPatt outperforms the alternatives is as practically important as asymptotic scaling.\nThe accuracy stress test in Figure 2b shows extrapolation (MSLL) performance on the metal tread\nplate pattern of Figure 1c with varying holesizes, running from 0% to 60% missing data for testing\n(for comparison the hole in Fig 1a has 25% missing data). GPs with SE, RQ, and MA kernels (and\nthe fast inference of section 2.1) all steadily increase in error as a function of holesize. Conversely,\nSSGP does not increase in error as a function of holesize \u2013 with \ufb01nite basis functions SSGP cannot\nextract as much information from larger datasets as the alternatives. GPatt performs well relative to\nthe other methods, even with a small number of components. GPatt is particularly able to exploit the\nextra information in additional training instances: only when the holesize is so large that over 60%\nof the data are missing does GPatt\u2019s performance degrade to the same level as alternative methods.\nIn Table 1 we compare the test performance of GPatt with SSGP, and GPs using SE, MA, and RQ\nkernels, for extrapolating \ufb01ve different patterns, with the same train test split as for the tread plate\npattern in Figure 1. All patterns are shown in the supplement. GPatt consistently has the lowest\nSMSE and MSLL. Note that many of these datasets are sophisticated patterns, containing intricate\ndetails which are not strictly periodic, such as lighting irregularities, metal impurities, etc. Indeed\nSSGP has a periodic kernel (unlike the SMP kernel which is not strictly periodic), and is capable of\nmodelling multiple periodic components, but does not perform as well as GPatt on these examples.\nWe also consider a particularly large example, where we use GPatt-10 to perform learning and exact\ninference on the Pores pattern, with 383,400 training points, to extrapolate a large missing region\nwith 96,600 test points. The SMSE is 0.077, and the total runtime was 2800 seconds. Images of the\nsuccessful extrapolation are shown in the supplement.\nWe end this section by showing that GPatt can accurately recover a wide range of kernels, even using\na small number of components. To test GPatt\u2019s ability to recover ground truth kernels, we simulate\na 50 \u00d7 50 \u00d7 50 movie of data (e.g.\ntwo spatial input dimensions, one temporal) using a GP with\nkernel k = k1k2k3 (each component kernel in this product operates on a different input dimension),\nwhere k1 = kSE + kSE\u00d7 kPER, k2 = kMA\u00d7 kPER + kMA\u00d7 kPER, and k3 = (kRQ + kPER)\u00d7 kPER + kSE.\n(kPER(\u03c4 ) = exp[\u22122 sin2(\u03c0 \u03c4 \u03c9)/(cid:96)2], \u03c4 = x \u2212 x(cid:48)). We use 5 consecutive 50 \u00d7 50 slices for testing,\nleaving a large number N = 112500 of training points, providing much information to learn the true\ngenerating kernels. Moreover, GPatt-20 reconstructs these complex out of class kernels in under 10\nminutes, as shown in Fig 2c. In the supplement, we show true and predicted frames from the movie.\n\n3.3 Wallpaper and Scene Reconstruction and Long Range Temperature Forecasting\n\nAlthough GPatt is a general purpose regression method, it can also be used for inpainting: image\nrestoration, object removal, etc. We \ufb01rst consider a wallpaper image stained by a black apple mark,\nshown in Figure 3. To remove the stain, we apply a mask and then separate the image into its\nthree channels (red, green, and blue), resulting in 15047 pixels in each channel for training. In each\nchannel we ran GPatt using SMP-30. We then combined the results from each channel to restore the\nimage without any stain, which is impressive given the subtleties in the pattern and lighting.\nIn our next example, we wish to reconstruct a natural scene obscured by a prominent rooftop, shown\nin the second row of Figure 3a). By applying a mask, and following the same procedure as for\nthe stain, this time with 32269 pixels in each channel for training, GPatt reconstructs the scene\nwithout the rooftop. This reconstruction captures subtle details, such as waves, with only a single\n\n7\n\n\f(a) Inpainting\n\n(b) Learned GPatt Kernel for Temperatures\n\n(c) Learned GP-SE Kernel for Temperatures\n\nFigure 3: a) Image inpainting with GPatt. From left to right: A mask is applied to the original image,\nGPatt extrapolates the mask region in each of the three (red, blue, green) image channels, and the\nresults are joined to produce the restored image. Top row: Removing a stain (train: 15047 \u00d7 3).\nBottom row: Removing a rooftop to restore a natural scene (train: 32269\u00d73). We do not extrapolate\nthe coast. (b)-(c): Kernels learned for land surface temperatures using GPatt and GP-SE.\n\ntraining image. In fact this example has been used with inpainting algorithms which were given\naccess to a repository of thousands of similar images [27]. The results emphasized that conventional\ninpainting algorithms and GPatt have profoundly different objectives, which are sometimes even at\ncross purposes: inpainting attempts to make the image look good to a human (e.g., the example in\n[27] placed boats in the water), while GPatt is a general purpose regression algorithm, which simply\naims to make accurate predictions at test input locations, from training data alone. For example,\nGPatt can naturally learn temporal correlations to make predictions in the video example of section\n3.2, for which standard patch based inpainting methods would be inapplicable [16].\nSimilarly, we use GPatt to perform long range forecasting of land surface temperatures. After train-\ning on 108 months (9 years) of temperature data across North America (299,268 training points; a\n71 \u00d7 66 \u00d7 108 completed grid, with missing data for water), we forecast 12 months (1 year) ahead\n(33,252 testing points). The runtime was under 30 minutes. The learned kernels using GPatt and GP-\nSE are shown in Figure 3 b) and c). The learned kernels for GPatt are highly non-standard \u2013 both\nquasi periodic and heavy tailed. These learned correlation patterns provide insights into features\n(such as seasonal in\ufb02uences) which affect how temperatures vary in space and time. Indeed learning\nthe kernel allows us to discover fundamental properties of the data. The temperature forecasts using\nGPatt and GP-SE, superimposed on maps of North America, are shown in the supplement.\n4 Discussion\nLarge scale multidimensional pattern extrapolation problems are of fundamental importance in ma-\nchine learning, where we wish to develop scalable models which can make impressive generalisa-\ntions. However, there are many obstacles towards applying popular kernel methods, such as Gaus-\nsian processes, to these fundamental problems. We have shown that a combination of expressive\nkernels, truly Bayesian nonparametric representations, and inference which exploits model struc-\nture, can distinctly enable a kernel approach to these problems. Moreover, there is much promise\nin further exploring Bayesian nonparametric kernel methods for large scale pattern extrapolation.\nSuch methods can be extremely expressive, and expressive methods are most needed for large scale\nproblems, which provide relatively more information for automatically learning a rich statistical\nrepresentation of the data.\nAcknowledgements AGW thanks ONR grant N000141410684 and NIH grant R01GM093156. JPC\nthanks Simons Foundation grants SCGB #325171, #325233, and the Grossman Center at Columbia.\n\n8\n\n05010000.51Time [mon]0500.20.40.60.8Y [Km]05000.51CorrelationsX [Km]050.20.40.60.8Time [mon]0500.20.40.60.8Y [Km]020400.20.40.60.8CorrelationsX [Km]\fReferences\n[1] C.E. Rasmussen and C.K.I. Williams. Gaussian processes for Machine Learning. The MIT Press, 2006.\n[2] C.E. Rasmussen. Evaluation of Gaussian Processes and Other Methods for Non-linear Regression. PhD\n\nthesis, University of Toronto, 1996.\n\n[3] A. O\u2019Hagan. Curve \ufb01tting and optimal design for prediction. Journal of the Royal Statistical Society, B\n\n(40):1\u201342, 1978.\n\n[4] M. G\u00a8onen and E. Alpayd\u0131n. Multiple kernel learning algorithms. Journal of Machine Learning Research,\n\n12:2211\u20132268, 2011.\n\n[5] A.G. Wilson and R.P. Adams. Gaussian process kernels for pattern discovery and extrapolation. Interna-\n\ntional Conference on Machine Learning, 2013.\n\n[6] M. L\u00b4azaro-Gredilla, J. Qui\u02dcnonero-Candela, C.E. Rasmussen, and A.R. Figueiras-Vidal. Sparse spectrum\n\nGaussian process regression. Journal of Machine Learning Research, 11:1865\u20131881, 2010.\n\n[7] C.K.I. Williams and M. Seeger. Using the Nystr\u00a8om method to speed up kernel machines. In Advances in\n\nNeural Information Processing Systems, pages 682\u2013688. MIT Press, 2001.\n\n[8] Q. Le, T. Sarlos, and A. Smola. Fastfood-computing Hilbert space expansions in loglinear time.\n\nInternational Conference on Machine Learning, pages 244\u2013252, 2013.\n\nIn\n\n[9] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Neural Information Pro-\n\ncessing Systems, 2007.\n\n[10] E. Snelson and Z. Ghahramani. Sparse gaussian processes using pseudo-inputs. In Advances in neural\n\ninformation processing systems, volume 18, page 1257. MIT Press, 2006.\n\n[11] J. Hensman, N. Fusi, and N.D. Lawrence. Gaussian processes for big data. In Uncertainty in Arti\ufb01cial\n\nIntelligence (UAI). AUAI Press, 2013.\n\n[12] M. Seeger, C.K.I. Williams, and N.D. Lawrence. Fast forward selection to speed up sparse Gaussian\n\nprocess regression. In Workshop on AI and Statistics, volume 9, 2003.\n\n[13] J. Qui\u02dcnonero-Candela and C.E. Rasmussen. A unifying view of sparse approximate Gaussian process\n\nregression. The Journal of Machine Learning Research, 6:1939\u20131959, 2005.\n\n[14] Y. Saatc\u00b8i. Scalable Inference for Structured Gaussian Process Models. PhD thesis, University of Cam-\n\nbridge, 2011.\n\n[15] E. Gilboa, Y. Saatc\u00b8i, and J.P. Cunningham. Scaling multidimensional inference for structured Gaussian\n\nprocesses. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013.\n\n[16] C. Guillemot and O. Le Meur.\n\nImage inpainting: Overview and recent advances. Signal Processing\n\nMagazine, IEEE, 31(1):127\u2013144, 2014.\n\n[17] S. Bochner. Lectures on Fourier Integrals, volume 42. Princeton University Press, 1959.\n[18] A.G. Wilson. A process over all stationary kernels. June, 2012. Technical Report, University of Cam-\n\nbridge. http://www.cs.cmu.edu/\u02dcandrewgw/spectralkernel.pdf.\n\n[19] A.G. Wilson. Covariance Kernels for Fast Automatic Pattern Discovery and Extrapolation with Gaus-\nsian Processes. PhD thesis, University of Cambridge, 2014. URL http://www.cs.cmu.edu/\n\u02dcandrewgw/andrewgwthesis.pdf.\n\n[20] K.E. Atkinson. An introduction to numerical analysis. John Wiley & Sons, 2008.\n[21] C.T.H. Baker. The numerical treatment of integral equations. 1977.\n[22] C.K.I. Williams and J. Shawe-Taylor. The stability of kernel principal components analysis and its relation\nto the process eigenspectrum. In Advances in Neural Information Processing Systems, volume 15, page\n383. MIT Press, 2003.\n\n[23] Y. Luo and R. Duraiswami. Fast near-grid Gaussian process regression. In International Conference on\n\nArti\ufb01cial Intelligence and Statistics, 2013.\n\n[24] A. Naish-Guzman and S. Holden. The generalized FITC approximation. In Advances in Neural Informa-\n\ntion Processing Systems, pages 1057\u20131064, 2007.\n\n[25] D. Duvenaud, J.R. Lloyd, R. Grosse, J.B. Tenenbaum, and Z. Ghahramani. Structure discovery in non-\nIn International Conference on Machine\n\nparametric regression through compositional kernel search.\nLearning, 2013.\n\n[26] D.J.C MacKay. Bayesian nonlinear modeling for the prediction competition. Ashrae Transactions, 100\n\n(2):1053\u20131062, 1994.\n\n[27] J. Hays and A. Efros. Scene completion using millions of photographs. Communications of the ACM, 51\n\n(10):87\u201394, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1907, "authors": [{"given_name": "Andrew", "family_name": "Wilson", "institution": "University of Cambridge"}, {"given_name": "Elad", "family_name": "Gilboa", "institution": "Technion"}, {"given_name": "Arye", "family_name": "Nehorai", "institution": "Washington University in St. Louis"}, {"given_name": "John", "family_name": "Cunningham", "institution": "Columbia University"}]}