Paper ID:1907
Title: Fast Kernel Learning for Multidimensional Pattern Extrapolation
Current Reviews

Submitted by Assigned_Reviewer_21

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors combine two recent advances in Gaussian processes, spectral mixture kernels [5], and scalable Gaussian processes for data in grids [14, 22 see below], in order to tackle applications with high amount of data points, like texture extrapolation, inpainting, and video extrapolation. The paper includes a thorough evaluation of the framework proposed, and comparisons against sparse GP methods, with general purpose covariance functions, and spectral mixture kernels.

Quality

The paper is technically sound. The framework proposed by the authors achieves outstanding results in the different applications studied in the paper. The authors claim that they relax the grid assumption, by completing with imaginary observations. It is important to discuss this issue when each of the experiments is being introduced. The imaginary observations are mentioned in section 2.1 and hardly ever mention again. How would the method work in higher dimensional settings (greater than three)? How prone to overfitting is the method? What are the weaknesses of the method?

Clarity

The paper is clearly written and well organized.

Originality

As the authors clearly stated it, the framework proposed is a combination of the spectral mixture kernels, and fast Gaussian process regression on grids. From this point of view, the novelty of the method is incremental.

Significance

I must say that the quality of the results obtained by the proposed method is impressive. The method beats any other of the GP alternatives. I wonder how methods in the image inpainting related literature would compare against the ones presented by the authors.

[22] Yuancheng Luo and Ramani Duraiswami (2013): Fast Near-GRID Gaussian process regression, at AISTAST 2013.
Q2: Please summarize your review in 1-2 sentences
A straightforward combination of fast Gaussian process regression on grids using spectral mixture kernels, with outstanding results on several (and difficult) examples.

Submitted by Assigned_Reviewer_24

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper proposes to scale up computations of Gaussian Processes (GPs) extending Kronecker-based computations and to allow GPs to extrapolate using spectral mixtures of kernels.

Even though the ideas might not be extremely novel and the applicability of the proposed methodology is restricted to a class of problems (see my comments below), I found the paper quite enjoyable and I believe that it deserves to be considered for publication. Here are a few comments on what I liked and what I think might be improved:

- I totally agree with the Authors when they motivate the need for spectral mixtures of kernels (end of sec 2).

- the assumption that data lie on a grid is very restrictive in machine learning/statistics applications, so the proposed extension to tractably deal with non-grid data is useful and I found it quite interesting. However, I don't think that the applicability of this idea goes beyond image data or time series data (or perhaps some 3d data). I think it would be good to stress this aspect of the proposed idea in the introduction.

- one of the novelties of this work lies in the extension of the Kronecker-based inference for GPs in [14] to deal with non-grid data. There is some recent work by E Gilboa, Y Saatçi, J Cunningham "Scaling multidimensional inference for structured Gaussian processes" that will appear in IEEE TPAMI that I think is relevant and potentially more general than what proposed here.

- the experimental evaluation is quite thorough and the results are compelling.
Q2: Please summarize your review in 1-2 sentences
The paper proposes an interesting combination of Kronecker based inference for Gaussian Processes and spectral mixtures of kernels.

The experimental validation is extensive and the results are compelling.

Submitted by Assigned_Reviewer_41

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper builds on previous work of Wilson and Adams on Gaussian process (GP) kernels for pattern discovery and extrapolation. More precisely, the authors are interested in extrapolating in multi-dimensional input cases and they consider a spectral mixture product kernel. Furthermore, the proposed method is suitable in low dimensional input spaces (up to 3 have been considered in the experiments) by making use of grid structures and Kronecker products in the computation of the covariance matrix. The authors also propose a method to deal with datasets that do not necessarily lie in the grid. In such cases, the authors augment the training data with additional function values that lie in a grid.

One not so clear point in the theory is the approximation used for the log determinant in equation (3). It is not clear what are the theoretical basis of this approximation. This certainly is one weakness of the paper and it needs further discussion.

The experiments are quite convincing and the authors show a series of results that demonstrate the ability of their algorithm to extrapolate in image patterns and fill in missing parts of images (inpaining). The ability to develop a GP framework that extrapolate in these image textures is quite impressive, although the image patterns are rather periodic.

Q2: Please summarize your review in 1-2 sentences
To summarize, the paper builds on previous work on flexible kernels that extrapolate well. Some theoretical issues deserve more discussion. Experiments look convincing.
Author Feedback
Author Feedback
Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank the reviewers for supportive comments. We respond to the reviewers individually.

Reviewer 21:

- We find, especially with N > 10^4 points, that the method does not generally suffer from over-fitting when using marginal likelihood optimisation. The number of free parameters (hyperparameters) is typically much less than the number of datapoints (e.g. 60 hyperparameters vs 10^5 datapoints). Also, Figure 1j shows how the marginal likelihood can act as a proxy for model selection, with ARD over the weight parameters, through the automatically calibrated complexity penalty. The weights of extraneous components, which do not help much with model fit, will shrink towards zero. This helps prevent over-fitting and helps us interpret the appropriateness of the model specification.

- The method scales well with number of data-points and input dimensions, but there has to be some grid structure in the 'M' points to see efficiency gains. We have included a caveat in this regard in lines 165-168 of the text. If there is grid structure, the scaling with the total number of points improves with P. This sort of grid structure tends to be more natural in lower dimensional (e.g., P < 5) settings (images, video, spatial statistics, etc.), although somewhat higher dimensional problems could be engineered for grid structure. Given limited space, we wanted to focus on experiments that would be most natural for the proposed approach.

- We believe in addition to the specific algorithm, there is valuable novelty in the experimental applications (multidimensional pattern extrapolation, inpainting, large scale long range land surface temperature forecasting, etc.), and in the findings of the experiments (e.g., the helpfulness of a non-parametric representation, the natural cohesion of spectral mixture product kernels and kronecker inference, the separate effects of various models and inference algorithms). Thanks for mentioning Luo and Duraiswami (2013), which we will cite. We note that GPatt is a significant advance on this method when approaching extrapolation problems: due to the need for inducing inputs, Luo and Duraiswami (2013) would perform similar to the results shown in Figure 1f, for FITC.


Reviewer 24:

- Thanks for the supportive comments.

- If there is grid structure, the scaling without the total number of points improves with input dimensions P. However, grid structure tends to be more natural in lower dimensional (e.g., P < 5) settings (images, video, spatial statistics, etc.), although somewhat higher dimensional problems could be engineered to have grid structure. We have tried to focus our experiments, showing a variety of properties of the proposed approach, on examples where GPatt is most natural. We have included caveats in this regard in lines 62-63 and 165-168, and we will be sure to make these points more clear. We considered some video data, and 3D spatial-temporal statistics data (long range forecasting of land surface temperatures), in addition to the 2D image data.


Reviewer 41:

- Thanks for the question on the approximation. We have included some extra general background on Kronecker methods in the supplementary material. Regarding the log determinant approximation, some useful additional background can be found in Rasmussen and Williams (pages 98-99), Theorem 3.4 of Baker (1977), and Shawe-Taylor and Williams (2003). Baker (1977) proves that the approximation will converge in the limit N -> infinity. One can also use PCA to upper and lower bound the true eigenvalues by their approximation, as described in p.99 of R&W (2006) and Shawe-Taylor and Williams (2003). The larger eigenvalues are approximated more accurately than the smaller ones. In practice, we find the approximation of the overall log determinant essentially exact for N > 1000 points. In addition to depending on the number of datapoints, the approximation improves in practice with the 'smoothness' of the covariance matrix (longer length-scales, less support for high frequencies in the spectral densities). We hope this explanation and the additional references helps supplement this part of the paper, and we will add more detail in an updated version.

References:
Baker, C. T. H. (1977). The Numerical Treatment of Integral Equations. pages 98-99.
Rasmussen and Williams (2006). Gaussian Processes for Machine Learning. pages 98-99.
Shawe-Taylor, J. and Williams, C. K. I. (2003). The Stability of Kernel Principal Components Analysis and its Relation to the Process Eigenspectrum. NIPS.