{"title": "Additive Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 226, "page_last": 234, "abstract": "We introduce a Gaussian process model of functions which are additive. An additive function is one which decomposes into a sum of low-dimensional functions, each depending on only a subset of the input variables. Additive GPs generalize both Generalized Additive Models, and the standard GP models which use squared-exponential kernels. Hyperparameter learning in this model can be seen as Bayesian Hierarchical Kernel Learning (HKL). We introduce an expressive but tractable parameterization of the kernel function, which allows efficient evaluation of all input interaction terms, whose number is exponential in the input dimension. The additional structure discoverable by this model results in increased interpretability, as well as state-of-the-art predictive power in regression tasks.", "full_text": "Additive Gaussian Processes\n\nDavid Duvenaud\n\nHannes Nickisch\n\nDepartment of Engineering\n\nMPI for Intelligent Systems\n\nCambridge University\ndkd23@cam.ac.uk\n\nT\u00a8ubingen, Germany\nhn@tue.mpg.de\n\nCarl Edward Rasmussen\nDepartment of Engineering\n\nCambridge University\ncer54@cam.ac.uk\n\nAbstract\n\nWe introduce a Gaussian process model of functions which are additive. An addi-\ntive function is one which decomposes into a sum of low-dimensional functions,\neach depending on only a subset of the input variables. Additive GPs general-\nize both Generalized Additive Models, and the standard GP models which use\nsquared-exponential kernels. Hyperparameter learning in this model can be seen\nas Bayesian Hierarchical Kernel Learning (HKL). We introduce an expressive but\ntractable parameterization of the kernel function, which allows ef\ufb01cient evalua-\ntion of all input interaction terms, whose number is exponential in the input di-\nmension. The additional structure discoverable by this model results in increased\ninterpretability, as well as state-of-the-art predictive power in regression tasks.\n\nIntroduction\n\n1\nMost statistical regression models in use today are of the form: g(y) = f (x1)+f (x2)+\u00b7\u00b7\u00b7+f (xD).\nPopular examples include logistic regression, linear regression, and Generalized Linear Models [1].\nThis family of functions, known as Generalized Additive Models (GAM) [2], are typically easy\nto \ufb01t and interpret. Some extensions of this family, such as smoothing-splines ANOVA [3], add\nterms depending on more than one variable. However, such models generally become intractable\nand dif\ufb01cult to \ufb01t as the number of terms increases.\nAt the other end of the spectrum are kernel-based models, which typically allow the response to\ndepend on all input variables simultaneously. These have the form: y = f (x1, x2, . . . , xD). A\npopular example would be a Gaussian process model using a squared-exponential (or Gaussian)\nkernel. We denote this model as SE-GP. This model is much more \ufb02exible than the GAM, but its\n\ufb02exibility makes it dif\ufb01cult to generalize to new combinations of input variables.\nIn this paper, we introduce a Gaussian process model that generalizes both GAMs and the SE-GP.\nThis is achieved through a kernel which allow additive interactions of all orders, ranging from \ufb01rst\norder interactions (as in a GAM) all the way to Dth-order interactions (as in a SE-GP). Although\nthis kernel amounts to a sum over an exponential number of terms, we show how to compute this\nkernel ef\ufb01ciently, and introduce a parameterization which limits the number of hyperparameters to\nO(D). A Gaussian process with this kernel function (an additive GP) constitutes a powerful model\nthat allows one to automatically determine which orders of interaction are important. We show\nthat this model can signi\ufb01cantly improve modeling ef\ufb01cacy, and has major advantages for model\ninterpretability. This model is also extremely simple to implement, and we provide example code.\nWe note that a similar breakthrough has recently been made, called Hierarchical Kernel Learning\n(HKL) [4]. HKL explores a similar class of models, and sidesteps the possibly exponential num-\nber of interaction terms by cleverly selecting only a tractable subset. However, this method suffers\nconsiderably from the fact that cross-validation must be used to set hyperparameters. In addition,\nthe machinery necessary to train these models is immense. Finally, on real datasets, HKL is outper-\nformed by the standard SE-GP [4].\n\n1\n\n\fk1(x1)\n1D kernel\n\n\u2193\n\n+\n\n+\n\nk2(x2)\n1D kernel\n\n\u2193\n\n=\n\n=\n\nf1(x1)\n\ndraw from\n1D GP prior\n\nf2(x2)\n\ndraw from\n1D GP prior\n\nk1(x1) + k2(x2)\n1st order kernel\n\n\u2193\n\nk1(x1)k2(x2)\n2nd order kernel\n\n\u2193\n\nf1(x1) + f2(x2)\n\ndraw from\n\n1st order GP prior\n\nf (x1, x2)\ndraw from\n\n2nd order GP prior\n\nFigure 1: A \ufb01rst-order additive kernel, and a product kernel. Left: a draw from a \ufb01rst-order additive\nkernel corresponds to a sum of draws from one-dimensional kernels. Right: functions drawn from a\nproduct kernel prior have weaker long-range dependencies, and less long-range structure.\n\n2 Gaussian Process Models\n\nGaussian processes are a \ufb02exible and tractable prior over functions, useful for solving regression\nand classi\ufb01cation tasks [5]. The kind of structure which can be captured by a GP model is mainly\ndetermined by its kernel:\nthe covariance function. One of the main dif\ufb01culties in specifying a\nGaussian process model is in choosing a kernel which can represent the structure present in the data.\nFor small to medium-sized datasets, the kernel has a large impact on modeling ef\ufb01cacy.\nFigure 1 compares, for two-dimensional functions, a \ufb01rst-order additive kernel with a second-order\nkernel. We can see that a GP with a \ufb01rst-order additive kernel is an example of a GAM: Each\nfunction drawn from this model is a sum of orthogonal one-dimensional functions. Compared to\nfunctions drawn from the higher-order GP, draws from the \ufb01rst-order GP have more long-range\nstructure.\nWe can expect many natural functions to depend only on sums of low-order interactions. For ex-\nample, the price of a house or car will presumably be well approximated by a sum of prices of\nindividual features, such as a sun-roof. Other parts of the price may depend jointly on a small set of\nfeatures, such as the size and building materials of a house. Capturing these regularities will mean\nthat a model can con\ufb01dently extrapolate to unseen combinations of features.\n\n3 Additive Kernels\nWe now give a precise de\ufb01nition of additive kernels. We \ufb01rst assign each dimension i \u2208 {1 . . . D}\na one-dimensional base kernel ki(xi, x(cid:48)\ni). We then de\ufb01ne the \ufb01rst order, second order and nth order\nadditive kernel as:\n\nki(xi, x(cid:48)\ni)\n\nD(cid:88)\n(cid:88)\n\ni=1\n\nj=i+1\n\nki(xi, x(cid:48)\n\ni)kj(xj, x(cid:48)\nj)\nN(cid:89)\n\nkid (xid , x(cid:48)\n\nid\n\n(1)\n\n(2)\n\n(3)\n\n)\n\nD(cid:88)\nD(cid:88)\n\ni=1\n\nkadd1(x, x(cid:48)) = \u03c32\n\n1\n\nkadd2(x, x(cid:48)) = \u03c32\n\n2\n\nkaddn (x, x(cid:48)) = \u03c32\n\nn\n\n1\u2264i1