{"title": "Bayesian optimization for automated model selection", "book": "Advances in Neural Information Processing Systems", "page_first": 2900, "page_last": 2908, "abstract": "Despite the success of kernel-based nonparametric methods, kernel selection still requires considerable expertise, and is often described as a \u201cblack art.\u201d We present a sophisticated method for automatically searching for an appropriate kernel from an infinite space of potential choices. Previous efforts in this direction have focused on traversing a kernel grammar, only examining the data via computation of marginal likelihood. Our proposed search method is based on Bayesian optimization in model space, where we reason about model evidence as a function to be maximized. We explicitly reason about the data distribution and how it induces similarity between potential model choices in terms of the explanations they can offer for observed data. In this light, we construct a novel kernel between models to explain a given dataset. Our method is capable of finding a model that explains a given dataset well without any human assistance, often with fewer computations of model evidence than previous approaches, a claim we demonstrate empirically.", "full_text": "Bayesian optimization for automated model selection\n\nGustavo Malkomes,\u2020 Chip Schaff,\u2020 Roman Garnett\nDepartment of Computer Science and Engineering\n\n{luizgustavo, cbschaff, garnett}@wustl.edu\n\nWashington University in St. Louis\n\nSt. Louis, MO 63130\n\nAbstract\n\nDespite the success of kernel-based nonparametric methods, kernel selection still\nrequires considerable expertise, and is often described as a \u201cblack art.\u201d We present a\nsophisticated method for automatically searching for an appropriate kernel from an\nin\ufb01nite space of potential choices. Previous efforts in this direction have focused on\ntraversing a kernel grammar, only examining the data via computation of marginal\nlikelihood. Our proposed search method is based on Bayesian optimization in\nmodel space, where we reason about model evidence as a function to be maximized.\nWe explicitly reason about the data distribution and how it induces similarity\nbetween potential model choices in terms of the explanations they can offer for\nobserved data. In this light, we construct a novel kernel between models to explain\na given dataset. Our method is capable of \ufb01nding a model that explains a given\ndataset well without any human assistance, often with fewer computations of model\nevidence than previous approaches, a claim we demonstrate empirically.\n\n1\n\nIntroduction\n\nOver the past decades, enormous human effort has been devoted to machine learning; preprocessing\ndata, model selection, and hyperparameter optimization are some examples of critical and often\nexpert-dependent tasks. The complexity of these tasks has in some cases relegated them to the realm\nof \u201cblack art.\u201d In kernel methods in particular, the selection of an appropriate kernel to explain\na given dataset is critical to success in terms of the \ufb01delity of predictions, but the vast space of\npotential kernels renders the problem nontrivial. We consider the problem of automatically \ufb01nding\nan appropriate probabilistic model to explain a given dataset. Although our proposed algorithm is\ngeneral, we will focus on the case where a model can be completely speci\ufb01ed by a kernel, as is the\ncase for example for centered Gaussian processes (GPs).\nRecent work has begun to tackle the kernel-selection problem in a systematic way. Duvenaud et al.\n[1] and Grosse et al. [2] described generative grammars for enumerating a countably in\ufb01nite space of\narbitrarily complex kernels via exploiting the closure of kernels under additive and multiplicative\ncomposition. We adopt this kernel grammar in this work as well. Given a dataset, Duvenaud et al.\n[1] proposed searching this in\ufb01nite space of models using a greedy search mechanism. Beginning\nat the root of the grammar, we traverse the tree greedily attempting to maximize the (approximate)\nevidence for the data given by a GP model incorporating the kernel.\nIn this work, we develop a more sophisticated mechanism for searching through this space. The\ngreedy search described above only considers a given dataset by querying a model\u2019s evidence. Our\nsearch performs a metalearning procedure, which, conditional on a dataset, establishes similarities\namong the models in terms of the space of explanations they can offer for the data. With this\nviewpoint, we construct a novel kernel between models (a \u201ckernel kernel\u201d). We then approach\n\n\u2020These authors contributed equally to this work\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fthe model-search problem via Bayesian optimization, treating the model evidence as an expensive\nblack-box function to be optimized as a function of the kernel. The dependence of our kernel between\nmodels on the distribution of the data is critical; depending on a given dataset, the kernels generated\nby a compositional grammar could be especially rich or deceptively so.\nWe develop an automatic framework for exploring a set of potential models, seeking the model that\nbest explains a given dataset. Although we focus on Gaussian process models de\ufb01ned by a grammar,\nour method could be easily extended to any probabilistic model with a parametric or structured model\nspace. Our search appears to perform competitively with other baselines across a variety of datasets,\nincluding the greedy method from [1], especially in terms of the number of models for which we\nmust compute the (expensive) evidence, which typically scales cubically for kernel methods.\n\n2 Related work\n\nThere are several works attempting to create more expressive kernels, either by combining kernels or\ndesigning custom ones. Multiple kernel learning approaches, for instance, construct a kernel for a\ngiven dataset through a weighted sum of prede\ufb01ned and \ufb01xed set of kernels, adjusting the weights to\nbest explain the observed data. Besides limiting the space of kernels considered, the hyperparameters\nof component kernels also need to be speci\ufb01ed in advance [3, 4]. Another approach to is to design\n\ufb02exible kernel families [5\u20137]. These methods often use Bochner\u2019s theorem to reason in spectral space,\nand can approximate any arbitrary stationary kernel function. In contrast, our method does not depend\non stationarity. Other work has developed expressive kernels by combining Gaussian processes with\ndeep belief networks; see, for example, [8\u201310]. Unfortunately, there is no free lunch; these methods\nrequire complicated inference techniques that are much more costly than using standard kernels.\nThe goal of automated machine learning (autoML) is to automate complex machine-learning proce-\ndures using insights and techniques from other areas of machine learning. Our work falls into this\nbroad category of research. By applying machine learning methods throughout the entire modeling\nprocess, it is possible to create more automated and, eventually, better systems. Bergstra et al. [11]\nand Snoek et al. [12], for instance, have shown how to use modern optimization tools such as Bayesian\noptimization to set the hyperparameters of machine learning methods (e.g., deep neural networks\nand structured SVMs). Our approach to model search is also based on Bayesian optimization, and its\nsuccess in similar settings is encouraging for our adoption here. Gardner et al. [13] also considered\nthe automated model selection problem, but in an active leaning framework with a \ufb01xed set of models.\nWe note that our method could be adopted to their Bayesian active model selection framework with\nminor changes, but we focus on the classical supervised learning case with a \ufb01xed training set.\n\n3 Bayesian optimization for model search\nSuppose we face a classical supervised learning problem de\ufb01ned on an input space X and output\nspace Y. We are given a set of training observations D = (X, y), where X represents the design\nmatrix of explanatory variables xi \u2208 X , and yi \u2208 Y is the respective value or label to be predicted.\nUltimately, we want to use D to predict the value y\u2217 associated with an unseen point x\u2217. Given a\nprobabilistic model M, we may accomplish this via formation of the predictive distribution.\nSuppose, however, that we are given a collection of probabilistic models M that could have plausibly\ngenerated the data. Ideally, \ufb01nding the source of D would let us solve our prediction task with the\nhighest \ufb01delity. Let M \u2208 M be a probabilistic model, and let \u0398M be the corresponding parameter\nspace. These models are typically parametric families of distributions, each of which encodes a\nstructural assumption about the data, for example, that the data can be described by a linear, quadratic,\nor periodic trend. Further, the member distributions (M\u03b8 \u2208 M, \u03b8 \u2208 \u0398M) of M differ from each\nother by a particular value of some properties\u2014represented by the hyperprameters \u03b8\u2014related to the\ndata such as amplitude, characteristic length scales, etc.\nWe wish to select one model from this collection of models M to explain D. From a Bayesian\nperspective, the principle approach for solving this problem is Bayesian model selection.2 The critical\n\n2\u201cModel selection\u201d is unfortunately sometimes also used in GP literature for the process of hyperparameter\n\nlearning (selecting some M\u03b8 \u2208 M), rather than selecting a model class M, the focus of our work.\n\n2\n\n\fvalue is model evidence, the probability of generating the observed data given a model M:\n\np(y | X,M) =\n\np(y | X, \u03b8,M) p(\u03b8 | M) d\u03b8.\n\n(1)\n\n(cid:90)\n\n\u0398M\n\nThe evidence (also called marginal likelihood) integrates over \u03b8 to account for all possible explana-\ntions of the data offered by the model, under a prior p(\u03b8 | M) associated with that model.\nOur goal is to automatically explore a space of models M to select a model3 M\u2217 \u2208 M that explains a\ngiven dataset D as well as possible, according to the model evidence. The essence of our method,\nwhich we call Bayesian optimization for model search (BOMS), is viewing the evidence as a function\ng : M \u2192 R to be optimized. We note two important aspects of g. First, for large datasets and/or\ncomplex models, g is an expensive function, for example growing cubically with |D| for GP models.\nFurther, gradient information about g is impossible to compute due to the discrete nature of M. We\ncan, however, query a model\u2019s evidence as a black-box function. For these reasons, we propose\nto optimize evidence over M using Bayesian optimization, a technique well-suited for optimizing\nexpensive, gradient-free, black-box objectives [14]. In this framework, we seek an optimal model\n\nwhere g(M;D) is the (log) model evidence:\n\nWe begin by placing a Gaussian process (GP) prior on g,\n\nM\u2217 = arg max\nM\u2208M\n\ng(M;D),\n\ng(M;D) = log p(y | X,M).\n\np(g) = GP(g; \u00b5g, Kg),\n\n(2)\n\n(3)\n\nDg =(cid:8)(cid:0)Mi, g(Mi;D)(cid:1)(cid:9),\n\nwhere \u00b5g : M \u2192 R is a mean function and Kg : M2 \u2192 R is a covariance function appropriately\nde\ufb01ned over the model space M. This is a nontrivial task due to the discrete and potentially complex\nnature of M. We will suggest useful choices for \u00b5g and Kg when M is a space of Gaussian process\nmodels below. Now, given observations of the evidence of a selected set of models,\n\n(4)\nwe may compute the posterior distribution on g conditioned on Dg, which will be an updated Gaussian\nprocess [15]. Bayesian optimization uses this probabilistic belief about g to induce an inexpensive\nacquisition function to select which model we should select to evaluate next. Here we use the\nclassical expected improvement (EI) [16] acquisition function, or a slight variation described below,\nbecause it naturally considers the trade off between exploration and exploitation. The exact choice\nof acquisition function, however, is not critical to our proposal. In each round of our model search,\nwe will evaluate the acquisition function in the optimal model evidence for a number of candidate\nmodels C(Dg) = {Mi}, and compute the evidence of the candidate where this is maximized:\n\nM(cid:48) = arg max\nM\u2208C\n\n\u03b1EI(M;Dg).\n\nWe then incorporate the chosen model M(cid:48) and the observed model evidence g(M(cid:48);D) into our\nmodel evidence training set Dg, update the posterior on g, select a new set of candidates, and continue.\nWe repeat this iterative procedure until a budget is expended, typically measured in terms of the\nnumber of models considered.\nWe have observed that expected improvement [16] works well especially for small and/or low-\ndimensional problems. When the dataset is large and/or high-dimensional, training costs can be\nconsiderable and variable, especially for complex models. To give better anytime performance on\nsuch datasets, we use expected improvement per second, where we divide the expected improvement\nby an estimate of the time required to compute the evidence. In our experiments, this estimation was\nperformed by \ufb01tting a linear regression model to the log time to compute g(M;D) as a function of\nthe number of hyperparameters (the dimension of \u0398M) that we train on the models available in Dg.\nThe acquisition function allows us to quickly determine which models are more promising than\nothers, given the evidence we have observed so far. Since M is an in\ufb01nite set of models, we cannot\nconsider every model in every round. Instead, we will de\ufb01ne a heuristic to evaluate the acquisition\nfunction at a smaller set of active candidate models below.\n\n3We could also select a set of models but, for simplicity, we assume that there is one model that best explains\nthat data with overwhelming probability, which would imply that there is not bene\ufb01t in considering more than\none model, e.g., via Bayesian model averaging.\n\n3\n\n\f4 Bayesian optimization for Gaussian process kernel search\nWe introduced above a general framework for searching over a space of probabilistic models M\nto explain a dataset D without making further assumptions about the nature of the models. In the\nfollowing, we will provide speci\ufb01c suggestions in the case that all members of M are Gaussian\nprocess priors on a latent function.\nWe assume that our observations y were generated according to an unknown function f : X \u2192 R\nvia a \ufb01xed probabilistic observation mechanism p(y | f ), where fi = f (xi). In our experiments\nhere, we will consider regression with additive Gaussian observation noise, but this is not integral\nto our approach. We further assume a GP prior distribution on f, p(f ) = GP(f ; \u00b5f , Kf ), where\n\u00b5f : X \u2192 R is a mean function and Kf : X 2 \u2192 R is a positive-de\ufb01nite covariance function or\nkernel. For simplicity, we will assume that the prior on f is centered, \u00b5f (x) = 0, which lets us fully\nde\ufb01ne the prior on f by the kernel function Kf . We assume that the kernel function is parameterized\nby hyperparameters that we concatenate into a vector \u03b8. In this restricted context, a model M is\ncompletely determined by the choice of kernel function and an associated hyperparameter prior\np(\u03b8 | M). Below we brie\ufb02y review a previously suggested method for constructing an in\ufb01nite space\nof potential kernels to model the latent function f, and thus an in\ufb01nite family of models M. We will\nthe discuss the standardized and automated construction of associated hyperparameter priors.\n\n4.1 Space of compositional Gaussian processes kernels\n\nWe adopt the same space of kernels de\ufb01ned by Duvenaud et al. [1], which we brie\ufb02y summarize here.\nWe refer the reader to the original paper for more details. Given a set of simple, so-called base kernels,\nsuch as the common squared exponential (SE), periodic (PER), linear (LIN), and rational quadratic\n(RQ) kernels, we create new and potentially complex kernels by summation and multiplication of\nthese base units. The entire kernel space can be describe by the following grammar rules:\n\n1. Any subexpression S can be replaced with S + B, where B is a base kernel.\n2. Any subexpression S can be replaced with S \u00d7 B, where B is a base kernel.\n3. Any base kernel B may be replaced with another base kernel B(cid:48).\n\n4.2 Creating hyperparameter priors\n\nThe base kernels we will use are well understood, as are their hyperparameters, which have simple\ninterpretations that can be thematically grouped together. We take advantage of the Bayesian\nframework to encode prior knowledge over hyperparameters, i.e., p(\u03b8 | M). Conveniently, these\npriors can also potentially mitigate numerical problems during the training of the GPs. Here we derive\na consistent method to construct such priors for arbitrary kernels and datasets in regression problems.\nWe \ufb01rst standardize the dataset, i.e., we subtract the mean and divide by the standard deviation of\nboth the predictive features {xi} and the outputs y. This gives each dataset a consistent scale. Now\nwe can reason about what real-world datasets usually look like in this scale. For example, we do\nnot typically expect to see datasets spanning 10 000 length scales. Here we encode what we judge\nto be reasonable priors for groups of thematically related hyperparameters for most datasets. These\ninclude three types of hyperparameters common to virtually any problem: length scales (cid:96) (including,\nfor example, the period parameter of a periodic covariance), signal variance \u03c3, and observation noise\n\u03c3n. We also consider separately three other parameters speci\ufb01c to particular covariances we use here:\nthe \u03b1 parameter of the rational quadratic covariance [15, (4.19)], the \u201clength scale\u201d of the periodic\ncovariance (cid:96)p [15, (cid:96) in (4.31)], and the offset \u03c30 in the linear covariance. We de\ufb01ne the following:\n\np(log (cid:96)) = N (0.1, 0.72)\np(log \u03b1) = N (0.05, 0.72)\n\np(log \u03c3) = N (0.4, 0.72)\np(log (cid:96)p) = N (2, 0.72)\n\np(log \u03c3n) = N (0.1, 12)\np(\u03c30) = N (0, 22)\n\nGiven these, each model was given an independent prior over each of its hyperparameters, using the\nappropriate selection from the above for each.\n\n4.3 Approximating the model evidence\nThe model evidence p(y | X,M) is in general intractable for GPs [17, 15]. Alternatively we use a\nLaplace approximation to approximately compute the model evidence. This approximation works by\n\n4\n\n\fmaking a second-order Taylor expansion of log p(\u03b8 | D,M) around its mode \u02c6\u03b8 and approximates the\nmodel evidence as follows:\n\nlog p(y | X,M) \u2248 log p(y | X, \u02c6\u03b8,M) + log p(\u02c6\u03b8 | M) \u2212 1\n\nwhere d is the dimension of \u03b8 and \u03a3\u22121 = \u2212\u22072 log p(\u03b8 | D,M)(cid:12)(cid:12)\u03b8=\u02c6\u03b8 [18, 19]. We can view (5) as\n\nrewarding model \ufb01t while penalizing model complexity. Note that the Bayesian information criterion\n(BIC), commonly used for model selection and also used by Duvenaud et al. [1], can be seen as an\napproximation to the Laplace approximation [20, 21].\n\n2 log det \u03a3\u22121 + d\n\n2 log 2\u03c0,\n\n(5)\n\n4.4 Creating a \u201ckernel kernel\u201d\n\nIn \u00a74.1, \u00a74.2, and \u00a74.3, we focused on modeling a latent function f with a GP, creating an in\ufb01nite\nspace of models M to explain f (along with associated hyperparameter priors), and approximating\nthe log model evidence function g(M;D). The evidence function g is the objective function we are\ntrying to optimize via Bayesian optimization. We described in \u00a73 how this search progresses in the\ngeneral case, described in terms of an arbitrary Gaussian process prior on g. Here we will provide\nspeci\ufb01c suggestions for the modeling of g in the case that the model family M comprises Gaussian\nprocess priors on a latent function f, as discussed here and considered in our experiments.\nOur prior belief about g is given by a GP prior p(g) = GP(g; \u00b5g, Kg), which is fully speci\ufb01ed by\nthe mean \u00b5g and covariance functions Kg. We de\ufb01ne the former as a simple constant mean function\n\u00b5g(M) = \u03b8\u00b5, where \u03b8\u00b5 is a hyperparameter to be learned through a regular GP training procedure\ngiven a set of observations. The latter we will construct as follows.\nThe basic idea in our construction is that is that we will consider the distribution of the observation\nlocations in our dataset D, X (the design matrix of the underlying problem). We note that selecting a\nmodel class M induces a prior distribution over the latent function values at X, p(f | X,M):\n\n(cid:90)\n\np(f | X,M) =\n\np(f | X,M, \u03b8) p(\u03b8 | M) d\u03b8.\n\nThis prior distribution is an in\ufb01nite mixture of multivariate Gaussian prior distributions, each condi-\ntioned on speci\ufb01c hyperparameters \u03b8. We consider these prior distributions as different explanations\nof the latent function f, restricted to the observed locations, offered by the model M. We will\ncompare two models in M according to how different the explanations they offer for f are, a priori.\nThe Hellinger distance is a probability metric that we adopt as a basic measure of similarity between\ntwo distributions. Although this quantity is de\ufb01ned between arbitrary probability distributions (and\nthus could be used with non-GP model spaces), we focus on the multivariate normal case. Suppose\nthat M,M(cid:48) \u2208 M are two models that we wish to compare, in the context of explaining a \ufb01xed dataset\nD. For now, suppose that we have conditioned each of these models on arbitrary hyperparameters\n(that is, we select a particular prior for f from each of these two families), giving M\u03b8 and M(cid:48)\u03b8(cid:48), with\n\u03b8 \u2208 \u0398M and \u03b8(cid:48) \u2208 \u0398M(cid:48). Now, we de\ufb01ne the two distributions\n\nThe squared Hellinger distance between P and Q is\n\nP = p(f | X,M, \u03b8) = N (f ; \u00b5P , \u03a3P )\n(cid:26)\n\n(cid:12)(cid:12)1/2\nH(P, Q) = 1 \u2212 |\u03a3P|1/4|\u03a3Q|1/4\n\n(cid:12)(cid:12) \u03a3P +\u03a3Q\n\nexp\n\n2\n\nd2\n\n\u2212 1\n8\n\nQ = p(f | X,M(cid:48), \u03b8(cid:48)) = N (f ; \u00b5Q, \u03a3Q).\n(cid:27)\n\n(cid:18) \u03a3P + \u03a3Q\n\n(cid:19)\u22121\n\n(\u00b5P \u2212 \u00b5Q)\n\n(\u00b5P \u2212 \u00b5Q)(cid:62)\n\n2\n\n.\n\n(6)\n\nThe Hellinger distance will be small when P and Q are highly overlapping, and thus M\u03b8 and M(cid:48)\u03b8(cid:48)\nprovide similar explanations for this dataset. The distance will be larger, conversely, when M\u03b8 and\nM(cid:48)\u03b8(cid:48) provide divergent explanations. Critically, we note that this distance depends on the dataset\nunder consideration in addition to the GP priors.\nObserve that the distance above is not suf\ufb01cient to compare the similarity of two models M, M(cid:48)\ndue to the \ufb01xing of hyperparameters above. To properly account for the different hyperparameters\nof different models, and the priors associated with them, we de\ufb01ne the expected squared Hellinger\ndistance of two models M,M(cid:48) \u2208 M as\n\nH(M,M(cid:48); X) = E(cid:2)d2\n\nH(M\u03b8,M(cid:48)\u03b8(cid:48))(cid:3) =\n\n\u00afd2\n\n(cid:90)(cid:90)\n\nH(M\u03b8,M(cid:48)\u03b8(cid:48); X) p(\u03b8 | M) p(\u03b8(cid:48) | M(cid:48)) d\u03b8 d\u03b8(cid:48),\nd2\n\n(7)\n\n5\n\n\fFigure 1: A demonstration of our model kernel Kg (8) based on expected Hellinger distance of\ninduced latent priors. Left: four simple model classes on a 1d domain, showing samples from the\nprior p(f | M) \u221d p(f | \u03b8,M) p(\u03b8 | M). Right: our Hellinger squared exponential covariance\nevaluated for the grid domains on the left. Increasing intensity indicates stronger covariance. The\nsets {SE, RQ} and {SE, PER, SE+PER} show strong mutual correlation.\n\nwhere the distance is understood to be evaluated between the priors provided on f induced at X.\nFinally, we construct the Hellinger squared exponential covariance between models as\n\nKg(M,M(cid:48); \u03b8g, X) = \u03c32 exp\n\n,\n\n(8)\n\n(cid:18)\n\nH(M,M(cid:48); X)\n\u00afd2\n\n(cid:96)2\n\n\u2212 1\n2\n\n(cid:19)\n\nwhere \u03b8g = (\u03c3, (cid:96)) speci\ufb01es output and length scale hyperparameters in this kernel/evidence space.\nThis covariance is illustrated in Figure 1 for a few simple kernels on a \ufb01ctitious domain.\nWe make two notes before continuing. The \ufb01rst observation is that computing (6) scales cubically\nwith |X|, so it might appear that we might as well compute the evidence instead. This is misleading\nfor two reasons. First, the (approximate) computation of a given model\u2019s evidence via either a Laplace\napproximation or the BIC requires optimizing its hyperparameters. Especially for complex models\nthis can require hundreds-to-thousands of computations that each require cubic time. Further, as\na result of our investigations, we have concluded that in practice we may approximate (6) and (7)\nby considering only a small subset of the observation locations X and that this usually suf\ufb01cient to\ncapture the similarity between models in terms of explaining a given dataset. In our experiments, we\nchoose 20 points uniformly at random from those available in each dataset, \ufb01xed once for the entire\nprocedure and for all kernels under consideration in the search. We then used these points to compute\ndistances (6\u20138), signi\ufb01cantly reducing the overall time to compute Kg.\nSecond, we note that the expectation in (7) is intractable. Here we approximate the expectation\nvia quasi-Monte Carlo, using a low-discrepancy sequence (a Sobol sequence) of the appropriate\ndimension, and inverse transform sampling, to give consistent, representative samples from the\nhyperparameter space of each model. Here we used 100 (\u03b8, \u03b8(cid:48)) samples with good results.\n\n4.5 Active set of candidate models\n\nAnother challenging of exploring an in\ufb01nite set of models is how we advance the search. Each\nround, we only compute the acquisition function on a set of candidate models C. Here we discuss\nour policy for creating and maintaining this set. From the kernel grammar (\u00a74.1), we can de\ufb01ne a\nmodel graph where two models are connected if we can apply one rule to produce the other. We seek\nto traverse this graph, balancing exploration (diversity) against exploitation (models likely to have\nhigher evidence). We begin each round with a set of already chosen candidates C. To encourage\nexploitation, we add to C all neighbors of the best model seen thus far. To encourage exploration, we\nperform random walks to create diverse models, which we also add to C. We start each random walk\nfrom the empty kernel and repeatedly apply a random number of grammatical transformations. The\nnumber of such steps is sampled from a geometric distribution with termination probability 1\n3. We\n\ufb01nd that 15 random walks works well. To constrain the number of candidates, we discard the models\nwith the lowest EI values at the end of each round, keeping |C| no larger than 600.\n\n6\n\nSE+PERRQPERSESERQPERSE+PERSERQPERSE+PER\fTable 1: Root mean square error for model-evidence regression experiment.\nGP ( \u00afdH)\nDataset\n\nTrain %\n\nMean\n\nk-NN (SP)\n0.200 (0.020)\n0.260 (0.025)\n0.266 (0.007)\n0.339 (0.015)\n0.226 (0.002)\n0.235 (0.004)\n0.235 (0.004)\n0.257 (0.004)\n0.736 (0.051)\n0.878 (0.062)\n1.051 (0.058)\n1.207 (0.048)\n\nk-NN ( \u00afdH)\n0.233 (0.008)\n0.221 (0.007)\n0.215 (0.005)\n0.200 (0.003)\n0.347 (0.004)\n0.348 (0.004)\n0.348 (0.004)\n0.344 (0.004)\n0.685 (0.010)\n0.667 (0.005)\n0.686 (0.010)\n0.707 (0.005)\n\nCONCRETE\n\nHOUSING\n\nMAUNA LOA\n\n20\n40\n60\n80\n20\n40\n60\n80\n20\n40\n60\n80\n\n0.109 (0.000)\n0.107 (0.000)\n0.107 (0.000)\n0.106 (0.000)\n0.210 (0.001)\n0.207 (0.001)\n0.206 (0.000)\n0.206 (0.000)\n0.543 (0.002)\n0.537 (0.001)\n0.535 (0.001)\n0.534 (0.001)\n\n5 Experiments\n\n0.107 (0.001)\n0.102 (0.001)\n0.097 (0.001)\n0.093 (0.002)\n0.175 (0.002)\n0.140 (0.002)\n0.123 (0.002)\n0.114 (0.002)\n0.513 (0.003)\n0.499 (0.003)\n0.487 (0.004)\n0.474 (0.004)\n\nHere we evaluate our proposed algorithm. We split our evaluation into two parts: \ufb01rst, we show that\nour GP model for predicting a model\u2019s evidence is suitable; we then demonstrate that our model search\nmethod quickly \ufb01nds a good model for a range of regression datasets. The datasets we consider are\npublicly available4 and were used in previous related work [1, 3]. AIRLINE, MAUNA LOA, METHANE,\nand SOLAR are 1d time series, and CONCRETE and HOUSING have, respectively, 8 and 13 dimensions.\nTo facilitate comparison of evidence across datasets, we report log evidence divided by dataset size,\nrede\ufb01ning\n\ng(M;D) = log(p(y | X,M))/|D|.\n\n(9)\nWe use the aforementioned base kernels {SE, RQ, LIN, PER} when the dataset is one-dimensional.\nFor multi-dimensional datasets, we consider the set {SEi} \u222a {RQi}, where the subscript indicates that\nthe kernel is applied only to the ith dimension. This setup is the same as in [1].\n\n5.1 Predicting a model\u2019s evidence\nWe \ufb01rst demonstrate that our proposed regression model in model space (i.e., the GP on g : M \u2192 R) is\nsound. We set up a simple prediction task where we predict model evidence on a set of models given\ntraining data. We construct a dataset Dg (4) of 1 000 models as follows. We initialize a set M with the\nset of base kernels, which varies for each dataset (see above). Then, we select one model uniformly\nat random from M and add its neighbors in the model grammar to M. We repeat this procedure until\n|M| = 1 000 and computed g(M;D) for the entire set generated. We train several baselines on a\nsubset of Dg and test their ability to predict the evidence of the remaining models, as measured by\nthe root mean squared error (RMSE). To achieve reliable results we repeat this experiment ten times.\nWe considered a subset of the datasets (including both high-dimensional problems), because training\n1 000 models demands considerable time. We compare with several alternatives:\n\n1. Mean prediction. Predicts the mean evidence on the training models.\n2. k-nearest neighbors. We perform k-NN regression with two distances: shortest-path\ndistance in the directed model graph described in \u00a74.5 (SP), and the expected squared\nHellinger distance (7). Inverse distance was used as weights.\n\nWe select k for both k-NN algorithms through cross-validation, trying all values of k from 1 to 10.\nWe show the average RMSE along with standard error in Table 1. The GP with our Hellinger distance\nmodel covariance universally achieves the lowest error. Both k-NN methods are outperformed by the\nsimple mean prediction. We note that in these experiments, many models perform similarly in terms\nof evidence (usually, this is because many models are \u201cbad\u201d in the same way, e.g., explaining the\ndataset away entirely as independent noise). We note, however, that the GP model is able to exploit\ncorrelations in deviations from the mean, for example in \u201cgood pockets\u201d of model space, to achieve\n\n4https://archive.ics.uci.edu/ml/datasets.html\n\n7\n\n\fAIRLINE\n\nMETHANE\n\nHOUSING\n\nSOLAR\n\nMAUNA LOA\n\nCONCRETE\n\nFigure 2: A plot of the best model evidence found (normalized by |D|, (9)) as a function of the\nnumber of models evaluated, g(M\u2217;D), for six of the datasets considered (identical vertical axis\nlabels omitted for greater horizontal resolution).\n\nbetter performance. We also note that both the k-NN and GP models have decreasing error with the\nnumber of training models, suggesting our novel model distance is also useful in itself.\n\n5.2 Model search\n\nWe also evaluate our method\u2019s ability to quickly \ufb01nd a suitable model to explain a given dataset. We\ncompare our approach with the greedy compositional kernel search (CKS) of [1]. Both algorithms\nused the same kernel grammar (\u00a74.1), hyperparameter priors (\u00a74.2), and evidence approximation\n(\u00a74.3, (5)). We used L-BFGS to optimize model hyperparameters, using multiple restarts to avoid bad\nlocal maxima; each restart begins from a sample from p(\u03b8 | M).\nFor BOMS, we always began our search evaluating SE \ufb01rst. The active set of models C (\u00a74.5)\nwas initialized with all models that are at most two edges distant from the base kernels. To avoid\nunnecessary re-training over g, we optimized the hyperparameters of \u00b5g and Kg every 10 iterations.\nThis also allows us to perform rank-one updates for fast inference during the intervening iterations.\nResults are depicted in Figure 2 for a budget of 50 evaluations of the model evidence. In four of\nthe six datasets we substantially outperform CKS. Note the vertical axis is in the log domain. The\noverhead for computing the kernel Kg and performing the inference about g was approximately 10%\nof the total running time. On MAUNA LOA our method is competitive since we \ufb01nd a model with\nsimilar quality, but earlier. The results for METHANE, on the other hand, indicate that our search\ninitially focused on a suboptimal region of the graph, but we eventually do catch up.\n\n6 Conclusion\n\nWe introduced a novel automated search for an appropriate kernel to explain a given dataset. Our\nmechanism explores a space of in\ufb01nite candidate kernels and quickly and effectively selects a\npromising model. Focusing on the case where the models represent structural assumptions in GPs, we\nintroduced a novel \u201ckernel kernel\u201d to capture the similarity in prior explanations that two models\nascribe to a given dataset. We have empirically demonstrated that our choice of modeling the evidence\n(or marginal likelihood) with a GP in model space is capable of predicting the evidence value of\nunseen models with enough \ufb01delity to effectively explore model space via Bayesian optimization.\n\n8\n\n0204000.5iterationCKSBOMS02040\u22120.4\u22120.3\u22120.2iteration02040\u22121.4\u22121.2\u22121\u22120.8\u22120.6iteration02040\u22120.4\u22120.3\u22120.2iteration020401.522.5iteration02040\u22121.4\u22121.2\u22121\u22120.8iteration\fAcknowledgments\n\nThis material is based upon work supported by the National Science Foundation (NSF) under award\nnumber IIA\u20131355406. Additionally, GM acknowledges support from the Brazilian Federal Agency\nfor Support and Evaluation of Graduate Education (CAPES).\n\nReferences\n[1] D. Duvenaud, J. R. Lloyd, R. Grosse, J. B. Tenenbaum, and Z. Ghahramani. Structure Discovery in\nNonparametric Regression through Compositional Kernel Search. In International Conference on Machine\nLearning (ICML), 2013.\n\n[2] R. Grosse, R. Salakhutdinov, W. Freeman, and J. Tenenbaum. Exploiting compositionality to explore a\n\nlarge space of model structures. In Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), 2012.\n\n[3] F. R. Bach. Exploring large feature spaces with hierarchical multiple kernel learning. In Conference on\n\nNeural Information Processing Systems (NIPS), 2008.\n\n[4] M. Gonen and E. Alpaydin. Multiple kernel learning algorithms. Journal of Machine Learning Research,\n\n12:2211\u20132268, 2011.\n\n[5] M. L\u00e1zaro-Gredilla, J. Q. Candela, C. E. Rasmussen, and A. R. Figueiras-Vidal. Sparse Spectrum Gaussian\n\nProcess Regression. Journal of Machine Learning Research, 11:1865\u20131881, 2010.\n\n[6] A. G. Wilson and R. P. Adams. Gaussian Process Kernels for Pattern Discovery and Extrapolation. In\n\nInternational Conference on Machine Learning (ICML), 2013.\n\n[7] A. Wilson, E. Gilboa, J. P. Cunningham, and A. Nehorai. Fast kernel learning for multidimensional pattern\n\nextrapolation. In Conference on Neural Information Processing Systems (NIPS), 2014.\n\n[8] A. G. Wilson, D. A. Knowles, and Z. Ghahramani. Gaussian process regression networks. In International\n\nConference on Machine Learning (ICML), 2012.\n\n[9] G. E. Hinton and R. R. Salakhutdinov. Using Deep Belief Nets to Learn Covariance Kernels for Gaussian\n\nProcesses. In Conference on Neural Information Processing Systems (NIPS). 2008.\n\n[10] A. C. Damianou and N. D. Lawrence. Deep Gaussian Processes. In International Conference on Arti\ufb01cial\n\nIntelligence and Statistics (AISTATS), 2013.\n\n[11] J. S. Bergstra, R. Bardenet, Y. Bengio, and B. K\u00e9gl. Algorithms for hyper-parameter optimization. In\n\nConference on Neural Information Processing Systems (NIPS). 2011.\n\n[12] J. Snoek, H. Larochelle, and R. P. Adams. Practical bayesian optimization of machine learning algorithms.\n\nIn Conference on Neural Information Processing Systems, 2012.\n\n[13] J. Gardner, G. Malkomes, R. Garnett, K. Q. Weinberger, D. Barbour, and J. P. Cunningham. Bayesian\nactive model selection with an application to automated audiometry. In Conference on Neural Information\nProcessing Systems (NIPS). 2015.\n\n[14] E. Brochu, V. M. Cora, and N. De Freitas. A tutorial on Bayesian optimization of expensive cost\nfunctions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint\narXiv:1012.2599, 2010.\n\n[15] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.\n\n[16] D. R. Jones, M. Schonlau, and W. J. Welch. Ef\ufb01cient global optimization of expensive black-box functions.\n\nJournal of Global optimization, 13(4):455\u2013492, 1998.\n\n[17] D. J. C. MacKay. Introduction to Gaussian processes. In C. M. Bishop, editor, Neural Networks and\n\nMachine Learning, pages 133\u2013165. Springer, Berlin, 1998.\n\n[18] A. E. Raftery. Approximate Bayes Factors and Accounting for Model Uncertainty in Generalised Linear\n\nModels. Biometrika, 83(2):251\u2013266, 1996.\n\n[19] J. Kuha. AIC and BIC: Comparisons of Assumptions and Performance. Sociological Methods and\n\nResearch, 33(2):188\u2013229, 2004.\n\n[20] G. Schwarz. Estimating the Dimension of a Model. Annals of Statistics, 6(2):461\u2013464, 1978.\n[21] K. P. Murphy. Machine Learning: A Probabilistic Perspective. MIT Press, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1454, "authors": [{"given_name": "Gustavo", "family_name": "Malkomes", "institution": "Washington University"}, {"given_name": "Charles", "family_name": "Schaff", "institution": "Washington University in St. Louis"}, {"given_name": "Roman", "family_name": "Garnett", "institution": "Washington University in St. Louis"}]}