{"title": "Multi-Task Bayesian Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 2004, "page_last": 2012, "abstract": "Bayesian optimization has recently been proposed as a framework for automatically tuning the hyperparameters of machine learning models and has been shown to yield state-of-the-art performance with impressive ease and efficiency. In this paper, we explore whether it is possible to transfer the knowledge gained from previous optimizations to new tasks in order to find optimal hyperparameter settings more efficiently. Our approach is based on extending multi-task Gaussian processes to the framework of Bayesian optimization. We show that this method significantly speeds up the optimization process when compared to the standard single-task approach. We further propose a straightforward extension of our algorithm in order to jointly minimize the average error across multiple tasks and demonstrate how this can be used to greatly speed up $k$-fold cross-validation. Lastly, our most significant contribution is an adaptation of a recently proposed acquisition function, entropy search, to the cost-sensitive and multi-task settings. We demonstrate the utility of this new acquisition function by utilizing a small dataset in order to explore hyperparameter settings for a large dataset. Our algorithm dynamically chooses which dataset to query in order to yield the most information per unit cost.", "full_text": "Multi-Task Bayesian Optimization\n\nKevin Swersky\n\nDepartment of Computer Science\n\nUniversity of Toronto\n\nkswersky@cs.toronto.edu\n\nJasper Snoek\u21e4\n\nSchool of Engineering and Applied Sciences\n\nHarvard University\n\njsnoek@seas.harvard.edu\n\nRyan P. Adams\n\nSchool of Engineering and Applied Sciences\n\nHarvard University\n\nrpa@seas.harvard.edu\n\nAbstract\n\nBayesian optimization has recently been proposed as a framework for automati-\ncally tuning the hyperparameters of machine learning models and has been shown\nto yield state-of-the-art performance with impressive ease and ef\ufb01ciency. In this\npaper, we explore whether it is possible to transfer the knowledge gained from\nprevious optimizations to new tasks in order to \ufb01nd optimal hyperparameter set-\ntings more ef\ufb01ciently. Our approach is based on extending multi-task Gaussian\nprocesses to the framework of Bayesian optimization. We show that this method\nsigni\ufb01cantly speeds up the optimization process when compared to the standard\nsingle-task approach. We further propose a straightforward extension of our al-\ngorithm in order to jointly minimize the average error across multiple tasks and\ndemonstrate how this can be used to greatly speed up k-fold cross-validation.\nLastly, we propose an adaptation of a recently developed acquisition function, en-\ntropy search, to the cost-sensitive, multi-task setting. We demonstrate the utility\nof this new acquisition function by leveraging a small dataset to explore hyper-\nparameter settings for a large dataset. Our algorithm dynamically chooses which\ndataset to query in order to yield the most information per unit cost.\n\nIntroduction\n\n1\nThe proper setting of high-level hyperparameters in machine learning algorithms \u2013 regularization\nweights, learning rates, etc. \u2013 is crucial for successful generalization. The difference between poor\nsettings and good settings of hyperparameters can be the difference between a useless model and\nstate-of-the-art performance. Surprisingly, hyperparameters are often treated as secondary consid-\nerations and are not set in a documented and repeatable way. As the \ufb01eld matures, machine learning\nmodels are becoming more complex, leading to an increase in the number of hyperparameters, which\noften interact with each other in non-trivial ways. As the space of hyperparameters grows, the task of\ntuning them can become daunting, as well-established techniques such as grid search either become\ntoo slow, or too coarse, leading to poor results in both performance and training time.\nRecent work in machine learning has revisited the idea of Bayesian optimization [1, 2, 3, 4, 5, 6, 7],\na framework for global optimization that provides an appealing approach to the dif\ufb01cult exploration-\nexploitation tradeoff. These techniques have been shown to obtain excellent performance on a va-\nriety of models, while remaining ef\ufb01cient in terms of the number of required function evaluations,\ncorresponding to the number of times a model needs to be trained.\nOne issue with Bayesian optimization is the so-called \u201ccold start\u201d problem. The optimization must\nbe carried out from scratch each time a model is applied to new data. If a model will be applied to\n\n\u21e4Research was performed while at the University of Toronto.\n\n1\n\n\fmany different datasets, or even just a few extremely large datasets, then there may be a signi\ufb01cant\noverhead to re-exploring the same hyperparameter space. Machine learning researchers are often\nfaced with this problem, and one appealing solution is to transfer knowledge from one domain to the\nnext. This could manifest itself in many ways, including establishing the values for a grid search, or\nsimply taking certain hyperparameters as \ufb01xed with some commonly accepted value. Indeed, it is\nthis knowledge that often separates an expert machine learning practitioner from a novice.\nThe question that this paper explores is whether we can incorporate the same kind of transfer of\nknowledge within the Bayesian optimization framework. Such a tool would allow researchers and\npractitioners to leverage previously trained models in order to quickly tune new ones. Furthermore,\nfor large datasets one could imagine exploring a wide range of hyperparameters on a small subset of\ndata, and then using this knowledge to quickly \ufb01nd an effective setting on the full dataset with just a\nfew function evaluations.\nIn this paper, we propose multi-task Bayesian optimization to solve this problem. The basis for\nthe idea is to apply well-studied multi-task Gaussian process models to the Bayesian optimization\nframework. By treating new domains as new tasks, we can adaptively learn the degree of correlation\nbetween domains and use this information to hone the search algorithm. We demonstrate the utility\nof this approach in a number of different settings: using prior optimization runs to bootstrap new\nruns; optimizing multiple tasks simultaneously when the goal is maximizing average performance;\nand utilizing a small version of a dataset to explore hyperparameter settings for the full dataset. Our\napproach is fully automatic, requires minimal human intervention and yields substantial improve-\nments in terms of the speed of optimization.\n2 Background\n2.1 Gaussian Processes\nGaussian processes (GPs) [8] are a \ufb02exible class of models for specifying prior distribu-\ntions over functions f : X! R. They are de\ufb01ned by the property that any \ufb01nite set of N\npoints X = {xn 2X} N\nn=1 induces a Gaussian distribution on RN. The convenient properties of the\nGaussian distribution allow us to compute marginal and conditional means and variances in closed\nform. GPs are speci\ufb01ed by a mean function m : X! R and a positive de\ufb01nite covariance, or ker-\nnel function K : X\u21e5X! R. The predictive mean and covariance under a GP can be respectively\nexpressed as:\n\n\u00b5(x ; {xn, yn},\u2713 ) = K(X, x)>K(X, X)1(y m(X)),\n\n\u2303(x, x0 ; {xn, yn},\u2713 ) = K(x, x0) K(X, x)>K(X, X)1K(X, x0).\n\n(1)\n(2)\n\nHere K(X, x) is the N-dimensional column vector of cross-covariances between x and the set X.\nThe N \u21e5 N matrix K(X, X) is the Gram matrix for the set X. As in [6] we use the Mat\u00b4ern 5/2\nkernel and we marginalize over kernel parameters \u2713 using slice sampling [9].\n2.2 Multi-Task Gaussian Processes\nIn the \ufb01eld of geostatistics [10, 11], and more recently in the \ufb01eld of machine learning [12, 13, 14],\nGaussian processes have been extended to the case of vector-valued functions, i.e., f : X! RT .\nWe can interpret the T outputs of such functions as belonging to different regression tasks. The\nkey to modeling such functions with Gaussian processes is to de\ufb01ne a useful covariance function\nK((x, t), (x0, t0)) between input-task pairs. One simple approach is called the intrinsic model of\ncoregionalization [12, 11, 13], which transforms a latent function to produce each output. Formally,\n\nKmulti((x, t), (x0, t0)) = Kt(t, t0) \u2326 Kx(x, x0),\n\n(3)\nwhere \u2326 denotes the Kronecker product, Kx measures the relationship between inputs, and Kt\nmeasures the relationship between tasks. Given Kmulti, this is simply a standard GP. Therefore, the\ncomplexity still grows cubically in the total number of observations.\nAlong with the other kernel parameters, we infer the parameters of Kt using slice sampling. Specif-\nically, we represent Kt by its Cholesky factor and sample in that space. For our purposes, it is\nreasonable to assume a positive correlation between tasks. We found that sampling each element of\nthe Cholesky in log space and then exponentiating adequately satis\ufb01ed this constraint.\n\n2\n\n\f2.3 Bayesian Optimization for a Single Task\nBayesian optimization is a general framework for the global optimization of noisy, expensive, black-\nbox functions [15]. The strategy is based on the notion that one can use a relatively cheap proba-\nbilistic model to query as a surrogate for the \ufb01nancially, computationally or physically expensive\nfunction that is subject to the optimization. Bayes\u2019 rule is used to derive the posterior estimate of the\ntrue function given observations, and the surrogate is then used to determine the next most promising\npoint to query. A common approach is to use a GP to de\ufb01ne a distribution over objective functions\nfrom the input space to a loss that one wishes to minimize. That is, given observation pairs of the\nform {xn, yn}N\nn=1, where xn 2X and yn 2 R, we assume that the function f (x) is drawn from a\nGaussian process prior where yn \u21e0N (f (xn),\u232b ) and \u232b is the function observation noise variance.\nA standard approach is to select the next point to query by \ufb01nding the maximum of an acquisition\nfunction a(x ; {xn, yn},\u2713 ) over a bounded domain in X . This is an heuristic function that uses the\nposterior mean and uncertainty, conditioned on the GP hyperparameters \u2713, in order to balance explo-\nration and exploitation. There have been many proposals for acquisition functions, or combinations\nthereof [16, 2]. We will use the expected improvement criterion (EI) [15, 17],\n\naEI(x ; {xn, yn},\u2713 ) =p\u2303(x, x ; {xn, yn},\u2713 ) ((x)( (x)) + N ((x) ; 0, 1)) ,\n\n(x) =\n\n(4)\n\n(5)\n\nybest \u00b5(x ; {xn, yn},\u2713 )\n\np\u2303(x, x ; {xn, yn},\u2713 )\n\n.\n\nWhere (\u00b7) is the cumulative distribution function of the standard normal, and (x) is a Z-score.\nDue to its simple form, EI can be locally optimized using standard black-box optimization algo-\nrithms [6].\nAn alternative to heuristic acquisition functions such as EI is to consider a distribution over the\nminimum of the function and iteratively evaluating points that will most decrease the entropy of\nthis distribution. This entropy search strategy [18] has the appealing interpretation of decreasing\nthe uncertainty over the location of the minimum at each optimization step. Here, we formulate the\nentropy search problem as that of selecting the next point from a pre-speci\ufb01ed candidate set. Given a\nset of C points \u02dcX \u21e2X , we can write the probability of a point x 2 \u02dcX having the minimum function\nvalue among the points in \u02dcX via:\nPr(min at x| \u2713, \u02dcX,{xn, yn}N\n\np(f | x,\u2713, {xn, yn}N\n\nh (f (\u02dcx) f (x)) df ,\n\n(6)\n\nn=1) =ZRC\n\nn=1) Y\u02dcx2 \u02dcX\\x\n\nwhere f is the vector of function values at the points \u02dcX and h is the Heaviside step function.\nThe entropy search procedure relies on an estimate of the reduction in uncertainty over this\ndistribution if the value y at x is revealed. Writing Pr(min at x| \u2713, \u02dcX,{xn, yn}N\nn=1) as Pmin,\nn=1) as p(f | x) and the GP likelihood function as p(y | f) for brevity, and us-\np(f | x,\u2713, {xn, yn}N\ning H(P) to denote the entropy of P, the objective is to \ufb01nd the point x from a set of candidates\nwhich maximizes the information gain over the distribution of the location of the minimum,\naKL(x) =Z Z [H(Pmin) H(Py\n\nmin)] p(y | f) p(f | x) dy df,\n\nwhere Py\nmin indicates that the fantasized observation {x, y} has been added to the observation set.\nAlthough (7) does not have a simple form, we can use Monte Carlo to approximate it by sampling f.\nAn alternative to this formulation is to consider the reduction in entropy relative to a uniform base\ndistribution, however we found that the formulation given by Equation (7) works better in practice.\n3 Multi-Task Bayesian Optimization\n3.1 Transferring Bayesian Optimization to a New Task\nUnder the framework of multi-task GPs, performing optimization on a related task is fairly straight-\nforward. We simply restrict our future observations to the task of interest and proceed as normal.\nOnce we have enough observations on the task of interest to properly estimate Kt, then the other\ntasks will act as additional observations without requiring any additional function evaluations. An\nillustration of a multi-task GP versus a single-task GP and its effect on EI is given in Figure 1. This\napproach can be thought of as a special case of contextual Gaussian process bandits [19].\n\n(7)\n\n3\n\n\f(a) Multi-task GP sample functions\n\n(b) Independent GP predictions\n\n(c) Multi-task GP predictions\n\nFigure 1: (a) A sample function with three tasks from a multi-task GP. Tasks 2 and 3 are correlated, 1 and 3 are\nanti-correlated, and 1 and 2 are uncorrelated. (b) independent and (c) multi-task predictions on the third task.\nThe dots represent observations, while the dashed line represents the predictive mean. Here we show a function\nover three tasks and corresponding observations, where the goal is to minimize the function over the third task.\nThe curve shown on the bottom represents the expected improvement for each input location on this task. The\nindependent GP fails to adequately represent the function and optimizing EI leads to a spurious evaluation. The\nmulti-task GP utilizes the other tasks and the maximal EI point corresponds to the true minimum.\n3.2 Optimizing an Average Function over Multiple Tasks\nHere we will consider optimizing the average function over multiple tasks. This has elements of\nboth single and multi-task settings since we have a single objective representing a joint function\nover multiple tasks. We motivate this approach by considering a \ufb01ner-grained version of Bayesian\noptimization over k-fold cross validation. We wish to optimize the average performance over all k\nfolds, but it may not be necessary to actually evaluate all of them in order to identify the quality of\nthe hyperparameters under consideration. The predictive mean and variance of the average objective\nare given by:\n\n\u00af\u00b5(x) =\n\n1\nk\n\n\u00b5(x, t ; {xn, yn},\u2713 ),\n\n\u00af(x)2 =\n\n1\nk2\n\nkXt=1\n\nkXt=1\n\nkXt0=1\n\n\u2303(x, x, t, t0 ; {xn, yn},\u2713 ).\n\n(8)\n\nIf we are willing to spend one function evaluation on each task for every point x that we query,\nthen the optimization of this objective can proceed using standard approaches. In many situations\nthough, this can be expensive and perhaps even wasteful. As an extreme case, if we have two\nperfectly correlated tasks then spending two function evaluations per query provides no additional\ninformation, at twice the cost of a single-task optimization. The more interesting case then is to try\nto jointly choose both x as well as the task t and spend only one function evaluation per query.\nWe choose a (x, t) pair using a two-step heuristic. First we impute missing observations using the\npredictive means. We then use the estimated average function to pick a promising candidate x by\noptimizing EI. Conditioned on x, we then choose the task that yields the highest single-task expected\nimprovement.\nThe problem of minimizing the average error over multiple tasks has been considered in [20], where\nthey applied Bayesian optimization in order to tune a single model on multiple datasets. Their\napproach is to project each function to a joint latent space and then iteratively visit each dataset\nin turn. Another approach can be found in [3], where additional task-speci\ufb01c features are used in\nconjunction with the inputs x to make predictions about each task.\n3.3 A Principled Multi-Task Acquisition Function\nRather than transferring knowledge from an already completed search on a related task to bootstrap\na new one, a more desirable strategy would have the optimization routine dynamically query the\nrelated, possibly signi\ufb01cantly cheaper task. Intuitively, if two tasks are closely related, then evaluat-\ning a cheaper one can reveal information and reduce uncertainty about the location of the minimum\non the more expensive task. A clever strategy may, for example, perform low cost exploration of a\npromising location on the cheaper task before risking an evaluation of the expensive task. In this\nsection we develop an acquisition function for such a dynamic multi-task strategy which speci\ufb01cally\ntakes noisy estimates of cost into account based on the entropy search strategy.\nAlthough the EI criterion is intuitive and effective in the single task case, it does not directly gen-\neralize to the multi-task case. However, entropy search does translate naturally to the multi-task\nn=1 and we wish\nproblem. In this setting we have observation pairs from multiple tasks, {xt\n\nn}N\n\nn, yt\n\n4\n\n(1)(2)(3)\fs\nn\no\n\ni\nt\nc\nn\nu\nF\nk\ns\na\nT\n\ni\n\nn\na\nG\nn\no\n\ni\nt\n\na\nm\nr\no\nn\n\nf\n\nI\n\n(a) Uncorrelated functions\n\n(b) Correlated functions\n\n(c) Correlated functions scaled by cost\n\nFigure 2: A visualization of the multi-task information gain per unit cost acquisition function. In each \ufb01gure,\nthe objective is to \ufb01nd the minimum of the solid blue function. The green function is an auxiliary objective\nfunction. In the bottom of each \ufb01gure are lines indicating the expected information gain with regard to the\nprimary objective function. The green dashed line shows the information gain about the primary objective that\nresults from evaluating the auxiliary objective function. Figure 2a shows two sampled functions from a GP that\nare uncorrelated. Evaluating the primary objective gains information, but evaluating the auxiliary does not. In\nFigure 2b we see that with two strongly correlated functions, not only do observations on either task reduce\nuncertainty about the other, but observations from the auxiliary task acquire information about the primary task.\nFinally, in 2c we assume that the primary objective is three times more expensive than the auxiliary task and\nthus evaluating the related task gives more information gain per unit cost.\n\nto pick the candidate xt that maximally reduces the entropy of Pmin for the primary task, which we\ntake to be t = 1. Naturally, Pmin evaluates to zero for xt>1. However, we can evaluate Py\nmin for\nyt>1 and if the auxiliary task is related to the primary task, Py\nmin will change from the base distribu-\ntion and H(Pmin) H(Py\nmin) will be positive. Through reducing uncertainty about f, evaluating an\nobservation on a related auxiliary task can reduce the entropy of Pmin on the primary task of interest.\nHowever, observe that evaluating a point on a related task can never reveal more information than\nevaluating the same point on the task of interest. Thus, the above strategy would never choose to\nevaluate a related task. Nevertheless, when cost is taken into account, the auxiliary task may convey\nmore information per unit cost. Thus we translate the objective from Equation (7) to instead re\ufb02ect\nthe information gain per unit cost of evaluating a candidate point,\n\naIG(xt) =Z Z \u2713 H[Pmin] H[Py\n\nct(x)\n\nmin]\n\n\u25c6 p(y | f) p(f | xt) dy df,\n\n(9)\n\nwhere ct(x), ct : X! R+, is the real valued cost of evaluating task t at x. Although, we don\u2019t\nknow this cost function in advance, we can estimate it similarly to the task functions, f (xt), using\nthe same multi-task GP machinery to model log ct(x).\nFigure 2 provides a visualization of this acquisition function, using a two task example. It shows\nhow selecting a point on a related auxiliary task can reduce uncertainty about the location of the\nminimum on the primary task of interest (blue solid line). In this paper, we assume that all the\ncandidate points for which we compute aIG come from a \ufb01xed subset. Following [18], we pick these\ncandidates by taking the top C points according to the EI criterion on the primary task of interest.\n4 Empirical Analyses\n4.1 Addressing the Cold Start Problem\nHere we compare Bayesian optimization with no initial information to the case where we can lever-\nage results from an already completed optimization on a related task. In each classi\ufb01cation experi-\nment the target of Bayesian optimization is the error on a held out validation set. Further details on\nthese experiments can be found in the supplementary material.\nBranin-Hoo The Branin-Hoo function is a common benchmark for optimization techniques [17]\nthat is de\ufb01ned over a bounded set on R2. As a related task we consider a shifted Branin-Hoo where\nthe function is translated by 10% along either axis. We used Bayesian optimization to \ufb01nd the\nminimum of the original function and then added the shifted function as an additional task.\nLogistic regression We optimize four hyperparameters of logistic regression (LR) on the MNIST\ndataset using 10000 validation examples. We assume that we have already completed 50 iterations\n\n5\n\n\fl\n\ne\nu\na\nV\nn\no\n\n \n\ni\nt\nc\nn\nu\nF\nn\nM\n\n \n\ni\n\n)\ne\n\nt\n\na\nR\n\n \nr\no\nr\nr\n\nE\n\n(\n \n\nl\n\ne\nu\na\nV\nn\no\n\n \n\ni\nt\nc\nn\nu\nF\nn\nM\n\n \n\ni\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n\n \n0\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n \n0\n\nBranin\u2212Hoo from scratch\nBranin\u2212Hoo from shifted\nBaseline\n\n \n\n)\ne\nt\na\nR\n\n \nr\no\nr\nr\n\nE\n\nl\n\n(\n \ne\nu\na\nV\n \nn\no\ni\nt\nc\nn\nu\nF\n \nn\nM\n\ni\n\n5\n\n10\n20\nFunction evaluations\n\n15\n\n25\n\n30\n\n(a) Shifted Branin-Hoo\n\nMNIST from scratch\nMNIST from USPS\nBaseline\n\n \n\nr\no\nr\nr\n\n \n\nE\ne\nv\ni\nt\n\nl\n\n \n\na\nu\nm\nu\nC\ne\ng\na\nr\ne\nv\nA\n\n10\n\n20\n\n30\n\n40\n\n50\n\nFunction evaluations\n\n(d) LR on MNIST\n\n0.07\n\n0.065\n\n0.06\n\n0.055\n\n0.05\n\n \n0\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n \n0\n\nSVHN from scratch\nSVHN from CIFAR\u221210\nSVHN from subset of SVHN\nBaseline\n\n \n\n)\ne\nt\na\nR\n\n \nr\no\nr\nr\n\nE\n\nl\n\n(\n \ne\nu\na\nV\n \nn\no\ni\nt\nc\nn\nu\nF\n \nn\nM\n\ni\n\n10\n\n20\n\n30\n\n40\n\n50\n\nFunction Evaluations\n\n(b) CNN on SVHN\n\nSVHN from scratch\nSVHN from CIFAR\u221210\nSVHN from subset of SVHN\n\n \n\nr\no\nr\nr\n\n \n\nE\ne\nv\ni\nt\n\nl\n\n \n\na\nu\nm\nu\nC\ne\ng\na\nr\ne\nv\nA\n\n10\n\n20\n\n30\n\n40\n\n50\n\nFunction Evaluations\n\n(e) SVHN ACE\n\n0.25\n\n0.245\n\n0.24\n\n0.235\n\n0.23\n\n \n0\n\n0.6\n\n0.55\n\n0.5\n\n0.45\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n \n0\n\n \n\nSTL\u221210 from scratch\nSTL\u221210 from CIFAR\u221210\nBaseline\n\n10\n\n20\n\n30\n\n40\n\n50\n\nFunction evaluations\n\n(c) CNN on STL-10\n\nSTL\u221210 from scratch\nSTL\u221210 from CIFAR\u221210\n\n \n\n10\n\n20\n\n30\n\n40\n\n50\n\nFunction evaluations\n\n(f) STL-10 ACE\n\nFigure 3: (a)-(d)Validation error per function evaluation. (e),(f) ACE over function evaluations.\n\nof an optimization of the same model on the related USPS digits task. The USPS data is only 1/6\nthe size of MNIST and each image contains 16\u21e5 16 pixels, so it is considerably cheaper to evaluate.\nConvolutional neural networks on pixels We applied convolutional neural networks1 (CNNs)\nto the Street View House Numbers (SVHN) [21] dataset and bootstrapped from a previous run of\nBayesian optimization using the same model trained on CIFAR-10 [22, 6]. At the time, this model\nrepresented the state-of-the-art. The SVHN dataset has the same input dimension as CIFAR-10,\nbut is 10 times larger. We used 6000 held-out examples for validation. Additionally, we consider\ntraining on 1/10th of the SVHN dataset to warm-start the full optimization. The best settings yielded\n4.77 \u00b1 0.22% error, which is comparable to domain experts using non-dropout CNNs [23].\nConvolutional networks on k-means features As an extension to the previous CNN experiment,\nwe incorporate a more sophisticated pipeline in order to learn a model for the STL-10 dataset [24].\nThis dataset consists of images with 96 \u21e5 96 pixels, and each training set has only 1000 images.\nOver\ufb01tting is a signi\ufb01cant challenge for this dataset, so we utilize a CNN2 on top of k-means features\nin a similar approach to [25], as well as dropout [26]. We bootstrapped Bayesian optimization using\nthe same model trained on CIFAR-10, which had achieved 14.2% test error on that dataset. During\nthe optimization, we used the \ufb01rst fold for training, and the remaining 4000 points from the other\nfolds for validation. We then trained separate networks on each fold using the best hyperparameter\nsettings found by Bayesian optimization. Following reporting conventions for this dataset, the model\nachieved 70.1 \u00b1 0.6% test-set accuracy, exceeding the previous state-of-the-art of 64.5 \u00b1 1% [27].\nThe results of these experiments are shown in Figure 3(a)-(d). In each case, the multi-task optimiza-\ntion \ufb01nds a better function value much more quickly than single-task optimization. Clearly there is\ninformation in the related tasks that can be exploited. To better understand the behaviour of the dif-\nferent methods, we plot the average cumulative error (ACE), i.e., the average of all function values\nseen up to a given time, in Figure 3(e),(f). The single-task method wastes many more evaluations\nexploring poor hyperparameter settings. In the multi-task case, this exploration has already been\nperformed and more evaluations are spent on exploitation.\nAs a baseline (the dashed black line), we took the best model from the \ufb01rst task and applied it\ndirectly to the task of interest. For example, in the CNN experiments this involved taking the best\nsettings from CIFAR-10. This \u201cdirect transfer\u201d performed well in some cases and poorly in others.\nIn general, we have found that the best settings for one task are usually not optimal for the other.\n4.2 Fast Cross-Validation\nk-fold cross-validation is a widely used technique for estimating the generalization error of machine\nlearning models, but requires retraining a model k times. This can be prohibitively expensive with\ncomplex models and large datasets. It is reasonable to expect, however, that if the data are randomly\n\n1Using the Cuda Convnet package: https://code.google.com/p/cuda-convnet\n2Using the Deepnet package: https://github.com/nitishsrivastava/deepnet\n\n6\n\n\f)\n\nE\nS\nM\nR\n\n1.05\n\nAvg Crossval Error\nEstimated Avg Crossval Error\nFull Crossvalidation\n\nl\n\n(\n \ne\nu\na\nV\n \nn\no\ni\nt\nc\nn\nu\nF\n \nn\nM\n\ni\n\n1\n\n0.95\n\n \n\n \n\n)\n\nE\nS\nM\nR\n\nl\n\n(\n \ne\nu\na\nV\n \nn\no\ni\nt\nc\nn\nu\nF\n \nn\nM\n\ni\n\n \n\nFold 1\nFold 2\nFold 3\nFold 4\nFold 5\n\n1.02\n\n1\n\n0.98\n\n0.96\n\n0.94\n\n0.92\n\n(a)\n\nPMF\n\ncross-\nFigure 4:\nvalidation error per\nfunction\nevaluation on Movielens-100k.\n(b) Lowest error observed for\neach fold per function evaluation\nfor a single run.\n\n50\n\n100\n\n150\n\nFunction evaluations\n\n200\n\n \n\n20\n\n40\n80\nFunction evaluations\n\n60\n\n100\n\n(a)\n\n(b)\n\npartitioned among folds that the errors for each fold will be highly correlated. For a given set of\nhyperparameters, we can therefore expect diminishing returns in estimating the average error for\neach subsequently evaluated fold. With a good GP model, we can very likely obtain a high quality\nestimate by evaluating just one fold per setting. In this experiment, we apply the algorithm described\nin Section 3.2 in order to dynamically determine which points/folds to query.\nWe demonstrate this procedure on the task of training probabilistic matrix factorization (PMF) mod-\nels for recommender systems [28]. The hyperparameters of the PMF model are the learning rate,\nan `2 regularizer, the matrix rank, and the number of epochs. We use 5-fold cross validation on the\nMovielens-100k dataset [29]. In Figure 4(a) we show the best error obtained after a given number\nof function evaluations as measured by the number of folds queried, averaged over 50 optimization\nruns. For the multi-task version, we show both the true average cross-validation error, as well as\nthe estimated error according to the GP. In the beginning, the GP \ufb01t is highly uncertain, so the op-\ntimization exhibits some noise. As the GP model becomes more certain however, the true error and\nthe GP estimate converge and the search proceeds rapidly compared to the single-task counterpart.\nIn Figure 4(b), we show the best observed error after a given number of function evaluations on a\nrandomly selected run. For a particular fold, the error cannot improve unless that fold is directly\nqueried. The algorithm makes nontrivial decisions in terms of which fold to query, steadily reducing\nthe average error.\n4.3 Using Small Datasets to Quickly Optimize for Large Datasets\nAs a \ufb01nal empirical analysis, we evaluate the dynamic multi-task entropy search strategy developed\nin Section 3.3 on two hyperparameter tuning problems. We treat the cost, ct(x), of a function\nevaluation as being the real running time of training and evaluating the machine learning algorithm\nwith hyperparameter settings x on task t. We assume no prior knowledge about either task, their\ncorrelation, or their respective cost, but instead estimate these as the optimization progresses. In\nboth tasks we compare using our multi-task entropy search strategy (MTBO) to optimizing the task\nof interest independently (STBO).\nFirst, we revisit the logistic regression problem from Section 4.1 (Figure 3(d)) using the same ex-\nperimental protocol, but rather than assuming that there is a completed optimization of the USPS\ndata, the Bayesian optimization routine can instead dynamically query USPS as needed. Figure 5(a),\nshows the average time taken by either strategy to reach the values along the blue line. We see that\nMTBO reaches the minimum value on the validation set within 40 minutes, while STBO reaches it\nin 100 minutes. Figures 5(b) and 5(c) show that MTBO reaches better values signi\ufb01cantly faster by\nspending more function evaluations on the related, but relatively cheaper task.\nFinally we evaluate the very expensive problem of optimizing the hyperparameters of online Latent\nDirichlet Allocation [30] on a large corpus of 200,000 documents. Snoek et al. [6] demonstrated\nthat on this problem, Bayesian optimization could \ufb01nd better hyperparameters in signi\ufb01cantly less\ntime than the grid search conducted by the authors. We repeat this experiment here using the exact\nsame grid as [6] and [30] but provide an auxiliary task involving a subset of 50,000 documents\nand 25 topics on the same grid. Each function evaluation on the large corpus took an average\nof 5.8 hours to evaluate while the smaller corpus took 2.5 hours. We performed our multi-task\nBayesian optimization restricted to the same grid and compare to the results of the standard Bayesian\noptimization of [6] (the GP EI MCMC algorithm). In Figure 5d, we see that our MTBO strategy\n\ufb01nds the minimum in approximately 6 days of computation while the STBO strategy takes 10 days.\nOur algorithm saves almost 4 days of computation by being able to dynamically explore the cheaper\nalternative task. We see in 5(f) that particularly early in the optimization, the algorithm explores the\ncheaper task to gather information about the expensive one.\n\n7\n\n\f100\n\n)\ns\nn\nM\n\ni\n\nMin Function Value (Classification Error %)\n\n(\n \n\n \n\nO\nB\nT\nM\n \ny\nb\nn\ne\nk\na\nT\ne\nm\nT\n\n \n\ni\n\n)\ns\ny\na\nD\n\n(\n \n\nO\nB\nT\nM\n \ny\nb\n\n \n\nn\ne\nk\na\nT\ne\nm\nT\n\n \n\ni\n\n80\n\n60\n\n40\n\n20\n\n0\n\n \n0\n\n13.53\n20\n\n8.14\n\n9.49\n\n7.06\n\n40\n\n80\nTime Taken by STBO (Mins)\n\n60\n\n6.79\n\n100\n\n(a) LR Time Taken\n\nMin Function Value (Perplexity)\n\n \n\n1267\n\n1270\n\n1271\n\n1272\n\n1489\n\n2\n\n4\n\n6\n\n8\n\n10\n\nTime Taken by STBO (Days)\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\n \n0\n\n \n\n18\n\n)\n\n%\n\n \nr\no\nr\nr\n\nE\n\n(\n \n\nl\n\ne\nu\na\nV\nn\no\n\n \n\ni\nt\nc\nn\nu\nF\nn\nM\n\n \n\ni\n\n)\ny\nt\ni\nx\ne\np\nr\ne\nP\n\nl\n\n(\n \n\nl\n\ne\nu\na\nV\nn\no\n\n \n\ni\nt\nc\nn\nu\nF\nn\nM\n\n \n\ni\n\n16\n\n14\n\n12\n\n10\n\n8\n\n6\n\n \n\n1280\n\n1278\n\n1276\n\n1274\n\n1272\n\n1270\n\n \n\nMTBO\nSTBO\n\n)\n\n%\n\n \nr\no\nr\nr\n\nE\n\n(\n \n\nl\n\ne\nu\na\nV\nn\no\n\n \n\ni\nt\nc\nn\nu\nF\nn\nM\n\n \n\ni\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\nTime (Minutes)\n\n(b) LR Time\n\n \n\nMTBO\nSTBO\n\n)\ny\nt\ni\nx\ne\np\nr\ne\nP\n\nl\n\n(\n \n\nl\n\ne\nu\na\nV\nn\no\n\n \n\ni\nt\nc\nn\nu\nF\nn\nM\n\n \n\ni\n\n \n\nMTBO\nSTBO\n\n10\n\n20\n\n30\n\n40\n\n50\n\nFunction Evaluations\n\n(c) LR Fn Evaluations\n\n \n\nMTBO\nSTBO\n\n18\n\n16\n\n14\n\n12\n\n10\n\n8\n\n \n\n1360\n\n1340\n\n1320\n\n1300\n\n1280\n\n1268\n\n \n\n2\n\n4\n\n6\n8\nTime (Days)\n\n10\n\n12\n\n \n\n10\n\n20\n\n30\n\nFunction Evaluations\n\n40\n\n50\n\n(d) Online LDA Time Taken\n\n(e) Online LDA Time\n\n(f) Online LDA Fn Evaluations\n\nFigure 5: (a),(d) Time taken to reach a given validation error. (b),(e) Validation error as a function of time spent\ntraining the models. (c),(f) Validation error over the number of function evaluations.\n5 Conclusion\nAs datasets grow larger, and models become more expensive, it has become necessary to de-\nvelop new search strategies in order to \ufb01nd optimal hyperparameter settings as quickly as possible.\nBayesian optimization has emerged a powerful framework for guiding this search. What the frame-\nwork currently lacks, however, is a principled way to leverage prior knowledge gained from searches\nover similar domains. There is a plethora of information that can be carried over from related tasks,\nand taking advantage of this can result in substantial cost-savings by allowing the search to focus on\nregions of the hyperparameter space that are already known to be promising.\nIn this paper we introduced multi-task Bayesian optimization as a method to address this issue.\nWe showed how multi-task GPs can be utilized within the existing framework in order to capture\ncorrelation between related tasks. Using this technique, we demonstrated that one can bootstrap\nprevious searches, resulting in signi\ufb01cantly faster optimization.\nWe further showed how this idea can be extended to solving multiple problems simultaneously. The\n\ufb01rst application we considered was the problem of optimizing an average score over several related\ntasks, motivated by the problem of k-fold cross-validation. Our fast cross-validation procedure\nobviates the need to evaluate each fold per hyperparameter query and therefore eliminates redundant\nand costly function evaluations.\nThe next application we considered employed a cost-sensitive version of the entropy search acqui-\nsition function in order to utilize a cheap auxiliary task in the minimization of an expensive primary\ntask. Our algorithm dynamically chooses which task to evaluate, and we showed that it can sub-\nstantially reduce the amount of time required to \ufb01nd good hyperparameter settings. This technique\nshould prove to be useful in tuning sophisticated models on extremely large datasets.\nAs future work, we would like to extend this framework to multiple architectures. For example,\nwe might want to train a one-layer neural network on one task, and a two-layer neural network on\nanother task. This provides another avenue for utilizing one task to bootstrap another.\nAcknowledgements\nThe authors would like to thank Nitish Srivastava for providing help with the Deepnet package,\nRobert Gens for providing feature extraction code, and Richard Zemel for helpful discussions.\nJasper Snoek was supported by a grant from Google. This work was funded by DARPA Young\nFaculty Award N66001-12-1-4219 and an Amazon AWS in Research grant.\nReferences\n[1] Eric Brochu, T Brochu, and Nando de Freitas. A Bayesian interactive optimization approach to procedural\n\nanimation design. In ACM SIGGRAPH/Eurographics Symposium on Computer Animation, 2010.\n\n8\n\n\f[2] Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process optimization\n\nin the bandit setting: no regret and experimental design. In ICML, 2010.\n\n[3] Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. Sequential model-based optimization for gen-\n\neral algorithm con\ufb01guration. In Learning and Intelligent Optimization 5, 2011.\n\n[4] M. A. Osborne, R. Garnett, and S. J. Roberts. Gaussian processes for global optimization. In LION, 2009.\n[5] James Bergstra, R\u00b4emi Bardenet, Yoshua Bengio, and B\u00b4al\u00b4azs K\u00b4egl. Algorithms for hyper-parameter opti-\n\nmization. In NIPS. 2011.\n\n[6] Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. Practical Bayesian optimization of machine learning\n\nalgorithms. In NIPS, 2012.\n\n[7] James Bergstra, Daniel Yamins, and David Cox. Making a science of model search: hyperparameter\n\noptimization in hundreds of dimensions for vision architectures. In ICML, 2013.\n\n[8] Carl E. Rasmussen and Christopher Williams. Gaussian Processes for Machine Learning. MIT Press,\n\n2006.\n\n[9] Iain Murray and Ryan P. Adams. Slice sampling covariance hyperparameters of latent Gaussian models.\n\nIn NIPS. 2010.\n\n[10] Andre G. Journel and Charles J. Huijbregts. Mining Geostatistics. Academic press London, 1978.\n[11] Pierre Goovaerts. Geostatistics for natural resources evaluation. Oxford University Press, 1997.\n[12] Matthias Seeger, Yee-Whye Teh, and Michael I. Jordan. Semiparametric latent factor models. In AISTATS,\n\n2005.\n\n[13] Edwin V. Bonilla, Kian Ming A. Chai, and Christopher K. I. Williams. Multi-task Gaussian process\n\nprediction. In NIPS, 2008.\n\n[14] Mauricio A Alvarez and Neil D Lawrence. Computationally ef\ufb01cient convolved multiple output Gaussian\n\nprocesses. Journal of Machine Learning Research, 12, 2011.\n\n[15] Jonas Mockus, Vytautas Tiesis, and Antanas Zilinskas. The application of Bayesian methods for seeking\n\nthe extremum. Towards Global Optimization, 2, 1978.\n\n[16] Matthew Hoffman, Eric Brochu, and Nando de Freitas. Portfolio allocation for Bayesian optimization. In\n\nUAI, 2011.\n\n[17] Donald R. Jones. A taxonomy of global optimization methods based on response surfaces. Journal of\n\nGlobal Optimization, 21, 2001.\n\n[18] Philipp Hennig and Christian J. Schuler. Entropy search for information-ef\ufb01cient global optimization.\n\nJournal of Machine Learning Research, 13, 2012.\n\n[19] Andreas Krause and Cheng Soon Ong. Contextual gaussian process bandit optimization. In NIPS, 2011.\n[20] R\u00b4emi Bardenet, M\u00b4aty\u00b4as Brendel, Bal\u00b4azs K\u00b4egl, and Mich`ele Sebag. Collaborative hyperparameter tuning.\n\nIn ICML, 2013.\n\n[21] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading\ndigits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and\nUnsupervised Feature Learning, 2011.\n\n[22] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Department of\n\nComputer Science, University of Toronto, 2009.\n\n[23] Pierre Sermanet, Soumith Chintala, and Yann LeCun. Convolutional neural networks applied to house\n\nnumbers digit classi\ufb01cation. In ICPR, 2012.\n\n[24] Adam Coates, Honglak Lee, and Andrew Y Ng. An analysis of single-layer networks in unsupervised\n\nfeature learning. AISTATS, 2011.\n\n[25] Robert Gens and Pedro Domingos. Discriminative learning of sum-product networks. In NIPS, 2012.\n[26] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Im-\n\nproving neural networks by preventing co-adaptation of feature detectors. arXiv preprint, 2012.\n\n[27] Liefeng Bo, Xiaofeng Ren, and Dieter Fox. Unsupervised feature learning for RGB-D based object\n\nrecognition. ISER,, 2012.\n\n[28] Ruslan Salakhutdinov and Andriy Mnih. Probabilistic matrix factorization. NIPS, 2008.\n[29] Jonathan L. Herlocker, Joseph A. Konstan, Al Borchers, and John Riedl. An algorithmic framework for\nperforming collaborative \ufb01ltering. In ACM SIGIR Conference on Research and Development in Informa-\ntion Retrieval, 1999.\n\n[30] Matthew Hoffman, David M. Blei, and Francis Bach. Online learning for latent Dirichlet allocation. In\n\nNIPS, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1012, "authors": [{"given_name": "Kevin", "family_name": "Swersky", "institution": "University of Toronto"}, {"given_name": "Jasper", "family_name": "Snoek", "institution": "University of Toronto"}, {"given_name": "Ryan", "family_name": "Adams", "institution": "Harvard University"}]}