{"title": "Probabilistic Matrix Factorization for Automated Machine Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 3348, "page_last": 3357, "abstract": "In order to achieve state-of-the-art performance, modern machine learning techniques require careful data pre-processing and hyperparameter tuning. Moreover, given the ever increasing number of machine learning models being developed, model selection is becoming increasingly important. Automating the selection and tuning of machine learning pipelines, which can include different data pre-processing methods and machine learning models, has long been one of the goals of the machine learning community. \nIn this paper, we propose to solve this meta-learning task by combining ideas from collaborative filtering and Bayesian optimization. Specifically, we use a probabilistic matrix factorization model to transfer knowledge across experiments performed in hundreds of different datasets and use an acquisition function to guide the exploration of the space of possible ML pipelines. In our experiments, we show that our approach quickly identifies high-performing pipelines across a wide range of datasets, significantly outperforming the current state-of-the-art.", "full_text": "Probabilistic Matrix Factorization for Automated\n\nMachine Learning\n\nNicolo Fusi, Rishit Sheth\n\nMicrosoft Research, New England\n{nfusi,rishet}@microsoft.com\n\nMelih Elibol\u2217\n\nEECS, University of California, Berkeley\n\nelibol@cs.berkeley.edu\n\nAbstract\n\nIn order to achieve state-of-the-art performance, modern machine learning tech-\nniques require careful data pre-processing and hyperparameter tuning. Moreover,\ngiven the ever increasing number of machine learning models being developed,\nmodel selection is becoming increasingly important. Automating the selection\nand tuning of machine learning pipelines, which can include different data pre-\nprocessing methods and machine learning models, has long been one of the goals\nof the machine learning community.\nIn this paper, we propose to solve this\nmeta-learning task by combining ideas from collaborative \ufb01ltering and Bayesian\noptimization. Speci\ufb01cally, we use a probabilistic matrix factorization model to\ntransfer knowledge across experiments performed in hundreds of different datasets\nand use an acquisition function to guide the exploration of the space of possible\npipelines. In our experiments, we show that our approach quickly identi\ufb01es high-\nperforming pipelines across a wide range of datasets, signi\ufb01cantly outperforming\nthe current state-of-the-art.\n\n1\n\nIntroduction\n\nMachine learning models often depend on hyperparameters that require extensive \ufb01ne-tuning in order\nto achieve optimal performance. For example, state-of-the-art deep neural networks have highly tuned\narchitectures and require careful initialization of the weights and learning algorithm (for example, by\nsetting the initial learning rate and various decay parameters). These hyperparameters can be learned\nby cross-validation (or holdout set performance) over a grid of values, or by randomly sampling\nthe hyperparameter space [2]; but, these approaches do not take advantage of any continuity in the\nparameter space. More recently, Bayesian optimization has emerged as a promising alternative to\nthese approaches [25, 8, 16, 1, 23, 3]. In Bayesian optimization, the loss (e.g. root mean square\nerror) is modeled as a function of the hyperparameters. A regression model (usually a Gaussian\nprocess) and an acquisition function are then used to iteratively decide which hyperparameter setting\nshould be evaluated next. More formally, the goal of Bayesian optimization is to \ufb01nd the vector of\nhyperparameters \u03b8 that corresponds to\n\narg min\n\n\u03b8\n\nL (M(x; \u03b8), y),\n\nwhere M(x; \u03b8) are the predictions generated by a machine learning model M (e.g. SVM, random\nforest, etc.) with hyperparameters \u03b8 on some inputs x, y are the targets/labels, and L is a loss\nfunction. Usually, the hyperparameters are a subset of RD, although in practice many hyperparameters\ncan be discrete (e.g. the number of layers in a neural network) or categorical (e.g. the loss function to\nuse in a gradient boosted regression tree).\n\n\u2217Work conducted at Microsoft Research, New England\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Two-dimensional embedding of 5,000 ML pipelines across 576 OpenML datasets. Each\npoint corresponds to a pipeline and is colored by the AUROC obtained by that pipeline in one of the\nOpenML datasets (OpenML dataset id 943).\n\nBayesian optimization techniques have been shown to be very effective in practice and sometimes\nidentify better hyperparameters than human experts, leading to state-of-the-art performance in\ncomputer vision tasks [23]. One drawback of these techniques is that they are known to suffer in\nhigh-dimensional hyperparameter spaces, where they often perform comparably to random search\n[12]. This limitation has both been shown in practice [12], as well as studied theoretically [25, 6] and\nis due to the necessity of sampling enough hyperparameter con\ufb01gurations to get a good estimate of the\npredictive posterior over a high-dimensional space. In practice, this is not an insurmountable obstacle\nwhen considering the \ufb01ne-tuning of a handful of parameters in a single model, but it is becoming\nincreasingly impractical as the focus of the community shifts from tuning individual hyperparameters\nto identifying entire ML pipelines consisting of data pre-processing methods, machine learning\nmodels, and their hyperparameters [4].\nOur goal in this paper is indeed not only to tune the hyperparameters of a given model, but also to\nidentify which model to use and how to pre-process the data. We do so by leveraging experiments\n\nalready performed across different datasets D = {D1, . . . ,DD} to solve the optimization problem\n\nL (M(P(x; \u03b8p); \u03b8m), y ),\n\narg min\n\nM, P, \u03b8m, \u03b8p\n\nwhere M is the ML model with hyperparameters \u03b8m and P is the pre-processing method with\nhyperparameters \u03b8p. In the rest of the paper, we refer to the combination of pre-processing method,\nmachine learning model and their hyperparameters as an ML pipeline. ML pipeline space can have a\ncombination of continuous, discrete, and categorical dimensions (e.g. the \u201cmodel\u201d dimension can be\na choice between a random forest or an SVM), as well as encode a complex heirarchical structure\nwith some dimensions conditioned on others (e.g. \u201cthe number of trees\u201d dimension in a random\nforest). This mixture of types makes modeling continuity in this space particularly challenging. For\nthis reason, unlike previous work, we consider \u201cinstantiations\u201d of pipelines, meaning that we \ufb01x the\nset of pipelines ahead of training. For example, an instantiated pipeline can consist in computing\nthe top 5 principal components of the input data and then applying a random forest with 1000 trees.\nImportantly, extensive experiments in Section 4 demonstrate that (i) low effective dimensionality of\nhyperparameter spaces suggested by [2] extends to pipeline space, and (ii) Bayesian optimization\nperformed in these discretized spaces leads to signi\ufb01cantly better performance than approaches that\nutilize continuity.\nWe show that the problem of predicting the performance of ML pipelines on a new dataset can be\ncast as a collaborative \ufb01ltering problem that can be solved with probabilistic matrix factorization\ntechniques. The approach we follow in the rest of this paper, based on Gaussian process latent\nvariable models [10, 9], embeds different pipelines in a latent space based on their performance\nacross different datasets. For example, Figure 1 shows the \ufb01rst two dimensions of the latent space\nof ML pipelines identi\ufb01ed by our model on OpenML [28] datasets. Each dot corresponds to an ML\npipeline and is colored depending on the AUROC achieved on a holdout set for a given OpenML\n\n2\n\nLatent dimension 1Latent dimension 20.540.600.660.720.780.840.900.96\fdataset. Since our probabilistic approach produces a full predictive distribution over the performance\nof the ML pipelines considered, we can use it in conjunction with acquisition functions commonly\nused in Bayesian optimization to guide the exploration of the ML pipeline space.\n\n2 Related work\n\nThe concept of leveraging experiments performed in previous problem instances has been explored in\ndifferent ways by two different communities. In the Bayesian optimization community, most of the\nwork revolves around either casting this problem as an instance of multi-task learning or by selecting\nthe \ufb01rst parameter settings to evaluate on a new dataset by looking at what worked in related datasets\n(we will refer to this as meta-learning for cold-start). In the multi-task setting, Swersky et al. (2013)\n[27] have proposed a multi-task Bayesian optimization approach leveraging multiple related datasets\nin order to \ufb01nd the best hyperparameter setting for a new task. For instance, they suggested using a\nsmaller dataset to tune the hyperparameters of a bigger dataset that is more expensive to evaluate.\nSchilling et al. (2015) [21] also treat this problem as an instance of multi-task learning, but instead of\ntreating each dataset as a separate task (or output), they effectively consider the tasks as conditionally\nindependent given an indicator variable specifying which dataset was used to run a given experiment.\nSpringenburg et al. (2016) [24] do something similar with Bayesian neural networks, but instead of\npassing an indicator variable, their approach learns a dataset-speci\ufb01c embedding vector. Perrone et al.\n(2017) [18] also effectively learn a task-speci\ufb01c embedding, but instead of using Bayesian neural\nnetworks end-to-end like in [24], they use feed-forward neural networks to learn the basis functions\nof a Bayesian linear regression model.\nOther approaches address the cold-start problem by evaluating parameter settings that worked well in\nprevious datasets. The most successful attempt to do so for automated machine learning problems\n(i.e. in very high-dimensional and structured parameter spaces) is the work by [4]. In their paper, the\nauthors compute meta-features of both the dataset under examination as well as a variety of OpenML\n[28] datasets. These meta-features include for example the number of classes or the number of\nsamples in each dataset. They measure similarity between datasets by computing the L1 norm of the\nmeta-features and use the optimization runs from the nearest datasets to \u201cwarm-start\u201d the optimization.\nReif et al. (2012) [19] also use meta-features of the dataset to warm-start the optimization performed\nby a genetic algorithm. Feurer et al. (2015) [5] consider learning the dataset similarity function from\ntraining data for warm-starting, and Wistuba et al. (2015) [30] extend this by additionally taking into\naccount the performance of hyperparameter con\ufb01gurations evaluated on the new dataset. In the same\npaper, they also propose to carefully pick these evaluations such that the similarity between datasets\nis more accurately represented, although they found that this doesn\u2019t result in improved performance\nin their experiments.\nOther related work has been produced in the context of algorithm selection for satis\ufb01ability problems.\nIn particular, Stern et al. (2010) [26] tackled constraint solving problems and combinatorial auction\nwinner determination problems using a latent variable model to select which algorithm to use. Their\nmodel performs a joint linear embedding of problem instances and experts (e.g. different SAT solvers)\nbased on their meta-features and a sparse matrix containing the results of previous algorithm runs.\nMalitsky and O\u2019Sullivan (2014) [13] also proposed to learn a latent variable model by decomposing\nthe matrix containing the performance of each solver on each problem. They then develop a model to\nproject commonly used hand-crafted meta-features used to select algorithms onto the latent space\nidenti\ufb01ed by their model. They use this last model to do one-shot (i.e. non-iterative) algorithm\nselection. This is similar to what was done by [14], but they do not use the second regression model\nand instead perform one-shot algorithm selection directly.\nOur work is most related to [4] in terms of scope (i.e. joint automated pre-processing, model selection\nand hyperparameter tuning), but we discretize the space and set up a multi-task model, while they\ncapture continuity in parameter space in a single-task model with a smart initialization. Our approach\nis also loosely related to the work of [26], but we perform sequential model based optimization with\na non-linear mapping between latent and observed space in an unsupervised model, while they use a\nsupervised linear model trained on ranks for one-shot algorithm selection. The application domain of\ntheir model also required a different utility function and a time-based feedback model. The work of\n[11] constructs an acquisition function that also exploits performance on previous datasets, but their\nmethod does not do signi\ufb01cantly better than random selection, whereas the experiments of Section 4\ndemonstrate that our proposed method provides signi\ufb01cant improvement over random selection.\n\n3\n\n\f3 AutoML as probabilistic matrix factorization\n\nIn this paper, we develop a method that can draw information from all of the datasets for which\nexperiments are available, whether they are immediately related (e.g. a smaller version of the current\ndataset) or not. The idea behind our approach is that if two datasets have similar (i.e. correlated)\nresults for a few pipelines, it\u2019s likely that the remaining pipelines will produce results that are similar\nas well. This is somewhat reminiscent of a collaborative \ufb01ltering problem for movie recommendation,\nwhere if two users liked the same movies in the past, it\u2019s more likely that they will like similar ones\nin the future.\nMore formally, given N machine learning pipelines and D datasets, we train each pipeline on part of\neach dataset and we evaluate it on a holdout set. This gives us a matrix Y \u2208 RN\u00d7D summarizing the\nperformance of each pipeline in each dataset. For example, Y may represent balanced accuracies in a\nclassi\ufb01cation setting or RMSE in a regression setting. Having observed these performances, the task\nof predicting the performance of any of them on a new dataset can be cast as a matrix factorization\nproblem.\nSpeci\ufb01cally, we are seeking a low rank decomposition such that Y \u2248 XW, where X \u2208 RN\u00d7Q\nand W \u2208 RQ\u00d7D, where Q is the dimensionality of the latent space. As done in [10] and [20], we\nconsider the probabilistic version of this task, known as probabilistic matrix factorization:\n\nN(cid:89)\nn=1N (yn | xnW, \u03c32I),\n\np(Y | X, W, \u03c32) =\n\nwhere xn is a row of the latent variables X and yn is a row of measured performances for pipeline n.\nIn this setting both X and W are unknown and must be inferred.\n\n3.1 Non-linear matrix factorization with Gaussian process priors\n\nThe probabilistic matrix factorization approach just introduced assumes that the entries of Y are\nlinearly related to the latent variables.\nIn nonlinear probabilistic matrix factorization [10], the\nelements of Y are given by a nonlinear function of the latent variables, yn,d = fd(xn) + \u0001, where \u0001\nis independent Gaussian noise. This gives a likelihood of the form\n\np(cid:0)Y | X, f , \u03c32(cid:1) =\n\nN(cid:89)\n\nD(cid:89)\n\nn=1\n\nd=1\n\nN\n\n(cid:0)yn,d|fd (xn) , \u03c32(cid:1) ,\n\n(1)\n\n(2)\n\nFollowing [10], we place a Gaussian process prior over fd(xn) so that any vector f is governed by a\njoint Gaussian density, p (f| X) = N (f|0, K) , where K is a covariance matrix, and the elements\nKi,j = k(xi, xj) encode the degree of correlation between two samples as a function of the latent\nvariables. If we use the covariance function k (xi, xj) = x(cid:62)\ni xj, which is a prior corresponding to\nlinear functions, we recover a model equivalent to (1). Alternatively, we can choose a prior over\nnon-linear functions, such as a squared exponential covariance function with automatic relevance\n\ndetermination (ARD, one length-scale per dimension), k (xi, xj) = \u03b1 exp(cid:0)\n\n2 ||xi \u2212 xj||2(cid:1) , where\n\n\u03b1 is a variance (or amplitude) parameter and \u03b3q are length-scales. The squared exponential covariance\nfunction is in\ufb01nitely differentiable and hence is a prior over very smooth functions. In practice, such\na strong smoothness assumption can be unrealistic and is the reason why the Matern class of kernels\nis sometimes preferred [29]. In the rest of this paper we use the squared exponential kernel and leave\nthe investigation of the performance of Matern kernels to future work.\nAfter specifying a GP prior, the marginal likelihood is obtained by integrating out the function f\nunder the prior\n\n\u2212 \u03b3q\n\np(Y | X, \u03b8, \u03c32) =\n\np(Y | X, f ) p(f | X) df =\n\nN (Y:,d | 0, K(X, X) + \u03c32I),\n\n(3)\n\nwhere \u03b8 = {\u03b1, \u03b31, . . . , \u03b3q}.\nIn principle, we could add metadata about the pipelines and/or the datasets by adding additional\nkernels. As we discuss in Section 4 and show in Figure 2 (and supplementary Figure 1), we didn\u2019t \ufb01nd\nthis to help in practice since the latent variable model is able to capture all the necessary information\neven in the fully unsupervised setting.\n\n4\n\n(cid:90)\n\nD(cid:89)\n\nd=1\n\n\fFigure 2: Latent embeddings of 42,000 machine learning pipelines colored according to which model\nwas included in each pipeline. These are paired plots of the \ufb01rst 6 dimensions of our 20-dimensional\nlatent space. The latent space effectively captures structure in the space of models.\n\n3.2\n\nInference with missing data\n\nRunning multiple pipelines on multiple datasets is an embarrassingly parallel operation, and our\nproposed method readily takes advantage of these kinds of computationally cheap observations.\nHowever, in applications where it is expensive to gather such observations, Y will be a sparse\nmatrix, and it becomes necessary to be able to perform inference with missing data. Given that the\nmarginal likelihood in (3) follows a multivariate Gaussian distribution, marginalizing over missing\nvalues is straightforward and simply requires \"dropping\" the missing observations from the mean\nand covariance. More formally, we let Nd denote the number of pipelines evaluated for dataset d,\nand de\ufb01ne an indexing function e(d) : N \u2192 NNd that, given a dataset index d, returns the list of Nd\npipelines that have been evaluated on d. We can then rewrite (3) as\n\nwhere Cd = K(Xe(d), Xe(d)) + \u03c32I. The negative log-likelihood (NLL) of the data under this model\n\nN (Ye(d),d | 0, Cd),\n\n(4)\n\nis given by(cid:80)D\n\np(Y | X, \u03b8, \u03c32) =\n(cid:16)\n\nd=1 NLLd, where\n1\n2\n\nNLLd =\n\nD(cid:89)\n\nd=1\n\n(cid:17)\n\n(cid:88)\n\nd\u2208B\n\ne(d),d C\u22121\n\nd Ye(d),d\n\n.\n\n(5)\n\nNd log(2\u03c0) + log|Cd| + Y(cid:62)\n(cid:88)\n\n\u2202NLLd\n\nSimilar to [10], we infer the parameters (\u03b8, \u03c3) and latent variables X by minimizing NLL using\nstochastic gradient-based optimization. Speci\ufb01cally, we update the parameters and latent variables\nwith a randomly selected batch of datasets B as\n\n\u03b8t+1 = \u03b8t \u2212 \u03b7\n\nD\n|B|\n\n,\n\nd\u2208B\n\n\u2202\u03b8\n\nxt+1\nn = xt\n\nn \u2212 \u03b7\n\nNn\nNBn\n\n\u2202NLLd\n\u2202xn\n\n,\n\n(6)\n\nwhere t denotes training iteration, \u03b7 is the learning rate, Nn denotes the total number of observations\nfor pipeline n, and NBn denotes the number of observations for pipeline n in the batch (we use the\nconvention \u2202NLLd\n= 0 when the entry yn,d is unobserved). Since our approach allows incremental\n\u2202xn\nre-training, existing pipeline embeddings can be updated with observations from new datasets, and\nnew pipelines can be included given their performance observations.\n\n3.3 Predictions\n\nPredictions from the model can be easily computed by following the standard derivations for Gaussian\nprocess regression [29]. The predicted performance yn,\u2217 of pipeline n for a new dataset \u2217 is given by\n(7)\n\np(yn,\u2217 | X, \u03b8, \u03c3) = N (yn,\u2217 | \u00b5n,\u2217, vn,\u2217)\n\n\u00b5n,\u2217 = k(cid:62)\ne(\u2217),n C\u22121\u2217 y\u2217\nvn,\u2217 = kn,n + \u03c32 \u2212 k(cid:62)\n\ne(\u2217),nC\u22121\u2217 ke(\u2217),n,\n\n5\n\n\fremembering that C\u2217 = K(Xe(\u2217), Xe(\u2217)) + \u03c32I and de\ufb01ning ke(\u2217),n = K(Xe(\u2217), xn) and kn,n =\nK(xn, xn).\nThe computational complexity for generating these predictions is largely determined by the number\nof pipelines already evaluated for a test dataset and is due to the inversion of a N\u2217 \u00d7 N\u2217 matrix. This\nis not particularly onerous because the typical number of evaluations is likely to be in the hundreds,\ngiven the cost of training each pipeline and the risk of over\ufb01tting to the validation set if too many\npipelines are evaluated.\n\n3.4 Acquisition functions\n\nThe model described so far can be used to predict the expected performance of each ML pipeline\nas a function of the pipelines already evaluated, but does not yet give any guidance as to which\npipeline should be tried next. A simple approach to pick the next pipeline to evaluate is to select the\npipeline resulting in the maximum predicted performance or argmaxn {\u00b5n,\u2217}. However, such a utility\nfunction, also known as an acquisition function, would discard information about the uncertainty of\nthe predictions. One of the most widely used acquisition functions is expected improvement (EI)\n[15], which is given by the expectation of the improvement function\n\nI(yn,\u2217, ybest) (cid:44) (yn,\u2217 \u2212 ybest)I(yn,\u2217 > ybest),\n\nEIn,\u2217 (cid:44) E[I(yn,\u2217, ybest)],\n\nwhere ybest is the best result observed. Since yn,\u2217 is Gaussian distributed (see (7)), this expectation\ncan be computed analytically:\n\nEIn,\u2217 = \u221avn,\u2217 [\u03b3n,\u2217\u03a6(\u03b3n,\u2217) + N (\u03b3n,\u2217 | 0, 1)] ,\n\nwhere \u03a6 is the cumulative distribution function of the standard normal, and \u03b3n,\u2217 = \u00b5n,\u2217\u2212ybest\u2212\u03be\n,\nwhere \u03be is a free parameter to encourage exploration. After computing the expected improvement\nfor each pipeline, the next pipeline to evaluate is simply given by argmaxn (EIn,\u2217). The expected\nimprovement is just one of many possible acquisition functions, and different problems may require\ndifferent acquisition functions. See [22] for a review.\n\nvn,\u2217\n\n\u221a\n\n4 Experiments\n\nIn this section, we compare our method in a classi\ufb01cation setting to a series of baselines that includes\nauto-sklearn [4], the current state-of-the-art approach and overall winner of the ChaLearn AutoML\ncompetition [7]. We ran all of the experiments on 553 OpenML [28] datasets selected by \ufb01ltering for\nbinary and multi-class classi\ufb01cation problems with no more than 10, 000 samples and no missing\nvalues, although our method is capable of handling datasets which cause ML pipeline runs to be\nunsuccessful (described below).\n\n4.1 Generation of training data\n\nP , ..., \u03b8n\n\nP} and \u0398M = {\u03b81\n\nWe generated training data for our method by splitting each OpenML dataset in 80% training data,\n10% validation data and 10% test data, running 42, 000 ML pipelines on each dataset and measuring\nthe normalized accuracy, i.e. accuracy rescaled such that random performance is 0 and perfect\nperformance is 1.0.\nWe generated the pipelines by sampling a combination of pre-processors P = {P 1, P 2, ..., P n},\nmachine learning models M = {M 1, M 2, ..., M m}, and their corresponding hyperparameters\nM} from the entries in supplementary Table 1. All the\n\u0398P = {\u03b81\nmodels and pre-processing methods we considered were implemented in scikit-learn [17]. We\nsampled the parameter space by using functions provided in the auto-sklearn library [4]. Similar\nto what was done in [4], we limited the maximum training time of each individual model within\na pipeline to 30 seconds and its memory consumption to 16GB. Because of network failures and\nthe cluster occasionally running out of memory, the resulting matrix Y was not fully sampled and\nhad approximately 21% missing entries. As pointed out in the previous section, this is expected in\nrealistic applications and is not a problem for our method, since it can easily handle sparse data.\nOut of the 553 total datasets, 100 were identi\ufb01ed to comprise the held-out test set. Full details\nincluding the IDs of both training and test sets are provided in the supplementary material.\n\nM , ..., \u03b8m\n\n6\n\n\fFigure 3: (Left) Average rank of all the approaches we considered as a function of the number\nof iterations. For each holdout dataset, the methods are ranked based on the normalized accuracy\nobtained on the validation set at each iteration. The ranks are then averaged across datasets. Lower is\nbetter. The shaded areas represent the standard error for each method. (Right) Difference between the\nmaximum normalized accuracy observed on the test set and the normalized accuracy. Lower is better.\n\n4.2 Parameter settings\n\nWe set the number of latent dimensions to Q = 20, stochastic gradient descent learning rate to\n\u03b7 = 1e\u22127, and (column) batch-size to 50. The latent space was initialized using PCA, and training\nwas run for 300 epochs (corresponding to approximately 3 hours on a 16-core Azure machine).\nFinally, we con\ufb01gured the acquisition function with \u03be = 0.012.\n\n4.3 Results\n\nWe compared the model described in this paper, PMF, to the following methods:\n\n\u2022 random. For each test dataset, we performed a random search by sampling each pipeline to\nbe evaluated from the set of 42,000 at random without replacement.\n\u2022 random 2x. Same as random, but with twice the budget. This simulates parallel evaluation\nof pipelines and is a strong baseline [12].\n\u2022 random 4x. Same as random but with 4 times the budget.\n\u2022 auto-sklearn [4]. We ran auto-sklearn for 4 hours per dataset and set to optimize normalized\naccuracy on a holdout set. We disabled the automated ensembling of models in order to\nobtain a fair comparison to the other non-ensembling methods.\n\u2022 fmlp [21]. Factorized multi-layer perceptron as an additional baseline (implemented by us).\nOur method uses the same procedure used in [4] to warm-start the process by selecting the \ufb01rst 5\npipelines, after which the acquisition function selects subsequent pipelines.\nThe left plot of Figure 3 shows the average rank for each method as a function of the number of\niterations (i.e. the number of pipelines evaluated). Starting from the \ufb01rst iteration, our approach\nconsistently achieves the best average rank. Auto-sklearn is the second best model, outperforming\nrandom 2x and almost matched by random 4x. The performance of the factorized MLP is between\nrandom and random-2x, even after tuning its hyperparameters to improve performance. This is\nsigni\ufb01cantly worse than both auto-sklearn and the proposed method. Please note that random 2x and\nrandom 4x are only intended as baselines that are easy to understand and interpret, but that in no\nway can be considered practical solutions, since they both have a much larger computational budget\nthan the non-random methods. Additionally, we measured the difference between the maximum\nnormalized accuracy obtained by any pipeline in each dataset and the one obtained by the pipeline\n\n2Although not shown in this section, we also ran the proposed method with \u03be = 0 in the acquisition function,\n\nand this did not produce any distinguishable effect in our results.\n\n7\n\n0255075100125150175200Iterations3.03.54.04.55.05.5Rank(mean\u00b1SE)randomauto-sklearnrandom2xrandom4xPMFPMF-90%missing(Q=5)fmlp0255075100125150175200Iterations0.060.080.100.120.140.160.180.200.22Regret(mean\u00b1SE)randomauto-sklearnrandom2xrandom4xPMFPMF-90%missing(Q=5)fmlp\fselected at each iteration. The results summarized in the right plot of Figure 3 show that our method\nstill outperforms all the others.\nWe also investigated how well our method performs when fewer observations/training datasets are\navailable. In the \ufb01rst variant, we ran our method in the setting where 90% of the entries in Y\nare missing3. The test set remains unchanged. The additional curves labeled \u201cPMF-90% missing\n(Q = 5)\u201d in the plots of Figure 3 demonstrate our method degrades in performance only slightly\nwhen 10% of the observations are available versus when training on 80% available observations, but\nstill outperforms competitors, thus demonstrating very good robustness to missing data. In the second\nexperiment, we matched the number (and the identity, for the most part) of datasets that auto-sklearn\nuses to initialize its Bayesian optimization procedure. The results, shown in supplementary Figure 3,\ncon\ufb01rm that our model still outperforms competing approaches.\nIncluding pipeline metadata. Our approach can easily incorporate information about the composi-\ntion and the hyperparameters of the pipelines considered. This metadata could, for example, include\ninformation about which model is used within each pipeline or which pre-processor is applied to\nthe data before passing it to the model. Empirically, we found that including this information in\nour model didn\u2019t improve performance (shown in supplementary Figure 4). Indeed, our model is\nable to effectively capture most of this information in a completely unsupervised fashion, just by\nobserving the sparse pipelines-dataset matrix Y. This is visible in Figure 2, where we show the latent\nembedding colored according to which model was included in which pipeline.\n\n5 Discussion\n\nWe have presented a new approach to automatically build predictive ML pipelines for a given dataset,\nautomating the selection of data pre-processing method and machine learning model as well as the\ntuning of their hyperparameters. Our approach combines techniques from collaborative \ufb01ltering\nand ideas from Bayesian optimization to intelligently explore the space of ML pipelines, exploiting\nexperiments performed in previous datasets. We have benchmarked our approach against the state-of-\nthe-art with a large number of OpenML datasets with different sample sizes, number of features and\nnumber of classes. Overall, our results show that our approach outperforms both the state-of-the-art\nas well as a set of strong baselines.\nOne potential concern with our method is that it requires sampling (i.e. instantiating pipelines) from\na potentially high-dimensional space and thus could require exponentially many samples in order\nto explore all areas of this space. We have found this not to be a problem for three reasons. First,\nmany of the dimensions in the space of pipelines are conditioned on the choice of other dimensions.\nFor example, the number of trees or depth of a random forest are parameters that are only relevant\nif a random forest is chosen in the \u201cmodel\u201d dimension. This reduces the effective search space\nsigni\ufb01cantly. Second, in our model we treat every pipeline as an additional sample, so increasing\nthe sampling density also results in an increase in sample size (and similarly, adding a dataset also\nincreases the effective sample size). Finally, very dense sampling of the pipeline space is only needed\nif the performance is very sensitive to small parameter changes, something that we haven\u2019t observed\nin practice. If this is a concern, we advise using our approach in conjunction with traditional Bayesian\noptimization methods (such as [23]) to further \ufb01ne-tune the parameters.\nWe are currently investigating several extensions of this work. First, we would like to include dataset-\nspeci\ufb01c information in our model. As discussed in Section 3, the only data taken into account by our\nmodel is the performance of each method in each dataset. Similarity between different pipelines is\ninduced by having correlated performance across multiple datasets, and ignores potentially relevant\nmetadata about datasets, such as the sample size or number of classes. We are currently working on\nincluding such information by extending our model using additional kernels and dual embeddings\n(i.e. embedding both pipelines and dataset in separate latent spaces). Second, we are interested in\nusing acquisition functions that include a factor representing the computational cost of running a\ngiven pipeline [23] to handle instances when datasets have a large number of samples. The machine\nlearning models we used for our experiments were constrained not to exceed a certain runtime, but\n\n3From the original observation matrix, which has \u2248 20% missing entries, an additional 70% are dropped\nuniformly at random. The model was trained with Adam using a learning rate of 10\u22122 and Q = 5, and 3\npipelines were used for warm-starting.\n\n8\n\n\fthis could be impractical in real applications. Finally, we are planning to experiment with different\nprobabilistic matrix factorization models based on variational autoencoders.\n\n6 Data and software\n\nData and software available at https://github.com/rsheth80/pmf-automl/.\n\nReferences\n[1] James Bergstra, R\u00e9mi Bardenet, Yoshua Bengio, and Bal\u00e1zs K\u00e9gl. Algorithms for hyper-\n\nparameter optimization. In NIPS, pages 2546\u20132554, 2011.\n\n[2] James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. JMLR,\n\n13:281\u2013305, 2012.\n\n[3] James Bergstra, Daniel Yamins, and David D Cox. Making a science of model search: Hyperpa-\nrameter optimization in hundreds of dimensions for vision architectures. ICML, pages 115\u2013123,\n2013.\n\n[4] Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and\nFrank Hutter. Ef\ufb01cient and robust automated machine learning. In NIPS, pages 2962\u20132970,\n2015.\n\n[5] Matthias Feurer, Jost Tobias Springenberg, and Frank Hutter. Initializing bayesian hyperparam-\n\neter optimization via meta-learning. In AAAI, pages 1128\u20131135, 2015.\n\n[6] Steffen Gr\u00fcnew\u00e4lder, Jean-Yves Audibert, Manfred Opper, and John Shawe-Taylor. Regret\n\nbounds for Gaussian process bandit problems. In AISTATS, pages 273\u2013280, 2010.\n\n[7] Isabelle Guyon, Imad Chaabane, Hugo Jair Escalante, Sergio Escalera, Damir Jajetic,\nJames Robert Lloyd, N\u00faria Maci\u00e0, Bisakha Ray, Lukasz Romaszko, Mich\u00e8le Sebag, Alexander\nStatnikov, S\u00e9bastien Treguer, and Evelyne Viegas. A brief review of the ChaLearn AutoML\nchallenge: Any-time any-dataset learning without human intervention. In Proceedings of the\nWorkshop on Automatic Machine Learning, volume 64, pages 21\u201330, 2016.\n\n[8] Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Sequential model-based optimization\nfor general algorithm con\ufb01guration. In International Conference on Learning and Intelligent\nOptimization, pages 507\u2013523, 2011.\n\n[9] Neil Lawrence. Probabilistic non-linear principal component analysis with Gaussian process\n\nlatent variable models. JMLR, 6:1783\u20131816, 2005.\n\n[10] Neil Lawrence and Raquel Urtasun. Non-linear matrix factorization with Gaussian processes.\n\nICML, 2009.\n\n[11] Rui Leite, Pavel Brazdil, and Joaquin Vanschoren. Selecting classi\ufb01cation algorithms with\nactive testing. In International workshop on machine learning and data mining in pattern\nrecognition, pages 117\u2013131, 2012.\n\n[12] Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hy-\nperband: A novel bandit-based approach to hyperparameter optimization. arXiv:1603.06560,\n2016.\n\n[13] Yuri Malitsky and Barry O\u2019Sullivan. Latent features for algorithm selection. In Seventh Annual\n\nSymposium on Combinatorial Search, 2014.\n\n[14] Mustafa M\u0131s\u0131r and Mich\u00e8le Sebag. ALORS: An algorithm recommender system. Arti\ufb01cial\n\nIntelligence, 244:291\u2013314, 2017.\n\n[15] J Mo\u02c7ckus. On Bayesian methods for seeking the extremum. In Optimization Techniques IFIP\n\nTechnical Conference, pages 400\u2013404, 1975.\n\n9\n\n\f[16] Michael A Osborne, Roman Garnett, and Stephen J Roberts. Gaussian processes for global op-\ntimization. In 3rd International Conference on Learning and Intelligent Optimization (LION3),\npages 1\u201315, 2009.\n\n[17] Fabian Pedregosa, Ga\u00ebl Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion,\nOlivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vander-\nplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and \u00c9douard\nDuchesnay. Scikit-learn: Machine learning in Python. JMLR, 12:2825\u20132830, 2011.\n\n[18] Valerio Perrone, Rodolphe Jenatton, Matthias Seeger, and Cedric Archambeau. Multiple\nadaptive Bayesian linear regression for scalable Bayesian optimization with warm start.\narXiv:1712.02902, 2017.\n\n[19] Matthias Reif, Faisal Shafait, and Andreas Dengel. Meta-learning for evolutionary parameter\n\noptimization of classi\ufb01ers. Machine learning, 87:357\u2013380, 2012.\n\n[20] Ruslan Salakhutdinov and Andriy Mnih. Bayesian probabilistic matrix factorization using\n\nMarkov chain Monte Carlo. In ICML, pages 880\u2013887, 2008.\n\n[21] Nicolas Schilling, Martin Wistuba, Lucas Drumond, and Lars Schmidt-Thieme. Hyperparameter\noptimization with factorized multilayer perceptrons. In Joint European Conference on Machine\nLearning and Knowledge Discovery in Databases, pages 87\u2013103, 2015.\n\n[22] Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando de Freitas. Taking\nthe human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE,\n104:148\u2013175, 2016.\n\n[23] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical Bayesian optimization of machine\n\nlearning algorithms. In NIPS, pages 2951\u20132959, 2012.\n\n[24] Jost Tobias Springenberg, Aaron Klein, Stefan Falkner, and Frank Hutter. Bayesian optimization\n\nwith robust Bayesian neural networks. In NIPS, pages 4134\u20134142. 2016.\n\n[25] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process\noptimization in the bandit setting: No regret and experimental design. arXiv:0912.3995, 2009.\n\n[26] David H Stern, Horst Samulowitz, Ralf Herbrich, Thore Graepel, Luca Pulina, and Armando\n\nTacchella. Collaborative expert portfolio management. In AAAI, pages 179\u2013184, 2010.\n\n[27] Kevin Swersky, Jasper Snoek, and Ryan P Adams. Multi-task Bayesian optimization. In NIPS,\n\npages 2004\u20132012, 2013.\n\n[28] Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. OpenML: Networked\n\nscience in machine learning. SIGKDD Explorations, 15:49\u201360, 2013.\n\n[29] Christopher KI Williams and Carl Edward Rasmussen. Gaussian processes for machine learning.\n\nThe MIT Press, 2006.\n\n[30] Martin Wistuba, Nicolas Schilling, and Lars Schmidt-Thieme. Learning data set similarities for\nhyperparameter optimization initializations. In MetaSel@ PKDD/ECML, pages 15\u201326, 2015.\n\n10\n\n\f", "award": [], "sourceid": 1693, "authors": [{"given_name": "Nicolo", "family_name": "Fusi", "institution": "Microsoft Research"}, {"given_name": "Rishit", "family_name": "Sheth", "institution": "Microsoft Research New England"}, {"given_name": "Melih", "family_name": "Elibol", "institution": "UC Berkeley"}]}