{"title": "Combinatorial Bayesian Optimization using the Graph Cartesian Product", "book": "Advances in Neural Information Processing Systems", "page_first": 2914, "page_last": 2924, "abstract": "This paper focuses on Bayesian Optimization (BO) for objectives on combinatorial\nsearch spaces, including ordinal and categorical variables. Despite the abundance\nof potential applications of Combinatorial BO, including chipset configuration\nsearch and neural architecture search, only a handful of methods have been pro-\nposed. We introduce COMBO, a new Gaussian Process (GP) BO. COMBO\nquantifies \u201csmoothness\u201d of functions on combinatorial search spaces by utilizing\na combinatorial graph. The vertex set of the combinatorial graph consists of all\npossible joint assignments of the variables, while edges are constructed using the\ngraph Cartesian product of the sub-graphs that represent the individual variables.\nOn this combinatorial graph, we propose an ARD diffusion kernel with which the\nGP is able to model high-order interactions between variables leading to better\nperformance. Moreover, using the Horseshoe prior for the scale parameter in the\nARD diffusion kernel results in an effective variable selection procedure, making\nCOMBO suitable for high dimensional problems. Computationally, in COMBO\nthe graph Cartesian product allows the Graph Fourier Transform calculation to\nscale linearly instead of exponentially.We validate COMBO in a wide array of real-\nistic benchmarks, including weighted maximum satisfiability problems and neural\narchitecture search. COMBO outperforms consistently the latest state-of-the-art\nwhile maintaining computational and statistical efficiency", "full_text": "Combinatorial Bayesian Optimization\nusing the Graph Cartesian Product\n\nChangyong Oh1 Jakub M. Tomczak2 Efstratios Gavves1 Max Welling1,2,3\n\n1 University of Amsterdam 2 Qualcomm AI Research 3 CIFAR\n\nC.Oh@uva.nl, jtomczak@qti.qualcomm.com, egavves@uva.nl, m.welling@uva.nl\n\nAbstract\n\nThis paper focuses on Bayesian Optimization (BO) for objectives on combinatorial\nsearch spaces, including ordinal and categorical variables. Despite the abundance\nof potential applications of Combinatorial BO, including chipset con\ufb01guration\nsearch and neural architecture search, only a handful of methods have been pro-\nposed. We introduce COMBO, a new Gaussian Process (GP) BO. COMBO\nquanti\ufb01es \u201csmoothness\u201d of functions on combinatorial search spaces by utilizing\na combinatorial graph. The vertex set of the combinatorial graph consists of all\npossible joint assignments of the variables, while edges are constructed using the\ngraph Cartesian product of the sub-graphs that represent the individual variables.\nOn this combinatorial graph, we propose an ARD diffusion kernel with which the\nGP is able to model high-order interactions between variables leading to better\nperformance. Moreover, using the Horseshoe prior for the scale parameter in the\nARD diffusion kernel results in an effective variable selection procedure, making\nCOMBO suitable for high dimensional problems. Computationally, in COMBO\nthe graph Cartesian product allows the Graph Fourier Transform calculation to\nscale linearly instead of exponentially.We validate COMBO in a wide array of real-\nistic benchmarks, including weighted maximum satis\ufb01ability problems and neural\narchitecture search. COMBO outperforms consistently the latest state-of-the-art\nwhile maintaining computational and statistical ef\ufb01ciency.\n\n1\n\nIntroduction\n\nThis paper focuses on Bayesian Optimization (BO) [42] for objectives on combinatorial search\nspaces consisting of ordinal or categorical variables. Combinatorial BO [21] has many applications,\nincluding \ufb01nding optimal chipset con\ufb01gurations, discovering the optimal architecture of a deep\nneural network or the optimization of compilers to embed software on hardware optimally. All these\napplications, where Combinatorial BO is potentially useful, feature the following properties. They\n(i) have black-box objectives for which gradient-based optimizers [47] are not amenable, (ii) have\nexpensive evaluation procedures for which methods with low sample ef\ufb01ciency such as, evolution\n[12] or genetic [9] algorithms are unsuitable, and (iii) have noisy evaluations and highly non-linear\nobjectives for which simple and exact solutions are inaccurate [5, 11, 40].\nInterestingly, most BO methods in the literature have focused on continuous [29] rather than combi-\nnatorial search spaces. One of the reasons is that the most successful BO methods are built on top of\nGaussian Processes (GPs) [22, 33, 42]. As GPs rely on the smoothness de\ufb01ned by a kernel to model\nuncertainty [37], they are originally proposed for, and mostly used in, continuous input spaces. In\nspite of the presence of kernels proposed on combinatorial structures [17, 25, 41], to date the relation\nbetween the smoothness of graph signals and the smoothness of functions de\ufb01ned on combinatorial\nstructures has been overlooked and not been exploited for BO on combinatorial structures. A simple\nsolution is to use continuous kernels and round them up. This rounding, however, is not incorporated\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fwhen computing the covariances at the next BO iteration [14], leading to unwanted artifacts. Further-\nmore, when considering combinatorial search spaces the number of possible con\ufb01gurations quickly\nexplodes: for M categorical variables with k categories the number of possible combinations scales\nwith O(kM ). Applying BO with GPs on combinatorial spaces is, therefore, not straightforward.\nWe propose COMBO, a novel Combinatorial BO designed to tackle the aforementioned problems\nof lack of smoothness and computational complexity on combinatorial structures. To introduce\nsmoothness of a function on combinatorial structures, we propose the combinatorial graph. The\ncombinatorial graph comprises sub-graphs \u2013one per categorical (or ordinal) variable\u2013 later com-\nbined by the graph Cartesian product. The combinatorial graph contains as vertices all possible\ncombinatorial choices. We de\ufb01ne then smoothness of functions on combinatorial structures to be\nthe smoothness of graph signals using the Graph Fourier Transform (GFT) [35]. Speci\ufb01cally, we\npropose as our GP kernel on the graph a variant of the diffusion kernel, the automatic relevance\ndetermination(ARD) diffusion kernel, for which computing the GFT is computationally tractable\nvia a decomposition of the eigensystem. With a GP on a graph COMBO accounts for arbitrarily\nhigh order interactions between variables. Moreover, using the sparsity-inducing Horseshoe prior [6]\non the ARD parameters COMBO performs variable selection and scales up to high-dimensional.\nCOMBO allows for accurate, ef\ufb01cient and large-scale BO on combinatorial search spaces.\nIn this work, we make the following contributions. First, we show how to introduce smoothness\non combinatorial search spaces by introducing combinatorial graphs. On top of a combinatorial\ngraph we de\ufb01ne a kernel using the GFT. Second, we present an algorithm for Combinatorial BO\nthat is computationally scalable to high dimensional problems. Third, we introduce individual scale\nparameters for each variable making the diffusion kernel more \ufb02exible. When adopting a sparsity\ninducing Horseshoe prior [6, 7], COMBO performs variable selection which makes it scalable to\nhigh dimensional problems. We validate COMBO extensively on (i) four numerical benchmarks,\nas well as two realistic test cases: (ii) the weighted maximum satis\ufb01ability problem [16, 39], where\none must \ufb01nd boolean values that maximize the combined weights of satis\ufb01ed clauses, that can be\nmade true by turning on and off the variables in the formula, (iii) neural architecture search [10, 48].\nResults show that COMBO consistently outperforms all competitors.\n\n2 Method\n\n2.1 Bayesian optimization with Gaussian processes\nBayesian optimization (BO) aims at \ufb01nding the global optimum of a black-box function f over a\nsearch space X , namely, xopt = arg minx\u2208X f (x). At each round, a surrogate model attempts to\napproximate f (x) based on the evaluations so far, D = {(xi, yi = f (xi))}. Then an acquisition\nfunction suggests the most promising point xi+1 that should be evaluated. The D is appended by the\nnew evaluation, D = D \u222a({xi+1, yi+1)}. The process repeats until the evaluation budget is depleted.\nThe crucial design choice in BO is the surrogate model that models f (\u00b7) in terms of (i) a predictive\nmean to predict f (\u00b7), and (ii) a predictive variance to quantify the prediction uncertainty. With a GP\nsurrogate model, we have the predictive mean \u00b5(x\u2217 |D) = K\u2217 D(KD D + \u03c32\nnI)\u22121y and variance\n\u03c32(x\u2217 |D) = K\u2217\u2217 \u2212 K\u2217 D(KD D + \u03c32\nnI)\u22121KD \u2217 where K\u2217\u2217 = K(x\u2217, x\u2217), [K\u2217 D]1,i = K(x\u2217, xi),\nKD \u2217 = (K\u2217 D)T , [KDD]i,j = K(xi, xj) and \u03c32\n\nn is the noise variance.\n\n2.2 Combinatorial graphs and kernels\nIn BO on continuous search spaces the most popular surrogate models rely on GPs [22, 33, 42]. Their\npopularity does not extend to combinatorial spaces, although kernels on combinatorial structures\nhave also been proposed [17, 25, 41]. To design an effective GP-based BO algorithm on combina-\ntorial structures, a space of smooth functions \u2013de\ufb01ned by the GP\u2013 is needed. We circumvent this\nrequirement by the notion of the combinatorial graph de\ufb01ned as a graph, which contains all possible\ncombinatorial choices as its vertices for a given combinatorial problem. That is, each vertex corre-\nsponds to a different joint assignment of categorical or ordinal variables. If two vertices are connected\nby an edge, then their respective set of combinatorial choices differ only by a single combinatorial\nchoice. As a consequence, we can now revisit the notion of smoothness on combinatorial structures\nas smoothness of a graph signal [8, 35] de\ufb01ned on the combinatorial graph. On a combinatorial graph,\nthe shortest path is closely related to the Hamming distance.\n\n2\n\n\fThe combinatorial graph To construct the combinatorial graph, we \ufb01rst de\ufb01ne one sub-graph per\ncombinatorial variable Ci, G(Ci). For a categorical variable Ci, the sub-graph G(Ci) is chosen to be\na complete graph while for an ordinal variable we have a path graph. We aim at building a search\nspace for combinatorial choices, i.e., a combinatorial graph, by combining sub-graphs G(Ci) in such\nway that a distance between two adjacent vertices corresponds to a change of a value of a single\ncombinatorial variable. It turns out that the graph Cartesian product [15] ensures this property. Then,\nthe graph Cartesian product of subgraphs G(Cj) = (V j,E j) is de\ufb01ned as G = (V,E) = (cid:3)i G(Ci),\n1 ,\u00b7\u00b7\u00b7 , c(1)\nwhere V = \u00d7i V i and (v1 = (c(1)\nN )) \u2208 E if and only if \u2203j such that\nj ) \u2208 E j.\n\u2200i (cid:54)= j c(1)\n, c(2)\nAs an example, let us consider a simplistic hyperparameter optimization problem for learning a\nneural network with three combinatorial variables: (i) the batch size, c1 \u2208 C1 = {16, 32, 64},\n(ii) the optimizer c2 \u2208 C2 = {AdaDelta, RM SP rop, Adam} and (iii) the learning rate annealing\nc3 \u2208 C3 = {Constant, Annealing}. The sub-graphs {G(Ci)}i=1,2,3 for each of the combinatorial\nvariables, as well as the \ufb01nal combinatorial graph after the graph Cartesian product, are illustrated\nin Figure 1. For the ordinal batch size variable we have a path graph, whereas for the categorical\noptimizer and learning rate annealing variables we have complete graphs. The \ufb01nal combinatorial\ngraph contains all possible combinations for batch size, optimizer and learning rate annealing.\n\n1 ,\u00b7\u00b7\u00b7 , c(2)\n\nN ), v2 = (c(2)\n\ni = c(2)\n\ni\n\nand (c(1)\n\nj\n\nFigure 1: Combinatorial Graph: graph Cartesian product of sub-graphs G(C1)(cid:3)G(C2)(cid:3)G(C3)\nCartesian product and Hamming distance The Hamming distance is a natural choice of distance\non categorical variables. With all complete sub-graphs, the shortest path between two vertices in the\ncombinatorial graph is exactly equivalent to the Hamming distance between the respective categorical\nchoices.\nTheorem 2.2.1. Assume a combinatorial graph G = (V,E) constructed from categorical variables,\nC1, . . . , CN , that is, G is a graph Cartesian product (cid:3)i G(Ci) of complete sub-graphs {G(Ci)}i.\nN ) \u2208 V\nThen the shortest path s(v1, v2;G) between vertices v1 = (c(1)\non G is equal to the Hamming distance between (c(1)\n1 ,\u00b7\u00b7\u00b7 , c(1)\nProof. The proof of Theorem 2.2.1 could be found in Supp. 1\n\n1 ,\u00b7\u00b7\u00b7 , c(2)\nN ), v2 = (c(2)\n1 ,\u00b7\u00b7\u00b7 , c(2)\nN ).\n\n1 ,\u00b7\u00b7\u00b7 , c(1)\nN ) and (c(2)\n\nWhen there is a sub-graph which is not complete, the below result follows from the Thm. 2.2.1:\nCorollary 2.2.1. If a sub-graph is not a complete graph, then the shortest path is equal to or bigger\nthan the Hamming distance.\nThe combinatorial graph using the graph Cartesian product is a natural search space for combinatorial\nvariables that can encode a widely used metric on combinatorial variables like Hamming distance.\nKernels on combinatorial graphs.\nIn order to de\ufb01ne the GP surrogate model for a combinatorial\nproblem, we need to specify a a proper kernel on a combinatorial graph G = (V,E). The role of\nthe surrogate model is to smoothly interpolate and extrapolate neighboring data. To de\ufb01ne a smooth\nfunction on a graph, i.e., a smooth graph signal f : V (cid:55)\u2192 R, we adopt Graph Fourier Transforms\n(GFT) from graph signal processing [35]. Similar to Fourier analysis on Euclidean spaces, GFT can\nrepresent any graph signal as a linear combination of graph Fourier bases. Suppressing the high\nfrequency modes of the eigendecomposition approximates the signal with a smooth function on\nthe graph. We adopt the diffusion kernel which penalizes basis-functions in accordance with the\nmagnitude of the frequency [25, 41].\nTo compute the diffusion kernel on the combinatorial graph G, we need the eigensystem of graph\nLaplacian L(G) = DG \u2212 AG, where AG is the adjacency matrix and DG is the degree matrix of\nthe graph G. The eigenvalues {\u03bb1, \u03bb2,\u00b7\u00b7\u00b7 , \u03bb| V |} and eigenvectors {u1, u2,\u00b7\u00b7\u00b7 , u| V |} of the graph\n\n3\n\n\f(cid:88)n\n\nk([p], [q]|\u03b2) =\n\nLaplacian L(G) are the graph Fourier frequencies and bases, respectively. Eigenvectors paired with\nlarge eigenvalues correspond to high-frequency Fourier bases. The diffusion kernel is de\ufb01ned as\n\n(1)\nfrom which it is clear that higher frequencies, \u03bbi (cid:29) 1, are penalized more. In a matrix form, with\n\u039b = diag(\u03bb1,\u00b7\u00b7\u00b7 , \u03bb| V |) and U = [u1,\u00b7\u00b7\u00b7 , u| V |], the kernel takes the following form:\n\ni=1\n\n(2)\nwhich is the Gram matrix on all vertices whose submatrix is the Gram matrix for a subset of vertices.\n\nK(V,V) = U exp(\u2212\u03b2\u039b)UT ,\n\ne\u2212\u03b2\u03bbiui([p])ui([q]),\n\n2.3 Scalable combinatorial Bayesian optimization with the graph Cartesian product\n\nThe direct computation of the diffusion kernel is infeasible because it involves the eigendecomposition\nof the Laplacian L(G), an operation with cubic complexity with respect to the number of vertices\n|V |. As we rely on the graph Cartesian product (cid:3)i Gi to construct our combinatorial graph, we can\ntake advantage of its properties and dramatically increase the ef\ufb01ciency of the eigendecomposition\nof the Laplacian L(G). Further, due to the construction of the combinatorial graph, we can propose\na variant of the diffusion kernel: automatic relevance determination (ARD) diffusion kernel. The\nARD diffusion kernel has more \ufb02exibility in its modeling capacity. Moreover, in combination with\nthe sparsity-inducing Horseshoe prior [6] the ARD diffusion kernel performs variable selection\nautomatically that allows to scale to high dimensional problems.\n\nSpeeding up the eigendecomposition with graph Cartesian products. Direct computation of\nthe eigensystem of the Laplacian L(G) naively is infeasible, even for problems of moderate size. For\ninstance, for 15 binary variables, eigendecomposition complexity is O(|V |3) = (215)3.\nThe graph Cartesian product allows to improve the scalability of the eigendecomposition. The\nLaplacian of the Cartesian product of two sub-graphs G1 and G2, G1(cid:3)G2, can be algebraically\nexpressed using the Kronecker product \u2297 and the Kronecker sum \u2295 [15]:\n\nK = exp(cid:0) \u2212 \u03b2\n\nL(G1 (cid:3)G2) = L(G1) \u2295 L(G2) = L(G1) \u2297 I1 + I2 \u2297 L(G2),\n, u(1)\ni \u2297 u(2)\n\ni + \u03bb(2)\n\n(3)\n)} and {(\u03bb(2)\nj )}\nwhere I denotes the identity matrix. Considering the eigensystems {(\u03bb(1)\n, u(2)\ni\nj )}. Given Eq. (3)\nof G1 and G2, respectively, the eigensystem of G1(cid:3)G2 is {(\u03bb(1)\n, u(1)\nand matrix exponentiation, for the diffusion kernel of m categorical (or ordinal) variables we have\n(4)\nThis means we can compute the kernel matrix by calculating the Kronecker product per sub-graph\nkernel. Speci\ufb01cally, we obtain the kernel for the i-th sub-graph from the eigendecomposition of its\nLaplacian as per eq. (2).\nImportantly, the decomposition of the \ufb01nal kernel into the Kronecker product of individual kernels in\nEq. (4) leads to the following proposition.\nProposition 2.3.1. Assume a graph G = (V,E) is the graph cartesian product of sub-graphs\ni=1 |Vi|3) while the direct\n\nG = (cid:3)i,Gi. The graph Fourier Transform of G can be computed in O((cid:80)m\ncomputation takes O((cid:81)m\n\nexp(cid:0) \u2212 \u03b2 L(Gi)(cid:1).\n\nL(Gi)(cid:1) =\n\n(cid:79)m\n\n(cid:77)m\n\ni=1\n\ni\n\nj\n\ni=1\n\nj\n\ni=1 |Vi|3).\n\nProof. The proof of Proposition 2.3.1 could be found in the Supp. 1.\n\nVariable-wise edge scaling. We can make the kernel more \ufb02exible by considering individual\nscaling factors {\u03b2i}, a single \u03b2i for each variable. The diffusion kernel then becomes:\n\nK = exp(cid:0)(cid:77)m\n\n\u2212\u03b2i L(Gi)(cid:1) =\n\n(cid:79)m\n\nexp(cid:0) \u2212 \u03b2i L(Gi)(cid:1),\n\n(5)\nwhere \u03b2i \u2265 0 for i = 1, . . . , m. Since the diffusion kernel is a discrete version of the exponential\nkernel, the application of the individual \u03b2i for each variable is equivalent to the ARD kernel [27, 31].\nHence, we can perform variable (sub-graph) selection automatically. We refer to this kernel as the\nARD diffusion kernel.\n\ni=1\n\ni=1\n\nPrior on \u03b2i. To determine \u03b2i, and to prevent GP with ARD kernel from over\ufb01tting, we apply\nposterior sampling with a Horseshoe prior [6] on the {\u03b2i}. The Horseshoe prior encourages sparsity,\nand, thus, enables variable selection, which, in turn, makes COMBO statistically scalable to high\ndimensional problems. For instance, if \u03b2i is set to zero, then L(Gi) does not contribute in Eq (5).\n\n4\n\n\fAlgorithm 1 COMBO: Combinatorial Bayesian Optimization on the combinatorial graph\n1: Input: N combinatorial variables {Ci}i=1,\u00b7\u00b7\u00b7 ,N\n2: Set a search space and compute Fourier frequencies and bases: # See Sect. 2.2\n3: (cid:66) Set sub-graphs G(Ci) for each variables Ci.\n4: (cid:66) Compute eigensystem {(\u03bb(i)\n5: (cid:66) Construct the combinatorial graph G = (V,E) = (cid:3)i G(Ci) using graph Cartesian product.\n6: Initialize D.\n7: repeat\n8:\n9: Maximize acquisition function : vnext = arg maxv\u2217\u2208V a(\u00b5(v\u2217|D), \u03c32(v\u2217|D))\n10:\n11: until stopping criterion\n\nFit GP using ARD diffusion kernel to D with slice sampling : \u00b5(v\u2217|D), \u03c32(v\u2217|D)\nEvaluate f (vnext), append to D = D \u222a{(vnext, f (vnext))}\n\nk , u(i)\n\nk )}i,k for each sub-graph G(Ci)\n\n2.4 COMBO algorithm\n\nWe present the COMBO approach in Algorithm 1. More details about COMBO could be found in\nthe Supp. Sections 2 and 3.\nWe start the algorithm with de\ufb01ning all sub-graphs. Then, we calculate GFT (line 4 of Alg. 1),\nwhose result is needed to compute the ARD diffusion kernel, which could be sped up due to the\napplication of the graph Cartesian product. Next, we \ufb01t the surrogate model parameters using slice\nsampling [30, 32] (line 8). Sampling begins with 100 steps of the burn-in phase. With the updated D\nof evaluated data, 10 points are sampled without thinning. More details on the surrogate model \ufb01tting\nare given in Supp. 2.\nLast, we maximize the acquisition function to \ufb01nd the next point for evaluation (line 9). For this\npurpose, we begin with evaluating 20,000 randomly selected vertices. Twenty vertices with highest\nacquisition values are used as initial points for acquisition function optimization. We use the breadth-\n\ufb01rst local search (BFLS), where at a given vertex we compare acquisition values on adjacent vertices.\nWe then move to the vertex with the highest acquisition value and repeat until no acquisition value on\nadjacent vertices are higher than the acquisition value at the current vertex. BFLS is a local search,\nhowever, the initial random search and multi-starts help to escape from local minima. In experiments\n(Supp. 3.1) we found that BFLS performs on par or better than non-local search, while being more\nef\ufb01cient.\nIn our framework we can use any acquisition function like GP-UBC, the Expected Improvement\n(EI) [37], the predictive entropy search [18] or knowledge gradient [49]. We opt for EI that generally\nworks well in practice [40].\n\n3 Related work\n\nWhile for continuous inputs, X \u2286 RD, there exist ef\ufb01cient algorithms to cope with high-dimensional\nsearch spaces using Gaussian processes(GPs) [33] or neural networks [44], few Bayesian Optimiza-\ntion(BO) algorithms have been proposed for combinatorial search spaces [2, 4, 20].\nA basic BO approach to combinatorial inputs is to represent all combinatorial variables using one-hot\nencoding and treating all integer-valued variables as values on a real line. Further, for the integer-\nvalued variables an acquisition function considers the closest integer for the chosen real value. This\napproach is used in Spearmint [42]. However, applying this method naively may result in severe\nproblems, namely, the acquisition function could repeatedly evaluate the same points due to rounding\nreal values to an integer and the one-hot representation of categorical variables. As pointed out in\n[14], this issue could be \ufb01xed by making the objective constant over regions of input variables for\nwhich the actual objective has to be evaluated. The method was presented on a synthetic problem\nwith two integer-valued variables, and a problem with one categorical variable and one integer-valued\nvariable. Unfortunately, it remains unclear whether this approach is suitable for high-dimensional\nproblems. Additionally, the proposed transformation of the covariance function seems to be better\nsuited for ordinal-valued variables rather than categorical variables, further restricting the utility of\nthis approach. In contrast, we propose a method that can deal with high-dimensional combinatorial\n(categorical and/or ordinal) spaces.\n\n5\n\n\fAnother approach to combinatorial optimization was proposed in BOCS [2] where the sparse Bayesian\nlinear regression was used instead of GPs. The acquisition function was optimized by a semi-de\ufb01nite\nprogramming or simulated annealing that allowed to speed up the procedure of picking new points\nfor next evaluations. However, BOCS has certain limitations which restrict its application mostly to\nproblems with low order interactions between variables. BOCS requires users to specify the highest\norder of interactions among categorical variables, which inevitably ignores interaction terms of orders\nhigher than the user-speci\ufb01ed order. Moreover, due to its parametric nature, the surrogate model of\nBOCS has excessively large number of parameters even for moderately high order (e.g., up to the 4th\nor 5th order). Nevertheless, this approach achieved state-of-the-art results on four high-dimensional\nbinary optimization problems. Different from [2], we use a non-parametric regression, i.e., GPs and\nperform variable selection both of which give more statistical ef\ufb01ciency.\n\n4 Experiments\n\nWe evaluate COMBO on two binary variable benchmarks, one ordinal and one multi-categorical\nvariable benchmarks, as well as in two realistic problems: weighted Maximum Satis\ufb01ability and\nNeural Architecture Search. We convert all into minimization problems. We compare SMAC [20],\nTPE [4], Simulated Annealing (SA) [45], as well as with BOCS (BOCS-SDP and BOCS-SA3)1 [2].\nAll details regarding experiments, baselines and results are in the supplementary material. The code\nis available at: https://github.com/QUVA-Lab/COMBO\n\n4.1 Bayesian optimization with binary variables 2\n\nTable 1: Results on the binary benchmarks (Mean \u00b1 Std.Err. over 25 runs)\n\nCONTAMINATION CONTROL\n\nISING SPARSIFICATION\n\n\u03bb = 0\n\nMETHOD\n21.61\u00b10.04\nSMAC\n21.64\u00b10.04\nTPE\n21.47\u00b10.04\nSA\nBOCS-SDP 21.37\u00b10.03\n21.28\u00b10.03\nCOMBO\n\n\u03bb = 10\u22124\n21.50\u00b10.03\n21.69\u00b10.04\n21.49\u00b10.04\n21.38\u00b10.03\n21.28\u00b10.03\n\n\u03bb = 10\u22122\n21.68\u00b10.04\n21.84\u00b10.04\n21.61\u00b10.04\n21.52\u00b10.03\n21.44\u00b10.03\n\n\u03bb = 0\n\n0.152\u00b10.040\n0.404\u00b10.109\n0.095\u00b10.033\n0.105\u00b10.031\n0.103\u00b10.035\n\n\u03bb = 10\u22124\n0.219\u00b10.052\n0.444\u00b10.095\n0.117\u00b10.035\n0.059\u00b10.013\n0.081\u00b10.028\n\n\u03bb = 10\u22122\n0.350\u00b10.045\n0.609\u00b10.107\n0.334\u00b10.064\n0.300\u00b10.039\n0.317\u00b10.042\n\nContamination control The contamination control in food supply chain is a binary optimization\nproblem with 21 binary variables (\u2248 2.10\u00d7 106 con\ufb01gurations) [19], where one can intervene at each\nstage of the supply chain to quarantine uncontaminated food with a cost. The goal is to minimize\nfood contamination while minimizing the prevention cost. We set the budget to 270 evaluations\nincluding 20 random initial points. We report results in Table 1 and \ufb01gures in Supp. 4.1.2. COMBO\noutperforms all competing methods. Although the optimizing variables are binary, there exist higher\norder interactions among the variables due to the sequential nature of the problem, showcasing the\nimportance of the modelling \ufb02exibility of COMBO.\n\nIsing sparsi\ufb01cation A probability mass function(p.m.f) p(x) can be de\ufb01ned by an Ising model Ip.\nIn Ising sparsi\ufb01cation, we approximate the p.m.f p(z) of Ip with a p.m.f q(z) of Iq. The objective is\nthe KL-divergence between p and q with a \u03bb-parameterized regularizer: L(x) = DKL(p||q) + \u03bb(cid:107)x(cid:107)1.\nWe consider 24 binary variable Ising models on 4 \u00d7 4 spin grid (\u2248 1.68 \u00d7 107 con\ufb01gurations) with a\nbudget of 170 evaluations, including 20 random initial points. We report results in Table 1 and \ufb01gures\nin Supp. 4.1.1. We observe that COMBO is competitive, obtaining slightly worse results, probably\nbecause in Ising sparsi\ufb01cation there exist no complex interactions between variables.\n\n1We exclude BOCS from ordinal/multi-categorical experiments, because at the time of the paper submission\nthe open source implementation provided by the authors did not support ordinal/multi-categorical variables. For\nthe explanation on how to use BOCS for ordinal/multi-categorical variables, please refer to the supplementary\nmaterial of [2].\n\n2In [34], the workshop version of this paper, we found that the methods were compared on different sets\nof initial evaluations and different objectives coming from the random processes involved in the generation of\nobjectives, which turned out to be disadvantageous to COMBO. We reran these experiments making sure that\nall methods are evaluated on the same set of 25 pairs of an objective and a set of initial evaluations.\n\n6\n\n\f4.2 Bayesian optimization with ordinal and multi-categorical variables\n\nOrdinal variables The Branin benchmark is an op-\ntimization problem of a non-linear function over a\n2D search space [21]. We discretize the search space,\nnamely, we consider a grid of points that leads to an\noptimization problem with ordinal variables. We set\nthe budget to 100 evaluations and report results in Ta-\nble 2 and Figure 9 in the Supp. COMBO converges\nto a better solution faster and with better stability.\n\nTable 2: Non-binary benchmarks results\n\n(Mean \u00b1 Std.Err. over 25 runs).\n\nBRANIN\n\nPEST CONTROL\nMETHOD\n14.2614\u00b10.0753\n0.6962\u00b10.0705\nSMAC\n14.9776\u00b10.0446\n0.7578\u00b10.0844\nTPE\n12.7154\u00b10.0918\n0.4659\u00b10.0211\nSA\n12.0012\u00b10.0033\nCOMBO 0.4113\u00b10.0012\nWe exclude BOCS, as the open source implementation provided by\nthe authors does not support ordinal/multi-categorical variables.\n\nMulti-categorical variables The Pest control is a modi\ufb01ed version of the contamination control\nwith more complex, higher-order variable interactions, as detailed in Supp. 4.2.2. We consider 21 pest\ncontrol stations, each having 5 choices (\u2248 4.77 \u00d7 1014 combinatorial choices). We set the budget to\n320 including 20 random initial points. Results are in Table 2 and Figure 10 in the Supp. COMBO\noutperforms all methods and converges faster.\n\n4.3 Weighted maximum satis\ufb01ability\n\nThe satis\ufb01ability (SAT) problem is an important combinatorial optimization problem, where one\ndecides how to set variables of a Boolean formula to make the formula true. Many other optimization\nproblems can be reformulated as SAT/MaxSAT problems. Although highly successful, specialized\nMaxSAT solvers [1] exist, we use MaxSAT as a testbed for BO evaluation. We run tests on three\nbenchmarks from the Maximum Satis\ufb01ability Competition 2018.3 The wMaxSAT weights are unit\nnormalized. All evaluations are negated to obtain a minimization problem. We set the budget to 270\nevaluations including 20 random initial points. We report results in Table 3 and Figures in Supp. 4.3,\nand runtimes on wMaxSAT43 in the \ufb01gure next to Table 3.on wMaxSAT28 (Figure 14 in the Supp.)4\nTable 3: (left) Negated wMaxSAT Minimum and (right) Runtime VS. Minimum on wMaxSAT43.\n\nwMaxSAT28 wMaxSAT43\nMethod\n-57.42\u00b11.76\n-20.05\u00b10.67\nSMAC\n-25.20\u00b10.88\n-52.39\u00b11.99\nTPE\n-31.81\u00b11.19\n-75.76\u00b12.30\nSA\nBOCS-SDP -29.49\u00b10.53\n-51.13\u00b11.69\n-61.02\u00b12.28a\n-34.79\u00b10.78\nBOCS-SA3\n-37.80\u00b10.27\n-85.02\u00b12.14\nCOMBO\na 270 evaluations were not \ufb01nished after 168 hours.\nb Not tried due to the computation time longer than wMaxSAT43.\n\nwMaxSAT60\n-148.60\u00b11.01\n-137.21\u00b12.83\n-187.55\u00b11.50\n-153.67\u00b12.01\n\nN.Ab\n\n-195.65\u00b10.00\n\nCOMBO performs best in all cases. BOCS bene\ufb01ts from third-order interactions on wMaxSAT28 and\nwMaxSAT43. However, this comes at the cost of large number of parameters [2], incurring expensive\ncomputations. When considering higher-order terms BOCS suffers severely from inef\ufb01cient training.\nThis is due to a bad ratio between the number of parameters and number of training samples (e.g., for\nthe 43 binary variables case BOCS-SA3/SA4/SA5 with, respectively, 3rd/4th/5th order interactions,\nhas 13288/136698/1099296 parameters to train). In contrast, COMBO models arbitrarily high order\ninteractions thanks to GP\u2019s nonparametric nature in a statistically ef\ufb01cient way.\nFocusing on the largest problem, wMaxSAT60 with \u2248 1.15 \u00d7 1018 con\ufb01gurations, COMBO main-\ntains superior performance. We attribute this to the sparsity-inducing properties of the Horseshoe\nprior, after examining non sparsity-inducing priors (Supp.4.3). The Horseshoe prior helps COMBO\nattain further statistical ef\ufb01ciency. We can interpret this reductionist behavior as the combinatorial\nversion of methods exploiting low-effective dimensionality [3] on continuous search spaces [46].\nThe runtime \u2013including evaluation time\u2013 was measured on a dual 8-core 2.4 GHz (Intel Haswell\nE5-2630-v3) CPU with 64 GB memory using Python implementations. SA, SMAC and TPE are\nfaster but inaccurate compared to BOCS. COMBO is faster than BOCS-SA3, which needed 168\nhours to collect around 200 evaluations. COMBO\u2013modelling arbitrarily high-order interactions\u2013 is\nalso faster than BOCS-SDP constrained up-to second-order interactions only.\n\n3https://maxsat-evaluations.github.io/2018/benchmarks.html\n4The all runtimes were measured on Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz with python codes.\n\n7\n\n\fWe conclude that in the realistic maximum satis\ufb01ablity problem COMBO yields accurate solutions\nin reasonable runtimes, easily scaling up to high dimensional combinatorial search problems.\n\n4.4 Neural architecture search\n\nLast, we compare BO methods on a neural archi-\ntecture search (NAS) problem, a typical combi-\nnatorial optimization problem [48]. We compare\nCOMBO with BOCS, as well as Regularized Evo-\nlution (RE) [38], one of the most successful evolu-\ntionary search algorithm for NAS [48]. We include\nRandom Search (RS) which can be competitive in\nwell-designed search spaces [48]. We do not com-\npare with the BO-based NASBOT [23]. NASBOT\nfocuses exclusively on NAS problems and optimizes\nover a different search space than ours using an op-\ntimal transport-based metric between architectures,\nwhich is out of the scope for this work.\n\nFigure 2: Result for Neural Architecture\nSearch (Mean \u00b1 Std.Err. over 4 runs)\n\nTable 4: (left) Connectivity (X \u2013 no connection, O \u2013 states are connected), (right) Computation type.\n\nIN\n\nH1\n\nH2\n\nH3\n\nH4\n\nH5\n\nOUT\n\nIN\nH1\nH2\nH3\nH4\nH5\nOUT\n\n-\n-\n-\n-\n-\n-\n-\n\nO\n-\n-\n-\n-\n-\n-\n\nX\nX\n-\n-\n-\n-\n-\n\nX\nO\nX\n-\n-\n-\n-\n\nX\nX\nO\nX\n-\n-\n-\n\nO\nX\nX\nO\nO\n-\n-\n\nX\nO\nX\nX\nO\nX\n-\n\nSMALL\nLARGE\n\nMAXPOOL\n\nID \u2261 MAXPOOL(1\u00d71)\n\nMAXPOOL(3\u00d73)\n\nCONV\n\nCONV(3\u00d73)\nCONV(5\u00d75)\n\nFor the considered NAS problem we aim at \ufb01nding the optimal cell comprising of one input node\n(IN), one output node (OUT) and \ufb01ve possible hidden nodes (H1\u2013H5). We allow connections from\nIN to all other nodes, from H1 to all other nodes and so one. We exclude connections that could cause\nloops. An example of connections within a cell can be found in Table. 4 on the left, where the input\nstate IN connects to H1, H1 connects to H3 and OUT, and so on. The input state and output state\nhave identity computation types, whereas the computation types for the hidden states are determined\nby combination of 2 binary choices from the table on the right of Table. 4. In total, the search space\nconsists of 31 binary variables, 21 for the connectivities and 2 for 5 computation types.\nThe objective is to minimize the classi\ufb01cation error on validation set of CIFAR10 [26] with a penalty\non the amount of FLOPs of a neural network constructed with a given cell. We search for an\narchitecture that balances accuracy and computational ef\ufb01ciency. In each evaluation, we construct a\ncell, and stack three cells to build a \ufb01nal neural network. More details are given in the Supp. 4.4.\nIn Figure 2 we can notice that COMBO outperforms other methods signi\ufb01cantly. BOCS-SDP\nand RS exhibit similar performance, con\ufb01rming that for NAS modeling high-order interactions\nbetween variables is crucial. Furthermore, COMBO outperforms the specialized RE, one of the most\nsuccessful evolutionary search (ES) algorithms shown to perform better on NAS than reinforcement\nlearning (RL) algorithms [38, 48]. When increasing the number of evaluations to 500, RE still cannot\nreach the performance of COMBO with 260 evaluations, see Figure 17 in the Supp. A possible\nexplanation for such behavior is the high sensitivity to choices of hyperparameters of RE, and ES\nrequires far more evaluations in general. Details about RE hyperparameters can be found in the\nSupp. 4.4.\nDue to the dif\ufb01culty of using BO on combinatoral structures, BOs have not been widely used for\nNAS with few exceptions [23]. COMBO\u2019s performance suggests that a well-designed general\ncombinatorial BO can be competitive or even better in NAS than ES and RL, especially when\ncomputational resources are constrained. Since COMBO is applicable to any set of combinatorial\nvariables, its use in NAS is not restricted to the typical NASNet search space. Interestingly, COMBO\ncan approximately optimize continuous variables by discretization, as shown in the ordinal variable\nexperiment, thus, jointly optimizing the architecture and hyperparameter learning.\n\n8\n\n\f5 Conclusion\n\nIn this work, we propose COMBO, a Bayesian Optimization method for combinatorial search spaces.\nTo the best of our knowledge, COMBO is the \ufb01rst Bayesian Optimization algorithm using Gaussian\nProcesses as a surrogate model suitable for problems with complex high order interactions between\nvariables. To ef\ufb01ciently tackle the exponentially increasing complexity of combinatorial search\nspaces, we rest upon the following ideas: (i) we represent the search space as the combinatorial graph,\nwhich combines sub-graphs given to all combinatorial variables using the graph Cartesian product.\nMoreover, the combinatorial graph re\ufb02ects a natural metric on categorical choices (Hamming distance)\nwhen all combinatorial variables are categorical. (ii) we adopt the GFT to de\ufb01ne the \u201csmoothness\u201d of\nfunctions on combinatorial structures. (iii) we propose a \ufb02exible ARD diffusion kernel for GPs on\nthe combinatorial graph with a Horseshoe prior on scale parameters, which makes COMBO scale up\nto high dimensional problems by performing variable selection. All above features together make\nthat COMBO outperforms competitors consistently on a wide range of problems. COMBO is a\nstatistically and computationally scalable Bayesian Optimization tool for combinatorial spaces, which\nis a \ufb01eld that has not been extensively explored.\n\nReferences\n[1] F. Bacchus, M. Jarvisalo, and R. Martins, editors. MaxSAT Evaluation 2018. Solver and\n\nBenchmark Descriptions, volume B-2018-2, 2018. University of Helsinki.\n\n[2] R. Baptista and M. Poloczek. Bayesian Optimization of Combinatorial Structures. In Interna-\n\ntional Conference on Machine Learning, pages 462\u2013471, 2018.\n\n[3] J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. Journal of\n\nMachine Learning Research, 13(Feb):281\u2013305, 2012.\n\n[4] J. Bergstra, D. Yamins, and D. D. Cox. Making a science of model search: Hyperparameter\n\noptimization in hundreds of dimensions for vision architectures. 2013.\n\n[5] E. Brochu, V. M. Cora, and N. De Freitas. A tutorial on bayesian optimization of expensive\ncost functions, with application to active user modeling and hierarchical reinforcement learning.\narXiv preprint arXiv:1012.2599, 2010.\n\n[6] C. M. Carvalho, N. G. Polson, and J. G. Scott. Handling sparsity via the horseshoe. In Arti\ufb01cial\n\nIntelligence and Statistics, pages 73\u201380, 2009.\n\n[7] C. M. Carvalho, N. G. Polson, and J. G. Scott. The horseshoe estimator for sparse signals.\n\nBiometrika, 97(2):465\u2013480, 2010.\n\n[8] F. R. Chung. Spectral graph theory (cbms regional conference series in mathematics, no. 92).\n\n1996.\n\n[9] L. Davis. Handbook of genetic algorithms. 1991.\n\n[10] T. Elsken, J. H. Metzen, and F. Hutter. Neural architecture search: A survey. arXiv preprint\n\narXiv:1808.05377, 2018.\n\n[11] P. I. Frazier. A tutorial on bayesian optimization. arXiv preprint arXiv:1807.02811, 2018.\n\n[12] A. A. Freitas. A review of evolutionary algorithms for data mining. In Data Mining and\n\nKnowledge Discovery Handbook, pages 371\u2013400. Springer, 2009.\n\n[13] R. Garnett, M. A. Osborne, and S. J. Roberts. Bayesian optimization for sensor set selection.\nIn Proceedings of the 9th ACM/IEEE international conference on information processing in\nsensor networks, pages 209\u2013219. ACM, 2010.\n\n[14] E. C. Garrido-Merch\u00e1n and D. Hern\u00e1ndez-Lobato. Dealing with categorical and integer-valued\nvariables in bayesian optimization with gaussian processes. arXiv preprint arXiv:1805.03463,\n2018.\n\n[15] R. Hammack, W. Imrich, and S. Klav\u017ear. Handbook of product graphs. CRC press, 2011.\n\n9\n\n\f[16] P. Hansen and B. Jaumard. Algorithms for the maximum satis\ufb01ability problem. Computing, 44\n\n(4):279\u2013303, 1990.\n\n[17] D. Haussler. Convolution kernels on discrete structures. Technical report, Technical report,\n\nDepartment of Computer Science, University of California, 1999.\n\n[18] J. M. Hern\u00e1ndez-Lobato, M. W. Hoffman, and Z. Ghahramani. Predictive entropy search\nfor ef\ufb01cient global optimization of black-box functions. In Advances in neural information\nprocessing systems, pages 918\u2013926, 2014.\n\n[19] Y. Hu, J. Hu, Y. Xu, F. Wang, and R. Z. Cao. Contamination control in food supply chain. In\nSimulation Conference (WSC), Proceedings of the 2010 Winter, pages 2678\u20132681. IEEE, 2010.\n\n[20] F. Hutter, H. H. Hoos, and K. Leyton-Brown. Sequential model-based optimization for general\nalgorithm con\ufb01guration. In International Conference on Learning and Intelligent Optimization,\npages 507\u2013523. Springer, 2011.\n\n[21] D. R. Jones, M. Schonlau, and W. J. Welch. Ef\ufb01cient global optimization of expensive black-box\n\nfunctions. Journal of Global optimization, 13(4):455\u2013492, 1998.\n\n[22] K. Kandasamy, G. Dasarathy, J. B. Oliva, J. Schneider, and B. P\u00f3czos. Gaussian process bandit\noptimisation with multi-\ufb01delity evaluations. In Advances in Neural Information Processing\nSystems, pages 992\u20131000, 2016.\n\n[23] K. Kandasamy, W. Neiswanger, J. Schneider, B. Poczos, and E. P. Xing. Neural architecture\nsearch with bayesian optimisation and optimal transport. In Advances in Neural Information\nProcessing Systems, pages 2016\u20132025, 2018.\n\n[24] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[25] R. I. Kondor and J. Lafferty. Diffusion kernels on graphs and other discrete structures. In\nProceedings of the 19th international conference on machine learning, volume 2002, pages\n315\u2013322, 2002.\n\n[26] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical\n\nreport, Citeseer, 2009.\n\n[27] D. J. MacKay. Bayesian nonlinear modeling for the prediction competition. ASHRAE transac-\n\ntions, 100(2):1053\u20131062, 1994.\n\n[28] G. Malkomes, C. Schaff, and R. Garnett. Bayesian optimization for automated model selection.\n\nIn Advances in Neural Information Processing Systems, pages 2900\u20132908, 2016.\n\n[29] J. Mo\u02c7ckus. On bayesian methods for seeking the extremum. In Optimization Techniques IFIP\n\nTechnical Conference, pages 400\u2013404. Springer, 1975.\n\n[30] I. Murray and R. P. Adams. Slice sampling covariance hyperparameters of latent gaussian\n\nmodels. In Advances in neural information processing systems, pages 1732\u20131740, 2010.\n\n[31] R. M. Neal. Bayesian learning for neural networks. PhD thesis, University of Toronto, 1995.\n\n[32] R. M. Neal. Slice sampling. Annals of Statistics, pages 705\u2013741, 2003.\n\n[33] C. Oh, E. Gavves, and M. Welling. Bock: Bayesian optimization with cylindrical kernels. arXiv\n\npreprint arXiv:1806.01619, 2018.\n\n[34] C. Oh, J. Tomczak, E. Gavves, and M. Welling. Combo: Combinatorial bayesian optimization\nusing graph representations. In ICML Workshop on Learning and Reasoning with Graph-\nStructured Data, 2019.\n\n[35] A. Ortega, P. Frossard, J. Kova\u02c7cevi\u00b4c, J. M. Moura, and P. Vandergheynst. Graph signal\nprocessing: Overview, challenges, and applications. Proceedings of the IEEE, 106(5):808\u2013828,\n2018.\n\n10\n\n\f[36] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,\n\nL. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.\n\n[37] C. E. Rasmussen and C. K. Williams. Gaussian processes for machine learning. the MIT Press,\n\n2006.\n\n[38] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. Regularized evolution for image classi\ufb01er\n\narchitecture search. arXiv preprint arXiv:1802.01548, 2018.\n\n[39] M. G. Resende, L. Pitsoulis, and P. Pardalos. Approximate solution of weighted max-sat\n\nproblems using grasp. Satis\ufb01ability problems, 35:393\u2013405, 1997.\n\n[40] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. De Freitas. Taking the human out of\nthe loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148\u2013175, 2015.\n\n[41] A. J. Smola and R. Kondor. Kernels and regularization on graphs. In Learning theory and\n\nkernel machines, pages 144\u2013158. Springer, 2003.\n\n[42] J. Snoek, H. Larochelle, and R. P. Adams. Practical bayesian optimization of machine learning\n\nalgorithms. In Advances in neural information processing systems, pages 2951\u20132959, 2012.\n\n[43] J. Snoek, K. Swersky, R. Zemel, and R. Adams. Input warping for bayesian optimization of\nnon-stationary functions. In International Conference on Machine Learning, pages 1674\u20131682,\n2014.\n\n[44] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. Patwary, M. Prabhat,\nand R. Adams. Scalable bayesian optimization using deep neural networks. In International\nConference on Machine Learning, pages 2171\u20132180, 2015.\n\n[45] W. M. Spears. Simulated annealing for hard satis\ufb01ability problems. In Cliques, Coloring, and\n\nSatis\ufb01ability, pages 533\u2013558. Citeseer, 1993.\n\n[46] Z. Wang, F. Hutter, M. Zoghi, D. Matheson, and N. de Feitas. Bayesian optimization in a billion\ndimensions via random embeddings. Journal of Arti\ufb01cial Intelligence Research, 55:361\u2013387,\n2016.\n\n[47] A. Wilson, A. Fern, and P. Tadepalli. Using trajectory data to improve Bayesian optimization\nfor reinforcement learning. The Journal of Machine Learning Research, 15(1):253\u2013282, 2014.\n\n[48] M. Wistuba, A. Rawat, and T. Pedapati. A survey on neural architecture search. arXiv preprint\n\narXiv:1905.01392, 2019.\n\n[49] J. Wu, M. Poloczek, A. G. Wilson, and P. Frazier. Bayesian optimization with gradients. In\n\nAdvances in Neural Information Processing Systems, pages 5267\u20135278, 2017.\n\n11\n\n\f", "award": [], "sourceid": 1678, "authors": [{"given_name": "Changyong", "family_name": "Oh", "institution": "University of Amsterdam"}, {"given_name": "Jakub", "family_name": "Tomczak", "institution": "Qualcomm AI Research"}, {"given_name": "Efstratios", "family_name": "Gavves", "institution": "University of Amsterdam"}, {"given_name": "Max", "family_name": "Welling", "institution": "University of Amsterdam / Qualcomm AI Research"}]}