{"title": "Bayesian optimization under mixed constraints with a slack-variable augmented Lagrangian", "book": "Advances in Neural Information Processing Systems", "page_first": 1435, "page_last": 1443, "abstract": "An augmented Lagrangian (AL) can convert a constrained optimization problem into a sequence of simpler (e.g., unconstrained) problems which are then usually solved with local solvers. Recently, surrogate-based Bayesian optimization (BO) sub-solvers have been successfully deployed in the AL framework for a more global search in the presence of inequality constraints; however a drawback was that expected improvement (EI) evaluations relied on Monte Carlo. Here we introduce an alternative slack variable AL, and show that in this formulation the EI may be evaluated with library routines. The slack variables furthermore facilitate equality as well as inequality constraints, and mixtures thereof. We show our new slack \"ALBO\" compares favorably to the original. Its superiority over conventional alternatives is reinforced on several new mixed constraint examples.", "full_text": "Bayesian optimization under mixed constraints with a\n\nslack-variable augmented Lagrangian\n\nVictor Picheny\n\nMIAT, Universit\u00e9 de Toulouse, INRA\n\nCastanet-Tolosan, France\n\nvictor.picheny@toulouse.inra.fr\n\nStefan Wild\n\nArgonne National Laboratory\n\nArgonne, IL, USA\nwildmcs.anl.gov\n\nRobert B. Gramacy\n\nVirginia Tech\n\nBlacksburg, VA, USA\n\nrbg@vt.edu\n\nS\u00e9bastien Le Digabel\n\n\u00c9cole Polytechnique de Montr\u00e9al\n\nMontr\u00e9al, QC, Canada\n\nsebastien.le-digabel@polymtl.ca\n\nAbstract\n\nAn augmented Lagrangian (AL) can convert a constrained optimization problem\ninto a sequence of simpler (e.g., unconstrained) problems, which are then usually\nsolved with local solvers. Recently, surrogate-based Bayesian optimization (BO)\nsub-solvers have been successfully deployed in the AL framework for a more global\nsearch in the presence of inequality constraints; however, a drawback was that\nexpected improvement (EI) evaluations relied on Monte Carlo. Here we introduce\nan alternative slack variable AL, and show that in this formulation the EI may be\nevaluated with library routines. The slack variables furthermore facilitate equality\nas well as inequality constraints, and mixtures thereof. We show our new slack\n\u201cALBO\u201d compares favorably to the original. Its superiority over conventional\nalternatives is reinforced on several mixed constraint examples.\n\n1\n\nIntroduction\n\nBayesian optimization (BO), as applied to so-called blackbox objectives, is a modernization of 1970-\n80s statistical response surface methodology for sequential design [3, 14]. In BO, nonparametric\n(Gaussian) processes (GPs) provide \ufb02exible response surface \ufb01ts. Sequential design decisions, so-\ncalled acquisitions, judiciously balance exploration and exploitation in search for global optima. For\nreviews, see [5, 4]; until recently this literature has focused on unconstrained optimization.\nMany interesting problems contain constraints, typically speci\ufb01ed as equalities or inequalities:\n\nmin\n\nx\n\n{f (x) : g(x) \u2264 0, h(x) = 0, x \u2208 B} ,\n\n(1)\nwhere B \u2282 Rd is usually a bounded hyperrectangle, f : Rd \u2192 R is a scalar-valued objective function,\nand g : Rd \u2192 Rm and h : Rd \u2192 Rp are vector-valued constraint functions taken componentwise\n(i.e., gj(x) \u2264 0, j = 1, . . . , m; hk(x) = 0, and k = 1, . . . , p). The typical setup treats f, g, and h as\na \u201cjoint\u201d blackbox, meaning that providing x to a single computer code reveals f (x), g(x), and h(x)\nsimultaneously, often at great computational expense. A common special case treats f (x) as known\n(e.g., linear); however the problem is still hard when g(x) \u2264 0 de\ufb01nes a nonconvex valid region.\nNot many algorithms target global solutions to this general, constrained blackbox optimization\nproblem. Statistical methods are acutely few. We know of no methods from the BO literature natively\naccommodating equality constraints, let alone mixed (equality and inequality) ones. Schonlau et al.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f[21] describe how their expected improvement (EI) heuristic can be extended to multiple inequality\nconstraints by multiplying by an estimated probability of constraint satisfaction. Here, we call this\nexpected feasible improvement (EFI). EFI has recently been revisited by several authors [23, 7, 6].\nHowever, the technique has pathological behavior in otherwise idealized setups [9], which is related\nto a so-called \u201cdecoupled\u201d pathology [7]. Some recent information-theoretic alternatives have shown\npromise in the inequality constrained setting [10, 17].\nWe remark that any problem with equality constraints can be \u201ctransformed\u201d to inequality constraints\nonly, by applying h(x) \u2264 0 and h(x) \u2265 0 simultaneously. However, the effect of such a reformulation\nis rather uncertain. It puts double-weight on equalities and violates certain regularity (i.e., constraint\nquali\ufb01cation [15]) conditions. Numerical issues have been reported in empirical work [1, 20].\nIn this paper we show how a recent BO method for inequality constraints [9] is naturally enhanced to\nhandle equality constraints, and therefore mixed ones too. The method involves converting inequality\nconstrained problems into a sequence of simpler subproblems via the augmented Lagrangian (AL, [2]).\nAL-based solvers can, under certain regularity conditions, be shown to converge to locally optimal\nsolutions that satisfy the constraints, so long as the sub-solver converges to local solutions. By\ndeploying modern BO on the subproblems, as opposed to the usual local solvers, the resulting\nmeta-optimizer is able to \ufb01nd better, less local solutions with fewer evaluations of the expensive\nblackbox, compared to several classical and statistical alternatives. Here we dub that method ALBO.\nTo extend ALBO to equality constraints, we suggest the opposite transformation to the one described\nabove: we convert inequality constraints into equalities by introducing slack variables. In the context\nof earlier work with the AL, via conventional solvers, this is rather textbook [15, Ch. 17]. Handling\nthe inequalities in this way leads naturally to solutions for mixed constraints and, more importantly,\ndramatically improves the original inequality-only version. In the original (non-slack) ALBO setup,\nthe density and distribution of an important composite random predictive quantity is not known\nin closed form. Except in a few particular cases [18], calculating EI and related quantities under\nthe AL required Monte Carlo integration, which means that acquisition function evaluations are\ncomputationally expensive, noisy, or both. A reformulated slack-AL version emits a composite that\nhas a known distribution, a so-called weighted non-central Chi-square (WNCS) distribution. We show\nthat, in that setting, EI calculations involve a simple 1-d integral via ordinary quadrature. Adding\nslack variables increases the input dimension of the optimization subproblems, but only arti\ufb01cially so.\nThe effects of expansion can be mitigated through optimal default settings, which we provide.\nThe remainder of the paper is organized as follows. Section 2 outlines the components germane to the\nALBO approach: AL, Bayesian surrogate modeling, and acquisition via EI. Section 3 contains the\nbulk of our methodological contribution: a slack variable AL, a closed form EI, optimal default slack\nsettings, and open-source software. Implementation details are provided by our online supplementary\nmaterial. Section 4 provides empirical comparisons, and Section 5 concludes.\n\n2 A review of relevant concepts: EI and AL\n\nEI: The canonical acquisition function in BO is expected improvement (EI) [12]. Consider a surrogate\nf n(x), trained on n pairs (xi, yi = f (xi)) emitting Gaussian predictive equations with mean \u00b5n(x)\nand standard deviation \u03c3n(x). De\ufb01ne f n\nmin = mini=1,...,n yi, the smallest y-value seen so far, and\nmin \u2212 Y (x)} be the improvement at x. I(x) is largest when Y (x) \u223c f n(x) has\nlet I(x) = max{0, f n\nsubstantial distribution below f n\nmin. The expectation of I(x) over Y (x) has a convenient closed form,\nrevealing balance between exploitation (\u00b5n(x) under f n\nmin \u2212 \u00b5n(x)\n\nmin) and exploration (large \u03c3n(x)):\n\nmin \u2212 \u00b5n(x)\n\n(cid:18) f n\n\n(cid:18) f n\n\n(cid:19)\n\nE{I(x)} = (f n\n\nmin \u2212 \u00b5n(x))\u03a6\n\n(cid:19)\n\n,\n\n\u03c3n(x)\n\n+ \u03c3n(x)\u03c6\n\n\u03c3n(x)\n\n(2)\n\nwhere \u03a6 (\u03c6) is the standard normal cdf (pdf). Accurate, approximately Gaussian predictive equations\nare provided by many statistical models (e.g., GPs). In non-Gaussian contexts, Monte Carlo schemes\u2014\nsampling Y (x)\u2019s and averaging I(x)\u2019s\u2014offer a computationally intensive alternative.\nAL: Although several authors have suggested extensions to EI for constraints, the BO literature has\nprimarily focused on unconstrained problems. The range of constrained BO options was recently\nextended by borrowing an apparatus from the mathematical optimization literature, the augmented\nLagrangian, allowing unconstrained methods to be adapted to constrained problems. The AL, as a\n\n2\n\n\fm(cid:88)\n\nj=1\n\ndevice for solving problems with inequality constraints (no h(x) in Eq. (1)), may be de\ufb01ned as\n\nLA(x; \u03bb, \u03c1) = f (x) + \u03bb(cid:62)g(x) +\n\n1\n2\u03c1\n\nmax{0, gj(x)}2 ,\n\n(3)\n\nwhere \u03c1 > 0 is a penalty parameter on constraint violation and \u03bb \u2208 Rm\n+ serves as a Lagrange\nmultiplier. AL methods are iterative, involving a particular sequence of (x; \u03bb, \u03c1). Given the current\nvalues \u03c1k\u22121 and \u03bbk\u22121, one approximately solves the subproblem\n\n(cid:8)LA(x; \u03bbk\u22121, \u03c1k\u22121) : x \u2208 B(cid:9) ,\n\nx\n\nmin\n\n(4)\nvia a conventional (bound-constrained) solver. The parameters (\u03bb, \u03c1) are updated depending on the\nnature of the solution found, and the process repeats. The particulars in our setup are provided in\nAlg. 1; for more details see [15, Ch. 17]. Local convergence is guaranteed under relatively mild\nconditions involving the choice of subroutine solving (4). Loosely, all that is required is that the solver\n\u201cmakes progress\u201d on the subproblem. In contexts where termination depends more upon computational\nbudget than on a measure of convergence, as in many BO problems, that added \ufb02exibility is welcome.\nHowever, the AL does not typically\nenjoy global scope. The local min-\nima found by the method are sen-\nsitive to initialization\u2014of starting\nchoices for (\u03bb0, \u03c10) or x0;\nlocal\nsearches in iteration k are usually\nstarted from xk\u22121. However, this\ndependence is broken when statisti-\ncal surrogates drive search for so-\nlutions to the subproblems.\nIndependently \ufb01t GP surrogates, f n(x) for the objective and\n(cid:80)m\nm(x)) for the constraints, yield predictive distributions for Y n\nf (x) and\ngn(x) = (gn\n1 (x), . . . , gn\n(x)). Dropping the n superscripts, the AL composite random variable\n(x), . . . , Y n\ng (x) = (Y n\nY n\ng1\ngm\nj=1 max{0, Ygj (x)}2 can serve as a surrogate for (3); however,\nY (x) = Yf (x) + \u03bb(cid:62)Yg(x) + 1\nit is dif\ufb01cult to deduce its distribution from the components of Yf and Yg, even when those are\nindependently Gaussian. While its mean is available in closed form, EI requires Monte Carlo.\n\nRequire: \u03bb0 \u2265 0, \u03c10 > 0\n1: for k = 1, 2, . . . do\n2:\n3:\n4:\n5: end for\n\nLet xk (approximately) solve (4)\nSet \u03bbk\nIf g(xk) \u2264 0, set \u03c1k = \u03c1k\u22121; else, set \u03c1k = 1\n\n\u03c1k\u22121 gj(xk)}, j = 1, . . . , m\n\nAlgorithm 1: Basic augmented Lagrangian method\n\nj = max{0, \u03bbk\u22121\n\nj + 1\n\n2 \u03c1k\u22121\n\n2\u03c1\n\n3 A novel formulation involving slack variables\n\nAn equivalent formulation of (1) involves introducing slack variables, sj, for j = 1, . . . , m (i.e., one\nfor each inequality constraint gj(x)), and converting the mixed constraint problem (1) to one with\nonly equality constraints (plus bound constraints for sj): gj(x)\u2212 sj = 0, sj \u2208 R+, for j = 1, . . . , m.\nObserve that introducing the slack \"inputs\" increases dimension of the problem from d to d + m.\nReducing a mixed constraint problem to one involving only equality and bound constraints is valuable\ninsofar as one has good solvers for those problems. Suppose, for the moment, that the original\nproblem (1) has no equality constraints (i.e., p = 0). In this case, a slack variable-based AL method\nis readily available\u2014as an alternative to the description in Section 2. Although we frame it as an\n\u201calternative\u201d, some would describe this as the standard version [see, e.g., 15, Ch. 17]. The AL is\n\nLA(x, s; \u03bbg, \u03c1) = f (x) + \u03bb(cid:62) (g(x)+s) +\n\n(gj(x)+sj)2 .\n\n(5)\n\nThis formulation is more convenient than (3) because the \u201cmax\u201d is missing, but the extra slack\nvariables mean solving a higher (d + m) dimensional subproblem compared to (4). That AL can be\nexpanded to handle equality (and thereby mixed constraints) as follows:\n\nLA(x, s; \u03bbg, \u03bbh, \u03c1) = f (x)+\u03bb(cid:62)\n\nDe\ufb01ning c(x) := (cid:2)g(x)(cid:62), h(x)(cid:62)(cid:3)(cid:62)\n\ng (g(x)+s)+\u03bb(cid:62)\n\n, \u03bb := (cid:2)\u03bb(cid:62)\n\ng , \u03bb(cid:62)\n\n(cid:3)(cid:62)\n\nh h(x)+\n\n1\n2\u03c1\n\nunderstanding that sm+1 = \u00b7\u00b7\u00b7 = sm+p = 0, leads to a streamlined AL for mixed constraints\n\nh\n\n, and enlarging the dimension of s with the\n\n(gj(x)+sj)2 +\n\nhk(x)2\n\n\uf8f9\uf8fb. (6)\n\np(cid:88)\n\nk=1\n\nj=1\n\n1\n2\u03c1\n\nm(cid:88)\n\uf8ee\uf8f0 m(cid:88)\nm+p(cid:88)\n\nj=1\n\n1\n2\u03c1\n\nj=1\n\nLA(x, s; \u03bb, \u03c1) = f (x) + \u03bb(cid:62) (c(x) + s) +\n\n3\n\n(cj(x) + sj)2 ,\n\n(7)\n\n\fwith \u03bb \u2208 Rm+p. A non-slack AL formulation (3) can analogously be written as\n\n\uf8ee\uf8f0 m(cid:88)\n\nj=1\n\n\uf8f9\uf8fb ,\n\nhk(x)2\n\np(cid:88)\n\nk=1\n\nLA(x; \u03bbg, \u03bbh, \u03c1) = f (x) + \u03bb(cid:62)\n\ng g(x) + \u03bb(cid:62)\n\nh h(x) +\n\n1\n2\u03c1\n\nmax{0, gj(x)}2 +\n\n+ and \u03bbh \u2208 Rp. Eq. (7), by contrast, is easier to work with because it is a smooth\nwith \u03bbg \u2208 Rm\nquadratic in the objective (f) and constraints (c). In what follows, we show that (7) facilitates\ncalculation of important quantities like EI, in the GP-based BO framework, via a library routine. So\nslack variables not only facilitate mixed constraints in a uni\ufb01ed framework, but they also lead to a\nmore ef\ufb01cient handling of the original inequality (only) constrained problem.\n\n3.1 Distribution of the slack-AL composite\n\nIf Yf and Yc1, . . . , Ycm+p represent random predictive variables from m + p + 1 surrogates \ufb01tted to\nn realized objective and constraint evaluations, then the analogous slack-AL random variable is\n\nY (x, s) = Yf (x) +\n\n\u03bbj(Ycj (x) + sj) +\n\n1\n2\u03c1\n\n(Ycj (x) + sj)2.\n\n(8)\n\nm+p(cid:88)\n\nj=1\n\nm+p(cid:88)\n\nj=1\n\nAs for the original AL, the mean of this RV has a simple closed form in terms of the means and\nvariances of surrogates. In the Gaussian case, we show that we can obtain a closed form for the full\ndistribution of the slack-AL variate (8). Toward that aim, \ufb01rst rewrite Y as:\n\nY (x, s) = Yf (x) +\n\n= Yf (x) +\n\nm+p(cid:88)\nm+p(cid:88)\n\nj=1\n\nj=1\n\n\u03bbjsj +\n\n\u03bbjsj +\n\n1\n2\u03c1\n\n1\n2\u03c1\n\nm+p(cid:88)\nm+p(cid:88)\n\nj=1\n\nj=1\n\ns2\nj +\n\ns2\nj +\n\n1\n2\u03c1\n\n1\n2\u03c1\n\nm+p(cid:88)\nm+p(cid:88)\n\nj=1\n\nj=1\n\n(cid:2)2\u03bbj\u03c1Ycj (x) + 2sjYcj (x) + Ycj (x)2(cid:3)\n(cid:104)(cid:0)\u03b1j + Ycj (x)(cid:1)2 \u2212 \u03b12\n\n(cid:105)\n\n,\n\nj\n\nwith \u03b1j = \u03bbj\u03c1 + sj. Now decompose the Y (x, s) into a sum of three quantities:\n\nY (x, s) = Yf (x) + r(s) +\n\nW (x, s), with\n\nr(s) =\n\nm+p(cid:88)\nUsing Ycj \u223c N(cid:16)\nm+p(cid:88)\n\nj=1\n\nW (x, s) =\n\nj=1\n\n1\n2\u03c1\n\nm+p(cid:88)\n(cid:17)\n\nj=1\n\nm+p(cid:88)\n\nj=1\n\n\u03bbjsj +\n\n1\n2\u03c1\n\nj \u2212 1\ns2\n2\u03c1\n\n\u03b12\nj\n\nand W (x, s) =\n\n\u00b5cj (x), \u03c32\ncj\n\n(x)\n\n, i.e., leveraging Gaussianity, W can be written as\n\n(x)Xj(x, s), with Xj(x, s) \u223c \u03c72\n\n\u03c32\ncj\n\ndof = 1, \u03b4 =\n\n(cid:32)\n\nj=1\n\nm+p(cid:88)\n\n(cid:0)\u03b1j + Ycj (x)(cid:1)2\n(cid:19)2(cid:33)\n(cid:18) \u00b5cj (x) + \u03b1j\n\n(9)\n\n.\n\n.\n\n(10)\n\n\u03c3cj (x)\n\nThe line above is the expression of a weighted sum of non-central chi-square (WSNC) variates.\nEach of the m + p variates involves a unit degrees-of-freedom (dof) parameter, and a non-centrality\nparameter \u03b4. A number of ef\ufb01cient methods exist for evaluating the density, distribution, and quantile\nfunctions of WSNC random variables. Details and code are provided in our supplementary materials.\nSome constrained optimization problems involve a known objective f (x). In that case, referring back\nto (9), we are done: Y (x, s) is WSNC (as in (10)) shifted by a known quantity f (x) + r(s). When\nYf (x) is conditionally Gaussian, \u02dcW (x, s) = Yf (x) + 1\n2\u03c1 W (x, s) is the weighted sum of a Gaussian\nand WNCS variates, a problem that is again well-studied\u2014see the supplementary material.\n\n3.2 Slack-AL expected improvement\n\nE(cid:2)(yn\n\nEvaluating EI at candidate (x, s) locations under the AL-composite involves working with EI(x, s) =\nmin of the AL over all n runs.\n\nmin \u2212 Y (x, s)) I{Y (x,s)\u2264yn\n\nmin}(cid:3), given the current minimum yn\n\n4\n\n\fmin \u2212 f (x) \u2212 r(s)) absorb all of the non-random\nWhen f (x) is known, let wn\nquantities involved in the EI calculation. Then, with DW (\u00b7; x, s) denoting the distribution of W (x, s),\n\nmin(x, s) = 2\u03c1 (yn\n\nEI(x, s) =\n\n1\n2\u03c1\n\nmin(x, s) \u2212 W (x, s)) IW (x,s)\u2264wmin(x,s)\n\nE(cid:2)(wn\n(cid:90) wn\n\nmin(x,s)\n\n1\n2\u03c1\n\n(cid:3)\n\n(cid:90) wn\n\nmin(x,s)\n\n1\n2\u03c1\n\n=\n\n\u2212\u221e\n\nDW (t; x, s)dt =\n\n(11)\nmin(x, s) \u2265 0 and zero otherwise. That is, the EI boils down to integrating the distribution function\nif wn\nof W (x, s) between 0 (since W is positive) and wn\nmin(x, s). This is a one-dimensional de\ufb01nite integral\nthat is easy to approximate via quadrature; details are in the supplementary material. Since W (x, s) is\nquadratic in the Yc(x) values, it is often the case, especially for smaller \u03c1-values in later AL iterations,\nthat DW (t; x, s) is zero over most of [0, wn\nmin(x, s)], simplifying numerical integration. However,\nthis has deleterious impacts on search over (x, s), as we discuss in our supplement. When f (x) is\nunknown and Yf (x) is conditionally normal, let \u02dcwn\n\nDW (t; x, s)dt\n\nmin(s) = 2\u03c1 (yn\n\n0\n\n(cid:17) I \u02dcW (x,s)\u2264 \u02dcwn\n\n(cid:105)\n\n(cid:90) \u02dcwn\nmin \u2212 r(s)). Then,\n1\n2\u03c1\n\nmin(s)\n\n\u2212\u221e\n\nmin(s)\n\n=\n\nD \u02dcW (t; x, s)dt.\n\nEI(x, s) =\n\n1\n2\u03c1\n\nmin(s) \u2212 \u02dcW (x, s)\n\u02dcwn\n\nE(cid:104)(cid:16)\n\nHere the lower bound of the de\ufb01nite integral cannot be zero since Yf (x) may be negative, and thus\n\u02dcW (x, s) may have non-zero distribution for negative t-values. This can challenge the numerical\nquadrature , although many library functions allow inde\ufb01nite bounds. We obtain better performance\nby supplying a conservative \ufb01nite lower bound, for example three standard deviations in Yf (x), in\nunits of the penalty (2\u03c1), below zero: 6\u03c1\u03c3f (x). Implementation details are in our supplement.\n\n3.3 AL updates, optimal slack settings, and other implementation notes\n\nThe new slack-AL method is completed by describing when the subproblem (7) is deemed to be\n\u201csolved\u201d (step 2 in Alg. 1), how \u03bb and \u03c1 updated (steps 3\u20134). We terminate the BO search sub-solver\nafter a single iteration as this matches with the spirit of EI-based search, whose choice of next location\ncan be shown to be optimal, in a certain sense, if it is the \ufb01nal point being selected. It also meshes well\nwith an updating scheme analogous to that in steps 3\u20134: updating only when no actual improvement\n(in terms of constraint violation) is realized by that choice. That is,\n\n(cid:110)\nLA(x, s; \u03bbk\u22121, \u03c1k\u22121) : (x, s1:m) \u2208 \u02dcB(cid:111)\n\nstep 2: Let (xk, sk) approx. solve minx,s\nstep 3: \u03bbk\n\u03c1k\u22121 (cj(xk) + sk\nstep 4: If c1:m(xk) \u2264 0 and |cm+1:m+p(xk)| \u2264 \u0001, set \u03c1k=\u03c1k\u22121; else \u03c1k = 1\n\nj ), for j = 1, . . . , m + p\n\nj = \u03bbk\u22121\n\nj + 1\n\n2 \u03c1k\u22121\n\nAbove, step 3 is the same as in Alg. 1 except without the \u201cmax\u201d, and with slacks augmenting the\nconstraint values. The \u201cif\u201d statement in step 4 checks for validity at xk, deploying a threshold \u0001 > 0\non equality constraints; further discussion of the threshold \u0001 is deferred to Section 4, where we discuss\nprogress metrics under mixed constraints. If validity holds at (xk, sk), the current AL iteration is\ndeemed to have \u201cmade progress\u201d and the penalty remains unchanged; otherwise it is doubled. An\n1:m| \u2264 \u0001. We \ufb01nd that the version in step 4, above, is\nalternate formulation may check |c1:m(xk) + sk\ncleaner because it limits sensitivity to the choice of threshold \u0001. In our supplementary material we\nrecommend initial (\u03bb0, \u03c10) values which are analogous to the original, non-slack AL settings.\nOptimal choice of slacks: The biggest difference between the original AL (3) and slack-AL (7)\nis that the latter requires searching over both x and s, whereas the former involves only x-values.\nIn what follows we show that there are automatic choices for the s-values as a function of the\ncorresponding x\u2019s, keeping the search space d-dimensional, rather than d + m.\nFor an observed cj(x) value, associated slack variables minimizing the AL (7) can be obtained analyt-\nically. Using the form of (9), observe that mins\u2208Rm\nj=1 2\u03bbj\u03c1sj +\nj + 2sjcj(x). For \ufb01xed x, this is strictly convex in s. Therefore, its unconstrained minimum can only\ns2\nbe its stationary point, which satis\ufb01es 0 = 2\u03bbj\u03c1 + 2s\u2217\nj (x) + 2cj(x), for j = 1, . . . , m. Accounting\nfor the nonnegativity constraint, we obtain the following optimal slack as a function of x:\n\ny(x, s) is equivalent to mins\u2208Rm\n\n(cid:80)m\n\n+\n\n+\n\nj (x) = max{0,\u2212\u03bbj\u03c1 \u2212 cj(x)} ,\ns\u2217\n\nj = 1, . . . , m.\n\n(12)\n\n5\n\n\f+\n\nE[Y (x, s)], are s\u2217\n\nAbove we write s\u2217 as a function of x to convey that x remains a \u201cfree\u201d quantity in y(x, s\u2217(x)). Recall\nthat slacks on equality constraints are zero, sk(x) = 0, k = m + 1, . . . , m + p, for all x.\nIn the blackbox c(x) setting, y(x, s\u2217(x)) is only directly accessible at the data locations xi. At\nother x-values, however, the surrogates provide a useful approximation. When Yc(x) is (approxi-\nmately) Gaussian it is straightforward to show that the optimal setting of the slack variables, solving\nj (x) = max{0,\u2212\u03bbj\u03c1 \u2212 \u00b5cj (x)}, i.e., the same as (12) with a prediction\nmins\u2208Rm\n\u00b5cj (x) for Ycj (x), the unknown cj(x) value. Again, slacks on the equality constraints are set to zero.\nOther criteria can be used to choose slack variables. Instead of minimizing the mean of the composite,\none could maximize the EI. In our supplementary material we explain how this is of dubious practical\nvalue, being more computationally intensive and providing near identical results in practice.\nImplementation notes: Code supporting all methods in this manuscript is provided in two open-\nsource R packages: laGP [8] and DiceOptim [19], both on CRAN [22]. Implementation details vary\nsomewhat across those packages, due primarily to particulars of their surrogate modeling capability\nand how they search the EI surface. For example, laGP can accommodate a smaller initial design\nsize because it learns fewer parameters (i.e., has fewer degrees of freedom). DiceOptim uses a\nmulti-start search procedure for EI, whereas laGP deploys a random candidate grid, which may\noptionally be \u201c\ufb01nished\u201d with an L-BFGS-B search. Nevertheless, their qualitative behavior exhibits\nstrong similarity. Both packages also implement the original AL scheme (i.e., without slack variables)\nupdated (6) for mixed constraints. Further details are provided in our supplementary material.\n\n4 Empirical comparison\n\nHere we describe three test problems, each mixing challenging elements from traditional uncon-\nstrained blackbox optimization benchmarks, but in a constrained optimization format. We run our\noptimizers on these problems 100 times under random initializations. In the case of our GP surrogate\ncomparators, this initialization involves choosing random space-\ufb01lling designs. Our primary means\nof comparison is an averaged (over the 100 runs) measure of progress de\ufb01ned by the best valid value\nof the objective for increasing budgets (number of evaluations of the blackbox), n.\nIn the presence of equality constraints it is necessary to relax this de\ufb01nition somewhat, as the valid\nset may be of measure zero. In such cases we choose a tolerance \u0001 \u2265 0 and declare a solution to be\n\u201cvalid\u201d when inequality constraints are all valid, and when |hk(x)| < \u0001 for all k = 1, . . . , p. In our\n\ufb01gures we choose \u0001 = 10\u22122; however, the results are similar under stronger thresholds, with a higher\nvariability over initializations. As \ufb01nding a valid solution is, in itself, sometimes a dif\ufb01cult task, we\nadditionally report the proportion of runs that \ufb01nd valid and optimal solutions as a function of budget,\nn, for problems with equality (and mixed) constraints.\n\n4.1 An inequality constrained problem\n\nWe \ufb01rst revisit the \u201ctoy\u201d problem from [9], having a 2d input space limited to the unit cube, a (known)\nlinear objective, with sinusoidal and quadratic inequality constraints (henceforth the LSQ problem;\nsee the supplementary material for details). Figure 1 shows progress over repeated solves with a\nmaximum budget of 40 blackbox evaluations. The left-hand plot in Figure 1 tracks the average best\nvalid value of the objective found over the iterations, using the progress metric described above.\nRandom initial designs of size n = 5 were used, as indicated by the vertical-dashed gray line. The\nsolid gray lines are extracted from a similar plot from [9], containing both AL-based comparators,\nand several from the derivative-free optimization and BO literatures. The details are omitted here.\nOur new ALBO comparators are shown in thicker colored lines; the solid black line is the original\nAL(BO)-EI comparator, under a revised (compared to [9]) initialization and updating scheme. The\ntwo red lines are variations on the slack-AL algorithm under EI: with (dashed) and without (solid)\nL-BFGS-B optimizing EI acquisition at each iteration. Finally, the blue line is PESC [10], using the\nPython library available at https://github.com/HIPS/Spearmint/tree/PESC. The take-home\nmessage from the plot is that all four new methods outperform those considered by the original ALBO\npaper [9]. Focusing on the new comparators only, observe that their progress is nearly statistically\nequivalent during the \ufb01rst 20 iterations. However, in the latter iterations stark distinctions emerge,\nwith Slack-AL+optim and PESC, both leveraging L-BFGS-B subroutines, outperforming. This\n\n6\n\n\fFigure 1: Results on the LSQ problem with initial designs of size n = 10. The left panel shows\nthe best valid value of the objective over the \ufb01rst 40 evaluations, whereas the right shows the log\nutility-gap for the second 20 evaluations. The solid gray lines show comparators from [9].\n\ndiscrepancy is more easily visualized in the right panel with a so-called log \u201cutility-gap\u201d plot [10],\ntracking the log difference between the theoretical best valid value and those found by search.\n\n4.2 Mixed inequality and equality constrained problems\n\nNext consider a problem in four input dimensions with a (known) linear objective and two constraints.\nThe \ufb01rst inequality constraint is the so-called \u201cAckley\u201d function in d = 4 input dimensions. The\nsecond is an equality constraint following the so-called \u201cHartman 4-dimensional function\u201d. Our\nsupplementary material provides a full mathematical speci\ufb01cation. Figure 2 shows two views into\n\nFigure 2: Results on the Linear-Ackley-Hartman mixed constraint problem. The left panel shows a\nprogress comparison based on laGP code with initial designs of size n = 10. The x-scale has been\ndivided by 140 for the nlopt comparator. A value of four indicates that no valid solution has been\nfound. The right panel shows the proportion of valid (thin lines) and optimal (thick lines) solutions\nfor the EFI and \u201cSlack AL + optim\u201d comparators.\n\nprogress on this problem. Since it involves mixed constraints, comparators from the BO literature\nare scarce. Our EFI implementation deploys the (\u2212h, h) heuristic mentioned in the introduction. As\nrepresentatives from the nonlinear optimization literature we include nlopt [11] and three adapted\nNOMAD [13] comparators, which are detailed in our supplementary material. In the left-hand plot\nwe can see that our new ALBO comparators are the clear winner, with an L-BFGS-B optimized EI\nsearch under the slack-variable AL implementation performing exceptionally well. The nlopt and\nNOMAD comparators are particularly poor. We allowed those to run up to 7000 and 1000 iterations,\nrespectively, and in the plot we scaled the x-axis (i.e., n) to put them on the same scale as the others.\n\n7\n\n0102030400.60.70.80.91.01.11.2blackbox evaluations (n)best valid objective (f)Initial DesignGramacy, et al. (2016)2025303540\u22127\u22126\u22125\u22124blackbox evaluations (n)log utility gapOriginal ALSlack ALSlack AL + optimPESC0102030405001234blackbox evaluations (n)best valid (1e\u22122 for equality) objective (f)Original ALSlack ALSlack AL + optimEFInlopt/140NOMAD\u2212P1/15NOMAD\u2212AL\u2212P1/15NOMAD\u2212AL\u2212PBP1/15010203040500.00.20.40.60.81.0blackbox evaluations (n)proportion of valid and solved runs\fThe right-hand plot provides a view into the distribution of two key aspects of performance over the\nMC repetitions. Observe that \u201cSlack AL + optim\u201d \ufb01nds valid values quickly, and optimal values not\nmuch later. Our adapted EFI is particularly slow at converging to optimal (valid) solutions.\nOur \ufb01nal problem involves two input dimensions, an unknown objective function (i.e., one that must\nbe modeled with a GP), one inequality constraint and two equality constraints. The objective is\na centered and re-scaled version of the \u201cGoldstein\u2013Price\u201d function. The inequality constraint is\nthe sinusoidal constraint from the LSQ problem [Section 4.1]. The \ufb01rst equality constraint is a\ncentered \u201cBranin\u201d function, the second equality constraint is taken from [16] (henceforth the GBSP\nproblem). Our supplement contains a full mathematical speci\ufb01cation. Figure 3 shows our results on\n\nFigure 3: Results on the GBSP problem. See Figure 2 caption.\n\nthis problem. Observe (left panel) that the original ALBO comparator makes rapid progress at \ufb01rst,\nbut dramatically slows for later iterations. The other ALBO comparators, including EFI, converge\nmuch more reliably, with the \u201cSlack AL + optim\u201d comparator leading in both stages (early progress\nand ultimate convergence). Again, nlopt and NOMAD are poor, however note that their relative\ncomparison is reversed; again, we scaled the x-axis to view these on a similar scale as the others. The\nright panel shows the proportion of valid and optimal solutions for \u201cSlack AL + optim\u201d and EFI.\nNotice that the AL method \ufb01nds an optimal solution almost as quickly as it \ufb01nds a valid one\u2014both\nsubstantially faster than EFI.\n\n5 Conclusion\n\nThe augmented Lagrangian (AL) is an established apparatus from the mathematical optimization\nliterature, enabling objective-only or bound-constrained optimizers to be deployed in settings with\nconstraints. Recent work involving Bayesian optimization (BO) within the AL framework (ALBO)\nhas shown great promise, especially toward obtaining global solutions under constraints. However,\nthose methods were de\ufb01cient in at least two respects. One is that only inequality constraints could\nbe supported. Another was that evaluating the acquisition function, combining predictive mean and\nvariance information via expected improvement (EI), required Monte Carlo approximation. In this\npaper we showed that both drawbacks could be addressed via a slack-variable reformulation of the\nAL. Our method supports inequality, equality, and mixed constraints, and to our knowledge this\nupdated ALBO procedure is unique in the BO literature in its applicability to the most general mixed\nconstraints problem (1). We showed that the slack ALBO method outperforms modern alternatives in\nseveral challenging constrained optimization problems.\n\nAcknowledgments\n\nWe are grateful to Mickael Binois for comments on early drafts. RBG is grateful for partial support\nfrom National Science Foundation grant DMS-1521702. The work of SMW is supported by the U.S.\nDepartment of Energy, Of\ufb01ce of Science, Of\ufb01ce of Advanced Scienti\ufb01c Computing Research under\nContract No. DE-AC02-06CH11357. The work of SLD is supported by the Natural Sciences and\nEngineering Research Council of Canada grant 418250.\n\n8\n\n050100150\u22120.50.00.51.01.52.0blackbox evaluations (n)best valid (1e\u22122 for equality) objective (f)Original ALSlack ALSlack AL + optimEFInlopt/46NOMAD\u2212P1/8NOMAD\u2212AL\u2212P1/8NOMAD\u2212AL\u2212PBP1/80501001500.00.20.40.60.81.0blackbox evaluations (n)proportion of valid and solved runs\fReferences\n[1] C. Audet, J. Dennis, Jr., D.W. Moore, A. Booker, and P.D. Frank. Surrogate-model-based method for\nconstrained optimization. In AIAA/USAF/NASA/ISSMO Symposium on Multidisciplinary Analysis and\nOptimization, 2000.\n\n[2] D. Bertsekas. Constrained Optimization and Lagrange Multiplier Methods. Academic Press, New York,\n\nNY, 1982.\n\n[3] G. E. P. Box and N. R. Draper. Empirical Model Building and Response Surfaces. Wiley, Oxford, 1987.\n\n[4] P. Boyle. Gaussian Processes for Regression and Optimization. PhD thesis, Victoria University of\n\nWellington, 2007.\n\n[5] E. Brochu, V. M. Cora, and N. de Freitas. A tutorial on Bayesian optimization of expensive cost functions,\nwith application to active user modeling and hierarchical reinforcement learning. Technical report,\nUniversity of British Columbia, 2010. arXiv:1012.2599v1.\n\n[6] J. R. Gardner, M. J. Kusner, Z. Xu, K. W. Weinberger, and J. P. Cunningham. Bayesian optimization\nwith inequality constraints. In Proceedings of the 31st International Conference on Machine Learning,\nvolume 32. JMLR, W&CP, 2014.\n\n[7] M. A. Gelbart, J. Snoek, and R. P. Adams. Bayesian optimization with unknown constraints. In Uncertainty\n\nin Arti\ufb01cial Intelligence (UAI), 2014.\n\n[8] R. B. Gramacy. laGP: Large-scale spatial modeling via local approximate Gaussian processes in R. Journal\n\nof Statistical Software, 72(1):1\u201346, 2016.\n\n[9] R.B. Gramacy, G.A. Gray, S. Le Digabel, H.K.H. Lee, P. Ranjan, G. Wells, and S.M. Wild. Modeling an\n\naugmented Lagrangian for blackbox constrained optimization. Technometrics, 58:1\u201311, 2016.\n\n[10] J.M. Hern\u00e1ndez-Lobato, M. A. Gelbart, M. W. Hoffman, R. P. Adams, and Z. Ghahramani. Predictive en-\ntropy search for Bayesian optimization with unknown constraints. In Proceedings of the 32nd International\nConference on Machine Learning, volume 37. JMLR, W&CP, 2015.\n\n[11] S. G. Johnson. The NLopt nonlinear-optimization package, 2014. via the R package nloptr.\n\n[12] D. R. Jones, M. Schonlau, and W. J. Welch. Ef\ufb01cient global optimization of expensive black box functions.\n\nJ. of Global Optimization, 13:455\u2013492, 1998.\n\n[13] S. Le Digabel. Algorithm 909: NOMAD: Nonlinear Optimization with the MADS algorithm. ACM\n\nTransactions on Mathematical Software, 37(4):44:1\u201344:15. doi: 10.1145/1916461.1916468.\n\n[14] J. Mockus. Bayesian Approach to Global Optimization: Theory and Applications. Springer, 1989.\n\n[15] J. Nocedal and S. J. Wright. Numerical Optimization. Springer, second edition, 2006.\n\n[16] J. Parr, A. Keane, A Forrester, and C. Holden. In\ufb01ll sampling criteria for surrogate-based optimization\n\nwith constraint handling. Engineering Optimization, 44:1147\u20131166, 2012.\n\n[17] V. Picheny. A stepwise uncertainty reduction approach to constrained global optimization. In Proceedings\nof the 7th International Conference on Arti\ufb01cial Intelligence and Statistics, volume 33, pages 787\u2013795.\nJMLR W&CP, 2014.\n\n[18] V. Picheny, D. Ginsbourger, and T. Krityakierne. Comment: Some enhancements over the augmented\n\nlagrangian approach. Technometrics, 58(1):17\u201321, 2016.\n\n[19] V. Picheny, D. Ginsbourger, O. Roustant, with contributions by M. Binois, C. Chevalier, S. Marmin, and\nT. Wagner. DiceOptim: Kriging-Based Optimization for Computer Experiments, 2016. R package version\n2.0.\n\n[20] M. J. Sasena. Flexibility and Ef\ufb01ciency Enhancement for Constrained Global Design Optimization with\n\nKriging Approximations. PhD thesis, University of Michigan, 2002.\n\n[21] M. Schonlau, W. J. Welch, and D. R. Jones. Global versus local search in constrained optimization of\n\ncomputer models. Lecture Notes-Monograph Series, pages 11\u201325, 1998.\n\n[22] R Development Core Team. R: A language and environment for statistical computing. R Foundation for\nStatistical Computing, Vienna, Aus., 2004. URL http://www.R-project.org. ISBN 3-900051-00-3.\n\n[23] J. Snoek, H. Larochelle, and R. P. Adams. Bayesian optimization of machine learning algorithms. In\n\nNeural Information Processing Systems (NIPS), 2012.\n\n9\n\n\f", "award": [], "sourceid": 817, "authors": [{"given_name": "Victor", "family_name": "Picheny", "institution": "Institut National de la Recherche Agronomique"}, {"given_name": "Robert", "family_name": "Gramacy", "institution": "Virginia Tech"}, {"given_name": "Stefan", "family_name": "Wild", "institution": "Argonne National Lab"}, {"given_name": "Sebastien", "family_name": "Le Digabel", "institution": "\u00c9cole Polytechnique de Montr\u00e9al"}]}