{"title": "Black-box optimization of noisy functions with unknown smoothness", "book": "Advances in Neural Information Processing Systems", "page_first": 667, "page_last": 675, "abstract": "We study the problem of black-box optimization of a function $f$ of any dimension, given function evaluations perturbed by noise. The function is assumed to be locally smooth around one of its global optima, but this smoothness is unknown. Our contribution is an adaptive optimization algorithm, POO or parallel optimistic optimization, that is able to deal with this setting. POO performs almost as well as the best known algorithms requiring the knowledge of the smoothness. Furthermore, POO works for a larger class of functions than what was previously considered, especially for functions that are difficult to optimize, in a very precise sense. We provide a finite-time analysis of POO's performance, which shows that its error after $n$ evaluations is at most a factor of $\\sqrt{\\ln n}$ away from the error of the best known optimization algorithms using the knowledge of the smoothness.", "full_text": "Black-box optimization of noisy functions with\n\nunknown smoothness\n\nJean-Bastien Grill\n\nMichal Valko\n\nSequeL team, INRIA Lille - Nord Europe, France\n\njean-bastien.grill@inria.fr\n\nmichal.valko@inria.fr\n\nR\u00b4emi Munos\n\nGoogle DeepMind, UK\u2217\nmunos@google.com\n\nAbstract\n\nWe study the problem of black-box optimization of a function f of any dimen-\nsion, given function evaluations perturbed by noise. The function is assumed to\nbe locally smooth around one of its global optima, but this smoothness is un-\nknown. Our contribution is an adaptive optimization algorithm, POO or parallel\noptimistic optimization, that is able to deal with this setting. POO performs almost\nas well as the best known algorithms requiring the knowledge of the smoothness.\nFurthermore, POO works for a larger class of functions than what was previously\nconsidered, especially for functions that are dif\ufb01cult to optimize, in a very precise\nsense. We provide a \ufb01nite-time analysis of POO\u2019s performance, which shows that\nits error after n evaluations is at most a factor of \u221aln n away from the error of the\nbest known optimization algorithms using the knowledge of the smoothness.\n\nIntroduction\n\n1\nWe treat the problem of optimizing a function f : X \u2192 R given a \ufb01nite budget of n noisy evalua-\ntions. We consider that the cost of any of these function evaluations is high. That means, we care\nabout assessing the optimization performance in terms of the sample complexity, i.e., the number\nof n function evaluations. This is typically the case when one needs to tune parameters for a complex\nsystem seen as a black-box, which performance can only be evaluated by a costly simulation. One\nsuch example, is the hyper-parameter tuning where the sensitivity to perturbations is large and the\nderivatives of the objective function with respect to these parameters do not exist or are unknown.\nSuch setting \ufb01ts the sequential decision-making setting under bandit feedback. In this setting, the\nactions are the points that lie in a domain X . At each step t, an algorithm selects an action xt \u2208 X\nand receives a reward rt, which is a noisy function evaluation such that rt = f (xt) + \u03b5t, where \u03b5t is\na bounded noise with E [\u03b5t |xt ] = 0. After n evaluations, the algorithm outputs its best guess x(n),\nwhich can be different from xn. The performance measure we want to minimize is the value of the\nfunction at the returned point compared to the optimum, also referred to as simple regret,\n\nRn\n\ndef= sup\nx\u2208X\n\nf (x) \u2212 f (x (n)) .\n\nWe assume there exists at least one point x(cid:63) \u2208 X such that f (x(cid:63)) = supx\u2208X f (x).\nThe relationship with bandit settings motivated UCT [10, 8], an empirically successful heuristic\nthat hierarchically partitions domain X and selects the next point xt \u2208 X using upper con\ufb01dence\nbounds [1]. The empirical success of UCT on one side but the absence of performance guarantees for\nit on the other, incited research on similar but theoretically founded algorithms [4, 9, 12, 2, 6].\nAs the global optimization of the unknown function without absolutely any assumptions would\nbe a daunting needle-in-a-haystack problem, most of the algorithms assume at least a very weak\n\n\u2217on the leave from SequeL team, INRIA Lille - Nord Europe, France\n\n1\n\n\fassumption that the function does not decrease faster than a known rate around one of its global\noptima. In other words, they assume a certain local smoothness property of f. This smoothness\nis often expressed in the form of a semi-metric (cid:96) that quanti\ufb01es this regularity [4]. Naturally, this\nregularity also in\ufb02uences the guarantees that these algorithms are able to furnish. Many of them\nde\ufb01ne a near-optimality dimension d or a zooming dimension. These are (cid:96)-dependent quantities\nused to bound the simple regret Rn or a related notion called cumulative regret.\nOur work focuses on a notion of such near-optimality dimension d that does not directly relate\nthe smoothness property of f to a speci\ufb01c metric (cid:96) but directly to the hierarchical partitioning\nP = {Ph,i}, a tree-based representation of the space used by the algorithm. Indeed, an interesting\nfundamental question is to determine a good characterization of the dif\ufb01culty of the optimization\nfor an algorithm that uses a given hierarchical partitioning of the space X as its input. The kind of\nhierarchical partitioning {Ph,i} we consider is similar to the ones introduced in prior work: for any\ndepth h \u2265 0 in the tree representation, the set of cells {Ph,i}1\u2264i\u2264Ih form a partition of X , where Ih\nis the number of cells at depth h. At depth 0, the root of the tree, there is a single cell P0,1 = X . A\ncell Ph,i of depth h is split into several children subcells {Ph+1,j}j of depth h + 1. We refer to the\nstandard partitioning as to one where each cell is split into regular same-sized subcells [13].\nAn important insight, detailed in Section 2, is that a near-optimality dimension d that is independent\nfrom the partitioning used by an algorithm (as de\ufb01ned in prior work [4, 9, 2]) does not embody the\noptimization dif\ufb01culty perfectly. This is easy to see, as for any f we could de\ufb01ne a partitioning,\nperfectly suited for f. An example is a partitioning, that at the root splits X into {x(cid:63)} and X \\ x(cid:63),\nwhich makes the optimization trivial, whatever d is. This insight was already observed by Slivkins\n[14] and Bull [6], whose zooming dimension depends both on the function and the partitioning.\nIn this paper, we de\ufb01ne a notion of near-optimality dimension d which measures the complexity of\nthe optimization problem directly in terms of the partitioning used by an algorithm. First, we make\nthe following local smoothness assumption about the function, expressed in terms of the partitioning\nand not any metric: For a given partitioning P, we assume that there exist \u03bd > 0 and \u03c1 \u2208 (0, 1), s.t.,\n\nwhere (h, i(cid:63)\ndimension d(\u03bd, \u03c1) as\n\nh) is the (unique) cell of depth h containing x(cid:63). Then, we de\ufb01ne the near-optimality\n\n\u2200h \u2265 0,\u2200x \u2208 Ph,i(cid:63)\n\nh\n\n,\n\nf (x) \u2265 f (x(cid:63)) \u2212 \u03bd\u03c1h\n\n(cid:110)\n\nd(cid:48)\n\n\u2208 R+ : \u2203C > 0,\u2200h \u2265 0,Nh(2\u03bd\u03c1h) \u2264 C\u03c1\u2212d(cid:48)h(cid:111)\n\n,\n\nd(\u03bd, \u03c1) def= inf\n\nf (x) \u2265 f (x(cid:63)) \u2212 \u03b5.\nwhere for all \u03b5 > 0, Nh(\u03b5) is the number of cells Ph,i of depth h s.t. supx\u2208Ph,i\nIntuitively, functions with smaller d are easier to optimize and we denote (\u03bd, \u03c1), for which d(\u03bd, \u03c1) is\nthe smallest, as (\u03bd(cid:63), \u03c1(cid:63)). Obviously, d(\u03bd, \u03c1) depends on P and f, but does not depend on any choice\nof a speci\ufb01c metric. In Section 2, we argue that this de\ufb01nition of d1 encompasses the optimization\ncomplexity better. We stress this is not an artifact of our analysis and previous algorithms, such as\nHOO [4], TaxonomyZoom [14], or HCT [2], can be shown to scale with this new notion of d.\nMost of the prior bandit-based algorithms proposed for function optimization, for either determinis-\ntic or stochastic setting, assume that the smoothness of the optimized function is known. This is the\ncase of known semi-metric [4, 2] and pseudo-metric [9]. This assumption limits the application of\nthese algorithms and opened a very compelling question of whether this knowledge is necessary.\nPrior work responded with algorithms not requiring this knowledge. Bubeck et al. [5] provided an\nalgorithm for optimization of Lipschitz functions without the knowledge of the Lipschitz constant.\nHowever, they have to assume that f is twice differentiable and a bound on the second order deriva-\ntive is known. Combes and Prouti`ere [7] treat unimodal f restricted to dimension one. Slivkins\n[14] considered a general optimization problem embedded in a taxonomy2 and provided guarantees\nas a function of the quality of the taxonomy. The quality refers to the probability of reaching two\ncells belonging to the same branch that can have values that differ by more that half of the diameter\n(expressed by the true metric) of the branch. The problem is that the algorithm needs a lower bound\non this quality (which can be tiny) and the performance depends inversely on this quantity. Also it\nassumes that the quality is strictly positive. In this paper, we do not rely on the knowledge of quality\nand also consider a more general class of functions for which the quality can be 0 (Appendix E).\n\n1we use the simpli\ufb01ed notation d instead of d(\u03bd, \u03c1) for clarity when no confusion is possible\n2which is similar to the hierarchical partitioning previously de\ufb01ned\n\n2\n\n\fFigure 1: Dif\ufb01cult function f : x \u2192 s (log2 |x \u2212 0.5|) \u00b7 ((cid:112)\n\n|x \u2212 0.5|\nwhere, s(x) = 1 if the fractional part of x, that is, x \u2212 (cid:98)x(cid:99), is in [0, 0.5] and s(x) = 0, if it is in\n(0.5, 1). Left: Oscillation between two envelopes of different smoothness leading to a nonzero d for\na standard partitioning. Right: Regret of HOO after 5000 evaluations for different values of \u03c1.\n\n|x \u2212 0.5| \u2212 (x \u2212 0.5)2) \u2212\n\n(cid:112)\n\nAnother direction has been followed by Munos [11], where in the deterministic case (the function\nevaluations are not perturbed by noise), their SOO algorithm performs almost as well as the best\nknown algorithms without the knowledge of the function smoothness. SOO was later extended to\nStoSOO [15] for the stochastic case. However StoSOO only extends SOO for a limited case of easy\ninstances of functions for which there exists a semi-metric under which d = 0. Also, Bull [6] pro-\nvided a similar regret bound for the ATB algorithm for a class of functions, called zooming continuous\nfunctions, which is related to the class of functions for which there exists a semi-metric under which\nthe near-optimality dimension is d = 0. But none of the prior work considers a more general class\nof functions where there is no semi-metric adapted to the standard partitioning for which d = 0.\nTo give an example of a dif\ufb01cult function, consider the function in Figure 1. It possesses a lower\nand upper envelope around its global optimum that are equivalent to x2 and \u221ax; and therefore\nhave different smoothness. Thus, for a standard partitioning, there is no semi-metric of the form\n(cid:96)(x, y) = ||x \u2212 y||\u03b1 for which the near-optimality dimension is d = 0, as shown by Valko et al.\n[15]. Other examples of nonzero near-optimality dimension are the functions that for a standard\npartitioning behave differently depending on the direction, for instance f : (x, y) (cid:55)\u2192 1 \u2212 |x| \u2212 y2.\nUsing a bad value for the \u03c1 parameter can have dramatic consequences on the simple regret. In\nFigure 1, we show the simple regret after 5000 function evaluations for different values of \u03c1. For the\nvalues of \u03c1 that are too low, the algorithm does not explore enough and is stuck in a local maximum\nwhile for values of \u03c1 too high the algorithm wastes evaluations by exploring too much.\nIn this paper, we provide a new algorithm, POO, parallel optimistic optimization, which competes\nwith the best algorithms that assume the knowledge of the function smoothness, for a larger class\nof functions than was previously done. Indeed, POO handles a panoply of functions, including hard\ninstances, i.e., such that d > 0, like the function illustrated above. We also recover the result of\nStoSOO and ATB for functions with d = 0. In particular, we bound the POO\u2019s simple regret as\n\n(cid:16)(cid:0)(cid:0)ln2 n(cid:1) /n(cid:1)1/(2+d(\u03bd(cid:63),\u03c1(cid:63)))(cid:17)\n\n.\n\nE[Rn] \u2264 O\n\nThis result should be compared to the simple regret of the best known algorithm that uses the knowl-\nedge of the metric under which the function is smooth, or equivalently (\u03bd, \u03c1), which is of the order of\nO((ln n/n)1/(2+d)). Thus POO\u2019s performance is at most a factor of (ln n)1/(2+d) away from that of\nthe best known optimization algorithms that require the knowledge of the function smoothness. In-\nterestingly, this factor decreases with the complexity measure d: the harder the function to optimize,\nthe less important it is to know its precise smoothness.\n\n2 Background and assumptions\n\n2.1 Hierarchical optimistic optimization\n\nPOO optimizes functions without the knowledge of their smoothness using a subroutine, an anytime\nalgorithm optimizing functions using the knowledge of their smoothness. In this paper, we use a\nmodi\ufb01ed version of HOO [4] as such subroutine. Therefore, we embark with a quick review of HOO.\nHOO follows an optimistic strategy close to UCT [10], but unlike UCT, it uses proper con\ufb01dence\nbounds to provide theoretical guarantees. HOO re\ufb01nes a partition of the space based on a hierarchical\npartitioning, where at each step, a yet unexplored cell (a leaf of the corresponding tree) is selected,\n\n3\n\n0.00.20.40.60.81.0x\u22121.0\u22120.8\u22120.6\u22120.4\u22120.20.0f(x)0.00.20.40.60.81.0\u03c10.010.020.030.040.050.060.070.080.09simpleregretafter5000evaluations\f(cid:115)\n\n+ \u03bd\u03c1h,\n\nUh,i(t) =(cid:98)\u00b5h,i(t) +\n\nand the function is evaluated at a point within this cell. The selected path (from the root to the leaf)\nis the one that maximizes the minimum value Uh,i(t) among all cells of each depth, where the value\nUh,i(t) of any cell Ph,i is de\ufb01ned as\nwhere t is the number of evaluations done so far,(cid:98)\u00b5h,i(t) is the empirical average of all evaluations\ndone within Ph,i, and Nh,i(t) is the number of them. The second term in the de\ufb01nition of Uh,i(t) is\na Chernoff-Hoeffding type con\ufb01dence interval, measuring the estimation error induced by the noise.\nThe third term, \u03bd\u03c1h with \u03c1 \u2208 (0, 1) is, by assumption, a bound on the difference f (x(cid:63)) \u2212 f (x) for\n, a cell containing x(cid:63). Is it this bound, where HOO relies on the knowledge of the\nany x \u2208 Ph,i(cid:63)\nsmoothness, because the algorithm requires the values of \u03bd and \u03c1. In the next sections, we clarify\nthe assumptions made by HOO vs. related algorithms and point out the differences with POO.\n\n2 ln(t)\nNh,i(t)\n\nh\n\n2.2 Assumptions made in prior work\nMost of previous work relies on the knowledge of a semi-metric on X such that the function is either\nlocally smooth near to one of its maxima with respect to this metric [11, 15, 2] or require a stronger,\nweakly-Lipschitz assumption [4, 12, 2]. Furthermore, Kleinberg et al. [9] assume the full metric.\nNote, that the semi-metric does not require the triangular inequality to hold. For instance, consider\nthe semi-metric (cid:96)(x, y) = ||x \u2212 y||\u03b1 on Rp with || \u00b7 || being the euclidean metric. When \u03b1 < 1\nthen this semi-metric does not satisfy the triangular inequality. However, it is a metric for \u03b1 \u2265 1.\nTherefore, using only semi-metric allows us to consider a larger class of functions.\nPrior work typically requires two assumptions. The \ufb01rst one is on semi-metric (cid:96) and the function.\nAn example is the weakly-Lipschitz assumption needed by Bubeck et al. [4] which requires that\n\n\u2200x, y \u2208 X ,\n\nf (x(cid:63)) \u2212 f (y) \u2264 f (x(cid:63)) \u2212 f (x) + max{f (x(cid:63)) \u2212 f (x), (cid:96) (x, y)} .\n\nIt is a weak version of a Lipschitz condition, restricting f in particular for the values close to f (x(cid:63)).\nMore recent results [11, 15, 2] assume only a local smoothness around one of the function maxima,\n\nx \u2208 X f (x(cid:63)) \u2212 f (x) \u2264 (cid:96)(x(cid:63), x).\n\nThe second common assumption links the hierarchical partitioning with the semi-metric. It requires\nthe partitioning to be adapted to the (semi) metric. More precisely the well-shaped assumption states\nthat there exist \u03c1 < 1 and \u03bd1 \u2265 \u03bd2 > 0, such that for any depth h \u2265 0 and index i = 1, . . . , Ih, the\nsubset Ph,i is contained by and contains two open balls of radius \u03bd1\u03c1h and \u03bd2\u03c1h respectively, where\nthe balls are w.r.t. the same semi-metric used in the de\ufb01nition of the function smoothness.\n\u2018Local smoothness\u2019 is weaker than \u2018weakly Lipschitz\u2019 and therefore preferable. Algorithms requir-\ning the local-smoothness assumption always sample a cell Ph,i in a special representative point and,\nin the stochastic case, collect several function evaluations from the same point before splitting the\ncell. This is not the case of HOO, which allows to sample any point inside the selected cell and to\nexpand each cell after one sample. This additional \ufb02exibility comes at the price of requiring the\nstronger weakly-Lipschitzness assumption. Nevertheless, although HOO does not wait before ex-\npanding a cell, it does something similar by selecting a path from the root to this leaf that maximizes\nthe minimum of the U-value over the cells of the path, as mentioned in Section 2.1. The fact that\nHOO follows an optimistic strategy even after reaching the cell that possesses the minimal U-value\nalong the path is not used in the analysis of the HOO algorithm.\nFurthermore, a reason for better dependency on the smoothness in other algorithms, e.g., HCT [2],\nis not only algorithmic: HCT needs to assume a slightly stronger condition on the cell, i.e., that the\nsingle center of the two balls (one that covers and the other one that contains the cell) is actually the\nsame point that HCT uses for sampling. This is stronger than just assuming that there simply exist\nsuch centers of the two balls, which are not necessarily the same points where we sample (which is\nthe HOO assumption). Therefore, this is in contrast with HOO that samples any point from the cell. In\nfact, it is straightforward to modify HOO to only sample at a representative point in each cell and only\nrequire the local-smoothness assumption. In our analysis and the algorithm, we use this modi\ufb01ed\nversion of HOO, thereby pro\ufb01ting from this weaker assumption.\n\n4\n\n\fPrior work [9, 4, 11, 2, 12] often de\ufb01ned some \u2018dimension\u2019 d of the near-optimal space of f measured\naccording to the (semi-) metric (cid:96). For example, the so-called near-optimality dimension [4] measures\nthe size of the near-optimal space X\u03b5 = {x \u2208 X : f (x) > f (x(cid:63)) \u2212 \u03b5} in terms of packing numbers:\nFor any c > 0, \u03b50 > 0, the (c, \u03b50)-near-optimality dimension d of f with respect to (cid:96) is de\ufb01ned as\n\ninf(cid:8)d \u2208 [0,\u221e) : \u2203C s.t. \u2200\u03b5 \u2264 \u03b50, N (Xc\u03b5, (cid:96), \u03b5) \u2264 C\u03b5\u2212d(cid:9) ,\n\n(1)\nwhere for any subset A \u2286 X , the packing number N (A, (cid:96), \u03b5) is the maximum number of disjoint\nballs of radius \u03b5 contained in A.\n\n2.3 Our assumption\n\nContrary to the previous approaches, we need only a single assumption. We do not introduce any\n(semi)-metric and instead directly relate f to the hierarchical partitioning P, de\ufb01ned in Section 1.\nLet K be the maximum number of children cells (Ph+1,jk )1\u2264k\u2264K per cell Ph,i. We remind the\nreader that given a global maximum x(cid:63) of f, i(cid:63)\nh denotes the index of the unique cell of depth h\n. With this notation we can state our sole assumption on\ncontaining x(cid:63), i.e., such that x(cid:63) \u2208 Ph,i(cid:63)\nboth the partitioning (Ph,i) and the function f.\nAssumption 1. There exists \u03bd > 0 and \u03c1 \u2208 (0, 1) such that\n\nh\n\n\u2200h \u2265 0,\u2200x \u2208 Ph,i(cid:63)\n\nh\n\n,\n\nf (x) \u2265 f (x(cid:63)) \u2212 \u03bd\u03c1h.\n\nThe values (\u03bd, \u03c1) de\ufb01nes a lower bound on the possible drop of f near the optimum x(cid:63) according\nto the partitioning. The choice of the exponential rate \u03bd\u03c1h is made to cover a very large class of\nfunctions, as well as to relate to results from prior work. In particular, for a standard partitioning on\nRp and any \u03b1, \u03b2 > 0, any function f such that f (x) \u223cx\u2192x(cid:63) \u03b2||x \u2212 x(cid:63)||\u03b1 \ufb01ts this assumption. This\nis also the case for more complicated functions such as the one illustrated in Figure 1. An example\nof a function and a partitioning that does not satisfy this assumption is the function f : x (cid:55)\u2192 1/ ln x\nand a standard partitioning of [0, 1) because the function decreases too fast around x(cid:63) = 0. As\nobserved by Valko [15], this assumption can be weaken to hold only for values of f that are \u03b7-close\nto f (x(cid:63)) up to an \u03b7-dependent constant in the regret.\nLet us note that the set of assumptions made by prior work (Section 2.2) can be reformulated using\nsolely Assumption 1. For example, for any f (x) \u223cx\u2192x(cid:63) \u03b2||x \u2212 x(cid:63)||\u03b1, one could consider the semi-\nmetric (cid:96)(x, y) = \u03b2||x \u2212 y||\u03b1 for which the corresponding near-optimality dimension de\ufb01ned by\nEquation 1 for a standard partitioning is d = 0. Yet we argue that our setting provides a more natural\nway to describe the complexity of the optimization problem for a given hierarchical partitioning.\nIndeed, existing algorithms, that use a hierarchical partitioning of X , like HOO, do not use the full\nmetric information but instead only use the values \u03bd and \u03c1, paired up with the partitioning. Hence,\nthe precise value of the metric does not impact the algorithms\u2019t decisions, neither their performance.\nWhat really matters, is how the hierarchical partitioning of X \ufb01ts f. Indeed, this \ufb01t is what we\nmeasure. To reinforce this argument, notice again that any function can be trivially optimized given\na perfectly adapted partitioning, for instance the one that associates x(cid:63) to one child of the root.\nAlso, the previous analyses tried to provide performance guaranties based only on the metric and f.\nHowever, since the metric is assumed to be such that the cells of the partitioning are well shaped,\nthe large diversity of possible metrics vanishes. Choosing such metric then comes down to choosing\nonly \u03bd, \u03c1, and a hierarchical decomposition of X . Another way of seeing this is to remark that\nprevious works make an assumption on both the function and the metric, and an other on both the\nmetric and the partitioning. We underline that the metric is actually there just to create a link between\nthe function and the partitioning. By discarding the metric, we merge the two assumptions into a\nsingle one and convert a topological problem into a combinatorial one, leading to easier analysis.\nTo proceed, we de\ufb01ne a new near-optimality dimension. For any \u03bd > 0 and \u03c1 \u2208 (0, 1), the near-\noptimality dimension d(\u03bd, \u03c1) of f with respect to the partitioning P is de\ufb01ned as follows.\nDe\ufb01nition 1. Near-optimality dimension of f is\n\n(cid:110)\n\nd(cid:48)\n\n\u2208 R+ : \u2203C > 0, \u2200h \u2265 0, Nh(2\u03bd\u03c1h) \u2264 C\u03c1\u2212d(cid:48)h(cid:111)\n\nd(\u03c1) def= inf\n\nwhere Nh(\u03b5) is the number of cells Ph,i of depth h such that supx\u2208Ph,i\n\nf (x) \u2265 f (x(cid:63)) \u2212 \u03b5.\n\n5\n\n\fThe hierarchical decomposition of the space X is the only prior information available to the algo-\nrithm. The (new) near-optimality dimension is a measure of how well is this partitioning adapted\nto f. More precisely, it is a measure of the size of the near-optimal set, i.e., the cells which are such\nf (x) \u2265 f (x(cid:63)) \u2212 \u03b5. Intuitively, this corresponds to the set of cells that any algorithm\nthat supx\u2208Ph,i\nwould have to sample in order to discover the optimum.\nAs an example, any f such that f (x) \u223cx\u2192x(cid:63) ||x \u2212 x(cid:63)||\u03b1, for any \u03b1 > 0, has a zero near-optimality\ndimension with respect to the standard partitioning and an appropriate choice of \u03c1. As discussed\nby Valko et al. [15], any function such that the upper and lower envelopes of f near its maximum are\nof the same order has a near-optimality dimension of zero for a standard partitioning of [0, 1]. An\nexample of a function with d > 0 for the standard partitioning is in Figure 1. Functions that behave\ndifferently in different dimensions have also d > 0 for the standard partitioning. Nonetheless, for a\nsome handcrafted partitioning, it is possible to have d = 0 even for those troublesome functions.\nUnder our new assumption and our new de\ufb01nition of near-optimality dimension, one can prove the\nsame regret bound for HOO as Bubeck et al. [4] and the same can be done for other related algorithms.\n\n3 The POO algorithm\n\n3.1 Description of POO\n\nAlgorithm 1 POO\nParameters: K, P = {Ph,i}\nOptional parameters: \u03c1max, \u03bdmax\nInitialization:\n\nThe POO algorithm uses, as a subroutine, an optimizing algorithm that requires the knowledge of\nthe function smoothness. We use HOO [4] as the base algorithm, but other algorithms, such as\nHCT [2], could be used as well. POO, with pseudocode in Algorithm 1, runs several HOO instances\nin parallel, hence the name parallel optimistic optimization. The number of base HOO instances and\nother parameters are adapted to the budget of evaluations and are automatically decided on the \ufb02y.\nEach instance of HOO requires two real\nRunning HOO\nnumbers \u03bd and \u03c1.\nparametrized with (\u03c1, \u03bd) that are far from\nthe optimal one (\u03bd(cid:63), \u03c1(cid:63))3 would cause HOO\nto underperform. Surprisingly, our analy-\nsis of this suboptimality gap reveals that it\ndoes not decrease too fast as we stray away\nfrom (\u03bd(cid:63), \u03c1(cid:63)). This motivates the follow-\ning observation. If we simultaneously run\na slew of HOOs with different (\u03bd, \u03c1)s, one\nof them is going to perform decently well.\nIn fact, we show that to achieve good per-\nformance, we only require (ln n) HOO in-\nstances, where n is the current number of\nfunction evaluations. Notice, that we do\nnot require to know the total number of\nrounds in advance which hints that we can\nhope for a naturally anytime algorithm.\nThe strategy of POO is quite simple:\nIt\nconsists of running N instances of HOO in\nparallel, that are all launched with differ-\nent (\u03bd, \u03c1)s. At the end of the whole pro-\ncess, POO selects the instance s(cid:63) which\nperformed the best and returns one of the\npoints selected by this instance, chosen\nuniformly at random. Note that just us-\ning a doubling trick in HOO with increasing\nvalues of \u03c1 and \u03bd is not enough to guaran-\ntee a good performance. Indeed, it is important to keep track of all HOO instances. Otherwise, the\nregret rate would suffer way too much from using the value of \u03c1 that is too far from the optimal one.\n\nDmax \u2190 ln K/ ln (1/\u03c1max)\nn \u2190 0 {number of evaluation performed}\nN \u2190 1 {number of HOO instances}\nS \u2190 {(\u03bdmax, \u03c1max)} {set of HOO instances}\nwhile computational budget is available do\n(cid:0)\u03bdmax, \u03c1max\n2N/(2i+1)(cid:1)\nwhile N \u2265 1\nfor i \u2190 1, . . . , N do {start new HOOs}\ns \u2190\nUpdate the average reward(cid:98)\u00b5[s] of HOO(s)\nS \u2190 S \u222a {s}\nPerform n\n\nN function evaluation with HOO(s)\n\n2 Dmax ln (n/(ln n)) do\n\nend for\nn \u2190 2n\nN \u2190 2N\nend while{ensure there is enough HOOs}\nfor s \u2208 S do\nUpdate the average reward(cid:98)\u00b5[s] of HOO(s)\nPerform a function evaluation with HOO(s)\ns(cid:63) \u2190 argmaxs\u2208S (cid:98)\u00b5[s]\n\nOutput: A random point evaluated by HOO(s(cid:63))\n\nend for\nn \u2190 n + N\n\nend while\n\n3the parameters (\u03bd, \u03c1) satisfying Assumption 1 for which d(\u03bd, \u03c1) is the smallest\n\n6\n\n\fFor clarity, the pseudo-code of Algorithm 1 takes \u03c1max and \u03bdmax as parameters but in Appendix C\nwe show how to set \u03c1max and \u03bdmax automatically as functions of the number of evaluations, i.e.,\n\u03c1max (n), \u03bdmax (n). Furthermore, in Appendix D, we explain how to share information between the\nHOO instances which makes the empirical performance light-years better.\nSince POO is anytime, the number of instances N (n) is time-dependent and does not need to be\nknown in advance.\nIn fact, N (n) is increased alongside the execution of the algorithm. More\nprecisely, we want to ensure that\n\nN (n) \u2265 1\n\n2 Dmax ln (n/ ln n) , where Dmax\n\ndef=(ln K)/ ln (1/\u03c1max)\u00b7\n\nTo keep the set of different (\u03bd, \u03c1)s well distributed, the number of HOOs is not increased one by one\nbut instead is doubled when needed. Moreover, we also require that HOOs run in parallel, perform the\nsame number of function evaluations. Consequently, when we start running new instances, we \ufb01rst\nensure to make these instances on par with already existing ones in terms of number of evaluations.\nFinally, as our analysis reveals, a good choice of parameters (\u03c1i) is not a uniform grid\nInstead, as suggested by our analysis, we require that 1/ ln(1/\u03c1i) is a uniform grid\non [0, 1].\non [0, 1/(ln 1/\u03c1max)]. As a consequence, we add HOO instances in batches such that \u03c1i = \u03c1max\nN/i.\n\n3.2 Upper bound on POO\u2019s regret\n\nPOO does not require the knowledge of a (\u03bd, \u03c1) verifying Assumption 1 and4 yet we prove that it\nachieves a performance close5 to the one obtained by HOO using the best parameters (\u03bd(cid:63), \u03c1(cid:63)). This\nresult solves the open question of Valko et al. [15], whether the stochastic optimization of f with\nunknown parameters (\u03bd, \u03c1) when d > 0 for the standard partitioning is possible.\nTheorem 1. Let Rn be the simple regret of POO at step n. For any (\u03bd, \u03c1) verifying Assumption 1\nsuch that \u03bd \u2264 \u03bdmax and \u03c1 \u2264 \u03c1max there exists \u03ba such that for all n\n\n(cid:0)(cid:0)ln2 n(cid:1) /n(cid:1)1/(d(\u03bd,\u03c1)+2)\n\nE[Rn] \u2264 \u03ba \u00b7\n\nMoreover, \u03ba = \u03b1 \u00b7 Dmax(\u03bdmax/\u03bd(cid:63))Dmax, where \u03b1 is a constant independent of \u03c1max and \u03bdmax.\nWe prove Theorem 1 in the Appendix A and B. Notice that Theorem 1 holds for any \u03bd \u2264 \u03bdmax\nand \u03c1 \u2264 \u03c1max and in particular for the parameters (\u03bd(cid:63), \u03c1(cid:63)) for which d(\u03bd, \u03c1) is minimal as long as\n\u03bd(cid:63) \u2264 \u03bdmax and \u03c1(cid:63) \u2264 \u03c1max. In Appendix C, we show how to make \u03c1max and \u03bdmax optional.\nTo give some intuition on Dmax, it is easy to prove that it is the attainable upper bound on the near-\noptimality dimension of functions verifying Assumption 1 with \u03c1 \u2264 \u03c1max. Moreover, any function\nof [0, 1]p, Lipschitz for the Euclidean metric, has (ln K)/ ln (1/\u03c1) = p for a standard partitioning.\nThe POO\u2019s performance should be compared to the simple regret of HOO run with the best parame-\nters \u03bd(cid:63) and \u03c1(cid:63), which is of order\n\n(cid:16)\n\n((ln n) /n)1/(d(\u03bd(cid:63),\u03c1(cid:63))+2)(cid:17)\n\nO\n\n.\n\nThus POO\u2019s performance is only a factor of O((ln n)1/(d(\u03bd(cid:63),\u03c1(cid:63))+2)) away from the optimally \ufb01tted\nHOO. Furthermore, we our regret bound for POO is slightly better than the known regret bound for\nStoSOO [15] in the case when d(\u03bd, \u03c1) = 0 for the same partitioning, i.e., E[Rn] = O (ln n/\u221an) .\nWith our algorithm and analysis, we generalize this bound for any value of d \u2265 0.\nNote that we only give a simple regret bound for POO whereas HOO ensures a bound on both the cu-\nmulative and simple regret.6 Notice that since POO runs several HOOs with non-optimal values of the\n(\u03bd, \u03c1) parameters, this algorithm explores much more than optimally \ufb01tted HOO, which dramatically\nimpacts the cumulative regret. As a consequence, our result applies to the simple regret only.\n\n4note that several possible values of those parameters are possible for the same function\n5up to a logarithmic term\n6in fact, the bound on the simple regret is a direct consequence of the bound on the cumulative regret [3]\n\nln n in the simple regret\n\n\u221a\n\n7\n\n\fFigure 2: Regret of POO and HOO run for different values of \u03c1.\n\n4 Experiments\n\nWe ran experiments on the function plotted in Figure 1 for HOO algorithms with different values of\n\u03c1 and the POO7 algorithm for \u03c1max = 0.9. This function, as described in Section 1, has an upper and\nlower envelope that are not of the same order and therefore has d > 0 for a standard partitioning.\nIn Figure 2, we show the simple regret of the algorithms as function of the number of evaluations.\nIn the \ufb01gure on the left, we plot the simple regret after 500 evaluations. In the right one, we plot the\nregret after 5000 evaluations in the log-log scale, in order to see the trend better. The HOO algorithms\nreturn a random point chosen uniformly among those evaluated. POO does the same for the best\nempirical instance of HOO. We compare the algorithms according to the expected simple regret,\nwhich is the difference between the optimum and the expected value of function value at the point\nthey return. We compute it as the average of the value of the function for all evaluated points. While\nwe did not investigate possibly different heuristics, we believe that returning the deepest evaluated\npoint would give a better empirical performance.\nAs expected, the HOO algorithms using values of \u03c1 that are too low, do not explore enough and\nbecome quickly stuck in a local optimum. This is the case for both UCT (HOO run for \u03c1 = 0) and\nHOO run for \u03c1 = 0.3. The HOO algorithm using \u03c1 that is too high waste their budget on exploring\ntoo much. This way, we empirically con\ufb01rmed that the performance of the HOO algorithm is greatly\nimpacted by the choice of this \u03c1 parameter for the function we considered. In particular, at T = 500,\nthe empirical regret of HOO with \u03c1 = 0.66 was a half of the regret of UCT.\nIn our experiments, HOO with \u03c1 = 0.66 performed the best which is a bit lower than what the theory\nwould suggest, since \u03c1(cid:63) = 1/\u221a2 \u2248 0.7. The performance of HOO using this parameter is almost\nmatched by POO. This is surprising, considering the fact the POO was simultaneously running 100\ndifferent HOOs. It shows that carefully sharing information between the instances of HOO, as described\nand justi\ufb01ed in Appendix D, has a major impact on empirical performance. Indeed, among the 100\nHOO instances, only two (on average) actually needed a fresh function evaluation, the 98 could reuse\nthe ones performed by another HOO instance.\n\n5 Conclusion\n\nWe introduced POO for global optimization of stochastic functions with unknown smoothness and\nshowed that it competes with the best known optimization algorithms that know this smoothness.\nThis results extends the previous work of Valko et al. [15], which is only able to deal with a near-\noptimality dimension d = 0. POO is provable able to deal with a trove of functions for which d \u2265 0\nfor a standard partitioning. Furthermore, we gave a new insight on several assumptions required by\nprior work and provided a more natural measure of the complexity of optimizing a function given a\nhierarchical partitioning of the space, without relying on any (semi-)metric.\n\nAcknowledgements The research presented in this paper was supported by French Ministry\nof Higher Education and Research, Nord-Pas-de-Calais Regional Council, a doctoral grant of\n\u00b4Ecole Normale Sup\u00b4erieure in Paris, Inria and Carnegie Mellon University associated-team project\nEduBand, and French National Research Agency project ExTra-Learn (n.ANR-14-CE24-0010-01).\n\n7code available at https://sequel.lille.inria.fr/Software/POO\n\n8\n\n100200300400500numberofevaluations0.060.080.100.120.140.160.18simpleregretHOO,\u03c1=0.0HOO,\u03c1=0.3HOO,\u03c1=0.66HOO,\u03c1=0.9POO45678numberofevaluation(log-scaled)\u22124.0\u22123.5\u22123.0\u22122.5\u22122.0simpleregret(log-scaled)HOO,\u03c1=0.0HOO,\u03c1=0.3HOO,\u03c1=0.66HOO,\u03c1=0.9POO\fReferences\n[1] Peter Auer, Nicol`o Cesa-Bianchi, and Paul Fischer. Finite-time Analysis of the Multiarmed\n\nBandit Problem. Machine Learning, 47(2-3):235\u2013256, 2002.\n\n[2] Mohammad Gheshlaghi Azar, Alessandro Lazaric, and Emma Brunskill. Online Stochastic\nIn International Conference on Machine\n\nOptimization under Correlated Bandit Feedback.\nLearning, 2014.\n\n[3] S\u00b4ebastien Bubeck, R\u00b4emi Munos, and Gilles Stoltz. Pure Exploration in Finitely-Armed and\n\nContinuously-Armed Bandits. Theoretical Computer Science, 412:1832\u20131852, 2011.\n\n[4] S\u00b4ebastien Bubeck, R\u00b4emi Munos, Gilles Stoltz, and Csaba Szepesv\u00b4ari. X-armed Bandits. Jour-\n\nnal of Machine Learning Research, 12:1587\u20131627, 2011.\n\n[5] S\u00b4ebastien Bubeck, Gilles Stoltz, and Jia Yuan Yu. Lipschitz Bandits without the Lipschitz\n\nConstant. In Algorithmic Learning Theory, 2011.\n\n[6] Adam D. Bull. Adaptive-treed bandits. Bernoulli, 21(4):2289\u20132307, 2015.\n[7] Richard Combes and Alexandre Prouti`ere. Unimodal Bandits without Smoothness. ArXiv\n\ne-prints: http://arxiv.org/abs/1406.7447, 2015.\n\n[8] Pierre-Arnaud Coquelin and R\u00b4emi Munos. Bandit Algorithms for Tree Search. In Uncertainty\n\nin Arti\ufb01cial Intelligence, 2007.\n\n[9] Robert Kleinberg, Alexander Slivkins, and Eli Upfal. Multi-armed Bandit Problems in Metric\n\nSpaces. In Symposium on Theory Of Computing, 2008.\n\n[10] Levente Kocsis and Csaba Szepesv\u00b4ari. Bandit based Monte-Carlo Planning.\n\nConference on Machine Learning, 2006.\n\nIn European\n\n[11] R\u00b4emi Munos. Optimistic Optimization of Deterministic Functions without the Knowledge of\n\nits Smoothness. In Neural Information Processing Systems, 2011.\n\n[12] R\u00b4emi Munos. From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to\nOptimization and Planning. Foundations and Trends in Machine Learning, 7(1):1\u2013130, 2014.\n[13] Philippe Preux, R\u00b4emi Munos, and Michal Valko. Bandits Attack Function Optimization. In\n\nCongress on Evolutionary Computation, 2014.\n\n[14] Aleksandrs Slivkins. Multi-armed Bandits on Implicit Metric Spaces. In Neural Information\n\nProcessing Systems, 2011.\n\n[15] Michal Valko, Alexandra Carpentier, and R\u00b4emi Munos. Stochastic Simultaneous Optimistic\n\nOptimization. In International Conference on Machine Learning, 2013.\n\n9\n\n\f", "award": [], "sourceid": 467, "authors": [{"given_name": "Jean-Bastien", "family_name": "Grill", "institution": "INRIA Lille - Nord Europe"}, {"given_name": "Michal", "family_name": "Valko", "institution": "INRIA Lille - Nord Europe"}, {"given_name": "Remi", "family_name": "Munos", "institution": "Google DeepMind"}, {"given_name": "Remi", "family_name": "Munos", "institution": "Google DeepMind"}]}