{"title": "Safe Exploration for Interactive Machine Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2891, "page_last": 2901, "abstract": "In interactive machine learning (IML), we iteratively make decisions and obtain noisy observations of an unknown function. While IML methods, e.g., Bayesian optimization and active learning, have been successful in applications, on real-world systems they must provably avoid unsafe decisions. To this end, safe IML algorithms must carefully learn about a priori unknown constraints without making unsafe decisions. Existing algorithms for this problem learn about the safety of all decisions to ensure convergence. This is sample-inefficient, as it explores decisions that are not relevant for the original IML objective. In this paper, we introduce a novel framework that renders any existing unsafe IML algorithm safe. Our method works as an add-on that takes suggested decisions as input and exploits regularity assumptions in terms of a Gaussian process prior in order to efficiently learn about their safety. As a result, we only explore the safe set when necessary for the IML problem. We apply our framework to safe Bayesian optimization and to safe exploration in deterministic Markov Decision Processes (MDP), which have been analyzed separately before. Our method outperforms other algorithms empirically.", "full_text": "Safe Exploration for Interactive Machine Learning\n\nMatteo Turchetta\n\nDept. of Computer Science\n\nETH Zurich\n\nmatteotu@inf.ethz.ch\n\nbefelix@inf.ethz.ch\n\nFelix Berkenkamp\n\nDept. of Computer Science\n\nETH Zurich\n\nAndreas Krause\n\nDept. of Computer Science\n\nETH Zurich\n\nkrausea@ethz.ch\n\nAbstract\n\nIn Interactive Machine Learning (IML), we iteratively make decisions and obtain\nnoisy observations of an unknown function. While IML methods, e.g., Bayesian\noptimization and active learning, have been successful in applications, on real-\nworld systems they must provably avoid unsafe decisions. To this end, safe IML\nalgorithms must carefully learn about a priori unknown constraints without making\nunsafe decisions. Existing algorithms for this problem learn about the safety of all\ndecisions to ensure convergence. This is sample-inef\ufb01cient, as it explores decisions\nthat are not relevant for the original IML objective. In this paper, we introduce a\nnovel framework that renders any existing unsafe IML algorithm safe. Our method\nworks as an add-on that takes suggested decisions as input and exploits regularity\nassumptions in terms of a Gaussian process prior in order to ef\ufb01ciently learn about\ntheir safety. As a result, we only explore the safe set when necessary for the\nIML problem. We apply our framework to safe Bayesian optimization and to safe\nexploration in deterministic Markov Decision Processes (MDP), which have been\nanalyzed separately before. Our method outperforms other algorithms empirically.\n\n1\n\nIntroduction\n\nInteractive Machine Learning (IML) problems, where an autonomous agent actively queries an\nunknown function to optimize it, learn it, or otherwise act based on the observations made, are\npervasive in science and engineering. For example, Bayesian optimization (BO) (Mockus et al., 1978)\nis an established paradigm to optimize unknown functions and has been applied to diverse tasks\nsuch as optimizing robotic controllers (Marco et al., 2017) and hyperparameter tuning in machine\nlearning (Snoek et al., 2012). Similarly, Markov Decision Processes (MDPs) (Puterman, 2014) model\nsequential decision making problems with long term consequences and are applied to a wide range of\nproblems including \ufb01nance and management of water resources (White, 1993).\nHowever, real-world applications are subject to safety constraints, which cannot be violated during\nthe learning process. Since the dependence of the safety constraints on the decisions is unknown a\npriori, existing algorithms are not applicable. To optimize the objective without violating the safety\nconstraints, we must carefully explore the space and ensure that decisions are safe before evaluating\nthem. In this paper, we propose a data-ef\ufb01cient algorithm for safety-constrained IML problems.\nRelated work One class of IML algorithms that consider safety are those for BO with Gaussian\nProcess (GP) (Rasmussen, 2004) models of the objective. While classical BO algorithms focus\non ef\ufb01cient optimization (Srinivas et al., 2010; Thompson, 1933; Wang and Jegelka, 2017), these\nmethods have been extended to incorporate safety constraints. For example, Gelbart et al. (2014)\npresent a variant of expected improvement with unknown constraints, while Hern\u00e1ndez-Lobato et al.\n(2016) extend an information-theoretic BO criterion to handle black-box constraints. However, these\nmethods only consider \ufb01nding a safe solution, but allow unsafe evaluations during the optimization\nprocess. Wu et al. (2016) de\ufb01ne safety as a constraint on the cumulative reward, while Schreiter et al.\n(2015) consider the safe exploration task on its own. The algorithms SAFEOPT (Sui et al., 2015;\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a) Set illustration.\n\n(b) STAGEOPT.\n\n(c) GP-UCB + GOOSE (ours)\n\nFigure 1: Existing algorithms for safe IML aim to expand the safe set \u00afSp (green shaded) in Fig. 1a by\nevaluating decisions on the boundary of the pessimistic safe set (dark green shaded). This can be\ninef\ufb01cient: to solve the safe BO problem in Fig. 1b, STAGEOPT evaluates decisions (green crosses,\nhistogram) close to the safety constraint q(\u00b7) > 0 (black dashed), even though the maximum (black\ncross) is known to be safe. In contrast, our method uses decisions x?\ni from existing unsafe IML\nalgorithms (oracle) within the optimistic safe set \u00afSo,\u270f\n(blue shaded, Fig. 1a). It can then use any\nheuristic to select learning targets At (blue cross) that are informative about the safety of x?\ni and\nlearns about them ef\ufb01ciently within Gt \u2713 \u00afSp\nt (blue shaded region). Since this method only learns\nabout the safe set when necessary, we evaluate more close-to-optimal decisions in Fig. 1c.\n\nt\n\nBerkenkamp et al., 2016) and STAGEOPT (Sui et al., 2018) both guarantee safety of the exploration\nand near-optimality of the solution. However, they treat the exploration of the safe set as a proxy\nobjective, which leads to sample-inef\ufb01cient exploration as they explore the entire safe set, even if this\nis not necessary for the optimization task, see the evaluation counts (green) in Fig. 1b for an example.\nSafety has also been investigated in IML problems in directed graphs, where decisions have long-term\neffects in terms of safety. Moldovan and Abbeel (2012) address this problem in the context of discrete\nMDPs by optimizing over ergodic policies, i.e., policies that are able to return to a known set of\nsafe states with high probability. However, they do not provide exploration guarantees. Biyik et al.\n(2019) study the ergodic exploration problem in discrete and deterministic MDPs with unknown\ndynamics and noiseless observations. Turchetta et al. (2016) investigate the ergodic exploration\nproblem subject to unknown external safety constraints under the assumption of known dynamics by\nimposing additional ergodicity constraints on the SAFEOPT algorithm. Wachi et al. (2018) compute\napproximately optimal policies in the same context but do not actively learn about the constraint. In\ncontinuous domains, safety has been investigated by, for example, Akametalu et al. (2014); Koller\net al. (2018). While these methods provide safety guarantees, current exploration guarantees rely on\nuncertainty sampling on a discretized domain (Berkenkamp et al., 2017). Thus, their analysis can\nbene\ufb01t from the more ef\ufb01cient, goal-oriented exploration introduced in this paper.\nContribution\nIn this paper, we introduce the Goal Oriented Safe Exploration algorithm, GOOSE;\na novel framework that works as an add-on to existing IML algorithms and renders them safe. Given\na possibly unsafe suggestion by an IML algorithm, it safely and ef\ufb01ciently learns about the safety of\nthis decision by exploiting continuity properties of the constraints in terms of a GP prior. Thus, unlike\nprevious work, GOOSE only learns about the safety of decisions relevant for the IML problem. We\nanalyze our algorithm and prove that, with high probability, it only takes safe actions while learning\nabout the safety of the suggested decisions. On safe BO problems, our algorithm leads to a bound on\na natural notion of safe cumulative regret when combined with a no-regret BO algorithm. Similarly,\nwe use our algorithm for the safe exploration in deterministic MDPs. Our experiments show that\nGOOSE is signi\ufb01cantly more data-ef\ufb01cient than existing methods in both settings.\n\n2 Problem Statement and Background\n\nIn IML, an agent iteratively makes decisions and observes their consequences, which it can use\nto make better decisions over time. Formally, at iteration i, the agent Oi uses the previous i 1\nobservations to make a new decision x?\ni = Oi(Di) from a \ufb01nite decision space Di \u2713D\u2713 Rd. It\nthen observes a noisy measurement of the unknown objective function f : D! R and uses the new\ninformation in the next iteration. This is illustrated in the top-left corner (blue shaded) of Fig. 2.\nDepending on the goal of the agent, this formulation captures a broad class of problems and many\nsolutions to these problems have been proposed. For example, in Bayesian optimization the agent\naims to \ufb01nd the global optimum maxx f (x) (Mockus et al., 1978). Similarly, in active learning\n\n2\n\nDomainDf(x)q(x)q(x)0#evaluationsDomainD\fFigure 2: Overview of GOOSE. If the oracle\u2019s suggestion x?\ni is safe, it can be evaluated. This is\nequivalent to the standard unsafe IML pipeline (top-left, blue shaded) in Fig. 2. Otherwise, GOOSE\nlearns about the safety of x?\ni by actively querying observations at decisions xt. Any provably unsafe\ndecision is removed from the decision space and we query a new x?\ni without providing a new\nobservation of f (x?\ni ).\n\n(Schreiter et al., 2015), one aims to learn about the function f. In the general case, the decision\nprocess may be stateful, e.g., as in dynamical systems, so that the decisions Di available to the agent\ndepend on those made in the past. This dependency among decisions can be modeled with a directed\ngraph, where nodes represent decisions and an edge connects node x to node x0 if the agent is allowed\nto evaluate x0 given that it evaluated x at the previous decision step. In the BO setting, the graph is\nfully-connected and any decision may be evaluated, while in a deterministic MDP decisions are states\nand edges represent transitions (Turchetta et al., 2016).\nIn this paper, we consider IML problems with safety constraints, which frequently occur in real-world\nsettings. The safety constraint can be written as q(x) 0 for some function q. Any decision x?\ni for\ni 1 evaluated by the agent must be safe. For example, Berkenkamp et al. (2016) optimize the\ncontrol policy of a \ufb02ying robot and must evaluate only policies that induce trajectories satisfying\ngiven constraints. However, it is unknown a priori which policy parameters induce safe trajectories.\nThus, we do not know which decisions are safe in advance, that is, q : D! R is a priori unknown.\nHowever, we can learn about the safety constraint by selecting decisions xt and obtaining noisy\nobservations of q(xt). We denote queries to f with x?\ni and queries to q with xt. As a result, we face a\ntwo-tiered safe exploration problem: On one hand we have to safely learn about the constraint q to\ndetermine which decisions are safe, while on the other hand we want to learn about f to solve the\nIML problem. The goal is to minimize the number of queries xt required to solve the IML problem.\nRegularity Without further assumptions, it is impossible to evaluate decisions without violating\nthe safety constraint q (Sui et al., 2015). For example, without an initial set of decisions that is\nknown to be safe a priori, we may fail at the \ufb01rst step. Moreover, if the constraint does not exhibit\nany regularity, we cannot infer the safety of decisions without evaluating them \ufb01rst. We assume\nthat a small initial safe set of decisions, S0, is available, which may come from domain knowledge.\nAdditionally, we assume that D is endowed with a positive de\ufb01nite kernel function, k(\u00b7,\u00b7), and that\nthe safety constraint q has bounded norm in the induced Reproducing Kernel Hilbert Space (RKHS)\n(Sch\u00f6lkopf and Smola, 2002)), kqkk \uf8ff Bq. The RKHS norm measures the smoothness of the safety\nfeature with respect to the kernel, so that q is L-Lipschitz continuous with respect to the kernel metric\nd(x, x0) =pk(x, x) 2k(x, x0) + k(x0, x0) with L = Bq (Steinwart and Christmann, 2008, (4.21))\nThis assumption allows us to model the safety constraint function q with a GP (Rasmussen, 2004).\nA GP is a distribution over functions parameterized by a mean function \u00b5(\u00b7) and a covariance\nfunction k(\u00b7,\u00b7). We set \u00b5(x) = 0 for all x 2D without loss of generality. The covariance function\nencodes our assumptions about the safety constraint. Given t observations of the constraint y =\n(q(x1) + \u23181, . . . , q(xt) + \u2318t) at decisions Dt = {xn}t\nn=1, where \u2318n \u21e0N (0, 2) is a zero-mean i.i.d.\nGaussian noise, the posterior belief is distributed as a GP with mean, covariance, and variance\n\u00b5t(x) = kT\nt (x)(Kt +2I)1kt(x0), t(x) = kt(x, x)\nt (x)(Kt +2I)1yt, kt(x, x0) = k(x, x0)kT\nrespectively. Here, kt(x) = (k(x1, x), . . . , k(xt, x)), Kt is the positive de\ufb01nite kernel matrix\n[k(x, x0)]x,x02Dt, and I 2 Rt\u21e5t denotes the identity matrix.\nSafe decisions\nThe previous regularity assumptions can be used to determine which decisions\nare safe to evaluate. Our classi\ufb01cation of the decision space is related to the one by Turchetta\net al. (2016), which combines non-increasing and reliable con\ufb01dence intervals on q with a reach-\nability analysis of the underlying graph structure for decisions. Based on a result by Chowd-\nhury and Gopalan (2017), they use the posterior GP distribution to construct con\ufb01dence bounds\n\n3\n\n\flt(x) := max(lt1(x), \u00b5t1(x) tt1(x)) and ut(x) := min(ut1(x), \u00b5t1(x) + tt1(x)) on\nthe function q. In particular, we have lt(x) \uf8ff q(x) \uf8ff ut(x) with high probability when the scaling\nfactor t is chosen as in Theorem 1. Thus, any decision x with lt(x) 0 is satis\ufb01es the safety\nconstraint q(x) 0 with high probability.\nTo analyze the exploration behavior of their algorithm, Turchetta et al. (2016) use the con\ufb01dence\nintervals within the current safe set, starting from S0, and the Lipschitz continuity of q to de\ufb01ne Sp\nt ,\nthe set of decisions that satisfy the constraint with high probability. We use a similar, albeit more\nef\ufb01cient, de\ufb01nition in Sec. 3. In practice, one may use the con\ufb01dence intervals directly. Moreover,\nin order to avoid exploring decisions that are instantaneously safe but that would force the agent to\neventually evaluate unsafe ones due to the graph structure G, Turchetta et al. (2016) de\ufb01ne \u00afSp\nt , the\nsubset of safe and ergodic decisions, i.e., decisions that are safe to evaluate in the short and long term.\nPrevious Exploration Schemes Given that only decisions in \u00afSp\nt are safe to evaluate, any safe\nIML algorithm faces an extended exploration-exploitation problem: it can either optimize decisions\nwithin \u00afSp\nt by evaluating decisions on its boundary. Existing\nsolutions to the safe exploration problem in both discrete and continuous domains either do not\nprovide theoretical exploration guarantees (Wachi et al., 2018) or treat the exploration of the safe\nset as a proxy objective for optimality. That is, the methods uniformly reduce uncertainty on the\nboundary of the safe set in Fig. 1a until the entire safe set is learned. Since learning about the entire\nsafe set is often unnecessary for the IML algorithm, this procedure can be sample-inef\ufb01cient. For\nexample, in the safe BO problem in Fig. 1b with f = q, this exploration scheme leads to a large\nnumber of unnecessary evaluations on the boundary of the safe set.\n\nt , or expand the set of safe decisions in \u00afSp\n\n3 Goal-oriented Safe Exploration (GOOSE)\n\ni , GOOSE only evaluates f (x?\n\ni ) if the decisions x?\n\ni ), or that x?\n\ni 2 S, see Fig. 2 (blue shaded).\n\ni is safe and allows the oracle to evaluate f (x?\n\nIn this section, we present our algorithm, GOOSE. We do not propose a new safe algorithm for a\nspeci\ufb01c IML setting, but instead exploit that, for speci\ufb01c IML problems high-performance, unsafe\nalgorithms already exist. We treat any such unsafe algorithm as an IML oracle Oi(S), which, given a\ndomain S and i 1 observations of f, suggests a new decision x?\nGOOSE can extend any such unsafe IML algorithm to the safety-constrained setting. Thus, we\neffectively leave the problem of querying f to the oracle and only consider safety. Given an unsafe\noracle decision x?\ni is known to be safe. Otherwise,\ni ) by safely and ef\ufb01ciently collecting observations q(xt). Eventually it\nit safely learns about q(x?\neither learns that the decision x?\ni cannot be\nguaranteed to be safe given an \u270f-accurate knowledge of the constraint, in which case the decision set\nof the oracle is restricted and a new decision is queried, see Fig. 2.\nPrevious approaches treat the expansion of the safe set as a proxy-objective to provide completeness\nguarantees. Instead, GOOSE employs goal-directed exploration scheme with a novel theoretical\nanalysis that shifts the focus from greedily reducing the uncertainty inside the safe set to learning\nabout the safety of decisions outside of it. This scheme retains the worst-case guarantees of existing\nmethods, but is signi\ufb01cantly more sample-ef\ufb01cient in practice. Moreover, GOOSE encompasses\nexisting methods for this problem. We now describe the detailed steps of GOOSE in Alg. 1 and 2.\nPessimistic and optimistic expansion.\nTo effectively shift the focus from inside the safe set to\noutside of it, GOOSE must reason not only about the decisions that are currently known to be safe\nbut also about those that could eventually be classi\ufb01ed as safe in the future. In particular, it maintains\ntwo sets, which are an inner/outer approximation of the set of safe decisions that are reachable from\nS0 and are based on a pessimistic/optimistic estimate of the constraint given the data, respectively.\nThe pessimistic safe set contains the decisions that are safe with high probability and is necessary to\nguarantee safe exploration (Turchetta et al., 2016). It is de\ufb01ned in two steps: discarding the decisions\nthat are not instantaneously safe and discarding those that we cannot reach/return from safely (see\nFig. 3b) and, thus, are not safe in the long term. To characterize it starting from a given set of safe\ndecisions S, we de\ufb01ne the pessimistic constraint satisfaction operator,\n\npt(S) = {x 2D , |9 z 2 S : lt(z) Ld(x, z) 0},\n\n(1)\nwhich uses the lower bound on the safety constraint of the decisions in S and the Lipschitz continuity\nof q to determine the decisions that instantaneously satisfy the constraint with high probability,\nsee Fig. 3a. However, for a general graph G, decisions in pt(S) may be unsafe in the long-term\n\n4\n\n\f(a) Expansion operators.\n\n(b) Long term safety in graph.\n\nFigure 3: Fig. 3a shows the pessimistic and optimistic constraint satisfaction operators that use the\ncon\ufb01dence intervals on the constraint and its Lipschitz continuity to make inference about the safety\nof decisions that have not yet been evaluated. Fig. 3b illustrates the long-term safety de\ufb01nition. While\ndecisions in pt(S) are myopically safe, decisions in P 1\nt (S) are safe in the long-term. This excludes\nx4 and x5, as no safe path from/to them exists.\n\nas in Fig. 3b: No safe path to the decision x5 exists, so that it can not be safely reached. Sim-\nilarly, if we were to evaluate x4, the graph structure forces us to eventually evaluate x3, which\nis not contained in pt(S) and might be unsafe. That is, we cannot safely return from x4. To\nexclude these decisions, we use the ergodicity operator introduced by Turchetta et al. (2016),\nwhich allows us to \ufb01nd those decisions that are pessimistically safe in the short and long term\nt (S) = pt(S) \\ Rergodic(pt(S), S) (see Appendix A or (Turchetta et al., 2016) for the de\ufb01nition of\nP 1\nRergodic). Alternating these operations n times, we obtain the n-step pessimistic expansion operator,\nt (S) = pt(P n1\n(S)), S), which, after a \ufb01nite number of steps, converges\nP n\nto its limiting set \u02dcPt(S) = limn!1 P n\nThe optimistic safe set excludes the decisions that are unsafe with high probability and makes the\nexploration ef\ufb01cient by restricting the decision space of the oracle. Similarly to the pessimistic one, it\nis de\ufb01ned in two steps. However, it uses the following optimistic constraint satisfaction operator,\n\n(S)) \\ Rergodic(pt(P n1\nt\nt (S).\n\nt\n\nt(S) = {x 2D , |9 z 2 S : ut(z) Ld(x, z) \u270f 0}.\no\u270f\n\n(2)\n\nt\n\nThe IML oracle Oi(S) suggests decisions x?\n\nt (S) are analogous to the pessimistic case by substituting pt with o\u270f\n\nSee Fig. 3a for a graphical intuition. The additional \u270f-uncertainty term in the optimistic operator\naccounts for the fact that we only have access to noisy measurements of the constraint and, therefore,\nwe can only learn it up to a speci\ufb01ed statistical accuracy. The de\ufb01nitions of the optimistic expansion\n(S) and \u02dcO\u270f\noperators O\u270f,n\nt. The\nsets \u02dcPt and \u02dcO\u270f\nt indicate the largest set of decisions that can be classi\ufb01ed as safe in the short and long\nterm assuming the constraint attains the worst/best possible value within S, given the observations\navailable and, for the optimistic case, despite an \u270f uncertainty.\nOptimistic oracle\ni 2 S to evaluate within a given\nsubset S of D. To make the oracle ef\ufb01cient, we restrict its decision space to decisions that could\noptimistically be safe in the long and short term. In particular, we de\ufb01ne the optimistic safe set \u00afSo,\u270f\nin Line 8 of Alg. 1 based on the optimistic expansion operator introduced above. The oracle uses this\nset to suggest a potentially unsafe, candidate decision x?\nSafe evaluation We determine safety of the suggestion x?\ni similarly to Turchetta et al. (2016) by\nconstructing the set \u00afSp\nt of decisions that are safe to evaluate. However, while Turchetta et al. (2016)\nuse the one step pessimistic expansion operator in their de\ufb01nition, P 1\nt , we use its limit set in Line 7\nof Alg. 1, \u02dcPt. While both operators eventually identify the same safe set, our de\ufb01nition allows for a\nmore ef\ufb01cient expansion. For example, consider the case where the graph over the decision space G\nis a chain of length m and where, for all j = 1,\u00b7\u00b7\u00b7 , m, the lower bound on the safety of decision\nj 1 guarantees the safety of decision j with high probability. In this case, Turchetta et al. (2016)\nrequire m 1 iterations to fully expand the safe set, while our classi\ufb01cation requires only one.\nIf we know that x?\nt , then the oracle obtains a noisy observation of\nf (x?\ni using a safe\nexpansion strategy in lines Lines 5\u20138 that we outline in the following. This routine is repeated until\nwe can either include x?\ni ), or remove it from the\ndecision space \u00afSo,\u270f\n\ni ) in Line 10. Otherwise GOOSE proceeds to safely learn about the safety of x?\n\ni in \u00afSp\n\nt , in which case we can safely evaluate f (x?\n\ni = Oi( \u00afSo,\u270f\n\nt ) in Line 4.\n\nand query the oracle for a new suggestion.\n\nt\n\ni is safe to evaluate, i.e., x?\n\ni 2 \u00afSp\n\n5\n\n\fAlgorithm 1 GOOSE\n1: Inputs: Lipschitz constant L, Seed S0,\nGraph G, Oracle O, Accuracy \u270f.\n\n2: \u00afSp\n\n0 D , t 0,\n\n0 S0, \u00afSo,\u270f\nl0(x) 0 for x 2 S0\n3: for k = 1, 2, . . . do\ni O ( \u00afSo,\u270f\nx?\n4:\nt )\ni 62 \u00afSp\nwhile x?\nt do\n5:\n, \u00afSp\nSE( \u00afSo,\u270f\nt ,G, x?\n6:\nt\nt \u02dcPt( \u00afSp\n\u00afSp\n7:\nt1)\n\u00afSo,\u270f\nt \u02dcO\u270f\nt ( \u00afSp\n8:\nt1)\ni 62 \u00afSo,\u270f\nthen go to Line 4\nif x?\n9:\nt\nEvaluate f (x?\ni ) and update oracle\n10:\n\ni ), t t + 1\n\nt\n\nAlgorithm 2 Safe Expansion (SE)\n1: Inputs: \u00afSo,\u270f\n2: W \u270f\n3: At(p) { x 2 \u00afSo,\u270f\nt )| h(x) = p}\n4: // Highest priority targets in At with expanders\n\u21b5\u21e4 max \u21b5 s.t.\nxt argmaxx2G\u270f\nUpdate GP with yt = q(xt) + \u2318t\n\n, \u00afSp\nt ,G, x?\nt { x 2 \u00afSp\nt | ut(x) lt(x) >\u270f }\nt \\ p0\n|G\u270f\nt(\u21b5)| > 0\nt (\u21b5\u21e4) wt(x)\n\n5: if optimization problem feasible then\n6:\n7:\n\nt ( \u00afSp\n\nt ( \u00afSp\n\nt\n\ni /2 \u00afSp\n\nt \\ p0\n\nIf the oracle suggestion x?\n\ni is not considered safe, x?\n\nt that is informative about q(x?\n\nexcludes the decisions that are unsafe with high probability, \u00afSo,\u270f\n\nSafe expansion\nt , GOOSE employs a\ngoal-directed scheme to evaluate a safe decision xt 2 \u00afSp\ni ), see Fig. 1a.\nIn practice, it is desirable to avoid learning about decisions beyond a certain accuracy \u270f, as the number\nof observations required to reduce the uncertainty grows exponentially with \u270f (Sui et al., 2018). Thus,\nwe only learn about decisions in \u00afSp\nt whose safety values are not known \u270f-accurately yet in Line 2,\nt = {x 2 \u00afSp\nt | ut(x) lt(x) >\u270f }, where ut(x) lt(x) is the width of the con\ufb01dence interval at x.\nW \u270f\nt to learn about, we \ufb01rst determine a set of learning targets outside\nTo decide which decision in W \u270f\nthe safe set (dark blue cross in Fig. 1a), and then learn about them ef\ufb01ciently within \u00afSp\nt . To quantify\nhow useful a learning target x is to learn about q(x?\ni ), we use any given iteration-dependent heuristic\nht(x). We discuss particular choices later, but a large priority h(x) indicates a relevant learning target\nt ) denotes the decisions that are known to satisfy the constraint with\n(dashed line, Fig. 1a). Since p0\nhigh probability and \u00afSo,\u270f\nt ( \u00afSp\nt )\nindicates the decisions whose safety we are uncertain about. We sort them according to their priority\nand let At(\u21b5) denote the subset of decision with equal priority.\nIdeally, we want to learn about the decisions with the highest priority. However, this may not be\nt . Thus, we must identify the decisions with\nimmediately possible by evaluating decisions within W \u270f\nt . Therefore, similarly to the de\ufb01nition of\nthe highest priority that we can learn about starting from W \u270f\nthe optimistic safe set, we identify decisions x in W \u270f\nt that have a large enough plausible value q(x) that\nthey could guarantee that q(z) 0 for some z in At(\u21b5). However, in this case, we are only interested\nin decisions that can be instantly classi\ufb01ed as safe (rather than eventually). Therefore, we focus on this\nt , |9 z 2 At(\u21b5) : ut(x) Ld(x, z) 0}.\nset of potential immediate expanders, G\u270f\nIn Line 4 of Alg. 2 we select the decisions with the priority level \u21b5\u21e4 such that there exist uncertain,\nt that could allow us to classify a decision in At(\u21b5\u21e4) as safe and thereby expand\nsafe decisions in W \u270f\nthe current safe set \u00afSp\nt . Intuitively, we look for the highest priority targets that can potentially be\nclassi\ufb01ed as safe by safely evaluating decisions that we have not already learned about to \u270f-accuracy.\nGiven these learning targets At(\u21b5\u21e4) (blue cross, Fig. 1a), we evaluate the most uncertain decision in\nt(\u21b5\u21e4) (blue shaded, Fig. 1a) in Line 6 and update the GP model with the corresponding observation\nG\u270f\nof q(xt) in Line 7. This uncertainty sampling is restricted to a small set of decisions close to the\ngoal. This is different from methods without a heuristic that select the most uncertain secision on the\nboundary of \u00afSp (green shaded in Fig. 1a). In fact, our method is equivalent to the one by Turchetta\net al. (2016) when an uninformative heuristic h(x) = 1 is used for all x. We iteratively select and\ni is safe, in which case it is added to \u00afSp, or we\nevaluate decisions xt until we either determine that x?\nprove that we can not safely learn about it for given accuracy \u270f, in which case is removed from \u00afSo,\u270f\nand a the oracle is queried with an updated decision space for a new suggestion.\nTo analyze our algorithm, we de\ufb01ne the largest set that we can learn about as \u02dcR\u270f(S0). This set\ncontains all the decisions that we could certify as safe if we used a full-exploration scheme that learns\nthe safety constraint q up to \u270f accuracy for all decisions inside the current safe set. This is a natural\nexploration target for our safe exploration problem (see Appendix A for a formal de\ufb01nition). We\nhave the following main result, which holds for any heuristic:\n\nt(\u21b5) = {x 2 W \u270f\n\n6\n\n\ft\u21e4\n\n\u270f2\n\nk, \u00afSo,\u270f\n\nt \u2713 \u00afSp\n\nt \u2713 \u02dcR0(S0).\n\nt\u21e4 t\u21e4 C | \u02dcR0(S0)|\n\nTheorem 1. Assume that q(\u00b7) is L-Lipschitz continuous w.r.t. d(\u00b7,\u00b7) with kqkk \uf8ff Bq, -sub-Gaussian\nnoise, S0 6= ;, q(x) 0 for all x 2 S0, and that, for any two decisions x, x0 2 S0, there is a path\nt = Bq + 4pt + 1 + ln(1/), then, for any\nin the graph G connecting them within S0. Let 1/2\nht : D! R, with probability at least 1 , we have q(x) 0 for any x visited by GOOSE. Moreover,\nlet t denote the information capacity associated with the kernel k and let t\u21e4 be the smallest integer\nsuch that\n, with C = 8/ log(1 + 2), then there exists a t \uf8ff t\u21e4 such that, with\nprobability at least 1 , \u02dcR\u270f(S0) \u2713 \u00afSo,\u270f\nTheorem 1 guarantees that GOOSE is safe with high probability. Moreover, for any priority function\nh in Alg. 2, it upper bounds the number of measurements that Alg. 1 requires to explore the largest\nsafely reachable region \u02dcR\u270f(S0). Note that GOOSE only achieves this upper bound if it is required by\nthe IML oracle. In particular, the following is a direct consequence of Theorem 1:\nCorollary 1. Under the assumptions of Theorem 1, let the IML oracle be deterministic given the\nobservations. Then there exists a set S with \u02dcR\u270f(S0) \u2713 S \u2713 \u02dcR0(S0) so that x?\ni = O(S) for all k 1.\nThat is, the oracle decisions x?\ni that we end up evaluating are the same as those by an oracle that was\ngiven the safe set S in Corollary 1 from the beginning. This is true since the set \u00afSo,\u270f\nconverges to this\nset S. Since Theorem 1 bounds the number of safety evaluations by t\u21e4, Corollary 1 implies that, up\nto t\u21e4 safety evaluations, GOOSE retains the properties (e.g., no-regret) of the IML oracle O over S.\nChoice of heuristic While our worst-case guarantees hold for any heuristic, the empirical per-\nformance of GOOSE depends on this choice. We propose to use the graph structure directly and\nadditionally de\ufb01ne a positive cost for each edge between two nodes. For a given edge cost, we de\ufb01ne\nk within the optimistic safe set \u00afSo,\u270f\nc(x, x?\n,\nt\nk, \u00afSo,\u270f\nwhich is equal to 1 if a path does not exist, and we consider the priority h(x) = c(x, x?\nt ).\nThus, the node x with the lowest-cost path to x?\nk has the highest priority. This reduces the design of a\ngeneral heuristic to a more intuitive weight assignment problem, where the edge costs determine the\nplanned path for learning about x?\nk (dashed line in Fig. 1a). One option for the edge cost is the inverse\nmutual information between x and the suggestion x?\nk, so that the resulting paths contain nodes that\nare informative about x?\nk. Alternatively, having successive nodes in the path close to each other under\nthe metric d(\u00b7,\u00b7), so that they can be easily added to the safe set and eventually lead us to x?\nk, can be\ndesirable. Thus, increasing monotone functions of the metric d(\u00b7,\u00b7) can be effective edge costs.\n4 Applications and Experiments\n\nt ) as the cost of the minimum-cost path from x to x?\n\nt\n\nIn this section, we introduce two safety-critical IML applications, discuss the consequences of\nTheorem 1 for these problems, and empirically compare GOOSE to stae-of-the-art competing\nmethods. In our experiments, we set t = 3 for all t 1 as suggested by Turchetta et al. (2016). This\nchoice of t ensures safety in practice, but leads to more ef\ufb01cient exploration than the theoretical\nchoice in Theorem 1 (Turchetta et al., 2016; Wachi et al., 2018). Moreover, since in practice it is hard\nto estimate the Lipschitz constant of an unknown function, in our experiments we use the con\ufb01dence\nintervals to de\ufb01ne the safe set and the expanders as suggested by Berkenkamp et al. (2016).\n\n4.1 Safe Bayesian optimization\n\nIn safe BO we want to optimize the unknown function f subject to the unknown safety constraint\nq, see Sec. 2. In this setting, we aim to \ufb01nd the best input over the largest set we can hope to\nexplore safely, \u02dcR\u270f(S0). The performance of an agent is measured in terms of the \u270f-safe regret\nargmaxx2 \u02dcR\u270f(S0) f (x) f (xt) of not having evaluated the function at the optimum in \u02dcR\u270f(S0).\nWe combine GOOSE with the unsafe GP-UCB (Srinivas et al., 2010) algorithm as an oracle. For\ncomputational ef\ufb01ciency, we do not use a fully connected graph, but instead connect decisions only\nto their immediate neighbors as measured by the kernel and assign equal weight to each edge for the\nheuristic h. We compare GOOSE to SAFEOPT (Sui et al., 2015) and STAGEOPT (Sui et al., 2018) in\nterms of \u270f-safe average regret. Both algorithms use safe exploration as a proxy objective, see Fig. 1.\nWe optimize samples from a GP with zero mean and Radial Basis Function (RBF) kernel with\nvariance 1.0 and lengthscale 0.1 and 0.4 for a one-dimensional and two-dimensional, respectively.\n\n7\n\n\fTable 1: Mars experiment perfor-\nmance normalized to SMDP in\nterms of samples to \ufb01nd the \ufb01rst\npath, exploration cost and com-\nputation time per iteration.\n\nSample\nCost\nTime\n\nGOOSE\nSEO\n30.0 % 38.4 %\n12.7 %\n0.7 %\n37.8 % 518 %\n\nFigure 4: Average normalized \u270f-safe regret for the safe optimiza-\ntion of GP samples over 40 (d=1, left) and 10 (d=2, right) samples.\nGOOSE only evaluates inputs that are relevant for the BO problem\nand, thereofore, it converges faster than its competitors.\n\n(a) Cost of exploration.\n\n(b) Samples to \ufb01rst path.\n\n(c) Computation per iteration.\n\nFigure 5: Performance of GOOSE and SEO normalized to SMDP in terms of exploration cost,\nsamples to \ufb01nd the \ufb01rst path and computation time per iteration as a function of the world size.\n\nThe observations are perturbed by i.i.d Gaussian noise with = 0.01. For simplicity, we set the\nobjective and the constraint to be the same, f = q. Fig. 4 (left) shows the average regret as a function\nof the number of evaluations k + t averaged over 40 different samples from the GP described above\nover a one dimensional domain (200 points evenly distributed in [1, 1]). Fig. 4 (right) shows similar\nresults averaged over 10 samples for a two dimensional domain (25 \u21e5 25 uniform grid in [0, 1]2).\nThese results con\ufb01rm the intuition from Fig. 1 that using safe exploration as a proxy objective reduces\nthe empirical performance of safe BO algorithms. The impact is more evident in the two dimensional\ncase where there are more points along the boundaries that are nor relevant to the optimization and\nthat are evaluated for exploration purposes.\n\n4.2 Safe shortest path in deterministic MDPs\nThe graph that we introduced in Sec. 2 can model states (nodes) and state transitions (edges) in\ndeterministic, discrete MDPs. Hence, GOOSE naturally extends to the goal-oriented safe exploration\nproblem in these models. We aim to \ufb01nd the minimum-cost safe path from a starting state x\u2020 to a\ngoal state x?, without violating the unknown safety constraint, q. At best, we can hope to \ufb01nd the\npath within the largest safely learnable set \u02dcR\u270f(S0) as in Theorem 1 with cost c(x\u2020, x?, \u02dcR\u270f(S0)).\nAlgorithms We compare GOOSE to SEO (Wachi et al., 2018) and SMPD (Turchetta et al.,\n2016) in terms of samples required to discover the \ufb01rst path, total exploration cost and computation\ncost on synthetic and real world data. The SMDP algorithm cannot take goals into account and\nserves as a benchmark for comparison. The SEO algorithm aims to safely learn a near-optimal\npolicy for any given cost function and can be adapted to the safe shortest path problem by setting\nthe cost to c(x) = kx x?k1. However, it cannot guarantee that a path to x? is found, if one\nexists. Since the goal x? is \ufb01xed, GOOSE does not need an oracle. For the heuristic we use and\noptimistic estimate of the cost of the safe shortest path from x\u2020 to x? passing through x; that is\nht(x) = minx02P red(x) c(x\u2020, x, \u00afSp\nt ). The \ufb01rst term is a conservative estimate of\nthe safe optimal cost from x\u2020 to the best predecessor of x in G and the second term is an optimistic\nestimate of the safe optimal cost from x to x? multiplied by \uf8ff> 1 to encourage early path discovery.\nHere, we use the predecessor node because t(x) = 1 for all x not in \u00afSp\nt . Notice that, if a safe path\nexists, Theorem 1 guarantees that GOOSE \ufb01nds the shortest one eventually.\nSynthetic data\nSimilarly to the setting in Turchetta et al. (2016); Wachi et al. (2018) we construct\na two-dimensional grid world. At every location, the agent takes one of four actions: left, right, up\nand down. We use the state augmentation in Turchetta et al. (2016) to de\ufb01ne a constraint over state\n\nt ) + \uf8ffc(x, x?, \u00afSo,\u270f\n\n8\n\n0102030Iterations0.00.20.40.6AveragesaferegretSafeOptStageOptGoOSE01020304050Iterations2000400060008000Worldsize0.00.51.0RatiotoSMDPGoOSESMDPSEO2000400060008000Worldsize2000400060008000Worldsize0102030RatiotoSMDP\ftransitions. The constraint function is a sample from a GP with mean \u00b5 = 0.6 and RBF kernel with\nlengthscale l = 2 and variance 2 = 1. If the agent takes an unsafe action, it ends up in a failure state,\notherwise it moves to the desired adjacent state. We make the constraint independent of the direction\nof motion, i.e., q(x, x0) = q(x0, x). We generate 800 worlds by sampling 100 different constraints for\nsquare maps with sides of 20, 30, 40,\u00b7\u00b7\u00b7 , 90 tiles and a source-target pair for each one.\nWe show the geometric mean of the performance of SEO and GOOSE relative to SMDP as a\nfunction of the world size in Fig. 5. Fig. 5b shows that GOOSE needs a factor 2.5 fewer samples\nthan SMDP. Fig. 5c shows that the overhead to compute the heuristic of GOOSE is negligible, while\nthe solution of the two MDPs 1 required by SEO is computationally intense. Fig. 5a shows that SEO\noutperforms GOOSE in terms of cost of the exploration trajectory. This is expected as SEO aims\nto minimize it, while GOOSE optimizes the sample-ef\ufb01ciency. However, it is easy to modify the\nheuristic of GOOSE to consider the exploration cost by, for example, reducing the priority of a state\nbased on its distance from the current location of the agent. In conclusion, GOOSE leads to a drastic\nimprovement in performance with respect to the previously known safe exploration strategy with\nexploration guarantees, SMDP. Moreover, it achieves similar or better performance than SEO while\nproviding exploration guarantees that SEO lacks.\nMars exploration We simulate the exploration of Mars with a rover. In this context, communi-\ncation delays between the rover and the operator on Earth make autonomous exploration extremely\nimportant, while the high degree of uncertainty about the environment requires the agent to consider\nsafety constraints. In our experiment, we consider the Mars Science Laboratory MSL (2007, Sec.\n2.1.3), a rover deployed on Mars that can climb a maximum slope of 30. We use Digital Terrain\nModels of Mars available from McEwen et al. (2007).\nWe use a grid world similar to the one introduced above. The safety constraint is the absolute value\nof the steepness of the slope between two locations: given two states x and x0, the constraint over\nthe state transition is de\ufb01ned as q(x, x0) = |H(x) H(x0)|/d(x, x0), where H(x), H(x0) indicate the\naltitudes at x and x0 respectively and d(x, x0) is the distance between them. We set conservatively\nthe safety constraint to q(x, x0) tan1(25). The step of the grid is 10m. We use square maps\nfrom 16 different locations on Mars with sides between 100 and 150 tiles. We generate 64 scenarios\nby sampling 4 source-target pairs for each map . We model the steepness with a GP with Mat\u00e9rn\nkernel with \u232b = 5/2. We set the hyperprior on the lengthscale and on the standard deviation to\nbe Lognormal(30m, 0.25m2) and Lognormal(tan(10), 0.04), respectively. These encode our\nprior belief about the surface of Mars. Next, we take 1000 noisy measurements at random locations\nfrom each map, which, in reality, could come from satellite images, to \ufb01nd a maximum a posteriori\nestimator of the hyperparameters to \ufb01ne tune our prior to each site.\nIn Tab. 1, we show the geometric mean of the performance of SEO and GOOSE relative to SMDP.\nThe results con\ufb01rm those of the synthetic experiments but with larger changes in performance with\nrespect to the benchmark due to the increased size of the world.\n\n5 Conclusion\n\nWe presented GOOSE, an add-on module that enables existing interactive machine learning algo-\nrithms to safely explore the decision space, without violating a priori unknown safety constraints.\nOur method is provably safe and learns about the safety of decisions suggested by existing, unsafe\nalgorithms. As a result, it is more data-ef\ufb01cient than previous safe exploration methods in practice.\nAknowlegment.\nThis research was partially supported by the Max Planck ETH Center for Learning\nSystems and by the European Research Council (ERC) under the European Union\u2019s Horizon 2020\nresearch and innovation programme grantagreement No 815943.\n\n1We use policy iteration. Policy evaluation is performed by solving a sparse linear system with SciPy (Virtanen\n\net al., 2019). At iteration t, we initialize policy iteration with the optimal policy from t 1.\n\n9\n\n\fReferences\nAnayo K Akametalu, Jaime F Fisac, Jeremy H Gillula, Shahab Kaynama, Melanie N Zeilinger,\nand Claire J Tomlin. Reachability-based safe learning with gaussian processes. In Decision and\nControl (CDC), 2014 IEEE 53rd Annual Conference on, pages 1424\u20131431. IEEE, 2014.\n\nFelix Berkenkamp, Andreas Krause, and Angela P Schoellig. Bayesian optimization with safety\nconstraints: safe and automatic parameter tuning in robotics. arXiv preprint arXiv:1602.04450,\n2016.\n\nFelix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause. Safe model-based\nreinforcement learning with stability guarantees. In Advances in Neural Information Processing\nSystems, pages 908\u2013919, 2017.\n\nErdem Biyik, Jonathan Margoliash, Shahrouz R. Alimo, and Dorsa Sadigh. Ef\ufb01cient and safe\nIn\n\nexploration in deterministic markov decision processes with unknown transition models.\nProceedings of the American Control Conference (ACC), July 2019.\n\nSayak Ray Chowdhury and Aditya Gopalan. On kernelized multi-armed bandits. In Proceedings of\n\nthe 34th International Conference on Machine Learning-Volume 70, pages 844\u2013853, 2017.\n\nMichael A Gelbart, Jasper Snoek, and Ryan P Adams. Bayesian optimization with unknown\nconstraints. In Proceedings of the Thirtieth Conference on Uncertainty in Arti\ufb01cial Intelligence,\npages 250\u2013259. AUAI Press, 2014.\n\nJos\u00e9 Miguel Hern\u00e1ndez-Lobato, Michael A Gelbart, Ryan P Adams, Matthew W Hoffman, and\nZoubin Ghahramani. A general framework for constrained bayesian optimization using information-\nbased search. The Journal of Machine Learning Research, pages 5549\u20135601, 2016.\n\nTorsten Koller, Felix Berkenkamp, Matteo Turchetta, and Andreas Krause. Learning-based model\npredictive control for safe exploration and reinforcement learning. In Proc. of the IEEE Conference\non Decision and Control (CDC), December 2018.\n\nAlonso Marco, Felix Berkenkamp, Philipp Hennig, Angela P Schoellig, Andreas Krause, Stefan\nSchaal, and Sebastian Trimpe. Virtual vs. real: Trading off simulations and physical experiments\nin reinforcement learning with bayesian optimization. In Robotics and Automation (ICRA), 2017\nIEEE International Conference on, pages 1557\u20131563. IEEE, 2017.\n\nAlfred S McEwen, Eric M Eliason, James W Bergstrom, Nathan T Bridges, Candice J Hansen,\nW Alan Delamere, John A Grant, Virginia C Gulick, Kenneth E Herkenhoff, Laszlo Keszthelyi,\net al. Mars reconnaissance orbiter\u2019s high resolution imaging science experiment (hirise). Journal\nof Geophysical Research: Planets, 112(E5), 2007.\n\nJonas Mockus, Vytautas Tiesis, and Antanas Zilinskas. The application of bayesian methods for\n\nseeking the extremum. Towards global optimization, (117-129), 1978.\n\nTeodor Mihai Moldovan and Pieter Abbeel. Safe exploration in markov decision processes. In\nProceedings of the 29th International Coference on International Conference on Machine Learning,\nICML\u201912, pages 1451\u20131458, 2012.\n\nMSL.\n\nMSL Landing Site Selection User\u2019s Guide to Engineering Constraints, 2007.\nURL http://marsoweb.nas.nasa.gov/landingsites/msl/memoranda/MSL_Eng_User_\nGuide_v4.5.1.pdf.\n\nMartin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John\n\nWiley & Sons, 2014.\n\nCarl Edward Rasmussen. Gaussian processes in machine learning. In Advanced lectures on machine\n\nlearning, pages 63\u201371. Springer, 2004.\n\nBernhard Sch\u00f6lkopf and Alexander J Smola. Learning with kernels: support vector machines,\n\nregularization, optimization, and beyond. MIT press, 2002.\n\n10\n\n\fJens Schreiter, Duy Nguyen-Tuong, Mona Eberts, Bastian Bischoff, Heiner Markert, and Marc Tous-\nsaint. Safe exploration for active learning with gaussian processes. In Joint European Conference\non Machine Learning and Knowledge Discovery in Databases, pages 133\u2013149. Springer, 2015.\nJasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine\nlearning algorithms. In Advances in neural information processing systems, pages 2951\u20132959,\n2012.\n\nNiranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process opti-\nmization in the bandit setting: No regret and experimental design. In Proceedings of the 27th\nInternational Conference on International Conference on Machine Learning, pages 1015\u20131022,\n2010.\n\nIngo Steinwart and Andreas Christmann. Support vector machines. Springer Science & Business\n\nMedia, 2008.\n\nYanan Sui, Alkis Gotovos, Joel Burdick, and Andreas Krause. Safe exploration for optimization with\n\ngaussian processes. In International Conference on Machine Learning, pages 997\u20131005, 2015.\n\nYanan Sui, Vincent Zhuang, Joel W Burdick, and Yisong Yue. Stagewise safe bayesian optimization\n\nwith gaussian processes. arXiv preprint arXiv:1806.07555, 2018.\n\nWilliam R Thompson. On the likelihood that one unknown probability exceeds another in view of\n\nthe evidence of two samples. Biometrika, 25:285\u2013294, 1933.\n\nMatteo Turchetta, Felix Berkenkamp, and Andreas Krause. Safe exploration in \ufb01nite markov decision\nprocesses with gaussian processes. In Advances in Neural Information Processing Systems, pages\n4312\u20134320, 2016.\n\nPauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau,\nEvgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, St\u00e9fan J. van der Walt,\nMatthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric\nJones, Robert Kern, Eric Larson, CJ Carey, \u02d9Ilhan Polat, Yu Feng, Eric W. Moore, Jake Vand erPlas,\nDenis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R Harris,\nAnne M. Archibald, Ant\u00f4nio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1. 0\nContributors. SciPy 1.0\u2013Fundamental Algorithms for Scienti\ufb01c Computing in Python. arXiv\ne-prints, art. arXiv:1907.10121, Jul 2019.\n\nAkifumi Wachi, Yanan Sui, Yisong Yue, and Masahiro Ono. Safe exploration and optimization\nof constrained mdps using gaussian processes. Association for the Advancement of Arti\ufb01cial\nIntelligence, 2018.\n\nZi Wang and Stefanie Jegelka. Max-value entropy search for ef\ufb01cient bayesian optimization. In\nProceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3627\u2013\n3635. JMLR. org, 2017.\n\nDouglas J White. A survey of applications of markov decision processes. Journal of the operational\n\nresearch society, pages 1073\u20131096, 1993.\n\nYifan Wu, Roshan Shariff, Tor Lattimore, and Csaba Szepesv\u00e1ri. Conservative bandits. In Interna-\n\ntional Conference on Machine Learning, pages 1254\u20131262, 2016.\n\n11\n\n\f", "award": [], "sourceid": 1664, "authors": [{"given_name": "Matteo", "family_name": "Turchetta", "institution": "ETH Zurich"}, {"given_name": "Felix", "family_name": "Berkenkamp", "institution": "ETH Zurich"}, {"given_name": "Andreas", "family_name": "Krause", "institution": "ETH Zurich"}]}