{"title": "Linear Stochastic Bandits Under Safety Constraints", "book": "Advances in Neural Information Processing Systems", "page_first": 9256, "page_last": 9266, "abstract": "Bandit algorithms have various application in safety-critical systems, where it is important to respect the system constraints that rely on the bandit's unknown parameters at every round. In this paper, we formulate a linear stochastic multi-armed bandit problem with safety constraints that depend (linearly) on an unknown parameter vector. As such, the learner is unable to identify all safe actions and must act conservatively in ensuring that her actions satisfy the safety constraint at all rounds (at least with high probability). For these bandits, we propose a new UCB-based algorithm called Safe-LUCB, which includes necessary modifications to respect safety constraints. The algorithm has two phases. During the pure exploration phase the learner chooses her actions at random from a restricted set of safe actions with the goal of learning a good approximation of the entire unknown safe set. Once this goal is achieved, the algorithm begins a safe exploration-exploitation phase where the learner gradually expands their estimate of the set of safe actions while controlling the growth of regret. We provide a general regret bound for the algorithm, as well as a problem dependent bound that is connected to the location of the optimal action within the safe set. We then propose a modified heuristic that exploits our problem dependent analysis to improve the regret.", "full_text": "Linear Stochastic Bandits Under Safety Constraints\n\nSanae Amani\n\nUniversity of California, Santa Barbara\n\nsamanigeshnigani@ucsb.edu\n\nMahnoosh Alizadeh\n\nUniversity of California, Santa Barbara\n\nalizadeh@ucsb.edu\n\nChristos Thrampoulidis\n\nUniversity of California, Santa Barbara\n\ncthrampo@ucsb.edu\n\nAbstract\n\nBandit algorithms have various application in safety-critical systems, where it\nis important to respect the system constraints that rely on the bandit\u2019s unknown\nparameters at every round. In this paper, we formulate a linear stochastic multi-\narmed bandit problem with safety constraints that depend (linearly) on an unknown\nparameter vector. As such, the learner is unable to identify all safe actions and\nmust act conservatively in ensuring that her actions satisfy the safety constraint\nat all rounds (at least with high probability). For these bandits, we propose a new\nUCB-based algorithm called Safe-LUCB, which includes necessary modi\ufb01cations\nto respect safety constraints. The algorithm has two phases. During the pure\nexploration phase the learner chooses her actions at random from a restricted set of\nsafe actions with the goal of learning a good approximation of the entire unknown\nsafe set. Once this goal is achieved, the algorithm begins a safe exploration-\nexploitation phase where the learner gradually expands their estimate of the set of\nsafe actions while controlling the growth of regret. We provide a general regret\nbound for the algorithm, as well as a problem dependent bound that is connected to\nthe location of the optimal action within the safe set. We then propose a modi\ufb01ed\nheuristic that exploits our problem dependent analysis to improve the regret.\n\n1\n\nIntroduction\n\nThe stochastic multi-armed bandit (MAB) problem is a sequential decision-making problem where,\nat each step of a T -period run, a learner plays one of k arms and observes a corresponding loss that is\nsampled independently from an underlying distribution with unknown parameters. The learner\u2019s goal\nis to minimize the pseudo-regret, i.e., the difference between the expected T -period loss incurred\nby the decision making algorithm and the optimal loss if the unknown parameters were given. The\nlinear stochastic bandit problem generalizes MAB to the setting where each arm is associated with a\nfeature vector x and the expected loss of each arm is equal to the inner product of its feature vector\nx and an unknown parameter vector \u00b5. There are several variants of linear stochastic bandits that\nconsider \ufb01nite or in\ufb01nite number of arms, as well as the case where the set of feature vectors changes\nover time. A detailed account of previous work in this area will be provided in Section 1.2.\nBandit algorithms have found many applications in systems that repeatedly deal with unknown\nstochastic environments (such as humans) and seek to optimize a long-term reward by simultaneously\nlearning and exploiting the unknown environment (e.g., ad display optimization algorithms with\nunknown user preferences, path routing, ranking in search engines). They are also naturally relevant\nfor many cyber-physical systems with humans in the loop (e.g., pricing end-use demand in societal-\nscale infrastructure systems such as power grids or transportation networks to minimize system costs\ngiven the limited number of user interactions possible). However, existing bandit heuristics might not\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fbe directly applicable in these latter cases. One critical reason is the existence of safety guarantees\nthat have to be met at every single round. For example, when managing demand to minimize costs in\na power system, it is required that the operational constraints of the power grid are not violated in\nresponse to our actions (these can be formulated as linear constraints that depend on the demand).\nThus, for such systems, it becomes important to develop new bandit algorithms that account for\ncritical safety requirements.\nGiven the high level of uncertainty about the system parameters in the initial rounds, any such\nbandit algorithm will be initially highly constrained in terms of safe actions that can be chosen.\nHowever, as further samples are obtained and the algorithm becomes more con\ufb01dent about the value\nof the unknown parameters, it is intuitive that safe actions become easier to distinguish and it seems\nplausible that the effect of the system safety requirements on the growth of regret can be diminished.\nIn this paper, we formulate a variant of linear stochastic bandits where at each round t, the learner\u2019s\nchoice of arm should also satisfy a safety constraint that is dependent on the unknown parameter\nvector \u00b5. While the formulation presented is certainly an abstraction of the complications that might\narise in the systems discussed above, we believe that it is a natural \ufb01rst step towards understanding\nand evaluating the effect of safety constraints on the performance of bandit heuristics.\nSpeci\ufb01cally, we assume that the learner\u2019s goal is twofold: 1) Minimize the T -period cumulative\npseudo-regret; 2) Ensure that a linear side constraint of the form \u00b5\u2020Bx \u2264 c is respected at every\nround during the T -period run of the algorithm, where B and c are known. See Section 1.1 for details.\nGiven the learner\u2019s uncertainty about \u00b5, the existence of this safety constraint effectively restricts\nthe learner\u2019s choice of actions to what we will refer to as the safe decision set at each round t. To\ntackle this constraint, in Section 2, we present Safe-LUCB as a safe version of the standard linear\nUCB (LUCB) algorithm Dani et al. (2008); Abbasi-Yadkori et al. (2011); Rusmevichientong and\nTsitsiklis (2010). In Section 3 we provide general regret bounds that characterize the effect of safety\nconstraints on regret. We show that the regret of the modi\ufb01ed algorithm is dependent on the parameter\n\u2206 = c \u2212 \u00b5\u2020Bx\u2217, where x\u2217 denotes the optimal safe action given \u00b5. When \u2206 > 0 and is known to\nT ); thus, the effect of the system safety\nrequirements on the growth of regret can be diminished (for large enough T ). In Section 4, we also\npresent a heuristic modi\ufb01cation of Safe-LUCB that empirically approaches the same regret without\na-priori knowledge of the value of \u2206. On the other hand, when \u2206 = 0, the regret of Safe-LUCB is\n\nthe learner, we show that the regret of Safe-LUCB is (cid:101)O(\n(cid:101)O(T 2/3). Technical proofs and some further discussions are deferred to the appendix provided in the\n\nsupplementary material.\nNotation. The Euclidean norm of a vector x is denoted by (cid:107)x(cid:107)2 and the spectral norm of a matrix\nM is denoted by (cid:107)M(cid:107). We denote the transpose of any column vector x by x\u2020. Let A be a positive\nde\ufb01nite d \u00d7 d matrix and v \u2208 Rd. The weighted 2-norm of v with respect to A is de\ufb01ned by\n(cid:107)v(cid:107)A=\nv\u2020Av. We denote the minimum and maximum eigenvalue of A by \u03bbmin(A) and \u03bbmax(A).\nThe maximum of two numbers \u03b1, \u03b2 is denoted \u03b1 \u2228 \u03b2. For a positive integer n, [n] denotes the set\n\n{1, 2, . . . , n}. Finally, we use standard (cid:101)O notation for big-O notation that ignores logarithmic factors.\n\n\u221a\n\n\u221a\n\n1.1 Safe linear stochastic bandit problem\nCost model. The learner is given a convex compact decision set D0 \u2282 Rd. At each round t, the\nlearner chooses an action xt \u2208 D0 which results in an observed loss (cid:96)t that is linear on the unknown\nparameter \u00b5 with additive random noise \u03b7t, i.e., (cid:96)t := ct(xt) := \u00b5\u2020xt + \u03b7t.\nSafety Constraint. The learning environment is subject to a side constraint that restricts the choice\nof actions by dividing D0 into a safe and an unsafe set. The learner is restricted to actions xt from\nthe safe set Ds\n0(\u00b5). As notation suggests, the safe set depends on the unknown parameter. Since \u00b5 is\nunknown, the learner is unable to identify the safe set and must act conservatively in ensuring that\nactions xt are feasible for all t. In this paper, we assume that Ds\n0(\u00b5) is de\ufb01ned via a linear constraint\n(1)\n\n\u00b5\u2020Bxt \u2264 c,\n\nwhich needs to be satis\ufb01ed by xt at all rounds t with high probability. Thus, Ds\n\n0(\u00b5) is de\ufb01ned as,\n\n(2)\nThe matrix B \u2208 Rd\u00d7d and the positive constant c > 0 are known to the learner. However, after\nplaying any action xt, the value \u00b5\u2020Bxt is not observed by the learner. When clear from context, we\ndrop the argument \u00b5 in the de\ufb01nition of the safe set and simply refer to it as Ds\n0.\n\nDs\n0(\u00b5) := {x \u2208 D0 : \u00b5\u2020Bx \u2264 c}.\n\n2\n\n\fxt is de\ufb01ned by RT =(cid:80)T\n\nRegret. Let T be the total number of rounds. If xt, t \u2208 [T ] are the actions chosen, then the\ncumulative pseudo-regret (Audibert et al. (2009)) of the learner\u2019s algorithm for choosing the actions\nt=1 \u00b5\u2020xt \u2212 \u00b5\u2020x\u2217, where x\u2217 is the optimal safe action that minimizes the\nloss (cid:96)t in expectation, i.e., x\u2217 \u2208 arg minx\u2208Ds\nGoal. The goal of the learner is to keep RT as small as possible. At the bare minimum, we require\nthat the algorithm leads to RT /T \u2192 0 (as T grows large). In contrast to existing linear stochastic\nbandit formulations, we require that the chosen actions xt, t \u2208 [T ] are safe (i.e., belong in Ds\n0 (2))\nwith high probability. For the rest of this paper, we simply use regret to refer to the pseudo-regret RT .\nIn Section 2.1 we place some further technical assumptions on D0 (bounded), on Ds\n0 (non-empty), on\n\u00b5 (bounded) and on the distribution of \u03b7t (subgaussian).\n\n0(\u00b5) \u00b5\u2020x.\n\n1.2 Related Works\n\n\u221a\n\nOur algorithm relies on a modi\ufb01ed version of the famous UCB algorithm known as UCB1, which\nwas \ufb01rst developed by Auer et al. (2002). For linear stochastic bandits, the regret of the LUCB\nalgorithm was analyzed by, e.g., Dani et al. (2008); Abbasi-Yadkori et al. (2011); Rusmevichientong\nand Tsitsiklis (2010); Russo and Van Roy (2014); Chu et al. (2011) and it was shown that the regret\ngrows at the rate of\nT log(T ). Extensions to generalized linear bandit models have also been\nconsidered by, e.g., Filippi et al. (2010); Li et al. (2017). There are two different contexts where\nconstraints have been applied to the stochastic MAB problem. The \ufb01rst line of work considers the\nMAB problem with global budget (a.k.a. knapsack) constraints where each arm is associated with a\nrandom resource consumption and the objective is to maximize the total reward before the learner\nruns out of resources, see, e.g., Badanidiyuru et al. (2013); Agrawal and Devanur (2016); Wu et al.\n(2015); Badanidiyuru et al. (2014). The second line of work considers stage-wise safety for bandit\nproblems in the context of ensuring that the algorithm\u2019s regret performance stays above a \ufb01xed\npercentage of the performance of a baseline strategy at every round during its run Kazerouni et al.\n(2017); Wu et al. (2016). In Kazerouni et al. (2017), which is most closely related to our setting, the\nauthors study a variant of LUCB in which the chosen actions are constrained such that the cumulative\nreward remains strictly greater than (1 \u2212 \u03b1) times a given baseline reward for all t. In both of the\nabove mentioned lines of work, the constraint applies to the cumulative resource consumption (or\nreward) across the entire run of the algorithm. As such, the set of permitted actions at each round vary\ndepending on the round and on the history of the algorithm. This is unlike our constraint, which is\napplied at each individual round, is deterministic, and does not depend on the history of past actions.\nIn a more general context, the concept of safe learning has received signi\ufb01cant attention in recent\nyears from different communities. Most existing work that consider mechanisms for safe exploration\nin unknown and stochastic environments are in reinforcement learning or control. However, the\nnotion of safety has many diverse de\ufb01nitions in this literature. For example, Moldovan and Abbeel\n(2012) proposes an algorithm that allows safe exploration in Markov Decision Processes (MDP)\nin order to avoid fatal absorbing states that must never be visited during the exploration process.\nBy considering constrained MDPs that are augmented with a set of auxiliary cost functions and\nreplacing them with surrogates that are easy to estimate, Achiam et al. (2017) purposes a policy search\nalgorithm for constrained reinforcement learning with guarantees for near constraint satisfaction\nat each iteration. In the framework of global optimization or active data selection, Schreiter et al.\n(2015); Berkenkamp et al. (2016) assume that the underlying system is safety-critical and present\nactive learning frameworks that use Gaussian Processes (GP) as non-parametric models to learn the\nsafe decision set. More closely related to our setting, Sui et al. (2015, 2018) extend the application\nof UCB to nonlinear bandits with nonlinear constraints modeled through Gaussian processes (GPs).\nThe algorithms in Sui et al. (2015, 2018) come with convergence guarantees, but no regret bounds\nas provided in our paper. Regret guarantees imply convergence guarantees from an optimization\nperspective (see Srinivas et al. (2010)), but not the other way around. Such approaches for safety-\nconstrained optimization using GPs have shown great promise in robotics applications with safety\nconstraints Ostafew et al. (2016); Akametalu et al. (2014).With a control theoretic point of view,\nGillulay and Tomlin (2011) combines reachability analysis and machine learning for autonomously\nlearning the dynamics of a target vehicle and Aswani et al. (2013) designs a learning-based MPC\nscheme that provides deterministic guarantees on robustness when the underlying system model is\nlinear and has a known level of uncertainty. In a very recent related work Usmanova et al. (2019),\nthe authors propose and analyze a (safe) variant of the Frank-Wolfe algorithm to solve a smooth\n\n3\n\n\foptimization problem with unknown linear constraints that are accessed by the learner via stochastic\nzeroth-order feedback. The main goal in Usmanova et al. (2019) is to provide a convergence rate for\nmore general convex objective, whereas we aim to provide regret bounds for a linear but otherwise\nunknown objective.\n\n2 A Safe-LUCB Algorithm\nOur proposed algorithm is a safe version of LUCB. As such, it relies on the well-known heuristic\nprinciple of optimism in the face of uncertainty (OFU). The algorithm constructs a con\ufb01dence set Ct\nat each round t, within which the unknown parameter \u00b5 lies with high probability. In the absence of\nany constraints, the learner chooses the most \u201cfavorable\u201d environment \u00b5 from the set Ct and plays\nthe action xt that minimizes the expected loss in that environment. However, the presence of the\nconstraint (1) complicates the choice of the learner. To address this, we propose an algorithm called\nsafe linear upper con\ufb01dence bound (Safe-LUCB), which attempts to minimize regret while making\nsure that the safety constraints (1) are satis\ufb01ed. Safe-LUCB is summarized in Algorithm 1 and a\ndetailed presentation follows in Sections 2.2 and 2.3, where we discuss the pure-exploration and\nsafe exploration-exploitation phases of the algorithm, respectively. Before these, in Section 2.1 we\nintroduce the necessary conditions under which our proposed algorithm operates and achieves good\nregret bounds as will be shown in Section 3.\n\n2.1 Model Assumptions\nLet Ft = \u03c3(x1, x2, . . . , xt+1, \u03b71, \u03b72, . . . , \u03b7t) be the \u03c3-algebra (or, history) at round t. We make the\nfollowing standard assumptions on the noise distribution, on the parameter \u00b5 and on the actions.\nAssumption 1 (Subgaussian Noise). For all t, \u03b7t is conditionally zero-mean R-sub-Gaussian for\n\ufb01xed constant R \u2265 0, i.e., E[\u03b7t | x1:t, \u03b71:t\u22121] = 0 and E[e\u03bb\u03b7t |Ft\u22121] \u2264 exp(\u03bb2R2/2),\n\u2200\u03bb \u2208 R.\nAssumption 2 (Boundedness). There exist positive constants S, L such that (cid:107)\u00b5(cid:107)2\u2264 S and (cid:107)x(cid:107)2\u2264\nL,\u2200x \u2208 D0. Also, \u00b5\u2020x \u2208 [\u22121, 1],\u2200x \u2208 D0 .\nIn order to avoid trivialities, we also make the following assumption. This, together with the\nassumption that C > 0 in (1), guarantee that the safe set Ds\nAssumption 3 (Non-empty safe set). The decision set D0 is a convex body in Rd that contains the\norigin in its interior.\n\n0(\u00b5) is non-empty (for every \u00b5).\n\nRandomly choose xt \u2208 Dw (de\ufb01ned in (3)) and observe loss (cid:96)t = ct(xt).\n\n\u03c4 =1 x\u03c4 x\u2020\n\n\u03c4 and compute \u02c6\u00b5t = A\u22121\n\nt\n\n\u03c4 =1 (cid:96)\u03c4 x\u03c4\n\nAlgorithm 1 Safe-LUCB\n1: Pure exploration phase:\n2: for t = 1, 2, . . . , T (cid:48)\n3:\n4: end for\n5: Safe exploration-exploitation phase:\n6: for t = T (cid:48) + 1, 2, . . . , T\n7:\n8:\n9: Ds\n10:\n11:\n12: end for\n\n(cid:80)t\u22121\nCt = {v \u2208 Rd : (cid:107)v \u2212 \u02c6\u00b5t(cid:107)At\u2264 \u03b2t} and \u03b2t chosen as in (7)\nt = {x \u2208 D0 : v\u2020Bx \u2264 c,\u2200v \u2208 Ct}\nxt = arg minx\u2208Ds\nChoose xt and observe loss (cid:96)t = ct(xt).\n\nSet At = \u03bbI +(cid:80)t\u22121\n\nminv\u2208Ct v\u2020x\n\nt\n\n2.2 Pure exploration phase\nThe pure exploration phase of the algorithm runs for rounds t \u2208 [T (cid:48)], where T (cid:48) is passed as input to\nthe algorithm. In Section 3, we will show how to appropriately choose its value to guarantee that the\ncumulative regret is controlled. During this phase, the algorithm selects random actions from a safe\nsubset Dw \u2282 D0 that we de\ufb01ne next. For every chosen action xt, we observe a loss (cid:96)t. The collected\naction-loss pairs (xt, (cid:96)t) over the T (cid:48) rounds are used in the second phase to obtain a good estimate of\n\u00b5. We will see in Section 2.3 that this is important since the quality of the estimate of \u00b5 determines\nour belief of which actions are safe. Now, let us de\ufb01ne the safe subset Dw.\n\n4\n\n\fThe safe set Ds\n0 is unknown to the learner (since \u00b5 is unknown). However, it can be deduced from the\nconstraint (1) and the boundedness Assumption 2 on \u00b5, that the following subset Dw \u2282 D0 is safe:\n(3)\n\nv\u2020Bx \u2264 c} = {x \u2208 D0 : (cid:107)Bx(cid:107)2\u2264 c/S}.\n\nDw := {x \u2208 D0 : max\n(cid:107)v(cid:107)2\u2264S\n\nNote that the set Dw is only a conservative (inner) approximation of Ds\n0, but this is inevitable, since\nthe learner has not yet collected enough information on the unknown parameter \u00b5.\nIn order to make the choice of random actions xt, t \u2208 [T (cid:48)] concrete, let X \u223c Unif(Dw) be a\nd-dimensional random vector uniformly distributed in Dw according to the probability measure\ngiven by the normalized volume in Dw (recall that Dw is a convex body by Assumption 3). During\nrounds t \u2208 [T (cid:48)], Safe-LUCB chooses safe IID actions xt\niid\u223c X. For future reference, we denote the\ncovariance matrix of X by \u03a3 = E[XX\u2020] and its minimum eigenvalue by\n\n\u03bb\u2212 := \u03bbmin(\u03a3) > 0.\n\n(4)\nRemark 1. Since D0 is compact with zero in its interior, we can always \ufb01nd 0 < \u0001 \u2264 C/S such that\n(5)\nThus, an effective way to choose (random) actions xt during the safe-exploration phase for which\nan explicit expression for \u03bb\u2212 is easily derived, is as follows. For simplicity, we assume B is\ninvertible. Let \u0001 be the largest value 0 < \u0001 \u2264 c/S such that (5) holds. Then, generate samples\n\nxt \u223c Unif((cid:102)Dw), t = 1, . . . , T (cid:48), by choosing xt = \u0001B\u22121zt, where zt are iid samples on the unit\n\n(cid:103)Dw := {x \u2208 Rd |(cid:107)Bx(cid:107)2= \u0001} \u2282 Dw.\n\nsphere S d\u22121. Clearly, E[ztz\nthat \u03bb\u2212 := \u03bbmin(\u03a3) =\n\n\u2020\nt ] = 1\n\u0001\nd \u03bbmax(B\u2020B)\n\nd I. Thus, \u03a3 := E[xtx\n= \u00012\n\nd(cid:107)B(cid:107)2 .\n\n\u2020\nt ] = \u00012\n\nd\n\n(cid:0)B\u2020B(cid:1)\u22121\n\n, from which it follows\n\n2.3 Safe exploration-exploitation phase\nWe implement the OFU principle while respecting the safety constraints. First, at each t = T (cid:48) +\n1, T (cid:48) + 2 . . . , T , the algorithm uses the previous action-observation pairs and obtains a \u03bb-regularized\nleast-squares estimate \u02c6\u00b5t of \u00b5 with regularization parameter \u03bb > 0 as follows:\n\nt\u22121(cid:88)\n\n\u02c6\u00b5t = A\u22121\n\nt\n\n(cid:96)\u03c4 x\u03c4 , where At = \u03bbI +\n\n\u03c4 =1\n\n\u03c4 =1\n\nt\u22121(cid:88)\n\nx\u03c4 x\u2020\n\u03c4 .\n\nThen, based on \u02c6\u00b5t the algorithm builds a con\ufb01dence set\n\nCt := {v \u2208 Rd : (cid:107)v \u2212 \u02c6\u00b5t(cid:107)At\u2264 \u03b2t},\n\n(6)\nwhere, \u03b2t is chosen according to Theorem 1 below (Abbasi-Yadkori et al. (2011)) to guarantee that\n\u00b5 \u2208 Ct with high probability.\nTheorem 1 (Con\ufb01dence Region, Abbasi-Yadkori et al. (2011)). Let Assumptions 1 and 2 hold. Fix\nany \u03b4 \u2208 (0, 1) and let \u03b2t in (6) be chosen as follows,\n\n(cid:115)\n\n(cid:18) 1 + (t \u2212 1)L2/\u03bb\n\n(cid:19)\n\n\u03b2t = R\n\nd log\n\n\u03b4\n\n+ \u03bb1/2S,\n\nfor all\n\nt > 0.\n\n(7)\n\nThen, with probability at least 1 \u2212 \u03b4, for all t > 0, it holds that \u00b5 \u2208 Ct.\nThe remaining steps of the algorithm also build on existing principles of UCB algorithms. However,\nhere we introduce necessary modi\ufb01cations to account for the safety constraint (1). Speci\ufb01cally, we\nchoose the actions with the following two principles.\nCaution in the face of constraint violation. At each round t, the algorithm performs conservatively,\nto ensure that the constraint (1) is satis\ufb01ed for the chosen action xt. As such, at the beginning of each\nround t = T (cid:48) + 1, . . . , T , Safe-LUCB forms the so-called safe decision set denoted as Ds\nt :\n\nDs\nt = {x \u2208 D0 : v\u2020Bx \u2264 c,\u2200v \u2208 Ct}.\nRecall from Theorem 1 that \u00b5 \u2208 Ct with high probability. Thus, Ds\nsafe actions that satisfy (1) with the same probability. On the other hand, note that Ds\n\nt is guaranteed to be a set of\nt is still a\n\n(8)\n\n5\n\n\fconservative inner approximation of Ds\n0(\u00b5) (actions in it are safe for all parameter vectors in Ct, not\nonly for the true \u00b5). This (unavoidable) conservative de\ufb01nition of safe decision sets could contribute\nto the growth of the regret. This is further studied in Section 3.\nOptimism in the face of uncertainty in cost. After choosing safe actions randomly at rounds\n1, . . . , T (cid:48), the algorithm creates the safe decision set Ds\nt at all rounds t \u2265 T (cid:48) + 1, and chooses an\naction xt based on the OFU principle. Speci\ufb01cally, a pair (xt, \u02dc\u00b5t) is chosen such that\n\n\u2020\n\u02dc\u00b5\nt xt = min\n\nx\u2208Ds\n\nt,v\u2208Ct\n\nv\u2020x.\n\n(9)\n\n3 Regret Analysis of Safe-LUCB\n\n3.1 The regret of safety\nIn the safe linear bandit problem, the safe set Ds\n0 is not known, since \u00b5 is unknown. Therefore, at\neach round, the learner chooses actions from a conservative inner approximation of Ds\n0. Intuitively,\nthe better this approximation, the more likely that the optimistic actions of Safe-LUCB lead to good\ncumulant regret, ideally of the same order as that of LUCB in the original linear bandit setting.\nA key difference in the analysis of Safe-LUCB compared to the classical LUCB is that x\u2217 may\nnot lie within the estimated safe set Ds\nt at each round. To see what changes, consider the standard\ndecomposition of the instantaneous regret rt, t = T (cid:48) + 1, . . . , T in two terms as follows (e.g., Dani\net al. (2008); Abbasi-Yadkori et al. (2011)):\n\n\u2020\nrt := \u00b5\u2020xt \u2212 \u00b5\u2020x\u2217 = \u00b5\u2020xt \u2212 \u02dc\u00b5\nt xt\n\n(cid:123)(cid:122)\n\nTerm I\n\n(cid:125)\n\n(cid:124)\n\n(cid:123)(cid:122)\n(cid:125)\n\u2020\nt xt \u2212 \u00b5\u2020x\u2217\n+ \u02dc\u00b5\n\n(cid:124)\n\nTerm II\n\n,\n\n(10)\n\nwhere, (\u02dc\u00b5t, xt) is the optimistic pair, i.e. the solution to the minimization in Step 10 of Algorithm\n1. On the one hand, controlling Term I, is more or less standard and closely follows previous such\nbounds on UCB-type algorithms (e.g., Abbasi-Yadkori et al. (2011)); see Appendix B.2 for details.\nOn the other hand, controlling Term II, which we call the regret of safety is more delicate. This\ncomplication lies at the heart of the new formulation with additional safety constraints. When safety\nconstraints are absent, classical LUCB guarantees that Term II is non-positive. Unfortunately, this is\nnot the case here: x\u2217 does not necessarily belong to Ds\nt in (8), thus Term II can be positive. This\nextra regret of safety is the price paid by Safe-LUCB for choosing safe actions at each round. Our\nmain contribution towards establishing regret guarantees is upper bounding Term II. We show in\nSection 3.2 that the pure-exploration phase is critical in this direction.\n\n3.2 Learning the safe set\nThe challenge in controlling the regret of safety is that, in general, Ds\n0. At a high level, we\nproceed as follows (see Appendix B.3 for details). First, we relate Term II with a certain notion of\n\u201cdistance\u201d in the direction of x\u2217 between the estimated set Ds\nt at rounds t = T (cid:48) + 1, . . . , T and the\ntrue safe set Ds\n0 . Next, we show that this \"distance\" term can be controlled by appropriately lower\nbounding the minimum eigenvalue \u03bbmin(At) of the Gram matrix At. Due to the interdependency of\nthe actions xt, it is dif\ufb01cult to directly establish such a lower bound for each round t. Instead, we use\nthat \u03bbmin(At) \u2265 \u03bbmin(AT (cid:48)+1), t \u2265 T (cid:48) + 1 and we are able to bound \u03bbmin(AT (cid:48)+1) thanks to the\npure exploration phase of Safe-LUCB . Hence, the pure exploration phase guarantees that Ds\nt is a\nsuf\ufb01ciently good approximation to the true Ds\n0 once the exploration-exploitation phase begins.\n\nt (cid:54)= Ds\n\n\u2020\nt be the Gram matrix corresponding to the \ufb01rst T (cid:48) actions\nof Safe-LUCB (pure-exploration phase). Recall the de\ufb01nition of \u03bb\u2212 in (4). Then, for any \u03b4 \u2208 (0, 1),\nit holds with probability at least 1 \u2212 \u03b4,\n\nt=1 xtx\n\nLemma 1. Let AT (cid:48)+1 = \u03bbI +(cid:80)T (cid:48)\n\n\u03bbmin(AT (cid:48)+1) \u2265 \u03bb +\n\n\u03bb\u2212T (cid:48)\n2\n\n,\n\n(11)\n\nprovided that T (cid:48) \u2265 t\u03b4 := 8L2\n\n\u03bb\u2212 log( d\n\n\u03b4 ).\n\nThe proof of the lemma and technical details relating the result to a desired bound on Term II are\ndeferred to Appendixes A and B.3, respectively.\n\n6\n\n\f3.3 Problem dependent upper bound\n\nIn this section, we present a problem-dependent upper bound on the regret of Safe-LUCB in terms of\nthe following critical parameter, which we call the safety gap:\n\n\u2206 := c \u2212 \u00b5\u2020Bx\u2217.\n\n(12)\nNote that \u2206 \u2265 0. In this section, we assume that \u2206 is known to the learner. The next lemma shows\nthat if \u2206 > 0 1 , then choosing T (cid:48) = O(log T ) guarantees that x\u2217 \u2208 Ds\nt for all t = T (cid:48) + 1, . . . , T .\nLemma 2 (x\u2217 \u2208 Ds\nt ). Let Assumptions 1, 2 and 3 hold. Fix any \u03b4 \u2208 (0, 1) and assume a positive\nsafety gap \u2206 > 0. Initialize Safe-LUCB with (recall the de\ufb01nition of t\u03b4 in Lemma 1)\n\n(cid:16) 8L2(cid:107)B(cid:107)2\u03b22\n\nT\n\n\u03bb\u2212 \u22062\n\n\u2212 2\u03bb\n\u03bb\u2212\n\n(cid:17) \u2228 t\u03b4.\n\nT (cid:48) \u2265 T\u2206 :=\n\n(13)\n\n(cid:19)\n\n.\n\nThen, with probability at least 1 \u2212 \u03b4, for all t = T (cid:48) + 1, . . . , T it holds that x\u2217 \u2208 Ds\nt .\nIn light of our discussion in Sections 3.1 and 3.2, once we have established that x\u2217 \u2208 Ds\nt for\nt = T (cid:48) + 1, . . . , T , the regret of safety becomes nonpositive and we can show that the algorithm\nperforms just like classical LUCB during the exploration-exploitation phase 2. This is formalized in\n\nTheorem 2 showing that when \u2206 > 0 (and is known), then the regret of Safe-LUCB is (cid:101)O(\n\nT ).\n\n\u221a\n\nTheorem 2 (Problem-dependent bound; \u2206 > 0). Let the same assumptions as in Lemma 2 hold.\nInitialize Safe-LUCB with T (cid:48) \u2265 T\u2206 speci\ufb01ed in (13). Then, for T \u2265 T (cid:48), with probability at least\n1 \u2212 2\u03b4, the cumulative regret of Safe-LUCB satis\ufb01es\n\n(cid:115)\n\n(cid:18)\n\n(14)\n\nRT \u2264 2T (cid:48) + 2\u03b2T\n\n2d (T \u2212 T (cid:48)) log\n\n2T L2\n\nd(\u03bb\u2212T (cid:48) + 2\u03bb)\nSpeci\ufb01cally, choosing T (cid:48) = T\u2206 guarantees cumulant regret O(T 1/2 log T ).\nThe bound in (14) is a contribution of two terms. The \ufb01rst one is a trivial bound on the regret of\nthe exploration-only phase of Safe-LUCB and is proportional to its duration T (cid:48). Thanks to Lemma\n2 the duration of the exploration phase is limited to T\u2206 rounds and T\u2206 is (at most) logarithmic in\nthe total number of rounds T . Thus, the \ufb01rst summand in (14) contributes only O(log T ) in the\ntotal regret. Note, however, that T\u2206 grows larger as the normalized safety gap \u2206/(cid:107)B(cid:107) becomes\nsmaller. The second summand in (14) contributes O(T 1/2 log T ) and bounds the cumulant regret\nof the exploration-exploitation phase, which takes the bulk of the algorithm. More speci\ufb01cally, it\nbounds the contribution of Term I in (10) since the Term II is zeroed out once x\u2217 \u2208 Ds\nt thanks to\nLemma 2. Finally, note that Theorem 2 requires the total number of rounds T to be large enough for\nthe desired regret performance. This is the price paid for the extra safety constraints compared to the\nperformance of the classical LUCB in the original linear bandit setting. We remark that existing lower\nbounds for the simpler problem without safety constraints (e.g. Rusmevichientong and Tsitsiklis\nT d) of Theorem 2 cannot be improved modulo\n\n(2010); Dani et al. (2008)), show that the regret (cid:101)O(\n\n\u221a\n\nlogarithmic factors. The proofs of Lemma 2 and Theorem 2 are in Appendix B.\n\n3.4 General upper bound\nWe now extend the results of Section 3.3 to instances where the safety gap is zero, i.e. \u2206 = 0. In\nthis case, we cannot guarantee an exploration phase that results in x\u2217 \u2208 Ds\nt , t > T (cid:48) in a reasonable\ntime length T (cid:48). Thus, the regret of safety is not necessarily non-positive and it is unclear whether a\nsub-linear cumulant regret is possible.\n\nTheorem 3 shows that Safe-LUCB achieves regret (cid:101)O(T 2/3) when \u2206 = 0. Note that this (worst-case)\n\nbound is also applicable when the safety gap is unknown to the learner. While it is signi\ufb01cantly\nworse than the performance guaranteed by Theorem 2, it proves that Safe-LUCB always leads to\nRT /T \u2192 0 as T grows large. The proof is deferred to Appendix B.\n\n1We remark that the case \u2206 > 0 studied here is somewhat reminiscent of the assumption \u03b1r(cid:96) > 0 in\n\nKazerouni et al. (2017).\n\n2Our simulation results in Appendix F emphasize the critical role of a suf\ufb01ciently long pure exploration\nphase by Safe-LUCB as suggested by Lemma 2. Speci\ufb01cally, Figure 1b depicts an instance where no exploration\nleads to signi\ufb01cantly worse order of regret.\n\n7\n\n\fTheorem 3 (General bound: worst-case). Suppose Assumptions 1, 2 and 3 hold. Fix any \u03b4 \u2208 (0, 0.5).\nInitialize Safe-LUCB with T (cid:48) \u2265 t\u03b4 speci\ufb01ed in Lemma 1. Then, with probability at least 1 \u2212 2\u03b4 the\ncumulative regret RT of Safe-LUCB for T \u2265 T (cid:48) satis\ufb01es\n\n(cid:115)\n\n(cid:18)\n(cid:1) 2\nSpeci\ufb01cally, choosing T (cid:48) = T0 :=(cid:0)(cid:107)B(cid:107)L\u03b2T T\n\nRT \u2264 2T (cid:48) + 2\u03b2T\n\n2d(T \u2212 T (cid:48)) log\n\n\u221a\n\nc\n\n2\u03bb\u2212\n\n(cid:19)\n\n2T L2\n\n\u221a\n2\n\n2(cid:107)B(cid:107)L\u03b2T (T \u2212 T (cid:48))\n\nc(cid:112)\u03bb\u2212T (cid:48) + 2\u03bb\n\n+\n\nd(\u03bb\u2212T (cid:48) + 2\u03bb)\n3 \u2228 t\u03b4 , guarantees regret O(T 2/3 log T ).\n\n.\n\n(15)\n\n\u221a\n\nCompared to Theorem 2, the bound in (15) is now comprised of three terms. The \ufb01rst one cap-\ntures again the exploration-only phase and is linear in its duration T (cid:48). However, note that T (cid:48) is\nnow O(T 2/3 log T ), i.e., of the same order as the total bound. The second term bounds the total\n\ncontribution of Term I of the exploration-exploitation phase. As usual, its order is (cid:101)O(T 1/2). Finally,\n\nperformance (cid:101)O(\n\nthe additional third term bounds the regret of safety and is of the same order as that of the \ufb01rst term.\n4 Unknown Safety Gap\nIn Section 3.3 we showed that when the safety gap \u2206 > 0, then Safe-LUCB achieves good regret\nT ). However, this requires that the value of \u2206, or at least a (non-trivial) lower\nbound on it, be known to the learner so that T (cid:48) is initialized appropriately according to Lemma\n2. This requirement might be restrictive in certain applications. When that is the case, one option\nis to run Safe-LUCB with a choice of T (cid:48) as suggested by Theorem 3, but this could result in an\nunnecessarily long pure exploration period (during which regret grows linearly). Here, we present an\nalternative. Speci\ufb01cally, we propose a variation of Safe-LUCB refered to as generalized safe linear\nupper con\ufb01dence bound (GSLUCB). The key idea behind GSLUCB is to build a lower con\ufb01dence\nbound \u2206t for the safety gap \u2206 and calculate the length of the pure exploration phase associated with\n\u2206t, denoted as T (cid:48)\nt. This allows the learner to stop the pure exploration phase at round t such that\ncondition t \u2264 T (cid:48)\nt\u22121 has been met. While we do not provide a separate regret analysis for GSLUCB, it\nis clear that the worst case regret performance would match that of Safe-LUCB with \u2206 = 0. However,\nour numerical experiment highlights the improvements that GSLUCB can provide for the cases where\n\u2206 (cid:54)= 0. We give a full explanation of GSLUCB, including how we calculate the lower con\ufb01dence\nbound \u2206t, in Appendix E.\nFigure 1a compares the average per-step regret of 1) Safe-LUCB with knowledge of \u2206; 2) Safe-\nLUCB without knowledge of \u2206 (hence, assuming \u2206 = 0); 3) GSLUCB without knowledge of \u2206, in\na simpli\ufb01ed setting of K-armed linear bandits with strictly positive safety gap (see Appendix C). The\ndetails on the parameters of the simulations are deferred to Appendix F.\n\n(cid:16)\n\n0 = T0\n\nt\u22121, T0\n\n(cid:1)(cid:17)\n\nt \u2264 min(cid:0)T (cid:48)\n\nAlgorithm 2 GSLUCB\n1: Pure exploration phase:\n2: t \u2190 1 , T (cid:48)\n3: while\n4:\n5: \u2206t = Lower con\ufb01dence bound on \u2206 at round t\n6:\n7:\n8:\n9:\n10: end while\n11: Safe exploration exploitation phase: Lines 6 - 12 of Safe-LUCB for all remaining rounds.\n\nRandomly choose xt \u2208 Dw and observe loss (cid:96)t = ct(xt).\nif \u2206t > 0 then T (cid:48)\nelse T (cid:48)\nend if\nt \u2190 t + 1\n\nt = T\u2206t\n\nt = T0\n\n5 Conclusions\nWe have formulated a linear stochastic bandit problem with safety constraints that depend linearly on\nthe unknown problem parameter \u00b5. While simpli\ufb01ed, the model captures the additional complexity\nintroduced in the problem by the requirement that chosen actions belong to an unknown safe set.\nAs such, it allows us to quantify tradeoffs between learning the safe set and minimizing the regret.\nSpeci\ufb01cally, we propose Safe-LUCB which is comprised of two phases: (i) a pure-exploration\nphase that speeds up learning the safe set; (ii) a safe exploration-exploitation phase that optimizes\n\n8\n\n\f(a) Average per-step regret of Safe-LUCB and\nGSLUCB with a decision set of K arms.\n\n(b) Per-step regret of Safe-LUCB with and without\npure exploration phase.\n\nFigure 1: Simulation of per-step regret.\n\nminimizing the regret. Our analysis suggests that the safety gap \u2206 plays a critical role. When \u2206 > 0\nT ) as in the classical linear bandit setting. However, when \u2206 = 0,\n\nwe show how to achieve regret (cid:101)O(\nthe regret of Safe-LUCB is (cid:101)O(T 2/3). It is an interesting open problem to establish lower bounds\nthat \u2206 = 0 is a worst-case scenario, but it remains open whether the (cid:101)O(T 2/3) regret bound can\n\n\u221a\n\nfor an arbitrary policy that accounts for the safety constraints. Our analysis of Safe-LUCB suggests\n\nbe improved in that case. Natural extensions of the problem setting to multiple constraints and\ngeneralized linear bandits (possibly with generalized linear constraints) might also be of interest.\n\n6 Acknowledgement\n\nThis research is supported by UCOP grant LFR-18-548175 and NSF grant 1847096.\n\nReferences\nAbbasi-Yadkori, Y., P\u00e1l, D., and Szepesv\u00e1ri, C. (2011). Improved algorithms for linear stochastic\n\nbandits. In Advances in Neural Information Processing Systems, pages 2312\u20132320.\n\nAchiam, J., Held, D., Tamar, A., and Abbeel, P. (2017). Constrained policy optimization.\n\nIn\nProceedings of the 34th International Conference on Machine Learning-Volume 70, pages 22\u201331.\nJMLR. org.\n\nAgrawal, S. and Devanur, N. (2016). Linear contextual bandits with knapsacks. In Lee, D. D.,\nSugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R., editors, Advances in Neural Information\nProcessing Systems 29, pages 3450\u20133458. Curran Associates, Inc.\n\nAkametalu, A. K., Fisac, J. F., Gillula, J. H., Kaynama, S., Zeilinger, M. N., and Tomlin, C. J. (2014).\nReachability-based safe learning with gaussian processes. In 53rd IEEE Conference on Decision\nand Control, pages 1424\u20131431.\n\nAswani, A., Gonzalez, H., Sastry, S. S., and Tomlin, C. (2013). Provably safe and robust learning-\n\nbased model predictive control. Automatica, 49(5):1216\u20131226.\n\nAudibert, J.-Y., Munos, R., and Szepesv\u00e1ri, C. (2009). Exploration\u2013exploitation tradeoff using\n\nvariance estimates in multi-armed bandits. Theoretical Computer Science, 410(19):1876\u20131902.\n\nAuer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit\n\nproblem. Mach. Learn., 47(2-3):235\u2013256.\n\nBadanidiyuru, A., Kleinberg, R., and Slivkins, A. (2013). Bandits with knapsacks. In 2013 IEEE\n\n54th Annual Symposium on Foundations of Computer Science, pages 207\u2013216.\n\n9\n\n024681010400.20.40.60.81024681010400.20.40.60.811.2\fBadanidiyuru, A., Langford, J., and Slivkins, A. (2014). Resourceful contextual bandits. In Balcan,\nM. F., Feldman, V., and Szepesv\u00e1ri, C., editors, Proceedings of The 27th Conference on Learning\nTheory, volume 35 of Proceedings of Machine Learning Research, pages 1109\u20131134, Barcelona,\nSpain. PMLR.\n\nBerkenkamp, F., Krause, A., and Schoellig, A. P. (2016). Bayesian optimization with safety con-\n\nstraints: safe and automatic parameter tuning in robotics. arXiv preprint arXiv:1602.04450.\n\nChu, W., Li, L., Reyzin, L., and Schapire, R. (2011). Contextual bandits with linear payoff functions.\nIn Gordon, G., Dunson, D., and Dud\u00edk, M., editors, Proceedings of the Fourteenth International\nConference on Arti\ufb01cial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning\nResearch, pages 208\u2013214, Fort Lauderdale, FL, USA. PMLR.\n\nDani, V., Hayes, T. P., and Kakade, S. M. (2008). Stochastic linear optimization under bandit\n\nfeedback.\n\nFilippi, S., Cappe, O., Garivier, A., and Szepesv\u00e1ri, C. (2010). Parametric bandits: The generalized\n\nlinear case. In Advances in Neural Information Processing Systems, pages 586\u2013594.\n\nGillulay, J. H. and Tomlin, C. J. (2011). Guaranteed safe online learning of a bounded system. In\n2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 2979\u20132984.\n\nKazerouni, A., Ghavamzadeh, M., Abbasi, Y., and Van Roy, B. (2017). Conservative contextual\nlinear bandits. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S.,\nand Garnett, R., editors, Advances in Neural Information Processing Systems 30, pages 3910\u20133919.\nCurran Associates, Inc.\n\nLi, L., Lu, Y., and Zhou, D. (2017). Provably optimal algorithms for generalized linear contextual\nbandits. In Proceedings of the 34th International Conference on Machine Learning-Volume 70,\npages 2071\u20132080. JMLR. org.\n\nMoldovan, T. M. and Abbeel, P. (2012). Safe exploration in markov decision processes. arXiv\n\npreprint arXiv:1205.4810.\n\nOstafew, C. J., Schoellig, A. P., and Barfoot, T. D. (2016). Robust constrained learning-based nmpc\nenabling reliable mobile robot path tracking. The International Journal of Robotics Research,\n35(13):1547\u20131563.\n\nRusmevichientong, P. and Tsitsiklis, J. N. (2010). Linearly parameterized bandits. Mathematics of\n\nOperations Research, 35(2):395\u2013411.\n\nRusso, D. and Van Roy, B. (2014). Learning to optimize via posterior sampling. Mathematics of\n\nOperations Research, 39(4):1221\u20131243.\n\nSchreiter, J., Nguyen-Tuong, D., Eberts, M., Bischoff, B., Markert, H., and Toussaint, M. (2015).\nSafe exploration for active learning with gaussian processes. In Bifet, A., May, M., Zadrozny,\nB., Gavalda, R., Pedreschi, D., Bonchi, F., Cardoso, J., and Spiliopoulou, M., editors, Machine\nLearning and Knowledge Discovery in Databases, pages 133\u2013149, Cham. Springer International\nPublishing.\n\nSrinivas, N., Krause, A., Kakade, S., and Seeger, M. (2010). Gaussian process optimization in\nthe bandit setting: no regret and experimental design. In Proceedings of the 27th International\nConference on International Conference on Machine Learning, pages 1015\u20131022. Omnipress.\n\nSui, Y., Burdick, J., Yue, Y., et al. (2018). Stagewise safe bayesian optimization with gaussian\n\nprocesses. In International Conference on Machine Learning, pages 4788\u20134796.\n\nSui, Y., Gotovos, A., Burdick, J. W., and Krause, A. (2015). Safe exploration for optimization\nwith gaussian processes. In Proceedings of the 32Nd International Conference on International\nConference on Machine Learning - Volume 37, ICML\u201915, pages 997\u20131005. JMLR.org.\n\nTropp, J. A. et al. (2015). An introduction to matrix concentration inequalities. Foundations and\n\nTrends R(cid:13) in Machine Learning, 8(1-2):1\u2013230.\n\n10\n\n\fUsmanova, I., Krause, A., and Kamgarpour, M. (2019). Safe convex learning under uncertain\nconstraints. In Chaudhuri, K. and Sugiyama, M., editors, Proceedings of Machine Learning\nResearch, volume 89 of Proceedings of Machine Learning Research, pages 2106\u20132114. PMLR.\n\nWu, H., Srikant, R., Liu, X., and Jiang, C. (2015). Algorithms with logarithmic or sublinear regret\nfor constrained contextual bandits. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M.,\nand Garnett, R., editors, Advances in Neural Information Processing Systems 28, pages 433\u2013441.\nCurran Associates, Inc.\n\nWu, Y., Shariff, R., Lattimore, T., and Szepesv\u00e1ri, C. (2016). Conservative bandits. In Proceedings of\nthe 33rd International Conference on International Conference on Machine Learning - Volume 48,\nICML\u201916, pages 1254\u20131262. JMLR.org.\n\n11\n\n\f", "award": [], "sourceid": 4958, "authors": [{"given_name": "Sanae", "family_name": "Amani", "institution": "University of California Santa Barbara"}, {"given_name": "Mahnoosh", "family_name": "Alizadeh", "institution": "University of California Santa Barbara"}, {"given_name": "Christos", "family_name": "Thrampoulidis", "institution": "UCSB"}]}