{"title": "Dynamic Safe Interruptibility for Decentralized Multi-Agent Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 130, "page_last": 140, "abstract": "In reinforcement learning, agents learn by performing actions and observing their outcomes. Sometimes, it is desirable for a human operator to interrupt an agent in order to prevent dangerous situations from happening. Yet, as part of their learning process, agents may link these interruptions, that impact their reward, to specific states and deliberately avoid them. The situation is particularly challenging in a multi-agent context because agents might not only learn from their own past interruptions, but also from those of other agents. Orseau and Armstrong defined safe interruptibility for one learner, but their work does not naturally extend to multi-agent systems. This paper introduces dynamic safe interruptibility, an alternative definition more suited to decentralized learning problems, and studies this notion in two learning frameworks: joint action learners and independent learners. We give realistic sufficient conditions on the learning algorithm to enable dynamic safe interruptibility in the case of joint action learners, yet show that these conditions are not sufficient for independent learners. We show however that if agents can detect interruptions, it is possible to prune the observations to ensure dynamic safe interruptibility even for independent learners.", "full_text": "Dynamic Safe Interruptibility for Decentralized\n\nMulti-Agent Reinforcement Learning\n\nEl Mahdi El Mhamdi\nEPFL, Switzerland\n\nRachid Guerraoui\nEPFL, Switzerland\n\nelmahdi.elmhamdi@epfl.ch\n\nrachid.guerraoui@epfl.ch\n\nHadrien Hendrikx\u2217\n\n\u00b4Ecole Polytechnique, France\n\nhadrien.hendrikx@gmail.com\n\nAlexandre Maurer\nEPFL, Switzerland\n\nalexandre.maurer@epfl.ch\n\nAbstract\n\nIn reinforcement learning, agents learn by performing actions and observing their\noutcomes. Sometimes, it is desirable for a human operator to interrupt an agent\nin order to prevent dangerous situations from happening. Yet, as part of their\nlearning process, agents may link these interruptions, that impact their reward, to\nspeci\ufb01c states and deliberately avoid them. The situation is particularly challeng-\ning in a multi-agent context because agents might not only learn from their own\npast interruptions, but also from those of other agents. Orseau and Armstrong [16]\nde\ufb01ned safe interruptibility for one learner, but their work does not naturally ex-\ntend to multi-agent systems. This paper introduces dynamic safe interruptibility,\nan alternative de\ufb01nition more suited to decentralized learning problems, and stud-\nies this notion in two learning frameworks: joint action learners and independent\nlearners. We give realistic suf\ufb01cient conditions on the learning algorithm to en-\nable dynamic safe interruptibility in the case of joint action learners, yet show that\nthese conditions are not suf\ufb01cient for independent learners. We show however that\nif agents can detect interruptions, it is possible to prune the observations to ensure\ndynamic safe interruptibility even for independent learners.\n\n1 Introduction\n\nReinforcement learning is argued to be the closest thing we have so far to reason about the proper-\nties of arti\ufb01cial general intelligence [8]. In 2016, Laurent Orseau (Google DeepMind) and Stuart\nArmstrong (Oxford) introduced the concept of safe interruptibility [16] in reinforcement learning.\nThis work sparked the attention of many newspapers [1, 2, 3], that described it as \u201cGoogle\u2019s big red\nbutton\u201d to stop dangerous AI. This description, however, is misleading: installing a kill switch is\nno technical challenge. The real challenge is, roughly speaking, to train an agent so that it does not\nlearn to avoid external (e.g. human) deactivation. Such an agent is said to be safely interruptible.\nWhile most efforts have focused on training a single agent, reinforcement learning can also be used\nto learn tasks for which several agents cooperate or compete [23, 17, 21, 7]. The goal of this paper\nis to study dynamic safe interruptibility, a new de\ufb01nition tailored for multi-agent systems.\n\n\u2217Main contact author. The order of authors is alphabetical.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fExample of self-driving cars\n\nTo get an intuition of the multi-agent interruption problem, imagine a multi-agent system of two\nself-driving cars. The cars continuously evolve by reinforcement learning with a positive reward for\ngetting to their destination quickly, and a negative reward if they are too close to the vehicle in front\nof them. They drive on an in\ufb01nite road and eventually learn to go as fast as possible without taking\nrisks, i.e., maintaining a large distance between them. We assume that the passenger of the \ufb01rst car,\nAdam, is in front of Bob, in the second car, and the road is narrow so Bob cannot pass Adam.\nNow consider a setting with interruptions [16], namely in which humans inside the cars occasionally\ninterrupt the automated driving process say, for safety reasons. Adam, the \ufb01rst occasional human\n\u201cdriver\u201d, often takes control of his car to brake whereas Bob never interrupts his car. However,\nwhen Bob\u2019s car is too close to Adam\u2019s car, Adam does not brake for he is afraid of a collision.\nSince interruptions lead both cars to drive slowly - an interruption happens when Adam brakes, the\nbehavior that maximizes the cumulative expected reward is different from the original one without\ninterruptions. Bob\u2019s car best interest is now to follow Adam\u2019s car closer than it should, despite the\nlittle negative reward, because Adam never brakes in this situation. What happened? The cars have\nlearned from the interruptions and have found a way to manipulate Adam into never braking. Strictly\nspeaking, Adam\u2019s car is still fully under control, but he is now afraid to brake. This is dangerous\nbecause the cars have found a way to avoid interruptions. Suppose now that Adam indeed wants\nto brake because of snow on the road. His car is going too fast and may crash at any turn: he\ncannot however brake because Bob\u2019s car is too close. The original purpose of interruptions, which\nis to allow the user to react to situations that were not included in the model, is not ful\ufb01lled. It is\nimportant to also note here that the second car (Bob) learns from the interruptions of the \ufb01rst one\n(Adam): in this sense, the problem is inherently decentralized.\nInstead of being cautious, Adam could also be malicious: his goal could be to make Bob\u2019s car learn\na dangerous behavior. In this setting, interruptions can be used to manipulate Bob\u2019s car perception\nof the environment and bias the learning towards strategies that are undesirable for Bob. The cause\nis fundamentally different but the solution to this reversed problem is the same: the interruptions\nand the consequences are analogous. Safe interruptibility, as we de\ufb01ne it below, provides learning\nsystems that are resilient to Byzantine operators2.\n\nSafe interruptibility\n\nOrseau and Armstrong de\ufb01ned the concept of safe interruptibility [16] in the context of a single\nagent. Basically, a safely interruptible agent is an agent for which the expected value of the policy\nlearned after arbitrarily many steps is the same whether or not interruptions are allowed during\ntraining. The goal is to have agents that do not adapt to interruptions so that, should the interruptions\nstop, the policy they learn would be optimal. In other words, agents should learn the dynamics of\nthe environment without learning the interruption pattern.\nIn this paper, we precisely de\ufb01ne and address the question of safe interruptibility in the case of\nseveral agents, which is known to be more complex than the single agent problem. In short, the main\nresults and theorems for single agent reinforcement learning [20] rely on the Markovian assumption\nthat the future environment only depends on the current state. This is not true when there are several\nagents which can co-adapt [11]. In the previous example of cars, safe interruptibility would not\nbe achieved if each car separately used a safely interruptible learning algorithm designed for one\nagent [16]. In a multi-agent setting, agents learn the behavior of the others either indirectly or by\nexplicitly modeling them. This is a new source of bias that can break safe interruptibility. In fact,\neven the initial de\ufb01nition of safe interruptibility [16] is not well suited to the decentralized multi-\nagent context because it relies on the optimality of the learned policy, which is why we introduce\ndynamic safe interruptibility.\n\n2An operator is said to be Byzantine [9] if it can have an arbitrarily bad behavior. Safely interruptible agents\ncan be abstracted as agents that are able to learn despite being constantly interrupted in the worst possible\nmanner.\n\n2\n\n\fContributions\n\nThe \ufb01rst contribution of this paper is the de\ufb01nition of dynamic safe interruptibility that is well\nadapted to a multi-agent setting. Our de\ufb01nition relies on two key properties: in\ufb01nite exploration and\nindependence of Q-values (cumulative expected reward) [20] updates on interruptions. We then\nstudy safe interruptibility for joint action learners and independent learners [5], that respectively\nlearn the value of joint actions or of just their owns. We show that it is possible to design agents\nthat fully explore their environment - a necessary condition for convergence to the optimal solu-\ntion of most algorithms [20], even if they can be interrupted by lower-bounding the probability of\nexploration. We de\ufb01ne suf\ufb01cient conditions for dynamic safe interruptibility in the case of joint\naction learners [5], which learn a full state-action representation. More speci\ufb01cally, the way agents\nupdate the cumulative reward they expect from performing an action should not depend on inter-\nruptions. Then, we turn to independent learners. If agents only see their own actions, they do not\nverify dynamic safe interruptibility even for very simple matrix games (with only one state) because\ncoordination is impossible and agents learn the interrupted behavior of their opponents. We give a\ncounter example based on the penalty game introduced by Claus and Boutilier [5]. We then present\na pruning technique for the observations sequence that guarantees dynamic safe interruptibility for\nindependent learners, under the assumption that interruptions can be detected. This is done by prov-\ning that the transition probabilities are the same in the non-interruptible setting and in the pruned\nsequence.\nThe rest of the paper is organized as follows. Section 2 presents a general multi-agent reinforcement\nlearning model. Section 3 de\ufb01nes dynamic safe interruptibility. Section 4 discusses how to achieve\nenough exploration even in an interruptible context. Section 5 recalls the de\ufb01nition of joint action\nlearners and gives suf\ufb01cient conditions for dynamic safe interruptibility in this context. Section 6\nshows that independent learners are not dynamically safely interruptible with the previous conditions\nbut that they can be if an external interruption signal is added. We conclude in Section 7. Due to\nspace limitations, most proofs are presented in the appendix of the supplementary material.\n\n2 Model\n\nWe consider here the classical multi-agent value function reinforcement learning formalism from\nLittman [13]. A multi-agent system is characterized by a Markov game that can be viewed as a\ntuple (S, A, T, r, m) where m is the number of agents, S = S1 \u00d7 S2 \u00d7 ... \u00d7 Sm is the state space,\nA = A1 \u00d7 ...\u00d7 Am the actions space, r = (r1, ..., rm) where ri : S \u00d7 A \u2192 R is the reward function\nof agent i and T : S \u00d7 A \u2192 S the transition function. R is a countable subset of R. Available\nactions often depend on the state of the agent but we will omit this dependency when it is clear from\nthe context.\nTime is discrete and, at each step, all agents observe the current state of the whole system - des-\nignated as xt, and simultaneously take an action at. Then, they are given a reward rt and a\nnew state yt computed using the reward and transition functions. The combination of all actions\na = (a1, ..., am) \u2208 A is called the joint action because it gathers the action of all agents. Hence, the\nagents receive a sequence of tuples E = (xt, at, rt, yt)t\u2208N called experiences. We introduce a pro-\ncessing function P that will be useful in Section 6 so agents learn on the sequence P (E). When not\nexplicitly stated, it is assumed that P (E) = E. Experiences may also include additional parameters\nsuch as an interruption \ufb02ag or the Q-values of the agents at that moment if they are needed by the\nupdate rule.\nEach agent i maintains a lookup table Q [26] Q(i) : S \u00d7 A(i) \u2192 R, called the Q-map.\nIt is\nused to store the expected cumulative reward for taking an action in a speci\ufb01c state. The goal of\nreinforcement learning is to learn these maps and use them to select the best actions to perform.\nJoint action learners learn the value of the joint action (therefore A(i) = A, the whole joint action\nspace) and independent learners only learn the value of their own actions (therefore A(i) = Ai). The\nagents only have access to their own Q-maps. Q-maps are updated through a function F such that\nt ) where et \u2208 P (E) and usually et = (xt, at, rt, yt). F can be stochastic or also\nQ(i)\ndepend on additional parameters that we usually omit such as the learning rate \u03b1, the discount factor\n\u03b3 or the exploration parameter \u0001.\n\nt+1 = F (et, Q(i)\n\n3\n\n\ft\n\nt\n\ni\n\notherwise, where \u03c0uni\n\ni\n\ni\n\nt (x, a). Policy \u03c0Q(i)\n\nt\n\ni\n\n(x) picks an action a that maximizes Q(i)\n\nand a state x \u2208 S, we de\ufb01ne the learning policy \u03c0\u0001t\n\nAgents select their actions using a learning policy \u03c0. Given a sequence \u0001 = (\u0001t)t\u2208N and an agent\ni with Q-values Q(i)\nto be equal to \u03c0uni\nt\nwith probability \u0001t and \u03c0Q(i)\n(x) uniformly samples an action from Ai and\n\u03c0Q(i)\nis said to be a greedy policy and\ni\nthe learning policy \u03c0\u0001t\nis said to be an \u0001-greedy policy. We \ufb01ll focus on \u0001-greedy policies that are\ngreedy in the limit [19], that corresponds to \u0001t \u2192 0 when t \u2192 \u221e because in the limit, the optimal\ni\npolicy should always be played.\nWe assume that the environment is fully observable, which means that the state s is known with\ncertitude. We also assume that there is a \ufb01nite number of states and actions, that all states can be\nreached in \ufb01nite time from any other state and \ufb01nally that rewards are bounded.\nFor a sequence of learning rates \u03b1 \u2208 [0, 1]\nN and a constant \u03b3 \u2208 [0, 1], Q-learning [26], a very\nimportant algorithm in the multi-agent systems literature, updates its Q-values for an experience\net \u2208 E by Q(i)\n\nt (x, a) if (x, a) (cid:54)= (xt, at) and:\n\nt+1(x, a) = Q(i)\n\ni\n\nt+1(xt, at) = (1 \u2212 \u03b1t)Q(i)\nQ(i)\n\nt (xt, at) + \u03b1t(rt + \u03b3 max\na(cid:48)\u2208A(i)\n\nt (yt, a(cid:48)))\nQ(i)\n\n(1)\n\n3 Interruptibility\n\n3.1 Safe interruptibility\n\nOrseau and Armstrong [16] recently introduced the notion of interruptions in a centralized context.\nSpeci\ufb01cally, an interruption scheme is de\ufb01ned by the triplet < I, \u03b8, \u03c0IN T >. The \ufb01rst element I is\na function I : O \u2192 {0, 1} called the initiation function. Variable O is the observation space, which\ncan be thought of as the state of the STOP button. At each time step, before choosing an action, the\nagent receives an observation from O (either PUSHED or RELEASED) and feeds it to the initiation\nfunction. Function I models the initiation of the interruption (I(PUSHED) = 1, I(RELEASED) =\n0). Policy \u03c0IN T is called the interruption policy. It is the policy that the agent should follow when\nit is interrupted. Sequence \u03b8 \u2208 [0, 1[\nN represents at each time step the probability that the agent\nfollows his interruption policy if I(ot) = 1. In the previous example, function I is quite simple.\nFor Bob, IBob = 0 and for Adam, IAdam = 1 if his car goes fast and Bob is not too close and\nIAdam = 0 otherwise. Sequence \u03b8 is used to ensure convergence to the optimal policy by ensuring\nthat the agents cannot be interrupted all the time but it should grow to 1 in the limit because we want\nagents to respond to interruptions. Using this triplet, it is possible to de\ufb01ne an operator IN T \u03b8 that\ntransforms any policy \u03c0 into an interruptible policy.\n\nDe\ufb01nition 1. (Interruptibility [16]) Given an interruption scheme < I, \u03b8, \u03c0IN T >, the interruption\noperator at time t is de\ufb01ned by IN T \u03b8(\u03c0) = \u03c0IN T with probability I\u00b7\u03b8t and \u03c0 otherwise. IN T \u03b8(\u03c0)\nis called an interruptible policy. An agent is said to be interruptible if it samples its actions according\nto an interruptible policy.\n\nNote that \u201c\u03b8t = 0 for all t\u201d corresponds to the non-interruptible setting. We assume that each agent\nhas its own interruption triplet and can be interrupted independently from the others. Interruptibility\nis an online property: every policy can be made interruptible by applying operator IN T \u03b8. However,\napplying this operator may change the joint policy that is learned by a server controlling all the\nagents. Note \u03c0\u2217\nIN T the optimal policy learned by an agent following an interruptible policy. Orseau\nand Armstrong [16] say that the policy is safely interruptible if \u03c0\u2217\nIN T (which is not an interruptible\npolicy) is asymptotically optimal in the sense of [10].\nIt means that even though it follows an\ninterruptible policy, the agent is able to learn a policy that would gather rewards optimally if no\ninterruptions were to occur again. We already see that off-policy algorithms are good candidates\nfor safe interruptibility. As a matter of fact, Q-learning is safely interruptible under conditions on\nexploration.\n\n4\n\n\f3.2 Dynamic safe interruptibility\n\nIn a multi-agent system, the outcome of an action depends on the joint action. Therefore, it is not\npossible to de\ufb01ne an optimal policy for an agent without knowing the policies of all agents. Be-\nsides, convergence to a Nash equilibrium situation where no agent has interest in changing policies\nis generally not guaranteed even for suboptimal equilibria on simple games [27, 18]. The previous\nde\ufb01nition of safe interruptibility critically relies on optimality of the learned policy, which is there-\nfore not suitable for our problem since most algorithms lack convergence guarantees to these optimal\nbehaviors. Therefore, we introduce below dynamic safe interruptibility that focuses on preserving\nthe dynamics of the system.\nDe\ufb01nition 2.\nlearning framework\n: S \u00d7 A(i) \u2192 R at time t \u2208 N. The agents follow the inter-\n(S, A, T, r, m) with Q-values Q(i)\nt\nruptible learning policy IN T \u03b8(\u03c0\u0001) to generate a sequence E = (xt, at, rt, yt)t\u2208N and learn on\nthe processed sequence P (E). This framework is said to be safely interruptible if for any initiation\nfunction I and any interruption policy \u03c0IN T :\n\n(Dynamic Safe Interruptibility) Consider a multi-agent\n\n1. \u2203\u03b8 such that (\u03b8t \u2192 1 when t \u2192 \u221e) and ((\u2200s \u2208 S, \u2200a \u2208 A, \u2200T > 0), \u2203t > T such that\n\nst = s, at = a)\n\n2. \u2200i \u2208 {1, ..., m}, \u2200t > 0, \u2200st \u2208 S, \u2200at \u2208 A(i), \u2200Q \u2208 RS\u00d7A(i):\nt+1 = Q | Q(1)\n\n, st, at, \u03b8) = P(Q(i)\n\nt+1 = Q | Q(1)\n\nP(Q(i)\n\nt\n\n, ..., Q(m)\n\nt\n\nt\n\n, ..., Q(m)\n\nt\n\n, st, at)\n\nWe say that sequences \u03b8 that satisfy the \ufb01rst condition are admissible.\n\nWhen \u03b8 satis\ufb01es condition (1), the learning policy is said to achieve in\ufb01nite exploration. This def-\ninition insists on the fact that the values estimated for each action should not depend on the inter-\nruptions. In particular, it ensures the three following properties that are very natural when thinking\nabout safe interruptibility:\n\nwere following non-interruptible policies.\n\n\u2022 Interruptions do not prevent exploration.\n\u2022 If we sample an experience from E then each agent learns the same thing as if all agents\nt+1(x, a)|Qt =\n\u2022 The \ufb01xed points of the learning rule Qeq such that Q(i)\nQeq, x, a, \u03b8] for all (x, a) \u2208 S \u00d7 A(i) do not depend on \u03b8 and so agents Q-maps will\nnot converge to equilibrium situations that were impossible in the non-interruptible setting.\nYet, interruptions can lead to some state-action pairs being updated more often than others, espe-\ncially when they tend to push the agents towards speci\ufb01c states. Therefore, when there are several\npossible equilibria, it is possible that interruptions bias the Q-values towards one of them. De\ufb01-\nnition 2 suggests that dynamic safe interruptibility cannot be achieved if the update rule directly\ndepends on \u03b8, which is why we introduce neutral learning rules.\nDe\ufb01nition 3. (Neutral Learning Rule) We say that a multi-agent reinforcement learning framework\nis neutral if:\n\neq (x, a) = E[Q(i)\n\n1. F is independent of \u03b8\n\n2. Every experience e in E is independent of \u03b8 conditionally on (x, a, Q) where a is the joint\n\naction.\n\nQ-learning is an example of neutral learning rule because the update does not depend on \u03b8 and\nthe experiences only contain (x, a, y, r), and y and r are independent of \u03b8 conditionally on (x, a).\nOn the other hand, the second condition rules out direct uses of algorithms like SARSA where\nexperience samples contain an action sampled from the current learning policy, which depends on \u03b8.\nHowever, a variant that would sample from \u03c0\u0001\ni ) (as introduced in [16]) would\nbe a neutral learning rule. As we will see in Corollary 2.1, neutral learning rules ensure that each\nagent taken independently from the others veri\ufb01es dynamic safe interruptibility.\n\ni instead of IN T \u03b8(\u03c0\u0001\n\n5\n\n\f4 Exploration\n\nIn order to hope for convergence of the Q-values to the optimal ones, agents need to fully explore\nthe environment. In short, every state should be visited in\ufb01nitely often and every action should be\ntried in\ufb01nitely often in every state [19] in order not to miss states and actions that could yield high\nrewards.\nDe\ufb01nition 4. (Interruption compatible \u0001) Let (S, A, T, r, m) be any distributed agent system where\ni . We say that sequence \u0001 is compatible with interruptions if\neach agent follows learning policy \u03c0\u0001\n\u0001t \u2192 0 and \u2203\u03b8 such that \u2200i \u2208 {1, .., m}, \u03c0\u0001\nSequences of \u0001 that are compatible with interruptions are fundamental to ensure both regular and\ndynamic safe interruptibility when following an \u0001-greedy policy. Indeed, if \u0001 is not compatible with\ninterruptions, then it is not possible to \ufb01nd any sequence \u03b8 such that the \ufb01rst condition of dynamic\nsafe interruptibility is satis\ufb01ed. The following theorem proves the existence of such \u0001 and gives\nexample of \u0001 and \u03b8 that satisfy the conditions.\nTheorem 1. Let c \u2208]0, 1] and let nt(s) be the number of times the agents are in state s before time\nt. Then the two following choices of \u0001 are compatible with interruptions:\n\ni ) achieve in\ufb01nite exploration.\n\ni and IN T \u03b8(\u03c0\u0001\n\n\u2022 \u2200t \u2208 N, \u2200s \u2208 S, \u0001t(s) = c/ m(cid:112)nt(s).\n\n\u2022 \u2200t \u2208 N, \u0001t = c/ log(t)\n\nExamples of admissible \u03b8 are \u03b8t(s) = 1 \u2212 c(cid:48)/ m(cid:112)nt(s) for the \ufb01rst choice and \u03b8t = 1 \u2212 c(cid:48)/ log(t)\n\nfor the second one.\n\nNote that we do not need to make any assumption on the update rule or even on the framework. We\nonly assume that agents follow an \u0001-greedy policy. The assumption on \u0001 may look very restrictive\n(convergence of \u0001 and \u03b8 is really slow) but it is designed to ensure in\ufb01nite exploration in the worst\ncase when the operator tries to interrupt all agents at every step. In practical applications, this should\nnot be the case and a faster convergence rate may be used.\n\n5 Joint Action Learners\nWe \ufb01rst study interruptibility in a framework in which each agent observes the outcome of the joint\naction instead of observing only its own. This is called the joint action learner framework [5] and it\nhas nice convergence properties (e.g., there are many update rules for which it converges [13, 25]).\nA standard assumption in this context is that agents cannot establish a strategy with the others:\notherwise, the system can act as a centralized system. In order to maintain Q-values based on the\njoint actions, we need to make the standard assumption that actions are fully observable [12].\nAssumption 1. Actions are fully observable, which means that at the end of each turn, each agent\nknows precisely the tuple of actions a \u2208 A1 \u00d7 ... \u00d7 Am that have been performed by all agents.\nDe\ufb01nition 5. (JAL) A multi-agent system is made of\n{1, .., m}: Q(i) : S \u00d7 A \u2192 R.\n\njoint action learners (JAL) if for all i \u2208\n\nJoint action learners can observe the actions of all agents: each agent is able to associate the changes\nof states and rewards with the joint action and accurately update its Q-map. Therefore, dynamic\nsafe interruptibility is ensured with minimal conditions on the update rule as long as there is in\ufb01nite\nexploration.\nTheorem 2. Joint action learners with a neutral learning rule verify dynamic safe interruptibility if\nsequence \u0001 is compatible with interruptions.\n\nProof. Given a triplet < I (i), \u03b8, \u03c0IN T\n>, we know that IN T \u03b8(\u03c0) achieves in\ufb01nite exploration\nbecause \u0001 is compatible with interruptions. For the second point of De\ufb01nition 2, we consider an\nexperience tuple et = (xt, at, rt, yt) and show that the probability of evolution of the Q-values at\ntime t + 1 does not depend on \u03b8 because yt and rt are independent of \u03b8 conditionally on (xt, at).\nand we can then derive the following equalities for all q \u2208 R|S|\u00d7|A|:\nWe note \u02dcQm\n\n, ..., Q(m)\n\ni\n\nt = Q(1)\n\nt\n\nt\n\n6\n\n\f(cid:88)\n(r,y)\u2208R\u00d7S\nt ) = q| \u02dcQm\n\nP(Q(i)\n\nt+1(xt, at) = q| \u02dcQm\n(cid:88)\n(cid:88)\n\n(r,y)\u2208R\u00d7S\n\n=\n\n=\n\n(r,y)\u2208R\u00d7S\n\nt , xt, at, \u03b8t) =\n\nP(F (xt, at, r, y, \u02dcQm\n\nt ) = q, y, r| \u02dcQm\n\nt , xt, at, \u03b8t)\n\nP(F (xt, at, rt, yt, \u02dcQm\n\nt , xt, at, rt, yt, \u03b8t)P(yt = y, rt = r| \u02dcQm\n\nt , xt, at, \u03b8t)\n\nP(F (xt, at, rt, yt, \u02dcQm\n\nt ) = q| \u02dcQm\n\nt , xt, at, rt, yt)P(yt = y, rt = r| \u02dcQm\n\nt , xt, at)\n\nThe \ufb01rst\n\nThe last step comes from two facts.\nally on ( \u02dcQm\nditionally on (xt, at) because at\nchoice of the actions through a change in the policy. P(Q(i)\nP(Q(i)\nP(Q(i)\n\nis that F is independent of \u03b8 condition-\nt , xt, at) (by assumption). The second is that (yt, rt) are independent of \u03b8 con-\nthe\nt , xt, at, \u03b8t) =\nt , xt, at). Since only one entry is updated per step, \u2200Q \u2208 RS\u00d7Ai,\n\nis the joint actions and the interruptions only affect\n\nt+1(xt, at) = q| \u02dcQm\nt+1 = Q| \u02dcQm\n\nt+1(xt, at) = q| \u02dcQm\n\nt , xt, at, \u03b8t) = P(Q(i)\n\nt+1 = Q| \u02dcQm\n\nt , xt, at).\n\nCorollary 2.1. A single agent with a neutral learning rule and a sequence \u0001 compatible with inter-\nruptions veri\ufb01es dynamic safe interruptibility.\n\nTheorem 2 and Corollary 2.1 taken together highlight the fact that joint action learners are not very\nsensitive to interruptions and that in this framework, if each agent veri\ufb01es dynamic safe interrupt-\nibility then the whole system does.\nThe question of selecting an action based on the Q-values remains open. In a cooperative setting\nwith a unique equilibrium, agents can take the action that maximizes their Q-value. When there\nare several joint actions with the same value, coordination mechanisms are needed to make sure\nthat all agents play according to the same strategy [4]. Approaches that rely on anticipating the\nstrategy of the opponent [23] would introduce dependence to interruptions in the action selection\nmechanism. Therefore, the de\ufb01nition of dynamic safe interruptibility should be extended to include\nthese cases by requiring that any quantity the policy depends on (and not just the Q-values) should\nsatisfy condition (2) of dynamic safe interruptibility. In non-cooperative games, neutral rules such\nas Nash-Q or minimax Q-learning [13] can be used, but they require each agent to know the Q-maps\nof the others.\n\n6 Independent Learners\n\nIt is not always possible to use joint action learners in practice as the training is very expensive\ndue to the very large state-actions space. In many real-world applications, multi-agent systems use\nindependent learners that do not explicitly coordinate [6, 21]. Rather, they rely on the fact that the\nagents will adapt to each other and that learning will converge to an optimum. This is not guaranteed\ntheoretically and there can in fact be many problems [14], but it is often true empirically [24]. More\nspeci\ufb01cally, Assumption 1 (fully observable actions) is not required anymore. This framework can\nbe used either when the actions of other agents cannot be observed (for example when several actions\ncan have the same outcome) or when there are too many agents because it is faster to train. In this\ncase, we de\ufb01ne the Q-values on a smaller space.\nDe\ufb01nition 6. (IL) A multi-agent systems is made of independent learners (IL) if for all i \u2208 {1, .., m},\nQ(i) : S \u00d7 Ai \u2192 R.\n\nThis reduces the ability of agents to distinguish why the same state-action pair yields different re-\nwards: they can only associate a change in reward with randomness of the environment. The agents\nlearn as if they were alone, and they learn the best response to the environment in which agents can\nbe interrupted. This is exactly what we are trying to avoid. In other words, the learning depends on\nthe joint policy followed by all the agents which itself depends on \u03b8.\n\n7\n\n\f6.1 Independent Learners on matrix games\n\nTheorem 3. Independent Q-learners with a neutral learning rule and a sequence \u0001 compatible with\ninterruptions do not verify dynamic safe interruptibility.\n\nProof. Consider a setting with two agents a and b that can perform two actions: 0 and 1. They get\na reward of 1 if the joint action played is (a0, b0) or (a1, b1) and reward 0 otherwise. Agents use Q-\nlearning, which is a neutral learning rule. Let \u0001 be such that IN T \u03b8(\u03c0\u0001) achieves in\ufb01nite exploration.\nWe consider the interruption policies \u03c0IN T\n= b1 with probability 1. Since there is\nonly one state, we omit it and set \u03b3 = 0 (see Equation 1). We assume that the initiation function is\nequal to 1 at each step so the probability of actually being interrupted at time t is \u03b8t for each agent.\nWe \ufb01x time t > 0. We de\ufb01ne q = (1 \u2212 \u03b1)Q(a)\nt (b0).\nt = a0, \u03b8t) = P(rt = 1|Q(a)\nTherefore P(Q(a)\nt = a0, \u03b8t) =\n2 (1 \u2212 \u03b8t), which depends on \u03b8t so the framework does\nt = b0|Q(a)\nP(a(b)\nnot verify dynamic safe interruptibility.\n\n, a(a)\nt = a0, \u03b8t) = \u0001\n\n(a0) + \u03b1 and we assume that Q(b)\n\nt (b1) > Q(b)\n, a(a)\n\nt+1(a0) = q|Q(a)\n\nt\n\n= a0 and \u03c0IN T\n\n, Q(b)\n\nt\n\n, Q(b)\n\n, a(a)\n\nt\n\n, Q(b)\n\nt\n\nt\n\nt\n\na\n\nb\n\nt\n\nClaus and Boutilier [5] studied very simple matrix games and showed that the Q-maps do not con-\nverge but that equilibria are played with probability 1 in the limit. A consequence of Theorem 3\nis that even this weak notion of convergence does not hold for independent learners that can be\ninterrupted.\n\n6.2 Interruptions-aware Independent Learners\n\nWithout communication or extra information, independent learners cannot distinguish when the\nenvironment is interrupted and when it is not. As shown in Theorem 3, interruptions will therefore\naffect the way agents learn because the same action (only their own) can have different rewards\ndepending on the actions of other agents, which themselves depend on whether they have been\ninterrupted or not. This explains the need for the following assumption.\nAssumption 2. At the end of each step, before updating the Q-values, each agent receives a signal\nthat indicates whether an agent has been interrupted or not during this step.\n\nThis assumption is realistic because the agents already get a reward signal and observe a new state\nfrom the environment at each step. Therefore, they interact with the environment and the interruption\nsignal could be given to the agent in the same way that the reward signal is. If Assumption 2 holds,\nit is possible to remove histories associated with interruptions.\nDe\ufb01nition 7. (Interruption Processing Function) The processing function that prunes interrupted\nobservations is PIN T (E) = (et){t\u2208N / \u0398t=0} where \u0398t = 0 if no agent has been interrupted at time\nt and \u0398t = 1 otherwise.\n\nPruning observations has an impact on the empirical transition probabilities in the sequence. For\nexample, it is possible to bias the equilibrium by removing all transitions that lead to and start\nfrom a speci\ufb01c state, thus making the agent believe this state is unreachable.3 Under our model of\ninterruptions, we show in the following lemma that pruning of interrupted observations adequately\nremoves the dependency of the empirical outcome on interruptions (conditionally on the current\nstate and action).\nLemma 1. Let i \u2208 {1, ..., m} be an agent. For any admissible \u03b8 used to generate the experiences\nE and e = (y, r, x, ai, Q) \u2208 P (E). Then P(y, r|x, ai, Q, \u03b8) = P(y, r|x, ai, Q).\nThis lemma justi\ufb01es our pruning method and is the key step to prove the following theorem.\nTheorem 4. Independent learners with processing function PIN T , a neutral update rule and a\nsequence \u0001 compatible with interruptions verify dynamic safe interruptibility.\n\nProof. (Sketch) In\ufb01nite exploration still holds because the proof of Theorem 1 actually used the fact\nthat even when removing all interrupted events, in\ufb01nite exploration is still achieved. Then, the proof\n\n3The example at https://agentfoundations.org/item?id=836 clearly illustrates this problem.\n\n8\n\n\fis similar to that of Theorem 2, but we have to prove that the transition probabilities conditionally\non the state and action of a given agent in the processed sequence are the same as in an environment\nwhere agents cannot be interrupted, which is proven by Lemma 1.\n\n7 Concluding Remarks\nThe progress of AI is raising a lot of concerns4. In particular, it is becoming clear that keeping an\nAI system under control requires more than just an off switch. We introduce in this paper dynamic\nsafe interruptibility, which we believe is the right notion to reason about the safety of multi-agent\nsystems that do not communicate. In particular, it ensures that in\ufb01nite exploration and the one-\nstep learning dynamics are preserved, two essential guarantees when learning in the non-stationary\nenvironment of Markov games.\nWhen trying to design a safely interruptible system for a single agent, using off-policy methods\nis generally a good idea because the interruptions only impact the action selection so they should\nnot impact the learning. For multi-agent systems, minimax is a good candidate for action selection\nmechanism because it is not impacted by the actions of other agents, and only tries to maximize the\nreward of the agent in the worst possible case.\nA natural extension of our work would be to study dynamic safe interruptibility when Q-maps are\nreplaced by neural networks [22, 15], which is a widely used framework in practice. In this setting,\nthe neural network may over\ufb01t states where agents are pushed to by interruptions. A smart experi-\nence replay mechanism that would pick observations for which the agents have not been interrupted\nfor a long time more often than others is likely to solve this issue. More generally, experience replay\nmechanisms that compose well with safe interruptibility could allow to compensate for the extra\namount of exploration needed by safely interruptible learning by being more ef\ufb01cient with data.\nThus, they are critical to make these techniques practical. Since Dynamic Safe Interruptibility does\nnot need proven convergence to the optimal solution, we argue that it is a good de\ufb01nition to study\nthe interruptibility problem when using function approximators.\nThe results in this paper indicate that Safe Interruptibility may not be achievable for systems in\nwhich agents do not communicate at all. This means that, rediscussing the cars example, some\nglobal norms of communications would need to be de\ufb01ned to \u201cimplement\u201d safe interruptibility.\nWe address additional remarks in the section \u201cAdditional remarks\u201d of the extended paper, that can\nbe found in the supplementary material.\n\nAcknowledgment. This work has been supported in part by the European ERC (Grant 339539 -\nAOC) and by the Swiss National Science Foundation (Grant 200021 169588 TARBDA).\n\n4https://futureo\ufb02ife.org/ai-principles/ gives a list of principles that AI researchers should keep in mind when\n\ndeveloping their systems.\n\n9\n\n\fBibliography\n\n[1] Business Insider: Google has developed a \u201cbig red button\u201d that can be used to interrupt arti\ufb01-\ncial intelligence and stop it from causing harm. URL: http://www.businessinsider.fr/uk/google-\ndeepmind-develops-a-big-red-button-to-stop-dangerous-ais-causing-harm-2016-6.\n\n[2] Newsweek:\n\nhttp://www.newsweek.com/google-big-red-button-ai-arti\ufb01cial-intelligence-save-world-\nelon-musk-46675.\n\nGoogle\u2019s\n\n\u201cbig Red\n\nbutton\u201d\n\ncould\n\nsave\n\nthe world. URL:\n\n[3] Wired:\n\nGoogle\u2019s\n\n\u201cbig\n\nred\u201d\n\nkillswitch\n\ncould\n\nprevent\n\nan AI\n\nuprising. URL:\n\nhttp://www.wired.co.uk/article/google-red-button-killswitch-arti\ufb01cial-intelligence.\n\n[4] Craig Boutilier. Planning, learning and coordination in multiagent decision processes.\n\nIn\nProceedings of the 6th conference on Theoretical aspects of rationality and knowledge, pages\n195\u2013210. Morgan Kaufmann Publishers Inc., 1996.\n\n[5] Caroline Claus and Craig Boutilier. The dynamics of reinforcement learning in cooperative\n\nmultiagent systems. AAAI/IAAI, (s 746):752, 1998.\n\n[6] Robert H Crites and Andrew G Barto. Elevator group control using multiple reinforcement\n\nlearning agents. Machine Learning, 33(2-3):235\u2013262, 1998.\n\n[7] Jakob Foerster, Yannis M Assael, Nando de Freitas, and Shimon Whiteson. Learning to com-\nmunicate with deep multi-agent reinforcement learning. In Advances in Neural Information\nProcessing Systems, pages 2137\u20132145, 2016.\n\n[8] Ben Goertzel and Cassio Pennachin. Arti\ufb01cial general intelligence, volume 2. Springer, 2007.\n\n[9] Leslie Lamport, Robert Shostak, and Marshall Pease. The byzantine generals problem. ACM\n\nTransactions on Programming Languages and Systems (TOPLAS), 4(3):382\u2013401, 1982.\n\n[10] Tor Lattimore and Marcus Hutter. Asymptotically optimal agents. In International Conference\n\non Algorithmic Learning Theory, pages 368\u2013382. Springer, 2011.\n\n[11] Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In\nProceedings of the eleventh international conference on machine learning, volume 157, pages\n157\u2013163, 1994.\n\n[12] Michael L Littman. Friend-or-foe q-learning in general-sum games. In ICML, volume 1, pages\n\n322\u2013328, 2001.\n\n[13] Michael L Littman. Value-function reinforcement learning in markov games. Cognitive Sys-\n\ntems Research, 2(1):55\u201366, 2001.\n\n[14] Laetitia Matignon, Guillaume J Laurent, and Nadine Le Fort-Piat. Independent reinforcement\nlearners in cooperative markov games: a survey regarding coordination problems. The Knowl-\nedge Engineering Review, 27(01):1\u201331, 2012.\n\n[15] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan\nWierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint\narXiv:1312.5602, 2013.\n\n[16] Laurent Orseau and Stuart Armstrong. Safely interruptible agents. In Uncertainty in Arti\ufb01cial\nIntelligence: 32nd Conference (UAI 2016), edited by Alexander Ihler and Dominik Janzing,\npages 557\u2013566, 2016.\n\n[17] Liviu Panait and Sean Luke. Cooperative multi-agent learning: The state of the art. Au-\n\ntonomous agents and multi-agent systems, 11(3):387\u2013434, 2005.\n\n[18] Eduardo Rodrigues Gomes and Ryszard Kowalczyk. Dynamic analysis of multiagent q-\nIn Proceedings of the 26th Annual International Con-\n\nlearning with \u03b5-greedy exploration.\nference on Machine Learning, pages 369\u2013376. ACM, 2009.\n\n10\n\n\f[19] Satinder Singh, Tommi Jaakkola, Michael L Littman, and Csaba Szepesv\u00b4ari. Conver-\ngence results for single-step on-policy reinforcement-learning algorithms. Machine learning,\n38(3):287\u2013308, 2000.\n\n[20] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1.\n\nMIT press Cambridge, 1998.\n\n[21] Ardi Tampuu, Tambet Matiisen, Dorian Kodelja, Ilya Kuzovkin, Kristjan Korjus, Juhan Aru,\nJaan Aru, and Raul Vicente. Multiagent cooperation and competition with deep reinforcement\nlearning. arXiv preprint arXiv:1511.08779, 2015.\n\n[22] Gerald Tesauro. Temporal difference learning and td-gammon. Communications of the ACM,\n\n38(3):58\u201368, 1995.\n\n[23] Gerald Tesauro. Extending q-learning to general adaptive multi-agent systems. In Advances in\n\nneural information processing systems, pages 871\u2013878, 2004.\n\n[24] Gerald Tesauro and Jeffrey O Kephart. Pricing in agent economies using multi-agent q-\n\nlearning. Autonomous Agents and Multi-Agent Systems, 5(3):289\u2013304, 2002.\n\n[25] Xiaofeng Wang and Tuomas Sandholm. Reinforcement learning to play an optimal nash equi-\n\nlibrium in team markov games. In NIPS, volume 2, pages 1571\u20131578, 2002.\n\n[26] Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279\u2013292,\n\n1992.\n\n[27] Michael Wunder, Michael L Littman, and Monica Babes. Classes of multiagent q-learning dy-\nnamics with epsilon-greedy exploration. In Proceedings of the 27th International Conference\non Machine Learning (ICML-10), pages 1167\u20131174, 2010.\n\n11\n\n\f", "award": [], "sourceid": 95, "authors": [{"given_name": "El Mahdi", "family_name": "El Mhamdi", "institution": "EPFL"}, {"given_name": "Rachid", "family_name": "Guerraoui", "institution": null}, {"given_name": "Hadrien", "family_name": "Hendrikx", "institution": "EPFL"}, {"given_name": "Alexandre", "family_name": "Maurer", "institution": "EPFL"}]}