{"title": "Non-Cooperative Inverse Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 9487, "page_last": 9497, "abstract": "Making decisions in the presence of a strategic opponent requires one to take into account the opponent\u2019s ability to actively mask its intended objective. To describe such strategic situations, we introduce the non-cooperative inverse reinforcement learning (N-CIRL) formalism. The N-CIRL formalism consists of two agents with completely misaligned objectives, where only one of the agents knows the true objective function. Formally, we model the N-CIRL formalism as a zero-sum Markov game with one-sided incomplete information. Through interacting with the more informed player, the less informed player attempts to both infer and optimize the true objective function. As a result of the one-sided incomplete information, the multi-stage game can be decomposed into a sequence of single- stage games expressed by a recursive formula. Solving this recursive formula yields the value of the N-CIRL game and the more informed player\u2019s equilibrium strategy. Another recursive formula, constructed by forming an auxiliary game, termed the dual game, yields the less informed player\u2019s strategy. Building upon these two recursive formulas, we develop a computationally tractable algorithm to approximately solve for the equilibrium strategies. Finally, we demonstrate the benefits of our N-CIRL formalism over the existing multi-agent IRL formalism via extensive numerical simulation in a novel cyber security setting.", "full_text": "Non-Cooperative Inverse Reinforcement Learning\n\nXiangyuan Zhang\n\nKaiqing Zhang\n\nErik Miehling\n\nTamer Bas\u00b8ar\n\nCoordinated Science Laboratory\n\nUniversity of Illinois at Urbana-Champaign\n\n{xz7,kzhang66,miehling,basar1}@illinois.edu\n\nAbstract\n\nMaking decisions in the presence of a strategic opponent requires one to take into\naccount the opponent\u2019s ability to actively mask its intended objective. To describe\nsuch strategic situations, we introduce the non-cooperative inverse reinforcement\nlearning (N-CIRL) formalism. The N-CIRL formalism consists of two agents\nwith completely misaligned objectives, where only one of the agents knows the\ntrue objective function. Formally, we model the N-CIRL formalism as a zero-sum\nMarkov game with one-sided incomplete information. Through interacting with\nthe more informed player, the less informed player attempts to both infer, and act\naccording to, the true objective function. As a result of the one-sided incomplete\ninformation, the multi-stage game can be decomposed into a sequence of single-\nstage games expressed by a recursive formula. Solving this recursive formula\nyields the value of the N-CIRL game and the more informed player\u2019s equilibrium\nstrategy. Another recursive formula, constructed by forming an auxiliary game,\ntermed the dual game, yields the less informed player\u2019s strategy. Building upon\nthese two recursive formulas, we develop a computationally tractable algorithm\nto approximately solve for the equilibrium strategies. Finally, we demonstrate the\nbene\ufb01ts of our N-CIRL formalism over the existing multi-agent IRL formalism\nvia extensive numerical simulation in a novel cyber security setting.\n\n1\n\nIntroduction\n\nIn any decision-making problem, the decision-maker\u2019s goal is characterized by some underlying,\npotentially unknown, objective function. In machine learning, ensuring that the learning agent does\nwhat the human intends it to do requires speci\ufb01cation of a correct objective function, a problem\nknown as value alignment [42, 37]. Solving this problem is nontrivial even in the simplest (single-\nagent) reinforcement learning (RL) settings [3] and becomes even more challenging in multi-agent\nenvironments [19, 45, 44]. Failing to specify a correct objective function can lead to unexpected and\npotentially dangerous behavior [38].\nInstead of specifying the correct objective function, a growing body of research has taken an alter-\nnative perspective of letting the agent learn the intended task from observed behavior of other agents\nand/or human experts, a concept known as inverse reinforcement learning (IRL) [31]. IRL describes\na setting where one agent is attempting to learn the true reward function by observing trajectories of\nsample behavior of another agent, termed demonstrations, under the assumption that the observed\nagent is acting (optimally) according to the true reward function. More recently, the concept of co-\noperative inverse reinforcement learning (CIRL) was introduced in [9]. Instead of passively learning\nfrom demonstrations (as is the case in IRL), CIRL allows for agents to interact during the learning\nprocess, which results in improved performance over IRL. CIRL inherently assumes that the two\nagents are on the same team, that is, the expert is actively trying to help the agent learn and achieve\nsome common goal. However, in many practical multi-agent decision-making problems, the objec-\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\ftives of the agents may be misaligned, and in some settings completely opposed [22, 21, 40, 43]\n(e.g., zero-sum interactions in cyber security [2, 27]). In such settings, the expert is still trying to\nachieve its objective, but may act in a way so as to make the agent think it has a different objective.\nDevelopment of a non-cooperative analogue of CIRL to describe learning in these settings has not\nyet been investigated.\nIn this paper, we introduce the non-cooperative inverse reinforcement learning (N-CIRL) formal-\nism. The N-CIRL formalism consists of two agents with completely misaligned objectives, where\nonly one agent knows the true reward function. The problem is modeled as a zero-sum Markov\ngame with one-sided incomplete information. In particular, at any stage of the game the informa-\ntion available to the player that does not know the true reward function, termed the less informed\nplayer, is contained within the information available to the player that knows the reward function,\ntermed the more informed player. This one-sided information structure allows for a simpli\ufb01cation in\nthe form of the beliefs (compared to the fully general asymmetric information case [10]) and, more\nimportantly, allows one to de\ufb01ne strategies as the stage-wise solutions to two recursive equations\n[7, 35]. By taking advantage of the structure of these recursive equations, and their correspond-\ning solutions (\ufb01xed points), computationally tractable algorithms for approximately computing the\nplayers\u2019 strategies are developed.\nOur primary motivation for developing the N-CIRL formalism is cyber security. In particular, we\nare interested in settings where the attacker has some intent that is unknown to the defender. In\nreality, the motivation of attackers can vary signi\ufb01cantly. For example, if the attacker is \ufb01nancially\nmotivated, its goal may be to \ufb01nd personal payment details, whereas if the attacker is aiming to\ndisrupt the normal operation of a system, reaching a computer responsible for controlling physical\ncomponents may be of more interest. This variability in what the attacker values is captured by the\nN-CIRL formalism through an intent parameter that serves to parameterize the attacker\u2019s true reward\nfunction. The defender, who does not know the true intent, then faces the problem of learning the\nattacker\u2019s intent while simultaneously defending the network. The attacker, knowing this, aims to\nreach its goal but may behave in a way that makes its true intent as unclear as possible.1\nThroughout the remainder of the paper, we will refer to the more informed player as the attacker\nand the less informed player as the defender. While we use the cyber security example throughout\nthe paper, this is primarily for ease of exposition. The results presented here apply to any zero-sum\nMarkov game setting where one player does not know the true reward.\nLimitations of Existing IRL Approaches. The application of N-CIRL to cyber security is espe-\ncially \ufb01tting due to the challenges associated with collecting useful attack data. Obtaining accurate\nattack logs is a computationally formidable task [16]. Furthermore, learning from sample equilib-\nrium behavior, as is done in the multi-agent inverse reinforcement learning (MA-IRL) settings of\n[21, 41, 20], is only useful if the goal(s) do not change between learning and execution/deployment.\nSuch an assumption is not appropriate in cyber security settings \u2013 the attacker\u2019s goal as well as the\noverall system structure, may frequently change. This non-stationary behavior necessitates the abil-\nity to intertwine learning and execution. N-CIRL provides a formalism for specifying actions that\nadapt to the information revealed during the game. We illustrate this adaptivity via an illustrative\nexample in Sec. A and extensive numerical results in Sec. 5.\nContribution. The contribution of the present work is three-fold: 1) We propose a new formalism\nfor IRL, termed N-CIRL, that describes how two competing agents make strategic decisions when\nonly one of the agents possesses knowledge of the true reward; 2) By recognizing that N-CIRL is a\nzero-sum Markov game with one-sided incomplete information, we leverage the recursive structure\nof the game to develop a computationally tractable algorithm, termed non-cooperative point-based\nvalue iteration (NC-PBVI), for computing both players\u2019 strategies; 3) We demonstrate in a novel cy-\nber security model that the adaptive strategies obtained from N-CIRL outperform strategies obtained\nfrom existing multi-agent IRL techniques.\n\n2 Related Work\n\nDecision-making when the agents are uncertain of the true objective function(s) has been extensively\nstudied within the \ufb01elds of both RL and game theory. One standard and popular way to infer the\n\n1An interesting real-world example of such behavior was Operation Fortitude in World War II [12].\n\n2\n\n\factual reward function is via inverse RL, the idea of which was \ufb01rst introduced by [17] under the title\nof inverse optimal control. Later, [31] introduced the notion of IRL with the goal of inferring the\nreward function being optimized by observing the behavior of an actor, termed an expert, over time\n[31, 1, 33, 8]. Fundamental to the IRL setting is the assumption that the agent inferring the reward\npassively observes the expert\u2019s behavior, while the expert behaves optimally in its own interest\nwithout knowing that the agent will later use the observed behavior to learn.\nAs pointed out in [9], such an assumption is not valid in certain cooperative settings where the agent\nand the expert are able to interact in order to achieve some common objective. In fact, IRL-type\nsolutions were shown to be suboptimal and generally less effective at instilling the knowledge of\nthe expert to the agent [9]. As argued by [9], the value alignment problem with cooperative agents\nis more appropriately viewed as an interactive decision-making process. The proposed formalism,\ntermed CIRL, is formulated as a two-player game of partial information with a common reward\nfunction2. Due to the special structure of CIRL, the problem can be transformed into a partially\nobservable Markov decision process (POMDP), see [30, 9], allowing for single-agent RL algorithms\nto be applied. Further improvements in computational ef\ufb01ciency can be achieved by exploiting the\nfact that the expert expects the agent to respond optimally [23].\nInverse RL under a non-cooperative setting has not received as much attention as its cooperative\ncounterpart. A recent collection of work on multi-agent IRL (MA-IRL) [21, 41, 20] addresses the\nproblem of IRL in stochastic games with multiple (usually more than two) agents. Distinct from our\nN-CIRL setting, MA-IRL aims to recover the reward function of multiple agents under the assump-\ntion that the demonstrations are generated from the Nash equilibrium strategies. Moreover, in the\nMA-IRL formalism, agents determine their strategies based on the inferred reward function, i.e., re-\ngarding the inferred reward as some \ufb01xed ground truth. In contrast, under our N-CIRL setting, only\none agent is unaware of the true reward function. Furthermore, the goal of the less informed player\nin N-CIRL goes beyond just inferring the true reward function; its ultimate goal is to determine an\noptimal strategy against a worst-case opponent who possesses a private reward.\nFrom a game theoretic perspective, the N-CIRL formalism can be viewed as a stochastic dynamic\ngame with asymmetric information, see [6, 28, 29, 4] and references therein. In particular, N-CIRL\nlies within the class of games with one-sided incomplete information [35, 39, 34, 14, 13]. This type\nof game allows for a simpli\ufb01ed belief and allows the game to be decomposed into a sequence of\nsingle-stage games [35, 13]. In particular, our N-CIRL formalism can be recognized as one of the\ngame settings discussed in [35], in which a dual game was formulated to solve for the less informed\nplayer\u2019s strategy. Our algorithm for computing the defender\u2019s strategy is built upon this formulation,\nand can be viewed as one way to approximately solve the dual game.\n\n3 Non-Cooperative Inverse Reinforcement Learning\n\nIn this section, we introduce the N-CIRL formalism and describe its information structure. As will\nbe shown, the information structure of N-CIRL admits compact information states for each player.\n\n3.1 N-CIRL Formulation\n\nThe N-CIRL formalism is modeled as a two-player zero-sum Markov game with one-sided incom-\nplete information. In particular, the attacker knows the true reward function, while the defender does\nnot. In the context of the cyber security setting of this paper, the reward function is assumed to be\nparameterized by an intent parameter that is only known to the attacker. The N-CIRL formalism is\ndescribed by the tuple (cid:104)S,{A,D},T (\u00b7 | \u00b7,\u00b7,\u00b7),{\u0398, R(\u00b7,\u00b7,\u00b7,\u00b7;\u00b7)},P0(\u00b7,\u00b7), \u03b3(cid:105), where\n\n\u2022 S is the \ufb01nite set of states; s \u2208 S.\n\u2022 A is the \ufb01nite set of actions for the attacker A; a \u2208 A.\n\u2022 D is the \ufb01nite set of actions for the defender D; d \u2208 D.\n\u2022 T (s(cid:48) | s, a, d) is the conditional distribution of the next state s(cid:48) given current state s and\n\u2022 \u0398 is the \ufb01nite set of intent parameters that parameterize the reward function; the true intent\n\nactions a, d.\nparameter \u03b8 \u2208 \u0398 is only observed by the attacker.\n\n2Also known as a team problem [24].\n\n3\n\n\f\u2022 R(s, a, d, s(cid:48); \u03b8) is the parameterized reward function that maps the current state s \u2208 S,\nactions (a, d) \u2208 A\u00d7D, next state s(cid:48) \u2208 S, and parameter \u03b8 \u2208 \u0398 to a reward for the attacker.\n\u2022 P0 is the distribution over the initial state s0 and the true reward parameter \u03b8, assumed to\n\u2022 \u03b3 \u2208 [0, 1) is the discount factor.\n\nbe common knowledge between A and D.\n\nThe game proceeds as follows. Initially, a state-parameter pair (s0, \u03b8) is sampled from the prior\ndistribution P0. The state s0 is publicly observed by both players, whereas the intent parameter \u03b8\nis only observed by the attacker.3 For each stage, the attacker and the defender act simultaneously,\nchoosing actions a \u2208 A and d \u2208 D. Note that the action sets may be state-dependent, i.e., a \u2208 A(s),\nd \u2208 D(s). Given both actions, the current state s transitions to a successor state s(cid:48) according\nto the transition model T (s(cid:48) | s, a, d). The attacker receives a bounded reward R(s, a, d, s(cid:48); \u03b8);\nthe defender receives the reward \u2212R(s, a, d, s(cid:48); \u03b8) (incurs a cost R(s, a, d, s(cid:48); \u03b8)). Neither player\nobserves rewards during the game. Before each subsequent stage, both players are informed of the\nsuccessor state s(cid:48) and the actions from the previous stage. While both players are aware of the\ncurrent state, only the attacker is aware of the true intent parameter \u03b8 \u2208 \u0398. This results in the\ndefender possessing incomplete information, requiring it to maintain a belief over the true intent\nparameter. The goals of the attacker and the defender are to maximize and minimize the expected\n\u03b3-discounted accumulated reward induced by R, respectively.\n\n3.2 The Information Structure of N-CIRL\n\nThe N-CIRL formalism falls within the class of partially observable stochastic games [10]. In such\ngames, perfect recall ensures that behavioral strategies, i.e., strategies that mix over actions, are\noutcome equivalent to mixed strategies, i.e., strategies that mix over pure strategies [18]. As a\nresult, players in N-CIRL can restrict attention to behavioral strategies, de\ufb01ned for each stage t as\nt \u2192 \u2206(D), where I A\nt ) represents the space of information\n\u03c0A\navailable to the attacker (resp. defender) at stage t and \u2206(A), \u2206(D) represent distributions over\nt\nactions. Given any realized information sequences I A\n\nt \u2192 \u2206(A) and \u03c0D\n\n: I D\n\n: I A\n\nt\n\nt =(cid:0)s0, \u03b8, a0, d0, . . . , at\u22121, dt\u22121, st\n\n(cid:1),\n\nI A\n\nt (resp. I D\nt \u2208 I A\nI D\n\n(cid:1)\n\nt , represented as\n\nt and I D\n\nt \u2208 I D\n\nt =(cid:0)s0, a0, d0, . . . , at\u22121, dt\u22121, st\n(cid:1) \u2208 It = I A\n\nt .\n\nthe defender\u2019s information is always contained within the attacker\u2019s information for any stage t,\ni.e., there is one-sided incomplete information. Furthermore, note that the attacker has complete\ninformation (it knows everything that has happened in the game); its information at stage t is the full\n\nhistory of the game at t, denoted by It =(cid:0)s0, \u03b8, a0, d0, . . . , at\u22121, dt\u22121, st\n\nInformation States. In general, games of incomplete information require players to reason over the\nentire belief hierarchy, that is, players\u2019 decisions not only depend on their beliefs on the state of na-\nture, but also on the beliefs on others\u2019 beliefs on the state and nature, and so on [25, 10]. Fortunately,\nplayers do not need to resort to this in\ufb01nite regress in games of one-sided incomplete information.\nInstead, each player is able to maintain a compact state of knowledge, termed an information state,\nthat is suf\ufb01cient for making optimal decisions. The more informed player maintains a pair con-\nsisting of the observable state and a distribution over the private state [35, 39]. The less informed\nplayer, through construction of a dual game (discussed in Sec. 4.2), maintains a pair consisting of\nthe observable state and a vector (in Euclidean space) of size equal to the number of private states\n[7, 35]. In the context of N-CIRL, the attacker\u2019s information state at each stage is a pair, denoted\nby (s, b), in the space S \u00d7 \u2206(\u0398) whereas the defender\u2019s information state at each stage (of the dual\ngame) is a pair, denoted by (s, \u03b6), in the space S \u00d7 R|\u0398|.\n\n4 Solving N-CIRL\n\nThe theoretical results used to solve the CIRL problem [30, 9] do not extend to the N-CIRL setting.\nAs outlined in [9], the form of CIRL allows one to convert the problem into a centralized control\nproblem [30].The problem can then be solved using existing techniques from reinforcement learning.\nIn N-CIRL, such a conversion to a centralized control problem is not possible; one is instead faced\nwith a dynamic game. As we show in this section, the one-sided incomplete information allows\n\n3The intent parameter \u03b8 is further assumed to be \ufb01xed throughout the problem.\n\n4\n\n\fone to recursively de\ufb01ne both the value of the game and the attacker\u2019s strategy. One can further\nrecursively de\ufb01ne the defender\u2019s strategy via the construction and sequential decomposition of a\ndual game. The two recursive formulas permit the development of a computational procedure, based\non linear programming, for approximating both players\u2019 strategies.\n\n4.1 Sequential Decomposition\n\nSolving a game involves \ufb01nding strategies for all players, termed a strategy pro\ufb01le, such that the\nresulting interaction is in (Nash) equilibrium. A strategy pro\ufb01le for N-CIRL is de\ufb01ned as follows.\nDe\ufb01nition 4.1 (Strategy Pro\ufb01le). A strategy pro\ufb01le, denoted by (\u03c3A, \u03c3D), is a pair of strategies\nt are behavioral strategies as\n\u03c3A = (\u03c0A\nde\ufb01ned in Sec. 3.2.\n\n1 , . . .) and \u03c3D = (\u03c0D\n\n0 , \u03c0D\n\nt , . . .), where \u03c0A\n\nt and \u03c0D\n\n0 , \u03c0A\n\nA simpli\ufb01cation of behavioral strategies are strategies that only depend on the most recent informa-\ntion rather than the entire history. These strategies, termed one-stage strategies, are de\ufb01ned below.\nDe\ufb01nition 4.2 (One-Stage Strategies). The one-stage strategies of the attacker and defender are\ndenoted by \u00af\u03c0A : S \u00d7 \u0398 \u2192 \u2206(A) and \u00af\u03c0D : S \u2192 \u2206(D), respectively. The pair (\u00af\u03c0A, \u00af\u03c0D) is termed\na one-stage strategy pro\ufb01le.\n\nDue to the information structure of N-CIRL, the attacker\u2019s information state (s, b) can be updated\nusing one-stage strategies instead of the full strategy pro\ufb01le. In fact, as illustrated by Lemma 1, the\nupdate of the attacker\u2019s information state only depends on the attacker\u2019s one-stage strategy \u00af\u03c0A, the\nattacker\u2019s action a, and the successor state s(cid:48).\nLemma 1 (Information State Update). Given the attacker\u2019s one-stage strategy pro\ufb01le \u00af\u03c0A, the cur-\nrent attacker\u2019s information state (s, b) \u2208 S \u00d7 \u2206(\u0398), the attacker\u2019s action a \u2208 A, and the successor\nstate s(cid:48) \u2208 S, the attacker\u2019s updated information state is (s(cid:48), b(cid:48)) \u2208 S \u00d7 \u2206(\u0398) where the posterior b(cid:48)\nis computed via the function \u03c4 : S \u00d7 \u2206(\u0398) \u00d7 A \u2192 \u2206(\u0398), de\ufb01ned elementwise as\n\nb(cid:48)(\u03d1) = \u03c4\u03d1(s, b, a) =\n\n.\n\n(1)\n\n(cid:80)\n\n\u03d1(cid:48)\u2208\u0398\n\n\u00af\u03c0A(a | s, \u03d1)b(\u03d1)\n\u00af\u03c0A(a | s, \u03d1(cid:48))b(\u03d1(cid:48))\n\nThe attacker\u2019s information state (s, b) also leads to the following de\ufb01nition of the value function\nv : S \u00d7 \u2206(\u0398) \u2192 R of the game\n\n(cid:20)(cid:88)\n\nt\u22650\n\nE\n\n(cid:21)\n(cid:12)(cid:12)(cid:12) s0 = s, \u03b8 \u223c b(\u00b7)\n\n,\n\nv(s, b) = max\n\u03c3A\n\nmin\n\u03c3D\n\n\u03b3tR(st, at, dt, st+1; \u03b8)\n\nwhich denotes the minimax accumulated reward if the initial state is s0 = s, and the belief over \u0398 is\nb. Note that the value exists as all spaces in the game are \ufb01nite and the discount factor lies in [0, 1)\n[35]. The value function v can be computed recursively via a sequential decomposition. In fact, it is\ngiven by the \ufb01xed point of a value backup operator [Gv](s, b), as illustrated in Proposition 1 below.\nTo differentiate from the dual game to be introduced in Sec. 4.2, we refer to the original N-CIRL\ngame as the primal game and [Gv](s, b) as the primal backup operator.\nProposition 1 (Sequential Decomposition of Primal Game). The primal game can be sequentially\ndecomposed into a sequence of single-stage games. Speci\ufb01cally, the primal value function v satis\ufb01es\nthe following recursive formula\n\n(cid:8)g\u00af\u03c0A,\u00af\u03c0D (s, b) + \u03b3V\u00af\u03c0A,\u00af\u03c0D (v; s, b)(cid:9)\n\n(2)\n\nv(s, b) = [Gv](s, b) = max\n\u00af\u03c0A\n\nmin\n\u00af\u03c0D\n\nwhere [Gv](s, b) is referred to as the primal value backup operator, and g\u00af\u03c0A,\u00af\u03c0D (s, b),\nV\u00af\u03c0A,\u00af\u03c0D (v; s, b) correspond to the instantaneous reward and the expected value of the continuation\ngame, respectively, de\ufb01ned as\n\ng\u00af\u03c0A,\u00af\u03c0D (s, b) =\n\nV\u00af\u03c0A,\u00af\u03c0D (v; s, b) =\n\nb(\u03d1)\u00af\u03c0A(a | s, \u03d1)\u00af\u03c0D(d | s)T (s(cid:48) | s, a, d)R(s, a, d, s(cid:48); \u03d1)\n\nb(\u03d1)\u00af\u03c0A(a | s, \u03d1)\u00af\u03c0D(d | s)T (s(cid:48) | s, a, d)v(s(cid:48), b(cid:48))\n\n(3)\n\n(4)\n\n(cid:88)\n(cid:88)\n\na,d,s(cid:48),\u03d1\n\na,d,s(cid:48),\u03d1\n\nwhere b(cid:48) represents the posterior distribution on \u0398 as computed by Eq. (1).\n\n5\n\n\fFor purposes of constructing an algorithm, we need to establish some properties of the backup\noperator de\ufb01ned in Proposition 1. The following lemma ensures that each application of the primal\nvalue backup yields a closer approximation of the value of the game.\nLemma 2 (Contraction of Primal Backup Operator). The primal value backup operator [Gv](s, b),\nde\ufb01ned in Eq. (2), is a contraction mapping. As a result, iterating the operator converges to the\nvalue of the primal game that solves the \ufb01xed point equation (2).\n\nThough conceptually correct, iterating the backup operator [Gv](s, b) exactly does not lead to a com-\nputationally tractable algorithm, as the belief b lies in a continuous space with an in\ufb01nite cardinality.\nThus, an approximate value iteration algorithm is required for solving the \ufb01xed point equation (2).\nWe will address this computational challenge in Sec. 4.3.\n(2), cannot\nAnother challenge in solving N-CIRL is that the \ufb01xed point problem, given by Eq.\nbe solved by the defender.\nIn fact, as pointed out in [35, Sec. 1.2], if the defender is unaware\nof the attacker\u2019s strategy, it cannot form the posterior on \u0398. The following section discusses the\nformulation of an auxiliary game to address this challenge.\n\n4.2 The Defender\u2019s Strategy\n\n0 (s) = (cid:80)\n\nAs shown by [7, 35], the defender\u2019s equilibrium strategy can be determined by construction of a dual\ngame, characterized by a tuple (cid:104)S,{A,D},T (\u00b7 | \u00b7,\u00b7,\u00b7),{\u0398, R(\u00b7,\u00b7,\u00b7,\u00b7;\u00b7)}, \u03b60,PS\n0 (\u00b7), \u03b3(cid:105). Note that\nthe sets S,A,D, \u0398, the reward function R(\u00b7,\u00b7,\u00b7,\u00b7;\u00b7), the discount factor \u03b3, and the state transition\ndistribution T are identical to those in the primal game. The quantity \u03b60 \u2208 R|\u0398| is the parameter\nof the dual game, PS\n0 (\u00b7) \u2208 \u2206(S) is the initial distribution of the state s0, which is obtained by\nmarginalizing P0(\u00b7,\u00b7) over \u03b8, i.e., PS\n\u03b8\u2208\u0398 P0(s, \u03b8). The dual game proceeds as follows:\nat the initial stage, s0 is sampled from PS\n0 and revealed to both players, the attacker chooses some\n\u03b8 \u2208 \u0398; then the game is played identically as the primal one, namely, both players choose actions\na \u2208 A and d \u2208 D simultaneously, and the state transitions from s to s(cid:48) following T (\u00b7 | s, a, d). Both\nplayers are then informed of the chosen actions and the successor state s(cid:48). Furthermore, a reward\nof R(s, a, d, s(cid:48); \u03b8) + \u03b60(\u03b8) is received by the attacker (thus \u2212R(s, a, d, s(cid:48); \u03b8) \u2212 \u03b60(\u03b8) is incurred by\nthe defender). Note that the value of \u03b8 is decided and only known by the attacker, instead of being\ndrawn from some probability distribution. This is one of the key differences from the primal game.\nThe value function of the dual game, denoted by w : S \u00d7 R|\u0398| \u2192 R, is de\ufb01ned as the maximin\n\u03b3-discounted accumulated reward received by the attacker, if the state starts from some s0 = s \u2208 S\nand the game parameter is \u03b60 = \u03b6 \u2208 R|\u0398|. The value w(s, \u03b6) exists since the dual game is \ufb01nite\n[35]. Similarly as in Proposition 1, the dual game value function w also satis\ufb01es a recursive formula,\nas formally stated below.\nProposition 2 (Sequential Decomposition of Dual Game). The dual game can be decomposed into\na sequence of single-stage games. Speci\ufb01cally, the dual value function w satis\ufb01es the following\nrecursive formula\n\n(cid:8)h\u00af\u03c0D, \u00b5(s, \u03b6) + \u03b3W\u00af\u03c0D, \u00b5(w, \u03be; s)(cid:9)\n\nw(s, \u03b6) = [Hw](s, \u03b6) = min\n\u00af\u03c0D, \u03be\n\n(5)\nwhere [Hw](s, \u03b6) is referred to as the dual value backup operator, \u00af\u03c0D(\u00b7|s) \u2208 \u2206(\u0398), \u03be \u2208 RS\u00d7A\u00d7\u0398\nare decision variables with \u03bea,s \u2208 R|\u0398| the (a, s)th vector of \u03be, \u00b5 \u2208 \u2206(A \u00d7 \u0398). Moreover,\nh\u00af\u03c0D, \u00b5(s, \u03b6) and W\u00af\u03c0D, \u00b5(w, \u03be; s) are de\ufb01ned as\n\nmax\n\n\u00b5\n\nh\u00af\u03c0D, \u00b5(s, \u03b6) :=\n\nW\u00af\u03c0D, \u00b5(w, \u03be; s) :=\n\n(cid:88)\n\n(cid:16)\n\u00b5(a, \u03d1)\u00af\u03c0D(d | s)T (s(cid:48) | s, a, d)(cid:0)w(s(cid:48), \u03bea,s(cid:48)) \u2212 \u03bea,s(cid:48)(\u03d1)(cid:1).\n\n\u00af\u03c0D(d | s)T (s(cid:48) | s, a, d)R(s, a, d, s(cid:48); \u03d1)\n\n\u03b6(\u03d1) +\n\nd,s(cid:48)\n\n(cid:17)\n\n\u00b5(a, \u03d1)\n\n,\n\n(6)\n\n(7)\n\n(cid:88)\n(cid:88)\n\na,\u03d1\n\na,d,s(cid:48),\u03d1\n\nThe recursive formula above allows for a stage-wise calculation of the defender\u2019s strategy \u00af\u03c0D. In\nparticular, by [35], the defender\u2019s equilibrium strategy obtained from the dual game is indeed its\nequilibrium strategy for the primal game. Moreover, the pair (s, \u03b6) is indeed the information state of\nthe defender in the dual game. More importantly, the formula of Eq.(5) does not involve the update\nof the belief b, as opposed to Eq. (2). Instead, the vector \u03be plays the similar role as the updated belief\n\n6\n\n\fb(cid:48), which can be calculated by the defender as a decision variable. This way, as the defender plays\nthe N-CIRL game, it can calculate its equilibrium strategy by only observing the attacker\u2019s action\na and the successive state s(cid:48), with no need to know the attacker\u2019s strategy. Besides, the defender\u2019s\nstrategy \u00af\u03c0D, the decision variable \u00b5 \u2208 \u2206(A \u00d7 \u0398) in Eq. (5), is essentially the attacker\u2019s strategy in\nthe dual game, i.e., given \u03b6, the attacker has the \ufb02exibility to choose both \u03b8 \u2208 \u0398 and action a \u2208 A.\nThe remaining issue is to solve the dual game by solving the \ufb01xed point equation (5). As with the\nprimal counterpart, the dual value backup operator is also a contraction mapping, as shown below.\nLemma 3 (Contraction of Dual Backup Operator). The value backup operator [Hw](s, \u03b6), de\ufb01ned\nin Eq. (5), is a contraction mapping. As a result, iterating the operator converges to the dual game\nvalue that solves the \ufb01xed point equation (5).\n\nBy Lemma 3, the defender\u2019s strategy can be obtained by iterating the backup operator. However, as\n\u03b6 and \u03be both lie in continuous spaces, such an iterative approach is not computationally tractable.\nThis motivates the approximate value iteration algorithm to be introduced next.\n\n4.3 Computational Procedure\n\nOne standard algorithm for computing strategies in single-agent partially observable settings is the\npoint-based value iteration (PBVI) algorithm [32]. The standard PBVI algorithm approximates the\nvalue function, which is convex in the beliefs, using a set of \u03b1-vectors. While the value functions v\nand w in N-CIRL also have desirable structure (as shown in [35], v : S \u00d7 \u2206(\u0398) \u2192 R is concave on\n\u2206(\u0398) and w : S \u00d7 R|\u0398| \u2192 R is convex on R|\u0398| for each s \u2208 S), the standard PBVI algorithm does\nnot directly apply to N-CIRL. The challenge arises from the update step of the \u03b1-vector set in PBVI,\nwhich requires carrying out one step of the Bellman update [32], for every action-observation pair\nof the agent. The corresponding Bellman update in N-CIRL is the primal backup operator in Eq.\n(2), which requires knowledge of the defender\u2019s strategy, something that is unknown to the attacker.\nTo address this challenge, we develop a modi\ufb01ed version of PBVI, termed non-cooperative PBVI\n(NC-PBVI), in which the value functions v and w are approximated by a set of information state-\nvalue pairs, i.e., ((s, b), v) and ((s, \u03b6), w), instead of \u03b1-vectors.\nImportantly, updating the sets\nonly requires evaluations at individual information states, avoiding the need to know the opponent\u2019s\nstrategy.\nThe evaluations can be approximated using linear programming. Speci\ufb01cally, to approximate the\nvalue function of the primal game, NC-PBVI updates the value at a given attacker\u2019s information state\n(s, b) by solving the primal backup operator of Eq. (2). Using a standard reformulation, the minimax\noperation in the primal backup operator can be approximately solved via a linear program, denoted\nby PA(s, b). Similarly, one can approximately solve the dual game\u2019s \ufb01xed point equation, Eq. (5), at\na given defender\u2019s information state (s, \u03b6) via another linear program, denoted by PD(s, \u03b6). For no-\ntational convenience, de\ufb01ne Tsad(s(cid:48)) = T (s(cid:48) | s, a, d), P \u03d1\ns(cid:48) T (s(cid:48) | s, a, d)R(s, a, d, s(cid:48); \u03d1),\nAs\u03d1(a) = \u00af\u03c0A(a | s, \u03d1)b(\u03d1), and Ds(d) = \u00af\u03c0D(d | s). The two linear programs, PA(s, b) and\nPD(s, \u03b6), are given below.\n\nsad =(cid:80)\n\nmax\n\n[As\u03d1(a)]\u22650,[Vds(cid:48) ],\n\n[bds(cid:48) (\u03d1)]\u22650,V\n\ns.t. V \u2264(cid:88)\n\na,\u03d1\n\nbds(cid:48)(\u03d1) =\nVds(cid:48) \u2264 \u03a5v\n\n(cid:88)\n\nAs\u03d1(a)P \u03d1\n\nV\n\nsad + \u03b3\n\nPA(s, b)\n\nVds(cid:48) \u2200 d\n\n(cid:88)\ns , bds(cid:48)(cid:1) \u2200 d, s(cid:48)\n\n(cid:88)\nAs\u03d1(a)Tsad(s(cid:48)) \u2200 d, s(cid:48), \u03d1\n(cid:0)Y A\ns ,W A\n\ns(cid:48)\n\na\n\nAs\u03d1(a) = b(\u03d1) \u2200 \u03d1\n\nmin\n\n[Ds(d)]\u22650,[Was(cid:48) ],\n[\u03bbas(cid:48) (\u03d1)]\u22650,W\ns.t. W \u2265 \u03b6(\u03d1) +\n\n(cid:88)\n\nW\n\nPD(s, \u03b6)\n\nDs(d)P \u03d1\nsad\n\n(cid:88)\n(cid:0)Was(cid:48) \u2212 \u03bbas(cid:48)(\u03d1)(cid:1) \u2200 a, \u03d1\ns , \u03bbas(cid:48)(cid:1) \u2200 a, s(cid:48)\n(cid:0)Y D\n\ns ,W D\n\nd\n\n+ \u03b3\n\ns(cid:48)\nWas(cid:48) \u2265 \u03a5w\n\n(cid:88)\n\nDs(d) = 1\n\na\n\nd\n\nThe objective functions of the two linear programs estimate the values of the primal and dual\ngames. The decision variables [As\u03d1], [Vds(cid:48)], [bds(cid:48)(\u03d1)] in PA(s, b) are used to \ufb01nd the attacker\u2019s\nstrategy, the continuation value of the (primal) game, and the updated belief, respectively. Similarly,\n\n7\n\n\f(cid:0)Y A\n(cid:0)Y D\n\n[Ds(d)], [Was(cid:48)], [\u03bbas(cid:48)(\u03d1)] in PD(s, \u03b6) are used to \ufb01nd the defender\u2019s strategy, the continuation value\nof the dual game, and the updated parameter \u03b6 for the dual game, respectively. The \ufb01rst constraint\nin PA(s, b) encodes the defender\u2019s best response, replacing the minimization over \u00af\u03c0D in Eq. (2).\nSimilarly, the \ufb01rst constraint in PD(s, \u03b6) replaces the maximization over \u00b5 in Eq. (5). The second\nand last constraints in PA(s, b) and the last constraint in PD(s, \u03b6) enforce basic rules of probabil-\nity. The third constraint in PA(s, b) and the second constraint in PD(s, \u03b6) provide the information\nstate-value approximations of the continuation value estimates Vds(cid:48) and Was(cid:48), respectively.\nDue to concavity (convexity) of the value function v (resp. w), we use the sawtooth function [11]\nas an information state-value approximation.\nIn particular, a lower bound on the primal game\u2019s\ns ,W A\nvalue function v(s, b) is given by \u03a5v\ns ,W D\ns contains the belief-value pairs\nvalue function w(s, \u03b6) is given by \u03a5w\nassociated with the beliefs that are non-corner points of the simplex over \u0398 for given s, whereas the\nset W A\ns contains the belief-value pairs associated with the corner points of the simplex. Analogously,\nY D\ns ,W D\ns represent subsets of R|\u0398| that contain the vectors with only one, and more than one, non-\nzero element, respectively. Details of both \u03a5v and \u03a5w using sawtooth functions can be found in the\npseudocode in Sec. C in the Appendix.\nLemma 4 ensures that the sawtooth constraints are linear in the decision variables of the respec-\ntive problem, which veri\ufb01es the computational tractability of PA(s, b) and PD(s, \u03b6). The proof of\nLemma 4 can be found in Sec. B in the appendix.\nLemma 4. By de\ufb01nitions of SAWTOOTH-A and SAWTOOTH-D in Algorithm 1, given\nY A\ns ,W A\nboth linear in the decision variables Vds(cid:48), bds(cid:48) and Was(cid:48), \u03bbas(cid:48), respectively.\n\ns , bds(cid:48)(cid:1) whereas an upper bound of the dual game\u2019s\ns , \u03bbas(cid:48)(cid:1).4 The set Y A\n\ns , bds(cid:48)(cid:1) and Was(cid:48) \u2265 \u03a5w\n\ns , \u03bbas(cid:48)(cid:1) are\n\ns ,Y D\n\ns ,W D\n\ns , constraints Vds(cid:48) \u2264 \u03a5v\n\ns ,W A\n\n(cid:0)Y A\n\n(cid:0)Y D\n\ns ,W D\n\nNext, we numerically analyze the proposed algorithm in a cyber security environment.\n\n5 Experiment: Intrusion Response\n\nA recent trend in the security literature concerns the development of automated defense systems,\ntermed state-based intrusion response systems, that automatically prescribe defense actions in re-\nsponse to intrusion alerts [26, 15, 27]. Core to these systems is the construction of a model that\ndescribes the possible ways an attacker can in\ufb01ltrate the system, termed a threat model. Deriving\na correct threat model is a challenging task and has a signi\ufb01cant impact on the effectiveness of\nthe intrusion response system. N-CIRL addresses one of the main challenges in this domain: the\ndefender\u2019s uncertainty of the attacker\u2019s true intent.\nThe threat model in our experiment is based on an attack graph, a common graphical model in the\nsecurity literature [2]. An attack graph is represented by a directed acyclic graph G = (N ,E) where\neach node n \u2208 N represents a system condition and each edge eij \u2208 N \u00d7 N represents an exploit.\nEach exploit eij relates a precondition i, the condition needed for the attack to be attempted, to a\npostcondition j, the condition satis\ufb01ed if the attack succeeds. Each exploit eij is associated with a\nprobability of success, \u03b2ij, describing the likelihood of the exploit succeeding (if attempted). The\nstate space S is the set of currently satis\ufb01ed conditions (enabled nodes). For a given state s \u2208 S,\nthe attacker chooses among exploits that have enabled preconditions and at least one not yet enabled\npostcondition. The defender simultaneously chooses which exploits to block for the current stage;\nblocked edges have a probability of success of zero for the stage in which they are blocked. The\nattacker\u2019s reward is R(s, a, d, s(cid:48); \u03b8) = re(s, s(cid:48); \u03b8) \u2212 cA(a) + cD(d), where s(cid:48) is the updated state,\nre(s, s(cid:48); \u03b8) is the attacker\u2019s reward for any newly enabled conditions, and cA(a) and cD(d) are costs\nfor attack and defense actions, respectively. The experiments are run on random instances of attack\ngraphs; see some instances in Figure 1. See Sec. C for more details of the experimental setup.\nAs seen in Figure 1, the strategies obtained from N-CIRL yield a lower attacker reward than the\nstrategies obtained from MA-IRL. Empirically, this implies that the defender bene\ufb01ts more from in-\nterleaving learning and execution than the attacker. Even though the interleaved setting may provide\nmore ground for the attacker to exercise deceptive tactics, [36] states that in games of incomplete\ninformation, the more informed player \u201ccannot exploit its private information without revealing it,\nat least to some extent.\u201d In the context of our example, we believe that the performance gain of\n\n4Note that bds(cid:48) is the vector consisting of bds(cid:48) (\u03d1) over all \u03d1 \u2208 \u0398 (similarly for \u03bbas(cid:48)).\n\n8\n\n\fN-CIRL arises from the fact that the attacker can only deceive for so long; eventually it must ful\ufb01ll\nits true objective and, in turn, reveal its intent.\n\nFigure 1: Top: Instances of randomly generated graphs of sizes n = 6 to n = 10, with state\ncardinalities |S| = {8, 24, 32, 48, 64} and action cardinalities |A| = |D| = {54, 88, 256, 368, 676}.\nBottom-left: Attacker\u2019s average reward for each graph size n; \ufb01lled/outlined markers represent the\nattacker\u2019s accumulated reward versus defense strategies obtained from MA-IRL/N-CIRL. Bottom-\nmiddle: The average relative reduction of attacker reward in N-CIRL compared to MA-IRL as a\nfunction of n. Bottom-right: Average runtime of NC-PBVI (in seconds) as a function of n.\n\n6 Concluding Remarks\n\nThe goal of our paper was to introduce the N-CIRL formalism and provide some theoretical results\nfor the design of learning algorithms in the presence of strategic opponents. The primary motiva-\ntion for this work was cyber security, speci\ufb01cally, problems where the defender is actively trying to\ndefend a network when it is uncertain of the attacker\u2019s true intent. Learning from past attack traces\n(i.e., equilibrium behavior/demonstrations) can lead to poor defense strategies, demonstrating that\nsuch approaches (e.g., MA-IRL) are not directly applicable for settings where rewards may change\nbetween demonstrations. Empirical studies illustrate that the defender can bene\ufb01t by interleaving\nthe learning and execution phases compared to just learning from equilibrium behavior (this is an\nanalogous conclusion to the one found in CIRL that an interactive scenario can lead to better per-\nformance). The reason for this is that defense strategies computed using N-CIRL learn the intent\nadaptively through interaction with the attacker.\nAs shown in [9], the value alignment problem is more appropriately addressed in a dynamic and\ncooperative setting. The cooperative reformulation converts the IRL problem into a decentralized\nstochastic control problem. In our paper, we have shown that the non-cooperative analog of CIRL,\ni.e., when agents possess goals that are misaligned, becomes a zero-sum Markov game with one-\nsided incomplete information. Such games are conceptually challenging due to the ability of agents\nto in\ufb02uence others\u2019 beliefs of their private information through their actions, termed signaling in\ngame theoretic parlance. We hope that the N-CIRL setting can provide a foundation for an algorith-\nmic perspective of these games and deeper investigation into signaling effects in general stochastic\ngames of asymmetric information.\n\nAcknowledgements\n\nThis work was supported in part by the US Army Research Laboratory (ARL) Cooperative Agree-\nment W911NF-17-2-0196, and in part by the Of\ufb01ce of Naval Research (ONR) MURI Grant N00014-\n16-1-2710.\n\n9\n\n\fReferences\n[1] P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Inter-\n\nnational Conference on Machine Learning, 2004.\n\n[2] P. Ammann, D. Wijesekera, and S. Kaushik. Scalable, graph-based network vulnerability\n\nanalysis. In ACM Conference on Computer and Communications Security, 2002.\n\n[3] T. Arnold, D. Kasenberg, and M. Scheutz. Value alignment or misalignment \u2013 what will keep\n\nsystems accountable? In AAAI Conference on Arti\ufb01cial Intelligence, 2017.\n\n[4] T. Bas\u00b8ar. Stochastic differential games and intricacy of information structures. In Dynamic\n\nGames in Economics, pages 23\u201349. Springer, 2014.\n\n[5] D. Blackwell. Discounted dynamic programming. The Annals of Mathematical Statistics,\n\n36(1):226\u2013235, 1965.\n\n[6] P. Cardaliaguet and C. Rainer. Stochastic differential games with asymmetric information.\n\nApplied Mathematics and Optimization, 59(1):1\u201336, 2009.\n\n[7] B. De Meyer. Repeated games, duality and the central limit theorem. Mathematics of Opera-\n\ntions Research, 21(1):237\u2013251, 1996.\n\n[8] C. Finn, S. Levine, and P. Abbeel. Guided cost learning: Deep inverse optimal control via\n\npolicy optimization. In International Conference on Machine Learning, 2016.\n\n[9] D. Had\ufb01eld-Menell, S. J. Russell, P. Abbeel, and A. Dragan. Cooperative inverse reinforcement\n\nlearning. In Advances in Neural Information Processing Systems, 2016.\n\n[10] E. A. Hansen, D. S. Bernstein, and S. Zilberstein. Dynamic programming for partially observ-\n\nable stochastic games. In AAAI Conference on Arti\ufb01cial Intelligence, 2004.\n\n[11] M. Hauskrecht. Value-function approximations for partially observable Markov decision pro-\n\ncesses. Journal of Arti\ufb01cial Intelligence Research, 13:33\u201394, 2000.\n\n[12] K. Hendricks and R. P. McAfee. Feints. Journal of Economics & Management Strategy,\n\n15(2):431\u2013456, 2006.\n\n[13] K. Hor\u00b4ak, B. Bosansk`y, and M. Pechoucek. Heuristic search value iteration for one-sided\n\npartially observable stochastic games. In AAAI Conference on Arti\ufb01cial Intelligence, 2017.\n\n[14] J. H\u00a8orner, D. Rosenberg, E. Solan, and N. Vieille. On a Markov game with one-sided informa-\n\ntion. Operations Research, 58(4-part-2):1107\u20131115, 2010.\n\n[15] S. Iannucci and S. Abdelwahed. A probabilistic approach to autonomic security management.\n\nIn IEEE International Conference on Autonomic Computing, 2016.\n\n[16] Y. Ji, S. Lee, E. Downing, W. Wang, M. Fazzini, T. Kim, A. Orso, and W. Lee. RAIN: Re\ufb01nable\nattack investigation with on-demand inter-process information \ufb02ow tracking. In ACM SIGSAC\nConference on Computer and Communications Security, 2017.\n\n[17] R. E. Kalman. When is a linear control system optimal?\n\n86(1):51\u201360, 1964.\n\nJournal of Basic Engineering,\n\n[18] H. W. Kuhn. Extensive Games and the Problem of Information, volume 2. Princeton University\n\nPress, 1953.\n\n[19] M. Lanctot, V. Zambaldi, A. Gruslys, A. Lazaridou, K. Tuyls, J. P\u00b4erolat, D. Silver, and T. Grae-\npel. A uni\ufb01ed game-theoretic approach to multiagent reinforcement learning. In Advances in\nNeural Information Processing Systems, 2017.\n\n[20] X. Lin, S. C. Adams, and P. A. Beling. Multi-agent inverse reinforcement learning for certain\ngeneral-sum stochastic games. Journal of Arti\ufb01cial Intelligence Research, 66:473\u2013502, 2019.\n[21] X. Lin, P. A. Beling, and R. Cogill. Multiagent inverse reinforcement learning for two-person\n\nzero-sum games. IEEE Transactions on Games, 10(1):56\u201368, 2018.\n\n[22] M. L. Littman. Markov games as a framework for multi-agent reinforcement learning.\n\nMachine Learning Proceedings, pages 157\u2013163. Elsevier, 1994.\n\nIn\n\n[23] D. Malik, M. Palaniappan, J. Fisac, D. Had\ufb01eld-Mennell, S. Russell, and A. Dragan. An\nef\ufb01cient, generalized Bellman update for cooperative inverse reinforcement learning. In Inter-\nnational Conference on Machine Learning, 2018.\n\n[24] J. Marschak and R. Radner. Economic Theory of Teams. Yale University Press, 1972.\n\n10\n\n\f[25] J.-F. Mertens and S. Zamir. Formulation of bayesian analysis for games with incomplete infor-\n\nmation. International Journal of Game Theory, 14(1):1\u201329, 1985.\n\n[26] E. Miehling, M. Rasouli, and D. Teneketzis. Optimal defense policies for partially observable\nspreading processes on Bayesian attack graphs. In Second ACM Workshop on Moving Target\nDefense, 2015.\n\n[27] E. Miehling, M. Rasouli, and D. Teneketzis. A POMDP approach to the dynamic defense\nIEEE Transactions on Information Forensics and Security,\n\nof large-scale cyber networks.\n13(10):2490\u20132505, 2018.\n\n[28] A. Nayyar and T. Bas\u00b8ar. Dynamic stochastic games with asymmetric information. In IEEE\n\nConference on Decision and Control, 2012.\n\n[29] A. Nayyar, A. Gupta, C. Langbort, and T. Bas\u00b8ar. Common information based Markov perfect\nequilibria for stochastic games with asymmetric information: Finite games. IEEE Transactions\non Automatic Control, 59(3):555\u2013570, 2014.\n\n[30] A. Nayyar, A. Mahajan, and D. Teneketzis. Decentralized stochastic control with partial his-\nIEEE Transactions on Automatic Control,\n\ntory sharing: A common information approach.\n58(7):1644\u20131658, 2013.\n\n[31] A. Y. Ng and S. J. Russell. Algorithms for inverse reinforcement learning. In International\n\nConference on Machine Learning, 2000.\n\n[32] J. Pineau, G. Gordon, and S. Thrun. Point-based value iteration: An anytime algorithm for\n\nPOMDPs. In International Joint Conferences on Arti\ufb01cial Intelligence, 2003.\n\n[33] N. D. Ratliff, J. A. Bagnell, and M. A. Zinkevich. Maximum margin planning. In International\n\nConference on Machine Learning, 2006.\n\n[34] J. Renault. The value of Markov chain games with lack of information on one side. Mathe-\n\nmatics of Operations Research, 31(3):490\u2013512, 2006.\n\n[35] D. Rosenberg. Duality and Markovian strategies.\n\n27(4):577\u2013597, 1998.\n\nInternational Journal of Game Theory,\n\n[36] D. Rosenberg and N. Vieille. The maxmin of recursive games with incomplete information on\n\none side. Mathematics of Operations Research, 25(1):23\u201335, 2000.\n\n[37] S. Russell, D. Dewey, and M. Tegmark. Research priorities for robust and bene\ufb01cial arti\ufb01cial\n\nintelligence. AI Magazine, 36(4):105\u2013114, 2015.\n\n[38] S. J. Russell and P. Norvig. Arti\ufb01cial Intelligence: A Modern Approach. Malaysia; Pearson\n\nEducation Limited, 2016.\n\n[39] S. Sorin. Stochastic games with incomplete information. In Stochastic Games and Applica-\n\ntions, pages 375\u2013395. Springer, 2003.\n\n[40] S. Srinivasan, M. Lanctot, V. Zambaldi, J. P\u00b4erolat, K. Tuyls, R. Munos, and M. Bowling.\nActor-critic policy optimization in partially observable multiagent environments. In Advances\nin Neural Information Processing Systems, 2018.\n\n[41] X. Wang and D. Klabjan. Competitive multi-agent inverse reinforcement learning with sub-\n\noptimal demonstrations. In International Conference on Machine Learning, 2018.\n\n[42] N. Wiener. Some moral and technical consequences of automation. Science, 131(3410):1355\u2013\n\n1358, 1960.\n\n[43] K. Zhang, Z. Yang, and T. Bas\u00b8ar. Policy optimization provably converges to Nash equilibria in\n\nzero-sum linear quadratic games. arXiv preprint arXiv:1906.00729, 2019.\n\n[44] K. Zhang, Z. Yang, H. Liu, T. Zhang, and T. Bas\u00b8ar. Finite-sample analyses for fully decentral-\n\nized multi-agent reinforcement learning. arXiv preprint arXiv:1812.02783, 2018.\n\n[45] K. Zhang, Z. Yang, H. Liu, T. Zhang, and T. Bas\u00b8ar. Fully decentralized multi-agent rein-\nforcement learning with networked agents. In International Conference on Machine Learning,\n2018.\n\n11\n\n\f", "award": [], "sourceid": 5045, "authors": [{"given_name": "Xiangyuan", "family_name": "Zhang", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "Kaiqing", "family_name": "Zhang", "institution": "University of Illinois at Urbana-Champaign (UIUC)"}, {"given_name": "Erik", "family_name": "Miehling", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "Tamer", "family_name": "Basar", "institution": null}]}