{"title": "Manipulating a Learning Defender and Ways to Counteract", "book": "Advances in Neural Information Processing Systems", "page_first": 8274, "page_last": 8283, "abstract": "In Stackelberg security games when information about the attacker's payoffs is uncertain, algorithms have been proposed to learn the optimal defender commitment by interacting with the attacker and observing their best responses. In this paper, we show that, however, these algorithms can be easily manipulated if the attacker responds untruthfully. As a key finding, attacker manipulation normally leads to the defender learning a maximin strategy, which effectively renders the learning attempt meaningless as to compute a maximin strategy requires no additional information about the other player at all. We then apply a game-theoretic framework at a higher level to counteract such manipulation, in which the defender commits to a policy that specifies her strategy commitment according to the learned information. We provide a polynomial-time algorithm to compute the optimal such policy, and in addition, a heuristic approach that applies even when the attacker's payoff space is infinite or completely unknown. Empirical evaluation shows that our approaches can improve the defender's utility significantly as compared to the situation when attacker manipulation is ignored.", "full_text": "Manipulating a Learning Defender and\n\nWays to Counteract\n\nJiarui Gan\n\nUniversity of Oxford\n\nOxford, UK\n\njiarui.gan@cs.ox.ac.uk\n\nNanyang Technological University\n\nQingyu Guo\n\nSingapore\n\nqguo005@e.ntu.edu.sg\n\nLong Tran-Thanh\n\nBo An\n\nUniversity of Southampton\n\nNanyang Technological University\n\nSouthampton, UK\n\nl.tran-thanh@soton.ac.uk\n\nSingapore\n\nboan@ntu.edu.sg\n\nMichael Wooldridge\nUniversity of Oxford\n\nOxford, UK\n\nmjw@cs.ox.ac.uk\n\nAbstract\n\nIn Stackelberg security games when information about the attacker\u2019s payoffs is un-\ncertain, algorithms have been proposed to learn the optimal defender commitment\nby interacting with the attacker and observing their best responses. In this paper,\nwe show that, however, these algorithms can be easily manipulated if the attacker\nresponds untruthfully. As a key \ufb01nding, attacker manipulation normally leads to\nthe defender learning a maximin strategy, which effectively renders the learning\nattempt meaningless as to compute a maximin strategy requires no additional infor-\nmation about the other player at all. We then apply a game-theoretic framework at\na higher level to counteract such manipulation, in which the defender commits to a\npolicy that speci\ufb01es her strategy commitment according to the learned information.\nWe provide a polynomial-time algorithm to compute the optimal such policy, and\nin addition, a heuristic approach that applies even when the attacker\u2019s payoff space\nis in\ufb01nite or completely unknown. Empirical evaluation shows that our approaches\ncan improve the defender\u2019s utility signi\ufb01cantly as compared to the situation when\nattacker manipulation is ignored.\n\n1\n\nIntroduction\n\nStackelberg security games (SSGs) are Stackelberg game models developed for deriving optimal\nsecurity resource allocation in strategic scenarios. In the AI community, a line of work applying\nSSG models forms the algorithmic basis of resource scheduling systems, that are in use by the Los\nAngeles Airport, the US Cost Guard, the Federal Air Marshal Service, etc, to assist in protecting\nhigh-pro\ufb01le infrastructures, and public and natural resources [21].\nThe standard solution concept of SSG, the strong Stackelberg equilibrium (SSE) captures the situation\nwhere a defender (the leader) commits to her optimal strategy, assuming that an attacker (the follower)\nwill respond optimally to her commitment. There are many algorithms designed to compute SSEs in\ndifferent SSG models when complete information about the attacker\u2019s type (i.e., his payoff parameters)\nis provided. While payoff information may be incomplete in many real environments, algorithms were\nalso designed for the defender to learn the optimal commitment through interacting with the attacker:\nby committing to a series of carefully chosen defender strategies and observing the attacker\u2019s best\nresponses to these strategies [11, 4, 8, 19, 17]. The optimality of the learned commitment thus relies\ncrucially on the assumption of a truthful attacker, one who responds to the defender\u2019s commitment\noptimally according to their actual payoffs. Unfortunately, when there is no guarantee that the\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fattacker will indeed be truthful, a strategic attacker can easily manipulate the learning algorithm by\nusing fake best responses \u2014 typically by imitating the responses of a different attacker type. The\ndefender will then learn a commitment that is optimal with respect to the imitated type but, very\nlikely, suboptimal with respect to the true attacker type. As we will show in this paper, the attacker is\noften incentivized to imitate a type that makes the game zero-sum; a credulous defender would then\nonly learn a maximin strategy (i.e., the optimal commitment in a zero-sum game). Effectively, the\nlearning attempt now becomes meaningless: to compute a maximin strategy, one needs no additional\ninformation about the other player\u2019s payoffs at all!\nDriven by this issue, we study what can be done to reduce the defender\u2019s loss due to attacker\nmanipulation. We apply a game-theoretic framework at a higher level. In the framework, the defender\ncommits to a policy that speci\ufb01es her strategy commitment according to the learned information. A\nstrategic attacker then takes into account the defender\u2019s policy, choosing optimally what he wants\nthe defender to learn so that the policy outputs a strategy that bene\ufb01ts him the most. We make\nseveral other contributions under this framework. (i) We propose a novel quality measure of the\ndefender\u2019s policy and argue why it is a reasonable choice in the context of SSGs. (ii) We develop a\npolynomial-time algorithm to compute the optimal policy with respect to this quality measure, as well\nas (iii) a heuristic approach which applies even when the attacker type space is in\ufb01nite or completely\nunknown. The heuristic approach is inspired by the famous quantal response model that was initially\nproposed to model bounded rationality of human players. It suggests a rationality of behaving in a\n\u201cboundedly rational\u201d manner in the presence of attacker manipulation. (iv) Finally, through empirical\nevaluation we show that our approaches can improve the defender\u2019s utility signi\ufb01cantly in randomly\ngenerated games, as compared to the situation when attacker manipulations are ignored.\nOur work follows some recent research effort on understanding manipulation of leader learning\nalgorithms in general Stackelberg games [7] and shed light on the same issue in SSGs. The SSG\nmodel offers us an appropriate level of speci\ufb01cation that enables us to derive a richer set of results\nthan from a general model (consider, e.g., when the interests of the leader and the follower completely\nalign, there is simply no incentive for the follower to manipulate the leader), while it also captures\nsuf\ufb01ciently many real-world applications of signi\ufb01cant practical value.\n\nAdditional Related Work Apart from [7], manipulation of leader learning algorithms remains\nlargely an under-explored topic, though there are many papers focusing on the design learning\nalgorithms for the leader. In addition to the aforementioned efforts to learn the optimal leader\ncommitment against a \ufb01xed but unknown follower type [11, 4, 8, 19, 17], a couple of papers also\ntake the regret-minimization perspective and design online learning algorithms for the leader to\nuse in the adversarial setting [1, 22]. Our work can be seen as a middle-ground approach between\nthe overoptimistic assumption of a truthful follower adopted by the former line of work and the\npessimistic assumption of a worst-case opponent by the latter. Our approach to deal with attacker\nmanipulation is, in a nutshell, to reduce information uncertainty by acquiring additional information\nwhile bearing in mind that information acquired may be manipulated by the attacher. There are\nalso approaches to deal with uncertainties without additional information retrieval attempts; they are\nimmune to manipulation as a result. In particular, algorithms were designed to compute robust leader\nstrategies when the leader can bound the follower\u2019s payoffs in certain intervals [11, 10, 14], or knows\na probability distribution of the follower\u2019s type [6, 16, 9, 18]. Follower manipulation is not a concern\nin applying these approaches but leader strategies yielded are weaker. In a broader sense, our work is\nalso related to poisoning attacks in adversarial machine learning, where an attacker manipulates the\ntraining data (in our model, their payoffs) to undermine the performance of learning algorithms; see,\ne.g., pioneering work in this area [2, 3] and some recent surveys [5, 12].\n\n2 SSG Preliminaries\n\nAn SSG is played between a defender (the leader) and an attacker (the follower). The defender\nallocates m security resources to a set of targets T = {1, . . . , n} (without loss of generality, n > m),\nand the attacker chooses a target to attack. In the pure strategy setting, an attack on a target i is\nunsuccessful as long as one resource is allocated to i, in which case the attacker receives a penalty\npa\ni and the defender a reward rd\ni . Otherwise, i.e., when no resource is allocated to i, the attack is\nsuccessful, in which case the attacker receives a reward ra\ni. We say\nthat a target is protected, or covered, if at least one resource is assigned to it; and unprotected, or\n\ni and the defender a penalty pd\n\n2\n\n\fi > pa\n\ni > pd\n\ni and rd\n\ni for all i, so the attacker always prefers\n\nuncovered, otherwise. It is assumed that ra\na successful attack and the defender prefers the opposite.\nThe defender can further randomize the resource allocation and commit to a mixed strategy, i.e., a\nprobability distribution over pure strategies. The structure of SSGs allows a defender mixed strategy\nto be represented more compactly as a coverage (vector) c = (ci)i\u2208T , with each ci representing\nthe probability that target i is protected. We will stick to this representation and use the terms\ncoverage and defender mixed strategy interchangeably. Under the constraint that the defender can\nuse at most m resources, the set of feasible mixed strategies available for the defender to use is\ni\u2208T ci \u2264 m}; provably, any coverage vector in C can be implemented\nas a distribution of pure strategies each involving at most m resources, and any such distribution\nresults in a coverage vector in C. When the defender plays a mixed strategy c and the attacker attacks\ntarget i, let ud(c, i) and ua(c, i) be the expected utilities of the defender and the attacker, respectively.\nWith slight abuse of notation, we write\n\nC = {c \u2208 Rn : 0 \u2264 c \u2264 1,(cid:80)\n\ni + (1 \u2212 ci) \u00b7 pd\ni ,\ni + ci \u00b7 pa\ni.\n\nud(c, i) = ud(ci, i) = ci \u00b7 rd\nua(c, i) = ua(ci, i) = (1 \u2212 ci) \u00b7 ra\n\n(1)\n(2)\nIt is worth noting that ud(c, i) is strictly increasing with respect to ci, and ua(c, i) strictly decreasing.\nBy the standard assumption, in an SSG the attacker is able to observe the defender\u2019s mixed strategy\nthrough surveillance before he launches an attack, but the instantiated pure strategy is not observable.\nThe strong Stackelberg equilibrium (SSE) is the standard solution concept of SSGs. In an SSE, the\ndefender commits to an optimal mixed strategy, taking into account that the attacker will observe\nthis strategy and respond optimally. It is assumed that ties are broken in favor of the defender when\nthe attacker has multiple best responses; hence, without loss of generality we can assume that the\nattacker always responds by playing a pure strategy. The assumption is justi\ufb01ed by the fact that the\nattacker\u2019s strict preference to the favored target can be induced if the defender reduces the coverage\nof this target by an in\ufb01nitesimal amount.1 Formally, a strategy pro\ufb01le (\u02c6c,\u02c6i) forms an SSE if:\n\n(\u02c6c,\u02c6i) \u2208 arg maxc\u2208C,i\u2208BR(c) ud(c, i),\n\nwhere BR(c) = arg maxi\u2208T ua(c, i) denotes the set of attacker best responses to c. An SSE always\nexists and can be computed in polynomial time, e.g., using a multiple-LP approach [6].\nWe will refer to a full set of attacker payoffs (ra, pa) as an attacker type. To distinguish, we extend (2)\n\u03b8(c, i) = (1\u2212 ci)\u00b7 xi + ci\u00b7 yi to be the utility function parameterized by attacker type \u03b8 =\nand de\ufb01ne ua\n(x, y). De\ufb01nition of the best response set is extended likewise, with BR\u03b8(c) = arg maxi\u2208T ua\n\u03b8(c, i).\nWe will refer to an SSE in a game where the attacker\u2019s type is \u03b8 an SSE on attacker type \u03b8.\nExample 1. Consider an SSG where the defender allocates one security guard to protect two targets\nA and B. The defender has three pure strategies: to assign the guard to protect A or B, or to send the\nguard on vacation; the corresponding mixed strategy space is C = {c \u2208 R2 : 0 \u2264 c \u2264 1, c1 +c2 \u2264 1}.\nThe attacker can choose to attack A or B. In this game, the targets are of equal importance to the\ndefender: a successful attack on any target i \u2208 {A, B} results in utility pd\ni = \u22121 for the defender,\nand an unsuccessful one results in rd\nB = 1, and\npa\nA = pa\nB = 0. The bi-matrix representation of the game is shown below, in which the defender and\nthe attacker are the row and column players, respectively.\n\ni = 0. For the attacker, the payoffs are ra\n\nA = 3, ra\n\nprotect A\nprotect B\nprotect \u2205\n\nattack A attack B\n-1, 1\n0, 0\n-1, 1\n\n0, 0\n-1, 3\n-1, 3\n\nThe SSE of this game can be identi\ufb01ed using the indifference rule, i.e., by identifying a point where\nthe attacker is indifferent of attacking A and B, while the defender cannot improve coverage of the\ntargets any further (however, not in every game can an SSE be found in this way). In the only SSE\nof this game, the defender protects (A, B) with probabilities c = ( 3\n4 ) (which is equivalent to a\nmixed strategy x = ( 3\n4 , 0) as in the bi-matrix representation), to which the attacker \ufb01nds his best\nresponses to be BR(c) = {A, B} and, by the SSE assumption, breaks the tie in favor of the defender\nby attacking A. The defender gets utility ud(c, A) = \u2212 1\n\n4 and the attacker gets ua(c, A) = 3\n4.\n\n4 , 1\n\n4 , 1\n\n1See [20] and [21] (Chapter 8) for more discussion about the SSG and the solution concepts.\n\n3\n\n\f3 Manipulating a Learning Defender\n\nWe investigate how attacker manipulation can take place. Let us begin with a warm-up example.\nExample 2. Consider now the attacker in Example 1 pretends to have payoff ra\nA = 1 (all other\nparameters remain the same) and \u201cbest\u201d responds to queries of the defender\u2019s learning algorithm\naccording to this fake parameter. Let the resultant fake attacker type be \u03b2. The defender will be misled\n2 ). We have BR\u03b2(\u02dcc) = {A, B}.\ninto learning an SSE on type \u03b2, in which her commitment is \u02dcc = ( 1\nThe attacker can respond (still with ties broken in favor of the defender) by attacking A, and this\nresults in the attacker\u2019s utility to increase to 3\n2. There is a loss of\n4 for the defender compared to the truthful situation! Note that the attacker behaves consistently\n1\naccording to the fake type \u03b2 even after having misled the defender into learning the fake commitment.\nHence, there is no way to distinguish him from a truthful type-\u03b2 attacker.\n\n2, but the defender\u2019s to drop to \u2212 1\n\n2 , 1\n\nIn the above example, the attacker actually lies to the defender that the game they are playing is zero-\nsum. It turns out that this is not a coincidence speci\ufb01c to this example but a general phenomenon in\nSSGs. We show next that it is always optimal for the attacker to mislead the defender into playing her\nmaximin strategy. A maximin strategy c maximizes the defender\u2019s utility against the worst possible\nattacker type, i.e., c \u2208 arg maxc mini ud(c, i); it is exactly the defender\u2019s optimal commitment in a\nzero-sum game (see Lemma 13 in the appendix).\nA couple of \u201cdisclaimers\u201d would be appropriate before we delve into our analysis: First, in line\nwith previous work (e.g., [11, 4, 17]), we only consider the players\u2019 utilities in the (fake) SSE the\ndefender learns. The cost incurred for both players during the learning process is omitted as we\nexpect the learning algorithm to run ef\ufb01ciently and the learned SSE to repeat in suf\ufb01ciently many\nrounds. Without loss of generality, we view the learning process as a reporting step in which the\nattacker simply reports his type to the defender. To manipulate the defender, the attacker reports a\nfake type, and we refer to this as his reporting strategy.\nSecond, we assume that the attacker behaves consistently according to the reported type. This means\nthat the attacker may be playing a fake best response \u2014 hence, a suboptimal one \u2014 in the learned\nSSE and may thus exploit the defender for an even higher utility by switching back to his true best\nresponse. Nevertheless, since such a change in his behavior will inevitably make the defender aware\nof the manipulation and further complicates the interaction, we ignore the possibility of such a\nbehavior change and adopt this cleaner model to capture the essence of the manipulation problem.\n\nOptimal Attacker Report The following program computes the optimal reporting strategy of a\ntype-\u03b8 attacker.\n\ns.t.\n\nmax\u03b2,z,t ua\n\n\u03b8(z, t)\n(z, t) \u2208 arg maxc\u2208C, i\u2208BR\u03b2 (c) ud(c, i)\n\u03b2 \u2208 \u0398\ni > pa\n\n(3)\n(3a)\n(3b)\nHere \u0398 = {(ra, pa) \u2208 Rn\u00d7n : ra\ni for all i \u2208 T} contains all types that adhere to the basic\nassumption that an attacker always prefers a successful attack (we will show a stronger result that\nallows for other speci\ufb01cations of \u0398). In the program, the attacker reports a fake type \u03b2 that results in\nthe defender to learn an SSE (z, t) on type \u03b2 (by (3a)). An optimal solution thus yields a reporting\nstrategy for a type-\u03b8 attacker, that maximizes his true utility as the objective function speci\ufb01es.\nUsing Program (3), we show with Theorem 3 that it is always optimal for the attacker to mislead the\ndefender into playing her maximin strategy. This result is surprising as the defender essentially learns\nnothing: she can well compute the maximin strategy without any additional knowledge about the\nattacker\u2019s payoffs. We present a proof sketch. All omitted proofs can be found in the appendix.\nTheorem 3. There exists an optimal solution (\u03b2, z, t) of Program (3) such that z is a maximin\nstrategy of the defender, i.e., z \u2208 arg maxc\u2208C mini\u2208T ud(c, i).\n\nProof sketch. Let c be a maximin strategy of the defender and u be her maximin utility. Consider a\nfor all i \u2208 T ; t \u2208 BR\u03b8(z); and \u03b2 = (r, p), such\nsolution (\u03b2, z, t) such that: zi = max\nif i (cid:54)= t\ni for all i \u2208 T . It can be veri\ufb01ed that z is indeed\nif i = t\n\n, and pi = \u2212rd\n\n(cid:26)\u2212pd\n\n0, u\u2212pd\ni\u2212pd\nrd\n\nthat ri =\n\n\u2212 min{pd\n\nt, u},\n\ni ,\n\nalso a maximin strategy of the defender and (\u03b2, z, t) is an optimal solution.\n\n(cid:110)\n\n(cid:111)\n\ni\n\ni\n\n4\n\n\fTheorem 3 can be further strengthened under the assumption that the defender\u2019s maximin strategy\nc is fully mixed, i.e., 0 < ci < 1 for all i. (In this case the maximin strategy is also unique; see\nLemma 15 in the appendix.) The assumption is mild as it is normally expected that no target would\nbe too worthless to the extent that the defender would leave it wide open for the attacker to attack,\nwhile on the other hand resources are normally insuf\ufb01cient to allow any target to be fully protected.\nThe strengthening is two-fold: (i) under the additional assumption, the defender\u2019s maximin strategy\nis her only SSE strategy induced by any optimal attacker report, so the equilibrium selection issue\nthat arises when a reported type induces multiple SSEs is avoided; (ii) one optimal attacker report, in\nparticular, is the type that makes the game zero-sum, so our result holds even for a more stringent\nspeci\ufb01cation of \u0398 (e.g., when the defender has more precise knowledge about possible attacker types)\nas long as \u0398 contains the zero-sum attacker type. (Indeed, it is very natural for an attacker to have the\nzero-sum type given the adversarial nature of SSGs.) We state the result in Theorem 4.\nTheorem 4. Suppose c is a maximin defender strategy and it is fully mixed, i.e., 0 < ci < 1 for all\ni \u2208 T . Let (\u03b2, z, t) be an arbitrary optimal solution of Program (3). For every SSE (\u02c6c,\u02c6i) on type \u03b2, it\nholds that \u02c6c = c. In addition, there exists an optimal solution (\u03b2(cid:48), z(cid:48), t(cid:48)) such that \u03b2(cid:48) = (\u2212pd,\u2212rd).\n\n4 Handling Attacker Manipulation \u2014 A New Playbook\n\n2 , 49\n\n100 ) even when she learns that ( 1\n\n2 , 1\n\n2, which is even lower than his utility 3\n\nRecall our analysis. The key to the success of the attacker\u2019s trick is the naive playbook the defender\nfollows \u2014 to always play the learned optimal commitment as is.\nIt appears that the defender\ncan be more strategic. Consider Example 2. Suppose the defender tweaks her strategy slightly,\nplaying \u02dcc = ( 1\n2 ) is optimal. The attacker, who imitates type \u03b2\n(that makes the game zero-sum), should then attack B as now the best response set of \u03b2 becomes\nBR\u03b2(\u02dcc) = {B}. The attacker obtains utility 1\n4 in the truthful\nsituation. Therefore, if the defender commits to playing, e.g., (c1, c2 \u2212 1\n100 ) whenever she learns that\n(c1, c2) is optimal, the attacker will at least lose the incentive to mislead the defender into playing her\nmaximin strategy. The question then becomes: what is the best the defender can achieve by revising\nher playbook in similar ways? We formalize this problem as \ufb01nding an optimal policy to commit to.\nCommitting to a Policy Formally, a policy is a function \u03c0 : \u0398 \u2192 C \u00d7 T that maps a reported\nattacker type to an outcome (c, i) \u2208 C \u00d7 T . An outcome (c, i) is a strategy pro\ufb01le consisting of a\ndefender strategy c and a best response i \u2208 BR\u03b8(c) of the reported attacker type \u03b8.2 As an example,\nthe way the defender plays when she ignores attacker manipulation can itself be viewed as a policy\nthat maps every reported type \u03b8 to an SSE on \u03b8; we will refer to this policy as the SSE policy.\nWe assume that a policy can be observed or learned by the attacker through constant interaction with\nthe defender, or the defender can simply announce it to the attacker. In response, a strategic attacker\nof true type \u03b8 chooses to report an optimal type \u03b2\u2217 \u2208 arg max\u03b2\u2208\u0398 ua\n\u03b8(\u03c0(\u03b2)) that will maximize his\nutility in the outcome of the policy. At a higher level, this can be seen as a Stackelberg game in which\nthe defender commits to a policy and the attacker reports optimally in response to this commitment.\nTo \ufb01nd the optimal policy, we need a good measure of the quality of a policy. When there is no other\nprior information about the attacker\u2019s type, the worst-case analysis seems to be appropriate and a\nstraightforward choice of quality measure is the utility the defender obtains when playing against the\nworst attacker type. However, as Proposition 5 suggests, this measure disallows us to well distinguish\nthe quality of many policies. Speci\ufb01cally, when playing against the zero-sum attacker type, no\nfeasible policy can achieve anything better than the maximin utility, so if we take the worst-case\nutility as the measure, the quality of all policies would be hindered by this attacker type, and the SSE\npolicy, which achieves exactly the maximin utility in the worst case, would then be the best policy we\ncan hope for. Essentially, there will be no room for improvement other than letting the attacker lie.\nProposition 5. Let c be a maximin strategy of the defender, and u = mini\u2208T ud(c, i) be the maximin\nutility. For any policy \u03c0, let \u03b3 \u2208 arg max\u03b8\u2208\u0398 ua\n\u03b2(\u03c0(\u03b8)) be an optimal report of a type-\u03b2 attacker,\n\u03b2 = (\u2212pd,\u2212rd). Then ud(\u03c0(\u03b3)) \u2264 u.\nThis is unreasonable: even in the truthful situation, it is impossible for the defender to achieve more\nthan the maximin utility when the attacker is of the zero-sum type, so it would be unfair to underrate\n\n2We also specify an attacker response in the output of a policy in order to deal with the tie-breaking issue\nexplicitly. Same as in SSEs, the defender can induce speci\ufb01c attacker responses through in\ufb01nitesimal deviations.\n\n5\n\n\fa policy simply because it underperforms against the zero-sum type. For this reason, we propose\nan alternative measure, termed the ef\ufb01ciency of a policy (EoP), which takes into consideration the\nhardness of playing against each attacker type in the truthful setting. As in De\ufb01nition 6, the EoP\nis the worst-case ratio between the utility the defender obtains and what she should have obtained\nhad the attacker been truthful. A higher EoP indicates a smaller loss due to attacker manipulation,\nand the value of the EoP always lies between 0 and 1 according to Proposition 7. For the EoP to be\nmeaningful, we shift all payoffs to be non-negative. Without loss of generality, we will hereafter also\nassume \u0398 \u2014 previously de\ufb01ned as the set of types the attacker is allowed to report \u2014 to be also the\nset of possible attacker types, which is common knowledge to both players.\nDe\ufb01nition 6 (EoP). For each \u03b8 \u2208 \u0398, let \u03b2\u03c0\n\u03b8(\u03c0(\u03b2)) be the attacker\u2019s optimal\nreporting strategy in response to a policy \u03c0 (tie-breaking in favor of the defender). The ef\ufb01ciency\nof \u03c0 on attacker type \u03b8 is EoP\u03b8(\u03c0) = ud(\u03c0(\u03b2\u03c0\n\u03b8 ))\n, where \u02c6u(\u03b8) = maxc\u2208C,i\u2208BR\u03b8(c) ud(c, i) is the\ndefender\u2019s utility in an SSE on type \u03b8. The (overall) ef\ufb01ciency of \u03c0 is EoP(\u03c0) = min\u03b8\u2208\u0398 EoP\u03b8(\u03c0).\nProposition 7. EoP(\u03c0) \u2208 [0, 1] for any feasible defender policy \u03c0.\nAnother challenge we face is the representation of a policy, which is a function to be optimized. We\nfollow a modeling approach in the literature and consider a discrete version of the problem where\nthe set \u0398 of attacker types is \ufb01nite. This approach has been widely adopted to model Bayesian\ngames (e.g., [6, 16, 9, 18]). A \ufb01nite type set can be seen as an approximation to the continuous\ntype space, while in some scenarios attacker types might also be discrete by nature. For example, in\ndefense against poaching, payoffs of the poachers may depend on the type of animal products they\nare interested in, which falls in a \ufb01nite set. In addition to this approach, we also propose a heuristic\npolicy that applies to an in\ufb01nite or even an unknown type set. We present these approaches next.\n\n\u03b8 = arg max\u03b2\u2208\u0398 ua\n\n\u02c6u(\u03b8)\n\n5 Computing the Optimal Policy\n\n5.1 Optimal Policy for Finite Attacker Types\nWhen \u0398 is a \ufb01nite set, a defender policy can be represented as a list of \u03bb = |\u0398| outcomes; we will\ntherefore also write a policy as \u03c0 = (c\u03b8, i\u03b8)\u03b8\u2208\u0398, meaning that \u03c0(\u03b8) = (c\u03b8, i\u03b8) for each \u03b8 \u2208 \u0398. Our\nanalysis reveals that to compute the EoP maximizing policy is NP-hard in general Stackelberg games\n(see Section D in the appendix), but thanks to the special utility structure of SSGs, the problem admits\na polynomial-time algorithm when the underlying game is an SSG.\nWe consider the decision version of the optimization problem: for a given value \u03be, decide whether\nany defender policy \u03c0 achieves EoP(\u03c0) \u2265 \u03be. Trivially, once we have an ef\ufb01cient algorithm for this\ndecision problem, the best EoP can be found ef\ufb01ciently using binary search (in particular, we already\nknow that the value always lies in [0, 1]). Our algorithm for this decision problem, presented as\nAlgorithm 1, is constructive and produces a satisfying policy when there exists one. In the remainder\nof this section, we will let \u0398 = {\u03b81, . . . , \u03b8\u03bb} such that \u03b81, . . . , \u03b8\u03bb are ordered by the utility they offer\nthe defender in an SSE, i.e., \u02c6u(\u03b81) \u2265 \u02c6u(\u03b82)\u00b7\u00b7\u00b7 \u2265 \u02c6u(\u03b8\u03bb); the order can be obtained ef\ufb01ciently given\nthat an SSE can be computed in polynomial time. We call a policy (cid:96)-compatible if truthful report is\nincentivized for every attacker type \u03b8j, j \u2264 (cid:96) (De\ufb01nition 8).\nThe correctness of Algorithm 1 is shown via Theorem 10. Brie\ufb02y speaking, Algorithm 1 can be\nviewed as a process of repeatedly replacing the (cid:96)-th outcome of a satisfying policy (suppose we are\ngiven one) with the outcome generated in the (cid:96)-th iteration of Step 3. The observation in Lemma 9\nensures that the new policy obtained after every replacement will still be a satisfying one. Hence,\neventually, we will obtain a satisfying policy that consists of outcomes generated all through the\nalgorithm, and this means that we do not actually need to be provided a satisfying policy to begin with.\nInterestingly, the policy generated by Algorithm 1 is also incentive compatible (IC) (\u03bb-compatible as\nin De\ufb01nition 8); it always incentivizes the attacker to report their true type.\nDe\ufb01nition 8. A policy \u03c0 is (cid:96)-compatible (0 \u2264 (cid:96) \u2264 \u03bb), if in response to \u03c0, it is optimal for every\nattacker type \u03b8 \u2208 {\u03b81, . . . , \u03b8(cid:96)} to report truthfully, i.e., ua\nLemma 9. Let \u03c0 be the policy generated in Step 3 of Algorithm 1. Suppose that there exists an ((cid:96)\u22121)-\nif \u03b8 \u2208 \u0398 \\ {\u03b8(cid:96)}\ncompatible policy \u03c0\u2217, EoP(\u03c0\u2217) \u2265 \u03be. Then the policy \u02dc\u03c0, such that \u02dc\u03c0(\u03b8) =\n,\nif \u03b8 = \u03b8(cid:96)\nis feasible and (cid:96)-compatible, and EoP(\u02dc\u03c0) \u2265 \u03be.\n\n\u03b8(\u03c0(\u03b2)) for all \u03b2 \u2208 \u0398.\n\n(cid:26)\u03c0\u2217(\u03b8),\n\n\u03b8(\u03c0(\u03b8)) \u2265 ua\n\n\u03c0(\u03b8),\n\n6\n\n\fAlgorithm 1: Decide if there exists a policy \u03c0 such that EoP(\u03c0) \u2265 \u03be.\n1. For each \u03b8 \u2208 \u0398, compute an SSE (\u02c6c\u03b8,\u02c6i\u03b8) on type \u03b8. Let \u02c6u(\u03b8) = ud(\u02c6c\u03b8,\u02c6i\u03b8).\n2. Sort attacker types in \u0398 by \u02c6u(\u03b8), so that \u02c6u(\u03b81) \u2265 \u02c6u(\u03b82)\u00b7\u00b7\u00b7 \u2265 \u02c6u(\u03b8\u03bb), \u03bb = |\u0398|.\n3. For each (cid:96) = 1, . . . , \u03bb, let \u03c0(\u03b8(cid:96)) = (z, t), where zi = min{\u02c6c\u03b8(cid:96)\n\ni , hi}, t = BR\u03b8(cid:96) (h), and\n\n(cid:110)\n\n(cid:111)\n\n.\n\nhi = max\n\n0, \u03be\u00b7\u02c6u(\u03b8(cid:96))\u2212pd\ni\u2212pd\nrd\n\ni\n\ni\n\n, max\u03b8\u2208{\u03b81,...,\u03b8(cid:96)\u22121} ua\n\n\u03b8(\u03c0(\u03b8))\u2212r\u03b8\ni \u2212r\u03b8\np\u03b8\n\ni\n\ni\n\n4. If EoP(\u03c0) \u2265 \u03be, return \u03c0 as a satisfying policy; Otherwise, claim that no such policy exists.\n\nTheorem 10. In time polynomial in m, n, and |\u0398|, Algorithm 1 either outputs a policy \u03c0 with\nEoP(\u03c0) \u2265 \u03be, or decides correctly that no such policy exists. The policy generated is IC.\n\nProof. The polynomial runtime is readily seen. By Lemma 9, \u03c0(\u03b8(cid:96)) generated in Step 3 must be\na feasible outcome, so \u03c0 is a feasible policy. When no feasible policy can achieve EoP \u03be, we have\nEoP(\u03c0) < \u03be and Algorithm 1 will decide correctly in Step 4 that no satisfying policy exists.\nSuppose that there exists a policy \u03c0\u2217, EoP(\u03c0\u2217) \u2265 \u03be. Let \u02dc\u03c00 = \u03c0\u2217, and for each (cid:96) = 1, . . . , \u03bb, we\niteratively construct a policy \u02dc\u03c0(cid:96) by replacing \u02dc\u03c0(cid:96)\u22121(\u03b8(cid:96)) in policy \u02dc\u03c0(cid:96)\u22121 with \u03c0(\u03b8(cid:96)) generated in Step 3;\nthus, \u02dc\u03c0\u03bb = \u03c0. Trivially, \u02dc\u03c00 is 0-compatible as is any feasible policy. Applying Lemma 9 iteratively,\nwe can also conclude that \u02dc\u03c0(cid:96) is (cid:96)-compatible and EoP(\u02dc\u03c0(cid:96)) \u2265 \u03be for every (cid:96); in particular, \u02dc\u03c0\u03bb is\n\u03bb-compatible and EoP(\u03c0\u03bb) \u2265 \u03be. Algorithm 1 outputs \u03c0 = \u02dc\u03c0\u03bb as a satisfying policy.\n\n5.2 Beyond Finite Attacker Types\n\nThe above approach only applies to a \ufb01nite type set, we present a heuristic approach to deal with a\ncontinuous or even unknown type set. The approach is inspired by the quantal response (QR) model\nthat is developed to study bounded rationality of human players [13]. In a QR equilibrium, players are\nassumed to play not only their optimal pure strategy but also every other strategy with a probability\npositively related to the utility the player gets from playing that strategy.\nThe QR policy imitates the irrational behavior in a QR equilibrium. Recall that in an SSE, a rational\ndefender commits to an optimal strategy c and induces a type-\u03b8 attacker to choose a response\ni\u2217 \u2208 BR\u03b8(c) that maximizes the defender\u2019s utility. The QR policy, however, induces the attacker\u2019s\ntie-breaking choice in an \u201cirrational\u201d way. It induces the attacker to choose not only i\u2217, but also every\ntarget in BR\u03b8(c) with some probability; the probability a target being chosen is positively related to\nthe defender\u2019s utility when this target is attacked. The idea is to add some uncertainty in the outcome,\nso that the attacker cannot bene\ufb01t from being induced to choose a particular response with certainty,\nwhich is crucial for the success of his manipulation. This also encourages truthful report to some\nextent: a truthful attacker, who reports his true type \u03b8, is indifferent of which response he is induced\nto choose in BR\u03b8(c) and hence immune to such uncertainty. The QR policy is as follows.\nDe\ufb01nition 11 (QR policy). For each type \u03b8, let \u02c6c\u03b8 be the defender strategy in an SSE on attacker type\n\u03b8. A QR policy \u03c0QR maps a report \u03b8 to a distribution \u03c3 over outcomes in {(\u02c6c\u03b8, i) : i \u2208 BR\u03b8(\u02c6c\u03b8)};\n, where fj = e\u03d5\u00b7ud(\u02c6c\u03b8,j) for each\nthe probability \u03c3(i) of each outcome (\u02c6c\u03b8, i) is \u03c3(i) =\nj \u2208 T , with \u03d5 > 0 being a parameter that represents a player\u2019s rationality level in the QR model.3\nA defender who uses \u03c0QR then samples an outcome from \u03c0QR(\u03b8) to implement when \u03b8 is reported.\nThe players are now concerned with their expected utility over the outcome distribution, e.g., for\ni\u2208BR\u03b8(\u02c6c\u03b8) \u03c3(i) \u00b7 ud(\u02c6c\u03b8, i). The EoP can also be rede\ufb01ned accordingly.\nSince \u02c6c\u03b8 and \u03c3 are independent of other types in \u0398, the QR policy can be implemented on-the-\ufb02y for\nthe type reported, and is thus able to handle in\ufb01nite or unknown type sets.\nIntuitively, the QR policy strikes a balance between two unaligned aspects of playing against attacker\nmanipulation: 1) it adds some uncertainty to discourage attacker manipulation; meanwhile, 2) the\nsoftmax function that de\ufb01nes \u03c3 loosely strings the induced attacker response to the optimal one for\n3When \u03d5 \u2192 0, a player behaves completely irrationally, playing each strategy uniformly at random; when\n\nthe defender: ud(\u03c0QR(\u03b8)) =(cid:80)\n\n\u03d5 \u2192 +\u221e, a player becomes perfectly rational, choosing the optimal strategy with certainty.\n\nj\u2208BR\u03b8 (\u02c6c\u03b8 ) fj\n\n(cid:80)\n\nfi\n\n7\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nP\no\nE\n\n1\n\n.8\n\n.6\n\n.4\n\n0\n\n.2\n\n.4\n\n.6\n\n.8\n\n1\n\n\u03c1\n\n1\n\n.8\n\n.6\n\n.4\n\n1\n\n.8\n\n.6\n\n.4\n\n0\n\n.2\n\n.4\n\n.6\n\n.8\n\n1\n\n\u03c1\n\n1\n\n.8\n\n.6\n\n.4\n\n100 200 300 400 500\n\nn\n\n100 200 300 400 500\n\nn\n\nQR (\u03d5 = 10)\n\nQR (\u03d5 = 50)\n\nQR (\u03d5 = 100)\n\nOptimal\n\nSSE\n\nFigure 1: Comparison of the EoP. In (a), results are obtained with other parameters set to \u03bb = 100, m = 10, and\nn = 50; and in (b) with m = n/5, \u03c1 = 0.5, and \u03bb = 100. Figures (c) and (d) repeat (a) and (b), respectively,\nwith the difference that the zero-sum attacker type is always included in \u0398 in the experiments.\n\nthe defender, so that the cost of achieving 1) is kept away from being too high. We empirically\nevaluate the performance of the QR policy in randomly generated games in the next section.\n\n6 Empirical Evaluation\n\nIn our evaluations, attacker types are randomly generated using the covariance model [15], with a\nparameter \u03c1 \u2208 [0, 1] to control the closeness of the generated game to a zero-sum game. That is, we\nshift each payoff parameter x towards the corresponding one y of a zero-sum attacker type, letting\nx \u2190 (1 \u2212 \u03c1) \u00b7 x + \u03c1 \u00b7 y. Thus, when \u03c1 = 1 the game generated is exactly zero-sum, and when \u03c1 = 0\nall payoffs are generated uniformly at random. All evaluations are conducted on \ufb01nite type sets. This\nalso simulates situations with an unknown (but \ufb01nite) type set, though situations with in\ufb01nite type\nsets requires more advanced approaches (in this case, it is unclear to us how to compute the optimal\nattacker report in response to the QR policy). All results shown are the average of at least 50 runs.\nWe compare the EoP achieved by our optimal and heuristic policies, using the SSE policy as a\nbenchmark (i.e., the situation when attacker manipulation is ignored). The \ufb01rst set of results, Figure 1\n(a) and (b), shows the variance of the EoP with respect to \u03c1 and the size of the game. Except for the\nQR policy with \u03d5 = 10, performance of all other policies is very close to each other, though there is\na discernable gap between the optimal policy and the SSE policy. In general, in these results, the loss\ndue to ignoring attacker manipulation appears to be very marginal.\nA more interesting set of results is shown (c) and (d), in which we slightly tweak the randomly\ngenerated type set, by always adding a zero-sum attacker type in it. This small change leads to a very\ndifferent pattern in the results. There is a wide gap between the optimal and the SSE policies, and\nthe QR policies normally rest in between them, exhibiting good performance as well. The results\ncorroborate our theoretical analysis, that all attacker types will be incentivized to report the zero-sum\ntype when they are allowed to, which undermines the performance of the SSE policy signi\ufb01cantly.\nThe optimal policy, however, is able to achieve very high EoP, sometimes close to recovering the\ndefender\u2019s utility in the truthful situation (EoP = 1).\n\n7 Conclusion\n\nIn this paper, we investigate manipulation of algorithms that are designed to learn the optimal strategy\nto commit to in Stackelberg security games, and aim at remedying the overoptimistic assumption of a\ntruthful attacker adopted by these algorithms. We propose exact and heuristic approaches to reduce\nthe loss due to manipulation. The effectiveness of our approaches are evaluated both theoretically and\nempirically. One promising direction for future work is to look at similar problems in other variants\nof Stackelberg games, where our framework and approaches may apply.\n\nAcknowledgments\n\nJiarui Gan was supported by the EPSRC International Doctoral Scholars Grant EP/N509711/1.\n\n8\n\n\fReferences\n\n[1] Maria-Florina Balcan, Avrim Blum, Nika Haghtalab, and Ariel D Procaccia. Commitment\nwithout regrets: Online learning in stackelberg security games. In Proceedings of the 16th ACM\nconference on economics and computation (EC\u201915), pages 61\u201378, 2015.\n\n[2] Battista Biggio, Blaine Nelson, and Pavel Laskov. Support vector machines under adversar-\nial label noise. In Proceedings of the Asian Conference on Machine Learning (ACML\u201911),\nvolume 20, pages 97\u2013112, 2011.\n\n[3] Battista Biggio, Blaine Nelson, and Pavel Laskov. Poisoning attacks against support vector\nmachines. In Proceedings of the 29th International Coference on International Conference on\nMachine Learning (ICML\u201912), pages 1467\u20131474, 2012.\n\n[4] Avrim Blum, Nika Haghtalab, and Ariel D. Procaccia. Learning optimal commitment to\novercome insecurity. In Proceedings of the 27th International Conference on Neural Information\nProcessing Systems (NIPS\u201914), pages 1826\u20131834, 2014.\n\n[5] Anirban Chakraborty, Manaar Alam, Vishal Dey, Anupam Chattopadhyay, and Debdeep\n\nMukhopadhyay. Adversarial attacks and defences: A survey. CoRR, abs/1810.00069, 2018.\n\n[6] Vincent Conitzer and Tuomas Sandholm. Computing the optimal strategy to commit to. In\nProceedings of the 7th ACM Conference on Electronic Commerce (EC\u201906), pages 82\u201390, 2006.\n\n[7] Jiarui Gan, Haifeng Xu, Qingyu Guo, Long Tran-Thanh, Zinovi Rabinovich, and Michael\nWooldridge. Imitative follower deception in stackelberg games. In Proceedings of the 2019\nACM Conference on Economics and Computation (EC\u201919), pages 639\u2013657, 2019.\n\n[8] Nika Haghtalab, Fei Fang, Thanh H. Nguyen, Arunesh Sinha, Ariel D. Procaccia, and Milind\nTambe. Three strategies to success: Learning adversary models in security games. In Pro-\nceedings of the 25th International Joint Conference on Arti\ufb01cial Intelligence (IJCAI\u201916), pages\n308\u2013314, 2016.\n\n[9] Manish Jain, James Pita, Milind Tambe, Fernando Ord\u00f3\u00f1ez, Praveen Paruchuri, and Sarit Kraus.\nBayesian Stackelberg games and their application for security at Los Angeles International\nAirport. SIGecom Exch., 7(2):10:1\u201310:3, June 2008.\n\n[10] Christopher Kiekintveld, Towhidul Islam, and Vladik Kreinovich. Security games with interval\nuncertainty. In Proceedings of The 12th International Conference on Autonomous Agents and\nMultiagent Systems (AAMAS\u201913), pages 231\u2013238, 2013.\n\n[11] Joshua Letchford, Vincent Conitzer, and Kamesh Munagala. Learning and approximating the\noptimal strategy to commit to. In International Symposium on Algorithmic Game Theory, pages\n250\u2013262, 2009.\n\n[12] Qiang Liu, Pan Li, Wentao Zhao, Wei Cai, Shui Yu, and Victor CM Leung. A survey on\nsecurity threats and defensive techniques of machine learning: A data driven view. IEEE access,\n6:12103\u201312117, 2018.\n\n[13] Richard D. McKelvey and Thomas R. Palfrey. Quantal response equilibria for normal form\n\ngames. Games and Economic Behavior, 10(1):6 \u2013 38, 1995.\n\n[14] Thanh Hong Nguyen, Amulya Yadav, Bo An, Milind Tambe, and Craig Boutilier. Regret-based\noptimization and preference elicitation for Stackelberg security games with uncertainty. In\nProceedings of the 28th AAAI Conference on Arti\ufb01cial Intelligence (AAAI\u201914), pages 756\u2013762,\n2014.\n\n[15] Eugene Nudelman, Jennifer Wortman, Yoav Shoham, and Kevin Leyton-Brown. Run the gamut:\nA comprehensive approach to evaluating game-theoretic algorithms. In Proceedings of the\nThird International Joint Conference on Autonomous Agents and Multiagent Systems-Volume 2,\npages 880\u2013887, 2004.\n\n9\n\n\f[16] Praveen Paruchuri, Jonathan P Pearce, Janusz Marecki, Milind Tambe, Fernando Ordonez, and\nSarit Kraus. Playing games for security: An ef\ufb01cient exact algorithm for solving Bayesian\nStackelberg games. In Proceedings of the 7th International Joint Conference on Autonomous\nAgents and Multiagent Systems (AAMAS\u201908), pages 895\u2013902, 2008.\n\n[17] Binghui Peng, Weiran Shen, Pingzhong Tang, and Song Zuo. Learning optimal strategies to\ncommit to. In Proceedings of the 33rd AAAI Conference on Arti\ufb01cial Intelligence (AAAI\u201919),\n2019.\n\n[18] James Pita, Manish Jain, Fernando Ord\u00f3\u00f1ez, Milind Tambe, Sarit Kraus, and Reuma Magori-\nCohen. Effective solutions for real-world Stackelberg games: When agents must deal with\nhuman uncertainties. In Proceedings of The 8th International Conference on Autonomous\nAgents and Multiagent Systems (AAMAS\u201909), pages 369\u2013376, 2009.\n\n[19] Aaron Roth, Jonathan Ullman, and Zhiwei Steven Wu. Watch and learn: Optimizing from\nrevealed preferences feedback. In Proceedings of the 48th annual ACM symposium on Theory\nof Computing (STOC\u201916), pages 949\u2013962, 2016.\n\n[20] Bernhard von Stengel and Shmuel Zamir. Leadership with commitment to mixed strategies.\n\nCDAM Research Report LSE-CDAM-2004-01, London School of Economics, 2004.\n\n[21] Milind Tambe. Security and game theory: algorithms, deployed systems, lessons learned.\n\nCambridge University Press, 2011.\n\n[22] Haifeng Xu, Long Tran-Thanh, and Nicholas R. Jennings. Playing repeated security games\nwith no prior knowledge. In Proceedings of the 2016 International Conference on Autonomous\nAgents and Multiagent Systems (AAMAS\u201916), pages 104\u2013112, 2016.\n\n10\n\n\f", "award": [], "sourceid": 4488, "authors": [{"given_name": "Jiarui", "family_name": "Gan", "institution": "University of Oxford"}, {"given_name": "Qingyu", "family_name": "Guo", "institution": "Nanyang Technological University"}, {"given_name": "Long", "family_name": "Tran-Thanh", "institution": "University of Southampton"}, {"given_name": "Bo", "family_name": "An", "institution": "Nanyang Technological University"}, {"given_name": "Michael", "family_name": "Wooldridge", "institution": "Univ of Oxford"}]}