{"title": "Online Robust Policy Learning in the Presence of Unknown Adversaries", "book": "Advances in Neural Information Processing Systems", "page_first": 9916, "page_last": 9926, "abstract": "The growing prospect of deep reinforcement learning (DRL) being used in cyber-physical systems has raised concerns around safety and robustness of autonomous agents. Recent work on generating adversarial attacks have shown that it is computationally feasible for a bad actor to fool a DRL policy into behaving sub optimally. Although certain adversarial attacks with specific attack models have been addressed, most studies are only interested in off-line optimization in the data space (e.g., example fitting, distillation). This paper introduces a Meta-Learned Advantage Hierarchy (MLAH) framework that is attack model-agnostic and more suited to reinforcement learning, via handling the attacks in the decision space (as opposed to data space) and directly mitigating learned bias introduced by the adversary. In MLAH, we learn separate sub-policies (nominal and adversarial) in an online manner, as guided by a supervisory master agent that detects the presence of the adversary by leveraging the advantage function for the sub-policies. We demonstrate that the proposed algorithm enables policy learning with significantly lower bias as compared to the state-of-the-art policy learning approaches even in the presence of heavy state information attacks. We present algorithm analysis and simulation results using popular OpenAI Gym environments.", "full_text": "Online Robust Policy Learning in the Presence of\n\nUnknown Adversaries\n\nAaron J. Havens, Zhanhong Jiang, Soumik Sarkar\n\nDepartment of Mechanical Engineering\n\n{ajhavens,zhjiang,soumiks}@iastate.edu\n\nIowa State University\n\nAmes, IA 50011\n\nAbstract\n\nThe growing prospect of deep reinforcement learning (DRL) being used in cyber-\nphysical systems has raised concerns around safety and robustness of autonomous\nagents. Recent work on generating adversarial attacks have shown that it is compu-\ntationally feasible for a bad actor to fool a DRL policy into behaving sub optimally.\nAlthough certain adversarial attacks with speci\ufb01c attack models have been ad-\ndressed, most studies are only interested in off-line optimization in the data space\n(e.g., example \ufb01tting, distillation). This paper introduces a Meta-Learned Ad-\nvantage Hierarchy (MLAH) framework that is attack model-agnostic and more\nsuited to reinforcement learning, via handling the attacks in the decision space\n(as opposed to data space) and directly mitigating learned bias introduced by the\nadversary. In MLAH, we learn separate sub-policies (nominal and adversarial) in\nan online manner, as guided by a supervisory master agent that detects the presence\nof the adversary by leveraging the advantage function for the sub-policies. We\ndemonstrate that the proposed algorithm enables policy learning with signi\ufb01cantly\nlower bias as compared to the state-of-the-art policy learning approaches even in\nthe presence of heavy state information attacks. We present algorithm analysis and\nsimulation results using popular OpenAI Gym environments.\n\nIntroduction\n\n1\nReal applications of cyber-physical systems that utilize learning techniques are already abundant\nsuch as smart buildings [1], intelligent transportation networks [2], and intelligent surveillance and\nreconnaissance [3]. In such systems, Reinforcement Learning (RL) [4, 5] is becoming a more\nattractive formulation for control of complex and highly non-linear systems. The application of Deep\nLearning (DL) has pushed recent advances in RL, namely Deep RL (DRL) [6, 7, 8]. Particularly\nin 3D continuous control tasks, DL is an indispensable tool due to its ability to generalize high\ndimensional state-action spaces in Policy Optimization algorithms [9], [10]. Notable variance\nreduction and trust-region optimization strategies have only furthered the performance and stability\nof DRL controllers [11].\nAlthough DL is generally useful for these control problems, DL has inherent vulnerabilities in the\nway that even very small perturbations in state inputs can result in signi\ufb01cant loss in policy learning\nperformance. This becomes a very reasonable cause for concern when contemplating DRL controllers\nin real-world tasks where there exist, not only environmental uncertainty, but perhaps adversarial\nactors that aims to fool a DRL agent into making a sub-optimal decision. During policy learning,\ninformation perturbation can be generally thought of as a bias that can prevent the the agent from\neffectively learning the desired policy. Previous attempts in mitigating adversarial attacks have\nbeen successful against speci\ufb01c attack models, however, such robust training strategies are typically\noff-line (e.g., using augmented datasets [12]) and may fail to adapt to different attacker strategies in\nan online fashion. Recently [13] has taken a model-agnostic approach by predicting future states,\nhowever it may be susceptible to multiple consecutive attacks.\nContributions: In this paper, we consider a policy learning problem where there are periods of\nadversarial attacks (via corrupting state inputs) when the agent is continuously learning in its\nenvironment. Our main objective is online mitigation of the bias introduced into the nominal\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fTable 1: Comparisons with different robust adversarial RL methods\n\nMethod\n\nOnline Adaptive Attack-model agnostic Mitigation\n\nVFAS [13]\nARDL [12]\n\n\u2713\n\u2717\nMLAH [This paper] \u2713\nOnline: no of\ufb02ine training/retraining required, Adaptive: can adapt to a change in\nattack strategy, Attack-model agnostic: assumes no speci\ufb01c attack model, Mitigation\n: is the impact of the attack actively mitigated?\n\n\u2713\n\u2713\n\u2713\n\n\u2713\n\u2717\n\u2713\n\n\u2717\n\u2717\n\u2713\n\npolicy by the attack. We only consider how an attack affects the return instead of optimizing the\nobservation space. In this context, our speci\ufb01c contributions are:\n\n1. Algorithm We propose a new hierarchal meta-learning framework, MLAH that can effec-\ntively detect and mitigate the impacts of adversarial state information attacks in a attack-\nmodel agnostic manner, using only the advantage observation.\n\n2. Analysis: Based on a temporal expectation de\ufb01nition, we analyze the performance of\na single mapping policy and our proposed multi-policy mapping. Visitation frequency\nestimates leads us to obtaining a new pessimistic lower bound for TRPO and variants.\n\n3. Implementation: We implement the framework in widely utilized Gym benchmarks [14].\nIt is shown that MLAH is able to learn minimally biased polices under frequent attacks by\nlearning to identify the adversaries presence in the return.\n\nAlthough we mention several relevant techniques on learning with adversaries, we only contrast\nmethodologies in table 1 that aim to mitigate adversarial attacks, as other papers [15], [16] do not\nclaim to do so. We compare our results with the state-of-the-art PPO [17] that is suf\ufb01ciently robust\nto uncertainties to understand the gain from multi-policy mapping. The source code is available on\nhttps://github.com/AaronHavens/safe_rl.\nRelated work: Attacks on deep neural networks and mitigation strategies have only recently been\nstudied primarily for supervised classi\ufb01cation problems. These attacks are most commonly formulated\nas \ufb01rst order gradient-based attacks, \ufb01rst seen as FGSM by Goodfellow et al [18]. These gradient\nbased perturbation attacks have proven to be effective in misclassi\ufb01cation, with the corrupted input\noften being indistinguishable from the original. The same principle applies to DRL agents, which\ncan drastically affect the agent performance and bias the policy learning process. The authors in [19]\nshowed a threat model that considered adversaries capable of dramatically degrading performance\neven with small adversarial perturbations without human perception. Three new attacks for different\ndistance metrics were introduced in [20] in \ufb01nding adversarial examples on defensively distilled\nnetworks. The authors in [21] introduced three new dimensions about adversarial attacks and used\nthe policy\u2019s value function as a guide for when to inject perturbations. Interestingly, it has been\nseen that training DRL agents on designed adversarial perturbations can improve robustness against\ngeneral model uncertainties [16], [15]. The adversarial robust policy learning algorithm [22] was\nintroduced to leverage active computation of physically-plausible adversarial examples in the training\nperiod to enable robust performance with either random or adversarial input perturbations. Another\nrobust DRL algorithm, EPOpt-\ufffd for robust policy search algorithm [23] was proposed to \ufb01nd a robust\npolicy using the source distribution. Note that the recently mentioned methods do not aim to mitigate\nadversarial attacks at all, but intentionally bias the agent to perform better for model uncertainties.\n2 Preliminaries and Problem Formulation\nIn this paper, we consider a \ufb01nite-horizon discounted Markov decision processes (MDP), where\neach MDP mi is de\ufb01ned by a tuple M = (S,A,P, r, \u03b3, \u03c10) where S is a \ufb01nite set of states, A is a\n\ufb01nite set of actions, P is a mapping function that signi\ufb01es the transition probability distribution, i.e.,\nS \u00d7 A \u00d7 S \u2192 R, r is a reward function S \u2192 R with respect to a given state and r \u2208 [rmin, rmax],\n\u03c10 is a distribution of the initial states and \u03b3 \u2208 (0, 1) is the discounted factor. The \ufb01nite-horizon\nexpected discounted reward R(\u03c0) following a policy \u03c0 is de\ufb01ned as follows:\n\nR(\u03c0) = Es0,a0,...\ufffd T\ufffdt=0\n\n\u03b3tr(st)\ufffd\nwhere s0 \u223c \u03c10(s0), at \u223c \u03c0i(at|st), st+1 \u223c P(st+1|st, at). We want to maximize this discounted\nreward sum by optimizing a policy \u03c0 : S \u2192 A map, discussed next.\n2.1 Trust Region Optimization for Parameterized Policies\nFor more complex 3D control problems, policy optimization has been proven to be the state-of-\nthe-art approach. A multi-step policy optimization scheme presented in [11] dually maximizes the\nimprovement (Advantage function) of the new policy while penalizing the change between the old\n\n(1)\n\n2\n\n\fand new policy described by a statistical distance, namely the Kullback Liebler divergence. For\ncontinuous control policy optimization a variant of the advantage function is often used being the\nGeneralized Advantage Function (GAE) from [24], which is parameterized by \u03b3 and \u03bb where V (st)\nis the value function. Intuitively, GAE attempts to balance the trade-off between bias and variance\nin the advantage estimate by introducing the controlled parameter \u03bb. We will use this in policy\noptimization as well as a method for temporal state abstraction later in the proposed algorithm.\n\nAGAE,t = \u03b6t + (\u03b3\u03bb)\u03b6t+1 + ... + ...(\u03b3\u03bb)T\u2212t+1\u03b6T\u22121\n\n(2)\n\nwhere \u03b6t = rt + \u03b3V (st+1) \u2212 V (st), \u03b3, \u03bb \u2208 [0, 1].\n2.2 Meta-Learned Hierarchies\nAs a basis for our proposed MLAH framework,\nwe consider a task with multiple objectives or la-\ntent states.\nIn this context, we de\ufb01ne a \ufb01nite set\nof MDPs M:{m0, m,1 ,\u00b7\u00b7\u00b7 , mn}, where an MDP\nmi, i \u2208 {0, 1,\u00b7\u00b7\u00b7 , n} is sampled for learning at time\nt. There exists a set of corresponding sub-policies \u03a0 :\n{\u03c00, \u03c01,\u00b7\u00b7\u00b7 , \u03c0n} which may individually be used at any\ninstant. We then have M \u2192 R and de\ufb01ne a joint hierar-\nchal objective for M composed of sub-policies:\n\u03b3tr(st)| mi, \u03c0i\ufffd\n\nR(\u03a0) = Es0,\u03c00,m0...\ufffd T\ufffdt=0\n\n(3)\n\nFigure 1: A meta learning hierarchy sim-\nilar to MLSH in [25]. The master is\ntasked with choosing a sub policy to\nmaximize return in the current MDP mi.\n\nEvery mi can be thought of as a unique objective in the\nsame state-action space. In our case, the RL agent is not\naware of the speci\ufb01c mi at time t. This could alternatively\nbe thought of as a partially observable MDP (POMDP),\nhowever in this work we introduce a hierarchal RL archi-\ntecture to explain the latent state. This hierarchal frame-\nwork depicted in Figure 1 has been presented in [25] as\nMeta-Learned Shared Hierarchies (MLSH). \u03c0master de-\nscribes an agent who\u2019s task is to select the appropriate sub-policy to maximize return. The master\npolicy, \u03c0master receives the observed reward and environment state. This mapping is far easier to\nlearn as apposed to re-learning each sub policy which may be re-used. Since each mi \u2208 M has\na different S \u2192 R mapping, this makes \u03c0master have a non-stationary mapping across S which\nrequires the parameters of \u03c0master to be reset on a predetermined interval.\n2.3 Adversary Models\nWe consider adversaries that perturb the state provided to the agent at any given time instant. Formally,\nDe\ufb01nition 1. An adversarial attack is any possible state observation perturbation that leads the agent\ninto incurring a sub-optimal return, which is less than the return of the learned optimal policy. In\nother words, R(\u03c0|attack) < R(\u03c0). The adversary may only perturb the state observation channel,\nand not the reward channel itself.\nNote, when discussing adversarial attacks, a common practice is to mathematically de\ufb01ne a feasible\nperturbation with respect to the observation space. This work presents an alternative approach (later\nin the analysis Section 4) by focusing on expected frequency of attacks only and how it realizes in\nthe RL decision space. This results in a framework which is more agnostic to a speci\ufb01c attack-model\nand considers more than just the observation (data) space. However, it is important to note that the\nRL agent is not aware of any attack-model speci\ufb01cations.\n3 Proposed Algorithms\nWe begin with a brief motivation to the proposed Meta-Learned Advantage Hierarchy (MLAH)\nalgorithm. An intelligent agent, such as a human with a set of skills, when presented with a new\ntask, should try out one of the known skills or policies and examine its effectiveness. When the task\nchanges, based on the expectation of usefulness of that skill, the agent may keep using the same skill\nor try another skill that may seem most appropriate for that task. In this context, given that the agent\nhas developed accurate expectations of its sub-policies (skills), if the underlying task were to change\nat anytime, the agent may notice that the result of its action has changed with respect to what was\nexpected. In an RL framework, comparing the expected return of a state to the observed return of\nsome action is typically known as the advantage. Therefore, such an advantage estimate can serve as\na good indicator of underlying changes in a task that can be leveraged to switch from one sub-policy\nto another more appropriate sub-policy.\n\n3\n\n\fWith this motivation, we can map the current problem of learning policy under intermittent adversarial\nperturbations as a meta-learning problem. As our adversarial attacks (by de\ufb01nition 1) create a different\nstate-reward map, a master policy may be able to detect an attack and help choose an appropriate\nsub-policy that corresponds to the adversarial scenario. More formally, we begin with two random\npolicies that are meant to represent the two distinct partially observable conditions in our MDP,\nnominal states and adversarial perturbed states. One may begin by pre-training \u03c0nom in isolation\nseeing only nominal experiences. Since we can not assume or simulate the adversary, typically it is\nnot possible to pre-train \u03c0adv and it must be left to \u03c0master to identify this alternative mapping. For\neach episode, we begin by collecting a trajectory of length T, allowing \u03c0master at every time step\n(or on an interval) to either or continue using a sub-policy to act based on the advantage coordinate\nobserved. The advantage for \u03c0master, represented by At, can be calculated using only the previous\nstate-reward or it can be computed as a generalized estimate over the past h time-steps as a rolling\nwindow. The following Eq. 5 describes the optimal policy for the master agent to choose action\neither staying the same policy or switching to another policy. While this form is similar to the generic\npolicy in deep reinforcement learning, the only difference is conditioned on the MDP, which suggests\nthe nominal or adversarial scenario.\n\nAt =\ufffdAGAE,t\u2212h|\u03c0nom, AGAE,t\u2212h|\u03c0adv\ufffd \u2208 R2\n\u03b3tr(st, a)| mi\ufffd \u2208 {0 : stay, 1 : switch}\n\nEst,\u03c0i,mi...\ufffd T\ufffdt=0\n\namaster,t = \u03c0\u2217,t = argmax\n\na\n\n(4)\n\n(5)\n\n(a) Adversary interaction model\n\n(b) MLAH framework\n\nFigure 2: a) Illustration of the adversarial attack mechanism: corrupting the state observation,\nby injecting a perturbation \ufffd before it reaches the agent, no perturbation in the reward signal. b)\nMLAH architecture: while similar to MLSH, key differences are: 1) master policy only observes the\nadvantage of the sub-policy as a state and 2) only two sub-polices (nominal/adversarial) considered.\nObserving the advantage over states and actions can be justi\ufb01ed philosophically and has technical\nbene\ufb01ts when compared to other temporal state abstraction techniques that may be used to estimate\nthe latent condition (RNN, LSTM). Although this mapping has potential to be noisy as the advantage\ncan be trajectory dependent, it is static across the multiple sub-policies as opposed to a state-policy\nselection mapping which must be re-learned with every change in the latent condition.\nAdvantage map as an effective metric to detect adversary: To fool an RL agent into taking an\nalternative action, an adversary may use the policy network to compute a perturbation [18]. For attack\nmitigation, the RNN-based visual-foresight method [13] is practical, considering the predicted policy\ndistance from the chosen policy. However, it was reported [13] that such a scheme can be fooled\nwith a series of likely state perturbations. However in MLAH, even if the adversary could compute a\nseries of likely states to fool the agent, the advantage would still be affected and the master agent may\ndetect the attack. The adversary would have to consecutively fool the agent with a state that would be\nexpected to give an equally bad reward. This constraint would make the perturbation especially hard\nor infeasible to compute. We do acknowledge however that this method is slightly delayed such that\nthe agent has to experience an off-trajectory reward before it can detect the adversary presence and\nmay also have to observe long attack periods before learning the advantage mapping.\n4 Analysis of Bias Mitigation and Policy Improvement\nHere we present analysis to show that the proposed MLAH framework reduces bias in the value\nfunction baseline under adversarial attacks. We then show how reducing bias is inherently bene\ufb01cial\nfor policy learning (improvement in expected reward lower bound compared to the state-of-the-art as\n\n4\n\n\fAlgorithm 1: MLAH\nInput :\u03c0nom and \u03c0adv sub-policies parameterized by \u03b8nom and \u03b8adv; Master policy \u03c0master with\n\nTrain \u03c0nom and \u03b8nom on only nominal experiences.\n\nparameter vector \u03c6.\n\nfor Time steps t to t + T do\n\n1 Initialize \u03b8nom, \u03b8adv, \u03c6\n2 for pre-training iterations [optional] do\n3\n4 end\n5 for learning life-time do\n6\n7\n8\n9\n10\n11\n12\n13\n14\n15 end\n\nend\nEstimate all AGAE for \u03c0nom, \u03c0adv over T\nEstimate all AGAE for \u03c0master over T with respect to At observations\nOptimize \u03b8nom based on experiences collected from \u03c0nom\nOptimize \u03b8adv based on experiences collected from \u03c0adv\nOptimize \u03c6 based on all experiences with respect to At observations\n\nCompute At over sub-policies (see eq. 4)\n\u03c0master selects to switch or stay with sub-policy based on At observations to take action\n\npresented in [11]) in the presence of adversaries. In order to estimate the expected value learned by a\npolicy, we consider a \ufb01rst-order stochastic transition model (from nominal-0 to adversary-1 and vice\nversa) for the temporal pro\ufb01le of the attack as follows:\n\nP =\ufffdp0|0\n\np0|1\n\np1|0\n\np1|1\ufffd =\ufffdm 1 \u2212 m\n1 \u2212 n\ufffd\n\nn\n\nThis de\ufb01nes a Markov chain (pb|a denotes the probability transitioning from a to b). Let the stationary\ndistribution for this Markov chain be denoted by, v = [p0, p1] that satis\ufb01es v = vP . Therefore,\n\np0 =\n\n,\n\np1 =\n\nn\n\n1 \u2212 m + n\n\n1 \u2212 m\n\n1 \u2212 m + n\n\nwhich describes the long term expectation of visiting a nominal or adversarial state. As discussed in\nthe preliminaries, trajectory experiences are handled with a distinct policy and value network when\nthe adversarial attack is present. As the condition is perceived by the master agent, we can de\ufb01ne two\nindependent MDPs separately, i.e., one given a nominal state (p\u223c|0) and another given the perturbed\nstate due to the adversary (p\u223c|1). With this setup, we present an assumption as follows:\nAssumption 1. Long term expectation of visiting a nominal state is higher than that of adversarial\nstate, i.e., for the stochastic transition model P , n < m.\nLet Es\u223cS|0V (s) be the expected discounted reward over states S given that the policy only sees\nnominal conditions (m = 1). Similarly, let Es\u223cS|1V (s) be the expected discounted reward for the\npolicy when it sees the adversarial states (m = 0, n = 0) alone. We simplify the notations as follows:\nEs\u223cS|0V (s) = V0 and Es\u223cS|1V (s) = V1 as two value primitives.\nAccording to de\ufb01nition of the adversary (De\ufb01nition 1), we have V1 < V0 as a successful adversarial\nattack leads to a sub-optimal return. We can now compare the expected discounted return for the\nunconditioned and conditioned learning scheme. Here, the unconditioned scheme refers to the\nlearning scheme of a classical DRL agent with one policy. In this case, the expected discounted\nreward under adversarial attacks can be expressed as:\n\nEunc,s\u223cS V (s) = V0p0 + V1p1 = V0\n\nn\n\n1 \u2212 m + n\n\n+ V1\n\n1 \u2212 m\n\n1 \u2212 m + n\n\nOn the other hand, the conditioned schemes refer to the two sub-policies (one given the nominal\nstate and other given the adversarial state) based on the proposed MLAH framework. In this context,\nthe expected discounted reward conditioned on the nominal state under adversarial attacks can be\nexpressed as:\n\nEcon,s\u223cS|0V (s) = V0p0|0 + V1p1|0 = V0m + V1(1 \u2212 m)\n\nWe now provide a lemma to compare the unconditioned and conditioned (given a nominal state)\nexpected discounted rewards.\nLemma 1. Let Assumption 1 hold. Eunc,s\u223cS V (s) < Econ,s\u223cS|0V (s).\n\n(6)\n\n(7)\n\n(8)\n\n5\n\n\fSee the proof in the Supplementary material.\nWe next discuss different lower bounds of the expected discounted rewards for the conditioned and\nunconditioned policies. We begin with de\ufb01ning the observed bias in the state value for both the\nconditioned and unconditioned policies by comparing the expected discounted reward to the original\nnominal value primitive V0. Then, we have,\n\u03b4con|0 = V0\u2212Econ,s\u223cS|0V (s) = (1\u2212m)(V0\u2212V1),\nWith this setup, we present the following lemma.\nLemma 2. Let Assumption 1 hold. \u03b4con|0 < \u03b4unc.\nThe proof is straightforward using Lemma 1 (see Supplementary material).\nIn this context, we express V0 = Econ,s\u223cS|0V (s) + \u03b4con|0 and V0 = Eunc,s\u223cS V (s) + \u03b4unc in a\ngeneral way as: V (s) = \u02c6V (s) + \u03b4, where \u03b4 is the observed bias in the state value. According to the\nde\ufb01nition of advantage function in Eq. 2, letting \u03bb = 0, we have A\u03c0(st, at) = rt +\u03b3V (st+1)\u2212V (st).\nSubstituting V (s) = \u02c6V (s) + \u03b4 into the last equation yields\n\n\u03b4unc = V0\u2212Eunc,s\u223cS V (s) =\n\n1 \u2212 m + n\n\n(1 \u2212 m)(V0 \u2212 V1)\n\nA\u03c0(st, at) = rt + \u03b3 \u02c6V (st+1) \u2212 \u02c6V (st) + \u03b3\u03b4s,t+1 \u2212 \u03b4s,t = \u02c6A\u03c0(st, at) + \u03b3\u03b4s,t+1 \u2212 \u03b4s,t\n\n(9)\nwhere \u02c6A\u03c0(st, at) is the actual advantage function. While Lemma 2 shows that \u03b4 is reduced due to\nconditioning in our proposed framework, we note that the observed bias in the expected discounted\nreward can be different from that in the state value due to the complex and uncertain environment.\nFollowing the de\ufb01nition of the expected discounted reward in [11], recalling V (s) = \u02c6V (s) + \u03b4, the\nrelationship between true and actual expected discounted reward is: R(\u03c0) = Es\u223c\u03c0[ \u02c6V\u03c0(st, at) + \u03b4] =\n\u02c6R(\u03c0) + \u02c6\u03b4, where \u02c6\u03b4 is observed bias in the expected discounted reward. We denote the observed bias in\nthe reward for the unconditioned and conditioned cases as: \u02c6\u03b4unc and \u02c6\u03b4con|0. Let \u0394\u02c6\u03b4 = \u02c6\u03b4unc \u2212 \u02c6\u03b4con|0\nand \u0394\u03b4 = \u03b4unc \u2212 \u03b4con|0. We are now ready to discuss the lower bounds of the expected discounted\nrewards for the conditioned and unconditioned schemes. Before that, based on [11], we introduce\nthe maximum total variation divergence for any two different policies and use \u03b1 to denote it for the\nrest of the analysis. We also \ufb01rst present one proposition to show the relationship between the actual\nexpected discounted reward and its approximation. It is an extension of Theorem 1 in [11], which\nhelps characterize the main claim in the paper.\nProposition 1. Let Assumption 1 hold. Then the following inequality hold:\n\n\u02c6R(\u03c0new) \u2265 \u02c6L\u03c0old (\u03c0new) \u2212\n\n4\u02dc\ufffd\u03b3\u03b12\n(1 \u2212 \u03b3)2\n\n(10)\n\n(11)\n\n\u02dc\ufffd =\uf8f1\uf8f2\uf8f3\n\nwhere \u03c0new indicates the new policy, \u03c0old indicates the current policy, \u02c6L\u03c0old (\u03c0new) =\nL\u03c0old (\u03c0new) + \u03b4 \u2212 \u02c6\u03b4, L\u03c0old (\u03c0new) is the approximation of R(\u03c0new), i.e., L\u03c0old (\u03c0new) = R(\u03c0old) +\n\ufffds \u03c1\u03c0old (s)\ufffda \u03c0new(a|s)A\u03c0old (s, a), \u03c1 is the discounted visitation frequencies as similarly de\ufb01ned\nin [11], \u02dc\ufffd satis\ufb01es the following relationship\n\nmaxs,a| \u02c6A\u03c0(s, a)| + (\u03b3 \u2212 1)\u03b4,\n\u2212maxs,a| \u02c6A\u03c0(s, a)| + (1 \u2212 \u03b3)\u03b4,\nmaxs,a| \u02c6A\u03c0(s, a)| + (1 \u2212 \u03b3)\u03b4,\n\nif \u02c6A\u03c0(s, a) \u2265 (1 \u2212 \u03b3)\u03b4.\nif 0 < \u02c6A\u03c0(s, a) < (1 \u2212 \u03b3)\u03b4.\nif \u02c6A\u03c0(s, a) \u2264 0\n\nProof sketch: Combining the proof of Lemmas 1, 2, and 3 in [11], we can immediately arrive at the\nsimilar form of conclusion as shown in Theorem 1 [11]. Then, we discuss the relationship between\nthe actual advantage value \u02c6A\u03c0(s, a) and observed bias in the state value (1 \u2212 \u03b3)\u03b4 to complete the\nproof. Due to the limit of space, in this context we give the proof sketch and the complete proof can\nbe found in the supplementary material and [26].\nWe then arrive at the following result to show that using the conditioned policy allows to achieve a\nhigher lower bound of expected discounted reward.\nProposition 2. If \u0394\u02c6\u03b4 < C\u0394V , where C \u2265 (m\u2212n)(1\u2212m)(4\u03b3\u03b12+1\u2212\u03b3)\nand \u0394V = V0 \u2212 V1, then the\nconditioned policy has a higher lower bound of expected discounted reward compared to that of the\nunconditioned policy.\nProof sketch: Based on Theorem 1 [11] and Proposition 1, we get the approximation of the actual\nexpected discounted reward in both conditioned and unconditioned policies. Similarly we can\nobtain \u02dc\ufffdcon|0 and \u02dc\ufffdunc. Due to the condition that \u02c6A\u03c0(s, a) \u2265 (1 \u2212 \u03b3)\u03b4, we get the relationship\nbetween the approximation of the actual expected discounted reward and the observed bias in the\nexpected discounted reward. Combining the condition that \u0394\u02c6\u03b4 < C\u0394V , with some mathematical\nmanipulation, the proof is completed. Due to the limit of space, we present the proof sketch and the\ncomplete proof can be found in the supplementary material and [26].\n\n(1\u2212m+n)(1\u2212\u03b3)\n\n6\n\n\fRemark 1. Proposition 1 suggests that under a certain condition, using the conditioned policy can\nimprove the lower bound of the expected discounted return over the unconditioned policy. Intuitively,\nthe condition demands the adversary to be suf\ufb01ciently intelligent in order to have a large enough\nvalue for \u0394V .\n5 Experimental Results\nIn order to justify the theoretical implications of bias reduction using a conditioned policy optimiza-\ntion, we implemented the proposed framework introduced in Section 3 with a selection of simple\nadversary models. Because the meta-learned framework has many moving parts and can be subject to\ninstabilities, we \ufb01rst consider a case where the master agent is an oracle in determining the presence\nof an adversary. Then we consider the advantage-based adversary detection by the master agent.\n5.1 Experimental Setup\nFor all experiments, we use the proximal clipped objective L(\u03b8)CLIP +V F from [17] instead of a\nconstrained trust region optimization in accordance with recent results showing similar performance\nand ease of implementation. We use the same optimization for the master agent, although we\nacknowledge this may not be the best method for only two action choices (nominal or adversarial), we\npropose this to generalize to an arbitrary number of sub-polices. In every example, training denotes\nthe agent acting with an \ufffd-greedy exploration policy with adversarial attacks. Simultaneously, we\nrun an evaluation which executes a deterministic actions with the same policy, without adversarial\nattacks, hence obtain much higher return values. For the examples shown, we introduce the adversary\non a \ufb01xed interval (e.g., 5000 with adversary, 10000 without). During that period, the adversary\nperturbs the state at every time step. For page limit constraints, PPO parameters used in experiments\nsuch as deep network size and actor-batches can be found in the supplementary material.\nStochastic l\u221e-bounded Attacks:\nIn this paper, for the purpose of experiment, we consider an attacker model that has the ability to\nperturb state information from the environment before it reaches the agent. Since gradient-based\nattacks for continuous action policies have not been thoroughly studied, we will focus on naive attacks\nwhich only sample the perturbation size and direction from a de\ufb01ned uniform distribution U(a, b)\nabout the current state s = [s0, s1,\u00b7\u00b7\u00b7 , sn]. This results in an attack where si, adversary = si+U(a, b)\nwhere the perturbation is bounded by the l\u221e norm so that \u2200si \u2208 s max\n|si \u2212 si, adversary| \u2264 \ufffdattack.\nWe \ufb01nd that this naive attack is effective enough to signi\ufb01cantly decrease the return of a policy,\nalthough we do provide one example of an iterative gradient-based attack in the supplementary\nmaterial Section 7.4. We speci\ufb01cally utilize white-noise attacks where a = \u2212b as well as bias attacks,\nwhere a \ufffd= b and a < b.\n5.2 Adversarial Bias Reduction with MLAH\nWe begin by examining an RL environment where the master agent is asked to select the policy\nthat corresponds to the current condition, i.e., nominal or adversarial. We acknowledge that this\n\"policy\" may not be the optimal master policy since a game may not be perfectly Markov. However,\nwe \ufb01nd that this is suf\ufb01cient to examine the policy improvement in some Openai Mujoco control\nenvironments [14].\n\ni\n\nFigure 3: Results of Oracle-MLAH and Vanilla PPO applied to the InvertedPendulum-v2 game with\nrepeatedly scheduled attacks for 5000 time steps and then off for 10000, displaying a 1\u03c3 bound. Left:\nCase study with an extreme bias attack spanning the entire state-space. Vanilla policy is unable to\nresolve the correct mapping due to large disturbances in the state information, while MLAH improves\nnearly monotonically. Right: Case study with a weaker bias attack, Vanilla agent still struggles.\n\n7\n\n\fVanilla\n\nm/n\n1.0/\u2212\n0.96 \u00b1 0.03\n0.995/0.005 0.238 \u00b1 .082\n0.612 \u00b1 .08\n0.95/0.05\n0.613 \u00b1 0.043\n0.8/0.2\n0.749 \u00b1 0.093\n0.5/0.5\n\nOracle-MLAH\n0.96 \u00b1 0.03\n0.553 \u00b1 0.242\n0.677 \u00b1 0.149\n0.728 \u00b1 0.063\n0.764 \u00b1 0.078\n\nVanilla\n\n1.0\n\n0.471 \u00b1 0.051\n0.644 \u00b1 0.078\n0.539 \u00b1 0.023\n0.787 \u00b1 0.010\n\nOracle-MLAH\n\n1.0\n\n0.99 \u00b1 0.001\n0.99 \u00b1 0.001\n0.994 \u00b1 0.165\n0.948 \u00b1 0.086\n\nTable 2: Performance evaluation of Oracle-MLAH\n\nNormalized avg. training return Normalized avg. evaluation return\n\nComparison of the returns of Vanilla PPO and Oracle-MLAH under attacks over 40 policy\noptimization iterations with 1\u03c3 uncertainty bounds. The training return uses a stochastic policy\nfor exploration and evaluation acts deterministically. The evaluation bias for the Oracle-MLAH\nremains substantially lower over all attack severity levels. Note when m = n, training returns are\nvery similar as predicted by Eq. 8.\n\nThe returns shown in Table 2 and Figure 3 for long and intermittent bias attacks (large m and\nsmall n) clearly demonstrate the bene\ufb01t of using distinct policies for nominal and adversarial states\nrespectively. According to eq. 8, this attack condition produces the largest difference in bias between\nconditioned and unconditioned policies. As a policy can only solve for one state-action mapping and\nthere are clearly two separate MDP state-reward distributions existing across time, a singly policy\nhas no choice, but to optimize over the mean of these two distributions. Often times this results in not\ndeveloping a useful policy for either condition as shown in \ufb01gure 3. Enabling the use of multiple\npolices in this intermittent attack case allows the agent to optimize for both mappings, even learning\nto mitigate the reduced return during the adversarial attack. More simulation results using Open\nGym environments such as MountainCarContinuous-v0 and Hopper-v2 [14] are included in the\nsupplementary material.\nIt can be seen in table 2 that as the switching expectations between nominal and adversarial states\nrise, the unconditioned (Vanilla) policy actually performs increasingly well, but still less than that of\nthe conditioned (MLAH) policy. This is perhaps because the switching is quick enough to map the\nscenario to one state-reward distribution, which is favorable for a single policy agent.\nAs anticipated by the analysis, when m = n, the training performances of both policies approach\na similar value, however the conditioned MLAH agent was able to maintain a nearly unbiased\nevaluation return. This may be an artifact of the environment or adversary, which is relatively simple\nand unintelligent. Over longer attack periods, it may be unrealistic to expect the return to behave\naccording to the stationary distribution expectation because the average resolves on a longer time\nscale than policy optimization.\n\nFigure 4: Master agent\u2019s performance in learning from two random policies to decide which to employ\nto maximize the reward of InvertedPendulum-v2 with bounded 5000 on, 10000 off bias attacks. The\nmaster agent is not given any information on which states are perturbed by the adversary. After initial\nlearning, the policy choices clearly diverge during the attack intervals with few exceptions.\nNext we put our master agent to the test, using the relative advantage coordinate mappings. This\nformulation is a novel alternative to previous meta-learned hierarchies which are non-stationary and\nneed to be reset over time [25]. The relative advantage mapping is stationary across multiple MDPs\nunder certain conditions. In order for the master agent to arrive at correct advantage-policy mapping,\nthe policies themselves must also optimize to produce better advantage estimates in this expectation\nmaximization (EM) type algorithm. This makes it challenging to produce a stable learning sequence\nof polices and advantage mappings. However, this mapping can be learned from \u201cnothing\u201d if an\nadversary creates a strong enough presence by altering the state-reward mapping (by De\ufb01nition 1).\nThis optimization process is explained in more depth in the supplementary material. Depending\non whether the nominal policy is pre-trained and the effectiveness of the adversary, the meta agent\ncan reliably use each policy during the respective conditions. As seen in Figure 4, an adversary is\nintroduced in an intermittent manner and the master agent has two random sub-polices at its disposal.\n\n8\n\n\fFigure 5: Shown is MLAH simultaneously learning to switch policies and a mitigation strategy on\na small 11 \u00d7 11 grid world. The adversary simply gives the agent a deterministic mirrored column\nobservation about the center of the grid, making it so that the optimal policy is different for every\nstate given there is an attack. The attacks are applied intermittently on intervals of 5000 actions,\nshowing 1 \u2212 \u03c3 variance.\n\nThe agent optimizes to use one policy for the nominal and the other for the adversarial conditions to\noptimize its reward. The policy-selection results in Figure 4 may resemble a Bayesian non-parametric\nlatent state estimator [27]. However, being entirely in the context of RL, MLAH is unique and uses\nthe advantage observation and a meta-learning objective to form a belief over the latent conditions.\n5.3 Countering Deterministic Attack Strategies\nGiven a deterministic attack strategy, it is likely that there is a learnable counter policy that performs\noptimally once the attack is detected. To clearly demonstrate MLAH we create a GridWorld environ-\nment and an attack which simply re\ufb02ects the agents column position about the center of the grid. This\nattack is interesting to us because it is completely deterministic and, for the single policy Vanilla PPO,\nit is impossible to be optimal in both nominal and adversarial conditions, requiring an additional\nmirrored S \u2192 A map. In this experiment we let the policy train in the nominal environment with no\nattack for 40 iterations (\u223c 40000 actions). Once attacks are introduced, it only takes MLAH about\n50000 actions to, not only solve the meta task of switching policies, but also learn the adversary\nmitigation policy from a random initial policy. The success of both of these task strongly depend\non each other, that is a noisy value function will result in poor policy switching and vice versa.\nThis is quite evident by the variation in reward when MLAH is learning the attack strategy and\nswitching policy simultaneously in 5. There may be further improvements to address the stability of\nthis algorithm for more complicated tasks.\n6 Conclusions\nWe have discussed a new MLAH framework for handling adversarial attacks in an online manner\nspeci\ufb01cally in the context of RL. This framework is attack-model agnostic and presents a general way\nof examining adversarial attacks in the temporal domain. Analyzing the hierarchical policy MLAH\nin this way, we can show that under certain conditions, the return lower-bound is improved when\ncompared to a single policy agent. In future research, we aim to improve the stability of MLAH by\noptimizing the master agent function, perhaps using a more simple method to regress the advantage\nspace. We will also attempt to extend MLAH to a more general framework for decision problems\nwith multiple time-varying objectives.\nAcknowledgements\nThis work has been supported in part by the U.S. AFOSR under the YIP grant FA9550-17-1- 0220.\nAny opinions, \ufb01ndings and conclusions or recommendations expressed in this publication are those\nof the authors and do not necessarily re\ufb02ect the views of the sponsoring agencies.\n\n9\n\n\fReferences\n[1] Chi-Sheng Shih, Jyun-Jhe Chou, Niels Reijers, and Tei-Wei Kuo. Designing cps/iot applications\nfor smart buildings and cities. IET Cyber-Physical Systems: Theory & Applications, 1(1):3\u201312,\n2016.\n\n[2] Danda B Rawat, Chandra Bajracharya, and Gongjun Yan. Towards intelligent transportation\ncyber-physical systems: Real-time computing and communications perspectives. In Southeast-\nCon 2015, pages 1\u20136. IEEE, 2015.\n\n[3] Antreas Antoniou and Plamen Angelov. A general purpose intelligent surveillance system for\nmobile devices using deep learning. In Neural Networks (IJCNN), 2016 International Joint\nConference on, pages 2879\u20132886. IEEE, 2016.\n\n[4] Richard S Sutton, Andrew G Barto, and Ronald J Williams. Reinforcement learning is direct\n\nadaptive optimal control. IEEE Control Systems, 12(2):19\u201322, 1992.\n\n[5] RS Sutton and AG Barto. Reinforcement learning: An introduction (in preparation), 2017.\n\n[6] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G\nBellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.\nHuman-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.\n\n[7] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap,\nTim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep rein-\nforcement learning. In International Conference on Machine Learning, pages 1928\u20131937,\n2016.\n\n[8] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double\n\nq-learning. In AAAI, volume 16, pages 2094\u20132100, 2016.\n\n[9] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra.\n\nContinuous control with deep reinforcement learning. ArXiv e-prints, September 2015.\n\n[10] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep\n\nvisuomotor policies. Journal of Machine Learning Research, 17(39):1\u201340, 2016.\n\n[11] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region\npolicy optimization. In International Conference on Machine Learning, pages 1889\u20131897,\n2015.\n\n[12] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu.\nTowards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083,\n2017.\n\n[13] Y.-C. Lin, M.-Y. Liu, M. Sun, and J.-B. Huang. Detecting Adversarial Attacks on Neural\n\nNetwork Policies with Visual Foresight. ArXiv e-prints, October 2017.\n\n[14] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang,\n\nand Wojciech Zaremba. Openai gym, 2016.\n\n[15] A. Pattanaik, Z. Tang, S. Liu, G. Bommannan, and G. Chowdhary. Robust Deep Reinforcement\n\nLearning with Adversarial Attacks. ArXiv e-prints, December 2017.\n\n[16] L. Pinto, J. Davidson, R. Sukthankar, and A. Gupta. Robust Adversarial Reinforcement Learning.\n\nArXiv e-prints, March 2017.\n\n[17] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal\n\npolicy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.\n\n[18] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and Harnessing Adversarial Examples.\n\nArXiv e-prints, December 2014.\n\n[19] Sandy Huang, Nicolas Papernot, Ian Goodfellow, Yan Duan, and Pieter Abbeel. Adversarial\n\nattacks on neural network policies. arXiv preprint arXiv:1702.02284, 2017.\n\n[20] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In\n\nSecurity and Privacy (SP), 2017 IEEE Symposium on, pages 39\u201357. IEEE, 2017.\n\n10\n\n\f[21] Jernej Kos and Dawn Song. Delving into adversarial attacks on deep policies. arXiv preprint\n\narXiv:1705.06452, 2017.\n\n[22] Ajay Mandlekar, Yuke Zhu, Animesh Garg, Li Fei-Fei, and Silvio Savarese. Adversarially\nrobust policy learning: Active construction of physically-plausible perturbations. In IEEE\nInternational Conference on Intelligent Robots and Systems (to appear), 2017.\n\n[23] Aravind Rajeswaran, Sarvjeet Ghotra, Balaraman Ravindran, and Sergey Levine. Epopt: Learn-\ning robust neural network policies using model ensembles. arXiv preprint arXiv:1610.01283,\n2016.\n\n[24] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-Dimensional Continuous\n\nControl Using Generalized Advantage Estimation. ArXiv e-prints, June 2015.\n\n[25] K. Frans, J. Ho, X. Chen, P. Abbeel, and J. Schulman. Meta Learning Shared Hierarchies. ArXiv\n\ne-prints, October 2017.\n\n[26] Aaron J Havens, Zhanhong Jiang, and Soumik Sarkar. Online robust policy learning in the\n\npresence of unknown adversaries. arXiv preprint arXiv:1807.06064, 2018.\n\n[27] Emily Fox, Erik B Sudderth, Michael I Jordan, and Alan S Willsky. Bayesian nonparametric\nIEEE Transactions on Signal Processing,\n\ninference of switching dynamic linear models.\n59(4):1569\u20131585, 2011.\n\n11\n\n\f", "award": [], "sourceid": 6444, "authors": [{"given_name": "Aaron", "family_name": "Havens", "institution": "University of Illinois Urbana-Champaign"}, {"given_name": "Zhanhong", "family_name": "Jiang", "institution": "Iowa State University"}, {"given_name": "Soumik", "family_name": "Sarkar", "institution": "Iowa State University"}]}