{"title": "Structural Causal Bandits: Where to Intervene?", "book": "Advances in Neural Information Processing Systems", "page_first": 2568, "page_last": 2578, "abstract": "We study the problem of identifying the best action in a sequential decision-making setting when the reward distributions of the arms exhibit a non-trivial dependence structure, which is governed by the underlying causal model of the domain where the agent is deployed. In this setting, playing an arm corresponds to intervening on a set of variables and setting them to specific values. In this paper, we show that whenever the underlying causal model is not taken into account during the decision-making process, the standard strategies of simultaneously intervening on all variables or on all the subsets of the variables may, in general, lead to suboptimal policies, regardless of the number of interventions performed by the agent in the environment. We formally acknowledge this phenomenon and investigate structural properties implied by the underlying causal model, which lead to a complete characterization of the relationships between the arms' distributions. We leverage this characterization to build a new algorithm that takes as input a causal structure and finds a minimal, sound, and complete set of qualified arms that an agent should play to maximize its expected reward. We empirically demonstrate that the new strategy learns an optimal policy and leads to orders of magnitude faster convergence rates when compared with its causal-insensitive counterparts.", "full_text": "Structural Causal Bandits: Where to Intervene?\n\nSanghack Lee\n\nDepartment of Computer Science\n\nPurdue University\n\nlee2995@purdue.edu\n\nElias Bareinboim\n\nDepartment of Computer Science\n\nPurdue University\neb@purdue.edu\n\nAbstract\n\nWe study the problem of identifying the best action in a sequential decision-\nmaking setting when the reward distributions of the arms exhibit a non-trivial\ndependence structure, which is governed by the underlying causal model of the\ndomain where the agent is deployed. In this setting, playing an arm corresponds to\nintervening on a set of variables and setting them to speci\ufb01c values. In this paper,\nwe show that whenever the underlying causal model is not taken into account\nduring the decision-making process, the standard strategies of simultaneously\nintervening on all variables or on all the subsets of the variables may, in general,\nlead to suboptimal policies, regardless of the number of interventions performed\nby the agent in the environment. We formally acknowledge this phenomenon and\ninvestigate structural properties implied by the underlying causal model, which lead\nto a complete characterization of the relationships between the arms\u2019 distributions.\nWe leverage this characterization to build a new algorithm that takes as input a\ncausal structure and \ufb01nds a minimal, sound, and complete set of quali\ufb01ed arms that\nan agent should play to maximize its expected reward. We empirically demonstrate\nthat the new strategy learns an optimal policy and leads to orders of magnitude\nfaster convergence rates when compared with its causal-insensitive counterparts.\n\n1\n\nIntroduction\n\nThe multi-armed bandit (MAB) problem is one of the prototypical settings studied in the sequential\ndecision-making literature [Lai and Robbins, 1985, Even-Dar et al., 2006, Bubeck and Cesa-Bianchi,\n2012]. An agent needs to decide which arm to pull and receives a corresponding reward at each time\nstep while keeping the goal of maximizing its cumulative reward in the long run. The challenge is the\ninherent trade-off between exploiting known arms versus exploring new reward opportunities [Sutton\nand Barto, 1998, Szepesv\u00e1ri, 2010]. There is a wide range of assumptions underlying MABs, but\nin most of the traditional settings, the arms\u2019 rewards are assumed to be independent, which means\nthat knowing the reward distribution of one arm has no implication to the reward of the other arms.\nMany strategies were developed to solve this problem, including classic algorithms such as \u270f-greedy,\nvariants of UCB (Auer et al., 2002, Capp\u00e9 et al., 2013), and Thompson sampling [Thompson, 1933].\nRecently, the existence of some non-trivial dependencies among arms has been acknowledged in\nthe literature and studied under the rubric of structured bandits, which include settings such as\nlinear [Dani et al., 2008], combinatorial [Cesa-Bianchi and Lugosi, 2012], unimodal [Combes and\nProutiere, 2014], and Lipschitz [Magureanu et al., 2014], just to name a few. For example, a linear (or\ncombinatorial) bandit imposes that an action xt 2 Rd (or {0, 1}d) at a time step t incurs a cost `>t xt,\nwhere `t is a loss vector chosen by, e.g., an adversary. In this case, an index-based MAB algorithm,\noblivious to the structural properties, can be suboptimal.\nIn another line of investigation, rich environments with complex dependency structures are modeled\nexplicitly through the use of causal graphs, where nodes represent decisions and outcome variables,\nand direct edges represent direct in\ufb02uence of one variable on another [Pearl, 2000]. Despite the\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fX2\n\nU\n\nX2\n\nU\n\nX\n\nY\n\nX1\n\nY\n\n(a) Standard MAB\n\n(b) IV-MAB\n\n(c) Cumulative Regrets\n\n(d) Probability\n\nFigure 1: MAB problems as directed acyclic graphs where U is an unobserved variable. Plots of\ncumulative regrets and probability selecting an optimal arm when a MAB algorithm intervenes X1 and\nX2 simultaneously (All-at-once) or all subsets of {X1, X2} for IV-MAB. The IV-MAB is also used\nin the experimental section (see Appendix D [Lee and Bareinboim, 2018] for its parametrization).\n\napparent connection between MABs and causality, only recently has the use of causal reasoning been\nincorporated into the design of MAB algorithms. For instance, Bareinboim et al. [2015] \ufb01rst explored\nthe connection between causal models with unobserved confounders (UCs) and reinforcement\nlearning, where latent factors affect both the reward distribution and the player\u2019s intuition. The\nkey observation used in the paper is that while standard MAB algorithms optimize based on the\ndo-distribution (formally written as E[Y |do(X)] or E[Yx]), the simplest type of counterfactuals,\nthis approach is dominated by another strategy using a more detailed counterfactual as the basis\nof the optimization process (i.e., E[Yx|X = x0]); this general strategy was called regret decision\ncriterion (RDC). This strategy was later extended to handle counterfactual distributions of higher\ndimensionality by Forney et al. [2017]. Further, Lattimore et al. [2016] and Sen et al. [2017] studied\nthe problem of best arm identi\ufb01cation through importance weighting, where information on how\nplaying arms in\ufb02uences the direct causes (parents, in causal terminology) of a reward variable is\navailable. Zhang and Bareinboim [2017] leveraged causal graphs to solve the problem of off-policy\nevaluation in the presence of UCs. They noted that whenever UCs are present, traditional off-policy\nmethods can be arbitrarily biased, leading to linear regret. They then showed how to solve the off-\npolicy evaluation problem by incorporating the causal bounds into the decision-making procedure.1\nOverall, these works showed different aspects of the same phenomenon \u2014 whenever UCs are present\nin the real world, the expected guarantees provided by standard methods are no longer valid, which\ntranslates to an inability to converge to any reasonable policy. They then showed that convergence can\nbe restored once the causal structure is acknowledged and used during the decision-making process.\nIn this paper, we focus on the challenge of identifying the best action in MABs where the arms\ncorrespond to interventions on an arbitrary causal graph, including when latent variables confound\nthe observed relations (i..e, semi-Markovian causal models). To understand this challenge, we \ufb01rst\nnote that a standard MAB can be seen as the simple causal model as shown in Fig. 1a, where X\nrepresents an arm (with K different values), Y the reward variable, and U the unobserved variable\nthat generates the randomness of Y .2 After a suf\ufb01ciently large number of pulls of X (chosen by the\nspeci\ufb01c algorithm), Y \u2019s average reward can be determined with high con\ufb01dence.\nWhenever a set of UCs affect more than one observed variable, however, novel, non-trivial challenges\narise. To witness, consider the more involved MAB structure shown in Fig. 1b, where an unobserved\nconfounder U affects both the action variable X1 and the reward Y . A naive approach for an\nalgorithm to play such a bandit would be to pull arms in a combinatorial manner, i.e., combining\nboth variables (X1\u21e5X2) so that arms are D(X1)\u21e5D(X2), where D(X) is the domain of X. One\nmay surmise that this is a valid strategy, albeit not the most ef\ufb01cient one. Somewhat unexpectedly,\nhowever, Fig. 1c shows that this is not the case \u2014 the optimal action comes from pulling X2 and\nignoring X1, while pulling {X1, X2} together would lead to subpar cumulative rewards (regardless\nof the number of iterations) since it simply cannot pull the optimal arm (Fig. 1d). After all, if one is\noblivious to the causal structure and decides to take all intervenable variables as one (in this case,\nX1\u21e5X2), indiscriminately, one may be doomed to learn a suboptimal policy.\n\nsampling applied to the problem of adaptive control.\n\n1On another line of investigation, Ortega and Braun [2014] introduced a generalized version of Thompson\n2In causal notation, Y fY (U , X), which means that Y \u2019s value is determined by X and the realization of\nthe latent variable U. If fY is linear, we would have a (stochastic) linear bandit. Our results do not constrain the\ntypes of structural functions, which is usually within nonparametric causal inference [Pearl, 2000, Ch. 7].\n\n2\n\nTrialsCum.RegretsTrialsOpt.ArmProb.AllSubsetsAll-at-once\fIn this paper, we investigate this phenomenon, and more broadly, causal MABs with non-trivial\ndependency structure between the arms. More speci\ufb01cally, our contributions are as follows: (1) We\nformulate a SCM-MAB problem, which is a structured multi-armed bandit instance within the causal\nframework. We then derive the structural properties of a SCM-MAB, which are computable from any\ncausal model, including arms\u2019 equivalence based on do-calculus [Pearl, 1995], and partial orderedness\namong sets of variables associated with arms in regards to the maximum rewards achievable. (2)\nWe characterize a special set of variables called POMIS (possibly-optimal minimal intervention\nset), which is worth intervening based on the aforementioned partial orders. We then introduce an\nalgorithm that identi\ufb01es a complete set of POMISs so that only the subset of arms associated with\nthem can be explored in a MAB algorithm. Simulations corroborate our \ufb01ndings.\n\ncausal\n\nacausal\n\nBig picture The multi-armed bandit is a rich setting in which a\nhuge number of variants has been studied in the literature. Differ-\nent aspects of the decision-making process have been analyzed\nand well-understood in the last decades, which include differ-\nent functional forms (e.g., linear, Lipschitz, Gaussian process),\ntypes of feedback experienced by the agent (bandit, semi-bandit,\nfull), the adversarial or i.i.d. nature of the interactions, just to\ncite some of the most popular ones. Our study of SCM-MABs\nputs the causal dimension front and center in the map. In par-\nticular, we fully acknowledge the existence of a causal structure\namong the underlying variables (whenever not known a priori,\nsee Footnote 3), and leverage the qualitative relations among\nthem. This is in clear contrast with the prevailing practice that\nis more quantitative and, almost invariably, is oblivious to the\nunderlying causal structure (as shown in Fig. 1a). We outline in\nFig. 2 an initial map that shows the relationship between these dimensions; our goal here is not to\nbe exhaustive, nor prescriptive, but to help to give some perspective. In this paper, we study bandits\nwith no constraints over the underlying functional form (nonparametric, in causality language), i.i.d.\nstochastic rewards, and with an explicit causal structure acknowledged by the agent.\n\nFigure 2: A bandit space with var-\nious dimensions (not all dimen-\nsions are shown)\n\niid stochastic\n\nMarkovian\n\nadversarial\n\nlinear\n\nLipschitz\n\nGP\n\nPreliminaries: notations and structural causal models\n\nWe follow the notation used in the causal inference literature. A capital letter is used for a variable\nor a mathematical object. The domain of X is denoted by D (X). A bold capital letter is for a set\nof variables, e.g., X = {Xi}n\ni=1, while a lowercase letter x 2 D (X) is a value assigned to X, and\nx 2 D (X) = \u21e5X2X (D (X)). We denote by x [W], values of x corresponding to W \\ X. A graph\nG = hV, Ei is a pair of vertices V and edges E. We adopt family relationships \u2014 pa, ch, an, and\nde to denote parents, children, ancestors, and descendants of a given variable; P a, Ch, An, and De\nextends pa, ch, an, and de by including the argument as the result, e.g., P a (X)G = pa (X)G [{ X}.\nWith a set of variables as argument, pa (X)G = SX2X pa (X)G and similarly de\ufb01ned for other\nrelations. We denote by V (G) the set of variables in G. G [V0] for V0 \u2713 V (G) is a vertex-induced\nsubgraph where all edges among V0 are preserved. We de\ufb01ne G\\X as G [V (G)\\X] for X \u2713 V (G).\nWe adopt the language of Structural Causal Models (SCM) [Pearl, 2000, Ch. 7]. An SCM M is a\ntuple hU, V, F, P (U)i, where U is a set of exogenous (unobserved or latent) variables and V is a\nset of endogenous (observed) variables. F is a set of deterministic functions F = {fi}, where fi\ndetermines the value of Vi 2 V based on endogenous variables PAi \u2713 V\\{Vi} and exogenous\nvariables Ui \u2713 U, that is, e.g., vi fi(pai, ui). P (U) is a joint distribution over the exogenous\nvariables. A causal diagram G = hV, Ei, associated with M, is a tuple of vertices V (the endogenous\nvariables) and edges E, where a directed edge Vi ! Vj 2 E if Vi 2 PAj, and a bidirected edge\nbetween Vi and Vj if they share an unobserved confounder, i.e., Ui \\ Uj 6= ;. Note that pa(Vi)G\ncorresponds to PAi. Probability of Y = y when X is held \ufb01xed at x (i.e., intervened) is denoted by\nP (y|do(x)), where intervention on X is graphically represented by GX, the graph G with incoming\nedges onto X removed. We denote by CC (X)G the c-component of G that contains X where a\nc-component is a maximal set of vertices connected with bidirected edges [Tian and Pearl, 2002]. We\nde\ufb01ne CC (X)G =SX2X CC (X)G. For a more detailed discussion on the properties of SCMs, we\n\nrefer readers to [Pearl, 2000, Bareinboim and Pearl, 2016]. For all the proofs and appendices, please\nrefer to the full technical report [Lee and Bareinboim, 2018].\n\n3\n\n\f; D(X) D(Z)\n\n3\n3\n\n(a)\n(b) 3\n(c)\n(d) 3\n\n3\n3\n3\n3\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\nFigure 3: (a\u2013d) Causal graphs such that \u00b5x = \u00b5x,z, and (e) non-dominated arms\n\n2 Multi-armed bandits with structural causal models\n\nWe recall that MABs consider a sequential decision-making setting where pulling one of the K\navailable arms at each round gives the player a stochastic reward from an unknown distribution\nassociated with the corresponding arm. The goal is to minimize (maximize) the cumulative regret\n(reward) after T rounds. The mean reward of an arm a is denoted by \u00b5a and the maximal reward\nt=1 E [YAt] =\na=1 aE [Ta (T )], where At is the arm played at time t, Ta (t) is the number of arm a has been\n\nis \u00b5\u21e4 = max1\uf8ffa\uf8ffK \u00b5a. We focus on the cumulative regret, RegT = T \u00b5\u21e4 PT\nPK\nplayed after t rounds, and a = \u00b5\u21e4 \u00b5a.\nWe now can explicitly connect a MAB instance to its SCM counterpart. Let M be a SCM\nhU, V, F, P (U)i and Y 2 V be a reward variable, where D (Y ) \u2713 R. The bandit contains\narms {x 2 D (X) | X \u2713 V\\{Y }}, a set of all possible interventions on endogenous variables except\nthe reward variable. Each arm Ax (or simply x) associates with a reward distribution P (Y |do(x))\nwhere its mean reward \u00b5x is E [Y |do(x)]. We call this setting a SCM-MAB, which is fully rep-\nresented by the pair hM , Y i. Throughout this paper, we assume that the causal graph G of M is\nfully accessible to the agent,3 although its parametrization is unknown: that is, an agent facing a\nSCM-MAB hM , Y i plays arms with knowledge of G and Y , but not of F and P (U). For simplicity,\nsome key structural properties that follow from the causal structure G of the SCM-MAB.\n\nwe denote information provided to an agent playing a SCM-MAB byJG, YK. We now investigate\n\nProperty 1. Equivalence among arms\n\nWe start by noting that do-calculus [Pearl, 1995] provides rules to evaluate invariances in the\ninterventional space. In particular, we focus here on the Rule 3, which ascertains the condition such\nthat a set of interventions does not have an effect on the outcome variable, i.e., P (y|do(x, z), w) =\nP (y|do(x), w). Since arms correspond to interventions (including the null intervention) and there is\nno contextual information, we consider examining P (y|do(x, z)) = P (y|do(x)) through Y ?? Z | X\nin GX[Z, which implies \u00b5x,z = \u00b5x. If valid, this condition implies that it is suf\ufb01cient to play only\none arm among arms in the equivalence class.\nDe\ufb01nition 1 (Minimal Intervention Set (MIS)). A set of variables X \u2713 V\\{Y } is said to be a\nminimal intervention set relative toJG, YK if there is no X0 \u21e2 X such that \u00b5x[X0] = \u00b5x for every\nSCM conforming to the G.\nFor instance, the MISs corresponding to the causal graphs in Fig. 3 are {;,{X},{Z}}, which do not\ninclude {X, Z} since \u00b5x = \u00b5x,z. The MISs are determined without considering the UCs in a causal\ngraph. The empty set and all singletons in an (Y )G are MISs for G with respect to Y . The task of\n\ufb01nding the best arm among all possible arms can be reduced to a search within the MISs.\nProposition 1 (Minimality). A set of variables X \u2713 V\\{Y } is a minimal intervention set for G\nwith respect to Y if and only if X \u2713 an (Y )GX\nAll the MISs givenJG, YK can be determined without explicitly enumerating 2V\\{Y } while checking\n\nthe condition in Prop. 1. We provide an ef\ufb01cient recursive algorithm enumerating the complete set of\nMISs given G and Y (Appendix A), which runs in O(mn2) where m is the number of MISs.\n\n.\n\n3In settings where this is not the case, one can spend the \ufb01rst interactions with the environment to learn the\n\ncausal graph G from observational [Spirtes et al., 2001] or experimental data [Kocaoglu et al., 2017].\n\n4\n\nXZYXZYXZYXZY\fProperty 2. Partial-orders among arms\nWe now explore the partial-orders among subsets of V\\{Y } within the MISs. Given the causal\ndiagram G, it is possible that intervening on some variables is always as good as intervening on\nanother set of variables (regardless of the parametrization of the underlying model). Formally, there\ncan be two different sets of variables W, Z \u2713 V\\{Y } such that\n\u00b5z\n\nmax\n\nw2D(W)\n\n\u00b5w \uf8ff max\nz2D(Z)\n\nin every possible SCM conforming to G. If that is the case, it would be unnecessary (and possibly\nharmful in terms of sample ef\ufb01ciency) to play arms D (W). We next de\ufb01ne Possibly-Optimal MIS,\nwhich incorporates the partial-orderedness among subsets of V\\{Y } into MIS denoting the optimal\nvalue for a X \u2713 V\\{Y } given a SCM by x\u21e4.\nX be a MIS. If there exists a SCM conforming to G such that \u00b5x\u21e4 > 8Z2Z\\{X}\u00b5z\u21e4, where Z is the\nset of MISs with respect to G and Y , then X is a possibly-optimal minimal intervention set with\n\nDe\ufb01nition 2 (Possibly-Optimal Minimal Intervention Set (POMIS)). Given informationJG, YK, let\nrespect to the informationJG, YK.\n\nIntuitively, one may believe that the best action will be to intervene on the direct causes (parents) of\nthe reward variable Y , since this would entail a higher degree of \u201ccontrollability\u201d of Y within the\nsystem. This, in fact, holds true if Y is not confounded with any of its ancestors, which includes the\ncase where no unobserved confounders are present in the system (i.e., Markovian models).\n\nconfounders, then pa(Y )G is the only POMIS.\n\nProposition 2. Given informationJG, YK, if Y is not confounded with an(Y )G via unobserved\nCorollary 3 (Markovian POMIS). GivenJG, YK, if G is Markovian, then pa(Y )G is the only POMIS.\nFor instance, in Fig. 3a, {{X}} is the set of POMISs. Whenever unobserved confounders (UCs)\nare present,4 on the other hand, the analysis becomes more involved. To witness, let us analyze\nthe maximum achievable rewards of the MISs in the other causal diagrams in Fig. 3. We start with\nFig. 3b and note that \u00b5z\u21e4 \uf8ff \u00b5x\u21e4 since \u00b5z\u21e4 =Px \u00b5xP (x|do(z\u21e4)) \uf8ffPx \u00b5\u21e4xP (x|do(z\u21e4)) = \u00b5x\u21e4.\nOn the other hand, \u00b5; is not comparable to \u00b5x\u21e4. For a concrete example, consider a SCM where the\ndomains of variables are {0, 1}. Let U be the UC between Y and Z where P (U = 1) = 0.5. Let\nfZ(u) = 1 u, fX(z) = z, and fY (x, u) = x u, where is the exclusive-or function. If X is not\nintervened on, x will be 1 u yielding y = 1 for both cases u = 0 or u = 1 so that \u00b5; = 1. However,\nif X is intervened to either 0 or 1, y will be 1 only half the time since P (U = 1) = 0.5, which results\nin \u00b5x\u21e4 = 0.5. We also provide in Appendix A a SCM such that \u00b5; < \u00b5x\u21e4 holds true. This model\n(\u00b5; > \u00b5x\u21e4) illustrates an interesting phenomenon \u2014 allowing an UC to affect Y freely may lead to a\nhigher reward, which may be broken upon interventions. We now consider the different confounding\nstructure shown in Fig. 3c (similar to Fig. 1b), where the variable Z lies outside of the in\ufb02uence of\nthe UC associated with Y . In this case, intervening on Z leads to a higher reward, \u00b5z\u21e4 \u00b5;. To\nwitness, note that \u00b5; =Pz E [Y |z] P (z) =Pz \u00b5zP (z) \uf8ffPz \u00b5z\u21e4P (z) = \u00b5z\u21e4. However, \u00b5z\u21e4\nand \u00b5x\u21e4 are incomparable, which is shown through two models provided in Appendix A. Finally, we\ncan add the confounders of the two previous models, which is shown in Fig. 3d. In this case, all three\n\u00b5x\u21e4, \u00b5z\u21e4, and \u00b5; are incomparable. One can imagine scenarios where the in\ufb02uence of the UCs are\nweak enough so that corresponding models produce results similar to Figs. 3a to 3c.\nIt\u2019s clear that the interplay between the location of the intervened variable, the outcome variable, and\nthe UCs entails non-trivial interactions and consequences in terms of the reward. The table in Fig. 3e\nhighlights the arms that are contenders to generate the highest rewards in each model (i.e., each arm\nintervenes a POMIS to speci\ufb01c values), while intervening on a non-POMIS represents a waste of\nresources. Interestingly, the only parent of Y , i.e., X, is not dominated by any other arms in any of\nthe scenarios discussed. In words, this suggests that the intuition that controlling variables closer to\nY is not entirely lost even when UCs are present; they are not the only POMIS, but certainly one of\nthem. Given that more complex mechanisms cannot be, in general, ruled out, performing experiments\nwould be required to identify the best arm. Still, the results of the table guarantee that the search\ncan be re\ufb01ned so that MAB solvers can discard arms that cannot lead to pro\ufb01table outcomes, and\nconverge faster to playing the optimal arm.\n\n4Recall that unobserved confounders are represented in the graph as bidirected dashed edges.\n\n5\n\n\fZ\n\nX\n\nT\n\nS\n\nW\n\nY\n\n(a) G\n\nS\n\nW\n\nZ\n\nX\n\nT\n\nY\n\n(b) GX\n\nZ\n\nX\n\nT\n\nS\n\nW\n\nY\n\n(c) GZ\n\nS\n\nW\n\nZ\n\nX\n\nT\n\nY\n\n(d) GW\n\n;\n\n{X}\n\n{Z}\n\n{W}\n\n{X, Z}\n\n{X, W}\n\n{Z, W}\n\n{X, Z, W}\n\n(e)\n\nFigure 4: Causal graphs where pink and blue nodes are MUCT and IB, respectively. (Right most) A\nschematic showing an exploration order of subsets of variables.\n\n3 Graphical characterization of POMIS\n\nDe\ufb01nition 3 (Unobserved-Confounders\u2019 Territory). Given information JG, YK,\n\nOur goal in this section is to graphically characterize POMISs. We will leverage the discussion in\nthe previous section and note that UCs connected to a reward variable affect the reward distributions\nin a way that intervening on a variable outside the coverage of such UCs (including no UC) can\nbe optimal \u2014 e.g., {X} for Fig. 3a, ; for Figs. 3b and 3d, and {Z} for Fig. 3c. We introduce two\ngraphical concepts to help characterizing this property.\nlet H be\nG [An (Y )G]. A set of variables T \u2713 V (H) containing Y is called an UC-territory on G with\nrespect to Y if De (T)H = T and CC (T)H = T.\nAn UC-territory T is said to be minimal if no T0 \u21e2 T is an UC-territory. A minimal UC-Territory\n(MUCT) for G and Y can be constructed by extending a set of variables, starting from {Y }, alterna-\ntively updating the set with the c-component and descendants of the set.\nDe\ufb01nition 4 (Interventional Border). Let T be a minimal UC-territory on G with respect to Y . Then,\nX = pa (T)G \\T is called an interventional border for G with respect to Y .\nThe interventional border (IB) encompasses essentially the parents of the MUCT. For concreteness,\nconsider Fig. 4a, and note that {W , X, Y , Z} is the MUCT for the causal graph with respect to Y ,\nand the IB is {S, T} (marked in pink and blue in the graph, respectively). As its name suggests,\nMUCT is a set of endogenous variables governed by a set of UCs where at least one UC is adjacent\nto a reward variable. Speci\ufb01cally, the reward is determined by values of: (1) the UCs governing the\nMUCT; (2) a set of unobserved variables (other than the UCs) where each affects an endogenous\nvariable in the MUCT; and (3) the IB. In other words, there is no UC interplaying across MUCT and\nits outside so that \u00b5x = E[Y |x] where x is a value assigned to the IB X. We now connect MUCT and\n\nThe main strategy of the proof is to construct a SCM M where intervening on any variable in\nMUCT(G, Y ) causes signi\ufb01cant loss of reward. It seems that MUCT and IB can only identify a\n\nIB with POMIS. Let MUCT(G, Y ) and IB(G, Y ) be, respectively, the MUCT and IB givenJG, YK.\nProposition 4. IB(G, Y ) is a POMIS givenJG, YK.\nsingle POMIS givenJG, YK. However, they, in fact, serve as basic units to identify all POMISs.\nProposition 5. GivenJG, YK, IB(GW, Y ) is a POMIS, for any W \u2713 V\\{Y }.\nProp. 5 generalizes Prop. 4 for when W 6= ; while taking care of UCs across MUCT(GW, Y ), and\nits outside in the original causal graph G. See Fig. 4d, for an instance, where IB(GW , Y ) = {W , T}.\nIntervening on W cuts the in\ufb02uence of S and the UC between W and X, while still allowing\nthe UC to affect X.5 Similarly, one can see in Fig. 4b that IB(GX, Y ) = {T , W , X} where\nintervening on X lets Y be the only element of MUCT making its parents an interventional border,\nhence, a POMIS. Note that pa(Y )G is always a POMIS since MUCT(Gpa(Y )G\n, Y ) = {Y } and\n, Y ) = pa(Y )G. With Prop. 5, one can enumerate the POMISs givenJG, YK considering\nIB(Gpa(Y )G\nall subsets of V\\{Y }. We show in the sequel that this strategy encompasses all the POMISs.\n\nTheorem 6. GivenJG, YK, X \u2713 V\\{Y } is a POMIS if and only if IB(GX, Y ) = X.\n\n5Note that exogenous variables that do not affect more than one endogenous variable (i.e., non-UCs) are not\n\nexplicitly represented in the graph.\n\n6\n\n\fAlgorithm 1 Algorithm enumerating all POMISs withJG, YK\n\nT, X = MUCT (G, Y ) , IB (G, Y ); H = GX [T [ X]\nreturn {X}[ subPOMISs (H, Y , reversed (topological-sort (H)) \\ (T \\ {Y }) , ;)\n\n1: function POMISS(G, Y )\n2:\n3:\n4: function SUBPOMISS(G, Y , \u21e1, O)\n5:\n6:\n7:\n8:\n9:\n10:\n\nP = ;\nfor \u21e1i 2 \u21e1 do\n\nreturn P\n\nT, X, \u21e10, O0 = MUCT(G\u21e1i , Y ), IB(G\u21e1i , Y ), \u21e1i+1:|\u21e1| \\ T, O [ \u21e11:i1\nif X \\ O0 = ; then\n\nP = P [{ X}[ (subPOMISs (GX [T [ X] , Y , \u21e10, O0) if \u21e10 6= ; else ;)\n\nAlgorithm 2 POMIS-based kl-UCB\n1: function POMIS-KL-UCB(B, G, Y , f , T )\n2:\n3:\n4:\n\nA =SX2POMISs(G, Y ) D(X)\n\nkl-UCB(B, A, f , T )\n\nInput: B, a SCM-MAB, G, a causal diagram; Y , a reward variable\n\nThm. 6 provides a graphical necessary and suf\ufb01cient condition for a set of variables being a POMIS\n\nare worth intervening on, and, therefore, being free from pulling the other unnecessary arms.\n\ngivenJG, YK. This characterization allows one to determine all possible arms in a SCM-MAB that\n\n4 Algorithmic characterization of POMIS\n\nAlthough the graphical characterization provides a means to enumerate the complete set of POMISs\n\ngivenJG, YK, a naively implemented algorithm requires time exponential in |V|. We construct an\n\nProposition 8. Let H=GX [T [ X] where T and X are MUCT and IB givenJGW, YK, respectively.\n\nef\ufb01cient algorithm (Alg. 1) that enumerates all the POMISs based on Props. 7 and 8 below and the\ngraphical characterization introduced in the previous section (Thm. 6).\nProposition 7. Let T and X be the MUCT(GW, Y ) and IB(GW, Y ), respectively, relative to G\nand Y . Then, for any Z \u2713 V\\T, MUCT(GX[Z, Y ) = T and IB(GX[Z, Y ) = X.\nThen, for any W0 \u2713 T\\{Y }, HW0 and GW[W0 yield the same MUCT and IB with respect to Y .\nProp. 7 allows one to avoid having to examine GW for every W \u2713 V\\{Y }. Prop. 8 characterizes\nthe recursive nature of MUCT and IB, where identi\ufb01cation of POMISs can be evaluated by subgraphs.\nBased on these results, we design a recursive algorithm (Alg. 1) to explore subsets of V\\{Y } with\na certain order. See Fig. 4e for an example where subsets of {X, Z, W} are connected based on\nset inclusion relationship and an order of variables, e.g., (X, Z, W ). That is, there exists a directed\nedge between two sets if (i) one set is larger than the other by a variable and (ii) the variable\u2019s index\n(as in the order) is larger than other variable\u2019s index in the smaller set. The diagram traces how the\nalgorithm will explore the subsets following the edges, while effectively skipping nodes.\nGiven G and Y , POMISs (Alg. 1) computes a POMIS, i.e., IB(G, Y ). Then, a recursive procedure\nsubPOMISs is called with an order of variables (Line 3). Then subPOMISs examines POMISs by\nintervening on a single variable against the given graph (Line 6\u20139). If the IB (X in Line 7) of such an\nintervened graph intersects with O0 (a set of variables that should be considered in other branch),\nthen no subsequent call is made (Line 8). Otherwise, a subsequent subPOMISs call will take as\narguments an MUCT-IB induced subgraph (Prop. 8), a re\ufb01ned order, and a set of variables not to be\nintervened in the given branch. For clarity, we provide a detailed working example in Appendix C\nwith Fig. 4a where the algorithm explores only four intervened graphs (G, G\n) and\ngenerates the complete set of POMISs {{S, T},{T , W},{T , W , X}}.\n(Alg. 1) returns all, and only POMISs.\n\nTheorem 9 (Soundness and Completeness). Given informationJG, YK, the algorithm POMISs\n\nThe POMISs algorithm can be combined with a MAB algorithm, such as the kl-UCB, creating\nE[Regn]\na simple yet effective SCM-MAB solver (see Alg. 2). kl-UCB satis\ufb01es lim supn!1\nlog(n) \uf8ff\n\n{W}\n\n, G\n\n, G\n\n{X}\n\n{Z}\n\n7\n\n\f(a) Task 1\n\n(b) Task 2\n\n(c) Task 3\n\nFigure 5: Comparisons across tasks (columns) with cumulative regrets (top) and optimal arm selection\nprobability (bottom) with TS for solid and kl-UCB for dashed lines. Best viewed in color.\n\nPx:\u00b5x<\u00b5\u21e4\n\n\u00b5\u21e4\u00b5x\nKL(\u00b5x,\u00b5\u21e4) where KL is Kullback-Leibler divergence between two Bernoulli distributions\n[Garivier and Capp\u00e9, 2011]. It is clear that the reduction in the size of arms will lower the upper\nbounds of the corresponding cumulative regrets.\n\n5 Experiments\n\nIn this section, we present empirical results demonstrating that the selection of arms based on\nPOMISs makes standard MAB solvers converge faster to an optimal arm. We employ two popular\nMAB solvers, kl-UCB, which enjoys cumulative regret growing logarithmically with the number\nof rounds [Capp\u00e9 et al., 2013], and Thompson sampling (TS, Thompson [1933]), which has strong\nempirical performance [Kaufmann et al., 2012]. We considered four strategies for selecting arms,\nincluding POMISs, MISs, Brute-force, and All-at-once, where Brute-force evaluates all combinations\n\nof armsSX\u2713V\\{Y } D (X), and All-at-once considers intervening in all variables simultaneously,\nD (V\\{Y }), oblivious to the causal structure and any knowledge about the action space. The\nperformance of the eight (4 \u21e5 2) algorithms are evaluated relative to three different SCM-MAB\ninstances (the detailed parametrizations are provided in Appendix D). We set the horizon large enough\nso as to observe near convergence, and repeat each simulation 300 times. We plot (i) the average\ncumulative regrets (CR) along with their respective standard deviations and (ii) the probability of an\noptimal arm being selected averaged over the repeated tests (OAP).6,7\nTask 1: We start by analyzing a Markovian model. We note that by Cor. 3, searching for the arms\nwithin the parent set is suf\ufb01cient in this case. The number of arms for POMISs, MISs, Brute-force,\nand All-at-once are 4, 49, 81, and 16, respectively. Note that there are 4 optimal arms within\nAll-at-once arms \u2014 for instance, if the parent con\ufb01guration is X1 = x1, X2 = x2, this strategy\nwill also include combinations of Z1 = z1, Z2 = z2,8z1, z2. The simulated results are shown in\nFig. 5a. CR at round 1000 with kl-UCB are 3.0, 48.0, 72, and 12 (in the order), and all strategies\nwere able to \ufb01nd the optimal arms at this time. POMIS and All-at-once \ufb01rst reached 95% OAP\nat round 20 and 66, respectively. There are two interesting observations at this point. First, at an\n\n6All the code is available at https://github.com/sanghack81/SCMMAB-NIPS2018\n7One may surmise that combinatorial bandit (CB) algorithms can be used to solve SCM-MAB instances by\nnoting that an intervention can be encoded as a binary vector, where each dimension in the vector corresponds\nto intervening on a single variable with a speci\ufb01c value. However, the two settings invoke a very different set\nof assumptions, which makes their solvers somewhat dif\ufb01cult to compare in some reasonably fair way. For\ninstance, the current generation of CB algorithms is oblivious to the underlying causal structure, which makes\nthem resemble very closely the Brute-force strategy, the worst possible method for SCM-MABs. Further, the\nassumption of linearity is arguably one of the most popular considered by CB solvers. The corresponding\nalgorithms, however, will be unable to learn the arms\u2019 rewards properly since a SCM-MAB is nonparametric,\nmaking no assumption about the underlying structural mechanisms. These are just a few immediate examples of\nthe mismatches between the current generation of algorithms for both causal and combinatorial bandits.\n\n8\n\n0255075Cum.Regrets05010005001000150002505007501000Trials0.00.51.0ProbabilityPOMISMISBrute-forceAll-at-once02505007501000Trials025005000750010000Trials\fearly stage, OAP for MISs is smaller than Brute-force since it has only 1 optimal arm among 49\narms, while Brute-force has 9 among 81. The advantage of employing MIS over Brute-force is only\nobserved after a suf\ufb01ciently large number of plays. More interestingly, POMIS and All-at-once both\nhave the common optimal to non-optimal arms-ratio (1:3 versus 4:12), however, POMIS dominates\nAll-at-once since the agent can learn better about the mean reward of the optimal arm while playing\nnon-optimal arms less. Naturally, this translates into less variability and additional certainty about the\noptimal arm even in Markovian settings.\nTask 2: We consider the setting known as instrumental variable (IV), which was shown in Fig. 3c.\nThe optimal arm in this simulation is setting Z = 0. The number of arms for the four strategies is 4, 5,\n9, and 4, respectively. The results are shown in Fig. 5b. Since the All-at-once strategy only considers\nnon-optimal arms (i.e., pulling Z, X together), it incurs in a linear regret without selecting an optimal\narm (0%). CR (and OAP) at round 1000 with TS are POMIS 16.1 (98.67%), MIS 21.4 (99.00%),\nBrute-force 42.9 (93.33%), and All-at-once 272.1 (0%). At round 5000, where Brute-force nearly\nconverged, the ratio of CRs for POMIS and Brute-force is 54.2\n41. POMIS, MIS,\nand Brute-force \ufb01rst hits 95% OAP at 172, 214, and 435.\nTask 3: Finally, we study the more involved scenario shown in Fig. 4a. In this case, the optimal\narm is intervening on {S, T}, which means that the system should follow its natural \ufb02ow of UCs,\nwhich All-at-once is unable to \u201cpull.\u201d There are 16, 75, 243, and 32 arms for the strategies (in the\norder). The results are shown in Fig. 5c. The CR (and OAP) at round 10000 with TS are POMIS 91.4\n(99.0%), MIS 472.4 (97.0%), Brute-force 1469.0 (85.0%), and All-at-once 2784.8 (0%). Similarly,\nthe ratio (in round 10000) is 1469.0\n161 which is expected to increase since\nBrute-force is not yet converged at the moment. Only POMIS and MIS achieved OAP of 95% \ufb01rst in\n684 and 3544 steps, respectively.\nWe start by noticing that the reduction in the CRs is approximately proportional to the reduction in the\nnumber of non-optimal arms pulled by (PO)MIS by the corresponding algorithm, which makes the\nPOMIS-based solver the clear winner throughout the simulations. It\u2019s still not inconceivable that the\nnumber of arms examined by All-at-once is smaller than for POMIS in a speci\ufb01c SCM-MAB instance,\nwhich would entail a lower CR to the former. However, such a lower CR in some instances does\nnot constitute any sort of assurance since arms excluded from All-at-once, but included in POMIS,\n\n91.4 = 16.07 \u21e1 16.13 = 2431\n\n18.1 = 2.99 ' 2.67 = 91\n\ncan be optimal in some SCM-MAB instance conforming toJG, YK. Furthermore, a POMIS-based\n\nstrategy always dominates the corresponding MIS and Brute-force ones. These observations together\nsuggest that, in practice, a POMIS-based strategy should be preferred given that it will always\nconverge and will usually be faster than its counterparts. Remarkably, there is an interesting trade-off\nbetween having knowledge of the causal structure versus not knowing the corresponding dependency\nstructure among arms, and potentially incurring in linear regret (All-at-once) or exponential slow-\ndown (Brute-force). In practice, for the cases in which the causal structure is unknown, the pull of\nthe arms themselves can be used as experiments and could be coupled with ef\ufb01cient strategies to\nsimultaneously learn the causal structure [Kocaoglu et al., 2017].\n\n6 Conclusions\n\nWe studied the problem of deciding whether an agent should perform a causal intervention and, if so,\nwhich variables it should intervene upon. The problem was formalized using the logic of structural\ncausal models (SCMs) and formalized through a new type of multi-armed bandit called SCM-MABs.\nWe started by noting that whenever the agent cannot measure all the variables in the environment (i.e.,\nunobserved confounders exist), standard MAB algorithms that are oblivious to the underlying causal\nstructure may not converge, regardless of the number of interventions performed in the environment.\n(We note that the causal structure can easily be learned in a typical MAB setting since the agent always\nhas interventional capabilities.) We introduced a novel decision-making strategy based on properties\nfollowing the do-calculus, which allowed the removal of redundant arms, and the partial-orders\namong the sets of variables existent in the underlying causal system, which led to the understanding\nof the maximum achievable reward of each interventional set. Leveraging this new strategy based\non the possibly-optimal minimal intervention sets (called POMIS), we developed an algorithm that\ndecides whether (and if so, where) interventions should be performed in the underlying system.\nFinally, we showed by simulations that this causally-sensible strategy performs more ef\ufb01ciently and\nmore robustly than their non-causal counterparts. We hope that formal machinery and the algorithms\ndeveloped here can help decision-makers to make more principled and ef\ufb01cient decisions.\n\n9\n\n\fAcknowledgments\nThis research is supported in parts by grants from IBM Research, Adobe Research, NSF IIS-1704352,\nand IIS-1750807 (CAREER).\n\nReferences\nPeter Auer, Nicol\u00f2 Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit\n\nproblem. Machine Learning, 47(2/3):235\u2013256, 2002.\n\nElias Bareinboim and Judea Pearl. Causal inference and the data-fusion problem. Proceedings of the\n\nNational Academy of Sciences, 113(27):7345\u20137352, 2016.\n\nElias Bareinboim, Andrew Forney, and Judea Pearl. Bandits with unobserved confounders: A causal\n\napproach. In Advances in Neural Information Processing Systems 28, pages 1342\u20131350. 2015.\n\nS\u00e9bastien Bubeck and Nicol\u00f2 Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-\n\narmed bandit problems. Foundations and Trends in Machine Learning, 5(1):1\u2013122, 2012.\n\nOlivier Capp\u00e9, Aur\u00e9lien Garivier, Odalric-Ambrym Maillard, R\u00e9mi Munos, and Gilles Stoltz.\nKullback-Leibler upper con\ufb01dence bounds for optimal sequential allocation. The Annals of\nStatistics, 41(3):1516\u20131541, 2013.\n\nNicol\u00f2 Cesa-Bianchi and G\u00e1bor Lugosi. Combinatorial bandits. Journal of Computer and System\n\nSciences, 78(5):1404 \u2013 1422, 2012.\n\nRichard Combes and Alexandre Proutiere. Unimodal bandits: Regret lower bounds and optimal\n\nalgorithms. In International Conference on Machine Learning, pages 521\u2013529, 2014.\n\nVarsha Dani, Thomas P. Hayes, and Sham M. Kakade. Stochastic linear optimization under bandit\n\nfeedback. In Proceedings of Conference On Learning Theory (COLT), pages 355\u2013366, 2008.\n\nEyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions\nfor the multi-armed bandit and reinforcement learning problems. Journal of machine learning\nresearch, 7(Jun):1079\u20131105, 2006.\n\nAndrew Forney, Judea Pearl, and Elias Bareinboim. Counterfactual data-fusion for online reinforce-\nment learners. In Proceedings of the 34th International Conference on Machine Learning, pages\n1156\u20131164, 2017.\n\nAur\u00e9lien Garivier and Olivier Capp\u00e9. The KL-UCB algorithm for bounded stochastic bandits and\nbeyond. In Proceedings of the 24th annual Conference On Learning Theory, pages 359\u2013376, 2011.\n\nEmilie Kaufmann, Nathaniel Korda, and R\u00e9mi Munos. Thompson sampling: An asymptotically\n\noptimal \ufb01nite-time analysis. In Algorithmic Learning Theory, pages 199\u2013213, 2012.\n\nMurat Kocaoglu, Karthikeyan Shanmugam, and Elias Bareinboim. Experimental design for learning\ncausal graphs with latent variables. In Advances in Neural Information Processing Systems 30,\npages 7021\u20137031, 2017.\n\nTze Leung Lai and Herbert Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in\n\napplied mathematics, 6(1):4\u201322, 1985.\n\nFinnian Lattimore, Tor Lattimore, and Mark D. Reid. Causal bandits: Learning good interventions\nvia causal inference. In Advances in Neural Information Processing Systems 29, pages 1181\u20131189.\n2016.\n\nSanghack Lee and Elias Bareinboim. Structural causal bandits: Where to intervene? Technical\n\nReport R-36, Purdue AI Lab, Department of Computer Science, Purdue University, 2018.\n\nStefan Magureanu, Richard Combes, and Alexandre Proutiere. Lipschitz bandits: Regret lower\nbound and optimal algorithms. In Proceedings of The 27th Conference on Learning Theory, pages\n975\u2013999, 2014.\n\n10\n\n\fPedro A. Ortega and Daniel A. Braun. Generalized Thompson sampling for sequential decision-\n\nmaking and causal inference. Complex Adaptive Systems Modeling, 2(2), 2014.\n\nJudea Pearl. Causal diagrams for empirical research. Biometrika, 82(4):669\u2013688, 1995.\nJudea Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York,\n\n2000. Second ed., 2009.\n\nRajat Sen, Karthikeyan Shanmugam, Alexandros G. Dimakis, and Sanjay Shakkottai. Identifying\nbest interventions through online importance sampling. In Proceedings of the 34th International\nConference on Machine Learning, pages 3057\u20133066, 2017.\n\nPeter Spirtes, Clark Glymour, and Richard Scheines. Causation, Prediction, and Search. A Bradford\n\nBook, 2001.\n\nRichard S. Sutton and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 1998.\nCsaba Szepesv\u00e1ri. Algorithms for reinforcement learning. Morgan and Claypool, 2010.\nWilliam R. Thompson. On the likelihood that one unknown probability exceeds another in view of\n\nthe evidence of two samples. Biometrika, 25(3/4):285\u2013294, 1933.\n\nJin Tian and Judea Pearl. A general identi\ufb01cation condition for causal effects. In Proceedings of the\n\nEighteenth National Conference on Arti\ufb01cial Intelligence, pages 567\u2013573, 2002.\n\nJunzhe Zhang and Elias Bareinboim. Transfer learning in multi-armed bandits: A causal approach. In\nProceedings of the Twenty-Sixth International Joint Conference on Arti\ufb01cial Intelligence, IJCAI-17,\npages 1340\u20131346, 2017.\n\n11\n\n\f", "award": [], "sourceid": 1286, "authors": [{"given_name": "Sanghack", "family_name": "Lee", "institution": "Purdue University"}, {"given_name": "Elias", "family_name": "Bareinboim", "institution": "Purdue"}]}