{"title": "Bandits with Unobserved Confounders: A Causal Approach", "book": "Advances in Neural Information Processing Systems", "page_first": 1342, "page_last": 1350, "abstract": "The Multi-Armed Bandit problem constitutes an archetypal setting for sequential decision-making, permeating multiple domains including engineering, business, and medicine. One of the hallmarks of a bandit setting is the agent's capacity to explore its environment through active intervention, which contrasts with the ability to collect passive data by estimating associational relationships between actions and payouts. The existence of unobserved confounders, namely unmeasured variables affecting both the action and the outcome variables, implies that these two data-collection modes will in general not coincide. In this paper, we show that formalizing this distinction has conceptual and algorithmic implications to the bandit setting. The current generation of bandit algorithms implicitly try to maximize rewards based on estimation of the experimental distribution, which we show is not always the best strategy to pursue. Indeed, to achieve low regret in certain realistic classes of bandit problems (namely, in the face of unobserved confounders), both experimental and observational quantities are required by the rational agent. After this realization, we propose an optimization metric (employing both experimental and observational distributions) that bandit agents should pursue, and illustrate its benefits over traditional algorithms.", "full_text": "Bandits with Unobserved Confounders:\n\nA Causal Approach\n\nElias Bareinboim\u2217\n\nDepartment of Computer Science\n\nPurdue University\neb@purdue.edu\n\nAndrew Forney\u2217\n\nDepartment of Computer Science\n\nUniversity of California, Los Angeles\n\nforns@cs.ucla.edu\n\nJudea Pearl\n\nDepartment of Computer Science\n\nUniversity of California, Los Angeles\n\njudea@cs.ucla.edu\n\nAbstract\n\nThe Multi-Armed Bandit problem constitutes an archetypal setting for sequential\ndecision-making, permeating multiple domains including engineering, business,\nand medicine. One of the hallmarks of a bandit setting is the agent\u2019s capacity\nto explore its environment through active intervention, which contrasts with the\nability to collect passive data by estimating associational relationships between\nactions and payouts. The existence of unobserved confounders, namely unmea-\nsured variables affecting both the action and the outcome variables, implies that\nthese two data-collection modes will in general not coincide. In this paper, we\nshow that formalizing this distinction has conceptual and algorithmic implications\nto the bandit setting. The current generation of bandit algorithms implicitly try to\nmaximize rewards based on estimation of the experimental distribution, which\nwe show is not always the best strategy to pursue. Indeed, to achieve low regret\nin certain realistic classes of bandit problems (namely, in the face of unobserved\nconfounders), both experimental and observational quantities are required by the\nrational agent. After this realization, we propose an optimization metric (employ-\ning both experimental and observational distributions) that bandit agents should\npursue, and illustrate its bene\ufb01ts over traditional algorithms.\n\nIntroduction\n\n1\nThe Multi-Armed Bandit (MAB) problem is one of the most popular settings encountered in the\nsequential decision-making literature [Rob52, LR85, EDMM06, Sco10, BCB12] with applications\nacross multiple disciplines. The main challenge in a prototypical bandit instance is to determine\na sequence of actions that maximizes payouts given that each arm\u2019s reward distribution is initially\nunknown to the agent. Accordingly, the problem revolves around determining the best strategy for\nlearning this distribution (exploring) while, simultaneously, using the agent\u2019s accumulated samples\nto identify the current \u201cbest\u201d arm so as to maximize pro\ufb01t (exploiting). Different algorithms employ\ndifferent strategies to balance exploration and exploitation, but a standard de\ufb01nition for the \u201cbest\u201d\narm is the one that has the highest payout rate associated with it. We will show that, perhaps\nsurprisingly, the de\ufb01nition of \u201cbest\u201d arm is more involved when unobserved confounders are present.\nThis paper complements the vast literature of MAB that encompasses many variants including ad-\nversarial bandits (in which an omnipotent adversary can dynamically shift the reward distributions\nto thwart the player\u2019s best strategies) [BFK10, AS95, BS12] contextual bandits (in which the payout,\n\n\u2217The authors contributed equally to this paper.\n\n1\n\n\fand therefore the best choice of action, is a function of one or more observed environmental vari-\nables) [LZ08, DHK+11, Sli14], and many different constraints and assumptions over the underlying\ngenerative model and payout structure [SBCAY14]. For a recent survey, see [BCB12].\nThis work addresses the MAB problem when unobserved confounders are present (called MABUC,\nfor short), which is arguably the most sensible assumption in real-world, practical applications (ob-\nviously weaker than assuming the inexistence of confounders). To support this claim, we should \ufb01rst\nnote that in the experimental design literature, Fisher\u2019s very motivation for considering randomizing\nthe treatment assignment was to eliminate the in\ufb02uence of unobserved confounders \u2013 factors that\nsimultaneously affect the treatment (or bandit arm) and outcome (or bandit payout), but are not ac-\ncounted for in the analysis. In reality, the reason for not accounting for such factors explicitly in the\nanalysis is that many of them are unknown a priori by the modeller [Fis51].\nThe study of unobserved confounders is one of the central themes in the modern literature of causal\ninference. To appreciate the challenges posed by these confounders, consider the comparison be-\ntween a randomized clinical trial conducted by the Food and Drug Administration (FDA) versus\nphysicians prescribing drugs in their of\ufb01ces. A key tenet in any FDA trial is the use of randomiza-\ntion for the treatment assignment, which precisely protects against biases that might be introduced\nby physicians. Speci\ufb01cally, physicians may prescribe Drug A for their wealthier patients who have\nbetter nutrition than their less wealthy ones, when unknown to the doctors, the wealthy patients\nwould recover without treatment. On the other hand, physicians may avoid prescribing the expen-\nsive Drug A to their less privileged patients, who (again unknown to the doctors) tend to suffer less\nstable immune systems causing negative reactions to the drug. If a naive estimate of the drug\u2019s causal\neffect is computed based on physicians\u2019 data (obtained through random sampling, but not random\nassignment), the drug would appear more effective than it is in practice \u2013 a bias that would otherwise\nbe avoided by random assignment. Confounding biases (of variant magnitude) appear in almost any\napplication in which the goal is to learn policies (instead of statistical associations), and the use of\nrandomization of the treatment assignment is one established tool to combat them [Pea00].\nTo the best of our knowledge, no method in the bandit literature has studied the issue of unobserved\nconfounding explicitly, in spite of its pervasiveness in real-world applications. Speci\ufb01cally, no MAB\ntechnique makes a clear-cut distinction between experimental exploration (through random assign-\nment as required by the FDA) and observational data (as given by random sampling in the doctors\u2019\nof\ufb01ces). In this paper, we explicitly acknowledge, formalize, and then exploit these different types\nof data-collection. More speci\ufb01cally, our contributions are as follow:\n\n\u2022 We show that the current bandit algorithms implicitly attempt to maximize rewards by es-\ntimating the experimental distribution, which does not guarantee an optimal strategy when\nunobserved confounders are present (Section 2).\n\u2022 Based on this observation, we translate the MAB problem to causal language, and then\nsuggest a more appropriate metric that bandit players should optimize for when unobserved\nconfounders are present. This leads to a new exploitation principle that can take advantage\nof data collected under both observational and experimental modes (Section 3).\n\u2022 We empower Thompson Sampling with this new principle and run extensive simulations.\n\nThe experiments suggest that the new strategy is stats. ef\ufb01cient and consistent (Sec. 4).\n\n2 Challenges due to Unobserved Confounders\nIn this section, we discuss the mechanics of how the maximization of rewards is treated based\non a bandit instance with unobserved confounders. Consider a scenario in which a greedy casino\ndecides to demo two new models of slot machines, say M1 and M2 for simplicity, and wishes to\nmake them as lucrative as possible. As such, they perform a battery of observational studies (using\nrandom sampling) to compare various traits of the casino\u2019s gamblers to their typical slot machine\nchoices. From these studies, the casino learns that two factors well predict the gambling habits of\nplayers when combined (unknown by the players): player inebriation and machine conspicuousness\n(say, whether or not a machine is blinking). Coding both of these traits as binary variables, we let\nB \u2208 {0, 1} denote whether or not a machine is blinking, and D \u2208 {0, 1} denote whether or not the\ngambler is drunk. As it turns out, a gambler\u2019s \u201cnatural\u201d choice of machine, X \u2208 {M1, M2}, can be\nmodelled by the structural equation indicating the index of their chosen arm (starting at 0):\n\nX \u2190 fX (B, D) = (D \u2227 \u00acB) \u2228 (\u00acD \u2227 B) = D \u2295 B\n\n(1)\n\n2\n\n\fFigure 1: Performance of different bandit strategies in the greedy casino example. Left panel:\nno algorithm is able to perform better than random guessing. Right panel: Regret grows without\nbounds.\n\nMoreover, the casino learns that every gambler has an equal chance of being intoxicated and each\nmachine has an equal chance of blinking its lights at a given time, namely, P (D = 0) = P (D =\n1) = 0.5 and P (B = 0) = P (B = 1) = 0.5.\nThe casino\u2019s executives decide to take advantage of these propensities by introducing a new type\nof reactive slot machine that will tailor payout rates to whether or not it believes (via sensor input,\nassumed to be perfect for this problem) a gambler is intoxicated. Suppose also that a new gambling\nlaw requires that casinos maintain a minimum attainable payout rate for slots of 30%. Cognizant of\nthis new law, while still wanting to maximize pro\ufb01ts by exploiting gamblers\u2019 natural arm choices,\nthe casino executives modify their new slots with the payout rates depicted in Table 1a.\n\n(a)\n\nX = M1\nX = M2\n\nD = 0\n\nD = 1\n\nB = 0 B = 1 B = 0 B = 1\n*0.20\n*0.10\n0.50\n0.40\n\n0.50\n*0.10\n\n0.40\n*0.20\n\n(b)\n\nX = M1\nX = M2\n\nP (y|X) P (y|do(X))\n0.3\n0.3\n\n0.15\n0.15\n\nTable 1: (a) Payout rates decided by reactive slot machines as a function of arm choice, sobriety,\nand machine conspicuousness. Players\u2019 natural arm choices under D, B are indicated by asterisks.\n(b) Payout rates according to the observational, P (Y = 1|X), and experimental P (Y = 1|do(X)),\ndistributions, where Y = 1 represents winning (shown in the table), and 0 otherwise.\n\nThe state, blind to the casino\u2019s payout strategy, decides to perform a randomized study to verify\nwhether the win rates meet the 30% payout requisite. Wary that the casino might try to in\ufb02ate\npayout rates for the inspectors, the state recruits random players from the casino \ufb02oor, pays them to\nplay a random slot, and then observes the outcome. Their randomized experiment yields a favorable\noutcome for the casino, with win rates meeting precisely the 30% cutoff. The data looks like Table\n1b (third column), assuming binary payout Y \u2208 {0, 1}, where 0 represents losing, and 1 winning.\nAs students of causal inference and still suspicious of the casino\u2019s ethical standards, we decide to go\nto the casino\u2019s \ufb02oor and observe the win rates of players based on their natural arm choices (through\nrandom sampling). We encounter a distribution close to Table 1b (second column), which shows\nthat the casino is actually paying ordinary gamblers only 15% of the time.\nIn summary, the casino is at the same time (1) exploiting the natural predilections of the gamblers\u2019\narm choices as a function of their intoxication and the machine\u2019s blinking behavior (based on eq. 1),\n(2) paying, on average, less than the legally allowed (15% instead of 30%), and (3) fooling state\u2019s\ninspectors since the randomized trial payout meets the 30% legal requirement.\nAs machine learning researchers, we decide to run a battery of experiments using the standard bandit\nalgorithms (e.g., \u0001-greedy, Thompson Sampling, UCB1, EXP3) to test the new slot machines on the\ncasino \ufb02oor. We obtain data encoded in Figure 1a, which shows that the probability of choosing\nthe correct action is no better than a random coin \ufb02ip even after a considerable number of steps.\nWe note, somewhat surprised, that the cumulative regret (Fig. 1b) shows no signs of abating, and\n\n3\n\n\fthat we are apparently unable to learn a superior arm. We also note that the results obtained by the\nstandard algorithms coincide with the randomized study conducted by the state (purple line).\nUnder the presence of unobserved confounders such as in the casino example, however, P (y|do(X))\ndoes not seem to capture the information required to maximize payout, but rather the average payout\nakin to choosing arms by a coin \ufb02ip. Speci\ufb01cally, the payout given by coin \ufb02ipping is the same for\nboth machines, P (Y = 1|do(X = M1)) = P (Y = 1|do(X = M2)) = 0.3, which means that\nthe arms are statistically indistinguishable in the limit of large sample size. Further, if we consider\nusing the observational data from watching gamblers on the casino \ufb02oor (based on their natural\npredilections), the average payoff will also appear independent of the machine choice, P (Y =\n1|X = M1) = P (Y = 1|X = M2) = 0.15, albeit with an even lower payout. 1\nBased on these observations, we can see why no arm choice is better than the other under either dis-\ntribution alone, which explains the reason any algorithm based on these distributions will certainly\nfail to learn an optimal policy. More fundamentally, we should be puzzled by the disagreement\nbetween observational and interventional distributions. This residual difference may be encoding\nknowledge about the unobserved confounders, which may lead to some indication on how to differ-\nentiate the arms. This indeed may lead to some indication on how to differentiate the arms as well as\na sensible strategy to play better than pure chance. In the next section, we will use causal machinery\nto realize this idea.\n3 Bandits as a Causal Inference Problem\n\nWe will use the language of structural causal models [Pea00, Ch. 7] for expressing the bandit data-\ngenerating process and for allowing the explicit manipulation of some key concepts in our analysis \u2013\ni.e., confounding, observational and experimental distributions, and counterfactuals (to be de\ufb01ned).\nDe\ufb01nition 3.1. (Structural Causal Model) ([Pea00, Ch. 7]) A structural causal model M is a\n4-tuple (cid:104) U, V, f, P (u) (cid:105) where:\n\n1. U is a set of background variables (also called exogenous), that are determined by factors\n\noutside of the model,\n\nmined by variables in the model (i.e., determined by variables in U \u222a V ),\n\n2. V is a set {V1, V2, ..., Vn} of observable variables (also called endogenous), that are deter-\n3. F is a set of functions {f1, f2, ..., fn} such that each fi is a mapping from the respective\ndomains of Ui \u222a P Ai to Vi, where Ui \u2286 U and P Ai \u2286 V \\ Vi and the entire set F forms a\nmapping from U to V . In other words, each fi in vi \u2190 fi(pai, ui), i = 1, ..., n, assigns a\nvalue to Vi that depends on the values of the select set of variables (Ui \u222a P Ai), and\n\n4. P (u) is a probability distribution over the exogenous variables.\n\nEach structural model M is associated with a directed acyclic graph G, where nodes correspond to\nendogenous variables V and edges represent functional relationships \u2013 i.e., there exists an edge from\nX to Y whenever X appears in the argument of Y \u2019s function. We de\ufb01ne next the MABUC problem\nwithin the structural semantics.\nDe\ufb01nition 3.2. (K-Armed Bandits with Unobserved Confounders) A K-Armed bandit problem\nwith unobserved confounders is de\ufb01ned as a model M with a reward distribution over P (u) where:\n1. Xt \u2208 {x1, ..., xk} is an observable variable encoding player\u2019s arm choice from one of k\narms, decided by Nature in the observational case, and do(Xt = \u03c0(x0, y0, ..., xt\u22121, yt\u22121)),\nfor strategy \u03c0 in the experimental case (i.e., when the strategy decides the choice),\n\n2. Ut represents the unobserved variable encoding the payout rate of arm xt as well as the\n3. Yt \u2208 0, 1 is a reward (0 for losing, 1 for winning) from choosing arm xt under unobserved\n\npropensity to choose xt, and\n\nconfounder state ut decided by yt = fy(xt, ut).\n\n1One may surmise that these ties are just contrived examples, or perhaps numerical coincidences, which\ndo not appear in realistic bandit instances. Unfortunately, that\u2019s not the case as shown in the other scenarios\ndiscussed in the paper. This phenomenon is indeed a manifestation of the deeper problem arising due to the\nlack of control for the unobserved confounders.\n\n4\n\n\fFigure 2: (a) Model for the standard MAB sequential decision game. (b) Model for the MABUC\nsequential decision game. In each model, solid nodes denote observed variables and open nodes\nrepresent unobserved variables. Square nodes denote the players strategys arm choice at time t.\nDashed lines illustrate in\ufb02uences on future time trials that are not pictured.\n\nFirst note that this de\ufb01nition also applies to the MAB problem (without confounding) as shown\nin Fig. 2a. This standard MAB instance is de\ufb01ned by constraining the MABUC de\ufb01nition such\nthat Ut affects only the outcome variable Yt \u2013 there is no edge from Ut to Xt (Def. 3.2.2). In the\nunconfounded case, it is clear that P (y|do(x)) = P (y|x) [Pea00, Ch. 3], which means that that\npayouts associated with \ufb02ipping a coin to randomize the treatment or observing (through random\nsampling) the player gambling on the casino\u2019s \ufb02oor based on their natural predilections will yield\nthe same answer. The variable U carries the unobserved payout parameters of each arm, which is\nusually the target of analysis. 2 3\nFig. 2b provides a graphical representation of the MABUC problem. Note that \u03c0t represents the\nsystem\u2019s choice policy, which is affected by the unobserved factors encoded through the arrow from\nUt to \u03c0t. One way to understand this arrow is through the idea of players\u2019 natural predilections.\nIn the example from the previous section, the predilection would correspond to the choices arising\nwhen the gambler is allowed to play freely on the casino\u2019s \ufb02oor (e.g., drunk players desiring to play\non the blinking machines) or doctors prescribing drugs based on their gut feeling (e.g., physicians\nprescribing the more expensive drug to their wealthier patients). These predilections are encoded in\nthe observational distribution P (y|x). On the other hand, the experimental distribution P (y|do(x))\nencodes the process in which the natural predilections are overridden, or ceased by external poli-\ncies. In our example, this distribution arises when the government\u2019s inspectors \ufb02ip a coin and send\ngamblers to machines based on the coin\u2019s outcome, regardless of their predilections.\nRemarkably, it is possible to use the information embedded in these distinct data-collection modes\n(and their corresponding distributions) to understand players\u2019 predilections and perform better than\nrandom guessing in these bandit instances. To witness, assume there exists an oracle on the casino\u2019s\n\ufb02oor operating by the following protocol. The oracle observes the gamblers until they are about to\nplay a given machine. The oracle intercepts each gambler who is about to pull the arm of machine\nM1, for example, and suggests the player to contemplate whether following his predilection (M1)\nor going against it (playing M2) would lead to a better outcome. The drunk gambler, who is a clever\nmachine learning student and familiar with Fig. 1, says that this evaluation cannot be computed a\npriori. He af\ufb01rms that, despite spending hours on the casino estimating the payoff distribution based\non players\u2019 natural predilections (namely, P (y|x)), it is not feasible to relate this distribution with\nthe hypothetical construction what would have happened had he decided to play differently. He also\nacknowledges that the experimental distribution P (y|do(x)), devoid of the gamblers\u2019 predilections,\ndoes not support any clear comparison against his personal strategy. The oracle says that this type\nof reasoning is possible, but \ufb01rst one needs to de\ufb01ne the concept of counterfactual.\nDe\ufb01nition 3.3. (Counterfactual) ([Pea00, pp. 204]) Let X and Y be two subsets of exogenous\nvariables in V . The counterfactual sentence \u201cY would be y (in situation u), had X been x\u201d is\ninterpreted as the equality with Yx(u) = y, with Yx(u) being the potential response of Y to X = x.\n\n2On a more fundamental level, it is clear that unconfoundedness is (implicitly) assumed not to hold in the\ngeneral case. Otherwise, the equality between observational and experimental distributions would imply that\nno randomization of the action needs to be carried out since standard random sampling would recover the same\ndistribution. In this case, this would imply that many works in the literature are acting in a suboptimal way\nsince, in general, experiments are more expensive to perform than collecting data through random sampling.\n\n3The interventional nature of the MAB problem is virtually not discussed in the literature, one of the few\n\nexceptions is the causal interpretation of Thompson Sampling established in [OB10].\n\n5\n\n\fThis de\ufb01nition naturally leads to the judgement suggested by the oracle, namely, \u201cwould I (the agent)\nwin (Y = 1) had I played on machine M1 (X = 1)\u201d, which can be formally written as YX=1 = 1\n(we drop the M for simplicity). Assuming that the agent\u2019s natural predilection is to play machine\n1, the oracle suggests an introspection comparing the odds of winning following his intuition or\ngoing against it. The former statement can be written in counterfactual notation, probabilistically,\nas E(YX=1 = 1|X = 1), which reads as \u201cthe expected value of winning (Y = 1) had I play\nmachine 1 given that I am about to play machine 1\u201d, which contrasts with the alternative hypothesis\nE(YX=0 = 1|X = 1), which reads as \u201cthe expected value of winning (Y = 1) had I play machine\n1 given that I am about to play machine 0\u201d. This is also known in the literature as the effect of the\ntreatment on the treated (ETT) [Pea00]. So, instead of using a decision rule comparing the average\npayouts across arms, namely (for action a),\n\nE(Y |do(X = a)),\n\nargmax\n\na\n\n(2)\n\nwhich was shown in the previous section to be insuf\ufb01cient to handle the MABUC, we should con-\nsider the rule using the comparison between the average payouts obtained by players for choosing\nin favour or against their intuition, respectively,\n\nE(YX=a = 1|X = x),\n\nargmax\n\na\n\n(3)\n\nwhere x is the player\u2019s natural predilection and a is their \ufb01nal decision. We will call this procedure\nRDC (regret decision criterion), to emphasize the counterfactual nature of this reasoning step and\nthe idea of following or disobeying the agent\u2019s intuition, which is motivated by the notion of regret.\nRemarkably, RDC accounts for the agents individuality and the fact that their natural inclination\nencodes valuable information about the confounders that also affect the payout. In the binary case,\nfor example, assuming that X = 1 is the player\u2019s natural choice at some time step, if E(YX=0 =\n1|X = 1) is greater than E(YX=1 = 1|X = 1), this would imply that the player should refrain of\nplaying machine X = 1 to play machine X = 0.\nAssuming one wants to implement an algorithm based on RDC, the natural question that arises\nis how the quantities entailed by Eq. 3 can be computed from data. For the factors in the form\nE(YX=1 = 1|X = 1), the consistency axiom [Pea00, pp. 229] implies that E(YX=1 = 1|X =\n1) = E(Y = 1|X = 1), where the l.h.s.\nis estimable from observational data. Counterfactuals\nin the form E(YX=a = 1|X = x), where a (cid:54)= x, can be computed in the binary case through\nalgebraic means [Pea00, pp. 396-7]. For the general case, however, ETT is not computable without\nknowledge of the causal graph. 4 Here, ETT will be computed in an alternative fashion, based on the\nidea of intention-speci\ufb01c randomization. The main idea is to randomize intention-speci\ufb01c groups,\nnamely, interrupt any reasoning agent before they execute their choice, treat this choice as intention,\ndelibarte, and then act. We discuss next about the algorithmic implementation of this randomization.\n4 Applications & Experiments\n\nBased on the previous discussion, we can revisit the greedy casino example from Section 2, apply\nRDC and use the following inequality to guide agent\u2019s decisions:\n\nE(YX=0|X = 1) > E(YX=1|X = 1) \u21d4 E(YX=0|X = 1) > P (Y |X = 1)\n\n(4)\n\nThere are different ways of incorporating this heuristic into traditional bandit algorithms, and we\ndescribe one such approach taking the Thompson Sampling algorithm as the basis [OB10, CL11,\nAG11]. (For simulation source code, see https://github.com/ucla-csl/mabuc )\nOur proposed algorithm, Causal Thompson Sampling (T SC) takes the following steps: (1) T SC\n\ufb01rst accepts an observational distribution as input, which it then uses to seed estimates of ETT quan-\ntities; i.e., for actions a and intuition x, by consistency we may seed knowledge of E(YX=a|X =\nx) = Pobs(y|x),\u2200a = x. With large samples from an input set of observations, this seeding reduces\n(and possibly eliminates) the need to explore the payout rates associated with following intuition,\nleaving only the \u201cdisobeying intuition\u201d payout rates left for the agent to learn. As such, (2) at each\ntime step, our oracle observes the agent\u2019s arm-choice predilection, and then uses RDC to deter-\n\n4Graphical conditions for identifying ETT [Pea00, SP07] are orthogonal to the bandit problem studied in\n\nthis paper, since no detailed knowledge about the causal graph (as well as in\ufb01nite samples) is assumed here.\n\n6\n\n\fmine their best choice.5 Lastly, note that our seeding in (2) immediately improves the accuracy of\nour comparison between arms, viz. that a superior arm will emerge more quickly than had we not\nseeded. We can exploit this early lead in accuracy by weighting the more favorable arm, making it\nmore likely to be chosen earlier in the learning process (which empirically improves the convergence\nrate as shown in the simulations).\n\nAlgorithm 1 Causal Thompson Sampling (T SC)\n1: procedure TSC(Pobs, T)\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n\nE(YX=a|X) \u2190 Pobs(y|X)\nfor t = [1, ..., T ] do\nx \u2190 intuition(t)\nQ1 \u2190 E(YX=x(cid:48)|X = x)\nQ2 \u2190 P (y|X = x)\nw \u2190 [1, 1]\nbias \u2190 1 \u2212 |Q1 \u2212 Q2|\nif Q1 > Q2 then w[x] \u2190 bias else w[x(cid:48)] \u2190 bias\na \u2190 max(\u03b2(sM1,x, fM1,x) \u00d7 w[1], \u03b2(sM2,x, fM2,x) \u00d7 w[2])\ny \u2190 pull(a)\nE(YX=a|X = x) \u2190 y|a, x\n\n(get intuition for trial)\n(estimated payout for counter-intuition)\n(estimated payout for intuition)\n(initialize weights)\n(compute weighting strength)\n(choose arm to bias)\n(choose arm) 6\n(receive reward)\n(update)\n\n(seed distribution)\n\nIn the next section, we provide simulations to support the ef\ufb01cacy of T SC in the MABUC context.\nFor simplicity, we present two simulation results for the model described in Section 2.7 Experiment\n1 employs the \u201cGreedy Casino\u201d parameterization found in Table 1, whereas Experiment 2 employs\nthe \u201cParadoxical Switching\u201d parameterization found in Table 2. Each experiment compares the\nperformance of traditional Thompson Sampling bandit players versus T SC.\n\n(a)\n\nX = M1\nX = M2\n\nD = 0\n\nD = 1\n\nB = 0 B = 1 B = 0 B = 1\n*0.40\n*0.40\n0.60\n0.60\n\n0.30\n*0.10\n\n0.30\n*0.20\n\n(b)\n\nX = M1\nX = M2\n\nP (y|X) P (y|do(X))\n0.35\n0.375\n\n0.4\n0.15\n\nTable 2: \u201cParadoxical Switching\u201d parameterization. (a) Payout rates decided by reactive slot ma-\nchines as a function of arm choice, sobriety, and machine conspicuousness. Players\u2019 natural arm\nchoices under D, B are indicated by asterisks. (b) Payout rates associated with the observational\nand experimental distributions, respectively.\n\nProcedure. All reported simulations are partitioned into rounds of T = 1000 trials averaged over\nN = 1000 Monte Carlo repetitions. At each time step in a single round, (1) values for the unob-\nserved confounders (B, D) and observed intuitive arm choice (x) are selected by their respective\nstructural equations (see Section 2), (2) the player observes the value of x, (3) the player chooses\nan arm based on their given strategy to maximize reward (which may or may not employ x), and\n\ufb01nally, (4) the player receives a Bernoulli reward Y \u2208 {0, 1} and records the outcome.\nFurthermore, at the start of every round, players possess knowledge of the problem\u2019s observational\ndistribution, i.e., each player begins knowing P (Y |X) (see Table 2b). However, only causally-\nempowered strategies will be able to make use of this knowledge, since this distribution is not, as\nwe\u2019ve seen, the correct one to maximize.\nCandidate algorithms. Standard Thompson Sampling (T S) attempts to maximize rewards based\non P (y|do(X)), ignoring the intuition x. Z-Empowered Thompson Sampling (T SZ) treats the\n5Note that using predilection as a criteria for the inequality does not uniquely map to the contextual bandit\nproblem. To understand this point, note that not all variables are equally legitimate for confounding control in\ncausal settings, while the agent\u2019s predilection is certainly one of such variables in our setup. Speci\ufb01cally when\nconsidering whether a variable quali\ufb01es as a causal context requires a much deeper understanding of the data\ngenerating model, which is usually not available in the general case.\n\n6The notation: \u03b2(sMk,x, fMk,x) means to sample from a Beta distribution with parameters equal to the\nsuccesses encountered choosing action x on machine Mk(sMk,x) and the failures encountered choosing action\nx on that machine (fMk,x).\n\n7For additional experimental results and parameterizations, see Appendix [BFP15].\n\n7\n\n\fFigure 3: Simulation results for Experiments 1 and 2 comparing standard Thompson Sampling\n(T S), Z-Empowered Thompson Sampling (T SZ), and Causal Thompson Sampling (T S\u2217)\n\npredilection as a new context variable, Z, and attempts to maximize based on P (y|do(X), Z) at\neach round. Causal Thompson Sampling (T SC), as described above, employs the ETT inequality\nand input observational distribution.\nEvaluation metrics. We assessed each algorithms\u2019 performances with standard bandit evaluation\nmetrics: (1) the probability of choosing the optimal arm and (2) cumulative regret. As in traditional\nbandit problems, these measures are recorded as a function of the time step t averaged over all\nN round repetitions. Note, however, that traditional de\ufb01nitions of regret are not phrased in terms\nof unobserved confounders; our metrics, by contrast, compare each algorithm\u2019s chosen arm to the\noptimal arm for a given instantiation of Bt and Dt, even though these instantiations are never directly\navailable to the players. We believe that this is a fair operationalization for our evaluation metrics\nbecause it allows us to compare regret experienced by our algorithms to a truly optimal (albeit\nhypothetical) policy that has access to the unobserved confounders.\nExperiment 1: \u201cGreedy Casino.\u201d The Greedy Casino parameterization (speci\ufb01ed in Table 1) il-\nlustrates the scenario where each arm\u2019s payout appears to be equivalent under the observational and\nexperimental distributions alone. Only when we concert the two distributions and condition on a\nplayer\u2019s predilection can we obtain the optimal policy. Simulations for Experiment 1 support the ef-\n\ufb01cacy of the causal approach (see Figure 3). Analyses revealed a signi\ufb01cant difference in the regret\nexperienced by T SZ (M = 11.03, SD = 15.96) compared to T SC (M = 0.94, SD = 15.39),\nt(999) = 14.52, p < .001. Standard T S was, predictably, not a competitor experiencing high regret\n(M = 150.47, SD = 14.09).\nExperiment 2: \u201cParadoxical Switching.\u201d The Paradoxical Switching parameterization (speci\ufb01ed\nin Table 2a) illustrates the scenario where one arm (M1) appears superior in the observational\ndistribution, but the other arm (M2) appears superior in the experimental. Again, we must use\ncausal analyses to resolve this ambiguity and obtain the optimal policy. Simulations for Experi-\nment 2 also support the ef\ufb01cacy of the causal approach (see Figure 3). Analyses revealed a signif-\nicant difference in the regret experienced by T SZ (M = 13.39, SD = 17.15) compared to T SC\n(M = 4.71, SD = 17.90), t(999) = 11.28, p < .001. Standard T S was, again predictably, not a\ncompetitor experiencing high regret (M = 83.56, SD = 15.75).\n\n5 Conclusions\n\nIn this paper, we considered a new class of bandit problems with unobserved confounders (MABUC)\nthat are arguably more realistic than traditional formulations. We showed that MABUC instances\nare not amenable to standard algorithms that rely solely on the experimental distribution. More\nfundamentally, this lead to an understanding that in MABUC instances the optimization task is not\nattainable through the estimation of the experimental distribution, but relies on both experimental\nand observational quantities rooted in counterfactual theory and based on the agents\u2019 predilections.\nTo take advantage of our \ufb01ndings, we empowered the Thompson Sampling algorithm in two different\nways. We \ufb01rst added a new rule capable of improving the ef\ufb01cacy of which arm to explore. We\nthen jumpstarted the algorithm by leveraging non-experimental (observational) data that is often\navailable, but overlooked. Simulations demonstrated that in general settings these changes lead to a\nmore effective decision-making with faster convergence and lower regret.\n\n8\n\n\fReferences\n[AG11]\n\n[AS95]\n\n[BCB12]\n\n[BFK10]\n\n[BFP15]\n\n[BS12]\n\n[CL11]\n\nS. Agrawal and N. Goyal. Analysis of thompson sampling for the multi-armed bandit\nproblem. CoRR, abs/1111.1797, 2011.\nCesa-Bianchi N. Freund Y. Auer, P. and R. E. Schapire. Gambling in a rigged casino:\nIn Foundations of Computer Science,\nThe adversarial multi-armed bandit problem.\n1995. Proceedings., 36th Annual Symposium on, pages 322\u2013331, Oct 1995.\nS\u00b4ebastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and non-\nstochastic multi-armed bandit problems. Foundations and Trends in Machine Learn-\ning, 5:1\u2013122, 2012.\nR. Busa-Fekete and B. K\u00b4egl. Fast boosting using adversarial bandits. In T. Joachims\nJ. F\u00a8urnkranz, editor, 27th International Conference on Machine Learning (ICML\n2010), pages 143\u2013150, Haifa, Israel, June 2010.\nE. Bareinboim, A. Forney, and J. Pearl. Bandits with unobserved confounders: A\ncausal approach. Technical Report R-460, <http://ftp.cs.ucla.edu/pub/stat ser/r460-\nL.pdf>, Cognitive Systems Laboratory, UCLA, 2015.\nS. Bubeck and A. Slivkins. The best of both worlds: stochastic and adversarial bandits.\nCoRR, abs/1202.4473, 2012.\nO. Chapelle and L. Li. An empirical evaluation of thompson sampling. In J. Shawe-\nTaylor, R.S. Zemel, P.L. Bartlett, F. Pereira, and K.Q. Weinberger, editors, Advances\nin Neural Information Processing Systems 24, pages 2249\u20132257. Curran Associates,\nInc., 2011.\n\n[DHK+11] Miroslav Dud\u00b4\u0131k, Daniel Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev\nReyzin, and Tong Zhang. Ef\ufb01cient optimal learning for contextual bandits. CoRR,\nabs/1106.2369, 2011.\n\n[Fis51]\n\n[LR85]\n\n[LZ08]\n\n[EDMM06] Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping\nconditions for the multi-armed bandit and reinforcement learning problems. J. Mach.\nLearn. Res., 7:1079\u20131105, December 2006.\nR.A. Fisher. The Design of Experiments. Oliver and Boyd, Edinburgh, 6th edition,\n1951.\nT.L Lai and Herbert Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Ad-\nvances in Applied Mathematics, 6(1):4 \u2013 22, 1985.\nJohn Langford and Tong Zhang. The epoch-greedy algorithm for multi-armed bandits\nwith side information.\nIn J.C. Platt, D. Koller, Y. Singer, and S.T. Roweis, editors,\nAdvances in Neural Information Processing Systems 20, pages 817\u2013824. Curran Asso-\nciates, Inc., 2008.\nP. A. Ortega and D. A. Braun. A minimum relative entropy principle for learning and\nacting. J. Artif. Int. Res., 38(1):475\u2013511, May 2010.\nJ. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press,\nNew York, 2000. Second ed., 2009.\nHerbert Robbins. Some aspects of the sequential design of experiments. Bull. Amer.\nMath. Soc., 58(5):527\u2013535, 09 1952.\n\n[Rob52]\n\n[Pea00]\n\n[OB10]\n\n[Sco10]\n\n[SBCAY14] Yevgeny Seldin, Peter L. Bartlett, Koby Crammer, and Yasin Abbasi-Yadkori. Predic-\ntion with limited advice and multiarmed bandits with paid observations. In Interna-\ntional Conference on Machine Learning, Beijing, China, 2014.\nS. L. Scott. A modern bayesian look at the multi-armed bandit. Applied Stochastic\nModels in Business and Industry, 26(6):639\u2013658, 2010.\nA. Slivkins. Contextual bandits with similarity information. J. Mach. Learn. Res.,\n15(1):2533\u20132568, January 2014.\nIn Proceedings of the\nI. Shpitser and J Pearl. What counterfactuals can be tested.\nTwenty-Third Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI 2007), pages\n352\u2013359. AUAI Press, Vancouver, BC, Canada, 2007.\n\n[Sli14]\n\n[SP07]\n\n9\n\n\f", "award": [], "sourceid": 826, "authors": [{"given_name": "Elias", "family_name": "Bareinboim", "institution": "Purdue University"}, {"given_name": "Andrew", "family_name": "Forney", "institution": "UCLA"}, {"given_name": "Judea", "family_name": "Pearl", "institution": "UCLA"}]}