{"title": "Fairness in Learning: Classic and Contextual Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 325, "page_last": 333, "abstract": "We introduce the study of fairness in multi-armed bandit problems. Our fairness definition demands that, given a pool of applicants, a worse applicant is never favored over a better one, despite a learning algorithm\u2019s uncertainty over the true payoffs. In the classic stochastic bandits problem we provide a provably fair algorithm based on \u201cchained\u201d confidence intervals, and prove a cumulative regret bound with a cubic dependence on the number of arms. We further show that any fair algorithm must have such a dependence, providing a strong separation between fair and unfair learning that extends to the general contextual case. In the general contextual case, we prove a tight connection between fairness and the KWIK (Knows What It Knows) learning model: a KWIK algorithm for a class of functions can be transformed into a provably fair contextual bandit algorithm and vice versa. This tight connection allows us to provide a provably fair algorithm for the linear contextual bandit problem with a polynomial dependence on the dimension, and to show (for a different class of functions) a worst-case exponential gap in regret between fair and non-fair learning algorithms.", "full_text": "Fairness in Learning: Classic and Contextual Bandits \u2217\n\nMatthew Joseph\n\nMichael Kearns\n\nJamie Morgenstern\n\nAaron Roth\n\nUniversity of Pennsylvania, Department of Computer and Information Science\n\nmajos, mkearns, jamiemor, aaroth@cis.upenn.edu\n\nAbstract\n\nWe introduce the study of fairness in multi-armed bandit problems. Our fairness\nde\ufb01nition demands that, given a pool of applicants, a worse applicant is never\nfavored over a better one, despite a learning algorithm\u2019s uncertainty over the true\npayoffs. In the classic stochastic bandits problem we provide a provably fair\nalgorithm based on \u201cchained\u201d con\ufb01dence intervals, and prove a cumulative regret\nbound with a cubic dependence on the number of arms. We further show that\nany fair algorithm must have such a dependence, providing a strong separation\nbetween fair and unfair learning that extends to the general contextual case. In\nthe general contextual case, we prove a tight connection between fairness and the\nKWIK (Knows What It Knows) learning model: a KWIK algorithm for a class of\nfunctions can be transformed into a provably fair contextual bandit algorithm and\nvice versa. This tight connection allows us to provide a provably fair algorithm\nfor the linear contextual bandit problem with a polynomial dependence on the\ndimension, and to show (for a different class of functions) a worst-case exponential\ngap in regret between fair and non-fair learning algorithms.\n\n1\n\nIntroduction\n\nAutomated techniques from statistics and machine learning are increasingly being used to make\ndecisions that have important consequences on people\u2019s lives, including hiring [24], lending [10],\npolicing [25], and even criminal sentencing [7]. These high stakes uses of machine learning have led\nto increasing concern in law and policy circles about the potential for (often opaque) machine learning\ntechniques to be discriminatory or unfair [13, 6]. At the same time, despite the recognized importance\nof this problem, very little is known about technical solutions to the problem of \u201cunfairness\u201d, or the\nextent to which \u201cfairness\u201d is in con\ufb02ict with the goals of learning.\nIn this paper, we consider the extent to which a natural fairness notion is compatible with learning in\na bandit setting, which models many of the applications of machine learning mentioned above. In\nthis setting, the learner is a sequential decision maker, which must choose at each time step t which\ndecision to make from a \ufb01nite set of k \u201carms\". The learner then observes a stochastic reward from\n(only) the arm chosen, and is tasked with maximizing total earned reward (equivalently, minimizing\ntotal regret) by learning the relationships between arms and rewards over time. This models, for\nexample, the problem of learning the association between loan applicants and repayment rates over\ntime by repeatedly granting loans and observing repayment.\nWe analyze two variants of the setting: in the classic case, the learner\u2019s only source of information\ncomes from choices made in previous rounds. In the contextual case, before each round the learner\nadditionally observes some potentially informative context for each arm (for example representing\nthe content of an individual\u2019s loan application), and the expected reward is some unknown function of\n\n\u2217A full technical version of this paper is available on arXiv [17].\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fthe context. The dif\ufb01culty in this task stems from the unknown relationships between arms, rewards,\nand (in the contextual case) contexts: these relationships must be learned.\nWe introduce fairness into the bandit learning framework by saying that it is unfair to preferentially\nchoose one arm over another if the chosen arm has lower expected quality than the unchosen arm.\nIn the loan application example, this means that it is unfair to preferentially choose a less-quali\ufb01ed\napplicant (in terms of repayment probability) over a more-quali\ufb01ed applicant.\nIt is worth noting that this de\ufb01nition of fairness (formalized in the preliminaries) is entirely consistent\nwith the optimal policy, which can simply choose at each round to play uniformly at random from\nthe arms maximizing the expected reward. This is because \u2013 it seems \u2013 this de\ufb01nition of fairness\nis entirely consistent with the goal of maximizing expected reward. Indeed, the fairness constraint\nexactly states that the algorithm cannot favor low reward arms!\nOur main conceptual result is that this intuition is incorrect in the face of unknown reward functions.\nAlthough fairness is consistent with implementing the optimal policy, it may not be consistent with\nlearning the optimal policy. We show that fairness always has a cost, in terms of the achievable\nlearning rate of the algorithm. For some problems, the cost is mild, but for others, the cost is large.\n\n1.1 Our Results\n\nWe divide our results into two parts. First, in Section 3 we study the classic stochastic multi-armed\nbandit problem [20, 19]. In this case, there are no contexts, and each arm i has a \ufb01xed but unknown\naverage reward \u00b5i. In Section 3.1 we give a fair algorithm, FAIRBANDITS, and show that it guarantees\nnontrivial regret after T = O(k3) rounds. We then show in Section 3.2 that it is not possible to do\nbetter \u2013 any fair learning algorithm can be forced to endure constant per-round regret for T = \u2126(k3)\nrounds, thus tightly characterizing the optimal regret attainable by fair algorithms in this setting, and\nformally separating it from the regret attainable by algorithms absent a fairness constraint.\nWe then move on to the general contextual bandit setting in Section 4 and prove a broad characteriza-\ntion result, relating fair contextual bandit learning to KWIK (\u201cKnows What It Knows\") learning [22].\nInformally, a KWIK leaarning algorithm receives a series of unlabeled examples and must either\npredict a label or announce \u201cI Don\u2019t Know\". The KWIK requirement then stipulates that any predicted\nmust be label close to its true label. The quality of a KWIK learning algorithm is characterized by\nits \u201cKWIK bound\u201d, which provides an upper bound on the maximum number of times the algorithm\ncan be forced to announce \u201cI Don\u2019t Know\u201d. For any contextual bandit problem (de\ufb01ned by the\nset of functions C from which the payoff functions may be selected), we show that the optimal\nlearning rate of any fair algorithm is determined by the best KWIK bound for the class C. We prove\nthis constructively via a reduction showing how to convert a KWIK learning algorithm into a fair\ncontextual bandit algorithm in Section 4.1, and vice versa in Section 4.2.\nThis general connection immediately allows us to import known results for KWIK learning [22]. It\nimplies that some fair contextual bandit problems are easy and achieve non-trivial regret guarantees\nin only polynomial many rounds. Conversely, it also implies that some contextual bandit problems\nwhich are easy without the fairness constraint become hard once we impose the fairness constraint, in\nthat any fair algorithm must suffer constant per-round regret for exponentially many rounds. By way\nof example, we will show in Section 4.1 that real contexts with linear reward functions are easy, and\nwe will show in Section 4.3 that boolean context vectors and conjunction reward functions are hard.\n\n1.2 Other Related Work\n\nMany papers study the problem of fairness in machine learning. One line of work studies algorithms\nfor batch classi\ufb01cation which achieve group fairness otherwise known as equality of outcomes,\nstatistical parity \u2013 or algorithms that avoid disparate impact (see e.g. [11, 23, 18, 15, 16] and [2] for\na study of auditing existing algorithms for disparate impact). While statistical parity is sometimes\na desirable or legally required goal, as observed by Dwork et al. [14] and others, it suffers from a\nnumber of drawbacks. First, if different populations indeed have different statistical properties, then it\ncan be at odds with accurate classi\ufb01cation. Second, even in cases when statistical parity is attainable\nwith an optimal classi\ufb01er, it does not prevent discrimination at an individual level. This led Dwork\net al. [14] to encourage the study of individual fairness, which we focus on here.\n\n2\n\n\fDwork et al. [14] also proposed and explored a technical de\ufb01nition of individual fairness formalizing\nthe idea that \u201csimilar individuals should be treated similarly\u201d by presupposing a task-speci\ufb01c quality\nmetric on individuals and proposing that fair algorithms should satisfy a Lipschitz condition on this\nmetric. Our de\ufb01nition of fairness is similar, in that the expected reward of each arm is a natural metric\nthrough which we de\ufb01ne fairness. However, where Dwork et al. [14] presupposes the existence of a\n\u201cfair\" metric on individuals \u2013 thus encoding much of the relevant challenge, as studied Zemel et al.\n[27] \u2013 our notion of fairness is entirely aligned with the goal of the algorithm designer and is satis\ufb01ed\nby the optimal policy. Nevertheless, it affects the space of feasible learning algorithms, because it\ninterferes with learning an optimal policy, which depends on the unknown reward functions.\nAt a technical level, our work is related to Amin et al. [4] and Abernethy et al. [1], which also relate\nKWIK learning to bandit learning in a different context unrelated to fairness.\n\n2 Preliminaries\nWe study the contextual bandit setting, de\ufb01ned by a domain X , a set of \u201carms\u201d [k] := {1, . . . , k} and\na class C of functions of the form f : X \u2192 [0, 1]. For each arm j there is some function fj \u2208 C,\nunknown to the learner. In rounds t = 1, . . . , T , an adversary reveals to the algorithm a context xt\nj\nfor each arm2. An algorithm A then chooses an arm it, and observes stochastic reward rt\nit for the\narm it chose. We assume rt\nLet \u03a0 be the set of policies mapping contexts to distributions over arms X k \u2192 \u2206k, and \u03c0\u2217 the\noptimal policy which selects a distribution over arms as a function of contexts to maximize the\nexpected reward of those arms. The pseudo-regret of an algorithm A on contexts x1, . . . , xT is\nde\ufb01ned as follows, where \u03c0t represents A\u2019s distribution on arms at round t:\n\nj), for some distribution Dt\n\nj over [0, 1].\n\nj] = fj(xt\n\nj \u223c Dt\n\nj, E[rt\n\nEit\u2217\u223c\u03c0\u2217(xt)[fit\u2217 (xt\n\nit\u2217 )] \u2212 Eit\u223c\u03c0t[\n\nfit(xt\n\nit)] = Regret(x1, . . . , xT ),\n\n(cid:88)\n\nt\n\nshorthanded as A\u2019s regret. Optimal policy \u03c0\u2217 pulls arms with highest expectation at each round, so:\n\n(cid:88)\n(cid:0)fj(xt\n\nt\n\n(cid:88)\n\nt\n\nmax\n\nj\n\n(cid:88)\nj)(cid:1) \u2212 Eit\u223c\u03c0t[\n\nfit(xt\n\nit)].\n\nt\n\nRegret(x1, . . . , xT ) =\n\nWe say that A satis\ufb01es regret bound R(T ) if maxx1,...,xT Regret(x1, . . . , xt) \u2264 R(T ).\n\nLet the history ht \u2208(cid:0)X k \u00d7 [k] \u00d7 [0, 1](cid:1)t\u22121 be a record of t \u2212 1 rounds experienced by A, t \u2212 1\n\n3-tuples encoding the realization of the contexts, arm chosen, and reward observed. \u03c0t\nj|ht denotes the\nprobability that A chooses arm j after observing contexts xt, given ht. For simplicity, we will often\ndrop the superscript t on the history when referring to the distribution over arms: \u03c0t\n\nj|h := \u03c0t\n\nj|ht.\n\nWe now de\ufb01ne what it means for a contextual bandit algorithm to be \u03b4-fair with respect to its arms.\nInformally, this will mean that A will play arm i with higher probability than arm j in round t only if\ni has higher mean than j in round t, for all i, j \u2208 [k], and in all rounds t.\nDe\ufb01nition 1 (\u03b4-fair). A is \u03b4-fair if, for all sequences of contexts x1, . . . , xt and all payoff distribu-\ntions Dt\nk, with probability at least 1 \u2212 \u03b4 over the realization of the history h, for all rounds\nt \u2208 [T ] and all pairs of arms j, j(cid:48) \u2208 [k],\n\u03c0t\nj|h > \u03c0t\n\nj(cid:48)|h only if fj(xt\n\n1, . . . ,Dt\n\nj) > fj(cid:48)(xt\n\nj(cid:48)).\n\nKWIK learning Let B be an algorithm which takes as input a sequence of examples x1, . . . , xT ,\nand when given some xt \u2208 X , outputs either a prediction \u02c6yt \u2208 [0, 1] or else outputs \u02c6yt = \u22a5,\nrepresenting \u201cI don\u2019t know\u201d. When \u02c6yt = \u22a5, B receives feedback yt such that E[yt] = f (xt). B is an\n(\u0001, \u03b4)-KWIK learning algorithm for C : X \u2192 [0, 1], with KWIK bound m(\u0001, \u03b4) if for any sequence\nof examples x1, x2, . . . and any target f \u2208 C, with probability at least 1 \u2212 \u03b4, both:\n\n1. Its numerical predictions are accurate: for all t, \u02c6yt \u2208 {\u22a5} \u222a [f (xt) \u2212 \u0001, f (xt) + \u0001], and\n\n2. B rarely outputs \u201cI Don\u2019t Know\u201d:(cid:80)\u221e\n\nI [\u02c6yt = \u22a5] \u2264 m(\u0001, \u03b4).\n\nt=1\n\n2Often, the contextual bandit problem is de\ufb01ned such that there is a single context xt every day. Our model\n\nis equivalent \u2013 we could take xt\n\nj := xt for each j.\n\n3\n\n\f2.1 Specializing to Classic Stochastic Bandits\n\nIn Sections 3.1 and 3.2, we study the classic stochastic bandit problem, an important special case\nof the contextual bandit setting described above. Here we specialize our notation to this setting, in\nwhich there are no contexts. For each arm j \u2208 [k], there is an unknown distribution Dj over [0, 1]\nwith unknown mean \u00b5j. A learning algorithm A chooses an arm it in round t, and observes the\n\u223c Dit for the arm that it chose. Let i\u2217 \u2208 [k] be the arm with highest expected reward:\nreward rt\ni\u2217 \u2208 arg maxi\u2208[k] \u00b5i. The pseudo-regret of an algorithm A on D1, . . . ,Dk is now just:\nit\n\nT \u00b7 \u00b5i\u2217 \u2212 Eit\u223c\u03c0t[\n\n\u00b5it] = Regret(T,D1, . . . ,Dk)\n\n(cid:88)\n\n0\u2264t\u2264T\n\nLet ht \u2208 ([k] \u00d7 [0, 1])t\u22121 denote a record of the t \u2212 1 rounds experienced by the algorithm so far,\nrepresented by t \u2212 1 2-tuples encoding the previous arms chosen and rewards observed. We write\nj|ht to denote the probability that A chooses arm j given history ht. Again, we will often drop the\n\u03c0t\nsuperscript t on the history when referring to the distribution over arms: \u03c0t\nj|ht. \u03b4-fairness in\nthe classic bandit setting then specializes as follows:\nDe\ufb01nition 2 (\u03b4-fairness in the classic bandits setting). A is \u03b4-fair if, for all distributions D1, . . . ,Dk,\nwith probability at least 1 \u2212 \u03b4 over the history h, for all t \u2208 [T ] and all j, j(cid:48) \u2208 [k]:\n\nj|h := \u03c0t\n\n\u03c0t\nj|h > \u03c0t\n\nj(cid:48)|h only if \u00b5j > \u00b5j(cid:48).\n\n3 Classic Bandits Setting\n\n3.1 Fair Classic Stochastic Bandits: An Algorithm\n\nIn this section, we describe a simple and intuitive modi\ufb01cation of the standard UCB algorithm [5],\ncalled FAIRBANDITS, prove that it is fair, and analyze its regret bound. The algorithm and its analysis\nhighlight a key idea that is important to the design of fair algorithms in this setting: that of chaining\ncon\ufb01dence intervals. Intuitively, as a \u03b4-fair algorithm explores different arms it must play two arms j1\nand j2 with equal probability until it has suf\ufb01cient data to deduce, with con\ufb01dence 1 \u2212 \u03b4, either that\n\u00b5j1 > \u00b5j2 or vice versa. FAIRBANDITS does this by maintaining empirical estimates of the means of\nboth arms, together with con\ufb01dence intervals around those means. To be safe, the algorithm must\nplay the arms with equal probability while their con\ufb01dence intervals overlap. The same reasoning\napplies simultaneously to every pair of arms. Thus, if the con\ufb01dence intervals of each pair of arms ji\nand ji+1 overlap for each i \u2208 [k], the algorithm is forced to play all arms j with equal probability.\nThis is the case even if the con\ufb01dence intervals around arm jk and arm j1 are far from overlapping \u2013\ni.e. when the algorithm can be con\ufb01dent that \u00b5j1 > \u00b5jk.\nThis chaining approach initially seems overly conservative when ruling out arms, as re\ufb02ected in its\nregret bound, which is only non-trivial after T (cid:29) k3. In contrast, the UCB algorithm [5] achieves\nnon-trivial regret after T = O(k) rounds. However, our lower bound in Section 3.2 shows that any\nfair algorithm must suffer constant per-round regret for T (cid:29) k3 rounds on some instances.\nWe now give an overview of the behavior of FAIRBANDITS. At every round t, FAIRBANDITS\ni that has the largest upper con\ufb01dence interval amongst the active\nidenti\ufb01es the arm it\u2217 = arg maxi ut\nj] (cid:54)= \u2205, and i is chained to j if i and\ni] \u2229 [(cid:96)t\narms. At each round t, we say i is linked to j if [(cid:96)t\nj are in the same component of the transitive closure of the linked relation. FAIRBANDITS plays\nuniformly at random among all active arms chained to arm it\u2217.\nInitially, the active set contains all arms. The active set of arms at each subsequent round is de\ufb01ned to\nbe the set of arms that are chained to the arm with highest upper con\ufb01dence bound at the previous\nround. The algorithm can be con\ufb01dent that arms that have become unchained to the arm with the\nhighest upper con\ufb01dence bound at any round have means that are lower than the means of any chained\narms, and hence such arms can be safely removed from the active set, never to be played again. This\nhas the useful property that the active set of arms can only shrink: at any round t, St \u2286 St\u22121.\nWe \ufb01rst observe that with probability 1 \u2212 \u03b4, all of the con\ufb01dence intervals maintained by FAIRBAN-\nDITS (\u03b4) contain the true means of their respective arms over all rounds. We prove this claim, along\nwith all other claims in this paper, in the full technical version of this paper [17].\n\nj, ut\n\ni, ut\n\n4\n\n\fi\n\n2, u0\n\ni \u2190 0\n\ni \u2190 1, (cid:96)0\n\ni \u2190 0, n0\n\nfor t = 1 to T do\n\nS0 \u2190 {1, . . . , k}\nfor i = 1, . . . k do\n\n1: procedure FAIRBANDITS(\u03b4)\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n\ni \u2190 1\n\u02c6\u00b50\nit\u2217 \u2190 arg maxi\u2208St\u22121 ut\nSt \u2190 {j | j chains to it\u2217, j \u2208 St\u22121}\nj\u2217 \u2190 (x \u2208R St)\nj\u2217 \u2190 nt\nnt+1\nj\u2217 + 1\nj\u2217 \u2190 1\n\u02c6\u00b5t+1\n(\u02c6\u00b5t\nnt+1\nj\u2217\nB \u2190\nj\u2217 , ut+1\nfor j \u2208 St, j (cid:54)= j\u2217 do\nj, nt+1\n\n(cid:114) ln((\u03c0\u00b7(t+1))2/3\u03b4)\nj\u2217 (cid:3) \u2190(cid:2)\u02c6\u00b5t+1\n\nj\u2217 \u2212 B, \u02c6\u00b5t+1\nj \u2190 nt\n\n(cid:2)(cid:96)t+1\n\nj \u2190 \u02c6\u00b5t\n\u02c6\u00b5t+1\n\n11:\n\n12:\n13:\n14:\n\nj\u2217 \u00b7 nt\n\nj, ut+1\n\nj\u2217 + rt\n\n2nt+1\nj\u2217\n\nj\u2217 )\n\nj\u2217 + B(cid:3)\n\nj \u2190 ut\n\nj \u2190 (cid:96)t\n\nj, (cid:96)t+1\n\nj\n\n(cid:46) Initialize the active set\n\n(cid:46) Initialize each arm\n\n(cid:46) Find arm with highest ucb\n(cid:46) Update active set\n(cid:46) Select active arm at random\n\n(cid:46) Pull arm j\u2217, update its mean estimate\n\n(cid:46) Update interval for pulled arm\n\nLemma 1. With probability at least 1 \u2212 \u03b4, for every arm i and round t (cid:96)t\nThe fairness of FAIRBANDITS follows: with high probability the algorithm constructs good con\ufb01dence\nintervals, so it can con\ufb01dently choose between arms without violating fairness.\nTheorem 1. FAIRBANDITS (\u03b4) is \u03b4-fair.\n\ni \u2264 \u00b5i \u2264 ut\ni.\n\nHaving proven that FAIRBANDITS is indeed fair, it remains to upper-bound its regret. We proceed by\na series of lemmas, \ufb01rst lower bounding the probability that any arm active in round t has been pulled\nsubstantially fewer times than its expectation.\nLemma 2. With probability at least 1 \u2212 \u03b4\narms in round t).\n\n(cid:1) for all i \u2208 St (for all active\n\n2 ln(cid:0) 2k\u00b7t2\n\nk \u2212(cid:113) t\n\ni \u2265 t\n\n2t2 , nt\n\n\u03b4\n\nWe now use this lower bound on the number of pulls of active arm i in round t to upper-bound \u03b7(t),\nan upper bound on the con\ufb01dence interval width FAIRBANDITS uses for any active arm i in round t.\nLemma 3. Consider any round t and any arm i \u2208 St. Condition on nt\n\n. Then,\n\nt ln( 2kt2\n\n)\n\n\u03b4\n\ni \u2265 t\n\nk \u2212\n\n2\n\n(cid:113)\n\n(cid:118)(cid:117)(cid:117)(cid:117)(cid:116) ln\n\n2 \u00b7 t\n\n(cid:17)\n\n(cid:16)\n(\u03c0 \u00b7 t)2 /3\u03b4\nk \u2212\nt ln( 2kt2\n\n(cid:113)\n\n2\n\n\u03b4\n\n= \u03b7(t).\n\n)\n\ni \u2212 (cid:96)t\nut\n\ni \u2264 2\n\nWe stitch together these lemmas as follows: Lemma 2 upper bounds the probability that any arm i\nactive in round t has been pulled substantially fewer times than its expectation, and Lemma 3 upper\nbounds the width of any con\ufb01dence interval used by FAIRBANDITS in round t by \u03b7(t). Together,\nthese enable us to determine how both the number of arms in the active set, as well as the spread of\ntheir con\ufb01dence intervals, evolve over time. This translates into the following regret bound.\n\n(cid:18)(cid:113)\n\n(cid:19)\n\n\u221a\nTheorem 2. If \u03b4 < 1/\n\nT , then FAIRBANDITS has regret R(T ) = O\n\nk3T ln T k\n\u03b4\n\n.\n\nTwo points are worth highlighting in Theorem 2. First, this bound becomes non-trivial (i.e. the\naverage per-round regret is (cid:28) 1) for T = \u2126(k3). As we show in the next section, it is not possible to\nimprove on this. Second, the bound may appear to have suboptimal dependence on T when compared\nto unconstrained regret bounds (where the dependence on T is often described as logarithmic).\nregret is necessary even in the unrestricted setting (without\nHowever, it is known that \u2126\nfairness) if one does not make data-speci\ufb01c assumptions on an instance [9] It would be possible to\nstate a logarithmic dependence on T in our setting as well while making assumptions on the gaps\nbetween arms, but since our fairness constraint manifests itself as a cost that depends on k, we choose\nfor clarity to avoid such assumptions; without them, our dependence on T is also optimal.\n\n(cid:16)\u221a\n\n(cid:17)\n\nkT\n\n5\n\n\f3.2 Fair Classic Stochastic Bandits: A Lower Bound\n\nWe now show that the regret bound for FAIRBANDITS has an optimal dependence on k: no fair\nalgorithm has diminishing regret before T = \u2126(k3) rounds. At a high level, we construct our lower\nbound example to embody the \u201cworst of both worlds\" for fair algorithms: the arm payoff means are\njust close enough together that the chain takes a long time to break, and the arm payoff means are just\nfar enough apart that the algorithm incurs high regret while the chain remains unbroken. This lets\nus prove the formal statement below. The full proof, which proceeds via Bayesian reasoning using\npriors for the arm means, may be found in our technical companion paper [17].\nTheorem 3. There is a distribution P over k-arm instances of the stochastic multi-armed bandit\nproblem such that any fair algorithm run on P experiences constant per-round regret for at least\n\nT = \u2126(cid:0)k3 ln 1\n\n(cid:1) rounds.\n\n\u03b4\n\nThus, we tightly characterize the optimal regret attainable by fair algorithms in the classic bandits\nsetting, and formally separate it from the regret attainable by algorithms absent a fairness constraint.\nNote that this already shows a separation between the best possible learning rates for contextual\nbandit learning with and without the fairness constraint \u2013 the classic multi-armed bandit problem is a\nspecial case of every contextual bandit problem, and for general contextual bandit problems, it is also\nknown how to get non-trivial regret after only T = O(k) many rounds [3, 8, 12].\n\n4 Contextual Bandits Setting\n\n4.1 KWIK Learnability Implies Fair Bandit Learnability\n\nIn this section, we show if a class of functions is KWIK learnable, then there is a fair algorithm for\nlearning the same class of functions in the contextual bandit setting, with a regret bound polynomially\nrelated to the function class\u2019 KWIK bound. Intuitively, KWIK-learnability of a class of functions\nguarantees we can learn the function\u2019s behavior to a high degree of accuracy with a high degree\nof con\ufb01dence. As fairness constrains an algorithm most before the algorithm has determined the\npayoff functions\u2019 behavior accurately, this guarantee enables us to learn fairly without incurring much\nadditional regret. Formally, we prove the following polynomial relationship.\nTheorem 4. For an instance of the contextual multi-armed bandit problem where fj \u2208 C for all\nj \u2208 [k], if C is (\u0001, \u03b4)-KWIK learnable with bound m(\u0001, \u03b4), KWIKTOFAIR (\u03b4, T ) is \u03b4-fair and\nachieves regret bound:\n\n(cid:18)\n\n(cid:18)\n\n(cid:18)\n\n(cid:19)\n\n(cid:19)(cid:19)\n\nR(T ) = O\n\nmax\n\nk2 \u00b7 m\n\n\u0001\u2217,\n\nmin (\u03b4, 1/T )\n\nT 2k\n\n, k3 ln\n\nk\n\u03b4\n\nfor \u03b4 \u2264 1\u221a\n\nT\n\nwhere \u0001\u2217 = arg min\u0001(max(\u0001 \u00b7 T, k \u00b7 m(\u0001, min(\u03b4,1/T )\n\nkT 2\n\n))).\n\nFirst, we construct an algorithm KWIKTOFAIR(\u03b4, T ) that uses the KWIK learning algorithm as a\nsubroutine, and prove that it is \u03b4-fair. A call to KWIKTOFAIR(\u03b4, T ) will initialize a KWIK learner\nfor each arm, and in each of the T rounds will implicitly construct a con\ufb01dence interval around the\nprediction of each learner. If a learner makes a numeric valued prediction, we will interpret this as\na con\ufb01dence interval centered at the prediction with width \u0001\u2217. If a learner outputs \u22a5, we interpret\nthis as a trivial con\ufb01dence interval (covering all of [0, 1]). We then use the same chaining technique\nused in the classic setting to choose an arm from the set of arms chained to the predicted top arm.\nWhenever all learners output predictions, they need no feedback. When a learner for j outputs \u22a5, if j\nis selected then we have feedback rt\nj to give it; on the other hand, if j isn\u2019t selected, we \u201croll back\u201d\nthe learning algorithm for j to before this round by not updating the algorithm\u2019s state.\n1: procedure KWIKTOFAIR(\u03b4, T )\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n\n\u03b4\u2217 \u2190 min(\u03b4, 1\nT )\nInitialize KWIK(\u0001\u2217, \u03b4\u2217)-learner Li, hi \u2190 [ ] \u2200i \u2208 [k]\nfor 1 \u2264 t \u2264 T do\n\n, \u0001\u2217 \u2190 arg min\u0001(max(\u0001 \u00b7 T, k \u00b7 m(\u0001, \u03b4\u2217)))\n\nS \u2190 \u2205\nfor i = 1, . . . , k do\ni, hi)\ni\n\ni \u2190 Li(xt\nst\nS \u2190 S \u222a st\n\n(cid:46) Initialize set of predictions S\n\n(cid:46) Store prediction st\ni\n\nkT 2\n\n6\n\n\f9:\n10:\n11:\n12:\n13:\n14:\n15:\n\nhj\u2217 \u2190 hj\u2217 :: (xt\n\nj\u2217 , rt\n\nj\u2217 )\n\nif \u22a5\u2208 S then\n\nelse\n\nPull j\u2217 \u2190 (x \u2208R [k]), receive reward rt\nj\u2217\nit\u2217 \u2190 arg maxi st\nSt \u2190 {j | (st\nj + \u0001\u2217) chains to (st\nPull j\u2217 \u2190 (x \u2208R St), receive reward rt\n\nj \u2212 \u0001\u2217, st\n\ni\n\n(cid:46) Pick arm at random from all arms\n\nit\u2217 + \u0001\u2217)}\n\nit\u2217 \u2212 \u0001\u2217, st\nj\u2217 (cid:46) Pick arm at random from active set st\ni\u2217\n(cid:46) Update the history for Lj\u2217\n\ni)| \u2264 \u0001\u2217 and (b)(cid:80)\n\nWe begin by bounding the probability of certain failures of KWIKTOFAIR in Lemma 4.\nLemma 4. With probability at least 1 \u2212 min(\u03b4, 1\nT ), for all rounds t and all arms i, (a) if st\n|st\ni \u2212 fi(xt\nThis in turn lets us prove the fairness of KWIKTOFAIR in Theorem 5. Intuitively, the KWIK\nalgorithm\u2019s con\ufb01dence about predictions translates into con\ufb01dence about expected rewards, which\nlets us choose between arms without violating fairness.\nTheorem 5. KWIKTOFAIR(\u03b4, T ) is \u03b4-fair.\n\ni = \u22a5 and i is pulled] \u2264 m(\u0001\u2217, \u03b4\u2217).\n\ni \u2208 R then\n\nI [st\n\nt\n\nWe now use the KWIK bounds of the KWIK learners to upper-bound the regret of KWIKTO-\nFAIR(\u03b4, T ). We proceed by bounding the regret incurred in those rounds when all KWIK algorithms\nmake a prediction (i.e., when we have a nontrivial con\ufb01dence interval for each arm\u2019s expected reward)\nand then bounding the number of rounds for which some learner outputs \u22a5 (i.e., when we choose\nrandomly from all arms and thus incur constant regret). These results combine to produce Lemma 5.\nLemma 5. KWIKTOFAIR(\u03b4, T ) achieves regret O(max(k2 \u00b7 m(\u0001\u2217, \u03b4\u2217), k3 ln T k\n\n\u03b4 )).\n\nOur presentation of KWIKTOFAIR(\u03b4, T ) has a known time horizon T . Its guarantees extend to the\ncase in which T is unknown via the standard \u201cdoubling trick\u201d to prove Theorem 4.\nAn important instance of the contextual bandit problem is the linear case, where C consists of the\nset of all linear functions of bounded norm in d dimensions, i.e. when the rewards of each arm are\ngoverned by an underlying linear regression model on contexts. Known KWIK algorithms [26] for\nthe set of linear functions C then allow us, via our reduction, to give a fair contextual bandit algorithm\nfor this setting with a polynomial regret bound.\nLemma 6 ([26]). Let C = {f\u03b8|f\u03b8(x) = (cid:104)\u03b8, x(cid:105), \u03b8 \u2208 Rd,||\u03b8|| \u2264 1} and X = {x \u2208 Rd : ||x|| \u2264 1}.\nC is KWIK learnable with KWIK bound m(\u0001, \u03b4) = \u02dcO(d3/\u00014).\n\nThen, an application of Theorem 4 implies that KWIKTOFAIR has a polynomial regret guarantee for\nthe class of linear functions.\nCorollary 1. Let C and X be as in Lemma 6, and fj \u2208 C for each j \u2208 [k]. Then, KWIKTO-\n\nFAIR(T, \u03b4) using the learner from [26] has regret R(T ) = \u02dcO(cid:0)max(cid:0)T 4/5k6/5d3/5, k3 ln k\n\n(cid:1)(cid:1) .\n\n\u03b4\n\n4.2 Fair Bandit Learnability Implies KWIK Learnability\n\nIn this section, we show how to use a fair, no-regret contextual bandit algorithm to construct a KWIK\nlearning algorithm whose KWIK bound has logarithmic dependence on the number of rounds T .\nIntuitively, any fair algorithm which achieves low regret must both be able to \ufb01nd and exploit an\noptimal arm (since the algorithm is no-regret) and can only exploit that arm once it has a tight\nunderstanding of the qualities of all arms (since the algorithm is fair). Thus, any fair no-regret\nalgorithm will ultimately have tight (1 \u2212 \u03b4)-con\ufb01dence about each arm\u2019s reward function.\nTheorem 6. Suppose A is a \u03b4-fair algorithm for the contextual bandit problem over the class of\nfunctions C, with regret bound R(T, \u03b4). Suppose also there exists f \u2208 C, x((cid:96)) \u2208 X such that for\nevery (cid:96) \u2208 [(cid:100) 1\n\u0001(cid:101)], f (x((cid:96))) = (cid:96)\u00b7 \u0001. Then, FAIRTOKWIK is an (\u0001, \u03b4)-KWIK algorithm for C with KWIK\nbound m(\u0001, \u03b4), with m(\u0001, \u03b4) the solution to m(\u0001,\u03b4)\u0001\nRemark 1. The condition that C should contain a function that can take on values that are multiples\nof \u0001 is for technical convenience; C can always be augmented by adding a single such function.\n\n4 = R(m(\u0001, \u03b4), \u0001\u03b4\n\n2T ).\n\n7\n\n\fOur aim is to construct a KWIK algorithm B to predict labels for a sequence of examples labeled\nwith some unknown function f\u2217 \u2208 C. We provide a sketch of the algorithm, FAIRTOKWIK, below,\nand refer interested readers to our full technical paper [17] for a complete and formal description.\nWe use our fair algorithm to construct a KWIK algorithm as follows: we will run our fair contextual\nbandit algorithm A on an instance that we construct online as examples xt arrive for B. The idea is\nto simulate a two arm instance, in which one arm\u2019s rewards are governed by f\u2217 (the function to be\nKWIK learned), and the other arm\u2019s rewards are governed by a function f that we can set to take\nany value in {0, \u0001, 2\u0001, . . . , 1}. For each input xt, we perform a thought experiment and consider A\u2019s\nprobability distribution over arms when facing a context which forces arm 2\u2019s payoff to take each of\nthe values 0, \u0001\u2217, 2\u0001\u2217, . . . , 1. Since A is fair, A will play arm 1 with weakly higher probability than\narm 2 for those (cid:96) : (cid:96)\u0001\u2217 \u2264 f (xt); analogously, A will play arm 1 with weakly lower probability than\narm 2 for those (cid:96) : (cid:96)\u0001\u2217 \u2265 f (xt). If there are at least 2 values of (cid:96) for which arm 1 and arm 2 are\nplayed with equal probability, one of those contexts will force A to suffer \u0001\u2217 regret, so we continue\nthe simulation of A on one of those instances selected at random, forcing at least \u0001\u2217/2 regret in\nexpectation, and at the same time have B return \u22a5. B receives f\u2217(xt) on such a round, which is used\nto construct feedback for A. Otherwise, A must transition from playing arm 1 with strictly higher\nprobability to playing 2 with strictly higher probability as (cid:96) increases: the point at which that occurs\nwill \u201csandwich\u201d the value of f (xt), since A\u2019s fairness implies this transition must occur when the\nexpected payoff of arm 2 exceeds that of arm 1. B uses this value to output a numeric prediction.\nAn important fact we exploit is that we can query A\u2019s behavior on (xt, x((cid:96))), for any xt and\nquery (xt, x((cid:96)))). We update A\u2019s history by providing it feedback only in rounds where B outputs\n\u22a5. Finally, we note that, as in KWIKTOFAIR, the claims of FAIRTOKWIK extend to the in\ufb01nite\nhorizon case via the doubling trick.\n4.3 An Exponential Separation Between Fair and Unfair Learning\n\n\u0001\u2217(cid:101)(cid:3) without providing it feedback (and instead \u201croll back\u201d its history to ht not including the\n\n(cid:96) \u2208(cid:2)(cid:100) 1\n\nIn this section, we use this Fair-KWIK equivalence to give a simple contextual bandit problem for\nwhich fairness imposes an exponential cost in its regret bound, unlike the polynomial cost proven\nin the linear case in Section 4.1. In this problem, the context domain is the d-dimensional boolean\nhypercube: X = {0, 1}d \u2013 i.e. the context each round for each individual consists of d boolean\nattributes. Our class of functions C is the class of boolean conjunctions: C = {f | f (x) =\nxi1 \u2227 xi2 \u2227 . . . \u2227 xik where 0 \u2264 k \u2264 d and i1, . . . , ik \u2208 [d]}.\nWe \ufb01rst note that there exists a simple but unfair algorithm for this problem which obtains regret\nR(T ) = O(k2d). A full description of this algorithm, called CONJUNCTIONBANDIT, may be\nfound in our technical companion paper [17]. We now show that, in contrast, fair algorithms cannot\nguarantee subexponential regret in d. This relies upon a known lower bound for KWIK learning\nconjunctions [21]:\nLemma 7. There exists a sequence of examples (x1, . . . , x2d\u22121) such that for \u0001, \u03b4 \u2264 1/2, every\n(\u0001, \u03b4)-KWIK learning algorithm B for the class C of conjunctions on d variables must output \u22a5 for\nxt for each t \u2208 [2d \u2212 1]. Thus, B has a KWIK bound of at least m(\u0001, \u03b4) = \u2126(2d).\n\nWe then use the equivalence between fair algorithms and KWIK learning to translate this lower bound\non m(\u0001, \u03b4) into a minimum worst case regret bound for fair algorithms on conjunctions. We modify\nTheorem 6 to yield the following lemma.\nLemma 8. Suppose A is a \u03b4-fair algorithm for the contextual bandit problem over the class C of\nconjunctions on d variables. If A has regret bound R(T, \u03b4) then for \u03b4(cid:48) = 2T \u03b4, FAIRTOKWIK is an\n(0, \u03b4(cid:48))-KWIK algorithm for C with KWIK bound m(0, \u03b4(cid:48)) = 4R(m(0, \u03b4(cid:48)), \u03b4).\n\nLemma 7 then lets us lower-bound the worst case regret of fair learning algorithms on conjunctions.\nCorollary 2. For \u03b4 < 1\n2T , any \u03b4-fair algorithm for the contextual bandit problem over the class C of\nconjunctions on d boolean variables has a worst case regret bound of R(T ) = \u2126(2d).\n\nTogether with the analysis of CONJUNCTIONBANDIT, this demonstrates a strong separation between\nfair and unfair contextual bandit algorithms: when the underlying functions mapping contexts to\npayoffs are conjunctions on d variables, there exist a sequence of contexts on which fair algorithms\nmust incur regret exponential in d while unfair algorithms can achieve regret linear in d.\n\n8\n\n\fReferences\n[1] Jacob D. Abernethy, Kareem Amin, , Moez Draief, and Michael Kearns. Large-scale bandit problems and\n\nkwik learning. In Proceedings of (ICML-13), pages 588\u2013596, 2013.\n\n[2] Philip Adler, Casey Falk, Sorelle A. Friedler, Gabriel Rybeck, Carlos Scheidegger, Brandon Smith, and\nSuresh Venkatasubramanian. Auditing black-box models by obscuring features. CoRR, abs/1602.07043,\n2016. URL http://arxiv.org/abs/1602.07043.\n\n[3] Alekh Agarwal, Daniel J. Hsu, Satyen Kale, John Langford, Lihong Li, and Robert E. Schapire. Taming\nthe monster: A fast and simple algorithm for contextual bandits. In Proceedings of ICML 2014, Beijing,\nChina, 21-26 June 2014, pages 1638\u20131646, 2014.\n\n[4] Kareem Amin, Michael Kearns, and Umar Syed. Graphical models for bandit problems. arXiv preprint\n\narXiv:1202.3782, 2012.\n\n[5] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.\n\nMachine learning, 47(2-3):235\u2013256, 2002.\n\n[6] Solon Barocas and Andrew D. Selbst. Big data\u2019s disparate impact. California Law Review, 104, 2016.\n\nAvailable at SSRN: http://ssrn.com/abstract=2477899.\n\n[7] Anna Maria Barry-Jester, Ben Casselman, and Dana Goldstein. The new science of sentencing. The\nMarshall Project, August 8 2015. URL https://www.themarshallproject.org/2015/08/04/\nthe-new-science-of-sentencing. Retrieved 4/28/2016.\n\n[8] Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert E. Schapire. Contextual bandit\nalgorithms with supervised learning guarantees. In Proceedings of AISTATS 2011, Fort Lauderdale, USA,\nApril 11-13, 2011, pages 19\u201326, 2011.\n\n[9] S\u00e9bastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed\n\nbandit problems. Machine Learning, 5(1):1\u2013122, 2012.\n\n[10] Nanette Byrnes. Arti\ufb01cial intolerance. MIT Technology Review, March 28 2016.\n[11] Toon Calders and Sicco Verwer. Three naive bayes approaches for discrimination-free classi\ufb01cation. Data\n\nMining and Knowledge Discovery, 21(2):277\u2013292, 2010.\n\n[12] Wei Chu, Lihong Li, Lev Reyzin, and Robert E. Schapire. Contextual bandits with linear payoff functions.\n\nIn Proceedings of AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011, pages 208\u2013214, 2011.\n\n[13] Cary Coglianese and David Lehr. Regulating by robot: Administrative decision-making in the machine-\n\nlearning era. Georgetown Law Journal, 2016. Forthcoming.\n\n[14] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through\n\nawareness. In Proceedings of ITCS 2012, pages 214\u2013226. ACM, 2012.\n\n[15] Michael Feldman, Sorelle A. Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian.\nCertifying and removing disparate impact. In Proceedings of ACM SIGKDD 2015, Sydney, NSW, Australia,\nAugust 10-13, 2015, pages 259\u2013268, 2015.\n\n[16] Benjamin Fish, Jeremy Kun, and \u00c1d\u00e1m D Lelkes. A con\ufb01dence-based approach for balancing fairness and\n\naccuracy. SIAM International Symposium on Data Mining, 2016.\n\n[17] Matthew Joseph, Michael Kearns, Jamie Morgenstern, and Aaron Roth. Fairness in learning: Classic and\n\ncontextual bandits. CoRR, abs/1605.07139, 2016. URL http://arxiv.org/abs/1605.07139.\n\n[18] Toshihiro Kamishima, Shotaro Akaho, and Jun Sakuma. Fairness-aware learning through regularization\napproach. In Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on, pages\n643\u2013650. IEEE, 2011.\n\n[19] Michael N Katehakis and Herbert Robbins. Sequential choice from several populations. PROCEEDINGS-\n\nNATIONAL ACADEMY OF SCIENCES USA, 92:8584\u20138584, 1995.\n\n[20] Tze Leung Lai and Herbert Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in\n\napplied mathematics, 6(1):4\u201322, 1985.\n\n[21] Lihong Li. A unifying framework for computational reinforcement learning theory. PhD thesis, Rutgers,\n\nThe State University of New Jersey, 2009.\n\n[22] Lihong Li, Michael L Littman, Thomas J Walsh, and Alexander L Strehl. Knows what it knows: a\n\nframework for self-aware learning. Machine learning, 82(3):399\u2013443, 2011.\n\n[23] Binh Thanh Luong, Salvatore Ruggieri, and Franco Turini. k-nn as an implementation of situation testing\nfor discrimination discovery and prevention. In Proceedings of ACM SIGKDD 2011, pages 502\u2013510. ACM,\n2011.\n\n[24] Clair C Miller. Can an algorithm hire better than a human? The New York Times, June 25 2015.\n[25] Cynthia Rudin. Predictive policing using machine learning to detect patterns of crime. Wired Magazine,\n\nAugust 2013.\n\n[26] Alexander L Strehl and Michael L Littman. Online linear regression and its application to model-based\nreinforcement learning. In Advances in Neural Information Processing Systems, pages 1417\u20131424, 2008.\n[27] Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair representations. In\n\nProceedings of (ICML-13), pages 325\u2013333, 2013.\n\n9\n\n\f", "award": [], "sourceid": 211, "authors": [{"given_name": "Matthew", "family_name": "Joseph", "institution": "University of Pennsylvania"}, {"given_name": "Michael", "family_name": "Kearns", "institution": "University of Pennsylvania"}, {"given_name": "Jamie", "family_name": "Morgenstern", "institution": "University of Pennsylvania"}, {"given_name": "Aaron", "family_name": "Roth", "institution": "University of Pennsylvania"}]}