{"title": "Differentially Private Contextual Linear Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 4296, "page_last": 4306, "abstract": "We study the contextual linear bandit problem, a version of the standard stochastic multi-armed bandit (MAB) problem where a learner sequentially selects actions to maximize a reward which depends also on a user provided per-round context. Though the context is chosen arbitrarily or adversarially, the reward is assumed to be a stochastic function of a feature vector that encodes the context and selected action. Our goal is to devise private learners for the contextual linear bandit problem.\n\nWe first show that using the standard definition of differential privacy results in linear regret. So instead, we adopt the notion of joint differential privacy, where we assume that the action chosen on day t is only revealed to user t and thus needn't be kept private that day, only on following days. We give a general scheme converting the classic linear-UCB algorithm into a joint differentially private algorithm using the tree-based algorithm. We then apply either Gaussian noise or Wishart noise to achieve joint-differentially private algorithms and bound the resulting algorithms' regrets. In addition, we give the first lower bound on the additional regret any private algorithms for the MAB problem must incur.", "full_text": "Di\ufb00erentially Private Contextual Linear Bandits\n\nRoshan Shari\ufb00\n\nDepartment of Computing Science\n\nUniversity of Alberta\n\nEdmonton, Alberta, Canada\n\nroshan.shariff@ualberta.ca\n\nAbstract\n\nOr She\ufb00et\n\nDepartment of Computing Science\n\nUniversity of Alberta\n\nEdmonton, Alberta, Canada\nosheffet@ualberta.ca\n\nWe study the contextual linear bandit problem, a version of the standard stochastic\nmulti-armed bandit (MAB) problem where a learner sequentially selects actions\nto maximize a reward which depends also on a user provided per-round context.\nThough the context is chosen arbitrarily or adversarially, the reward is assumed to be\na stochastic function of a feature vector that encodes the context and selected action.\nOur goal is to devise private learners for the contextual linear bandit problem.\nWe \ufb01rst show that using the standard de\ufb01nition of di\ufb00erential privacy results in\nlinear regret. So instead, we adopt the notion of joint di\ufb00erential privacy, where we\nassume that the action chosen on day t is only revealed to user t and thus needn\u2019t be\nkept private that day, only on following days. We give a general scheme converting\nthe classic linear-UCB algorithm into a joint di\ufb00erentially private algorithm using\nthe tree-based algorithm [10, 18]. We then apply either Gaussian noise or Wishart\nnoise to achieve joint-di\ufb00erentially private algorithms and bound the resulting\nalgorithms\u2019 regrets. In addition, we give the \ufb01rst lower bound on the additional\nregret any private algorithms for the MAB problem must incur.\n\nIntroduction\n\n1\nThe well-known stochastic multi-armed bandit (MAB) is a sequential decision-making task in which\na learner repeatedly chooses an action (or arm) and receives a noisy reward. The objective is to\nmaximize cumulative reward by exploring the actions to discover optimal ones (having the best\nexpected reward), balanced with exploiting them. The contextual bandit problem is an extension of\nthe MAB problem, where the learner also receives a context in each round, and the expected reward\ndepends on both the context and the selected action.\nAs a motivating example, consider online shopping: the user provides a context (composed of query\nwords, past purchases, etc.), and the website responds with a suggested product and receives a reward\nif the user buys it. Ignoring the context and modeling the problem as a standard MAB (with an action\nfor each possible product) su\ufb00ers from the drawback of ignoring the variety of users\u2019 preferences;\nwhereas separately learning each user\u2019s preferences doesn\u2019t allow us to generalize between users.\nTherefore it is common to model the task as a contextual linear bandit problem: Based on the\nuser-given context, each action is mapped to a feature vector; the reward probability is then assumed\nto depend on the same unknown linear function of the feature vector across all users.\nThe above example motivates the need for privacy in the contextual bandit setting: users\u2019 past\npurchases and search queries are sensitive personal information, yet they strongly predict future\npurchases. In this work, we give upper and lower bounds for the problem of (joint) di\ufb00erentially private\ncontextual linear bandits. Di\ufb00erential privacy is the de facto gold standard of privacy-preserving\ndata analysis in both academia and industry, requiring that an algorithm\u2019s output have very limited\ndependency on any single user interaction (one context and reward). However, as we later illustrate,\nadhering to the standard notion of di\ufb00erential privacy (under event-level continual observation) in\nthe contextual bandit requires us to essentially ignore the context and thus incur linear regret. We\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\ftherefore adopt the more relaxed notion of joint di\ufb00erential privacy [23] which, intuitively, allows us\nto present the t-th user with products corresponding to her preferences, while guaranteeing that all\ninteractions with all users at times t(cid:48) > t have very limited dependence on user t\u2019s preferences. The\nguarantee of di\ufb00erential privacy under continuous observation assures us that even if all later users\ncollude in an e\ufb00ort to learn user t\u2019s context or preference, they still have very limited advantage over a\nrandom guess.\n\n1.1 Problem Formulation\nIn the classic MAB, in every round t a learner selects\nStochastic Contextual Linear Bandits.\nan action at from a \ufb01xed set A and receives a reward yt. In the (stationary) stochastic MAB, the\nreward is noisy with a \ufb01xed but unknown expectation E[yt | at] that depends only on the selected\naction. In the stochastic contextual bandit problem, before each round the learner also receives\na context ct \u2208 C \u2014 the expected reward E[yt | ct, at] depends on both ct and at. It is common\nto assume that the context a\ufb00ects the reward in a linear way: map every context-action pair to a\nfeature vector \u03c6(c, a) \u2208 Rd (where \u03c6 is an arbitrary but known function) and assume that E[yt |\nct, at] = (cid:104)\u03b8\u2217, \u03c6(ct, at)(cid:105). The vector \u03b8\u2217 \u2208 Rd is the key unknown parameter of the environment which\nthe learner must discover to maximize reward. Alternatively, we say that on every round the learner\nis given a decision set Dt (cid:66) {\u03c6(ct, a) | a \u2208 A} of all the pre-computed feature vectors: choosing\nxt \u2208 Dt e\ufb00ectively determines the action at \u2208 A. Thus, the contextual stochastic linear bandit\nframework consists of repeated rounds in which the learner: (i) receives a decision set Dt \u2282 Rd;\n(ii) chooses an action xt \u2208 Dt; and (iii) receives a stochastic reward yt = (cid:104)\u03b8\u2217, xt(cid:105) + \u03b7t. When all the\nDt are identical and consist of the standard basis vectors, the problem reduces to standard MAB.\nThe learner\u2019s objective is to maximize cumulative reward, which is equivalent to minimizing regret: the\nextra reward a learner would have received by always choosing the best available action. In other words,\nthe regret characterizes the cost of having to learn the optimal action over just knowing it beforehand.\nFor stochastic problems, we are usually interested in a related quantity called pseudo-regret, which is\nthe extra expected reward that the learner could have earned if it had known \u03b8\u2217 in advance. In our\n\nsetting, the cumulative pseudo-regret after n rounds is(cid:98)Rn (cid:66)\u0001n\n\nt=1 maxx\u2208Dt (cid:104)\u03b8\u2217, x \u2212 xt(cid:105).1\n\n1, y(cid:48)\n\n1), . . . ,(D(cid:48)\n\nn, y(cid:48)\n\nn)(cid:105) are t-neighbors if for all t(cid:48) (cid:44) t it holds that (Dt(cid:48), yt(cid:48)) = (D(cid:48)\n\nJoint Di\ufb00erential Privacy. As discussed above, the context and reward may be considered private\ninformation about the users which we wish to keep private from all other users. We thus introduce\nthe notion of jointly di\ufb00erentially private learners under continuous observation, a combination of\ntwo de\ufb01nitions [given in 23, 18]. First, we say two sequences S = (cid:104)(D1, y1),(D2, y2), . . . ,(Dn, yn)(cid:105)\nt(cid:48)).\nand S(cid:48) = (cid:104)(D(cid:48)\nt(cid:48), y(cid:48)\nDe\ufb01nition 1. A randomized algorithm A for the contextual bandit problem is(\u03b5, \u03b4)-jointly di\ufb00erentially\nprivate (JDP) under continual observation if for any t and any pair of t-neighboring sequences S and\nS(cid:48), and any subset S>t \u2282 Dt+1 \u00d7 Dt+2 \u00d7 \u00b7 \u00b7 \u00b7 \u00d7 Dn of sequence of actions ranging from day t + 1 to\nthe end of the sequence, it holds that P(A(S) \u2208 S>t) \u2264 e\u03b5P(A(S(cid:48)) \u2208 S(cid:48)\nThe standard notion of di\ufb00erential privacy under continual observation would require that changing\nthe context ct cannot have much e\ufb00ect on the probability of choosing action at \u2014 even for round t\nitself (not just for future rounds as with JDP). In our problem formulation, however, changing ct to c(cid:48)\nmay change the decision set Dt to a possibly disjoint D(cid:48)\nt\n, making that notion ill-de\ufb01ned. Therefore,\nwhen we discuss the impossibility of regret-minimization under standard di\ufb00erential privacy in\nSection 5, we revert back to a \ufb01xed action set A with an explicit per-round context ct.\n\n>t) + \u03b4.\n\nt\n\n1.2 Our Contributions and Paper Organization\nIn this work, in addition to formulating the de\ufb01nition of JDP under continual observation, we also\npresent a framework for implementing JDP algorithms for the contextual linear bandit problem. Not\nsurprisingly, our framework combines a tree-based privacy algorithm [10, 18] with a linear upper\ncon\ufb01dence bound (LinUCB) algorithm [13]. For modularity, in Section 3 we analyze a family of\nlinear UCB algorithms that use di\ufb00erent regularizers in every round, under the premise that the\n\n1The pseudo-regret ignores the stochasticity of the reward but not the resulting randomness in the learner\u2019s\nchoice of actions. It equals the regret in expectation, but is more amenable to high-probability bounds such as\nours. In particular, in some cases we can achieve polylog(n) bounds on pseudo-regret because, unlike regret, it\ndoesn\u2019t have added regret noise of variance \u2126(n).\n\n2\n\n\fn \u00b7 d3/4/\u221a\n\nregularizers are PSD with bounded singular values. Moreover, we repeat our analysis twice \u2014 \ufb01rst we\nobtain a general \u02dcO(\u221a\nn) upper bound on regret; then, for problem instances that maintain a \u2206 reward\ngap separating the optimal and sub-optimal actions, we obtain a polylog(n)/\u2206 regret upper bound.\nOur leading application of course is privacy, though one could postulate other reasons where such\nchanging regularizers would be useful (e.g., if parameter estimates turn out to be wrong and have to be\nupdated). We then plug two particular regularizers into our scheme: the \ufb01rst is a privacy-preserving\nmechanism that uses additive Wishart noise [30] (which is always PSD); the second uses additive\nGaussian noise [19] (shifted to make it PSD w.h.p. over all rounds). The main term in the two regret\nbounds obtained by both algorithms is \u02dcO(\u221a\n\u03b5) (the bound itself depends on numerous\nparameters; a notation list appears in Section 2). Details of both techniques appear in Section 4.\nExperiments with a few variants of our algorithms are detailed in Section D of the supplementary\nmaterial. In Section 5 we also give a lower bound for the \u03b5-di\ufb00erentially private MAB problem.\nWhereas all previous work on the private MAB problem uses standard (non-private) bounds, we\nshow that any private algorithm must incur an additional regret of \u2126(k log(n)/\u03b5). While the result\nresembles the lower bound in the adversarial setting, the proof technique cannot rely on standard\npacking arguments [e.g. 20] since the input for the problem is stochastic rather than adversarial.\nInstead, we rely on a recent coupling argument [22] to prove any private algorithm must substantially\nexplore suboptimal arms. By showing that the contextual bandit problem does not become much\nharder by adding privacy, we open the possibility of machine learning systems that are useful to their\nusers without signi\ufb01cantly compromising their privacy.\nFuture Directions. The linear UCB algorithm we adapt in this work is a canonical approach to the\nlinear bandit problem, using the principle of \u201coptimism in the face of uncertainty.\u201d However, recent\nwork [24] shows that all such \u201coptimistic\u201d algorithms are sub-optimal, and instead proposes adapting\nto the decision set in a particular way by solving an intricate optimization problem. It remains an\nopen question to devise a private version of this algorithm which interpolates between UCB and\n\ufb01ne-tuning to the speci\ufb01c action set.\n\n1.3 Related Work\nMAB and the Contextual Bandit Problem. The MAB dates to the seminal work of Robbins\n[28], with the UCB approach developed in a series of works [8, 4] culminating in [6]. Stochastic\nlinear bandits were formally \ufb01rst introduced in [3], and [5] was the \ufb01rst paper to consider UCB-style\nalgorithms. An algorithm that is based on a con\ufb01dence ellipsoid is described by [13], with a variant\nbased on ridge regression given in [12], or explore-then-commit variant in [29], and a variant related\nto a sparse setting appears in [2]. Abbasi-Yadkori et al. [1] gives an instance dependent bound for\nlinear bandits, which we convert to the contextual setting.\nDi\ufb00erential Privacy. Di\ufb00erential privacy, \ufb01rst introduced by Dwork et al. [17, 16], is a rigorous\nmathematical notion of privacy that requires the probability of any observable output to change very\nlittle when any single datum changes. (We omit the formal de\ufb01nition, having already de\ufb01ned JDP.)\nAmong its many elegant traits is the notion of group privacy: should k datums change then the change\nin the probability of any event is still limited by (roughly) k times the change when a single datum\nwas changed. Di\ufb00erential privacy also composes: the combination of k (\u03b5, \u03b4)-di\ufb00erentially private\n\n2 + 2(cid:112)k log(1/\u03b4(cid:48))), k\u03b4 + \u03b4(cid:48)(cid:1)-di\ufb00erentially private for any \u03b4(cid:48) > 0 [14].\n\nalgorithms is(cid:0)O(k\u03b5\n\nThe notion of di\ufb00erential privacy under continual observation was \ufb01rst de\ufb01ned by Dwork et al. [18]\nusing the tree-based algorithm [originally appearing in 10]. This algorithm maintains a binary tree\nwhose n leaves correspond to the n entries in the input sequence. Each node in the tree maintains a\nnoisy (privacy-preserving) sum of the input entries in its subtree \u2014 the cumulative sums of the inputs\ncan thus be obtained by combining at most log(n) noisy sums. This algorithm is the key ingredient of\na variety of works that deal with privacy in an online setting, including counts [18], online convex\noptimization [21], and regret minimization in both the adversarial [31, 34] and stochastic [26, 33]\nsettings. We comment that Mishra and Thakurta [26] proposed an algorithm similar to our own\nfor the contextual bandit setting, however (i) without maintaining PSD, (ii) without any analysis,\nonly empirical evidence, and (iii) without presenting lower bounds. A partial utility analysis of this\nalgorithm, in the reward-privacy model (where the context\u2019s privacy is not guaranteed), appears in the\nrecent work of Neel and Roth [27]. Further details about achieving di\ufb00erential privacy via additive\nnoise and the tree-based algorithm appear in Section A of the supplementary material. The related\nproblem of private linear regression has also been extensively studied in the o\ufb04ine setting [11, 7].\n\n3\n\n\fM\n\n2)\u22121/2 exp(\u2212(x\u2212\u00b5)2/2\u03c3\n\n= x(cid:203)M x. We use M (cid:23) N to mean M \u2212 N (cid:23) 0. The Gaussian distribution N(\u00b5, \u03c3\n\n2 Preliminaries and Notation\nWe use bol d letters to denote vectors and bold C APIT ALS for matrices. Given a d-column matrix\nM, its Gram matrix is the (d \u00d7 d)-matrix M(cid:203)M. A symmetric matrix M is positive-semide\ufb01nite (PSD,\ndenoted M (cid:23) 0) if x(cid:203)M x \u2265 0 for any vector x. Any such M de\ufb01nes a norm on vectors, so we de\ufb01ne\n(cid:107)x(cid:107)2\n2) is de\ufb01ned\nby the density function (2\u03c0\u03c3\n2). The squared L2-norm of a d-dimensional vector\nwhose coordinates are drawn i.i.d. from N(0,1) is given by the \u03c7\n2(d) distribution, which is tightly\nconcentrated around d. Given two distributions P and Q we denote their total variation distance by\ndTV(P, Q) = maxevent E|P(E) \u2212 Q(E)|.\nNotation. Our bound depends on many parameters of the problem, speci\ufb01ed below. Additional\nparameters (bounds) are speci\ufb01ed in the assumptions stated below.\nhorizon, i.e. number of rounds\nn\nindices of rounds\ns, t\ndimensionality of action space\nd\nDt \u2282 Rd; decision set at round t\n\u2208 Dt; action at round t\nxt\n\u2208 R; reward at round t\nyt\n\u2208 Rd; unknown parameter vector\n\u03b8\u2217\n\nm (cid:66) (cid:100)log2(n) + 1(cid:101)\nX 0. If for each t the Ht and ht are generated by the tree-based algorithm\nwith Wishart noise Wd+1( \u02dcL2\n\nI, k), then the following are (\u03b1/2n)-accurate bounds:\n\nd \u2212(cid:112)2 ln(8n/\u03b1)(cid:1)2\n\u03c1min = \u02dcL2(cid:0)\u221a\nd +(cid:112)2 ln(8n/\u03b1)(cid:1)2\n\u03c1max = \u02dcL2(cid:0)\u221a\nmk \u2212 \u221a\nd +(cid:112)2 ln(2n/\u03b1)(cid:1).\n\u03b3 = \u02dcL(cid:0)\u221a\n\u221a\nd +(cid:112)2 ln(8n/\u03b1)(cid:1),\nmk(cid:0)\u221a\nMoreover, if we use the shifted regularizer H(cid:48)\nare (\u03b1/2n)-accurate bounds:\nd +(cid:112)2 ln(8n/\u03b1)(cid:1),\nmk(cid:0)\u221a\nmin = 4 \u02dcL2\u221a\n(cid:113)\u221a\nd +(cid:112)2 ln(2n/\u03b1)(cid:1).\nmax = 8 \u02dcL2\u221a\nmk(cid:0)\u221a\n(cid:48) = \u02dcL\n\nmk +\n\n(cid:48)\n\u03c1\n(cid:48)\n\n\u03b3\n\n\u03c1\n\nt\n\n,\n\n,\n\n(cid:66) Ht \u2212 cI with c as given in Eq. (4), then the following\n\nPlugging these into Theorems 5 and 6 gives us the following upper bounds on pseudo-regret.\nCorollary 10. Algorithm 1 with Ht and ht generated by the tree-based mechanism with each node\nadding noise independently from Wd+1((L2 + \u02dcB2)I, k) and then subtracting cI using Eq. (4), we get\na pseudo-regret bound of\n\ndlog(n)3/4(d1/4 + \u03b5\n\n\u22121/2log(1/\u03b4)1/4)(d1/4 + log(n/\u03b1)1/4)\n\n\u221a\n\nO\n\nB\n\nn\n\n(cid:20)\n\u03c3(cid:0)log(cid:0)1/\u03b1(cid:1) + d log(n)(cid:1) + S \u02dcL\n\u03c3(cid:0)log(cid:0)1/\u03b1(cid:1) + d log(n)(cid:1) + S \u02dcL\n\n\u221a\n\n(cid:20)\n\n\u221a\n\n(cid:18)\n(cid:32)\n\nO\n\nB\n\u2206\n\nin general, and a gap-dependent pseudo-regret bound of\n\n(cid:21)(cid:19)\n(cid:21)2(cid:33)\n\ndlog(n)3/4(d1/4 + \u03b5\n\n\u22121/2log(1/\u03b4)1/4)(d1/4 + log(n/\u03b1)1/4)\n\n2\n\n2\n\ni, j \u223c N(0, \u03c3\n\n4.2 Di\ufb00erential Privacy via Additive Gaussian Noise\nOur second alternative is to instantiate the tree-based algorithm with symmetric Gaussian noise: sample\nnoise) i.i.d. and symmetrize to get Z = (Z(cid:48) + Z(cid:48)(cid:203))/\u221a\nZ(cid:48) \u2208 R(d+1)\u00d7(d+1) with each Z(cid:48)\n2.4\nRecall that each datum has a bounded L2-norm of \u02dcL, hence a change to a single datum may alter\nthe Frobenius norm of Mt by \u02dcL2. It follows that in order to make sure each node in the tree-based\nalgorithm preserves (\u03b5/\u221a\n8m ln(2/\u03b4), \u03b4/2)-di\ufb00erential privacy,5 the variance in each coordinate must be\nnoise = 16m \u02dcL4ln(4/\u03b4)2/\u03b5\n2. When all entries on Z are sampled from N(0,1), known concentration\n2\n\u03c3\nresults [32] on the top singular value of Z give that P[(cid:107)Z(cid:107) > (4\nd + 1 + 2 ln(2n/\u03b1))] < \u03b1/2n. Note\n2m(cid:0)4\nhowever that in each day t the noise Nt is the sum of \u2264 m such matrices, thus the variance of each\ncoordinate is m\u03c3\n\nnoise. The top-left (d \u00d7 d)-submatrix of Nt has operator norm of at most\n\nd + 2 ln(2n/\u03b1)(cid:1)/\u03b5.\n\nd + 2 ln(2n/\u03b1)(cid:1) =\n\n32m \u02dcL2 ln(4/\u03b4)(cid:0)4\n\n\u03a5 (cid:66) \u03c3noise\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\nHowever, it is important to note that the result of adding Gaussian noise may not preserve the PSD\nproperty of the noisy Gram matrix. To that end, we ought to shift Nt by some cI in order to make\nsure that we maintain strictly positive eigenvalues throughout the execution of the algorithm. Since\nthe bounds in Theorems 5 and 6 mainly depend on \u221a\n\u03c1max + \u03b3, we choose the shift-magnitude to\nbe 2\u03a5I. This makes \u03c1max = 3\u03a5 and \u03c1min = \u03a5 and as a result (cid:107)ht(cid:107)H\u22121\n\u03a5\u22121(cid:107)ht(cid:107), which we can\n2-distribution (see Claim 17). This culminates in\nbound using standard concentration bounds on the \u03c7\nthe following bounds.\nProposition 11. Fix any \u03b1 > 0. Given that for each t the regularizers Ht, ht are taken by applying\nthe tree-based algorithm with symmetrized shifted Gaussian noise whose entries are sampled i.i.d.\nfrom N(0, \u03c3\n\nnoise), then the following \u03c1min, \u03c1max, and \u03b3 are (\u03b1/2n)-accurate bounds:\n\u03c1max = 3\u03a5,\n\nd +(cid:112)2 ln(2n/\u03b1)(cid:1) \u2264(cid:113)\n\nd + 2 ln(2n/\u03b1)(cid:1)/(cid:0)\u221a\n\n\u03a5\u22121m(cid:0)\u221a\n\nm \u02dcL2(cid:0)\u221a\n\n2\u03b5(cid:1)\n\n\u03b3 = \u03c3noise\n\n\u03c1min = \u03a5,\n\n(cid:112)\n\n\u221a\n\n\u221a\n\n\u2264\n\n2\n\nt\n\n4This increases the variance along the diagonal entries beyond the noise magnitude required to preserve\n\nprivacy, but only by a constant factor of 2.\n\n5We use here the slightly better bounds for the composition of Gaussian noise based on zero-Concentrated\n\nDP [9].\n\n7\n\n\f2 roughly on the order of O(\u03a5).\n\nNote how this choice of shift indeed makes both \u03c1max and \u03b3\nThe end result is that for each day t, ht is given by summing at most m d-dimensional vectors whose\nentries are sampled i.i.d. from N(0, \u03c3\nnoise); the symmetrization doesn\u2019t change the distribution of each\ncoordinate. The matrix Ht is given by: (i) summing at most m matrices whose entries are sampled\ni.i.d. from N(0, \u03c3\nnoise); (ii) symmetrizing the result as shown above; and (iii) adding 2\u03a5I. This leads\nto a bound on the regret of Algorithm 1 with the tree-based algorithm using Gaussian noise.\nCorollary 12. Applying Algorithm 1 where the regularizers Ht and ht are derived by applying the\ntree-based algorithm where each node holds a symmetrized matrix whose entries are sampled i.i.d.\nfrom N(0, \u03c3\n\n2\n\n2\n\n2\n\n\u221a\n\nnoise) and adding 2\u03a5I, we get a regret bound of\nO\n\n\u03c3(d log(n) + log(1/\u03b1)) + S \u02dcL log(n)\nin general, and a gap-dependent pseudo-regret bound of\n\u03c3(d log(n) + log(1/\u03b1)) + S \u02dcL log(n)\n\n(cid:18)\n\nO\n\nB\n\nn\n\nd(\u221a\n\nd(\u221a\n\n(cid:18)\n\n(cid:18)\n(cid:32)\n\nB\n\u2206\n\n(cid:113)\n(cid:113)\n\n(cid:19)(cid:19)\n(cid:19)2(cid:33)\n\nd + ln(n/\u03b1)) ln(1/\u03b4)/\u03b5\n\nd + ln(n/\u03b1)) ln(1/\u03b4)/\u03b5\n\n5 Lower Bounds\nIn this section, we present lower bounds for two versions of the problem we deal with in this work. The\n\ufb01rst, and probably the more obvious of the two, deals with the impossibility of obtaining sub-linear\nregret for the contextual bandit problem with the standard notion of di\ufb00erential privacy (under\ncontinual observation). Here, we assume user t provides a context ct which actually determines the\nmapping of the arms into feature vectors: \u03c6(ct, a) \u2208 Rd. The sequence of choice thus made by the\nlearner is a1, . . . , an \u2208 An which we aim to keep private. The next claim, whose proof is deferred to\nSection C in the supplementary material, shows that this is impossible without e\ufb00ectively losing any\nreasonable notion of utility.\nClaim 13. For any \u03b5 < ln(2) and \u03b4 < 0.25, any (\u03b5, \u03b4)-di\ufb00erentially private algorithm A for the\ncontextual bandit problem must incur pseudo-regret of \u2126(n).\nThe second lower bound we show is more challenging. We show that any \u03b5-di\ufb00erentially private\nalgorithm for the classic MAB problem must incur an additional pseudo-regret of \u2126(k log(n)/\u0001)\non top of the standard (non-private) regret bounds. We consider an instance of the MAB where\nthe leading arm is a1, the rewards are drawn from a distribution over {\u22121,1}, and the gap between\nthe means of arm a1 and arm a (cid:44) a1 is \u2206a. Simple calculation shows that for such distributions,\nthe total-variation distance between two distributions whose means are \u00b5 and \u00b5 \u2212 \u2206 is \u2206/2. Fix\n\u22062, \u22063, . . . , \u2206k as some small constants, and we now argue the following.\nClaim 14. Let A be any \u03b5-di\ufb00erentially private algorithm for the MAB problems with k arms whose\nexpected regret is at most n3/4. Fix any arm a (cid:44) a1, whose di\ufb00erence between it and the optimal arm\na1 is \u2206a. Then, for su\ufb03ciently large ns, A pulls arm a at least log(n)/100\u03b5\u2206a many times w.p. \u2265 1/2.\nWe comment that the bound n3/4 was chosen arbitrarily, and we only require a regret upper bound\nof n1\u2212c for some c > 0. Of course, we could have used standard assumptions, where the regret is\nasymptotically smaller than any polynomial; or discuss algorithms of regret \u02dcO(\u221a\nn) (best minimax\nregret). Aiming to separate the standard lower-bounds on regret from the private bounds, we decided\n\u2126(k log(n)/\u03b5). Combined with the non-private bound of \u2126(cid:0)\u0001\nto use n3/4. As an immediate corollary we obtain the following private regret bound:\nregret bound is the max of the two terms, i.e.: \u2126(cid:0)k log(n)/\u03b5 +\u0001\nCorollary 15. The expected pseudo-regret of any \u03b5-di\ufb00erentially private algorithm for the MAB is\nProof. Based on Claim 14, the expected pseudo-regret is at least \u0001\n\n(cid:1) we get that the private\n(cid:1).\n\n1 log(n)/\u2206a\n1 log(n)/\u2206a\n\n= (k\u22121) log(n)\n\na(cid:44)a\na(cid:44)a\n\n(cid:3)\n\n.\n\n\u2206a log(n)\n200\u03b5\u2206a\n\na(cid:44)a\n\n1\n\n200\u03b5\n\nProof of Claim 14. Fix arm a. Let \u00afP be the vector of the k-probability distributions associated with\nthe k arms. Denote E as the event that arm a is pulled < log(n)/100\u03b5\u2206a := ta many times. Our goal is\nto show that P\nTo that end, we postulate a di\ufb00erent distribution for the rewards of arm a \u2014 a new distribution whose\nmean is greater by \u2206a than the mean reward of arm a1. The total-variation distance between the\ngiven distribution and the postulated distribution is \u2206a. Denote \u00afQ as the vector of distributions of\n\nA; rewards\u223c \u00afP[E] < 1/2.\n\n8\n\n\fA; rewards\u223c \u00afQ[E], for su\ufb03ciently large n.\n\narm-rewards (where only Pa (cid:44) Qa). We now argue that should the rewards be drawn from \u00afQ, then\nA; rewards\u223c \u00afQ[E] \u2264 2n\u22121/4/\u2206a. Indeed, the argument is based on a\nthe event E is very unlikely: P\nstandard Markov-like argument: the expected pseudo-regret of A is at most n3/4, yet it is at least\nA; rewards\u223c \u00afQ[E] \u00b7 (n \u2212 ta)\u2206a \u2265 (n\u2206a/2)P\nP\nWe now apply a beautiful result of Karwa and Vadhan [22, Lemma 6.1], stating that the \u201ce\ufb00ective\u201d\ngroup privacy between the case where the n datums of the inputs are drawn i.i.d. from either\ndistribution P or from distribution Q is proportional to \u03b5n \u00b7 dTV(P,Q). In our case, the key point is\nthat we only consider this change under the event E, thus the number of samples we need to redraw\nfrom the distribution Pa rather than Qa is strictly smaller than ta, and the elegant coupling argument\nof [22] reduces it to 6\u2206a \u00b7 ta. To better illustrate the argument, consider the coupling argument of [22]\nas an oracle O. The oracle generates a collection of precisely ta pairs of points, the left ones are\ni.i.d. samples from Pa and the right ones are i.i.d. samples from Qa, and, in expectation, in (1 \u2212 \u2206a)\nfraction of the pairs the right- and the left-samples are identical. Whenever the learner A pulls arm a\nit makes an oracle call to O, and depending on the environment (whether the distribution of rewards\nis \u00afP or \u00afQ) O provides either a fresh left-sample or a right-sample. Moreover, suppose there exists a\ncounter C that stands between A and O, and in case O runs out of examples then C routes A\u2019s oracle\ncalls to a di\ufb00erent oracle. Now, Karwa and Vadhan [22, Lemma 6.1] assures that the probability\nof the event \u201cC never re-routes the requests\u201d happens with similar probability under P or under Q,\ndi\ufb00erent only up to a multiplicative factor of exp(\u0001\u2206ata). And seeing as the event \u201cC never re-routes\nthe requests\u201d is quite unlikely when O only provides right-samples (from \u00afQ), it is also fairly unlikely\nwhen O only provides left-samples (from \u00afP).\nA; rewards\u223c \u00afP[E] \u2264\nFormally, we conclude the proof by applying the result of [22] to infer that P\nexp(6\u03b5\u2206ata)P\n\u2264 1/2 for su\ufb03ciently large ns,\nproving the required claim.\n(cid:3)\n\nA; rewards\u223c \u00afQ[E] \u2264 exp(0.06 log(n)) \u00b7 2\n\nn\u22121/4 = n\u22120.19 2\n\n\u2206a\n\n\u2206a\n\nAcknowledgements\nWe gratefully acknowledge the Natural Sciences and Engineering Research Council of Canada\n(NSERC) for supporting R.S. with the Alexander Graham Bell Canada Graduate Scholarship and O.S.\nwith grant #2017\u201306701. R.S. was also supported by Alberta Innovates and O.S. is also an unpaid\ncollaborator on NSF grant #1565387.\n\nReferences\n[1] Yasin Abbasi-Yadkori, D\u00e1vid P\u00e1l, and Csaba Szepesv\u00e1ri. Improved algorithms for linear stochastic bandits.\nIn J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in\nNeural Information Processing Systems, volume 24, pages 2312\u20132320. Curran Associates, Inc., 2011.\n\n[2] Yasin Abbasi-Yadkori, D\u00e1vid P\u00e1l, and Csaba Szepesv\u00e1ri. Online-to-con\ufb01dence-set conversions and\n\napplication to sparse stochastic bandits. In AISTATS, pages 1\u20139, 2012.\n\n[3] Naoki Abe, Alan W. Biermann, and Philip M. Long. Reinforcement learning with immediate rewards and\n\nlinear hypotheses. Algorithmica, 37(4):263\u2013293, 2003.\n\n[4] Rajeev Agrawal. Sample mean based index policies with O(log n) regret for the multi-armed bandit\n\nproblem., volume 27, pages 1054\u20131078. Applied Probability Trust, 1995.\n\n[5] Peter Auer. Using con\ufb01dence bounds for exploitation-exploration trade-o\ufb00s. JMLR, 3:397\u2013422, 2003.\n[6] Peter Auer, Nicol\u00f2 Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.\n\nJMLR, 47(2-3):235\u2013256, 2002.\n\n[7] Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: E\ufb03cient\nalgorithms and tight error bounds. In Proceedings of the 2014 IEEE 55th Annual Symposium on Foundations\nof Computer Science, FOCS \u201914, pages 464\u2013473, Washington, DC, USA, 2014. IEEE Computer Society.\nISBN 978-1-4799-6517-5. doi: 10.1109/FOCS.2014.56.\n\n[8] Donald A Berry and Bert Fristedt. Bandit problems: sequential allocation of experiments. London ; New\n\nYork : Chapman and Hall, 1985.\n\n9\n\n\f[9] Mark Bun and Thomas Steinke. Concentrated di\ufb00erential privacy: Simpli\ufb01cations, extensions, and lower\nbounds. In Theory of Cryptography, Lecture Notes in Computer Science, pages 635\u2013658. Springer,\nBerlin, Heidelberg, November 2016.\nISBN 978-3-662-53640-7 978-3-662-53641-4. doi: 10.1007/\n978-3-662-53641-4_24.\n\n[10] T.-H. Hubert Chan, Elaine Shi, and Dawn Song. Private and continual release of statistics. In Automata,\nLanguages and Programming, Lecture Notes in Computer Science, pages 405\u2013417. Springer, Berlin,\nHeidelberg, July 2010. ISBN 978-3-642-14161-4 978-3-642-14162-1. doi: 10.1007/978-3-642-14162-1_\n34.\n\n[11] Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. Di\ufb00erentially private empirical risk\n\nminimization. J. Mach. Learn. Res., 12:1069\u20131109, July 2011. ISSN 1532-4435.\n\n[12] Wei Chu, Lihong Li, Lev Reyzin, and Robert E. Schapire. Contextual bandits with linear payo\ufb00 functions.\n\nIn AISTATS, volume 15 of JMLR Proceedings, pages 208\u2013214, 2011.\n\n[13] Varsha Dani, Thomas Hayes, and Sham Kakade. Stochastic linear optimization under bandit feedback. In\n\n21st Annual Conference on Learning Theory, pages 355\u2013366, January 2008.\n\n[14] C. Dwork, G. N. Rothblum, and S. Vadhan. Boosting and di\ufb00erential privacy. In 2010 IEEE 51st Annual\nSymposium on Foundations of Computer Science, pages 51\u201360, October 2010. doi: 10.1109/FOCS.2010.12.\n[15] Cynthia Dwork and Aaron Roth. The algorithmic foundations of di\ufb00erential privacy. Foundations and\nTrends\u00ae in Theoretical Computer Science, 9(3\u20134):211\u2013407, August 2014. ISSN 1551-305X, 1551-3068.\ndoi: 10.1561/0400000042.\n\n[16] Cynthia Dwork, Krishnaram Kenthapadi, Frank McSherry, Ilya Mironov, and Moni Naor. Our data,\nourselves: Privacy via distributed noise generation. In Advances in Cryptology - EUROCRYPT 2006,\nLecture Notes in Computer Science, pages 486\u2013503. Springer, Berlin, Heidelberg, May 2006. ISBN\n978-3-540-34546-6 978-3-540-34547-3. doi: 10.1007/11761679_29.\n\n[17] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity\nin private data analysis. In Theory of Cryptography, Lecture Notes in Computer Science, pages 265\u2013\n284. Springer, Berlin, Heidelberg, March 2006. ISBN 978-3-540-32731-8 978-3-540-32732-5. doi:\n10.1007/11681878_14.\n\n[18] Cynthia Dwork, Moni Naor, Toniann Pitassi, and Guy N. Rothblum. Di\ufb00erential privacy under continual\nobservation. In Proceedings of the Forty-Second ACM Symposium on Theory of Computing, STOC \u201910, pages\n715\u2013724, New York, NY, USA, 2010. ACM. ISBN 978-1-4503-0050-6. doi: 10.1145/1806689.1806787.\n\n[19] Cynthia Dwork, Kunal Talwar, Abhradeep Thakurta, and Li Zhang. Analyze Gauss: Optimal bounds\nfor privacy-preserving principal component analysis. In Proceedings of the Forty-Sixth Annual ACM\nSymposium on Theory of Computing, STOC \u201914, pages 11\u201320, New York, NY, USA, 2014. ACM. ISBN\n978-1-4503-2710-7. doi: 10.1145/2591796.2591883.\n\n[20] Moritz Hardt and Kunal Talwar. On the geometry of di\ufb00erential privacy. In Proceedings of the Forty-Second\nACM Symposium on Theory of Computing, STOC \u201910, pages 705\u2013714, New York, NY, USA, 2010. ACM.\nISBN 978-1-4503-0050-6. doi: 10.1145/1806689.1806786.\n\n[21] Prateek Jain, Pravesh Kothari, and Abhradeep Thakurta. Di\ufb00erentially private online learning. In Conference\n\non Learning Theory, pages 24.1\u201324.34, June 2012.\n\n[22] Vishesh Karwa and Salil Vadhan. Finite sample di\ufb00erentially private con\ufb01dence intervals. arXiv:1711.03908\n\n[cs, math, stat], November 2017.\n\n[23] Michael Kearns, Mallesh Pai, Aaron Roth, and Jonathan Ullman. Mechanism design in large games:\nIncentives and privacy. pages 403\u2013410. ACM Press, 2014. ISBN 978-1-4503-2698-8. doi: 10.1145/\n2554797.2554834.\n\n[24] Tor Lattimore and Csaba Szepesv\u00e1ri. The end of optimism? an asymptotic analysis of \ufb01nite-armed linear\n\nbandits. In AISTATS, pages 728\u2013737, 2017.\n\n[25] B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. The Annals of\nStatistics, 28(5):1302\u20131338, October 2000. ISSN 0090-5364, 2168-8966. doi: 10.1214/aos/1015957395.\n[26] Nikita Mishra and Abhradeep Thakurta. (Nearly) optimal di\ufb00erentially private stochastic multi-arm bandits.\nIn Proceedings of the Thirty-First Conference on Uncertainty in Arti\ufb01cial Intelligence, UAI\u201915, pages\n592\u2013601, Arlington, Virginia, United States, 2015. AUAI Press. ISBN 978-0-9966431-0-8.\n\n10\n\n\f[27] Seth Neel and Aaron Roth. Mitigating bias in adaptive data gathering via di\ufb00erential privacy. In ICML,\n\npages 3717\u20133726, 2018.\n\n[28] Herbert Robbins. Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc., 58(5):\n\n527\u2013535, 09 1952.\n\n[29] Paat Rusmevichientong and John N. Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations\n\nResearch, 35(2):395\u2013411, April 2010. ISSN 0364-765X. doi: 10.1287/moor.1100.0446.\n\n[30] Or She\ufb00et. Private approximations of the 2nd-moment matrix using existing techniques in linear regression.\n\narXiv:1507.00056 [cs], June 2015.\n\n[31] Adam Smith and Abhradeep Thakurta. (Nearly) optimal algorithms for private online learning in full-\ninformation and bandit settings. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q.\nWeinberger, editors, Advances in Neural Information Processing Systems 26, pages 2733\u20132741. Curran\nAssociates, Inc., 2013.\n\n[32] Terence Tao. Topics in Random Matrix Theory, volume 132. American Mathematical Society Providence,\n\nRI, 2012.\n\n[33] Aristide C. Y. Tossou and Christos Dimitrakakis. Algorithms for di\ufb00erentially private multi-armed bandits.\n\nIn Thirtieth AAAI Conference on Arti\ufb01cial Intelligence, March 2016.\n\n[34] Aristide C. Y. Tossou and Christos Dimitrakakis. Achieving privacy in the adversarial multi-armed bandit.\n\narXiv:1701.04222 [cs], January 2017.\n\n[35] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv:1011.3027 [cs,\n\nmath], November 2010.\n\n[36] Fuzhen Zhang. Matrix Theory: Basic Results and Techniques. Universitext. Springer, New York, 2nd\n\nedition, 2011. ISBN 978-1-4614-1098-0.\n\n11\n\n\f", "award": [], "sourceid": 2098, "authors": [{"given_name": "Roshan", "family_name": "Shariff", "institution": "University of Alberta"}, {"given_name": "Or", "family_name": "Sheffet", "institution": "University of Alberta"}]}