{"title": "Private Learning Implies Online Learning: An Efficient Reduction", "book": "Advances in Neural Information Processing Systems", "page_first": 8702, "page_last": 8712, "abstract": "We study the relationship between the notions of differentially private learning and online learning. Several recent works have shown that differentially private learning implies online learning, but an open problem of Neel, Roth, and Wu \\cite{NeelAaronRoth2018} asks whether this implication is {\\it efficient}. \nSpecifically, does an efficient differentially private learner imply an efficient online learner? \n\nIn this paper we resolve this open question in the context of pure differential privacy.\nWe derive an efficient black-box reduction from differentially private learning to online learning from expert advice.", "full_text": "Private Learning Implies Online Learning:\n\nAn Ef\ufb01cient Reduction\n\nAlon Gonen\n\nElad Hazan\n\nUniversity of California San Diego\n\nPrinceton University and Google AI Princeton\n\nalgonen@cs.ucsd.edu\n\nehazan@princeton.edu\n\nShay Moran\n\nGoogle AI Princeton\n\nshaymoran1@gmail.com\n\nAbstract\n\nWe study the relationship between the notions of differentially private learning\nand online learning in games. Several recent works have shown that differentially\nprivate learning implies online learning, but an open problem of Neel, Roth, and\nWu [27] asks whether this implication is ef\ufb01cient. Speci\ufb01cally, does an ef\ufb01cient\ndifferentially private learner imply an ef\ufb01cient online learner?\nIn this paper we resolve this open question in the context of pure differential pri-\nvacy. We derive an ef\ufb01cient black-box reduction from differentially private learn-\ning to online learning from expert advice.\n\n1\n\nIntroduction\n\nDifferential Private Learning and Online Learning are two well-studied areas in machine learning.\nWhile at a \ufb01rst glance these two subjects may seem disparate, recent works gathered a growing\namount of evidence which suggests otherwise. For example, Adaptive Data Analysis [15, 14, 24,\n19, 3] shares strong similarities with adversarial frameworks studied in online learning, and on the\nother hand exploits ideas and tools from differential privacy. A more formal relation between private\nand online learning is manifested by the following fact:\n\nEvery privately learnable class is online learnable.\n\nThis implication and variants of it were derived by several recent works [20, 9, 1] (see the related\nwork section for more details). One caveat of the latter results is that they are non-constructive:\nthey show that every privately learnable class has a \ufb01nite Littlestone dimension. Then, since the\nLittlestone dimension is known to capture online learnability [26, 5], it follows that privately learn-\nable classes are indeed online learnable. Consequently, the implied online learner is not necessarily\nef\ufb01cient, even if the assumed private learner is. Thus, the following question emerges:\n\nDoes ef\ufb01cient differentially private learning imply ef\ufb01cient online learning?\n\nThis question was explicitly raised by Neel, Roth and Wu [27]. In this work we resolve this question\naf\ufb01rmatively under the assumption that the given private learner satis\ufb01es Pure Differential Privacy\n(the case of Approximate Differential Privacy remains open: see Section 4 for a short discussion).\nWe give an ef\ufb01cient black-box reduction which transforms an ef\ufb01cient pure private learner to an\nef\ufb01cient online learner. Our reduction exploits a characterization of private learning due to [4],\ntogether with tools from online boosting [6], and a lemma which converts oblivious online learning\nto adaptive online learning. The latter lemma is novel and may be of independent interest.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f1.1 Main result\nTheorem 1. Let A be a differentially private learning algorithm for an hypothesis class H in the\nrealizable setting. Denote its sample complexity by m(\u00b7,\u00b7) and denote by m0 := m(1/4, 1/2). Then,\nAlgorithm 3 is an ef\ufb01cient online learner for H in the realizable setting which attains an expected\n\nregret of at most O((cid:112)ln(T )).\n\nThe (standard) notation used in the theorem statement is detailed in Section 2.\n\nAgnostic versus Realizable.\nIt is natural to ask whether Theorem 1 can be generalized to the ag-\nnostic setting, namely, whether Algorithm 3 can be extended to an (ef\ufb01cient) online learner which\nachieves a sublinear regret against arbitrary adversaries. It turns out, that the answer is no, at least\nif one is willing to assume certain customary complexity theoretical assumptions and consider a\nnon-uniform1 model of computation. Speci\ufb01cally, consider the class of all halfspaces over the do-\nmain {0, 1}n \u2286 Rn whose margin is at least poly(n). This class satis\ufb01es: (i) it is ef\ufb01ciently learnable\nby a pure differentially private algorithm [7, 18, 28]. (ii) Conditioned on certain average case hard-\nness assumptions, there is no ef\ufb01cient online learner2 for this class which achieves sublinear regret\nagainst arbitrary adversaries [11]. We note that this argument only invalidates the possibility of re-\nducing agnostic online learning to realizable private learning. The question of whether there exists\nan ef\ufb01cient reduction from agnostic online learning to agnostic private learning remains open.\n\nProof overview. Here is a short outline of the proof. A characterization of differentially private\nlearning due to [4] implies that if H is privately learnable in the pure setting, then the representation\ndimension of H is \ufb01nite. Roughly, this means that for any \ufb01xed distribution D over labeled examples,\nby repeatedly sampling the (random) outputs of the algorithm A on a \u201cdummy\" input sample, we\neventually get an hypothesis that performs well with respect to D. In more detail, if one samples\n(roughly) exp(1/\u03b1) random hypotheses, then with high probability one of them will have excess\npopulation loss \u2264 \u03b1 with respect to D. This suggests the following approach: sample exp(1/\u03b1)\nrandom hypotheses (\u03b1 will be speci\ufb01ed later) and treat them an a class of experts, denoted by H\u03b1;\n\nthen, use Multiplicative Weights to online learn H\u03b1 with regret (roughly)(cid:112)T log|H\u03b1| \u2248 (cid:112)T /\u03b1,\n\nand thus the total regret will be\n\n\u03b1 \u00b7 T +(cid:112)T /\u03b1,\n\nwhich is at most T 2/3 if we set \u03b1 = T \u22121/3.\nThere are two caveats with this approach: i) the number of experts in H\u03b1 is exp(T 1/3), which is\ntoo large for applying Multiplicative Weights ef\ufb01ciently. ii) A more subtle issue is that the above\nregret analysis only applies in the oblivious setting: an adaptive adversary may \u201clearn\u201d the random\nclass H\u03b1 from the responses of our online learner, and eventually produce a (non-typical) sequence\nof examples for which it is no longer the case that the best expert in H\u03b1 has loss \u2264 \u03b1. To handle\nthe \ufb01rst obstacle we only require a constant accuracy of \u03b1 = 1/4, which we later reduce using\nonline boosting from [6]. As for the second obstacle, to cope with adaptive adversaries we propose\na general reduction from the adaptive setting which might be of independent interest.\n\n1.2 Related work\n\nOnline and private learning Feldman and Xiao [20] exploited techniques from communication\ncomplexity to show that every pure differentially private (DP) learnable class has a \ufb01nite Littlestone\ndimension (and hence is online learnable). Their work actually proved that pure private learning\nis strictly more dif\ufb01cult than online learning. That is, there exist classes with a \ufb01nite Littlestone\ndimension which are not pure-DP learnable. More recently, Alon et al. [9, 1] extended the former\nresult to approximate differential privacy, showing that every approximate-DP learnable class has a\n\ufb01nite Littlestone dimension. It remains open whether the converse holds.\n\n1Complexity theory distinguishes between uniform and non-uniform models, such as Turing machines vs.\narithmetic circuits. In this paper we consider the uniform model. However, the lower bound we sketch applies\nto non-uniform computation.\n\n2The result in [11] is in fact stronger: it shows that there exists no ef\ufb01cient agnostic PAC learner for this\n\nclass (see Theorem 1.4 in it).\n\n2\n\n\fAnother line of work by [27, 8] exploit online learning techniques to derive results in differential\nprivacy related to sanitization and uniform convergence.\n\nAdaptive data analysis. A growing area which intersects both \ufb01elds of online learning and pri-\nvate learning is adaptive data analysis ([15], [14],[24] [19],[3]). This framework studies scenarios\nin which a data analyst wishes to test multiple hypotheses on a \ufb01nite sample in an adaptive manner.\nThe adaptive nature of this setting resembles scenarios that are traditionally studied in online learn-\ning, and the connection with differential privacy is manifested in the technical tools used to study\nadaptive data analysis, many of which were developed in differential privacy (e.g. composition the-\norems).\n\nOracle complexity of online learning. One feature of our algorithm is that it uses an oracle access\nto a private learner. Several works studied online learning in oracle model ([23, 25, 13]). This\nframework is natural in scenarios in which it is computationally hard to achieve sublinear regret in\nthe worst case, but the online learner has access to an of\ufb02ine optimization and/or learning oracle.\nOur results fall into the same paradigm, where the oracle is a differentially private learner.\n\n2 De\ufb01nitions and Preliminaries\n\n2.1 PAC learning\nLet X be an instance space, Y = {\u22121, 1} be a label set, and let D be an (unknown) distribution\nover X \u00d7 Y. An \u201cX \u2192 Y\u201d function is called a concept/hypothesis. The goal here is to design a\nlearning algorithm, which given a large enough input sample S = ((x1, y1)), . . . , (xm, ym)) drawn\ni.i.d. from D, outputs an hypothesis h : X \u2192 Y whose expected risk\n\nLD(h) := E(x,y)\u223cD[(cid:96)(h(x), y)]\n\n(cid:96)(a, b) = 1a(cid:54)=b\n\n(cid:80)m\n\nis small compared to the best hypothesis in a hypothesis class H, which is a \ufb01xed and known to the\nalgorithm.\nThe distribution D is said to be realizable with respect to H if there exists h(cid:63) \u2208 H such that\nLD(h) = 0. We also de\ufb01ne the empirical risk of an hypothesis h with respect to a sample S =\n((x1, y1), . . . , (xm, ym)) as LS(h) = 1\nm\nDe\ufb01nition 2. (PAC learning) An hypothesis class H is PAC learnable with sample complexity\nm(\u03b1, \u03b2) if there exists an algorithm A such that for any distribution D over X , an accuracy and\ncon\ufb01dence parameters \u03b1, \u03b2 \u2208 (0, 1), if A is given an input sample S = ((x1, ym), . . . , (xm, ym)) \u223c\nDm such that m \u2265 m(\u03b1, \u03b2), then it outputs an hypothesis h : X \u2192 Y satisfying LD(h) \u2264 \u03b1\nwith probability at least 1 \u2212 \u03b2. The class H is ef\ufb01ciently PAC learnable if the runtime of A (and\nthus its sample complexity) are polynomial in 1/\u03b1 and 1/\u03b2. If the above holds only for realizable\ndistributions then we say that H is PAC learnable in the realizable setting.\n\ni=1 (cid:96)(h(xi), yi).\n\n2.2 Differentially private PAC learning\n\nIn some important learning tasks (e.g. medical analysis, social networks, \ufb01nancial records, etc.) the\ninput sample consists of sensitive data that should be kept private. Differential privacy ([12, 16]) is\na by-now standard formalism that captures such requirements.\nThe de\ufb01nition of differentially private algorithms is as follows. Two samples S(cid:48), S(cid:48)(cid:48) \u2208 (X \u00d7 Y)m\nare called neighbors if there exists at most one i \u2208 [m] such that the i\u2019th example in S(cid:48) differs from\nthe i\u2019th example in S(cid:48)(cid:48).\nDe\ufb01nition 3. (Differentially private learning) A learning algorithm A is said to be \u0001-differentially\nprivate3 (DP) if for any two neighboring samples and for any measurable subset F \u2208 YX ,\n\nP r[A(S) \u2208 F] \u2264 exp(\u0001)P r[A(S(cid:48)) \u2208 F] and\nP r[A(S(cid:48)) \u2208 F] \u2264 exp(\u0001)P r[A(S) \u2208 F]\n\n3The algorithm is said to be (\u0001, \u03b4)-approximate differentially private if the above inequality holds up to an\n\nadditive factor \u03b4. In this work we focus on the so-called pure case where \u03b4 = 0.\n\n3\n\n\fGroup privacy is a simple extension of the above de\ufb01nition [17]: Two samples S, S(cid:48) are q-neighbors\nif they differ in at most q of their pairs.\nLemma 4. Let A be a DP learner. Then for any q \u2208 N and any two q-neighboring samples S, S(cid:48)\nand any subset F \u2208 YX \u2229 range(A), P r[A(S) \u2208 F] \u2264 exp(\u0001q)P r[A(S(cid:48)) \u2208 F]\nCombining the requirements of PAC and DP learnability yields the de\ufb01nition of private PAC (PPAC)\nlearner.\nDe\ufb01nition 5. (PPAC Learning) A concept class H is differentially private PAC learnable with\nsample complexity m(\u03b1, \u03b2) if it is PAC learnable with sample complexity m(\u03b1, \u03b2) by an algorithm\nA which is an \u0001 = 0.1-differentially private.\n\nRemark. Setting \u0001 = 0.1 is without loss of generality; the reason is that there are ef\ufb01cient methods\nto boost the value of \u0001 to arbitrarily small constants, see [30] and references within.\n\n2.3 Online Learning\nThe online model can be seen as a repeated game between a learner A and an environment (a.k.a.\nadversary) E. Let T be a (known4) horizon parameter. On each round t \u2208 [T ] the adversary decides\non a pair (xt, yt) \u2208 X \u00d7 Y, and the learner decides on a prediction rule ht : X \u2192 {0, 1}. Then,\nthe learner suffers the loss |yt \u2212 \u02c6yt|, where \u02c6yt = h(xt). Both players may base their decisions\non the entire history and may use randomness. Unlike in the statistical setting, the adversary E\ncan generate the examples in an adaptive manner. In this work we focus on the realizable setting\nwhere it is assumed that the labels are realized by some target concept c \u2208 H, i.e., for all t \u2208 [T ],\nyt = c(xt).5 The measure of success is the expected number of mistakes done by the learner:\n\nE[MA] = E(cid:2) T(cid:88)\n\n(cid:96)(\u02c6yt, yt)(cid:3),\n\nwhere the expectation is taken over the randomness of the learner and the adversary. An algo-\nrithm A is a (strong) online learner if for any horizon parameter T and any realizable sequence\n((x1, y1), . . . , (xT , yT )), the expected number of mistakes made by A is sublinear in T .\n\nt=1\n\n2.3.1 Weak Online Learning\n\nWe describe an extension due to [6] of the boosting framework ([29]) (from the statistical setting) to\nthe online.\nDe\ufb01nition 6. (Weak online learning) An online learner A is called a weak online learner for a\nclass H with an edge parameter \u03b3 \u2208 (0, 1/2) and excess loss parameter T0 > 0 if for any horizon\nparameter T and every sequence ((x1, y1), . . . , (xT , yT )) realized by some target concept c \u2208 H,\nthe expected number of mistakes done by A satis\ufb01es\n\u2212 \u03b3\n\n(cid:18) 1\n\nE[MA] \u2264\n\n(cid:19)\n\nT + T0 .\n\n2\n\n2.3.2 Oblivious vs. Non-oblivious Adversaries\n\nThe general adversary considered in this paper is adaptive in the sense that it can choose the pair\n(xt, yt) based on the actual predictions \u02c6y1, . . . , \u02c6yt\u22121 made by the learner on rounds 1, . . . , t\u2212 1. An\nadversary is called oblivious if it chooses the entire sequence ((x1, y1), . . . , (xT , yT )) in advance.\nWe will \ufb01rst develop an online weak online learner for the oblivious setting and then extend it to the\nadaptive setting.\n\n2.3.3 Regret bounds using Multiplicative Weights\n\nAlthough we focus our attention on the realizable setting, our development also requires working\nin the so-called agnostic setting, where the sequence ((x1, y1), . . . , (xT , yT )) is not assumed to be\n\n4Standard doubling techniques allow the learner to cope with scenarios where T is not known.\n5However, the adversary does not need to decide on the identity of c in advance.\n\n4\n\n\frealized by some c \u2208 H. The standard measure of success in this setting is the expected regret\nde\ufb01ned as\n\nE[RegretT ] = E\n\n(cid:96)(\u02c6yt, yt) \u2212 inf\nh\u2208H\n\n(cid:96)(h(xt), yt).\n\nT(cid:88)\n\nt=1\n\nT(cid:88)\n\nt=1\n\nAccordingly, an online learner in this context needs to achieve a sublinear regret in terms of the\nhorizon parameter T .\nWhen the class H is \ufb01nite, there is a well-known algorithm named Multiplicative Weights (MW)\nwhich maintains a weight wt,j for each hypothesis (a.k.a. expert in the online model) hj according\nto\n\nw1,j = 1 , wt+1,j = wt,j exp(\u2212\u03b7(cid:96)(hj(xt), yt)))\n\nwhere \u03b7 > 0 is a step-size parameter. At each time t, MW predicts with \u02c6yt = hj(xt) with probability\nproportional to wt,j. We refer to [2] for an extensive survey on Multiplicative Weights and its many\napplications. The following theorem establishes an upper bound on the regret of MW.\nTheorem 7. (Regret of MW) If the class H is \ufb01nite then the expected regret of MW with step size\n\nparameter \u03b7 =(cid:112)log(|H|)/T is at most(cid:112)2T log |H|.\n\n3 The Reduction and its Analysis\n\nIn this section we formally present our ef\ufb01cient reduction from online learning to private PAC learn-\ning. Our reduction only requires a black-box oracle access to the the private learner. The reduction\ncan be roughly partitioned into 3 parts: (i) We \ufb01rst use this oracle to construct an ef\ufb01cient weak\nonline learner against oblivious adversaries. (ii) Then, we transform this learner so it also handles\nadaptive adversaries. This step is based on a general reduction which may be of independent interest.\n(iii) Finally, we boost the weak online learner to a strong one using online boosting.\n\n3.1 A Weak Online Learner in the Oblivious Setting\nLet Ap be a PPAC algorithm with sample complexity m(\u03b1, \u03b2) for H and denote by m0 :=\nm(1/4, 1/2) = \u0398(1). We only assume an oracle access to Ap, and in the \ufb01rst part we use it to\nconstruct a distribution over hypotheses/experts. Speci\ufb01cally, let S0 be a dummy sample consist-\ning of m occurrences of the pair (\u00afx, 0) where \u00afx is an arbitrary instance from X . Note that the\nhypothesis/expert Ap(S0) is random.6\nDe\ufb01nition 8. Let P0 be the distribution over hypotheses/experts induced by applying Ap on the\ninput sample S0.\nLemma 9. For any realizable distribution D over X \u00d7 Y, with probability at least 15/16 over the\ndraw of N = \u0398(exp(m0)) = \u0398(1) i.i.d. hypothesis h1, . . . , hN \u223c P0 , there exists i \u2208 [N ] such\nthat LD(hi) \u2264 1/4.\nProof. Let c \u2208 H be such that LD(c) = 0, and denote by\n\nH(D) = {h \u2208 range(A) : LD(h) \u2264 1/4} .\n\nBy assumption, if we feed the PPAC algorithm A with a sample S drawn according to Dm and\nlabeled by c, then with probability at least 1/4 over both the internal randomness of A and the draw\nof S, the output of A belongs to H(D). It follows that there exists at least one sample, which we\ndenote by \u00afS, such that with probability at least 1/2 over the randomness of A, the output h = A( \u00afS)\nbelongs to H(D). Since A is differentially private and ( \u00afS, S0) are m-neighbors, we obtain that\n\nP r[A(S0) \u2208 H(D)] \u2265 1\n2\n\nexp(\u22120.1m0) .\n\nConsequently if we draw N = \u0398(exp(m0)) hypotheses hj \u223c P0 then with probability at least\n15/16, at least one of the hj\u2019s belongs to H(D). This completes the proof.\n\nArmed with this lemma, we proceed by applying the Multiplicative Weights method to the random\nclass H produced by the PPAC learner Ap. The algorithm is detailed as Algorithm 1. The next\nlemma establishes its weak learnability in the oblivious setting.\n\n6The de\ufb01nition of differential privacy implies that every private algorithm is randomized (ignoring triviali-\n\nties).\n\n5\n\n\fAlgorithm 1 Weak online learner for oblivious adversaries\n\nOracle access: Let P0 denote the distribution from De\ufb01nition 8, and let m0 = m(1/4, 1/2),\nwhere m(\u03b1, \u03b2) is the sample complexity of the private learner Ap.\nSet: N = \u0398(exp(m0)), \u03b7 =\nfor j = 1 to N do\n\nT .\n\n(cid:113) log N\n\n\u2200j \u2208 [N ]\n\n(cid:46) Initializing MW w.r.t. h1, . . . , hN\n\nhj \u223c P0\nw1,j = 1\n\nend for\nfor t = 1 to T do\n\nPredict \u02c6yt = hj(xt) with probability wt,j/(cid:80)N\n\nReceive an instance xt\n\nReceive the true label yt\nwt+1,j = wt,j exp(\u2212\u03b7|yt \u2212 hj(xt)|)\n\nk=1 wt,k\n\nend for\n\nmade by Algorithm 1 is at most O(cid:0)\u221a\n\nT m0 + T\n4\n\n(cid:1). In particular, the algorithm is a weak online\n\nLemma 10. For any oblivious adversary and horizon parameter T , the expected number of mistakes\n\nlearner with an edge parameter 1/8 and excess loss T0 = O(1).\n\nProof. Since the adversary is oblivious, it chooses the (realizable) sequence (x1, y1) . . . , (xT , yT )\nin advance. In particular, these choices do not depend on the hypotheses h1, . . . , hN drawn from\nP0. De\ufb01ne a distribution D over X \u00d7 Y by\n\nD[{(x, y)}] =\n\n|{t \u2208 [T ] : (xt, yt) = (x, y)}|\n\n.\n\nT\n\nBy the previous lemma we have that with probability at least 15/16, there exists j \u2208 [N ] such that\n\nT(cid:88)\n\nt=1\n\n1\nT\n\n(cid:96)(hj(xt), yt) = LD(hi) \u2264 1/4.\n\n2(cid:112)T log N +\n\n2(cid:112)T log N +\n\nT\n4\n\n+\n\nT\n16\n\nT\n4\n\n+ T /16.\n\n\u2264(cid:16) 1\n\n2\n\n(cid:17)\n\n\u2212 1\n8\n\nT + T0.\n\nUsing the standard regret bound of Multiplicative Weights (Lemma 7), we obtain that the expected\nnumber of mistakes done by our algorithm is at most\n\n(The T /16 factor is because the success probability of Ap is 15/16, see Lemma 9). In particular, set\nT0 = C \u00b7 log N = O(m0) for a suf\ufb01ciently large constant C such that,\n\nThis concludes the proof.\n\n3.2 General reduction from adaptive to oblivious environments\n\nIn this part we describe a simple general-purpose extension from the oblivious setting to the adaptive\nsetting. Let Ao be an online learner for H that handles oblivious adversaries. We may assume\nthat Ao is random since otherwise any guarantee with respect to oblivious adversary holds also\nwith respect to adaptive adversary. Given an horizon parameter T , we initialize T instances of this\nalgorithm (each of with an independent random seed of its own). Finally, on round t we follow the\nprediction of the t-th instance, A(t)\no .\nLemma 11. Suppose that Ao is an online learner for a class H in the oblivious setting whose\nexpected regret is upper bounded by R(T ). Then, the expected regret of Algorithm 2 is also upper\nbounded by R(T ).\n\nProof. The proof relies on a lemma by [10] which provides a reduction from the adaptive to the\noblivious setting given a certain condition on the responses of the online learner. Since this lemma\n\n6\n\n\fAlgorithm 2 Reduction from Oblivious to Adaptive Setting\n\nOracle access: Online algorithm Ao for the oblivious setting.\nInitialize T independent instances of Ao, denoted A(1)\nfor t = 1 to T do\n\no , . . . , A(T )\n\no\n\n.\n\n:= prediction of A(j)\n\n\u02c6y(j)\nt\nPredict \u02c6yt = \u02c6y(t)\nt\n\no , j = 1, . . . , T .\n\nend for\n\nis somewhat technical, we defer the proof of the stated bound to the appendix (Section A), and prove\nhere a slightly weaker bound, which is off by a factor of log T . This weaker bound however follows\nfrom elementary arguments in a self contained manner.\nNote that the algorithms A(j)\u2019s for j = 1 . . . T are i.i.d. (i.e. have independent internal randomness).\nTherefore, the sequence of examples chosen by the adversary up to time t is independent of the\npredictions of A(j)\nin the\noblivious setting:\n\no whenever j \u2265 t, and thus we can use the assumed guarantee for A(j)\n\no\n\nwhere \u02c6(cid:96)(j)\n\ni = (cid:96)(yi, \u02c6y(j)\n\ni\n\ni=1\n). Similarly, it follows that\n\nE[\u02c6(cid:96)t] = E[\u02c6(cid:96)(t)\n\nt\n\n] = E[\u02c6(cid:96)(t+1)\n\nt\n\n] = . . . = E[\u02c6(cid:96)(T )\n\nt\n\n1\n\nT \u2212 t + 1\n\nTherefore,\n\n(cid:105)\n\n\u02c6(cid:96)t\n\nE(cid:104) T(cid:88)\n\nt=1\n\n(1)\n\n(2)\n\n\uf8f9\uf8fb .\n\n\u02c6(cid:96)(j)\nt\n\nT(cid:88)\n\nj=t\n\n(by Equation 2)\n\n(\u2200j \u2265 t) : E(cid:104) t(cid:88)\n\n\u02c6(cid:96)(j)\ni\n\n(cid:105) \u2264 R(T ),\n\uf8ee\uf8f0\n\n] = E\n\n1\n\nT \u2212 t + 1\n\nT(cid:88)\n\nj=t\n\n(cid:105)\n\n\u02c6(cid:96)(j)\nt\n\n(cid:105)\n(cid:105)\n\n\u02c6(cid:96)(j)\nt\n\nT \u2212 t + 1\n\n\u02c6(cid:96)(j)\nt\n\nT \u2212 j + 1\n\nt=1\n\nt=1\n\nj=1\n\n= E(cid:104) T(cid:88)\n= E(cid:104) T(cid:88)\nj(cid:88)\n\u2264 E(cid:104) T(cid:88)\nj(cid:88)\nE[(cid:80)j\nT(cid:88)\n\u2264 T(cid:88)\n\nj=1\n\nj=1\n\nt=1\n\n=\n\nR(T )\n\nT \u2212 j + 1\n\nj=1\n\n\u2264 R(T ) log T.\n\n\u02c6(cid:96)(j)\nt\nT \u2212 j + 1\n\nt=1\n\n]\n\n(by Equation 1)\n\n3.3 Applying Online Boosting\n\nIn this part we apply an online boosting algorithm due to [6] to improve the accuracy of our weak\nlearner. The algorithm is named Online Boosting-by-Majority (online BBM). We start by brie\ufb02y\ndescribing online BBM and stating an upper bound on its expected regret.\nThe Online BBM can be seen as an extension of Boosting-by-Majority algorithm due to [21]. Let\nWL be a weak learner with an edge parameter \u03b3 \u2208 (0, 1/2) and excessive loss T0. The online BBM\nalgorithm maintains N copies WL, denoted by WL(1), . . . , WL(N ). On each round t it uses a simple\n\n7\n\n\f(unweighted) majority vote over WL(1), . . . , WL(N ) to perform a prediction \u02c6yt. The pair (xt, yt) is\npassed to the weak learner WL(j) with probability that depends on the accuracy of the majority vote\nbased on the weak learners WL(1), . . . , WL(j\u22121) with respect to (xt, yt). Similarly to the well-known\nAdaBoost algorithm by [22], the worse is the accuracy of the previous weak learners, the larger is\nthe probability that (xt, yt) is passed to WL(j) (see Algorithm 1 in [6]).\nTheorem 12. ([6]) For any T and any N, the expected number of mistakes made by the Online\nBoosting-by-Majority Algorithm is bounded by7\n\n(cid:18)\n\nexp\n\n\u2212 1\n2\n\nN \u03b32\n\n(cid:19)\n\n1\n\u03b3\n\n)\n\nN (T0 +\n\n(cid:19)\n\nT + \u02dcO\n\n(cid:18)\u221a\nO(T \u0001 +(cid:112)ln(1/\u0001))\n\nIn particular, if \u03b3 and T0 are constants then for any \u0001 > 0, it suf\ufb01ces to pick N = \u0398(ln(1/\u0001)) weak\nlearners to obtain an upper bound of\n\n(3)\n\non the expected number of mistakes.\n\nWe have collected all the pieces of our algorithm.\n\nAlgorithm 3 Online Learning using a Private Oracle\n\nHorizon parameter: T\n\u0001 := 1/T\nWeak learner WL: Algorithm 2 applied to Algorithm 1\nApply online BBM using N = \u0398(ln(1/\u0001)) = \u0398(ln T ) instances of WL\n\nProof. (of Theorem 1) Combining Lemma 10 and Lemma 11, we obtain that WL is a weak online\nlearner with an edge parameter \u03b3 = 1/8 and constant excessive loss. Plugging \u0001 = 1/T in the\naccuracy parameter in Theorem 12 (Equation 3) yields the stated bound.\n\n4 Discussion\n\nWe have considered online learning in the presence of a private learning oracle, and gave an ef\ufb01cient\nreduction from online learning to private learning.\nWe conclude with two questions for future research.\n\n\u2022 Can our result can be extended to the approximate case? That is, does an ef\ufb01cient ap-\nproximately differentially private learner for a class H imply an ef\ufb01cient online algorithm\nwith sublinear regret? Can the online learner be derived using only an oracle access to the\nprivate learner?\n\u2022 Can our result be extended to the agnostic setting? That is, does an ef\ufb01cient agnostic private\nlearner for a class H implies an ef\ufb01cient agnostic online learner for it?\n\nAs for the \ufb01rst question, one difference with pure privacy is that Lemma 9 ceases to hold. Recall that\nthis lemma guarantees that applying a pure private learner on a dummy sample yields an output hy-\npothesis, which is correlated with any realizable target concept, with a non-negligible chance. This\nlemma manifests a certain obliviousness of pure private learners which is crucial in our transfor-\nmation from the statistical i.i.d setting to the adversarial setting. Approximately private learners do\nnot share similar obliviousness: in particular, to obtain an output hypothesis which is non-trivially\ncorrelated with the realizable target concept, one must apply it on an i.i.d sample consistent with the\nconcept (rather than an arbitrary dummy sample).\nAs for the second question, it is natural to try and implement the approach used in this paper, but\nthere are several major missing components. Most notably, we would need a boosting algorithm\nfor agnostic online learning, which is not known to exist. We consider this as a direction for future\nresearch, which is interesting in its own right.\n\n7The bound in [6] is an high probability bound. It is easy to translate it to a bound in expectation.\n\n8\n\n\fReferences\n[1] Noga Alon, Roi Livni, Maryanthe Malliaris, and Shay Moran. Private PAC learning implies\n\n\ufb01nite Littlestone dimension. arXiv preprint arXiv:1806.00949, 2018.\n\n[2] Sanjeev Arora, Elad Hazan, and Satyen Kale. The Multiplicative Weights Update Method: A\n\nMeta-Algorithm and Applications. Theory of Computing, 8(1):121\u2013164, 2012.\n\n[3] Raef Bassily, Kobbi Nissim, Adam Smith, Thomas Steinke, Uri Stemmer, and Jonathan Ull-\nman. Algorithmic Stability for Adaptive Data Analysis. In Proceedings of the forty-eighth\nannual ACM symposium on Theory of Computing, pages 1046\u20131059, 11 2015.\n\n[4] Amos Beimel, Kobbi Nissim, and Uri Stemmer. Characterizing the Sample Complexity of\nPrivate Learners. In roceedings of the 4th conference on Innovations in Theoretical Computer\nScience, pages 97\u2013110, 2013.\n\n[5] Shai Ben-David, D\u00e1vid P\u00e1l, and Shai Shalev-Shwartz. Agnostic online learning. In Conference\n\non Learning Theory, 2009.\n\n[6] Alina Beygelzimer, Satyen Kale, and Haipeng Luo. Optimal and Adaptive Algorithms for\nOnline Boosting. In International Conference on Machine Learning, pages 2323\u20132331, 2015.\n\n[7] Avrim Blum, Cynthia Dwork, Frank McSherry, and Kobbi Nissim. Practical privacy: the sulq\nframework. In Proceedings of the Twenty-fourth ACM SIGACT-SIGMOD-SIGART Symposium\non Principles of Database Systems, June 13-15, 2005, Baltimore, Maryland, USA, pages 128\u2013\n138, 2005.\n\n[8] Olivier Bousquet, Roi Livni, and Shay Moran. Passing tests without memorizing: Two models\n\nfor fooling discriminators. CoRR, abs/1902.03468, 2019.\n\n[9] Mark Bun, Kobbi Nissim, Uri Stemmer, and Salil Vadhan. Differentially private release and\nIn 2015 IEEE 56th Annual Symposium on Foundations of\n\nlearning of threshold functions.\nComputer Science, 2015.\n\n[10] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge uni-\n\nversity press, 2006.\n\n[11] Amit Daniely. Complexity theoretic limitations on learning halfspaces. In Proceedings of the\n48th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2016, Cambridge, MA,\nUSA, June 18-21, 2016, pages 105\u2013117, 2016.\n\n[12] Irit Dinur and Kobbi Nissim. Revealing Information while Preserving Privacy. In Proceedings\nof the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database\nsystems, pages 202\u2013210, 2003.\n\n[13] Miroslav Dudik, Nika Haghtalab, Haipeng Luo, Robert E. Schapire, Vasilis Syrgkanis, and\nJennifer Wortman Vaughan. Oracle-ef\ufb01cient online learning and auction design. In Annual\nSymposium on Foundations of Computer Science - Proceedings, 2017.\n\n[14] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron\nRoth. Preserving Statistical Validity in Adaptive Data Analysis. In Proceedings of the forty-\nseventh annual ACM symposium on Theory of computing, pages 117\u2013126, 2014.\n\n[15] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron\nRoth. Generalization in Adaptive Data Analysis and Holdout Reuse. In Advances in Neural\nInformation Processing Systems, pages 2350\u20132358, 2015.\n\n[16] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating Noise to Sen-\nIn Theory of Cryptography Conference, pages 265\u2013284.\n\nsitivity in Private Data Analysis.\nSpringer, 2006.\n\n[17] Cynthia Dwork, Aaron Roth, and others. The algorithmic foundations of differential privacy.\n\nFoundations and Trends R(cid:13)in Theoretical Computer Science, 9(3\u20134):211\u2013407, 2014.\n\n9\n\n\f[18] Vitaly Feldman, Cristobal Guzman, and Santosh Vempala. Statistical query algorithms for\nIn Proceedings of the Twenty-\nmean vector estimation and stochastic convex optimization.\nEighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2017, Barcelona, Spain,\nHotel Porta Fira, January 16-19, pages 1265\u20131277, 2017.\n\n[19] Vitaly Feldman and Thomas Steinke. Calibrating Noise to Variance in Adaptive Data Analysis.\n\nIn Conference On Learning Theory, pages 535\u2013544, 2018.\n\n[20] Vitaly Feldman and David Xiao. Sample Complexity Bounds on Differentially Private Learn-\ning via Communication Complexity. In Conference on Learning Theory, pages 1000\u20131019,\n2014.\n\n[21] Yoav Freund. Boosting a Weak Learning Algorithm by Majority.\n\nputaiton, 121:256\u201328, 1995.\n\nInformation and Com-\n\n[22] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning\n\nand an application to boosting. J. Comput. Syst. Sci., 55(1):119\u2013139, 1997.\n\n[23] Alon Gonen and Elad Hazan. Learning in non-convex games with an optimization oracle.\n\narXiv preprint arXiv:1810.07362, 2018.\n\n[24] Moritz Hardt and Jonathan Ullman. Preventing false discovery in interactive data analysis is\nhard. In Proceedings - Annual IEEE Symposium on Foundations of Computer Science, FOCS,\npages 454\u2013463, 2014.\n\n[25] Elad Hazan and Tomer Koren. A linear-time algorithm for trust region problems. Mathematical\n\nProgramming, 158(1-2):363\u2013381, 2016.\n\n[26] Nick Littlestone and Manfred K Warmuth. Relating Data Compression and Learnability. Tech-\n\nnical report, Technical report, University of California, Santa Cruz, 1986.\n\n[27] Seth Neel Aaron Roth and Zhiwei Steven Wu. How to Use Heuristics for Differential Privacy.\n\nTechnical report, Technical report, University of California, Santa Cruz, 2018.\n\n[28] Huy L. Nguyen, Jonathan Ullman, and Lydia Zakynthinou. Ef\ufb01cient private algorithms for\n\nlearning halfspaces. CoRR, abs/1902.09009, 2019.\n\n[29] Robert E. Schapire and Yoav Freund. Boosting: Foundations and Algorithms. Cambridge\n\nuniversity press, 2012.\n\n[30] Salil Vadhan. The Complexity of Differential Privacy.\n\nCryptography, pages 347\u2013450. Springer, 2017.\n\nIn Tutorials on the Foundations of\n\nA Proof of Lemma 11\nThe proof exploits Lemma 4.1 from [10] which we explain next. Let A be a (possibly randomized)\nonline learner, and let ut denote the response of A in time t \u2264 T . Then, since A may be random-\nized, ut is drawn from a random variable Ut that may depend on the entire history: namely, on both\nthe responses of A as well as of the adversary up to time t. So\n\nUt = Ut(u1 . . . ut\u22121, v1 . . . vt\u22121),\n\nwhere ui \u223c Ui denotes the response of A and vi \u223c Vi denotes the response of the (possibly\nrandomized) adversary on round i < t (in the classi\ufb01cations setting, vi is the labelled example\n(xi, yi), and ui is the prediction rule hi : X \u2192 {0, 1} used by A). Lemma 4.1 in [10] asserts that if\nUt is only a function of the vi\u2019s, namely\n\nUt = Ut(v1 . . . vt\u22121),\n\n(4)\n\nthen the expected regret of A in the adaptive setting is the same like in the oblivious setting.\nThe proof now follows by noticing that Algorithm 2 satis\ufb01es Equation (4). To see this, note that at\neach round t, Algorithm 2 uses the response of algorithm A(t)\no which only depends on the responses\n\n10\n\n\fof the adversary and A(t)\no up to time t. In particular, it does not additionally depend the responses of\nAlgorithm 2 at times up to t. Putting it differently, given the responses of the adversary z1 . . . zt\u22121,\none can produce the response of Algorithm 2 at time t by simulating A(t)\nThus, we may assume that the adversary is oblivious, and therefore that the sequence of exam-\nples (x1, y1) . . . (xt, yt) is \ufb01xed in advance and independent from the algorithms Aj\no\u2019s. Now,\nsince A(1)\nare i.i.d. (i.e. have independent internal randomness), the expected loss of\nAlgorithm 2 at time t satis\ufb01es\n\no on this sequence.\n\no , . . . , A(T )\n\no\n\nE[\u02c6(cid:96)t] = E[\u02c6(cid:96)(t)\n\nt\n\n] = E[\u02c6(cid:96)(1)\n\nt\n\n] = . . . = E[\u02c6(cid:96)(T )\n\nt\n\n] = E\n\nwhere \u02c6(cid:96)(j)\n\ni = (cid:96)(yi, \u02c6y(j)\n\ni\n\n). Thus, its expected number of mistakes is at most\n\n(cid:34) T(cid:88)\n\nE\n\n(cid:35)\n\n\u02c6(cid:96)t\n\n= E\n\n\uf8ee\uf8f0 T(cid:88)\n\n\uf8f9\uf8fb =\n\n1\nT\n\n\u02c6(cid:96)(j)\nt\n\nT(cid:88)\n\nE\n\nj=1\n\nt=1\n\nt=1\n\nt=1\n\nTherefore, the expected regret satis\ufb01es\n\nE[RegretT ] =\n\nE[Regret(j)\n\nT ] \u2264 R(T ) .\n\nT\n\n\uf8ee\uf8f0 1\nT(cid:88)\n(cid:34) T(cid:88)\n\nj=1\n\n\uf8f9\uf8fb ,\n\n\u02c6(cid:96)(j)\nt\n\n(cid:35)\n\n.\n\n\u02c6(cid:96)(j)\nt\n\nT(cid:88)\n\nj=1\n\nT(cid:88)\n\nj=1\n\n1\nT\n\n1\nT\n\n11\n\n\f", "award": [], "sourceid": 4697, "authors": [{"given_name": "Alon", "family_name": "Gonen", "institution": "UCSD"}, {"given_name": "Elad", "family_name": "Hazan", "institution": "Princeton University"}, {"given_name": "Shay", "family_name": "Moran", "institution": "Google AI Princeton"}]}