{"title": "Differential Privacy for Growing Databases", "book": "Advances in Neural Information Processing Systems", "page_first": 8864, "page_last": 8873, "abstract": "The large majority of differentially private algorithms focus on the static setting, where queries are made on an unchanging database. This is unsuitable for the myriad applications involving databases that grow over time. To address this gap in the literature, we consider the dynamic setting, in which new data arrive over time. Previous results in this setting have been limited to answering a single non-adaptive query repeatedly as the database grows. In contrast, we provide tools for richer and more adaptive analysis of growing databases. Our first contribution is a novel modification of the private multiplicative weights algorithm, which provides accurate analysis of exponentially many adaptive linear queries (an expressive query class including all counting queries) for a static database. Our modification maintains the accuracy guarantee of the static setting even as the database grows without bound. Our second contribution is a set of general results which show that many other private and accurate algorithms can be immediately extended to the dynamic setting by rerunning them at appropriate points of data growth with minimal loss of accuracy, even when data growth is unbounded.", "full_text": "Differential Privacy for Growing Databases\n\nRachel Cummings\u21e4\n\nGeorgia Institute of Technology\n\nrachelc@gatech.edu\n\nKevin A. Lai\u21e4\n\nGeorgia Institute of Technology\n\nkevinlai@gatech.edu\n\nSara Krehbiel\u21e4\n\nUniversity of Richmond\n\nkrehbiel@richmond.edu\n\nUthaipon Tantipongpipat\u21e4\n\nGeorgia Institute of Technology\n\ntao@gatech.edu\n\nAbstract\n\nThe large majority of differentially private algorithms focus on the static setting,\nwhere queries are made on an unchanging database. This is unsuitable for the\nmyriad applications involving databases that grow over time. To address this gap\nin the literature, we consider the dynamic setting, in which new data arrive over\ntime. Previous results in this setting have been limited to answering a single non-\nadaptive query repeatedly as the database grows [DNPR10, CSS11]. In contrast, we\nprovide tools for richer and more adaptive analysis of growing databases. Our \ufb01rst\ncontribution is a novel modi\ufb01cation of the private multiplicative weights algorithm\nof [HR10], which provides accurate analysis of exponentially many adaptive linear\nqueries (an expressive query class including all counting queries) for a static\ndatabase. Our modi\ufb01cation maintains the accuracy guarantee of the static setting\neven as the database grows without bound. Our second contribution is a set of\ngeneral results which show that many other private and accurate algorithms can\nbe immediately extended to the dynamic setting by rerunning them at appropriate\npoints of data growth with minimal loss of accuracy, even when data growth is\nunbounded.\n\n1\n\nIntroduction\n\nDifferential privacy is a well-studied framework for data privacy. First de\ufb01ned by [DMNS06],\ndifferential privacy gives a mathematically rigorous worst-case bound on the maximum amount of\ninformation that can be learned about any one individual\u2019s data from the output of an algorithm.\nThe theoretical computer science community has been proli\ufb01c in designing differentially private\nalgorithms that provide accuracy guarantees for a wide variety of machine learning problems (see\n[JLE14] for a survey). Differentially private algorithms have also begun to be implemented in practice\nby major organizations such as Apple, Google, Uber, and the United Status Census Bureau.\nThe large majority of work in differential privacy focuses on the static setting, in which adaptive\nor non-adaptive queries are made on an unchanging database. However, this is unsuitable for the\nmyriad applications involving databases that grow over time. For example, a hospital may want to\npublish updated statistics on its growing database of patients, or a company may want to maintain an\nup-to-date classi\ufb01er for its expanding user base. To harness the value of growing databases and keep\nup with data analysis needs, guarantees of private machine learning algorithms and other statistical\ntools must apply not just to \ufb01xed databases but also to dynamic databases.\nTo address this gap in the literature, we consider the dynamic setting, in which new data arrive over\ntime. Previous results in this setting have been limited to answering a single non-adaptive query\n\n\u21e4Author order is alphabetical and all authors contributed equally.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\frepeatedly as the database grows [DNPR10, CSS11]. In contrast, we provide tools for richer and\nmore adaptive analysis of growing databases. Our \ufb01rst contribution is a novel modi\ufb01cation of the\nprivate multiplicative weights algorithm of [HR10], which provides accurate analysis of exponentially\nmany adaptive linear queries (an expressive query class including all counting queries) for a static\ndatabase. Our modi\ufb01cation maintains the accuracy guarantee of the static setting even in the presence\nof unbounded data growth. Our second contribution is a set of more general techniques to adapt any\nexisting algorithm providing privacy and accuracy in the static setting to the dynamic setting. Our\ntechniques schedule black box access to a static algorithm as data accumulate, allowing for up-to-date\nanalysis of growing data with only a small accuracy cost relative to the static setting. Our work gives\nthe \ufb01rst private algorithms for answering adaptive queries in the dynamic setting.\n\n1.1 Our results\nHere we outline our two sets of results for adaptive analysis of dynamically growing databases.\nThroughout the paper, we refer to the setting in which a database of n elements from a universe of\nsize N is \ufb01xed for the life of the analysis as the static setting, and we refer to the setting in which a\ndatabase is accumulating new data entries while the analysis is ongoing as the dynamic setting. We\nuse the standard de\ufb01nition of differential privacy, presented formally along with other notation in the\npreliminaries.\n\nAdaptive linear queries for growing databases. Our \ufb01rst result is a novel modi\ufb01cation of the\nprivate multiplicative weights (PMW) algorithm [HR10], a broadly useful algorithm for privately\nanswering an adaptive stream of linear queries. The static PMW algorithm works by maintaining\na public histogram that re\ufb02ects the current estimate of the database given all previously answered\nqueries. It categorizes incoming queries as either easy or hard, updating the histogram and suffering\nsigni\ufb01cant privacy loss only for the hard queries. The number of hard queries is bounded using a\npotential argument, where the potential is de\ufb01ned as the relative entropy between the true database\nand the public histogram. This quantity is initially bounded, decreases by a substantial amount after\nevery hard query, and never increases.\nThe main challenge in adapting PMW to the dynamic setting is that new data increase the number\nof opportunities for privacy loss, harming the privacy-accuracy tradeoff. If we run static PMW on\na growing database, the previous potential argument fails because the relative entropy between the\ndatabase and the public histogram can increase as new data arrive. In the worst case, PMW can learn\nthe true database with high accuracy (using many hard queries), and then adversarial data growth will\nchange the composition of the database dramatically, essentially requiring the maximum possible\nnumber of additional hard queries to retain the same accuracy.\nWe modify PMW so that when new data arrive, the algorithm adds a uniform distribution to the public\nhistogram and re-normalizes. This leads to no additional privacy loss and requires no assumptions on\nthe actual distribution of the new data. This technique defends against adversarial data growth that\ncould dramatically increase the relative entropy between the public histogram and the true database\nincorporating the new data, allowing us to maintain the accuracy guarantee of the static setting\nthrough unbounded data growth. Speci\ufb01cally, static PMW works on a \ufb01xed database of size n and\nanswers k linear queries. In comparison, our modi\ufb01cation for growing databases (PMWG) works\non a database of starting size n and at each time step when the database is size t n answers up to\n\uf8ff \u00b7 exp(pt/n) queries.\nTheorem 1 (Informal version of Theorem 5). PMWG is \u270f-differentially private and for any\nstream with up to \uf8ff \u00b7 exp(pt/n) queries at each time t n incurs additive error at most\n\n)1/3) for all queries with high probability.\n\n\u21b5 = O(( log N log \uf8ff\n\n\u270fn\n\nThis error bound is tight with respect to static PMW, which incurs additive error O(( log N log k\n)1/3)\nfor only k total queries. This is somewhat surprising, given that the dynamic setting is strictly harder\nthan the static setting. Even on just the \ufb01rst time step when t = n, PMWG must answer \uf8ff queries on\na database of size n, and it achieves the same error guarantee on those queries as static PMW. Static\n\nPMW terminates at this point, while PMWG will answer another \uf8ff \u00b7 exp(p(n + 1)/n) queries at\n\nthe next time step and will continue answering queries as the database grows.\nIn the process of proving Theorem 1, we develop extensions of several static differentially private\nalgorithms to the dynamic setting, which may be of independent interest for future work on the\n\n\u270fn\n\n2\n\n\fdesign of differentially private algorithms for growing databases. These algorithms are presented in\nAppendix C.\n\nGeneral transformations of static algorithms into algorithms for growing databases. Our sec-\nond set of results consists of two methods, BBSCHEDULER and BBIMPROVER, for generically\ntransforming a black box algorithm that is private and accurate in the static setting into an algorithm\nthat is private and accurate in the dynamic setting. BBSCHEDULER reruns the black box algorithm\nevery time the database increases in size (starting from n) by a small multiplicative factor, and it\nprovides privacy and accuracy guarantees that are independent of the total number of queries and the\ncurrent database size (Theorem 27). BBSCHEDULER instantiates each successive run of the black\nbox algorithm with an exponentially shrinking privacy parameter to achieve any desired total privacy\nloss. The privacy parameter\u2019s decay is tied to database growth so that the two scale together, yielding\na time-independent accuracy guarantee. We instantiate this scheduler using the SMALLDB algorithm\nfor answering linear queries as a black box (Corollary 10).\nOur second transformation, BBIMPROVER, runs the black box every time a new entry is added to\nthe database. As with BBSCHEDULER, the privacy parameter decreases for successive calls to the\nblack box, but in this case this shrinking eventually dominates the database growth to yield accuracy\nguarantees that improve as more data accumulate. This algorithm is well-suited for problems where\ndata points are sampled from a distribution, where one would expect the accuracy guarantees of static\nanalysis to improve with the size of the sample. We apply this scheduler to private empirical risk\nminimization (ERM) algorithms to output classi\ufb01ers with generalization error that improves as the\ntraining database grows (Table 3).\nThe following informal theorem statement summarizes our results for BBSCHEDULER (Theorem 27)\nand BBIMPROVER (Theorem 29). Taken together, these results show that almost any private and\naccurate algorithm can be rerun at appropriate points of data growth with minimal loss of accuracy,\neven when data growth is unbounded.\nTheorem 2 (Informal). Let M be an \u270f-differentially private algorithm that for some constant p incurs\nadditive error \u21b5 = \u02dcO 1\n\n\u270fnp for all queries with high probability. Then,\n\n\u270fn )p/(2p+1) for all queries with high probability.\n\n1. BBSCHEDULER running M is \u270f-differentially private and incurs additive error \u21b5 =\n\u02dcO( 1\n2. BBIMPROVER running M is (\u270f, )-differentially private and incurs additive error \u21b5t =\n\u02dcO\u2713(\n\n)p\u25c6 for all queries at time t for all t n with high probability.\n\nplog(1/)\n\n\u270fpt\n\n1.2 Related Work\n\nDifferential privacy for growing databases has been studied for a limited class of problems. We\nsummarize the relationship between our work and the most relevant previous work in Table 1. Both\n[DNPR10] and [CSS11] adapted the notion of differential privacy to streaming environments in a\nsetting where each entry in the database is a single bit, and bits arrive one per unit time. [DNPR10]\nand [CSS11] design differentially private algorithms for an analyst to maintain an approximately\naccurate count of the number 1-bits seen thus far in the stream. This technique was later extended\nby [ST13] to maintain private sums of real vectors arriving online in a stream. We note that both of\nthese settings correspond to only a single query repeatedly asked on a dynamic database, precluding\nmeaningful adaptive analysis. In contrast, we consider the much richer class of linear queries,\nincluding 2|X| counting queries, allowing for adaptive analysis of a dynamically growing database.\nOur setting also resembles the online learning setting, but differs in that we are interested in per-round\naccuracy bounds, rather than regret bounds. We discuss this connection in more detail in Appendix A,\nalong with background on private adaptive analysis of a static databases.\n\n2 Preliminaries\n\nAll algorithms in this paper take as inputs a database over some \ufb01xed data universe X of \ufb01nite size\nN. Our algorithms and analyses represent a \ufb01nite database D 2X n equivalently as a fractional\n\n3\n\n\fTable 1: Asymptotic accuracy guarantees for answering adaptive linear queries\n\nPrevious\nwork\n\nOur work\n\nWork\n\nDatabase Queries\n\nSmallDB [BLR08]\n\nPMW [HR10]\nCounting bits\n[DNPR10, CSS11]\n\nPMWG\nBBSCHEDULER\nBBIMPROVER\n\nstatic\n\nstatic\n\ndynamic\n\ndynamic\ndynamic\ndynamic\n\nlinear queries,\nnon-adaptive\nlinear queries,\nadaptive\none \ufb01xed query,\nnon-adaptive\nlinear queries,\nadaptive\nany queries, adaptive\nany queries, adaptive\n\nAccuracy\n\n\ufb01xed\n\n\ufb01xed\nimproving as\ndatabase grows\n\n\ufb01xed\n\ufb01xed\nimproving as\ndatabase grows\n\nhistogram x 2 (X ) \u2713 RN, where xi is the fraction of the database of type i 2 [N ]. When we say a\ndatabase x 2 (X ) has size n, this means that for each i 2 [N ] there exists some ni 2 N such that\nxi = ni/n.\nIf an algorithm operates over a single \ufb01xed database, we refer to this as the static setting. In the\ndynamic setting, algorithms operate over a stream of databases, de\ufb01ned as a sequence of databases\nX = {xt}tn starting with a database xn of size n at time t = n and increasing by one data entry per\ntime step so that t always denotes both a time and the size of the database at that time. Our dynamic\nalgorithms also take a parameter n, which denotes the starting size of the database.\nWe consider algorithms that answer real-valued queries f : RN ! R with particular focus on linear\nqueries. A linear query assigns a weight to each entry depending on its type and averages these\nweights over the database. We can interpret a linear query as a vector f 2 [0, 1]N and write the\nanswer to the query on database x 2 (X ) as hf, xi, f (x), or x(f ), depending on context. For f\nviewed as a vector, f i denotes the ith entry. We note that an important special case of linear queries\nare counting queries, which calculate the proportion of entries in a database satisfying some boolean\npredicate over X .\nMany of the algorithms we study allow queries to be chosen adaptively, i.e., the algorithm accepts a\nstream of queries F = {fj}k\nj=1 where the choice of fj+1 can depend on the previous j 1 queries\nand answers. For the dynamic setting, we doubly index a stream of queries as F = {ft,:}tn =\n{{ft,j}`t\nj=1}tn so that t denotes the size of the database at the time ft,j is received and j = 1, . . . ,` t\nindexes the queries received when the database is size t.\nThe algorithms studied produce outputs of various forms. To evaluate accuracy, we assume that an\noutput y of an algorithm for query class F (possibly speci\ufb01ed by an adaptively chosen query stream)\ncan be interpreted as a function over F, i.e., we write y(f ) to denote the answer to f 2F based on\nthe mechanism\u2019s output. We seek to develop mechanisms that are accurate in the following sense.\nDe\ufb01nition 1 (Accuracy in the static setting). For \u21b5, > 0, an algorithm M is (\u21b5, )-accurate\nfor real query class F if for any input database x 2 (X ), the algorithm outputs y such that\n|f (x) y(f )|\uf8ff \u21b5 for all f 2F with probability at least 1 .\nIn the dynamic setting, accuracy must be with respect to the current database, and the bounds may be\nparametrized by time.\nDe\ufb01nition 2 (Accuracy in the dynamic setting). For \u21b5n,\u21b5 n+1, ... > 0 and > 0, an algorithm M\nis ({\u21b5t}tn, )-accurate for query stream F = {ft,:}tn if for any input data stream X = {xt}tn,\nthe algorithm outputs y such that |ft,j(xt) y(ft,j)|\uf8ff \u21b5t for all ft,j 2 F with probability at least\n1 .\n2.1 Differential privacy and composition lemmas\nDifferential privacy in the static setting requires that an algorithm produce similar outputs on neigh-\nboring databases x \u21e0 x0, which differ by a single entry. In the dynamic setting, differential privacy\nrequires similar outputs on neighboring database streams X, X0 that satisfy that for some t n,\n\n4\n\n\fx\u2327 = x0\u2327 for \u2327 = n, . . . , t 1 and x\u2327 \u21e0 x0\u2327 for \u2327 = t, t + 1, . . . . In the de\ufb01nition below, a pair\nof neighboring inputs refers to a pair of neighboring databases in the static setting or a pair of\nneighboring database streams in the dynamic setting. We note that in the dynamic setting, an element\nin Range(M) is an entire (potentially in\ufb01nite) transcript of outputs that may be produced by M.\nDe\ufb01nition 3 (Differential privacy [DMNS06]). For \u270f, > 0, an algorithm M is (\u270f, )-differentially\nprivate if for any pair of neighboring inputs x, x0 and any subset S \u2713 Range(M),\n\nPr[M(x) 2 S] \uf8ff e\u270f \u00b7 Pr[M(x0) 2 S] + .\n\nWhen = 0, we will say that M is \u270f-differentially private.\nDifferential privacy is typically achieved by adding random noise that scales with the sensitivity of\nthe computation being performed. The sensitivity of any real-valued query f :( X ) ! R is the\nmaximum change in the query\u2019s answer due to the change of a single entry in the database, denoted\nf = maxx\u21e0x0|f (x) f (x0)|. Note that a linear query on a database of size n has sensitivity 1/n.\nThe following composition theorems quantify how the privacy guarantee degrades as additional\ncomputations are performed on a database.\nTheorem 3 (Basic composition, [DMNS06]). Let Mi be an \u270fi-differentially private algorithm for\nall i 2 [k]. Then the composition M de\ufb01ned as M(x) = (Mi(x))k\ni=1 is \u270f-differentially private for\n\u270f =Pk\nTheorem 4 (CDP composition, Corollary of [BS16]). Let Mi be a \u270fi-differentially private algorithm\nfor all i 2 [k]. Then the composition of M de\ufb01ned as M(x) = (Mi(x))k\ni=1 is (\u270f, )-differentially\n2 (Pk\ni \uf8ff 1,\nprivate for \u270f = 1\nwe have \u270f \uf8ff 2q(Pk\n\ni ) log(1/). In particular, for \uf8ff e1 andPT\n\ni=1 \u270f2\ni=1 \u270f2\n\ni=1 \u270fi.\n\ni ) +q2(Pk\n\ni ) log(1/).\n\ni=1 \u270f2\n\ni=1 \u270f2\n\n3 Adaptive linear queries for growing databases\n\nIn this section we show how to modify the static private multiplicative weights (PMW) algo-\nrithm [HR10] for the dynamic setting to allow for private and accurate adaptive analysis of a\ngrowing database. Static PMW answers an adaptive stream of linear queries while maintaining\na public histogram y re\ufb02ecting the current estimate of the static database x given all previously\nanswered queries. Critical to the performance of the algorithm is that it uses the public histogram to\ncategorize incoming queries as either easy or hard, and it updates the histogram after hard queries in\na way that moves it closer to a correct answer on that query. The number of hard queries is bounded\nusing a potential argument, where potential is de\ufb01ned as the relative entropy between the database\n\nand the public histogram, i.e., RE (x||y) =Pi2[N ] xi log(xi/yi). This quantity is initially bounded,\n\ndecreases by a substantial amount after every hard query, and never increases. However, this argument\ndoes not extend to the dynamic setting because the potential can increase with the arrival of new data.\nWe instead modify the algorithm so the public histogram updates in response to new data arrivals as\nwell as hard queries. This modi\ufb01cation allows us to suffer only constant loss in accuracy per query\nrelative to the static setting, while maintaining this accuracy through unbounded data growth and a\ngrowing query budget at each stage of growth. Table 2 compares our results to the static setting.\nWe remark that PMW runs in time linear in the data universe size N. If the incoming data entries are\ndrawn from a distribution that satis\ufb01es a mild smoothness condition, a compact representation of the\ndata universe can signi\ufb01cantly reduce the runtime [HR10]. The same idea applies to our modi\ufb01cation\nof PMW for the dynamic setting without requiring new technical tools.\n\n3.1 Private multiplicative weights for growing databases (PMWG)\nOur algorithm for PMW for growing databases (PMWG) is given as Algorithm 1 in Appendix B.\nWe give an overview here to motivate our main results. The algorithm takes as inputs a data stream\nX = {xt}tn and an adaptively chosen query stream F = {{ft,j}`t\nj=1}tn. It also accepts privacy\nand accuracy parameters \u270f, , \u21b5 > 0, although in this section we consider the case that = 0.\nThe algorithm maintains a fractional histogram y over X , where yt,j denotes the histogram after the\njth query at time t has been processed. This histogram is initialized to uniform, i.e., yi\nn,0 = 1/N\n\n5\n\n\ft\n\n1\n\nt yi\n\nt,0 = t1\n\nt1,`t1 + 1\n\nfor all i 2 [N ]. As with static PMW, when a query is deemed hard, our algorithm performs a\nmultiplicative weights update of y with learning rate \u21b5/6. As an extension of the static case, we also\nupdate the weights of y when a new data entry arrives to re\ufb02ect a data-independent prior belief that\nN . It is\ndata arrive from a uniform distribution, i.e., for all t > n, i 2 [N ], yi\nimportant to note that a multiplicative weights update depends only on the noisy answer to a hard\nquery as in the static case, and the uniform update only depends on the knowledge that a new entry\narrived, so this histogram can be thought of as public.\nAs in static PMW, we determine hardness using a numeric sparse subroutine. As part of our proof,\nwe adapt the Numeric Sparse and the underlying Above Threshold algorithms of [DNR+09] to the\ndynamic setting. The proofs for our dynamic versions of these algorithms are in Appendix C and may\nbe of independent interest for future work in the design of private algorithms for growing databases.\nWe now present our main result for PMWG, Theorem 5. We sketch its proof here and give the full\nproof in Appendix B.1. Whereas the accuracy results for static PMW are parametrized by the total\nallowed queries k, our noise scaling means our algorithm can accommodate more and more queries\nas new data continue to arrive. Our accuracy result is with respect to a query stream respecting a\nquery budget. This budget increases at each time t by a quantity increasing exponentially with pt,\nand it is parametrized by some time-independent \uf8ff 1, which is somewhat analogous to the total\nquery budget k in static PMW. This theorem tells us that PMWG can accommodate poly(\uf8ff) queries\non the original database. Since \uf8ff degrades accuracy logarithmically, this means we can accurately\nanswer exponentially many queries before any new data arrive. In particular, our accuracy bounds\nare tight with respect to the static setting2, and we maintain this accuracy through unbounded data\n\ngrowth, subject to a generous query budget speci\ufb01ed by the theorem\u2019s bound onPt\nindependent \uf8ff 1 and > 0 it is (\u21b5, )-accurate for any query stream F such thatPt\n\uf8ffPt\n\u2327 =n exp( \u21b53\u270fpn\u2327\n\nTheorem 5. The algorithm PMWG(X, F, \u270f, 0,\u21b5, n ) is (\u270f, 0)-differentially private, and for any time-\n\u2327 =n `\u2327 \uf8ff\nC log(N n) ) for all t n and suf\ufb01ciently large constant C as long as N 3, n 21\n\n\u2327 =n `\u2327 .\n\nand\n\n\u21b5 C\u21e3 log(N n) log(\uf8ffn/)\n\nn\u270f\n\n\u23181/3\n\n.\n\nProof sketch. The proof hinges on showing that we do not have to answer too many hard queries,\neven as the composition of the database changes with new data, which can increase the relative\nentropy between the database and the public histogram. We \ufb01rst show that our new public histogram\nupdate rule bounds this relative entropy increase (Lemma 6), and then our bound on the number of\nhard queries suffers accordingly relative to static PMW (Corollary 7).\nLemma 6. Let x, y, \u00afx, \u00afy 2 (X ) be databases of size t, t, t + 1, t + 1, respectively, where \u00afx is\nobtained by adding one entry to x and \u00afyi = t\n\nt+1 yi + 1\nRE (\u00afx||\u00afy) RE (x||y) \uf8ff log N\n\n(t+1)N for i 2 [N ]. Then,\nt+1 + log t\n\nt+1 + log( t+1\n\nt ).\n\nThe corollary below comes from a straightforward modi\ufb01cation of the proof on the bound on hard\nqueries in static PMW using the result above.\nCorollary 7. If the numeric sparse subroutine returns \u21b5/3-accurate answers for each query for a\nparticular run of PMWG, then the total number of hard queries answered by any time t n is\n\nPt\n\u2327 =n h\u2327 \uf8ff 36\n\n\u21b52 (log N +Pt\n\n\u2327 =n+1\n\nlog(N )\n\n\u2327 + log(\u23271)\n\n\u2327\n\n+ log( \u2327\n\n\u23271 )).\n\nWith this corollary, we separately prove privacy and accuracy (Theorems 11 and 12) in terms of the\nnoise function \u21e0, which yield our desired result when instantiated with the \u21e0 speci\ufb01ed by Algorithm 1.\nAs with static PMW, the only privacy is leaked by the numeric sparse subroutine. Privacy loss\ndepends in the usual ways on the noise parameter, query sensitivity, and number of hard queries,\nalthough in our setting both the noise parameter and query sensitivity change over time.\n\n2This tightness claim assumes n = O(poly(N )). We think of PMW as being useful in this setting when the\ndata universe is large relative to the size of the database, otherwise an analyst could learned the dataset more\naccurately with N \u2327 n counting queries using output perturbation.\n\n6\n\n\fTable 2: Asymptotic accuracy guarantees for answering adaptive linear queries\n\nWork\n\nSetting\n\nAccuracy for (\u270f, 0)-DP\n\nAccuracy for (\u270f, )-DP\n\n[HR10]\n\nStatic\n\nThis work Dynamic\n\n\u270fn\n\n\u21e3 log N log(k/)\n\u23181/3\n\u23181/3\n\u21e3 log(N n) log(\uf8ffn/)\n\n\u270fn\n\n\u21e3 log1/2 N log(k/) log(1/)\n\u23181/2\n\u23181/2\n\u21e3 log1/2(N n) log(\uf8ffn/) log1/2(1/)\n\n\u270fn\n\n\u270fn\n\nAfter the proof of the above theorem in Appendix B.1, Theorem 16 generalizes PMWG as speci\ufb01ed\nby Equation (B.5). This generalization leaves a free parameter in the noise function \u21e0 used by the\nsubroutine, allowing one to trade off between accuracy and a query budget that increases more with\ntime. See Observation 17.\nWe remark that we can tighten our accuracy bounds if we allow (\u270f, )-differential privacy and use\nCDP composition [BS16]. These results are proven in Appendix B.2 and included informally in\nTable 2.\nTheorem 8. The algorithm PMWG(X, F, \u270f, , \u21b5, n ) is (\u270f, )-differentially private for any \u270f 2\n(0, 1], 2 (0, e1), and for any time-independent \uf8ff 1 and 2 (0, 215/2) it is (\u21b5, )-accurate for\nany query stream F such thatPt\nC log1/2(N n) log1/2(1/) ) for all t n and\n\u2327 =n exp(\nsuf\ufb01ciently large constant C as long as N 3, n 17 and\n\n\u2327 =n `\u2327 \uf8ff \uf8ffPt\n\n\u21b52\u270fpn\u2327\n\n\u21b5 C\u21e3 log1/2(N n) log1/2(1/) log(\uf8ffn/)\n\nn\u270f\n\n\u23181/2\n\n.\n\n4 General transformations from static to dynamic settings\n\nIn this section, we give two schemes for answering a stream of queries on a growing database, given\nblack box access to a differentially private algorithm for the static setting.3 In Section 4.1, we describe\nan algorithm BBSCHEDULER for scheduling repeated runs of a static algorithm. BBSCHEDULER\nruns an underlying of\ufb02ine mechanism with exponentially decreasing frequency and offers the same\naccuracy guarantee at every point in data growth. We instantiate BBSCHEDULER with the SmallDB\nalgorithm as an illustrative example. In Section 4.2, we describe a second algorithm BBIMPROVER,\nwhich runs an underlying mechanism at every time step. Its results are initially inferior but improve\nover BBSCHEDULER with suf\ufb01cient data growth. This result is well-suited for problems where data\npoints are sampled from a distribution, where one would expect the accuracy guarantees of static\nanalysis to improve with the size of the sample. We showcase our result by applying it to solve private\nempirical risk minimization on a growing database. We formalize these algorithms and give privacy\nand accuracy guarantees in full generality in Appendix D.\n\n4.1 Fixed accuracy as data accumulate\nIn this section, we give results for using a private and accurate algorithm for the static setting as a\nblack box to solve the analogous problem in the dynamic setting. Our general purpose algorithm\nBBSCHEDULER treats a static algorithm as a black box endowed with privacy and accuracy guaran-\ntees, and it reruns the black box whenever the database grows by a small multiplicative factor. This\nschedule can be applied to any algorithm that satis\ufb01es \u270f-differential privacy and (\u21b5, )-accuracy for\n\u21b5 of a certain form as speci\ufb01ed in De\ufb01nition 4 below.\nDe\ufb01nition 4 ((p, g)-black box). An algorithm M(xn,\u270f,\u21b5,, n ) is a (p, g)-black box for a class of\nlinear queries F if it is (\u270f, 0)-differentially private and with probability 1 it outputs y : F! R\nsuch that |y(f ) xn(f )|\uf8ff \u21b5 for every f 2F when \u21b5 g \u00b7 ( log(1/)\n)p for some g that is\nindependent of \u270f, n, .\n\n\u270fn\n\nThe parameter g captures dependence on domain-speci\ufb01c parameters that affect accuracy of the\nblack box algorithm, such as the dependence on log N for SMALLDB. If these other parameters\n3For ease of presentation, we restrict our results to accuracy of real-valued queries, but the algorithms we\npropose could be applied to settings with more general notions of accuracy or to settings where the black box\nalgorithm itself can change across time steps, adding to the adaptivity of this scheme.\n\n7\n\n\f1\n\n\u270fn\n\n = g\n\n)p). As a concrete example, see Corollary 10 and surrounding\n\nare constant, then \u21b5 =\u21e5(( log(1/)\ndiscussion as an instantiation of BBSCHEDULER with the SMALLDB algorithm as a black box.\nOur generic algorithm BBSCHEDULER runs the black box M(xti,\u270f i,\u21b5 i, i, ti) at times {ti}1i=0 for\nti = (1 + )in with parameters as listed below and receives output yi. Upon receipt of query ft,j\nfor t 2 [ti, . . . , ti+1), we output yi(ft,j). We give the = 0 case below; the full algorithm including\nparameter settings for the > 0 case is presented in Appendix D.\n1+\u2318i+1\n\n(1+)i+2 \u270f, \u21b5i = g\u21e3 log(1/)\n\u270fi(1+)in\u2318p\n\n2p+1\u21e3 log(1/)\n\u270fn \u2318 p\n\nThere are two key technical properties that allow this result to hold. First, since the epochs are\nexponentially far apart, the total privacy loss from multiple calls to M is not too large. Second,\neach data point added to a database of size t can only change a linear query by roughly 1\nt , so since\na database grows by ti in epoch i, an answer to a query at the end of epoch i using yi incurs at\nmost extra additive error relative to a query issued at time ti. We now state our main result for\nBBSCHEDULER, including the result for > 0:\nTheorem 9. Let M be a (p, g)-black box for query class F. Then for any database stream X and\n) is (\u270f, )-differentially\nstream of linear queries F over F, BBSCHEDULER(X, F,M,\u270f,,, n, p, g\nprivate for \u270f< 1 and (\u21b5, )-accurate for suf\ufb01ciently large constant C and\n\ni =\u21e3 \n\ni = 2(i+1)\n\n2p+1 ,\u270f\n\n,\n\nCg\n\nCg\n\n1\n\n2p+1\n\n2p+1 \u00b7\u21e3 log(1/)\n\u270fn \u2318 p\n1.5p+1 \u00b7\u2713plog(1/) log(1/)\n\n\u270fn\n\n1\n\n\u21b5 8>><>>:\n\nif = 0\n\nif > 0\n\n.\n\n1.5p+1\n\n\u25c6 p\n\nFor concreteness, we instantiate this general result with SMALLDB [BLR08], a differentially private\nalgorithm for generating a synthetic database y that closely approximates a true database x on a every\nquery from some \ufb01xed set F of k linear queries. Speci\ufb01cally, SMALLDB outputs some y : F! R\nsuch that |y(f ) x(f )|\uf8ff \u21b5 for every f 2F when \u21b5 C\u21e3 log N log k+log(1/)\n. SMALLDB is\nthus a (1/3, C(log N log k)1/3)-black box for an arbitrary set of k linear queries over a data universe\nof size N, and so we have the following corollary of Theorem 27.\nCorollary 10. BBSCHEDULER instantiated with SMALLDB is \u270f-differentially private and can\nanswer all queries in F with (\u21b5, )-accuracy for suf\ufb01ciently large constant C and\n\n\u23181/3\n\n\u270fn\n\n\u21b5 C\u21e3 log N log |F| log(1/)\n\n\u270fn\n\n\u23181/5\n\n.\n\nImproving accuracy as data accumulate\n\n4.2\nIn some applications it is more natural for accuracy bounds to improve as the database grows.\nFor instance, in empirical risk minimization (ERM), we expect to be able to \ufb01nd classi\ufb01ers with\ndiminishing empirical risk, which implies diminishing generalization error.\nWe can extend our black box scheduler framework to allow for accuracy guarantees that improve as\ndata accumulate. Like our \ufb01rst scheduler, our new algorithm BBIMPROVER takes in a private and\naccurate static black box M. Unlike the \ufb01rst scheduler, it reruns M on the current database at every\ntime step. The algorithm no longer incurs accuracy loss from ignoring new data points mid-epoch\nbecause it runs M at every time step. However, this also means that privacy loss will accumulate\nmuch faster because more computations are being composed. To combat this and achieve overall\nprivacy loss \u270f, each run of M will have increasingly strict (i.e., smaller) privacy parameter \u270ft. The\nadditional noise needed to preserve privacy will overpower the improvements in accuracy until the\ndatabase grows suf\ufb01ciently large (t n2), when the accuracy of BBIMPROVER will surpass the\ncomparable \ufb01xed accuracy guarantee of BBSCHEDULER. Our BBIMPROVER algorithm and general\nresults (Theorem 29) are presented in Appendix D. We also instantiate BBIMPROVER with various\nprivate ERM algorithms in Theorem 31 in Appendix E.\n\nAcknowledgements\nR.C. and S.K. supported in part by a Mozilla Research Grant. K.L. supported in part by NSF grant\nIIS-1453304. U.T. supported in part by NSF grants CCF-24067E5 and CCF-1740776, and by a\nGeorgia Institute of Technology ARC fellowship.\n\n8\n\n\fReferences\n\n[AS17] Naman Agarwal and Karan Singh. The price of differential privacy for online learning.\n\nIn International Conference on Machine Learning (ICML), 2017.\n\n[BLR08] Avrim Blum, Katrina Ligett, and Aaron Roth. A learning theory approach to non-\ninteractive database privacy. In Proceedings of the 40th annual ACM Symposium on\nTheory of Computing, STOC \u201908, pages 609\u2013618, 2008.\n\n[BNS+16] Raef Bassily, Kobbi Nissim, Adam D. Smith, Thomas Steinke, Uri Stemmer, and\nJonathan Ullman. Algorithmic stability for adaptive data analysis. In Proceedings of the\n48th Annual ACM on Symposium on Theory of Computing, STOC, 2016.\n\n[BS16] Mark Bun and Thomas Steinke. Concentrated differential privacy: Simpli\ufb01cations,\nextensions, and lower bounds. In Proceedings of the 13th Conference on Theory of\nCryptography, TCC \u201916, pages 635\u2013658, 2016.\n\n[BST14] Raef Bassily, Adam Smith, and Abhradeep Thakurta. Differentially private empirical\nrisk minimization: Ef\ufb01cient algorithms and tight error bounds. In Proceedings of the\n2014 IEEE 55th Annual Symposium on Foundations of Computer Science, FOCS \u201914,\npages 464\u2013473, 2014.\n\n[CLN+16] Rachel Cummings, Katrina Ligett, Kobbi Nissim, Aaron Roth, and Zhiwei Steven Wu.\nAdaptive learning with robust generalization guarantees. In 29th Annual Conference on\nLearning Theory, COLT \u201916, pages 772\u2013814, 2016.\n\n[CSS11] T.-H. Hubert Chan, Elaine Shi, and Dawn Song. Private and continual release of statistics.\n\nACM Transactions on Information and System Security, 14(3):26, 2011.\n\n[DFH+15] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and\nAaron Roth. The reusable holdout: Preserving validity in adaptive data analysis. Science,\n349(6248):636\u2013638, 2015.\n\n[DMNS06] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to\nsensitivity in private data analysis. In Proceedings of the 3rd Conference on Theory of\nCryptography, TCC \u201906, pages 265\u2013284, 2006.\n\n[DNPR10] Cynthia Dwork, Moni Naor, Toniann Pitassi, and Guy N. Rothblum. Differential privacy\nunder continual observation. In Proceedings of the 42nd ACM Symposium on Theory of\nComputing, STOC \u201910, 2010.\n\n[DNR+09] Cynthia Dwork, Moni Naor, Omer Reingold, Guy N. Rothblum, and Salil Vadhan. On\nthe complexity of differentially private data release. In Proceedings of the 41st annual\nACM Symposium on Theory of Computing, STOC \u201909, pages 381\u2013390, 2009.\n\n[DR14] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy.\n\nFoundations and Trends in Theoretical Computer Science, 9(34):211\u2013407, 2014.\n\n[HR10] Moritz Hardt and Guy N. Rothblum. A multiplicative weights mechanism for privacy-\npreserving data analysis. In Proceedings of the 51st annual IEEE Symposium on Founda-\ntions of Computer Science, FOCS \u201910, pages 61\u201370, 2010.\n\n[Jam16] G.J.O. Jameson. The incomplete gamma functions. The Mathematical Gazette,\n\n100(548):298\u2013306, 2016.\n\n[JKT12] Prateek Jain, Pravesh Kothari, and Abhradeep Thakurta. Differentially private online\nlearning. In Proceedings of the 25th Annual Conference on Learning Theory, COLT \u201912,\npages 1\u201334, 2012.\n\n[JLE14] Zhanglong Ji, Zachary C. Lipton, and Charles Elkan. Differential privacy and machine\n\nlearning: a survey and review. arXiv pre-print 1412.7584, 2014.\n\n9\n\n\f[KST+12] Daniel Kifer, Adam Smith, Abhradeep Thakurta, Shie Mannor, Nathan Srebro, and\nRobert C Williamson. Private convex empirical risk minimization and high-dimensional\nregression. In Proceedings of the 25th Annual Conference on Learning Theory, COLT\n\u201912, pages 1\u201340, 2012.\n\n[SSSSS09] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Stochastic\nconvex optimization. In Proceedings of the 22nd Annual Conference on Learning Theory,\nCOLT \u201909, 2009.\n\n[ST13] Adam Smith and Abhradeep Guha Thakurta. (Nearly) optimal algorithms for private\nonline learning in full information and bandit settings. In Advances in Neural Information\nProcessing Systems, NIPS \u201913, pages 2733\u20142741, 2013.\n\n10\n\n\f", "award": [], "sourceid": 5326, "authors": [{"given_name": "Rachel", "family_name": "Cummings", "institution": "Georgia Tech"}, {"given_name": "Sara", "family_name": "Krehbiel", "institution": "University of Richmond"}, {"given_name": "Kevin", "family_name": "Lai", "institution": "Georgia Tech"}, {"given_name": "Uthaipon", "family_name": "Tantipongpipat", "institution": "Georgia Tech"}]}