{"title": "The Everlasting Database: Statistical Validity at a Fair Price", "book": "Advances in Neural Information Processing Systems", "page_first": 6531, "page_last": 6540, "abstract": "The problem of handling adaptivity in data analysis, intentional or not, permeates\n a variety of fields, including test-set overfitting in ML challenges and the\n accumulation of invalid scientific discoveries.\n We propose a mechanism for answering an arbitrarily long sequence of\n potentially adaptive statistical queries, by charging a price for\n each query and using the proceeds to collect additional samples.\n Crucially, we guarantee statistical validity without any assumptions on\n how the queries are generated. We also ensure with high probability that\n the cost for $M$ non-adaptive queries is $O(\\log M)$,\n while the cost to a potentially adaptive user who makes $M$\n queries that do not depend on any others is $O(\\sqrt{M})$.", "full_text": "The Everlasting Database:\n\nStatistical Validity at a Fair Price\n\nBlake Woodworth\nToyota Technological\nInstitute at Chicago\n\nVitaly Feldman\n\nGoogle\n\nSaharon Rosset\nTel Aviv University\n\nNathan Srebro\n\nToyota Technological\nInstitute at Chicago\n\nAbstract\n\nThe problem of handling adaptivity in data analysis, intentional or not, permeates a\nvariety of \ufb01elds, including test-set over\ufb01tting in ML challenges and the accumula-\ntion of invalid scienti\ufb01c discoveries. We propose a mechanism for answering an\narbitrarily long sequence of potentially adaptive statistical queries, by charging a\nprice for each query and using the proceeds to collect additional samples. Crucially,\nwe guarantee statistical validity without any assumptions on how the queries are\ngenerated. We also ensure with high probability that the cost for M non-adaptive\nqueries is O(log M ), while the cost to a potentially adaptive user who makes M\nqueries that do not depend on any others is O(pM ).\n\n1\n\nIntroduction\n\nConsider the problem of running a server that provides the test loss of a model on held out data,\ne.g. for evaluation in a machine learning challenge. We would like to ensure that all test losses\nreturned by the server are accurate estimates of the true generalization error of the predictors.\nReturning the empirical error on held out test data would initially be a good estimate of the general-\nization error. However, an analyst can use the empirical errors to adjust their model and improve their\nperformance on the test data. In fact, with a number of queries only linear in the amount of test data,\none can easily create a predictor that completely over\ufb01ts, having empirical error on the test data that\nis arti\ufb01cially small [5, 12]. Even without such intentional over\ufb01tting, sequential querying can lead to\nunintentional adaptation since analysts are biased toward tweaks that lead to improved test errors.\nIf the queries were non-adaptive, i.e. the sequence of predictors is not in\ufb02uenced by previous test\nresults, then we could handle a much larger number of queries before over\ufb01tting\u2013a number exponential\nin the size of the dataset. Nevertheless, the test set will eventually be \u201cused up\u201d and estimates of the\ntest error (speci\ufb01cally those of the best performers) might be over-optimistic.\nA similar situation arises in other contexts such as validating potential scienti\ufb01c discoveries. One can\nevaluate potential discoveries using set aside validation data, but if analyses are re\ufb01ned adaptively\nbased on the results, one may again over\ufb01t the validation data and arrive at false discoveries [14, 17].\nOne way to ensure the validity of answers in the face of adaptive querying is to collect all queries\nbefore giving any answers, and answer them all at once, e.g. at the end of a competition. However,\nanalysts typically want more immediate feedback, both for ML challenges and in scienti\ufb01c research.\nAdditionally, if we want to answer more queries later, ensuring statistical validity would require\ncollecting a whole new dataset. This might be unnecessarily expensive if few or none of the queries\nare in fact adaptive. It also raises the question of who should bear the cost of collecting new data.\nAlternatively, we could try to limit the number or frequency of queries from each user, forbid adaptive\nquerying, or assume users work independently of each other, remaining oblivious to other users\u2019\nqueries and answers. However, it is nearly impossible to enforce such restrictions. Determined users\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fcan avoid querying restrictions by creating spurious user accounts and working in groups; there is no\nfeasible way to check if queries are chosen adaptively; and information can leak between analysts,\nintentionally or not, e.g. through explicit collaboration or published results.\nIn this paper, we address the fundamental challenge of providing statistically valid answers to an arbi-\ntrarily long sequence of potentially adaptive queries. We assume that it is possible to collect additional\nsamples from the same data distribution at a \ufb01xed cost per sample. To pay for new samples, users of\nthe database will be charged for their queries. We propose a mechanism, EVERLASTINGVALIDATION,\nthat guarantees \u201ceverlasting\u201d statistical validity and maintains the following properties:\nValidity Without any assumptions about the users, and even with arbitrary adaptivity, with high\n\nprobability, all answers ever returned by the database are accurate.\n\nSelf-Sustainability The database collects enough revenue to purchase as many new samples as\n\nnecessary in perpetuity, and can answer an unlimited number of queries.\n\nCost for Non-Adaptive Users With high probability, a user making M non-adaptive queries will\n\npay at most O(log M ), so the average cost per query decreases as \u02dcO(1/M ).\n\nCost for Autonomous Users With high probability, a user (or group of users) making M potentially\nadaptive queries that depend on each other arbitrarily, but not on any queries made by others,\nwill pay at most \u02dcO(pM ), so the average cost per query decreases as \u02dcO(1/pM ).\n\nWe emphasize that the database mechanism needs no notion of \u201cuser\u201d or \u201caccount\u201d when answering\nthe queries; it does not need to know which \u201cuser\u201d made which query; and most of all, it does not\nneed to know whether a query was made adaptively or not. Rather, the cost guarantees hold for any\ncollection of queries that are either non-adaptive or autonomous in the sense described above\u2013a \u201cuser\u201d\ncould thus refer to a single individual, or if an analyst uses answers from another person\u2019s queries, we\ncan consider them together as an \u201cautonomous user\u201d and get cost guarantees based on their combined\nnumber of queries. The database\u2019s cost guarantees are nearly optimal; the cost to non-adaptive users\nand the cost to autonomous users cannot be improved (beyond log-factors) while still maintaining\nvalidity and sustainability (Section 5).\nAs is indicated by the guarantees above, using the mechanism adaptively may be far more expensive\nthan using it non-adaptively. We view this as a positive feature. Although we cannot enforce non-\nadaptivity, and it is sometimes unreasonable to expect that analysts are entirely non-adaptive, we\nintend the mechanism to be used for validation. That is, analysts should do their discovery, training,\ntuning, development, and adaptive data analysis on unrestricted \u201ctraining\u201d or \u201cdiscovery\u201d datasets,\nand only use the protected database when they wish to receive a stamp of approval on their model,\npredictor, or discovery. Instead of trying to police or forbid adaptivity, we discourage it with pricing,\nbut in a way that is essentially guaranteed not to affect non-adaptive users. Further, users will need to\npay a high price only when their queries explicitly cause over\ufb01tting, so only adaptivity that is harmful\nto statistical validity will be penalized.\n\nRelationship to prior work Our work is inspired by a number of mechanisms for dealing with\npotentially adaptive queries that have been proposed and analyzed using techniques from differential\nprivacy and information theory. These mechanisms handle only a pre-determined number of queries\nusing a \ufb01xed dataset. We use techniques developed in this literature, in particular addition of noise to\nensure that a quadratically larger number of adaptive queries can be answered in the worst case [6, 10].\nOur main innovations over this prior work are the self-sustaining nature of the database, as opposed to\nhandling only a pre-determined number of queries of each type, and also the per-query pricing scheme\nthat places the cost burden on the adaptive users. To ensure that the cost burden on non-adaptive\nusers does not grow by more than a constant factor, we need to adapt existing algorithms.\nLADDER [5] and SHAKYLADDER [16] are mechanisms tailored to maintaining a ML competition\nleaderboard. These algorithms reveal the answer to a user\u2019s query for the error of their model only\nif it is signi\ufb01cantly lower than the error of the previous best submission from the user. While these\nmechanisms can handle an exponential number of arbitrarily adaptive submissions, each user will\nreceive answers to a relatively small number of queries. Our setting is more suitable for the case\nwhere we want to validate the errors of all submissions or for scienti\ufb01c discovery where there is more\nthen one discovery to be made.\nA separate line of work in the statistics literature on \u201cQuality Preserving Databases\u201d (Aharoni and\nRosset [2] and references therein) has suggested schemes for databases that maintain everlasting\nvalidity, while charging for use. The fundamental difference from our work is that these schemes\n\n2\n\n\fdo not account for adaptivity and thus are limited to non-adaptive querying. A second difference is\nthat they focus on hypothesis testing for scienti\ufb01c discovery, with pricing schemes that depend on\nconsiderations of statistical power, which are not part of our framework. We further compare with\nexisting methods at the end of Section 4.\n\n2 Model formulation\n\nWe consider a setting in which a database curator has access to samples from some unknown\ndistribution D over a sample space X . Multiple analysts submit a sequence of statistical queries\nqi : X! [0, 1], the database responds with answers ai 2 R, and the goal is to ensure that with high\nprobability, all answers satisfy |ai Ex\u21e0D [qi(x)]|\uf8ff \u2327 for some \ufb01xed accuracy parameter \u2327. In a\nprediction validation application, each query would measure the expected loss of a particular model,\nwhile in scienti\ufb01c applications a single query might measure the value of some phenomenon of interest,\nor compare it to a \u201cnull\u201d reference. We denote Q the set of all possible queries, i.e. measurable\nfunctions q : X! [0, 1], and use the shorthand E [q] = Ex\u21e0D [q(x)] to denote the mean value\n|S|Px2S q(x) as\n(desired answer) for each query. Given a data sample S \u21e0D n, we use ES [q] = 1\nshorthand for the empirical mean of q on S.\nIn our framework, the database can, at any time, acquire new samples from D at some \ufb01xed cost\nper sample, e.g. by running more experiments or paying workers to label more data. To answer a\ngiven query, the database can use the samples it has already purchased in any way it chooses, and\nthe database is allowed to charge analysts for their queries in order to purchase additional samples.\nThe price pi of query qi may be determined by the database after it receives query qi, allowing the\ndatabase to charge more for queries that force it to collect more data.\nWe do not assume the queries are chosen in advance, and instead allow the sequence of queries to\ndepend adaptively on past answers. More formally, we de\ufb01ne a \u201cquerying rule\u201d Ri : (Q, R, R)i1 7!\nQ as a randomized mapping from the history of all previously made queries and their answers and\nprices to the statistical query to be made next:\n\nqi = Ri ((q1, a1, p1), (q2, a2, p2), . . . , (qi1, ai1, pi1)) .\n\nThe interaction of users with the database can then be modeled as a sequence of querying rules\n{Ri}i2N. The combination of the data distribution, database mechanism, and sequence of querying\nrules together de\ufb01ne a joint distribution over queries, answers, and prices {Qi, Ai, Pi}i2N. All our\nresults will hold for any data distribution and any querying sequence, with high probability over\n{Qi, Ai, Pi}i2N.\nWe think of the query sequence as representing a combination of queries from multiple users, but\nthe database itself is unaware of the identity or behavior of the users. Our validity guarantees do not\nassume any particular user structure, nor any constraints on the interactions of the different users.\nThus, the guarantees are always valid regardless of what a \u201cuser\u201d means, how \u201cusers\u201d are allowed to\ncollaborate, how many \u201cusers\u201d there are, or how many queries each \u201cuser\u201d makes\u2014the guarantees\nsimply hold for any (arbitrarily adaptive) querying sequence.\nHowever, our cost guarantees will, and must, refer to analysts (or perhaps groups of analysts) behaving\nin speci\ufb01c ways. In particular, we de\ufb01ne a non-adaptive user as a subsequence {uj}j2[M ] consisting\nof queries which do not depend on any of the history, i.e. Ruj is a \ufb01xed (pre-determined) distribution\nover queries, so Quj is independent of all of the history. We further de\ufb01ne an autonomous user of\nthe database as a subsequence {uj}j2[M ] of the querying rules that depend only on the history within\nthe subsequence, i.e.\nRuj(q1, a1, p1), . . . , (q(uj1), a(uj1), p(uj1)) =\n\nRuj(qu1, au1, pu1), . . . , (qu(j1), au(j1), pu(j1)) .\nThat is, Quj is independent of the overall past history given the past history pertaining to the\nautonomous user. The \u201ccost to a user\u201d is the total price paid for queries in the subsequence {uj}:\nPM\n\nj=1 puj .\n\n3\n\n\f3 VALIDATIONROUND\n\nOur mechanism for providing \u201ceverlasting\u201d validity guarantees is based on a query answering\nmechanism which we call VALIDATIONROUND. It uses n samples from D in order to answer\nexp(\u2326(n)) non-adaptive and at least \u02dc\u2326(n2) adaptive statistical queries (and potentially many more).\nOur analysis is based on ideas developed in the context of adaptive data analysis [10] and relies\non techniques from differential privacy [9]. Differential privacy is a strong stability property of\nrandomized algorithms that operate on a dataset. Composition properties of differential privacy\nimply that this form of stability holds even when the same dataset is used by multiple algorithms that\ncan depend on the outputs of preceding algorithms. Most importantly, differential privacy implies\ngeneralization with high probability [4, 10].\nVALIDATIONROUND splits its data into two sets S and T . Upon receiving each query, it \ufb01rst checks\nwhether the answers on these datasets approximately agree. If so, the query has almost certainly\nnot over\ufb01t to the data, and the algorithm simply returns the empirical mean of the query on S plus\nadditional random noise. We show that the addition of noise ensures that the algorithm, as a function\nfrom the data sample S to an answer, satis\ufb01es differential privacy. This can be leveraged to show that\nany query which depends on a limited number of previous queries will have an empirical mean on S\nthat is close to the true expectation. This ensures that VALIDATIONROUND can accurately answer a\nlarge number of queries, while allowing some (unknown) subset of the queries to be adaptive.\nVALIDATIONROUND uses truncated Gaussian noise \u21e0 \u21e0N (0, 2, [, ]), i.e. Gaussian noise\nZ \u21e0N (0, 2) conditioned on the event |Z|\uf8ff . Its density f\u21e0(x) / exp\u21e3 x2\n\n22\u2318 |x|\uf8ff.\n\nAlgorithm 1 VALIDATIONROUND(\u2327, , n, S, T )\n4 exp\u21e3 n\u2327 2\n\u2327 2\n1: Set I(\u2327, , n ) = \n2: for each query q1, q2, ... do\nif |ES [qi] E T [qi]|\uf8ff \u2327\n3:\n4:\n5:\n6:\n7:\n\nDraw truncated Gaussian \u21e0i \u21e0N (0, 2, [ \u2327\nOutput: ai = ES [qi] + \u21e0i\nHalt (\u2318 = i)\n\n8 \u2318, 2 =\n2 and i \uf8ff I(\u2327, , n ) then\n4 , \u2327\n4 ])\n\n32 ln(8n2/)\n\nelse\n\nHere, \u2318 is the index of the query that causes the algorithm to halt. If \u2318 \uf8ff I(\u2327, , n ), the maximum\nallowed number of answers, we say that VALIDATIONROUND halted \u201cprematurely.\u201d The following\nthree lemmas characterize the behavior of VALIDATIONROUND.\nLemma 1. For any \u2327, , and n, for any sequence of querying rules (with arbitrary adaptivity) and\nany probability distribution D, the answers provided by VALIDATIONROUND(\u2327, , n, S, T ) satisfy\n\nwhere the probability is taken over the randomness in the draw of datasets S and T from Dn, the\nquerying rules, and VALIDATIONROUND.\nLemma 2. For any \u2327, , and n, any sequence of querying rules, and any non-adaptive user {uj}j2[M ]\ninteracting with VALIDATIONROUND(\u2327, , n, S, T ), Ph\u2318 \uf8ff I(\u2327, , n ) ^ \u2318 2{ uj}j2[M ]i \uf8ff .\nLemma 3. For any \u2327, , and n, any sequence of querying rules, and any autonomous user {uj}j2[M ]\ninteracting with VALIDATIONROUND(\u2327, , n, S, T ), if 2 =\nthen Ph\u2318 \uf8ff I(\u2327, , n ) ^ \u2318 2{ uj}j2[M ]i \uf8ff .\nLemma 1 indicates that all returned answers are accurate with high probability, regardless of adaptivity.\nThe proof involves showing that ET [qi] is close to E [qi] for each query, so any query that is answered\nmust be accurate since |ES [qi] E T [qi]| and |\u21e0| are small. Lemma 2 indicates that with high\nprobability, non-adaptive queries never cause a premature halt, which is a simple application of\n\n32 ln(8n2/) and M \uf8ff\n\n175760 ln2(8n2/)\n\nn2\u2327 4\n\n\u2327 2\n\nPh8i<\u2318Ai E\n\nx\u21e0D\n\n[Qi(x)] \uf8ff \u2327i 1 \n\n\n2\n\n,\n\n4\n\n\fHoeffding\u2019s inequality. Finally, Lemma 3 shows that with high probability, an autonomous user who\nmakes \u02dcO(n2) queries will not cause a premature halt. This requires showing that ES [qi] is close to\nE [qi] despite the potential adaptivity.\nThe proof of Lemma 3 uses existing results from adaptive data analysis together with a simple\nargument that noise truncation does not signi\ufb01cantly affect the results. For reference, the results we\ncite are included in Appendix E. While using Gaussian noise to answer queries is mentioned in other\nwork, we are not aware of an explicit analysis, so we analyze the method here. To simplify parts\nof the derivation, we rely on the notion of concentrated differential privacy, which is particularly\nwell suited for analysis of composition with Gaussian noise addition [6]. Lemmas 1-3 are proven in\nAppendix A.\n\n4 EVERLASTINGVALIDATION and pricing\n\nVALIDATIONROUND uses a \ufb01xed number, n, of samples and with high probability returns accurate\nanswers for at least exp (\u2326(n)) non-adaptive queries and \u02dc\u2326(n2) adaptive queries. In order to handle\nin\ufb01nitely many queries, we chain together multiple instances of VALIDATIONROUND. We start with\nan initial dataset, answer queries using VALIDATIONROUND using that data until it halts. At this\npoint, we buy more data and repeat. The used-up data can be released to the public as a \u201ctraining set,\u201d\nwhich can be used with no restriction without affecting any guarantees.\n\nPass qi to VALIDATIONROUND(\u2327, t, Nt, St, Tt)\nif VALIDATIONROUND does not halt then\n\n2 , 0 = \n\n2 , t = 0, i = 0\n\nAlgorithm 2 EVERLASTINGVALIDATION(\u2327, )\n1: Require initial budget = 36 ln (8/) /\u2327 2\n2: N0 = \n3: Buy datasets S0, T0 \u21e0D N0\n4: loop\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n\nCharge 6Nt minus current capital\nNt+1 = 3Nt, t+1 = 1\nBuy datasets St, Tt \u21e0D Nt\nRestart loop with same i\n\nOutput: ai\nCharge 96\n\u2327 2 \u00b7 1\n\ni , move on to i = i + 1\n\n2 t, t = t + 1\n\nelse\n\nThe key ingredient is a pricing system with which we can always afford new data when an instance\nof VALIDATIONROUND halts. Our method has two price types: a low price, which is charged for\nall queries and decreases like 1/i; and a high price, which is charged for any query that causes an\ninstance of VALIDATIONROUND to halt prematurely, which may grow with the size of the current\ndataset. EVERLASTINGVALIDATION(\u2327, ) guarantees the following:\nTheorem 1 (Validity). For any sequence of querying rules (with arbitrary adaptivity), EVERLAST-\nINGVALIDATION will provide answers such that\n\nPh8i2NAi E\n\nx\u21e0D\n\n[Qi(x)] \uf8ff \u2327i 1 \n\n\n2\n\nProof. Consider the sequence of query rules that are answered by the tth instantiation of the VAL-\nIDATIONROUND mechanism. By Lemma 1, for any sequence of querying rules, with probability\n1 t\n2 , all of the answers during round t are answered accurately. By a union bound over all rounds,\nall answers in all rounds are accurate with probability at least 1 P1t=0 t/2 = 1 /2.\n\nTheorem 2 (Sustainability). For any sequence of queries, the revenue collected can pay for all\nsamples ever needed by EVERLASTINGVALIDATION, excluding the initial budget of 36 ln (8/) /\u2327 2.\n\nProof. When VALIDATIONROUND halts, we charge exactly enough for the next St, Tt (line 10).\n\n5\n\n\fduring round t, then at least 6Nt revenue is collected.\n\nThe proof of Lemma 4 involves a straightforward computation. We \ufb01nd an upper bound, BT , on the\nnumber of queries made before round T begins and then lower bound the revenue collected in round\n\nLemma 4. If N0 18 ln(2)/\u2327 2 and I(\u2327, t, Nt) = (t/4) expNt\u2327 2/8 queries are answered\nT withPi\nTheorem 3 (Cost for non-adaptive users). For any sequence of querying rules and any non-adaptive\nuser indexed by {uj}j2[M ], the cost to the user satis\ufb01es\n\n\u2327 2(BT +i). We defer the details to Appendix B.\n\n96\n\nP\uf8ffXj2[M ]\n\nPuj \uf8ff\n\n96\n\n\u2327 2 (1 + ln(M )) 1 .\n\nProof. By Lemma 4, if a round t ends after I(\u2327, t, Nt) queries are answered, then the total revenue\ncollected from queries in that round is at least 6Nt, so the \u201chigh price\u201d at the end of the round is 0.\nConsequently, a query quj from the non-adaptive user costs the low price 96/(\u2327 2uj) unless it causes\nan instantiation of VALIDATIONROUND to halt prematurely. By Lemma 2 and a union bound, this\n\nnever occurs in any round with probability at least 1 P1t=0 t = 1 , and the cost to the user is\n\n96\n\n\u2327 2uj \uf8ffXj2[M ]\n\n96\n\u2327 2i \uf8ff\n\n96\n\u2327 2 (1 + ln(M )) .\n\nXj2[M ]\n\npuj =Xj2[M ]\nP\"Xj2[M ]\n\nTheorem 4 (Cost for adaptive users). For any sequence of querying rules and any autonomous user\nindexed by {uj}j2[M ], there is a \ufb01xed constant c0 such that the cost to the user satis\ufb01es\n\npM ln2 (M/)\n\nPuj \uf8ff c0 \u00b7\n\n\u2327 2\n\n# 1 .\n\nProof. Ideally, none of the M queries causes a premature halt, and the total cost is at most\n\u2327 2 (1 + ln(M )), but the adaptive user may cause rounds to end prematurely and pay up to 6Nt.\n96\nHowever, by Lemma 3, with probability 1 t if one of the adaptive user\u2019s queries causes a round t\nto end prematurely, then the amount of data, Nt, and the number of the user\u2019s queries answered in\nthat round, Mt, must satisfy\n\nN 2\n\nt \u2327 4\n\nMt \n\n175760 ln2 (8N 2\n\nt /t)\n\n.\n\n(1)\n\nGiven M, there is a largest t for which this is possible since Nt = 3tN0 and t = 2t0. That is,\n\n32tN 2\n\n0 \u2327 4\n\n175760 ln (18t \u00b7 8N 2\n\n0 /0) \uf8ff M\n\n2 ln\u21e324pM ln (144N0/0)\u2318. Let T be the set of rounds in which the adaptive\nwhich implies tmax \uf8ff 1\nuser pays the high 6Nt price, then with probability at least 1 Pt2T t 1 , inequality (1)\nholds for all t 2T . In this case, the total cost to the adaptive user is no more than\n1890pM ln216M 2/\n\n2520pM ln8M 2/tmax\n\n6Nt \uf8ff tmax\n\n\uf8ff\n\n\u2327 2\n\n\u2327 2\n\n.\n\nXt2T\n\nRelationship to prior work on adaptive data analysis We handle adaptivity using ideas devel-\noped in recent work on adaptive data analysis. In this line of work, all queries are typically assumed\nto be adaptively chosen and the overall number of queries known in advance. For completeness,\nwe brie\ufb02y describe several algorithms that have been developed in this context and compare them\nwith our algorithm. Dwork et al. [10] analyze an algorithm that adds Laplace or Gaussian noise to\nthe empirical mean in order to answer M adaptive queries using \u02dcO(pM ) samples\u2014a method that\nforms the basis of VALIDATIONROUND. However, adding untruncated Laplace or Gaussian noise to\nexponentially many non-adaptive queries would likely cause large errors when the variance is large\nenough to ensure that the sample mean is accurate. We use truncated Gaussian noise instead and\nshow that it does not substantially affect the analysis for autonomous queries.\n\n6\n\n\fTHRESHOLDOUT [11] answers veri\ufb01cation queries in which the user submits both a query and an\n\nestimate of the answer. The algorithm uses n = \u02dcO(pM \u00b7 log I) samples to answer I queries of\n\nwhich at most M estimates are far from correct. Similar to our use of the second dataset T , this\nalgorithm can be used to detect over\ufb01tting and answer adaptive queries (this is the basis of the\nEFFECTIVEROUNDS algorithm [10]). However, in our application this algorithm would have sample\n\ncomplexity of n = \u02dcO(pM \u00b7 log I), for M autonomous queries in T total queries. Consequently,\ndirect use of this mechanism would result in a pricing for non-adaptive users that depends on the\nnumber of queries by autonomous users. This is in contrast to n = \u02dcO(pM + log T ) samples that\nsuf\ufb01ce for VALIDATIONROUND, where the improvement relies on our de\ufb01nition of autonomy and\ntruncation of the noise variables.\n\n5 Optimality\n\nOne might ask if it is possible to devise a mechanism with similar properties but lower costs. We argue\nthat the prices set by EVERLASTINGVALIDATION are near optimal. The total cost to a non-adaptive\nuser who makes M queries is O(log M/\u2327 2). Even if we knew in advance that we would receive only\nM non-adaptive queries, we would still need \u2326(log M/\u2327 2) samples to answer all of them accurately\nwith high probability. Thus, our price for non-adaptive queries is optimal up to constant factors.\nIt is also known that answering a sequence of M adaptively chosen queries with accuracy \u2327 requires\n\u02dc\u2326(pM /\u2327 ) samples [15, 19]. Hence, the cost to a possibly adaptive autonomous user is nearly\noptimal in its dependence on M (up to log factors). One natural concern is that our guarantee in this\ncase is only for the amortized (or total) cost, and not on the cost of each individual query. Indeed,\nalthough the average cost of adaptive queries decreases as \u02dcO(1/pM ), the maximal cost of a single\nquery might increase as \u02dcO(pM ). A natural question is whether the maximum price can be reduced,\nto spread the high price over more queries.\nFinally, an individual who queries our mechanism with M entirely non-adaptive queries will only\npay log M in the worst case; generally, they will bene\ufb01t from the economies of scale associated\nwith collecting more and more data. For instance, if there are K users each making M non-adaptive\nqueries, then the total cost of all KM queries will be log KM so the average cost to each user is only\nlog(KM )/K \u2327 log M.\n6 An Alternative Approach: EVERLASTINGTO\n\nThe EVERLASTINGVALIDATION mechanism provides cost guarantees that are, in certain ways,\nnearly optimal. The two main shortcomings are that (1) the price is guaranteed only for non-adaptive\nor autonomous users\u2013not arbitrary adaptive ones and (2) the cost of an individual adaptive query\ncannot be upper bounded. One might also ask if inventing VALIDATIONROUND was necessary in the\n\ufb01rst place. Another mechanism, THRESHOLDOUT [11], is already well-suited to the setting of mixed\nadaptive and non-adaptive queries and it gives accuracy guarantees for quadratically many arbitrary\nadaptive queries or exponentially many non-adaptive queries. Perhaps using THRESHOLDOUT instead\nwould be better? We will now describe an alternative mechanism, EVERLASTINGTO, which allows\nus to provide price guarantees for individual queries, including arbitrarily adaptive ones, but with an\nexponential increase in the cost for both non-adaptive and adaptive queries.\nThe EVERLASTINGTO mechanism is very similar to EVERLASTINGVALIDATION, except it uses\nTHRESHOLDOUT in the place of VALIDATIONROUND. In each round, the algorithm determines\nan over\ufb01tting budget, Bt, and a maximum number of queries, Mt, as a function of the tradeoff\nparameter p. It then answers queries using THRESHOLDOUT, charging a high price 2Nt+1/Bt for\nqueries that fail the over\ufb01tting check, and charging a low price 2Nt+1/Mt for all of the other queries.\nOnce THRESHOLDOUT cannot answer more queries, the mechanism buys more data, reinitializes\nTHRESHOLDOUT, and continues as before.\nWe analyze EVERLASTINGTO in Appendix D. Theorems 6-9 closely parallel the guarantees of\nEVERLASTINGVALIDATION and establish the following for any \u2327, 2 (0, 1) and any p 2 (0, 2\n3 ):\nValidity: with high probability, for any sequence of querying rules, all answers provided by EVER-\nLASTINGTO are \u2327-accurate. Sustainability: EVERLASTINGTO charges high enough prices to be\nable to afford new samples as needed, excluding the initial budget. Cost: with high probability, any\n\n7\n\n\ft\n\nln 1/t \u2318, Mt = t\n\n4 exp (2N p\nt )\n\nPurchase datasets St, Tt \u21e0D Nt and initialize THRESHOLDOUT(St, Tt, Bt, t).\nwhile THRESHOLDOUT(St, Tt, Bt, t) has not halted do\n\ne\n\net, Bt = \u02dc\u21e5\u21e3 \u2327 4N 22p\n\nAlgorithm 3 EVERLASTINGTO(\u2327, , p )\n1: Require suf\ufb01ciently initial budget n = n(\u2327, , p )\n2: 8t set Nt = net, t = (e1)\n3: for t = 0, 1, . . . do\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n\nAccept query q\n(a, o) = THRESHOLDOUT(St, Tt, Bt, t)(q)\nOutput: a\nif o =? then\nelse\n\nCharge: 2Nt+1\nMt\nCharge: 2Nt+1\nBt\n\n1\n\n23p\u2318 (ignoring\n\nM non-adaptive queries and any B adaptive queries cost at most O\u21e3ln1/p(M ) + B\n\nthe dependence on \u2327, ).\nUnlike EVERLASTINGVALIDATION, which prioritized charging as little as possible for non-adaptive\nqueries, EVERLASTINGTO increases the O(log M ) cost to O(polylog M ) in order to bound the price\nof arbitrary adaptive queries. The parameter p allows the database manager to control the tradeoff; for\np near zero, the cost of B adaptive queries is roughly the optimal O(pB), but non-adaptive queries\nare extremely expensive. On the other side, for p near 2/3, the cost of adaptive queries becomes very\nhigh, but the cost of non-adaptive queries is relatively small, although it does not approach optimality.\nFurther details of the mechanism are contained in Appendix D. We also provide a tighter analysis\nof the THRESHOLDOUT algorithm which guarantees accurate answers using a substantially smaller\namount of data in Appendix C. This analysis allows us to reduce the exponent in EVERLASTINGTO\u2019s\ncost guarantee for non-adaptive queries.\n\n7 Potential applications\n\nIn the ML challenge scenario, validation results are often displayed on a scoreboard. Although\nit is often assumed that scoreboards cannot be used for extensive adaptation, it appears that such\nadaptations have played roles in determining the outcome of various well known competitions,\nincluding the Net\ufb02ix challenge, where the \ufb01nal test set performance was signi\ufb01cantly worse than\nperformance on the leaderboard data set. EVERLASTINGVALIDATION would guarantee that test\nerrors returned by the validation database are accurate, regardless of adaptation, collusion, the number\nof queries made by each user, or other intentional or unintentional dependencies. We do charge a price\nper-validation, but as long as users are non-adaptive, the price is very small. Adaptive users, on the\nother hand, pay what is required in order to ensure validity (which could be a lot). Nevertheless, even\nif a wealthy user could afford paying the higher cost of adaptive queries, she would still not be able\nto \u201ccheat\u201d and over\ufb01t the scoreboard set, and a poor user could still afford the quickly diminishing\ncosts of validating non-adaptive queries.\nAnother feature of our mechanism is that once a round t is over, we can safely release the datasets St\nand Tt to the public as unrestricted training data. This way, poor analysts also bene\ufb01t from adaptive\nqueries made by others, as all data is eventually released, and at any given time, a substantial fraction\nof all the data ever collected is public. Also, the ratio of public data to validation data can easily be\nadjusted by slightly amending the pricing.\nIn the context of scienti\ufb01c discovery, one use case is very similar to the ML competition. Scientists\ncan search for interesting phenomena using unprotected data, and then re-evaluate \u201cinteresting\u201d\ndiscoveries with the database mechanism in order to get an accurate and almost-unbiased estimate\nof the true value. This could be useful, for example, in building prediction models for scienti\ufb01c\nphenomena such as genetic risk of disease, which often involve complex modeling [7].\nHowever, most scienti\ufb01c research is done in the context of hypothesis testing rather than estimation.\nDeclarations of discoveries like the Higgs boson [1] and genetic associations of disease [8] are based\n\n8\n\n\fon performing a potentially large number of hypothesis tests and identifying statistically signi\ufb01cant\ndiscoveries while controlling for multiplicity. Because of the complexity of the discovery process, it\nis often quite dif\ufb01cult to properly control for all potential tests, causing many dif\ufb01culties, the most\nwell known of which is the problem of publication bias (cf. \u201cWhy Most Published Research Findings\nare False\u201d [17]). An alternative, approach that has gained popularity in recent years, is requiring\nreplication of any declared discoveries on new and independent data [3]. Because the new data is\nused only for replication, it is much easier to control multiplicity and false discovery concerns.\nOur everlasting database can be useful in both the discovery and replication phases. We now brie\ufb02y\nexplain how its validity guarantees can be used for multiplicity control in testing. Assume we have\na collection of hypothesis tests on functionals of D with null hypotheses: H0i : E [qi] = e0i. We\nemploy our scheme to obtain estimates Ai of E [qi]. Setting \u21b5 = /2, Theorem (1) guarantees:\nPi PH0i [maxi |Ai e0i| >\u2327 ] \uf8ff \u21b5, meaning that for any combination of true nulls, the rejection\npolicy reject if |Aie0i| >\u2327 makes no false rejections with probability at least 1\u21b5, thus controlling\nthe family-wise error rate (FWER) at level \u21b5. This is easily used in the replication phase, where an\nentire community (say, type-I diabetes researchers) could share a single replication server using the\neverlasting database scheme in order to to guarantee validity. It could also be used in the discovery\nphase for analyses that can be described through a set of measurements and tests of the form above.\n\n8 Conclusion and extensions\n\nOur primary contribution is in designing a database mechanism that brings together two important\nproperties that have not been previously combined: everlasting validity and robustness to adaptivity.\nFurthermore, we do so in an asymptotically ef\ufb01cient manner that guarantees that non-adaptive queries\nare inexpensive with high probability, and that the potentially high cost of handling adaptivity only\nfalls upon truly adaptive users. Currently, there are large constants in the cost guarantees, but these\nare pessimistic and can likely be reduced with a tighter analysis and more re\ufb01ned pricing scheme. We\nbelieve that with some improvements, our scheme can form the basis of practical implementations\nfor use in ML competitions and scienti\ufb01c discovery. Also, our cost guarantees themselves are\nworst-case and only guarantee a low price to entirely non-adaptive users. It would be useful to\ninvestigate experimentally how much users would actually end up being charged under \u201ctypical use,\u201d\nespecially users who are only \u201cslightly adaptive.\u201d However, there is no established framework for\nunderstanding what would constitute \u201ctypical\u201d or \u201cslightly adaptive\u201d usage of a statistical query\nanswering mechanism, so more work is needed before such experiments would be insightful.\nOur mechanism can be improved in several ways. It only provides answers at a \ufb01xed, additive \u2327, and\nonly answers statistical queries, however these issues have been already addressed in the adaptive\ndata analysis literature. E.g. arbitrary low-sensitivity queries can be handled without any modi\ufb01cation\nto the algorithm, and arbitrary real-valued queries can be answered with the error proportional to their\nstandard deviation (instead of 1/pn as in our analysis) [13]. These approaches can be combined\nwith our algorithms but we restrict our attention to the basic case since our focus is different.\nFinally, one potentially objectionable element of our approach is that it discards samples at the end\nof each round (although these samples are not wasted since they become part of the public dataset).\nAn alternative approach is to add the new samples to the dataset as they can be purchased. While\nthis might be a more practical approach, existing analysis techniques that are based on differential\nprivacy do not appear to suf\ufb01ce for dealing with such mechanisms. Developing more \ufb02exible analysis\ntechniques for this purpose is another natural direction for future work.\n\nAcknowledgements BW is supported the NSF Graduate Research Fellowship under award\n1754881.\n\nReferences\n\n[1] Georges Aad, T Abajyan, B Abbott, J Abdallah, S Abdel Khalek, AA Abdelalim, O Abdinov,\nR Aben, B Abi, M Abolins, et al. Observation of a new particle in the search for the standard\nmodel higgs boson with the atlas detector at the lhc. Physics Letters B, 716(1):1\u201329, 2012.\n\n9\n\n\f[2] Ehud Aharoni and Saharon Rosset. Generalized alpha-investing: De\ufb01nitions, optimality results\nand application to public databases. Journal of the Royal Statistical Society: Series B, 76(4):\n771\u2013794, 2014.\n\n[3] Monya Baker. 1,500 scientists lift the lid on reproducibility. Nature News, 533(7604):452,\n\n2016.\n\n[4] Raef Bassily, Kobbi Nissim, Adam D. Smith, Thomas Steinke, Uri Stemmer, and Jonathan\n\nUllman. Algorithmic stability for adaptive data analysis. In STOC, pages 1046\u20131059, 2016.\n\n[5] Avrim Blum and Moritz Hardt. The ladder: A reliable leaderboard for machine learning\n\ncompetitions. In International Conference on Machine Learning, pages 1006\u20131014, 2015.\n\n[6] Mark Bun and Thomas Steinke. Concentrated differential privacy: Simpli\ufb01cations, extensions,\n\nand lower bounds. In Theory of Cryptography Conference, pages 635\u2013658. Springer, 2016.\n\n[7] Nilanjan Chatterjee, Jianxin Shi, and Montserrat Garc\u00b4\u0131a-Closas. Developing and evaluating\npolygenic risk prediction models for strati\ufb01ed disease prevention. Nature Reviews Genetics, 17\n(7):392, 2016.\n\n[8] Nick Craddock, Matthew E Hurles, Niall Cardin, Richard D Pearson, Vincent Plagnol, Samuel\nRobson, Damjan Vukcevic, Chris Barnes, Donald F Conrad, Eleni Giannoulatou, et al. Genome-\nwide association study of cnvs in 16,000 cases of eight common diseases and 3,000 shared\ncontrols. Nature, 464(7289):713, 2010.\n\n[9] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private\n\ndata analysis. In TCC, pages 265\u2013284, 2006.\n\n[10] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron\nRoth. Preserving statistical validity in adaptive data analysis. CoRR, abs/1411.2664, 2014.\nExtended abstract in STOC 2015.\n\n[11] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toni Pitassi, Omer Reingold, and Aaron Roth.\nGeneralization in adaptive data analysis and holdout reuse. In Advances in Neural Information\nProcessing Systems, pages 2350\u20132358, 2015.\n\n[12] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron\nRoth. The reusable holdout: Preserving validity in adaptive data analysis. Science, 349\n(6248):636\u2013638, 2015. doi: 10.1126/science.aaa9375. URL http://www.sciencemag.org/\ncontent/349/6248/636.abstract.\n\n[13] Vitaly Feldman and Thomas Steinke. Generalization for adaptively-chosen estimators via stable\n\nmedian. In Conference on Learning Theory (COLT), 2017.\n\n[14] Andrew Gelman and Eric Loken. The statistical crisis in science. The American Statistician,\n\n102(6):460, 2014.\n\n[15] M. Hardt and J. Ullman. Preventing false discovery in interactive data analysis is hard. In\n\nFOCS, pages 454\u2013463, 2014.\n\n[16] Moritz Hardt. Climbing a shaky ladder: Better adaptive risk estimation. CoRR, abs/1706.02733,\n\n2017. URL http://arxiv.org/abs/1706.02733.\n\n[17] John PA Ioannidis. Why most published research \ufb01ndings are false. PLoS medicine, 2(8):e124,\n\n2005.\n\n[18] Kobbi Nissim and Uri Stemmer. On the generalization properties of differential privacy. CoRR,\n\nabs/1504.05800, 2015.\n\n[19] Thomas Steinke and Jonathan Ullman. Interactive \ufb01ngerprinting codes and the hardness of\npreventing false discovery. In COLT, pages 1588\u20131628, 2015. URL http://jmlr.org/\nproceedings/papers/v40/Steinke15.html.\n\n10\n\n\f", "award": [], "sourceid": 3253, "authors": [{"given_name": "Blake", "family_name": "Woodworth", "institution": "TTI-Chicago"}, {"given_name": "Vitaly", "family_name": "Feldman", "institution": "Google Brain"}, {"given_name": "Saharon", "family_name": "Rosset", "institution": "Technion"}, {"given_name": "Nati", "family_name": "Srebro", "institution": "TTI-Chicago"}]}