{"title": "Generalization in Adaptive Data Analysis and Holdout Reuse", "book": "Advances in Neural Information Processing Systems", "page_first": 2350, "page_last": 2358, "abstract": "Overfitting is the bane of data analysts, even when data are plentiful. Formal approaches to understanding this problem focus on statistical inference and generalization of individual analysis procedures. Yet the practice of data analysis is an inherently interactive and adaptive process: new analyses and hypotheses are proposed after seeing the results of previous ones, parameters are tuned on the basis of obtained results, and datasets are shared and reused. An investigation of this gap has recently been initiated by the authors in (Dwork et al., 2014), where we focused on the problem of estimating expectations of adaptively chosen functions.In this paper, we give a simple and practical method for reusing a holdout (or testing) set to validate the accuracy of hypotheses produced by a learning algorithm operating on a training set. Reusing a holdout set adaptively multiple times can easily lead to overfitting to the holdout set itself. We give an algorithm that enables the validation of a large number of adaptively chosen hypotheses, while provably avoiding overfitting. We illustrate the advantages of our algorithm over the standard use of the holdout set via a simple synthetic experiment.We also formalize and address the general problem of data reuse in adaptive data analysis. We show how the differential-privacy based approach in (Dwork et al., 2014) is applicable much more broadly to adaptive data analysis. We then show that a simple approach based on description length can also be used to give guarantees of statistical validity in adaptive settings. Finally, we demonstrate that these incomparable approaches can be unified via the notion of approximate max-information that we introduce. This, in particular, allows the preservation of statistical validity guarantees even when an analyst adaptively composes algorithms which have guarantees based on either of the two approaches.", "full_text": "Generalization in Adaptive Data Analysis and\n\nHoldout Reuse\u2217\n\nCynthia Dwork\nMicrosoft Research\n\nVitaly Feldman\n\nIBM Almaden Research Center\u2020\n\nMoritz Hardt\nGoogle Research\n\nToniann Pitassi\n\nUniversity of Toronto\n\nOmer Reingold\n\nSamsung Research America\n\nAaron Roth\n\nUniversity of Pennsylvania\n\nAbstract\n\nOver\ufb01tting is the bane of data analysts, even when data are plentiful. Formal\napproaches to understanding this problem focus on statistical inference and gen-\neralization of individual analysis procedures. Yet the practice of data analysis is\nan inherently interactive and adaptive process: new analyses and hypotheses are\nproposed after seeing the results of previous ones, parameters are tuned on the\nbasis of obtained results, and datasets are shared and reused. An investigation of\nthis gap has recently been initiated by the authors in [7], where we focused on the\nproblem of estimating expectations of adaptively chosen functions.\nIn this paper, we give a simple and practical method for reusing a holdout (or\ntesting) set to validate the accuracy of hypotheses produced by a learning algorithm\noperating on a training set. Reusing a holdout set adaptively multiple times can\neasily lead to over\ufb01tting to the holdout set itself. We give an algorithm that enables\nthe validation of a large number of adaptively chosen hypotheses, while provably\navoiding over\ufb01tting. We illustrate the advantages of our algorithm over the standard\nuse of the holdout set via a simple synthetic experiment.\nWe also formalize and address the general problem of data reuse in adaptive data\nanalysis. We show how the differential-privacy based approach given in [7] is\napplicable much more broadly to adaptive data analysis. We then show that a\nsimple approach based on description length can also be used to give guarantees of\nstatistical validity in adaptive settings. Finally, we demonstrate that these incompa-\nrable approaches can be uni\ufb01ed via the notion of approximate max-information\nthat we introduce. This, in particular, allows the preservation of statistical valid-\nity guarantees even when an analyst adaptively composes algorithms which have\nguarantees based on either of the two approaches.\n\n1\n\nIntroduction\n\nThe goal of machine learning is to produce hypotheses or models that generalize well to the unseen\ninstances of the problem. More generally, statistical data analysis is concerned with estimating\nproperties of the underlying data distribution, rather than properties that are speci\ufb01c to the \ufb01nite data\nset at hand. Indeed, a large body of theoretical and empirical research was developed for ensuring\ngeneralization in a variety of settings. In this work, it is commonly assumed that each analysis\nprocedure (such as a learning algorithm) operates on a freshly sampled dataset \u2013 or if not, is validated\non a freshly sampled holdout (or testing) set.\n\n\u2217See [6] for the full version of this work.\n\u2020Part of this work done while visiting the Simons Institute, UC Berkeley.\n\n1\n\n\fUnfortunately, learning and inference can be more dif\ufb01cult in practice, where data samples are often\nreused. For example, a common practice is to perform feature selection on a dataset, and then use\nthe features for some supervised learning task. When these two steps are performed on the same\ndataset, it is no longer clear that the results obtained from the combined algorithm will generalize.\nAlthough not usually understood in these terms, \u201cFreedman\u2019s paradox\" is an elegant demonstration of\nthe powerful (negative) effect of adaptive analysis on the same data [10]. In Freedman\u2019s simulation,\nvariables with signi\ufb01cant t-statistic are selected and linear regression is performed on this adaptively\nchosen subset of variables, with famously misleading results: when the relationship between the\ndependent and explanatory variables is non-existent, the procedure over\ufb01ts, erroneously declaring\nsigni\ufb01cant relationships.\nMost of machine learning practice does not rely on formal guarantees of generalization for learning\nalgorithms. Instead a dataset is split randomly into two (or sometimes more) parts: the training set\nand the testing, or holdout, set. The training set is used for learning a predictor, and then the holdout\nset is used to estimate the accuracy of the predictor on the true distribution (Additional averaging over\ndifferent partitions is used in cross-validation.). Because the predictor is independent of the holdout\ndataset, such an estimate is a valid estimate of the true prediction accuracy (formally, this allows\none to construct a con\ufb01dence interval for the prediction accuracy on the data distribution). However,\nin practice the holdout dataset is rarely used only once, and as a result the predictor may not be\nindependent of the holdout set, resulting in over\ufb01tting to the holdout set [17, 16, 4]. One well-known\nreason for such dependence is that the holdout data is used to test a large number of predictors and\nonly the best one is reported. If the set of all tested hypotheses is known and independent of the\nholdout set, then it is easy to account for such multiple testing.\nHowever such static approaches do not apply if the estimates or hypotheses tested on the holdout are\nchosen adaptively: that is, if the choice of hypotheses depends on previous analyses performed on the\ndataset. One prominent example in which a holdout set is often adaptively reused is hyperparameter\ntuning (e.g.[5]). Similarly, the holdout set in a machine learning competition, such as the famous\nImageNet competition, is typically reused many times adaptively. Other examples include using\nthe holdout set for feature selection, generation of base learners (in aggregation techniques such as\nboosting and bagging), checking a stopping condition, and analyst-in-the-loop decisions. See [13] for\na discussion of several subtle causes of over\ufb01tting.\nThe concrete practical problem we address is how to ensure that the holdout set can be reused to\nperform validation in the adaptive setting. Towards addressing this problem we also ask the more\ngeneral question of how one can ensure that the \ufb01nal output of adaptive data analysis generalizes\nto the underlying data distribution. This line of research was recently initiated by the authors in [7],\nwhere we focused on the case of estimating expectations of functions from i.i.d. samples (these are\nalso referred to as statistical queries). .\n\n1.1 Our Results\n\nWe propose a simple and general formulation of the problem of preserving statistical validity in\nadaptive data analysis. We show that the connection between differentially private algorithms\nand generalization from [7] can be extended to this more general setting, and show that similar\n(but sometimes incomparable) guarantees can be obtained from algorithms whose outputs can be\ndescribed by short strings. We then de\ufb01ne a new notion, approximate max-information, that uni\ufb01es\nthese two basic techniques and gives a new perspective on the problem. In particular, we give an\nadaptive composition theorem for max-information, which gives a simple way to obtain generalization\nguarantees for analyses in which some of the procedures are differentially private and some have\nshort description length outputs. We apply our techniques to the problem of reusing the holdout set\nfor validation in the adaptive setting.\nA reusable holdout: We describe a simple and general method, together with two speci\ufb01c instan-\ntiations, for reusing a holdout set for validating results while provably avoiding over\ufb01tting to the\nholdout set. The analyst can perform any analysis on the training dataset, but can only access the\nholdout set via an algorithm that allows the analyst to validate her hypotheses against the holdout set.\nCrucially, our algorithm prevents over\ufb01tting to the holdout set even when the analyst\u2019s hypotheses\nare chosen adaptively on the basis of the previous responses of our algorithm.\n\n2\n\n\fOur \ufb01rst algorithm, referred to as Thresholdout, derives its guarantees from differential privacy and\nthe results in [7, 14]. For any function \u03c6 : X \u2192 [0, 1] given by the analyst, Thresholdout uses the\nholdout set to validate that \u03c6 does not over\ufb01t to the training set, that is, it checks that the mean value\nof \u03c6 evaluated on the training set is close to the mean value of \u03c6 evaluated on the distribution P from\nwhich the data was sampled. The standard approach to such validation would be to compute the mean\nvalue of \u03c6 on the holdout set. The use of the holdout set in Thresholdout differs from the standard use\nin that it exposes very little information about the mean of \u03c6 on the holdout set: if \u03c6 does not over\ufb01t\nto the training set, then the analyst receives only the con\ufb01rmation of closeness, that is, just a single\nbit. On the other hand, if \u03c6 over\ufb01ts then Thresholdout returns the mean value of \u03c6 on the training set\nperturbed by carefully calibrated noise.\nUsing results from [7, 14] we show that for datasets consisting of i.i.d. samples these modi\ufb01cations\nprovably prevent the analyst from constructing functions that over\ufb01t to the holdout set. This ensures\ncorrectness of Thresholdout\u2019s responses. Naturally, the speci\ufb01c guarantees depend on the number of\nsamples n in the holdout set. The number of queries that Thresholdout can answer is exponential in n\nas long as the number of times that the analyst over\ufb01ts is at most quadratic in n.\nOur second algorithm SparseValidate is based on the idea that if most of the time the analyst\u00e2 \u02d8A \u00b4Zs\nprocedures generate results that do not over\ufb01t, then validating them against the holdout set does not\nreveal much information about the holdout set. Speci\ufb01cally, the generalization guarantees of this\nmethod follow from the observation that the transcript of the interaction between a data analyst and\nthe holdout set can be described concisely. More formally, this method allows the analyst to pick\nany Boolean function of a dataset \u03c8 (described by an algorithm) and receive back its value on the\nholdout set. A simple example of such a function would be whether the accuracy of a predictor on\nthe holdout set is at least a certain value \u03b1. (Unlike in the case of Thresholdout, here there is no\nneed to assume that the function that measures the accuracy has a bounded range or even Lipschitz,\nmaking it qualitatively different from the kinds of results achievable subject to differential privacy). A\nmore involved example of validation would be to run an algorithm on the holdout dataset to select an\nhypothesis and check if the hypothesis is similar to that obtained on the training set (for any desired\nnotion of similarity). Such validation can be applied to other results of analysis; for example one\ncould check if the variables selected on the holdout set have large overlap with those selected on the\ntraining set. An instantiation of the SparseValidate algorithm has already been applied to the problem\nof answering statistical (and more general) queries in the adaptive setting [1].\nWe describe a simple experiment on synthetic data that illustrates the danger of reusing a standard\nholdout set, and how this issue can be resolved by our reusable holdout. The design of this experiment\nis inspired by Freedman\u2019s classical experiment, which demonstrated the dangers of performing\nvariable selection and regression on the same data [10].\nGeneralization in adaptive data analysis: We view adaptive analysis on the same dataset as an\nexecution of a sequence of steps A1 \u2192 A2 \u2192 \u00b7\u00b7\u00b7 \u2192 Am. Each step is described by an algorithm\nAi that takes as input a \ufb01xed dataset S = (x1, . . . , xn) drawn from some distribution D over\nX n, which remains unchanged over the course of the analysis. Each algorithm Ai also takes as\ninput the outputs of the previously run algorithms A1 through Ai\u22121 and produces a value in some\nrange Yi. The dependence on previous outputs represents all the adaptive choices that are made\nat step i of data analysis. For example, depending on the previous outputs, Ai can run different\ntypes of analysis on S. We note that at this level of generality, the algorithms can represent the\nchoices of the data analyst, and need not be explicitly speci\ufb01ed. We assume that the analyst uses\nalgorithms which individually are known to generalize when executed on a fresh dataset sampled\nindependently from a distribution D. We formalize this by assuming that for every \ufb01xed value\ny1, . . . , yi\u22121 \u2208 Y1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 Yi\u22121, with probability at least 1 \u2212 \u03b2i over the choice of S according\nto distribution D, the output of Ai on inputs y1, . . . , yi\u22121 and S has a desired property relative to\nthe data distribution D (for example has low generalization error). Note that in this assumption\ny1, . . . , yi\u22121 are \ufb01xed and independent of the choice of S, whereas the analyst will execute Ai on\nvalues Y1, . . . , Yi\u22121, where Yj = Aj(S, Y1, . . . , Yj\u22121). In other words, in the adaptive setup, the\nalgorithm Ai can depend on the previous outputs, which depend on S, and thus the set S given to\nAi is no longer an independently sampled dataset. Such dependence invalidates the generalization\nguarantees of individual procedures, potentially leading to over\ufb01tting.\nDifferential privacy: First, we spell out how the differential privacy based approach from [7] can\nbe applied to this more general setting. Speci\ufb01cally, a simple corollary of results in [7] is that for\n\n3\n\n\fa dataset consisting of i.i.d. samples any output of a differentially-private algorithm can be used in\nsubsequent analysis while controlling the risk of over\ufb01tting, even beyond the setting of statistical\nqueries studied in [7]. A key property of differential privacy in this context is that it composes\nadaptively: namely if each of the algorithms used by the analyst is differentially private, then the\nwhole procedure will be differentially private (albeit with worse privacy parameters). Therefore, one\nway to avoid over\ufb01tting in the adaptive setting is to use algorithms that satisfy (suf\ufb01ciently strong)\nguarantees of differential-privacy.\nDescription length: We then show how description length bounds can be applied in the context\nof guaranteeing generalization in the presence of adaptivity. If the total length of the outputs of\nalgorithms A1, . . . ,Ai\u22121 can be described with k bits then there are at most 2k possible values of\nthe input y1, . . . , yi\u22121 to Ai. For each of these individual inputs Ai generalizes with probability\n1\u2212 \u03b2i. Taking a union bound over failure probabilities implies generalization with probability at least\n1 \u2212 2k\u03b2i. Occam\u2019s Razor famously implies that shorter hypotheses have lower generalization error.\nOur observation is that shorter hypotheses (and the results of analysis more generally) are also better\nin the adaptive setting since they reveal less about the dataset and lead to better generalization of\nsubsequent analyses. Note that this result makes no assumptions about the data distribution D. In the\nfull versionwe also show that description length-based analysis suf\ufb01ces for obtaining an algorithm\n(albeit not an ef\ufb01cient one) that can answer an exponentially large number of adaptively chosen\nstatistical queries. This provides an alternative proof for one of the results in [7].\nApproximate max-information: Our main technical contribution is the introduction and analysis of\na new information-theoretic measure, which uni\ufb01es the generalization arguments that come from\nboth differential privacy and description length, and that quanti\ufb01es how much information has been\nlearned about the data by the analyst. Formally, for jointly distributed random variables (S, Y ),\nthe max-information is the maximum of the logarithm of the factor by which uncertainty about S\nis reduced given the value of Y , namely I\u221e(S, Y )\n, where the maximum\nis taken over all S in the support of S and y in the support Y . Approximate max-information is a\nrelaxation of max-information. In our use, S denotes a dataset drawn randomly from the distribution\nD and Y denotes the output of a (possibly randomized) algorithm on S. We prove that approximate\nmax-information has the following properties\n\n.\n= log max\n\nP[S=S | Y =y]\n\nP[S=S]\n\n\u2022 An upper bound on (approximate) max-information gives generalization guarantees.\n\u2022 Differentially private algorithms have low max-information for any distribution D over\ndatasets. A stronger bound holds for approximate max-information on i.i.d. datasets. These\nbounds apply only to so-called pure differential privacy (the \u03b4 = 0 case).\n\u2022 Bounds on the description length of the output of an algorithm give bounds on the approxi-\nmate max-information of the algorithm for any D.\n\u2022 Approximate max-information composes adaptively.\n\nComposition properties of approximate max-information imply that one can easily obtain general-\nization guarantees for adaptive sequences of algorithms, some of which are differentially private,\nand others of which have outputs with short description length. These properties also imply that\ndifferential privacy can be used to control generalization for any distribution D over datasets, which\nextends its generalization guarantees beyond the restriction to datasets drawn i.i.d. from a \ufb01xed\ndistribution, as in [7].\nWe remark that (pure) differential privacy and description length are otherwise incomparable. Bounds\non max-information or differential privacy of an algorithm can, however, be translated to bounds on\nrandomized description length for a different algorithm with statistically indistinguishable output.\nHere we say that a randomized algorithm has randomized description length of k if for every \ufb01xing\nof the algorithm\u2019s random bits, it has description length of k. Details of these results and additional\ndiscussion appear in Section 2 and the full version.\n\n1.2 Related Work\n\nThis work complements [7] where we initiated the formal study of adaptivity in data analysis. The\nprimary focus of [7] is the problem of answering adaptively chosen statistical queries. The main\ntechnique is a strong connection between differential privacy and generalization: differential privacy\n\n4\n\n\fguarantees that the distribution of outputs does not depend too much on any one of the data samples,\nand thus, differential privacy gives a strong stability guarantee that behaves well under adaptive data\nanalysis. The link between generalization and approximate differential privacy made in [7] has been\nsubsequently strengthened, both qualitatively \u2014 by [1], who make the connection for a broader\nrange of queries \u2014 and quantitatively, by [14] and [1], who give tighter quantitative bounds. These\npapers, among other results, give methods for accurately answering exponentially (in the dataset\nsize) many adaptively chosen queries, but the algorithms for this task are not ef\ufb01cient. It turns out\nthis is for fundamental reasons \u2013 Hardt and Ullman [11] and Steinke and Ullman [19] prove that,\nunder cryptographic assumptions, no ef\ufb01cient algorithm can answer more than quadratically many\nstatistical queries chosen adaptively by an adversary who knows the true data distribution.\nThe classical approach in theoretical machine learning to ensure that empirical estimates generalize\nto the underlying distribution is based on the various notions of complexity of the set of functions\noutput by the algorithm, most notably the VC dimension. If one has a sample of data large enough\nto guarantee generalization for all functions in some class of bounded complexity, then it does not\nmatter whether the data analyst chooses functions in this class adaptively or non-adaptively. Our goal,\nin contrast, is to prove generalization bounds without making any assumptions about the class from\nwhich the analyst can output functions.\nAn important line of work [3, 15, 18] establishes connections between the stability of a learning\nalgorithm and its ability to generalize. Stability is a measure of how much the output of a learning\nalgorithm is perturbed by changes to its input. It is known that certain stability notions are necessary\nand suf\ufb01cient for generalization. Unfortunately, the stability notions considered in these prior works\ndo not compose in the sense that running multiple stable algorithms sequentially and adaptively may\nresult in a procedure that is not stable. The measure we introduce in this work (max information),\nlike differential privacy, has the strength that it enjoys adaptive composition guarantees. This makes\nit amenable to reasoning about the generalization properties of adaptively applied sequences of\nalgorithms, while having to analyze only the individual components of these algorithms. Connections\nbetween stability, empirical risk minimization and differential privacy in the context of learnability\nhave been recently explored in [21].\nNumerous techniques have been developed by statisticians to address common special cases of\nadaptive data analysis. Most of them address a single round of adaptivity such as variable selection\nfollowed by regression on selected variables or model selection followed by testing and are optimized\nfor speci\ufb01c inference procedures (the literature is too vast to adequately cover here, see Ch. 7 in [12]\nfor a textbook introduction and [20] for a survey of some recent work). In contrast, our framework\naddresses multiple stages of adaptive decisions, possible lack of a predetermined analysis protocol\nand is not restricted to any speci\ufb01c procedures.\nFinally, inspired by our work, Blum and Hardt [2] showed how to reuse the holdout set to maintain\nan accurate leaderboard in a machine learning competition that allows the participants to submit\nadaptively chosen models in the process of the competition (such as those organized by Kaggle Inc.).\nTheir analysis also relies on the description length-based technique we used to analyze SparseValidate.\n\n2 Max-Information\n\nPreliminaries: In the discussion below log refers to binary logarithm and ln refers to the natural\nlogarithm. For two random variables X and Y over the same domain X the max-divergence of X\nfrom Y is de\ufb01ned as D\u221e(X(cid:107)Y ) = log maxx\u2208X\nP[X=x]\nP[Y =x] . \u03b4-approximate max-divergence is de\ufb01ned\nas\n\nDe\ufb01nition 1. [9, 8] A randomized algorithm A with domain X n for n > 0 is (\u03b5, \u03b4)-differentially\nprivate if for all pairs of datasets that differ in a single element S, S(cid:48) \u2208 X n: D\u03b4\u221e(A(S)(cid:107)A(S(cid:48))) \u2264\nlog(e\u03b5). The case when \u03b4 = 0 is sometimes referred to as pure differential privacy, and in this case\nwe may say simply that A is \u03b5-differentially private.\nConsider two algorithms A : X n \u2192 Y and B : X n \u00d7 Y \u2192 Y(cid:48) that are composed adaptively and\nassume that for every \ufb01xed input y \u2208 Y, B generalizes for all but fraction \u03b2 of datasets. Here we\nare speaking of generalization informally: our de\ufb01nitions will support any property of input y \u2208 Y\n\nD\u03b4\u221e(X(cid:107)Y ) = log\n\nmax\n\nO\u2286X , P[X\u2208O]>\u03b4\n\nP[X \u2208 O] \u2212 \u03b4\nP[Y \u2208 O]\n\n.\n\n5\n\n\fand dataset S. Intuitively, to preserve generalization of B we want to make sure that the output of A\ndoes not reveal too much information about the dataset S. We demonstrate that this intuition can be\ncaptured via a notion of max-information and its relaxation approximate max-information.\nFor two random variables X and Y we use X \u00d7 Y to denote the random variable obtained by\ndrawing X and Y independently from their probability distributions.\nDe\ufb01nition 2. Let X and Y be jointly distributed random variables. The max-information between\nX and Y is de\ufb01ned as I\u221e(X; Y ) = D\u221e((X, Y )(cid:107)X \u00d7 Y ). The \u03b2-approximate max-information\nis de\ufb01ned as I \u03b2\u221e(X; Y ) = D\u03b2\u221e((X, Y )(cid:107)X \u00d7 Y ).\nIn our use (X, Y ) is going to be a joint distribution (S,A(S)), where S is a random n-element\ndataset and A is a (possibly randomized) algorithm taking a dataset as an input.\nDe\ufb01nition 3. We say that an algorithm A has \u03b2-approximate max-information of k if for every\ndistribution S over n-element datasets, I \u03b2\u221e(S;A(S)) \u2264 k, where S is a dataset chosen randomly\naccording to S. We denote this by I \u03b2\u221e(A, n) \u2264 k.\nAn immediate corollary of our de\ufb01nition of approximate max-information is that it controls the\nprobability of \u201cbad events\" that can happen as a result of the dependence of A(S) on S.\nTheorem 4. Let S be a random dataset in X n and A be an algorithm with range Y such that for\nsome \u03b2 \u2265 0, I \u03b2\u221e(S;A(S)) = k. Then for any event O \u2286 X n \u00d7 Y,\n\nP[(S,A(S)) \u2208 O] \u2264 2k \u00b7 P[S \u00d7 A(S) \u2208 O] + \u03b2.\n\nIn particular, P[(S,A(S)) \u2208 O] \u2264 2k \u00b7 maxy\u2208Y P[(S, y) \u2208 O] + \u03b2.\nWe remark that mutual information between S and A(S) would not suf\ufb01ce for ensuring that bad\nevents happen with tiny probability. For example mutual information of k allows P[(S,A(S)) \u2208 O]\nto be as high as k/(2 log(1/\u03b4)), where \u03b4 = P[S \u00d7 A(S) \u2208 O].\nApproximate max-information satis\ufb01es the following adaptive composition property:\nLemma 5. Let A : X n \u2192 Y be an algorithm such that I \u03b21\u221e (A, n) \u2264 k1, and let B : X n \u00d7 Y \u2192 Z\nbe an algorithm such that for every y \u2208 Y, B(\u00b7, y) has \u03b22-approximate max-information k2. Let\nC : X n \u2192 Z be de\ufb01ned such that C(S) = B(S,A(S)). Then I \u03b21+\u03b22\u221e\nBounds on Max-information: Description length k gives the following bound on max-information.\nTheorem 6. Let A be a randomized algorithm taking as an input an n-element dataset and outputting\na value in a \ufb01nite set Y. Then for every \u03b2 > 0, I \u03b2\u221e(A, n) \u2264 log(|Y|/\u03b2).\nNext we prove a simple bound on max-information of differentially private algorithms that applies to\nall distributions over datasets.\nTheorem 7. Let A be an \u0001-differentially private algorithm. Then I\u221e(A, n) \u2264 log e \u00b7 \u0001n.\nFinally, we prove a stronger bound on approximate max-information for datasets consisting of\ni.i.d. samples using the technique from [7].\nTheorem 8. Let A be an \u03b5-differentially private algorithm with range Y. For a distribution P over\nX , let S be a random variable drawn from P n. Let Y = A(S) denote the random variable output\n\nby A on input S. Then for any \u03b2 > 0, I \u03b2\u221e(S;A(S)) \u2264 log e(\u03b52n/2 + \u03b5(cid:112)n ln(2/\u03b2)/2).\n\n(C, n) \u2264 k1 + k2.\n\nOne way to apply a bound on max-information is to start with a concentration of measure result which\nensures that the estimate of predictor\u2019s accuracy is correct with high probability when the predictor is\nchosen independently of the samples. For example for a loss function with range [0, 1], Hoeffding\u2019s\nbound implies that for a dataset consisting of i.i.d. samples the empirical estimate is not within \u03c4 of\nthe true accuracy with probability \u2264 2e\u22122\u03c4 2n. Now, given a bound of log e \u00b7 \u03c4 2n on \u03b2-approximate\ninformation of the algorithm that produces the estimator, Thm. 4 implies that the produced estimate is\nnot within \u03c4 of the true accuracy with probability \u2264 2log e\u00b7\u03c4 2n \u00b7 2e\u22122\u03c4 2n + \u03b2 \u2264 2e\u2212\u03c4 2n + \u03b2. Thm. 7\nimplies that any \u03c4 2-differentially private algorithm has max-information of at most log e \u00b7 \u03c4 2n. For\na dataset consisting of i.i.d. samples Thm. 8 implies that a \u03c4-differentially private algorithm has\n\u03b2-approximate max-information of 1.25 log e \u00b7 \u03c4 2n for \u03b2 = 2e\u2212\u03c4 2n.\n\n6\n\n\f3 Reusable Holdout\n\n(cid:80)n\n\nWe describe two simple algorithms that enable validation of analyst\u2019s queries in the adaptive setting.\nThresholdout: Our \ufb01rst algorithm Thresholdout follows the approach in [7] where differentially\nprivate algorithms are used to answer adaptively chosen statistical queries. This approach can also be\napplied to any low-sensitivity functions of the dataset but for simplicity we present the results for\nstatistical queries. Here we address an easier problem in which the analyst\u2019s queries only need to be\nanswered when they over\ufb01t. Also, unlike in [7], the analyst has full access to the training set and the\nholdout algorithm only prevents over\ufb01tting to holdout dataset. As a result, unlike in the general query\nanswering setting, our algorithm can ef\ufb01ciently validate an exponential in n number of queries as\nlong as a relatively small number of them over\ufb01t.\nFor a function \u03c6 : X \u2192 R and a dataset S = (x1, . . . , xn), let ES[\u03c6]\ni=1 \u03c6(xi). Thresholdout\nis given access to the training dataset St and holdout dataset Sh and a budget limit B. It allows any\nquery of the form \u03c6 : X \u2192 [0, 1] and its goal is to provide an estimate of P[\u03c6]. To achieve this the\nalgorithm gives an estimate of ESh [\u03c6] in a way that prevents over\ufb01tting of functions generated by the\nanalyst to the holdout set. In other words, responses of Thresholdout are designed to ensure that, with\nhigh probability, ESh [\u03c6] is close to P[\u03c6] and hence an estimate of ESh [\u03c6] gives an estimate of the true\nexpectation P[\u03c6].\nGiven a function \u03c6, Thresholdout \ufb01rst checks if the difference between the average value of \u03c6 on the\ntraining set St (or ESt[\u03c6]) and the average value of \u03c6 on the holdout set Sh (or ESh [\u03c6]) is below a\ncertain threshold T + \u03b7. Here, T is a \ufb01xed number such as 0.01 and \u03b7 is a Laplace noise variable\nwhose standard deviation needs to be chosen depending on the desired guarantees (The Laplace\ndistribution is a symmetric exponential distribution.) If the difference is below the threshold, then\nthe algorithm returns ESt[\u03c6]. If the difference is above the threshold, then the algorithm returns\nESh [\u03c6] + \u03be for another Laplacian noise variable \u03be. Each time the difference is above threshold the\n\u201cover\ufb01tting\" budget B is reduced by one. Once it is exhausted, Thresholdout stops answering queries.\nWe provide the pseudocode of Thresholdout below.\n\n.\n= 1\nn\n\nInput: Training set St, holdout set Sh, threshold T, noise rate \u03c3, budget B\n\n1. sample \u03b3 \u223c Lap(2 \u00b7 \u03c3); \u02c6T \u2190 T + \u03b3\n2. For each query \u03c6 do\n\n(a) if B < 1 output \u201c\u22a5\u201d\n(b) else\n\ni. sample \u03b7 \u223c Lap(4 \u00b7 \u03c3)\nii. if |ESh [\u03c6] \u2212 ESt[\u03c6]| > \u02c6T + \u03b7\nA. sample \u03be \u223c Lap(\u03c3), \u03b3 \u223c Lap(2 \u00b7 \u03c3)\nB. B \u2190 B \u2212 1 and \u02c6T \u2190 T + \u03b3\nC. output ESh[\u03c6] + \u03be\niii. else output ESt[\u03c6].\n\nWe now establish the formal generalization guarantees that Thresholdout enjoys.\nTheorem 9. Let \u03b2, \u03c4 > 0 and m \u2265 B > 0. We set T = 3\u03c4 /4 and \u03c3 = \u03c4 /(96 ln(4m/\u03b2)). Let\nS denote a holdout dataset of size n drawn i.i.d. from a distribution P and St be any additional\ndataset over X . Consider an algorithm that is given access to St and adaptively chooses functions\n\u03c61, . . . , \u03c6m while interacting with Thresholdout which is given datasets S, St and values \u03c3, B, T .\nFor every i \u2208 [m], let ai denote the answer of Thresholdout on function \u03c6i : X \u2192 [0, 1]. Further, for\nevery i \u2208 [m], we de\ufb01ne the counter of over\ufb01tting Zi\n= |{j \u2264 i : |P[\u03c6j] \u2212 ESt[\u03c6j]| > \u03c4 /2}| . Then\n.\n\nwhenever n \u2265 n0 = O\n\n\u03c4 2\n\n(cid:16) ln(m/\u03b2)\n\nP [\u2203i \u2208 [m], Zi < B & |ai \u2212 P[\u03c6i]| \u2265 \u03c4 ] \u2264 \u03b2\n\n(cid:17) \u00b7 min{B,(cid:112)B ln(ln(m/\u03b2)/\u03c4 )}.\n\nSparseValidate: We now present a general algorithm for validation on the holdout set that can\nvalidate many arbitrary queries as long as few of them fail the validation. More formally, our\n\n7\n\n\falgorithm allows the analyst to pick any Boolean function of a dataset \u03c8 (or even any algorithm that\noutputs a single bit) and provides back the value of \u03c8 on the holdout set \u03c8(Sh). SparseValidate has a\nbudget m for the total number of queries that can be asked and budget B for the number of queries\nthat returned 1. Once either of the budgets is exhausted, no additional answers are given. We now\ngive a general description of the guarantees of SparseValidate.\nTheorem 10. Let S denote a randomly chosen holdout set of size n. Let A be an algorithm\nthat is given access to SparseValidate(m, B) and outputs queries \u03c81, . . . , \u03c8m such that each \u03c8i\nis in some set \u03a8i of functions from X n to {0, 1}. Assume that for every i \u2208 [m] and \u03c8i \u2208 \u03a8i,\nP[\u03c8i(S) = 1] \u2264 \u03b2i. Let \u03c8i be the random variable equal to the i\u2019th query of A on S. Then\n\nP[\u03c8i(S) = 1] \u2264 (cid:96)i \u00b7 \u03b2i, where (cid:96)i =(cid:80)min{i\u22121,B}\n\n(cid:1) \u2264 mB.\n\n(cid:0)i\n\nj\n\nj=0\n\nIn this general formulation it is the analyst\u2019s responsibility to use the budgets economically and\npick query functions that do not fail validation often. At the same time, SparseValidate ensures\nthat (for the appropriate values of the parameters) the analyst can think of the holdout set as a fresh\nsample for the purposes of validation. Hence the analyst can pick queries in such a way that failing\nthe validation reliably indicates over\ufb01tting. An example of the application of SparseValidate for\nanswering statistical and low-sensitivity queries that is based on our analysis can be found in [1]. The\nanalysis of generalization on the holdout set in [2] and the analysis of the Median Mechanism we\ngive in the full version also rely on this sparsity-based technique.\n\nExperiments:\nIn our experiment the analyst is given a d-dimensional labeled data set S of size 2n\nand splits it randomly into a training set St and a holdout set Sh of equal size. We denote an element\nof S by a tuple (x, y) where x is a d-dimensional vector and y \u2208 {\u22121, 1} is the corresponding class\nlabel. The analyst wishes to select variables to be included in her classi\ufb01er. For various values of the\nnumber of variables to select k, she picks k variables with the largest absolute correlations with the\nlabel. However, she veri\ufb01es the correlations (with the label) on the holdout set and uses only those\nvariables whose correlation agrees in sign with the correlation on the training set and both correlations\nare larger than some threshold in absolute value. She then creates a simple linear threshold classi\ufb01er\non the selected variables using only the signs of the correlations of the selected variables. A \ufb01nal test\nevaluates the classi\ufb01cation accuracy of the classi\ufb01er on both the training set and the holdout set.\n\nIn our \ufb01rst experiment, each attribute of x is drawn independently from the normal distribution\nN (0, 1) and we choose the class label y \u2208 {\u22121, 1} uniformly at random so that there is no correlation\nbetween the data point and its label. We chose n = 10, 000, d = 10, 000 and varied the number\nof selected variables k. In this scenario no classi\ufb01er can achieve true accuracy better than 50%.\nNevertheless, reusing a standard holdout results in reported accuracy of over 63% for k = 500 on\nboth the training set and the holdout set (the standard deviation of the error is less than 0.5%). The\naverage and standard deviation of results obtained from 100 independent executions of the experiment\nare plotted above. For comparison, the plot also includes the accuracy of the classi\ufb01er on another\nfresh data set of size n drawn from the same distribution. We then executed the same algorithm with\nour reusable holdout. Thresholdout was invoked with T = 0.04 and \u03c4 = 0.01 explaining why the\naccuracy of the classi\ufb01er reported by Thresholdout is off by up to 0.04 whenever the accuracy on the\nholdout set is within 0.04 of the accuracy on the training set. We also used Gaussian noise instead of\nLaplacian noise as it has stronger concentration properties. Thresholdout prevents the algorithm from\nover\ufb01tting to the holdout set and gives a valid estimate of classi\ufb01er accuracy. Additional experiments\nand discussion are presented in the full version.\n\n8\n\n\fReferences\n[1] Raef Bassily, Adam Smith, Thomas Steinke, and Jonathan Ullman. More general queries and\n\nless generalization error in adaptive data analysis. CoRR, abs/1503.04843, 2015.\n\n[2] Avrim Blum and Moritz Hardt. The ladder: A reliable leaderboard for machine learning\n\ncompetitions. CoRR, abs/1502.04585, 2015.\n\n[3] Olivier Bousquet and Andr\u00e9 Elisseeff. Stability and generalization. JMLR, 2:499\u2013526, 2002.\n[4] Gavin C. Cawley and Nicola L. C. Talbot. On over-\ufb01tting in model selection and subsequent\nselection bias in performance evaluation. Journal of Machine Learning Research, 11:2079\u20132107,\n2010.\n\n[5] Chuong B. Do, Chuan-Sheng Foo, and Andrew Y. Ng. Ef\ufb01cient multiple hyperparameter\n\nlearning for log-linear models. In NIPS, pages 377\u2013384, 2007.\n\n[6] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron\nRoth. Generalization in adaptive data analysis and holdout reuse. CoRR, abs/1506. Extended\nabstract to appear in NIPS 2015.\n\n[7] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron\nRoth. Preserving statistical validity in adaptive data analysis. CoRR, abs/1411.2664, 2014.\nExtended abstract in STOC 2015.\n\n[8] Cynthia Dwork, Krishnaram Kenthapadi, Frank McSherry, Ilya Mironov, and Moni Naor. Our\ndata, ourselves: Privacy via distributed noise generation. In EUROCRYPT, pages 486\u2013503,\n2006.\n\n[9] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to\nsensitivity in private data analysis. In Theory of Cryptography, pages 265\u2013284. Springer, 2006.\n[10] David A. Freedman. A note on screening regression equations. The American Statistician,\n\n37(2):152\u2013155, 1983.\n\n[11] Moritz Hardt and Jonathan Ullman. Preventing false discovery in interactive data analysis is\n\nhard. In FOCS, pages 454\u2013463, 2014.\n\n[12] Trevor Hastie, Robert Tibshirani, and Jerome H. Friedman. The Elements of Statistical Learning:\n\nData Mining, Inference, and Prediction. Springer series in statistics. Springer, 2009.\n\n[13] John Langford. Clever methods of over\ufb01tting. http://hunch.net/?p=22, 2005.\n[14] Kobbi Nissim and Uri Stemmer. On the generalization properties of differential privacy. CoRR,\n\nabs/1504.05800, 2015.\n\n[15] Tomaso Poggio, Ryan Rifkin, Sayan Mukherjee, and Partha Niyogi. General conditions for\n\npredictivity in learning theory. Nature, 428(6981):419\u2013422, 2004.\n\n[16] R. Bharat Rao and Glenn Fung. On the dangers of cross-validation. an experimental evaluation.\n\nIn International Conference on Data Mining, pages 588\u2013596. SIAM, 2008.\n\n[17] Juha Reunanen. Over\ufb01tting in making comparisons between variable selection methods. Journal\n\nof Machine Learning Research, 3:1371\u20131382, 2003.\n\n[18] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability,\nstability and uniform convergence. The Journal of Machine Learning Research, 11:2635\u20132670,\n2010.\n\n[19] Thomas Steinke and Jonathan Ullman. Interactive \ufb01ngerprinting codes and the hardness of\n\npreventing false discovery. arXiv preprint arXiv:1410.1228, 2014.\n\n[20] Jonathan Taylor and Robert J. Tibshirani. Statistical learning and selective inference. Proceed-\n\nings of the National Academy of Sciences, 112(25):7629\u20137634, 2015.\n\n[21] Yu-Xiang Wang, Jing Lei, and Stephen E. Fienberg. Learning with differential privacy: Stability,\nlearnability and the suf\ufb01ciency and necessity of ERM principle. CoRR, abs/1502.06309, 2015.\n\n9\n\n\f", "award": [], "sourceid": 1379, "authors": [{"given_name": "Cynthia", "family_name": "Dwork", "institution": "Microsoft Research"}, {"given_name": "Vitaly", "family_name": "Feldman", "institution": "IBM Research - Almaden"}, {"given_name": "Moritz", "family_name": "Hardt", "institution": "Google"}, {"given_name": "Toni", "family_name": "Pitassi", "institution": "University of Toronto"}, {"given_name": "Omer", "family_name": "Reingold", "institution": "Samsung Research"}, {"given_name": "Aaron", "family_name": "Roth", "institution": "University of Pennsylvania"}]}