{"title": "A Simple and Practical Algorithm for Differentially Private Data Release", "book": "Advances in Neural Information Processing Systems", "page_first": 2339, "page_last": 2347, "abstract": "We present a new algorithm for differentially private data release, based on a simple combination of the Exponential Mechanism with the Multiplicative Weights update rule. Our MWEM algorithm achieves what are the best known and nearly optimal theoretical guarantees, while at the same time being simple to implement and experimentally more accurate on actual data sets than existing techniques.", "full_text": "A Simple and Practical Algorithm\n\nfor Differentially Private Data Release\n\nMoritz Hardt\n\nIBM Almaden Research\n\nSan Jose, CA\n\nmhardt@us.ibm.com\n\nKatrina Ligett\u21e4\n\nCaltech\n\nkatrina@caltech.edu\n\nFrank McSherry\n\nMicrosoft Research SVC\n\nmcsherry@microsoft.com\n\nAbstract\n\nWe present a new algorithm for differentially private data release, based on a sim-\nple combination of the Multiplicative Weights update rule with the Exponential\nMechanism. Our MWEM algorithm achieves what are the best known and nearly\noptimal theoretical guarantees, while at the same time being simple to implement\nand experimentally more accurate on actual data sets than existing techniques.\n\n1\n\nIntroduction\n\nSensitive statistical data on individuals are ubiquitous, and publishable analysis of such private data\nis an important objective. When releasing statistics or synthetic data based on sensitive data sets, one\nmust balance the inherent tradeoff between the usefulness of the released information and the pri-\nvacy of the affected individuals. Against this backdrop, differential privacy [1, 2, 3] has emerged as a\ncompelling privacy de\ufb01nition that allows one to understand this tradeoff via formal, provable guaran-\ntees. In recent years, the theoretical literature on differential privacy has provided a large repertoire\nof techniques for achieving the de\ufb01nition in a variety of settings (see, e.g., [4, 5]). However, data an-\nalysts have found that several algorithms for achieving differential privacy add unacceptable levels\nof noise.\nIn this work we develop a broadly applicable, simple, and easy-to-implement algorithm, capable of\nsubstantially improving the performance of linear queries on many realistic datasets. Linear queries\nare equivalent to statistical queries (in the sense of [6]) and can serve as the basis of a wide range of\ndata analysis and learning algorithms (see [7] for some examples).\nOur algorithm is a combination of the Multiplicative Weights approach of [8, 9], maintaining and\ncorrecting an approximating distribution through queries on which the approximate and true datasets\ndiffer, and the Exponential Mechanism [10], which selects the queries most informative to the Multi-\nplicative Weights algorithm (speci\ufb01cally, those most incorrect vis-a-vis the current approximation).\nOne can view our approach as combining expert learning techniques (multiplicative weights) with\nan active learning component (via the exponential mechanism).\nWe present experimental results for differentially private data release for a variety of problems stud-\nied in prior work: range queries as studied by [11, 12], contingency table release across a collection\nof statistical benchmarks as in [13], and datacube release as studied by [14]. We empirically eval-\nuate the accuracy of the differentially private data produced by MWEM using the same query class\nand accuracy metric proposed by each of the corresponding prior works, improving on all. Be-\nyond empirical improvements in these settings, MWEM matches the best known and nearly optimal\ntheoretical accuracy guarantees for differentially private data analysis with linear queries.\n\n\u21e4Computer Science Department, Cornell University. Work supported in part by an NSF Computing Inno-\nvation Fellowship (NSF Award CNF-0937060) and an NSF Mathematical Sciences Postdoctoral Fellowship\n(NSF Award DMS-1004416).\n\n1\n\n\fFinally, we describe a scalable implementation of MWEM capable of processing datasets of sub-\nstantial complexity. Producing synthetic data for the classes of queries we consider is known to be\ncomputationally hard in the worst-case [15, 16]. Indeed, almost all prior work performs computa-\ntion proportional to the size of the data domain, which limits them to datasets with relatively few\nattributes. In contrast, we are able to process datasets with thousands of attributes, corresponding to\ndomains of size 21000. Our implementation integrates a scalable parallel implementation of Multi-\nplicative Weights, and a representation of the approximating distribution in a factored form that only\nexhibits complexity when the model requires it.\n\n2 Our Approach\n\nThe MWEM algorithm (Figure 1) maintains an approximating distribution over the domain D of\ndata records, scaled up by the number of records. We repeatedly improve the accuracy of this ap-\nproximation with respect to the private dataset and the desired query set by selecting and posing a\nquery poorly served by our approximation and improving the approximation to better re\ufb02ect the true\nanswer to this query. We select and pose queries using the Exponential [10] and Laplace Mecha-\nnisms [3], whose de\ufb01nitions and privacy properties we review in Subsection 2.1. We improve our\napproximation using the Multiplicative Weights update rule [8], reviewed in Subsection 2.2.\n\n2.1 Differential Privacy and Mechanisms\n\nDifferential privacy is a constraint on a randomized computation that the computation should not\nreveal speci\ufb01cs of individual records present in the input. It places this constraint by requiring the\nmechanism to behave almost identically on any two datasets that are suf\ufb01ciently close.\nImagine a dataset A whose records are drawn from some abstract domain D, and which is described\nas a function from D to the natural numbers N, with A(x) indicating the frequency (number of\noccurrences) of x in the dataset. We use kA Bk to indicate the sum of the absolute values of\ndifference in frequencies (how many records would have to be added or removed to change A to B).\nDe\ufb01nition 2.1 (Differential Privacy). A mechanism M mapping datasets to distributions over an\noutput space R provides (\", )-differential privacy if for every S \u2713 R and for all data sets A, B\nwhere kA Bk \uf8ff 1,\nIf = 0 we say that M provides \"-differential privacy.\n\nP r[M (A) 2 S] \uf8ff e\" Pr[M (B) 2 S] + .\n\nThe Exponential Mechanism [10] is an \"-differentially private mechanism that can be used to select\namong the best of a discrete set of alternatives, where \u201cbest\u201d is de\ufb01ned by a function relating each\nalternative to the underlying secret data. Formally, for a set of alternative results R, we require\na quality scoring function s : dataset \u21e5 R ! R, where s(B, r) is interpreted as the quality of\nthe result r for the dataset B. To guarantee \"-differential privacy, the quality function is required\nto satisfy a stability property: that for each result r the difference |s(A, r) s(B, r)| is at most\nkA Bk. The Exponential Mechanism E simply selects a result r from the distribution satisfying\n\nPr[E(B) = r] / exp(\" \u21e5 s(B, r)/2).\n\nIntuitively, the mechanism selects result r biased exponentially by its quality score. The Exponential\nMechanism takes time linear in the number of possible results, evaluating s(B, r) once for each r.\nA linear query (also referred to as counting query or statistical query) is speci\ufb01ed by a function q\nmapping data records to the interval [1, +1]. The answer of a linear query on a data set D, denoted\nq(B), is the sumPx2D q(x) \u21e5 B(x).\nThe Laplace Mechanism is an \"-differentially private mechanism which reports approximate sums\nof bounded functions across a dataset. If q is a linear query, the Laplace Mechanism L obeys\n\nPr[L(B) = r] / exp (\" \u21e5| r q(B)|)\n\nAlthough the Laplace Mechanism is an instance of the Exponential Mechanism, it can be imple-\nmented much more ef\ufb01ciently, by adding Laplace noise with parameter 1/\" to the value q(B). As\nthe Laplace distribution is exponentially concentrated, the Laplace Mechanism provides an excellent\napproximation to the true sum.\n\n2\n\n\fInputs: Data set B over a universe D; Set Q of linear queries; Number of iterations T 2 N; Privacy\nparameter \"> 0; Number of records n.\n\nLet A0 denote n times the uniform distribution over D.\nFor iteration i = 1, ..., T :\n\n1. Exponential Mechanism: Select a query qi 2 Q using the Exponential Mechanism param-\n\neterized with epsilon value \"/2T and the score function\n\nsi(B, q) = |q(Ai1) q(B)| .\n\n2. Laplace Mechanism: Let measurement mi = qi(B) + Lap(2T /\").\n3. Multiplicative Weights: Let Ai be n times the distribution whose entries satisfy\n\nAi(x) / Ai1(x) \u21e5 exp(qi(x) \u21e5 (mi qi(Ai1))/2n) .\n\nOutput: A = avgi 0, with probability at\nleast 1 2T /|Q|, MWEM produces A such that\n\nq2Q |q(A) q(B)|\uf8ff 2nr log |D|\n\nmax\n\n+\n\n10T log |Q|\n\n\"\n\n.\n\nT\n\n3\n\n\fProof. The proof of this theorem is an integration of pre-existing analyses of both the Exponential\nMechanism and the Multiplicative Weights update rule, omitted for reasons of space.\n\nNote that these bounds are worst-case bounds, over adversarially chosen data and query sets. We\nwill see in Section 3 that MWEM works very well in more realistic settings.\n\n2.3.1 Running time\nThe running time of our basic algorithm as described in Figure 1 is O(n|Q| + T|D||Q|)). The al-\ngorithm is embarrassingly parallel: query evaluation can be conducted independently, implemented\nusing modern database technology; the only required serialization is that the T steps must proceed\nin sequence, but within each step essentially all work is parallelizable.\nResults of Dwork et al. [17] show that for worst case data, producing differentially private synthetic\ndata for a set of counting queries requires time |D|0.99 under reasonable cryptographic hardness\nassumptions. Moreover, Ullman and Vadhan [16] showed that similar lower bounds also hold for\nmore basic query classes such as we consider in Section 3.2. Despite these hardness results, we\nprovide an alternate implementation of our algorithm in Section 4 and demonstrate that its running\ntime is acceptable on real-world data even in cases where |D| is as large as 277, and on simple\nsynthetic input datasets where |D| is as large as 21000.\n2.3.2 Improvements and Variations\nThere are several ways to improve the empirical performance of MWEM at the expense of the\ntheoretical guarantees. First, rather than use the average of the distributions Ai we use only the\n\ufb01nal distribution. Second, in each iteration we apply the multiplicative weights update rule for all\nmeasuments taken, multiple times; as long as any measurements do not agree with the approximating\ndistribution (within error) we can improve the result. Finally, it is occasionally helpful to initialize\nA0 by performing a noisy count for each element of the domain; this consumes from the privacy\nbudget and lessens the accuracy of subsequent queries, but is often a good trade-off.\n\n2.4 Related Work\n\nThe study of differentially private synthetic data release mechanisms for arbitrary counting queries\nbegan with the work of Blum, Ligett, and Roth [18], who gave a computationally inef\ufb01cient (su-\nperpolynomial in |D|) \"-differentially private algorithm that achieves error that scales only logarith-\nmically with the number of queries. The dependence on n and |Q| achieved by their algorithm is\nO(n2/3 log1/3 |Q|) (which is the same dependence achieved by optimizing the choice of T in Theo-\nrem 2.2). Since [18], subsequent work [17, 19, 20, 8] has focused on computationally more ef\ufb01cient\nalgorithms (i.e., polynomial in |D|) as well as algorithms that work in the interactive query setting.\nThe latest of these results is the private Multiplicative Weights method of Hardt and Rothblum [8]\nwhich achieves error rates of O(pn log(|Q|)) for (\", )-differential privacy (which is the same\n\ndependence achieved by applying k-fold adaptive composition [19] and optimizing T in our Theo-\nrem 2.2). While their algorithm works in the interactive setting, it can also be used non-interactively\nto produce synthetic data, albeit at a computational overhead of O(n). MWEM can also be cast as\nan instance of a more general Multiplicative-Weights based framework of Gupta et al. [9], though\nour speci\ufb01c instantiation and its practical appeal were not anticipated in their work.\nPrior work on linear queries includes Fienberg et al. [13] and Barak et al. [21] on contingency tables;\nLi et al. [22] on range queries (and substantial related work [23, 24, 22, 11, 12, 25] which Li and\nMiklau [11, 25] show can all be seen as instances of the matrix mechanism of [22]); and Ding et\nal. [14] on data cubes. In each case, MWEM\u2019s theoretical guarantees and experimental performance\nimprove on prior work. We compare further in Section 3.\n\n3 Experimental Evaluation\n\nWe evaluate MWEM across a variety of query classes, datasets, and metrics as explored by prior\nwork, demonstrating improvement in the quality of approximation (often signi\ufb01cant) in each case.\nThe problems we consider are: (1) range queries under the total squared error metric, (2) binary\n\n4\n\n\fcontingency table release under the relative entropy metric, and (3) datacube release under the aver-\nage absolute error metric. Although contingency table release and datacube release are very similar,\nprior work on the two have had different focuses: small datasets over many binary attributes vs. large\ndatasets over few categorical attributes, low-order marginals vs. all cuboids as queries, and relative\nentropy vs. the average error within a cuboid as metrics.\nOur general conclusion is that intelligently selecting the queries to measure can result in signi\ufb01cant\naccuracy improvements, in settings where accuracy is a scare resource. When the privacy parameters\nare very lax, or the query set very simple, direct measurement of all queries yields better results than\nexpending some fraction of the privacy budget determining what to measure. On the other hand, in\nthe more challenging case of restrictions on privacy for complex data and query sets, MWEM can\nsubstantially out-perform previous algorithms.\n\n3.1 Range Queries\nA range query over a domain D = {1, . . . , N} is a counting query speci\ufb01ed by the indicator function\nof an interval I \u2713 D. Over a multi-dimensional domain D = D1 \u21e5 . . . Dd a range query is\nde\ufb01ned by the product of indicator functions. Differentially private algorithms for range queries\nwere speci\ufb01cally considered by [18, 23, 24, 22, 11, 12, 25]. As noted in [11, 25], all previously\nimplemented algorithms for range queries can be seen as instances of the matrix mechanism of [22].\nMoreover, [11, 25] show a lower bound on the total squared error achieved by the matrix mechanism\nin terms of the singular values of a matrix associated with the set of queries. We refer to this bound\nas the SVD bound.\n\nFigure 2: Comparison of MWEM with the SVD lower bound on four data sets. The y-axis measures\nthe average squared error per query, averaged over 5 independent repetitions of the experiment,\nas epsilon varies. The improvement is most signi\ufb01cant for small epsilon, diminishing as epsilon\nincreases.\n\nWe empirically evaluate MWEM for range queries on restrictions of the Adult data set [26] to (a)\nthe \u201ccapital loss\u201d attribute, and (b) the \u201cage\u201d and \u201chours\u201d attributes, as well as the restriction of\nthe Blood Transfusion data set [26, 27] to (c) the \u201crecency\u201d and \u201cfrequency\u201d attributes, and (d) the\n\u201cmonetary\u201d attribute. We chose these data sets as they feature numerical attributes of suitable size.\nIn Figure 2, we compare the performance of MWEM on sets of randomly chosen range queries\nagainst the SVD lower bound proved by [11, 25], varying \" while keeping the number of queries\n\ufb01xed. The SVD lower bound holds for algorithms achieving the strictly weaker guarantee of (\", )-\ndifferential privacy with > 0, permitting some probability of unbounded disclosure. The SVD\n\n5\n\n1.00E+061.00E+071.00E+081.00E+091.00E+100.01250.0250.50.1Transfusion: monetary MWEM (T = 10)SVD Lower Bound1.00E+061.00E+071.00E+081.00E+091.00E+101.00E+110.01250.0250.50.1Transfusion: recency x frequency MWEM (T = 10)SVD Lower Bound1.00E+061.00E+071.00E+081.00E+091.00E+101.00E+110.01250.0250.50.1Adult: capital loss MWEM (T = 10)SVD Lower Bound1.00E+061.00E+071.00E+081.00E+091.00E+101.00E+110.01250.0250.50.1Adult: age x hours MWEM (T = 10)SVD Lower Bound\f50\n45\n40\n35\n30\n25\n20\n15\n10\n5\n \n\n0.1\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\n5\n4.5\n4\n3.5\n3\n2.5\n2\n1.5\n1\n0.5\n0\n0.1\n\n0.5\n0.45\n0.4\n0.35\n0.3\n0.25\n0.2\n0.15\n0.1\n0.05\n0\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\nFigure 3: Relative entropy (y-axis) as a function of epsilon (x-axis) for the mildew, rochdale, and\nczech datasets, respectively. The lines represent averages across 100 runs, and the corresponding\nshaded areas one standard deviation in each direction. Red (dashed) represents the modi\ufb01ed Barak\net al. [21] algorithm, green (dot-dashed) represents unoptimized MWEM, and blue (solid) represents\nthe optimized version thereof. The solid black horizontal line is the stated relative entropy values\nfrom Fienberg et al. [13].\n\nbound depends on ; in our experiments we \ufb01xed = 1/n when instantiating the SVD bound, as\nany larger value of permits mechanisms capable of exact release of individual records.\n\n3.2 Contingency Tables\n\nA contingency table can be thought of as a table of records over d binary attributes, and the k-way\n\nmarginal is represented by the 2k counts of the records with each possible setting of attributes. In\nprevious work, Barak et al. [21] describe an approach to differentially private contingency table re-\nlease using linear queries de\ufb01ned by the Hadamard matrix. Importantly, all k-dimensional marginals\n\nmarginals of a contingency table correspond to thed\ncan be exactly recovered by examination of relatively few such queries: roughlyd\n\nk possible choices of k attributes, where each\nk out of the pos-\n\nsible 2d, improving over direct measurement of the marginals by a factor of 2k. This algorithm is\nevaluated by Fienberg et al. [13], and was found to do poorly on several benchmark datasets.\nWe evaluate our approximate dataset following Fienberg et al. [13] using relative entropy, also\nknown as the Kullback-Leibler (or KL) divergence. Formally, the relative entropy between our two\ndistributions (A/n and B/n) is\n\nRE(B||A) = Xx2D\n\nB(x) log(B(x)/A(x))/n .\n\nWe use several statistical datasets from Fienberg et al. [13], and evaluate two variants of MWEM\n(both with and without initialization of A0) against a modi\ufb01cation of Barak et al. [21] which com-\nbines its observations using multiplicative weights (we \ufb01nd that without this modi\ufb01cation, [21] is\nterrible with respect to relative entropy). These experiments are therefore largely assessing the se-\nlective choice of measurements to take, rather than the ef\ufb01cacy of multiplicative weights.\nFigure 3 presents the evaluation of MWEM on several small datasets in common use by statisticians.\nOur \ufb01ndings here are fairly uniform across the datasets: the ability to measure only those queries\nthat are informative about the dataset results in substantial savings over taking all possible measure-\nments. In many cases MWEM approaches the good non-private values of [13], indicating that we\ncan approach levels of accuracy at the limit of statistical validity.\nWe also consider a larger dataset, the National Long-Term Care Study (NLTCS), in Figure 4. This\ndataset contains orders of magnitudes more records, and has 16 binary attributes. For our initial set-\ntings, maintaining all three-way marginals, we see similar behavior as above: the ability to choose\nthe measurements that are important allows substantially higher accuracy on those that matter. How-\never, we see that the algorithm of Barak et al. [21] is substantially more competitive in the regime\nwhere we are interested in querying all two-dimensional marginals, rather than the default three we\nhave been using. In this case, for values of epsilon at least 0.1, it seems that there is enough signal\npresent to simply measure all corresponding entries of the Hadamard transform; each is suf\ufb01ciently\ninformative that measuring substantially fewer at higher accuracy imparts less information, rather\nthan more.\n\n6\n\n\f5\n4.5\n4\n3.5\n3\n2.5\n2\n1.5\n1\n0.5\n \n0.01\n\n2\n1.8\n1.6\n1.4\n1.2\n1\n0.8\n0.6\n0.4\n0.2\n0\n0.1\n\n0.03\n\n0.05\n\n0.07\n\n0.1\n\n2\n1.8\n1.6\n1.4\n1.2\n1\n0.8\n0.6\n0.4\n0.2\n0\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\nFigure 4: Curves comparing our approach with that of Barak et al. on the National Long Term Care\nSurvey. The red (dashed) curve represents Barak et al, and the multiple blue (solid) curves represent\nMWEM, with 20, 30, and 40 queries (top to bottom, respectively). From left to right, the \ufb01rst two\n\ufb01gures correspond to degree 2 marginals, and the third to degree 3 marginals. As before, the x-\naxis is the value of epsilon guaranteed, and the y-axis is the relative entropy between the produced\ndistribution and actual dataset. The lines represent averages across only 10 runs, owing to the high\ncomplexity of Barak et al. on this many-attributed dataset, and the corresponding shaded areas one\nstandard deviation in each direction.\n\n3.3 Data Cubes\n\nWe now change our terminology and objectives, shifting our view of contingency tables to one of\ndatacubes. The two concepts are interchangeable, a contingency table corresponding to the datacube,\nand a marginal corresponding to its cuboids. However, the datasets studied and the metrics applied\nare different. We focus on the restriction of the Adult dataset [26] to its eight categorical attributes,\nas done in [14], and evaluate our approximations using average error within a cuboid, also as in [14].\nAlthough MWEM is de\ufb01ned with respect to a single query at a time, it generalizes to sets of counting\nqueries, as re\ufb02ected in a cuboid. The Exponential Mechanism can select a cuboid to measure using\na quality score function summing the absolute values of the errors within the cells of the cuboid. We\nalso (heuristically) subtract the number of cells from the score of a cuboid to bias the selection away\nfrom cuboids with many cells, which would collect Laplace error in each cell. This subtraction\ndoes not affect privacy properties. An entire cuboid can be measured with a single differentially\nprivate query, as any record contributes to at most one cell (this is a generalization of the Laplace\nMechanism to multiple dimensions, from [3]). Finally, Multiplicative Weights works unmodi\ufb01ed,\nincreasing and decreasing weights based on the over- or under-estimation of the count to which the\nrecord contributes.\n\nFigure 5: Comparison of MWEM with the custom approaches from [14], varying epsilon through\nthe reported values from [14]. Each cuboid (marginal) is assessed by its average error, and either the\naverage or maximum over all 256 marginals is taken to evaluate the technique.\n\nWe compare MWEM with the work of [14] in Figure 5. The average average error improves notice-\nably, by approximately a factor of four. The maximum average error is less clear; experimentally\nwe have found we can bring the numbers lower using different heuristic variants of MWEM, but\nwithout principled guidance we report only the default behavior. Of note, our results are achieved\n\n7\n\n0501001502002500.250.511.52Average Average Error PMostCMWEM (T = 10)01002003004005006007008000.250.511.52Maximum Average Error BMaxCMWEM (T = 10)\fby a single algorithm, whereas the best results for maximum and average error in [14] are achieved\nby two different algorithms, each designed to optimize one speci\ufb01c metric.\n\n4 A Scalable Implementation\n\nThe implementation of MWEM used in the previous experiments quite literally maintains a distri-\nbution Ai over the elements of the universe D. As the number of attributes grows, the universe D\ngrows exponentially, and it can quickly become infeasible to track the distribution explicitly. In\nthis section, we consider a scalable implementation with essentially no memory footprint, whose\nrunning time is in the worst case proportional to |D|, but which for many classes of simple datasets\nremains linear in the number of attributes.\nRecall that the heart of MWEM maintains a distribution Ai over D that is then used in the Ex-\nponential Mechanism to select queries poorly approximated by the current distribution. From the\nde\ufb01nition of the Multiplicative Weights distribution, we see that the weight Ai(x) can be determined\nfrom the history Hi = {(qj, mj) : j \uf8ff i}:\n\nAi(x) / exp0@Xj\uf8ffi\n\nqj(x) \u21e5 (mj qj(Aj1))/2n1A .\n\nWe explicitly record the scaling factors lj = mj qj(Aj1) as part of the history Hi =\n{(qj, mj, lj) : j \uf8ff i}, to remove the dependence on prior Aj.\nThe domain D is often the product of many attributes. If we partition these attributes into disjoint\nparts D1, D2, . . . Dk so that no query in Hi involves attributes from more than one part, then the\ndistribution produced by Multiplicative Weights is a product distribution over D1\u21e5D2\u21e5. . . Dk. For\nquery classes that factorize over the attributes of the domain (for example, range queries, marginal\nqueries, and cuboid queries) we can rewrite and ef\ufb01ciently perform the integration over D using\n\nXx2D1\u21e5D2\u21e5...Dk\n\nq(x) \u21e5 Ai(x) = Y1\uf8ffj\uf8ffk\n\n0@ Xxj2Dj\n\nq(xj) \u21e5 Aj\n\ni (xj)1A .\n\nwhere Aj\ni is a mini Multiplicative Weights over attributes in part Dj, using only the relevant queries\nfrom Hi. So long as the measurements taken re\ufb02ect modest groups of independent attributes, the\nintegration can be ef\ufb01ciently performed. As the measurements overlap more and more, additional\ncomputation or approximation is required. The memory footprint is only the combined size of the\ndata, query, and history sets.\nExperimentally, we are able to process a binarized form of the Adult dataset with 27 attributes ef-\n\ufb01ciently (taking 80 seconds to process completely), and the addition of 50 new independent binary\nattributes, corresponding to a domain of size 277, results in neglible performance impact. For a sim-\nple synthetic dataset with up to 1,000 independent binary attributes, the factorized implementation\nof MWEM takes only 19 seconds to for a complete execution.\n\n5 Conclusions\n\nWe introduced MWEM, a simple algorithm for releasing data maintaining a high \ufb01delity to the\nprotected source data, as well as differential privacy with respect to the records. The approach builds\nupon the Multiplicative Weights approach of [8, 9], by introducing the Exponential Mechanism [10]\nas a more judicious approach to determining which measurements to take. The theoretical analysis\nmatches previous work in the area, and experimentally we have evidence that for many interesting\nsettings, MWEM represents a substantial improvement over existing techniques.\nAs well as improving on experimental error, the algorithm is both simple to implement and simple\nto use. An analyst does not require a complicated mathematical understanding of the nature of the\nqueries (as the community has for linear algebra [11] and the Hadamard transform [21]), but rather\nonly needs to enumerate those measurements that should be preserved. We hope that this generality\nleads to a broader class of high-\ufb01delity differentially-private data releases across a variety of data\ndomains.\n\n8\n\n\fReferences\n[1] I. Dinur and K. Nissim. Revealing information while preserving privacy. In PODS, 2003.\n[2] Cynthia Dwork and Kobbi Nissim. Privacy-preserving datamining on vertically partitioned databases. In\n\nCRYPTO. Springer, 2004.\n\n[3] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis.\n\nIn TCC, 2006.\n\n[4] Cynthia Dwork. The differential privacy frontier (extended abstract). In TCC, 2009.\n[5] Cynthia Dwork. The promise of differential privacy: A tutorial on algorithmic techniques. In FOCS,\n\n2011.\n\n[6] Michael J. Kearns. Ef\ufb01cient noise-tolerant learning from statistical queries. Journal of the ACM (JACM),\n\n45(6):983\u20131006, 1998.\n\n[7] Avrim Blum, Cynthia Dwork, Frank McSherry, and Kobbi Nissim. Practical privacy: the SuLQ frame-\n\nwork. In Proc. 24th PODS, pages 128\u2013138. ACM, 2005.\n\n[8] Moritz Hardt and Guy Rothblum. A multiplicative weights mechanism for interactive privacy-preserving\n\ndata analysis. In FOCS, 2010.\n\n[9] Anupam Gupta, Moritz Hardt, Aaron Roth, and Jon Ullman. Privately releasing conjunctions and the\n\nstatistical query barrier. In STOC, 2011.\n\n[10] Frank McSherry and Kunal Talwar. Mechanism design via differential privacy. In FOCS, 2007.\n[11] Chao Li and Gerome Miklau. Ef\ufb01cient batch query answering under differential privacy. CoRR,\n\nabs/1103.1367, 2011.\n\n[12] Chao Li and Gerome Miklau. An adaptive mechanism for accurate query answering under differential\n\nprivacy. to appear, PVLDB, 2012.\n\n[13] Stephen E. Fienberg, Alessandro Rinaldo, and Xiolin Yang. Differential privacy and the risk-utility trade-\n\noff for multi-dimensional contingency tables. In Privacy in Statistical Databases, 2010.\n\n[14] Bolin Ding, Marianne Winslett, Jiawei Han, and Zhenhui Li. Differentially private data cubes: optimizing\n\nnoise sources and consistency. In SIGMOD, 2011.\n\n[15] Cynthia Dwork, Moni Naor, Omer Reingold, Guy N. Rothblum, and Salil P. Vadhan. On the complexity\n\nof differentially private data release: ef\ufb01cient algorithms and hardness results. In STOC, 2009.\n\n[16] Jonathan Ullman and Salil P. Vadhan. PCPs and the hardness of generating private synthetic data. In\n\nTCC, 2011.\n\n[17] C. Dwork, M. Naor, O. Reingold, G.N. Rothblum, and S. Vadhan. On the complexity of differentially\n\nprivate data release: ef\ufb01cient algorithms and hardness results. In STOC, 2009.\n\n[18] Avrim Blum, Katrina Ligett, and Aaron Roth. A learning theory approach to non-interactive database\n\nprivacy. In STOC, 2008.\n\n[19] Cynthia Dwork, Guy Rothblum, and Salil Vadhan. Boosting and differential privacy. In FOCS, 2010.\n[20] Aaron Roth and Tim Roughgarden. The median mechanism: Interactive and ef\ufb01cient privacy with multi-\n\nple queries. In STOC, 2010.\n\n[21] B. Barak, K. Chaudhuri, C. Dwork, S. Kale, F. McSherry, and K. Talwar. Privacy, accuracy, and consis-\n\ntency too: a holistic solution to contingency table release. In PODS, 2007.\n\n[22] C. Li, M. Hay, V. Rastogi, G. Miklau, and A. McGregor. Optimizing linear counting queries under\n\ndifferential privacy. In PODS, 2010.\n\n[23] Xiaokui Xiao, Guozhang Wang, and Johannes Gehrke. Differential privacy via wavelet transforms. IEEE\n\nTransactions on Knowledge and Data Engineering, 23:1200\u20131214, 2011.\n\n[24] Michael Hay, Vibhor Rastogi, Gerome Miklau, and Dan Suciu. Boosting the accuracy of differentially-\n\nprivate queries through consistency. In VLDB, 2010.\n\n[25] Chao Li and Gerome Miklau. Measuring the achievable error of query sets under differential privacy.\n\nCoRR, abs/1202.3399v2, 2012.\n\n[26] A. Frank and A. Asuncion. UCI machine learning repository, 2010.\n[27] I-Cheng Yeh, King-Jang Yang, and Tao-Ming Ting. Knowledge discovery on RFM model using Bernoulli\n\nsequence. Expert Systems with Applications, 36(3), 2008.\n\n9\n\n\f", "award": [], "sourceid": 1143, "authors": [{"given_name": "Moritz", "family_name": "Hardt", "institution": null}, {"given_name": "Katrina", "family_name": "Ligett", "institution": null}, {"given_name": "Frank", "family_name": "Mcsherry", "institution": null}]}