{"title": "Differentially Private M-Estimators", "book": "Advances in Neural Information Processing Systems", "page_first": 361, "page_last": 369, "abstract": "This paper studies privacy preserving M-estimators using perturbed histograms. The proposed approach allows the release of a wide class of M-estimators with both differential privacy and statistical utility without knowing a priori the particular inference procedure. The performance of the proposed method is demonstrated through a careful study of the convergence rates. A practical algorithm is given and applied on a real world data set containing both continuous and categorical variables.", "full_text": "Differentially Private M-Estimators\n\nLei, Jing\n\nDepartment of Statistics\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\njinglei@andrew.cmu.edu\n\nAbstract\n\nThis paper studies privacy preserving M-estimators using perturbed histograms.\nThe proposed approach allows the release of a wide class of M-estimators with\nboth differential privacy and statistical utility without knowing a priori the partic-\nular inference procedure. The performance of the proposed method is demonstrat-\ned through a careful study of the convergence rates. A practical algorithm is given\nand applied on a real world data set containing both continuous and categorical\nvariables.\n\n1\n\nIntroduction\n\nPrivacy-preserving data analysis has received increasing attention in recent years. Among various\nnotions of privacy, differential privacy [1, 2] provides mathematically rigorous privacy guarantee\nand protects against essentially all kinds of identity attacks regardless of the auxiliary information\nthat may be available to the attackers. Differential privacy requires that the presence or absence of\nany individual data record can never greatly change the outcome and hence the user can hardly learn\nmuch about any individual data record from the output.\nHowever, designing differentially private statistical inference procedures has been a challenging\nproblem. Differential privacy protects individual data by introducing uncertainty in the outcome,\nwhich generally requires the output of any inference procedure to be random even for a \ufb01xed input\ndata set. This makes differentially private statistical analysis different from most traditional statis-\ntical inference procedures, which are deterministic once the data set is given. Most existing works\n[3, 4, 5] focus on the interactive data release where a particular statistical inference problem is cho-\nsen a priori and the randomized output for that particular inference is released to the users. In reality\na data release that allows multiple inference procedures are often desired because real world statis-\ntical analyses usually consist of a series of inferences such as exploratory analysis, model \ufb01tting,\nand model selection, where the exact inference problem in a later stage is determined by results of\nprevious steps and cannot be determined in advance.\nIn this work we study M-estimators under a differentially private framework. The proposed method\nuses perturbed histograms to provide a systematic way of releasing a class of M-estimators in a\nnon-interactive fashion. Such a non-interactive method uses randomization independent of any par-\nticular inference procedure, therefore it allows the users to apply different inference procedures on\nthe same synthetic data set without additional privacy compromise. The accuracy of these private\npreserving estimates has also been studied and we prove that, under mild conditions on the contrast\n\u221a\nfunctions of the M-estimators, the proposed differentially private M-estimators are consistent. As\na special case, this approach gives 1/\nn-consistent estimates for quantiles, providing a simple and\nef\ufb01cient alternative solution to similar problems considered in [4, 5]. Our main condition requires\nconvexity and bounded partial derivatives of the contrast function. The convexity is used to en-\nsure the existence and stability of the M-estimator whereas the bounded derivative controls the bias\ncaused by the perturbed histogram. In classical theory of M-estimators, a contrast function with\n\n1\n\n\fbounded derivative implies robustness of the corresponding M-estimator. This is another evidence\nof the natural connection between robustness and differential privacy [4].\nWe also describe an algorithm that is conceptually simple and computationally feasible. It is \ufb02ex-\nible enough to accommodate continuous, ordinal, and categorical variables at the same time, as\ndemonstrated by its application on a Bay Area housing data.\n\n1.1 Related Work\n\nThe perturbed histogram is \ufb01rst described under the context of differential privacy in [1]. The prob-\nlem of non-interactive release has also been studied by [6], which targets at releasing the differen-\ntially private distribution function or the density function in a non-parametric setting. Theoretically,\nM-estimators could be indirectly obtained from the released density function. However, the more\ndirect perspective taken in this paper leads to an improved rate of convergence as well as an ef\ufb01cient\nalgorithm.\nSeveral aspects of parameter estimation problems have been studied with differential privacy under\nthe interactive framework. In particular, [4] shows that many robust estimators can be made dif-\nferentially private and that general private estimators can be obtained from composition of robust\nlocation and scale estimators. [5] shows that statistical estimators with generic asymptotic normality\ncan be made differentially private with the same asymptotic variance. Both works involve estimat-\ning the inter-quartile range in a differentially private manner, where the algorithm may output \u201cNo\nResponse\u201d [4], or the data is assumed to have known upper and lower bounds [5]. In a slightly\ndifferent context, [3] considers penalized logistic regression as a special case of empirical risk mini-\nmization, where the penalized logistic regression coef\ufb01cients are estimated with differential privacy\nby minimizing a perturbed objective function. Their method uses a different form of perturbation\nand is still interactive. It connects with the present paper in the sense that the perturbation is \ufb01nally\nexpressed in the objective function. Both papers assume convexity, which ensures that the shift in\nthe minimizer is small when the deviation in the objective function is small. We also note that the\nmethod in [3] depends on a strictly convex penalty term which is typically used in high-dimensional\nproblems, while our method works for problems where no penalization is used.\n\n2 Preliminaries\n\n2.1 De\ufb01nition of Privacy\nA database is modeled as a set of data points D = {x1, . . . , xn} \u2208 X n, where X is the data universe.\nIn most cases each data entry xi represents the microdata of an individual. We use the Hamming\ndistance to measure the proximity between two databases of the same size. Suppose |D| = |D(cid:48)|, the\nHamming distance is H(D, D(cid:48)) = |D\\D(cid:48)| = |D(cid:48)\\D|. The objective of data privacy is to release\nuseful information from the data set while protecting information about any individual data entry.\nDe\ufb01nition 1 (Differential Privacy [1]). A randomized function T (D) gives \u03b1-differential privacy if\nfor all pairs of databases (D, D(cid:48)) with H(D, D(cid:48)) = 1 and all measurable subsets E of the image of\nT :\n\n(1)\n\n(cid:12)(cid:12)(cid:12)(cid:12)log\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2264 \u03b1.\n\nP (T \u2208 E|D)\nP (T \u2208 E|D(cid:48))\n\nIn the rest of this paper we assume that, n, the size of database, is public.\n\n2.2 The Perturbed Histogram\n\ncell as Br = (cid:78)d\n\nIn most statistical problems, a database D consists of n independent copies of a random variable\nX with density f (x). For simplicity, we assume X = [0, 1]d. As we will see in Section 3.2,\nour method can be extended to non-compact X for some important examples. Suppose [0, 1]d is\npartitioned into cubic cells with equal bandwidth hn such that kn = h\u22121\nn is an integer. Denote each\nj=1[(rj \u2212 1)hn, rjhn),1 for all r = (r1, ..., rd) \u2208 {1, ..., kn}d. The histogram\n1To make sure that Br\u2019s do form a partition of [0, 1]d, the interval should be [(kn \u2212 1)hn, 1] when rj = kn.\n\n2\n\n\fdensity estimator is then\n\nwhere nr :=(cid:80)n\n\n\u02c6fhist(x) = h\u2212d\n\n1(x \u2208 Br),\ni=1 1(Xi \u2208 Br) is the number of data points in Br.\n\nnr\nn\n\nn\n\n(cid:88)\n\nr\n\n(2)\n\n(3)\n\n(cid:88)\n\nr\n\nClearly the density estimator described above depends on the data only through the histogram counts\n(nr, r \u2208 {1, . . . , kn}d). If we can \ufb01nd a differentially private version of (nr, r \u2208 {1, . . . , kn}d),\nthen the corresponding density estimator \u02c6f will also be differentially private by a simple change-of-\nmeasure argument. We consider the following perturbed histogram as described in [1]:\n\n\u02c6nr = nr + zr,\u2200 r \u2208 {1, . . . , kn}d,\n\nwhere zr\u2019s are independent with density \u03b1 exp(\u2212\u03b1|z|/2)/4. We have\nLemma 2 ([1]). (\u02c6nr, r \u2208 {1, . . . , kn}d) satis\ufb01es \u03b1-differential privacy.\nWe call (\u02c6nr, r \u2208 {1, . . . , kn}d) the Perturbed Histogram. Substituting nr by \u02c6nr in (2), we obtain a\ndifferentially private version of \u02c6fhist:\n\n\u02c6fP H (x) = h\u2212d\n\nn\n\n1(x \u2208 Br) .\n\n\u02c6nr\nn\n\n(4)\n\nIn general \u02c6fP H given by (4) is not a valid density function, since it can take negative values and may\nnot integrate to 1. To avoid these undesirable properties, [6] uses \u02dcnr = (\u02c6nr \u2228 0) instead of \u02c6nr and\nr \u02dcnr instead of n so that the resulting density estimator is non-negative and integrates to 1.\n\n\u02dcn =(cid:80)\narg min\u0398 M (\u03b8), where M (\u03b8) = (cid:82) m(x, \u03b8)f (x)dx, \u0398 \u2286 Rp, and m(x, \u03b8) is the contrast func-\n\n2.3 M-estimators\nGiven a random variable X with density f (x), the parameter of interest is de\ufb01ned as: \u03b8\u2217 =\n\niid\u223c f, the corresponding M-estimator is usually obtained by minimizing the\n\ntion. Assuming Xi\nempirical average of contrast function:\n\nMn(\u03b8), where Mn(\u03b8) = n\u22121(cid:88)\n\ni=1\n\n\u02c6\u03b8 = arg min\n\u03b8\u2208\u0398\n\nm(Xi, \u03b8).\n\n(5)\n\n\u221a\nM-estimators cover many important statistical inference procedures such as sample quantiles, max-\nn-\nimum likelihood estimators (MLE), and least square estimators. Most M-estimators are 1/\nconsistent and asymptotically normal. For more details about the theory and application of M-\nestimators, see [7].\n\n3 Differentially private M-estimators\n\n(cid:90)\n\nCombining equations (4) and (5) gives a differentially private objective function:\n\nMn,P H (\u03b8) =\n\n\u02c6fP H (x)m(x, \u03b8)dx.\n\n(6)\n\nWe wish to use the minimizer of Mn,P H as a differentially private estimate of \u03b8\u2217. Consider the\nfollowing set of conditions on the contrast function m(x, \u03b8).\n\n[0,1]d\n\n(A1) g(x, \u03b8) := \u2202\n(A2) g(x, \u03b8) is Lipschitz in x and \u03b8: ||g(x1, \u03b8) \u2212 g(x2, \u03b8)||2 \u2264 C2||x1 \u2212 x2||2, for all \u03b8; and\n\n\u2202\u03b8 m(x, \u03b8) exists and |g(x, \u03b8)| \u2264 C1 on [0, 1]d \u00d7 \u0398.\n\n||g(x, \u03b81) \u2212 g(x, \u03b82)||2 \u2264 C2||\u03b81 \u2212 \u03b82||2, for all x.\n\n(A3) m(x, \u03b8) is convex in \u03b8 for all x and M (\u03b8) is twice continuously differentiable with\n\nM(cid:48)(cid:48)(\u03b8\u2217) :=(cid:82) f (x) \u2202\n\n\u2202\u03b8 g(x, \u03b8\u2217)dx positive de\ufb01nite.\n\nCondition (A1) requires a bounded derivative of the contrast function, which is closely related to\nthe robustness of the corresponding M-estimator [8].\nIt indicates that any small changes in the\n\n3\n\n\funderlying distribution cannot change the outcome by too much, which is also required implicitly\nby the de\ufb01nition of differential privacy. Condition (A2) has two parts. The Lipschitz condition on\nx is used to bound the bias caused by histogram approximation, while the Lipschitz condition on \u03b8\nis used to establish a uniform upper bound of the sampling error in M(cid:48)\ni g(xi, \u03b8) as\nwell as a uniform upper bound on the error caused by the additive Laplacian noises. Condition (A3)\nrequires some curvature in the objective function in a neighborhood of the true parameter, which\nensures that the minimizer is stable under small perturbations.\nThe following theorem is our \ufb01rst main result:\n\u221a\nTheorem 3. Under conditions (A1)-(A3), if hn (cid:16) (\nminimizer, \u02c6\u03b8\u2217\n\nn(\u03b8) = n\u22121(cid:80)\n\nlog n/n)2/(d+2), then there exists a local\n\nP H, of Mn,P H, such that\n\n(cid:0)n\u22121/2 \u2228 ((cid:112)log n/n)2/(d+2)(cid:1) .\n\n|\u02c6\u03b8\u2217\nP H \u2212 \u03b8\u2217| = OP\n\n(7)\n\nA proof of Theorem 3 is given in the supplementary material. At a high level, by assumption (A3) it\nsuf\ufb01ces to show (Lemma 9) that sup\u03b8\u2208\u03980 |M(cid:48)\nlog n/n)2/(2+d)),\nfor some compact neighborhood \u03980 of \u03b8\u2217.\nThe approximation error of M(cid:48)\n\n\u221a\nn,priv(\u03b8)\u2212M(cid:48)(\u03b8)| = OP (1/\n\nn,P H (\u03b8) can be decomposed into three parts:\n\nn\u2228(\n\n\u221a\n\n(cid:90)\n\n( \u02c6fP H (x) \u2212 f (x))g(x, \u03b8)dx =n\u22121(cid:88)\n+n\u22121(cid:88)\n+n\u22121(cid:88)\n\nr\n\nr\n\n(cid:90)\n\n(cid:90)\n\nBr\n\nzrh\u2212d\n\n(cid:18)\n\nnrh\u2212d\n\nn\n\ng(x, \u03b8)dx\n\ng(x, \u03b8)dx \u2212 n(cid:88)\n\nBr\n\ni:Xi\u2208Br\n\n(cid:19)\n\ng(Xi, \u03b8)\n\n(8)\n\ng(Xi, \u03b8) \u2212 Eg(X, \u03b8) .\n\ni\n\nThe three terms on the right hand side of (8) correspond to the effect of Laplace noises added for\nprivacy, the bias caused by using histogram, and the sampling error, respectively. As in the general\ntheory of histogram estimators, the approximation error depends on the choice of bandwidth hn.\nGenerally speaking, if the bandwidth is small, then the histogram bias term will be small. However,\na smaller bandwidth leads to a larger number of cells and hence more Laplacian noises. As a result,\nthere is a trade-off between the histogram bias and Laplacian noises in the choice of bandwidth. The\nbandwidth given in Theorem 3 balances these two parts. We also comment on practical choices of\nhn in Section 4.\nWe prove Theorem 3 by investigating the convergence rate of each term in the right hand side of\n\u221a\n(8). First (Lemma 10) by empirical process theory [9, 10] we have, under conditions A(1) and A(2),\nn), uniformly on \u03980. Second, using Lipschitz\nthe sampling error term in (8) is of order OP (1/\nproperty of g, the histogram bias term in (8) is of order O(hn). Therefore it suf\ufb01ces to show that\n\u221a\nsup\u03b8\u2208\u03980\ning a concentration inequality due to Talagrand [11] (see also [12, Equation 1.3]), together with a\n\u03b4-net argument (Lemma 11) enabled by the Lipschitz property of g in \u03b8.\n\n(cid:1), which can be established us-\n\nm(x, \u03b8)dx(cid:12)(cid:12) = OP\n\n(cid:12)(cid:12)(cid:80)\nr n\u22121zrh\u2212d(cid:82)\n\n\u2212d/2\nlog n/n)h\nn\n\n(cid:0)(\n\nBr\n\n3.1 Algorithm based on perturbed histogram\n\nIn practice, exact integration of \u02c6fP H (x)m(x, \u03b8) over each cell Br may be computationally expensive\nand approximations must be adopted to make the implementation feasible. Note that \u02c6fP H (x) is\npiecewise constant. The integration can be simpli\ufb01ed by using a piecewise constant approximation\nof m(x, \u03b8). Formally, we introduce the following algorithm:\n\nAlgorithm 1 (M-estimator using perturbed histogram)\nInput: D = {X1,\u00b7\u00b7\u00b7 , Xn}, m(\u00b7,\u00b7), \u03b1, hn.\n\n2. Let Mn,P H (\u03b8) = n\u22121(cid:80)\n\n(rj \u2212 0.5)hn for all 1 \u2264 j \u2264 d.\n\n1. Construct perturbed histogram with bandwidth hn and privacy parameter \u03b1 as in (3).\n\nr \u02c6nrm(ar, \u03b8), where ar \u2208 [0, 1]d is the center of Br, with ar(j) =\n\n4\n\n\f3. Output \u02c6\u03b8P H = arg min Mn,P H (\u03b8).\n\nComparing to \u02c6\u03b8\u2217\nusing g(ar, \u03b8) instead of h\u2212d\n\nn\n\nn,P H obtained by minimizing the exact integral, the only term in (8) impacted by\n\ng(x, \u03b8)dx is the histogram bias term. However, note that\n\n(cid:82)\n(cid:12)(cid:12)(cid:12)(cid:12)g(ar, \u03b8) \u2212 h\u2212d\n\nBr\n\nn\n\n(cid:90)\n\n(cid:12)(cid:12)(cid:12)(cid:12) = O(hn) .\n\ng(x, \u03b8)dx\n\nBr\n\nAs a result, the convergence rate of \u02c6\u03b8n,P H remains the same:\nTheorem 4 (Statistical Utility of Algorithm 1). Under Assumptions (A1-A3), if Mn,P H (\u03b8) is giv-\nen by Algorithm 1 with hn (cid:16) (\nlog n/n)2/(2+d) then there exists a local minimizer, \u02c6\u03b8P H, of\nMn,P H (\u03b8), such that\n\n\u221a\n\n\u221a\n|\u02c6\u03b8P H \u2212 \u03b8\u2217| = OP (1/\n\nn \u2228 ((cid:112)log n/n)2/(2+d)).\n\n(9)\n\ntor for \u03b2 is \u03b2MLE = arg min(cid:80)\n\nExample (Logistic regression) We give a concrete example that satis\ufb01es (A1)-(A3). Let D =\n{(Xi, Yi) \u2208 [0, 1] \u00d7 {0, 1} : 1 \u2264 i \u2264 n}, where the conditional distribution of Yi given\nXi is Bernoulli with parameter exp(\u03b2Xi)/[1 + exp(\u03b2Xi)]. The maximum likelihood estima-\ni[\u2212\u03b2YiXi + log(1 + exp(\u03b2Xi))]. Here the contrast function\nm(x, y; \u03b2) = \u2212\u03b2xy + log(1 + exp(\u03b2x)) and it is easy to check that (A1)-(A3) hold.\nIn this\nexample X is continuous and Y is binary, so it is only necessary to discretize X when constructing\nthe histogram. To be speci\ufb01c, suppose [0, 1] is partitioned into equal-sized cells (Br, 1 \u2264 r \u2264 kn)\nas in the ordinary univariate histogram. The joint histogram for (X, Y ) is constructed by counting\nthe number of data points in each of the product cells Br,j := Br \u00d7{j} for j = 0, 1. See Subsection\n4.1 for more details on constructing histograms when there are categorical variables.\nNote that Theorems 3 and 4 do not guarantee the uniqueness or even existence of a global minimizer\nfor the perturbed objective function Mn,P H (\u03b8). This is because sometimes with small probability\nsome perturbed histogram count \u02c6nr can be negative hence the corresponding objective function\nMn,P H may not be convex. In our simulation and real data experience, this is usually not a real\nproblem since a similar argument as in Theorem 3 shows that, with high probability, the second\nn,P H is uniformly close to M(cid:48)(cid:48) in any compact subset of \u0398. To completely avoid this\nderivative M(cid:48)(cid:48)\nissue, one can use thresholding after perturbation as described in the following algorithm.\nAlgorithm 1(cid:48) (Perturbed histogram with nonnegative counts)\nInput: D = {X1,\u00b7\u00b7\u00b7 , Xn}, m(\u00b7,\u00b7), \u03b1, hn.\n\n1 Construct perturbed histogram with bandwidth hn and privacy parameter \u03b1 as in (3).\n\n2 Let \u02dcMn,P H (\u03b8) = n\u22121(cid:80)\n\nr \u02dcnrm(ar, \u03b8), where \u02dcnr = max(\u02c6nr, 0).\n\n3 Output \u02dc\u03b8P H = arg min \u02dcMn,P H (\u03b8).\n\nAlthough the thresholding guarantees that the zero points of M(cid:48)\nn,P H (\u03b8) is indeed a global minimizer\nby convexity of Mn,P H (\u03b8), it increases the approximation error introduced by the Laplacian noises\nbecause now these noises no longer cancel with each other nicely in the \ufb01rst term of the right hand\nside of equation (8). We have the following utility result for Algorithm 1(cid:48):\nTheorem 5. Under Assumptions (A1-A3) and hn (cid:16) (log n/n)1/(1+d), the estimator given by Algo-\nrithm 1(cid:48) satis\ufb01es\n\n|\u02dc\u03b8P H \u2212 \u03b8\u2217| = OP ((log n/n)1/(1+d)).\n\nwidth hn. The concentration inequality result no longer holds for (cid:80)\n\nProof. The proof follows essentially from that of Theorem 3, with a different choice of band-\nr \u02dczrg(ar, \u03b8) where \u02dczr =\nmax(zr,\u2212nr), because \u02dczr\u2019s are not independent.\nInstead, we consider a direct union bound:\nsupr |\u02dczr| \u2264 supr |zr| = OP (log h\u2212d\nn ) = OP (log n). Therefore the Laplacian noise term in right\nhand side of (8) is bounded uniformly for all \u03b8 by OP (n\u22121h\u2212d\nn log n). The histogram bias is still\nO(hn) as we mentioned in the discussion of Algorithm 1. Therefore the convergence rate is opti-\nmized by choosing hn (cid:16) (log n/n)1/(1+d).\n\n5\n\n\f3.2 Non-differentiable contrast functions\n\nNow we consider the possibility of relaxing condition (A2). Allowing discontinuity in g(x, \u03b8) is\nmotivated by a class of M-estimators whose contrast functions m(x, \u03b8) are non-differentiable on\na set of zero measure. An important example is the quantile. For a random variable X \u2208 R1\nwith cumulative distribution function F (\u00b7) and any given \u03c4 \u2208 (0, 1), the \u03c4-th quantile of X is\nq(\u03c4 ) := F \u22121(\u03c4 ), which corresponds to an M-estimator with m(x, \u03b8) = (1\u2212\u03c4 )(x\u2212\u03b8)\u2212 +\u03c4 (x\u2212\u03b8)+\n(see [13]). Quantiles provide important information about the distribution, including both location\n(median) and scale (inter-quartile range). The robustness of sample quantiles also makes them good\ncandidates for differentially private data release. Differentially private quantile estimators are indeed\nmajor building blocks for some existing privacy preserving statistical estimators [4, 5]. Our result\nin this subsection shows that perturbed histograms can give simple, consistent, and differentially\nprivate quantile estimators. The following set of conditions will suf\ufb01ce for this purpose and the\nargument is largely the same as Theorem 4:\n\n(B1) m(x, \u03b8) is convex and Lipschitz in both x and \u03b8.\n(B2) M (\u03b8) is twice differentiable at \u03b8\u2217 with M(cid:48)(cid:48)(\u03b8\u2217) > 0.\n(B3) \u0398 is compact and convex.\n\nCorollary 6 (Statistical utility of Algorithm 1). Under conditions (B1-B3) and hn (cid:16)\n\u221a\n(\n\nlog n/n)2/(2+d), any minimizer \u02c6\u03b8P H of Mn,P H given by Algorithm 1 satis\ufb01es (9).\n\nProof. The argument is largely the same as the proof of Theorem 3. Here we consider the original\nobjective functions Mn,P H and M instead of their derivatives. By a similar decomposition as in eq.\n\u221a\n(8), using the compactness of \u0398, we have sup\u0398 |Mn,P H\u2212M| = OP (1/\nlog n/n)\u22122/(2+d)).\nThen the convergence of \u02c6\u03b8P H follows from the convexity of M.\n\n\u221a\nn\u2228(\n\nRemark 7. Condition (B3) is the most restrictive one. It requires \u0398 to be bounded. This is because\nthe proof uses the fact that Mn(\u03b8) and M (\u03b8) are uniformly close for large n, which is usually true\nfor a bounded set of \u03b8.\nRemark 8. For quantiles the contrast function is piecewise linear, so for most cells in the histogram\nthere would be no approximation error if the data points are approximated by the cell center. The\nM-estimators for quantiles actually enjoy faster convergence rates.\nExtension to distributions supported on (\u2212\u221e,\u221e). Recall that we assume X \u2208 [0, 1]d. For quan-\ntiles, we have d = 1 and the quantile estimators described above can be extended to any continuous\nrandom variable whose density function is supported on (\u2212\u221e,\u221e). Let {Zi, i = 1, . . . , n} be an in-\ndependent sample from density fZ with fZ(z) > 0, \u2200 z \u2208 R1. Let \u03c4 \u2208 (0, 1) and suppose we want\nto estimate qZ(\u03c4 ), the \u03c4-th quantile of Z. To apply our method, de\ufb01ne X = exp(Z)/(1 + exp(Z)).\nClearly the quantiles are preserved under this monotone transformation. Applying the perturbed\nhistogram quantile estimator on {Xi, i = 1, . . . , n} we obtain \u02c6qX,P H (\u03c4 ), the differentially pri-\n\u221a\nn-consistent by Corollary 6. As a result, the estimate\nvate \u03c4-th qunatile of X, which is 1/\n\u02c6qZ,P H (\u03c4 ) := log[\u02c6qX,P H (\u03c4 )/(1 \u2212 \u02c6qX,P H (\u03c4 ))] is a 1/\n\nn-consistent estimator for qZ(\u03c4 ).\n\n\u221a\n\n4 Practical Aspects\n\n4.1 Complexity and Flexibility\nFrom now on we will drop the logarithm terms to simplify presentation. Suppose hn (cid:16) n\u22122/(2+d).\nThen the perturbed histogram (\u02c6nr : r \u2208 {1, . . . , h\u22121\nn }d) can be constructed in O(n2d/(2+d))\ntime by specifying the corresponding cell for each data point. Once the histogram is construct-\ned, following Algorithm 1, we can view it as a set of h\u2212d\nn = O(n2d/(2+d)) weighted data points\n\nn }d(cid:9) associated with weights {\u02c6nr}, where each data point ar is the center of\n\n(cid:8)ar, r \u2208 {1, . . . , h\u22121\n\ncell Br as de\ufb01ned in Step 2 of Algorithm 1. For M-estimators that allow a close form solution in\nterms of the minimum suf\ufb01cient statistics, such as least square regression, Mn,P H (\u03b8) (and hence\n\u02c6\u03b8P H) can be calculated in O(n2d/(2+d)) time. For general M-estimators that require an iterative op-\ntimization, such as logistic regression, the Hessian and gradients can be calculated in O(n2d/(2+d))\n\n6\n\n\ftime in each iteration. Such a weighted sample representation can be easily implemented using\nstandard data structures in common statistical programming packages such as R and Matlab.\nAnother attractive property of the proposed approach is its \ufb02exibility to accommodate different\ndata types. As seen in the logistic regression example in Subsection 3.1, it is straightforward to\nconstruct multivariate histograms when some variables are categorical and some are continuous. In\nsuch cases it suf\ufb01ces to discretize the continuous variables. To be speci\ufb01c, let (X 1, . . . , X d1) \u2208\nj=1{1, . . . , kj} be a set\nof d2 discrete variables where Y j takes value in {1, . . . , kj}. For any bandwidth h, let {Br, r \u2208\n{1, . . . , h\u22121}d1} be the corresponding set of histogram cells in [0, 1]d1. Then the joint histogram for\n(X, Y ) is constructed with cells\n\n[0, 1]d1 be a d1-dimensional continuous variable and (Y 1, . . . , Y d2) \u2208 (cid:81)d2\n\n(cid:8)Br,y, r \u2208 {1, . . . , h\u22121}d1, y \u2208 d2(cid:79)\n\n{1, . . . , kj}(cid:9).\n\nBecause only the continuous variables have histogram approximation error, the theoretical results\ndeveloped in Section 3 are applicable with sample size n and dimensionality d1.\n\nj=1\n\n4.2\n\nImprovement by enhanced thresholding\n\nIn applications such as regression, the multivariate distribution often concentrates on a subset (usu-\nally a lower dimensional manifold) of [0, 1]d. Therefore many non-zero cells are arti\ufb01cially created\nby additive noises. To alleviate this problem, we threshold the histogram with an enhanced cut-off\nvalue: \u02dcnr = \u02c6nr1(\u02c6nr \u2265 A log n/\u03b1), where A > 0 is a tuning parameter. This is based on the\nintuition that the maximal noise will be O(log n/\u03b1). As shown in the following data example, such\na simple thresholding step remarkably improves the accuracy.\n\n4.3 Application to housing price data\n\nAs an illustration, we apply our method to a housing price data consisting of 348,189 houses sold in\nSan Francisco Bay Area between 2003 and 2006. For each house, the data contains the price, size,\nyear of transaction, and county in which the house is located. The inference problem of interest is\nto study the relationship between housing price and other variables [14]. In our case, we want to\nbuild a simple linear regression model to predict the housing price using the other variables while\nprotecting each individual transaction record with differential privacy.\nThe data set has two continuous variables (price and size), one ordinal variable (year of sale) with 4\nlevels, and one categorical variable (county) with 9 levels. The preprocessing \ufb01lters out data points\nwith price outside of the range $105 \u223c $9\u00d7 105 or with size larger than 3000 sqft. We also combine\nsmall counties that are geologically close and have similar housing prices. After the preprocessing,\nthere are 250,070 data points and the county variable has 6 levels after the combination.\nFor each (year, county) combination, a perturbed histogram is constructed over the two continuous\nvariables with privacy parameter \u03b1 and K levels in each continuous dimension. Then there are\n4 \u00d7 6 \u00d7 K 2 cells, each having a perturbed histogram count. Using the weighted sample represen-\ntation described in Subsection 4.1, the perturbed data can be viewed as a data set with 24K 2 data\npoints weighted by the perturbed histogram counts. A differentially private regression coef\ufb01cient is\nobtained by applying a weighted least square regression on this data set. To assess the performance,\nthe privacy preserving regression coef\ufb01cients are compared with those given by the non-private or-\ndinary least square (OLS) estimates. In particular, we look at the coordinate-wise relative deviance\nfrom OLS coef\ufb01cients: \u03b5 = |\u02c6\u03b8priv/\u02c6\u03b8OLS \u2212 1|. To account for the randomness of additive noises,\ni /100)1/2, where \u03b5i is the\n\u03b52\n\nwe repeat 100 times and report the root mean square error: \u00af\u03b5 = ((cid:80)100\n\nrelative error obtained in the ith repetition. The results are summarized in Table 1.\nWe test 2 values of \u03b1, the privacy parameter. Recall that a smaller value of \u03b1 indicates a stronger\nprivacy guarantee. For each value of \u03b1 we apply both the original Algorithm 1 and the enhanced\nthresholding described in Subsection 4.2, with tuning parameter A = 1/2. For \u03b1 = 1 the coef\ufb01cients\ngiven by the perturbed histogram are close to those given by OLS with most relative deviances\nbelow 5%. When \u03b1 = 0.1, which is a conservative choice because exp(0.1) \u2248 1.1, the perturbed\nhistogram still gives reasonably close estimates with average deviance below 10% for all parameters\n\n1\n\n7\n\n\fTable 1: Linear regression coef\ufb01cients using the Bay Area housing data. The second column is\nthe regression coef\ufb01cients given by ordinary least square method without any perturbation. We\ncompare estimate given by (1) perturbed histogram (PH, Algorithm 1) and (2) perturbed histogram\nwith enhanced thresholding (THLD) as described in Subsection 4.2. The reported number is the root\nmean square relative error (in percentage) over 100 perturbations as described above. The histogram\nwith use K = 10 segments in each continuous dimension.\n\n\u03b1 = 0.1\n\n\u03b1 = 1\n\nVariable\nIntercept\nSize\nYear\nCounty2\nCounty3\nCounty4\nCounty5\nCounty6\n\nOLS\n135141\n209\n56375\n-53765\n146593\n-27546\n45828\n-140738\n\nPH THLD PH THLD\n10.6\n4.7\n4.6\n8.0\n4.2\n29.8\n9.8\n7.1\n\n7.7\n3.5\n2.8\n7.8\n2.5\n37.1\n7.9\n3.3\n\n7.2\n3.6\n1.0\n1.5\n0.8\n2.8\n1.4\n1.0\n\n4.4\n2.3\n0.4\n0.7\n0.3\n2.1\n1.3\n0.4\n\nexcept the county dummy variable \u201cCounty4\u201d. This variable has the smallest OLS coef\ufb01cient among\nall county dummy variables, so weight \ufb02uctuation in the histogram causes a relatively larger impact\non the relative deviance. Even though, the perturbed histogram still gives at least qualitatively\ncorrect estimate. We also observe that the thresholded histogram gives more accurate estimate for\nall coef\ufb01cients except for County4 when \u03b1 = 0.1.\nThe choice of K should depend on the sample size and dimensionality. Our theory suggests\nK = O(n2/(2+d)) where d is the dimensionality of the histogram and hence equals the number\nof continuous variables. In this data set n = 250, 070 and d = 2, which suggests K \u2248 500. This is\nnot a good choice since it produces 24 \u00d7 5002 = 6 \u00d7 106 cells. Let the number of cells be c(K).\nIn practice, it makes sense to choose K such that the average data counts in a cell, n/c(K), is much\nlarger than the maximum additive noise maxr |zr|, which is OP (log c(K)). For this data set, when\nK = 10 we have n/c(K) \u2248 100 and log(c(K)) \u2248 7.78.\n\n5 Further Discussions\n\nWe demonstrate how histograms can be used as a basic tool for statistical parameter estimation\nunder strong privacy constraints. The perturbed histogram adds to each histogram count a double-\nexponential noise with constant parameter depending only on the privacy budget \u03b1. The histogram\napproximation bias and the additive noise on the cell counts result in a bias-variance trade-off as\nusually seen for histogram-based methods. Such an algorithm should work well for low-dimensional\nproblems. Solutions to higher dimensional problems are yet to be developed. One possibility is to\nperturb the minimum suf\ufb01cient statistics because the dimensionality of minimum suf\ufb01cient statistics\nis usually much smaller than the number of histogram cells. For example, in linear regression\nanalysis, it suf\ufb01ces to obtain the \ufb01rst and second moments of all variables in a privacy-preserving\nway. However, perturbing minimum suf\ufb01cient statistics would only work for a single estimator and\nis only possible for interactive release. We are seeing another type of privacy-utility trade-off, where\nthe utility is not only about the rate of convergence, but also about the range of possible analyses\nallowed by the data releasing mechanism.\nThe perturbed histogram is also related to \u201cerror in variable\u201d inference problems. Suppose the\noriginal data is just the histogram, then the perturbed version can be thought as the true histogram\ncounts contaminated by some measurement errors.\nIn this paper we provide consistency results\nfor a class of inference problems in presence of such measurement errors. However, plugging in\nthe perturbed values does not necessarily give the best inference procedure and better alternatives\nmay be possible, see [15] for a hypothesis testing example in contingency tables. An important and\nchallenging question is how to \ufb01nd the optimal inference procedure in presence of such measurement\nerrors. A positive answer to this question will help establish a lower bound of approximation error\nand better understand the power and limit of perturbed histograms.\n\n8\n\n\fAcknowledgements\n\nJing Lei was partially supported by NSF Grant BCS-0941518.\n\nReferences\n[1] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private\ndata analysis. In Proceedings of the 3rd Theory of Cryptography Conference, pages 265\u2013284,\n2006.\n\n[2] C. Dwork. Differential privacy.\n\nIn Proceedings of the 33rd International Colloquium on\n\nAutomata, Languages and Programming (ICALP)(2), pages 1\u201312, 2006.\n\n[3] K. Chaudhuri and C. Monteleoni. Privacy-preserving logistic regression. In Advances in Neu-\n\nral Information Processing Systems, 2008.\n\n[4] C. Dwork and J. Lei. Differential privacy and robust statistics.\n\nAnnual ACM Symposium on Theory of Computing, 2009.\n\nIn Proceedings of the 41st\n\n[5] A. Smith. Privacy-preserving statistical estimation with optimal convergence rates. In Pro-\n\nceedings of the 41st Annual ACM Symposium on Theory of Computing, 2011.\n\n[6] L. Wasserman and S. Zhou. A statistical framework for differential privacy. Journal of the\n\nAmerican Statistical Association, 105:375\u2013389, 2010.\n\n[7] P. J. Huber and E. M. Ronchetti. Robust Statistics. John Wiley & Sons, Inc., 2nd edition, 2009.\n[8] F. Hampel, E. Ronchetti, P. Rousseeuw, and W. Stahel. Robust Statistics: The Approach Based\n\non In\ufb02uence Functions. John Wiley, New York, 1986.\n\n[9] A. W. van der Vaart. Asymptotic Statistics. Cambridge University Press, 1998.\n[10] M. Talagrand. Sharper bounds for Gaussian and empirical processes. The Annals of Probabil-\n\nity, 22:28\u201376, 1994.\n\n[11] M. Talagrand. A new isoperimetric inequality and the concentration of measure phenomenon.\n\nLecture Notes in Mathematics, 1469/1991:94\u2013124, 1991.\n\n[12] S. Bobkov and M. Ledoux. Poincar\u00b4e\u2019s inequalities and Talagrand\u2019s concentration phenomenon\n\nfor the exponential distribution. Probability Theory and Related Fields, 107:383\u2013400, 1997.\n\n[13] R. Koenker and K. F. Hallock. Quantile regression. Journal of Economic Perspectives, 15:143\u2013\n\n156, 2001.\n\n[14] R. K. Pace and R. Barry. Sparse spatial autoregressions. Statistics & Probability Letters,\n\n33:291\u2013297, 1997.\n\n[15] D. Vu and A. Slavkovic. Differential privacy for clinical trial data: Preliminary evaluations. In\n\nProceedings of the 2009 IEEE International Conference on Data Mining Workshops, 2009.\n\n9\n\n\f", "award": [], "sourceid": 256, "authors": [{"given_name": "Jing", "family_name": "Lei", "institution": null}]}