{"title": "Efficient Algorithm for Privately Releasing Smooth Queries", "book": "Advances in Neural Information Processing Systems", "page_first": 782, "page_last": 790, "abstract": "We study differentially private mechanisms for answering \\emph{smooth} queries on databases consisting of data points in $\\mathbb{R}^d$. A $K$-smooth query is specified by a function whose partial derivatives up to order $K$ are all bounded. We develop an $\\epsilon$-differentially private mechanism which for the class of $K$-smooth queries has accuracy $O (\\left(\\frac{1}{n}\\right)^{\\frac{K}{2d+K}}/\\epsilon)$. The mechanism first outputs a summary of the database. To obtain an answer of a query, the user runs a public evaluation algorithm which contains no information of the database. Outputting the summary runs in time $O(n^{1+\\frac{d}{2d+K}})$, and the evaluation algorithm for answering a query runs in time $\\tilde O (n^{\\frac{d+2+\\frac{2d}{K}}{2d+K}} )$. Our mechanism is based on $L_{\\infty}$-approximation of (transformed) smooth functions by low degree even trigonometric polynomials with small and efficiently computable coefficients.", "full_text": "Ef\ufb01cient Algorithm for Privately Releasing Smooth\n\nQueries\n\nZiteng Wang\n\nSchool of EECS\nPeking University\n\nJiaqi Zhang\n\nSchool of EECS\nPeking University\n\nKai Fan\n\nSchool of EECS\nPeking University\n\nLiwei Wang\n\nSchool of EECS\nPeking University\n\nKey Laboratory of Machine Perception, MOE\n\nKey Laboratory of Machine Perception, MOE\n\nwangzt@cis.pku.edu.cn\n\ninterfk@hotmail.com\n\nKey Laboratory of Machine Perception, MOE\n\nKey Laboratory of Machine Perception, MOE\n\nZhangjq@cis.pku.edu.cn\n\nwanglw@cis.pku.edu.cn\n\nAbstract\n\nWe study differentially private mechanisms for answering smooth queries on\ndatabases consisting of data points in Rd. A K-smooth query is speci\ufb01ed by a\nfunction whose partial derivatives up to order K are all bounded. We develop an\n\u0001-differentially private mechanism which for the class of K-smooth queries has\naccuracy O(n\u2212 K\n2d+K /\u0001). The mechanism \ufb01rst outputs a summary of the database.\nTo obtain an answer of a query, the user runs a public evaluation algorithm which\ncontains no information of the database. Outputting the summary runs in time\nO(n1+ d\n2d+K ), and the evaluation algorithm for answering a query runs in time\nd+2+ 2d\nK\n\u02dcO(n\n2d+K ). Our mechanism is based on L\u221e-approximation of (transformed)\nsmooth functions by low degree even trigonometric polynomials with small and\nef\ufb01ciently computable coef\ufb01cients.\n\n1\n\nIntroduction\n\nPrivacy is an important problem in data analysis. Often people want to learn useful information from\ndata that are sensitive. But when releasing statistics of sensitive data, one must tradeoff between the\naccuracy and the amount of privacy loss of the individuals in the database.\nIn this paper we consider differential privacy [9], which has become a standard concept of privacy.\nRoughly speaking, a mechanism which releases information about the database is said to preserve\ndifferential privacy, if the change of a single database element does not affect the probability distri-\nbution of the output signi\ufb01cantly. Differential privacy provides strong guarantees against attacks. It\nensures that the risk of any individual to submit her information to the database is very small. An\nadversary can discover almost nothing new from the database that contains the individual\u2019s infor-\nmation compared with that from the database without the individual\u2019s information. Recently there\nhave been extensive studies of machine learning, statistical estimation, and data mining under the\ndifferential privacy framework [29, 5, 18, 17, 6, 30, 20, 4].\nAccurately answering statistical queries is an important problem in differential privacy. A simple\nand ef\ufb01cient method is the Laplace mechanism [9], which adds Laplace noise to the true answers.\nLaplace mechanism is especially useful for query functions with low sensitivity, which is the max-\nimal difference of the query values of two databases that are different in only one item. A typical\n\n1\n\n\fIt can answer at most O(n2) queries.\n\nclass of queries that has low sensitivity is linear queries, whose sensitivity is O(1/n), where n is the\nsize of the database.\nThe Laplace mechanism has a limitation.\nIf the number\nof queries is substantially larger than n2, Laplace mechanism is not able to provide differentially\nprivate answers with nontrivial accuracy. Considering that potentially there are many users and\neach user may submit a set of queries, limiting the number of total queries to be smaller than n2 is\ntoo restricted in some situations. A remarkable result due to Blum, Ligett and Roth [2] shows that\ninformation theoretically it is possible for a mechanism to answer far more than n2 linear queries\nwhile preserving differential privacy and nontrivial accuracy simultaneously.\nThere are a series of works [10, 11, 21, 16] improving the result of [2]. All these mechanisms\nare very powerful in the sense that they can answer general and adversely chosen queries. On the\nother hand, even the fastest algorithms [16, 14] run in time linear in the size of the data universe to\nanswer a query. Often the size of the data universe is much larger than that of the database, so these\nmechanisms are inef\ufb01cient. Recently, [25] shows that there is no polynomial time algorithm that\ncan answer n2+o(1) general queries while preserving privacy and accuracy (assuming the existence\nof one-way function).\nGiven the hardness result, recently there are growing interests in studying ef\ufb01cient and differentially\nprivate mechanisms for restricted class of queries. From a practical point of view, if there exists a\nclass of queries which is rich enough to contain most queries used in applications and allows one to\ndevelop fast mechanisms, then the hardness result is not a serious barrier for differential privacy.\nOne class of queries that attracts a lot of attentions is the k-way conjunctions. The data universe for\nthis problem is {0, 1}d. Thus each individual record has d binary attributes. A k-way conjunction\nquery is speci\ufb01ed by k features. The query asks what fraction of the individual records in the\ndatabase has all these k features being 1. A series of works attack this problem using several different\ntechniques [1, 13, 7, 15, 24] . They propose elegant mechanisms which run in time poly(n) when\nk is a constant. Another class of queries that yields ef\ufb01cient mechanisms is sparse query. A query\nis m-sparse if it takes non-zero values on at most m elements in the data universe. [3] develops\nmechanisms which are ef\ufb01cient when m = poly(n).\nWhen the data universe is [\u22121, 1]d, where d is a constant, [2] considers rectangle queries. A rectangle\nquery is speci\ufb01ed by an axis-aligned rectangle. The answer to the query is the fraction of the data\npoints that lie in the rectangle. [2] shows that if [\u22121, 1]d is discretized to poly(n) bits of precision,\nthen there are ef\ufb01cient mechanisms for the class of rectangle queries. There are also works studying\nrelated range queries [19].\nIn this paper we study smooth queries de\ufb01ned also on data universe [\u22121, 1]d for constant d. A smooth\nquery is speci\ufb01ed by a smooth function, which has bounded partial derivatives up to a certain order.\nThe answer to the query is the average of the function values on data points in the database. Smooth\nfunctions are widely used in machine learning and data analysis [28]. There are extensive studies\non the relation between smoothness, regularization, reproducing kernels and generalization ability\n[27, 22].\nOur main result is an \u0001-differentially private mechanism for the class of K-smooth queries, which\nare speci\ufb01ed by functions with bounded partial derivatives up to order K. The mechanism has\n(\u03b1, \u03b2)-accuracy, where \u03b1 = O(n\u2212 K\n2d+K ). The mechanism \ufb01rst outputs a\nsummary of the database. To obtain an answer of a smooth query, the user runs a public evaluation\nprocedure which contains no information of the database. Outputting the summary has running time\nd+2+ 2d\nK\n2d+K ). The\nO\nmechanism has the advantage that both the accuracy and the running time for answering a query\nimprove quickly as K/d increases (see also Table 1 in Section 3).\nOur algorithm is a L\u221e-approximation based mechanism and is motivated by [24], which considers\napproximation of k-way conjunctions by low degree polynomials. The basic idea is to approximate\nthe whole query class by linear combination of a small set of basis functions. The technical dif\ufb01cul-\nties lie in that in order that the approximation induces an ef\ufb01cient and differentially private mech-\nanism, all the linear coef\ufb01cients of the basis functions must be small and ef\ufb01ciently computable.\nTo guarantee these properties, we \ufb01rst transform the query function. Then by using even trigono-\n\n, and the evaluation procedure for answering a query runs in time \u02dcO(n\n\nn1+ d\n\n2d+K\n\n2d+K /\u0001) for \u03b2 \u2265 e\u2212O(n\n\nd\n\n(cid:16)\n\n(cid:17)\n\n2\n\n\fmetric polynomials as basis functions we prove a constant upper bound for the linear coef\ufb01cients.\nThe smoothness of the functions also allows us to use an ef\ufb01cient numerical method to compute the\ncoef\ufb01cients to a precision so that the accuracy of the mechanism is not affected signi\ufb01cantly.\n\n2 Background\nLet D be a database containing n data points in the data universe X . In this paper, we consider the\ncase that X \u2282 Rd where d is a constant. Typically, we assume that the data universe X = [\u22121, 1]d.\nTwo databases D and D(cid:48) are called neighbors if |D| = |D(cid:48)| = n and they differ in exactly one data\npoint. The following is the formal de\ufb01nition of differential privacy.\nDe\ufb01nition 2.1 ((\u0001, \u03b4)-differential privacy). A sanitizer S which is an algorithm that maps input\ndatabase into some range R is said to preserve (\u0001, \u03b4)-differential privacy, if for all pairs of neighbor\ndatabases D, D(cid:48) and for any subset A \u2282 R, it holds that\n\nP(S(D) \u2208 A) \u2264 P(S(D(cid:48)) \u2208 A) \u00b7 e\u0001 + \u03b4.\n\nIf S preserves (\u0001, 0)-differential privacy, we say S is \u0001-differentially private.\n\nWe consider linear queries. Each linear query qf is speci\ufb01ed by a function f which maps data\nuniverse [\u22121, 1]d to R, and qf is de\ufb01ned by qf (D) := 1|D|\nLet Q be a set of queries. The accuracy of a mechanism with respect to Q is de\ufb01ned as follows.\nDe\ufb01nition 2.2 ((\u03b1, \u03b2)-accuracy). Let Q be a set of queries. A sanitizer S is said to have (\u03b1, \u03b2)-\naccuracy for size n databases with respect to Q, if for every database D with |D| = n the following\nholds\n\nx\u2208D f (x).\n\nP(\u2203q \u2208 Q,\n\n|S(D, q) \u2212 q(D)| \u2265 \u03b1) \u2264 \u03b2,\n\n(cid:80)\n\nwhere S(D, q) is the answer to q given by S.\n\n2\u03c3 exp(\u2212|x|/\u03c3).\n\nWe will make use of Laplace mechanism [9] in our algorithm. Laplace mechanism adds Laplace\nnoise to the output. We denote by Lap(\u03c3) the random variable distributed according to the Laplace\ndistribution with parameter \u03c3: P(Lap(\u03c3) = x) = 1\nWe will design a differentially private mechanism which is accurate with respect to a query set\nQ possibly consisting of in\ufb01nite number of queries. Given a database D, the sanitizer outputs a\nsummary which preserves differential privacy. For any qf \u2208 Q, the user makes use of an evaluation\nprocedure to measure f on the summary and obtain an approximate answer of qf (D). Although we\nmay think of the evaluation procedure as part of the mechanism, it does not contain any information\nof the database and therefore is public. We will study the running time for the sanitizer outputting\nthe summary. Ideally it is O(nc) for some constant c not much larger than 1. For the evaluation\nprocedure, the running time per query is the focus. Ideally it is sublinear in n. Here and in the rest\nof the paper, we assume that calculating the value of f on a data point x can be done in unit time.\nIn this work we will frequently use trigonometric polynomials. For the univariate case, a function\nl=1 (al cos l\u03b8 + bl sin l\u03b8),\nIf p(\u03b8) is an even function, we say that it is an even trigonomet-\nl=1 al cos l\u03b8. For the multivariate case, if p(\u03b81, . . . , \u03b8d) =\nl=(l1,...,ld) al cos(l1\u03b81) . . . cos(ld\u03b8d), then p is said to be an even trigonometric polynomial (with\n\np(\u03b8) is called a trigonometric polynomial of degree m if p(\u03b8) = a0 +(cid:80)m\nric polynomial, and p(\u03b8) = a0 +(cid:80)m\n(cid:80)\n\nwhere al, bl are constants.\n\nrespect to each variable), and the degree of \u03b8i is the upper limit of li.\n\n3 Ef\ufb01cient differentially private mechanism\n\nLet us \ufb01rst describe the set of queries considered in this work. Since each query qf is speci\ufb01ed by a\nfunction f, a set of queries QF can be speci\ufb01ed by a set of functions F . Remember that each f \u2208 F\nmaps [\u22121, 1]d to R. For any point x = (x1, . . . , xd) \u2208 [\u22121, 1]d, if k = (k1, . . . , kd) is a d-tuple\nwith nonnegative integers, then we de\ufb01ne\n\nDk := Dk1\n\n1 \u00b7\u00b7\u00b7 Dkd\n\nd :=\n\n\u2202k1\n\u2202xk1\n1\n\n\u00b7\u00b7\u00b7 \u2202kd\n\u2202xkd\nd\n\n.\n\n3\n\n\fParameters: Privacy parameters \u0001, \u03b4 > 0; Failure probability \u03b2 > 0;\n\nInput: Database D \u2208(cid:0)[\u22121, 1]d(cid:1)n.\n\nSmoothness order K \u2208 N; Set t = n\n\n1\n\n2d+K .\n\nOutput: A td-dimensional vector as the summary.\nAlgorithm:\n\nFor each x = (x1, . . . , xd) \u2208 D:\nFor every d-tuple of nonnegative integers m = (m1, . . . , md), where (cid:107)m(cid:107)\u221e \u2264 t \u2212 1\n\nSet: \u03b8i(x) = arccos(xi), i = 1, . . . , d;\n\n(cid:80)\n(cid:99)Sum(D) \u2190 Sum(D) + Lap\n\nCompute: Sum(D) = 1\nn\n\n(cid:16)(cid:99)Sum(D)\n\n(cid:17)\n\n(cid:107)m(cid:107)\u221e\u2264t\u22121\n\nLet(cid:99)Su(D) =\nReturn:(cid:99)Su(D).\n\n(cid:16) td\n\n(cid:17)\n\n;\n\nx\u2208D cos (m1\u03b81(x)) . . . cos (md\u03b8d(x));\n\nn\u0001\nbe a td dimensional vector;\n\nAlgorithm 1: Outputting the summary\n\n1\n\nParameters: t = n\nInput: A query qf , where f : [\u22121, 1]d \u2192 R and f \u2208 C K\nB ,\n\nSummary(cid:99)Su(D) (a td-dimensional vector).\n\n2d+K .\n\nOutput: Approximate answer to qf (D).\nAlgorithm:\n\nLet gf (\u03b8) = f (cos(\u03b81), . . . , cos(\u03b8d)), \u03b8 = (\u03b81, . . . , \u03b8d) \u2208 [\u2212\u03c0, \u03c0]d;\nCompute a trigonometric polynomial approximation pt(\u03b8) of gf (\u03b8),\n\nwhere the degree of each \u03b8i is t;\n\nDenote pt(\u03b8) =(cid:80)\nReturn: the inner product < c,(cid:99)Su(D) >.\n\nLet c = (cm)(cid:107)m(cid:107)\u221e 0. Formally, C K\nof queries speci\ufb01ed by C K\ndepth in machine learning [26, 28, 27] and found wide applications [22].\nThe following theorem is our main result. It says that if the query class is speci\ufb01ed by smooth\nfunctions, then there is a very ef\ufb01cient mechanism which preserves \u0001-differential privacy and good\naccuracy. The mechanism consists of two parts: One for outputting a summary of the database,\nthe other for answering a query. The two parts are described in Algorithm 1 and Algorithm 2\nrespectively. The second part of the mechanism contains no private information of the database.\nB }, where K \u2208 N\nTheorem 3.1. Let the query set be QCK\nand B > 0 are constants. Let the data universe be [\u22121, 1]d, where d \u2208 N is a constant. Then the\nmechanism S given in Algorithm 1 and Algorithm 2 satis\ufb01es that for any \u0001 > 0, the following hold:\n(cid:17)\n1) The mechanism is \u0001-differentially private.\n2) For any \u03b2 \u2265 10 \u00b7 e\u2212 1\nand the hidden constant depends only on d, K and B.\n\n2d+K ) the mechanism is (\u03b1, \u03b2)-accurate, where \u03b1 = O\n\n= {qf = 1\n\n(cid:1) K\n\n(cid:16)(cid:0) 1\n\nf \u2208 C K\n\nx\u2208D f (x) :\n\n(cid:80)\n\n2d+K /\u0001\n\n5 (n\n\nn\n\nn\n\n,\n\nB\n\nd\n\n4\n\n\fOrder of smoothness\n\nTable 1: Performances vs. Order of smoothness\nAccuracy \u03b1\n\nTime: Outputting summary Time: Answering a query\n\nK = 1\n\nK = 2d\n\nO(( 1\nn )\n\n1\n\n2d+1 )\n\nO( 1\u221a\n\nn )\n\nd\n\nK = \u00010 (cid:28) 1\n\nO(( 1\n\nn )1\u22122\u00010 )\n\nO(n 3\n2 )\n\nO(n 5\n4 )\n\n\u02dcO(n\n\n3\n\n2 + 1\n\n4d+2 )\n\n\u02dcO(n 1\n\n4 + 3/4\n\nd )\n\nO(n1+\u00010 )\n\n\u02dcO(n\u00010(1+ 3\n\nd ))\n\n3d+K\n\n2d+K ).\n\nd+2+ 2d\nK\n2d+K polylog(n)).\n\n3) The running time for S to output the summary is O(n\n4) The running time for S to answer a query is O(n\nThe proof of Theorem 3.1 is given in the supplementary material. To have a better idea of how\nthe performances depend on the order of smoothness, let us consider three cases. The \ufb01rst case\nis K = 1, i.e., the query functions only have the \ufb01rst order derivatives. Another extreme case is\nK (cid:29) d, and we assume d/K = \u00010 (cid:28) 1. We also consider a case in the middle by assuming\nK = 2d. Table 1 gives simpli\ufb01ed upper bounds for the error and running time in these cases. We\nhave the following observations:\n2d ) to nearly O(n\u22121) as K increases.\n1) The accuracy \u03b1 improves dramatically from roughly O(n\u2212 1\nFor K > 2d, the error is smaller than the sampling error O( 1\u221a\nn ).\n\n2) The running time for outputting the summary does not change too much, because reading through\nthe database requires \u2126(n) time.\n\nd\n\n3) The running time for answering a query reduces signi\ufb01cantly from roughly O(n3/2) to nearly\nO(n\u00010) as K getting large. When K = 2d, it is about n1/4 if d is not too small. In practice, the\nspeed for answering a query may be more important than that for outputting the summary since\nthe sanitizer only output the summary once. Thus having an nc-time (c (cid:28) 1) algorithm for query\nanswering will be appealing.\nConceptually our mechanism is simple. First, by change of variables we have gf (\u03b81, . . . , \u03b8d) =\nf (cos \u03b81, . . . , cos \u03b8d). It also transforms the data universe from [\u22121, 1]d to [\u2212\u03c0, \u03c0]d. Note that for\neach variable \u03b8i, gf is an even function. To compute the summary, the mechanism just gives noisy\nanswers to queries speci\ufb01ed by even trigonometric monomials cos(m1\u03b81) . . . cos(md\u03b8d). For each\n2d+K ). The\ntrigonometric monomial, the highest degree of any variable is t := maxd md = O(n\n2d+K )-dimensional vector. To answer a query speci\ufb01ed by a smooth function f,\nsummary is a O(n\nthe mechanism computes a trigonometric polynomial approximation of gf . The answer to the query\nqf is a linear combination of the summary by the coef\ufb01cients of the approximation trigonometric\npolynomial.\nOur algorithm is an L\u221e-approximation based mechanism, which is motivated by [24]. An approx-\nimation based mechanism relies on three conditions: 1) There exists a small set of basis functions\nsuch that every query function can be well approximated by a linear combination of them; 2) All the\nlinear coef\ufb01cients are small; 3) The whole set of the linear coef\ufb01cients can be computed ef\ufb01ciently.\nIf these conditions hold, then the mechanism just outputs noisy answers to the set of queries speci\ufb01ed\nby the basis functions as the summary. When answering a query, the mechanism computes the\ncoef\ufb01cients with which the linear combination of the basis functions approximate the query function.\nThe answer to the query is simply the inner product of the coef\ufb01cients and the summary vector.\nThe following theorem guarantees that by change of variables and using even trigonometric poly-\nnomials as the basis functions, the class of smooth functions has all the three properties described\nabove.\nTheorem 3.2. Let \u03b3 > 0. For every f \u2208 C K\n\nB de\ufb01ned on [\u22121, 1]d, let\n\n1\n\ngf (\u03b81, . . . , \u03b8d) = f (cos \u03b81, . . . , cos \u03b8d),\n\n5\n\n\u03b8i \u2208 [\u2212\u03c0, \u03c0].\n\n\fThen, there is an even trigonometric polynomial p whose degree of each variable is t(\u03b3) =\n\np(\u03b81, . . . , \u03b8d) =\n\ncl1,...,ld cos(l1\u03b81) . . . cos(ld\u03b8d),\n\n(cid:88)\n\n0\u2264l1,...,ld