{"title": "Privacy Aware Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1430, "page_last": 1438, "abstract": "We study statistical risk minimization problems under a version of privacy in which the data is kept confidential even from the learner. In this local privacy framework, we show sharp upper and lower bounds on the convergence rates of statistical estimation procedures. As a consequence, we exhibit a precise tradeoff between the amount of privacy the data preserves and the utility, measured by convergence rate, of any statistical estimator.", "full_text": "Privacy Aware Learning\n\nJohn C. Duchi1\n\nMichael I. Jordan1,2\n\nMartin J. Wainwright1,2\n\n1Department of Electrical Engineering and Computer Science, 2Department of Statistics\n\nUniversity of California, Berkeley\n\nBerkeley, CA USA 94720\n\n{jduchi,jordan,wainwrig}@eecs.berkeley.edu\n\nAbstract\n\nWe study statistical risk minimization problems under a version of privacy in\nwhich the data is kept con\ufb01dential even from the learner. In this local privacy\nframework, we establish sharp upper and lower bounds on the convergence rates\nof statistical estimation procedures. As a consequence, we exhibit a precise trade-\noff between the amount of privacy the data preserves and the utility, measured by\nconvergence rate, of any statistical estimator.\n\n1\n\nIntroduction\n\nThere are natural tensions between learning and privacy that arise whenever a learner must aggregate\ndata across multiple individuals. The learner wishes to make optimal use of each data point, but the\nproviders of the data may wish to limit detailed exposure, either to the learner or to other individuals.\nIt is of great interest to characterize such tensions in the form of quantitative tradeoffs that can be\nboth part of the public discourse surrounding the design of systems that learn from data and can be\nemployed as controllable degrees of freedom whenever such a system is deployed.\n\nWe approach this problem from the point of view of statistical decision theory. The decision-\ntheoretic perspective offers a number of advantages. First, the use of loss functions and risk functions\nprovides a compelling formal foundation for de\ufb01ning \u201clearning,\u201d one that dates back to Wald [28] in\nthe 1930\u2019s, and which has seen continued development in the context of research on machine learn-\ning over the past two decades. Second, by formulating the goals of a learning system in terms of\nloss functions, we make it possible for individuals to assess whether the goals of a learning system\nalign with their own personal utility, and thereby determine the extent to which they are willing to\nsacri\ufb01ce some privacy. Third, an appeal to decision theory permits abstraction over the details of\nspeci\ufb01c learning procedures, providing (under certain conditions) minimax lower bounds that apply\nto any speci\ufb01c procedure. Finally, the use of loss functions, in particular convex loss functions, in\nthe design of a learning system allows powerful tools of optimization theory to be brought to bear.\nIn more formal detail, our framework is as follows. Given a compact convex set \u0398 \u2282 Rd, we\nwish to \ufb01nd a parameter value \u03b8 \u2208 \u0398 achieving good average performance under a loss function\n\" : X\u00d7 Rd \u2192 R+. Here the value \"(X, \u03b8) measures the performance of the parameter vector \u03b8 \u2208 \u0398\non the sample X \u2208X , and \"(x,\u00b7) : Rd \u2192 R+ is convex for x \u2208X . We measure the expected\nperformance of \u03b8 \u2208 \u0398 via the risk function\n\nR(\u03b8) := E[\"(X, \u03b8)].\n\n(1)\n\nIn the standard formulation of statistical risk minimization, a method M is given n samples\nX1, . . . , Xn, and outputs an estimate \u03b8n approximately minimizing R(\u03b8). Instead of allowing M\naccess to the samples Xi, however, we study the effect of giving only a perturbed view Zi of each\ndatum Xi, quantifying the rate of convergence of R(\u03b8n) to inf \u03b8\u2208\u0398 R(\u03b8) as a function of both the\nnumber of samples n and the amount of privacy Zi provides for Xi.\n\n1\n\n\fThere is a long history of research at the intersection of privacy and statistics, where there is a natural\ncompetition between maintaining the privacy of elements in a dataset {X1, . . . , Xn} and the output\nof statistical procedures. Study of this issue goes back at least to the 1960s, when Warner [29]\nsuggested privacy-preserving methods for survey sampling. Recently, there has been substantial\nwork on privacy\u2014focusing on a measure known as differential privacy [12]\u2014in statistics, computer\nscience, and other \ufb01elds. We cannot hope to do justice to the large body of related work, referring\nthe reader to the survey by Dwork [10] and the statistical framework studied by Wasserman and\nZhou [30] for background and references.\n\nIn this paper, we study local privacy [13, 17], in which each datum Xi is kept private from the\n\nmethod M. The goal of many types of privacy is to guarantee that the output!\u03b8n of the method M\n\nbased on the data cannot be used to discover information about the individual samples X1, . . . , Xn,\nbut locally private algorithms access only disguised views of each datum Xi. Local algorithms\nare among the most classical approaches to privacy, tracing back to Warner\u2019s work on randomized\nresponse [29], and rely on communication only of some disguised view Zi of each true sample Xi.\nLocally private algorithms are natural when the providers of the data\u2014the population sampled to\ngive X1, . . . , Xn\u2014do not trust even the statistician or statistical method M, but the providers are\ninterested in the parameters \u03b8\u2217 minimizing R(\u03b8). For example, in medical applications, a participant\nmay be embarrassed about his use of drugs, but if the loss \" is able to measure the likelihood of\ndeveloping cancer, the participant has high utility for access to the optimal parameters \u03b8\u2217. In essence,\nwe would like the statistical procedure M to learn from the data X1, . . . , Xn but not about it.\nOur goal is to understand the fundamental tradeoffs between maintaining privacy while still retain-\ning the utility of the statistical inference method M. Though intuitively there must be some tradeoff,\nquantifying it precisely has been dif\ufb01cult. In the machine learning literature, Chaudhuri et al. [7]\ndevelop differentially private empirical risk minimization algorithms, and Dwork and Lei [11] and\nSmith [26] analyze similar statistical procedures, but do not show that there must be negative effects\nof privacy. Rubinstein et al. [24] are able to show that it is impossible to obtain a useful parameter\nvector \u03b8 that is substantially differentially private; it is unclear whether their guarantees are improv-\nable. Recent work by Hall et al. [15] gives sharp minimax rates of convergence for differentially\nprivate histogram estimation. Blum et al. [5] also give lower bounds on the closeness of certain\nstatistical quantities computed from the dataset, though their upper and lower bounds do not match.\nSankar et al. [25] provide rate-distortion theorems for utility models involving information-theoretic\nquantities, which has some similarity to our risk-based framework, but it appears challenging to\nmap their setting onto ours. The work most related to ours is probably that of Kasiviswanathan et al.\n[17], who show that that locally private algorithms coincide with concepts that can be learned with\npolynomial sample complexity in Kearns\u2019s statistical query (SQ) model. In contrast, our analysis\naddresses sharp rates of convergence, and applies to estimators for a broad class of convex risks (1).\n\n2 Main results and approach\n\nOur approach to local privacy is based on a worst-case measure of mutual information, where we\nview privacy preservation as a game between the providers of the data\u2014who wish to preserve\nprivacy\u2014and nature. Recalling that the method sees only the perturbed version Zi of Xi, we adopt\na uniform variant of the mutual information I(Zi; Xi) between the random variables Xi and Zi\nas our measure for privacy. This use of mutual information is by no means original [13, 25], but\nbecause standard mutual information has de\ufb01ciencies as a measure of privacy [e.g. 13], we say the\ndistribution Q generating Z from X is private only if I(X; Z) is small for all possible distributions\nP on X (possibly subject to constraints). This is similar to the worst-case information approach of\nEv\ufb01mievski et al. [13], which limits privacy breaches. (In the long version of this paper [9] we also\nconsider differentially private algorithms.)\n\nThe central consequences of our main results are, under standard conditions on the loss functions \",\nsharp upper and lower bounds on the possible convergence rates for estimation procedures when we\nwish to guarantee a level of privacy I(Xi; Zi) \u2264 I \u2217. We show there are problem dependent constants\na(\u0398,\" ) and b(\u0398,\" ) such that the rates of convergence of all possible procedures are lower bounded\nby a(\u0398,\" )/\u221anI \u2217 and that there exist procedures achieving convergence rates of b(\u0398,\" )/\u221anI \u2217,\nwhere the ratio b(\u0398,\" )/a(\u0398,\" ) is upper bounded by a universal constant. Thus, we establish and\nquantify explicitly the tradeoff between statistical estimation and the amount of privacy.\n\n2\n\n\fWe show that stochastic gradient descent is one procedure that achieves the optimal convergence\nrates, which means additionally that our upper bounds apply in streaming and online settings, re-\nquiring only a \ufb01xed-size memory footprint. Our subsequent analysis builds on this favorable prop-\nerty of gradient-based methods, whence we focus on statistical estimation procedures that access\ndata through the subgradients of the loss functions \u2202\"(X, \u03b8). This is a natural restriction. Gradients\nof the loss \" are asymptotically suf\ufb01cient [18] (in an asymptotic sense, gradients contain all of the\nstatistical information for risk minimization problems), stochastic gradient-based estimation proce-\ndures are (sample) minimax optimal and Bahadur ef\ufb01cient [23, 1, 27, Chapter 8], many estimation\nprocedures are gradient-based [20, 6], and distributed optimization procedures that send gradient\ninformation across a network to a centralized procedure M are natural [e.g. 3]. Our mechanism\ngives M access to a vector Zi that is a stochastic (sub)gradient of the loss evaluated on the sample\nXi at a parameter \u03b8 of the method\u2019s choosing:\n\n(2)\nwhere \u2202\"(Xi,\u03b8 ) denotes the subgradient set of the function \u03b8 \u2019\u2192 \"(Xi,\u03b8 ). In a sense, the unbiased-\nness of the subgradient inclusion (2) is information-theoretically necessary [1].\n\nE[Zi | Xi,\u03b8 ] \u2208 \u2202\"(Xi,\u03b8 ),\n\nTo obtain upper and lower bound on the convergence rate of estimation procedures, we provide a\ntwo-part analysis. One part requires studying saddle points of the mutual information I(X; Z) (as a\nfunction of the distributions P of X and Q(\u00b7 | X) of Z) under natural constraints that allow inference\nof the optimal parameters \u03b8\u2217 for the risk R. We show that for certain classes of loss functions \" and\nconstraints on the communicated version Zi of the data Xi, there is a unique distribution Q(\u00b7 | Xi)\nthat attains the smallest possible mutual information I(X; Z) for all distributions on X. Using this\nunique distribution, we can adapt information-theoretic techniques for obtaining lower bounds on\nestimation [31, 1] to derive our lower bounds. The uniqueness results for the conditional distribution\nQ show that no algorithm guaranteeing privacy between M and the samples Xi can do better. We\ncan obtain matching upper bounds by application of known convergence rates for stochastic gradient\nand mirror descent algorithms [20, 21], which are computationally ef\ufb01cient.\n\n3 Optimal learning rates and tradeoffs\n\nHaving outlined our general approach, we turn in this section to providing statements of our main\nresults. Before doing so, we require some formalization of our notions of privacy and error measures,\nwhich we now provide.\n\n3.1 Optimal Local Privacy\n\nWe begin by describing in slightly more detail the communication protocol by which information\nabout the random variables X is communicated to the procedure M. We assume throughout that\nthere exist two d-dimensional compact sets C, D, where C \u2282 int D \u2282 Rd, and we have that\n\u2202\"(x, \u03b8) \u2282 C for all \u03b8 \u2208 \u0398 and x \u2208X . We wish to maximally \u201cdisguise\u201d the random variable\nX with the random variable Z satisfying Z \u2208 D. Such a setting is natural; indeed, many online\noptimization and stochastic approximation algorithms [34, 21, 1] assume that for any x \u2208X and\n\u03b8 \u2208 \u0398, if g \u2208 \u2202\"(x, \u03b8) then (g( \u2264 L for some norm (\u00b7(. We may obtain privacy by allowing a\nperturbation to the subgradient g so long as the perturbation lives in a (larger) norm ball of radius\nM \u2265 L, so that C = {g \u2208 Rd : (g( \u2264 L}\u2282 D = {g \u2208 Rd : (g( \u2264 M}.\nNow let X have distribution P , and for each x \u2208X , let Q(\u00b7 | x) denote the regular conditional\nprobability measure of Z given that X = x. Let Q(\u00b7) denote the marginal probability de\ufb01ned by\nQ(A) = EP [Q(A | X)]. The mutual information between X and Z is the expected Kullback-\nLeibler (KL) divergence between Q(\u00b7 | X) and Q(\u00b7):\n\nI(P, Q) = I(X; Z) := EP [Dkl (Q(\u00b7 | X)||Q(\u00b7))] .\n\n(3)\n\nWe view the problem of privacy as a game between the adversary controlling P and the data owners,\nwho use Q to obscure the samples X. In particular, we say a distribution Q guarantees a level of\nprivacy I \u2217 if and only if supP I(P, Q) \u2264 I \u2217. (Ev\ufb01mievski et al. [13, De\ufb01nition 6] present a similar\ncondition.) Thus we seek a saddle point P \u2217, Q\u2217 such that\n\nsup\nP\n\nI(P, Q\u2217) \u2264 I(P \u2217, Q\u2217) \u2264 inf\n\nQ\n\nI(P \u2217, Q),\n\n(4)\n\n3\n\n\fP (C) := {Distributions P such that supp P \u2282 C}\n\nand the set of regular conditional distributions (r.c.d.\u2019s), or communicating distributions,\n\nQ (C, D) :=\"r.c.d.\u2019s Q s.t. supp Q(\u00b7 | c) \u2282 D and #D\n\nzdQ(z | c) = c for c \u2208 C$.\n\nThe de\ufb01nitions (5a) and (5b) formally de\ufb01ne the sets over which we may take in\ufb01ma and suprema\nin the saddle point calculations, and they capture what may be communicated. The condi-\ntional distributions Q \u2208Q (C, D) are de\ufb01ned so that if \u2207\"(x, \u03b8) \u2208 C then EQ[Z | X, \u03b8] :=\n%D zdQ (z |\u2207 \"(x, \u03b8)) = \u2207\"(x, \u03b8). We now make the following key de\ufb01nition:\nDe\ufb01nition 1. The conditional distribution Q\u2217 satis\ufb01es optimal\nlocal privacy for the sets\nC \u2282 D \u2282 Rd at level I \u2217 if\n\n(5a)\n\n(5b)\n\nwhere the \ufb01rst supremum is taken over all distributions P on X such that \u2207\"(X, \u03b8) \u2208 C with\nP -probability 1, and the in\ufb01mum is taken over all regular conditional distributions Q such that if\nZ \u223c Q(\u00b7 | X), then Z \u2208 D and EQ[Z | X, \u03b8] = \u2207\"(X, \u03b8). Indeed, if we can \ufb01nd P \u2217 and Q\u2217\nsatisfying the saddle point (4), then the trivial direction of the max-min inequality yields\n\nsup\nP\n\ninf\nQ\n\nI(P, Q) = I(P \u2217, Q\u2217) = inf\nQ\n\nsup\nP\n\nI(P, Q).\n\nTo fully formalize this idea and our notions of privacy, we de\ufb01ne two collections of probability\nmeasures and associated losses. For sets C \u2282 D \u2282 Rd, we de\ufb01ne the source set\n\nsup\nP\n\nI(P, Q\u2217) = inf\nQ\n\nsup\nP\n\nI(P, Q) = I \u2217,\n\nwhere the supremum is taken over distributions P \u2208P (C) and the in\ufb01mum is taken over regular\nconditional distributions Q \u2208Q (C, D).\nIf a distribution Q\u2217 satis\ufb01es optimal local privacy, then it guarantees that even for the worst possible\ndistribution on X, the information communicated about X is limited.\nIn a sense, De\ufb01nition 1\ncaptures the natural competition between privacy and learnability. The method M speci\ufb01es the\nset D to which the data Z it receives must belong; the \u201cteachers,\u201d or owners of the data X, choose\nthe distribution Q to guarantee as much privacy as possible subject to this constraint. Using this\nmechanism, if we can characterize a unique distribution Q\u2217 attaining the in\ufb01mum (4) for P \u2217 (and\nby extension, for any P ), then we may study the effects of privacy between the method M and X.\n3.2 Minimax error and loss functions\n\nHaving de\ufb01ned our privacy metric, we now turn to our original goal: quanti\ufb01cation of the effect\nprivacy has on statistical estimation rates. Let M denote any statistical procedure or method (that\nuses n stochastic gradient samples) and let \u03b8n denote the output of M after receiving n such samples.\nLet P denote the distribution according to which samples X are drawn. We de\ufb01ne the (random) error\nof the method M on the risk R(\u03b8) = E[\"(X, \u03b8)] after receiving n sample gradients as\nEP [\"(X, \u03b8)].\n\n(6)\n\n\u0001n(M,\", \u0398, P ) := R(\u03b8n) \u2212 inf\n\n\u03b8\u2208\u0398\n\nR(\u03b8) = EP [\"(X, \u03b8n)] \u2212 inf\n\n\u03b8\u2208\u0398\n\nIn our settings, in addition to the randomness in the sampling distribution P , there is additional\nrandomness from the perturbation applied to stochastic gradients of the objective \"(X,\u00b7) to mask X\nfrom the statistitician. Let Q denote the regular conditional probability\u2014the channel distribution\u2014\nwhose conditional part is de\ufb01ned on the range of the subgradient mapping \u2202\"(X,\u00b7). As the output\n\u03b8n of the statistical procedure M is a random function of both P and Q, we measure the expected\nsub-optimality of the risk according to both P and Q. Now, let L be a collection of loss functions,\nwhere L(P ) denotes the losses \" : supp P \u00d7 \u0398 \u2192 R belonging to L. We de\ufb01ne the minimax error\n(7)\n\nsup\n\n\u0001\u2217\nn(L, \u0398) := inf\nM\n\nEP,Q[\u0001n(M,\", \u0398, P )],\n\n\"\u2208L(P ),P\n\nwhere the expectation is taken over the random samples X \u223c P and Z \u223c Q(\u00b7 | X). We characterize\nthe minimax error (7) for several classes of loss functions L(P ), giving sharp results when the\nprivacy distribution Q satis\ufb01es optimal local privacy.\n\nWe assume that our collection of loss functions obey certain natural smoothness conditions, which\nare often (as we see presently) satis\ufb01ed. We de\ufb01ne the class of losses as follows.\n\n4\n\n\fDe\ufb01nition 2. Let L > 0 and p \u2265 1. The set of (L, p)-loss functions are those measurable functions\n\" : X\u00d7 \u0398 \u2192 R such that x \u2208X , the function \u03b8 \u2019\u2192 \"(x, \u03b8) is convex and\n\n|\"(x, \u03b8) \u2212 \"(x, \u03b8#)|\u2264 L(\u03b8 \u2212 \u03b8#(q\nfor any \u03b8, \u03b8 # \u2208 \u0398, where q is the conjugate of p: 1/p + 1/q = 1.\nA loss \" satis\ufb01es the condition (8) if and only if for all \u03b8 \u2208 \u0398, we have the inequality (g(p \u2264 L for\nany subgradient g \u2208 \u2202\"(x, \u03b8) (e.g. [16]). We give a few standard examples of such loss functions.\nFirst, we consider \ufb01nding a multi-dimensional median, in which case the data x \u2208 Rd and \"(x, \u03b8) =\nL(\u03b8 \u2212 x(1. This loss is L-Lipschitz with respect to the \"1 norm, so it belongs to the class of (L,\u221e)\nlosses. A second example includes classi\ufb01cation problems, using either the hinge loss or logistic\nregression loss. In these cases, the data comes in pairs x = (a, b), where a \u2208 Rd is the set of\nregressors and b \u2208{\u2212 1, 1} is the label; the losses are\n\n(8)\n\n\"(x, \u03b8) = [1 \u2212 b.a, \u03b8/]+ or \"(x, \u03b8) = log (1 + exp(\u2212b.a, \u03b8/))\n\nBy computing (sub)gradients, we may verify that each of these belong to the class of (L, p)-losses\nif and only if the data a satis\ufb01es (a(p \u2264 L, which is a common assumption [7, 24].\nThe privacy-guaranteeing channel distributions Q\u2217 we construct in Section 4 are motivated by our\nconcern with the (L, p) families of loss functions. In our model of computation, the learning method\nM queries the loss \"(Xi,\u00b7) at the point \u03b8; the owner of the datum Xi then computes the subgradient\n\u2202\"(Xi,\u03b8 ) and returns a masked version Zi with the property that E[Zi | Xi,\u03b8 ] \u2208 \u2202\"(Xi,\u03b8 ). In\nthe following two theorems, we give lower bounds on \u0001\u2217\nn for the (L,\u221e) and (L, 1) families of loss\nfunctions under the constraint that the channel distribution Q must guarantee that a limited amount\nof information I(Xi; Zi) is communicated: the channel distribution Q satis\ufb01es our De\ufb01nition 1 of\noptimal local privacy.\n\n3.3 Main theorems\n\nWe now state our two main theorems, deferring proofs to Appendix B. Our \ufb01rst theorem applies to\nthe class of (L,\u221e) loss functions (recall De\ufb01nition 2). We assume that the set to which the perturbed\ndata Z must belong is [\u2212M\u221e, M\u221e]d, where M\u221e \u2265 L. We state two variants of the theorem, as one\ngives sharper results for an important special case.\nTheorem 1. Let L be the collection of (L,\u221e) loss functions and assume the conditions of the\npreceding paragraph. Let Q satisfy be optimally private for the collection L. Then\n\n(a) If \u0398 contains the \"\u221e ball of radius r,\n\n(b) If \u0398= {\u03b8 \u2208 Rd : (\u03b8(1 \u2264 r},\n\n\u0001\u2217\nn(L, \u0398) \u2265\n\n1\n163 \u00b7\n\nM\u221erd\n\u221an\n\n.\n\n\u0001\u2217\nn(L, \u0398) \u2265\n\nrM\u221e&log(2d)\n\n17\u221an\n\n.\n\nFor our second theorem, we assume that the loss functions L consist of (L, 1) losses, and that the\nperturbed data must belong to the \"1 ball of radius M1, i.e., Z \u2208{ z \u2208 Rd |( z(1 \u2264 M1}. Setting\nM = M1/L, we de\ufb01ne (these constants relate to the optimal local privacy distribution for \"1-balls)\n\n\u03b3 := log\u2019 2d \u2212 2 +&(2d \u2212 2)2 + 4(M 2 \u2212 1)\n\n2(M \u2212 1)\n\n( ,\n\nand \u2206(\u03b3) :=\n\ne\u03b3 \u2212 e\u2212\u03b3\n\ne\u03b3 + e\u2212\u03b3 + 2(d \u2212 1)\n\n.\n\n(9)\n\nTheorem 2. Let L be the collection of (L, 1) loss functions and assume the conditions of the pre-\nceding paragraph. Let Q be optimally locally private for the collection L. Then\n\nrL\u221ad\n\u221an\u2206(\u03b3)\n\n.\n\n\u0001\u2217\nn(L, \u0398) \u2265\n\n1\n163 \u00b7\n\n5\n\n\fRemarks We make two main remarks about Theorems 1 and 2. First, we note that each result\nyields a minimax rate for stochastic optimization problems when there is no random distribution Q.\nIndeed, in Theorem 1, we may take M\u221e = L, in which case (focusing on the second statement of\n\nthe theorem) we obtain the lower bound rL&log(2d)/17\u221an when \u0398= {\u03b8 \u2208 Rd : (\u03b8(1 \u2264 r}.\n\nMirror descent algorithms [20, 21] attain a matching upper bound (see the long version of this\npaper [9, Sec. 3.3] for more substantial explanation). Moreover, our analysis is sharper than previous\nanalyses [1, 20], as none (to our knowledge) recover the logarithmic dependence on the dimension\nd, which is evidently necessary. Theorem 2 provides a similar result when we take M1 \u2193 L, though\nin this case stochastic gradient descent attains the matching upper bound.\n\nOur second set of remarks are somewhat more striking. In these, we show that the lower bounds in\nTheorems 1 and 2 give sharp tradeoffs between the statistical rate of convergence for any statistical\nprocedure and the desired privacy of a user. We present two corollaries establishing this tradeoff. In\neach corollary, we look ahead to Section 4 and use one of Propositions 1 or 2 to derive a bijection\nbetween the size M\u221e or M1 of the perturbation set and the amount of privacy\u2014as measured by the\nworst case mutual information I \u2217\u2014provided. We then combine Theorems 1 and 2 with results on\nstochastic approximation to demonstrate the tradeoffs.\nCorollary 1. Let the conditions of Theorem 1(b) hold, and assume that M\u221e \u2265 2L. Assume Q\u2217\nsatis\ufb01es optimal local privacy at information level I \u2217. For universal constants c \u2264 C,\n\nrL\u221ad log d\n\n\u221anI \u2217\n\nc \u00b7\n\n\u2264 \u0001\u2217\n\nn(L, \u0398) \u2264 C \u00b7\n\nrL\u221ad log d\n\n\u221anI \u2217\n\n.\n\nSince \u0398 \u2286{ \u03b8 \u2208 Rd : (\u03b8(1 \u2264 r}, mirror descent [2, 21, 20, Chapter 5], using n un-\nProof\nbiased stochastic gradient samples whose \"\u221e norms are bounded by M\u221e, obtains convergence\nrate O(M\u221er\u221alog d/\u221an). This matches the second statement of Theorem 1. Now \ufb01x our desired\namount of mutual information I \u2217. From the remarks following Proposition 1, if we must guarantee\nthat I \u2217 \u2265 supP I(P, Q) for any distribution P and loss function \" whose gradients are bounded in\n\"\u221e-norm by L, we must (by the remarks following Proposition 1) have\n\nI \u2217 2\n\ndL2\nM 2\n\u221e\n\n.\n\nUp to higher-order terms, to guarantee a level of privacy with mutual information I \u2217, we must allow\ngradient noise up to M\u221e = L&d/I \u2217. Using the bijection between M\u221e and the maximal allowed\nmutual information I \u2217 under local privacy that we have shown, we substitute M\u221e = L\u221ad/\u221aI \u2217\n\ninto the upper and lower bounds that we have already attained.\n\nSimilar upper and lower bounds can be obtained under the conditions of part (a) of Theorem 1,\nwhere we need not assume \u0398 is an \"1-ball, but we lose a factor of \u221alog d in the lower bound. Now\nwe turn to a parallel result, but applying Theorem 2 and Proposition 2.\nCorollary 2. Let the conditions of Theorem 2 hold and assume that M1 \u2265 2L. Assume that Q\u2217\nsatis\ufb01es optimal local privacy at information level I \u2217. For universal constants c \u2264 C,\n\nrLd\n\u221anI \u2217 \u2264 \u0001\u2217\n\nc \u00b7\n\nn(L, \u0398) \u2264 C \u00b7\n\nrLd\n\u221anI \u2217\n\n.\n\nProof By the conditions of optimal local privacy (Proposition 2 and Corollary 3), to have I \u2217 \u2265\nsupP I(P, Q) for any loss \" whose gradients are bounded in \"1-norm by L, we must have\n\nI \u2217 2\n\ndL2\n2M 2\n1\n\n,\n\nusing Corollary 3. Rewriting this, we see that we must have M1 = L&d/2I \u2217 (to higher-order\nterms) to be able to guarantee an amount of privacy I \u2217. As in the \"\u221e case, we have a bijection\nbetween the multiplier M1 and the amount of information I \u2217 and can apply similar techniques.\nIndeed, stochastic gradient descent (SGD) enjoys the following convergence guarantees (e.g. [21]).\nLet \u0398 \u2286 Rd be contained in the \"\u221e ball of radius r and the gradients of the loss \" belong to the\nn(L, \u0398) \u2264 CM1r\u221ad/\u221an. Now apply the lower bound\n\"1-ball of radius M1. Then SGD has \u0001\u2217\nprovided by Theorem 2 and substitute for M1.\n\n6\n\n\f4 Saddle points, optimal privacy, and mutual information\n\nIn this section, we explore conditions for a distribution Q\u2217 to satisfy optimal local privacy, as given\nby De\ufb01nition 1. We give characterizations of necessary and suf\ufb01cient conditions based on the com-\npact sets C \u2282 D for distributions P \u2217 and Q\u2217 to achieve the saddle point (4). Our results can be\nviewed as rate distortion theorems [14, 8] (with source P and channel Q) for certain compact al-\nphabets, though as far as we know, they are all new. Thus we sometimes refer to the conditional\ndistribution Q, which is designed to maintain the privacy of the data X by communication of Z, as\nthe channel distribution. Since we wish to bound I(X; Z) for arbitrary losses \", we must address the\ncase when \"(X, \u03b8) = .\u03b8, X/, in which case \u2207\"(X, \u03b8) = X; by the data-processing inequality [14,\nChapter 5] it is thus no loss of generality to assume that X \u2208 C and that E[Z | X] = X.\nWe begin by de\ufb01ning the types of sets C and D that we use in our characterization of privacy. As\nwe see in Section 3, such sets are reasonable for many applications. We focus on the case when the\ncompact sets C and D are (suitably symmetric) norm balls:\nDe\ufb01nition 3. Let C \u2282 Rd be a compact convex set with extreme points ui \u2208 Rd, i \u2208 I for some\nindex set I. Then C is rotationally invariant through its extreme points if (ui(2 = (uj(2 for each\ni, j, and for any unitary matrix U such that U ui = uj for some i 3= j, then U C = C.\nSome examples of convex sets rotationally invariant through their extreme points include \"p-norm\nballs for p = 1, 2,\u221e, though \"p-balls for p 3\u2208 {1, 2,\u221e} are not. The following theorem gives\na general characterization of the minimax mutual information for rotationally invariant norm balls\nwith \ufb01nite numbers of extreme points by providing saddle point distributions P \u2217 and Q\u2217. We provide\nthe proof of Theorem 3 in Section A.1.\nTheorem 3. Let C be a compact, convex, polytope rotationally invariant through its extreme points\n{ui}m\ni=1 and D = (1 + \u03b1)C for some \u03b1> 0. Let Q\u2217 be the conditional distribution on Z | X that\nmaximizes the entropy H(Z | X = x) subject to the constraints that\n\nEQ[Z | X = x] = x\n\ni=1 uniquely attains the saddle point (4).\n\nfor x \u2208 C and that Z is supported on (1 + \u03b1)ui for i = 1, . . . , m. Then Q\u2217 satis\ufb01es De\ufb01nition 1,\noptimal local privacy, and Q\u2217 is (up to measure zero sets) unique. Moreover, the distribution P \u2217\nuniform on {ui}m\nRemarks: While in the theorem we assume that Q\u2217(\u00b7 | X = x) maximizes the entropy for each\nx \u2208 C, this is not in fact essential. In fact, we may introduce a random variable X # between X and\nZ: let X # be distributed among the extreme points {ui}m\ni=1 of C in any way such that E[X # | X] =\nX, then use the maximum entropy distribution Q\u2217(\u00b7 | ui) de\ufb01ned in the theorem when X \u2208{ ui}m\ni=1\nto sample Z from X #. The information processing inequality [14, Chapter 5] guarantees the Markov\nchain X \u2192 X # \u2192 Z satis\ufb01es the minimax bound I(X; Z) \u2264 inf Q supP I(P, Q).\nWith Theorem 3 in place, we can explicitly characterize the distributions achieving optimal local\nprivacy (recall De\ufb01nition 1) for \"1 and \"\u221e balls. We present the propositions in turn, providing\nsome discussion here and deferring proofs to Appendices A.2 and A.3.\nFirst, consider the case where X \u2208 [\u22121, 1]d and Z \u2208 [\u2212M, M ]d. For notational convenience, we\nde\ufb01ne the binary entropy h(p) = \u2212p log p \u2212 (1 \u2212 p) log(1 \u2212 p). We have\nProposition 1. Let X \u2208 [\u22121, 1]d and Z \u2208 [\u2212M, M ]d be random variables with M \u2265 1 and\nE[Z | X] = X almost surely. De\ufb01ne Q\u2217 to be the conditional distribution on Z | X such that the\ncoordinates of Z are independent, have range {\u2212M, M}, and\n\nQ\u2217(Zi = M | X) =\n\n1\n2\n\n+\n\nXi\n2M\n\nThen Q\u2217 satis\ufb01es De\ufb01nition 1, optimal local privacy, and moreover,\n\nand Q\u2217(Zi = \u2212M | X) =\n\n1\n2 \u2212\n\nXi\n2M\n\n.\n\nI(P, Q\u2217) = d \u2212 d \u00b7 h) 1\n\n2\n\n+\n\n1\n\n2M* .\n\nsup\nP\n\nBefore continuing, we give a more intuitive understanding of Proposition 1. Concavity implies that\nfor a, b > 0, log(a) \u2264 log b + b\u22121(a \u2212 b), or \u2212 log(a) \u2265 \u2212 log(b) + b\u22121(b \u2212 a), so in particular\nM* = log 2\u2212\nh) 1\n1\nM 2 .\n\n2M*)\u2212 log 2 +\n\n2M*)\u2212 log 2 \u2212\n\nM*\u2212) 1\n\n1\n\n2M* \u2265 \u2212) 1\n\n2 \u2212\n\n+\n\n+\n\n2\n\n1\n\n2\n\n1\n\n1\n\n1\n\n7\n\n\fThat is, we have for any distribution P on X \u2208 [\u22121, 1]d that (in natural logarithms)\n\nI(P, Q\u2217) \u2264\n\nd\nM 2\n\nand I(P, Q\u2217) =\n\nd\nM 2 + O(M \u22123).\n\nWe now consider the case when X \u2208+x \u2208 Rd |( x(1 \u2264 1, and Z \u2208+z \u2208 Rd |( z(1 \u2264 M,. Here\nthe arguments are slightly more complicated, as the coordinates of the random variables are no\nlonger independent, but Theorem 3 still allows us to explicitly characterize the saddle point of the\nmutual information.\nProposition 2. Let X \u2208{ x \u2208 Rd |( x(1 \u2264 1} and Z \u2208{ z \u2208 Rd |( z(1 \u2264 M} be random\nvariables with M > 1. De\ufb01ne the parameter \u03b3 as in Eq. (9), and let Q\u2217 be the distribution on Z | X\nsuch that Z is supported on {\u00b1M ei}d\n\ni=1, and\n\nQ\u2217(Z = M ei | X = ei) =\nQ\u2217(Z = \u2212M ei | X = ei) =\nQ\u2217(Z = \u00b1M ej | X = ei, j 3= i) =\n\ne\u03b3\n\ne\u03b3 + e\u2212\u03b3 + (2d \u2212 2)\n\ne\u2212\u03b3\n\ne\u03b3 + e\u2212\u03b3 + (2d \u2212 2)\ne\u03b3 + e\u2212\u03b3 + (2d \u2212 2)\n\n1\n\n,\n\n,\n\n.\n\n(10a)\n\n(10b)\n\n(10c)\n\nsup\nP\n\n(For X 3\u2208 {\u00b1ei}, de\ufb01ne X # to be randomly selected in any way from among {\u00b1ei} such that\nE[X # | X] = X, then sample Z conditioned on X # according to (10a)\u2013(10c).) Then Q\u2217 satis\ufb01es\nDe\ufb01nition 1, optimal local privacy, and\nI(P, Q\u2217) = log(2d)\u2212log-e\u03b3 + e\u2212\u03b3 + 2d \u2212 2.+\u03b3\ne\u03b3 + e\u2212\u03b3 + 2d \u2212 2\nWe remark that the additional sampling to guarantee that X # \u2208{\u00b1 ei} (where the conditional\ndistribution Q\u2217 is de\ufb01ned) can be accomplished simply: de\ufb01ne the random variable X # so that\nX # = ei sign(xi) with probability |xi|/(x(1. Evidently E[X # | X] = x, and X \u2192 X # \u2192 Z\nfor Z distributed according to Q\u2217 de\ufb01nes a Markov chain as in our remarks following Theorem 3.\nAdditionally, an asymptotic expansion allows us to gain a somewhat clearer picture of the values of\nthe mutual information, though we do not derive upper bounds as we did for Proposition 1. We have\nthe following corollary, proved in Appendix E.1.\nCorollary 3. Let Q\u2217 denote the conditional distribution in Proposition 2. Then\n\ne\u03b3 + e\u2212\u03b3 + 2d \u2212 2\u2212\u03b3\n\ne\u2212\u03b3\n\ne\u03b3\n\n.\n\nI(P, Q\u2217) =\n\nsup\nP\n\nd\n\n2M 2 +\u0398 )min/ d3\n\nM 4 ,\n\nlog4(d)\n\nd 0* .\n\n5 Discussion and open questions\n\nThis study leaves a number open issues and areas for future work. We study procedures that access\neach datum only once and through a perturbed view Zi of the subgradient \u2202\"(Xi,\u03b8 ), which allows\nus to use (essentially) any convex loss. A natural question is whether there are restrictions on the loss\nfunction so that a transformed version (Z1, . . . , Zn) of the data are suf\ufb01cient for inference. Zhou\net al. [33] study one such procedure, and nonparametric data releases, such as those Hall et al. [15]\nstudy, may also provide insights. Unfortunately, these (and other) current approaches require the\ndata be aggregated by a trusted curator. Our constraints on the privacy-inducing channel distribution\nQ require that its support lie in some compact set. We \ufb01nd this restriction useful, but perhaps it\npossible to achieve faster estimation rates under other conditions. A better understanding of general\nprivacy-preserving channels Q for alternative constraints to those we have proposed is also desirable.\n\nThese questions do not appear to have easy answers, especially when we wish to allow each provider\nof a single datum to be able to guarantee his or her own privacy. Nevertheless, we hope that our view\nof privacy and the techniques we have developed herein prove fruitful, and we hope to investigate\nsome of the above issues in future work.\n\nAcknowledgments We thank Cynthia Dwork, Guy Rothblum, and Kunal Talwar for feedback on\nearly versions of this work. This material supported in part by ONR MURI grant N00014-11-1-0688\nand the U.S. Army Research Laboratory and the U.S. Army Research Of\ufb01ce under grant W911NF-\n11-1-0391. JCD was partially supported by an NDSEG fellowship and a Facebook fellowship.\n\n8\n\n\fReferences\n\n[1] A. Agarwal, P. Bartlett, P. Ravikumar, and M. Wainwright. Information-theoretic lower bounds on the\noracle complexity of convex optimization. IEEE Trans. on Information Theory, 58(5):3235\u20133249, 2012.\n[2] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex opti-\n\nmization. Operations Research Letters, 31:167\u2013175, 2003.\n\n[3] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Prentice-\n\nHall, Inc., 1989.\n\n[4] P. Billingsley. Probability and Measure. Wiley, Second edition, 1986.\n[5] A. Blum, K. Ligett, and A. Roth. A learning theory approach to non-interactive database privacy.\n\nIn\n\nProceedings of the Fourtieth Annual ACM Symposium on the Theory of Computing, 2008.\n[6] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[7] K. Chaudhuri, C. Moneleoni, and A. D. Sarwate. Differentially private empirical risk minimization.\n\nJournal of Machine Learning Research, 12:1069\u20131109, 2011.\n\n[8] T. M. Cover and J. A. Thomas. Elements of Information Theory, Second Edition. Wiley, 2006.\nPrivacy aware learning.\n[9] J. C. Duchi, M.\n\nJ. Wainwright.\n\nand M.\n\nI.\n\nJordan,\n\nURL\n\nhttp://arxiv.org/abs/1210.2085, 2012.\n\n[10] C. Dwork. Differential privacy: a survey of results. In Theory and Applications of Models of Computation,\n\nvolume 4978 of Lecture Notes in Computer Science, p. 1\u201319. Springer, 2008.\n\n[11] C. Dwork and J. Lei. Differential privacy and robust statistics. In Proceedings of the Fourty-First Annual\n\nACM Symposium on the Theory of Computing, 2009.\n\n[12] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis.\n\nIn Proceedings of the 3rd Theory of Cryptography Conference, p. 265\u2013284, 2006.\n\n[13] A. V. Ev\ufb01mievski, J. Gehrke, and R. Srikant. Limiting privacy breaches in privacy preserving data mining.\nIn Proceedings of the Twenty-Second Symposium on Principles of Database Systems, p. 211\u2013222, 2003.\n\n[14] R. M. Gray. Entropy and Information Theory. Springer, 1990.\n[15] R. Hall, A. Rinaldo,\n\nand L. Wasserman.\nhttp://arxiv.org/abs/1112.2680, 2011.\n\nRandom differential\n\nprivacy.\n\nURL\n\n[16] J. Hiriart-Urruty and C. Lemar\u00b4echal. Convex Analysis and Minimization Algorithms. Springer, 1996.\n[17] S. P. Kasiviswanathan, H. K. Lee, K. Nissim, S. Raskhodnikova, and A. Smith. What can we learn\n\nprivately? SIAM Journal on Computing, 40(3):793\u2013826, 2011.\n\n[18] L. Le Cam. On the asymptotic theory of estimation and hypothesis testing. Proceedings of the Third\n\nBerkeley Symposium on Mathematical Statistics and Probability, p. 129\u2013156, 1956.\n\n[19] L. Le Cam. Convergence of estimates under dimensionality restrictions. Annals of Statistics, 1(1):38\u201353,\n\n1973.\n\n[20] A. Nemirovski and D. Yudin. Problem Complexity and Method Ef\ufb01ciency in Optimization. Wiley, 1983.\n[21] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to\n\nstochastic programming. SIAM Journal on Optimization, 19(4):1574\u20131609, 2009.\n\n[22] R. R. Phelps. Lectures on Choquet\u2019s Theorem, Second Edition. Springer, 2001.\n[23] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal\n\non Control and Optimization, 30(4):838\u2013855, 1992.\n\n[24] B. I. P. Rubinstein, P. L. Bartlett, L. Huang, and N. Taft. Learning in a large function space: privacy-\n\npreserving mechanisms for SVM learning. Journal of Privacy and Con\ufb01dentiality, 4(1):65\u2013100, 2012.\n\n[25] L. Sankar, S. R. Rajagopalan, and H. V. Poor. An information-theoretic approach to privacy. In The 48th\n\nAllerton Conference on Communication, Control, and Computing, p. 1220\u20131227, 2010.\n\n[26] A. Smith. Privacy-preserving statistical estimation with optimal convergence rates. In Proceedings of the\n\nFourty-Third Annual ACM Symposium on the Theory of Computing, 2011.\n\n[27] A. W. van der Vaart. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics.\n\nCambridge University Press, 1998. ISBN 0-521-49603-9.\n\n[28] A. Wald. Contributions to the theory of statistical estimation and testing hypotheses. Annals of Mathe-\n\nmatical Statistics, 10(4):299\u2013326, 1939.\n\n[29] S. L. Warner. Randomized response: a survey technique for eliminating evasive answer bias. Journal of\n\nthe American Statistical Association, 60(309):63\u201369, 1965.\n\n[30] L. Wasserman and S. Zhou. A statistical framework for differential privacy. Journal of the American\n\nStatistical Association, 105(489):375\u2013389, 2010.\n\n[31] Y. Yang and A. Barron. Information-theoretic determination of minimax rates of convergence. Annals of\n\nStatistics, 27(5):1564\u20131599, 1999.\n\n[32] B. Yu. Assouad, Fano, and Le Cam. In Festschrift for Lucien Le Cam, p. 423\u2013435. Springer-Verlag, 1997.\n[33] S. Zhou, J. Lafferty, and L. Wasserman. Compressed regression.\nIEEE Transactions on Information\n\nTheory, 55(2):846\u2013866, 2009.\n\n[34] M. Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In Proceedings\n\nof the Twentieth International Conference on Machine Learning, 2003.\n\n9\n\n\f", "award": [], "sourceid": 682, "authors": [{"given_name": "Martin", "family_name": "Wainwright", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}, {"given_name": "John", "family_name": "Duchi", "institution": null}]}