{"title": "Private Stochastic Convex Optimization with Optimal Rates", "book": "Advances in Neural Information Processing Systems", "page_first": 11282, "page_last": 11291, "abstract": "We study differentially private (DP) algorithms for stochastic convex optimization (SCO). In this problem the goal is to approximately minimize the population loss given i.i.d.~samples from a distribution over convex and Lipschitz loss functions. A long line of existing work on private convex optimization focuses on the empirical loss and derives asymptotically tight bounds on the excess empirical loss.  However a significant gap exists in the known bounds for the population loss.\n\nWe show that, up to logarithmic factors, the optimal excess population loss for DP algorithms is equal to the larger of the optimal non-private excess population loss, and the optimal excess empirical loss of DP algorithms. This implies that, contrary to intuition based on private ERM, private SCO has asymptotically the same rate of $1/\\sqrt{n}$ as non-private SCO in the parameter regime most common in practice. The best previous result in this setting gives rate of $1/n^{1/4}$. Our approach builds on existing differentially private algorithms and relies on the analysis of algorithmic stability to ensure generalization.", "full_text": "Private Stochastic Convex Optimization with Optimal\n\nRates\n\nRaef Bassily\u2217\n\nThe Ohio State University\nbassily.1@osu.edu\n\nVitaly Feldman\u2217\n\nGoogle Research. Brain Team.\n\nKunal Talwar\u2217\n\nGoogle Research. Brain Team.\n\nkunal@google.com\n\nAbhradeep Thakurta\u2217\n\nUniversity of California Santa Cruz\n\nGoogle Research. Brain Team.\n\nAbstract\n\nWe study differentially private (DP) algorithms for stochastic convex optimization\n(SCO). In this problem the goal is to approximately minimize the population loss\ngiven i.i.d. samples from a distribution over convex and Lipschitz loss functions. A\nlong line of existing work on private convex optimization focuses on the empirical\nloss and derives asymptotically tight bounds on the excess empirical loss. However\na signi\ufb01cant gap exists in the known bounds for the population loss.\nWe show that, up to logarithmic factors, the optimal excess population loss for DP\nalgorithms is equal to the larger of the optimal non-private excess population loss,\nand the optimal excess empirical loss of DP algorithms. This implies that, contrary\n\u221a\nto intuition based on private ERM, private SCO has asymptotically the same rate of\nn as non-private SCO in the parameter regime most common in practice. The\n1/\nbest previous result in this setting gives rate of 1/n1/4. Our approach builds on\nexisting differentially private algorithms and relies on the analysis of algorithmic\nstability to ensure generalization.\n\nIntroduction\n\n1\nMany fundamental problems in machine learning reduce to the problem of minimizing the expected\nloss (a.k.a. population loss) L(w) = E\nz\u223cD [(cid:96)(w, z)] for convex loss functions of w given access to\nsamples z1, . . . , zn from the data distribution D. This problem arises in various settings, such as\nestimating the mean of a distribution, least squares regression, or minimizing a convex surrogate loss\nfor a classi\ufb01cation problem. This problem is commonly referred to as Stochastic Convex Optimization\n(SCO) and has been the subject of extensive study in machine learning and optimization [SSBD14].\nIn this work we study this problem with the additional constraint of differential privacy.\n\nA closely related problem is that of minimizing the loss (cid:98)L(w) = 1\n\ni (cid:96)(w, zi) on the sampled set\nof functions, often known as Empirical Risk Minimization (ERM). The problem of private ERM has\nbeen well-studied and tight upper and lower bounds are known for private ERM. We give nearly tight\nupper and lower bounds on the excess population loss (a.k.a. excess population risk). At \ufb01rst glance\nthese two problems may appear to be essentially the same as an optimal algorithm for minimizing the\nempirical risk should also achieve the best bounds for the population risk itself, i.e. the best approach\nto private SCO is to use the best private ERM.\nThis simple intuition is unfortunately false, even in the non-private case. A natural approach of\n\nbounding the population loss is by proving an upper bound on Ez1,...,zn [supw(L(w) \u2212 (cid:98)L(w))]. This\n\nn\n\n(cid:80)\n\n\u2217Part of this work was done while visiting the Simons Institute for the Theory of Computing.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fis known as uniform convergence. There are examples of distributions over losses where uniform\nconvergence based bounds are provably sub-optimal. For example, for convex Lipschitz losses\nin d-dimensional Euclidean space, the best bound on the population loss achievable via uniform\n\nconvergence is \u2126((cid:112)d/n) [Fel16]. In contrast, SGD is known to achieve excess loss of O((cid:112)1/n)\nin modern ML (when n = \u0398(d)), the optimal achievable excess loss is O((cid:112)1/n), whereas the\n\nwhich is independent of the dimension. As a result, in the high-dimensional settings often considered\n\nuniform convergence bound is \u2126(1).\nThis discrepancy implies that using private ERM and appealing to uniform convergence will not lead\nto optimal bounds for private SCO. The \ufb01rst work to address the population loss for private SCO is\n[BST14] which gives bounds based on several natural approaches. Their \ufb01rst approach is to use the\ngeneralization properties of differential privacy itself to bound the gap between the empirical and\npopulation losses [DFH+15, BNS+16], and thus derive bounds for SCO from bounds on ERM. This\napproach leads to a suboptimal bound for private SCO (speci\ufb01cally2, \u2248 max\n[BST14,\nSec. F]). For the important case of d being on the order of n and \u0001 being on the order of one this\nresults in \u2126(n\u2212 1\n4 ) bound on excess population loss. Their second approach uses stability induced by\nregularizing the empirical loss before it is minimized via a private ERM algorithm for strongly convex\nlosses. This technique also yields a suboptimal bound on the excess population loss \u2248 (d 1\n\u0001 n).\n4 /\n\nThere are two natural lower bounds that apply to private SCO. The lower bound of \u2126((cid:112)1/n) for\n\nthe excess loss of non-private SCO applies for private SCO. Further it is not hard to show that lower\nbounds for private ERM translate to essentially the same lower bound for private SCO, leading to a\n\u221a\n\u0001n ). We give a detailed argument for the lower bound in the full version\nlower bound of the form \u2126(\n[BFTT19] . In this work, we address the question:\n\n(cid:18)\n\n(cid:19)\n\n\u221a\nd\n\u0001n\n\n\u221a\n\n1\n\n4\u221a\nd\nn ,\n\nd\n\nWhat is the optimal excess loss for private SCO? Is the rate of O\n\nachievable?\n\n(cid:16)(cid:113) 1\n\n\u221a\n\nd\n\u0001n\n\nn +\n\n(cid:17)\n\n1.1 Our contribution\n\n(cid:16)(cid:113) 1\n\n(cid:17)\n\n\u221a\n\nd\n\u0001n\n\n\u221a\n\nn +\n\nis achievable. In particular, we obtain the statistically\nn) whenever d = O(n). This is in contrast to the situation for private ERM\n\nWe show that the optimal rate of O\noptimal rate of O(1/\nwhere the cost of privacy grows with the dimension for all n.\nIn fact, under relatively mild smoothness assumptions, this rate is achieved by a variant of the standard\nnoisy mini-batch SGD. The parameters of the scheme need to be tuned carefully to satisfy a delicate\nbalance. The classical analyses for non-private SCO depend crucially on making only one pass over\nthe dataset. However, a single pass noisy SGD is not suf\ufb01ciently accurate as we need a non-trivial\namount of noise in each step to carry out the privacy analysis. We rely instead on a different approach\nto generalization, known as uniform stability [BE02]. The stability parameter degrades with the\nnumber of passes over the dataset [HRS15, FV19], while the empirical accuracy improves as we\nmake more passes. In addition, the batch size needs to be suf\ufb01ciently large to ensure that the noise\nadded for privacy is small. To satisfy all these constraints the parameters of the scheme need to\nbe tuned carefully. Speci\ufb01cally we show that \u2248 min(n, n2\u00012/d) steps of SGD with a batch size of\n\u2248 max(\nOur second contribution is to show that the smoothness assumptions can be relaxed at essentially\nno additional loss. We use a general smoothing technique based on the Moreau-Yosida envelope\noperator that allows us to derive the same asymptotic bounds as the smooth case. This operator\ncannot be implemented ef\ufb01ciently in general, but for algorithms based on gradient steps we exploit\nthe well-known connection between the gradient step on the smoothed function and the proximal step\non the original function. Thus our algorithm is equivalent to (stochastic, noisy, mini-batch) proximal\ndescent on the unsmoothed function. We show that our analysis in the smooth case is robust to\ninaccuracies in the computation of the gradient. This allows us to show that suf\ufb01cient approximation\nto the proximal steps can be implemented in polynomial time given access to the gradient of the\n(cid:96)(w, zi)\u2019s.\n\n\u0001n, 1) are suf\ufb01cient to get all the desired properties.\n\n\u221a\n\n2In this Introduction, we are concerned with the dependence on d and n, for (\u0001, \u03b4)-DP. We suppress the\ndependence on \u03b4 and on parameters of the loss function such as Lipschitz constant and the constraint set radius.\n\n2\n\n\f\u221a\n\nFinally, we show that Objective Perturbation [CMS11, KST12] also achieves optimal bounds for\nprivate SCO. However, objective perturbation is only known to satisfy privacy under some additional\nassumptions (most notably, Hessian being rank 1 on all points in the domain). The generalization\nanalysis in this case is based on the uniform stability of the solution to strongly convex ERM. Aside\nfrom extending the analysis of this approach to population loss, we show that it can lead to algorithms\nfor private SCO that use only near-linear number of gradient evaluations (whenever these assumptions\nhold). In particular, we give a variant of objective perturbation in conjunction with the stochastic\nvariance reduced gradient descent (SVRG) with only O(n log n) gradient evaluations. We remark\nthat the known lower bounds for uniform convergence [Fel16] hold even under those additional\nassumptions invoked in objective perturbation. Finding algorithms with near-linear running time in\nthe general setting of SCO is a natural avenue for future research.\nOur work highlights the importance of uniform stability as a tool for analysis of this important class\nof problems. We believe it should have applications to other differentially private statistical analyses.\nRelated work: Differentially private empirical risk minimization (ERM) is a well-studied area span-\nning over a decade [CM08, CMS11, JKT12, KST12, ST13, SCS13, DJW13, Ull15, JT14, BST14,\nTTZ15, STU17, WLK+17, WYX17, INS+19]. Aside from [BST14] and work in the local model\nof DP [DJW13], these works focus on achieving optimal empirical risk bounds under privacy. Our\nwork builds heavily on algorithms and analyses developed in this line of work while contributing\nadditional insights. Optimal bounds for private SCO are known for some simple subclasses of convex\nfunctions such as Generalized Linear Models [JT14, BST14] where uniform convergence bounds on\nthe order of 1/\n2 Preliminaries\nNotation: We use W \u2282 Rd to denote the parameter space, which is assumed to be a convex, compact\nw\u2208W (cid:107)w(cid:107) the L2 radius of W. We use Z to denote an arbitrary data universe\nset. We denote by M = max\nand D to denote an arbitrary distribution over Z. We let (cid:96) : Rd \u00d7Z \u2192 R be a loss function that takes\nThe empirical loss of w \u2208 W w.r.t. loss (cid:96) and dataset S = (z1, . . . , zn) is de\ufb01ned as (cid:98)L(w; S) (cid:44)\na parameter vector w \u2208 W and a data point z \u2208 Z as inputs and outputs a real value.\n(cid:80)n\ni=1 (cid:96)(w, zi). The excess empirical loss of w is de\ufb01ned as (cid:98)L(w; S) \u2212 min(cid:101)w\u2208W (cid:98)L ((cid:101)w; S) .\nL((cid:101)w; D).\nz\u223cD [(cid:96)(w, z)] . The excess population loss of w is de\ufb01ned as L(w; D)\u2212 min(cid:101)w\u2208W\n\nThe population loss of w \u2208 W with respect to a loss (cid:96) and a distribution D over Z, is de\ufb01ned as\nL(w;D) (cid:44) E\nDe\ufb01nition 2.1 (Uniform stability). Let \u03b1 > 0. A (randomized) algorithm A : Z n \u2192 W is \u03b1-\nuniformly stable (w.r.t. loss (cid:96) : W \u00d7 Z \u2192 R) if for any pair S, S(cid:48) \u2208 Z n differing in at most one\ndata point, we have\n\nn are known [KST08].\n\nsup\nz\u2208Z\n\nA [(cid:96) (A(S), z) \u2212 (cid:96) (A(S(cid:48)), z)] \u2264 \u03b1\nE\nwhere the expectation is taken only over the internal randomness of A.\nThe following is a useful implication of uniform stability.\nLemma 2.2 (See, e.g., [SSBD14]). Let A : Z n \u2192 W be an \u03b1-uniformly stable algorithm w.r.t. loss\n(cid:96) : W \u00d7 Z \u2192 R. Let D be any distribution over Z, and let S \u223c Dn. Then,\n\n1\nn\n\n(cid:104)L (A(S); D) \u2212 (cid:98)L (A(S); S)\n\n(cid:105) \u2264 \u03b1.\n\nE\n\nS\u223cDn,A\n\nDe\ufb01nition 2.3 (Smooth function). Let \u03b2 > 0. A differentiable function f : Rd \u2192 R is \u03b2-smooth if\nfor every w, v \u2208 Rd, we have\n\nf (v) \u2264 f (w) + (cid:104)\u2207f (w), v \u2212 w(cid:105) +\n\n(cid:107)w \u2212 v(cid:107)2.\n\n\u03b2\n2\n\nIn the sequel, whenever we attribute a property (e.g., convexity, Lipschitz property, smoothness, etc.)\nto a loss function (cid:96), we mean that for every data point z \u2208 Z, the loss (cid:96)(\u00b7, z) possesses that property.\nStochastic Convex Optimization (SCO): Let D be an arbitrary (unknown) distribution over Z, and\nS = {z1, . . . , zn} be a sample of i.i.d. draws from D. Let (cid:96) : W \u00d7 Z \u2192 R be a convex loss\n\n3\n\n\ffunction. A (possibly randomized) algorithm for SCO uses the sample S to generate an (approximate)\n\nminimizer (cid:98)wS for L(\u00b7; D). We measure the accuracy of A by the expected excess population loss of\nits output parameter (cid:98)wS, de\ufb01ned as:\n\n(cid:20)\nL((cid:98)wS; D) \u2212 min\n\n(cid:21)\nw\u2208W L(w; D)\n\n,\n\n\u2206L (A; D) (cid:44) E\n\nwhere the expectation is taken over the choice of S \u223c Dn, and any internal randomness in A.\nDifferential privacy [DMNS06, DKM+06]: A randomized algorithm A is (\u0001, \u03b4)-differentially\nprivate if, for any pair of datasets S and S(cid:48) differ in exactly one data point, and for all events O in the\noutput range of A, we have\n\nP [A(S) \u2208 O] \u2264 e\u0001 \u00b7 P [A(S(cid:48)) \u2208 O] + \u03b4,\n\nwhere the probability is taken over the random coins of A. For meaningful privacy guarantees, the\ntypical settings of the privacy parameters are \u0001 < 1 and \u03b4 (cid:28) 1/n.\nDifferentially Private Stochastic Convex Optimization (DP-SCO): An (\u0001, \u03b4)-DP-SCO algorithm\nis a SCO algorithm that satis\ufb01es (\u0001, \u03b4)-differential privacy.\n3 Private SCO via Mini-batch Noisy SGD\nIn this section, we consider the setting where the loss (cid:96) is convex, Lipschitz, and smooth. We give\na technique that is based on a mini-batch variant of Noisy Stochastic Gradient Descent (NSGD)\nalgorithm [BST14, ACG+16].\nAlgorithm 1 ANSGD: Mini-batch noisy SGD for convex, smooth losses\nInput: Private dataset: S = (z1, . . . , zn) \u2208 Z n, L-Lipschitz, \u03b2-smooth, convex loss function\n(cid:96), convex set W \u2286 Rd, step size \u03b7, mini-batch size m, # iterations T , privacy parameters\n\u0001 \u2264 1, \u03b4 \u2264 1/n2.\n\nn2\u00012\n\n2: Set mini-batch size m := max(cid:0)n(cid:112) \u0001\n4 T , 1(cid:1) .\n(cid:80)m\nj=1 \u2207(cid:96)(wt, zi(t,j) ) + Gt\n\n1: Set noise variance \u03c32 := 8T L2 log(1/\u03b4)\n.\n3: Choose arbitrary initial point w0 \u2208 W.\n4: for t = 0 to T \u2212 1 do\n5:\n6: wt+1 := ProjW\n\nclidean projection onto W, and Gt \u223c N(cid:0)0, \u03c32Id\n\n(cid:16)\nwt \u2212 \u03b7 \u00b7(cid:16) 1\n(cid:80)T\n\nm\n\n7: return wT = 1\nT\n\nt=1 wt\n\n(cid:17)(cid:17)\n\nSample a batch Bt = {zi(t,1) , . . . , zi(t,m)} \u2190 S uniformly with replacement.\n\n(cid:1) drawn independently each iteration.\n\n, where ProjW denotes the Eu-\n\nTheorem 3.1 (Privacy guarantee of ANSGD). Algorithm 1 is (\u0001, \u03b4)-differentially private.\n\nProof. The proof follows from [ACG+16, Theorem 1].\n\nThe population loss attained by ANSGD is given by the next theorem.\nTheorem 3.2 (Excess population loss of ANSGD). Let D be any distribution over Z, and let S \u223c Dn.\nSuppose \u03b2 \u2264 L\n. Then,\n\n(cid:18)(cid:112) n\n\n\u221a\nand \u03b7 = M\n\n. Let T = min\n\n(cid:19)\n\n\u221a\n\n\u00012 n2\n\n\u0001 n\n\nM \u00b7 min\n\n32 d log(1/\u03b4)\n\nL\n\nT\n\n2 ,\n\n2\n\n2d log(1/\u03b4)\n\n(cid:16) n\n(cid:32)(cid:112)d log(1/\u03b4)\n\n8 ,\n\n,\n\n1\u221a\nn\n\n\u0001 n\n\n(cid:17)\n(cid:33)\n\n\u2206L (ANSGD; D) \u2264 10 M L \u00b7 max\n\nBefore proving the above theorem, we \ufb01rst state and prove the following useful lemmas.\nLemma 3.3. Let S \u2208 Z n. Suppose the parameter set W is convex and M-bounded. For any \u03b7 > 0,\nthe excess empirical loss of ANSGD satis\ufb01es\n\n4\n\n\fE(cid:104)(cid:98)L(wT ; S)\n\n(cid:105) \u2212 min\nw\u2208W (cid:98)L(w; S) \u2264 M 2\n\n2 \u03b7 T\n\n+\n\n\u03b7 L2\n\n2\n\n(cid:18)\n\n16\n\nT d log(1/\u03b4)\n\nn2 \u00012\n\n+ 1\n\n(cid:19)\n\nwhere the expectation is taken with respect to the choice of the mini-batch (step 5) and the independent\nGaussian noise vectors G1, . . . , GT .\n\nProof. The proof follows from the classical analysis of the stochastic oracle model (see, e.g.,\n[SSBD14]). In particular, we can show that\n\nE(cid:104)(cid:98)L(wT ; S)\n\n(cid:105) \u2212 min\nw\u2208W (cid:98)L(w; S) \u2264 M 2\n\n2 \u03b7 T\n\n+\n\n\u03b7 L2\n\n2\n\n+ \u03b7 \u03c32 d,\n\nwhere the last term captures the extra empirical error due to privacy. The statement now follows from\nthe setting of \u03c32 in Algorithm 1.\n\nThe following lemma is a simple extension of the results on uniform stability of GD methods that\nappeared in [HRS15] and [FV19, Lemma 4.3] to the case of mini-batch noisy SGD. We provide a\nproof for this lemma in the full version [BFTT19].\n\u03b2 , where \u03b2 is the smoothness parameter of (cid:96). Then, ANSGD is\nLemma 3.4. In ANSGD, suppose \u03b7 \u2264 2\n\u03b1-uniformly stable with \u03b1 = L2 T \u03b7\nn .\n\nProof of Theorem 3.2\n\nBy Lemma 2.2, \u03b1-uniform stability implies that the expected generalization error is bounded by \u03b1.\nHence, by combining Lemma 3.3 with Lemma 3.4, we have\n\nE\n\nS\u223cDn, ANSGD\n\n[L(wT ; D)] \u2212 min\n\nw\u2208W L(w; D) \u2264\n\u2264\n\nS\u223cDn, ANSGD\n\nS\u223cDn, ANSGD\n\nE\n\nE\n\n(cid:104)(cid:98)L(wT ; S)\n(cid:105) \u2212 min\n(cid:21)\n(cid:20)(cid:98)L(wT ; S) \u2212 min\nw\u2208W L(w; D) + L2 \u03b7 T\nw\u2208W (cid:98)L(w; S)\nn\n+ L2 \u03b7 T\n(cid:19)\n(cid:18)\nn\n(1)\n(cid:21)\n(cid:20)\nw\u2208W (cid:98)L(w; S)\n\nT d\nn2 \u00012 + 1\n\u2264 min\nw\u2208W\n\n(cid:104)(cid:98)L(w; S)\n\n+ L2 \u03b7 T\nn\n\nS\u223cDn\n\n\u03b7 L2\n\n(cid:105)\n\nmin\n\n16\n\nE\n\n2\n\n\u2264 M 2\n2 \u03b7 T\n\n+\n\nE\n\n=\n\nthat\n\nS\u223cDn\n\nwhere (1) follows from the fact\nw\u2208W L(w; D). Optimizing the above bound in \u03b7 and T yields the values in the theorem state-\nmin\nment for these parameters, as well as the stated bound on the excess population loss.\n4 Private SCO for Non-smooth Losses\nIn this section, we consider the setting where the convex loss is non-smooth. First, we show a\ngeneric reduction to the smooth case by employing the smoothing technique known as Moreau-Yosida\nregularization (a.k.a. Moreau envelope smoothing) [Nes05]. Given an appropriately smoothed\nversion of the loss, we obtain the optimal population loss w.r.t. the original non-smooth loss function.\nComputing the smoothed loss via this technique is generally computationally inef\ufb01cient. Hence, we\nmove on to describe a computationally ef\ufb01cient algorithm for the non-smooth case with essentially\noptimal population loss. Our construction is based on an adaptation of our noisy SGD algorithm\nANSGD (Algorithm 1) that exploits some useful properties of Moreau-Yosida smoothing technique\nthat stem from its connection to proximal operations.\nDe\ufb01nition 4.1 (Moreau envelope). Let f : W \u2192 Rd be a convex function, and \u03b2 > 0. The \u03b2-Moreau\nenvelope of f is a function f\u03b2 : W \u2192 Rd de\ufb01ned as\n\u03b2\n2\n\nf\u03b2(w) = min\nv\u2208W\n\n, w \u2208 W.\n\n(cid:107)w \u2212 v(cid:107)2\n\nf (v) +\n\n(cid:18)\n\n(cid:19)\n\nMoreau envelope has direct connection with the proximal operator of a function de\ufb01ned below.\n\n5\n\n\fDe\ufb01nition 4.2 (Proximal operator). The prox operator of f : W \u2192 Rd is de\ufb01ned as\n\n(cid:18)\n\n(cid:19)\n\nproxf (w) = arg min\nv\u2208W\n\nf (v) +\n\n(cid:107)w \u2212 v(cid:107)2\n\n1\n2\n\n, w \u2208 W.\n\nIt follows that the Moreau envelope f\u03b2 can be written as\n\n(cid:16)\n\n(cid:17)\n\nf\u03b2(w) = f\n\nproxf /\u03b2 (w)\n\n+\n\n\u03b2\n2\n\n(cid:107)w \u2212 proxf /\u03b2 (w)(cid:107)2.\n\nThe following lemma states some useful, known properties of Moreau envelope.\nLemma 4.3 (See [Nes05, Can11]). Let f : W \u2192 Rd be a convex, L-Lipschitz function, and let\n\u03b2 > 0. The \u03b2-Moreau envelope f\u03b2 satis\ufb01es the following:\n\n1. f\u03b2 is convex, 2L-Lipschitz, and \u03b2-smooth.\n(cid:17)\n2. \u2200w \u2208 W f\u03b2(w) \u2264 f (w) \u2264 f\u03b2(w) + L2\n2 \u03b2 .\nw \u2212 proxf /\u03b2(w)\n3. \u2200w \u2208 W \u2207f\u03b2(w) = \u03b2\n\n(cid:16)\n\n.\n\nenvelope of (cid:96)(\u00b7, z). For a dataset S = (z1, . . . , zn) \u2208 Z n, let (cid:98)L\u03b2(\u00b7; S) (cid:44) 1\n\n(cid:80)n\nLet (cid:96) : W \u00d7 Z \u2192 R be a convex, L-Lipschitz loss. For any z \u2208 Z, let (cid:96)\u03b2(\u00b7, z) denote the \u03b2-Moreau\ni=1 (cid:96)\u03b2(\u00b7, zi) be the\nempirical risk w.r.t. the \u03b2-smoothed loss. For any distribution D, let L\u03b2(\u00b7;D) (cid:44) E\nz\u223cD [(cid:96)(\u00b7, z)] denote\nthe corresponding population loss. The following theorem asserts that, with an appropriate setting for\n\u03b2, running ANSGD over the \u03b2-smoothed losses (cid:96)\u03b2(\u00b7, zi), i \u2208 [n] yields the optimal population loss\nw.r.t. the original non-smooth loss (cid:96).\nTheorem 4.4 (Excess population loss for non-smooth losses via smoothing). Let D be any distribution\nover Z. Let S = (z1, . . . , zn) \u223c Dn. Let \u03b2 = L\n. Suppose we run ANSGD\n(Algorithm 1) over the \u03b2-smoothed version of (cid:96) associated with the points in S: {(cid:96)\u03b2(\u00b7, zi), i \u2208 [n]}.\nLet \u03b7 and T be set as in Theorem 3.2. Then, the excess population loss of the output of ANSGD w.r.t. (cid:96)\nsatis\ufb01es\n\nM \u00b7min\n\n(cid:18)\u221a\n\n(cid:19)\n\nn\n4 ,\n\n8\n\nd log(1/\u03b4)\n\n\u221a\n\n\u0001 n\n\nn\n\n\u2206L (ANSGD;D) \u2264 24 M L \u00b7 max\n\n(cid:32)(cid:112)d log(1/\u03b4)\n\n,\n\n1\u221a\nn\n\n\u0001 n\n\n(cid:33)\n\nProof. Let wT be the output of ANSGD. Using property 1 of Lemma 4.3 together with Theorem 3.2,\nwe have\n\nBy property 2 of Lemma 2 and the setting of \u03b2 in the theorem statement, for every w \u2208 W, we have\n\nE\n\nS\u223cDn,ANSGD\n\n[L\u03b2(wT ;D)] \u2212 min\n\nw\u2208W L\u03b2(w; D) \u2264 20 M L \u00b7 max\n(cid:32)\n\nL\u03b2(w; D) \u2264 L(w; D) \u2264 L\u03b2(w; D) + 2 M L \u00b7 max\n\n(cid:33)\n\n.\n\n(cid:32)(cid:112)d log(1/\u03b4)\n2(cid:112)d log(1/\u03b4)\n\n\u0001 n\n\n,\n\n1\u221a\nn\n\n,\n\n\u0001 n\n\n1\u221a\nn\n\n(cid:33)\n\n.\n\nPutting these together gives the stated result.\nComputationally ef\ufb01cient algorithm AProxGD (NSGD + Prox)\nComputing the Moreau envelope of a function is computationally inef\ufb01cient in general. However,\nby property 3 of Lemma 4.3, we note that evaluating the gradient of Moreau envelope at any point\ncan be attained by evaluating the proximal operator of the function at that point. Evaluating the\nproximal operator is equivalent to minimizing a strongly convex function (see De\ufb01nition 4.2). This\ncan be approximated ef\ufb01ciently, e.g., via gradient descent. Since our ANSGD algorithm (Algorithm 1)\nrequires only suf\ufb01ciently accurate gradient evaluations, we can hence use an ef\ufb01cient, approximate\nproximal operator to approximate the gradient of the smoothed losses. The gradient evaluations in\n\n6\n\n\fANSGD will thus be replaced with such approximate gradients evaluated via the approximate proximal\noperator. The resulting algorithm, referred to as AProxGD, will approximately minimize the smoothed\nempirical loss without actually computing the smoothed losses.\nOur construction of AProxGD involves \u2248 n2 \u00b7 T 2 \u00b7 m gradient evaluations (of individual losses), where\nT is the number of iterations of ANSGD reported in Theorem 3.2, and m is its mini-batch size.\nWe argue that the approximate proximal operation will essentially have no impact on the guarantees\nof AProxGD as compared to those of ANSGD. In particular, in terms of privacy, the sensitivity of\nthe approximate gradients (evaluated via the approximate prox operator) will basically remain the\nsame as that of the exact gradients. In terms of empirical error, since the approximation error in\nthe prox operations can be made suf\ufb01ciently small (while maintaining computational ef\ufb01ciency),\nthe impact of the approximation error on the empirical loss guarantee of AProxGD will be negligible.\nFinally, in terms of uniform stability, again since the approximation error is suf\ufb01ciently small, the\nerror accumulated across iterations will have no pronounced impact on the uniform stability of\nANSGD (established in Lemma 3.4). Putting these together shows that AProxGD achieves the optimal\npopulation loss bound in Theorem 4.4.\nA more detailed description of AProxGD and its guarantees is given in the full version [BFTT19].\n5 Private SCO via Objective Perturbation\nIn this section, we show that the technique known as objective perturbation [CMS11, KST12] can be\nused to attain optimal population loss under additional assumptions on the loss. These assumptions\nare invoked to ensure differential privacy. The excess empirical loss of this technique for smooth\nconvex losses was originally analyzed in the aforementioned works, and was shown to be optimal by\nthe lower bound in [BST14]. We revisit this technique and show that the regularization term added for\nprivacy can be used to attain the optimal excess population loss by exploiting the stability-inducing\nproperty of regularization. The objective perturbation algorithm AObjP is described in Algorithm 2.\nIn addition to smoothness and convexity, we make the following assumption on the loss.\nAssumption 5.1. For all z \u2208 Z, (cid:96)(\u00b7, z) is twice-differentiable, and the rank of its Hessian \u22072(cid:96)(w, z)\nat any w \u2208 W is at most 1.\n\nAlgorithm 2 AObjP: Objective Perturbation for convex, smooth losses\nInput: Private dataset: S = (z1, . . . , zn) \u2208 Z n, L-Lipschitz, \u03b2-smooth, convex loss function (cid:96),\n\nconvex set W \u2286 Rd, privacy parameters \u0001 \u2264 1, \u03b4 \u2264 1/n2, regularization parameter \u03bb.\n\n1: Sample G \u223c N(cid:0)0, \u03c32 Id\n2: return (cid:98)w = arg min\n\nw\u2208W (cid:98)L (w; S) +\n\n(cid:1) , where \u03c32 = 10 L2 log(1/\u03b4)\n\n(cid:104)G, w(cid:105)\n\nn + \u03bb(cid:107)w(cid:107)2, where (cid:98)L(w; S) (cid:44) 1\n\n\u00012\n\nn\n\ni=1 (cid:96)(w, zi).\n\n(cid:80)n\n\nNote: Unlike in [KST12], the regularization term as appears in AObjP is not normalized by n.\nHence, whenever the results from [KST12] are used here, the regularization parameter in their\nstatements should be replaced with n\u03bb. This presentation choice is more consistent with literature on\nregularization.\nThe privacy guarantee of AObjP follows directly from [KST12]:\nTheorem 5.2 (Privacy guarantee of AObjP, restatement of Theorem 2 in [KST12]). Suppose that\nAssumption 5.1 holds and that the smoothness parameter satis\ufb01es \u03b2 \u2264 \u0001 n \u03bb. Then, AObjP is\n(\u0001, \u03b4)-differentially private.\nWe now state our main result for this section showing that, with appropriate setting for \u03bb, AObjP\nyields the optimal population loss.\nTheorem 5.3 (Excess population loss of AObjP). Let D be any distribution over Z, and let S \u223c\nDn. Suppose that Assumption 5.1 holds. Suppose that W is M-bounded.\nIn AObjP, set \u03bb =\n\n(cid:113)\n\n2 L\nM\n\n2\n\nn + 4 d log(1/\u03b4)\n\n\u00012 n2\n\n. Then, we have\n\n\u2206L (AObjP; D) \u2264 2 M L\n\n(cid:114)\n\n(cid:32)\n\n(cid:32)\n\n(cid:112)d log(1/\u03b4)\n\n(cid:33)(cid:33)\n\n.\n\nM L \u00b7 max\n\n1\u221a\nn\n\n,\n\n\u0001 n\n\n2\nn\n\n+\n\n4 d log(1/\u03b4)\n\n\u00012 n2\n\n= O\n\n7\n\n\f(cid:112)2 n + 4 d log(1/\u03b4).\n\nM\n\nn\n\nL +\n\nw\u2208W\n\n+ \u03bb M 2.\n\nE(cid:104)(cid:98)L((cid:98)w; S)\n\nNote: According to Theorem 5.2, the privacy of AObjP entails the assumption that \u03b2 \u2264 \u0001 n \u03bb. With\nthe setting of \u03bb in Theorem 5.3, it would suf\ufb01ce to assume that \u03b2 \u2264 2 \u0001 L\nTo prove the above theorem, we use the following lemmas.\nLemma 5.4 (Excess empirical loss of AObjP, restatement of Theorem 26 in [KST12]). Let S \u223c Z n.\nUnder Assumption 5.1, the excess empirical loss of AObjP satis\ufb01es\n\nn2 \u00012 \u03bb\nwhere the expectation is taken over the Gaussian noise in AObjP.\nLemma 5.5 ([SSBD14]). Let f : W \u00d7 Z \u2192 R be a convex, \u03c1-Lipschitz loss, and let \u03bb > 0. Let\n,\n\n(cid:105) \u2212 min\nw\u2208W (cid:98)L(w; S) \u2264 16 L2 d log(1/\u03b4)\nS = (z1, . . . , zn) \u223c Z n. Let A be an algorithm that outputs (cid:101)w = arg min\nwhere (cid:98)F(w; S) = 1\n(cid:16)\n\n(cid:16)(cid:98)F(w; S) + \u03bb(cid:107)w(cid:107)2(cid:17)\n\n\u03bb n -uniformly stable.\n\nDe\ufb01ne FG(w; D) (cid:44) E\nE\n\nz\u223cD [fG(w, z)] . Thus, by combining Lemmas 5.5 and 2.2, we have\n\nProof of Theorem 5.3\nFix any realization of the noise vector G. For every w, z de\ufb01ne fG(w, z) (cid:44) (cid:96)(w, z) +\nNote that fG is\n1\nn\n\n(cid:80)n\ni=1 f (w, zi). Then, A is 2 \u03c12\n(cid:17)\n-Lipschitz. For any S = (z1, . . . , zn) \u2208 Z n, let (cid:98)FG(w; S) (cid:44)\n(cid:80)n\nw\u2208W (cid:98)FG(w; S)+\u03bb(cid:107)w(cid:107)2.\ni=1 fG(w, zi). Hence, the output of AObjP can be written as(cid:98)w = arg min\n(cid:105) \u2264 2 (L+\n(cid:104)FG((cid:98)w; D) \u2212 (cid:98)FG((cid:98)w; S)\n. On the other hand, note that FG((cid:98)w; D) \u2212\n(cid:98)FG((cid:98)w; S) = L((cid:98)w; D) \u2212 (cid:98)L((cid:98)w; S) since the linear term cancels out. Hence,\n(cid:17)2\n(cid:105) \u2264 2\n(cid:104)L((cid:98)w; D) \u2212 (cid:98)L((cid:98)w; S)\nBy taking expectation over G \u223c N(cid:0)0, \u03c32Id\n(cid:20)(cid:98)L((cid:98)w; S) \u2212 min\n(cid:18) 2 L2 d log(1/\u03b4)\n\n(cid:1) as well, we get E(cid:104)L((cid:98)w; D) \u2212 (cid:98)L((cid:98)w; S)\n(cid:105) \u2264 8 L2\n(cid:21)\n+ E(cid:104)L((cid:98)w; D) \u2212 (cid:98)L((cid:98)w; S)\n(cid:105)\nw\u2208W (cid:98)L(w; S)\n(cid:19)\n\n\u2206L (AObjP;D) \u2264 E\n\nNow, observe that:\n\n(cid:107)G(cid:107)\nn\n\u03bb n\n\n(cid:107)G(cid:107)\nn )2\n\nE\n\nS\u223cDn\n\n(cid:107)G(cid:107)\nn\n\n(cid:104)G,w(cid:105)\n\nn\n\n.\n\nS\u223cDn\n\n(cid:16)\n\nL +\n\n\u03bb n .\n\n\u03bb n\n\n\u2264 8\n\u03bb\n\n\u00012 n2\n\n+\n\nL 2\nn\n\n+ \u03bb M 2\n\nwhere we use Lemma 5.4 in the last bound. Optimizing this bound in \u03bb yields the result.\nA note on the rank assumption: The assumption on the rank of (cid:53)2(cid:96)(w, z) can actually be relaxed\nwithout affecting the asymptotic\n\n(using similar argument in [INS+19]) to a rank of (cid:101)O\n\n(cid:16) L\n\n(cid:17)\n\nn+d\n\n\u221a\n\n\u03b2M\n\npopulation loss guarantees (see the full version [BFTT19] for a discussion.)\nEf\ufb01cient Objective Perturbation: The privacy guarantee of the standard objective perturbation\ntechnique is given only when the output is the exact minimizer [CMS11, KST12]. Exact minimization\nis not usually attainable in practice. We give a practical version of algorithm AObjP that attains the\nsame guarantees of privacy and optimal population loss as AObjP, and in addition, makes only\nO(n log n) number of gradient evaluations. The main idea is to \ufb01rst obtain an approximate minimizer\n\u02dcw that is suf\ufb01ciently close to the true minimizer, and then perturb \u02dcw with only a small amount\nof Gaussian noise to ensure privacy. The extra error due to the little noise added in the last step\nends up having a trivial impact on the population loss. Hence, the algorithm achieves the same\nguarantees as AObjP. Crucially, it attains the optimal population loss in an ef\ufb01cient manner. In\nparticular, we use Stochastic Variance Reduced Gradient Descent (SVRG) [JZ13, XZ14] to perform\nthe optimization step, which leads to a construction with O(n log n) number of gradient evaluations.\nDetailed discussion can be found in the full version [BFTT19].\n\n8\n\n\fAcknowledgements\nWe thank Adam Smith, Thomas Steinke and Jon Ullman for the insightful discussions of the problem\nat the early stages of this project. We are also grateful to Tomer Koren for bringing the Moreau-Yosida\nsmoothing technique to our attention. R. Bassily\u2019s research is supported by NSF Awards AF-1908281,\nSHF-1907715, Google Faculty Research Award, and OSU faculty start-up support. A. Thakurta\u2019s\nresearch is supported by NSF Awards TRIPODS+X-1839317, AF-1908281, TRIPODS-1740850,\nand Google Faculty Research Award.\nReferences\n[ACG+16] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal\nTalwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the\n2016 ACM SIGSAC Conference on Computer and Communications Security, pages\n308\u2013318. ACM, 2016.\n\n[BE02] Olivier Bousquet and Andr\u00e9 Elisseeff. Stability and generalization. Journal of machine\n\nlearning research, 2(Mar):499\u2013526, 2002.\n\n[BFTT19] Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Thakurta. Private stochastic\n\nconvex optimization with optimal rates. arXiv preprint arXiv:1908.09970, 2019.\n\n[BNS+16] Raef Bassily, Kobbi Nissim, Adam Smith, Thomas Steinke, Uri Stemmer, and Jonathan\nUllman. Algorithmic stability for adaptive data analysis. In Proceedings of the forty-\neighth annual ACM symposium on Theory of Computing, pages 1046\u20131059. ACM,\n2016.\n\n[BST14] Raef Bassily, Adam Smith, and Abhradeep Thakurta. Differentially private empir-\nical risk minimization: Ef\ufb01cient algorithms and tight error bounds. arXiv preprint\narXiv:1405.7085, 2014.\n\n[Can11] Emmanuel Candes. Mathematical optimization, volume Lec. notes: MATH 301. Stan-\n\nford Univesity, 2011.\n\n[CM08] Kamalika Chaudhuri and Claire Monteleoni. Privacy-preserving logistic regression. In\nDaphne Koller, Dale Schuurmans, Yoshua Bengio, and L\u00e9on Bottou, editors, NIPS. MIT\nPress, 2008.\n\n[CMS11] Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. Differentially private\nempirical risk minimization. Journal of Machine Learning Research, 12(Mar):1069\u2013\n1109, 2011.\n\n[DFH+15] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and\nAaron Leon Roth. Preserving statistical validity in adaptive data analysis. In Proceedings\nof the forty-seventh annual ACM symposium on Theory of computing, pages 117\u2013126.\nACM, 2015.\n\n[DJW13] John C. Duchi, Michael I. Jordan, and Martin J. Wainwright. Local privacy and statistical\nminimax rates. In IEEE 54th Annual Symposium on Foundations of Computer Science\n(FOCS), pages 429\u2013438, 2013.\n\n[DKM+06] Cynthia Dwork, Krishnaram Kenthapadi, Frank McSherry, Ilya Mironov, and Moni\nNaor. Our data, ourselves: Privacy via distributed noise generation. In EUROCRYPT,\n2006.\n\n[DMNS06] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise\nto sensitivity in private data analysis. In Theory of Cryptography Conference, pages\n265\u2013284. Springer, 2006.\n\n[Fel16] Vitaly Feldman. Generalization of erm in stochastic convex optimization: The dimension\nstrikes back. In Advances in Neural Information Processing Systems, pages 3576\u20133584,\n2016.\n\n[FV19] Vitaly Feldman and Jan Vondrak. High probability generalization bounds for uniformly\n\nstable algorithms with nearly optimal rate. arXiv preprint arXiv:1902.10710, 2019.\n\n9\n\n\f[HRS15] Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better:\n\nStability of stochastic gradient descent. arXiv preprint arXiv:1509.01240, 2015.\n\n[INS+19] Roger Iyengar, Joseph P Near, Dawn Song, Om Thakkar, Abhradeep Thakurta, and Lun\nWang. Towards practical differentially private convex optimization. In IEEE S and P\n(Oakland), 2019.\n\n[JKT12] Prateek Jain, Pravesh Kothari, and Abhradeep Thakurta. Differentially private online\nlearning. In 25th Annual Conference on Learning Theory (COLT), pages 24.1\u201324.34,\n2012.\n\n[JT14] Prateek Jain and Abhradeep Thakurta. (near) dimension independent risk bounds for\n\ndifferentially private learning. In ICML, 2014.\n\n[JZ13] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive\nvariance reduction. In Advances in neural information processing systems, pages 315\u2013\n323, 2013.\n\n[KST08] S. Kakade, K. Sridharan, and A. Tewari. On the complexity of linear prediction: Risk\n\nbounds, margin bounds, and regularization. In NIPS, pages 793\u2013800, 2008.\n\n[KST12] Daniel Kifer, Adam Smith, and Abhradeep Thakurta. Private convex empirical risk\nminimization and high-dimensional regression. In Conference on Learning Theory,\npages 25\u20131, 2012.\n\n[Nes05] Yu Nesterov. Smooth minimization of non-smooth functions. Mathematical program-\n\nming, 103(1):127\u2013152, 2005.\n\n[SCS13] Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. Stochastic gradient de-\nscent with differentially private updates. In IEEE Global Conference on Signal and\nInformation Processing, 2013.\n\n[SSBD14] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From\n\ntheory to algorithms. Cambridge university press, 2014.\n\n[ST13] Adam Smith and Abhradeep Thakurta. Differentially private feature selection via\nstability arguments, and the robustness of the LASSO. In Conference on Learning\nTheory (COLT), pages 819\u2013850, 2013.\n\n[STU17] Adam Smith, Abhradeep Thakurta, and Jalaj Upadhyay. Is interaction necessary for\n\ndistributed private learning? In IEEE Security & Privacy, pages 58\u201377, 2017.\n\n[TTZ15] Kunal Talwar, Abhradeep Thakurta, and Li Zhang. Nearly optimal private LASSO. In\nProceedings of the 28th International Conference on Neural Information Processing\nSystems, volume 2, pages 3025\u20133033, 2015.\n\n[Ull15] Jonathan Ullman. Private multiplicative weights beyond linear queries. In Proceedings of\nthe 34th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems,\npages 303\u2013312. ACM, 2015.\n\n[WLK+17] Xi Wu, Fengan Li, Arun Kumar, Kamalika Chaudhuri, Somesh Jha, and Jeffrey\nNaughton. Bolt-on differential privacy for scalable stochastic gradient descent-based\nanalytics. In SIGMOD. ACM, 2017.\n\n[WYX17] Di Wang, Minwei Ye, and Jinhui Xu. Differentially private empirical risk minimization\nrevisited: Faster and more general. In Advances in Neural Information Processing\nSystems, pages 2722\u20132731, 2017.\n\n[XZ14] Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive\n\nvariance reduction. SIAM Journal on Optimization, 24(4):2057\u20132075, 2014.\n\n10\n\n\f", "award": [], "sourceid": 6024, "authors": [{"given_name": "Raef", "family_name": "Bassily", "institution": "The Ohio State University"}, {"given_name": "Vitaly", "family_name": "Feldman", "institution": "Google Brain"}, {"given_name": "Kunal", "family_name": "Talwar", "institution": "Google"}, {"given_name": "Abhradeep", "family_name": "Guha Thakurta", "institution": "University of California Santa Cruz"}]}