Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Strengths All the contributions mentioned above. Very clearly written complete work which proves the optimal convergence rate for various different settings. Weaknesses/comments In the abstract it says that "... in the parameter regime most common in practice." Yet, in Algorithm 1 there are the restrictions epsilon \leq 1 and delta \leq 1/n^2. Especially the restriction on delta seems crude. I cannot see where these restrictions come from. The DP result is based here on the reference Abadi et al., and I do not see such restrictions in the results of the reference article. The privacy analysis (and therefor also the convergence bound) of Algorithm 1 seems to be ultimately based on Lemma 3 of [ACG+16] and on the correct asymptotics that its accurate bounds provide. There seems to be an assumption \sigma \geq 1 in Lemma 3 of [ACG+16], however I do not see this appearing in the article. Could you comment on this? Some comment about how the coefficients c_1 and c_2 of Theorem 1 of the reference [ACG+ 16] are determined would be in order. I assume the authors have worked out the result of that Thm. 1 of [ACG+16] in detail and that is where the choices of variance and mini-batch size in Algorithm 1 come from. To be precise, the results of [ACG+16] hold in the case, where the datasets S and S' are such neighbours that one of them is obtained from the other one by removing one element (see e.g. the proof of Lemma 3 of [ACG+16]). Therefore, is it not entirely correct to say (as in Definition 1) "...for any pair of datasets S and S′ differ in exactly one data point..". This gives the impression that one could replace one element in S by another one to obtain S'. Then the results of [ACG+16] would not hold. Perhaps you could comment on the the neighbouring relation of S and S'. Originality Regarding the breadth of the analysis, this is highly original. Quality and clarity Very clearly structured and easy to read, of highest quality. EDIT: Thank you for the response. The authors have answered my comments. I still slightly disagree with one point: I think that from (eps,delta)-DP with "remove"-relation it does not follow (2*eps,2*delta)-DP for the "replace"-relation DP. Using a triangle-like argument, one gets (2*eps,(1+exp(eps))*delta)-DP. Of course that can be bounded with your bound for eps. I suggest you either comment the relation or change the constants in step 1 of Alg. 1 and step 1 of Alg. 2. Moreover, I am still slightly puzzled why specifically \delta \leq 1/n^2, why not e.g. \delta << 1/n. Anyhow that is ok as it is I think. I will keep my score and I vote for accept.
I really like the first contribution of the paper. However, I still think this work lacks the experimental part. As I know, most of the recent work on the central (\epsilon, \delta) DP-ERM has experimental study such as [1-6]. So I think in order to say that their method of DP-batch SGD, they should provide some experimental study in order to say the improvement.  Towards Practical Differentially Private Convex Optimization.  Differentially Private Empirical Risk Minimization Revisited: Faster and More General.  Privacy-Preserving ERM by Laplacian Smoothing Stochastic Gradient Descent  Renyi Differentially Private ERM for Smooth Objectives  Distributed Learning without Distress: Privacy-Preserving Empirical Risk Minimization  Efficient Private ERM for Smooth Objectives. ------------------------------------ ------------------------------------------------------------------------------------------------- After the rebuttal I change my rate to accept. However, I still want to see some empirical performance of the algorithms.
The optimal rate for ERM is not the same as the optimal rate for stochastic convex optimization. Similarly the optimal rates for private erm are not the same as the optimal rates for private SCO. This paper resolves the important question about what the optimal private rate for SCO is, and gives algorithms for matching the upper bound. The main technique in showing the generalization bounds is uniform stability. They extend the uniform stability analysis of SGD to the noisy SGD case to prove that noisy SGD achieves the optimal SCO rate. The non-smooth case is handled by noisy SGD on the smoothed version of the function via the M-Y envelope. Significance: Significant new results -- worthy of publication Originality: Technically the results seem to follow from standard analysis techniques adapted to the private setting Clarity: Very clearly written