|
Submitted by Assigned_Reviewer_1
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper proposes a differentially private algorithm for the Lasso with significantly cleaner construction and guarantees than previous known results. The idea is to adapt the Franke-Wolfe algorithm, which at each iteration finds the one feature most correlated with the current gradient and adds it into the estimate with a fixed weight. The adaptation is simply to add Laplace noise to the calculation of the correlation of feature X_j with the current gradient, before selecting which feature to add at this iteration.
The resulting guarantee shows an excess risk of rate O(1/n^(2/3)), with log factors. This power of n is also proven to be tight. In contrast, in a setting with a true sparse model, restricted eigenvalue type conditions on X, etc, the non-private error is O(1/n) and existing (more complicated) private mechanisms also introduce error at a rate O(1/n). The result given here is far more general as it does not rely on any properties such as restricted eigenvalues, which in practice are not verifiable.
The proof of the upper bound is a simple corollary of an existing result on noisy Franke-Wolfe, while the lower bound involves some computation. The novelty here is in the idea of the algorithm itself, which is elegant.
Overall this is a nice contribution that largely resolves the question of privacy for the Lasso.
Page 2 typo: "logartihmic"
Theorem 1.2, 3.1, 3.2 are not clear on the definition of (eps,delta). Theorem 1.2 uses eps=0.1 and delta=1/n. Theorem 3.1 uses eps=0.1 and delta=o(1/n^2) which is a narrower class of algorithms. Theorem 3.2 seems to give a result that is independent of (eps,delta) since they are not specified but it must be the case that there are assumptions on (eps,delta).
Q2: Please summarize your review in 1-2 sentences
Overall this is a nice contribution that largely resolves the question of privacy for the Lasso. The main results for the upper bound are straightforward applications of
existing bounds, but the idea of the algorithm is very clean and novel.
Submitted by Assigned_Reviewer_2
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper introduces differentially private Frank-Wolfe algorithms, which allows to conduct LASSO with differential privacy constraints. The proposal method is designed by taking account of the fact that the constraints induced by LASSO forms a polytope. As a result, the excess empirical risk of the proposal learning algorithm is bounded above by nearly O(1/n^{2/3}) under l1-Lipschitz continuous assumption of loss functions. This bound is better than O(polylog(n)/n^(1/3)) by [31]. The bound on the excess empirical risk is higher than O(polylog(n)/n) by [32,37]. However considering that this is derived without strong assumptions (e.g., restricted strong convexity and mutual incoherence as required in [32,37]), this can be a new contribution. Furthermore, the authors show that the bound of the excess empirical risk is near optimal under epsilon<=0.1 and delta=o(1/n^2). The authors also derive a differentially private Frank-Wolfe algorithm that can work with any convex constraint region.
The derivation of the near-optimality of the proposed algorithm is an original contribution while the derivations of the differential privacy and the excess empirical risk seems to be applications of existing results.
In terms of privacy preservation, assumptions on the parameter, epsilon<=0.1 and delta=o(1/n^2), are adequate. On the other hand, in terms of utility, derivation of the constant term of the excess risk is needed. The noise added by the mechanism can occupy the domain of the prediction y
in realistic situations if the constant term of the bound on the excess empirical risk is not small.
Say, suppose the constant term of the bound of the excess empirical risk is 1 (i.e., the excess empirical risk <= log(np/delta)/(n epsilon)^(2/3) ). Suppose privacy parameters are set to epsilon=0.1delta=1/n^2, and p=2. Then,
we have excess empirical risk <= 0.5 if n<=4000 by Corollary 2.7. Since the domain of the prediction y is [-1,1], the half of the domain can be covered with not a small probability. This can be improved with a larger size of samples, but n=4000 is not a small sample size in practice. Consequently, the assumptions on privacy parameters, epsilon=0.1 and delta=1/n^2, might not be practical in some situations.
Q2: Please summarize your review in 1-2 sentences
This paper introduces differentially private Frank-Wolfe algorithms, which allows to conduct LASSO with differential privacy constraints. The study contains some novel theoretical contribution.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We would first like to thank all the reviewers for
their careful reviews. Below are our responses to some specific comments
from the reviewers.
Assigned_reviewer_1: Theorem 1.2, 3.1, 3.2 are
not clear on the definition of (eps,delta). Theorem 1.2 uses eps=0.1 and
delta=1/n. Theorem 3.1 uses eps=0.1 and delta=o(1/n^2) which is a narrower
class of algorithms. Theorem 3.2 seems to give a result that is
independent of (eps,delta) since they are not specified but it must be the
case that there are assumptions on (eps,delta).
Response: Thanks
for pointing out the inconsistency between Theorems 1.2. and Theorem 3.1.
It should have been \delta = o(1/n^2) in both the cases, and \epsilon
< = 0.1 . Our lower bounds hide the exact dependence on \delta,
assuming it is o(1/n^2) and \epsilon <=0.1. We will correct these in
the final version. Note that proving a lower bound with \delta=1/n^2, and
\epsilon=0.1 suffices for the regime \delta= o(1/n^2), and
\epsilon<=0.1, as the privacy guarantees are only stronger in the
latter.
Assigned_reviewer_3: In terms of privacy preservation,
assumptions on the parameter, epsilon<=0.1 and delta=o(1/n^2), are
adequate. On the other hand, in terms of utility, derivation of the
constant term of the excess risk is needed. The noise added by the
mechanism can occupy the domain of the prediction y in realistic
situations if the constant term of the bound on the excess empirical risk
is not small. Say, suppose the constant term of the bound of the excess
empirical risk is 1 (i.e., the excess empirical risk <=
log(np/delta)/(n epsilon)^(2/3) ). Suppose privacy parameters are set to
epsilon=0.1,delta=1/n^2, and p=2. Then, we have excess empirical risk
<= 0.5 if n<=4000 by Corollary 2.7. Since the domain of the
prediction y is [-1,1], the half of the domain can be covered with not a
small probability. This can be improved with a larger size of samples, but
n=4000 is not a small sample size in practice. Consequently, the
assumptions on privacy parameters, epsilon=0.1 and delta=1/n^2, might not
be practical in some situations.
Response: We will work out the
leading constant in the next version. (A rough estimate gives an upper
bound of 5.) We believe that our results are most applicable in the domain
that $p$ is large. For example, if the constant term is 5, p is 10^5
(which is reasonable in modern applications) and n = 10^5 suffices to get
excess empirical risk <= 0.5, for eps=0.1, delta=1/n^2. (As an example
for such a choice of parameters at a similar scale present in Table 1 of
https://web.stanford.edu/~boyd/papers/pdf/l1_logistic_reg_aaai.pdf). We
want to mention that our result is an upper bound based on worst case
analysis.
Assigned_reviewer_8: Moreover, it seems that the lower
bound is only valid for large n, which is not the standard setting for the
Lasso. Could the authors discuss that point in more details?
Response: Our lower-bounds hold as long as p=Omega(n^{2/3}). Our
algorithm achieves the upper bound of O(log p / n^{2/3}). So the gap is
O(\log p), which is nearly tight for the usual case of p = n^O(1). Hence
the L1 constraints used in LASSO allow us to achieve a DP algorithm whose
excess risk has a weak dependence on p, a phenomenon permeated in the
other quality analysis of LASSO (this is also why LASSO scales well with
p). We will add a discussion about this point in the next version, and
also add a conclusion section. |
|