Part of Advances in Neural Information Processing Systems 33 (NeurIPS 2020)
Raef Bassily, Shay Moran, Anupama Nandi
We initiate the study of a new model of supervised learning under privacy constraints. Imagine a medical study where a dataset is sampled from a population of both healthy and unhealthy individuals. Suppose healthy individuals have no privacy concerns (in such case, we call their data ``public'') while the unhealthy individuals desire stringent privacy protection for their data. In this example, the population (data distribution) is a mixture of private (unhealthy) and public (healthy) sub-populations that could be very different. Inspired by the above example, we consider a model in which the population $\cD$ is a mixture of two possibly distinct sub-populations: a private sub-population $\Dprv$ of private and sensitive data, and a public sub-population $\Dpub$ of data with no privacy concerns. Each example drawn from $\cD$ is assumed to contain a privacy-status bit that indicates whether the example is private or public. The goal is to design a learning algorithm that satisfies differential privacy only with respect to the private examples. Prior works in this context assumed a homogeneous population where private and public data arise from the same distribution, and in particular designed solutions which exploit this assumption. We demonstrate how to circumvent this assumption by considering, as a case study, the problem of learning linear classifiers in $R^d$. We show that in the case where the privacy status is correlated with the target label (as in the above example), linear classifiers in $R^d$ can be learned, in the agnostic as well as the realizable setting, with sample complexity which is comparable to that of the classical (non-private) PAC-learning. It is known that this task is impossible if all the data is considered private.