{"title": "Differentially Private Distributed Data Summarization under Covariate Shift", "book": "Advances in Neural Information Processing Systems", "page_first": 14459, "page_last": 14469, "abstract": "We envision Artificial Intelligence marketplaces to be platforms where consumers, with very less data for a target task, can obtain a relevant model by accessing many private data sources with vast number of data samples.  One of the key challenges is to construct a training dataset that matches a target task without compromising on privacy of the data sources. To this end, we consider the following distributed data summarizataion problem. Given K private source datasets denoted by $[D_i]_{i\\in [K]}$ and a small target validation set $D_v$, which may involve a considerable covariate shift with respect to the sources, compute a summary dataset $D_s\\subseteq \\bigcup_{i\\in [K]} D_i$ such that its statistical distance from the validation dataset $D_v$ is minimized. We use the popular Maximum Mean Discrepancy as the measure of statistical distance. The non-private problem has received considerable attention in prior art, for example in prototype selection (Kim et al., NIPS 2016). Our work is the first to obtain strong differential privacy guarantees while ensuring the quality guarantees of the non-private version. We study this problem in a Parsimonious Curator Privacy Model, where a trusted curator coordinates the summarization process while minimizing the amount of private information accessed. Our central result is a novel protocol that (a) ensures the curator does not access more than $O(K^{\\frac{1}{3}}|D_s| + |D_v|)$ points (b) has formal privacy guarantees on the leakage of information between the data owners and (c) closely matches the  best known non-private greedy algorithm. Our protocol uses two hash functions, one inspired by the Rahimi-Recht random features method and the second leverages state of the art differential privacy mechanisms. We introduce a novel ``noiseless'' differentially private auctioning protocol, which may be of independent interest.  Apart from theoretical guarantees, we demonstrate the efficacy of our protocol using real-world datasets.", "full_text": "Differentially Private Distributed Data\nSummarization under Covariate Shift \u21e4\n\nKanthi K. Sarpatwar 1\n\nIBM Research\n\nsarpatwa@us.ibm.com\n\nKarthikeyan Shanmugam 1\n\nIBM Research AI\n\nkarthikeyan.shanmugam2@ibm.com\n\nVenkata Sitaramagiridharganesh Ganapavarapu\n\nIBM Research\n\ngiridhar.ganapavarapu@ibm.com\n\nAshish Jagmohan\n\nIBM Research\n\nashishja@us.ibm.com\n\nRoman Vaculin\nIBM Research\n\nvaculin@us.ibm.com\n\nAbstract\n\nWe envision Arti\ufb01cial Intelligence marketplaces to be platforms where consumers,\nwith very less data for a target task, can obtain a relevant model by accessing many\nprivate data sources with vast number of data samples. One of the key challenges is\nto construct a training dataset that matches a target task without compromising on\nprivacy of the data sources. To this end, we consider the following distributed data\nsummarizataion problem. Given K private source datasets denoted by [Di]i2[K]\nand a small target validation set Dv, which may involve a considerable covariate\n\nshift with respect to the sources, compute a summary dataset Ds \u2713Si2[K] Di such\n\nthat its statistical distance from the validation dataset Dv is minimized. We use the\npopular Maximum Mean Discrepancy as the measure of statistical distance. The\nnon-private problem has received considerable attention in prior art, for example in\nprototype selection (Kim et al., NIPS 2016). Our work is the \ufb01rst to obtain strong\ndifferential privacy guarantees while ensuring the quality guarantees of the non-\nprivate version. We study this problem in a Parsimonious Curator Privacy Model,\nwhere a trusted curator coordinates the summarization process while minimizing\nthe amount of private information accessed. Our central result is a novel protocol\nthat (a) ensures the curator accesses at most O(K 1\n3|Ds| + |Dv|) points (b) has\nformal privacy guarantees on the leakage of information between the data owners\nand (c) closely matches the best known non-private greedy algorithm. Our protocol\nuses two hash functions, one inspired by the Rahimi-Recht random features method\nand the second leverages state of the art differential privacy mechanisms. Further,\nwe introduce a novel \u201cnoiseless\u201d differentially private auctioning protocol for\nwinner noti\ufb01cation, which may be of independent interest. Apart from theoretical\nguarantees, we demonstrate the ef\ufb01cacy of our protocol using real-world datasets.\n\n1\n\nIntroduction\n\nIntegrating new types of data to drive analytics based decision-making can contribute signi\ufb01cant\neconomic impact across a broad spectrum of industries including healthcare, banking, insurance,\n\n\u21e41Equal contribution by these authors.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\ftravel, and urban planning. This has led to the emergence of complex data ecosystems consisting of\nheterogeneous (and overlapping) data generators, aggregators, and analytics providers. In general,\nparticipants in these ecosystems are looking to monetize a class of assets that we term AI assets; such\nassets include raw and aggregated data, as well as models trained on such data. A recent Mckinsey\nglobal survey found, for example, that more than half of the respondents in sectors including basic\nmaterials and energy, \ufb01nancial services, and high tech stated that their companies had begun to\nmonetize their data assets [Gottlieb & Khaled (2017)].\nIn light of the above, in this work we consider the basic setting of an AI Marketplace : a consumer\narrives with a small dataset, referred to as a \u201cvalidation\u201d dataset, and wants to build a prediction model\nthat performs well on this dataset. However, the model training process requires huge amount of data,\nthat it must acquire from multiple private sources. Fundamentally, the AI Marketplace must address a\ntransfer learning problem, where the distribution of data at different sources is considerably different\nfrom each other and even from the validation dataset. The Marketplace must facilitate transactions\nof data points from multiple sources towards the consumer\u2019s task by forming a training dataset that\nis close in some distance measure to the validation dataset. In the process, it must preserve data\nownership and privacy as much as possible.\nConsider the following scenario in the health care domain, as an example. Suppose the consumer\nis a newly established cancer hospital and the data sources are cancer institutions from different\ngeographical locations across the globe. The goal of the new hospital is to construct ML models\nthat, say, can predict early onset of some form of cancer. The quality of the model depends on the\ndemography of its patients and therefore it is crucial to collect data that matches a small validation\nset that is representative of the demography. The individual sources clearly have widely different\ndemographic data. The goal of an AI Marketplace is to enable private collection of a dataset sampled\nfrom these sources that matches the demography of the new institute. From the privacy perspective,\nthere are two desirable properties: (a) The multiple data owners are typically \u201ccompetitors\u201d and\ntherefore, individual data must be protected (for e.g. in a differentially private manner) from each\nother. (b) The platform (we use the term curator) must be \u201cparsimonious\u201d in handling data, i.e., it\nshould access information on a \u201cneed to know\u201d basis.\nMotivated by the above, we consider the following problem. We consider K data owners with private\ndatasets D1, . . . , DK, and a data consumer who wishes to build a model for a speci\ufb01c task. The\nspeci\ufb01c task is embodied by the consumer possessing a validation dataset Dv. The data consumer\nwould like to procure a subset of data from each private dataset which is well-matched to its task.\nA parsimonious trusted curator does the collection of points. We call this the parsimonious curator\nmodel. Although trusted, we wish to minimize the number of points accessed by the curator to\nconstruct the the \ufb01nal summary. In turn, the K data owners would like to ensure that their data is\nprivate with respect to other data owners. The curator exchanges messages and data points with the\ndata owners. We seek to make the exchanges by the curator to the data owners differentially private.\nOur Contributions: We propose a novel protocol, based on an iterative hash-exchange mechanism,\nthat enables the curator to construct a summary data set Ds from the K owner datasets. A central\nresult of the paper shows that the proposed protocol simultaneously satis\ufb01es the following desired\nproperties: (i) The constructed dataset Ds is well-matched to the validation dataset Dv, in terms\nof having a small Maximum Mean Discrepancy (MMD) (Gretton et al. (2008)); (ii) The protocol\nexchanges with any data owner i is (\u270f, )-differentially private with respect to the other owner\ndatasets [j6=iDj; and (iii) The parsimonious curator accesses at most O(K 1\n3|Ds| +|Dv|) data points.\nQualitatively, we expect the protocol to produce data summaries that are useful for model building\nwhile maintaining differential privacy. We show through empirical evaluation that this is indeed the\ncase; by examining generalization error on two example tasks, we show that the protocol pays only a\nsmall price for its differential privacy guarantees.\nPrior Work: Privacy Preserving Learning Algorithms: There is a long line of work that considers\nprivate empirical risk minimization that seeks to optimize the trade-off between accuracy of a trained\nclassi\ufb01er and differential privacy guarantees with respect to the training set [Kasiviswanathan et al.\n(2011); Chaudhuri et al. (2011); Song et al. (2013); Kifer et al. (2012); Bassily et al. (2014); Shokri\n& Shmatikov (2015); Hamm et al. (2016); Wu et al. (2016); Abadi et al. (2016); Pathak et al. (2010);\nThakurta (2013); Rubinstein et al. (2012); Talwar et al. (2015); Dwork et al. (2014a)]. One of the most\nnotable in this line of work is the idea of adding noise to stochastic gradient iterations to preserve\nprivacy [Song et al. (2013); Abadi et al. (2016); Shokri & Shmatikov (2015)]. We do not consider the\n\n2\n\n\fproblem of learning a classi\ufb01er directly. Our goal is to summarize diverse data sources in a distributed\nprivate manner to match a given validation set in a transfer learning setting. All the above works of\nprivacy preserving learning algorithms can be applied after our summarization step. In Rubinstein\net al. (2012); Chaudhuri et al. (2011), authors use noisy Rahimi-Recht Fourier features to release a\nrepresentation of the support vectors for differentially private SVM classi\ufb01er release. Our purpose of\nusing Rahimi-Recht Fourier features is different and is used to expose the partial MMD objective at\nevery round and in conjunction with novel private auctioning mechanisms.\nPrivately Aggregating Teacher Ensembles: Several works have considered the following setting: An\nensemble of teacher classi\ufb01ers, each trained on private data sources, noisily predict labels on an\nunlabelled public dataset that is further used to train a student model [Papernot et al. (2016)]. Again,\nthis is different from our transfer learning setting where the various distributions are matched to a\ntarget task \ufb01rst handling covariate shift.\nAnother related line of work is differentially private submodular optimization [Mitrovic et al. (2017)].\nWhile they consider a single private source, we handle multiple private data sources in optimizing a\nspeci\ufb01c statistical distance (M M D). Our techniques leverage state of the art methods on privacy\npreserving mechanisms found in Dwork et al. (2014b); Hardt et al. (2012); Hardt & Rothblum (2010).\nDomain Adaptation Methods: For the transfer learning problem, the existing domain adaptation\nmethods [Ganin & Lempitsky (2014); Tzeng et al. (2014)] ensure the following:\nthey learn a\nrepresentation (x) such that (\u00b7) of the source and the target are similar in distance (MMD metric\nhas been used to regularize the distance penalty) and that classifying based on (\u00b7) on the source\nhave very high accuracy. However, most existing approaches used differentiable models like deep\nlearning to achieve this - to learn (\u00b7). In our methods, we \ufb01rst match the distributions in the ambient\nspace by sub-selecting points and then train any suitable classi\ufb01er. One advantage is that we can train\nany classi\ufb01er after the moment matching step (Xgboost, Decision Tree, SVMs etc.). If one wants to\nmake an existing domain adaptation algorithm private with respect to any pair of participants, one\nhas to add noise to gradients computed at every step. The state of the art in differential privacy for\ndeep learning [Abadi et al. (2016)] (in the non-transfer learning setting) adds Gaussian noise whose\nvariance is linear in the number of iterations per step which signi\ufb01cantly degrades the performance.\nIn our method, we gain on this aspect as we add noise per point acquisition.\nFederated Learning: We also note there is a distinction between our transfer learning setting from\nthat of Federated learning McMahan et al. (2016). The validation set distribution is distinct from\neach of the individual data source distributions. There are signi\ufb01cant covariate shifts between these.\nFederated learning would assume a training distribution which is obtained by sampling from different\ndata sources uniformly at random or with a speci\ufb01c mixture distribution. In fact, in our experiments\nwe contrast with training done on uniform samples which is a proxy for federated learning.\n\n2 Problem Setting\n\nThe setting has K data owners with private datasets denoted by D1, D2, . . . DK. Here, Di 2 Rmi\u21e5n\nwhere mi denotes the number of points and n denotes the their dimension. Further, there exists a\n\u201cconsumer\u201d entity that wants to form a summary dataset (which can be used for downstream training\ngoals) Ds \u2713Si Di and |Ds| = p. The quality of the summary set is measured by its closeness to a\ntarget validation dataset Dv 2 Rm\u21e5n which is private to the consumer. We measure the closeness of\nDs to Dv using the MMD (Maximum Mean Discrepancy) statistical distance de\ufb01ned below.\nDe\ufb01nition: The sample MMD distance for \ufb01nite datasets D 2 Rm1\u21e5n and D0 2 Rm2\u21e5n is given by:\n(1)\n\nm1m2 Xx2D,y2D0\nwhere k(\u00b7,\u00b7) is a kernel function underlying an RKHS (Reproducing Kernel Hilbert Space) function\nspace such that k(x, y) = k(y, x) and k(\u00b7,\u00b7) is positive de\ufb01nite.\nDifferential Privacy: We adopt the following de\ufb01nition of differential privacy [Dwork et al. (2006)] in\nour work. On a high level, it means that two datasets that differ in at most one point should not cause\na differentially private algorithm to produce output that are very different statistically. Formally,\n\n2 Xy,y02D0\n\n1\nm2\n\n1 Xx,x02D\n\nk(x, y) +\n\n1\nm2\n\nk(y, y0)\n\nMMD2(D, D0) =\n\nk(x, x0) \n\n2\n\n3\n\n\fDe\ufb01nition: The output of a randomized algorithm A(D) is (\u270f, ) differentially private with respect\nto the input dataset D if for any two neighboring datasets D, D0 that differs in one data point,\n\nP (A(D) 2 E) \uf8ff e\u270fP (A(D0) 2 E) + .\n\n(2)\n\nfor all events E that can be de\ufb01ned on the output space.\nParsimonious Curator Privacy Model: We assume that there exists a trusted curator, called aggregator,\nthat collects the summary data points Ds. The participants holding data Di wish to preserve the\nprivacy of their individual data points. The model satis\ufb01es the following constraints:a) During the\nprotocol run, the curator must not have access to more than \u21e2(|Ds| + |Dv|) points. We refer to such\na protocol as \u21e2-parsimonious protocol. The aggregator needs to collect points that closely match Dv\nin MMD distance. Therefore the aggregator at least sees |Ds| points in this framework. This forms a\nnatural |Ds| + |Dv| lower bound on how many points the aggregator has to access. Therefore, we\nde\ufb01ne a \u21e2-parsimonious aggregator who sees \u21e2 times the minimum required. b) Communication to a\nnon-trusted participant i is differentially private with respect to all other datasets i.e. [j6=iDi [ Dv.\nThis setting can be viewed as an intermediate regime between the \u201ccentralized setting\" and \u201clocalized\nsetting\u201d [Nissim & Stemmer (2017)] considered in the prior works. In other words, the source Di\nknowing all but one point in the union of other datasets as side information must not know much\n(in the differential privacy sense) about the missing point given all the communication to it during\nthe protocol (standard informed adversary model with respect to union of other datasets Sj6=i\nDj).\nPreservation of differential privacy across data sources constrains the aggregator to collect more\npoints than necessary (i.e. |Ds| + |Dv|).\nMain Problem: Is there a (\u270f, ) differentially private protocol in the parsimonious curator model, that\noutputs a subset Ds \u2713Si Di : |Ds| = p that (approximately) minimizes E[MMD2(Ds, Dv)] ?\nIncentives: The aggregator needs to train a downstream task on a test distribution that is similar to\nDv. To this end, |Ds| points (much larger is size than Dv) are being collected for training. In fact,\none could think of the aggregator paying for the points. Our protocol is approximately the best way\nto obtain such points. There is no incentive for the aggregator to cheat since it has to pay for the\ncollected points. The data providers are happy to provide a set as long as they are compensated and\nother data sources do not know about their data (in a differential privacy sense.).\nEvery data source would be able to monetize their contribution in proportion to the value they provide\nto the summary. After the protocol ends, value of a data source\u2019s contribution could be deemed\nproportional to the sum of winning marginal bids from the source. Value attribution based on this\nwould be a incentive for data holders to participate. We address the problem of value attribution\nto data sources in a companion paper (Sarpatwar et al. (2019)). We only focus on the privacy and\nparsimonious constraints.\nOur Approach: We brie\ufb02y summarize the greedy approach to solve the moment matching problem\nwithout privacy constraints. Our fundamental contribution is to make it differentially private in the\nparsimonious curator model.\nGreedy Algorithm Without Privacy: Our objective is to form a summary Ds of size p by collecting\npoints from all the data owners. We maximize the following normalized MMD objective [Kim et al.\n(2016)] as described below. For \ufb01xed validation set Dv such that |Dv| = m and the summary set Ds,\nthe objective J(Ds) is as follows:\nJ(Ds) = Xi,j2Dv\nNote that our objective here is different from the one used in Kim et al. (2016), in that we do not have\nthe property S \u2713 V . Submodularity of this function does not follow from their work directly. In\nSection E of appendix, we show that the function is submodular under some condition on the kernel\nfunction. This condition is satis\ufb01ed if the distance between any two points is \u2326(plog N ) and when\nthe RBF kernel k(x, y) = exp(kx  yk2\nTheorem 1. Let N be the total number of points in the system. Given a diagonally dominant kernel\nmatrix K 2 RN\u21e5N satisfying ki,i = k\u21e4, for any i 2 [N ] and ki,j \uf8ff\nN 3+3N 2+N for any i 6= j, then\nJ(S) is a non-negative, monotone and submodular function.\n\nm2  MMD2(Dv, Ds) = Xi2Dv,j2Ds\n\n2) is used with some constant > 0.\n\nk(yi, yj)\n\nk\u21e4\n\n2k(yi, xj)\n\nm|Ds|  Xi,j2Ds\n\nk(xi, xj)\n|Ds|2\n\n(3)\n\n4\n\n\fIt has been proven that the following iterative greedy approach yields a constant factor approximation\nguarantee, given that the objective is a non-negative monotone submodular [Nemhauser et al. (1978)]\nfunction. Iteratively, until the required summary size is achieved: (a) each participant computes\nits marginally best point y, i.e., that maximizes J(Ds + y)  J(Ds) and (b) curator collects the\nmarginally best points from various participants and adds the best among them to the summary.\nOur Private Algorithm: The focus of the paper is to adapt this greedy approach with privacy\nguarantees in the parsimonious curator model. In our private protocol, the curator collects the data\npoints in Ds in a greedy fashion as above. However, there is a key challenge on the privacy front:\nChallenges: During the implementation of the greedy algorithm, the curator maintains a set of\npoints Ds = {x1, . . . , xk}. To calculate the marginal gain with respect to Equation (3), we observe\nthat the curator needs to expose a function of the formP \u21b5ik(xi,\u00b7) to every participant for some\nconstants \u21b5i (this will become clear later). However, sharing the points in the raw form would be a\nviolation of privacy constraints at the participants. Further, over the course of multiple releases, any\nparticipant must not be able to acquire any information about previous data points of other participants.\nTherefore, the key issue is that the releases of the curator must be differentially private while enabling\nthe computation of the (non-linear) marginal gain, over all the iterations of the protocol. Beyond\nenabling the computation of \u201cbest\u201d points, privacy concerns also arise in the actual collection of data\npoints. Indeed, even a private declaration of \u201cwinners\u201d to data providers would result in the leakage\nof information on the quality of other data providers.\nOur Solution: To solve these issues, we use two hash functions:\n(a) h1(\u00b7) based on the random Fourier features method of Rahimi-Recht to hash every data point at\nthe curator. This hash function is common to all the entities (i.e., curator and the participants) and\nsatis\ufb01es the property that h1(x)T h1(y) \u21e1 k(x, y) w.h.p., which is useful to convert the non-linear\nkernel computation (Equation (3)) to a linear one. This enables approximate kernel computation by\nan entity external to the curator. Therefore, any entity can compute the marginal gain of a new point\ny byP \u21b5ik(xi, y) \u21e1P \u21b5ih1(xi)T h1(y). Thus, the curator needs to only shareP \u21b5ih1(xi).\n(b) a second hash function h2(\u00b7), whose randomness is private to the curator such that,\nh2(P \u21b5ih1(xi))T h1(y) \u21e1 P \u21b5ik(xi, y) and h2 is differentially private with respect to h1(xi).\nA speci\ufb01c participant can observe multiple releases ofP \u21b5ih1(xi) and potentially \ufb01nd out the last\npoint that was added. Therefore, the releases of the sum vectorP \u21b5ih1(xi) needs to be protected.\nOur h2(\u00b7) is a novel adaptation of the well-known MWEM method [Hardt et al. (2012)]. The key\ntechnical challenge is to match the performance of the greedy algorithm while ensuring privacy\nproperties of h2(\u00b7) in order to protect data releases from the curator. Further, to address the privacy\nconcerns in parsimonious data collection, we obtain a novel private auction mechanism that is\nO(K 1\n3 )-parsimonious and (\u270f, )-differentially private, with no further loss in optimality. Aside from\ntheory, we provide insights to make our protocol well-suited for practice and demonstrate its ef\ufb01cacy\non real world datasets.\n\n3 The Protocol\n\nOur protocol uses two different hash functions that we refer to as h1(.) and h2(.). The hash function\nh1(.) is shared between the various data owners and the aggregator. The hash function h2(.) is used\nby the aggregator to hash the current summary dataset before being broadcast to various participating\nentities (owners). We now describe both the hash functions h1(\u00b7), h2(\u00b7).\nThe Hash Function h1(\u00b7): Our \ufb01rst hash function, which is shared and used by various data owners\nand the aggregator is based on a well known distance preserving hash function formulated by Rahimi\n& Recht (2008). Formally, the hash function is de\ufb01ned in Algorithm 1. The main purpose of this\nhash function is to ensure that h1(x)T h1(y) \u21e1 k(kx  yk). We assume an RBF kernel function\nthroughout the paper which is given by k() = exp(2). In Algorithm 1, p(!) is the distribution\n2\u21e1R ej! T k()d. Due to the\nde\ufb01ned by the Fourier transform of the kernel k(), i.e. p(!) = 1\nRBF kernel, p(!) = N (0, 2In). The randomness in the hash function is due to d random points\ndrawn from this distribution as in Algorithm 1.\n\n5\n\n\f1: Input: Point x 2 Rn, parameter , dimension parameter d\n2: Output: h1(x)\ni=1 i.i.d from the same distribution p(!) = N (0, 2In) only once at the beginning\n3: Draw {!i}d\nof the protocol and reuse it over subsequent calls to h1(\u00b7).\n4: Draw samples {bi}i2[d] i.i.d uniformly from [0, 2\u21e1] only once at the beginning of the protocol.\n5: return h1(x) =q 2\n\nd\u21e5cos(!T\nAlgorithm 1: Computing the hash function h1(\u00b7).\n\nd x + bd)\u21e4T\n\n2 x + b2) + . . . cos(!T\n\n1 x + b1), cos(!T\n\nd \uf8ff vij \uf8ff q 2\n\nThe Hash Function h2(\u00b7): Consider a dataset D 2 Rq\u21e5d consisting of vectors {v1, v2 . . . vq} such\nthat vi 2 R1\u21e5d and q 2\nd , 1 \uf8ff i \uf8ff q, 1 \uf8ff j \uf8ff d. The hash function h2(D)\napproximately computes the vector sum w(D) = Pi vi in a differentially private manner. Let\nw(D, j) =Pi vij. We now provide the description of the h2(\u00b7) in Algorithm 2. The algorithm has\ntwo components: (a) The algorithm \ufb01rst quantizes the q vectors in D to obtain DQ such that the\nquantized coordinate values are from a grid S of points S = {1,1+\u2318, 1+2\u2318... . . . 1\u2318, 1}, for\na parameter \u2318 (refer to Line 10 in Algorithm 2). (b) Then a random distribution Pavg over the space\nof all possible quantized vectors S1\u21e5d is found such that the expected vector under this distribution is\nclose to the sum of the quantized vectors in DQ. Further, the releases are also differential private.\nThis second part relies on the MWEM mechanism of Hardt et al. (2012).\nFull Algorithmic Description of h2(\u00b7) : Let \u02dcv1, \u02dcv2 . . . \u02dcvq 2 S1\u21e5d be the quantized vectors in DQ and\nw(DQ, i) =Pq\nj=1 \u02dcvji. Now, we will de\ufb01ne probability mass functions Pt(s 2 S1\u21e5d) for every time\nt over the \ufb01nite set S1\u21e5d whose cardinality is |S|d. Pt will be dependent only on Pt1. We will de\ufb01ne\nthe distribution iteratively over t \uf8ff T iterations. De\ufb01ne w(P, i) = q(Ps2S sPi(s)) with respect to a\nprobability mass function P on S1\u21e5d where Pi(s) is the marginal pmf on the i-th coordinate. The\nway Pt is computed is given in Algorithm 2 (Steps 6-7).\n\n7: end for\n8: Pavg = 1\n\n1: Input: Dataset D, parameters \", \u2318 and T\n2: Output: h2(D, T, \")\n3: Obtain DQ QUANTIZATION(D, \u2318). Let P0 be the uniform distribution over the set S1\u21e5d\n4: for all t \uf8ff [T ] do\n5:\n\nwhere S = {1,1 + \u2318, 1 + 2\u2318... . . . 1  \u2318, 1}.\n\n1\nq [w(Pavg, 1) . . . w(Pavg, d)] = h2(D, \") = h2(DQ,\" )\n\nfunction: i(DQ) = |w(Pt1, i)  w(DQ, i)|. Let the sampled coordinate be i(t)\nPt1(s) exp [si(t)(\u00b5i(t)  w(Pt1, i(t)) )/2q]\n\nSample a coordinate i 2 [d] with probability proportional to exp (\" i(DQ)) where the score\nLet \u00b5i(t) w(DQ, i(t)) + Lap(1/\"). Compute the distribution satisfying Pt(s) /\nT Pt2[T ] Pt.\nDe\ufb01ne Q(x) = (\nreturn DQ =\u21e2Q\u2713q d\n\nx+1k\u270f\nQ(v = (v1, v2, . . . , vd)) = (Q(v1), Q(v2), . . . , Q(vd))\n\n) where k = b(x + 1)/\u2318c. Let\n\n9: returnq 2\n\n1 + (k + 1)\u2318w.p.\n\n2 vi\u25c6q\n\n1 + k\u2318,\n\nw.p. (k+1)\u23181x\n\n,\n\ni=1\n\n6:\n\n11:\n\nd\n\n10: procedure QUANTIZATION(D, \u2318)\n\n\u2318\n\n\u2318\n\n12:\n13: end procedure\n\nAlgorithm 2: Computing the hash function h2(\u00b7).\n\nDescription of the Protocol: We now describe our protocol in Algorithm 3 and the protocol parame-\nters \u270fv,\u270f `,T used. The protocol ensures two properties at the data owner:\nApproximate Marginal Gain Computation: The trusted aggregator at the beginning (Step 4) shares \u02dcg =\nh1(x)k1 is very small. Therefore, \u02dcgT h1(y) when\n\nh2(h1(Dv)). We show that kh2(h1(Dv))Px2Dv\n\n6\n\n\f` h1(y) \u21e1Px2Ds\n\ncomputed at a data owner with a new point y approximatesPx2Dv\nh1(x)T h1(y). Similarly, over\nany other iteration ` (in Step 4), the hashed vector g` is such that gT\nh1(x)T h1(y).\nSince, h1 has the property that h1(x)T h1(y) \u21e1 k(kx  yk), we can ensure that the maximization in\nStep 13 is approximately the marginal gain computation J(Ds + y)  J(Ds).\nDifferential Privacy: We also show that, due to application of h2, all the releases seen by any\ndata owner i are differentially private with respect to the current summary which also implies it is\ndifferentially private with respect to [j6=iDi  Di. Another key ingredient in our proof is in showing\nthat the novel scheme in making the bid collection and winner noti\ufb01cation process differentially\nprivate, while ensuring the parsimonious nature of the aggregator. Consider the Step 7 in Algorithm 3.\nUpon making a decision on the winning bid, the aggregator needs to acquire the winning point from\nthe winner data source. Consider the following two naive ways of doing this: (a) Aggregator noti\ufb01es\nthe winner alone about the decision and acquires the data point. (b) Aggregator acquires data points\nfrom all the data sources. Keeping only the winner point, it discards the other points. An important\nobservation here is that the \ufb01rst alternative is not differentially private. Indeed, it leaks information\nabout the data points of the participating data sources. The second way is differentially private,\nindeed, each data source learns nothing new about other data sources. However, it is highly wasteful\nand contradicts the parsimonious nature of the aggregator. Indeed, in forming a summary of size p, it\ncollects Kp data points. Our novel private auction (Steps 11-16 of Algorithm 3) obtains best of both\nscenarios, i.e., it is differentially private and accesses at most O(pK 1\n\n3 ) data points in total.\n\n1: Input: Di i 2 [K], validation dataset Dv, seed set Dinit, params {\u270fauc,\u270f v,{\u270f`,T}p\n2: Output: Summary Ds: Ds \u2713 [i2[K]Di such that |Ds| = p.\n3: Aggregator initializes summary Ds Dinit and broadcasts \u02dcg = h2(h1(Dv),\u270f v).\n4: for ` = 1 . . . p do\n5:\n6:\n7:\n\nAggregator broadcasts g` = h2(h1(Ds),\u270f `,T ).\nEach Data owner i 2 [n] computes its \u201cbid\u201d: bi = maxx2Di gT\nAggregator chooses the best point through a private auction:\nxi\u21e4 PRIVAUCTION(bi : i 2 [n])\nAggregator veri\ufb01es the data point against the bid value and updates Ds Ds [ xi\u21e4.\n\n`+1.\n` h1(x)  \u02dcgT h1(x) `\n\n`=1,\u2327 }.\n\n8:\n9: end for\n10: return Summary Ds  Dinit.\n11: procedure PRIVAUCTION(bi : i 2 [n])\n12:\n13:\n14:\n15:\n\nAggregator orders the data owners, as D01, D02, . . . , D0K, by their decreasingly bid values.\nIndependently with probability P[xi] = e\u270fauc(i1), Aggregator asks for the point xi.\nIf a certain data point x was chosen \u2327 times by a data source (Step 6), Aggregator asks for it.\nAggregator chooses a point x\u21e4, with maximum bid value b\u21e4, from the pool of all the points\nData owners disconsider all the points sent to the Aggregator in the future iterations.\n\nobtained so far and not yet included in the summary.\n\n16:\n17: end procedure\n\nAlgorithm 3: Description of the protocol.\n\n\"v\n\nd, T = d2, \"v = \u270f\n\n16T , \"`,T =\n\nq16T` log( 1\n\nIn Algorithm 3, for\n, and\n\n) log5 p, d  16(log 2N )(log p)2\n\u02dc\n, we obtain the following guarantees:\n\n, |Dinit| 121 \u21e4 8d2 log2 d log( 1\n\u02dc ) log p\n\nTheorem 2. Let a 2 (0, 1) and \u02dc 2 (0, 1/e) be any \ufb01xed constants.\n|Dv| 44p2pd log d log2 p\nsetting \u2318 \uf8ff 1\n(Differential Privacy) Releases of the aggregator to any data owner i is (\u270f, \u02dc)-differentially pri-\nvate over all the iterations/epochs with respect to the datasets [j6=iDi. Similarly, we have (\u270f, \u02dc)-\ndifferentially privacy over all the iterations w.r.t. validation set Dv.\n(Approximation Guarantee) Let OPT denote an optimal summary set and Ds be the set of points ob-\ntained by Algorithm 3. We have J(Ds)  (1 1\n)+a+ 1\n\u270f log p <\n1. Barring the  additive error the guarantees are close to the non-private greedy algorithm.\n(Parsimoniousness Guarantee) Algorithm 3 is O( log 1\nsummary of size p, it needs to access at most O(pK 1\n\ne )J(OPT), where  < O( log ppln dpd\n\n3 )-parsimonious, i.e., in computing a\n\na2\n\n\n\n\u270f K 1\n\n3 ) data points.\n\n\u270f\n\n7\n\n\fDifferences between PRIVAUCTION and the Exponential Mechanism: There may be a super\ufb01cial\nresemblance between Step 13 in the PRIVAUCTION procedure of Algorithm 3 and the exponential\nmechanism. Actually, our private auction is signi\ufb01cantly different. First note that the probability of\nchoosing the best bid is 1 which is not the case with the exponential mechanism. Secondly, while the\nexponential mechanism selects one approximately \"best\" point, we \ufb02ip a coin for every bid whose\nbias has an exponentially decreasing relationship to the position of the bid in sorting order. Then, we\nchoose multiple of them (instead of one) and a key proof point is to show that we can restrict the\nnumber of the points chosen overall. Finally, the bias probabilities do not even depend on the bid\nvalue (i.e., \"score\") while it would be the case for exponential mechanism.\nExtension to a Less Trusted Curator. In our parsimonious curator model, the \ufb01nal summary\ndataset needs to be revealed to the trusted aggregator in order to train diverse models downstream. In\nSection F of the Appendix, we show that Algorithm 3 can be adapted to share just h1(x) hashes of\ndata points. We show that this approach has some interesting privacy guarantees, speci\ufb01cally that the\naggregator can only know the pairwise Euclidean distances between the points and nothing more.\nThese hashes would be useful to train kernel based models such as Support Vector Machines.\n\n4 Experimental Evaluation\n\nq\n\nfor `> 1.\n\nWe make an important observation that is crucial to obtain good performance in practice. According\nto Theorem 3, in order to control the additive error in approximating the query w(D,i)\n, Algorithm 2\nneeds: (a) T (the number of iterations) in Algorithm 2 to be larger than d2 to match the distribution\nPavg to the empirical distribution of coordinate i in the current summary Ds, (b) Dinit, the size of the\ninitial seed summary also needs to be large enough because of this (refer Theorem 2). Over multiple\nepochs of Algorithm 3 (Step 4) , we make the following changes to deal with these issues. First\nEpoch (` = 1): In practice, we \u2018seed\u2019 the protocol with a small initial seed set Dinit to satisfy (b)\nand set T = Tinit to be large enough (d1.5) to satisfy (a). Subsequent Epochs (`> 1): Clearly, the\nsummary Ds grows and hence (b) is satis\ufb01ed. We set T = Tsub to be a constant for subsequent\niterations. This may seem to contradict the requirement (a). However, we observe that h2(\u00b7) operates\non a summary that is only differing in one point from the previous iteration. Intuitively, a single\npoint addition results in a small shift in the empirical distribution. Small incremental changes to the\nempirical distribution need to be matched incrementally. Thus, it is suf\ufb01cient to have a signi\ufb01cantly\nsmaller number of iterations than that in Theorem 4. Therefore, Tsub is set to be small. We set the\nparameters of our algorithm as follows: the RBF kernel parameter  = 0.1, dimension of Rahimi-\nRecht hash function h1(.) as d = 140. We use two different T parameters for different epochs given\nby Tinit(= T,` = 1) = d1.5 = 1656 and Tsubs(= T,` > 1) = 5. \"v = 0.01 is the \u270f parameter for\nh2(\u00b7) for the validation set and \"`,T is set for h2(\u00b7) on summaries Ds over epochs ` as 0.05 for ` = 1,\n0.01ppTsubs\nDifferential Privacy: An important observation here is that we do not need to preserve the privacy\nof the seed set, since it can be completely random. We now bound the differential privacy of our\nparameters with respect to both the consumer data and the summary data points. Consumer Dataset\n(Dv): We compute h2(h1(Dv)) only once i.e., in the \ufb01rst epoch. This involves Tinit = 1656 iterations\nin Algorithm 2, with \"v = 0.01. Applying Theorem 7 (in Appendix), we see that the total differential\nprivacy measure \" = 1.4 (setting \u02dc = 0.01). Summary Dataset (Ds): Over p epochs of Algorithm 3,\nwe have 5 iterations each with differential privacy\n. Thus, again by applying Theorem 7 (in\nAppendix), we obtain a total differential privacy of 0.043 (with \u02dc = 0.0001).\nExperiments on Real World Datasets: We now back our theoretical results with empirical experi-\nments. We compare three algorithms: a) Non-Private Greedy, where the aggregator broadcasts the\n(exact) average of the hashed summary set (i.e., W (Ds,i)\n) and hashed validation set (i.e., W (Dv,i)\nm ).\nThis is equivalent to the approach of Kim et al. (2016). b) Private Greedy, which is the Algorithm 3\nwith parameters set as above. c) Uniform Sampling, where we draw equal number p\nK of required\nsamples from each data provider to construct a summary of size p. We empirically show that private\ngreedy closely matches the performance of non-private greedy even under the strong differential\nprivacy constraints. For comparison, we show that our algorithm outperforms uniform sampling. The\nmotivation for choosing the latter as a candidate comes from the typical manner of using stochastic\ngradient descent approaches such as Federated Learning [McMahan et al. (2016)] that perform\n\n0.01ppTsubs\n\nq\n\n8\n\n\funiform sampling. We experiment with two real world datasets. We discuss one of them, which is\nbased on an Allstate insurance dataset from a Kaggle (2014) competition. We show similar results for\nthe MNIST dataset, that contains image data for recognizing hand written digits, in the Appendix G.\n\nFigure 1: All State Insurance Data: (Top): Comparison of the percentage increase in M M D2 of\nboth the private and uniform sampling algorithms with respect to baseline greedy algorithm. Lower\nvalues indicate better performance. The private algorithm performs consistently better than uniform\nsampling. (Bottom): Comparison of the classi\ufb01cation accuracy of the three algorithms using a Linear\nSVM classi\ufb01er. Higher numbers indicate better performance. Our private algorithm outperforms\nuniform sampling by 6-10% and closely matches the performance of the base line greedy algorithm.\n\nAll State Insurance Data: The dataset contains insurance data of customers belonging to different\nstates of the U.S. The objective is to predict labels of one of the all-state products. In our setup, we\nuse data corresponding to two states - Florida and Connecticut. We have four data owner participants,\nand an aggregator. The data is split up as follows: Training data: The training data is comprised of\nall the Florida data and 70% of the Connecticut data. The Florida data is split uniformly among the\nfour data owners and Connecticut data is given to one of them. This allows us to create a skew in the\ndata quality across different participants. Validation data: From the remaining 30% of Connecticut\ndata we choose 25% of data as the validation data set. Note that we remove the labels from this\nvalidation set before giving it to the consumer. Testing data: The remaining Connecticut data is set\naside as testing data. Thus the training data is solely comprised of Connecticut data. Further, we\nuse around 150 points of random seed data belonging to a different state (Ohio). In our experiments,\nwe vary the number of samples that need to be collected and compute the M M D2 objective in\neach of these cases. In Figure 1, we compare the increase in M M D2 with respect to greedy, i.e.,\nMM D2(ALGM )MM D2(GREEDY )\n\u21e5 100 where ALGM is either our private greedy algorithm or\nthe uniform sampling algorithm. Our results show that we consistently beat the uniform sampling\nalgorithm while preserving differential privacy. In Figure 1, we compare the performance of these\nalgorithms using a linear SVM. We \ufb01nd that the private algorithm while closely matching greedy\nbeats uniform sampling by 6% to 10%.\n\nMM D2(GREEDY )\n\n5 Discussion\n\nWe consider a distributed data summarization problem in a transfer learning setting with privacy\nconstraints. Different data owners have privacy constraints and a subset of points matching a target\ndataset needs to be formed. We provide a differentially private algorithm for this problem in the\nparsimonious curator setting, where the data owners do not wish to reveal information to other data\nowners and a curator entity can only access limited number of points.\n\n9\n\n\fAcknowledgement\n\nWe thank Naoki Abe and Michele Franceshini for helpful discussions in the initial stages of this\nwork. We also thank anonymous reviewers for their thoughtful suggestions that helped improve our\npresentation of the paper.\n\nReferences\nAbadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., and Zhang, L. Deep\nIn Proceedings of the 2016 ACM SIGSAC Conference on\n\nlearning with differential privacy.\nComputer and Communications Security, pp. 308\u2013318. ACM, 2016.\n\nBassily, R., Smith, A., and Thakurta, A. Private empirical risk minimization: Ef\ufb01cient algorithms\nand tight error bounds. In Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual\nSymposium on, pp. 464\u2013473. IEEE, 2014.\n\nChaudhuri, K., Monteleoni, C., and Sarwate, A. D. Differentially private empirical risk minimization.\n\nJournal of Machine Learning Research, 12(Mar):1069\u20131109, 2011.\n\nDwork, C., Kenthapadi, K., McSherry, F., Mironov, I., and Naor, M. Our data, ourselves: Privacy via\ndistributed noise generation. In Annual International Conference on the Theory and Applications\nof Cryptographic Techniques, pp. 486\u2013503. Springer, 2006.\n\nDwork, C., Nikolov, A., and Talwar, K. Using convex relaxations for ef\ufb01ciently and privately\nreleasing marginals. In Proceedings of the thirtieth annual symposium on Computational geometry,\npp. 261. ACM, 2014a.\n\nDwork, C., Roth, A., et al. The algorithmic foundations of differential privacy. Foundations and\nTrends R in Theoretical Computer Science, 9(3\u20134):211\u2013407, 2014b.\nGanin, Y. and Lempitsky, V. Unsupervised domain adaptation by backpropagation. arXiv preprint\n\narXiv:1409.7495, 2014.\n\nGiraud, B. G. and Peschanski, R. From\" dirac combs\" to fourier-positivity. arXiv preprint\n\narXiv:1509.02373, 2015.\n\nGottlieb,\n\nJ. and Khaled, R.\n\nFueling growth through data monetization.\n\nhttps:\n\n//www.mckinsey.com/business-functions/mckinsey-analytics/our-insights/\nfueling-growth-through-data-monetization, December 2017.\n\nGretton, A., Borgwardt, K. M., Rasch, M. J., Sch\u00f6lkopf, B., and Smola, A. J. A kernel method for\n\nthe two-sample problem. CoRR, abs/0805.2368, 2008.\n\nHamm, J., Cao, Y., and Belkin, M. Learning privately from multiparty data.\n\nConference on Machine Learning, pp. 555\u2013563, 2016.\n\nIn International\n\nHardt, M. and Rothblum, G. N. A multiplicative weights mechanism for privacy-preserving data\nanalysis. In Foundations of Computer Science (FOCS), 2010 51st Annual IEEE Symposium on, pp.\n61\u201370. IEEE, 2010.\n\nHardt, M., Ligett, K., and McSherry, F. A simple and practical algorithm for differentially private\n\ndata release. In Advances in Neural Information Processing Systems, pp. 2339\u20132347, 2012.\n\nJukna, S. Extremal combinatorics: with applications in computer science. Springer Science &\n\nBusiness Media, 2011.\n\nKaggle.\n\nAllstate purchase prediction challenge.\nallstate-purchase-prediction-challenge, 2014.\n\nhttps://www.kaggle.com/c/\n\nKairouz, P., Oh, S., and Viswanath, P. The composition theorem for differential privacy. IEEE\n\nTransactions on Information Theory, 63(6):4037\u20134049, 2017.\n\nKasiviswanathan, S. P., Lee, H. K., Nissim, K., Raskhodnikova, S., and Smith, A. What can we learn\n\nprivately? SIAM Journal on Computing, 40(3):793\u2013826, 2011.\n\n10\n\n\fKifer, D., Smith, A., and Thakurta, A. Private convex empirical risk minimization and high-\n\ndimensional regression. In Conference on Learning Theory, pp. 25\u20131, 2012.\n\nKim, B., Khanna, R., and Koyejo, O. O. Examples are not enough, learn to criticize! criticism for\n\ninterpretability. In Advances in Neural Information Processing Systems, pp. 2280\u20132288, 2016.\n\nMardia, K. V. and Jupp, P. E. Directional statistics, volume 494. John Wiley & Sons, 2009.\nMcMahan, H. B., Moore, E., Ramage, D., Hampson, S., et al. Communication-ef\ufb01cient learning of\n\ndeep networks from decentralized data. arXiv preprint arXiv:1602.05629, 2016.\n\nMitrovic, M., Bun, M., Krause, A., and Karbasi, A. Differentially private submodular maximization:\nData summarization in disguise. In International Conference on Machine Learning, pp. 2478\u20132487,\n2017.\n\nNemhauser, G. L., Wolsey, L. A., and Fisher, M. L. An analysis of approximations for maximizing\n\nsubmodular set functions\u2014i. Mathematical Programming, 14(1):265\u2013294, 1978.\n\nNissim, K. and Stemmer, U. Clustering algorithms for the centralized and local models. arXiv\n\npreprint arXiv:1707.04766, 2017.\n\nPapernot, N., Abadi, M., Erlingsson, U., Goodfellow, I., and Talwar, K. Semi-supervised knowledge\n\ntransfer for deep learning from private training data. arXiv preprint arXiv:1610.05755, 2016.\n\nPathak, M., Rane, S., and Raj, B. Multiparty differential privacy via aggregation of locally trained\n\nclassi\ufb01ers. In Advances in Neural Information Processing Systems, pp. 1876\u20131884, 2010.\n\nRahimi, A. and Recht, B. Random features for large-scale kernel machines. In Advances in neural\n\ninformation processing systems, pp. 1177\u20131184, 2008.\n\nRubinstein, B. I., Bartlett, P. L., Huang, L., and Taft, N. Learning in a large function space: Privacy-\npreserving mechanisms for svm learning. Journal of Privacy and Con\ufb01dentiality, 4(1):65\u2013100,\n2012.\n\nSarpatwar, K. K., Ganapavarapu, V. S., Shanmugam, K., Rahman, A., and Vacul\u00edn, R. Blockchain\nenabled AI marketplace: The price you pay for trust. In IEEE Conference on Computer Vision and\nPattern Recognition Workshops, CVPR Workshops 2019, Long Beach, CA, USA, June 16-20, 2019,\npp. 0, 2019.\n\nShokri, R. and Shmatikov, V. Privacy-preserving deep learning. In Proceedings of the 22nd ACM\n\nSIGSAC conference on computer and communications security, pp. 1310\u20131321. ACM, 2015.\n\nSong, S., Chaudhuri, K., and Sarwate, A. D. Stochastic gradient descent with differentially private\nupdates. In Global Conference on Signal and Information Processing (GlobalSIP), 2013 IEEE, pp.\n245\u2013248. IEEE, 2013.\n\nTalwar, K., Thakurta, A. G., and Zhang, L. Nearly optimal private lasso. In Advances in Neural\n\nInformation Processing Systems, pp. 3025\u20133033, 2015.\n\nThakurta, A. G. Differentially private convex optimization for empirical risk minimization and\n\nhigh-dimensional regression. The Pennsylvania State University, 2013.\n\nTzeng, E., Hoffman, J., Zhang, N., Saenko, K., and Darrell, T. Deep domain confusion: Maximizing\n\nfor domain invariance. arXiv preprint arXiv:1412.3474, 2014.\n\nWu, X., Kumar, A., Chaudhuri, K., Jha, S., and Naughton, J. F. Differentially private stochastic\n\ngradient descent for in-rdbms analytics. arXiv preprint arXiv:1606.04722, 2016.\n\n11\n\n\f", "award": [], "sourceid": 8195, "authors": [{"given_name": "Kanthi", "family_name": "Sarpatwar", "institution": "IBM T. J. Watson Research Center"}, {"given_name": "Karthikeyan", "family_name": "Shanmugam", "institution": "IBM Research, NY"}, {"given_name": "Venkata Sitaramagiridharganesh", "family_name": "Ganapavarapu", "institution": "IBM Research"}, {"given_name": "Ashish", "family_name": "Jagmohan", "institution": "IBM Research"}, {"given_name": "Roman", "family_name": "Vaculin", "institution": "IBM Research"}]}