{"title": "Oblivious Sampling Algorithms for Private Data Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 6495, "page_last": 6506, "abstract": "We study secure and privacy-preserving data analysis\r\nbased on queries executed on samples from a dataset.\r\nTrusted execution environments (TEEs) can be used to\r\nprotect the content of the data during query computation,\r\nwhile supporting differential-private (DP) queries in TEEs\r\nprovides record privacy when query output is revealed.\r\nSupport for sample-based queries is attractive\r\ndue to \\emph{privacy amplification}\r\nsince not all dataset is used to answer a query but only a small subset.\r\nHowever, extracting data samples with TEEs\r\nwhile proving strong DP guarantees is not\r\ntrivial as secrecy of sample indices has to be preserved.\r\nTo this end, we design efficient secure variants of common sampling algorithms.\r\nExperimentally we show that accuracy of models\r\ntrained with shuffling and sampling is the same for\r\ndifferentially private models for MNIST and CIFAR-10,\r\nwhile sampling provides stronger privacy guarantees than shuffling.", "full_text": "Oblivious Sampling Algorithms for\n\nPrivate Data Analysis\n\nSajin Sasy\u2217\n\nUniversity of Waterloo\n\nOlga Ohrimenko\nMicrosoft Research\n\nAbstract\n\nWe study secure and privacy-preserving data analysis based on queries executed\non samples from a dataset. Trusted execution environments (TEEs) can be used\nto protect the content of the data during query computation, while supporting\ndifferential-private (DP) queries in TEEs provides record privacy when query\noutput is revealed. Support for sample-based queries is attractive due to privacy\nampli\ufb01cation since not all dataset is used to answer a query but only a small subset.\nHowever, extracting data samples with TEEs while proving strong DP guarantees\nis not trivial as secrecy of sample indices has to be preserved. To this end, we\ndesign ef\ufb01cient secure variants of common sampling algorithms. Experimentally\nwe show that accuracy of models trained with shuf\ufb02ing and sampling is the same for\ndifferentially private models for MNIST and CIFAR-10, while sampling provides\nstronger privacy guarantees than shuf\ufb02ing.\n\n1\n\nIntroduction\n\nSensitive and proprietary datasets (e.g., health, personal and \ufb01nancial records, laboratory experiments,\nemails, and other personal digital communication) often come with strong privacy and access control\nrequirements and regulations that are hard to maintain and guarantee end-to-end. The fears of\ndata leakage may block datasets from being used by data scientists and prevent collaboration and\ninformation sharing between multiple parties towards a common good (e.g., training a disease\ndetection model across data from multiple hospitals). For example, the authors of [11, 14, 37] show\nthat machine learning models can memorize individual data records, while information not required\nfor the agreed upon learning task may be leaked in collaborative learning [28]. To this end, we are\ninterested in designing the following secure data query framework:\n\nstrong security privacy guarantees on the usage of their data;\n\n\u2022 A single or multiple data owners contribute their datasets to the platform while expecting\n\u2022 The framework acts as a gatekeeper of the data and a computing resource of the data scientist:\nit can compute queries on her behalf while ensuring that data is protected from third parties;\n\u2022 Data scientist queries the data via the framework via a range of queries varying from\n\napproximating sample statistics to training complex machine learning models.\n\nThe goal of the framework is to allow data scientist to query the data while providing strong privacy\nguarantees to data owners on their data. The framework aims to protect against two classes of\nattackers: the owner of the computing infrastructure of the framework and the data scientist.\nThe data scientist may try to infer more information about the dataset than what is available through\na (restricted) class of queries supported by the framework. We consider the following two collusion\nscenarios. As the framework may be hosted in the cloud or on premise of the data scientist\u2019s\n\n\u2217Work done during internship at Microsoft Research.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\forganization, the infrastructure is not trusted as one can access the data without using the query\ninterface. The second collusion may occur in a multi-data-owner scenario where the data scientist\ncould combine the answer of a query and data of one of the parties to infer information about other\nparties\u2019 data. Hence, the attacker may have auxiliary information about the data.\nIn the view of the above requirements and threat model we propose Private Sampling-based Query\nFramework. It relies on secure hardware to protect data content and restrict data access. Additionally,\nit supports sample-based differentially private queries for ef\ufb01ciency and privacy. However, naive\ncombination of these components does not lead to an end-to-end secure system for the following\nreason. Differential privacy guarantees for sampling algorithms (including machine learning model\ntraining that build on them [3, 26, 45]) are satis\ufb01ed only if the sample is hidden. Unfortunately as we\nwill see this is not the case with secure hardware due to leakage of memory access patterns. To this\nend, we design novel algorithms for producing data samples using two common sampling techniques,\nSampling without replacement and Poisson, with the guarantee that whoever observes data access\npatterns cannot identify the indices of the dataset used in the samples. We also argue that if privacy of\ndata during model training is a requirement then sampling should be used instead of the default use of\nshuf\ufb02ing since it incurs smaller privacy loss in return to similar accuracy as we show experimentally.\nWe now describe components of our Private Sampling-based Query Framework.\nFramework security: In order to protect data content and computation from the framework host,\nwe rely on encryption and trusted execution environments (TEE). TEEs can be enabled using secure\nhardware capabilities such as Intel SGX [20] which provides a set of CPU instructions that gives\naccess to special memory regions (enclaves) where encrypted data is loaded, decrypted and computed\non. Importantly access to this region is restricted and data is always encrypted in memory. One can\nalso verify the code and data that is loaded in TEEs via attestation. Hence, data owners can provide\ndata encrypted under the secret keys that are available only to TEEs running speci\ufb01c code (e.g.,\ndifferentially private algorithms). Some of the limitations of TEEs include resource sharing with the\nrest of the system (e.g., caches, memory, network), which may lead to side-channels [10, 19, 33].\nAnother limitation of existing TEEs is the amount of available enclave memory (e.g., Intel Skylake\nCPUs restrict the enclave page cache to 128MB). Though one can use system memory, the resulting\nmemory paging does not only produce performance overhead but also introduces more memory\nside-channels [44].\nSample-based data analysis: Data sampling has many applications in data analysis from returning\nan approximate query result to training a model using mini-batch stochastic gradient descent (SGD).\nSampling can be used for approximating results when performing the computation on the whole\ndataset is expensive (e.g., graph analysis or frequent itemsets [35, 36]) 2 or not needed (e.g., audit\nof a \ufb01nancial institution by a regulator based on a sample of the records). We consider various uses\nof sampling, including queries that require a single sample, multiple samples such as bootstrapping\nstatistics, or large number of samples such as training of a neural network.\nSampling-based queries provide: Ef\ufb01ciency: computing on a sample is faster than on the whole\ndataset, which \ufb01ts the TEE setting, and can be extended to process dataset samples in parallel with\nmultiple TEEs. Expressiveness: a large class of queries can be answered approximately using\nsamples, furthermore sampling (or mini-batching) is at the core of training modern machine learning\nmodels. Privacy: a query result from a sample reveals information only about the sample and not\nthe whole dataset. Though intuitively privacy may come with sampling, it is not always true. If a\ndata scientist knows indices of the records in the sample used for a query, then given the query result\nthey learn more about records in that sample than about other records. However if sample indices are\nhidden then there is plausible deniability. Luckily, differential privacy takes advantage of privacy\nfrom sampling and formally captures it with privacy ampli\ufb01cation [8, 21, 25].\nDifferential privacy: Differential privacy (DP) is a rigorous de\ufb01nition of individual privacy when\na result of a query on the dataset is revealed. Informally, it states that a single record does not\nsigni\ufb01cantly change the result of the query. Strong privacy can be guaranteed in return for a drop in\naccuracy for simple statistical queries [13] and complex machine learning models [3, 7, 26, 43, 45].\nDP mechanisms come with a parameter \u0001, where higher \u0001 signi\ufb01es a higher privacy loss.\n\n2We note that we use sampling differently from statistical approaches that treat the dataset D as a sample\n\nfrom a population and use all records in D to estimate parameters of the underlying population.\n\n2\n\n\fAmpli\ufb01cation by sampling is a well known result in differential privacy. Informally, it says that\nwhen an \u0001-DP mechanism is applied on a sample of size \u03b3n from a dataset D of size n, \u03b3 < 1, then\nthe overall mechanism is O(\u03b3\u0001)-DP w.r.t. D. Small \u0001 parameters reported from training of neural\nnetworks using DP SGD [3, 26, 45] make extensive use of privacy ampli\ufb01cation in their analysis.\nImportantly, for this to hold they all require the sample identity to be hidden.\nDP algorithms mentioned above are set in the trusted curator model where hiding the sample is not\na problem as algorithm execution is not visible to an attacker (i.e., the data scientist who obtains\nthe result in our setting). TEEs can be used only as an approximation of this model due to the\nlimitations listed above: revealing memory access patterns of a differentially-private algorithm can be\nenough to violate or weaken its privacy guarantees. Sampling-based DP algorithms fall in the second\ncategory as they make an explicit assumption that the identity of the sample is hidden [42, 24]. If not,\nampli\ufb01cation based results cannot be applied. If one desires the same level of privacy, higher level of\nnoise will need to be added which would in turn reduce the utility of the results.\nDifferential privacy is attractive since it can keep track of the privacy loss over multiple queries.\nHence, reducing privacy loss of individual queries and supporting more queries as a result, is an\nimportant requirement. Sacri\ufb01cing on privacy ampli\ufb01cation by revealing sample identity is wasteful.\nData-oblivious sampling algorithms Query computation can be supported in a TEE since samples\nare small compared to the dataset and can \ufb01t into private memory of a TEE. However, naive\nimplementation of data sampling algorithms is inef\ufb01cient (due to random access to memory outside\nof TEE) and insecure in our threat model (since sample indices are trivially revealed). Naively hiding\nsample identity would be to read a whole dataset and only keep elements whose indices happen to\nbe in the sample. This would require reading the entire dataset for each sample (training of models\nusually requires small samples, e.g., 0.01% of the dataset). This will also not be competitive in\nperformance with shuf\ufb02ing-based approaches used today.\nTo this end, we propose novel algorithms for producing data samples for two popular sampling\napproaches: sampling without replacement and Poisson. Samples produced by shuf\ufb02ing-based\nsampling contain distinct elements, however elements may repeat between the samples. Our algo-\nrithms are called data-oblivious [15] since the memory accesses they produce are independent of\nthe sampled indices. Our algorithms are ef\ufb01cient as they require only two data oblivious shuf\ufb02es\nand one scan to produce n/m samples of size m that is suf\ufb01cient for one epoch of training. An\noblivious sampling algorithm would be used as follows: n/m samples are generated at once, stored\nindividually encrypted, and then loaded in a TEE on a per-query request.\nContributions: (i) We propose a Private Sampling-based Query Framework for querying sensitive\ndata; (ii) We use differential privacy to show that sampling algorithms are an important building block\nin privacy-preserving frameworks; (iii) We develop ef\ufb01cient and secure (data-oblivious) algorithms\nfor two common sampling techniques; (iv) We empirically show that for MNIST and CIFAR-10\nusing sampling algorithms for generating mini-batches during differentially-private training achieves\nthe same accuracy as shuf\ufb02ing, even though sampling incurs smaller privacy loss than shuf\ufb02ing.\n\n2 Notation and Background\nA dataset D contains n elements; each element e has a key and a value; keys are distinct in [1, n]. If a\ndataset does not have keys, we use its element index in the array representation of D as a key.\nTrusted Execution Environment TEE provides strong protection guarantees to data in its private\nmemory: it is not visible to an adversary who can control everything outside of the CPU, e.g., even\nif it controls the operating system (OS) or the VM. The private memory of TEEs (depending on\nthe side-channel threat model) is restricted to CPU registers (few kilobytes) or caches (32MB) or\nenclave page cache (128MB). Since these sizes will be signi\ufb01cantly smaller than usual datasets, an\nalgorithm is required to store the data in the external memory. Since external memory is controlled\nby an adversary (e.g., an OS), it can observe its content and the memory addresses requested from\na TEE. Probabilistic encryption can be used to protect the content of data in external memory: an\nadversary seeing two ciphertexts cannot tell if they are encryptions of the same element or a dummy\nof the same size as a real element.\nThough the size of primary memory is not suf\ufb01cient to process a dataset, it can be leveraged for\nsample-based data analysis queries as follows. When a query requires a sample, it loads an encrypted\n\n3\n\n\fsample from the external memory into the TEE, decrypts it, performs a computation (for example,\nSGD), discards the sample, and either updates a local state (for example, parameters of the ML model\nmaintained in a TEE) and proceeds to the next sample, or encrypts the result of the computation\nunder data scientist\u2019s secret key and returns it.\nAddresses (or memory access sequence) requested by a TEE can leak information about data. Leaked\ninformation depends on adversary\u2019s background knowledge (attacks based on memory accesses have\nbeen shown for image and text processing [44]). In general, many (non-differentially-private and\ndifferentially-private [4]) algorithms leak their access pattern including sampling (see \u00a74.1).\nData-oblivious algorithms access memory in a manner that appears to be independent of the sensi-\ntive data. For example, sorting networks are data-oblivious as compare-and-swap operators access\nthe same array indices independent of the array content, in contrast to quick sort. Data-oblivious algo-\nrithms have been designed for array access [15, 16, 39], sorting [18], machine learning algorithms [32]\nand several data structures [41]; while this work is the \ufb01rst to consider sampling algorithms. The\nperformance goal of oblivious algorithms is to reduce the number of additional accesses to external\nmemory needed to hide real accesses.\nOur sampling algorithms in \u00a74 rely on an oblivious shuf\ufb02e oblshu\ufb04e(D) [31]. A shuf\ufb02e rearranges\nelements according to permutation \u03c0 s.t. element at index i is placed at location \u03c0[i] after the shuf\ufb02e.\nAn oblivious shuf\ufb02e does the same except the adversary observing its memory accesses does not\n\u221a\nlearn \u03c0. The Melbourne shuf\ufb02e [31] makes O(cn) accesses to external memory with private memory\nof size O( c\nn). This overhead is constant since non-oblivious shuf\ufb02e need to make n accesses.\nOblivious shuf\ufb02e can use smaller private memory at the expense of more accesses (see [34]). It is\nimportant to note that while loading data into private memory, the algorithm re-encrypts the elements\nto avoid trivial comparison of elements before and after the shuf\ufb02e.\nDifferential privacy A randomized mechanism M : D \u2192 R is (\u0001, \u03b4) differentially private [13]\nif for any two neighbouring datasets D0,D1 \u2208 D and for any subset of outputs R \u2208 R it holds\nthat Pr[M(D0) \u2208 R] \u2264 e\u0001 Pr[M(D1) \u2208 R] + \u03b4. We use substitute-one neighbouring relationship\nwhere |D0| = |D1| and D0,D1 are different in one element. This relationship is natural for sampling\nwithout replacement and data-oblivious setting where an adversary knows |D|. As we see in \u00a74.2\nhiding the size of Poisson sampling in our setting is non-trivial and we choose to hide the number of\nsamples instead.\nGaussian mechanism [13] is a common way of obtaining differentially private variant of real val-\nued function f : D \u2192 R. Let \u2206f be the L2-sensitivity of f, that is the maximum distance\n(cid:107)f (D0) \u2212 f (D1)(cid:107)2 between any D0 and D1. Then, Gaussian noise mechanism is de\ufb01ned by\nM(D) = f (D) + N (0, \u03c32) where N (0, \u03c32\u22062\nf ) is a Gaussian distribution with mean 0 and standard\n\ndeviation \u03c3\u2206f . The resulting mechanism is (\u0001, \u03b4)-DP if \u03c3 =(cid:112)2 log(1.25/\u03b4)/\u0001 for \u0001, \u03b4 \u2208 (0, 1).\n\nSampling methods Algorithms that operate on data samples often require more than one sample.\nFor example, machine learning model training proceeds in epochs where each epoch processes\nmultiple batches (or samples) of data. The number of samples k and sample size m are usually\nchosen such that n \u2248 km so that every data element has a non-zero probability of being processed\nduring an epoch. To this end, we de\ufb01ne samplesA(D, q, k) that produces samples s1, s2, . . . , sk using\na sampling algorithm A and parameter q, where si is a set of keys from [1, n]. For simplicity we\nassume that m divides n and k = n/m. We omit stating the randomness used in samplesA but\nassume that every call uses a new seed. We will now describe three sampling methods that vary based\non element distribution within each sample and between the samples.\nSampling without replacement (SWO) produces a sample by drawing m distinct elements uniformly at\nn\u22121 \u00b7\u00b7\u00b7\nrandom from a set [1, n], hence probability of a sample s is 1\nSWO be the set of\nall SWO samples of size m from domain [1, n]; samplesSWO(D, m, k) draws k samples from F n,m\nn\nSWO\nwith replacement: elements cannot repeat within the same sample but can repeat between the samples.\nPoisson Sampling (Poisson) s is constructed by independently adding each element from [1, n] with\nprobability \u03b3, that is Pr(j \u2208 s) = \u03b3. Hence, probability of a sample s is Pr\u03b3(s) = \u03b3|s|(1 \u2212 \u03b3)n\u2212|s|.\nPoisson be the set of all Poisson samples from domain [1, n]. Then, samplesPoisson(D, \u03b3, k)\nLet F n,\u03b3\ndraws k elements with replacement from F n,\u03b3\nPoisson. The size of a Poisson sample is a random variable\nand \u03b3n on average. Sampling via Shu\ufb04e is common for obtaining mini-batches for SGD in practice.\nIt shuf\ufb02es D and splits it in batches of size m. If more than k samples are required, the procedure is\n\nn\u2212m+1. Let F n,m\n\n1\n\n1\n\n4\n\n\fTable 1: Parameters (\u0001(cid:48), \u03b4(cid:48)) of mechanisms that use (\u0001, \u03b4)-DP mechanism M with one of the three\nsampling techniques with a sample of size m from a dataset of size n and \u03b3 = m/n for Poisson\nsampling, where \u0001(cid:48) < 1, \u03b4(cid:48)(cid:48) > 0, T is the number of samples in an epoch, E is the number of epochs.\n\nT \u2264 n/m\n\n# analyzed samples of size m\nT = En/m, E \u2265 1\n\nO(\u0001(cid:112)E log(1/\u03b4(cid:48)(cid:48))), E\u03b4 + \u03b4(cid:48)(cid:48))\n\nO(\u0001\u03b3(cid:112)T log(1/\u03b4(cid:48)(cid:48))), T \u03b3\u03b4 + \u03b4(cid:48)(cid:48))\n\n\u0001, \u03b4\n\nSampling mechanism\n\nShuf\ufb02ing\nPoisson, SWO\nPoisson & Gaussian distribution [3]\n\n\u221a\n\nO(\u03b3\u0001\n\nT ), \u03b4\n\nrepeated. Similar to SWO or Poisson, each sample contains distinct elements, however in contrast to\nthem, a sequence of k samples contain distinct elements between the samples.\n\n3 Privacy via Sampling and Differential privacy\n\nPrivacy ampli\ufb01cation of differential privacy captures the relationship of performing analysis over\na sample vs. whole dataset. Let M be a randomized mechanism that is (\u0001, \u03b4)-DP and let sample\nbe a random sample from dataset D of size \u03b3n, where \u03b3 < 1 is a sampling parameter. Let M(cid:48) =\nM\u25e6 sample be a mechanism that applies M on a sample of D. Then, informally, M(cid:48) is (O(\u03b3\u0001), \u03b3\u03b4)-\nDP [8, 25].\nSampling For Poisson and sampling without replacement \u0001(cid:48) of M(cid:48) is log(1 + \u03b3(e\u0001 \u2212 1)) [25] and\nlog(1 + m/n(e\u0001 \u2212 1)) [6], respectively. We refer the reader to Balle et al. [6] who provide a uni\ufb01ed\nframework for studying ampli\ufb01cation of these sampling mechanisms. Crucially all ampli\ufb01cation\nresults assume that the sample is hidden during the analysis as otherwise ampli\ufb01cation results cannot\nhold. That is, if the keys of the elements of a sample are revealed, M(cid:48) has the same (\u0001, \u03b4) as M.\nPrivacy loss of executing a sequence of DP mechanisms can be analyzed using several ap-\nproaches. Strong composition theorem [13] states that running T (\u0001, \u03b4)-mechanisms would be\n\n(\u0001(cid:112)2T log(1/\u03b4(cid:48)(cid:48)) + T \u0001(e\u0001 \u2212 1), T \u03b4 + \u03b4(cid:48)(cid:48))-DP, \u03b4(cid:48)(cid:48) \u2265 0. Better bounds can be obtained if one takes\n\nT ), \u03b4(cid:48) = \u03b4.\n\nadvantage of the underlying DP mechanism. Abadi et al. [3] introduce a moment account that\nleverages the fact that M(cid:48) uses Poisson sampling and applies Gaussian noise to the output. They\n\u221a\nobtain \u0001(cid:48) = O(\u03b3\u0001\nShuf\ufb02ing Analysis of differential private parameters of M(cid:48) that operates on samples obtained from\nshuf\ufb02ing is different. Parallel composition by McSherry [27] can be seen as the privacy \u201campli\ufb01cation\u201d\nresult for shuf\ufb02ing. It states that running T algorithms in parallel on disjoint samples of the dataset\nhas \u0001(cid:48) = maxi\u2208[1,T ] \u0001i where \u0001i is the parameter of the ith mechanism. It is a signi\ufb01cantly better\nresult than what one would expect from using DP composition theorem, since it relies on the fact that\nsamples are disjoint. If one requires multiple passes over a dataset (as is the case with multi-epoch\ntraining), strong composition theorem can be used with parallel composition.\nSampling vs. Shuf\ufb02ing DP Guarantees We bring the above results together in Table 1 to compare\nthe parameters of several sampling approaches. As we can see sampling based approaches for general\n\nDP mechanisms give an order of O((cid:112)m/n) smaller epsilon than shuf\ufb02ing based approaches. It is\n\nimportant to note that sampling-based approaches assume that the indices (or keys) of the dataset\nelements used by the mechanism remain secret. In \u00a74 we develop algorithms with this property.\nDifferentially private SGD We now turn our attention to a differentially private mechanism for\nmini-batch stochastic gradient descent computation. The mechanism is called NoisySGD [7, 38]\nand when applied instead of non-private mini-batch SGD allows for a release of a machine learning\nmodel with differential privacy guarantees on the training data. For example, it has been applied in\nBayesian learning [43] and to train deep learning [3, 26, 45] and logistic regression [38] models.\nIt proceeds as follows. Given a mini-batch (or sample) the gradient of every element in a batch is\ncomputed and the L2 norm of the gradient is clipped according to a clipping parameter C. Then a\nnoise is added to the sum of the (clipped) gradients of all the elements and the result is averaged over\nthe sample size. The noise added to the result is from Gaussian distribution parametrized with C and\na noise scale parameter \u03c3: N (0, \u03c32C 2). The noise is proportional to the sensitivity of the sum of\ngradients to the value of each element in the sample. The amount of privacy budget that a single\n\n5\n\n\fbatch processing, also called subsampled Gaussian mechanism, incurs depends on the parameters of\nthe noise distribution and how the batch is sampled. The model parameters are iteratively updated\nafter every NoisySGD processing. The number of iterations and the composition mechanism used to\nkeep track of the privacy loss determine the DP parameters of the overall training process.\nAbadi et al. [3] report analytical results assuming Poisson sampling but use shuf\ufb02ing to obtain the\nsamples in the evaluation. Yu et al. [45] point out the discrepancy between analysis and experimental\nresults in [3], that is the reported privacy loss is underestimated due to the use of shuf\ufb02ing. Yu et al.\nproceed to analyze shuf\ufb02ing and sampling but also use shuf\ufb02ing in their experiments. Hence, though\nanalytically Poisson and SWO sampling provide better privacy parameters than shuf\ufb02ing, there is no\nevidence that the accuracy is the same between the approaches in practice. We \ufb01ll in this gap in \u00a75\nand show that for the benchmarks we have tried it is indeed the case.\n\n4 Oblivious Sampling Algorithms\n\nIn this section, we develop data-oblivious algorithms for generating a sequence of samples from a\ndataset D such that the total number of samples is suf\ufb01cient for a single epoch of a training algorithm.\nMoreover, our algorithms will access the original dataset at indices that appear to be independent\nof how elements are distributed across the samples. As a result, anyone observing their memory\naccesses cannot identify, how many and which samples each element of D appears in.\n\n4.1 Oblivious sampling without replacement (SWO)\nWe introduce a de\ufb01nition of an oblivious sampling algorithm: oblivious samplesSWO(D, m) is a\nrandomized algorithm that returns k SWO samples from D and produces memory accesses that are\nindistinguishable between invocations for all datasets of size n = |D| and generated samples.\nAs a warm-up, consider the following naive way of generating a single SWO sample of size m from\ndataset D stored in external memory of a TEE: generate m distinct random keys from [1, n] and load\nfrom external memory elements of D that are stored at those indices. This trivially reveals the sample\nto an observer of memory accesses. A secure but inef\ufb01cient way would be to load D[l] for \u2200l \u2208 [1, n]\nand, if l matches one of the m random keys, keep D[l] in private memory. This incurs n accesses to\ngenerate a sample of size m. Though our algorithm will also make a linear number of accesses to D,\nit will amortize this cost by producing n/m samples.\nThe high level description of our secure and ef\ufb01cient algorithm for producing k is as follows. Choose k\nsamples from F n,m\nSWO, numbering each sample with an identi\ufb01er 1 to k; the keys within the samples\n(up to a mapping) will represent the keys of elements used in the samples of the output. Then, while\nscanning D, replicate elements depending on how many samples they should appear in and associate\neach replica with its sample id. Finally, group elements according to sample ids.\nPreliminaries Our algorithm relies on a primitive that can ef\ufb01ciently draw k samples from F n,m\nSWO\n(denoted via SWO.initialize(n, m)).\nIt also provides a function SWO.samplemember(i, j) that\nreturns True if key j is in the ith sample and False otherwise. This primitive can be instantiated\nusing k pseudo-random permutations \u03c1i over [1, n]. Then sample i is de\ufb01ned by the \ufb01rst m indices of\nthe permutation, i.e., element with key j is in the sample i if \u03c1i(j) \u2264 m. This procedure is described\nin more detail in Appendix \u00a7A.\nWe will use rj to denote the number of samples where key j appears in,\nthat is rj =\n|{i | samplemember(i, j),\u2200i \u2208 [1, k],\u2200j \u2208 [1, n]}|. It is important to note that samples drawn\nabove are used as a template for a valid SWO sampling (i.e., to preserve replication of elements across\nthe samples). However, the \ufb01nal samples s1, s2, . . . , sk returned by the algorithm will be instantiated\nwith keys that are determined using function \u03c0(cid:48) (which will be de\ufb01ned later). In particular, for all\nsamples, if samplemember(i, j) is true then \u03c0(cid:48)(j) \u2208 si.\nDescription The pseudo-code in Algorithm 1 provides the details of the method. It starts with\ndataset D obliviously shuf\ufb02ed according to a random secret permutation \u03c0 (Line 1). Hence, el-\nement e is stored (re-encrypted) in D at index \u03c0(e.key). The next phase replicates elements\nsuch that for every index j \u2208 [1, n] there is an element (not necessarily with key j) that is\nreplicated rj times (Lines 4-14). The algorithm maintains a counter l which keeps the cur-\nrent index of the scan in the array and enext which stores the element read from lth index.\n\n6\n\n\ffor i \u2208 [1, k] do\n\nif SWO.samplemember(i, j) then\n\n(ce, ci) \u2190 p, i \u2190 dec(ci)\nsi \u2190 si.append(ce)\n\nS.append(re-enc(e), enc(i))\nl \u2190 l + 1\nenext \u2190 D[l]\n\nAlgorithm 1 Oblivious samplesSWO(D, m):\ntakes an encrypted dataset D and returns k =\nn/m SWO samples of size m, n = |D|.\n1: D \u2190 oblshu\ufb04e(D)\n2: SWO.initialize(n, m)\n3: S \u2190 [], j \u2190 1, l \u2190 1, e \u2190 D[1],\nenext \u2190 D[1]\n4: while l \u2264 n do\n5:\n6:\n7:\n8:\n9:\nend if\n10:\nend for\n11:\ne \u2190 enext\n12:\nj \u2190 j + 1\n13:\n14: end while\n15: S \u2190 oblshu\ufb04e(S)\n16: \u2200i \u2208 [1, k] : si \u2190 []\n17: for p \u2208 S do\n18:\n19:\n20: end for\n21: Return s1, s2, . . . , sk\n\nAdditionally the algorithm maintains element e\nwhich is an element that currently is being replicated.\nIt is updated to enext as soon as suf\ufb01cient number of\nreplicas is reached. The number of times e is repli-\ncated depends on the number of samples element\nwith key j appears in. Counter j starts at 1 and is\nincremented after element e is replicated rj times.\nAt any given time, counter j is an indicator of the\nnumber of distinct elements written out so far. Hence,\nj can reach n only if every element appears in exactly\none sample. On the other hand, the smallest j can be\nis m, this happens when all k samples are identical.\nGiven the above state, the algorithm reads an element\ninto enext, loops internally through i \u2208 [1..k]: if cur-\nrent key j is in ith sample it writes out an encrypted\ntuple (e, i) and reads the next element from D into\nenext. Note that e is re-encrypted every time it is writ-\nten out in order to hide which one of the elements\nread so far is being written out. After the scan, the\ntuples are obliviously shuf\ufb02ed. At this point, the\nsample id i of each tuple is decrypted and used to\n(non-obliviously) group elements that belong to the\nsame sample together, creating the sample output\ns1..sk (Lines 16-20).\nWe are left to derive the mapping m between keys\nused in samples drawn in Line 2 and elements re-\nturned in samples s1..sk. We note that m is not ex-\nplicitly used during the algorithm and is used only in the analysis. From the algorithm we see that\nj=1 rj), that is m is derived from \u03c0 with shifts due to replications of preceding\nkeys. (Observe that if every element appears only in one sample m(l) = \u03c0\u22121(l).) We show that m is\ninjective and random (Lemma 1) and, hence, s1..sk are valid SWO samples.\nExample Let D = {(1, A), (2, B), (3, C), (4, D), (5, E), (6, F )}, where (4, D) denotes element\nD at index 4 (used also as a key), m = 2, and randomly drawn samples in SWO.initialize are {1, 4},\n{1, 2}, {1, 5}. Suppose D after the shuf\ufb02e is {(4, D), (1, A), (5, E), (3, C), (6, F ), (2, B)}. Then,\nafter the replication S = {((4, D), 1), ((4, D), 2), ((4, D), 3), ((3, C), 2), ((6, F ), 1), ((2, B), 3)}\nwhere the \ufb01rst tuple ((4, D), 1) indicates that (4, D) appears in the \ufb01rst sample.\nCorrectness We show that samples returned by the algorithm correspond to samples drawn randomly\nfrom F m,n\nSWO. We argue that samples returned by the oblivious samplesSWO are identical to those\ndrawn truly at random from F m,n\nSWO up to key mapping m and then show that m is injective and random\nin Appendix A. For every key j present in the drawn samples there is an element with key m(j)\nthat is replicated rj times and is associated with the sample ids of j. Hence, returned samples, after\nbeing grouped, are exactly the drawn samples where every key j is substituted with an element with\nkey m(j).\nSecurity and performance The adversary observes an oblivious shuf\ufb02e, a scan where an element\nis read and an encrypted pair is written, another oblivious shuf\ufb02e and then a scan that reveals the\nsample identi\ufb01ers. All patterns except for revealing of the sample identi\ufb01ers are independent of the\ndata and sampled keys. We argue security further in \u00a7A. Performance of oblivious SWO sampling is\n\u221a\ndominated by two oblivious shuf\ufb02es and the non-oblivious grouping, replication scan has linear cost.\nHence, our algorithm produces k samples in time O(cn) with private memory of size O( c\nn). Since\na non-oblivious version would require n accesses, our algorithm has a constant overhead for small c.\n\nm(l) = \u03c0\u22121(1 +(cid:80)l\u22121\n\nObservations We note that if more than k samples of size m = n/k need to be produced, one\ncan invoke the algorithm multiple times using different randomness. Furthermore, Algorithm 1 can\n\nproduce samples of varying sizes m1, m2, .., mk (n =(cid:80) mi) given as an input. The algorithm itself\n\nwill remain the same. However, in order to determine if j is in sample i or not, samplemember(i, j)\nwill check if \u03c1i(j) \u2264 mi instead of \u03c1i(j) \u2264 m.\n\n7\n\n\f4.2 Oblivious Poisson sampling\n\nS that consists of samples s1, s2, . . . , sk(cid:48) where k(cid:48) \u2264 k such that(cid:80)\n\nPerforming Poisson sampling obliviously requires not only hiding access pattern but also the size of\nthe samples. Since in the worst case the sample can be of size n, each sample will need to be padded\nto n with dummy elements. Unfortunately generating k samples each padded to size n is impractical.\nThough samples of size n are unlikely, revealing some upper bound on sample size would affect the\nsecurity of the algorithms relying on Poisson sampling.\nInstead of padding to the worst case, we choose to hide the number of samples that are contained\nwithin an n-sized block of data (e.g., an epoch). In particular, our oblivious Poisson sampling returns\ni\u2208[1,k(cid:48)] |si| \u2264 n. The security of\nsampling relies on hiding k(cid:48) and the boundary between the samples, as otherwise an adversary can\nestimate sample sizes.\nThe algorithm (presented in Appendix\u00a7B) proceeds similar to SWO except every element, in addition\nto being associated with a sample id, also stores its position in \ufb01nal S. The element and the sample\nid are kept private while the position is used to order the elements. It is then up to the queries that\noperate on the samples inside of a TEE (e.g., SGD computation) to use sample id while scanning S\nto determine the sample boundaries. The use of samplesPoisson by the queries has to be done carefully\nwithout revealing when the sample is actually used as this would reveal the boundary (e.g., while\nreading the elements during an epoch, one needs to hide after which element the model is updated).\nWe assume that that samples from F n,\u03b3\nPoisson can be drawn ef\ufb01ciently and describe how in Ap-\npendix\u00a7B. The algorithm relies on two functions that have access to the samples: getsamplesize(i)\nand getsamplepos(i, l) which return the size of the ith sample and the position of element l in ith sam-\nple. The algorithm uses the former to compute k(cid:48) and creates replicas for samples with identi\ufb01ers from\n1 to k(cid:48). The other changes to the Algorithm 1 are that S.append(enc(e), enc(i)) is substituted with\ni(cid:48)