{"title": "Collecting Telemetry Data Privately", "book": "Advances in Neural Information Processing Systems", "page_first": 3571, "page_last": 3580, "abstract": "The collection and analysis of telemetry data from user's devices is routinely performed by many software companies. Telemetry collection leads to improved user experience but poses significant risks to users' privacy. Locally differentially private (LDP) algorithms have recently emerged as the main tool that allows data collectors to estimate various population statistics, while preserving privacy. The guarantees provided by such algorithms are typically very strong for a single round of telemetry collection, but degrade rapidly when telemetry is collected regularly. In particular, existing LDP algorithms are not suitable for repeated collection of counter data such as daily app usage statistics. In this paper, we develop new LDP mechanisms geared towards repeated collection of counter data, with formal privacy guarantees even after being executed for an arbitrarily long period of time. For two basic analytical tasks, mean estimation and histogram estimation, our LDP mechanisms for repeated data collection provide estimates with comparable or even the same accuracy as existing single-round LDP collection mechanisms. We conduct empirical evaluation on real-world counter datasets to verify our theoretical results. Our mechanisms have been deployed by Microsoft to collect telemetry across millions of devices.", "full_text": "Collecting Telemetry Data Privately\n\nBolin Ding, Janardhan Kulkarni, Sergey Yekhanin\n\nMicrosoft Research\n\n{bolind, jakul, yekhanin}@microsoft.com\n\nAbstract\n\nThe collection and analysis of telemetry data from user\u2019s devices is routinely\nperformed by many software companies. Telemetry collection leads to improved\nuser experience but poses signi\ufb01cant risks to users\u2019 privacy. Locally differentially\nprivate (LDP) algorithms have recently emerged as the main tool that allows data\ncollectors to estimate various population statistics, while preserving privacy. The\nguarantees provided by such algorithms are typically very strong for a single round\nof telemetry collection, but degrade rapidly when telemetry is collected regularly.\nIn particular, existing LDP algorithms are not suitable for repeated collection of\ncounter data such as daily app usage statistics. In this paper, we develop new\nLDP mechanisms geared towards repeated collection of counter data, with formal\nprivacy guarantees even after being executed for an arbitrarily long period of time.\nFor two basic analytical tasks, mean estimation and histogram estimation, our LDP\nmechanisms for repeated data collection provide estimates with comparable or\neven the same accuracy as existing single-round LDP collection mechanisms. We\nconduct empirical evaluation on real-world counter datasets to verify our theoretical\nresults. Our mechanisms have been deployed by Microsoft to collect telemetry\nacross millions of devices.\n\n1\n\nIntroduction\n\nCollecting telemetry data to make more informed decisions is a commonplace. In order to meet\nusers\u2019 privacy expectations and in view of tightening privacy regulations (e.g., European GDPR law)\nthe ability to collect telemetry data privately is paramount. Counter data, e.g., daily app or system\nusage statistics reported in seconds, is a common form of telemetry. In this paper we are interested\nin algorithms that preserve users\u2019 privacy in the face of continuous collection of counter data, are\naccurate, and scale to populations of millions of users.\nRecently, differential privacy [10] (DP) has emerged as defacto standard for the privacy guarantees. In\nthe context of telemetry collection one typically considers algorithms that exhibit differential privacy\nin the local model [12, 14, 7, 5, 3, 18], also called randomized response model [19], \u03b3-ampli\ufb01cation\n[13], or FRAPP [1]. These are randomized algorithms that are invoked on each user\u2019s device to turn\nuser\u2019s private value into a response that is communicated to a data collector and have the property\nthat the likelihood of any speci\ufb01c algorithm\u2019s output varies little with the input, thus providing users\nwith plausible deniability. Guarantees offered by locally differentially private algorithms, although\nvery strong in a single round of telemetry collection, quickly degrade when data is collected over\ntime. This is a very challenging problem that limits the applicability of DP in many contexts.\nIn telemetry applications, privacy guarantees need to hold in the face of continuous data collection.\nAn in\ufb02uential paper [12] proposed a framework based on memoization to tackle this issue. Their\ntechniques allow one to extend single round DP algorithms to continual data collection and protect\nusers whose values stay constant or change very rarely. The key limitation of the work of [12] is that\ntheir approach cannot protect users\u2019 private numeric values with very small but frequent changes,\nmaking it inappropriate for collecting telemetry counters. In this paper, we address this limitation.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fWe design mechanisms with formal privacy guarantees in the face of continuous collection of counter\ndata. These guarantees are particularly strong when user\u2019s behavior remains approximately the\nsame, varies slowly, or varies around a small number of values over the course of data collection.\nOur results. Our contributions are threefold.\n1) We give simple 1-bit response mechanisms in the local model of DP for single-round collection\nof counter data for mean and histogram estimation. Our mechanisms are inspired by those in\n[19, 8, 7, 4], but allow for considerably simpler descriptions and implementations. Our experiments\nalso demonstrate their performance in concrete settings.\n2) Our main technical contribution is a rounding technique called \u03b1-point rounding that borrows\nideas from approximation algorithms literature [15, 2], and allows memoization to be applied in\nthe context of private collection of counters. Our memoization schema avoids substantial losses in\naccuracy or privacy and unaffordable storage overhead. We give a rigorous de\ufb01nition of privacy\nguarantees provided by our algorithms when the data is collected continuously for an arbitrarily long\nperiod of time. We also present empirical \ufb01ndings related to our privacy guarantees.\n3) Finally, our mechanisms have been deployed by Microsoft across millions of devices starting with\nWindows Insiders in Windows 10 Fall Creators Update to protect users\u2019 privacy while collecting\napplication usage statistics.\n\n1.1 Preliminaries and problem formulation\n\nIn our setup, there are n users, and each user at time t has a private (integer or real-valued) counter\nwith value xi(t) \u2208 [0, m]. A data collector wants to collect these counter values {xi(t)}i\u2208[n] at each\ntime stamp t to do statistical analysis. For example, for the telemetry analysis, understanding the\nmean and the distribution of counter values (e.g., app usage) is very important to IT companies.\nLocal model of differential privacy (LDP). Users do not need to trust the data collector and require\nformal privacy guarantees before they are willing to communicate their values to the data collector.\nHence, a more well-studied DP model [10, 11], which \ufb01rst collects all users\u2019 data and then injects\nnoise in the analysis step, is not applicable in our setup. In this work, we adopt the local model\nof differential privacy, where each user randomizes private data using a randomized algorithm\n(mechanism) A locally before sending it to data collector.\nDe\ufb01nition 1 ([13, 8, 4]). A randomized algorithm A : V \u2192 Z is \u0001-locally differentially private\n(\u0001-LDP) if for any pair of values v, v(cid:48) \u2208 V and any subset of output S \u2286 Z, we have that\n\nPr[A(v) \u2208 S] \u2264 e\u0001 \u00b7 Pr[A(v(cid:48)) \u2208 S] .\n\nLDP formalizes a type of plausible deniability: no matter what output is released, it is approximately\nequally as likely to have come from one point v \u2208 V as any other. For alternate interpretations of\ndifferential privacy within the framework of hypothesis testing we refer the reader to [20, 7].\nStatistical estimation problems. We focus on two estimation problems in this paper.\nMean estimation: For each time stamp t, the data collector wants to obtain an estimation \u02c6\u03c3(t) for\nthe mean of (cid:126)xt = (cid:104)xi(t)(cid:105)i\u2208[n], i.e., \u03c3( (cid:126)xt) = 1\ni\u2208[n] xi(t). We do worst case analysis and aim to\nbound the absolute error |\u02c6\u03c3(t) \u2212 \u03c3( (cid:126)xt)| for any input (cid:126)xt \u2208 [0, m]n. In the rest of the paper, we abuse\nnotation and denote \u03c3(t) to mean \u03c3( (cid:126)xt) for a \ufb01xed input (cid:126)xt.\nHistogram estimation: Suppose the domain of counter values is partitioned into k buckets (e.g., with\nequal widths), and a counter value xi(t) \u2208 [0, m] can be mapped to a bucket number vi(t) \u2208 [k].\nFor each time stamp t, the data collector wants to estimate frequency of v \u2208 [k] : ht(v) = 1\nn \u00b7 |{i :\nvi(t) = v}| as \u02c6ht(v). The error of a histogram estimation is measured by maxv\u2208[k] |\u02c6ht(v) \u2212 ht(v)|.\nAgain, we do worst case analysis of our algorithm over all possible inputs (cid:126)vt = (cid:104)vi(t)(cid:105)i\u2208[n] \u2208 [k]n.\n\nn \u00b7(cid:80)\n\n1.2 Repeated collection and overview of privacy framework\n\nPrivacy leakage in repeated data collection. Although LDP is a very strict notion of privacy,\nits effectiveness decreases if the data is collected repeatedly. If we collect counter values of a\nuser i for T time stamps by executing an \u03b5-LDP mechanism A independently on each time stamp,\n\n2\n\n\fi(2) . . . x(cid:48)\n\nxi(1)xi(2) . . . xi(T ) can be only guaranteed indistinguishable to another sequence of counter values,\nx(cid:48)\ni(T ), by a factor of up to eT\u00b7\u03b5, which is too large to be reasonable as T increases.\ni(1)x(cid:48)\nHence, in applications such as telemetry, where data is collected continuously, privacy guarantees\nprovided by an LDP mechanism for a single round of data collection are not suf\ufb01cient. We formalize\nour privacy guarantee to enhance LDP for repeated data collection later in Section 3. However,\nintuitively we ensure that every user blends with a large set of other users who have very different\nbehaviors.\nOur Privacy Framework and Guarantees. Our framework for repeated private collection of\ncounter data follows similar outline as the framework used in [12]. Our framework for mean and\nhistogram estimation has four main components:\n1) An important building block for our overall solution are 1-bit mechanisms that provide local \u0001-LDP\nguarantees and good accuracy for a single round of data collection (Section 2).\n2) An \u03b1-point rounding scheme to randomly discretize users private values prior to applying memoiza-\ntion (to conceal small changes) while keeping the expectation of discretized values intact (Section 3).\n3) Memoization of discretized values using the 1-bit mechanisms to avoid privacy leakage from\nrepeated data collection (Section 3). In particular, if the counter value of a user remains approximately\nconsistent, then the user is guaranteed \u0001-differential privacy even after many rounds of data collection.\n4) Finally, output perturbation (instantaneous noise in [12]) to protect exposing the transition points\ndue to large changes in user\u2019s behavior and attacks based on auxiliary information (Section 4).\nIn Sections 2, 3 and 4, we formalize these guarantees focusing predominantly on the mean estimation\nproblem. All the omitted proofs and additional experimental results are in the full version on the\narXiv [6].\n\n2 Single-round LDP mechanisms for mean and histogram estimation\n\nWe \ufb01rst describe our 1-bit LDP mechanisms for mean and histogram estimation. Our mechanisms are\ninspired by the works of Duchi et al. [8, 7, 9] and Bassily and Smith [4]. However, our mechanisms\nare tuned for more ef\ufb01cient communication (by sending 1 bit for each counter each time) and stronger\nprotection in repeated data collection (introduced later in Section 3). To the best our knowledge, the\nexact form of mechanisms presented in this Section was not known. Our algorithms yield accuracy\ngains in concrete settings (see Section 5) and are easy to understand and implement.\n\n2.1\n\n1-Bit mechanism for mean estimation\n\nCollection mechanism 1BitMean: When the collection of counter xi(t) at time t is requested by the\ndata collector, each user i sends one bit bi(t), which is independently drawn from the distribution:\n\n(cid:26) 1, with probability\n\n0, otherwise.\n\nbi(t) =\n\n1\n\ne\u0001+1 + xi(t)\n\nm \u00b7 e\u0001\u22121\ne\u0001+1 ;\n\n(1)\n\nMean estimation. Data collector obtains the bits {bi(t)}i\u2208[n] from n users and estimates \u03c3(t) as\n\n\u02c6\u03c3(t) =\n\nm\nn\n\nbi(t) \u00b7 (e\u03b5 + 1) \u2212 1\n\ne\u03b5 \u2212 1\n\n.\n\n(2)\n\nn(cid:88)\n\ni=1\n\nThe basic randomizer of [4] is equivalent to our 1-bit mechanism for the case when each user takes\nvalues either 0 or m. The above mechanism can also be seen as a simpli\ufb01cation of the multidimen-\nsional mean-estimation mechanism given in [7]. For the 1-dimensional mean estimation, Duchi et al.\n[7] show that Laplace mechanism is asymptotically optimal for the mini-max error. However, the\ncommunication cost per user in Laplace mechanism is \u2126(log m) bits, and our experiments show it\nalso leads to larger error compared to our 1-bit mechanism. We prove following results for the above\n1-bit mechanism.\nTheorem 1. For single-round data collection, the mechanism 1BitMean in (1) preserves \u0001-LDP for\neach user. Upon receiving the n bits {bi(t)}i\u2208[n], the data collector can then estimate the mean of\n\n3\n\n\fcounters from n users as \u02c6\u03c3(t) in (2). With probability at least 1 \u2212 \u03b4, we have\n\n|\u02c6\u03c3(t) \u2212 \u03c3(t)| \u2264 m\u221a\n2n\n\n\u00b7 e\u03b5 + 1\ne\u03b5 \u2212 1\n\n\u00b7\n\n2.2 d-Bit mechanism for histogram estimation\n\n(cid:114)\n\nlog\n\n2\n\u03b4\n\n.\n\nNow we consider the problem of estimating histograms of counter values in a discretized domain\nwith k buckets with LDP to be guaranteed. This problem has extensive literature both in computer\nscience and statistics, and dates back to the seminal work Warner [19]; we refer the readers to\nfollowing excellent papers [16, 8, 4, 17] for more information. Recently, Bassily and Smith [4] gave\nasymptotically tight results for the problem in the worst-case model building on the works of [16].\nOn the other hand, Duchi et al. [8] introduce a mechanism by adapting Warner\u2019s classical randomized\nresponse mechanism in [19], which is shown to be optimal for the statistical mini-max regret if one\ndoes not care about the cost of communication. Unfortunately, some ideas in Bassily and Smith [4]\nsuch as Johnson-Lindenstrauss lemma do not scale to population sizes of millions of users. Therefore,\nin order to have a smooth trade-off between accuracy and communication cost (as well as the ability\nto protect privacy in repeated data collection, which will be introduced in Section 3) we introduce a\nmodi\ufb01ed version of Duchi et al.\u2019s mechanism [8] based on subsampling by buckets.\nCollection mechanism dBitFlip: Each user i randomly draws d bucket numbers without replacement\nfrom [k], denoted by j1, j2, . . . , jd. When the collection of discretized bucket number vi(t) \u2208 [k] at\ntime t is requested by the data collector, each user i sends a vector:\n\nbi(t) = [(j1, bi,j1(t)), (j2, bi,j2(t)), . . . , (jd, bi,jd (t))] , where bi,jp (t) is a random 0-1 bit,\n\nwith Pr(cid:2)bi,jp (t) = 1(cid:3) =\n\n(cid:26)e\u03b5/2/(e\u03b5/2 + 1)\n\n1/(e\u03b5/2 + 1)\n\nif vi(t) = jp\nif vi(t) (cid:54)= jp\n\n, for p = 1, 2, . . . , d.\n\nUnder the same public coin model as in [4], each user i only needs to send to the data collector d bits\nbi,j1 (t), bi,j2 (t), . . ., bi,jd (t) in bi(t), as j1, j2, . . . , jd can be generated using public coins.\nHistogram estimation. Data collector estimates histogram ht as: for v \u2208 [k],\n\n\u02c6ht(v) =\n\nk\nnd\n\nbi,v(t) is received\n\nbi,v(t) \u00b7 (e\u03b5/2 + 1) \u2212 1\n\ne\u03b5/2 \u2212 1\n\n.\n\n(3)\n\n(cid:88)\n\nWhen d = k, dBitFlip is exactly the same as the one in Duchi et al.[8]. The privacy guarantee is\nstraightforward. In terms of the accuracy, the intuition is that for each bucket v \u2208 [k], there are\nroughly nd/k users responding with a 0-1 bit bi,v(t). We can prove the following result.\nTheorem 2. For single-round data collection, the mechanism dBitFlip preserves \u0001-LDP for each\nuser. Upon receiving the d bits {bi,jp (t)}p\u2208[d] from each user i, the data collector can then estimate\nthen histogram ht as \u02c6ht in (3). With probability at least 1 \u2212 \u03b4, we have,\n\n(cid:114)\n\n(cid:32)(cid:114)\n\n(cid:33)\n\n(cid:114)\n\n|ht(v) \u2212 \u02c6ht(v)| \u2264\n\nmax\nv\u2208[k]\n\n5k\nnd\n\n\u00b7 e\u03b5/2 + 1\ne\u03b5/2 \u2212 1\n\n\u00b7\n\nlog\n\n\u2264 O\n\n6k\n\u03b4\n\nk log(k/\u03b4)\n\n\u03b52nd\n\n.\n\n3 Memoization for continual collection of counter data\n\nOne important concern regarding the use of \u0001-LDP algorithms (e.g., in Section 2.1) to collect counter\ndata pertains to privacy leakage that may occur if we collect user\u2019s data repeatedly (say, daily) and\nuser\u2019s private value xi does not change or changes little. Depending on the value of \u0001, after a number\nof rounds, data collector will have enough noisy reads to estimate xi with high accuracy.\nMemoization [12] is a simple rule that says that: At the account setup phase each user pre-computes\nand stores his responses to data collector for all possible values of the private counter. At data\ncollection users do not use fresh randomness, but respond with pre-computed responses corresponding\nto their current counter values. Memoization (to a certain degree) takes care of situations when the\nprivate value xi stays constant. Note that the use of memoization violates differential privacy in\ncontinual collection. If memoization is employed, data collector can easily distinguish a user whose\n\n4\n\n\fvalue keeps changing, from a user whose value is constant; no matter how small the \u0001 is. However,\nprivacy leakage is limited. When data collector observes that user\u2019s response had changed, this only\nindicates that user\u2019s value had changed, but not what it was and not what it is.\nAs observed in [12, Section 1.3] using memoization technique in the context of collecting counter data\nis problematic for the following reason. Often, from day to day, private values xi do not stay constant,\nbut rather experience small changes (e.g., one can think of app usage statistics reported in seconds).\nNote that, naively using memoization adds no additional protection to the user whose private value\nvaries but stays approximately the same, as data collector would observe many independent responses\ncorresponding to it.\nOne naive way to \ufb01x the issue above is to use discretization: pick a large integer (segment size) s\nthat divides m; consider the partition of all integers into segments [(cid:96)s, ((cid:96) + 1)s]; and have each user\nreport his value after rounding the true value xi to the mid-point of the segment that xi belongs to.\nThis approach takes care of the issue of leakage caused by small changes to xi as users values would\nnow tend to stay within a single segment, and thus trigger the same memoized response; however\naccuracy loss may be extremely large. For instance, in a population where all xi are (cid:96)s + 1 for some\n(cid:96), after rounding every user would be responding based on the value (cid:96)s + s/2.\nIn the following subsection we present a better (randomized) rounding technique (termed \u03b1-point\nrounding) that has been previously used in approximation algorithms literature [15, 2] and rigorously\naddresses the issues discussed above. We \ufb01rst consider the mean estimation problem.\n\n3.1 \u03b1-point rounding for mean estimation\n\nThe key idea of rounding is to discretize the domain where users\u2019 counters take their values. Dis-\ncretization reduces domain size, and users that behave consistently take less different values, which\nallows us to apply memoization to get a strong privacy guarantee.\nAs we demonstrated above discretization may be particularly detrimental to accuracy when users\u2019\nprivate values are correlated. We propose addressing this issue by: making the discretization rule\nindependent across different users. This ensures that when (say) all users have the same value, some\nusers round it up and some round it down, facilitating a smaller accuracy loss.\nWe are now ready to specify the algorithm that extends the basic algorithm 1BitMean and employs\nboth \u03b1-point rounding and memoization. We assume that counter values range in [0, m].\n\n1. At the algorithm design phase, we specify an integer s (our discretization granularity). We\nassume that s divides m. We suggest setting s rather large compared to m, say s = m/20\nor even s = m depending on the particular application domain.\n\n2. At the the setup phase, each user i \u2208 [n] independently at random picks a value \u03b1i \u2208\n\n{0, . . . , s \u2212 1}, that is used to specify the rounding rule.\n\n3. User i invokes the basic algorithm 1BitMean with range m to compute and memoize 1-bit\n\nresponses to data collector for all m\n\ns + 1 values xi in the arithmetic progression\nA = {(cid:96)s}0\u2264(cid:96)\u2264 m\n\n.\n\ns\n\n(4)\n\n4. Consider a user i with private value xi who receives a data collection request. Let\nxi \u2208 [L, R), where L, R are the two neighboring elements of the arithmetic progres-\nsion {(cid:96)s}0\u2264(cid:96)\u2264 m\ns +1. The user xi rounds value to L if xi + \u03b1i < R; otherwise, the user\nrounds the value to R. Let yi denote the value of the user after rounding. In each round, user\nresponds with the memoized bit for value yi. Note that rounding is always uniquely de\ufb01ned.\n\nPerhaps a bit surprisingly, using \u03b1-point rounding does not lead to additional accuracy losses\nindependent of the choice of discretization granularity s.\nTheorem 3. Independent of the value of discretization granularity s, at any round of data collection,\neach output bit bi is still sampled according to the distribution given by formula (1). Therefore, the\nalgorithm above provides the same accuracy guarantees as given in Theorem 1.\n\n5\n\n\f3.2 Privacy de\ufb01nition using permanent memoization\n\nIn what follows we detail privacy guarantees provided by an algorithm that employs \u03b1-point rounding\nand memoization in conjunction with the \u0001-DP 1-bit mechanism of Section 2.1 against a data collector\nthat receives a very long stream of user\u2019s responses to data collection events.\nLet U be a user and x(1), . . . , x(T ) be the sequence of U\u2019s private counter values. Given user\u2019s\nprivate value \u03b1i, each of {x(j)}j\u2208[T ] gets rounded to the corresponding value {y(j)}j\u2208[T ] in the set\nA (de\ufb01ned by (4)) according to the rule given in Section 3.1.\nDe\ufb01nition 2. Let B be the space of all sequences {z(j)}j\u2208[T ] \u2208 AT , considered up to an arbitrary\npermutation of the elements of A. We de\ufb01ne the behavior pattern b(U ) of the user U to be the element\nof B corresponding to {y(j)}j\u2208[T ]. We refer to the number of distinct elements y(j) in the sequence\n{y(j)}j\u2208[T ] as the width of b(U ).\n\nWe now discuss our notion of behavior pattern, using counters that carry daily app usage statistics\nas an example. Intuitively, users map to the same behavior pattern if they have the same number\nof different modes (approximate counter values) of using the app, and switch between these modes\non the same days. For instance, one user that uses an app for 30 minutes on weekdays, 2 hours on\nweekends, and 6 hours on holidays, and the other user who uses the app for 4 hours on weekdays, 10\nminutes on weekends, and does not use it on holidays will likely map to the same behavior pattern.\nObserve however that the mapping from actual private counter values {x(j)} to behavior patterns\nis randomized, thus there is a likelihood that some users with identical private usage pro\ufb01les may\nmap to different behavior patterns. This is a positive feature of the De\ufb01nition 2 that increases entropy\namong users with the same behavior pattern.\nThe next theorem shows that the algorithm of Section 3.1 makes users with the same behavior pattern\nblend with each other from the viewpoint of data collector (in the sense of differential privacy).\nTheorem 4. Consider users U and V with sequences of private counter values {xU (1), . . . , xU (T )}\nand {xV (1), . . . , xV (T )}. Assume that both U and V respond at T data-collection time stamps\nusing the algorithm presented in Section 3.1, and b(U ) = b(V ) with the width of b(U ) equal to w.\nLet sU , sV \u2208 {0, 1}T be the random sequences of responses generated by users U and V ; then for\nany binary string s \u2208 {0, 1}T in the response domain, we have:\n\nPr[sU = s] \u2264 ew\u0001 \u00b7 Pr[sV = s] .\n\n(5)\n\n3.2.1 Setting parameters\n\nThe \u0001-LDP guarantee provided by Theorem 4 ensures that each user is indistinguishable from other\nusers with the same behavior pattern (in the sense of LDP). The exact shape of behavior patterns is\ngoverned by the choice of the parameter s. Setting s very large, say s = m or s = m/2 reduces the\nnumber of possible behavior patterns and thus increases the number of users that blend by mapping\nto a particular behavior pattern. It also yields stronger guarantee for blending within a pattern since\nfor all users U we necessarily have b(U ) \u2264 m/s + 1 and thus by Theorem 4 the likelihood of\ndistinguishing users within a pattern is trivially at most e(m/s+1)\u00b7\u0001. At the same time there are cases\nwhere one can justify using smaller values of s. In fact, consistent users, i.e., users whose private\ncounter always land in the vicinity of one of a small number of \ufb01xed values enjoy a strong LDP\nguarantee within their patterns irrespective of s (provided it is not too small), and smaller s may be\nadvantageous to avoid certain attacks based on auxiliary information as the set of all possible values\nof a private counter xi that lead to a speci\ufb01c output bit b is potentially more complex.\nFinally, it is important to stress that the \u0001-LDP guarantee established in Theorem 4 is not a panacea,\nand in particular it is a weaker guarantee provided in a much more challenging setting than just the\n\u0001-LDP guarantee across all users that we provide for a single round of data collection (an easier\nsetting). In particular, while LDP across all population of users is resilient to any attack based on\nauxiliary information, LDP across a sub population may be vulnerable to such attacks and additional\nlevels of protection may need to be applied. In particular, if data collector observes that user\u2019s\nresponse has changed; data collector knows with certainty that user\u2019s true counter value had changed.\nIn the case of app usage telemetry this implies that app has been used on one of the days. This attack\nis partly mitigated by the output perturbation technique that is discussed in Section 4.\n\n6\n\n\fFigure 1: Distribution of pattern supports for App A\n\n3.2.2 Experimental study\n\nWe use a real-world dataset of 3 million users with their daily usage of an app (App A) collected\n(in seconds) over a continuous period of 31 days to demonstrate the mapping of users to behavior\npatterns in Figure 1. See full version of the paper for usage patterns for more apps. For each behavior\npattern (De\ufb01nition 2), we calculate its support as the number of users with their sequences in this\npattern. All the patterns\u2019 supports sup are plotted (y-axis) in the decreasing order, and we can also\ncalculate the percentage of users (x-axis) in patterns with supports at least sup. We vary the parameter\ns in permanent memoization from m (maximizing blending) to m/3 and report the corresponding\ndistributions of pattern supports in Figure 1.\nIt is not hard to see that theoretically for every behavior pattern there is a very large set of sequences\nof private counter values {x(t)}t that may map to it (depending on \u03b1i). Real data (Figure 1) provides\nevidence that users tend to be approximately consistent and therefore simpler patterns, i.e., patterns\nthat mostly stick to a single rounded value y(t) = y correspond to larger sets of sequences {xi(t)}t,\nobtained from a real population. In particular, for each app there is always one pattern (corresponding\nto having one \ufb01xed y(t) = y across all 31 days) which blends the majority of users (> 2 million).\nMore complex behavior patterns have less users mapping to them. In particular, there always are\nsome lonely users (1%-5% depending on s) who land in patterns that have support size of one or two.\nFrom the viewpoint of a data collector such users can only be identi\ufb01ed as those having a complex\nand irregular behavior, however the actual nature of that behavior by Theorem 4 remains uncertain.\n\n3.3 Example\n\nOne speci\ufb01c example of a counter collection problem that has been identi\ufb01ed in [12, Section 1.3]\nas being non-suitable for techniques presented in [12] but can be easily solved using our methods\nis to repeatedly collect age in days from a population of users. When we set s = m and apply\nthe algorithm of Section 3.1 we can collect such data for T rounds with high accuracy. Each user\nnecessarily responds with a sequence of bits that has form zt \u25e6 \u00afzT\u2212t, where 0 \u2264 t \u2264 T. Thus data\ncollector only gets to learn the transition point, i.e., the day when user\u2019s age in days passes the value\nm \u2212 \u03b1i, which is safe from privacy perspective as \u03b1i is picked uniformly at random by the user.\n\n3.4 Continual collection for histogram estimation using permanent memoization\n\nNaive memoization. \u03b1-point rounding is not suitable for histogram estimation as counter values\nhave been mapped to k buckets. The single-round LDP mechanism in Duchi et al. [8] sends a 0-1\nrandom response for each bucket: send 1 with probability e\u03b5/2/(e\u03b5/2 + 1) if the value is in this bucket,\nand with probability 1/(e\u03b5/2 + 1) if not. This mechanism is \u0001-LDP. Each user can then memoize a\nmapping fk : [k] \u2192 {0, 1}k by running this mechanism once for each v \u2208 [k], and always respond\nfk(v) if the user\u2019s value is in bucket v. However, this memoization schema leads to serious privacy\nleakage: with some auxiliary information, one can infer with high con\ufb01dence a user\u2019s value from the\nresponse produced by the mechanism; more concretely, if the data collector knows that the app usage\nvalue is in a bucket v and observes the output fk(v) = z in some day, whenever the user sends z\nagain in future, the data collector can infer that the bucket number is v with almost 100% probability.\nd-bit memoization. To avoid such privacy leakages, we memoize based on our d-bit mechanism\ndBitFlip (Section 2.2). Each user runs dBitFlip for each v \u2208 [k], with responses created on d buckets\nj1, j2, . . . , jd (randomly drawn and then \ufb01xed per user), and memoizes the response in a mapping\nfd : [k] \u2192 {0, 1}d. A user will always send fd(v) if the bucket number is v. This mechanism is\ndenoted by dBitFlipPM, and the same estimator (3) can be used to estimate the histogram upon\n\n7\n\n12040080001600003200000Percentage of users in the patternsApp A (s=m)12040080001600003200000Percentage of users in the patternsApp A (s=m/2)12040080001600003200000Percentage of users in the patternsApp A (s=m/3)\f(a) Mean (n = 0.3 \u00d7 106)\n\n(c) Histogram (n = 0.3 \u00d7 106)\nFigure 2: Comparison of mechanisms for mean and histogram estimations on real-world datasets\n\nthe naive memoization, because multiple (\u2126(cid:0)k/2d(cid:1) w.h.p.) buckets are mapped to the same response.\n\nreceiving the d-bit response from every user. This scheme avoids privacy leakages that arise due to\n\n(b) Mean (n = 3 \u00d7 106)\n\nThis protection is the strongest when d = 1. De\ufb01nition 2 about behavior patterns and Theorem 4 can\nbe generalized here to provide similar privacy guarantee in continual data collection.\n\n4 Output perturbation\n\nOne of the limitations of our memoization approach based on \u03b1-point rounding is that it does not\nprotect the points of time where user\u2019s behavior changes signi\ufb01cantly. Consider a user who never\nuses an app for a long time, and then starts using it. When this happens, suppose the output produced\nby our algorithm changes from 0 to 1. Then the data collector can learn with certainty that the user\u2019s\nbehavior changed, (but not what this behavior was or what it became). Output perturbation is one\npossible mechanism of protecting the exact location of the points of time where user\u2019s behavior\nhas changed. As mentioned earlier, output perturbation was introduced in [12] as a way to mitigate\nprivacy leakage that arises due to memoization. The main idea behind output perturbation is to \ufb02ip the\noutput of memoized responses with a small probability 0 \u2264 \u03b3 \u2264 0.5. This ensures that data collector\nwill not be able to learn with certainty that the behavior of a user changes at certain time stamps. In\nthe full version of the paper we formalize this notion, and prove accuracy and privacy guarantees\nwith output perturbation. Here we contain ourselves to mentioning that using output perturbation\nwith a positive \u03b3, in combination with the \u0001-LDP 1BitMean algorithm in Section 2 is equivalent to\ninvoking the 1BitMean algorithm with \u0001(cid:48) = ln\n\n(cid:18) (1\u22122\u03b3)( e\u0001\n\n(cid:19)\n\n.\n\n(1\u22122\u03b3)(\n\ne\u0001+1 )+\u03b3\ne\u0001+1 )+\u03b3\n\n1\n\n5 Empirical evaluation\n\nWe compare our mechanisms (with permanent memoization) for mean and histogram estimation with\nprevious mechanisms for one-time data collection. We would like to emphasize that the goal of these\nexperiments is to show that our mechanisms, with such additional protection, are no worse than or\ncomparable to the state-of-the-art LDP mechanisms in terms of estimation accuracy.\nWe \ufb01rst use the real-world dataset which is described in Section 3.2.2.\nMean estimation. We implement our 1-bit mechanism (Section 2.1) with \u03b1-point Randomized\nRounding and Permanent Memoization for repeated collection (Section 3), denoted by 1BitR-\nRPM, and output perturbation to enhance the protection for usage change (Section 4), denoted by\n1BitRRPM+OP(\u03b3). We compare it with the Laplace mechanism for LDP mean estimation in [8, 9],\ndenoted by Laplace. We vary the value of \u03b5 (\u03b5 = 0.1-10) and the number of users (n = 0.3, 3 \u00d7 106\nby randomly picking subsets of all users), and run all the mechanisms 3000 times on 31-day usage\ndata with three counters. The domain size is m = 24 hours. The average of absolute errors (in sec-\nonds) with one standard deviation (STD) are reported in Figures 2(a)-2(b). 1BitRRPM is consistently\nbetter than Laplace with smaller errors and narrower STDs. Even with a perturbation probability\n\u03b3 = 1/10, they are comparable in accuracy. When \u03b3 = 1/3, output perturbation is equivalent to\nadding an additional uniform noise from [0, 24 hours] independently on each day\u2013even in this case,\n1BitRRPM+OP(1/3) gives us tolerable accuracy when the number of users is large.\nHistogram estimation. We create k = 32 buckets on [0, 24 hours] with even widths to evaluate\nmechanisms for histogram estimation. We implement our d-bit mechanism (Section 2.2) with\n\n8\n\n18645124096327680.10.20.512510EpsilonLaplace1BitRRPM1BitRRPM+OP(1/10)1BitRRPM+OP(1/3)18645124096327680.10.20.512510EpsilonLaplace1BitRRPM1BitRRPM+OP(1/10)1BitRRPM+OP(1/3)00.050.10.150.20.10.20.512510EpsilonBinFlipBinFlip+KFlip4BitFlipPM2BitFlipPM1BitFlipPM\f(b) Mean (uniform distribution)\n\n(a) Mean (constant distribution)\n(c) Histogram (normal distribution)\nFigure 3: Mechanisms for mean and histogram estimations on different distributions (n = 0.3 \u00d7 106)\npermanent memoization for repeated collection (Section 3.4), denoted by dBitFlipPM. In order to\nprovide protection on usage change in repeated collection, we use d = 1, 2, 4 (strongest when d = 1).\nWe compare it with state-of-the-art one-time mechanisms for histogram estimation: BinFlip [8, 9],\nKFlip (k-RR in [17]), and BinFlip+ (applying the generic protocol with 1-bit reports in [4] on BinFlip).\nWhen d = k, dBitFlipPM has the same accuracy as BinFlip. KFlip is sub-optimal for small \u03b5 [17] but\nhas better performance when \u03b5 is \u2126(ln k). In contrast, BinFlip+ has good performance when \u03b5 \u2264 2.\nWe repeat the experiment 3000 times and report the average histogram error (i.e., maximum error\nacross all bars in a histogram) with one standard deviation for different algorithms in Figure 2(c)\nwith \u03b5 = 0.1-10 and n = 0.3 \u00d7 106 to con\ufb01rm the above theoretical results. BinFlip (equivalently,\n32BitFlipPM) has the best accuracy overall. With enhanced privacy protection in repeated data\ncollection, 4bitFlipPM is comparable to the one-time collection mechanism KFlip when \u03b5 is small\n(0.1-0.5) and 4bitFlipPM-1bitFlipPM are better than BinFlip+ when \u03b5 is large (5-10).\nOn different data distributions. We have shown that errors in mean and histogram estimations\ncan be bounded (Theorems 1-2) in terms of \u03b5 and the number of users n, together with the number\nof buckets k and the number of bits d (applicable only to histograms). We now conduct additional\nexperiments on synthetic datasets to verify that the empirical errors should not change much on\ndifferent data distributions. Three types of distributions are considered: i) constant distribution, i.e.,\neach user i has a counter xi(t) = 12 (hours) all the time; ii) uniform distribution, i.e., xi(t) \u223c\nU(0, 24); and iii) normal distribution, i.e., xi(t) \u223c N (12, 22) (with mean equal to 12 and standard\ndeviation equal to 2), truncated on [0, 24]. Three synthetic datasets are created by drawing samples\nof sizes n = 0.3 \u00d7 106 from these three distributions. Some results are plotted on Figure 3: the\nempirical errors on different distributions are almost the same as those in Figures 2(a) and 2(c). One\ncan refer to the full version of the paper [6] for the complete set of charts.\n\n6 Deployment\n\nIn earlier sections, we presented new LDP mechanisms geared towards repeated collection of counter\ndata, with formal privacy guarantees even after being executed for a long period of time. Our mean\nestimation algorithm has been deployed by Microsoft starting with Windows Insiders in Windows\n10 Fall Creators Update. The algorithm is used to collect the number of seconds that a user has\nspend using a particular app. Data collection is performed every 6 hours, with \u0001 = 1. Memoization\nis applied across days and output perturbation uses \u03b3 = 0.2. According to Section 4, this makes a\nsingle round of data collection satisfy \u0001(cid:48)-DP with \u0001(cid:48) = 0.686.\nOne important feature of our deployment is that collecting usage data for multiple apps from a single\nuser only leads to a minor additional privacy loss that is independent of the actual number of apps.\nIntuitively, this happens since we are collecting active usage data, and the total number of seconds\nthat a user can spend across multiple apps in 6 hours is bounded by an absolute constant that is\nindependent of the number of apps.\nTheorem 5. Using the 1BitMean mechanism with a parameter \u0001(cid:48) to simultaneously collect t counters\n\nx1, . . . , xt, where each xi satis\ufb01es 0 \u2264 xi \u2264 m and(cid:80)\n\ni xi \u2264 m preserves \u0001(cid:48)(cid:48)-DP, where\n\n\u0001(cid:48)(cid:48) = \u0001(cid:48) + e\u0001(cid:48) \u2212 1.\n\nWe defer the proof to the full version of the paper [6]. By Theorem 5, in deployment, a single round\nof data collection across an arbitrary large number of apps satis\ufb01es \u0001(cid:48)(cid:48)-DP, where \u0001(cid:48)(cid:48) = 1.672.\n\n9\n\n18645124096327680.10.20.512510EpsilonLaplace1BitRRPM1BitRRPM+OP(1/10)1BitRRPM+OP(1/3)18645124096327680.10.20.512510EpsilonLaplace1BitRRPM1BitRRPM+OP(1/10)1BitRRPM+OP(1/3)00.050.10.150.20.10.20.512510EpsilonBinFlipBinFlip+KFlip4BitFlipPM2BitFlipPM1BitFlipPM\fReferences\n[1] S. Agrawal and J. R. Haritsa. A framework for high-accuracy privacy-preserving mining. In\n\nICDE, pages 193\u2013204, 2005.\n\n[2] N. Bansal, D. Coppersmith, and M. Sviridenko.\n\nImproved approximation algorithms for\n\nbroadcast scheduling. SIAM Journal on Computing, 38(3):1157\u20131174, 2008.\n\n[3] R. Bassily, K. Nissim, U. Stemmer, and A. Thakurta. Practical locally private heavy hitters. In\n\nNIPS, 2017.\n\n[4] R. Bassily and A. D. Smith. Local, private, ef\ufb01cient protocols for succinct histograms. In STOC,\n\npages 127\u2013135, 2015.\n\n[5] R. Bassily, A. D. Smith, and A. Thakurta. Private empirical risk minimization: Ef\ufb01cient\n\nalgorithms and tight error bounds. In FOCS, pages 464\u2013473, 2014.\n\n[6] B. Ding, J. Kulkarni, and S. Yekhanin. Collecting telemetry data privately. arXiv, 2017.\n\n[7] J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Local privacy and statistical minimax rates. In\n\nFOCS, pages 429\u2013438, 2013.\n\n[8] J. C. Duchi, M. J. Wainwright, and M. I. Jordan. Local privacy and minimax bounds: Sharp\n\nrates for probability estimation. In NIPS, pages 1529\u20131537, 2013.\n\n[9] J. C. Duchi, M. J. Wainwright, and M. I. Jordan. Minimax optimal procedures for locally private\n\nestimation. CoRR, abs/1604.02390, 2016.\n\n[10] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private\n\ndata analysis. In TCC, pages 265\u2013284, 2006.\n\n[11] C. Dwork, A. Roth, et al. The algorithmic foundations of differential privacy. Foundations and\n\nTrends R(cid:13) in Theoretical Computer Science, 9(3\u20134):211\u2013407, 2014.\n\n[12] \u00da. Erlingsson, V. Pihur, and A. Korolova. RAPPOR: randomized aggregatable privacy-\n\npreserving ordinal response. In CCS, pages 1054\u20131067, 2014.\n\n[13] A. Ev\ufb01mievski, J. Gehrke, and R. Srikant. Limiting privacy breaches in privacy preserving data\n\nmining. In PODS, pages 211\u2013222, 2003.\n\n[14] G. C. Fanti, V. Pihur, and \u00da. Erlingsson. Building a RAPPOR with the unknown: Privacy-\n\npreserving learning of associations and data dictionaries. PoPETs, 2016(3):41\u201361, 2016.\n\n[15] M. X. Goemans, M. Queyranne, A. S. Schulz, M. Skutella, and Y. Wang. Single machine\nscheduling with release dates. SIAM Journal on Discrete Mathematics, 15(2):165\u2013192, 2002.\n\n[16] J. Hsu, S. Khanna, and A. Roth. Distributed private heavy hitters. In ICALP, pages 461\u2013472,\n\n2012.\n\n[17] P. Kairouz, K. Bonawitz, and D. Ramage. Discrete distribution estimation under local privacy.\n\nICML, 2016.\n\n[18] J. Tang, A. Korolova, X. Bai, X. Wang, and X. Wang. Privacy loss in Apple\u2019s implementation\n\nof differential privacy on MacOS 10.125. arXiv 1709.02753, 2017.\n\n[19] S. L. Warner. Randomized response: A survey technique for eliminating evasive answer bias.\n\nJournal of the American Statistical Association, 60(309):63\u201369, 1965.\n\n[20] L. Wasserman and S. Zhou. A statistical framework for differential privacy. Journal of the\n\nAmerican Statistical Association, 105(489):375\u2013389, 2010.\n\n10\n\n\f", "award": [], "sourceid": 2012, "authors": [{"given_name": "Bolin", "family_name": "Ding", "institution": "Microsoft"}, {"given_name": "Janardhan", "family_name": "Kulkarni", "institution": "Microsoft Research"}, {"given_name": "Sergey", "family_name": "Yekhanin", "institution": "Microsoft"}]}