{"title": "Sampling Sketches for Concave Sublinear Functions of Frequencies", "book": "Advances in Neural Information Processing Systems", "page_first": 1363, "page_last": 1373, "abstract": "We consider massive distributed datasets that consist of elements modeled as key-value pairs and the task of computing statistics or aggregates where the contribution of each key is weighted by a function of its frequency (sum of values of its elements). This fundamental problem has a wealth of applications in data analytics and machine learning, in particular, with concave sublinear functions of the frequencies that mitigate the disproportionate effect of keys with high frequency. The family of concave sublinear functions includes low frequency moments ($p \\leq 1$), capping, logarithms, and their compositions. A common approach is to sample keys, ideally, proportionally to their contributions and estimate statistics from the sample. A simple but costly way to do this is by aggregating the data to produce a table of keys and their frequencies, apply our function to the frequency values, and then apply a weighted sampling scheme. Our main contribution is the design of composable sampling sketches that can be tailored to any concave sublinear function of the frequencies. Our sketch structure size is very close to the desired sample size and our samples provide statistical guarantees on the estimation quality that are very close to that of an ideal sample of the same size computed over aggregated data. Finally, we demonstrate experimentally the simplicity and effectiveness of our methods.", "full_text": "Sampling Sketches for\n\nConcave Sublinear Functions of Frequencies\n\nEdith Cohen\n\nGoogle Research, CA\n\nTel Aviv University, Israel\nedith@cohenwang.com\n\nO\ufb01r Geri\n\nStanford University, CA\n\nofirgeri@cs.stanford.edu\n\nAbstract\n\nWe consider massive distributed datasets that consist of elements modeled as key-\nvalue pairs and the task of computing statistics or aggregates where the contribution\nof each key is weighted by a function of its frequency (sum of values of its elements).\nThis fundamental problem has a wealth of applications in data analytics and\nmachine learning, in particular, with concave sublinear functions of the frequencies\nthat mitigate the disproportionate effect of keys with high frequency. The family\nof concave sublinear functions includes low frequency moments (\ud835\udc5d \u2264 1), capping,\nlogarithms, and their compositions. A common approach is to sample keys, ideally,\nproportionally to their contributions and estimate statistics from the sample. A\nsimple but costly way to do this is by aggregating the data to produce a table of keys\nand their frequencies, apply our function to the frequency values, and then apply\na weighted sampling scheme. Our main contribution is the design of composable\nsampling sketches that can be tailored to any concave sublinear function of the\nfrequencies. Our sketch structure size is very close to the desired sample size and\nour samples provide statistical guarantees on the estimation quality that are very\nclose to that of an ideal sample of the same size computed over aggregated data.\nFinally, we demonstrate experimentally the simplicity and effectiveness of our\nmethods.\n\n1\n\nIntroduction\n\nsum of values of the elements with that key, i.e., \ud835\udf08\ud835\udc65 :=\u2211\ufe00\n\nWe consider massive distributed datasets that consist of elements that are key-value pairs \ud835\udc52 =\n(\ud835\udc52.key, \ud835\udc52.val ) with \ud835\udc52.val > 0. The elements are generated or stored on a large number of servers\nor devices. A key \ud835\udc65 may repeat in multiple elements, and we de\ufb01ne its frequency \ud835\udf08\ud835\udc65 to be the\n\ud835\udc52|\ud835\udc52.key=\ud835\udc65 \ud835\udc52.val. For example, the keys\ncan be search queries, videos, terms, users, or tuples of entities (such as video co-watches or term\nco-occurrences) and each data element can correspond to an occurrence or an interaction involving\nthis key: the search query was issued, the video was watched, or two terms co-occurred in a typed\nsentence. An instructive common special case is when all elements have the same value 1 and the\nfrequency \ud835\udf08\ud835\udc65 of each key \ud835\udc65 in the dataset is simply the number of elements with key \ud835\udc65.\nA common task is to compute statistics or aggregates, which are sums over key contributions.\nThe contribution of each key \ud835\udc65 is weighted by a function of its frequency \ud835\udf08\ud835\udc65. One example of\n\ud835\udc65\u2208\ud835\udc3b \ud835\udf08\ud835\udc65 for some domain (subset of keys)\n\ud835\udc3b. The domains of interest are often overlapping and speci\ufb01ed at query time. Sum aggregates\nalso arise as components of a larger pipeline, such as the training of a machine learning model\nwith parameters \ud835\udf03, labeled examples \ud835\udc65 \u2208 \ud835\udcb3 with frequencies \ud835\udf08\ud835\udc65, and a loss objective of the form\n\ud835\udc65 \ud835\udc53 (\ud835\udf08\ud835\udc65)\ud835\udc3f(\ud835\udc65; \ud835\udf03). The function \ud835\udc53 that is applied to the frequencies can be any concave\nsublinear function. Concave sublinear functions, which we discuss further below, are used in\n\nsuch sum aggregates are queries of domain statistics\u2211\ufe00\n\u2113(\ud835\udcb3 ; \ud835\udf03) =\u2211\ufe00\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fapplications to mitigate the disproportionate effect of keys with very high frequencies. The training\nof the model typically involves repeated evaluation of the loss function (or of its gradient that also\nhas a sum form) for different values of \ud835\udf03. We would like to compute these aggregates on demand,\nwithout needing to go over the data many times.\nWhen the number of keys is very large it is often helpful to compute a smaller random sample\n\ud835\udc46 \u2286 \ud835\udcb3 of the keys from which aggregates can be ef\ufb01ciently estimated. In some applications,\nobtaining a sample can be the end goal. For example, when the aggregate is a gradient, we can use\nthe sample itself as a stochastic gradient. To provide statistical guarantees on our estimate quality,\nthe sampling needs to be weighted (importance sampling), with heavier keys sampled with higher\nprobability, ideally, proportional to their contribution (\ud835\udc53 (\ud835\udf08\ud835\udc65)). When the weights of the keys are\nknown, there are classic sampling schemes that provide estimators with tight worst-case variance\nbounds [28, 13, 6, 10, 32, 33].\nThe datasets we consider here are presented in an unaggregated form: each key can appear multiple\ntimes in different locations. The focus of this work is designing composable sketch structures\n(formally de\ufb01ned below) that allow to compute a sample over unaggregated data with respect to\nthe weights \ud835\udc53 (\ud835\udf08\ud835\udc65). One approach to compute a sample from unaggregated data is to \ufb01rst aggregate\nthe data to produce a table of key-frequency pairs (\ud835\udc65, \ud835\udf08\ud835\udc65), compute the weights \ud835\udc53 (\ud835\udf08\ud835\udc65), and apply a\nweighted sampling scheme. This aggregation can be performed using composable structures that are\nessentially a table with an entry for each distinct key that occurred in the data. The number of distinct\nkeys, however, and hence the size of that sketch, can be huge. For our sampling application, we\nwould hope to use sketches of size that is proportional to the desired sample size, which is generally\nmuch smaller than the number of unique keys, and still provide statistical guarantees on the estimate\nquality that are close to that of a weighted sample computed according to \ud835\udc53 (\ud835\udf08\ud835\udc65).\n\nConcave Sublinear Functions. Typical datasets have a skewed frequency distribution, where\na small fraction of the keys have very large frequencies and we can get better results or learn a\nbetter model of the data by suppressing their effect. The practice is to apply a concave sublinear\nfunction \ud835\udc53 to the frequency, so that the importance weight of the key is \ud835\udc53 (\ud835\udf08\ud835\udc65) instead of simply its\n\ud835\udc65 for \ud835\udc5d \u2264 1, ln(1 + \ud835\udf08\ud835\udc65),\nfrequency \ud835\udf08\ud835\udc65. This family of functions includes the frequency moments \ud835\udf08\ud835\udc5d\ncap\ud835\udc47 (\ud835\udf08\ud835\udc65) = min{\ud835\udc47, \ud835\udf08\ud835\udc65} for a \ufb01xed \ud835\udc47 \u2265 0, their compositions, and more. A formal de\ufb01nition\nappears in Section 2.3.\nTwo hugely popular methods for producing word embeddings from word co-occurrences use this form\nof mitigation: word2vec [25] uses \ud835\udc53 (\ud835\udf08) = \ud835\udf080.5 and \ud835\udc53 (\ud835\udf08) = \ud835\udf080.75 for positive and negative examples,\nrespectively, and GloVe [30] uses \ud835\udc53 (\ud835\udf08) = min{\ud835\udc47, \ud835\udf080.75} to mitigate co-occurrence frequencies.\nWhen the data is highly distributed, for example, when it originates or resides at millions of mobile\ndevices (as in federated learning [24]), it is useful to estimate the loss or compute a stochastic gradient\nupdate ef\ufb01ciently via a weighted sample.\nThe suppression of higher frequencies may also directly arise in applications. One example is\ncampaign planning for online advertising, where the value of showing an ad to a user diminishes\nwith the number of views. Platforms allow an advertiser to specify a cap value \ud835\udc47 on the number of\ntimes the same ad can be presented to a user [19, 29]. In this case, the number of opportunities to\ndisplay an ad to a user \ud835\udc65 is a cap function of the frequency of the user \ud835\udc53 (\ud835\udf08\ud835\udc65) = min{\ud835\udc47, \ud835\udf08\ud835\udc65}, and\n\ud835\udc65\u2208\ud835\udc3b \ud835\udc53 (\ud835\udf08\ud835\udc65). When planning a campaign, we\nneed to quickly estimate the statistics for different segments, and this can be done from a sample that\nideally is weighted by \ud835\udc53 (\ud835\udf08\ud835\udc65).\n\nthe number for a segment of users \ud835\udc3b is the statistics\u2211\ufe00\n\nOur Contribution.\nIn this work, we design composable sketches that can be tailored to any concave\nsublinear function \ud835\udc53, and allow us to compute a weighted sample over unaggregated data with respect\nto the weights \ud835\udc53 (\ud835\udf08\ud835\udc65). Using the sample, we will be able to compute unbiased estimators for the\naggregates mentioned above. In order to compute the estimators, we need to make a second pass over\nthe data: In the \ufb01rst pass, we compute the set of sampled keys, and in the second pass we compute\ntheir frequencies. Both passes can be done in a distributed manner.\nA sketch \ud835\udc46(\ud835\udc37) is a data structure that summarizes a set \ud835\udc37 of data elements, so that the output of\ninterest for \ud835\udc37 (in our case, a sample of keys) can be recovered from the sketch \ud835\udc46(\ud835\udc37). A sketch\nstructure is composable if we can obtain a sketch \ud835\udc46(\ud835\udc371 \u222a \ud835\udc372) of two sets of elements \ud835\udc371 and \ud835\udc372\nfrom the sketches \ud835\udc46(\ud835\udc371) and \ud835\udc46(\ud835\udc372) of the sets. This property alone gives us full \ufb02exibility to\n\n2\n\n\fparallelize or distribute the computation. The size of the sketch determines the communication and\nstorage needs of the computation.\nWe provide theoretical guarantees on the quality (variance) of the estimators. The baseline for our\nanalysis is the bounds on the variance that are guaranteed by PPSWOR on aggregated data. PPSWOR\n[32, 33] is a sampling scheme with tight worst-case variance bounds. The estimators provided by our\nsketch have variance at most 4/((1 \u2212 \ud835\udf00)2) times the variance bound for PPSWOR. The parameter\n\ud835\udf00 \u2264 1/2 mostly affects the run time of processing a data element, which grows near-linearly in 1/\ud835\udf00.\nThus, our sketch allows us to get approximately optimal guarantees on the variance while avoiding\nthe costly aggregation of the data. We remark that these guarantees are for soft concave sublinear\nfunctions and the extension to any concave sublinear function incurs a factor of (1 + 1/(\ud835\udc52 \u2212 1))2 in\nthe variance. The space required by our sketch signi\ufb01cantly improves upon the previous methods\n(which all require aggregating the data). In particular, if the desired sample size is \ud835\udc58, we show that\nthe space required by the sketch at any given time is \ud835\udc42(\ud835\udc58) in expectation and is well concentrated.\nWe complement our work with a small-scale experimental study. We use a simple implementation of\nour sampling sketch to study the actual performance in terms of estimate quality and sketch size. In\nparticular, we show that the estimate quality is even better than the (already adequate) guarantees\nprovided by our worst-case bounds. We additionally compare the estimate quality to that of two\npopular sampling schemes for aggregated data, PPSWOR [32, 33] and priority (sequential Poisson)\nsampling [28, 13]. In the experiments, we see that the estimate quality of our sketch is close to what\nachieved by PPSWOR and priority sampling, while our sketch uses much less space by eliminating\nthe need for aggregation. This paper presents our sketch structures and states our results. The full\nversion (including proofs and additional details) can be found in the supplementary material.\n\nRelated Work. Composable weighted sampling schemes with tight worst-case variance for aggre-\ngated datasets (where keys are unique to elements) include priority (sequential Poisson) sampling\n[28, 13], VarOpt sampling [6, 10], and PPSWOR [32, 33]. We use PPSWOR as our base scheme\nbecause it extends to unaggregated datasets, where multiple elements can additively contribute to the\nfrequency/weight of each key. A proli\ufb01c line of research developed sketch structures for different\ntasks over streamed or distributed unaggregated data [26, 16, 1]. Composable sampling sketches\nfor unaggregated datasets have the goal of meeting the quality of samples computed on aggregated\nfrequencies while using a sketch structure that can only hold a \ufb01nal-sample-size number of distinct\nkeys. Prior work includes a folklore sketch for distinct sampling (\ud835\udc53 (\ud835\udf08) = 1 when \ud835\udf08 > 0) [22, 34],\nsum sampling (\ud835\udc53 (\ud835\udf08) = \ud835\udf08) [9, 18, 14, 11] based on PPSWOR, cap functions (\ud835\udc53 (\ud835\udf08) = min{\ud835\udc47, \ud835\udf08}) [7],\nand universal (multi-objective) samples with a logarithmic overhead that simultaneously support all\nconcave sublinear \ud835\udc53. In the current work we propose sampling sketches that can be tailored to any\nconcave sublinear function and only have a small constant overhead. An important line of work uses\nrandom linear projections to estimate frequency statistics and to sample. In particular, \u2113\ud835\udc5d sampling\nsketches [17, 27, 2, 21, 20] sample (roughly) according to \ud835\udc53 (\ud835\udf08) = \ud835\udf08\ud835\udc5d. These sketches have higher\noverhead than sample-based sketches and are more limited in their application. Their advantage is that\nthey can be used with super-linear (e.g., moments with \ud835\udc5d \u2208 (1, 2]) functions of frequencies and can\nalso support signed element values (the turnstile model). For the more basic problem of sketches that\nestimate frequency statistics over the full data, a characterization of sketchable frequency functions is\nprovided in [5, 3]. Universal sketches for estimating \u2113\ud835\udc5d norms of subsets were recently considered in\n[4]. A double logarithmic size sketch (extending [15] for distinct counting) that computes statistics\nover the entire dataset for all soft concave sublinear functions is provided in [8]. Our design builds on\ncomponents of that sketch.\n\n2 Preliminaries\n\nSum\ud835\udc37(\ud835\udc67) := \u2211\ufe00\nsum and the max-distinct statistics of \ud835\udc37 are de\ufb01ned, respectively, as Sum\ud835\udc37 :=\u2211\ufe00\n\nConsider a set \ud835\udc37 of data elements of the form \ud835\udc52 = (\ud835\udc52.key, \ud835\udc52.val ) where \ud835\udc52.val > 0. We denote\nthe set of possible keys by \ud835\udcb3 . For a key \ud835\udc67 \u2208 \ud835\udcb3 , we let Max\ud835\udc37(\ud835\udc67) := max\ud835\udc52\u2208\ud835\udc37|\ud835\udc52.key=\ud835\udc67 \ud835\udc52.val and\n\ud835\udc52\u2208\ud835\udc37|\ud835\udc52.key=\ud835\udc67 \ud835\udc52.val denote the maximum value of a data element in \ud835\udc37 with key\n\ud835\udc67 and the sum of values of data elements in \ud835\udc37 with key \ud835\udc67, respectively. Each key \ud835\udc67 \u2208 \ud835\udcb3 that\nappears in \ud835\udc37 is called active. If there is no element \ud835\udc52 \u2208 \ud835\udc37 with \ud835\udc52.key = \ud835\udc67, we say that \ud835\udc67 is\ninactive and de\ufb01ne Max\ud835\udc37(\ud835\udc67) := 0 and Sum\ud835\udc37(\ud835\udc67) := 0. When \ud835\udc37 is clear from context, it is omitted.\nFor a key \ud835\udc67, we use the shorthand \ud835\udf08\ud835\udc67 := Sum\ud835\udc37(\ud835\udc67) and refer to it as the frequency of \ud835\udc67. The\n\ud835\udc52\u2208\ud835\udc37 \ud835\udc52.val and\n\n3\n\n\fMxDistinct\ud835\udc37 :=\u2211\ufe00\n\n\ud835\udc67\u2208\ud835\udcb3 Max\ud835\udc37(\ud835\udc67). For a function \ud835\udc53, \ud835\udc53\ud835\udc37 :=\u2211\ufe00\n\n\ud835\udc67\u2208\ud835\udcb3 \ud835\udc53 (Sum\ud835\udc37(\ud835\udc67)) =\u2211\ufe00\n\n\ud835\udc67\u2208\ud835\udcb3 \ud835\udc53 (\ud835\udf08\ud835\udc67) is\n\nthe \ud835\udc53-frequency statistics of \ud835\udc37.\n\n2.1 The Composable Bottom-\ud835\udc58 Structure\n\nIn this work, we will use composable sketch struc-\ntures in order to ef\ufb01ciently summarize streamed or dis-\ntributed data elements. A composable sketch structure\nis speci\ufb01ed by three operations: The initialization of an\nempty sketch structure \ud835\udc60, the processing of a data ele-\nment \ud835\udc52 into a structure \ud835\udc60, and the merging of two sketch\nstructures \ud835\udc601 and \ud835\udc602. To sketch a stream of elements, we\nstart with an empty structure and sequentially process\ndata elements while storing only the sketch structure.\nThe merge operation is useful with distributed or par-\nallel computation and allows us to compute the sketch\n\ud835\udc56 \ud835\udc37\ud835\udc56 of data elements by merg-\ning the sketches of the parts \ud835\udc37\ud835\udc56. In particular, one of\nthe main building blocks that we use is the bottom-\ud835\udc58\nstructure [12], speci\ufb01ed in Algorithm 1. The structure\nmaintains \ud835\udc58 data elements: For each key, consider only\nthe element with that key that has the minimum value.\nOf these elements, the structure keeps the \ud835\udc58 elements\nthat have the lowest values.\n\nof a large set \ud835\udc37 = \u22c3\ufe00\n\n2.2 The PPSWOR Sampling Sketch\n\nAlgorithm 1: Bottom-\ud835\udc58 Sketch Structure\n// Initialize structure\nInput: the structure size \ud835\udc58\n\ud835\udc60.\ud835\udc60\ud835\udc52\ud835\udc61 \u2190 \u2205 // Set of \u2264 \ud835\udc58 key-value\npairs\n// Process element\nInput: element \ud835\udc52 = (\ud835\udc52.key, \ud835\udc52.val ), a\nif \ud835\udc52.key \u2208 \ud835\udc60.\ud835\udc60\ud835\udc52\ud835\udc61 then\n\nbottom-\ud835\udc58 structure \ud835\udc60\n\nreplace the current value \ud835\udc63 of \ud835\udc52.key in\n\ud835\udc60.\ud835\udc60\ud835\udc52\ud835\udc61 with min{\ud835\udc63, \ud835\udc52.val}\n\nelse\n\ninsert (\ud835\udc52.key, \ud835\udc52.val ) to \ud835\udc60.\ud835\udc60\ud835\udc52\ud835\udc61\nif |\ud835\udc60.\ud835\udc60\ud835\udc52\ud835\udc61| = \ud835\udc58 + 1 then\n\nRemove the element \ud835\udc52\u2032 with\nmaximum value from \ud835\udc60.\ud835\udc60\ud835\udc52\ud835\udc61\n// Merge two bottom-\ud835\udc58 structures\nInput: \ud835\udc601,\ud835\udc602 // Bottom-\ud835\udc58 structures\nOutput: \ud835\udc60 // Bottom-\ud835\udc58 structure\n\ud835\udc43 \u2190 \ud835\udc601.\ud835\udc60\ud835\udc52\ud835\udc61 \u222a \ud835\udc602.\ud835\udc60\ud835\udc52\ud835\udc61\n\ud835\udc60.\ud835\udc60\ud835\udc52\ud835\udc61 \u2190 the (at most) \ud835\udc58 elements of \ud835\udc43 with\nlowest values (at most one element per key)\n\nbility \ud835\udc64\ud835\udc65/\u2211\ufe00\n\nAlgorithm 2: PPSWOR Sampling Sketch\n// Initialize structure\nInput: the sample size \ud835\udc58\nInitialize a bottom-\ud835\udc58 structure \ud835\udc60.sample\n// Algorithm 1\n// Process element\nInput: element \ud835\udc52 = (\ud835\udc52.key, \ud835\udc52.val ), PPSWOR\n\ud835\udc63 \u223c Exp[\ud835\udc52.val ]\nProcess the element (\ud835\udc52.key, \ud835\udc63) into the bottom-\ud835\udc58\nstructure \ud835\udc60.sample\n// Merge two structures \ud835\udc601, \ud835\udc602 to obtain \ud835\udc60\n\ud835\udc60.sample \u2190 Merge the bottom-\ud835\udc58 structures\n\ud835\udc601.sample and \ud835\udc602.sample\n\nIn this subsection, we describe a scheme to pro-\nduce a sample of \ud835\udc58 keys, where at each step the\nprobability that a key is selected is proportional\nto its weight. That is, the sample we produce\nwill be equivalent to performing the following\n\ud835\udc58 steps. At each step we select one key and\nadd it to the sample. At the \ufb01rst step, each key\n\ud835\udc65 \u2208 \ud835\udcb3 (with weight \ud835\udc64\ud835\udc65) is selected with proba-\n\ud835\udc66 \ud835\udc64\ud835\udc66. At each subsequent step, we\nchoose one of the remaining keys, again with\nprobability proportional to its weight. This pro-\ncess is called probability proportional to size\nand without replacement (PPSWOR) sampling.\nA classic method for PPSWOR sampling is the\nfollowing scheme [32, 33]. For each key \ud835\udc65 with weight \ud835\udc64\ud835\udc65, we independently draw seed(\ud835\udc65) \u223c\nExp(\ud835\udc64\ud835\udc65). The output sample will include the \ud835\udc58 keys with smallest seed(\ud835\udc65). This method together\nwith a bottom-\ud835\udc58 structure can be used to implement PPSWOR sampling over a set of data elements\n\ud835\udc37 according to \ud835\udf08\ud835\udc65 = Sum\ud835\udc37(\ud835\udc65). The sampling sketch is presented here as Algorithm 2. This sketch\nis due to [9] (based on [18, 14, 11]).\n\nsample structure \ud835\udc60\n\n2.3 Concave Sublinear Functions\nA function \ud835\udc53 : [0,\u221e) \u2192 [0,\u221e) is soft concave sublinear if for some \ud835\udc4e(\ud835\udc61) \u2265 0 it can be expressed as\n\n\u222b\ufe01 \u221e\n\n\ud835\udc53 (\ud835\udf08) = \u2112c[\ud835\udc4e](\ud835\udf08) :=\n\n\ud835\udc4e(\ud835\udc61)(1 \u2212 \ud835\udc52\u2212\ud835\udf08\ud835\udc61)\ud835\udc51\ud835\udc61 .\n\n(1)\n\n\u2112c[\ud835\udc4e](\ud835\udf08) is called the complement Laplace transform of \ud835\udc4e at \ud835\udf08. The sampling schemes we present in\nthis work will be de\ufb01ned for soft concave sublinear functions of the frequencies. However, this will\nallow us to estimate well any function that is within a small multiplicative constant of a soft concave\nsublinear function. In particular, we can estimate concave sublinear functions. These functions can\n\n0\n\n4\n\n\fAlgorithm 3: Sampling Sketch Structure for \ud835\udc53\n// Initialize empty structure \ud835\udc60\nInput: \ud835\udc58: Sample size, \ud835\udf00, \ud835\udc4e(\ud835\udc61) \u2265 0\nInitialize \ud835\udc60. SumMax // SumMax sketch of\nsize \ud835\udc58 (Algorithm 5)\nInitialize \ud835\udc60.\ud835\udc5d\ud835\udc5d\ud835\udc60\ud835\udc64\ud835\udc5c\ud835\udc5f // PPSWOR sketch of size\n\ud835\udc58 (Algorithm 2)\nInitialize \ud835\udc60.\ud835\udc60\ud835\udc62\ud835\udc5a \u2190 0 // A sum of all the\nelements seen so far\nInitialize \ud835\udc60.\ud835\udefe \u2190 \u221e // Threshold\nInitialize \ud835\udc60.Sideline // A composable\nmax-heap/priority queue\n// Merge two structures \ud835\udc601 and \ud835\udc602 to \ud835\udc60 (with\n\n\ud835\udc60.\ud835\udc60\ud835\udc62\ud835\udc5a\n\nsame \ud835\udc58, \ud835\udf00, \ud835\udc4e and same \u210e in SumMax\nsub-structures)\n\n\ud835\udc60.\ud835\udc60\ud835\udc62\ud835\udc5a \u2190 \ud835\udc601.\ud835\udc60\ud835\udc62\ud835\udc5a + \ud835\udc602.\ud835\udc60\ud835\udc62\ud835\udc5a\n\ud835\udc60.\ud835\udefe \u2190 2\ud835\udf00\n\ud835\udc60.Sideline \u2190 merge \ud835\udc601.Sideline and \ud835\udc602.Sideline\n// merge priority queues.\n\ud835\udc60. SumMax \u2190 merge \ud835\udc601. SumMax and\n\ud835\udc602. SumMax // Merge SumMax structures\n(Algorithm 5)\nwhile \ud835\udc60.Sideline contains an element\n\ud835\udc54 = (\ud835\udc54.key, \ud835\udc54.val ) with \ud835\udc54.val \u2265 \ud835\udc60.\ud835\udefe do\n\nRemove \ud835\udc54 from \ud835\udc60.Sideline\n\nif\u222b\ufe00 \u221e\nProcess element (\ud835\udc54.key,\u222b\ufe00 \u221e\n\n\ud835\udc54.val \ud835\udc4e(\ud835\udc61)\ud835\udc51\ud835\udc61 > 0 then\n\n\ud835\udc54.val \ud835\udc4e(\ud835\udc61)\ud835\udc51\ud835\udc61) by\n\n\ud835\udc60. SumMax\n\nbe expressed as\n\n\u222b\ufe01 \u221e\n\n// Process element\nInput: Element \ud835\udc52 = (\ud835\udc52.key, \ud835\udc52.val ), structure \ud835\udc60\nProcess \ud835\udc52 by \ud835\udc60.\ud835\udc5d\ud835\udc5d\ud835\udc60\ud835\udc64\ud835\udc5c\ud835\udc5f\n\ud835\udc60.\ud835\udc60\ud835\udc62\ud835\udc5a \u2190 \ud835\udc60.\ud835\udc60\ud835\udc62\ud835\udc5a + \ud835\udc52.val\n\ud835\udc60.\ud835\udefe \u2190 2\ud835\udf00\nforeach \ud835\udc56 \u2208 [\ud835\udc5f] do // \ud835\udc5f = \ud835\udc58/\ud835\udf00\n\n\ud835\udc60.\ud835\udc60\ud835\udc62\ud835\udc5a\n\n\ud835\udc66 \u223c Exp[\ud835\udc52.val ] // exponentially\ndistributed with parameter \ud835\udc52.val\n// Process in Sideline\nif The key (\ud835\udc52.key, \ud835\udc56) appears in \ud835\udc60.Sideline then\n\nUpdate the value of (\ud835\udc52.key, \ud835\udc56) to be the\nminimum of \ud835\udc66 and the current value\n\nelse\n\nAdd the element ((\ud835\udc52.key, \ud835\udc56), \ud835\udc66) to \ud835\udc60.Sideline\n\nwhile \ud835\udc60.Sideline contains an element\n\ud835\udc54 = (\ud835\udc54.key, \ud835\udc54.val ) with \ud835\udc54.val \u2265 \ud835\udc60.\ud835\udefe do\n\nRemove \ud835\udc54 from \ud835\udc60.Sideline\n\nif\u222b\ufe00 \u221e\nProcess element (\ud835\udc54.key,\u222b\ufe00 \u221e\n\n\ud835\udc54.val \ud835\udc4e(\ud835\udc61)\ud835\udc51\ud835\udc61 > 0 then\n\n\ud835\udc60. SumMax\n\n\ud835\udc54.val \ud835\udc4e(\ud835\udc61)\ud835\udc51\ud835\udc61) by\n\n\ud835\udc4e(\ud835\udc61) min{1, \ud835\udf08\ud835\udc61}\ud835\udc51\ud835\udc61\n\n\ud835\udc53 (\ud835\udf08) =\n\n(2)\nfor \ud835\udc4e(\ud835\udc61) \u2265 0. The concave sublinear family includes all functions such that \ud835\udc53 (0) = 0, \ud835\udc53 is monotoni-\ncally non-decreasing, \ud835\udf15+\ud835\udc53 (0) < \u221e, and \ud835\udf152\ud835\udc53 \u2264 0.\nAny concave sublinear function \ud835\udc53 can be approximated by a soft concave sublinear function as\nfollows. Consider the corresponding soft concave sublinear function \u02dc\ud835\udc53 using the same coef\ufb01cients\n\ud835\udc4e(\ud835\udc61). The function \u02dc\ud835\udc53 closely approximates \ud835\udc53 pointwise [8]:\n\n0\n\n(1 \u2212 1/\ud835\udc52)\ud835\udc53 (\ud835\udf08) \u2264 \u02dc\ud835\udc53 (\ud835\udf08) \u2264 \ud835\udc53 (\ud835\udf08) .\n\n\ud835\udc37). The statistics \ud835\udc53\ud835\udc37 =\u2211\ufe00\n\nOur weighted sample for \u02dc\ud835\udc53 will respectively approximate a weighted sample for \ud835\udc53.\n\ud835\udc65 \ud835\udc53 (Sum\ud835\udc37(\ud835\udc65)) =\u2211\ufe00\nConsider a soft concave sublinear \ud835\udc53 and a set of data elements \ud835\udc37 with the respective frequency\nfunction \ud835\udc4a : (0,\u221e) \u2192 N \u222a {0} (for every \ud835\udf08 > 0, \ud835\udc4a (\ud835\udf08) is the number of keys with frequency \ud835\udf08 in\n\u222b\ufe01 \ud835\udc4f\n\ud835\udc65 \ud835\udc53 (\ud835\udf08\ud835\udc65) can then be expressed as \ud835\udc53\ud835\udc37 = \u2112c[\ud835\udc4a ][\ud835\udc4e]\u221e\n\ud835\udc4e(\ud835\udc61)\u2112c[\ud835\udc4a ](\ud835\udc61)\ud835\udc51\ud835\udc61 .\n\nwith the notation\n\n\u2112c[\ud835\udc4a ][\ud835\udc4e]\ud835\udc4f\n\n(3)\n\n0\n\n\ud835\udefe :=\n\n\ud835\udefe\n\n3 Sketch Overview\n\nGiven a set \ud835\udc37 of elements \ud835\udc52 = (\ud835\udc52.key, \ud835\udc52.val ), we wish to maintain a sample of \ud835\udc58 keys, that will be\nclose to PPSWOR according to a soft concave sublinear function of their frequencies \ud835\udc53 (\ud835\udf08\ud835\udc65). At a high\n\u222b\ufe00 \u221e\nlevel, our sampling sketch is guided by the sketch for estimating the statistics \ud835\udc53\ud835\udc37 due to Cohen [8].\nRecall that a soft concave sublinear function \ud835\udc53 can be represented as \ud835\udc53 (\ud835\udc64) = \u2112c[\ud835\udc4e](\ud835\udc64)\u221e\n0 =\n0 \ud835\udc4e(\ud835\udc61)(1 \u2212 \ud835\udc52\u2212\ud835\udc64\ud835\udc61)\ud835\udc51\ud835\udc61 for \ud835\udc4e(\ud835\udc61) \u2265 0. Using this representation, we express \ud835\udc53 (\ud835\udf08\ud835\udc65) as a sum of two\ncontributions for each key \ud835\udc65:\n\n\ud835\udc53 (\ud835\udf08\ud835\udc65) = \u2112c[\ud835\udc4e](\ud835\udf08\ud835\udc65)\ud835\udefe\n\n0 + \u2112c[\ud835\udc4e](\ud835\udf08\ud835\udc65)\u221e\n\ud835\udefe ,\n\n5\n\n\fseed pairs\n\n\ud835\udc60. SumMax\n\nif\u222b\ufe00 \ud835\udefe\n\nif\u222b\ufe00 \u221e\n\n\ud835\udefe \ud835\udc4e(\ud835\udc61)\ud835\udc51\ud835\udc61) by sketch\n\nforeach \ud835\udc52 \u2208 \ud835\udc60. SumMax .sample do\n\n\ud835\udc52.val \u2190 \ud835\udc5f * \ud835\udc52.val // scale\nsample by \ud835\udc5f\n\n\ud835\udefe \ud835\udc4e(\ud835\udc61)\ud835\udc51\ud835\udc61 > 0 then\nforeach \ud835\udc52 \u2208 \ud835\udc60.Sideline do\n\n(\ud835\udc52.key,\u222b\ufe00 \u221e\n\nProcess element\n\nAlgorithm 4: Produce a Final Sam-\nple from a Sampling Sketch Structure\n(Algorithm 3)\nInput: sampling sketch structure \ud835\udc60 for \ud835\udc53\nOutput: sample of size \ud835\udc58 of key and\n\n0 \ud835\udc61\ud835\udc4e(\ud835\udc61)\ud835\udc51\ud835\udc61 > 0 then\nforeach \ud835\udc52 \u2208 \ud835\udc60.ppswor.sample do\n\u222b\ufe00 \ud835\udefe\n\ud835\udc52.val \u2190 \ud835\udc52.val\nsample \u2190 merge\n\ud835\udc60. SumMax .sample and\n\ud835\udc60.ppswor.sample // bottom-\ud835\udc58\nmerge (Algorithm 1)\nsample \u2190 \ud835\udc60. SumMax .sample\n\nwhere \ud835\udefe is a value we will set adaptively while processing the elements. Our sampling sketch is\ndescribed in Algorithm 3. It maintains a separate sampling sketch for each set of contributions. In\norder to produce a sample from the sketch, these separate sketches need to be combined. Algorithm 4\ndescribes how to produce a \ufb01nal sample from the sketch.\nRunning Algorithm 3 and then Algorithm 4 requires one\npass over the data. In order to use the \ufb01nal sample to esti-\nmate statistics, we need to compute the Horvitz-Thompson\ninverse-probability estimator (cid:92)\ud835\udc53 (\ud835\udf08\ud835\udc65) for each of the sam-\npled keys. Informally, the estimator for key \ud835\udc65 in the sam-\nple is \ud835\udc53 (\ud835\udf08\ud835\udc65)/ Pr[\ud835\udc65 in sample] (and 0 for keys not in the\nsample). To compute the estimator, we need to know the\nvalues \ud835\udc53 (\ud835\udf08\ud835\udc65) for the keys in the sample, which we get\nfrom a second pass over the data, and the conditional in-\nclusion probabilities (the denominator), that have a closed\nform and can be computed. The parameter \ud835\udf00 trades off the\nrunning time of processing an element with the bound on\nthe variance of the inverse-probability estimator.\nWe continue with an overview of the different compo-\nnents of the sketch. As mentioned above, we represent\n\ud835\udc53 (\ud835\udf08\ud835\udc65) = \u2112c[\ud835\udc4e](\ud835\udf08\ud835\udc65)\ud835\udefe\n0 + \u2112c[\ud835\udc4e](\ud835\udf08\ud835\udc65)\u221e\n\ud835\udefe , and for each sum-\nmand we maintain a separate sample of size \ud835\udc58 (which\nwill later be merged). For \u2112c[\ud835\udc4e](\ud835\udf08\ud835\udc65)\ud835\udefe\n0, we maintain a stan-\ndard PPSWOR sketch. For \u2112c[\ud835\udc4e](\ud835\udf08\ud835\udc65)\u221e\n\ud835\udefe , we build on a\nresult from [8], which shows a way to map each input\nelement into an temporary \u201coutput\u201d element with a ran-\ndom value, such that if we look at all the output elements,\nE[Max(\ud835\udc65)] = \u2112c[\ud835\udc4e](\ud835\udf08\ud835\udc65)\u221e\n\ud835\udefe . These components were used\nin [8] to estimate the \ud835\udc53-statistics of the data.\nHowever, in this work we need to produce a sample according to \u2112c[\ud835\udc4e](\ud835\udf08\ud835\udc65)\u221e\n\ud835\udefe (as opposed to\nestimating the sum of these quantities for all keys). In particular, when we look at the output elements,\nwe only see their random value, but we are interested in producing a weighted sample according to\ntheir expected value. For that, we introduce the analysis of PPSWOR with stochastic inputs, which\nappears in Section 4. In that analysis, we establish the conditions that are needed in order for the\nsample according to the random values to be close to a sample according to the expected values.\nThe conditions in the analysis of stochastic PPSWOR require creating \ud835\udc58/\ud835\udf00 independent output\nelements for each element we see, and subsequently, the sample we need for the range \u2112c[\ud835\udc4e](\ud835\udf08\ud835\udc65)\u221e\n\ud835\udefe\nis a PPSWOR sample of the output elements according to the weights SumMax(\ud835\udc65) (de\ufb01ned in\nSection 5). That is the purpose of the SumMax sketch structure, which is presented in Section 5.\nEach of the two samples we maintain (the PPSWOR and SumMax samples) has a \ufb01xed size and\nstores at most \ud835\udc58 keys at any time. The \ud835\udefe threshold is chosen to guarantee that we get the desired\napproximation ratio. The only structure that can use more space is the Sideline structure. As part\nof the analysis, we bound the size of the Sideline and show that in expectation, it is \ud835\udc42(\ud835\udc58) and also\nprovide worst case bounds on its maximum size during the run of the algorithm. The output elements\nthat are processed by the SumMax sketch have a value that depends on \ud835\udefe (which changes as we\nprocess the data), and the purpose of the Sideline structure is to store elements until \ud835\udefe decreases\nenough that their value is \ufb01xed (and then they are removed from the Sideline and processed by the\nSumMax sketch).\nThe analysis results in the following main theorem.\nTheorem 3.1. Let \ud835\udc58 \u2265 3 and 0 < \ud835\udf00 \u2264 1\nof size \ud835\udc58 \u2212 1, where each key \ud835\udc65 has weight \ud835\udc49\ud835\udc65 that satis\ufb01es \ud835\udc53 (\ud835\udf08\ud835\udc65) \u2264 E[\ud835\udc49\ud835\udc65] \u2264 1\nper-key inverse-probability estimator of \ud835\udc53 (\ud835\udf08\ud835\udc65) is unbiased and has variance\n\n2 . Algorithms 3 and 4 produce a stochastic PPSWOR sample\n(1\u2212\ud835\udf00) \ud835\udc53 (\ud835\udf08\ud835\udc65). The\n\nelse\n\nreturn sample\n\n0 \ud835\udc61\ud835\udc4e(\ud835\udc61)\ud835\udc51\ud835\udc61\n\n[\ufe01(cid:92)\ud835\udc53 (\ud835\udf08\ud835\udc65)\n\n]\ufe01 \u2264 4\ud835\udc53 (\ud835\udf08\ud835\udc65)\u2211\ufe00\n\nVar\n\n\ud835\udc67\u2208\ud835\udcb3 \ud835\udc53 (\ud835\udf08\ud835\udc67)\n\n(1 \u2212 \ud835\udf00)2(\ud835\udc58 \u2212 2)\n\n.\n\n6\n\n\fThe space required by the sketch at any given time is \ud835\udc42(\ud835\udc58) in expectation. Additionally, with proba-\nbility at least 1 \u2212 \ud835\udeff, the space will not exceed \ud835\udc42\nat any time while processing \ud835\udc37, where \ud835\udc5a is the number of elements in \ud835\udc37, Min(\ud835\udc37) is the minimum\nvalue of an element in \ud835\udc37, and Sum(\ud835\udc4a ) is the sum of frequencies of all keys.\n\n\ud835\udc58 + min{log \ud835\udc5a, log log\n\n\ud835\udeff\n\n(\ufe01 Sum(\ud835\udc4a )\n\nMin(\ud835\udc37)\n\n)\ufe01} + log(\ufe00 1\n\n)\ufe00)\ufe01\n\n(\ufe01\n\n4 Stochastic PPSWOR Sampling\n\nIn the PPSWOR sampling scheme described in Section 2.2, the weights \ud835\udc64\ud835\udc65 of the keys were part\nof the deterministic input to the algorithm. In this section, we consider PPSWOR sampling when\nthe weights are random variables. We will show that under certain assumptions, PPSWOR sampling\naccording to randomized inputs is close to sampling according to the expected values of these random\ninputs.\nFormally, let \ud835\udcb3 be a set of keys. Each key \ud835\udc65 \u2208 \ud835\udcb3 is associated with \ud835\udc5f\ud835\udc65 \u2265 0 independent random\nvariables \ud835\udc46\ud835\udc65,1, . . . , \ud835\udc46\ud835\udc65,\ud835\udc5f\ud835\udc65 in the range [0, \ud835\udc47 ] (for some constant \ud835\udc47 > 0). The weight of key \ud835\udc65 is the\n\ud835\udc56=1 \ud835\udc46\ud835\udc65,\ud835\udc56. We additionally denote its expected weight by \ud835\udc63\ud835\udc65 := E[\ud835\udc46\ud835\udc65], and\n\nrandom variable \ud835\udc46\ud835\udc65 :=\u2211\ufe00\ud835\udc5f\ud835\udc65\nthe expected sum statistics by \ud835\udc49 :=\u2211\ufe00\n\n\ud835\udc65 \ud835\udc63\ud835\udc65.\n\nA stochastic PPSWOR sample is a PPSWOR sample computed for the key-value pairs (\ud835\udc65, \ud835\udc46\ud835\udc65). That\nis, we draw the random variables \ud835\udc46\ud835\udc65, then we draw for each \ud835\udc65 a random variable seed(\ud835\udc65) \u223c Exp[\ud835\udc46\ud835\udc65],\nand take the \ud835\udc58 keys with lowest seed values.\nThe following result bounds the variance of estimating \ud835\udc63\ud835\udc65 using a stochastic PPSWOR sample. We\nconsider the conditional inverse-probability estimator of \ud835\udc63\ud835\udc65. Note that even though the PPSWOR\n\nsample was computed using the random weight \ud835\udc46\ud835\udc65, the estimator \u0302\ufe00\ud835\udc63\ud835\udc65 is computed using \ud835\udc63\ud835\udc65 and will\n\nbe\nPr[seed(\ud835\udc65)<\ud835\udf0f ] for keys \ud835\udc65 in the sample. It suf\ufb01ces to bound the per-key variance and relate it to\nthe per-key variance bound for a PPSWOR sample computed directly for \ud835\udc63\ud835\udc65. We show that when\n\ud835\udc49 \u2265 \ud835\udc47 \ud835\udc58, the overhead due to the stochastic sample is at most 4 (that is, the variance grows by a\nmultiplicative factor of 4). The proof details would also reveal that when \ud835\udc49 \u226b \ud835\udc47 \ud835\udc58, the worst-case\nbound on the overhead is actually closer to 2.\nTheorem 4.1. Let \ud835\udc58 \u2265 3. In a stochastic PPSWOR sample, if \ud835\udc49 \u2265 \ud835\udc47 \ud835\udc58, then for every key \ud835\udc65 \u2208 \ud835\udcb3 ,\nthe variance Var [\u02c6\ud835\udc63\ud835\udc65] of the bottom-\ud835\udc58 inverse probability estimator of \ud835\udc63\ud835\udc65 is bounded by\n\n\ud835\udc63\ud835\udc65\n\nVar [\u02c6\ud835\udc63\ud835\udc65] \u2264 4\ud835\udc63\ud835\udc65\ud835\udc49\n\ud835\udc58 \u2212 2\n\n.\n\n5 SumMax Sampling Sketch\n\nWe present an auxiliary sketch that processes elements \ud835\udc52 = (\ud835\udc52.key, \ud835\udc52.val ) with keys \ud835\udc52.key =\n(\ud835\udc52.key.\ud835\udc5d, \ud835\udc52.key.\ud835\udc60) that are structured to have a primary key \ud835\udc52.key.\ud835\udc5d and a secondary key \ud835\udc52.key.\ud835\udc60. For\neach primary key \ud835\udc65, we de\ufb01ne\n\nSumMax\ud835\udc37(\ud835\udc65) :=\n\nMax\ud835\udc37(\ud835\udc67)\n\n\u2211\ufe01\n\n\ud835\udc67|\ud835\udc67.\ud835\udc5d=\ud835\udc65\n\nwhere Max is as de\ufb01ned in Section 2. If there are no elements \ud835\udc52 \u2208 \ud835\udc37 such that \ud835\udc52.key.\ud835\udc5d = \ud835\udc65, then\nby de\ufb01nition Max\ud835\udc37(\ud835\udc67) = 0 for all \ud835\udc67 with \ud835\udc67.\ud835\udc5d = \ud835\udc65 (as there are no elements in \ud835\udc37 with key \ud835\udc67) and\ntherefore SumMax\ud835\udc37(\ud835\udc65) = 0. The SumMax sampling sketch (Algorithm 5) produces a PPSWOR\nsample of primary keys \ud835\udc65 according to weights SumMax\ud835\udc37(\ud835\udc65). Note that while the key space of the\ninput elements contains structured keys of the form \ud835\udc52.key = (\ud835\udc52.key.\ud835\udc5d, \ud835\udc52.key.\ud835\udc60), the key space for the\noutput sample will be the space of primary keys only. The sketch structure consists of a bottom-\ud835\udc58\nstructure and a hash function \u210e. We assume we have a perfectly random hash function \u210e such that\nfor every key \ud835\udc67 = (\ud835\udc67.\ud835\udc5d, \ud835\udc67.\ud835\udc60), \u210e(\ud835\udc67) \u223c Exp[1] independently (in practice, we assume that the hash\nfunction is provided by the platform on which we run). We process an input element \ud835\udc52 by generating\na new data element with key \ud835\udc52.key.\ud835\udc5d (the primary key of the key of the input element) and value\n\nElementScore(\ud835\udc52) := \u210e(\ud835\udc52.key)/\ud835\udc52.val\n\nand then processing that element by our bottom-\ud835\udc58 structure. The bottom-\ud835\udc58 structure holds our current\nsample of primary keys. By de\ufb01nition, the bottom-\ud835\udc58 structure retains the \ud835\udc58 primary keys \ud835\udc65 with\n\n7\n\n\fAlgorithm 5: SumMax sampling sketch\n// Initialize empty structure \ud835\udc60\nInput: Sample size \ud835\udc58\n\ud835\udc60.\u210e \u2190 independent random hash with range Exp[1]\nInitialize \ud835\udc60.sample // A bottom-\ud835\udc58 structure (Algorithm 1)\n// Process element \ud835\udc52 = (\ud835\udc52.key, \ud835\udc52.val ) where \ud835\udc52.key = (\ud835\udc52.key.\ud835\udc5d, \ud835\udc52.key.\ud835\udc60)\nProcess element (\ud835\udc52.key.\ud835\udc5d, \ud835\udc60.\u210e(\ud835\udc52.key)/\ud835\udc52.val ) to structure \ud835\udc60.sample // bottom-\ud835\udc58 process element\n// Merge structures \ud835\udc601, \ud835\udc602 (with \ud835\udc601.\u210e = \ud835\udc602.\u210e) to get \ud835\udc60\n\ud835\udc60.\u210e \u2190 \ud835\udc601.\u210e // \ud835\udc601.\u210e = \ud835\udc602.\u210e\n\ud835\udc60.sample \u2190 Merge \ud835\udc601.sample, \ud835\udc602.sample// bottom-\ud835\udc58 merge (Algorithm 1)\n\nminimum\n\nseed\ud835\udc37(\ud835\udc65) :=\n\nmin\n\n\ud835\udc52\u2208\ud835\udc37|\ud835\udc52.key.\ud835\udc5d=\ud835\udc65\n\nElementScore(\ud835\udc52) .\n\nTo establish that this is a PPSWOR sample according to SumMax\ud835\udc37(\ud835\udc65), we study the distribution of\nseed\ud835\udc37(\ud835\udc65).\nLemma 5.1. For all primary keys \ud835\udc65 that appear in elements of \ud835\udc37, seed\ud835\udc37(\ud835\udc65) \u223c\nExp[SumMax\ud835\udc37(\ud835\udc65))]. The random variables seed\ud835\udc37(\ud835\udc65) are independent.\n\nNote that the distribution of seed\ud835\udc37(\ud835\udc65), which is Exp[SumMax\ud835\udc37(\ud835\udc65)], does not depend on the\nparticular structure of \ud835\udc37 or the order in which elements are processed, but only on the parameter\nSumMax\ud835\udc37(\ud835\udc65). The bottom-\ud835\udc58 sketch structure maintains the \ud835\udc58 primary keys with smallest seed\ud835\udc37(\ud835\udc65)\nvalues. We therefore get the following corollary.\nCorollary 5.2. Given a stream or distributed set of elements \ud835\udc37, the sampling sketch Algorithm 5\nproduces a PPSWOR sample according to the weights SumMax\ud835\udc37(\ud835\udc65).\n\n6 Experiments\n\nWe implemented our sampling sketch in Python and report here the results of experiments on real\nand synthetic datasets. The implementation follows the pseudocode except that we incorporated\ntwo practical optimizations: removing redundant keys from the PPSWOR subsketch and removing\nredundant elements from Sideline. These optimizations do not affect the outcome of the computation\nor the worst-case analysis, but reduce the sketch size in practice. We used the following datasets:\n\ntag, and the value is the number of times it appeared in a certain folder.\n\n\u2219 abcnews [23]: News headlines. For each word, we created an element with value 1.\n\u2219 flicker [31]: Tags used by Flickr users to annotate images. The key of each element is a\n\u2219 Three synthetic generated datasets that contain 2 \u00d7 106 data elements. Each element has\nvalue 1, and the key was chosen according to the Zipf distribution (numpy.random.zipf),\nwith Zipf parameter values \ud835\udefc \u2208 {1.1, 1.2, 1.5}. The Zipf family in this range is often a good\nmodel to real-world frequency distributions.\n\nrepetitions where we used the \ufb01nal sample to estimate the sum\u2211\ufe00\n\nWe applied our sampling sketch with sample size parameter values \ud835\udc58 \u2208 {25, 50, 75, 100} and set the\nparameter \ud835\udf00 = 0.5 in all experiments. We sampled according to two concave sublinear functions: the\nfrequency moment \ud835\udc53 (\ud835\udf08) = \ud835\udf080.5 and \ud835\udc53 (\ud835\udf08) = ln(1 + \ud835\udf08). Tables 1 reports aggregated results of 200\n\u221a\n\ud835\udc65\u2208\ud835\udcb3 \ud835\udc53 (\ud835\udf08\ud835\udc65). For error bounds, we\nlist the worst-case bound on the CV (which depends only on \ud835\udc58 and \ud835\udf00 and is \u221d 1/\n\ud835\udc58) and report the\nactual normalized root of the average squared error (NRMSE). In addition, we report the NRMSE that\nwe got from 200 repetitions of estimating the same statistics using two common sampling schemes\nfor aggregated data, PPSWOR and priority sampling, which we use as benchmarks. We also consider\nthe size of the sketch after processing each element. Since the representation of each key can be\nexplicit and require a lot of space, we separately consider the number of distinct keys and the number\nof elements stored in the sketch. We report the maximum number of distinct keys stored in the sketch\nat any point (the average and the maximum over the 200 repetitions) and the respective maximum\nnumber of elements stored in the sketch at any point during the computations (again, the average\nand the maximum over the 200 repetitions). We can see that the actual error reported is signi\ufb01cantly\n\n8\n\n\fTable 1: Experimental Results: \ud835\udc53 (\ud835\udf08) = \ud835\udf080.5 , ln(1 + \ud835\udf08)\nmax #elem\nave\n\nBenchmark\nPri.\n\nmax #keys\nave\n\nNRMSE\n\nbound\n\nactual\n\nppswor\n\nmax\n\n\ud835\udc58\n\nmax\n\nDataset: abcnews (7.07 \u00d7 106 elements, 91.7 \u00d7 103 keys)\n\n\ud835\udc53 (\ud835\udf08) = \ud835\udf080.5, 200 reps\n\n25\n50\n75\n100\n\n25\n50\n75\n100\n\n25\n50\n75\n100\n\n25\n50\n75\n100\n\n25\n50\n75\n100\n\n25\n50\n75\n100\n\n25\n50\n75\n100\n\n25\n50\n75\n100\n\n25\n50\n75\n100\n\n25\n50\n75\n100\n\n0.213\n0.142\n0.120\n0.105\n\n0.200\n0.144\n0.123\n0.115\n\n0.215\n0.123\n0.109\n0.106\n\n0.199\n0.144\n0.122\n0.098\n\n0.201\n0.152\n0.115\n0.098\n\n0.208\n0.138\n0.130\n0.102\n\n0.227\n0.144\n0.119\n0.097\n\n0.201\n0.127\n0.116\n0.107\n\n0.209\n0.147\n0.120\n0.098\n\n0.213\n0.128\n0.111\n0.098\n\n0.190\n0.147\n0.114\n0.095\n\n0.198\n0.137\n0.115\n0.103\n\n0.208\n0.138\n0.116\n0.109\n\n0.217\n0.136\n0.099\n0.115\n\n0.199\n0.151\n0.121\n0.104\n\n0.204\n0.132\n0.122\n0.106\n\n0.195\n0.144\n0.111\n0.106\n\n0.217\n0.137\n0.110\n0.103\n\n0.208\n0.142\n0.110\n0.099\n\n0.217\n0.131\n0.114\n0.097\n\n0.214\n0.145\n0.124\n0.096\n\n0.194\n0.142\n0.117\n0.103\n\n0.180\n0.129\n0.109\n0.095\n\n0.234\n0.129\n0.110\n0.104\n\n0.218\n0.139\n0.113\n0.102\n\n31.7\n58.5\n85.4\n111.2\n\n31.2\n57.8\n83.7\n108.9\n\n31.8\n58.7\n84.7\n111.2\n\n31.1\n57.9\n83.9\n109.6\n\n29.5\n54.9\n80.0\n104.9\n\n28.0\n53.3\n78.2\n102.7\n\n29.2\n54.4\n79.6\n104.5\n\n28.5\n53.7\n78.8\n103.9\n\n37\n66\n94\n120\n\n37\n64\n91\n116\n\n39\n66\n91\n119\n\n38\n65\n90\n115\n\n35\n60\n86\n113\n\n34\n60\n85\n109\n\n31\n59\n83\n106\n\n34\n58\n84\n109\n\n33\n57\n84\n108\n\n0.834\n0.577\n0.468\n0.404\n\n0.834\n0.577\n0.468\n0.404\n\n0.834\n0.577\n0.468\n0.404\n\n0.834\n0.577\n0.468\n0.404\n\n0.834\n0.577\n0.468\n0.404\n\n0.834\n0.577\n0.468\n0.404\n\n0.834\n0.577\n0.468\n0.404\n\n0.834\n0.577\n0.468\n0.404\n\n0.834\n0.577\n0.468\n0.404\n\n0.834\n0.577\n0.468\n0.404\n\nDataset: flickr (7.64 \u00d7 106 elements, 572.4 \u00d7 103 keys)\n\nDataset: zipf1.1 (2.00 \u00d7 106 elements, 652.2 \u00d7 103 keys)\n\nDataset: zipf1.2 (2.00 \u00d7 106 elements, 237.3 \u00d7 103 keys)\n\nDataset: zipf1.5 (2.00 \u00d7 106 elements, 22.3 \u00d7 103 keys)\n\n0.207\n0.139\n0.115\n0.094\n\ud835\udc53 (\ud835\udf08) = ln(1 + \ud835\udf08), 200 reps\n\n0.194\n0.142\n0.112\n0.086\n\n30.1\n56.1\n81.6\n107.1\n\nDataset: abcnews (7.07 \u00d7 106 elements, 91.7 \u00d7 103 keys)\n\nDataset: flickr (7.64 \u00d7 106 elements, 572.4 \u00d7 103 keys)\n\nDataset: zipf1.1 (2.00 \u00d7 106 elements, 652.2 \u00d7 103 keys)\n\nDataset: zipf1.2 (2.00 \u00d7 106 elements, 237.3 \u00d7 103 keys)\n\nDataset: zipf1.5 (2.00 \u00d7 106 elements, 22.3 \u00d7 103 keys)\n\n0.210\n0.141\n0.124\n0.100\n\n0.197\n0.146\n0.112\n0.101\n\n0.226\n0.149\n0.106\n0.099\n\n27.2\n52.1\n76.9\n101.9\n\n30\n55\n79\n104\n\n50.9\n95.1\n134.8\n171.1\n\n53.1\n94.6\n131.7\n173.4\n\n52.5\n95.0\n135.2\n176.3\n\n53.2\n98.4\n138.2\n179.2\n\n53.4\n101.5\n151.8\n196.3\n\n49.1\n80.9\n111.1\n140.7\n\n41.4\n72.2\n99.8\n130.3\n\n48.8\n80.4\n110.9\n139.8\n\n48.0\n80.5\n111.4\n140.3\n\n45.2\n78.9\n110.5\n139.1\n\n76\n136\n181\n256\n\n77\n130\n175\n223\n\n75\n130\n186\n221\n\n83\n139\n173\n227\n\n74\n136\n199\n248\n\n71\n110\n152\n184\n\n69\n101\n135\n166\n\n71\n119\n142\n165\n\n72\n113\n143\n173\n\n66\n104\n146\n173\n\nlower than the worst-case bound. Furthermore, the error that our sketch gets is close to the error\nachieved by the two benchmark sampling schemes. We can also see that the maximum number of\ndistinct keys stored in the sketch at any time is relatively close to the speci\ufb01ed sample size of \ud835\udc58 and\nthat the total sketch size in terms of elements rarely exceeded 3\ud835\udc58, with the relative excess seeming to\ndecrease with \ud835\udc58. In comparison, the benchmark schemes require space that is the number of distinct\nkeys (for the aggregation), which is signi\ufb01cantly higher than the space required by our sketch.\n\n7 Conclusion\n\nWe presented composable sampling sketches for weighted sampling of unaggregated data tailored\nto a concave sublinear function of the frequencies of keys. We experimentally demonstrated the\nsimplicity and ef\ufb01cacy of our design: Our sketch size is nearly optimal in that it is not much larger\nthan the \ufb01nal sample size, and the estimate quality is close to that provided by a weighted sample\ncomputed directly over the aggregated data.\n\n9\n\n\fAcknowledgments\n\nO\ufb01r Geri was supported by NSF grant CCF-1617577, a Simons Investigator Award for Moses\nCharikar, and the Google Graduate Fellowship in Computer Science in the School of Engineering\nat Stanford University. The computing for this project was performed on the Sherlock cluster. We\nwould like to thank Stanford University and the Stanford Research Computing Center for providing\ncomputational resources and support that contributed to these research results.\n\nReferences\n[1] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency\n\nmoments. J. Comput. System Sci., 58:137\u2013147, 1999.\n\n[2] A. Andoni, R. Krauthgamer, and K. Onak. Streaming algorithms via precision sampling. In\n2011 IEEE 52nd Annual Symposium on Foundations of Computer Science, pages 363\u2013372, Oct\n2011.\n\n[3] V. Braverman, S. R. Chestnut, D. P. Woodruff, and L. F. Yang. Streaming space complexity of\n\nnearly all functions of one variable on frequency vectors. In PODS. ACM, 2016.\n\n[4] V. Braverman, R. Krauthgamer, and L. F. Yang. Universal streaming of subset norms. CoRR,\n\nabs/1812.00241, 2018.\n\n[5] V. Braverman and R. Ostrovsky. Zero-one frequency laws. In STOC. ACM, 2010.\n\n[6] M. T. Chao. A general purpose unequal probability sampling plan. Biometrika, 69(3):653\u2013656,\n\n1982.\n\n[7] E. Cohen. Stream sampling for frequency cap statistics. In KDD. ACM, 2015. full version:\n\nhttp://arxiv.org/abs/1502.05955.\n\n[8] E. Cohen. Hyperloglog hyperextended: Sketches for concave sublinear frequency statistics. In\n\nKDD. ACM, 2017. full version: https://arxiv.org/abs/1607.06517.\n\n[9] E. Cohen, G. Cormode, and N. Duf\ufb01eld. Don\u2019t let the negatives bring you down: Sampling\n\nfrom streams of signed updates. In Proc. ACM SIGMETRICS/Performance, 2012.\n\n[10] E. Cohen, N. Duf\ufb01eld, H. Kaplan, C. Lund, and M. Thorup. Ef\ufb01cient stream sampling for\n\nvariance-optimal estimation of subset sums. SIAM J. Comput., 40(5), 2011.\n\n[11] E. Cohen, N. Duf\ufb01eld, H. Kaplan, C. Lund, and M. Thorup. Algorithms and estimators for\n\naccurate summarization of unaggregated data streams. J. Comput. System Sci., 80, 2014.\n\n[12] E. Cohen and H. Kaplan. Summarizing data using bottom-k sketches. In ACM PODC, 2007.\n\n[13] N. Duf\ufb01eld, M. Thorup, and C. Lund. Priority sampling for estimating arbitrary subset sums. J.\n\nAssoc. Comput. Mach., 54(6), 2007.\n\n[14] C. Estan and G. Varghese. New directions in traf\ufb01c measurement and accounting. In SIGCOMM.\n\nACM, 2002.\n\n[15] P. Flajolet, E. Fusy, O. Gandouet, and F. Meunier. Hyperloglog: The analysis of a near-optimal\n\ncardinality estimation algorithm. In Analysis of Algorithms (AofA). DMTCS, 2007.\n\n[16] P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J.\n\nComput. System Sci., 31:182\u2013209, 1985.\n\n[17] G. Frahling, P. Indyk, and C. Sohler. Sampling in dynamic data streams and applications.\n\nInternational Journal of Computational Geometry & Applications, 18(01n02):3\u201328, 2008.\n\n[18] P. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximate\n\nquery answers. In SIGMOD. ACM, 1998.\n\n[19] Google. Frequency capping: AdWords help, December 2014. https://support.google.\n\ncom/adwords/answer/117579.\n\n10\n\n\f[20] R. Jayaram and D. P. Woodruff. Perfect lp sampling in a data stream. In 2018 IEEE 59th Annual\n\nSymposium on Foundations of Computer Science (FOCS), pages 544\u2013555, Oct 2018.\n\n[21] H. Jowhari, M. Sa\u02d8glam, and G. Tardos. Tight bounds for lp samplers, \ufb01nding duplicates in\nstreams, and related problems. In Proceedings of the Thirtieth ACM SIGMOD-SIGACT-SIGART\nSymposium on Principles of Database Systems, PODS \u201911, pages 49\u201358, 2011.\n\n[22] D. E. Knuth. The Art of Computer Programming, Vol 2, Seminumerical Algorithms. Addison-\n\nWesley, 1st edition, 1968.\n\n[23] R. Kulkarni. A million news headlines [csv data \ufb01le]. https://www.kaggle.com/therohk/\n\nmillion-headlines/home, 2017.\n\n[24] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. Aguera y Arcas. Communication-\nEf\ufb01cient Learning of Deep Networks from Decentralized Data. In Aarti Singh and Jerry Zhu,\neditors, Proceedings of the 20th International Conference on Arti\ufb01cial Intelligence and Statistics,\nvolume 54 of Proceedings of Machine Learning Research, pages 1273\u20131282. PMLR, 2017.\n\n[25] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations\nof words and phrases and their compositionality. In Proceedings of the 26th International\nConference on Neural Information Processing Systems - Volume 2, NIPS\u201913, pages 3111\u20133119,\n2013.\n\n[26] J. Misra and D. Gries. Finding repeated elements. Technical report, Cornell University, 1982.\n\n[27] M. Monemizadeh and D. P. Woodruff. 1-pass relative-error lp-sampling with applications. In\n\nProc. 21st ACM-SIAM Symposium on Discrete Algorithms. ACM-SIAM, 2010.\n\n[28] E. Ohlsson. Sequential poisson sampling. J. Of\ufb01cial Statistics, 14(2):149\u2013162, 1998.\n\n[29] M. Osborne. Facebook Reach and Frequency Buying, October 2014. http://citizennet.\n\ncom/blog/2014/10/01/facebook-reach-and-frequency-buying/.\n\n[30] J. Pennington, R. Socher, and C. D. Manning. GloVe: Global vectors for word representation.\n\nIn EMNLP, 2014.\n\n[31] A. Plangprasopchok, K. Lerman, and L. Getoor. Growing a tree in the forest: Constructing\nfolksonomies by integrating structured metadata. In Proceedings of the 16th ACM SIGKDD\nInternational Conference on Knowledge Discovery and Data Mining, KDD \u201910, pages 949\u2013958,\n2010.\n\n[32] B. Ros\u00e9n. Asymptotic theory for successive sampling with varying probabilities without\n\nreplacement, I. The Annals of Mathematical Statistics, 43(2):373\u2013397, 1972.\n\n[33] B. Ros\u00e9n. Asymptotic theory for order sampling. J. Statistical Planning and Inference,\n\n62(2):135\u2013158, 1997.\n\n[34] J.S. Vitter. Random sampling with a reservoir. ACM Trans. Math. Softw., 11(1):37\u201357, 1985.\n\n11\n\n\f", "award": [], "sourceid": 786, "authors": [{"given_name": "Edith", "family_name": "Cohen", "institution": "Google"}, {"given_name": "Ofir", "family_name": "Geri", "institution": "Stanford University"}]}