{"title": "A Model for Learned Bloom Filters and Optimizing by Sandwiching", "book": "Advances in Neural Information Processing Systems", "page_first": 464, "page_last": 473, "abstract": "Recent work has suggested enhancing Bloom filters by using a pre-filter, based on applying machine learning to determine a function that models the data set the Bloom filter is meant to represent. Here we model such learned Bloom filters, with the following outcomes: (1) we clarify what guarantees can and cannot be associated with such a structure; (2) we show how to estimate what size the learning function must obtain in order to obtain improved performance; (3) we provide a simple method, sandwiching, for optimizing learned Bloom filters; and (4) we propose a design and analysis approach for a learned Bloomier filter, based on our modeling approach.", "full_text": "A Model for Learned Bloom Filters,\n\nand Optimizing by Sandwiching\n\nMichael Mitzenmacher\n\nSchool of Engineering and Applied Sciences\n\nHarvard University\n\nmichaelm@eecs.harvard.edu\n\nAbstract\n\nRecent work has suggested enhancing Bloom \ufb01lters by using a pre-\ufb01lter, based\non applying machine learning to determine a function that models the data set the\nBloom \ufb01lter is meant to represent. Here we model such learned Bloom \ufb01lters,\nwith the following outcomes: (1) we clarify what guarantees can and cannot be\nassociated with such a structure; (2) we show how to estimate what size the learning\nfunction must obtain in order to obtain improved performance; (3) we provide a\nsimple method, sandwiching, for optimizing learned Bloom \ufb01lters; and (4) we\npropose a design and analysis approach for a learned Bloomier \ufb01lter, based on our\nmodeling approach.\n\n1\n\nIntroduction\n\nAn interesting recent paper, \u201cThe Case for Learned Index Structures\u201d [7], argues that standard index\nstructures and related structures, such as Bloom \ufb01lters, could be improved by using machine learning\nto develop what the authors dub learned index structures. However, this paper did not provide a\nsuitable mathematical model for judging the performance of such structures. Here we aim to provide\na more formal model for their variation of a Bloom \ufb01lter, which they call a learned Bloom \ufb01lter.\nTo describe our results, we \ufb01rst somewhat informally describe the learned Bloom \ufb01lter. Like a standard\nBloom \ufb01lter, it provides a compressed representation of a set of keys K that allows membership\nqueries. (We may sometimes also refer to the keys as elements.) Given a key y, a learned Bloom\n\ufb01lter always returns yes if y is in K, so there will be no false negatives, and generally returns no if y\nis not in K, but may provide false positives. What makes a learned Bloom \ufb01lter interesting is that it\nuses a function that can be obtained by \u201clearning\u201d the set K to help determine the appropriate answer;\nthe function acts as a pre-\ufb01lter that provides a probabilistic estimate that a query key y is in K. This\nlearned function can be used to make an initial decision as to whether the key is in K, and a smaller\nbackup Bloom \ufb01lter is used to prevent any false negatives.\nOur more formal model provides interesting insights into learned Bloom \ufb01lters, and how they might\nbe effective. In particular, here we: (1) clarify what guarantees can and cannot be associated with\nsuch a structure; (2) show how to estimate what size the learning function must obtain in order to\nobtain improved performance; (3) provide a simple method for optimizing learned Bloom \ufb01lters; and\n(4) demonstrate our approach may be useful for other similar structures.\nWe brie\ufb02y summarize the outcomes above. First, we explain how the types of guarantees offered by\nlearned Bloom \ufb01lters differ signi\ufb01cantly from those of standard Bloom \ufb01lters. We thereby clarify\nwhat application-level assumptions are required for a learned Bloom \ufb01lter to be effective. Second,\nwe provide formulae for modeling the false positive rate for a learned Bloom \ufb01lter, allowing for an\nestimate of how small the learned function needs to be in order to be effective. We then \ufb01nd, perhaps\nsurprisingly, that a better structure uses a Bloom \ufb01lter before as well as after the learned function.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fBecause we optimize for two layers of Bloom \ufb01lters surrounding the learned function, we refer to\nthis as a sandwiched learned Bloom \ufb01lter. We show mathematically and intuitively why sandwiching\nimproves performance. We also discuss an approach to designing learned Bloomier \ufb01lters, where a\nBloomier \ufb01lter returns a value associated with a set element (instead of just returning whether the\nelement is in the set), and show it can be modeled similarly.\nWhile the contents of this paper may be seen as relatively simple, we feel it is important to provide\nsolid foundations in order for a wide community to understand the potential and pitfalls of data\nstructures using machine learning components. We therefore remark that the simplicity is purposeful,\nand suggest it is desirable in this context. Finally, we note that this work incorporates and extends\nanalysis that appeared in two prior working notes [8, 9].\n\n2 Review: Bloom Filters\n\nWe start by reviewing standard Bloom \ufb01lters and variants, following the framework provided by the\nreference [2].\n\n2.1 De\ufb01nition of the Data Structure\nA Bloom \ufb01lter for representing a set S = {x1, x2, . . . , xn} of n elements corresponds to an array\nof m bits, and uses k independent hash functions h1, . . . , hk with range {0, . . . , m \u2212 1}. Here we\nfollow the typical assumption that these hash functions are perfect; that is, each hash function maps\neach item in the universe independently and uniformly to a number in {0, . . . , m \u2212 1}. Initially all\narray bits are 0. For each element x \u2208 S, the array bits hi(x) are set to 1 for 1 \u2264 i \u2264 k; it does not\nmatter if some bit is set to 1 multiple times. To check if an item y is in S, we check whether all hi(y)\nare set to 1. If not, then clearly y is not a member of S. If all hi(y) are set to 1, we conclude that y is\nin S, although this may be a false positive. A Bloom \ufb01lter does not produce false negatives.\nThe primary standard theoretical guarantee associated with a Bloom \ufb01lter is the following. Let y be\nan element of the universe such that y /\u2208 S, where y is chosen independently of the hash functions\nused to create the \ufb01lter. Let \u03c1 be the fraction of bits set to 1 after the elements are hashed. Then\n\nPr(y yields a false positive) = \u03c1k.\n\nFor a bit in the Bloom \ufb01lter to be 0, it has to not be the outcome of the kn hash values for the n items.\nIt follows that\n\n(cid:18)\n\n(cid:19)kn \u2248 1 \u2212 e\u2212kn/m,\n\nE[\u03c1] = 1 \u2212\n\n1 \u2212 1\nm\n\nand that via standard techniques using concentration bounds (see, e.g., [11])\n\nPr(|\u03c1 \u2212 E[\u03c1]| \u2265 \u03b3) \u2264 e\u2212\u0398(\u03b32m)\n\nin the typical regime where m/n and k are constant. That is, \u03c1 is, with high probability, very close to\nits easily calculable expectation, and thus we know (up to very small random deviations) what the\nprobability is that an element y will be a false positive. Because of this tight concentration around the\nexpectation, it is usual to talk about the false positive probability of a Bloom \ufb01lter; in particular, it is\ngenerally referred to as though it is a constant depending on the \ufb01lter parameters, even though it is a\nrandom variable, because it is tightly concentrated around its expectation.\nMoreover, given a set of distinct query elements Q = {y1, y2, . . . , yq} with Q \u2229 S = \u2205 chosen a\npriori before the Bloom \ufb01lter is instantiated, the fraction of false positives over these queries will\nsimilarly be concentrated near \u03c1k. Hence we may talk about the false positive rate of a Bloom \ufb01lter\nover queries, which (when the query elements are distinct) is essentially the same as the false positive\nprobability. (When the query elements are not distinct, the false positive rate may vary signi\ufb01cantly,\ndepending on on the distribution of the number of appearances of elements and which ones yield\nfalse positives; we focus on the distinct item setting here.) In particular, the false positive rate is a\npriori the same for any possible query set Q. Hence one approach to \ufb01nding the false positive rate of\na Bloom \ufb01lter empirically is simply to test a random set of query elements (that does not intersect S)\nand \ufb01nd the fraction of false positives. Indeed, it does not matter what set Q is chosen, as long as it is\nchosen independently of the hash functions.\n\n2\n\n\fWe emphasize that, as we discuss further below, the term false positive rate often has a different\nmeaning in the context of learning theory applications. Care must therefore be taken in understanding\nhow the term is being used.\n\n2.2 Additional Bloom Filter Bene\ufb01ts and Limitations\n\nFor completeness, we relate some of the other bene\ufb01ts and limitations of Bloom \ufb01lters. More details\ncan be found in [2].\nWe have assumed in the above analysis that the hash functions are fully random. As fully random\nhash functions are not practically implementable, there are often questions relating to how well the\nidealization above matches the real world for speci\ufb01c hash functions. In practice, however, the model\nof fully random hash functions appears reasonable in many cases; see [5] for further discussion on\nthis point.\nIf an adversary has access to the hash functions used, or to the \ufb01nal Bloom \ufb01lter, it can \ufb01nd elements\nthat lead to false positives. One must therefore \ufb01nd other structures for adversarial situations. A\ntheoretical framework for such settings is developed in [12]. Variations of Bloom \ufb01lters, which\nadapt to false positives and prevent them in the future, are described in [1, 10]; while not meant for\nadversarial situations, they prevent repeated false positives with the same element.\nOne of the key advantages of a standard Bloom \ufb01lter is that it is easy to insert an element (possibly\nslightly changing the false positive probability), although one cannot delete an element without\nusing a more complex structure, such as a counting Bloom \ufb01lter. However, there are more recent\nalternatives to the standard Bloom \ufb01lter, such as the cuckoo \ufb01lter [6], which can achieve the same\nor better space performance as a standard Bloom \ufb01lter while allowing insertions and deletions. If\nthe Bloom \ufb01lter does not need to insert or delete elements, a well-known alternative is to develop a\nperfect hash function for the data set, and store a \ufb01ngerprint of each element in each corresponding\nhash location (see, e.g., [2] for further explanation); this approach reduces the space required by\napproximately 30%.\n\n3 Learned Bloom Filters\n\n3.1 De\ufb01nition of the Data Structure\n\nWe now consider the learned Bloom \ufb01lter construction as described in [7]. We are given a set of\npositive keys K that correspond to set to be held in the Bloom \ufb01lter \u2013 that is, K corresponds to the set\nS in the previous section. We are also given a set of negative keys U for training. We then train a\nneural network with D = {(xi, yi = 1) | xi \u2208 K} \u222a {(xi, yi = 0) | xi \u2208 U}; that is, they suggest\nusing a neural network on this binary classi\ufb01cation task to produce a probability, based on minimizing\nthe log loss function\n\ny log f (x) + (1 \u2212 y) log(1 \u2212 f (x)),\n\n(cid:88)\n\nL =\n\n(x,y)\u2208D\n\nwhere f is the learned model from the neural network. Then f (x) can be interpreted as a \u201cprobability\u201d\nestimate that x is a key from the set. Their suggested approach is to choose a threshold \u03c4 so that\nif f (x) \u2265 \u03c4 then the algorithm returns that x is in the set, and no otherwise. Since such a process\nmay provide false negatives for some keys in K that have f (x) < \u03c4, a secondary structure \u2013 such\nas a smaller standard Bloom \ufb01lter holding the keys from K that have f (x) < \u03c4 \u2013 can be used to\ncheck keys with f (x) < \u03c4 to ensure there are no false negatives, matching this feature of the standard\nBloom \ufb01lter.\nIn essence, [7] suggests using a pre-\ufb01lter ahead of the Bloom \ufb01lter, where the pre-\ufb01lter comes from a\nneural network and estimates the probability a key is in the set, allowing the use of a smaller Bloom\n\ufb01lter than if one just used a Bloom \ufb01lter alone. Performance improves if the size to represent the\nlearned function f and the size of the smaller backup \ufb01lter for false negatives is smaller than the size\nof a corresponding Bloom \ufb01lter with the same false positive rate. Of course the pre-\ufb01lter here need\nnot come from a neural network; any approach that would estimate the probability an input key is in\nthe set could be used.\nThis motivates the following formal de\ufb01nition:\n\n3\n\n\fDe\ufb01nition 1 A learned Bloom \ufb01lter on a set of positive keys K and negative keys U is a function\nf : U \u2192 [0, 1] and threshold \u03c4, where U is the universe of possible query keys, and an associated\nstandard Bloom \ufb01lter B, referred to as a backup \ufb01lter. The backup \ufb01lter holds the set of keys\n{z : z \u2208 K, f (z) < \u03c4}. For a query y, the learned Bloom \ufb01lter returns that y \u2208 K if f (y) \u2265 \u03c4, or if\nf (y) < \u03c4 and the backup \ufb01lter returns that y \u2208 K. The learned Bloom \ufb01lter returns y /\u2208 K otherwise.\n\n3.2 De\ufb01ning the False Positive Probability\n\nThe question remains how to determine or derive the false positive probability for a learned Bloom\n\ufb01lter, and how to choose an appropriate threshold \u03c4. The approach in [7] is to \ufb01nd the false positive\nrate over a test set. This approach is, as we have discussed, suitable for a standard Bloom \ufb01lter,\nwhere the false positive rate is guaranteed to be close to its expected value for any test set, with high\nprobability. However, this methodology requires additional assumptions in the learned Bloom \ufb01lter\nsetting.\nAs an example, suppose the universe of elements is the range [0, 1000000), and the set K of keys to\nstore in our Bloom \ufb01lter consists of a random subset of 500 elements from the range [1000, 2000], and\nof 500 other random elements from outside this range. Our learning algorithm might determine that a\nsuitable function f yields that f (y) is large (say f (y) \u2248 1/2) for elements in the range [1000, 2000]\nand close to zero elsewhere, and then a suitable threshold might be \u03c4 = 0.4. The resulting false\npositive rate depends substantially on what elements are queried. If Q consists of elements primarily\nin the range [1000, 2000], the false positive rate will be quite high, while if Q is chosen uniformly at\nrandom over the whole range, the false positive rate will be quite low. Unlike a standard Bloom \ufb01lter,\nthe false positive rate is highly dependent on the query set, and is not well-de\ufb01ned independently of\nthe queries.\nIndeed, it seems plausible that in many situations, the query set Q might indeed be similar to the set\nof keys K, so that f (y) for y \u2208 Q might often be above naturally chosen thresholds. For example, in\nsecurity settings, one might expect that queries for objects under consideration (URLs, network \ufb02ow\nfeatures) would be similar to the set of keys stored in the \ufb01lter. Unlike in the setting of a standard\nBloom \ufb01lter, the false positive probability for a query y can depend on y, even before the function f\nis instantiated.\nIt is worth noting, however, that the problem we point out here can possibly be a positive feature in\nother settings; it might be that the false positive rate is remarkably low if the query set is suitable.\nAgain, one can consider the range example above where queries are uniform over the entire space;\nthe query set is very unlikely to hit the range where the learned function f yields an above threshold\nvalue in that setting for a key outside of K. The data-dependent nature of the learned Bloom \ufb01lter\nmay allow it to circumvent lower bounds for standard Bloom \ufb01lter structures.\nWhile the false positive probability for learned Bloom \ufb01lters does not have the same properties as for\na standard Bloom \ufb01lter, we can de\ufb01ne the false positive rate of a learned Bloom \ufb01lter with respect to\na given query distribution.\nDe\ufb01nition 2 A false positive rate on a query distribution D over U \u2212 K for a learned Bloom \ufb01lter\n(f, \u03c4, B) is given by\n\ny\u223cD(f (y) \u2265 \u03c4 ) + (1 \u2212 Pr\n\ny\u223cD(f (y) \u2265 \u03c4 ))F (B),\n\nPr\n\nwhere F (B) is the false positive rate of the backup \ufb01lter B.\n\nWhile technically F (B) is itself a random variable, the false positive rate is well concentrated around\nits expectations, which depends only on the size of the \ufb01lter |B| and the number of false negatives\nfrom K that must be stored in the \ufb01lter, which depends on f. Hence where the meaning is clear we\nmay consider the false positive rate for a learned Bloom \ufb01lter with function f and threshold \u03c4 to be\n\ny\u223cD(f (y) \u2265 \u03c4 ) + (1 \u2212 Pr\n\ny\u223cD(f (y) \u2265 \u03c4 ))E[F (B)],\n\nPr\n\nwhere the expectation E[F (B)] is meant to over instantiations of the Bloom \ufb01lter with given size |B|.\nGiven suf\ufb01cient data, we can determine an empirical false positive rate on a test set, and use that\nto predict future behavior. Under the assumption that the test set has the same distribution as future\n\n4\n\n\fqueries, standard Chernoff bounds provide that the empirical false positive rate will be close to the\nfalse positive rate on future queries, as both will be concentrated around the expectation. In many\nlearning theory settings, this empirical false positive rate appears to be referred to as simply the false\npositive rate; we emphasize that false positive rate, as we have explained above, typically means\nsomething different in the Bloom \ufb01lter literature.\nDe\ufb01nition 3 The empirical false positive rate on a set T , where T \u2229 K = \u2205, for a learned Bloom\n\ufb01lter (f, \u03c4, B) is the number of false positives from T divided by |T |.\nTheorem 4 Consider a learned Bloom \ufb01lter (f, \u03c4, B), a test set T , and a query set Q, where T and\nQ are both determined from samples according to a distribution D. Let X be the empirical false\npositive rate on T , and Y be the empirical false positive rate on Q. Then\n\nPr(|X \u2212 Y | \u2265 \u0001) \u2264 e\u2212\u2126(\u00012 min(|T |,|Q|)).\n\nProof: Let \u03b1 = Pry\u223cD(f (y) \u2265 \u03c4 ), and \u03b2 be false positive rate for the backup \ufb01lter. We \ufb01rst show\nthat for T and X that\n\nPr(|X \u2212 (\u03b1 + (1 \u2212 \u03b1)\u03b2)| \u2265 \u0001) \u2264 2e\u22122\u00012|T |.\n\nThis follows from a direct Chernoff bound (e.g., [11][Exercise 4.13]), since each sample chosen\naccording to D is a false positive with probability \u03b1 + (1 \u2212 \u03b1)\u03b2. A similar bound holds for Q and Y .\nWe can therefore conclude\n\nPr(|X \u2212 Y | \u2265 \u0001) \u2264 Pr(|X \u2212 (\u03b1 + (1 \u2212 \u03b1)\u03b2)| \u2265 \u0001/2)\n\n+ Pr(|Y \u2212 (\u03b1 + (1 \u2212 \u03b1)\u03b2)| \u2265 \u0001/2)\n\n\u2264 2e\u2212\u00012|T |/2 + 2e\u2212\u00012|Q|/2,\n\ngiving the desired result.\n\nTheorem 4 also informs us that it is reasonable to \ufb01nd a suitable parameter \u03c4, given f, by trying\na suitable \ufb01nite discrete set of values for \u03c4, and choosing the best size-accuracy tradeoff for the\napplication. By a union bound, all choices of \u03c4 will perform close to their expectation with high\nprobability.\nWhile Theorem 4 requires the test set and query set to come from the same distribution D, the\nnegative examples U do not have to come from that distribution. Of course, if negative examples U\nare drawn from D, it may yield a better learning outcome f.\nIf the test set and query set distribution do not match, because for example the types of queries\nchange after the original gathering of test data T , Theorem 4 offers limited guidance. Suppose T is\nderived from samples from distribution D and Q from another distribution D(cid:48). If the two distributions\nare close (say in L1 distance), or, more speci\ufb01cally, if the changes do not signi\ufb01cantly change the\nprobability that a query y has f (y) \u2265 \u03c4, then the empirical false positive rate on the test set may still\nbe relatively accurate. However, in practice it may be hard to provide such guarantees on the nature\nof future queries. This explains our previous statement that learned Bloom \ufb01lters appear most useful\nwhen the query stream can be modeled as coming from a \ufb01xed distribution, which can be sampled\nduring the construction.\nWe can return to our previous example to understand these effects. Recall our set consists of\n500 random elements from the range [1000, 2000] and 500 other random elements from the range\n[0, 1000000). Our learned Bloom \ufb01lter has f (y) \u2265 \u03c4 for all y in [1000, 2000] and f (y) < \u03c4\notherwise. Our backup \ufb01lter will therefore store 500 elements. If our test set is uniform over\n[0, 1000000) (excluding elements stored in the Bloom \ufb01lter), our false positive rate from elements\nwith too large an f value would be approximately 0.0002; one could choose a backup \ufb01lter with\nroughly the same false positive probability for a total empirical false positive probability of 0.0004.\nIf, however, our queries are uniform over a restricted range [0, 100000), then the false positive\nprobability would jump to 0.0022 for the learned Bloom \ufb01lter, because the learned function would\nyield more false positives over the smaller query range.\n\n5\n\n\f3.3 Additional Learned Bloom Filter Bene\ufb01ts and Limitations\nLearned Bloom \ufb01lters can easily handle insertions into K by adding the key, if is does not already\nyield a (false) positive, to the backup \ufb01lter. Such changes have a larger effect on the false positive\nprobability than for a standard Bloom \ufb01lter, since the backup \ufb01lter is smaller. Keys cannot be deleted\nnaturally from a learned Bloom \ufb01lter. A deleted key would simply become a false positive, which (if\nneeded) could possibly be handled by an additional structure.\nAs noted in [7], it may be possible to re-learn a new function f if the data set changes substantially via\ninsertions and deletion of keys from K. Of course, besides the time needed to re-learn a new function\nf, this requires storing the original set somewhere, which may not be necessary for alternative\nschemes. Similarly, if the false positive probability proves higher than desired, one can re-learn a\nnew function f; again, doing so will require access to K, and maintaining a (larger) set U of negative\nexamples.\n\n4 Size of the Learned Function\n\nWe now consider how to model the performance of the learned Bloom \ufb01lter with the goal of\nunderstanding how small the representation of the function f needs needs to be in order for the\nlearned Bloom \ufb01lter to be more effective than a standard Bloom \ufb01lter. 1\nOur model is as follows. The function f associated with De\ufb01nition 1 we treat as an oracle for the\nkeys K, where |K| = m, that works as follows. For keys not in K there is an associated false positive\nprobability Fp, and there are Fnm false negatives for keys in K. (The value Fn is like a false negative\nprobability, but given K this fraction is determined and known according to the oracle outcomes.) We\nnote the oracle representing the function f is meant to be general, so it could potentially represent\nother sorts of \ufb01lter structures as well. As we have described in Section 3.2, in the context of a learned\nBloom \ufb01lter the false positive rate is necessarily tied to the query stream, and is therefore generally\nan empirically determined quantity, but we take the value Fp here as a given. Here we show how\nto optimize over a single oracle, although in practice we may possibly choose from oracles with\ndifferent values Fp and Fn, in which case we can optimize for each pair of values and choose the\nbest suited to the application.\nWe assume a total budget of bm bits for the backup \ufb01lter, and |f| = \u03b6 bits for the learned function. If\n|K| = m, the backup Bloom \ufb01lter only needs to hold mFn keys, and hence we take the number of\nbits per stored key to be b/Fn. To model the false positive rate of a Bloom \ufb01lter that uses j bits per\nstored key, we assume the false positive rate falls as \u03b1j. This is the case for a standard Bloom \ufb01lter\n(where \u03b1 \u2248 0.6185 when using the optimal number of hash functions, as described in the survey\n[2]), as well as for a static Bloom \ufb01lter built using a perfect hash function (where \u03b1 = 1/2, again\ndescribed in [2]). The analysis can be modi\ufb01ed to handle other functions for false positives in terms\nof j in a straightforward manner. (For example, for a cuckoo \ufb01lter [6], a good approximation for the\nfalse positive rate is c\u03b1j for suitable constants c and \u03b1.)\nThe false positive rate of a learned Bloom \ufb01lter is Fp + (1 \u2212 Fp)\u03b1b/Fn . This is because, for y /\u2208 K, y\ncauses a false positive from the learned function f with probability Fp, or with remaining probability\n(1 \u2212 Fp) it yields a false positive on the backup Bloom \ufb01lter with probability \u03b1b/Fn.\nA comparable Bloom \ufb01lter using the same number of total bits, namely bm + \u03b6 bits, would have\na false positive probability of \u03b1b+\u03b6/m. Thus we \ufb01nd an improvement using a learned Bloom \ufb01lter\nwhenever\n\nFp + (1 \u2212 Fp)\u03b1b/Fn \u2264 \u03b1b+\u03b6/m,\n\nwhich simpli\ufb01es to\n\n\u03b6/m \u2264 log\u03b1\n\n(cid:16)\n\nFp + (1 \u2212 Fp)\u03b1b/Fn\n\n(cid:17) \u2212 b,\n\nwhere we have expressed the requirement in terms of a bound on \u03b6/m, the number of bits per key the\nfunction f is allowed.\n\n1We thank Alex Beutel for pointing out that our analysis in [9] could be used in this manner.\n\n6\n\n\fThis expression is somewhat unwieldy, but it provides some insight into what sort of compression is\nrequired for the learned function f, and how a practitioner can determine what is needed. First, one\ncan determine possible thresholds and the corresponding rate of false positive and false negatives\nfrom the learned function. For example, the paper [7] considers situations where Fp \u2248 0.01, and\nFn \u2248 0.5; let us consider Fp = 0.01 and Fn = 0.5 for clarity. If we have a target goal of one byte\nper item, a standard Bloom \ufb01lter achieves a false positive probability of approximately 0.0214. If\nour learned function uses 3 bits per item (or less), then the learned Bloom \ufb01lter can use 5m bits\nfor the backup Bloom \ufb01lter, and achieve a false positive rate of approximately 0.0181. The learned\nBloom \ufb01lter will therefore provide over a 10% reduction in false positives with the same or less space.\nMore generally, in practice one could determine or estimate different Fp and Fn values for different\nthresholds and different learned functions of various sizes, and use these equations to determine if\nbetter performance can be expected without in depth experiments.\nIndeed, an interesting question raised by this analysis is how learned functions scale in terms of\ntypical data sets. In extreme situations, such as when the set K being considered is a range of\nconsecutive integers, it can be represented by just two integers, which does not grow with K. If,\nin practice, as data sets grow larger the amount of information needed for a learned function f to\napproximate key sets K grows sublinearly with |K|, learned Bloom \ufb01lters may prove very effective.\n\n5 Sandwiched Learned Bloom Filters\n\n5.1 The Sandwich Structure\n\nGiven the formalization of the learned Bloom \ufb01lter, it seems natural to ask whether this structure can\nbe improved. Here we show that a better structure is to use a Bloom \ufb01lter before using the function f,\nin order to remove most queries for keys not in K. We emphasize that this initial Bloom \ufb01lter does\nnot declare that an input y is in K, but passes forward all matching keys to the learned function f,\nand it returns y /\u2208 K when the Bloom \ufb01lter shows the key is not in K. Then, as before, we use the\nfunction f to attempt to remove false positives from the initial Bloom \ufb01lter, and then use the backup\n\ufb01lter to allow back in keys from K that were false negatives for f. Because we have two layers of\nBloom \ufb01lters surrounding the learned function f, we refer to this as a sandwiched learned Bloom\n\ufb01lter. The sandwiched learned Bloom \ufb01lter is represented pictorially in Figure 1.\nIn hindsight, our result that sandwiching improves performance makes sense. The purpose of\nthe backup Bloom \ufb01lter is to remove the false negatives arising from the learned function. If we\ncan arrange to remove more false positives up front, then the backup Bloom \ufb01lter can be quite\nporous, allowing most everything that reaches it through, and therefore can be quite small. Indeed,\nsurprisingly, our analysis shows that the backup \ufb01lter should not grow beyond a \ufb01xed size.\n\n5.2 Analyzing Sandwiched Learned Bloom Filters\n\nWe model the sandwiched learned Bloom \ufb01lter as follows. As before, the learned function f in the\nmiddle of the sandwich we treat as an oracle for the keys K, where |K| = m. Also as before, for\nkeys not in K there is an associated false positive probability Fp, and there are Fnm false negatives\nfor keys in K.\nWe here assume a total budget of bm bits to be divided between an initial Bloom \ufb01lter of b1m bits\nand a backup Bloom \ufb01lter of b2m bits. As before, we model the false positive rate of a Bloom \ufb01lter\nthat uses j bits per stored key as \u03b1j for simplicity. The backup Bloom \ufb01lter only needs to hold mFn\nkeys, and hence we take the number of bits per stored key to be b2/Fn. If we \ufb01nd the best value of b2\nis b, then no initial Bloom \ufb01lter is needed, but otherwise, an initial Bloom \ufb01lter is helpful.\nThe false positive rate of a sandwiched learned Bloom \ufb01lter is then \u03b1b1(Fp + (1 \u2212 Fp)\u03b1b2/Fn ). To\nsee this, note that for y /\u2208 K, y \ufb01rst has to pass through the initial Bloom \ufb01lter, which occurs with\nprobability \u03b1b1. Then y either causes a false positive from the learned function f with probability\nFp, or with remaining probability (1 \u2212 Fp) it yields a false positive on the backup Bloom \ufb01lter, with\nprobability \u03b1b2/Fn.\n\n7\n\n\fFigure 1: The left side shows the original learned Bloom \ufb01lter. The right side shows the sandwiched\nlearned Bloom \ufb01lter.\n\nAs \u03b1, Fp, Fn and b are all constants for the purpose of this analysis, we may optimize for b1 in the\nequivalent expression\n\n(1)\n\n(2)\n\nThe derivative with respect to b1 is\n\nFp(ln \u03b1)\u03b1b1 + (1 \u2212 Fp)\n\nThis equals 0 when\n\n(cid:16) 1\n\nFn\n\nFp\n(1 \u2212 Fp)\n\n\u2212 1\n\nFp\u03b1b1 + (1 \u2212 Fp)\u03b1b/Fn\u03b1b1(1\u22121/Fn).\n\n(cid:19)\n\n1 \u2212 1\nFn\n\n(cid:18)\n(cid:17) = \u03b1(b\u2212b1)/Fn = \u03b1b2/Fn .\n\n\u03b1b/Fn (ln \u03b1)\u03b1b1(1\u22121/Fn).\n\nThis further yields that the false positive rate is minimized when b2 = b\u2217\n\n2, where\n\nb\u2217\n2 = Fn log\u03b1\n\n(cid:16) 1\n\nFn\n\n(cid:17) .\n\n\u2212 1\n\nFp\n(1 \u2212 Fp)\n\nThis result may be somewhat surprising, as here we see that the optimal value b\u2217\n2 is a constant,\nindependent of b. That is, the number of bits used for the backup \ufb01lter is not a constant fraction\nof the total budgeted number of bits bm, but a \ufb01xed number of bits; if the number of budgeted bits\nincreases, one should simply increase the size of the initial Bloom \ufb01lter as long as the backup \ufb01lter is\nappropriately sized.\nIn hindsight, returning to the expression for the false positive rate \u03b1b1 (Fp + (1\u2212 Fp)\u03b1b2/Fn ) provides\nuseful intuition. If we think of sequentially distributing the bm bits among the two Bloom \ufb01lters, the\nexpression shows that bits assigned to the initial \ufb01lter (the b1 bits) reduce false positives arising from\nthe learned function (the Fp term) as well as false positives arising subsequent to the learned function\n(the (1 \u2212 Fp) term), while the backup \ufb01lter only reduces false positives arising subsequent to the\nlearned function. Initially we would provide bits to the backup \ufb01lter to reduce the (1 \u2212 Fp) rate of\nfalse positives subsequent to the learned function. Indeed, bits in the backup \ufb01lter drive down this\n(1 \u2212 Fp) term rapidly, because the backup \ufb01lter holds fewer keys from the original set, leading to\nthe b2/Fn (instead of just a b2) in the exponent in the expression \u03b1b2/Fn. Once the false positives\ncoming through the backup Bloom \ufb01lter reaches an appropriate level, which, by plugging in the\ndetermined optimal value for b2, we \ufb01nd is Fp/\n, then the tradeoff changes. At that point\nthe gains from reducing the false positives by increasing the bits for the backup Bloom \ufb01lter become\nsmaller than the gains obtained by increasing the bits for the initial Bloom \ufb01lter.\nAgain, we can look at situations discussed in [7] for some insight. Suppose we have a learned function\nf where Fn = 0.5 and Fp = 0.01. We consider \u03b1 = 0.6185 (which corresponds to a standard Bloom\n\ufb01lter). We do not consider the size of f in the calculation below. Then the optimal value for b2 is\n\n(cid:16) 1\n\n\u2212 1\n\n(cid:17)\n\nFn\n\n2 = (log\u03b1 1/99)/2 \u2248 6.\nb\u2217\n\n8\n\nLearned\t\r \u00a0Oracle\t\r \u00a0Backup\t\r \u00a0Filter\t\r \u00a0Input\t\r \u00a0Posi6ves\t\r \u00a0Nega6ves\t\r \u00a0Posi6ves\t\r \u00a0Nega6ves\t\r \u00a0Learned\t\r \u00a0Oracle\t\r \u00a0Backup\t\r \u00a0Filter\t\r \u00a0Posi6ves\t\r \u00a0Posi6ves\t\r \u00a0Nega6ves\t\r \u00a0Posi6ves\t\r \u00a0Nega6ves\t\r \u00a0Ini6al\t\r \u00a0Filter\t\r \u00a0Input\t\r \u00a0Nega6ves\t\r \u00a0\fDepending on our Bloom \ufb01lter budget parameter b, we obtain different levels of performance\nimprovement by using the initial Bloom \ufb01lter. At b = 8 bits per key, the false positive rate drops from\napproximately 0.010045 to 0.005012, over a factor of 2. At b = 10 bits per key, the false positive\nrate drops from approximately 0.010066 to 0.001917, almost an order of magnitude.\nWe may also consider the implications for the oracle size. Again, if we let \u03b6 represent the size of the\noracle in bits, then a corresponding Bloom \ufb01lter would have a false positive probability of \u03b1b+\u03b6/m.\nHence we have an improvement whenever\n\n\u03b1b1 (Fp + (1 \u2212 Fp)\u03b1b2/Fn) \u2264 \u03b1b+\u03b6/m.\n\nFor b suf\ufb01ciently large that b1 > 0, we can calculate the false positive probability of the opti-\nmized sandwiched Bloom \ufb01lter. Let b\u2217\n1 be the\ncorresponding value for b1. First using the relationship from equation 1, we have a gain whenever\n\n2 be the optimal value for b2 from equation 2 and b\u2217\n\nUsing b\u2217\n\n1 = b \u2212 b\u2217\n\n2 and equation 2 gives\nFp\n1 \u2212 Fn\n\n\u03b6/m \u2264 log\u03b1\n\n\u03b1b\u2217\n\n1\n\nFp\n1 \u2212 Fn\n\n\u2264 \u03b1b+\u03b6/m.\n\n\u2212 Fn log\u03b1\n\n(cid:16) 1\n\nFn\n\nFp\n(1 \u2212 Fp)\n\n(cid:17) .\n\n\u2212 1\n\nAgain, this expression is somewhat unwieldy, but one useful difference from the analysis of the\noriginal learned Bloom \ufb01lter is that we see the improvement does not depend on the exact value of\nb (as long b is large enough so that b1 > 0, and we use the optimal value for b2). For Fp = 0.01,\nFn = 0.5, and \u03b1 = 0.6185, we \ufb01nd a gain whenever \u03b6/m falls below approximately 3.36.\nA possible further advantage of the sandwich approach is that it makes learned Bloom \ufb01lters more\nrobust. As discussed previously, if the queries given to a learned Bloom \ufb01lter do not come from the\nsame distribution as the queries from the test set used to estimate the learned Bloom \ufb01lter\u2019s false\npositive probability, the actual false positive probability may be substantially larger than expected.\nThe use of an initial Bloom \ufb01lter mitigates this problem, as this issue then only affects the smaller\nnumber of keys that pass the initial Bloom \ufb01lter.\nWe note that a potential disadvantage of the sandwich approach may be that it is more computationally\ncomplex than a learned Bloom \ufb01lter without sandwiching, requiring possibly more hashing and\nmemory accesses for the initial Bloom \ufb01lter. The overall ef\ufb01ciency would be implementation\ndependent, but this remains a possible issue for further research.\n\n6 Learned Bloomier Filters\n\nIn the supplemental material, we consider learned Bloomier \ufb01lters. Bloomier \ufb01lters are a variation\nof the Bloom \ufb01lter idea where each key in the set K has an associated value. The Bloomier \ufb01lter\nreturns the value for every key of K, and is supposed to return a null value for keys not in K, but in\nthis context there can be false positives where the return for a key outside of K is a non-null value\nwith some probability. We derive related formulae for the performance of learned Bloomier \ufb01lters.\n\n7 Conclusion\n\nWe have focused on providing a more formal analysis of the proposed learned Bloom \ufb01lter. As part of\nthis, we have attempted to clarify a particular issue in the Bloom \ufb01lter setting, namely the dependence\nof what is referred to as the false positive rate in [7] on the query set, and how it might affect the\napplications this approach is suited for. We have also found that our modeling laeds to a natural and\ninteresting optimization, based on sandwiching, and allows for generalizations to related structures,\nsuch as Bloomier \ufb01lters. Our discussion is meant to encourage users to take care to realize all of the\nimplications of the learned Bloom \ufb01lter approach before adopting it. However, for sets that can be\naccurately predicted by small learned functions, the learned Bloom \ufb01lter may provide a novel means\nof obtaining signi\ufb01cant performance improvements over standard Bloom \ufb01lter variants.\n\n9\n\n\fAcknowledgments\n\nThe author thanks Suresh Venkatasubramanian for suggesting a closer look at [7], and thanks the\nauthors of [7] for helpful discussions involving their work. This work was supported in part by NSF\ngrants CCF-1563710, CCF-1535795, CCF-1320231, and CNS-1228598. Part of this work was done\nwhile visiting Microsoft Research New England.\n\nReferences\n[1] M. Bender, M. Farach-Colton, M. Goswami, R. Johnson, S. McCauley, and S. Singh. Bloom\nFilters, Adaptivity, and the Dictionary Problem. https://arxiv.org/abs/1711.01616,\n2017.\n\n[2] A. Broder and M. Mitzenmacher. Network Applications of Bloom Filters: A Survey. Internet\n\nMathematics, 1(4):485-509, 2004.\n\n[3] D. Charles and K. Chellapilla. Bloomier Filters: A Second Look. In Proceedings of the European\n\nSymposium on Algorithms, pp. 259-270, 2008.\n\n[4] B. Chazelle, J. Kilian, R. Rubinfeld, and A. Tal. The Bloomier Filter: an Ef\ufb01cient Data\nStructure for Static Support Lookup Tables. In Proceedings of the Fifteenth Annual ACM-SIAM\nSymposium on Discrete Algorithms, pp. 30-39, 2004.\n\n[5] K. Chung, M. Mitzenmacher, and S. Vadhan. Why Simple Hash Functions Work: Exploiting\n\nthe Entropy in a Data Stream. Theory of Computing, 9(30):897-945, 2013.\n\n[6] B. Fan, D. Andersen, M. Kaminsky, and M. Mitzenmacher. Cuckoo Filter: Practically Better\nthan Bloom. In Proceedings of the 10th ACM International Conference on Emerging Networking\nExperiments and Technologies, pp. 75-88, 2014.\n\n[7] T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis. The Case for Learned Index\n\nStructures. https://arxiv.org/abs/1712.01208, 2017.\n\n[8] M. Mitzenmacher. A Model for Learned Bloom Filters and Related Structures. https://\n\narxiv.org/abs/1802.00884, 2018.\n\n[9] M. Mitzenmacher. Optimizing Learned Bloom Filters by Sandwiching. https://arxiv.org/\n\nabs/1803.01474, 2018.\n\n[10] M. Mitzenmacher, S. Pontarelli, and P. Reviriego. Adaptive Cuckoo Filters. In Proceedings\nof the Twentieth Workshop on Algorithm Engineering and Experiments (ALENEX), pp. 36-47,\n2018.\n\n[11] M. Mitzenmacher and E. Upfal. Probability and Computing: Randomization and Probabilistic\n\nTechniques in Algorithms and Data Analysis. Cambridge University Pres, 2017.\n\n[12] M. Naor and E. Yogev. Bloom Filters in Adversarial Environments. In Proceedings of the\n\nAnnual Cryptography Conference, pp. 565-584, 2015.\n\n10\n\n\f", "award": [], "sourceid": 283, "authors": [{"given_name": "Michael", "family_name": "Mitzenmacher", "institution": "Harvard University"}]}