{"title": "Near Neighbor: Who is the Fairest of Them All?", "book": "Advances in Neural Information Processing Systems", "page_first": 13176, "page_last": 13187, "abstract": "In this work we study a \"fair\" variant of the near neighbor problem. Namely, given a set of $n$ points $P$ and a parameter $r$, the goal is to preprocess the points, such that given a query point $q$, any point in the $r$-neighborhood of the query, i.e., $B(q,r)$, have the same probability of being reported as the near neighbor.\n\nWe show that LSH based algorithms can be made fair, without a significant loss in efficiency. Specifically, we show an algorithm that reports a point $p$ in the $r$-neighborhood of a query $q$ with almost uniform probability. The time to report such a point is proportional to $O(\\dns(q.r) Q(n,c))$, and its space is $O(S(n,c))$, where $Q(n,c)$ and $S(n,c)$ are the query time and space of an LSH algorithm for $c$-approximate near neighbor, and $\\dns(q,r)$ is a function of the local density around $q$.\n\nOur approach works more generally for sampling uniformly from a sub-collection of sets of a given collection and can be used in a few other applications. Finally, we run experiments to show performance of our approach on real data.", "full_text": "Near Neighbor: Who is the Fairest of Them All?\n\nUniversity of Illinois at Urbana-Champaign\n\nToyota Technological Institute at Chicago\n\nSariel Har-Peled\n\nChampaign, IL 61801\nsariel@illinois.edu\n\nSepideh Mahabadi\n\nChicago, IL 60637\n\nmahabadi@ttic.edu\n\nAbstract\n\nIn this work we study a fair variant of the near neighbor problem. Namely, given a\nset of n points P and a parameter r, the goal is to preprocess the points, such that\ngiven a query point q, any point in the r-neighborhood of the query, i.e., Bpq, rq,\nhave the same probability of being reported as the near neighbor.\nWe show that LSH based algorithms can be made fair, without a signi\ufb01cant loss\nin ef\ufb01ciency. Speci\ufb01cally, we show an algorithm that reports a point in the r-\nneighborhood of a query q with almost uniform probability. The query time is\nproportional to Odnspq.rqQpn, cq\b, and its space is OpSpn, cqq, where Qpn, cq\nand Spn, cq are the query time and space of an LSH algorithm for c-approximate\nnear neighbor, and dnspq, rq is a function of the local density around q.\nOur approach works more generally for sampling uniformly from a sub-collection\nof sets of a given collection and can be used in a few other applications. Finally,\nwe run experiments to show performance of our approach on real data.\n\n1\n\nIntroduction\n\nNowadays, many important decisions, such as college admissions, offering home loans, or estimat-\ning the likelihood of recidivism, rely on machine learning algorithms. There is a growing con-\ncern about the fairness of the algorithms and creating bias toward a speci\ufb01c population or fea-\nture [HPS16, Cho17, MSP16, KLL17]. While algorithms are not inherently biased, nevertheless,\nthey may amplify the already existing biases in the data. Hence, this concern has led to the design\nof fair algorithms for many different applications, e.g., [DOBD18, ABD18, PRW17, CKLV19,\nEJJ19, OA18, CKLV17, BIO19, BCN19, KSAM19].\nBias in the data used for training machine learning algorithms is a monumental challenge in creat-\ning fair algorithms [HGB07, TE11, ZVGRG17, Cho17]. Here, we are interested in a somewhat\ndifferent problem, of handling the bias introduced by the data-structures used by such algorithms.\nSpeci\ufb01cally, data-structures may introduce bias in the data stored in them, and the way they answer\nqueries, because of the way the data is stored and how it is being accessed. Such a defect leads to\nselection bias by the algorithms using such data-structures. It is natural to want data-structures that\ndo not introduce a selection bias into the data when handling queries.\nThe target as such is to derive data-structures that are bias-neutral. To this end, imagine a data-\nstructure that can return, as an answer to a query, an item out of a set of acceptable answers. The\npurpose is then to return uniformly a random item out of the set of acceptable outcomes, without\nexplicitly computing the whole set of acceptable answers (which might be prohibitively expensive).\nSeveral notions of fairness have been studied, including group fairness1 (where demographics of the\npopulation is preserved in the outcome) and individual fairness (where the goal is to treat individuals\n\n1The concept is denoted as statistical fairness too, e.g., [Cho17].\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fwith similar conditions similarly) [DHP12]. In this work, we study the near neighbor problem from\nthe perspective of individual fairness.\nNear Neighbor is a fundamental problem that has applications in many areas such as machine learn-\ning, databases, computer vision, information retrieval, and many others, see [SDI06, AI08] for an\noverview. The problem is formally de\ufb01ned as follows. Let pM, dq be a metric space. Given a set\nP \u0084 M of n points and a parameter r, the goal of the near neighbor problem is to preprocess P ,\nsuch that for a query point q P M, one can report a point p P P , such that dpp, qq \u00a4 r if such a\npoint exists. As all the existing algorithms for the exact variant of the problem have either space\nor query time that depends exponentially on the ambient dimension of M, people have considered\nthe approximate variant of the problem. In the c-approximate near neighbor (ANN) problem, the\nalgorithm is allowed to report a point p whose distance to the query is at most cr if a point within\ndistance r of the query exists, for some prespeci\ufb01ed constant c \u00a1 1.\nPerhaps the most prominent approach to get an ANN data structure is via Locality Sensitive Hashing\n(LSH) [IM98, HIM12], which leads to sub-linear query time and sub-quadratic space. In particular,\nfor M \u0010 Rd, by using LSH one can get a query time of n\u03c1op1q and space n1\u03c1op1q where for\nthe L1 distance metric \u03c1 \u0010 1{c [IM98, HIM12], and for the L2 distance metric \u03c1 \u0010 1{c2 ocp1q\n[AI08]. The idea of the LSH method is to hash all the points using several hash functions that are\nchosen randomly, with the property that closer points have a higher probability of collision than the\nfar points. Therefore, the closer points to a query have a higher probability of falling into a bucket\nbeing probed than far points. Thus, reporting a random point from a random bucket computed for\nthe query, produces a distribution that is biased by the distance to the query: closer points to the\nquery have a higher probability of being chosen.\n\nWhen random nearby is better than nearest. The bias mentioned above towards nearer points\nis usually a good property, but is not always desirable. Indeed, consider the following scenarios:\n\n(I) The nearest neighbor might not be the best if the input is noisy, and the closest point might be\nviewed as an unrepresentative outlier. Any point in the neighborhood might be then considered\nto be equivalently bene\ufb01cial. This is to some extent why k-NN classi\ufb01cation [ELL09] is so\neffective in reducing the effect of noise.\n\n(II) However, k-NN works better in many cases if k is large, but computing the k nearest-neighbors\nis quite expensive if k is large [HAAA14]. Computing quickly a random nearby neighbor can\nsigni\ufb01cantly speed-up such classi\ufb01cation.\n\n(III) We are interested in annonymizing the query [Ada07], thus returning a random near-neighbor\nmight serve as \ufb01rst line of defense in trying to make it harder to recover the query. Simi-\nlarly, one might want to anonymize the nearest-neighbor [QA08], for applications were we are\ninterested in a \u201ctypical\u201d data item close to the query, without identifying the nearest item.\n\n(IV) If one wants to estimate the number of items with a desired property within the neighborhood,\nIn\n\nthen the easiest way to do it is via uniform random sampling from the neighborhood.\nparticular, this is useful for density estimation [KLK12].\n\n(V) Another natural application is simulating a random walk in the graph where two items are\nconnected if they are in distance at most r from each other. Such random walks are used by\nsome graph clustering algorithms [HK01].\n\n1.1 Results\n\nOur goal is to solve the near-neighbor problem, and yet be fair among \u201call the points\u201d in the neigh-\nborhood. We introduce and study the fair near neighbor problem \u2013 where the goal is to report any\npoint of N pq, rq with uniform distribution. That is, report a point within distance r of the query\npoint with probability of Ppq, rq \u0010 1{npq, rq, where npq, rq \u0010 |N pq, rq|. Naturally, we study the\napproximate fair near neighbor problem, where one can hope to get ef\ufb01cient data-structures. We\nhave the following results:\n\n(I) Exact neighborhood. We present a data structure for reporting a neighbor according to an\n\u201calmost uniform\u201d distribution with space Spn, cq, and query time rOQpn, cq \u0004 npq,crq\nnpq,rq \b, where\nSpn, cq and Qpn, cq are, respectively, the space and query time of the standard c-ANN data\nstructure. Note that, the query time of the algorithm might be high if the approximate neigh-\n\n2\n\n\fborhood of the query is much larger than the exact neighborhood.2 Guarantees of this data\nstructure hold with high probability. See Lemma 4.9 for the exact statement.\n\n(II) Approximate neighborhood. This formulation reports an almost uniform distribution from\nan approximate neighborhood S of the query. We can provide such a data structure that uses\n\nspace Spn, cq and whose query time is rOpQpn, cqq, albeit in expectation. See Lemma 4.3 for\n\nthe exact statement.\n\nMoreover, the algorithm produces the samples independently of past queries. In particular, one can\nassume that an adversary is producing the set of queries and has full knowledge of the data structure.\nEven then the generated samples have the same (almost) uniform guarantees. Furthermore, we\nremark that the new sampling strategy can be embedded in the existing LSH method to achieve\nunbiased query results. Finally, we remark that to get a distribution that is p1 \u03b5q-uniform (See\npreliminaries for the de\ufb01nition), the dependence of our algorithms on \u03b5 is only Oplogp1{\u03b5qq.\nVery recently, independent of our work, [APS19] also provides a similar de\ufb01nition for the fair near\nneighbor problem.\nExperiments. Finally, we compare the performance of our algorithm with the algorithm that uni-\nformly picks a bucket and reports a random point, on the MNIST, SIFT10K, and GloVe data sets.\nOur empirical results show that while the standard LSH algorithm fails to fairly sample a point\nin the neighborhood of the query, our algorithm produces an empirical distribution which is much\ncloser to the uniform distribution: it improves the statistical distance to the uniform distribution by\na signi\ufb01cant factor.\n\n2 Preliminaries\nNeighborhood, fair nearest-neighbor, and approximate neighborhood. Let pM, dq be a metric\nspace and let P \u0084 M be a set of n points. Let Bpc, rq \u0010 tx P M | dpc, xq \u00a4 ru be the (close) ball\nof radius r around a point c P M, and let N pc, rq \u0010 Bpc, rq X P be the r-neighborhood of c in P .\nThe size of the r-neighborhood is npc, rq \u0010 |N pc, rq|.\nDe\ufb01nition 2.1 (FANN). Given a data set P \u0084 M of n points and a parameter r, the goal is to\npreprocess P such that for a given query q, one reports each point p P N pq, rq with probability \u00b5p\nwhere \u00b5 is an approximately uniform probability distribution: Ppq, rq{p1\u03b5q \u00a4 \u00b5p \u00a4 p1\u03b5qPpq, rq,\nwhere Ppq, rq \u0010 1{npq, rq.\nDe\ufb01nition 2.2 (FANN with approximate neighborhood). Given a data set P \u0084 M of n points\nand a parameter r, the goal is to preprocess them such that for a given query q, one reports each\npoint p P S with probability \u00b5p where \u03d5{p1 \u03b5q \u00a4 \u00b5p \u00a4 p1 \u03b5q\u03d5, where S is a point set such that\nN pq, rq \u0084 S \u0084 N pq, crq, and \u03d5 \u0010 1{|S|.\nSet representation. Let U be an underlying ground set of n objects (i.e., elements). In this paper,\nwe deal with sets of objects. Assume that such a set X \u0084 U is stored in some reasonable data-\nstructure, where one can insert delete, or query an object in constant time. Querying for an object\no P U, requires deciding if o P X. Such a representation of a set is straightforward to implement\nusing an array to store the objects, and a hash table. This representation allows random access to the\nelements in the set, or uniform sampling from the set.\nIf hashing is not feasible, one can just use a standard dictionary data-structure \u2013 this would slow\ndown the operations by a logarithmic factor.\nSubset size estimation. We need the following standard estimation tool, [BHR17, Lemma 2.8].\nLemma 2.3. Consider two sets B \u0084 U, where n \u0010 |U |. Let \u03be, \u03b3 P p0, 1q be parameters, such\nthat \u03b3 \u00a0 1{ log n. Assume that one is given an access to a membership oracle that, given an\nelement x P U, returns whether or not x P B. Then, one can compute an estimate s, such that\np1 \u0001 \u03beq |B| \u00a4 s \u00a4 p1 \u03beq |B|, and computing this estimates requires Oppn{ |B|q\u03be\u00012 log \u03b3\u00011q\noracle queries. The returned estimate is correct with probability \u00a5 1 \u0001 \u03b3.\n\nWeighted sampling. We need the following standard data-structure for weighted sampling.\n\n2As we show, the term Qpn, rq \u0004 npq,crq\n\nnpq,rq can also be replaced by Qpn, rq |N pq, crqzN pq, rq| which can\n\npotentially be smaller.\n\n3\n\n\fLemma 2.4. Given a set of objects H \u0010 to1, . . . , otu, with associated weights w1, . . . , wt, one\ncan preprocess them in Optq time, such that one can sample an object out of H. The probability of\nan object oi to be sampled is wi{\u00b0t\nj\u00101 wj. In addition the data-structure supports updates to the\nweights. An update or sample operation takes Oplog tq time. (Proof in Appendix A.1)\n\n3 Data-structure: Sampling from the union of sets\nThe problem. Assume you are given a data-structure that contains a large collection F of sets of\nobjects. The sets in F are not necessarily disjoint. The task is to preprocess the data-structure, such\nthat given a sub-collection G \u0084 F of the sets, one can quickly pick uniformly at random an object\nfrom the set \u0094G :\u0010 \u0094XPG X.\nNaive solution. The naive solution is to take the sets under consideration (in G), compute their\nunion, and sample directly from the union set \u0094G. Our purpose is to do (much) better \u2013 in particular,\nthe goal is to get a query time that depends logarithmically on the total size of all sets in G.\n\n3.1 Preprocessing\nFor each set X P F, we build the set representation mentioned in the preliminaries section. In\naddition, we assume that each set is stored in a data-structure that enables easy random access or\nuniform sampling on this set (for example, store each set in its own array). Thus, for each set X,\nand an element, we can decide if the element is in X in constant time.\n\n3.2 Uniform sampling via exact degree computation\n\nThe query is a family G \u0084 F, and de\ufb01ne m \u0010\uf8f4\uf8f4\uf8f4G\uf8f4\uf8f4\uf8f4:\u0010 \u00b0XPG |X| (which should be distinguished\n\nfrom g \u0010 |G| and from n \u0010 |\u0094G|). The degree of an element x P \u0094G, is the number of sets of\nG that contains it \u2013 that is, dGpxq \u0010 |DGpxq|, where DGpxq \u0010 tX P G | x P Xu . The algorithm\nrepeatedly does the following:\n(I) Picks one set from G with probabilities proportional to their sizes. That is, a set X P G is\n\npicked with probability |X| {m.\n\n(II) It picks an element x P X uniformly at random.\n(III) Computes the degree d \u0010 dGpxq.\n(IV) Outputs x and stop with probability 1{d. Otherwise, continues to the next iteration.\n\nLemma 3.1. Let n \u0010 |\u0094G| and g \u0010 |G|. The above algorithm samples an element x P \u0094G\naccording to the uniform distribution. The algorithm takes in expectation Opgm{nq \u0010 Opg2q time.\nThe query time is takes Opg2 log nq with high probability. (Proof in Appendix A.2)\n\n3.3 Almost uniform sampling via degree approximation\n\nThe bottleneck in the above algorithm is computing the degree of an element. We replace this by an\napproximation.\nDe\ufb01nition 3.2. Given two positive real numbers x and y, and a parameter \u03b5 P p0, 1q, the numbers\nx and y are \u03b5-approximation of each other, denoted by x \u0013\u03b5 y, if x{p1 \u03b5q \u00a4 y \u00a4 xp1 \u03b5q and\ny{p1 \u03b5q \u00a4 x \u00a4 yp1 \u03b5q.\nIn the approximate version, given an item x P \u0094G, we can approximate its degree and get an\nimproved runtime for the algorithm.\nLemma 3.3. The input is a family of sets F that one can preprocess in linear time. Let G \u0084 F\nbe a sub-family and let n \u0010 |\u0094G|, g \u0010 |G|, and \u03b5 P p0, 1q be a parameter. One can sample an\nelement x P \u0094G with almost uniform probability distribution. Speci\ufb01cally, the probability of an\nelement to be output is \u0013\u03b5 1{n. After linear time preprocessing, the query time is Og\u03b5\u00012 log n\b,\nin expectation, and the query succeeds with high probability. (Proof in Appendix A.3)\nRemark 3.4. The query time of Lemma 3.3 deteriorates to Og\u03b5\u00012 log2 n\b if one wants the bound\nto hold with high probability. This follows by restarting the query algorithm if the query time\nexceeds (say by a factor of two) the expected running time. A standard application of Markov\u2019s\n\n4\n\n\finequality implies that this process would have to be restarted at most Oplog nq times, with high\nprobability.\nRemark 3.5. The sampling algorithm is independent of whether or not we fully know the under-\nlying family F and the sub-family G. This means the past queries do not affect the sampled object\nreported for the query G. Therefore, the almost uniform distribution property holds in the presence\nof several queries and independently for each of them.\n\n3.4 Further Improvement.\n\nIn Appendix B, we show how to further improve the dependence on \u03b5, from \u03b5\u00012 down to logp1{\u03b5q.\nRemark 3.6. Similar to Remark 3.4, the query time of this algorithm (Lemma B.3) can be made\nto work with high probability with an additional logarithmic factor. Thus with high probability, the\nquery time is Opg logpg{\u03b5q log nq.\n\nFinally, in Appendix C, we show further applications of this data structure.\n\n3.5 Handling outliers\nImagine a situation where we have a marked set of outliers O. We are interested in sampling from\n\u0094GzO. We assume that the total degree of the outliers in the query is at most mO for some pre-\nspeci\ufb01ed parameter mO. More precisely, we have dGpOq \u0010 \u00b0xPO dGpxq \u00a4 mO.\nLemma 3.7. The input is a family of sets F that one can preprocess in linear time. A query is a\nsub-family G \u0084 F, a set of outliers O, a parameter mO, and a parameter \u03b5 P p0, 1q. One can either\n(A) Sample an element x P \u0094GzO with \u03b5-approximate uniform distribution. Speci\ufb01cally, the\n(B) Alternatively, report that dGpOq \u00a1 mO.\n\nprobabilities of two elements to be output is the same up to a factor of 1 \b \u03b5.\n\nwhere g \u0010 |G|, and N \u0010\uf8f4\uf8f4\uf8f4F\uf8f4\uf8f4\uf8f4. (Proof in Appendix A.4)\n\nThe expected query time is OpmO g logpN {\u03b5qq, and the query succeeds with high probability,\n\n4\n\nIn the search for a fair near neighbor\n\nIn this section, we employ our data structure of Section 3 to show the two results on uniformly\nreporting a neighbor of a query point mentioned in Section 1.1. First, let us brie\ufb02y give some\npreliminaries on LSH. We refer the reader to [HIM12] for further details. Throughout the section,\nwe assume that our metric space, admits the LSH data structure.\n\n4.1 Background on LSH\n\nLocality Sensitive Hashing (LSH). Let D denote the data structure constructed by LSH, and let\nc denote the approximation parameter of LSH. The data-structure D consists of L hash functions\ng1, . . . , gL (e.g., L \u0013 n1{c for a c-approximate LSH), which are chosen via a random process and\neach function hashes the points to a set of buckets. For a point p P M, let Hippq be the bucket that\nthe point p is hashed to using the hash function gi. The following are standard guarantees provided\nby the LSH data structure [HIM12].\nLemma 4.1. For a given query point q, let S \u0010 \u0094i Hipqq. Then for any point p P N pq, rq, we have\nthat with a probability of least 1 \u0001 1{e \u0001 1{3, we have (i) p P S and (ii) |SzBpq, crq| \u00a4 3L, i.e.,\nthe number of outliers is at most 3L. Moreover, the expected number of outliers in any single bucket\nHipqq is at most 1.\n\nTherefore, if we take t \u0010 Oplog nq different data structures D1, . . . , Dt with corresponding hash\nfunctions gj\ni to denote the ith hash function in the jth data structure, we have the following lemma.\nLemma 4.2. Let the query point be q, and let p be any point in N pq, rq. Then, with high probability,\nthere exists a data structure Dj, such that p P S \u0010 \u0094i H j\nBy the above, the space used by LSH is Spn, cq \u0010 rOpn \u0004 Lq and the query time is Qpn, cq \u0010 rOpLq.\n\ni pqq and |SzBpq, crq| \u00a4 3L.\n\n5\n\n\f4.2 Approximate Neighborhood\nFor t \u0010 Oplog nq, let D1, . . . , Dt be data structures constructed by LSH. Let F be the set of\ni ppq \u0007\u0007 i \u00a4 L, j \u00a4 t, p P P( . For a query point q,\nall buckets in all data structures, i.e., F \u0010 H j\nconsider the family G of all buckets containing the query, i.e., G \u0010 tH j\ni pqq | i \u00a4 L, j \u00a4 tu, and thus\n|G| \u0010 OpL log nq. Moreover, we let O to be the set of outliers, i.e., the points that are farther than\ncr from q. Note that as mentioned in Lemma 4.1, the expected number of outliers in each bucket of\nLSH is at most 1. Therefore, by Lemma 3.7, we immediately get the following result.\nLemma 4.3. Given a set P of n points and a parameter r, we can preprocess it such that given\nquery q, one can report a point p P S with probability \u00b5p where \u03d5{p1 \u03b5q \u00a4 \u00b5p \u00a4 p1 \u03b5q\u03d5,\nwhere S is a point set such that N pq, rq \u0084 S \u0084 N pq, crq, and \u03d5 \u0010 1{|S|. The algorithm uses space\n\nSpn, cq and its expected query time is rOpQpn, cq \u0004 logp1{\u03b5qq. (Proof in Appendix A.5)\nRemark 4.4. For the L1 distance, the runtime of our algorithm is rOpnp1{cqop1qq and for the L2\ndistance, the runtime of our algorithm is rOpnp1{c2qop1qq. These matches the runtime of the standard\nLSH-based near neighbor algorithms up to polylog factors.\n\n4.3 Exact Neighborhood\n\ni pqq | i \u00a4 L, j P J 1u.\n\ni . Moreover, for each bucket in H j\n\ni , we store its size |H j\ni |.\n\nAs noted earlier, the result of the previous section only guarantees a query time which holds in\nexpectation. Here, we provide an algorithm whose query time holds with high probability. Note\nthat, here we cannot apply Lemma 3.7 directly, as the total number of outliers in our data structure\nmight be large with non-negligible probability (and thus we cannot bound mO). However, as noted\nin Lemma 4.2, with high probability, there exists a subset of these data structures J \u0084 rts such\nthat for each j P J, the number of outliers in Sj \u0010 \u0094i H j\ni pqq is at most 3L, and moreover, we\nhave that N pq, rq \u0084 \u0094jPJ Sj. Therefore, on a high level, we make a guess J 1 of J, which we\ninitialize it to J 1 \u0010 rts, and start by drawing samples from G; once we encounter more than 3L\noutliers from a certain data structure Dj, we infer that j R J, update the value of J 1 \u0010 J 1ztju, and\nset the weights of the buckets corresponding to Dj equal to 0, so that they will never participate in\nthe sampling process. As such, at any iteration of the algorithm we are effectively sampling from\nG \u0010 tH j\nPreprocessing. We keep t \u0010 Oplog nq LSH data structures which we refer to as D1, . . . , Dt, and\nwe keep the hashed points by the ith hash function of the jth data structure in the array denoted by\nH j\nQuery Processing. We maintain the variables zj\ninitialized to |H j\ndetected from H j\ndetect an outlier in H j\nkeep track of J 1, for any data structure Dj, whenever \u00b0i |Oj\nin Dj, by setting all corresponding zj\nAt each iteration, the algorithm proceeds by sampling a bucket H j\ni pqq proportional to its weight zj\ni ,\nbut only among the set of buckets from those data structures Dj for which less than 3L outliers are\ndetected so far, i.e., j P J 1. We then sample a point uniformly at random from the points in the\nchosen bucket that have not been detected as an outlier, i.e., H j\ni . If the sampled point is an\noutlier, we update our data structure accordingly. Otherwise, we proceed as in Lemma ??.\nDe\ufb01nition 4.5 (Active data structures and active buckets). Consider an iteration k of the al-\ngorithm. Let us de\ufb01ne the set of active data structures to be the data structures from whom\nwe have seen less than 3L outliers so far, and let us denote their indices by J 1\nk \u0084 rts, i.e.,\nk \u0010 j \u0007\u0007 \u00b0i |Oi\nj| \u00a0 3L(.\nJ 1\nMoreover, let us de\ufb01ne the active buckets to be all buckets containing the query in these active data\nstructures, i.e., Gk \u0010 tH j\nObservation 4.6. Lemma 4.2 implies that with high probability at any iteration k of the algorithm\nN pq, rq \u0084 \u0094Gk.\n\ni pqq, which is\ni pqq| that is stored in the preprocessing stage. Moreover, we keep the set of outliers\ni pqq in Oj\ni which is initially set to be empty. While running the algorithm, as we\ni by one. Moreover, in order to\ni | exceeds 3L, we will ignore all buckets\n\ni showing the weights of the bucket H j\n\ni pqq | i \u00a4 L, j P J 1\n\nku.\n\ni pqq, we add it to Oj\n\ni , and we further decrease zj\n\ni to zero.\n\ni pqqzOj\n\n6\n\n\fi pqqzOj\ni |.\n\nDe\ufb01nition 4.7 (active size). For an active bucket H j\ni which\nshows the total number of points in the bucket that have not yet been detected as an outlier, i.e.,\n|H j\nLemma 4.8. Given a set P of n points and a parameter r, we can preprocess it such that given a\nquery q, one can report a point p P P with probability \u00b5p, so that there exists a value \u03c1 P r0, 1s\nwhere\n\ni pqq, we de\ufb01ne its active size to be zj\n\n For p P N pq, rq, we have\n\r For p P N pq, crqzN pq, rq, we have \u00b5p \u00a4 p1 Op\u03b5qq\u03c1.\n\r For p R N pq, crq, we have \u00b5p \u0010 0.\n\np1Op\u03b5qq \u00a4 \u00b5p \u00a4 p1 Op\u03b5qq\u03c1.\n\n\u03c1\n\nThe space used is rOpSpn, cqq and the query time is rOQpn, cq \u0004 logp1{\u03b5qq with high probability.\n\n(Proof in Appendix A.6)\nLemma 4.9. Given a set P of n points and a parameter r, we can preprocess it such that given\na query q, one can report a point p P S with probability \u00b5p where \u00b5 is an approximately uniform\nprobability distribution: \u03d5{p1 \u03b5q \u00a4 \u00b5p \u00a4 \u03d5p1 \u03b5q, where \u03d5 \u0010 1{|N pq, rq|. The algorithm uses\n|N pq,rq| \u0004 logp1{\u03b5q\b with high probability. (Proof\n\nspace Spn, cq and has query time of rOQpn, cq \u0004 |N pq,crq|\n\nin Appendix A.7)\n\n5 Experiments\n\nIn this section, we consider the task of retrieving a random point from the neighborhood of a given\nquery point, and evaluate the effectiveness of our proposed algorithm empirically on real data sets.\nData set and Queries. We run our experiments on three datasets that are standard benchmarks in\nthe context of Nearest Neighbor algorithms (see [ABF17])\n\n(I) Our \ufb01rst data set contains a random subset of 10K points in the MNIST training data set\n[LBBH98]3. The full data set contains 60K images of hand-written digits, where each\nimage is of size 28 by 28. For the query, we use a random subset of 100 (out of 10K)\nimages of the MNIST test data set. Therefore, each of our points lie in a 784 dimensional\nEuclidean space and each coordinate is in r0, 255s.\n\n(II) Second, we take SIFT10K image descriptors that contains 10K 128-dimensional points as\n\ndata set and 100 points as queries 4.\n\ni `\u0004 \u0004 \u0004`hk\n\ni . Each of the unit hash functions hj\n\n(III) Finally, we take a random subset of 10K words from the GloVe data set [PSM14] and a\nrandom subset of 100 words as our query. GloVe is a data set of 1.2M word embeddings in\n100-dimensional space and we further normalize them to unit norm.\nWe use the L2 Euclidean distance to measure the distance between the points.\nLSH data structure and parameters. We use the locality sensitive hashing data structure for the L2\nEuclidean distance [AI08]. That is, each of the L hash functions gi, is a concatenation of k unit hash\nfunctions h1\ni is chosen by selecting a point in a random\ndirection (by choosing every coordinate from a Gaussian distribution with parameters p0, 1q). Then\nall the points are projected onto this one dimensional direction. Then we put a randomly shifted one\ndimensional grid of length w along this direction. The cells of this grid are considered as buckets\nof the unit hash function. For tuning the parameters of LSH, we follow the method described in\n[DIIM04], and the manual of E2LSH library [And05], as follows.\nFor MNIST, the average distance of a query to its nearest neighbor in the our data set is around 4.5.\nThus we choose the near neighbor radius r \u0010 5. Consequently, as we observe, the r-neighborhood\nof at least half of the queries are non-empty. As suggested in [DIIM04] to set the value of w \u0010 4, we\ntune it between 3 and 5 and set its value to w \u0010 3.1. We tune k and L so that the false negative rate\n(the near points that are not retrieved by LSH) is less than 10%, and moreover the cost of hashing\n(proportional to L) balances out the cost of scanning. We thus get k \u0010 15 and L \u0010 100. This also\nagrees with the fact that L should be roughly square root of the total number of points. Note that\nwe use a single LSH data structure as opposed to taking t \u0010 Oplog nq instances. We use the same\n\n3The dataset is available here: http://yann.lecun.com/exdb/mnist/\n4The dataset if available here: http://corpus-texmex.irisa.fr/\n\n7\n\n\fmethod for the other two data sets. For SIFT, we use R \u0010 255, w \u0010 4, k \u0010 15, L \u0010 100, and for\nGloVe we use R \u0010 0.9, w \u0010 3.3, k \u0010 15, and L \u0010 100.\nAlgorithms. Given a query point q, we retrieve all L buckets corresponding to the query. We then\nimplement the following algorithms and compare their performance in returning a neighbor of the\nquery point.\n\r Uniform/Uniform: Picks bucket uniformly at random and picks a random point in bucket.\n\r Weighted/Uniform: Picks bucket according to its size, and picks uniformly random point inside\n\nbucket.\n\n Optimal: Picks bucket according to size, and then picks uniformly random point p inside bucket.\n\nThen it computes p\u2019s degree exactly and rejects p with probability 1 \u0001 1{degppq.\n\n Degree approximation: Picks bucket according to size, and picks uniformly random point p\n\ninside bucket. It approximates p\u2019s degree and rejects p with probability 1 \u0001 1{deg1ppq.\n\nDegree approximation method. We use the algorithm of Appendix B for the degree approximation:\nwe implement a variant of the sampling algorithm which repeatedly samples a bucket uniformly at\nrandom and checks whether p belongs to the bucket. If the \ufb01rst time this happens is at iteration i,\nthen it outputs the estimate as deg1ppq \u0010 L{i.\nExperiment Setup. In order to compare the performance of different algorithms, for each query\nq, we compute M pqq: the set of neighbors of q which fall to the same bucket as q by at least one\nof the L hash functions. Then for 100|M pqq| times, we draw a sample from the neighborhood of\nthe query, using all four algorithms. We compare the empirical distribution of the reported points\non |M pqq| with the uniform distribution on it. More speci\ufb01cally, we compute the total variation\ndistance (statistical distance)5 to the uniform distribution. We repeat each experiment 10 times and\nreport the average result of all 10 experiments over all 100 query points.\nResults. Figure 1 shows the comparison between all four algorithms. To compare their performance,\nwe compute the total variation distance of the empirical distribution of the algorithms to the uniform\ndistribution. For the tuned parameters (k \u0010 15 , L \u0010 100), our results are as follows. For MNIST,\nwe see that our proposed degree approximation based algorithm performs only 2.4 times worse than\nthe optimal algorithm, while we see that other standard sampling methods perform 6.6 times and\n10 times worse than the optimal algorithm. For SIFT, our algorithm performs only 1.4 times worse\nthan the optimal while the other two perform 6.1 and 9.7 times worse. For GloVe, our algorithm\nperforms only 2.7 times worse while the other two perform 6.5 and 13.1 times worse than the optimal\nalgorithm.\nMoreover, in order get a different range of degrees and show that our algorithm works well for those\ncases, we further vary the parameters k and L of LSH. More precisely, to get higher ranges of the\ndegrees, \ufb01rst we decrease k (the number of unit hash functions used in each of the L hash function);\nthis will result in more collisions. Second, we increase L (the total number of hash functions).\nThese are two ways to increase the degree of points. For example for the MNIST data set, the above\nprocedure increases the degree range from r1, 33s to r1, 99s.\nQuery time discussion. As stated in the experiment setup, in order to have a meaningful comparison\nbetween distributions, in our code, we retrieve a random neighbor of each query 100m times, where\nm is the size of its neighborhood (which itself can be as large as 1000). We further repeat each\nexperiment 10 times. Thus, every query might be asked upto 106 times. This is going to be costly\nfor the optimal algorithm that computes the degree exactly. Thus, we use the fact that we are asking\nthe same query many times and preprocess the exact degrees for the optimal solution. Therefore,\nit is not meaningful to compare runtimes directly. Thus we run the experiments on a smaller size\ndataset to compare the runtimes of all the four approaches: Our sampling approach is twice faster\nthan the optimal algorithm, and almost \ufb01ve times slower than the other two approaches. However,\nwhen the number of buckets (L) increases from 100 to 300, our algorithm is 4.3 times faster than the\noptimal algorithm, and almost 15 times slower than the other two approaches.\nTrade-off of time and accuracy. We can show a trade-off between our proposed sampling approach\nand the optimal. For the MNIST data set with tuned parameters (k \u0010 15 and L \u0010 100), by asking\ntwice more queries (for degree approximation), the solution of our approach improves from 2.4 to\n1.6, and with three times more, it improves to 1.2, and with four times more, it improves to 1.05. For\n\n5For two discrete distributions \u00b5 and \u03bd on a \ufb01nite set X, the total variation distance is 1\n\n2 \u00b0xPX |\u00b5pxq\u0001\u03bdpxq|.\n\n8\n\n\f(a) MNIST, varying the parameter k of LSH\n\n(b) MNIST, varying the parameter L of LSH\n\n(c) SIFT, varying the parameter k of LSH\n\n(d) SIFT, varying the parameter L of LSH\n\n(e) GloVe, varying the parameter k of LSH\n\n(f) GloVe, varying the parameter L of LSH\n\nFigure 1: Comparison of the performance of the four algorithms is measured by computing the\nstatistical distance of their empirical distribution to the uniform distribution.\n\nthe SIFT data set (using the same parameters), using twice more queries, the solution improves from\n1.4 to 1.16, and with three times more, it improves to 1.04, and with four times more, it improves to\n1.05. For GloVe, using twice more queries, the solution improves from 2.7 to 1.47, and with three\ntimes more, it improves to 1.14, and with four times more, it improves to 1.01.\n\n6 Acknowledgement\n\nThe authors would like to thank Piotr Indyk for the helpful discussions about the modeling and\nexperimental sections of the paper.\n\nReferences\n[ABD18] Alekh Agarwal, Alina Beygelzimer, Miroslav Dud\u00b4\u0131k, John Langford, and Hanna M.\nWallach. A reductions approach to fair classi\ufb01cation. In Jennifer G. Dy and Andreas\n\n9\n\nkDistanceUniform/UniformWeighted/UniformDegree Approx Alg.Optimal Algorithm10L_10.2323140.126880.0333760.029246511L_10.2546820.1529970.0365950.028969412L_10.2705570.1558220.04124640.029520413L_10.295870.1811870.05405510.029488914L_10.2653550.1647320.05469070.028362915L_10.2704620.1798410.06559050.027138600.050.10.150.20.250.30.35101112131415Statistical Distance Value of parameter k of LSHStatistical Distance to Uniform DistributionUniform/UniformWeighted/UniformDegree Approx Alg.Optimal AlgorithmLDistanceUniform/UniformWeighted/UniformDegree Approx Alg.Optimal Algorithm100L_10.2704620.1798410.06559050.0271386200L_10.2615740.1816870.04822190.0288596300L_10.2667370.1927480.04153690.03001400.050.10.150.20.250.3100200300Statistical DistanceValue of parameter L of LSHStatistical Distance to Uniform DistributionUniform/UniformWeighted/UniformDegree Approx Alg.Optimal AlgorithmkDistanceUniform/UniformWeighted/UniformDegree Approx Alg.Optimal Algorithm10L_10.2388350.1211150.0345190.032583711L_10.2570740.1360350.03396210.031918512L_10.2623330.1470880.03470920.031150313L_10.2893080.1619840.03632140.030427114L_10.2839140.1684550.03936770.030706915L_10.2986770.186760.04181970.030801300.050.10.150.20.250.30.35101112131415Statistical Distance Value of parameter k of LSHStatistical Distance to Uniform DistributionUniform/UniformWeighted/UniformDegree Approx Alg.Optimal AlgorithmLDistanceUniform/UniformWeighted/UniformDegree Approx Alg.Optimal Algorithm100L_10.2986770.186760.04181970.0308013200L_10.2721220.1635940.03542780.0310783300L_10.2577690.1538910.03340610.031625300.050.10.150.20.250.30.35100200300Statistical DistanceValue of parameter L of LSHStatistical Distance to Uniform DistributionUniform/UniformWeighted/UniformDegree Approx Alg.Optimal AlgorithmkDistanceUniform/UniformWeighted/UniformDegree Approx Alg.Optimal Algorithm10L_10.185480.07160240.02764050.02536911L_10.1661310.08158120.03182710.022614112L_10.2051480.08848660.02679890.023428213L_10.2223270.1148440.03997210.021840714L_10.2181010.1125780.03718630.019163815L_10.2605270.1293090.05298890.01986400.050.10.150.20.250.3101112131415Statistical Distance Value of parameter k of LSHStatistical Distance to Uniform DistributionUniform/UniformWeighted/UniformDegree Approx Alg.Optimal AlgorithmLDistanceUniform/UniformWeighted/UniformDegree Approx Alg.Optimal Algorithm100L_10.2605270.1293090.05298890.019864200L_10.2247370.1346050.03969670.0196103300L_10.2017560.10222280.02252510.021817600.050.10.150.20.250.3100200300Statistical DistanceValue of parameter L of LSHStatistical Distance to Uniform DistributionUniform/UniformWeighted/UniformDegree Approx Alg.Optimal Algorithm\fKrause, editors, Proc. 35th Int. Conf. Mach. Learning (ICML), volume 80 of Proc. of\nMach. Learn. Research, pages 60\u201369. PMLR, 2018.\n\n[ABF17] M. Aum\u00a8uller, E. Bernhardsson, and A. Faithfull. Ann-benchmarks: A benchmarking\nIn International Conference on\n\ntool for approximate nearest neighbor algorithms.\nSimilarity Search and Applications, 2017.\n\n[Ada07] Eytan Adar. User 4xxxxx9: Anonymizing query logs. Appeared in the workshop\nQuery Log Analysis: Social and Technological Challenges, in association with WWW\n2007, 01 2007.\n\n[AI08] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest\n\nneighbor in high dimensions. Commun. ACM, 51(1):117\u2013122, 2008.\n\n[And05] Alexandr Andoni.\n\nE2lsh 0.1 user manual.\n\ndoni/LSH/manual.pdf, 2005.\n\nhttps://www.mit.edu/ an-\n\n[APS19] Martin Aum\u00a8uller, Rasmus Pagh, and Francesco Silvestri. Fair near neighbor search:\nIndependent range sampling in high dimensions. arXiv preprint arXiv:1906.01859,\n2019.\n\n[BCN19] Suman K Bera, Deeparnab Chakrabarty, and Maryam Negahbani. Fair algorithms for\n\nclustering. arXiv preprint arXiv:1901.02393, 2019.\n\n[BHR17] Paul Beame, Sariel Har-Peled, Sivaramakrishnan Natarajan Ramamoorthy, Cyrus\nRashtchian, and Makrand Sinha. Edge estimation with independent set oracles. CoRR,\nabs/1711.07567, 2017.\n\n[BIO19] Arturs Backurs, Piotr Indyk, Krzysztof Onak, Baruch Schieber, Ali Vakilian, and Tal\n\nWagner. Scalable fair clustering. arXiv preprint arXiv:1902.03519, 2019.\n\n[Cho17] Alexandra Chouldechova. Fair prediction with disparate impact: A study of bias in\n\nrecidivism prediction instruments. Big data, 5(2):153\u2013163, 2017.\n\n[CKLV17] Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, and Sergei Vassilvitskii. Fair clus-\ntering through fairlets. In Advances in Neural Information Processing Systems, pages\n5029\u20135037, 2017.\n\n[CKLV19] Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, and Sergei Vassilvtiskii. Matroids,\nmatchings, and fairness. In Proceedings of Machine Learning Research, volume 89\nof Proceedings of Machine Learning Research, pages 2212\u20132220. PMLR, 2019.\n\n[DHP12] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel.\nIn Proceedings of the 3rd innovations in theoretical\n\nFairness through awareness.\ncomputer science conference, pages 214\u2013226. ACM, 2012.\n\n[DIIM04] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing\nscheme based on p-stable distributions. In Proc. 20th Annu. Sympos. Comput. Geom.\n(SoCG), pages 253\u2013262, 2004.\n\n[DOBD18] Michele Donini, Luca Oneto, Shai Ben-David, John S Shawe-Taylor, and Massimil-\niano Pontil. Empirical risk minimization under fairness constraints. In Advances in\nNeural Information Processing Systems, pages 2791\u20132801, 2018.\n\n[EJJ19] Hadi Elzayn, Shahin Jabbari, Christopher Jung, Michael Kearns, Seth Neel, Aaron\nRoth, and Zachary Schutzman. Fair algorithms for learning in allocation problems. In\nProceedings of the Conference on Fairness, Accountability, and Transparency, pages\n170\u2013179. ACM, 2019.\n\n[ELL09] Brian S. Everitt, Sabine Landau, and Morven Leese. Cluster Analysis. Wiley Pub-\n\nlishing, 4th edition, 2009.\n\n10\n\n\f[HAAA14] Ahmad Basheer Hassanat, Mohammad Ali Abbadi, Ghada Awad Altarawneh, and\nAhmad Ali Alhasanat. Solving the problem of the K parameter in the KNN classi\ufb01er\nusing an ensemble learning approach. CoRR, abs/1409.0919, 2014.\n\n[HGB07] Jiayuan Huang, Arthur Gretton, Karsten Borgwardt, Bernhard Sch\u00a8olkopf, and Alex J\nSmola. Correcting sample selection bias by unlabeled data. In Advances in neural\ninformation processing systems, pages 601\u2013608, 2007.\n\n[HIM12] S. Har-Peled, P. Indyk, and R. Motwani. Approximate nearest neighbors: Towards\nremoving the curse of dimensionality. Theory Comput., 8:321\u2013350, 2012. Special\nissue in honor of Rajeev Motwani.\n\n[HK01] David Harel and Yehuda Koren. On clustering using random walks.\n\nIn Ramesh\nHariharan, Madhavan Mukund, and V. Vinay, editors, FST TCS 2001: Foundations\nof Software Technology and Theoretical Computer Science, 21st Conference, Banga-\nlore, India, December 13-15, 2001, Proceedings, volume 2245 of Lecture Notes in\nComputer Science, pages 18\u201341. Springer, 2001.\n\n[HPS16] Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised\nlearning. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon,\nand Roman Garnett, editors, Neural Info. Proc. Sys. (NIPS), pages 3315\u20133323, 2016.\n\n[IM98] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the\ncurse of dimensionality. In Proc. 30th Annu. ACM Sympos. Theory Comput. (STOC),\npages 604\u2013613, 1998.\n\n[KLK12] Yi-Hung Kung, Pei-Sheng Lin, and Cheng-Hsiung Kao. An optimal k-nearest neigh-\nbor for density estimation. Statistics & Probability Letters, 82(10):1786 \u2013 1791, 2012.\n\n[KLL17] Jon Kleinberg, Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, and Sendhil Mul-\nlainathan. Human decisions and machine predictions. The quarterly journal of eco-\nnomics, 133(1):237\u2013293, 2017.\n\n[KSAM19] Matth\u00a8aus Kleindessner, Samira Samadi, Pranjal Awasthi, and Jamie Morgen-\nstern. Guarantees for spectral clustering with fairness constraints. arXiv preprint\narXiv:1901.08668, 2019.\n\n[LBBH98] Yann LeCun, L\u00b4eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based\nlearning applied to document recognition. Proceedings of the IEEE, 86(11):2278\u2013\n2324, 1998.\n\n[MSP16] Cecilia Munoz, Megan Smith, and DJ Patil. Big Data: A Report on Algorithmic\nSystems, Opportunity, and Civil Rights. Executive Of\ufb01ce of the President and Penny\nHill Press, 2016.\n\n[OA18] Matt Olfat and Anil Aswani. Convex formulations for fair principal component anal-\n\nysis. arXiv preprint arXiv:1802.03765, 2018.\n\n[PRW17] Geoff Pleiss, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Q Weinberger.\nOn fairness and calibration. In Advances in Neural Information Processing Systems,\npages 5680\u20135689, 2017.\n\n[PSM14] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors\nfor word representation. In Proceedings of the 2014 conference on empirical methods\nin natural language processing (EMNLP), pages 1532\u20131543, 2014.\n\n[QA08] Yinian Qi and Mikhail J. Atallah. Ef\ufb01cient privacy-preserving k-nearest neighbor\nsearch. In 28th IEEE International Conference on Distributed Computing Systems\n(ICDCS 2008), 17-20 June 2008, Beijing, China, pages 311\u2013319. IEEE Computer\nSociety, 2008.\n\n11\n\n\f[SDI06] Gregory Shakhnarovich, Trevor Darrell, and Piotr Indyk. Nearest-neighbor methods\nin learning and vision: theory and practice (neural information processing). The MIT\nPress, 2006.\n\n[TE11] A. Torralba and A. A. Efros. Unbiased look at dataset bias. In CVPR 2011, pages\n\n1521\u20131528, 2011.\n\n[ZVGRG17] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P\nGummadi. Fairness beyond disparate treatment & disparate impact: Learning clas-\nsi\ufb01cation without disparate mistreatment. In Proceedings of the 26th International\nConference on World Wide Web, pages 1171\u20131180, 2017.\n\n12\n\n\f", "award": [], "sourceid": 7237, "authors": [{"given_name": "Sariel", "family_name": "Har-Peled", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "Sepideh", "family_name": "Mahabadi", "institution": "Toyota Technological Institute at Chicago"}]}