{"title": "Practical Data-Dependent Metric Compression with Provable Guarantees", "book": "Advances in Neural Information Processing Systems", "page_first": 2617, "page_last": 2626, "abstract": "We introduce a new distance-preserving compact representation of multi-dimensional point-sets. Given n points in a d-dimensional space where each coordinate is represented using B bits (i.e., dB bits per point), it produces a representation of size O( d log(d B/epsilon) +log n) bits per point from which one can approximate the distances up to a factor of 1 + epsilon. Our algorithm almost matches the recent bound of Indyk et al, 2017} while being much simpler. We compare our algorithm to Product Quantization (PQ) (Jegou et al, 2011) a state of the art heuristic metric compression method. We evaluate both algorithms on several data sets: SIFT, MNIST, New York City taxi time series and a synthetic one-dimensional data set embedded in a high-dimensional space. Our algorithm produces representations that are comparable to or better than those produced by PQ, while having provable guarantees on its performance.", "full_text": "Practical Data-Dependent Metric Compression with\n\nProvable Guarantees\n\nPiotr Indyk\u2217\n\nMIT\n\nIlya Razenshteyn\u2217\n\nMIT\n\nTal Wagner\u2217\n\nMIT\n\nAbstract\n\nWe introduce a new distance-preserving compact representation of multi-\ndimensional point-sets. Given n points in a d-dimensional space where each\ncoordinate is represented using B bits (i.e., dB bits per point), it produces a rep-\nresentation of size O(d log(dB/\u0001) + log n) bits per point from which one can\napproximate the distances up to a factor of 1 \u00b1 \u0001. Our algorithm almost matches\nthe recent bound of [6] while being much simpler. We compare our algorithm\nto Product Quantization (PQ) [7], a state of the art heuristic metric compression\nmethod. We evaluate both algorithms on several data sets: SIFT (used in [7]),\nMNIST [11], New York City taxi time series [4] and a synthetic one-dimensional\ndata set embedded in a high-dimensional space. With appropriately tuned parame-\nters, our algorithm produces representations that are comparable to or better than\nthose produced by PQ, while having provable guarantees on its performance.\n\n1 Introduction\n\nCompact distance-preserving representations of high-dimensional objects are very useful tools in\ndata analysis and machine learning. They compress each data point in a data set using a small number\nof bits while preserving the distances between the points up to a controllable accuracy. This makes it\npossible to run data analysis algorithms, such as similarity search, machine learning classi\ufb01ers, etc, on\ndata sets of reduced size. The bene\ufb01ts of this approach include: (a) reduced running time (b) reduced\nstorage and (c) reduced communication cost (between machines, between CPU and RAM, between\nCPU and GPU, etc). These three factors make the computation more ef\ufb01cient overall, especially on\nmodern architectures where the communication cost is often the dominant factor in the running time,\nso \ufb01tting the data in a single processing unit is highly bene\ufb01cial. Because of these bene\ufb01ts, various\ncompact representations have been extensively studied over the last decade, for applications such\nas: speeding up similarity search [3, 5, 10, 19, 22, 7, 15, 18], scalable learning algorithms [21, 12],\nstreaming algorithms [13] and other tasks. For example, a recent paper [8] describes a similarity\nsearch software package based on one such method (Product Quantization (PQ)) that has been used\nto solve very large similarity search problems over billions of point on GPUs at Facebook.\nThe methods for designing such representations can be classi\ufb01ed into data-dependent and data-\noblivious. The former analyze the whole data set in order to construct the point-set representation,\nwhile the latter apply a \ufb01xed procedure individually to each data point. A classic example of the\ndata-oblivious approach is based on randomized dimensionality reduction [9], which states that\nany set of n points in the Euclidean space of arbitrary dimension D can be mapped into a space of\ndimension d = O(\u0001\u22122 log n), such that the distances between all pairs of points are preserved up to a\nfactor of 1 \u00b1 \u0001. This allows representing each point using d(B + log D) bits, where B is the number\n\n\u2217Authors ordered alphabetically.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fof bits of precision in the coordinates of the original pointset. 2 More ef\ufb01cient representations are\npossible if the goal is to preserve only the distances in a certain range. In particular, O(\u0001\u22122 log n) bits\nare suf\ufb01cient to distinguish between distances smaller than 1 and greater than 1 + \u0001, independently of\nthe precision parameter [10] (see also [16] for kernel generalizations). Even more ef\ufb01cient methods\nare known if the coordinates are binary [3, 12, 18].\nData-dependent methods compute the bit representations of points \u201cholistically\", typically by solving\na global optimization problem. Examples of this approach include Semantic Hashing [17], Spectral\nHashing [22] or Product Quantization [7] (see also the survey [20]). Although successful, most of\nthe results in this line of research are empirical in nature, and we are not aware of any worst-case\naccuracy vs. compression tradeoff bounds for those methods along the lines of the aforementioned\ndata oblivious approaches.\nA recent work [6] shows that it is possible to combine the two approaches and obtain algorithms that\nadapt to the data while providing worst-case accuracy/compression tradeoffs. In particular, the latter\npaper shows how to construct representations of d-dimensional pointsets that preserve all distances up\nto a factor of 1\u00b1\u0001 while using only O((d+log n) log(1/\u0001)+log(Bn)) bits per point. Their algorithm\nuses hierarchical clustering in order to group close points together, and represents each point by a\ndisplacement vector from a near by point that has already been stored. The displacement vector is\nthen appropriately rounded to reduce the representation size. Although theoretically interesting, that\nalgorithm is rather complex and (to the best of our knowledge) has not been implemented.\n\nOur results. The main contribution of this paper is QuadSketch (QS), a simple data-adaptive\nalgorithm, which is both provable and practical. It represents each point using O(d log(dB/\u0001)+log n)\nbits, where (as before) we can set d = O(\u0001\u22122 log n) using the Johnson-Lindenstrauss lemma. Our\nbound signi\ufb01cantly improves over the \u201cvanilla\u201d O(dB) bound (obtained by storing all d coordinates\nto full precision), and comes close to bound of [6]. At the same time, the algorithm is quite simple\nand intuitive: it computes a d-dimensional quadtree3 and appropriately prunes its edges and nodes.4\nWe evaluate QuadSketch experimentally on both real and synthetic data sets: a SIFT feature data\nset from [7], MNIST [11], time series data re\ufb02ecting taxi ridership in New York City [4] and a\nsynthetic data set (Diagonal) containing random points from a one-dimensional subspace (i.e., a line)\nembedded in a high-dimensional space. The data sets are quite diverse: SIFT and MNIST data sets are\nde-facto \u201cstandard\u201d test cases for nearest neighbor search and distance preserving sketches, NYC taxi\ndata was designed to contain anomalies and \u201cirrelevant\u201d dimensions, while Diagonal has extremely\nlow intrinsic dimension. We compare our algorithms to Product Quantization (PQ) [7], a state of\nthe art method for computing distance-preserving sketches, as well as a baseline simple uniform\nquantization method (Grid). The sketch length/accuracy tradeoffs for QS and PQ are comparable on\nSIFT and MNIST data, with PQ having higher accuracy for shorter sketches while QS having better\naccuracy for longer sketches. On NYC taxi data, the accuracy of QS is higher over the whole range\nof sketch lengths . Finally, Diagonal exempli\ufb01es a situation where the low dimensionality of the data\nset hinders the performance of PQ, while QS naturally adapts to this data set. Overall, QS performs\nwell on \u201ctypical\u201d data sets, while its provable guarantees ensure robust performance in a wide range\nof scenarios. Both algorithms improve over the baseline quantization method.\n\n2 Formal Statement of Results\nPreliminaries. Let X = {x1, . . . , xn} \u2282 Rd be a pointset in Euclidean space. A compression\nscheme constructs from X a bit representation referred to as a sketch. Given the sketch, and\nwithout access to the original pointset, one can decompress the sketch into an approximate pointset\n\n2The bounds can be stated more generally in terms of the aspect ratio \u03a6 of the point-set. See Section 2 for\n\nthe discussion.\n\n3Traditionally, the term \u201cquadtree\u201d is used for the case of d = 2, while its higher-dimensional variants are\ncalled \u201c hyperoctrees\u201d [23]. However, for the sake of simplicity, in this paper we use the same term \u201cquadtree\u201d\nfor any value of d.\n\n4We note that a similar idea (using kd-trees instead of quadtrees) has been earlier proposed in [1]. However,\n\nwe are not aware of any provable space/distortion tradeoffs for the latter algorithm.\n\n2\n\n\f\u02dcX = {\u02dcx1, . . . , \u02dcxn} \u2282 Rd. The goal is to minimize the size of the sketch, while approximately\npreserving the geometric properties of the pointset, in particular the distances and near neighbors.\nIn the previous section we parameterized the sketch size in terms of the number of points n, the\ndimension d, and the bits per coordinate B. In fact, our results are more general, and can be stated in\nterms of the aspect ratio of the pointset, denoted by \u03a6 and de\ufb01ned as the ratio between the largest to\nsmallest distance,\n\n\u03a6 =\n\nmax1\u2264i 0, let \u039b = O(log(d log \u03a6/\u0001\u03b4)) and L = log \u03a6 + \u039b. QuadSketch runs in\ntime \u02dcO(ndL) and produces a sketch of size O(nd\u039b + n log n) bits, with the following guarantee:\nFor every i \u2208 [n],\n\nPr(cid:2)\u2200j\u2208[n](cid:107)\u02dcxi \u2212 \u02dcxj(cid:107) = (1 \u00b1 \u0001)(cid:107)xi \u2212 xj(cid:107)(cid:3) \u2265 1 \u2212 \u03b4.\n\nIn particular, with probability 1 \u2212 \u03b4, if \u02dcxi\u2217 is the nearest neighbor of \u02dcxi in \u02dcX, then xi\u2217 is a (1 + \u0001)-\napproximate nearest neighbor of xi in X.\n\nNote that the theorem allows us to compress the input point-set into a sketch and then decompress it\nback into a point-set which can be fed to a black box similarity search algorithm. Alternatively, one\ncan decompress only speci\ufb01c points and approximate the distance between them.\nFor example, if d = O(\u0001\u22122 log n) and \u03a6 is polynomially bounded in n, then Theorem 1 uses\n\u039b = O(log log n + log(1/\u0001)) bits per coordinate to preserve (1 + \u0001)-approximate nearest neighbors.\nThe full version of QuadSketch, described in Section 3, allows extra \ufb01ne-tuning by exposing additional\nparameters of the algorithm. The guarantees for the full version are summarized by Theorem 3 in\nSection 3.\n\nMaximum distortion. We also show that a recursive application of QuadSketch makes it possible\nto approximately preserve the distances between all pairs of points. This is the setting considered\nin [6]. (In contrast, Theorem 1 preserves the distances from any single point.)\nTheorem 2. Given \u0001 > 0, let \u039b = O(log(d log \u03a6/\u0001)) and L = log \u03a6 + \u039b. There is a randomized\nalgorithm that runs in time \u02dcO(ndL) and produces a sketch of size O(nd\u039b + n log n) bits, such that\nwith high probability, every distance (cid:107)xi \u2212 xj(cid:107) can be recovered from the sketch up to distortion\n1 \u00b1 \u0001.\n\nTheorem 2 has smaller sketch size than that provided by the \u201cvanilla\u201d bound, and only slightly\nlarger than that in [6]. For example, for d = O(\u0001\u22122 log n) and \u03a6 = poly(n), it improves over the\n\u201cvanilla\u201d bound by a factor of O(log n/ log log n) and is lossier than the bound of [6] by a factor\nof O(log log n). However, compared to the latter, our construction time is nearly linear in n. The\ncomparison is summarized in Table 1.\n\n3\n\n\fTable 1: Comparison of Euclidean metric sketches with maximum distortion 1 \u00b1 \u0001, for d =\nO(\u0001\u22122 log n) and log \u03a6 = O(log n).\n\nBITS PER POINT\n\nREFERENCE\n\u201cVanilla\u201d bound O(\u0001\u22122 log2 n)\nAlgorithm of [6] O(\u0001\u22122 log n log(1/\u0001))\nTheorem 2\n\nO(\u0001\u22122 log n (log log n + log(1/\u0001)))\n\nCONSTRUCTION TIME\n\u2013\n\u02dcO(n1+\u03b1 + \u0001\u22122n) for \u03b1 \u2208 (0, 1]\n\u02dcO(\u0001\u22122n)\n\nWe remark that Theorem 2 does not let us recover an approximate embedding of the pointset,\n\u02dcx1, . . . , \u02dcxn, as Theorem 1 does. Instead, the sketch functions as an oracle that accepts queries of the\nform (i, j) and return an approximation for the distance (cid:107)xi \u2212 xj(cid:107).\n\n3 The Compression Scheme\n\nThe sketching algorithm takes as input the pointset X, and two parameters L and \u039b that control the\namount of compression.\n\nStep 1: Randomly shifted grid. The algorithm starts by imposing a randomly shifted axis-parallel\ngrid on the points. We \ufb01rst enclose the whole pointset in an axis-parallel hypercube H. Let\n\u2206(cid:48) = maxi\u2208[n](cid:107)x1 \u2212 xi(cid:107), and \u2206 = 2(cid:100)log \u2206(cid:48)(cid:101). Set up H to be centered at x1 with side length 4\u2206.\nNow choose \u03c31, . . . , \u03c3d \u2208 [\u2212\u2206, \u2206] independently and uniformly at random, and shift H in each\ncoordinate j by \u03c3j. By the choice of side length 4\u2206, one can see that H after the shift still contains\nthe whole pointset. For every integer (cid:96) such that \u2212\u221e < (cid:96) \u2264 log(4\u2206), let G(cid:96) denote the axis-parallel\ngrid with cell side 2(cid:96) which is aligned with H.\nNote that this step can be often eliminated in practice without affecting the empirical performance of\nthe algorithm, but it is necessary in order to achieve guarantees for arbitrary pointsets.\n\nStep 2: Quadtree construction. The 2d-ary quadtree on the nested grids G(cid:96) is naturally de\ufb01ned\nby associating every grid cell c in G(cid:96) with the tree node at level (cid:96), such that its children are the 2d\ngrid cells in G(cid:96)\u22121 which are contained in c. The edge connecting a node v to a child v(cid:48) is labeled\nwith a bitstring of length d de\ufb01ned as follows: the jth bit is 0 if v(cid:48) coincides with the bottom half of\nv along coordinate j, and 1 if v(cid:48) coincides with the upper half along that coordinate.\nIn order to construct the tree, we start with H as the root, and bucket the points contained in it into\nthe 2d children cells. We only add child nodes for cells that contain at least one point of X. Then we\ncontinue by recursing on the child nodes. The quadtree construction is \ufb01nished after L levels. We\ndenote the resulting edge-labeled tree by T \u2217. A construction for L = 2 is illustrated in Figure 1.\n\nFigure 1: Quadtree construction for points x, y, z. The x and y coordinates are written as binary\nnumbers.\n\n4\n\n\fWe de\ufb01ne the level of a tree node with side length 2(cid:96) to be (cid:96) (note that (cid:96) can be negative). The degree\nof a node in T \u2217 is its number of children. Since all leaves are located at the bottom level, each point\nxi \u2208 X is contained in exactly one leaf, which we henceforth denote by vi.\n\nStep 3: Pruning. Consider a downward path u0, u1, . . . , uk in T \u2217, such that u1, . . . , uk\u22121 are\nnodes with degree 1, and u0, uk are nodes with degree other than 1 (uk may be a leaf). For every\nsuch path in T \u2217, if k > \u039b + 1, we remove the nodes u\u039b+1, . . . , uk\u22121 from T \u2217 with all their adjacent\nedges (and edge labels). Instead we connect uk directly to u\u039b as its child. We refer to that edge as\nthe long edge, and label it with the length of the path it replaces (k \u2212 \u039b). The original edges from T \u2217\nare called short edges. At the end of the pruning step, we denote the resulting tree by T .\n\nThe sketch. For each point xi \u2208 X the sketch stores the index of the leaf vi that contains it. In\naddition it stores the structure of the tree T , encoded using the Eulerian Tour Technique5. Speci\ufb01cally,\nstarting at the root, we traverse T in the Depth First Search (DFS) order. In each step, DFS either\nexplores the child of the current node (downward step), or returns to the parent node (upward step).\nWe encode a downward step by 0 and an upward step by 1. With each downward step we also store\nthe label of the traversed edge (a length-d bitstring for a short edge or the edge length for a long edge,\nand an additional bit marking if the edge is short or long).\n\nDecompression. Recovering \u02dcxi from the sketch is done simply by following the downward path\nfrom the root of T to the associated leaf vi, collecting the edge labels of the short edges, and placing\nzeros instead of the missing bits of the long edges. The collected bits then correspond to the binary\nexpansion of the coordinates of \u02dcxi.\nMore formally, for every node u (not necessarily a leaf) we de\ufb01ne c(u) \u2208 Rd as follows: For\nj \u2208 {1, . . . , d}, concatenate the jth bit of every short edge label traversed along the downward path\nfrom the root to u. When traversing a long edge labeled with length k, concatenate k zeros.6 Then,\nplace a binary \ufb02oating point in the resulting bitstring, after the bit corresponding to level 0. (Recall\nthat the levels in T are de\ufb01ned by the grid cell side lengths, and T might not have any nodes in level\n0; in this case we need to pad with 0\u2019s either on the right or on the left until we have a 0 bit in the\nlocation corresponding to level 0.) The resulting binary string is the binary expansion of the jth\ncoordinate of c(u). Now \u02dcxi is de\ufb01ned to be c(vi).\n\nBlock QuadSketch. We can further modify QuadSketch in a manner similar to Product Quantiza-\ntion [7]. Speci\ufb01cally, we partition the d dimensions into m blocks B1 . . . Bm of size d/m each, and\napply QuadSketch separately to each block. More formally, for each Bi, we apply QuadSketch to the\npointset (x1)Bi . . . (xn)Bi, where xB denotes the m/d-dimensional vector obtained by projecting x\non the dimensions in B.\nThe following statement is an immediate corollary of Theorem 1.\nTheorem 3. Given \u0001, \u03b4 > 0, and m dividing d, set the pruning parameter \u039b to O(log(d log \u03a6/\u0001\u03b4))\nand the number of levels L to log \u03a6 + \u039b. The m-block variant of QuadSketch runs in time \u02dcO(ndL)\nand produces a sketch of size O(nd\u039b + nm log n) bits, with the following guarantee: For every\ni \u2208 [n],\n\nPr(cid:2)\u2200j\u2208[n](cid:107)\u02dcxi \u2212 \u02dcxj(cid:107) = (1 \u00b1 \u0001)(cid:107)xi \u2212 xj(cid:107)(cid:3) \u2265 1 \u2212 m\u03b4.\n\nIt can be seen that increasing the number of blocks m up to a certain threshold ( d\u039b/ log n ) does\nnot affect the asymptotic bound on the sketch size. Although we cannot prove that varying m allows\nto improve the accuracy of the sketch, this seems to be the case empirically, as demonstrated in the\nexperimental section.\n\n5See e.g., https://en.wikipedia.org/wiki/Euler_tour_technique.\n6This is the \u201clossy\u201d step in our sketching method: the original bits could be arbitrary, but they are replaced\n\nwith zeros.\n\n5\n\n\fTable 2: Datasets used in our empirical evaluation. The aspect ratio of SIFT and MNIST is estimated\non a random sample.\n\nDataset\nSIFT\nMNIST\nNYC Taxi\nDiagonal (synthetic)\n\nPoints\n\n1, 000, 000\n\n60, 000\n8, 874\n10, 000\n\n4 Experiments\n\n128\n784\n48\n128\n\nDimension Aspect ratio (\u03a6)\n\n\u2265 83.2\n\u2265 9.2\n49.5\n\n20, 478, 740.2\n\nWe evaluate QuadSketch experimentally and compare its performance to Product Quantization\n(PQ) [7], a state-of-the-art compression scheme for approximate nearest neighbors, and to a baseline\nof uniform scalar quantization, which we refer to as Grid. For each dimension of the dataset, Grid\nplaces k equally spaced landmark scalars on the interval between the minimum and the maximum\nvalues along that dimension, and rounds each coordinate to the nearest landmark.\nAll three algorithms work by partitioning the data dimensions into blocks, and performing a quanti-\nzation step in each block independently of the other ones. QuadSketch and PQ take the number of\nblocks as a parameter, and Grid uses blocks of size 1. The quantization step is the basic algorithm\ndescribed in Section 3 for QuadSketch, k-means for PQ, and uniform scalar quantization for Grid.\nWe test the algorithms on four datasets: The SIFT data used in [7], MNIST [11] (with all vectors\nnormalized to 1), NYC Taxi ridership data [4], and a synthetic dataset called Diagonal, consisting of\nrandom points on a line embedded in a high-dimensional space. The properties of the datasets are\nsummarized in Table 2. Note that we were not able to compute the exact diameters for MNIST and\nSIFT, hence we only report estimates for \u03a6 for these data sets, obtained via random sampling.\nThe Diagonal dataset consists of 10, 000 points of the form (x, x, . . . , x), where x is chosen inde-\npendently and uniformly at random from the interval [0..40000]. This yields a dataset with a very\nlarge aspect ratio \u03a6, and on which partitioning into blocks is not expected to be bene\ufb01cial since all\ncoordinates are maximally correlated.\nFor SIFT and MNIST we use the standard query set provided with each dataset. For Taxi and\nDiagonal we use 500 queries chosen at random from each dataset. For the sake of consistency, for all\ndata sets, we apply the same quantization process jointly to both the point set and the query set, for\nboth PQ and QS. We note, however, that both algorithms can be run on \u201cout of sample\u201d queries.\nFor each dataset, we enumerate the number of blocks over all divisors of the dimension d. For\nQuadSketch, L ranges in 2, . . . , 20, and \u039b ranges in 1, . . . , L \u2212 1. For PQ, the number of k-means\nlandmarks per block ranges in 25, 26, . . . , 212. For both algorithms we include the results for all\ncombinations of the parameters, and plot the envelope of the best performing combinations.\nWe report two measures of performance for each dataset: (a) the accuracy, de\ufb01ned as the fraction of\nqueries for which the sketch returns the true nearest neighbor, and (b) the average distortion, de\ufb01ned\nas the ratio between the (true) distances from the query to the reported near neighbor and to the true\nnearest neighbor. The sketch size is measured in bits per coordinate. The results appear in Figures 2\nto 5. Note that the vertical coordinate in the distortion plots corresponds to the value of \u0001, not 1 + \u0001.\nFor SIFT, we also include a comparison with Cartesian k-Means (CKM) [14], in Figure 6.\n\n4.1 QuadSketch Parameter Setting\n\nWe plot how the different parameters of QuadSketch effect its performance. Recall that L determines\nthe number of levels in the quadtree prior to the pruning step, and \u039b controls the amount of pruning.\nBy construction, the higher we set these parameters, the larger the sketch will be and with better\naccuracy. The empirical tradeoff for the SIFT dataset is plotted in Figure 7.\n\n6\n\n\fFigure 2: Results for the SIFT dataset.\n\nFigure 3: Results for the MNIST dataset.\n\nFigure 4: Results for the Taxi dataset.\n\nFigure 5: Results for the Diagonal dataset.\n\n7\n\n\fFigure 6: Additional results for the SIFT dataset.\n\nFigure 7: On the left, L varies from 2 to 11 for a \ufb01xed setting of 16 blocks and \u039b = L \u2212 1 (no\npruning). On the right, \u039b varies from 1 to 9 for a \ufb01xed setting of 16 blocks and L = 10. Increasing \u039b\nbeyond 6 does not have further effect on the resulting sketch.\n\nThe optimal setting for the number of blocks is not monotone, and generally depends on the speci\ufb01c\ndataset. It was noted in [7] that on SIFT data an intermediate number of blocks gives the best results,\nand this is con\ufb01rmed by our experiments. Figure 8 lists the performance on the SIFT dataset for a\nvarying number of blocks, for a \ufb01xed setting of L = 6 and \u039b = 5. It shows that the sketch quality\nremains essentially the same, while the size varies signi\ufb01cantly, with the optimal size attained at 16\nblocks.\n\n# Blocks Bits per coordinate Accuracy Average distortion\n\n1\n2\n4\n8\n16\n32\n64\n128\n\n5.17\n4.523\n4.02\n3.272\n2.795\n3.474\n4.032\n4.079\n\n0.719\n0.717\n0.722\n0.712\n0.712\n0.712\n0.713\n0.72\n\n1.0077\n1.0076\n1.0079\n1.0079\n1.008\n1.0082\n1.0081\n1.0078\n\nFigure 8: QuadSketch accuracy on SIFT data by number of blocks, with L = 6 and \u039b = 5.\n\n8\n\n\fReferences\n[1] R. Arandjelovi\u00b4c and A. Zisserman. Extremely low bit-rate nearest neighbor search using a set\ncompression tree. IEEE transactions on pattern analysis and machine intelligence, 36(12):2396\u2013\n2406, 2014.\n\n[2] Y. Bartal. Probabilistic approximation of metric spaces and its algorithmic applications. In\nFoundations of Computer Science, 1996. Proceedings., 37th Annual Symposium on, pages\n184\u2013193. IEEE, 1996.\n\n[3] A. Z. Broder. On the resemblance and containment of documents.\nComplexity of Sequences 1997. Proceedings, pages 21\u201329. IEEE, 1997.\n\nIn Compression and\n\n[4] S. Guha, N. Mishra, G. Roy, and O. Schrijvers. Robust random cut forest based anomaly\ndetection on streams. In International Conference on Machine Learning, pages 2712\u20132721,\n2016.\n\n[5] P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of\ndimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing,\npages 604\u2013613. ACM, 1998.\n\n[6] P. Indyk and T. Wagner. Near-optimal (euclidean) metric compression. In Proceedings of the\nTwenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 710\u2013723. SIAM,\n2017.\n\n[7] H. Jegou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. IEEE\n\ntransactions on pattern analysis and machine intelligence, 33(1):117\u2013128, 2011.\n\n[8] J. Johnson, M. Douze, and H. J\u00e9gou. Billion-scale similarity search with gpus. CoRR,\n\nabs/1702.08734, 2017.\n\n[9] W. B. Johnson and J. Lindenstrauss. Extensions of lipschitz mappings into a hilbert space.\n\nContemporary mathematics, 26(189-206):1, 1984.\n\n[10] E. Kushilevitz, R. Ostrovsky, and Y. Rabani. Ef\ufb01cient search for approximate nearest neighbor\n\nin high dimensional spaces. SIAM Journal on Computing, 30(2):457\u2013474, 2000.\n\n[11] Y. LeCun and C. Cortes. The mnist database of handwritten digits, 1998.\n\n[12] P. Li, A. Shrivastava, J. L. Moore, and A. C. K\u00f6nig. Hashing algorithms for large-scale learning.\n\nIn Advances in neural information processing systems, pages 2672\u20132680, 2011.\n\n[13] S. Muthukrishnan et al. Data streams: Algorithms and applications. Foundations and Trends R(cid:13)\n\nin Theoretical Computer Science, 1(2):117\u2013236, 2005.\n\n[14] M. Norouzi and D. J. Fleet. Cartesian k-means. In Proceedings of the IEEE Conference on\n\nComputer Vision and Pattern Recognition, pages 3017\u20133024, 2013.\n\n[15] M. Norouzi, D. J. Fleet, and R. R. Salakhutdinov. Hamming distance metric learning. In\n\nAdvances in neural information processing systems, pages 1061\u20131069, 2012.\n\n[16] M. Raginsky and S. Lazebnik. Locality-sensitive binary codes from shift-invariant kernels. In\n\nAdvances in neural information processing systems, pages 1509\u20131517, 2009.\n\n[17] R. Salakhutdinov and G. Hinton. Semantic hashing. International Journal of Approximate\n\nReasoning, 50(7):969\u2013978, 2009.\n\n[18] A. Shrivastava and P. Li. Densifying one permutation hashing via rotation for fast near neighbor\n\nsearch. In ICML, pages 557\u2013565, 2014.\n\n[19] A. Torralba, R. Fergus, and Y. Weiss. Small codes and large image databases for recognition. In\nComputer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1\u20138.\nIEEE, 2008.\n\n9\n\n\f[20] J. Wang, W. Liu, S. Kumar, and S.-F. Chang. Learning to hash for indexing big data: a survey.\n\nProceedings of the IEEE, 104(1):34\u201357, 2016.\n\n[21] K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg. Feature hashing for\nlarge scale multitask learning. In Proceedings of the 26th Annual International Conference on\nMachine Learning, pages 1113\u20131120. ACM, 2009.\n\n[22] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In Advances in neural information\n\nprocessing systems, pages 1753\u20131760, 2009.\n\n[23] M.-M. Yau and S. N. Srihari. A hierarchical data structure for multidimensional digital images.\n\nCommunications of the ACM, 26(7):504\u2013515, 1983.\n\n10\n\n\f", "award": [], "sourceid": 1504, "authors": [{"given_name": "Piotr", "family_name": "Indyk", "institution": "MIT"}, {"given_name": "Ilya", "family_name": "Razenshteyn", "institution": "Columbia University"}, {"given_name": "Tal", "family_name": "Wagner", "institution": "MIT"}]}