{"title": "M\u00f6bius Transformation for Fast Inner Product Search on Graph", "book": "Advances in Neural Information Processing Systems", "page_first": 8218, "page_last": 8229, "abstract": "We present a fast search on graph algorithm for Maximum Inner Product Search (MIPS). This optimization problem is challenging since traditional Approximate Nearest Neighbor (ANN) search methods may not perform efficiently in the non-metric similarity measure. Our proposed method is based on the property that M\u00f6bius transformation introduces an isomorphism between a subgraph of l^2-Delaunay graph and Delaunay graph for inner product. Under this observation, we propose a simple but novel graph indexing and searching algorithm to find the optimal solution with the largest inner product with the query. Experiments show our approach leads to significant improvements compared to existing methods.", "full_text": "M\u00f6bius Transformation for Fast Inner Product\n\nSearch on Graph\n\nZhixin Zhou, Shulong Tan, Zhaozhuo Xu and Ping Li\n\nCognitive Computing Lab\n\nBaidu Research\n\n10900 NE 8th St. Bellevue, WA 98004, USA\n1195 Bordeaux Dr. Sunnyvale, CA 94089, USA\n\n{zhixin0825,zhaozhuoxu}@gmail.com, {shulongtan,liping11}@baidu.com\n\nAbstract\n\nWe present a fast search on graph algorithm for Maximum Inner Product Search\n(MIPS). This optimization problem is challenging since traditional Approximate\nNearest Neighbor (ANN) search methods may not perform ef\ufb01ciently in the non-\nmetric similarity measure. Our proposed method is based on the property that\nM\u00f6bius transformation introduces an isomorphism between a subgraph of (cid:96)2-\nDelaunay graph and Delaunay graph for inner product. Under this observation,\nwe propose a simple but novel graph indexing and searching algorithm to \ufb01nd the\noptimal solution with the largest inner product with the query. Experiments show\nour approach leads to signi\ufb01cant improvements compared to existing methods.\n\n1\n\nIntroduction\n\nThis paper focuses on a discrete optimization problem. Given a large dataset S with high dimensional\nvectors and a query point q in Euclidean space, we aim to search for x \u2208 S that maximizes the inner\nproduct x(cid:62)q. Rigorously speaking, we will develop an ef\ufb01cient algorithm for computing\n\np = arg max\nx\u2208S\n\nx(cid:62)q.\n\n(1)\n\nThis so-called Maximum Inner Product Search (MIPS) problem has wide applicability in machine\nlearning models, such as recommender system [35, 16], natural language processing [5, 33] and multi-\nclass or multi-label classi\ufb01er [38, 34], computational advertising for search engines [9], etc. Because\nof its importance and popularity, there has been substantial research on effective and ef\ufb01cient MIPS\nalgorithms. The early work of [27] proposed tree-based methods to solve the MIPS problem. Recently,\nthere is a line of works in the literature tried to transform MIPS to traditional Approximate Nearest\nNeighbor (ANN) search [11, 12, 18] by lifting the base data vectors and query vectors asymmetrically\nto higher dimensional space [2, 28, 30, 26, 29, 36]. After the transformation, the well-developed\nANN search methods can then be applied to solve the MIPS problem. There are other proposals\ndesigned for the MIPS task including quantization based methods [15] and graph based methods [25].\nIn this paper, we will introduce a new graph based MIPS algorithm. Graph based methods have\nbeen well developed for ANN search in metric space and show signi\ufb01cant superiority [20, 4, 24, 13].\nThe recent work [25], namely ip-NSW, attempts to extend the graph based methods for ANN search\nto MIPS. The authors introduce the concepts of IP-Delaunay graph, which is the smallest graph\nthat can guarantee the return of exact solutions for MIPS by greedy search. Practically, ip-NSW\ntries to approximate the IP-Delaunay graph via Navigable Small World (NSW) [23] and Hierarchical\nNavigable Small World (HNSW) [24]. To improve upon existing methods, we propose a better graph\nbased method for MIPS, which preserves the advantages of similarity graph in metric space.\nOur method is based on M\u00f6bius transformation on the dataset, that connects graph based indices for\nMIPS and ANN search. We \ufb01nd that under M\u00f6bius transformation, there is an isomorphism between\ntwo graphs: (a) IP-Delaunay graph before the transformation. (b) A subgraph of the Delaunay\ntriangulation w.r.t. (cid:96)2-norm ((cid:96)2-Delaunay graph) after the transformation. Based on this observation,\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fwe approximate IP-Delaunay graph in two steps: (i) map the data points via M\u00f6bius transformation;\n(ii) approximate (cid:96)2-Delaunay graph on the transformed data points and one additional point for\nthe origin. Afterward, given a query point, we perform a greedy search on the obtained graph by\ncomparing inner product of the query with data points (nodes in the graph) in the original format.\nThe superiority of our method is two-fold: (a) The (cid:96)2-distance based graph construction can preserve\nall advantageous features of similarity graph in metric space; (b) The additional point (i.e., the origin)\nwill be connected to diverse high norm points (usually solutions for MIPS), which will naturally\nprovide good starting points for the greedy search. The empirical experiments demonstrate that these\nfeatures signi\ufb01cantly improve the ef\ufb01ciency.\n\n2 Graph Based Search Methods and Our Approach\n\nA graph based search method typically \ufb01rst constructs a well-designed similarity graph, e.g., kNN\ngraph in Approximate Nearest Neighbor (ANN) search, then performs greedy search on the graph.\nSimple greedy search, such as for Maximum Inner Product Search (MIPS) task, can be described as\nfollows. Given a graph and a query, the algorithm randomly selects a vertex from the graph, then\nevaluates the inner product of the query with the randomly seeded vertex and the vertex\u2019s neighbors.\nIf one of its neighbors has a larger inner product with the query than the vertex itself, then we consider\nthe neighbor as a newly seeded vertex and repeat the searching step. This procedure stops when it\n\ufb01nds a vertex that has a larger inner product with the query than all the vertex\u2019s neighbors. Greedy\nsearch has a generalized version, which will be introduced in Algorithm 1 with more details.\nIt was pointed out in [1, 23] that in order to discover the exact solution of nearest neighbor search or\nMIPS by the greedy search strategy, the graph must contain the Delaunay graph (see De\ufb01nition 2)\nwith respect to (w.r.t.) the searching measure as a subgraph. For common ANN search cases,\nsearching w.r.t. (cid:96)2-distance, the index graph should contain the Delaunay graph w.r.t (cid:96)2-distance\n(referred as (cid:96)2-Delaunay graph) as a subgraph. In practice, approximate (cid:96)2-Delaunay graphs are\nusually constructed due to the dif\ufb01culty in building the exact Delaunay graphs, such as VoroNet [4]\nand Navigable Small World (NSW) [23]. Based on NSW, Hierarchical-NSW (HNSW) network [24]\nexploits the hierarchical graph structure and heuristic edge selection criterion (see Algorithm 3 for\ndetails), and often obtains performance improvement in ANN search tasks.\nThe idea of the Delaunay graph can be extended to inner product. The best graph for exact MIPS\nby simple greedy search is the Delaunay graph w.r.t. inner product (referred as IP-Delaunay graph).\nThe recent work [25], namely ip-NSW, attempts to extend HNSW for metric spaces to MIPS. It is\nworth noting that the authors of [25] show some important properties of Delaunay graph. However,\ntheir HNSW based graph construction algorithm for inner product has some disadvantages:\n\n1. Since the edge selection criterion of HNSW does not apply on inner product, the incident\n\nedges of a vertex can have very similar directions, which will reduce the ef\ufb01ciency.\n\n2. The hierarchical graph structure of HNSW is helpful in ANN search for metric measures,\n\nbut it has little effect on the MIPS problem.\n\nWe validate these claims by experiments on comparison with different versions of ip-NSW. The effect\nof edge selection can be positive or negative in different datasets. Hierarchical structure does not\nchange the ef\ufb01ciency of inner product search. To resolve the edge selection issue, previously we\nproposed a proper edge selection method, IPDG, speci\ufb01cally for inner product [31]. IPDG improves\nthe top-1 MIPS signi\ufb01cantly but shows performance limitations for top-n ( n > 1) results. In this\n\nFigure 1: Experimental results for (top-1) Recall vs. Queries Per Second on different datasets. The\ncurve on the top shows superiority of the corresponding method. M\u00f6bius Graph, ip-NSW, ip-NSW-no-\nhie, ip-NSW-no-sel stand for our proposed method, ip-NSW with both hierarchical structure and edge\nselection, ip-NSW without hierarchical structure, and ip-NSW without edge selection respectively.\n\n2\n\n0.40.50.60.70.80.91Avg. Recall00.511.52Queries Per Second105Netflix top-1M\u00f6bius-Graphip-NSWip-NSW-no-hieip-NSW-no-sel0.40.50.60.70.80.91Avg. Recall0246810104Amovie top-1M\u00f6bius-Graphip-NSWip-NSW-no-hieip-NSW-no-sel0.40.50.60.70.80.91Avg. Recall0246810104Yelp top-1M\u00f6bius-Graphip-NSWip-NSW-no-hieip-NSW-no-sel0.40.50.60.70.80.91Avg. Recall0246104Music-100 top-1M\u00f6bius-Graphip-NSWip-NSW-no-hieip-NSW-no-sel\fpaper, we propose a better approximation of IP-Delaunay graph (referred to as M\u00f6bius-Graph) for\nMIPS, which provides a state-of-the-art MIPS method for various top-n MIPS results.\nThe intuition behind is that if we \ufb01nd a transformation that maps IP-Delaunay graph in the original\nspace to (cid:96)2-Delaunay graph\u2019s certain subgraph in the transformed space, we can make full use of\nthe successful (cid:96)2-Delaunay graph approximation methods to build an IP-Delaunay graph. Given\neach data point xi, we perform the M\u00f6bius transformation yi := xi/(cid:107)xi(cid:107)2, from which we have\na new data collection: \u02dcS = {0, y1, y2, . . . , yn}. After M\u00f6bius transformation, we apply existing\ngraph construction method (e.g., HNSW or \u201cSONG\u201d a recent variant [39]) and obtain an approximate\n(cid:96)2-Delaunay graph on the transformed data (i.e., \u02dcS). We found that the IP-Delaunay graph w.r.t. S is\nisomorphic to the neighborhood of 0 in (cid:96)2-Delaunay graph w.r.t. \u02dcS. Details about this statement can\nbe found in Section 3. In short, our approach can be summarized as the following steps:\n\n1. Let \u02dcS := {yi = xi/(cid:107)xi(cid:107)2 | xi \u2208 S} \u222a {0} be the transformed dataset.\n2. Constructing approximate (cid:96)2-Delaunay graph, (e.g., HNSW), w.r.t. \u02dcS.\n3. Let N denote the neighbors of 0 on the graph from the previous step. Then remove 0 and its\n\nincident edges from the graph, and replace the vertices yi by original data vectors xi.\n\n4. Let N be initial vertices, then perform greedy inner product search on the graph.\n\nNote that our greedy search algorithm starts from a set of initial points instead of the data point 0\nsince 0 is not in S. Multiple initial points are possible in generalized greedy search described in\nAlgorithm 1. An equivalent description is starting from 0 but never returning it. Compared with the\nexisting graph based search method for MIPS (i.e., ip-NSW), our approach builds the index graph by\n(cid:96)2-distance (on the transformed data), which can largely preserve advantageous features of metric\nsimilarity graph. Besides, our approach starts searching from well-chosen diverse top-norm points\nN (the usage is similar as the hierarchical structure of HNSW), which will lead to more ef\ufb01cient\nperformance. Therefore, our approach to a large extent overcomes the weakness of the existing graph\nbased search method, and it is not surprising that our method performs empirically better.\n\n3 M\u00f6bius Transformation and Delaunay Graph Isomorphism\n\nAs pointed out in [1] that, in order to \ufb01nd the exact nearest neighbor by simple greedy search, the\ngraph must contain Delaunay graph as a subgraph. This statement can extend to MIPS problem [25].\nFor generality, we will introduce Voronoi cell and Delaunay graph for arbitrary continuous binary\nfunction f : X \u00d7 X \u2192 R; however, we are typically interested in the cases of inner product\nf (x, y) = x(cid:62)y and negative (cid:96)2-norm f (x, y) = \u2212(cid:107)x \u2212 y(cid:107) in this paper.\nDe\ufb01nition 1. For \ufb01xed xi \u2208 S \u2282 X and a given function f, the Voronoi cell Ri is de\ufb01ned as\n\nRi := Ri(f, S) := {q \u2208 X | \u2200x \u2208 S, f (xi, q) \u2265 f (x, q)}.\n\nVoronoi cells determine the solution of MIPS problem. One can observe from the de\ufb01nition above\nthat, when f (x, y) = x(cid:62)y, xj \u2208 arg maxxi\u2208S x(cid:62)\ni q if and only if q \u2208 Rj. Since recording Voronoi\ncells is expensive. We instead record its dual diagram, namely Delaunay graph, de\ufb01ned as follows.\nDe\ufb01nition 2. For \ufb01xed function f and dataset S \u2282 X, and given Voronoi cells Ri, i = 1, 2, . . . , n\nw.r.t. f and S, the Delaunay graph is an undirected graph with vertices S, and the edge {xi, xj}\nexists if and only if Ri \u2229 Rj (cid:54)= \u2205.\nDelaunay graph records adjacency of Voronoi cells. If cell Ri and cell Rj is adjacent to each other,\nthen there exists an edge between their corresponding nodes xi and xj. If f (x, y) = \u2212(cid:107)x \u2212 y(cid:107), then\nthe graph is called (cid:96)2-Delaunay graph. If f (x, y) = x(cid:62)y, then the graph is called IP-Delaunay graph.\nWe now narrow the scope to MIPS problem. Let f (x, y) = x(cid:62)y and X = Rd\\{0}, and we aim to\nsolve the optimization problem (1). We remove 0 from Rd for two reasons. Firstly, 0 has the same\ninner product value with any points. Secondly, if 0 is not removed, then every Voronoi cell w.r.t. the\ninner product contains 0 as a common element, so the Delaunay graph will be fully connected and\nnot interesting. We also require the following mild assumption on dataset to simplify the analysis.\nAssumption 1. The dataset S satis\ufb01es that its conical hull is the whole space. More precisely,\n\nconi(S) :=\n\n\u03b1ixi\n\n= Rd.\n\n(A1)\n\n(cid:110) n(cid:88)\n\ni=1\n\n(cid:12)(cid:12)(cid:12)xi \u2208 S, \u03b1i \u2265 0\n\n(cid:111)\n\n3\n\n\fAssumption 2 (General position). For k = 2, 3, . . . , d + 1, there do not exist k points of the dataset\nS lies on a (k \u2212 2)-dimensional af\ufb01ne hyperplane, or k + 1 points of S on any (k \u2212 2)-dimensional\nsphere. If so, then we say dataset S is in general position.\n\nAssumptions 1 and 2 are often mild in real data. When the data points are embedded vectors of\nusers, items, (in recommender system) entities or sentence (in natural language processing). In these\nscenarios, the entries of data vectors are distributed on the whole real line. With high probability,\neach hyperoctant contains at least one data point so that the convex hull of the dataset contains 0 as\nan interior point. Assumption 2 holds with probability one if the data vectors in S are independently\nand identically following any continuous distribution on Rd. For such dataset S, the corresponding\n(cid:96)2-Delaunay graph and IP-Delaunay graph are unique. See [10] for details. Now we are ready to\nintroduce two important criterion of these Delaunay graphs.\nProposition 1 (Empty half-space criterion). For a \ufb01xed dataset S \u2282 Rd, suppose there exists an\nopen half-space H of Rd satisfying: (a) xi and xj are on the boundary of H, (b) H contains no data\npoints, then there exists an edge connecting xi and xj in IP-Delaunay graph. Conversely, if such an\nedge exists, then the open half space H must exist.\nIn other words, empty half-space criterion says, in IP-Delaunay graph, edge {xi, xj} exists if and\nonly if there is a (d \u2212 1)-dimensional hyperplane, which passes through xi and xj, such that one of\nits corresponding open half-space is empty, and the other one contains all data points except xi and\nxj. The empty half-space criterion of IP-Delaunay graph is closely related to empty sphere criterion\nof (cid:96)2-Delaunay graph as what follow.\nProposition 2 (Empty sphere criterion). For a \ufb01xed dataset S \u2282 Rd, a subset of d + 1 points of S\nare fully connected in the (cid:96)2-Delaunay graph corresponding to S if and only if the circumsphere of\nthese points does not contain any other points from the dataset S inside the sphere.\nOnce this criterion is satis\ufb01ed, we call the subgraph of these d + 1 vertices a d-simplex. The proof\nof the empty sphere criterion is not provided here. We refer readers to see [14] for details. The\nconnection between these criterions can be demonstrated by the transformation\n\ng : Rd\\{0} \u2192 Rd\\{0},\n\ng(x) =\n\nx\n(cid:107)x(cid:107)2 .\n\n(2)\n\nUnder this transformation, every hyperplane is mapped to a sphere passing through the origin. This is\ndue to the fact that transforms on Rd of the form\n\ng(x) = b +\n\nA(x \u2212 a)\n(cid:107)x \u2212 a(cid:107)\u0001\n\n(3)\n\nfor orthogonal matrix A and \u0001 = 0 or 2 are M\u00f6bius transformations. Indeed, by Liouville\u2019s conformal\nmapping theorem (a generalized version can be found in [21]), for d > 2, (3) characterizes all M\u00f6bius\ntransformations. An important and useful property of M\u00f6bius transformation says, if a hyperplane\ndoes not pass through origin, then its image under any M\u00f6bius transformation is a sphere passing\nthrough the origin.\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 2: (a) Empty half-space criterion for IP-Delaunay graph. (b) The IP-Delaunay graph. (c)\nEmpty sphere criterion for (cid:96)2-Delaunay graph after transformation. (d) The (cid:96)2-Delaunay graph after\ntransformation. The red edges form the subgraph that is isomorphic to IP-Delaunay graph.\n\nFigure 2 shows an example when d = 2. The line AB in Figure 3 divides the plane into two open\nhalf-spaces. One of the half-space does not contain any data points, so A and B are connected in\nIP-Delaunay graph by Proposition 1. Let A(cid:48) and B(cid:48) be the images of A and B under transformation\n\n4\n\n-1-0.500.51-1-0.500.51 A B C-1-0.500.51-1-0.500.51 A B C-2-1012-2-1012 A' B' C' 0-2-1012-2-1012 A' B' C' 0\f(2). According to the property of M\u00f6bius transformation, the image of line AB is the circumcircle\nof points 0, A(cid:48) and B(cid:48) in Figure 3. The empty half-space criterion of A and B implies that the\ncircumcircle does not contain any data points inside, so there is a simplex with vertices 0, A(cid:48) and B(cid:48)\nin the (cid:96)2-Delaunay graph by empty sphere criterion. This observation is formalized as follows.\nTheorem 1. Let X = Rd\\{0}. We assume S satis\ufb01es Assumption 1 and 2. For i \u2208 [n], let\nyi := xi/(cid:107)xi(cid:107)2, S(cid:48) := {y1, . . . , yn} and \u02dcS = S(cid:48) \u222a {0}, then the following are equivalent:\n\n(a) The IP-Delaunay graph w.r.t. S contains an edge {xi, xj}.\n(b) There exists a \u2208 Rd\\{0} such that x(cid:62)\nx(cid:62)a > 0.\n(c) There exists c \u2208 X such that (cid:107)yi \u2212 c(cid:107) = (cid:107)yj \u2212 c(cid:107) = (cid:107)c(cid:107) \u2264 min\n(d) There exists a d-simplex in (cid:96)2-Delaunay graph w.r.t \u02dcS contains vertices {0, yi, yj}.\n\ny\u2208S(cid:48) (cid:107)y \u2212 c(cid:107).\n\ni a = x(cid:62)\n\nj a \u2265 max\nx\u2208S\n\nEquivalence between (a) and (d) in the theorem implies an isomorphism between IP-Delaunay graph\nand a subgraph of (cid:96)2-Delaunay graph. Hence we immediately have the next corollary.\nCorollary 1. The following graphs are isomorphic after removing their isolated vertices:\n\n(a) the IP-Delaunay graph on S,\n(b) a subgraph of (cid:96)2-Delaunay graph on \u02dcS with every edge {yi, yj} satisfying the following\n\ncondition: there exists a d-simplex in (cid:96)2-Delaunay graph contains vertices {0, yi, yj},\n\nwhere the isomorphism is xi (cid:55)\u2192 yi for xi that are not isolated in IP-Delaunay graph.\nConsidering the example in Figure 2, Corollary 1 says the IP-Delaunay graph in Figure 3 is isomorphic\nto the subgraph in red in Figure 3. Thus, good approximation of (cid:96)2-Delaunay graph also applies to\napproximation of IP-Delaunay graph. See next section for implementation details.\nRemark 1 (Convex hull and extreme point). If a vertex is not isolated in IP-Delaunay graph, then we\nsay it is an extreme point. The concept of the extreme point is introduced in [3]. Under Assumption 1,\na point is extreme if and only if it locates on the boundary of the convex hull of S. In this case, building\nthe IP-Delaunay graph is equivalent to \ufb01nd the convex hull. In Corollary 1, we derive an equivalent\nway to \ufb01nd the convex hull of a \ufb01nite set. For the purpose of convex hull construction, Assumption 1\nis not required since it always holds after some translation. We note that there exist algorithms for\n\ufb01nding convex hull [3]. This method is not computationally feasible on high dimensional data, and\nthere does not exist a convex hull approximation in previous work, so we propose IP-Delaunay graph\napproximation by graph isomorphism in this paper.\n\n4\n\nImplementation in Large High Dimensional Data\n\nFor large high dimensional data, \ufb01nding the exact IP-Delaunay graph of the data points is not compu-\ntationally feasible. Therefore, practical and ef\ufb01cient graph construction and searching algorithm for\nlarge scale data in high dimension are in demand. In this work, we provide the algorithm (summarized\nin Algorithm 4) for building M\u00f6bius-Graph and greedy search on it when we have massive high\ndimensional data. We will \ufb01rst introduce a generalized greedy search algorithm because it will be\nrepeatedly used during graph construction and inner product search.\nWe recall that our goal of greedy search is to \ufb01nd x \u2208 S to maximize f (x, q) for any query q.\nHere, we consider either f (x, y) = \u2212(cid:107)x \u2212 y(cid:107) or f (x, y) = x(cid:62)y. For simplicity, we say the nearest\nneighbor of x is y when y has largest evaluation of f (x, \u00b7 ). We \ufb01rst initialize priority queue C (it\ncan be random or well-chosen data points), then check the evaluation of f (x, q) for all x \u2208 C and all\nout-neighbors of these x\u2019s. Among those vectors we have evaluated, we replace C by top-k vectors\nin descending order of evaluation of function f ( \u00b7 , q). We consider the top-k elements in C as the\nnew priority queue. We update C until it does not change anymore. Algorithm 1 summarizes this\nprocedure. If k = 1, then this generalized greedy search is equivalent to the simple version described\nin Section 2. This generalized greedy search allows the algorithm to return approximate top-k items,\nwhich are valuable for query search and recommender system.\nNow we are ready to present the graph construction algorithm (summarized in Algorithms 2). By\nTheorem 1 and Corollary 1, the best graph we want to use is IP-Delaunay graph on S, which is\n\n5\n\n\freturn k, measurement function f.\n\nAlgorithm 1: GREEDY-SEARCH(q, P, G, k, f )\n1: Input: query element q, a set of enter points P , graph G = (S, E), number of candidates to\n2: Initialize the set of priority queue, C \u2190 P .\n3: Mark elements of P as checked and the rest of vertices as unchecked.\n4: if |C| > k then\n5:\n6: while \u2203x \u2208 S unchecked and C keeps update do\n7:\n8: Mark elements in C as checked.\n9:\n10:\n11: Output: C.\n\nC \u2190 top-k elements of x \u2208 C in descending order of f (x, q).\nC \u2190 C \u222a {y \u2208 S : x \u2208 C, y unchecked, (x, y) \u2208 E}\nif |C| > k then\n\nC \u2190 top-k candidates of x \u2208 C in descending order of f (x, q).\n\nAlgorithm 2: GRAPH-CONSTRUCTION(S, k, d)\n1: Input: dataset S, the size of priority queue k, maximum outgoing degree of graph d.\n2: n \u2190 |S|. For i \u2208 [n], let yi = xi/(cid:107)xi(cid:107)2.\n3: \u02dcS \u2190 {0, y1, . . . , yn}. De\ufb01ne y0 = 0 \u2208 \u02dcS.\n4: G \u2190 fully connected graph with vertices {y0, . . . , yd\u22121}.\n5: for i = d to n do\n6:\n7: N \u2190 SELECT-NEIGHBORS(0, C, d).\n8:\n9:\n10:\n11:\n12:\n13: P (cid:48) \u2190 out-neighbors of 0 in graph G.\n14: P \u2190 {xi \u2208 S : yi \u2208 P (cid:48)}.\n15: Remove 0 and its incident edges from G and replace the vertices of G by the ones before\n\nC \u2190 GREEDY-SEARCH(yi,{0}, G, k, (cid:96)2-distance).\nAdd edges (y, z) to G for every z \u2208 N.\nfor z \u2208 N do\n\nC \u2190 {w \u2208 \u02dcS : (z, w) is an edge of G} \u222a {y}.\nN \u2190 SELECT-NEIGHBORS(0, C, d).\nLet N be the out-neighbors of z in G.\n\ntransformation.\n\n16: Output: (G, P ).\n\nAlgorithm 3: SELECT-NEIGHBORS(x, C, d)\n1: Input: element x, the set of k-nearest neighbors C of x, maximum outdegree d.\n2: Initialize the out-neighbors set N of x, i.e., N \u2190 \u2205.\n3: Order yi \u2208 C in ascending order of (cid:107)x \u2212 yi(cid:107).\n4: i \u2190 1.\n5: while |N| \u2264 d and i \u2264 |C| do\n6:\n7:\n8:\n9: Output: a set of elements N.\n\nif (cid:107)x \u2212 yi(cid:107) \u2264 minz\u2208N (cid:107)z \u2212 yi(cid:107) then\ni \u2190 i + 1.\n\nN \u2190 N \u222a {yi}.\n\nisomorphic to a subgraph of (cid:96)2-Delaunay graph on \u02dcS after transformation. We will consider HNSW\nas an (cid:96)2-Delaunay graph approximation as proposed in [24]. The authors suggest that the hierarchy\nof Delaunay graph can be approximated by edge discrimination. Furthermore, we will consider\na directed graph as an approximation to reduce the total degree. Given a dataset \u02dcS, one wants to\nbuild the directed graph on \u02dcS iteratively. A directed graph is initialized by a random graph. In every\niteration, for a given directed graph G with vertices \u02dcS, we consider an isolated vertex x and apply\ngreedy search (Algorithm 1) to \ufb01nd k-nearest neighbor of x, say Cx. x will be connected to its nearest\nelement, say y1 in the candidate set Cx. Now the neighbor set is initialized to be N (x) = {y1}.\nFor the next nearest neighbor y, we add it to the neighbor set N (x) if it satis\ufb01es edge selection\n\n6\n\n\fAlgorithm 4: MIPS(Q, S, K, k, l, d)\n1: Input: A set of queries Q, dataset S, the number of elements will be returned K, the size of\ncandidate set k for graph construction and l for greedy search, maximum outgoing degree of\ngraph d.\n\n2: (G, P ) \u2190 GRAPH-CONSTRUCTION(S, k, d).\n3: for q \u2208 Q do\n4:\n5: Output: the set of top-K objects Cq \u2282 S in descending order of inner product with q for q \u2208 Q.\n\nCq \u2190 GREEDY-SEARCH(q, P, S, G, l, inner product).\n\ncriterion: (cid:107)x \u2212 y(cid:107) \u2264 minz\u2208N (x) (cid:107)z \u2212 y(cid:107). Iterative process stops when d many valid neighbors are\nfound. Algorithm 3 represents an embodiment of this procedure. This edge selection can improve the\ndiversity of the direction of incident edges. We repeat this step and stop when either all elements in\nCx has been checked or the maximum outdegree d is achieved. The edges (x, y) for y \u2208 N (x) are\nadded to the graph. Moreover, for y \u2208 N (x), we will add x to N (y). If |N (y)| > d, then we update\nN (y) according to the edge selection criterion. This \ufb01nal step can reduce the effect caused by the\nrandom order of vertices. Corollary 1 suggests that IP-Delaunay graph is the neighborhood (in the\ngraph sense) of 0 in (cid:96)2-Delaunay graph. So for any query q, we will apply greedy search starting\nfrom the out-neighbors of 0 (i.e., P in Algorithm 2). Then the algorithm will search the optimal\nobject w.r.t. inner product by greedy search. See Algorithm 4.\n\n5 Experiments\n\nIn this section, we compare our method with state-of-the-art MIPS methods, on four common\ndatasets (see Table 1): Net\ufb02ix, Amazon Movie (Amovie) (http://jmcauley.ucsd.edu/data/amazon),\nYelp (https://www.yelp.com/dataset/challenge) and Music-100. The \ufb01rst three are popular recom-\nmendation datasets. For Net\ufb02ix, we use its 50-dimensional user and item vectors from [37]. For\nAmovie and Yelp, we utilize the matrix factorization method in [17] to get 100-dimensional latent\nvectors for user and item. Music-100 is introduced in [25] for the MIPS problem.\n\nTable 1: Statistics of the datasets.\n\nDatasets\nNet\ufb02ix\nAmovie\nYelp\nMusic-100\n\n# Base Data\n17770\n104708\n25815\n1000000\n\n# Query Data\n1000\n7748\n25677\n1000\n\n# Dimension\n50\n100\n100\n100\n\n# Extreme % Extreme\n45.12%\n3.03%\n2.80%\n30.44%\n\n8017\n3169\n722\n304431\n\nThe ground truth of each query vector is the top-1, top-10, and top-100 measuring by the inner product.\nOnly a fraction of data points can be the top-1 solution of (1), i.e., extreme points in Remark 1, whose\npercentage is an important feature of the dataset in MIPS problem. We estimate the percentage of\nextreme points for each dataset as below: for each vector x in the base, we calculate its inner product\nx(cid:62)y with all vector y in the base (including x itself). Then we count the number of unique top-1\nvector y (i.e., extreme points) and compute the percentage of extreme points (i.e., last column of\nTable 1). This is not an exact estimation, but it is a tight lower bound.\n5.1 Experimental Settings\nWe refer the new proposed algorithm as M\u00f6bius-Graph, and compare it with three previous state-of-\nthe-art MIPS methods, Greedy-MIPS [37], ip-NSW [25], and Range-LSH [36], which are the most\nrepresentative for MIPS. In Range-LSH, the dataset is \ufb01rst partitioned into small subsets according\nto the (cid:96)2-norm rank and then normalize data using a local maximum (cid:96)2-norm in each sub-dataset.\nThis overcomes the limited performance due to the long tail distribution of data norms [36]. The\nauthors of [37] used an upper bound of the inner product as the approximation of MIPS and designed\na greedy search algorithm to \ufb01nd this approximation, called Greedy-MIPS. We use their original\nimplementations. The open source code of ip-NSW adopts HNSW instead of NSW for graph\nconstruction. We found that the hierarchical structure and heuristic edge selection in HNSW does\nnot signi\ufb01cantly improve the performance of ip-NSW; see Figure 1. To provide comprehensive\nevaluation, we implement M\u00f6bius-Graph by both HNSW and SONG [39]. All comparing methods\nhave tunable parameters. To get a fair comparison, we vary all parameters over a \ufb01ne grid.\n\n7\n\n\fAs the evaluation measures, we choose the trade-offs Recall vs. Queries Per Second (QPS) and Recall\nvs. Percentage of Computations. Recall vs. Queries Per Second reports the number of queries an\nalgorithm can process per second at each recall level. Ideally, one wishes to have high recall levels,\nthe method can process as many queries as possible (i.e., more ef\ufb01cient). Recall vs. Percentage of\nComputations checks the pair-wise computations at each recall level, the less the better. For each\nalgorithm, we will have multiple points scattered on the plane by tuning parameters. To plot curves,\nwe \ufb01rst \ufb01nd out the best result, maxx, along with the x-axis (i.e., Recall). Then 100 buckets are\nproduced by splitting the range from 0 to maxx evenly. For each bucket, the best result along the\ny-axis (i.e., the biggest amount of queries per second) is chosen. If there are no data points in the\nbucket, it will be ignored. In this way, we shall have at most 100 pairs of data for drawing curves. All\nexperiments were performed on a 2X 3.00 GHz 8-core i7-5960X CPU server with 32GB memory.\n5.2 Experimental Results\nExperimental results for Recall vs. Queries Per Second (QPS) are shown in Figure 3. Each column\ncorresponds to one dataset and \ufb01gures in each row are results for top-1, top-10 and top-100 labels\nrespectively. As can be seen, the proposed method M\u00f6bius-Graph works much better than previous\nstate-of-the-art methods in most of the cases on all datasets.\nThe interesting fact is the effect of the extreme points percentage across different datasets. The\nM\u00f6bius-Graph embodiment is motivated by that the percentage of extreme points is low. As a\nresult, the constructed approximate Delaunay graph would be ef\ufb01cient for maximum inner product\nretrieval. Nevertheless, we can see that, the proposed method works very well for datasets with a high\npercentage of extreme points, such as Net\ufb02ix which has 45% of extreme points and the Music-100\nwhich has more than 30%. We also show results for different ground truth label sets, which tell that\nthe proposed method works well in various cases, not only for the top-1 label but also for the top-10\nand top-100 labels. These results demonstrate the robustness of the proposed M\u00f6bius-Graph in MIPS.\n\nFigure 3: Experimental results for Recall vs. Queries Per Second on different datasets. We focus on\ntop-1, top-10, and top-100 ground-truth labels. Here the best results are in the upper right corners.\nConversely, it is dif\ufb01cult to tell which baseline works better than others across all datasets. Range-\nLSH works relatively well on Net\ufb02ix but much worse than other methods on the other three datasets.\nThe baseline ip-NSW works well on datasets with high extreme points percentages (e.g., Net\ufb02ix and\nMusic-100) but becomes worse on other datasets. Greedy-MIPS shows priorities over ip-NSW on\ndatasets with low extreme points percentages (e.g., Amovie and Yelp) at some recall levels.\nResults for Recall vs. Percentage of Computations are shown in Figure 4. Only top-10 results are\nshown due to the limited space. Top-1 and top-100 results can be found in the Appendix. Note\nthat this measurement is not meaningful for Greedy-MIPS. Results for Recall vs. Percentage of\nComputations are shown in Figure 4. In this view, the proposed M\u00f6bius-Graph works best in all\n\n8\n\n0.40.50.60.70.80.91Avg. Recall00.511.52Queries Per Second105Netflix top-1M\u00f6bius-Graphip-NSWGreedy-MIPSRange-LSH0.40.50.60.70.80.91Avg. Recall0246810104Amovie top-1M\u00f6bius-Graphip-NSWGreedy-MIPSRange-LSH0.40.50.60.70.80.91Avg. Recall0246810104Yelp top-1M\u00f6bius-Graphip-NSWGreedy-MIPSRange-LSH0.40.50.60.70.80.91Avg. Recall0246104Music-100 top-1M\u00f6bius-Graphip-NSWGreedy-MIPSRange-LSH0.40.50.60.70.80.91Avg. Recall02468Queries Per Second104Netflix top-10M\u00f6bius-Graphip-NSWGreedy-MIPSRange-LSH0.20.40.60.81Avg. Recall0246104Amovie top-10M\u00f6bius-Graphip-NSWGreedy-MIPSRange-LSH00.20.40.60.81Avg. Recall02468104Yelp top-10M\u00f6bius-Graphip-NSWGreedy-MIPSRange-LSH00.20.40.60.81Avg. Recall0246104Music-100 top-10M\u00f6bius-Graphip-NSWGreedy-MIPSRange-LSH0.20.40.60.81Avg. Recall0123Queries Per Second104Netflix top-100M\u00f6bius-Graphip-NSWGreedy-MIPSRange-LSH00.20.40.60.81Avg. Recall00.511.522.5104Amovie top-100M\u00f6bius-Graphip-NSWGreedy-MIPSRange-LSH00.20.40.60.81Avg. Recall0123104Yelp top-100M\u00f6bius-Graphip-NSWGreedy-MIPSRange-LSH00.20.40.60.81Avg. Recall00.511.522.5104Music-100 top-100M\u00f6bius-Graphip-NSWGreedy-MIPSRange-LSH\fFigure 4: Experimental results for Recall vs. Percentage of Computations on different datasets. Best\nresults are in the lower right corners.\ncases. Range-LSH works comparably with others on smaller datasets (i.e., the \ufb01rst three) in this\nview. Recall vs. Percentage of Computations does not consider the cost of different index structures.\nAlthough Range-LSH works well in this view, its overall time cost is much higher than others as\nshown in Figure 3. The possible reason is that the table based index used in Range-LSH is not that\nef\ufb01cient in searching. Besides, Range-LSH works badly on Music-100, which is much larger. The\ncurve for Range-LSH cannot be shown in the scope of Music-100.\nBesides, we represent the graph construction time cost by ip-NSW and M\u00f6bius-Graph in Table 2. As\ncan be seen, M\u00f6bius-Graph consumes 13.7% to 65.5% less time in index construction than ip-NSW,\nwhich brings great bene\ufb01ts for real applications. The reason is that metric measure (i.e., (cid:96)2) based\nsearching (in the graph construction) is more ef\ufb01cient than inner product based searching.\n\nTable 2: Graph Construction Time in Seconds.\n\nip-NSW\nM\u00f6bius-Graph\n\nNet\ufb02ix\n2.19\n1.89(-13.7%)\n\nAmovie\n36.95\n24.35(-34.1%)\n\nYelp\n6.78\n2.34(-65.5%)\n\nMusic-100\n396.82\n162.24(-59.1%)\n\nImplementation by SONG\n\n5.3\nTo exclude bias from implementation, we also implement M\u00f6bius-Graph and ip-NSW by another\nsearch on graph platform, SONG [39]. The results are shown in Figure 5. As can be seen, the\nimplementation of SONG is more ef\ufb01cient than HNSW, both for M\u00f6bius-Graph and ip-NSW, but their\npriority order keeps the same. M\u00f6bius-Graph works better than ip-NSW under both implementations.\n\nFigure 5: Comparison of two implementations, HNSW and SONG, on M\u00f6bius-Graph and ip-NSW.\n\n6 Conclusion and Future Work\nMaximum Inner Product Search (MIPS) is a challenging problem with wide applications in search\nand machine learning. In this work, we develop a novel search on the graph method for MIPS. In the\nview of computational geometry, we show that under M\u00f6bius transformation, an isomorphism exists\nbetween Delaunay graph for inner product and (cid:96)2-norm. Based on this observation, we present a\ngraph indexing algorithm that converts subgraph of (cid:96)2-Delaunay graph into IP-Delaunay graph. Then,\nwe perform MIPS via greedy search on the transformed graph. We demonstrate that our approach\nprovides an effective and ef\ufb01cient solution for MIPS.\nThis paper focuses on fast search under the non-metric measure, inner product. Beyond inner product,\nmore complicated measures has been studied, such as Bregman divergence [6], max-kernel [8, 7] and\neven more generic measures [32]. It would be interesting to extend the method proposed in this paper\nto these measures. Another promising direction is to adopt a GPU-based system for fast ANN search\nand MIPS, which has been shown highly effective for generic ANN tasks [22, 19, 39]. Developing\nGPU-based algorithms for MIPS (and related applications) is a topic which can be further explored.\n\n9\n\n0.50.60.70.80.91Avg. Recall00.020.040.060.080.1Percent. ComputationsNetflix top-10M\u00f6bius-Graphip-NSWRange-LSH0.50.60.70.80.91Avg. Recall00.010.020.030.040.05Amovie top-10M\u00f6bius-Graphip-NSWRange-LSH0.50.60.70.80.91Avg. Recall00.020.040.060.080.1Yelp top-10M\u00f6bius-Graphip-NSWRange-LSH00.20.40.60.81Avg. Recall00.511.5210-3Music-100 top-10M\u00f6bius-Graphip-NSW0.40.50.60.70.80.91Avg. Recall0246810Queries Per Second104Netflix top-10M\u00f6bius-Graph-SONGM\u00f6bius-Graph-HNSWip-NSW-SONGip-NSW-HNSW0.20.40.60.81Avg. Recall02468104Amovie top-10M\u00f6bius-Graph-SONGM\u00f6bius-Graph-HNSWip-NSW-SONGip-NSW-HNSW00.20.40.60.81Avg. Recall0246810104Yelp top-10M\u00f6bius-Graph-SONGM\u00f6bius-Graph-HNSWip-NSW-SONGip-NSW-HNSW00.20.40.60.81Avg. Recall0246104Music-100 top-10M\u00f6bius-Graph-SONGM\u00f6bius-Graph-HNSWip-NSW-SONGip-NSW-HNSW\fReferences\n[1] Franz Aurenhammer. Voronoi diagrams\u2014a survey of a fundamental geometric data structure.\n\nACM Computing Surveys (CSUR), 23(3):345\u2013405, 1991.\n\n[2] Yoram Bachrach, Yehuda Finkelstein, Ran Gilad-Bachrach, Liran Katzir, Noam Koenigstein,\nNir Nice, and Ulrich Paquet. Speeding up the xbox recommender system using a Euclidean\ntransformation for inner-product spaces. In Eighth ACM Conference on Recommender Systems\n(RecSys), pages 257\u2013264, Foster City, CA, 2014.\n\n[3] C Bradford Barber, David P Dobkin, and Hannu Huhdanpaa. The quickhull algorithm for\n\nconvex hulls. ACM Transactions on Mathematical Software (TOMS), 22(4):469\u2013483, 1996.\n\n[4] Olivier Beaumont, Anne-Marie Kermarrec, Loris Marchal, and \u00c9tienne Rivi\u00e8re. Voronet: A\nscalable object network based on voronoi tessellations. In 21th International Parallel and\nDistributed Processing Symposium (IPDPS), pages 1\u201310. Long Beach, CA, 2007.\n\n[5] Yoshua Bengio, R\u00e9jean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic\n\nlanguage model. Journal of Machine Learning Research, 3:1137\u20131155, 2003.\n\n[6] Lawrence Cayton. Fast nearest neighbor retrieval for bregman divergences. In Proceedings\nof the Twenty-Fifth International Conference on Machine learning (ICML), pages 112\u2013119,\nHelsinki, Finland, 2008.\n\n[7] Ryan R Curtin and Parikshit Ram. Dual-tree fast exact max-kernel search. Statistical Analysis\n\nand Data Mining: The ASA Data Science Journal, 7(4):229\u2013253, 2014.\n\n[8] Ryan R Curtin, Parikshit Ram, and Alexander G Gray. Fast exact max-kernel search.\n\nIn\nProceedings of the 13th SIAM International Conference on Data Mining (SDM), pages 1\u20139,\nAustin,TX, 2013.\n\n[9] Miao Fan, Jiacheng Guo, Shuai Zhu, Shuo Miao, Mingming Sun, and Ping Li. MOBIUS:\ntowards the next generation of query-ad matching in baidu\u2019s sponsored search. In Proceedings\nof the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,\n(KDD), pages 2509\u20132517, Anchorage, AK, 2019.\n\n[10] Steven Fortune. Voronoi diagrams and delaunay triangulations. In Handbook of Discrete and\n\nComputational Geometry, Second Edition., pages 513\u2013528. 2004.\n\n[11] Jerome H. Friedman, F. Baskett, and L. Shustek. An algorithm for \ufb01nding nearest neighbors.\n\nIEEE Transactions on Computers, 24:1000\u20131006, 1975.\n\n[12] Jerome H. Friedman, J. Bentley, and R. Finkel. An algorithm for \ufb01nding best matches in\n\nlogarithmic expected time. ACM Transactions on Mathematical Software, 3:209\u2013226, 1977.\n\n[13] Cong Fu, Chao Xiang, Changxu Wang, and Deng Cai. Fast approximate nearest neighbor search\n\nwith the navigating spreading-out graphs. PVLDB, 12(5):461 \u2013 474, 2019.\n\n[14] Paul-Louis George and Houman Borouchaki. Delaunay triangulation and meshing. 1998.\n\n[15] Ruiqi Guo, Sanjiv Kumar, Krzysztof Choromanski, and David Simcha. Quantization based\nfast inner product search. In Proceedings of the 19th International Conference on Arti\ufb01cial\nIntelligence and Statistics (AISTATS), pages 482\u2013490, Cadiz, Spain, 2016.\n\n[16] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. Neural\ncollaborative \ufb01ltering. In Proceedings of the 26th International Conference on World Wide Web\n(WWW), pages 173\u2013182, Perth, Australia, 2017.\n\n[17] Yifan Hu, Yehuda Koren, and Chris Volinsky. Collaborative \ufb01ltering for implicit feedback\ndatasets. In Proceedings of the 8th IEEE International Conference on Data Mining (ICDM),\npages 263\u2013272, Pisa, Italy, 2008.\n\n[18] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse\nof dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on the Theory of\nComputing (STOC), pages 604\u2013613, Dallas, TX, 1998.\n\n10\n\n\f[19] Jeff Johnson, Matthijs Douze, and Herv\u00e9 J\u00e9gou. Billion-scale similarity search with gpus.\n\narXiv:1702.08734, 2017.\n\n[20] Jon M. Kleinberg. The small-world phenomenon: an algorithmic perspective. In Proceedings of\nthe Thirty-Second Annual ACM Symposium on Theory of Computing (STOC), pages 163\u2013170,\nPortland, OR, USA.\n\n[21] Wolfgang K\u00fchnel and Hans-Bert Rademacher. Liouville\u2019s theorem in conformal geometry.\n\nJournal de math\u00e9matiques pures et appliqu\u00e9es, 88(3):251\u2013260, 2007.\n\n[22] Ping Li, Anshumali Shrivastava, and Christian A. Konig. GPU-based minwise hashing: Gpu-\nbased minwise hashing. In Proceedings of the 21st World Wide Web Conference (WWW), pages\n565\u2013566, Lyon, France, 2012.\n\n[23] Yury Malkov, Alexander Ponomarenko, Andrey Logvinov, and Vladimir Krylov. Approximate\nInformation Systems,\n\nnearest neighbor algorithm based on navigable small world graphs.\n45:61\u201368, 2014.\n\n[24] Yury A Malkov and Dmitry A Yashunin. Ef\ufb01cient and robust approximate nearest neighbor\nsearch using hierarchical navigable small world graphs. IEEE transactions on pattern analysis\nand machine intelligence, Early Access.\n\n[25] Stanislav Morozov and Artem Babenko. Non-metric similarity graphs for maximum inner\nIn Advances in Neural Information Processing Systems (NeurIPS), pages\n\nproduct search.\n4726\u20134735, Montreal, Canada, 2018.\n\n[26] Behnam Neyshabur and Nathan Srebro. On symmetric and asymmetric lshs for inner product\nsearch. In Proceedings of the 32nd International Conference on Machine Learning (ICML),\npages 1926\u20131934, Lille, France, 2015.\n\n[27] Parikshit Ram and Alexander G Gray. Maximum inner-product search using cone trees. In\nThe 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining\n(KDD), pages 931\u2013939, Beijing, China, 2012.\n\n[28] Anshumali Shrivastava and Ping Li. Asymmetric LSH (ALSH) for sublinear time maximum\ninner product search (mips). In Advances in Neural Information Processing Systems (NIPS),\npages 2321\u20132329, Montreal, Canada, 2014.\n\n[29] Anshumali Shrivastava and Ping Li. Asymmetric minwise hashing for indexing binary inner\nproducts and set containment. In Proceedings of the 24th International Conference on World\nWide Web (WWW), pages 981\u2013991, Florence, Italy, 2015.\n\n[30] Anshumali Shrivastava and Ping Li. Improved asymmetric locality sensitive hashing (ALSH)\nfor maximum inner product search (MIPS). In Proceedings of the Thirty-First Conference\non Uncertainty in Arti\ufb01cial Intelligence (UAI), pages 812\u2013821, Amsterdam, The Netherlands,\n2015.\n\n[31] Shulong Tan, Zhixin Zhou, Zhaozhuo Xu, and Ping Li. On ef\ufb01cient retrieval of top similarity\nvectors. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language\nProcessing (EMNLP), pages 5235\u20135245, Hong Kong, China, 2019.\n\n[32] Shulong Tan, Zhixin Zhou, Zhaozhuo Xu, and Ping Li. Fast item ranking under neural network\nbased measures. In Proceedings of the 13th ACM International Conference on Web Search and\nData Mining (WSDM), Huston, TX, 2020.\n\n[33] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,\n\u0141ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Informa-\ntion Processing Systems (NIPS), pages 5998\u20136008, Long Beach, CA, 2017.\n\n[34] Jason Weston, Samy Bengio, and Nicolas Usunier. Large scale image annotation: learning to\n\nrank with joint word-image embeddings. Machine Learning, 81(1):21\u201335, 2010.\n\n11\n\n\f[35] Hong-Jian Xue, Xin-Yu Dai, Jianbing Zhang, Shujian Huang, and Jiajun Chen. Deep matrix\nfactorization models for recommender systems. In Proceedings of the Twenty-Sixth International\nJoint Conference on Arti\ufb01cial Intelligence (IJCAI), pages 3203\u20133209, Melbourne, Australia,\n2017.\n\n[36] Xiao Yan, Jinfeng Li, Xinyan Dai, Hongzhi Chen, and James Cheng. Norm-ranging LSH\nfor maximum inner product search. In Advances in Neural Information Processing Systems\n(NeurIPS), pages 2956\u20132965, Montreal, Canada, 2018.\n\n[37] Hsiang-Fu Yu, Cho-Jui Hsieh, Qi Lei, and Inderjit S Dhillon. A greedy approach for budgeted\nmaximum inner product search. In Advances in Neural Information Processing Systems (NIPS),\npages 5453\u20135462, Long Beach, CA, 2017.\n\n[38] Hsiang-Fu Yu, Prateek Jain, Purushottam Kar, and Inderjit Dhillon. Large-scale multi-label\nlearning with missing labels. In Proceedings of the 31th International Conference on Machine\nLearning (ICML), pages 593\u2013601, Beijing, China, 2014.\n\n[39] Weijie Zhao, Shulong Tan, and Ping Li. Song: Approximate nearest neighbor search on gpu. In\n\n35th IEEE International Conference on Data Engineering (ICDE), Dallas, TX, 2020.\n\n12\n\n\f", "award": [], "sourceid": 4471, "authors": [{"given_name": "Zhixin", "family_name": "Zhou", "institution": "City University of Hong Kong"}, {"given_name": "Shulong", "family_name": "Tan", "institution": "Baidu Research"}, {"given_name": "Zhaozhuo", "family_name": "Xu", "institution": "Baidu Research"}, {"given_name": "Ping", "family_name": "Li", "institution": "Baidu Research USA"}]}