{"title": "Randomized Algorithms for Comparison-based Search", "book": "Advances in Neural Information Processing Systems", "page_first": 2231, "page_last": 2239, "abstract": "This paper addresses the problem of finding the nearest neighbor (or one of the $R$-nearest neighbors) of a query object $q$ in a database of $n$ objects, when we can only use a comparison oracle. The comparison oracle, given two reference objects and a query object, returns the reference object most similar to the query object. The main problem we study is how to search the database for the nearest neighbor (NN) of a query, while minimizing the questions. The difficulty of this problem depends on properties of the underlying database. We show the importance of a characterization: \\emph{combinatorial disorder} $D$ which defines approximate triangle inequalities on ranks. We present a lower bound of $\\Omega(D\\log \\frac{n}{D}+D^2)$ average number of questions in the search phase for any randomized algorithm, which demonstrates the fundamental role of $D$ for worst case behavior. We develop a randomized scheme for NN retrieval in $O(D^3\\log^2 n+ D\\log^2 n \\log\\log n^{D^3})$ questions. The learning requires asking $O(n D^3\\log^2 n+ D \\log^2 n \\log\\log n^{D^3})$ questions and $O(n\\log^2n/\\log(2D))$ bits to store.", "full_text": "Randomized Algorithms for Comparison-based\n\nSearch\n\nDominique Tschopp\n\nAWK Group\n\nBern, Switzerland\n\nSuhas Diggavi\n\nUniversity of California Los Angeles (UCLA)\n\nLos Angeles, CA 90095\n\ndominique.tschopp@gmail.com\n\nsuhasdiggavi@ucla.edu\n\nPayam Delgosha\n\nSharif University of Technology\n\nTehran, Iran\n\nSoheil Mohajer\n\nPrinceton University\nPrinceton, NJ 08544\n\npdelgosha@ee.sharif.ir\n\nsmohajer@princeton.edu\n\nAbstract\n\nThis paper addresses the problem of \ufb01nding the nearest neighbor (or one of the\nR-nearest neighbors) of a query object q in a database of n objects, when we can\nonly use a comparison oracle. The comparison oracle, given two reference objects\nand a query object, returns the reference object most similar to the query object.\nThe main problem we study is how to search the database for the nearest neighbor\n(NN) of a query, while minimizing the questions. The dif\ufb01culty of this problem\ndepends on properties of the underlying database. We show the importance of a\ncharacterization: combinatorial disorder D which de\ufb01nes approximate triangle\ninequalities on ranks. We present a lower bound of \u2126(D log n\nD + D2) average\nnumber of questions in the search phase for any randomized algorithm, which\ndemonstrates the fundamental role of D for worst case behavior. We develop\na randomized scheme for NN retrieval in O(D3 log2 n + D log2 n log log nD3\n)\nquestions. The learning requires asking O(nD3 log2 n + D log2 n log log nD3\n)\nquestions and O(n log2 n/ log(2D)) bits to store.\n\n1 Introduction\n\nConsider the situation where we want to search and navigate a database, but the underlying relation-\nships between the objects are unknown and are accessible only through a comparison oracle. The\ncomparison oracle, given two reference objects and a query object, returns the reference object most\nsimilar to the query object. Such an oracle attempts to model the behavior of human users, capable\nof making statements about similarity, but not of assigning meaningful numerical values to distances\nbetween objects. These situations could occur in many tasks, such as recommendation for movies,\nrestaurants etc., or a human-assisted search system for image databases among other applications.\nUsing such an oracle, the best we can hope for is to obtain, for every object u in the database, a\nranking of the other objects according to their similarity to u. However, the use of the oracle to\nget complete information about ranking could be costly, since invoking the oracle is to represent\nhuman input to the task (preferences in movies, comparison of images etc). We can pre-process\nthe database by asking questions during a learning phase, and use the resulting answers to facilitate\nthe search process. Therefore, the main question we ask in this paper is to design a (approximate)\nnearest neighbor retrieval algorithm while minimizing the number of questions to such an oracle.\n\nClearly the dif\ufb01culty of searching using such an oracle depends critically on the properties of the set\nof objects. We demonstrate the importance of a characterization which determines the performance\n\n1\n\n\fof comparison based search algorithms. Combinatorial disorder (introduced by Goyal et al. [1]),\nde\ufb01nes approximate triangle inequalities on ranks. Roughly speaking, it de\ufb01nes a multiplicative\nfactor D by which the triangle inequality on ranks can be violated. We show our \ufb01rst lower bound of\n\u2126(D log n\nD +D2) on average number of questions in the search phase for any randomized algorithm,\nand therefore demonstrate the fundamental importance of D for worst case behavior. When the\ndisorder is known, we can use partial rank information to estimate, or infer the other ranks. This\nallows us to design a novel hierarchical scheme which considerably improves the existing bounds\nfor nearest neighbor search based on a similarity oracle, and performs provably close to the lower\nbound. If no characterization of the hidden space can be used as an input, we develop algorithms\nthat can decompose the space such that dissimilar objects are likely to get separated, and similar\nobjects have the tendency to stay together; generalizing the notion of randomized k-d-trees [2]. This\nis developed in more detail in [3]. Due to space constraints, we give statements of the results along\nwith an outline of proof ideas in the main text. Additionally we provide proof details in the appendix\n[4] as extra material allowed by NIPS.\nRelationship to published works: Nearest neighbor (NN) search problem has been very well stud-\nied for metric spaces (see [5]). However, in all these works, it is assumed that one can compute\ndistances between points in the data set. In [6, 7, 8, 9, 10, 11], various approaches to measure simi-\nlarities between images are presented, which could be used as comparison oracles in our setup. The\nalgorithmic aspects of searching with a comparison oracle was \ufb01rst studied in [1], where a random\nwalk algorithm is presented. The main limitation of this algorithm is the fact that all rank relation-\nships need to be known in advance, which amounts to asking the oracle O(n2 log n) questions, in\na database of size n. In [12], a data structure similar in spirit to \u01eb-nets of [13] is introduced. It\nis shown that a learning phase with complexity O(D7n log2 n) questions and a space complexity\nof O(D5n + Dn log n) allows to retrieve the NN in O(D4 log n) questions. The learning phase\nbuilds a hierarchical structure based on coverings of exponentially decreasing radii. In this paper,\nwe present what we believe is the \ufb01rst lower bound for search through comparisons. This gives\na more fundamental meaning to D as a parameter determining worst case behavior. Based on the\ninsights gained from this worst case analysis, we then improve (see Section 3) the existing upper\nbounds by a poly(D) factor, if we are willing to accept a negligible (less than 1\nn ) probability of\nfailure. Our algorithm is based on random sampling, and can be seen as a form of metric skip list (as\nintroduced in [14]), but applied to a combinatorial (non-metric) framework. However, the fact that\nwe do not have access to distances forces us to use new techniques in order to minimize the number\nof questions (or ranks we need to compute). In particular, we sample the database at different densi-\nties, and infer the ranks from the density of the sampling, which we believe is a new technique. We\nalso need to relate samples to each other when building the data structure top down.\n\nA natural question to ask is whether one can develop data structures for NN when a characterization\nof the underlying space is unknown.\nIn [2], when one has access to metric distances, a binary\ntree decomposition of a dataset that adapts to its \u201cintrinsic dimension\u201d [13] has been designed. We\nextend the result of [2] to our setup, where we have a comparison oracle but do not have access to\nmetric distances. This can be used in a manner similar to [2] to \ufb01nd (approximate) NN (see [3] for\nmore details).\n\nTo the best of our knowledge, the notion of randomized NN search using similarity oracle is studied\nfor the \ufb01rst time in this paper. Moreover, the hierarchical search scheme proposed is more ef\ufb01cient\nthan earlier schemes. The lower bound presented appears to be new and demonstrates that our\nschemes are (almost) ef\ufb01cient.\n\n2 De\ufb01nitions and Problem Statement\n\nWe consider a hidden space K, and a database of objects T \u2282 K, with |T | = n. We can only access\nthis space through a similarity oracle which for any point q \u2208 K, and objects u, v \u2208 T returns\n\nO(q, u, v) = (cid:26) u if u is more similar to q than v\n\nelse.\n\nv\n\n(1)\n\nThe goal is to develop and analyse algorithms which for any given q \u2208 K, can \ufb01nd an object in the\ndatabase a \u2208 T which is the nearest neighbor (NN) to q, using the smallest number of questions of\ntype (1). We also relax this goal to \ufb01nd the approximate NN with \u201chigh probability\u201d. The algorithm\n\n2\n\n\fmay have a learning phase, in which it explores the structure of the database, and stores it using a\ncertain amount of memory. Note that this phase has to be done prior to knowing the query q \u2208 K.\nThen, once the query is given, the search phase of the algorithm asks a certain number of questions\nof type (1) and \ufb01nds the closest object in the database.\n\nThe performance of the algorithm is measured by three components among which there could be\na trade-off: the number of questions asked in the learning phase, the number of questions asked in\nthe searching phase, and the total memory to be stored. The main goal of this work is to design\nalgorithms for NN search and characterize its performance in terms of these parameters. We will\npresent some de\ufb01nitions which are required to state the results of this paper.\nDe\ufb01nition 1. The rank of u in a set S with respect to v, rv(u,S) is equal to c, if u is the cth nearest\nobject to v in S, i.e., |{w \u2208 S : d(w, v) < d(u, v)}| = c \u2212 1, where d(w, v) < d(u, v) could be\ninterpreted as a distance function. Also the rank ball \u03b2x(r) is de\ufb01ned to be {y : rx(y,S)) \u2264 r}.\nNote that we do not need existence of a distance function in De\ufb01nition 1. We could replace\nd(w, v) < d(u, v) with \u201cv is more similar to w than u\u201d by using the oracle in (1).\nTo simplify the notation, we only indicate the set if it is unclear from the context i.e., we write rv(u)\ninstead of rv(u,S) unless there is an ambiguity. Note that rank need not be a symmetric relationship\nbetween objects i.e., ru(v) 6= rv(u) in general. Further, note that we can rank m objects w.r.t. an\nobject o by asking the oracle O(m log m) questions, using standard sort algorithms [15].\nOur characterization of the space of objects is through a form of approximate triangle inequali-\nties introduced in [1] and [12]. Instead of de\ufb01ning a inequalities between distances, these triangle\ninequalities de\ufb01ned over ranks, and depend on a property of the space, called the disorder constant.\nDe\ufb01nition 2. The combinatorial disorder of a set of objects S is the smallest D such that \u2200x, y, z \u2208\nS, we have the following approximate triangle inequalities:\n\n(i) rx(y, S) \u2264 D(rz(x, S) + rz(y, S))\n(iii) rx(y, S) \u2264 D(rx(z, S) + rz(y, S))\n\n(ii) rx(y, S) \u2264 D(rx(z, S) + ry(z, S))\n(iv) rx(y, S) \u2264 D(rz(x, S) + ry(z, S))\n\nIn particular, rx(x, S) = 0 and rx(y, S) \u2264 Dry(x, S).\n3 Contributions\n\nOur contributions are the following: (i) we design a randomized hierarchical data structure with\nwhich we can do NN search using the comparison oracle (ii) we develop the \ufb01rst lower bound\nfor the search complexity in the combinatorial framework of [1, 12], and thereby demonstrate the\nimportance of combinatorial disorder. The performance of the randomized algorithm (see (i)) is\nshown to be close to this lower bound. We also develop a binary tree decomposition that adapts to\nthe data set in a manner analogous to [2].\n\nMore precisely, we prove a lower bound on the average search time to retrieve the nearest neighbor\nof a query point for randomized algorithms in the combinatorial framework.\nTheorem 1. There exists a space, a con\ufb01guration of a database of n objects in that space that\nfor the uniform distribution over placements of the query point q such that no randomized search\nalgorithm, even if O(n3) questions can be asked in the learning phase, can \ufb01nd q\u2019s nearest neighbor\nin the database for sure (with a probability of error of 0) by asking less than an expected \u2126(D2 +\nD log n/D) questions in the worst case when D < \u221an.\n\nAs a consequence of this theorem, there must exist at least one query point in this con\ufb01guration\nwhich requires asking at least \u2126(D log( n\nD ) + D2) questions, hence setting a lower bound on the\nsearch complexity. Based on the insights gained from this worst case analysis, we introduce a\nconceptually simple randomized hierarchical scheme that allows us to reduce the learning compared\nto the existing algorithm (see [12, 1]) by a factor D4, memory consumption by a factor D5/ log2 n,\nand a factor D/ log n log log nD3 for search.\nTheorem 2. We design a randomized algorithm, which for a given query point q, can retrieve\nits nearest neighbor with high probability in O(D3 log2 n + D log2 n log log nD3\n) questions. The\n\n3\n\n\flearning requires asking O(nD3 log2 n + D log2 n log log nD3\nO(n log2 n/ log(2D)) bits.\n\n) questions and we need to store\n\nConsequently, our schemes are asymptotically (for n) within Dpolylog(n) questions of the optimal\nsearch algorithm.\n\n4 Lower Bounds for NNS\n\nA natural question to ask is whether there are databases and query points for which we need to ask\na minimal number of questions, independent of the algorithm used. In this section, we construct\na database T of n objects, a universe of queries K\\T and similarity relationships, for which no\nsearch algorithm can \ufb01nd the NN of a query point in less than expected \u2126(D log n\nD + D2) questions.\nWe show this even when all possible questions O(u, v, w) related to the n database objects (i.e.,\nu, v, w \u2208 T ) can be asked during the learning phase. The query is chosen uniformly from the\nuniverse of queries and is unknown during the learning phase.\nDatabase Structure: Consider the weighted graph shown in Fig. 1. It consists of a star with \u03b1\nbranches \u03c61, \u03c62, . . . , \u03c6\u03b1, each composed of n/\u03b12 supernodes (SN). Each of the supernodes in turn\ncontains \u03b1 database objects (i.e., objects in T ). Clearly, in total there are \u03b1\u03b1 n\n\u03b12 = n objects.\nNote that the database T only includes the set of objects inside the supernodes, and the supernodes,\nthemselves, are not element of T . We indicate the objects in each branch by numbers from 1 to n/\u03b1.\nWe de\ufb01ne the set of queries, M, as follows: every query point q is attached to one object form T on\neach branch of the star with an edge; this object is called a direct node (DN) on the corresponding\nbranch. Moreover, we assume that the weights of all query edges, the \u03b1 edges connecting the query\nto its DNs, are different. Therefore, the set of all queries, M could be restricted to \u03b1!(n/\u03b1)\u03b1\nelements, since there are n/\u03b1 choices for choosing the direct node in each branch (i.e., (n/\u03b1)\u03b1\nchoices for \u03b1 branches), and the weight of the query edges can be ordered in \u03b1! different ways.\nIn this example, distance between two nodes is given by the weighted graph distance, and the oracle\nanswers queries based on this distance. All edges connecting the SNs to each other have weight 1\nexpect those \u03b1 edges emitting from the center of the star and ending at the \ufb01rst SNs which have\nweight n/(\u03b12). Edges connecting the objects in a supernode to its root are called object edges. We\nassume that all n/\u03b1 object edges in branch \u03c6i have weight i/(4\u03b1). It remains to \ufb01x the weight of\nthe query edges. We will de\ufb01ne the weight of these edges in the following.\nDe\ufb01nition 3. For a query q \u2208 M, de\ufb01ne the \u03b1-tuple \u03b4q \u2208 {1, 2, . . . , n/\u03b1}\u03b1 to be the sequence of\nDNs of q in \u03b1 branches, i.e., \u03b4q(i) denotes the indicator of the object on \u03c6i which is connected to q\nvia a query edge. We also represent the rank of the DNs w.r.t. q, by an \u03b1-tuple \u03a8q \u2208 {1, . . . , \u03b1}\u03b1,\ni.e., \u03a8q(i) denotes the rank of the DN on branch \u03c6i among all the other DNs w.r.t. q.\nNow we can de\ufb01ne the weight of query edges. For a query q \u2208 M, the weight of the query edge\nwhich connects q to \u03b4q(i) is given to be 1 + (\u03a8q(i)/\u03b1)\u01eb, where \u01eb \u226a 1/(4\u03b1) is a constant.\nAs mentioned before, the disorder constant plays an important role in the performance of the algo-\nrithm. The following lemma gives the disorder constant for the database introduced. The proof of\nthis lemma is presented in the appendix [4].\nLemma 1. The star shaped graph introduced above has disorder constant D = \u0398(\u03b1).\n\nThe Lower Bound: In the proof of Theorem 1, we will use Yao\u2019s minimax principle (see [16]),\nwhich states that, for any distribution on the inputs the expected cost for the best deterministic\nalgorithm provides a lower bound on the worst case running time of any randomized algorithm. In\nthe following, we state two lower bounds for the number of questions in the searching phase of any\ndeterministic algorithm for the database illustrated in Fig. 1.\nProposition 1. The number of questions asked by a deterministic algorithm A, on average w.r.t.\nuniform distribution, to solve the NNS problem in star graph, is lower bounded by \u2126 (\u03b1 log(n/\u03b1)).\n\nTo outline the proof of this claim: each question asked by the algorithm involves two database\nnodes. Note that the weights of the edges emitting from the center of the graph are chosen so that\nthe branches become independent, in the sense that questioning nodes on one branch will not reveal\n\n4\n\n\fFigure 1: The star database: a weighted star graph with \u03b1 branches, each composed of n/\u03b12 \u201csu-\npernodes\u201d. Each supernode further includes \u03b1 database objects. Finally, each query points is ran-\ndomly connected to one object on each branch of the star via a weighted edge. The weights of the\nedges are chosen so than the disorder constant be D = \u0398(\u03b1).\n\nany information about other branches. Therefore, in order to \ufb01nd the nearest node to q, the algorithm\nhas to \ufb01nd the direct node on each branch, and then compare them to \ufb01nd the NN. For any branch\n\u03c6i, there are n/\u03b1 candidates which can be DN of q with equal probability. Hence, roughly speaking,\nthe algorithm needs to ask \u2126(log(n/\u03b1)) questions for each branch. This yields to a minimum total\nof \u2126(\u03b1 log(n/\u03b1)) questions for \u03b1 independent branches in the graph.\nProposition 2. Any deterministic algorithm A solving nearest neighbor search problem in the input\nquery set M with uniform distribution should ask on average \u2126(\u03b12) questions from the oracle.\nTo outline the proof of this claim: consider an arbitrary branch \u03c6i and assume a genie tells us that\nwhich supernode on \u03c6i contains the DN for q. However, we do not know which of p1, p2, . . . , p\u03b1,\nthe nodes inside the revealed supernode, is the DN of q on \u03c6i. Since all the edges connecting the\nsupernode to its children have the same weight, questioning just some of them is not suf\ufb01cient to\n\ufb01nd the direct node, and effectively all of them should be asked on average. Since each question\ninvolves at most two of such nodes, an \u2126(\u03b1) questions is required to \ufb01nd the DN on \u03c6i. Summing up\nthe same number over all \u03b1 branches, we obtain the \u2126(\u03b12) lower bound on the number of questions.\nTheorem 1 is a direct consequence of the above mentioned propositions.\n\nProof of Theorem 1. Let A be an arbitrary deterministic algorithm which solves NNS problem in\nstar shaped graph with uniform distribution. If QA denotes the average number of questions A asks,\naccording to Proposition 1 and Proposition 2 we have\n\nQA \u2265 maxn\u2126(cid:16)\u03b1 log\n\nn\n\n\u03b1(cid:17) , \u2126(\u03b12)o \u2265\n\n1\n\n2 (cid:16)\u2126(cid:16)\u03b1 log\n\nn\n\n\u03b1(cid:17) , \u2126(\u03b12)(cid:17) = \u2126(cid:0)\u03b12 + \u03b1 log n/\u03b1(cid:1) .\n\n(2)\n\nBy using the Yao\u2019s Minimax principle, we can conclude Theorem 1.\n\nWe can show that this bound is best bound one can \ufb01nd for this dataset. Indeed, we present an\n\nalgorithm in the appendix [4], which \ufb01nds the query by asking \u0398(cid:0)\u03b12 + \u03b1 log n/\u03b1(cid:1) questions.\n\n5\n\n\f5 Hierarchical Data Structure For Nearest-Neighbor Search\n\nIn this section we develop the search algorithm that guarantees the performance stated in Theorem 2.\nThe learning phase is described in Algorithm 1. The algorithm builds a hierarchical decomposition\nlevel by level, top-down. At each level, we sample objects from the database. The set of samples at\nlevel i is denoted by Si, and we have |Si| = mi = a(2D)i log n, where a is a constant independent1\nof n and D. At each level i, every object in T is put in the \u201cbin\u201d of the sample in Si closest to it. To\n\ufb01nd this sample at level i, for every object p we rank the samples in Si w.r.t. p (by using the oracle\nto make pairwise comparisons). However, we show that given that we know D, we only need to\nrank those samples that fell in the bin of one of the at most 4aD log n nearest samples to p at level\ni \u2212 1. This is a consequence of the fact that we carefully chose the density of the samples at each\nlevel. Further, the fact that we build the hierarchy top-down, allows us to use the answers to the\nquestions asked at level i, to reduce the number of questions we need to ask at level i + 1. This way,\nthe number of questions per object does not increase as we go down in the hierarchy, even though\nthe number of samples increases.\nFor object p, \u03bdp(i) denotes the nearest neighbor to object p in Si. We want to keep the \u03bbi =\nn/(2D)i\u22121 closest objects in Si to p in the set \u0393p(i), i.e., all objects o \u2208 Si so that rp(o, Si) \u2264 \u03bbi.\nIt could be shown that for an object o to be in \u0393p(i) it is necessary that \u03bdo(i \u2212 1) be in \u0393p(i \u2212 1).\nTherefore by taking \u039bp(i) = {o \u2208 Si|\u03bdo(i \u2212 1) \u2208 \u0393p(i \u2212 1)} we have \u0393p(i) \u2286 \u039bp(i). It could\nbe veri\ufb01ed that |\u0393p(i)| \u2264 4aD log n, therefore \u0393p(i) can be constructed by \ufb01nding the 4aD log n\nclosest objects in \u039bp(i) to p. De\ufb01nitely the \ufb01rst object in \u0393p(i) is \u03bdp(i). Therefore we can recursively\nbuild \u0393p(i), \u039bp(i) and \u03bdp(i) for 1 \u2264 i \u2264 log n/ log 2D for any object p, as it is done in the algorithm.\nThe role of macros BuildHeap and ExtractMin is to build a heap from unordered data, and ex-\ntract the minimum element from the heap, respectively. Although they are well-known and standard\nalgorithms, we will present them in the appendix [4] for completeness.\n\nThe search process is described in Algorithm 2. The key idea is that the sample closest to the query\npoint on the lowest level will be its NN. Hence, by repeating the same process for inserting objects\nin the database, we can retrieve the NN w.h.p. We \ufb01rst bound the number of questions asked by\nAlgorithm 1 (w.h.p.), in Theorem 3. Having this result, the proof of Theorem 2 is then immediate.\nTheorem 3. Algorithm 1 succeeds with probability higher than 1 \u2212 1\nn , and it requires asking no\nmore than O(nD3 log2 n + D log2 n log log nD3\n\n) questions w.h.p.\n\nn\n\n(2D)i=1 . For every object p \u2208 T \u222a{q}, where q is the query\n\nWe \ufb01rst state a technical lemma that we will need to prove Theorem 3. The proof could be found in\nAppendix [4].\nLemma 2. Take a a constant and \u03bbi =\npoint, the following four properties of the data structure are true w.h.p.\n1. |Si \u2229 \u03b2p(\u03bbi+1)| \u2265 1\n2. |Si \u2229 \u03b2p(\u03bbi)| \u2264 4aD log n\n3. |Si+1 \u2229 \u03b2p(\u03bbi\u22121)| \u2264 16aD3 log n\n4. |Si \u2229 \u03b2p(4\u03bbi)| \u2265 4aD log n\n5. |Si+1 \u2229 \u03b2p(4\u03bbi\u22121)| \u2264 64aD3 log n\nProof of Theorem 3. Let mi = a(2D)i log n denote the number of objects we sample at level i, and\nlet Si be the set of samples at level i i.e., |Si| = mi. Here, a is an appropriately chosen constant,\nindependent of D and n. Further, let \u03bbi =\n\nn\n\n(2D)i\u22121 .\n\nFrom now on, we assume that we are in the situation where Properties (1) to (5) in Lemma 2 are\ntrue for all objects (which is the case w.h.p.). Again, \ufb01x an object p. For each object p, we need\nto \ufb01nd \u03bdp(i), which is the nearest neighbor in Si with respect to p.\nIn order for being able to\ncontinue this procedure in every level, we keep a wider range of objects: those objects in Si that\nhave ranks less than \u03bbi+1 with respect to p in level i; we store them in \u0393p(i) (property 1 tells us that\nsuch objects exist), in this way the \ufb01rst object in \u0393p(i) would be \u03bdp(i). In practice our algorithm\nstores some redundant objects in \u0393p(i), but we claim that totally no more than 4aD log n objects\nare stored in \u0393p(i + 1). To summarize, the properties we want to maintain in each level are: 1-\n\u2200p \u2208 T and 1 \u2264 i \u2264 log n/ log 2D, Si \u2229 \u03b2p(\u03bbi) \u2286 \u0393p(i) and 2- |\u0393p(i)| \u2264 4aD log n.\n\n1in fact the value of a is dependent on the value of error we expect, the more accurate we want to be, the\n\nmore sample points we need in each level and a would be larger.\n\n6\n\n\finput : A database with n objects p1, ..., pn, and disorder constant D\noutput: For each object u, a vector \u03bdu of length log n/ log(2D). The list of all samples \u222aiSi\nDef.: Si: The set of a(2D)i log n random samples at level i, i = 1, . . . , log n/ log(2D);\n\u03bdo(i) =nearest neighbor to object o in Si; o \u2208 T , i = 1, . . . , log n/ log(2D);\ncontains the \u03bbi closest objects to p in Si, possibly with redundant objects;\n\n\u03bdo:\n\u0393o(i):\n\u039bo(i): The set of p \u2208 Si, for which \u03bdp(i \u2212 1) \u2208 \u0393o(i \u2212 1);\n\nlog 2D do\n\nfor i \u2190 1 to L = log n\nfor p \u2190 1 to n do\nif i = 1 then\n\u039bp(1) \u2190 S1\n\u039bp(i) = {o \u2208 Si|\u03bdo(i \u2212 1) \u2208 \u0393p(i \u2212 1)};\nif |\u039bp(i)| = 0 then\nReport Failure\nelse\n\nelse\n\nH \u2190 BuildHeap(\u039bp(i)) ;\nfor k \u2190 1 to 4aD log n do\nm \u2190 ExtractMin(H) ;\nadd m to \u0393p(i)\n\nend\n\nend\n\u03bdp(i) \u2190 \ufb01rst object in \u0393p(i);\n\nend\n\nend\n\nend\n\nAlgorithm 1: Learning Algorithm\n\ninput : A database with n objects and disorder D, the list of samples, the vectors \u03bdu for\n\nu \u2208 T , a query point q\n\noutput: The nearest neighbor of q in the database\n\u0393q(1) = S1;\nfor i \u2190 2 to L = log n\n\nlog 2D do\n\u039bq(i) \u2190 {p \u2208 Si|\u03bdp(i \u2212 1) \u2208 \u0393q(i \u2212 1)};\nH \u2190 BuildHeap(\u039bq(i)) ;\nfor k \u2190 1 to 4aD log n do\nm \u2190 ExtractMin(H) ;\nadd m to \u0393q(i)\n\nend\n\nend\nreturn \ufb01rst object in \u0393q( log n\n\nlog 2D )\n\nAlgorithm 2: Search Algorithm\n\nIn the \ufb01rst step, for all p, \u039bp(1) = S1, and since |S1| = 2aD log n < 4aD log n, all the objects in\nS1 are extracted from the heap and therefore \u0393p(i) is S1 ordered with respect to p, as a result both\nthe properties hold when i = 1. The argument for the maintenance of this property is as follows:\nAssume the property holds up to level i; we analyze level i + 1. In fact we want an object s \u2208 Si+1\nto be in \u0393p(i + 1) if rp(s) \u2264 \u03bbi+1 (note that Property 1 guarantees that there is a least one such\nsample). Further, let s\u2032 \u2208 Si be the sample at level i closest to s i.e., s\u2032 = minx\u2208Si rs(s\u2032). Again,\nby Property 1, we know that rs(s\u2032) \u2264 \u03bbi+1. Hence, by the approximate triangle inequality 3 (see\nSection 2), we have:\n\nrp(s,T ) \u2264 \u03bbi+1 and rs(s\u2032,T ) \u2264 \u03bbi+1 \u21d2 rp(s\u2032,T ) \u2264 2D\u03bbi+1 = \u03bbi\n\nhence s\u2032 = \u03bds(i) \u2208 Si \u2229 \u03b2p(\u03bbi) \u2286 \u0393p(i) using the \ufb01rst property for step i. Therefore \u03bds(i) \u2208 \u0393p(i)\nand therefore s \u2208 \u039bp(i + 1). Property 2 tells us that |Si+1 \u2229 \u03b2p(\u03bbi+1)| \u2264 4aD log n. Hence by\n\n7\n\n\ftaking the \ufb01rst 4aD log n closest objects to p in \u039bp(i + 1) and storing them in \u0393p(i + 1), we can\nmake sure than both s \u2208 \u0393p(i + 1) for s \u2208 Si+1, s \u2208 \u03b2p(\u03bbi+1) and |\u0393p(i + 1)| \u2264 4aD log n.\nNote that in the last loop of the algorithm when i = log n/ log 2D, according to Property 1, |Si \u2229\n\u03b2p(\u03bbi+1)| \u2265 1. But \u03bbi+1 in the last step is 1, therefore the closest object to p in the database is\nin Slog n/ log 2D, which means that \u03bdp(log n/ log 2D) is the nearest neighbor of p in the database.\nRepeating this argument for the query point in the Search algorithm shows that after the termination,\nthe algorithm \ufb01nds the nearest neighbor.\nTo analyze the complexity of the algorithm, we should show that |\u039bp(i + 1)| is not big. Property 4\ntells us that all of the 4aD log n closest samples to p at level i have rank less than 8\u03bbi,so all objects in\n\u039bp(i) have ranks less than 8\u03bbi with respect to p. Consider a sample s \u2208 Si such that rp(s,T ) \u2264 8\u03bbi\nand a sample s\u2032\u2032 \u2208 Si+1 that falls in the bin of s.\nIf an object s\u2032\u2032 is in \u039bp(i + 1), it means that it falls in the bin of an object s in \u0393p(i), i.e. \u03bds\u2032\u2032 (i) \u2208\n\u0393p(i). Since s \u2208 \u0393p(i), we have rp(s, T ) \u2264 8\u03bbi.\nBy property 1, we must have rs\u2032\u2032 (s,T ) \u2264 \u03bbi+1. Thus, by inequality 2, we have:\n\nrs\u2032\u2032 (s,T ) \u2264 \u03bbi+1 and rp(s,T ) \u2264 8\u03bbi \u21d2 rp(s\u2032\u2032,T ) < D(8\u03bbi + \u03bbi+1) \u2264 4\u03bbi\u22121\n\nBy property 5, there are at most O(D3 log n) such samples at level i + 1, i.e. \u039bp(i + 1) =\nO(D3 log n).\nTo summarize, at each level for each object, we build a heap out of O(D3 log n) objects and ap-\nply O(aD log n) ExtractMin procedures to \ufb01nd the \ufb01rst 4aD log n objects in the heap. Each\nExtractMin requires O(log(D3 log n)) = O(log log nD3\n). Hence the complexity for each level\nand for each object is O(D3 log n + D log n log log nD3\n). There are O(log n) levels and n objects,\nso the overall complexity is O(nD3 log n + nD log2 n log log nD3\n\n).\n\nProof of Theorem 2. The upper bound on the number of questions to be asked in the learning phase\nis immediate from Theorem 3. For each object, we need to store one identi\ufb01er (the identi\ufb01er of the\nclosest object) at every level i in the hierarchy, and one bit to mark it as a member of Si or not;\nalso one bit if it is in \u0393q(i \u2212 1) and one bit for being in \u039bq(i) (we can reuse this memory in the\nnext level) (note that a heap with size N needs O(N log n) memory, where log n is for storing each\nobject). Hence, the total memory requirement2 do not exceed O(n log2 n/ log(2D)) bits. Finally,\nthe properties 1-5 shown in the proof of Theorem 3 are also true for an external query object q.\nHence, to \ufb01nd the closest object to q on every level, we build the same heap structure, the only\ndifference is that instead of repeating this procedure n times in each level, since there is just one\nquery point, we need to ask at most O(D3 log2 n + D log2 n log log nD3\nIn\nparticular, the closest object at level L = log2D(n) will be q\u2019s nearest neighbor w.h.p.\n\n) questions totally.\n\nNote that this scheme can be easily modi\ufb01ed for R-nearest neighbor search. At the i-the level of the\nhierarchy, the closest sample to q will, w.h.p., be one of its\n(2D)i nearest neighbors. If we are only\ninterested in the level of precision, we can stop the hierarchy construction at the desired level.\n\nn\n\n6 Discussion\n\nThe use of a comparison oracle is motivated by a human user who can make comparisons between\nobjects but not assign meaningful numerical values to similarities between objects. There are many\ninteresting questions raised by studying such a model including fundamental characterizations of the\ncomplexity of search in terms of number of oracle questions. We also believe that ideas of searching\nthrough comparisons form a bridge between many well known search techniques in metric spaces to\nperceptually important (non-metric spaces) situations, and could lead to innovative practical appli-\ncations. Analogous to locality sensitive hashing, one can develop notions of rank-sensitive hashing,\nwhere \u201csimilar\u201d objects based on ranks are given the same hash value. Some preliminary ideas\nfor it were given in [3], but we believe this is an interesting line of inquiry. Also in [3], we have\nimplemented comparison-based search heuristics to navigate image database.\n\n2Making the assumption that every object can be uniquely identi\ufb01ed with log n bits\n\n8\n\n\fReferences\n\n[1] N. Goyal, Y. Lifshits, and H. Schutze, \u201cDisorder inequality: A combinatorial approach to nearest neighbor\n\nsearch,\u201d in WSDM, 2008, pp. 25\u201332.\n\n[2] S. Dasgupta and Y. Freund, \u201cRandom projection trees and low dimensional manifolds,\u201d in STOC, 2008,\n\npp. 537\u2013546.\n\n[3] D. Tschopp, \u201cRouting and search on large scale networks,\u201d Ph.D. dissertation, \u00b4Ecole Polytechnique\n\nF\u00b4ed\u00b4erale de Lausanne (EPFL), 2010.\n\n[4] D. Tschopp, S. Diggavi, P. Delgosha, and S. Mohajer, \u201cRandomized algorithms for comparison-based\n\nsearch: Supplementary material,\u201d 2011, submitted to NIPS as supplementary material.\n\n[5] K. Clarkson, \u201cNearest-neighbor searching and metric space dimensions,\u201d in Nearest-Neighbor Methods\nfor Learning and Vision: Theory and Practice, G. Shakhnarovich, T. Darrell, and P. Indyk, Eds. MIT\nPress, 2006, pp. 15\u201359.\n\n[6] Y. Rubner, C. Tomasi, and L. J. Guibas, \u201cThe earth mover\u2019s distance as a metric for image retrieval,\u201d\n\nInternational Journal of Computer Vision, vol. 40, no. 2, pp. 99\u2013121, 2000.\n\n[7] E. Demidenko, \u201cKolmogorov-smirnov test for image comparison,\u201d in Computational Science and Its\n\nApplications - ICCSA, 2004, pp. 933\u2013939.\n\n[8] M. Nachtegael, S. Schulte, V. De Witte, T. Mlange, and E. Kerre, \u201cImage similarity, from fuzzy sets to\n\ncolor image applications,\u201d in Advances in Visual Information Systems, 2007, pp. 26\u201337.\n\n[9] S. Santini and R. Jain, \u201cSimilarity measures,\u201d IEEE transactions on Pattern Analysis and Machine Intel-\n\nligence, vol. 21, no. 9, pp. 871\u2013883, 1999.\n\n[10] G. Chechik, V. Sharma, U. Shalit, and S. Bengio, \u201cLarge scale online learning of image similarity through\n\nranking,\u201d Journal of Machine Learning Research, vol. 11, pp. 1109\u20131135, 2010.\n\n[11] A. Frome, Y. Singer, F. Sha, and J. Malik, \u201cLearning globally-consistent local distance functions for\n\nshape-based image retrieval and classi\ufb01cation,\u201d in ICCV, 2007, pp. 1\u20138.\n\n[12] Y. Lifshits and S. Zhang, \u201cCombinatorial algorithms for nearest neighbors, near-duplicates and small-\n\nworld design,\u201d in SODA, 2009, pp. 318\u2013326.\n\n[13] R. Krauthgamer and J. R. Lee, \u201cNavigating nets: simple algorithms for proximity search,\u201d in SODA, 2004,\n\npp. 798\u2013807.\n\n[14] D. R. Karger and M. Ruhl, \u201cFinding nearest neighbors in growth-restricted metrics,\u201d in STOC, 2002, pp.\n\n741\u2013750.\n\n[15] T. Cormen, C. Leiserson, R. Rivest, and C. Stein, \u201cIntroduction to algorithms,\u201d MIT Press and McGraw-\n\nHill Book Company, vol. 7, pp. 1162\u20131171, 1976.\n\n[16] R. Motwani and P. Raghavan, Randomized Algorithms. Cambridge University Press, 1995.\n\n9\n\n\f", "award": [], "sourceid": 1228, "authors": [{"given_name": "Dominique", "family_name": "Tschopp", "institution": null}, {"given_name": "Suhas", "family_name": "Diggavi", "institution": null}, {"given_name": "Payam", "family_name": "Delgosha", "institution": null}, {"given_name": "Soheil", "family_name": "Mohajer", "institution": null}]}