{"title": "Exploiting spatial overlap to efficiently compute appearance distances between image windows", "book": "Advances in Neural Information Processing Systems", "page_first": 2735, "page_last": 2743, "abstract": "We present a computationally efficient technique to compute the distance of high-dimensional appearance descriptor vectors between image windows. The method exploits the relation between appearance distance and spatial overlap. We derive an upper bound on appearance distance given the spatial overlap of two windows in an image, and use it to bound the distances of many pairs between two images. We propose algorithms that build on these basic operations to efficiently solve tasks relevant to many computer vision applications, such as finding all pairs of windows between two images with distance smaller than a threshold, or finding the single pair with the smallest distance. In experiments on the PASCAL VOC 07 dataset, our algorithms accurately solve these problems while greatly reducing the number of appearance distances computed, and achieve larger speedups than approximate nearest neighbour algorithms based on trees [18]and on hashing [21]. For example, our algorithm finds the most similar pair of windows between two images while computing only 1% of all distances on average.", "full_text": "Exploiting spatial overlap to ef\ufb01ciently compute\nappearance distances between image windows\n\nBogdan Alexe\nETH Zurich\n\nViviana Petrescu\n\nETH Zurich\n\nVittorio Ferrari\n\nETH Zurich\n\nAbstract\n\nWe present a computationally ef\ufb01cient technique to compute the distance of high-\ndimensional appearance descriptor vectors between image windows. The method\nexploits the relation between appearance distance and spatial overlap. We derive\nan upper bound on appearance distance given the spatial overlap of two windows\nin an image, and use it to bound the distances of many pairs between two images.\nWe propose algorithms that build on these basic operations to ef\ufb01ciently solve\ntasks relevant to many computer vision applications, such as \ufb01nding all pairs of\nwindows between two images with distance smaller than a threshold, or \ufb01nding\nthe single pair with the smallest distance. In experiments on the PASCAL VOC 07\ndataset, our algorithms accurately solve these problems while greatly reducing the\nnumber of appearance distances computed, and achieve larger speedups than ap-\nproximate nearest neighbour algorithms based on trees [18] and on hashing [21].\nFor example, our algorithm \ufb01nds the most similar pair of windows between two\nimages while computing only 1% of all distances on average.\n\nIntroduction\n\n1\nComputing the appearance distance between two windows is a fundamental operation in a wide\nvariety of computer vision techniques. Algorithms for weakly supervised learning of object\nclasses [7, 11, 16] typically compare large sets of windows between images trying to \ufb01nd recurring\npatterns of appearance. Sliding-window object detectors based on kernel SVMs [13, 24] compute\nappearance distances between the support vectors and a large number of windows in the test image.\nIn human pose estimation, [22] computes the color histogram dissimilarity between many candidate\nwindows for lower and upper arms. In image retrieval the user can search a large image database for\na query object speci\ufb01ed by an image window [20]. Finally, many tracking algorithms [4, 5] compare\na window around the target object in the current frame to all windows in a surrounding region of the\nnext frame.\nIn most cases one is not interested in computing the distance between all pairs of windows from two\nsets, but in a small subset of low distances, such as all pairs below a given threshold, or the single\nbest pair. Because of this, computer vision researchers often rely on ef\ufb01cient nearest neighbour\nalgorithms [2, 6, 10, 17, 18, 21]. Exact nearest neighbour algorithms organize the appearance\ndescriptors into trees which can be ef\ufb01ciently searched [17]. However, these methods work well only\nfor descriptors of small dimensionality n (typically n < 20), and their speedup vanishes for larger\nn (e.g. the popular GIST descriptor [19] has n = 960). Locality sensitive hashing (LSH [2, 10, 21])\ntechniques hash the descriptors into bins, so that similar descritors are mapped to the same bins with\nhigh probability. LSH is typically used for ef\ufb01ciently \ufb01nding approximate nearest neighbours in\nhigh dimensions [2, 6].\nAll the above methods consider windows only as points in appearance space. However, windows\nexist also as points in the geometric space de\ufb01ned as their 4D coordinates in the image they lie in. In\nthis geometric space, a natural distance between two windows is their spatial overlap (\ufb01g. 1). In this\npaper we propose to take advantage of an important relation between the geometric and appearance\nspaces: the apparance distance between two windows decreases as their spatial overlap increases.\nWe derive an upper bound on the appearance distance between two windows in the same image,\n\n1\n\n\fFig. 1: Relation between spatial overlap and appearance distance. Windows w1, w2 in an image I are\nembedded in geometric space and in appearance space. All windows overlapping more than r with w1 are at\nmost at distance B(r) in appearance space. The bound B(r) decreases as overlap increases (i.e. r decreases).\n\ngiven their spatial overlap (sec. 2). We then use this bound in conjuction with the triangle inequality\nto bound the appearance distances of many pairs of windows between two images, given the distance\nof just one pair. Building on these basic operations, we design algorithms to ef\ufb01ciently \ufb01nd all pairs\nwith distance smaller than a threshold (sec. 3) and to \ufb01nd the single pair with the smallest distance\n(sec. 4).\nThe techniques we propose reduce computation by minimizing the number of times appearance\ndistances are computed. They are complementary to methods for reducing the cost of computing\none distance, such as dimensionality reduction [15] or Hamming embeddings [14, 23].\nWe experimentally demonstrate in sec. 5 that the proposed algorithms accurately solve the above\nproblems while greatly reducing the number of appearance distances computed. We compare to\napproximate nearest neighbour algorithms based on trees [18], as well as on the recent LSH tech-\nnique [21]. The results show our techniques outperform them in the setting we consider, where the\ndatapoints are embedded in a space with additional overlap structure.\n\n2 Relation between spatial overlap and appearance distance\nWindows w in an image I are emdebbed in two spaces at the same time (\ufb01g. 1).\nIn geometric\nspace, w is represented by its 4 spatial coordinates (e.g. x, y center, width, height). The distance\nbetween two windows is de\ufb01ned based on their spatial overlap o(w1, w2) = |w1\\w2|\n|w1[w2| 2 [0, 1],\nwhere \\ denotes the area of the intersection and [ the area of the union. In appearance space, w\nis represented by a high dimensional vector describing the pixel pattern inside it, as computed by\na function fapp(w) : I ! Rn (e.g. the GIST descriptor has n = 960 dimensions). In appearance\nspace, two windows are compared using a distance d(fapp(w1), fapp(w2)).\nTwo overlapping windows w1, w2 in an image I share the pixels contained in their intersection\n(\ufb01g. 1). The spatial overlap of the two windows correlates with the proportion of common pixels\ninput to fapp when computing the descriptor for each window. In general, fapp varies smoothly with\nthe geometry of w, so that windows of similar geometry are close in appearance space. Conse-\nquently, the spatial overlap o and appearance distance d are related. In this paper we exploit this\nrelation to derive an upper bound B(o(w1, w2)) on the appearance distance between two overlapping\nwindows.\nWe present here the general form of the bound B, its main properties and explain why it is useful. In\nsubsections 2.1 and 2.2 we derive the actual bound itself. To simplify the notation we use d(w1, w2)\nto denote the appearance distance d(fapp(w1), fapp(w2)). We refer to it simply as distance and we\nsay overlap for spatial overlap. The upper bound B is a function of the overlap o(w1, w2), and has\nthe following property\n\nd(w1, w2) \uf8ffB (o(w1, w2))\n\n8w1, w2\n\nMoreover, B is a monotonic decreasing function\n\nB(o1) \uf8ffB (o2)\n\n8o1 o2\n\n2\n\n(1)\n\n(2)\n\n\f(a)\n\n(b)\n\n(c)\n\nFig. 2: Triangle inequality in appearance space.\npoints fapp(w1), fapp(w2) and fapp(w3) in appearance space.\n|d(w1, w2) d(w2, w3)| = d(w1, w3); (c) Upper bound case: d(w1, w3) = d(w1, w2) + d(w2, w3).\nThis property means B continuously decreases as overlap increases. Therefore, all pairs of windows\nwithin an overlap radius r (i.e. o(w1, w2) r) have distance below B(r) (\ufb01g. 1)\n8w1, w2, o(w1, w2) r\n(3)\n\nThe triangle inequality (4) holds for any three\n(a) General case; (b) Lower bound case:\n\nd(w1, w2) \uf8ffB (o(w1, w2)) \uf8ffB (r)\n\nAs de\ufb01ned above, B bounds the appearance distance between two windows in the same image.\nNow we show how it can be used to derive a bound on the distances between windows in two\ndifferent images I 1, I 2. Given two windows w1, w2 in I 1 and a window w3 in I 2, we use the\ntriangle inequality to derive (\ufb01g. 2)\n\n|d(w1, w2) d(w2, w3)|\uf8ff d(w1, w3) \uf8ff d(w1, w2) + d(w2, w3)\n\nUsing the bound B in eq. (4) we obtain\n\nmax(0, d(w2, w3) B (o(w1, w2))) \uf8ff d(w1, w3) \uf8ffB (o(w1, w2)) + d(w2, w3)\n\n(4)\n\n(5)\n\nEq. (5) delivers lower and upper bounds for d(w1, w3) without explicitly computing it (given that\nd(w2, w3) and o(w1, w2) are known). These bounds will form the basis of our algorithms for reduc-\ning the number of times the appearance distance is computed when solving two classic tasks (sec. 3\nand 4).\nIn the next subsection we estimate B for arbitrary window descriptors (e.g. color histograms, bag of\nvisual words, GIST [19], HOG [8]) from a set of images (no human annotation required). In sub-\nsection 2.2 we derive exact bounds in closed form for histogram descriptors (e.g. color histograms,\nbag of visual words [25]).\n\n2.1 Statistical bounds for arbitrary window descriptors\nWe estimate B\u21b5 from training data so that eq. (1) holds with probability \u21b5\nP ( d(w1, w2) \uf8ffB \u21b5(o(w1, w2)) ) = \u21b5 8w1, w2\n\n(6)\n\nB\u21b5 is estimated from a set of M training images I = {I m}. For each image I m we sample N\nj ) and distance\nwindows {wm\nij ) for every window pair\nij = d(wm\ndm\n(7)\n\ni }, and then compute for all window pairs their overlap om\nj ). The overall training dataset D is composed of (om\ni , wm\n\nij = o(wm\nij , dm\nij ) | k 2{ 1, M} , i, j 2{ 1, N}}\n\nD = { (om\n\ni , wm\n\nij , dm\n\nWe now quantize the overlap values into 100 bins and estimate B\u21b5(o) for each bin o separately. For\nij is in the bin. We choose B\u21b5(o) as\na bin o, we consider the set Do of all distances dm\nthe \u21b5-quantile of D(o) (\ufb01g. 3a)\n\nij for which om\n\nB\u21b5(o) = q\u21b5(Do)\n\n(8)\n\nij for which om\n\nB1(o) is the largest distance dm\nij is in bin o. Fig. 3a shows the binned distance-\noverlap pairs and the bound B0.95 for GIST descriptors [19]. The data comes from 100 windows\nsampled from more than 1000 images (details in sec. 5). Each column of this matrix is roughly\nGaussian distributed, and its mean continuously decreases with increasing overlap, con\ufb01rming our\nassumptions about the relation between overlap and distance (sec. 2). In particular, note how the\nmean distance decrease fastest for 50% to 80% overlap.\n\n3\n\n\f(a)\n\n(b)\n\nFig. 3: Estimating B0.95(o) and omin(\u270f). (a) The estimated B0.95(o) (white line) for the GIST [19] appear-\nance descriptor. (b) Using B0.95(o) we derive omin(\u270f).\nGiven a window w1 and a distance \u270f we can use B\u21b5 to \ufb01nd windows w2 overlapping with w1\nthat are at most distance \u270f from w1. This will be used extensively by our algorithms presented in\nsecs. 3 and 4. From B\u21b5 we can derive what is the smallest overlap omin(\u270f) so that all pairs of\nwindows overlapping more than omin(\u270f) have distance smaller than \u270f (with probability more than\n\u21b5). Formally\n\nP ( d(w1, w2) \uf8ff \u270f ) \u21b5 8w1, w2, o(w1, w2) omin(\u270f)\n\nand omin(\u270f) is de\ufb01ned as the smallest overlap o for which the bound is smaller than \u270f (\ufb01g. 3b)\n\nomin(\u270f) = min{o | B\u21b5(o) \uf8ff \u270f}\n\n(9)\n\n(10)\n\n2.2 Exact bounds for histogram descriptors\nThe statistical bounds of the previous subsection can be estimated from images for any appearance\ndescriptor. In contrast, in this subsection we derive exact bounds in closed form for histogram de-\nscriptors (e.g. color histograms, bag of visual words [25]). Our derivation applies to L1-normalized\nhistograms and the 2 distance. For simplicity of presentation, we assume every pixel contributes\none feature to the histogram of the window (as in color histograms). The derivation is very similar\nfor features computed on another regular grid (e.g. dense SURF bag-of-words [11]). We present\nhere the main idea behind the bound and give the full derivation in the supplementary material [1].\nThe upper bound B for two windows w1 and w2 corresponds to the limit case where the three\nregions w1 \\ w2, w1 \\ w2 and w2 \\ w1 contain three disjoint sets of colors (or visual word in\ngeneral). Therefore, the upper bound B is\nB(w1, w2) = |w1 \\ w2|\n|w1|\n\n+ |w1 \\ w2| \u00b7\u21e3 1\n\n|w2|\u23182\n|w1| 1\n+ 1\n1\n|w1|\n|w2|\n\n+ |w2 \\ w1|\n\n|w2|\n\n(11)\n\nExpressing the terms in (11) based on the windows overlap o = o(w1, w2) = |w1\\w2|\n|w1[w2|\nclosed form for the upper bound B that depends only on o\n\nB(w1, w2) = B(o(w1, w2)) = B(o) = 2 4 \u00b7\n\no\n\no + 1\n\n, we obtain a\n\n(12)\n\nIn practice, this exact bound is typically much looser than its corresponding statistical bound learned\nfrom data (sec. 2.1). Therefore, we use the statistical bound for the experiments in sec. 5.\n\n3 Ef\ufb01ciently computing all window pairs with distance smaller than \u270f\nIn this section we present an algorithm to ef\ufb01ciently \ufb01nd all pairs of windows with distance smaller\nthan a threshold \u270f between two images I 1, I 2. Formally, given an input set of windows W 1 = {w1\ni }\nin image I 1 and a set W 2 = {w2\nj} in image I 2, the algorithm should return the set of pairs P\u270f =\n{ (w1\nj ) \uf8ff \u270f }.\nAlgorithm overview. Algorithm 1 summarizes our technique. Block 1 randomly samples a small\nset of seed pairs, for which it explicly computes distances. The core of the algorithm (Block 3)\nexplores pairs overlapping with a seed, looking for all appearance distances smaller than \u270f. When\n\nj ) | d(w1\n\ni , w2\n\ni , w2\n\n4\n\n\fAlgorithm 1 Ef\ufb01ciently computing all distances smaller than \u270f\nInput: windows W m = {wm\nOutput: set P\u270f of all pairs p with d(p) \uf8ff \u270f\n\ni }, threshold \u270f, lookup table omin, number of initial samples F\n\n1. Compute seed pairs PF\n\n(a) sample F random pairs pij = (w1\n(b) compute dij = d(w1\n\ni , w2\nj ), 8pij 2P F\n\ni , w2\n\nj ) from P = W 1 \u21e5W 2, giving PF\n\n2. Determine a sequence S of all pairs from P (gives schedule of block 3 below)\n\n(a) sort the seed pairs in PF in order of decreasing distance\n(b) set S(1 : F ) = PF\n(c) \ufb01ll S((F + 1) : end) with random pairs from P \\ PF\n\n3. For pc = S(1 : end) (explore the pairs in the S order)\n\n(a) compute d(pc)\n(b) if d(pc) \uf8ff \u270f\n\ni. let r = omin(\u270f d(pc))\nii. let N = overlap neighborhood(pc, r)\niii. for all pairs p 2N : compute d(p)\niv. update P\u270f P \u270f [{ p 2N | d(p) \uf8ff \u270f}\ni. let r = omin(d(pc) \u270f)\nii. let N = overlap neighborhood(pc, r)\niii. discard all pairs in N from S: S S \\ N\n\n(c) else\n\nj ), overlap radius r\n\ni , w2\n\ni , w2\n\noverlap neighborhood\nInput: pair pij = (w1\nOutput: overlap neighborhood N of pij\nN = { (w1\ncompute\nInput: pair pij\nOutput: If d(w1\nd(w1\n\nv) | o(w2\n\ni , w2\n\nj , w2\n\ni , w2\n\nj ) is already in D, then directly return it.\n\nv) r }[{ (w1\n\nu, w2\n\nj ) | o(w1\n\ni , w1\n\nu) r }\n\nj ) was never computed before, then compute it and store it in a table D.\n\nIf\n\nexploring a seed, the algorithm can decide to discard many pairs overlapping with it, as the bound\npredicts that their distance cannot be lower than \u270f. This causes the computational saving (step 3.c).\nBefore starting Block 3, Block 2 establishes the sequence in which to explore the seeds, i.e. in order\nof decreasing distance. The remaining pairs are appended in random order afterwards.\nAlgorithm core. Block 3 takes one of two actions based on the distance of the pair pc currently\nbeing explored. If d(pc) \uf8ff \u270f, then all pairs in the overlap neighborhood N of pc have distance\nsmaller than \u270f. This overlap neighborhood has a radius r = omin(\u270f d(pc)) predicted by the\nbound lookup table omin (\ufb01g. 4a). Therefore, Block 3 computes the distance of all pairs in N\n(step 3.b). Instead, if d(pc) >\u270f , Block 3 determines the radius r = omin(d(pc) \u270f) of the overlap\nneighborhood containing pairs with distance greater than \u270f, and then discards all pairs in it (step 3.c).\nj ) with radius r con-\nOverlap neighborhood. The overlap neighborhood of a pair pij = (w1\ntains all pairs (w1\nu) r\n(\ufb01g. 4a).\n\nv) r, and all pairs (w1\n\nj ) such that o(w1\n\nv) such that o(w2\n\nu, w2\n\ni , w2\n\nj , w2\n\ni , w1\n\ni , w2\n\n4 Ef\ufb01ciently computing the single window pair with the smallest distance\nWe give an algorithm to ef\ufb01ciently \ufb01nd the single pair of windows with the smallest appearance\ndistance between two images. Given as input the two sets of windows W 1,W 2, the algorithm\nshould return the pair p\u21e4 = (w1\nj ).\ni , w2\n\nj\u21e4) with the smallest distance: d(w1\n\nj\u21e4) = minij d(w1\n\ni\u21e4, w2\n\ni\u21e4, w2\n\n5\n\n\f(a)\n\n(b)\n\nFig. 4: Overlap neighborhoods. (a) The overlap neighborhood of radius r of a pair (w1\nblue pairs. (b) The joint overlap neighborhood of radius s of a pair (w1\n\nj ) contains all\nj ) contains all blue and green pairs.\n\ni , w2\n\ni , w2\n\nAlgorithm overview. Algorithm 2 is analog to Algorithm 1. Block 1 computes distances for the\nseed pairs and it selectes the pair with the smallest distance as initial approximation to p\u21e4. Block 3\nexplores pairs overlapping with a seed, looking for a distance smaller than d(p\u21e4). When exploring a\nseed, the algorithm can decide to discard many pairs overlapping with it, as the bound predicts they\ncannot be better than p\u21e4. Block 2 organizes the seeds in order of increasing distance. In this way,\nthe algorithm can rapidly re\ufb01ne p\u21e4 towards smaller and smaller values. This is useful because in\nstep 3.c, the amount of discarded pairs is greater as d(p\u21e4) gets smaller. Therefore, this seed ordering\nmaximises the number of discarded pairs (i.e. minimizes the number of distances computed).\nAlgorithm core. Block 3 takes one of two actions based on d(pc). If d(pc) \uf8ff d(p\u21e4) + B\u21b5(s),\nthen there might be a better pair than d(p\u21e4) within radius s in the joint overlap neighborhood of\npc. Therefore, the algorithm computes the distance of all pairs in this neighborhood (step 3.b). The\nradius s is an input parameter. Instead, if d(pc) > d(p\u21e4) + B\u21b5(s), the algorithm determines the\nradius r = omin(d(pc) d(p\u21e4)) of the overlap neighborhood that contains only pairs with distance\ngreater than d(p\u21e4), and then discards all pairs in it (step 3.c).\nJoint overlap neighborhood. The joint overlap neighborhood of a pair pij = (w1\nj ) with\nradius s contains all pairs (w1\n\nv) such that o(w1\n\nu, w2\n\ni , w2\n\ni , w1\n\nu) s and o(w2\n\nj , w2\n\nv) s.\n\n5 Experiments and conclusions\nWe present experiments on a test set composed of 1000 image pairs from the PASCAL VOC 07\ndataset [12], randomly sampled under the constraint that two images in a pair contain at least one\nobject of the same class (out of 6 classes: aeroplane, bicycle, bus, boat, horse, motorbike). This\nsetting is relevant for various applications, such as object detection [13, 24], and ensures a balanced\ndistribution of appearance distances in each image pair (some pairs of windows will have a low\ndistance while others high distances). We experiment with three appearance descriptors: GIST [19]\n(960D), color histograms (CHIST, 4000D), and bag-of-words [11, 25] on the dense SURF descrip-\ntor [3] (BOW, 2000D). As appearance distances we use the Euclidean for GIST, and 2 for CHIST\nand SURF BOW. The bound tables B\u21b5 for each descriptor were estimated beforehand from a sepa-\nrate set of 1300 images of other classes (sec. 2.1).\nTask 1: all pairs of windows with distance smaller than \u270f. The task is to \ufb01nd all pairs of win-\ndows with distance smaller than a user-de\ufb01ned threshold \u270f between two images I 1, I 2 (sec. 3). This\ntask occurs in weakly supervised learning of object classes [7, 11, 16], where algorithms search for\nrecurring patterns over training images containing thousands of overlapping windows, and in human\npose estimation [22], which compares many overlapping candidate body part locations.\nWe random sample 3000 windows in each image (|W 1| = |W 2| = 3000) and set \u270f so that 10%\nof all distances are below it. This makes the task meaningful for any image pair, regardless of the\nrange of distances it contains. For each image pair we quantify performance with two measures: (i)\ncost: the number of computed distances divided by the total number of window pairs (9 millions);\n(i) accuracy:\nrithm, and the denominator sums over all distances truly below \u270f. The lowest possible cost while\nstill achieving 100% accuracy is 10%.\nWe compare to LSH [2, 6, 10] using [21] as a hash function. It maps descriptors to binary strings,\nsuch that the Hamming distance between two strings is related to the value of a Gaussian kernel\nbetween the original descriptors [21]. As recommended in [6, 10], we generate T separate (random)\nencodings and build T hash tables, each with 2C bins, where C is the number of bits in the encoding.\n\nP{p2W1\u21e5W2|d(p)\uf8ff\u270f}(\u270fd(p)), where P\u270f is the set of window pairs returned by the algo-\n\nPp2P\u270f\n\n(\u270fd(p))\n\n6\n\n\fAlgorithm 2 Ef\ufb01ciently computing the smallest distance\nInput: windows W m = {wm\nOutput: pair p\u21e4 with the smallest distance\n\ni }, lookup table omin, search radius s, number of initial samples F\n\nestimate current best pair: p\u21e4 = arg minpij2PF dij\n\n1. Compute seed pairs PF (as Block 1 of Algorithm 1) and\n2. Determine a sequence S of all pairs (as Block 2 of Algorithm 1)\n3. For pc = S(1 : end) (explore the pairs in the S order)\n\n(a) compute d(pc)\n(b) if d(pc) \uf8ff d(p\u21e4) + B\u21b5(s)\n\n(c) else\n\ni. let N = joint overlap neighborhood(pc, s)\nii. for all pairs p 2N : compute d(p)\niii. update p\u21e4 arg min {{d(p\u21e4)}[{ d(p) | p 2 N}}\ni. let r = omin(d(pc) d(p\u21e4))\nii. let N = overlap neighborhood(pc, r)\niii. discard all pairs in N from S: S S \\ N\n\njoint overlap neighborhood\nj ), overlap radius s\nInput pair pij = (w1\nOutput: joint overlap neighborhood N of pij\nN = { (w1\nj , w2\n\nu) s, o(w2\n\nv) | o(w1\n\ni , w2\n\nu, w2\n\ni , w1\n\nv) s }\n\nj 2 b1\ni , w2\n\ni into its bin b1\n\nTo perform Task 1, we loop over each table t and do: (H1) hash all w2\nj 2W 2 into table t; (H2) for\nt,i; (H2.2) compute all distances d in the original\neach w1\ni 2W 1 do: (H2.1) hash w1\ni and all windows w2\nspace between w1\nt,i (unless already computed when inspecting a previous\ntable); (H3) return all computed d(w1\nj ) \uf8ff \u270f.\nWe also compare to approximate nearest-neighbors based on kd-trees, using the ANN library [18].\nTo perform Task 1, we do: (A1) for each w1\ni 2W 1 do: (A1.1) compute the \u270f-NN between w1\ni\nand all windows w2\nj 2W 2 and return them all. The notion of cost above is not de\ufb01ned for ANN\nmethods based on trees. Instead, we measure wall clock runtime. Instead, we report as cost the ratio\nof the runtime of approximate NN over the runtime of exact NN (also computed using the ANN\nlibrary [18]). This gives a meaningful indication of speedup, which can be compared to the cost we\nreport for our method and LSH. As the ANN library supports only the Euclidean distance, we report\nresults only for GIST.\nThe results table reports cost and accuracy averaged over the test set. Our method from sec. 3\nperforms very well for all three descriptors. On average it achieves 98% accuracy at 16% cost. This\nis a considerable speedup over exhaustive search, as it means only 7% of the 90% distances greater\nthan \u270f have been computed. The behavior of LSH depends on T and C. The higher the T , the\nhigher the accuracy, but also the cost (because there are more collisions; the same holds for lower\nC). To compare fairly, we evaluate LSH over T 2{ 1, 20} and C 2{ 2, 30} and report results for\nthe T, C that deliver the closest accuracy to our method. As the table shows, on average over the\nthree descriptors, for same accuracy LSH has cost 92%, substantially worse than our method. The\nbehavior of ANN depends on the degree of approximation which we set so as to get accuracy closest\nto our method. At 92% accuracy, ANN has 72% of the runtime of exact NN. This shows that, if high\naccuracy is desired, ANN offers only a modest speedup (compared to our 18% cost for GIST).\n\nTask 2: all windows closer than \u270f to a query. This is a special case of Task 1, where W 1 contains\njust one window. Hence, this becomes a \u270f-nearest-neighbours task where W 1 acts as a query and\nW 2 as the retrieval database. This task occurs in many applications, e.g. object detectors based\non kernel SVMs compare a support vector (query) to a large set of overlapping windows in the test\nimage [13, 24]. As this is expensive, many detectors resort to linear kernels [9]. Our algorithms\n\n7\n\n\fmethod\n\nGIST + Euclidean distance\naccuracy\n97.3%\n95.4%\n91.9%\n\ncost\n18.0%\n86.2%\n71.8%\n\nour\nLSH\nANN\n\nmethod\n\nour\nLSH\nANN\n\nmethod\n\nour\nLSH\nANN\n\nmethod\n\nour\nLSH\nANN\n\ncost\n30.2%\n73.4%\n72.6%\n\naccuracy\n87.1%\n83.5%\n87.7%\n\nmethod\n\nour\nLSH\nANN\n\nratio\ncost\n2.3%\n1.02\n16.4% 1.03\n58.6% 1.01\n\nrank method\n1.39\n2.72\n1.48\n\nour\nLSH\nANN\n\nCHIST + 2 distance\n\naccuracy\n97.7%\n97.2%\n\naccuracy\n96.2%\n95.1%\n\n-\n\n-\n\nTask 1\n\ncost\n15.7%\n93.7%\n\n-\nTask 2\ncost\n30.3%\n96.9%\n\n-\nTask 3\nratio\ncost\n0.4%\n1.01\n37.5% 1.02\n\n-\n\n-\n\nmethod\n\nSURF BOW + 2 distance\naccuracy\n98.5%\n98.5%\n\ncost\n15.2%\n96.8%\n\nour\nLSH\nANN\n\n-\n\nmethod\n\nour\nLSH\nANN\n\ncost\n28.6%\n88.7%\n\n-\n\naccuracy\n94.0%\n92.1%\n\n-\n\nrank method\n1.12\n33.5\n\nour\nLSH\nANN\n\n-\n\nratio\ncost\n0.7%\n1.01\n46.5% 1.01\n\n-\n\nrank\n1.19\n9.62\n\n-\n\n-\n\n-\n\noffer the option to use more complex kernels while retaining a practical speed. Other applications\ninclude tracking in video [4, 5] and image retrieval [20] (see beginning of sec. 1).\nAs the table shows, our method is somewhat less ef\ufb01cient than on Task 1. This makes sense, as it\ncan only exploit overlap structure in one of the two input sets. Yet, for a similar accuracy it offers\ngreater speedup than LSH and ANN.\nTask 3: single pair of windows with smallest distance. The task is to \ufb01nd the single pair of\nwindows with the smallest distance between I 1 and I 2, out of 3000 windows in each image (sec. 4),\nand has similar applications as Task 1.\nWe quantify performance with three measures: (i) cost: as in all other tasks. (ii) distance ratio: the\nratio between the smallest distance returned by the algorithm and the true smallest distance. The\nbest possible value is 1, and higher values are worse; (iii) rank: the rank of the returned distance\namong all 9 million.\nTo perform Task 3 with LSH, we simply modify step (H3) of the procedure given for Task 1 to:\nreturn the smallest distance among all those computed. To perform Task 3 with ANN we replace\ni in W 2. At the end of loop (A1) return the smallest distance\nstep (A1.1) with: compute the NN of w1\namong all those computed.\nAs the table shows, on average over the three descriptors, our method from sec. 4 achieves a distance\nratio of 1.01 at 1.1% cost, which is almost a 100\u21e5 faster than exhaustive search. The average rank of\nthe returned distance is 1.25 out of 9 millions, which is almost a perfect result. When compared at a\nsimilar distance ratio, our method is considerably more ef\ufb01cient than LSH and ANN. LSH computes\n33.3% of all distances, while ANN brings only a speedup of factor 2 over exact NN.\nRuntime considerations. While we have measured only the number of computed appearance dis-\ntances, our algorithms also compute spatial overlaps. Crucially, spatial overlaps are computed in the\n4D geometric space, compared to 1000+ dimensions for the appearance space. Therefore, comput-\ning spatial overlaps has negligible impact on the total runtime of the algorithms. In practice, when\nusing 5000 windows per image with 4000D dense SURF BOW descriptors, the total runtime of our\nalgorithms is 71s for Task 1 or 16s for Task 3, compared to 335s for exhaustive search. Impor-\ntantly, the cost of computing the descriptors is small compared to the cost of evaluating distances,\nas it is roughly linear in the number of windows and can be implemented very rapidly. In practice,\ncomputing dense SURF BOW for 5000 windows in two images takes 5 seconds.\nConclusions. We have proposed ef\ufb01cient algorithms for computing distances of appearance de-\nscriptors between two sets of image windows, by taking advantage of the overlap structure in the\nsets. Our experiments demonstrate that these algorithms greatly reduce the number of appearance\ndistances computed when solving several tasks relevant to computer vision and outperform LSH\nand ANN for these tasks. Our algorithms could be useful in various applications. For example,\nimproving the spatial accuracy of weakly supervised learners [7, 11] by using thousands of win-\ndows per image, using more complex kernels and detecting more classes in kernel SVM object\ndetectors [13, 24], and enabling image retrieval systems to search at the window level with any de-\nscriptor, rather than returning entire images or be constrained to bag-of-words descriptors [20]. To\nencourage these applications, we release our source code at http://www.vision.ee.ethz.ch/\u02dccalvin.\n\n8\n\n\fReferences\n[1] B. Alexe, V. Petrescu, and V. Ferrari. Exploiting spatial overlap to ef\ufb01ciently compute ap-\npearance distances between image windows - supplementary material. In NIPS, 2011. Also\navailable at http://www.vision.ee.ethz.ch/ calvin/publications.html.\n\n[2] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in\n\nhigh dimensions. In Communications of the ACM, 2008.\n\n[3] H. Bay, A. Ess, T. Tuytelaars, and L. van Gool. SURF: Speeded up robust features. CVIU,\n\n110(3):346\u2013359, 2008.\n\n[4] C. Bibby and I. Reid. Robust real-time visual tracking using pixel-wise posteriors. In ECCV,\n\n[5] S. Birch\ufb01eld. Elliptical head tracking using intensity gradients and color histograms. In CVPR,\n\n2008.\n\n1998.\n\n[6] O. Chum, J. Philbin, M. Isard, and A. Zisserman. Scalable near identical image and shot\n\ndetection. In CIVR, 2007.\n\n[7] O. Chum and A. Zisserman. An exemplar model for learning object classes. In CVPR, 2007.\n[8] N. Dalal and B. Triggs. Histogram of Oriented Gradients for Human Detection. In CVPR,\n\nvolume 2, pages 886\u2013893, 2005.\n\n[9] N. Dalal and B. Triggs. Histogram of oriented gradients for human detection. In CVPR, 2005.\n[10] M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni. Locality-sensitive hashing scheme based\n\non p-stable distributions. In SCG, 2004.\n\n[11] T. Deselaers, B. Alexe, and V. Ferrari. Localizing objects while learning their appearance. In\n\nECCV, 2010.\n\n[12] M. Everingham, L. Van Gool, C. Williams, J. Winn, and A. Zisserman. The PASCAL Visual\n\nObject Classes Challenge 2007 Results, 2007.\n\n[13] H. Harzallah, F. Jurie, and C. Schmid. Combining ef\ufb01cient object localization and image\n\nclassi\ufb01cation. In ICCV, 2009.\n\n[14] H. Jegou, M. Douze, and C. Schmid. Hamming embedding and weak geometric consistency\n\nfor large-scale image search. In ECCV, 2008.\n\n[15] Y. Ke and R. Sukthankar. Pca-sift: A more distinctive representation for local image descrip-\n\n[16] G. Kim and A. Torralba. Unsupervised detection of regions of interest using iterative link\n\ntors. In CVPR, 2004.\n\nanalysis. In NIPS, 2009.\n\n[17] N. Kumar, L. Zhang, and S. Nayar. What is a good nearest neighbors algorithm for \ufb01nding\n\nsimilar patches in images? In ECCV, 2008.\n\n[18] D. M. Mount and S. Arya. Ann: A library for approximate nearest neighbor searching, August\n\n[19] A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic representation of the\n\nspatial envelope. IJCV, 42(3):145\u2013175, 2001.\n\n[20] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large vocabu-\n\nlaries and fast spatial matching. In CVPR, 2007.\n\n[21] M. Raginski and S. Lazebnik. Locality sensitive binary codes from shift-invariant kernels. In\n\n[22] B. Sapp, A. Toshev, and B. Taskar. Cascaded models for articulated pose estimation. In ECCV,\n\n[23] A. Torralba, R. Fergus, and Y. Weiss. Small codes and large image databases for recognition.\n\n[24] A. Vedaldi, V. Gulshan, M. Varma, and A. Zisserman. Multiple kernels for object detection.\n\n[25] J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid. Local features and kernels for classi\ufb01-\n\ncation of texture and object categories: a comprehensive study. IJCV, 2007.\n\nNIPS, 2009.\n\n2010.\n\nIn CVPR, 2008.\n\nIn ICCV, 2009.\n\n2006.\n\n9\n\n\f", "award": [], "sourceid": 1484, "authors": [{"given_name": "Bogdan", "family_name": "Alexe", "institution": null}, {"given_name": "Viviana", "family_name": "Petrescu", "institution": null}, {"given_name": "Vittorio", "family_name": "Ferrari", "institution": null}]}