{"title": "Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 928, "page_last": 936, "abstract": "We consider the problem of retrieving the database points nearest to a given {\\em hyperplane} query without exhaustively scanning the database. We propose two hashing-based solutions. Our first approach maps the data to two-bit binary keys that are locality-sensitive for the angle between the hyperplane normal and a database point. Our second approach embeds the data into a vector space where the Euclidean norm reflects the desired distance between the original points and hyperplane query. Both use hashing to retrieve near points in sub-linear time. Our first method's preprocessing stage is more efficient, while the second has stronger accuracy guarantees. We apply both to pool-based active learning: taking the current hyperplane classifier as a query, our algorithm identifies those points (approximately) satisfying the well-known minimal distance-to-hyperplane selection criterion. We empirically demonstrate our methods' tradeoffs, and show that they make it practical to perform active selection with millions of unlabeled points.", "full_text": "Hashing Hyperplane Queries to Near Points\n\nwith Applications to Large-Scale Active Learning\n\nPrateek Jain\n\nAlgorithms Research Group\n\nMicrosoft Research, Bangalore, India\n\nprajain@microsoft.com\n\nSudheendra Vijayanarasimhan\nDepartment of Computer Science\n\nUniversity of Texas at Austin\nsvnaras@cs.utexas.edu\n\nKristen Grauman\n\nDepartment of Computer Science\n\nUniversity of Texas at Austin\ngrauman@cs.utexas.edu\n\nAbstract\n\nWe consider the problem of retrieving the database points nearest to a given hyper-\nplane query without exhaustively scanning the database. We propose two hashing-\nbased solutions. Our \ufb01rst approach maps the data to two-bit binary keys that\nare locality-sensitive for the angle between the hyperplane normal and a database\npoint. Our second approach embeds the data into a vector space where the Eu-\nclidean norm re\ufb02ects the desired distance between the original points and hyper-\nplane query. Both use hashing to retrieve near points in sub-linear time. Our\n\ufb01rst method\u2019s preprocessing stage is more ef\ufb01cient, while the second has stronger\naccuracy guarantees. We apply both to pool-based active learning:\ntaking the\ncurrent hyperplane classi\ufb01er as a query, our algorithm identi\ufb01es those points (ap-\nproximately) satisfying the well-known minimal distance-to-hyperplane selection\ncriterion. We empirically demonstrate our methods\u2019 tradeoffs, and show that they\nmake it practical to perform active selection with millions of unlabeled points.\n\n1\n\nIntroduction\n\nEf\ufb01cient similarity search with large databases is central to many applications of interest, such as\nexample-based learning algorithms, content-based image or audio retrieval, and quantization-based\ndata compression. Often the search problem is considered in the domain of point data: given a\ndatabase of vectors listing some attributes of the data objects, which points are nearest to a novel\nquery vector? Existing algorithms provide ef\ufb01cient data structures for point-to-point retrieval tasks\nwith various useful distance functions, producing either exact or approximate near neighbors while\nforgoing a brute force scan through all database items, e.g., [1, 2, 3, 4, 5, 6, 7].\n\nBy comparison, much less work considers how to ef\ufb01ciently handle instances more complex than\npoints. In particular, little previous work addresses the hyperplane-to-point search problem: given\na database of points, which are nearest to a novel hyperplane query? This problem is critical to\npool-based active learning, where the goal is to request labels for those points that appear most\ninformative. The widely used margin-based selection criterion of [8, 9, 10] seeks those points that are\nnearest to the current support vector machine\u2019s hyperplane decision boundary, and can substantially\nreduce total human annotation effort. However, for large-scale active learning, it is impractical to\nexhaustively apply the classi\ufb01er to all unlabeled points at each round of learning; to exploit massive\nunlabeled pools, a fast (sub-linear time) hyperplane search method is needed.\n\n1\n\n\fTo this end, we propose two solutions for approximate hyperplane-to-point search. For each, we\nintroduce randomized hash functions that offer query times sub-linear in the size of the database, and\nprovide bounds for the approximation error of the neighbors retrieved. Our \ufb01rst approach devises\na two-bit hash function that is locality-sensitive for the angle between the hyperplane normal and a\ndatabase point. Our second approach embeds the inputs such that the Euclidean distance re\ufb02ects the\nhyperplane distance, thereby making them searchable with existing approximate nearest neighbor\nalgorithms for vector data. While the preprocessing in our \ufb01rst method is more ef\ufb01cient, our second\nmethod has stronger accuracy guarantees.\n\nWe demonstrate our algorithms\u2019 signi\ufb01cant practical impact for large-scale active learning with\nSVM classi\ufb01ers. Our results show that our method helps scale-up active learning for realistic prob-\nlems with massive unlabeled pools on the order of millions of examples.\n\n2 Related Work\n\nWe brie\ufb02y review related work on approximate similarity search, subspace search methods, and\npool-based active learning.\nApproximate near-neighbor search. For low-dimensional points, spatial decomposition and tree-\nbased search algorithms can provide the exact neighbors in sub-linear time [1, 2]. While such\nmethods break down for high-dimensional data, a number of approximate near neighbor methods\nhave been proposed that work well with high-dimensional inputs. Locality-sensitive hashing (LSH)\nmethods devise randomized hash functions that map similar points to the same hash buckets, so that\nonly a subset of the database must be searched after hashing a novel query [3, 4, 5]. A related family\nof methods design Hamming space embeddings that can be indexed ef\ufb01ciently (e.g., [11, 12, 6]).\nHowever, in contrast to our approach, all such techniques are intended for vector/point data.\n\nA few researchers have recently examined approximate search tasks involving subspaces. In [13], a\nEuclidean embedding is developed such that the norm in the embedding space directly re\ufb02ects the\nprincipal angle-based distance between the original subspaces. After this mapping, one can apply\nexisting approximate near-neighbor methods designed for points (e.g., LSH). We provide a related\nembedding to \ufb01nd the points nearest to the hyperplane; however, in contrast to [13], we provide LSH\nbounds, and our embedding is more compact due to our proposed sampling strategy. Another method\nto \ufb01nd the nearest subspace for a point query is given in [14], though it is limited to relatively low-\ndimensional data due to its preprocessing time/space requirement of O(N d2 log N ) and query time\nof O(d10 log N ), where N is the number of database points and d is the dimensionality of the data.\nFurther, unlike [13], that approach is restricted to point queries. Finally, a sub-linear time method to\nmap a line query to its nearest points is derived in [15]. In contrast to all the above work, we propose\nspecialized methods for the hyperplane search problem, and show that they handle high-dimensional\ndata and large databases very ef\ufb01ciently.\nMargin-based active learning. Existing active classi\ufb01er learning methods for pool-based selection\ngenerally scan all database instances before selecting which to have labeled next.1 One well-known\nand effective active selection criterion for support vector machines (SVMs) is to choose points that\nare nearest to the current separating hyperplane [8, 9, 10]. While simple, this criterion is intuitive,\nhas theoretical basis in terms of rapidly reducing the version space [8], and thus is widely used\nin practice (e.g., [17, 18, 19]). Unfortunately, even for inexpensive selection functions, very large\nunlabeled datasets make the cost of exhaustively searching the pool impractical. Researchers have\npreviously attempted to cope with this issue by clustering or randomly downsampling the pool [19,\n20, 21, 22]; however, such strategies provide no guarantees as to the potential loss in active selection\nquality. In contrast, when applying our approach for this task, we can consider orders of magnitude\nfewer points when making the next active label request, yet guarantee selections within a known\nerror of the traditional exhaustive pool-based technique.\nOther forms of approximate SVM training. To avoid potential confusion, we note that our prob-\nlem setting differs from both that considered in [23], where computational geometry insights are\ncombined with the QP formulation for more ef\ufb01cient \u201ccore vector\u201d SVM training, as well as that\nconsidered in [19], where a subset of labeled data points are selected for online LASVM training.\n\n1We consider only a speci\ufb01c hyperplane criterion in this paper; see [16] for an active learning survey.\n\n2\n\n\f3 Approach\n\nWe consider the following retrieval problem. Given a database D = [x1, . . . , xN ] of N points in\nRd, the goal is to retrieve the points from the database that are closest to a given hyperplane query\nwhose normal is given by w \u2208 Rd. We call this the nearest neighbor to a query hyperplane (NNQH)\nproblem. Without loss of generality, we assume that the hyperplane passes through origin, and that\neach xi, w is unit norm. We see in later sections that these assumptions do not affect our solution.\nThe Euclidean distance of a point x to a given hyperplane hw parameterized by normal w is:\n\nd(hw, x) = k(xT w)wk = |xT w|.\n\n(1)\n\nThus, the goal for the NNQH problem is to identify those points xi \u2208 D that minimize |xT\ni w|. Note\nthat this is in contrast to traditional proximity problems, e.g., nearest or farthest neighbor retrieval,\nwhere the goal is to maximize xT w or \u2212xT w, respectively. Hence, existing approaches are not\ndirectly applicable to this problem.\n\nWe formulate two algorithms for NNQH. Our \ufb01rst approach maps the data to binary keys that are\nlocality-sensitive for the angle between the hyperplane normal and a database point, thereby per-\nmitting sub-linear time retrieval with hashing. Our second approach computes a sparse Euclidean\nembedding for the query hyperplane that maps the desired search task to one handled well by exist-\ning approximate nearest-point methods.\n\nIn the following, we \ufb01rst provide necessary background on locality-sensitive hashing (LSH). The\nsubsequent two sections describe each approach in turn, and Sec. 3.4 reviews their trade-offs. Fi-\nnally, in Sec. 3.5, we explain how either method can be applied to large-scale active learning.\n\n3.1 Background: Locality-Sensitive Hashing (LSH)\n\nInformally, LSH [3] requires randomized hash functions guaranteeing that the probability of colli-\nsion of two vectors is inversely proportional to their \u201cdistance\u201d, where \u201cdistance\u201d is de\ufb01ned accord-\ning to the task at hand. Since similar points are assured (w.h.p.) to fall into the same hash bucket,\none need only search those database items with which a novel query collides in the hash table.\nFormally, let d(\u00b7,\u00b7) be a distance function over items from a set S, and for any item p \u2208 S, let\nB(p, r) denote the set of examples from S within radius r from p.\nDe\ufb01nition 3.1. [3] Let hH denote a random choice of a hash function from the family H. The family\nH is called (r, r(1 + \u01eb), p1, p2)\u2212sensitive for d(\u00b7,\u00b7) when, for any q, p \u2208 S,\n\n\u2022 if p \u2208 B(q, r) then Pr[hH(q) = hH(p)] \u2265 p1,\n\u2022 if p /\u2208 B(q, r(1 + \u01eb)) then Pr[hH(q) = hH(p)] \u2264 p2.\n\nH (p), . . . , h(k)\n\nFor a family of functions to be useful, it must satisfy p1 > p2. A k-bit LSH function com-\nputes a hash \u201ckey\u201d by concatenating the bits returned by a random sampling of H: g(p) =\nH (p)i. Note that the probability of collision for close points is thus at least\nhh(1)\nH (p), h(2)\n2. During a preprocessing stage, all database points are\n1, while for dissimilar points it is at most pk\npk\nmapped to a series of l hash tables indexed by independently constructed g1, . . . , gl, where each gi\nis a k-bit function. Then, given a query q, an exhaustive search is carried out only on those examples\nin the union of the l buckets to which q hashes. These candidates contain the (r, \u01eb)-nearest neighbors\n(NN) for q, meaning if q has a neighbor within radius r, then with high probability some example\nwithin radius r(1 + \u01eb) is found.\nIn [3] an LSH scheme using projections onto single coordinates is shown to be locality-sensitive for\nthe Hamming distance over vectors. For that hash function, \u03c1 = log p1\n1+\u01eb , and using l = N \u03c1\nhash tables, a (1+\u01eb)-approximate solution can be retrieved in time O(N\n(1+\u01eb) ). Related formulations\nand LSH functions for other distances have been explored (e.g., [5, 4, 24]). Our contribution is to\nde\ufb01ne two locality-sensitive hash functions for the NNQH problem.\n\nlog p2 \u2264 1\n\n1\n\n3\n\n\f3.2 Hyperplane Hashing based on Angle Distance (H-Hash)\n\nIf the\nRecall that we want to retrieve the database vector(s) x for which |wT x| is minimized.\nvectors are unit norm, then this means that for the \u201cgood\u201d (close) database vectors, w and x are\nalmost perpendicular. Let \u03b8x,w denote the angle between x and w. We de\ufb01ne the distance d(\u00b7,\u00b7) in\nDe\ufb01nition 3.1 to re\ufb02ect how far from perpendicular w and x are:\nd\u03b8(x, w) = (\u03b8x,w \u2212 \u03c0/2)2.\n(2)\n\nConsider the following two-bit function that maps two input vectors a, b \u2208 \u211cd to {0, 1}2:\n\nhu,v(a, b) = [hu(a), hv(b)] = [sign(uT a), sign(vT b)],\n\n(3)\nwhere hu(a) = sign(uT a) returns 1 if uT a \u2265 0, and 0 otherwise, and u and v are sampled\nindependently from a standard d-dimensional Gaussian, i.e., u, v \u223c N (0, I).\nWe de\ufb01ne our hyperplane hash (H-Hash) function family H as:\n\nhH(z) = \u00bdhu,v(z, z),\nhu,v(z,\u2212z),\n\nif z is a database point vector,\nif z is a query hyperplane vector.\n\nNext, we prove that this family of hash functions is locality-sensitive (De\ufb01nition 3.1).\n\nClaim 3.2. The family H is \u00a1r, r(1 + \u01eb), 1\nd\u03b8(\u00b7,\u00b7), where r, \u01eb > 0.\nProof. Since the vectors u, v used by hash function hu,v are sampled independently, then for a\nquery hyperplane vector w and a database point vector x,\n\n\u03c02 r(1 + \u01eb)\u00a2-sensitive for the distance\n\n4 \u2212 1\n\n4 \u2212 1\n\n\u03c02 r, 1\n\nPr[hH(w) = hH(x)] = Pr[hu(w) = hu(x) and hv(\u2212w) = hv(x)],\n= Pr[hu(w) = hu(x)] Pr[hv(\u2212w) = hv(x)].\n\nNext, we use the following fact proven in [25],\n\nPr[sign(uT a) = sign(uT c)] = 1 \u2212\n\n\u03b8a,c\n\u03c0\n\n,\n\n(4)\n\n(5)\n\nwhere u is sampled as de\ufb01ned above, and \u03b8a,c denotes the angle between the two vectors a and c.\nUsing (4) and (5), we get:\n\n.\n\n1\n\n\u03c0\n\n\u03b8x,w\n\n\u03b8x,w\n\n2\u00b42\n\n\u03c0 \u00b51 \u2212\n\nPr[hH(w) = hH(x)] =\n\n1\n4 \u2212\n4 \u2212 r\n4 \u2212 r(1+\u01eb)\n\n\u03c02 \u00b3\u03b8x,w \u2212\n\u03c02 = p1. Similarly, for any \u01eb > 0\n\n\u03c0 \u00b6 =\n2\u00a22\n\u2264 r, Pr[hH(w) = hH(x)] \u2265 1\nHence, when \u00a1\u03b8x,w \u2212 \u03c0\n2\u00a22\n\u2265 r(1 + \u01eb), Pr[hH(w) = hH(x)] \u2264 1\nsuch that \u00a1\u03b8x,w \u2212 \u03c0\nWe note that unlike traditional LSH functions, ours are asymmetric. That is, to hash a database point\nx we use hu,v(x, x), whereas to hash a query hyperplane w, we use hu,v(w,\u2212w). The purpose of\nthe two-bit hash is to constrain the angle with respect to both w and \u2212w, so that we do not simply\nretrieve examples for which we know only that x is \u03c0/2 or less away from w.\nWith these functions in hand, we can now form hash keys by concatenating k two-bit pairs from k\nhash functions from H, store the database points in the hash tables, and query with a novel hyper-\nplane to retrieve its closest points (see Sec. 3.1).\n\n\u03c02 = p2.\n\nThe approximation guarantees and correctness of this scheme can be obtained by adapting the proof\nof Theorem 1 in [3] (see supplementary \ufb01le). In particular, we can show that with high probability,\nour LSH scheme will return a point within a distance (1 + \u01eb)r, where r = mini d\u03b8(xi, w), in time\nO(N \u03c1), where \u03c1 = log p1\n. As p1 > p2, we have \u03c1 < 1, i.e., the approach takes sub-linear time\nlog p2\nfor all values of r, \u01eb. Furthermore, as p1 = 1\n, \u03c1 can also be bounded\nas \u03c1 \u2264\n. Note that this bound for \u03c1 is dependent on r, and is more ef\ufb01cient for larger\nvalues of r. See the supplementary material for more discussion on the bound.\n\n\u03c02 , and p2 = 1\n\n4 \u2212 r(1+\u01eb)\n\n1\u2212log(1\u2212 4r\n1+\n\n4 \u2212 r\n\n4r log 4\n\n1+ \u03c02\n\n\u03c02 )\n\n\u03c02\n\n\u01eb\n\n4\n\n\f3.3 Embedded Hyperplane Hashing based on Euclidean Distance (EH-Hash)\n\nOur second approach for the NNQH problem relies on a Euclidean embedding for the hyperplane\nand points. It offers stronger bounds than the above, but at the expense of more preprocessing.\n\nGiven a d-dimensional vector a, we compute an embedding inspired by [13] that yields a d2-\ndimensional vector by vectorizing the corresponding rank-1 matrix aaT :\n\n1, a1a2, . . . , a1ad, a2\n\nV (a) = vec(aaT ) = \u00a3a2\n\n(6)\nwhere ai denotes the i-th element of a. Assuming a and b to be unit vectors, the Euclidean distance\nbetween the embeddings V (a) and \u2212V (b) is given by ||V (a) \u2212 (\u2212V (b))||2 = 2 + 2(aT b)2.\nHence, minimizing the distance between the two embeddings is equivalent to minimizing |aT b|,\nour intended function.\nGiven this, we de\ufb01ne our embedding-hyperplane hash (EH-Hash) function family E as:\n\n2, a2a3, . . . , a2\nd\u00a4 ,\n\nhE (z) = \u00bdhu (V (z)) ,\nhu (\u2212V (z)) ,\n\nif z is a database point vector,\nif z is a query hyperplane vector,\n\nE\n\nThe\n\nof\n\nfunctions\n\nde\ufb01ned\n\nfamily\n1\n\nabove\n\nis\n\n3.3.\n\u03c0 cos\u22121 sin2(\u221ar),\n1\n\n\u03c0 cos\u22121 sin2(pr(1 + \u01eb))\u00b4-sensitive for d\u03b8(\u00b7,\u00b7), where r, \u01eb > 0.\n\nwhere hu(z) = sign(uT z) is a one-bit hash function parameterized by u \u223c N (0, I).\nClaim\n\u00b3r, r(1 + \u01eb),\nProof. Using the result of [25], for any vector w, x \u2208 Rd,\nPr\u00a3sign\u00a1uT (\u2212V (w))\u00a2 = sign\u00a1uT V (x)\u00a2\u00a4 = 1 \u2212\n\nkV (w)k kV (x)k\u00b6 ,\nwhere u \u2208 Rd2 is sampled from a standard d2-variate Gaussian distribution, u \u223c N (0, I). Note\nthat for any unit vectors a, b \u2208 Rd2, V (a)T V (b) = Tr(aaT bbT ) = (aT b)2 = cos2 \u03b8a,b.\nUsing (7) together with the de\ufb01nition of hE above, given a hyperplane query w and database point\nx we have:\n\ncos\u22121\u00b5 \u2212V (w)T V (x)\n\n1\n\u03c0\n\n(7)\n\n(8)\n\n(9)\n\nPr[hE (w) = hE (x)] = 1 \u2212\n\n1\n\u03c0\n\ncos\u22121\u00a1\u2212 cos2(\u03b8x,w)\u00a2 = cos\u22121\u00a1cos2(\u03b8x,w)\u00a2 /\u03c0\n\nHence, when (\u03b8x,w \u2212 \u03c0\n\n2 )2 \u2264 r,\nPr[hE (w) = hE (x)] \u2265\n\ncos\u22121 sin2(\u221ar) = p1,\n\n1\n\u03c0\n\nand p2 is obtained similarly.\n\nWe observe that this p1 behaves similarly to 2( 1\n\u03c02 ). That is, as r varies, EH-Hash\u2019s p1 returns\nvalues close to twice those returned by H-Hash\u2019s p1 (see plot illustrating this in supplementary \ufb01le).\nHence, the factor \u03c1 = log p1\nimproves upon that of the previous section, remaining lower for lower\nlog p2\nvalues of \u01eb, and leading to better approximation guarantees. See supplementary material for a more\ndetailed comparison of the two bounds.\n\n4 \u2212 r\n\nOn the other hand, EH-Hash\u2019s hash functions are signi\ufb01cantly more expensive to compute. Specif-\nically, it requires O(d2) time, whereas H-Hash requires only O(d). To alleviate this problem, we\nuse a form of randomized sampling when computing the hash bits for a query that reduces the time\nto O(1/\u01eb\u20322), for \u01eb\u2032 > 0. Our method relies on the following lemma, which states that sampling a\nvector v according to the weights of each element leads to good approximation to vT y for any vec-\ntor y (with constant probability). Similar sampling schemes have been used for a variety of matrix\napproximation problems (see [26]).\nLemma 3.4. Let v \u2208 Rd and de\ufb01ne pi = v2\ni /kvk2. Construct \u02dcv \u2208 Rd such that the i-th element is\nvi with probability pi and is 0 otherwise. Select t such elements using sampling with replacement.\nThen, for any y \u2208 Rd, \u01eb > 0, c \u2265 1, t \u2265 c\n\u01eb\u2032 2 ,\n\n.\n\n(10)\n\nPr[|\u02dcvT y \u2212 vT y| \u2264 \u01eb\u2032kvk2kyk2] > 1 \u2212\n\n5\n\n1\nc\n\n\f\u01eb\u2032 2 ).\n\nWe defer the proof to the supplementary material. The lemma implies that at query time our hash\nfunction hE (w) can be computed while incurring a small additive error in time O( 1\n\u01eb\u2032 2 ), by sampling\nits embedding V (w) accordingly, and then cycling through only the non-zero indices of V (w) to\ncompute uT (\u2212V (w)). Note that we can substantially reduce the error in the hash function compu-\ntation by sampling O( 1\n\u01eb\u20322 ) elements of the vector w and then using vec(w \u02dcwT ) as the embedding\nfor w. However, in this case, the computational requirements increase to O( d\nWhile one could alternatively use the Johnson-Lindenstrauss (JL) lemma to reduce the dimension-\nality of the embedding with random projections, doing so has two major dif\ufb01culties: \ufb01rst, the d \u2212 1\ndimensionality of a subspace represented by a hyperplane implies the random projection dimension-\nality must still be large for the JL-lemma to hold, and second, the projection dimension is dependent\non the sum of the number of database points and query hyperplanes. The latter is problematic when\n\ufb01elding an arbitrary number of queries over time or storing a growing database of points\u2014both prop-\nerties that are intrinsic to our target active learning application. In contrast, our sampling method is\ninstance-dependent and incurs very little overhead for computing the hash function.\nComparison to [13]. Basri et al. de\ufb01ne embeddings for \ufb01nding nearest subspaces [13]. In particular,\nthey de\ufb01ne Euclidean embeddings for af\ufb01ne subspace queries and database points which could be\nused for NNQH, although they do not speci\ufb01cally apply it to hyperplane-to-point search in their\nwork. Also, their embedding is not tied to LSH bounds in terms of the distance function (2), as we\nhave shown above. Finally, our proposed instance-speci\ufb01c sampling strategy offers a more compact\nrepresentation with the advantages discussed above.\n\n3.4 Recap of the Hashing Approaches\n\nTo summarize, we presented two locality-sensitive hashing approaches for the NNQH problem. Our\n\ufb01rst H-Hash approach de\ufb01nes locality-sensitivity in the context of NNHQ, and then provides suit-\nable two-bit hash functions together with a bound on retrieval time. Our second EH-Hash approach\nconsists of a d2-dimensional Euclidean embedding for vectors of dimension d that in turn reduces\nNNHQ to the Euclidean space nearest neighbor problem, for which ef\ufb01cient search structures (in-\ncluding LSH) are available. While EH-Hash has better bounds than H-Hash, its hash functions are\nmore expensive. To mitigate the expense for high-dimensional data, we use a well-justi\ufb01ed heuristic\nwhere we randomly sample the given query embedding, reducing the query time to linear in d.\nNote that both of our approaches attempt to minimize d\u03b8(w, x) between the retrieved x and the\nhyperplane w. Since that distance is only dependent on the angle between x and w, any scaling of\nthe vectors do not effect our methods, and we can safely treat the provided vectors to be unit norm.\n\n3.5 Application to Large-Scale Active Learning\n\nThe search algorithms introduced above can be applied for any task \ufb01tting their query/database\nspeci\ufb01cations. We are especially interested in their relevance for making active learning scalable.\n\nA practical paradox with pool-based active learning algorithms is that their intended value\u2014to re-\nduce learning time by choosing informative examples to label \ufb01rst\u2014con\ufb02icts with the real expense\nof applying them to very large \u201cunprepared\u201d unlabeled datasets. Generally methods today are tested\nin somewhat canned scenarios: the implementor has a moderately sized labeled dataset, and simply\nwithholds the labels from the learner until a given point is selected, at which point the \u201coracle\u201d re-\nveals the label. In reality, one would like to deploy an active learner on a massive truly unlabeled\ndata pool (e.g., all documents on the Web) and let it crawl for the instances that appear most valuable\nfor the target classi\ufb01cation task. The problem is that a scan of millions of points is rather expensive\nto compute exhaustively, and thus defeats the purpose of improving overall learning ef\ufb01ciency.\n\nOur algorithms make it possible to bene\ufb01t from both massive unlabeled collections as well as\nactively chosen label requests. We consider the \u201csimple margin\u201d selection criterion for linear\nSVM classi\ufb01ers [8, 9, 10]. Given a hyperplane classi\ufb01er and an unlabeled pool of vector data\nU = {x1, . . . , xN}, the point that minimizes the distance to the current decision boundary is se-\nlected for labeling: x\u2217 = argminxi\u2208U |wT xi|. Our two NNQH solutions supply exactly the hash\nfunctions needed to rapidly identify the next point to label: \ufb01rst we hash the unlabeled database into\ntables, and then at each active learning loop, we hash the current classi\ufb01er w as a query.2\n\n2The SVM bias term is handled by appending points with a 1. Note, our approach assumes linear kernels.\n\n6\n\n\fLearning curves\n\n \n\n101\n\nSelection time\n\nDistances to hyperplane\n\n EH\u2212Hash\n H\u2212Hash\n Random\n Exhaustive\n\nl\n\ne\na\nc\ns\n \ng\no\nl\n \n\n\u2212\n \n)\ns\nc\ne\ns\n(\n \ne\nm\nT\n\ni\n\n100\n\n100\n200\nSelection iterations\n\n300\n\nH\u2212Hash\n\nE H\u2212Hash\n\nExhaustive\n\n|\nx\n.\nw\n\n|\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\nRando m\n\n(a)\n\n(b)\n\nE H\u2212Hash\n\nExhaustive\n\nH\u2212Hash\n(c)\n\nFigure 1: Newsgroups results.\n(a) Improvements in prediction accuracy relative to the initial classi\ufb01er,\naveraged across all 20 categories and runs. (b) Time required to perform selection. (c) Value of |w T x| for\nthe selected examples. Lower is better. Both of our approximate methods (H-Hash and EH-Hash) signi\ufb01cantly\noutperform the passive baseline; they are nearly as accurate as ideal exhaustive active selection, yet require 1-2\norders of magnitude less time to select an example. (Best viewed in color.)\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n)\n\n%\n\n(\n \nt\n\nn\ne\nm\ne\nv\no\nr\np\nm\n\nI\n \n\nC\nO\nR\nU\nA\n\n \n\n0\n0\n\n0.15\n\n0.1\n\n0.05\n\n0\n\n)\n\n%\n\n(\n \nt\n\nn\ne\nm\ne\nv\no\nr\np\nm\n\nI\n \n\nC\nO\nR\nU\nA\n\n\u22120.05\n0\n\n \n\nLearning curves \u2212 All 10 classes\n\n \n\n102\n\nSelection time\n\nDistances to hyperplane\n\nl\n\ne\na\nc\ns\n \n\ng\no\n\nl\n \n\n\u2212\n \n)\ns\nc\ne\ns\n(\n \n\ne\nm\nT\n\ni\n\n101\n\n100\n\nH\u2212Hash\n\n EH\u2212Hash\n H\u2212Hash\n Random\n Exhaustive\n\n50\n\n100\n200\nSelection iterations\n\n150\n\n250\n\n300\n\n(a)\n\nE H\u2212Hash\n(b)\n\n2\n\n1.5\n\n|\nx\n.\nw\n\n|\n\n1\n\n0.5\n\n0\n\nRando m\n\nExhaustive\n\nE H\u2212Hash\n\nExhaustive\n\nH\u2212Hash\n(c)\n\nFigure 2: CIFAR-10 results. (a)-(c) Plotted as in above \ufb01gure. Our methods compare very well with the\nsigni\ufb01cantly more expensive exhaustive baseline. Our EH-Hash provides more accurate selection than our\nH-Hash (see (c)), though requires noticeably more query time (see (b)).\n\n4 Results\n\nWe demonstrate our approach applied to large-scale active learning tasks. We compare our methods\n(H-Hash in Sec. 3.2 and EH-Hash in Sec. 3.3) to two baselines: 1) passive learning, where the next\nlabel request is randomly selected, and 2) exhaustive active selection, where the margin criterion\nin (1) is computed over all unlabeled examples in order to \ufb01nd the true minimum. The main goal\nis to show our algorithms can retrieve examples nearly as well as the exhaustive approach, but with\nsubstantially greater ef\ufb01ciency.\nDatasets and implementation details. We use three publicly available datasets. 20 Newsgroups\nconsists of 20,000 documents from 20 newsgroup categories. We use the provided 61,118-d bag-of-\nwords features, and a test set of 7,505. CIFAR-10 [27] consists of 60,000 images from 10 categories.\nIt is a manually labeled subset of the 80 Million Tiny Image dataset [28], which was formed by\nsearching the Web for all English nouns and lacks ground truth labels. We use the provided train and\ntest splits of 50K and 10K images, respectively. Tiny-1M consists of the \ufb01rst 1,000,000 (unlabeled)\nimages from [28]. For both CIFAR-10 and Tiny-1M, we use the provided 384-d GIST descriptors as\nfeatures. For all datasets, we train a linear SVM in the one-vs-all setting using a randomly selected\nlabeled set (5 examples per class), and then run active selection for 300 iterations. We average results\nacross \ufb01ve such runs. We \ufb01x k = 300, N \u03c1 = 500, \u01eb\u2032 = 0.01.\nNewsgroups documents results. Figure 1 shows the results on the 20 Newsgroups, starting with\nthe learning curves for all four approaches (a). The active learners (exact and approximate) have the\nsteepest curves, indicating that they are learning more effectively from the chosen labels compared\nto the random baseline. Both of our hashing methods perform similarly to the exhaustive selection,\nyet require scanning an order of magnitude fewer examples (b). Note, Random requires \u223c 0 time.\nFig. 1(c) shows the actual values of |wT x| for the selected examples over all iterations, categories,\nand runs; in line with our methods\u2019 guarantees, they select points close to those found with ex-\nhaustive search. We also observe the expected trade-off: H-Hash is more ef\ufb01cient, while EH-Hash\nprovides better results (only slightly better for this smaller dataset).\nCIFAR-10 tiny image results. Figure 2 shows the same set of results on the CIFAR-10. The trends\nare mostly similar to the above, although the learning task is more dif\ufb01cult on this data, narrowing the\n\n7\n\n\fEH-Hash\n\nH-Hash\n\nExhaustive\n\nRandom\n\n)\n\n%\n\n(\n \n\nC\nO\nR\nU\nA\nn\n\n \n\ni\n \nt\n\nn\ne\nm\ne\nv\no\nr\np\nm\n\nI\n\n0.2\n\n0.1\n\n(a)\n\n)\n\n%\n\n(\n \n\nC\nO\nR\nU\nA\n \nn\ni\n \nt\nn\ne\nm\ne\nv\no\nr\np\nm\n\nI\n\n0.1\n\n0.05\n\n \n\n0\n0\n\n(b)\n\nAll categories \u2212 newsgroups\n\n \n\nAll categories \u2212 tinyimages\n\n \n\nEH\u2212Hash\nH\u2212Hash\nrandom\nexhaustive\n\nEH\u2212Hash\nH\u2212Hash\nrandom\nexhaustive\n\n \n\n0\n100 200 300 400 500\n0\nSelection + labeling time (secs)\n\n2000\n\n4000\n\nSelection + labeling time (secs)\n\nFigure 3: (a) First seven examples selected per method when learning the CIFAR-10 Airplane class.\n(b)\nImprovements in prediction accuracy as a function of the total time taken, including both selection and labeling\ntime. By minimizing both selection and labeling time, our methods provide the best accuracy per unit time.\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n|\nx\n.\nw\n\n|\n\nDistances to hyperplane\n\nSelection time\n\nairplane\n\nautomobile\n\nl\n\ne\na\nc\ns\n \n\ng\no\n\nl\n \n\n\u2212\n \n)\ns\nc\ne\ns\n(\n \ne\nm\nT\n\ni\n\n101\n\n100\n\nE H\u2212Hash\n\nExhaustive\n\nRando m\n\nE H\u2212Hash\n\nExhaustive\n\n(c)\n\n(a)\n\n(b)\n\nFigure 4: Tiny-1M results. (a) Error of examples selected. (b) Time required. (c) Examples selected by\nEH-Hash among 1M candidates in the \ufb01rst nine iterations when learning the Airplane and Automobile classes.\nmargin between active and random. Averaged over all classes, we happen to outperform exhaustive\nselection (Fig. 2(a)); this can happen since there is no guarantee that the best active choice will help\ntest accuracy, and it also re\ufb02ects the wider variation across per-class results. The boxplots in (c) more\ndirectly show the hashing methods are behaving as expected. Both (b) and (c) illustrate their trade-\noffs: EH-Hash has stronger guarantees than H-Hash (and thus retrieves lower wT x values), but is\nmore expensive. Figure 3(a) shows example image selection results; both exhaustive search and our\nhashing methods manage to choose images useful for learning about airplanes/non-airplanes.\n\nFigure 3(b) shows the prediction accuracy plotted against the total time taken per iteration, which\nincludes both selection and labeling time, for both datasets. We set the labeling time per instance\nto 1 and 5 seconds for the Newsgroups and Tiny image datasets, respectively. (Note, however, that\nthese could vary in practice depending on the dif\ufb01culty of the instance.) These results best show\nthe advantage of our approximate methods: accounting for both types of cost inherent to training\nthe classi\ufb01er, they outperform both exhaustive and random selection in terms of the accuracy gains\nper unit time. While exhaustive active selection suffers because of its large selection time, random\nselection suffers because it wastes expensive labeling time on irrelevant examples. Our algorithms\nprovide the best accuracy gains by minimizing both selection and labeling time.\nTiny-1M results. Finally, to demonstrate the practical capability of our hyperplane hashing ap-\nproach, we perform active selection on the one million tiny image set. We initialize the classi\ufb01er\nwith 50 examples from CIFAR-10. The 1M set lacks any labels, making this a \u201clive\u201d test of active\nlearning (we ourselves annotated whatever the methods selected). We use our EH-Hash method,\nsince it offers stronger performance.\n\nEven on this massive collection, our method\u2019s selections are very similar in quality to the exhaus-\ntive method (see Fig. 4(a)), yet require orders of magnitude less time (b). The images (c) show the\nselections made from this large pool during the \u201clive\u201d labeling test; among all one million unla-\nbeled examples (nearly all of which likely belong to one of the other 1000s of classes) our method\nretrieves seemingly relevant instances. To our knowledge, this experiment exceeds any previous\nactive selection results in the literature in terms of the scale of the unlabeled pool.\nConclusions. We introduced two methods for the NNQH search problem. Both permit ef\ufb01cient\nlarge-scale search for points near to a hyperplane, and experiments with three datasets clearly\ndemonstrate the practical value for active learning with massive unlabeled pools. For future work,\nwe plan to further explore more accurate hash-functions for our H-hash scheme and also investigate\nsublinear time methods for non-linear kernel based active learning.\n\nThis work is supported in part by DARPA CSSG, NSF EIA-0303609, and the Luce Foundation.\n\n8\n\n\fReferences\n\n[1] J. Freidman, J. Bentley, and A. Finkel. An Algorithm for Finding Best Matches in Logarithmic Expected\n\nTime. ACM Transactions on Mathematical Software, 3(3):209\u2013226, September 1977.\n\n[2] J. Uhlmann. Satisfying General Proximity / Similarity Queries with Metric Trees. Information Processing\n\nLetters, 40:175\u2013179, 1991.\n\n[3] A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via Hashing. In Proceedings\n\nof the 25th Intl Conf. on Very Large Data Bases, 1999.\n\n[4] A. Andoni and P. Indyk. Near-Optimal Hashing Algorithms for Near Neighbor Problem in High Dimen-\n\nsions. In FOCS, 2006.\n\n[5] M. Charikar. Similarity Estimation Techniques from Rounding Algorithms. In STOC, 2002.\n[6] Y. Weiss, A. Torralba, and R. Fergus. Spectral Hashing. In NIPS, 2008.\n[7] B. Kulis and K. Grauman. Kernelized Locality-Sensitive Hashing for Scalable Image Search. In Proceed-\n\nings of the IEEE International Conference on Computer Vision (ICCV), 2009.\n\n[8] S. Tong and D. Koller. Support Vector Machine Active Learning with Applications to Text Classi\ufb01cation.\n\nIn Proccedings of International Conference on Machine Learning, 2000.\n\n[9] G. Schohn and D. Cohn. Less is More: Active Learning with Support Vector Machines. In Proccedings\n\nof International Conference on Machine Learning, 2000.\n\n[10] C. Campbell, N. Cristianini, and A. Smola. Query Learning with Large Margin Classi\ufb01ers. In Proccedings\n\nof International Conference on Machine Learning, 2000.\n\n[11] G. Shakhnarovich, P. Viola, and T. Darrell. Fast Pose Estimation with Parameter-Sensitive Hashing. In\n\nProceedings of the IEEE International Conference on Computer Vision (ICCV), 2003.\n\n[12] R. Salakhutdinov and G. Hinton. Semantic Hashing. In Proceedings of the SIGIR Workshop on Informa-\n\ntion Retrieval and Applications of Graphical Models, 2007.\n\n[13] R. Basri, T. Hassner, and L. Zelnik-Manor. Approximate Nearest Subspace Search. PAMI, 2010.\n[14] A. Magen. Dimensionality Reductions that Preserve Volumes and Distance to Af\ufb01ne Spaces, and their\nAlgorithmic Applications. In Randomization and Approximation Techniques in Computer Science, 2002.\n[15] A. Andoni, P. Indyk, R. Krauthgamer, and H. L. Nguyen. Approximate Line Nearest Neighbor in High\n\nDimensions. In SODA, 2009.\n\n[16] B. Settles. Active Learning Literature Survey. TR 1648, University of Wisconsin, 2009.\n[17] E. Chang, S. Tong, K. Goh, and C. Chang. Support Vector Machine Concept-Dependent Active Learning\n\nfor Image Retrieval. In IEEE Transactions on Multimedia, 2005.\n\n[18] M. K. Warmuth, J. Liao, G. Ratsch, M. Mathieson, S. Putta, and C. Lemmen. Active Learning with\nSupport Vector Machines in the Drug Discovery Process. J. Chem. Inf. Comput. Sci., 43:667\u2013673, 2003.\n[19] A. Bordes, S. Ertekin, J. Weston, and L. Bottou. Fast Kernel Classi\ufb01ers with Online and Active Learning.\n\nJournal of Machine Learning Research (JMLR), 6:1579\u20131619, September 2005.\n\n[20] N. Panda, K. Goh, and E. Chang. Active Learning in Very Large Image Databases. Journal of Multimedia\n\nTools and Applications: Special Issue on Computer Vision Meets Databases, 31(3), December 2006.\n[21] W. Zhao, J. Long, E. Zhu, and Y. Liu. A Scalable Algorithm for Graph-Based Active Learning.\n\nIn\n\nFrontiers in Algorithmics, 2008.\n\n[22] R. Segal, T. Markowitz, and W. Arnold. Fast Uncertainty Sampling for Labeling Large E-mail Corpora.\n\nIn Conference on Email and Anti-Spam, 2006.\n\n[23] I. Tsang, J. Kwok, and P.-M. Cheung. Core Vector Machines: Fast SVM Training on Very Large Data\n\nSets. Journal of Machine Learning Research, 6:363\u2013392, 2005.\n\n[24] P. Indyk and N. Thaper. Fast Image Retrieval via Embeddings. In Intl Wkshp on Stat. and Comp. Theories\n\nof Vision, 2003.\n\n[25] M. Goemans and D. Williamson. Improved Approximation Algorithms for Maximum Cut and Satis\ufb01a-\n\nbility Problems Using Semide\ufb01nite Programming. JACM, 42(6):1115\u20131145, 1995.\n\n[26] R. Kannan and S. Vempala. Spectral Algorithms. Foundations and Trends in Theoretical Computer\n\nScience, 4(3-4):157\u2013288, 2009.\n\n[27] A. Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Technical report, University of\n\nToronto, 2009.\n\n[28] A. Torralba, R. Fergus, and W. T. Freeman. 80 million Tiny Images: a Large Dataset for Non-Parametric\n\nObject and Scene Recognition. PAMI, 30(11):1958\u20131970, 2008.\n\n9\n\n\f", "award": [], "sourceid": 757, "authors": [{"given_name": "Prateek", "family_name": "Jain", "institution": null}, {"given_name": "Sudheendra", "family_name": "Vijayanarasimhan", "institution": null}, {"given_name": "Kristen", "family_name": "Grauman", "institution": null}]}