{"title": "Discrete Graph Hashing", "book": "Advances in Neural Information Processing Systems", "page_first": 3419, "page_last": 3427, "abstract": "Hashing has emerged as a popular technique for fast nearest neighbor search in gigantic databases. In particular, learning based hashing has received considerable attention due to its appealing storage and search efficiency. However, the performance of most unsupervised learning based hashing methods deteriorates rapidly as the hash code length increases. We argue that the degraded performance is due to inferior optimization procedures used to achieve discrete binary codes. This paper presents a graph-based unsupervised hashing model to preserve the neighborhood structure of massive data in a discrete code space. We cast the graph hashing problem into a discrete optimization framework which directly learns the binary codes. A tractable alternating maximization algorithm is then proposed to explicitly deal with the discrete constraints, yielding high-quality codes to well capture the local neighborhoods. Extensive experiments performed on four large datasets with up to one million samples show that our discrete optimization based graph hashing method obtains superior search accuracy over state-of-the-art unsupervised hashing methods, especially for longer codes.", "full_text": "Discrete Graph Hashing\n\nWei Liu\u2020 Cun Mu\u2021\n\n\u2020IBM T. J. Watson Research Center\n\nSanjiv Kumar(cid:2)\n\n\u2021Columbia University\n\nShih-Fu Chang\u2021\n\n(cid:2)Google Research\n\nweiliu@us.ibm.com\n\ncm3052@columbia.edu\n\nsfchang@ee.columbia.edu\n\nsanjivk@google.com\n\nAbstract\n\nHashing has emerged as a popular technique for fast nearest neighbor search in gi-\ngantic databases. In particular, learning based hashing has received considerable\nattention due to its appealing storage and search ef\ufb01ciency. However, the perfor-\nmance of most unsupervised learning based hashing methods deteriorates rapidly\nas the hash code length increases. We argue that the degraded performance is due\nto inferior optimization procedures used to achieve discrete binary codes. This\npaper presents a graph-based unsupervised hashing model to preserve the neigh-\nborhood structure of massive data in a discrete code space. We cast the graph\nhashing problem into a discrete optimization framework which directly learns the\nbinary codes. A tractable alternating maximization algorithm is then proposed to\nexplicitly deal with the discrete constraints, yielding high-quality codes to well\ncapture the local neighborhoods. Extensive experiments performed on four large\ndatasets with up to one million samples show that our discrete optimization based\ngraph hashing method obtains superior search accuracy over state-of-the-art un-\nsupervised hashing methods, especially for longer codes.\n\nIntroduction\n\n1\nDuring the past few years, hashing has become a popular tool for tackling a variety of large-scale\ncomputer vision and machine learning problems including object detection [6], object recognition\n[35], image retrieval [22], linear classi\ufb01er training [19], active learning [24], kernel matrix approx-\nimation [34], multi-task learning [36], etc. In these problems, hashing is exploited to map similar\ndata points to adjacent binary hash codes, thereby accelerating similarity search via highly ef\ufb01cient\nHamming distances in the code space. In practice, hashing with short codes, say about one hundred\nbits per sample, can lead to signi\ufb01cant gains in both storage and computation. This scenario is called\nCompact Hashing in the literature, which is the focus of this paper.\nEarly endeavors in hashing concentrated on using random permutations or projections to construct\nrandomized hash functions. The well-known representatives include Min-wise Hashing (MinHash)\n[3] and Locality-Sensitive Hashing (LSH) [2]. MinHash estimates the Jaccard set similarity and is\nimproved by b-bit MinHash [18]. LSH can accommodate a variety of distance or similarity metrics\nsuch as (cid:2)p distances for p \u2208 (0, 2], cosine similarity [4], and kernel similarity [17]. Due to random-\nized hashing, one needs more bits per hash table to achieve high precision. This typically reduces\nrecall, and multiple hash tables are thus required to achieve satisfactory accuracy of retrieved nearest\nneighbors. The overall number of hash bits used in an application can easily run into thousands.\nBeyond the data-independent randomized hashing schemes, a recent trend in machine learning is to\ndevelop data-dependent hashing techniques that learn a set of compact hash codes using a training\nset. Binary codes have been popular in this scenario for their simplicity and ef\ufb01ciency in compu-\ntation. The compact hashing scheme can accomplish almost constant-time nearest neighbor search,\nafter encoding the whole dataset to short binary codes and then aggregating them into a hash table.\nAdditionally, compact hashing is particularly bene\ufb01cial to storing massive-scale data. For exam-\nple, saving one hundred million samples each with 100 binary bits costs less than 1.5 GB, which\n\n1\n\n\fcan easily \ufb01t in memory. To create effective compact codes, several methods have been proposed.\nThese include the unsupervised methods, e.g., Iterative Quantization [9], Isotropic Hashing [14],\nSpectral Hashing [38, 37], and Anchor Graph Hashing [23], the semi-supervised methods, e.g.,\nWeakly-Supervised Hashing [25], and the supervised methods, e.g., Semantic Hashing [30], Binary\nReconstruction Embeddings [16], Minimal Loss Hashing [27], Kernel-based Supervised Hashing\n[22], Hamming Distance Metric Learning [28], and Column Generation Hashing [20].\nThis paper focuses on the problem of unsupervised learning of compact hash codes. Here we argue\nthat most unsupervised hashing methods suffer from inadequate search performance, particularly\nlow recall, when applied to learn relatively longer codes (say around 100 bits) in order to achieve\nhigher precision. The main reason is that the discrete (binary) constraints which should be imposed\non the codes during learning itself have not been treated adequately. Most existing methods either\nneglect the discrete constraints like PCA Hashing and Isotropic Hashing, or discard the constraints\nto solve the relaxed optimizations and afterwards round the continuous solutions to obtain the bi-\nnary codes like Spectral Hashing and Anchor Graph Hashing. Crucially, we \ufb01nd that the hashing\nperformance of the codes obtained by such relaxation + rounding schemes deteriorates rapidly when\nthe code length increases (see Fig. 2). Till now, very few approaches work directly in the discrete\ncode space. Parameter-Sensitive Hashing [31] and Binary Reconstruction Embeddings (BRE) learn\nthe parameters of prede\ufb01ned hash functions by progressively tuning the codes generated by such\nfunctions; Iterative Quantization (ITQ) iteratively learns the codes by explicitly imposing the binary\nconstraints. While ITQ and BRE work in the discrete space to generate the hash codes, they do not\ncapture the local neighborhoods of raw data in the code space well. ITQ targets at minimizing the\nquantization error between the codes and the PCA-reduced data. BRE trains the Hamming distances\nto mimic the (cid:2)2 distances among a limited number of sampled data points, but could not incorporate\nthe entire dataset into training due to its expensive optimization procedure.\nIn this paper, we leverage the concept of Anchor Graphs [21] to capture the neighborhood structure\ninherent in a given massive dataset, and then formulate a graph-based hashing model over the whole\ndataset. This model hinges on a novel discrete optimization procedure to achieve nearly balanced\nand uncorrelated hash bits, where the binary constraints are explicitly imposed and handled. To\ntackle the discrete optimization in a computationally tractable manner, we propose an alternating\nmaximization algorithm which consists of solving two interesting subproblems. For brevity, we call\nthe proposed discrete optimization based graph hashing method as Discrete Graph Hashing (DGH).\nThrough extensive experiments carried out on four benchmark datasets with size up to one million,\nwe show that DGH consistently obtains higher search accuracy than state-of-the-art unsupervised\nhashing methods, especially when relatively longer codes are learned.\n\n2 Discrete Graph Hashing\n\nFirst we de\ufb01ne a few main notations used throughout this paper: sgn(x) denotes the sign function\nwhich returns 1 for x > 0 and \u22121 otherwise; In denotes the n\u00d7n identity matrix; 1 denotes a vector\nwith all 1 elements; 0 denotes a vector or matrix of all 0 elements; diag(c) represents a diagonal\nmatrix with elements of vector c being its diagonal entries; tr(\u00b7), (cid:3) \u00b7 (cid:3)F, (cid:3) \u00b7 (cid:3)1, and (cid:4)\u00b7,\u00b7(cid:5) express\nmatrix trace norm, matrix Frobenius norm, (cid:2)1 norm, and inner-product operator, respectively.\nAnchor Graphs. In the discrete graph hashing model, we need to choose a neighborhood graph that\ncan easily scale to massive data points. For simplicity and ef\ufb01ciency, we choose Anchor Graphs [21],\nwhich involve no special indexing scheme but still have linear construction time in the number of\ndata points. An anchor graph uses a small set of m points (called anchors), U = {uj \u2208 R\nj=1, to\napproximate the neighborhood structure underlying the input dataset X = {xi \u2208 R\ni=1. Af\ufb01nities\n(or similarities) of all n data points are computed with respect to these m anchors in linear time\nO(dmn) where m (cid:6) n. The true af\ufb01nity matrix Ao \u2208 R\nn\u00d7n is then approximated by using these\naf\ufb01nities.\n(cid:2)\nm) z(x) =\nSpeci\ufb01cally, an anchor graph leverages a nonlinear data-to-anchor mapping (R\n\u03b41 exp(\u2212D2(x,u1)\n/M, where \u03b4j \u2208 {1, 0} and \u03b4j = 1 if and only\nif anchor uj is one of s (cid:6) m closest anchors of x in U according to some distance function\n(cid:4)m\nj=1 \u03b4j exp(\u2212D2(x,uj )\nD() (e.g., (cid:2)2 distance), t > 0 is the bandwidth parameter, and M =\n)\nleading to (cid:3)z(x)(cid:3)1 = 1. Then, the anchor graph builds a data-to-anchor af\ufb01nity matrix Z =\n\n),\u00b7\u00b7\u00b7 , \u03b4m exp(\u2212D2(x,um)\n\nd (cid:7)\u2192 R\n\n(cid:3)(cid:2)\n\n)\n\nt\n\nt\n\nd}m\n\nd}n\n\nt\n\n2\n\n\f(cid:8)\n\n(cid:9)\n\n,\n\nn(cid:7)\n\ni,j=1\n\n1\n2\n\nB\n\n(cid:6)(cid:2) \u2208 R\n\nn\u00d7n where \u039b = diag(Z(cid:2)1) \u2208 R\n\n(cid:5)\nz(x1),\u00b7\u00b7\u00b7 , z(xn)\nn\u00d7m that is highly sparse. Finally, the anchor graph gives a data-to-data\naf\ufb01nity matrix as A = Z\u039b\u22121Z(cid:2) \u2208 R\nm\u00d7m. Such an af\ufb01nity\nmatrix empirically approximates the true af\ufb01nity matrix Ao, and has two nice characteristics: 1)\nA is a low-rank positive semide\ufb01nite (PSD) matrix with rank at most m, so the anchor graph does\nnot need to compute it explicitly but instead keeps its low-rank form and only saves Z and \u039b in\nmemory; 2) A has unit row and column sums, so the resulting graph Laplacian is L = In \u2212 A. The\ntwo characteristics permit convenient and ef\ufb01cient matrix manipulations upon A, as shown later on.\nWe also de\ufb01ne an anchor graph af\ufb01nity function as A(x, x(cid:3)\n) in which (x, x(cid:3)\n)\nis any pair of points in R\nLearning Model. The purpose of unsupervised hashing is to learn to map each data point xi to an\nr-bit binary hash code b(xi) \u2208 {1,\u22121}r given a training dataset X = {xi}n\ni=1. For simplicity, let\nus denote b(xi) as bi, and the corresponding code matrix as B = [b1,\u00b7\u00b7\u00b7 , bn]\n(cid:2) \u2208 {1,\u22121}n\u00d7r. The\nstandard graph-based hashing framework, proposed by [38], aims to learn the hash codes such that\nthe neighbors in the input space have small Hamming distances in the code space. This is formulated\nas:\n\n(x)\u039b\u22121z(x(cid:3)\n\n) = z(cid:2)\n\nd.\n\n(cid:8)\n\ntr\n\n(cid:9)\n\n,\n\nmax\n\nB\n\nij = tr\n\nB(cid:2)AB\n\nB(cid:2)LoB\n\n(cid:3)bi \u2212 bj(cid:3)2Ao\n\ns.t. B \u2208 {\u00b11}n\u00d7r, 1(cid:2)B = 0, B(cid:2)B = nIr, (1)\nmin\nwhere Lo is the graph Laplacian based on the true af\ufb01nity matrix Ao1. The constraint 1(cid:2)B = 0\nis imposed to maximize the information from each hash bit, which occurs when each bit leads to\na balanced partitioning of the dataset X . Another constraint B(cid:2)B = nIr makes r bits mutually\nuncorrelated to minimize the redundancy among these bits. Problem (1) is NP-hard, and Weiss et al.\n[38] therefore solved a relaxed problem by dropping the discrete (binary) constraint B \u2208 {\u00b11}n\u00d7r\nand making a simplifying assumption of data being distributed uniformly.\nWe leverage the anchor graph to replace Lo by the anchor graph Laplacian L = In \u2212 A. Hence, the\nobjective in Eq. (1) can be rewritten as a maximization problem:\n\ns.t. B \u2208 {1,\u22121}n\u00d7r, 1(cid:2)B = 0, B(cid:2)B = nIr.\n\n(2)\nIn [23], the solution to this problem is obtained via spectral relaxation [33] in which B is relaxed\nto be a matrix of reals followed by a thresholding step (threshold is 0) that brings the \ufb01nal discrete\nB. Unfortunately, this procedure may result in poor codes due to ampli\ufb01cation of the error caused\nby the relaxation as the code length r increases. To this end, we propose to directly solve the binary\ncodes B without resorting to such error-prone relaxations.\nn\u00d7r|1(cid:2)Y = 0, Y(cid:2)Y = nIr}. Then we formulate a more general\nLet us de\ufb01ne a set \u03a9 =\n(cid:9) \u2212 \u03c1\ngraph hashing framework which softens the last two hard constraints in Eq. (2) as:\ns.t. B \u2208 {1,\u22121}n\u00d7r,\n\n(3)\nwhere dist(B, \u03a9) = minY\u2208\u03a9 (cid:3)B \u2212 Y(cid:3)F measures the distance from any matrix B to the set \u03a9, and\n\u03c1 \u2265 0 is a tuning parameter. If problem (2) is feasible, we can enforce dist(B, \u03a9) = 0 in Eq. (3) by\n(cid:8)\nimposing a very large \u03c1, thereby turning problem (3) into problem (2). However, in Eq. (3) we allow\na certain discrepancy between B and \u03a9 (controlled by \u03c1), which makes problem (3) more \ufb02exible.\nY(cid:2)Y) = nr, problem (3) can be equivalently transformed to the following\nSince tr\nproblem:\nQ(B, Y) := tr\n\nY \u2208 R\n(cid:8)\n\nB(cid:2)B) = tr\n\ndist2(B, \u03a9),\n\nB(cid:2)AB\n\nB(cid:2)Y\n\nmax\n\nB\n\n(cid:10)\n\n(cid:8)\n\n(cid:8)\n\n(cid:9)\n\n(cid:8)\n\n(cid:9)\n\ntr\n\n2\n\nB(cid:2)AB\nmax\nB,Y\ns.t. B \u2208 {1,\u22121}n\u00d7r, Y \u2208 R\n\n,\n\n+ \u03c1tr\nn\u00d7r, 1(cid:2)Y = 0, Y(cid:2)Y = nIr.\n\n(4)\nWe call the code learning model formulated in Eq. (4) as Discrete Graph Hashing (DGH). Because\nconcurrently imposing B \u2208 {\u00b11}n\u00d7r and B \u2208 \u03a9 will make graph hashing computationally in-\ntractable, DGH does not pursue the latter constraint but penalizes the distance from the target code\nmatrix B to \u03a9. Different from the previous graph hashing methods which discard the discrete con-\nstraint B \u2208 {\u00b11}n\u00d7r to obtain continuously relaxed B, our DGH model enforces this constraint to\ndirectly achieve discrete B. As a result, DGH yields nearly balanced and uncorrelated binary bits. In\nSection 3, we will propose a computationally tractable optimization algorithm to solve this discrete\nprogramming problem in Eq. (4).\n\n1The spectral hashing method in [38] did not compute the true af\ufb01nity matrix Ao because of the scalability\n\nissue, but instead used a complete graph built over 1D PCA embeddings.\n\n3\n\n\fAlgorithm 1 Signed Gradient Method (SGM) for B-Subproblem\n\nInput: B(0) \u2208 {1, \u22121}n\u00d7r and Y \u2208 \u03a9.\nj := 0; repeat B(j+1) := sgn\nOutput: B = B(j).\n\n(cid:8)C(cid:8)\n\n(cid:9)(cid:9)\n\n2AB(j) + \u03c1Y, B(j)\n\n, j := j + 1, until B(j) converges.\n\nOut-of-Sample Hashing. Since a hashing scheme should be able to generate the hash code for any\ndata point q \u2208 R\nd beyond the points in the training set X , here we address the out-of-sample ex-\ntension of the DGH model. Similar to the objective in Eq. (1), we minimize the Hamming distances\nbetween a novel data point q and its neighbors (revealed by the af\ufb01nity function A) in X as\n(cid:13)\nb(q) \u2208 arg min\nb(q)\u2208{\u00b11}r\nwhere B\u2217\n= [b\u2217\n(cid:2)Z\u039b\u22121 \u2208 R\n(B\u2217\nfor any novel data point q very ef\ufb01ciently.\n\n(cid:11)(cid:11)b(q) \u2212 b\u2217\n,\n(cid:9)\n(cid:2) is the solution of problem (4). After pre-computing a matrix W =\nWz(q)\n\n1\n2\n1,\u00b7\u00b7\u00b7 , b\u2217\nn]\nr\u00d7m in the training phase, one can compute the hash code b\u2217\n\nA(q, xi) = arg max\n\n(cid:2)Z\u039b\u22121z(q)\n\nb(q), (B\u2217\n\nb(q)\u2208{\u00b11}r\n\n(q) = sgn\n\nn(cid:7)\n\n(cid:11)(cid:11)2\n\n(cid:12)\n\n(cid:8)\n\ni=1\n\n)\n\n)\n\ni\n\n3 Alternating Maximization\nThe graph hashing problem in Eq. (4) is essentially a nonlinear mixed-integer program involving\nboth discrete variables in B and continuous variables in Y. It turns out that problem (4) is generally\nNP-hard and also dif\ufb01cult to approximate. In speci\ufb01c, since the Max-Cut problem is a special case\nof problem (4) when \u03c1 = 0 and r = 1, there exists no polynomial-time algorithm which can achieve\nthe global optimum, or even an approximate solution with its objective value beyond 16/17 of the\nglobal maximum unless P = NP [11]. To this end, we propose a tractable alternating maximization\nalgorithm to optimize problem (4), leading to good hash codes which are demonstrated to exhibit\nsuperior search performance through extensive experiments conducted in Section 5.\n(cid:9)\nThe proposed algorithm proceeds by alternately solving the B-subproblem\n\n(cid:9)\n\n(cid:8)\n\n(cid:8)\n\nf (B) := tr\n\nB(cid:2)AB\n\nY(cid:2)B\n\n+ \u03c1tr\n\n(5)\n\nand the Y-subproblem\n\nmax\n\nB\u2208{\u00b11}n\u00d7r\n(cid:8)\n\nmax\nY\u2208Rn\u00d7r\n\ntr\n\n(cid:9)\n\n,\n\ns.t. 1(cid:2)Y = 0, Y(cid:2)Y = nIr.\n\n(6)\n\nB(cid:2)Y\n\n(cid:9)\n\nIn what follows, we propose an iterative ascent procedure called Signed Gradient Method for sub-\nproblem (5) and derive a closed-form optimal solution to subproblem (6). As we can show, our\nalternating algorithm is provably convergent. Schemes for choosing good initializations are also\ndiscussed. Due to the space limit, all the proofs of lemmas, theorems and propositions presented in\nthis section are placed in the supplemental material.\n3.1 B-Subproblem\nWe tackle subproblem (5) with a simple iterative ascent procedure described in Algorithm 1. In the\nj-th iteration, we de\ufb01ne a local function \u02c6fj(B) that linearizes f (B) at the point B(j), and employ\n\u02c6fj(B) as a surrogate of f (B) for discrete optimization. Given B(j), the next discrete point is\nderived as B(j+1) \u2208 arg maxB\u2208{\u00b11}n\u00d7r \u02c6fj(B) := f\n. Note that\nsince \u2207f\nmay include zero entries, multiple solutions for B(j+1) could exist. To avoid this\nambiguity, we introduce the function C(x, y) =\n\nB(j)\n(cid:14)\nx, x (cid:11)= 0\n(cid:15)\n(cid:9)(cid:16)\ny, x = 0 to specify the following update:\n\n, B \u2212 B(j)\n\n(7)\nin which C is applied in an element-wise manner, and no update is carried out to the entries where\n\u2207f\nDue to the PSD property of the matrix A, f is a convex function and thus f (B) \u2265 \u02c6fj(B) for any B.\n(cid:17)\n, Lemma 1 ensures\nTaking advantage of the fact f\nthat both the sequence of cost values\n\n(cid:8)\nand the sequence of iterates\n\n(cid:8)\n(cid:17)\nf (B(j))\n\n(cid:9) \u2265 \u02c6fj\n\n(cid:9) \u2265 \u02c6fj\n\n(cid:9) \u2261 f\n\n2AB(j) + \u03c1Y, B(j)\n\nC(cid:8)\u2207f\n\nB(j+1) := sgn\n\n(cid:12)\u2207f\n\nconverge.\n\nvanishes.\n\nC(cid:8)\n\nB(j+1)\n\nB(j+1)\n\n, B(j)\n\n(cid:9)(cid:16)\n\n= sgn\n\nB(j)\n\nB(j)\n\nB(j)\n\nB(j)\n\n(cid:9)\n\n(cid:10)\n\n(cid:8)\n\n(cid:9)\n\nB(j)\n\nB(j)\n\nB(j)\n\n(cid:9)\n\n+\n\n(cid:15)\n\n(cid:10)\n\n(cid:8)\n\n(cid:9)\n\n(cid:8)\n\n(cid:9)\n\n(cid:8)\n\n(cid:13)\n\n,\n\n(cid:8)\n\n(cid:8)\n\n(cid:8)\n\n4\n\n\fAlgorithm 2 Discrete Graph Hashing (DGH)\n\n(cid:9)\n\nB(j)\n\nB(j)\n\n(cid:10)\n\n= Yk.\n\n(cid:8)\n\nB(j+1)\n\n(cid:17)\n\n(cid:10)\n\n(cid:17)\n\nf (B(j))\n\nand\n\nB(j)\n\nconverge.\n\n= Bk, Y\u2217\n(cid:10)\n\n(cid:17)\n\nholds for any integer j \u2265 0, and both\n\nis the sequence of iterates produced by Algorithm 1, then f\n\nInput: B0 \u2208 {1,\u22121}n\u00d7r and Y0 \u2208 \u03a9.\nk := 0;\nrepeat Bk+1 := SGM(Bk, Yk), Yk+1 \u2208 \u03a6(JBk+1), k := k + 1, until Q(Bk, Yk) converges.\nOutput: B\u2217\n(cid:9) \u2265\n\n(cid:8)\nLemma 1. If\nf\nOur idea of optimizing a proxy function \u02c6fj(B) can be considered as a special case of majorization\nmethodology exploited in the \ufb01eld of optimization. The majorization method typically deals with a\ngeneric constrained optimization problem: min g(x), s.t. x \u2208 F, where g : R\nn (cid:7)\u2192 R is a contin-\nuous function and F \u2286 R\nn is a compact set. The majorization method starts with a feasible point\nx0 \u2208 F, and then proceeds by setting xj+1 as a minimizer of \u02c6gj(x) over F, where \u02c6gj satisfying\n\u02c6gj(xj) = g(xj) and \u02c6gj(x) \u2265 g(x) \u2200x \u2208 F is called a majorization function of g at xj. In speci\ufb01c,\nin our scenario, problem (5) is equivalent to minB\u2208{\u00b11}n\u00d7r \u2212f (B), and the linear surrogate \u2212 \u02c6fj\nis a majorization function of \u2212f at point B(j). The majorization method was \ufb01rst systematically\nintroduced by [5] to deal with multidimensional scaling problems, although the EM algorithm [7],\nproposed at the same time, also falls into the framework of majorization methodology. Since then,\nthe majorization method has played an important role in various statistics problems such as multi-\ndimensional data analysis [12], hyperparameter learning [8], conditional random \ufb01elds and latent\nlikelihoods [13], and so on.\n3.2 Y-Subproblem\n(cid:4)r(cid:2)\nAn analytical solution to subproblem (6) can be obtained with the aid of a centering matrix J = In\u2212\n. Write the singular value decomposition (SVD) of JB as JB = U\u03a3V(cid:2)\nn 11(cid:2)\nk=1 \u03c3kukv(cid:2)\n1\nk ,\n(cid:3) \u2264 r is the rank of JB, \u03c31,\u00b7\u00b7\u00b7 , \u03c3r(cid:2) are the positive singular values, and U = [u1,\u00b7\u00b7\u00b7 , ur(cid:2) ]\nwhere r\nand V = [v1,\u00b7\u00b7\u00b7 , vr(cid:2) ] contain the left- and right-singular vectors, respectively. Then, by employing\na Gram-Schmidt process, one can easily construct matrices \u00afU \u2208 R\nr\u00d7(r\u2212r(cid:2))\nsuch that \u00afU(cid:2) \u00afU = Ir\u2212r(cid:2), [U 1]\n(cid:2) \u00afU = 0, and \u00afV(cid:2) \u00afV = Ir\u2212r(cid:2), V(cid:2) \u00afV = 02. Now we are ready to\ncharacterize a closed-form solution of the Y-subproblem by Lemma 2.\nLemma 2. Y(cid:3) =\n(cid:2)\nFor notational convenience, we de\ufb01ne the set of all matrices in the form of\nas \u03a6(JB). Lemma 2 reveals that any matrix in \u03a6(JB) is an optimal solution to subproblem (6).\nIn practice, to compute such an optimal Y(cid:3), we perform the eigendecomposition over the small\nr \u00d7 r matrix B(cid:2)JB to have B(cid:2)JB = [V \u00afV]\n(cid:2), which gives V, \u00afV, \u03a3, and\nimmediately leads to U = JBV\u03a3\u22121. The matrix \u00afU is initially set to a random matrix followed\nby the aforementioned Gram-Schmidt orthogonalization. It can be seen that Y(cid:3) is uniquely optimal\nwhen r\n3.3 DGH Algorithm\nThe proposed alternating maximization algorithm, also referred to as Discrete Graph Hashing\n(DGH), for solving the raw problem in Eq. (4) is summarized in Algorithm 2, in which we introduce\nSGM(\u00b7,\u00b7) to represent the functionality of Algorithm 1. The convergence of Algorithm 2 is guar-\nanteed by Theorem 1, whose proof is based on the nature of the proposed alternating maximization\nprocedure that always generates a monotonically non-decreasing and bounded sequence.\nis the sequence generated by Algorithm 2, then Q(Bk+1, Yk+1) \u2265\nTheorem 1. If\nQ(Bk, Yk) holds for any integer k \u2265 0, and\nconverges starting with any feasible\ninitial point (B0, Y0).\nInitialization. Since the DGH algorithm deals with discrete and non-convex optimization, a good\nchoice of an initial point (B0, Y0) is vital. Here we suggest two different initial points which are\nboth feasible to problem (4).\n\nis an optimal solution to the Y-subproblem in Eq. (6).\n\n(cid:10)Q(Bk, Yk)\n\n(cid:17)\n\nn\u00d7(r\u2212r(cid:2)) and \u00afV \u2208 R\n\n= r (i.e., JB is full column rank).\n\nn[U \u00afU][V \u00afV]\n\n(cid:2)\n\n\u03a32 0\n0\n0\n\n[V \u00afV]\n\n=\n\n\u221a\n\nn[U \u00afU][V \u00afV]\n\n(cid:10)\n\n(Bk, Yk)\n\n(cid:17)\n\n(cid:18)\n\n(cid:19)\n\n\u221a\n\n(cid:3)\n\n2Note that when r(cid:2) = r, \u00afU and \u00afV are nothing but 0.\n\n5\n\n\f(cid:4)m\nLet us perform the eigendecomposition over A to obtain A = P\u0398P(cid:2)\nk=1 \u03b8kpkp(cid:2)\nk , where\n\u03b81,\u00b7\u00b7\u00b7 , \u03b8m are the eigenvalues arranged in a non-increasing order, and p1,\u00b7\u00b7\u00b7 , pm are the corre-\n(cid:9)\nsponding normalized eigenvectors. We write \u0398 = diag(\u03b81,\u00b7\u00b7\u00b7 , \u03b8m) and P = [p1,\u00b7\u00b7\u00b7 , pm]. Note\n\u221a\nthat \u03b81 = 1 and p1 = 1/\n, where\nH = [p2,\u00b7\u00b7\u00b7 , pr+1] \u2208 R\nAlternatively, Y0 can be allowed to consist of orthonormal columns within the column space of H,\nr\u00d7r. We can obtain R along with B0\ni.e., Y0 =\nby solving a new discrete optimization problem:\ns.t. R \u2208 R\n= Ir, B0 \u2208 {1,\u22121}n\u00d7r,\n\n\u221a\nn. The \ufb01rst initialization used is\nn\u00d7r. The initial codes B0 were used as the \ufb01nal codes by [23].\n\nnHR subject to some orthogonal matrix R \u2208 R\nr\u00d7r, RR(cid:2)\n\nR(cid:2)H(cid:2)AB0\n\nnH, B0 = sgn(H)\n\nY0 =\n\n\u221a\n\n(cid:8)\n\n(cid:9)\n\n(cid:8)\n\ntr\n\n=\n\n,\n\n(8)\n\nmax\nR,B0\n\n(cid:8)\n\n(cid:8)\n\n.\n\n(cid:9)\n\n.\n\n(cid:8)\n\n(cid:8)\n\n\u221a\n\nY0 =\n\nR(cid:2)H(cid:2)AB\n\nr\nB(cid:2)AB\n\nB(cid:2)AB\n(cid:8)\n\n(cid:9) \u2265 1\n(cid:9)\n\nnHR, B0 = sgn(H \u02c6\u0398R)\n\nj for j = 0, 1, 2,\u00b7\u00b7\u00b7 , where \u02dcUj, \u02dcVj \u2208 R\n\nr\u00d7r and any binary matrix B \u2208 {1,\u22121}n\u00d7r, we\n\nwhich is motivated by the proposition below.\nProposition 1. For any orthogonal matrix R \u2208 R\ntr2\nhave tr\nProposition 1 implies that the optimization in Eq. (8) can be interpreted as to maximize a lower\nwhich is the \ufb01rst term of the objective Q(B, Y) in the original problem\nbound of tr\n(cid:8)\n(4). We still exploit an alternating maximization procedure to solve problem (8). Noticing AH =\n(cid:9)\nH \u02c6\u0398 where \u02c6\u0398 = diag(\u03b82,\u00b7\u00b7\u00b7 , \u03b8r+1), the objective in Eq. (8) is equal to tr\nR(cid:2) \u02c6\u0398H(cid:2)B0). The\nalternating procedure starts with R0 = Ir, and then makes the simple updates Bj\nH \u02c6\u0398Rj\n,\n0 := sgn\nr\u00d7r stem from the full SVD \u02dcUj \u02dc\u03a3j \u02dcV(cid:2)\nRj+1 := \u02dcUj \u02dcV(cid:2)\nj of\n(cid:9)\nthe matrix \u02c6\u0398H(cid:2)Bj\n0. When convergence is reached, we obtain the optimized rotation R that yields\nthe second initialization\nEmpirically, we \ufb01nd that the second initialization typically gives a better objective value Q(B0, Y0)\nat the start than the \ufb01rst one, as it aims to maximize the lower bound of the \ufb01rst term in the objective\nQ. We also observe that the second initialization often results in a higher objective value Q(B\u2217\n, Y\u2217\n)\nat convergence (Figs. 1-2 in the supplemental material show convergence curves of Q starting from\nthe two initial points). We call DGH using the \ufb01rst and second initializations as DGH-I and DGH-R,\nrespectively. Regarding the convergence property, we would like to point out that since the DGH al-\ngorithm (Algorithm 2) works on a mixed-integer objective, it is hard to quantify the convergence to a\nlocal optimum of the objective function Q. Nevertheless, this does not affect the performance of our\nalgorithm in practice. In our experiments in Section 5, we consistently \ufb01nd a convergent sequence\n{(Bk, Yk)} arriving at a good objective value when started with the suggested initializations.\n(cid:8)\n4 Discussions\nHere we analyze space and time complexities of DGH-I/DGH-R. The space complexity is O\n(d +\nin the training stage and O(rn) for storing hash codes in the test stage for DGH-I/DGH-R.\ns + r)n\nLet TB and TG be the budget iteration numbers of optimizing the B-subproblem and the whole DGH\n(cid:9)\ndmn + m2n + (mTB +\nproblem, respectively. Then, the training time complexity of DGH-I is O\ndmn + m2n + (mTB + sTB +\nsTB + r)rTGn\nr)rTGn + r2TRn\n, where TR is the budget iteration number for seeking the initial point via Eq. (8).\nNote that the time for \ufb01nding anchors and building the anchor graph is O(dmn) which is included\nin the above training time. Their test time (referring to encoding a query to an r-bit code) is both\nO(dm + sr). In our experiments, we \ufb01x m, s, TB, TG, TR to constants independent of the dataset\nsize n, and make r \u2264 128. Thus, DGH-I/DGH-R enjoy linear training time and constant test time.\nIt is worth mentioning again that the low-rank PSD property of the anchor graph af\ufb01nity matrix A is\nadvantageous for training DGH, permitting ef\ufb01cient matrix computations in O(n) time, such as the\neigendecomposition of A (encountered in initializations) and multiplying A with B (encountered\nin solving the B-subproblem with Algorithm 1).\nIt is interesting to point out that DGH falls into the asymmetric hashing category [26] in the sense\nthat hash codes are generated differently for samples within the dataset and queries outside the\ndataset. Unlike most existing hashing techniques, DGH directly solves the hash codes B\u2217 of the\ntraining samples via the proposed discrete optimization in Eq. (4) without relying on any explicit or\nprede\ufb01ned hash functions. On the other hand, the hash code for any query q is induced from the\nsolved codes B\u2217, leading to a hash function b\u2217\nparameterized by the matrix\n\n(cid:9)\n, and the training time complexity of DGH-R is O\n\n(q) = sgn\n\nWz(q)\n\n(cid:8)\n\n(cid:9)\n\n(cid:8)\n\n(cid:8)\n\n(cid:9)\n\n6\n\n\f(a) Hash lookup success rate @ CIFAR\u221210\n\n(b) Hash lookup success rate @ SUN397\n\n(c) Hash lookup success rate @ YouTube Faces\n\n(d) Hash lookup success rate @ Tiny\u22121M\n\n48\n# bits\n\n64\n\n96\n\n48\n# bits\n\n64\n\n96\n\n64\n# bits\n\n96\n\n128\n\n1216 24 32\n\n48\n\n64\n# bits\n\nFigure 1: Hash lookup success rates for different hashing techniques. DGH tends to achieve nearly\n100% success rates even for longer code lengths.\n\n(a) Hash lookup F\u2212measure @ CIFAR\u221210\n\n(b) Hash lookup F\u2212measure @ SUN397\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\nLSH\nKLSH\nITQ\nIsoH\nSH\nIMH\n1\u2212AGH\n2\u2212AGH\nBRE\nDGH\u2212I\nDGH\u2212R\n\n96\n\n128\n\n(c) Hash lookup F\u2212measure @ YouTube Faces\n\n(d) Hash lookup F\u2212measure @ Tiny\u22121M\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\nLSH\nKLSH\nITQ\nIsoH\nSH\nIMH\n1\u2212AGH\n2\u2212AGH\nBRE\nDGH\u2212I\nDGH\u2212R\n24\n\n8 1216\n\n32\n\ne\n\nt\n\na\nr\n \ns\ns\ne\nc\nc\nu\nS\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n2\n\ni\n\n \ns\nu\nd\na\nr\n \n\ni\n\ng\nn\nm\nm\na\nH\nn\nh\n\n \n\ni\n\nt\ni\n\n \n\nw\ne\nr\nu\ns\na\ne\nm\n\u2212\nF\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\nLSH\nKLSH\nITQ\nIsoH\nSH\nIMH\n1\u2212AGH\n2\u2212AGH\nBRE\nDGH\u2212I\nDGH\u2212R\n24\n\n8 1216\n\n32\n\nLSH\nKLSH\nITQ\nIsoH\nSH\nIMH\n1\u2212AGH\n2\u2212AGH\nBRE\nDGH\u2212I\nDGH\u2212R\n\n64\n\n96\n\n0.2\n\n0.18\n\n0.16\n\n0.14\n\n0.12\n\n0.1\n\n0.08\n\n0.06\n\n0.04\n\n0.02\n\n0\n\nLSH\nKLSH\nITQ\nIsoH\nSH\nIMH\n1\u2212AGH\n2\u2212AGH\nBRE\nDGH\u2212I\nDGH\u2212R\n1216 24 32\n\n48\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n96\n\n0\n\n128\n\n1216 24 32\n\n48\n\n64\n# bits\n\n96\n\n128\n\n8 1216\n\n24\n\n32\n\n48\n# bits\n\n8 1216\n\n24\n\n32\n\n64\n\n96\n\n1216 24 32\n\n48\n\n48\n# bits\n\n64\n# bits\n\nFigure 2: Mean F-measures of hash lookup within Hamming radius 2 for different techniques. DGH\ntends to retain good recall even for longer codes, leading to much higher F-measures than the others.\nW which was computed using B\u2217. While the hashing mechanisms for producing B\u2217 and b\u2217\n(q) are\ndistinct, they are tightly coupled and prone to be adaptive to speci\ufb01c datasets. The \ufb02exibility of the\nasymmetric hashing nature of DGH is validated through the experiments shown in the next section.\n5 Experiments\nWe conduct large-scale similarity search experiments on four benchmark datasets: CIFAR-10 [15],\nSUN397 [40], YouTube Faces [39], and Tiny-1M. CIFAR-10 is a labeled subset of the 80 Million\nTiny Images dataset [35], which consists of 60K images from ten object categories with each image\nrepresented by a 512-dimensional GIST feature vector [29]. SUN397 contains about 108K images\nfrom 397 scene categories, where each image is represented by a 1,600-dimensional feature vector\nextracted by PCA from 12,288-dimensional Deep Convolutional Activation Features [10]. The raw\nYouTube Faces dataset contains 1,595 different people, from which we choose 340 people such that\neach one has at least 500 images to form a subset of 370,319 face images, and represent each face\nimage as a 1,770-dimensional LBP feature vector [1]. Tiny-1M is one million subset of the 80M\ntiny images, where each image is represented by a 384-dimensional GIST vector. In CIFAR-10, 100\nimages are sampled uniformly randomly from each object category to form a separate test (query)\nset of 1K images; in SUN397, 100 images are sampled uniformly randomly from each of the 18\nlargest scene categories to form a test set of 1.8K images; in YouTube Faces, the test set includes\n3.8K face images which are evenly sampled from the 38 people each containing more than 2K faces;\nin Tiny-1M, a separate subset of 5K images randomly sampled from the 80M images is used as the\ntest set. In the \ufb01rst three datasets, groundtruth neighbors are de\ufb01ned based on whether two samples\nshare the same class label; in Tiny-1M which does not have full annotations, we de\ufb01ne groundtruth\nneighbors for a given query as the samples among the top 2% (cid:2)2 distances from the query in the 1M\ntraining set, so each query has 20K groundtruth neighbors.\nWe evaluate twelve unsupervised hashing methods including: two randomized methods LSH [2] and\nKernelized LSH (KLSH) [17], two linear projection based methods Iterative Quantization (ITQ) [9]\nand Isotropic Hashing (IsoH) [14], two spectral methods Spectral Hashing (SH) [38] and its weight-\ned version MDSH [37], one manifold based method Inductive Manifold Hashing (IMH) [32], two\nexisting graph-based methods One-Layer Anchor Graph Hashing (1-AGH) and Two-Layer Anchor\nGraph Hashing (2-AGH) [23], one distance preservation method Binary Reconstruction Embed-\ndings (BRE) [16] (unsupervised version), and our proposed discrete optimization based methods\nDGH-I and DGH-R. We use the publicly available codes of the competing methods, and follow the\nconventional parameter settings therein. In particular, we use the Gaussian kernel and 300 randomly\nsampled exemplars (anchors) to run KLSH; IMH, 1-AGH, 2-AGH, DGH-I and DGH-R also use\nm = 300 anchors (obtained by K-means clustering with 5 iterations) for fair comparison. This\nchoice of m gives a good trade-off between hashing speed and performance. For 1-AGH, 2-AGH,\nDGH-I and DGH-R that all use anchor graphs, we adopt the same construction parameters s, t on\neach dataset (s = 3 and t is tuned following AGH), and (cid:2)2 distance as D(\u00b7). For BRE, we uniformly\n\n7\n\n\fTable 1: Hamming ranking performance on YouTube Faces and Tiny-1M. r denotes the number of\nhash bits used in the hashing methods. All training and test times are in seconds.\nMethod\nTiny-1M\n\nYouTube Faces\n\nMean Precision / Top-2K\n\nr = 48\n\nr = 128\n\nTrainTime\nr = 128\n\nTestTime\nr = 128\n\nMean Precision / Top-20K\n\nr = 48\n\nr = 96\n\nr = 128\n\nTrainTime\nr = 128\n\nTestTime\nr = 128\n\nr = 96\n0.7591\n0.1005\n0.5210\n0.7493\n0.6962\n0.6655\n0.6752\n0.3641\n0.7571\n0.7377\n0.6238\n0.7644\n0.7672\n\n\u2013\n\n\u2013\n\n(cid:4)2 Scan\n\n1\n\n0.1324\n0.4105\n0.4726\n0.4816\n0.1923\n0.3878\n0.2497\n0.4117\n0.4099\n0.4836\n0.4865\n0.5006\n\nLSH\nKLSH\nITQ\nIsoH\nSH\n\n0.0830\n0.3982\n0.7017\n0.6093\n0.5897\n0.6110\n0.3150\n0.7138\n0.6727\n0.5564\n0.7086\n0.7245\n\n0.1061\n0.5871\n0.7562\n0.7058\n0.6736\n0.6795\n0.3889\n0.7646\n0.7521\n0.6483\n0.7750\n0.7805\n\n0.1155\n0.3054\n0.3925\n0.3896\n0.1857\n0.3312\n0.2257\n0.4061\n0.3925\n0.3943\n0.4045\n0.4208\n\n6.1\n20.7\n297.3\n13.5\n61.4\n193.6\n139.3\n141.4\n272.5\n8419.0\n1769.4\n2793.4\n\n0.1766\n0.4705\n0.5052\n0.5161\n0.2079\n0.3955\n0.2557\n0.4107\n0.4152\n0.5218\n0.5178\n0.5358\n\n6.4\n16.1\n169.0\n73.6\n108.9\n118.8\n92.1\n84.1\n94.7\n10372.0\n402.6\n408.9\n\n1.8\u00d710\u22125\n4.8\u00d710\u22125\n1.8\u00d710\u22125\n1.8\u00d710\u22125\n2.0\u00d710\u22124\n4.9\u00d710\u22125\n2.3\u00d710\u22125\n2.1\u00d710\u22125\n3.5\u00d710\u22125\n9.0\u00d710\u22125\n2.1\u00d710\u22125\n2.1\u00d710\u22125\n\n1.0\u00d710\u22125\n4.6\u00d710\u22125\n1.0\u00d710\u22125\n1.0\u00d710\u22125\n1.6\u00d710\u22124\n2.8\u00d710\u22125\nMDSH\n2.7\u00d710\u22125\nIMH\n3.4\u00d710\u22125\n1-AGH\n4.7\u00d710\u22125\n2-AGH\n8.8\u00d710\u22125\nBRE\n3.3\u00d710\u22125\nDGH-I\n3.3\u00d710\u22125\nDGH-R\nrandomly sample 1K, and 2K training samples to train the distance preservations on CIFAR-10 &\nSUN397, and YouTube Faces & Tiny-1M, respectively. For DGH-I and DGH-R, we set the penalty\nparameter \u03c1 to the same value in [0.1, 5] on each dataset, and \ufb01x TR = 100, TB = 300, TG = 20.\nWe employ two widely used search procedures hash lookup and Hamming ranking with 8 to 128\nhash bits for evaluations. The Hamming ranking procedure ranks the dataset samples according to\ntheir Hamming distances to a given query, while the hash lookup procedure \ufb01nds all the points within\na certain Hamming radius away from the query. Since hash lookup can be achieved in constant time\nby using a single hash table, it is the main focus of this work. We carry out hash lookup within a\nHamming ball of radius 2 centered on each query, and report the search recall and F-measure which\nare averaged over all queries for each dataset. Note that if table lookup fails to \ufb01nd any neighbors\nwithin a given radius for a query, we call it a failed query and assign it zero recall and F-measure. To\nquantify the failed queries, we report the hash lookup success rate which gives the proportion of the\nqueries for which at least one neighbor is retrieved. For Hamming ranking, mean average precision\n(MAP) and mean precision of top-retrieved samples are computed.\nThe hash lookup results are shown in Figs. 1-2. DGH-I/DGH-R achieve the highest (close to 100%)\nhash lookup success rates, and DGH-I is slightly better than DGH-R. The reason is that the asym-\nmetric hashing scheme exploited by DGH-I/DGH-R poses a tight linkage to connect queries and\ndatabase samples, providing a more adaptive out-of-sample extension than the traditional symmet-\nric hashing schemes used by the competing methods. Also, DGH-R achieves the highest F-measure\nexcept on CIFAR-10, where DGH-I is highest while DGH-R is the second. The F-measures of\nKLSH, IsoH, SH and BRE deteriorate quickly and are with very poor values (< 0.05) when r \u2265 48\ndue to poor recall3. Although IMH achieves nice hash lookup succuss rates, its F-measures are\nmuch lower than DGH-I/DGH-R due to lower precision. MDSH produces the same hash bits as SH,\nso is not included in the hash lookup experiments. DGH-I/DGH-R employ the proposed discrete\noptimization to yield high-quality codes that preserve the local neighborhood of each data point\nwithin a small Hamming ball, so obtain much higher search accuracy in F-measure and recall than\nSH, 1-AGH and 2-AGH which rely on relaxed optimizations and degrade drastically when r \u2265 48.\nFinally, we report the Hamming ranking results in Table 1 and the table in the sup-material, which\nclearly show the superiority of DGH-R over the competing methods in MAP and mean precision;\non the \ufb01rst three datasets, DGH-R even outperforms exhaustive (cid:2)2 scan. The training time of DGH-\nI/DGH-R is acceptable and faster than BRE, and their test time (i.e., coding time since hash lookup\ntime is small enough to be ignored) is comparable with 1-AGH.\n6 Conclusion\nThis paper investigated a pervasive problem of not enforcing the discrete constraints in optimiza-\ntion pertaining to most existing hashing methods. Instead of resorting to error-prone continuous\nrelaxations, we introduced a novel discrete optimization technique that learns the binary hash codes\ndirectly. To achieve this, we proposed a tractable alternating maximization algorithm which solves\ntwo interesting subproblems and provably converges. When working with a neighborhood graph,\nthe proposed method yields high-quality codes to well preserve the neighborhood structure inherent\nin the data. Extensive experimental results on four large datasets up to one million showed that our\ndiscrete optimization based graph hashing technique is highly competitive.\n\n3The recall results are shown in Fig. 3 of the supplemental material, which indicate that DGH-I achieves the\n\nhighest recall except on YouTube Faces, where DGH-R is highest while DGH-I is the second.\n\n8\n\n\fReferences\n[1] T. Ahonen, A. Hadid, and M. Pietikainen. Face description with local binary patterns: Application to face recognition. TPAMI,\n\n28(12):2037\u20132041, 2006.\n\n[2] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Communications of the\n\nACM, 51(1):117\u2013122, 2008.\n\n[3] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. In Proc. STOC, 1998.\n[4] M. Charikar. Similarity estimation techniques from rounding algorithms. In Proc. STOC, 2002.\n[5] J. de Leeuw. Applications of convex analysis to multidimensinal scaling. Recent Developments in Statistics, pages 133\u2013146, 1977.\n[6] T. Dean, M. A. Ruzon, M. Segal, J. Shlens, S. Vijayanarasimhan, and J. Yagnik. Fast, accurate detection of 100,000 object classes on a\n\nsingle machine. In Proc. CVPR, 2013.\n\n[7] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal\n\nStatistical Society, Series B, 39(1):1\u201338, 1977.\n\n[8] C.-S. Foo, C. B. Do, and A. Y. Ng. A majorization-minimization algorithm for (multiple) hyperparameter learning. In Proc. ICML, 2009.\n[9] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative quantization: A procrustean approach to learning binary codes for large-scale\n\nimage retrieval. TPAMI, 35(12):2916\u20132929, 2013.\n\n[10] Y. Gong, L. Wang, R. Guo, and S. Lazebnik. Multi-scale orderless pooling of deep convolutional activation features. In Proc. ECCV,\n\n2014.\n\n[11] J. Hastad. Some optimal inapproximability results. Journal of the ACM, 48(4):798\u2013859, 2001.\n[12] W. J. Heiser. Convergent computation by iterative majorization: theory and applications in multidimensional data analysis. Recent\n\nadvances in descriptive multivariate analysis, pages 157\u2013189, 1995.\n\n[13] T. Jebara and A. Choromanska. Majorization for crfs and latent likelihoods. In NIPS 25, 2012.\n[14] W. Kong and W.-J. Li. Isotropic hashing. In NIPS 25, 2012.\n[15] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.\n[16] B. Kulis and T. Darrell. Learning to hash with binary reconstructive embeddings. In NIPS 22, 2009.\n[17] B. Kulis and K. Grauman. Kernelized locality-sensitive hashing. TPAMI, 34(6):1092\u20131104, 2012.\n[18] P. Li and A. C. Konig. Theory and applications of b-bit minwise hashing. Communications of the ACM, 54(8):101\u2013109, 2011.\n[19] P. Li, A. Shrivastava, J. Moore, and A. C. Konig. Hashing algorithms for large-scale learning. In NIPS 24, 2011.\n[20] X. Li, G. Lin, C. Shen, A. van den Hengel, and A. R. Dick. Learning hash functions using column generation. In Proc. ICML, 2013.\n[21] W. Liu, J. He, and S.-F. Chang. Large graph construction for scalable semi-supervised learning. In Proc. ICML, 2010.\n[22] W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang. Supervised hashing with kernels. In Proc. CVPR, 2012.\n[23] W. Liu, J. Wang, S. Kumar, and S.-F. Chang. Hashing with graphs. In Proc. ICML, 2011.\n[24] W. Liu, J. Wang, Y. Mu, S. Kumar, and S.-F. Chang. Compact hyperplane hashing with bilinear functions. In Proc. ICML, 2012.\n[25] Y. Mu, J. Shen, and S. Yan. Weakly-supervised hashing in kernel space. In Proc. CVPR, 2010.\n[26] B. Neyshabur, P. Yadollahpour, Y. Makarychev, R. Salakhutdinov, and N. Srebro. The power of asymmetry in binary hashing. In NIPS\n\n26, 2013.\n\n[27] M. Norouzi and D. J. Fleet. Minimal loss hashing for compact binary codes. In Proc. ICML, 2011.\n[28] M. Norouzi, D. J. Fleet, and R. Salakhudinov. Hamming distance metric learning. In NIPS 25, 2012.\n[29] A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic representation of the spatial envelope. IJCV, 42(3):145\u2013175, 2001.\n[30] R. Salakhutdinov and G. Hinton. Semantic hashing. International Journal of Approximate Reasoning, 50(7):969\u2013978, 2009.\n[31] G. Shakhnarovich, P. Viola, and T. Darrell. Fast pose estimation with parameter-sensitive hashing. In Proc. ICCV, 2003.\n[32] F. Shen, C. Shen, Q. Shi, A. van den Hengel, and Z. Tang. Inductive hashing on manifolds. In Proc. CVPR, 2013.\n[33] J. Shi and J. Malik. Normalized cuts and image segmentation. TPAMI, 22(8):888\u2013905, 2000.\n[34] Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola, and S. V. N. Vishwanathan. Hash kernels for structured data. JMLR, 10:2615\u20132637,\n\n2009.\n\n[35] A. Torralba, R. Fergus, and W. T. Freeman. 80 million tiny images: a large dataset for non-parametric object and scene recognition.\n\nTPAMI, 30(11):1958\u20131970, 2008.\n\n[36] K. Q. Weinberger, A. Dasgupta, J. Langford, A. J. Smola, and J. Attenberg. Feature hashing for large scale multitask learning. In Proc.\n\nICML, 2009.\n\n[37] Y. Weiss, R. Fergus, and A. Torralba. Multidimensional spectral hashing. In Proc. ECCV, 2012.\n[38] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In NIPS 21, 2008.\n[39] L. Wolf, T. Hassner, and I. Maoz. Face recognition in unconstrained videos with matched background similarity. In Proc. CVPR, 2011.\n[40] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In Proc.\n\nCVPR, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1780, "authors": [{"given_name": "Wei", "family_name": "Liu", "institution": "IBM T. J. Watson Research Center"}, {"given_name": "Cun", "family_name": "Mu", "institution": "Columbia University"}, {"given_name": "Sanjiv", "family_name": "Kumar", "institution": "Google"}, {"given_name": "Shih-Fu", "family_name": "Chang", "institution": "Columbia University"}]}