{"title": "Greedy Hash: Towards Fast Optimization for Accurate Hash Coding in CNN", "book": "Advances in Neural Information Processing Systems", "page_first": 798, "page_last": 807, "abstract": "To convert the input into binary code, hashing algorithm has been widely used for approximate nearest neighbor search on large-scale image sets due to its computation and storage efficiency. Deep hashing further improves the retrieval quality by combining the hash coding with deep neural network. However, a major difficulty in deep hashing lies in the discrete constraints imposed on the network output, which generally makes the optimization NP hard. In this work, we adopt the greedy principle to tackle this NP hard problem by iteratively updating the network toward the probable optimal discrete solution in each iteration. A hash coding layer is designed to implement our approach which strictly uses the sign function in forward propagation to maintain the discrete constraints, while in back propagation the gradients are transmitted intactly to the front layer to avoid the vanishing gradients. In addition to the theoretical derivation, we provide a new perspective to visualize and understand the effectiveness and efficiency of our algorithm. Experiments on benchmark datasets show that our scheme outperforms state-of-the-art hashing methods in both supervised and unsupervised tasks.", "full_text": "Greedy Hash: Towards Fast Optimization for\n\nAccurate Hash Coding in CNN\n\nShupeng Su1\n\nChao Zhang1\u2217\n\nKai Han1,3\n\nYonghong Tian1,2\n\n1Key Laboratory of Machine Perception (MOE), School of EECS, Peking University\n\n2National Engineering Laboratory for Video Technology, School of EECS, Peking University\n\n{sushupeng, c.zhang, hankai, yhtian}@pku.edu.cn\n\n3Huawei Noah\u2019s Ark Lab\n\nAbstract\n\nTo convert the input into binary code, hashing algorithm has been widely used for\napproximate nearest neighbor search on large-scale image sets due to its computa-\ntion and storage ef\ufb01ciency. Deep hashing further improves the retrieval quality by\ncombining the hash coding with deep neural network. However, a major dif\ufb01culty\nin deep hashing lies in the discrete constraints imposed on the network output,\nwhich generally makes the optimization NP hard. In this work, we adopt the greedy\nprinciple to tackle this NP hard problem by iteratively updating the network toward\nthe probable optimal discrete solution in each iteration. A hash coding layer is de-\nsigned to implement our approach which strictly uses the sign function in forward\npropagation to maintain the discrete constraints, while in back propagation the\ngradients are transmitted intactly to the front layer to avoid the vanishing gradients.\nIn addition to the theoretical derivation, we provide a new perspective to visualize\nand understand the effectiveness and ef\ufb01ciency of our algorithm. Experiments on\nbenchmark datasets show that our scheme outperforms state-of-the-art hashing\nmethods in both supervised and unsupervised tasks.\n\n1\n\nIntroduction\n\nIn the era of big data, searching for the desired information has become an important topic in such a\nvast ocean of data. Hashing for large-scale image set retrieval [7, 8, 21, 22] has attracted extensive\ninterest in Approximate Nearest Neighbor (ANN) search due to its computation and storage ef\ufb01ciency\nwith the generated binary representation. Deep hashing further promotes the performance by learning\nthe image representation and hash coding in the same network simultaneously [30, 15, 18, 6]. Not\nonly the common pairwise label based methods [17, 30, 4], but also triplet [34, 35] and point-wise\n[19, 31] schemes have been exploited extensively.\nDespite the considerable progress, it\u2019s still a dif\ufb01cult task to realize the real end-to-end training of\ndeep hashing owing to the vanishing gradient problem from sign function which is appended after the\noutput of the network to achieve binary code. To be speci\ufb01c, the gradient of sign function is zero\nfor all nonzero input, and that is fatal to the neural network which uses gradient descent for training.\nMost of the previous works choose to \ufb01rst solve a relaxed problem discarding the discrete constraints\n(e.g., [34, 19, 31] replace sign function with tanh or sigmoid, [20, 36, 17] add a penalty term in loss\nfunction to generate feature as discrete as possible), and later in test phase apply sign function to\nobtain real binary code. Although capable of training the network, these relaxation schemes will\nintroduce quantization error which basically leads to suboptimal hash code. Later HashNet [4] and\nDeep Supervised Discrete Hashing (DSDH) [16] made a progress on this dif\ufb01culty. HashNet starts\n\n\u2217Corresponding author\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\ftraining with a smoothed activation function y = tanh(\u03b2x) and becomes more non-smooth by\nincreasing \u03b2 until eventually almost behaves like the original sign function. DSDH solves the discrete\nhashing optimization with the discrete cyclic coordinate descend (DCC) [26] algorithm, which can\nkeep the discrete constraint during the whole optimization process.\nAlthough these two papers have achieved breakthroughs, there are still some problems worthy of\nattention. On the one hand, they need to train a lot of iterations since DSDH updates the hash\ncode bit-by-bit, while HashNet requires to increase \u03b2 iteratively. On the other hand, DCC which\nis used to solve the discrete optimization by DSDH, can only be applied to the standard binary\nquadratic program (BQP) problem with limited applications, while HashNet is still distracted by\nquantization error as \u03b2 cannot increase in\ufb01nitely. Therefore this paper proposes a faster and more\naccurate algorithm to integrate the hash coding with neural network.\nOur main contributions are as follows. (1) We propose to adopt greedy algorithm for fast processing of\nhashing discrete optimization, and a new coding layer is designed, in which the sign function is strictly\nused in forward propagation against the quantization error, and later the gradients are transmitted\nintactly to the front layer which effectively prevents the vanishing gradients and updates all bits\ntogether. Bene\ufb01tting from the high ef\ufb01ciency to integrate hashing with neural network, the proposed\nhash layer can be applied to various occasions that need binary coding. (2) We not only provide\ntheoretical derivation, but also propose a visual perspective to understand the rationality and validity\nof our method based on the aggregation effect of sign function. (3) Extensive experiments show that\nour scheme outperforms state-of-the-art hashing methods for image retrieval in both supervised and\nunsupervised tasks.\n\n2 Greedy hash in CNN\n\nIn this section we will detailedly introduce our method and \ufb01rst of all we would like to indicate\nsome notations that would be used later. We denote the output of the last hidden layer in the original\nneural network with H, which also serves as the input to our hash coding layer. B would be used to\nrepresent the hash code, which is exactly the output of the hash layer. In addition, sgn() is the sign\nfunction which outputs +1 for positive numbers and -1 otherwise.\n\n2.1 Optimizing discrete hashing with greedy principle\n\nFirstly we put the neural network aside and focus on approaching the discrete optimization problem,\nwhich is de\ufb01ned as follows:\n\nmin\n\nL(B),\n\nB\n\ns.t. B \u2208 {\u22121, +1}N\u00d7K.\n\n(1)\n\nN means there are N inputs, and K means using K bits to encode. Besides, L(B) can be any loss\nfunction you need to use, e.g., Mean Square Error loss, Cross Entropy loss and so on.\nIf we leave out the discrete constraint B \u2208 {\u22121, +1}N\u00d7K, aiming to obtain the optimal continuous\nB, we can calculate the gradient and use gradient descent to iteratively update B as follows until\nconvergence:\n\nBt+1 = Bt \u2212 lr \u2217 \u2202L\n\u2202Bt ,\n\n(2)\n\nwhere t stands for the t-th iteration, and lr denotes the learning rate.\nHowever, it is almost impossible for B calculated by Equation (2) to satisfy the requirement B \u2208\n{\u22121, +1}N\u00d7K, and after considering the discrete constraint, the optimization (1) will become NP\nhard. One of the fast and effective methods to tackle NP hard problems is the greedy algorithm,\nwhich iteratively selects the best option in each iteration and ultimately reaches a nice point that is\nsuf\ufb01ciently close to the global optimal solution. If Bt+1 calculated by Equation (2) is the optimal\ncontinuous solution without the discrete requirement, then applying the greedy principle, the closest\ndiscrete point to the continuous Bt+1, that is sgn(Bt+1), is probable to be the optimal discrete\nsolution in each iteration thus we update B toward it greedily. Concretely we use the following\nequation to solve the optimization (1) iteratively:\n\nBt+1 = sgn( Bt \u2212 lr \u2217 \u2202L\n\n\u2202Bt ).\n\n(3)\n\n2\n\n\fConceptual convergence of Equation (3) is shown on Figure 1, from which we can see that each\nupdate of our method is able to reach a lower loss point (\u22121, 1) \u2192 (\u22121,\u22121) \u2192 (1,\u22121) and \ufb01nally\nreach the optimal discrete solution (1,\u22121) of the loss contour map in Figure 1.\n\nFigure 1: Suppose we only use two bits, b1 and b2, to encode the input. The circles represent the\ncontour map of the loss function and the red line represents the update trajectory of the hash code, in\nwhich the full line denotes the Equation (2) while the dotted line denotes the Equation (3).\n\nIt is worth noting that our solution (3) is consistent with the conclusion of [27], in which the author has\ncarried out a rigorous mathematical proof on the convergence of (3) (using theory from non-convex\nand non-smooth optimization [1, 3]). Different from their paper, we would pay more attention to the\ndistinct meaning behind Equation (3) (the greedy choice), as well as the remarkable characters we\nwill demonstrate below when combining (3) with neural network (notice that [27] has not used any\ndeep learning method).\nWe believe that although (3) may not be the most effective method for solving discrete optimization\nproblems, it is one of the best selections to combine with the neural network when confronting the\ndiscrete constraint. Reasons are listed as follows:\n\n1) As is widely known that, neural network is updated by gradient descent which is also a greedy\nstrategy as it updates the network toward the steepest descent direction of the loss function in\neach iteration, which demonstrates the high feasibility of using the greedy principle to handle\noptimization problem in neural network.\n\n2) The similar update mode (calculate the gradient and then update the parameter) shared by Equation\n(3) and neural network has laid a solid foundation for combining them into an end-to-end training\nframework (see Section 2.2).\n\n3) As is pointed out by [9], stochastic gradient descent (SGD) is equivalent to adding noise in the\nglobal gradient (calculated by the whole training samples), and appropriate noise in gradient not\nonly bring regularization effect, but also helps the network to escape from some local minimum\npoints and saddle points during optimization. From Figure 1 we can distinctly see that Equation\n(3) is exactly introducing \"noise\" into the original Equation (2) by the sgn() function, thus using\nEquation (3) not only helps the network to handle the discrete constraint but also to some extent\npromotes the optimization process of the neural network.\n\nTherefore, Equation (3) is a reasonable and effective way to solve the discrete hashing optimization\nin neural network, which will be further demonstrated with experiments later.\n\n2.2 Back propagating the coding message with a new hash layer\n\nIn section 2.1 we have discussed why we choose (3) to deal with the discrete optimization in neural\nnetwork, and in this section we would display how we implement (3) into the training system of the\nnetwork with a newly designed hash layer.\nFirstly variable H is introduced to split Equation (3) into:\n\n(4a)\n\n(4b)\n\n\uf8f1\uf8f2\uf8f3 Bt+1 = sgn( Ht+1 ),\n\nHt+1 = Bt \u2212 lr \u2217 \u2202L\n\u2202Bt .\n\n3\n\nb2b111-1-1\fJust recall that H denotes the output of the neural network while B denotes the hash code, and\nwhat we\u2019re going to do here is designing a new layer to connect H and B which should satisfy the\nEquations (4a) and (4b).\nKnowing this, we could immediately \ufb01nd that what we need to do to implement Equation (4a)\nis simply using sign function in the forward propagation of the new hash layer, namely applying\nB = sgn(H) forward.\nAs for the Equation (4b), if we add a penalty term (cid:107) H \u2212 sgn(H) (cid:107)p\np (entrywise matrix norm) in the\nobjective function to make it as close to zero as possible, then with Equation (4a) Bt = sgn(Ht), we\nhave:\n\nHt+1 = Ht \u2212 lr \u2217 \u2202L\n\u2202Ht\n\n= (Ht \u2212 Bt) + Bt \u2212 lr \u2217 \u2202L\n\u2202Ht\n= (Ht \u2212 sgn(Ht)) + Bt \u2212 lr \u2217 \u2202L\n\u2202Ht\n\u2248 Bt \u2212 lr \u2217 \u2202L\n\u2202Ht .\n\nComparing (4b) with (5), we could \ufb01nally implement Equation (4b) by setting:\n\n\u2202L\n\u2202Ht =\n\n\u2202L\n\u2202Bt\n\n(5)\n\n(6)\n\nin the backward propagation of our new hash layer, which means the gradient of B is back transmitted\nto H intactly. Our method has been summarized in Algorithm 1.\n\nAlgorithm 1 Greedy Hash\n\nPrepare training set X and neural network F\u0398, in which \u0398 denotes parameters of the network.\nrepeat\n\n- H = F\u0398(X).\n- B = sgn(H) [forward propagation of our hash layer].\n- Calculate the loss function : Loss = L(B) + \u03b1 (cid:107) H \u2212 sgn(H) (cid:107)p\np ,\nwhere L can be any learning function such as the Cross Entropy Loss.\n- Calculate \u2202Loss\n- Set \u2202L\n- Calculate \u2202Loss\n- Calculate \u2202Loss\n- Update the whole network\u2019s parameters.\n\n\u2202B.\n\u2202B = \u2202L\n\u2202B [backward propagation of our hash layer].\n\u2202H = \u2202L\n\u2202\u0398 = \u2202Loss\n\n\u2202H + \u03b1\n\u2202H \u00d7 \u2202H\n\u2202\u0398 .\n\n\u2202H = \u2202L\n\n= \u2202L\n\n\u2202B + \u03b1 p (cid:107) H \u2212 sgn(H) (cid:107)p\u22121\np\u22121.\n\n\u2202(cid:107)H\u2212sgn(H)(cid:107)p\n\np\n\n\u2202H\n\nuntil convergence.\n\n2.3 Analyzing our algorithm\u2019s validity from a visual perspective\n\nIn this section, we would provide a new perspective to visualize and understand the two most critical\nparts in our algorithm:\n\nF orward : B = sgn( H ),\n\nBackward :\n\n\u2202L\n\u2202H\n\n=\n\n\u2202L\n\u2202B\n\n.\n\n(7)\n\n(8)\n\nFirstly, suppose there are two categories of input images. We set H = (h1, h2) and B = (b1, b2)\n(namely we only use two bits to encode the input image). As is shown in the forward part of Figure\n2(a), sign function is trying to aggregate the data of each quadrant in H coordinate system into a\nsingle point in B coordinate system, and obviously learning to move the misclassi\ufb01ed samples to the\ncorrect quadrant in H coordinate system is our ultimate training goal.\n\n4\n\n\f(a)\n\n(b)\n\nFigure 2: A visual perspective to observe (a) the aggregation effect of sgn() and the back propagation\nfrom B to H in our algorithm (b) the quantization error generated by relaxation methods.\n\nAs is introduced earlier, most previous hashing methods relax sgn() with tanh function or penalty\nterm, whereas these relaxation schemes would produce a mass of continuous value violating the\ndiscrete constraint. The samples will not be imposed to aggregate strictly in the training stage (the\nleft part of Figure 2(b)), while in the test phase the sign function is used to generate real binary code\nresulting in a different distribution from the training phase (the right part of Figure 2(b)). Certainly\nit will produce what we often call, the quantization error. In our proposed hash layer, sign function\n(Equation (7)) is directly applied in the training stage without any relaxation. The training samples\nwill be strictly integrated in each quadrant, which enables our loss function foresee the error before\nthe test phase and pay timely action on moving them.\nEquation (8) that used in the back propagation of our hash layer, is the most signi\ufb01cant part to settle\nthe vanishing gradient problem of sign function. In order to propagate the moving information (update\ndirection of each sample) from \u2202L\n\u2202H, we directly transmitted the gradient of B to H intactly,\nbased on the principle that the aggregation effect of sign function does not change the quadrant of the\ninput sample from H to B. As a consequence, direction that the misclassi\ufb01ed samples need to move\ntoward in the H coordinate system is exactly the direction learned by B in the B coordinate system\n(e.g., the red part in Figure 2(a)). Therefore Equation (8) enables H timely obtain the direction that\nthe loss function expect B move toward, and it is (8) that helps our network to realize a fast and\neffective convergence.\nBy the way, it is noteworthy that even though the earlier researches on stochastic neurons [2, 24]\nhave roughly mentioned about the straight through strategy (Equation (8)), our paper carefully study\nits derivation and demonstrate its performance in deep hashing coding domain. Moreover, we use\n(8) with the assistance of the penalty term (cid:107) H \u2212 sgn(H) (cid:107)p\np (has not seen in [2, 24]), which is\nnonnegligible to reduce the gradient bias through making H closer to B and improve the optimization\nproperty.\n\n\u2202B to \u2202L\n\n3 Experiments\n\nWe evaluate the ef\ufb01cacy of our proposed Greedy Hash in this section and the source code is available\nat: https://github.com/ssppp/GreedyHash.\n\n3.1 Datasets\nCIFAR-10 The CIFAR-10 dataset [14] consists of 60,000 32\u00d732 color images in 10 classes.\nFollowing most deep hashing methods like [16] and [17], we have conducted two experiment settings\nfor CIFAR-10. In the \ufb01rst one (denoted as CIFAR-10 (I)), 1000 images (100 images per class) are\nselected as the query set, and the remaining 59,000 images are used as database. Besides, 5,000\nimages (500 images per class) are randomly sampled from the database as the training set. As for the\nsecond one (denoted as CIFAR-10 (II)), 1,000 images per class (10,000 images in total) are selected\nas the test query set, and the remaining 50,000 images are used as the training set.\n\nImageNet\nImageNet [25] that consists of 1,000 classes is a benchmark image set for object category\nclassi\ufb01cation and detection in Large Scale Visual Recognition Challenge (ILSVRC). It contains over\n1.2M images in the training set and 50K images in the validation set. Following the experiment\nsetting in [4], we randomly select 100 categories, use all the images of these categories in the training\nset as the database, and use all the images in the validation set as the queries. Furthermore, 130\nimages per category are randomly selected from the database as the training points.\n\n5\n\nB = sgn(H)h2h1b2b1!\"!#=!\"!%B = sgn(H)h2h1b2b1traintest\f3.2\n\nImplementation details\n\nBasic setting Our model is implemented with Pytorch [23] framework. We set the batch size\nas 32 and use SGD as the optimizer with a weight decay of 0.0005 and a momentum of 0.9. For\nsupervised experiments we use 0.001 as the initial learning rate while for unsupervised experiments\nwe use 0.0001, and we divide both of them by 10 when the loss stop decreasing. In addition, we\ncross-validate the hyper-parameters \u03b1 and p in the penalty term \u03b1 (cid:107) H \u2212 sgn(H) (cid:107)p\np, which are\n\ufb01nally \ufb01xed with p = 3, \u03b1 = 0.1 \u00d7 1\nN\u00b7K term to remove the impacts of the various\nencoding length and input size) for CIFAR-10, while for ImageNet \u03b1 = 1 \u00d7 1\n\nN\u00b7K (using\n\n1\n\nN\u00b7K .\n\nSupervised setting We choose the Cross Entropy loss as our supervised loss function L, which\nmeans we just apply softmax to classify the hash code B \u2208 {\u22121, +1}N\u00d7K without adding any\nretrieval loss (e.g., contrastive loss or triplet loss), and later we would display its outstanding retrieval\nperformance despite merely using single softmax. Moreover, for fair comparison with the previous\nworks, we use the pre-trained AlexNet [13] as our neural network, in which we append a new fc\nlayer after fc7 to generate feature of desired length and then append our hash layer to produce binary\ncode. We have compared our method with DSDH [16], HashNet [4], DPSH [17], DTSH [29], DHN\n[36], NINH [15], CNNH [30], VDSH [33], DRSCH [32], DSCH [32], DSRH [34], DPLM [27],\nSDH [26], KSH [21] under this supervised setup. It is worth noting that some of aforementioned\nmethods such as DSDH have used the VGG-F [5] convolutional neural network, which composes\nof \ufb01ve convolutional layers and two fully connected layers the same as AlexNet (other methods\nincluding ours have selected), thus we consider the comparison among them is fair even though\nVGG-F performs slightly better on the classi\ufb01cation of the original 1,000 classes ImageNet.\n\nUnsupervised setting Inspired by [12], we choose to minimize the difference on the cosine distance\nrelationships when the features are encoded from Euclidean space to Hamming space. Concretely\nwe use L =(cid:107) cos(h1, h2) \u2212 cos(b1, b2) (cid:107)2\n2, in which cos means the cosine distance, and h means the\nfeature in Euclidean space while b means binary code in Hamming space. We use the pre-trained\nVGG16 [28] network following the setting in [6, 18], and we append a new fc layer as well as our\nhash layer to generate the binary code. We have compared with SAH [6], DeepBit [18], ITQ [8],\nKMH [10] and SPH [11] under this unsupervised setting.\n\n3.3 Comparison on fast optimization\n\nFirstly we compare our method with DSDH which can keep the discrete constraint during the whole\noptimization process just like ours. For fair comparison, we rerun the code released by the DSDH\nauthor and we follow the experiment setting in their program: using pre-trained CNN and encoding\nthe images with maximum 48 bits as well as minimum 12 bits on supervised CIFAR-10 (I). The\nMean Average Precision (MAP) during the training stage is shown on Figure 3(a).\nWe can see from Figure 3(a) that our method have achieved faster and better MAP promotion with\nboth short and long bits (especially using the shorter one). In DSDH, the coding message from B is\nback propagated to the front network only by the loss term (cid:107) B \u2212 H (cid:107)2\n2, which will be de\ufb01cient and\nunstable when the value is small, while our method uses \u2202L\n\u2202B, capable of receiving the coding\nmessage rapidly and accurately as is analyzed in Section 2.3.\n\n\u2202H = \u2202L\n\n(a)\n\n(b)\n\n(c)\n\nFigure 3: Fast optimization comparison with (a) DSDH and (b)(c) relaxation methods.\n\n6\n\n01020304050epoch0.20.30.40.50.60.70.80.9MAPours 12DSDH 12ours 48DSDH 48020406080100120epoch0.00.51.01.52.02.5 Cross Entropy lossourstanhpenaltyoriginal020406080100120epoch0.10.20.30.40.5MAPourstanhpenaltyoriginal_euclideanoriginal_cosine\fNext we compare our algorithm with the relaxation methods. We use 16 bits to encode the images\nand we train the AlexNet from scratch here to further demonstrate the learning ability. The dynamic\nclassi\ufb01cation loss and MAP are shown on Figure 3(b) and 3(c), in which label tanh means using tanh\nfunction as relaxation, penalty denotes method adding penalty term to generate features as discrete\nas possible, and original represents training without hash coding (the same length 16 but no longer\nlimited to binary). We use two schemes to retrieve these original features, the Euclidean distance and\nthe cosine distance.\nAs shown on Figure 3(b) and 3(c), our algorithm is one of the fastest methods to decrease the\nclassi\ufb01cation loss and simultaneously is the fastest and greatest one to promote the MAP owing to\nthe better protection against quantization error. Among the results the tanh method is the slowest\none probably due to the existence of saturation in tanh activation function. Moreover, it is interesting\nto discover that retrieval with hash binary code is better than the retrieval with original unrestricted\nfeatures, probably because the softmax function generally needs adding retrieval loss to further\nconstrain the image feature (for smaller intra-class distance and larger inter-class distance), which\nis exactly the merit of hashing that owns the nature of aggregating the inputs in each quadrant as is\ndemonstrated in Section 2.3.\nThus our Greedy Hash is able to realize faster and better optimization in hash coding with CNN, and\nsubsequently we will further demonstrate the accuracy of our generated hash code when compared\nwith the state-of-the-art.\n\n3.4 Comparison on accurate coding\n\nSupervised experiments Table 1 shows the MAP results on the supervised CIFAR-10 (I) dataset,\nin which we can explicitly see the better performance of our algorithm than both the state-of-the-art\ndeep hashing methods and the traditional hashing with deep learned features as input (\"SDH+CNN\"\ndenotes the SDH method with deep features). It is widely known that deep neural network needs\nmore training samples to perform better, thus we conduct the second experiment on CIFAR-10 (i.e.,\nCIFAR-10 (II)) whose training set is ten times the number of the \ufb01rst training setting. The MAP\nresult is shown in Table 2. All the deep methods have made a big progress due to the augment of the\ntraining set, and even though the best result obtained by the previous work is as high as 93%, our\nmethod is capable of further improving it to 94%.\nNotice that we have specially compared our method with DPLM [27] in Table 2, which distinctly\nreveals that using deep learned features as input can improve the performance of DPLM while our\nmethod can further boost the results as we better integrate hash coding with CNN and simultaneously\nrealize ef\ufb01cient end-to-end training of the network.\n\nTable 1: MAP on supervised CIFAR-10 (I), where\n\"method+CNN\" means traditional hashing methods\nwith deep features as input.\n\nMethod\n\nSupervised\n\nOurs\nDSDH\nDTSH\nDPSH\nNINH\nCNNH\n\n12bits 24bits 32bits 48bits\n0.822\n0.774 0.795 0.810\n0.820\n0.801\n0.740\n0.710\n0.765\n0.774\n0.757\n0.744\n0.713\n0.581\n0.558\n0.552\n0.522\n0.509\n0.439\nSDH+CNN 0.478\n0.592\n0.584\nKSH+CNN 0.488\n0.548\n0.563\n\n0.786\n0.750\n0.727\n0.566\n0.511\n0.557\n0.539\n\nTable 2: MAP on supervised CIFAR-10 (II).\n\nMethod\n\nSupervised\n\nOurs\nDSDH\nDTSH\nDPSH\nVDSH\nDRSCH\nDSCH\nDSRH\n\n16bits 24bits 32bits 48bits\n0.944\n0.942 0.943 0.943\n0.939\n0.939\n0.935\n0.926\n0.925\n0.915\n0.807\n0.795\n0.763\n0.845\n0.844\n0.845\n0.631\n0.629\n0.615\n0.620\n0.617\n0.609\n0.618\n0.617\n0.608\nDPLM+CNN 0.562\n0.843\n0.837\n0.465\n0.643\n0.671\n\n0.940\n0.923\n0.781\n0.848\n0.622\n0.613\n0.611\n0.830\n0.614\n\nDPLM\n\nTable 3 displays the retrieval result training on the supervised ImageNet, and our algorithm still have\nsuperior performance on this larger dataset. We should point out that the pairwise or triplet labels\nbased methods generally need a high storage and computation cost to construct the input images\ngroup, which are infeasible for large-scale datasets. Our scheme learn hash code in a point-wise\nmanner and consequently, it would be simpler and more promising for our method to apply in the\npractical retrieval system.\n\n7\n\n\fTable 3: MAP@1000 on supervised ImageNet.\n\nMethod\n\nOurs\n\nHashNet\n\n16bits\n0.625\n0.506\n0.311\n0.290\n0.281\nSDH+CNN 0.298\nKSH+CNN 0.160\n\nDHN\nNINH\nCNNH\n\nSupervised\n\n32bits\n0.662\n0.631\n0.472\n0.461\n0.450\n0.455\n0.298\n\n48bits\n0.682\n0.663\n0.542\n0.530\n0.525\n0.554\n0.342\n\n64bits\n0.688\n0.684\n0.573\n0.565\n0.554\n0.585\n0.394\n\nTable 4: MAP@1000 on unsupervised\nCIFAR-10 (II).\n\nMethod\n\nOurs\nSAH\n\n16bits\n0.448\n0.418\n0.194\nDeepBit\nITQ+CNN\n0.385\nKMH+CNN 0.360\nSPH+CNN\n0.302\n\nUnsupervised\n\n32bits\n0.472\n0.456\n0.249\n0.414\n0.382\n0.356\n\n64bits\n0.501\n0.474\n0.277\n0.442\n0.401\n0.392\n\nUnsupervised experiment The MAP@1000 result on unsupervised CIFAR-10 (II) is shown on\nTable 4 (again, \"ITQ+CNN\" denotes the ITQ method with deep features). Our method is still able to\nimprove the performance as before in spite of our rough unsupervised objective function. In view of\nthe improvement by our algorithm in both the supervised and unsupervised tasks, we demonstrate that\nwith small modi\ufb01cations to the original network, our method can easily transfer to various occasions\nthat need hash coding.\n\n3.5 Coding with shorter length\n\nFrom all the retrieval results above, it is interesting to discover that the shorter encoding bits we use,\nthe larger margin we obtain when comparing with the other methods. It seems that our method can\nmaintain a decent retrieval performance even with restricted encoding length. Thus we conduct the\nfollowing experiment on supervised CIFAR-10 (I) using different encoding length shorter than the\ncommon minimum 12 bits (but larger than 4 bits as there are 10 classes in CIFAR-10) to further seek\nmore outstanding character of our method.\n\nFigure 4: Retrieval experiments with shorter encoding length (4-12 bits).\n\nThe result is shown on Figure 4, from which it is impressive to \ufb01nd that our proposed scheme indeed\nproduce nice results with shorter hash code. With this charming nature, our method can process\nthe large-scale image set with higher storage ef\ufb01ciency and faster retrieval speed, which we think is\nextremely useful in practical application.\n\n4 Conclusion\n\nIn this paper we propose to adopt greedy algorithm to tackle the discrete hashing optimization, and we\ndesign a new neural layer to implement our approach, which \ufb01xedly uses the sign function in forward\npropagation without any relaxation to avoid the quantization error, while in backward propagation the\ngradients are transmitted intactly to the front layer, preventing the vanishing gradients and helping to\nachieve fast convergence. The superiority of our method is proved from a novel visual perspective as\nwell as abundant experiments on different retrieval tasks.\n\n8\n\n456789101112encoding length0.350.400.450.500.550.600.650.700.750.80MAPoursDSDHDTSHDPSHDHN\fAcknowledgments\n\nThis work was supported in part by the National Key R&D Program of China under Grant 2017YF-\nB1002400, the National Natural Science Foundation of China under Grant 61671027, U1611461 and\nthe National Key Basic Research Program of China under Grant 2015CB352303.\n\nReferences\n[1] H. Attouch, J. Bolte, and B. F. Svaiter. Convergence of descent methods for semi-algebraic and tame\nproblems: proximal algorithms, forward\u2013backward splitting, and regularized gauss\u2013seidel methods. Math-\nematical Programming, 137(1-2):91\u2013129, 2013.\n\n[2] Y. Bengio, N. L\u00e9onard, and A. Courville. Estimating or propagating gradients through stochastic neurons\n\nfor conditional computation. arXiv preprint arXiv:1308.3432, 2013.\n\n[3] J. Bolte, S. Sabach, and M. Teboulle. Proximal alternating linearized minimization or nonconvex and\n\nnonsmooth problems. Mathematical Programming, 146(1-2):459\u2013494, 2014.\n\n[4] Z. Cao, M. Long, J. Wang, and P. S. Yu. Hashnet: Deep learning to hash by continuation. In The IEEE\n\nInternational Conference on Computer Vision (ICCV), Oct 2017.\n\n[5] K. Chat\ufb01eld, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep\n\ninto convolutional nets. arXiv preprint arXiv:1405.3531, 2014.\n\n[6] T.-T. Do, D.-K. Le Tan, T. T. Pham, and N.-M. Cheung. Simultaneous feature aggregating and hashing\nfor large-scale image search. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 6618\u20136627, 2017.\n\n[7] A. Gionis, P. Indyk, R. Motwani, et al. Similarity search in high dimensions via hashing.\n\nvolume 99, pages 518\u2013529, 1999.\n\nIn Vldb,\n\n[8] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative quantization: A procrustean approach to\nlearning binary codes for large-scale image retrieval. IEEE Transactions on Pattern Analysis and Machine\nIntelligence, 35(12):2916\u20132929, 2013.\n\n[9] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio. Deep learning, volume 1. MIT press Cambridge,\n\n2016.\n\n[10] K. He, F. Wen, and J. Sun. K-means hashing: An af\ufb01nity-preserving quantization method for learning\nbinary compact codes. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on,\npages 2938\u20132945. IEEE, 2013.\n\n[11] J.-P. Heo, Y. Lee, J. He, S.-F. Chang, and S.-E. Yoon. Spherical hashing. In Computer Vision and Pattern\n\nRecognition (CVPR), 2012 IEEE Conference on, pages 2957\u20132964. IEEE, 2012.\n\n[12] M. Hu, Y. Yang, F. Shen, N. Xie, and H. T. Shen. Hashing with angular reconstructive embeddings. IEEE\n\nTransactions on Image Processing, 27(2):545\u2013555, 2018.\n\n[13] A. Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arX-\n\niv:1404.5997, 2014.\n\n[14] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.\n[15] H. Lai, Y. Pan, Y. Liu, and S. Yan. Simultaneous feature learning and hash coding with deep neural\n\nnetworks. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).\n\n[16] Q. Li, Z. Sun, R. He, and T. Tan. Deep supervised discrete hashing. In Advances in Neural Information\n\nProcessing Systems, pages 2479\u20132488, 2017.\n\n[17] W.-J. Li, S. Wang, and W.-C. Kang. Feature learning based deep supervised hashing with pairwise labels. In\nProceedings of the Twenty-Fifth International Joint Conference on Arti\ufb01cial Intelligence, pages 1711\u20131717.\nAAAI Press, 2016.\n\n[18] K. Lin, J. Lu, C.-S. Chen, and J. Zhou. Learning compact binary descriptors with unsupervised deep neural\nnetworks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n1183\u20131192, 2016.\n\n[19] K. Lin, H.-F. Yang, J.-H. Hsiao, and C.-S. Chen. Deep learning of binary hash codes for fast image retrieval.\nIn Computer Vision and Pattern Recognition Workshops (CVPRW), 2015 IEEE Conference on, pages\n27\u201335. IEEE, 2015.\n\n[20] H. Liu, R. Wang, S. Shan, and X. Chen. Deep supervised hashing for fast image retrieval. In Proceedings\n\nof the IEEE conference on computer vision and pattern recognition, pages 2064\u20132072, 2016.\n\n[21] W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang. Supervised hashing with kernels. In Computer Vision\n\nand Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2074\u20132081. IEEE, 2012.\n\n[22] J. Masci, M. M. Bronstein, A. M. Bronstein, and J. Schmidhuber. Multimodal similarity-preserving\n\nhashing. IEEE transactions on pattern analysis and machine intelligence, 36(4):824\u2013830, 2014.\n\n[23] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and\n\nA. Lerer. Automatic differentiation in pytorch. 2017.\n\n[24] T. Raiko, M. Berglund, G. Alain, and L. Dinh. Techniques for learning binary stochastic feedforward\n\nneural networks. arXiv preprint arXiv:1406.2989, 2014.\n\n9\n\n\f[25] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,\nM. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer\nVision, 115(3):211\u2013252, 2015.\n\n[26] F. Shen, C. Shen, W. Liu, and H. T. Shen. Supervised discrete hashing. In CVPR, volume 2, page 5, 2015.\n[27] F. Shen, X. Zhou, Y. Yang, J. Song, H. T. Shen, and D. Tao. A fast optimization method for general binary\n\ncode learning. IEEE Transactions on Image Processing, 25(12):5610\u20135621, 2016.\n\n[28] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.\n\narXiv preprint arXiv:1409.1556, 2014.\n\n[29] X. Wang, Y. Shi, and K. M. Kitani. Deep supervised hashing with triplet labels. In Asian Conference on\n\nComputer Vision, pages 70\u201384. Springer, 2016.\n\n[30] R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan. Supervised hashing for image retrieval via image representation\n\nlearning. In AAAI, volume 1, page 2, 2014.\n\n[31] H.-F. Yang, K. Lin, and C.-S. Chen. Supervised learning of semantics-preserving hashing via deep neural\n\nnetworks for large-scale image search. arXiv preprint arXiv:1507.00101, 2015.\n\n[32] R. Zhang, L. Lin, R. Zhang, W. Zuo, and L. Zhang. Bit-scalable deep hashing with regularized similar-\nity learning for image retrieval and person re-identi\ufb01cation. IEEE Transactions on Image Processing,\n24(12):4766\u20134779, 2015.\n\n[33] Z. Zhang, Y. Chen, and V. Saligrama. Ef\ufb01cient training of very deep neural networks for supervised\nhashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n1487\u20131495, 2016.\n\n[34] F. Zhao, Y. Huang, L. Wang, and T. Tan. Deep semantic ranking based hashing for multi-label image\nretrieval. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pages 1556\u2013\n1564. IEEE, 2015.\n\n[35] Y. Zhou, S. Huang, Y. Zhang, and Y. Wang. Deep hashing with triplet quantization loss. arXiv preprint\n\narXiv:1710.11445, 2017.\n\n[36] H. Zhu, M. Long, J. Wang, and Y. Cao. Deep hashing network for ef\ufb01cient similarity retrieval. In AAAI,\n\npages 2415\u20132421, 2016.\n\n10\n\n\f", "award": [], "sourceid": 443, "authors": [{"given_name": "Shupeng", "family_name": "Su", "institution": "Peking University"}, {"given_name": "Chao", "family_name": "Zhang", "institution": "Peking University"}, {"given_name": "Kai", "family_name": "Han", "institution": "Noah's Ark Laboratory, Huawei"}, {"given_name": "Yonghong", "family_name": "Tian", "institution": "Peking University"}]}