{"title": "BinGAN: Learning Compact Binary Descriptors with a Regularized GAN", "book": "Advances in Neural Information Processing Systems", "page_first": 3608, "page_last": 3618, "abstract": "In this paper, we propose a novel regularization method for Generative Adversarial Networks that allows the model to learn discriminative yet compact binary representations of image patches (image descriptors). We exploit the dimensionality reduction that takes place in the intermediate layers of the discriminator network and train the binarized penultimate layer's low-dimensional representation to mimic the distribution of the higher-dimensional preceding layers. To achieve this, we introduce two loss terms that aim at: (i) reducing the correlation between the dimensions of the binarized penultimate layer's low-dimensional representation (i.e. maximizing joint entropy) and (ii) propagating the relations between the dimensions in the high-dimensional space to the low-dimensional space. We evaluate the resulting binary image descriptors on two challenging applications, image matching and retrieval, where they achieve state-of-the-art results.", "full_text": "BinGAN: Learning Compact Binary Descriptors\n\nwith a Regularized GAN\n\nMaciej Zieba\n\nWroclaw University of\n\nScience and Technology, Tooploox\n\nmaciej.zieba@pwr.edu.pl\n\nPiotr Semberecki\n\nWroclaw University of\n\nScience and Technology, Tooploox\npiotr.semberecki@pwr.edu.pl\n\nTarek El-Gaaly\n\nVoyage\n\ntarek@voyage.auto\n\nTomasz Trzcinski\n\nWarsaw University of Technology,\n\nTooploox\n\nt.trzcinski@ii.pw.edu.pl\n\nAbstract\n\nIn this paper, we propose a novel regularization method for Generative Adversarial\nNetworks, which allows the model to learn discriminative yet compact binary rep-\nresentations of image patches (image descriptors). We employ the dimensionality\nreduction that takes place in the intermediate layers of the discriminator network\nand train binarized low-dimensional representation of the penultimate layer to\nmimic the distribution of the higher-dimensional preceding layers. To achieve this,\nwe introduce two loss terms that aim at: (i) reducing the correlation between the\ndimensions of the binarized low-dimensional representation of the penultimate\nlayer (i.e. maximizing joint entropy) and (ii) propagating the relations between\nthe dimensions in the high-dimensional space to the low-dimensional space. We\nevaluate the resulting binary image descriptors on two challenging applications,\nimage matching and retrieval, and achieve state-of-the-art results.\n\n1\n\nIntroduction\n\nCompact binary representations of images are instrumental for a multitude of computer vision ap-\nplications, including image retrieval, simultaneous localization and mapping, and large-scale 3D\nreconstruction. Typical approaches to the problem of learning discriminative yet concise represen-\ntations include supervised machine learning methods such as boosting [27], hashing [8] and, more\nrecently, deep learning [23]. Although unsupervised methods have also been proposed [14, 16, 6],\ntheir performance is typically signi\ufb01cantly lower than the competing supervised approaches.\nThe goal of this work is to bridge this performance gap by using an intermediate layer representation\nof a Generative Adversarial Network (GAN) [9] discriminator as a compact binary image descriptor.\nRecent studies show the powerful discriminative capabilities of features extracted from the discrimi-\nnator networks of GANs [19, 21]. With a growing number of hidden units in intermediate layers, the\nquality of vector representations increase, when applied to both image matching and retrieval. This\nis why typical approaches make use of high-dimensional intermediate representations to generate\nimage descriptors, therefore leading to high memory footprint and computationally expensive match-\ning. We address this shortcoming and build low-dimensional compact descriptors by training GAN\nwith a novel Distance Matching Regularizer (DMR). This regularizer is responsible for propagating\nthe Hamming distances between binary vectors in high-dimensional feature spaces of intermediate\ndiscriminator layers to the compact feature space of the low-dimensional deeper layers in the same\nnetwork. More precisely, our proposed method allows to regularize the output of an intermediate\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\flayer (with low number units) of the discriminator with the help of the output of previous layers (with\nhigh number of units). This is achieved by propagating the correlations between sample pairs of\nrepresentations between the layers. Moreover, to better allocate the capacity of the low-dimensional\nfeature representation we extend our model with an adjusted version of the Binarization Representa-\ntion Entropy (BRE) Regularizer [5]. This regularization technique was initially applied to increase the\ndiversity of intermediate layers of the discriminator by maximizing the joint entropy of the binarized\noutputs of the layers. We adjust this regularization term so that it concentrates on increasing the\nentropy of the particular pairs of binary vectors that are not correlated in high-dimensional space.\nAs a consequence, we keep the balance between propagating the Hamming distances between the\nlayers for correlated vectors and increasing the diversity of the binary vectors in the low-dimensional\nfeature space.\nThe main contributions of this paper are two-fold. Firstly, we build a powerful yet compact binary\nimage descriptor using a GAN architecture. Secondly, we introduce a novel regularization method\nthat propagates the Hamming distances between correlated pairs of vectors in the high-dimensional\nfeatures of earlier layers to the low-dimensional binary representation of deeper layers during\ndiscriminator training. A binary image descriptor resulting from our approach, dubbed BinGAN,\nsigni\ufb01cantly outperforms state-of-the-art methods in two challenging tasks: image matching and\nimage retrieval. Last but not least, we release the code of the method along with the evaluation scripts\nto enable reproducible research1.\n\n2 Related Work\n\n2.1 Binary Descriptors\n\nBinary local feature descriptors have gained a signi\ufb01cant amount of attention from the research\ncommunity, mainly due to their compact nature, ef\ufb01ciency and multitude of applications in computer\nvision [4, 13, 1, 25, 28, 27, 7]. BRIEF [4], the \ufb01rst widely adopted binary feature descriptor, sparked a\nnew domain of research on feature descriptors that rely on a set of hand-crafted intensity comparisons\nthat are used to generate binary strings. Several follow-up works proposed different sampling\nstrategies, e.g. BRISK [13] and FREAK [1]. Although these approaches offer unprecedented\ncomputational ef\ufb01ciency, their performance is highly sensitive to standard image transformation,\nsuch as rotation or scaling, as well as other viewpoint changes. To address those limitations, several\nsupervised approaches to learning binary local feature descriptors from the data have been proposed.\nLDAHash [25] proposed to train discriminative projections of SIFT [17] descriptors and binarize\nthem afterwards to obtain a highly robust patch descriptor. D-Brief [28] extends this approach by\nincreasing the ef\ufb01ciency of the descriptor with banks of simple \ufb01ltering elements used to approximate\nthe projections. To further boost the performance of learned binary descriptors, BinBoost [27]\nproposes to use greedy boosting algorithm for training consecutive bits, while RDF [7] uses an\nalternative greedy algorithm to select the most distinctive receptive \ufb01elds used to construct dimension\nof the descriptor. With this kind of approach, it is possible to obtain more powerful descriptors than\nby application of hand-crafted methods. However, the binary descriptors are trained using pair-wise\nlearning methods, which substantially limits their applicability to new tasks.\n\n2.2 Hashing Methods\n\nOn the other hand, binary descriptors can be learned with hashing algorithms that aim at preserving\noriginal distances between images in binary spaces, such as in [2, 20, 30, 8].\nLocality Sensitive Hashing (LSH) [2] binarizes the input by thresholding a low-dimensional repre-\nsentation generated with random projections. Semantic Hashing (SH) [20] achieves the same goal\nwith a multi-layer Restricted Boltzmann Machine. Spectral Hashing (SpeH) [30] exploits spectral\ngraph partitioning to create ef\ufb01cient binary codes. Iterative Quantization (ITQ) [8] uses an iterative\napproach to \ufb01nd a set of projections that minimize the binarization loss. Unlike most recent deep\nlearning approaches (discussed next), these hashing algorithms typically operate on hand-crafted\nimage representations, e.g. SIFT descriptors [25], dramatically reducing their effectiveness and\nlimiting their performance, as can be seen in the results of our experiments.\n\n1The code is available at: github.com/maciejzieba/binGAN\n\n2\n\n\f2.3 Deep Learning Approaches\n\nInspired by the spectacular success of deep neural networks, several methods have been proposed that\ngenerate binary image descriptors using deep neural networks [23, 26, 16, 14, 6]. Supervised methods,\nsuch as [23, 26], achieve impressive results by exploiting data labeling and training advancements\nsuch as Siamese architecture [23] or progressive sampling [26]. Nevertheless, their outstanding\nperformance is often limited to the original task and it is challenging to apply them to other domains.\nUnsupervised deep learning methods [16, 14, 6], on the other hand, are typically less domain-speci\ufb01c\nand do not require any data labeling, which becomes especially important in the domains where such\nlabeling is hard or impossible to obtain, e.g. medical imaging. Deep Hashing (DH) [16] uses neural\nnetworks to \ufb01nd a binary representation that reduces binarization loss while balancing bit values\nto maximize its entropy. As an input, however, it takes an intermediate image representation, such\nas the GIST descriptor. DeepBit [14] addresses this shortcoming by using a convolutional neural\nnetwork and further improves the results with data augmentation. However, DeepBit relies on a\nrigid sign function with threshold at zero to binarize the \ufb02oating-point output values, which may\nlead to signi\ufb01cant quantization losses. DBD-MQ [6] overcomes this limitation by mapping this\nproblem as a multi-quantization task and using K-AutoEncoders network to solve it. In this paper, we\nfollow this line of research and employ a different binarization technique, as in [4], to generate binary\ndescriptors. Furthermore, we also rely on recent generative models, namely Generative Adversarial\nNetworks [9], to build image descriptors. In this regard, our work is also related to [18] and [24],\nwhere GANs are used to address image retrieval problem. [18] learns binary representations by\ntraining an end-to-end network to distinguish synthetic and real images. [24] proposes to employ\nGANs to enhance the intermediate representation of the generator. Contrary to our approach, however,\nboth of those methods use tanh-like activation for binarization and optimize their performance toward\nimage retrieval task, while our approach is agnostic to \ufb01nal application and can be equally successful\nwhen applied to local feature descriptor learning, image matching or image retrieval.\n\n3 BinGAN\n\nWe propose a novel approach for learning compact binary descriptors that exploits good capabilities\nof learning discriminative features with GAN models. In order to extract binary features we make\nuse of intermediate layers of a GAN\u2019s discriminator [19]. To enforce good binary representation we\nincorporate two additional losses in training the discriminator: a distance matching regularizer that\nforces the propagation of distances from high-dimensional spaces to the low-dimensional compact\nspace and an adjusted binarization representation entropy (BRE) regularizer [5] with weighted\ncorrelation.\n\n3.1 GAN\n\nThe main idea of GAN [9] is based on game theory and assumes training of two competing networks:\ngenerator G(z) and discriminator D(x). The goal of GANs is to train generator G to sample from\nthe data distribution pdata(x) by transforming the vector of noise z (of which, the prior is denoted as\npz(z)). The discriminator D is trained to distinguish the samples generated by G from the samples\nfrom pdata(x). The training problem formulation is as follows:\n\nmin\n\nG\n\nmax\n\nD\n\nV (D, G) = Ex\u223cpdata(x)[log (D(x))]\n\n+ Ez\u223cpz(z)[log (1 \u2212 D(G(z)))].\n\n(1)\n\nThe model is usually trained with the gradient-based approaches by taking minibatch of fake images\ngenerated by transforming random vectors sampled from pz(z) via the generator and minibatch of\ndata samples from pdata(x). They are used to maximize V (D, G) with respect to parameters of D by\nassuming a constant G, and then minimizing V (D, G) with respect to parameters of G by assuming\na constant D.\nHowever, to obtain more discriminative features on the intermediate layer of discriminator and\nstability of training process authors of [21] recommend, that generator G should be trained using a\nfeature matching procedure. The objective to train the generator G is:\n\nLG = ||Ex\u223cpdata(x)f (x) \u2212 Ez\u223cpz(z)f (G(z))||2\n2,\n\n(2)\n\n3\n\n\fwhere f (x) denotes the intermediate layer of the discriminator. In practical implementations it is\nusually the layer just before classi\ufb01cation (penultimate layer).\nDespite the fact that GANs are used for generating arti\ufb01cial examples from the data distribution, they\ncan be also used as feature embeddings. This was initially discussed in [19] and further extended in\n[21], where the authors con\ufb01rm that by incorporating a discriminator network, in a semi-supervised\nsetting, they were able to obtain competitive results. There are a couple of bene\ufb01ts in using adversarial\ntraining for feature embeddings. First, during the adversarial training, the generator produces fake\nimages with increasing quality and the discriminator is trained to distinguish between these and data\nexamples. During this discriminative procedure the discriminator is forced to train more speci\ufb01c\nfeatures that are characteristic for some regions of the feature space that are strongly associated with\nparticular classes. Second, the adversarial training is done in an unsupervised setting without the\nneed for tedious data annotation. Third, the feature matching approach (as in [21]) that is applied to\ntrain the discriminator results in generating fake images with similar feature characteristics, which\nforces the discriminator to extract more diverse features.\nThe most recent approaches for generating binary image descriptors aim at constructing binary\nvectors of low dimensionality. However, it was shown in [19] that the best performing representations\nin GANs can be obtained from high-dimensional intermediate layers of the discriminator. Therefore,\nin this work, we aim at transferring the Hamming distances from the high-dimensional space of\nintermediate layers to their binarized representations of low dimensionality to build our binary image\ndescriptors. To that end, we propose a regularization technique that enforces this transfer, effectively\nleading to a construction of a compact yet discriminative binary descriptor. In Sec. 4.1 we de\ufb01ne the\nlayers used as high and low-dimensional representations for a given network architecture.\n\n3.2 Distance Matching Regularizer\n\nIn this section we introduce a regularization loss function that aims at propagating the correlations\nbetween pairs of examples from high-dimensional space to low-dimensional representation, what is\nequivalent to propagating Hamming distances between two layers in the discriminator. We achieve\nthis goal by taking a pair of vectors from two intermediate layers of the same network (discriminator)\ncorresponding to two examples from a data batch and enforcing their binarized outputs to have similar\nnormalized dot products.\nLet f (x) and h(x) denote the low and high-dimensional intermediate layers of discriminator with the\nnumbers of hidden units equal K and M, respectively. We assume, that the number of hidden units for\nf (x) is signi\ufb01cantly higher than the number of the units for h(x), M (cid:29) K. The corresponding binary\nvectors bf \u2208 {\u22121, 1}K and bh \u2208 {\u22121, 1}M can be obtained using a sign function: sign(a) = a/|a|.\nThe main problem with the sign activations is that they are not able to propagate the gradient\nbackwards. In order to overcome this limitation we use the following quantization technique as in\n[5]: sof tsign(a) = a/(|a| + \u03b3), where \u03b3 is a hyperparameter that is responsible for smoothing the\nsign(\u00b7) function. We de\ufb01ne the vector sf that is created by applying the sof tsign(\u00b7) function to\neach element of f (x): sf,k = sof tsign(fk(x)).\nHamming distance between two binary vectors, b1 and b2 can be expressed using a dot product:\n1 b2 \u2212 M ). As a consequence, distant vectors are characterized by low-\ndH (b1, b2) = \u22120.5 \u00b7 (bT\nvalued dot products and close vectors are characterized by high values. Considering this property,\nwe introduce the Distance Matching Regularizer (DMR) that aims at propagating the good coding\nproperties of vectors bh in high-dimensional space to the compact space of binary vectors bf\n(represented by their soft proxy vectors sf ). We de\ufb01ne the DMR in the following manner:\n\nLDM R =\n\n1\n\nN (N \u2212 1)\n\nh,kbh,j\n\n| bT\n\nM\n\n\u2212 sT\n\nf,ksf,j\n\nK\n\n|,\n\n(3)\n\nN(cid:88)\n\nk,j=1,k(cid:54)=j\n\nIn terms of optimization procedure we assume constant values of high-dimensional vectors bh\nand optimize the parameters of the discriminator with respect to sf . To make Hamming distances\ncomparable between the high-dimensional and the low-dimensional spaces we normalize them by\ndividing by the corresponding vector dimensions.\n\n4\n\n\fThe LDM R function can be interpreted as the empirical expected value of the loss function\nl(dh, df ) = 2 \u00b7 |dh \u2212 df|, where dh is the normalized Hamming distance in high-dimensional\nspace that is assumed to be constant and df is the normalized distance in the low-dimensional space\ncalculated on quantized vectors.\nThe motivation behind using this kind of regularization procedure is as follows. A usual approach\nfor learning informative and discriminative feature embeddings is to take intermediate layers of the\nnetwork, concatenate them and obtain high-dimensional representation that provides better benchmark\nresults. However, practical applications such as image matching require binary, short and compact\nrepresentations for sake of ef\ufb01ciency. Therefore, the role of LDM R regularizer is to map the good\nembeddings from high-dimensional to the compact binary space.\n\n3.3 Adjusted Binarization Representation Entropy Regularizer\n\nTo increase the diversity of binary vectors in the low-dimensional layer we utilize BRE regularizer.\nIt was initially applied in [5] to guide the discriminator D to better allocate its model capacity, by\nencouraging the binary activation patterns on selected intermediate layers of D to maximize the total\nentropy. To achieve this, \ufb02oating-point features are binarized and the expected value of each of the\nbinary dimensions is enforced to be equal to 0.02 For that purpose the following regularizer is used:\n\nK(cid:88)\n\nk=1\n\n1\nK\n\n(cid:80)N\n\nLM E =\n\n(\u00afsf,k)2,\n\n(4)\n\nwhere \u00afsf,k are elements of \u00afsf = 1\nn sf,n that represent the average of N quantized binary vectors\nN\nsf,n. To promote the independence between the binary variables, a loss term LAC is proposed in [5]:\n\nLAC =\n\n1\n\nN (N \u2212 1)\n\nN(cid:88)\n\n|sT\nf,k \u00b7 sf,j|\n\nk,j=1,k(cid:54)=j\n\nK\n\n.\n\n(5)\n\nThe BRE regularizer introduced in [5] is de\ufb01ned as the sum of LM E and LAC losses. Effectively, we\nwould like to increase the diversity of the binary vectors whose dot product is equal to zero, i.e. their\ndistance is closer to the middle of the range, while for those vectors with dot product different then\nzero the importance of the diversity is lower, hence it can be downweighed. Therefore, we propose to\namend the formulation of the BRE regularizer and replace LAC with its weighted version as de\ufb01ned\nbelow:\n\nN(cid:88)\n\nLM AC =\n\nk,j=1,k(cid:54)=j\n\n|sT\nf,k \u00b7 sf,j|\n\nK\n\n,\n\n\u03b1k,j\nZ\n\nf,j and Z =(cid:80)N\n\n(6)\n\nf,k and sT\n\nwhere weights \u03b1k,j are associated with corresponding pairs sT\nk,j=1,k(cid:54)=j \u03b1k,j\nis normalization constant.\nZ (for \u03b1k,j \u2265 0) constitute the discrete distribution responsible\nIt can be observed that pk,j = \u03b1k,j\nfor taking pairs of vectors for regularization. Practically, it was shown in [5] that LAC is the\nempirical estimation of E[ bT b(cid:48)\nK ], where b and b(cid:48) are zero-mean multivariate Bernoulli vectors that\nare independent. The LM AC criterion can be seen as empirical estimation of Epk,j [ bT b(cid:48)\nK ] where the\npairs b and b(cid:48) are binded by the pk,j distribution.\nWe propose to de\ufb01ne \u03b1k,j in the following manner:\n\n(cid:40)\u2212|bT\n\nh,kbh,j|\n\u03b2 \u00b7 M\n\n(cid:41)\n\n(cid:40)\u2212|M \u2212 2 \u00b7 dH (bh,k, bh,j)|\n\n(cid:41)\n\n\u03b2 \u00b7 M\n\n\u03b1k,j = exp\n\n= exp\n\n,\n\n(7)\n\nwhere bh,k are binary vectors from the high-dimensional layer and \u03b2 is a hyperparameter that controls\nthe variance of distances. As we mentioned before, we would like to promote low-dimensional vectors\n\n2We assume {\u22121, 1} and independence between them is enforced by minimizing the correlation.\n\n5\n\n\ffor regularization that are not strongly correlated in high-dimensional space h, therefore we propose\nthe function exp(\u2212|a|/\u03b2) that takes the highest values for a close to 0. As a consequence, we promote\nthe pairs of vectors in the criterion LM AC for which distances are around M/2 and put less force to\nthe pairs for which the distances in high-dimensional space are close to 0 and M. While optimizing\nLM AC in each iteration of gradient method we assume that bh,k are constant and calculated from the\nh(x) layer of discriminator by application of the sign(\u00b7) function.\n\n3.4 Training BinGAN\n\nWe train our BinGAN model in a typical unsupervised GAN scheme, by alternating procedure of\nupdating the discriminator D and generator G. The discriminator D is trained using the following\nlearning objective:\n\nL = LD + \u03bbDM R \u00b7 LDM R + \u03bbBRE \u00b7 (LM E + LM AC)\n\n(8)\nwhere \u03bbDM R, \u03bbBRE are regularization parameters. \u03bbDM R de\ufb01nes the impact of the DMR reg-\nularization term and \u03bbBRE de\ufb01nes the impact of the two BRE terms, LM E and LM AC. LD =\n\u2212Ex\u223cpdata(x)[log (D(x))] \u2212 Ez\u223cpz(z)[log (1 \u2212 D(G(z)))] is the loss for training the discriminator.\nThe training procedure is performed in the standard methodology for this type of models assuming\ntraining generator G and discriminator D in alternating steps. The generator in BinGAN model is\ntrained by minimizing the feature matching criterion provided by equation (2). The discriminator is\nupdated to minimize the loss function that is de\ufb01ned by equation (8). The alternating procedure of\nupdating generator and discriminator is repeated for each of the minibatches considered in the current\nepoch.\n\n4 Results\n\nWe conduct experiments on two benchmark datasets, Brown gray-scale patches [3] and CIFAR-10\ncolor images [12]. These benchmarks are used to evaluate the quality of our approach on image\nmatching and image retrieval tasks, respectively.\n\n4.1 Model Architecture and Parameter Settings\n\nFor both tasks, we use the same generator architecture and slight modi\ufb01cations of the discriminator.\nBelow we outline the main features of both models and their parameters.\nFor the image matching task the discriminator is composed of 7 convolutional layers (3x3 kernels,\n3 layers with 96 kernels and 4 layers with 128 kernels), two network-in-network (NiN) [15] layers\n(with 256 and 128 units respectively) and discriminative layer. For the low-dimensional feature space\nbf we take the average-pooled NiN layer composed of 256 units. For the high-dimensional space bh\nwe take the reshaped output of the last convolutional layer that is composed of 9216 units.\nFor image retrieval the discriminator is composed of: 7 convolutional layers (3x3 kernels, 3 layers\nwith 96 kernels and 4 layers with 192 kernels), two NiN layers with 192 units, one fully-connected\nlayer with three variants of (16, 32, 64 units) and discriminative layer. For the low-dimensional\nfeature space bf we take fully-connected layer, and for the high-dimensional space bh we take\naverage-pooled last NiN layer.\nThere are 4 hyperparameters in our method: \u03b3, \u03b2 and regularization parameters: \u03bbDM R, \u03bbBRE. In all\nour experiments, we \ufb01x the parameters to: \u03bbDM R = 0.05, \u03bbBRE = 0.01, \u03b3 = 0.001 and \u03b2 = 0.5.\nThe values of the hyperparameters were set according to the following motivations. The hy-\nperparameter \u03b3 controls the softness of the sign(\u00b7) function and the value was set according to\nsuggestions provided in [5] therefore additional tuning was not needed. The value of a scaling\nparameter \u03b2 was set according to prior assumptions based on the analysis of the impact of\nscaling factor for the Laplace distribution. We scale the distances by the number of the units\n(M), therefore the value of \u03b2 can be constant among various applications. The values of reg-\nularization terms \u03bbBRE and \u03bbDM R were \ufb01xed empirically following the methodology provided in [6].\n\n6\n\n\f16 bit\nMethod\n13.59\nKHM\n13.98\nSphH\n12.55\nSpeH\n12.95\nSH\n12.91\nPCAH\nLSH\n12.55\nPCA-ITQ 15.67\n16.17\nDH\nDeepBit\n19.43\nDBD-MQ 21.53\nBinGAN\n30.05\n\n32 bit\n13.93\n14.58\n12.42\n14.09\n12.60\n13.76\n16.20\n16.62\n24.86\n26.50\n34.65\n\n64 bit\n14.46\n15.38\n12.56\n13.89\n12.10\n15.07\n16.64\n16.96\n27.73\n31.85\n36.77\n\nFigure 1: (Left) Performance comparison (mAP, %) of different unsupervised hashing algorithms on\nthe CIFAR-10 dataset. This table shows the mean Average Precision (mAP) of top 1000 returned\nimages with respect to different number of hash bits. We report the results for all the methods except\nfor BinGAN after [6]. (Right) Top retrieved image matches from CIFAR-10 dataset for given query\nimages from test set - \ufb01rst column.\n\n4.2\n\nImage Retrieval\n\nIn this experiment we use CIFAR-10 dataset to evaluate the quality of our approach in image retrieval.\nCIFAR-10 dataset has 10 categories and each of them is composed of 6,000 pictures with a resolution\n32 \u00d7 32 color images. The whole dataset has 50,000 training and 10,000 testing images.\nTo compare the binary descriptor generated with our BinGAN model with the competing approaches,\nwe evaluate several unsupervised state-of-the methods, such as: KMH [10], Spherical Hashing\n(SphH)[11], PCAH [29], Spectral Hashing (SpeH)[30], Semantic Hashing (SH) [20], LSH [2],\nPCT-ITQ [8], Deep Hashing (DH)[16], DeepBit[14], deep binary descriptor with multiquantization\n(DBD-MQ)[6]. For all methods except DH, DeepBit, DBD-MQ and ours, we follow [16] and\ncompute hashes on 512-d GIST descriptors. The table in Fig. 1 shows the CIFAR10 retrieval results\nbased on the mean Average Precision (mAP) of the top 1000 returned images with respect to different\nbit lengths. Fig. 1 shows top 10 images retrieved from a database for given query image from our test\ndata.\nOur method outperforms DBD-MQ method, the unsupervised method previously reporting state-of-\nthe-art results on this dataset, for 16, 32 and 64 bits. The performance improvement in terms of mean\nAverage Precision reaches over 40%, 31% and 15%, respectively. The most signi\ufb01cant performance\nboost can be observed for the shortest binary strings, as thanks to the loss terms introduced in our\nmethod, we explicitly model the distribution of the information in a low-dimensional binary space.\n\n4.3\n\nImage Matching\n\nTo evaluate the performance of our approach on image matching task, we use the Brown dataset [3]\nand train binary local feature descriptors using our BinGAN method and competing previous methods,\napplying the methodology described in [14]. The Brown dataset is composed of three subsets of\npatches: Yosemite, Liberty and Notredame. The resolution of the patches is 64 \u00d7 64, although we\nsubsample them to 32 \u00d7 32 to increase the processing ef\ufb01ciency and use the method to create binary\ndescriptors in practice. The data is split into training and test sets according to the provided ground\ntruth, with 50,000 training pairs (25,000 matched and 25,000 non-matched pairs) and 10,000 test\npairs (5,000 matched, and 5,000 non-matched pairs), respectively.\nTab. 1 shows the false positive rates at 95% true positives (FPR@95%) for binary descriptors\ngenerated with our BinGAN approach compared with several state-of-the-art descriptors. Among\nthe compared approaches, Boosted SSC [22], BRISK [13], BRIEF [4], DeepBit [14] and DBD-MQ\n\n7\n\n\f(a) Fake patches\n\n(b) True patches\n\nFigure 2: We present generative capabilities of BinGAN model for the Liberty dataset. Fake patches\ngenerated by the model are shown in Fig. 2a and true patches from the data in Fig. 2b.\n\nTable 1: False positive rates at 95% true positives (FPR@95%) obtained for our BinGAN descriptor\ncompared with the state-of-the-art binary descriptors on Brown dataset (%). Real-valued SIFT\nfeatures are provided for reference. We report all the results from [6], except for L2-Net and\nBinGAN.\n\nYosemite\n\nNotre Dame\n\nLiberty\n\nNotre Dame\n\nLiberty\n\nYosemite\n\nLiberty\n\nNotre Dame\n\nYosemite\n\nAverage\n\nFPR@95%\n\nTrain\nTest\n\nSupervised\n\nLDAHash (16 bytes)\nD-BRIEF (4 bytes)\nBinBoost (8 bytes)\nRFD (50-70 bytes)\n\nBinary L2-Net [26] (32 bytes)\nUnsupervised\n\nSIFT (128 bytes)\nBRISK (64 bytes)\nBRIEF (32 bytes)\nDeepBit (32 bytes)\nDBD-MQ (32 bytes)\nBinGAN (32 bytes)\n\n51.58\n43.96\n14.54\n11.68\n2.51\n\n28.09\n74.88\n54.57\n29.60\n27.20\n16.88\n\n49.66\n53.39\n21.67\n19.40\n6.65\n\n36.27\n79.36\n59.15\n34.41\n33.11\n26.08\n\n52.95\n46.22\n18.96\n14.50\n4.04\n\n29.15\n73.21\n54.96\n63.68\n57.24\n40.80\n\n49.66\n51.30\n20.49\n19.35\n4.01\n\n36.27\n79.36\n59.15\n32.06\n31.10\n25.76\n\n51.58\n43.10\n16.90\n13.23\n1.9\n\n28.09\n74.88\n54.57\n26.66\n25.78\n27.84\n\n52.95\n47.29\n22.88\n16.99\n5.61\n\n29.15\n73.21\n54.96\n57.61\n57.15\n47.64\n\n51.40\n47.54\n19.24\n15.86\n4.12\n\n31.17\n75.81\n56.23\n40.67\n38.59\n30.76\n\nTable 2: Ablation study. False positive rates at 95% true positives (FPR@95%) for three settings of \u03bb\nparameters when training BinGAN for image matching. Optimizing all three loss terms leads to the\nbest performance on the Brown dataset.\n\nTrain\nTest\n\nYosemite\n\nNotre Dame\n\nLiberty\n\nAverage\n\nNotre Dame Liberty Yosemite Liberty Notre Dame Yosemite FPR@95%\n\n\u03bbDM R = \u03bbBRE = 0\n\n\u03bbDM R = 0 \u03bbBRE = 0.01\n\u03bbDM R = 0.05 \u03bbBRE = 0\n\n\u03bbDM R = 0.05 \u03bbBRE = 0.01\n\n32.72\n30.12\n24.68\n16.88\n\n39.44\n36.28\n26.96\n26.08\n\n39.44\n44.2\n40.16\n40.80\n\n27.92\n24.28\n27.00\n25.76\n\n27.24\n26.44\n27.28\n27.84\n\n50.48\n51.88\n45.28\n47.64\n\n36.21\n35.53\n31.90\n30.76\n\n[6] are unsupervised binary descriptors while LDAHash [25], D-BRIEF [28], BinBoost [27] and\nRFD [7] are supervised. The real-valued SIFT [17] is provided for reference. Our BinGAN approach\nachieves the lowest FPR@95% value of all unsupervised binary descriptors. The improvement over\nthe state-of-the-art competitor, DBD-MQ, is especially visible when testing on Yosemite.\nFurthermore, we examine the in\ufb02uence of BinGAN\u2019s regularization terms on the performance of\nthe resulting binary descriptor. Tab. 2) shows the results of this ablation study. Using binarized\nfeatures from a GAN trained without any additional loss terms provides state-of-the-art results in\nterms of average FPR@95%. By adding Distance Matching Regularizer (\u03bbDM R (cid:54)= 0) we can observe\nsigni\ufb01cant improvement for almost all testing cases. Additional performance boost can be observed\n\n8\n\n\f(a) Notredame\n\n(b) Yosemite\n\nFigure 3: The set of randomly selected patches from the original data (odd columns) and correspond-\ning synthetically generated patches (even columns) that are at the closest Hamming distance to the\ntrue patch in the binary descriptor space.\n\nwhen adding the adjusted BRE regularizer. We can therefore conclude that the results of our BinGAN\napproach can be attributed to a combination of two regularization terms proposed in this work.\n\n4.4 Generative Capabilities of BinGAN\n\nContrary to previous methods for learning binary image descriptors, our approach allows to syntheti-\ncally generate new image patches that can be then used for semi-supervised learning. Fig. 2 presents\nthe images created by a generator trained on the Liberty dataset. Fake and true images are dif\ufb01cult to\ndifferentiate. Additionally, Fig. 3 presents patch pairs that consist of a true patch and a synthetically\ngenerated patch with the closest Hamming distance to the true patch in the binary descriptor space.\nThe majority of generated patches are fairly similar to the original ones, which can hint that those\npatches can be used for semi-supervised training of more powerful binary descriptors, although this\nremains our future work.\n\n5 Conclusions\n\nIn this work, we presented a novel approach for learning compact binary image descriptors that\nexploit regularized Generative Adversarial Networks. The proposed BinGAN architecture is trained\nwith two regularization terms that enable weighting the importance of dimensions with the correlation\nmatrix and propagate the distances between high-dimensional and low-dimensional spaces of the\ndiscriminator. The resulting binary descriptor is highly compact yet discriminative, providing state-\nof-the-art results on two benchmark datasets for image matching and image retrieval.\n\nAcknowledgements\n\nThis research was partially supported by the Polish National Science Centre grant no. UMO-\n2016/21/D/ST6/01946 as well as Google Sponsor Research Agreement under the project \"Ef\ufb01cient\nvisual localization on mobile devices\".\nThe research conducted by Maciej Zieba has been partially co- \ufb01nanced by the Ministry of Science\nand Higher Education, Republic of Poland.\nWe would especially link to thank Karol Kurach, Jan Hosang, Adam Bielski and Aleksander Holynski\nfor their valuable insights and discussion.\n\n9\n\n\fReferences\n[1] A. Alahi, R. Ortiz, and P. Vandergheynst. FREAK: Fast retina keypoint. In CVPR, 2012.\n\n[2] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in\n\nhigh dimensions. In FOCS, 2006.\n\n[3] M. Brown, G. Hua, and S. Winder. Discriminative learning of local image descriptors. TPAMI,\n\n33(1):43\u201357, 2011.\n\n[4] M. Calonder, V. Lepetit, C. Strecha, and P. Fua. BRIEF: Binary robust independent elementary\n\nfeatures. In ECCV, 2010.\n\n[5] Y. Cao, G. W. Ding, K. Y.-C. Lui, and R. Huang. Improving GAN training via binarized\n\nrepresentation entropy (BRE) regularization. In ICLR, 2018.\n\n[6] Y. Duan, J. Lu, Z. Wang, J. Feng, and J. Zhou. Learning deep binary descriptor with multi-\n\nquantization. In CVPR, 2017.\n\n[7] B. Fan, Q. Kong, T. Trzcinski, Z. Wang, C. Pan, and P. Fua. Receptive \ufb01elds selection for binary\n\nfeature description. TIP, 23(6):2583\u20132595, 2014.\n\n[8] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin.\n\nIterative quantization: A procrustean\napproach to learning binary codes for large-scale image retrieval. TPAMI, 35(12):2916\u20132929,\n2013.\n\n[9] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\n\nY. Bengio. Generative adversarial nets. In NIPS, 2014.\n\n[10] K. He, F. Wen, and J. Sun. K-means hashing: An af\ufb01nity-preserving quantization method for\n\nlearning binary compact codes. In CVPR, 2013.\n\n[11] J.-P. Heo, Y. Lee, J. He, S.-F. Chang, and S.-E. Yoon. Spherical hashing. In CVPR, 2012.\n\n[12] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.\n\n[13] S. Leutenegger, M. Chli, and R. Y. Siegwart. BRISK: Binary robust invariant scalable keypoints.\n\nIn ICCV, 2011.\n\n[14] K. Lin, J. Lu, C.-S. Chen, and J. Zhou. Learning compact binary descriptors with unsupervised\n\ndeep neural networks. In CVPR, 2016.\n\n[15] M. Lin, Q. Chen, and S. Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.\n\n[16] V. E. Liong, J. Lu, G. Wang, P. Moulin, J. Zhou, et al. Deep hashing for compact binary codes\n\nlearning. In CVPR, 2015.\n\n[17] D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91\u2013110,\n\n2004.\n\n[18] Z. Qiu, Y. Pan, T. Yao, and T. Mei. Deep semantic hashing with generative adversarial networks.\n\nIn SIGIR, 2017.\n\n[19] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolu-\n\ntional generative adversarial networks. In ICLR, 2016.\n\n[20] R. Salakhutdinov and G. Hinton. Semantic hashing. International Journal of Approximate\n\nReasoning, 50(7):969\u2013978, 2009.\n\n[21] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved\n\ntechniques for training gans. In NIPS, 2016.\n\n[22] G. Shakhnarovich. Learning task-speci\ufb01c similarity. PhD thesis, MIT, 2005.\n\n[23] E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and F. Moreno-Noguer. Discriminative\n\nlearning of deep convolutional feature point descriptors. In ICCV, 2015.\n\n10\n\n\f[24] J. Song, T. He, L. Gao, X. Xu, A. Hanjalic, and H. Shen. Binary generative adversarial networks\n\nfor image retrieval. In AAAI, 2018.\n\n[25] C. Strecha, A. Bronstein, M. Bronstein, and P. Fua. LDAHash: Improved matching with smaller\n\ndescriptors. TPAMI, 34(1):66\u201378, 2012.\n\n[26] Y. Tian, B. Fan, and F. Wu. L2-net: Deep learning of discriminative patch descriptor in euclidean\n\nspace. In CVPR, 2017.\n\n[27] T. Trzcinski, M. Christoudias, P. Fua, and V. Lepetit. Boosting binary keypoint descriptors. In\n\nCVPR, 2013.\n\n[28] T. Trzcinski and V. Lepetit. Ef\ufb01cient discriminative projections for compact binary descriptors.\n\nIn ECCV, 2012.\n\n[29] J. Wang, S. Kumar, and S.-F. Chang. Semi-supervised hashing for scalable image retrieval. In\n\nCVPR, 2010.\n\n[30] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In NIPS, 2009.\n\n11\n\n\f", "award": [], "sourceid": 1826, "authors": [{"given_name": "Maciej", "family_name": "Zieba", "institution": "Wroclaw University of Science and Technology, Tooploox"}, {"given_name": "Piotr", "family_name": "Semberecki", "institution": "Wroc\u0142aw University of Science and Technology, Tooploox"}, {"given_name": "Tarek", "family_name": "El-Gaaly", "institution": "Voyage"}, {"given_name": "Tomasz", "family_name": "Trzcinski", "institution": "Tooploox / Warsaw University of Technology"}]}