{"title": "Co-Regularized Hashing for Multimodal Data", "book": "Advances in Neural Information Processing Systems", "page_first": 1376, "page_last": 1384, "abstract": "Hashing-based methods provide a very promising approach to large-scale similarity search. To obtain compact hash codes, a recent trend seeks to learn the hash functions from data automatically. In this paper, we study hash function learning in the context of multimodal data. We propose a novel multimodal hash function learning method, called Co-Regularized Hashing (CRH), based on a boosted co-regularization framework. The hash functions for each bit of the hash codes are learned by solving DC (difference of convex functions) programs, while the learning for multiple bits proceeds via a boosting procedure so that the bias introduced by the hash functions can be sequentially minimized. We empirically compare CRH with two state-of-the-art multimodal hash function learning methods on two publicly available data sets.", "full_text": "Co-Regularized Hashing for Multimodal Data\n\nYi Zhen and Dit-Yan Yeung\n\nDepartment of Computer Science and Engineering\nHong Kong University of Science and Technology\n\nClear Water Bay, Kowloon, Hong Kong\n{yzhen,dyyeung}@cse.ust.hk\n\nAbstract\n\nHashing-based methods provide a very promising approach to large-scale similar-\nity search. To obtain compact hash codes, a recent trend seeks to learn the hash\nfunctions from data automatically. In this paper, we study hash function learning\nin the context of multimodal data. We propose a novel multimodal hash function\nlearning method, called Co-Regularized Hashing (CRH), based on a boosted co-\nregularization framework. The hash functions for each bit of the hash codes are\nlearned by solving DC (difference of convex functions) programs, while the learn-\ning for multiple bits proceeds via a boosting procedure so that the bias introduced\nby the hash functions can be sequentially minimized. We empirically compare\nCRH with two state-of-the-art multimodal hash function learning methods on two\npublicly available data sets.\n\n1\n\nIntroduction\n\nNearest neighbor search, a.k.a. similarity search, plays a fundamental role in many important ap-\nplications, including document retrieval, object recognition, and near-duplicate detection. Among\nthe methods proposed thus far for nearest neighbor search [1], hashing-based methods [2, 3] have\nattracted considerable interest in recent years. The major advantage of hashing-based methods is\nthat they index data using binary hash codes which enjoy not only low storage requirements but\nalso high computational ef\ufb01ciency. To preserve similarity in the data, a family of algorithms called\nlocality sensitive hashing (LSH) [4, 5] has been developed over the past decade. The basic idea of\nLSH is to hash the data into bins so that the collision probability re\ufb02ects data similarity. LSH is very\nappealing in that it has theoretical guarantee and is also simple to implement. However, in practice\nLSH algorithms often generate long hash codes in order to achieve acceptable performance because\nthe theoretical guarantee only holds asymptotically. This shortcoming can be attributed largely to\ntheir data-independent nature which cannot capture the data characteristics very accurately in the\nhash codes. Besides, in many applications, neighbors cannot be de\ufb01ned easily using some generic\ndistance or similarity measures. As such, a new research trend has emerged over the past few years\nby learning the hash functions from data automatically. In the sequel, we refer to this new trend as\nhash function learning (HFL).\nBoosting, as one of the most popular machine learning approaches, was \ufb01rst applied to learning hash\nfunctions for pose estimation [6]. Later, impressive performance for HFL using restricted Boltz-\nmann machines was reported [7]. These two early HFL methods have been successfully applied to\ncontent-based image retrieval in which large-scale data sets are commonly encountered [8]. A num-\nber of algorithms have been proposed since then. Spectral hashing (SH) [9] treats HFL as a special\ncase of manifold learning and uses an ef\ufb01cient algorithm based on eigenfunctions. One shortcom-\ning of spectral hashing is in its assumption, which requires that the data be uniformly distributed.\nTo overcome this limitation, several methods have been proposed, including binary reconstructive\nembeddings [10], shift-invariant kernel hashing [11], distribution matching [12], optmized kernel\nhashing [13], and minimal loss hashing [14]. Recently, some semi-supervised hashing models have\n\n1\n\n\fbeen developed to combine both feature similarity and semantic similarity for HFL [15, 16, 17, 18].\nTo further improve the scalability of these methods, Liu et al. [19] presented a fast algorithm based\non anchor graphs.\nExisting HFL algorithms have enjoyed wide success in challenging applications. Nevertheless, they\ncan only be applied to a single type of data, called unimodal data, which refer to data from a single\nmodality such as image, text, or audio. Nowadays, it is common to \ufb01nd similarity search applications\nthat involve multimodal data. For example, given an image of a tourist attraction as query, one\nwould like to retrieve some textual documents that provide more detailed information about the place\nof interest. Because data from different modalities reside in different feature spaces, performing\nmultimodal similarity search will be made much easier and faster if the multimodal data can be\nmapped into a common Hamming space. However, it is challenging to do so because data from\ndifferent modalities generally have very different representations.\nAs far as we know, there exist only two multimodal HFL methods. Bronstein et al. [20] made the\n\ufb01rst attempt to learn linear hash functions using eigendecomposition and boosting, while Kumar\net al. [21] extended spectral hashing to the multiview setting and proposed a cross-view hashing\nmodel. One major limitation of these two methods is that they both rely on eigendecomposition\noperations which are computationally very demanding when the data dimensionality is high. More-\nover, they consider applications for shape retrieval, image alignment, and people search which are\nquite different from the multimodal retrieval applications of interest here.\nIn this paper, we propose a novel multimodal HFL method, called Co-Regularized Hashing (CRH),\nbased on a boosted co-regularization framework. For each bit of the hash codes, CRH learns a group\nof hash functions, one for each modality, by minimizing a novel loss function. Although the loss\nfunction is non-convex, it is in a special form which can be expressed as a difference of convex\nfunctions. As a consequence, the Concave-Convex Procedure (CCCP) [22] can be applied to solve\nthe optimization problem iteratively. We use a stochastic sub-gradient method, which converges\nvery fast, in each CCCP iteration to \ufb01nd a local optimum. After learning the hash functions for one\nbit, CRH proceeds to learn more bits via a boosting procedure such that the bias introduced by the\nhash functions can be sequentially minimized.\nIn the next section, we present the CRH method in detail. Extensive empirical study using two data\nsets is reported in Section 3. Finally, Section 4 concludes the paper.\n\n2 Co-Regularized Hashing\n\nWe use boldface lowercase letters and calligraphic letters to denote vectors and sets, respectively.\nFor a vector x, xT denotes its transpose and (cid:107)x(cid:107) its (cid:96)2 norm.\n\n2.1 Objective Function\nSuppose that there are two sets of data points from two modalities,1 e.g., {xi \u2208 X}I\ni=1 for a\nset of I images from some feature space X and {yj \u2208 Y}J\nj=1 for a set of J textual docu-\nments from another feature space Y. We also have a set of N inter-modality point pairs \u0398 =\n{(xa1, yb1 ), (xa2 , yb2), . . . , (xaN , ybN )}, where, for the nth pair, an and bn are indices of the points\nin X and Y, respectively. We further assume that each pair has a label sn = 1 if xan and ybn are\nsimilar and sn = 0 otherwise. The notion of inter-modality similarity varies from application to\napplication. For example, if an image includes a tiger and a textual document is a research paper on\ntigers, they should be labeled as similar. On the other hand, it is highly unlikely to label the image\nas similar to a textual document on basketball.\nFor each bit of the hash codes, we de\ufb01ne two linear hash functions as follows:\n\nf (x) = sgn(wT\n\nx x) and g(y) = sgn(wT\n\ny y),\n\nwhere sgn(\u00b7) denotes the sign function, and wx and wy are projection vectors which, ideally, should\nmap similar points to the same hash bin and dissimilar points to different bins. Our goal is to achieve\nHFL by learning wx and wy from the multimodal data.\n\n1For simplicity of our presentation, we focus on the bimodal case here and leave the discussion on extension\n\nto more than two modalities to Section 2.4.\n\n2\n\n\fTo achieve this goal, we propose to minimize the following objective function w.r.t. (with respect\nto) wx and wy:\n\nO =\n\n1\nI\n\n\u03c9n(cid:96)\u2217\n\nn +\n\n(cid:107)wx(cid:107)2 +\n\n\u03bbx\n2\n\n(cid:107)wy(cid:107)2,\n\n\u03bby\n2\n\n(1)\n\nj are intra-modality loss terms for modalities X and Y, respectively. In this work, we\n\ni and (cid:96)y\nwhere (cid:96)x\nde\ufb01ne them as:\n\ni=1\n\nj=1\n\n1\nJ\n\n(cid:96)x\ni +\n\n(cid:96)y\nj + \u03b3\n\nJ(cid:88)\n\nI(cid:88)\nN(cid:88)\ni =(cid:2)1 \u2212 f (xi)(wT\nx xi)(cid:3)\nj =(cid:2)1 \u2212 g(yj)(wT\ny yj)(cid:3)\n\n(cid:96)x\n(cid:96)y\n\nn=1\n\n+ =(cid:2)1 \u2212 |wT\nx xi|(cid:3)\n+ =(cid:2)1 \u2212 |wT\ny yj|(cid:3)\n\n+,\n+,\n\nassociate with each point pair a weight \u03c9n, with(cid:80)N\n\nwhere [a]+ is equal to a if a \u2265 0 and 0 otherwise. We note that the intra-modality loss terms\nare similar to the hinge loss in the (linear) support vector machine but have quite different mean-\ning. Conceptually, we want the projected values to be far away from 0 and hence expect the hash\nfunctions learned to have good generalization ability [16]. For the inter-modality loss term (cid:96)\u2217\nn, we\nn=1 \u03c9n = 1, to normalize the loss as well as\ncompute the bias of the hash functions. In this paper, we de\ufb01ne (cid:96)\u2217\nn + (1 \u2212 sn)\u03c4 (dn),\n\n(cid:96)\u2217\nn = snd2\n\nn as\n\nx xan \u2212 wT\n\nwhere dn = wT\ny ybn and \u03c4 (d) is called the smoothly clipped inverted squared deviation\n(SCISD) function. The loss function such de\ufb01ned requires that the similar inter-modality points,\ni.e., sn = 1, have small distance after projection, and the dissimilar ones, i.e., sn = 0, have large\ndistance. With these two kinds of loss terms, we expect that the learned hash functions can enjoy\nthe large-margin property while effectively preserving the inter-modality similarity.\nThe SCISD function was \ufb01rst proposed in [23]. It can be de\ufb01ned as follows:\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3 \u2212 1\n\n0\n\n\u03c4 (d) =\n\n2 d2 + a\u03bb2\n2\nd2\u22122a\u03bb|d|+a2\u03bb2\n2(a\u22121)\n\nif |d| \u2264 \u03bb\nif \u03bb < |d| \u2264 a\u03bb\nif a\u03bb < |d|,\n\nwhere a and \u03bb are two user-speci\ufb01ed parameters. The SCISD function penalizes projection vectors\nthat result in small distance between dissimilar points after projection. A more important property\nis that it can be expressed as a difference of two convex functions. Speci\ufb01cally, we can express\n\u03c4 (d) = \u03c41(d) \u2212 \u03c42(d) where\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3 0\n\nad2\u22122a\u03bb|d|+a\u03bb2\n2(a\u22121)\n2 d2 \u2212 a\u03bb2\n\n1\n\n2\n\nif |d| \u2264 \u03bb\nif \u03bb < |d| \u2264 a\u03bb\nif a\u03bb < |d|\n\nand \u03c42(d) =\n\n1\n2\n\nd2 \u2212 a\u03bb2\n2\n\n.\n\n\u03c41(d) =\n\n2.2 Optimization\n\nThough the objective function (1) is nonconvex w.r.t. wx and wy, we can optimize it w.r.t. wx and\nwy in an alternating manner. Take wx for example, we remove the irrelevant terms and get the\nfollowing objective:\n\n(cid:96)x\ni +\n\n(cid:107)wx(cid:107)2 + \u03b3\n\n\u03bbx\n2\n\n\u03c9n(cid:96)\u2217\nn,\n\n(2)\n\nN(cid:88)\n\nn=1\n\nwhere\n\n1\nI\n\n(cid:96)x\ni =\n\ni=1\n\nI(cid:88)\n\uf8f1\uf8f2\uf8f3 0\n\n1 \u2212 wT\n1 + wT\n\nx xi\nx xi\n\nif |wT\nx xi| \u2265 1\nif 0 \u2264 wT\nif \u22121 < wT\n\nx xi < 1\n\nx xi < 0.\n\nIt is easy to realize that the objective function (2) can be expressed as a difference of two convex\nfunctions in different cases. As a consequence, we can use CCCP to solve the nonconvex opti-\nmization problem iteratively with each iteration minimizing a convex upper bound of the original\nobjective function.\n\n3\n\n\fBrie\ufb02y speaking, given an objective function f0(x)\u2212 g0(x) where both f0 and g0 are convex, CCCP\nworks iteratively as follows. The variable x is \ufb01rst randomly initialized to x(0). At the tth iteration,\nCCCP minimizes the following convex upper bound of f0(x) \u2212 g0(x) at location x(t):\n\nf0(x) \u2212(cid:0)g0(x(t)) + \u2202xg0(x(t))(x \u2212 x(t))(cid:1),\n\nwhere \u2202xg0(x(t)) is the \ufb01rst derivative of g0(x) at x(t). This optimization problem can be solved\nusing any convex optimization solver to obtain x(t+1). Given an initial value x(0), the solution\nsequence {x(t)} found by CCCP is guaranteed to reach a local minimum or a saddle point.\nFor our problem, the optimization problem at the tth iteration minimizes the following upper bound\nof Equation (2) w.r.t. wx:\n\nOx =\n\n\u03bbx(cid:107)wx(cid:107)2\n\n2\n\n+ \u03b3\n\n(cid:0)snd2\n\n\u03c9n\n\nN(cid:88)\n\nn=1\n\n(cid:1) +\n\nI(cid:88)\n\ni=1\n\n1\nI\n\nn + (1 \u2212 sn)\u03b6 x\n\nn\n\n(cid:96)x\ni ,\n\n(3)\n\nn xT\nan\n\nx is the\n\nn )\u2212 d(t)\n\nx ), d(t)\n\nn = (w(t)\n\n(wx \u2212 w(t)\n\ny ybn, and w(t)\n\nx )T xan \u2212 wT\n\nn = \u03c41(dn)\u2212 \u03c42(d(t)\nwhere \u03b6 x\nvalue of wx at the tth iteration.\nTo \ufb01nd a locally optimal solution to problem (3), we can use any gradient-based method. In this\nwork, we develop a stochastic sub-gradient solver based on Pegasos [24], which is known to be one\nof the fastest solvers for margin-based classi\ufb01ers. Speci\ufb01cally, we randomly select k points from\neach modality and l point pairs to evaluate the sub-gradient at each iteration.\nThe key step of our method is to evaluate the sub-gradient of objective function (3) w.r.t. wx, which\ncan be computed as\n\u2202Ox\n\u2202wx\nn = (1 \u2212 sn)\n\nn + \u03bbxwx \u2212 1\nI\n\n(cid:16) \u2202\u03c41\n\n\u03c9nsndnxan + \u03b3\n\nN(cid:88)\n\nN(cid:88)\n\nI(cid:88)\n\nwhere \u00b5x\n\n\u03c9n\u00b5x\n\n(cid:17)\n\n= 2\u03b3\n\n\u03c0x\ni ,\n\n(4)\n\nn=1\n\nn=1\n\ni=1\n\n\u2202dn\n\n\uf8f1\uf8f2\uf8f3 0\n\ndn\n\n\u2202\u03c41\n\u2202dn\n\n=\n\nadn\u22122a\u03bb sgn(dn)\n\n(a\u22121)\n\nn\n\n\u2212 d(t)\nxan,\nif |dn| \u2264 \u03bb\nif \u03bb < |dn| \u2264 a\u03bb\nif a\u03bb < |dn|\n\nand \u03c0x\n\ni =\n\nif |wT\nif |wT\n\nx xi| \u2265 1\nx xi| < 1.\n\nSimilarly, the objective function for the optimization problem w.r.t. wy at the tth CCCP iteration is:\n\nx xi\n\n(cid:26) 0\nsgn(cid:0)wT\n(cid:1) +\n\n1\nJ\n\n(cid:1) xi\nI(cid:88)\n\nj=1\n\n(cid:96)y\nj ,\n\n\u03bby(cid:107)wy(cid:107)2\n\nOy =\n\n+ \u03b3\n\n\u03c9n\n\nn + (1 \u2212 sn)\u03b6 y\n\nn\n\nN(cid:88)\n\nn=1\n\n(cid:0)snd2\n\n2\nn = \u03c41(dn) \u2212 \u03c42(d(t)\n\nwhere \u03b6 y\nvalue of wy at the tth iteration and\n\nn ) + d(t)\n\nn yT\nbn\n\n(wy \u2212 w(t)\n\ny ), d(t)\n\nn = wT\n\nx xan \u2212 (w(t)\n\ny )T ybn, w(t)\n\ny\n\n(cid:96)y\nj =\n\n\uf8f1\uf8f2\uf8f3 0\nN(cid:88)\n\n1 \u2212 wT\n1 + wT\n\ny yj\ny yj\n\ny yj| \u2265 1\n\nif |wT\nif 0 \u2264 wT\nif \u22121 < wT\n\ny yj < 1\n\ny yj < 0.\n\nN(cid:88)\n\nn=1\n\n\u03c9n\u00b5y\n\nn + \u03bbywy \u2212 1\nJ\n\nI(cid:88)\n\nj=1\n\n\u03c0y\nj ,\n\n(5)\n\nis the\n\n(6)\n\nThe corresponding sub-gradient is given by\n\n\u2202Oy\n\u2202wy\nn = (1 \u2212 sn)\n\nwhere \u00b5y\n\n= \u22122\u03b3\n\n(cid:16) \u2202\u03c41\n\n\u2202dn\n\nn=1\n\n\u2212 d(t)\n\nn\n\n\u03c9nsndnybn \u2212 \u03b3\n(cid:17)\n(cid:26) 0\nsgn(cid:0)wT\n\nybn and\n\ny yj\n\n(cid:1) yj\n\n\u03c0y\n\nj =\n\nif |wT\nif |wT\n\ny yj| \u2265 1\ny yj| < 1.\n\n4\n\n\f2.3 Algorithm\n\nSo far we have only discussed how to learn the hash functions for one bit of the hash codes. To learn\nthe hash functions for multiple bits, one could repeat the same procedure and treat the learning for\neach bit independently. However, as reported in previous studies [15, 19], it is very important to take\ninto consideration the relationships between different bits in HFL. In other words, to learn compact\nhash codes, we should coordinate the learning of hash functions for different bits.\nTo this end, we take the standard AdaBoost [25] approach to learn multiple bits sequentially. In-\ntuitively, this approach allows learning of the hash functions in later stages to be aware of the bias\nintroduced by their antecedents. The overall algorithm of CRH is summarized in Algorithm 1.\n\nAlgorithm 1 Co-Regularized Hashing\n\nInput:\nX ,Y \u2013 multimodal data\n\u0398 \u2013 inter-modality point pairs\nK \u2013 code length\n\u03bbx, \u03bby, \u03b3 \u2013 regularization parameters\na, \u03bb \u2013 parameters for SCISD function\nOutput:\nx , k = 1, . . . , K \u2013 projection vectors for X\nw(k)\ny , k = 1, . . . , K \u2013 projection vectors for Y\nw(k)\nProcedure:\nInitialize \u03c9(1)\nfor k = 1 to K do\n\nn = 1/N, \u2200n \u2208 {1, 2, . . . , N}.\n\nrepeat\n\nuntil convergence.\nCompute error of current hash functions\n\n\u0001k =\n\n\u03c9(k)\nn I[sn(cid:54)=hn],\n\nn=1\n\nwhere I[a] = 1 if a is true and I[a] = 0 oth-\nerwise, and\n\n(cid:88)N\n(cid:26) 1\n\nhn =\n\nif f (xan ) = g(ybn )\nif f (xan ) (cid:54)= g(ybn ).\n\n0\n\nSet \u03b2k = \u0001k/(1 \u2212 \u0001k).\nUpdate the weight for each point pair:\n\n\u03c9(k+1)\n\nn\n\n= \u03c9(k)\n\nn \u03b2\n\n1\u2212I[sn(cid:54)=hn ]\n\nk\n\n.\n\nOptimize Equation (3) to get w(k)\nx ;\nOptimize Equation (5) to get w(k)\ny ;\n\nend for\n\nIn the following, we brie\ufb02y analyze the time complexity of Algorithm 1 for one bit. The \ufb01rst com-\nputationally expensive part of the algorithm is to evaluate the sub-gradients. The time complexity is\nO((k + l)d), where d is the data dimensionality, and k and l are the numbers of random points and\nrandom pairs, respectively, for the stochastic sub-gradient solver. In our experiments, we set k = 1\nand l = 500. We notice that further increasing the two numbers brings no signi\ufb01cant performance\nimprovement. We leave the theoretical study of the impact of k and l to our future work. Another\nmajor computational cost comes from updating the weights of the inter-modality point pairs. The\ntime complexity is O(dN ), where N is the number of inter-modality point pairs.\nTo summarize, our algorithm scales linearly with the number of inter-modality point pairs and the\ndata dimensionality. In practice, the number of inter-modality point pairs is usually small, making\nour algorithm very ef\ufb01cient.\n\n2.4 Extensions\n\nWe brie\ufb02y discuss two possible extensions of CRH in this subsection. First, we note that it is easy\nto extend CRH to learn nonlinear hash functions via the kernel trick [26]. Speci\ufb01cally, according to\nthe generalized representer theorem [27], we can represent the projection vectors wx and wy as\n\nwx =\n\n\u03b1i\u03c6x(xi) and wy =\n\ni=1\n\n\u03b2j\u03c6y(yj),\n\nj=1\n\nwhere \u03c6x(\u00b7) and \u03c6y(\u00b7) are kernel-induced feature maps for modalities X and Y, respectively. Then\nthe objective function (1) can be expressed in kernel form and kernel-based hash functions can be\nlearned by minimizing a new but very similar objective function.\nAnother possible extension is to make CRH support more than two modalities. Taking a new modal-\nity Z for example, we need to incorporate into Equation (1) the following terms: loss and regular-\nization terms for Z, and all pairwise loss terms involving Z and other modalities, e.g., X and Y.\n\n(cid:88)J\n\n(cid:88)I\n\n5\n\n\fFor both extensions, it is straightforward to adapt the algorithm presented above to solve the new\noptimization problems.\n\n2.5 Discussions\n\nCRH is closely related to a recent multimodal metric learning method called Multiview Neighbor-\nhood Preserving Projections (Multi-NPP) [23], because CRH uses a loss function for inter-modality\npoint pairs which is similar to Multi-NPP. However, CRH is a general framework and other loss\nfunctions for inter-modality point pairs can also be adopted. The two methods have at least three\nsigni\ufb01cant differences. First, our focus is on HFL while Multi-NPP is on metric learning through\nembedding. Second, in addition to the inter-modality loss term, the objective function in CRH in-\ncludes two intra-modality loss terms for large margin HFL while Multi-NPP only has a loss term for\nthe inter-modality point pairs. Third, CRH uses boosting to sequentially learn the hash functions but\nMulti-NPP does not take this aspect into consideration.\nAs discussed brie\ufb02y in [23], one may \ufb01rst use Multi-NPP to map multimodal data into a common\nreal space and then apply any unimodal HFL method for multimodal hashing. However, this naive\ntwo-stage approach has some limitations. First, both stages can introduce information loss which\nimpairs the quality of the hash functions learned. Second, a two-stage approach generally needs\nmore computational resources. These two limitations can be overcome by using a one-stage method\nsuch as CRH.\n\n3 Experiments\n\n3.1 Experimental Settings\n\nIn our experiments, we compare CRH with two state-of-the-art multimodal hashing methods,\nnamely, Cross-Modal Similarity Sensitive Hashing (CMSSH) [20]2 and Cross-View Hashing\n(CVH) [21],3 for two cross-modal retrieval tasks: (1) image query vs. text database; (2) text query\nvs. image database. The goal of each retrieval task is to \ufb01nd from the text (image) database the\nnearest neighbors for the image (text) query.\nWe use two benchmark data sets which are, to the best of our knowledge, the largest fully paired\nand labeled multimodal data sets. We further divide each data set into a database set and a query\nset. To train the models, we randomly select a group of documents from the database set to form the\ntraining set. Moreover, we randomly select 0.1% of the point pairs from the training set. For fair\ncomparison, all models are trained on the same training set and the experiments are repeated with 5\nindependent training sets.\nThe mean average precision (mAP) is used as the performance measure. To compute the\nmAP, we \ufb01rst evaluate the average precision (AP) of a set of R retrieved documents by AP =\nr=1 P (r) \u03b4(r), where L is the number of true neighbors in the retrieved set, P (r) denotes the\n1\nL\nprecision of the top r retrieved documents, and \u03b4(r) = 1 if the rth retrieved document is a true\nneighbor and \u03b4(r) = 0 otherwise. The mAP is then computed by averaging the AP values over all\nthe queries in the query set. The larger the mAP, the better the performance. In the experiments, we\nset R = 50. Besides, we also report the precision and recall within a \ufb01xed Hamming radius.\nWe use cross-validation to choose the parameters for CRH and \ufb01nd that the model performance is\nonly mildly sensitive to the parameters. As a result, in all experiments, we set \u03bbx = 0.01, \u03bby =\n0.01, \u03b3 = 1000, a = 3.7, and \u03bb = 1/a. Besides, unless speci\ufb01ed otherwise, we \ufb01x the training set\nsize to 2,000 and the code length K to 24.\n\n(cid:80)R\n\n3.2 Results on Wiki\n\nThe Wiki data set, generated from Wikipedia featured articles, consists of 2,866 image-text pairs.4\nIn each pair, the text is an article describing some events or people and the image is closely related to\n\n2We used the implementation generously provided by the authors.\n3We implemented the method ourselves because the code is not publicly available.\n4http://www.svcl.ucsd.edu/projects/crossmodal/\n\n6\n\n\fthe content of the article. The images are represented by 128-dimensional SIFT [28] feature vectors,\nwhile the text articles are represented by the probability distributions over 10 topics learned by a\nlatent Dirichlet allocation (LDA) model [29]. Each pair is labeled with one of 10 semantic classes.\nWe simply use these class labels to identify the neighbors. Moreover, we use 80% of the data as the\ndatabase set and the remaining 20% to form the query set.\nThe mAP values of the three methods are reported in Table 1. We can see that CRH outperforms\nCVH and CMSSH under all settings and CVH performs slightly better than CMSSH. We note that\nCMSSH ignores the intra-modality relational information and CVH simply treats each bit indepen-\ndently. Hence the performance difference is expected.\n\nTask\n\nvs.\n\nvs.\n\nImage Query\n\nText Database\nText Query\n\nImage Database\n\nTable 1: mAP comparison on Wiki\n\nK = 24\n\n0.2537 \u00b1 0.0206\n0.2043 \u00b1 0.0150\n0.1965 \u00b1 0.0123\n0.2896 \u00b1 0.0214\n0.2714 \u00b1 0.0164\n0.2179 \u00b1 0.0161\n\nCode Length\n\nK = 48\n\n0.2399 \u00b1 0.0185\n0.1788 \u00b1 0.0149\n0.1780 \u00b1 0.0080\n0.2882 \u00b1 0.0261\n0.2304 \u00b1 0.0104\n0.2094 \u00b1 0.0072\n\nK = 64\n\n0.2392 \u00b1 0.0131\n0.1732 \u00b1 0.0072\n0.1624 \u00b1 0.0073\n0.2989 \u00b1 0.0293\n0.2156 \u00b1 0.0202\n0.2040 \u00b1 0.0135\n\nMethod\nCRH\nCVH\n\nCMSSH\n\nCRH\nCVH\n\nCMSSH\n\nWe further compare the three methods on several aspects in Figure 1. We \ufb01rst vary the size of the\ntraining set in sub\ufb01gures 1(a) and 1(d). Although CVH performs the best when the training set\nis small, its performance is gradually surpassed by CRH as the size increases. We then plot the\nprecision-recall curves and recall curves for all three methods in the remaining sub\ufb01gures. It is clear\nthat CRH outperforms its two counterparts by a large margin.\n\n(a) Varying training set size\n\n(b) Precision-recall curve\n\n(c) Recall curve\n\n(d) Varying training set size\n\n(e) Precision-recall curve\n\n(f) Recall curve\n\nFigure 1: Results on Wiki\n\n3.3 Results on Flickr\n\nThe Flickr data set consists of 186,577 image-tag pairs pruned from the NUS data set5 [30] by\nkeeping the pairs that belong to one of the 10 largest classes. The images are represented by 500-\ndimensional SIFT vectors. To obtain more compact representations of the tags, we perform PCA\non the original tag occurrence features and obtain 1000-dimensional feature vectors. Each pair is\nannotated by at least one of 10 semantic labels, and two points are de\ufb01ned as neighbors if they share\nat least one label. We use 99% of the data as the database set and the remaining 1% to form the\nquery set.\n\n5http://lms.comp.nus.edu.sg/research/NUS-WIDE.htm\n\n7\n\n050010001500200000.050.10.150.20.250.30.350.4Size of Training SetPrecision within Hamming Radius 2Image Query vs. Text Database CRHCVHCMSSH00.20.40.60.8100.10.20.30.40.5RecallPrecisionImage Query vs. Text Database CRHCVHCMSSH051015x 10500.20.40.60.81No. of Retrieved PointsRecallImage Query vs. Text Database CRHCVHCMSSH050010001500200000.050.10.150.20.250.30.35Size of Training SetPrecision within Hamming Radius 2Text Query vs. Image Database CRHCVHCMSSH00.20.40.60.8100.10.20.30.40.5RecallPrecisionText Query vs. Image Database CRHCVHCMSSH051015x 10500.20.40.60.81No. of Retrieved PointsRecallText Query vs. Image Database CRHCVHCMSSH\fThe mAP values of the three methods are reported in Table 2. In the task of image query vs. text\ndatabase, CRH performs comparably to CMSSH, which is better than CVH. However, in the other\ntask, CRH achieves the best performance.\n\nTable 2: mAP comparison on Flickr\n\nTask\n\nvs.\n\nvs.\n\nImage Query\n\nText Database\nText Query\n\nImage Database\n\nMethod\nCRH\nCVH\n\nCMSSH\n\nCRH\nCVH\n\nCMSSH\n\nK = 24\n\n0.5259 \u00b1 0.0094\n0.4717 \u00b1 0.0035\n0.5287 \u00b1 0.0123\n0.5364 \u00b1 0.0021\n0.4598 \u00b1 0.0020\n0.5029 \u00b1 0.0321\n\nCode Length\n\nK = 48\n\n0.4990 \u00b1 0.0075\n0.4515 \u00b1 0.0041\n0.5098 \u00b1 0.0141\n0.5185 \u00b1 0.0050\n0.4519 \u00b1 0.0029\n0.4815 \u00b1 0.0101\n\nK = 64\n\n0.4929 \u00b1 0.0064\n0.4471 \u00b1 0.0023\n0.4911 \u00b1 0.0220\n0.5064 \u00b1 0.0055\n0.4477 \u00b1 0.0058\n0.4660 \u00b1 0.0298\n\nSimilar to the previous subsection, we have conducted a group of experiments to compare the three\nmethods on several aspects and report the results in Figure 2. The results for varying the size of\nthe training set are plotted in sub\ufb01gures 2(a) and 2(d). As more training data are used, CRH always\nperforms better but the performance of CVH and CMSSH has high variance. The precision-recall\ncurves and recall curves are shown in the remaining sub\ufb01gures. Similar to the results on Wiki, CRH\nperforms the best. However, the performance gap is smaller here.\n\n(a) Varying training set size\n\n(b) Precision-recall curve\n\n(c) Recall curve\n\n(d) Varying training set size\n\n(e) Precision-recall curve\n\n(f) Recall curve\n\nFigure 2: Results on Flickr\n\n4 Conclusions\n\nIn this paper, we have presented a novel method for multimodal hash function learning based on a\nboosted co-regularization framework. Because the objective function of the optimization problem is\nin the form of a difference of convex functions, we can devise an ef\ufb01cient learning algorithm based\non CCCP and a stochastic sub-gradient method. Comparative studies based on two benchmark data\nsets show that CRH outperforms two state-of-the-art multimodal hashing methods.\nTo take this work further, we would like to conduct theoretical analysis of CRH and apply it to\nsome other tasks such as multimodal medical image alignment. Another possible research issue is\nto develop more ef\ufb01cient optimization algorithms to further improve the scalability of CRH.\n\nAcknowledgement\n\nThis research has been supported by General Research Fund 621310 from the Research Grants\nCouncil of Hong Kong.\n\n8\n\n050010001500200000.10.20.30.40.5Size of Training SetPrecision within Hamming Radius 2Image Query vs. Text Database CRHCVHCMSSH00.20.40.60.810.250.30.350.40.450.50.550.6RecallPrecisionImage Query vs. Text Database CRHCVHCMSSH01234x 10800.20.40.60.81No. of Retrieved PointsRecallImage Query vs. Text Database CRHCVHCMSSH050010001500200000.10.20.30.40.50.60.7Size of Training SetPrecision within Hamming Radius 2Text Query vs. Image Database CRHCVHCMSSH00.20.40.60.810.250.30.350.40.450.50.550.6RecallPrecisionText Query vs. Image Database CRHCVHCMSSH01234x 10800.20.40.60.81No. of Retrieved PointsRecallText Query vs. Image Database CRHCVHCMSSH\fReferences\n[1] Gregory Shakhnarovich, Trevor Darrell, and Piotr Indyk, editors. Nearest-Neighbor Methods in Learning\n\nand Vision: Theory and Practice. MIT Press, March 2006.\n\n[2] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse of dimen-\n\nsionality. In STOC, 1998.\n\n[3] Moses Charikar. Similarity estimation techniques from rounding algorithms. In STOC, 2002.\n[4] Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in\n\nhigh dimensions. Communications of the ACM, 51(1):117\u2013122, 2008.\n\n[5] Brian Kulis and Kristen Grauman. Kernelized locality-sensitive hashing for scalable image search. In\n\nICCV, 2009.\n\n[6] Gregory Shakhnarovich, Paul Viola, and Trevor Darrell. Fast pose estimation with parameter-sensitive\n\nhashing. In ICCV, 2003.\n\n[7] Ruslan Salakhutdinov and Geoffrey E. Hinton. Semantic hashing. In SIGIR Workshop on Information\n\nRetrieval and Applications of Graphical Models, 2007.\n\n[8] Antonio Torralba, Rob Fergus, and Yair Weiss. Small codes and large image databases for recognition.\n\nIn CVPR, 2008.\n\n[9] Yair Weiss, Antonio Torralba, and Rob Fergus. Spectral hashing. In NIPS 21, 2008.\n[10] Brian Kulis and Trevor Darrell. Learning to hash with binary reconstructive embeddings. In NIPS 22,\n\n2009.\n\n[11] Maxim Raginsky and Svetlana Lazebnik. Locality-sensitive binary codes from shift-invariant kernels. In\n\nNIPS 22, 2009.\n\n[12] Ruei-Sung Lin, David A. Ross, and Jay Yagnik. SPEC hashing: Similarity preserving algorithm for\n\nentropy-based coding. In CVPR, 2010.\n\n[13] Junfeng He, Wei Liu, and Shih-Fu Chang. Scalable similarity search with optimized kernel hashing. In\n\nKDD, 2010.\n\n[14] Mohammad Norouzi and David J. Fleet. Minimal loss hashing for compact binary codes. In ICML, 2011.\n[15] Jun Wang, Sanjiv Kumar, and Shih-Fu Chang. Semi-supervised hashing for scalable image retrieval. In\n\nCVPR, 2010.\n\n[16] Yadong Mu, Jialie Shen, and Shuicheng Yan. Weakly-supervised hashing in kernel space. In CVPR, 2010.\n[17] Dan Zhang, Fei Wang, and Luo Si. Composite hashing with multiple information sources. In SIGIR,\n\n2011.\n\n[18] Jingkuan Song, Yi Yang, Zi Huang, Heng Tao Shen, and Richang Hong. Multiple feature hashing for\n\nreal-time large scale near-duplicate video retrieval. In ACM MM, 2011.\n\n[19] Wei Liu, Jun Wang, Sanjiv Kumar, and Shih-Fu Chang. Hashing with graphs. In ICML, 2011.\n[20] Michael M. Bronstein, Alexander M. Bronstein, Fabrice Michel, and Nikos Paragios. Data fusion through\n\ncross-modality metric learning using similarity-sensitive hashing. In CVPR, 2010.\n\n[21] Shaishav Kumar and Raghavendra Udupa. Learning hash functions for cross-view similarity search. In\n\nIJCAI, 2011.\n\n[22] A. L. Yuille and Anand Rangarajan. The concave-convex procedure (CCCP). In NIPS 14, 2001.\n[23] Novi Quadrianto and Christoph H. Lampert. Learning multi-view neighborhood preserving projections.\n\nIn ICML, 2011.\n\n[24] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated sub-gradient solver\n\nfor SVM. In ICML, 2007.\n\n[25] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an\n\napplication to boosting. Journal of Computer and System Sciences, 55(1):119\u2013139, 1997.\n\n[26] John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge University\n\nPress, 2004.\n\n[27] Bernhard Sch\u00a8olkopf, Ralf Herbrich, and Alex J. Smola. A generalized representer theorem. In COLT,\n\n2001.\n\n[28] David G. Lowe. Distinctive image features from scale-invariant keypoints.\n\nComputer Vision, 60(2):91\u2013110, 2004.\n\nInternational Journal of\n\n[29] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet allocation. Journal of Machine\n\nLearning Research, 3:993\u20131022, 2003.\n\n[30] Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yan-Tao Zheng. NUS-WIDE:\n\nA real-world web image database from National University of Singapore. In CIVR, 2009.\n\n9\n\n\f", "award": [], "sourceid": 669, "authors": [{"given_name": "Yi", "family_name": "Zhen", "institution": null}, {"given_name": "Dit-Yan", "family_name": "Yeung", "institution": null}]}