{"title": "A Probabilistic Framework for Multimodal Retrieval using Integrative Indian Buffet Process", "book": "Advances in Neural Information Processing Systems", "page_first": 2384, "page_last": 2392, "abstract": "We propose a multimodal retrieval procedure based on latent feature models. The procedure consists of a nonparametric Bayesian framework for learning underlying semantically meaningful abstract features in a multimodal dataset, a probabilistic retrieval model that allows cross-modal queries and an extension model for relevance feedback. Experiments on two multimodal datasets, PASCAL-Sentence and SUN-Attribute, demonstrate the effectiveness of the proposed retrieval procedure in comparison to the state-of-the-art algorithms for learning binary codes.", "full_text": "A Probabilistic Framework for Multimodal Retrieval\n\nusing Integrative Indian Buffet Process\n\nBahadir Ozdemir\n\nDepartment of Computer Science\n\nUniversity of Maryland\n\nCollege Park, MD 20742 USA\n\nozdemir@cs.umd.edu\n\nLarry S. Davis\n\nInstitute for Advanced Computer Studies\n\nUniversity of Maryland\n\nCollege Park, MD 20742 USA\n\nlsd@umiacs.umd.edu\n\nAbstract\n\nWe propose a multimodal retrieval procedure based on latent feature models. The\nprocedure consists of a Bayesian nonparametric framework for learning under-\nlying semantically meaningful abstract features in a multimodal dataset, a proba-\nbilistic retrieval model that allows cross-modal queries and an extension model for\nrelevance feedback. Experiments on two multimodal datasets, PASCAL-Sentence\nand SUN-Attribute, demonstrate the effectiveness of the proposed retrieval proce-\ndure in comparison to the state-of-the-art algorithms for learning binary codes.\n\n1\n\nIntroduction\n\nAs the number of digital images which are available online is constantly increasing due to rapid ad-\nvances in digital camera technology, image processing tools and photo sharing platforms, similarity-\npreserving binary codes have received signi\ufb01cant attention for image search and retrieval in large-\nscale image collections [1, 2]. Encoding high-dimensional descriptors into compact binary strings\nhas become a very popular representation for images because of their high ef\ufb01ciency in query pro-\ncessing and storage capacity [3, 4, 5, 6].\nThe most widely adapted strategy for similarity-preserving binary codes is to \ufb01nd a projection of\ndata points from the original feature space to Hamming space. A broad range of hashing techniques\ncan be categorized as data independent and dependent schemes. Locality sensitive hashing [3] is one\nof the most widely known data-independent hashing techniques. This technique has been extended\nto various hashing functions with kernels [4, 5]. Notable data-dependent hashing techniques include\nspectral hashing [1], iterative quantization [6] and spherical hashing [7]. Despite the increasing\namount of multimodal data, especially in multimedia domains e.g. images with tags, most existing\nhashing techniques, unfortunately, focus on unimodal data. Hence, they inevitably suffer from the\nsemantic gap, which is de\ufb01ned in [8] as the lack of coincidence between low level visual features and\nhigh level semantic interpretation of an image. On the other hand, joint analysis of multimodal data\noffers improved search and cross-view retrieval capabilities e.g. text-to-image queries by bridging\nthe semantic gap. However, it also poses challenges associated with handling cross-view similarity.\nMost recent studies have concentrated on multimodal hashing. Bronstein et al. proposed cross-\nmodality similarity learning via a boosting procedure [9]. Kumar and Udupa presented a cross-view\nsimilarity search [10] by generalizing spectral hashing [1] for multi-view data objects. Zhen and Ye-\nung described two recent methods: Co-regularized hashing [11] based on a boosted co-regularization\nframework and a probabilistic generative approach called multimodal latent binary embedding [12]\nbased on binary latent factors. Nitish and Salakhutdinov proposed a deep Boltzmann machine for\nmultimodal data [13]. Recently, Rastegari et al. proposed a predictable dual-view hashing [14] that\naims to minimize the Hamming distance between binary codes obtained from two different views\nby utilizing multiple SVMs. Most of the multimodal hashing techniques are computationally ex-\n\n1\n\n\fpensive, especially when dealing with large-scale data. High computational and storage complexity\nrestricts their scalability.\nAlthough many hashing approaches rely on supervised information like semantic class labels, class\nmemberships are not available for many image datasets. In addition, some supervised approaches\ncannot be generalized to unseen classes that are not used during training [15] even though new\nclasses emerge in the process of adding new images to online image databases. Besides, every user\u2019s\nneed is different and time varying [16]. Therefore, user judgments indicating the relevance of an\nimage retrieved for a query are utilized to achieve better retrieval performance in the revised ranking\nof images [17]. Development of an ef\ufb01cient retrieval system that embeds information from multiple\ndomains into short binary codes and takes relevance feedback into account is quite challenging.\nIn this paper, we propose a multimodal retrieval method based on latent features. A probabilistic\napproach is employed for learning binary codes, and also for modeling relevance and user prefer-\nences in image retrieval. Our model is built on the assumption that each image can be explained by\na set of semantically meaningful abstract features which have both visual and textual components.\nFor example, if an image in the dataset contains a side view of a car, the words \u201ccar\u201d, \u201cautomobile\u201d\nor \u201cvehicle\u201d will probably appear in the description; also an object detector trained for vehicles will\ndetect the car in the image. Therefore, each image can be represented as a binary vector, with entries\nindicating the presence or absence of each abstract feature.\nOur contributions can be summarized in three aspects:\n\n1. We propose a Bayesian nonparametric framework based on the Indian Buffet Process (IBP)\n[18] for integrating multimodal data in a latent space. Since the IBP is a nonparametric prior\nin an in\ufb01nite latent feature model, the proposed method offers a \ufb02exible way to determine\nthe number of underlying abstract features in a dataset.\n\n2. We develop a retrieval system that can respond to cross-modal queries by introducing new\nrandom variables indicating relevance to a query. We present a Markov chain Monte Carlo\n(MCMC) algorithm for inference of the relevance from data.\n\n3. We formulate relevance feedback as pseudo-images to alter the distribution of images in\nthe latent space so that the ranking of images for a query is in\ufb02uenced by user preferences.\n\nThe rest of the paper is organized as follows: Section 2 describes the proposed integrative procedure\nfor learning binary codes, retrieval model and processing relevance feedback in detail. Performance\nevaluation and comparison to state-of-the-art methods are presented in Section 3, and Section 4\nprovides conclusions.\n\n2 Our Approach\n\nIn our data model, each image has both textual and visual components. To facilitate the discussion,\nwe assume that the dataset is composed of two full matrices; our approach can easily handle images\nwith only one component and it can be generalized to more than two modalities as well. We denote\nthe data in the textual and visual space by X\u03c4 and Xv, respectively. X\u2217 is an N \u00d7 D\u2217 matrix\nwhose rows corresponds to images in either space where \u2217 is a placeholder used for either v or\n\u03c4. The values in each column of X\u2217 are centered by subtracting the sample mean of that column.\nThe dimensionality of the textual space D\u03c4 and the dimensionality of the visual space Dv can be\ndifferent. We use X to represent the set {X\u03c4 , Xv}.\n\n2.1\n\nIntegrative Latent Feature Model\n\nWe focus on how textual and visual values of an image are generated by a linear-Gaussian model\nand its extension for retrieval systems. Given a multimodal image dataset, the textual and visual data\nmatrices, X\u03c4 and Xv, can be approximated by ZA\u03c4 and ZAv, respectively. Z is an N \u00d7 K binary\nmatrix where Znk equals to one if abstract feature k is present in image n and zero otherwise. A\u2217 is\na K \u00d7 D\u2217 matrix where the textual and visual values for abstract feature k are stored in row k of A\u03c4\nand Av, respectively (See Figure 1 for an illustration). The set {A\u03c4 , Av} is denoted by A.\n\n2\n\n\fOur initial goal is to learn abstract features present in the dataset. Given X , we wish to compute the\nposterior distribution of Z and A using Bayes\u2019 rule\n\np(Z,A|X ) \u221d p(X\u03c4|Z, A\u03c4 )p(A\u03c4 )p(Xv|Z, Av)p(Av)p(Z)\n\n(1)\nwhere Z, A\u03c4 and Av are assumed to be a priori independent. In our model, the vectors for textual\nand visual properties of an image are generated from Gaussian distributions with covariance matrix\nx)2I and expectation E[X\u2217] equal to ZA\u2217. Similarly, a prior on A\u2217 is de\ufb01ned to be Gaussian with\n(\u03c3\u2217\nzero mean vector and covariance matrix (\u03c3\u2217\na)2I. Since we do not know the exact number of abstract\nfeatures present in the dataset, we employ the Indian Buffet Process (IBP) to generate Z, which\nprovides a \ufb02exible prior that allows K to be determined at inference time (See [18] for details). The\ngraphical model of our integrative approach is shown in Figure 2.\n\nFigure 1: The latent abstract feature model proposes that visual data Xv is a product of Z and Av\nwith some noise; and similarly the textual data X\u03c4 is a product of Z and A\u03c4 with some noise.\n\nFigure 2: Graphical model for the integrative IBP approach where circles indicate random variables,\nshaded circles denote observed values, and the blue square boxes are hyperparameters.\n\nThe exchangeability property of the IBP leads directly to a Gibbs sampler which takes image n as\nthe last customer to have entered the buffet. Then, we can sample Znk for all initialized features k\nvia\n\n(2)\nwhere Z\u2212nk denotes entries of Z other than Znk. In the \ufb01nite latent feature model (where K is\n\ufb01xed), the conditional distribution for any Znk is given by\n\np(Znk = 1|Z\u2212nk,X ) \u221d p(Znk = 1|Z\u2212nk)p(X|Z).\n\np(Znk = 1|Z\u2212nk) =\n\nm\u2212n,k + \u03b1\nK\n\nN + \u03b1\nK\n\n(3)\n\nwhere m\u2212n,k is the number of images possessing abstract feature k apart from image n. In the\nin\ufb01nite case like the IBP, we obtain p(Znk = 1|Z\u2212nk) = m\u2212n,k\nfor any k such that m\u2212n,k > 0.\n\nWe also need to draw new features associated with image n from Poisson(cid:0) \u03b1\n\n(cid:1), and the likelihood\n\nN\n\nterm is now conditioned on Z with new additional columns set to one for image n.\n\nN\n\n3\n\nAbstract features for imageUnobservedObserved Visual features for imagevisualtextual Textual features for image\fFor the linear-Gaussian model, the collapsed likelihood function p(X|Z) = p(X\u03c4|Z)p(Xv|Z) can\nbe computed using\n\nexp(cid:8)\u2212 1\n\nx)2 tr(cid:0)X\u2217T (I \u2212 ZMZT )X\u2217(cid:1)(cid:9)\n\np(X\u2217|Z, A\u2217)p(A\u2217) dA\u2217 =\n\na)2 I(cid:1)\u22121 and tr(\u00b7) is the trace of a matrix [18]. To reduce the computational\n\na)KD\u2217|M| \u2212D\u2217\n(\u03c3\u2217\n\nx)(N\u2212K)D\u2217\n\n(2\u03c0) ND\u2217\n\nx)2\n\n2\n\n2\n\n(4)\n\n2(\u03c3\u2217\n(\u03c3\u2217\n\n(cid:90)\nwhere M =(cid:0)ZT Z + (\u03c3\u2217\n\np(X\u2217|Z) =\n\n(\u03c3\u2217\n\ncomplexity, Doshi-Velez and Ghahramani proposed an accelerated sampling in [19] by maintaining\nthe posterior distribution of A\u2217 conditioned on partial X\u2217 and Z. We use this approach to learn\nbinary codes, i.e. the feature-assignment matrix Z, for multimodal data. Unlike the hashing methods\nthat learn optimal hyperplanes from training data [6, 7, 14], we only sample Z without specifying\nthe length of binary codes in this process. Therefore, the binary codes can be updated ef\ufb01ciently if\nnew images are added in a long run of the retrieval system.\n\n2.2 Retrieval Model\n\nWe extend the integrative IBP model for image retrieval. Given a query, we need to sort the images\nin the dataset with respect to their relevance to the query. A query can be comprised of textual\nand visual data, or either component can be absent. Let q\u03c4 be a D\u03c4 -dimensional vector for the\ntextual values and qv be a Dv-dimensional vector for the visual values of the query. We can write\nQ = {q\u03c4 , qv}. As for the images in X , we consider a query to be generated by the same model\ndescribed in the previous section with the exception of the prior on abstract features. In the retrieval\npart, we consider Z as a known quantity and we \ufb01x the number abstract features to K. Therefore,\nthe feature-assignments for the dataset are not affected by queries. In addition, queries are explained\nby known abstract features only.\nWe extend the Indian restaurant metaphor to construct the retrieval model. A query corresponds to\nthe (N + 1)th customer to enter the buffet. The previous customers are divided into two classes\nas friends and non-friends based on their relevance to the new customer. The new customer now\nsamples from at most K dishes in proportion to their popularity among friends and also their un-\npopularity among non-friends. Consequently, the dishes sampled by the new customer are expected\nto be similar to those of friends and dissimilar to those of non-friends. Let r be an N-dimensional\nvector where rn equals to one if customer n is a friend of the new customer and zero otherwise.\nFor this \ufb01nitely long buffet, the sampling probability of dish k by the new customer can be written\nas m(cid:48)\nn=1(Znk)rn (1 \u2212 Znk)1\u2212rn, that is the total number of friends who\ntried dish k and non-friends who did not sample dish k. Let z(cid:48) be a K-dimensional vector where z(cid:48)\nk\nrecords if the new customer (query) sampled dish k. We place a prior over rn as Bernoulli(\u03b8). Then,\nwe can sample z(cid:48)\n\nk = (cid:80)N\n\nN +1+\u03b1/K where m(cid:48)\n\nk+\u03b1/K\n\nk from\n\np(z(cid:48)\n\n(5)\nk = 1|Z) can be computed ef\ufb01ciently for k = 1, . . . , K by marginalizing over r\n\nk = 1|Z)p(Q|z(cid:48), Z,X ).\n\n\u2212k,Q, Z,X ) \u221d p(z(cid:48)\n\nThe probability p(z(cid:48)\nas below:\n\np(z(cid:48)\n\nk = 1|Z) =\n\np(z(cid:48)\n\nk = 1|r, Z)p(r) =\n\n\u03b8mk + (1 \u2212 \u03b8)(N \u2212 mk) + \u03b1\n\nK\n\n.\n\n(6)\n\nN + 1 + \u03b1\nK\n\nThe collapsed likelihood of the query, p(Q|z(cid:48), Z,X ), is given by the product of textual and visual\nlikelihood values, p(q\u03c4|z(cid:48), Z, X\u03c4 )p(qv|z(cid:48), Z, Xv). If either textual or visual component is missing,\nwe can simply integrate out the missing one by omitting the corresponding term from the equation.\nThe likelihood of each part can be calculated as follows:\n\nk = 1|z(cid:48)\n(cid:88)\n\nr\u2208{0,1}N\n\n(cid:90)\n\np(q\u2217|z(cid:48), Z, X\u2217) =\n\np(q\u2217|z(cid:48), A\u2217)p(A\u2217|Z, X\u2217) dA\u2217 = N (q\u2217; \u00b5\u2217\n\nx)2(z(cid:48)Mz(cid:48)T + I), akin to the update equation in [19] (Refer to (4) for M).\n\nwhere the mean and covariance matrix of the normal distribution are given by \u00b5\u2217\n\u03a3\u2217\nq = (\u03c3\u2217\nFinally, we use the conditional expectation of r to rank images in the dataset with respect to their\nrelevance to the given query. Calculating the expectation E[r|Q, Z,X ] is computationally expensive;\n\nq, \u03a3\u2217\n(7)\nq).\nq = z(cid:48)MZT X\u2217 and\n\n4\n\n\fhowever, it can be empirically estimated using the Monte Carlo method as follows:\n\nI(cid:88)\n\nK(cid:89)\n\ni=1\n\nk=1\n\nk |rn = 1, Z(cid:1)\np(cid:0)z\np(cid:0)z\nk |Z(cid:1)\n\n(cid:48)(i)\n\n(cid:48)(i)\n\n(8)\n\n\u02c6E[rn|Q, Z,X ] =\n\n1\nI\n\np(rn = 1|z(cid:48)(i), Z) =\n\n\u03b8\nI\n\nI(cid:88)\n\ni=1\n\nwhere z(cid:48)(i) represents i.i.d. samples from (5) for i = 1, . . . , I. The last equation required for\ncomputing (8) is\n\np(z(cid:48)\n\nk = 1|rn = 1, Z) =\n\nZnk + \u03b8m\u2212n,k + (1 \u2212 \u03b8)(N \u2212 1 \u2212 m\u2212n,k) + \u03b1\n\nK\n\n.\n\n(9)\n\nN + 1 + \u03b1\nK\n\nThe retrieval system returns a set of top ranked images to the user. Note that we compute the expec-\ntation of relevance vector instead of sampling directly since binary values indicating the relevance\nare less stable and they hinder the ranking of images.\n\n2.3 Relevance Feedback Model\n\nIn our data model, user preferences can be described over abstract features. For instance, if abstract\nfeature k is present in the most of positive samples i.e. images judged as relevant by the user and\nit is absent in the irrelevant ones, then we can say that the user is more interested in the semantic\nsubspace represented by abstract feature k. In the revised query, the images having abstract feature\nk are expected to be ranked in higher positions in comparison to the initial query. We can achieve\nthis desirable property from query-speci\ufb01c alterations to the sampling probability in (5) for the\ncorresponding abstract features. Our approach is to add pseudo-images to the feature-assignment\nmatrix Z before the computations of the revised query. For the Indian restaurant analogy, pseudo-\nimages correspond to some additional friends of the new customer (query), who do not really exist\nin the restaurant. The distribution of dishes sampled by those imaginary customers re\ufb02ects user\nrelevance feedback. Thus, the updated expectation of the relevance vector has a bias towards user\npreferences.\nLet Zu be an Nu\u00d7K feature-assignment matrix for pseudo-images only; then the number of pseudo-\nimages, Nu, determines the in\ufb02uence of relevance feedback. Therefore, we set an upper limit on\nNu as the number of real images, N, by placing a prior distribution as Nu \u223c Binomial(\u03b3, N ) where\n\u03b3 is a parameter that controls the weight of feedback. Let mu,k be the number of pseudo-images\ncontaining abstract feature k; then this number has an upper bound Nu by de\ufb01nition. For abstract\nfeature k, a prior distribution conditioned on Nu can be de\ufb01ned as mu,k|Nu \u223c Binomial(\u03c6k, Nu)\nwhere \u03c6k is a parameter that can be tuned by relevance judgments.\nLet z(cid:48)(cid:48) be a K-dimensional feature-assignment vector for the revised query; then we can sample\neach z(cid:48)(cid:48)\n\nk via\n\np(z(cid:48)(cid:48)\n\nk = 1|z(cid:48)(cid:48)\n\n\u2212k,Q, Z,X ) \u221d p(z(cid:48)(cid:48)\n\nk = 1|Z)p(Q|z(cid:48)(cid:48), Z,X )\n\nwhere the computation of the collapsed likelihood is already shown in (7). Note that we do not\nactually generate all entries of Zu but only the sum of its columns mu and number of rows Nu for\ncomputing the sampling probability. We can write the \ufb01rst term as:\n\np(z(cid:48)(cid:48)\n\nk = 1|Z) =\n\n=\n\nN(cid:88)\n(cid:18)N\nN(cid:88)\n\nNu=0\n\nj\n\nj=0\n\np(Nu)\n\n(cid:88)\n\nNu(cid:88)\n\u03b3j(1 \u2212 \u03b3)N\u2212j \u03b8mk + (1 \u2212 \u03b8)(N \u2212 mk) + \u03b1\n\np(mu,k|Nu)\n\nr\u2208{0,1}N\n\np(z(cid:48)(cid:48)\n\nmu,k=0\n\n(cid:19)\n\nN + 1 + \u03b1\n\nK + j\n\nk = 1|r, Zu, Z)p(r)\n\nK + \u03c6kj\n\n(10)\n\n(11)\n\nUnfortunately, this expression has no compact analytic form; however, it can be ef\ufb01ciently computed\nnumerically by contemporary scienti\ufb01c computing software even for large values of N.\nIn this\nequation, one can alternatively \ufb01x rn to 1 if the user marks observation n as relevant or 0 if it is\nindicated to be irrelevant. Finally, the expectation of r is updated using (8) with new i.i.d. samples\nz(cid:48)(cid:48)(i) from (10) and the system constructs the revised set of images.\n\n5\n\n\f3 Experiments\n\nThe experiments were performed in two phases. We \ufb01rst compared the performance of our method in\ncategory retrieval with several state-of-the-art hashing techniques. Next, we evaluated the improve-\nment in the performance of our method with relevance feedback. We used the same multimodal\ndatasets as [14], namely PASCAL-Sentence 2008 dataset [20] and the SUN-Attribute dataset [21].\nIn the quantitative analysis, we used the mean of the interpolated precision at standard recall levels\nfor comparing the retrieval performance. In the qualitative analysis, we present the images retrieved\nby our proposed method for a set of text-to-image and image-to-image queries. All experiments\nwere performed in the Matlab environment1.\n\n3.1 Datasets\n\nThe PASCAL-Sentence 2008 dataset is formed from the PASCAL 2008 images by randomly select-\ning 50 images belonging to each of the 20 categories. In experiments, we used the precomputed\nvisual and textual features provided by Rastegari et al. [14]. Amazon Mechanical Turk workers\nannotate \ufb01ve sentences for each of the 1000 images. Each image is labelled by a triplet of