{"title": "A Bayesian Model for Simultaneous Image Clustering, Annotation and Object Segmentation", "book": "Advances in Neural Information Processing Systems", "page_first": 486, "page_last": 494, "abstract": "A non-parametric Bayesian model is proposed for processing multiple images. The analysis employs image features and, when present, the words associated with accompanying annotations. The model clusters the images into classes, and each image is segmented into a set of objects, also allowing the opportunity to assign a word to each object (localized labeling). Each object is assumed to be represented as a heterogeneous mix of components, with this realized via mixture models linking image features to object types. The number of image classes, number of object types, and the characteristics of the object-feature mixture models are inferred non-parametrically. To constitute spatially contiguous objects, a new logistic stick-breaking process is developed. Inference is performed efficiently via variational Bayesian analysis, with example results presented on two image databases.", "full_text": "A Bayesian Model for Simultaneous Image\n\nClustering, Annotation and Object Segmentation\n\nLan Du, Lu Ren, 1David B. Dunson and Lawrence Carin\n\nDepartment of Electrical and Computer Engineering\n\n1Statistics Department\n\nDuke University\n\n{ld53, lr, lcarin}@ee.duke.edu, dunson@stats.duke.edu\n\nDurham, NC 27708-0291, USA\n\nAbstract\n\nA non-parametric Bayesian model is proposed for processing multiple images.\nThe analysis employs image features and, when present, the words associated\nwith accompanying annotations. The model clusters the images into classes, and\neach image is segmented into a set of objects, also allowing the opportunity to\nassign a word to each object (localized labeling). Each object is assumed to be\nrepresented as a heterogeneous mix of components, with this realized via mixture\nmodels linking image features to object types. The number of image classes, num-\nber of object types, and the characteristics of the object-feature mixture models\nare inferred nonparametrically. To constitute spatially contiguous objects, a new\nlogistic stick-breaking process is developed. Inference is performed ef\ufb01ciently\nvia variational Bayesian analysis, with example results presented on two image\ndatabases.\n\n1 Introduction\nThere has recently been much interest in developing statistical models for analyzing and organiz-\ning images, based on image features and, when available, auxiliary information, such as words\n(e.g., annotations). Three important aspects of this problem are:\n(i) sorting multiple images\ninto scene-level classes, (ii) image annotation, and (iii) segmenting and labeling localized objects\nwithin images. Probabilistic topic models, originally developed for text analysis [8, 12], have been\nadapted and extended successfully for many image-understanding problems [3, 6, 9\u201311, 16, 23, 24].\nMoreover, recent work has also used the Dirichlet process (DP) [5] or similar non-parametric pri-\nors to enhance the topic-model structure [2, 20, 26]. Using such statistical models, researchers\n[2, 3, 6, 10, 16, 20, 23, 24, 26] have addressed two or all three of the objectives simultaneously\nwithin a single setting. Such uni\ufb01ed formalisms have realized marked improvements in overall al-\ngorithm performance. A relatively complete summary of the literature may be found in [16, 23],\nwhere the advantages of the approaches in [16, 23] are described relative to previous related ap-\nproaches [3, 6, 10, 11, 18, 24, 27]. The work in [16, 23] is based on the correspondence LDA\n(Corr-LDA) model [6]. The approach in [23] integrates the Corr-LDA model and the supervised\nLDA (sLDA) model [7] into a single framework. Although good classi\ufb01cation performance was\nachieved using this approach, the model is employed in a supervised manner, utilizing scene-labeled\nimages for scene classi\ufb01cation. A class label variable is introduced in [16] to cluster all images in\nan unsupervised manner, and a switching variable to address noisy annotations. Nevertheless, to\nimprove performance, in [16] some images are required for supervised learning, based on the seg-\nmented and labeled objects obtained via the method proposed in [10], with these used to initialize\nthe algorithm.\n\nThe research reported here seeks to build upon and extend recent research on uni\ufb01ed image-analysis\nmodels. Speci\ufb01cally, motivated by [16, 23], we develop a novel non-parametric Bayesian model\n\n1\n\n\fthat simultaneously addresses all three objectives discussed above. The four main contributions of\nthis paper are:\n\n\u2022 Each object in an image is represented as a mixture of image-feature model parameters, account-\ning for the heterogeneous character of individual objects. This framework captures the idea that a\nparticular object may be composed as an aggregation of distinct parts. By contrast, each object is\nonly associated with one image-feature component/atom in the Corr-LDA-like models [6, 16, 23].\n\u2022 Multiple images are processed jointly; all, none or a subset of the images may be annotated. The\nmodel infers the linkage between image-feature parameters and object types, with this linkage used\nto yield localized labeling of objects within all images. The unsupervised framework is executed\nwithout the need for a human to constitute training data.\n\u2022 A novel logistic stick-breaking process (LSBP) is proposed, imposing the belief that proximate\nportions of an image are more likely to reside within the same segment (object). This spatially con-\nstrained prior yields contiguous objects with sharp boundaries, and via the aforementioned mixture\nmodels the segmented objects may be composed of heterogeneous building blocks.\n\u2022 The proposed model is nonparametric, based on use of stick-breaking constructions [13], which\ncan be easily implemented by fast variational Bayesian (VB) inference [14]. The number of image\nclasses, number of object types, number of image-feature mixture components per object, and the\nlinkage between words and image model parameters are inferred nonparametrically.\n2 The Hierarchical Generative Model\n2.1 Bag of image features\nWe jointly process data from M images, and each image is assumed to come from an associated\nclass type (e.g., city scene, beach scene, of\ufb01ce scene, etc.). The class type associated with image m\nis denoted by zm \u2208 {1, . . . , I}, and it is drawn from the mixture model\n\nI\n\nzm \u223c\n\nX\n\ni=1\n\nui\u03b4i , u \u223c StickI (\u03b1u)\n\n(1)\n\nwhere StickI (\u03b1u) is a stick-breaking process [13] that is truncated to I sticks, with hyper-parameter\n\u03b1u > 0. The symbol \u03b4i represents a unit measure at the integer i, and the parameter ui denotes the\nprobability that image type i will be observed across the M images.\nThe observed data are image feature vectors, each tied to a local region in the image (for example,\nassociated with an over-segmented portion of the image). The Lm observed image feature vectors\nassociated with image m are {xml}Lm\nl=1, and the lth feature vector is assumed drawn xml \u223c F (\u03b8ml).\nThe expression F (\u00b7) represents the feature model, and \u03b8ml represents the model parameters.\nEach image is assumed to be composed of a set of latent objects. An indicator variable \u03b6ml de\ufb01nes\nwhich object type the lth feature vector from image m is associated with, and it is drawn\n\nK\n\n\u03b6ml \u223c\n\nX\n\nk=1\n\nwzmk\u03b4k , wi \u223c StickK(\u03b1w)\n\n(2)\n\nwhere index k corresponds to the kth type of object that may reside within an image. The vector\nwi de\ufb01nes the probability that each of the K object types will occur, conditioned on the image type\ni \u2208 {1, . . . , I}; the kth component of wzm, wzmk, denotes the probability of observing object type\nk in image m, when image m was drawn from class zm \u2208 {1, . . . , I}.\n\nThe image class zm and corresponding objects {\u03b6ml}Lm\nl=1 associated with image m are latent vari-\nables. The generative process for the observed data, {xml}Lm\nl=1, is manifested via mixture models\nwith respect to model parameter \u03b8. Speci\ufb01cally, a separate such mixture model is manifested for\neach of the K object types, motivated by the idea that each object will in general be composed of a\ndifferent set of image-feature building blocks. The mixture model for object type k \u2208 {1, . . . , K}\nis represented as\n\nJ\n\nGk =\n\nX\n\nj=1\n\nhkj \u03b4\u03b8\u2217\n\nj\n\n, hk \u223c StickJ (\u03b1h) , \u03b8\u2217\n\nj \u223c H\n\n(3)\n\nwhere H is a base measure, usually selected to be conjugate to F (\u00b7).\n2.2 Bag of clustered image features\nWhile the model described above is straightforward to understand, it has been found to be ineffec-\ntive. This is because each of the \u03b6ml is drawn i.i.d. from PK\nk=1 wzmk\u03b4k, and therefore there is\n\n2\n\n\fnothing in the model that encourages the image features, xml and xml\u2032, which are associated with\nthe same image-feature atom \u03b8\u2217\n\nj , to be assigned to the same object k.\n\nTo address this limitation, we add a clustering step within each of the images; this is similar to\nthe structure of the hierarchical Dirichlet process (HDP) [21]. Speci\ufb01cally, consider the following\naugmented model:\n\nxml \u223c F (\u03b8ml) , \u03b8ml \u223c Gcml , cml \u223c\n\nT\n\nX\n\nt=1\n\nvmt\u03b4\u03b6mt , \u03b6mt \u223c\n\nK\n\nX\n\nk=1\n\nwzmk\u03b4k , zm \u223c\n\nI\n\nX\n\ni=1\n\nui\u03b4i\n\n(4)\n\nwhere vm \u223c StickT (\u03b1v), and Gk is as de\ufb01ned in (3). We make truncation level T < K, to\nencourage a relatively small number of objects in a given image.\n2.3 Linking words with images\nIn the above discussion it was assumed that the only observed data are the image feature vectors\n{xml}Lm\nl=1. However, there are situations for which annotations (words) may be available for at\nleast a subset of the M images. In this setting we assume that we have a K-dimensional dictionary\nof words associated with objects in images, and a word is assigned to each of the objects k \u2208\n{1, . . . , K}. Of the collection of M images, some may be annotated and some not, and all will\nbe processed simultaneously by the joint model; in so doing, annotations will be inferred for the\noriginally non-annotated images.\n\nFor an image for which no annotation is given, the image is assumed generated via (4). When\nan annotation is available, the words associated with image m are represented as a vector ym =\n[ym1, \u00b7 \u00b7 \u00b7 , ymK]T, where ymk denotes the number of times word k is present in the annotation to\nimage m (typically ymk will either be one or zero), and ym is assumed drawn from a multinomial\ndistribution associated with a parameter \u03d5m: ym \u223c Mult(\u03d5m). If image m is in class zm, then we\nsimply set\n\nym \u223c Mult(wzm) , wi \u223c StickK(\u03b1w)\n\n(5)\n\nNamely, \u03d5m = wzm, recalling that wi de\ufb01nes the probability of observing each object type\nfor image class i. When a dictionary of K words is available, we generally use wi \u223c\nDir(\u03b1w/K, . . . , \u03b1w/K), consistent with LDA [8].\n3 Encouraging Spatially Contiguous Objects\n3.1 Logistic stick-breaking process (LSBP)\nIn (5), note that once the image class zm is drawn for image m, the order/location of the xml within\nthe image may be interchanged, and nothing in the generative process will change. This is because\nthe indicator variable cml, which de\ufb01nes the object class associated with feature vector l in image m,\nis drawn i.i.d. cml \u223c PT\nt=1 vmt\u03b4\u03b6mt. It is therefore desirable to impose that if two feature vectors\nare proximate within the image, they are likely to be associated with the same object.\nWith each feature vector xml there is an associated spatial location, which we denote sml (this is a\ntwo-dimensional vector). We wish to draw\n\ncml \u223c\n\nT\n\nX\n\nt=1\n\nvmt(sml)\u03b4\u03b6mt\n\n,\n\n\u03b6mt \u223c\n\nK\n\nX\n\nk=1\n\nwzmk\u03b4k\n\n(6)\n\nwhere the cluster probabilities vmt(sml) are now a function of position sml (the \u03b6mt \u2208 {1, . . . , K}\ncorrespond to object types). The challenge, therefore, becomes development of a means of construct-\ning vmt(s) to encourage nearby feature vectors to come from the same object type. Toward this goal,\nlet \u03c3[gmt(s)] represent a logistic link function, which is a function of s. For t = 1, . . . , T \u2212 1 we\nimpose\n\nt\u22121\n\nvmt(s) = \u03c3[gmt(s)]\n\nY\n\n{1 \u2212 \u03c3[gm\u03c4 (s)]}\n\n\u03c4 =1\n\n(7)\n\nt=1 vmt(s). We de\ufb01ne gmt(s) = PLm\n\ntl K(s, sml) + W (m)\nwhere vmT (s) = 1 \u2212 PT \u22121\nwhere K(s, sml) is a kernel, and here we utilize the radial basis function kernel K(s, sml) =\nexp[\u2212ks \u2212 smlk2/\u03c6mt]. The parameter kernel width \u03c6mt plays an important role in dictating the\nsize of segments associated with stick t, and therefore these parameters should be learned by the\ndata in the analysis. In practice we de\ufb01ne a library of discrete kernel widths \u03c6\u2217 = {\u03c6\u2217\nd=1, and\ninfer each \u03c6mt, placing a uniform prior on the elements of \u03c6\u2217.\n\nd}D\n\nl=1 W (m)\n\nt0\n\n3\n\n\fis drawn from a gamma prior, with hyper-parameters set to encourage most \u03b7(m)\n\nWe desire that a given stick vmt(s) has importance (at most) over a localized region, and therefore\nwe impose sparseness priors on parameters {W (m)\n)\u22121), and\n\u03b7(m)\ntl \u2192 \u221e. Such a\ntl\nStudent-t prior is also applied in [4]. The model described above is termed a logistic stick-breaking\nprocess (LSBP). For notational convenience, cml \u223c PT\nk=1 wzmk\u03b4k\nconstructed as above is represented as a draw from LSBPT (wzm). Figure 1 depicts the detailed\ngenerative process of the proposed model with LSBP.\n\nt=1 vmt(sml)\u03b4\u03b6mt and \u03b6mt \u223c PK\n\ntl\n\ntl }Lm\n\nl=0. Speci\ufb01cally, W (m)\n\ntl \u223c N (0, (\u03b7(m)\n\nS\n\nc\n\ne\n\nn\n\ne\n\nI\n\nI\n\nL\n\nL\n\nn\n\nB\n\ne\n\nu\n\ni\n\ni\n\ni\n\nn\n\ni\n\nl\n\nd\n\ng\n\nS\n\n~\n\nG\n\nG\n\nc\n\na\n\ns\n\ne\n\ns\n\nr\n\nm\n\n\u03b8\n\n~\n\nl\n\nG\n\n\u03b8\n\nm\n\nl\n\ni\n\nv\n\n(\n\n)\n\n(\n\n(\n\ni\n\ni\n\n)\ni\n\n)\n\nn\n\nS\n\ne\nm\n\ne\n\n~\n\nc\n\ne\nl\n\nk\n\ny\n\nS\n\nr\n\nc\n\nT\n\ne\n\nz\n\nL\n\n1\n\nm\n\nS\n\n\u00a6\n\n \n\nI\n\nB\n\ni\n\nT\n\n1\n\nG\n\n(\n\ng\n\nu\n\nu G\ni\ni\nl\ni\nd\ni\nn\ns\ns\na\nr\n)\nz\nm\n\nw\n\nP\n\n~\n\nB\n\n\u03b8\n\nm\n\nl\n\n~\n\nS\n\nk\n\ny\n\nG\n\nS\n\n\u03b8\n\nc\n\nm\n\nn\n\ni\n\ne\n\ni\n\ni\n\n~\n\ne\n\n(\n\nl\n\n2\n\n)\n\nG\n\nT\n\nr\n\ne\n\ne\n\nFigure 1: Depiction of the generative process. (i) A scene-class indicator zm \u2208 {1, . . . , I} is drawn to de\ufb01ne\nthe image class; (ii) conditioned on zm, and using the LSBP, contiguous segmented blocks are constituted,\nwith associated words de\ufb01ned by object indicator cml \u2208 {1, \u00b7 \u00b7 \u00b7 , K}, where w i de\ufb01nes the probability of\nobserving each object type for image class i; (iii) conditioned on cml, image-feature atoms are drawn from\nappropriate mixture models Gc\nml , linked to over-segmented regions within each of the object clusters; (iv) the\nimage-feature model parameters are responsible for generating the image features, via the model F (\u03b8), where\n\u03b8 is the image-feature parameter.\n\ntl\n\n3.2 Discussion of LSBP properties and comparison with KSBP\nThere are two key components of the LSBP construction: (i) sparseness promotion on the W (m)\n,\nand (ii) the use of a logistic link function to de\ufb01ne spatial stick weights. A particular non-zero W (m)\nis (via the kernel) associated with the lth local spatial region, with spatial extent de\ufb01ned by \u03c6mt. If\nW (m)\nis suf\ufb01ciently large, the \u201cclipping\u201d property of the logistic link yields a spatially contiguous\nand extended region over which the tth LSBP layer will dominate. Speci\ufb01cally, c(t)\nml will likely be\nthe same for data samples located near (de\ufb01ned by \u03c6mt) where a large W (m)\nresides, since in this\nregion \u03c3[gmt(s)] \u2192 1. All locations s for which (roughly) gmt(s) \u2265 4 will have \u2013 via the \u201cclipping\u201d\nmanifested via the logistic \u2013 nearly the same high probability of being associated with model layer\nt. Sharp segment boundaries are also encouraged by the steep slope of the logistic function.\nA related use of spatial information is constituted via the kernel stick-breaking process (KSBP) [2].\nWith the KSBP, rather than assuming exchangeable data, the vmt(s) in (6) is de\ufb01ned as:\n\ntl\n\ntl\n\ntl\n\nt\u22121\n\nvmt(s) = VmtK(s, \u0393mt)\n\nY\n\n\u03c4\n\n[1 \u2212 VmtK(s, \u0393m\u03c4 ; \u03c6)] , Vmt \u223c Beta(1, \u03b10)\n\n(8)\n\nwhere K(s, \u0393mt) represents a kernel distance between the feature-vector spatial coordinate s and a\nlocal basis location \u0393mt associated with the tth stick. Although such a model also establishes spatial\ndependence within local regions, the form of the prior has not been found explicit enough to impose\nsmooth segments with sharp boundaries, as demonstrated in [2].\n4 Using the Proposed Model\n4.1 Inference\nBayesian inference seeks to estimate the posterior distribution of the latent variables \u03a8 , given the\nobserved data D and hyper-parameters \u03a5. We employ variational Bayesian (VB) [14] inference as a\ncompromise between accuracy and ef\ufb01ciency. This method approximates an intractable joint poste-\nrior p(\u03a8|D) of all the hidden variables by a product of marginal distributions q(\u03a8) = Qf qf (\u03a8f ),\neach over only a single hidden variable \u03a8f . The optimal parameterization of qf (\u03a8f ) for each\nvariable is obtained by minimizing the Kullback-Leibler divergence between the variational approx-\nimation q(\u03a8) and the true joint posterior p(\u03a8).\n\n4\n\n\fj }J\n\nj=1, and on {Gk}K\n\n4.2 Processing images with no words given\nIf one is given M images, all non-annotated, then the model may be employed on the data {xml}Lm\nl=1,\nfor m = 1, . . . , M , from which a posterior distribution is inferred on the image model parameters\nk=1. Note that properties of the image classes and of the objects within\n{\u03b8\u2217\nimages is inferred by processing all M images jointly. By placing all images within the context of\neach other, the model is able to infer which building blocks (classes and objects) are responsible for\nall of the data. In this sense the simultaneous processing of multiple images is critical: the learning\nof properties of objects in one image is aided by the properties being learned for objects in all other\nimages, through the inference of inter-relationships and commonalities.\n\nAfter the M images are analyzed in the absence of annotations, one may observe example portions\nof the M images, to infer the link between actual object characteristics within imagery and the\nassociated latent object indicator to which it was assigned. With this linkage made, one may assign\nwords to all or a subset of the K object types. After words are assigned to previously latent object\ntypes, the results of the analysis (with no additional processing) may be used to automatically label\nregions (objects) in all of the images. This is manifested because each of the cluster indicators cml\nis associated with a latent localized object type (to which a word may now be assigned).\n4.3 Joint processing of images and annotations\nWe may consider problems for which a subset of the images are provided with annotations (but not\nthe explicit location and segmented-out objects); the words are assumed to reside in a prescribed\ndictionary of object types. The generation of the annotations (and images) is constituted via the\nmodel in (5), with the LSBP employed as discussed. We do not require that all images are annotated\n(the non-annotated images help learn the properties of the image features, and are therefore useful\neven if they do not provide information about the words). It is desirable that the same word be\nannotated for multiple images. The presence of the same word within the annotations of multiple\nimages encourages the model to infer what objects (represented in terms of image features) are\ncommon to the associated images, aiding the learning. Hence, the presence of annotations serves as\na learning aid (encourages looking for commonalities between particular images, if words are shared\nin the associated annotations). Further, the annotations associated with images may disambiguate\nobjects that appear similar in image-feature space (because they will have different annotations).\n\nFrom the above discussion, the model performance will improve as more images are annotated\nwith each word, but presumably this annotation is much easier for the human than requiring one to\nsegment out and localize words within a scene.\n5 Experimental Results\nExperiments are performed on two real-world data sets: subsets of Microsoft Research (MSRC)\ndata ( http://research.microsoft.com/en-us/projects/objectclassrecognition/) and UIUC-Sport data from\n[15, 16], the latter images originally obtained from the Flickr website and available online (\nhttp://vision.cs.princeton.edu/lijiali/event dataset/).\n\nFor the MSRC dataset, 10 categories of images with manual annotations are selected: \u201ctree\u201d, \u201cbuild-\ning\u201d, \u201ccow\u201d, \u201cface\u201d, \u201ccar\u201d, \u201csheep\u201d, \u201c\ufb02ower\u201d, \u201csign\u201d, \u201cbook\u201d and \u201cchair\u201d. The number of images\nin the \u201ccow\u201d class is 45, and in the \u201csheep\u201d class there are 35; there are 30 images in all other\nclasses. From each category, we randomly choose 10 images, and remove the annotations, treating\nthese as non-annotated images within the analysis (to allow quanti\ufb01cation of inferred-annotation\nquality). Each image is of size 213 \u00d7 320 or 320 \u00d7 213. In addition, we remove all words that\noccur less that 8 times (approximately 1% of all words). There are 14 unique words: \u201cvoid\u201d, \u201cbuild-\ning\u201d, \u201cgrass\u201d, \u201ctree\u201d, \u201ccow\u201d, \u201csheep\u201d, \u201csky\u201d, \u201cface\u201d, \u201ccar\u201d, \u201c\ufb02ower\u201d, \u201csign\u201d, \u201cbook\u201d, \u201cchair\u201d and\n\u201croad\u201d. We assume that each word corresponds to a visual object in the image. Regarding the case\nin which multiple words may refer to the same object, one may use the method mentioned in [16] to\ngroup synonyms in the preprocessing phase (not necessary here). The following analysis, in which\nannotated and non-annotated images are processed jointly, is executed as discussed in Section 4.3.\n\nThe UIUC-Sport dataset [15, 16] contains 8 types of sports: \u201cbadminton\u201d, \u201cbocce\u201d, \u201ccroquet\u201d,\n\u201cpolo\u201d, \u201crock climbing\u201d, \u201crowing\u201d, \u201csailing\u201d and \u201csnowboarding\u201d. Here we randomly choose 25\nimages for each category, and each image is resized to a dimension of 240 \u00d7 320 or 320 \u00d7 240.\nSince the annotations are not available at the cited website, the analysis is initially performed with\nno words, as discussed in Section 4.2. After performing this analysis, and upon examining the\nproperties of segmented data associated with each (latent) object class on a small subset of the data,\n\n5\n\n\fwe can infer words associated with some important Gk, and then label portions (objects) within each\nimage via the inferred words. This process is different than in [6, 16, 23], in which annotations were\nemployed.\n\nWhen investigating algorithm performance, we make comparisons to Corr-LDA [6]. Our objectives\nare related to those in [16, 23], but to the authors\u2019 knowledge the associated software is not currently\navailable. The Corr-LDA model [6] is relatively simple, and has been coded ourselves. We also\nexamine our model with the proposed LSBP replaced with with KSBP.\n\nthe structure necessary for segmentation at\n\n5.1 Image preprocessing\nEach image is \ufb01rst segmented into 800 \u201csuperpixels\u201d, which are local, coherent and\npreserve most of\n[19].\nThe software used for over-segmentation is discussed in [17] and is available online\n(http://www.cs.sfu.ca/\u223cmori/research/superpixels/). Each superpixel is represented by both color and\ntexture descriptors, based on the local RGB, hue [25] feature vectors and also the output of max-\nimum response (MR) \ufb01lter banks [22] (http://www.robots.ox.ac.uk/\u223cvgg/research/texclass/\ufb01lters.html).\nWe discretize these features using a codebook of size 64 (other codebook sizes gave similar per-\nformance), and then calculate the distribution [1] for each feature within each superpixel as visual\nwords [3, 6, 10, 11, 20, 23, 24].\n\nthe scale of\n\ninterest\n\neach\n\nsuperpixel\n\nis\n\nare\n\ndistributions\n\nrepresented\n\nby\n\nthree multinomial\n\nthree\n{Mult(\u0398\u2217\n\nvisual words,\n1j) N Mult(\u0398\u2217\n\nAccordingly,\n1j|\u02dc\u03c11j) N Dir(\u0398\u2217\n\natoms\nfor\nthe variational distribution in the VB [14] analysis is\n\nSince\n\u03b8\u2217\nj\nj = 1, \u00b7 \u00b7 \u00b7 , J.\nj ) = Dir(\u0398\u2217\nq(\u03b8\u2217\nThe center of each superpixel is recorded as the location coordinate sml. The set of discrete ker-\nnel widths \u03c6\u2217 are de\ufb01ned by 30, 35, \u00b7 \u00b7 \u00b7 , 160, and a uniform multinomial prior is placed on these\nparameters (the size of each kernel is inferred, for each of the T LSBP layers, and separately in\neach of the M images). To save computational resources, rather than centering a kernel at each of\nthe Lm points associated with the superpixels, the kernel spatial centers are placed once every 20\nsuperpixels.\n\nthe mixture\n2j) N Mult(\u0398\u2217\n\n2j|\u02dc\u03c12j) N Dir(\u0398\u2217\n\n3j|\u02dc\u03c13j).\n\n3j)}\n\ntl }T,Lm,M\n\nWe set truncation levels I = 20, J = 50 and T = 10 (similar results were found for larger trun-\ncations). For analysis on UIUC-Sport dataset, K = 40. All gamma priors for precision parameters\n\u03b1w, \u03b1v or {\u03b7(m)\nt=1,l=0,m=1, \u03b1u and \u03b1h are set as (10\u22126, 10\u22126). All these hyper-parameters\nand truncation levels have not been optimized or tuned. In the following comparisons, the number\nof topics is set to be same as the atom number, J = 50, and the Dirichlet hyperparameters are\nset as (1/J, . . . , 1/J)T for Corr-LDA model; a gamma prior is also used for the KSBP precision\nparameter, \u03b10 in (8), also set as (10\u22126, 10\u22126).\n5.2 Scene clustering\nThe proposed model automatically learns a posterior distribution on mixture-weights u and in so\ndoing infers an estimate of the proper number of scene classes. As shown in Figure 2, although we\ninitialized the truncation level to I = 20, for the MSRC dataset only the \ufb01rst 10 clusters are selected\nas being important (the mixture weights for other clusters are very small); recall that \u201ctruth\u201d indi-\ncated that there were 10 classes. In addition, based on the learned posterior word distribution wi\nfor each image class i, we can further infer which words/objects are probable for each scene class.\nIn Figure 2, we show two example wi for the MSRC \u201cbuilding\u201d and \u201ccow\u201d classes. Although not\nshown here for brevity, the analysis on UIUC features correctly inferred the 8 image classes asso-\nciated with that data (without using annotations). By examining the words and segmented objects\nextracted with high probability as represented by wi, we may also assign names to each of the 18\nimage classes across both the MSRC and UIUC data, consistent with the associated class labels\nprovided with the data.\n\nFor each image m \u2208 {1, . . . , M } we also have a posterior distribution on the associated class\nindicator zm. We approximate the membership for each image by assigning it to the mixture with\nlargest probability. This \u201chard\u201d decision is employed to provide scene-level label for each image (the\nBayesian analysis can also yield a \u201csoft\u201d decision in terms of a full posterior distribution). Figure 3\npresents the confusion matrices for the proposed model with and without LSBP, on both the MSRC\nand UIUC datasets. Both forms of the model yield relatively good results, but the average accuracy\nindicates that the model with LSBP performs better than that without LSBP for both datasets. Note\n\n6\n\n\fthat the results in Figure 3 for the UIUC-Sport data cannot be directly compared with those in [6, 16],\nsince our experiments were performed on non-annotated images.\n\nUsing the concepts discussed in Section 4.2, and employing results from the processed non-\nannotated UIUC-Sport data, we examined the properties of segmented data associated with each\n(latent) object type. We inferred the presence of 12 unique objects, and these objects were assigned\nthe following words: \u201chuman\u201d, \u201chorse\u201d, \u201cgrass\u201d, \u201csky\u201d, \u201ctree\u201d, \u201cground\u201d,\u201cwater\u201d, \u201crock\u201d, \u201ccourt\u201d,\n\u201cboat\u201d, \u201csailboat\u201d and \u201csnow\u201d. Using these words, we annotated each image and re-trained our\nmodel in the presence of annotations. After doing so, the average accuracies of scene-level clus-\ntering are improved to 72.0% and 69.0% with and without LSBP, respectively. The improvement\nin performance, relative to processing the images without annotations, is attributed to the ability of\nwords to disambiguate distinct objects that have similar properties in image-feature space (e.g., the\ndistinct use of \u201cboat\u201d and \u201csailboat\u201d, which helps distinguish rowing and sailing).\n\nMicrosoft Research Data\n\n0.2\n\n0.15\n\n0.1\n\nt\n\ni\n\nh\ng\ne\nW\ne\nr\nu\n\n \n\ni\n\nt\nx\nM\n\n0.05\n\n0\n0\n\n5\n\n10\n\nCluster Index\n\n15\n\n20\n\ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\nP\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\nBuilding\n\nbuilding\n\ncow\n\ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\nP\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\nGrass\n\nTree\n\nVoid\n\nSky\n\nGrass\n\nObject Index\n\nCow\n\nObject Index\n\nTree\n\nVoid Building\n\nFigure 2: Example inferred latent properties associated with MSRC dataset. Left: Posterior distribution on\nthe mixture-weights u, quantifying the probability of scene classes (10 classes are inferred). Middle and Right:\nExample probability of objects for a given class, w i (probability of object/words); here we only give the top 5\nwords for each class.\nwithout LSBP\n\nwithout LSBP\n\nwith LSBP\n\nwith LSBP\n\ntree\nbuilding\ncow\nface\ncar\nsheep\nflower\nsign\nbook\nchair\n\n.83 .13 .00 .03 .00 .00 .00 .00 .00 .00\n.10 .80 .00 .03 .00 .00 .00 .00 .07 .00\n.04 .02 .87 .00 .00 .07 .00 .00 .00 .00\n.03 .10 .00 .73 .00 .00 .07 .07 .00 .00\n.03 .10 .00 .00 .87 .00 .00 .00 .00 .00\n.03 .03 .09 .00 .00 .86 .00 .00 .00 .00\n.00 .07 .00 .03 .00 .00 .83 .07 .00 .00\n.03 .03 .00 .00 .00 .00 .10 .80 .03 .00\n.00 .00 .00 .03 .00 .00 .00 .13 .83 .00\n.10 .03 .00 .00 .00 .00 .00 .00 .00 .87\n\ntree\nbuilding\ncow\nface\ncar\nsheep\nflower\nsign\nbook\nchair\n\n.87 .10 .00 .03 .00 .00 .00 .00 .00 .00\n.13 .83 .00 .03 .00 .00 .00 .00 .00 .00\n.04 .00 .89 .00 .00 .07 .00 .00 .00 .00\n.03 .07 .00 .80 .00 .00 .07 .03 .00 .00\n.00 .17 .00 .00 .83 .00 .00 .00 .00 .00\n.00 .00 .11 .00 .00 .89 .00 .00 .00 .00\n.00 .00 .00 .10 .00 .00 .87 .03 .00 .00\n.00 .07 .00 .00 .00 .00 .03 .87 .03 .00\n.00 .00 .00 .03 .00 .00 .00 .10 .87 .00\n.07 .03 .00 .00 .00 .00 .00 .00 .00 .90\n\nbadmi.\nbocce\ncroquet\npolo\nrockc\nsailing\nrowing\nsnowb.\n\n.76 .00 .08 .04 .00 .04 .04 .04\n.08 .44 .24 .04 .04 .08 .00 .08\n.04 .08 .72 .08 .04 .04 .00 .00\n.04 .04 .12 .64 .04 .04 .04 .04\n.00 .04 .04 .00 .76 .04 .04 .08\n.04 .04 .04 .04 .00 .44 .32 .08\n.04 .04 .04 .04 .04 .28 .44 .08\n.04 .08 .04 .04 .08 .04 .04 .64\n\nbadmi.\nbocce\ncroquet\npolo\nrockc\nsailing\nrowing\nsnowb.\n\n.76 .04 .04 .04 .00 .04 .04 .04\n.04 .48 .24 .04 .04 .04 .04 .08\n.04 .08 .72 .08 .04 .04 .00 .00\n.04 .04 .12 .64 .04 .04 .04 .04\n.00 .08 .04 .00 .76 .04 .00 .08\n.04 .04 .00 .04 .00 .52 .28 .08\n.04 .04 .04 .04 .04 .24 .52 .04\n.04 .08 .00 .04 .08 .04 .04 .68\n\nFigure 3: Comparisons using confusion matrices for all images in each dataset (all of the annotated and non-\nannotated images in MSRC; all the non-annotated images in UIUC-Sport). The left two results are for MSRC,\nand the right two for UIUC-Sport. In each pair, the result is without LSBP, and the right is with LSBP. Average\nperformance, left to right: 82.90%, 86.80%, 60.50% and 63.50%.\n5.3 Image annotation\nThe proposed model infers a posterior distribution for the indicator variables cml (de\ufb01ning the ob-\nject/word for super-pixel l in image m). Similar to the \u201chard\u201d image-class assignment discussed\nabove, a \u201chard\u201d segmentation is employed here to provide object labels for each super-pixel. For the\nMSRC images for which annotations were held out, we evaluate whether the words associated with\nobjects in a given image were given in the associated annotation (thus, our annotation is de\ufb01ned by\nthe words we have assigned to objects in an image).\nTable 1: Comparison of precision and recall values for annotation and segmentation with Corr-LDA [6], our\nmodel without LSBP (Simp. Model) and the extended models with KSBP (Ext. with KSBP) and LSBP (Ext.\nwith LSBP) on MSRC datasets. To evaluate annotation performance, the results are just calculated based on\nnon-annotated images; while for segmentation, the results are based on all images.\n\nAnnotation\nSimp. Model\nF\nPrec Rec\n.70\n.70\n.70\n.50\n.60\n.55\n.70\n.70\n.70\n.63\n.60\n.66\n.70\n.70\n.70\n\nExt. with LSBP\nPrec Rec\n.70\n.70\n.55\n.60\n.70\n.70\n.60\n.68\n.70\n.70\n\nF\n.70\n.57\n.70\n.64\n.70\n\nCorr-LDA\n\nPrec Rec\n.08\n.13\n.06\n.03\n.02\n.02\n.29\n.39\n.13\n.16\n\nSegmentation\n\nSimp. Model\nF\nPrec Rec\n.43\n.38\n.49\n.43\n.38\n.40\n.58\n.63\n.53\n.45\n.51\n.40\n.57\n.55\n.56\n\nExt. with KSBP\nPrec Rec\n.50\n.56\n.48\n.44\n.63\n.57\n.54\n.49\n.58\n.55\n\nF\n.53\n.46\n.60\n.51\n.57\n\nExt. with LSBP\nPrec Rec\n.58\n.61\n.51\n.48\n.62\n.60\n.55\n.55\n.59\n.55\n\nF\n.60\n.50\n.61\n.55\n.57\n\n.49\n\n.51\n\n.50\n\n.53\n\n.53\n\n.53\n\n.56\n\n.54\n\n.54\n\nF\n.10\n.04\n.02\n.33\n.15\n\n.16\n\nCorr-LDA\n\nPrec Rec\n.60\n.18\n.30\n.50\n.60\n.17\n.65\n.38\n.14\n.60\n\n.23\n\n.63\n\nF\n.28\n.38\n.27\n.48\n.22\n\n.32\n\nObject\ncar\ntree\nsheep\nsky\nchair\n\nMean\n\n.65\n\n.63\n\n.64\n\n.67\n\n.65\n\n.65\n\n.17\n\n.18\n\nWe use precision-recall and F-measures [16, 23] to quantitatively evaluate the annotation perfor-\nmance. The left part of Table 1 lists detailed annotation results for \ufb01ve objects, as well as the overall\nscores from all objects classes for the MSRC data. Our annotation results consistently and signi\ufb01-\ncantly outperform Corr-LDA, especially for the precision values.\n\n7\n\n\f5.4 Object segmentation\nFigure 4 shows some detailed object-segmentation results of Corr-LDA and the proposed model\n(with and without LSBP). We observe that our models generally yield visibly better segmentation\nrelative to Corr-LDA. For example, for complicated objects the Corr-LDA segmentation results are\nvery sensitive to the feature variance, and an object is generally segmented into many small, detailed\nparts. By contrast, due to the imposed mixture structure on each object, our models cluster small\nparts into one aggregate object. Furthermore, LSBP encourages local contiguous regions to be\ngrouped in the same segment, and therefore it is less sensitive to localized variability. In addition,\ncompared with results shown in [2], which also used the MSRC dataset, one may observe KSBP\ncannot do as well as LSBP in maintaining spatial contiguity, as discussed in Section 3.2. Due to\nspace limitations, detailed example comparison between LSBP and KSBP will be shown elsewhere\nin a longer report; the quantitative comparison in Table 1 further demonstrate the advantages of\nLSBP over KSBP.\n\nn\n\np\n\nH\nH\n\nH\n\no\n\no\n\nu\no\n\nl\n\nr\n\no\n\ns\n\nm\nr\n\ns\n\na\ne\n\ne\n\nW\n\nr\n\ne\n\nT\n\na\n\nG\ne\n\nT\n\nt\n\nR\n\ne\n\nr\nr\n\nG\n\nr\n\no\n\na\n\ne\na\n\nr\n\nc\n\ne\ns\n\nk\n\ns\n\ns\n\ns\n\nk\n\na\n\nc\n\ni\n\nl\n\nb\n\no\n\na\n\no\n\na\n\nS\n\nn\n\nr\na\n\na\n\nc\n\nn\n\nn\n\nH\n\nH\n\nH\n\nm\n\nu\n\nu\n\nu\n\nm\n\nm\n\no\n\nu\n\nr\n\nt\n\nC\n\nk\n\nc\n\nt\nR\n\nR\n\nk\n\nR\n\no\n\nc\n\nc\n\no\nk\n\no\n\nFigure 4: Example segmentation and labeling results. First row: original images; second row: Corr-LDA [6];\nthird row: proposed model without LSBP; fourth row: proposed model with LSBP. Columns 1-3 from MSRC\ndataset; Columns 4-6 from UIUC-Sport dataset. The name of original images are inferred by scene-level\nclassi\ufb01cation via our model. The UIUC-Sport results are based on the words inferred by our model.\n\nThe MSRC database provides manually de\ufb01ned segmentations, to which we quantitatively compare.\nThe right part of Table 1 compares results of the proposed model with Corr-LDA. As indicated in\nTable 1, the proposed model (with and without LSBP) signi\ufb01cantly outperforms Corr-LDA for all\nobjects. Moreover, due to imposed spatial contiguity, the models with KSBP and LSBP are better\nthan without.\n\nThe experiments have been performed in non-optimized software written in Matlab, on a Pentium\nPC with 1.73 GHz CPU and 4G RAM. One VB run of our model with LSBP, for 70 VB iterations,\nrequired nearly 7 hours for 320 images from MSRC dataset. Typically 50 VB iterations are required\nto achieve convergence. The UIUC-Sport data required comparable CPU time. It typically took less\nthan half the CPU time for our model without LSBP on a same dataset. All results are based on a\nsingle VB run, with random initialization.\n6 Conclusions\nA nonparametric Bayesian model has been developed for clustering M images into classes; the im-\nages are represented as a aggregation of distinct localized objects, to which words may be assigned.\nTo infer the relationships between image objects and words (labels), we only need to make the asso-\nciation between inferred model parameters and words. This may be done as a post-processing step if\nno words are provided, and it may done in situ if all or a subset of the M images are annotated. Spa-\ntially contiguous objects are realized via a new logistic stick-breaking process. Quantitative model\nperformance is highly competitive relative to competing approaches, with relatively fast inference\nrealized via variational Bayesian analysis. The authors acknowledge partial support from ARO,\nAFOSR, DOE, NGA and ONR.\n\nr\n\nt\no\nT\n\na\n\ne\nr\n\ne\n\nd\n\ne\n\nT\n\ne\n\nR\n\ne\n\nB\n\nu\n\ni\n\nl\n\nd\n\ni\n\nn\n\ng\n\nT\n\nS\n\nS\n\nr\n\ne\n\nk\n\nk\n\ny\n\ny\n\nr\n\ny\n\ny\n\nk\n\nr\n\nC\ne\n\nr\n\ny\ne\n\nS\n\nS\n\na\n\nk\ne\n\nk\n\ne\n\nS\n\nT\n\nT\n\nV\n\no\n\ni\n\nd\n\nr\n\ne\n\ne\n\ns\n\nr\n\na\n\ns\n\ns\n\ns\n\nG\n\nr\n\ns\n\na\n\ns\n\nG\n\na\n\nr\n\nG\n\nb\n\nu\n\ni\n\nB\n\nl\n\nd\n\nu\n\ng\n\ni\n\nn\n\nl\n\ni\n\nd\n\ng\n\ni\n\nn\n\nB\n\nu\n\nC\n\no\n\nw\n\nT\n\nd\n\ne\n\nr\n\nn\n\ni\n\ne\n\ng\n\ni\n\nl\n\nd\n\nG\n\ni\n\nn\n\nr\n\nB\n\na\n\nr\n\ng\n\na\n\nG\n\ns\n\nu\n\ns\n\ni\n\nl\n\ns\n\ns\n\ns\n\nS\n\nS\n\ni\ni\n\ni\n\ng\n\ng\ng\n\nn\nn\n\nn\n\ng\n\nB\n\nu\n\nu\n\ni\n\nl\n\nl\n\nd\n\ni\n\nd\n\ni\n\nn\n\ni\n\nn\n\ng\n\nB\n\nc\n\no\n\na\n\nq\n\nt\n\nr\n\no\n\nu\n\nH\n\ne\n\nu\n\nH\n\nt\nu\n\nt\n\ns\n\ns\n\nH\n\nS\n\na\n\no\n\nr\n\ns\n\nn\n\na\n\na\n\ns\n\nu\n\na\n\nG\n\ni\n\nm\nl\n\nr\n\nb\n\nG\n\na\n\nm\n\nm\n\na\n\nn\n\na\n\nn\n\na\n\nt\n\ne\n\nr\n\nB\n\nW\n\nr\nr\n\ne\ne\n\ne\ne\n\nT\nT\n\nS\n\nB\n\nk\n\ny\n\nu\n\ni\n\nl\n\nd\n\ni\n\nn\n\ng\n\n8\n\n\fReferences\n[1] T. Ahonen and M. Pietik\u00a8ainen. Image description using joint distribution of \ufb01lter bank responses. Pattern\n\nRecogntion Letters, 30:368\u2013376, 2009.\n\n[2] Q. An, C. Wang, I. Shterev, E. Wang, L. Carin, and D. B. Dunson. Hierarchical kernel stick-breaking\n\nprocess for multi-task image analysis. In ICML, 2008.\n\n[3] K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D. M. Blei, and M. I. Jordan. Matching words and\n\npictures. JMLR, 3:1107\u20131135, 2003.\n\n[4] C. M. Bishop and M. E. Tipping. Variational relevance vector machines. In UAI, 2000.\n\n[5] D. Blackwell and J. B. MacQueen. Ferguson distributions via Polya urn schemes. Ann. Statist., 1(2):353\u2013\n\n355, 1973.\n\n[6] D. M. Blei and M. Jordan. Modeling annotated data. In SIGIR, 2003.\n\n[7] D. M. Blei and J. D. McAuliffe. Supervised topic model. In NIPS, 2007.\n\n[8] D. M. Blei, A. Ng, and M. I. Jordan. Latent Dirichlet allocation. JMLR, 3:993\u20131022, 2003.\n\n[9] A. Bosch, A. Zisserman, and X. Munoz. Scene classi\ufb01cation via plsa. In ECCV, 2006.\n\n[10] L. Cao and L. Fei-Fei. Spatially coherent latent topic model for concurrent segmentation and classi\ufb01cation\n\nof objects and scenes. In ICCV, 2007.\n\n[11] L. Fei-Fei and P. Perona. A Bayesian hieratchical model for learning natural scence categories. In CVPR,\n\n2005.\n\n[12] T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn., 42(1-2):177\u2013\n\n196, 2001.\n\n[13] H. Ishwaran and L. F. James. Gibbs sampling methods for stick-breaking priors. JASA, 96(453):161\u2013173,\n\n2001.\n\n[14] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. Saul. An introduction to variational methods for\n\ngraphical models. Mach. Learn., 37(2):183\u2013233, 1999.\n\n[15] J. Li and L. Fei-Fei. What, where and who? classfying events by scene and object recognition. In ICCV,\n\n2007.\n\n[16] J. Li, R. Socher, and L. Fei-Fei. Towards total scene understaning: classi\ufb01cation, annotation and segmen-\n\ntation in an automatic framework. In CVPR, 2009.\n\n[17] G. Mori. Guiding model search using segmentation. In ICCV, 2005.\n\n[18] A. Rabinovich, A. Vedaldi, C. Galleguillos, and E. Wiewiora. Objects in context. In ICCV, 2007.\n\n[19] X. Ren and J. Malik. Learning a classi\ufb01cation model foe segmentation. In ICCV, 2003.\n\n[20] E. B. Sudderth and M. I. Jordan. Shared segementation of natural scenes using dependent pitman-yor\n\nprocesses. In NIPS, 2008.\n\n[21] Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical Dirichlet processes. JASA, 101:1566\u20131582, 2005.\n\n[22] M. Varma and A. Zisserman. Classifying images of materials: Achieving viewpoint and illumination\n\nindependence. In ECCV, 2002.\n\n[23] C. Wang, D. M. Blei, and L. Fei-Fei. Simultaneous image classi\ufb01cation and annotation. In CVPR, 2009.\n\n[24] X. Wang and E. Grimson. Spatial latent dirichlet allocation. In NIPS, 2007.\n\n[25] J. V. D. Weijer and C. Schmid. Coloring local feature extraction. In ECCV, 2006.\n\n[26] O. Yakhnenko and V. Honavar. Multi-modal hierarchical Dirichlet process model for predicting image\n\nannotation and image-object label correspondence. In SIAM SDM, 2009.\n\n[27] Z.-H. Zhou and M.-L. Zhang. Mutlti-instance multi-label learning with application to scene classi\ufb01cation.\n\nIn NIPS, 2006.\n\n9\n\n\f", "award": [], "sourceid": 422, "authors": [{"given_name": "Lan", "family_name": "Du", "institution": null}, {"given_name": "Lu", "family_name": "Ren", "institution": null}, {"given_name": "Lawrence", "family_name": "Carin", "institution": null}, {"given_name": "David", "family_name": "Dunson", "institution": null}]}