{"title": "A ``Shape Aware'' Model for semi-supervised Learning of Objects and its Context", "book": "Advances in Neural Information Processing Systems", "page_first": 577, "page_last": 584, "abstract": "Integrating semantic and syntactic analysis is essential for document analysis. Using an analogous reasoning, we present an approach that combines bag-of-words and spatial models to perform semantic and syntactic analysis for recognition of an object based on its internal appearance and its context. We argue that while object recognition requires modeling relative spatial locations of image features within the object, a bag-of-word is sufficient for representing context. Learning such a model from weakly labeled data involves labeling of features into two classes: foreground(object) or ''informative'' background(context). labeling. We present a ''shape-aware'' model which utilizes contour information for efficient and accurate labeling of features in the image. Our approach iterates between an MCMC-based labeling and contour based labeling of features to integrate co-occurrence of features and shape similarity.", "full_text": "A \u201cShape Aware\u201d Model for semi-supervised\n\nLearning of Objects and its Context\n\nAbhinav Gupta1, Jianbo Shi2 and Larry S. Davis1\n\n1 Dept. of Computer Science, Univ. of Maryland, College Park\n\n2 Dept. of Computer and Information Sciences, Univ. of Pennsylvania\n\nagupta@cs.umd.edu, jshi@cis.upenn.edu, lsd@cs.umd.edu\n\nAbstract\n\nWe present an approach that combines bag-of-words and spatial models to perform\nsemantic and syntactic analysis for recognition of an object based on its internal\nappearance and its context. We argue that while object recognition requires mod-\neling relative spatial locations of image features within the object, a bag-of-word\nis suf\ufb01cient for representing context. Learning such a model from weakly labeled\ndata involves labeling of features into two classes: foreground(object) or \u201cinfor-\nmative\u201d background(context). We present a \u201cshape-aware\u201d model which utilizes\ncontour information for ef\ufb01cient and accurate labeling of features in the image.\nOur approach iterates between an MCMC-based labeling and contour based la-\nbeling of features to integrate co-occurrence of features and shape similarity.\n\n1 Introduction\n\nUnderstanding the meaning of a sentence involves both syntactic and semantic analysis. A bag-of-\nwords approach applied locally over a sentence would be insuf\ufb01cient to understand its meaning. For\nexample, \u201cJack hit the bar\u201d and \u201cThe bar hit Jack\u201d have different meanings even though the bag-of-\nwords representation is the same for both. In many cases, determining meaning also requires word\nsense disambiguation using contextual knowledge. For example, does \u201cbar\u201d represents a rod or a\nplace where drinks are served? While a combined semantic and syntactical model could be used\nfor representation and application of context as well, it would be expensive to apply. Syntactical\nrules are generally not required for extracting knowledge about context - a topic model is generally\nsuf\ufb01cient for contextual analysis in text [14, 15].\nWe use analogous reasoning to suggest a similar dichotomy in representing object structure and\ncontext in vision. Our approach combines bag-of-words and spatial models to capture semantics\nand syntactic rules, respectively, that are employed for recognizing an object using its appearance,\nstructure and context. We treat an object and a scene analogous to a sentence and a document\nrespectively. Similar to documents, object recognition in natural scenes requires modeling spatial\nrelationships of image features(words) within the object but for representing context in a scene, a\nbag-of-words approach suf\ufb01ces (See Figure 1 (a) and (b)).\nLearning such a model from weakly labeled data requires labeling the features in an image as be-\nlonging to an object or its context (informative background). Spatial models, such as constellation\nor star models, compute a sparse representation of objects(with a \ufb01xed number of parts) by se-\nlecting features which satisfy spatial constraints. Their sparse representation reduces their utility\nin the presence of occlusion. Approaches for learning a dense bag-of-features model with spatial\nconstraints from weakly labeled data have also been proposed. Such approaches (based on marginal-\nizing over possible locations of the object), however, lead to poor foreground segmentation if the\ntraining dataset is small, the images have signi\ufb01cant clutter 1 or if some other object in the back-\nground has a strong and consistent spatial relationship with the object to be learned throughout the\n\n1A dataset of less cluttered images would fail to provide enough contextual information to be learned for a\n\nmodel that simultaneously learns object model and its contextual relationships.\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: (a) An example of the importance of spatial constraints locally. The red color shows the features on\nthe foreground car. A bag of words approach fails to capture spatial structure and thus combines the front and\nrear of different cars. (b) We use a spatial model of the object and a bag-of-words approach for context repre-\nsentation. (c) Importance of using contour information: Objects such as signs become part of the foreground\nsince they occur at consistent relative location to the car. If shape and contour information is combined with\nco-occurrence and spatial structure of image features, then such mis-labellings can be reduced. For example,\nin the above case since there are strong intervening contours between the features on the car(foreground) and\nthe features on signs, and there is a lack of strong contours between features on signs and features on trees\n(background), it is more likely that features on the signs should be labeled as background.\n\nProblem:\nLearn the parameters of object model given the images (I1, .., ID), object labels (O1, .., OD)\nand Object Model Shape (M).\n\nApproach:\nSimultaneous localization the object in training images and estimation of model parameters. This\nis achieved by integrating cues from image features and contours. The criteria includes following terms:\n1. Feature Statistics: The image features satisfy the co-occurrence and spatial statistics of the model.\n2. Shape Similarity: The shape of the foreground object is similar to the shape of the sketch of the object.\n3. Separation: The object and background features should be separated by the object boundary contours.\n\nTable 1: Summary of \u201cShape Aware\u201d Model\n\ninformation for\n\ntraining dataset. We overcome this problem by applying shape based constraints while constructing\nthe foreground model.\nFigure 1(c) shows an example of how contours provide important\nfore-\nground/background labeling. We add two constraints to the labeling problem using the contour\ninformation: (a) The \ufb01rst constraint requires the presence of strong intervening contours between\nforeground and background features. (b) The second constraint requires the shape of boundary con-\ntours be similar to the shape of the exemplar/sketch provided with the weakly labeled dataset. This\nallows us to learn object models from images where there is signi\ufb01cant clutter and in which the\nobject does not cover a signi\ufb01cant part of the image. We provide an iterative solution to integrate\nthese constraints. Our approach \ufb01rst labels the image features based on co-occurrence and spatial\nstatistics - the features that occur in positive images and exhibit strong spatial relationships are la-\nbeled as foreground features. Based on the labels of image features, object boundaries are identi\ufb01ed\nbased on how well they separate foreground and background features. This is followed by a shape\nmatching step which identi\ufb01es the object boundary contours based on their expected shape. This\nstep prunes many contours and provides a better estimate of object boundaries. These boundaries\nare then be used to relabel the features in the image. This provides an initialization point for the next\niteration of Gibbs sampling. Figure 2 shows the system \ufb02ow of our \u201cShape Aware\u201d approach.\n1.1 Related Work\n\nMany graphical models for object recognition [11] have been inspired by models of text documents\nsuch as LDA [6] and pLSA [7]. These models are computationally ef\ufb01cient because they ignore\nthe spatial relationships amongst image features (or parts) and use a dense object representation.\nHowever, ignoring spatial relationships between features leads to problems (See Figure 1(a)). In\ncontrast, approaches that model spatial relationships [9, 5] between object parts/features are com-\n\n\fFigure 2: Shape-Aware Learning (Overview): We \ufb01rst compute feature labels using the Gibbs sampling ap-\nproach on the Spatial Author Topic model. The features labeled foreground and background are drawn in red\nand yellow respectively. This is followed by object boundary extraction. The object boundaries are identi\ufb01ed\nbased on how well they separate foreground and background features. Likely object boundary contours are then\nmatched to the sketch using a voting-based approach and the contours consistent with the shape of the sketch\nare identi\ufb01ed. These contours are then used to relabel the features using the same separation principle. The\nnew labels and topics from the previous time step are used as a new initialization point for the next iteration.\n\nputationally expensive and therefore employ only sparse features representation. These approaches\nfail under occlusion due to their sparse representation and their stringent requirement of a one-one\ncorrespondence between image and object features.\nThere has been recent work in applying spatial constraints to topic models which enforce neigh-\nboring features to belong to similar topics [10, 2] for the purpose of segmentation. Our work is\nmore related to classi\ufb01cation based approaches [8, 3] that model spatial locations of detected fea-\ntures based on a reference location in the image. Sudderth et. al [3] presented such a model that\ncan be learned in a supervised manner. Fergus et. al [8] proposed an approach to learn the model\nfrom weakly labeled data. This was achieved by marginalizing object locations and scale. Each\nobject location hypothesis provides a foreground segmentation which can be used for learning the\nmodel. Such an approach, however, is expensive unless the training images are not highly cluttered.\nAdditionally, they are subject to modeling errors if the object of interest is small in the training\nimages.\nOur goal is to simultaneously learn an object model and its context model from weakly labeled\nimages. To learn context we require real world scenes of object and their natural surrounding en-\nvironment (high clutter and small objects). We present a \u201cshape aware\u201d feature based model for\nrecognizing objects. Our approach resolves the foreground/background labeling ambiguities by re-\nquiring that the shapes of the foreground object across the training images to be similar to a sketch\nexemplar. Shape based models [1] have been used previously for object recognition. However,\ncontour matching is an expensive(exponential) problem due to the need to select the best subset of\ncontours from the set of all edges that match the shape model. Approximate approaches such as\nMCMC are not applicable since matching is very closely coupled with selection. We propose an\nef\ufb01cient approach that iterates between an co-occurence based labeling and contour based labeling\nof features.\n2 Our Approach - Integrating feature and contour based cues\n\nWe assume the availability of a database of weakly labeled images which specify the presence of an\nobject, but not its location. Similar to previous approaches based on document models, we vector\n\n\fquantize the space of image features into visual words to generate a discrete image representation.\nEach visual word is analogous to a word and an image is treated analogous to a document.\nEach word is associated with a topic and an author (the object). The topic distribution depends\non the associated author and the word distribution depends on the assigned topic (Section 2.1).\nWe start with random assignments of words to topics and authors. This is followed by a Gibbs\nsampling step which simultaneously estimates the hidden variables (topic and author) and also the\nparameters of the generative model that maximizes the likelihood(Section 2.2). These assignments\nare then used to obtain a set of likely object boundary contours in each image. These contours are\nsubsequently analyzed to identify the object \u201ccenters\u201d and \ufb01nal object contours by matching with\nthe shape exemplar(Section 2.3). Using the new set of boundary contours, the authors corresponding\nto each word are reassigned and the model is retrained using the new assignment.\n\n2.1 Generative Model - Syntax and Semantics\n\nAuthor-Topic Model: Our model is motivated by the author-topic model [13] and the model pre-\nsented in [4]. We \ufb01rst provide a brief description of the author topic model, shown in \ufb01gure 3(a).\nThe author-topic model is used to model documents for which a set of authors is given. For each\nword in the document, an author (xi) is chosen uniformly at random from the set of authors (ad). A\ntopic (zi) is chosen from a distribution of topics speci\ufb01c to the selected author and a word (wi) is\ngenerated from that topic. The distribution of topics (\u03b8) for each author is chosen from a symmetric\nDirichlet(\u03b1) prior and the distribution of words (\u03c6) for a topic is chosen from symmetric Dirichlet\n(\u03b2) prior.\n\nad\n\nx\n\nz\n\nw\n\nNd\n\nD\n\nOd\n\nRd\n\n\u03b7\n\n\u03b8\n\n\u03b1\n\nx\n\nz\n\n\u03b2\n\n\u03c6\n\nw\n\nri\n\nD\n\nl\n\nNd\n\n\u03b6\n\n\u03b3\n\n\u03b1\n\n\u03b2\n\n\u03b8\n\n\u03c6\n\nFigure 3: (a) Author-Topic Model (b) Our Model (Spatial Author-Topic Model). Our model extends\nthe author topic model by including the spatial(syntactical) relationship between features.\n\nSpatial-Author Topic Model: Our model is shown in \ufb01gure 3(b). Our goal is not only to model the\ndistribution of type of features but also to model the distribution of spatial locations of the subset of\nthese features that are associated with the foreground object. We model this as follows: A feature in\nthe image is described by its type wi and location li. Each feature (wi, li) is \u2018authored\u2019 by an author\n2 and its location ri. For each feature, the author xi is chosen\nxi which is described by its type oi\nfrom a distribution, \u03b7, which can be either uniform or generated using available priors from other\nsources. Topic zi for each word is chosen from a distribution of topic speci\ufb01c to the type of object\noi and a word wi is generated from that topic. The distribution of topics (\u03b8) for each object type is\nchosen from a symmetric Dirichlet (\u03b1) distribution3 . The distribution of a word for each topic is\nchosen from a symmetric Dirichlet (\u03b2) prior.\nThe location of each feature, li, is sampled from the distribution p(li|oi, zi, ri) using the following\ndistribution:\n\np(li|oi, zi, ri) = exp(\n\n\u2212||li \u2212 ri||2\n\n\u03c32\ns\n\n)\u03b6 oi,zi\n\nri\n\n(li)\n\n(1)\n\n2For an image with label car, the possible object types are car, and context of car. The differentiation\n\nbetween \u201cinformative\u201d and \u201cnon-informative\u201d background is captured by the probability distributions.\n\n3The Dirichlet distribution is an attractive distribution - it belongs to the exponential family and is conjugate\n\nto the multinomial distribution.\n\n\fThe \ufb01rst term ensures that each feature has higher probability of being generated by nearby reference\nlocations. The second term enforces spatial constraints on the location of the feature that is generated\nby topic (zi). We enforce these spatial constraints by a binning approach. Each feature in the\nforeground can lie in B possible bins with respect to the reference location. The distribution of the\nspatial location of a feature is speci\ufb01c to the topic zi and the type of object oi. This distribution is\nchosen from a symmetric Dirichlet (\u03b3) prior. Since we do not want to enforce spatial constraints\non the locations of the features generated by topics from context, we set \u03b6 to a constant when oi\ncorresponds to the context of some object.\n\n2.2 Gibbs Sampling\n\nWe use Gibbs sampling to estimate zi and xi for each feature. Given the features (w, l), authors\nassignments x, other topic assignments z\u2212i and other hyperparameters, each zi is drawn from:\n\nP (zi|w, l, x, z\u2212i) \u221d P (wi|w\u2212i, z)P (zi|z\u2212i, oi)P (li|xi, l\u2212i, x\u2212i, zi)\n\n\u221d\n\nnzi\nwi + \u03b2\nnzi + W \u03b2\n\nnoi\nzi + \u03b1\nnoi + T \u03b1\n\nBi\n\n+ \u03b3\n\nnoi,zi\nnoi,zi + B\u03b3\n\n(2)\n\nwhere nzi\nrepresents the number of features of type wi in the dataset assigned to topic zi, nzi\nwi\nrepresents the total number of features assigned to topic zi. noi\nrepresents the number of features\nzi\nthat are assigned to topic zi and author of type oi and noi represents the total number of features\nassigned to author oi. Bi represents the spatial bin in which feature i lies in when the reference is ri,\nnoi,zi\nrepresents the number of features from object type oi and topic zi which lie in bin Bi, noi,zi\nrepresents the total number of features from object type oi and topic zi. W is number of type of\nwords and T represents number of topic types.\nSimilarly, given the features (w, l), topic assignments z, other author assignments x\u2212i and other\nhyperparameters, each xi is drawn from:\n\nBi\n\nP (xi|w, l, z, x\u2212i) \u221d P (li|xi, l\u2212i, x\u2212i, zi)P (zi|oi, z\u2212i, x\u2212i)P (ri|oi, z\u2212i, x\u2212i)\n\n\u221d exp(\n\n\u2212||li \u2212 ri||2\n\n\u03c32\ns\n\n)\n\nBi\n\n+ \u03b3\n\nnoi,zi\nnoi,zi + B\u03b3\n\nnoi\nzi + \u03b1\nnoi + T \u03b1\n\nnoi\nri + \u03b4\nnoi + R\u03b4\n\n(3)\n\nwhere noi\nrepresents the number of features from object type oi that have ri as the reference location\nri\nand noi represents the total number of features from object oi. In case oi is of type context, the\nsecond term is replaced by a constant. R represents the number of possible reference locations.\n\n2.3\n\n\u201cShape Aware\u201d Model\n\nThe generative model presented in section 2.1 can be learned using the Gibbs sampling approach\nexplained above. However, this approach has some shortcomings: (a) If there are features in the\nbackground that exhibit a strong spatial relationship with the object, they can be labeled as fore-\nground. (b) In clutter, the labeling performance diminishes as the discriminability of the object is\nlower. The labeling performance can, however, be improved if contour cues are utilized. We do\nthis by requiring that the shape of the object boundary contours extracted based on feature labeling\nshould be similar to a sketch of the object provided in the dataset. Thus, the labeling of features into\nforeground and background is not only governed by co-occurrence and structural information, but\nalso by shape similarity. We refer to this as a \u201cshape aware\u201d model.\nShape matching using contours has, in the worst case, exponential complexity since it requires\nselection of the subset of contours that best constitute the foreground boundary. We avoid this\ncomputationally expensive challenge by solving the selection problem based on the labels of features\nextracted using Gibbs sampling. The spatial author-topic model is used to attend to the contours\nwhich are likely to be object boundaries. Our shape matching module has three steps: (a) Extracting\nobject boundaries based on labels extracted from the spatial author topic model.\n(b) Extracting\nboundaries consistent with the shape model by matching. (c) Using new boundaries to determine\nnew labels for features.\n\n\fFigure 4: Extraction of object boundaries consistent with the shape of exemplar. The \ufb01rst step is extraction\nof contours which separate foreground and background features. This is followed by a voting process. Each\ncontour in the image is matched to every contour in the model to extract the center of the object. The votes are\nthen traced back to identify the contours consistent with the shape model.\n\nExtracting Object Boundary Contours from Feature Labels: We \ufb01rst determine the edges using\nand group them into contours using the approach presented in [16]. Each contour cj is a collection\nof 2D points (pj1, pj2....). Our goal is to extract boundary contours of the object using the feature\nlabels. Since, the boundary contours separates foreground and background features, an estimate\nof the number of foreground and background features on each side of an image contour provides\nevidence as to whether that image contour is part of the object boundary. For each contour, we\nmeasure the number of foreground and background features that lie on each side of the contour\nwithin some \ufb01xed distance of the contour. The probability that a contour is a boundary contour\nclj = 1 of the object with the side S1 being the interior of the object is given by:\n\nPS1(clj = 1|x) =\n\nnS1\nf + \u03c4\nnS1 + 2\u03c4\n\nnS2\nb + \u03c4\nnS2 + 2\u03c4\n\n(4)\n\nis the total number of features with foreground label on side S1 of the contour and nS1\n\nwhere nS1\nf\nis total number of features on side S1.\nShape Matching: Given the probabilities of each contour being a part of the object boundary, we\nestimate the object center using a voting-based approach [18]. Each contour votes for the center of\nthe object where the weight of the vote is determined based on how well the contour matches the\nsketch. Non-maximal suppression is then used to estimate the candidate object locations. Once the\ncandidate location of the center of object is selected, we trace back the votes to estimate the new\nboundary of the object. Figure 4 shows an example of the voting process and boundary contours\nextracted using this approach.\nExtracting New Labels: These boundaries are then used to relabel the image features into fore-\nground and background. We use the same separation principle to label new features. Each boundary\ncontour votes as to whether a feature should be labeled foreground or background. If the feature lies\non the same side as the object center, then the contour votes for the feature as foreground. Votes are\nweighted based on the probability of a contour being an object boundary. Therefore, the probability\n\nthat the feature i is labeled as foreground is given by\n\nwhere \u03c9j is the probability that the\ncontour j is on object boundary and \u03bdij is variable which is 1 if the object center and feature are on\nsame side of contour cj or 0, if the center is on opposite side. The new labels are then used as an\ninitialization point for the Gibbs sampling based learning of the feature model.\n3 Experimental Results\n\nPj\nPj\n\n\u03c9j \u03bdij\n\n\u03c9j\n\nWe tested our \u201cshape-aware\u201d model on images of cars obtained from the Label-me dataset[17].\nWe randomly selected 45 images for training the model from the LabelMe dataset. A potential\nconcern is the number of iterations/convergence required by our iterative approach. However, it was\nempirically observed that, in most cases the system stabilizes after only two iterations. It should also\nbe noted that each iteration between contour and feature labelings is performed after 200 iterations\n\n\fFigure 5: Advantages of iterative approach. At each iteration, the author topic distribution changes, which\nrequires retraining the model using Gibbs sampling. This can help in two ways: (A) More Focused Attention:\nThe feature labeling gets re\ufb01ned. (B) Change of Focus: A new reference point gets chosen by new distribution.\n\nof Gibbs sampling. The advantages of having an iterative approach is shown in, \ufb01gure 5. We\ncompared the performance of our system against the author-topic model and the author-topic model\nwith spatial constraints. We evaluated the performance of the algorithm by measuring the labeling\nperformance in training and test datasets. Better labeling in training is required for better model\nlearning. Figure 6 show some of the cases where both author-topic and author-topic model with\nspatial constraints fail due to high clutter or the foreground object being too small in the training\ndataset. The \u201cshape aware\u201d model, however, shows better localization performance as compared to\nthe other two.\n\nt = 0\n\nt = 2\n\nt = 0\n\nt = 2\n\nFigure 6: Two examples of how the \u201cshape aware\u201d model provides better localization compared to spatial\nauthor topic models. The odd columns show the results of the author topic model (the initialization point of\niterative approach). The even columns show the labeling provided by our algorithm after 2 iterations.\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\nRecall\nPrecision\n\n\"Shape\u2212Aware\"\n\nSpatial Author Topic\n\nAuthor Topic\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\nRecall\nPrecision\n\n\"Shape Aware\"\n\nSpatial Author Topic\n\nAuthor Topic\n\n(a) Labeling (Training)\n\n(b) Labeling (Test)\n\nFigure 7: Quantitative Comparison of author-topic, spatial author-topic and \u201cshape aware\u201d model based on\nrandomly selected 40 images each from the training and test dataset(17000 features each approximately). The\nvalues of the parameters used are T = 50, \u03b1 = 50\n\nT , \u03b2 = 0.01, \u03b3 = 0.01, B = 8 and \u03c4 = 0.1.\n\nFigure 7 shows a quantitative comparison of the \u201cshape aware\u201d model to the author-topic and the\nspatial author-topic model. Recall ratio is de\ufb01ned as the ratio of features labeled as foreground to the\ntotal number of foreground features. Precision is de\ufb01ned as the ratio of features correctly labeled as\nforeground to the total number of features labeled as foreground. In the case of labeling in training\ndata, our approach outperforms both author-topic and spatial author-topic model. In the case of test\ndataset, the author-topic model has higher recall but very low precision. The low precision of author-\ntopic and spatial author-topic can be attributed to the fact that, in many cases the context is similar\nand at the same relative locations to each other. This leads to modeling errors - these features are\nlearned to be part of the object. In the case of the \u201cshape aware\u201d model, the shape of the objects help\nin pruning these features and therefore lead to much higher precision. Low recall rates in our model\nand the spatial author-topic model is because some foreground features do not satisfy the spatial\n\n\fFigure 8: Example of performance of three models on a test image. \u201cShape Aware\u201d model shows high\nprecision in label prediction due to pruning provided by shape matching. Author Topic model shows high\nrecall rates because high similarity in context across images.\n\nFigure 9: A few examples of labeling in the test dataset.\n\nconstraints and hence are falsely labeled as background features. Figure 9 shows some examples of\nperformance of the \u201cshape aware\u201d model on test dataset.\n\nAcknowledgements\n\nThis research was funded by US Government\u2019s VACE program and NSF-IIS-04-47953(CAREER) award. The\nauthors would also like to thank Qihui Zhu for providing the code for extracting contours.\nReferences\n[1] G. Elidan, G. Heitz and D. Koller, Learning Object Shape: From Drawings to Images, IEEE CVPR 2006.\n[2] X. Wang and E. Grimson, Spatial Latent Dirichlet Allocation, NIPS 2007.\n[3] E. Sudderth, A. Torralba, W.T Freeman and A.S Wilsky, Learning Hierarchical Models of Scenes, Objects\n\nand Parts, ICCV 2005.\n\n[4] T.L Grif\ufb01ths, M Steyvers, D.M Blei and J.B Tenenbaum, Integrating Topics and Syntax, NIPS 2005.\n[5] D.J Crandall and D.P Huttenlocher, Weakly Supervised Learning of Part-Based Spatial Models for Visual\n\nObject Recognition, ECCV 2006.\n\n[6] D. Blei, A. Ng and M. Jordan, Latent Dirichlet Allocation, Journal of Machine Learning Research, 2003.\n[7] T. Hofmann, Unsupervised learning by probabilistic latent semantic analysis, Machine Learning 2001.\n[8] R. Fergus, L. Fei-Fei, P. Perona and A. Zisserman, Learning Object Categories from Google\u2019s Image\n\nSearch, ICCV 2005.\n\n[9] R. Fergus, P. Perona and A. Zisserman, Object Class Recognition by Unsupervised Scale-Invariant Learn-\n\ning, CVPR 2003.\n\n[10] L. Cao and L. Fei-Fei, Spatially coherent latent topic model for concurrent object segmentation and\n\nclassi\ufb01cation, ICCV 2007.\n\n[11] B. Russell, A. Efros, J. Sivic, W. Freeman and A. Zisserman, Using Multiple Segmentations to Discover\n\nObjects and their Extent in Image Collections, CVPR 2006.\n\n[12] T.L Grif\ufb01ths and M. Steyvers, Finding Scienti\ufb01c Topics, PNAS 2004.\n[13] M. Rosen-Zvi, T. Grif\ufb01ths, M. Steyvers and P. Smyth, The Author-Topic Model for Authors and Docu-\n\nments, UAI 2004\n\n[14] M. Lesk, Automatic Sense Disambiguation Using Marchine Readable Dictionaries: How to Tell a Pine\n\nCone from Ice Cream Cone, SIGDOC 1986.\n\n[15] D. Yarowsky, Word Sense Disambiguation Using Statistical Models of Roget\u2019s Categories trained on\n\nLarge Corpora, COLING 1992.\n\n[16] Q. Zhi, G. Song and J. Shi, Untangling Cycles for Contour Grouping, ICCV 2007.\n[17] B. C. Russell, A. Torralba, K. P. Murphy, W. T. Freeman, LabelMe: a Database and Web-based Tool for\n\nImage Annotation, IJCV 2008.\n\n[18] B. Leibe, A. Leonardis and B. Schiele,Combined Object Categorization and Segmentationwith an Implicit\n\nShape Model, ECCV workshop on Statistical Learning in Vision, 2006.\n\n\f", "award": [], "sourceid": 681, "authors": [{"given_name": "Abhinav", "family_name": "Gupta", "institution": null}, {"given_name": "Jianbo", "family_name": "Shi", "institution": null}, {"given_name": "Larry", "family_name": "Davis", "institution": null}]}