{"title": "Efficient Unsupervised Learning for Localization and Detection in Object Categories", "book": "Advances in Neural Information Processing Systems", "page_first": 811, "page_last": 818, "abstract": "", "full_text": "Ef\ufb01cient Unsupervised Learning for Localization\n\nand Detection in Object Categories\n\nNicolas Loeff, Himanshu Arora\n\nECE Department\n\nUniversity of Illinois at\n\nUrbana-Champaign\n\nAlexander Sorokin, David Forsyth\n\nComputer Science Department\n\nUniversity of Illinois at\n\nUrbana-Champaign\n\n{loeff,harora1}@uiuc.edu\n\n{sorokin2,daf}@uiuc.edu\n\nAbstract\n\nWe describe a novel method for learning templates for recognition and\nlocalization of objects drawn from categories. A generative model repre-\nsents the con\ufb01guration of multiple object parts with respect to an object\ncoordinate system; these parts in turn generate image features. The com-\nplexity of the model in the number of features is low, meaning our model\nis much more ef\ufb01cient to train than comparative methods. Moreover,\na variational approximation is introduced that allows learning to be or-\nders of magnitude faster than previous approaches while incorporating\nmany more features. This results in both accuracy and localization im-\nprovements. Our model has been carefully tested on standard datasets;\nwe compare with a number of recent template models. In particular, we\ndemonstrate state-of-the-art results for detection and localization.\n\n1\n\nIntroduction\n\nBuilding appropriate object models is central to object recognition, which is a fundamental\nproblem in computer vision. Desirable characteristics of a model include good represen-\ntation of objects, fast and ef\ufb01cient learning algorithms that require as little supervised in-\nformation as possible. We believe an appropriate representation of an object should allow\nfor both detection of its presence and localization (\u2018where is it?\u2019). So far the quality of\nobject recognition in the literature has been measured by its detection performance only.\nViola and Jones [1] present a fast object detection system boosting Haar \ufb01lter responses.\nAnother effective discriminative approach is that of a bag of keypoints [2, 3]. It is based on\nclustering image patches using appearance only, disregarding geometric information. The\nperformance for detection in this algorithm is among the state of the art. However as no\ngeometry cues are used during training, features that do not belong to the object can be\nincorporated into the object model. This is similar to classic over\ufb01tting and typically leads\nto problems in object localization.\n\nWeber et. al. [4] represent an object as a constellation of parts. Fergus et. al. [5] extend\nthe model to account for variability in appearance. The model encodes a template as a\nset of feature-generating parts. Each part generates at most one feature. As a result the\ncomplexity is determined by hardness of part-feature assignment. Heuristic search is used\nto approximate the solution, but feasible problems are limited to 7 parts with 30 features.\n\n\fAgarwal and Roth [6] learn using SNoW a classi\ufb01er on a sparse representation of patches\nextracted around interesting points in the image. In [7], Leibe and Schiele use a voting\nscheme to predict object con\ufb01guration from locations of individual patches. Both ap-\nproaches provide localization, but require manually localizing the objects in training im-\nages. Hillel et. al. [8] independently proposed an approach similar to ours. Their model\nhowever has higher learning complexity and inferior detection performance despite being\nof discriminative nature.\n\nIn this paper, we present a generative probabilistic model for detection and localization of\nobjects that can be ef\ufb01ciently learnt with minimal supervision. The \ufb01rst crucial property\nof the model is that it represents the con\ufb01guration of multiple object parts with respect to\nan unobserved, abstract object root (unlike [9, 10], where an \u201cobject root\u201d is chosen as\none of the visible parts of the object). This simpli\ufb01es localization and allows our model to\novercome occlusion and errors in feature extraction. The model also becomes symmetric\nwith respect to visible parts. The second crucial assumption of the model is that a single\npart can generate multiple features in the image (or none). This may seem counterintuitive,\nbut keypoint detectors generally detects several features around interesting areas. This\nhypothesis also makes an explicit model for part occlusion unnecessary: instead occlusion\nof a part means implicitly that no feature in the image is produced by it.\n\nThese assumptions allow us to model all features in the image as being emitted indepen-\ndently conditioned on the object center. As a result the complexity of inference in our\nmodel is linear in the number of parts of the model and the number of features in the im-\nage, obviating the exponential complexity of combinatoric assignments in other approaches\n[4, 5, 11]. This means our model is much easier than constellation models to train using\nExpectation Maximization (EM), which enables the use of more features and more com-\nplex models with resulting improvements in both accuracy and localization. Furthermore\nwe introduce a variational (mean-\ufb01eld) approximation during learning that allows it to be\nhundreds of times faster than previous approaches, with no substantial loss of accuracy.\n\n2 Model\n\nOur model of an object category is a template that generates features in the image. Each\nimage is represented as a set {fj} of F features extracted with the scale-saliency point\ndetector [13]. Each feature is described by its location and appearance. Feature ex-\ntraction and representation will be detailed in section 3. As described in the introduc-\ntion, we hypothesize that given the object center all features are generated independently:\n\npobj (f1, .., fF ) = Poc\n\ngenerate any features - is represented by a hidden random variable oc. For simplicity it\ntakes values in a discrete grid of size Nx \u00d7 Ny inside the image and oc is assumed to be a\npriori uniformly distributed in its domain.\n\nP (oc)Qj p(fj|oc). The abstract object center - which does not\n\nConditioned on the object center, each feature is generated by a mixture of P parts plus a\nbackground part. A set of hidden variables {\u03c9ij} represents which part (i) produced feature\ni=1 \u03c9ij = 1. In other words,\n\u03c9ij = 1 means feature j was produced by part i; each part can produce multiple features,\neach feature is produced by only one part. The distribution of a feature conditioned on the\n\nfj. These variables \u03c9ij then take values {0, 1} restricted toPP +1\nobject center is then p(fj|oc) = Pi p(fj, wij = 1|oc) = Pi p(fj|wij = 1, oc)\u03c0i, where\n\u03c0i is the prior emission probability of part i. \u03c0i is subject toPP +1\n\nEach part has a location distribution with respect to the object center corresponding to a two\ndimensional full covariance Gaussian, pi\nL(x|oc). The appearance (see section 3 for details)\nof a part does not depend on the con\ufb01guration of the object; we consider two models :\n\ni=1 \u03c0i = 1.\n\n\fGaussian Model (G) Appearance pi\n\nA is modeled as a k dimensional diagonal covariance\n\nGaussian distribution.\n\nLocal Topic Model (LT) Appearance pi\n\nA is modeled as a multinomial distribution on a\npreviously learnt k-word image patch dictionary. This can be considered as a\nlocal topic model.\n\nLet \u03b8 denote the set of parameters. The complete data likelihood (joint distribution) for\nimage n in the object model is then,\n\nP obj\n\n\u03b8\n\n({\u03c9ij}, oc, {fj}) =Yo\u2032\n\nc\n\nL(fj|o\u2032\n\nc)pi\n\nA(fj)\u03c0i(cid:9)[\u03c9ij =1]\n\nP (o\u2032\n\nwhere [expr] is one if expr is true and zero otherwise. Marginalizing, the probability of\nthe observed image in the object model is then,\n\n\uf8f1\uf8f2\nYj,i (cid:8)pi\n\uf8f3\n\n[oc=o\u2032\nc]\n\n(1)\n\nc)\uf8fc\uf8fd\n\uf8fe\n\nP obj\n\n\u03b8\n\n({fj}) =Xoc\n\nP (oc)Yj \u2032 (Xi\n\nP (fj \u2032, \u03c9ij \u2032 = 1|oc))\n\n(2)\n\nThe background model assumes all features are produced independently, with uniform lo-\ncation on the image.\nIn the G model of appearance, the appearance is modeled with a\nk dimensional full covariance matrix Gaussian distribution. In the LT model, we use a\nmultinomial distribution on the k-word image patch dictionary to model the appearance.\n\n2.1 Learning\n\nThe maximum-likelihood solution for the parameters of the above model does not have a\nclosed form. In order to train the model the parameters are computed numerically using the\napproach of [14], minimizing a free-energy Fe associated with the model that is an upper\nbound on the negative log-likelihood. Following [14], we denote v = {fj} as the set of\nvisible and h = {oc, \u03c9ij} as the set of hidden variables. Let DKL be the K-L divergence:\n\nFe(Q, \u03b8) = DKL(cid:8)Q(h)(cid:12)(cid:12)(cid:12)(cid:12)P\u03b8(h|v)(cid:9) \u2212 log P\u03b8(v) =Zh\n\nQ(h) log\n\nQ(h)\n\nP\u03b8(h, v)\n\ndh\n\n(3)\n\nIn this bound, Q(h) can be a simpler approximation of the posterior probability P\u03b8(h|v),\nthat is used to compute estimates and update parameters. Minimizing eq. 3 with respect to\nQ and \u03b8 under different restrictions, produces a range of algorithms including exact EM,\nvariational learning and others [14]. Table 2.1 shows sample updates and complexity of\nthese algorithms and comparison to other relevant work.\n\nThe background model is learnt before the object model is trained. As assumed earlier, for\nGaussian appearance model the background appearance model is a single gaussian, whose\nmean and variance are estimated as the sample mean and covariance. For the Local Topic\nmodel, the multinomial distribution is estimated as the sample histogram. The model for\nbackground feature location is uniform and does not have any parameters.\nEM Learning for the Object model: In the E-step, the set of parameters \u03b8 is \ufb01xed and\nFe is minimized with respect to Q(h) without restrictions. This is equivalent to com-\nputing the actual posteriors in EM [14, 15]. In this case the optimal solution factorizes\nas Q(h) = Q(oc)Q(\u03c9ij|oc) = P (oc|v)P (\u03c9ij|oc, v). In the M-step, Fe is minimized with\nrespect to the parameters \u03b8 using the current estimate of Q. Due to the conditional indepen-\ndence introduced in the model, inference is tractable and thus the E-step can be computed\nef\ufb01ciently. The overall complexity of inference is O(F P \u00b7 NxNy).\n\n\fModel\nFergus et al.\n\nModel (EM)\n\n\u00b5i\n\nL \u2190\n\nUpdate for \u00b5i\nL\n\nN/A\n\nPn Poc\n\nQ(oc) Pj Q(\u03c9ji|oc){x\n\nj\n\nL\u2212oc}\n\n(Variational)\n\n\u00b5i\n\nL \u2190\n\nPn Poc\n\nQ(oc) Pj Q(\u03c9ji|oc)\n\nPn{Pj Q(\u03c9ji)x\n\nj\nL\u2212Poc\n\nQ(oc)oc}\n\nPn Poc\n\nQ(oc) Pj Q(\u03c9ji)\n\nComplexity\nF P\n\nF P \u00b7 NxNy\n\nTime (F,P)\n36 hrs (30, 7)\n\n3 hrs (50, 30)\n\nF P + NxNy\n\n3 mins (100, 30)\n\nTable 1: An example of an update, overall complexity and convergence time for our models and [5],\nfor different number of features per image (F ) and number of parts in the object model (P ). There is\nan increase in speed of several orders of magnitude with respect to [5] on similar hardware.\nVariational Learning: In this approach a mean \ufb01eld approximation of Q is considered;\nin the E-step the parameters \u03b8 are \ufb01xed and F is minimized with respect to Q under the\nrestriction that it factorizes as Q(h) = Q(oc)Q(wij). This corresponds to a decoupling of\nlocation (oc) and part-feature assignment (wij) in the approximation (Q) of the posterior\nP\u03b8(h|v). In the M-step \u03b8 is \ufb01xed and the free energy Fe is minimized with respect to this\n(mean \ufb01eld) version of Q. A comparison between EM and Variational updates of the mean\nin location \u00b5i\nL of a part is shown in table 2.1. The overall complexity of inference is now\nO(F P ) + O(NxNy); this represents orders of magnitude of speedup with respect to the\nalready ef\ufb01cient EM learning. The impact on performance of the variational approximation\nis discussed in section 4.\n\n2.2 Detection and localization\n\n\u03b8\n\nFor detection of object presence, a natural decision rule is the likelihood ratio test. After the\nmodels are learnt, for each test image P obj\n({fj})/P bg({fj}) is compared to a threshold to\nmake the decision. Once the presence of the object is established, the most likely location\nis given by the MAP estimate of oc. We assign parts in the model to the object if they ex-\nhibit consistent appearance and location. To remove model parts representing background\nwe use a threshold on the entropy of the appearance distribution for the LT model (the\ndeterminant of the covariance in location for the G model). The MAP estimate of which\nfeatures in the image are assigned (marginalizing over the object center) to parts in the\nmodel determines the support of the object. Bounding boxes include all keypoints assigned\nto the object and means of all model parts belonging to the object even if no keypoint is\nobserved to be produced by such part. This explicitly handles occlusion (\ufb01g. 1).\n\n3 Experimental setup\n\nThe performance of the method depends on the feature detector making consistent extrac-\ntion in different instances of objects of the same type. We use the scale-saliency interest\npoint detector proposed in [13]. This method selects regions exhibiting unpredictable char-\nacteristics over both location and scale. The F regions with highest saliency over the image\nprovide the features for learning and recognition. After the keypoints are detected, patches\nare extracted around this points and scale-normalized. A SIFT descriptor [16] (without\norientation) is obtained from these patches. For model G, due to the high dimensionality\nof resulting space, PCA is performed choosing k = 15 components to represent the ap-\npearance of a feature. For model LT, we instead cluster the appearance of features in the\noriginal SIFT space with a gaussian mixture model with k = 250 components and use the\nmost likely cluster as feature appearance representation.\n\nFor all experiments we use P = 30 parts. The number of features is F = 50 for G model\nand F = 100 for LT model, Nx \u00d7 Ny = 238. We test our approach on the Caltech 5\ndataset: faces, motorbikes, airplanes, spotted cats vs. Caltech background and cars rear\n2001 vs. cars background [5]. We initialize appearance and location of the parts with P\nrandomly chosen features from the training set. The stopping criterion is the change in Fe.\n\n\fFigure 1: Local Topic model for faces, motorbikes and airplanes datasets [5]. In (a) the most likely\nlocation of the object center is plotted as a black circle. With respect to this reference, the spatial\nIn (b) the\ndistribution (2D gaussian) of each part associated with the object is plotted in green.\ncenters of all features extracted are depicted. Blue ones are assigned by the model to the object, and\nred ones to the background. The bounding box is plotted in blue. Image (c) shows how many features\nin the image are assigned to the same part (a property of our model, not shared by [5]): six parts are\nchosen, their spatial distribution is plotted (green), and the features assigned to them are depicted in\nblue. Eyes (4,5), mouth (3) and left ear (6) have multiple assignments each. For each these parts,\nimage (d) image shows the best matches in features extracted from the dataset. Note that the local\ntopic model can learn parts uniform in appearance (i.e. eyes) but also more complex parts (i.e. the\nmouth part includes moustaches, beards and chins). The G appearance model and [5] do not have\nthis property. The images (e) show the robustness of the method in cases with occlusion, missed\ndetections and one caricature of a face. Images (f) and (g) show plots for motorbikes, and (h) and (i)\nfor airplanes.\n\n\f4 Results\n\nDetection: Although we believe that localization is an essential performance criterion, it is\nuseless if the approach cannot detect objects. Figure 2 depicts equal error rate detection per-\nformance for our models and [5, 3, 8]. We can not compare our range of performance (for\ntrain/test splits), shown on the plot, because this data is not available for other approaches.\nOur method is robust to initialization (the variance for starting points is negligible com-\npared to train/test split variance). The results show higher detection performance of all our\nalgorithms compared to the generative model presented in [5]. The local topic (LT) model\nperforms better than the model presented in [8]. The purely discriminative approach pre-\nsented in [3] shows higher detection performance with different (\u201coptimal combination\u201d)\nfeatures, but performs worse for the features we are using. The LT model showed con-\nsistently higher detection performance than the Gaussian (G) model. For both LT and G\nmodels the variational approximations showed similar discriminative power to that of the\nrespective exact models. Unlike [5, 3], our model currently is not scale invariant. Never-\ntheless the probabilistic nature of the model allows for some tolerance to scale changes.\n\nIn datasets of manageable size, it is inevitable that the background is correlated with the\nobject. The result is that most modern methods that infer the template form partially su-\npervised data can tend to model some background parts as lying on the object (see \ufb01gure\n4). Doing so tends to increase detection performance. It is reasonable to expect this in-\ncrease will not persist in the face of a dramatic change in background. One symptom of\nthis phenomenon (as in classical over\ufb01tting) is that methods that detect very well may be\nbad at localization, because they cannot separate the object from background. We are able\nto avoid this dif\ufb01culty by predicting object extent conditioned on detection using only a\nsubset of parts known to have relatively low variance in location or appearance, given the\nobject center. We do not yet have an estimate of the increase in detection rate resulting\nfrom over\ufb01tting. This is a topic of ongoing research. In our opinion, if a method can detect\nbut performs poorly at localization, the reason may be over\ufb01tting.\nLocalization: Previous work on localization required aligned images (bounding boxes)\nor segmentation masks [7, 6]. A novel property of our model is that it learns to localize\nthe object and determine its spatial extent without supervision. Figure 1 shows learned\nmodels and examples of localization. There is no standard measure to evaluate localization\nperformance in an unsupervised setting. In such a case, the object center can be learnt at\nany position in the image, provided that this position is consistent across all images. We\nthus use as our performance measure, the standard deviation of estimated object centers\nand bounding boxes (obtained as in \u00a72.2), after normalizing the estimates of each image to\na coordinate system in which the ground truth bounding box is a unit square (0, 0) \u2212 (1, 1).\nAs a baseline we use the recti\ufb01ed center of the image. All objects of interest in both\nairplane and motorbike datasets are centered in the image. As a result the baseline is a\ngood predictor of the object center and is hard to beat. However in the faces dataset there is\nmuch more variation in location; then the advantage of our approach becomes clear. Figure\n3 shows the scatterplot of normalized object centers and bounding boxes. The table in\n\ufb01gure 2 shows the localization performance results using the proposed metric.\nVariational approximation comparison: Unusually for a variational approximation it is\npossible to compare it to the exact model; the results are excellent especially for the G\nmodel. This is consistent with our observation that during learning the variational approx-\nimation is good in this case (the free energy bound appears tight). On the other hand, for\nthe LT model, the variational bound is loose during learning and localization performance\nis equivalent, but slightly lower than that of exact LT model. This may be explained by the\nfact that gaussian appearance model is less \ufb02exible then the topic model and thus G model\ncan better tolerate decoupling of location and appearance.\n\n\fAirplanes\n\nMotorbikes\n\nFaces\n\nCars rear\n\nSpotted Cats\n\nB\n\n98\n\nLT\n\nLV\n\nLT\n\nLV\n\n96\n\nG GV\n\nModel\n\nBbox(%)\n\nObj. center(%)\nvert\n\nhorz\n\n99\n\n98\n\n97\n\n96\n\n95\n\n94\n\n93\n\n92\n\n91\n\n90\n\n100\n\nDLc\n\n99\n\n98\n\n97\n\n96\n\nLT\n\nLV\n\nDL\n\nGV\n\nG\n\nB\n\n95\n\n94\n\n93\n\n92\n\nC\n\n100\n\nDLc\n\n99\n\n98\n\n97\n\nDL\n\nLT\n\nLV\n\nGV\n\nG\n\n96\n\nB\n\n95\n\n94\n\n93\n\nC\n\nLV\n\nLT\n\nGV\n\nG\n\n100\n\nDLc\n\n98\n\n96\n\nDL\n\n94\n\nC\n\n92\n\n90\n\nB\n\n88\n\n86\n\n94\n\n92\n\n90\n\n88\n\n86\n\n84\n\n82\n\nG\nGV\nLT\nLV\nBL\n\nLT\nBL\n\nLT\nBL\n\nC\n\nDLc\n\nDL\n\nGV\n\nG\n\nC\n\nvert\n\n8.88\n8.64\n8.17\n7.86\n-\n\nhorz\nFaces\n21.88\n16.10\n13.16\n18.62\n-\n\n4.58\n4.47\n3.92\n3.76\n4.50\n\nAirplanes\n\n19.30\n-\n\n9.09\n-\n\n10.06\n10.37\n\nMotorbikes\n\n8.41\n-\n\n7.33\n-\n\n4.93\n5.11\n\n16.59\n16.10\n6.45\n11.04\n24.71\n\n4.42\n4.47\n\n4.65\n2.01\n\nFigure 2: Plots on the left show detection performance on Caltech 5 datasets [5]. Equal error rate\nis reported. The original performance of constellation model [5] is denoted by C. We denote by DLc\nthe performance (best in literature) reported by [3] using an optimal combination of feature types,\nand by DL the performance using our features. The performance of [8] is denoted by B. We show\nperformance for our G model (G), LT model (L) and their variational approximations (GV) and (LV)\nrespectively. We report median performance (\u00d7) over 20 runs and performance range excluding 10%\nbest and 10% worst runs. On the right we show localization performance for all models on Faces\ndataset and performance of the best model (LT) on all datasets. Standard deviation is reported in\npercentage units with respect to the ground truth bounding box. For bounding boxes we average the\nstandard deviation in each direction. BL denotes baseline performance.\n\nFigure 3: The airplane and motorbike datasets are aligned. Thus the image center baseline (b), (d)\nperforms well there. Our localization performs similarly (a), (c). There is more variation in location\nin faces dataset. Scatterplot (f) shows the baseline performance and (g) shows the performance of\nour model. (e) shows the bounding boxes computed by our approach (LT model). Object centers and\nbounding boxes are recti\ufb01ed using the ground truth bounding boxes (blue). No information about\nlocation or spatial extent of the object is given to the algorithm.\n\nFigure 4: Approaches like [3] do not use geometric constraints during learning. Therefore, corre-\nlation between background and object in the dataset is incorporated into the object model. In this\ncase the ellipses represent the features that are used by the algorithm in [3] to decide the presence\nof a face and motorbike (left images taken from [3]). On the other hand, our model (right images)\ncan estimate the location and support of the object, even though no information about it is provided\nduring learning. Blue circles represent the features assigned by the model to the face, the red points\nare centers of features assigned to background (plot for Local Topic Model).\n\n\f5 Conclusions and future work\n\nWe have presented a novel model for object categories. Our model allows ef\ufb01cient unsu-\npervised learning, bringing the learning time to a few hours for full models and to minutes\nfor variational approximations. The signi\ufb01cant reduction in complexity allows to handle\nmany more parts and features than comparable algorithms. The detection performance of\nour approach compares favorably to the state of the art even when compared to purely dis-\ncriminative approaches. Also our model is capable of learning the spatial extent of the\nobjects without supervision, with good results.\n\nThis combination of fast learning and ability to localize is required to tackle challenging\nproblems in computer vision. Among the most interesting applications we see unsupervised\nsegmentation, learning, detection and localization of multiple object categories, deformable\nobjects and objects with varying aspects.\n\nReferences\n\n[1] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. Proc.\n\nof CVPR, pages 511\u2013518, 2001.\n\n[2] G. Csurka, C. Dance, L. Fan, and C. Bray. Visual Categorization with Bags of Keypoints. In\n\nWorkshop on Stat. Learning in Comp. Vision, ECCV, pages 1\u201322, 2004.\n\n[3] G. Dork\u00b4o and C. Schmid. Object class recognition using discriminative local features. Submit-\n\nted to IEEE trans. on PAMI, 2004.\n\n[4] M. Weber, M. Welling, and P. Perona. Unsupervised Learning of Models for Recognition. Proc.\n\nof ECCV (1), pages 18\u201332, 2000.\n\n[5] R. Fergus, P. Perona, and A. Zisserman. Object Class Recognition by Unsupervised Scale-\n\nInvariant Learning. Proc. of CVPR, pages 264\u2013271, 2003.\n\n[6] S. Agarwal and D. Roth. Learning a sparse representation for object detection.\n\nECCV, volume 4, pages 113\u2013130, Copenhagen, Denmark, May 2002.\n\nIn Proc. of\n\n[7] B. Leibe, A. Leonardis, and B. Schiele. Combined object categorization and segmentation with\nan implicit shape model. In Workshop on Stat. Learning in Comp. Vision, pages 17\u201332, May\n2004.\n\n[8] A. B. Hillel, T. Hertz, and D. Weinshall. Ef\ufb01cient learning of relational object class models. In\n\nProc. of ICCV, pages 1762\u20131769, October 2005.\n\n[9] R. Fergus, P. Perona, and A. Zisserman. A sparse object category model for ef\ufb01cient learning\n\nand exhaustive recognition. In Proc. of CVPR, pages 380\u2013387, june 2005.\n\n[10] D. Crandall, P. Felzenszwalb, and D. Huttenlocher. Spatial Priors for Part-Based Recognition\n\nusing Statistical Models. In Proc. of CVPR, pages 10\u201317, 2005.\n\n[11] L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training\nexamples an incremental bayesian approach tested on 101 object categories. In Workshop on\nGenerative-Model Based Vision, Washington, DC, June 2004.\n\n[12] A. Opelt, M. Fussenegger, A. Pinz, and P. Auer. Generic object recognition with boosting.\nTechnical Report TR-EMT-2004-01, EMT, TU Graz, Austria, 2004. Submitted to the IEEE\nTrans. on PAMI.\n\n[13] T. Kadir and M. Brady. Saliency, Scale and Image Description. IJCV, 45(2):83\u2013105, 2001.\n[14] B. Frey and N. Jojic. A Comparison of Algorithms for Inference and Learning in Probabilistic\n\nGraphical Models. IEEE Trans. on PAMI, 27(9):1392\u20131416, 2005.\n\n[15] R. Neal and G. Hinton. A view of the EM algorithm that justi\ufb01es incremental, sparse, and other\nIn M. I. Jordan, editor, Learning in graphical models, pages 355\u2013368. MIT Press,\n\nvariants.\nCambridge, MA, USA, 1999.\n\n[16] D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91\u2013110, 2004.\n\n\f", "award": [], "sourceid": 2796, "authors": [{"given_name": "Nicolas", "family_name": "Loeff", "institution": null}, {"given_name": "Himanshu", "family_name": "Arora", "institution": null}, {"given_name": "Alexander", "family_name": "Sorokin", "institution": null}, {"given_name": "David", "family_name": "Forsyth", "institution": null}]}