{"title": "Common-Frame Model for Object Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 953, "page_last": 960, "abstract": null, "full_text": "     Common-Frame Model for Object Recognition\n\n\n                       Pierre Moreels                   Pietro Perona\n              California Insitute of Technology - Pasadena CA91125 - USA\n                    pmoreels,perona@vision.caltech.edu\n\n\n                                         Abstract\n\n         A generative probabilistic model for objects in images is presented. An\n         object consists of a constellation of features. Feature appearance and\n         pose are modeled probabilistically. Scene images are generated by draw-\n         ing a set of objects from a given database, with random clutter sprinkled\n         on the remaining image surface. Occlusion is allowed.\n         We study the case where features from the same object share a common\n         reference frame. Moreover, parameters for shape and appearance den-\n         sities are shared across features. This is to be contrasted with previous\n         work on probabilistic `constellation' models where features depend on\n         each other, and each feature and model have different pose and appear-\n         ance statistics [1, 2]. These two differences allow us to build models\n         containing hundreds of features, as well as to train each model from a\n         single example. Our model may also be thought of as a probabilistic\n         revisitation of Lowe's model [3, 4].\n         We propose an efficient entropy-minimization inference algorithm that\n         constructs the best interpretation of a scene as a collection of objects and\n         clutter. We test our ideas with experiments on two image databases. We\n         compare with Lowe's algorithm and demonstrate better performance, in\n         particular in presence of large amounts of background clutter.\n\n\n1     Introduction\n\nThere is broad agreement in the machine vision literature that objects and object categories\nshould be represented as collections of features or parts with distinctive appearance and\nmutual position [1, 2, 4, 5, 6, 7, 8, 9]. A number of ideas for efficient detection algorithms\n(find instances of a given object category, e.g. faces) have been proposed by virtually all\nthe cited authors, far fewer for recognition (list all objects and their pose in a given image)\nwhere matching would ideally take a logarithmic time with respect to the number of avail-\nable models [3, 4]. Learning of parameters characterizing features shape or appearance\nis still a difficult area, with most authors opting for heavy human intervention (typically\nsegmentation and alignment of the training examples, although [1, 2, 3] train without su-\npervision) and very large training sets for object categories (typically in the order of 10 3 -\n104, although [10] recently demonstrated learning categories from 1-10 examples).\n\nThis work is based on two complementary efforts: the deterministic recognition system\nproposed by Lowe [3, 4], and the probabilistic constellation models by Perona and col-\nlaborators [1, 2]. The first line of work has three attractive characteristics: objects are\nrepresented with hundreds of features, thus increasing robustness; models are learned from\na single training example; last but not least, recognition is efficient with databases of hun-\ndreds of objects. The drawback of Lowe's approach is that both modeling decisions and\nalgorithms rely on heuristics, whose design and performance may be far from optimal in\n\n\f\nFigure 1: Diagram of our recognition model showing database, test image and two competing hy-\npotheses. To avoid a cluttered diagram, only one partial hypothesis is displayed for each hypothesis.\nThe predicted position of models according to the hypotheses are overlaid on the test image.\n\nsome circumstances. Conversely, the second line of work is based on principled proba-\nbilistic object models which yield principled and, in some respects, optimal algorithms\nfor learning and recognition/detection. Unfortunately, the large number of parameters em-\nployed in each model limit in practice the number of features being used and require many\ntraining examples. By recasting Lowe's model and algorithms in probabilistic terms, we\nhope to combine the advantages of both methods. Besides, in this paper we choose to focus\non individual objects as in [3, 4] rather than on categories as in [1, 2].\n\nIn [11] we presented a model aimed at the same problem of individual object recogni-\ntion. A major difference with the work described here lies in the probabilistic treatment of\nhypotheses, which allows us here to use directly hypothesis likelihood as a guide for the\nsearch, instead of the arbitrary admissible heuristic required by A*.\n\n2    Probabilistic framework and notations\n\nEach model object is represented as a collection of features. Features are informative parts\nextracted from images by an interest point operator. Each model is the set of features\nextracted from one training image of a given object - although this could be generalized to\nfeatures from many images of the same object. Models are indexed by k and denoted by\nmk, while indices i and j are used respectively for features extracted from the test image\nand from model images: fi denotes the i - th test feature, while f k\n                                                                            j denotes the j - th\nfeature from the k - th model. The features extracted from model images (training set)\nform the database. A feature detected in a test image can be a consequence of the presence\nof a model object in the image, in which case it should be associated to a feature from the\ndatabase. In the alternative, this feature is attributed to a clutter - or background - detection.\n\nThe geometric information associated to each feature contains position information (x and\ny coordinates, denoted by the vector x), orientation (denoted by ) and scale (denoted by\n). It is denoted by Xi = (x, i, i) for test feature fi and X k\n                                                                      j = (xk\n                                                                               j k\n                                                                                 j , k\n                                                                                       j ) for model\nfeature f k\n          j . This geometric information is measured relatively to the standard reference\nframe of the image in which the feature has been detected. All features extracted from the\nsame image share the same reference frame.\n\nThe appearance information associated to a feature is a descriptor characterizing the local\nimage appearance near this feature. The measured appearance information is denoted by\n\n\f\nAi for test feature fi and Akj for model feature f kj. In our experiments, features are detected\nat multiple scales at the extrema of difference-of-gaussians filtered versions of the image [4,\n12]. The SIFT descriptor [4] is then used to characterize the local texture about keypoints.\n\nA partial hypothesis h explains the observations made in a fraction of the test image. It\ncombines a model image mh and a corresponding set of pose parameters Xh. Xh encodes\nposition, rotation, scale (this can easily be extended to affine transformations). We assume\nindependence between partial hypotheses. This requires in particular independence be-\ntween models. Although reasonable, this approximation is not always true (e.g. a keyboard\nis likely to be detected close to a computer screen). This allows us to search in parallel for\nmultiple objects in a test image.\n\nA hypothesis H is the combination of several partial hypotheses, such that it explains com-\npletely the observations made in the test image. A special notation H 0 or h0 denotes any\n(partial) hypothesis that states that no model object is present in a given fraction of the test\nimage, and that features that could have been detected there are due to clutter.\n\nOur objective is to find which model objects are present in the test scene, given the ob-\nservations made in the test scene and the information that is present in the database. In\nprobabilistic terms, we look for hypotheses H for which the likelihood ration LR(H) =\nP (H|{fi},{fkj})\nP (H0|{fi},{fkj}) > 1. This ratio characterizes how well models and poses specified by H\nexplain the observations, as opposed to them being generated by clutter. Using Bayes rules\nand after simplifications,\n                                P (H|{fi}, {f k\n                                                 j })         P ({fi}|{f k\n                                                                         j }, H )  P (H )\n               LR(H) =                                   =                                                   (1)\n                                P (H0|{fi}, {f kj})           P ({fi}|{f k\n                                                                        j }, H0)  P (H0)\nwhere we used P ({f k\n                        j }|H ) = P ({f k\n                                              j }) since the database observations do not depend on\nthe current hypothesis.\n\nA key assumption of this work is that once the pose parameters of the objects (and thus\ntheir reference frames) are known, the geometric configuration and appearance of the test\nfeatures are independent from each other. We also assume independence between features\nassociated to models and features associated to clutter detections, as well as independence\nbetween separate clutter detections. Therefore, P ({fi}|{f k\n                                                                         j }, H ) =          i P (fi|{f k\n                                                                                                       j }, H ).\nThese assumptions of independence are also made in [13], and undelying in [4].\n\nAssignment vectors v represent matches between features from the test scene, and model\nfeatures or clutter. The dimension of each assignment vector is the number of test features\nntest. Its i - th component v(i) = (k, j) denotes that the test feature fi is matched to\nfv(i) = f kj, j - th feature from model mk. v(i) = (0, 0) denotes the case where fi is\nattributed to clutter. The set VH of assignment vectors compatible with a hypothesis H are\nthose that assign test features only to models present in H (and to clutter). In particular, the\nonly assignment vector compatible with h0 is v0 such that i, v0(i) = (0, 0). We obtain\n                                                                                                      \n                                                                               P (fi|fv(i), mh, Xh)\n  LR(H) = P (H)                          P (v|{fk                                                     \n                                                   j }, mh, Xh)                                             (2)\n              P (H0)                                                                P (f\n                         vV                                                                i|h0)\n                                H hH                                i|fih\n\nP (H) is a prior on hypotheses, we assume it is constant. The term P (v|{f k\n                                                                                               j }, mh, Xh) is\ndiscussed in 3.1, we now explore the other terms.\n\nP (fi|fv(i), mh, Xh) : fi and fv(i) are believed to be one and the same feature. Differences\nmeasured between them are noise due to the imaging system as well as distortions caused\nby viewpoint or lighting conditions changes. This noise probability p n encodes differences\nin appearance of the descriptors, but also in geometry, i.e. position, scale, orientation\nAssuming independence between appearance information and geometry information,\n\n         pn(fi|f k\n                    j , mh, Xh) = pn,A(Ai|Av(i), mh, Xh)  pn,X (Xi|Xv(i), mh, Xh)                           (3)\n\n\f\nFigure 2: Snapshots from the iterative matching process. Two competing hypotheses are displayed\n(top and bottom row) a) Each assignment vector contains one assignment, suggesting a transformation\n(red box) b) End of iterative process. The correct hypothesis is supported by numerous matches and\nhigh belief, while the wrong hypothesis has only a weak support from few matches and low belief.\n\nThe error in geometry is measured by comparing the values observed in the test image,\nwith the predicted values that would be observed if the model features were to be trans-\nformed according to the parameters Xh. Let's denote by Xh(xv(i)),Xh(v(i)),Xh(v(i))\nthose predicted values, the geometry part of the noise probability can be decomposed into\n\n     pn,X (Xi|Xv(i), h) = pn,x(xi, Xh(xv(i)))  pn,(i, Xh(v(i)))  pn,(i, Xh(v(i))) (4)\n\nP (fi|h0) is a density on appearance and position of clutter detections, denoted by p bg(fi).\nWe can decompose this density as well into an appearance term and a geometry term:\n\n     pbg(fi) = pbg,A(Ai)  pbg,X (Xi) = pbg,A(Ai)  pbg,(x)(xi)  pbg,(i)  pbg,(i)              (5)\n\npbg,A, pbg,(x)(xi) pbg,(i), pbg,(i) are densities that characterize, for clutter detections,\nappearance, position, scale and rotation respectively.\n\nOut of lack of space, and since it is not the main focus of this paper, we will not go into the\ndetails of how the \"foreground density\" p n and the \"background density\" pbg are learned.\nThe main assumption is that those densities are shared across features, instead of having\none set of parameters for each feature as in [1, 2]. This results in an important decrease of\nthe number of parameters to be learned, at a slight cost in the model expressiveness.\n\n\n3      Search for the best interpretation of the test image\n\nThe building block of the recognition process is a question, comparing a feature from a\ndatabase model with a feature of the test image. A question selects a feature from the\ndatabase, and tries to identify if and where this feature appears in the test image.\n\n3.1     Assignment vectors compatible with hypotheses\n\nFor a given hypothesis H, the set of possible assignment vectors V H is too large for explicit\nexploration. Indeed, each potential match can either be accepted or rejected, which creates\na combinatorial explosion. Hence, we approximate the summation in (2) by its largest\nterm. In particular, each assignment vector v and each model referenced in v implies a\nset of pose parameters Xv (extracted e.g. with least-squares fitting). Therefore, the term\nP (v|{f k\n          j }, mh, Xh) from (2) will be significant only when Xv  Xh, i.e. when the pose\nimplied by the assignment vector agrees with the pose specified by the partial hypothesis.\nWe consider only the assignment vectors v for which Xv  Xh. P (vH|{f k\n                                                                                      j }, h) is assumed\nto be close to 1. Eq.(2) becomes\n\n                                                           P (fi|fv\n                      LR(H)  P (H)                               h(i), mh, Xh)\n                                 P (H0)                          P (f\n                                           hH                           i|h\n                                                  i|f                           0)                   (6)\n                                                    ih\n\n\f\nOur recognition system proceeds by asking questions sequentially and adding matches to\nassignment vectors. It is therefore natural to define, for a given hypothesis H and the\ncorresponding assignment vector vH and t  ntest, the belief in vH by\n                                             pn(ft|fv(t), mh , Xh )\n                    B                                        t    t\n                    0(vH ) = 1, Bt(vH ) =                               Bt-1(vH)              (7)\n                                                   pbg(ft|h0)\nThe geometric part of the belief (cf.(3)-(5) characterizes how close the pose X v implied by\nthe assignments is to the pose Xh specified by the hypothesis. The geometric component\nof the belief characterizes the quality of the appearance match for the pairs (f i, fv(i)).\n\n3.2    Entropy-based optimization\n\nOur goal is finding quickly the hypothesis that best explains the observations, i.e. the hy-\npothesis (models+poses) that has the highest likelihood ratio. We compute such hypothesis\nincrementally by asking questions sequentially. Each time a question is asked we update\nthe beliefs. We stop the process and declare a detection (i.e. a given model is present in\nthe image) as soon as the belief of a corresponding hypothesis exceeds a given confidence\nthreshold. The speed with which we reach such a conclusion depends on choosing cleverly\nthe next question. A greedy strategy says that the best next question is the one that takes us\nclosest to a detection decision. We do so by considering the entropy of the vector of beliefs\n(the vector may be normalized to 1 so that each belief is in fact a probability): the lower the\nentropy the closer we are to a detection. Therefore we study the following heuristic: The\nmost informative next question is the one that minimizes the expectation of the entropy of\nour beliefs. We call this strategy `minimum expected entropy' (MEE). This idea is due to\nGeman et al. [14].\n\nCalculating the MEE question is, unfortunately, a complex and expensive calculation in\nitself. In Monte-Carlo simulations of a simplified version of our problem we notice that\nthe MEE strategy tends to ask questions that relate to the maximum-belief hypothesis.\nTherefore we approximate the MEE strategy with a simple heuristic: The next question\nconsists of attempting to match one feature of the highest-belief model; specifically, the\nfeature with best appearance match to a feature in the test image.\n\n\n3.3    Search for the best hypotheses\n\nIn an initialization step, a geometric hash table [3, 6, 7] is created by discretizing the space\nof possible transformations Note that we add only partial hypotheses in a hypothesis one at\na time, which allows us to discretize only the space of partial hypotheses (models + poses),\ninstead of discretizing the space of combinations of partial hypotheses.\n\nQuestions to be examined are created by pairing database features to the test features clos-\nest in terms of appearance. Note that since features encode location, orientation and scale,\nany single assignment between a test feature and a model feature contains enough infor-\nmation to characterize a similarity transformation. It is therefore natural to restrict the set\nof possible transformations to similarities, and to insert each candidate assignment in the\ncorresponding geometric hash table entry. This forms a pool of candidate assignments. The\nset of hypotheses is initialized to the center of the hash table entries, and their belief is set\nto 1. The motivation for this initialization step is to examine, for each partial hypothesis,\nonly a small number of candidate matches. A partial hypothesis corresponds to a hash table\nentry, we consider only the candidate assignments that fall into this same entry.\n\nEach iteration proceeds as follows. The hypothesis H that currently has the highest likeli-\nhood ratio is selected. If the geometric hash table entry corresponding to the current partial\nhypothesis h, contains candidate assignments that have not been examined yet, one of them,\n(      m\n fi, f h\n       j    ) is picked - currently, the best appearance match - and the probabilities p bg(fi)\n               m\nand pn(fi|f h\n               j    , mh, Xh) are computed. As mentioned in 3.1, only the best assignment\n\n\f\nFigure 3: Results from our algorithm in various situations (viewpoint change can be seen in Fig.6).\nEach row shows the best hypothesis in terms of belief. a) Occlusion b) Change of scale.\n\n\n\n\n\nFigure 4: ROC curves for both experiments. The performance improvement from our probabilistic\nformulation is particularly significant when a low false alarm rate is desired. The threshold used is\nthe repeatability rate defined in [15]\n\n                                 m\nvector is explored: if pn(fi|f h\n                                 j       , mh, Xh) > pbg(fi) the match is accepted and inserted in\n                                                                                   m\nthe hypothesis. In the alternative, fi is considered a clutter detection and f h\n                                                                                   j       is a missed\ndetection. The belief B(vH ) and the likelihood ratio LR(H) are updated using (7).\n\nAfter adding an assignment to a hypothesis, frame parameters X h are recomputed using\nleast-squares optimization, based on all assignments currently associated to this hypothe-\nsis. This parameter estimation step provides a progressive refinement of the model pose\nparameters as assignments are added. Fig.2 illustrates this process.\n\nThe exploration of a partial hypothesis ends when no more candidate match is available in\nthe hash table entry. We proceed with the next best partial hypothesis. The search ends\nwhen all test scene features have been matched or assigned to clutter.\n\n\n4      Experimental results\n\n4.1    Experimental setting\n\nWe tested our algorithm on two sets of images, containing respectively 49 and 161 model\nimages, and 101 and 51 test images (sets P M - gadgets - 03 and JP - 3Dobjects - 04\navailable from http : //www.vision.caltech.edu/html - f iles/arc hive.html). Each\nmodel image contained a single object. Test images contained from zero (negative exam-\nples) to five objects, for a total of 178 objects in the first set, and 79 objects in the second\nset. A large fraction of each test image consists of background. The images were taken\nwith no precautions relatively to lighting conditions or viewing angle.\n\nThe first set contains common kitchen items and objects of everyday use. The second\nset (Ponce Lab, UIUC) includes office pictures. The objects were always moved between\nmodel images and test images. The images of model objects used in the learning stage\nwere downsampled to fit in a 500  500 pixels box, the test images were downsampled to\n800  800 pixels. With these settings, the number of features generated by the features\ndetector was of the order of 1000 per training image and 2000-4000 per test image.\n\n\f\nFigure 5: Behavior induced by clutter detections. A ground truth model was created by cutting a\nrectangle from the test image and adding noise. The recognition process is therefore expected to\nfind a perfect match. The two rows show the best and second best model found by each algorithm\n(estimated frame position shown by the red box, features that found a match are shown in yellow).\n\n4.2    Results\n\nOur probabilistic method was compared against Lowe's voting approach on both sets of\nimages. We implemented Lowe's algorithm following the details provided in [3, 4]. Direct\ncomparison of our approach to `constellation' models [1, 2] is not possible as those require\nmany training samples for each class in order to learn shape parameters, while our method\nlearns from single examples. Recognition time for our unoptimized implementations was\n10 seconds for Lowe's algorithm and 25 seconds for our probabilistic method on a 2.8GHz\nPC, both implementations used approximately 200MB of memory.\n\nBoth methods yielded similar detection rates for simple scenes. In challenging situations\nwith multiple objects or textured clutter, our method performs a more systematic check on\ngeometric consistency by updating likelihoods every time a match is added. Hypotheses\nstarting with wrong matches due to clutter don't find further supporting matches, and are\neasily discarded by a threshold based on the number of matches. Conversely, Lowe's algo-\nrithm checks geometric consistency as a last step of the recognition process, but needs to\nallow for a large slop in the transformation parameters. Spurious matches induced by clut-\nter detections may still be accepted, thus leading to the acceptance of incorrect hypotheses.\n\nAn example of this behavior is displayed in Fig.5: the test image consists of a picture\nof concrete. A rectangular patch was extracted from this image, noise was added to this\npatch, and it was inserted in the database as a new model. With our algorithm, the best\nhypothesis found the correct match with the patch of concrete, its best contender doesn't\nsucceed in collecting more than one correspondence and is discarded. In Lowe's case,\nother models manage to accumulate a high number of correspondences induced by texture\nmatches among clutter detections. Although the first correspondence concerns the correct\nmodel, it contains wrong matches. Moreover, the model displayed in the second row leads\nto a false alarm supported by many matches.\n\nFig.4 displays receiver-operating curves (ROC) for both tests sets, obtained for our proba-\nbilistic system and Lowe's method. Both curves confirm that our probabilistic interpreta-\ntion leads to less false alarms than Lowe's method for a same detection rate.\n\n5      Conclusion\n\nWe have proposed an object recognition method that combines the benefits of a set of rich\nfeatures with those of a probabilistic model of features positions and appearance. The use\nof large number of features brings robustness with respect to occlusions and clutter. The\nprobabilistic model verifies the validity of candidate hypotheses in terms of appearance and\ngeometric configuration. Our system improves upon a state-of-the art recognition method\nbased on strict feature matching. In particular, the rate of false alarms in the presence\n\n\f\nFigure 6: Sample scenes and training objects from the two sets of images. Recognized frame poses\nare overlayed in red.\n\nof textured backgrounds generating strong erroneous matches, is lower. This is a strong\nadvantage in real-world situations, where a \"clean\" background is not always available.\n\nReferences\n\n[1] M. Weber, M. Welling and P. Perona, \"Unsupervised Learning of Models for Recognition\", Proc.\n\n    Europ. Conf. Comp. Vis., 2000.\n[2] R. Fergus, P. Perona, A. Zisserman, \"Object Class Recognition by Unsupervised Scale-invariant\n\n    Learning\", IEEE. Conf. on Comp. Vis. and Patt. Recog., 2003.\n[3] D.G. Lowe, \"Object Recognition from Local Scale-invariant Features\", ICCV,1999\n[4] D.G. Lowe, \"Distinctive Image Features from Scale-Invariant Keypoints\", Int. J. Comp. Vis.,\n\n    60(2), pp. 91-110, 2004.\n[5] G. Carneiro and A. Jepson \"Flexible Spatial Models for Grouping Local Image Features\", IEEE.\n    Conf. on Comp. Vis. and Patt. Recog., 2004.\n[6] I. Rigoutsos and R. Hummel \"A Bayesian Approach to Model Matching with Geometric Hash-\n\n    ing\", CVIU, 62(1), pp. 11-26, 1995.\n\n[7] W.E.L. Grimson and D.P. Huttenlocher, \"On the Sensitivity of Geometric Hashing\", ICCV, 1990\n\n[8] H. Rowley, S. Baluja, T. Kanade, \"Neural Network-based Face Detection\", IEEE. Trans. Patt.\n\n    Anal. Mach. Int., 20(1):pp. 23-38, 1998.\n[9] P. Viola and M. Jones, \"Rapid Object Detection Using a Boosted Cascade of Simple Features\",\n\n    Proc. IEEE Conf. Comp. Vis. Patt. Recog., 2001.\n[10] L. Fei-Fei, R. Fergus, P. Perona. \"Learning Generative Visual Models from Few Training Ex-\n\n    amples: An Incremental Bayesian Approach Tested on 101 Object Categories\" CVPR, 2004.\n[11] P. Moreels, M. Maire, P. Perona, 'Recognition by Probabilistic Hypothesis Construction', Proc.\n\n    8th Europ. Conf. Comp. Vision, Prague, Czech Republic, pp.55-68, 2004\n[12] T. Lindeberg, \"Scale-space Theory: a Basic Tool for Analising Structures at Different Scales\",\n\n    J. Appl. Stat., 21(2), pp.225-270, 1994.\n[13] A.R. Pope and D.G. Lowe, \"Probabilistic Models of Appearance for 3-D Object Recognition\",\n\n    Int. J. Comp. Vis., 40(2), pp. 149-167, 2000.\n[14] D. Geman and B. Jedynak, \"An Active Testing Model for Tracking Roads in Satellite Images\",\n\n    IEEE. Trans. Patt. Anal. Mach. Int.,18(1) pp. 1 - 14,1996\n[15] C. Schmid, R. Mohr, C. Bauckhage\", \"Comparing and Evaluating Interest Points\", Proc. of 6th\n\n    Int. Conf. Comp. Vis., Bombay, India, 1998.\n\n\f\n", "award": [], "sourceid": 2746, "authors": [{"given_name": "Pierre", "family_name": "Moreels", "institution": null}, {"given_name": "Pietro", "family_name": "Perona", "institution": null}]}