{"title": "Conditional Random Fields for Object Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 1097, "page_last": 1104, "abstract": null, "full_text": " Conditional Random Fields for Object\n Recognition\n\n\n\n Ariadna Quattoni Michael Collins Trevor Darrell\n MIT Computer Science and Artificial Intelligence Laboratory\n Cambridge, MA 02139\n {ariadna, mcollins, trevor}@csail.mit.edu\n\n\n Abstract\n\n We present a discriminative part-based approach for the recognition of\n object classes from unsegmented cluttered scenes. Objects are modeled\n as flexible constellations of parts conditioned on local observations found\n by an interest operator. For each object class the probability of a given\n assignment of parts to local features is modeled by a Conditional Ran-\n dom Field (CRF). We propose an extension of the CRF framework that\n incorporates hidden variables and combines class conditional CRFs into\n a unified framework for part-based object recognition. The parameters of\n the CRF are estimated in a maximum likelihood framework and recogni-\n tion proceeds by finding the most likely class under our model. The main\n advantage of the proposed CRF framework is that it allows us to relax the\n assumption of conditional independence of the observed data (i.e. local\n features) often used in generative approaches, an assumption that might\n be too restrictive for a considerable number of object classes.\n\n1 Introduction\n\nThe problem that we address in this paper is that of learning object categories from super-\nvised data. Given a training set of n pairs (xi, yi), where xi is the ith image and yi is the\ncategory of the object present in xi, we would like to learn a model that maps images to\nobject categories. In particular, we are interested in learning to recognize rigid objects such\nas cars, motorbikes, and faces from one or more fixed view-points.\n\nThe part-based models we consider represent images as sets of patches, or local features,\nwhich are detected by an interest operator such as that described in [4]. Thus an image\nxi can be considered to be a vector {xi,1, . . . , xi,m} of m patches. Each patch xi,j has\na feature-vector representation (xi,j) Rd; the feature vector might capture various\nfeatures of the appearance of a patch, as well as features of its relative location and scale.\nThis scenario presents an interesting challenge to conventional classification approaches in\nmachine learning, as the input space xi is naturally represented as a set of feature-vectors\n{(xi,1), . . . , (xi,m)} rather than as a single feature vector. Moreover, the local patches\nunderlying the local feature vectors may have complex interdependencies: for example,\nthey may correspond to different parts of an object, whose spatial arrangement is important\nto the classification task.\n\nThe most widely used approach for part-based object recognition is the generative model\nproposed in [1]. This classification system models the appearance, spatial relations and\nco-occurrence of local parts. One limitation of this framework is that to make the model\n\n\f\ncomputationally tractable one has to assume the independence of the observed data (i.e.,\nlocal features) given their assignment to parts in the model. This assumption might be too\nrestrictive for a considerable number of object classes made of structured patterns.\n\nA second limitation of generative approaches is that they require a model P (xi,j|hi,j) of\npatches xi,j given underlying variables hi,j (e.g., hi,j may be a hidden variable in the\nmodel, or may simply be yi). Accurately specifying such a generative model may be chal-\nlenging in particular in cases where patches overlap one another, or where we wish to\nallow a hidden variable hi,j to depend on several surrounding patches. A more direct ap-\nproach may be to use a feature vector representation of patches, and to use a discriminative\nlearning approach. We follow an approach of this type in this paper.\n\nSimilar observations concerning the limitations of generative models have been made in\nthe context of natural language processing, in particular in sequence-labeling tasks such as\npart-of-speech tagging [7, 5, 3] and in previous work on conditional random fields (CRFs)\nfor vision [2]. In sequence-labeling problems for NLP each observation xi,j is typically\nthe j'th word for some input sentence, and hi,j is a hidden state, for example representing\nthe part-of-speech of that word. Hidden Markov models (HMMs), a generative approach,\nrequire a model of P (xi,j|hi,j), and this can be a challenging task when features such as\nword prefixes or suffixes are included in the model, or where hi,j is required to depend\ndirectly on words other than xi,j. This has led to research on discriminative models for se-\nquence labeling such as MEMM's [7, 5] and conditional random fields (CRFs)[3]. A strong\nargument for these models as opposed to HMMs concerns their flexibility in terms of rep-\nresentation, in that they can incorporate essentially arbitrary feature-vector representations\n(xi,j) of the observed data points.\n\nWe propose a new model for object recognition based on Conditional Random Fields. We\nmodel the conditional distribution p(y|x) directly. A key difference of our approach from\nprevious work on CRFs is that we make use of hidden variables in the model. In previous\nwork on CRFs (e.g., [2, 3]) each \"label\" yi is a sequence hi = {hi,1, hi,2, . . . , hi,m} of\nlabels hi,j for each observation xi,j. The label sequences are typically taken to be fully\nobserved on training examples. In our case the labels yi are unstructured labels from some\nfixed set of object categories, and the relationship between yi and each observation xi,j is\nnot clearly defined. Instead, we model intermediate part-labels hi,j as hidden variables in\nthe model. The model defines conditional probabilities P (y, h | x), and hence indirectly\nP (y | x) = P (y, h | x)\n h , using a CRF. Dependencies between the hidden variables\nh are modeled by an undirected graph over these variables. The result is a model where\ninference and parameter estimation can be carried out using standard graphical model al-\ngorithms such as belief propagation [6].\n\n2 The Model\n\n2.1 Conditional Random Fields with Hidden Variables\n\nOur task is to learn a mapping from images x to labels y. Each y is a member of a set Y\nof possible image labels, for example, Y = {background, car}. We take each image x\nto be a vector of m \"patches\" x = {x1, x2, . . . , xm}.1 Each patch xj is represented by a\nfeature vector (xj) Rd. For example, in our experiments each xj corresponds to a patch\nthat is detected by the feature detector in [4]; section [3] gives details of the feature-vector\nrepresentation (xj) for each patch. Our training set consists of labeled images (xi, yi) for\ni = 1 . . . n, where each yi Y, and each xi = {xi,1, xi,2, . . . , xi,m}. For any image x\nwe also assume a vector of \"parts\" variables h = {h1, h2, . . . , hm}. These variables are\nnot observed on training examples, and will therefore form a set of hidden variables in the\n\n 1Note that the number of patches m can vary across images, and did vary in our experiments. For\nconvenience we use notation where m is fixed across different images; in reality it will vary across\nimages but this leads to minor changes to the model.\n\n\f\nmodel. Each hj is a member of H where H is a finite set of possible parts in the model.\nIntuitively, each hj corresponds to a labeling of xj with some member of H. Given these\ndefinitions of image-labels y, images x, and part-labels h, we will define a conditional\nprobabilistic model:\n e(y,h,x;)\n P (y, h | x, ) = . (1)\n e(y ,h,x;)\n y ,h\n\nHere are the parameters of the model, and (y, h, x; ) R is a potential function\nparameterized by . We will discuss the choice of shortly. It follows that\n\n e(y,h,x;)\n P (y | x, ) = P (y, h | x, ) = h . (2)\n e(y ,h,x;)\n h y ,h\n\nGiven a new test image x, and parameter values induced from a training example, we\nwill take the label for the image to be arg maxyY P (y | x, ). Following previous work\non CRFs [2, 3], we use the following objective function in training the parameters:\n\n 1\n L() = log P (yi | xi, ) - ||||2 (3)\n 22\n i\n\nThe first term in Eq. 3 is the log-likelihood of the data. The second term is the log of a\nGaussian prior with variance 2, i.e., P () exp 1 ||||2\n 22 . We will use gradient ascent\nto search for the optimal parameter values, = arg max L(), under this criterion.\n\nWe now turn to the definition of the potential function (y, h, x; ). We assume an undi-\nrected graph structure, with the hidden variables {h1, . . . , hm} corresponding to vertices\nin the graph. We use E to denote the set of edges in the graph, and we will write (j, k) E\nto signify that there is an edge in the graph between variables hj and hk. In this paper we\nassume that E is a tree.2 We define to take the following form:\n\n m\n (y, h, x; ) = f 1l(j, y, hj, x)1l + f 2l(j, k, y, hj, hk, x)2l (4)\n j=1 l (j,k)E l\n\nwhere f 1, f 2 , 2\n l l are functions defining the features in the model, and 1\n l l are the components\nof . The f 1 features depend on single hidden variable values in the model, the f 2 features\ncan depend on pairs of values. Note that is linear in the parameters , and the model in\nEq. 1 is a log-linear model. Moreover the features respect the structure of the graph, in that\nno feature depends on more than two hidden variables hj, hk, and if a feature does depend\non variables hj and hk there must be an edge (j, k) in the graph E.\n\nAssuming that the edges in E form a tree, and that takes the form in Eq. 4, then exact\nmethods exist for inference and parameter estimation in the model. This follows because\nbelief propagation [6] can be used to calculate the following quantities in O(|E||Y|) time:\n\n y Y, Z(y | x, ) = exp{(y, h, x; )}\n h\n\n y Y, j 1 . . . m, a H, P (hj = a | y, x, ) = P (h | y, x, )\n h:hj=a\n\ny Y, (j, k) E, a, b H, P (hj = a, hk = b | y, x, ) = P (h | y, x, )\n h:hj=a,hk=b\n\n 2This will allow exact methods for inference and parameter estimation in the model, for example\nusing belief propagation. If E contains cycles then approximate methods, such as loopy belief-\npropagation, may be necessary for inference and parameter estimation.\n\n\f\nThe first term Z(y | x, ) is a partition function defined by a summation over\nthe h variables. Terms of this form can be used to calculate P (y | x, ) =\nZ(y | x, )/ Z(y | x, )\n y . Hence inference--calculation of arg max P (y | x, )--\ncan be performed efficiently in the model. The second and third terms are marginal distri-\nbutions over individual variables hj or pairs of variables hj, hk corresponding to edges in\nthe graph. The next section shows that the gradient of L() can be defined in terms of these\nmarginals, and hence can be calculated efficiently.\n\n\n2.2 Parameter Estimation Using Belief Propagation\n\nThis section considers estimation of the parameters = arg max L() from a training\nsample, where L() is defined in Eq. 3. In our work we used a conjugate-gradient method\nto optimize L() (note that due to the use of hidden variables, L() has multiple local\nminima, and our method is therefore not guaranteed to reach the globally optimal point).\nIn this section we describe how the gradient of L() can be calculated efficiently. Consider\nthe likelihood term that is contributed by the i'th training example, defined as:\n\n e(yi,h,xi;)\n Li() = log P (yi | xi, ) = log h . (5)\n e(y ,h,xi;)\n y ,h\n\nWe first consider derivatives with respect to the parameters 1l corresponding to features\nf 1(j, y, h\n l j , x) that depend on single hidden variables. Taking derivatives gives\n\n Li() (y (y , h, x\n = P (h | y i, h, xi; ) - P (y , h | x i; )\n 1 i, xi, ) 1 i, ) 1\n l h l y ,h l\n m m\n = P (h | yi, xi, ) f 1l(j, yi, hj, xi) - P (y , h | xi, ) f 1l(j, y , hj, xi)\n h j=1 y ,h j=1\n\n = P (hj = a | yi, xi, )f1l(j, yi, a, xi) - P (hj = a, y | xi, )f1l(j, y , a, xi)\n j,a y ,j,a\n\n L\nIt follows that i() can be expressed in terms of components P (h\n 1 j = a | xi, ) and\n l\nP (y | xi, ), which can be calculated using belief propagation, provided that the graph E\nforms a tree structure. A similar calculation gives\n\n Li() = P (h\n 2 j = a, hk = b | yi, xi, )f 2\n l (j, k, yi, a, b, xi)\n l (j,k)E,a,b\n\n - P (hj = a, hk = b, y | xi, )f2l(j, k, y , a, b, xi)\n y ,(j,k)E,a,b\n\n\nhence Li()/2l can also be expressed in terms of expressions that can be calculated\nusing belief propagation.\n\n\n2.3 The Specific Form of our Model\n\nWe now turn to the specific form for the model in this paper. We define\n\n (y, h, x; ) = (xj) (hj) + (y, hj) + (y, hj, hk) (6)\n j j (j,k)E\n\n\nHere (k) Rd for k H is a parameter vector corresponding to the k'th part label. The\ninner-product (xj) (hj) can be interpreted as a measure of the compatibility between\npatch xj and part-label hj. Each parameter (y, k) R for k H, y Y can be\n\n\f\ninterpreted as a measure of the compatibility between part k and label y. Finally, each\nparameter (y, k, l) R for y Y, and k, l H measures the compatibility between an\nedge with labels k and l and the label y. It is straightforward to verify that the definition in\nEq. 6 can be written in the same form as Eq. 4. Hence belief propagation can be used for\ninference and parameter estimation in the model.\n\nThe patches xi,j in each image are obtained using the SIFT detector [4]. Each patch xi,j\nis then represented by a feature vector (xi,j) that incorporates a combination of SIFT and\nrelative location and scale features.\n\nThe tree E is formed by running a minimum spanning tree algorithm over the parts hi,j,\nwhere the cost of an edge in the graph between hi,j and hi,k is taken to be the distance\nbetween xi,j and xi,k in the image. Note that the structure of E will vary across different\nimages. Our choice of E encodes our assumption that parts conditioned on features that are\nspatially close are more likely to be dependent. In the future we plan to experiment with\nthe minimum spanning tree approach under other definitions of edge-cost. We also plan to\ninvestigate more complex graph structures that involve cycles, which may require approx-\nimate methods such as loopy belief propagation for parameter estimation and inference.\n\n\n\n3 Experiments\n\nWe carried out three sets of experiments on a number of different data sets.3 The first\ntwo experiments consisted of training a two class model (object vs. background) to distin-\nguish between a category from a single viewpoint and background. The third experiment\nconsisted of training a multi-class model to distinguish between n classes.\n\nThe only parameter that was adjusted in the experiments was the scale of the images upon\nwhich the interest point detector was run. In particular, we adjusted the scale on the car\nside data set: in this data set the images were too small and without this adjustment the\ndetector would fail to find a significant amount of features.\n\nFor the experiments we randomly split each data set into three separate data sets: training,\nvalidation and testing. We use the validation data set to set the variance parameters 2 of\nthe gaussian prior.\n\n\n3.1 Results\n\nIn figure 2.a we show how the number of parts in the model affects performance. In the case\nof the car side data set, the ten-part model shows a significant improvement compared to\nthe five parts model while for the car rear data set the performance improvement obtained\nby increasing the number of parts is not as significant. Figure 2.b shows a performance\ncomparison with previous approaches [1] tested on the same data set (though on a different\npartition). We observe an improvement between 2 % and 5 % for all data sets.\n\nFigures 3 and 4 show results for the multi-class experiments. Notice that random perfor-\nmance for the animal data set would be 25 % across the diagonal. The model exhibits\nbest performance for the Leopard data set, for which the presence of part 1 alone is a clear\npredictor of the class. This shows again that our model can learn discriminative part distri-\nbutions for each class. Figure 3 shows results for a multi-view experiment where the task\nis two distinguish between two different views of a car and background.\n\n\n 3The images were obtained from http://www.vision.caltech.edu/html-files/archive.html and the\ncar side images from http://l2r.cs.uiuc.edu/ cogcomp/Data/Car/. Notice, that since our algorithm\ndoes not currently allow for the recognition of multiple instances of an object we test it on a partition\nof the the training set in http://l2r.cs.uiuc.edu/ cogcomp/Data/Car/ and not on the testing set in that\nsite. The animals data set is a subset of Caltech's 101 categories data set.\n\n\f\nFigure 1: Examples of the most likely assignment of parts to features for the two class\nexperiments (car data set).\n\n\n Data set Our Model Others [1]\n Car Side 99 % -\n Data set 5 parts 10 parts Car Rear 94.6 % 90.3 %\n (a) Car Side 94 % 99 % (b) Face 99 % 96.4 %\n Car Rear 91 % 91.7 % Plane 96 % 90.2 %\n Motorbike 95 % 92.5 %\n\n\nFigure 2: (a) Equal Error Rates for the car side and car rear experiments with different\nnumber of parts. (b) Comparative Equal Error Rates.\n\nFigure 1 displays the Viterbi labeling4 for a set of example images showing the most likely\nassignment of local features to parts in the model. Figure 6 shows the mean and variance\nof each part's location for car side images and background images. The mean and variance\nof each part's location for the car side images were calculated in the following manner:\nFirst we find for every image classified as class a the most likely part assignment under our\nmodel. Second, we calculate the mean and variance of positions of all local features that\nwere assigned to the same part. Similarly Figure 5 shows part counts among the Viterbi\npaths assigned to examples of a given class.\n\nAs can be seen in Figure 6 , while the mean location of a given part in the background\nimages and the mean location of the same part in the car images are very similar, the parts\nin the car have a much tighter distribution which seems to suggest that the model is learning\nthe shape of the object.\n\nAs shown in Figure 5 the model has also learnt discriminative part distributions for each\nclass, for example the presence of part 1 seems to be a clear predictor for the car class. In\ngeneral part assignments seem to rely on a combination of appearance and relative location.\nPart 1, for example, is assigned to wheel like patterns located on the left of the object.\n\n\n 4This is the labeling h = arg maxh P (h | y, x, ) where x is an image and y is the label for\nthe image under the model.\n\n\f\n Data set Precision Recall\n Car Side 87.5 % 98 %\n Car Rear 87.4 % 86.5 %\n\n\n Figure 3: Precision and recall results for 3 class experiment.\n\n\n Data set Leopards Llamas Rhinos Pigeons\n Leopards 91 % 2 % 0 % 7 %\n Llamas 0 % 50 % 27 % 23 %\n Rhinos 0 % 40 % 46 % 14 %\n Pigeons 0 % 30 % 20 % 50 %\n\n\n Figure 4: Confusion table for 4 class experiment.\n\n\n\nHowever, the parts might not carry semantic meaning. It appears that the model has learnt\na vocabulary of very general parts with significant variability in appearance and learns to\ndiscriminate between classes by capturing the most likely arrangement of these parts for\neach class.\n\nIn some cases the model relies more heavily on relative location than appearance because\nthe appearance information might not be very useful for discriminating between the two\nclasses. One of the reasons for this is that the detector produces a large number of false de-\ntections, making the appearance data too noisy for discrimination. The fact that the model\nis able to cope with this lack of discriminating appearance information illustrates its flexible\ndata-driven nature. This can be a desirable model property of a general object recognition\nsystem, because for some object classes appearance is the important discriminant (i.e., in\ntextured classes) while for others shape may be important (i.e., in geometrically constrained\nclasses).\n\nOne noticeable difference between our model and similar part-based models is that our\nmodel learns large parts composed of small local features. This is not surprising given how\nthe part dependencies were built (i.e., through their position in minimum spanning tree):\nthe potential functions defined on pairs of hidden variables tend to smooth the allocation of\nparts to patches.\n\n\n\n 1 8\n 3 6\n 7\n 3 8\n 5\n 4 5 4\n\n\n\n\nFigure 5: Graph showing part counts for the background (left) and car side images (right)\n\n\n4 Conclusions and Further Work\n\nIn this work we have presented a novel approach that extends the CRF framework by in-\ncorporating hidden variables and combining class conditional CRFs into an unified frame-\nwork for object recognition. Similarly to CRFs and other maximum entropy models our\napproach allows us to combine arbitrary observation features for training discriminative\nclassifiers with hidden variables. Furthermore, by making some assumptions about the\njoint distribution of hidden variables one can derive efficient training algorithms based on\ndynamic programming.\n\n\f\n Background Shape Car Shape\n\n\n\n\n\n 200 200\n\n\n 150 150\n\n\n 100 100\n\n\n 50 3 50 3\n 7\n 0 5 4 0 5 9 6 4\n 1\n 8 8\n -50 -50\n\n\n -100 -100\n\n\n -150 -150\n\n\n -200 -200\n -200 -150 -100 -50 0 50 100 150 200 -200 -150 -100 -50 0 50 100 150 200\n\n\n (a) (b)\n\nFigure 6: (a) Graph showing mean and variance of locations for the different parts for the\ncar side images; (b) Mean and variance of part locations for the background images.\n\n\n\nThe main limitation of our model is that it is dependent on the feature detector picking up\ndiscriminative features of the object. Furthermore, our model might learn to discriminate\nbetween classes based on the statistics of the feature detector and not the true underlying\ndata, to which it has no access. This is not a desirable property since it assumes the feature\ndetector to be consistent. As future work we would like to incorporate the feature detection\nprocess into the model.\n\nReferences\n\n[1] R. Fergus, P. Perona,and A. Zisserman. Object class recognition by unsupervised scale-invariant\n learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni-\n tion,volume 2, pages 264-271, June 2003.\n\n[2] S. Kumar and M. Hebert. Discriminative random fields: A framework for contextual interaction\n in classification. In IEEE Int Conference on Computer Vision,volume 2, pages 1150-1157, June\n 2003.\n\n[3] J. Lafferty,A. McCallum and F. Pereira. Conditional random fields: Probabilistic models for\n segmenting and labeling sequence data. In Proc. Int Conf. on Machine Learning, 2001.\n\n[4] D. Lowe. Object Recognition from local scale-invariant features. In IEEE Int Conference on\n Computer Vision, 1999.\n\n[5] A. McCallum, D. Freitag, and F. Pereira. Maximum entropy markov models for information\n extraction and segmentation. In ICML-2000, 2000.\n\n[6] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Mor-\n gan Kaufmann,1988.\n\n[7] A. Ratnaparkhi. A maximum entropy part-of-speech tagger. In EMNLP, 1996.\n\n\f\n", "award": [], "sourceid": 2652, "authors": [{"given_name": "Ariadna", "family_name": "Quattoni", "institution": null}, {"given_name": "Michael", "family_name": "Collins", "institution": null}, {"given_name": "Trevor", "family_name": "Darrell", "institution": null}]}