{"title": "Learning to Find Pictures of People", "book": "Advances in Neural Information Processing Systems", "page_first": 782, "page_last": 788, "abstract": null, "full_text": "Learning to Find Pictures of People \n\nSergey Ioffe \n\nComputer Science Division \n\nU.C. Berkeley \n\nBerkeley CA 94720 \niojJe(Cj)cs. be1\u00b7keley. edu \n\nDavid Forsyth \n\nComputer Sciencp Division \n\nU.C. Berkeley \n\nBerkeley CA 94720 \ndaf@cs.beTkeley. edv \n\nAbstract \n\nFinding articulated objects, like people, in pictures present.s a par(cid:173)\nticularly difficult object. recognition problem. We show how t.o \nfind people by finding putative body segments, and then construct.(cid:173)\ning assemblies of those segments that are consist.ent with the con(cid:173)\nstraints on the appearance of a person that result from kinematic \nproperties. Since a reasonable model of a person requires at. least \nnine segments, it is not possible to present every group to a classi(cid:173)\nfier. Instead, the search can be pruned by using projected versions \nof a classifier that accepts groups corresponding to people. We \ndescribe an efficient projection algorithm for one popular classi(cid:173)\nfier , and demonstrate that our approach can be used to determine \nwhether images of real scenes contain people. \n\n1 \n\nIntroduction \n\nSeveral t.ypical collpctions containing over ten million images are listed in [2]. Th ere \nis an extensiw literature on obtaining images from large collections using features \ncomputed from t.he whole image, including colour histograms, texture measures and \nshape measures ; a partial review appears in [5]. \n\nHowever, in the most comprehensive field study of usage pract.ices (a paper by \nEnser [2] surveying the use of the Hulton Deutsch collection), t.here is a clear user \npreference for searching these collections on image semantics. An ideal search tool \n,,,ould be a quite general object recognition system that could be adapted quickly \nand easily to the types of objects sought by a user. An important special case \nis finding people and determining what they are doing . This is hard , because \npeople have many internal degrees of freedom. We follow the approach of [3], \nand represent people as collections of cylinders, each representing a body segment. \nRegions that could be the projections of cylinders are easily found using techniques \nsimilar to those of [1]. Once these regions ate found , they must be assembled \n\n\fLearning to Find Pictures of People \n\n783 \n\nint.o collect.ions t.hat. are consistent with the appearance of images of real people, \nwhich are constrained by the kinematics of human joints; consistency is tested \nwit.h a classifier. Since t.here are many candidate segment.s, a brute force search \nis impossible. \\Ve show how this search can be pruned using projections of the \nclassifier . \n\n2 Learning to Build Seglnent Configurations \n\nSuppose that. ;V segments have been found in an image , and there are m body parts. \nWe will defin e a labeling as a set L = {(Ll , sd , (l2, S2), .. . , (h\u00b7, sd} of pairs where \neach segment. Si E {1 .. . N} is labeled with the labelli E {1 . .. m}. A labeling is \ncomplete if it represents a full m-segment configuration (Fig. 2( a,b)). \nAssume we have a classifier C that for any complete labeling L output.s C( L) > 0 \nif L corresponds to a person-like configuration, and C (L) < 0 otherwise. Finding \nall the possible body configurations in an image is equivalent. t.o finding all the \ncomplete labelings L for which C(L) > O. This cannot be done with brute-force \nsearch t.hrough the entire set.. The search can be pruned if, for an (incomplete) \nlabeling L' there is no complete L ;2 L' such that G(L) > O. For inst.ance, if two \nsegments cannot represent the upper and lower left. arm, as in Figure la, then we \ndo not consider any complete labelings where they are labeled as such. \n\nProjected classifiers make the search for body configura tions efficient. by pruning \nla belings using the properties of smaller sub-Iabelings (as in [7], who use manually \ndetermined bounds and do not learn the tests). Given a classifier G which is a \nfunction of a set of features whose values depend on segments with labels l1 . . . Im , \nt.he projected classifier Cil (k is a function of of all those features that depend \nonly on the segments with labels 11 ... lh ' In particular, GIllk(1') > 0 if there is \nsome ext.ension L of l' such that C(L) > 0 (see figure l).The converse need not \nbe true: t.he fea ture values required to bring a projected point inside the positive \n. volUl11f' of C' may not be realized with any labeling of t.he current Sf't. of segments \n1, . .. , N. For a projected classifier to be usefuL it must be easy to compute the \nprojection , and it must be effective in rejecting labelings at. an early stage. These \nare strong rf'quirements which are not satisfied by most good classifiers; for example, \nin our f'xperience a support vector machine with a posit.ive definit.e quadratic kernel \nprojects easily but typically yields unrestrictive projected classifiers. \n\n2.1 Building Labelings Increm entally \n\nAssume we have a classifier C that accepts assemblies corresponding to people and \nthat we can construct. projected classifiers as we need them. We will now show how \nt.o use them to ronst.ruct labelings, using a pyramid of classifiers. \n\nA pyramid of classifiers (Fig. 1 (c)) , determined by the classifier C and a permutation \nof labels (11 .. . ld consists of nodes NI, ... I J corresponding to each of the projected \nclassifiers CI , .I J \u2022 i ~ j. Each of the bottom-level nodes NI , receives the set of all \nsegments ill th e image as the input . The top node Nil 1m OUt.pUt.S t.he set of all \ncomplete labelings L = {(/ 1 , sIl . . . (lm,sm)) such that G(L) > 0, i.e. the set of all \nassemblies in t.he image classified as people. Further, each node NI , . I, outputs the \nset of all sub-labelings L = {(li,sil . . . (lj,Sj)) such that GI, \nThE' node:,> Nt , at t.he bottom level work by selecting all segments Si in the image for \nwhich n, {(I,.:>i)} > O. Each of the remaining nodes has t.wo part.s: merging and \nfilt.ering. The merying stage of node NI, .. I J merges the outputs of its children by \ncomputing t.he set of all la belings {(li, s;) . .. (lj, Sj)} where {(Ii , sd ... (lj -1, S j - tl} \n\nI)(L) > O. \n\n\fS. Ioffe and D. Forsyth \n\n784 \n\ny(sl,s2) \n\n\\J. \n\n\u00b7 \n\n\u00b7 \n\u00b7 \n\u00b7 \n\n. \n\n. \n. \n. \n: x(sJ) \n\na \n\nII \n\nb \n\nx(sJ) \n\n.. \n\n'--_---'-_--'-_---' __ -'--_segments \n\nc \n\nFigure 1: \n(a) Two segments that cannot correspond to the left upper and lower \narm. Any configuration where they do can be rejected using a projected classifier \nregardless of the other segments that might appear in the configuration. (b) Pro(cid:173)\nJecting a classifier G {( [1, SI), ([2, S2)}' The shaded area is the volume classified as \npositive, for the feature set {x (SI), y( SI , S2)} . Finding the projection Gil amounts \nto projecting off the features that cannot be computed from SI only, i. e., Y(SI' S2}. \n(c) A pyramid of classifiers. Each node outputs sub-assemblies accepted by the cor(cid:173)\nresponding projected classifier. Each node except those in the bottom row works by \nforming labelings from the outputs of its two children, and filtering the result using \nthe corresponding projected classifier. The top node outputs the set of all complete \nlabelings that correspond to body configurations. \n\nand {(li+l, si+d . .. (Ij, Sj)} are in the outputs of N I ,lj_1 and NI,+l .. lj' respectively. \nThe filtering stage then selects, from the resulting set of labelings, those for which \nG1, ... lj(\u00b7) > 0, and the resulting set is the output of Nl, . lj' It is clear, from the \ndefinition of projected classifiers, that the output of the pyramid is, in fact, the set \nof all complete L for which G(L) > 0 (note that GIl 1m = G) . \nThe only constraint on the order in which the outputs of nodes are computed is that \nchildren nodes have to be applied before parents. In our implementation, we use \nnodes Nl, . l j where j changes from 1 to m, and, for each j, i changes from j down to \n1. This is equivalent to computing sets of labelings of the form {(II , stl . .. (lj, Sj)} \nin order, where getting (j + I)-segment labelings from j-segment ones is itself an \nincremental process, whereby we check labels againstlj +l in the order [j, lj-I, . . . , [1. \nIn practice, we choose the latter order on the fly for each increment step using a \ngreedy algorithm, to minimize the size of labeling sets that are constructed (note \nthat in this case the classifiers no longer form a pyramid) . The order (11 .. . lm) in \nwhich labels are added to an assembly needs to be fixed. We determine this order \nwith a greedy algorithm by running a large segment set through the labeling builder \nand choosing the next label to add so as to minimize the number of labelings that \nresult. \n\n2.2 Classifiers that Project \n\nIn our problem, each segment from the set {I .. . N} is a rectangle in some position \nand orientation. Given a complete labeling L = {(I, SI), ... , (m, sm)} , we want to \nhave G(L) > 0 iff the segment arrangement produced by L looks like a person . \n\n\fLearning to Find Pictures of People \n\n785 \n\n=0.25+0.22 \n\n0.47 \n\n)' ~ ------\n, , \n, 0.25 \n, \n, , , \n, 0.4 \n, , \n, \n, , \n, , \n- - - -- ------\n\n0.62 \n\n0.15 \n\n0.37 \n\n=0.4+0.22 \n\n=0.15+0.22 \n\n-------1 \n\n~ \n0.85 \n\n, , \n=0.25+0.6' , \n\n: fO.15 \n, \n\n1.0 \n\n=0.4+0.6 : to.25 \n\n, \n, , \n=0.15+0.6 ' \n------_. \n\n0.75 \n\n0.25 \n\n0.4 \n\n0.15 \n\na \n\nb \n\n0 \n\n0.22 \n\n0.6 \n\n=0.22+0.38 \n\n~ 0.22 \n\n0.6 \n\nx \n\n\" x \n\nC \n\n(b) A labeled segment con(cid:173)\n\nFigure 2: (a) All segments extracted for an image. \nfiguration corresponding to a person, where T=torso, LUA=left upper arm, etc. \nThe head is not marked because we are not looking for it with our method. The \nsingle left leg segment in (a) has been broken in (b) to generate the upper and \nlower leg segments. (c) (top) A combination of a bounding box (the dashed line) \nand a boosted classifier, for two features x and y. Each plane in the boosted \nclassifier is a thick line with the positive half-space indicated by an arrow; the \nassociated weight {3 is shown next to the arrow. The shaded area is the posi(cid:173)\ntive volume of the classifier, which are the points P where LJ wJ{P(f)) > 1/2. \nThe weights wx (-) and wy{') are shown along the x- and y-axes, respectively, and \nthe total weight wx{P{x)) + Wy{P{y)) is shown for each region of the bounding \nbox. (bottom) The projected classifier, given by wx{P{x)) > 1/2 - 8 = 0.1 whel'P \n8 = maxp(y) wy{P{y)) = max{0.25, 0.4, 0.15} = 0.4. \n\nEach feature will depend on a few segments (1 to 3 in our experiments). Our \nkinematic features are invariant to translation, uniform scaling or rotation of the \nsegment set, and include angles between segments and ratios of lengths, widths and \ndistances. We expect the features that correspond to human configurations to lie \nwithin small fractions of their possible value ranges. This suggests using an axis(cid:173)\naligned bounding box, with bounds learned from a collection of positive labelings, \nfor a good first separation, and then using a boosted version of a weak classifier that \nsplits the feature space on a single feature value (as in [6]). This classifier projects \nparticularly well, using a simple algorithm described in section 2.3. \nEach weak classifier (Fig. 2(c)) is defined by the feature Ij on which the split is \nmade, the position Pj of the splitting hyperplane, and the direct.ion dj E {I, -I} \nthat determines which half-space is positive. A point P is classified as positive iff \ndj{P{fj) - Pj) > 0, where P{fj) is the value of feature /j. The boosting algorithm \nwill associate a weight {3j with each plane {so that Lj {3j = 1), and the resulting \nclassifier will classify a point as positive iffLd,(p(f,)-Pi\u00bbo{3j > 1/2, that is, iff the \ntotal weight of the weak classifiers that classify the point as positive is at least a \nhalf of the total weight of the classifiers. The set {/j} may have repeating features \n(which may have different Pj, dj and Wj values), and does not need to span the \nentire feature set. \n\nBy grouping together the weights corresponding to planes splitting on the same \nfeature, we finally rewrite the classifier as LJ wJ(P(f)) > 1/2, where 'U'J(P(f)) = \n\n\f786 \n\nS. Joffe and D. Forsyth \n\nLfJ=j, dJ (P(f)-Pl \u00bb0 j3j is the weight associated with the particular value of feature \nf, is a piece-wise constant function and depends on in which of the intervals given \nby {pj I fj = f} this value falls . \n\n2.3 Projecting a Boosted Classifier \n\nGiven a classifier constructed as above, we need to construct classifiers that depend \non on some identified subset of the features . The geometry of our classifiers -\nwhose positive regions consist of unions of axis-aligned bounding boxes - makes \nthis easy to do. \n\nLet 9 be the feature to be projected away -\nperhaps because the value depends on \na label that is not available. The projection of the classifier should classify a point \npi in the (lower-dimensional) feature space as positive iffmaxp Lj Wj (P(f)) > 1/2 \nwhere P is a point which projects into pi but can have any value for P(g). We can \nrewrite this expression as LNg Wj(PI(f)) + maXp(g) wg(P(g)) > 1/2. The value \nof J = maxwg(P(g)) is readily available and independent of P'. We can see that, \nwith the feature projected away, we obtain Lj Wj (Pi (f)) > 1/2 - J. Any number \nof features can be project.ed away in a sequence in this fashion . An example of the \nprojected classifier is shown in Figure 2( c). \nThe classifier C we are using allows for an efficient building of labelings, in that \nthe features do not need to be recomputed when we move from G/t.l k to Gil .lk+l. \nWe achieve this efficiency by carrying along with a labeling L = {(it , SI) ... (lk' Sk)} \nthe sum