{"title": "Maximal Cliques that Satisfy Hard Constraints with Application to Deformable Object Model Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 864, "page_last": 872, "abstract": "We propose a novel inference framework for finding maximal cliques in a weighted graph that satisfy hard constraints. The constraints specify the graph nodes that must belong to the solution as well as mutual exclusions of graph nodes, i.e., sets of nodes that cannot belong to the same solution. The proposed inference is based on a novel particle filter algorithm with state permeations. We apply the inference framework to a challenging problem of learning part-based, deformable object models. Two core problems in the learning framework, matching of image patches and finding salient parts, are formulated as two instances of the problem of finding maximal cliques with hard constraints. Our learning framework yields discriminative part based object models that achieve very good detection rate, and outperform other methods on object classes with large deformation.", "full_text": "Maximal Cliques that Satisfy Hard Constraints with\nApplication to Deformable Object Model Learning\n\nXinggang Wang1\u2217 Xiang Bai1 Xingwei Yang2\u2020 Wenyu Liu1 Longin Jan Latecki3\n1 Dept. of Electronics and Information Engineering, Huazhong Univ. of Science and Technology, China\n\n2 Image Analytics Lab, GE Research, One Research Circle, Niskayuna, NY 12309, USA\n\n3 Dept. of Computer and Information Sciences, Temple Univ., USA\n\n{wxghust,xiang.bai}@gmail.com,yang@ge.com,liuwy@hust.edu.cn,latecki@temple.edu\n\nAbstract\n\nWe propose a novel inference framework for \ufb01nding maximal cliques in a weight-\ned graph that satisfy hard constraints. The constraints specify the graph nodes\nthat must belong to the solution as well as mutual exclusions of graph nodes, i.e.,\nsets of nodes that cannot belong to the same solution. The proposed inference is\nbased on a novel particle \ufb01lter algorithm with state permeations. We apply the\ninference framework to a challenging problem of learning part-based, deformable\nobject models. Two core problems in the learning framework, matching of image\npatches and \ufb01nding salient parts, are formulated as two instances of the problem\nof \ufb01nding maximal cliques with hard constraints. Our learning framework yields\ndiscriminative part based object models that achieve very good detection rate, and\noutperform other methods on object classes with large deformation.\n\nIntroduction\n\n1\nThe problem of \ufb01nding maximal cliques in a weighted graph is faced in many applications from\ncomputer vision to social networks. Related work on \ufb01nding dense subgraph in weighted graph\ninclude [16, 12, 14]. However, these approaches relax the discrete problem of subgraph selection\nto a continuous problem. The main drawback of such relaxation is the fact that it is impossible to\nenforce that the constraints are satis\ufb01ed for solutions of the relaxed problem. Therefore, we aim\nat solving the discrete subgraph selection problem by employing the recently proposed extension\nof particle \ufb01lter inference to problems with state permeations [20]. There are at least two main\ncontributions of this paper: (1) We propose an inference framework for solving a maximal clique\nproblem that cannot be solved with typical clustering methods nor with recent relaxation based\nmethods [16, 12, 14]. (2) We utilize the inference framework for solving a challenging problem of\nlearning a part model for deformable object detection.\n\nObject detection is one of the key challenges in computer vision, due to the large intra-class ap-\npearance variation of an object class. The appearance variation arises not only from changes in\nillumination, viewpoint, color, and other visual properties, but also from nonrigid deformations.\nObjects under deformation often observed large variation globally. However, their local structures\nare somewhat more invariant to the deformations. Based on this observation, we propose a learning\nby matching framework to match all local image patches from training image. By matching, object\nparts with similar local structure in different training images can be found.\n\nGiven a set of training images that contain objects of the same class, e.g., Fig. 1(a), our \ufb01rst problem\nis to select a set of image patches that depict the same visual part of these objects. Thus, an object\npart is regarded as a collection of image patches e.g., Fig. 1(c). To solve the problem, we divide\neach training image into a set of overlapping patches, like the ones shown in Fig. 1(b), and construct\na graph whose nodes represent the patches. The edge weights represent the appearance similarity of\npair of patches. Since close by patches in the same image tend to be very similar, we must impose\n\n\u2217This work was done while the author visiting Temple University.\n\u2020This work was done when the author was a graduate student at Temple University.\n\n1\n\n\fFigure 1: (a) example training images; (b) patches extracted from the training images; (c) object\nparts as collections of patches obtained as maximal cliques of patch similarity graph; (d) the learned\nsalient parts for giraffe, the patches belong to the same salient part are in the same color. The salient\nparts are obtained as maximal cliques in a second graph whose vertices represent the object parts.\n\na hard constraint that a patch set representing the same object part does not contain two patches\nfrom the same image. This constraint is very important, since otherwise very similar patches from\nthe same images will dominate this graph. In order to obtain meaningful object parts, we de\ufb01ne an\nobject part as a maximal clique in the weighted graph that satis\ufb01es the above constraint. By solving\nthe problem of maximal clique, we obtain a set of object parts like the ones shown in Fig. 1(c). We\nuse this set as vertices of a second graph. Finally, we obtain a small set of salient visual parts, e.g.,\nFig. 1(d), by solving a different instance of the maximal clique problem on the second graph.\n\nFor each salient visual part, we train a discriminative classi\ufb01er. By combining these classi\ufb01ers\nwith spatial distribution of the salient object parts, a detector for deformable object is built. As\nillustrated in the experimental results, this detector achieves very good object detection performance,\nand outperforms other methods on object classes with large deformation.\n\nThe computer vision literature has approached learning of part based object models in different\nways. In [8] objects are modeled as \ufb02exible constellations of parts, parts are constrained to a sparse\nset of locations determined by an entropy-based feature detector, other part models based on feature\ndetector include [15, 17]. Our model is similar to discriminatively trained part based model in [6] in\nthat we train SVM classi\ufb01ers for each part of object and geometric arrangement of parts is captured\nby a set of \u201dsprings\u201d. However, our learning method is quite different from [6]. In [6] the learning\nproblem is formalized as latent SVM, where positions of parts are considered as latent values. The\nlearning process is an iterative algorithm that alternates between \ufb01xing latent values and optimizing\nthe latent SVM objective function. In contrast, we case part learning as \ufb01nding maximal cliques in\na weighted graph of image patches. The edge weights represent appearance similarities of patches.\nIn [4, 13] multiple instance learning is used to search position of object parts in training images, and\nboosting algorithm is used to select salient parts to represent object.\n\n2 Maximal Cliques that Satisfy Hard Constraint\nA weighted graph G is de\ufb01ned as G = (V, E, e), where V = {v1, . . . , vn} is the vertex set, n is the\nnumber of vertices, E \u2286 V \u00d7 V , and e : E \u2192 R\u22650 is the weight function. Vertices in G correspond\nto data points, edge weights between different vertices represent the strength of their relationships,\nand self-edge weight respects importance of a vertex. As is customary, we represent the graph G\nwith the corresponding weighted adjacency matrix, more speci\ufb01cally, an n \u00d7 n symmetric matrix\nA = (aij), where aij = e(vi, vj) if (vi, vj) \u2208 E, and aij = 0 otherwise.\nLet S = {1, ..., n} be the index set of vertex set V . For any subset T \u2286 S, GT denotes a subgraph\nof G with vertex set VT = {vi, i \u2208 T } and edge set ET = {(vi, vj) | (vi, vj) \u2208 E, i \u2208 T, j \u2208 T }.\nThe total weight of subgraph GT is de\ufb01ned as f (GT ) = Pi\u2208T ,j\u2208T A(i, j). We can express T by\nan indicator vector x = (x1, . . . , xn) \u2208 {0, 1}n such that xi = 1 if i \u2208 T and xi = 0 otherwise.\nThen f (GT ) can be represented in a quadratic form f (x) = xT Ax.\n\n2\n\n\fWe consider mutex relationship between vertices in graph. Given a subset of vertices M \u2286 S,\nwe call M a mutex (short for mutual exclusion) if i \u2208 M and j \u2208 M implies that vertices vi\nand vj can not belong to the same maximal clique. Formally, M is a constraint on the indicator\nvector x \u2208 {0, 1}n, i.e., if i \u2208 M and j \u2208 M, then xi + xj \u2264 1. A mutex set of graph G is\nM = {M1, . . . , Mm | Mi \u2286 S, i = 1, . . . , m} such that each Mi is a mutex for i = 1, . . . , m.\nGiven a set T \u2286 S, we de\ufb01ne mutex(T ) as a set of indices of vertices of G that are incompatible\nwith T according to M: mutex(T ) = {j \u2208 S|\u2203Mi\u2208M \u2203k\u2208T j, k \u2208 Mi}. We consider the following\nmaximization problem\n\nmaximize\n\nx\n\nsubject to\n\nf (x) = xT Ax\n\n(C1) x = (x1, . . . , xn) \u2208 {0, 1}n and\n(C2) \u2200i \u2208 U xi = 1 and\n(C3) xi + xj \u2264 1 if \u2203Mk \u2208 M such that i, j \u2208 Mk and\n(C4) \u03a3x \u2264 K\n\n(1)\n\nThe constraint (C2) speci\ufb01es a set of vertices U \u2286 S that must be selected as part of the solution,\n(C3) ensures that all mutex constraints are satis\ufb01ed, (C4) requires number of vertices in the solution\nis small or equal to K. Of course, we assume the problem (1) is well-de\ufb01ned in that there exists x\nthat satis\ufb01es the four constraints (C1)-(C4).\n\nThe goal of (1) is to select a subset of vertices of graph G such that f is maximized and the con-\nstraints (C1)-(C4) are satis\ufb01ed. Since f is the sum of pairwise af\ufb01nities of the elements of the\nselected subset, the larger is the subset, the larger is the value of f. However, the size of the subset\nis limited by the mutex constraints (C3) and maximal size constraint (C4).\n\nA global maximum of (1) is called a U \u2212 M maximal clique of graph G. When both sets U and M\nare clear from the context, we simply call the solution a maximal clique.\n\nThe problem (1) is a combinatorial optimization problem, and hence it is NP-hard [2]. As is the case\nfor similar problems of \ufb01nding dense subgraphs, the constraint (C1) is usually relaxed to x \u2208 [0, 1]n,\ni.e., each coordinate of x is relaxed to a continuous variable in the interval [0, 1], e.g., [16, 12, 14].\nHowever, it is dif\ufb01cult if not impossible to ensure that constraints (C2), (C3) and (C4) are satis\ufb01ed\nthen. Another dif\ufb01culty is related to discretization of the relaxed solution in order to obtain a solution\nthat satis\ufb01es (C1). For these reasons, and since for our application, it is very important that the\nconstraints are satis\ufb01ed, we treat (C1)-(C4) as hard constraints that cannot be violated. We propose\nan ef\ufb01cient method for directly solving (1) in Section 4. We \ufb01rst present two instances of problem\n(1) in Section 3, where we describe the proposed application to learning salient object parts.\n\n3 Learning by Matching\n\nIn this section, we present a novel framework to learn part based object model based on matching.\nThe core problems of learning part based object model are how to search right locations of an object\npart in all training images and how to select salient parts for representing object. In our framework,\nthe two problems are formulated as \ufb01nding maximal cliques with hard constraints.\n\n3.1 Matching Image Patches\nGiven a batch of training images I = {I1, . . . , IK} showing objects from a given class, e.g., Fig. 1\n(a), where K is the total number of training images. For every training image, we densely extract\nimage patches with overlap. We denote the set of patches extracted from all images as {P1, . . . , Pn},\nwhere n is total number of patches. Each patch is described as Pi = {Fi, Li, Xi, Yi} for i \u2208\n[1, . . . n], where Fi is the appearance descriptor of Pi (we use the descriptor from [19]), Li is the\nimage label of Pi, (e.g., if Pi is extracted from the 5th training image, Li = 5), Xi and Yi indicate\nthe position of Pi in its image. All the training images are normalized to the same size.\nWe treat all the patches as the set of vertices of graph G, i.e., V = {P1, . . . , Pn}. The af\ufb01nity\nrelation between the patches, i.e., the graph edge weights, are de\ufb01ned as aij = Fi \u00b7 Fj, if i 6= j, and\naij = 0 otherwise, where Fi \u00b7 Fj is the dot product of two feature vectors, which are normalized. It\nmeasures the appearance similarity of patches Pi and Pj. In addition, if the distance between patch\npositions (Xi, Yi) and (Xj, Yj) is larger than 0.2 of the mean of all bounding box heights, we set\naij=0. This ensures that matrix A is sparse.\n\n3\n\n\fWe have exactly K mutex constraints M = {M1, . . . , MK}, where Mj contains all patches from\nimage Ij, i.e., Mj = {Pi \u2208 V |Li = j}, j \u2208 [1, . . . , K]. This means that we do not want two\npatches from the same image to belong to the same maximal clique.\nSuppose that the \ufb01rst r patches P1, . . . , Pr are in the 1st training image, i.e., Li = 1 if and only if\ni = 1, . . . , r. The part learning algorithm by \ufb01nding maximal cliques is given in Alg. 1.\n\nAlgorithm 1 Part learning by \ufb01nding maximal cliques with hard constraints\n\nInput: A, M, K, and r.\nfor i = 1 \u2192 r do\n1. Set U = {i}.\n2. Solve problem (1), get the solution x\u2217, and its value W (i) = f (x\u2217) = x\u2217T Ax\u2217.\n3. Set the solution patches as Q(i) = {Pj|x\u2217\n\nj = 1}.\n\nend for\nOutput: Parts Q = {Q(1), . . . , Q(r)} and their matching weights W = {W (1), . . . , W (r)}.\n\n1 , . . . , M H\n\nr } is de\ufb01ned as M H\n\nWe recall that each learned part Q(i) is de\ufb01ned as a set of K patches, e.g., Fig. 1 (c). Due to our\nmutex constraint, each Q(i) contains exactly one patch from each of K training images. We treat\nthe learned parts as candidate object parts, because there are non-object areas inside the bounding\nbox images. Each value W (i) represents a matching score of of Q(i).\n3.2 Selecting Salient Parts for Part Based Object Representation\nIn order to select a set of object parts that best represent the object class, our strategy is to \ufb01nd a\nsubset of Q that maximizes the sum of the matching scores. We formulate this problem as \ufb01nding\nmaximal clique with hard constraints again. We de\ufb01ne a new graph H with vertices V = Q and\nadjacency matrix B = (bij), where bij = W (i) if i = j, and bij = 0 otherwise. Thus, the matrix of\ngraph H has nonzero entries only on diagonal. It may appear that the problem is trivial, since there\nis no edges between different vertices of H, but this is not the case due to the mutex relations.\ni = {j | D(i, j) \u2264 \u03c4 } for i, j \u2208 [1, . . . , r],\nThe mutex set MH = {M H\nwhere \u03c4 is a distance threshold and D(i, j) is the average distance between patches in Q(i) and Q(j)\nthat belong to the same image. If Q(i) is selected as a salient part, the mutex M H\ni ensures that the\npatches of other salient parts are not too close to the patches of Q(i). For example, Q(1) and Q(2)\nin Fig. 1(c) both have good matching weights, but the average distance between Q(1) and Q(2) is\nsmaller than \u03c4 , so they cannot be selected as salient parts at the same time.\nAs initialization (C2), we set U H to a one element set containing arg maxi W (i), so the part with\nmaximal matching score is always selected as a salient part. We set K in (C4) to K H, where K H is\nthe maximal number of salient parts. K H = 6 in all our experiments.\nBy solving the second instance of problem (1) for B, U H, MH, K H, we obtain the set of salient\nparts as the solution x\u2217. We denote is as SP = {Q(j) | x\u2217(j) = 1}.\n4 Particle Filter Inference for U \u2212 M Maximal Clique\nBy associating a random variable (RV) Xi with each vertex i \u2208 S of graph G, we introduce a Gibbs\nrandom \ufb01eld (GRF) with the neighborhood structure of graph G. Each RV can be assigned either 1\nor 0, where Xi = 1 means that the vertex vi is selected as part of the solution. The probability of\nthe assignment of values to all RVs is de\ufb01ned as\n\nP (X1 = x1, . . . , Xn = xn) = p(x) \u221d exp\n\nf (x)\n\n\u03b3\n\n= exp\n\nxT Ax\n\n\u03b3\n\n,\n\n(2)\n\nf (xi1 ,...,xik )\n\nwhere we recall that x = (x1, . . . , xn) \u2208 {0, 1}n and \u03b3 > 0. We observe that the de\ufb01nition in (2)\nalso applies to a subset of RVs, i.e., we can use it to compute P (Xi1 = xi1, . . . , Xik = xik ) =\nfor k < n. This is equivalent to setting other coordinates in the\np(xi1 , . . . , xik ) \u221d exp\nindicator vector x to zero.\nSince exp is a monotonically increasing function, the maximum of (2) is obtained at the same point\nas the maximum of f in (1). We propose to utilize Particle Filter (PF) framework to maximize (2)\nsubject to the constraints in (1). The goal of PF is to approximate p(x) with a set of with weighted\n\n\u03b3\n\n4\n\n\f1 , . . . , x(i)\n\nt\n\n1:t = (x(i)\n\n1:t\u22121, x(i)\n\nt ). When t = m we obtain that x(i)\n\ni=1 drawn from some proposal distribution q. Under reasonable assumptions\n\nsamples {x(i), w(x(i))}N\non p(x) this approximation is possible with any precision if N is suf\ufb01ciently large [3].\nSince it is still computationally intractable to draw samples from q due to high dimensionality of x,\nPF utilizes Sequential Importance Sampling (SIS). In the classical PF approaches, samples are gen-\nerated recursively following the order of the RVs according to x(i)\nt \u223c q(xt|x1:t\u22121) for t = 1, . . . n,\nand the particles are built sequentially x(i)\nt ) for i = 1, . . . , N. The subscript t in xt\nin q(xt|x1:t\u22121) indicates from which RV the samples are generated. We use x(i)\n1:t as a shorthand nota-\ntion for (x(i)\n1:m \u223c q(x1:m). In other words, by sampling\nrecursively x(i)\nfrom proposal distribution q(xt|x1:t\u22121) of RV with index t, we obtain a sample from\nq(x1:m) at t = m. As is common in PF applications, we set q(xt|x1:t\u22121) = p(xt|x1:t\u22121), i.e., the\nproposal distribution is set to the conditional distribution of p.\nWe observe that the order of sampling follows the indexing of RVs with the index set S. However,\nthere is not natural order of RVs on GRF, and the order of RV indices in S does not have any\nparticular meaning in that this order is not related in any way to our objective function f. The\nclassical PF framework has been developed for sequential state estimation like tracking or robot\nlocalization [5], where observations arrive sequentially, and consequently, determine a natural order\nof RVs representing the states like locations. In a recent work [20], PF framework has been extended\nto work with unordered set of RVs for solving image jigsaw puzzles. Inspired by this work, we\nextend PF framework to solve U \u2212 M maximal clique problem in the weighted graph. Unlike\ntracking a moving object, in our problem, the observations are known from the beginning and are\ngiven by the af\ufb01nity matrix A.\nThe key idea of [20] is to explore different orders of the states (xi1, . . . , xin ) as opposed to utilizing\nthe \ufb01x order of the states x = (x1, . . . , xn) determined by the index of RVs as in the standard PF.\n(States are assigned values of RVs.) To achieve this the \ufb01rst step of the PF algorithm is modi\ufb01ed so\nthat the importance sampling is performed for every RV not yet represented by the current particle.\nTo formally de\ufb01ne the sampling rule, we need to explicitly represent different orders of states with\nan index selection function \u03c3 : {1, . . . , t} \u2192 {1, . . . , n} for 1 < t \u2264 n, which is one-to-one.\nIn particular, when t = n, \u03c3 is a permutation. We use the shorthand notation \u03c3(1 : t) to denote\n(\u03c3(1), \u03c3(2), . . . , \u03c3(t)) for t \u2264 n, and similarly, x\u03c3(1:t) = (x\u03c3(1), x\u03c3(2), . . . , x\u03c3(t)). Each particle\nx(i)\n\u03c3(1:t) can now have a different permutation \u03c3(i) representing the indices of RVs with assigned\nvalues. Thus, a sequence of RVs visited before time t is described by a subsequence (i1, . . . , it) of\nt different numbers in S = {1, . . . , n}.\nWe de\ufb01ne an index set of indices of graph vertices that are compatible with selected vertices in\n\u03c3(i)(1 : t) as \u03ba(\u03c3(i)(1 : t)) = S \\ ( \u03c3(i)(1 : t) \u222a mutex(\u03c3(i)(1 : t) ). Hence \u03ba(\u03c3(i)(1 : t)) contains\nindices from S that that are both not present in \u03c3(i)(1 : t) and not have mutex relation with the\nmembers of \u03c3(i)(1 : t).\nWe are now ready to formulate the proposed importance sampling. At each iteration t \u2264 n, for each\nparticle (i) and for each s \u2208 \u03ba(\u03c3(i)(1 : t \u2212 1)), we sample x(i)\n\u03c3(1:t\u22121)). The subscript s at\nthe conditional pdf p indicates that we sample values for RV with index s. We generate at least one\nsample for each s \u2208 \u03ba(\u03c3(i)(1 : t \u2212 1)). This means that the single particle x(i)\n\u03c3(1:t\u22121) is multiplied\nand extended to several follower particles x(i)\n\ns \u223c p(xs|x(i)\n\n\u03c3(1:t\u22121),s.\n\nBased on (2), it is easy to derive a formula for the proposal function:\n\np(xs|x\u03c3(1:t\u22121)) =\n\np(x\u03c3(1:t\u22121), xs)\n\np(x\u03c3(1:t\u22121))\n\n=\n\nexp f (x\u03c3(1:t\u22121),xs)\nexp f (x\u03c3(1:t\u22121))\n\n\u03b3\n\n\u03b3\n\n= exp\n\nf (x\u03c3(1:t\u22121), xs) \u2212 f (x\u03c3(1:t\u22121))\n\n\u03b3\n\n(3)\n\nWe observe that f (xs, x\u03c3(1:t\u22121)) \u2212 f (x\u03c3(1:t\u22121)) = xT\ns Ax\u03c3(1:t\u22121) is the gain in the\ntarget function f obtained after assigning the value to RV Xs. Since we are interested in making\nthis gain as large as possible, and assigning xs = 0 leads to zero gain, we focus only on assigning\nxs = 1. Consequently, the pdf in (3) can be treated as a probability mass function (pmf) over\n\ns Axs + 2xT\n\n5\n\n\fs \u2208 \u03ba(\u03c3(i)(1 : t \u2212 1)) and sampling from it becomes equivalent to sampling\n\ns(i) \u223c p(s|\u03c3(i)(1 : t \u2212 1)) = p(xs = 1|x(i)\n\n\u03c3(1:t\u22121)).\n\n(4)\n\nHence, we can interpret a particle x(i)\n\u03c3(1:t\u22121) as a sequence of indices of selected graph vertices\n\u03c3(i)(1 : t \u2212 1), since x(i)\n\u03c3(1:t\u22121) is a vector of ones assigned to RVs with indices in \u03c3(i)(1 : t \u2212 1).\nIn other words, it holds ind(x(i)\n\u03c3(1:t\u22121)) = \u03c3(i)(1 : t \u2212 1), where ind : {0, 1}n \u2192 2S is a function\nthat assigns to x a set of indices of coordinates of x that are equal to one. For example, if x =\n(0, 1, 1, 0, 0) \u2208 {0, 1}5, then ind(x) = {2, 3}, which means that graph vertices with indices 2 and 3\nare selected by x.\nIn order to construct the pmf in (4), we only need to assign the probabilities to all indices s \u2208\n\u03ba(\u03c3(i)(1 : t \u2212 1)) according to the de\ufb01nition in (3). Then s(i) is sampled from the discrete pmf\nconstructed this way. Now we are ready to summarize the proposed PF framework in Algorithm 2.\n\nAlgorithm 2 Particle Filter Algorithm for U \u2212 M Maximal Clique\n\nInput: A, U, M, K, N, \u03b3.\nInitialize: t = 1, initialize every particle (i) with \u03c3(i)\nwhile \u03ba(\u03c3(1)(1 : t \u2212 1)) \u222a . . . \u222a \u03ba(\u03c3(N )(1 : t \u2212 1)) 6= \u2205 and t \u2264 K do\n\n1 = U for i = 1, . . . , N.\n\nfor i = 1 \u2192 N do\n\nif \u03ba(\u03c3(i)(1 : t \u2212 1)) 6= \u2205 then\n\n1. Importance sampling / proposal: Sample followers x(i)\n\ns of particle (i) from\n\ns \u223c p(xs|x(i)\nx(i)\n\n\u03c3(1:t\u22121)) = exp((f (xs, x(i)\n\n\u03c3(1:t\u22121)) \u2212 f (x(i)\n\n\u03c3(1:t\u22121)))/\u03b3)\n\n\u03c3(1:t) = (x(i)\n\nand set x(i,s)\ns ) and \u03c3(i,s)(t) = s, i.e., \u03c3(i,s)(1 : t) = (\u03c3(1 : t \u2212 1), s).\n2. Importance weighting / evaluation: An individual importance weight is assigned to\neach follower particle according to\n\n\u03c3(1:t\u22121), x(i)\n\nw(x(i,s)\n\n\u03c3(1:t)) = exp(f (x(i)\n\ns , x(i)\n\n\u03c3(1:t\u22121))/\u03b3)\n\nelse\n\nwe carry over the particle: x(i,s)\n\n\u03c3(1:t) = x(i)\n\n\u03c3(1:t\u22121) and w(x(i,s)\n\n\u03c3(1:t)) = w(x(i)\n\n\u03c3(1:t\u22121)).\n\nend if\nend for\n3. Resampling: Sample with replacement N new particle \ufb01lters from {x(1,s)\naccording to weights, and assign the sampled set to {x(1)\n\n\u03c3(1:t), . . . , x(N )\n\n\u03c3(1:t)}; set t \u2190 t + 1.\n\n\u03c3(1:t), . . . , x(N,s)\n\u03c3(1:t)}\n\nend while\nOutput: {x(1)\n\n\u03c3(1:t), . . . , x(N )\n\n\u03c3(1:t)}\n\n\u03c3(1:t), where k = arg maxif (x(i)\n\nWe take the particle with maximal value of f as solution of (2), or equivalently, as solution of (1):\nx\u2217 = x(k)\n\u03c3(1:t)). As proven in [20], x\u2217 approximates maxx p(x) with\nany precision for suf\ufb01ciently large number of particles N.\n5 Object Detection with the Deformable Part Model\nIn Section 3.2, we \ufb01nd K H salient parts denoted as SP = {Qi|i = 1, . . . , K H } to represent an\nobject class, each part Qi contains K image patches, one patch from each training image. Now we\ndescribe the object model constructed from SP.\nWe train a linear SVM classi\ufb01er for each part Qi, which we denote as SV M (Qi). To train the linear\nSVM classi\ufb01er SV M (Qi), positive examples are the patches of Qi. The negatives examples are\nobtained by an iterative procedure described in [10]. The initial training set consists of randomly\nchosen background windows and objects from other classes. The resulting classi\ufb01er is used to scan\nimages and select the top false positives as hard examples. These hard examples are added to the\nnegative set and a new classi\ufb01er is learned. This procedure is repeated several times to obtain the\n\ufb01nal classi\ufb01er.\n\n6\n\n\fAs in [6], we capture the spatial distribution of salient parts in SP with a star model, where the\nlocation of each part is expressed as an offset vector with respect to the model center. The offset is\nlearned from the offsets of the patches in Qi to the centers of their training images (bounding boxes)\ncontaining them.\n\nIn order to be able to directly compare to Latent SVM [6], we use the same object detection frame-\nwork. Thus, the detection is performed in the sliding window fashion followed by non maxima\nsuppression. However, we do not use the root \ufb01lter, which is an appearance classi\ufb01er of the whole\ndetection window. Thus, our detection is purely part based.\n\n6 Experimental Evaluation\nWe validate our method on two datasets with deformable objects: ETHZ Giraffes dataset [9] and\nTUD-Pedestrians dataset [1]. For ETHZ Giraffes dataset, we follow the train/test split described in\n[18]: the \ufb01rst 43 giraffe images are positive training examples. The remaining 44 giraffe images in\nETHZ dataset are used for testing as positive images. We also select 43 images from other categories\nas negative training images. As negative test images we take all remaining images from the other\ncategories. Thus, we have the total of 86 training images, and the total of 169 test images. For\nlearning the salient parts, the giraffe bounding boxes are normalized to the area of 3000 pixels with\naspect ratio kept.\n\nFor TUD-Pedestrians dataset, we use the provided 400 images for training and 250 images for\ntesting. The background of training images is used to extract negative examples. The training\npedestrian bounding boxes are normalized to the height of 200 pixels with aspect ratio kept.\n\nFor both datasets, the size of each patch is 61 \u2217 61 pixels, number of patches per image is about\n1000. We set K H in (C4) to 6 meaning that our goal is to learn 6 salient parts for each object class.\nThe number of salient part was determined experimentally. The minimal distance \u03c4 between salient\nparts is 60 pixels for the giraffe class and 45 pixels for the pedestrian class. In Algorithm 2, the\nnormalization parameter \u03b3 is set to the median value in A times the size of expected maximal clique\ntimes 2, the number of particles is N = 500, and for each particle we sample 10 followers. In order\nto compare to [6], we used the released latent SVM code [7] on the same training and testing images\nas for our approach.\n6.1 Detection Performance\nWe plot the precision/recall (PR) curves to show the detection performance of the latent SVM\nmethod [6] and our method on both test dataset in Fig. 2. On the ETHZ giraffe class our aver-\nage precision (AP) is 0.841, it is much better than AP of the latent SVM which is 0.610. Our\nresult signi\ufb01cantly outperforms the currently best reported result in [18], which has AP of 0.787.\nOn the TUD-Pedestrian dataset, our AP of 0.862 is comparable to the latent SVM, whose AP is\n0.875. These results show that our method can learn object models that yield very good detection\nperformance. Our method is particularly suitable for learning part models of objects with large de-\nformation like giraffes. The signi\ufb01cant nonrigid deformation of giraffes leads to a large variation in\nthe position of patches representing the same object part. Since latent SVM learning is based on in-\ncremental improvement in the position of parts, it seems to be unable to deal with large variations of\npart positions. In contrast, this does not in\ufb02uence the performance of our method, since it is match-\ning based. Because the variance in the part positions in TUD-Pedestrian dataset is smaller than in\ngiraffes, the performance of both methods becomes comparable. Some of our detection results are\nshown in Fig. 3. They demonstrate that our learned part model leads to detection performance that is\nrobust to the scale changes, appearance variance, part location variance, and substantial occlusion.\n\ni\n\ni\n\nn\no\ns\nc\ne\nr\np\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n \n0\n\nTUD\u2212Pedestrians\n\n \n\nOur method, AP=0.862\nLatent SVM, AP=0.875\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\nrecall\n\nFigure 2: Precision/recall curves for Latent SVM method (red) and our method (blue) on ETHZ\nGiraffe dataset (left) and TUD-Pedestrian dataset (right).\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\nETHZ Giraffes\n\n \n\nOur method, AP=0.841\nLatent SVM, AP=0.610\n\ni\n\ni\n\nn\no\ns\nc\ne\nr\np\n\n0\n\n \n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\nrecall\n\n7\n\n\fFigure 3: Some of our detection results for giraffe class and pedestrian dataset. The detected patches\nwith the same color belong to the same salient part. The part colors are the same as in Fig. 4.\nDetected bounding boxes are shown in blue.\n\n6.2 Tree Structure of Salient Parts\nIn our framework, it is also possible to learn a tree structure of the salient parts. Given a set of\nlearned salient parts SP = {Qi|i = 1, . . . , K H } as vertices, we construct a new graph, called\nSalient Part Graph (SPG). The edge weights of SPG are given by the average distance between pairs\nof salient parts Qi and Qj given by D(i, j) for i, j = 1, . . . , K H.\n\nFigure 4: The learned salient parts and graph structures for the giraffe class and pedestrian dataset.\nThe patches that belong to the same salient part are in the same color.\n\nWe obtain a minimum spanning tree of SPG using the Kruskal\u2019s algorithm [11]. The learned trees\nfor two object classes of giraffes and pedestrians are illustrated in Fig. 4. Their connections yield\na salient part structure in accord with our intuition. We did not utilize this tree structure for object\ndetection. Instead we used the star model in our detection results in order to have a fair comparison\nto [6].\n7 Conclusions\nAn object part is de\ufb01ned as a set of image patches. Learning object parts is formulated as two\ninstances of the problem of \ufb01nding maximal cliques in weighted graphs that satisfy hard constraints,\nand solved with the proposed Particle Filter inference framework. By utilizing the spatial relation of\nthe obtained salient parts, we are also able to learn a tree structure of the deformable object model.\nThe application of the proposed inference framework is not limited to learning object part models.\nThere exist many other applications where it is important to enforce hard constraints like common\npattern discovery and solving constrained matching problems.\nAcknowledgement: The work was supported by the NSF under Grants IIS-0812118, BCS-0924164,\nOIA-1027897, by the AFOSR Grant FA9550-09-1-0207, and by the National Natural Science Foun-\ndation of China (NSFC) Grants 60903096, 61173120 and 60873127.\n\n8\n\n\fReferences\n\n[1] M. Andriluka, S. Roth, and B. Schiele. People-tracking-by-detection and people-detection-by-tracking.\n\nIEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2008.\n\n[2] Y. Asahiro, R. Hassin, and K. Iwama. Complexity of \ufb01nding dense subgraphs. Discrete Applied Mathe-\n\nmatics, 121:15 \u2013 26, 2002.\n\n[3] D. Crisan and A. Doucet. A survey of convergence results on particle \ufb01ltering methods for practitioners.\n\nIEEE Transactions on Signal Processing, 50(3):736\u2013746, 2002.\n\n[4] P. Dollar, B. Babenko, S. Belongie, P. Perona, and Z. Tu. Multiple component learning for object detec-\n\ntion. ECCV, 2008.\n\n[5] A. Eliazar and P. Ronald. Hierarchical linear/constant time slam using particle \ufb01lters for dense maps. In\n\nAdvances in Neural Information Processing Systems 18, pages 339\u2013346. 2006.\n\n[6] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively\ntrained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 32, No.\n9, 2010.\n\n[7] P. F. Felzenszwalb, R. B. Girshick, and D. McAllester. Discriminatively trained deformable part models,\n\nrelease 4. http://people.cs.uchicago.edu/ pff/latent-release4/.\n\n[8] R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsupervised scale-invariant learning.\n\nProc. of the IEEE Conf on Computer Vision and Pattern Recognition, 2003.\n\n[9] V. Ferrari, T. Tuytelaars, and L. V. Gool. Object detection by contour segment networks. ECCV, 2006.\n[10] H. Harzallah, F. Jurie, and C. Schmid. Combining ef\ufb01cient object localization and image classi\ufb01cation.\n\nIn International Conference on Computer Vision, 2009.\n\n[11] J. B. Kruskal. On the shortest spanning subtree of a graph and the traveling salesman problem.\n\nProceedings of the American Mathematical Society, 1956.\n\nIn\n\n[12] M. Leordeanu, M. Hebert, and R. Sukthankar. An integer projected \ufb01xed point method for graph matching\n\nand map inference. In Neural Info. Proc. Systems (NIPS), 2009.\n\n[13] Z. Lin, G. Hua, and L. S. Davis. Multiple instance feature for robust part-based object detection. IEEE\n\nConference on Computer Vision and Pattern Recognition, 2009.\n\n[14] H. Liu, L. J. Latecki, and S. Yan. Robust clustering as ensemble of af\ufb01nity relations. In Neural Info. Proc.\n\nSystems (NIPS), 2010.\n\n[15] N. Loeff, H. Arora, A. Sorokin, and D. Forsyth. Ef\ufb01cient unsupervised learning for localization and\ndetection in object categories. In Advances in Neural Information Processing Systems 18, pages 811\u2013818.\n2006.\n\n[16] M. Pavan and M. Pelillo. Dominant sets and pairwise clustering. IEEE Transactions on Pattern Analysis\n\nand Machine Intelligence, 29:167-172, 2007.\n\n[17] A. Quattoni, M. Collins, and T. Darrell. Conditional random \ufb01elds for object recognition. In Advances in\n\nNeural Information Processing Systems 17, pages 1097\u20131104. 2005.\n\n[18] P. Srinivasan, Q. Zhu, and J. Shi. Many-to-one contour matching for describing and discriminating object\n\nshape. IEEE Conference on Computer Vision and Pattern Recognition, 2010.\n\n[19] X. Wang, X. Bai, W. Liu, and L. J. Latecki. Feature context for image classi\ufb01cation and object detection.\n\nIEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2011.\n\n[20] X. Yang, N. Adluru, and L. J. Latecki. Particle \ufb01lter with state permutations for solving image jigsaw\n\npuzzles. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2011.\n\n9\n\n\f", "award": [], "sourceid": 572, "authors": [{"given_name": "Xinggang", "family_name": "Wang", "institution": null}, {"given_name": "Xiang", "family_name": "Bai", "institution": null}, {"given_name": "Xingwei", "family_name": "Yang", "institution": null}, {"given_name": "Wenyu", "family_name": "Liu", "institution": null}, {"given_name": "Longin", "family_name": "Latecki", "institution": null}]}