{"title": "Sidestepping Intractable Inference with Structured Ensemble Cascades", "book": "Advances in Neural Information Processing Systems", "page_first": 2415, "page_last": 2423, "abstract": "For many structured prediction problems, complex models often require adopting approximate inference techniques such as variational methods or sampling, which generally provide no satisfactory accuracy guarantees. In this work, we propose sidestepping intractable inference altogether by learning ensembles of tractable sub-models as part of a structured prediction cascade. We focus in particular on problems with high-treewidth and large state-spaces, which occur in many computer vision tasks. Unlike other variational methods, our ensembles do not enforce agreement between sub-models, but filter the space of possible outputs by simply adding and thresholding the max-marginals of each constituent model. Our framework jointly estimates parameters for all models in the ensemble for each level of the cascade by minimizing a novel, convex loss function, yet requires only a linear increase in computation over learning or inference in a single tractable sub-model. We provide a generalization bound on the filtering loss of the ensemble as a theoretical justification of our approach, and we evaluate our method on both synthetic data and the task of estimating articulated human pose from challenging videos. We find that our approach significantly outperforms loopy belief propagation on the synthetic data and a state-of-the-art model on the pose estimation/tracking problem.", "full_text": "Sidestepping Intractable Inference\nwith Structured Ensemble Cascades\n\nDavid Weiss\u2217\n\nBenjamin Sapp\u2217\n\nBen Taskar\n\nComputer and Information Science\n\nUniversity of Pennsylvania\nPhiladelphia, PA 19104, USA\n\n{djweiss,bensapp,taskar}@cis.upenn.edu\n\nAbstract\n\nFor many structured prediction problems, complex models often require adopting\napproximate inference techniques such as variational methods or sampling, which\ngenerally provide no satisfactory accuracy guarantees. In this work, we propose\nsidestepping intractable inference altogether by learning ensembles of tractable\nsub-models as part of a structured prediction cascade. We focus in particular on\nproblems with high-treewidth and large state-spaces, which occur in many com-\nputer vision tasks. Unlike other variational methods, our ensembles do not enforce\nagreement between sub-models, but \ufb01lter the space of possible outputs by simply\nadding and thresholding the max-marginals of each constituent model. Our frame-\nwork jointly estimates parameters for all models in the ensemble for each level of\nthe cascade by minimizing a novel, convex loss function, yet requires only a linear\nincrease in computation over learning or inference in a single tractable sub-model.\nWe provide a generalization bound on the \ufb01ltering loss of the ensemble as a theo-\nretical justi\ufb01cation of our approach, and we evaluate our method on both synthetic\ndata and the task of estimating articulated human pose from challenging videos.\nWe \ufb01nd that our approach signi\ufb01cantly outperforms loopy belief propagation on\nthe synthetic data and a state-of-the-art model on the pose estimation/tracking\nproblem.\n\n1\n\nIntroduction\n\nWe address the problem of prediction in graphical models that are computationally challenging be-\ncause of both high-treewidth and large state-spaces. A primary example where intractable, large\nstate-space models typically arise is in dynamic state estimation problems, including tracking ar-\nticulated objects or multiple targets [1, 2]. The complexity stems from interactions of multiple\ndegrees-of-freedom (state variables) and \ufb01ne-level resolution at which states need to be estimated.\nAnother typical example arises in pixel-labeling problems where the model topology is typically a\n2D grid and the number of classes is large [3]. In this work, we propose a novel, principled frame-\nwork called Structured Ensemble Cascades for handling state complexity while learning complex\nmodels, extending our previous work on structured cascades for low-treewidth models [4].\nThe basic idea of structured cascades is to learn a sequence of coarse-to-\ufb01ne models that are op-\ntimized to safely \ufb01lter and re\ufb01ne the structured output state space, speeding up both learning and\ninference. While we previously assumed (sparse) exact inference is possible throughout the cas-\ncade [4], in this work, we apply and extend the structured cascade framework to intractable high-\ntreewidth models. To avoid intractable inference, we decompose the desired model into an ensemble\nof tractable sub-models for each level of the cascade. For example, in the problem of tracking ar-\nticulated human pose, each sub-model includes temporal dependency for a single body joint only.\n\n\u2217These authors have contributed equally.\n\n1\n\n\f\u03b8(cid:63)(x, yj) =(cid:80)\n\nFigure 1: (a) Schematic overview of structured ensemble cascades. The m\u2019th level of the cascade takes as\ninput a sparse set of states Y m for each variable yj. The full model is decomposed into constituent sub-models\n(above, the three tree models used in the pose tracking experiment) and sparse inference is run. Next, the\nmax marginals of the sub-models are summed to produce a single max marginal for each variable assignment:\np(x, yj). Note that each level and each constituent model will have different parameters as\na result of the learning process. Finally, the state spaces are thresholded based on the max-marginal scores and\nlow-scoring states are \ufb01ltered. Each state is then re\ufb01ned according to a state hierarchy (e.g., spatial resolution,\nor semantic categories) and passed to the next level of the cascade. This process can be repeated as many times\nas desired. In (b), we illustrate two consecutive levels of the ensemble cascade on real data, showing the \ufb01ltered\nhypotheses left for a single video example.\n\np \u03b8(cid:63)\n\nTo maintain ef\ufb01ciency, inference in the sub-models of the ensemble is uncoupled (unlike in dual\ndecomposition [5]), but the decision to \ufb01lter states depends on the sum of the max-marginals of\nthe constituent models (see Figure 1). We derive a convex loss function for joint estimation of sub-\nmodels in each ensemble, which provably balances accuracy and ef\ufb01ciency, and we propose a simple\nstochastic subgradient algorithm for training.\nThe novel contributions of this work are as follows. First, we provide a principled and practical gen-\neralization of structured cascades to intractable models. Second, we present generalization bounds\non the performance of the ensemble. Third, we introduce a challenging VideoPose dataset, culled\nfrom TV videos, for evaluating pose estimation and tracking.\nFinally, we present an evaluation\nof our approach on synthetic data and the VideoPose dataset. We \ufb01nd that our joint training of an\nensemble method outperforms several competing baselines on this dif\ufb01cult tracking problem.\n\n2 Structured Cascades\n\nGiven an input space X , output space Y, and a training set {(cid:10)x1, y1(cid:11) , . . . ,(cid:104)xn, yn(cid:105)} of n samples\nfrom a joint distribution D(X, Y ), the standard supervised learning task is to learn a hypothesis\nh : X (cid:55)\u2192 Y that minimizes the expected loss ED [L (h(x), y)] for some non-negative loss function\nL : Y\u00d7Y \u2192 R+. In structured prediction problems, Y is a (cid:96)-vector of variables and Y = Y1\u00d7\u00b7\u00b7\u00b7\u00d7\nY(cid:96), and Yi = {1, . . . , K}. In many settings, the number of random variables, (cid:96), differs depending\non input X, but for simplicity of notation, we assume a \ufb01xed (cid:96) here. The linear hypothesis class we\nconsider is of the form h(x) = argmaxy\u2208Y \u03b8(x, y), where the scoring function \u03b8(x, y) (cid:44) \u03b8(cid:62)f (x, y)\nis the inner product of a vector of parameters \u03b8 and a feature function f : X \u00d7 Y (cid:55)\u2192 Rd mapping\n(x, y) pairs to a set of d real-valued features. We further assume that f decomposes over a set of\ncliques C over inputs and outputs, so that \u03b8(x, y) =(cid:80)c\u2208C \u03b8(cid:62)fc(x, yc). Above, yc is an assignment\n\n2\n\nInferenceSum+Level m\uffffThresholdingRe\ufb01nementFullModelSub-modelsInferenceSum+Level m+1\uffffThresholdingRe\ufb01nementFullModelSub-modelsYm+2Ym+1Ym\u03b8\uffffm(x,yj)\u2264tm(x,\u03b1)Ym+1\u03b8\uffffm+1(x,yj)\u2264tm+1(x,\u03b1)(a)(b)!\"#$%&'&&!\"#$%&(&&!\"#$%&)&&*%+%,&!\"!\"#$%&'&&!\"#$%&(&&!\"#$%&)&&*%+%,&!#$&!\"#$%\"&'('\")$*&+,%\"&'-*.+\")#/'0'1*2',)34''0')$56+',)34''0'+\")#\"'\fto the subset of Y variables in the clique c and we will use Yc to refer to the set of all assignments\nto the clique. By considering different cliques over X and Y , f can represent arbitrary interactions\nbetween the components of x and y. Evaluating h(x) is tractable for low-treewidth (hyper)graphs\nbut is NP-hard in general, and typically, approximate inference is used when features are not low-\ntreewidth.\nIn our prior work [4], we introduced the framework of Structured Prediction Cascades (SPC) to\nhandle problems with low-treewidth T but large node state-space K, which makes complexity of\nO(K T ) prohibitive. For example, for a 5-th order linear chain model for handwriting recognition or\npart-of-speech tagging, K is about 50, and exact inference is on the order 506 \u2248 15 billion times\nthe length the sequence. In tree-structured models we have used for for human pose estimation [6],\ntypical K for each part includes image location and orientation and is on the order of 250, 000, so\neven K 2 in pairwise potentials is prohibitive. Rather than learning a single monolithic model, a\nstructured cascade is a coarse-to-\ufb01ne sequence of increasingly complex models, where model com-\nplexity scales with Markov order in sequence models or spatial/angular resolution in pose models,\nfor example. The goal of each model is to \ufb01lter out a large subset of assignments without eliminating\nthe correct one, so that the next level only has to consider a much reduced state-space. The \ufb01ltering\nprocess is feed-forward, and each stage uses inference to compute max-marginals which are used\nto eliminate low-scoring node or clique assignments. The parameters of each model in the cascade\nare learned using a loss function which balances accuracy (not eliminating correct assignment) and\nef\ufb01ciency (eliminating as many other assignments as possible).\nMore precisely, for each clique assignment yc, there is a max marginal \u03b8(cid:63)(x, yc), de\ufb01ned as the\nmaximum score of any output y that contains the clique assignment yc:\n(cid:48)\nc = yc}.\n\n\u03b8(cid:63)(x, yc) (cid:44) max\n\ny(cid:48)\u2208Y {\u03b8(x, y\n\n) : y\n\n(1)\n\n(cid:48)\n\nFor simplicitly, we will examine the case where the cliques that we \ufb01lter are de\ufb01ned only over single\nvariables: yc = yj (although the model may also contain larger cliques). Clique assignments are\n\ufb01ltered by discarding any yj for which \u03b8(cid:63)(x, yj) \u2264 t(x) for a threshold t(x). We de\ufb01ne Yj to be\nthe set of possible states for the j\u2019th variable. The threshold proposed in [4] is a \u201cmax mean-max\u201d\nfunction,\n\nt(x, \u03b1) = \u03b1\u03b8(cid:63)(x) + (1 \u2212 \u03b1)\n\n\u03b8(cid:63)(x, yj).\n\n(2)\n\n1\nj=1 |Yj|\n\n(cid:96)(cid:88)j=1 (cid:88)yj\u2208Yj\n\n(cid:80)(cid:96)\n\nFiltering max marginals in this fashion can be learned because of the \u201csafe \ufb01ltering\u201d property: en-\nsuring that \u03b8(xi, yi) > t(xi, \u03b1) is suf\ufb01cient (although not necessary) to guarantee that no marginal\nconsistent with the true answer yi will be \ufb01ltered. Thus, for \ufb01xed \u03b1, [4] proposed learning parame-\nters \u03b8 to maximize the margin \u03b8(xi, yi) \u2212 t(xi, \u03b1) and therefore minimize \ufb01ltering errors:\n\u2200i = 1, . . . , n\n\n\u03b8(xi, yi) \u2265 t(xi, \u03b1) + (cid:96)i \u2212 \u03bei,\n\n\u03bb\n2||\u03b8||2 +\n\ninf\n\u03b8,\u03be\u22650\n\ns.t.\n\n(3)\n\n\u03bei\n\nAbove, \u03bei are slack variables for the margin constraints, and (cid:96)i is the size of the i\u2019th example.\n\n1\n\nn(cid:88)i\n\n3 Structured Ensemble Cascades\n\nIn this work, we tackle the problem of learning a structured cascade for problems in which inference\nis intractable, but in which the large node state-space has a natural hierarchy that can be exploited.\nFor example, such hierarchies arise in pose estimation by discretizing the articulation of joints at\nmultiple resolutions, or in image segmentation due to the semantic relationship between class labels\n(e.g., \u201cgrass\u201d and \u201ctree\u201d can be grouped as \u201cplants,\u201d \u201chorse\u201d and \u201ccow\u201d can be grouped as \u201canimal.\u201d)\nAlthough the methods discussed in this section can be applied to more general intractable settings,\nand our prior work considered more general cascades that operate on graph cliques, we will assume\nfor simplicitly that the structured cascades operate in a \u201cnode-centric\u201d coarse-to-\ufb01ne manner as fol-\nlows. For each variable yj in the model, each level of the cascade \ufb01lters a current set of possible\nstates Yj, and any surviving states are passed forward to the next level of the cascade by substituting\neach state with its set of descendents in the hierarchy. Thus, in the pose estimation problem, surviv-\ning states are subdivided into multiple \ufb01ner-resolution states; in the image segmentation problem,\nbroader object classes are split into their constituent classes for the next level.\n\n3\n\n\fWe propose a novel method for learning structured cascades when inference is intractable due to\nloops in the graphical structure. The key idea of our approach is to decompose the loopy model into\na collection of equivalent tractable sub-models for which inference is tractable. What distinguishes\nour approach from other decomposition based methods (e.g., [5, 7]) is that, because the cascade\u2019s\nobjective is \ufb01ltering and not decoding, our approach does not require enforcing the constraint that the\nsub-models agree on which output has maximum score. We call our approach structured ensemble\ncascades.\n\n3.1 Decomposition without agreement constraints\n\nGiven a loopy (intractable) graphical model, it is always possible to express the score of a given\noutput \u03b8(x, y) as the sum of P scores \u03b8p(x, y) under sub-models that collectively cover every edge\n\nin the loopy model: \u03b8(x, y) = (cid:80)p \u03b8p(x, y). (See Figures 2 & 3 for illustrations speci\ufb01c to the\n\nexperiments presented in this paper.) For example, in the method of dual decomposition [5], it is\npossible to solve a relaxed MAP problem in the (intractable) full model by running inference in\nthe (tractable) sub-models under the constraint that all sub-models agree on the argmax solution.\nEnforcing this constraint requires iteratively re-weighting unary potentials of the sub-models and\nrepeatedly re-running inference until each sub-model convergences to the same argmax solution.\nHowever, for the purposes of a structured cascade, we are only interested in computing the max\nmarginals \u03b8(cid:63)(x, yj). In other words, we are only interested in knowing whether or not a con\ufb01gu-\nration y consistent with yj that scores highly in each sub-model \u03b8p(x, y) exists. We show in the\nremainder of this section that the requirement that a single y consistent with yj optimizes the score\nof each submodel (i.e, that all sub-models agree) is not necessary for the purposes of \ufb01ltering. Thus,\nbecause we do not have to enforce agreement between sub-models, we can learn a structured cas-\ncade for intractable models, but pay only a linear (factor of P ) increase in inference time over the\ntractable sub-models.\nFormally, we de\ufb01ne a single level of the ensemble cascade as a set of P models such that \u03b8(x, y) =\np(x) and tp(x, \u03b1) be the score, max marginal, max score,\np(x, yj)\np(x, yj). Then, if\n\nand threshold of the p\u2019th model, respectively. We de\ufb01ne the argmax marginal or witness y(cid:63)\nto be the maximizing complete assignment of the corresponding max marginal \u03b8(cid:63)\ny = y(cid:63)\n\n(cid:80)p \u03b8p(x, y). We let \u03b8p(x,\u00b7), \u03b8(cid:63)\n\np(x, yj) is the same for each of the p\u2019th submodels, we have that\n\np(x,\u00b7), \u03b8(cid:63)\n\n\u03b8(cid:63)(x, yj) =(cid:88)p\n\n\u03b8(cid:63)\np(x, yj)\n\n(4)\n\nNote that if we do not require the sub-models to agree,\n\np(x, yj). Nonetheless, as we show next, the approximation \u03b8(cid:63)(x, yj) \u2248 (cid:80)p \u03b8(cid:63)\n\nuseful and suf\ufb01cient for \ufb01ltering in a structured cascade.\n\nthen \u03b8(cid:63)(x, yj) is stricly less than\np(x, yj) is still\n\n(cid:80)p \u03b8(cid:63)\n\n3.2 Safe \ufb01ltering and generalization error\n\nWe \ufb01rst show that if a given label y has a high score in the full model, it must also have a large\nensemble max marginal score, even if the sub-models do not agree on the argmax. This results in a\n\u201csafe \ufb01ltering\u201d lemma similar to that given in [4], as follows:\n\nLemma 1 (Joint Safe Filtering). If(cid:80)p \u03b8p(x, y) > t, then(cid:80)p \u03b8(cid:63)\n\np(x, yj) > t for all yj \u2286 y.\n\np(x, yj) \u2265(cid:80)p \u03b8p(x, y) > t.\n\nProof. In English, this lemma states that if the global score is above a given threshold, then the\nsum of sub-model max-marginals is also above threshold (with no agreement constraint). The\np(x, yj) \u2265 \u03b8p(x, y). Therefore\nproof is straightforward. For any yj consistent with y, we have \u03b8(cid:63)\n\n(cid:80)p \u03b8(cid:63)\nthat the combined score(cid:80)p \u03b8p(x, y) of the true label y is above threshold, then we can \ufb01lter without\n\nmaking a mistake if we compute max marginals by running inference separately for each sub-model.\nHowever, there is still potentially a price to pay for disagreement. If the sub-models do not agree,\nand the truth is not above threshold, then the threshold may \ufb01lter all of the states for a given variable\n\nTherefore, we see that an agreement constraint is not necessary in order to \ufb01lter safely: if we ensure\n\n4\n\n\fyj and therefore \u201cbreak\u201d the cascade. This results from the fact that without agreement, there is\nno single argmax output y(cid:63) that is always above threshold for any \u03b1; therefore, it is not guaranteed\nthat there exists an output y to satisfy the Joint Safe Filtering Lemma. However, we note that in\nour experiments, we never experienced such breakdown of the cascades due to overly aggressive\n\ufb01ltering.\nIn order to learn parameters that are useful for \ufb01ltering, Lemma 1 suggests a natural ensemble\n\ufb01ltering loss, which we de\ufb01ne for any \ufb01xed \u03b1 as follows,\n\nLjoint(\u03b8,(cid:104)x, y(cid:105)) = 1(cid:34)(cid:88)p\n\n\u03b8p(x, y) \u2264(cid:88)p\n\ntp(x, \u03b1)(cid:35) ,\n\n(5)\n\nwhere \u03b8 = {\u03b81, . . . , \u03b8P} is the set of all parameters of the ensemble. (Note that this loss function is\nsomewhat conservative because it measures whether or not a suf\ufb01cient but not necessary condition\nfor a \ufb01ltering error has occured.)\nTo conclude this section, we provide a generalization bound on the ensemble \ufb01ltering loss, equiva-\nlent to the bounds in [4] for the single-model cascades. To do so, we \ufb01rst eliminate the dependence\non x and \u03b8 by rewriting Ljoint in terms of the scores of every possible state assignment, \u03b8 \u00b7 f (x, yj),\naccording to each sub-model. Let the vector \u03b8x \u2208 RmP denote these scores, where m is the number\nof possible state assignments in the sub-models.\nTheorem 1. For any \ufb01xed \u03b1 \u2208 [0, 1), de\ufb01ne the dominating cost\nfunction \u03c6(y, \u03b8x) =\nr\u03b3(1/P(cid:80)p \u03b8p(x, y) \u2212 tp(x, \u03b1)), where r\u03b3(\u00b7) is the ramp function with slope \u03b3. Let ||\u03b8p||2 \u2264 F for\nall p, and ||f (x, yj)||2 \u2264 1 for all x and yj. Then there exists a constant C such that for any integer\nn and any 0 < \u03b4 < 1 with probability 1\u2212 \u03b4 over samples of size n, every \u03b8 = {\u03b81, . . . , \u03b8P} satis\ufb01es:\n(6)\n\nCm\u221a(cid:96)F P\n\nE [Ljoint(Y, \u03b8x)] \u2264 \u02c6E [\u03c6(Y, \u03b8x)] +\n\n+(cid:114) 8 ln(2/\u03b4)\n\n\u03b3\u221an\n\nn\n\n,\n\nwhere \u02c6E is the empirical expectation with respect to training data.\nThe proof is given in the supplemental materials.\n\n3.3 Parameter estimation with gradient descent\n\nIn this section we now discuss how to minimize the loss (5) given a dataset. We rephrase the SC\noptimization problem (3) using the ensemble max-marginals to form the ensemble cascade learning\nproblem,\n\ntp(xi, \u03b1) + (cid:96)i \u2212 \u03bei,\n\n(7)\n\ninf\n\n\u03b81,...,\u03b8P ,\u03be\u22650\n\n||\u03b8p||2 +\n\n\u03bb\n\n2(cid:88)p\n\n\u03bei\n\n1\n\nn(cid:88)i\n\ns.t. (cid:88)p\n\n\u03b8p(xi, yi) \u2265(cid:88)p\n\ncan form an equivalent unconstrained minimization problem and take the subgradient of (7) with\nrespect to each parameter \u03b8p. This yields the following update rule for the p\u2019th model:\n\nSeeing that the constraints can be ordered to show \u03bei \u2264 (cid:80)p tp(xi, \u03b1) \u2212(cid:80)p \u03b8p(xi, yi) + (cid:96)i, we\n\u03b8p \u2190 (1 \u2212 \u03bb)\u03b8p +(cid:40)0\n\nif (cid:80)p \u03b8p(xi, yi) \u2265(cid:80)p tp(xi, \u03b1) + (cid:96)i,\n\n\u2207\u03b8p(xi, yi) \u2212 \u2207tp(xi, \u03b1)\n\nThis update is identical to the original SC update with the exception that we update each model\nindividually only when the ensemble has made a mistake jointly. Thus, learning to \ufb01lter with the\nensemble requires only P times as many resources as learning to \ufb01lter with any of the models\nindividually.\n\notherwise.\n\n(8)\n\n4 Experiments\n\nWe evaluated structured ensemble cascades in two experiments. First, we analyzed the \u201cbest-case\u201d\n\ufb01ltering performance of the summed max-marginal approximation to the true marginals on a syn-\nthetic image segmentation task, assuming the true scoring function \u03b8(x, y) is available for inference.\nSecond, we evaluated the real-world accuracy of our approach on a dif\ufb01cult, real-world human pose\ndataset (VideoPose). In both experiments, the max-marginal ensemble outperforms state-of-the-art\nbaselines.\n\n5\n\n\f\u03b8(x, y)\n\n=\n\n\u03b81(x, y)\n\n+\n\n\u03b82(x, y)\n\n+\n\n\u03b83(x, y)\n\n+\n\n\u03b84(x, y)\n\n+\n\n\u03b85(x, y)\n\n+\n\n\u03b86(x, y)\n\n(a)\n\n(b)\n\nFigure 2: (a) Example decomposition of a 3 \u00d7 3 fully connected grid into all six constituent \u201ccomb\u201d trees. In\ngeneral, a n \u00d7 n grid yields 2n such trees. (b) Improvement over Loopy BP and constituent tree-models on the\nsynthetic segmentation task. Error bars show standard error.\n\n4.1 Asymptotic Filtering Accuracy\n\nWe \ufb01rst evaluated the \ufb01ltering accuracy of the max-marginal ensemble on a synthetic 8-class seg-\nmentation task. For this experiment, we removed variability due to parameter estimation and focused\nour analysis on accuracy of inference. We compared our approach to Loopy Belief Propagation\n(Loopy BP) [8], a state-of-the-art method for approximate inference, on a 11 \u00d7 11 two-dimensional\ngrid MRF.\u2217 For the ensemble, we used 22 unique \u201ccomb\u201d tree structures to approximate the full grid\nmodel (i.e. Figure 2(a)). To generate a synthetic instance, we generated unary potentials \u03c9i(k) uni-\nformly on [0, 1] and pairwise potentials log-uniformly: \u03c9ij(k, k(cid:48)) = exp\u2212v, where v \u223c U[\u221225, 25]\nwas sampled independently for every edge and every pair of classes. (Note that for the ensemble,\nwe normalized unary and edge potentials by dividing by the number of times that each potential was\nincluded in any model.) It is well known that inference for such grid MRFs is extremely dif\ufb01cult\n[8], and we observed that Loopy BP failed to converge for at least a few variables on most examples\nwe generated.\n\nEnsemble outperforms Loopy BP. We evaluted our approach on 100 synthetic grid MRF in-\nstances. For each instance, we computed the accuracy of \ufb01ltering using marginals from Loopy BP,\nthe ensemble, and each individual sub-model. We determined error rates by counting the number of\ntimes \u201cground truth\u201d was incorrectly \ufb01ltered if the top K states were kept for each variable, where\nwe sampled 1000 \u201cground truth\u201d examples from the true joint distribution using Gibbs sampling.\nTo obtain a good estimate of the true marginals, we restarted the chain for each sample and allowed\n1000 iterations of mixing time. The result is presented in Figure 2(b) for all possible values of\nK (\ufb01lter aggressiveness.) We found that the ensemble outperformed Loopy BP and the individual\nsub-models by a signi\ufb01cant margin for all K.\n\nEffect of sub-model agreement. We next investigated the question of whether or not the ensem-\nbles were most accurate on variables for which the sub-models tended to agree. For each variable\nyij in each instance, we computed the mean pairwise Spearman correlation between the ranking of\nthe 8 classes induced by the max marginals of each of the 22 sub-models. We found that complete\nagreement between all sub-models never occured (the median correlation was 0.38). We found that\nsub-model agreement was signi\ufb01cantly correlated (p < 10\u221215) with the error of the ensemble for all\nvalues of K, peaking at \u03c1 = \u22120.143 at K = 5. Thus, increased agreement predicted a decrease in\nerror of the ensemble. We then asked the question: Does the effect of model agreement explain the\nimprovement of the ensemble over Loopy BP? In fact, the improvement in error compared to Loopy\nBP was not correlated with sub-model agreement for any K (maximum \u03c1 = 0.0185, p < 0.05).\nThus, sub-model agreement does not explain the improvement over Loopy BP, indicating that sub-\nmodel disagreement is not related to the dif\ufb01culty in inference problems that causes Loopy BP to\nunderperform relative to the ensembles (e.g., due to convergence failure.)\n\n\u2217\n\nWe used the UGM Matlab Toolbox by Mark Schmidt for the Loopy BP and Gibbs MCMC sections of this experiment.\n\nPublicly available at:\n\nhttp://people.cs.ubc.ca/ schmidtm/Software/UGM.html\n\n6\n\n\fState\n\nLevel Dimensions\n\nP CP0.25\nin top K=4\n\n0\n2\n4\n6\n\n10 \u00d7 10 \u00d7 24\n20 \u00d7 20 \u00d7 24\n40 \u00d7 40 \u00d7 24\n80 \u00d7 80 \u00d7 24\n\n\u2013\n\n98.8\n93.8\n84.6\n\nEf\ufb01ciency\n\n(%)\n\n\u2013\n\n87.5\n96.9\n99.2\n\n(a) Decoding Error.\n\n(b) Top K = 4 Error.\n\n(c) Ensemble ef\ufb01ciency.\n\nFigure 3: (a),(b): Prediction error for VideoPose dataset. Reported errors are the average distance from a\npredicted joint location to the true joint for frames that lie in the [25,75] inter-quartile range (IQR) of errors.\nError bars show standard errors computed with respect to clips. All SC models outperform [9]; the \u201ctorso\nonly\u201d persistence cascade introduces additional error compared to a single-frame cascade, but adding arm\ndependencies in the ensemble yields the best performance. (c): Summary of test set \ufb01ltering ef\ufb01ciency and\naccuracy for the ensemble cascade. P CP0.25 measures Oracle % of correctly matched limb locations given\nun\ufb01ltered states; see [6] for more details.\n\n4.2 The VideoPose Dataset\n\nOur dataset consists of 34 video clips of approximately 50 frames each. The clips were harvested\nfrom three popular TV shows: 3 from Buffy the Vampire Slayer, 27 from Friends, and 4 from\nLOST. Clips were chosen to highlight a variety of situations and and movements when the camera is\nlargely focused on a single actor. In our experiments, we use the Buffy and half of the Friends clips\nas training (17 clips), and the remaining Friends and LOST clips for testing. In total we test on 901\nindividual frames. The Friends are split so no clips from the same episode are used for both training\nand testing. We further set aside 4 of the Friends test clips to use as a development set. Each frame\nof each clip is hand-annotated with locations of joints of a full pose model: torso, upper/lower arms\nfor both right and left, and top and bottom of head. For each joint, a binary tag indicating whether\nor not the joint is occluded is also included, to be used in future research.\u2020 For simplicity, we use\nonly the torso and upper arm annotations in this work, as these have the strongest continuity across\nframes and strong geometric relationships.\n\nArticulated pose model. All of the models we evaluated on this dataset share the same basic\nstructure: a variable for each limb\u2019s (x, y) location and angle rotation (torso, left arm, and right arm)\nwith edges between torso and arms to model pose geometry. We refer to this basic model, evaluated\nindependently on each frame, as the \u201cSingle Frame\u201d approach. For the VideoPose dataset, we aug-\nmented this model by adding edges between limb states in adjacent frames (Figure 1), forming an\nintractable, loopy model. Features: Our features in a single frame are the same as in the beginning\nlevels of the pictorial structure cascade from [6]: unary features are discretized Histogram of Gra-\ndient part detectors scores, and pairwise terms measure relative displacement in location and angle\nbetween neighboring parts. Pairwise features connecting limbs across time also express geometric\ndisplacement, allowing our model to capture the fact that human limbs move smoothly over time.\n\nCoarse-to-Fine Ensemble Cascade. We learned a coarse-to-\ufb01ne structured cascade with six lev-\nels for tracking as follows. The six levels use increasingly \ufb01ner state spaces for joint locations,\ndiscretized into bins of resolution 10 \u00d7 10 up to 80 \u00d7 80, with each stage doubling one of the state\nspace dimensions in the re\ufb01nement step. All levels use an angular discretization of 24 bins. For\nthe ensemble cascade, we learned three sub-models simultaneously (Figure 1), with each sub-model\naccounting for temporal consistency for a different limb by adding edges connecting the same limb\nin consecutive frames.\n\nExperimental Comparison. A summary of results are presented in Figure 3. We compared the\nsingle-frame cascade and the ensemble cascade to a state-of-the-art single-frame pose detector (Fer-\nrari et al. [9]) and to one of the individual sub-models, modeling torso consistency only (\u201cTorso\n\n\u2020The VideoPose dataset is available online at http://vision.grasp.upenn.edu/video/.\n\n7\n\n\fFigure 4: Qualitative test results. Points shown are the position of left/right shoulders and torsos at the last\nlevel of the ensemble SC (blue square, green dot, white circle resp.). Also shown (green line segments) are the\nbest-\ufb01tting hypotheses to groundtruth joints, selected from within the top 4 max-marginal values. Shown as\ndotted gray lines is the best guess pose returned by the [9].\n\nOnly\u201d). We evaluated the method from [9] on only the \ufb01rst half of the test data due to computation\ntime (taking approximately 7 minutes/frame). We found that the ensemble cascade was the most\naccurate for every joint in the model, that all cascades outperformed the state-of-the-art baseline,\nand, interestingly, that the single-frame cascade outperformed the torso-only cascade. We suspect\nthat the poor performance of the torso-only model may arise because propagating only torso states\nthrough time leads to an over-reliance on the relatively weak torso signal to determine the location\nof all the limbs. Sample qualitative output from the ensemble is presented in Figure 4.\n\n5 Discussion\n\nRelated Work. Tracking with articulated body parts is challenging for two main reasons. First, body\nparts are hard to detect in unconstrained environments due to the enormous variability in appearance\n(from lighting, clothing and articulation) and occlusion. Second, the huge number of degrees of\nfreedom makes exact modeling of the problem computationally prohibitive. In light of these two\nissues, many works focus on \ufb01xed-camera environments (e.g., [10, 11, 12]), some even assuming\nsillhouettes can be obtained (e.g., [2]), or 3d information from multiple sensors ([13]). In choices of\nmodeling, past works reduce the large state space degrees of freedom by only modeling location and\nscale, or resorting to sampling methods ([1, 14], or embedding into low-dimensional latent spaces\n[10]. In contrast, in this work we learn to ef\ufb01ciently navigate an unconstrained state space in the\nchallenging setting of a single, non-\ufb01xed camera.\nWe adopt the same basic modeling structure as [15, 9, 16] in our work, but also model dependencies\nthrough time. We also take a discriminative approach to training rather than generative. Ferrari et\nal. [9] use loopy belief propagation to incorporate temporal consistency of parts, but to our knowl-\nedge we are the \ufb01rst to quantitatively evaluate on movie/TV show sequences.\nIn the method of dual decomposition [5], ef\ufb01cient optimization of a LP relaxation of MAP inference\nin an intractable model is achieved by coupling the inference of a collection of tractable sub-models.\nThis coupling is achieved by repeatedly performing inference and updating a set of dual parameters\nuntil convegence. In contrast, we perform inference independently in each sub-model only once,\nand reason about individual variables using the sums of max-marginals.\n\nFuture Research. Several key questions remain as future directions of research. Although we\npresented generalization bounds for the error of the cascade, such bounds are purely \u201cpost-hoc.\u201d\nWe are currently investigating a priori properties of or assumptions about the data and cascade that\nwill provably lead to ef\ufb01cient cascaded learning and inference. In the future, our approach on the\nVideoPose dataset could be easily extended to model more limbs, additionally complex features in\ntime and geometry (e.g. [6]), and additional states such as occlusions. Successfully solving this\nproblem is necessary in order to understand the context and consequences of interactions between\nactors in video; e.g., to be able to follow a pointing arm or to observe the transfer of an important\nobject from one person to another.\n\nAcknowledgements\n\nThe authors were partially supported by NSF Grant 0803256 and ARL Cooperative Agreement W911NF-10-\n2-0016. David Weiss was also supported by a NSF Graduate Research Fellowship.\n\n8\n\n\fReferences\n[1] L. Sigal, S. Bhatia, S. Roth, M.J. Black, and M. Isard. Tracking loose-limbed people. In Proc. CVPR,\n\n2004.\n\n[2] B. Wu and R. Nevatia. Detection and tracking of multiple, partially occluded humans by bayesian com-\n\nbination of edgelet based part detectors. IJCV, 75(2):247\u2013266, 2007.\n\n[3] J.D.J. Shotton, J. Winn, C. Rother, and A. Criminisi. Textonboost for image understanding: Multi-class\nobject recognition and segmentation by jointly modeling texture, layout, and context. IJCV, 81(1), January\n2009.\n\n[4] D. Weiss and B. Taskar. Structured prediction cascades. In Proc. AISTATS, 2010.\n[5] N. Komodakis, N. Paragios, and G. Tziritas. MRF optimization via dual decomposition: Message-passing\n\nrevisited. In Proc. ICCV, 2007.\n\n[6] B. Sapp, A. Toshev, and B. Taskar. Cascaded models for articulated pose estimation. In Proc. ECCV,\n\n2010.\n\n[7] D. P. Bertsekas. Nonlinear Programming. Athena Scienti\ufb01c, second edition, 1999.\n[8] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. The MIT Press,\n\n2009.\n\n[9] V. Ferrari, M. Marin-Jimenez, and A. Zisserman. Progressive search space reduction for human pose\n\nestimation. In Proc. CVPR, 2008.\n\n[10] M. Andriluka, S. Roth, and B. Schiele. People-tracking-by-detection and people-detection-by-tracking.\n\nIn Proc. CVPR, 2008.\n\n[11] S. Pellegrini, A. Ess, K. Schindler, and L. Van Gool. Youll Never Walk Alone: Modeling Social Behavior\n\nfor Multi-target Tracking. In Proc. ICCV, 2009.\n\n[12] L. Kratz and K. Nishino. Tracking with Local Spatio-Temporal Motion Patterns in Extremely Crowded\n\nScenes. In Proc. CVPR, 2010.\n\n[13] R. Mu\u02dcnoz-Salinas, E. Aguirre, and M. Garc\u00b4\u0131a-Silvente. People detection and tracking using stereo vision\n\nand color. Image and Vision Computing, 25(6):995\u20131007, 2007.\n\n[14] J. S. Kwon and K. M. Lee. Tracking of a non-rigid object via patch-based dynamic appearance modeling\n\nand adaptive basin hopping monte carlo sampling. In Proc. CVPR, 2009.\n\n[15] B. Sapp, C. Jordan, and B. Taskar. Adaptive pose priors for pictorial structures. In Proc. CVPR, 2010.\n[16] M. Andriluka, S. Roth, and B. Schiele. Pictorial structures revisited: People detection and articulated\n\npose estimation. In Proc. CVPR, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1317, "authors": [{"given_name": "David", "family_name": "Weiss", "institution": null}, {"given_name": "Benjamin", "family_name": "Sapp", "institution": null}, {"given_name": "Ben", "family_name": "Taskar", "institution": null}]}