{"title": "Learning a discriminative hidden part model for human action recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 1721, "page_last": 1728, "abstract": "We present a discriminative part-based approach for human action recognition from video sequences using motion features. Our model is based on the recently proposed hidden conditional random field~(hCRF) for object recognition. Similar to hCRF for object recognition, we model a human action by a flexible constellation of parts conditioned on image observations. Different from object recognition, our model combines both large-scale global features and local patch features to distinguish various actions. Our experimental results show that our model is comparable to other state-of-the-art approaches in action recognition. In particular, our experimental results demonstrate that combining large-scale global features and local patch features performs significantly better than directly applying hCRF on local patches alone.", "full_text": "Learning a Discriminative Hidden Part Model for\n\nHuman Action Recognition\n\nYang Wang\n\nSchool of Computing Science\n\nSimon Fraser University\n\nBurnaby, BC, Canada, V5A 1S6\n\nGreg Mori\n\nSchool of Computing Science\n\nSimon Fraser University\n\nBurnaby, BC, Canada, V5A 1S6\n\nywang12@cs.sfu.ca\n\nmori@cs.sfu.ca\n\nAbstract\n\nWe present a discriminative part-based approach for human action recognition\nfrom video sequences using motion features. Our model is based on the recently\nproposed hidden conditional random \ufb01eld (hCRF) for object recognition. Similar\nto hCRF for object recognition, we model a human action by a \ufb02exible constel-\nlation of parts conditioned on image observations. Different from object recogni-\ntion, our model combines both large-scale global features and local patch features\nto distinguish various actions. Our experimental results show that our model is\ncomparable to other state-of-the-art approaches in action recognition. In partic-\nular, our experimental results demonstrate that combining large-scale global fea-\ntures and local patch features performs signi\ufb01cantly better than directly applying\nhCRF on local patches alone.\n\n1 Introduction\n\nRecognizing human actions from videos is a task of obvious scienti\ufb01c and practical importance.\nIn this paper, we consider the problem of recognizing human actions from video sequences on a\nframe-by-frame basis. We develop a discriminatively trained hidden part model to represent human\nactions. Our model is inspired by the hidden conditional random \ufb01eld (hCRF) model [16] in object\nrecognition.\n\nIn object recognition, there are three major representations: global template (rigid, e.g. [3], or de-\nformable, e.g. [1]), bag-of-words [18], and part-based [7, 6]. All three representations have been\nshown to be effective on certain object recognition tasks. In particular, recent work [6] has shown\nthat part-based models outperform global templates and bag-of-words on challenging object recog-\nnition tasks.\n\nA lot of the ideas used in object recognition can also be found in action recognition. For example,\nthere is work [2] that treats actions as space-time shapes and reduces the problem of action recog-\nnition to 3D object recognition. In action recognition, both global template [5] and bag-of-words\nmodels [14, 4, 15] have been shown to be effective on certain tasks. Although conceptually ap-\npealing and promising, the merit of part-based models has not yet been widely recognized in action\nrecognition. The goal of this work is to address this gap.\n\nOur work is partly inspired by a recent work in part-based event detection [10].\nIn that work,\ntemplate matching is combined with a pictorial structure model to detect and localize actions in\ncrowded videos. One limitation of that work is that one has to manually specify the parts. Unlike\nKe et al. [10], the parts in our model are initialized automatically.\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\nFigure 1: Construction of the motion descriptor. (a) original image; (b) optical \ufb02ow; (c) x and y com-\nponents of optical \ufb02ow vectors Fx, Fy; (d) half-wave recti\ufb01cation of x and y components to obtain\n4 separate channels F +\n\ny ; (e) \ufb01nal blurry motion descriptors F b+\n\ny .\ny , F b\u2212\n\nx , F b+\n\nx , F b\u2212\n\nx , F +\n\ny , F \u2212\n\nx , F \u2212\n\nThe major contribution of this work is that we combine the \ufb02exibility of part-based approaches with\nthe global perspectives of large-scale template features in a discriminative model. We show that the\ncombination of part-based and large-scale template features improves the \ufb01nal results.\n\n2 Our Model\n\nThe hidden conditional random \ufb01eld model [16] was originally proposed for object recognition\nand has also been applied in sequence labeling [19]. Objects are modeled as \ufb02exible constella-\ntions of parts conditioned on the appearances of local patches found by interest point operators.\nThe probability of the assignment of parts to local features is modeled by a conditional random\n\ufb01eld (CRF) [11]. The advantage of the hCRF is that it relaxes the conditional independence as-\nsumption commonly used in the bag-of-words approaches of object recognition.\n\nSimilarly, local patches can also be used to distinguish actions. Figure. 4(a) shows some examples\nof human motion and the local patches that can be used to distinguish them. A bag-of-words repre-\nsentation can be used to model these local patches for action recognition. However, it suffers from\nthe same restriction of conditional independence assumption that ignores the spatial structures of\nthe parts. In this work, we use a variant of hCRF to model the constellation of these local patches in\norder to alleviate this restriction.\n\nThere are also some important differences between objects and actions. For objects, local patches\ncould carry enough information for recognition. But for actions, we believe local patches are not\nsuf\ufb01ciently informative. In our approach, we modify the hCRF model to combine local patches and\nlarge-scale global features. The large-scale global features are represented by a root model that takes\nthe frame as a whole. Another important difference with [16] is that we use the learned root model\nto \ufb01nd discriminative local patches, rather than using a generic interest-point operator.\n\n2.1 Motion features\n\nOur model is built upon the optical \ufb02ow features in [5]. This motion descriptor has been shown to\nperform reliably with noisy image sequences, and has been applied in various tasks, such as action\nclassi\ufb01cation, motion synthesis, etc.\n\nTo calculate the motion descriptor, we \ufb01rst need to track and stabilize the persons in a video se-\nquence. Any reasonable tracking or human detection algorithm can be used, since the motion de-\nscriptor we use is very robust to jitters introduced by the tracking. Given a stabilized video sequence\nin which the person of interest appears in the center of the \ufb01eld of view, we compute the optical \ufb02ow\nat each frame using the Lucas-Kanade [12] algorithm. The optical \ufb02ow vector \ufb01eld F is then split\ninto two scalar \ufb01elds Fx and Fy, corresponding to the x and y components of F . Fx and Fy are fur-\nther half-wave recti\ufb01ed into four non-negative channels F +\nx \u2212 F \u2212\nx\nand Fy = F +\ny . These four non-negative channels are then blurred with a Gaussian kernel and\nnormalized to obtain the \ufb01nal four channels F b+\n\ny , so that Fx = F +\n\ny \u2212 F \u2212\n\ny , F \u2212\n\nx , F \u2212\n\nx , F +\n\nx ,F b\u2212\n\nx ,F b+\n\ny ,F b\u2212\n\ny (see Fig. 1).\n\n\f2.2 Hidden conditional random \ufb01eld(hCRF)\n\nNow we describe how we model a frame I in a video sequence. Let x be the motion feature of\nthis frame, and y be the corresponding class label of this frame, ranging over a \ufb01nite label alphabet\nY. Our task is to learn a mapping from x to y. We assume each image I contains a set of salient\npatches {I1, I2, ..., Im}. we will describe how to \ufb01nd these salient patches in Sec. 3. Our training set\nconsists of labeled images (xt, yt) (as a notation convention, we use superscripts to index training\nm).\nimages and subscripts to index patches) for t = 1, 2, ..., n, where yt \u2208 Y and xt = (xt\ni ) is the feature vector extracted from the global motion feature xt at the location of the\nxt\ni = xt(I t\ni . For each image I = {I1, I2, ..., Im}, we assume there exists a vector of hidden \u201cpart\u201d\npatch I t\nvariables h = {h1, h2, ..., hm}, where each hi takes values from a \ufb01nite set H of possible parts.\nIntuitively, each hi assigns a part label to the patch Ii, where i = 1, 2, ..., m. For example, for the\naction \u201cwaving-two-hands\u201d, these parts may be used to characterize the movement patterns of the\nleft and right arms. The values of h are not observed in the training set, and will become the hidden\nvariables of the model.\nWe assume there are certain constraints between some pairs of (hj, hk). For example, in the case of\n\u201cwaving-two-hands\u201d, two patches hj and hk at the left hand might have the constraint that they tend\nto have the same part label, since both of them are characterized by the movement of the left hand. If\nwe consider hi(i = 1, 2, ..., m) to be vertices in a graph G = (E, V ), the constraint between hj and\nhk is denoted by an edge (j, k) \u2208 E. See Fig. 2 for an illustration of our model. Note that the graph\nstructure can be different for different images. We will describe how to \ufb01nd the graph structure E in\nSec. 3.\n\n2..., xt\n\n1, xt\n\ny\n\nclass label\n\nhi\n\nxi\n\nx\n\nhj\n\nxj\n\n\u03c6(\u00b7)\n\u03d5(\u00b7)\n\u03c8(\u00b7)\n\u03c9(\u00b7)\n\nhk\n\nhidden parts\n\nxk\n\nimage\n\nFigure 2: Illustration of the model. Each circle corresponds to a variable, and each square corre-\nsponds to a factor in the model.\n\nGiven the motion feature x of an image I, its corresponding class label y, and part labels h, a hidden\n, where \u03b8 is the\nconditional random \ufb01eld is de\ufb01ned as p(y, h|x; \u03b8) =\nmodel parameter, and \u03a8(y, h, x; \u03b8) \u2208 R is a potential function parameterized by \u03b8. It follows that\n\nP \u02c6y\u2208Y P \u02c6h\u2208Hm exp(\u03a8(\u02c6y,x,\u02c6h;\u03b8))\n\nexp(\u03a8(y,x,h;\u03b8))\n\np(y|x; \u03b8) = Xh\u2208Hm\n\np(y, h|x; \u03b8) = Ph\u2208Hm exp(\u03a8(y, h, x; \u03b8))\nP\u02c6y\u2208YPh\u2208Hm exp(\u03a8(\u02c6y, h, x; \u03b8))\n\nWe assume \u03a8(y, h, x) is linear in the parameters \u03b8 = {\u03b1, \u03b2, \u03b3, \u03b7}:\n\n(1)\n\n\u03a8(y, h, x; \u03b8) = Xj\u2208V\n\n\u03b1\u22a4 \u00b7\u03c6(xj, hj)+Xj\u2208V\n\n\u03b2\u22a4 \u00b7\u03d5(y, hj)+ X(j,k)\u2208E\n\n\u03b3\u22a4 \u00b7\u03c8(y, hj, hk)+\u03b7\u22a4 \u00b7\u03c9(y, x) (2)\n\nwhere \u03c6(\u00b7) and \u03d5(\u00b7) are feature vectors depending on unary hj\u2019s, \u03c8(\u00b7) is a feature vector depending\non pairs of (hj, hk), \u03c9(\u00b7) is a feature vector that does not depend on the values of hidden variables.\nThe details of these feature vectors are described in the following.\nUnary potential \u03b1\u22a4 \u00b7 \u03c6(xj, hj) : This potential function models the compatibility between xj and\nthe part label hj, i.e., how likely the patch xj is labeled as part hj. It is parameterized as\n\n\u03b1\u22a4 \u00b7 \u03c6(xj, hj) = Xc\u2208H\n\n\u03b1\u22a4\n\nc \u00b7 1\n\n{hj =c} \u00b7 [f a(xj) f s(xj)]\n\n(3)\n\n\fx (xj) F b+\n\nx (xj) F b\u2212\n\ny (xj) F b\u2212\n\nwhere we use [f a(xj) f s(xj)] to denote the concatenation of two vectors f a(xj) and f s(xj).\nf a(xj) is a feature vector describing the appearance of the patch xj.\nIn our case, f a(xj) is\nsimply the concatenation of four channels of the motion features at patch xj, i.e., f a(xj) =\ny (xj)]. f s(xj) is a feature vector describing the spatial location\n[F b+\nof the patch xj. We discretize the whole image locations into l bins, and f s(xj) is a length l vector\nof all zeros with a single one for the bin occupied by xj. The parameter \u03b1c can be interpreted as\nthe measurement of compatibility between feature vector [f a(xj) f s(xj)] and the part label hj = c.\nThe parameter \u03b1 is simply the concatenation of \u03b1c for all c \u2208 H.\nUnary potential \u03b2\u22a4 \u00b7\u03d5(y, hj) : This potential function models the compatibility between class label\ny and part label hj, i.e., how likely an image with class label y contains a patch with part label hj.\nIt is parameterized as\n\n\u03b2\u22a4 \u00b7 \u03d5(y, hj) = Xa\u2208YXb\u2208H\n\n\u03b2a,b \u00b7 1\n\n{y=a} \u00b7 1\n\n{hj =b}\n\n(4)\n\nwhere \u03b2a,b indicates the compatibility between y = a and hj = b.\nPairwise potential \u03b3\u22a4 \u00b7 \u03c8(y, hj, hk): This pairwise potential function models the compatibility\nbetween class label y and a pair of part labels (hj, hk), i.e., how likely an image with class label y\ncontains a pair of patches with part labels hj and hk, where (j, k) \u2208 E corresponds to an edge in\nthe graph. It is parameterized as\n\n\u03b3\u22a4 \u00b7 \u03c8(y, hj, hk) = Xa\u2208YXb\u2208HXc\u2208H\n\n\u03b3a,b,c \u00b7 1\n\n{y=a} \u00b7 1\n\n{hj =b} \u00b7 1\n\n{hk=c}\n\n(5)\n\nwhere \u03b3a,b,c indicates the compatibility of y = a, hj = b and hk = c for the edge (j, k) \u2208 E.\nRoot model \u03b7\u22a4 \u00b7 \u03c9(y, x): The root model is a potential function that models the compatibility of\nclass label y and the large-scale global feature of the whole image. It is parameterized as\n\n\u03b7\u22a4 \u00b7 \u03c9(y, x) = Xa\u2208Y\n\n\u03b7\u22a4\na \u00b7 1\n\n{y=a} \u00b7 g(x)\n\n(6)\n\nx F b+\n\nx F b\u2212\n\ny F b\u2212\n\nwhere g(x) is a feature vector describing the appearance of the whole image. In our case, g(x)\nis the concatenation of all the four channels of the motion features in the image, i.e., g(x) =\ny ]. \u03b7a can be interpreted as a root \ufb01lter that measures the compatibility be-\n[F b+\ntween the appearance of an image g(x) and a class label y = a. And \u03b7 is simply the concatenation\nof \u03b7a for all a \u2208 Y.\nThe parameterization of \u03a8(y, h, x) is similar to that used in object recognition [16]. But there are\ntwo important differences. First of all, our de\ufb01nition of the unary potential function \u03c6(\u00b7) encodes\nboth appearance and spatial information of the patches. Secondly, we have a potential function \u03c9(\u00b7)\ndescribing the large scale appearance of the whole image. The representation in Quattoni et al. [16]\nonly models local patches extracted from the image. This may be appropriate for object recognition.\nBut for human action recognition, it is not clear that local patches can be suf\ufb01ciently informative.\nWe will demonstrate this experimentally in Sec. 4.\n\n3 Learning and Inference\n\nThe model parameters \u03b8 are learned by maximizing the conditional log-likelihood on the training\nimages:\n\n\u03b8\u2217 = arg max\n\n\u03b8\n\nL(\u03b8) = arg max\n\n\u03b8 Xt\n\nlog p(yt|xt; \u03b8) = arg max\n\n\u03b8 Xt\n\nlog Xh\n\np(yt, h|xt; \u03b8)! (7)\n\nThe objective function L(\u03b8) in Quattoni et al.[16] also has a regularization term \u22121\n2\u03c32 ||\u03b8||2. In our\nexperiments, we \ufb01nd that the regularization does not seem to have much effect on the \ufb01nal results,\nso we will use the un-regularized version. Different from conditional random \ufb01eld (CRF) [11], the\nobjective function L(\u03b8) of hCRF is not concave, due to the hidden variables h. But we can still use\n\n\fgradient ascent to \ufb01nd \u03b8 that is locally optimal. The gradient of the log-likelihood with respect to\nthe t-th training image (xt, yt) can be calculated as:\n\n\u2202Lt(\u03b8)\n\n\u2202\u03b1\n\n\u2202Lt(\u03b8)\n\n\u2202\u03b2\n\n\u2202Lt(\u03b8)\n\n\u2202\u03b3\n\n\u2202Lt(\u03b8)\n\n\u2202\u03b7\n\nj, hj) \u2212 Ep(hj ,y|xt;\u03b8)\u03c6(xt\n\n= Xj\u2208V (cid:2)Ep(hj |yt,xt;\u03b8)\u03c6(xt\nj, hj)(cid:3)\n= Xj\u2208V (cid:2)Ep(hj |yt,xt;\u03b8)\u03d5(hj, yt) \u2212 Ep(hj ,y|xt;\u03b8)\u03d5(hj, y)(cid:3)\n= X(j,k)\u2208E(cid:2)Ep(hj ,hk|yt,xt;\u03b8)\u03c8(yt, hj, hk) \u2212 Ep(hj ,hk,y|xt;\u03b8)\u03c8(y, hj, hk)(cid:3)\n\n= \u03c9(yt, xt) \u2212 Ep(y|xt;\u03b8)\u03c9(y, xt)\n\n(8)\n\nAssuming the edges E form a tree, the expectations in Eq. 8 can be calculated in O(|Y||E||H|2)\ntime using belief propagation.\n\nNow we describe several details about how the above ideas are implemented.\nLearning root \ufb01lter \u03b7: Given a set of training images (xt, yt), we \ufb01rstly learn the root \ufb01lter \u03b7 by\nsolving the following optimization problem:\n\n\u03b7\u2217 = arg max\n\nlog L(yt|xt; \u03b7) = arg max\n\nlog\n\n(9)\n\n\u03b7 Xt\n\n\u03b7 Xt\n\nexp(cid:0)\u03b7\u22a4 \u00b7 \u03c9(yt, xt)(cid:1)\nPy exp (\u03b7\u22a4 \u00b7 \u03c9(y, xt))\n\nIn other words, \u03b7\u2217 is learned by only considering the feature vector \u03c9(\u00b7). We then use \u03b7\u2217 as the\nstarting point for \u03b7 in the gradient ascent (Eq. 8). Other parameters \u03b1, \u03b2, \u03b3 are initialized randomly.\nPatch initialization: We use a simple heuristic similar to that used in [6] to initialize ten salient\npatches on every training image from the root \ufb01lter \u03b7\u2217 trained above. For each training image I with\nclass label a, we apply the root \ufb01lter \u03b7a on I, then select an rectangle region of size 5 \u00d7 5 in the\nimage that has the most positive energy. We zero out the weights in this region and repeat until ten\npatches are selected. Figure 4(a) shows examples of the patches found in some images. The tree\nG = (V, E) is formed by running a minimum spanning tree algorithm over the ten patches.\nInference: During testing, we do not know the class label of a given test image, so we cannot use the\npatch initialization described above to initialize the patches, since we do not know which root \ufb01lter\nto use. Instead, we run root \ufb01lters from all the classes on a test image, then calculate the probabilities\nof all possible instantiations of patches under our learned model, and classify the image by picking\nthe class label that gives the maximum of the these probabilities. In other words, for a testing image\nwith motion descriptor x, we \ufb01rst obtain |Y| instances {x(1), x(2), ..., x(|Y|)}, where each x(k) is\nobtained by initializing the patches on x using the root \ufb01lter \u03b7k. The \ufb01nal class label y\u2217 of x is\n\nobtained as y\u2217 = arg maxy(cid:2)max{p(y|x(1); \u03b8), p(y|x(2); \u03b8), ..., p(y|x(|Y|); \u03b8)}(cid:3).\n\n4 Experiments\n\nWe test our algorithm on two publicly available datasets that have been widely used in action recog-\nnition: Weizmann human action dataset [2], and KTH human motion dataset [17]. Performance on\nthese benchmarks is saturating \u2013 state-of-the-art approaches achieve near-perfect results. We show\nour method achieves results comparable to the state-of-the-art, and more importantly that our ex-\ntended hCRF model signi\ufb01cantly outperforms a direct application of the original hCRF model [16].\nWeizmann dataset: The Weizmann human action dataset contains 83 video sequences show-\ning nine different people, each performing nine different actions:\nrunning, walking, jumping-\njack,\njumping-forward-on-two-legs,jumping-in-place-on-two-legs, galloping-sideways, waving-\ntwo-hands, waving-one-hand, bending. We track and stabilize the \ufb01gures using the background\nsubtraction masks that come with this dataset.\n\nWe randomly choose videos of \ufb01ve subjects as training set, and the videos in the remaining four\nsubjects as test set. We learn three hCRF models with different sizes of possible part labels, |H| =\n6, 10, 20. Our model classi\ufb01es every frame in a video sequence (i.e., per-frame classi\ufb01cation), but\n\n\fbend\n\n1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00\n\nbend\n\n1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00\n\njack\n\n0.02 0.93 0.01 0.02 0.00 0.00 0.00 0.00 0.01\n\njack\n\n0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00\n\njump\n\n0.01 0.03 0.74 0.00 0.06 0.02 0.12 0.02 0.00\n\njump\n\n0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00\n\npjump\n\n0.01 0.00 0.00 0.99 0.00 0.00 0.00 0.00 0.00\n\npjump\n\n0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00\n\nrun\n\n0.00 0.05 0.00 0.00 0.72 0.06 0.17 0.00 0.00\n\nrun\n\n0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00\n\nside\n\n0.00 0.01 0.07 0.00 0.02 0.73 0.17 0.00 0.00\n\nside\n\n0.00 0.00 0.00 0.00 0.00 0.75 0.25 0.00 0.00\n\nwalk\n\n0.00 0.00 0.01 0.00 0.05 0.06 0.88 0.00 0.00\n\nwalk\n\n0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00\n\nwave1\n\n0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.99 0.00\n\nwave1\n\n0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00\n\nwave2\n\n0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00\n\nwave2\n\n0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00\n\nrun\n\nbend\n\njack\n\njump\n\npjump\n\nwave1\nFrame-by-frame classi\ufb01cation\n\nwalk\n\nside\n\nwave2\n\nbend\n\njack\n\njump\n\npjump\n\nrun\n\nside\n\nwalk\n\nwave1\n\nwave2\n\nVideo classi\ufb01cation\n\nFigure 3: Confusion matrices of classi\ufb01cation results on Weizmann dataset. Horizontal rows are\nground truths, and vertical columns are predictions.\n\nmethod\n\nroot model\n\nper-frame\nper-video\n\n0.7470\n0.8889\n\n|H| = 6\n0.5722\n0.5556\n\nlocal hCRF\n|H| = 10\n0.6656\n0.6944\n\n|H| = 20\n0.6383\n0.6111\n\n|H| = 6\n0.8682\n0.9167\n\nour approach\n|H| = 10\n0.9029\n0.9722\n\n|H| = 20\n0.8557\n0.9444\n\nTable 1: Comparison of two baseline systems with our approach on Weizmann dataset.\n\nwe can also obtain the class label for the whole video sequence by the majority voting of the labels\nof its frames (i.e., per-video classi\ufb01cation). We show the confusion matrix with |H| = 10 for both\nper-frame and per-video classi\ufb01cation in Fig. 3.\n\nWe compare our system to two baseline methods. The \ufb01rst baseline (root model) only uses the root\n\ufb01lter \u03b7\u22a4 \u00b7 \u03c9(y, x), which is simply a discriminative version of Efros et al. [5]. The second baseline\n(local hCRF) is a direct application of the original hCRF model [16]. It is similar to our model, but\nwithout the root \ufb01lter \u03b7\u22a4 \u00b7 \u03c9(y, x), i.e., local hCRF only uses the root \ufb01lter to initialize the salient\npatches, but does not use it in the \ufb01nal model. The comparative results are shown in Table 1. Our\napproach signi\ufb01cantly outperforms the two baseline methods. We also compare our results(with\n|H| = 10) with previous work in Table 2. Note [2] classi\ufb01es space-time cubes. It is not clear how it\ncan be compared with other methods that classify frames or videos. Our result is signi\ufb01cantly better\nthan [13], and comparable to [8]. Although we accept the fact that the comparison is not completely\nfair, since [13] does not use any tracking or background subtraction.\n\nWe visualize the learned parts in Fig. 4(a). Each patch is represented by a color that corresponds to\nthe most likely part label of that patch. We also visualize the root \ufb01lters applied on these images in\nFig. 4(b).\n\nKTH dataset: The KTH human motion dataset contains six types of human actions (walking,\njogging, running, boxing, hand waving and hand clapping) performed several times by 25 subjects\nin four different scenarios: outdoors, outdoors with scale variation, outdoors with different clothes\nand indoors. We \ufb01rst run an automatic preprocessing step to track and stabilize the video sequences,\nso that all the \ufb01gures appear in the center of the \ufb01eld of view.\n\nWe split the videos roughly equally into training/test sets and randomly sample 10 frames from each\nvideo. The confusion matrices (with |H| = 10) for both per-frame and per-video classi\ufb01cation are\n\nOur method\n\nJhuang et al. [8]\n\nNiebles & Fei-Fei [13]\n\nBlank et al. [2]\n\nper-frame(%)\n\nper-video(%)\n\nper-cube(%)\n\n90.3\nN/A\n55\nN/A\n\n97.2\n98.8\n72.8\nN/A\n\nN/A\nN/A\nN/A\n99.64\n\nTable 2: Comparison of classi\ufb01cation accuracy with previous work on the Weizmann dataset.\n\n\f(a)\n\n(b)\n\nFigure 4: (a) Visualization of the learned parts. Patches are colored according to their most likely\npart labels. Each color corresponds to a part label. Some interesting observations can be made.\nFor example, the part label represented by red seems to correspond to the \u201cmoving down\u201d patterns\nmostly observed in the \u201cbending\u201d action. The part label represented by green seems to correspond\nto the motion patterns distinctive of \u201chand-waving\u201d actions; (b) Visualization of root \ufb01lters applied\non these images. For each image with class label c, we apply the root \ufb01lter \u03b7c. The results show the\n\ufb01lter responses aggregated over four motion descriptor channels. Bright areas correspond to positive\nenergies, i.e., areas that are discriminative for this class.\n\nboxing\n\n0.55\n\n0.04\n\n0.03\n\n0.10\n\n0.17\n\n0.12\n\nboxing\n\n0.86\n\n0.00\n\n0.03\n\n0.02\n\n0.05\n\n0.05\n\nhandclapping\n\n0.02\n\n0.74\n\n0.10\n\n0.07\n\n0.04\n\n0.02\n\nhandclapping\n\n0.00\n\n0.97\n\n0.00\n\n0.03\n\n0.00\n\n0.00\n\nhandwaving\n\n0.02\n\n0.10\n\n0.77\n\n0.01\n\n0.05\n\n0.04\n\nhandwaving\n\n0.00\n\n0.02\n\n0.98\n\n0.00\n\n0.00\n\n0.00\n\njogging\n\n0.02\n\n0.01\n\n0.04\n\n0.55\n\n0.20\n\n0.18\n\njogging\n\n0.00\n\n0.00\n\n0.00\n\n0.67\n\n0.19\n\n0.14\n\nrunning\n\n0.01\n\n0.00\n\n0.07\n\n0.09\n\n0.67\n\n0.16\n\nrunning\n\n0.00\n\n0.00\n\n0.02\n\n0.03\n\n0.84\n\n0.11\n\nwalking\n\n0.02\n\n0.01\n\n0.05\n\nwalking\n\n0.00\n\n0.00\n\n0.04\n\n0.08\njogging\n\n0.10\nrunning\n\n0.73\n\nwalking\n\nboxing\n\nhandclapping\n\nhandwaving\n\n0.01\njogging\n\n0.01\nrunning\n\n0.93\n\nwalking\n\nboxing\n\nhandclapping\n\nhandwaving\n\nFrame-by-frame classi\ufb01cation\n\nVideo classi\ufb01cation\n\nFigure 5: Confusion matrices of classi\ufb01cation results on KTH dataset. Horizontal rows are ground\ntruths, and vertical columns are predictions.\n\nshown in Fig. 5. The comparison with the two baseline algorithms is summarized in Table 3. Again,\nour approach outperforms the two baselines systems.\n\nThe comparison with other approaches is summarized in Table 4. We should emphasize that we do\nnot attempt a direct comparison, since different methods listed in Table 4 have all sorts of variations\nin their experiments (e.g., different split of training/test data, whether temporal smoothing is used,\nwhether per-frame classi\ufb01cation can be performed, whether tracking/background subtraction is used,\nwhether the whole dataset is used etc.), which make it impossible to directly compare them. We\nprovide the results only to show that our approach is comparable to the state-of-the-art.\n\nmethod\n\nroot model\n\nper-frame\nper-video\n\n0.5377\n0.7339\n\n|H| = 6\n0.4749\n0.5607\n\nlocal hCRF\n|H| = 10\n0.4452\n0.5814\n\n|H| = 20\n0.4282\n0.5504\n\n|H| = 6\n0.6633\n0.7855\n\nour approach\n|H| = 10\n0.6698\n0.8760\n\n|H| = 20\n0.6444\n0.7512\n\nTable 3: Comparison of two baseline systems with our approach on KTH dataset.\n\n\fmethods\n\naccuracy(%)\n\nOur method\n\n87.60\n91.70\n87.04\n81.50\n81.17\n71.72\n62.96\n\nJhuang et al. [8]\n\nNowozin et al. [15]\nNiebles et al. [14]\nDoll\u00b4ar et al. [4]\nSchuldt et al. [17]\n\nKe et al. [9]\n\nTable 4: Comparison of per-video classi\ufb01cation accuracy with previous approaches on KTH dataset.\n\n5 Conclusion\n\nWe have presented a discriminatively learned part model for human action recognition. Unlike\nprevious work [10], our model does not require manual speci\ufb01cation of the parts. Instead, the parts\nare initialized by a learned root \ufb01lter. Our model combines both large-scale features used in global\ntemplates and local patch features used in bag-of-words models. Our experimental results show that\nour model is quite effective in recognizing actions. The results are comparable to the state-of-the-\nart approaches. In particular, we show that the combination of large-scale features and local patch\nfeatures performs signi\ufb01cantly better than using either of them alone.\n\nReferences\n\n[1] A. C. Berg, T. L. Berg, and J. Malik. Shape matching and object recognition using low distortion corre-\n\nspondence. In IEEE CVPR, 2005.\n\n[2] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri. Actions as space-time shapes. In IEEE\n\nICCV, 2005.\n\n[3] N. Dalal and B. Triggs. Histogram of oriented gradients for human detection. In IEEE CVPR, 2005.\n[4] P. Doll\u00b4ar, V. Rabaud, G. Cottrell, and S. Belongie. Behavior recognition via sparse spatio-temporal\n\nfeatures. In VS-PETS Workshop, 2005.\n\n[5] A. A. Efros, A. C. Berg, G. Mori, and J. Malik. Recognizing action at a distance. In IEEE ICCV, 2003.\n[6] P. Felzenszwalb, D. McAllester, and D. Ramanan. A discriminatively trained, multiscale, deformable part\n\nmodel. In IEEE CVPR, 2008.\n\n[7] P. F. Felzenszwalb and D. P. Huttenlocher. Pictorial structures for object recognition. IJCV, 61(1):55\u201379,\n\nJanuary 2003.\n\n[8] H. Jhuang, T. Serre, L. Wolf, and T. Poggio. A biologically inspired system for action recognition. In\n\nIEEE ICCV, 2007.\n\n[9] Y. Ke, R. Sukthankar, and M. Hebert. Ef\ufb01cient visual event detection using volumetric features. In IEEE\n\nICCV, 2005.\n\n[10] Y. Ke, R. Sukthankar, and M. Hebert. Event detection in crowded videos. In IEEE ICCV, 2007.\n[11] J. Lafferty, A. McCallum, and F. Pereira. Conditional random \ufb01elds: Probabilistic models for segmenting\n\nand labeling sequence data. In ICML, 2001.\n\n[12] B. D. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision.\n\nIn Proc. DARPA Image Understanding Workshop, 1981.\n\n[13] J. C. Niebles and L. Fei-Fei. A hierarchical model of shape and appearance for human action classi\ufb01cation.\n\nIn IEEE CVPR, 2007.\n\n[14] J. C. Niebles, H. Wang, and L. Fei-Fei. Unsupervised learning of human action categories using spatial-\n\ntemporal words. In BMVC, 2006.\n\n[15] S. Nowozin, G. Bakir, and K. Tsuda. Discriminative subsequence mining for action classi\ufb01cation. In\n\nIEEE ICCV, 2007.\n\n[16] A. Quattoni, M. Collins, and T. Darrell. Conditional random \ufb01elds for object recognition. In NIPS 17,\n\n2005.\n\n[17] C. Schuldt, L. Laptev, and B. Caputo. Recognizing human actions: a local SVM approach.\n\nICPR, 2004.\n\nIn IEEE\n\n[18] J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T. Freeman. Discovering objects and their\n\nlocation in images. In IEEE ICCV, 2005.\n\n[19] S. B. Wang, A. Quattoni, L.-P. Morency, D. Demirdjian, and T. Darrell. Hidden conditional random \ufb01elds\n\nfor gesture recognition. In IEEE CVPR, 2006.\n\n\f", "award": [], "sourceid": 309, "authors": [{"given_name": "Yang", "family_name": "Wang", "institution": null}, {"given_name": "Greg", "family_name": "Mori", "institution": null}]}