{"title": "Beyond Actions: Discriminative Models for Contextual Group Activities", "book": "Advances in Neural Information Processing Systems", "page_first": 1216, "page_last": 1224, "abstract": "We propose a discriminative model for recognizing group activities. Our model jointly captures the group activity, the individual person actions, and the interactions among them. Two new types of contextual information, group-person interaction and person-person interaction, are explored in a latent variable framework. Different from most of the previous latent structured models which assume a predefined structure for the hidden layer, e.g. a tree structure, we treat the structure of the hidden layer as a latent variable and implicitly infer it during learning and inference. Our experimental results demonstrate that by inferring this contextual information together with adaptive structures, the proposed model can significantly improve activity recognition performance.", "full_text": "Beyond Actions: Discriminative Models for\n\nContextual Group Activities\n\nTian Lan\n\nSchool of Computing Science\n\nSimon Fraser University\n\ntla58@sfu.ca\n\nYang Wang\n\nDepartment of Computer Science\n\nUniversity of Illinois at Urbana-Champaign\n\nyangwang@uiuc.edu\n\nWeilong Yang\n\nSchool of Computing Science\n\nSimon Fraser University\n\nwya16@sfu.ca\n\nGreg Mori\n\nSchool of Computing Science\n\nSimon Fraser University\n\nmori@cs.sfu.ca\n\nAbstract\n\nWe propose a discriminative model for recognizing group activities. Our model\njointly captures the group activity, the individual person actions, and the interac-\ntions among them. Two new types of contextual information, group-person inter-\naction and person-person interaction, are explored in a latent variable framework.\nDifferent from most of the previous latent structured models which assume a pre-\nde\ufb01ned structure for the hidden layer, e.g. a tree structure, we treat the structure of\nthe hidden layer as a latent variable and implicitly infer it during learning and in-\nference. Our experimental results demonstrate that by inferring this contextual in-\nformation together with adaptive structures, the proposed model can signi\ufb01cantly\nimprove activity recognition performance.\n\n1\n\nIntroduction\n\nLook at the two persons in Fig. 1(a), can you tell they are doing two different actions? Once the\nentire contexts of these two images are revealed (Fig. 1(b)) and we observe the interaction of the\nperson with other persons in the group, it is immediately clear that the \ufb01rst person is queuing, while\nthe second person is talking. In this paper, we argue that actions of individual humans often cannot\nbe inferred alone. We instead focus on developing methods for recognizing group activities by\nmodeling the collective behaviors of individuals in the group.\nBefore we proceed, we \ufb01rst clarify some terminology used throughout the rest of the paper. We use\naction to denote a simple, atomic movement performed by a single person. We use activity to refer\nto a more complex scenario that involves a group of people. Consider the examples in Fig. 1(b),\neach frame describes a group activity: queuing and talking, while each person in a frame performs\na lower level action: talking and facing right, talking and facing left, etc.\nOur proposed approach is based on exploiting two types of contextual information in group activ-\nities. First, the activity of a group and the collective actions of all the individuals serve as context\n(we call it the group-person interaction) for each other, hence should be modeled jointly in a uni\ufb01ed\nframework. As shown in Fig. 1, knowing the group activity (queuing or talking) helps disambiguate\nindividual human actions which are otherwise hard to recognize. Similarly, knowing most of the\npersons in the scene are talking (whether facing right or left) allows us to infer the overall group\nactivity (i.e. talking). Second, the action of an individual can also bene\ufb01t from knowing the actions\nof other surrounding persons (which we call the person-person interaction). For example, consider\nFig. 1(c). The fact that the \ufb01rst two persons are facing the same direction provides a strong cue that\n\n1\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: Role of context in group activities. It is often hard to distinguish actions from each individual person\nalone (a). However, if we look at the whole scene (b), we can easily recognize the activity of the group and the\naction of each individual. In this paper, we operationalize on this intuition and introduce a model for recognizing\ngroup activities by jointly consider the group activity, the action of each individual, and the interaction among\ncertain pairs of individual actions (c).\n\nboth of them are queuing. Similarly, the fact that the last two persons are facing each other indicates\nthey are more likely to be talking.\nRelated work: Using context to aid visual recognition has received much attention recently. Most\nof the work on context is in scene and object recognition. For example, work has been done on ex-\nploiting contextual information between scenes and objects [13], objects and objects [5, 16], objects\nand so-called \u201cstuff\u201d (amorphous spatial extent, e.g. trees, sky) [11], etc.\nMost of the previous work in human action recognition focuses on recognizing actions performed\nby a single person in a video (e.g. [2, 17]). In this setting, there has been work on exploiting contexts\nprovided by scenes [12] or objects [10] to help action recognition. In still image action recognition,\nobject-action context [6, 9, 23, 24] is a popular type of context used for human-object interaction.\nThe work in [3] is the closest to ours. In that work, person-person context is exploited by a new\nfeature descriptor extracted from a person and its surrounding area.\nOur model is directly inspired by some recent work on learning discriminative models that allow\nthe use of latent variables [1, 6, 15, 19, 25], particularly when the latent variables have complex\nstructures. These models have been successfully applied in many applications in computer vision,\ne.g. object detection [8, 18], action recognition [14, 19], human-object interaction [6], objects and\nattributes [21], human poses and actions [22], image region and tag correspondence [20], etc. So\nfar only applications where the structures of latent variables are \ufb01xed have been considered, e.g. a\ntree-structure in [8, 19]. However in our applications, the structures of latent variables are not \ufb01xed\nand have to be inferred automatically.\nOur contributions: In this paper, we develop a discriminative model for recognizing group ac-\ntivities. We highlight the main contributions of our model. (1) Group activity: most of the work\nin human activity understanding focuses on single-person action recognition. Instead, we present\na model for group activities that dynamically decides on interactions among group members. (2)\nGroup-person and person-person interaction: although contextual information has been exploited\nfor visual recognition problems, ours introduces two new types of contextual information that have\nnot been explored before. (3) Adaptive structures: the person-person interaction poses a challenging\nproblem for both learning and inference. If we naively consider the interaction between every pair of\npersons, the model might try to enforce two persons to have take certain pairs of labels even though\nthese two persons have nothing to do with each other. In addition, selecting a subset of connec-\ntions allows one to remove \u201cclutter\u201d in the form of people performing irrelevant actions. Ideally, we\nwould like to consider only those person-person interactions that are strong. To this end, we propose\nto use adaptive structures that automatically decide on whether the interaction of two persons should\nbe considered. Our experimental results show that our adaptive structures signi\ufb01cantly outperform\nother alternatives.\n\n2 Contextual Representation of Group Activities\n\nOur goal is to learn a model that jointly captures the group activity, the individual person actions, and\nthe interactions among them. We introduce two new types of contextual information, group-person\n\n2\n\n\f(a)\n\n(b)\n\nFigure 2: Graphical illustration of the model in (a). The edges represented by dashed lines indicate the connec-\ntions are latent. Different types of potentials are denoted by lines with different colors in the example shown in\n(b).\n\ninteraction and person-person interaction. Group-person interaction represents the co-occurrence\nbetween the activity of a group and the actions of all the individuals. Person-person interaction\nindicates that the action of an individual can bene\ufb01t from knowing the actions of other people in the\nsame scene. We present a graphical model representing all the information in a uni\ufb01ed framework.\nOne important difference between our model and previous work is that in addition to learning the\nparameters in the graphical model, we also automatically infer the graph structures (see Sec. 3).\nWe assume an image has been pre-processed (i.e. by running a person detector) so the persons in the\nimage have been found. On the training data, each image is associated with a group activity label,\nand each person in the image is associated with an action label.\n\n2.1 Model Formulation\n\nA graphical representation of the model is shown in Fig. 2. We now describe how we model an\nimage I. Let I1, I2, . . . , Im be the set of persons found in the image I, we extract features x from\nthe image I in the form of x = (x0, x1, . . . , xm), where x0 is the aggregation of feature descriptors\nof all the persons in the image (we call it root feature vector), and xi(i = 1, 2, . . . , m) is the feature\nvector extracted from the person Ii. We denote the collective actions of all the persons in the image\nas h = (h1, h2, . . . , hm), where hi \u2208 H is the action label of the person Ii and H is the set of all\npossible action labels. The image I is associated with a group activity label y \u2208 Y, where Y is the\nset of all possible activity labels.\nWe assume there are connections between some pairs of action labels (hj, hk). Intuitively speaking,\nthis allows the model to capture important correlations between action labels. We use an undirected\ngraph G = (V,E) to represent (h1, h2, . . . , hm), where a vertex vi \u2208 V corresponds to the action\nlabel hi, and an edge (vj, vk) \u2208 E corresponds to the interactions between hj and hk.\nWe use fw(x, h, y;G) to denote the compatibility of the image feature x, the collective action labels\nh, the group activity label y, and the graph G = (V,E). We assume fw(x, h, y;G) is parameterized\nby w and is de\ufb01ned as follows:\n\n0 \u03c60(y, x0) +(cid:88)\n\nfw(x, h, y;G) = w(cid:62)\u03a8(y, h, x;G)\n= w(cid:62)\n\n1 \u03c61(xj, hj) +(cid:88)\n\nw(cid:62)\n\n2 \u03c62(y, hj) + (cid:88)\n\nw(cid:62)\n\nw(cid:62)\n3 \u03c63(y, hj, hk)\n\n(1a)\n(1b)\n\nj\u2208V\n\nj\u2208V\n\nj,k\u2208E\n\nThe model parameters w are simply the combination of four parts, w = {w1, w2, w3, w4}. The\ndetails of the potential functions in Eq. 1 are described in the following:\nImage-Action Potential w(cid:62)\nthe j-th person\u2019s action label hj and its image feature xj. It is parameterized as:\n\n1 \u03c61(xj, hj): This potential function models the compatibility between\n\nw(cid:62)\n\n1b\n\n1(hj = b) \u00b7 xj\n\n(2)\n\n1 \u03c61(xj, hj) =(cid:88)\n\nw(cid:62)\n\nb\u2208H\n\nwhere xj is the feature vector extracted from the j-th person and we use 1(\u00b7) to denote the indicator\nfunction. The parameter w1 is simply the concatenation of w1b for all b \u2208 H.\n\n3\n\n\fAction-Activity Potential w(cid:62)\nthe group activity label y and the j-th person\u2019s action label hj. It is parameterized as:\n\n2 \u03c62(y, hj): This potential function models the compatibility between\n\nw2ab \u00b7 1(y = a) \u00b7 1(hj = b)\n\n(3)\n\n2 \u03c62(y, hj) =(cid:88)\n\nw(cid:62)\n\n(cid:88)\n\na\u2208Y\n\nb\u2208H\n\nAction-Action Potential w(cid:62)\n3 \u03c63(y, hj, hk): This potential function models the compatibility be-\ntween a pair of individuals\u2019 action labels (hj, hk) under the group activity label y, where (j, k) \u2208 E\ncorresponds to an edge in the graph. It is parameterized as:\n\nw3abc \u00b7 1(y = a) \u00b7 1(hj = b) \u00b7 1(hk = c)\n\n(4)\n\n3 \u03c63(y, hj, hk) =(cid:88)\n\nw(cid:62)\n\n(cid:88)\n\n(cid:88)\n\na\u2208Y\n\nb\u2208H\n\nc\u2208H\n\nImage-Activity Potential w(cid:62)\n0 \u03c60(y, x0): This potential function is a root model which measures the\ncompatibility between the activity label y and the root feature vector x0 of the whole image. It is\nparameterized as:\n\nw(cid:62)\n\n0a\n\n1(y = a) \u00b7 x0\n\n(5)\n\n0 \u03c60(y, x0) =(cid:88)\n\nw(cid:62)\n\na\u2208Y\n\nThe parameter w0a can be interpreted as a root \ufb01lter that measures the compatibility of the class\nlabel a and the root feature vector x0.\n\n3 Learning and Inference\n\nWe now describe how to infer the label given the model parameters (Sec. 3.1), and how to learn the\nmodel parameters from a set of training data (Sec. 3.2). If the graph structure G is known and \ufb01xed,\nwe can apply standard learning and inference techniques of latent SVMs. For our application, a\ngood graph structure turns out to be crucial, since it determines which person interacts (i.e. provides\naction context) with another person. The interaction of individuals turns out to be important for\ngroup activity recognition, and \ufb01xing the interaction (i.e. graph structure) using heuristics does not\nwork well. We will demonstrate this experimentally in Sec. 4. We instead develop our own inference\nand learning algorithms that automatically infer the best graph structure from a particular set.\n\nInference\n\n3.1\nGiven the model parameters w, the inference problem is to \ufb01nd the best group activity label y\u2217 for a\nnew image x. Inspired by the latent SVM [8], we de\ufb01ne the following function to score an image x\nand a group activity label y:\nFw(x, y) = maxGy\n\nfw(x, hy, y;Gy) = maxGy\n\nw(cid:62)\u03a8(x, hy, y;Gy)\n\nmax\nhy\n\nmax\nhy\n\n(6)\n\nWe use the subscript y in the notations hy and Gy to emphasize that we are now \ufb01xing on a particular\nactivity label y. The group activity label of the image x can be inferred as: y\u2217 = arg maxy Fw(x, y).\nSince we can enumerate all the possible y \u2208 Y and predict the activity label y\u2217 of x, the main\ndif\ufb01culty of solving the inference problem is the maximization over Gy and hy according to Eq. 6.\nNote that in Eq. 6, we explicitly maximize over the graph G. This is very different from previous\nwork which typically assumes the graph structure is \ufb01xed.\nThe optimization problem in Eq. 6 is in general NP-hard since it involves a combinatorial search.\nWe instead use an coordinate ascent style algorithm to approximately solve Eq. 6 by iterating the\nfollowing two steps:\n1. Holding the graph structure Gy \ufb01xed, optimize the action labels hy for the (cid:104)x, y(cid:105) pair:\n\nh(cid:48) w(cid:62)\u03a8(x, h(cid:48), y;Gy)\n2. Holding hy \ufb01xed, optimize graph structure Gy for the (cid:104)x, y(cid:105) pair:\nGy = arg maxG(cid:48) w(cid:62)\u03a8(x, hy, y;G(cid:48))\n\nhy = arg max\n\n4\n\n(7)\n\n(8)\n\n\fThe problem in Eq. 7 is a standard max-inference problem in an undirected graphical model. Here\nwe use loopy belief propagation to approximately solve it. The problem in Eq. 8 is still an NP-hard\nproblem since it involves enumerating all the possible graph structures. Even if we can enumerate\nall the graph structures, we might want to restrict ourselves to a subset of graph structures that will\nlead to ef\ufb01cient inference (e.g. when using loopy BP in Eq. 7). One obvious choice is to restrict\nG(cid:48) to be a tree-structured graph, since loopy BP is exact and tractable for tree structured models.\nHowever, as we will demonstrate in Sec. 4, the tree-structured graph built from simple heuristic (e.g.\nminimum spanning tree) does not work that well. Another choice is to choose graph structures that\nare \u201csparse\u201d, since sparse graphs tend to have fewer cycles, and loopy BP tends to be ef\ufb01cient in\ngraphs with fewer cycles. In this paper, we enforce the graph sparsity by setting a threshold d on\nthe maximum degree of any vertex in the graph. When hy is \ufb01xed, we can formulate an integer\nlinear program (ILP) to \ufb01nd the optimal graph structure (Eq. 8) with the additional constraint that\nthe maximum vertex degree is at most d. Let zjk = 1 indicate that the edge (j, k) is included in the\ngraph, and 0 otherwise. The ILP can be written as:\n\nzjk\u03c8jk,\n\ns.t.\n\nzjk \u2264 d,\n\nzjk \u2264 d, zjk = zkj, zjk \u2208 {0, 1}, \u2200j, k\n\n(9)\n\n(cid:88)\n\nj\u2208V\n\n(cid:88)\n\nk\u2208V\n\n(cid:88)\n\n(cid:88)\n\nj\u2208V\n\nk\u2208V\n\nmax\n\nz\n\nwhere we use \u03c8jk to collectively represent the summation of all the pairwise potential functions in\nEq. 1 for the pairs of vertices (j, k). Of course, the optimization problem in Eq. 9 is still hard due\nto the integral constraint zjk \u2208 {0, 1}. But we can relax the value of zjk to a real value in the range\nof [0, 1]. The solution of the LP relaxation might have fractional numbers. To get integral solutions,\nwe simply round them to the closest integers.\n\n3.2 Learning\nGiven a set of N training examples (cid:104)xn, hn, yn(cid:105) (n = 1, 2, . . . , N), we would like to train the model\nparameter w that tends to produce the correct group activity y for a new test image x. Note that the\naction labels h are observed on training data, but the graph structure G (or equivalently the variables\nz) are unobserved and will be automatically inferred. A natural way of learning the model is to adopt\nthe latent SVM formulation [8, 25] as follows:\n\nmin\n\n1\n||w||2 + C\nw,\u03be\u22650,Gy\n2\nfw(xn, hn, yn;Gyn) \u2212 maxGy\ns.t. maxGyn\n\nn=1\n\n\u03ben\n\n(10a)\nfw(xn, hy, y;Gy) \u2265 \u2206(y, yn) \u2212 \u03ben,\u2200n,\u2200y (10b)\n\nmax\nhy\n\nwhere \u2206(y, yn) is a loss function measuring the cost incurred by predicting y when the ground-\ntruth label is yn. In standard multi-class classi\ufb01cation problems, we typically use the 0-1 loss \u22060/1\nde\ufb01ned as:\n\n(cid:26) 1\n\n0\n\nif y (cid:54)= yn\notherwise\n\n\u22060/1(y, yn) =\n\n(11)\n\nThe constrained optimization problem in Eq. 10 can be equivalently written as an unconstrained\nproblem:\n\nN(cid:88)\n\n1\n2\n\n||w||2 + C\nmin\nw,\u03be\nwhere Ln = max\n\ny\n\nN(cid:88)\nmaxGy\n\nn=1\nmax\nhy\n\n(Ln \u2212 Rn)\n\n(\u2206(y, yn) + fw(xn, hy, y;Gy)), Rn = maxGyn\n\n(12a)\nfw(xn, hn, yn;Gyn)(12b)\n\nWe use the non-convex bundle optimization in [7] to solve Eq. 12.\nIn a nutshell, the algorithm\niteratively builds an increasingly accurate piecewise quadratic approximation to the objective func-\ntion. During each iteration, a new linear cutting plane is found via a subgradient of the objective\nfunction and added to the piecewise quadratic approximation. Now the key issue is to compute two\nsubgradients \u2202wLn and \u2202wRn for a particular w, which we describe in detail below.\nFirst we describe how to compute \u2202wLn. Let (y\u2217, h\u2217,G\u2217) be the solution to the following optimiza-\ntion problem:\n\nmax\n\nmax\n\ny\n\nh\n\nmaxG \u2206(y, yn) + fw(xn, h, y;G)\n\n(13)\n\n5\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 3: Different structures of person-person interaction. Each node here represents a person in a frame. Solid\nlines represent connections that can be obtained from heuristics. Dashed lines represent latent connections that\nwill be inferred by our algorithm. (a) No connection between any pair of nodes; (b) Nodes are connected\nby a minimum spanning tree; (c) Any two nodes within a Euclidean distance \u03b5 are connected (which we call\n\u03b5-neighborhood graph); (d) Connections are obtained by adaptive structures. Note that (d) is the structure of\nperson-person interaction of the proposed model.\n\nThen it is easy to show that the subgradient \u2202wLn can be calculated as \u2202wLn = \u03a8(xn, y\u2217, h\u2217;G\u2217).\nThe inference problem in Eq. 13 is similar to the inference problem in Eq. 6, except for an additional\nterm \u2206(y, yn). Since the number of possible choices of y is small (e.g.|Y| = 5) in our case), we can\nenumerate all possible y \u2208 Y and solve the inference problem in Eq. 6 for each \ufb01xed y.\nNow we describe how to compute \u2202wRn, let \u02c6G be the solution to the following optimization prob-\nlem:\n\n(14)\nThen we can show that the subgradient \u2202wRn can be calculated as \u2202wRn = \u03a8(xn, yn, hn; \u02c6G). The\nproblem in Eq. 14 can be approximately solved using the LP relaxation of Eq. 9. Using the two\nsubgradients \u2202wLn and \u2202wRn, we can optimize Eq. 10 using the algorithm in [7].\n4 Experiments\n\nmaxG(cid:48) fw(xn, hn, yn;G(cid:48))\n\nWe demonstrate our model on the collective activity dataset introduced in [3]. This dataset contains\n44 video clips acquired using low resolution hand held cameras.\nIn the original dataset, all the\npersons in every tenth frame of the videos are assigned one of the following \ufb01ve categories: crossing,\nwaiting, queuing, walking and talking, and one of the following eight pose categories: right, front-\nright, front, front-left, left, back-left, back and back-right. Based on the original dataset, we de\ufb01ne\n\ufb01ve activity categories including crossing, waiting, queuing, walking and talking. We de\ufb01ne forty\naction labels by combining the pose and activity information, i.e. the action labels include crossing\nand facing right, crossing and facing front-right, etc. We assign each frame into one of the \ufb01ve\nactivity categories, by taking the majority of actions of persons (ignoring their pose categories) in\nthat frame. We select one fourth of the video clips from each activity category to form the test set,\nand the rest of the video clips are used for training.\nRather than directly using certain raw features (e.g. the HOG descriptor [4]) as the feature vector\nxi in our framework, we train a 40-class SVM classi\ufb01er based on the HOG descriptor of each\nindividual and their associated action labels. In the end, each feature vector xi is represented as a\n40-dimensional vector, where the k-th entry of this vector is the score of classifying this instance\nto the k-th class returned by the SVM classi\ufb01er. The root feature vector x0 of an image is also\nrepresented as a 40-dimensional vector, which is obtained by taking an average over all the feature\nvectors xi (i = 1, 2, ..., m) in the same image.\nResults and Analysis: In order to comprehensively evaluate the performance of the proposed\nmodel, we compare it with several baseline methods. The \ufb01rst baseline (which we call global bag-of-\nwords) is a SVM model with linear kernel based on the global feature vector x0 with a bag-of-words\nstyle representation. The other baselines are within our proposed framework, with various ways of\nsetting the structures of the person-person interaction. The structures we have considered are illus-\ntrated in Fig. 3(a)-(c), including (a) no pairwise connection; (b) minimum spanning tree; (c) graph\nobtained by connecting any two vertices within a Euclidean distance \u03b5 (\u03b5-neighborhood graph) with\n\u03b5 = 100, 200, 300. Note that in our proposed model the person-person interactions are latent (shown\nin Fig. 3(d)) and learned automatically. The performance of different structures of person-person in-\n\n6\n\n\f(a)\n\n(b)\n\nFigure 4: Confusion matrices for activity classi\ufb01cation: (a) global bag-of-words (b) our approach. Rows are\nground-truths, and columns are predictions. Each row is normalized to sum to 1.\n\nMethod\n\nOverall Mean per-class\n\nglobal bag-of-words\n\nno connection\n\nminimum spanning tree\n\n\u03b5-neighborhood graph, \u03b5 = 100\n\u03b5-neighborhood graph, \u03b5 = 200\n\u03b5-neighborhood graph, \u03b5 = 300\n\nOur Approach\n\n70.9\n75.9\n73.6\n74.3\n70.4\n62.2\n79.1\n\n68.6\n73.7\n70.0\n72.9\n66.2\n62.5\n77.5\n\nTable 1: Comparison of activity classi\ufb01cation accuracies of different methods. We report both the overall and\nmean per-class accuracies due to the class imbalance. The \ufb01rst result (global bag-of-words) is tested in the\nmulti-class SVM framework, while the other results are in the framework of our proposed model but with\ndifferent structures of person-person interaction. The structures are visualized in Fig. 3.\n\nteraction are evaluated and compared. We summarize the comparison in Table 1. Since the test set is\nimbalanced, e.g. the number of crossing examples is more than twice that of the queuing or talking\nexamples, we report both overall and mean per-class accuracies. As we can see, for both overall and\nmean per-class accuracies, our method achieves the best performance. The proposed model signif-\nicantly outperforms global bag-of-words. The confusion matrices of our method and the baseline\nglobal bag-of-words are shown in Fig. 4. There are several important conclusions we can draw from\nthese experimental results:\nImportance of group-person interaction: The best result of the baselines comes from no connec-\ntion between any pair of nodes, which clearly outperforms global bag-of-words. It demonstrates the\neffectiveness of modeling group-person interaction, i.e. connection between y and h in our model.\nImportance of adaptive structures of person-person interaction: In Table 1, the pre-de\ufb01ned\nstructures such as the minimum spanning tree and the \u03b5-neighborhood graph do not perform as well\nas the one without person-person interaction. We believe this is because those pre-de\ufb01ned structures\nare all based on heuristics and are not properly integrated with the learning algorithm. As a result,\nthey can create interactions that do not help (and sometimes even hurt) the performance. However, if\nwe consider the graph structure as part of our model and directly infer it using our learning algorithm,\nwe can make sure that the obtained structures are those useful for differentiating various activities.\nEvidence for this is provided by the big jump in terms of the performance by our approach.\nWe visualize the classi\ufb01cation results and the learned structure of person-person interaction of our\nmodel in Fig. 6.\n\n5 Conclusion\n\nWe have presented a discriminative model for group activity recognition which jointly captures the\ngroup activity, the individual person actions, and the interactions among them. We have exploited\ntwo new types of contextual information: group-person interaction and person-person interaction.\nWe also introduce an adaptive structures algorithm that automatically infers the optimal structure of\nperson-person interaction in a latent SVM framework. Our experimental results demonstrate that\nour proposed model outperforms other baseline methods.\n\n7\n\n\f(a)\n\n(d)\n\n(b)\n\n(e)\n\n(c)\n\n(f)\n\nFigure 5: Visualization of the weights across pairs of action classes for each of the \ufb01ve activity classes. Light\ncells indicate large values of weights. Consider the example (a), under the activity label crossing, the model\nfavors seeing actions of crossing with different poses together (indicated by the area bounded by the red box).\nWe can also take a closer look at the weights within actions of crossing, as shown in (f). we can see that within\nthe crossing category, the model favors seeing the same pose together, indicated by the light regions along the\ndiagonal. It also favors some opposite poses, e.g. back-right with front-left. These make sense since people\nalways cross street in either the same or the opposite directions.\n\nCrossing\n\nWaiting\n\nQueuing\n\nWalking\n\nTalking\n\nFigure 6: (Best viewed in color) Visualization of the classi\ufb01cation results and the learned structure of person-\nperson interaction. The top row shows correct classi\ufb01cation examples and the bottom row shows incorrect\nexamples. The labels C, S, Q, W, T indicate crossing, waiting, queuing, walking and talking respectively. The\nlabels R, FR, F, FL, L, BL, B, BR indicate right, front-right, front, front-left, left, back-left, back and back-right\nrespectively. The yellow lines represent the learned structure of person-person interaction, from which some\nimportant interactions for each activity can be obtained, e.g. a chain structure which connects persons facing\nthe same direction is \u201cimportant\u201d for the queuing activity.\n\n8\n\n\fReferences\n\n[1] S. Andrews, I. Tsochantaridis, and T. Hofmann. Support vector machines for multiple-instance learning.\n\nIn Advances in Neural Information Processing Systems, 2003.\n\n[2] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri. Actions as space-time shapes. In IEEE\n\nInternational Conference on Computer Vision, 2005.\n\n[3] W. Choi, K. Shahid, and S. Savarese. What are they doing? : Collective activity classi\ufb01cation using\nspatio-temporal relationship among people. In 9th International Workshop on Visual Surveillance, 2009.\n[4] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Proc. IEEE Comput.\n\nSoc. Conf. Comput. Vision and Pattern Recogn., 2005.\n\n[5] C. Desai, D. Ramanan, and C. Fowlkes. Discriminative models for multi-class object layout. In IEEE\n\nInternational Conference on Computer Vision, 2009.\n\n[6] C. Desai, D. Ramanan, and C. Fowlkes. Discriminative models for static human-object interactions. In\n\nWorkshop on Structured Models in Computer Vision, 2010.\n\n[7] T.-M.-T. Do and T. Artieres. Large margin training for hidden markov models with partially observed\n\nstates. In International Conference on Machine Learning, 2009.\n\n[8] P. Felzenszwalb, D. McAllester, and D. Ramanan. A discriminatively trained, multiscale, deformable part\n\nmodel. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2008.\n\n[9] A. Gupta, A. Kembhavi, and L. S. Davis. Observing human-object interactions: Using spatial and func-\ntional compatibility for recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence,\n31(10):1775\u20131789, 2009.\n\n[10] D. Han, L. Bo, and C. Sminchisescu. Selection and context for action recognition. In IEEE International\n\nConference on Computer Vision, 2009.\n\n[11] G. Heitz and D. Koller. Learning spatial context: Using stuff to \ufb01nd things. In European Conference on\n\nComputer Vision, 2008.\n\n[12] M. Marszalek, I. Laptev, and C. Schmid. Actions in context. In IEEE Computer Society Conference on\n\nComputer Vision and Pattern Recognition, 2009.\n\n[13] K. P. Murphy, A. Torralba, and W. T. Freeman. Using the forest to see the trees: A graphicsl model\nrelating features, objects, and scenes. In Advances in Neural Information Processing Systems, volume 16.\nMIT Press, 2004.\n\n[14] J. C. Niebles, C.-W. Chen, , and L. Fei-Fei. Modeling temporal structure of decomposable motion seg-\n\nments for activity classi\ufb01cation. In European Conference of Computer Vision, 2010.\n\n[15] A. Quattoni, S. Wang, L.-P. Morency, M. Collins, and T. Darrell. Hidden conditional random \ufb01elds. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, 29(10):1848\u20131852, June 2007.\n\n[16] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie. Objects in context. In IEEE\n\nInternational Conference on Computer Vision, 2007.\n\n[17] C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: A local svm approach.\n\nInternational Conference on Pattern Recognition, 2004.\n\nIn 17th\n\n[18] A. Vedaldi and A. Zisserman. Structured output regression for detection with partial truncation.\n\nAdvances in Neural Information Processing Systems. MIT Press, 2009.\n\nIn\n\n[19] Y. Wang and G. Mori. Max-margin hidden conditional random \ufb01elds for human action recognition. In\n\nProc. IEEE Comput. Soc. Conf. Comput. Vision and Pattern Recogn., 2009.\n\n[20] Y. Wang and G. Mori. A discriminative latent model of image region and object tag correspondence. In\n\nAdvances in Neural Information Processing Systems (NIPS), 2010.\n\n[21] Y. Wang and G. Mori. A discriminative latent model of object classes and attributes.\n\nConference on Computer Vision, 2010.\n\nIn European\n\n[22] W. Yang, Y. Wang, and G. Mori. Recognizing human actions from still images with latent poses.\n\nCVPR, 2010.\n\nIn\n\n[23] B. Yao and L. Fei-Fei. Grouplet: a structured image representation for recognizing human and object\ninteractions. In The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, San\nFrancisco, CA, June 2010.\n\n[24] B. Yao and L. Fei-Fei. Modeling mutual context of object and human pose in human-object interaction\nIn The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, San\n\nactivities.\nFrancisco, CA, June 2010.\n\n[25] C.-N. Yu and T. Joachims. Learning structural SVMs with latent variables. In International Conference\n\non Machine Learning, 2009.\n\n9\n\n\f", "award": [], "sourceid": 115, "authors": [{"given_name": "Tian", "family_name": "Lan", "institution": null}, {"given_name": "Yang", "family_name": "Wang", "institution": null}, {"given_name": "Weilong", "family_name": "Yang", "institution": null}, {"given_name": "Greg", "family_name": "Mori", "institution": null}]}