{"title": "Submodular Attribute Selection for Action Recognition in Video", "book": "Advances in Neural Information Processing Systems", "page_first": 1341, "page_last": 1349, "abstract": "In real-world action recognition problems, low-level features cannot adequately characterize the rich spatial-temporal structures in action videos. In this work, we encode actions based on attributes that describes actions as high-level concepts: \\textit{e.g.}, jump forward and motion in the air. We base our analysis on two types of action attributes. One type of action attributes is generated by humans. The second type is data-driven attributes, which is learned from data using dictionary learning methods. Attribute-based representation may exhibit high variance due to noisy and redundant attributes. We propose a discriminative and compact attribute-based representation by selecting a subset of discriminative attributes from a large attribute set. Three attribute selection criteria are proposed and formulated as a submodular optimization problem. A greedy optimization algorithm is presented and guaranteed to be at least (1-1/e)-approximation to the optimum. Experimental results on the Olympic Sports and UCF101 datasets demonstrate that the proposed attribute-based representation can significantly boost the performance of action recognition algorithms and outperform most recently proposed recognition approaches.", "full_text": "Submodular Attribute Selection for Action\n\nRecognition in Video\n\nJinging Zheng\n\nUMIACS, University of Maryland\n\nCollege Park, MD, USA\n\nzjngjng@umiacs.umd.edu\n\nZhuolin Jiang\nNoah\u2019s Ark Lab\n\nHuawei Technologies\n\nzhuolin.jiang@huawei.com\n\nRama Chellappa\n\nP. Jonathon Phillips\n\nUMIACS, University of Maryland\n\nNational Institute of Standards and Technology\n\nCollege Park, MD, USA\n\nrama@umiacs.umd.edu\n\nGaithersburg, MD, USA\n\njonathon.phillips@nist.gov\n\nAbstract\n\nIn real-world action recognition problems, low-level features cannot adequately\ncharacterize the rich spatial-temporal structures in action videos. In this work,\nwe encode actions based on attributes that describes actions as high-level con-\ncepts e.g., jump forward or motion in the air. We base our analysis on two types\nof action attributes. One type of action attributes is generated by humans. The\nsecond type is data-driven attributes, which are learned from data using dictio-\nnary learning methods. Attribute-based representation may exhibit high variance\ndue to noisy and redundant attributes. We propose a discriminative and compact\nattribute-based representation by selecting a subset of discriminative attributes\nfrom a large attribute set. Three attribute selection criteria are proposed and for-\nmulated as a submodular optimization problem. A greedy optimization algorithm\nis presented and guaranteed to be at least (1-1/e)-approximation to the optimum.\nExperimental results on the Olympic Sports and UCF101 datasets demonstrate\nthat the proposed attribute-based representation can signi\ufb01cantly boost the perfor-\nmance of action recognition algorithms and outperform most recently proposed\nrecognition approaches.\n\nIntroduction\n\n1\nAction recognition in real-world videos has many potential applications in multimedia retrieval,\nvideo surveillance and human computer interaction. In order to accurately recognize human ac-\ntions from videos, most existing approaches developed various discriminative low-level features,\nincluding spatio-temporal interest point (STIP) based features [8, 15], shape and optical \ufb02ow-based\nfeatures [19, 5], and trajectory-based representations [28, 33]. Because of large variations in view-\npoints, complicated backgrounds, and people performing the actions differently, videos of an action\nvary greatly. A result of this variability is that conventional low-level features are not able to char-\nacterize the rich spatio-temporal structures in real-world action videos. Inspired by recent progress\non object recognition [6, 14], multiple high-level semantic concepts called action attributes were\nintroduced in [20, 17] to describe the spatio-temporal evolution of the action, object shapes and hu-\nman poses, and contextual scenes. Since these action attributes are relatively robust to changes in\nviewpoints and scenes, they bridge the gap between low-level features and class labels. In this work,\nwe focus on improving action recognition performance of attribute-based representations.\nEven though attribute-based representation appear effective for action recognition, they require hu-\nmans to generate a list of attributes that may adequately describe a set of actions. From this list,\nhumans then need to assign the action attributes to each class. Previous approaches [20, 17] simply\nused all the given attributes and ignored the difference in discriminative capability among attributes.\nThis caused two major problems. First, a set of human-labeled attributes may be not be able to\n\n1\n\n\f(a) ApplyEyeMakeup\n\n(b) ApplyLipStick\n\n(c) Attribute set\n\nFigure 1: Key frames from two actions \u201cApplyEyeMakeup\u201d and \u201cApplyLipStick\u201d and the associated\nattribute set that the two actions share.\nrepresent and distinguish a set of action classes. This is because humans subjectively annotate ac-\ntion videos with arbitrary attributes. For example, consider the two classes \u201cApplyEyeMakeup\u201d and\n\u201cApplyLipStick\u201d in UCF101 action dataset [30] shown in Figure 1. They have the same human-\nlabeled attribute set and cannot be distinguished from one another. Second, some manually labeled\nattributes may be noisy or redundant which leads to degradation in action recognition performance.\nIn addition, their inclusion also increases the feature extraction time. Thus, it would be bene\ufb01cial to\nuse a smaller subset of attributes while achieving comparable or even improved performance.\nTo overcome the \ufb01rst problem, we propose another type of attributes that we call data-driven at-\ntributes. We show that data data-driven attributes are complementary to human-labeled attributes.\nInstead of using clustering-based algorithms to discover data-driven attributes as in [20], we pro-\npose a dictionary-based sparse representation method to discover a large data-driven attribute set.\nOur learned attributes are more suited to represent all the input data points because our method\navoids the problem of hard assignment of data points to clusters. To address the attribute selec-\ntion problem, we propose to select a compact and discriminative set of attributes from a large set\nof attributes. Three attribute selection criteria are proposed and then combined to form a submod-\nular objective function. Our method encourages the selected attributes to have strong and similar\ndiscrimination capability for all pairs of actions. Furthermore, our method maximizes the sum of\nmaximum coverage that each pairwise class can obtain from the selected attributes.\n\n2 Related Work\nAttribute-based representation for action recognition: Recently, several attribute-based represen-\ntations have been proposed for improving action recognition performance. Liu et al. [20] modeled\nattributes as latent variables and searched for the best con\ufb01guration of attributes for each action\nusing latent SVMs. However, the performance may drop drastically when some attributes are too\nnoisy or redundant. This is because pretrained attribute classi\ufb01ers from these noisy attributes per-\nform poorly. Li et al. [17] decomposed a video sequence into short-term segments and characterized\nsegments by the dynamics of their attributes. However, since attributes are de\ufb01ned over the entire\naction video instead of short-term segments, different decomposition of video segments may obtain\ndifferent attribute dynamics.\nAnother line of work similar to attribute-based methods is based on learning different types of mid-\nlevel representations. These mid-level representations usually identify the occurrence of semantic\nconcepts of interest, such as scene types, actions and objects. Fathi et al. [7] proposed to construct\nmid-level motion features from low-level optical \ufb02ow features using AdaBoost. Wang et al. [35]\nmodeled a human action as a global root template and a constellation of several parts. Raptis et\nal. [27] used trajectory clusters as candidates for the parts of an action and assembled these clusters\ninto an action class by graphical modeling. Jain et al. [10] presented a new mid-level representation\nfor videos based on discriminative spatio-temporal patches, which are automatically mined from\nvideos using an exemplar-based clustering approach.\nSubmodularity: Submodular functions are a class of set functions that have the the property\nof diminishing returns [24]. Given a set E, a set function F : 2E \u2192 R is submodular if\nF (A \u222a v) \u2212 F (A) \u2265 f (B \u222a v) \u2212 F (B) holds for all A \u2286 B \u2286 E and v \u2208 E \\ B. The di-\nminishing returns mean that the marginal value of the element v decreases if used in a later stage.\nRecently, submodular functions have been widely exploited in various applications, such as sensor\nplacements [13], superpixel segmentation [22], document summarization [18], and feature selec-\ntion [3, 23]. Liu et al. [23] presented a submodular feature selection method for acoustic score\nspaces based on existing facility location and saturated coverage functions. Krause et al. [12] de-\n\n2\n\nIndoor =Yes One_hand_visible =Yes Stick_like =Yes Sharp_like =Yes One_arm_bent =Yes Facing_front =Yes \fveloped a submodular method for selecting dictionary columns from multiple candidates for sparse\nrepresentation. Iyer et al. [9] designed a new framework for both unconstrained and constrained\nsubmodular function optimization. Streeter et al. [31] proposed an online algorithm for maximizing\nsubmodular functions. Different from these approaches, we de\ufb01ne a novel submodular objective\nfunction for attribute selection. Although we only evaluate our approach for action recognition, it\ncan be applied to other recognition tasks that use attribute descriptions.\n\n(cid:80)\n\nk=i,j nk(\u00b5d\n\nk\u2212\u00b5d)2\n\nk=i,j nk\u03c32\nk\n\n(cid:80)\n\nk and \u03c3d\n\n3 Submodular Attribute Selection\nIn this section, we \ufb01rst propose three attribute selection criteria. In order to satisfy these criteria,\nwe de\ufb01ne a submodular function based on entropy rate of a random walk and a weighted maximum\ncoverage function. Then we introduce algorithms for the detection of human-labeled attributes and\nextraction of data-driven attributes.\n3.1 Attribute Selection Criteria\nAssume that we have C classes and a large attribute set P = {a1, a2, .., aM} which con-\ntains M attributes. The set that includes all combinations of pairwise classes is represented by\nU = {u1(1, 1), u2(1, 2), ..., ul(i, j), ..., uL(C \u2212 1, C)} where ul(i, j), i < j denotes the pairwise\ncombination of classes i and j, l is the index of this combination in U, and L = C \u00d7 (C \u2212 1)/2\nis the total number of all possible pairwise classes. Here we propose to use the Fisher score to\nconstruct an attribute contribution matrix A \u2208 RM\u00d7L, where an entry Ad,l represents the dis-\ncrimination capability of attribute ad for differentiating the class pair (i, j) indexed by ul(i, j).\nSpeci\ufb01cally, given the attribute ad and class pair (i, j), let \u00b5d\nk be the mean and standard\ndeviation of k-th class and \u00b5d be the mean of samples from both classes i and j corresponding to\nd-th attribute. The Fisher score of attribute ad for differentiating the class pair (i, j) is computed\nwhere l is the index of pairwise classes (i, j) in U, and\nas follows: Ad,l(i,j) =\nnk is the number of points from class k. Note that different methods can be used to measure the\ndiscrimination capability of ad, such as mutual information and T-test.\nGiven A, we can obtain a row vector r by summing up its elements from each column that are in\nrows corresponding to selected attributes S. An example of vector r is shown in Figure 2a. We\nwould like to have r satisfy two selection criteria: (1) each entry of r should be as large as possible;\nand (2) the variance of all entries of r should be small. The \ufb01rst criterion encourages S to provide as\nmuch discrimination capability as possible for each pairwise classes. The second criterion makes S\nhave similar discrimination capability for each pairwise classes. These two criteria can be satis\ufb01ed\nby maximizing the entropy rate of a random walk on the proposed graphs. Meanwhile, since some\nattributes may well differentiate the same collection of pairwise classes, it would be redundant to\nselect all these attributes. In other words, one combination of pairwise classes may be repeatedly\n\u201ccovered\u201d (differentiated) by multiple attributes. It is better to select other attributes which can dif-\nferentiate \u201cuncovered\u201d combinations of pairwise classes. Therefore, we propose the third criterion:\nthe sum of maximum discrimination capability that each pairwise classes can obtain from the se-\nlected attributes should be maximized. We will model it as a weighted maximum coverage problem\nand encourage S to have a maximum coverage of all pairwise classes.\n3.2 Entropy Rate-based Attribute Selection\nIn order to achieve the \ufb01rst two criteria, we need to construct an undirected graph and maximize the\nentropy rate of a random walk on this graph. We aim to obtain a subset S so that the attribute-based\nrepresentation has good discrimination power.\nGraph Construction: We use G = (V, E) to denote an undirected graph where V is the vertex\nset, and E is the edge set. The vertex vi represents class i and the edge ei,j connecting class i\nand j represents that class i and j can be differentiated by the selected attribute subset S to some\nd\u2208S Ad,l, which represents the discrimination\ncapability of S for differentiating class i from class j. The edge weights are symmetric, i.e. wi,j =\nwj,i. In addition, we add a self-loop ei,i for each vertex vi of G. And the weight for self-loop ei,i\nd\u2208P\\S Ad,l. The total incident weight for each vertex is kept constant so\nthat it produces a stationary distribution for the later proposed random walk on this graph. Note that\nthe addition of these self-loops do not affect the selection of attributes and the graph will change\nwith the selected subset S. Figure 2 gives an example to illustrate the bene\ufb01ts of the entropy rate.\n\nextent. The edge weight for ei,j is de\ufb01ned as wi,j =(cid:80)\nis de\ufb01ned as wi,i = (cid:80)\n\n3\n\n\fSubset c1/c2 c1/c3 c1/c4 c2/c3 c2/c4 c3/c4\nS1\nS2\nS3\n(a) Vector r corresponding to different subsets.\n\n1\n2\n1\n\n1\n2\n3\n\n1\n2\n1\n\n1\n2\n2\n\n1\n2\n3\n\n1\n2\n2\n\n(b) S1\n\n(c) S2\n\n(d) S3\n\nFigure 2: The summations of different rows in the contribution matrix corresponding to three different\nselected subsets are provided in the left table and the corresponding undirected graphs are in the right\n\ufb01gure. We show the role of the entropy rate in selecting attributes which have large and similar discrimination\ncapability for each pair of classes. The circles with numbers denote the corresponding class vertices and the\nnumbers next to the edge denote the edge weights, which is a measure of the discrimination capability of\nselected attribute subset. The self-loops are not displayed. The entropy rate of the graph with large edge\nweights in (c) has a higher objective value than that of a graph with smaller edge weights in (b). The entropy\nrate of graph with equal edge weights in (c) has a higher objective value than that of the graph with different\nedge weights in (d).\nEntropy Rate: Let X = {Xt|t \u2208 T, Xt \u2208 V } be a random walk on the graph G = (V, E) with\nnonnegative discrimination measure w. We use the random walk model from [2] with a transition\nprobability de\ufb01ned as below:\npi,j(S) =\n\n(cid:80)\n(cid:80)\n\n(cid:26)\n\nd\u2208S Ad,l\n\n(cid:80)\n\n(1)\n\n=\n\nwi\n\nwhere S is the selected attribute subset and wi =(cid:80)\n\n1 \u2212\n\n=\n\nwi\n\nwi,j\nwi\nk:k(cid:54)=i wi,k\n\nm:ei,m\u2208E wi,m is the sum of incident weights\nof the vertex vi including the self-loop. The stationary distribution for this random walk is given by\ni=1 wi is the sum of the total weights\n\u00b5 = (\u00b51, \u00b52, ..., \u00b5C)T = ( w1\nw0\nincident on all vertices. For a stationary 1st-order Markov chain, the entropy rate which measures the\nuncertainty of the stochastic process X is given by: H(X) = limt\u2192\u221eH(Xt|Xt\u22121, Xt\u22122, ..., X1) =\nlimt\u2192\u221eH(Xt|Xt\u22121) = H(X2|X1). More details can be found in [2]. Consequently, the entropy\nrate of the random walk X on our proposed graph G = (V, E) can be written as a set function:\n\n) where w0 = (cid:80)C\n(cid:88)\n\nuiH(X2|X1 = vi) = \u2212(cid:88)\n\npi,j(S)log(pi,j(S))\n\nH(S) =\n\n(cid:88)\n\n, ..., wC\nw0\n\n, w2\nw0\n\n(2)\n\nui\n\nif i (cid:54)= j\nif i = j\n\nd\u2208P\\S Ad,l\n\nwi\n\ni\n\ni\n\nj\n\nIntuitively, the maximization of the entropy rate will have two properties. First, it encourages the\nmaximization of pi,j(S) where i = 1, ..., C and i (cid:54)= j. This can make edge weights wi,j, i (cid:54)= j as\nlarge as possible, so class i can be easily differentiated from other classes j (i.e., satisfying the \ufb01rst\ncriteria). Second, it makes all class vertices have transition probabilities similar to other connected\nclass vertices, so the discrimination capabilities of class i from other classes are very similar (i.e.,\nsatisfying the second criteria). Maximizing the entropy rate of the random walk on the proposed\ngraph can select a subset of attributes that are compact and discriminative for differentiating all\npairwise classes.\nProposition 3.1. The entropy rate of the random walk H : 2M \u2192 R is a submodular function under\nthe proposed graph construction.\n\nThe observation that adding an attribute in a later stage has a lower increase in the uncertainty\nestablishes the submodularity of the entropy rate. This is because at a later stage, the increased edge\nweights from the added attribute will be shared with attributes which contribute to the differentiation\nof the same pair of classes. A detailed proof based on [22] is given in the supplementary section.\n3.3 Weighted Maximum Coverage-based Attribute Selection\nWe consider a weighted maximum coverage function to achieve the last criteria that the selected\nsubset S should maximize the coverage of all combinations of pairwise classes. For each attribute\nad, we de\ufb01ne a coverage set Ud \u2286 U which covers all the combinations of pairwise classes that\nattribute ad can differentiate. Meanwhile, for each element (combination) ul \u2208 U that is covered by\nUd, we de\ufb01ne a coverage weight w(Ud, ul) = Ad,l. Given the universe set U and these coverage sets\nUd, d = 1, ..., M, the weighted maximum coverage problem is to select at most K coverage sets,\nsuch that the sum of maximum coverage weight each element can obtain from S is maximized. The\nweighted maximum coverage function is de\ufb01ned as follows:\n\nmax\nd\u2208S Ad,l,\n\ns.t. NS \u2264 K\n\n(3)\n\n(cid:88)\n\nul\u2208U\n\nQ(S) =\n\n(cid:88)\n\nul\u2208U\n\nmax\nd\u2208S w(Ud, ul) =\n\n4\n\n1 2 4 3 1 1 1 1 1 1 1 2 4 3 2 2 2 2 2 2 1 2 4 3 2 3 3 1 1 2 \fAttrs.c1/c2c1/c3c1/c4c2/c3c2/c4c3/c4\na1\na2\na3\na4\n\n2\n1\n0\n0\n\n1\n0\n0\n2\n\n2\n1\n0\n0\n\n0\n0\n1\n0\n\n1\n0\n0\n2\n\n0\n0\n2\n0\n\n(a) Attribute contribution matrix A.\n\n(b) Coverage graph\n\nFigure 3: An example of attribute contribution matrix is given in the left table and the corresponding\ncoverage graph is in the right \ufb01gure. We show the role of weighted maximum coverage term in selecting\nattributes which have large coverage weights. Two numbers separated by a backslash in the top circles denote\na pair of classes, while the bottom circles denote different attributes. The number next to one edge is the\ncoverage weight associated with the class pair when covered by the corresponding attribute. The edge which\nprovides maximum coverage weight for each class pair is in red color. We consider three attribute subsets\nS1 = {a1, a2},S2 = {a1, a3},S3 = {a1, a4}. S2 has a higher objective value than S1 and S3 because the\nsum of maximum coverage weights for all class pairs obtained using attributes from subset S2 is largest.\nwhere NS is the number of attributes in S. Note that the weighted maximum coverage problem is\nreduced to the well studied set-cover problem when all the coverage weights are equal to be ones.\nProposition 3.2. The weighted maximum coverage function Q : 2M \u2192 R is a monotonically\nincreasing submodular function under the proposed set representation.\n\nFor the weighted maximum coverage term, monotonicity is obvious because the addition of any\nattribute will increase the number of covered elements in U. Submodularity results from the obser-\nvation that the coverage weights of increased covered elements will be less from adding an attribute\nin a later stage because some elements may be already covered by previously selected attributes.\nThe proof is given in the supplementary section.\n3.4 Objective Function and Optimization\nCombing the entropy rate term and the weighted maximum coverage term, the overall objective\nfunction for attribute selection is formulated as follows:\n\nmaxF(S) = max\n\nH(S) + \u03bbQ(S) s.t. NS \u2264 K\n\nS\n\n(4)\n\nwhere \u03bb controls the relative contribution between entropy rate and the weighted maximum coverage\nterm. The objective function is submodular because linear combination of two submodular functions\nwith nonnegative coef\ufb01cients preserves submodularity [24]. Direct maximization of a submodular\n\nAlgorithm 1 Submodular Attribute Selection\n1: Input: G = (V, E), A and \u03bb\n2: Output: S\n3: Initialization: S \u2190 \u2205\n4: for NS < K and F (S \u222a a) \u2212 F (S) \u2265 0 do\n5:\n6:\n7: end for\n\nam = argmaxS\u222aamF(S \u222a {am}) \u2212 F(S)\nS \u2190 am\n\nfunction is an NP-hard problem. However, a greedy algorithm from [24] gives a near-optimal so-\nlution with a (1 \u2212 1/e)-approximation bound. The greedy algorithm starts from an empty attribute\nset S = \u2205 ; and iteratively adds one attribute that provides the largest gain for F at each iteration.\nThe iteration stops when the maximum number of selected attributes is obtained or F(S) decreases.\nAlgorithm 1 presents the pseudo code of our algorithm. A naive implementation of this algorithm\nhas the complexity of O(|M|2), because it needs to loop O(|M|) times to add a new attribute and\nscan through the whole attribute list in each loop. By exploiting the submodularity of the objective\nfunction, we use the lazy greedy approach presented in [16] to speed up the optimization process.\n3.5 Human-labeled Attribute and Data-driven Attribute Extraction\nAction videos can be characterized by a collection of human-labeled attributes [20]. For example,\nthe action \u201clong-jump\u201d in Olympic Sports Dataset [25] is associated with either the motion attributes\n(jump forward, motion in the air), or with the scene attributes (e.g., outdoor, track). Given an action\n\n5\n\na1a2a31/21/31/42/32/43/41122111222a4\fvideo x, an attribute classi\ufb01er fa : x \u2192 {0, 1} predicts the con\ufb01dence score of the presence of\nattribute a in the video. This classi\ufb01er fa is learned using the training samples of all action classes\nwhich have this attribute as positive and the rest as negative. Given a set of attribute classi\ufb01ers S =\n{fai(x)}m\ni=1, an action video x \u2208 Rd is mapped to the semantic space O: h : Rd \u2192 O = [0, 1]m\nwhere h(x) = (h1(x), ..., hm(x))T is a m-dimensional attribute score vector.\nPrevious works [21, 20] on data-driven attribute discovery used k-means or information theoretic\nclustering algorithms to obtain the clusters as the learned attributes. In this paper, we propose to\ndiscover a large initial set of data-driven attributes using a dictionary learning method. Speci\ufb01cally,\nassume that we have a set of N videos in a n-dimensional feature space X = [x1, ..., xN ], xi \u2208 Rn,\nthen a data-driven dictionary is learned by solving the following problem:\n\n||X \u2212 DZ||2\n\n2 s.t. \u2200i,\n\n||zi||0 \u2264 T\n\narg min\nD,Z\n\ni\u2208V maxj\u2208S wi,j,Fsa(S) = (cid:80)\n\nare de\ufb01ned as follows: Ff a(S) = (cid:80)\nwhere wi,j is a similarity between attribute i and j, Ci(S) =(cid:80)\n\n(5)\nwhere D = [d1...dK], di \u2208 Rn is the learned attribute dictionary of size K, Z = [zi...zN ], zi \u2208 RK\nare the sparse codes of X, and T speci\ufb01es the sparsity that each video has fewer than T items in its\ndecomposition. Compared to k-means clustering, this dictionary-based learning scheme avoids the\nhard assignment of cluster centers to data points. Meanwhile, it doesn\u2019t require the estimation of the\nprobability density function of clusters in information theoretic clustering. Note that our attribute\nselection framework is very general and different initial attribute extraction methods can be used\nhere.\n4 Experiments\nIn this section, we validate our method for action recognition on two public datasets: Sports\ndataset [25] and UCF101 [20] dataset. Speci\ufb01cally, we consider three sets of attributes: human-\nlabeled attribute set (HLA set), data-driven attribute set (DDA set) and the set mixing both types of\nattributes (Mixed set). To demonstrate the effectiveness of our selection framework, we compare\nthe result using the selected subset with the result based on the initial set.\nWe also compare our method with other two submodular approaches based on the facility location\nfunction (FL) and saturated coverage function (SC) respectively in [23]. These objective functions\ni\u2208V min{Ci(S), \u03b1Ci(V)}\nj\u2208S wi,j measures the degree that\nattribute i is \u201ccovered\u201d by S and \u03b1 is a hyperparameter that determines a global saturation threshold.\nFor the two approaches compared against, we consider an undirected k-nearest neighbor graph and\nuse a Gaussian kernel to compute pairwise similarities wi,j = exp(\u2212\u03b2d2\ni,j) where di,j is the distance\nbetween attribute i and j, \u03b2 = (2(cid:104)d2\ni,j(cid:105))\u22121 and (cid:104)\u00b7(cid:105) denotes expectation over all pairwise distances.\nFinally, we compare the performance of attribute-based representation with several state-of-the-art\napproaches on the two datasets.\n4.1 Olympic Sports Dataset\nThe Olympic Sports dataset contains 783 YouTube video clips of 16 sports activities. We followed\nthe protocol in [20] to extract STIP features [4]. Each action video is \ufb01nally represented by a 2000-\ndimensional histogram. We use 40 human-labeled attributes provided by [20]. Three attribute-based\nrepresentations are constructed as follows: (1) HLA set: For each human-labeled attribute, we train\na binary SVM with a histogram intersection kernel. We concatenate con\ufb01dence scores from all\nthese attribute classi\ufb01ers into a 40-dimensional vector to represent this video. (2) DDA set: For\ndata-driven attributes, we learn a dictionary of size 457 from all video features using KSVD [1]\nand each video is represented by a 457-dimensional sparse coef\ufb01cient vector. (3) Mixed set: This\nattribute set is obtained by combining HLA set and DDA set.\nWe compare the performance of features based on selected attributes with those based on the initial\nattribute set. For all the different attribute-based features, we use an SVM with Gaussian kernel for\nclassi\ufb01cation. Table 1 shows classi\ufb01cation accuracies of different attribute-based representations.\nCompared with the initial attribute set, the selected attributes have greatly improved the classi\ufb01ca-\ntion accuracy, which demonstrates the effectiveness of our method for selecting a subset of discrim-\ninative attributes. Moreover, features based on the Mixed set outperform features based on either\nHLA set or DDA set. This shows that data-driven attributes are complementary to human-labeled\nattributes and together they offer a better description of actions. Table 2 shows the per-category av-\nerage precision (AP) and mean AP of different approaches. It can be seen that our method achieves\n\n6\n\n\fdataset\nOlympic\nUCF101\n\nHLA\n\nSubset\n64.1\n83.4\n\nAll\n61.8\n81.7\n\nDDA\n\nSubset\n53.8\n81.6\n\nAll\n49.0\n79.0\n\nMixed\n\nAll\n63.1\n82.3\n\nSubset\n66.7\n85.2\n\nTable 1: Recognition results of different attribute-based representations. \u201cAll\u201d denotes the original at-\ntribute sets and \u201cSubset\u201d denote the selected subsets.\n\n(a) HLA set\n\n(b) DDA set\n\n(c) Mixed set\n\n(d) Effect of \u03bb in Mixed set\n\nFigure 4: Recognition results by different submodular methods on the Olympic Sports dataset.\n\nclean-jerk\n\nActivity\nhigh-jump\nlong-jump\ntriple-jump\npole-vault\ngym. vault\nshort-put\nsnatch\n\n[15]\n52.4\n66.8\n36.1\n47.8\n88.6\n56.2\n41.8\n83.2\njavelin throw\n61.1\nhammer throw 65.1\ndiscuss throw 37.4\n91.5\ndiving-plat.\n80.7\ndiving-sp. bd.\nbask. layup\n75.8\n66.7\n39.6\n62.0\n\ntennis-serve\nmean-AP\n\nbowling\n\n[25]\n68.9\n74.8\n52.3\n82.0\n86.1\n62.1\n69.2\n84.1\n74.6\n77.5\n58.5\n87.2\n77.2\n77.9\n72.7\n49.1\n72.1\n\n[32]\n18.4\n81.8\n16.1\n84.9\n85.7\n43.3\n88.6\n78.2\n79.5\n70.5\n48.9\n93.7\n79.3\n85.5\n64.3\n49.6\n66.8\n\n[20]\n93.2\n82.6\n48.3\n74.4\n86.7\n76.2\n71.6\n79.4\n62.1\n65.5\n68.9\n77.5\n65.2\n66.7\n72.0\n55.2\n71.6\n\n[17] HLA DDA Mixed\n82.2\n83.1\n93.9\n92.5\n73.6\n52.1\n56.8\n79.4\n98.4\n83.4\n72.2\n70.3\n79.8\n72.7\n82.6\n85.1\n87.5\n36.5\n80.4\n74.0\n56.0\n57.0\n99.2\n86.0\n90.4\n78.3\n90.7\n78.1\n55.4\n52.5\n83.7\n38.7\n77.0\n73.2\n\n80.4\n88.8\n61.4\n55.1\n98.2\n63.7\n74.5\n73.8\n36.0\n76.9\n53.9\n94.8\n79.7\n88.7\n43.0\n78.8\n72.1\n\n66.4\n85.3\n60.7\n45.5\n84.2\n39.5\n34.2\n57.9\n26.4\n77.2\n45.6\n55.3\n59.7\n89.7\n55.3\n35.3\n57.2\n\nTable 2: Average precisions for activity recognition on the Olympic Sporst dataset.\n\nthe best performance. This illustrates the bene\ufb01ts of selecting discriminative attributes and removing\nnoisy and redundant attributes. Note that our method outperforms the method that is most similar to\nours [20] which uses complex latent SVMs to combine low-level features, human-labeled attributes\nand data-driven attributes. Moreover, compared with other dynamic classi\ufb01ers [25, 17] which ac-\ncount for the dynamics of bag-of-features or action attributes, our method still obtains comparable\nresults. This is because the provided human-labeled attributes are very noisy and they can greatly\naffect the training of latent SVM and representation of the attribute dynamics.\nFigures 4a 4b 4c show classi\ufb01cation accuracies of attribute subsets selected by different submodular\nselection methods. It can be seen that our method outperforms the other two submodular selection\nmethods for the three different attribute sets. This is because our method prefers attributes with large\nand similar discrimination capability for differentiating pairwise classes, while the other two meth-\nods prefer attributes with large similarity to other attributes (i.e. representative), without explicitly\nconsidering the discrimination capabilities of selected attributes. Figure 4d shows the performance\ncurves for a range of \u03bb. We observe that the combination of entropy rate term and maximum cover-\nage term obtains a higher classi\ufb01cation accuracy than when only one of them is used. In addition,\nour approach is insensitive to the selection of \u03bb. Hence we use \u03bb = 0.1 throughout the experiments.\n\n4.2 UCF101 Dataset\nUCF101 dataset contains over 10,000 video clips from 101 different human action categories. We\ncompute the improved version of dense trajectories in [34] and extract three types of descriptors:\nhistogram of oriented gradients (HOG), histogram of optical \ufb02ow (HOF) and motion boundary his-\n\n7\n\n2025303540405060Attribute subset sizeAccuracy Our methodFL [23]SC [23] 100200300400304050Attribute subset sizeAccuracy Our methodFL [23]SC [23]200300400500586062646668Attribute subset sizeAccuracy Our methodFL [23]SC [23] 10030050056586062646668Attribute subset sizeAccuracy Entropy rateMaximum Coverage\u03bb =0.01\u03bb =0.1\u03bb = 1\fsplits\n\n1\n2\n3\nAvg\n\n[34]\n83.03\n84.22\n84.80\n84.02\n\n[36]\n83.11\n84.60\n84.23\n83.98\n\n[37]\n79.41\n81.25\n82.03\n80.90\n\n[11]\n65.22\n65.39\n67.24\n65.95\n\n[29] HLA DDA Mixed\n84.19\n63.41\n85.51\n65.37\n86.30\n64.12\n85.24\n64.30\n\n80.35\n82.16\n82.42\n81.64\n\n82.45\n83.27\n84.60\n83.44\n\nTable 3: Recognition results of different approaches on UCF101 dataset.\n\n(a) HLA set\nFigure 5: Recognition results by different submodular methods on UCF101 dataset.\n\n(c) Mixed set\n\n(b) DDA set\n\ntogram (MBH). We use Fisher vector encoding [26] and obtain 101,376-dimensional histogram to\nrepresent each action video. Three different attribute sets and corresponding attribute-based rep-\nresentations are constructed as follows: (1) HLA set: Due to the high dimensionality of features\nand large number of samples, the linear SVM is trained for the detection of each human-labeled at-\ntribute. We concatenate con\ufb01dence scores from all these attribute classi\ufb01ers into a 115-dimensional\nvector to represent a video. (2) DDA set: For data-driven attributes, we \ufb01rst apply PCA to reduce\nthe dimension of histogram descriptors to be 3300 and then learn a dictionary of size 3030. The\nfeatures based on data-driven attributes are 3030-dimensional sparse coef\ufb01cient vectors. (3) Mixed\nset: HLA set plus DDA set.\nFollowing the training and testing dataset partitions proposed in [30], we train a linear SVM and\nreport classi\ufb01cation accuracies of different attribute-based representations in Table 1. The selected\nattribute subset outperforms the initial attribute set again which demonstrates the effectiveness of\nour proposed attribute selection method. Figure 5 shows the results of attribute subsets selected\nby different submodular selection methods. Note that this dataset is highly challenging because\nthe training and test videos of the same action have different backgrounds and actors. You can see\nthat our method still substantially outperforms the other two submodular methods. This is because\nsome redundant attributes dominated the selection process and the attributes selected by compar-\ning approaches had very unbalanced discrimination capability for different classes. However, the\nattributes selected by our method have strong and similar discrimination capability for each class.\nTable 3 presents the classi\ufb01cation accuracies of several state-of-the-art approaches on this dataset.\nOur method achieves comparable results to the best result 85.9% from [34] which uses complex\nspatio-temporal pyramids to embed structure information in features. Note that our method also\noutperforms other methods which make use of complicated and advanced feature extraction and\nencoding techniques.\n\n5 Conclusion\n\nWe exploited human-labeled attributes and data-driven attributes for improving the performance of\naction recognition algorithms. We \ufb01rst presented three attribute selection criteria for the selection of\ndiscriminative and compact attributes. Then we formulated the selection procedure as one of opti-\nmizing a submodular function based on the entropy rate of a random walk and weighted maximum\ncoverage function. Our selected attributes not only have strong and similar discrimination capability\nfor all pairwise classes, but also maximize the sum of largest discrimination capability that each\npairwise classes can obtain from the selected attributes. Experimental results on two challenging\ndataset show that the proposed method signi\ufb01cantly outperforms many state-of-the art approaches.\n\n6 Acknowledgements\n\nThe identi\ufb01cation of any commercial product or trade name does not imply endorsement or rec-\nommendation by NIST. This research was partially supported by a MURI from the Of\ufb01ce of Naval\nresearch under the Grant 1141221258513.\n\n8\n\n40608010012070758085Attribute subset sizeAccuracy Our methodFL [23]SC [23]100020003000607080Attribute subset sizeAccuracy Our methodFL [23]SC [23]10002000300070758085Attribute subset sizeAccuracy Our methodFL [23]SC [23]\fReferences\n[1] M. Aharon, M. Elad, and A. Bruckstein. KSVD: An algorithm for designing overcomplete dictionaries\n\nfor sparse representation. In IEEE Transactions on Signal Processing, 2006.\n\n[2] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley-Interscience, 2006.\n[3] A. Das, A. Dasgupta, and R. Kumar. Selecting diverse features via spectral regularization. In NIPS, 2012.\n[4] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behavior recognition via sparse spatio-temporal\n\nfeatures. In VS-PETS, 2005.\n\n[5] A. A. Efros, A. C. Berg, E. C. Berg, G. Mori, and J. Malik. Recognizing action at a distance. In ICCV,\n\n2003.\n\n[6] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. In CVPR, 2009.\n[7] A. Fathi and G. Mori. Action recognition by learning mid-level motion features. In CVPR, 2008.\n[8] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri. Actions as space-time shapes. In ICCV,\n\n2005.\n\n[9] R. Iyer, S. Jegelka, and J. Bilmes. Fast semidifferential-based submodular function optimization.\n\nICML, 2013.\n\nIn\n\n[10] A. Jain, A. Gupta, M. Rodriguez, and L. Davis. Representing videos using mid-level discriminative\n\npatches. In CVPR, 2013.\n\n[11] Z. Jiang, Z. Lin, and L. S. Davis. Label consistent K-SVD: Learning a discriminative dictionary for\n\nrecognition. In PAMI, 2013.\n\n[12] A. Krause and V. Cevher. Submodular dictionary selection for sparse representation. In ICML, 2010.\n[13] A. Krause, A. Singh, C. Guestrin, and C. Williams. Near-optimal sensor placements in gaussian processes.\n\nIn ICML, 2005.\n\n[14] C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class\n\nattribute transfer. In CVPR, 2009.\n\n[15] I. Laptev and T. Lindeberg. Space-time interest points. In ICCV, 2003.\n[16] J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, and N. Glance. Cost-effective outbreak\n\ndetection in networks. In KDD, 2007.\n\n[17] W. Li and N. Vasconcelos. Recognizing activities by attribute dynamics. In NIPS, 2012.\n[18] H. Lin and J. Bilmes. A class of submodular functions for document summarization. In Proceedings of\n\nACL, 2011.\n\n[19] Z. Lin, Z. Jiang, and L. S. Davis. Recognizing actions by shape-motion prototype trees. In CVPR, 2009.\n[20] J. Liu, B. Kuipers, and S. Savarese. Recognizing human actions by attributes. In CVPR, 2011.\n[21] J. Liu, Y. Yang, and M. Shah. Learning semantic visual vocabularies using diffusion distance. In CVPR,\n\n2009.\n\n[22] M.-Y. Liu, O. Tuzel, S. Ramalingam, and R. Chellappa. Entropy-rate clustering: Cluster analysis via\n\nmaximizing a submodular function subject to a matroid constraint. In PAMI, 2014.\n\n[23] Y. Liu, K. Wei, K. Kirchhoff, Y. Song, and J. Bilmes. Submodular feature selection for high-dimensional\n\nacoustic score spaces. In ICASSP, 2013.\n\n[24] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing sub-\n\nmodular set functionsi. Mathematical Programming, 1978.\n\n[25] J. C. Niebles, C. wei Chen, and L. Fei-fei. Modeling temporal structure of decomposable motion segments\n\nfor activity classi\ufb01cation. In ECCV, 2010.\n\n[26] F. Perronnin, J. S\u00b4anchez, and T. Mensink. Improving the \ufb01sher kernel for large-scale image classi\ufb01cation.\n\nIn ECCV, 2010.\n\n[27] M. Raptis, I. Kokkinos, and S. Soatto. Discovering discriminative action parts from mid-level video\n\nrepresentations. In CVPR, 2012.\n\n[28] M. Raptis and S. Soatto. Tracklet descriptors for action modeling and video analysis. In ECCV, 2010.\n[29] S. Sadanand and J. Corso. Action bank: A high-level representation of activity in video. In CVPR, 2012.\n[30] K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human action classes from videos in\n\nthe wild. In CRCV-TR-12-01, 2012.\n\n[31] M. Streeter and D. Golovin. An online algorithm for maximizing submodular functions. In NIPS, 2008.\n[32] K. D. Tang, F.-F. Li, and D. Koller. Learning latent temporal structure for complex event detection. In\n\nCVPR, 2012.\n\n[33] H. Wang, A. Kl\u00a8aser, C. Schmid, and C.-L. Liu. Dense trajectories and motion boundary descriptors for\n\naction recognition. International Journal of Computer Vision, 2013.\n\n[34] H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013.\n[35] Y. Wang and G. Mori. Max-margin hidden conditional random \ufb01elds for human action recognition. In\n\nCVPR, 2009.\n\n[36] J. Wu, Y. Zhang, and W. Lin. Towards good practices for action video encoding. In ICCV, 2013.\n[37] J. Zhu, B. Wang, X. Yang, W. Zhang, and Z. Tu. Action recognition with actons. In ICCV, 2013.\n\n9\n\n\f", "award": [], "sourceid": 749, "authors": [{"given_name": "Jingjing", "family_name": "Zheng", "institution": "University of Maryland"}, {"given_name": "Zhuolin", "family_name": "Jiang", "institution": "Noah's Ark Lab, Huawei Technologies"}, {"given_name": "Rama", "family_name": "Chellappa", "institution": "University of Maryland College Park"}, {"given_name": "Jonathon", "family_name": "Phillips", "institution": "National Institute of Standards and  Technology"}]}