{"title": "Relevance Topic Model for Unstructured Social Group Activity Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 2580, "page_last": 2588, "abstract": "Unstructured social group activity recognition in web videos is a challenging task due to 1) the semantic gap between class labels and low-level visual features and 2) the lack of labeled training data. To tackle this problem, we propose a relevance topic model\" for jointly learning meaningful mid-level representations upon bag-of-words (BoW) video representations and a classifier with sparse weights. In our approach, sparse Bayesian learning is incorporated into an undirected topic model (i.e., Replicated Softmax) to discover topics which are relevant to video classes and suitable for prediction. Rectified linear units are utilized to increase the expressive power of topics so as to explain better video data containing complex contents and make variational inference tractable for the proposed model. An efficient variational EM algorithm is presented for model parameter estimation and inference. Experimental results on the Unstructured Social Activity Attribute dataset show that our model achieves state of the art performance and outperforms other supervised topic model in terms of classification accuracy, particularly in the case of a very small number of labeled training videos.\"", "full_text": "Relevance Topic Model for Unstructured Social\n\nGroup Activity Recognition\n\nFang Zhao\n\nYongzhen Huang\n\nCenter for Research on Intelligent Perception and Computing\n\nInstitute of Automation, Chinese Academy of Sciences\n\n{fang.zhao,yzhuang,wangliang,tnt}@nlpr.ia.ac.cn\n\nLiang Wang\n\nTieniu Tan\n\nAbstract\n\nUnstructured social group activity recognition in web videos is a challenging task\ndue to 1) the semantic gap between class labels and low-level visual features and 2)\nthe lack of labeled training data. To tackle this problem, we propose a \u201crelevance\ntopic model\u201d for jointly learning meaningful mid-level representations upon bag-\nof-words (BoW) video representations and a classi\ufb01er with sparse weights. In\nour approach, sparse Bayesian learning is incorporated into an undirected topic\nmodel (i.e., Replicated Softmax) to discover topics which are relevant to video\nclasses and suitable for prediction. Recti\ufb01ed linear units are utilized to increase the\nexpressive power of topics so as to explain better video data containing complex\ncontents and make variational inference tractable for the proposed model. An\nef\ufb01cient variational EM algorithm is presented for model parameter estimation\nand inference. Experimental results on the Unstructured Social Activity Attribute\ndataset show that our model achieves state of the art performance and outperforms\nother supervised topic model in terms of classi\ufb01cation accuracy, particularly in the\ncase of a very small number of labeled training videos.\n\n1\n\nIntroduction\n\nThe explosive growth of web videos makes automatic video classi\ufb01cation important for online video\nsearch and indexing. Classifying short video clips which contain simple motions and actions has\nbeen solved well in standard datasets (such as KTH [1], UCF-Sports [2] and UCF50 [3]). However,\ndetecting complex activities, specially social group activities [4], in web videos is a more dif\ufb01cult\ntask because of unstructured activity context and complex multi-object interaction.\nIn this paper, we focus on the task of automatic classi\ufb01cation of unstructured social group activities\n(e.g., wedding dance, birthday party and graduation ceremony in Figure 1), where the low-level\nfeatures have innate limitations in semantic description of the underlying video data and only a\nfew labeled training videos are available. Thus, a common method is to learn human-de\ufb01ned (or\nsemi-human-de\ufb01ned) semantic concepts as mid-level representations to help video classi\ufb01cation\n[4]. However, those human de\ufb01ned concepts are hardly generalized to a larger or new dataset. To\ndiscover more powerful representations for classi\ufb01cation, we propose a novel supervised topic model\ncalled \u201crelevance topic model\u201d to automatically extract latent \u201crelevance\u201d topics from bag-of-words\n(BoW) video representations and simultaneously learn a classi\ufb01er with sparse weights.\nOur model is built on Replicated Softmax [5], an undirected topic model which can be viewed as\na family of different-sized Restricted Boltzmann Machines that share parameters. Sparse Bayesian\nlearning [6] is incorporated to guide the topic model towards learning more predictive topics which\nare associated with sparse classi\ufb01er weights. We refer to those topics corresponding to non-zero\nweights as \u201crelevance\u201d topics. Meanwhile, binary stochastic units in Replicated Softmax are re-\nplaced by recti\ufb01ed linear units [7], which allows each unit to express more information for better\n\n1\n\n\f(a) Wedding Dance\n\n(b) Birthday Party\n\n(c) Graduation Ceremony\n\nFigure 1: Example videos of the \u201cWedding Dance\u201d, \u201cBirthday Party\u201d and \u201cGraduation Ceremony\u201d\nclasses taken from the USAA dataset [4].\n\nexplaining video data containing complex content and also makes variational inference tractable for\nthe proposed model. Furthermore, by using a simple quadratic bound on the log-sum-exp function\n[8], an ef\ufb01cient variational EM algorithm is developed for parameter estimation and inference. Our\nmodel is able to be naturally extended to deal with multi-modal data without changing the learning\nand inference procedures, which is bene\ufb01cial for video classi\ufb01cation tasks.\n\n2 Related work\n\nThe problems of activity analysis and recognition have been widely studied. However, most of the\nexisting works [9, 10] were done on constrained videos with limited contents (e.g., clean background\nand little camera motions). Complex activity recognition in web videos, such as social group activity,\nis not much explored. Most relevant to our work is a recent work that learns video attributes to\nanalyze social group activity [4]. In [4], a semi-latent attribute space is introduced, which consists\nof user-de\ufb01ned attributes, class-conditional and background latent attributes, and an extended Latent\nDirichlet Allocation (LDA) [11] is used to model those attributes as topics. Different from that, our\nwork discovers a set of discriminative latent topics without extra human annotations on videos.\nFrom the view of graphical models, most similar to our model are the maximum entropy discrimina-\ntion LDA (MedLDA) [12] and the generative Classi\ufb01cation Restricted Boltzmann Machines (gClass-\nRBM) [13], both of which have been successfully applied to document semantic analysis. MedLDA\nintegrates the max-margin learning and hierarchical directed topic models by optimizing a single\nobjective function with a set of expected margin constraints. MedLDA tries to estimate parame-\nters and \ufb01nd latent topics in a max-margin sense, which is different from our model relying on the\nprinciple of automatic relevance determination [14]. The gClassRBM used to model word count\ndata is actually a supervised Replicated Softmax. Different from the gClassRBM, instead of point\nestimation of classi\ufb01er parameters, our proposed model learns a sparse posterior distribution over\nparameters within a Bayesian paradigm.\n\n3 Models and algorithms\n\nWe start with the description of Replicated Softmax, and then by integrating it with sparse Bayesian\nlearning, propose the relevance topic model for videos. Finally, we develop an ef\ufb01cient variational\nalgorithm for inference and parameter estimation.\n\n3.1 Replicated Softmax\n\nThe Replicated Softmax model is a two-layer undirected graphical model, which can be used to\nmodel sparse word count data and extract latent semantic topics from document collections. Repli-\ncated Softmax allows for very ef\ufb01cient inference and learning, and outperforms LDA in terms of\nboth the generalization performance and the retrieval accuracy on text datasets.\nAs shown in Figure 2 (left), this model is a generalization of the restricted Boltzmann machine\n(RBM). The bottom layer represents a multinomial visible unit sampled K times (K is the total\nnumber of words in a document) and the top layer represents binary stochastic hidden units.\n\n2\n\n\fFigure 2: Left: Replicated Softmax model: an undirected graphical model. Right: Relevance topic\nmodel: a mixed graphical model. The undirected part models the marginal distribution of video\nBoW vectors v and the directed part models the conditional distribution of video classes y given\nlatent topics tr by using a hierarchical prior on weights \u03b7\u03b7\u03b7.\n\nLet a word count vector v \u2208 NN be the visible unit (N is the size of the vocabulary), and a binary\ntopic vector h \u2208 {0, 1}F be the hidden units. Then the energy function of the state {v, h} is de\ufb01ned\nas follows:\n\nE(v, h; \u03b8) = \u2212 N(cid:88)\n\nF(cid:88)\n\nWijvihj \u2212 N(cid:88)\n\naivi \u2212 K\n\n(1)\nwhere \u03b8 = {W, a, b}, Wij is the weight connected with vi and hj, ai and bj are the bias terms\nof visible and hidden units respectively. The joint distribution over the visible and hidden units is\nde\ufb01ned by:\n\nbjhj,\n\nj=1\n\ni=1\n\nj=1\n\ni=1\n\nF(cid:88)\n\nP (v, h; \u03b8) =\n\n1\nZ(\u03b8)\n\nexp(\u2212E(v, h; \u03b8)), Z(\u03b8) =\n\nexp(\u2212E(v, h; \u03b8)),\n\n(2)\n\nwhere Z(\u03b8) is the partition function. Since exact maximum likelihood learning is intractable, the\ncontrastive divergence [15] approximation is often used to estimate model parameters in practice.\n\n(cid:88)\n\n(cid:88)\n\nv\n\nh\n\n3.2 Relevance topic model\n\nThe relevance topic model (RTM) is an integration of sparse Bayesian learning and Replicated Soft-\nmax, the main idea of which is to jointly learn discriminative topics as mid-level video representa-\ntions and sparse discriminant function as a video classi\ufb01er.\nWe represent the video dataset with class labels y \u2208 {1, ..., C} as D = {(vm, ym)}M\nm=1, where\neach video is represented as a BoW vector v \u2208 NN . Consider modeling video BoW vectors using\nF ] denotes a F-dimensional topic vector of one video.\nthe Replicated Softmax. Let tr = [tr\nAccording to Equation 2, the marginal distribution over the BoW vector v is given by:\n\n1, ..., tr\n\nP (v; \u03b8) =\n\n1\nZ(\u03b8)\n\nexp(\u2212E(v, tr; \u03b8)),\n\n(3)\n\n(cid:88)\n\ntr\n\nN(cid:88)\n\nSince videos contain more complex and diverse contents than documents, binary topics are far from\nideal to explain video data. We replace binary hidden units in the original Replicated Softmax with\nrecti\ufb01ed linear units which are given by:\n\nj = max(0, tj), P (tj|v; \u03b8) = N (tj|Kbj +\ntr\n\nWijvi, 1),\n\n(4)\n\ni=1\n\nwhere N (\u00b7|\u00b5, \u03c4 ) denotes a Gaussian distribution with mean \u00b5 and variance \u03c4. The recti\ufb01ed linear\nunits taking nonnegative real values can preserve information about relative importance of topics.\nMeanwhile, the recti\ufb01ed Gaussian distribution is semi-conjugate to the Gaussian likelihood. This fa-\ncilitates the development of variational algorithms for posterior inference and parameter estimation,\nwhich we describe in Section 3.3.\nLet \u03b7\u03b7\u03b7 = {\u03b7\u03b7\u03b7y}C\nas a linear combination of topics: F (y, tr, \u03b7\u03b7\u03b7) = \u03b7\u03b7\u03b7T\n\ny=1 denote a set of class-speci\ufb01c weight vectors. We de\ufb01ne the discriminant function\ny tr. The conditional distribution of classes is\n\n3\n\nW1W2vhytrv\u03b7c\u03b1cCW. . .\fde\ufb01ned as follows:\n\nand the classi\ufb01er is given by:\n\nP (y|tr, \u03b7\u03b7\u03b7) =\n\n(cid:80)C\n\nexp(F (y, tr, \u03b7\u03b7\u03b7))\ny(cid:48)=1 exp(F (y(cid:48), tr, \u03b7\u03b7\u03b7))\n\n,\n\n\u02c6y = arg max\n\ny\u2208C\n\nE[F (y, tr, \u03b7\u03b7\u03b7)|v].\n\nThe weights \u03b7\u03b7\u03b7 are given a zero-mean Gaussian prior:\nP (\u03b7yj|\u03b1yj) =\n\nP (\u03b7\u03b7\u03b7|\u03b1\u03b1\u03b1) =\n\nC(cid:89)\n\nF(cid:89)\n\nC(cid:89)\n\nF(cid:89)\n\nN (\u03b7yj|0, \u03b1\u22121\nyj ),\n\n(5)\n\n(6)\n\n(7)\n\nwhere \u03b1\u03b1\u03b1 = {\u03b1\u03b1\u03b1y}C\nindependently to each weight \u03b7yj. The hyperpriors over \u03b1\u03b1\u03b1 are given by Gamma distributions:\n\ny=1 is a set of hyperparameter vectors, and each hyperparameter \u03b1yj is assigned\n\ny=1\n\nj=1\n\ny=1\n\nj=1\n\nC(cid:89)\n\nF(cid:89)\n\nC(cid:89)\n\nF(cid:89)\n\nP (\u03b1\u03b1\u03b1) =\n\nP (\u03b1yj) =\n\ny=1\n\nj=1\n\ny=1\n\nj=1\n\n\u0393(c)\n\n\u22121dc\u03b1c\u22121\n\nyj e\u2212d\u03b1,\n\n(8)\n\nwhere \u0393(c) is the Gamma function. To obtain broad hyperpriors, we set c and d to small values,\ne.g., c = d = 10\u22124. This hierarchical prior, which is a type of automatic relevance determination\nprior [14], enables the posterior probability of the weights \u03b7\u03b7\u03b7 to be concentrated at zero and thus\neffectively switch off the corresponding topics that are considered to be irrelevant to classi\ufb01cation.\nFinally, given the parameters \u03b8, RTM de\ufb01nes the joint distribution:\n\nP (v, y, tr, \u03b7\u03b7\u03b7, \u03b1\u03b1\u03b1; \u03b8) = P (v; \u03b8)P (y|tr, \u03b7\u03b7\u03b7)\n\nP (tj|v; \u03b8)\n\nP (\u03b7yj|\u03b1yj)P (\u03b1yj)\n\n.\n\n(9)\n\nj=1\n\ny=1\n\nj=1\n\nFigure 2 (right) illustrates RTM as a mixed graphical model with undirected and directed edges.\nThe undirected part models the marginal distribution of video data and the directed part models the\nconditional distribution of classes given latent topics. We can naturally extend RTM to Multimodal\nRTM by using the undirected part to model the multimodal data v = {vmodl}L\nl=1. Accordingly,\nl=1 P (vmodl; \u03b8modl). In Section 3.3, we can see that it will\n\nP (v; \u03b8) in Equation 9 is replaced with(cid:81)L\n\nnot change learning and inference rules.\n\n(cid:18) F(cid:89)\n\n(cid:19)(cid:18) C(cid:89)\n\nF(cid:89)\n\n(cid:19)\n\n(cid:90)\n\n3.3 Parameter estimation and inference\nFor RTM, we wish to \ufb01nd parameters \u03b8 = {W, a, b} that maximize the log likelihood on D:\n\nlog P (D; \u03b8) = log\n\n(10)\nand learn the posterior distribution P (\u03b7\u03b7\u03b7, \u03b1\u03b1\u03b1|D; \u03b8) = P (\u03b7\u03b7\u03b7, \u03b1\u03b1\u03b1,D; \u03b8)/P (D; \u03b8). Since exactly comput-\ning P (D; \u03b8) is intractable, we employ variational methods to optimize a lower bound L on the log\nlikelihood by introducing a variational distribution to approximate P ({tm}M\n\nm=1, \u03b7\u03b7\u03b7, \u03b1\u03b1\u03b1|D; \u03b8):\n\nm=1d\u03b7\u03b7\u03b7d\u03b1\u03b1\u03b1,\n\nm}M\n\nm=1, \u03b7\u03b7\u03b7, \u03b1\u03b1\u03b1; \u03b8)d{tm}M\n\nP ({vm, ym, tr\n\nQ({tm}M\n\nm=1, \u03b7\u03b7\u03b7, \u03b1\u03b1\u03b1) =\n\nq(tmj)\n\nq(\u03b7\u03b7\u03b7)q(\u03b1\u03b1\u03b1).\n\n(11)\n\n(cid:18) M(cid:89)\n\nF(cid:89)\n\nm=1\n\nj=1\n\n(cid:19)\n\nUsing Jensens inequality, we have:\n\n(cid:90)\n(cid:16)(cid:81)M\nm=1, \u03b7\u03b7\u03b7, \u03b1\u03b1\u03b1)\nm=1 P (vm; \u03b8)P (ym|tr\n\nQ({tm}M\n\nlog P (D; \u03b8) (cid:62)\n\nlog\n\n(cid:17)\nm, \u03b7\u03b7\u03b7)P (tm|vm; \u03b8)\n\nQ({tm}M\n\nm=1, \u03b7\u03b7\u03b7, \u03b1\u03b1\u03b1)\n\nP (\u03b7\u03b7\u03b7|\u03b1\u03b1\u03b1)P (\u03b1\u03b1\u03b1)\n\nd{tm}M\n\nm=1d\u03b7\u03b7\u03b7d\u03b1\u03b1\u03b1.\n\n(12)\n\nNote that P (ym|tr\nm, \u03b7\u03b7\u03b7) is not conjugate to the Gaussian prior, which makes it intractable to compute\nthe variational factors q(\u03b7\u03b7\u03b7) and q(tmj). Here we use a quadratic bound on the log-sum-exp (LSE)\nfunction [8] to derive a further bound. We rewrite P (ym|tr\n\nm, \u03b7\u03b7\u03b7) as follows:\n\nP (ym|tr\n\nm, \u03b7\u03b7\u03b7) = exp(yT\n\nmTr\n\nm\u03b7\u03b7\u03b7 \u2212 lse(Tr\n\nm\u03b7\u03b7\u03b7)),\n\n(13)\n\n4\n\n\fwhere Tr\n\nm\u03b7\u03b7\u03b7 = [(tr\n\nm)T \u03b7\u03b7\u03b71, ..., (tr\n\nlabel ym and lse(x) (cid:44) log(1 +(cid:80)C\u22121\n\nm)T \u03b7\u03b7\u03b7C\u22121], ym = I(ym = c) is the one-of-C encoding of class\ny(cid:48)=1 exp(xy(cid:48))) (we set \u03b7\u03b7\u03b7C = 0 to ensure identi\ufb01ability). In [8],\nthe LSE function is expanded as a second order Taylor series around a point \u03d5\u03d5\u03d5, and an upper bound\nis found by replacing the Hessian matrix H(\u03d5\u03d5\u03d5) with a \ufb01xed matrix A = 1\nC\u2217+1 1C\u2217 1T\nC\u2217 ]\nsuch that A (cid:31) H(\u03d5\u03d5\u03d5), where C\u2217 = C \u2212 1, IC\u2217 is the identity matrix of size M \u00d7 M and 1C\u2217 is a\nM-vector of ones. Thus, similar to [16], we have:\n\n2 [IC\u2217 \u2212 1\n\nlog P (ym|tr\n\nm, \u03b7\u03b7\u03b7) (cid:62) J(ym, tr\n\nm\u03b7\u03b7\u03b7 \u2212 1\n2\n\n(Tr\n\nmTr\n\nm, \u03b7\u03b7\u03b7, \u03d5\u03d5\u03d5m) = yT\nsm = A\u03d5\u03d5\u03d5m \u2212 exp(\u03d5\u03d5\u03d5m \u2212 lse(\u03d5\u03d5\u03d5m)),\nmA\u03d5\u03d5\u03d5m \u2212 \u03d5\u03d5\u03d5T\n\u03d5\u03d5\u03d5T\n\nm exp(\u03d5\u03d5\u03d5m \u2212 lse(\u03d5\u03d5\u03d5m)) + lse(\u03d5\u03d5\u03d5m),\n\nm\u03b7\u03b7\u03b7)T ATr\n\nm\u03b7\u03b7\u03b7 + sT\n\nmTr\n\nm\u03b7\u03b7\u03b7 \u2212 \u03bai,\n\n(14)\n\n(15)\n\n(16)\n\nwhere \u03d5\u03d5\u03d5m \u2208 RC\u2217\n11, we can obtain a further lower bound:\n\nis a vector of variational parameters. Substituting J(ym, tr\n\nm, \u03b7\u03b7\u03b7, \u03d5\u03d5\u03d5m) into Equation\n\n\u03bai =\n\n1\n2\n\nM(cid:88)\n\nM(cid:88)\n\n(cid:20) M(cid:88)\n\n(cid:21)\n\nlog P (D; \u03b8) (cid:62) L(\u03b8, \u03d5\u03d5\u03d5) =\n\nlog P (vm; \u03b8) + EQ\n\nJ(ym, tr\n\nm, \u03b7\u03b7\u03b7, \u03d5\u03d5\u03d5m)\n\nm=1\n\nm=1\n\n+\n\nlog P (tm|vm; \u03b8) + log P (\u03b7\u03b7\u03b7|\u03b1\u03b1\u03b1) + log P (\u03b1\u03b1\u03b1) \u2212 Q({tm}M\n\nm=1, \u03b7\u03b7\u03b7, \u03b1\u03b1\u03b1)\n\n.\n\n(17)\n\nm=1\n\nNow we convert the problem of model training into maximizing the lower bound L(\u03b8, \u03d5\u03d5\u03d5) with\nrespect to the variational posteriors q(\u03b7\u03b7\u03b7), q(\u03b1\u03b1\u03b1) and q(t) = {q(tmj)} as well as the parameters\n\u03b8 and \u03d5\u03d5\u03d5 = {\u03d5\u03d5\u03d5m}. We can give some insights into the objective function L(\u03b8, \u03d5\u03d5\u03d5): the \ufb01rst term\nis exactly the marginal log likelihood of video data and the second term is a variational bound of\nthe conditional log likelihood of classes, thus maximizing L(\u03b8, \u03d5\u03d5\u03d5) is equivalent to \ufb01nding a set of\nmodel parameters and latent topics which could \ufb01t video data well and simultaneously make good\npredictions for video classes.\nDue to the conjugacy properties of the chosen distributions, we can directly calculate free-form\nvariational posteriors q(\u03b7\u03b7\u03b7), q(\u03b1\u03b1\u03b1) and parameters \u03d5\u03d5\u03d5:\n\nq(\u03b7\u03b7\u03b7) = N (\u03b7\u03b7\u03b7|E\u03b7\u03b7\u03b7, V\u03b7\u03b7\u03b7),\n\nC(cid:89)\n\nF(cid:89)\n\nq(\u03b1\u03b1\u03b1) =\n\ny=1\n\nj=1\n\n\u03d5\u03d5\u03d5 = (cid:104)Tr\n\nGamma(\u03b1yj|\u02c6c, \u02c6dyj),\nm(cid:105)q(t)E\u03b7\u03b7\u03b7,\n(cid:19)\u22121\n\nwhere (cid:104)\u00b7(cid:105)q denotes an exception with respect to the distribution q and\nM(cid:88)\n\n(cid:11)\nq(t) + diag(cid:104)\u03b1yj(cid:105)q(\u03b1\u03b1\u03b1)\n\n(cid:18) M(cid:88)\n\n(cid:10)(Tr\n\nm)T ATr\nm\n\n, E\u03b7\u03b7\u03b7 = V\u03b7\u03b7\u03b7\n\nV\u03b7\u03b7\u03b7 =\n\nm=1\n\n\u02c6c = c +\n\n1\n2\n\n, \u02c6dyj = d +\n\n1\n2\n\n(cid:10)\u03b72\n\nyj\n\n(cid:11)\n\nm=1\n\n.\n\nq(\u03b7\u03b7\u03b7)\n\n(18)\n\n(19)\n\n(20)\n\n(cid:10)(Tr\nm)T(cid:11)\n\nq(t)(ym + sm),\n\n(21)\n\n(22)\n\nFor q(t), the calculation is not immediate because of the recti\ufb01cation. Inspired by [17], we have the\nfollowing free-form solution:\n\nq(tmj) =\n\n\u03c9pos\n\nZ\n\nN (tmj|\u00b5pos, \u03c32\n\npos)u(tmj) +\n\nN (tmj|\u00b5neg, \u03c32\n\nneg)u(\u2212tmj),\n\n(23)\n\n\u03c9neg\n\nZ\n\nwhere u(\u00b7) is the unit step function. See Appendix A for parameters of q(tmj).\nGiven \u03b8, through repeating the updates of Equations 18-20 and 23 to maximize L(\u03b8, \u03d5\u03d5\u03d5), we can\nobtain the variational posteriors q(\u03b7\u03b7\u03b7), q(\u03b1\u03b1\u03b1) and q(t). Then given q(\u03b7\u03b7\u03b7), q(\u03b1\u03b1\u03b1) and q(t), we estimate\n\u03b8 by using stochastic gradient descent to maximize L(\u03b8, \u03d5\u03d5\u03d5), and the derivatives of L(\u03b8, \u03d5\u03d5\u03d5) with\n\n5\n\n\fj\n\ndata\n\n(cid:11)\n\n\u2202L(\u03b8, \u03d5\u03d5\u03d5)\n\u2202Wij\n\nrespect to \u03b8 are given by:\n\n=(cid:10)vitr\n(cid:11)\n=(cid:10)tr\nwhere the derivatives of(cid:80)M\n\n\u2202L(\u03b8, \u03d5\u03d5\u03d5)\n\n\u2202bj\n\n(cid:11)\n\n\u2212(cid:10)vitr\n\u2212(cid:10)tr\n\ndata\n\nj\n\nM(cid:88)\n\nvmi\n\n1\nM\n= (cid:104)vi(cid:105)data \u2212 (cid:104)vi(cid:105)model,\n\nm=1\n\n(cid:18)\n(cid:104)tmj(cid:105)q(t) \u2212 N(cid:88)\n(cid:18)\n(cid:104)tmj(cid:105)q(t) \u2212 N(cid:88)\n\ni=1\n\nM(cid:88)\n\nm=1\n\ni=1\n\nj\n\nmodel\n\n+\n\u2202L(\u03b8, \u03d5\u03d5\u03d5)\n\n\u2202ai\n\n(cid:11)\n\nj\n\nmodel\n\n+\n\nK\nM\n\nm=1 log P (vm; \u03b8) are the same as those in [5].\n\n(cid:19)\n\n,\n\n(24)\n\nWijvmi \u2212 Kbj\n\n(cid:19)\n\n,\n\n(25)\n\n(26)\n\nWijvmi \u2212 Kbj\n\nThis leads to the following variational EM algorithm:\n\nE-step: Calculate variational posteriors q(\u03b7\u03b7\u03b7), q(\u03b1\u03b1\u03b1) and q(t).\nM-step: Estimate parameters \u03b8 = {W, a, b} through maximizing L(\u03b8, \u03d5\u03d5\u03d5).\n\nThese two steps are repeated until L(\u03b8, \u03d5\u03d5\u03d5) converges. For the Multimodal RTM learning, we just\nadditionally calculate the gradients of \u03b8modl for each modality l in the M-step while the updating\nrules are not changed.\nAfter the learning is completed, according to Equation 6 the prediction for new videos can be easily\nobtained:\n\n(cid:11)\n\n(cid:10)\u03b7\u03b7\u03b7T\n\ny\n\n(cid:104)tr(cid:105)p(t|v;\u03b8).\n\nq(\u03b7\u03b7\u03b7)\n\n(27)\n\n\u02c6y = arg max\n\ny\u2208C\n\n4 Experiments\n\nWe test our models on the Unstructured Social Activity Attribute (USAA) dataset 1 for social group\nactivity recognition. Firstly, we present quantitative evaluations of RTM in the case of different\nmodalities and comparisons with other supervised topic models (namely MedLDA and gClass-\nRBM). Secondly, we compare Multimodal RTM with some baselines in the case of plentiful and\nsparse training data respectively. In all experiments, the contrastive divergence is used to ef\ufb01ciently\napproximate the derivatives of the marginal log likelihood and the unsupervised training on Repli-\ncated Softmax is used to initialize \u03b8.\n\n4.1 Dataset and video representation\n\nThe USAA dataset consists of 8 semantic classes of social activity videos collected from the Internet.\nThe eight classes are: birthday party, graduation party, music performance, non-music performance,\nparade, wedding ceremony, wedding dance and wedding reception. The dataset contains a total\nof 1466 videos and approximate 100 videos per-class for training and testing respectively. These\nvideos range from 20 seconds to 8 minutes averaging 3 minutes and contain very complex and\ndiverse contents, which brings signi\ufb01cant challenges for content analysis.\nEach video is represented using three modalities, i.e., static appearance, motion, and auditory.\nSpeci\ufb01cally, three visual and audio local keypoint features are extracted for each video: scale-\ninvariant feature transform (SIFT) [18], spatial-temporal interest points (STIP) [19] and mel-\nfrequency cepstral coef\ufb01cients (MFCC) [20]. Then the three features are collected into a BoW\nvector (5000 dimensions for SIFT and STIP, and 4000 dimensions for MFCC) using a soft-weighting\nclustering algorithm, respectively.\n\n4.2 Model comparisons\n\nTo evaluate the discriminative power of video topics learned by RTM, we present quantitative clas-\nsi\ufb01cation results compared with other supervised topic models (MedLDA and gClassRBM) in the\ncase of different modalities. We have tried our best to tune these compared models and report the\nbest results.\n\n1Available at http://www.eecs.qmul.ac.uk/\u02dcyf300/USAA/download/.\n\n6\n\n\fTable 1: Classi\ufb01cation accuracy of different supervised topic models for single-modal features.\n\nFeature\n\nModel\n\nAccuracy\n\n(%)\n\n20 topics\n30 topics\n40 topics\n50 topics\n60 topics\n\nSIFT\ngClass\nRBM\n45.40\n46.11\n47.08\n46.81\n49.72\n\nMed\nLDA\n44.72\n44.17\n43.07\n42.80\n40.74\n\nRTM\n51.99\n53.09\n55.83\n54.17\n54.03\n\nMed\nLDA\n37.28\n38.93\n40.85\n39.75\n41.54\n\nSTIP\ngClass\nRBM\n42.39\n42.25\n42.39\n41.70\n43.35\n\nRTM\n48.29\n49.11\n50.62\n51.71\n51.17\n\nMed\nLDA\n34.71\n38.55\n41.15\n41.98\n38.27\n\nMFCC\ngClass\nRBM\n41.70\n43.62\n45.00\n44.31\n43.48\n\nRTM\n45.35\n46.67\n48.15\n47.46\n47.33\n\nTable 2: Classi\ufb01cation accuracy of different methods for multimodal features.\n\nMethod\n\nAccuracy\n\n(%)\n\n100 Inst\n\n10 Inst\n\nMultimodal RTM\n60 topics\n60.22\n62.69\n90 topics\n63.79\n120 topics\n64.06\n150 topics\n64.72\n180 topics\n38.68\n60 topics\n41.29\n90 topics\n43.48\n120 topics\n43.72\n150 topics\n44.99\n180 topics\n\nRS+SVM Direct\n\nSVM-UD+LR\n\nSLAS+LR\n\n54.60\n56.10\n57.34\n59.26\n60.63\n23.73\n28.53\n30.59\n33.47\n35.94\n\n66.0\n\n65.0\n\n65.0\n\n29.0\n\n37.0\n\n40.0\n\nTable 1 shows the classi\ufb01cation accuracy of different models for three single-modal features: SIFT,\nSTIP and MFCC. We can see that RTM achieves higher classi\ufb01cation accuracy than MedLDA and\ngClassRBM in all cases, which demonstrates that through leveraging sparse Bayesian learning to\nincorporate class label information into topic modeling, RTM can \ufb01nd more discriminative topical\nrepresentations for complex video data.\n\n4.3 Baseline comparisons\n\nWe compare Multimodal RTM with the baselines in [4] which are the best results on the USAA\ndataset:\nDirect Direct SVM or KNN classi\ufb01cation on raw video BoW vectors (14000 dimensions), where\nSVM is used for experiments with more than 10 instances and KNN otherwise.\nSVM-UD+LR SVM attribute classi\ufb01ers learn 69 user-de\ufb01ned attributes, and then a logistic regres-\nsion (LR) classi\ufb01er is performed according to the attribute classi\ufb01er outputs.\nSLAS+LR Semi-latent attribute space is learned, and then a LR classi\ufb01er is performed based on the\n69 user-de\ufb01ned, 8 class-conditional and 8 latent topics.\nBesides, we also perform a comparison with another baseline where different modal topics extracted\nby Replicated Softmax are connected together as video representations, and then a multi-class SVM\nclassi\ufb01er [21] is learned from the representations. This baseline is denoted by RS+SVM.\nThe results are illustrated in Table 2. Here the number of topics of each modality is assumed to be\nthe same. When the labeled training data is plentiful (100 instances per class), the classi\ufb01cation per-\nformance of Multimodal RTM is similar to the baselines in [4]. However, We argue that our model\nlearns a lower dimensional latent semantic space which provides ef\ufb01cient video representations and\nis able to be better generalized to a larger or new dataset because extra human de\ufb01ned concepts are\nnot required in our model. When considering the classi\ufb01cation scenario where only a very small\nnumber of training data are available (10 instances per class), Multimodal RTM can achieve better\nperformance with an appropriate number (e.g., (cid:62) 90) of topics because the sparsity of relevance top-\nics learned by RTM can effectively prevent over\ufb01tting to speci\ufb01c training instances. In addition, our\nmodel outperforms RS+SVM in both cases, which demonstrates the advantage of jointly learning\ntopics and classi\ufb01er weights through sparse Bayesian learning.\nIt is also interesting to examine the sparsity of relevance topics. Figure 3 illustrates the degree of\ncorrelation between topics and two different classes. We can see that the learned relevance topics are\nvery sparse, which leads to good generalisation for new instances and robustness for small datasets.\n\n7\n\n\fFigure 3: Relevance topics discovered by RTM for two different classes. Vertical axis indicates the\ndegree of positive and negative correlation.\n\n5 Conclusion\n\nThis paper has proposed a supervised topic model, the relevance topic model (RTM), to jointly learn\ndiscriminative latent topical representations and a sparse classifer for recognizing unstructured so-\ncial group activity. In RTM, sparse Bayisian learning is integrated with an undirected topic model\nto discover sparse relevance topics. Recti\ufb01ed linear units are employed to better \ufb01t complex video\ndata and facilitate the learning of the model. Ef\ufb01cient variational methods are developed for pa-\nrameter estimation and inference. To further improve video classi\ufb01cation performance, RTM is also\nextended to deal with multimodal data. Experimental results demonstrate that RTM can \ufb01nd more\npredictive video topics than other supervised topic models and achieve state of the art classi\ufb01cation\nperformance, particularly in the scenario of lacking labeled training videos.\n\nAppendix A. Parameters of free-form variational posterior q(tmj)\n\nThe expressions of parameters in q(tmj) (Equation 23) are listed as follows:\n\n\u03c9pos = N (\u03b1|\u03b2, \u03b3 + 1), \u03c32\n\npos = (\u03b3\u22121 + 1)\u22121, \u00b5pos = \u03c32\n\npos(\n\n1\n2\nwhere erfc(\u00b7) is the complementary error function and\n\nZ =\n\npos\n\n+\n\n\u03c9negerfc\n\nneg = 1, \u00b5neg = \u03b2,\n\n\u03c9neg = N (\u03b1|0, \u03b3), \u03c32\n1\n2\n\n\u03c9poserfc\n\n2\u03c32\n\n(cid:18) \u2212\u00b5pos(cid:113)\n(cid:19)\nym + sm \u2212(cid:80)\n(cid:69)\u22121\n\n(cid:42)\u03b7\u03b7\u03b7\u00b7j\n(cid:68)\n\n(cid:16)\n\n\u03b7\u03b7\u03b7\u00b7jA\u03b7\u03b7\u03b7T\u00b7j\n\n\u03b1 =\n\nj(cid:48)(cid:54)=j \u03b7\u03b7\u03b7\u00b7j(cid:48)Atr\n\nmj(cid:48)\n\nN(cid:88)\n\n(cid:18) \u00b5neg(cid:113)\n(cid:17)\n(cid:43)\n\n2\u03c32\n\nneg\n\n\u03b1\n\u03b3\n\n(cid:19)\n\n,\n\n,\n\n+ \u03b2),\n\n(28)\n\n(29)\n\n(30)\n\n(31)\n\nq(\u03b7\u03b7\u03b7)q(t)\n\n(32)\nWe can see that q(tmj) depends on expectations over \u03b7\u03b7\u03b7 and {tmj(cid:48)}j(cid:48)(cid:54)=j, which is consistent with the\ngraphical model representation of RTM in Figure 2.\n\nWijvmi + Kbj.\n\n, \u03b2 =\n\n\u03b3 =\n\nq(\u03b7\u03b7\u03b7)\n\ni=1\n\n\u03b7\u03b7\u03b7\u00b7jA\u03b7\u03b7\u03b7T\u00b7j\n\nAcknowledgments\n\nThis work was supported by the National Basic Research Program of China (2012CB316300), Hun-\ndred Talents Program of CAS, National Natural Science Foundation of China (61175003, 61135002,\n61203252), and Tsinghua National Laboratory for Information Science and Technology Cross-\ndiscipline Foundation.\n\nReferences\n\n[1] C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: a local svm approach. In\n\nICPR, 2004.\n\n8\n\n\f[2] M. Rodriguez, J. Ahmed, and M. Shah. Action mach a spatio-temporal maximum average\n\ncorrelation height \ufb01lter for action recognition. In CVPR, 2008.\n\n[3] Ucf50 action dataset. \u201chttp://vision.eecs.ucf.edu/data/ucf50.rar\u201d.\n[4] Y.W. Fu, T.M. Hospedales, T. Xiang, and S.G. Gong. Attribute learning for understanding\n\nunstructured social activity. In ECCV, 2012.\n\n[5] R. Salakhutdinov and G.E. Hinton. Replicated softmax: an undirected topic model. In NIPS,\n\n2009.\n\n[6] M.E. Tipping. Sparse bayesian learning and the relevance vector machine. JMLR, 2001.\n[7] V. Nair and G.E. Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines. In\n\nICML, 2010.\n\n[8] D. Bohning. Multinomial logistic regression algorithm. AISM, 1992.\n[9] P. Turaga, R. Chellappa, V.S. Subrahmanian, and O. Udrea. Machine recognition of human\n\nactivities: a survey. TCSVT, 2008.\n\n[10] J. Varadarajan, R. Emonet, and J.-M. Odobez. A sequential topic model for mining recurrent\n\nactivities from long term video logs. IJCV, 2013.\n\n[11] D. Blei, A.Y. Ng, and M.I. Jordan. Latent dirichlet allocation. JMLR, 2003.\n[12] J. Zhu, A. Ahmed, and E.P. Xing. Medlda: Maximum margin supervised topic models. JMLR,\n\n2012.\n\n[13] H. Larochelle, M. Mandel, R. Pascanu, and Y. Bengio. Learning algorithms for the classi\ufb01ca-\n\ntion restricted boltzmann machine. JMLR, 2012.\n\n[14] R.M. Neal. Bayesian learning for neural networks. PhD thesis, University of Toronto, 1995.\n[15] G.E. Hinton. Training products of experts by minimizing contrastive divergence. Neural\n\nComputation, 2002.\n\n[16] K.P. Murphy. Machine learning: a probabilistic perspective. MIT Press, 2012.\n[17] M. Harva and A. Kaban. Variational learning for recti\ufb01ed factor analysis. Signal Processing,\n\n2007.\n\n[18] D.G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004.\n[19] I. Laptev. On space-time interest points. IJCV, 2005.\n[20] B. Logan. Mel frequency cepstral coef\ufb01cients for music modeling. ISMIR, 2000.\n[21] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector learning for interde-\n\npendent and structured output spaces. In ICML, 2004.\n\n9\n\n\f", "award": [], "sourceid": 1227, "authors": [{"given_name": "Fang", "family_name": "Zhao", "institution": "Chinese Academy of Sciences"}, {"given_name": "Yongzhen", "family_name": "Huang", "institution": "Chinese Academy of Sciences"}, {"given_name": "Liang", "family_name": "Wang", "institution": "Chinese Academy of Sciences"}, {"given_name": "Tieniu", "family_name": "Tan", "institution": "Chinese Academy of Sciences"}]}