{"title": "Bilinear classifiers for visual recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 1482, "page_last": 1490, "abstract": "We describe an algorithm for learning bilinear SVMs. Bilinear classifiers are a discriminative variant of bilinear models, which capture the dependence of data on multiple factors. Such models are particularly appropriate for visual data that is better represented as a matrix or tensor, rather than a vector. Matrix encodings allow for more natural regularization through rank restriction. For example, a rank-one scanning-window classifier yields a separable filter. Low-rank models have fewer parameters and so are easier to regularize and faster to score at run-time. We learn low-rank models with bilinear classifiers. We also use bilinear classifiers for transfer learning by sharing linear factors between different classification tasks. Bilinear classifiers are trained with biconvex programs. Such programs are optimized with coordinate descent, where each coordinate step requires solving a convex program - in our case, we use a standard off-the-shelf SVM solver. We demonstrate bilinear SVMs on difficult problems of people detection in video sequences and action classification of video sequences, achieving state-of-the-art results in both.", "full_text": "Bilinear classi\ufb01ers for visual recognition\n\nHamed Pirsiavash\n\nDeva Ramanan\n\nCharless Fowlkes\n\nDepartment of Computer Science\nUniversity of California at Irvine\n\n{hpirsiav,dramanan,fowlkes}@ics.uci.edu\n\nAbstract\n\nWe describe an algorithm for learning bilinear SVMs. Bilinear classi\ufb01ers are a\ndiscriminative variant of bilinear models, which capture the dependence of data\non multiple factors. Such models are particularly appropriate for visual data that\nis better represented as a matrix or tensor, rather than a vector. Matrix encod-\nings allow for more natural regularization through rank restriction. For example,\na rank-one scanning-window classi\ufb01er yields a separable \ufb01lter. Low-rank mod-\nels have fewer parameters and so are easier to regularize and faster to score at\nrun-time. We learn low-rank models with bilinear classi\ufb01ers. We also use bi-\nlinear classi\ufb01ers for transfer learning by sharing linear factors between different\nclassi\ufb01cation tasks. Bilinear classi\ufb01ers are trained with biconvex programs. Such\nprograms are optimized with coordinate descent, where each coordinate step re-\nquires solving a convex program - in our case, we use a standard off-the-shelf\nSVM solver. We demonstrate bilinear SVMs on dif\ufb01cult problems of people de-\ntection in video sequences and action classi\ufb01cation of video sequences, achieving\nstate-of-the-art results in both.\n\n1\n\nIntroduction\n\nLinear classi\ufb01ers (i.e., wT x > 0) are the basic building block of statistical prediction. Though quite\nstandard, they produce many competitive approaches for various prediction tasks. We focus here\non the task of visual recognition in video - \u201cdoes this spatiotemporal window contain an object\u201d?\nIn this domain, scanning-window templates trained with linear classi\ufb01cation yield state of the art\nperformance on many benchmark datasets [6, 10, 7].\nBilinear models, introduced into the vision community by [23], provide an interesting generalization\nof linear models. Here, data points are modelled as the con\ufb02uence of a pair of factors. Typical ex-\namples include digits affected by style and content factors or faces affected by pose and illumination\nfactors. Conditioned on one factor, the model is linear in the other. More generally, one can de\ufb01ne\nmultilinear models [25] that are linear in one factor conditioned on the others.\nInspired by the success of bilinear models in data modeling, we introduce discriminative bilinear\nmodels for classi\ufb01cation. We describe a method for training bilinear (multilinear) SVMs with bi-\nconvex (multiconvex) programs. A function f : X \u00d7 Y \u2192 R is called biconvex if f(x, y) is convex\nin y for \ufb01xed x \u2208 X and is convex in x for \ufb01xed y \u2208 Y . Such functions are well-studied in\nthe optimization literature [1, 14]. While not convex, they admit ef\ufb01cient coordinate descent algo-\nrithms that solve a convex program at each step. We show bilinear SVM classi\ufb01ers can be optimized\nwith an off-the-shelf linear SVM solver. This is advantageous because we can leverage large-scale,\nhighly-tuned solvers (we use [13]) to learn bilinear classi\ufb01ers with tens of thousands of features with\nhundreds of millions of examples.\nWhile bilinear models are often motivated from the perspective of increasing the \ufb02exibility of a\nlinear model, our motivation is reversed - we use them to reduce the number of parameters of a\n\n1\n\n\fFigure 1: Many approaches for visual recognition employ linear classi\ufb01ers on scanned windows.\nHere we illustrate windows processed into gradient-based features [6, 12]. We show an image\nwindow (left) and a visualization of the extracted HOG descriptor (middle), which itself is better\nrepresented as gradient features extracted from different orientation channels (right). Most learning\nformulations ignore this natural representation of visual data as matrices or tensors. Wolf et al. [26]\nshow that one can produce more meaningful schemes for regularization and parameter reduction\nthrough low-rank approximations of a tensor model. Our contribution involves casting the resulting\nlearning problem as a biconvex optimization. Such formulations can leverage off-the-shelf solvers\nin an ef\ufb01cient two-stage optimization. We also demonstrate that bilinear models have additional\nadvantages for transfer learning and run-time ef\ufb01ciency.\n\nweight vector that is naturally represented as a matrix or tensor W . We reduce parameters by\nfactorizing W into a product of low-rank factors. This parameter reduction can reduce over-\ufb01tting\nand improve run-time ef\ufb01ciency because fewer operations are needed to score an example. These are\nimportant considerations when training large-scale spatial or spatiotemporal template-classi\ufb01ers. In\nour case, the state-of-the-art features we use to detect pedestrians are based on histograms of gradient\n(HOG) features [6] or spatio-temporal generalizations [7] as shown in Fig.1. The extracted feature\nset of both gradient and optical \ufb02ow histogram is quite large, motivating the need for dimensionality\nreduction.\nFinally, by sharing factors across different classi\ufb01cation problems, we introduce a novel formulation\nof transfer learning. We believe that transfer through shared factors is an important bene\ufb01t of\nmultilinear classi\ufb01ers which can help ameliorate over\ufb01tting.\nWe begin with a discussion of related work in Sec.2. We then explicitly de\ufb01ne our bilinear classi\ufb01er\nin Sec. 3. We illustrate several applications and motivations for the bilinear framework in Sec. 4.\nIn Sec. 5, We describe extensions to our model for the multilinear and multiclass case. We provide\nseveral experiments on visual recognition in the video domain in Sec. 6, signi\ufb01cantly improving on\nthe state-of-the-art system for \ufb01nding people in video sequences [7] both in performance and speed.\nWe also illustrate our approach on the task of action recognition, showing that transfer learning can\nameliorate the small-sample problem that plagues current benchmark datasets [18, 19].\n\n2 Related Work\n\nTenenbaum and Freeman [23] introduced bilinear models into the vision community to model data\ngenerated from multiple linear factors. Such methods have been extended to the multilinear set-\nting, e.g. by [25], but such models were generally used as a factor analysis or density estimation\ntechnique. Recent work has explored extensions of tensor models to discriminant analysis [22, 27],\nwhile our work focuses on an ef\ufb01cient max-margin formulation of multilinear models.\nThere is also a body of related work on learning low-rank matrices from the collaborative \ufb01lter-\n\u221a\ning literature [21, 17, 16]. Such approaches typically de\ufb01ne a convex objective by replacing the\nTr(W T W ) regularization term in our objective (6) with the trace norm Tr(\nW T W ). This can be\nseen as an alternate \u201csoft\u201d rank restriction on W that retains convexity. This is because the trace\nnorm of a matrix is equivalent to the sum of its singular values rather than the number of nonzero\neigenvalues (the rank) [3]. Such a formulation would be interesting to pursue in our scenario, but as\n[17, 16] note, the resulting SDP is dif\ufb01cult to solve. Our approach, though non-convex, leverages\nexisting SVM solvers in the inner loop of a coordinate descent optimization that enforces a hard\nlow-rank condition.\n\n2\n\n\fOur bilinear-SVM formulation is closely related to the low-rank SVM formulation of [26]. Wolf\net. al. convincingly argue that many forms of visual data are better modeled as matrices rather than\nvectors - an important motivation for our work (see Fig.1). They analyze the VC dimension of rank\nconstrained linear classi\ufb01ers and demonstrate an iterative weighting algorithm for approximately\nsolving an SVM problem in which the rank of W acts as a regularizer. They also outline an algo-\nrithm similar to the one we propose here which has a hard constraint on the rank, but they include an\nadditional orthogonality constraint on the columns of the factors that compose W . This requires cy-\ncling through each column separately during the optimization which is presumably slower and may\nintroduce additional local minima. This in turn may explain why they did not present experimental\nresults for their hard-rank formulation.\nOur work also stands apart from Wolf et. al. in our focus on the multi-task learning, which dates\nback at least to the work of Caruna [4]. Our formulation is most similar to that of Ando and Zhang\n[2]. They describe a procedure for learning linear prediction models for multiple tasks with the\nassumption that all models share a component living in a common low-dimensional subspace. While\nthis formulation allows for sharing, it does not reduce the number of model parameters as does our\napproach of sharing factors.\n\n3 Model de\ufb01nition\n\nLinear predictors are of the form\n\nfw(x) = wT x.\n\n(1)\nExisting formulations of linear classi\ufb01cation typically treat x as a vector. We argue for many prob-\nlems, particularly in visual recognition, x is more naturally represented as a matrix or tensor. For\nexample, many state-of-the-art window scanning approaches train a classi\ufb01er de\ufb01ned over local\nfeature vectors extracted over a spatial neighborhood. The Dalal and Triggs detector [6] is a partic-\nularly popular pedestrian detector where x is naturally represented as a concatenation of histogram\nof gradient (HOG) feature vectors extracted from a spatial grid of ny \u00d7 nx, where each local HOG\ndescriptor is itself composed of nf features. In this case, it is natural to represent an example x\nas a tensor X \u2208 Rny\u00d7nx\u00d7nf . For ease of exposition, we develop the mathematics for a simpler\nmatrix representation, \ufb01xing nf = 1. This holds, for example, when learning templates de\ufb01ned on\ngrayscale pixel values.\nWe generalize (1) for a matrix X with\n\nfW (X) = Tr(W T X).\n\n(2)\nwhere both X and W are ny \u00d7 nx matrices. One advantage of the matrix representation is that it\nis more natural to regularize W and restrict the number of parameters. For example, one natural\nmechanism for reducing the degrees of freedom in a matrix is to reduce its rank. We show that one\ncan obtain a biconvex objective function by enforcing a hard restriction on the rank. Speci\ufb01cally,\nwe enforce the rank of W to be at most d \u2264 min(ny, nx). This restriction can be implemented by\nwriting\n\nW = WyW T\n\nx\n\nwhere\n\nWy \u2208 Rny\u00d7d, Wx \u2208 Rnx\u00d7d.\n\nThis allows us to write the \ufb01nal predictor explicitly as the following bilinear function:\n\nfWy,Wx\n\n(X) = Tr(WyW T\n\nx X) = Tr(W T\n\ny XWx).\n\n3.1 Learning\nAssume we are given a set of training data and label pairs {xn, yn}. We would like to learn a model\nwith low error on the training data. One successful approach is a support vector machine (SVM).\nWe can rewrite the linear SVM formulation for w and xn with matrices W and Xn using the trace\noperator.\n\n(3)\n\n(4)\n\n(5)\n\n(6)\n\n(cid:88)\n\nn\n\nL(w) =\n\nL(W ) =\n\n1\n2 wT w + C\n1\n2\n\nTr(W T W ) + C\n\nmax(0, 1 \u2212 ynwT xn).\n\nmax(0, 1 \u2212 yn Tr(W T Xn)).\n\n(cid:88)\n\nn\n\n3\n\n\fThe above formulations are identical when w and xn are the vectorized elements of matrices W and\nXn. Note that (6) is convex. We wish to restrict the rank of W to be d. Plugging in W = WyW T\nx ,\nwe obtain our biconvex objective function:\n\nL(Wy, Wx) =\n\n1\n2\n\nTr(WxW T\n\ny WyW T\nx\n\n) + C\n\nmax(0, 1 \u2212 yn Tr(WxW T\n\ny Xn)).\n\n(7)\n\nIn the next section, we show that optimizing (7) over one matrix holding the other \ufb01xed is a convex\nprogram - speci\ufb01cally, a QP equivalent to a standard SVM. This makes (7) biconvex.\n\n3.2 Coordinate descent\n\nWe can optimize (7) with a coordinate descent algorithm that solves for one set of parameters holding\nthe other \ufb01xed. Each step in this descent is a convex optimization that can be solved with a standard\nSVM solver. Speci\ufb01cally, consider\n\nmin\nWy\n\nL(Wy, Wx) =\n\n1\n2\n\nTr(WyAW T\n\ny\n\n) + C\n\nmax(0, 1 \u2212 yn Tr(W T\n\ny XnWx)).\n\n(8)\n\nThe above optimization is convex in Wy but does not directly translate into the trace-based SVM\nformulation from (6). To do so, let us reparametrize Wy as \u02dcWy:\n\n(cid:88)\n\nn\n\n(cid:88)\n\nn\n\n1\n2\n\nTr( \u02dcW T\n\ny\n\n\u02dcWy) + C\n\nmax(0, 1 \u2212 yn Tr( \u02dcW T\n\ny\n\n\u02dcXn))\n\n(9)\n\n(cid:88)\n\nn\n\nL( \u02dcWy, Wx) =\n\nmin\n\u02dcWy\n\u02dcWy = WyA\n\nwhere\n\n1\n2\n\nand\n\n\u02dcXn = XnWxA\u2212 1\n\n2\n\nand\n\nA = W T\n\nx Wx.\n\nOne can see that (9) is structurally equivalent to (6) and hence (5). Hence it can be solved with\na standard off-the-shelf SVM solver. Given a solution, we can recover the original parameters by\nx Wx is matrix of size d \u00d7 d that is in general invertible for\nWy = \u02dcWyA\u2212 1\nsmall d. Using a similar derivation, one can show that minWx L(Wy, Wx) is also equivalent to a\nstandard convex SVM formulation.\n\n2 . Recall that A = W T\n\n4 Motivation\n\nWe outline here a number of motivations for the biconvex objective function de\ufb01ned above.\n\n4.1 Regularization\n\nBilinear models allow a natural way of restricting the number of parameters in a linear model. From\nthis perspective, they are similar to approaches that apply PCA for dimensionality reduction prior\nto learning. Felzenszwalb et al.\n[11] \ufb01nd that PCA can reduce the size of HOG features by a\nfactor of 4 without a loss in performance. Image windows are naturally represented as a 3D tensor\nX \u2208 Rny\u00d7nx\u00d7nf , where nf is the dimensionality of a HOG feature. Let us \u201creshape\u201d X into a 2D\nmatrix X \u2208 Rnxy\u00d7nf where nxy = nxny. We can restrict the rank of the corresponding model\nf . Wxy \u2208 Rnxy\u00d7d is equivalent to a vectorized spatial template\nto d by de\ufb01ning W = WxyW T\nde\ufb01ned over d features at each spatial location, while Wf \u2208 Rnf\u00d7d de\ufb01nes a set of d basis vectors\nspanning Rnf . This basis can be loosely interpreted as the PCA-basis estimated in [11]. In our\nbiconvex formulation, the basis vectors are not constrained to be orthogonal, but they are learned\ndiscriminatively and jointly with the template Wxy. We show in Sec. 6 this often signi\ufb01cantly\noutperforms PCA-based dimensionality reduction.\n\n4.2 Ef\ufb01ciency\n\nScanning window classi\ufb01ers are often implemented using convolutions [6, 12]. For example, the\nproduct Tr(W T X) can be computed for all image windows X with nf convolutions. By restricting\nW to be WxyW T\nf , we project features into a d dimensional subspace spanned by Wf , and com-\npute the \ufb01nal score with d convolutions. One can further improve ef\ufb01ciency by using the same\n\n4\n\n\fd-dimensional feature space for a large number of different object templates - this is precisely the\nbasis of our transfer approach in Sec.4.3. This can result in signi\ufb01cant savings in computation. For\nexample, spatio-temporal templates for \ufb01nding objects in video tend to have large nf since multiple\nfeatures are extracted from each time-slice.\nConsider a rank-1 restriction of Wx and Wy. This corresponds to a separable \ufb01lter Wxy. Hence, our\nformulation can be used to learn separable scanning-window classi\ufb01ers. Separable \ufb01lters can be\nevaluated ef\ufb01ciently with two one-dimensional convolutions. This can result in signi\ufb01cant savings\nbecause computing the score at the window is now O(nx + ny) rather than O(nxny).\n\n4.3 Transfer\nAssume we wish to train M predictors and are given {xm\nproblem 1 \u2264 m \u2264 M. One can write all M learning problems with a single optimization:\nX m\nn\n\nn } training data pairs for each prediction\n(cid:88)\n\nW m) +(cid:88)\n\nL(W 1, . . . , W M ) =\n\nmax(0, 1 \u2212 ym\n\n(cid:88)\n\nTr(W mT\n\nTr(W mT\n\nn , ym\n\n(10)\n\nCm\n\n)).\n\nn\n\n1\n2\n\nm\n\nm\n\nn\n\nAs written, the problem above can be optimized over each W m independently. We can introduce\nn . To transfer\na rank constraint on W m that induces a low-dimensional subspace projection of X m\nknowledge between the classi\ufb01cation tasks, we require all tasks to use the same low-dimensional\nsubspace projection by sharing the same feature matrix:\n\nW m = W m\n\nxyW T\nf\n\nxy can depend on m. This fact allows for X m\n\nNote that the leading dimension of W m\nn from different\ntasks to be of varying sizes. In our motivating application, we can learn a family of HOG templates\nof varying spatial dimension that share a common HOG feature subspace. The coordinate descent\nalgorithm from Sec.3.2 naturally applies to the multi-task setting. Given a \ufb01xed Wf , it is straightfor-\nxy, a single\nward to independently optimize W m\nmatrix Wf is learned for all classes by computing:\n\nf Wf . Given a \ufb01xed set of W m\n\nxy by de\ufb01ning A = W T\n\nL( \u02dcWf , W 1\n\nxy, . . . , W M\nxy\n\n) =\n\nmin\n\u02dcWf\n\n1\n2\n\nTr( \u02dcW T\n\nf\n\nwhere\n\n\u02dcWf = Wf A\n\n1\n2\n\nand\n\n\u02dcX m\n\nn\n\n= X m\n\n(cid:88)\n\n\u02dcWf ) +(cid:88)\n\nCm\nxyA\u2212 1\n\n2\n\nm\nn W m\n\nn\n\nand\n\nmax(0, 1 \u2212 ym\n\nTr( \u02dcW T\n\nf\n\n\u02dcX m\n\nn\n\n))\n\nn\n\nA =(cid:88)\n\nW mT\n\nxy W m\nxy.\n\nIf all problems share the same slack penalty (Cm = C), the above can be optimized with an off-the-\nshelf SVM solver. In the general case, a minor modi\ufb01cation is needed to allow for slack-rescaling\n[24].\nIn practice, nf can be large for spatio-temporal features extracted from multiple temporal windows.\nThe above formulation is convenient in that we can use data examples from many classi\ufb01cation tasks\nto learn a good subspace for spatiotemporal features.\n\nm\n\n5 Extensions\n\n5.1 Multilinear\n\nIn many cases, a data point x is more natural represented as a multidimensional matrix or a high-\norder tensor. For example, spatio-temporal templates are naturally represented as a 4th-order tensor\ncapturing the width, height, temporal extent, and the feature dimension of a spatio-temporal window.\nFor ease of exposition let us assume the feature dimension is 1 and so we write a feature vector x as\nX \u2208 Rnx\u00d7ny\u00d7nt. We denote the element of a tensor X as xijk. Following [15], we de\ufb01ne a scalar\nproduct of two tensors W and X as the sum of their element-wise products:\n\nwijkxijk.\n\n(11)\n\nWith the above de\ufb01nition, we can generalize our trace-based objective function (6) to higher-order\ntensors:\n\nL(W ) =\n\n1\n2 (cid:104)W, W(cid:105) + C\n\nmax(0, 1 \u2212 yn (cid:104)W, Xn(cid:105)).\n\n(12)\n\n(cid:104)W, X(cid:105) =(cid:88)\n(cid:88)\n\nijk\n\nn\n\n5\n\n\fd(cid:88)\n\nWe wish to impose a rank restriction on the tensor W . The notion of rank for tensors of order\ngreater than two is subtle - for example, there are alternate approaches for de\ufb01ning a high-order\nSVD [25, 15]. For our purposes, we follow [20] and de\ufb01ne W as a rank d tensor by writing it as\nproduct of matrices W y \u2208 Rny\u00d7d, W x \u2208 Rnx\u00d7d, W t \u2208 Rnt\u00d7d:\n\nwijk =\n\nwy\n\niswx\n\njswt\n\nks.\n\n(13)\n\ns=1\n\nCombining (11) - (13), it is straightforward to show that L(W y, W x, W t) is convex in one matrix\ngiven the others. This means our coordinate descent algorithm from Sec.3.2 still applies. As an\nexample, consider the case when d = 1. This rank restriction forces the spatio-temporal template\nW to be separable in along the x, y, and t axes, allowing for window-scan scoring by three one-\ndimensional convolutions. This greatly increases run-time ef\ufb01ciency for spatio-temporal templates.\n\n5.2 Bilinear structural SVMs\n\nWe outline here an extension of our formalism to structural SVMs [24]. Structural SVMs learn\nmodels that predict a structured label yn given a data point xn. Given training data of the form\n{xn, yn}, the learning problem is:\n\n(cid:88)\n\nL(w) =\n\nmax\nwhere \u2206\u03c6(xn, yn, y) = \u03c6(xn, yn) \u2212 \u03c6(xn, y),\n\nn\n\ny\n\n1\n2 wT w + C\n\n(l(yn, y) \u2212 wT \u2206\u03c6(xn, yn, y))\n\n(14)\n\nand where l(yn, y) is the loss of assigning example i with label y given that its true label is yn. The\nabove optimization problem is convex in w. As a concrete example, consider the task of learning a\nmulticlass SVM for nc classes using the formalism of Crammer and Singer [5]. Here,\n\nw =(cid:2)wT\n\n1\n\n(cid:3) ,\n\n. . . wT\nnc\n\nwhere each wi \u2208 Rnx can be interpreted as a classi\ufb01er for class i. The corresponding \u03c6(x, y) will\nbe a sparse vector with nx nonzero values at those indices associated with the yth class. It is natural\nto model the relevant vectors as matrices W, Xn, \u2206\u03a6 that lie in Rnc\u00d7nx. We can enforce W to be\nx where Wc \u2208 Rnc\u00d7d and Wx \u2208 Rnx\u00d7d. For\nof rank d < min(nc, nx) by de\ufb01ning W = WcW T\nexample, one may expect template classi\ufb01ers that classify nc different human actions to reside in a\nd dimensional subspace. The resulting biconvex objective function is\n\nL(Wc, Wx) =\n\n1\n2\n\nTr(WxW T\n\nc WcW T\nx\n\n) + C\n\nmax\n\n(l(yn, y) \u2212 Tr(WxW T\n\nc\n\n\u03a6(Xn, yn, y)).\n\n(15)\n\n(cid:88)\n\ny\n\nn\n\nUsing our previous arguments, it is straightforward to show that the above objective is biconvex and\nthat each step of the coordinate descent algorithm reduces to a standard structural SVM problem.\n\n6 Experiments\n\nWe focus our experiments on the task of visual recognition using spatio-temporal templates. This\nproblem domain has large feature sets obtained by histograms of gradients and histograms of optical\n\ufb02ow computing from a frame pair. We illustrate our method on two challenging tasks using two\nbenchmark datasets - detecting pedestrians in video sequences from the INRIA-Motion database [7]\nand classifying human actions in UCF-Sports dataset [18].\nWe model features computed from frame pairs x as matrices X \u2208 Rnxy\u00d7nf , where nxy = nxny\nis the vectorized spatial template and nf is the dimensionality of our combined gradient and \ufb02ow\nfeature space. We use the histogram of gradient and \ufb02ow feature set from [7]. Our bilinear model\nf where Wxy \u2208 Rnxy\u00d7d and Wf \u2208 Rnf\u00d7d. Typical values\nlearns a classi\ufb01er of the form WxyW T\ninclude ny = 14, nx = 6, nf = 84, and d = 5 or 10.\n\n6\n\n\f6.1 Spatiotemporal pedestrian detection\n\nScoring a detector: Template classi\ufb01ers are often scored using missed detections versus false-\npositives-per-window statistics. However, recent analysis suggests such measurements can be mis-\nleading [9]. We opt for the scoring criteria outlined by the widely-acknowledged PASCAL com-\npetition [10], which looks at average precision (AP) results obtained after running the detector on\ncluttered video sequences and suppressing overlapping detections.\nBaseline: We compare with the linear spatiotemporal-template classi\ufb01er from [7]. The static-image\ndetector counterpart is a well-known state-of-the-art system for \ufb01nding pedestrians [6]. Surprisingly,\nwhen scoring AP for person detection in the INRIA-motion dataset, we \ufb01nd the spatiotemporal\nmodel performed worse than the static-image model. This is corroborated by personal communi-\ncation with the authors as well as Dalal\u2019s thesis [8]. We found that aggressive SVM cutting-plane\noptimization algorithms [13] were needed for the spatiotemporal model to outperform the spatial\nmodel. This suggests our linear baseline is the true state-of-the-art system for \ufb01nding people in\nvideo sequences. We also compare results with an additional rank-reduced baseline obtained by set-\nting wf to the basis returned by a PCA projection of the feature space from nf to d dimensions. We\nuse this PCA basis to initialize our coordinate descent algorithm when training our bilinear models.\nWe show precision-recall curves in Fig.2. We refer the reader to the caption for a detailed analysis,\nbut our bilinear optimization seems to produce the state-of-the-art system for \ufb01nding people in video\nsequences, while being an order-of-magnitude faster than previous approaches.\n\n6.2 Human action classi\ufb01cation\n\nAction classi\ufb01cation requires labeling a video sequence with one of nc action labels. We do this\nby training nc 1-vs-all action templates. Template detections from a video sequence are pooled\ntogether to output a \ufb01nal action label. We experimented with different voting schemes and found\nthat a second-layer SVM classi\ufb01er de\ufb01ned over the maximum score (over the entire video) for each\ntemplate performed well. Our future plan is to integrate the video class directly into the training\nprocedure using our bilinear structural SVM formulation.\nAction recognition datasets tend to be quite small and limited. For example, up until recently, the\nnorm consisted of scripted activities on controlled, simplistic backgrounds. We focus our results\non the relatively new UCF Sports Action dataset, consisting of non-scripted sequences of cluttered\nsports videos. Unfortunately, there has been few published results on this dataset, and the initial\nwork [18] uses a slightly different set of classes than those which are available online. The published\naverage class confusion is 69.2%, obtained with leave-one-out cross validation. Using 2-fold cross\nvalidation (and hence signi\ufb01cantly less training data), our bilinear template achieves a score of\n64.8% (Fig. 3). Again, we see a large improvement over linear and PCA-based approaches. While\nnot directly comparable, these results suggest our model is competitive with the state of the art.\nTransfer: We use the UCF dataset to evaluate transfer-learning in Fig.4. We consider a small-\nsample scenario when one has only two example video sequences of each action class. Under this\nscenario, we train one bilinear model in which the feature basis Wf is optimized independently for\neach action class, and another where the basis is shared across all classes. The independently-trained\nmodel tends to over\ufb01t to the training data for multiple values of C, the slack penalty from (6). The\njoint model clearly outperforms the independently-trained models.\n\n7 Conclusion\n\nWe have introduced a generic framework for multilinear classi\ufb01ers that are ef\ufb01cient to train with\nexisting linear solvers. Multilinear classi\ufb01ers exploit the natural matrix and/or tensor representation\nof spatiotemporal data. For example, this allows one to learn separable spatio-temporal templates\nfor \ufb01nding objects in video. Multilinear classi\ufb01ers also allow for factors to be shared across clas-\nsi\ufb01cation tasks, providing a novel form of transfer learning. In our future experiments, we wish to\ndemonstrate transfer between domains such as pedestrian detection and action classi\ufb01cation.\n\n7\n\n\fFigure 2: Our results on the INRIA-motion database [7]. We evaluate results using average preci-\nsion, using the well-established protocol outlined in [10]. The baseline curve is our implementation\nof the HOG+\ufb02ow template from [7]. The size of the feature vector is over 7,000 dimensions. Using\nPCA to reduce the dimensionality by 10X results in a signi\ufb01cant performance hit. Using our bilin-\near formulation with the same low-dimensional restriction, we obtain better performance than the\noriginal detector while being 10X faster. We show example detections on video clips on the right.\n\nFigure 3: Our results on the UCF Sports Action dataset [18]. We show classi\ufb01cation results obtained\nfrom 2-fold cross validation. Our bilinear model provides a strong improvement over both the linear\nand PCA baselines. We show class confusion matrices, where light values correspond to correct\nclassi\ufb01cation. We label each matrix with the average classi\ufb01cation rate over all classes.\n\nFigure 4: We show results for transfer learning on the UCF action recognition dataset with limited\ntraining data - 2 training videos for each of 12 action classes. In the top table row, we show results\nfor independently learning a subspace for each action class. In the bottom table row, we show\nresults for jointly learning a single subspace that is transfered across classes. In both cases, the\nregularization parameter C was set on held-out data. The jointly-trained model is able to leverage\ntraining data from across all classes to learn the feature space Wf , resulting in overall better perfor-\nmance. On the right, We show low-rank models W = WxyW T\nf during iterations of the coordinate\ndescent. Note that the head and shoulders of the model are blurred out in iteration 1 which uses\nPCA, but after the biconvex training procedure discriminatively updates the basis, the \ufb01nal model is\nsharper at the head and shoulders.\n\nReferences\n[1] F.A. Al-Khayyal and J.E. Falk. Jointly constrained biconvex programming. Mathematics of Operations\n\nResearch, pages 273\u2013286, 1983.\n\n8\n\n00.20.40.60.8100.20.40.60.81RecallPrecisionPrec/Rec curve Bilinear AP = 0.795Baseline AP = 0.765PCA AP = 0.698Dive\u2212SideGolf\u2212BackGolf\u2212FrontGolf\u2212SideKick\u2212FrontKick\u2212SideRide\u2212HorseRun\u2212SideSkate\u2212FrontSwing\u2212BenchSwing\u2212SideWalk\u2212FrontDive\u2212SideGolf\u2212BackGolf\u2212FrontGolf\u2212SideKick\u2212FrontKick\u2212SideRide\u2212HorseRun\u2212SideSkate\u2212FrontSwing\u2212BenchSwing\u2212SideWalk\u2212FrontDive\u2212SideGolf\u2212BackGolf\u2212FrontGolf\u2212SideKick\u2212FrontKick\u2212SideRide\u2212HorseRun\u2212SideSkate\u2212FrontSwing\u2212BenchSwing\u2212SideWalk\u2212FrontDive\u2212SideGolf\u2212BackGolf\u2212FrontGolf\u2212SideKick\u2212FrontKick\u2212SideRide\u2212HorseRun\u2212SideSkate\u2212FrontSwing\u2212BenchSwing\u2212SideWalk\u2212FrontDive\u2212SideGolf\u2212BackGolf\u2212FrontGolf\u2212SideKick\u2212FrontKick\u2212SideRide\u2212HorseRun\u2212SideSkate\u2212FrontSwing\u2212BenchSwing\u2212SideWalk\u2212FrontDive\u2212SideGolf\u2212BackGolf\u2212FrontGolf\u2212SideKick\u2212FrontKick\u2212SideRide\u2212HorseRun\u2212SideSkate\u2212FrontSwing\u2212BenchSwing\u2212SideWalk\u2212FrontBilinear (.648)Linear (.518)PCA (.444)Iter1Iter2Ind(C=.01).222.289Joint(C=.1).267.356Walk\u2212Iter2Walk\u2212Iter1(2 training videos per class)UCF Sport Action DatasetcloseupWalk\u2212Iter2closeupWalk\u2212Iter1\f[2] R.K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unla-\n\nbeled data. The Journal of Machine Learning Research, 6:1817\u20131853, 2005.\n\n[3] S.P. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.\n[4] R. Caruana. Multitask learning. Machine Learning, 28(1):41\u201375, 1997.\n[5] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector ma-\n\nchines. The Journal of Machine Learning Research, 2:265\u2013292, 2002.\n\n[6] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In IEEE Computer Society\n\nConference on Computer Vision and Pattern Recognition, 2005. CVPR 2005, volume 1, 2005.\n\n[7] N. Dalal, B. Triggs, and C. Schmid. Human detection using oriented histograms of \ufb02ow and appearance.\n\nLecture Notes in Computer Science, 3952:428, 2006.\n\n[8] Navneet Dalal. Finding People in Images and Video. PhD thesis, Institut National Polytechnique de\n\nGrenoble / INRIA Grenoble, July 2006.\n\n[9] P. Doll\u00b4ar, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: A benchmark. In CVPR, June 2009.\nThe\n[10] M. Everingham, L. Van Gool, C. K.\nhttp://www.pascal-\n\nPASCAL Visual Object Classes Challenge 2008 (VOC2008) Results.\nnetwork.org/challenges/VOC/voc2008/workshop/index.html.\n\nI. Williams,\n\nJ. Winn,\n\nand A. Zisserman.\n\n[11] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively\n\ntrained part based models. PAMI, In submission.\n\n[12] P. Felzenszwalb, D. McAllester, and D. Ramanan. A discriminatively trained, multiscale, deformable part\n\nmodel. Computer Vision and Pattern Recognition, Anchorage, USA, June, 2008.\n\n[13] V. Franc and S. Sonnenburg. Optimized cutting plane algorithm for support vector machines. In Proceed-\nings of the 25th international conference on Machine learning, pages 320\u2013327. ACM New York, NY,\nUSA, 2008.\n\n[14] J. Gorski, F. Pfeuffer, and K. Klamroth. Biconvex sets and optimization with biconvex functions: a survey\n\nand extensions. Mathematical Methods of Operations Research, 66(3):373\u2013407, 2007.\n\n[15] L.D. Lathauwer, B.D. Moor, and J. Vandewalle. A multilinear singular value decomposition. SIAM J.\n\nMatrix Anal. Appl, 1995.\n\n[16] N. Loeff and A. Farhadi. Scene Discovery by Matrix Factorization. In Proceedings of the 10th European\n\nConference on Computer Vision: Part IV, pages 451\u2013464. Springer-Verlag Berlin, Heidelberg, 2008.\n\n[17] J.D.M. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction. In\n\nInternational Conference on Machine Learning, volume 22, page 713, 2005.\n\n[18] M.D. Rodriguez, J. Ahmed, and M. Shah. Action MACH a spatio-temporal Maximum Average Correla-\ntion Height \ufb01lter for action recognition. In IEEE Conference on Computer Vision and Pattern Recognition,\n2008. CVPR 2008, pages 1\u20138, 2008.\n\n[19] C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: A local SVM approach. In Pattern\n\nRecognition, 2004. ICPR 2004. Proceedings of th e17th International Conference on, volume 3, 2004.\n\n[20] A. Shashua and T. Hazan. Non-negative tensor factorization with applications to statistics and computer\n\nvision. In International Conference on Machine Learning, volume 22, page 793, 2005.\n\n[21] N. Srebro, J.D.M. Rennie, and T.S. Jaakkola. Maximum-margin matrix factorization. Advances in Neural\n\nInformation Processing Systems, 17:1329\u20131336, 2005.\n\n[22] D. Tao, X. Li, X. Wu, and S.J. Maybank. General tensor discriminant analysis and Gabor features for gait\n\nrecognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(10):1700, 2007.\n\n[23] J.B. Tenenbaum and W.T. Freeman. Separating style and content with bilinear models. Neural Computa-\n\ntion, 12(6):1247\u20131283, 2000.\n\n[24] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and\n\ninterdependent output variables. Journal of Machine Learning Research, 6(2):1453, 2006.\n\n[25] M.A.O. Vasilescu and D. Terzopoulos. Multilinear analysis of image ensembles: Tensorfaces. Lecture\n\nNotes in Computer Science, pages 447\u2013460, 2002.\n\n[26] L. Wolf, H. Jhuang, and T. Hazan. Modeling appearances with low-rank SVM. In IEEE Conference on\n\nComputer Vision and Pattern Recognition (CVPR), pages 1\u20136. Citeseer, 2007.\n\n[27] S. Yan, D. Xu, Q. Yang, L. Zhang, X. Tang, and H.J. Zhang. Discriminant analysis with tensor represen-\ntation. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 1,\npage 526. Citeseer, 2005.\n\n9\n\n\f", "award": [], "sourceid": 1202, "authors": [{"given_name": "Hamed", "family_name": "Pirsiavash", "institution": null}, {"given_name": "Deva", "family_name": "Ramanan", "institution": null}, {"given_name": "Charless", "family_name": "Fowlkes", "institution": null}]}