{"title": "Fast Training of Pose Detectors in the Fourier Domain", "book": "Advances in Neural Information Processing Systems", "page_first": 3050, "page_last": 3058, "abstract": "In many datasets, the samples are related by a known image transformation, such as rotation, or a repeatable non-rigid deformation. This applies to both datasets with the same objects under different viewpoints, and datasets augmented with virtual samples. Such datasets possess a high degree of redundancy, because geometrically-induced transformations should preserve intrinsic properties of the objects. Likewise, ensembles of classifiers used for pose estimation should also share many characteristics, since they are related by a geometric transformation. By assuming that this transformation is norm-preserving and cyclic, we propose a closed-form solution in the Fourier domain that can eliminate most redundancies. It can leverage off-the-shelf solvers with no modification (e.g. libsvm), and train several pose classifiers simultaneously at no extra cost. Our experiments show that training a sliding-window object detector and pose estimator can be sped up by orders of magnitude, for transformations as diverse as planar rotation, the walking motion of pedestrians, and out-of-plane rotations of cars.", "full_text": "Fast Training of Pose Detectors in the Fourier Domain\n\nJo\u02dcao F. Henriques\n\n{henriques,pedromartins,ruicaseiro,batista}@isr.uc.pt\n\nUniversity of Coimbra\n\nPedro Martins\nInstitute of Systems and Robotics\n\nRui Caseiro\n\nJorge Batista\n\nAbstract\n\nIn many datasets, the samples are related by a known image transformation, such\nas rotation, or a repeatable non-rigid deformation. This applies to both datasets\nwith the same objects under different viewpoints, and datasets augmented with\nvirtual samples. Such datasets possess a high degree of redundancy, because\ngeometrically-induced transformations should preserve intrinsic properties of the\nobjects. Likewise, ensembles of classi\ufb01ers used for pose estimation should also\nshare many characteristics, since they are related by a geometric transformation.\nBy assuming that this transformation is norm-preserving and cyclic, we propose a\nclosed-form solution in the Fourier domain that can eliminate most redundancies.\nIt can leverage off-the-shelf solvers with no modi\ufb01cation (e.g. libsvm), and train\nseveral pose classi\ufb01ers simultaneously at no extra cost. Our experiments show that\ntraining a sliding-window object detector and pose estimator can be sped up by or-\nders of magnitude, for transformations as diverse as planar rotation, the walking\nmotion of pedestrians, and out-of-plane rotations of cars.\n\n1\n\nIntroduction\n\nTo cope with the rich variety of transformations in natural images, recognition systems require a\nrepresentative sample of possible variations. Some of those variations must be learned from data\n(e.g. non-rigid deformations), while others can be virtually generated (e.g. translation or rotation).\nRecently, there has been a renewed interest in augmenting datasets with virtual samples, both in the\ncontext of supervised [23, 17] and unsupervised learning [6]. This augmentation has the bene\ufb01ts of\nregularizing high-capacity classi\ufb01ers [6], while learning the natural invariances of the visual world.\nSome kinds of virtual samples can actually make learning easier \u2013 for example, with horizontally-\n\ufb02ipped virtual samples [7, 4, 17], half of the weights of the template in the Dalal-Triggs detector [4]\nbecome redundant by horizontal symmetry. A number of very recent works [14, 13, 8, 1] have shown\nthat cyclically translated virtual samples also constrain learning problems, which allows impressive\ngains in computational ef\ufb01ciency. The core of this technique relies on approximately diagonalizing\nthe data matrix with the Discrete Fourier Transform (DFT).\nIn this work, we show that the \u201cFourier trick\u201d is not unique to cyclic translation, but can be gener-\nalized to other cyclic transformations. Our model captures a wide range of useful image transfor-\nmations, yet retains the ability to accelerate training with the DFT. As it is only implicit, we can\naccelerate training in both datasets of virtual samples and natural datasets with pose annotations.\nAlso due to the geometrically-induced structure of the training data, our algorithm can obtain several\ntransformed pose classi\ufb01ers simultaneously. Some of the best object detection and pose estimation\nsystems currently learn classi\ufb01ers for different poses independently [10, 7, 19], and we show how\njoint learning of these classi\ufb01ers can dramatically reduce training times.\n\n1\n\n\f(a)\n\n(b)\n\nFigure 1: (a) The horizontal translation of a 6 \u00d7 6 image, by 1 pixel, can be achieved by a 36 \u00d7 36 permutation\nmatrix P that reorders elements appropriately (depicted is the reordering of 2 pixels). (b) Rotation by a \ufb01xed\nangle, with linearly-interpolated pixels, requires a more general matrix Q. By studying its in\ufb02uence on a dataset\nof rotated samples, we show how to accelerate learning in the Fourier domain. Our model can also deal with\nother transformations, including non-rigid. (c) Example HOG template (a car from the Google Earth dataset)\nat 4 rotations learned by our model. Positive weights are on the \ufb01rst and third column, others are negative.\n\n(c)\n\n1.1 Contributions\n\nOur contributions are as follows: 1) We generalize a previous successful model for translation\n[14, 13] to other transformations, and analyze the properties of datasets with many transformed\nimages; 2) We present closed-form solutions that fully exploit the known structure of these datasets,\nfor Ridge Regression and Support Vector Regression, based on the DFT and off-the-shelf solvers;\n3) With the same computational cost, we show how to train multiple classi\ufb01ers for different poses\nsimultaneously; 4) Since our formulas do not require explicitly estimating or knowing the transfor-\nmation, we demonstrate applicability to both datasets of virtual samples and structured datasets with\npose annotations. We achieve performance comparable to naive algorithms on 3 widely different\ntasks, while being several orders of magnitude faster.\n\n1.2 Related work\n\nThere is a vast body of works on image transformations and invariances, of which we can only men-\ntion a few. Much of the earlier computer vision literature focused on \ufb01nding viewpoint-invariant\npatterns [22]. They were based on image or scene-space coordinates, on which geometric transfor-\nmations can be applied directly, however they do not apply to modern appearance-based represen-\ntations. To relate complex transformations with appearance descriptors, a classic approach is to use\ntangent vectors [3, 26, 16], which represent a \ufb01rst-order approximation. However, the desire for\nmore expressiveness has motivated the search for more general models.\nRecent works have begun to approximate transformations as matrix-vector products, and try to esti-\nmate the transformation matrix explicitly. Tamaki et al. [27] do so for blur and af\ufb01ne transformations\nin the context of LDA, while Miao et al. [21] approximate af\ufb01ne transformations with an E-M algo-\nrithm, based on a Lie group formulation. They estimate a basis for the transformation operator or the\ntransformed images, which is a hard analytical/inference problem in itself. The involved matrices\nare extremely large for moderately-sized images, necessitating dimensionality reduction techniques\nsuch as PCA, which may be suboptimal.\nSeveral works focus on rotation alone [25, 18, 28, 2], most of them speeding up computations using\nFourier analysis, but they all explicitly estimate a reduced basis on which to project the data. Another\napproach is to learn a transformation from data, using more parsimonious factored or deep models\n[20]. In contrast, our method generalizes to other transformations and avoids a potentially costly\ntransformation model or basis estimation.\n\n2 The cyclic orthogonal model for image transformations\nConsider the m \u00d7 1 vector x, obtained by vectorizing an image, i.e. stacking its elements into a\nvector. The particular order does not matter, as long as it is consistent. The image may be a 3-\n\n2\n\n\fdimensional array that contains multiple channels, such as RGB, or the values of a densely-sampled\nimage descriptor.\nWe wish to quickly train a classi\ufb01er or regressor with transformed versions of sample images, to\nmake it robust to those transformations. The model we will use is an m \u00d7 m orthogonal matrix Q,\nwhich will represent an incremental transformation of an image as Qx (for example, a small trans-\nlation or rotation, see Fig. 1-a and 1-b). We can traverse different poses w.r.t. that transformation,\np \u2208 Z, by repeated application of Q with a matrix power, Qpx.\nIn order for the number of poses to be \ufb01nite, we must require the transformation to be cyclic, Qs =\nQ0 = I, with some period s. This allows us to store all versions of x transformed to different poses\nas the rows of an s \u00d7 m matrix,\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\n(cid:0)Q0x(cid:1)T\n(cid:0)Q1x(cid:1)T\n(cid:0)Qs\u22121x(cid:1)T\n\n...\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb\n\nCQ(x) =\n\n(1)\n\nDue to Q being cyclic, any pose p \u2208 Z can be found in the row (p mod s) + 1. Note that the \ufb01rst\nrow of CQ(x) contains the untransformed image x, since Q0 is the identity I. For the purposes of\ntraining a classi\ufb01er, CQ(x) can be seen as a data matrix, with one sample per row.\nAlthough conceptually simple, we will show through experiments that this model can accurately\ncapture a variety of natural transformations (Section 5.2). More importantly, we will show that Q\nnever has to be created explicitly. The algorithms we develop will be entirely data-driven, using\nan implicit description of Q from a structured dataset, either composed of virtual samples (e.g., by\nimage rotation), or natural samples (e.g. using pose annotations).\n\n2.1 Image translation as a special case\nA particular case of Q, and indeed what inspired the generalization that we propose, is the s \u00d7 s\ncyclic shift matrix\n\n(cid:20) 0T\n\ns\u22121\nIs\u22121\n\n(cid:21)\n\n,\n\n1\n\nP =\n\n(2)\nwhere 0s\u22121 is an (s\u22121)\u00d71 vector of zeros. This matrix cyclically permutes the elements of a vector\nx as (x1, x2, x3, . . . , xs) \u2192 (xs, x1, x2, . . . , xs\u22121). If x is a one-dimensional horizontal image, with\na single channel, then it is translated to the right by one pixel. An illustration is shown in Fig. 1-a.\nBy exploiting its relationship with the Discrete Fourier Transform (DFT), the cyclic shift model has\nbeen used to accelerate a variety of learning algorithms in computer vision [14, 13, 15, 8, 1], with\nsuitable extensions to 2D and multiple channels.\n\n0s\u22121\n\n2.2 Circulant matrices and the Discrete Fourier Transform\n\nThe basis for this optimization is the fact that the data matrix CP (x), or C(x) for short, formed by\nall cyclic shifts of a sample image x, is circulant [5]. All circulant matrices are diagonalized by the\nDFT, which can be expressed as the eigendecomposition\n\nC(x) = U diag (F(x)) U H ,\n\n(3)\nwhere .H is the Hermitian transpose (i.e., transposition and complex-conjugation), F(x) denotes the\nDFT of a vector x, and U is the unitary DFT basis. The constant matrix U can be used to compute\nsF(x). This is possible due to the linearity of the\nthe DFT of any vector, since it satis\ufb01es U x = 1\u221a\nDFT, though in practice the Fast Fourier Transform (FFT) algorithm is used instead. Note that U\nis symmetric, U T = U, and unitary, U H = U\u22121. When working in Fourier-space, Eq. 3 shows\nthat circulant matrices in a learning problem become diagonal, which drastically reduces the needed\ncomputations. For multiple channels or more images, they may become block-diagonal, but the\nprinciples remain the same [13].\n\n3\n\n\fAn important open question was whether the same diagonalization trick can be applied to image\ntransformations other than translation. We will show that this is true, using the model from Eq. 1.\n\n3 Fast training with transformations of a single image\n\nWe will now focus on the main derivations of our paper, which allow us to quickly train a classi\ufb01er\nwith virtual samples generated from an image x by repeated application of the transformation Q.\nThis section assumes only a single image x is given for training, which makes the presentation\nsimpler and we hope will give valuable insight into the core of the technique. Section 4 will expand\nit to full generality, with training sets of an arbitrary number of images, all transformed by Q.\nThe \ufb01rst step is to show that some aspect of the data is diagonalizable by the DFT, which we do in\nthe following theorem.\nTheorem 1. Given an orthogonal cyclic matrix Q, i.e. satisfying QT = Q\u22121 and Qs = Q0, then\nthe s \u00d7 m matrix X = CQ(x) (from Eq. 1) veri\ufb01es the following:\n\n\u2022 The data matrix X and the uncentered covariance matrix X H X are not circulant in general,\n\nunless Q = P (from Eq. 2).\n\n\u2022 The Gram matrix G = XX H is always circulant.\n\nProof. See Appendix A.1.\n\nTheorem 1 implies that the learning problem in its original form is not diagonalizable by the DFT\nbasis. However, the same diagonalization is possible for the dual problem, de\ufb01ned by the Gram\nmatrix G.\nBecause G is circulant, it has only s degrees of freedom and is fully speci\ufb01ed by its \ufb01rst row g [11],\nG = C(g). By direct computation from Eq. 1, we can verify that the elements of the \ufb01rst row g\nare given by gp = xT Qp\u22121x. One interpretation is that g contains the auto-correlation of x through\npose-space, i.e., the inner-product of x with itself as the transformation Q is applied repeatedly.\n\n3.1 Dual Ridge Regression\n\nFor now we will restrict our attention to Ridge Regression (RR), since it has the appealing property\nof having a solution in closed form, which we can easily manipulate. Section 4.1 will show how\nto extend these results to Support Vector Regression. The goal of RR is to \ufb01nd the linear function\n\nf (x) = wT x that minimizes a regularized squared error:(cid:80)\n\ni (f (xi) \u2212 yi)2 + \u03bb(cid:107)w(cid:107)2.\n\nSince we have s samples in the data matrix under consideration (Eq. 1), there are s dual variables,\n\u22121 y [24], where G = XX H is the\nstored in a vector \u03b1. The RR solution is given by \u03b1 = (G + \u03bbI)\ns \u00d7 s Gram matrix, y is the vector of s labels (one per pose), and \u03bb is the regularization parameter.\nThe dual form of RR is usually associated with non-linear kernels [24], but since this is not our case\nwe can compute the explicit primal solution with w = X T \u03b1, yielding\n\nw = X T (G + \u03bbI)\n\n\u22121 y.\n\n(4)\n\nApplying the circulant eigendecomposition (Eq. 3) to G, and substituting it in Eq. 4,\n\nw = X T(cid:0)U diag (\u02c6g) U H + \u03bbU U H(cid:1)\u22121\n\n(5)\nwhere we introduce the shorthand \u02c6g = F (g), and similarly \u02c6y = F (y). Since inversion of a\ndiagonal matrix can be done element-wise, and its multiplication by the vector U H y amounts to an\nelement-wise product, we obtain\n\ny = X T U (diag (\u02c6g + \u03bb))\n\n\u22121 U H y,\n\nw = X TF\u22121\n\n(6)\nwhere F\u22121 denotes the inverse DFT, and the division is taken element-wise. This formula allows us\nto replace a costly matrix inversion with fast DFT and element-wise operations. We also do not need\nto compute and store the full G, as the auto-correlation vector g suf\ufb01ces. As we will see in the next\nsection, there is a simple modi\ufb01cation to Eq. 6 that turns out to be very useful for pose estimation.\n\n\u02c6g + \u03bb\n\n(cid:18) \u02c6y\n\n(cid:19)\n\n,\n\n4\n\n\f3.2 Training several components simultaneously\n\nA relatively straightforward way to estimate the object pose in an input image x is to train a classi\ufb01er\nfor each pose (which we call components), evaluate all of them and take the maximum, i.e.\n\nfpose (x) = arg max\n\np\n\nwT\n\np x.\n\n(7)\n\nThis can also be used as the basis for a pose-invariant classi\ufb01er, by replacing argmax with max\n[10]. Of course, training one component per pose can quickly become expensive. However, we can\nexploit the fact that these training problems become tightly related when the training set contains\ntransformed images.\nRecall that y speci\ufb01es the labels for a training set of s transformed images, one label per pose.\nWithout any loss of generality, suppose that the label is 1 for a given pose t and 0 for all others, i.e.\ny contains a single peak at element t. Then by shifting the peak with P py, we will train a classi\ufb01er\nfor pose t + p. In this manner we can train classi\ufb01ers for all poses simply by varying the labels P py,\nwith p = 0, . . . , s \u2212 1.\nBased on Eq. 6, we can concatenate the solutions for all s components into a single m \u00d7 s matrix,\n\nW =(cid:2) w0\n\n\u00b7\u00b7\u00b7 ws\u22121\n\n\u22121(cid:2) P 0y \u00b7\u00b7\u00b7 P s\u22121y (cid:3)\n(cid:3) = X T (G + \u03bbI)\n(cid:18) \u02c6y\u2217\n(cid:18)\n(cid:19)\n\n\u22121 C T (y) .\n\n= X T (G + \u03bbI)\n\n(cid:19)\n\nF (X)\n\n(8)\n(9)\n\nDiagonalization yields\n\nW T = F\u22121\n\n(10)\nwhere .\u2217 denotes complex-conjugation. Since their arguments are matrices, the DFT/IDFT opera-\ntions here work along each column. The product of F (X) by the diagonal matrix simply amounts\nto multiplying each of its rows by a scalar factor, which is inexpensive. Eq. 10 has nearly the same\ncomputational cost as Eq. 6, which trains a single classi\ufb01er.\n\n\u02c6g + \u03bb\n\ndiag\n\n,\n\n4 Transformation of multiple images\n\nThe training method described in the previous section would \ufb01nd little applicability for modern\nrecognition tasks if it remained limited to transformations of a single image. Naturally, we would\nlike to use n images xi. We now have a dataset of ns samples, which can be divided into n sample\n\ngroups(cid:8)Qp\u22121xi|p = 1, . . . , s(cid:9), each containing the transformed versions of one image.\n\nThis case becomes somewhat complicated by the fact that the data matrix X now has three dimen-\nsions \u2013 the m features, the n sample groups, and the s poses of each sample group. In this m\u00d7 n\u00d7 s\narray, each column vector (along the \ufb01rst dimension) is de\ufb01ned as\n\nX\u2022ip = Qp\u22121xi,\n\n(11)\nwhere we have used \u2022 to denote a one-dimensional slice of the three-dimensional array X.1 A two-\ndimensional slice will be denoted by X\u2022\u2022p, which yields a m\u00d7 n matrix, one for each p = 1, . . . , s.\nThrough a series of block-diagonalizations and reorderings, we can show (Appendix A.2-A.5) that\nthe solution W , of size m \u00d7 s, describing all s components (similarly to Eq. 10), is obtained with\n\ni = 1, . . . , n; p = 1, . . . , s,\n\n\u02c6W\u2022p = \u02c6X\u2022\u2022p (\u02c6g\u2022\u2022p + \u03bbI)\n\n\u22121 \u02c6Y \u2217\n\u2022p,\n\np = 1, . . . , s,\n\n(12)\n\nwhere a hat \u02c6 over an array denotes the DFT along the dimension that has size s (e.g. \u02c6X is the DFT\nof X along the third dimension), Yip speci\ufb01es the label of the sample with pose p in group i, and g\nis the n \u00d7 n \u00d7 s array with elements\n\n1For reference, our slice notation \u2022 works the same way as the slice notation : in Matlab or NumPy.\n\n5\n\n\fgijp = xT\n\ni Qp\u22121xj = X T\u2022i1X\u2022jp,\n\ni, j = 1, . . . , n; p = 1, . . . , s.\n\n(13)\n\nIt may come as a surprise that, after all these changes, Eq. 12 still essentially looks like a dual Ridge\nRegression (RR) problem (compare it to Eq. 4). Eq. 12 can be interpreted as splitting the original\nproblem into s smaller problems, one for each Fourier frequency, which are independent and can be\nsolved in parallel. A Matlab implementation is given in Appendix B.2\n\n4.1 Support Vector Regression\n\nGiven that we can decompose such a large RR problem into s smaller RR problems, by applying\nthe DFT and slicing operators (Eq. 12), it is natural to ask whether the same can be done with\nother algorithms. Leveraging a recent result [13], where this was done for image translation, the\nsame steps can be repeated for the dual formulation of other algorithms, such as Support Vector\nRegression (SVR). Although RR can deal with complex data, SVR requires an extension to the\ncomplex domain, which we show in Appendix A.6. We give a Matlab implementation in Appendix\nB, which can use any off-the-shelf SVR solver without modi\ufb01cation.\n\n4.2 Ef\ufb01ciency\nNaively training one detector per pose would require solving s large ns\u00d7ns systems (either with RR\nor SVR). In contrast, our method learns jointly all detectors using s much smaller n\u00d7n subproblems.\nThe computational savings can be several orders of magnitude for large s. Our experiments seem to\nvalidate this conclusion, even in relatively large recognition tasks (Section 6).\n\n5 Orthogonal transformations in practice\n\nUntil now, we avoided the question of how to compute a transformation model Q. This may seem\nlike a computational burden, not to mention a hard estimation problem \u2013 for example, what is the\ncyclic orthogonal matrix Q that models planar rotations with period s? Inspecting Eq. 12-13,\nhowever, reveals that we do not need to form Q explicitly, but can work with just a data matrix X\nof transformed images. From there on, we exploit the knowledge that this data was obtained from\nsome matrix Q, and that is enough to allow fast training in the Fourier domain. This allows a great\ndeal of \ufb02exibility in implementation.\n\n5.1 Virtual transformations\n\nOne way to obtain a structured data matrix X is with virtual samples. From the original dataset of n\nsamples, we can generate ns virtual samples using a standard image operator (e.g. planar rotation).\nHowever, we should keep in mind that the accuracy of the proposed method will be affected by how\nmuch the image operator resembles a pure cyclic orthogonal transformation.\nLinearity. Many common image transformations, such as rotation or scale, are implemented by\nnearest-neighbor or bilinear interpolation. For a \ufb01xed amount of rotation or scale, these functions\nare linear functions in the input pixels, i.e. each output pixel is a \ufb01xed linear combination of some\nof the input pixels. As such, they ful\ufb01ll the linearity requirement.\nOrthogonality. For an operator to be orthogonal, it must preserve the L2 norm of its inputs. At the\nexpense of introducing some non-linearity, we simply renormalize each virtual sample to have the\nsame norm as the original sample, which seems to work well in practice (Section 6).\nCyclicity. We conducted some experiments with planar rotation on satellite imagery (Section 6.1)\n\u2013 rotation by 360/s degrees is cyclic with period s. In the future, we plan to experiment with non-\ncyclic operators (similar to how cyclic translation is used to approximate image translation [14]).\n\n2The supplemental material is available at: www.isr.uc.pt/\u02dchenriques/transformations/\n\n6\n\n\fFigure 2: Example detections and estimated poses in 3 different settings. We can accelerate training with\n(a) planar rotations (Google Earth), (b) non-rigid deformations in walking pedestrians (TUD-Campus/TUD-\nCrossing), and (c) out-of-plane rotations (KITTI). Best viewed in color.\n\n5.2 Natural transformations\n\nAnother interesting possibility is to use pose annotations to create a structured data matrix. This\ndata-driven approach allows us to consider more complicated transformations than those associated\nwith virtual samples. Given s views of n objects under different poses, we can build the m \u00d7 n \u00d7 s\ndata matrix X and use the same methodology as before. In Section 6 we describe experiments with\nthe walk cycle of pedestrians, and out-of-plane rotations of cars in street scenes. These transforma-\ntions are cyclic, though highly non-linear, and we use the same renormalization as in Section 5.1.\n\n5.3 Negative samples\n\nOne subtle aspect is how to obtain a structured data matrix from negative samples. This is simple for\nvirtual transformations, but not for natural transformations. For example, with planar rotation we\ncan easily generate rotated negative samples with arbitrary poses. However, the same operation with\nwalk cycles of pedestrians is not de\ufb01ned. How do we advance the walk cycle of a non-pedestrian? As\na pragmatic solution, we consider that negative samples are unaffected by natural transformations,\nso a negative sample is constant for all s poses. Because the DFT of a constant signal is 0, except for\nthe DC value (the \ufb01rst frequency), we can ignore untransformed negative samples in all subproblems\nfor p (cid:54)= 1 (Eq. 12). This simple observation can result in signi\ufb01cant computational savings.\n\n6 Experiments\n\nTo demonstrate the generality of the proposed model, we conducted object detection and pose esti-\nmation experiments on 3 widely different settings, which will be described shortly. We implemented\na detector based on Histogram of Oriented Gradients (HOG) templates [4] with multiple components\n[7]. This framework forms the basis on which several recent advances in object detection are built\n[19, 10, 7]. The baseline algorithm independently trains s classi\ufb01ers (components), one per pose, en-\nabling pose-invariant object detection and pose prediction (Eq. 7). Components are then calibrated,\nas usual for detectors with multiple components [7, 19]. The proposed method does not require any\nad-hoc calibration, since the components are jointly trained and related by the orthogonal matrix Q,\nwhich preserves their L2 norm.\nFor the performance evaluation, ground truth objects are assigned to hypothesis by the widely used\nPascal criterion of bounding box overlap [7]. We then measure average precision (AP) and pose er-\nror (as epose/s, where epose is the discretized pose difference, taking wrap-around into account). We\ntested two variants of each method, trained with both RR and SVR. Although parallelization is triv-\nial, we report timings for single-core implementations, which more accurately re\ufb02ect the total CPU\nload. As noted in previous work [13], detectors trained with SVR have very similar performance to\nthose trained with Support Vector Machines.\n\n6.1 Planar rotation in satellite images (Google Earth)\n\nOur \ufb01rst test will be on a car detection task on satellite imagery [12], which has been used in several\nworks that deal with planar rotation [25, 18]. We annotated the orientations of 697 objects over half\nthe 30 images of the dataset. The \ufb01rst 7 annotated images were used for training, and the remaining\n8 for validation. We created a structured data matrix X by augmenting each sample with 30 virtual\n\n7\n\n\fGoogle Earth\n\nTUD Campus/Crossing\n\n4.5\nFourier\ntraining\n3.7\nStandard SVR 130.7\n399.3\n\nTime (s) AP\n73.0\n71.4\n73.2\n72.7\n\nSVR\nRR\n\nRR\n\nPose Time (s) AP\n81.5\n9.4\n10.0\n82.2\n80.2\n9.8\n10.3\n81.6\n\n0.1\n0.08\n40.5\n45.8\n\nKITTI\nPose Time (s) AP\n53.5\n9.3\n8.9\n53.4\n56.5\n9.5\n9.4\n54.5\n\n15.0\n15.5\n454.2\n229.6\n\nPose\n14.9\n15.0\n13.8\n14.0\n\nTable 1: Results for pose detectors trained with Support Vector Regression (SVR) and Ridge Regression (RR).\nWe report training time, Average Precision (AP) and pose error (both in percentage).\n\nsamples, using 12\u00ba rotations. A visualization of trained weights is shown in Fig. 1-c and Appendix\nB. Experimental results are presented in Table 1. Recall that our primary goal is to demonstrate\nfaster training, not to improve detection performance, which is re\ufb02ected in the results. Nevertheless,\nthe two proposed fast Fourier algorithms are 29 to 107\u00d7 faster than the baseline algorithms.\n\n6.2 Walk cycle of pedestrians (TUD-Campus and TUD-Crossing)\n\nWe can consider a walking pedestrian to undergo a cyclic non-rigid deformation, with each period\ncorresponding to one step. Because this transformation is time-dependent, we can learn it from\nvideo data. We used TUD-Campus for training and TUD-Crossing for testing (see Fig. 2). We\nannotated a key pose in all 272 frames, so that the images of a pedestrian between two key poses\nrepresent a whole walk cycle. Sampling 10 images per walk cycle (corresponding to 10 poses), we\nobtained 10 sample groups for training, for a total of 100 samples.\nFrom Table 1, the proposed algorithms seem to slightly outperform the baseline, showing that these\nnon-rigid deformations can be accurately accounted for. However, they are over 2 orders of magni-\ntude faster. In addition to the speed bene\ufb01ts observed in Section 6.1, another factor at play is that\nfor natural transformations we can ignore the negative samples in s\u2212 1 of the subproblems (Section\n5.3), whereas the baseline algorithms must consider them when training each of the s components.\n\n6.3 Out-of-plane rotations of cars in street scenes (KITTI)\n\nFor our \ufb01nal experiment, we will attempt to demonstrate that the speed advantage of our method\nstill holds for dif\ufb01cult out-of-plane rotations. We chose the very recent KITTI benchmark [9], which\nincludes an object detection set of 7481 images of street scenes. The facing angle of cars (along the\nvertical axis) is provided, which we bin into 15 discrete poses. We performed an 80-20% train-test\nsplit of the images, considering cars of \u201cmoderate\u201d dif\ufb01culty [9], and obtained 73 sample groups for\ntraining with 15 poses each (for a total of 1095 samples).\nTable 1 shows that the proposed method achieves competitive performance, but with a dramatically\nlower computational cost. The results agree with the intuition that out-of-plane rotations strain the\nassumptions of linearity and orthogonality, since they result in large deformations of the object.\nNevertheless, the ability to learn a useful model under such adverse conditions shows great promise.\n\n7 Conclusions and future work\n\nIn this work, we derived new closed-form formulas to quickly train several pose classi\ufb01ers at once,\nand take advantage of the structure in datasets with pose annotation or virtual samples. Our implicit\ntransformation model seems to be surprisingly expressive, and in future work we would like to\nexperiment with other transformations, including non-cyclic. Other interesting directions include\nlarger-scale variants and the composition of multiple transformations.\nAcknowledgements.\nThe authors would like to thank Jo\u02dcao Carreira for valuable discus-\nsions. They also acknowledge support by the FCT project PTDC/EEA-CRO/122812/2010, grants\nSFRH/BD75459/2010, SFRH/BD74152/2010, and SFRH/BPD/90200/2012.\n\n8\n\n\fReferences\n[1] V. N. Boddeti, T. Kanade, and B.V.K. Kumar. Correlation \ufb01lters for object alignment.\n\nCVPR, 2013. 1, 2.1\n\nIn\n\n[2] C.-Y. Chang, A. A. Maciejewski, and V. Balakrishnan. Fast eigenspace decomposition of\n\ncorrelated images. IEEE Transactions on Image Processing, 9(11):1937\u20131949, 2000. 1.2\n\n[3] O. Chapelle and B. Scholkopf.\n\nIncorporating invariances in non-linear support vector ma-\n\nchines. In Advances in neural information processing systems, 2002. 1.2\n\n[4] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.\n\n[5] P. J. Davis. Circulant matrices. American Mathematical Soc., 1994. 2.2\n[6] A. Dosovitskiy, J. T. Springenberg, and T. Brox. Unsupervised feature learning by augmenting\n\nsingle images. In International Conference on Learning Representations, 2014. 1\n\n[7] P.F. Felzenszwalb, R.B. Girshick, D. McAllester, and D. Ramanan. Object detection with\n\ndiscriminatively trained part-based models. TPAMI, 2010. 1, 6\n\n[8] H. K. Galoogahi, T. Sim, and S. Lucey. Multi-channel correlation \ufb01lters. In ICCV, 2013. 1,\n\n[9] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? The KITTI Vision\n\nBenchmark Suite. In CVPR, 2012. 6.3\n\n[10] A. Geiger, C. Wojek, and R. Urtasun. Joint 3d estimation of objects and scene layout. In NIPS,\n\n2011. 1, 3.2, 6\n\n[11] R. M. Gray. Toeplitz and Circulant Matrices: A Review. Now Publishers, 2006. 3\n[12] G. Heitz and D. Koller. Learning spatial context: Using stuff to \ufb01nd things. In ECCV, 2008.\n\n1, 6\n\n2.1\n\n6.1\n\n[13] J. F. Henriques, J. Carreira, R. Caseiro, and J. Batista. Beyond hard negative mining: Ef\ufb01cient\n\ndetector learning via block-circulant decomposition. In ICCV, 2013. 1, 1.1, 2.1, 2.2, 4.1, 6\n\n[14] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. Exploiting the circulant structure of\n\ntracking-by-detection with kernels. In ECCV, 2012. 1, 1.1, 2.1, 5.1\n\n[15] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. High-speed tracking with kernelized\n\ncorrelation \ufb01lters. TPAMI, 2015. 2.1\n\n[16] N. Jojic, P. Simard, B. J. Frey, and D. Heckerman. Separating appearance from deformation.\n\nIn ICCV, 2001. 1.2\n\n[17] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In NIPS, 2012. 1\n\n[18] K. Liu, H. Skibbe, T. Schmidt, T. Blein, K. Palme, T. Brox, and O. Ronneberger. Rotation-\ninvariant HOG descriptors using fourier analysis in polar and spherical coordinates. Interna-\ntional Journal of Computer Vision, 106(3):342\u2013364, February 2014. 1.2, 6.1\n\n[19] T. Malisiewicz, A. Gupta, and A. A. Efros. Ensemble of exemplar-svms for object detection\n\nand beyond. In ICCV, 2011. 1, 6\n\n[20] R. Memisevic and G. E. Hinton. Learning to represent spatial transformations with factored\n\nhigher-order boltzmann machines. Neural Computation, 22(6):1473\u20131492, 2010. 1.2\n\n[21] X. Miao and R. Rao. Learning the lie groups of visual invariance. Neural computation,\n\n[22] J. L. Mundy. Object recognition in the geometric era: A retrospective. Lecture Notes in\n\n19(10):2665\u20132693, 2007. 1.2\n\nComputer Science, pages 3\u201328, 2006. 1.2\n\nimage classi\ufb01cation. In CVPR, 2014. 1\n\n[23] M. Paulin, J. Revaud, Z. Harchaoui, F. Perronnin, and C. Schmid. Transformation pursuit for\n\n[24] R. Rifkin, G. Yeo, and T. Poggio. Regularized least-squares classi\ufb01cation. Nato Science Series\n\nSub Series III: Computer and Systems Sciences, 190:131\u2013154, 2003. 3.1\n\n[25] U. Schmidt and S. Roth. Learning rotation-aware features: From invariant priors to equivariant\n\ndescriptors. In CVPR, 2012. 1.2, 6.1\n\n[26] P. Simard, Y. LeCun, J. Denker, and B. Victorri. Transformation invariance in pattern recogni-\n\ntion \u2013 tangent distance and tangent propagation. In LNCS. Springer, 1998. 1.2\n\n[27] B. Tamaki, T.and Yuan, K. Harada, B. Raytchev, and K. Kaneda. Linear discriminative image\n\nprocessing operator analysis. In CVPR, 2012. 1.2\n\n[28] M. Uenohara and T. Kanade. Optimal approximation of uniformly rotated images.\n\nTransactions on Image Processing, 7(1):116\u2013119, 1998. 1.2\n\n9\n\nIEEE\n\n\f", "award": [], "sourceid": 1582, "authors": [{"given_name": "Jo\u00e3o F.", "family_name": "Henriques", "institution": "University of Coimbra"}, {"given_name": "Pedro", "family_name": "Martins", "institution": "University of Coimbra"}, {"given_name": "Rui", "family_name": "Caseiro", "institution": "University of Coimbra"}, {"given_name": "Jorge", "family_name": "Batista", "institution": "University of Coimbra"}]}