{"title": "Learning with Recursive Perceptual Representations", "book": "Advances in Neural Information Processing Systems", "page_first": 2825, "page_last": 2833, "abstract": "Linear Support Vector Machines (SVMs) have become very popular in vision as part of state-of-the-art object recognition and other classification tasks but require high dimensional feature spaces for good performance. Deep learning methods can find more compact representations but current methods employ multilayer perceptrons that require solving a difficult, non-convex optimization problem. We propose a deep non-linear classifier whose layers are SVMs and which incorporates random projection as its core stacking element. Our method learns layers of linear SVMs recursively transforming the original data manifold through a random projection of the weak prediction computed from each layer. Our method scales as linear SVMs, does not rely on any kernel computations or nonconvex optimization, and exhibits better generalization ability than kernel-based SVMs. This is especially true when the number of training samples is smaller than the dimensionality of data, a common scenario in many real-world applications. The use of random projections is key to our method, as we show in the experiments section, in which we observe a consistent improvement over previous --often more complicated-- methods on several vision and speech benchmarks.", "full_text": "Learning with Recursive Perceptual Representations\n\nOriol Vinyals\nUC Berkeley\nBerkeley, CA\n\nYangqing Jia\nUC Berkeley\nBerkeley, CA\n\nLi Deng\n\nMicrosoft Research\n\nRedmond, WA\n\nTrevor Darrell\nUC Berkeley\nBerkeley, CA\n\nAbstract\n\nLinear Support Vector Machines (SVMs) have become very popular in vision as\npart of state-of-the-art object recognition and other classi\ufb01cation tasks but require\nhigh dimensional feature spaces for good performance. Deep learning methods\ncan \ufb01nd more compact representations but current methods employ multilayer\nperceptrons that require solving a dif\ufb01cult, non-convex optimization problem. We\npropose a deep non-linear classi\ufb01er whose layers are SVMs and which incorpo-\nrates random projection as its core stacking element. Our method learns layers of\nlinear SVMs recursively transforming the original data manifold through a ran-\ndom projection of the weak prediction computed from each layer. Our method\nscales as linear SVMs, does not rely on any kernel computations or nonconvex\noptimization, and exhibits better generalization ability than kernel-based SVMs.\nThis is especially true when the number of training samples is smaller than the\ndimensionality of data, a common scenario in many real-world applications. The\nuse of random projections is key to our method, as we show in the experiments\nsection, in which we observe a consistent improvement over previous \u2013often more\ncomplicated\u2013 methods on several vision and speech benchmarks.\n\n1\n\nIntroduction\n\nIn this paper, we focus on the learning of a general-purpose non-linear classi\ufb01er applied to perceptual\nsignals such as vision and speech. The Support Vector Machine (SVM) has been a popular method\nfor multimodal classi\ufb01cation tasks since its introduction, and one of its main advantages is the sim-\nplicity of training a linear model. Linear SVMs often fail to solve complex problems however, and\nwith non-linear kernels, SVMs usually suffer from speed and memory issues when faced with very\nlarge-scale data, although techniques such as non-convex optimization [6] or spline approximations\n[19] exist for speed-ups. In addition, \ufb01nding the \u201coracle\u201d kernel for a speci\ufb01c task remains an open\nproblem, especially in applications such as vision and speech.\nOur aim is to design a classi\ufb01er that combines the simplicity of the linear Support Vector Machine\n(SVM) with the power derived from deep architectures. The new technique we propose follows\nthe philosophy of \u201cstacked generalization\u201d [23], i.e. the framework of building layer-by-layer ar-\nchitectures, and is motivated by the recent success of a convex stacking architecture which uses a\nsimpli\ufb01ed form of neural network with closed-form, convex learning [10]. Speci\ufb01cally, we propose\na new stacking technique for building a deep architecture, using a linear SVM as the base building\nblock, and a random projection as its core stacking element.\nThe proposed model, which we call the Random Recursive SVM (R2SVM), involves an ef\ufb01cient,\nfeed-forward convex learning procedure. The key element in our convex learning of each layer is to\nrandomly project the predictions of the previous layer SVM back to the original feature space. As we\nwill show in the paper, this could be seen as recursively transforming the original data manifold so\nthat data from different classes are moved apart, leading to better linear separability in the subsequent\nlayers. In particular, we show that randomly generating projection parameters, instead of \ufb01ne-tuning\nthem using backpropagation, suf\ufb01ces to achieve a signi\ufb01cant performance gain. As a result, our\n\n1\n\n\fFigure 1: A conceptual example of Random Recursive SVM separating edges from cross-bars. Start-\ning from data manifolds that are not linearly separable, our method transforms the data manifolds\nin a stacked way to \ufb01nd a linear separating hyperplane in the high layers, which corresponds to\nnon-linear separating hyperplanes in the lower layers. Non-linear classi\ufb01cation is achieved without\nkernelization, using a recursive architecture.\n\nmodel does not require any complex learning techniques other than training linear SVMs, while\ncanonical deep architectures usually require carefully designed pre-training and \ufb01ne-tuning steps,\nwhich often depend on speci\ufb01c applications.\nUsing linear SVMs as building blocks our model scales in the same way as the linear SVM does,\nenabling fast computation during both training and testing time. While linear SVM fails to solve\nnon-linearly separable problems, the simple non-linearity in our algorithm, introduced with sigmoid\nfunctions, is shown to adapt to a wide range of real-world data with the same learning structure.\nFrom a kernel based perspective, our method could be viewed as a special non-linear SVM, with\nthe bene\ufb01t that the non-linear kernel naturally emerges from the stacked structure instead of be-\ning de\ufb01ned as in conventional algorithms. This brings additional \ufb02exibility to the applications, as\ntask-dependent kernel designs usually require detailed domain-speci\ufb01c knowledge, and may not\ngeneralize well due to suboptimal choices of non-linearity. Additionally, kernel SVMs usually suf-\nfer from speed and memory issues when faced with large-scale data, although techniques such as\nnon-convex optimization [6] exist for speed-ups.\nOur \ufb01ndings suggest that the proposed model, while keeping the simplicity and ef\ufb01ciency of training\na linear SVM, can exploit non-linear dependencies with the proposed deep architecture, as suggested\nby the results on two well known vision and speech datasets. In addition, our model performs better\nthan other non-linear models under small training set sizes (i.e. it exhibits better generalization gap),\nwhich is a desirable property inherited from the linear model used in the architecture presented in\nthe paper.\n\n2 Previous Work\n\nThere has been a trend on object, acoustic and image classi\ufb01cation to move the complexity from\nthe classi\ufb01er to the feature extraction step. The main focus of many state of the art systems has\nbeen to build rich feature descriptors (e.g. SIFT [18], HOG [7] or MFCC [8]), and use sophisticated\nnon-linear classi\ufb01ers, usually based on kernel functions and SVM or mixture models. Thus, the\ncomplexity of the overall system (feature extractor followed by the non-linear classi\ufb01er) is shared\nin the two blocks. Vector Quantization [12], and Sparse Coding [21, 24, 26] have theoretically and\nempirically been shown to work well with linear classi\ufb01ers. In [4], the authors note that the choice\nof codebook does not seem to impact performance signi\ufb01cantly, and encoding via an inner product\nplus a non-linearity can effectively replace sparse coding, making testing signi\ufb01cantly simpler and\nfaster.\nA disturbing issue with sparse coding + linear classi\ufb01cation is that with a limited codebook size,\nlinear separability might be an overly strong statement, undermining the use of a single linear clas-\nsi\ufb01er. This has been empirically veri\ufb01ed: as we increase the codebook size, the performance keeps\nimproving [4], indicating that such representations may not be able to fully exploit the complexity\n\n2\n\nLayer 2Layer 3Code 1Code 2Input SpaceLayer1 (linear SVM)Layer2Layer3\fof the data [2]. In fact, recent success on PASCAL VOC could partially be attributed to a huge\ncodebook [25]. While this is theoretically valid, the practical advantage of linear models dimin-\nishes quickly, as the computation cost of feature generation, as well as training a high-dimensional\nclassi\ufb01er (despite linear), can make it as expensive as classical non-linear classi\ufb01ers.\nDespite this trend to rely on linear classi\ufb01ers and overcomplete feature representations, sparse cod-\ning is still a \ufb02at model, and efforts have been made to add \ufb02exibility to the features. In particular,\nDeep Coding Networks [17] proposed an extension where a higher order Taylor approximation of\nthe non-linear classi\ufb01cation function is used, which shows improvements over coding that uses one\nlayer. Our approach can be seen as an extension to sparse coding used in a stacked architecture.\nStacking is a general philosophy that promotes generalization in learning complex functions and that\nimproves classi\ufb01cation performance. The method presented in this paper is a new stacking technique\nthat has close connections to several stacking methods developed in the literature, which are brie\ufb02y\nsurveyed in this section. In [23], the concept of stacking was proposed where simple modules of\nfunctions or classi\ufb01ers are \u201cstacked\u201d on top of each other in order to learn complex functions or\nclassi\ufb01ers. Since then, various ways of implementing stacking operations have been developed, and\nthey can be divided into two general categories. In the \ufb01rst category, stacking is performed in a\nlayer-by-layer fashion and typically involves no supervised information. This gives rise to multiple\nlayers in unsupervised feature learning, as exempli\ufb01ed in Deep Belief Networks [14, 13, 9], layered\nConvolutional Neural Networks [15], Deep Auto-encoder [14, 9], etc. Applications of such stacking\nmethods includes object recognition [15, 26, 4], speech recognition [20], etc.\nIn the second category of techniques, stacking is carried out using supervised information. The mod-\nules of the stacking architectures are typically simple classi\ufb01ers. The new features for the stacked\nclassi\ufb01er at a higher level of the hierarchy come from concatenation of the classi\ufb01er output of lower\nmodules and the raw input features. Cohen and de Carvalho [5] developed a stacking architecture\nwhere the simple module is a Conditional Random Field. Another successful stacking architecture\nreported in [10, 11] uses supervised information for stacking where the basic module is a simpli\ufb01ed\nform of multilayer perceptron where the output units are linear and the hidden units are sigmoidal\nnonlinear. The linearity in the output units permits highly ef\ufb01cient, closed-form estimation (results\nof convex optimization) for the output network weights given the hidden units\u2019 outputs. Stacked\ncontext has also been used in [3], where a set of classi\ufb01er scores are stacked to produce a more\nreliable detection. Our proposed method will build a stacked architecture where each layer is an\nSVM, which has proven to be a very successful classi\ufb01er for computer vision applications.\n\n3 The Random Recursive SVM\n\nIn this section we formally introduce the Random Recursive SVM model, and discuss the moti-\nvation and justi\ufb01cation behind it. Speci\ufb01cally, we consider a training set that contains N pairs of\ntuples (d(i), y(i)), where d(i) \u2208 RD is the feature vector, and y(i) \u2208 {1, . . . , C} is the class label\ncorresponding to the i-th sample.\nAs depicted in Figure 2(b), the model is built by multiple layers of blocks, which we call Random\nSVMs, that each learns a linear SVM classi\ufb01er and transforms the data based on a random projection\nof previous layers SVM outputs. The linear SVM classi\ufb01ers are learned in a one-vs-all fashion.\nFor convenience, let \u03b8 \u2208 RD\u00d7C be the classi\ufb01cation matrix by stacking each parameter vector\ncolumn-wise, so that o(i) = \u03b8T d(i) is the vector of scores for each class corresponding to the\nT d(i) is the prediction for the i-th sample if we want to make\nsample d(i), and \u02c6y(i) = arg maxc \u03b8c\n\ufb01nal predictions. From this point onward, we drop the index \u00b7(i) for the i-th sample for notational\nconvenience.\n\n3.1 Recursive Transform of Input Features\n\nFigure 2(b) visualizes one typical layer in the pipeline of our algorithm. Each layer takes the output\nof the previous layer, (starting from x1 = d for the \ufb01rst layer as our initial input), and feeds it to\na standard linear SVM that gives the output o1. In general, o1 would not be a perfect prediction,\nbut would be better than a random guess. We then use a random projection matrix W2,1 \u2208 RD\u00d7C\nwhose elements are sampled from N (0, 1) to project the output o1 into the original feature space,\n\n3\n\n\f(a) Layered structure of R2SVM\n\n(b) Details of an RSVM layer.\n\nFigure 2: The pipeline of the proposed Random Recursive SVM model. (a) The model is built with\nlayers of Random SVM blocks, which are based on simple linear SVMs. Speech and image signals\nare provided as input to the \ufb01rst level. (b) For each random SVM layer, we train a linear SVM\nusing the transformed data manifold by combining the original features and random projections of\nprevious layers\u2019 predictions.\n\nin order to use this noisy prediction to modify the original features. Mathematically, the additively\nmodi\ufb01ed feature space after applying the linear SVM to obtain o1 is:\n\nx2 = \u03c3(d + \u03b2W2,1o1),\n\nwhere \u03b2 is a weight parameter that controls the degree with which we move the original data sample\nx1, and \u03c3(\u00b7) is the sigmoid function, which introduces non-linearity in a similar way as in the\nmultilayer perceptron models, and prevents the recursive structure to degenerate to a trivial linear\nmodel. In addition, such non-linearity, akin to neural networks, has desirable properties in terms of\nGaussian complexity and generalization bounds [1].\nIntuitively, the random projection aims to push data from different classes towards different direc-\ntions, so that the resulting features are more likely to be linearly separable. The sigmoid function\ncontrols the scale of the resulting features, and at the same time prevents the random projection to\nbe \u201ctoo con\ufb01dent\u201d on some data points, as the prediction of the lower-layer is still imperfect. An im-\nportant note is that, when the dimension of the feature space D is relatively large, then the column\nvectors of Wl are much likely to be approximately orthogonal, known as the quasi-orthogonality\nproperty of high-dimensional spaces [16]. At the same time, the column vectors correspond to the\nper class bias applied to the original sample d if the output was close to ideal (i.e. ol = ec, where\nec is the one-hot encoding representing class c), so the fact that they are approximately orthogonal\nmeans that (with high probability) they are pushing the per-class manifolds apart.\nThe training of the R2SVM is then carried out in a purely feed-forward way. Speci\ufb01cally, we train\na linear SVM for the l-th layer, and then compute the input of the next layer as the addition of the\noriginal feature space and the random projection of previous layers\u2019 outputs, which is then passed\nthrough a simple sigmoid function:\n\nl xl\nxl+1 = \u03c3(d + \u03b2Wl+1[oT\n\nol = \u03b8T\n\n1 , oT\n\n2 ,\u00b7\u00b7\u00b7 , oT\n\nl ]T )\n\nwhere \u03b8l are the linear SVM parameters trained with xl, and Wl+1 is the concatenation of l ran-\ndom projection matrices [Wl+1,1, Wl+1,2,\u00b7\u00b7\u00b7 , Wl+1,l], one for each previous layer, each being a\nrandom matrix sampled from N (0, 1).\nFollowing [10], for each layer we use the outputs from all lower modules, instead of only the imme-\ndiately lower module. A chief difference of our proposed method from previous approaches is that,\ninstead of concatenating predictions with the raw input data to form the new expanded input data,\nwe use the predictions to modify the features in the original space with a non-linear transformation.\nAs will be shown in the next section, experimental results demonstrate that this approach is superior\nthan simple concatenation in terms of classi\ufb01cation performance.\n\n3.2 On the Randomness in R2SVM\n\nThe motivation behind our method is that projections of previous predictions help to move apart the\nmanifolds that belong to each class in a recursive fashion, in order to achieve better linear separabil-\nity (Figure 1 shows a vision example separating different image patches).\nSpeci\ufb01cally, consider that we have a two class problem which is non-linearly separable. The follow-\ning Lemma illustrates the fact that, if we are given an oracle prediction of the labels, it is possible to\n\n4\n\nRSVMRSVMRSVM\u00b7\u00b7\u00b7dpredictiontransformed dataSVMxl1olWo1\u00b7\u00b7\u00b7l1do1\u00b7\u00b7\u00b7l+xl\fadd an offset to each class to \u201cpull\u201d the manifolds apart with this new architecture, and to guarantee\nan improvement on the training set if we assume perfect labels.\nLemma 3.1 Let T be a set of N tuples (d(i), y(i)), where d(i) \u2208 RD is the feature vector, and\ny(i) \u2208 {1, . . . , C} is the class label corresponding to the i-th sample. Let \u03b8 \u2208 RD\u00d7C be the\ncorresponding linear SVM solution with objective function value fT ,\u03b8. Then, there exist wi \u2208 RD\n(cid:48) de\ufb01ned as (d(i) + wy(i) , y(i)) has a linear SVM\nfor i = {1, . . . , C} such that the translated set T\nsolution \u03b8(cid:48) which achieves a better optimum fT (cid:48),\u03b8(cid:48) < fT ,\u03b8.\nProof Let \u03b8i be the i-th column of \u03b8 (which corresponds to the one vs all classi\ufb01er for class i).\nDe\ufb01ne wi = \u03b8i\n||\u03b8i||2\nmax(0, 1 \u2212 \u03b8T\n\n. Then we have\ny(i) (d(i) + wy(i) )) = max(0, 1 \u2212 (\u03b8T\n\nwhich leads to fT (cid:48),\u03b8 \u2264 fT ,\u03b8. Since \u03b8(cid:48) is de\ufb01ned to be the optimum for the set T\nwhich concludes the proof.\n\n2\n\ny(i) d(i) + 1)) \u2264 max(0, 1 \u2212 (\u03b8T\n\ny(i) d(i))),\n\n(cid:48), fT (cid:48),\u03b8(cid:48) \u2264 fT (cid:48),\u03b8,\n(cid:4)\n\nLemma 3.1 would work for any monotonically decreasing loss function (in particular, for the hinge\nloss of SVM), and motivates our search for a transform of the original features to achieve linear\nseparability, under the guidance of SVM predictions. Note that we would achieve perfect classi\ufb01ca-\ntion under the assumption that we have oracle labels, while we only have noisy predictions for each\nclass \u02c6y(i) during testing time. Under such noisy predictions, a deterministic choice of wi, especially\nlinear combinations of the data as in the proof for Lemma 3.1, suffers from over-con\ufb01dence in the\nlabels and may add little bene\ufb01t to the learned linear SVMs.\nA \ufb01rst choice to avoid degenerated results is to take random weights. This enables us to use label-\nrelevant information in the predictions, while at the same time de-correlate it with the original input\nd. Surprisingly, as shown in Figure 4(a), randomness achieves a signi\ufb01cant performance gain in\ncontrast to the \u201coptimal\u201d direction given by Lemma 3.1 (which degenerates due to imperfect predic-\ntions), or alternative stacking strategies such as concatenation as in [10]. We also note that beyond\nsampling projection matrices from a zero-mean Gaussian distribution, a biased sampling that favors\ndirections near the \u201coptimal\u201d direction may also work, but the degree of bias would be empirically\ndif\ufb01cult to determine and may be data-dependent. In general, we aim to avoid supervision in the\nprojection parameters, as trying to optimize the weights jointly would defeat the purpose of having a\ncomputationally ef\ufb01cient method, and would, perhaps, increase training accuracy at the expense of\nover-\ufb01tting. The risk of over-\ufb01tting is also lower in this way, as we do not increase the dimension-\nality of the input space, and we do not learn the matrices Wl, which means we pass a weak signal\nfrom layer to layer. Also, training Random Recursive SVM is carried out in a feed-forward way,\nwhere each step involves a convex optimization problem that can be ef\ufb01ciently solved.\n\n3.3 Synthetic examples\n\nTo visually show the effectiveness of our approach in learning non-linear SVM classi\ufb01ers without\nkernels, we apply our algorithm to two synthetic examples, neither of which can be linearly sepa-\nrated. The \ufb01rst example contains two classes distributed in a two-moon shaped way, and the second\nexample contains data distributed as two more complex spirals. Figure 3 visualizes the classi\ufb01cation\nhyperplane at different stages of our algorithm. The \ufb01rst layer of our approach is identical to the\nlinear SVM, which is not able to separate the data well. However, when classi\ufb01ers are recursively\nstacked in our approach, the classi\ufb01cation hyperplane is able to adapt to the nonlinear characteristics\nof the two classes.\n\n4 Experiments\n\nIn this section we empirically evaluate our method, and support our claims: (1) for low-dimensional\nfeatures, linear SVMs suffer from their limited representation power, while R2SVMs signi\ufb01cantly\nimprove performance; (2) for high-dimensional features, and especially when faced with limited\namount of training data, R2SVMs exhibit better generalization power than conventional kernelized\nnon-linear SVMs; and (3) the random, feed-forward learning scheme is able to achieve state-of-the-\nart performance, without complex \ufb01ne-tuning.\n\n5\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\n(f)\n\nFigure 3: Classi\ufb01cation hyperplane from different stages of our algorithm: \ufb01rst layer, second layer,\nand \ufb01nal layer outputs. (a)-(c) show the two-moon data and (d)-(f) show the spiral data.\n\n(a)\n\n(b)\n\nFigure 4: Results on CIFAR-10. (a) Accuracy versus number of layers on CIFAR-10 for Random\nRecursive SVM with all the training data and 50 codebook size, for a baseline where the output of\na classi\ufb01er is concatenated with the input feature space, and for a deterministic version of recursive\nSVM where the projections are as in the proof of Lemma 3.1. (b) Accuracy versus codebook size\non CIFAR-10 for linear SVM, RBF SVM, and our proposed method.\n\nWe describe the experimental results on two well known classi\ufb01cation benchmarks: CIFAR-10 and\nTIMIT. The CIFAR-10 dataset contains large amount of training/testing data focusing on object\nclassi\ufb01cation. TIMIT is a speech database that contains two orders of magnitude more training\nsamples than the other datasets, and the largest output label space.\nRecall that our method relies on two parameters: \u03b2, which is the factor that controls how much to\nshift the original feature space, and C, the regularization parameter of the linear SVM trained at\n10 for all the experiments, which was experimentally found to work well for\neach layer. \u03b2 is set to 1\none of the CIFAR-10 con\ufb01gurations. C controls the regularization of each layer, and is an important\nparameter \u2013 setting it too high will yield over\ufb01tting as the number of layers is increased. As a result,\nwe learned this parameter via cross validation for each con\ufb01guration, which is the usual practice of\nother approaches. Lastly, for each layer, we sample a new random matrix Wl. As a result, even\nif the training and testing sets are \ufb01xed, randomness still exists in our algorithm. Although one\nmay expect the performance to \ufb02uctuate from run to run, in practice we never observe a standard\ndeviation larger than 0.25 (and typically less than 0.1) for the classi\ufb01cation accuracy, over multiple\nruns of each experiment.\nCIFAR-10\nThe CIFAR-10 dataset contains 10 object classes with a fair amount of training examples per class\n(5000), with images of small size (32x32 pixels). For this dataset, we follow the standard pipeline\nde\ufb01ned in [4]: dense 6x6 local patches with ZCA whitening are extracted with stride 1, and thresh-\nolding coding with \u03b1 = 0.25 is adopted for encoding. The codebook is trained with OMP-1. The\nfeatures are then average-pooled on a 2 \u00d7 2 grid to form the global image representation. We tested\nthree classi\ufb01ers: linear SVM, RBF kernel based SVM, and the Random Recursive SVM model as\nintroduced in Section 3.\nAs have been shown in Figure 4(b), the performance is almost monotonically increasing as we stack\nmore layers in R2SVM. Also, stacks of SVMs by concatenation of output and input feature space\ndoes not yield much gain above 1 layer (which is a linear SVM), and neither does a deterministic\n\n6\n\n0102030405060Layer Index646566676869AccuracyAccuracy vs. Number of Layers on CIFAR-10randomconcatenationdeterministic02004006008001000120014001600CodebookSize646668707274767880AccuracyLinearSVMR2SVMRBF-SVM\fTr. Size Code. Size\n\nTable 1: Results on CIFAR-10, with different\ncodebook sizes (hence feature dimensions).\nAcc.\n64.7%\n74.4%\n69.3%\n67.2%\n79.5%\n79.0%\n79.7%\n78.1%\n\nMethod\nLinear SVM\nRBF SVM\nR2SVM\nDCN\nLinear SVM\nRBF SVM\nR2SVM\nDCN\n\nAll\nAll\nAll\nAll\nAll\nAll\nAll\nAll\n\n50\n50\n50\n50\n1600\n1600\n1600\n1600\n\nTable 2: Results on CIFAR-10, with 25 training\ndata per class.\n\nMethod\nTr. Size\nLinear SVM 25/class\n25/class\nRBF SVM\n25/class\nR2SVM\nDCN\n25/class\nLinear SVM 25/class\nRBF SVM\n25/class\n25/class\nR2SVM\nDCN\n25/class\n\nCode. Size\n\n50\n50\n50\n50\n1600\n1600\n1600\n1600\n\nAcc.\n41.3%\n42.2%\n42.8%\n40.7%\n44.1%\n41.6%\n45.1%\n42.7%\n\nversion of recursive SVM where a projection matrix as in the proof for Lemma 3.1 is used. For\nthe R2SVM, in most cases the performance asymptotically converges within 30 layers. Note that\ntraining each layer involves training a linear SVM, so the computational complexity is simply linear\nto the depth of our model. In contrast to this, the dif\ufb01culty of training deep learning models based on\nmany hidden layers may be signi\ufb01cantly harder, partially due to the lack of supervised information\nfor its hidden layers.\nFigure 4(b) shows the effect that the feature dimensionality (controlled by the codebook size of\nOMP-1) has on the performance of the linear and non-linear classi\ufb01ers, and Table 1 provides rep-\nresentative numerical results. In particular, when the codebook size is low, the assumption that we\ncan approximate the non-linear function f as a globally linear classi\ufb01er fails, and in those cases the\nR2SVM and RBF SVM clearly outperform the linear SVM. Moreover, as the codebook size grows,\nnon-linear classi\ufb01ers, represented by RBF SVM in our experiments, suffer from the curse of dimen-\nsionality partially due to the large dimensionality of the over-complete feature representation. In\nfact, as the dimensionality of the over-complete representation becomes too large, RBF SVM starts\nperforming worse than linear SVM. For linear SVM, increasing the codebook size makes it perform\nbetter with respect to non-linear classi\ufb01ers, but additional gains can still be consistently obtained by\nthe Random Recursive SVM method. Also note how our model outperforms DCN, another stacking\narchitecture proposed in [10].\nSimilar to the change of codebook sizes, it is interesting to experiment with the number of training\nexamples per class. In the case where we use fewer training examples per class, little gain is obtained\nby classical RBF SVMs, and performance even drops when the feature dimension is too high (Ta-\nble 2), while our Random Recursive SVM remains competitive and does not over\ufb01t more than any\nbaseline. This again suggests that our proposed method may generalize better than RBF, which is a\ndesirable property when the number of training examples is small with respect to the dimensionality\nof the feature space, which are cases of interest to many computer vision applications.\nIn general, our method is able to combine the advantages of both linear and nonlinear SVM: it has\nhigher representation power than linear SVM, providing consistent performance gains, and at the\nsame time has a better robustness against over\ufb01tting. It is also worth pointing out again that R2SVM\nis highly ef\ufb01cient, since each layer is a simple linear SVM that can be carried out by simple matrix\nmultiplication. On the other hand, non-linear SVMs like RBF SVM may take much longer to run\nespecially for large-scale data, when special care has to be taken [6].\nTIMIT\nFinally, we report our experiments using the popular speech database TIMIT. The speech data is\nanalyzed using a 25-ms Hamming window with a 10-ms \ufb01xed frame rate. We represent the speech\nusing \ufb01rst- to 12th-order Mel frequency cepstral coef\ufb01cients (MFCCs) and energy, along with their\n\ufb01rst and second temporal derivatives. The training set consists of 462 speakers, with a total number\nof frames in the training data of size 1.1 million, making classical kernel SVMs virtually impossible\nto train. The development set contains 50 speakers, with a total of 120K frames, and is used for\ncross validation. Results are reported using the standard 24-speaker core test set consisting of 192\nsentences with 7333 phone tokens and 57920 frames.\nThe data is normalized to have zero mean and unit variance. All experiments used a context window\nof 11 frames. This gives a total of 39 \u00d7 11 = 429 elements in each feature vector. We used 183\n\n7\n\n\fTable 3: Performance comparison on TIMIT.\nPhone state accuracy\n\n50.1% (2000 codes) 53.5% (8000 codes)\n53.5% (2000 codes) 55.1% (8000 codes)\n\nMethod\nLinear SVM\nR2SVM\nDCN, learned per-layer\nDCN, jointly \ufb01ne-tuned\n\n48.5%\n54.3%\n\ntarget class labels (i.e., three states for each of the 61 phones), which are typically called \u201cphone\nstates\u201d, with a one-hot encoding.\nThe pipeline adopted is otherwise unchanged from the previous dataset. However, we did not ap-\nply pooling, and instead coded the whole 429 dimensional vector with a dictionary with 2000 and\n8000 elements found with OMP-1, with the same parameter \u03b1 as in the vision tasks. The com-\npetitive results with a framework known in vision adapted to speech [22], as shown in Table 3, are\ninteresting on their own right, as the optimization framework for linear SVM is well understood,\nand the dictionary learning and encoding step are almost trivial and scale well with the amounts of\ndata available in typical speech tasks. On the other hand, our R2SVM boosts performance quite\nsigni\ufb01cantly, similar to what we observed on other datasets.\nIn Table 3 we also report recent work on this dataset [10], which uses multi-layer perceptron with\na hidden layer and linear output, and stacks each block on top of each other. In their experiments,\nthe representation used from the speech signal is not sparse, and uses instead Restricted Boltzman\nMachine, which is more time consuming to learn. In addition, only when jointly optimizing the\nnetwork weights (\ufb01ne tuning), which requires solving a non-convex problem, the accuracy achieves\nstate-of-the-art performance of 54.3%. Our method does not include this step, which could be\nadded as future work; we thus think the fairest comparison of our result is to the per-layer DCN\nperformance.\nIn all the experiments above we have observed two advantages of R2SVM. First, it provides a con-\nsistent improvement over linear SVM. Second, it can offer a better generalization ability over non-\nlinear SVMs, especially when the ratio of dimensionality to the number of training data is large.\nThese advantages, combined with the fact that R2SVM is ef\ufb01cient in both training and testing, sug-\ngests that it could be adopted as an improvement over the existing classi\ufb01cation pipeline in general.\nWe also note that in the current work we have not employed techniques of \ufb01ne tuning similar to\nthe one employed in the architecture of [10]. Fine tuning of the latter architecture has accounted\nfor between 10% to 20% error reduction, and reduces the need for having large depth in order to\nachieve a \ufb01xed level of recognition accuracy. Development of \ufb01ne-tuning is expected to improve\nrecognition accuracy further, and is in the interest of future research. However, even without \ufb01ne\ntuning, the recognition accuracy is still shown to consistently improve until convergence, showing\nthe robustness of the proposed method.\n\n5 Conclusions and Future Work\n\nIn this paper, we investigated low level vision and audio representations. We combined the simplic-\nity of linear SVMs with the power derived from deep architectures, and proposed a new stacking\ntechnique for building a better classi\ufb01er, using linear SVM as the base building blocks and emplying\na random non-linear projection to add \ufb02exibility to the model. Our work is partially motivated by the\nrecent trend of using coding techniques as feature representation with relatively large dictionaries.\nThe chief advantage of our method lies in the fact that it learns non-linear classi\ufb01ers without the\nneed of kernel design, while keeping the ef\ufb01ciency of linear SVMs. Experimental results on vision\nand speech datasets showed that the method provides consistent improvement over linear baselines,\neven with no learning of the model parameters. The convexity of our model could lead to better\ntheoretical analysis of such deep structures in terms of generalization gap, adds interesting oppor-\ntunities for learning using large computer clusters, and would potentially help understanding the\nnature of other deep learning approaches, which is the main interest of future research.\n\n8\n\n\fReferences\n[1] P L Bartlett and S Mendelson. Rademacher and gaussian complexities: Risk bounds and\n\nstructural results. The Journal of Machine Learning Research, 3:463\u2013482, 2003.\n\n[2] O Boiman, E Shechtman, and M Irani. In defense of nearest-neighbor based image classi\ufb01ca-\n\ntion. In CVPR, 2008.\n\n[3] L Bourdev, S Maji, T Brox, and J Malik. Detecting people using mutually consistent poselet\n\nactivations. In ECCV, 2010.\n\n[4] A Coates and A Ng. The importance of encoding versus training with sparse coding and vector\n\nquantization. In ICML, 2011.\n\n[5] W Cohen and V R de Carvalho. Stacked sequential learning. In IJCAI, 2005.\n[6] R Collobert, F Sinz, J Weston, and L Bottou. Trading convexity for scalability. In ICML, 2006.\n[7] N Dalal. Histograms of oriented gradients for human detection. In CVPR, 2005.\n[8] S Davis and P Mermelstein. Comparison of parametric representations for monosyllabic word\nrecognition in continuously spoken sentences. Acoustics, Speech and Signal Processing, IEEE\nTransactions on, 28(4):357\u2013366, 1980.\n\n[9] L Deng, M L Seltzer, D Yu, A Acero, A Mohamed, and G Hinton. Binary coding of speech\n\nspectrograms using a deep auto-encoder. In Interspeech, 2010.\n\n[10] L Deng and D Yu. Deep convex network: A scalable architecture for deep learning. In Inter-\n\nspeech, 2011.\n\n[11] L Deng, D Yu, and J Platt. Scalable stacking and learning for building deep architectures. In\n\nICASSP, 2012.\n\n[12] L Fei-Fei and P Perona. A bayesian hierarchical model for learning natural scene categories.\n\nIn CVPR, 2005.\n\n[13] G Hinton, L Deng, D Yu, G Dahl, A Mohamed, N Jaitly, A Senior, V Vanhoucke, P Nguyen,\nT Sainath, and B Kingsbury. Deep Neural Networks for Acoustic Modeling in Speech Recog-\nnition. IEEE Signal Processing Magazine, 28:82\u201397, 2012.\n\n[14] G Hinton and R Salakhutdinov. Reducing the dimensionality of data with neural networks.\n\nScience, 313(5786):504, 2006.\n\n[15] K Jarrett, K Kavukcuoglu, M A Ranzato, and Y LeCun. What is the best multi-stage architec-\n\nture for object recognition? In ICCV, 2009.\n\n[16] T Kohonen. Self-Organizing Maps. Springer-Verlag, 2001.\n[17] Y Lin, T Zhang, S Zhu, and K Yu. Deep coding network. In NIPS, 2010.\n[18] D Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004.\n[19] S Maji, AC Berg, and J Malik. Classi\ufb01cation using intersection kernel support vector machines\nis ef\ufb01cient. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference\non, pages 1\u20138. Ieee, 2008.\n\n[20] A Mohamed, D Yu, and L Deng. Investigation of full-sequence training of deep belief networks\n\nfor speech recognition. In Interspeech, 2010.\n\n[21] B Olshausen and D J Field. Sparse coding with an overcomplete basis set: a strategy employed\n\nby V1? Vision research, 37(23):3311\u20133325, 1997.\n\n[22] O Vinyals and L Deng. Are Sparse Representations Rich Enough for Acoustic Modeling? In\n\nInterspeech, 2012.\n\n[23] D H Wolpert. Stacked generalization. Neural networks, 5(2):241\u2013259, 1992.\n[24] J Yang, K Yu, and Y Gong. Linear spatial pyramid matching using sparse coding for image\n\nclassi\ufb01cation. In CVPR, 2009.\n\n[25] J Yang, K Yu, and T Huang. Ef\ufb01cient highly over-complete sparse coding using a mixture\n\nmodel. In ECCV, 2010.\n\n[26] K Yu and T Zhang. Improved Local Coordinate Coding using Local Tangents. In ICML, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1290, "authors": [{"given_name": "Oriol", "family_name": "Vinyals", "institution": null}, {"given_name": "Yangqing", "family_name": "Jia", "institution": null}, {"given_name": "Li", "family_name": "Deng", "institution": null}, {"given_name": "Trevor", "family_name": "Darrell", "institution": null}]}