{"title": "Do Deep Nets Really Need to be Deep?", "book": "Advances in Neural Information Processing Systems", "page_first": 2654, "page_last": 2662, "abstract": "Currently, deep neural networks are the state of the art on problems such as speech recognition and computer vision. In this paper we empirically demonstrate that shallow feed-forward nets can learn the complex functions previously learned by deep nets and achieve accuracies previously only achievable with deep models. Moreover, in some cases the shallow nets can learn these deep functions using the same number of parameters as the original deep models. On the TIMIT phoneme recognition and CIFAR-10 image recognition tasks, shallow nets can be trained that perform similarly to complex, well-engineered, deeper convolutional models.", "full_text": "Do Deep Nets Really Need to be Deep?\n\nLei Jimmy Ba\n\nUniversity of Toronto\n\njimmy@psi.utoronto.ca\n\nRich Caruana\n\nMicrosoft Research\n\nrcaruana@microsoft.com\n\nAbstract\n\nCurrently, deep neural networks are the state of the art on problems such as speech\nrecognition and computer vision. In this paper we empirically demonstrate that\nshallow feed-forward nets can learn the complex functions previously learned by\ndeep nets and achieve accuracies previously only achievable with deep models.\nMoreover, in some cases the shallow nets can learn these deep functions using the\nsame number of parameters as the original deep models. On the TIMIT phoneme\nrecognition and CIFAR-10 image recognition tasks, shallow nets can be trained\nthat perform similarly to complex, well-engineered, deeper convolutional models.\n\n1\n\nIntroduction\n\nYou are given a training set with 1M labeled points. When you train a shallow neural net with one\nfully connected feed-forward hidden layer on this data you obtain 86% accuracy on test data. When\nyou train a deeper neural net as in [1] consisting of a convolutional layer, pooling layer, and three\nfully connected feed-forward layers on the same data you obtain 91% accuracy on the same test set.\nWhat is the source of this improvement? Is the 5% increase in accuracy of the deep net over the\nshallow net because: a) the deep net has more parameters; b) the deep net can learn more complex\nfunctions given the same number of parameters; c) the deep net has better inductive bias and thus\nlearns more interesting/useful functions (e.g., because the deep net is deeper it learns hierarchical\nrepresentations [5]); d) nets without convolution can\u2019t easily learn what nets with convolution can\nlearn; e) current learning algorithms and regularization methods work better with deep architectures\nthan shallow architectures[8]; f) all or some of the above; g) none of the above?\nThere have been attempts to answer this question. It has been shown that deep nets coupled with\nunsupervised layer-by-layer pre-training [10] [19] work well. In [8], the authors show that depth\ncombined with pre-training provides a good prior for model weights, thus improving generalization.\nThere is well-known early theoretical work on the representational capacity of neural nets. For\nexample, it was proved that a network with a large enough single hidden layer of sigmoid units can\napproximate any decision boundary [4]. Empirical work, however, shows that it is dif\ufb01cult to train\nshallow nets to be as accurate as deep nets. For vision tasks, a recent study on deep convolutional\nnets suggests that deeper models are preferred under a parameter budget [7]. In [5], the authors\ntrained shallow nets on SIFT features to classify a large-scale ImageNet dataset and found that it\nwas dif\ufb01cult to train large, high-accuracy, shallow nets. And in [17], the authors show that deeper\nmodels are more accurate than shallow models in speech acoustic modeling.\nIn this paper we provide empirical evidence that shallow nets are capable of learning the same\nfunction as deep nets, and in some cases with the same number of parameters as the deep nets. We\ndo this by \ufb01rst training a state-of-the-art deep model, and then training a shallow model to mimic the\ndeep model. The mimic model is trained using the model compression method described in the next\nsection. Remarkably, with model compression we are able to train shallow nets to be as accurate\nas some deep models, even though we are not able to train these shallow nets to be as accurate as\nthe deep nets when the shallow nets are trained directly on the original labeled training data. If a\nshallow net with the same number of parameters as a deep net can learn to mimic a deep net with\nhigh \ufb01delity, then it is clear that the function learned by that deep net does not really have to be deep.\n\n1\n\n\f2 Training Shallow Nets to Mimic Deep Nets\n\n2.1 Model Compression\n\nThe main idea behind model compression [3] is to train a compact model to approximate the func-\ntion learned by a larger, more complex model. For example, in [3], a single neural net of modest\nsize could be trained to mimic a much larger ensemble of models\u2014although the small neural nets\ncontained 1000 times fewer parameters, often they were just as accurate as the ensembles they were\ntrained to mimic. Model compression works by passing unlabeled data through the large, accurate\nmodel to collect the scores produced by that model. This synthetically labeled data is then used to\ntrain the smaller mimic model. The mimic model is not trained on the original labels\u2014it is trained\nto learn the function that was learned by the larger model. If the compressed model learns to mimic\nthe large model perfectly it makes exactly the same predictions and mistakes as the complex model.\nSurprisingly, often it is not (yet) possible to train a small neural net on the original training data to be\nas accurate as the complex model, nor as accurate as the mimic model. Compression demonstrates\nthat a small neural net could, in principle, learn the more accurate function, but current learning\nalgorithms are unable to train a model with that accuracy from the original training data; instead, we\nmust train the complex intermediate model \ufb01rst and then train the neural net to mimic it. Clearly,\nwhen it is possible to mimic the function learned by a complex model with a small net, the function\nlearned by the complex model wasn\u2019t truly too complex to be learned by a small net. This suggests\nto us that the complexity of a learned model, and the size and architecture of the representation best\nused to learn that model, are different things.\n\n2.2 Mimic Learning via Regressing Logits with L2 Loss\n\nthe 183 p values where pk = ezk /(cid:80)\n\nOn both TIMIT and CIFAR-10 we use model compression to train shallow mimic nets using data\nlabeled by either a deep net, or an ensemble of deep nets, trained on the original TIMIT or CIFAR-10\ntraining data. The deep models are trained in the usual way using softmax output and cross-entropy\ncost function. The shallow mimic models, however, instead of being trained with cross-entropy on\nj ezj output by the softmax layer from the deep model, are\ntrained directly on the 183 log probability values z, also called logits, before the softmax activation.\nTraining on logits, which are logarithms of predicted probabilities, makes learning easier for the\nstudent model by placing equal emphasis on the relationships learned by the teacher model across\nall of the targets. For example, if the teacher predicts three targets with probability [2\u00d7 10\u22129, 4\u00d7\n10\u22125, 0.9999] and those probabilities are used as prediction targets and cross entropy is minimized,\nthe student will focus on the third target and tend to ignore the \ufb01rst and second targets. A student,\nhowever, trained on the logits for these targets, [10, 20, 30], will better learn to mimic the detailed\nbehaviour of the teacher model. Moreover, consider a second training case where the teacher predicts\nlogits [\u221210, 0, 10]. After softmax, these logits yield the same predicted probabilities as [10, 20, 30],\nyet clearly the teacher models the two cases very differently. By training the student model directly\non the logits, the student is better able to learn the internal model learned by the teacher, without\nsuffering from the information loss that occurs from passing through logits to probability space.\nWe formulate the SNN-MIMIC learning objective function as a regression problem given training\ndata {(x(1), z(1)),...,(x(T ), z(T )) }:\n\nL(W, \u03b2) =\n\n1\n2T\n\n||g(x(t); W, \u03b2) \u2212 z(t)||2\n2,\n\n(1)\n\n(cid:88)\n\nt\n\nwhere W is the weight matrix between input features x and hidden layer, \u03b2 is the weights from\nhidden to output units, g(x(t); W, \u03b2) = \u03b2f (W x(t)) is the model prediction on the tth training data\npoint and f (\u00b7) is the non-linear activation of the hidden units. The parameters W and \u03b2 are updated\nusing standard error back-propagation algorithm and stochastic gradient descent with momentum.\nWe have also experimented with other mimic loss functions, such as minimizing the KL divergence\nKL(pteacher(cid:107)pstudent) cost function and L2 loss on probabilities. Regression on logits outperforms all\nthe other loss functions and is one of the key techniques for obtaining the results in the rest of this\n\n2\n\n\fpaper. We found that normalizing the logits from the teacher model by subtracting the mean and\ndividing the standard deviation of each target across the training set can improve L2 loss slightly\nduring training, but normalization is not crucial for obtaining good student mimic models.\n\n2.3 Speeding-up Mimic Learning by Introducing a Linear Layer\n\nTo match the number of parameters in a deep net, a shallow net has to have more non-linear hidden\nunits in a single layer to produce a large weight matrix W . When training a large shallow neural\nnetwork with many hidden units, we \ufb01nd it is very slow to learn the large number of parameters in the\nweight matrix between input and hidden layers of size O(HD), where D is input feature dimension\nand H is the number of hidden units. Because there are many highly correlated parameters in this\nlarge weight matrix, gradient descent converges slowly. We also notice that during learning, shallow\nnets spend most of the computation in the costly matrix multiplication of the input data vectors and\nlarge weight matrix. The shallow nets eventually learn accurate mimic functions, but training to\nconvergence is very slow (multiple weeks) even with a GPU.\nWe found that introducing a bottleneck linear layer with k linear hidden units between the input\nand the non-linear hidden layer sped up learning dramatically: we can factorize the weight matrix\nW \u2208 RH\u00d7D into the product of two low-rank matrices, U \u2208 RH\u00d7k and V \u2208 Rk\u00d7D, where\nk << D, H. The new cost function can be written as:\n\n(cid:88)\n\nt\n\nL(U, V, \u03b2) =\n\n1\n2T\n\n||\u03b2f (U V x(t)) \u2212 z(t)||2\n\n2\n\n(2)\n\nThe weights U and V can be learnt by back-propagating through the linear layer. This re-\nparameterization of weight matrix W not only increases the convergence rate of the shallow mimic\nnets, but also reduces memory space from O(HD) to O(k(H + D)).\nFactorizing weight matrices has been previously explored in [16] and [20]. While these prior works\nfocus on using matrix factorization in the last output layer, our method is applied between the input\nand hidden layer to improve the convergence speed during training.\nThe reduced memory usage enables us to train large shallow models that were previously infeasible\ndue to excessive memory usage. Note that the linear bottleneck can only reduce the representational\npower of the network, and it can always be absorbed into a single weight matrix W .\n\n3 TIMIT Phoneme Recognition\n\nThe TIMIT speech corpus has 462 speakers in the training set, a separate development set for cross-\nvalidation that includes 50 speakers, and a \ufb01nal test set with 24 speakers. The raw waveform au-\ndio data were pre-processed using 25ms Hamming window shifting by 10ms to extract Fourier-\ntransform-based \ufb01lter-banks with 40 coef\ufb01cients (plus energy) distributed on a mel-scale, together\nwith their \ufb01rst and second temporal derivatives. We included +/- 7 nearby frames to formulate the\n\ufb01nal 1845 dimension input vector. The data input features were normalized by subtracting the mean\nand dividing by the standard deviation on each dimension. All 61 phoneme labels are represented\nin tri-state, i.e., three states for each of the 61 phonemes, yielding target label vectors with 183\ndimensions for training. At decoding time these are mapped to 39 classes as in [13] for scoring.\n\n3.1 Deep Learning on TIMIT\n\nDeep learning was \ufb01rst successfully applied to speech recognition in [14]. Following their frame-\nwork, we train two deep models on TIMIT, DNN and CNN. DNN is a deep neural net consisting\nof three fully connected feedforward hidden layers consisting of 2000 recti\ufb01ed linear units (ReLU)\n[15] per layer. CNN is a deep neural net consisting of a convolutional layer and max-pooling layer\nfollowed by three hidden layers containing 2000 ReLU units [2]. The CNN was trained using the\nsame convolutional architecture as in [6]. We also formed an ensemble of nine CNN models, ECNN.\nThe accuracy of DNN, CNN, and ECNN on the \ufb01nal test set are shown in Table 1. The error rate\nof the convolutional deep net (CNN) is about 2.1% better than the deep net (DNN). The table also\nshows the accuracy of shallow neural nets with 8000, 50,000, and 400,000 hidden units (SNN-8k,\n\n3\n\n\fSNN-50k, and SNN-400k) trained on the original training data. Despite having up to 10X as many\nparameters as DNN, CNN, and ECNN, the shallow models are 1.4% to 2% less accurate than the\nDNN, 3.5% to 4.1% less accurate than the CNN, and 4.5% to 5.1% less accurate than the ECNN.\n\n3.2 Learning to Mimic an Ensemble of Deep Convolutional TIMIT Models\n\nThe most accurate single model that we trained on TIMIT is the deep convolutional architecture in\n[6]. Because we have no unlabeled data from the TIMIT distribution, we use the same 1.1M points\nin the train set as unlabeled data for compression by throwing away the labels.1 Re-using the 1.1M\ntrain set reduces the accuracy of the student mimic models, increasing the gap between the teacher\nand mimic models on test data: model compression works best when the unlabeled set is very large,\nand when the unlabeled samples do not fall on train points where the teacher model is likely to have\nover\ufb01t. To reduce the impact of the gap caused by performing compression with the original train\nset, we train the student model to mimic a more accurate ensemble of deep convolutional models.\nWe are able to train a more accurate model on TIMIT by forming an ensemble of nine deep, con-\nvolutional neural nets, each trained with somewhat different train sets, and with architectures of\ndifferent kernel sizes in the convolutional layers. We used this very accurate model, ECNN, as the\nteacher model to label the data used to train the shallow mimic nets. As described in Section 2.2\nthe logits (log probability of the predicted values) from each CNN in the ECNN model are averaged\nand the average logits are used as \ufb01nal regression targets to train the mimic SNNs.\nWe trained shallow mimic nets with 8k (SNN-MIMIC-8k) and 400k (SNN-MIMIC-400k) hidden\nunits on the re-labeled 1.1M training points. As described in Section 2.3, to speed up learning both\nmimic models have 250 linear units between the input and non-linear hidden layer\u2014preliminary\nexperiments suggest that for TIMIT there is little bene\ufb01t from using more than 250 linear units.\n\n3.3 Compression Results For TIMIT\n\nArchitecture\n\nSNN-8k\n\nSNN-50k\n\nSNN-400k\n\nDNN\n\nCNN\n\nECNN\n\nSNN-MIMIC-8k\n\nSNN-MIMIC-400k\n\n8k + dropout\n\ntrained on original data\n\n50k + dropout\n\ntrained on original data\n250L-400k + dropout\ntrained on original data\n\n2k-2k-2k + dropout\n\ntrained on original data\nc-p-2k-2k-2k + dropout\ntrained on original data\nensemble of 9 CNNs\n\n250L-8k\n\n250L-400k\n\nno convolution or pooling layers\n\nno convolution or pooling layers\n\n# Param.\n\u223c12M\n\u223c100M\n\u223c180M\n\u223c12M\n\u223c13M\n\u223c125M\n\u223c12M\n\u223c180M\n\n# Hidden units\n\nPER\n\n\u223c8k\n\u223c50k\n\u223c400k\n\u223c6k\n\u223c10k\n\u223c90k\n\u223c8k\n\u223c400k\n\n23.1%\n\n23.0%\n\n23.6%\n\n21.9%\n\n19.5%\n\n18.5%\n\n21.6%\n\n20.0%\n\nTable 1: Comparison of shallow and deep models: phone error rate (PER) on TIMIT core test set.\n\nThe bottom of Table 1 shows the accuracy of shallow mimic nets with 8000 ReLUs and 400,000\nReLUs (SNN-MIMIC-8k and -400k) trained with model compression to mimic the ECNN. Surpris-\ningly, shallow nets are able to perform as well as their deep counterparts when trained with model\ncompression to mimic a more accurate model. A neural net with one hidden layer (SNN-MIMIC-\n8k) can be trained to perform as well as a DNN with a similar number of parameters. Furthermore,\nif we increase the number of hidden units in the shallow net from 8k to 400k (the largest we could\n\n1That SNNs can be trained to be as accurate as DNNs using only the original training data highlights that it\n\nshould be possible to train accurate SNNs on the original training data given better learning algorithms.\n\n4\n\n\ftrain), we see that a neural net with one hidden layer (SNN-MIMIC-400k) can be trained to perform\ncomparably to a CNN, even though the SNN-MIMIC-400k net has no convolutional or pooling lay-\ners. This is interesting because it suggests that a large single hidden layer without a topology custom\ndesigned for the problem is able to reach the performance of a deep convolutional neural net that\nwas carefully engineered with prior structure and weight-sharing without any increase in the number\nof training examples, even though the same architecture trained on the original data could not.\n\nFigure 1: Accuracy of SNNs, DNNs, and Mimic SNNs vs. # of parameters on TIMIT Dev (left) and\nTest (right) sets. Accuracy of the CNN and target ECNN are shown as horizontal lines for reference.\n\nFigure 1 shows the accuracy of shallow nets and deep nets trained on the original TIMIT 1.1M data,\nand shallow mimic nets trained on the ECNN targets, as a function of the number of parameters in\nthe models. The accuracy of the CNN and the teacher ECNN are shown as horizontal lines at the top\nof the \ufb01gures. When the number of parameters is small (about 1 million), the SNN, DNN, and SNN-\nMIMIC models all have similar accuracy. As the size of the hidden layers increases and the number\nof parameters increases, the accuracy of a shallow model trained on the original data begins to lag\nbehind. The accuracy of the shallow mimic model, however, matches the accuracy of the DNN until\nabout 4 million parameters, when the DNN begins to fall behind the mimic. The DNN asymptotes\nat around 10M parameters, while the shallow mimic continues to increase in accuracy. Eventually\nthe mimic asymptotes at around 100M parameters to an accuracy comparable to that of the CNN.\nThe shallow mimic never achieves the accuracy of the ECNN it is trying to mimic (because there\nis not enough unlabeled data), but it is able to match or exceed the accuracy of deep nets (DNNs)\nhaving the same number of parameters trained on the original data.\n\n4 Object Recognition: CIFAR-10\n\nTo verify that the results on TIMIT generalize to other learning problems and task domains, we ran\nsimilar experiments on the CIFAR-10 Object Recognition Task[12]. CIFAR-10 consists of a set\nof natural images from 10 different object classes: airplane, automobile, bird, cat, deer, dog, frog,\nhorse, ship, truck. The dataset is a labeled subset of the 80 million tiny images dataset[18] and is\ndivided into 50,000 train and 10,000 test images. Each image is 32x32 pixels in 3 color channels,\nyielding input vectors with 3072 dimensions. We prepared the data by subtracting the mean and\ndividing the standard deviation of each image vector to perform global contrast normalization. We\nthen applied ZCA whitening to the normalized images. This pre-processing is the same used in [9].\n\n4.1 Learning to Mimic an Ensemble of Deep Convolutional CIFAR-10 Models\n\nWe follow the same approach as with TIMIT: An ensemble of deep CNN models is used to label\nCIFAR-10 images for model compression. The logit predictions from this teacher model are used\nas regression targets to train a mimic shallow neural net (SNN). CIFAR-10 images have a higher\ndimension than TIMIT (3072 vs. 1845), but the size of the CIFAR-10 training set is only 50,000\ncompared to 1.1 million examples for TIMIT. Fortunately, unlike TIMIT, in CIFAR-10 we have\naccess to unlabeled data from a similar distribution by using the superset of CIFAR-10:\nthe 80\nmillion tiny images dataset. We add the \ufb01rst one million images from the 80 million set to the\noriginal 50,000 CIFAR-10 training images to create a 1.05M mimic training (transfer) set.\n\n5\n\n 76 77 78 79 80 81 82 83 1 10 100Accuracy on TIMIT Dev SetNumber of Parameters (millions)ShallowNetDeepNetShallowMimicNetConvolutional NetEnsemble of CNNs 75 76 77 78 79 80 81 82 1 10 100Accuracy on TIMIT Test SetNumber of Parameters (millions)ShallowNetDeepNetShallowMimicNetConvolutional NetEnsemble of CNNs\fDNN\n\nSNN-30k\nsingle-layer\nfeature extraction\nCNN[11]\n(no augmentation)\nCNN[21]\n(no augmentation)\n\nteacher CNN\n(no augmentation)\n\nECNN\n(no augmentation)\nSNN-CNN-MIMIC-30k\ntrained on a single CNN\nSNN-CNN-MIMIC-30k\ntrained on a single CNN\nSNN-ECNN-MIMIC-30k\ntrained on ensemble\n\nArchitecture\n\n2000-2000 + dropout\n128c-p-1200L-30k\n\n+ dropout input&hidden\n\n4000c-p\n\nfollowed by SVM\n\n64c-p-64c-p-64c-p-16lc\n\n+ dropout on lc\n\n64c-p-64c-p-128c-p-fc\n\n+ dropout on fc\n\nand stochastic pooling\n\n128c-p-128c-p-128c-p-1kfc\n\n+ dropout on fc\n\nand stochastic pooling\nensemble of 4 CNNs\n\n64c-p-1200L-30k\n\nwith no regularization\n\n128c-p-1200L-30k\n\nwith no regularization\n\n128c-p-1200L-30k\n\nwith no regularization\n\n# Param.\n\u223c10M\n\u223c70M\n\u223c125M\n\u223c10k\n\n\u223c56k\n\n\u223c35k\n\n\u223c140k\n\u223c54M\n\u223c70M\n\u223c70M\n\n# Hidden units\n\nErr.\n\n4k\n\n\u223c190k\n\u223c3.7B\n\u223c110k\n\n\u223c120k\n\n\u223c210k\n\n\u223c840k\n\u223c110k\n\u223c190k\n\u223c190k\n\n57.8%\n\n21.8%\n\n18.4%\n\n15.6%\n\n15.13%\n\n12.0%\n\n11.0%\n\n15.4%\n\n15.1%\n\n14.2%\n\nTable 2: Comparison of shallow and deep models: classi\ufb01cation error rate on CIFAR-10. Key: c,\nconvolution layer; p, pooling layer; lc, locally connected layer; fc, fully connected layer\n\nCIFAR-10 images are raw pixels for objects viewed from many different angles and positions,\nwhereas TIMIT features are human-designed \ufb01lter-bank features. In preliminary experiments we\nobserved that non-convolutional nets do not perform well on CIFAR-10, no matter what their depth.\nInstead of raw pixels, the authors in [5] trained their shallow models on the SIFT features. Similarly,\n[7] used a base convolution and pooling layer to study different deep architectures. We follow the\napproach in [7] to allow our shallow models to bene\ufb01t from convolution while keeping the models\nas shallow as possible, and introduce a single layer of convolution and pooling in our shallow mimic\nmodels to act as a feature extractor to create invariance to small translations in the pixel domain. The\nSNN-MIMIC models for CIFAR-10 thus consist of a convolution and max pooling layer followed\nby fully connected 1200 linear units and 30k non-linear units. As before, the linear units are there\nonly to speed learning; they do not increase the model\u2019s representational power and can be absorbed\ninto the weights in the non-linear layer after learning.\nResults on CIFAR-10 are consistent with those from TIMIT. Table 2 shows results for the shallow\nmimic models, and for much deeper convolutional nets. The shallow mimic net trained to mimic the\nteacher CNN (SNN-CNN-MIMIC-30k) achieves accuracy comparable to CNNs with multiple con-\nvolutional and pooling layers. And by training the shallow model to mimic the ensemble of CNNs\n(SNN-ECNN-MIMIC-30k), accuracy is improved an additional 0.9%. The mimic models are able\nto achieve accuracies previously unseen on CIFAR-10 with models with so few layers. Although the\ndeep convolutional nets have more hidden units than the shallow mimic models, because of weight\nsharing, the deeper nets with multiple convolution layers have fewer parameters than the shallow\nfully connected mimic models. Still, it is surprising to see how accurate the shallow mimic mod-\nels are, and that their performance continues to improve as the performance of the teacher model\nimproves (see further discussion of this in Section 5.2).\n\n5 Discussion\n5.1 Why Mimic Models Can Be More Accurate than Training on Original Labels\nIt may be surprising that models trained on targets predicted by other models can be more accurate\nthan models trained on the original labels. There are a variety of reasons why this can happen:\n\n6\n\n\fcensor the data), thus making learning easier for the student.\n\n\u2022 If some labels have errors, the teacher model may eliminate some of these errors (i.e.,\n\u2022 Similarly, if there are complex regions in p(y|X) that are dif\ufb01cult to learn given the features\nand sample density, the teacher may provide simpler, soft labels to the student. Complexity\ncan be washed away by \ufb01ltering targets through the teacher model.\n\u2022 Learning from the original hard 0/1 labels can be more dif\ufb01cult than learning from a\nteacher\u2019s conditional probabilities: on TIMIT only one of 183 outputs is non-zero on each\ntraining case, but the mimic model sees non-zero targets for most outputs on most train-\ning cases, and the teacher can spread uncertainty over multiple outputs for dif\ufb01cult cases.\nThe uncertainty from the teacher model is more informative to the student model than the\noriginal 0/1 labels. This bene\ufb01t is further enhanced by training on logits.\n\u2022 The original targets may depend in part on features not available as inputs for learning, but\nthe student model sees targets that depend only on the input features; the targets from the\nteacher model are a function only of the available inputs; the dependence on unavailable\nfeatures has been eliminated by \ufb01ltering targets through the teacher model.\n\nThe mechanisms above can be seen as forms\nof regularization that help prevent over\ufb01tting in\nthe student model. Typically, shallow models\ntrained on the original targets are more prone\nto over\ufb01tting than deep models\u2014they begin to\nover\ufb01t before learning the accurate functions\nlearned by deeper models even with dropout\n(see Figure 2). If we had more effective regu-\nlarization methods for shallow models, some of\nthe performance gap between shallow and deep\nmodels might disappear. Model compression\nappears to be a form of regularization that is ef-\nfective at reducing this gap.\n\n5.2 The Capacity and Representational\nPower of Shallow Models\n\nFigure 2: Shallow mimic tends not to over\ufb01t.\n\nFigure 3 shows results of an experiment with\nTIMIT where we trained shallow mimic mod-\nels of two sizes (SNN-MIMIC-8k and SNN-\nMIMIC-160k) on teacher models of different\naccuracies. The two shallow mimic models are\ntrained on the same number of data points. The\nonly difference between them is the size of the\nhidden layer. The x-axis shows the accuracy of\nthe teacher model, and the y-axis is the accu-\nracy of the mimic models. Lines parallel to the\ndiagonal suggest that increases in the accuracy\nof the teacher models yield similar increases in\nthe accuracy of the mimic models. Although\nthe data does not fall perfectly on a diagonal,\nthere is strong evidence that the accuracy of the\nmimic models continues to increase as the ac-\ncuracy of the teacher model improves, suggest-\ning that the mimic models are not (yet) running\nout of capacity. When training on the same tar-\ngets, SNN-MIMIC-8k always perform worse than SNN-MIMIC-160K that has 10 times more pa-\nrameters. Although there is a consistent performance gap between the two models due to the differ-\nence in size, the smaller shallow model was eventually able to achieve a performance comparable to\nthe larger shallow net by learning from a better teacher, and the accuracy of both models continues\nto increase as teacher accuracy increases. This suggests that shallow models with a number of pa-\nrameters comparable to deep models probably are capable of learning even more accurate functions\n\nFigure 3: Accuracy of student models continues to\nimprove as accuracy of teacher models improves.\n\n7\n\n 74 74.5 75 75.5 76 76.5 77 77.5 0 2 4 6 8 10 12 14Phone Recognition AccuracyNumber of EpochsSNN-8kSNN-8k + dropoutSNN-Mimic-8k 78 79 80 81 82 83 78 79 80 81 82 83Accuracy of Mimic Model on Dev SetAccuracy of Teacher Model on Dev SetMimic with 8k Non-Linear UnitsMimic with 160k Non-Linear Unitsy=x (no student-teacher gap)\fif a more accurate teacher and/or more unlabeled data become available. Similarly, on CIFAR-10\nwe saw that increasing the accuracy of the teacher model by forming an ensemble of deep CNNs\nyielded commensurate increase in the accuracy of the student model. We see little evidence that shal-\nlow models have limited capacity or representational power. Instead, the main limitation appears to\nbe the learning and regularization procedures used to train the shallow models.\n\n5.3 Parallel Distributed Processing vs. Deep Sequential Processing\n\nOur results show that shallow nets can be competitive with deep models on speech and vision tasks.\nIn our experiments the deep models usually required 8\u201312 hours to train on Nvidia GTX 580 GPUs\nto reach the state-of-the-art performance on TIMIT and CIFAR-10 datasets. Interestingly, although\nsome of the shallow mimic models have more parameters than the deep models, the shallow models\ntrain much faster and reach similar accuracies in only 1\u20132 hours.\nAlso, given parallel computational resources, at run-time shallow models can \ufb01nish computation in\n2 or 3 cycles for a given input, whereas a deep architecture has to make sequential inference through\neach of its layers, expending a number of cycles proportional to the depth of the model. This bene\ufb01t\ncan be important in on-line inference settings where data parallelization is not as easy to achieve\nas it is in the batch inference setting. For real-time applications such as surveillance or real-time\nspeech translation, a model that responds in fewer cycles can be bene\ufb01cial.\n\n6 Future Work\n\nThe tiny images dataset contains 80 millions images. We are currently investigating whether, if by\nlabeling these 80M images with a teacher, it is possible to train shallow models with no convolutional\nor pooling layers to mimic deep convolutional models.\nThis paper focused on training the shallowest-possible models to mimic deep models in order to\nbetter understand the importance of model depth in learning. As suggested in Section 5.3, there\nare practical applications of this work as well: student models of small-to-medium size and depth\ncan be trained to mimic very large, high-accuracy deep models, and ensembles of deep models,\nthus yielding better accuracy with reduced runtime cost than is currently achievable without model\ncompression. This approach allows one to adjust \ufb02exibly the trade-off between accuracy and com-\nputational cost.\nIn this paper we are able to demonstrate empirically that shallow models can, at least in principle,\nlearn more accurate functions without a large increase in the number of parameters. The algorithm\nwe use to do this\u2014training the shallow model to mimic a more accurate deep model, however, is\nawkward. It depends on the availability of either a large unlabeled dataset (to reduce the gap between\nteacher and mimic model) or a teacher model of very high accuracy, or both. Developing algorithms\nto train shallow models of high accuracy directly from the original data without going through the\nintermediate teacher model would, if possible, be a signi\ufb01cant contribution.\n\n7 Conclusions\n\nWe demonstrate empirically that shallow neural nets can be trained to achieve performances pre-\nviously achievable only by deep models on the TIMIT phoneme recognition and CIFAR-10 image\nrecognition tasks. Single-layer fully connected feedforward nets trained to mimic deep models can\nperform similarly to well-engineered complex deep convolutional architectures. The results suggest\nthat the strength of deep learning may arise in part from a good match between deep architectures\nand current training procedures, and that it may be possible to devise better learning algorithms to\ntrain more accurate shallow feed-forward nets. For a given number of parameters, depth may make\nlearning easier, but may not always be essential.\n\nAcknowledgements We thank Li Deng for generous help with TIMIT, Li Deng and Ossama Abdel-\nHamid for the code for their deep convolutional TIMIT model, Chris Burges, Li Deng, Ran Gilad-\nBachrach, Tapas Kanungo and John Platt for discussion that signi\ufb01cantly improved this work, David\nJohnson for help with the speech model, and Mike Aultman for help with the GPU cluster.\n\n8\n\n\fReferences\n[1] Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, and Gerald Penn. Applying convolutional\nIn Acoustics, Speech and\n\nneural networks concepts to hybrid nn-hmm model for speech recognition.\nSignal Processing (ICASSP), 2012 IEEE International Conference on, pages 4277\u20134280. IEEE, 2012.\n\n[2] Ossama Abdel-Hamid, Li Deng, and Dong Yu. Exploring convolutional neural network structures and\n\noptimization techniques for speech recognition. Interspeech 2013, 2013.\n\n[3] Cristian Bucilu, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression.\n\nIn Proceedings\nof the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages\n535\u2013541. ACM, 2006.\n\n[4] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control,\n\nSignals and Systems, 2(4):303\u2013314, 1989.\n[5] Yann N Dauphin and Yoshua Bengio.\n\narXiv:1301.3583, 2013.\n\nBig neural networks waste capacity.\n\narXiv preprint\n\n[6] Li Deng, Jinyu Li, Jui-Ting Huang, Kaisheng Yao, Dong Yu, Frank Seide, Michael Seltzer, Geoff Zweig,\nXiaodong He, Jason Williams, et al. Recent advances in deep learning for speech research at Microsoft.\nICASSP 2013, 2013.\n\n[7] David Eigen, Jason Rolfe, Rob Fergus, and Yann LeCun. Understanding deep architectures using a\n\nrecursive convolutional network. arXiv preprint arXiv:1312.1847, 2013.\n\n[8] Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, and Samy\nBengio. Why does unsupervised pre-training help deep learning? The Journal of Machine Learning\nResearch, 11:625\u2013660, 2010.\n\n[9] Ian Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout net-\nworks. In Proceedings of The 30th International Conference on Machine Learning, pages 1319\u20131327,\n2013.\n\n[10] G.E. Hinton and R.R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science,\n\n313(5786):504\u2013507, 2006.\n\n[11] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R.R. Salakhutdinov. Improving neural net-\n\nworks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.\n\n[12] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Computer\n\nScience Department, University of Toronto, Tech. Rep, 2009.\n\n[13] K.F. Lee and H.W. Hon. Speaker-independent phone recognition using hidden markov models. Acoustics,\n\nSpeech and Signal Processing, IEEE Transactions on, 37(11):1641\u20131648, 1989.\n\n[14] Abdel-rahman Mohamed, George E Dahl, and Geoffrey Hinton. Acoustic modeling using deep belief\n\nnetworks. Audio, Speech, and Language Processing, IEEE Transactions on, 20(1):14\u201322, 2012.\n\n[15] V. Nair and G.E. Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines. In Proc. 27th\n\nInternational Conference on Machine Learning, pages 807\u2013814. Omnipress Madison, WI, 2010.\n\n[16] Tara N Sainath, Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, and Bhuvana Ramabhadran. Low-rank\nmatrix factorization for deep neural network training with high-dimensional output targets. In Acoustics,\nSpeech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 6655\u20136659.\nIEEE, 2013.\n\n[17] Frank Seide, Gang Li, and Dong Yu. Conversational speech transcription using context-dependent deep\n\nneural networks. In Interspeech, pages 437\u2013440, 2011.\n\n[18] Antonio Torralba, Robert Fergus, and William T Freeman. 80 million tiny images: A large data set for\nnonparametric object and scene recognition. Pattern Analysis and Machine Intelligence, IEEE Transac-\ntions on, 30(11):1958\u20131970, 2008.\n\n[19] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.A. Manzagol. Stacked denoising autoencoders:\nLearning useful representations in a deep network with a local denoising criterion. The Journal of Machine\nLearning Research, 11:3371\u20133408, 2010.\n\n[20] Jian Xue, Jinyu Li, and Yifan Gong. Restructuring of deep neural network acoustic models with singular\n\nvalue decomposition. Proc. Interspeech, Lyon, France, 2013.\n\n[21] Matthew D. Zeiler and Rob Fergus. Stochastic pooling for regularization of deep convolutional neural\n\nnetworks. arXiv preprint arXiv:1301.3557, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1380, "authors": [{"given_name": "Jimmy", "family_name": "Ba", "institution": "University of Toronto"}, {"given_name": "Rich", "family_name": "Caruana", "institution": "Microsoft"}]}