{"title": "Predicting Parameters in Deep Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2148, "page_last": 2156, "abstract": "We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.", "full_text": "Predicting Parameters in Deep Learning\n\nMisha Denil1 Babak Shakibi2 Laurent Dinh3\nMarc\u2019Aurelio Ranzato4 Nando de Freitas1,2\n\n{misha.denil,nando.de.freitas}@cs.ox.ac.uk\n\nlaurent.dinh@umontreal.ca\n\nranzato@fb.com\n\n1University of Oxford, United Kingdom\n2University of British Columbia, Canada\n\n3Universit\u00b4e de Montr\u00b4eal, Canada\n\n4Facebook Inc., USA\n\nAbstract\n\nWe demonstrate that there is signi\ufb01cant redundancy in the parameterization of\nseveral deep learning models. Given only a few weight values for each feature it\nis possible to accurately predict the remaining values. Moreover, we show that not\nonly can the parameter values be predicted, but many of them need not be learned\nat all. We train several different architectures by learning only a small number of\nweights and predicting the rest. In the best case we are able to predict more than\n95% of the weights of a network without any drop in accuracy.\n\n1\n\nIntroduction\n\nRecent work on scaling deep networks has led to the construction of the largest arti\ufb01cial neural\nnetworks to date. It is now possible to train networks with tens of millions [13] or even over a\nbillion parameters [7, 16].\nThe largest networks (i.e. those of Dean et al. [7]) are trained using asynchronous SGD. In this\nframework many copies of the model parameters are distributed over many machines and updated\nindependently. An additional synchronization mechanism coordinates between the machines to en-\nsure that different copies of the same set of parameters do not drift far from each other.\nA major drawback of this technique is that training is very inef\ufb01cient in how it makes use of parallel\nresources [1]. In the largest networks of Dean et al. [7], where the gains from distribution are largest,\ndistributing the model over 81 machines reduces the training time per mini-batch by a factor of 12,\nand increasing to 128 machines achieves a speedup factor of roughly 14. While these speedups are\nvery signi\ufb01cant, there is a clear trend of diminishing returns as the overhead of coordinating between\nthe machines grows. Other approaches to distributed learning of neural networks involve training in\nbatch mode [8], but these methods have not been scaled nearly as far as their online counterparts.\nIt seems clear that distributed architectures will always be required for extremely large networks;\nhowever, as ef\ufb01ciency decreases with greater distribution, it also makes sense to study techniques\nfor learning larger networks on a single machine. If we can reduce the number of parameters which\nmust be learned and communicated over the network of \ufb01xed size, then we can reduce the number\nof machines required to train it, and hence also reduce the overhead of coordination in a distributed\nframework.\nIn this work we study techniques for reducing the number of free parameters in neural networks\nby exploiting the fact that the weights in learned networks tend to be structured. The technique we\npresent is extremely general, and can be applied to a broad range of models. Our technique is also\ncompletely orthogonal to the choice of activation function as well as other learning optimizations; it\ncan work alongside other recent advances in neural network training such as dropout [12], recti\ufb01ed\nunits [20] and maxout [9] without modi\ufb01cation.\n\n1\n\n\fFigure 1: The \ufb01rst column in each block shows four learned features (parameters of a deep model).\nThe second column shows a few parameters chosen at random from the original set in the \ufb01rst col-\numn. The third column shows that this random set can be used to predict the remaining parameters.\nFrom left to right the blocks are: (1) a convnet trained on STL-10 (2) an MLP trained on MNIST,\n(3) a convnet trained on CIFAR-10, (4) Reconstruction ICA trained on Hyv\u00a8arinen\u2019s natural image\ndataset (5) Reconstruction ICA trained on STL-10.\n\nThe intuition motivating the techniques in this paper is the well known observation that the \ufb01rst layer\nfeatures of a neural network trained on natural image patches tend to be globally smooth with local\nedge features, similar to local Gabor features [6, 13]. Given this structure, representing the value\nof each pixel in the feature separately is redundant, since it is highly likely that the value of a pixel\nwill be equal to a weighted average of its neighbours. Taking advantage of this type of structure\nmeans we do not need to store weights for every input in each feature. This intuition is illustrated in\nFigures 1 and 2.\nThe remainder of this paper is dedicated to elaborating on this observation. We describe a general\npurpose technique for reducing the number of free parameters in neural networks. The core of the\ntechnique is based on representing the weight matrix as a low rank product of two smaller matrices.\nBy factoring the weight matrix we are able to directly control the size of the parameterization by\ncontrolling the rank of the weight matrix.\nNa\u00a8\u0131ve application of this technique is straightforward but\ntends to reduce performance of the networks. We show\nthat by carefully constructing one of the factors, while\nlearning only the other factor, we can train networks with\nvastly fewer parameters which achieve the same perfor-\nmance as full networks with the same structure.\nThe key to constructing a good \ufb01rst factor is exploiting\nsmoothness in the structure of the inputs. When we have\nprior knowledge of the smoothness structure we expect to\nsee (e.g. in natural images), we can impose this structure\ndirectly through the choice of factor. When no such prior\nknowledge is available we show that it is still possible to\nmake a good data driven choice.\nWe demonstrate experimentally that our parameter pre-\ndiction technique is extremely effective. In the best cases\nwe are able to predict more than 95% of the parameters\nof a network without any drop in predictive accuracy.\nThroughout this paper we make a distinction between dy-\nnamic and static parameters. Dynamic parameters are updated frequently during learning, poten-\ntially after each observation or mini-batch. This is in contrast to static parameters, whose values are\ncomputed once and not altered. Although the values of these parameters may depend on the data and\nmay be expensive to compute, the computation need only be done once during the entire learning\nprocess.\nThe reason for this distinction is that static parameters are much easier to handle in a distributed\nsystem, even if their values must be shared between machines. Since the values of static parame-\nters do not change, access to them does not need to be synchronized. Copies of these parameters\ncan be safely distributed across machines without any of the synchronization overhead incurred by\ndistributing dynamic parameters.\n\nFigure 2: RICA with different amounts\nof parameter prediction.\nIn the left-\nmost column 100% of the parameters\nare learned with L-BFGS. In the right-\nmost column, only 10% of the parame-\nters learned, while the remaining values\nare predicted at each iteration. The in-\ntermediate columns interpolate between\nthese extremes in increments of 10%.\n\n2\n\n\f2 Low rank weight matrices\n\nDeep networks are composed of several layers of transformations of the form h = g(vW), where\nv is an nv-dimensional input, h is an nh-dimensional output, and W is an nv \u21e5 nh matrix of\nparameters. A column of W contains the weights connecting each unit in the visible layer to a\nsingle unit in the hidden layer. We can to reduce the number of free parameters by representing W\nas the product of two matrices W = UV, where U has size nv \u21e5 n\u21b5 and V has size n\u21b5 \u21e5 nh.\nBy making n\u21b5 much smaller than nv and nh we achieve a substantial reduction in the number of\nparameters.\nIn principle, learning the factored weight matrices is straightforward. We simply replace W with\nUV in the objective function and compute derivatives with respect to U and V instead of W. In\npractice this na\u00a8\u0131ve approach does not preform as well as learning a full rank weight matrix directly.\nMoreover, the factored representation has redundancy. If Q is any invertible matrix of size n\u21b5 \u21e5 n\u21b5\nwe have W = UV = (UQ)(Q1V) = \u02dcU \u02dcV. One way to remove this redundancy is to \ufb01x\nthe value of U and learn only V. The question remains what is a reasonable choice for U? The\nfollowing section provides an answer to this question.\n\n3 Feature prediction\n\nWe can exploit the structure in the features of a deep network to represent the features in a much\nlower dimensional space. To do this we consider the weights connected to a single hidden unit as\na function w : W! R mapping weight space to real numbers estimate values of this function\nusing regression. In the case of p \u21e5 p image patches, W is the coordinates of each pixel, but other\nstructures for W are possible.\nA simple regression model which is appropriate here is a linear combination of basis functions. In\nthis view the columns of U form a dictionary of basis functions, and the features of the network\nare linear combinations of these features parameterized by V. The problem thus becomes one of\nchoosing a good base dictionary for representing network features.\n\n3.1 Choice of dictionary\n\nThe base dictionary for feature prediction can be constructed in several ways. An obvious choice\nis to train a single layer unsupervised model and use the features from that model as a dictionary.\nThis approach has the advantage of being extremely \ufb02exible\u2014no assumptions about the structure of\nfeature space are required\u2014but has the drawback of requiring an additional training phase.\nWhen we have prior knowledge about the structure of feature space we can exploit it to construct an\nappropriate dictionary. For example when learning features for images we could choose U to be a\nselection of Fourier or wavelet bases to encode our expectation of smoothness.\nWe can also build U using kernels that encode prior knowledge. One way to achieve this is via kernel\nridge regression [25]. Let w\u21b5 denote the observed values of the weight vector w on a restricted\nsubset of its domain \u21b5 \u21e2W . We introduce a kernel matrix K\u21b5, with entries (K\u21b5)ij = k(i, j), to\nmodel the covariance between locations i, j 2 \u21b5. The parameters at these locations are (w\u21b5)i and\n(w\u21b5)j. The kernel enables us to make smooth predictions of the parameter vector over the entire\ndomain W using the standard kernel ridge predictor:\n\nw = kT\n\n\u21b5(K\u21b5 + I)1w\u21b5 ,\n\nwhere k\u21b5 is a matrix whose elements are given by (k\u21b5)ij = k(i, j) for i 2 \u21b5 and j 2W , and is a\nridge regularization coef\ufb01cient. In this case we have U = kT\n\n\u21b5(K\u21b5 + I)1 and V = w\u21b5.\n\n3.2 A concrete example\n\nIn this section we describe the feature prediction process as it applies to features derived from image\npatches using kernel ridge regression, since the intuition is strongest in this case. We defer a discus-\nsion of how to select a kernel for deep layers as well as for non-image data in the visible layer to a\nlater section. In those settings the prediction process is formally identical, but the intuition is less\nclear.\n\n3\n\n\fIf v is a vectorized image patch corresponding to the visible layer of a standard neural network\nthen the hidden activity induced by this patch is given by h = g(vW), where g is the network\nnonlinearity and W = [w1, . . . , wnh] is a weight matrix whose columns each correspond to features\nwhich are to be matched to the visible layer.\nWe consider a single column of the weight matrix, w, whose elements are indexed by i 2W . In\nthe case of an image patch these indices are multidimensional i = (ix, iy, ic), indicating the spatial\nlocation and colour channel of the index i. We select locations \u21b5 \u21e2W at which to represent the\n\ufb01lter explicitly and use w\u21b5 to denote the vector of weights at these locations.\nThere are a wide variety of options for how \u21b5 can be selected. We have found that choosing \u21b5\nuniformly at random from W (but tied across channels) works well; however, it is possible that\nperformance could be improved by carefully designing a process for selecting \u21b5.\n\u21b5(K\u21b5 + I)1w\u21b5. Notice that we\nWe can use values for w\u21b5 to predict the full feature as w = kT\ncan predict the entire feature matrix in parallel using W = kT\n\u21b5(K\u21b5 + I)1W\u21b5 where W\u21b5 =\n[(w1)\u21b5, . . . , (wnh)\u21b5].\nFor image patches, where we expect smoothness in pixel space, an appropriate kernel is the squared\nexponential kernel\n\nk(i, j) = exp\u2713\n\n(ix jx)2 + (iy jy)2\n\n22\n\n\u25c6\n\nwhere is a length scale parameter which controls the degree of smoothness.\nHere \u21b5 has a convenient interpretation as the set of pixel locations in the image, each corresponding\nto a basis function in the dictionary de\ufb01ned by the kernel. More generically we will use \u21b5 to index a\ncollection of dictionary elements in the remainder of the paper, even when a dictionary element may\nnot correspond directly to a pixel location as in this example.\n\n3.3\n\nInterpretation as pooling\n\nSo far we have motivated our technique as a method for predicting features in a neural network;\nhowever, the same approach can also be interpreted as a linear pooling process.\nRecall that the hidden activations in a standard neural network before applying the nonlinearity\nare given by g1(h) = vW. Our motivation has proceeded along the lines of replacing W with\nU\u21b5W\u21b5 and discussing the relationship between W and its predicted counterpart.\nAlternatively we can write g1(h) = v\u21b5W\u21b5 where v\u21b5 = vU\u21b5 is a linear transformation of the\ndata. Under this interpretation we can think of a predicted layer as being composed to two layers\ninternally. The \ufb01rst is a linear layer which applies a \ufb01xed pooling operator given by U\u21b5, and the\nsecond is an ordinary fully connected layer with |\u21b5| visible units.\n3.4 Columnar architecture\n\nThe prediction process we have described so far assumes that U\u21b5 is the same for all features; how-\never, this can be too restrictive. Continuing with the intuition that \ufb01lters should be smooth local\nedge detectors we might want to choose \u21b5 to give high resolution in a local area of pixel space while\nusing a sparser representation in the remainder of the space. Naturally, in this case we would want\nto choose several different \u21b5\u2019s, each of which concentrates high resolution information in different\nregions.\nIt is straightforward to extend feature prediction to this setting. Suppose we have several different\nindex sets \u21b51, . . . ,\u21b5 J corresponding to elements from a dictionary U. For each \u21b5j we can form the\nsub-dictionary U\u21b5j and predicted the feature matrix Wj = U\u21b5j W\u21b5j . The full predicted feature\nmatrix is formed by concatenating each of these matrices blockwise W = [W1, . . . , WJ ]. Each\nblock of the full predicted feature matrix can be treated completely independently. Blocks Wi and\nWj share no parameters\u2014even their corresponding dictionaries are different.\nEach \u21b5j can be thought of as de\ufb01ning a column of representation inside the layer. The input to each\ncolumn is shared, but the representations computed in each column are independent. The output of\nthe layer is obtained by concatenating the output of each column. This is represented graphically in\nFigure 3.\n\n4\n\n\fU\u21b52\n\ng(vU\u21b5iw\u21b5i)\n\nw\u21b51\n\nw\u21b52\n\nw\u21b53\n\ng(\u00b7)\n\ng(\u00b7)\n\ng(\u00b7)\n\nv\n\nvU\u21b5i\n\nvU\u21b5iw\u21b5i\n\ng(v \u21e4 U\u21b5iw\u21b5i)\n\nw\u21b51\n\nw\u21b52\n\nw\u21b53\n\nv\n\nv \u21e4 U\u21b5i\n\nv \u21e4 U\u21b5iw\u21b5i\n\nFigure 3: Left: Columnar architecture in a fully connected network, with the path through one\ncolumn highlighted. Each column corresponds to a different \u21b5j. Right: Columnar architecture\nin a convolutional network. In this setting the w\u21b5\u2019s take linear combinations of the feature maps\nobtained by convolving the input with the dictionary. We make the same abuse of notation here as\nin the main text\u2014the vectorized \ufb01lter banks must be reshaped before the convolution takes place.\n\nIntroducing additional columns into the network increases the number of static parameters but the\nnumber of dynamic parameters remains \ufb01xed. The increase in static parameters comes from the fact\nthat each column has its own dictionary. The reason that there is not a corresponding increase in\nthe number of dynamic parameters is that for a \ufb01xed size hidden layer the hidden units are divided\nbetween the columns. The number of dynamic parameters depends only on the number of hidden\nunits and the size of each dictionary.\nIn a convolutional network the interpretation is similar. In this setting we have g1(h) = v \u21e4 W\u21e4,\nwhere W\u21e4 is an appropriately sized \ufb01lter bank. Using W to denote the result of vectorizing the\n\ufb01lters of W\u21e4 (as is done in non-convolutional models) we can again write W = U\u21b5w\u21b5, and using\na slight abuse of notation1 we can write g1(h) = v\u21e4 U\u21b5w\u21b5. As above, we re-order the operations\nto obtain g1(h) = v\u21b5w\u21b5 resulting in a structure similar to a layer in an ordinary MLP. This\nstructure is illustrated in Figure 3.\nNote that v is \ufb01rst convolved with U\u21b5 to produce v\u21b5. That is, preprocessing in each column\ncomes from a convolution with a \ufb01xed set of \ufb01lters, de\ufb01ned by the dictionary. Next, we form linear\ncombinations of these \ufb01xed convolutions, with coef\ufb01cients given by w\u21b5. This particular order of\noperations may result in computational improvements if the number of hidden channels is larger\nthan n\u21b5, or if the elements of U\u21b5 are separable [22].\n\n3.5 Constructing dictionaries\n\nWe now turn our attention to selecting an appropriate dictionary for different layers of the network.\nThe appropriate choice of dictionary inevitably depends on the structure of the weight space.\nWhen the weight space has a topological structure where we expect smoothness, for example when\nthe weights correspond to pixels in an image patch, we can choose a kernel-based dictionary to\nenforce the type of smoothness we expect.\nWhen there is no topological structure to exploit, we propose to use data driven dictionaries. An\nobvious choice here is to use a shallow unsupervised feature learning, such as an autoencoder, to\nbuild a dictionary for the layer.\nAnother option is to construct data-driven kernels for ridge regression. Easy choices here are using\nthe empirical covariance or empirical squared covariance of the hidden units, averaged over the data.\nSince the correlations in hidden activities depend on the weights in lower layers we cannot initialize\nkernels in deep layers in this way without training the previous layers. We handle this by pre-training\neach layer as an autoencoder. We construct the kernel using the empirical covariance of the hidden\nunits over the data using the pre-trained weights. Once each layer has been pre-trained in this way\n\n1The vectorized \ufb01lter bank W = U\u21b5w\u21b5 must be reshaped before the convolution takes place.\n\n5\n\n\fFigure 4: Left: Comparing the performance of different dictionaries when predicting the weights in\nthe \ufb01rst two layers of an MLP network on MNIST. The legend shows the dictionary type in layer1\u2013\nlayer2 (see main text for details). Right: Performance on the TIMIT core test set using an MLP\nwith two hidden layers.\n\nwe \ufb01ne-tune the entire network with backpropagation, but in this phase the kernel parameters are\n\ufb01xed.\nWe also experiment with other choices for the dictionary, such as random projections (iid Gaussian\ndictionary) and random connections (dictionary composed of random columns of the identity).\n4 Experiments\n4.1 Multilayer perceptron\n\nWe perform some initial experiments using MLPs [24] in order to demonstrate the effectiveness of\nour technique. We train several MLP models on MNIST using different strategies for constructing\nthe dictionary, different numbers of columns and different degrees of reduction in the number of\ndynamic parameters used in each feature. We chose to explore these permutations on MNIST since\nit is small enough to allow us to have broad coverage.\nThe networks in this experiment all have two hidden layers with a 784\u2013500\u2013500\u201310 architecture\nand use a sigmoid activation function. The \ufb01nal layer is a softmax classi\ufb01er. In all cases we preform\nparameter prediction in the \ufb01rst and second layers only; the \ufb01nal softmax layer is never predicted.\nThis layer contains approximately 1% of the total network parameters, so a substantial savings is\npossible even if features in this layer are not predicted.\nFigure 4 (left) shows performance using several different strategies for constructing the dictionary,\neach using 10 columns in the \ufb01rst and second layers. We divide the hidden units in each layer equally\nbetween columns (so each column connects to 50 units in the layer above).\nThe different dictionaries are as follows: nokernel is an ordi-\nnary model with no feature prediction (shown as a horizontal\nline). LowRank is when both U and V are optimized. Rand-\nCon is random connections (the dictionary is random columns\nof the identity). RandFixU is random projections using a ma-\ntrix of iid Gaussian entries. SE is ridge regression with the\nsquared exponential kernel with length scale 1.0. Emp is ridge\nregression with the covariance kernel. Emp2 is ridge regres-\nsion with the squared covariance kernel. AE is a dictionary\npre-trained as an autoencoder. The SE\u2013Emp and SE-Emp2 ar-\nchitectures preform substantially better than the alternatives,\nespecially with few dynamic parameters.\nFor consistency we pre-trained all of the models, except for the\nLowRank, as autoencoders. We did not pretrain the LowRank\nmodel because we found the autoencoder pretraining to be ex-\ntremely unstable for this model.\nFigure 4 (right) shows the results of a similar experiment on TIMIT. The raw speech data was\nanalyzed using a 25-ms Hamming window with a 10-ms \ufb01xed frame rate. In all the experiments,\nwe represented the speech using 12th-order Mel frequency cepstral coefcients (MFCCs) and energy,\nalong with their \ufb01rst and second temporal derivatives. The networks used in this experiment have\ntwo hidden layers with 1024 units. Phone error rate was measured by performing Viterbi decoding\n\nFigure 5: Performance of a con-\nvnet on CIFAR-10. Learning only\n25% of the parameters has a negli-\ngible effect on predictive accuracy.\n\n6\n\n0.20.40.60.81.0Proportion of parameters learned0.0150.0200.0250.0300.0350.0400.0450.050ErrorCompare Completion MethodsnokernelLowRankRandCon-RandConRandFixU-RandFixUSE-EmpSE-Emp2SE-AE0.20.40.60.81.0Proportion of Parameters Learned0.00.10.20.30.40.5Phone Error RateTIMITEmp-Emp1.00.50.250.75Proportion of parameters learned0.080.160.240.320.4ErrorConvnet CIFAR-10convnet\fthe phones in each utterance using a bigram language model, and confusions between certain sets of\nphones were ignored as described in [19].\n\n4.2 Convolutional network\n\nFigure 5 shows the performance of a convnet [17] on CIFAR-10. The \ufb01rst convolutional layer \ufb01lters\nthe 32 \u21e5 32 \u21e5 3 input image using 48 \ufb01lters of size 8 \u21e5 8 \u21e5 3. The second convolutional layer\napplies 64 \ufb01lters of size 8 \u21e5 8 \u21e5 48 to the output of the \ufb01rst layer. The third convolutional layer\nfurther transforms the output of the second layer by applying 64 \ufb01lters of size 5 \u21e5 5 \u21e5 64. The\noutput of the third layer is input to a fully connected layer with 500 hidden units and \ufb01nally into a\nsoftmax layer with 10 outputs. Again we do not reduce the parameters in the \ufb01nal softmax layer.\nThe convolutional layers each have one column and the fully connected layer has \ufb01ve columns.\nConvolutional layers have a natural topological structure to exploit, so we use an dictionary con-\nstructed with the squared exponential kernel in each convolutional layer. The input to the fully\nconnected layer at the top of the network comes from a convolutional layer so we use ridge regres-\nsion with the squared exponential kernel to predict parameters in this layer as well.\n\n4.3 Reconstruction ICA\n\nReconstruction ICA [15] is a method for learning overcomplete ICA models which is similar to a\nlinear autoencoder network. We demonstrate that we can effectively predict parameters in RICA on\nboth CIFAR-10 and STL-10. In order to use RICA as a classi\ufb01er we follow the procedure of Coates\net al. [6].\nFigure 6 (left) shows the results of parameter prediction with RICA on CIFAR-10 and STL-10.\nRICA is a single layer architecture, and we predict parameters a squared exponential kernel dictio-\nnary with a length scale of 1.0. The nokernel line shows the performance of RICA with no feature\nprediction on the same task. In both cases we are able to predict more than half of the dynamic\nparameters without a substantial drop in accuracy.\nFigure 6 (right) compares the performance of two RICA models with the same number of dynamic\nparameters. One of the models is ordinary RICA with no parameter prediction and the other has\n50% of the parameters in each feature predicted using squared exponential kernel dictionary with a\nlength scale of 1.0; since 50% of the parameters in each feature are predicted, the second model has\ntwice as many features with the same number of dynamic parameters.\n\n5 Related work and future directions\n\nSeveral other methods for limiting the number of parameters in a neural network have been explored\nin the literature. An early approach is the technique of \u201cOptimal Brain Damage\u201d [18] which uses\napproximate second derivative information to remove parameters from an already trained network.\nThis technique does not apply in our setting, since we aim to limit the number of parameters before\ntraining, rather than after.\nThe most common approach to limiting the number of parameters is to use locally connected fea-\ntures [6]. The size of the parameterization of locally connected networks can be further reduced\nby using tiled convolutional networks [10] in which groups of feature weights which tile the input\n\nFigure 6: Left: Comparison of the performance of RICA with and without parameter prediction on\nCIFAR-10 and STL-10. Right: Comparison of RICA, and RICA with 50% parameter prediction\nusing the same number of dynamic parameters (i.e. RICA-50% has twice as many features). There\nis a substantial gain in accuracy with the same number of dynamic parameters using our technique.\nError bars for STL-10 show 90% con\ufb01dence intervals from the the recommended testing protocol.\n\n7\n\n0.080.280.490.690.9Proportion of parameters learned0.240.310.370.440.5ErrorCIFAR-10nokernelRICA0.080.280.490.690.9Proportion of parameters learned0.420.440.460.480.5ErrorSTL-10nokernelRICA180015750297004365057600Number of dynamic parameters0.240.310.370.440.5ErrorCIFAR-10RICARICA-50%500023750425006125080000Number of dynamic parameters0.420.440.460.480.5ErrorSTL-10RICARICA-50%\fspace are tied. Convolutional neural networks [13] are even more restrictive and force a feature to\nhave tied weights for all receptive \ufb01elds.\nTechniques similar to the one in this paper have appeared for shallow models in the computer vi-\nsion literature. The double sparsity method of Rubinstein et al. [23] involves approximating linear\ndictionaries with other dictionaries in a similar manner to how we approximate network features.\nRigamonti et al. [22] study approximating convolutional \ufb01lter banks with linear combinations of\nseparable \ufb01lters. Both of these works focus on shallow single layer models, in contrast to our focus\non deep networks.\nThe techniques described in this paper are orthogonal to the parameter reduction achieved by tying\nweights in a tiled or convolutional pattern. Tying weights effectively reduces the number of feature\nmaps by constraining features at different locations to share parameters. Our approach reduces the\nnumber of parameters required to represent each feature and it is straightforward to incorporate into\na tiled or convolutional network.\nCires\u00b8an et al. [3] control the number of parameters by removing connections between layers in a\nconvolutional network at random. They achieve state-of-the-art results using these randomly con-\nnected layers as part of their network. Our technique subsumes the idea of random connections, as\ndescribed in Section 3.5.\nThe idea of regularizing networks through prior knowledge of smoothness is not new, but it is a\ndelicate process. Lang and Hinton [14] tried imposing explicit smoothness constraints through reg-\nularization but found it to universally reduce performance. Na\u00a8\u0131vely factoring the weight matrix and\nlearning both factors tends to reduce performance as well. Although the idea is simple conceptu-\nally, execution is dif\ufb01cult. G\u00a8ulc\u00b8ehre et al. [11] have demonstrated that prior knowledge is extremely\nimportant during learning, which highlights the importance of introducing it effectively.\nRecent work has shown that state of the art results on several benchmark tasks in computer vision\ncan be achieved by training neural networks with several columns of representation [2, 13]. The use\nof different preprocessing for different columns of representation is of particular relevance [2]. Our\napproach has an interpretation similar to this as described in Section 3.4. Unlike the work of [2], we\ndo not consider deep columns in this paper; however, collimation is an attractive way for increasing\nparallelism within a network, as the columns operate completely independently. There is no reason\nwe could not incorporate deeper columns into our networks, and this would make for a potentially\ninteresting avenue of future work.\nOur approach is super\ufb01cially similar to the factored RBM [21, 26], whose parameters form a 3-\ntensor. Since the total number of parameters in this model is prohibitively large, the tensor is rep-\nresented as an outer product of three matrices. Major differences between our technique and the\nfactored RBM include the fact that the factored RBM is a speci\ufb01c model, whereas our technique can\nbe applied more broadly\u2014even to factored RBMs. In addition, in a factored RBM all factors are\nlearned, whereas in our approach the dictionary is \ufb01xed judiciously.\nIn this paper we always choose the set \u21b5 of indices uniformly at random. There are a wide variety\nof other options which could be considered here. Other works have focused on learning receptive\n\ufb01elds directly [5], and would be interesting to incorporate with our technique.\nIn a similar vein, more careful attention to the selection of kernel functions is appropriate. We\nhave considered some simple examples and shown that they preform well, but our study is hardly\nexhaustive. Using different types of kernels to encode different types of prior knowledge on the\nweight space, or even learning the kernel functions directly as part of the optimization procedure as\nin [27] are possibilities that deserve exploration.\nWhen no natural topology on the weight space is available we infer a topology for the dictionary\nfrom empirical statistics; however, it may be possible to instead construct the dictionary to induce a\ndesired topology on the weight space directly. This has parallels to other work on inducing topology\nin representations [10] as well as work on learning pooling structures in deep networks [4].\n6 Conclusion\nWe have shown how to achieve signi\ufb01cant reductions in the number of dynamic parameters in deep\nmodels. The idea is orthogonal but complementary to recent advances in deep learning, such as\ndropout, recti\ufb01ed units and maxout. It creates many avenues for future work, such as improving\nlarge scale industrial implementations of deep networks, but also brings into question whether we\nhave the right parameterizations in deep learning.\n\n8\n\n\fReferences\n[1] Y. Bengio. Deep learning of representations: Looking forward. Technical Report arXiv:1305.0445,\n\nUniversit\u00b4e de Montr\u00b4eal, 2013.\n\n[2] D. Cires\u00b8an, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classi\ufb01cation.\n\nIn IEEE Computer Vision and Pattern Recognition, pages 3642\u20133649, 2012.\n\n[3] D. Cires\u00b8an, U. Meier, and J. Masci. High-performance neural networks for visual object classi\ufb01cation.\n\narXiv:1102.0183, 2011.\n\n[4] A. Coates, A. Karpathy, and A. Ng. Emergence of object-selective features in unsupervised feature\n\nlearning. In Advances in Neural Information Processing Systems, pages 2690\u20132698, 2012.\n\n[5] A. Coates and A. Y. Ng. Selecting receptive \ufb01elds in deep networks. In Advances in Neural Information\n\nProcessing Systems, pages 2528\u20132536, 2011.\n\n[6] A. Coates, A. Y. Ng, and H. Lee. An analysis of single-layer networks in unsupervised feature learning.\n\nIn Arti\ufb01cial Intelligence and Statistics, 2011.\n\n[7] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao, M. Ranzato, A. Senior, P. Tucker,\nK. Yang, and A. Ng. Large scale distributed deep networks. In Advances in Neural Information Processing\nSystems, pages 1232\u20131240, 2012.\n\n[8] L. Deng, D. Yu, and J. Platt. Scalable stacking and learning for building deep architectures. In Interna-\n\ntional Conference on Acoustics, Speech, and Signal Processing, pages 2133\u20132136, 2012.\n\n[9] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. In Interna-\n\ntional Conference on Machine Learning, 2013.\n\n[10] K. Gregor and Y. LeCun. Emergence of complex-like cells in a temporal product network with local\n\nreceptive \ufb01elds. arXiv preprint arXiv:1006.0448, 2010.\n\n[11] C. G\u00a8ulc\u00b8ehre and Y. Bengio. Knowledge matters: Importance of prior information for optimization. In\n\nInternational Conference on Learning Representations, 2013.\n\n[12] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Improving neural net-\n\nworks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580, 2012.\n\n[13] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classi\ufb01cation with deep convolutional neural net-\n\nworks. In Advances in Neural Information Processing Systems, pages 1106\u20131114, 2012.\n\n[14] K. Lang and G. Hinton. Dimensionality reduction and prior knowledge in e-set recognition. In Advances\n\nin Neural Information Processing Systems, 1990.\n\n[15] Q. V. Le, A. Karpenko, J. Ngiam, and A. Y. Ng. ICA with reconstruction cost for ef\ufb01cient overcomplete\n\nfeature learning. Advances in Neural Information Processing Systems, 24:1017\u20131025, 2011.\n\n[16] Q. V. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, and A. Ng. Building high-\nlevel features using large scale unsupervised learning. In International Conference on Machine Learning,\n2012.\n\n[17] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[18] Y. LeCun, J. S. Denker, S. Solla, R. E. Howard, and L. D. Jackel. Optimal brain damage. In Advances in\n\nNeural Information Processing Systems, pages 598\u2013605, 1990.\n\n[19] K.-F. Lee and H.-W. Hon. Speaker-independent phone recognition using hidden markov models. Acous-\n\ntics, Speech and Signal Processing, IEEE Transactions on, 37(11):1641\u20131648, 1989.\n\n[20] V. Nair and G. E. Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines. In Proc. 27th\n\nInternational Conference on Machine Learning, pages 807\u2013814. Omnipress Madison, WI, 2010.\n\n[21] M. Ranzato, A. Krizhevsky, and G. E. Hinton. Factored 3-way restricted Boltzmann machines for mod-\n\neling natural images. In Arti\ufb01cial Intelligence and Statistics, 2010.\n\n[22] R. Rigamonti, A. Sironi, V. Lepetit, and P. Fua. Learning separable \ufb01lters. In IEEE Computer Vision and\n\nPattern Recognition, 2013.\n\n[23] R. Rubinstein, M. Zibulevsky, and M. Elad. Double sparsity: learning sparse dictionaries for sparse signal\n\napproximation. IEEE Transactions on Signal Processing, 58:1553\u20131564, 2010.\n\n[24] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors.\n\nNature, 323(6088):533\u2013536, 1986.\n\n[25] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press,\n\nNew York, NY, USA, 2004.\n\n[26] K. Swersky, M. Ranzato, D. Buchman, B. Marlin, and N. Freitas. On autoencoders and score matching\n\nfor energy based models. In International Conference on Machine Learning, pages 1201\u20131208, 2011.\n\n[27] P. Vincent and Y. Bengio. A neural support vector network architecture with adaptive kernels. In Inter-\n\nnational Joint Conference on Neural Networks, pages 187\u2013192, 2000.\n\n9\n\n\f", "award": [], "sourceid": 1053, "authors": [{"given_name": "Misha", "family_name": "Denil", "institution": "UBC"}, {"given_name": "Babak", "family_name": "Shakibi", "institution": "UBC"}, {"given_name": "Laurent", "family_name": "Dinh", "institution": "\u00c9cole Centrale Paris"}, {"given_name": "Marc'Aurelio", "family_name": "Ranzato", "institution": "Google Research"}, {"given_name": "Nando", "family_name": "de Freitas", "institution": "UBC"}]}