{"title": "Fast Pruning Using Principal Components", "book": "Advances in Neural Information Processing Systems", "page_first": 35, "page_last": 42, "abstract": null, "full_text": "Fast Pruning Using Principal \n\nComponents \n\nAsriel U. Levin, Todd K. Leen and John E. Moody \n\nDepartment of Computer Science and Engineering \n\nOregon Graduate Institute \n\nP.O. Box 91000 \n\nPortland, OR 97291-1000 \n\nAbstract \n\nWe present a new algorithm for eliminating excess parameters and \nimproving network generalization after supervised training. The \nmethod, \"Principal Components Pruning (PCP)\", is based on prin(cid:173)\ncipal component analysis of the node activations of successive layers \nof the network. It is simple, cheap to implement, and effective. It \nrequires no network retraining, and does not involve calculating \nthe full Hessian of the cost function. Only the weight and the node \nactivity correlation matrices for each layer of nodes are required. \nWe demonstrate the efficacy of the method on a regression problem \nusing polynomial basis functions, and on an economic time series \nprediction problem using a two-layer, feedforward network. \n\n1 \n\nIntroduction \n\nIn supervised learning, a network is presented with a set of training exemplars \n[u(k), y(k)), k = 1 ... N where u(k) is the kth input and y(k) is the correspond(cid:173)\ning output. The assumption is that there exists an underlying (possibly noisy) \nfunctional relationship relating the outputs to the inputs \n\nwhere e denotes the noise. The aim of the learning process is to approximate this \nrelationship based on the the training set. The success of the learned approximation \n\ny=/(u,e) \n\n35 \n\n\f36 \n\nLevin, Leen, and Moody \n\nis judged by the ability of the network to approximate the outputs corresponding \nto inputs it was not trained on. \n\nLarge networks have more functional flexibility than small networks, so are better \nable to fit the training data. However large networks can have higher parameter \nvariance than small networks, resulting in poor generalization. The number of \nparameters in a network is a crucial factor in it's ability to generalize. \n\nNo practical method exists for determining, a priori, the proper network size and \nconnectivity. A promising approach is to start with a large, fully-connected network \nand through pruning or regularization, increase model bias in order to reduce model \nvariance and improve generalization. \n\nReview of existing algorithms \n\nIn recent years, several methods have been proposed. Skeletonization (Mozer and \nSmolensky, 1989) removes the neurons that have the least effect on the output \nerror. This is costly and does not take into account correlations between the neuron \nactivities. Eliminating small weights does not properly account for a weight's effect \non the output error. Optimal Brain Damage (OBD) (Le Cun et al., 1990) removes \nthose weights that least affect the training error based on a diagonal approximation \nof the Hessian. The diagonal assumption is inaccurate and can lead to the removal \nof the wrong weights. The method also requires retraining the pruned network, \nwhich is computationally expensive. Optimal Brain Surgeon (OBS) (Hassibi et al., \n1992) removes the \"diagonal\" assumption but is impractical for large nets. Early \nstopping monitors the error on a validation set and halts learning when this error \nstarts to increase. There is no guarantee that the learning curve passes through the \noptimal point, and the final weight is sensitive to the learning dynamics. Weight \ndecay (ridge regression) adds a term to the objective function that penalizes large \nweights. The proper coefficient for this term is not known a priori, so one must \nperform several optimizations with different values, a cumbersome process. \n\nWe propose a new method for eliminating excess parameters and improving network \ngeneralization. The method, \"Principal Components Pruning (PCP)\", is based on \nprincipal component analysis (PCA) and is simple, cheap and effective. \n\n2 Background and Motivation \n\nPCA (Jolliffe, 1986) is a basic tool to reduce dimension by eliminating redundant \nvariables. In this procedure one transforms variables to a basis in which the covari(cid:173)\nance is diagonal and then projects out the low variance directions. \n\nWhile application of PCA to remove input variables is useful in some cases (Leen \net al., 1990), there is no guarantee that low variance variables have little effect on \nerror. We propose a saliency measure, based on PCA, that identifies those variables \nthat have the least effect on error. Our proposed Principal Components Pruning \nalgorithm applies this measure to obtain a simple and cheap pruning technique in \nthe context of supervised learning. \n\n\fFast Pruning Using Principal Components \n\n37 \n\nSpecial Case: PCP in Linear Regression \n\nIn unbiased linear models, one can bound the bias introduced from pruning the \nprincipal degrees of freedom in the model. We assume that the observed system \nis described by a signal-plus-noise model with the signal generated by a function \nlinear in the weights: \n\ny = Wou + e \n\nwhere u E ~P, Y E ~m, W E ~mxp, and e is a zero mean additive noise. The \nregression model is \n\nY=Wu. \n\nThe input correlation matrix is ~ = ~ L:k u(k)uT(k). \nIt is convenient to define coordinates in which ~ is diagonal A = C T ~ C where C is \nthe matrix whose columns are the orthonormal eigenvectors of~. The transformed \ninput variables and weights are u = CT u and W = W C respectively, and the \nmodel output can be rewritten as Y = W u . \nIt is straightforward to bound the increase in training set error resulting from re(cid:173)\nmoving subsets of the transformed input variable. The sum squared error is \n\nI = ~ L[y(k) - y(k)f[y(k) - y(k)] \n\nk \n\nLet Yl(k) denote the model's output when the last p -l components of u(k) are set \nto zero. By the triangle inequality \n\nh \n\n~ L[y(k) - Yl(k)f[y(k) - Yl(k)] \n\nk \n\n< 1+ ~ L[Y(k) - Yl(k)f[Y(k) - Yl(k)] \n\n(1) \n\nk \n\nThe second term in (1) bounds the increase in the training set errorl. This term \ncan be rewritten as \n\n~ L[y(k) - Yl(k)f[Y(k) - lh(k)] \n\nk \n\np \n\nL w; WiAi \n\ni=l+l \n\nwhere Wi denotes the ith column of Wand Ai is the ith eigenvalue. The quantity \nw; Wi Ai measures the effect of the ith eigen-coordinate on the output error; it serves \nas our saliency measure for the weight Wi. \n\nRelying on Akaike's Final Prediction error (FPE) (Akaike, 1970), the average test \nset error for the original model is given by \n\nJ[W] = ~ + pm I(W) \n\n-pm \n\nwhere pm is the number of parameters in the model. If p -l principal components \nare removed, then the expected test set is given by \n\nJl[W] = N + lm Il(W) . \n\nN-lm \n\n1 For y E Rl, the inequality is replaced by an equality. \n\n\f38 \n\nLevin, Leen, and Moody \n\nIf we assume that N\u00bb l * m, the last equation implies that the optimal generaliza(cid:173)\ntion will be achieved if all principal components for which \n\n-T _ \nWi WiAi < N \n\n2m! \n\nare removed. For these eigen-coordinates the reduction in model variance will more \nthen compensate for the increase in training error, leaving a lower expected test set \nerror. \n\n3 Proposed algorithm \n\nThe pruning algorithm for linear regression described in the previous section can be \nextended to multilayer neural networks. A complete analysis of the effects on gen(cid:173)\neralization performance of removing eigen-nodes in a nonlinear network is beyond \nthe scope of this short paper. However, it can be shown that removing eigen-nodes \nwith low saliency reduces the effective number of parameters (Moody, 1992) and \nshould usually improve generalization. Also, as will be discussed in the next section, \nour PCP algorithm is related to the OBD and OBS pruning methods. As with all \npruning techniques and analyses of generalization, one must assume that the data \nare drawn from a stationary distribution, so that the training set fairly represents \nthe distribution of data one can expect in the future. \n\nConsider now a feedforward neural network, where each layer is of the form \n\nyi = r[WiUi ] = r[Xi] . \n\nHere, u i is the input, Xi is the weighted sum of the input, r is a diagonal operator \nconsisting of the activation function of the neurons at the layer, and yi is the output \nof the layer. \n\n1. A network is trained using a supervised (e.g. backpropagation) training \n\nprocedure. \n\n2. Starting at the first layer, the correlation matrix :E for the input vector to \n\nthe layer is calculated. \n\n3. Principal components are ranked by their effect on the linear output of the \n\nlayer. 2 \n\n4. The effect of removing an eigennode is evaluated using a validation set. \n\nThose that do not increase the validation error are deleted. \n\n5. The weights of the layer are projected onto the l dimensional subspace \n\nspanned by the significant eigenvectors \n\nW -+ WClCr \n\nwhere the columns of C are the eigenvectors of the correlation matrix. \n\n6. The procedure continues until all layers are pruned. \n\n2If we assume that -r is the sigmoidal operator, relying on its contraction property, \nwe have that the resulting output error is bounded by Ilell <= IIWlllle\",lll where e\",l IS \nerror observed at Xi and IIWII is the norm of the matrices connecting it to the output. \n\n\fFast Pruning Using Principal Components \n\n39 \n\nAs seen, the algorithm proposed is easy and fast to implement. The matrix dimen(cid:173)\nsions are determined by the number of neurons in a layer and hence are manageable \neven for very large networks. No retraining is required after pruning and the speed \nof running the network after pruning is not affected. \n\nNote: A finer scale approach to pruning should be used ifthere is a large variation \nbetween Wij for different j. In this case, rather than examine w[ WiAi in one piece, \nthe contribution of each wtj Ai could be examined individually and those weights \nfor which the contribution is small can be deleted. \n\n4 Relation to Hessian-Based Methods \n\nThe effect of our PCP method is to reduce the rank of each layer of weights in a \nnetwork by the removal of the least salient eigen-nodes, which reduces the effective \nnumber of parameters (Moody, 1992). This is in contrast to the OBD and OBS \nmethods which reduce the rank by eliminating actual weights. PCP differs further \nfrom OBD and OBS in that it does not require that the network be trained to a \nlocal minimum of the error. \n\nIn spite of these basic differences, the PCP method can be viewed as intermedi(cid:173)\nate between OBD and OBS in terms of how it approximates the Hessian of the \nerror function. OBD uses a diagonal approximation, while OBS uses a linearized \napproximation of the full Hessian. In contrast, PCP effectively prunes based upon \na block-diagonal approximation of the Hessian. A brief discussion follows. \n\nIn the special case of linear regression, the correlation matrix ~ is the full Hessian \nof the squared error. 3 For a multilayer network with Q layers, let us denote the \nnumbers of units per layer as {Pq : q = 0 . . . Q}.4 The number of weights (including \nbiases) in each layer is bq = Pq(Pq-l + 1), and the total number of weights in the \nnetwork is B = L:~=l bq . The Hessian of the error function is a B x B matrix, \nwhile the input correlation matrix for each of the units in layer q is a much simpler \n(Pq-l + 1) X (Pq-l + 1) matrix. Each layer has associated with it Pq identical \ncorrelation matrices. \nThe combined set of these correlation matrices for all units in layers q = 1 .. . Q of \nthe network serves as a linear, block-diagonal approximation to the full Hessian of \nthe nonlinear network. 5 This block-diagonal approximation has E~=l Pq(Pq-l + 1)2 \nnon-zero elements, compared to the [E~=l Pq(Pq-l + 1)]2 elements of the full Hessian \n(used by OBS) and the L:~=l Pq(Pq-l + 1) diagonal elements (used by OBD). Due \nto its greater richness in approximating the Hessian, we expect that PCP is likely \nto yield better generalization performance than OBD. \n\n3The correlation matrix and Hessian may differ by a numerical factor depending on the \nnormalization of the squared error. If the error function is defined as one half the average \nsquared error (ASE), then the equality holds. \n\n4The inputs to the network constitute layer O. \n5The derivation of this approximation will be presented elsewhere. However, the cor(cid:173)\n\nrespondence can be understood in analogy with the special case of linear regression. \n\n\f40 \n\nLevin, Leen, and Moody \n\n0.75 \n\n0 . 5 \n\n0.25 \n\n-0.25 \n\n0.25 o.~ 0 \u2022. 75 \n\n-1 \n\na) \n\n-. \n\n-1 \n\n0.75 \n\n0.5 \n\n0.25 \n\nb) \n\n.' \n\n~# .. \n\n.. -\n...... \n\n-1 \n\nFigure 1: a) Underlying function (solid), training data (points), and 10th order polynomial \nfit (dashed). b) Underlying function, training data, and pruned regression fit (dotted). \n\nThe computational complexities of the OBS, OBD, and PCP methods are \n\nrespectively, where we assume that N 2: B. The computational cost of PCP is \ntherefore significantly less than that of OBS and is similar to that of OBD. \n\n5 Simulation Results \n\nRegression With Polynomial Basis Functions \n\nThe analysis in section 2 is directly applicable to regression using a linear combina(cid:173)\ntion of basis functions y = W f (11,) \u2022 One simply replaces 11, with the vector of basis \nfunctions f(11,). \n\nWe exercised our pruning technique on a univariate regression problem using mono(cid:173)\nmial basis functions f(11,) = (1,u,u 2 , ... ,un f with n = 10. The underlying func(cid:173)\ntion was a sum of four sigmoids. Training and test data were generated by evaluating \nthe underlying function at 20 uniformly spaced points in the range -1 ~ u ~ + 1 and \nadding gaussian noise. The underlying function, training data and the polynomial \nfit are shown in figure 1a. \n\nThe mean squared error on the training set was 0.00648. The test set mean squared \nerror, averaged over 9 test sets, was 0.0183 for the unpruned model. We removed \nthe eigenfunctions with the smallest saliencies w2 >.. The lowest average test set \nerror of 0.0126 was reached when the trailing four eigenfunctions were removed. 6 . \nFigure 1 b shows the pruned regression fit. \n\n6The FPE criterion suggested pruning the trailing three eigenfunctions. We note that \nour example does not satisfy the assumption of an unbiased model, nor are the sample \nsizes large enough for the FPE to be completely reliable. \n\n\fFast Pruning Using Principal Components \n\n41 \n\n'\" 0 '\" '\" \n\nr.l \n\nal \nN ..... \n..... \n\u00abJ \n~ \n0 z \n\n0.9 \n\n0.85 \n\n0 . 8 \n\n0 . 75 \n\n0 . 7 \n\n0 . 65 \n\n0.6 \n\n0 . 55 \n\n0 . 5 0 \n\n\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7t \n\n.................... ...... \n\n1 \n\n.......................... \n\n2 \n\n4 \n\n6 \n\n8 \n\n10 \n\n12 \n\nPrediction Horizon (month) \n\nFigure 2: Prediction of the \nIP index 1980 - 1990. The \nsolid line shows the perfor(cid:173)\nmance before pruning and \nthe dotted line the perfor(cid:173)\nmance after the application \nof the PCP algorithm. The \nresults shown represent av(cid:173)\nerages over 11 runs with \nthe error bars representing \nthe standard deviation of \nthe spread. \n\nTime Series Prediction with a Sigmoidal Network \n\nWe have applied the proposed algorithm to the task of predicting the Index of In(cid:173)\ndustrial Production (IP), which is one of the main gauges of U.S. economic activity. \nWe predict the rate of change in IP over a set of future horizons based on lagged \nmonthly observations of various macroeconomic and financial indicators (altogether \n45 inputs). 7 \n\nOur standard benchmark is the rate of change in IP for January 1980 to January \n1990 for models trained on January 1960 to December 1979. In all runs, we used two \nlayer networks with 10 tanh hidden nodes and 6 linear output nodes corresponding \nto the various prediction horizons (1, 2, 3, 6, 9, and 12 months). The networks were \ntrained using stochastic backprop (which with this very noisy data set outperformed \nmore sophisticated gradient descent techniques). The test set results with and \nwithout the PCP algorithm are shown in Figure 2. \n\nDue to the significant noise and nonstationarity in the data, we found it beneficial \nto employ both weight decay and early stopping during training. In the above runs, \nthe PCP algorithm was applied on top of these other regularization methods. \n\n6 Conclusions and Extensions \n\nOur \"Principal Components Pruning (PCP)\" algorithm is an efficient tool for re(cid:173)\nducing the effective number of parameters of a network. It is likely to be useful when \nthere are correlations of signal activities. The method is substantially cheaper to \nimplement than OBS and is likely to yield better network performance than OBD.8 \n\n7Preliminary results on this problem have been described briefly in (Moody et al., \n\n1993), and a detailed account of this work will be presented elsewhere. \n\n8See section 4 for a discussion of the block-diagonal Hessian interpretation of our \n\nmethod. A systematic empirical comparison of computational cost and resulting net(cid:173)\nwork performance of PCP to other methods like OBD and OBS would be a worthwhile \nundertaking. \n\n\f42 \n\nLevin, Leen, and Moody \n\nFurthermore, PCP can be used on top of any other regularization method, including \nearly stopping or weight decay.9 Unlike OBD and OBS, PCP does not require that \nthe network be trained to a local minimum. \n\nWe are currently exploring nonlinear extensions of our linearized approach. These \ninvolve computing a block-diagonal Hessian in which the block corresponding to \neach unit differs from the correlation matrix for that layer by a nonlinear factor. The \nanalysis makes use of GPE (Moody, 1992) rather than FPE. \n\nAcknowledgements \n\nOne of us (TKL) thanks Andreas Weigend for stimulating discussions that provided some \nof the motivation for this work. AUL and JEM gratefully acknowledge the support of \nthe Advanced Research Projects Agency and the Office of Naval Research under grant \nONR NOOOI4-92-J-4062. TKL acknowledges the support of the Electric Power Research \nInstitute under grant RP8015-2 and the Air Force Office of Scientific Research under grant \nF49620-93-1-0253. \n\nReferences \n\nAkaike, H. (1970). Statistical predictor identification. Ann. Inst. Stat. Math., 22:203. \nHassibi, B., Stork, D., and Wolff, G. (1992). Optimal brain surgeon and general network \npruning. Technical Report 9235, RICOH California Research Center, Menlo Park, \nCA. \n\nJolliffe, I. T. (1986). Principal Component Analysis. Springer-Verlag. \nLe Cun, Y., Denker, J., and Solla, S. (1990). Optimal brain damage. In Touretzky, D., \neditor, Advances in Neural Information Processing Systems, volume 2, pages 598-605, \nDenver 1989. Morgan Kaufmann, San Mateo. \n\nLeen, T. K., Rudnick, M., and Hammerstrom, D. (1990). Hebbian feature discovery \nimproves classifier efficiency. In Proceedings of the IEEE/INNS International Joint \nConference on Neural Networks, pages I-51 to I-56. \n\nMoody, J. (1992). The effective number of parameters: An analysis of generalization and \nregularization in nonlinear learning systems. In Moody, J., Hanson, S., and Lippman, \nR., editors, Advances in Neural Information Processing Systems, volume 4, pages \n847-854. Morgan Kaufmann. \n\nMoody, J., Levin, A., and Rehfuss, S. (1993). Predicting the u.s. index of industrial \nproduction . Neural Network World, 3:791-794. in Proceedings of Parallel Applications \nin Statistics and Economics '93. \n\nMozer, M. and Smolensky, P. (1989). Skeletonization: A technique for trimming the fat \nfrom a network via relevance assesment. In Touretzky, D., editor, Advances in Neural \nInformation Processing Systems, volume 1, pages 107-115. Morgan Kaufmann. \n\nWeigend, A. S. and Rumelhart, D. E. (1991). Generalization through minimal networks \nwith application to forecasting. In Keramidas, E. M., editor, INTERFACE'91 - 23rd \nSymposium on the Interface: Computing Science and Statistics, pages 362-370. \n\n9(Weigend and Rumelhart, 1991) called the rank of the covariance matrix of the node \nactivities the \"effective dimension of hidden units\" and discussed it in the context of early \nstopping. \n\n\f", "award": [], "sourceid": 754, "authors": [{"given_name": "Asriel", "family_name": "Levin", "institution": null}, {"given_name": "Todd", "family_name": "Leen", "institution": null}, {"given_name": "John", "family_name": "Moody", "institution": null}]}