{"title": "Fast Pruning Using Principal Components", "book": "Advances in Neural Information Processing Systems", "page_first": 35, "page_last": 42, "abstract": null, "full_text": "Fast  Pruning Using Principal \n\nComponents \n\nAsriel U.  Levin,  Todd K.  Leen and John E.  Moody \n\nDepartment of Computer Science and Engineering \n\nOregon Graduate Institute \n\nP.O.  Box 91000 \n\nPortland, OR 97291-1000 \n\nAbstract \n\nWe  present a new  algorithm for  eliminating excess parameters and \nimproving  network  generalization  after  supervised  training.  The \nmethod,  \"Principal Components Pruning (PCP)\", is based on prin(cid:173)\ncipal component analysis of the node activations of successive layers \nof the network.  It is  simple,  cheap to implement,  and effective.  It \nrequires  no  network  retraining,  and  does  not  involve  calculating \nthe full  Hessian of the cost function.  Only the weight and the node \nactivity  correlation  matrices for  each  layer of nodes  are  required. \nWe demonstrate the efficacy of the method on a regression problem \nusing  polynomial  basis functions,  and on an  economic time  series \nprediction problem using a  two-layer, feedforward  network. \n\n1 \n\nIntroduction \n\nIn  supervised  learning,  a  network  is  presented  with  a  set  of training  exemplars \n[u(k), y(k)),  k  = 1 ... N  where  u(k)  is  the  kth  input  and  y(k)  is  the  correspond(cid:173)\ning  output.  The  assumption  is  that  there  exists  an  underlying  (possibly  noisy) \nfunctional  relationship relating the outputs to the inputs \n\nwhere  e denotes the  noise.  The aim of the learning process is  to approximate this \nrelationship based on the the training set.  The success of the learned approximation \n\ny=/(u,e) \n\n35 \n\n\f36 \n\nLevin, Leen, and Moody \n\nis  judged  by  the ability of the network to approximate the  outputs  corresponding \nto inputs  it  was not trained on. \n\nLarge networks  have more functional  flexibility  than small  networks,  so  are better \nable  to  fit  the  training  data.  However  large  networks  can  have  higher  parameter \nvariance  than  small  networks,  resulting  in  poor  generalization.  The  number  of \nparameters in  a  network is  a  crucial factor in it's ability to generalize. \n\nNo  practical  method  exists  for  determining,  a  priori,  the proper network size  and \nconnectivity.  A promising approach is to start with a large, fully-connected network \nand through pruning or regularization, increase model bias in order to reduce model \nvariance and improve generalization. \n\nReview of existing algorithms \n\nIn recent  years,  several methods  have  been  proposed.  Skeletonization  (Mozer  and \nSmolensky,  1989)  removes  the  neurons  that  have  the  least  effect  on  the  output \nerror.  This is costly and does not take into account correlations between the neuron \nactivities.  Eliminating small weights does not properly account for  a  weight's effect \non the output error.  Optimal  Brain  Damage  (OBD)  (Le  Cun et al.,  1990)  removes \nthose weights that least affect  the training error based on a diagonal approximation \nof the Hessian.  The diagonal assumption is inaccurate and can lead to the removal \nof the  wrong  weights.  The  method  also  requires  retraining  the  pruned  network, \nwhich is  computationally expensive.  Optimal  Brain  Surgeon  (OBS)  (Hassibi et  al., \n1992)  removes  the  \"diagonal\"  assumption  but  is  impractical  for  large  nets.  Early \nstopping monitors the error on a  validation set  and  halts learning when  this error \nstarts to increase.  There is no guarantee that the learning curve passes through the \noptimal  point,  and  the final  weight  is  sensitive  to  the  learning  dynamics.  Weight \ndecay  (ridge regression)  adds a  term to the objective function  that penalizes large \nweights.  The  proper  coefficient  for  this  term  is  not  known  a  priori,  so  one  must \nperform several optimizations with different  values,  a  cumbersome process. \n\nWe propose a new method for eliminating excess parameters and improving network \ngeneralization.  The method,  \"Principal Components Pruning (PCP)\", is  based on \nprincipal component analysis  (PCA)  and is  simple,  cheap and effective. \n\n2  Background and  Motivation \n\nPCA  (Jolliffe,  1986)  is  a  basic tool  to reduce  dimension  by eliminating redundant \nvariables.  In this procedure one transforms variables to a  basis in which the covari(cid:173)\nance is  diagonal and then projects out the low  variance directions. \n\nWhile  application of PCA to remove  input  variables is  useful  in  some  cases  (Leen \net  al.,  1990),  there is  no  guarantee that low  variance variables  have little effect  on \nerror.  We propose a saliency measure, based on PCA, that identifies those variables \nthat  have  the  least  effect  on error.  Our  proposed  Principal  Components  Pruning \nalgorithm  applies  this  measure to obtain a  simple  and  cheap  pruning technique  in \nthe context of supervised learning. \n\n\fFast Pruning Using Principal Components \n\n37 \n\nSpecial  Case:  PCP in Linear Regression \n\nIn  unbiased  linear  models,  one  can  bound  the  bias  introduced  from  pruning  the \nprincipal  degrees  of freedom  in  the  model.  We  assume  that  the  observed system \nis  described  by  a  signal-plus-noise  model  with  the signal  generated  by  a  function \nlinear in the weights: \n\ny  =  Wou + e \n\nwhere  u  E  ~P, Y  E  ~m, W  E  ~mxp, and  e  is  a  zero  mean  additive  noise.  The \nregression model is \n\nY=Wu. \n\nThe input correlation matrix is  ~ = ~ L:k u(k)uT(k). \nIt is convenient to define coordinates in which ~ is diagonal A  = C T  ~ C where C is \nthe matrix whose columns are the orthonormal eigenvectors of~. The transformed \ninput  variables  and  weights  are  u = CT u  and  W =  W C  respectively,  and  the \nmodel output can be rewritten as Y =  W u  . \nIt is  straightforward to bound the  increase  in  training set  error resulting  from  re(cid:173)\nmoving subsets of the transformed input variable.  The sum squared error is \n\nI = ~ L[y(k) - y(k)f[y(k) - y(k)] \n\nk \n\nLet Yl(k)  denote the model's output when the last p -l components of u(k) are set \nto zero.  By the triangle inequality \n\nh \n\n~ L[y(k) - Yl(k)f[y(k) - Yl(k)] \n\nk \n\n<  1+ ~ L[Y(k) - Yl(k)f[Y(k) - Yl(k)] \n\n(1) \n\nk \n\nThe second  term  in  (1)  bounds  the  increase in  the  training set  errorl.  This  term \ncan be rewritten as \n\n~ L[y(k) - Yl(k)f[Y(k) - lh(k)] \n\nk \n\np \n\nL  w; WiAi \n\ni=l+l \n\nwhere  Wi  denotes  the  ith  column of Wand  Ai  is  the  ith  eigenvalue.  The quantity \nw; Wi  Ai  measures the effect of the ith eigen-coordinate on the output error; it serves \nas our saliency measure for  the weight Wi. \n\nRelying on  Akaike's  Final  Prediction error  (FPE)  (Akaike,  1970),  the average  test \nset error for  the original model is  given by \n\nJ[W] =  ~ + pm I(W) \n\n-pm \n\nwhere pm is  the number of parameters in the model.  If p -l principal components \nare removed,  then the expected test set is  given by \n\nJl[W] = N  + lm Il(W)  . \n\nN-lm \n\n1 For  y  E Rl, the inequality is  replaced by  an equality. \n\n\f38 \n\nLevin, Leen, and Moody \n\nIf we  assume that N\u00bb l * m, the last equation implies that the optimal generaliza(cid:173)\ntion will  be  achieved  if all  principal components for  which \n\n-T _ \nWi  WiAi  < N \n\n2m! \n\nare removed.  For these eigen-coordinates the reduction in model variance will more \nthen compensate for  the increase in training error, leaving a lower expected test set \nerror. \n\n3  Proposed algorithm \n\nThe pruning algorithm for linear regression described in the previous section can be \nextended to multilayer neural  networks.  A complete analysis of the effects on gen(cid:173)\neralization performance of removing eigen-nodes in  a  nonlinear network  is  beyond \nthe scope of this short paper.  However, it can be shown that removing eigen-nodes \nwith  low  saliency  reduces  the  effective  number  of parameters  (Moody,  1992)  and \nshould usually improve generalization.  Also, as will be discussed in the next section, \nour PCP algorithm is  related to the  OBD  and  OBS  pruning methods.  As  with all \npruning techniques and analyses of generalization, one  must  assume that the  data \nare drawn from  a  stationary distribution,  so  that the training set  fairly  represents \nthe distribution of data one can expect in the future. \n\nConsider now  a  feedforward neural network,  where each layer is of the form \n\nyi  =  r[WiUi ] = r[Xi]  . \n\nHere,  u i  is  the input,  Xi  is  the weighted sum of the input, r is  a  diagonal operator \nconsisting of the activation function of the neurons at the layer, and yi is the output \nof the layer. \n\n1.  A  network  is  trained  using  a  supervised  (e.g.  backpropagation)  training \n\nprocedure. \n\n2.  Starting at the first  layer,  the correlation matrix :E  for  the input vector to \n\nthe layer is calculated. \n\n3.  Principal components are ranked by their effect on the linear output of the \n\nlayer.  2 \n\n4.  The  effect  of removing  an  eigennode  is  evaluated  using  a  validation  set. \n\nThose that do not increase the validation error are deleted. \n\n5.  The  weights  of  the  layer  are  projected  onto  the  l  dimensional  subspace \n\nspanned by the significant eigenvectors \n\nW  -+ WClCr \n\nwhere the columns of C  are the eigenvectors of the correlation matrix. \n\n6.  The procedure continues until all  layers are pruned. \n\n2If we  assume  that -r is  the  sigmoidal  operator,  relying  on  its  contraction  property, \nwe  have  that  the  resulting  output  error  is  bounded  by  Ilell  <=  IIWlllle\",lll  where  e\",l  IS \nerror observed at  Xi  and IIWII  is  the norm of the matrices connecting it to the output. \n\n\fFast Pruning Using Principal Components \n\n39 \n\nAs seen, the algorithm proposed is easy and fast to implement.  The matrix dimen(cid:173)\nsions are determined by the number of neurons in a layer and hence are manageable \neven for very large networks.  No  retraining is  required after pruning and the speed \nof running the network after pruning is  not affected. \n\nNote:  A finer scale approach to pruning should be used ifthere is a large variation \nbetween Wij  for  different j.  In this case, rather than examine w[ WiAi  in one piece, \nthe  contribution  of each  wtj Ai  could  be  examined  individually  and  those  weights \nfor  which the contribution is  small can be deleted. \n\n4  Relation to Hessian-Based Methods \n\nThe effect  of our  PCP method  is  to  reduce  the rank of each  layer of weights  in  a \nnetwork by the removal of the least salient eigen-nodes, which reduces the  effective \nnumber  of parameters  (Moody,  1992).  This  is  in  contrast  to  the  OBD  and  OBS \nmethods which  reduce the rank by eliminating actual weights.  PCP differs further \nfrom  OBD  and  OBS  in  that it  does  not  require  that  the  network  be trained to a \nlocal  minimum of the error. \n\nIn  spite  of these  basic  differences,  the  PCP  method  can  be  viewed  as  intermedi(cid:173)\nate  between  OBD  and  OBS  in  terms  of how  it  approximates  the  Hessian  of  the \nerror function.  OBD  uses  a  diagonal  approximation,  while  OBS  uses  a  linearized \napproximation of the full  Hessian.  In  contrast, PCP effectively prunes based upon \na  block-diagonal approximation of the Hessian.  A brief discussion follows. \n\nIn the special  case of linear regression, the correlation matrix  ~ is the full  Hessian \nof the  squared  error. 3  For  a  multilayer  network  with  Q  layers,  let  us  denote  the \nnumbers of units per layer as {Pq  :  q = 0 . . . Q}.4  The number of weights  (including \nbiases)  in  each  layer is  bq = Pq(Pq-l  + 1),  and the  total  number of weights  in  the \nnetwork  is  B  = L:~=l bq .  The  Hessian  of the  error  function  is  a  B  x  B  matrix, \nwhile the input correlation matrix for  each of the units in layer q is  a  much simpler \n(Pq-l  + 1)  X  (Pq-l  + 1)  matrix.  Each  layer  has  associated  with  it  Pq  identical \ncorrelation matrices. \nThe combined set of these correlation matrices for  all  units in layers q  =  1 .. . Q of \nthe network serves  as a  linear,  block-diagonal approximation to the full  Hessian of \nthe nonlinear network. 5  This block-diagonal approximation has E~=l Pq(Pq-l + 1)2 \nnon-zero elements, compared to the [E~=l Pq(Pq-l + 1)]2 elements of the full Hessian \n(used by OBS)  and the L:~=l Pq(Pq-l  + 1)  diagonal elements  (used  by  OBD).  Due \nto its greater richness  in  approximating the Hessian,  we  expect that PCP is  likely \nto yield better generalization performance than OBD. \n\n3The correlation  matrix and Hessian may differ  by a  numerical factor  depending on the \nnormalization of the squared error.  If the error function  is  defined as one half the average \nsquared error  (ASE), then the equality holds. \n\n4The  inputs to the network constitute layer  O. \n5The  derivation of this  approximation  will  be presented  elsewhere.  However,  the cor(cid:173)\n\nrespondence can be understood in  analogy  with the special  case of linear regression. \n\n\f40 \n\nLevin, Leen, and Moody \n\n0.75 \n\n0 . 5 \n\n0.25 \n\n-0.25 \n\n0.25  o.~ 0 \u2022. 75 \n\n-1 \n\na) \n\n-. \n\n-1 \n\n0.75 \n\n0.5 \n\n0.25 \n\nb) \n\n.' \n\n~# .. \n\n.. -\n...... \n\n-1 \n\nFigure 1:  a) Underlying function (solid), training data (points), and 10th order polynomial \nfit  (dashed).  b)  Underlying function,  training data, and pruned regression  fit  (dotted). \n\nThe computational complexities of the OBS,  OBD,  and PCP methods are \n\nrespectively,  where  we  assume  that  N  2:  B.  The  computational  cost  of PCP  is \ntherefore significantly less than that of OBS  and is  similar to that of OBD. \n\n5  Simulation Results \n\nRegression With Polynomial Basis Functions \n\nThe analysis in section 2 is directly applicable to regression using a linear combina(cid:173)\ntion of basis functions  y = W f (11,)  \u2022 One simply replaces 11,  with  the vector of basis \nfunctions  f(11,). \n\nWe exercised our pruning technique on a univariate regression problem using mono(cid:173)\nmial basis functions  f(11,)  =  (1,u,u 2 ,  ... ,un f  with n  =  10.  The underlying func(cid:173)\ntion was a sum of four sigmoids.  Training and test data were generated by evaluating \nthe underlying function at 20 uniformly spaced points in the range -1 ~ u  ~ + 1 and \nadding gaussian noise.  The underlying function,  training data and the polynomial \nfit  are shown in figure  1a. \n\nThe mean squared error on the training set was 0.00648.  The test set mean squared \nerror,  averaged over 9 test  sets,  was  0.0183  for  the unpruned  model.  We  removed \nthe  eigenfunctions  with  the  smallest  saliencies  w2 >..  The  lowest  average  test  set \nerror of 0.0126  was  reached when the trailing four  eigenfunctions were  removed. 6 . \nFigure 1 b shows the pruned regression fit. \n\n6The FPE criterion suggested pruning the trailing  three eigenfunctions.  We note that \nour  example  does  not  satisfy  the  assumption  of an  unbiased  model,  nor  are  the  sample \nsizes  large enough for  the FPE to be completely reliable. \n\n\fFast Pruning Using Principal Components \n\n41 \n\n'\" 0 '\" '\" \n\nr.l \n\nal \nN ..... \n..... \n\u00abJ \n~ \n0 z \n\n0.9 \n\n0.85 \n\n0 . 8 \n\n0 . 75 \n\n0 . 7 \n\n0 . 65 \n\n0.6 \n\n0 . 55 \n\n0 . 5 0 \n\n\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7t \n\n.................... ...... \n\n1 \n\n.......................... \n\n2 \n\n4 \n\n6 \n\n8 \n\n10 \n\n12 \n\nPrediction  Horizon  (month) \n\nFigure 2:  Prediction of the \nIP index  1980 - 1990.  The \nsolid line shows the perfor(cid:173)\nmance  before  pruning  and \nthe  dotted  line  the  perfor(cid:173)\nmance after the application \nof the PCP algorithm.  The \nresults shown represent av(cid:173)\nerages  over  11  runs  with \nthe error bars  representing \nthe  standard  deviation  of \nthe spread. \n\nTime Series  Prediction with a  Sigmoidal Network \n\nWe  have applied the proposed algorithm to the task of predicting the Index of In(cid:173)\ndustrial Production (IP), which is one of the main gauges of U.S.  economic activity. \nWe  predict the rate of change  in IP over  a  set  of future  horizons  based on lagged \nmonthly observations of various macroeconomic and financial indicators (altogether \n45  inputs).  7 \n\nOur standard benchmark is  the rate of change  in  IP for  January 1980 to  January \n1990 for models trained on January 1960 to December 1979.  In all runs, we used two \nlayer networks with 10 tanh hidden nodes and 6 linear output nodes corresponding \nto the various prediction horizons (1,  2,  3, 6,  9,  and 12 months).  The networks were \ntrained using stochastic backprop (which with this very noisy data set outperformed \nmore  sophisticated  gradient  descent  techniques).  The  test  set  results  with  and \nwithout  the PCP algorithm are shown in Figure 2. \n\nDue  to the significant noise  and nonstationarity in  the data, we  found  it beneficial \nto employ both weight decay and early stopping during training.  In the above runs, \nthe PCP algorithm was applied on top of these other regularization methods. \n\n6  Conclusions and Extensions \n\nOur  \"Principal Components  Pruning  (PCP)\"  algorithm  is  an  efficient  tool  for  re(cid:173)\nducing the effective number of parameters of a network.  It is likely to be useful when \nthere  are  correlations of signal  activities.  The  method is  substantially  cheaper to \nimplement than OBS and is likely to yield better network performance than OBD.8 \n\n7Preliminary  results  on  this  problem  have  been  described  briefly  in  (Moody  et  al., \n\n1993),  and a  detailed  account of this work  will  be presented elsewhere. \n\n8See  section  4  for  a  discussion  of  the  block-diagonal  Hessian  interpretation  of  our \n\nmethod.  A  systematic  empirical  comparison  of  computational  cost  and  resulting  net(cid:173)\nwork  performance  of PCP  to other  methods  like  OBD  and  OBS  would  be  a  worthwhile \nundertaking. \n\n\f42 \n\nLevin, Leen, and Moody \n\nFurthermore, PCP can be used on top of any other regularization method, including \nearly stopping or weight decay.9  Unlike  OBD and OBS,  PCP does not require that \nthe network be trained to a  local minimum. \n\nWe  are currently exploring nonlinear extensions of our linearized approach.  These \ninvolve  computing  a  block-diagonal  Hessian  in  which  the  block  corresponding  to \neach unit differs from the correlation matrix for that layer by a nonlinear factor. The \nanalysis makes use of GPE (Moody,  1992)  rather than FPE. \n\nAcknowledgements \n\nOne of us  (TKL)  thanks Andreas Weigend for  stimulating discussions that provided some \nof  the  motivation  for  this  work.  AUL  and  JEM  gratefully  acknowledge  the  support  of \nthe  Advanced  Research  Projects  Agency  and  the  Office  of Naval  Research  under  grant \nONR NOOOI4-92-J-4062.  TKL  acknowledges  the support of the  Electric  Power  Research \nInstitute under grant RP8015-2 and the Air Force Office of Scientific Research under grant \nF49620-93-1-0253. \n\nReferences \n\nAkaike, H.  (1970).  Statistical  predictor identification.  Ann.  Inst.  Stat.  Math.,  22:203. \nHassibi,  B.,  Stork,  D.,  and Wolff,  G.  (1992).  Optimal brain surgeon  and general  network \npruning.  Technical  Report  9235,  RICOH  California  Research  Center,  Menlo  Park, \nCA. \n\nJolliffe,  I.  T.  (1986).  Principal  Component  Analysis.  Springer-Verlag. \nLe  Cun,  Y.,  Denker,  J.,  and Solla,  S.  (1990).  Optimal  brain  damage.  In  Touretzky,  D., \neditor,  Advances in Neural Information Processing  Systems, volume 2,  pages 598-605, \nDenver  1989.  Morgan Kaufmann, San Mateo. \n\nLeen,  T.  K.,  Rudnick,  M.,  and  Hammerstrom,  D.  (1990).  Hebbian  feature  discovery \nimproves  classifier  efficiency.  In  Proceedings  of the  IEEE/INNS  International  Joint \nConference  on  Neural  Networks,  pages I-51  to I-56. \n\nMoody,  J.  (1992).  The effective number of parameters:  An analysis of generalization  and \nregularization in nonlinear learning systems.  In Moody, J., Hanson, S., and Lippman, \nR.,  editors,  Advances  in  Neural  Information  Processing  Systems,  volume  4,  pages \n847-854.  Morgan Kaufmann. \n\nMoody,  J.,  Levin,  A.,  and  Rehfuss,  S.  (1993).  Predicting  the  u.s.  index  of  industrial \nproduction . Neural Network  World, 3:791-794.  in Proceedings of Parallel Applications \nin Statistics and Economics  '93. \n\nMozer,  M.  and  Smolensky,  P.  (1989).  Skeletonization:  A  technique for  trimming the fat \nfrom  a  network via relevance assesment.  In Touretzky, D., editor,  Advances  in Neural \nInformation  Processing  Systems,  volume 1,  pages  107-115.  Morgan  Kaufmann. \n\nWeigend,  A.  S.  and Rumelhart,  D.  E.  (1991).  Generalization  through  minimal  networks \nwith application to forecasting.  In Keramidas, E.  M., editor,  INTERFACE'91  - 23rd \nSymposium  on  the  Interface:  Computing  Science  and Statistics,  pages  362-370. \n\n9(Weigend and Rumelhart,  1991)  called the rank of the covariance matrix of the node \nactivities the  \"effective  dimension of hidden units\"  and discussed it in the context of early \nstopping. \n\n\f", "award": [], "sourceid": 754, "authors": [{"given_name": "Asriel", "family_name": "Levin", "institution": null}, {"given_name": "Todd", "family_name": "Leen", "institution": null}, {"given_name": "John", "family_name": "Moody", "institution": null}]}