{"title": "Optimal Brain Surgeon: Extensions and performance comparisons", "book": "Advances in Neural Information Processing Systems", "page_first": 263, "page_last": 270, "abstract": "", "full_text": "Optimal Brain Surgeon: \n\nExtensions and performance comparisons \n\nBabak Hassibi* \n\nDavid G. Stork \n\nGregory Wolff \n\nTakahiro Watanabe \n\nRicoh California Research Center \n\n2882 Sand Hill Road Suite 115 \nMenlo Park, CA 94025-7022 \n\n105B Durand Hall \nStanford University \n\nStanford, CA 94305-4055 \n\n* Department of Electrical Engineering \n\nand \n\nAbstract \n\na second-order \n\nWe extend Optimal Brain Surgeon (OBS) -\nto allow for general error mea(cid:173)\nmethod for pruning networks -\nsures, and explore a reduced computational and storage implemen(cid:173)\ntation via a dominant eigenspace decomposition. Simulations on \nnonlinear, noisy pattern classification problems reveal that OBS \ndoes lead to improved generalization, and performs favorably in \ncomparison with Optimal Brain Damage (OBD). We find that the \nrequired retraining steps in OBD may lead to inferior generaliza(cid:173)\ntion, a result that can be interpreted as due to injecting noise back \ninto the system. A common technique is to stop training of a large \nnetwork at the minimum validation error. We found that the test \nerror could be reduced even further by means of OBS (but not \nOBD) pruning. Our results justify the t ~ 0 approximation used \nin OBS and indicate why retraining in a highly pruned network \nmay lead to inferior performance. \n\n263 \n\n\f264 \n\nHassibi, Stork, Wolff, and Watanabe \n\n1 \n\nINTRODUCTION \n\nThe fundamental theory of generalization favors simplicity. For a given level of \nperformance on observed data, models with fewer parameters can be expected to \nperform better on test data. In practice, we find that neural networks with fewer \nweights typically generalize better than large networks with the same training error. \nTo this end, LeCun, Denker and Solla's (1990) Optimal Brain Damage method \n(OED) sought to delete weights by keeping the training error as small as possible. \nHassibi and Stork (1993) extended OED to include the off-diagonal terms in the \nnetwork's Hessian, which were shown to be significant and important for pruning \nin classical and benchmark problems. \nOED and Optimal Brain Surgeon (OES) share the same basic approach of training \na network to (local) minimum in error at weight w*, and then pruning a weight \nthat leads to the smallest increase in the training error. The predicted functional \nincrease in the error for a change in full weight vector 8w is: \n\n8E= \n\n( BE )T \n\n-\nBw \n\n'V ' \n~O \n\n1 T B2E \n\u00b78w+-8w\u00b7--\n2 \nBw2 \n~ \n=H \n\n\u00b78w + O(1I8wI1 3 ) , \n\" -v - \" \n\n~O \n\n(1 ) \n\nwhere H is the Hessian matrix. The first term vanishes because we are at a local \nminimum in error; we ignore third- and higher-order terms (Gorodkin et al., 1993). \nHassibi and Stork (1993) first showed that the general solution for minimizing this \nfunction given the constraint of deleting one weight was: \n\n(2) \n\nHere, e q is the unit vector along the qth direction in weight space and Lq is the \nsaliency of weight q -\nan estimate of the increase in training error if weight q is \npruned and the other weights updated by the left equation in Eq. 2. \n\n2 GENERAL ERROR MEASURES AND FISHER'S \n\nMETHOD OF SCORING \n\nIn this section we show that the recursive procedure for computing the inverse \nHessian for sum squared errors presented in Hassibi and Stork (1993) generalizes \nto any twice differentiable distance norm and that the key approximation based on \nFisher's method of scoring is still valid. \nConsider an arbitrary twice differentiable distance norm d(t, 0) where t is the de(cid:173)\nsired output (teaching vector) and 0 = F(w, in) the actual output. Given a weight \nvector w, F maps the input vector in to the output; the total error over P patterns \nis E = J; 2:f=l d(tlkJ, olkJ). It is straightforward to show that for a single output \nunit network the Hessian is: \n\n\fOptimal Brain Surgeon: Extensions and Performance Comparisons \n\n265 \n\n_ 2. ~ 8P(w, in[kJ) \n\nH- P L \n\nk=1 \n\n8 \nW \n\n. 82d(t[kJ,0[kJ) \n\n. 8pT(w, in[kJ) \n\n8 2 \n0 \n. 8 2 P(w, in[kJ) \n\n8w \n\n8w2\n\n\u00b7 \n\n+ \n\n(3) \n\n~ ~ 8d(t[kJ, O[kJ) \nL \nP k=1 \n\n80 \n\nThe second term is of order O(lIt - 011); using Fisher's method of scoring (Sever & \nWild, 1989), we set this term to zero. Thus our Hessian reduces to: \n\nH = 2. ~ 8P(w, in[kJ) \n\nP L \n\nk=1 \n\n8w \n\n. 8 2d(t[kJ, o[kJ) \n\n. 8pT(w, in[kJ) \n\n802 \n\n8w \u00b7 \n\n(4) \n\n. \n\n. \n\nk+l \n\nk \n\nand ak \n\n8 2 d(t[k1 o[kl) \n\n802\n\n8F(W in[k1) \n\n8W \n\n.. \nWe define Xk \n, and followmg the lOgIC of HasSlbl \nand Stork (1993) we can easily show that the recursion for computing the inverse \nHessian becomes: \nH- 1 = H- 1 _ \n\nk H- 1 = a-II and Hp-l = H- 1 , \nk+l \n\nH- 1 .X \nk+l \nk \n.E. + XT \nk+l \n\n.XT \n. H- 1 . X \n\nk+l \nk \n\n(5) \nwhere a is a small parameter -\neffectively a weight decay constant. Note how \ndifferent error measures d(t,o) scale the gradient vectors X k forming the Hessian \n(Eq. 4). For the squared error d(t,o) = (t - 0)2, we have ak = 1, and all gradient \nvectors are weighted equally. The cross entropy or Kullback-Leibler distance, \n\n' 0 \n\n, \n\n.H- 1 \n\nak \n\nd(t, 0) = o log ~ + (1- 0) log i~ = ~? ' \n\n0::; 0, t::; 1 \n\n(6) \n\notkl(I~O[l'l). Hence if o[kJ is close to zero or one, Xk is given a large \nyields ak \nweight in the Hessian; conversely, the smallest value of ak occurs when o[kJ = 1/2. \nThis is desirable and makes great intuitive sense, since in the cross entropy norm \nthe value of o[kJ is interpreted as the probability that the kth input pattern belongs \nto a particular class, and therefore we give large weight to Xk whose class we are \nmost certain and small weight to those which we are least certain. \n\n3 EIGENSPACE DECOMPOSITION \n\nAlthough DES has been shown to be a powerful method for small and interme(cid:173)\ndiate sized networks - Hassibi, Stork and Wolff (1993) applied OES successfully \nto NETtaik -\nits use in larger problems is difficult because of large storage and \ncomputation requirements. For a network of n weights, simply storing the Hessian \nrequires 0(n2 /2) elements and 0(Pn2 ) computations are needed for each pruning \nstep. Reducing this computational burden requires some type of approximation. \nSince OES uses the inverse of the Hessian, any approximation to DES will at some \nlevel reduce to an approximation of H. For instance OED uses a diagonal approx(cid:173)\nimation; magnitude-based methods use an isotropic approximation; and dividing \nthe network into subsets (e.g., hidden-to-output and input-to-hidden) corresponds \nto the less-restrictive block diagonal approximation. In what follows we explore the \ndominant eigenspace decomposition of the inverse Hessian as our approximation. It \nshould be remembered that all these are subsets of the full DBS approach. \n\n\f266 \n\nHassibi, Stork, Wolff, and Watanabe \n\n3.1 Theory \n\nThe dominant eigendecomposition is the best low-rank approximation of a matrix \n(in an induced 2-norm sense). Since the largest eigenvalues of H- 1 are the smallest \neigenvalues of H, this method will, roughly speaking, be pruning weights in the \napproximate nullspace of H. Dealing with a low rank approximation of H-l will \ndrastically reduce the storage and computational requirements. \n\nConsider the eigendecomposition of H: \n\n(7) \n\nwhere ~s contains the largest eigenvalues of H and ~N the smallest ones. (We use \nthe subscripts Sand N to loosely connote signal and noise.) The dimension of the \nnoise subspace is typically m\u00ab n. Us and UN are n x (n - m) and n x m matrices \nthat span the dominant eigenspace of Hand H-l, and * denotes matrix transpose \nand complex conjugation. If, as suggested above, we restrict the weight prunings to \nlie in UN, we obtain the following saliency and full weight change when removing \nthe qth weight: \n\n1 \n-\nLq = - -----...:.~---\n2 ef . UN . ~N 1 . UN . e q \n\nw~ \n\n-\n8w = -\n\nWq \n\n1 \n\ne T . UN . ~- . U N* . e \nq \nq \n\nN \n\n1 \n\n~N Uiveq , \n\n(8) \n\n(9) \n\nwhere we have used 'bars' to indicate that these are approximations to Eq. 2. Note \nnow that we need only to store ~N and UN, which have roughly nm elements. \nLikewise the computation required to estimate ~N and UN is O(Pnm). \nThe bound on Lq is: \n\n-\n\nLq < Lq < Lq + 2 w~ . a(s) , \n\nLqLq \n\n1 \n\n(10) \n\nwhere a(8) is the smallest eigenvalue of ~s. Moreover if Q:.(8) is large enough so \nthat Q:.( 8) > [H! l)qq we have the following simpler form: \n\n(11) \n\nIn either case Eqs. 10 and 11 indicate that the larger a(8) is, the tighter the bounds \nare. Thus if the subspace dimension m is such that the eigenvalues in Us are large, \nthen we will have a good approximation. \n\nLeCun, Simard and Pearlmutter (1993) have suggested a method that can be used \nto estimate the smallest eigenvectors of the Hessian. However, for 0 BS (as we shall \nsee) it is best to use the Hessian with the t ~ 0 approximation, and their method \nis not appropriate. \n\n\fOptimal Brain Surgeon: Extensions and Performance Comparisons \n\n267 \n\n3.2 Simulations \n\nWe pruned networks trained on the three Monk's problems (Thrun et al., 1991) us(cid:173)\ning the full OBS and a 5-dimensional eigenspace version of OBS, using the validation \nerror rate for stopping criterion. (We chose a 5-dimensional subspace, because this \nreduced the computational complexity by an order of magnitude.) The Table shows \nthe number of weights obtained. It is clear that this eigenspace decomposition was \nnot particularly successful. It appears as though the the off-diagonal terms in H be(cid:173)\nyond those in the eigenspace are important, and their omission leads to bad pruning. \nHowever, this warrants further study. \n\nMonk1 \nMonk2 \nMonk3 \n\nunpruned OBS \n14 \n16 \n4 \n\n58 \n39 \n39 \n\n5-d eigenspace \n\n28 \n27 \n11 \n\n4 OBS/OBD COMPARISON \n\nGeneral criteria for comparing pruning methods do not exist. Since such meth(cid:173)\nods amount to assuming a particular prior distribution over the parameters, the \nempirical results usually tell us more about the problem space, than about the \nmethods themselves. However, for two methods, such as OBS and OBD, which \nutilize the same cost function, and differ only in their approximations, empirical \ncomparisons can be informative. Hence, we have applied both OBS and OBD to \nseveral problems, including an artificially generated statistical classification task, \nand a real-world copier voltage control problem. As we show below, the OBS algo(cid:173)\nrithm usually results in better generalization performance. \n\n4.1 MULTIPLE GAUSSIAN PRIORS \n\n(1,1,0,1, .5) and /-LA2 = \n\nWe created a two-catagory classification problem with a five-dimensional in(cid:173)\nput space. Category A consisted of two Gaussian distributions with mean \nvectors /-LA! = \n(0,0,1,0, .5) and covariances ~A! = \nDiag[0.99, 1.0, 0.88, 0.70, 0.95] and ~A2 = Diag[1.28, 0.60, 0.52, 0.93, 0.93] while cat(cid:173)\negory B had means /-LB! = (0,1,0,0, .5) and /-LB2 = (1,0,1,1, .5) and covariances \n~Bl = Diag[0.84, 0.68, 1.28, 1.02,0.89] and ~B2 = Diag[0.52, 1.25, 1.09,0.64,1.13]. \nThe networks were feedforward with 5 input units, 9 hidden units, and a single \noutput unit (64 weights total). The training and the test sets consisted of 1000 \npatterns each, randomly chosen from the equi-probable categories. The problem \nwas a difficult one: even with the somewhat large number of weights it was not \npossible to obtain less than 0.15 squared error per training pattern. We trained the \nnetworks to a local error minimum and then applied OBD (with retraining after \neach pruning step using backpropagation) as well as 0 BS. \n\nFigure 1 (left) shows the training errors for the network as a function of the number \nof remaining weights during pruning by OBS and by OBD. As more weights are \npruned the training errors for both OBS and OBD typically increase. Comparing \nthe two graphs for the first pruned weights, the training error for OBD and OBS \nare roughly equal, after which the training error of OBS is less until the 24th weight \n\n\f268 \n\nHassibi, Stork, Wolff, and Watanabe \n\nTrain \n\nE \n\n. 17 \n\nr \n. 165 \n\n.16 \n\n.155 \n\n.15 \n\n30 35 40 45 50 55 60 65 \n\nnumber of weights \n\n. 22 \n\nE \nr \n.215 \n\nI \n\n. 2 1 \n\n.205 \n\n.2 \n\n.'~ \n\nTest \n, \" . ~.,OBD \n.. ' \n. . .... \n\ntI _. \u2022 \u2022 \u2022 \u2022 \u2022 \n\n',' \n\n\u2022 ._ . \n\naBS \n\n.195 ' \n\n30 35 \n\n40 \n\n45 \n\n50 55 \nnumber of weights \n\n60 \n\n65 \n\nFigure 1: DES and OED training error on a sum of Gaussians prior pattern clas(cid:173)\nsification task as a function of the number of weights in the network. \nproceeds right to left.) DES pruning employed 0: = 10-6 (cf., Eq. 5); OED em(cid:173)\nployed 60 retraining epochs after each pruning. \n\n(Pruning \n\nis removed. The reason OED training is initially slightly better is that the network \nwas not at an exact local minimum; indeed in the first few stages the training error \nfor OED actually becomes less than its original value. (Training exhaustively to \nthe true local minimum took prohibitively long.) In contrast, due to the t ---+ 0 \napproximation DES tries to keep the network response close to where it was, even \nif that isn't the minimum w*. We think it plausible that if the network were at an \nexact local minimum DES would have had virtually identical performance. \n\nSince OED is using retraining the only reason why OES can outperform after the \nfirst steps is that OED has removed an incorrect weight, due to its diagonal approx(cid:173)\nimation. (The reason DES behaves poorly after removing 24 weights -\na radically \npruned net - may be that the second-order approximation breaks down at this \npoint.) We can see that the minimum on test error occurs before this breakdown, \nmeaning that the failed approximation (Fig. 2) does not affect our choice of the \noptimal network, at least for this problem. \n\nThe most important and interesting result is the test error for these pruned networks \n(Figure 1, right). The test error for OED does not show any consistent behaviour, \nother than the fact that on the average it generally goes up. This is contrary to what \none would expect of a pruning algorithm. It seems that the retraining phase works \nagainst the pruning process, by tending to reinforce overfitting, and to reinject the \ntraining set noise. For DES, however, the test error consistently decreases until \nafter removing 22 weights a minimum is reached, because the t ---+ 0 approximation \navoids reinjecting the training set noise. \n\n4.2 OBS/OBD PRUNING AND \"STOPPED\" NETWORKS \n\nA popular method of avoiding overfitting is to stop training a large net when the \nvalidation error reaches a minimum. In order to explore whether pruning could \nimprove the performance on such a \"stopped\" network (Le., not at w*), we mon(cid:173)\nitored the test error for the above problem and recorded the weights for which a \nminimum on the test set occured. We then applied OES and OED to this network. \n\n\fOptimal Brain Surgeon: Extensions and Performance Comparisons \n\n269 \n\n. , , \n\n, , , , \n\n'- \" aBO \n, \n\" \n\n, \n,'\" , .. \n\n, \\ \"'--, \n\n-' .... -, \n\nE \n\n\u00b7204 \n\n.202 \n\n.200 \n\n. 198 \n\n.196 \n\n.194 \n\n35 \n\n45 \n\n55 \n40 \nnumber of weights \n\n50 \n\n60 \n\nFigure 2: A 64-weight network was trained to minimum validation error on the \nand then pruned by OBD and by OBS. The test \nGaussian problem -\nerror on the resulting network is shown. (Pruning proceeds from right to left.) Note \nes,pecially that even though the network is far from w*, OBS leads lower test error \nover a wide range of prunings, even through OBD employs retraining. \n\nnot w* -\n\nThe results shown in Figure 2 indicate that with OBS we were able to reduce the \ntest error, and this reached a minimum after removing 17 weights. OBD was not \nable to consistently reduce the test error. \n\nThis last result and those from Fig. 2 have important consequences. There are no \nuniversal stopping criteria based on theory (for the reasons mentioned above), but \nit is a typical practice to use validation error as such a criterion. As can be seen \nin Figure 2, the test error (which we here consider a validation error) consistantly \ndecreases to a unique miniumum for pruning by OBS. For the network pruned (and \ncontinuously retrained) by OBD, there is no such structure in the validation curves. \nThere seems to be no reliable clue that would permit the user to know when to stop \npruning. \n\n4.3 COPIER CONTROL APPLICATION \n\nThe quality of an image produced by a copier is dependent upon a wide variety of \nfactors: time since last copy, time since last toner cartridge installed, temperature, \nhumidity, overall graylevel of the source document, etc. These factors interact in a \nhighly non-linear fashion, so that mathematical modelling of their interrelationships \nis difficult. Morita et al. (1992) used backpropagation to train an 8-4-8 network (65 \nweights) on real-world data, and managed to achieve an RMS voltage error of 0.0124 \non a critical control plate. We pruned his network with both OBD with retraining \nas well as with OBS. When the network was pruned by OBD with retraining, the \ntest error continually increased (erratically) such that at 34 remaining weights, the \nRMS error was 0.023. When also we pruned the original net by OBS, and the test \nerror gradually decreased such that at the same number of weights the test error \nwas 0.012 -\n\nsignificantly lower than that of the net pruned by OBD. \n\n\f270 \n\nHassibi, Stork, Wolff, and Watanabe \n\n5 CONCLUSIONS \n\nWe compared pruning by OES and by OED with retraining on a difficult non-linear \nstatistical pattern recognition problem and found that OES led to lower generaliza(cid:173)\ntion error. We also considered the widely used technique of training large nets to \nminimum validation error. To our surprise, we found that subsequent pruning by \nOES lowered generalization error, thereby demonstrating that such networks still \nhave over fitting problems. We have found that the dominant eigenspace approach \nto OES leads to poor performance. Our simulations support the claim that the \nt ---+ 0 approximation used in OBS avoids reinjecting training set noise into the net(cid:173)\nwork. In contrast, including such t - 0 terms in OES reinjects training set noise \nand degrades generalization performance, as does retraining in OBD. \n\nAcknowledgements \n\nThanks to T. Kailath for support of B.H. through grants AFOSR 91-0060 and \nDAAL03-91-C-0010. Address reprint requests to Dr. Stork: stork@crc.ricoh.com. \n\nReferences \n\nJ. Gorodkin, L. K. Hansen, A. Krogh, C. Svarer and O. Winther. (1993) A quanti(cid:173)\ntative study of pruning by Optimal Brain Damage. International Journal of Neural \nSystems 4(2) 159-169. \nB. Hassibi & D. G. Stork. (1993) Second order derivatives for network pruning: \nOptimal Brain Surgeon. In S. J. Hanson, J. D. Cowan and C. L. Giles (eds.), \nAdvances in Neural Information Processing Systems 5, 164-171. San Mateo, CA: \nMorgan Kaufmann. \n\nB. Hassibi, D. G. Stork & G. Wolff. (1993) Optimal Brain Surgeon and general \nnetwork pruning. Proceedings of ICNN 93, San Francisco 1 IEEE Press. 293-299. \nY. LeCun, J. Denker & S. Solla. (1990) Optimal Brain Damage. In D. Touretzky \n(ed.), Advances in Neural Information Processing Systems 2, 598-605. San Mateo, \nCA: Morgan Kaufmann. \nY. LeCun, P. Simard & B. Pearlmutter. \nIn S. J. Hanson, \nimization by on-line estimation of the Hessian's eigenvectors. \nJ. D. Cowan & C. L. Giles (eds.), Advances in Neural Information Processing Sys(cid:173)\ntems 5, 156-163. San Mateo, CA: Morgan Kaufmann. \nT. Morita, M. Kanaya, T. Inagaki, H. Murayama & S. Kato. (1992) Photo-copier \nimage density control using neural network and fuzzy theory. Second International \nWorkshop on Industrial Fuzzy Control \u20ac3 Intelligent Systems December 2-4, College \nStation, TX, 10. \nS. Thrun and 23 co-authors. (1991) The Monk's Problems - A performance com(cid:173)\nparison of different learning algorithms. CMU-CS-91-197 Carnegie-Mellon Univer(cid:173)\nsity Dept. of Computer Science Technical Report. \n\n(1993) Automatic learning rate max(cid:173)\n\n\f", "award": [], "sourceid": 749, "authors": [{"given_name": "Babak", "family_name": "Hassibi", "institution": null}, {"given_name": "David", "family_name": "Stork", "institution": null}, {"given_name": "Gregory", "family_name": "Wolff", "institution": null}]}