{"title": "Second order derivatives for network pruning: Optimal Brain Surgeon", "book": "Advances in Neural Information Processing Systems", "page_first": 164, "page_last": 171, "abstract": null, "full_text": "Second order derivatives for network \n\npruning: Optimal Brain Surgeon \n\nBabak Hassibi* and David G. Stork \n\nRicoh California Research Center \n2882 Sand Hill Road, Suite 115 \nMenlo Park, CA 94025-7022 \n\nstork@crc.ricoh.com \n\n* Department of Electrical Engineering \n\nand \n\nStanford University \nStanford, CA 94305 \n\nAbstract \n\nWe investigate the use of information from all second order derivatives of the error \nfunction to perfonn network pruning (i.e., removing unimportant weights from a trained \nnetwork) in order to improve generalization, simplify networks, reduce hardware or \nstorage requirements, increase the speed of further training, and in some cases enable rule \nextraction. Our method, Optimal Brain Surgeon (OBS), is Significantly better than \nmagnitude-based methods and Optimal Brain Damage [Le Cun, Denker and Sol1a, 1990], \nwhich often remove the wrong weights. OBS permits the pruning of more weights than \nother methods (for the same error on the training set), and thus yields better \ngeneralization on test data. Crucial to OBS is a recursion relation for calculating the \ninverse Hessian matrix H-I from training data and structural information of the net. OBS \npermits a 90%, a 76%, and a 62% reduction in weights over backpropagation with weighL \ndecay on three benchmark MONK's problems [Thrun et aI., 1991]. Of OBS, Optimal \nBrain Damage, and magnitude-based methods, only OBS deletes the correct weights from \na trained XOR network in every case. Finally, whereas Sejnowski and Rosenberg [1987J \nused 18,000 weights in their NETtalk network, we used OBS to prune a network to just \n1560 weights, yielding better generalization. \n\n1 Introduction \nA central problem in machine learning and pattern recognition is to minimize the system complexity \n(description length, VC-dimension, etc.) consistent with the training data. \nIn neural networks this \nregularization problem is often cast as minimizing the number of connection weights. Without such weight \nelimination overfilting problems and thus poor generalization will result. Conversely, if there are too few \nweights, the network might not be able to learn the training data. \nIf we begin with a trained network having too many weights, the questions then become: Which weights \nshould be eliminated? How should the remaining weights be adjusted for best performance? How can such \nnetwork pruning be done in a computationally efficient way? \n\n164 \n\n\fSecond order derivatives for network pruning: Optimal Brain Surgeon \n\n165 \n\nMagnitude based methods [Hertz, Krogh and Palmer, 1991] eliminate weights that have the smallest \nmagnitude. This simple and naively plausible idea unfortunately often leads to the elimination of the wrong \nweights -\nsmall weights can be necessary for low error. Optimal Brain Damage [Le Cun, Denker and \nSolla, 1990] uses the criterion of minimal increase in training error for weight elimination. For \ncomputational simplicity, OBD assumes that the Hessian matrix is diagonal: in fact. however, Hessians for \nevery problem we have considered are strongly non-diagonal, and this leads OBD to eliminate the wrong \nweights. The superiority of the method described here - Optimal Brain Surgeon -\nlies in great pan to the \nfact that it makes no restrictive assumptions about the form of the network's Hessian, and thereby \neliminates the correct weights. Moreover, unlike other methods, OBS does not demand (typically slow) \nretraining after the pruning of a weight. \n\n2 Optimal Brain Surgeon \nIn deriving our method we begin, as do Le Cun, Denker and Solla [1990], by considering a network trained \nto a local minimum in error. The functional Taylor series of the error with respect to weights (or \nparameters, see below) is: \n\nwhere H = ;]2 E/ aw2 is the Hessian matrix (containing all second order derivatives) and the superscript \n\nT denotes vector transpose. For a network trained to a local minimum in error, the first (linear) term \nvanishes: we also ignore the third and all higher order terms. Our goal is then to set one of the weights to \nzero (which we call wq) to minimize the increase in error given by Eq. l. Eliminating Wq is expressed as: \n\n(1) \n\nwhere eq is the unit vector in weight space corresponding to (scalar) weight wq\u2022 Our goal is then to solve: \n\nsuch that e~. ow + W q = O} \n\normoregenerally e~ \u00b7OW+Wq =0 \n\nowq+wq =0 \nMinq {Mint5w l! OW T . H . ow} \n\n(2) \n\n(3) \n\nTo solve Eq. 3 we form a Lagrangian from Eqs. 1 and 2: \n\n(4) \nwhere A. is a Lagrange undetermined multiplier. We take functional derivatives, employ the constraints of \nEq. 2, and use matrix inversion to find that the optimal weight change and resulting change in error are: \n\nL = 1-ow T . H . ow + A (e~ . ow + W q) \n\now = -\n\nw \n[H-1] \n\nq H-1 \u2022 e \nq \nqq \n\nand \n\n1 w2 \nL = -\nq \nq 2 [H-1] \n\nqq \n\n(5) \n\nNote that neither H nor H\u00b7 I need be diagonal (as is assumed by Le Cun et al.): moreover, our method \nrecalculates the magnitude of all the weights in the network, by the left side of Eq. 5. We call Lq the \n\"saliency\" of weight q -\na definition \nthe increase in error that results when the weight is eliminated -\nmore general than Le Cun et al. 's, and which includes theirs in the special case of diagonal H. \nThus we have the following algorithm: \n\nOptimal Brain Surgeon procedure \n\n1. Train a \"reasonably large\" network to minimum error. \n2. Compute H\u00b7I . \n3. Find the q that gives the smallest saliency Lq = Wq 2/(2[H\u00b7 I ]qq). If this candidate error \n\nincrease is much smaller than E, then the qth weight should be deleted, and we \nproceed to step 4; otherwise go to step 5. (Other stopping criteria can be used too.) \n\n4. Use the q from step 3 to update all weights (Eq. 5). Go to step 2. \n5. No more weights can be deleted without large increase in E. (At this point it may be \n\ndesirable to retrain the network.) \n\nFigure 1 illustrates the basic idea. The relative magnitudes of the error after pruning (before retraining. if \nany) depend upon the particular problem, but to second order obey: E(mag) ~ E(OBD) ~ E(OBS). which is \nthe key to the superiority of OBS. In this example OBS and OBD lead to the elimination of the same \nweight (weight 1). In many cases, however. OBS will eliminate different weights than those eliminated by \nOBD (cf. Sect. 6). We call our method Optimal Brain Surgeon because in addition to deleting weights, it \n\n\f166 \n\nHassibi and Stork \n\ncalculates and changes the strengths of other weights without the need for gradient descent or other \nincremental retraining. \n\nFigure 1: Error as a function of two weights in a \nnetwork. The (local) minimum occurs at weight \nw\u00b7, found by gradient descent or other learning \nmethod. In this illustration, a magnitude based \npruning technique (mag) then removes the \nsmallest weight, weight 2; Optimal Brain \nDamage before retraining (OBD) removes \nweight I. \nIn contrast, our Optimal Brain \nSurgeon method (OBS) not only removes weight \nI, but also automatically adjusts the value of \nweight 2 to minimize the error, without \nretraining. The error surface here is general in \nthat it has different curvatures (second \nderivatives) along different directions, a \nminimum at a non-special weight value, and a \nnon-diagonal Hessian (i.e., principal axes are not \nparallel to the weight axes). We have found (to \nour surprise) that every problem we have \ninvestigated has strongly non-diagonal Hessians \n-\nthereby explaining the improvment of our \nmethod over that of Le Cun et al. \n\n3 Computing the inverse Hessian \nThe difficulty appears to be step 2 in the OBS procedure, since inverting a matrix of thousands or millions \nof terms seems computationally intractable. In what follows we shall give a general derivation of the \ninverse Hessian for a fully trained neural network. It makes no difference whether it was trained by \nbackpropagation, competitive learning, the Boltzmann algorithm, or any other method, so long as \nderivatives can be taken (see below). We shall show that the Hessian can be reduced to the sample \ncovariance matrix associated with certain gradient vectors. Furthennore, the gradient vectors necessary for \nOBS are normally available at small computational cost; the covariance form of the Hessian yields a \nrecursive formula for computing the inverse. \nConsider a general non-linear neural network that maps an input vector in of dimension nj into an output \nvector 0 of dimension no' according to the following: \n\n0= F(w,in) \n\n(6) \n\nwhere w is an n dimensional vector representing the neural network's weights or other parameters. We \nshall refer to w as a weight vector below for simplicity and definiteness, but it must be stressed that w could \nrepresent any continuous parameters, such as those describing neural transfer function, weight sharing, and \nso on. The mean square error corresponding to the training set is dermed as: \n\nE = _1 i(t[k] _ o[k]{ (t[k] _ o[k]) \n\n2P k=1 \n\n(7) \n\nwhere P is the number of training patterns, and t lk] and olk] are the desired response and network response \nfor the kth training pattern. The first derivative with respect to w is: \n\naE = _! i aF(w,in[k) (t[k) _ o[k]) \n\nP k=1 \n\ndw \n\n(Jw \nand the second derivative or Hessian is: \n\n1 P aF(w in[k]) aF(w,in[k]{ \nH=--2 =- L[ \nP k=1 \n\n, \n(Jw \n\n(Jw \n\na2E \ndw \n\n\u2022 [k] \n\n:l2 \n(J F(w,m \n(Jw2 \n\n) . (t[k) _ o[k)] \n\n(8) \n\n(9) \n\n\fSecond order derivatives for network pruning: Optimal Brain Surgeon \n\n167 \n\nNext we consider a network fully trained to a local minimum in error at w*. Under this condition the \nnetwork response O[k] will be close to the desired response t[k], and hence we neglect the tenn involving \n(t[k]- ork]). Even late in pruning, when this error is not small for a single pattern, this approximation can be \njustified (see next Section). This simplification yields: \n\nH =! f dF(w,in[k]). dF(w,in[k) T \n\np k=1 \n\ndw \n\ndw \n\n(10) \n\nIf out network has just a single output, we may define the n-dimensional data vector Xrk] of derivatives as: \n\nThus Eq. 10 can be written as: \n\nX[k) = dF(w,in[k]) \n\naw \n\nH =! fX[k). X[k]T \n\nP k=1 \n\nIf instead our network has mUltiple output units, then X will be an n x no matrix of the fonn: \nX[k] = dF(w,in[k]) = (dF1(w,in[k]) \n[k] \n= (Xl \u2022...\u2022 Xno) \n\ndFno (W,in[k]\u00bb \n\n[kJ \n\naw \n\n..... \n\naw \n\naw \n\nwhere F j is the ith component of F. Hence in this multiple output unit case Eq. 10 generalizes to: \n\nH =! f rx~k). X~k]T \n\nP k=ll=1 \n\n(11) \n\n(12) \n\n(13) \n\n(14) \n\nEquations 12 and 14 show that H is the sample covariance matrix associated with the gradient variable X. \nEquation 12 also shows that for the single output case we can calculate the full Hessian by sequentially \nadding in successive \"component\" Hessians as: \n\nH \n\n- H + .!.X[m+I]. X[m+l]T with HO = aI and Hp = H \n\nm+l- m P \n\n(15) \n\nBut Optimal Brain Surgeon requires the inverse of H (Eq. 5). This inverse can be calculated using a \nstandard matrix inversion fonnula [Kailath, 1980]: \n\n(A + 8 . C . 0)-1 = A-I - A-I. 8 . (C- I + D. A-I. 8)-1 . D . A-I \n\n(16) \n\napplied to each tenn in the analogous sequence in Eq. 16: \n\nH- I \n\nm+1 -\n\n- H-I -\nm \n\nH-1 . X[m+1) . X[m+1)T . H-I \nm \np + x[m+I)T . H-I . X[m+lI \n\nm \n\nm with HOi = a-II and Hpl = H- I \n\n(17) \n\nand a (l0\u00b78 S a S 10-4) a small constant needed to make HO\u2022I meaningful, and to which our method is \ninsensitive [Hassibi, Stork and Wolff, 1993b]. Actually, Eq. 17 leads to the calculation of the inverse of \n(H + ciI), and this corresponds to the introduction of a penalty term allliwll2 in Eq. 4. This has the benefit \nof penalizing large candidate jumps in weight space, and thus helping to insure that the neglecting of higher \norder Lenns in Eq. 1 is valid. \nEquation 17 permits the calculation of H\u00b7I using a single sequential pass through the training data \n1 S m S P. It is also straightforward to generalize Eq. 18 to the multiple output case of Eq. 15: in this case \nEq. 15 will have recursions on both the indices m and I giving: \n\nH \n\nm 1+1 -\n\n- H \n\nH \n\nm+11 \n\n= H \n\n1+1' \n\n1 x[m) X[m)T \nml + -\nP \n+ ! x[m+1) . X[m+I]T \n\n1+1 \n\nmilo P 1 \n\nI \n\n(18) \n\nTo sequentially calculate U- I for the multiple output case, we use Eq. 16, as before. \n\n4 The (t - 0) ~ 0 approximation \nThe approximation used for Eq. 10 can be justified on computational and functional grounds, even late in \npruning when the training error is not negligible. From the computational view, we note [rrst that nonnally \nH is degenerate -\nand its inverse not well defined. \n\nespecially before significant pruning has been done -\n\n\f168 \n\nHassibi and Stork \n\nThe approximation guarantees that there are no singularities in the calculation of H-1\u2022 It also keeps the \ncomputational complexity of calculating H- 1 the same as that for calculating H - O(p n2). In Statistics the \napproximation is the basis of Fisher's method of scoring and its goal is to replace the true Hessian with its \nexpected value and guarantee that H is positive definite (thereby avoiding stability problems that can \nplague Gauss-Newton methods) [Seber and Wild, 1989]. \nEqually important are the functional justifications of the approximation. Consider a high capactiy network \ntrained to small training error. We can consider the network structure as involving both signal and noise. \nAs we prune, we hope to eliminate those weights that lead to \"overfilting,\" i.e., learning the noise. If our \npruning method did not employ the (t - 0) ~ 0 approximation, every pruning step (Eqs. 9 and 5) would \ninject the noise back into the system, by penalizing for noise tenns. A different way to think of the \napproximation is the following. After some pruning by OBS we have reached a new weight vector that is a \nlocal minimum of the error (cf. Fig. 1). Even if this error is not negligible, we want to stay as close to that \nvalue of the error as we can. Thus we imagine a new, effective teaching signal t*, that would keep the \nnetwork near this new error minimum. It is then (t* - 0) that we in effect set to zero when using Eq. 10 \ninstead of Eq. 9. \n\n5 aBS and back propagation \nUsing the standard tennino)ogy from backpropagation [Rumelhart, Hinton and Williams, 1986J and the \nsingle output network of Fig. 2, it is straightforward to show from Eq. 11 that the derivative vectors are: \n\n[k] - (x~]J \nX \n\n-\n\n[k] \nXu \n\nwhere \n\nrefers to derivatives with respect to hidden-to-output weights Vj and \n\n[X~.t)]T = (f' (net[.t)f (net\\.t)v\\.t)o~!L .... f' (net[.t)f (net\\.t)v~.t)o~~) ... . , \n\nf (net[.t)f' (net~.t)v~~)o\\.t) ..... f (net(.t)f (net~.t.)V~.t.)o~~l) \n\nJ \n\nJ \n\nJ \n\nJ \n\nI \n\n(19) \n\n(20) \n\n(21) \n\nrefers to derivatives with respect to input-to-hidden weights uji' and where lexicographical ordering has \nbeen used. The neuron nonlinearity is f(\u00b7). \n\noutput \n\nhidden \n\ninput \n\ni = n\u00b7 1 \n\nFigure 2: Backpropagation net with lli inputs and nj hidden units. The input-to-hidden \nweights are Uji and hidden-to-output weights Vj. The derivative (\"data\") vectors are Xv \nand Xu (Eqs. 20 and 21). \n\nSimulation results \n\n6 \nWe applied OBS, Optimal Brain Damage, and a magnitude based pruning method to the 2-2-1 network \nwith bias unit of Fig. 3, trained on all patterns of the XOR problem. The network was first trained to a local \nminimum, which had zero error, and then the three methods were used to prune one weight. As shown,the \nmethods deleted different weights. We then trained the original XOR network from different initial \nconditions, thereby leading to a different local minima. Whereas there were some cases in which OBD or \nmagnitude methods deleted the correct weight, only OBS deleted the correct weight in every case. \nMoreover, OBS changed the values of the remaining weights (Eq.5) to achieve perfect perfonnance \nwithout any retraining by the backpropagation algorithm. Figure 4 shows the Hessian of the trained but \nunpruned XOR network. \n\n\foutput \n\nhidden \n\ninput \n\nSecond order derivatives for network pruning: Optimal Brain Surgeon \n\n169 \n\nFigure 3: A nine weight XOR network trained \nto a local minimum. The thickness of the lines \nindicates the weight magnitudes, and inhibitory \nweights are shown dashed. Subsequent pruning \nusing a magnitude based method (Mag) would \ndelete weight v3; using Optimal Brain Damage \n(OBD) would delete U22. Even with retraining, \nthe network pruned by those methods cannot \nlearn the XOR problem. In contrast, Optimal \nBrain Surgeon (OBS) deletes U23 and furthennore \nchanged all other weights (cf. Eq. 5) to achieve \nzero error on the XOR problem. \n\nFigure 4: The Hessian of the trained but \nunpruned XOR network, calculated by means of \nEq. 12. White represents large values and black \nsmall magnitudes. The rows and columns are \nlabeled by the weights shown in Fig. 3. As is to \nbe expected, the hidden-to-output weights have \nsignificant Hessian components. Note especially \nthat the Hessian is far from being diagonal. The \nHessians for all problems we have investigated, \nincluding the MONK's problems (below), are far \nfrom being diagonal. \n\nVI v2 v3 UII Ul2 u13 u21 u22 U23 \n\nFigure 5 shows two-dimensional \"slices\" of the nine-dimensional error surface in the neighborhood of a \nlocal minimum at w\u00b7 for the XOR network. The cuts compare the weight elimination of Magnitude \nmethods (left) and OBD (right) with the elimination and weight adjustment given by OBS. \n\nE \n\nU23 o \n\n-1 \n\n-2 \n\nV3 \n\nE \n\nu22 \n\nFigure 5: (Left) the XOR error surface as a function of weights V3 and U23 (cf. Fig. 4). A \nmagnitude based pruning method would delete weight V3 whereas OBS deletes U23. \n(Right) The XOR error surface as a function of weights U22 and U23. Optimal Brain \nDamage would delete U22 whereas OBS deletes U23. For this minimum, only deleting U23 \nwill allow the pruned network to solve the XOR problem. \n\n\f170 \n\nHassibi and Stork \n\nAfter all network weights are updated by Eq. 5 the system is at zero error (not shown). It is especially \nnoteworthy that in neither case of pruning by magnitude methods nor Optimal Brain Damage will further \nretraining by gradient descent reduce the training error to zero. In short, magnitude methods and Optimal \nBrain Damage delete the wrong weights, and their mistake cannot be overcome by further network training. \nOnly Optimal Brain Surgeon deletes the correct weight. \nWe also applied OBS to larger problems, three MONK's problems, and compared our results to those of \nThrun et al. [1991], whose backpropagation network outperformed all other approaches (network and rule(cid:173)\nbased) on these benchmark problems in an extensive machine learning competition. \n\nAccuracy \n\ntraining \n\ntesting \n\nMONKl \n\nMONK 2 \n\nMONK 3 \n\nBPWD \naBS \nBPWD \naBS \nBPWD \naBS \n\n100 \n100 \n100 \n100 \n\n93.4 \n93.4 \n\n100 \n100 \n100 \n100 \n97.2 \n97.2 \n\n# weights \n\n58 \n14 \n39 \n15 \n39 \n4 \n\nTable 1: The accuracy and number of weights found by backpropagation with weight \ndecay (BPWD) found by Thrun etal. [1991], and by OBS on three MONK's problems. \n\nTable I shows that for the same perfonnance, OBS (without retraining) required only 24%, 38% and 10% \nof the weights of the backpropagation network, which was already regularized with weight decay (Fig. 6). \nThe error increaseL (Eq. 5) accompanying pruning by OBS negligibly affected accuracy. \n\n.' \n..... .-'...... . .... \n\n\" , \n.... \n.... \n, \n\n_-\n\nI. l-\n\nFigure 6: Optimal networks found by Thrun using backpropagation with weight decay \n(Left) and by OBS (Right) on MONK I, which is based on logical rules. Solid (dashed) \nlines denote excitatory (inhibitory) connections; bias units are at left. \n\nThe dramatic reduction in weights achieved by OBS yields a network that is simple enough that the logical \nrules that generated the data can be recovered from the pruned network, for instance by the methods of \nTowell and Shavlik [1992]. Hence OBS may help to address a criticism often levied at neural networks: \nthe fact that they may be unintelligible. \nWe applied OBS to a three-layer NETtalk network. While Sejnowski and Rosenberg [1987] used 18,000 \nweights, we began with just 5546 weights, which after backpropagation training had a test error of 5259. \nAfter pruning this net with OBS to 2438 weights, and then retraining and pruning again, we achieved a net \nwith only 1560 weights and test error of only 4701 -\na significant improvement over the original, more \ncomplex network [Hassibi, Stork and Wolff, 1993a]. Thus OBS can be applied to real-world pattern \nrecognition problems such as speech recognition and optical character recognition, which typically have \nseveral thousand parameters. \n\n7 Analysis and conclusions \nWhy is Optimal Brain Surgeon so successful at reducing excess degrees of freedom? Conversely, given \nthis new standard in weight elimination, we can ask: Why are magnitude based methods so poor? \nConsider again Fig. 1. Starting from the local minimum at w\u00b7, a magnitude based method deletes the \nwrong weight, weight 2, and through retraining, weight 1 will increase. The final \"solution\" is \nlarge, weight 2 = O. This is precisely the opposite of the solution found by OBS: weight 1 = 0, \nweight 1 4 \nweight 2 4 \nlarge. Although the actual difference in error shown in Fig. 1 may be small, in large networks, \ndifferences from many incorrect weight elimination decisions can add up to a significant increase in error. \n\n\fSecond order derivatives for network pruning: Optimal Brain Surgeon \n\n171 \n\nBut most importantly, it is simply wishful thinking to believe that after the elimination of many incorrect \nweights by magnitude methods the net can \"sort it all out\" through further training and reach a global \noptimum, especially if the network has already been pruned significantly (cf. XOR discussion, above). \nWe have also seen how the approximation employed by Optimal Brain Damage -\nthat the diagonals of the \nHessian are dominant - does not hold for the problems we have investigated. There are typically many \noff-diagonal terms that are comparable to their diagonal counterparts. This explains why OBD often \ndeletes the wrong weight, while OBS deletes the correct one. \nWe note too that our method is quite general, and subsumes previous methods for weight elimination. In \nour terminology, magnitude based methods assume isotropic Hessian (H ex I); OBD assumes diagonal H: \nFARM [Kung and Hu, 1991] assumes linear f(net) and only updates the hidden-to-output weights. We \nhave shown that none of those assumptions are valid nor sufficient for optimal weight elimination. \nWe should also point out that our method is even more general than presented here [Hassibi, Stork and \nWotff, 1993bl. For instance, rather than pruning a weight (parameter) by setting it to zero, one can instead \nreduce a degree of freedom by projecting onto an arbitrary plane, e.g., Wq = a constant, though such \nnetworks typically have a large description length [Rissanen, 1978]. The pruning constraint w q = 0 \ndiscussed throughout this paper makes retraining (if desired) particularly simple. Several weights can be \ndeleted simultaneously; bias weights can be exempt from pruning, and so forth. A slight generalization of \nOBS employs cross-entropy or the Kullback-Leibler error measure, leading to Fisher Infonnation matrix \nrather than the Hessian (Hassibi, Stork and Wolff, 1993b). We note too that OBS does not by itself give a \ncriterion for when to stop pruning, and thus OBS can be utilized with a wide variety of such criteria. \nMoreover, gradual methods such as weight decay during learning can be used in conjunction with OBS. \nAcknowledgements \nThe first author was supported in part by grants AFOSR 91-0060 and DAAL03-91-C-OOlO to T. Kailath, \nwho in tum provided constant encouragement Deep thanks go to Greg Wolff (Ricoh) for assistance with \nsimulations and analysis, and Jerome Friedman (Stanford) for pointers to relevant statistics literature. \n\nHassibi, B. Stork, D. G. and Wolff, G. (1993a). Optimal Brain Surgeon and general network pruning \n\n(submitted to ICNN, San Francisco) \n\nREFERENCES \n\nHassibi, B. Stork, D. G. and Wolff, G. (1993b). Optimal Brain Surgeon, Information Theory and network \n\ncapacity control (in preparation) \n\nHertz, J., Krogh, A. and Palmer, R. G. (1991). Introduction to the Theory of Neural Computation \n\nAddison-Wesley. \n\nKailath, T. (1980). Linear Systems Prentice-Hall. \nKung, S. Y. and Hu, Y. H. (1991). A Frobenius approximation reduction method (FARM) for detennining \n\nthe optimal number of hidden units, Proceedings of the IJCNN-9I Seattle, Washington. \n\nLe Cun, Y., Denker, J. S. and SoUa, S. A. (1990). Optimal Brain Damage, in Proceedings of the Neural \n\nInformation Processing Systems-2, D. S. Touretzky (ed.) 598-605, Morgan-Kaufmann. \n\nRissanen, J. (1978). Modelling by shortest data description, Aulomatica 14,465-471. \nRumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning Internal representations by error \npropagation, Chapter 8 (318-362) in Parallel Distributed Processing I D. E. Rumelhart and J. L. \nMcClelland (eds.) MIT Press. \n\nSeber, G. A. F. and Wild, C. J. (1989). Nonlinear Regression 35-36 Wiley. \nSejnowski, T. J., and Rosenberg, C. R. (1987). Parallel networks that learn to pronounce English text, \n\nComplex Syslems I, 145-168. \n\nThrun, S. B. and 23 co-authors (1991). The MONK's Problems - A perfonnance comparison of different \n\nlearning algorithms, CMU-CS-91-197 Carnegie-Mellon U. Department of Computer ScienceTech \nReport. \n\nTowell, G. and Shavlik, J. W. (1992). Interpretation of artificial neural networks: Mapping knowledge(cid:173)\n\nbased neural networks into rules, in Proceedings of the Neural In/ormation Processing Systems-4, ]. \nE. Moody, D. S. Touretzky and R. P. Lippmann (eds.) 977-984, Morgan-Kaufmann. \n\n\f", "award": [], "sourceid": 647, "authors": [{"given_name": "Babak", "family_name": "Hassibi", "institution": null}, {"given_name": "David", "family_name": "Stork", "institution": null}]}