{"title": "Fast Network Pruning and Feature Extraction by using the Unit-OBS Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 655, "page_last": 661, "abstract": null, "full_text": "Fast Network Pruning and Feature \n\nExtraction Using the Unit-OBS Algorithm \n\nAchim Stahlberger and Martin Riedmiller \n\nInstitut fur Logik , Komplexitiit und Deduktionssysteme \n\nUniversitiit Karlsruhe, 76128 Karlsruhe, Germany \n\nemail: stahlb@ira.uka.de. riedml@ira.uka.de \n\nAbstract \n\nThe algorithm described in this article is based on the OBS algo(cid:173)\nrithm by Hassibi, Stork and Wolff ([1] and [2]). The main disad(cid:173)\nvantage of OBS is its high complexity. OBS needs to calculate the \ninverse Hessian to delete only one weight (thus needing much time \nto prune a big net) . A better algorithm should use this matrix to \nremove more than only one weight , because calculating the inverse \nHessian takes the most time in the OBS algorithm. \nThe algorithm, called Unit- OBS, described in this article is a \nmethod to overcome this disadvantage. This algorithm only needs \nto calculate the inverse Hessian once to remove one whole unit thus \ndrastically reducing the time to prune big nets. \nA further advantage of Unit- OBS is that it can be used to do a \nfeature extraction on the input data. This can be helpful on the \nunderstanding of unknown problems. \n\n1 \n\nIntroduction \n\nThis article is based on the technical report [3] about speeding up the OBS algo(cid:173)\nrithm. The main target of this work was to reduce the high complexity O(n 2p) of \nthe OBS algorithm in order to use it for big nets in a reasonable time. Two \"ex(cid:173)\nact\" algorithms were developed which lead to exactly the same results as OBS but \nusing less time. The first with time O( n1. 8p) makes use of Strassens' fast matrix \nmultiplication algorithm. The second algorithm uses algebraic transformations to \nspeed up calculation and needs time O(np2). This algorithm is faster than OBS in \nthe special case of p < n. \n\n\f656 \n\nA. Stahlberger and M. Riedmiller \n\nTo get a much higher speedup than these exact algorithms can do, an improved \nOBS algorithm was developed which reduces the runtime needed to prune a big \nnetwork drastically. The basic idea is to use the inverse Hessian to remove a group \nof weights instead of only one, because the calculation of this matrix takes the most \ntime in the OBS algorithm. This idea leads to an algorithm called Unit- OBS that \nis able to remove whole units. \n\nUnit-OBS has two main advantages: First it is a fast algorithm to prune big nets, \nbecause whole units are removed in every step instead of slow pruning weight by \nweight. On the other side it can be used to do a feature extraction on the input \ndata by removing unimportant input units. This is helpful for the understanding \nof unknown problems. \n\n2 Optimal Brain Surgeon \n\nThis section gives a small summary of the OBS algorithm described by Hassibi, \nStork and Wolff in [1] and [2] . As they showed the increase in error (when changing \nweights by ~ w) is \n\n(1) \n\nwhere H is the Hessian matrix. The goal is to eliminate weight Wq and minimize the \nincrease in error given by Eq. 1. Eliminating Wq can be expressed by Wq + ~Wq = 0 \nwhich is equivalent to (w + ~ W f eq = 0 where eq is the unit vector corresponding to \nweight Wq (wT eq = wq ) . Solving this extremum problem with side condition using \nLagrange's method leads to the solution \n\n2 \n~E= Wq \n\n2H-1 qq \n\n~W = - H-1 H \n\nWq \n\n-1 \n\neq \n\nqq \n\n(2) \n\n(3) \n\nH -1 qq denotes the element (q, q) of matrix H -1 . For every weight Wq the minimal \nincrease in error ~E(wq) is calculated and the weight which leads to overall mini(cid:173)\nmum will be removed and all other weights be adapted referring to Eq. 3. Hassibi, \nStork and Wolff also showed how to calculate H- 1 using time O(n 2p) where n is \nthe number of weights and p the number of pattern. \n\nThe main disadvantage of the OBS algorithm is that it needs time O(n 2p) to remove \nonly one weight thus needing much time to prune big nets. The basic idea to \nsoften this disadvantage is to use H- 1 to remove more than only one weight! This \ngeneralized OBS algorithm is described in the next section. \n\n3 Generalized OBS (G-OBS) \n\nThis section shows a generalized OBS algorithm (G-OBS) which can be used to \ndelete m weights in one step with minimal increase in error. Like in the OBS \nalgorithm the increase in error is given by ~E = ~~wT H ~w . But the condition \nWq + ~Wq = 0 is replaced by the generalized condition \n\n(4) \n\n\fFast Network Pruning by using the Unit-OBS Algoritiun \n\n657 \n\nwhere M \nis the selection matrix (selecting the weights to be removed) and \nql, q2, . .. , qm are the indices of the weights that will be removed. Solving this \nextremum problem with side condition using Lagrange's method leads to the solu(cid:173)\ntion \n\njj.E = !wT M(MT H- 1 M)-l MT w \n\n2 \n\njj.w = _H- 1 M(MT H- 1 M)-l MT w \n\n(5) \n\n(6) \n\nChoosing M = eq Eq. 5 and 6 reduce to Eq. 2 and 3. This shows that OBS is (as \nexpected) a special case of G-OBS. The problem of calculating H- 1 was already \nsolved by Hassibi, Stork and Wolff ([1] and [2]) . \n\n4 Analysis of G-OBS \n\nHassibi, Stork and Wolff ([1] and [2]) showed that the time to calculate H-l is in \nO(n2p). The calculation of jj.E referring to Eq. 5 needs time O(m3)t where m is \nthe number of weights to be removed. The calculation of jj.w (Eq. 6) needs time \nO(nm + m 3 ). \nThe problem within this solution consists of not knowing which weights should be \ndeleted and thus jj.E has to be calculated for all possible combinations to find \nthe global minimum in error increase. Choosing m weights out of n can be done \nwith (;:J possible combinations. This takes time (~)O(m3) to find the minimum. \nTherefore the total runtime of the generalized OBS algorithm to remove m weights \n(with minimal increase in error) is \n\nThe problem is that for m > 3 the term C:Jm3 dominates and TG-OBS is in O(n4 ). \nIn other words G-OBS can be used only to remove a maximum of three weights in \none step. But this means little advantage over OBS. \n\nTo overcome this problem the set of possible combinations has to be restricted \nto a small subset of combinations that seem to be \"good\" combinations. This \nreduces the term (~)m3 to a reasonable amount. One way to do this is that a good \ncombination exists of all outgoing connections of a unit. This reduces the number \nof combinations to the number of units! The basic idea for that subset is: If all \noutgoing connections of a unit can be removed then the whole unit can be deleted \nbecause it can not influence the net output anymore. Therefore choosing this subset \nleads to an algorithm called Unit- OBS that is able to remove whole units without \nthe need to recalculate H- 1 . \n\n5 Special Case of G-OBS: Unit-OBS \n\nWith the results of the last sections we can now describe an algorithm called Unit(cid:173)\nOBS to remove whole units. \n\n1. Train a network to minimum error. \n\nt M is a matrix of special type and thus the calculation of (MT H- J M) needs only \n\nO(m2 ) operations! \n\n\f658 \n\nA. Stahlbergerand M. Riedmiller \n\n2. Compute H- 1 . \n\n3. For each unit u \n\n(a) Compute the indices Ql, Q2 , .. . ,Qm(u) of the outgoing connections of \n\nunit u where m(u) is the number of outgoing connections of unit u. \n\n(b) M := (e q1 eq2 ... eqm(u\u00bb) \n(c) D..E(u) := ~wT M(MT H- 1 M)-l MT w \n\n4. Find the Uo that gives the smallest increase in error D..E(uo). \n\n5. M := M(uo) (refer to steps 3.(a) and 3.(b)) \n\n6. D..w := _H- 1 M(MT H- 1 M)-l MT w \n\n7. Remove unit Uo and use D..w to update all weights. \n\n8. Repeat steps 2 to 7 until a break criteria is reached . \n\nFollowing the analysis of G-OBS the time to remove one unit is \n\nTUnit-OBS = O(n2p + um3 ) \n\n(7) \n\nwhere u is the number of units in the network and m is the maximum number of \noutgoing connections. If m is much smaller than n we can neglect the term um3 \nand the main problem is to calculate H- 1 . Therefore, if m is small, we can say that \nUnit-OBS needs the same time to remove a whole unit as OBS needs to remove \na single weight . The speedup when removing units with an average of s outgoing \nconnections should then be s . \n\n6 Simulation results \n\n6.1 The Monk-1 benchmark \n\nUnit- OBS was applied to the MONK's problems because the underlying logical \nrules are well known and it is easy to say which input units are important to the \nproblem and which input units can be removed. The simulations showed that in \nno case Unit-OBS removed a wrong unit and that it has the ability to remove all \nunimportant input units. \n\nFigure 1 shows a MONK-I- net pruned with Unit-OBS. This net is the minimal \nnetwork that can be found by Unit-OBS. Table 1 shows the speedup of Unit-OBS \ncompared to OBS to find an equal-size network for the MONK-I problem. \n\nThe network shown in Fig. 1 is only minimal in the number of units but not minimal \nwith respect to the number of weights. Hassibi, Stork and Wolff ([1] and [2]) found a \nnetwork with only 14 weights by applying OBS (Fig. 3). In the framework of Unit(cid:173)\nOBS, OBS can be used to do further pruning on the network after all possible units \nhave been pruned. The advantage lies in the fact that now the time consuming \nOBS- algorithm is applied to a much smaller network (22 weights instead of 58). \nThe result of this combination of Unit-OBS and OBS is a network with only 14 \nweights (Fig. 2) which has also 100 % accuracy like the minimal net found by OBS \n(see Table 1). \n\n. \n\n\fFast Network Pruning by using the Unit-OBS Algorithm \n\n659 \n\nAtuibute 1 \n\nAttribute 2 \n\nAttribute 3 \n\nAttribute 4 \n\nAttribute 5 \n\nAtuibute 6 \n\nFigure 1: MONK-I - net pruned with Unit-OBS, 22 weights. All unimportant units \nare removed and this net needs less units than the minimal network found by OBS! \n\nAtuibute 1 \n\nAttribute 2 \n\nAttribute 3 \n\nAttribute 4 \n\nAtuibute 5 \n\nAtuibute 6 \n\nFigure 2: Minimal network (14 weights) for the MONK-I problem found by the \ncombination of Unit-OBS with OBS . The logical rule for the MONK- I problem is \nmore evident in this network than in the minimal network found by OBS (comp. \nFig. 3) . \n\nAtuibute 1 \n\nAttribute 2 \n\nAttribute 3 \n\nAttribute 4 \n\nAttribute 5 \n\nAttribute 6 \n\nFigure 3: Minimal network (14 weights) for the MONK-I problem found by OBS \n(see [1] and [2]) . \n\n\f660 \n\nA. Stahlberger and M. Riedmiller \n\nalgorithm \n\n# weights \n\ntopology \n\nspeedup+ \n\nno prumng \n\nOBS \n\nUnit-OBS \n\nUnit-OBS + OBS \n\n58 \n14 \n22 \n14 \n\n17-3-1 \n6-3-1 \n5-3-1 \n5-3-1 \n\n-\n1.0 \n2.8 \n2.6 \n\nTable 1: The Monk- l problem \n\nperf. \nperf. \ntest \ntrain \n100% 100% \n100% 100% \n100% \n100% \n100% \n100% \n\nFor the initial Monk-l network the maximum number of outgoing connections (m \nin Eq. 7) is 3 and this is much smaller than the number of weights. The average \nnumber of outgoing connections of the removed units is 3 and therefore we expect \na speedup by factor 3 (compare Table 1). \n\nBy comparing the two minimal nets found by Unit-OBSjOBS (Fig. 2) and \nOBS (Fig. 3) it can be seen that the underlying logical rule (out=1 \u00a2:} At(cid:173)\ntribuLl=AttribuL2 or AttribuL5=1) is more evident in the network found by Unit(cid:173)\nOBSjOBS. The other advantage of Unit-OBS is that it needs only 38 % of the time \nOBS needs to find this minimal network. This advantage makes it possible to apply \nUnit-OBS to big nets for which OBS is not useful because of its long computation \ntime. \n\n6.2 The Thyroid Benchmark \n\nThe following describes the application of pruning on a medical classification prob(cid:173)\nlem. The task is to classify measured data values of patients into three categories. \nThe output of the three layered feed forward network therefore consists of three neu(cid:173)\nrons indicating the corresponding class. The input consists of 21 both continuos \nand binary signals. \n\nThe task was first described in [4]. The results obtained there are shown in the first \nrow of Table 2. The initially used network has 21 input neurons, 10 hidden and 3 \noutput neurons, which are fully connected using shortcut connections. \n\nWhen applying OBS to prune the network weights, more than 90 % of the weights \ncan be pruned. However, over 8 hours of cpu-time on a sparc workstation are used \nto do so (row 2 in Table 2). The solution finally found by OBS uses only 8 of \nthe originally 21 input features. The pruned network shows a slightly improved \nclassification rate on the test set. \n\nUnit-OBS finds a solution with 41 weights in only 76 minutes of cpu-time. In \ncomparison to the original OBS algorithm, Unit-OBS is about 8 times as fast when \ndeleting the same number of weights. Also another important fact can be seen from \nthe result: The Unit-OBS network considers only 7 of the originally 21 inputs, 1 less \nthan the weight-focused OBS- algorithm. The number of hidden units is reduced to \n2 units, 5 units less than the OBS network uses. \n\nWhen further looking for an absolute minimum in the number of used weights, the \nUnit-OBS network can be additionally pruned using OBS . This finally leeds to an \noptimized network with only 24 weights. The classification performance of this very \n\ntCompared to OBS deleting the same number of weights. \n\n\fFast Network Pruning by using the Unit-DBS Algorithm \n\n661 \n\nsmall network is 98.5 % which is even slightly better than obtained by the much \nbigger initial net . \n\nalgorithm \nno prunmg \n\nOBS \n\nUnit-OBS \n\nUnit-OBS + OBS \n\n# weights \n\n316 \n28 \n41 \n24 \n\ntopology \n21-10-3 \n8-7-3 \n7-2-3 \n7-2-3 \n\nspeedup I cpu-time \n\nperf. test \n\n-\n1.0 \n7.8 \n-\n\n-\n\n511 min . \n76 min. \n137 min. \n\n98.4% \n98.5% \n98.4% \n98.5% \n\nTable 2: The thyroid benchmark \n\n7 Conclusion \n\nThe article describes an improvement of the OBS-algorithm introduced in [1] called \nGeneralized OBS (G-OBS). The underlying idea is to exploit second order informa(cid:173)\ntion to delete mutliple weights at once. The aim to reduce the number of different \nweight groups leads to the formulation of the Unit-OBS algorithm, which considers \nthe outgoing weights of one unit as a group of candidate weights: When all the \nweights of a unit can be deleted, the unit itself can be pruned . The new Unit-OBS \nalgorithm has two major advantages : First, it considerably accelerates pruning by \na speedup factor which lies in the range of the average number of outgoing weights \nof each unit . Second, deleting complete units is especially interesting to determine \nthe input features which really contribute to the computation of the output. This \ninformation can be used to get more insight in the underlying problem structure, \ne.g. to facilitate the process of rule extraction. \n\nReferences \n\n[1] B . Hassibi, D. G . Storck: Second Order Derivatives for Network Pruning: Op(cid:173)\n\ntimal Brain Surgeon . Advances in Neural Information Processing Systems 5, \nMorgan Kaufmann, 1993, pages 164- 171 . \n\n[2] B. Hassibi, D. G. Stork, G. J. Wolff: Optimal Brain Surgeon and general Net(cid:173)\n\nwork Pruning. IEEE International Conference on Neural Networks, 1993 Volume \n1, pages 293-299. \n\n[3] A. Stahlberger: OBS - Verbesserungen und neue Ansatze. Diplomarbeit, Uni(cid:173)\n\nversitat Karlsruhe, Institut fur Logik , Komplexitat und Deduktionssysteme, \n1996. \n\n[4] W . Schiffmann , M. Joost, R. Werner: Optimization of the Backpropagation \nAlgorithm for Training Multilayer Perceptrons. Technical Report, University of \nKoblenz, Institute of Physics, 1993 . \n\n\f", "award": [], "sourceid": 1233, "authors": [{"given_name": "Achim", "family_name": "Stahlberger", "institution": null}, {"given_name": "Martin", "family_name": "Riedmiller", "institution": null}]}