{"title": "Discriminability-Based Transfer between Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 204, "page_last": 211, "abstract": null, "full_text": "Discriminability-Based Transfer between \n\nNeural Networks \n\nL. Y. Pratt \n\nDepartment of Mathematical and Computer Sciences \n\nColorado School of Mines \n\nGolden, CO 80401 \n\nlpratt@mines.colorado.edu \n\nAbstract \n\nPreviously, we have introduced the idea of neural network transfer, \nwhere learning on a target problem is sped up by using the weights \nobtained from a network trained for a related source task. Here, \nwe present a new algorithm. called Discriminability-Based Transfer \n(DBT), which uses an information measure to estimate the utility \nof hyperplanes defined by source weights in the target network, \nand rescales transferred weight magnitudes accordingly. Several \nexperiments demonstrate that target networks initialized via DBT \nlearn significantly faster than networks initialized randomly. \n\n1 \n\nINTRODUCTION \n\nNeural networks are usually trained from scratch, relying only on the training data \nfor guidance. However, as more and more networks are trained for various tasks, \nit becomes reasonable to seek out methods that. avoid \"'reinventing the wheel\" , and \ninstead are able to build on previously trained networks' results. For example, con(cid:173)\nsider a speech recognition network that was only trained on American English speak(cid:173)\ners. However, for a new application, speakers might have a British accent. Since \nthese tasks are sub-distributions of the same larger distribution (English speakers), \nthey may be related in a way that. can be exploited to speed up learning on the \nBritish network, compared to when weights are randomly initialized. \n\nWe have previously introduced the question of how trained neural networks can be \n\n204 \n\n\fDiscriminability-Based Transfer between Neural Networks \n\n205 \n\n\"recycled' in this way [Pratt et al., 1991]; we've called this the transfer problem. \nThe idea of transfer has strong roots in psychology (as discussed in [Sharkey and \nSharkey, 1992]), and is a standard paradigm in neurobiology, where synapses almost \nalways come \"pre-wired\". \n\nThere are many ways to formulate the transfer problem. Retaining performance on \nthe source task mayor may not be important. When it is, the problem has been \ncalled sequential learning, and has been explored by several authors (cf. [McCloskey \nand Cohen, 1989]). Our paradigm assumes that source task performance is not \nimportant, though when thl~ source task training data is a subset of the target \ntraining data, our method may be viewed as addressing sequential learning as well. \nTransfer knowledge can also be inserted into several different entry points in a \nback-propagation network (see [Pratt, 1993al). We focus on changing a network's \ninitial weights; other studies change other aspects, such as the objective function \n(cf. [Thrun and Mitchell, 1993, Naik et al., 1992]). \n\nTransfer methods mayor may not use back-propagation for target task training. \nOur formulation does, because this allows it to degrade, in the worst case of no \nsource task relevance, to back-propagation training on the tar~et task with randomly \ninitialized weights. An alternative approach is described by lAgarwal et al., 1992]. \n\nSeveral studies have explored literal transfer in back-propagation networks, where \nthe final weights from training on a source task are used as the initial conditions for \ntarget training (cf. [Martin, 1988]). However, these studies have shown that often \nnetworks will demonstrate worse performance after literal transfer than if they had \nbeen randomly initialized. \n\nThis paper describes the Discriminability-Based Transfer (DBT) algorithm, which \novercomes problems with literal transfer. DBT achieves the same asymptotic ac(cid:173)\ncuracy as randomly initialized networks, and requires substantially fewer training \nupdates. It is also superior to literal transfer, and to just using the source network \non the target task. \n\n2 ANALYSIS OF LITERAL TRANSFER \n\nAs mentioned above, several studies have shown that networks initialized via literal \ntransfer give worse asymptotic perlormance than randomly initialized networks. To \nunderstand why. consider the situation when only a subset. of the source network \ninput-to-hidden (IH) layer hyperplanes are relevant to the target problem. as il(cid:173)\nlustrated in Figure 1. We've observed that some hyperplanes initialized by source \nnetwork training don't shift out of their initial positions, despite the fact that they \ndon't help to separate the target training data. The weights defining such hyper(cid:173)\nplanes often have high magnitudes [Dewan and Sontag, 1990]. Figure 2 (a) shows \na simulation of such a situation, where a hyperplane that has a high magnitude, as \nif it came from a source network, causes learning to be slowed down. 1 \n\nAnalysis of the back-propagation weight update equations reveals that high source \nweight magnitudes retard back-propagation learning on the target task because this \n\n1 Neural network visualization will be explored more thoroughly in an upcoming pape ... \nAn X-based animator is available from the author via anonymous ftp. Type \"archie ha\". \n\n\f206 \n\nPratt \n\n0.9 \n\n0.1 \n\nSource training data \n\nTarget training data \n\n....\u2022....\u2022.... .Q~ \u2022... ; .\u2022.....\u2022...\u2022...\u2022.\u2022....\u2022.......\u2022.\u2022...\u2022.\u2022\u2022.......\u2022.\u2022..\u2022........ Q ....\u2022\u2022\u2022\u2022...\u2022\u2022.\u2022..\u2022..\u2022\u2022.. \n\n: 0 ! \no f 1./ 0 \no i 1 ./ 0 \n! 1! \n;...j.~ _ Hyperplanes need to move \n\n~ Hyperplanes \nshould be \nf retained \n\n0 \n1 0 \n0 \n\n0 \n0 1 \n1 \n\n\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7if\u00b7\u00b7\u00b7f\u00b7\u00b7'O\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7 ......................... 0-\u2022...... 0 ............ . \n\n0 \n\n0.1 \n\n0.5 \n\nFeature 1 \n\n0.9 \n\nFigure 1: Problem Illustrating the need for DBT. The source and target tasks are \nidentical, except that the target task has been shifted along one axis, as represented \nby the training data shown. Because of this shift, two of the source hyperplanes are \nhelpful in separating class-O from class-l data in the target task, and two are not. \n\nequation is not scaled relative to weight magnitudes. Also, the weight update equa(cid:173)\ntion contains the factor y( 1 - y) (where y is a unit's activation). which is small \nfor large weights. Considering this analysis, it might at first appear that a simple \nsolution to the problem with literal transfer is to uniformly lower all weight magni(cid:173)\ntudes. However, we have also observed that hyperplanes in separating positions will \nmove unless they are given high weight magnitudes. To address both of these prob(cid:173)\nlems, we must rescale hyperplanes so that useful ones are defined by high-magnitude \nweights and less useful hyperplanes receive low magnitudes. To implement such a \nmethod, we need a metric for evaluating hyperplane utility. \n\n3 EVALUATING CLASSIFIER COMPONENTS \n\nWe borrow the 1M metric for evaluating hyperplanes from decision tree induction \n[Quinlan, 1983]. Given a set of training data and a hyperplane that crosses through \nit, the 1M function returns a value between 0 and 1, indicating the amount that the \nhyperplane helps to separate the data into different classes. \n\nfor \n\na decision \n\nsurface \n\nfor 1M, \n\nformula \n-\n\nin a multi-class prob(cid:173)\nThe \n~ ( L L xij logxij - L Xi. log Xi. - L X.j logx.j + Nlog N) \nlem, is: 1M \n[Mingers, 1989). Here, N is the number of patterns, i is either 0 or 1, depending \non the side of a hyperplane on which a pattern falls, j indexes over all classes, Xij \nis the count of class j patterns on side i of the hyperplane, ~. i \nis the count of all \npatterns on side i, and x.j is the total number of patterns in class j. \n\n4 THE DBT ALGORITHM \n\nThe DBT algorithm is shown in Figure 3. It inputs the target training data and \nweights from the source network, along with two parameters C and S (see below). \nDBT outputs a modified set of weights, for initializing training on the target task. \nFigure 2 (b) shows how the problem of Figure 2 (a) was repaired via DBT. \n\nDBT modifies the weights defining each source hyperplane to be proportional to the \n\n\fDiscriminability-Based Transfer between Neural Networks \n\n207 \n\n(a) Literal \n\n(b) OBT \n\n'\" ~ 1 ....... , \n! 0 \nI \nI.L. \n\n0 \n\nC! \n0: \n\nf\u00b7\u00b7\u00b7\u00b7 \u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7~ ............. , \n\nEpodI1 \n\n\\ \n\n. , . . \n\nl \n\n0 \n0 \n\no \n\ni 0 I\u00b7 ...... r.. \nl : . \n\nEpodI1 \n\n\\ \n\\ \n. .......................... \\ \n\n1 \n\n\\ \n\n~ \n\n0.2 \n\n0.4 \n\n0.6 \nF .. u.l \n100 \n\n0.1 \n\n0.2 \n\n0.4 \n\n... , .............. . \\\\.~ ... . ..., \n\no \n\n1 \no \n1 \n\n0.8 \nF .. ture 1 \nEDOdI100 \n\n\\ \n\\ \n\\ \n\\ \n\n:J \n\n0.8 \n\no \no \n\n0.8 \n\n0 l \n\n0 \n\n.. .... \u00b7 \u00b7tlJ \n\n0.8 \n\n, \n\nIII \n\n0.' \n\n0.2 \n\n0.4 \n\n0.8 \nF....\".., \n300 \n\n\\ \n\n\u00b7\u00b7\u00b7\u00b7\u00b7\u00b71\u00b7\u00b7. .. .... \n\n\\ \n\n0 \n0 \n\n1\u00b7 ... . \no \n1 \n\n\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\\.;.\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7m \n\n0.2 \n\n0.4 \n\n0.8 \nFHU.1 \n400 \n\n\\ \n\nIII \n\n0.8 \n\no \n\n0 \n\n\" \n'. \"'''',' \u00b7\u00b7 \u00b7\u00b7\u00b7m \nIII \n\n\\ \n\n'\" \nCl \ne 0 \n1 0 \n\n~ \n\nC! \n\n'\" -e J ~ \nC! -\n'\" -e \ni ~ \nC! -\nLt \n\nC! \n\n0.2 \n\n0.4 \n\n0.8 \nF .. ture 1 \n\nHIf'2O '. . . . EpodI 300 \nf \n\u00b7\u00b7\u00b7\u00b7\u00b71... .. ... \n\n\\ \n\\ \n.. .. .. \\. \n,\" \n, \n\n0.2 \n\n0.4 \n\n0.8 \nFHtur.1 \n\nI .. ... .. ~ .... .. ... :400 \\ 0 tlJr \n\n... -- ....... s;: ...... O' . .. +tJ \n\nf \n\n' \\ \n\n. \n\n0.2 \n\n0.4 \n\n0.' \n\n0.8 \nFHture 1 \nE:~ \n\n0.2 \n\n0.4 \n\n0.8 \nF .. tI.n 1 \n3100 \n\n1 \n\n... ... '0\" \n-\n\n1 \n\n0.8 \n\n\\ \n\nttl \n. \u2022 \u2022 \\ \u2022 \u2022 \u2022\u2022\u2022 \u2022 \u2022 I)- \u2022\u2022 \u2022 +tJ \n\n0 \n\n, , \n\n0.2 \n\n0.4 \n\n0.6 \nF .. ture1 \n\n0.8 \n\n0.2 \n\n0.4 \n\n0.8 \nF .. u.1 \n3100 \n\\ \n\n0.1 \n\n0 \n\n\\ \n\n. .. . . .\\ \n\n0 \n\n.\\ ......... ..., \n\nIII \n\n1 \n0\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7 ... 1. \nI \n\nD.2 \n\n0.4 \n\n0.8 \nFHU.1 \n\n\\ \n\n0.8 \n\nFigure 2: Hyperplane 110vement speed in Literal Transfer. Compared to DBT. \n\nEach image in this figure shows the hyperplanes implemented by IH weights at a \ndifferent epoch of training. Hidden unit l's hyperplane is a solid line; HU2's is a \ndotted line, and HU3's hyperplane is shown as a dashed line. In (a) note how HUl \nseems fixed in place. Its high magnitude causes learning to be slow (taking about \n3100 epochs to converge). In (b) note how DBT has given HUl a small magnitude, \nallowing it to be flexible, so that the training data is separated by epoch 390. A \nrandomly initialized network on this problem takes about 600 epochs. \n\n\f208 \n\nPratt \n\nInput: \n\nOutput: \n\nMethod: \n\nSource network weights \nTarget training data \nParameters: C (cutoff (actor), S (scaleup (actor) \n\nInitial weigbts (or target network, assuming same topology as source network \n\nFor eacb source network bidden unit i \n\ncalculating [Mt ; (f [0,1]) \n\nCompare tbe byperplane defined by incoming weights to i to tbe target training data, \n\nRescale [Mfa values so that largest has value S. Put. result in s, . \nFor [Mt ; 's tbat are less tban C \n\nI( higbest magnitude ratio between weights defining hyperplane i is > 100.0, \nreset weights for that hyperplane randomly \n\nElse uniformly scale down byperplane to bave low-valued weigbts (maximum \n\nmagnitude of 0.5), but to be in the same position. \n\nFor eacb remaining IH hidden unit i \n\nFor eacb weight wj; defining hyperplane i in target network \n\net Wj. = source weJg t Wj; X Si \nL t \u00b7 h \n\u2022 \n\nSet hidden-to-output target network weights randomly in [-0.5,0.5] \n\nFigure 3: The Discriminability-Based Transfer (DBT) Algorithm. \n\n1M value, according to an input parameter, S. DBT is based on the idea that the \nbest initial magnitude Aft for a target hyperplane is M t = S X M. x I M t , where S \n(\"scaleup\") is a constant of proportionality, At. is the magnitude of a source network \nhyperplane, and I M t is the discriminability of the source hyperplane on the target \ntraining data. We assume that this simple relationship holds over some range of I M t \nvalues. A second parameter, C, determines a cut-off in this relationship - source \nhyperplanes with I M t < C receive very low magnitudes, so that the hyperplanes \nare effectively equivalent to those in a randomly initialized network. The use of \nthe C parameter was motivated by empirical experiments that indicated that the \nmultiplicative scaling via S was not adequate. \nTo determine S and C for a particular source and target task, we ran DBT several \ntimes for a small number of epochs with different Sand C values. We chose the S \nand C values that yielded the best average TSS (total sum of squared errors) after \na few epochs. We used local hill climbing in average TSS space to decide how to \nmove in S, C space. \n\nDBT randomizes the weights in the network's hidden-to-output (HO) layer. See \n[Sharkey and Sharkey, 1992) for an extension to this work showing that literal \ntransfer of HO weights might also be effective. \n\n5 EMPIRICAL RESULTS \n\nDBT was evaluated on seven tasks: female-to-male speaker transfer on a lO-vowel \nrecognition task (PB), a 3-class subset of the PB task (PB123). transfer from all \nfemales to a single male in the PB task (Onemale), transfer for a heart disease \ndiagnosis problem from Hungarian to Swiss patients (Heart-HS). transfer for the \nsame task from patients in California to Swiss patients (Heart-VAS). transfer from \na subset of DNA pattern recognition exanlples to a superset (DNA). and transfer \n\n\fDiscriminability-Based Transfer between Neural Networks \n\n209 \n\nfrom a subset of chess endgame problems to a superset (Chess). Note that the \nDNA and chess tasks effectively address the sequential learning problem; as long as \nthe source data is a subset of the target data, the target network can build on the \nprevious results. \n\nDBT was compared to randomly initialized networks on the target task. We \nmeasured generalization performance in both conditions by using 10-way cross(cid:173)\nvalidation on 10 different initial conditions for each t.arget task, resulting in 100 \ndifferent runs for each of the two conditions, and for each of the seven tasks. Our \nempirical methodology controlled carefully for initial conditions, hidden unit count, \nback-propagation paranleter:> '1 (learning rate) and Q\" (momentum), and DBT pa(cid:173)\nrameters S and C. \n\n5.1 SCENARIOS FOR EVALUATION \n\nThere are at least two different practical situations in which we may want to speed \nup learning. First, we may have a limited amount of computer time. all of which \nwill be used because we have no way of detecting when a network's performance has \nreached some criterion. In this case. if our speed-up method (i.e. DBT) is signifi(cid:173)\ncantly superior to a baseline for a large proportion of epochs during training. then \nthe probability that we'll have to stop during that period of significant superiority \nis high If we do stop at an epoch when our method is significantly better, then this \njustifies it over the baseline, because the resulting network has better petfonnance. \n\nA second situation is when we have some way of detecting when petformance is \n\"good enough\" for an application. In contrast to the above situation. here a DBT \nnetwork may be run for a shorter time than a baseline network, because it reaches \nthis criterion faster. In this case, the number of epochs of DBT significant superi(cid:173)\nority is less important than the speed with which it achieves the criterion. \n\n5.2 RESULTS \n\nTo evaluate networks according to the first scenario, we tested for statistical signif(cid:173)\nicance at the 99.0% level between the 100 DBT and the 100 randomly initialized \nnetworks at each training epoch. We found (1) that asymptotic DBT petformance \nscores were the same as for random networks and (2), that DBT was superior for \nmuch of the training period. Figure 4 (a) shows the number of weight updates for \nwhich a significant difference was found for the seven tasks. \n\nFor the second scenario, we also found (3) that DBT networks required many fewer \nepochs to reach a criterion performance score. For this test, we found the last \nsignificantly different epoch between the two methods. Then we measured the \nnumber of epochs required to reach 98%, 95o/c\" and 66%, of that level. The number \nof weight updates required for DBT and randomly initialized networks to reach the \n98% criterion are shown in Figure 4 (b). Note that the y axis is logarithmic, so, \nfor example. over 30 million weight updates were saved by using DBT instead of \nrandom initialization in the PB123 problem. Results for the 95% and 66% criteria \nalso showed DBT to be at least as fast as random initialization for every task. \n\nUsing the same tests described for DBT above, we also tested literal networks on \nthe seven transfer tasks. We found that, unlike DBT, literal networks reached sig-\n\n\f210 \n\nPratt \n\n(a) Time for sig. epoch difference: OBT vs. random \n\n(b) Time required to train to 98% criterion \n\n00 \n\no \n\nPB \n\n-o \n\no \ng,~ \n\no \n\nPB \n\nPB1230nema1eHeart\u00b7 Heart- DNA a-\n\nHS \n\nVAS \n\nPB123 0nemaIe Heart- Heart- a-\n\nHS \n\nVAS \n\nTask \n\nTask \n\nFigure 4: Summary of DBT Empirical Results. \n\nnificantly worse asymptotic performance scores than randomly initialized networks. \nLiteral networks also learned slower for some tasks. These results justify the use of \nthe more complicated D BT method oyer literal transfer. \n\n\\\\Te also evaluated the source networks directly on the target tasks, without any \nback-propagation training on the target training data. Scores were significantly and \nsubstantially worse than random networks. This result indicates that the transfer \nscenarios we chose for evaluation were nontrivial. \n\n6 CONCLUSION \n\nWe have described the DBT algorithm for transfer between neural networks. 2 DBT \ndemonstrated substantial and significant learning speed improvement over randomly \ninitialized networks in 6 out of 7 tasks studied (and the same learning speed in the \nother task). DBT never displayed worse asymptotic performance than a randomly \ninitialized network. We have also shown that DBT is superior to literal transfer, \nand to simply using the source network on the target task. \n\nAcknowledgements \n\nThe author is indebted to John Smith. Gale MartinI and Anshu Agarwal for their \nvaluable comments 011 this paper, and to Jack Mostow and Haym Hirsh for their \ncontribution to this research program. \n\n2See [Pratt, 1993b] for more details. \n\n\fDiscriminability-Based Transfer between Neural Networks \n\n211 \n\nReferences \n\n[Agarwal et al., 19921 A. Agarwal, R. J. Mammone, and D. K. Naik. An on-line \n\ntraining algorithm to overcome catastrophic forgetting. In Intelligence Engineer(cid:173)\ning Systems through Artificial Neural Networks. volume 2, pages 239-244. The \nAmerican Society of Mechanical Engineers, AS~IE Press, 1992. \n\n[Dewan and Sontag, 1990) Hasanat M. Dewan and Eduardo Sontag. Using extrap(cid:173)\nolation to speed up the backpropagation algorithm. In Proceedings oj the In(cid:173)\nternational Joint Conjerence on Neural Networks, Washington, DC, volume 1, \npages 613-616. IEEE Pub:ications, Inc., January 1990. \n\n[Martin, 1988] Gale Martin. The effects of old learning on new in Hopfield and Back(cid:173)\n\npropagation nets. Technical Report ACA-HI-0l9. Microelectronics and Computer \nTechnology Corporation (MCC), 1988. \n\n[McCloskey and Cohen, 1989J Michael McCloskey and Neal J. Cohen. Catastrophic \nthe sequential learning problem. The \n\ninterference in connectionist networks: \npsychology oj learning and motivation, 24, 1989. \n\n[Mingers. 1989J John Mingers. An empirical comparison of selection measures for \n\ndecision- tree induction. Machine Learning, 3( 4):319-342, 1989. \n\n[Naik et al., 1992] D. K. Naik, R. J. Mammone. and A. Agarwal. Meta-neural \nnetwork approach to learning by learning. In Intelligence Engineering Systems \nthrough Artificial Neural Networks, volume 2. pages 245-252. The American So(cid:173)\nciety of Mechanical Engineers, AS ME Press. 1992. \n\n[Pratt et al., 19911 Lorien Y. Pratt, Jack Mostow. and Candace A. Kamm. Direct \ntransfer of learned information among neural networks. In Proceedings oj the \nNinth National Conjerence on Artificial Intelligence (AAAI-91), pages 584-589, \nAnaheim, CA, 1991. \n\n[Pratt, 1993aJ Lorien Y. Pratt. Experiments in the transfer of knowledge between \nneural networks. In S. Hanson, G. Drastal, and R. Rivest, editors, Computational \nLearning Theory and Natural Learning Systems. Constraints and Prospects, chap(cid:173)\nter 4.1. MIT Press, 1993. To appear. \n\n[Pratt, 1993b] Lorien Y. Pratt. Non-literal transfer of informat.ion among inductive \nlearners. In R.J.Mammone and Y. Y. Zeevi. editors. Neural Networks: Theory \nand Applications II. Academic Press, 1993. To appear. \n\n[Quinlan, 1983] J. R. Quinlan. Learning efficient classification procedures and their \napplication to chess end games. In Machine Learning, pages 463-482. Palo Alto, \nCA: Tioga Publishing Company. 1983. \n\n[Sharkey and Sharkey, 19921 Noel E. Sharkey and Amanda J. C. Sharkey. Adaptive \ngeneralisation and the transfer of knowledge. 1992. \\Vorking paper, Center for \nConnection Science, University of Exeter, 1992. \n\n[Thrun and Mitchell, 1993J Sebastian B. Thrun and Tom M. Mitchell. Integrat(cid:173)\ning inductive neural network learning and explanation-based learning. In C.L. \nGiles, S. J . Hanson, and J . D. Cowan, editors. Advances In Neural Injormahon. \nProcessing Systems 5. Morgan Kaufmann Publishers. San Mateo, CA, 1993. \n\n\f", "award": [], "sourceid": 641, "authors": [{"given_name": "L. Y.", "family_name": "Pratt", "institution": null}]}