{"title": "Adaptive Soft Weight Tying using Gaussian Mixtures", "book": "Advances in Neural Information Processing Systems", "page_first": 993, "page_last": 1000, "abstract": null, "full_text": "Adaptive Soft Weight Tying \n\nusing Gaussian Mixtures \n\nSteven J. Nowlan \n\nComputational Neuroscience Laboratory \n\nThe Salk Institute, P.O . Box 5800 \n\nSan Diego, CA 92186-5800 \n\nGeoffrey E. Hinton \n\nDepartment of Computer Science \n\n. U ni versi ty of Toran to \n\nToronto, Canada M5S lA4 \n\nAbstract \n\nOne way of simplifying neural networks so they generalize better is to add \nan extra t.erm 10 the error fUll ction that will penalize complexit.y. \n\\Ve \npropose a new penalt.y t.erm in which the dist rihution of weight values \nis modelled as a mixture of multiple gaussians . C nder this model, a set \nof weights is simple if the weights can be clustered into subsets so that \nthe weights in each cluster have similar values . We allow the parameters \nof the mixture model to adapt at t.he same time as t.he network learns. \nSimulations demonstrate that this complexity term is more effective than \nprevious complexity terms. \n\nIntroduction \n\n1 \nA major problem in training artificial nellral network:> is to ellsure t.hat th ey wIll \ngel/eraiIze well to ra .. .,f'~ thaI they h(lvl> 1I0t been tralHeu OIl. SUIlle recellt t.heuretical \nresults (Baurn anu Iiallssier. 10S~I) Itave :,.ug,g,e~teU that ill order to guaralltee goou \ngeneralizatioll Ilw r than t hI' lllllTlber of \nindependellt weight:,. III th e ll t' twork \nIII 1I1any practI cal problt'lllS there IS only \na small amount of labelled data available for traming and this creates problellls \nfor any approach that uses a large. homogeneous network with many indepeIldent \nweights. As a result. there has been much recent int.erest in techniques that can \ntrain large networks wil h relatively small amounts of labelled data and still provide \ngood generalization performance. \n\nIn order to improve generalization, t.he number of free parameters in the network \nmust be reduced. Olle of the oldest and simplest approaches to removing excess \ndegrees of fr eedolll from a net work i~ to add an ext fa term 10 the error [Ullct 1011 \n\n993 \n\n\f994 \n\nNowlan and Hinton \n\nthat penalizes complexity: \n\ncost = data-misfit + A complexity \n\n(1) \n\nDuring learning, the network is trying to find a locally optimal trade-off between \nthe data-misfit (the usual error term) and the complexity of the net. The relative \nimportance of these two terms can be estimated by finding the value of A that \noptimizes generalization to a validation set. Probably the simplest approximation \nto complexity is the sum of the squares of the weights, Li w;. Differentiating \nthis complexity measure leads to simple weight decay (Plaut, Nowlan and Hinton, \n1986) in which each weight decays towards zero at a rate that is proportional to its \nmagnitude. This decay is countered by the gradient of the error term, so weights \nwhich are not critical to network performance, and hence always have small error \ngradients, decay away leaving only the weights necessary to solve the problem. \nThe use of a Li'IV; penalty term can also be interpreted from a Bayesian \nperspective. l The \"complexity\" of a set of weights, ALi w;, may be described \nas its negat.ive log probahilit.y dellsit.y under a radially symmetric gaussian prior \ndistribution on the weights. The distribution is centered at the origin and has vari(cid:173)\nance 1/ A. For multilayer networks, it is hard to find a good theoretical justificatioll \nfor this prior, but Hinton (1987) justifies it empirically by sllOwiug tllat it greatly \nimproves generalizat.ioll on a very difficult, task. MOI'e recently, Mackay (1991) has \nshown that even better generalization can be achieved by using different values of \nA for the weights in different layers. \n\n2 A more conlplex measure of network complexity \nIf we wish to eliminate small weights without forcing large weights away from the \nvalues they lleed to model the data, we can use a prior which is a mixture of a \nnarrow (n) and a broad (b) gaussian, both centered at zero. \n\n1 \np(w) = trn yI2; \n\n27rl1n \n\n-5 \ne 2\"n + trb ~ e \n\n1 -6 -\nb \n27rl1 b \n\n(2) \n\nwhere trn and trb are the mixing proportions of the two gaussians and are therefore \nconstrained to sum to l. \n\nAssuming that. the weight values were generated from a gaussian mixture, the con(cid:173)\nditional probability that a particular weight, Wi, was generated by a particular \ngaussian, j, is called the responsibility of that gaussian fOI' the weight and is: \n\n(3) \n\nwhere Pj(Wj) is the probahilit.y density of Wi under gaussian j. \n\nWhen the mixing proportions of t.lw two gatlssians are comparable, t.he llal'l'OW gaus(cid:173)\nsian gets most of the responsibilit.y for a small weight. Adopting the Bayesiall per(cid:173)\nspective, the cost of a weight under the narrow gaussian is proportional to w 2 /2l1~. \nAs long as l1 n is quite small there will be strong pressure to reduce the magnitude \n\n1 R. Szeliski, personal communication, 1985. \n\n\fAdaptive Soft Weight Tying using Gaussian Mixtures \n\n995 \n\nof small weights even further. Conversely, the broad gaussian takes most of the \nresponsibility for large weight values, so there is much less pressure to reduce them. \nIn the limiting case when the broad gaussian becomes a unifonTI distribution, there \nis almost no pressure to reduce very large weights because they are almost certainly \ngenerated by the uniform distribution. A complexity term very similar to this limit(cid:173)\ning case is used in t.he \"weight elimination\" technique of CWeigend, Huberman and \nRumelhart, 1990) to improve generalization for a time series prediction task. 2 \n\n3 Adaptive Gaussian Mixtures and Soft Weight-Sharing \nA mixture of a narrow, zero-mean gaussian with a broad gaussian Or a uniform allows \nus to favor networks with many near-zero weights, and this improves generalization \non many tasks. But practical experience with hand-coded weight constraints has \nalso shown that great improvements can be achieved by constraining particular \nsubsets of the weights t.o share the same value (Lang, '-\\Taibel and Hinton, 1990; Le \nCun, 1989). Mixtures of zero-mean gaussians and uniforms canllot implement this \ntype of symllletry constraint. If however, we use multiple gaussians and allow their \nmeans and variances to adapt as t.lw lIetwol\u00b7k learns, we call implemellt a \"soft\" \nversion of weight.-sharing III which the leawing algoritlllll decides for itself which \nweights should be t.ied together. \n(We may also allow the lllixillg, proportiolls to \nadapt so that. we are 1I0t assulllillg all sets of tied weights al\u00b7e the sallle size.) \n\nThe basic idea is t.hat a gallssiall which takps responsibility for a subset of the \nweights will squeeze those weight.s t.ogether since it can then have a lower variance \nand assign a higher probability dellsit.y t.o each weight. If t.he gaussialls all start \nwith high variallce, the initial division of weights into subsets will be very soft . As \nthe variances shrink and the network learns, the decisions about how to group the \nweights iuto subsets are influenced by the task the network is learning t.o perforul. \n\nTo make t.hese intuit.ive ideas a bit more concrete, \\ve may define a cost function of \nthe general form given in (1): \n\n(4) \n\nwhere 0\"; is the variance of the squared error and each Pj (wd is a gaussian density \nwith mean /1j and standard deviation O\"j. \\Ve optimize this function by adjusting \nthe Wi and the mixture parameters 1fj, /1j, and O\"j, and O\"y.3 \nThe partial derivative of C with respect to each weight is the sum of the usual \nsquared error derivative and a term due to the complexity cost for the weight: \n\n(5) \n\n2See (N owl au, 1991) for a precise descri pt iOll of t.he rela.tionshi p bet.weeu rni xture models \n\nand the model Ilsed by (Weigend. Huherman a.nd Rllmelltart. 1990). \n\nJl/a~ lllay be tlLOUgltt of as playillg tlte sallle role a.s A ill equatiou 1 ill detcrminiug a \nit \n\ntrade-off between the misfit. auo complexity costs . K is a 1I0rlllaiiting factor ba.sed 011 \ngaussia.u error LLlude!. \n\n\f996 \n\nNowlan and Hinton \n\nMethod \n\nVanilla Back Prop. \nCross Valid. \nWeight Elimination \nSoft-share - 5 Compo \nSoft-share - 10 Compo \n\nTrain % Correct Test % Correct \n\n100.0 \u00b1 0.0 \n98.8 \u00b1 1.1 \n100 .0 \u00b1 0.0 \n100.0 \u00b1 0.0 \n100.0 \u00b1 0.0 \n\n67.3 \u00b1 5.7 \n83.5 \u00b1 5.1 \n89.8 \u00b1 3.0 \n95.6 \u00b1 2.7 \n97.1 \u00b1 2.1 \n\nTable 1: SUllllllal'y of generalization performance of 5 different training techniques \non the shift detection problem. \n\nThe derivative of the complexity cost term is simply a weighted sum of the difference \nbetween the weight value and the center of each of the gaussians. The weighting \nfactors are the responsibility measures defined in equation 3 and if over time a \nsingle gaussian claims most of the responsibility for a particular weight the effect \nof the complexity cost t.erm is simply to pull the weight towards the center of the \nresponsible gaussian. The strength of this force is inversely proport.ional to the \nvariance of the gaussian. \n\nIn the simulations described below, all of the parameters (Wi, Pj, (Jj, 7rj) are updated \nsimultaneously using a conjugate gradient descent procedure. To prevent variances \nshrinking too fast or going negative we optimize log (Jj rather than (Jj. To ensure \nthat the mixing proportions sum t.o 1 and are positive, we optimize Xj where trj = \nexp(xj)/ L exp(x/,;). For furtiter details see (Nowlan and Hinton, 1992). \n\n4 SilTIulation Results \n\nV>le compared the gelleralization performance of soft weight-tying to other tech(cid:173)\nniques on two different. problems. The first problem, a 20 input., one output shift \ndetection !letwork, vvas chosell because it was biJlary problem for which solutiotls \nwhich generalize well exhibit a lot. of repeat.ed weight structure. The generalizatioll \nperfOrlllallCt\u00b7 of lwtworks trailled using, the co:st Cl\"it.erion giveJl ill equation 4 was \ncompared to Ilet.works t.rained in three other ways: No cost term to penalize com(cid:173)\nplexity; No explicit complexity cost. term, but use of a validat.ion set to terminate \nlearning; Weight elimination (Wf'igelld, Huberman a.nd Rumelhart, 1990)4. The \nsimulation results art' sllmmarized in Table 1. \n\nThe network had 20 input units, 10 hidden units, and a single output unit and \ncontained 101 weights. The first 10 input units in this network were given a random \nbinary pattern, and the second group of 10 input units were given the same pattern \ncircularly shifted by 1 bit left or right. The desired output of the network was +1 \nfor a left shift. and -1 for a right shift. A data set of 2400 patterns was created by \nrandomly generating a 10 bit string, and choosing with equal probability to shift \nthe string left or right. The data set was divided into 100 training cases, 1000 \nvalidation cnses, and 1 :WO t.est. cases. The training :set was deliberat.ely chosen to \nbe very small \u00ab 5% of possible patterns) to explore the region in which complexity \npenalties should have the largest. impa.ct . Ten simulations were performed with each \n\n4With a fixed value of >. chosen by cross-validation. \n\n\fAdaptive Soft Weight Tying using Gaussian Mixtures \n\n997 \n\n1,' \n\n1.3 \n\nI.Z \n\n1.1 \n\n0.9 \n\n0 .8 \n\n0.7 \n\n0.& \n\n0.5 \n\n0 \u2022\u2022 \n\n0.3 \n\n0.2 \n\no. ~-l=+7+=~~~+::=f=*B~:;::+:=+-'H+~\\~=+=l~ \n\n- -4.5-4-l.5\u00b7~-2.5-2-1,5-1-o,5 0 0,5 11,5 2 2.5 1 l,5 \u2022 ',5 5 \n\nFigure 1: Final mixture probability density for a typical solution to the shift de(cid:173)\ntection problem. Five of the components in the mixture can be seen as distinct \nbumps in the probabilit.y densit.y. Of the remaining five components, two have been \neliminated by having their mixing proportions go to zero and the other three are \nvery broad and form the baseline offset of the density function. \n\nmethod, starting frolll ten difl\u00b7erent. initial weight sets (t.e. each method used the \nsame ten initial weight. configurations). \n\nThe final weight distl'ihlltiolls discovered by the soft weight-tyiug technique are \nshown in Figlll'e 1. There is no significant component with mean O. The classical \nassumpt.ioll t.hat. the nt't.work collt.aiw; a large lIulllber of illessellt.ial weight.s which \ncan be eliI1IilIated to ililprove generalizatioll is lIOt appropriate COL' this problelll aBd \nnetwork arcilitecture. Tilis may explaiu why the weight elimination model used \nby 'Veigend ef af ('Veigend, Huberman and Rumelhart, 1990) performs relatively \npoorly in this si tuation. \n\nThe second task chosen to evaluate the effectiveness of our complexity penalty was \nthe prediction of the yearly sunspot average from the averages of previous years . \nThis task has been well studied as a time-series prediction benchmark in the statis(cid:173)\ntics literature (Priestley, 1991b; Priestley, 19910.) and has also been investigated by \n(Weigend, Huberman and Rumelhart, 1990) using a complexity penalty similar to \nthe one discllssed in section 2. \n\nThe network archit.ect.me Llsed was identical to the one used in the study by VVeigend \net af: The Iwtwork had 1 L input. unit.s which represent.ed the yearly average from til<-' \npreceding lL years, 8 hidden unit.s, and a silIgle lillear output unit which represented \nthe predictioll for tlw averagl' Illllllhu' of SllIlSPOt.S ill t.he current year. Yearly \nsunspot dat.a from l700 to uno wa:-; lIsed to train the lIetwork to perform this OllC(cid:173)\nstep prediction task, aud t.he evaluation of the network was based on data from \n\n\f998 \n\nNowlan and Hinton \n\nMethod \n\nTAR \nRBF \n\\VRH \nSoft-share - 3 Compo \nSoft-share - 8 Compo \n\nTest arv \n\n0.097 \n0.092 \n0.086 \n0.077 \u00b1 0.0029 \n0.072 \u00b1 0.0022 \n\nTable 2: Summary of average relat,ivp variance of 5 different models on the one-step \nsunspot prediction problelll. \n\n1921 to 1955. 5 The evaluation of prediction performance used the aver\u00b7age relative \nvariance (ar\u00b7v) measure discussed in (Weigend, Huberman and Rumelhart, 1990). \n\nSimulations were performed using the same conjugate gradient method used for the \nfirst problem. Complexity measures based on gaussian mixtures with;) and 8 com(cid:173)\nponents were used and ten simulat.ions were performed with each (USillg the same \ntraining data but different initial weight configurations). The results of these simu(cid:173)\nlations are summarized in Table 2 along with the best result obtailled by Weigend et \nat (Weigend, H ubermall and RUlllelhart, 1990) (HI RH), the bilinear auto-regression \nmodel of Tong and Lim (Tong ano Lim, 1980) (T A R) 6, and the multi-layer RBF \nnetwork of He and Lapeoes (lIe alld Lapedes, 1991) (RBF). All figure:::; represent \nthe arv on t.he t.est set. For the mixture complexity models, this is the average over \nthe ten simlllations, plus or minus one standard deviation. \n\nSince the results for the models ot.her than the mixture complexity trained networks \nare based on a single simulation it is difficult to assign statistical signifigance to the \ndifferences shown in Table 2. We may note however, that the difference between \nthe 3 and 8 component mixture complexity models is significant (p > 0.95) and the \ndifferences bet.ween t.he 8 componellt. model and the other models are much larger . \n\nFigure 2 shows an 8 component mixture Blodel of the fillal weight distribution. It is \nquite unlike t.he distribution ill Figure 1 and is actually quite close to a mixture of \ntwo zero-meall gallssians, one hroad ano one lIarrow. This may explain why weight \nelimination works quite well for t.his t.ask. \nWeigeno el at point. Ollt that. fOJ\" Lillie series preoiction tasks sllch as the SUllspot \ntask a mudl rnore int,(' resl.illg nlca:-;llI\"t' of performance is th e ability of the Illouel to \npreoict Illore thall aile t.illlt: st.ep into the fUl.LUe. One way to appl\"Oacll th e Illulti(cid:173)\nstep prediction problem is to llse iterated szng/e-step predzctzon. In this method, the \npredicted output is fed back as input fOI\u00b7 the next preJictioll and all otlter illput \nunits have theil' values shifted back Olle unit. Thus t.he input. typically consists \nof a combination of act.ual and preJicted values. \\Vhen preuictillg more thaJl one \nstep int.o the future, the prediction error depends both on how ma.ny steps into the \nfuture one is predicting (/) ano on what point in the time series the prediction \nbegan. An appropriate enor measure for iterated prediction is the aVe1\u00b7age relaltve \nI-times iter'ated pr'ediclion V(lT\"wnce (\\Veigend, Huberman and Rumelhart, 1990) \n\n5The aut.hors thallk Andreas vVeigend for providing his version of this data. \n6This was the morl el fa.vored b~1 Priestly (Pri estley, 1991a.) in a recent evaluation of \n\nclassical stat.istical approaches 1.0 t.his t.ask. \n\n\fAdaptive Soft Weight Tying using Gaussian Mixtures \n\n999 \n\n1.9 \n1.8 \n\n1.7 \n1.& \n1.5 \n1.. \n\n1.3 \n\n1.2 \n\n1.1 \n\n0.9 \n0.8 \n0.7 \n\n0.6 \n0.5 \n0 \u2022\u2022 \n\n0.3 \n\n0.2 \n\n0.1 \n\nFigure 2: Typical final mixture probability density for the SUllspot prediction pl'Ob(cid:173)\nlem with a model containing 8 ruixt.llI'e components. \n\n3.5 \u2022\u2022. 5 5 \n\n0.9 \n\n0.8 \n\n0./ \n\nO,r. \n\n0.5 \n\n0 .4 \n\n0 ,1 \n\n0.2 \n\n0.1 \n\nFigure 3: A verage relative I-times iterated prediction variance versus number of \nprediction iterations for t.he sunspot. time series from 1921 to 1955. Closed circles \nrepresent the TAR model, opell circles the W RH model, closed sljuares the j \ncomponent complexity 1l10del, and opell squares the ~ componellt complexity lllodei. \nTen different set.s of initial weights were used for the 3 and 8 component complexity \nmodels and one standard deviation error bars are shown. \n\n\f1000 \n\nNowlan and Hinton \n\nwhich averages predi ctions I steps into the future over all possible starting points. \nUsing this measure , the performance of various models is shown in Figure 3. \n\n5 Sun1mary \nThe simulations we have described provide evidence that the use of a more flexible \nmodel for the distribution of weights in a network can lead to better generalization \nperformance than weight decay, weight elimination, or techniques that control the \nlearning time. The flexibility of our model is clearly demonstrated in the very differ(cid:173)\nent final weight distributions discovered for the two different problems investigated \nin this paper. The a.bility to automatically adapt to individual problems suggests \nthat the method should ha.ve broad applicability. \n\nAcknowledgements \nThis research was funded by the Ontario ITRC, the Canadian NSERC and the Howard \nHughes Medical Institute. Hiutoll is the Noranda fellow of the Calladiall lllstitute for \nAdvanced Research. \n\nReferences \n\nBaum , E. B. allo Hallssler. D. (I()~~l) . What size net gives vidid generalizat.ioJl ? N ew 'o/ \n\nCompu/a/ion.l :151 -- lGO . \n\nHe, X. and Lapedes, A. (1991) . Nonlin ear modelling alld prediction by stlccessive approxi(cid:173)\n\nmation \\Ising Radial Basis FUllctioJls. Techllical Report LA-U H.-Yl-lJ75, Los Alamos \nNational Laboratory. \n\nHinton, G. \"8. (1987) . Learning t.rallslatioll invariant recognition in a massively parallel \n\nnetwork. In Pmc. Conf. Pam/ft'i Al'chitectw'es and Languages EU7\"Ope, Eindhoveu . \n\nLang, K. J., Waibel, A. H., and Hint.on, G . E . (1990). A time-delay neural network \n\narchitecture for isolated word recognition. Neural Networks, 3:23-43. \n\nLe Cun, Y . (1989) . Generalization and network design strategies. Technical Report CRG(cid:173)\n\nTR-89-4, University of Toronto . \n\nMacKay, D. J. C. (1991). Bayesian Modelling and New'al Networ'ks. PhD thesis, Compu(cid:173)\n\ntation and Neural Systems, California Institute of Technology, Pasadena, CA . \n\nNowlan, S. J . (1991) . Soft Competitive Adaptation: New'al Network Learning Algo(cid:173)\nrithms based on Fitting Sta.tistical Mixtures. PhD thesis , School of Computer Science, \nCarnegie Melloll Uuiversit.y, Pittsburgh, PA. \n\nNowlan, S. J. and Hinton, G. E. (Hlq~) . Sirnplifyillg neural networks by soft weight(cid:173)\n\nsharing. New'al ComputatlOlI . III press. \n\nPlaut, D. C ., Nowlall , S. J .. alld Hill/on, G . E. (lY86). Experimellt.s 0 11 \n\nlearllillg by \nback- propagation. Tech nical Report. CM U -CS-86-l'26, Carnegie-Melloll U IIi versi ty, \nPittsburgh PA 15213. \n\nPriestley, M. B. (1991a.). Non-lineal' ond Non-stationa \u00b7ry Time Senes Analysis. Acad emic \n\nPress. \n\nPriestley, M. B. (1~~lb) . Spectml AIHllY8is and Time Series . Academic Press. \nTong, H . and Lilli, 1\\ . S. (1980) . Threshold autoregression, limit cycles, a.lld cyclical dat.a. \n\n10'l.1mal Royal Stati\" tical Society B, 42 . \n\nWeigend, A. S., Huberman, B . A., alld Rurnelhart, D. E. (lY90). Predictillg the futur e: A \n\nconnectiollist approach . lllte,.,w/wllal lou.null of Neuml Systems, 1. \n\n\f", "award": [], "sourceid": 498, "authors": [{"given_name": "Steven", "family_name": "Nowlan", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}]}