{"title": "Bayesian Backpropagation Over I-O Functions Rather Than Weights", "book": "Advances in Neural Information Processing Systems", "page_first": 200, "page_last": 207, "abstract": null, "full_text": "Bayesian Backpropagation Over 1-0 Functions \n\nRather Than Weights \n\nDavid H. Wolpert \n\nThe Santa Fe Institute \n1660 Old Pecos Trail \nSanta Fe, NM 87501 \n\nAbstract \n\nThe conventional Bayesian justification of backprop is that it finds the \nMAP weight vector. As this paper shows, to find the MAP i-o function \ninstead one must add a correction tenn to backprop. That tenn biases one \ntowards i-o functions with small description lengths, and in particular fa(cid:173)\nvors (some kinds of) feature-selection, pruning, and weight-sharing. \n\n1 \n\nINTRODUCTION \n\nIn the conventional Bayesian view ofbackpropagation (BP) (Buntine and Weigend, 1991; \nNowlan and Hinton,1994; MacKay,I992; Wolpert, 1993), one starts with the \"likelihood\" \nconditional distribution P(training set = t I weight vector w) and the \"prior\" distribution \nP(w). As an example, in regression one might have a \"Gaussian likelihood\", P(t I w) oc: \nexp[-x2(w, t)] == I1i exp [-(net(w, tx(i\u00bb - ty(i) )2/2c?] for some constant CJ. (tx(i) and ty(i) \nare the successive input and output values in the training set respectively, and net(w, .) is \nthe function, induced by w, taking input neuron values to output neuron values.) As another \nexample, the \"weight decay\" (Gaussian) prior is P(w) oc: eXp(-a(w2\u00bb for some constant a. \nBayes' theorem tells us that P(w I t) oc P(t I w) P(w). Accordingly, the most probable weight \ngiven the data - the \"maximum a posteriori\" (MAP) w - is the mode over w of P(t I w) P(w), \nwhich equals the mode over w of the \"cost function\" L(w, t) == In[P(t I w)] + In[P(w)]. So \nfor example with the Gaussian likelihood and weight decay prior, the most probable w giv-\nen the data is the w minimizing X2(w, t) + aw2. Accordingly BP with weight decay can be \nviewed as a scheme for trying to find the function from input neuron values to output neu(cid:173)\nron values (i-o function) induced by the MAP w. \n\n200 \n\n\fBayesian Backpropagation over 1-0 Functions Rather Than Weights \n\n201 \n\nOne peculiar aspect of this justification of weight-decay BP is the fact that rather than the \ni-o function induced by the most probable weight vector. in practice one would usually pre(cid:173)\nfer to know the most probable i-o function. (In few situations would one care more about a \nweight vector than about what that weight vector parameterizes.) Unfortunately. the differ(cid:173)\nence between these two i-o functions can be large; in general it is not true that \"the most \nprobable output corresponds to the most probable parameterU (Denker and LeCun. 1991). \nThis paper shows that to fmd the MAP i-o function rather than the MAP w one adds a \"cor(cid:173)\nrection termU to conventional BP. That term biases one towards i-o functions with small \ndescription lengths. and in particular favors feature-selection. pruning and weight-sharing. \nIn this that term constitutes a theoretical justification for those techniques. \nAlthough cast in terms of neural nets. this paper\u00b7s analysis applies to any case where con(cid:173)\nvention is to use the MAP value of a parameter encoding Z to estimate the value of Z. \n\nBACKPROP OVER 1-0 FUNCTIONS \n\n2 \nAssume the nee s architecture is fixed. and that weight vectors w live in a Euclidean vector \nspace W of dimension IWI. Let X be the set of vectors x which can be loaded on the input \nneurons. and 0 the set of vectors 0 which can be read off the output neurons. Assume that \nthe number of elements in X (lXI) is finite. This is always the case in the real world. where \nmeasuring devices have finite accuracy. and where the computers used to emulate neural \nnets are finite state machines. For similar reasons 0 is also finite in practice. However for \nnow assume that 0 is very large and \"fine-grained\". and approximate it as a Euclidean vec(cid:173)\ntor space of dimension 101. (This assumption usually holds with neural nets. where output \nvalues are treated as real-valued vectors.) This assumption will be relaxed later. \nIndicate the set of functions taking X to 0 by cl>. (net(w \u2022. ) is an element of cl>.) Any cI\u00bb E cl> \nis an (lXI x 101)-dimensional Euclidean vector. Accordingly. densities over W are related \nto densities over cl> by the usual rules for transforming densities between IWI-dimensional \nand (IXI x IOI)-dimensional Euclidean vector spaces. There are three cases to consider: \n\nIWI < IXIIOL In general. as one varies over all w's the corresponding i-o func-\n\nIWI > IXIIOL There are an infinite number of w's corresponding to each cI\u00bb. \nIWI = IXIIOI. This is the easiest case to analyze in detail. Accordingly I will deal \n\n1) \ntions net(w, .) map out a sub-manifold of cl> having lower dimension than cl>. \n2) \n3) \nwith it first, deferring discussion of cases one and two until later. \nWith some abuse of notation, let capital letters indicate random variables and lower case \nletters indicate values of random variables. So for example w is a value of the weight vector \nrandom variable W. Use 'p' to indicate probability densities. So for example PIT, conditioned on the training set random \nvariable T, and evaluated at the values cl> = cI\u00bb and T = t. \nIn general, any i-o function not expressible as net(w, .) for some w has zero probability. For \nthe other i-o functions, with S(.) being the multivariable Dirac delta function, \n\np(net(w \u2022. \u00bb = jdw' Pw(w') S(net(w', .) - net(w, .\u00bb. \n\nWhen the mapping cl> = net(W, .) is one-to-one, we can evaluate equation (1) to get \n\np4>lT.W w(w) is the Jacobian of the W ~ cl> mapping: \n\n\u2022 \n\n(1) \n\n(2) \n\n\f202 \n\nWolpert \n\nJ*,W(w) == I det[ ()i / dWj lew) I = I det[ d net(w, ')i / dWj ] I. \n\n(3) \n\"net(w, .)t means the i'th component of the i-o function net(w, .). \"net(w, x)\" means the \nvector 0 mapped by net(w, .) from the input x, and \"net(w, X)k\" is the k'th component of o. \nSo the \"i\" in \"net(w, .)( refers to a pair of values {x, k}. Each matrix value a~ / dWj is the \npartial derivative of net(w, x>t with respect to some weight, for some x and k. J**,w(w) can \nbe rewritten as detla [gij(W)], where gij(w) == ~ [(dt / dWi) (d~ / dWj)] is the metric of \nthe W ~ ct\u00bb mapping. This form of J~ w(w) is usually more difficult to evaluate though. \nUnfortunately, cI\u00bb = net(w, .) is not one-to-one; where J**,w(w) * 0 the mapping is locally \n\n, \n\none-to-one, but there are global symmetries which ensure that more than one w corre(cid:173)\nsponds to each cI\u00bb. (Such symmetries arise from things like permuting the hidden neurons \nor changing the sign of all weights leading into a hidden neuron - see (Fefferman, 1993) \nand references therein.) To circumvent this difficulty we must make a pair of assumptions. \nTo begin, restrict attention to W inj, those values w of the variable W for which the Jaco(cid:173)\nbian is non-zero. This ensures local injectivity of the map between W and ct\u00bb. Given a par(cid:173)\nticular w E W inj' let k be the number of w' E W inj such that net(w, .) = net(w', .). (Since \nnet(w,.) = net(w, .), k ~ 1.) Such a set ofk vectors form an equivalence class, {w}. \n\nThe first assumption is that for all w E W inj the size of (w) (i.e., k) is the same. This will \nbe the case if we exclude degenerate w (e.g .\u2022 w's with all first layer weights set to 0). The \nsecond assumption is that for all w' and w in the same equivalence class, PWID (w I d) = \nPWID (w' I d). This is usually the case. (For example, start with w' and relabel hidden neu(cid:173)\nrons to get a new WE (w'). If we assume the Gaussian likelihood and prior, then since \nneither differs for the two w's the weight-posterior is also the same for the two w's.) Given \nthese assumptions, pIT(net(w, .) I t) = k pWlnw I t) / J**,w(w). So rather than minimize the \nusual cost function, L(w, t), to find the MAP ct\u00bb BP should minimize L'(w. t) == L(w, t) + \nIn[ J~Wln cI\u00bb I t) one encounters in the real world is independent of how \none chooses to parameterize ct\u00bb, then the probability density of our parameter must depend \non how it gets mapped to ct\u00bb. This is the basis of the correction term. As this suggests, the \ncorrection term won't arise if we use non-pcI>lncl\u00bb I t)-based estimators, like maximum-like(cid:173)\nlihood estimators. (This is a basic difference between such estimators and MAP estimators \nwith a uniform prior.) The correction term is also irrelevant if it we use an MAP estimate \nbut J~ w(w) is independent of w (as when net (w \u2022. ) depends linearly on w). And for non-\nlinear net(w, .), the correction term has no effect for some non-MAP-based ways to apply \nBayesianism to neural nets, like guessing the posterior average ct\u00bb (Neal, 1993): \n\n, \n\n\fBayesian Backpropagation over 1-0 Functions Rather Than Weights \n\n203 \n\nE(ct\u00bb It) == Idct> Pcl>lnct> It) ct> = Idw PWIT(w I t) net(w, .), \n\n(4) \n\n, \n\nso one can calculate E(ct\u00bb I t) by working in W, without any concern for a correction tenn. \n(Loosely speaking, the Jacobian associated with changing integration variables cancels the \nJacobian associated with changing the argument of the probability density. A formal deri(cid:173)\nvation - applicable even when IWI-:/: IXI x 101 - is in the appendix of (Wolpert, 1994).) \nOne might think that since it's independent of 1, the correction term can be absorbed into \nPw(w). Ironically, it is precisely because quantities like E(ct\u00bb I t) aren't affected by the cor-\nrection tenn that this is impossible: Absorb the correction term into the prior, giving a new \nprior P*w(w) == d x Pw(w) x J** corresponds to \nthe w minimizing L(w, t). However such a redefinition changes E(ct\u00bb I t) (amongst other \nthings): Idct> P*** I t) ct> = Idw P*Wlnw I t) net(w, .) -:/: Idw Pwrr(w I t) net(w, .) = \nIdct> Pcl>lnct> I t) ct>. So one can either modify BP (by adding in the correction term) and leave \nE(ct\u00bb I t) alone, or leave BP alone but change E(ct\u00bb I t); one can not leave both unchanged. \nMoreover, some procedures involve both prior-based modes and prior-based integrals, and \ntherefore are affected by the correction tenn no matter how Pw(w) is redefined. For exam-\nple, in the evidence procedure (Wolpert, 1993; MacKay, 1992) one fixes the value of a \nhyperparameter r (e.g., ex from the introduction) to the value 1 maximizing Pr IT(11 t). \nNext one find the value s' maximizing PSIT,r (s' I t, 1) for some variable S. Finally, one \nguesses the ** associated with s'. Now it's hard to see why one should use this procedure \nwith S = W (as is conventional) rather than with S = ct\u00bb. But with S = ct\u00bb rather than W, one \nmust factor in the correction term when calculating PSIT,r (s I t, 1), and therefore the \nguessed ct> is different from when S = W. If one tries to avoid this change in the guessed ct> \nby absorbing the correction tenn into the prior PWIr(w I y), then Pn nY I t) - which is given \nby an integral involving that prior - changes. This in turn changes 1, and therefore the \nguessed ct> again is different. So presuming one is more directly interested in ct\u00bb rather than \nW, one can't avoid having the correction term affect the evidence procedure. \nIt should be noted that calculating the correction tenn can be laborious in large nets. One \nshould bear in mind the determinant-evaluation tricks mentioned in (Buntine and \nWeigend, 1991), as well as others like the identity In[ J**i / dwj is an N x N diagonal matrix, and \nIn[ J** found by modified BP has larger (in magnitude) values of 0 than does the correspond(cid:173)\ning ** found by unmodified BP. In addition, unlike unmodified BP, modified BP has multi(cid:173)\nple extrema over certain regimes. All of this is illustrated in figures (1) through (3), which \ngraph the value of 0 resulting from using modified BP with a particular training set t and \ninput value x vs. the value of 0 resulting from using unmodified BP with t and x. Figure \n(4) depicts the wi-dependences of the weight decay term and of the weight-decay term \nplus the correction term. (When there's no data, BP searches for minima of those curves.) \n\nNow consider multi-layer nets, possibly with non-unary X. Denote a vector of the compo-\n\n\fBayesian Backpropagation over 1-0 Functions Rather Than Weights \n\n20S \n\nnents of w which lead from the input layer into hidden neuron K by w[K)' Let x\u00b7 be the \ninput vector consisting of all O's. Then a tanh(W[K) . x') I aWj = 0 for any j, w, and K, and \nfor any w, there is a row of ~I awj which is all zeroes. This in tum means that Jw,w(w) = \no for any w, which means that Winj is empty, and PWIT(' I t) is independent of the data t. \n(Intuitively, this problem arises since the 0 corresponding to x\u00b7 can't vary with w, and \ntherefore the dimension of ct> is less than IWI) So we must forbid such an all-zeroes x'. The \neasiest way to do this is to require that one input neuron always be on, i.e., introduce a bias \nunit. An alternative is to redefine ct> to be the functions from the set {X - (0, 0, ... , O)} to 0 \nrather than from the set X to O. Another alternative, appropriate when the original X is the \nset of all input neuron vectors consisting of O's and 1 's, is to instead have input neuron val(cid:173)\nues E {z * 0, I}. (In general z * -1 though; due to the symmetries of the tanh, for many \narchitectures z = -1 means that two rows of ai I awj are identical up to an overall sign, \nwhich means that Jw,w(w) = 0.) This is the solution implicitly assumed from now on. \nJw,w(w) will be small - and therefore Pw(net(w, .\u00bb will be large - whenever one can make \nlarge changes to w without affecting, = net(w, .) much. In other words, pw(net(w, .\u00bb will \nbe large whenever we don't need to specify w very accurately. So the correction factor \nfavors those w which can be expressed with few bits. In other words, the correction factor \nenforces a sort of automatic MDL (Rissanen, 1986; Nowlan and Hinton, 1994). \n\nMore generally, for any multi-layer architecture there are many \"singular weights\" w sin ~ \nW inj such that Jw.w(w sin) is not just small but equals zero exactly. Pw(w) must compen(cid:173)\nsate for these singularities, or the peaks of PCI)rr<, I t) won't depend on t. So we need to \nhave pw(w) ~ 0 as w ~ wsin' Sometimes this happens automatically. For example often \nWsin includes infinite-valued w's, since tanh'(oo) = O. Because Pw(oo) = 0 for the weight(cid:173)\ndecay prior, that prior compensates for the infmite-w singularities in the correction term. \n\nFor other w sin there is no such automatic compensation, and we have to explicitly modify \npwCw) to avoid singularities. In doing so though it seems reasonable to maintain a \"bias\" \ntowards the wsin, that Pw(w) goes to zero slowly enough so that the values pw(net(w, .\u00bb \nare \"enhanced\" for w near wsin' Although a full characterization of such enhanced w is not \nin hand, it's easy to see that they include certain kinds of pruned nets (Hassibi and Stork, \n1992), weight-shared nets (Nowlan and Hinton, 1994), and feature-selected nets. \nTo see that (some kinds of) pruned nets have singular weights, let w* be a weight vector \nwith a zero-valued weight coming out of hidden neuron K. By (1) Pw(net(w*, .\u00bb = \nJdw' Pw(w,) S(net(w', .) - net(w*, .\u00bb. Since we can vary the value of each weight w*i lead(cid:173)\n\ning into neuron K without affecting net(w*, .), the integral diverges. So w* is singUlar; \nremoving a hidden neuron results in an enhanced probability. This constitutes an a priori \nargument in favor of trying to remove hidden neurons during training. \n\nThis argument does not apply to weights leading into a hidden neuron; Jw,w(w) treats \nweights in different layers differently. This fact suggests that however pw(w) compen(cid:173)\nsates for the singularities in Jw.w(w), weights in different layers should be treated differ(cid:173)\nently by Pw(w). This is in accord with the advice given in (MacKay, 1992). \n\nTo see that some kinds of weight-shared nets have singular weights, let w' be a weight vec(cid:173)\ntor such that for any two hidden neurons K and K' the weight from input neuron i to K \nequals the weight from i to K', for all input neurons i. In other words, w is such that all hid-\n\n\f206 \n\nWolpert \n\nden neurons compute identical functions of x. (For some architectures we'll actually only \nneed a single pair of hidden neurons to be identical.) Usually for such a situation there is a \npair of columns of the matrix ~ / awj which are exactly proportional to one another. (For \nexample, in a 3-2-1 architecture, with X = {z, I} 3, IWI = IXI x 101 = 8, and there are four \nsuch pairs of columns.) This means that JfI>,W(w') = 0; w' has an enhanced probability, and \nwe have an a priori argument in favor of trying to equate hidden neurons during training. \nThe argument that feature-selected nets have singular weights is architecture-dependent, \nand there might be reasonable architectures for which it fails. To illustrate the argument, \nconsider the 3-2-1 architecture. Let xl(k) and x2(k) with k = {I, 2,3) designate three dis-\ntinct pairs of input vectors. For each k have xl (k) and x2(k) be identical for all input neu(cid:173)\nrons except neuron A, for which they differ. (Note there are four pairs of input vectors \nwith this property, one for each of the four possible patterns over input neurons B and C.) \nLet w' be a weight vector such that both weights leaving A equal zero. For this situation \nnet(w', xl(k\u00bb = net(w', x2(k\u00bb for all k. In addition a net(w, xl(k\u00bb / awj = \na net(w, x2(k\u00bb / awj for all weights Wj except the two which lead out of A. So k = 1 gives \nus a pair of rows of the matrix a~ / awj which are identical in all but two entries (one row \nfor Xl (k) and one for x2(k\u00bb. We get another such pair of rows, differing from each other in \nthe exact same two entries, for k = 2, and yet another pair for k = 3. So there is a linear \ncombination of these six rows which is all zeroes. This means that JfI>, w(w') = O. This con(cid:173)\nstitutes an a priori argument in favor of trying to remove input neurons during training. \nSince it doesn't favor any Pw(w), the analysis of this paper doesn't favor any pfl>( <1\u00bb. How(cid:173)\never when combined with empirical knowledge it suggests certain pfl>(cj). For example, \nthere are functions g(w) which empirically are known to be good choices for pfl>(net(w, .\u00bb \n(e.g., g(w) oc:exp[awl]). There are usually problems with such choices of Pfl>(cj) though. \nFor example, these g(w) usually make more sense as a prior over W than as a prior over \n<1>, which would imply pfl>(net(w, .\u00bb = g(w) / J**W(w). Moreover it's empirically true that \nenhanced w should be favored over other w, as advised by the correction term. So it makes \nsense to choose a compromise between g(w) and g(w) / J**W(w). An example is pfl>(cj) oc: \ng(w) / [A} + tanh(~ x JfI>,w(w\u00bb] for two hyperparameters A} > 0 and ~ > O. \n\n, \n\n, \n\nBEYOND THE CASE OF BACKPROP WITH IWI = IXIIOI \n\n4 \nWhen 0 does not approximate a Euclidean vector space, elements of <1> have probabilities \nrather than probability densities, and P(cj) It) = jdw PWl'r(w I t) S(net(w, .), cj), (0(., .) being \na Kronecker delta function). Moreover, if 0 is a Euclidean vector space but WI > IXI 101, \nthen again one must evaluate a difficult integral; <1> = net(W, .) is not one-to-one so one \nmust use equation (1) rather than (2). Fortunately these two situations are relatively rare. \nThe final case to consider is IWI < IXIIOI (see section two). Let Sew) be the surface in <1> \nwhich is the image (under net(W, .\u00bb ofW. For all ** PfI>(cj) is either zero (when cj) Ii': S(W\u00bb \nor infinite (when cj) E S(W\u00bb. So as conventionally defined, \"MAP cj)\" is not meaningful. \nOne way to deal with this case is to embed the net in a larger net, where that larger net's \noutput is relatively insensitive to the values of the newly added weights. An alternative \nthat is applicable when IWI / 101 is an integer is to reduce X by removing \"uninteresting\" \nx's. A third alternative is to consider surface densities over Sew), Ps(W)(cj), instead of vol-\n\n\fBayesian Backpropagation over 1-0 Functions Rather Than Weights \n\n207 \n\nume densities over <%\u00bb. P\u00abl>(e!\u00bb). Such surface densities are given by equation (2). if one uses \nthe metric form of J\u00abl>,w(w). (Buntine has emphasized that the Jacobian form is not even \ndefined for IWI < IXIIOI. since ()cj)i / aWj is not square then (personal communication).) \n\nAs an aside, note that restricting P\u00abl>(e!\u00bb) to Sew) is an example of the common theoretical \nassumption that \"target functions\" come from a pre-chosen \"concept class\". In practice \nsuch an assumption is usually ludicrous - whenever it is made there is an implicit hope that \nit constitutes a valid approximation to a more reasonable P\u00abl>(e!\u00bb). \n\nWhen decision theory is incorporated into Bayesian analysis. only rarely does it advise us \nto evaluate an MAP quantity (Le.. use BP). Instead Bayesian decision theory usually \nadvises us to evaluate quantities like E(<%\u00bb I t) (Wolpert. 1994). Just as it does for the use of \nMAP estimators. the analysis of this paper has implications for the use of such E(<%\u00bb I t) \nestimators. In particular. one way to evaluate E(<%\u00bbI t) = jdw PwIT(w I t) net(w \u2022. ) is to \nexpand net(w \u2022. ) to low order and then approximate PWlnw I t) as a sum of Gaussians \n(Buntine and Weigend. 1991). Equation (4) suggests that instead we write E(<%\u00bb I t) as \njde!\u00bb P\u00abl>lne!\u00bb I t) e!\u00bb and approximate P\u00abl>IT(e!\u00bb I t) as a sum of Gaussians. Since fewer approxi(cid:173)\nmations are used (no low order expansion of net(w \u2022. \u00bb, this might be more accurate. \nAcknowledgements \nThanks to David Rosen and Wray Buntine for stimulating discussion. and to TXN and the \nSF! for funding. This paper is a condensed version of (Wolpert 1994). \n\nReferences \nBuntine. W .\u2022 Weigend. A. (1991). Bayesian back-propagation. Complex Systems. S.p. 603. \nDenker. J., LeCun, Y. (1991). Transforming neural-net output levels to probability distri(cid:173)\nbutions. In Neural Information Processing Systems 3, R. Lippman et al. (Eds). \nFefferman, C. (1993). Reconstructing a neural net from its output. Sarnoff Research Cen(cid:173)\nter TR 93-01. \nHassibi. B., and Stork, D. (1992). Second order derivatives for network pruning: optimal \nbrain surgeon. Ricoh Tech Report CRC-TR-9214. \nMacKay, D. (1992). Bayesian Interpolation, and A Practical Framework for Backpropa(cid:173)\ngation Networks. Neural Computation. 4. pp. 415 and 448. \nNeal, R. (1993). Bayesian learning via stochastic dynamics. In Neural Information Pro(cid:173)\ncessing Systems 5, S. Hanson et al. (Eds). Morgan Kaufmann. \nNowlan, S., and Hinton. G. (1994). Simplifying Neural Networks by Soft Weight-Sharing. \nIn Theories of Induction: Proceedings of the SFIICNLS Workshop on Formal Approaches \nto Supervised Learning, D. Wolpert (Ed.). Addison-Wesley, to appear. \nRissanen, J. (1986). Stochastic complexity and modeling. Ann. Stat .\u2022 14, p. 1080. \nWolpert, D. (1993). On the use of evidence in neural networks. In Neural I nformation Pro(cid:173)\ncessing Systems 5, S. Hanson et aI. (Eds). Morgan-Kauffman. \nWolpert, D. (1994). Bayesian back-propagation over i-o functions rather than weights. SF! \ntech. report. ftp'ablefrom archive.cis.ohio-state.edu, as pub/neuroprose/wolpert.nips.93.Z. \n\n\f", "award": [], "sourceid": 728, "authors": [{"given_name": "David", "family_name": "Wolpert", "institution": null}]}*