{"title": "On the Use of Evidence in Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 539, "page_last": 546, "abstract": null, "full_text": "On the Use of Evidence in Neural Networks \n\nDavid H. Wolpert \n\nThe Santa Fe Institute \n1660 Old Pecos Trail \nSanta Fe, NM 87501 \n\nAbstract \n\nThe Bayesian \"evidence\" approximation has recently been employed to \ndetermine the noise and weight-penalty terms used in back-propagation. \nThis paper shows that for neural nets it is far easier to use the exact result \nthan it is to use the evidence approximation. Moreover, unlike the evi(cid:173)\ndence approximation, the exact result neither has to be re-calculated for \nevery new data set, nor requires the running of computer code (the exact \nresult is closed form). In addition, it turns out that the evidence proce(cid:173)\ndure's MAP estimate for neural nets is, in toto, approximation error. An(cid:173)\nother advantage of the exact analysis is that it does not lead one to incor(cid:173)\nrect intuition, like the claim that using evidence one can \"evaluate differ(cid:173)\nent priors in light of the data\". This paper also discusses sufficiency \nconditions for the evidence approximation to hold, why it can sometimes \ngive \"reasonable\" results, etc. \n\n1 THE EVIDENCE APPROXIMATION \n\nIt has recently become popular to consider the problem of training neural nets from a Baye(cid:173)\nsian viewpoint (Buntine and Weigend 1991, MacKay 1992). The usual way of doing this \nstarts by assuming that there is some underlying target function f from Rn to R, parameter(cid:173)\nized by an N-dimensional weight vector w. We are provided with a training set L of noise(cid:173)\ncorrupted samples of f. Our goal is to make a guess for w, basing that guess only on L. Now \nassume we have Li.d. additive gaussian noise resulting in P(L I w, ~) oc exp(-~ X2)), where \nX2(w, L) is the usual sum-squared training set error, and ~ reflects the noise level. Assume \nfurther that P(w I a.) oc exp(-o.W(w)), where W(w) is the sum of the squares ofthe weights. \nIf the values of a. and ~ are known and fixed, to the values ~ and ~t respectively, then P(w) \n\n539 \n\n\f540 Wolpert \n\n= pew I o.J and P(L I w) = P(L I W. ~t). Bayes' theorem then tells us that the posterior is \nproportional to the product of the likelihood and the prior. i.e .\u2022 pew I L) 0<: peL I w) pew). \nConsequently. finding the maximum a posteriori (MAP) w - the w which maximizes \nis equivalent to finding the w minimizing X2{w. L) + (Ut I ~t)W{w). This can be \npew I L) -\nviewed as a justification for performing gradient descent with weight-decay. \nOne of the difficulties with the foregoing is that we almost never know Ut and ~t in real(cid:173)\nworld problems. One way to deal with this is to estimate Ut and ~t. for example via a tech(cid:173)\nnique like cross-validation. In contrast. a Bayesian approach to this problem would be to \nset priors over a. and ~. and then examine the consequences for the posterior of w. \nThis Bayesian approach is the starting point for the \"evidence\" approximation created by \nGull {Gull 1989). One makes three assumptions. for pew I y). peL I w. y). and P{y). (For sim(cid:173)\nplicity of the exposition, from now on the two hyperparameters a. and ~ will be expressed \nas the two components of the single vector y.) The quantity of interest is the posterior: \n\npew I L) = J dyP{w, yl L) \n\n= J dy [(pew. y I L) / P{y I L)} x P{y I L)] \n\n(I) \nThe evidence approximation suggests that if P{y I L) is sharply peaked about y = 1. while \nthe term in curly brackets is smooth about y = 1, then one can approximate the w-depen(cid:173)\ndence of pew I L) as pew, 11 L) / P{11 L) 0<: P(L I w.1) pew 11). In other words. with the \nevidence approximation. one sets the ~sterior by taking pew) = pew 11) and P(L I w) = peL \nI w.1). where 1 is the MAPy. PeL I y) = J dw [peL I w. y) pew I y)] is known as the \"evidence\" \nfor L given y. For relatively smooth P(y). the peak of P{y I L) is the same as the peak of the \nevidence (hence the name \"evidence approximation\"). Although the current discussion will \nonly explicitly consider using evidence to set hyperparameters like a. and ~. most of what \nwill be said also applies to the use of evidence to set other characteristics of the learner. like \nits architecture. \n\nMacKay has applied the evidence approximation to finding the posterior for the neural net \npew I a.) and P(L I w.~) recounted above combined with a P{y) = P{a., ~) which is uniform \nover all a. and ~ from 0 to +00 (MacKay 1992). In addition to the error introduced by the \nevidence approximation, additional error is introduced by his need to approximate 1. \nMacKay states that although he expects his approximation for 1 to be valid. \"it is a matter \nof further research to establish [conditions for] this approximation to be reliable\". \n\n2 THE EXACT CALCULATION \n\nIt is always true that the exact posterior is given by \n\npew) = f dyP(w I y) P(y). \nP(L I w) = Jdy {P(L I w. y) x pew I y) x P(y)} I pew); \npew I L) 0<: J dy {P(L I w. y) x pew I y) x P(y)} \n\n(2) \n\nwhere the proportionality constant. being independent of w. is irrelevant. \nUsing the neural net pew I a.) and peL I w, ~) recounted above, and MacKay's P{y), it is \ntrivial to use equation 2 to calculate that pew) 0<: [W(w)r(N12 + 1), where N is the number of \nweIghts. Similarly, with m the number of pairs in L, peL I w) 0<: [x2{w. L)r(m12 + 1). {See \n(Wolpert 1992) and (Buntine and Weigend 1991). and allow the output values in L to range \n\n\fOn the Use of Evidence in Neural Networks \n\n541 \n\nfrom -00 to +00.) These two results give us the exact expression for the posterior pew I L). \nIn contrast, the evidence-approximated posterior oc exp[-a'(L) W(w) - W(L) X2(w, L)]. \n\nIt is illuminating to compare this exact calculation to the calculation based on the evidence \napproximation. A good deal of relati vely complicated mathematics followed by some com(cid:173)\nputer-based numerical estimation is necessary to arrive at the answer given by the evidence \napproximation. (This is due to the need to approximate y.) In contrast, to perform the exact \ncalculation one only need evaluate a simple gaussian integral, which can be done in closed \nform, and in particular one doesn't need to petform any computer-based numerical estima(cid:173)\ntion. In addition, with the evidence procedure ''I must be re-evaluated for each new data set, \nwhich means that the formula giving the posterior must be re-derived every time one uses \na new data set. In contrast, the exact calculation's formula for the posterior holds for any \ndata set; no re-calculations are required. So as a practical tool, the exact calculation is both \nfar simpler and quicker to use than the calculation based on the evidence approximation. \n\nAnother advantage of the exact calculation, of course, is that it is exact. Indeed, consider \nthe simple case where the noise is fixed, i.e., P(y) = P(YI) O(Y2 - Pt), so that the only term \nwe must \"deal with\" is YI = a. Set all other distributions as in (MacKay 1992). For this case, \nthe w-dependence of the exact posterior can be quite different from that of the evidence(cid:173)\napproximated posterior. In particular, note that the MAP estimate based on the exact calcu(cid:173)\nlation is w = O. This is, of course, a silly answer, and reflects the poor choice of distributions \nmade in (MacKay 1992). In particular, it directly reflects the un-normalizability of MacK(cid:173)\nay's P(a). However the important point is that this is the exactly correct answer for those \ndistributions. On the other hand, the evidence procedure will result in an MAP estimate of \nargminw [X2(w, L) + (a' / W)W(w)], where a' and P' are derived from L. This answer will \noften be far from the correct answer ofw = O. Note also that the evidence approximations's \nanswer will vary, perhaps greatly, with L, whereas the correct answer is L-independent. Fi(cid:173)\nnally, since the correct answer is w = 0, the difference between the evidence procedure's \nanswer and the correct answer is equal to the evidence procedure's answer. In other words, \nalthough there exist scenarios for which the evidence approximation is valid, neural nets \nwith flat P(YI) is not one of them; for this scenario, the evidence procedure's answer is in \ntoto approximation error. (A possible reason for this is presented in section 4.) \n\nIf one were to use a more reasonable P(a), uniform only from 0 to an upper cut-off Clmax' \nthe results would be essentially the same, for large enough Clmax' The effect on the exact \nposterior, to first order, is to introduce a small region around w = 0 in which P(w) behaves \nlike a decaying exponential in W(w) (the exponent being set by Clmax) rather than like \n[W(w)r(N/2 + 1) (T. Wallstrom, private communication). For large enough Clmax' the region \nis small enough so that the exact posterior still has a peak very close toO. On the other hand, \nfor large enough Clmax, there is no change in the evidence procedure's answer. (Generically, \nthe major effect on the evidence procedure of modifying P(y) is not to change its guess for \nP(w I L), but rather to change the associated error, i.e., change whether sufficiency condi(cid:173)\ntions for the validity of the approximation are met. See below.) Even with a normalizable \nprior, the evidence procedure's answer is still essentially all approximation error. \nConsider again the case where the prior over both a and P is uniform. With the evidence \napproximation, the log of the posterior is -{ X2(w, L) + (a' / W)W(w) }, where a' and P' \nare set by the data. On the other hand, the exact calculation shows that the log of the pos-\n\n\f542 \n\nWolpert \n\nterior is really given by -{ In[x2(w, L)] + (N+2/ m+2) In [w(w)] }. What's interesting about \nthis is not simply the logarithms, absent from the evidence approximation's answer, but also \nthe factor mUltiplying the tenn involving the \"weight penalty\" quantity W(w). In the evi(cid:173)\ndence approximation, this factor is data-dependent, whereas in the exact calculation it only \ndepends on the number of data. Moreover, the value of this factor in the exact calculation \ntells us that if the number of weights increases, or alternatively the number of training ex(cid:173)\namples decreases, the \"weight penalty\" term becomes more important, and fitting the train(cid:173)\ning examples becomes less important. (It is not at all clear that this trade-off between N and \nm is reflected in (a' / W), the corresponding factor from the evidence approximation.) As \nbefore, if we have upper cut-offs on P(y), so that the MAP estimate may be reasonable, \nthings don't change much. For such a scenario, the N vs. m trade-off governing the relative \nimportance ofW(w) and X2(w, L) still holds, but only to lowest order, and only in the region \nsufficiently far from the ex-singularities (like w = 0) so that pew I L) behaves like \n[W(w)r(N!2 + 1) x [X2(w, L)r(m!2 + 1). \n\nAll of this notwithstanding, the evidence approximation has been reported to give good re(cid:173)\nsults in practice. This should not be all that surprising. There are many procedures which \nare formally illegal but which still give reasonable advice. (Some might classify all of non(cid:173)\nBayesian statistics that way.) The evidence procedure fixes y to a single value, essentially \nby maximum likelihood. That'S not unreasonable, just usually illegal (as well as far more \nlaborious than the correct Bayesian procedure). \n\nIn addition, the tests of the evidence approximation reported in (MacKay 1992) are not all \nthat convincing. For paper 1, the evidence approximation gives a' = 2.5. For any other a in \nan interval extending three orders of magnitude about this a', test set error is essentially \nunchanged (see figure 5 of (MacKay 1992\u00bb. Since such error is what we're ultimately in(cid:173)\nterested in, this is hardly a difficult test of the evidence approximation. In paper 2 of \n(MacKay 1992) the initial use of the evidence approximation is \"a failure of Bayesian pre(cid:173)\ndiction\"; P(y I L) doesn't correlate with test set error (see figure 7). MacKay addresses this \nby arguing that poor Bayesian results are never wrong, but only \"an opportunity to learn\" \n(in contrast to poor non-Bayesian results?). Accordingly, he modifies the system while \nlooking at the test set, to get his desired correlation on the test set. To do this legally, he \nshould have instead modified his system while looking at a validation set, separate from the \ntest set. However if he had done that, it would have raised the question of why one should \nuse evidence at all; since one is already assuming that behavior on a validation set corre(cid:173)\nsponds to behavior on a test set, why not just set a and p via cross-validation? \n\n3 EVIDENCE AND THE PRIOR \n\nConsider again equation 1. Since y depends on the data L, it would appear that when the \nevidence approximation is valid, the data determines the prior, or as MacKay puts it, \"the \nmodem Bayesian ... does not assign the priors - many different priors can be ... compared \nin the light of the data by evaluating the evidence\" (MacKay 1992). If this were true, it \nwould remove perhaps the most major objection which has been raised concerning Baye(cid:173)\nsian analysis - the need to choose priors in a subjective manner, independent of the data. \nHowever the exact pew) given by equation 2 is data-independent. So one has chosen the \nprior, in a subjective way. The evidence procedure is simply providing a data-dependent \napproximation to a data-independent quantity. In no sense does the evidence procedure al(cid:173)\nlow one to side-step the need to make subjective assumptions which fix pew). \n\n\fOn the Use of Evidence in Neural Networks \n\n543 \n\nSince the true pew) doesn't vary with L whereas the evidence approximation's pew) does, \none might suspect that that approximation to pew) can be quite poor, even when the evi(cid:173)\ndence approximation to the posterior is good. Indeed, ifP(w I YI) is exponential, there is no \nnon-pathological scenario for which the evidence approximation to pew) is correct: \n\nTheorem 1: Assume that pew I YI) oc e-YI U(w). Then the only way that one can have \npew) oc e-a U(w) for some constant a is ijP(YI) = Of or all YI '* a. \nProof: Our proposed equality is exp( -a xU) = IdYl {P(YI) x exp( -YI xU)} (the normaliza(cid:173)\ntion factors having all been absorbed into P(YI\u00bb. We must find an a and a normalizable \nP(YI) such that this equality holds for all allowed U. Let u be such an allowed value of U. \nTake the derivative with respect to U of both sides of the proposed equality t times, and \nevaluate for U = u. The result is at = IdYl \u00abyd x R(YI\u00bb for any integer t ~ 0, where R(YI) == \nP(Yt) exp(u(a - Yt\u00bb. Using this, we see thaddYt\u00abYt - a)2 x R(Yt\u00bb = O. Since both R(Yl) and \n(Yt - a)2 are nowhere negative, this means that for all YI for which (Yt - a)2 '* 0, R(Yl) must \nequal zero. Therefore R(Yt) must equal zero for all Yl '* a. QED. \nSince the evidence approximation for the prior is always wrong, how can its approximation \nfor the posterior ever be good? To answer this, write pew I L) = PeL I w) X [P'(w) + E(w)] / \nP(L), where P'(w) is the evidence approximation to pew). (It is assumed that we know the \nlikelihood exactly.) This means that pew I L) - {PeL I w) X P'(w) / P(L)} , the error in the \nevidence procedure's estimate for the posterior, equals peL I w) x E(w) / PeL). So we can \nhave arbitrarily large E(w) and not introduce sizable error into the posterior of w, but only \nfor those w for which PeL I w) is small. As L varies, the w with non-negligible likelihood \nvary, and the Y such thatfor those w pew I y) is a good approximation to pew) varies. When \nit works, the y given by the evidence approximation reflects this changing of Y with L. \n\n4 SUFFICIENCY CONDITIONS FOR EVIDENCE TO WORK \n\n'* \nNote that regardless of how peaked the evidence is, -{ X2(w, L) + (a'i W)W(w)} \n- ( In[x2(w, L)] + (N+2 / m+2) In[W(w)] }; the evidence approximation always has non(cid:173)\nnegligible error for neural nets used with flat P(Y). To understand this, one must carefully \nelucidate a set of sufficiency conditions necessary for the evidence approximation to be val(cid:173)\nid. (Unfortunately, this has never been done before. A direct consequence is that no one has \never checked, formally, that a particular use of the evidence approximation is justified.) \n\nOne such set of sufficiency conditions, the one implicit in all attempts to date to justify the \nevidence approximation (i.e., the one implicit in the logic of equation I), is the following: \n\nP(y I L) is sharply peaked about a particular Y,y. \npew, Y I L) / P(y I L) varies slowly around Y = y. \n\npew, Y I L) is infinitesimal for all Y sufficiently far from y. \n\n(i) \n(ii) \n(iii) \n\nFormally, condition (iii) can be taken to mean that there exists a not too large positive con(cid:173)\nstant k, and a small positive constant b, such that I pew I L) - k I y!..c+iJ dy P(w, Y I L) I is \nbounded by a small constant E for all w. (As stated, (iii) has k = 1. This will almost always \n\n\f544 \n\nWolpert \n\nbe the case in practice and will usually be assumed, but it is not needed to prove theorem \n2.) Condition (ii) can be taken to mean that across [\"I - 0, \"I + 0], IP(w I y, L) - pew 1\"1 L)I < \n't, for some small positive constant 't, for all w. (Here and throughout this paper, when y is \nmulti-dimensional, \"0\" is taken to be a small positive vector.) \nTheorem 2: When conditions (i), (ii), and (iii) hold, pew I L) == PeL I w, \"I) x pew I \"I), up \nto an (irrelevant) overall proportionality constant. \n\nProof: Condition (iii) gives IP(wIL) - kJy!o+f> dy[P(wlyL)xP(yIL)]I < \u00a3forallw. \n\nJ y+f> \n\ny+f> \n\nHowever Ik y-O dy[P(wly,L)xP(yIL)] - kP(wl\"lL)Jy-O dyP(ylL)I < 'tk x \nIy!: dy P(y I L), by condition (ii). If we now combine these two results, we see that \nI P(w I L) - k P(w 1\"1 L) Jy-O \ndyP(y I L). Smce the \n\ndyP(y I L) I < \u00a3 + 'tk x y-O \n\ny+8 \n\nJ y+8 \n\n. \n\nintegral is bounded by 1, IP(wIL) - kP(wl\"lL)Iy!;8 dyP(ylL)I < \u00a3+'tk.Sincethe \nintegral is independent of w, up to an overall proportionality constant (that integral times \nk) the w-dependence of P(w I L) can be approximated by that of P(w I \"I, L) oc PeL I w, 1) \nx pew I \"I), incurring error less than \u00a3 + 'tk. Take k not too large and both \u00a3 and 't small. QED. \nNote that the proof would go through even if P(y I L) were not peaked about \"I, or if \nP(y I L) were peaked about some point far from the \"I for which (ii) and (iii) hold; nowhere \nin the proof is the definition of \"I from condition (i) used. However in practice, when con(cid:173)\ndition (iii) is met, k = 1, P(y I L) falls to 0 outside of [\"I - 0, \"I + 0], and pew I y, L) stays \nreasonably bounded for all such y. (If this weren't the case, then P(w I y, L) would have to \nfall to 0 outside of [\"I - 0, \"I + 0], something which is rarely true.) So we see that we could \neither just give conditions (ii) and (iii), or we could give (i), (ii), and the extra condition that \nP( w I y, L) is bounded small enough so that condition (iii) is met (In addition, one can prove \nthat if the evidence approximation is valid, then conditions (i) and (ii) give condition (iii).) \n\nIn any case, it should be noted that conditions (i) and (ii) by themselves are not sufficient \nfor the evidence approximation to be valid. To see this, have w be one-dimensional, and let \npew, yl L) = 0 both for {Iy- \"II < 0, Iw - w*1 < v} and for {Iy- \"II> 0, Iw - w*1 > v}. Let it \nbe constant everywhere else (within certain bounds of allowed yand w). Then for both a \nand v small, conditions (i) and (ii) hold: the evidence is peaked about \"I, and 't = O. Yet for \nthe true MAP w, w*, the evidence approximation fails badly. (Generically, this scenario \nwill also result in a big error if rather than using the evidence-approximated posterior to \nguess the MAP w. one instead uses it to evaluate the posterior-averaged f, Idf f P(f I L).) \n\nGull mentions only condition (i). MacKay also mentions condition (ii), but not condition \n(iii). Neither author plugs in for \u00a3 and 't, or in any other way uses their distributions to infer \nbounds on the error accompanying their use of the evidence approximation. \nSince by (i) P(y I L) is sharply peaked about \"I, one would expect that for (ii) to hold \npew, yl L) must also be sharply peaked about \"I. Although this line of reasoning can be for(cid:173)\nmalized, it turns out to be easier to prove the result using sufficiency condition (iii): \n\nTheorem 3: If condition (iii) holds, then for all w such that P(w I L) > c > \u00a3,for each com(cid:173)\nponent i ofy, pew, Yi I L) must have a Yi-peak somewhere within 0i[l + 2\u00a3 / (c - E)] of(y'h \nProof: Condition (iii) withk= 1 tellsusthatP(wIL)- Iy!;8 oyP(w,yIL)<\u00a3.Extending \n\n\fOn the Use of Evidence in Neural Networks \n\n545 \n\nthe integrals over ')j~ gives P(w 1 L) - J(y_i~-H\u00bbi dYi P(w, Yi 1 L) < E. From now on the i \nsubscript on Y and a will be implicit. We have E > Jy!o-H>+r dy P(w, Y 1 L) for any scalar r \n> O. Assume that P(w, Y 1 L) doesn't have a peak anywhere in [y - a, y + a + r]. Without \nloss of generality, assume also that P( w, Y + aiL) ~ P(w, Y - aiL). These two assumptions \nmean that for any Y E [y + a, y + a + r], the value of P(w, Y 1 L) exceeds the maximal value \nit takes on in the interval [y - a, y + a]. Therefore JY-H> \ndy P(w, Y 1 L) ~ (r / 2a) x \n\ny+Mr \n\ny+o \n\ny-H> \n\nJy~ dyP(w,yIL).Thismeansthady~ dyP(w,yIL) < 2aE/r.ButsinceP(wIL)< \nE + Jy~-H> dyP(w, y 1 L), this means that P(w 1 L) < E(1 + 2a / r). So ifP(w 1 L) > c > E, r \n< 2E / (c - E), and there must be a peak of P(w, Y 1 L) within ao + 2E/(C - E)) of y. QED. \n\nSo for those w with non-negligible posterior, for E small, the y-peak of P(w, y 1 L) oc \nP(L 1 w, y) x P(w 1 y) x P(y) must lie essentially within the peak of P(y 1 L). Therefore: \nTheorem 4: Assume that P(w 1 YI) = exp(-YI U(w)) / ZI(YI) for some function U(.), \nP(L 1 w, yiJ = exp(-Y2 V(w, L)) / ~(Y2' w)for somefunction V(., .), andP(y) = P(YI)P('Y2)' \n(The Zj act as normalization constants.) Then if condition (iii) holds, for all w with non(cid:173)\nnegligible posterior the y-solution to the equations \n\n-U(w) + i)YI [In(p(YI) -In(ZI(YI))] = 0 \n\n-V(w, L) + i)Y2 [In(p(YiJ - In(~(Y2' w))] = 0 \n\nmust like within the y-peak ofP(y 1 L). \n\nProof: P(w, yl L) oc {P(YI) xP(Y2) x exp[-yIU(w) - Y2 V(w, L)] } / {ZI(YI) X Zz(Y2, w)}. \nFor both i = 1 and i = 2, evaluate i)r. {f dYj;ti P(w, Y 1 L)}, and set it equal to zero. This gives \nthe two equations. Now define \"the y-peak of P(y 1 L)\" to mean a cube with i-component \nwidth ai[l + 2E / (c - E)], centered on y, where having a \"non-negligible posterior\" means \nP(w 1 L) > c. Applying theorem 3, we get the result claimed. QED. \nIn particular, in MacKay's scenario, P(y) is uniform, W(w) = I:.i=1 (Wi)2, and V(w, L) = \nX2(w, L). Therefore ZI and Zz are proportional to (YlrN/2 and (yi)-rn!2 respectively. This \nmeans that if the vector {YI' Y2} = {N / [2W(w)], m / [2X2(w, L)]} does not lie within the \npeak of the evidence for the MAP w, condition (iii) does not hold. That YI / Y2 must ap-\nproximately equal [N X2(w, L)] / [m W(w)] should not be too surprising. If we set the w(cid:173)\ngradient of both the evidence-approximated and exact P(w 1 L) to zero, and demand that the \nsame w,w', solves both equations, we get YI /Y2 = -[(N + 2) X2(w', L)] / [(m + 2)W(w')]. \n(Unfortunately, if one continues and evaluates i)wii)wl(w 1 L) at w', often one finds that it \nhas opposite signs for the two posteriors - a graphic failure of the evidence approximation.) \n\nIt is not clear from the provided neural net data whether this condition is met in (MacKay \n1992). However it appears that the corresponding condition is nm met, for YI at least, for \nthe scenario in (Gull 1992) in which the evidence approximation is used with U(.) being the \nentropy. (See (Strauss et al. 1993, Wolpert et al. 1993).) Since conditions (i) through (iii) \n\n\f546 \n\nWolpert \n\nare sufficient conditions, not necessary ones, this does not prove that Gull's use of evidence \nis invalid. (It is still an open problem to delineate the full iff for when the evidence approx(cid:173)\nimation is valid, though it appears that matching of peaks as in theorem 3 is necessary. See \n(Wolpert et al. 1993).) However this does mean that the justification offered by Gull for his \nuse of evidence is apparently invalid. It might also help explain why Gull's results were \"vi(cid:173)\nsually disappointing and ... clearly ... 'over-fitted''', to use his terms. \n\nThe first equation in theorem 4 can be used to set restrictions on the set of w which both \nhave non-negligible posterior and for which condition (iii) holds. Consider for example \nMacKay's scenario, where that equation says that N /2W(w) must lie within the width of \nthe evidence peak. If the evidence peak is sharp, this means that unless all w with non-neg(cid:173)\nligible posterior have essentially the same W(w), condition (iii) can not hold for all of them. \nFinally, if for some reason one wishes to know y, theorem 4 can sometimes be used to cir(cid:173)\ncumvent the common difficulty of evaluating P(y I L). To do this, one assumes that condi(cid:173)\ntions (i) through (iii) hold. Then one finds any w with a non-negligible posterior (say by use \nof the evidence approximation coupled with approximations to P(y I L)) and uses it in the(cid:173)\norem 4 to find a y which must lie within the peak of P(y I L), and therefore must lie close to \nthe correct value of y. \nTo summarize, there might be scenarios in which the exact calculation of the quantity of \ninterest is intractable, so that some approximation like evidence is necessary. Alternatively, \nif one's choice ofP(w I y), P(y), and P(L I w, y) is poor, the evidence approximation would \nbe useful if the error in that approximation somehow \"cancels\" error in the choice of distri(cid:173)\nbutions. However if one believes one's choice of distributions, and if the quantity of interest \nis P(w I L), then at a minimum one should check conditions (i) through (iii) before using \nthe evidence approximation. When one is dealing with neural nets, one needn't even do \nthat; the exact calculation is quicker and simpler than using the evidence approximation. \n\nAcknowledgments \n\nThis work was done at the SFI and was supported in part by NLM grant F37 LMOOOll. I \nwould like to thank Charlie Strauss and Tim Wall strom for stimulating discussion. \n\nReferences \n\nBuntine, W., Weigend, A. (1991). Bayesian back-propagation. Complex Systems, S, 603. \n\nGull, S.F. (1989). Developments in maximum entropy data analysis. In \"Maximum-entro(cid:173)\npy and Bayesian methods\", J. Skilling (Ed.). Kluwer Academics publishers. \n\nMacKay, DJ.C. (1992). Bayesian Interpolation. A Practical Framework for Backpropaga(cid:173)\ntion Networks. Neural Computation, 4,415 and 448. \n\nStrauss, C.E.M, Wolpert, D.H., Wolf, D.R. (1993). Alpha, Evidence, and the Entropic Pri(cid:173)\nor. In \"Maximum-entropy and Bayesian methods\", A. Mohammed-Djafari (Ed.). Kluwer \nAcademics publishers. In press \n\nWolpert, D.H. (1992). A Rigorous Investigation of \"Evidence\" and \"Occam Factors\" in \nBayesian Reasoning. SFI TR 92-03-13. Submitted. \n\nWolpert, D.H., Strauss, C.E.M., Wolf, D.R. (1993). On evidence and the marginalization \nof alpha in the entropic prior. In preparation. \n\n\fPART VI \n\nNETWORK \n\nDYNAMICS AND \n\nCHAOS \n\n\f\f", "award": [], "sourceid": 716, "authors": [{"given_name": "David", "family_name": "Wolpert", "institution": null}]}