{"title": "Hyperparameters Evidence and Generalisation for an Unrealisable Rule", "book": "Advances in Neural Information Processing Systems", "page_first": 255, "page_last": 262, "abstract": null, "full_text": "Hyperparameters, Evidence and \n\nGeneralisation for an Unrealisable Rule \n\nGlenn Marion and David Saad \nglennyGed.ac.uk, D.SaadGed.ac.uk \n\nDepartment of Physics, University of Edinburgh, \n\nEdinburgh, EH9 3JZ, U.K. \n\nAbstract \n\nUsing a statistical mechanical formalism we calculate the evidence, \ngeneralisation error and consistency measure for a linear percep(cid:173)\ntron trained and tested on a set of examples generated by a non \nlinear teacher. The teacher is said to be unrealisable because the \nstudent can never model it without error. Our model allows us to \ninterpolate between the known case of a linear teacher, and an un(cid:173)\nrealisable, nonlinear teacher. A comparison of the hyperparameters \nwhich maximise the evidence with those that optimise the perfor(cid:173)\nmance measures reveals that, in the non-linear case, the evidence \nprocedure is a misleading guide to optimising performance. Finally, \nwe explore the extent to which the evidence procedure is unreliable \nand find that, despite being sub-optimal, in some circumstances it \nmight be a useful method for fixing the hyperparameters. \n\n1 \n\nINTRODUCTION \n\nThe analysis of supervised learning or learning from examples is a major field of \nresearch within neural networks. In general, we have a probabilistic1 teacher, which \nmaps an N dimensional input vector x to output Yt(x) according to some distri(cid:173)\nbution P(Yt I x). We are supplied with a data set v= ({Yt(xlJ), xlJ} : J.' = l..p) \ngenerated from P(Yt I x) by independently sampling the input distribution, P(x), \np times. One attempts to optimise a model mapping (a student), parameterised by \n\nlThis accommodates teachers with deterministic output corrupted by noise. \n\n\f256 \n\nGlenn Marion, David Saad \n\nsome vector w, with respect to the underlying teacher. The training error Ew (V) \nis some measure of the difference between the student and the teacher outputs over \nthe set V. Simply minimising the training error leads to the problem of over-fitting. \nIn order to make successful predictions out-with the set V it is essential to have \nsome prior preference for particular rules. Occams razor is an expression of our \npreference for the simplest rules which account for the data. Clearly Ew(V) is an \nunsatisfactory performance measure since it is limited to the training examples. \nVery often we are interested in the students ability to model a random example \ndrawn from P(Yt I x)P(x), but not necessarily in the training set, one measure \nof this performance is the generalisation error. It is also desirable to predict, or \nestimate, the level of this error. The teacher is said to be an unrealisable rule, for \nthe student in question, if the minimum generalisation error is non-zero. \nOne can consider the Supervised Learning Paradigm within the context of Bayesian \nInference. In particular MacKay [MacKay 92(a)] advocates the evidence procedure \nas a 'principled' method which, in some situations, does seem to improve perfor(cid:173)\nmance [Thodberg 93]. However, in others, as MacKay points out the evidence \nprocedure can be misleading [MacKay 92(b )]. \nIn this paper we do not seek to comment on the validity of of the evidence procedure \nas an approximation to Hierarchical Bayes (see for example [Wolpert and Strauss \n94]). Rather, we ask which performance measures do we seek to optimise and under \nwhat conditions will the evidence procedure optimise them? Theoretical results \nhave been obtained for a linear percept ron trained on data produced by a linear \nperceptron [Bruce and Saad 94]. They suggest that the evidence procedure is a \nuseful guide to optimising the learning algorithm's performance. \nIn what follows we examine the evidence procedure for the case of a linear perceptron \nlearning a non linear teacher. In the next section we review the Bayesian scheme \nand define the evidence and the relevant performance measures. In section 3 we \nintroduce our student and teacher and discuss the calculation. Finally, in section 4 \nwe examine the extent to which the evidence procedure optimises performance. \n\n2 BAYESIAN FORMALISM \n\n2.1 THE EVIDENCE \n\nIf we take Ew(V) to be the usual sum squared error and assume that our data is \ncorrupted by Gaussian noise with variance 1/2/3 then the probability, or likelihood, \nofthe data(V) being produced given the model wand /3 is P(D 1/3, w) ex: e-~Ew(1)). \nIn order to incorporate Occams Razor we also assume a prior distribution on the \nteacher rules, that is, we believe a priori in some rules more strongly than others. \nSpecifically we believe that pew I ,) ex: e-\"'(C(w). MUltiplying the likelihood by \nthe prior we obtain the post training or student distribution2 P( w I V\", /3) ex: \ne-~Ew(1))-''YC(w). It is clear that the most probable model w\u00b7 is given by minimising \nthe composite cost function /3Ew(V)+,C(w) with respect to the weights (w). This \nformalises the trade off between fitting the data and minimising student complexity. \nIn this sense the Bayesian viewpoint coincides with the usual backprop standpoint. \n\n2Integrating this over f3 and 'Y gives us the posterior P(w I 1\u00bb. \n\n\fHyperparameters, Evidence and Generalisation for an Unrealisable Rule \n\n257 \n\nIn fact, it should be noted that stochastic minimisation can also give rise to the \nsame post training distribution [Seung et aI92). The parameters (3 and, are known \nas the hyperparameters. Here we consider C(w) = wtw in which case, is termed \nthe weight decay. \nThe evidence is the normalisation constant in the above expression for the post \ntraining distribution. \n\nP('D I 'Y,(3) = J n dWjP('D I (3, w)P(w I,) \n\nJ \n\nThat is, the probability of the data set ('D) given the hyperparameters. The ev(cid:173)\nidence procedure fixes the hyperparameters to the values that maximise this \nprobability. \n\n2.2 THE PERFORMANCE MEASURES \n\nMany performance measures have been introduced in the literature (See e.g., [Krogh \nand Hertz 92) and [Seung et aI92)) . Here, we consider the squared difference between \nthe average (over the post training distribution) of the student output (y.(x)}w \nand that of the teacher, Yt(x) , averaged over all possible test questions and teacher \noutputs, P(Yt, x) and finally over all possible sets of data, 'D. \n\nfg = ((Yt(x) - (Y. (x\u00bb)w ?}P(X,Yf).'l) \n\nThis is equivalent to the generalisation error given by Krogh and Hertz. \nAnother factor we can consider is the variance of the output over the student dis(cid:173)\ntribution ({y.(x) -\n(y.(x)}wP}w,P(x)' This gives us a measure of the confidence \nwe should have in our post training distribution and could possibly be calculated if \nwe could estimate the input distribution P(x). Here we extend Bruce and Saad's \ndefinition [Bruce and Saad 94] of the consistency measure Dc to include unrealisable \nrules by adding the asymptotic error fr: = IiIIlp_oo fg, \n\nDc = ({y.(x) -\n\n(y.(x)}w}2}w,p(x),'P - fg + fr;' \n\nWe regard Dc = 0 as optimal since then the variance over our student distribution \nis an accurate prediction of the decaying part of the generalisation error. \nWe can consider both these performance measures as objective functions measuring \nthe students ability to mimic the underlying teacher. Clearly, they can only be \ncalculated in theory and perhaps, estimated in practice. In contrast, the evidence \nis only a function of our assumptions and the data and the evidence procedure is, \ntherefore, a practical method of setting the hyperparameters. \n\n3 THE MODEL \n\nIn our model the student is simply a linear perceptron. The output for an input \n\nvector xl' is given by Y: = w .xl' / v'N. The examples, against which the student \n\nis trained and tested, are produced by sampling the input distribution, P(x) and \nthen generating outputs from the distribution, \n\nP(Yt I x) = t P(y~ I x, O)P(x I O)PA \n\n0=1 2:0=1 P(O)P(x I 0) \n\n\f258 \n\nGlenn Marion, David Saad \n\n-1.0 \n\nI \n\n-0.6 \n\nI \n\n0.0 \n\n\u2022 \n\nI \n\n0.6 \n\n1.0 \n\nFigure 1: A 2-teacher in 1D : The average output (Yt}P(yl%') (i) for Dw = 0 , (ii) for \nDw > 0 (0'%'1 = 0'%',) and (iii) with Dw > 0 (0'%'1 \u00a5- 0'%,,) . \n\nwhere P(Yt I x, n) \" ,13) = - N 1(1) where the 1 is analogous to a free energy in \nstatistical physics. \n\n- 1(1) = -In - + -In - + -lndetg + -ln211' + -P'g'kPk - e \n\n1 \n2 \n\n1 \nN J J \n\n1 \n2 \n\n>.. \n11' \n\na \n2 \n\n13 \n11' \n\n1 \n2N \n\nand, \n\nn \n\ngjk1 = L Afk + >\"Ojk \n\nn=1 \n\np \na=-\nN \n\nHere we are using the convention that summations are implied where repeated \nindices occur. \n\n3Where N(x, 0'2) denotes a normal distribution with mean x and variance 0'2 . \n\n\fHyperparameters, Evidence and Generalisation for an Un realisable Rule \n\n259 \n\nThe performance measures for this model are \n\n2 \n\nu 2 \n\n{g = (~'x PA{w?w? - 2w?(Wj}w + (Wj}!}}'V \nOc = Nt (trg}'V - {g + {r; \n(Wj}w = Pl:gl:j ' and \n\nu;eff = PAu;o \n\nwhere, \n\nIn order to pursue the calculation we consider the average of I(V) over all possible \ndata sets just as, earlier, we defined our performance measures as averages over all \ndata sets. This is some what artificial as we would normally be able to calculate \nI(V) and be interested in the generalisation error for our learning algorithm given a \nparticular instance of the data. However, here we consider the thermodynamic limit \n(i.e., N,p - 00 s.t. 0 = piN = const.) in which, due to our sampling assumptions, \nthe behaviours for typical examples of V coincide with that of the average. Details \nof the calculation will be published else where [Marion and Saad 95]. \n\n4 RESULTS AND DISCUSSION \n\nWe can now examine the evidence and the performance measures for our unlearnable \nproblem. We note that in two limits we recover the learnable, linear teacher, case. \nSpecifically if the probability of picking one of the component teachers is zero or if \nboth component teacher vectors are aligned. In what follows we set Pi = P~ and \nnormalise the components of the teacher such that Iwol = l. \nFirstly let us consider the performance measures. The asymptotic value of both {g \nand loci for large 0 is PiP~u;lu;:lDwlu;eff' This is the minimum generalisation \nerror attainable and reflects the effective noise level due to the mismatch between \nstudent and teacher. \nWe note here that the generalisation error is a function of ~ rather than f3 and 'Y \nindependently. Figure 2a shows the generalisation error plotted against o. The \naddition of unlearn ability (Dw > 0) has a similar effect to the addition of noise on \nthe examples. The appearance of the hump can be easily understood; If there is no \nnoise or ~ is large enough then there is a steady reduction in {g. However, if this is \nnot so then for small 0 the student learns this effective noise and the generalisation \nerror increases with o . As the student gets more examples the effects of the noise \nbegin to average out and the student starts to learn the rule. The point at which the \ngeneralisation error starts to decrease is influenced by the effective noise level and \nthe prior constraint. Figure 2b shows the absolute value of the consistency measure \nv's 0 for non-optimal f3. Again we see that unlearn ability acts as an effective noise. \nFor a few examples with ~ small or with large effective noise the student distribution \nis narrowed until the Oc is zero. However, the generalisation error is still increasing \n(as described above) and loci increases to a local maximum, it then asymptotically \ntends to { ,q. If there is no noise or ~ is large enough then loci steadily reduces as \nthe number of examples increases. \nWe now examine the evidence procedure. Firstly we define f3ev ( 'Y) and 'Yev (f3) to \nbe the hyperparameters which maximise the evidence. The evidence procedure \n\n\f260 \n\nGlenn Marion, David Saad \n\n4,-------------------, \n\n, \n\" .... \n\nr, (Ui) \n, \nl \n: \nl \nI \n, \n,~.................. \n\nI \n\nI \n\n\"\"' ... \n\n\" \n\n(ii) \n\n3 \n\n1 \n\n-... \"\"_ \n\n.... \n\n--..... _-\n\n.....................................\u2022... \n\nO~---r~-,----~--~ \n\n4 \n\n3 \n\n18J2 \n\nI \n\n0 \n\n(iii) ---\n\n(i) \n\n3 \n\n2 \ntl \n\n\" \n\no \n\n1 \n\n2 \n\n3 \n\n0 \n\n1 \n\n(a) Generalisation error \n\n(b) Consistency Measure \n\nFigure 2: The performance measures: Graph a shows (g for finite lambda. a(i) and \na(ii) are the learnable case with noise in the latter case. a(iii) shows that the effect \nof adding unlearn ability is qualitatively the same as adding noise. Graph b. shows \nthe modulus of the consistency error v's a. Curves b(i) and b(ii) are the learnable \ncase without and with noise respectively. Curve b(iii) is an unlearnable case with \nthe same noise level. \n\npicks the point in hyperparameter space where these curves coincide. We denote \nthe asymptotic values of 13ev(-y) and 'Yev(13) in the limit of large a by 1300 and 'Yoo \nrespectively. \nIn the linear case (Dw = 0) the evidence procedure assignments of the hyperpa(cid:173)\nrameters (for finite a) coincide with 1300 and 'Yoo and also optimise (g and 6c in \nagreement with [Bruce and Saad 94] . This is shown in Figure 3a where we plot \nthe 13 which optimises the evidence (13ev) , the consistency measure (136c) and the \ngeneralisation error (13!g) versus 'Y. The point at which the three curves coincide is \nthe point in the 13-'Y plane identified by the evidence procedure. However, we note \nhere that, if one of the hyperparameters is poorly determined then maximising the \nevidence with respect to the other is a misleading guide to optimising performance \neven in the linear case. \nThe results for an unrealisable rule in the linear regime (Dw > 0, lrXI = lrX:l) are \nsimilar to the learnable case but with an increased noise due to the unlearn ability. \nThe evidence procedure still optimises performance. \nIn the non-linear regime (Dw > 0 , lrXI \u00a5-\nminimise either performance measure. This is shown in Figure 3b where the evi(cid:173)\ndence procedure point does not lie on 13!g ('Y) or 136c (-y). Indeed, its hyperparameter \nassignments do not coincide with 1300 and 'Yoo but are a dependent. \nHow badly does the evidence procedure fail? We define the percentage degradation \nin generalisation performance as I'\\, = 100 * \u00ab( 9 (Aev) - (;Pt) / (;pt. Where Aev is the \nevidence procedure assignment and (;pt is the optimal generalisation error with \nrespect to A. This is plotted in Figure 4a. We also define \n1'\\,6 = 100* 16c(Aev)1 /(g(Aev ). This measures the error in using the variance of the \n\nlrX:l) the evidence procedure fails to \n\n\fHyperparameters, Evidence and Generalisation for an Unrealisable Rule \n\n261 \n\n1.0.....---------..,. \n\n.. / .....\u2022.. \n\n0.8-\n\n/Jopt. o.e ./ ... - ... _ ............ \", \n\\\" \n\\..(~) \n\n\"\", \n\n(Ii) \". \n\n,. \". ~-\"\"::\" '--1' \n\".::r-\n\n(i) \n\n0.4. \n\n0.2 \n\n,',',,, \n\n0.0 \n\n0.0 \n\nI \n\n0.2 \n\nI \n\n0.4. \n\nI \n\no.e \n\nI \n\n0.8 \n\n1.0 \n\n.., \n\n0.5.....-----------, \n\n/Jopt. 0.3 , ___ .... \n\n0.2 \n\n(1) \n\\~~\" -/~. \n\n0.1 (11),','/~; \n\n,.' \n\n0.0-f---'T\"---,---r---1 \n2.0 \n\n0.6 \n\n0.0 \n\n1.0 \n.., \n\n(a) Linear Case \n\n(b) Non-Linear Case \n\nFigure 3: The evidence procedure:Optimal f3 v's /. In both graphs for (i) the \nevidence(f3ev), (ii) the generalisation error (f3f g ) and (iii) the consistency measure \n(f36J . The point which the evidence procedure picks in the linear case is that where \nall three curves coincide, whereas in the non linear case it coincides only with f3ev . \n\npost training distribution to estimate the generalisation error as a percentage of \nthe generalisation error itself. Examples of this quantity are plotted in Figure 4b. \nThere are three important points to note concerning I'\\, and 1'\\,6 . Firstly, the larger \nthe deviation from a linear rule the greater is the error. Secondly, that it is the \nmagnitude of the effective noise due to unlearnability relative to the real noise which \ndetermines this error. In other words , if the real noise is large enough to swamp the \nnon-linearity of the rule then the evidence procedure will not be very misleading. \nFinally, the magnitude of the error for relatively large deviations from linearity is \nonly a few percent and thus the evidence procedure might well be a reasonable, if \nnot optimal, method for setting the hyperparameters. However, clearly it would be \npreferable to improve our student space to enable it to model the teacher. \n\n5 CONCLUSION \n\nWe have examined the generalisation error, the consistency measure and the evi(cid:173)\ndence procedure within a model which allows us to interpolate between a learnable \nand an unlearnable scenario. We have seen that the unlearnability acts like an ef(cid:173)\nfective noise on the examples. Furthermore, we have seen that for a linear student \nthe evidence procedure breaks down, in that it fails to optimise performance, when \nthe teacher output is non-linear. However, even for relatively large deviations of \nthe teacher from linearity the evidence procedure is close to optimal. \nBayesian methods, such as the evidence procedure, are based on the assumption \nthat the student or hypothesis space contains the teacher generating the data. In \nour case, in the non-linear regime, this is clearly not true and so it is perhaps \nnot surprising that the evidence procedure is sub-optimal. Whether or not such a \nbreakdown of the evidence procedure is a generic feature of a mismatch between \nthe hypothesis space and the teacher is a matter for further study. \n\n\f262 \n\nGlenn Marion, David Saad \n\no~,-----------------~ \nf\\ \n0.4-/ \\ \n\\ \nJ \n\\ \nJ \n0.3- J \n\\ \n\n(iii) \n\n(Ii) \n\n(i) \n\n5 \n\n4 \n\n3 \n\n2 \n\nIC, \n\n-----\n\n2.0 \n\n0 \n\n0 \n\nIC \n\n0.2- _. \n. ' \" \n! \n0.1-1 \n\n0.0 \n\n\\ \n\\ \n\n, \n\" \n., .. , \n\n\",. \n\n~., .......................... ---\n\n\" ... \n\n_ .. -. \n\n0.0 \n\n0.5 \n\nt.o \nCl \n\n(a) \n\n, \n, , , \n, , , \n\\ \n\\ \n,.0\\0., \n, \" . \n/ \n,'.. \n\\ \n! \n, .. , \n, ! \n\\\\ \nI \nV \n\n2 \nCl \n\n1 \n\n(b) \n\n\"\"\" \n\n. .... \n\"- -\n\n3 \n\n(I) \n\n(Ii) \n\n(iii) \n\n4 \n\nFigure 4: The relative degradation in performance compared to the optimal when \nusing the evidence procedure to set the hyperparameters. Graph (a) shows the \npercentage degradation in generalisation performance K, \u2022 a(i) has Dw = 1 with the \nreal noise level u = 1. a(ii) has this noise level reduced to u = 0.1 and a(iii) has \nincreased non-linearity, Dw = 3, and u = 1. Graph (b) shows the error made in \npredicting the generalisation error from the variance of the post training distribution \nas a percentage of the generalisation error itself, \"'6 . b(i) and b(ii) have the same \nparameter values as a(i) and a(ii), whilst b(iii) has Dw = 3 and u = 0.1 \n\nAcknowledgments \n\nWe are very grateful to Alastair Bruce and Peter Sollich for useful discussions. GM \nis supported by an E.P.S.R.C. studentship. \n\nReferences \nBruce, A.D. and Saad, D. (1994) Statistical mechanics of hypothesis evaluation . \nJ. of Phys. A: Math. Gen. 27:3355-3363 \nKrogh, A. and Hertz, J. (1992) Generalisation in a linear perceptron in the \npresence of noise. J. of Phys. A: Math. Gen. 25:1135-1147 \nMacKay, D.J.C. (1992a) Bayesian interpolation. Neural Compo 4:415-447 \nMacKay, D.J.C. (1992b) A practical Bayesian framework for backprop networks. \nNeural Compo 4:448-472 \nMarion, G. and Saad, D. (1995) A statistical mechanical analysis of a Bayesian \ninference scheme for an unrealisable rule. To appear in J. of Phys. A: Math. Gen. \nSeung, H. S, Sompolinsky, H., Tishby, N. (1992) Statistical mechanics of \nlearning from examples. Phys. Rev. A, 45:6056-6091 \nThodberg, H.H. (1994) Bayesian backprop in action:pruning, ensembles, error \nbars and application to spectroscopy. Advances in Neural Information Processing \nSystems 6:208-215. Cowan et al.(Eds.), Morgan Kauffmann, San Mateo, CA \nWolpert, D. H and Strauss, C. E. M. (1994) What Bayes has to say about \nthe evidence procedure. To appear in Maximum entropy and Bayesian methods. G . \nHeidbreder (Ed.), Kluwer. \n\n\f", "award": [], "sourceid": 952, "authors": [{"given_name": "Glenn", "family_name": "Marion", "institution": null}, {"given_name": "David", "family_name": "Saad", "institution": null}]}