{"title": "Experiences with Bayesian Learning in a Real World Application", "book": "Advances in Neural Information Processing Systems", "page_first": 964, "page_last": 970, "abstract": "", "full_text": "Experiences with Bayesian Learning \n\n\u2022 In a \n\nReal World Application \n\nPeter Sykacek, Georg Dorffner \n\nAustrian Research Institute for Artificial Intelligence \n\nSchottengasse 3, A-10ID Vienna Austria \n\npeter, georg@ai.univie.ac.at \n\nInstitute for Neurophysiology at the University Vienna \n\nPeter Rappelsberger \n\nWahringer StraBe 17, A-lOgO Wien \nPeter.Rappelsberger@univie.ac.at \n\nJosef Zeitlhofer \n\nDepartment of Neurology at the AKH Vienna \n\nWahringer Giirtel 18-20, A-lOgO Wien \n\nJosef.Zeitlhofer@univie.ac.at \n\nAbstract \n\nThis paper reports about an application of Bayes' inferred neu(cid:173)\nral network classifiers in the field of automatic sleep staging. The \nreason for using Bayesian learning for this task is two-fold. First, \nBayesian inference is known to embody regularization automati(cid:173)\ncally. Second, a side effect of Bayesian learning leads to larger \nvariance of network outputs in regions without training data. This \nresults in well known moderation effects, which can be used to \ndetect outliers. \nIn a 5 fold cross-validation experiment the full \nBayesian solution found with R. Neals hybrid Monte Carlo algo(cid:173)\nrithm, was not better than a single maximum a-posteriori (MAP) \nsolution found with D.J. MacKay's evidence approximation. In a \nsecond experiment we studied the properties of both solutions in \nrejecting classification of movement artefacts. \n\n\fExperiences with Bayesian Learning in a Real World Application \n\n965 \n\n1 \n\nIntroduction \n\nSleep staging is usually based on rules defined by Rechtschaffen and Kales (see [8]). \nRechtschaffen and Kales rules define 4 sleep stages, stage one to four, as well as rapid \neye movement (REM) and wakefulness. In [1] J. Bentrup and S. Ray report that \nevery year nearly one million US citizens consulted their physicians concerning their \nsleep. Since sleep staging is a tedious task (one all night recording on average takes \nabou t 3 hours to score manually), much effort was spent in designing automatic \nsleep stagers. \n\nSleep staging is a classification problem which was solved using classical statistical \nt.echniques or techniques emerged from the field of artificial intelligence (AI) . Among \nclassical techniques especially the k nearest neighbor technique was used. In [1] \nJ. Bentrup and S. Ray report that the classical technique outperformed their AI \napproaches. Among techniques from the field of AI, researchers used inductive \nlearning to build tree based classifiers (e.g. ID3, C4.5) as reported by M. Kubat et. \na1. in [4]. Neural networks have also been used to build a classifier from training \nexamples. Among those who used multi layer perceptron networks to build the \nclassifier, the work of R. Schaltenbrand et. a1. seems most interesting. In [to] they \nuse a separate network to refuse classification of too distant input vectors. The \nperformance usually reported is in the range of 75 to 85 percent. \n\n\\Vhich enhancements to these approaches can be made to get a. reliable system \nwit.h hopefully better performance? According to S. Roberts et . al. in [9], outlier \ndetection is important to get reliable results in a critical (e.g. medical) environment. \nTo get reliable results one must refuse classification of dubious inputs. Those inputs \nare marked separately for further inspection by a human expert. To be able to \ndetect such dubious inputs, we use Bayesian inference to calculate a distribution \nover the neural network weights. This approach automatically incorporates the \ncalculation of confidence for each network estimate. Bayesian inference has the \nfurther advantage that regularization is part of the learning algorithm. Additional \nmethods like weight decay penalty a.nd cross validation for decay parameter tuning \nare no longer needed . Bayesian inference for neural networks was among others \ninvestigated by D.J. MacKay (see [5]), Thodberg (see [11]) and Buntine and Weigend \n(~ee [3]). \nThe a.im of this paper is to study how Bayesian inference leads to probabilities for \nclasses, which together with doubt levels allow to refuse classification of outliers. \nAs we are interested in evaluating the resulting performance, we use a comparative \nmethod on the same data set and use a significance test, such that the effect of the \nmethod can easily be evaluated. \n\n2 Methods \n\nIn this section we give a short description of the inference techniques used to perform \nthe experiments. We have used two approaches using neural networks as classifiers \nand an instance based approach in order to make the performance estimates com(cid:173)\nparable to other methods. \n\n2.1 Architecture for polychotomous classification \n\nFor polychotomous classification problems usually a l-of-c target coding scheme is \nused. Usually it is sufficient to use a network architecture with one hidden layer. In \n[2] pp. 237-240, C. Bishop gives a general motivation for the softmax data model, \n\n\f966 \n\nP. Sykacek, G. Dorffner; P. Rappelsberger and 1. Zeitlhofer \n\nwhich should be used if one wants the network outputs to be probabilities for classes. \nIf we assume that the class conditional densities, p(\u00a3 I Ck), of the hidden unit acti(cid:173)\nvation vector, \u00a3, are from the general family of exponential distributions, then using \nt.he transformation in (1), allows to interpret the network outputs as probabilities \nfor classes. This transformation is known as normalized exponential or softmax \nactivation function. \n\np(Ck 1\u00a3) = \n\nexp(ak) \n\nLkl exp(ak l ) \n\n(1) \n\nIn ! 1) t.he value ak is the value at output node k before applying softmax activa(cid:173)\ntion. Softmax transformation of the activations in the output layer is used for both \nnetwork approaches used in this paper. \n\n2.2 Bayesian Inference \n\nIn [6] D.J. MacKay uses Bayesian inference and marginalization to get moderated \nprobabilities for classes in regions where the network is uncertain about the class \nlabel. In conjunction with doubt levels this allows to suppress a classification of \nsuch patterns. A closer investigation of this approach showed that marginalization \nleads to moderated probabilities, but the degree of moderation heavily depends on \nthe direction in which we move away from the region with sufficient training data. \nTherefore one has to be careful about whether the moderation effect should be used \nfor outliers detection. \n\nA Bayesian solution for neural networks is a posterior distribution over weight space \ncalculated via Bayes' theorem using a prior over weights. \n\n(2) \n\nIn (2), w is the weight vector of the network and V represents the training data. \nTwo different possibilities are known to calculate the posterior in (2). In [5] D.J . \nMacKay derives an analytical expression assuming a Gaussian distribution. In [7] \nR. Neal uses a hybrid Monte Carlo method to sample from the posterior. For one \ninput pattern, the posterior over weight space will lead to a distribution of network \noutputs. \n\nFor a classification problem, following MacKay [6], the network estimate is calcu(cid:173)\nlated by marginalization over the output distribution. \n\nP(C1 I~, V) =.J P(C1 I~, w)p(w I V)dw \n= J y(~, w)p(w I V)dw \n\n(3) \n\nIn general, the distribution over output activations will have small variance in re(cid:173)\ngions well represented in the training data and large variance everywhere else. The \nreason for that is the influence of the likelihood term p(V I w), which forces the \nnetwork mapping to lie close to the desired one in regions with training data, but \nwhich has no influence on the network mapping in regions without training data. \nAt least for for generalized linear models applied to regression, this property is \nquantifiable. In [12] C. Williams et.al. showed that the error bar is proportional to \nthe inverse input data density p(~)-l. A similar relation is also plausible for the \noutput activation in classification problems. \n\n\fExperiences with Bayesian Learning in a Real World Application \n\n967 \n\nDue to the nonlinearity of the softmax transformation, marginalization will moder(cid:173)\nate probabilities for classes. Moderation will be larger in regions with large variance \nof the output activation. Compared to a decision made with the most probable \nweight, the network guess for the class label will be less certain. This moderation \neffect allows to reject classification of outlying patterns. \n\nSince upper integral can not be solved analytically for classification problems, there \nare t.wo possibilities to solve it. In [6] D.J. MacKay uses an approximation. Using \nhybrid Monte Carlo sampling as an implementation of Bayesian inference (see R. \nNeal in [7]), there is no need to perform upper integration analytically. The hybrid \nMonte Carlo algorithm samples from the posterior and upper integral is calculated \nas a finite sum. \n\nP(C1 I~, 1)) ~ L LY(~' Wi) \n\n1 L \n\ni=l \n\n(4) \n\nAssuming, that the posterior over weights is represented exactly by the sampled \nweights, there is no need to limit the number of hidden units, if a correct (scaled) \nprior is used. Consequently in the experiments the network size was chosen to be \nlarge. We used 25 hidden units. Implementation details of the hybrid Monte Carlo \nalgorithm may be found in [7]. \n\n2.3 The Competitor \n\nThe classifier, used to give performance estimates to compare to, is built as a two \nlayer perceptron network with softmax transformation applied to the outputs. As \nan error function we use the cross entropy error including a consistent weight decay \npenalty, as it is e.g. proposed by C. Bishop in [2], pp. 338. The decay parameters \nare estimated with D.J. MacKay's evidence approximation ( see [5] for details). \nNote that the restriction of D.J. MacKay's implementation of Bayesian learning, \nwhich has no solution to arrive at moderated probabilities in l-of-c classification \nproblems, do not apply here since we use only one MAP value. The key problem \nwith this approach is the Gaussian approximation of the posterior over weights, \nwhich is used to derive the most probable decay parameters. This approximation is \ncertainly only valid if the number of network parameters is small compared to the \nnumber of training samples. One consequence is, that the size of the network has \nto be restricted . Our model uses 6 hidden units. \n\nTo make the performance of the Bayes inferred classifier also comparable to other \nmethods, we decided to include performance estimates of a k nearest neighbor \nalgorithm. This algorithm is easy to implement and from [1] we have some evidence \nthat its performance is good. \n\n3 Experiments and Results \n\nIn this sect.ion we discuss the results of a sleep staging experiment based on the \nt.echniques described in the \"Methods\" section. \n\n3.1 Data \n\nAll experiments are performed with spectral features calculated from a database of 5 \ndifferent healthy subjects. All recordings were scored according to the Rechtschaffen \n& Kales rules. The data pool consisted from data calculated for all electrodes \n\n\f968 \n\nP. Sykacek, G. Doif.fner, P. Rappelsberger and J. Zeitlhofer \n\navailable, which were horizontal eye movement, vertical eye movement and 18 EEG \nf'lectrodes placed with respect to the international 10-20 system. \n\nThe data were transformed into the frequency domain. We used power density \nvalues as well as coherency between different electrodes, which is a correlation \ncoefficient expressed as a function of frequency as input features. All data were \ntransformed to zero mean and unit variance. From the resulting feature space we \nselected 10 features, which were used as inputs for classification. Feature selection \nwas done with a suboptimal search algorithm which used the performance of a k \nnearest neighbor classifier for evaluation. We used more than 2300 samples during \nt.raining and about 580 for testing. \n\n3.2 Analysis of Both Classifiers \n\nThe analysis of both classifiers described in the \"Methods\" section should reveal \nwhether besides good classification performance the Bayes' inferred classifier is also \n,apable of refusing outlying test patterns. Increasing the doubt level should lead to \nbetter results of the classifier trained by Bayesian Inference if the test data contains \nout.lying patterns. We performed two experiments. During the first experiment \nWf' calculated results from a 5 fold cross validation, where training is done with 4 \nsubjects and tests are performed with one independent test person. In a second \nj,f'St. we examine the differences of both algorithms on patterns which are definitely \noutliers. We used the same classifiers as in the first experiment. Test patterns for \nt his experiment were classified movement artefacts, which should not be classified \nas one of the sleep stages. \n\nThe classifier used in conjunction with Bayesian inference was a 2-layer neural \nnet.work with 10 inputs, 25 hidden units with sigmoid activation and five output \nunits with softmax activation. The large number of hidden units is motivated by \nthe results reported from R. Neal in [7]. R. Neal studied the properties of neural \nnetworks in a Bayesian framework when using Gaussian priors over weights. He \nconcluded that there is no need for limiting the complexity of the network when \nusing a correct Bayesian approach. The standard deviation of the Gaussian prior \n1S scaled by the number of hidden units. \n\nFor the comparative approach we used a neural network with 10 inputs, 6 hidden \nunits and 5 outputs with softmax activation. Optimization was done via the BFGS \nalgorit.hm (see C. Bishop in [2]) with automatic weight decay parameter tuning \n(D.J. MacKay's evidence approximation). As described in the methods section, \nthe smaller network used here is motivated by the Gaussian approximation of the \nposterior over weights, which is used in the expression for the most probable decay \nparameters. \n\nThe third result is a result achieved with a k nearest neighbor classifier with k set \nto three. \n\nAll results are summaried in table 1. Each column summarizes the results achieved \nwith one of the algorithms and a certain doubt level during the cross validation run. \nAs the k nearest neighbor classifier gives only coarse probability estimates, we give \nonly the performance estimate when all test patterns are classified. \n\nAn examination of table 1 shows that the differences between the MAP-solution \nand the Bayesian solution are extremely small. Consequently, using a t-test, the \nO-hypothesis could not be rejected at any reasonable significance level. On the other \nhand compared to the Bayesian solution, the performance of the k nearest neighbor \nclassifier is significantly lower (the significance level is 0.001). \n\n\fExperiences with Bayesian Learning in a Real World Application \n\n969 \n\nTable 1: Classification Performance \n\nDoubt Cases \nMean Perf. \nStd. Dev. \n\nDoubt Cases \nMean Perf. \nStd. Dev. \n\nMAP \n0 \n78.6% \n9.1% \n\n5% \n80.4% \n9.4% \n\nBayes \n0 \n78.4% \n8.6% \n\n5% \n80.2% \n9.0% \n\n100/0 \n81.6% \n9.4% \n\n15% \n83.2'70 \n9.1% \n\n10% \n82.2% \n9.4% \n\n15% \n83.6% \n9.1% \n\nk nearest neighbor \n\nDoubt Cases \nMean Perf. \nStd. Dev. \n\n0 \n74.6% \n8.4% \n\n5% \n-\n-\n\n10% \n-\n-\n\n15% \n-\n-\n\nTable 2: Rejection of Movement Periods \nBayes \n% \n\nMethod \n\nMAP \n\nrecognized outliers No. \n0 \n1 \n2 \n0 \n1 \n\n% \n0% \n7.7% \n15.4% \n\n0% \n7.7% \n\nNo. \n1 \n6 \n5 \n5 \n3 \n\n7.1% \n46.f% \n38.5% \n38.5% \n23.1% \n\nThe last experiment revealed that both training algorithms lead to comparable per(cid:173)\nformance estimates, when clean data is used. When using the classifier in practice \nthere is no guarantee that the data are clean. One common problem of all night \nrecordings are the so called movement periods, which are periods with muscle activ(cid:173)\nity due to movements of the sleeping subject. During a second experiment we tried \nt.o assess the robustness of both neural classifiers against such inputs. During this \nexperiment we used a fixed doubt level, for which approximately 5% of the clean \nt.est. data from the last experiment were rejected. With this doubt level we classified \n13 movement periods, which should not be assigned to any of the other stages. The \nnumber of correctly refused outlying patterns are shown in table 2. Analysis of the \nresults with a t-test showed a significant higher rate of removed outliers for the full \nBayesian approach. Nevertheless as the number of misclassified outliers is large, \none has to be careful in using this side-effect of Bayesian inference. \n\n4 Conclusion \n\nUsing Bayesian Inference for neural network training is an approach which leads to \nbetter classification results compared with simpler training procedures. Comparing \nwit.h the \"one MAP\" solution, we observed significantly larger reliability in detecting \ndubious patterns. The large amount of remaining misclassified patterns, which were \nobviously outlying, shows that we should not rely blindly on the moderating effect \nof marginalization. Despite the large amount of time which is required to calculate \nt.he solution, Bayesian inference has relevance for practical applications. On one \nhand the Bayesian solution shows good performance. But the main reason is the \na.bility to encode a validity region of the model into the solution. Compared to all \nmethods which do not aim at a predictive distribution, this is a clear advantage for \nBayesian inference. \n\n\f970 \n\nP. Sykacek, G. Doif.fner, P. Rappelsberger and 1. Zeitlhofer \n\nAcknowledgements \n\nWe want to acknowledge the work of R. Neal from the Departments of Statistics \nand Computer Science at the University of Toronto, who made his implementation \nof hybrid Monte-Carlo sampling for Bayesian inference available electronically. His \nsoftware was used to calculate the full Bayes' inferred classification results. We also \nwant to express gratitude to S. Roberts from Imperial College London, one of the \npartners in the ANNDEE project. His work and his consequence in insisting on \nconfidence measures for network decisions had a large positive impact on our work. \n\nThis work was sponsored by the Austrian Federal Ministry of Science and Transport. \nIt was done in the framework of the BIOMED 1 concerted action ANNDEE, financed \nby the European Commission, DG. XII. \n\nReferences \n\n[1] J.A. Bentrup and S.R. Ray. An examination of inductive learning algorithms for \nthe classification of sleep signals. Technical Report UIUCDCS-R-93-1792, Dept of \nComputer Science, University of Illinois, Urbana-Champaign, 1993. \n\n[3] C. M. Bishop. Neural Networks for Pattern Recognition. Clarendon Press, Oxford, \n\n1995. \n\n[3] W. L. Buntine and A. S. Weigend. Bayesian back-propagation. Complex Systems, \n\n5:603-643, 1991. \n\n[4] M. Kubat, G. Pfurtscheller, and D. Flotzinger. Discrimination and classification using \n\nbot.h binary and continuous variables. Biological Cybernetics, 70:443-448, 1994. \n[5] D. J. C. MacKay. Bayesian interpolation. Neural Computation, 4:415-447, 1992. \n[6] D. J. C. MacKay. The evidence framework applied to classification networks. Neural \n\nComputation, 4:720-736, 1992. \n\n[7] R. M. Neal. Bayesian Learning for Neural Networks. Springer, New York, 1996. \n[8] A. Rechtschaffen and A. Kales. A manual of standardized terminology, techniques \nand scoring system for sleep stages of human subjects. NIH Publication No. 204, US \nGovernment Printing Office, Washington, DC., 1968. \n\n[9] S. Roberts, L. Tarassenko, J. Pardey, and D. Siegwart. A confidence measure for \nartificial neural networks. In International Conference Neural Networks and Expert \nSystems in Medicine and Healthcare, pages 23-30, Plymouth, UK, 1994. \n\n[10] N. Schaltenbrand, R. Lengelle, and J.P. Macher. Neural network model: application \nto automatic analysis of human sleep. Computers and Biomedical Research, 26:157-\n171, 1993. \n\n~ll] H. H. Thodberg. A review of bayesian neural networks with an application to near \ninfrared spectroscopy. IEEE Transactions on Neural Networks, 7(1):56-72, January \n1996. \n\n[12] C. K. I. Williams, C. Quazaz, C. M. Bishop, and H. Zhu. On the relationship between \nbayesian error bars and the input data density. In Fourth International Conference \non Artificial Neural Networks, Churchill Col/ege, University of Cambridge, UK. lEE \nConference Publication No. 409, pages 160-165, 1995. \n\n\f", "award": [], "sourceid": 1425, "authors": [{"given_name": "Peter", "family_name": "Sykacek", "institution": null}, {"given_name": "Georg", "family_name": "Dorffner", "institution": null}, {"given_name": "Peter", "family_name": "Rappelsberger", "institution": null}, {"given_name": "Josef", "family_name": "Zeitlhofer", "institution": null}]}