{"title": "SIMPLIFYING NEURAL NETS BY DISCOVERING FLAT MINIMA", "book": "Advances in Neural Information Processing Systems", "page_first": 529, "page_last": 536, "abstract": null, "full_text": "SIMPLIFYING NEURAL NETS BY \n\nDISCOVERING FLAT MINIMA \n\nSepp Hochreiter\" \n\nJiirgen Schmidhubert \n\nFakultat fiir Informatik, H2 \n\nTechnische Universitat Miinchen \n\n80290 Miinchen, Germany \n\nAbstract \n\nWe present a new algorithm for finding low complexity networks \nwith high generalization capability. The algorithm searches for \nlarge connected regions of so-called ''fiat'' minima of the error func(cid:173)\ntion. In the weight-space environment of a \"flat\" minimum, the \nerror remains approximately constant. Using an MDL-based ar(cid:173)\ngument, flat minima can be shown to correspond to low expected \noverfitting. Although our algorithm requires the computation of \nsecond order derivatives, it has backprop's order of complexity. \nExperiments with feedforward and recurrent nets are described. In \nan application to stock market prediction, the method outperforms \nconventional backprop, weight decay, and \"optimal brain surgeon\" . \n\n1 \n\nINTRODUCTION \n\nPrevious algorithms for finding low complexity networks with high generalization \ncapability are based on significant prior assumptions. They can be broadly classified \nas follows: (1) Assumptions about the prior weight distribution. Hinton and van \nCamp [3] and Williams [17] assume that pushing the posterior distribution (after \nlearning) close to the prior leads to \"good\" generalization. Weight decay can be \nderived e.g. from Gaussian priors. Nowlan and Hinton [10] assume that networks \nwith many similar weights generated by Gaussian mixtures are \"better\" a priori. \nMacKay's priors [6] are implicit in additional penalty terms, which embody the \n\n\"hochreit@informatik. tu-muenchen .de \nt schmidhu@informatik.tu-muenchen.de \n\n\f530 \n\nSepp Hochreiter. Jurgen Schmidhuber \n\nassumptions made. (2) Prior assumptions about how theoretical results on early \nstopping and network complexity carryover to practical applications. Examples are \nmethods based on validation sets (see [8]), Vapnik's \"structural risk minimization\" \n[1] [14], and the methods of Holden [5] and Wang et al. [15]. Our approach requires \nless prior assumptions than most other approaches (see appendix A.l). \nBasic idea of flat minima search. Our algorithm finds a large region in weight \nspace with the property that each weight vector from that region has similar small \nerror. Such regions are called \"flat minima\". To get an intuitive feeling for why \n''flat'' minima are interesting, consider this (see also Wolpert [18]): a \"sharp\" mini(cid:173)\nmum corresponds to weights which have to be specified with high precision. A ''flat'' \nminimum corresponds to weights many of which can be given with low precision. In \nthe terminology of the theory of minimum description length (MDL), fewer bits of \ninformation are required to pick a ''flat'' minimum (corresponding to a \"simple\" or \nlow complexity-network). The MDL principle suggests that low network complex(cid:173)\nity corresponds to high generalization performance (see e.g. [4, 13]). Unlike Hinton \nand van Camp's method [3] (see appendix A.3), our approach does not depend on \nexplicitly choosing a \"good\" prior. \n\nOur algorithm finds \"flat\" minima by searching for weights that minimize both \ntraining error and weight precision. This requires the computation of the Hessian. \nHowever, by using Pearlmutter's and M~ner's efficient second order method [11, 7], \nwe obtain the same order of complexity as with conventional backprop. A utomat(cid:173)\nically, the method effectively reduces numbers of units, weigths, and input lines, \nas well as the sensitivity of outputs with respect to remaining weights and units. \nExcellent experimental generalization results will be reported in section 4. \n\n2 TASK / ARCHITECTURE / BOXES \n\nGeneralization task. The task is to approximate an unknown relation [) c X x Z \nbetween a set of inputs X C RN and a set of outputs Z C RK. [) is taken to be \na function. A relation D is obtained from [) by adding noise to the outputs. All \ntraining information is given by a finite relation Do C D. Do is called the training \nset. The pth element of Do is denoted by an input/target pair (xp, dp). \nArchitecture. For simplicity, we will focus on a standard feedforward net (but in \nthe experiments, we will use recurrent nets as well). The net has N input units, \nK output units, W weights, and differentiable activation functions. It maps input \nvectors xp E RN to output vectors op E RK. The weight from unit j to i is denoted \nby Wij. The W -dimensional weight vector is denoted by w. \nTraining error. Mean squared error Eq(w, Do) := l.riol E(xp,dp)EDo II dp - op 112 is \nused, where II . II denotes the Euclidian norm, and 1.1 denotes the cardinality of a set. \nTo define regions in weight space with the property that each weight vector from \nthat region has \"similar small error\", we introduce the tolerable error Etal, a positive \nconstant. \"Small\" error is defined as being smaller than Etal. Eq(w, Do) > Etal \nimplies \"underfitting\" . \nBoxes. Each weight W satisfying Eq(w, Do) ~ Etal defines an \"acceptable mini(cid:173)\nmum\". We are interested in large regions of connected acceptable minima. \n\n\fSimplifying Neural Nets by Discovering Flat Minima \n\n531 \n\nSuch regions are called fiat minima. They are associated with low ex(cid:173)\npected generalization error (see [4]). To simplify the algorithm for finding large \nconnected regions (see below), we do not consider maximal connected regions but \nfocus on so-called \"boxes\" within regions: for each acceptable minimum w, its box \nMw in weight space is a W-dimensional hypercuboid with center w . For simplicity, \neach edge of the box is taken to be parallel to one weight axis. Half the length of the \nbox edge in direction of the axis corresponding to weight Wij is denoted by ~wii ' \nwhich is the maximal (positive) value such that for all i, j, all positive K.ii :5 ~Wij \ncan be added to or subtracted from the corresponding component of W simultane(cid:173)\nously without violating Eq(. , Do) :5 Etol (~Wij gives the precision of Wij). Mw's \nbox volume is defined by ~w := 2w ni,j ~Wij. \n\n3 THE ALGORITHM \n\nThe algorithm is designed to find a W defining a box Mw with maximal box vol(cid:173)\nume ~w. This is equivalent to finding a box Mw with minimal B( w, Do) := \n-log(~w/2W) = Li,j -log ~Wi.j. Note the relationship to MDL (B is the number \nof bits required to describe the weights). In appendix A.2, we derive the following \nalgorithm. It minimizes E(w, Do) = Eq(w, Do) + >'B(w, Do), where \n\nB = ~ (-WIOg{+ ~logL(:~~j)2 + WlogL (L: /~ ,)2) . (1) \nHere 0\" is the activation of the kth output unit, { is a constant, and >. is a positive \nvariable ensuring either Eq(w, Do) :5 Etol, or ensuring an expected decrease of \nEq(., Do) during learning (see [16] for adjusting >.). \nE(w, Do) is minimized by gradient descent. To minimize B(w, Do), we compute \n\n' ,3 L\"(8w;) \n\n' ,3 \" \n\n\" \n\n8B(w, Do) \n\n8 \nwuv \n\n= L-\n\n'\"\" 8B(w, Do) \nI. . \u2022 8(~) Wij wuv \n\"',' ,) \n\n820\" \n8 \n\n8 \" \nvw., \n\n8 \n\n~ \nlor a u, v . \n\nII \n\n(2) \n\nIt can be shown (see [4]) that by using Pearlmutter's and M~ller's efficient second \norder method [11, 7], the gradient of B( w, Do) can be computed in O(W) time (see \ndetails in [4]) . Therefore, our algorithm has the same order of complexity \nas standard backprop. \n\n4 EXPERIMENTAL RESULTS (see [4] for details) \n\nEXPERIMENT 1 - noisy classification. The first experiment is taken from \nPearlmutter and Rosenfeld [12]. The task is to decide whether the x-coordinate of \na point in 2-dimensional space exceeds zero (class 1) or does not (class 2). Noisy \ntraining examples are generated as follows: data points are obtained from a Gaus(cid:173)\nsian with zero mean and stdev 1.0, bounded in the interval [-3.0,3.0]. The data \npoints are misclassified with a probability of 0.05. Final input data is obtained by \nadding a zero mean Gaussian with stdev 0.15 to the data points. In a test with \n2,000,000 data points, it was found that the procedure above leads to 9.27 per cent \n\n\f532 \n\nSepp Hochreiter, Jurgen Schmidhuber \n\nBackprop \ndto MSE \nMSE \n1.35 0.193 \n1 0.220 \n1.16 0.189 \n2 0.223 \n1.37 0.186 \n3 0.222 \n4 0.213 1.18 0.181 \n1.24 0.195 \n5 0.222 \n\ndto \n0.00 \n0.09 \n0.13 \n0.01 \n0.25 \n\nNew approach \n\nNew approach \n\nBackprop \ndto MSE \nMSE \n0.219 1.24 0.187 \n6 \n0.215 1.14 0.187 \n7 \n0.214 1.10 0.185 \n8 \n0.218 1.21 \n0.190 \n9 \n10 0.214 1.21 \n0.188 \n\ndto \n0.04 \n0.07 \n0.01 \n0.09 \n0.07 \n\nTable 1: 10 comparisons of conventional backprop (BP) and our new method (FMS) . \nThe second row (labeled \"MSE\") shows mean squared error on the test set. The third \nrow (\"dto\") shows the difference between the fraction (in per cent) of misclassifica(cid:173)\ntions and the optimal fraction (9.27). The remaining rows provide the analoguous \ninformation for the new approach, which clearly outperforms backprop. \n\nmisclassified data. No method will misclassify less than 9.27 per cent, due to the \ninherent noise in the data. The training set is based on 200 fixed data points. The \ntest set is based on 120,000 data points. \n\nResults. 10 conventional backprop (BP) nets were tested against 10 equally ini(cid:173)\ntialized networks based on our new method (\"flat minima search\", FMS). After \n1,000 epochs, the weights of our nets essentially stopped changing (automatic \"early \nstopping\"), while backprop kept changing weights to learn the outliers in the data \nset and overfit. In the end, our approach left a single hidden unit h with a maximal \nweight of 30.0 or -30.0 from the x-axis input. Unlike with backprop, the other \nhidden units were effectively pruned away (outputs near zero). So was the y-axis \ninput (zero weight to h) . It can be shown that this corresponds to an \"optimal\" \nnet with minimal numbers of units and weights. Table 1 illustrates the superior \nperformance of our approach. \nEXPERIMENT 2 - recurrent nets. The method works for continually running \nfully recurrent nets as well. At every time step, a recurrent net with sigmoid \nactivations in [0,1] sees an input vector from a stream of randomly chosen input \nvectors from the set {(0,0), (0,1),(1,0),(1,1)}. The task is to switch on the first \noutput unit whenever an input (1,0) had occurred two time steps ago, and to switch \non the second output unit without delay in response to any input (0,1). The task \ncan be solved by a single hidden unit. \n\nResults. With conventional recurrent net algorithms, after training, both hidden \nunits were used to store the input vector. Not so with our new approach. We \ntrained 20 networks. All of them learned perfect solutions. Like with weight decay, \nmost weights to the output decayed to zero. But unlike with weight decay, strong \ninhibitory connections (-30.0) switched off one of the hidden units, effectively \npruning it away. \nEXPERIMENT 3 - stock market prediction. We predict the DAX (German \nstock market index) based on fundamental (experiments 3.1 and 3.2) and technical \n(experiment 3.3) indicators. We use strictly layered feedforward nets with sigmoid \nunits active in [-1,1]' and the following performance measures: \nConfidence: output 0 > a -\npositive tendency, 0 < -a -\nnegative tendency. \nPerformance: Sum of confidently, incorrectly predicted DAX changes is subtracted \n\n\fSimplifying Neural Nets by Discovering Flat Minima \n\n533 \n\nfrom sum of confidently, correctly predicted ones. The result is divided by the sum \nof absolute changes. \nEXPERIMENT 3.1: Fundamental inputs: (a) German interest rate (\"Umlaufsren(cid:173)\ndite\"), (b) industrial production divided by money supply, (c) business sentiments \n(\"IFO Geschiiftsklimaindex\"). 24 training examples, 68 test examples, quarterly \nprediction, confidence: 0: = 0.0/0.6/0.9, architecture: (3-8-1) . \nEXPERIMENT 3.2: Fundamental inputs: (a), (b), (c) as in expo 3.1, (d) dividend \nrate, (e) foreign orders in manufacturing industry. 228 training examples, 100 test \nexamples, monthly prediction, confidence: 0: = 0.0/0.6/0.8, architecture: (5-8-1). \nEXPERIMENT 3.3: Technical inputs: (a) 8 most recent DAX-changes, (b) DAX, \n(c) change of 24-week relative strength index (\"RSI\"), (d) difference of \"5 week \nstatistic\", (e) \"MACD\" (difference of exponentially weighted 6 week and 24 week \nDAX). 320 training examples, 100 test examples, weekly predictions, confidence: \n0: = 0.0/0.2/0.4, architecture: (12-9-1). \nThe following methods are tested: (1) Conventional backprop (BP), (2) optimal \nbrain surgeon (OBS [2]), (3) weight decay (WD [16]), (4) flat minima search (FMS). \n\nResults. Our method clearly outperforms the other methods. FMS is up to 63 per \ncent better than the best competitor (see [4] for details) . \n\nAPPENDIX - THEORETICAL JUSTIFICATION \n\nA.t. OVERFITTING ERROR \n\nIn analogy to [15] and [1], we decompose the generalization error into an \"overfit(cid:173)\nting\" error and an \"underfitting\" error. There is no significant underfitting error \n(corresponding to Vapnik's empirical risk) if Eq(w, Do) ~ Etol . Some thought is \nrequired, however, to define the \"overfitting\" error. We do this in a novel way. Since \nwe do not know the relation D, we cannot know p(o: I D), the \"optimal\" posterior \nweight distribution we would obtain by training the net on D (- \"sure thing hy(cid:173)\npothesis\") . But, for theoretical purposes, suppose we did know p(o: I D) . Then we \ncould use p(o: I D) to initialize weights before learning the training set Do . Using \nthe Kullback-Leibler distance, we measure the information (due to noise) conveyed \nby Do, but not by D. In conjunction with the initialization above, this provides the \nconceptual setting for defining an overfitting error measure. But, the initialization \ndoes not really matter, because it does not heavily influence the posterior (see [4]). \nThe overfittin~ error is the Kullback-Leibler distance of the posteriors: \nEo(D, Do) = J p(o: I Do) log (p(o: I Do)/p(o: I D\u00bb) do:. Eo(D, Do) is the expectation \nof log (p(o: I Do)/p(o: I D)) (the expected difference of the minimal description of \n0: with respect to D and Do, after learning Do). Now we measure the expected \noverfitting error relative to Mw (see section 2) by computing the expectation \nof log (p( 0: I Do) / p( 0: I D\u00bb in the range Mw: \n\nEro(w) = f3 (1M\", PM\", (0: I Do)Eq(O:, D)do: - Eq(Do, M w \u00bb) . \n\n(3) \n\nHere PM\",(O: I Do) := p(o: I Do)/ IM\", p(a: I Do)da: is the posterior of Do scaled to \nobtain a distribution within Mw, and Eq(Do, Mw) := IM\", PM\", (a I Do)Eq(a, Do)do: \nis the mean error in Mw with respect to Do. \n\n\f534 \n\nSepp Hochreiter. JiJrgen Schmidhuber \n\nClearly, we would like to pick W such that Ero( w) is minimized. Towards this pur(cid:173)\npose, we need two additional prior assumptions, which are actually implicit in most \nprevious approaches (which make additional stronger assumptions, see section 1): \n(1) \"Closeness assumption\": Every minimum of E q(., Do) is \"close\" to a maximum \nof p(aID) (see formal definition in [4]). Intuitively, \"closeness\" ensures that Do can \nindeed tell us something about D, such that training on Do may indeed reduce the \nerror on D. (2) \"Flatness assumption\": The peaks of p(aID)'s maxima are not \nsharp. This MDL-like assumption holds if not all weights have to be known exactly \nto model D. It ensures that there are regions with low error on D. \n\nA.2. HOW TO FLATTEN THE NETWORK OUTPUT \n\nTo find nets with flat outputs, two conditions will be defined to specify B(w, Do) \n(see section 3). The first condition ensures flatness. The second condition en(cid:173)\nforces \"equal flatness\" in all weight space \u00b7 directions. \nIn both cases, linear ap(cid:173)\nproximations will be made (to be justified in [4]). We are looking for weights \n(causing tolerable error) that can be perturbed without causing significant out(cid:173)\nput changes. Perturbing the weights w by 6w (with components 6Wij), we obtain \nED(w,6w) := L,,(o\"(w + 6w) - o\"(w\u00bb)2, where o\"(w) expresses o\"'s dependence \non w (in what follows, however, w often will be suppressed for convenience). Linear \napproximation (justified in [4]) gives us \"Flatness Condition 1\": \n\nED(w, 6w) ~ L(L -;;;-:-:6Wij)2 :$ L(L 1-;;;-:-:1I6wij1)2 :$ ( , \n\n(4) \n\n00\" \n.. uW'J \n'.J \n\n.. \n.. \n\n00\" \n.. uW'J \n'.J \n\n.. \n.. \n\nwhere ( > 0 defines tolerable output changes within a box and is small enough \nto allow for linear approximation (it does not appear in B(w, Do)'s gradient, see \nsection 3). \nMany Mw satisfy flatness condition 1. To select a particular, very flat Mw, the \nfollowing \"Flatness Condition 2\" uses up degrees of freedom left by (4): \n\n'VZ,),U,V: 6Wij) L-~ = 6wuv ) L-(-~-)' \n\n2 \" 00\" 2 \n\n2 \" ( 00\")2 \n\n( \n\n. \n\n. \n\n( \n\n\" uW'J \n\n\" uWuv \n\n(5) \n\nFlatness Condition 2 enforces equal \"directed errors\" \nEDij(W,6wij) = L,,(O\"(Wij + 6Wij) - O\"(Wij)? ~ L,,(:;:;6wij?, where Ok(Wij) \nhas the obvious meaning. It can be shown (see [4]) that with given box volume, we \nneed flatness condition 2 to minimize the expected description length of the box \ncenter. Flatness condition 2 influences the algorithm as follows: (1) The algorithm \nprefers to increase the 6Wij'S of weights which currently are not important to gen(cid:173)\nerate the target output. (2) The algorithm enforces equal sensitivity of all output \nunits with respect to the weights. Hence, the algorithm tends to group hidden units \naccording to their relevance for groups of output units. Flatness condition 2 is es(cid:173)\nsential: flatness condition 1 by itself corresponds to nothing more but first order \nderivative reduction (ordinary sensitivity reduction, e.g. [9]). Linear approximation \nis justified by the choice of f in equation (4). \n\nWe first solve equation (5) for 16w,; I = 16W., I ( 2:, (a':::.) 2 / 2:, U.::J 2) \n\n\fSimplifying Neural Nets by Discovering Flat Minima \n\n535 \n\n(fixing u, v for all i, j) . Then we insert 16wij I into equation (4) (replacing the \nsecond \"$\" in (4) by \";;;:\"). This gives us an equation for the 16wijl (which depend \non w, but this is notationally suppressed): \n\n16wi\u00b71 = Vii \n\nJ \n\n( \n\nook \n\"(_)2 \nL- ow .. \nIJ \nk \n\n(6) \n\nThe 16wijl approximate the ~Wij from section 2. Thus, B(w,Do) (see section 3) \ncan be approximated by B(w, Do) :;;;: Ei,j -log 16wijl. This immediately leads to \nthe algorithm given by equation (1). \n\nHow can this approximation be justified? The learning process itself enforces \nits validity (see justification in [4]). Initially, the conditions above are valid only \nin a very small environment of an \"initial\" acceptable minimum. But during search \nfor new acceptable minima with more associated box volume, the corresponding \nenvironments are enlarged, which implies that the absolute values of the entries in \nthe Hessian decrease. It can be shown (see [4]) that the algorithm tends to suppress \nthe following values: (1) unit activations, (2) first order activation derivatives, (3) \nthe sum of all contributions of an arbitary unit activation to the net output. Since \nweights, inputs, activation functions, and their first and second order derivatives are \nbounded, it can be shown (see [4]) that the entries in the Hessian decrease where \nthe corresponding 16wij I increase. \n\nA.3. RELATION TO HINTON AND VAN CAMP \n\nHinton and van Camp [3] minimize the sum of two terms: the first is conventional \nenor plus variance, the other is the distance f p( a I Do) log (p( a I Do) I p( a\u00bb da \nbetween posterior pea I Do) and prior pea). The problem is to choose a \"good\" \nprior. In contrast to their approach, our approach does not require a \"good\" prior \ngiven in advance. Furthermore, Hinton and van Camp have to compute variances \nof weights and units, which (in general) cannot be done using linear approximation. \nIntuitively speaking, their weight variances are related to our ~Wij. Our approach, \nhowever, does justify linear approximation. \n\nReferences \n\n[1] I. Guyon, V. Vapnik, B. Boser, L. Bottou, and S. A. Solla. Structural risk minimiza(cid:173)\n\ntion for character recognition. In J. E. Moody, S. J. Hanson, and R. P. Lippman, \neditors, Advances in Neural Information Processing Systems 4, pages 471-479. San \nMateo, CA: Morgan Kaufmann, 1992. \n\n[2] B. Hassibi and D. G. Stork. Second order derivatives for network pruning: Optimal \nbra.i.n surgeon. In J. D. Cowan S. J . Hanson and C. L. Giles, editors, Advances in \nNeural Information Processing Systems 5, pages 164-171. San Mateo, CA: Morgan \nKaufmann, 1993. \n\n[3] G. E. Hinton and D. van Camp. Keeping neural networks simple. In Proceedings of \nthe International Conference on Artificial Neural Networks, Amsterdam, pa.ges 11-18. \nSpringer, 1993. \n\n\f536 \n\nSepp Hoc/zreiter, Jiirgen Schmidhuber \n\n[4] S. Hochreiter and J. Schmidhuber. Flat Inllllma search for discovering simple \nnets. Technical Report FKI-200-94, Fakultiit fiir Informatik, Technische Universitiit \nMunchen, 1994. \n\n[5] S. B. Holden. On the Theory of Generalization and Self-Structuring in Linearly \nWeighted Connectionist Networks. PhD thesis, Cambridge University, Engineering \nDepartment, 1994. \n\n[6] D. J. C . MacKay. A practical Bayesian framework for backprop networks. Neural \n\nComputation, 4:448-472, 1992. \n\n[7] M. F. M~ller. Exact calculation of the product of the Hessian matrix offeed-forward \n\nnetwork error functions and a vector in O(N) time. Technical Report PB-432, Com(cid:173)\nputer Science Department, Aarhus University, Denmark, 1993. \n\n[8] J. E. Moody and J . Utans. Architecture selection strategies for neural networks: \nApplication to corporate bond rating prediction. In A. N. Refenes, editor, Neural \nNetworks in the Capital Markets. John Wiley & Sons, 1994. \n\n[9] A. F. Murray and P. J. Edwards. Synaptic weight noise during MLP learning enhances \nfault-tolerance, generalisation and learning trajectory. In J. D. Cowan S. J. Hanson \nand C. L. Giles, editors, Advances in Neural Information Processing Systems 5, pages \n491-498. San Mateo, CA: Morgan Kaufmann, 1993. \n\n[10] S. J. Nowlan and G. E. Hinton. Simplifying neural networks by soft weight sharing. \n\nNeural Computation, 4:173-193, 1992. \n\n[11] B. A. Pearlmutter. Fast exact multiplication by the Hessian. Neural Computation, \n\n1994. \n\n[12] B. A. Pearlmutter and R. Rosenfeld. Chaitin-Kolmogorov complexity and general(cid:173)\n\nization in neural networks. In R. P. Lippmann, J. E. Moody, and D. S. Touretzky, \neditors, Advances in Neural Information Processing Systems 3, pages 925-931. San \nMateo, CA: Morgan Kaufmann, 1991. \n\n[13] J. H. Schmid huber. Discovering problem solutions with low Kolmogorov complex(cid:173)\n\nity and high generalization capability. Technical Report FKI-194-94, Fakultiit fUr \nInformatik, Technische U niversitiit Munchen, 1994. \n\n[14] V. Vapnik. Principles of risk minimization for learning theory. In J. E. Moody, S. J. \nHanson, and R. P. Lippman, editors, Advances in Neural Information Processing \nSystems 4, pages 831-838. San Mateo, CA: Morgan Kaufmann, 1992. \n\n[15] C. Wang, S. S. Venkatesh, and J. S. Judd. Optimal stopping and effective machine \ncomplexity in learning. In J . D. Cowan, G. Tesauro, and J. Alspector, editors, Ad(cid:173)\nvances in Neural Information Processing Systems 6, pages 303-310. Morgan Kauf(cid:173)\nmann, San Mateo, CA, 1994. \n\n[16] A. S. Weigend, D. E. Rumelhart, and B. A. Huberman. Generalization by weight(cid:173)\n\nelimination with application to forecasting. In R. P. Lippmann, J. E. Moody, and \nD. S. Touretzky, editors, Advances in Neural Information Processing Systems 3, pages \n875-882. San Mateo, CA: Morgan Kaufmann, 1991. \n\n[17] P. M. Williams. Bayesian regularisation and pruning using a Laplace prior. Technical \nreport, School of Cognitive and Computing Sciences, University of Sussex, Falmer, \nBrighton, 1994. \n\n[18] D. H. Wolpert. Bayesian backpropagation over i-o functions rather than weights. In \nJ. D. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural Information \nProcessing Systems 6, pages 200-207. San Mateo, CA: Morgan Kaufmann, 1994. \n\n\f", "award": [], "sourceid": 899, "authors": [{"given_name": "Sepp", "family_name": "Hochreiter", "institution": null}, {"given_name": "J\u00fcrgen", "family_name": "Schmidhuber", "institution": null}]}