{"title": "Bayesian Learning via Stochastic Dynamics", "book": "Advances in Neural Information Processing Systems", "page_first": 475, "page_last": 482, "abstract": null, "full_text": "Bayesian Learning \n\nvia  Stochastic Dynamics \n\nRadford M.  Neal \n\nToronto,  Ontario, Canada  M5S  lA4 \n\nDepartment of Computer Science \n\nUniversity of Toronto \n\nAbstract \n\nThe  attempt  to  find  a  single  \"optimal\"  weight  vector  in  conven(cid:173)\ntional network training can lead to overfitting and poor generaliza(cid:173)\ntion.  Bayesian methods avoid  this,  without the  need  for  a  valida(cid:173)\ntion set,  by  averaging the outputs of many  networks with  weights \nsampled  from  the  posterior  distribution  given  the  training  data. \nThis sample can be obtained by simulating a  stochastic dynamical \nsystem that has the posterior as  its stationary distribution. \n\n1  CONVENTIONAL AND  BAYESIAN  LEARNING \n\nI view neural networks as probabilistic models, and learning as statistical inference. \nConventional  network  learning  finds  a  single  \"optimal\"  set  of network  parameter \nvalues,  corresponding  to maximum likelihood or maximum penalized likelihood in(cid:173)\nference.  Bayesian  inference  instead  integrates  the  predictions of the  network  over \nall  possible  values  of the  network  parameters,  weighting each  parameter set  by  its \nposterior probability in light of the training data. \n\n1.1  NEURAL  NETWORKS  AS  PROBABILISTIC MODELS \n\nConsider a network taking a vector of real-valued inputs, x, and producing a  vector \nof real-valued  outputs,  y,  perhaps  computed  using  hidden  units.  Such  a  network \narchitecture corresponds  to a function,  I,  with y = I(x, w),  where w  is  a  vector of \nconnection weights.  If we assume the observed outputs, y, are equal to y  plus Gaus(cid:173)\nsian noise  of standard deviation  (j,  the network  defines  the conditional probability \n\n475 \n\n\f476 \n\nNeal \n\nfor  an observed  output vector given an input vector  as follows: \nex:  exp( -IY - !(x, w)12 /20\"2) \n\nP(y I x, 0\") \n\n(1) \n\nThe probability of the outputs in a training set (Xl, yt), ... , (Xn, Yn)  given this fixed \nnoise level  is therefore \n\nP(Yl, ... , Yn I Xl,\u00b7.\u00b7, Xn, 0\") \n\nex:  exp( - E lYe  - !(Xe, w)12 /20\"2) \n\ne \n\n(2) \n\nOften 0\"  is  unknown.  A  Bayesian approach  to handling this is  to assign  0\"  a  vague \nprior distribution and  then  \u00b7.ntcgrating it away, giving the following probability for \nthe training set  (see  (Buntine and Weigend,  1991) or (Neal,  1992) for  details): \n\nP(Yl,\"\"  Yn I Xl, ... , Xn) \n\nex: \n\n(so + E lYe  - !(Xe, w)12) -\n\ne \n\nmp\u00b1nD \n\n2 \n\n(3) \n\nwhere  So  and mo  are parameters of the prior for  0\". \n\n1.2  CONVENTIONAL  LEARNING \n\nConventional backpropagation learning tries  to find  the weight  vector  that  assigns \nthe highest  probability to the training data, or equivalently, that minimizes minus \nthe log probability of the training data.  When 0\"  is assumed known,  we  can use  (2) \nto obtain the following objective function  to minimize: \n\nM(w)  =  E lYe  - !(Xe, w)12 /  20\"2 \n\ne \n\nWhen 0\"  is  unknown,  we  can instead minimize the following,  derived from  (3): \n\nM(w) \n\ne \n\n(4) \n\n(5) \n\nConventional  learning  often  leads  to  the  network  over fitting  the  training  data -\nmodeling  the  noise,  rather  than  the  true  regularities.  This  can  be  alleviated  by \nstopping learning when the the performance of the network on a separate validation \nset  begins  to  worsen,  rather  than improve.  Another  way  to  avoid overfitting  is  to \ninclude a  weight  decay  term in the objective function,  as  follows: \n\nM'(w)  =  Alwl 2  +  M(w) \n\n(6) \n\nHere,  the data fit  term,  M(w), may come from either (4) or (5).  We must somehow \nfind  an appropriate value for  A,  perhaps,  again, using a  separate validation set. \n\n1.3  BAYESIAN LEARNING  AND  PREDICTION \n\nUnlike conventional training, Bayesian learning does not look for a single \"optimal\" \nset  of network  weights.  Instead,  the  training  data  is  used  to  find  the  posterior \nprobability distribution over  weight  vectors.  Predictions for  future  cases  are  made \nby  averaging the outputs obtained with  all possible  weight  vectors,  with each  con(cid:173)\ntributing in proportion to its posterior probability. \n\nTo obtain the posterior, we must first  define a  prior distribution for weight vectors. \nWe  might, for  example, give each  weight a  Gaussian prior of standard deviation w: \n\n(7) \n\n\fBayesian  Learning via  Stochastic Dynamics \n\n477 \n\nWe can then obtain the posterior distribution over weight vectors given the training \ncases  (Xl, yt), ... , (Xn, Yn)  using  Bayes' Theorem: \n\nP(w I (Xl, yt}, ... , (Xn, Yn)) \n\noc  P(w) P(YI, ... , Yn  I Xl, ... , Xn, w) \n\n(8) \n\nBased on the training data, the  best  prediction for  the output vector in a  test  case \nwith  input vector X.,  assuming squared-error loss,  is \n\nY.  =  J /(x.,w)P(w I (xI,yd,\u00b7 .. ,(xn,Yn))dw \n\n(9) \n\nA  full  predictive distribution for  the outputs in the test  case  can  also  be obtained, \nquantifying the uncertainty in the above prediction. \n\n2 \n\nINTEGRATION BY MONTE  CARLO METHODS \n\nIntegrals such  as  that of (9)  are  difficult  to evaluate.  Buntine and Weigend  (1991) \nand MacKay (1992) approach this problem by approximating the posterior distribu(cid:173)\ntion by  a  Gaussian.  Instead, I evaluate such  integrals using  Monte  Carlo methods. \nIf we  randomly select  weight  vectors,  wo, ... , WN-I, each  distributed  according to \nthe  posterior,  the  prediction  for  a  test  case  can  be  found  by  approximating  the \nintegral of (9)  by  the average output of networks with  these  weights: \n\ny.  ~  ~ L/(x.,Wt) \n\n(10) \n\nThis  formula  is  valid  even  if the  Wt  are  dependent,  though  a  larger  sample  may \nthen  be  needed  to  achieve  a  given  error  bound.  Such  a  sample  can  be  obtained \nby  simulating  an  ergodic  Markov  chain  that  has  the  posterior  as  its  stationary \ndistribution.  The  early  part  of the  chain,  before  the  stationary  distribution  has \nbeen reached,  is  discarded.  Subsequent  vectors  are used  to estimate the integral. \n\nt \n\n2.1  FORMULATING THE PROBLEM IN  TERMS  OF ENERGY \n\nConsider  the  general  problem  of obtaining  a  sample  of  (dependent)  vectors,  qt, \nwith  probabilities  given  by  P( q).  For  Bayesian  network  learning,  q  will  be  the \nweight  vector,  or  other  parameters  from  which  the  weights  can  be  obtained,  and \nthe distribution of interest  will be the posterior. \n\nIt will be  convenient to express  this probability distribution in  terms of a  potential \nenergy  function,  E( q), chosen  so  that \n\nP(q) \n\noc  exp(-E(q)) \n\n(11) \n\nA  momentum vector, p, of the same dimensions as q, is also introduced, and defined \nto have  a  kinetic  energy  of ~ \\pI2.  The sum of the potential and kinetic energies  is \nthe  Hamiltonian: \n\nH(q,p)  =  E(q)  +  ~lpl2 \n\n(12) \n\n(13) \n\nFrom the Hamiltonian, we define ajoint probability distribution over q and p  (phase \nspace)  as  follows: \n\nP(q,p)  oc  exp(-H(q,p)) \n\nThe marginal distribution for q in (13) is that of (11), from which we wish to sample. \n\n\f478 \n\nNeal \n\nWe can therefore proceed  by sampling from this joint distribution for  q  and p, and \nthen just ignoring the values obtained for  p. \n\n2.2  HAMILTONIAN  DYNAMICS \n\nSampling  from  the  distribution  (13)  can  be  split  into  two  subproblems  -\nto  sample  uniformly  from  a  surface  where  H,  and  hence  the  probability,  is  con(cid:173)\nstant,  and second,  to visit points of differing  H  with the correct  probabilities.  The \nsolutions to these subproblems can  then be interleaved to give  an overall solution. \n\nfirst, \n\nThe  first  subproblem  can  be  solved  by  simulating  the  Hamiltonian  dynamics  of \nthe system,  in which  q  and p  evolve  through  a  fictitious  time,  r, according  to the \nfollowing equations: \n\ndq \ndr \n\n8H \n8p  = p, \n\ndp \n- = -- = -VE(q) \ndr \n\n8H \n8q \n\n(14) \n\nThis  dynamics  leaves  H  constant,  and  preserves  the  volumes  of regions  of phase \nspace.  It therefore visits points on a surface of constant H  with uniform probability. \n\nWhen  simulating this  dynamics,  some  discrete  approximation must  be  used.  The \nleapfrog  method exactly  maintains the preservation of phase space  volume.  Given \na  size for  the time step,  E,  an iteration of the leapfrog method goes  as follows: \n\np(r+ E/2) \nq(r+ E) \np(r + E) \n\nper) - (E/2)VE(q(r\u00bb) \nq(r)+Ep \np(r + E)  - (E/2)V E(q(r + E\u00bb \n\n(15) \n\n2.3  THE STOCHASTIC  DYNAMICS  METHOD \n\nTo create a  Markov chain that converges  to the distribution of (13),  we  must inter(cid:173)\nleave leapfrog  iterations,  which  keep  H  (approximately)  constant,  with steps  that \ncan change H.  It is convenient for  the latter to affect  only p, since it enters into H \nin a  simple way.  This general approach is  due to Anderson  (1980). \nI  use  stochastic steps of the following form  to change H: \n\np' \n\n(16) \n\nwhere  0 < (l'  < 1,  and n  is  a  random vector with components picked independently \nfrom  Gaussian  distributions  of mean  zero  and  standard  deviation  one.  One  can \nshow  that  these  steps  leave  the  distribution  of (13)  invariant.  Alternating  these \nstochastic  steps  with  dynamical  leapfrog  steps  will  therefore  sample  values  for  q \nand p  with close  to the desired  probabilities.  In so far  as  the discretized  dynamics \ndoes  not keep  H exactly constant, however,  there will be some degree of bias, which \nwill be eliminated only in the limit as  E  goes  to zero. \n\nIt is  best  to use  a  value of (l'  close  to one,  as  this reduces  the  random walk  aspect \nof the dynamics.  If the random term in (16)  is omitted, the procedure is equivalent \nto ordinary batch mode backpropagation learning with momentum. \n\n\fBayesian  Learning via Stochastic Dynamics \n\n479 \n\n2.4  THE HYBRID  MONTE  CARLO  METHOD \n\nThe bias introduced into the stochastic  dynamics method  by  using  an approxima(cid:173)\ntion  to the  dynamics  is eliminated  in  the  Hybrid  Monte  Carlo method  of Duane, \nKennedy,  Pendleton,  and Roweth (1987). \n\nThis  method  is  a  variation  on  the  algorithm  of Metropolis,  et  al  (1953),  which \ngenerates  a  Markov  chain  by  considering  randomly-selected  changes  to  the  state. \nA  change  is  always accepted  if it lowers  the energy  (H), or leaves  it unchanged.  If \nit  increases  the energy,  it is  accepted  with  probability exp( -LlH), and  is  rejected \notherwise,  with the old  state then being repeated. \nIn  the  Hybrid  Monte  Carlo method,  candidate  changes  are produced  by  picking  a \nrandom value for p  from its distribution given by (13) and then performing some pre(cid:173)\ndetermined  number of leapfrog steps.  If the leapfrog method were  exact,  H  would \nbe  unchanged,  and  these  changes  would  always  be  accepted.  Since  the  method \nis  actually  only  approximate,  H  sometimes  increases,  and  changes  are  sometimes \nrejected,  exactly cancelling the bias introduced by  the  approximation. \n\nOf course,  if the errors are very  large,  the acceptance  probability will  be  very  low, \nand  it  will  take  a  long  time  to  reach  and  explore  the  stationary  distribution.  To \navoid  this,  we  need  to choose a  step size  (f) that is  small enough. \n\n3  RESULTS  ON A  TEST PROBLEM \n\nI  use  the  \"robot  arm\"  problem of MacKay  (1992) for  testing.  The task  is  to learn \nthe mapping from two real-valued inputs, Xl  and X2,  to two real-valued outputs,  YI \nand Y2,  given by \n\nih  =  2.0  cos(xI)  +  1.3 COS(XI  +  X2) \nY2  =  2.0  sin(xI)  +  1.3 sin(xi + X2) \n\n(17) \n(18) \nGaussian noise of mean zero and standard deviation 0.05 is added to (YI' Y2)  to give \nthe observed position, (YI,  Y2).  The training and test sets each consist of 200 cases, \nwith  Xl  picked  randomly  from  the  ranges  [-1.932, -0.453]  and  [+0.453, +1.932], \nand  X2  from  the range  [0.534,3.142]. \nA  network with  16 sigmoidal hidden units  was  used.  The output units were  linear. \nLike  MacKay,  I  group  weights  into  three  categories  -\ninput  to  hidden,  bias  to \nhidden,  and  hidden/bias  to  output.  MacKay  gives  separate  priors  to  weights  in \nI  fix  w  to  one,  but \neach  category,  finding  an  appropriate  value  of w  for  each. \nmultiply each  weight  by  a  scale  factor associated  with  its category  before  using it, \ngiving  an equivalent  effect.  For  conventional  training with  weight  decay,  I  use  an \nanalogous scheme  with three weight decay  constants  (.\\  in (6\u00bb. \n\nIn all cases,  I  assume  that the true value of u  is  not known.  I  therefore  use  (3)  for \nthe  training set  probability, and  (5)  for  the  data fit  term in conventional training. \nI  set  80 = rno  = 0.1,  which  corresponds  to a  very  vague prior for  u. \n\n3.1  PERFORMANCE OF  CONVENTIONAL  LEARNING \n\nConventional  backpropagation  learning  was  tested  on  the  robot  arm  problem  to \ngauge how  difficult  it  is  to obtain good generalization with standard methods. \n\n\f480 \n\nNeal \n\n\\.. \n\n. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  o. \n\n......... \n\n1.0-\"', \n\n.0060~~~.-... -... ~.~-r---~~---~ \n\n(a) .006.5 +-l,-----t--___ ==!r.:-::,===*\" (b) .006.5 +--+-,--t-----+----_+_ \n.0000+-~~ ... -.. -... -.\u2022 r .. -.. -... -.. -... -.. -... -... ~.~---~ \n........................ \n. ~5+--4,,----+-------4-~~--~ \n.0050 +-----'''''''\"--t----==-1I_---;(cid:173)\n.~.5+----~---~~~====~ \n\n.~.5+-~~ ___ --r----~---~ \n\n-~O+-r----t----~I_---~ \n\n--..... \n\n.~o \n\no \n\n50 \nHerations  X  1000 \n\n~ \n\n~ \n\n.oow+----~---~~---~ \n~ \n\n0 \n\n50 \n100 \nIterations  X  1000 \n\nFigure  1:  Conventional  backpropagation  learning -\n(a)  with  no  weight  decay,  (b)  with \ncarefully-chosen  weight  decay  constants.  The  solid  lines  give  the  squared  error  on  the \ntraining data,  the dotted lines  the squared error on the  test data. \n\nFig.  l(a) shows  results obtained without using weight  decay.  Error on  the  test  set \ndeclined initially, but then increased with further  training.  To achieve good results, \nthe point where the test error reaches its minimum would have to be identified using \na  separate validation set. \nFig.  l(b)  shows  results  using  good  weight  decay  constants,  one  for  each  category \nof weights,  taken from  the Bayesian  runs  described  below.  In this  case  there  is no \nneed  to stop learning early,  but finding  the  proper weight  decay  constants by  non(cid:173)\nBayesian methods would be  a problem.  Again, a validation set  seems  necessary,  as \nwell  as  considerable computation. \nUse  of a  validation set  is  wasteful,  since  data that  could otherwise  be  included  in \nthe  training set  must  be excluded.  Standard  techniques  for  avoiding  this,  such  as \n\"N-fold\"  cross-validation, are  difficult to apply to neural networks. \n\n3.2  PERFORMANCE OF BAYESIAN LEARNING \n\nBayesian learning was first  tested  using the unbiased  Hybrid Monte Carlo method. \nThe  parameter  vector  in  the  simulations  (q)  consisted  of the  unsealed  network \nweights  together with  the scale  factors for  the  three  weight  categories.  The  actual \nweight  vector  (w)  was  obtained  by  multiplying each  unsealed  weight  by  the  scale \nfactor for  its category. \n\nEach  Hybrid  Monte  Carlo run  consisted  of 500  Metropolis  steps.  For each step,  a \ntrajectory  consisting  of 1000  leapfrog  iterations  with  f  =  0.00012  was  computed, \nand  accepted  or  rejected  based  on  the  change  in  H  at  its  end-point.  Each  run \ntherefore required 500,000 batch gradient evaluations, and took approximately four \nhours on  a  machine rated  at about 25  MIPS. \nFig. 2(a) shows the training and test error for the early portion of one Hybrid Monte \nCarlo run.  After initially declining, these values fluctuate about an average.  Though \nnot  apparent  in  the  figure,  some  quantities  (notably  the  scale  factors)  require  a \nhundred or more steps  to reach  their final  distribution.  The first  250  steps of each \nrun were  therefore  discarded  as  not being from  the stationary distribution. \n\nFig.  2(b) shows  the  training and test  set errors  produced  by  networks with  weight \nvectors  taken from  the  last  250  steps of the same run.  Also  shown  is  the error on \nthe  test  set  using  the  average  of the  outputs of all  these  networks  -\nthat  is,  the \nestimate given  by  (10)  for  the  Bayesian  prediction of (9).  For the  run shown,  this \n\n\fBayesian Learning via  Stochastic Dynamics \n\n481 \n\n(b)  .0070 --I.----t-------1I---+--+-----II-----!(cid:173)\n.0061  --li-----t---__II----t.-+-----II-t----!(cid:173)\n~~~-~~-~-M~~~~--~~~~~ \n.0064  --Ih--;--t--t-rh-i:'lIIt-i1lr--te!H-+--~!-!1I5B-+._i:_r_-!\u00ad\n.0062  ~~~~-+:III'!I''HI:4---::.fII-+Iit-'-i''*''-.+-__..H*<~II'-+:~:.H!t_.;+\u00ad\n.0060  --I~14.~!H-;;1fH-~~fml!--lF-Hi:-:t-_i_f.;i_\u00a5_'i~f--i!h\"lt~!tt\u00ad\n\n(a)  .0140 \n\n.0120 \n\n.01110 \n\n.1lO8O \n\n.1lO6O \n\n.0040 \n\nI \n\n\"'.,\"; ...... ~.J! \u2022\u2022 \n..J>v-A \n\n,J ! .........  ~,. \n, \n\n\" .. \"\"\"'-' ~ \n\nd \n\no \n\n50 \n\n100 \n\n.~.--I~~~~-~~~__Ir+~~~~~--~ \n.00545  --Ia---;J--=.~\"*-~---I~---'--+-.!i=_--1~-~-t\u00ad\n.00S4  --Ift-;;,-;nr-t----:---jih---Jr:--+----r---IHt--JbT;rl(cid:173)\n.0052  ~-'\\+~'fliIllnl~HI\\,rbItl-\"'AiIA>tI-Wl~'\"T1I'cyJ-y;I-fif'L'\\.-tti-tf'l~ \n.0050 --I-lf---'--f+-'~'---'..:&.!1---'-''---F>LL.:'--....,I--...J.:....--!-\n\n300 \n\n350 \n\n400 \n\n450 \n\nI&era&ions  X  1000 \n\nI&era&ions  x  1000 \nFigure 2:  Bayesian learning using Hybrid Mon~e Carlo -\n(a) early portion of run,  (b) last \n250  iterations.  The solid  lines  give  the squared  error on  the  training set,  the dotted lines \nthe  squared  error  on  the  test  set,  for  individual  networks.  The dashed  line  in  (b)  is  the \ntest error when  using  the average of the outputs of all  250  networks. \n\nFigure  3:  Predictive  distribution  for \noutputs.  The  two  regions  from  which \ntraining  data was  drawn  are  outlined. \nCircles indicate the true, noise-free out-\nputs  for  a  grid  of cases  in  the  input \nspace.  The dots in the vicinity of each \ncircle  (often  piled on  top of it) are  the \noutputs of every fifth  network from  the \nlast  250  iterations  of a  Hybrid  Monte \nCarlo run. \n\n+3\u00bb -\n\n+2\u00bb -\n\n' \u2022\u2022 \n\n\",. \n\n+1\u00bb -\n\n0.0-\n\n\u00b7LD  -\n\n\u00b72.0-\n\n\u00b7u-\n\n\u2022 \n\n::, \n\n.. ~., \n:.:~:.'. \n.. ' \n\n\u2022 \n\n\u00b710 \n\n\u00b71.0 \n\n0\u00bb \n\n+LD \n\n+10 \n\n+3.0 \n\ntest set error using averaged outputs is  0.00559, which is  (slightly) better than any \nresults  obtained  using  conventional  training.  Note  that with  Bayesian  training no \nvalidation  set  is  necessary.  The  analogues  of the  weight  decay  constants  -\nthe \nweight scale factors -\n\nare found during the course of the simulation. \n\nAnother  advantage  of the  Bayesian  approach  is  that  it  can  provide  an  indication \nof how  uncertain  the  predictions  for  test  cases  are.  Fig.  3  demonstrates  this.  As \none would expect,  the uncertainty  is  greater  for  test  cases  with inputs outside  the \nregion  where  training data was  supplied. \n\n3.3  STOCHASTIC DYNAMICS  VS.  HYBRID  MONTE  CARLO \n\nThe uncorrected  stochastic  dynamics  method will have some degree  of systematic \nbias,  due  to inexact  simulation of the  dynamics.  Is  the amount of bias introduced \nof any practical importance,  however? \n\n\f482 \n\nNeal \n\n(a)  .IXY1O  ~------+---~-++-----~----~-----4-(b) \n\n.0068 \n.0066 \n.0064 \n.D062 \n.0060 \n.005. \n.oo5ti \n.0054 \n.0052 \n.0050 \n.0048 \n\n250 \n\n~  400 \nIterations  X  1000 \n\n\u2022 \n\u2022 \n\u2022 \n\n.. \n\n\\.  ~ \n\u2022 \n\u2022 \n\n21 \n\n~ \n\\ ~ \n\nII \n\n\u2022 \n\nIterations  X  1000 \n\nFigure  4:  Bayesian  learning  using  uncorrected  stochastic  dynamics  -\n(a)  Training  and \ntest error for  the  last  250  iterations of a  run  with  c = 0.00012,  (b)  potential  energy  (E) \nfor  a  run with  c =  0.00030.  Note  the two peaks where the dynamics  became  unstable. \n\nTo  help  answer  this  question,  the  stochastic  dynamics  method  was  run  with  pa(cid:173)\nrameters analogous to those used in the Hybrid Monte Carlo runs.  The step size of \n( = 0.00012 used  in  those  runs was  chosen  to be as  large as  possible  while  keeping \nthe number of trajectories rejected  low  (about 10%).  A smaller step size  would not \ngive competitive results,  so this value was  used for  the stochastic dynamics runs as \nwell.  A  value of 0.999 for  0'  in  (16)  was  chosen  as  being (loosely) equivalent to the \nuse  of trajectories  1000 iterations long in  the  Hybrid  Monte  Carlo runs. \n\nThe results shown in Fig. 4(a) are comparable to those obtained using Hybrid Monte \nCarlo  in  Fig.  2(b).  Fig.  4(b)  shows  that  with  a  larger  step  size  the  uncorrected \nstochastic dynamics method becomes unstable.  Large step sizes also cause problems \nfor  the Hybrid Monte Carlo method,  however,  as  they lead to high rejection  rates. \n\nThe  Hybrid  Monte  Carlo method may  be  the  more  robust  choice  in some  circum(cid:173)\nstances,  but  uncorrected  stochastic  dynamics  can  also  give  good  results.  As  it is \nsimpler,  the  stochastic  dynamics  method  may  be  better  for  hardware  implemen(cid:173)\ntation,  and  is  a  more  plausible  starting  point  for  any  attempt  to  relate  Bayesian \nmethods  to  biology.  Numerous  other  variations on  these  methods  are  possible  as \nwell,  some of which  are discussed  in (Neal,  1992). \n\nReferences \n\nAndersen,  H.  C.  (1980)  \"Molecular  dynamics  simulations  at  constant  pressure  and/or \n\ntemperature\",  Journal of Chemical Physics, vol.  72,  pp.  2384-2393. \n\nBuntine, W. L.  and Weigend,  A. S.  (1991)  \"Bayesian back-propagation\",  Complex Systems, \n\nvol.  5,  pp.  603-643. \n\nDuane,  S.,  Kennedy,  A.  D.,  Pendleton,  B.  J.,  and  Roweth,  D.  (1987)  \"Hybrid  Monte \n\nCarlo\",  Physics Letters  B,  vol.  195,  pp.  216-222. \n\nMacKay,  D.  J. C.  (1992)  \"A practical Bayesian framework for backpropagation  networks\", \n\nNeural  Computation, vol.  4,  pp.  448-472. \n\nMetropolis,  N.,  Rosenbluth,  A.  W.,  Rosenbluth,  M.  N.,  Teller, A.  H.,  and Teller, E. (1953) \n\"Equation  of state  calculations  by  fast  computing  machines\",  Journal  of Chemical \nPhysics, vol.  21,  pp.  1087-1092. \n\nNeal,  R.  M.  (1992)  \"Bayesian  training of backpropagation  networks  by the hybrid  Monte \n\nCarlo method\",  CRG-TR-92-1,  Dept.  of Computer Science,  University of Toronto. \n\n\f", "award": [], "sourceid": 613, "authors": [{"given_name": "Radford", "family_name": "Neal", "institution": null}]}