{"title": "Robust Full Bayesian Methods for Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 379, "page_last": 385, "abstract": null, "full_text": "Robust  Full  Bayesian Methods for  Neural \n\nNetworks \n\nChristophe Andrieu* \nCambridge University \n\nEngineering Department \n\nCambridge  CB2  1PZ \n\nEngland \n\nca226@eng.cam.ac.uk \n\nJ oao FG de  Freitas \n\nUC  Berkeley \n\nComputer Science \n\n387 Soda Hall,  Berkeley \n\nCA  94720-1776 USA \njfgf@cs.berkeley.edu \n\nArnaud Doucet \n\nCambridge University \n\nEngineering Department \n\nCambridge CB2  1PZ \n\nEngland \n\nad2@eng.cam.ac.uk \n\nAbstract \n\nIn this paper, we propose a full Bayesian model for neural networks. \nThis model treats the model dimension (number of neurons), model \nparameters, regularisation parameters and noise parameters as ran(cid:173)\ndom  variables  that  need  to be estimated.  We  then  propose  a  re(cid:173)\nversible jump Markov chain Monte Carlo (MCMC)  method to per(cid:173)\nform  the necessary computations.  We  find  that the results are not \nonly  better than the previously  reported ones,  but  also  appear to \nbe  robust  with  respect  to  the  prior  specification.  Moreover,  we \npresent a  geometric convergence theorem for  the algorithm. \n\n1 \n\nIntroduction \n\nIn the early nineties, Buntine and Weigend (1991)  and Mackay (1992) showed that a \nprincipled Bayesian learning approach to neural networks can lead to many improve(cid:173)\nments  [1 ,2] .  In particular, Mackay showed that by approximating the distributions \nof the weights with Gaussians and adopting smoothing priors, it is possible to obtain \nestimates of the weights and output variances and to automatically set the regular(cid:173)\nisation coefficients.  Neal (1996)  cast the net much further by introducing advanced \nBayesian simulation methods, specifically the hybrid Monte Carlo method, into the \nanalysis of neural networks [3].  Bayesian sequential Monte Carlo methods have also \nbeen  shown  to  provide  good  training results,  especially  in  time-varying  scenarios \n[4].  More  recently,  Rios  Insua and  Muller  (1998)  and  Holmes  and  Mallick  (1998) \nhave  addressed  the issue  of selecting the  number of hidden  neurons  with  growing \nand pruning algorithms from a Bayesian perspective [5,6].  In particular, they apply \nthe  reversible  jump  Markov  Chain  Monte  Carlo  (MCMC)  algorithm  of  Green  [7] \nto  feed-forward  sigmoidal  networks  and  radial  basis  function  (RBF)  networks  to \nobtain joint estimates of the number of neurons and weights. \nWe also apply the reversible jump MCMC simulation algorithm to RBF networks so \nas to compute the joint posterior distribution of the radial basis parameters and the \nnumber of basis functions.  However, we  advance this area of research in two impor(cid:173)\ntant directions.  Firstly, we propose a full  hierarchical prior for RBF networks.  That \n\n* Authorship based on alphabetical order. \n\n\f380 \n\nC.  Andrieu, J.  F  G. d.  Freitas and A.  Doucet \n\nis,  we  adopt a full  Bayesian model, which accounts for model order uncertainty and \nregularisation,  and show  that the results  appear to  be  robust  with  respect  to  the \nprior specification.  Secondly,  we  present a  geometric  convergence theorem for  the \nalgorithm.  The complexity of the problem does  not allow for  a  comprehensive dis(cid:173)\ncussion in this short paper.  We have, therefore, focused on describing our objectives, \nthe  Bayesian model,  convergence theorem and results.  Readers  are encouraged to \nconsult our technical report for further results and implementation details  [Sp. \n\n2  Problem statement \n\nMany physical processes may be described by the following  nonlinear,  multivariate \ninput-output mapping: \n\n(1) \n\nwhere Xt  E  ~d corresponds to a group of input variables, Yt  E  ~c to the target vari(cid:173)\nables,  TIt  E ~c to an unknown noise  process and t =  {I, 2,' .. } is  an index variable \nover the data.  In this context, the learning problem involves  computing an approx(cid:173)\nimation  to  the  function  f  and  estimating  the  characteristics  of the  noise  process \ngiven  a  set of N  input-output observations:  0  =  {Xl, X2, ... ,XN, YI, Y2, ... ,Y N  } \nTypical  examples  include  regression,  where  YI :N,1 :C2  is  continuous;  classification, \nwhere  Y corresponds to a  group  of classes  and nonlinear dynamical system identi(cid:173)\nfication,  where the inputs and targets correspond to several delayed versions of the \nsignals under consideration. \n\nYt  = b + f3'Xt  + TIt \n\nWe  adopt  the  approximation scheme  of Holmes  and  Mallick  (199S),  consisting  of \na  mixture  of  k  RBFs  and  a  linear  regression  term.  Yet,  the  work  can  be  easily \nextended to other regression models.  More precisely, our model Mis: \nk  = 0 \nMo : \nMk:  Yt  = ~~=l aj\u00a2(IIx t  - 11)1)  +  b +  f3'Xt  +  TIt  k  ~ 1 \n\n(2) \nwhere\" ./1  denotes  a  distance metric  (usually Euclidean or  Mahalanobis),  I-Lj  E  ~d \ndenotes  the  j-th  RBF  centre  for  a  model  with  k  RBFs,  aj  E  ~c  the  j-th  RBF \namplitude and b  E  ~c and f3  E  ~d X  ~c the linear regression parameters.  The noise \nsequence  TIt  E  ~c is  assumed  to be zero-mean white  Gaussian.  It is  important to \nmention that although we  have not explicitly indicated the dependency of b,  f3  and \nTIt  on  k,  these  parameters are  indeed  affected  by  the value  of k.  For convenience, \nwe  express  our approximation model in vector-matrix form: \n\nYl,l . . . Y1,e \nY2,1  . .. Y2,e \n\n1  X1,1  .. . X1 ,d \n1  X2,1 . .. X2 ,d \n\n</J(X1 , 1'1)' .. </J(X1, JLk) \n</J(X2, 1'1) .. . </J(X2, JLk) \n\n= \n\nYN,l .. . YN,c \n\n1  XN,l\" \n\n' XN,d  </J(XN, 1'1) ' \"  </J(XN, JLk) \n\nb1 . . , be \n\nf31,1  ... f31 ,e \n\nf3d,l  ... f3d,e  +n1:N \na1,1  .. , a1,e \n\nak ,l .,. ak,e \n\n1The software  is  available  at http://vvv . cs . berkeley. edur jfgf. \n2Y1 :N,1 :e  is  an  N  by  c  matrix,  where  N  is  the  number  of data and  c the  number  of \n\noutputs.  We  adopt  the  notation  Y1:N,j  ~ (Y1,j,Y2,j, .. . ,YN,j)'  to  denote  all  the  obser(cid:173)\nvations  corresponding  to  the  j-th output  (j-th  column  of y) .  To  simplify  the  notation, \nYt  is  equivalent  to  Yt,l :c'  That  is,  if one  index  does  not  'appear,  it  is  implied  that  we \nare  referring  to  all  of its  possible  values.  Similarly,  Y is  equivalent  to  Y1:N ,1:c,  We  will \nfavour  the shorter  notation  and  only  adopt the longer  notation  to  avoid  ambiguities  and \nemphasise certain dependencies. \n\n\fRobust Full Bayesian Methods for Neural Networks \n\n381 \n\nwhere  the  noise  process  is  assumed  to  be  normally  distributed  Dt  '\" N(o, un  for \n\ni  =  1, ... ,c.  In shorter notation, we  have: \n\nY =  D(J.Ll:k,l:d,Xl:N,l:d)Ol:l+d+k,l:C + Dt \n\n(3) \nWe  assume  here  that  the  number  k  of  RBFs  and  their  parameters  (J  ~ \n{Ol:m,l:c, J.Ll:k,l:d' uI:J,  with  m =  1 + d + k,  are  unknown.  Given  the  data  set \n{x,y}, our objective is  to estimate k and (J  E  8 k . \n\n3  Bayesian model and aims \n\nWe  follow  a  Bayesian approach where the unknowns  k  and (J  are regarded as  being \ndrawn from appropriate prior distributions.  These priors reflect our degree of belief \non the relevant values of these quantities  [9].  Furthermore, we  adopt, a  hierarchical \nprior structure that enables  us  to treat the priors'  parameters  (hyper-parameters) \nas  random variables  drawn from  suitable distributions  (hyper-priors).  That is,  in(cid:173)\nstead  of fixing  the  hyper-parameters  arbitrarily,  we  acknowledge  that  there  is  an \ninherent  uncertainty in what we  think their  values  should  be.  By devising  proba(cid:173)\nbilistic models that deal with this uncertainty, we  are able to implement estimation \ntechniques that are robust to the specification of the hyper-priors. \n\nThe  overall  parameter  space  8  x  'II  can  be  written  as  a  finite  union  of  sub(cid:173)\nspaces  8  x  'II  =  (U~:ir {k}  x  8 k)  X \n'II  where  8 0  ~  (JRd+l)C  X  (JR+) C and \n8 k  ~ (JRd+l+k)c  X  (JR+)C  X  !lk  for  k  E  {I, ... ,kmax }.  That  is,  \u00b0 E  (JRd+l+k)C, \n0'  E  (JR+)C  and  J.L  E  !lk.  The  hyper-parameter  space  'II  ~ (1R+)C+l,  with  ele(cid:173)\nments  'l/J  ~ {A, 82 },  will  be  discussed  at  the  end  of  this  section.  The  space  of \nthe  radial  basis  centres  !lk  is  defined  as  a  compact  set  including  the  input  data: \n!lk ~ {J.L; J.Ll:k,i  E [min(xl:N,i) -tSi, max(xl:N,i) +tSdk for  i  =  1, ... ,d with J.Lj,i  of(cid:173)\nJ.L1  i  for  j  of- l}.  Si  =  II max(xl:N,i)  - min(xl:N,i)1I  denotes  the  Euclidean  distance \nfor  the i-th dimension of the input and  t  is  a  user specified parameter that we  only \nneed  to  consider  if  we  wish  to place  basis  functions  outside  the  region  where  the \ninput  data lie.  That  is,  we  allow  !lk  to include  the  space  of the  input  data  and \nextend  it  by  a  factor  which  is  proportional to the  spread  of the  input  data.  The \nhyper-volume of this space is:  ~k ~ (rrf=l (1 + 2t)Si)k. \nThe  maximum  number  of  basis  functions  is  defined  as  kmax  ~ (N - (d + 1))  We \nalso  define  !l ~ U~:(nk} x  !lk  with  !lo  ~ 0.  Under  the assumption of indepen(cid:173)\ndent outputs given (k, (J),  the likelihood p(ylk, (J, 'l/J, x) for the approximation model \ndescribed in the previous section is: \n\nD. (2,,-.,.1) -N /2 exp ( - 2!; (hN,i - D(I'\"\"  x)a' ,m,,)' (Y',N\"  - D(I'H, x)a',m,;) ) \n\nWe  assume the following structure for  the prior distribution: \n\np(k, (J, 'l/J)  =  p( ol:mlk, 0'2, ( 2)p(J.Ll:k Ik)p(kIA)p(u2)p(A)p( ( 2 ) \n\nwhere  the  scale  parameters  0';,  are  assumed  to  be  independent  of  the  hyper(cid:173)\nparameters  (i. e.  p( u 2 1A, ( 2 )  =  p( 0'2)),  independent  of  each  other  (p( 0'2)  = \nrr~=l p(u;))  and  distributed  according  to  conjugate  inverse-Gamma  prior  distri(cid:173)\nbutions:  0';  '\"  I9 (~, ~).  When  'Uo  = 0  and  /0  = 0,  we  obtain  Jeffreys'  un(cid:173)\ninformative prior [9].  For a given 0'2, the prior distribution p(k, Glom, J.Ll:kIU2, A, 82 ) \nis: \n\n[lIC  I \n\n. \nt=l \n\n27ru i 8 i 1m I \n\n2  2 \n\n-1/2 \n\nexp (- -2-2 0l'm iOl:m i \n\no.k \n20'.8.  . , '   :s \n\nI \n\n) ]   [][o(k,J.Ll:k)]  [  Aklk! \n\n] \nk .  \n\"  max  AJI)'! \n6 J =0 \n\n1 \n\nt \n\nt \n\n\f382 \n\nC.  Andrieu, J.  F.  G.  d.  Freitas and A.  Doucet \n\nwhere 1m  denotes the identity matrix of size  m x m  and IIn(k, ILI :k)  is the indicator \nfunction  of the set n  (1  if (k,IL1:k)  En, 0 otherwise). \nThe prior model order distribution p(kIA) is a truncated Poisson distribution.  Con(cid:173)\nditional  upon  k,  the  RBF  centres  are  uniformly  distributed.  Finally,  conditional \nupon  (k, IL1:k),  the  coefficients  01 :m,i  are assumed to be zero-mean Gaussian with \n\nvariance  c5~ ur  The  hyper-parameters  c52  E  (IR.+)C  and  A  E  IR.+  can  be  respec(cid:173)\n\ntively  interpreted  as  the  expected  signal  to  noise  ratios  and  the  expected  num(cid:173)\nber  of  radial  basis.  We  assume  that  they  are  independent  of  each  other,  i.e. \np(A, c52)  = p(A)p(c52 ).  Moreover,  p(c5 2 )  = TI~=1 p(c5~).  As  c52 is  a  scale  parameter, \nwe  ascribe a vague conjugate prior density to it:  c5~ \"\" YQ (a,p ,(302)  for  i  =  1, ... ,c, \nwith a 02 = 2 and (302  > O.  The variance of this hyper-prior with a 02 = 2 is infinite. \nWe  apply  the same  method to  A by  setting  an  uninformative  conjugate  prior  [9]: \nA\"\" Qa(1/2 +Cl,c2)  (ci\u00ab 1 i  =  1,2). \n\n3.1  Estimation and inference aims \n\nThe Bayesian inference  of k,  0  and 1/J  is  based  on  the joint  posterior  distribution \np(k, 0, 1/Jlx, y)  obtained from  Bayes' theorem.  Our aim is  to estimate this joint dis(cid:173)\ntribution from  which,  by  standard probability  marginalisation and transformation \ntechniques,  one  can  \"theoretically\"  obtain  all  posterior  features  of  interest.  We \npropose  here to use  the reversible jump  MCMC  method to perform  the necessary \ncomputations,  see  [8]  for  details.  MCMC  techniques  were  introduced  in  the  mid \n1950's  in  statistical  physics  and  started  appearing  in  the  fields  of applied  statis(cid:173)\ntics,  signal  processing  and  neural  networks  in  the  1980's  and  1990's  [3,5,6,10,11]. \nThe  key  idea  is  to  build  an ergodic  Markov  chain  (k(i) , O(i), 1/J(i\u00bb)iEN  whose  equi(cid:173)\nlibrium  distribution  is  the  desired  posterior  distribution.  Under  weak  additional \nassumptions, the P \u00bb 1 samples generated by the Markov chain are asymptotically \ndistributed  according to  the posterior  distribution and thus  allow  easy  evaluation \nof all posterior features of interest.  For example: \n\n.-.... \np(k =  Jlx, y)  =  p  I){j}(k(t\u00bb)  and  IE(Olk  =  J, x, y)  = \n\n1  P \n\n.-... \n\n. \n\n. \n\ni=1 \n\n~~ 0(i)1I  .  (k(i\u00bb) \n\nt-~ \n~i=1 lI{j}(k(t\u00bb) \n\n{J}. \n\nIn addition,  we  can obtain predictions, such as: \n\n.-... \nIE(YN+llxl :N+l,Yl :N)  =  p  L...,.D(ILl:k,XN+I)OI:m \n\n1 ~  ( i) \n\n(i) \n\nP \n\ni=1 \n\n(4) \n\n(5) \n\n3.2 \n\nIntegration of the  nuisance parameters \n\nAccording to Bayes theorem, we  can obtain the posterior distribution as follows: \n\np(k, 0, 1/Jlx, y)  ex  p(Ylk, 0, 1/J, x)p(k, 0, 'ljJ) \n\nIn our case, we can integrate with respect to 01:m  (Gaussian distribution) and with \nrespect  to u;  (inverse  Gamma distribution)  to obtain the following  expression for \nthe posterior: \n\n(k \nP \n\n,IL1:k\" \n\nA  c521 \n\n[ lIn(k,lLk)][ \n\n<;Sk \n\nt \n\n) \n\nx,yex \n\n[rrC (c52)-m/2IM'  11/2  (TO  + Y~:N,iP i,kYl:N,i  ) (_ N~VQ)] \ni=1 \n][rrC (c5~)-(062 +l)exp(- (362 )][(A)(Cl-l/2)exP (-C2 A )] \n\"kmax  AJ /  ., \ni=l \nJ. \nLJj=O \n\nAk/k!. \n\nc52 \nt \n\nt,k \n\n2 \n\nX \n\nt \n\n(6) \n\n\fRobust Full Bayesian Methods for Neural Networks \n\n383 \n\nIt is  worth noticing that the posterior distribution is  highly  non-linear in the RBF \ncentres  I-'k  and that an expression of p(klx,y)  cannot  be  obtained in closed-form. \n\n4  Geometric convergence theorem \n\nIt is easy to prove that the reversible jump MCMC algorithm applied to our model \nconverges,  that is,  that the Markov chain  (k( i) , I-'l( ~~, A (i) , 82( i\u00bb) \nis  ergodic.  We \niEN \npresent  here  a  stronger result,  namely  that  (k(i)'I-'(I~LA(i),82(i\u00bb) \n. \nthe required posterior distribution at a geometric rate: \n\nconverges  to \n\niEN \n\n. \n\n. \n\nTheorem 1  Let  (k(i), I-'i~)k' A(i), 82(i\u00bb) \nbe  the  Markov  chain  whose  transition \nkernel  has  been  described  in Section  3.  This  Markov  chain  converges  to  the  proba-\nbility  distribution  p (k, I-'l :k, A, 82 1 x, y).  Furthermore  this  convergence  occurs  at  a \ngeometric rate,  that is,  for almost every initial point  (k(O), I-'i~k, A (0),8 2(0\u00bb)  E  11 x 'II \nthere  exists a function  of the  initial states Co  > 0  and a constant and p E  [0,1)  such \nthat \n\niEN \n\nIIp(i)  (k,I-'l:k,A,8 2)  -p(k,I-'l:k,A, 82 I x ,y)IITV ~ CopLi/kmaxJ \n\n(7) \n\nwhere  p(i)  (k,I-'l:k,A,82)  is  the  distribution  of (k(i)'l-'i~~,A(i),82(i\u00bb)  and  II\u00b7IITV  is \nthe  total variation  norm [11].  Proof.  See  [8]  \u2022 \n\nCorollary  1  If for  each iteration i  one samples the nuisance parameters (Ol :m, u%) \nthen the  distribution  of the series  (k(i), oi~~, I-'i~~, u~(i), A(i), 8Z(i\u00bb)iEN  converges ge(cid:173)\nometrically  towards p(k,Ol:m,I-'l:k,ULA,82Ix,y)  at the  same  rate  p. \n\n5  Demonstration:  robot  arm data \n\nThis data is often used as a benchmark to compare learning algorithms3 .  It involves \nimplementing a model to map the joint angle of a robot arm (Xl, X2)  to the position \nof the end of the arm (Yl, yz).  The data were  generated from  the following  model: \n\nYl  =  2.0cos(xt}  +  1.3COS(Xl  +X2) +  El \nY2  =  2.0sin(xt} +  1.3sin(xl +X2) +E2 \n\nwhere  Ei  '\" N(O, 0\"2);  0\"  =  0.05.  We  use  the first  200  observations  of the  data set \nto train our models and the last 200  observations to test them.  In the simulations, \nwe  chose to use  cubic basis functions.  Figure  1 shows  the 3D  plots  of the training \ndata  and  the  contours  of the  training  and  test  data.  The  contour  plots  also  in(cid:173)\nclude the typical approximations that were obtained using the algorithm.  We  chose \nuninformative  priors  for  all  the  parameters  and  hyper-parameters  (Table  1).  To \ndemonstrate the robustness of our algorithm, we  chose  different values for  (382  (the \nonly critical hyper-parameter as  it quantifies the mean of the spread 8 of Ok)'  The \nobtained mean square errors and probabilities for  81, 82, u~ k' u~ k  and k, shown in \nFigure 2,  clearly indicate that our algorithm is  robust with 'respe~t to prior specifi(cid:173)\ncation.  Our mean square errors are of the same magnitude as  the ones reported by \nother researchers [2,3,5,6];  slightly  better (Not  by more than 10%).  Moreover, our \nalgorithm leads to more parsimonious models than the ones  previously reported. \n\n3The  robot  arm  data  set  can  be \n\nfound \n\nin  David  Mackay's  home  page: \n\nhttp://vol.ra.phy.cam.ac.uk/mackay/ \n\n\f384 \n\nC.  Andrieu. J.  F  G.  d.  Freitas and A.  Doucet \n\n5  .'  . \n\n>.0  .. .. .. .. .. . ... ........ ..... \n\n-~  . '   .\n\n' \n\n.' \n\n2 \nx2 \n\no  -2 \n\no \n\nx1 \n\n2 \n\n5 \n\n~ 0 \u00b7\u00b7 \u00b7 \u00b7\u00b7\u00b7\n\n\u00b7 \n\n-5  ... ... \n4 \n\no  -2 \n\nx2 \n\n2 \n\n4r---~--~--~----, \n\n4r---~--~--~----, \n\noL---~------~--~ \n-2 \n\n-1 \n\n2 \n\no \n\n4r---~--~--~----, \n\no~--~------~--~ \n-2 \n\n-1 \n\n2 \n\no \n\n0 \n-2 \n\n4 \n\n-1 \n\n0 \n\n2 \n\n:~ \n\n0 \n-2 \n\n-1 \n\n0 \n\n2 \n\nFigure  1:  The  top  plots  show  the  training  data  surfaces  corresponding  to  each \ncoordinate  of the  robot  arm's  position.  The  middle  and  bottom  plots  show  the \ntraining and validation data [- -]  and the respective REF network mappings  [-]. \n\nTable 1:  Simulation parameters and mean square test errors. \n\na 02 \n2 \n2 \n2 \n\n/30 2 \n0.1 \n10 \n100 \n\nVo \n0 \n0 \n0 \n\n')'0 \n0 \n0 \n0 \n\nCI \n\nC2 \n\n0.0001  0.0001 \n0.0001  0.0001 \n0.0001  0.0001 \n\nMS  ERROR \n\n0.00505 \n0.00503 \n0.00502 \n\n6  Conclusions \n\nIn  adopting  a  Bayesian \n\nWe  presented a general methodology for  estimating, jointly, the noise variance, pa(cid:173)\nrameters  and  number  of  parameters  of  an  RBF  model. \nmodel and the reversible jump MCMC  algorithm to perform the necessary integra(cid:173)\ntions,  we  demonstrated  that  the  method  is  very  accurate.  Contrary  to  previous \nreported  results,  our  experiments  indicate that  our  model  is  robust  with  respect \nto the specification of the prior.  In addition, we  obtained more parsimonious RBF \nnetworks and better approximation errors than the ones previously reported in the \nliterature.  There are many avenues for  further  research.  These include estimating \nthe  type  of basis functions,  performing input  variable  selection,  considering  other \nnoise models  and extending the framework to sequential scenarios.  A possible so(cid:173)\nlution  to  the  first  problem  can  be  formulated  using  the  reversible  jump  MCMC \nframework.  Variable selection schemes  can also  be implemented  via the reversible \njump  MCMC  algorithm.  We  are  presently  working  on  a  sequential  version of the \nalgorithm that allows us to perform model selection in non-stationary environments. \n\nReferences \n\n[1]  Buntine, W.L.  &  Weigend,  A.S.  (1991)  Bayesian back-propagation.  Complex  Systems \n5:603-643. \n\n\fRobust Full Bayesian Methods for Neural Networks \n\n385 \n\n'fl and 'fl \n2 \n1 \n\n~and~ \n\n100 \n\n200 \n\n100 \n\n200 \n\n0.06 \n\n0.04 \n\n0.02 \n\n0 \n0 \n\n0.06 \n\n0.04 \n\n0.02 \n\n0 \n0 \n\n0.06 \n\n0.02 \n\n0  0.2 \nII \n~0. 1 \n:!:l. \n\n'\" no 0.1 \n\n8  0.2 \n,... \n\n\"'00 0.1 \n:!:l. \n\n100 \n\n200 \n\n0 \n0 \n\n2 \n\n4 \n\n6 \nx 10-\" \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\n0 \n12 \n\nO.B \n\n0.6 \n\n0.4 \n\n0.2 \n\n0 \n12 \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\n0 \n12 \n\nk \n\nI \n\n14 \n\nI \n\n16 \n\n1 \n\n14 \n\nI \n\n14 \n\n16 \n\n16 \n\nFigure 2:  Histograms of smoothness constraints  (~1 and  82),  noise  variances  (O'i  k \nand O'~ k)  and  model  order  (k)  for  the robot arm data using  3 different  values  f~r \n{382.  The plots confirm that the algorithm is  robust to the setting of {382. \n\n[2]  Mackay,  D.J.C.  (1992)  A practical Bayesian framework for  backpropagation networks. \nNeural  Computation 4:448-472. \n\n[3]  Neal,  R.M.  (1996)  Bayesian  Learning  for  Neural  Networks.  New York:  Lecture Notes \nin Statistics No.  118,  Springer-Verlag. \n\n[4]  de  Freitas,  J .F.G.,  Niranjan,  M.,  Gee,  A.H.  &  Doucet,  A.  (1999)  Sequential  Monte \nCarlo methods to train neural network models.  To  appear in  Neural  Computation. \n\n[5]  Rios  Insua,  D.  &  Miiller,  P.  (1998)  Feedforward  neural  networks  for  nonparametric \nregression.  Technical  report  98-02. \nInstitute  of Statistics  and  Decision  Sciences,  Duke \nUniversity,  http://vtw.1 . stat. duke. edu. \n\n[6]  Holmes,  C.C.  &  Mallick,  B.K.  (1998)  Bayesian radial basis functions of variable dimen(cid:173)\nsion.  Neural  Computation 10:1217-1233. \n\n[7]  Green,  P.J.  (1995)  Reversible  jump  Markov  chain  Monte  Carlo  computation  and \nBayesian model determination.  Biometrika 82:711-732. \n[8]  Andrieu,  C.,  de  Freitas,  J.F.G.  &  Doucet,  A.  (1999)  Robust  full  Bayesian  learning \nfor  neural networks.  Technical  report  CUED/F-INFENG/TR 343.  Cambridge University, \nhttp://svr-www.eng.cam.ac.uk/. \n\n[9]  Bernardo,  J .M.  &  Smith, A.F.M.  (1994)  Bayesian  Theory.  Chichester:  Wiley Series in \nApplied Probability and Statistics. \n\n[10]  Besag,  J ., Green, P.J., Hidgon,  D.  &  Mengersen,  K.  (1995)  Bayesian computation and \nstochastic systems.  Statistical  Science  10:3-66. \n\n[11]  Tierney,  L.  (1994)  Markov chains for  exploring posterior distributions.  The  Annals of \nStatistics.  22(4):1701-1762. \n\n\f", "award": [], "sourceid": 1741, "authors": [{"given_name": "Christophe", "family_name": "Andrieu", "institution": null}, {"given_name": "Jo\u00e3o", "family_name": "de Freitas", "institution": null}, {"given_name": "Arnaud", "family_name": "Doucet", "institution": null}]}