{"title": "On Input Selection with Reversible Jump Markov Chain Monte Carlo Sampling", "book": "Advances in Neural Information Processing Systems", "page_first": 638, "page_last": 644, "abstract": null, "full_text": "On input  selection with reversible jump \n\nMarkov chain Monte  Carlo sampling \n\nPeter Sykacek \n\nAustrian  Research  Institute for  Artificial Intelligence  (OFAI) \n\nSchottengasse  3,  A-10lO Vienna, Austria \n\npeter@ai. univie. ac. at \n\nAbstract \n\nIn this paper we will treat input selection for a radial basis function \n(RBF) like classifier within a Bayesian framework.  We approximate \nthe a-posteriori distribution over both model coefficients  and input \nsubsets  by samples drawn with Gibbs updates and reversible jump \nmoves.  Using some  public  datasets,  we  compare the  classification \naccuracy  of the  method with  a  conventional  ARD  scheme.  These \ndatasets  are  also used  to  infer the  a-posteriori  probabilities of dif(cid:173)\nferent  input subsets. \n\n1 \n\nIntroduction \n\nMethods  that  aim  to  determine  relevance  of  inputs  have  always  interested  re(cid:173)\nsearchers  in  various  communities.  Classical feature  subset  selection  techniques,  as \nreviewed in [1],  use search algorithms and evaluation criteria to determine  one opti(cid:173)\nmal subset.  Although these approaches can improve classification accuracy,  they do \nnot explore  different  equally probable subsets.  Automatic relevance  determination \n(ARD) is another approach which determines relevance of inputs.  ARD is due to [6] \nwho uses  Bayesian  techniques,  where hierarchical  priors penalize irrelevant inputs. \n\nOur approach is  also  \"Bayesian\":  Relevance of inputs is  measured by a  probability \ndistribution over all possible feature subsets.  This probability measure is determined \nby the Bayesian evidence of the corresponding models.  The general idea was already \nused  in  [7]  for  variable selection  in  linear  regression  mo.dels.  Though  our  interest \nis  different  as  we  select  inputs  for  a  nonlinear  classification  model.  We  want  an \napproximation of the  true distribution over  all different  subsets.  As  the number of \nsubsets  grows exponentially  with  the total number of inputs,  we  can  not  calculate \nBayesian model evidence directly.  We need  a method that samples efficiently across \ndifferent  dimensional parameter spaces.  The most general method that can do this \nis  the  reversible  jump  Markov  chain  Monte  Carlo  sampler  (reversible  jump Me) \nrecently  proposed in  [4].  The approach was successfully applied by  [8]  to determine \na probability distribution in a mixture density model with variable number of kernels \nand  in  [5]  to  sample from  the  posterior  of RBF  regression  networks  with  variable \nnumber of kernels.  A  Markov chain that switches  between different  input subsets is \nuseful  for  two  tasks:  Counting how often  a  particular subset  was  visited  gives  us  a \nrelevance  measure  of the  corresponding  inputs;  For classification,  we  approximate \n\n\fOn Input Selection with Reversible Jump MCMC \n\n639 \n\nthe  integral  over  input  sets  and  coefficients  by  summation over  samples from  the \nMarkov chain. \n\nThe next sections will show how to implement such a reversible jump MC and apply \nthe  proposed  algorithm  to  classification  and  input  evaluation  using  some  public \ndatasets.  Though  the  approach  could  not  improve  the  MLP-ARD  scheme  from \n[6]  in  terms of classification  accuracy,  we  still think  that  it  is  interesting:  We  can \nassess  the importance of different  feature  subsets which is different  than importance \nof single features  as estimated by ARD. \n\n2  Methods \n\nThe classifier used in this paper is  a  RBF like model.  Inference is  performed within \na  Bayesian framework.  When conditioning on  one set of inputs, the posterior over \nmodel  parameters  is  already  multimodal.  Therefore  we  resort  to  Markov  chain \nMonte  Carlo  (MCMC) -sampling  techniques  to  approximate the  desired  posterior \nover  both  model  coefficients  and  feature  subsets.  In  the  next  subsections  we  will \npropose an appropriate architecture for the classifier and a  hybrid sampler for model \ninference.  This hybrid sampler consists of two parts:  We use  Gibbs updates  ([2])  to \nsample when  conditioning on a  particular set  of inputs and reversible jump moves \nthat carry out dimension switching updates. \n\n2.1  The classifier \n\nI~ order  to  allow  input  relevance  determination  by  Bayesian  model  selection ,  the \nclassifier  needs  at least  one coefficient  that is  associated  with each input:  Roughly \nspeaking, the probability of each model is proportional to the likelihood of the most \nprobable coefficients, weighted by their posterior width divided by their prior width. \nThe  first  factor  always  increases  when  using  more coefficients  (or  input features). \nThe  second  will  decrease  the  more  inputs  we  use  and  together  this  gives  a  peak \nfor  the  most  probable  model.  A  classifier  that  satisfies  these  constraints  is  the so \ncalled classification  in  the sampling paradigm.  We model class conditional densities \nand  together with  class  priors express  posterior  probabilities for  classes.  In neural \nnetwork  literature  this  approach  was  first  proposed  in  [10).  We  use  a  model  that \nallows for  overlapping class conditional densities: \n\np(~lk) = L WkdP(~I~) , p(~) = L PkP(~lk) \n\n(1) \n\nD \n\nd=l \n\nK \n\nk=l \n\nUsing  Pk  for  the  J{  class  priors  and  p(~lk) for  the  class  conditional densities,  (1) \nexpresses  posterior  probabj,Jities for  classes  as  P(kl~) = PkP(~lk)/p(~).  We  choose \nthe component densities,  p(~IcI> d),  to  be  Gaussian  with  restricted  parametrisation: \nEach  kernel  is  a  multivariate normal distribution  with  a  mean and  a  diagonal co(cid:173)\nvariance matrix. For all Gaussian kernels together,  we get  2 * D * I  parameters, with \nI  denoting  the  current  input  dimension  and  D  denoting  the  number  of  kernels. \nApart  from  kernel  coefficients,  cI>d ,  (1)  has  D  coefficients  per  class,  Wkd,  indicat(cid:173)\ning the prior kernel  allocation probabilities and J{ class priors.  Model  (1)  allows to \ntreat labels of patterns as  missing data and use labeled as well  as  unlabeled data for \nmodel inference.  In this case training is carried out using the likelihood of observing \ninputs  and targets: \n\np(T, X18) =  rrr;=lrr;::=lPkPk(~nk Ifu)rr~=lp(bnI8) , \n\n(2) \nwhere T denotes labeled and X  unlabeled training data.  In (2)  8 k  are all coefficients \nthe  k-th  class  conditional  density  depends  on.  We  further  use  8  for  all  model \n\n\f640 \n\nP.  Sykacek \n\ncoefficients  together,  nk  as  number of samples belonging to class  k  and  m  as  index \nfor  unlabeled samples.  To  make Gibbs updates  possible,  we  further  introduce  two \nlatent allocation variables.  The first one, d, indicates the kernel number each sample \nwas  generated  from,  the  second one  is  the unobserved  class label c,  introduced  for \nunlabeled  data.  Typical  approaches  for  training models  like  (1),  e.g.  [3]  and  [9], \nuse  the  EM  algorithm,  which  is  closely  related  to  the  Gibbs  sampler introduce  in \nthe next  subsection. \n\n2.2  Fixed dimension sampling \n\nIn this subsection  we  will formulate Gibbs updates for  sampling from the posterior \nwhen conditioning on a fixed  set of inputs.  In order to allow sampling from the full \nconditional, we  have to choose priors over coefficients from  their  conjugate family: \n\n\u2022  Each  component mean, !!!d,  is  given  a  Gaussian prior:  !!!d '\" Nd({di). \n\n\u2022  The inverse  variance of input i  and kernel  d gets a  Gamma prior: \n\nu;;l '\" r( a, ,Bi). \n\n\u2022  All  d  variances  of input  i  have  a  common  hyperparameter,  ,Bi,  that  has \n\nitself a  Gamma hyperprior:  ,Bi  ,....,  r(g, hi). \n\n\u2022  The mixing coefficients,  ~, get  a  Dirichlet prior:  ~ '\" 1J (6w , ... , 6w ). \n\n\u2022  Class priors,  P,  also get  a  Dirichlet prior:  P '\" 1J(6p , ... ,6p). \n\nThe quantitative settings are similar to those  used  in  [8]:  Values for  a  are between \n\n1 and 2,  g is usually between 0.2 and 1 and hi  is typically between  1/ Rr and 10/ Rr, \nat  the  midpoint, e,  with  diagonal  inverse  covariance  matrix ~,  with  \"'ii  =  1/ Rr. \n\nwith  Ri  denoting  the  i'th input  range.  The  mean  gets  a  Gaussian  prior centered \n\nThe  prior  counts d w  and  6p  are  set  to  1  to  give  the  corresponding  probabilities \nnon-informative proper Dirichlet priors. \n\nThe Gibbs sampler uses  updates from  the full  conditional distributions in  (3).  For \nnotational  convenience  we  use  ~ for  the  parameters  that  determine  class  condi(cid:173)\ntional densities.  We  use  m as index over unlabeled data and Cm  as latent class label. \nThe index for  all data is  n,  dn  are  the latent kernel  allocations and  nd  the number \nof samples  allocated  by  the  d-th  component.  One  distribution  does  not  occur  in \nthe prior specification.  That is Mn(l, ... ) which  is  a  multinomial-one distribution. \nFinally we  need some counters:  ml  ...  mK  are the counts per class and  mlk  ..  mDk \ncount  kernel  allocations of class-k-patterns.  The full  conditional of the  d-th  kernel \nvariances  and  the  hyper  parameter ,Bi  contain  i  as  index  of the  input  dimension. \nThere  we  express  each  u;J separately.  In  the  expression  of the  d-th  kernel  mean, \n\nI \n\n\f641 \n\n(3) \n\nOn Input Selection with Reversible Jump MCMC \n\nilld,  we  use .lGt  to denote  the entire covariance matrix. \n\n(  {  PkP(~mlfu) \n\nI:k PkP(~mlfu)' k = l..K \n\nMn  1, \n\n}) \n\nMn (1, { WtndP(~nl~)  ,d= l..D}) \nI:, Wt,.dP(~nl~) \nr (9 + Da. hi + ;; ud,! ) \n1) (ow  + mlk, ... ,ow + mDk) \n1) (op + ml, ... , op + mK) \nN  ((nd~l + ~)-l(ndVdl~ + ~S), (ndVd 1 + ~)-l) \nr (a + ~d, f3i  + ~  L \n\n(~n,i -llid,i)2) \n\ni\u00a3,. Vnld,.=d \n\np(~J .. ) \n\np(~I\u00b7\u00b7\u00b7) \np(PI\u00b7\u00b7\u00b7) \np(illdl\u00b7\u00b7\u00b7) \n\n2.3  Moving between different input subsets \n\nThe core part of this sampler are reversible jump updates,  where we  move between \ndifferent  feature  subsets.  The probability of a  feature subset  will be determined by \nthe corresponding Bayesian model evidence and by an additional prior over number \nof inputs.  In  accordance  with [7J,  we  use  the truncated  Poisson  prior: \n\np(I) =  1/ (  I jax  )  c ~~ , where  c is  a  constant  and  Imax  the total nr.  of inputs. \n\nReversible  jump  updates  are  generalizations  of conventional  Metropolis-Hastings \nupdates, where  moves are bijections (x, u)  H  (x', u').  For a  thorough treatment we \nrefer  to  [4J.  In order to switch subsets  efficiently,  we  will  use  two different  types  of \nmoves.  The first  consist  of a step where  we  add one input chosen  at random and a \nmatching step that removes one  randomly chosen  input.  A second  move exchanges \ntwo  inputs which  allows  \"tunneling\"  through low  likelihood areas. \n\nAdding an input, we have to increase the dimension of all kernel means and diagonal \ncovariances.  These  coefficients  are  drawn  from  their  priors.  In  addition  the  move \nproposes  new  allocation  probabilities  in  a  semi  deterministic  way.  Assuming  the \nordering, Wk,d  ~ Wk,d+1: \n\nop \n\nVd  ~ D/2 \n\nBeta(ba , bb + 1) \n{ W~'D+l-d = Wk,D+l-d + Wk ,dOp \nw~ ,d = wk ,d(1  - op) \n\n(4) \n\nThe  matching step  proposes  removing a  randomly chosen  input.  Removing corre(cid:173)\nsponding  kernel  coefficients  is  again  combined  with  a  semi  deterministic  proposal \nof new  allocation  probabilities,  which  is  exactly symmetric to the  proposal  in  (4). \n\n\f642 \n\nP.  Sykacek \n\nTable  1:  Summary of experiments \n\navg(#)  max(#)  RBF  (%,n a )  MLP  (%,nb) \n\nData \n\nIonosphere \n\nPima \nWine \n\n4.3 \n4 \n4.4 \n\n9 \n7 \n8 \n\n(91.5,11) \n(78.9,111 \n(100,  01 \n\n95.5,4 \n79.8,8 \n96.8,2 \n\nWe  accept  births with probability: \n\nmin( 1, lh.  rt  x p(;(;/) G, J2,; r g exp ( - 05 ;\"  (I'd - <d)' ) \nx  (~':) f g (.,.~ -')\"-1 exp( -Ii' \"'~-') \n\nn, \n\nx  dm  /  (I + 1)  x \n\nbm/(Imax  - I) \n\n1 \n\n(~, V27i) D TID  exp ( -0.5Ih(J.l~ - ed)2) \n1 \n\nx \n\nD \n\n(Ii;))  TID (0\"~-2)a-l exp( _(3'0\"~-2) \n\n). \n\n(5) \n\nThe first  line in  (5)  are  the likelihood and prior ratio.  The prior ratio results  from \nthe difference  in input dimension, which  affects  the kernel means and the prior over \nnumber  of inputs.  The  first  term  of the  proposal  ratio  is  from  proposing  to  add \nor  remove  one  input.  The  second  term  is  the  proposal  density  of the  additional \nkernel  components  which  cancels  with  the  corresponding  term  in  the  prior  ratio. \nDue  to symmetry of the  proposal  (4)  and  its  reverse  in  a  death  move,  there  is  no \ncontribution from changing allocation probabilities.  Death moves are accepted with \nprobability ad = l/ab. \nThe  second  type  of move  is  an  exchange  move.  We  select  a  new  input  and  one \nfrom the model inputs and propose  new  mean coefficients.  This gives the following \nacceptance  probability: \n\nmin( 1, lh.  ratio x \n\ncm/ I \n\nx---:-:--'-----..,..x \n\ncm/(Imax  - I) \n\n(*,J2;) D TID  exp ( -0.5\"Jb(J.ld - ed)2) \n(~, J2;) D  ITD exp  ( -0.5 Ih(J.ld  - ed)2) \nTID  N(J.ldl\u00b7\u00b7\u00b7)  ) \nTID  N(J.l~I\u00b7\u00b7\u00b7)  . \n\n(6) \n\nThe  first  line  of (6)  are  again  likelihood  and prior  ratio.  For  exchange  moves,  the \nprior ratio is just the ratio from different  values in the kernel  means.  The first  term \nin the proposal ratio is from proposing to exchange an input.  The second term is the \nproposal density  of new  kernel  mean components.  The last  part is  from proposing \nnew  allocation probabilities. \n\n3  Experiments \n\nAlthough  the  method can  be  used  with  labeled  and  unlabeled  data,  the  following \nexperiments  were  performed  using  only  labeled  data.  For  all  experiments  we  set \na  = 2 and 9  = 0.2.  The first  two data sets  are  from  the VCI repositoryl.  We  use \n\n1 Available  at http://www.ics.uci.edu/  mlearn/MLRepository.html. \n\n\fOn Input Selection with Reversible Jump MCMC \n\n643 \n\nthe  Ionosphere  data which  has  33  inputs,  175  training  and  176  test  samples.  For \nthis  experiment  we  use  6  kernels  and  set  h  =  0.5.  The  second  data  is  the  wine \nrecognition data which provides 13 inputs, 62 training and 63  test samples.  For this \ndata,  we  use  3  kernels  and set  h  =  0.28.  The  third  experiment  is  performed with \nthe  Pima data provided  by  B.  D.  Ripley2.  For  this  one  we  use  3  kernels  and  set \nh =  0.16. \nFor all experiments we  draw  15000 samples from  the posterior over coefficients  and \ninput  subsets.  We  discard  the  first  5000  samples  as  burn  in  and  use  the  rest  for \npredictions.  Classification  accuracy,  is  compared  with  an  MLP  classifier  using  R. \nNeals hybrid Monte Carlo sampling with ARD priors on inputs.  These experiments \nuse  25  hidden  units.  Table  1  contains  further  details:  avg( #) is  the  average  and \nmax(#) the maximal number of inputs used by the hybrid sampler; RBF (%,  na)  is \nthe classification  accuracy  of the hybrid sampler and the number of errors  it made \nthat were not made by the ARD-MLP; MLP(%, nb)  is the same for the ARD-MLP. \nWe compare classifiers by testing (na,  nb)  against the null hypothesis that this is an \nobservation from  a Binomial Bn(na +nb , 0.5) distribution.  This reveals that neither \ndifference  is  significant.  Although  we  could  not  improve classification  accuracy  on \nthese  data,  this does  not really  matter because  ARD methods usually lead  to high \ngeneralization accuracy  and we  can compete. \n\nThe  real  benefit  from  using  the  hybrid  sampler  is  that  we  can  infer  probabilities \ntelling  us  how  much  different  subsets  contribute  to  an  explanation  of the  target \nvariables.  Figure  3  shows  the  occurrence  probabilities  of feature  subsets  and  fea(cid:173)\ntures.  Note  that  table  1 has  also  details  about  how  many  features  were  used  in \nthese  problems.  Especially  the  results  from  Ionosphere  data are  interesting  as  on \naverage we  use  only 4.3 out of 33  input features.  For ionosphere  and wine data the \nMarkov  chain  visits  about  500  different  input  subsets  within  10000  samples.  For \nthe Pima data the number is about  60  and an order of magnitude smaller. \n\n4  Discussion \n\nIn  this  paper  we  have  discussed  a  hybrid  sampler  that  uses  Gibbs  updates  and \nreversible jump moves to approximate the a-posteriori distribution over parameters \nand  input subsets  in  nonlinear classification  problems.  The classification  accuracy \nof the  method could compete with R . Neals  MLP-ARD  implementation.  However \nthe real  advantage of the method is  that it provides us  with a  relevance  measure of \nfeature  subsets.  This allows to infer  the  optimal number of inputs and how  many \ndifferent  explanations the data provides. \n\nAcknow ledgements \n\nI  want to thank several people  for  having used  resources  they  provide:  I  have used \nR.Neals hybrid Markov chain sampler for the MLP  experiments; The data used  for \nthe  experiments  were  obtained form  the  University  at  Irvine  repository  and  from \nB.  D.  Ripley.  Furthermore I want  to express  gratitude to the anonymous reviewers \nfor their comments and to J.F.G . de Freitas for useful discussions during the confer(cid:173)\nence.  This work  was  done  in  the framework  of the research  project  GZ  607.519/2-\nV /B/9/98  \"Verbesserung  der  Biosignalverarbeitung  durch  Beruecksichtigung  von \nUnsicherheit  und  Konfidenz\",  funded  by  the  Austrian  federal  ministry  of science \nand transport  (BMWV). \n\n2 Available  at http://www.stats.ox.ac.uk \n\n\f644 \n\n10 \n\n5 \n\n9.1% :[7.14) \n8 .6%: [3.  4) \n7.9%: [4.  7) \n4.6% : [4.  15) \n\n1. \n\nJ \n\no j, I.  j \no \n\n100 \n\n200 \n\n300 \n\n400 \n\nProbabilities of input subsets-Pima \n\n20r-----~--------------TO \n\n7j \n\n17.3% : [2. \n16.1%:[2. \n9.8% : [2. \n5) \n8.6% : [2.  5. 7) \n\n15 \n\n10 \n\n5 \n\n\u00b00L-~~~2~0~~~~40~~~-6~0 \n\nProbabilities of input subsets - Wine \n\n10r---~--~--~----~--~ \n\n13) \n9.5\u00b0,,\":  [1. \n13) \n5.2%: [7. \n3,4\u00b0,,\": [ \n13) \n3.1%: [1.12.13) \n\n5 \n\nProbabil~les of input subsets-Ionosphere \n\nP.  Sykacek \n\nProbabilities of inputs - Ionosphere \n\n30r-----~------------~_, \n\n20 \n\n10 \n\n500 \n\n10 \n\n20 \n\n30 \n\nProbabilities of inputs - Pima \n\n40r---~~--------~-----' \n\n30 \n\n20 \n\n10 \n\no~--~~--~----~----~ \no  24 6  \nProbabilities of inputs - Wine \n\n8 \n\n3Or-------~-------------, \n\n20 \n\n10 \n\n100 \n\n200 \n\n300 \n\n400 \n\n500 \n\n0U-------~------~----~ \n\n5 \n\n10 \n\no \n\nFigure  1:  Probabilities of inputs and input subsets measuring their relevance. \n\nReferences \n\n[1]  P. A. Devijver and J. V. Kittler.  Pattern Recognition. A  Statistical Approach. Prentice(cid:173)\n\nHall,  Englewood  Cliffs,  NJ,  1982. \n\n[2]  S.  Geman and D.  Geman.  Stochastic  relaxation,  gibbs distributions  and  the bayesian \n\nrestoration of images.  IEEE  Trans.  Pattn.  Anal.  Mach.  Intel.,  6:721-741,  1984. \n\n[3]  Z.  Ghahramani,  M.1.  Jordan  Supervised  Learning  from  Incomplete  Data via  an  EM \nApproach  In  Cowan  J.D.,  et  al.(eds.),  Advances  in  Neural  Information  Processing \nSystems 6,  Morgan Kaufmann,  Los  Altos/Palo  Alto/San Francisco,  pp.120-127,  1994. \n[4]  P.  J.  Green.  Reversible  jump  markov  chain  monte  carlo  computation  and bayesian \n\nmodel determination.  Biometrika, 82:711-732,  1995. \n\n[5]  C. C. Holmes and B. K. Mallick.  Bayesian radial basis functions of variable  dimension. \n\nNeural  Computation, 10:1217-1234,  1998. \n\n[6]  R.  M.  Neal.  Bayesian  Learning for  Neural Networks.  Springer,  New  York,  1986. \n[7]  D.  B.  Phillips  and A.  F.  M.  Smith.  Bayesian  model comparison  via jump diffusioons. \nIn  W.R.  Gilks,  S.  Richardson,  and  D.J.  Spiegelhalter,  editors,  Markov  Chain  Monte \nCarlo  in  Practice, pages 215-239,  London,  1996.  Chapman &  Hall. \n\n[8]  S.  Richardson  and  P.J.  Green  On  Bayesian  Analysis  of Mixtures  with  an  unknown \n\nnumber of components  Journal Royal Stat.  Soc.  B,  59:731-792,  1997. \n\n[9]  M.  Stensmo, T.J . Sejnowski  A Mixture Model System for Medical  and Machine  Diag(cid:173)\n\nnosis  In  Tesauro  G.,  et al.(eds.),  Advances  in  Neural  Information  Processing  System \n7,  MIT Press,  Cambridge/Boston/London,  pp.1077-1084,  1995. \n\n[10]  H.  G.  C.  Traven  A  neural  network  approach  to statistical  pattern classification  by \n\"semi parametric\"  estimation of probability  density functions  IEEE Trans.  Neur.  Net., \n2:366-377,  1991. \n\n\f", "award": [], "sourceid": 1645, "authors": [{"given_name": "Peter", "family_name": "Sykacek", "institution": null}]}