{"title": "On Input Selection with Reversible Jump Markov Chain Monte Carlo Sampling", "book": "Advances in Neural Information Processing Systems", "page_first": 638, "page_last": 644, "abstract": null, "full_text": "On input selection with reversible jump \n\nMarkov chain Monte Carlo sampling \n\nPeter Sykacek \n\nAustrian Research Institute for Artificial Intelligence (OFAI) \n\nSchottengasse 3, A-10lO Vienna, Austria \n\npeter@ai. univie. ac. at \n\nAbstract \n\nIn this paper we will treat input selection for a radial basis function \n(RBF) like classifier within a Bayesian framework. We approximate \nthe a-posteriori distribution over both model coefficients and input \nsubsets by samples drawn with Gibbs updates and reversible jump \nmoves. Using some public datasets, we compare the classification \naccuracy of the method with a conventional ARD scheme. These \ndatasets are also used to infer the a-posteriori probabilities of dif(cid:173)\nferent input subsets. \n\n1 \n\nIntroduction \n\nMethods that aim to determine relevance of inputs have always interested re(cid:173)\nsearchers in various communities. Classical feature subset selection techniques, as \nreviewed in [1], use search algorithms and evaluation criteria to determine one opti(cid:173)\nmal subset. Although these approaches can improve classification accuracy, they do \nnot explore different equally probable subsets. Automatic relevance determination \n(ARD) is another approach which determines relevance of inputs. ARD is due to [6] \nwho uses Bayesian techniques, where hierarchical priors penalize irrelevant inputs. \n\nOur approach is also \"Bayesian\": Relevance of inputs is measured by a probability \ndistribution over all possible feature subsets. This probability measure is determined \nby the Bayesian evidence of the corresponding models. The general idea was already \nused in [7] for variable selection in linear regression mo.dels. Though our interest \nis different as we select inputs for a nonlinear classification model. We want an \napproximation of the true distribution over all different subsets. As the number of \nsubsets grows exponentially with the total number of inputs, we can not calculate \nBayesian model evidence directly. We need a method that samples efficiently across \ndifferent dimensional parameter spaces. The most general method that can do this \nis the reversible jump Markov chain Monte Carlo sampler (reversible jump Me) \nrecently proposed in [4]. The approach was successfully applied by [8] to determine \na probability distribution in a mixture density model with variable number of kernels \nand in [5] to sample from the posterior of RBF regression networks with variable \nnumber of kernels. A Markov chain that switches between different input subsets is \nuseful for two tasks: Counting how often a particular subset was visited gives us a \nrelevance measure of the corresponding inputs; For classification, we approximate \n\n\fOn Input Selection with Reversible Jump MCMC \n\n639 \n\nthe integral over input sets and coefficients by summation over samples from the \nMarkov chain. \n\nThe next sections will show how to implement such a reversible jump MC and apply \nthe proposed algorithm to classification and input evaluation using some public \ndatasets. Though the approach could not improve the MLP-ARD scheme from \n[6] in terms of classification accuracy, we still think that it is interesting: We can \nassess the importance of different feature subsets which is different than importance \nof single features as estimated by ARD. \n\n2 Methods \n\nThe classifier used in this paper is a RBF like model. Inference is performed within \na Bayesian framework. When conditioning on one set of inputs, the posterior over \nmodel parameters is already multimodal. Therefore we resort to Markov chain \nMonte Carlo (MCMC) -sampling techniques to approximate the desired posterior \nover both model coefficients and feature subsets. In the next subsections we will \npropose an appropriate architecture for the classifier and a hybrid sampler for model \ninference. This hybrid sampler consists of two parts: We use Gibbs updates ([2]) to \nsample when conditioning on a particular set of inputs and reversible jump moves \nthat carry out dimension switching updates. \n\n2.1 The classifier \n\nI~ order to allow input relevance determination by Bayesian model selection , the \nclassifier needs at least one coefficient that is associated with each input: Roughly \nspeaking, the probability of each model is proportional to the likelihood of the most \nprobable coefficients, weighted by their posterior width divided by their prior width. \nThe first factor always increases when using more coefficients (or input features). \nThe second will decrease the more inputs we use and together this gives a peak \nfor the most probable model. A classifier that satisfies these constraints is the so \ncalled classification in the sampling paradigm. We model class conditional densities \nand together with class priors express posterior probabilities for classes. In neural \nnetwork literature this approach was first proposed in [10). We use a model that \nallows for overlapping class conditional densities: \n\np(~lk) = L WkdP(~I~) , p(~) = L PkP(~lk) \n\n(1) \n\nD \n\nd=l \n\nK \n\nk=l \n\nUsing Pk for the J{ class priors and p(~lk) for the class conditional densities, (1) \nexpresses posterior probabj,Jities for classes as P(kl~) = PkP(~lk)/p(~). We choose \nthe component densities, p(~IcI> d), to be Gaussian with restricted parametrisation: \nEach kernel is a multivariate normal distribution with a mean and a diagonal co(cid:173)\nvariance matrix. For all Gaussian kernels together, we get 2 * D * I parameters, with \nI denoting the current input dimension and D denoting the number of kernels. \nApart from kernel coefficients, cI>d , (1) has D coefficients per class, Wkd, indicat(cid:173)\ning the prior kernel allocation probabilities and J{ class priors. Model (1) allows to \ntreat labels of patterns as missing data and use labeled as well as unlabeled data for \nmodel inference. In this case training is carried out using the likelihood of observing \ninputs and targets: \n\np(T, X18) = rrr;=lrr;::=lPkPk(~nk Ifu)rr~=lp(bnI8) , \n\n(2) \nwhere T denotes labeled and X unlabeled training data. In (2) 8 k are all coefficients \nthe k-th class conditional density depends on. We further use 8 for all model \n\n\f640 \n\nP. Sykacek \n\ncoefficients together, nk as number of samples belonging to class k and m as index \nfor unlabeled samples. To make Gibbs updates possible, we further introduce two \nlatent allocation variables. The first one, d, indicates the kernel number each sample \nwas generated from, the second one is the unobserved class label c, introduced for \nunlabeled data. Typical approaches for training models like (1), e.g. [3] and [9], \nuse the EM algorithm, which is closely related to the Gibbs sampler introduce in \nthe next subsection. \n\n2.2 Fixed dimension sampling \n\nIn this subsection we will formulate Gibbs updates for sampling from the posterior \nwhen conditioning on a fixed set of inputs. In order to allow sampling from the full \nconditional, we have to choose priors over coefficients from their conjugate family: \n\n\u2022 Each component mean, !!!d, is given a Gaussian prior: !!!d '\" Nd({di). \n\n\u2022 The inverse variance of input i and kernel d gets a Gamma prior: \n\nu;;l '\" r( a, ,Bi). \n\n\u2022 All d variances of input i have a common hyperparameter, ,Bi, that has \n\nitself a Gamma hyperprior: ,Bi ,...., r(g, hi). \n\n\u2022 The mixing coefficients, ~, get a Dirichlet prior: ~ '\" 1J (6w , ... , 6w ). \n\n\u2022 Class priors, P, also get a Dirichlet prior: P '\" 1J(6p , ... ,6p). \n\nThe quantitative settings are similar to those used in [8]: Values for a are between \n\n1 and 2, g is usually between 0.2 and 1 and hi is typically between 1/ Rr and 10/ Rr, \nat the midpoint, e, with diagonal inverse covariance matrix ~, with \"'ii = 1/ Rr. \n\nwith Ri denoting the i'th input range. The mean gets a Gaussian prior centered \n\nThe prior counts d w and 6p are set to 1 to give the corresponding probabilities \nnon-informative proper Dirichlet priors. \n\nThe Gibbs sampler uses updates from the full conditional distributions in (3). For \nnotational convenience we use ~ for the parameters that determine class condi(cid:173)\ntional densities. We use m as index over unlabeled data and Cm as latent class label. \nThe index for all data is n, dn are the latent kernel allocations and nd the number \nof samples allocated by the d-th component. One distribution does not occur in \nthe prior specification. That is Mn(l, ... ) which is a multinomial-one distribution. \nFinally we need some counters: ml ... mK are the counts per class and mlk .. mDk \ncount kernel allocations of class-k-patterns. The full conditional of the d-th kernel \nvariances and the hyper parameter ,Bi contain i as index of the input dimension. \nThere we express each u;J separately. In the expression of the d-th kernel mean, \n\nI \n\n\f641 \n\n(3) \n\nOn Input Selection with Reversible Jump MCMC \n\nilld, we use .lGt to denote the entire covariance matrix. \n\n( { PkP(~mlfu) \n\nI:k PkP(~mlfu)' k = l..K \n\nMn 1, \n\n}) \n\nMn (1, { WtndP(~nl~) ,d= l..D}) \nI:, Wt,.dP(~nl~) \nr (9 + Da. hi + ;; ud,! ) \n1) (ow + mlk, ... ,ow + mDk) \n1) (op + ml, ... , op + mK) \nN ((nd~l + ~)-l(ndVdl~ + ~S), (ndVd 1 + ~)-l) \nr (a + ~d, f3i + ~ L \n\n(~n,i -llid,i)2) \n\ni\u00a3,. Vnld,.=d \n\np(~J .. ) \n\np(~I\u00b7\u00b7\u00b7) \np(PI\u00b7\u00b7\u00b7) \np(illdl\u00b7\u00b7\u00b7) \n\n2.3 Moving between different input subsets \n\nThe core part of this sampler are reversible jump updates, where we move between \ndifferent feature subsets. The probability of a feature subset will be determined by \nthe corresponding Bayesian model evidence and by an additional prior over number \nof inputs. In accordance with [7J, we use the truncated Poisson prior: \n\np(I) = 1/ ( I jax ) c ~~ , where c is a constant and Imax the total nr. of inputs. \n\nReversible jump updates are generalizations of conventional Metropolis-Hastings \nupdates, where moves are bijections (x, u) H (x', u'). For a thorough treatment we \nrefer to [4J. In order to switch subsets efficiently, we will use two different types of \nmoves. The first consist of a step where we add one input chosen at random and a \nmatching step that removes one randomly chosen input. A second move exchanges \ntwo inputs which allows \"tunneling\" through low likelihood areas. \n\nAdding an input, we have to increase the dimension of all kernel means and diagonal \ncovariances. These coefficients are drawn from their priors. In addition the move \nproposes new allocation probabilities in a semi deterministic way. Assuming the \nordering, Wk,d ~ Wk,d+1: \n\nop \n\nVd ~ D/2 \n\nBeta(ba , bb + 1) \n{ W~'D+l-d = Wk,D+l-d + Wk ,dOp \nw~ ,d = wk ,d(1 - op) \n\n(4) \n\nThe matching step proposes removing a randomly chosen input. Removing corre(cid:173)\nsponding kernel coefficients is again combined with a semi deterministic proposal \nof new allocation probabilities, which is exactly symmetric to the proposal in (4). \n\n\f642 \n\nP. Sykacek \n\nTable 1: Summary of experiments \n\navg(#) max(#) RBF (%,n a ) MLP (%,nb) \n\nData \n\nIonosphere \n\nPima \nWine \n\n4.3 \n4 \n4.4 \n\n9 \n7 \n8 \n\n(91.5,11) \n(78.9,111 \n(100, 01 \n\n95.5,4 \n79.8,8 \n96.8,2 \n\nWe accept births with probability: \n\nmin( 1, lh. rt x p(;(;/) G, J2,; r g exp ( - 05 ;\" (I'd -