{"title": "Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation", "book": "Advances in Neural Information Processing Systems", "page_first": 209, "page_last": 216, "abstract": null, "full_text": "Sparse Multinomial Logistic Regression via Bayesian\n\nL1 Regularisation\n\nGavin C. Cawley\n\nSchool of Computing Sciences\n\nUniversity of East Anglia\n\nNorwich, Norfolk, NR4 7TJ, U.K.\n\nNicola L. C. Talbot\n\nSchool of Computing Sciences\n\nUniversity of East Anglia\n\nNorwich, Norfolk, NR4 7TJ, U.K.\n\ngcc@cmp.uea.ac.uk\n\nnlct@cmp.uea.ac.uk\n\nMark Girolami\n\nDepartment of Computing Science\n\nUniversity of Glasgow\n\nGlasgow, Scotland, G12 8QQ, U.K.\n\ngirolami@dcs.gla.ac.uk\n\nAbstract\n\nMultinomial\nlogistic regression provides the standard penalised maximum-\nlikelihood solution to multi-class pattern recognition problems. More recently,\nthe development of sparse multinomial logistic regression models has found ap-\nplication in text processing and microarray classi\ufb01cation, where explicit identi\ufb01-\ncation of the most informative features is of value. In this paper, we propose a\nsparse multinomial logistic regression method, in which the sparsity arises from\nthe use of a Laplace prior, but where the usual regularisation parameter is inte-\ngrated out analytically. Evaluation over a range of benchmark datasets reveals\nthis approach results in similar generalisation performance to that obtained using\ncross-validation, but at greatly reduced computational expense.\n\n1 Introduction\n\nMultinomial logistic and probit regression are perhaps the classic statistical methods for multi-class\npattern recognition problems (for a detailed introduction, see e.g. [1, 2]). The output of a multino-\nmial logistic regression model can be interpreted as an a-posteriori estimate of the probability that\na pattern belongs to each of c disjoint classes. The probabilistic nature of the multinomial logistic\nregression model affords many practical advantages, such as the ability to set rejection thresholds\n[3], to accommodate unequal relative class frequencies in the training set and in operation [4], or\nto apply an appropriate loss matrix in making predictions that minimise the expected risk [5]. As\na result, these models have been adopted in a diverse range of applications, including cancer clas-\nsi\ufb01cation [6, 7], text categorisation [8], analysis of DNA binding sites [9] and call routing. More\nrecently, the focus of research has been on methods for inducing sparsity in (multinomial) logistic\nor probit regression models. In some applications, the identi\ufb01cation of salient input features is of\nitself a valuable activity; for instance in cancer classi\ufb01cation from micro-array gene expression data,\nthe identi\ufb01cation of biomarker genes, the pattern of expression of which is diagnostic of a particular\nform of cancer, may provide insight into the \u00e6tiology of the condition. In other applications, these\nmethods are used to select a small number of basis functions to form a compact non-parametric clas-\nsi\ufb01er, from a set that may contain many thousands of candidate functions. In this case the sparsity\nis desirable for the purposes of computational expediency, rather than as an aid to understanding the\ndata.\n\n\fA variety of methods have been explored that aim to introduce sparsity in non-parametric regression\nmodels through the incorporation of a penalty or regularisation term within the training criterion. In\nthe context of least-squares regression using Radial Basis Function (RBF) networks, Orr [10], pro-\nposes the use of local regularisation, in which a weight-decay regularisation term is used with distinct\nregularisation parameters for each weight. The optimisation of the Generalised Cross-Validation\n(GCV) score typically leads to the regularisation parameters for redundant basis functions achiev-\ning very high values, allowing them to be identi\ufb01ed and pruned from the network (c.f. [11, 12]).\nThe computational ef\ufb01ciency of this approach can be further improved via the use of Recursive Or-\nthogonal Least Squares (ROLS). The relevance vector machine (RVM) [13] implements a form of\nBayesian automatic relevance determination (ARD), using a separable Gaussian prior. In this case,\nthe regularisation parameter for each weight is adjusted so as to maximise the marginal likelihood,\nalso known as the Bayesian evidence for the model. An ef\ufb01cient component-wise training algorithm\nis given in [14]. An alternative approach, known as the LASSO [15], seeks to minimise the negative\nlog-likelihood of the sample, subject to an upper bound on the sum of the absolute value of the\nweights (see also [16] for a practical training procedure). This strategy is equivalent to the use of a\nLaplace prior over the model parameters [17], which has been demonstrated to control over-\ufb01tting\nand induce sparsity in the weights of multi-layer perceptron networks [18]. The equivalence of the\nLaplace prior and a separable Gaussian prior (with appropriate choice of regularisation parameters)\nhas been established by Grandvalet [11, 12], unifying these strands of research.\n\nIn this paper, we demonstrate that, in the case of the Laplace prior, the regularisation parameters\ncan be integrated out analytically, obviating the need for a lengthy cross-validation based model\nselection stage. The resulting sparse multinomial logistic regression algorithm with Bayesian regu-\nlarisation (SBMLR) is then fully automated and, having storage requirements that scale only linearly\nwith the number of model parameters, is well suited to relatively large-scale applications. The re-\nmainder of this paper is set out as follows: The sparse multinomial logistic regression procedure\nwith Bayesian regularisation is presented in Section 2. The proposed algorithm is then evaluated\nagainst competing approaches over a range of benchmark learning problems in Section 3. Finally,\nthe work is summarised in Section 5 and conclusion drawn.\n\n2 Method\n\nLet D = {(xn, tn)}\u2113\nn=1 represent the training sample, where xn \u2208 X \u2282 Rd is the vector of input\nfeatures for the ith example, and tn \u2208 T = {t| t \u2208 {0, 1}c, ktk1 = 1} is the corresponding vector\nof desired outputs, using the usual 1-of-c coding scheme. Multinomial logistic regression constructs\na generalised linear model [1] with a softmax inverse link function [19], allowing the outputs to be\ninterpreted as a-posteriori estimates of the probabilities of class membership,\n\np(tn\n\ni |xn) = yn\n\ni =\n\nexp{an\ni }\nPc\nj=1 exp{an\nj }\n\nwhere\n\nan\ni =\n\nwijxn\nj\n\n(1)\n\nd\n\nXj=1\n\nAssuming that D represents an i.i.d. sample from a conditional multinomial distribution, then the\nnegative log-likelihood, used as a measure of the data-mis\ufb01t, can be written as,\n\nED =\n\nEn\nD = \u2212\n\nXn=1\n\n\u2113\n\n\u2113\n\nc\n\nXn=1\n\nXi=1\n\ni log {yn\ntn\ni }\n\nThe parameters, w of the multinomial logistic regression model are given by the minimiser of a\npenalised maximum-likelihood training criterion,\n\nL = ED + \u03b1EW\n\nwhere\n\nEW =\n\n|wij|\n\n(2)\n\nc\n\nd\n\nXi=1\n\nXj=1\n\nand \u03b1 is a regularisation parameter [20] controlling the bias-variance trade-off [21]. At a minima of\nL, the partial derivatives of L with respect to the model parameters will be uniformly zero, giving\n\n= \u03b1 if |wij| > 0\n\nand\n\n(cid:12)(cid:12)(cid:12)(cid:12)\n\n\u2202ED\n\n\u2202wij(cid:12)(cid:12)(cid:12)(cid:12)\n\n< \u03b1 if |wij| = 0.\n\n(cid:12)(cid:12)(cid:12)(cid:12)\n\n\u2202ED\n\n\u2202wij(cid:12)(cid:12)(cid:12)(cid:12)\n\n\fThis implies that if the sensitivity of the negative log-likelihood with respect to a model parameter,\nwij, falls below \u03b1, then the value of that parameter will be set exactly to zero and the corresponding\ninput feature can be pruned from the model.\n\n2.1 Eliminating the Regularisation Parameters\n\nMinimisation of (2) has a straight-forward Bayesian interpretation; the posterior distribution for w,\nthe parameters of the model given by (1), can be written as\n\nL is then, up to an additive constant, the negative logarithm of the posterior density. The prior over\nmodel parameters, w, is then given by a separable Laplace distribution\n\np(w|D) \u221d P (D|w)P (w).\n\nP (w) =(cid:16) \u03b1\n2(cid:17)W\n\nexp{\u2212\u03b1EW} =\n\n\u03b1\n2\n\nexp{\u2212\u03b1|wi|} ,\n\n(3)\n\nW\n\nYi=1\n\nwhere W is the number of active (non-zero) model parameters. A good value for the regularisation\nparameter \u03b1 can be estimated, within a Bayesian framework, by maximising the evidence [22] or\nalternatively it may be integrated out analytically [17, 23]. Here we take the latter approach, where\nthe prior distribution over model parameters is given by marginalising over \u03b1,\n\np(w) =Z p(w|\u03b1)p(\u03b1)d\u03b1.\n\nAs \u03b1 is a scale parameter, an appropriate ignorance prior is given by the improper Jeffrey\u2019s prior,\np(\u03b1) \u221d 1/\u03b1, corresponding to a uniform prior over log \u03b1. Substituting equation (3) and noting that\n\u03b1 is strictly positive,\n\nUsing the Gamma integral,R \u221e\n\np(w) =\n\n1\n\n2W Z \u221e\n\n0\n\n\u03b1W \u22121 exp{\u2212\u03b1EW}d\u03b1.\n\n0 x\u03bd\u22121e\u2212\u00b5xdx = \u0393(\u03bd)\n\n\u00b5\u03bd\n\n[24, equation 3.384], we obtain\n\np(w) =\n\n1\n2W\n\n\u0393(W )\nEW\n\nW\n\n=\u21d2 \u2212 log p(w) \u221d W log EW ,\n\ngiving a revised optimisation criterion for sparse logistic regression with Bayesian regularisation,\n\nM = ED + W log EW ,\n\n(4)\n\nin which the regularisation parameter has been eliminated, for further details and theoretical justi-\n\ufb01cation, see [17]. Note that we integrate out the regularisation parameter and optimise the model\nparameters, which is unusual in that most Bayesian approaches, such as the relevance vector ma-\nchine [13] optimise the regularisation parameters and integrate over the weights.\n\n2.1.1 Practical Implementation\n\nThe training criterion incorporating a fully Bayesian regularisation term can be minimised via a\nsimple modi\ufb01cation of existing cyclic co-ordinate descent algorithms for sparse regression using a\nLaplace prior (e.g. [25, 26]). Differentiating the original and modi\ufb01ed training criteria, (2) and (4)\nrespectively, we have that\n\n\u2207L = \u2207ED + \u03b1\u2207EW\n\nand\n\n\u2207M = \u2207ED + \u02dc\u03b1\u2207EW\n\nwhere\n\n1/\u02dc\u03b1 =\n\n1\nW\n\nW\n\nXi=1\n\n|wi|.\n\n(5)\n\nFrom a gradient descent perspective, minimising M effectively becomes equivalent to minimising\nL, assuming that the regularisation parameter, \u03b1, is continuously updated according to (5) following\nevery change in the vector of model parameters, w [17]. This requires only a very minor modi\ufb01ca-\ntion of the existing training algorithm, whilst eliminating the only training parameter and hence the\nneed for a model selection procedure in \ufb01tting the model.\n\n\f2.1.2 Equivalence of Marginalisation and Optimisation under the Evidence Framework\n\nWilliams [17] notes that, at least in the case of the Laplace prior, integrating out the regularisation pa-\nrameter analytically is equivalent to its optimisation under the evidence framework of MacKay [22].\nThe argument provided by Williams can be summarised as follows: The evidence framework sets\nthe value of the regularisation parameter so as to optimise the marginal likelihood,\n\nP (D) =Z P (D|w)P (w)dw,\n\nalso known as the evidence for the model. The Bayesian interpretation of the regularised objective\nfunction gives,\n\nP (D) =\n\n1\n\nZW Z exp{\u2212L} dw,\n\nwhere ZW is a normalising constant for the prior over the model parameters, for the Laplace prior,\nZW = (2/\u03b1)W . In the case of multinomial logistic regression, ED represents the negative logarithm\nof a normalised distribution, and so the corresponding normalising constant for the data mis\ufb01t term\nis redundant. Unfortunately this integral is analytically intractable, and so we adopt the Laplace\napproximation, corresponding to a Gaussian posterior distribution for the model parameters, centred\non their most probable value, wMP,\n\nL(w) = L(wMP) +\n\n1\n2\n\n(w \u2212 wMP)T A(w \u2212 wMP)\n\nwhere A = \u2207\u2207L is the Hessian of the regularised objective function. The regulariser corresponding\nto the Laplace prior is locally a hyper-plane, and so does not contribute to the Hessian and so\nA = \u2207\u2207ED. The negative logarithm of the evidence can then be written as,\n\n\u2212 log P (D) = ED + \u03b1EW +\n\nlog |A| + log ZW + constant.\n\n1\n2\n\nSetting the derivative of the evidence with respect to \u03b1 to zero, gives rise to a simple update rule for\nthe regularisation parameter,\n\n1\n\u02dc\u03b1\n\n=\n\n1\nW\n\nW\n\nXj=1\n\n|wj|,\n\nwhich is equivalent to the update rule obtained using the integrate-out approach. Maximising the\nevidence for the model also provides a convenient means for model selection. Using the Laplace\napproximation, evidence for a multinomial logistic regression model under the proposed Bayesian\nregularisation scheme is given by\n\n\u2212 log {D} = ED + W log EW \u2212 log(cid:26) \u0393 (W )\n\n2W (cid:27) +\n\n1\n2\n\nlog |A| + constant\n\nwhere A = \u2207\u2207L.\n2.2 A Simple but Ef\ufb01cient Training Algorithm\n\nIn this study, we adopt a simpli\ufb01ed version of the ef\ufb01cient component-wise training algorithm of\nShevade and Keerthi [25], adapted for multinomial, rather than binomial, logistic regression. The\nprincipal advantage of a component-wise optimisation algorithm is that the Hessian matrix is not\nrequired, but only the \ufb01rst and second partial derivatives of the regularised training criterion. The\n\ufb01rst partial derivatives of the data mis-\ufb01t term are given by,\n\nD\n\n\u2202En\n\u2202an\nj\n\n=\n\nc\n\nXi=1\n\nD\n\n\u2202En\n\u2202yn\ni\n\n\u2202yn\ni\n\u2202an\nj\n\nwhere\n\nD\n\n\u2202En\n\u2202yn\ni\n\n= \u2212\n\ntn\ni\nyn\ni\n\n,\n\nand \u03b4ij = 1 if i = j and otherwise \u03b4ij = 0. Substituting, we obtain,\n\n\u2202ED\n\u2202ai\n\n=\n\n\u2113\n\nXn=1\n\n[yn\ni \u2212 tn\ni ]\n\n=\u21d2\n\n\u2202ED\n\u2202wij\n\n=\n\n\u2113\n\nXn=1\n\n[yn\ni \u2212 tn\n\ni ] xn\n\nj =\n\n\u2202yn\ni\n\u2202an\nj\n\n\u2113\n\nXn=1\n\n= yi\u03b4ij \u2212 yiyj\n\nyn\ni xn\n\nj \u2212\n\ni xn\ntn\nj .\n\n\u2113\n\nXn=1\n\n\fSimilarly, the second partial derivatives are given by,\n\n\u22022ED\n\u2202wij\n\n=\n\nxn\nj\n\n\u2202yn\ni\n\u2202wij\n\n=\n\n\u2113\n\nXn=1\n\n\u2113\n\nXn=1\n\ni (1 \u2212 yn\nyn\n\n.\n\nj(cid:3)2\ni )(cid:2)xn\n\nThe Laplace regulariser is locally a hyperplane, with the magnitude of the gradient given by the\nregularisation parameter, \u03b1,\n\n\u2202\u03b1EW\n\u2202wij\n\n= sign{wij} \u03b1\n\nand\n\n\u22022\u03b1EW\n\n\u2202w2\nij\n\n= 0.\n\nThe partial derivatives of the regularisation term are not de\ufb01ned at the origin, and so we de\ufb01ne the\neffective gradient of the regularised loss function as follows:\n\n\u2202ED\n\u2202wij\n\u2202ED\n\n+ \u03b1 if wij > 0\n\u2202wij \u2212 \u03b1 if wij < 0\n\n\u2202ED\n\u2202wij\n\u2202ED\n\n+ \u03b1 if wij = 0 and \u2202ED\n\u2202wij\n\u2202wij \u2212 \u03b1 if wij = 0 and \u2202ED\n0\n\notherwise\n\n+ \u03b1 < 0\n\u2202wij \u2212 \u03b1 > 0\n\n\u2202L\n\u2202wij\n\n=\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\n\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\nNote that the value of a weight may be stable at zero if the derivative of the regularisation term\ndominates the derivative of the data mis\ufb01t. The parameters of the model may then be optimised,\nusing Newton\u2019s method, i.e.\n\nwij \u2190 wij \u2212\n\n\u2202ED\n\n\u2202wij \" \u22022ED\n\n\u2202w2\n\nij #\u22121\n\n.\n\nAny step that causes a change of sign in a model parameter is truncated and that parameter set to\nzero. All that remains is to decide on a heuristic used to select the parameter to be optimised in each\nstep. In this study, we adopt the heuristic chosen by Shevade and Keerthi, in which the parameter\nhaving the steepest gradient is selected in each iteration. The optimisation proceeds using two nested\nloops, in the inner loop, only active parameters are considered. If no further progress can be made\nby optimising active parameters, the search is extended to parameters that are currently set to zero.\nAn optimisation strategy based on scaled conjugate gradient descent [27] has also be found to be\neffective.\n\n3 Results\n\nThe proposed sparse multinomial logistic regression method incorporating Bayesian regularisation\nusing a Laplace prior (SBMLR) was evaluated over a suite of well-known benchmark datasets,\nagainst sparse multinomial logistic regression with \ufb01ve-fold cross-validation based optimisation of\nthe regularisation parameter using a simple line search (SMLR). Table 1 shows the test error rate\nand cross-entropy statistics for SMLR and SBMLR methods over these datasets. Clearly, there is\nlittle reason to prefer either model over the other in terms of generalisation performance, as neither\nconsistently dominates the other, either in terms of error rate or cross-entropy. Table 1 also shows\nthat the Bayesian regularisation scheme results in models with a slightly higher degree of sparsity\n(i.e. the proportion of weights pruned from the model). However, the most striking aspect of the\ncomparison is that the Bayesian regularisation scheme is typically around two orders of magnitude\nfaster than the cross-validation based approach, with SBMLR being approximately \ufb01ve times faster\nin the worst case (COVTYPE).\n\n3.1 The Value of Probabilistic Classi\ufb01cation\n\nProbabilistic classi\ufb01ers, i.e. those that providing an a-posteriori estimate of the probability of class\nmembership, can be used in minimum risk classi\ufb01cation, using an appropriate loss matrix to account\nfor the relative costs of different types of error. Probabilistic classi\ufb01ers allow rejection thresholds\nto be set in a straight-forward manner. This is particularly useful in a medical setting, where it may\nbe prudent to refer a patient for further tests if the diagnosis is uncertain. Finally, the output of\n\n\fTable 1: Evaluation of linear sparse multinomial logistic regression methods over a set of nine\nbenchmark datasets. The best results for each statistic are shown in bold. The \ufb01nal column shows\nthe logarithm of the ratio of the training times for the SMLR and SBMLR, such that a value of 2\nwould indicate that SBMLR is 100 times faster than SMLR for a given benchmark dataset.\n\nBenchmark\n\nCovtype\nCrabs\nGlass\nIris\nIsolet\nSatimage\nViruses\nWaveform\nWine\n\nError Rate\n\nCross Entropy\n\nSparsity\n\nSBMLR SMLR SBMLR SMLR SBMLR SMLR\n0.3069\n0.4051\n0.0350\n0.0635\n0.4700\n0.3318\n0.4067\n0.0267\n0.0475\n0.8598\n0.1610\n0.2747\n0.0328\n0.7632\n0.3939\n0.1290\n0.0225\n0.5524\n\n0.4041\n0.0500\n0.3224\n0.0267\n0.0513\n0.1600\n0.0328\n0.1302\n0.0281\n\n0.4312\n0.2708\n0.4400\n0.4067\n0.9311\n0.3694\n0.8118\n0.3712\n0.6071\n\n0.9590\n0.1075\n0.9398\n0.0792\n0.1858\n0.3717\n0.1670\n0.3124\n0.0827\n\n0.9733\n0.0891\n0.9912\n0.0867\n0.2641\n0.3708\n0.1168\n0.3131\n0.0825\n\nlog10\n\nTSMLR\nTSBMLR\n\n0.6965\n2.7949\n1.9445\n1.9802\n1.3110\n1.3083\n2.1118\n1.8133\n2.5541\n\na probabilistic classi\ufb01er can be adjusted after training to compensate for a difference between the\nrelative class frequencies in the training set and those observed in operation. Saerens [4] provides\na simple expectation-maximisation (EM) based procedure for estimating unknown operational a-\npriori probabilities from the output of a probabilistic classi\ufb01er (c.f. [28]). Let pt (Ci) represent the\na-priori probability of class Ci in the training set and pt (Ci|xn) represent the raw output of the\nclassi\ufb01er for the nth pattern of the test data (representing operational conditions). The operational\na-priori probabilities, po (Ci) can then be updated iteratively via\n\np(s)\no (\u03c9i|xn) =\n\np(s)\no (\u03c9i)\n\npt(\u03c9i) pt(\u03c9i|xn)\n\nj=1\n\np(s)\no (\u03c9j )\npt(\u03c9j ) pt(\u03c9j|xn)\n\nPc\nbeginning with p(0)\n(Ci) = pt (Ci). Note that the labels of the test examples are not required for\no\nthis procedure. The adjusted estimates of a-posteriori probability are then given by the \ufb01rst part\nof equation (6). The training and validation sets of the COVTYPE benchmark have been arti\ufb01cially\nbalanced, by random sampling, so that each class is represented by the same number of examples.\nThe test set consists of the unused patterns, and so the test set a-priori probabilities are both highly\ndisparate and very different from the training set a-priori probabilities. Figure 1 and Table 2 sum-\nmarise the results obtained using the raw and corrected outputs of a linear SBMLR model on this\ndataset, clearly demonstrating a key advantage of probabilistic classi\ufb01ers over purely discriminative\nmethods, for example the support vector machine (note the same procedure could be applied to the\nSMLR model with similar results).\n\nand\n\np(s+1)\no\n\n(\u03c9i) =\n\np(s)\no (\u03c9i|xn),\n\n(6)\n\n1\n\u2113\n\nN\n\nXn=1\n\nError\n\nTable 2:\nrate and average cross-\nentropy score for linear SBMLR models of the\nCOVTYPE benchmark, using the raw and cor-\nrected outputs.\n\nStatistic\n\nError Rate\n\nCross-Entropy\n\nRaw\n40.51%\n0.9590\n\nCorrected\n\n28.57%\n0.6567\n\ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\np\n \ni\nr\no\ni\nr\np\n\u2212\na\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\ntraining set\ntest set\nestimated\n\n1\n\n2\n\n3\n\n4\n\nclass\n\n5\n\n6\n\n7\n\nFigure 1: Training set,\ntest set and esti-\nmated a-priori probabilities for the COVTYPE\nbenchmark.\n\n\f4 Relationship to Existing Work\n\nZ Nw(0, \u03c4 )E\u03c4 (\u03b3)d\u03c4 =\n\n\u03b1\n2\n\nexp(\u2212\u03b1|w|)\n\nThe sparsity inducing Laplace density has been utilized previously in [15, 25, 26, 29\u201331] and\nemerges as the marginal of a scale-mixture-of-Gaussians where the corresponding prior is an Expo-\nnential such that\n\nwhere E\u03c4 (\u03b3) is an Exponential distribution over \u03c4 with parameter \u03b3 and \u03b1 = \u221a\u03b3. In [29] this\n\nhierarchical representation of the Laplace prior is utilized to develop an EM style sparse binomial\nprobit regression algorithm. The hyper-parameter \u03b1 is selected via cross-validation but in an attempt\nto circumvent this requirement a Jeffreys prior is placed on \u03c4 and is used to replace the exponential\ndistribution in the above integral. This yields an improper parameter free prior distribution over\nw which removes the explicit requirement to perform any cross-validation. However, the method\ndeveloped in [29] is restricted to binary classi\ufb01cation and has compute scaling O(d3) which prohibits\nits use on moderately high-dimensional problems.\n\nLikewise in [13] the RVM employs a similar scale-mixture for the prior where now the Exponential\ndistribution is replaced by a Gamma distribution whose marginal yields a Student prior distribution.\nNo attempt is made to estimate the associated hyper-parameters and these are typically set to zero\nproducing, as in [29], a sparsity inducing improper prior. As with [29] the original scaling of [13] is,\nat worst, O(d3), though more ef\ufb01cient methods have been developed in [14]. However the analysis\nholds only for a binary classi\ufb01er and it would be non-trivial to extend this to the multi-class domain.\n\nA similar multinomial logistic regression model to the one proposed in this paper is employed in\n[26] where the algorithm is applied to large scale classi\ufb01cation problems and yet they, as with [25],\nhave to resort to cross-validation in obtaining a value for the hyper-parameters of the Laplace prior.\n\n5 Summary\n\nIn this paper we have demonstrated that the regularisation parameter used in sparse multinomial lo-\ngistic regression using a Laplace prior can be integrated out analytically, giving similar performance\nin terms of generalisation as is obtained using extensive cross-validation based model selection, but\nat a greatly reduced computational expense. It is interesting to note that the SBMLR implements a\nstrategy that is exactly the opposite of the relevance vector machine (RVM) [13], in that it integrates\nover the hyper-parameter and optimises the weights, rather than marginalising the model parameters\nand optimising the hyper-parameters. It seems reasonable to suggest that this approach is feasible\nin the case of the Laplace prior as the pruning action of this prior ensures that values of all of the\nweights are strongly determined by the data mis\ufb01t term. A similar strategy has already proved ef-\nfective in cancer classi\ufb01cation based on gene expression microarray data in a binomial setting [32],\nand we plan to extend this work to multi-class cancer classi\ufb01cation in the near future.\n\nAcknowledgements\n\nThe authors thank the anonymous reviewers for their helpful and constructive comments. MG is\nsupported by EPSRC grant EP/C010620/1.\n\nReferences\n\n[1] P. McCullagh and J. A. Nelder. Generalized linear models, volume 37 of Monographs on Statistics and\n\nApplied Probability. Chapman & Hall/CRC, second edition, 1989.\n\n[2] D. W. Hosmer and S. Lemeshow. Applied logistic regression. Wiley, second edition, 2000.\n[3] C. K. Chow. On optimum recognition error and reject tradeoff. IEEE Transactions on Information Theory,\n\n16(1):41\u201346, January 1970.\n\n[4] M. Saerens, P. Latinne, and C. Decaestecker. Adjusting the outputs of a classi\ufb01er to new a priori proba-\n\nbilities: A simple procedure. Neural Computation, 14(1):21\u201341, 2001.\n\n[5] J. O. Berger. Statistical decision theory and Bayesian analysis. Springer Series in Statistics. Springer,\n\nsecond edition, 1985.\n\n\f[6] J. Zhu and T. Hastie. Classi\ufb01cation of gene microarrays by penalized logistic regression. Biostatistics,\n\n5(3):427\u2013443, 2004.\n\n[7] X. Zhou, X. Wang, and E. R. Dougherty. Multi-class cancer classi\ufb01cation using multinomial probit\nregression with Bayesian gene selection. IEE Proceedings - Systems Biology, 153(2):70\u201376, March 2006.\n[8] T. Zhang and F. J. Oles. Text categorization based on regularised linear classi\ufb01cation methods. Informa-\n\ntion Retrieval, 4(1):5\u201331, April 2001.\n\n[9] L. Narlikar and A. J. Hartemink. Sequence features of DNA binding sites reveal structural class of\n\nassociated transcription factor. Bioinformatics, 22(2):157\u2013163, 2006.\n\n[10] M. J. L. Orr. Regularisation in the selection of radial basis function centres. Neural Computation,\n\n7(3):606\u2013623, 1995.\n\n[11] Y. Grandvalet. Least absolute shrinkage is equivalent to quadratic penalisation.\n\nIn L. Niklasson,\nM. Bod\u00b4en, and T. Ziemske, editors, Proceedings of the International Conference on Arti\ufb01cial Neural\nNetworks, Perspectives in Neural Computing, pages 201\u2013206, Sk\u00a8ovde, Sweeden, September 2\u20134 1998.\nSpringer.\n\n[12] Y. Grandvalet and S. Canu. Outcomes of the quivalence of adaptive ridge with least absolute shrinkage.\n\nIn Advances in Neural Information Processing Systems, volume 11. MIT Press, 1999.\n\n[13] M. E. Tipping. Sparse Bayesian learning and the Relevance Vector Machine. Journal of Machine Learn-\n\ning Research, 1:211\u2013244, 2001.\n\n[14] A. C. Faul and M. E. Tipping. Fast marginal likelihood maximisation for sparse Bayesian models. In C. M.\nBishop and B. J. Frey, editors, Proceedings of the Ninth International Workshop on Arti\ufb01cial Intelligence\nand Statistics, Key West, FL, USA, 3\u20136 January 2003.\n\n[15] R. Tibshirani. Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society\n\n- Series B, 58:267\u2013288, 1996.\n\n[16] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics,\n\n32(2):407\u2013499, 2004.\n\n[17] P. M. Williams. Bayesian regularization and pruning using a Laplace prior. Neural Computation,\n\n7(1):117\u2013143, 1995.\n\n[18] C. M. Bishop. Neural networks for pattern recognition. Oxford University Press, 1995.\n[19] J. S. Bridle. Probabilistic interpretation of feedforward classi\ufb01cation network outputs, with relationships\nto statistical pattern recognition. In F. Fogelman Souli\u00b4e and J. H\u00b4erault, editors, Neurocomputing: Algo-\nrithms, architectures and applications, pages 227\u2013236. Springer-Verlag, New York, 1990.\n\n[20] A. N. Tikhonov and V. Y. Arsenin. Solutions of ill-posed problems. John Wiley, New York, 1977.\n[21] S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the bias/variance dilema. Neural\n\nComputation, 4(1):1\u201358, 1992.\n\n[22] D. J. C. MacKay. The evidence framework applied to classi\ufb01cation networks. Neural Computation,\n\n4(5):720\u2013736, 1992.\n\n[23] W. L. Buntine and A. S. Weigend. Bayesian back-propagation. Complex Systems, 5:603\u2013643, 1991.\n[24] I. S. Gradshteyn and I. M. Ryzhic. Table of Integrals, Series and Products. Academic Press, \ufb01fth edition,\n\n1994.\n\n[25] S. K. Shevade and S. S. Keerthi. A simple and ef\ufb01cient algorithm for gene selection using sparse logistic\n\nregression. Bioinformatics, 19(17):2246\u20132253, 2003.\n\n[26] D. Madigan, A. Genkin, D. D. Lewis, and D. Fradkin. Bayesian multinomial logistic regression for author\n\nidenti\ufb01cation. In AIP Conference Proceedings, volume 803, pages 509\u2013516, 2005.\n\n[27] P. M. Williams. A Marquardt algorithm for choosing the step size in backpropagation learning with\n\nconjugate gradients. Technical Report CSRP-229, University of Sussex, February 1991.\n[28] G. J. McLachlan. Discriminant analysis and statistical pattern recognition. Wiley, 1992.\n[29] M. Figueiredo. Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysis and\n\nMachine Intelligence, 25(9):1150\u20131159, September 2003.\n\n[30] B. Krishnapuram, L. Carin, M. A. T. Figueiredo, and A. J. Hartemink. Sprse multinomial logistic regres-\nsion: Fast algorithms and generalisation bounds. IEEE Transactions on Pattern Analysis and Machine\nIntelligence, 27(6):957\u2013968, June 2005.\n\n[31] J. M. Bioucas-Dias, M. A. T. Figueiredo, and J. P. Oliveira. Adaptive total variation image deconvolution:\nA majorization-minimization approach. In Proceedings of the European Signal Processing Conference\n(EUSIPCO\u20192006), Florence, Italy, September 2006.\n\n[32] G. C. Cawley and N. L. C. Talbot. Gene selection in cancer classi\ufb01cation using sparse logistic regression\n\nwith Bayesian regularisation. Bioinformatics, 22(19):2348\u20132355, October 2006.\n\n\f", "award": [], "sourceid": 3155, "authors": [{"given_name": "Gavin", "family_name": "Cawley", "institution": null}, {"given_name": "Nicola", "family_name": "Talbot", "institution": null}, {"given_name": "Mark", "family_name": "Girolami", "institution": null}]}