{"title": "Adaptive Sparseness Using Jeffreys Prior", "book": "Advances in Neural Information Processing Systems", "page_first": 697, "page_last": 704, "abstract": null, "full_text": "Adaptive Sparseness Using Jeffreys Prior\n\nM\u00b4ario A. T. Figueiredo\n\nInstitute of Telecommunications,\n\nand Department of Electrical and Computer Engineering.\n\nInstituto Superior T\u00b4ecnico\n1049-001 Lisboa, Portugal\n\nmtf @lx.it.pt\n\nAbstract\n\nIn this paper we introduce a new sparseness inducing prior which does not involve any (hy-\nper)parameters that need to be adjusted or estimated. Although other applications are possi-\nble, we focus here on supervised learning problems: regression and classi\ufb01cation. Experi-\nments with several publicly available benchmark data sets show that the proposed approach\nyields state-of-the-art performance. In particular, our method outperforms support vector\nmachines and performs competitively with the best alternative techniques, both in terms\nof error rates and sparseness, although it involves no tuning or adjusting of sparseness-\ncontrolling hyper-parameters.\n\n1 Introduction\n\nis continuous (typically 6143\n\nThe goal of supervised learning is to infer a functional relation \u0002\u0001\u0004\u0003\u0006\u0005\b\u0007\n\t , based on a\n\u0001\r\f\u000e\u0005\b\u0007\u0006\u000f\u0011\u0010\u0012\u000e\u000f\u0013\t\u0014\u0010\u0016\u0015\u0017\u0015\u0018\u0015\u0017\u0010\u0019\u0005\b\u0007\u001b\u001a\u001c\u0010\u001d\u001e\u001a\u001f\t! . Usually, the inputs\nset of (maybe noisy) training examples \u000b\n, . When \nare vectors, \u0007\u001b\"#\u0001%$\n&'\")(*\u000f+\u0010\u0016\u0015\u0017\u0015\u0018\u0015\u0018\u0010\u0012&'\")(-,\u0013.0/2143\n), we\nare in the context of regression, whereas in classi\ufb01cation \nis of categorical nature (e.g.,\n718\f:9#;:\u0010\u0013;\u001e ). Usually, the structure of \u0003\u0006\u0005\u0012<\n\t is assumed \ufb01xed and the objective is to estimate\na vector of parameters = de\ufb01ning it; accordingly we write >\u0001?\u0003\u0006\u0005\b\u0007@\u0010\nTo achieve good generalization (i.e.\nto perform well on yet unseen data) it is necessary\nto control the complexity of the learned function (see [1] - [4], and the many references\ntherein). In Bayesian approaches, complexity is controlled by placing a prior on the func-\n. This should not be confused with a generative (informative)\n\u0005\b\u0007@\u0010\u001dB\t .\nA common choice is a zero-mean Gaussian prior, which appears under different names,\nlike ridge regression [5], or weight decay, in the neural learning literature [6]. Gaussian\npriors are also used in non-parametric contexts, like the Gaussian processes (GP) approach\n[2], [7], [8], [9], which has roots in earlier spline models [10] and regularized radial basis\nfunctions [11]. Very good performance has been reported for methods based on Gaussian\npriors [8], [9]. Their main disadvantage is that they do not control the structural complexity\n(say, a weight in a neu-\nral network) happens to be irrelevant, a Gaussian prior will not set it exactly to zero, thus\n\ntion to be learned, i.e., on =\nBayesian approach, since it involves no explicit modelling of the joint probability A\n\nof the resulting functions. That is, if one of the components of =\n\n\t .\n\nThis work was partially supported by the Portuguese Foundation for Science and Technology\n\n(FCT), Ministry of Science and Technology, under project POSI/33143/SRI/2000.\n\n5\n5\n=\n\fpruning that parameter, but to some small value.\n\nSparse estimates (i.e., in which irrelevant parameters are set exactly to zero) are desirable\nbecause (in addition to other learning-theoretic reasons [4]) they correspond to a structural\n\u000f -penalized\nsimpli\ufb01cation of the estimated function. Using Laplacian priors (equivalently,\nregularization) is known to promote sparseness [12] - [15]. Support vector machines (SVM)\ntake a non-Bayesian approach to the goal of sparseness [2], [4]. Interestingly, however, it\ncan be shown that the SVM and\n\n\u000f -penalized regression are closely related [13].\n\nBoth in approaches based on Laplacian priors and in SVMs, there are hyper-parameters\nwhich control the degree of sparseness of the obtained estimates. These are commonly\nadjusted using cross-validation methods which do not optimally utilize the available data,\nand are time consuming. We propose an alternative approach which involves no hyper-\nparameters. The key steps of our proposal are: (i) a hierarchical Bayes interpretation\nof the Laplacian prior as a normal/independent distribution (as used in robust regression\n[16]); (ii) a Jeffreys\u2019 non-informative second-level hyper-prior (in the same spirit as [17])\nwhich expresses scale-invariance and, more importantly, is parameter-free [18]; (iii) a sim-\nple expectation-maximization (EM) algorithm which yields a maximum a posteriori (MAP)\n\n(and of the observation noise variance, in the case of regression).\n\nOur method is related to the automatic relevance determination (ARD) concept [7], [19],\nwhich underlies the recently proposed relevance vector machine (RVM) [20], [21]. The\nRVM exhibits state-of-the-art performance, beating SVMs both in terms of accuracy and\nsparseness [20], [21]. However, we do not resort to a type-II maximum likelihood approxi-\nmation [18] (as in ARD and RVM); rather, our modelling assumptions lead to a marginal a\n\nposteriori probability function on = whose mode is located by a very simple EM algorithm.\n\nLike the RVM, but unlike the SVM, our classi\ufb01er produces probabilistic outputs.\n\nExperimental evaluation of the proposed method, both with synthetic and real data, shows\nthat it performs competitively with (often better than) GP-based methods, RVM, and SVM.\n\n2 Regression\n\nestimate of =\n\n\u0005)\u0007\n$\u0017;\u001e\u0010\u001d&\n\u0010\u0013\u0015\u0018\u0015\u0017\u0015\u0018\u0010\u0005\u0004\u0007\u0006\n, where\nis some (symmetric) kernel function [2] (as in SVM and RVM regression), not\n\n; (ii) nonlinear regression via a set of\n\u0005)\u0007@\u0010\u0012\u0007\n\u000f\n\n.0/\n; (iii) kernel regression, \u0001\n\n\t , i.e., that are linear with respect to =\n$\u0017;\u001e\u0010\u0005\b\n\nbasis functions, \u0001\n\u0010\u0013\u0015\u0018\u0015\u0017\u0015\u0018\u0010\t\b\n\t-.0/\n\n). This includes: (i) classical linear regression,\n\n\u0005\b\u0007@\u0010\u001d\u0007\u001b\u001a\n\n\u0005\b\u0007\n\nis a set of independent zero-mean Gaussian variables with variance\n\n;\u001e\u0010\u0016\u0015\u0017\u0015\u0018\u0015\u0017\u0010\t\u0011\n, where\n. With\n\u0012\u0014\u0013\n\t , where \u001d\nis the\n\" s and on the adopted function representation,\n.\n, evaluated at #\n\t , the posterior\n\n\t\r\f\u000f\u000e\n\" , for\n\u0005\u001c\n\u0019\u0018\n\u0001\u001b\u001a\n\u0012\u001e\u0013 \u001f\nand covariance &\n*\u001f\u0010\n\n\u0001)\u001a\n\n\u0005\b\u0007\n\n/\u0002\u0001\n\u0005\b\u0007\n\n\u0010\u0016\u0015\u0017\u0015\u0018\u0015\u0018\u0010\u0012&\n\t-.0/\n\nWe consider functions of the type \u0003\u0006\u0005\b\u0007@\u0010\n(whose dimensionality we will denote by\n\u0005\b\u0007\n\u000f\u001e\u0005\b\u0007\n\u0005)\u0007@\u0010\u000b\n\nnecessarily verifying Mercer\u2019s condition.\nWe follow the standard assumption that \n\u000f\u0011\u0010\u0013\u0015\u0018\u0015\u0018\u0015\u0017\u0010\t\u000e\n\u001a\u000e.\n\u0010\u0013\u0015\u0018\u0015\u0017\u0015\u0018\u0010\u0012\u001e\u001a\u000e.0/\n\u000e\u000f\n\n\u0016\u0015\n\t design matrix which depends on the \u0007\n\u0005\u0017\u0011\"!\n\u0010'&\n\u0005\u0017#$\u0018\nand \u001a\n\n\u0003\u0006\u0005)\u0007\n\u0005\u0017\n\u0019\u0018\n, the likelihood function is then A\n\t denotes a Gaussian density of mean %\n, A\n\nWith a zero-mean Gaussian prior with covariance\n\nis still Gaussian with mean and mode at\n\nWhen\n\nWith a Laplacian prior for =\nposteriorA\n\n(-,\nis proportional to identity, say\n\u000143\n\fKJ;\u001d\n\n2\u0006\t\n, A\n\u0001@?BA\u0005CEDGFIH\n\n\f.\u001d\n\u00010/\n\n\u00131\u001f\n\n\u0005\u00175\n9L\n\u0019J\n\n , the\nis not Gaussian. The maximum a posteriori (MAP) estimate is given by\n\n\f:9\r2>\u0018\n\n\u000176\n\n\u0005\u00175\n\n, this is called ridge regression [5].\n\n2\n\t , with A\n2>J\n\f.M\n\n2\u0006\t\nJ\u0019\u000f\u0019 :\u0010\n\n\u001398;:=<\n\n(1)\n\n\n\n=\n\t\n\u0001\n=\n\u0003\n\u0001\n\t\n\u0001\n\u000f\n,\n\u0003\n\t\n\u0001\n$\n\u0004\n\t\n\t\n\u0001\n\t\n\b\n\t\n\"\n\u0001\n\"\n\u0010\n=\n\u0010\n\u0001\n$\n\u000e\n$\n=\n\t\n\u001d\n=\n\u0010\n\u0003\n%\n(\n\u0005\n=\n\u0018\n(\n\t\n\u0005\n=\n\u0018\n(\nA\n\u0005\n=\n\u0018\n\n\t\n+\n=\n\u0001\n\u0005\n\u0012\n\u0013\n\u000f\n/\n\u001d\n\t\n,\n\u000f\n\u001d\n/\n\n\u0015\n(\n(\n\u0005\n=\n\u0018\n\"\nA\n\"\n\u0018\n\"\n\u0018\n5\n\"\n\u0018\n\u0005\n=\n\u0018\n\n\t\n+\n=\n=\n\u0013\n\u0013\n\u0012\n\u0013\n=\n\fIf \u001d\n\n\u0001\u0001\n\n\u000f norm.\n\nis the Euclidean (\n\nwhere J;#\u0019J\n\n) norm, and J;#\u0019J\u0019\u000f\n\nLet us consider an alternative model:\n\nthreshold estimation rule, widely used in wavelet-based signal/image denoising [22].\n\nIn linear\nregression this is called the LASSO (least absolute shrinkage and selection operator) [14].\nThe main effect of the\n\n\u0002\u0011\"\u0005\u0018 is the\n\u000f penalty is that some of the components of +\n\n= may be exactly zero.\nis an orthogonal matrix, (1) can be solved separately for each 5\u001c\" , leading to the soft\n\" have a zero-mean Gaussian prior\n\u0005\u001c5'\"\n\u0003\u0016\"\n(like in ARD and RVM). Now, rather\nthan adopting a type-II maximum likelihood criterion (as in ARD and RVM), let us con-\nsider hyper-priors for the \u0003\u0019\" s and integrate them out. Assuming exponential hyper-priors\n\u0006\u001b\t\n\u0005\u0005\u0003\n\nlet each 5\n\u0010\u0004\u0003\u0013\"-\t , with its own variance \u0003\u0016\"\n\bBMB (for\u0003\n\f\u000e9\n\u0006\u000b\u0003\n\"\r\f\n\"\u0005\u0018\n\u0005\u00175\n\"\u0005\u0018\n\u0003\u0013\"-\t\n\u0001\u000f\u000e\u0011\u0010\n\u0006\u001b\t\n\n* , because these are variances) we obtain\n\u0005\u0005\u0003\u0013\"\n\u0006\u001b\t\u0014\u0013\u0015\u0003\u0013\"\n\nThis shows that the Laplacian prior is equivalent to a 2-level hierachical-Bayes model:\nzero-mean Gaussian priors with independent exponentially distributed variances. This de-\ncomposition has been exploited in robust least absolute deviation (LAD) regression [16].\n\n\u00012\u0005\u0007\u0006\t\b\n\n\u0005\u001c5'\"\nM\u001e\t\n\n\u0001\u0017\u0016\n\n81:=<\n\u0005\u00175\n\n\u0006G\u0018\n\n5'\"\u0005\u0018\n\n8;:=<\n\n \u000e\u0015\n\n\f:9\n\nThe hierarchical decomposition of the Laplacian prior allows using the EM algorithm\n. as hid-\nto implement the LASSO criterion in (1) by simply regarding\n, and where\nden/missing data. In fact, the complete log-posterior (with a \ufb02at prior for\n\n\t ),\n\n\u0010\u0013\u0015\u0018\u0015\u0017\u0015\u0018\u0010\u0004\u0003\n\ndiag\u0005\u0005\u0003\n\u001b\u001d\u001c\n\n\t\u001f\u001e\nis easy to maximize with respect to =\nof the conditional expectation of \u0019\n\t1\u0018\n#'%\u0007&\n\n. This leads to\n\n\u0015*)\n\n#'%\u0007&\n\n9\u001b\u001d\n\n\u0012\u001e\u0013\n\n(2)\n\nJ'\n\n. The E-step reduces to the computation\nand\n\u0013$# %\u0007&\n\t . The\n\n\u001b \u001c\nand\n\t , given current (at iteration\n#'%\u0007&\n\u0013\u0015#'%\u0007&\n\n.\u001c\u0001+\u0006 diag\u0005\u0005\u0018\n\n) estimates\n\n\u0010\u0013\u0015\u0018\u0015\u0017\u0015\u0018\u0010 \u0018\n\n# %\u0007&\n\n\u0006\u0019(\n\n\u000f\u0014(\n\n#'%\u0007&\n\n(3)\n\nM-step is then de\ufb01ned by the two following update equations:\n\n\u0010\u0013\u0015\u0018\u0015\u0017\u0015\u0018\u0010\u0004\u0003\n\nand\n\n\u0013,#'%\u0007-\n\u00012\u0005\n\n#'%\u0007&\n\nJ'\n\n#'%\u0007&\n\n\f.\u001d\n\n#'%\u0007-\n\n(4)\nThis EM algorithm is not the most ef\ufb01cient way to solve (1); see, e.g., the methods proposed\nin [23], [14]. Our main goal is to open the way to the adoption of different hyper-priors.\n\n\u0013,#'%\u0007-\nOne question remains: how to adjust\u0006\ntimates? Our proposal is to remove\u0006\n\u0005\u0007\u0003+\"\n\t.\u001e/\u0003\nprior by a non-informative Jeffreys hyper-prior: A\ncourse this is no longer equivalent to a Laplacian prior on =\n\n, which controls the degree of sparseness of the es-\nfrom the model, by replacing the exponential hyper-\n. This prior expresses igno-\nrance with respect to scale (see [17], [18]) and, most importantly, it is parameter-free. Of\n, but to some other prior. As will\nbe shown experimentally, this prior strongly induces sparseness and yields state-of-the-art\nperformance. Computationally, this choice leads to a minor modi\ufb01cation of the EM algo-\n\t .\nrithm described above: matrix\nSince several of the +\n. However, we\ncan re-write the M-step as\n\n\u000f!(\n#'%\u0007&\n\" s may go to zero, it is not convenient to deal with\n\u000110\n#'%\u0007&\n\n\f20\n\t , thus avoiding the inversion of the elements of +\n\n.\nMoreover, it is not necessary to invert the matrix, but simply to solve the corresponding\n\nwhere0\nlinear system, whose dimension is only the number of non-zero elements in 0\n\n\u0010\u0013\u0015\u0018\u0015\u0017\u0015\u0018\u0010 \u0018\n#'%\u0007&\n\n# %\u0007-\ndiag\u0005\t\u0018\n\n\u0013,#'%\u0007-\n\u0006\u0019(\n# %\u0007&\n\nis now given by\n\n#'%\u0007&\n\u0010\u0013\u0015\u0018\u0015\u0017\u0015\u0018\u0010 \u0018\n\ndiag\u0005\t\u0018\n\n\u001d30\n\n# %\u0007&\n\n#'%\u0007&\n\n#'%\u0007&\n\n#'%\u0007&\n\n#'%\u0007&\n\n#'%\u0007&\n\n#'%\u0007&\n\n#'%\u0007&\n\n\u0006+(\n\n\u000f\u0014(\n\n.\n\n#'%\u0007&\n\n\u0013\n\n\u0013\n\"\n\u0018\n\n\nA\n\u0018\n\t\n\u0001\n\u001a\n\u0018\n*\nA\n\"\n\u0018\n\"\nA\n\u0012\nA\nA\n\u0018\n\u0006\nM\n\u0016\n\u0018\n\u0001\n$\n\u0003\n\u000f\n\u0006\n\u0012\n\u0013\n\u0019\n\u0005\n\u0018\n\t\n\u0015\n,\n\u000f\n\u000f\n,\n\u000f\n\u001a\nC\nA\n\u0005\n=\n\u0010\n\u0012\n\u0013\n\u0018\n\n\u0010\n\u0018\n9\n\u0011\nC\n\u0012\n\u0013\n9\n=\nJ\n\u0013\n\u0013\n\u0012\n\u0013\n9\n=\n/\n\u0019\n\u0005\n\u0018\n\t\n=\n\u0010\n\u0005\n\u0018\n!\n\"\n\u0012\n+\n=\n(\n$\n\u0019\n\u0005\n\u0018\n\n\u0010\n\"\n\u0012\n\u0010\n+\n=\n+\n5\n\u0018\n,\n\u000f\n+\n5\n\u0018\n,\n\u000f\n\"\n\u0012\n\u000f\n&\n\u0001\n;\n\u0011\n9\n\u001d\n+\n=\nJ\n\u0013\n\u0013\n+\n=\n\u000f\n&\n\"\n\u0012\n\u000f\n&\n(\n/\n\u001d\n\t\n,\n\u000f\n\u001d\n/\n\n\u0015\n,\n\u000f\n\"\n(\n(\n\u0001\n+\n5\n\u0018\n,\n\u0013\n+\n5\n\u0018\n,\n\u0013\n5\n(\n+\n=\n\u000f\n&\n\u0005\n\"\n\u0012\n\u000f\n&\n\u001f\n\u001d\n/\n\t\n,\n\u000f\n0\n\u001d\n/\n\n\u0010\n\u0015\n+\n5\n\u0018\n+\n5\n\u0018\n=\n\f\u0005\u0004\u0003B\t\n\n25\n\n20\n\n15\n\n10\n\n5\n\ns\nr\ne\nt\ne\nm\na\nr\na\np\no\nr\ne\nz\nn\no\nn\n \nf\no\n\n \n\n \n \n\n#\n\n \n \n.\n\nm\n\ni\nt\ns\nE\n\n1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n(cid:13)\n- 0.2\n\n25\n\n3 Regression experiments\n\nOur \ufb01rst example illustrates the use of the proposed method for variable selection in stan-\ndard linear regression. Consider a sequence of 20 true = s, having from 1 to 20 non-zero\ncomponents (out of 20): from $\ndom (\n* ) design matrices, following the procedure in [14], and for each of these, we\nobtain data points with unit noise variance. Fig. 1 (a) shows the mean number of estimated\nnon-zero components, as a function of the true number. Our method exhibits a very good\n\n. . For each =\n\n\u0010\u0013\u0015\u0018\u0015\u0017\u0015\u0018\u0010\t*\u0011. to $\n\n, we obtain 100 ran-\n\n\u0010\u0001\u001f\u0010\u0016\u0015\u0017\u0015\u0018\u0015\u0018\u0010\u0001\n\n\u001f\u0010\u0005*\u001f\u0010\t*\n\n*G!-M\n\nability to \ufb01nd the correct number of nonzero components in =\n\n, in an adaptive manner.\n\u0005\u0006\u0005\n\n5\n20\nTrue # of nonzero parameters\n\n15\n\n10\n\n- 8\n\n- 6\n\n- 4\n\n- 2\n\n0\n\n2\n\n4\n\n6\n\n8\n\nFigure 1: (a) Mean number of nonzero components in +\ncomponents in =\nfunction \n\u0001\b\u0007\u000bF\nfunction. Circles: data points corresponding to the non-zero parameters.\n\nversus the number of nonzero\n(the dotted line is the identity). (b) Kernel regression. Dotted line: true\n\u0005)&\n\n. Dots: 50 noisy observations (\n\n\u00014*\u001f\u0015\u0018; ). Solid line: the estimated\n\u0010\t*\u001f\u0010\n\u0010\t*\u0011. , with\nMB\u0010\u0005*\u001f\u0010\t*\n. , with\n* , and the design\nmatrices are generated as in [14].\nIn table 3, we compare the relative modelling error\n. ) improvement (with respect to the least squares solution) of our\n(\nmethod and of several methods studied in [14]. Our method performs comparably with the\nbest method for each case, although it involves no tuning or adjustment of parameters, and\nis computationally faster.\n\nWe now consider two of the experimental setups of [14]: =\n\t\n\u0001\u000b , and =\r\f\n)\u0002\u00011)\nJ;\u001d\n\n\u001f\u0010\u0013;:\u0015\n\u0001\u000fM . In both cases, \u0011\n\n\u0010\t*\u001f\u0010\u0005*\u001f\u0010\u0005*\u001f\u0010\t*\n\n\u0010\t*\u001f\u0010\u0005*\n\n9-\u001d\n\n\u0010\t*\n\n\u0010\t*\n\nTable 1: Relative (\n\n) improvement in modeling error of several mehods.\n\nMethod\n\nProposed method\n\nLASSO (CV)\nLASSO (GCV)\nSubset selection\n\n=\r\t\nM\u0012\u0011\n;\u0017\n;\u0017\n\n=\u0010\f\n\u0013\u0015\u0014\u0016\u000f\n\u0018\u001a\u0019\u001b\u000f\n\u0013\u0012\u0013\u001a\u000f\n\nM\u0012\u001c\n\n81:=<\n\n\u0005\b\u0007@\u0010\u001d\u0007\n\n\f:9\"J\n\nWe now study the performance of our method in kernel regression, using Gaussian kernels,\n\t! . We begin by considering the synthetic example\ni.e., \b\nstudied in [20] and [21], where the true function is \n(see Fig. 1 (b)). To\ncompare our results to the RVM and the variational RVM (VRVM), we ran the algorithm\non 25 generations of the noisy data. The results are summarized in Table 2 (which also\nincludes the SVM results from [20]). Finally, we have also applied our method to the well-\nknown Boston housing data-set (20 random partitions of the full data-set into 481 training\nsamples and 25 test samples); Table 2 shows the results, again versus SVM, RVM, and\nVRVM regression (as reported in [20]). In these tests, our method performs better than\n\n\u0001\u001d\u0007\tF\n\n\bB\u0005\n\n\b+&\n\n\u0005\b&\n\n\n\u0002\n\t\n=\nH\n\t\n\b\n&\n\u0012\n\u0001\n$\n\u0002\n\u0012\n\u0001\n$\n\u0002\n\u0012\n\u0001\nM\n\u000e\n$\n+\n=\n=\nJ\n\u0013\n\u000f\n\u000f\n\u000f\n\n*\n\u000f\n\u0018\n\u0002\n\u000f\n\u000f\n\"\n\t\n\u0001\n\u0007\n9\n\u0007\n\"\nJ\n\u0013\n\u0013\nH\n\t\n\fRVM, VRVM, and SVM regression, although it doesn\u2019t require any tuning.\n\nTable 2: Mean root squared errors and mean number of kernels for the \u201c \u0007\u000bF\n\n& \u201d function\n\n\u0005\b&\nBoston housing\n\nand the Boston housing examples.\n& \u201d function\nMSE\n\nMethod\n\nH\u001b\u0005\b&\n\n\u201c\u0007\u000bF\n\nNo. kernels\n\nMethod\n\nMSE No. kernels\n\nNew method\n\nSVM\nRVM\nVRVM\n\n; *\u001f\u0015\n; *\u001f\u0015\u0018;\n; *\u001f\u0015\n\n;\u001e\u0015\u0018;\n*\u001f\u0015\n\nNew method\n\nSVM\nRVM\nVRVM\n\n*\u001f\u0015\n*\u001f\u0015\n*\u001f\u0015\n4 Classi\ufb01cation\n\n\u0002\u001a\u0002\n\u0014\u001b\u0019\n\u0014\u001b\u0019\n\n(5)\n\n\u0005\b\u0007\n\n\u0005\b\u0007\n\n\u0001\u0011\n\n\u0005\b\u0007\n\n\t\u001d\t\u0014\u0015\n\n5\u0005\u0004\n\n\u0005\b\u0007\n\n8;:=<\n\n\u0001\u000b\n\n\u0005\b&\u0014\u0018\n\n\u0005\b\u0007\n* , and \n\nIn classi\ufb01cation the formulation is somewhat more complicated, with the standard ap-\nproach being generalized linear models [24]. For a two-class problem (\n\f\u000e9#;\u001e\u0010\u0016;\u0011 ),\nthe probability that an observation \u0007 belongs to, say class 1, is given by a nonlinear func-\ntion \u0001\u0003\u0002\n*\u001f\u0010\u001b;\u0013. (called the link), \u0006\n\u0005)\n\u0005)\u0007\n\t can have\n\t\u0012\t , where \u0001\none of the forms referred in the \ufb01rst paragraph of Section 2 (linear, nonlinear, kernel).\nis the logistic function,\u0001\nAlthough the most common choice for\u0001\n\u0005\b\u0007\n\u0005\u00129\t\u0007\u000e\t\u001d\t\n\u000f ,\n\u0005\u0012;\nin this paper, we adopt the probit model \u0001\n\u0005\b\u0007\n\u0005\f\u0007\u000e\t , where\n\u000e\u000e\r\n*\u001f\u0010\u0016;\u0019\t\u0014\u0013:&\nthe standard Gaussian cumulative distribution function (cdf). The probit model has a simple\ninterpretation in terms of hidden variables [25], which we will exploit. Consider a hidden\n\u0010\u0013;\u0019\t . Then, if the classi\ufb01cation rule is\n\u0005\u0017\u000e\n* , we obtain the probit model:\n*:\t\n\nvariable \u0007\nif\u0007\n>\u0001\n\n, where A\nif\u0007\u0010\u000f\n\f\u000e\u0005)\u0007\n\u000f+\u0010\u0012\n\u0007\u001e\u000f\n\u0001\u001b\u001a\n\n\u0005\u0017\u000e\n9#;\n/\u0002\u0001\n\u0005)>\u0001\u0002;K\u0018\n\t\u0014\f\u001b\u000e\n\t\u0014\u0010\u0016\u0015\u0017\u0015\u0018\u0015\u0018\u0010\u0016\u0005\b\u0007\u001b\u001a\u001b\u0010\u0012\u001e\u001a\u001f\t\u0014 , consider the corresponding vector of\nGiven training data \u000b\n\u0010\u0013\u0015\u0018\u0015\u0017\u0015\u0018\u0010\u0013\u0007+\u001a:.0/\nhidden/missing variables \u0012\n. If we had \u0012 , we would have a simple linear\n\t . This fact suggests using the EM algorithm to\nregression likelihoodA\n, by treating\u0012 as missing data.\nestimate =\nTo promote sparseness, we will adopt the same hierarchical prior on =\nthat we have used\n\u0003\u0013\"\n\u0003\u0013\" (the Jeffreys prior). The complete\n\u0005\u0005\u0003\u0013\"-\t\n\u0003\u0013\"-\t andA\n\u001e\u0002;\nfor regression: A\nand\u0012 ) is\n9\u001bM\n\u001b \u001c\nthe fact that now \u0012\nis missing. The expected value of \u0019\ndiag\u0005\u0005\u0018\ncase; accordingly we de\ufb01ne the same diagonal matrix 0\n#'%\u0007&\naddition, we also need )\nrespect to\u0012 ), which can be expressed in closed form, for each \u0007\u001e\" , as\n*\u001f\u0010\u0013;+\t\n\u0005\b\u0007\u001b\"-\t1\u0018\n\u0005\b\u0007\u001b\"-\t\u0012\t\n#'%\u0007&\n\u0010\u0013;\u0019\t\n\u0005)\u0007\u001b\"\n\u0005\b\u0007\u001b\"-\t\u0012\t\n#'%\u0007&\n\n\"\u0005\u0018\n*\u001f\u0010\n\t\u001f\u001e29\n(6)\nwhich is similar to (2), except for the noise variance which is not needed here, and for\nis similar to the regression\n\t . In\n\u0010\t\n\u001c. (notice that the complete log-posterior is linear with\n\nlog posterior (with the hidden vectors\n\n# %\u0007&\nif \n\n9\u0019\n\n\u0005\u00129\n\n#'%\u0007&\n# %\u0007&\n\n\u000129#; .\n\n\")(\n\n#'%\u0007&\n\n\u0006\u0019(\n\n#'%\u0007&\n\n\u0001.\u001a\n\n\u0005\u00175\n\n\u000f!(\n\n\u0010\u0016\u0015\u0017\u0015\u0018\u0015\u0017\u0010\n\n\u0010\u000b\n\u001c.\u001b\u0001\n\n# %\u0007&\n\n\u0005\b\u0007\n\n\u0005\b\u0007\n\n\t\u0002\f\n\n\t\u00069\n\n#'%\u0007&\n#'%\u0007&\n\n\u0007+\"\t\u0018\n\n\u0015+)\n\n\u00012;\n\n(7)\n\nif \n\n#'%\u0007&\n\n\u0005\u00175\n\nH\n\t\n\b\n\t\n\b\n*\n\u0015\n*\n\u0014\n\u0013\n\u0015\n*\n*\n\u0002\n;\n\u0019\nM\n\u0011\n\u0015\n*\n*\n\u0014\n\u0018\n\u0015\n\u0019\n*\n\u0014\n\u0013\n\u0015\n\u0014\n\u0019\n\u0015\n\u0019\n\u0011\n\u0014\n\u0002\n\u0015\nM\nM\n\u0019\nM\n\n\u0002\n\u0015\nM\n\u0013\n\u0014\n\n\u0018\n\u0014\n\u0019\n1\n3\n$\n\u0001\n;\n\u0018\n\u0007\n\t\n\u0001\n\u0001\n\u0005\n=\n/\n\u0001\n\t\n\u0001\n\f\n,\n\t\n\n\t\n\u0015\n,\n\u0010\n\u001a\n\u0010\n\u0001\n=\n/\n\u0001\n\t\n\f\n\u000e\n\t\n\u0001\n\u001a\n\u0018\n*\n;\n\f\n\u0001\n\u0006\n\u0007\n\t\n\u0001\n\u0006\n\u0005\n=\n\f\n\u0005\n=\n/\n\u0001\n\u0001\n\u000f\n\u0001\n$\n\u0005\n\u0012\n\u0018\n=\n\t\n\u0005\n\u0012\n\u0018\n\u001d\n=\n\u0010\n\u001f\n\"\n\u0018\n\t\n\b\n\u0018\nC\nA\n\u0005\n=\n\u0018\n\n\u0010\n\u0018\n\u0010\n\u0012\n=\n/\n\u001d\n/\n\u001d\n=\n=\n/\n\u001d\n/\n\u0012\n9\n=\n/\n\u0019\n\u0005\n\u0018\n\t\n=\n\u0010\n\u0005\n\u0018\n\t\n\u0001\n+\n5\n\u0018\n\u0018\n+\n5\n\u0018\n$\n\u0012\n\u0018\n+\n=\n\u0014\n$\n+\n=\n\u0015\n\u0016\n\u0016\n\u0016\n\u0016\n\u0016\n\u0016\n\u0017\n\u0016\n\u0016\n\u0016\n\u0016\n\u0016\n\u0016\n\u0018\n+\n=\n/\n\u0001\n\"\n\u001a\n\u0005\n+\n=\n/\n\u0001\n;\n\u0005\n9\n+\n=\n/\n\u0001\n\"\n+\n=\n/\n\u0001\n\"\n\u001a\n\u0005\n+\n=\n/\n\u0001\n\t\n\u0018\n*\n\n+\n=\n/\n\u0001\n\"\n\fThese expressions are easily derived after noticing that \u0007\u0011\" is (conditionally) Gaussian with\nmean +\n9#; .\nWith\n\n; , and right-truncated at zero if \u000e\"\n\n, the M-step is similar to the regression case,\n\n\u0005\b\u0007\u001b\"*\t , but left-truncated at zero if \u000e\"\n\u000f\u0014(\n#'%\u0007&\n\n\u0010\u0013\u0015\u0018\u0015\u0017\u0015\u0018\u0010\n#'%\u0007-\n\n# %\u0007&\n\u000110\n\n#'%\u0007&\n\n#'%\u0007&\n#'%\u0007&\n#'%\u0007&\n\nwith\n\nplaying the role of observed data.\n\n5 Classi\ufb01cation experiments\n\n\f20\n\n#'%\u0007&\n\n\u001d30\n\n# %\u0007&\n\n#'%\u0007&\n\n#'%\u0007&\n\n\bB\u0005\n\n98\u0007\n\n\f\u000e9\"J\u0014\u0007\n\n\u0005)\u0007@\u0010\u0012\u0007\u0006\"\n\n\t\u0014 :\u0010 where \u001c\n\nis a parameter that controls the kernel width.\n\n\t>\u0001\nIn all the experiments we use kernel classi\ufb01ers, with Gaussian kernels, i.e., \b\n8;:=<\nOur \ufb01rst experiment is mainly illustrative and uses Ripley\u2019s synthetic data 1; the optimal\n[3]. Table 3 shows the average test set error (on 1000 test\nerror rate for this problems is \u0011\nsamples) and the \ufb01nal number of kernels, for 20 classi\ufb01ers learned from 20 random sub-\nsets of size 100 from the original 250 training samples. For comparison, we also include\nresults (from [20]) for RVM, variational RVM (VRVM), and SVM classi\ufb01ers. On this data\nset, our method performs competitively with RVM and VRVM and much better than SVM\n(specially in terms of sparseness). To allow the comparisons, we chose \u001c\n, as in [20].\nTable 3 also reports the numbers of errors achieved by the proposed method and by several\nstate-of-the-art techniques on three well-known benchmark problems: the Pima Indians\ndiabetes2,\nthe Leptograpsus crabs2, and the Wisconsin breast cancer 3 (WBC). For the\nWBC, we report average results over 30 random partitions (300/269 training/testing, as in\n[26]). All the inputs are normalized to zero mean and unit variance, and the kernel width\nfor the WBC. On the\nwas set to \u001c\nPima and crabs data sets, our algorithm outperforms all the other techniques. On the WBC\ndata set, our method performs nearly as well as the best available alternative. The running\ntime of our learning algorithm (in MATLAB, on a PIII-800MHz) is less than 1 second\nfor crabs, and about 2 seconds for the Pima and WBC problems. Finally, notice that the\nclassi\ufb01ers obtained with our algorithm are much sparser than the SVM classi\ufb01ers.\n\n, for the Pima and crabs problems, and to \u001c\n\n\u0001@*\u001f\u0015\n\nMethod\n\nProposed method\n\nTable 3: Numbers of test set errors for the four data sets studied (see text for details). The\nnumbers in square brackets in the \u201cmethod\u201d column indicate the bibliographic reference\nfrom which the results are quoted. The numbers in parentheses indicate the (mean) number\nof kernels used by the classi\ufb01ers (when available).\nRipley\u2019s\n94 (4.8)\n106 (38)\n93 (4)\n92 (4)\nN/A\nN/A\nN/A\nN/A\nN/A\n\nCrabs WBC\n8.5 (5)\n0 (5)\nN/A\nN/A\nN/A\nN/A\nN/A\nN/A\n\nLogistic regression [9]\nLinear discriminant [26]\nGaussian process [9], [26]\n\nSVM [20]\nRVM [20]\nVRVM [20]\nSVM [26]\n\nPima\n61 (6)\n64 (110)\n65 (4)\n65 (4)\n\nNeural network [9]\n\nN/A\nN/A\n19\n8\n\n64\n75\n66\n67\n\n4\n3\n4\n3\n3\n\n68, 67\n\n9\n\n1Available (divided into training/test sets) at: http://www.stats.ox.ac.uk/pub/PRNN/\n2Available at http://www.stats.ox.ac.uk/pub/PRNN/\n3Available at: http://www.ics.uci.edu/ mlearn/MLSummary.html\n\n=\n/\n\u0001\n\u0001\n\u0001\n\n\u0015\n$\n\u0014\n\u0014\n\u001a\n(\n.\n/\n+\n=\n\u000f\n&\n\u0005\n\u001f\n\u001d\n/\n\t\n,\n\u000f\n0\n\u001d\n/\n\n\u0010\n\n\"\nJ\n\u0013\nM\n\u001c\n\u0013\n\u000f\n\u0002\n\u0001\n\u0014\n\u0001\n;\nM\n\f6 Concluding remarks\n\nWe have introduced a new sparseness inducing prior related to the Laplacian prior. Its main\nfeature is the absence of any hyper-parameters to be adjusted or estimated. Experiments\nwith several publicly available benchmark data sets, both for regression and classi\ufb01cation,\nhave shown state-of-the-art performance. In particular, our approach outperforms support\nvector machines and Gaussian process classi\ufb01ers both in terms of error rate and sparseness,\nalthough it involves no tuning or adjusting of sparseness-controlling hyper-parameters.\n\nFuture research includes testing on large-scale problems, like handwritten digit classi\ufb01ca-\ntion. One of the weak points of our approach, when used with kernel-based methods, is the\nneed to solve a linear system in the M-step (of dimension equal to the number of training\npoints) whose computational requirements make it impractical to use with very large train-\ning data sets. This issue is of current interest to researchers in kernel-based methods (e.g.,\n[27]), and we also intend to focus on it.\n\nReferences\n\n[1] V. Cherkassky and F. Mulier, Learning from Data: Concepts, Theory, and Methods.\n\nNew York: Wiley, 1998.\n\n[2] N. Cristianini and J. Shawe-Taylor, Support Vector Machines and Other Kernel-Based\n\nLearning Methods. Cambridge University Press, 2000.\n\n[3] B. Ripley, Pattern Recognition and Neural Networks. Cambridge University Press,\n\n1996.\n\n[4] V. Vapnik, Statistical Learning Theory. New York: John Wiley, 1998.\n[5] A. Hoerl and R. Kennard, \u201cRidge regression: Biased estimation for nonorthogonal\n\nproblems,\u201d Technometrics, vol. 12, pp. 55\u201367, 1970.\n\n[6] C. Bishop, Neural Networks for Pattern Recognition. Oxford University Press, 1995.\n[7] R. Neal, Bayesian Learning for Neural Networks. New York: Springer Verlag, 1996.\n[8] C. Williams, \u201cPrediction with Gaussian processes: from linear regression to linear\nprediction and beyond,\u201d in Learning and Inference in Graphical Models, Kluwer,\n1998.\n\n[9] C. Williams and D. Barber, \u201cBayesian classi\ufb01cation with Gaussian priors,\u201d IEEE\nTrans. on Pattern Analysis and Machine Intelligence, vol. 20, no. 12, pp. 1342\u20131351,\n1998.\n\n[10] G. Kimeldorf and G. Wahba, \u201cA correspondence between Bayesian estimation of\nstochastic processes and smoothing by splines,\u201d Annals of Mathematical Statistics,\nvol. 41, pp. 495\u2013502, 1990.\n\n[11] T. Poggio and F. Girosi, \u201cNetworks for approximation and learning,\u201d Proceedings of\n\nthe IEEE, vol. 78, pp. 1481\u20131497, 1990.\n\n[12] S. Chen, D. Donoho, and M. Saunders, \u201cAtomic decomposition by basis pursuit,\u201d\n\nSIAM Journal of Scienti\ufb01c Computation, vol. 20, no. 1, pp. 33\u201361, 1998.\n\n[13] F. Girosi, \u201cAn equivalence between sparse approximation and support vector ma-\n\nchines,\u201d Neural Computation, vol. 10, pp. 1445\u20131480, 1998.\n\n[14] R. Tibshirani, \u201cRegression shrinkage and selection via the lasso,\u201d Journal of the Royal\n\nStatistical Society (B), vol. 58, 1996.\n\n[15] P. Williams, \u201cBayesian regularization and pruning using a Laplace prior,\u201d Neural\n\nComputation, vol. 7, pp. 117\u2013143, 1995.\n\n\f[16] K. Lange and J. Sinsheimer, \u201cNormal/independent distributions and their applica-\ntions in robust regression,\u201d Journal of Computational and Graphical Statistics, vol. 2,\npp. 175\u2013198, 1993.\n\n[17] M. Figueiredo and R. Nowak, \u201cWavelet-based image estimation: an empirical Bayes\napproach using Jeffreys\u2019 noninformative prior,\u201d IEEE Transactions on Image Pro-\ncessing, vol. 10, pp. 1322-1331, 2001.\n\n[18] J. Berger, Statistical Decision Theory and Bayesian Analysis. Springer-Verlag, 1980.\n[19] D. MacKay, \u201cBayesian non-linear modelling for the 1993 energy prediction competi-\ntion,\u201d in Maximum Entropy and Bayesian Methods, G. Heidbreder, ed., pp. 221\u2013234,\nKluwer, 1996.\n\n[20] C. Bishop and M. Tipping, \u201cVariational relevance vector machines,\u201d in Proceedings\nof the 16th Conference in Uncertainty in Arti\ufb01cial Intelligence, pp. 46\u201353, Morgan\nKaufmann, 2000.\n\n[21] M. Tipping, \u201cThe relevance vector machine,\u201d in Advances in Neural Information Pro-\ncessing Systems \u2013 NIPS 12 (S. Solla, T. Leen, and K.-R. M\u00a8uller, eds.), pp. 652\u2013658,\nMIT Press, 2000.\n\n[22] D. L. Donoho and I. M. Johnstone, \u201cIdeal adaptation via wavelet shrinkage,\u201d\n\nBiometrika, vol. 81, pp. 425\u2013455, 1994.\n\n[23] M. Osborne, B. Presnell, and B. Turlach, \u201cA new approach to variable selection in\nleast squares problems,\u201d IMA Journal of Numerical Analysis, vol. 20, pp. 389\u2013404,\n2000.\n\n[24] P. McCullagh and J. Nelder, Generalized Linear Models. London: Chapman and Hall,\n\n1989.\n\n[25] J. Albert and S. Chib, \u201cBayesian analysis of binary and polychotomous response\n\ndata,\u201d Journal of the American Statistical Association, vol. 88, pp. 669\u2013679, 1993.\n\n[26] M. Seeger, \u201cBayesian model selection for support vector machines, Gaussian pro-\ncesses and other kernel classi\ufb01ers,\u201d in Advances in Neural Information Processing \u2013\nNIPS 12 (S. Solla, T. Leen, and K.-R. M\u00a8uller, eds.), pp. 603\u2013609, MIT Press, 2000.\n\n[27] C. Williams and M. Seeger, \u201cUsing the Nystrom method to speedup kernel machines,\u201d\n\nin NIPS 13, MIT Press, 2001.\n\n\f", "award": [], "sourceid": 1976, "authors": [{"given_name": "M\u00e1rio", "family_name": "Figueiredo", "institution": null}]}