{"title": "Analysis of Sparse Bayesian Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 383, "page_last": 389, "abstract": null, "full_text": "Analysis of Sparse Bayesian Learning \n\nAnita C. Fanl Michael E. Tipping \n\nMicrosoft Research \n\nSt George House, 1 Guildhall St \n\nCambridge CB2 3NH, U.K. \n\nAbstract \n\nThe recent introduction of the 'relevance vector machine' has effec(cid:173)\ntively demonstrated how sparsity may be obtained in generalised \nlinear models within a Bayesian framework. Using a particular \nform of Gaussian parameter prior, 'learning' is the maximisation, \nwith respect to hyperparameters, of the marginal likelihood of the \ndata. This paper studies the properties of that objective func(cid:173)\ntion, and demonstrates that conditioned on an individual hyper(cid:173)\nparameter, the marginal likelihood has a unique maximum which \nis computable in closed form. It is further shown that if a derived \n'sparsity criterion' is satisfied, this maximum is exactly equivalent \nto 'pruning' the corresponding parameter from the model. \n\n1 \n\nIntroduction \n\nWe consider the approximation, from a training sample, of real-valued functions, \na task variously referred to as prediction, regression, interpolation or function ap(cid:173)\nproximation. Given a set of data {xn' tn};;=l the 'target' samples tn = f(xn) + En \nare conventionally considered to be realisations of a deterministic function f, po(cid:173)\ntentially corrupted by some additive noise process. This function f will be modelled \nby a linearly-weighted sum of M fixed basis functions {4>m (X)}~= l: \n\nM \n\nf(x) = L wm\u00a2>m(x), \n\nm=l \n\n(1) \n\nand the objective is to infer values of the parameters/weights {Wm}~=l such that \nf is a 'good' approximation of f. \nWhile accuracy in function approximation is generally universally valued, there has \nbeen significant recent interest [2, 9, 3, 5]) in the notion of sparsity, a consequence \nof learning algorithms which set significant numbers of the parameters Wm to zero. \nA methodology which effectively combines both these measures of merit is that of \n'sparse Bayesian learning', briefly reviewed in Section 2, and which was the ba(cid:173)\nsis for the recent introduction of the relevance vector machine (RVM) and related \nmodels [6, 1, 7]. This model exhibits some very compelling properties, in particular \na dramatic degree of sparseness even in the case of highly overcomplete basis sets \n\n\f(M \u00bbN). The sparse Bayesian learning algorithm essentially involves the max(cid:173)\nimisation of a marginalised likelihood function with respect to hyperparameters in \nthe model prior. In the RVM, this was achieved through re-estimation equations, \nthe behaviour of which was not fully understood. In this paper we present further \nrelevant theoretical analysis of the properties of the marginal likelihood which gives \na much fuller picture of the nature of the model and its associated learning proce(cid:173)\ndure. This is detailed in Section 3, and we close with a summary of our findings and \ndiscussion of their implications in Section 4 (and which, to avoid repetition here, \nthe reader may wish to preview at this point). \n\n2 Sparse Bayesian Learning \n\nWe now very briefly review the methodology of sparse Bayesian learning, more \ncomprehensively described elsewhere [6]. To simplify and generalise the exposition, \nwe omit to notate any functional dependence on the inputs x and combine quantities \ndefined over the training set and basis set within N- and M-vectors respectively. \nUsing this representation, we first write the generative model as: \n\n(2) \nwhere t = (t1\"'\" tN )T, f = (11, ... , fN)T and \u20ac = (E1\"'\" EN)T. The approximator \nis then written as: \n\nt = f + \u20ac, \n\nf = *w, \n\n(3) \nwhere ** = [\u00abPl'\" \u00abPM] is a general N x M design matrix with column vectors \u00abPm \nand w = (W1, ... ,WM)T. Recall that in the context of (1), **nm = \u00a2m(xn) and \nf = {f(x1), .. .j(XN)P. \nThe sparse Bayesian framework assumes an independent zero-mean Gaussian noise \nmodel, with variance u 2 , giving a multivariate Gaussian likelihood of the target \nvector t: \n\np(tlw, ( 2) = (27r)-N/2 U -N exp { _lit ~:\"2 } . \n\nThe prior over the parameters is mean-zero Gaussian: \n\np(wlo:) = (27r)-M/21I a~,e exp \n\nM \n\n( \n\n-\n\na W2) \n\nm2 m \n\n, \n\n(4) \n\n(5) \n\n(7) \n\n(8) \n\nwhere the key to the model sparsity is the use of M independent hyperparameters \n0: = (a1 \" '\" aM)T, one per weight (or basis vector), which moderate the strength \nof the prior. Given 0:, the posterior parameter distribution is Gaussian and given \nvia Bayes' rule as p(wlt , 0:) = N(wIIL,~) with \n\n~ = (A + u - 2**T**Tt, \n\n(6) \n\nand A defined as diag(a1, ... ,aM) . Sparse Bayesian learning can then be formu(cid:173)\nlated as a type-II maximum likelihood procedure, in that objective is to maximise \nthe marginal likelihood, or equivalently, its logarithm \u00a3(0:) with respect to the hy(cid:173)\nperparameters 0:: \n\n\u00a3(0:) = logp(tlo: , ( 2) = log i: p(tlw, ( 2) p(wlo:) dw, \n\n1 \n\n= -\"2 [Nlog27r + log ICI + t T C-1t] , \n\nwith C = u2I + ** A - l**T. \n\n\fOnce most-probable values aMP have been found 1 , in practice they can be plugged \ninto (6) to give a posterior mean (most probabletpoint estimate for the parameters \nJ.tMP and from that a mean final approximator: fMP = ()J.tMp\u00b7 \nEmpirically, the local maximisation of the marginal likelihood (8) with respect to \na has been seen to work highly effectively [6, 1, 7]. Accurate predictors may be \nrealised, which are typically highly sparse as a result of the maximising values of \nmany hyperparameters being infinite. From (6) this leads to a parameter posterior \ninfinitely peaked at zero for many weights Wm with the consequence that J.tMP \ncorrespondingly comprises very few non-zero elements. \nHowever, the learning procedure in [6] relied upon heuristic re-estimation equations \nfor the hyperparameters, the behaviour of which was not well characterised. Also, \nlittle was known regarding the properties of (8), the validity of the local maximisa(cid:173)\ntion thereof and importantly, and perhaps most interestingly, the conditions under \nwhich a-values would become infinite. We now give, through a judicious re-writing \nof (8), a more detailed analysis of the sparse Bayesian learning procedure. \n\n3 Properties of the Marginal Likelihood \u00a3(0:) \n\n3.1 A convenient re-writing \n\nWe re-write C from (8) in a convenient form to analyse the dependence on a single \nhyperparameter ai: \n\nC = (]\"21 + 2..: am\u00a2m\u00a2~' = (]\"21 + 2..: a~1\u00a2m\u00a2~ + a-;1\u00a2i\u00a2T, \n\nm \n\nm # i \n\n= C_i + a-;1\u00a2i\u00a2T, \n\n(9) \nwhere we have defined C-i = (]\"21+ Lm#i a;r/\u00a2m\u00a2~ as the covariance matrix with \nthe influence of basis vector \u00a2i removed, equivalent also to ai = 00. \nUsing established matrix determinant and inverse identities, (9) allows us to write \nthe terms of interest in \u00a3( a) as: \n\n(10) \n\n(11) \n\nwhich gives \n\n\u00a3(a) = -~ [Nlog(2n) + log IC-il + tTC=;t \n\n-logai + log(ai + \u00a2TC=;\u00a2i) -\n\n(\u00a2TC- 1 t)2 \n. \u2022 T-' -1 \n\na. + \u00a2i C_i \u00a2i \n\n], \n\n1 \n\n= \u00a3(a-i) + -2 [logai -log(ai + \u00a2TC=;\u00a2i) + \n\n] , \n\n(\u00a2TC- 1t)2 \n\u2022 T-' -1 \n\nai + \u00a2i C_i \u00a2i \n\n= \u00a3( a-i) + \u00a3( ai), \n\n(12) \nwhere \u00a3(a-i) is the log marginal likelihood with ai (and thus Wi and \u00a2i) removed \nfrom the model and we have now isolated the terms in ai in the function \u00a3(ai). \n\nIThe most-probable noise variance (]\"~p can also be directly and successfully estimated \nfrom the data [6], but for clarity in this paper, we assume without prejudice to our results \nthat its value is fixed. \n\n\f3.2 First derivatives of \u00a3(0:) \n\nPrevious results. \nmarginal likelihood was computed as: \n\nIn [6], based on earlier results from [4], the gradient of the \n\nwith fJi the i-th element of JL and ~ ii the i-th diagonal element of~. This then leads \nto re-estimation updates for O::i in terms of fJi and ~ii where, disadvantageously, \nthese latter terms are themselves functions of O::i. \n\nA new, simplified, expression. \n(13) can be seen to be equivalent to: \n\nIn fact , by instead differentiating (12) directly, \n\n(13) \n\n(14) \n\nwhere advantageously, O::i now occurs only explicitly since C - i is independent of O::i. \nFor convenience, we combine terms and re-write (14) as: \no\u00a3(o:) _ o::;lS;- (Qr - Si) \nOO::i \n\n2( O::i + Si)2 \n\n(15) \n\nwhere, for simplification of this and forthcoming expressions, we have defined: \n\nThe term Qi can be interpreted as a 'quality' factor: a measure of how well c/>i \nincreases \u00a3(0:) by helping to explain the data, while Si is a 'sparsity' factor which \nmeasures how much the inclusion of c/>i serves to decrease \u00a3(0:) through 'inflating' \nC (i. e. adding to the normalising factor). \n\n(16) \n\n3.3 Stationary points of \u00a3(0:) \n\nEquating (15) to zero indicates that stationary points of the marginal likelihood \noccur both at O::i = +00 (note that , being an inverse variance, O::i must be positive) \nand for: \n\nS2 \n\nt \n\n. _ \n\nO::t - Q; _ Si' \n\n(17) \n\nsubject to Qr > Si as a consequence again of O::i > o. \nSince the right-hand-side of (17) is independent of O::i, we may find the stationary \npoints of \u00a3(O::i) analytically without iterative re-estimation. To find the nature of \nthose stationary points, we consider the second derivatives. \n\n3.4 Second derivatives of \u00a3(0:) \n\n3.4.1 With respect to O::i \n\nDifferentiating (15) a second time with respect to O::i gives: \n\n-0::;2S;(O::i + Si)2 - 2(O::i + Si) [o::;lS;- (Qr - Si)] \n\n2(O::i + Si)4 \n\n(18) \n\nand we now consider (18) for both finite- and infinite-O::i stationary points. \n\n\fFinite 0::. \nterm in the numerator in (18) is zero, giving: \n\nIn this case, for stationary points given by (17), we note that the second \n\n(19) \n\nWe see that (19) is always negative, and therefore \u00a3(O::i) has a maximum, which \n\nmust be unique, for Q; - Si > \u00b0 and O::i given by (17). \n\nInfinite 0::. For this case, (18) and indeed, all further derivatives, are uninforma(cid:173)\ntively zero at O::i = 00 , but from (15) we can see that as O::i --+ 00, the sign of the \ngradient is given by the sign of - (Q; - Si). \nIf Q; - Si > 0, then the gradient at O::i = 00 is negative so as O::i decreases \u00a3(O::i) \nmust increase to its unique maximum given by (17). It follows that O::i = 00 is thus \na minimum. Conversely, if Q; - Si < 0, O::i = 00 is the unique maximum of \u00a3(O::i) . \nIf Q; - Si = 0, then this maximum and that given by (17) coincide. \nWe now have a full characterisation of the marginal likelihood as a function of a \nsingle hyperparameter, which is illustrated in Figure 1. \n\nu. 10\u00b0 \n\nI \n\n10' \n\nFigure 1: Example plots of \u00a3(ai) against a i (on a log scale) for Q; > Si (left) , \nshowing the single maximum at finite ai, and Q; < Si (right), showing the \nmaximum at a i = 00. \n\n3.4.2 With respect to O::j, j i:- i \nTo obtain the off-diagonal terms of the second derivative (Hessian) matrix, it is \nconvenient to manipulate (15) to express it in terms of C. From (11) we see that \n\nUtilising these identities in (15) gives: \n\nand \n\nWe now write: \n\n(20) \n\n(21) \n\n(22) \n\n\fwhere 6ij is the Kronecker 'delta' function , allowing us to separate out the additional \n(diagonal) term that appears only when i = j. \nWriting, similarly to (9) earlier, C = C_ j + ajl\u00a2j\u00a2j, substituting into (21) and \ndifferentiating with respect to aj gives: \n\nwhile we have \n\n(24) \n\nIf all hyperparameters ai are individually set to their maximising values, i. e. a = \naMP such that alI8\u00a3(a)/8ai = 0, then even if all 8 2\u00a3(a)/8a; are negative, there \nmay still be a non-axial direction in which the likelihood could be increasing. We \nnow rule out this possibility by showing that the Hessian is negative semi-definite. \nFirst, we note from (21) that if 8\u00a3(a)/8ai = 0, 'V;i = 0. Then, if v is a generic \nnonzero direction vector: \n\n< \n\n(25) \n\nwhere we use the Cauchy-Schwarz inequality. If the gradient vanishes, then for all \ni = 1, ... , M either ai = 00, or from (21) , \u00a2rC-1\u00a2i = (\u00a2rC- 1t)2. It follows \ndirectly from (25) that the Hessian is negative semi-definite, with (25) only zero \nwhere v is orthogonal to all finite a values. \n\n4 Summary \n\nSparse Bayesian learning proposes the iterative maximisation of the marginal like(cid:173)\nlihood function \u00a3(a) with respect to the hyperparameters a. Our analysis has \nshown the following: \n\n1. As a function of an individual hyperparameter ai, \u00a3( a) has a unique maximum \ncomputable in closed-form. (This maximum is, of course, dependent on the \nvalues of all other hyperparameters.) \n\n\fII. If the criterion Qr - Si (defined in Section 3.2) is negative, this maximum \n\noccurs at O:i = 00, equivalent to the removal of basis function i from the \nmodel. \n\nIII. The point where all individual marginal likelihood functions \u00a3(O:i) are max(cid:173)\n\nimised is a joint maximum (not necessarily unique) over all O:i. \n\nThese results imply the following consequences. \n\n\u2022 From I, we see that if we update, in any arbitrary order, the O:i parameters \nusing (17), we are guaranteed to increase the marginal likelihood at each \nstep, unless already at a maximum. Furthermore, we would expect these \nupdates to be more efficient than those given in [6], which individually only \nincrease, not maximise, \u00a3 (O:i) . \n\n\u2022 Result III indicates that sequential optimisation of individual O:i cannot \nlead to a stationary point from which a joint maximisation over all 0: may \nhave escaped. (i.e. the stationary point is not a saddle point.) \n\n\u2022 The result II confirms the qualitative argument and empirical observation \nthat many O:i -+ 00 as a result of the optimisation procedure in [6]. The \ninevitable implication of finite numerical precision prevented the genuine \nsparsity of the model being verified in those earlier simulations. \n\n\u2022 We conclude by noting that the maximising hyperparameter solution (17) \nremains valid if O:i is already infinite. This means that basis functions not \neven in the model can be assessed and their corresponding hyperparameters \nupdated if desired. So as well as the facility to increase \u00a3(0:) through \nthe 'pruning' of basis functions if Qr - Si ::::: 0, new basis functions can \nbe introduced if O:i = 00 but Qr - Si > O. This has highly desirable \n\ncomputational consequences which we are exploiting to obtain a powerful \n'constructive' approximation algorithm [8]. \n\nReferences \n\n[1] C. M. Bishop and M. E . Tipping. Variational relevance vector machines. In C. Boutilier \nand M. Goldszmidt, editors, Proceedings of th e 16th Conference on Uncertainty in \nArtificial Intelligence, pages 46- 53. Morgan Kaufmann , 2000. \n\n[2] S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. \n\nTechnical Report 479, Department of Statistics, Stanford University, 1995. \n\n[3] Y. Grandvalet. Least absolute shrinkage is equivalent to quadratic penalisation. In \n\nL. Niklasson, M. Boden, and T. Ziemske, editors, Proceedings of the Eighth Interna(cid:173)\ntional Conference on Artificial Neural Networks (ICANN98) , pages 201- 206. Springer, \n1998. \n\n[4] D. J. C. MacKay. Bayesian interpolation. Neural Computation, 4(3):415- 447, 1992. \n[5] A. J. Smola, B. Scholkopf, and G. Ratsch. Linear programs for automatic accuracy \ncontrol in regression. In Proceedings of th e Ninth Int ernational Conference on Artificial \nNeural N etworks, pages 575- 580, 1999. \n\n[6] M. E. Tipping. The Relevance Vector Machine. In S. A. Solla, T . K. Leen , and K.-R. \nMuller, editors, Advances in Neural Information Processing Systems 12, pages 652- 658. \nMIT Press, 2000. \n\n[7] M. E. Tipping. Sparse kernel principal component analysis. In Advances in Neural \n\nInformation Processing Systems 13. MIT Press, 200l. \n\n[8] M. E . Tipping and A. C. Faul. Bayesian pursuit. Submitted to NIPS*Ol. \n[9] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998. \n\n\f", "award": [], "sourceid": 2121, "authors": [{"given_name": "Anita", "family_name": "Faul", "institution": null}, {"given_name": "Michael", "family_name": "Tipping", "institution": null}]}*