{"title": "Covariance Kernels from Bayesian Generative Models", "book": "Advances in Neural Information Processing Systems", "page_first": 905, "page_last": 912, "abstract": null, "full_text": "Covariance Kernels from Bayesian \n\nGenerative Models \n\nMatthias Seeger \n\nInstitute for Adaptive and Neural Computation \n\nUniversity of Edinburgh \n\n5 Forrest Hill, Edinburgh EH1 2QL \n\nseeger@dai.ed.ac.uk \n\nAbstract \n\nWe propose the framework of mutual information kernels for \nlearning covariance kernels, as used in Support Vector machines \nand Gaussian process classifiers, from unlabeled task data using \nBayesian techniques. We describe an implementation of this frame(cid:173)\nwork which uses variational Bayesian mixtures of factor analyzers \nin order to attack classification problems in high-dimensional spaces \nwhere labeled data is sparse, but unlabeled data is abundant. \n\n1 \n\nIntroduction \n\nKernel machines, such as Support Vector machines or Gaussian processes, are pow(cid:173)\nerful and frequently used tools for solving statistical learning problems. They are \nbased on the use of a kernel function which encodes task prior knowledge in a \nBayesian manner. In this paper, we propose the framework of mutual informa(cid:173)\ntion (MI) kernels for learning covariance kernels from unlabeled task data using \nBayesian techniques. This section introduces terms and concepts. We also discuss \nsome general ideas for discriminative semi-supervised learning and kernel design \nin this context. In section 2, we define the general framework and give examples. \nWe note that the Fisher kernel [4] is a special case of a MI kernel. MI kernels for \nmixture models are discussed in detail. In section 3, we describe an implementation \nfor a MI kernel for variational Bayesian mixtures of factor analyzers models and \nshow results of preliminary experiments. \n\na \n\nlabeled dataset Dl \n\nthe semi-supervised classification problem, \n\nIn \n{(Xl,tl), ... ,(Xm ,tm)} as well as an unlabeled set Du = {xm+1 ,\"\"Xm+n} are \ngiven for training, both i.i.d. drawn from the same unknown distribution, but the \nlabels for Du cannot be observed. Here, Xi E I~.P and ti E {-1, +1}.1 Typically, \nm = IDll is rather small, and n = IDul \u00bbm. Our aim is to fit models to Du in a \nBayesian way, thereby extracting (posterior) information, then use this information \nto build a covariance kernel K. Afterwards, K will be plugged into a supervised \nkernel machine, which is trained on the labeled data Dl to perform the classification \ntask. \n\n1 For simplicity, we only discuss binary labels here. \n\n\fIt is important to distinguish very clearly between these two learning scenarios. \nFor fitting Du, we use Bayesian density estimation. After having chosen a model \nfamily {p(xIOn and a prior distribution P(O) over parameters 0, the posterior \ndistribution P(OIDu) ex P(DuIO)P(O), where P(DuIO) = rr::~'~l P(xiIO), encodes \nall information that Du contains about the latent (i.e. unobserved) parameters 0. 2 \nThe other learning scenario is supervised classification, using a kernel machine. \nSuch architectures model a smooth latent function y (x) E ~ as a random process, \ntogether with a classification noise model P(tly).3 The covariance kernel K specifies \nthe prior distribution for this process: namely, a-priori, y(x) is assumed to be a \nGaussian process with zero mean and covariance function K , i.e. K(x(1) , X(2 )) = \nE[y(x(1))Y(X(2))]; see e.g. [10] for details. In the following, we use the notation \na = (ai)i = (al' ... ,aI)' for vectors, and A = (ai ,j )i,j for matrices respectively. The \nprime denotes transposition. diag a is the matrix with diagonal a and 0 elsewhere. \nN(xlJ.t,~) denotes the Gaussian density with mean J.t and covariance matrix ~. \nWithin the standard discriminative Bayesian classification scenario, unlabeled data \ncannot be used. However, it is rather straightforward to modify this scenario by \nintroducing the concept of conditional priors (see [6]). If we have a discriminant \nmodel family {P(tlx; w n, a conditional prior P(w 10) allows to encode prior knowl(cid:173)\nedge and assumptions about how information about P(x) (i.e. about 0) influences \nour assumptions about a-priori probabilities over discriminants w. For example, \nthe P(wIO) could be Occam priors, expressing the intuitive fact that for many \nproblems, the notion of \"simplicity\" of a discriminant function depends strongly on \nwhat is known about the input distribution P(x). For a given problem, it is in gen(cid:173)\neral not easy to come up with a useful conditional prior. However, once such a prior \nis specified, we can in principle use the same powerful techniques for approximate \nBayesian inference that have been developed for supervised discriminative settings. \nSemi-supervised techniques that can be seen as employing conditional priors include \nco-training [1], feature selection based on clustering [7] and the Fisher kernel [4]. For \na probabilistic kernel technique, P( w 10) is fully specified by a covariance function \nK(x(1) , X(2) 10) depending on O. The problem is therefore to find covariance kernels \nwhich (as GP priors) favour discriminants in some sense compatible with what we \nhave learned about the input distribution P(x). \n\nKernel techniques can be seen as nonparametric smoothers, based on the (prior) \nassumption that if two input points are \"similar\" (e.g. \"close\" under some distance), \ntheir labels (and latent outputs y) should be highly correlated. Thus, one generic \nway of learning kernels from unlabeled data is to learn a distance between input \npoints from the information about P( x). A frequently used assumption about \nhow classification labels may depend on P(x) is the cluster hypothesis: we assume \ndiscriminants whose decision boundaries lie between clusters in P(x) to be a-priori \nmore likely than such that label clusters inconsistently. A general way of encoding \nthis hypothesis is to learn a distance from P(x) which is consistent with clusters in \nP(x) , i.e. points within the same cluster are closer under this distance than points \nfrom different clusters. We can then try to embed the learned distance d(x(1), X(2)) \napproximately in an Euclidean space, i.e. learn a mapping \u00a2 : X \nr-+ \u00a2( x) E ~l \n11 \u00a2(x(1)) - \u00a2(X(2)) II for all pairs from Du. Then, a natural \nsuch that d(x(1) , X(2)) :=;::j \nkernel function would be K(x(1) , X(2)) = exp( - ,BII\u00a2(x(1)) - \u00a2(x(2))11 2). In this \npaper, however, we follow a simpler approach, by considering a similarity measure \n\n2In practice, computation of P(OIDu) is hardly ever feasible , but powerful approxima(cid:173)\n\ntion techniques can be used. \n\n+1Ix)/P(t = -1Ix)) by y(x) . \n\n3 A natural choice for binary classification is to represent the log odds log(P(t = \n\n\fwhich immediately gives rise to a covariance kernel, without having to compute an \napproximate Euclidean embedding. \n\nRemark: Our main aim in this paper is to construct kernels that can be learned \nfrom unlabeled data only. In contrast to this, the task of learning a kernel from \nlabeled data is somewhat simpler and can be approached in the following generic \nway: start with a parametric model family {y(x; w)} , with the interpretation that \ny(x;w) models the log odds log(P(t = +llx)/P(t = -llx)). Fitting these models \nto labeled data D[ , we obtain a posterior P(wIDI) . Now, a natural covariance \nkernel for our problem is simply K(x(1),X(2)) = Jy(x(1);w)y(x(2 );w)Q(w)dw, \nwhere (say) Q(w) 0 (see e.g. [3]) . \n\n60ne can show that if j is itself a kernel , and j -+ I< under EE, then 1<(3 is also a \n\n\for the weighted inner product x(1)'VX(2) into the squared-exponential kernel (e.g. \n[10]). It is easy to show that K in (3) is a valid covariance kernel7 , and we refer to \nit as mutual information (MI) kernel. \nExample: Let P(xIO) = N(xIO, (p/2)I) (spherical Gaussian with mean 0), \nPrned(O) = N(OIO, aI). Then, the MI kernel K is the RBF kernel (4) with \n(3 = 4/(p(4 + pia)). Thus, the RBF kernel is a special case of a MI kernel. \n\n2.1 Mediator distribution. Model-trust scaling. \n\nThe mediator distribution Prned(O), motivated earlier in this section, should ideally \nencode information about the x generation process, just as the Bayesian posterior \nP(OIDu). On the other hand, we need to be able to control the influence that \ninformation from sources such as unlabeled data Du can have on the kernel (relying \ntoo much on such sources results in lack of robustness, see e.g. [6] for details). Here, \nwe propose model-trust scaling (MTS) , by setting \n\n(5) \n\nPrned varies with A from the (usually vague) prior P(O) (A = 0) towards the sharp \nposterior P(OIDu) (A = n), rendering the Du information (via the model) more \nand more influence upon the kernel K. The concrete effect of MTS on the kernel \ndepends on the model family. \nExample (continued): Again, P(xIO) = N(xIO , (p/2)I) , with a flat prior P(O) = 1 \non the mean. Then, P(OIDu) = N(Olx , (p/2n)I), where x = n - 1 L:;~;>~l Xi, and \nPrned(O) = N(Olx, (p/2A)I) (after (5)). Thus, the MI kernel is again the RBF kernel \n\n(4) with (3 = 2/(p(2 + A)) . For the more flexible model P(xIO) = N(xIJL , ~), \u00b0 = \n\n(JL,~) and the conjugate Jeffreys prior, the MI kernel is computed in [5]. \n\nIf the Bayesian analysis is done with conjugate prior-model pairs, the corresponding \nMI kernel can be computed easily, and for many of these cases, MTS has a very \nsimple, analytic form (see [5]). In general, approximation techniques developed for \nBayesian analysis have to be applied. For example, applying the Laplace approxima(cid:173)\ntion to the computations on a model with flat prior P(O) = 1 results in the Fisher \nkernel [4]8, see e.g. [5]. However, in this paper we favour another approximation \ntechnique (see section 3). \n\n2.2 Mutual Information Kernels for Mixture Models \nIf we apply the MI kernel framework to mixture models P(x 10, 7T\") = Ls 7fsP(x lOs), \nwe run into a problem. As mentioned in section 1, we would like our kernel at least \npartly to encode the cluster hypothesis, i.e. K(x(1), X(2)) should be small if x(1), X(2) \ncome from different clusters in P(x ),9 but the opposite is true (for not too small \n\n7 Q(x(1 ), X (2)) is an inner product (therefore a kernel), for the rest of the argument see \n\n[3], section 5. \n\n8This was essentially observed by the authors of [4] on workshop talks, but has not \nbeen published to our knowledge. The fascinating idea of the Fisher kernel has indeed \nbeen the main motivation and inspiration for this paper. \n\n9This does not mean that we (a-priori) believe they should have different labels, but \nonly that the label (or better: the latent yO) at one of them should not depend strongly \non yO at the other. \n\n\fA). To overcome this problem, we generalize Q(x(1), X(2)): \n\nQ(X(1),X(2)) = L WS1 ,S2 J P(x(1) IOsJP(X(2) IOs2)Prned(O) dO, \n\nS \n\n(6) \n\n8} , 82=1 \n\nwhere W = (WS1 ,S2)Sl ,S2 is symmetric with nonnegative entries and positive elements \non the diagonal. The MI kernel K is defined as before by (3) , based on the new \nQ. If Prned(O,rr) = ITsPrned(Os)Prned(rr) (which is true for the cases we will be \ninterested in), we see that the original MI kernel arises as special case WS1,S2 = \nEPm ed[7fS17fs2]' Now, by choosing W = diag(Epm ed[7fs])s, we arrive at a MI kernel \nK which (typically) behaves as expected w.r.t. cluster separation (see figure 1), but \ndoes not exhibit long-range correlations between joined components. In the present \nwork, we restrict ourselves to this diagonal mixture kernel. Note that this kernel \ncan be seen as (normalized) mixture of MI kernels over the component models. \n\nFigure 1: Kernel contours on 2-cluster dataset (A = 5,100,30) \n\nFigure 1 shows contour plots 10 of the diagonal mixture kernel for VB-MoFA (see \nsection 3), learned on a 500 cases dataset sampled from two Gaussians with equal \ncovariance (see subsection 3.1). We plot K(a , x) for fixed a (marked by a cross) \nagainst all x , the height between contour lines is 0.1. The left and middle plot \nhave the lower cluster's centre as a, with A = 5, A = 100 respectively, the right \nplot's a lies between the cluster centres, A = 30. The effect of MTS can be seen by \ncomparing left and middle, note the different sharpness of the slopes towards the \nother cluster and the different sizes and shapes of the \"high correlation\" regions. As \nseen on the right, points between clusters have highest correlation with other such \ninter-cluster points, a feature that may be very useful for successful discrimination. \n\n3 Experiments with Mixtures of Factor Analyzers \n\nIn this section, we describe an implementation of a MI kernel, using variational \nBayesian mixtures of factor analyzers (VB-MoFA) [2] as density models. These are \nable to combine local dimensionality reduction (using noisy linear transformations \nu -+ x from low-dimensional latent spaces) with good global data fit using mixtures. \nVB-MoFA is a variational approximation to Bayesian analysis on these models, able \nto deliver the posterior approximations we require for an MI kernel. \n\nWe employ the diagonal mixture kernel (see subsection 2.2). Instead of implement(cid:173)\ning MTS analytically, we compute the VB approximation to the true posterior (i.e. \nA = n), then simply apply the scaling to this distribution. P rned (0, rr) factorizes as \nrequired in subsection 2.2. The integrals J P(x(1) IOs)p(X(2) IOs)Prned(Os) dOs in (6) \nlOProduced using the first-order approximation (see 3) to the MI kernel. Plots using the \n\none-step variational approximation (see 3) have a somewhat richer structure. \n\n\fare not analytically tractable. Our first idea was to approximate them by applying \nthe VB technique once more, ending up with what we called one-step variational \napproximations. Unfortunately, the MI kernel approximation based on these terms \ncannot be shown to be positive definite anymorell ! Thus, in the moment we use a \nless elegant and, we feel, less accurate approximation (details can be found in [5]) \nbased on first-order Taylor expansions. \n\nIn the remainder of this section we compare the VB-MoFA kernel with the RBF \nkernel (4) on two datasets, using a Laplace GP classifier (see [10]). In each case \nwe sample a training pool, a kernel dataset Du and a test set (mutually exclusive). \nThe VB-MoFA diagonal mixture kernel is learned on Du. For a given training set \nsize m, a run consists of sampling a training set Dl and a holdout set Dh (both of \nsize m) from the training pool, tuning kernel parameters by validation on D h , then \ntesting on the test set. We use the same Dl, Du for both kernels. For each training \nset size, we do L = 30 runs. Results are presented by plotting means and 95% t-test \nconfidence intervals of test errors over runs. \n\n3.1 Two Gaussian clusters \n\nThe dataset is sampled from two 2-d Gaussians with same non-spherical covariance \n(see figure 1) , one for each class (the Bayes error is 2.64%) . We use n = 500 points \nfor D u , a training pool of 100 and a test set of 500 points. The learning curves in \nfigure 2 show that on this simple toy problem, on which the fitted VB-MoFA model \nrepresents the cluster structure in P(x) almost perfectly, the VB-MoFA MI kernel \noutperforms the RBF kernel for samples sizes n :::; 40. \n\n~ 02 \n~ 0.175 \n\n~ 0.15 IL \n\n, \n\n_ 0.225 \n\n~ 02 \n~O. 175 \n~ 0.15 \n~O.125 \n\nrI I \n\nI \n\n~~~~~---7---~~~~~~~--~ ',~~~~---7---.~, --~--~--~~ \n\nTraining set size n \n\nTrair>ngsets;zen \n\nFigure 2: Learning curves on 2-cluster dataset. Left: RBF kernel; right: MI kernel \n\n3.2 Handwritten Digits (MNIST): Twos against threes \n\nWe report results of preliminary experiments using the subset of twos and threes \nof the MNIST Handwritten Digits database12 . Here, n = IDul = 2000, the training \npool contains 8089, the test set 2000 cases. We employ a VB-MoFA model with 20 \ncomponents, fitted to Du. We use a very simple baseline (BL) algorithm (see [6], \nsection 2.3) based on the component densities from the VB-MoFA model13 , which \n\nllThanks to an anonymous reviewer for pointing out this flaw. \n12The 28 x 28 images were downsampled to size 8 x 8. \n13The estimates P(xls) are obtained by integrating out the parameters (}s using the \nvariational posterior approximation. The integral is not analytic, and we use a one-step \nvariational approximation to it. \n\n\fthis \nallows us to assess the \"purity\" of the component clusters w.r.t. the labels1\\ \nalgorithm is the only one not based on a kernel. Furthermore, we show results for \nthe one-step variational approximation to the MI kernel 15 (MIOLD). The learning \ncurves are shown in figure 3. \n\nr \n\u2022 \nt\u00b7 .. \n\n~ 0.' \n\n! \u2022\u2022 \n\nII \n\nI \n\n, II \n\nj\" \ni \nI .. \n! \n\nI \n\nI \n\n,,_ ... _, \n\n= \u2022 \n\nT,_ ... _, \n\nI \nI \n\nto.> \n~ ... \n1 \n\n..... \n\nT.-.\" ,\",_ , \n\n, I \nI \n\nj\" \nI .. \ni \n! \n\n[ \n\nFigure 3: Learning curves on MNIST twos/threes. Upper left: RBF kernel; upper \nmiddle: Baseline method; upper right: VB-MoFA MI kernel (first-order approx.) ; \nlower left: VB-MoFA MI \"kernel\" (one-step var. approx.) \n\nThe results are disappointing. The fact that the first-order approximation to the \nMI kernel performs worse than the one-step variational approximation (although \nthe latter may fail to be positive definite) , indicates that the former is a poorer \napproximation. The latter renders results close to the baseline method, while the \nsmoothing RBF kernel makes much better use of a growing number of labeled ex(cid:173)\namples16 This indicates that the conditional prior, as represented by the VB-MoFA \nMI kernel, behaves nonsmooth and overrides label information in regions where it \nshould not. We suspect this problem to be related to the high dimensionality of \nthe input space, in which case probability densities tend to have a large dynamic \nrange, and mixture component responsibility estimates tend to behave very nons(cid:173)\nmooth. Thus, it seems to be necessary to extend the basic MI kernel framework \nby new scaling mechanisms in order to produce a smoother encoding of the prior \nassumptions. \n\n14The baseline algorithm is based on the assumption that, given the component index \ns, the input point x and the label t are independent. Only the conditional probabilities \nP(t ls) are learned, while P(xls) and pes) is obtained from the VB-MoFA model fitted to \nunlabeled data only. Thus, success/failure of this method should be closely related to the \ndegree of purity of the component clusters w.r.t . the labels. \n\n15This is somewhat inconsistent, since we use a kernel function which might not be \n\npositive definite in a context (GP classification) which requires a covariance function. \n\n16Note also that RBF kernel matrices can be evaluated significantly faster than such \n\nusing the VB-MoFA MI kernel. \n\n\f4 Related work. Discussion \n\nThe present work is probably most closely related to the Fisher kernel (see sub(cid:173)\nsection 2.1). The arguments concerning mixture models (see subsection 2.2) apply \nthere as well. Haussler [3] contains a wealth of material about kernel design for dis(cid:173)\ncrete objects x. Watkins [9] mentions that expressions like Q in (1) are valid kernels \nfor discrete x and countable parameter spaces. Very recently we came across [11], \nwhich essentially describes a special case of the diagonal mixture kernel (see sub(cid:173)\nsection 2.2) for Gaussian components with diagonal covariances17 . The author calls \nQ a stochastic equivalence predicate. He is interested in distance learning, does not \napply his method to kernel machines and does not give a Bayesian interpretation. \n\nWe have presented a general framework for kernel learning from unlabeled data and \ndescribed an approximate implementation using VB-MoFA models. A straightfor(cid:173)\nward application of this technique to high-dimensional real-world data did not prove \nsuccessful, and in future work we will explore new ideas for extending the basic MI \nkernel framework in order to be able to deal with high-dimensional input spaces. \n\nAcknowledgments \n\nWe thank Chris Williams for many inspiring discussions, furthermore Ralf Her(cid:173)\nbrich, Amos Storkey, Hugo Zaragoza and Neil Lawrence. Matt Beal helped us a \nlot with VB-MoFA. The author gratefully acknowledges support through a research \nstudentship from Microsoft Research Ltd. \n\nReferences \n\n[1] Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with Co(cid:173)\n\nTraining. In Proceedings of COLT, 1998. \n\n[2] Z. Ghahramani and M. Beal. Variational inference for Bayesian mixtures of factor \n\nanalysers. In Advances in NIPS 12. MIT Press, 1999. \n\n[3] David Haussler. Convolution kernels on discrete structures. Technical Report UCSC(cid:173)\n\nCRL-99-10, University of California, Santa Cruz, July 1999. \n\n[4] Tommi S. Jaakkola and David Haussler. Exploiting generative models in discrimina(cid:173)\n\ntive classifiers. In Advances in Neural Information Processing Systems 11, 1998. \n\n[5] Matthias Seeger. Covariance kernels from Bayesian generative models. Technical \n\nreport, 2000. Available at http : //yyy . dai . ed. ac . ukr seeger /papers . html. \n\n[6] Matthias Seeger. Learning with labeled and unlabeled data. Technical report, 2000. \n\nAvailable at http://yyy .dai. ed. ac. ukrseeger/papers .html. \n\n[7] Martin Szummer and Tommi Jaakkola. Partially labeled classification with Markov \n\nrandom walks. In Advances in NIPS 14. MIT Press, 200l. \n\n[8] Koji Tsuda, Motoaki Kawanabe, Gunnar Ratsch, Soeren Sonnenburg, and Klaus(cid:173)\n\nRobert Muller. A new discriminative kernel from probabilistic models. In Advances \nin NIPS 14. MIT Press, 200l. \n\n[9] Chris Watkins. Dynamic alignment kernels. Technical Report CSD-TR-98-11 , Royal \n\nHolloway, University of London, 1999. \n\n[10] Christopher K.1. Williams and David Barber. Bayesian classification with Gaussian \n\nprocesses. IEEE Trans. PAMI, 20(12):1342- 1351, 1998. \n\n[11] Peter Yianilos. Metric learning via normal mixtures. Technical report , NEC Research , \n\nPrinceton, 1995. \n\n17The a parameter in this work is related to MTS in this case. \n\n\f", "award": [], "sourceid": 2133, "authors": [{"given_name": "Matthias", "family_name": "Seeger", "institution": null}]}