{"title": "Covariance Kernels from Bayesian Generative Models", "book": "Advances in Neural Information Processing Systems", "page_first": 905, "page_last": 912, "abstract": null, "full_text": "Covariance Kernels  from  Bayesian \n\nGenerative Models \n\nMatthias  Seeger \n\nInstitute for  Adaptive and Neural  Computation \n\nUniversity of Edinburgh \n\n5 Forrest Hill,  Edinburgh EH1  2QL \n\nseeger@dai.ed.ac.uk \n\nAbstract \n\nWe  propose  the  framework  of  mutual  information  kernels  for \nlearning  covariance  kernels,  as  used  in  Support  Vector  machines \nand  Gaussian  process  classifiers,  from  unlabeled  task  data  using \nBayesian techniques.  We describe an implementation of this frame(cid:173)\nwork which  uses  variational Bayesian mixtures of factor  analyzers \nin order to attack classification problems in high-dimensional spaces \nwhere labeled data is  sparse, but unlabeled data is  abundant. \n\n1 \n\nIntroduction \n\nKernel machines,  such as  Support  Vector machines or  Gaussian processes,  are pow(cid:173)\nerful  and frequently  used  tools  for  solving  statistical  learning  problems.  They  are \nbased  on  the  use  of  a  kernel  function  which  encodes  task  prior  knowledge  in  a \nBayesian  manner.  In  this  paper,  we  propose  the  framework  of  mutual  informa(cid:173)\ntion  (MI)  kernels  for  learning  covariance  kernels  from  unlabeled  task  data  using \nBayesian techniques.  This section introduces terms  and concepts.  We  also  discuss \nsome  general  ideas  for  discriminative  semi-supervised  learning  and  kernel  design \nin  this  context.  In  section  2,  we  define  the general framework  and  give  examples. \nWe  note that the  Fisher  kernel  [4]  is  a  special  case of a  MI  kernel.  MI  kernels  for \nmixture models are discussed in detail.  In section 3, we  describe an implementation \nfor  a  MI  kernel  for  variational  Bayesian  mixtures  of factor  analyzers  models  and \nshow  results of preliminary experiments. \n\na \n\nlabeled  dataset  Dl \n\nthe  semi-supervised  classification  problem, \n\nIn \n{(Xl,tl), ... ,(Xm ,tm)}  as  well  as  an  unlabeled  set  Du  =  {xm+1 ,\"\"Xm+n}  are \ngiven for  training,  both i.i.d.  drawn from  the  same unknown  distribution,  but the \nlabels  for  Du  cannot  be  observed.  Here,  Xi  E  I~.P  and  ti  E  {-1, +1}.1  Typically, \nm  =  IDll  is  rather small,  and n  =  IDul  \u00bbm.  Our aim is  to fit  models  to Du  in  a \nBayesian way, thereby extracting (posterior)  information, then use this information \nto  build  a  covariance  kernel  K.  Afterwards,  K  will  be  plugged  into  a  supervised \nkernel machine, which is trained on the labeled data Dl  to perform the classification \ntask. \n\n1 For simplicity,  we  only  discuss  binary labels here. \n\n\fIt is  important  to  distinguish  very  clearly  between  these  two  learning  scenarios. \nFor fitting  Du, we  use  Bayesian  density  estimation.  After  having  chosen  a  model \nfamily  {p(xIOn  and  a  prior  distribution  P(O)  over  parameters  0,  the  posterior \ndistribution P(OIDu) ex  P(DuIO)P(O), where P(DuIO)  =  rr::~'~l P(xiIO), encodes \nall information that Du  contains about the latent (i.e. unobserved)  parameters 0. 2 \nThe  other  learning  scenario  is  supervised  classification,  using  a  kernel  machine. \nSuch  architectures model  a  smooth  latent function  y (x)  E  ~ as  a  random  process, \ntogether with a classification noise model P(tly).3  The covariance kernel K  specifies \nthe  prior  distribution  for  this  process:  namely,  a-priori,  y(x)  is  assumed  to  be  a \nGaussian  process  with  zero  mean  and  covariance  function  K , i.e.  K(x(1) , X(2 ))  = \nE[y(x(1))Y(X(2))];  see  e.g.  [10]  for  details.  In  the  following,  we  use  the  notation \na  = (ai)i  = (al' ... ,aI)' for vectors, and A  = (ai ,j )i,j for matrices respectively.  The \nprime denotes transposition.  diag a  is  the matrix with diagonal a  and 0 elsewhere. \nN(xlJ.t,~) denotes the  Gaussian density with mean J.t  and covariance matrix ~. \nWithin the standard discriminative Bayesian classification scenario, unlabeled data \ncannot  be  used.  However,  it  is  rather  straightforward to  modify  this  scenario  by \nintroducing  the  concept  of  conditional  priors  (see  [6]).  If we  have  a  discriminant \nmodel family {P(tlx; w n, a conditional prior P(w 10)  allows to encode prior knowl(cid:173)\nedge and assumptions about how information about P(x)  (i.e.  about 0)  influences \nour  assumptions  about  a-priori  probabilities  over  discriminants  w.  For  example, \nthe  P(wIO)  could  be  Occam  priors,  expressing  the  intuitive  fact  that  for  many \nproblems, the notion of \"simplicity\"  of a discriminant function depends strongly on \nwhat is  known about the input distribution P(x).  For a given problem, it is in gen(cid:173)\neral not easy to come up with a useful conditional prior.  However, once such a prior \nis  specified,  we  can in  principle  use  the same  powerful  techniques  for  approximate \nBayesian inference that have been developed for  supervised discriminative settings. \nSemi-supervised techniques that can be seen as employing conditional priors include \nco-training [1],  feature selection based on clustering [7]  and the Fisher kernel [4].  For \na  probabilistic kernel  technique,  P( w 10)  is  fully  specified  by  a  covariance function \nK(x(1) , X(2) 10)  depending on O.  The problem is  therefore to find  covariance kernels \nwhich  (as  GP  priors)  favour  discriminants in  some sense  compatible with what we \nhave learned about the input distribution  P(x). \n\nKernel  techniques  can  be  seen  as  nonparametric  smoothers,  based  on  the  (prior) \nassumption that if two input points are \"similar\"  (e.g.  \"close\"  under some distance), \ntheir labels  (and latent outputs  y)  should  be highly  correlated.  Thus,  one  generic \nway  of learning  kernels  from  unlabeled  data is  to  learn  a  distance  between  input \npoints  from  the  information  about  P( x).  A  frequently  used  assumption  about \nhow  classification labels may depend on  P(x)  is  the  cluster hypothesis:  we  assume \ndiscriminants whose decision boundaries lie between clusters in  P(x) to be a-priori \nmore likely than such that label clusters inconsistently.  A  general way of encoding \nthis hypothesis is  to learn a distance from  P(x) which is  consistent with clusters in \nP(x) , i.e.  points within the same cluster are closer  under this distance than points \nfrom  different clusters.  We  can then try to embed the learned distance d(x(1), X(2)) \napproximately  in  an  Euclidean  space,  i.e.  learn  a  mapping  \u00a2  :  X \nr-+  \u00a2( x)  E  ~l \n11 \u00a2(x(1)) - \u00a2(X(2)) II  for  all pairs from  Du.  Then, a  natural \nsuch that d(x(1) , X(2))  :=;::j \nkernel  function  would  be  K(x(1) , X(2))  =  exp( - ,BII\u00a2(x(1))  - \u00a2(x(2))11 2).  In  this \npaper, however,  we  follow  a  simpler approach,  by  considering a  similarity measure \n\n2In practice, computation of P(OIDu) is  hardly ever feasible , but powerful approxima(cid:173)\n\ntion techniques can be used. \n\n+1Ix)/P(t = -1Ix)) by y(x) . \n\n3 A  natural  choice  for  binary  classification  is  to  represent  the  log  odds  log(P(t  = \n\n\fwhich immediately gives rise  to a  covariance kernel,  without having to compute an \napproximate Euclidean embedding. \n\nRemark:  Our  main  aim  in  this  paper  is  to  construct  kernels  that  can  be  learned \nfrom  unlabeled  data  only.  In  contrast  to  this,  the  task  of  learning  a  kernel  from \nlabeled  data is  somewhat  simpler  and  can  be  approached in  the  following  generic \nway:  start with a  parametric model family  {y(x; w)} , with the interpretation that \ny(x;w)  models  the log odds log(P(t =  +llx)/P(t =  -llx)).  Fitting these models \nto  labeled  data  D[ ,  we  obtain  a  posterior  P(wIDI) .  Now,  a  natural  covariance \nkernel  for  our  problem  is  simply  K(x(1),X(2))  =  Jy(x(1);w)y(x(2 );w)Q(w)dw, \nwhere  (say)  Q(w)  <X  P(wID[)AP(W)l - A (or an approximation thereof).  For A =  0, \nwe  obtain  the  prior  covariance  kernel  for  our  model,  while  for  larger  A the  kernel \nincorporates more and more posterior information.  The kernel  proposed in  [8]  can \nbe seen as  approximation to this  approach. \n\n2  Mutual Information Kernels \n\nIn this section,  we  begin by introducing the framework of mutual information  ker(cid:173)\nnels.  Given a  mediator  distribution Pm e d (())  over parameters (),  we  define the joint \ndistribution  Q(x(1) , X(2))  mediated  by  Pm e d (())  as \n\nQ(x(1) , X(2))  = J Pmed (())P(x(1)I())P(x(2)1())d(). \n\n(1) \n\nThe sample mutual information between  x(1)  and X(2)  under this distribution is \n\n(1) \n\nI(x \n\n(2)  _ \n\nQ(X(l) , X(2)) \n\n,x \n\n) - log Q(x(1))Q(X(2)) ' \n\n(2) \n\nwhere  Q(x)  =  JQ(x , x)dx.  I(x(1) , x(2))  is  called  the  mutual  information  (MI) \nscore.  In a very concrete sense, it measures the similarity between x(1 ) and X(2)  with \nrespect  to  the  generative  process  represented  by  the  mediator  distribution  Pm ed (()): \nit is  the amount of information they  share  via the mediator variable  ()  ~ Pm ed (()) . \nNote that Q(x(1), X(2))  can be seen as  inner product in a space of functions  ()  f-t  R, \nthe features  of X(k)  being  (P(x(k)I()))o,  weighted by the distribution Pm e d .4  X(k)  is \nrepresented by its likelihood under  all possible models. \n\nCovariance kernels have to satisfy the property of positive  definiteness5 ,  and the MI \nscore I  does  not.  However,  applying a  standard transformation  (called  exponential \nembedding  (EE)  here),  we  arrive at \n\nK(x(1)  X(2))  =  e - (I(x (l) ,x(1))+I(x(2) ,x(2))) /2+I(x(1) ,x(2))  = \n\nQ(x(1), X(2)) \n\n, \n\nvQ(x(1) , x(1))Q(X(2), X(2)) \n(3) \nEE  becomes  familiar  if  we  note  that  it  transforms  the  standard  inner  product \nx(1)' X(2)  into the well-known  Radial  Basis  Function  (RBF)  kernel 6 \n\n(4) \n\n4When comparing  X ( l) , X (2)  via the inner product, we  are not  interested in correlating \n\ntheir features  uniformly, but rather focus  on regions  of high volume under Pm e d . \n\n5 K  is  positive  definite if the  matrix  (K(X(k ll , X(k2\u00bb)hl ,k2 is  positive  definite for  every \n\nset  {x(1 ), ... , X (K ) }  of distinct  points. \n\nkernel for  all  (3  > 0  (see  e.g.  [3]) . \n\n60ne  can  show  that  if  j  is  itself  a  kernel ,  and  j  -+  I<  under  EE,  then  1<(3  is  also  a \n\n\for  the  weighted  inner  product  x(1)'VX(2)  into  the  squared-exponential kernel  (e.g. \n[10]).  It is  easy to show that K  in  (3)  is  a  valid  covariance kernel7 ,  and we  refer to \nit as  mutual information  (MI)  kernel. \nExample:  Let  P(xIO)  =  N(xIO, (p/2)I)  (spherical  Gaussian  with  mean  0), \nPrned(O)  =  N(OIO, aI).  Then,  the  MI  kernel  K  is  the  RBF  kernel  (4)  with \n(3  =  4/(p(4 + pia)).  Thus, the RBF kernel  is  a  special case of a  MI kernel. \n\n2.1  Mediator distribution.  Model-trust scaling. \n\nThe mediator distribution Prned(O),  motivated earlier in this section, should ideally \nencode information about the x  generation process, just as  the Bayesian posterior \nP(OIDu).  On  the  other  hand,  we  need  to  be  able  to  control  the  influence  that \ninformation from sources such as unlabeled data Du  can have on the kernel (relying \ntoo much on such sources results in lack of robustness, see e.g.  [6]  for  details).  Here, \nwe  propose  model-trust scaling  (MTS) , by setting \n\n(5) \n\nPrned  varies  with  A from  the  (usually  vague)  prior P(O)  (A  =  0)  towards the sharp \nposterior  P(OIDu)  (A  =  n),  rendering  the  Du  information  (via  the  model)  more \nand more  influence  upon  the  kernel  K.  The concrete effect  of MTS  on the kernel \ndepends on the model family. \nExample (continued):  Again, P(xIO)  = N(xIO , (p/2)I) , with a flat  prior P(O)  = 1 \non the mean.  Then, P(OIDu) =  N(Olx , (p/2n)I),  where x  =  n - 1 L:;~;>~l Xi,  and \nPrned(O)  =  N(Olx, (p/2A)I)  (after (5)).  Thus, the MI kernel is again the RBF kernel \n\n(4)  with (3  =  2/(p(2 + A)) .  For the more flexible  model  P(xIO)  =  N(xIJL , ~), \u00b0 = \n\n(JL,~) and the conjugate Jeffreys  prior, the MI kernel is  computed in  [5]. \n\nIf the Bayesian analysis is done with conjugate prior-model pairs, the corresponding \nMI  kernel  can  be  computed  easily,  and  for  many  of these  cases,  MTS  has  a  very \nsimple, analytic form  (see  [5]).  In general,  approximation techniques  developed for \nBayesian analysis have to be applied.  For example, applying the Laplace  approxima(cid:173)\ntion to the computations on a  model  with flat  prior P(O)  = 1 results in the  Fisher \nkernel  [4]8,  see  e.g.  [5].  However,  in  this  paper  we  favour  another  approximation \ntechnique  (see  section 3). \n\n2.2  Mutual Information Kernels for  Mixture Models \nIf we  apply the MI kernel framework to mixture models P(x 10, 7T\")  =  Ls 7fsP(x lOs), \nwe  run into a  problem.  As  mentioned in section 1,  we  would like our kernel at least \npartly to encode the cluster hypothesis, i.e.  K(x(1), X(2))  should be small if x(1), X(2) \ncome  from  different  clusters  in  P(x ),9 but  the  opposite  is  true  (for  not  too  small \n\n7 Q(x(1 ), X (2))  is an inner product  (therefore a  kernel),  for  the rest of the argument see \n\n[3],  section  5. \n\n8This  was  essentially  observed  by  the  authors  of  [4]  on  workshop  talks,  but  has  not \nbeen  published  to  our  knowledge.  The  fascinating  idea  of the  Fisher  kernel  has  indeed \nbeen the main motivation and inspiration for  this paper. \n\n9This  does  not mean  that  we  (a-priori)  believe  they  should  have  different  labels,  but \nonly that the label  (or  better:  the latent  yO) at one of them should not depend strongly \non  yO at the other. \n\n\fA).  To overcome this problem, we  generalize Q(x(1), X(2)): \n\nQ(X(1),X(2))  =  L  WS1 ,S2 J P(x(1) IOsJP(X(2) IOs2)Prned(O)  dO, \n\nS \n\n(6) \n\n8} , 82=1 \n\nwhere W  =  (WS1 ,S2)Sl ,S2  is symmetric with nonnegative entries and positive elements \non  the  diagonal.  The  MI  kernel  K  is  defined  as  before  by  (3) ,  based  on  the  new \nQ.  If Prned(O,rr)  =  ITsPrned(Os)Prned(rr)  (which  is  true  for  the  cases  we  will  be \ninterested  in),  we  see  that  the  original  MI  kernel  arises  as  special  case  WS1,S2  = \nEPm ed[7fS17fs2]'  Now,  by choosing  W  = diag(Epm ed[7fs])s,  we  arrive at  a  MI  kernel \nK  which  (typically)  behaves as expected w.r.t. cluster separation (see figure  1), but \ndoes not exhibit long-range correlations between joined components.  In the present \nwork,  we  restrict  ourselves  to  this  diagonal  mixture  kernel.  Note  that  this  kernel \ncan be seen as  (normalized)  mixture of MI kernels over the component models. \n\nFigure 1:  Kernel contours on 2-cluster dataset  (A  =  5,100,30) \n\nFigure  1  shows  contour  plots 10  of the  diagonal  mixture  kernel  for  VB-MoFA  (see \nsection 3),  learned on a  500  cases  dataset sampled from  two  Gaussians  with equal \ncovariance  (see  subsection  3.1).  We  plot  K(a , x)  for  fixed  a  (marked  by  a  cross) \nagainst  all  x ,  the  height  between  contour  lines  is  0.1.  The  left  and  middle  plot \nhave  the  lower  cluster's  centre  as  a,  with  A =  5,  A  =  100  respectively,  the  right \nplot's a  lies between the cluster centres,  A =  30.  The effect  of MTS can be seen by \ncomparing left  and  middle,  note the  different  sharpness  of the  slopes  towards  the \nother cluster and the different sizes and shapes of the  \"high correlation\"  regions.  As \nseen  on the right,  points  between  clusters  have  highest  correlation with other such \ninter-cluster points, a feature that may be very useful for  successful discrimination. \n\n3  Experiments with Mixtures of Factor  Analyzers \n\nIn  this  section,  we  describe  an  implementation  of a  MI  kernel,  using  variational \nBayesian mixtures of factor  analyzers  (VB-MoFA)  [2]  as density models.  These are \nable to combine local  dimensionality reduction  (using noisy linear transformations \nu  -+  x  from low-dimensional latent spaces) with good global data fit using mixtures. \nVB-MoFA is a  variational approximation to Bayesian analysis on these models, able \nto deliver the posterior approximations we  require for  an MI kernel. \n\nWe  employ the diagonal mixture kernel  (see subsection 2.2).  Instead of implement(cid:173)\ning MTS analytically, we  compute the VB  approximation to the true posterior  (i.e. \nA =  n),  then simply apply the scaling to this distribution.  P rned (0, rr)  factorizes  as \nrequired in subsection 2.2.  The integrals J P(x(1) IOs)p(X(2) IOs)Prned(Os) dOs  in (6) \nlOProduced using the first-order approximation (see 3)  to the MI kernel.  Plots using the \n\none-step variational  approximation  (see  3)  have a  somewhat richer structure. \n\n\fare not analytically tractable.  Our first  idea was to approximate them by applying \nthe  VB  technique  once  more,  ending  up  with  what  we  called  one-step  variational \napproximations.  Unfortunately,  the MI kernel approximation based on these terms \ncannot be shown to be positive definite anymorell !  Thus, in the moment  we  use  a \nless  elegant  and,  we  feel, less  accurate approximation  (details  can be found  in  [5]) \nbased on first-order  Taylor expansions. \n\nIn  the  remainder  of this  section  we  compare  the  VB-MoFA  kernel  with  the  RBF \nkernel  (4)  on  two  datasets,  using  a  Laplace  GP  classifier  (see  [10]).  In  each  case \nwe  sample a training pool,  a  kernel  dataset Du  and a test set  (mutually exclusive). \nThe VB-MoFA  diagonal mixture kernel is  learned on  Du.  For  a  given training set \nsize  m,  a  run consists of sampling a  training set  Dl  and a  holdout set  Dh  (both of \nsize m) from the training pool, tuning kernel parameters by validation on D h ,  then \ntesting on the test set.  We  use the same Dl, Du for  both kernels.  For each training \nset size, we  do L  =  30 runs.  Results are presented by plotting means and 95% t-test \nconfidence intervals of test errors over runs. \n\n3.1  Two  Gaussian clusters \n\nThe dataset is  sampled from two 2-d Gaussians with same non-spherical covariance \n(see figure  1) , one for  each class  (the Bayes error is  2.64%) .  We  use n  =  500 points \nfor  D u ,  a  training pool  of 100 and a  test  set of 500  points.  The learning curves in \nfigure 2 show that on this simple toy problem, on which the fitted VB-MoFA model \nrepresents the cluster structure in  P(x) almost perfectly, the VB-MoFA  MI kernel \noutperforms the RBF kernel for  samples sizes  n  :::;  40. \n\n~  02 \n~ 0.175 \n\n~  0.15  IL \n\n, \n\n_ 0.225 \n\n~  02 \n~O. 175 \n~  0.15 \n~O.125 \n\nrI  I \n\nI \n\n~~~~~---7---~~~~~~~--~  ',~~~~---7---.~, --~--~--~~ \n\nTraining set size n \n\nTrair>ngsets;zen \n\nFigure 2:  Learning curves on  2-cluster dataset.  Left:  RBF kernel;  right:  MI kernel \n\n3.2  Handwritten Digits  (MNIST):  Twos  against  threes \n\nWe  report  results  of preliminary  experiments  using  the  subset  of twos  and  threes \nof the MNIST Handwritten Digits database12 .  Here, n  =  IDul  =  2000,  the training \npool contains 8089,  the test set 2000 cases.  We  employ a VB-MoFA model with 20 \ncomponents,  fitted  to  Du.  We  use  a  very  simple  baseline  (BL)  algorithm  (see  [6], \nsection  2.3)  based on the component  densities from  the VB-MoFA  model13 ,  which \n\nllThanks to an anonymous reviewer for  pointing out this flaw. \n12The  28  x  28  images were  downsampled to size  8  x  8. \n13The  estimates  P(xls)  are  obtained  by  integrating  out  the  parameters  (}s  using  the \nvariational  posterior  approximation.  The integral  is  not  analytic,  and we  use  a  one-step \nvariational  approximation to it. \n\n\fthis \nallows  us  to  assess  the  \"purity\"  of the  component  clusters  w.r.t.  the  labels1\\ \nalgorithm is  the only  one  not based on a  kernel.  Furthermore,  we  show  results for \nthe one-step variational approximation to the MI  kernel 15  (MIOLD). The learning \ncurves are shown in figure  3. \n\nr \n\u2022 \nt\u00b7 .. \n\n~ 0.' \n\n! \u2022\u2022 \n\nII \n\nI \n\n,  II \n\nj\" \ni \nI .. \n! \n\nI \n\nI \n\n,,_ ... _, \n\n= \u2022 \n\nT,_ ... _, \n\nI \nI \n\nto.> \n~ ... \n1 \n\n..... \n\nT.-.\" ,\",_ , \n\n,  I \nI \n\nj\" \nI .. \ni \n! \n\n[ \n\nFigure 3:  Learning curves on  MNIST twos/threes.  Upper left:  RBF  kernel; upper \nmiddle:  Baseline  method;  upper  right:  VB-MoFA  MI  kernel  (first-order  approx.) ; \nlower left:  VB-MoFA  MI  \"kernel\"  (one-step  var.  approx.) \n\nThe  results  are  disappointing.  The fact  that the  first-order  approximation to  the \nMI  kernel  performs  worse  than  the  one-step  variational  approximation  (although \nthe  latter  may  fail  to  be  positive  definite) ,  indicates  that  the  former  is  a  poorer \napproximation.  The latter  renders  results  close  to the baseline  method,  while  the \nsmoothing RBF  kernel  makes much  better use  of a  growing  number of labeled  ex(cid:173)\namples16  This indicates that the conditional prior, as represented by the VB-MoFA \nMI  kernel,  behaves  nonsmooth and overrides label  information in  regions  where  it \nshould  not.  We  suspect  this  problem  to  be  related  to  the  high  dimensionality  of \nthe  input  space,  in  which  case  probability  densities  tend  to  have  a  large  dynamic \nrange,  and  mixture  component  responsibility  estimates tend to  behave  very  nons(cid:173)\nmooth.  Thus,  it  seems  to  be  necessary  to  extend  the  basic  MI  kernel  framework \nby  new  scaling  mechanisms  in  order to  produce  a  smoother  encoding  of the  prior \nassumptions. \n\n14The  baseline  algorithm  is  based on  the assumption  that,  given  the component  index \ns, the input  point  x  and the label  t  are  independent.  Only the conditional  probabilities \nP(t ls)  are learned,  while  P(xls)  and pes) is obtained from  the VB-MoFA  model fitted to \nunlabeled data only.  Thus, success/failure  of this method should  be closely  related to the \ndegree of purity of the component clusters  w.r.t .  the labels. \n\n15This  is  somewhat  inconsistent,  since  we  use  a  kernel  function  which  might  not  be \n\npositive definite in  a  context  (GP classification)  which requires a  covariance function. \n\n16Note  also  that  RBF  kernel  matrices  can  be  evaluated  significantly  faster  than  such \n\nusing the VB-MoFA  MI  kernel. \n\n\f4  Related work.  Discussion \n\nThe  present  work  is  probably  most  closely  related  to  the  Fisher  kernel  (see  sub(cid:173)\nsection 2.1).  The arguments concerning mixture models  (see  subsection  2.2)  apply \nthere as well.  Haussler [3]  contains a wealth of material about kernel design for dis(cid:173)\ncrete objects x.  Watkins [9]  mentions that expressions like Q in (1)  are valid kernels \nfor  discrete  x  and countable  parameter spaces.  Very  recently we  came across  [11], \nwhich  essentially  describes  a  special  case  of the  diagonal  mixture  kernel  (see  sub(cid:173)\nsection 2.2)  for  Gaussian components with diagonal covariances17 .  The author calls \nQ a  stochastic  equivalence  predicate.  He  is  interested in distance learning,  does  not \napply his  method to kernel  machines and does  not give  a  Bayesian interpretation. \n\nWe  have presented a general framework for kernel learning from unlabeled data and \ndescribed  an approximate implementation using VB-MoFA  models.  A  straightfor(cid:173)\nward application of this technique to high-dimensional real-world data did not prove \nsuccessful,  and in future work we  will explore new ideas for  extending the basic MI \nkernel framework in order to be able to deal with high-dimensional input spaces. \n\nAcknowledgments \n\nWe  thank  Chris  Williams  for  many  inspiring  discussions,  furthermore  Ralf  Her(cid:173)\nbrich,  Amos  Storkey,  Hugo  Zaragoza and  Neil  Lawrence.  Matt  Beal  helped  us  a \nlot with VB-MoFA. The author gratefully acknowledges support through a research \nstudentship from  Microsoft  Research  Ltd. \n\nReferences \n\n[1]  Avrim  Blum  and  Tom  Mitchell.  Combining  labeled  and  unlabeled  data  with  Co(cid:173)\n\nTraining.  In  Proceedings  of COLT,  1998. \n\n[2]  Z.  Ghahramani  and  M.  Beal.  Variational  inference  for  Bayesian  mixtures of factor \n\nanalysers.  In  Advances  in  NIPS 12.  MIT  Press,  1999. \n\n[3]  David Haussler.  Convolution kernels on discrete structures.  Technical Report UCSC(cid:173)\n\nCRL-99-10, University of California,  Santa Cruz, July 1999. \n\n[4]  Tommi S.  Jaakkola and David Haussler.  Exploiting generative models in discrimina(cid:173)\n\ntive classifiers.  In  Advances  in  Neural  Information  Processing  Systems  11,  1998. \n\n[5]  Matthias  Seeger.  Covariance  kernels  from  Bayesian  generative  models.  Technical \n\nreport, 2000.  Available  at  http : //yyy . dai . ed. ac . ukr seeger /papers . html. \n\n[6]  Matthias  Seeger.  Learning with labeled  and unlabeled data.  Technical report,  2000. \n\nAvailable at http://yyy .dai. ed. ac. ukrseeger/papers .html. \n\n[7]  Martin  Szummer and Tommi  Jaakkola.  Partially  labeled  classification  with  Markov \n\nrandom walks.  In  Advances  in  NIPS 14. MIT  Press,  200l. \n\n[8]  Koji  Tsuda,  Motoaki  Kawanabe,  Gunnar  Ratsch,  Soeren  Sonnenburg,  and  Klaus(cid:173)\n\nRobert  Muller.  A  new  discriminative  kernel  from  probabilistic  models.  In  Advances \nin  NIPS 14.  MIT  Press,  200l. \n\n[9]  Chris Watkins.  Dynamic alignment kernels.  Technical Report CSD-TR-98-11 , Royal \n\nHolloway, University of London,  1999. \n\n[10]  Christopher K.1.  Williams  and David Barber.  Bayesian classification  with  Gaussian \n\nprocesses.  IEEE  Trans.  PAMI, 20(12):1342- 1351,  1998. \n\n[11]  Peter Yianilos.  Metric learning via normal mixtures. Technical report , NEC Research , \n\nPrinceton,  1995. \n\n17The a  parameter in this work  is  related to  MTS  in this case. \n\n\f", "award": [], "sourceid": 2133, "authors": [{"given_name": "Matthias", "family_name": "Seeger", "institution": null}]}