{"title": "Learning with Multiple Labels", "book": "Advances in Neural Information Processing Systems", "page_first": 921, "page_last": 928, "abstract": null, "full_text": "Learning with  Multiple  Labels \n\nRong Jin* \n\n*School of Computer Science \nCarnegie Mellon University \nPittsburgh, PA 15213, USA \n\nrong@es.emu.edu \n\nZoubin Ghahramanit* \n\ntGatsby Computational Neuroscience Unit \n\nUniversity College London \nLondon WCIN 3AR, UK \nzoubin@gatsby.ucl.ae.uk \n\nAbstract \n\nIn this paper,  we  study a  special kind of learning problem in which \neach  training  instance  is  given  a  set  of  (or  distribution  over) \ncandidate  class  labels  and  only  one  of the  candidate  labels  is  the \ncorrect  one.  Such  a  problem  can  occur,  e.g.,  in  an  information \nretrieval  setting  where  a  set  of words  is  associated  with  an  image, \nor  if  classes  labels  are  organized  hierarchically.  We  propose  a \nnovel  discriminative  approach  for  handling  the  ambiguity  of class \nlabels  in the  training  examples.  The  experiments with the  proposed \napproach over five  different UCI datasets  show that our approach is \nable  to  find  the  correct label  among the  set of candidate  labels  and \nactually  achieve  performance  close  to  the  case  when  each  training \ninstance  is  given  a  single  correct  label.  In  contrast,  naIve  methods \ndegrade rapidly as  more ambiguity is introduced into the labels. \n\n1  Introduction \n\nSupervised and unsupervised learning problems have been extensively studied in the \nmachine  learning  literature.  In  supervised  classification  each  training  instance  is \nassociated  with  a  single  class  label,  while  in  unsupervised  classification  (i.e. \nclustering)  the  class  labels  are  not  known.  There  has  recently  been  a  great  deal  of \ninterest in partially- or semi-supervised learning problems, where the training data is \na  mixture  of both  labeled  and  unlabelled  cases.  Here  we  study  a new  type  of semi(cid:173)\nsupervised learning problem. \n\nWe  generalize  the  notion  of supervision  by  thinking  of learning  problems  where \nmultiple  candidate  class  labels  are  associated  with  each  training  instance,  and  it  is \nassumed  that  only  one  of  the  candidates  is  the  correct  label.  For  a  supervised \nclassification  problem,  the  set  of candidate  class  labels  for  every  training  instance \ncontains  only  one  label,  while  for  an  unsupervised  learning  problem,  the  set  of \ncandidate  class  labels  for  each  training  instance  counts  in  all  the  possible  class \nlabels.  For  a  learning  problem  with  the  mixture  of labeled  and  unlabelled  training \ndata,  the  number  of candidate  class  labels  for  every  training  instance  can  be  either \none or the total number of different classes. \n\nHere we study the general setup,  i.e.  a learning problem when each training instance \nis  assigned  to  a  subset  of all  the  class  labels  (later,  we  further  generalize  this  to \n\n\finclude  arbitrary  distributions  over the  class  labels).  For example,  there  may be  10 \ndifferent  classes  and  each  training  instance  is  given two  candidate  class  labels  and \none  of the  two  given  labels  is  correct.  This  learning  problem  is  more  difficult  than \nsupervised  classification  because  for  each  training  example  we  don't  know  which \nclass  among  the  given  set  of  candidate  classes  is  actually  the  target.  For  easy \nreference, we  called this  class of learning problems  'multiple-label' problems. \n\nIn practice, many real problems can be formalized as  a  'multiple-label' problem. For \nexample,  the  problem  of having  several  different  class  labels  for  a  single  training \nexample  can  be  caused  by  the  disagreement  between  several  assessors. 1  Consider \nthe  scenario  when  two  assessors  are  hired  to  label  the  training  data  and  sometimes \nthe  two  assessors  give  different  class  labels  to  the  same  training  example.  In  this \ncase,  we  will  have  two  class  labels  for  a  single  training  instance  and  don't  know \nwhich,  if any,  is  actually  correct.  Another  scenario  that  can  cause  multiple  class \nlabels  to  be  assigned  to  a  single  training  example  is  when  there  is  a  hierarchical \nstructure  over the  class  labels  and  some  of the  training  data  are  given the  labels  of \nthe  internal nodes in the  hierarchy (i.e.  superclasses) instead of the  labels of the  leaf \nnodes  (subclasses).  Such  hierarchies  occur,  for  example,  in  bioinformatics  where \nproteins  are  regularly  classified \ninto  superfamilies  and  families.  For  such \nhierarchical  labels,  we  can  treat the  label  of internal nodes  as  a  set  of the  labels  on \nthe leaf nodes. \n\n2  Related  Work \n\nFirst  of all,  we  need  to  distinguish  this  'multiple-label'  problem  from  the  problem \nwhere the  classes  are  not mutually  exclusive  and therefore  each training  example  is \nallowed several class  labels  [4].  There,  even though each training  example  can have \nmultiple  class labels,  all the  assigned class labels  are  actually correct labels while in \n'multiple-label'  problems only one  of the assigned multiple  labels is  the target label \nfor the training instance. \n\nThe  essential  difficulty  of 'multiple-label'  problems  comes  from  the  ambiguity  in \nthe  class  labels  for  training  data,  i.e.  among  the  several  labels  assigned  to  every \ntraining  instance  only  one  is  presumed  to  be  the  correct  one  and  unfortunately  we \nare  not  informed  which  one  is  the  target  label.  A  similar  difficulty  appears  in  the \nproblem  of classification  from  labeled  and  unlabeled  training  data.  The  difference \nbetween  the \nlabeled/unlabeled  classification \nproblem  is  that  in the  former  only  a  subset  of the  class  labels  can  be  the  candidate \nfor  the  target  label, while  in the  latter any  class  label  can be  the  candidate. As  will \nbe  shown  later,  this  constraint  makes  it  possible  for  us  to  build  up  a  purely \ndiscriminative  approach  while  for  learning  problems  using  unlabeled  data  people \nusually take a generative approach and model properties of the input distribution. \n\n'multiple-label'  problem  and  the \n\nIn  contrast  to  the  'multiple-label'  problem,  there  is  a  set  of  problems  named \n'multiple-instance'  problems  [3]  where  instances  are  organized  into  'bags'  of \nseveral  instances,  and  a  class  label  is  tagged  for  every  bag  of instances.  In  the \n'multiple-instance'  problem,  at \ninstances  within  each  bag \ncorresponds  to  the  label  of the  bag  and  all  other  instances  within  the  bag  are  just \nnoise.  The  difference  between  'multiple-label'  problems  and  'multiple-instance' \nproblems is  that for  'multiple-label'  problems the ambiguity lies on the  side of class \nlabels while for  'multiple-instance' problem the  ambiguity comes from the  instances \nwithin the bag. \n\nleast  one  of  the \n\n1  Observer  disagreement  has  been  modeled  using  the  EM  algorithm  [1] .  Our  multiple(cid:173)\nlabel  framework  differs  in  that  we  don't  know  which  observer  assigned  which  label  to \neach case. This would be  an interesting direction to  extend our framework. \n\n\fThe most related work to  this  paper is  [6],  where  a  similar problem is  studied using \nthe  logistic  regression  method.  Our  framework  is  completely  general  for  any \ndiscriminative model and incorporates non-uniform  'prior' on the  labels. \n\n3  Formal  Description  of the  'Multiple-label'  Problem \n\nAs  described  in  the  introduction,  for  a  'multiple-label'  problem,  each  training \ninstance  is  associated with a  set  of candidate  class  labels,  only one  of which  is  the \ntarget label for that instance.  Let Xi  be  the  input for the  i-th training  example,  and Si \nbe the  set of candidate  class  labels  for the  i-th  training example.  Our goal  is  to  find \nthe  model  parameters  e E  e  in  some  class  of models  M  ,  i.e.  a  parameterized \nclassifier with parameters  e which maps  inputs to  labels,  so  that the predicted class \nlabel y  for the i-th training example has a high probability to  be  a member of the  set \nSi.  More  formally,  using  the  maximum  likelihood  criterion  and  the  assumption  of \ni.i.d.  assignments, this goal can be simply stated as \n\n(1) \n\n4  Description  of the  Discriminative  Model  for  the \n\n'Multiple-label'  Problem \n\nBefore  discussing  the  discriminative  model  for  the  'multiple-label'  problem,  let's \nlook at the standard  discriminative model  for  supervised  classification.  Let  p(y I X i ) \nstand for  some given conditional distribution of class labels for the training instance \nXi  and  p(y I x\"f})  be the model-based conditional distribution for the training data Xi \nto  have  the  class  label  y.  A  common  and  sensible  criterion  for  finding  model \nparameters  (/  is  to  minimize  the  KL  divergence  between  the  given  conditional \ndistributions and the model-based distributions, i.e. \n\nB*  =  arg min {L L p(y  I  x,) log  p(y  I  x)  } \n\np(y I x ;, B) \n\nB \n\n; \n\ny \n\n(2) \n\n(3) \n\n(4) \n\nFor  supervised  learning  problems,  the  class  label  for  every  trammg  instance  is \nknown.  Therefore,  the  given  conditional  distribution  of the  class  label  for  every \ntraining  instance  is  a  delta  function  or  jJ(y I Xi)  =  c5(y, Yi)  where Yi  is  the  given  class \nlabel  for  the  i-th  instance.  With  this,  it  can  be  easily  shown  that  Eqn.  (2)  will  be \nsimplified  as  maximum  likelihood  criterion.  For the  'multiple-label'  problem,  each \ntraining instance Xi  is  assigned to  a set of candidate class labels Si  and therefore Eqn. \n(2)  can be rewritten as: \n\n()* = arg min {L L p(y I X,) log  p(y I x,)  } \n\np(y I Xi' (}) \n\nB \n\ni  YES; \n\nwith the  constraints  Vi  L yESi  p(y I Xi) =  I . \n\nIn  the  'multiple-label'  problem the  distribution of class  labels  p(y I x,)  is  unknown \nexcept  for  the  constraint  that  the  target  class  label  for  every  training  example  is  a \nmember of the  corresponding  set  of candidate  class  labels.  A  simple  solution to  the \nproblem  of  unknown \nI.e. \np(y I x,) =  p(y' I x,) for any y, y' E  Si . Then, Eqn.  (3)  can be  simplified to: \n\nlabel  distribution \n\nis  uniform, \n\nis \n\nto  assume \n\nit \n\n\fB* = argmin {L:-1  L:loi \n\nB \n\ni  I Si I YES, \n\nII Si  I p(y I x\"B) \n\nB \n\ni  I Si I YE S, \n\n1 \n\nJ} =argmax{L:-1  L:IOgp(YIXi' B)} , \n\n(5) \n\nwhich  corresponds  to  minimizing  the  KL  divergence  (2)  to  a  uniform  over  Sj . For \nthe  case  of multiple  assessors  giving  differing  labels  to  the  data,  discussed  in  the \nintroduction,  this  corresponds  to  concatenating  the  labeled  data  sets.  Standard \nlearning  algorithms  can  be  applied  to  learn  the  conditional  model  p(y I x,B).  For \nlater reference, we  called this simple idea the  ' Naive Model'. \n\nA  better  solution  than  the  'NaIve  Model'  is  to  disambiguate  the  label  association, \ni.e.  to  find  which label  among the  given set is  more  appropriate than the  others  and \nuse the  appropriate  label for training.  It turns  out that it is  possible to  apply the  EM \nalgorithm  [2]  to  accomplish  this  goal,  resulting  in  a  procedure  which  iterates \nbetween  disambiguating  and  classifying.  Starting  with  the  assumption  that  every \nclass  label  within  the  set  is  equally likely,  we  train  a  conditional  model  p(y I x, B). \nThen,  with  the  help  of this  conditional  model,  we  estimate  the  label  distribution \njJ(y I x,)  for  each data  point.  With  these  label  distributions,  we  refit the  conditional \nmodel  p(y I x , B)  and so  on. More formally,  this idea can be expressed as  follows: \n\nFirst,  we  estimate  the  conditional  model  based  on  the  assumed  or  estimated  label \ndistribution  according  to  Eqn.  (3).  This  step  corresponds  to  the  M-step  in  the  EM \nalgorithm.  Then,  in  the  E-step,  new label distributions  are  estimated by maximizing \nEqn.  (3) W.r.t.  jJ(y I x,)  under the  constraints (4), resulting in: \n\njJ(y I Xi) =  L: p(y' I Xi' B) \n\n1 P(yIXi,B) \n\nY  ESj \n\no \n\nVYESi \n\notherwise \n\n(6) \n\nimportantly,  this  procedure  optImIzes  the  objective  function  in  Eqn.  (1),  by  the \nusual EM proof.  The negative  of the  KL  divergence  in Eqn.  (3) is  a lower bound on \nthe  log likelihood (1) by Jensen's inequality.  Substituting Eqn.  (6)  for  jJ(y I Xi)  into \n(3) we obtain equality. For easy reference, we called this model the  'EM Model'. \n\nin  some  'multiple-label'  problems,  information  on  which  class  label  within  the  set \nis  more  likely  to  be  the  correct  one  can  be  obtained.  For  example,  if  three \nSj \nassessors manually label the training data,  in some cases two  assessors will agree on \nthe  class  label and the  other doesn't. We  should give more weights to  the  labels that \nare  agreed  by two  assessors  and  low  weights  to  the  labels  that  are  chosen by  only \none.  To  accommodate  prior  information  on  the  class  labels,  we  generalize  the \njJ(y  I Xi)  has  low \nprevious  framework  so  that  the  estimated  label  distribution \nrelative  entropy with the  prior on the  class  labels.  Therefore, the  objective  function \n(1)  and its EM -bound (4)  can be modified to be \n\nB* = arg~in{ ~ ~ p(y I x,)logP:i.lyx,) - ~ ~ p(y I X,) log p(y I Xi,B)} \n\n(7) \n\nwhere  \" i,y  is  the prior probability for the i-th training example to  have class  label y. \nThe  first  term  in  the  objective  function  (7)  encourages  the  estimated  label \ndistribution to be consistent with the prior distribution of class labels and the  second \nterm  encourages  the  prediction  of the  model  to  be  consistent  with  the  estimated \nlabel  distribution.  The  objective  (7)  is  an  upper  bound  on  - L:\\og L: 7l'i,y P(Y I xi,B) . \n\nYE Si \n\n\fWhen  there  is  no  prior  information  about  which  class  label  within  the  given  set  is \npreferable we can set  n ;,y  =  1/ I S; I  and Eqn.  (7) becomes \n\nB* = argmin{II p(y I xJlog p(y I x;) - I I  p(y I xJlogp(y I X;,B)} \n\n(I \n\n;  YES, \n\n1/ I S; I \n\n;  YES, \n\n(7') \n\n= argmin{II p(y I xJlog  p(y I xJ  + Ilog I S; I} = argmin{I I  p(y I xJlog  p(y I xJ  } \np(y I x;,IJ) \n\np(y I x;,B) \n\nI I ;  yES, \n\n;  yES, \n\n; \n\nII \n\nEqn.  (7')  is  identical  to  Eqn.  (3),  which  shows  that  when  there  is  no  pnor \nknowledge on the  class label distribution, we revert back to the' EM Model' . \n\nAgain  we  can  optimize  Eqn.  (7)  using  the  EM  algorithm,  estimating  the  label \ndistribution  p(y I x;)  in  the  E  step  fitting  any  standard  discriminative  model  for \np(y I x,B)  in  the  M  step.  The  label  distribution  that  optimizes  (7)  in  the  Estep \nis: p(y I x.) =  7r.  p(y I x  B) / \" \n.p(y'l x  B),  and  0  otherwise.  As  we  would  expect, \nthe  label  distribution  p(y I xJ  trades  off both  the  prior  n ;,y  and  the  model-based \nprediction  p(y I x;, B). We will call this model  'EM+Prior Model'. \n\nI'  ~Y'ESi \n\n7r \n\nI ,), \n\nI, ), \n\nI' \n\nI \n\ncan \n\n'EM+Prior  Model' \n\nThe \nalso  be \ninterpreted  from  the  viewpoint of a graphical \nmodel.  The  basic  idea  is  illustrated  in Figure \n1,  where the random variable ti represents the \nevent  that  the  true  label  Yi  belongs  to  the \nlabel  set  Si.  For  the  'EM+Prior'  model,  n ;,y \nactually  plays  the  role  of  a  likelihood  or \nnoise model  where, where  p(y E  Si  I x i ,(})  in \nEqn.  (1)  is  replaced as  in Eqn.  (8).  From this \npoint  of  view,  generalizing \nto  Bayesian \nlearning and regression is easy. \n\nFigure \ninterpretation  of 'EM+Prior'  model \n\nI:  Diagram  for  graphic  model \n\nP(ti  = 11 xi,B) =  LP(ti = 11 y)p(y I xi,B) =  L\"i.yP(y I xi,B) \n\nYE5i \n\nYESi \n\n(8) \n\n5  Experiments \n\nThe goal of our experiments is to  answer the following questions: \n\nl. \nIs  the  'EM Model'  better  than  the  'Nai've  Model'?  The  difference  between the \n'EM  Model'  and  the  'Naive  Model'  for  the  'multiple-label'  problems  is  that  the \n'Naive Model'  makes no  effort in finding  the  correct label within the  given label set \nwhile the  'EM Model'  applies the EM algorithm to  clarify the  ambiguity in the  class \nlabel.  Therefore,  in  this  experiment,  we  need  to  justify  empirically  whether  the \neffort in disambiguating class labels is effective. \n\n2.  Will prior knowledge  help  the  model?  The  difference  between  the  'EM Model' \nand  the  'EM+Prior  Model'  is  that  the  'EM+Prior  Model'  takes  advantage  of prior \nknowledge  on  the  distribution  of  class  labels  for  instances.  However,  since \nsometimes the prior knowledge  on the  class label can be misleading, we need to  test \nthe robustness of the  'EM+Prior Model' to  such noisy prior knowledge. \n\n5.1  Experimental  Data \n\nSince  there  don't  exist  standard  data  sets  with  trammg  instances  assigned  to \nmultiple  class  labels,  we  actually  create  several  data  sets  with  multiple  class  labels \n\n\ffrom  the  UCI  classification  datasets.  To  make  our  experiments  more  realistic,  we \ntried two different methods of creating datasets with multiple class labels: \n\n\u2022  Random  Distractors.  For  every  training  instance,  in  addition  to  the  original \nassigned label,  several randomly selected labels are  added to  the label  candidate set. \nWe varied the number of added classes to test reliability of our algorithm. \n\n\u2022  Nai\"ve  Bayes  Distractors.  In  the  previous  method,  the  added  class  labels  are \nrandomly selected and therefore independent from the  original class  label. However, \nwe  usually  expect that distractors  are  in the  candidate  set  should be  correlated with \nthe  original  label.  To  simulate  this  realistic  situation,  we  use  the  output  of a  NaIve \nBayes  (NB)  classifier  as  an  additional  member  of the  class  label  candidate  set. 1 \nFirst,  a  NaIve  Bayes  classifier  using  Gaussian  generation  models  is  trained  on  the \ndataset.  Then,  the  trained  NB  classifier  is  asked  to  predict  the  class  label  of the \ntraining data.  When the  output of the NB  classifier differs  from the  original  label,  it \nis  added  as  a  candidate  label.  Otherwise,  a  randomly  selected  label  is  added  to  the \ncandidate set.  Since the NB  classifier errors are  not completely random, they  should \nhave some correlation with the  originally assigned labels. \n\nIn  these  experiments  we  chose  a  simple  maximum  entropy  (ME)  model  [5]  as  the \nbasic discriminative model,  which expresses a conditional probability  p(y I i,e)  in an \nexponential form,  i.e.  p(y I i ,e) = exp(e\u00b7 i ) / Z(i )  where  x  is  the  input feature  vector and \nZ(x) is  the  normalization  constant  which  ensures  that  the  conditional  probabilities \nover all different classes y  sum to  1. \n\nT  bill \u00a3 \n\na  e \n\nn  ormatIOn  a  out  lve \n\nb \n\nf \n\nUCI d \n\natasets t  at are use \n\nh \n\nd\u00b7  h \n\nIII t  e expenments \n\nClass Name \n\nNumber of Instances \n\nNumber of Classes \n\nNumber of Features \n\n% NB  Output;tAssigned  Label \n\nError Rate for ME on clean \n\ndata (lO-fold cross validation) \n\necoli \n\n327 \n\n5 \n\n7 \n\n15% \n\n12.6% \n\nwine \n\n178 \n\n3 \n\n13 \n\n8% \n\n3.7% \n\npendi2it \n\n2000 \n\n10 \n\n16 \n\n22.3% \n\n9% \n\niris \n\n154 \n\n3 \n\n14 \n\n13.3% \n\n5.7% \n\n21ass \n\n204 \n\n5 \n\n10 \n\n16.6% \n\n9.7% \n\nFive  different  VCI  datasets  were  selected  as \ntestbed  for  experiments. \nInformation  about  these  datasets  is  listed  in  Table  1.  For  each  dataset,  the  10-fold \ncross  validation  results  for  the  ME  model  together  with  the  percentage  of time  the \nNB  output differs from the  originally assigned label are  also  listed in Table  1. \n\nthe \n\n5.2  Experiment  Results  (I):  'Naive  Model'  vs.  'EM  Model' \n\nTable  2  lists  the  results  for  the  'NaIve  Model'  and  'EM  Model'  over  a  varied \nnumber of additional  class  labels  created by  the  'random distractor'  and the  'NaIve \nBayes'  distractor.  Since  'wine'  and  'iris'  datasets  only  have  3 different  classes,  the \nmaximum  additional  class  labels  for  these  two  data sets  is  1.  Therefore,  there  is  no \nexperiment result for the case of 2 or 3 distractor class labels for  'wine' and 'iris'. \n\nAs  shown  in  Table  2,  for  the  random  distractor,  the  'EM  Model'  substantially \noutperforms  the  'NaIve  Model'  in  all  cases.  Particularly,  for  the  'wine'  and  'iris' \ndatasets,  by  introducing an additional  class  label to  every  training  instance,  there  is \nonly one  class  label left out of the class  label candidates and yet the  performance of \nthe  'EM Model'  is  still  close  to  the  case  when  there  are  no  additional  class  labels. \n\n1 NaIve Bayes distractor should not be confused with the  multiple-label NaIve Model. \n\n\fMeanwhile,  the  'NaIve Model'  degrades  significantly for both cases,  i.e.  from  3.7% \nto  10.0%  for  'wine'  and  5.7%  to  18.5%  for  'iris'.  Therefore,  we  can  conclude  that \nthe  'EM Model' is able to reduce the noise caused by randomly added class labels. \n\nT  bl  2  A \n\na  e \n\nverage  10  D Id \n\n- 0 \n\ncross va  I  attOn  error rates D  b  h 'N \n\nor  ot \n\naIve  M  d  I' \n\no  e  an \n\nd 'EM M  d  I' \n\no  e \n\nClass Name \n\n1 extra label \nby  random \ndistracter \n\n2 extra labels \nby random \ndistracter \n\n3 extra labe ls \nby random \ndistracter \n\n1 extra labe l \nbyNB \ndistracter \n\nNaive \n\nEM \n\nNaive \n\nEM \n\nNaive \n\nEM \n\nNaive \n\nEM \n\necoli \n\n17.3% \n\n13.6% \n\n20.7% \n\n14.9% \n\n25 .8% \n\n18.3% \n\n22.4% \n\n14.6% \n\nwine \n\n10% \n\n4.4% \n\n15.7% \n\n6.8% \n\npendigit \n\n14.2% \n\n8.9% \n\n15.4% \n\n9.4% \n\n17.6% \n\n11.7% \n\n17.2% \n\n15.4% \n\niris \n\n18.5% \n\n5.2% \n\n18.5% \n\n6.7% \n\nglass \n\n24.9% \n\n12.9% \n\n44.9% \n\n12% \n\n34.6% \n\n33.5% \n\n27.7% \n\n20.6% \n\nSecondly,  we  compare  the  performance  of these  two  models  over  a  more  realistic \nsetup for the  'multiple-label' problem where the distractor identity is  correlated with \nthe  true  label  (simulated  by  using  the  NB  distractor).  Table  1 gives  the  percentage \nof times  when  the  trained  Naive  Bayes  classifier  disagreed  with  the  'true'  labels, \nwhich  is  also  the  percentage  of the  additional  class  labels  that  is  created  by  the \n'Naive  Bayes  distracter'.  The  last  row  of Table  2  shows  the  performance  of these \ntwo  models  when  the  additional  class  labels  are  introduced  by  the  'NB  distracter'. \nAgain,  the  'EM  Model'  is  significantly  better  than  'NaIve  Model'.  For  dataset \n'ecoli',  'wine'  and  'iris',  the  averaged  error rates  of the  'EM Model'  are  very  close \nto  the  cases  when  there  are  no  distractor  class  labels.  Therefore,  we  can  conclude \nthat  the  'EM  Model'  is  able  to  reduce  the  noise  caused  not  only  by  random  label \nambiguity but also by some systematic label ambiguity. \n\n5.3  Experiment  Results  (II):  'EM  Model'  vs.  'EM+Prior  Model' \n\nT  bl  3  A \n\na  e \n\nverage  10  D Id \n\n- 0 \n\ncross va  I  attOn  error rates D \nor \n\n'EM  P'  M  d  I' \n\n+  nor  o  e  over f  UCld \n\nIve \n\natasets. \n\nClass  Name \n\nI  extra label \nby  random \ndistracter \n\n2 extra labels \nby random \ndistracter \n\n3 extra labels \nby random \ndistracter \n\nI  extra labe l \nbyNB \ndistracter \n\nPerfect \n\nNoisy \n\nPerfect \n\nNoisy \n\nPerfect \n\nNoisy \n\nPerfect \n\nNoisy \n\necoli \n\n13 .3% \n\n13 .3% \n\n13.6% \n\n13.9% \n\n12.6% \n\n13.9% \n\n13.9% \n\n15.3% \n\nwine \n\n3.7% \n\n3.2% \n\n5.0% \n\n6.2% \n\npendigit \n\n8.7% \n\n9.0% \n\n9.0% \n\n9.4% \n\n10.0% \n\n11.0% \n\n13.4% \n\n14.2% \n\niris \n\n5.2% \n\n18.5% \n\n5.2% \n\n6.7% \n\nglass \n\n12.4% \n\n12 .9% \n\n12.5% \n\n13.6% \n\n12.4% \n\n16.8% \n\n16.7% \n\n19.0% \n\nIn this subsection, we  focus  on whether the  information from  a prior distribution on \nclass labels can improve the performance.  In this  experiment, we study two cases: \n\n'Perfect  Case '.  Here  the  guidance  of the  prior  distribution  on  class  labels  is \n\u2022 \nalways  correct.  In  our  experiments  for  every  training  instance  Xi  we  set  the \nprobability  Jri, y;  twice as  large for the correct Yi as  for other  Jri ,yo<y;  \u2022 \n\n\f'Noisy  Case '.  For this  case,  we  only allow the guidance of the prior distribution \n\u2022 \non the class label to be correct 70% of the time.  With this setup, we are  able to  see if \nthe  ' EM+Prior Model'  is robust to noise in the prior distribution. \n\nTable  3  lists  the  results  for  ' EM+Prior  Model'  under  both  'Perfect'  and  ' Noisy' \nsituations  over  five  different  collections.  In  the  'perfect  case ',  the  averaged  error \nrates  of  'EM+Prior  Model'  are  quite  close  to  the  case  when  there  is  no  label \nambiguity at all (see Table  1).  Moreover, the  performance of the  'Noisy case'  is  also \nclose to that of the  'Perfect case '  for  most data sets  listed in  Table  3.  Therefore, we \ncan  conclude  that  our  'EM+Prior  Model'  is  able  to  take  advantage  of the  pnor \ndistribution on class labels even when some of the' guidance'  is not correct. \n\n6  Conclusions  and  Future Work \n\nthe \n\nintroduced \n\n'multiple-label'  problem  and  proposed  a  discriminative \nWe \nframework  that  is  able  to  clarify  the  ambiguity  between  labels.  Although  it  is \ndiscriminative,  this  framework  is  firmly  grounded  in  the  EM  algorithm  for \nmaximum  likelihood  estimation.  The  framework was  generalized  to  take  advantage \nof prior  knowledge  on  which  class  label  is  more  likely  to  be  the  target  label.  Our \nexperiments  clearly  indicate  that the  proposed discriminative  model  is  robust to  the \naddition of noisy class labels and to errors in the prior distribution over class  labels. \n\nthe \n\ntarget. \n\nIn \n\nThe  idea  of this  framework,  allowing  the  target distribution  p(y I x,)  to  be  inferred \nfrom  the  classifier  itself,  can  be  extended  in  many  different  ways.  We  outline \nseveral promising directions  which  we  hope to  explore.  (1)  It should be  possible  to \nextend  this  framework  to  function  approximation,  where  y  E  91,  and  ranges  or \ndistributions  are  given  for \nto \nparameterize p(y I x,)  to  simplify  the  resulting  variational  optimization  problem. \n(2)  We  have  focused  on  maximum  likelihood;  however  Bayesian  generalizations, \nwhere  the  goal  is  to  compute  a  posterior  distribution  over  ()  given  ambiguously \nlabeled data would be  interesting.  (3) It is possible to use these ideas as  a framework \nfor  combining  multiple  models.  Each  model  is  trained  on  a  small  labeled  data  set \nand  predicts  labels  on  a  large  unlabeled  data  set.  These  predicted  labels  can  be \ncombined with the  small  set to  form  a larger multiply-labeled data  set  (since  not  all \nmodels will  agree).  This  larger data set  can be used to  train a more  complex model. \n(4)  It is  possible to  extend this  framework  to  handle the presence of label noise  and \nto  combine it with the multiple-instance problem [3]. \n\nit  may  be  useful \n\nthis  case, \n\nReferences \n\n[1]  A.  P.  Dawid  and  A.  M.  Skene  (1979)  Maximum  likelihood estimation of observer error(cid:173)\nrates  using the  EM algorithm. Applied Statistics 28:20-28. \n\n[2]  A.  Dempster, N.  Laird  and  D.  Rubin  (1977),  Maximum  likelihood  from  incomplete  data \nvia the EM algorithm, Journal of the Royal Statistical Society,  39  (Series B),  1-38. \n\n[3]  T.  G.  Dietterich,  R.  H.  Lathrop,  and  T.  L.-Perez  (1997)  Solving  the  multiple-instance \nproblem with axis-parallel  rectangles, Artificial Intelligence, 89(1-2),  pp.  31-71. \n\n[4]  A.  McCallum (1999)  Multi-label text classification with a mixture  model  trained  by EM, \nAAAI'99 Workshop  on  Text Learning. \n\n[5]  S.  Della Pietra, V.  Della Pietra and J. Lafferty (1997)  Inducing feature s of random fields , \nIEEE Transactions  on Pattern Analysis and Machine Intelligence,  19(4): 380-393. \n\n[6]  Y.  Grandvalet  (2002),  Logistic  regression  for  partial  labels,  9th  Information  Processing \nand Managem ent of Uncertainty  in Knowledge-based System (IPMU'02) , pp.  1935-1941. \n\n\f", "award": [], "sourceid": 2234, "authors": [{"given_name": "Rong", "family_name": "Jin", "institution": null}, {"given_name": "Zoubin", "family_name": "Ghahramani", "institution": null}]}