{"title": "Mutual Boosting for Contextual Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 1515, "page_last": 1522, "abstract": "", "full_text": " \n\n \n \n \n \n \n \n\n \n\n Mutual Boosting for\n \n\n  \nContextual Inference \n\n \n\nMichael Fink \n\nPietro Perona \n\nCenter for Neural Computation  Electrical Engineering Department  \nHebrew University of Jerusalem  California Institute of Technology  \n\nJerusalem, Israel 91904 \n\nfink@huji.ac.il \n\nPasadena, CA 91125 \n\nperona@vision.caltech.edu \n\nAbstract \n\nMutual  Boosting  is  a  method  aimed  at  incorporating  contextual \ninformation  to  augment  object  detection.  When  multiple  detectors \nof  objects  and  parts  are  trained  in  parallel  using  AdaBoost  [1], \nobject detectors might use the remaining intermediate detectors to \nenrich  the  weak  learner  set.  This  method  generalizes  the  efficient \nfeatures  suggested  by  Viola  and  Jones \nthus  enabling \ninformation inference between parts and objects in a compositional \nhierarchy.  In  our  experiments  eye-,  nose-,  mouth-  and  face \ndetectors are trained using the Mutual Boosting framework. Results \nshow \nthe  method  outperforms  applications  overlooking \ncontextual  information.  We  suggest  that  achieving  contextual \nintegration is a step toward human-like detection capabilities. \n\nthat \n\n[2] \n\n1  Introduction \n\nClassification  of  multiple  objects  in  complex  scenes  is  one  of  the  next  challenges \nfacing the machine learning and computer vision communities. Although, real-time \ndetection  of  single  object  classes  has  been  recently  demonstrated  [2],  na\u00efve \nduplication of these detectors to the multiclass case would be unfeasible. Our goal is \nto propose an efficient method for detection of multiple objects in natural scenes.  \nHand-in-hand  with  the  challenges  entailing  multiclass  detection,  some  distinct \nadvantages  emerge  as  well.  Knowledge  on  position  of  several  objects  might  shed \nlight  on  the  entire  scene  (Figure  1).  Detection  systems  that  do  not  exploit  the \ninformation provided by objects on the neighboring scene will be suboptimal.  \n\nA \n\nB \n\nFigure 1: Contextual spatial relationships assist detection A. in absence of facial \n\ncomponents (whitened blocking box) faces can be detected by context (alignment of \nneighboring faces). B. keyboards can be detected when they appear under monitors.  \n\n\f \n\nfeatures \n\nfollows  a  compositional  hierarchy.  Grounded \n\nMany  human  and  computer  vision  models  postulate  explicitly  or  implicitly  that \nvision \n(that  are \ninnate/hardwired and are available prior to learning) are used to detect salient parts, \nthese parts in turn enable detection of complex objects [3, 4], and finally objects are \nused to recognize the semantics of the entire scene. Yet, a more accurate assessment \nof  human  performance  reveals  that  the  visual  system  often  violates  this  strictly \nhierarchical  structure  in  two  ways.  First,  part  and  whole  detection  are  often \nevidently interacting [5, 6]. Second, several layers of the hierarchy are occasionally \nbypassed to enable swift direct detection. This phenomenon is demonstrated by gist \nrecognition  experiments  where  the  semantic  classification  of  an  entire  scene  is \nperformed using only minimal low level feature information [7]. \nThe insights emerging from observing human perception were adopted by the object \ndetection  community.  Many  object  detection  algorithms  bypass  stages  of  a  strict \ncompositional  hierarchy.  The  Viola  &  Jones  (VJ)  detector  [2]  is  able  to  perform \nrobust  online  face  detection  by  directly  agglomerating  very  low-level  features \n(rectangle contrasts), without explicitly referring to facial parts. Gist detection from \nlow-level spatial frequencies was demonstrated by Oliva and Torralba [8]. Recurrent \noptimization of parts and object constellation is also common in  modern detection \nschemes  [9].  Although  Latent  Semantic  Analysis  (making  use  of  object  co-\noccurrence information) has been adapted to images [10], the existing state of object \ndetection  methods  is  still  far  from  unifying  all  the  sources  of  visual  contextual \ninformation  integrated  by  the  human  perceptual  system.  Tackling  the  context \nintegration problem and achieving robust multiclass object detection is a vital step \nfor  applications  like  image-content  database  indexing  and  autonomous  robot \nnavigation. \nWe  will  propose  a  method  termed  Mutual  Boosting  to  incorporate  contextual \ninformation  for  object  detection.  Section  2  will  start  by  posing  the  multiclass \ndetection problem from labeled images. In Section 3 we characterize the feature sets \nimplemented  by  Mutual  Boosting  and  define  an  object's  contextual  neighborhood. \nSection 4 presents the Mutual Boosting framework aimed at integrating contextual \ninformation  and  inspired  by  the  recurrent  inferences  dominating  the  human \nperceptual  system.  An  application  of  the  Mutual  Boosting  framework  to  facial \ncomponent  detection  is  presented  in  Section  5.  We  conclude  with  a  discussion  on \nthe scope and limitations of the proposed framework. \n\n2  Problem setting and basic notation \n\nSuppose we wish to detect multiple objects in natural scenes, and that these scenes \nare characterized by certain mutual positions between the composing objects. Could \nwe make use of these objects' contextual relations to improve detection? Perceptual \ncontext might include multiple sources of information: information originating from \nthe  presence  of  existing  parts,  information  derived  from  other  objects  in  the \nperceptual  vicinity  and  finally  general  visual  knowledge  on  the  scene.  In  order  to \nincorporate these various sources of visual contextual information Mutual Boosting \nwill treat parts, objects and scenes identically. We will therefore use the term object \nas a general term while referring to any entity in the compositional hierarchy.  \nLet M denote the cardinality of the object set we wish to detect in  natural scenes. \nOur  goal  is  to  optimize  detection  by  exploiting  contextual  information  while \nmaintaining  detection  time  comparable  to  M  individual  detectors  trained  without \nsuch  information.  We  define  the  goal  of  the  multiclass  detection  algorithm  as \ngenerating M intensity maps Hm=1,..,M indicating the likelihood of object m appearing \nat different positions in a target image. \n\n\f \n\nWe will use the following notation (Figure 2): \n\n\u2022  H0+/H0-: raw image input with/without the trained objects (A1 & A2) \n\u2022  Cm[i]: labeled position of instance i of object m in image H0+ \n\u2022  Hm: intensity map output indicating the likelihood of object m appearing in \n\ndifferent positions in the image H0 (B)  \n\nA1. \nH0+ \n\nA2. \nH0- \n\nB. \nHm \n\nCm[1] \n\nCm[2] \n\nCm[1] \n\nCm[2] \n\nFigure 2: A1 & A2. Input: position of positive and negative examples of eyes in \nnatural images. B. Output: Eye intensity (eyeness) detection map of image H0+ \n\n3  Feature set and contextual window generalizations \n\nThe  VJ  method  for  real-time  object-detection  included  three  basic  innovations. \nFirst,  they  presented  the  rectangle  contrast-features,  features  that  are  evaluated \nefficiently, using an integral-image. Second, VJ introduced AdaBoost [1] to object \ndetection  using  rectangle  features  as  weak  learners.  Finally  a  cascade  method  was \ndeveloped to chain a sequence of increasingly complex AdaBoost learners to enable \nrapid filtering of non-relevant sections in the target image. The resulting cascade of \nAdaBoost face detectors achieves a 15 frame per second detection speed, with 90% \ndetection rate and 2x10-6 false alarms. This detection speed is currently unmatched. \nIn order to maintain efficient detection and in order to benchmark the performance \nof  Mutual  Boosting  we  will  adopt  the  rectangle  contrast  feature  framework \nsuggested by VJ.  \nIt should be noted that the grayscale rectangle features could be naturally extended \nto  any  image  channel  that  preserves  the  semantics  of  summation.  A  diversified \nfeature set (including color features, texture features, etc.) might saturate later than \na  homogeneous  channel  feature  set.  By  making  use  of  features  that  capture  the \nobject regularities well, one can improve performance or reduce detection time. \nVJ  extract  training  windows  that  capture  the  exact  area  of  the  training  faces.  We \nterm this the local window approach. A second approach, in line with our attempt to \nincorporate information from neighboring parts or objects, would be to make use of \ntraining windows that capture wide regions around the object (Figure 3)1. \n\nA \n\nB \n\n \nFigure  3:  A  local  window  (VJ)  and  a  contextual  window  that  captures  relative \nposition information from objects or parts around and within the detected object. \n\n                                                           \n\n1 Contextual neighborhoods emerge by downscaling larger regions in the original image \nto a PxP resolution window. \n\n\f \n\nThe  contextual  neighborhood  approach  contributes  to  detection  when  the  applied \nchannels  require  a  wide  contextual  range  as  will  be  demonstrated  in  the  Mutual \nBoosting scheme presented in the following section2. \n\n4  Mutual Boosting \n\nThe  AdaBoost  algorithm  maintains  a  clear  distinction  between  the  boosting  level \nand the weak-learner training level. The basic insight guiding the Mutual Boosting \nmethod reexamines this distinction, stipulating that when multiple objects and parts \nare trained simultaneously using AdaBoost; any object detector might combine the \npreviously evolving intermediate detectors to generate new weak learners. In order \nto elaborate this insight it should first be noted that while training a strong learner \nusing  100  iterations  of  AdaBoost  (abbreviated  AB100)  one  could  calculate  an \nintermediate  strong  learner  at  each  step  on  the  way  (AB2  -  AB99).  To  apply  this \nobservation for our multiclass detection problem we simultaneously train M object \ndetectors.  At  each  boosting  iteration  t  the  M  detectors  (ABm\nt-1)  emerging  at  the \nprevious  stage  t-1,  are  used  to  filter  positive  and  negative3  training  images,  thus \nproducing  intermediate  m-detection  maps  Hm\nt-1  (likelihood  of  object  m  in  the \nimages4).  Next,  the  Mutual  Boosting  stage  takes  place  and  all  the  existing  Hm\nt-1 \nmaps  are  used  as  additional  channels  out  of  which  new  contrast  features  are \nselected.  This  process  gradually  enriches  the  initial  grounded  features  with \ncomposite contextual features. The composite features are searched on a PxP wide \ncontextual neighborhood region rather than the PxP local window (Figure 3). \nFollowing  a  dynamic  programming  approach  in  training  and  detection,  Hm=1,..,M \ndetection  maps  are  constantly  maintained  and  updated  so  that  the  recalculation  of \nt  only  requires  the  last  chosen  weak  learner  WLmn*\nHm\nt  to  be  evaluated  on  channel \nHn*\nt-1 of the training image (Figure 4). This evaluation produces a binary detection \nlayer  that  will  be  weighted  by  the  AdaBoost  weak-learner  weighting  scheme  and \nadded to the previous stage map5. \nAlthough Mutual Boosting examines a larger feature set during training, an iteration \nof Mutual Boosting detection of M objects is as time-consuming as performing an \nAdaBoost  detection  iteration  for  M  individual  objects.  The  advantage  of  Mutual \nBoosting emerges from introducing highly informative feature sets that can enhance \ndetection  or  require  fewer  boosting  iterations.  While  most  object  detection \napplications  extract  a  local  window  containing  the  object  information  and  discard \nthe remaining image (including the object positional information). Mutual Boosting \nprocesses the entire image during training and detection and makes constant use of \nthe information characterizing objects\u2019 relative-position in the training images. \nAs  we  have  previously  stated,  the detected  objects  might be  in various  levels of a \ncompositional  hierarchy  (e.g.  complex  objects  or  parts  of  other  objects). \nNevertheless,  Mutual  Boosting  provides  a  similar  treatment  to  objects,  parts  and \nscenes enabling any compositional structure of the data to naturally emerge. We will \nterm any contextual reference that is not directly grounded to the basic features, as a \ncross referencing of objects6. \n\n                                                           \n\n2 The most efficient size of the contextual neighborhoods might vary, from the immediate \nto the entire image, and therefore should be empirically learned. \n3 Images without target objects (see experimental section below) \n4 Unlike the weak learners, the intermediate strong learners do not apply a threshold \n5  In  order  to  optimize  the  number  of  detection  map  integral  image  recalculations  these \nmaps might be updated every k (e.g. 50) iterations rather than at each iteration. \n6 Scenes can be crossed referenced as well if scene labels are available (office/lab etc.). \n\n\f \n\n \n \n\n \n\n0/Hm-\n\n0 to 0 \n\nH0+/0- positive / negative raw images \nCm[i] position of instance i of object m=1,..,M in image H0+ \ninitialize boosting-weights of instances i of object m to 1  \ninitialize detection maps Hm+\n\nInput \n \nInitialization \n \n \nFor  t=1,\u2026,T \n \n     For  m=1,..,M   and   n=0,..,M \n        (A) cutout & downscale local (n=0) or contextual (n>0) windows (WINm) \n                    of instances i of object m (at Cm[i]), from all existing images Hn\nt-1 \n \n    For  m=1,..,M  \n            normalize boosting-weights of object m instances [1] \n        (B1&2) select map Hn*\n            decrease boosting-weights of instances that WLmn* labeled correctly [1] \n        (C) DetectionLayermn* \u2190 WLmn*(Hn*\n            calculate \u03b1m\n        (D) update m-detection map Hm\n \nReturn strong learner ABm\n\nt DetectionLayermn * \n\nT including WLmn*\n\n1,..,T  (m=1,..,M) \n\nt \u2190 Hm\n\nt-1 + \u03b1m\n\n1,..,T and \u03b1m\n\nt the weak learner contribution factor from the empirical error [1] \n\nt-1) \n\nt-1 and weak learner WLmn* that minimize error on WINm \n\nH0\u00b1  raw image  H1\u00b1 \n\n \n\n. . . Hn*\u00b1 \n\n . . .  Hm\u00b1 \n\n(A) \n\n(A)\n\n(A)\n\nde\n\nm \ntectio\nmap \n\nn \n\n(D) \n\nWIN\nm0 \n\nWL\nm0 \n\n(B1) \n\n(B2) \n\nWIN\nm1\n\nWL\nm1\n\n(B1) \n\n(B2) \n\nWIN\nmn*\n\nWL\nmn*\n\n(B1) \n\nDetection \nLayer  \nmn* \n\n(C) \n\nFigure 4: Mutual Boosting Diagram & Pseudo code. Each raw image H0 is analyzed \nby M object detection maps Hm=1,.,M, updated by iterating through four steps: (A) \ncutout & downscale from existing maps Hn=0,..,M\n t-1 a local (n=0) or contextual (n>0) \nPxP window containing a neighborhood of object m (B1&2) select best performing \nmap Hn* and weak learner WLmn* that optimize object m detection (C) run WLmn* on \n\nHn* map to generate a new binary m-detection layer (D) add m-detection layer to \n\nexisting detection map Hm. [1] Standard AdaBoost stages are not elaborated \n\n \n\nTo maintain local and global natural scene statistics, negative training examples are \ngenerated by pairing each image with an image of equal size that does not contain \nthe target objects and by centering the local and contextual windows of the positive \nand negative examples on the object positions in the positive images (see Figure 2). \nBy using parallel boosting and efficient rectangle contrast features, Mutual Boosting \nis capable of incorporating many information inferences (references in Figure 5): \n\n\u2022  Features could be used to directly detect parts and objects (A & B) \n\u2022  Objects could be used to detect other (or identical) objects in the image (C) \n\u2022  Parts could be used to detect other (or identical) nearby parts (D & E) \n\u2022  Parts could be used to detect objects (F) \n\u2022  Objects could be used to detect parts \n\n\f \n\nC. face \nfeature \nfrom face \ndetection \n\nimage\n\nE. mouth \nfeature   \nfrom eye \ndetection \nimage  \n\nA. eye \nfeature \nfrom \nraw \nimage  \n\nB. face \nfeature \nfrom \nraw \nimage  \n\nD. eye \nfeature \nfrom eye \ndetection \n\nimage \n\nF. face \nfeature  \n\nfrom mouth \ndetection \n\nimage \n\n \nFigure 5: A-E. Emerging features of eyes, mouths and faces (presented on windows \nof raw images for legibility). The windows\u2019 scale is defined by the detected object \nsize  and  by  the  map  mode  (local  or  contextual).  C.  faces  are  detected  using  face \ndetection maps HFace, exploiting the fact that faces tend to be horizontally aligned.  \n\n5  Experiments \n\nIn  order  to  test  the  contribution  of  the  Mutual  Boosting  process  we  focused  on \ndetection of objects in what we term a face-scene (right eye, left eye, nose, mouth \nand face). We chose to perform contextual detection in the face-scene for two main \nreasons. First as detailed in Figure 5, face scenes demonstrate a range of potential \npart  and  object  cross  references.  Second,  faces  have  been  the  focus  of  object \ndetection  research  for  many  years,  thus  enabling  a  systematic  result  comparison. \nExperiment 1 was aimed at comparing the performance of Mutual Boosting to that \nof na\u00efve independently trained object detectors using local windows. \n\nA. \n\n \nd\nP\n\nPfa \n\nFigure 6: A. Two examples of the CMU/MIT face database. B. Mutual Boosting and \n\nAdaBoost ROCs on the CMU/MIT face database.  \n\nFace-scene images were downloaded from the web and manually labeled7. Training \nrelied on 450 positive and negative examples (~4% of the images used by VJ). 400 \niterations of local window AdaBoost and contextual window Mutual Boosting were \nperformed  on  the  same  image  set.  Contextual  windows  encompassed  a  region  five \ntimes larger in width and height than the local windows8 (see Figure 3).  \n\n                                                           \n\n7 By following CMU database conventions (R-eye, L-eye, Nose & Mouth positions) we \nderive both the local window position and the relative position of objects in the image \n8 Local windows were created by downscaling objects to 25x25 grids \n\n\f \n\nTest  image  detection  maps  emerge  from  iteratively  summing  T  m-detection  layers \n(Mutual Boosting stages C&D). ROC performance on the CMU/MIT face database \n(see sample images in Figure 6A) was assessed using a threshold on position Cm[i] \nthat best discriminated the final positive and negative detection maps Hm+/-\nT. Figure \n6B demonstrates the superiority of Mutual Boosting to grounded feature AdaBoost. \nOur second experiment was aimed at assessing the performance of Mutual Boosting \nas  we  change  the  detected  configurations\u2019  variance.  Assuming  normal  distribution \nof  face  configurations  we  estimated  (from  our  existing  labeled  set)  the  spatial \ncovariance between four facial components (noses, mouths and both eyes). We then \nmodified  the  covariance  matrix,  multiplying  it  by  0.25,  1  or  4  and  generated  100 \nartificial  configurations by  positioning four  contrasting  rectangles  in  the  estimated \nposition  of  facial  components.  Although  both  Mutual  Boosting  and  AdaBoost \nperformance  degraded  as  the  configuration  variance  increased,  the  advantage  of \nMutual Boosting persists both in rigid and in varying configurations9 (Figure 7).  \n\nA. \n\nCOV \n0.25 \nCOV \n1.00 \n\nCOV \n4.00 \n\n \ne\nc\nn\na\nm\n\nr\no\nf\nr\ne\np\n\n \nr\no\nr\nr\ne\n \nl\na\nu\nq\nE\n\nBoosting iteration \n\nMB sigma=0.25 \nMB sigma=1.00 \nMB sigma=4.00 \nAB sigma=0.25 \nAB sigma=1.00 \nAB sigma=4.00 \n\nFigure 7: A. Artificial face configurations with increasing covariance B. MB and \nAB Equal error rate performance on configurations with varying covariance as a \n\nfunction of boosting iterations.  \n\n6  Discussion \n\nWhile evaluating the performance of Mutual Boosting it should be emphasized that \nwe  did  not  implement  the  VJ  cascade  approach;  therefore  we  only  attempt  to \ndemonstrate  that  the  power  of  a  single  AdaBoost  learner  could  be  augmented  by \nMutual Boosting. The VJ detector is rescaled in order to perform efficient detection \nof objects in multiple scales. For simplicity, scale of neighboring objects and parts \nwas  assumed  to  be  fixed  so  that  a  similar  detector-rescaling  approach  could  be \nfollowed.  This  assumption  holds  well  for  face-scenes,  but  if  neighboring  objects \nmay  vary  in  scale  a  single  m-detection  map  will  not  suffice.  However,  by \ntransforming  each  m-detection  image  to  an  m-detection  cube,  (having  scale  as  the \nthird  dimension)  multi-scale  context  detection  could  be  achieved10.  The  dynamic \nprogramming  characteristic  of  Mutual  Boosting  (simply  reusing  the  multiple \nposition and scale detections already performed by VJ) will ensure that the running \ntime of varying scale context will only be doubled. It should be noted that the face-\nscene  is  highly  structured  and  therefore  it  is  a  good  candidate  for  demonstrating \n\n                                                           \n\n9  In  this  experiment  the  resolution  of  the  MB  windows  (and  the  number  of  training \nfeatures)  was  decreased  so  that  information  derived  from  the  higher  resolution  of  the \nparts would be ruled out as an explaining factor for the Mutual Boosting advantage. This \nprocedure explains the superior AdaBoost performance in the first boosting iteration. \n10 By using an integral cube, calculating the sum of a cube feature (of any size) requires 8 \naccess operations (only double than the 4 operations required in the integral image case). \n\n\f \n\nMutual Boosting; however as suggested by Figure 7B Mutual Boosting can handle \nhighly varying configurations and the proposed method needs no modification when \napplied  to  other  scenes,  like  the  office  scene  in  Figure  111.  Notice  that  Mutual \nBoosting  does  not  require  a-priori  knowledge  of  the  compositional  structure  but \nrather  permits  structure  to  naturally  emerge  in  the  cross  referencing  pattern  (see \nexamples in Figure 5). \nMutual  Boosting  could  be  enhanced  by  unifying  the  selection  of  weak-learners \nrather  than  selecting  an  individual  weak  learner  for  each  object  detector.  Unified \nselection  is  aimed  at  choosing  weak  learners  that  maximize  the  entire  object  set \ndetection  rate,  thus  maximizing  feature  reuse  [11].  This  approach  is  optimal  when \nmany objects with common characteristics are trained. \nIs  Mutual  Boosting  specific  for  image  object  detection?  Indeed  it  requires  labeled \ninput of multiple objects in a scene supplying a local description of the objects as \nwell as information on their contextual mutual positioning. But these criterions are \nshared by other complex \"scenes\". DNA sequences include multiple objects (Genes) \nin  mutual  positions,  and  therefore  might  be  handled  by  a  variant  of  Mutual \nBoosting. The remarkable success of the VJ method stems from abandoning the use \nof  highly  custom-tailored  complex  features  in  favor  of  numerous  simple  ones. \nMutual  Boosting  combines  parallel  boosting,  with  a  similar  feature  approach  to \nefficiently  incorporate  contextual  information.  We  suggest  that  achieving  wide \ncontextual integration is one step towards human-like object detection capabilities. \n\ntheory  of  human \n\nReferences \n[1] Freund, Y. and Schapire, R. E. (1997) A Decision-Theoretic Generalization of On-Line \nLearning and an Application to Boosting. JCSS 55(1): 119-139 \n[2]  Viola,  V.  P.  and  Jones  M.  (2001)  Robust  real-time  object  detection.  IEEE  ICCV \nWorkshop on Stat. and Comp. Theories of Vision , Vancouver, Canada, July 13 2001 \n[3] Tanaka, K., Saito, H., Fukada, Y. and Moriya, M. (1991) Coding visual images of objects \nin the inferotemporal cortex of the macaque monkey. J. Neurophys. 66:170-189 \n[4]  Biederman,  I.  (1987).  Recognition-by-components:  A \nunderstanding. Psychological Review, 94, 115\u00b1147.  \n[5]  Navon,  D.  (1977).  Forest  before  trees:  The  precedence  of  global  features  in  visual \nperception. Cog. Psych. 9, 353-383.  \n[6]  Biederman,  I.,  Mezzanotte,  R.  J.,  &  Rabinowitz,  J.  C.  (1982).  Scene  perception: \nDetecting the judging objects undergoing relational violations. Cog. Psych. 14, 143\u00b1177 \n[7]  Biederman,  I.  (1981).  On  the  semantics  of  a  glance  at  a  scene.  In  M.  Kubovy,  &  J.  R. \nPomerantz, Perceptual organization (pp. 213\u00b1253). Hillsdale, NJ: Erlbaum.  \n[8]  Oliva,  A.,  Torralba,  A.  B.  (2002)  Scene-Centered  Description  from  Spatial  Envelope \nProperties. Biologically Motivated Computer Vision 2002: 263-272 \n[9]  Weber,  M.,  Welling,  M.,  &  Perona,  P.  (2000)  Unsupervised  Learning  of  Models  for \nRecognition. ECCV (1) 2000: 18-32 \n[10]  Barnard  K.  and  Forsyth  D.  (2001)  Learning  the  semantics  of  words  and  pictures.  In \nIEEE ICCV, volume 2, pages 408--415, Vancouver, Canada, July 2001  \n[11]  Schapire,  R.  E.  and  Singer.  Y.  (2000)  Boostexter:  A  boosting-based  system  for  text \ncategorization. Machine Learning, 39(2-3):135--168, May/June 2000.  \n \n\nimage \n\n                                                          \n \n\n11 MB is currently aimed at detecting objects in office-scenes (Caltech 360\u00b0 office DB) \n\n\f", "award": [], "sourceid": 2520, "authors": [{"given_name": "Michael", "family_name": "Fink", "institution": null}, {"given_name": "Pietro", "family_name": "Perona", "institution": null}]}