{"title": "Promoting Poor Features to Supervisors: Some Inputs Work Better as Outputs", "book": "Advances in Neural Information Processing Systems", "page_first": 389, "page_last": 395, "abstract": null, "full_text": "Promoting Poor Features to  Supervisors: \n\nSome Inputs  Work Better as  Outputs \n\nRich  Caruana \n\nJPRC and \n\nCarnegie Mellon  University \n\nPittsburgh,  PA  15213 \ncaruana@cs.cmu.edu \n\nVirginia R.  de Sa \n\nSloan Center for  Theoretical  Neurobiology  and \nW . M.  Keck  Center for  Integrative  Neuroscience \nUniversity of California, San  Francisco  CA  94143 \n\ndesa@phy.ucsf.edu \n\nAbstract \n\nIn supervised  learning there is  usually  a  clear  distinction  between \ninputs  and  outputs -\ninputs are  what  you  will  measure,  outputs \nare  what  you  will  predict  from  those  measurements.  This  paper \nshows  that the distinction between  inputs  and  outputs is  not this \nsimple.  Some  features  are  more  useful  as  extra  outputs  than  as \ninputs.  By using a feature  as  an output we  get  more than just the \ncase  values but can. learn a  mapping from  the other inputs to that \nfeature.  For many features  this mapping may be more useful  than \nthe  feature  value  itself.  We  present  two  regression  problems  and \none  classification problem  where  performance improves  if features \nthat  could  have  been  used  as  inputs  are  used  as  extra  outputs \ninstead.  This result is  surprising since  a feature used  as  an output \nis  not  used  during testing. \n\nIntroduction \n\n1 \nThe  goal  in  supervised  learning  is  to  learn  functions  that  map  inputs  to  outputs \nwith  high  predictive  accuracy.  The  standard  practice  in  neural  nets  is  to  use  all \nfeatures  that  will be available for  the test  cases  as inputs,  and  use  as  outputs only \nthe features  to be  predicted. \n\nExtra  features  available  for  training  cases  that  won't  be  available  during  testing \ncan  be  used  as  extra  outputs  that  often  benefit  the  original  output[2][5].  Other \nways of adding information to supervised learning through outputs include hints[l]' \ntangent-prop[7],  and  EBNN[8].  In  unsupervised  learning  it  has  been  shown  that \ninputs arising from different modalities can provide supervisory signals (outputs for \nthe other  modality)  to each other  and thus aid learning [3][6]. \nIf outputs are so useful,  and since any input could be used as an output, would some \ninputs  be more  useful  as  outputs?  Yes.  In  this  paper  we  show  that  in supervised \nbackpropagation learning, some features  are more useful  as outputs than as  inputs. \nThis is surprising since using a feature as  an output only extracts information from \nit during training;  during testing it is  not used. \n\n\f390 \n\nR.  Caruana and V.  R.  de Sa \n\nThis  paper  uses  the following  terms:  The  Main Task  is  the  output  to be learned. \nThe  goal  is  to  improve  performance  on  the  Main  Task.  Regular  Inputs  are  the \nfeatures  provided  as  inputs in  all experiments.  Extra Inputs  (Extra  Outputs)  are \nthe extra features when used as inputs (outputs).  STD is standard backpropagation \nusing  the  Regular  Inputs  as  inputs  and  the  Main  Task as  outputs.  STD+IN  uses \nthe  Extra Features  as  Extra Inputs  to learn the  Main  Task.  STD+OUT  uses  the \nExtra Features,  but as Extra Outputs learned in parallel with the Main Task, using \njust the  Regular Inputs as  inputs. \n\n2  Poorly Correlated Features \nThis section presents a simple synthetic problem where it is easy to see  why  using a \nfeature  as  an extra output is better than using that same feature  as  an extra input. \nConsider  the following  function: \nF1(A,B) = SIGMOID(A+B), \nThe STD net in Figure  1a has  20  inputs,  16  hidden  units,  and one  output.  We  use \nbackpropagation on this net to learn FlO.  A and B are uniformly sampled from the \ninterval [-5,5].  The network's input is binary codes  for  A and B.  The range [-5,5]  is \ndiscretized  into 210  bins and the binary code of the resulting bin number is  used  as \nthe input coding.  The first  10  input units receive  the code for  A  and the second  10 \nthat for  B.  The target output is  the unary real  (unencoded)  value  F1(A,B). \n\nSIGMOID(x) =  1/(1 +  e(-x) \n\nA:B'I'D \n\nfully  connected \nhidden  layer \n\n, \n\n.:81'0+11 \n\nIIIlin  OUtput \n\n!lain  OUtput \n\n, \n~ \nfolly  cOMocted  ~ hidden  layer \n~~oo~ \n\nC I BTDtOUT \n\nfully  connected \nhidden  layer \n\n!laiD  OUtput  Bxtra  OUtput \n\n, \n\n, \n\n\"'''''''' \"\"\"\"I' \n\nbinary  inputs \ncoding  for  B \n\nbinary  inputs \ncoding  for  A \n\n\"11'\"'\"  \"\"'''\"' \n\nbinary  inputs \ncoding  for  A \n\nbinary  inputs \ncodinq  for  B \n\nI \n\nJlxtra  IDput \n\n\"\"''''''  '''''''''' \n\nbinary  inputs \ncoding  for  B \n\nbinary  inputs \ncodino  for  A \n\na  \u2022  g  u  I  a  r \n\nI  n  put \u2022 \n\naegular  Input. \n\nRagular  Inpu t .  \n\nFigure  1:  Three Neural  Net  Architectures  for  Learning  F1 \n\nBackpropagation is  done  with  per-epoch  updating  and early  stopping.  Each  trial \nuses  new  random  training,  halt,  and  test  sets.  Training sets  contain  50  patterns. \nThis  is  enough  data to  get  good  performance,  but  not  so  much  that  there  is  not \nroom for  improvement.  We  use  large  halt  and  test  sets  -\nto \nminimize  the  effect  of sampling error  in  the  measured  performances.  Larger  halt \nand test sets  yield similar results.  We  use  this methodology for  all the experiments \nin this paper. \n\n1000  cases  each  -\n\nTable 1 shows the mean performance of 50  trials of STD Net  1a with backpropaga(cid:173)\ntion and early stopping. \n\nN ow  consider  a  similar function: \n\nJ:2(A,B)  =  SIGMOID(A-B). \nSuppose, in addition to the 10-bit codings for  A and B, you are given the unencoded \nunary value F2(A,B) as an extra input feature.  Will this extra input help you learn \nF1(A,B) better?  Probably not.  A+B and A-B do not correlate for random A and B. \nThe correlation coefficient for our training sets is typically less than \u00b10.0l.  Because \n\n\fPromoting Poor Features to Supervisors \n\n391 \n\nTable  1:  Mean Test  Set  Root-Mean-Squared-Error  on  F1 \nI Trials  I Mean  RMSE  I Significance  I \n\nNetwork \n\nSTD \n\nSTD+IN \nSTD+OUT \n\n50 \n50 \n50 \n\n0.0648 \n0.0647 \n0.0631 \n\n-\nns \n\n0.013* \n\nof this, knowing the value of F2(A,B) does  not tell you much about the target value \nF1(A,B)  (and vice-versa). \n\nF1(A,B)'s  poor  correlation  with  F2(A,B)  hurts  backprop's  ability  to  learn  to  use \nF2(A,B)  to  predict  F1(A,B).  The  STD+IN  net  in  Figure  1b  has  21  inputs  -\n20 \nfor  the binary codes  for  A and B,  and an extra input for  F2(A,B).  The 2nd line in \nTable 1 shows  the performance of STD+IN for  the same training, halting, and test \nsets  used  by STD; the only difference  is  that there is  an extra input feature  in  the \ndata sets  for  STD+ IN.  Note  that the performance of STD+ IN  is  not significantly \ndifferent from that ofSTD -\nthe extra information contained in the feature F2(A,B) \ndoes  not  help  backpropagation learn  F1(A,B)  when  used  as  an  extra  input. \nIf F2(A,B)  does  not  help  backpropagation learn  Fl(A,B)  when  used  as  an  input, \nshould  we  ignore  it  altogether?  No.  F1(A,B)  and  F2(A,B)  are  strongly  related. \nThey both benefit from decoding the binary input encoding to compute the subfea(cid:173)\ntures  A and B.  If,  instead of using F2(A,B)  as  an extra input, it is  used  as  an extra \noutput trained  with backpropagation, it  will  bias the  shared  hidden  layer  to learn \nA and B better,  and this will  help  the net  better learn to predict  Fl(A,B). \n\nFigure 1c shows  a  net  with 20  inputs for  A  and  B,  and  2 outputs, one for  Fl(A,B) \nand  one  for  F2(A,B).  Error  is  back-propagated  from  both  outputs,  but  the  per(cid:173)\nformance  of this  net  is  evaluated  only  on  the  output  F1(A,B)  and early  stopping \nis  done  using  only  the  performance of this  output.  The 3rd line in Table  1 shows \nthe mean performance of 50  trials of this multitask net on F1(A,B).  Using  F2(A,B) \nas  an  extra output  significantly  improves  performance on  F1(A,B).  Using  the  ex(cid:173)\ntra feature  as  an extra output  is  better  than  using  it as  an extra input.  By  using \nF2( A,B)  as  an  output  we  make  use  of more  than  just  the  individual  output  values \nF2( A , B)  but  learn  to  extract  information  about  the  function  mapping  the  inputs  to \nF2(A,B).  This  is  a  key  difference  between  using features  as  inputs  and  outputs. \nThe  increased  performance  of STD+OUT  over  STD  and  STD+IN  is  not  due  to \nSTD+OUT reducing  the capacity  available for  the main task  FlO.  All  three  nets \n- STD, STD+IN, STD+OUT - perform better with more hidden units.  (Because \nlarger capacity favors  STD+OUT over STD and STD+IN, we  report results for  the \nmoderate sized  16  hidden  unit nets  to be fair  to STD and STD+IN.) \n\n3  Noisy Features \nThis  section  presents  two  problems where extra features  are more useful  as  inputs \nif they  have  low  noise,  but  which  become  more  useful  as  outputs  as  their  noise \nincreases.  Because  the  extra  features  are  ideal  features  for  these  problems,  this \ndemonstrates  that  what  we  observed  in  the  previous  section  does  not  depend  on \nthe  extra  features  being  contrived  so  that  their  correlation  with  the  main task  is \nlow  - features  with high correlation can  still be more useful  as  outputs. \nOnce  again, consider  the main task from  the previous section: \nF1(A,B) = SIGMOID(A+B) \n\n\f392 \n\nR.  Caruana and V.  R.  de Sa \n\nNow  consider these  extra features: \nEF(A)  =  A  +  NOISE...sCALE * Noisel \nEF(B)  =  B  +  NOISE...sCALE * Noise2 \nNoisel  and  Noise2  are  uniformly sampled  on  [-1,1].  If NOISE...sCALE  is  not  too \nlarge,  EF(A)  and  EF(B)  are excellent input features  for  learning  FI(A,B)  because \nthe net  can  avoid learning to decode  the binary input representations.  However,  as \nNOISE_SCALE increases,  EF(A)  and EF(B) become less  useful  and it is better for \nthe net  to learn  FI(A,B) from the binary inputs for  A  and B. \n\nAs  before,  we  try using the extra features  as either extra inputs or as extra outputs. \nAgain,  the  training  sets  have  50  patterns,  and  the  halt  and  test  sets  have  1000 \npatterns.  Unlike before,  however,  we  ran preliminary tests  to find  the best net size. \nThe  results  showed  256  hidden  units  to  be  about  optimal for  the  STD  nets  with \nearly stopping on this problem. \n\n0.06 \n\n0.055 \n\n0.05 \n\n0.045 \n\nw \n(/) \n:::i: ex: \n1ii \n(/) \n! \n\n0.18 \n\n0.17 \n\nw \n~  0.16 \nex: \n\n0.15 \n\n0.14 \n\n0.13 \n\n\"STD+IN\" \n\n' STD+OUT'  -+-(cid:173)\n'STD\"  \u00b7B--\n\n'STD+IN\" \n\n\"STD+OUr  -+-(cid:173)\n\n\"STD\"  \u00b7 B\u00b7\u00b7 \n\n0.0  1.0  2.0  3.0  4.0  5.0  6.0  7.0  8.0  9.0  10.0 \n\nFeature Noise Scale \n\n0.12  \"---'---'-_'----'----'-_L..--'-----'---JL..-~ \n0.0  1.0  2.0  3.0  4.0  5.0  6.0  7.0  8.0  9.0  10.0 \n\nFeature Noise Scale \n\nFigure 2:  STD, STD+IN,  and STD+OUT on FI (left)  and  F3  (right) \n\nFigure  2a plots  the  average  performance of 50  trials  of STD+IN  and  STD+OUT \nas  NOISE...sCALE  varies  from  0.0  to  10.0.  The  performance  of STD,  which  does \nnot  use  EF(A)  and  EF(B),  is  shown  as  a  horizontal  line;  it  is  independent  of \nNOISE_SCALE.  Let's first  examine the  results  of STD+IN  which  uses  EF(A)  and \nEF(B)  as  extra inputs.  As  expected,  when  the  noise  is  small,  using  EF(A)  and \nEF(B)  as  extra inputs improves  performance considerably.  As  the noise  increases, \nhowever,  this improvement  decreases.  Eventually there  is  so  much  noise  in EF(A) \nand  EF(B)  that  they  no  longer  help  the  net  if used  as  inputs.  And,  if the  noise \nincreases  further,  using  EF(A)  and  EF(B)  as  extra inputs  actually hurts.  Finally, \nas  the noise  gets  very  large,  performance asymptotes back  towards the baseline. \n\nUsing  EF(A)  and EF(B)  as  extra outputs  yields  quite  different  results.  When the \nnoise  is  low,  they  do  not  help  as  much  as  they  did  as  extra  inputs.  As  the  noise \nincreases,  however,  at  some  point  they  help  more  as  extra outputs  than  as  extra \ninputs,  and never  hurt performance the way  the noisy  extra inputs did. \n\nWhy does  noise  cause  STD+IN to perform worse  than STD? With a finite  training \nsample,  correlations  between  noisy  inputs  and the main task  cause  the  network  to \nuse  the  noisy  inputs.  To  the  extent  that the  main task  is  a  function  of the  noisy \ninputs,  it must  pass  the noise  to the output,  causing the output to be noisy.  Also, \nas  the  net  comes  to  depend  on the  noisy  inputs,  it  depends  less  on  the  noise-free \nbinary inputs.  The noisy inputs  explain  away some of the  training signal, so  less  is \navailable to encourage learning to decode  the binary inputs. \n\n\fPromoting Poor Features to Supervisors \n\n393 \n\nWhy does noise not hurt STD+OUT as  much as it hurts STD+IN? As  outputs, the \nnet  is  learning  the  mapping from  the  regular  inputs  to  EF(A)  and  EF(B).  Early \nin training, the net learns to interpolate through the noise  and thus  learns  smooth \nfunctions  for  EF(A)  and  EF(B)  that have  reasonable fidelity  to the true  mapping. \nThis makes  learning less sensitive to the noise  added to these features. \n\n3.1  Another Problem \nF1(A ,B)  is  only  mildly  nonlinear  because  A  and  B  do  not  go  far  into  the  tails  of \nthe  SIGMOID.  Do  the  results  depend  on  this smoothness?  To  check,  we  modified \nF1(A,B)  to make  it  more nonlinear.  Consider this function: \nF3(A,B) = SIGMOID(EXPAND(SIGMOID(A)-SIGMOID(B))) \nwhere  EXPAND scales the inputs from  (SIGMOID(A)-SIGMOID(B)) to the range \n[-12.5,12.5]' and A and B are drawn from [-12.5,12.5].  F3(A,B)  is significantly more \nnonlinear  than  F1(A,B)  because  the  expanded  scales  of A  and  B,  and  expanding \nthe difference  to [-12.5 ,12.5] before passing it through another sigmoid, cause much \nof the  data to fall  in the tails of either the inner or  outer sigmoids. \nConsider  these  extra features : \nEF(A)  = SIGMOID(A)  +  NOISE...sCALE  * Noise1 \nEF(B)  =  SIGMOID(B)  +  NOISE...sCALE * Noise2 \nwhere  Noises  are  sampled  as  before.  Figure  2B  shows  the  results  of using  extra \nfeatures  EF(A)  and EF(B)  as extra inputs or as extra outputs.  The trend is similar \nto that in Figure  2A  but the benefit  of STD+OUT is  even  larger at low  noise.  The \ndata for  2a and 2b  are generated using different  seeds,  2a used steepest descent and \nMitre's  Aspirin  simulator, 2b  used  conjugate  gradient  and Toronto's  Xerion  simu(cid:173)\nlator,  and F1  and  F3  do not  behave  as similarly as  their definitions  might suggest. \nThe similarity between the two graphs is due to the ubiquity of the phenomena, not \nto some small detail of the test  functions  or  how  the experiments were  run. \n\n4  A  Classification Problem \nThis section  presents  a  problem that  combines feature  correlation  (Section  1)  and \nfeature  noise (Section 2)  into one problem.  Consider the 1-D classification problem, \nshown  in  Figure  3,  of separating  two  Gaussian  distributions  with  means  0  and  1, \nand  standard  deviations  of 1.  This  problem  is  simple  to  learn  if the  1-D  input \nis  coded  as  a  single,  continuous  input  but  can  be  made  harder  by  embedding  it \nnon-linearly in  a  higher  dimensional space.  Consider encoding input values  defined \non  [0.0 ,15.0]  using  an  interpolated 4-D  Gray  code(GC);  integer  values  are  mapped \nto  a  4-D  binary  Gray  code  and  intervening  non-integers  are  mapped  linearly  to \nintervening 4-D  vectors  between  the binary  Gray  codes  for  the bounding integers. \nAs  the Gray code flips only one bit between neighboring integers this involves simply \ninterpolating along the  1 dimension in  the 4-D  unit cube  that changes.  Thus 3.4 is \nencoded as  .4(GC(4) - GC(3)) +  GC(3). \nThe  extra feature  is  a  1-D  value  correlated  (with  correlation  p)  with  the  original \nunencoded regular input, X. The extra feature is drawn from a Gaussian distribution \nwith  mean  p x  (X -\n.5) +  .5  and  standard  deviation  )(1 - p2).  Examples of the \ndistributions of the  unencoded original dimension and  the extra feature  for  various \ncorrelations are shown in Figure 3.  This problem has been carefully constructed  so \nthat the optimal classification boundary does  not  change as  p varies. \nConsider  the extreme  cases.  At  p =  1,  the extra feature  is  exactly  an  unencoded \n\n\f394 \n\nR.  Caruana and V.  R.  de Sa \n\np=O \n\ny \n\np=O.5 \n\ny \n\np=1 \n\ny \n\n:\\, \n\n;  \\, \n\n./, \n\nI \n\nI \n\no 1 \n\nx \n\n, X \n\nX \n\nX \n\nFigure  3:  Two  Overlapped  Gaussian  Classes  (left),  and An  Extra Feature  (y-axis) \nCorrelated  Different  Amounts  (p  = 0:  no  correlation,  p = 1:  perfect  correlation) \nWith the unencoded  version  of the  Regular Input (x-axis) \n\nversion  of the  regular  input.  A  STD+IN  net  using  this  feature  as  an extra input \ncould ignore the encoded inputs and solve the problem using this feature alone.  An \nSTD+OUT net  using  this  extra feature  as  an  extra output  would  have  its  hidden \nlayer  biased  towards representations  that decode  the  Gray  code,  which  is  useful  to \nthe main classification task.  At the other extreme (p  = 0),  we expect nets using the \nextra feature to learn no better than one using just the regular inputs because there \nis  no useful information provided by the uncorrelated extra feature.  The interesting \ncase  is  between the two  extremes.  We  can imagine a situation where  as  an output, \nthe extra feature  is  still able  to  help  STD+OUT by  guiding it to decode  the  Gray \ncode  but does  not help  STD+IN because  of the high level of noise. \n\n0,65 \n\n0,64 \n\n0 ,63 \n\n0,62 \n\n0,61 \n\n0,6 \n\n... -- ---- --.-. -- ---- -- ---- ----- -- --- --.-1:::1 \n\n+-- - -- ----- --- -- ... -+- - ... ... -- ............ -\n\n-------\n\n\"STD+IN\"  _____ \n\n\"STD.OUT\"  -+- -(cid:173)\n\"STD\u00b7  -0 --\n\n--+---\n\n---. \n\n0,00 \n\n0 , 25 \n\nCorrelation of Extra Feature \n\n0,50 \n\n0 ,75 \n\n1,00 \n\nFigure 4:  STD, STD+IN,  and STD+OUT vs.  p  on the  Classification Problem \n\nThe class output unit uses  a sigmoid transfer function and cross-entropy error mea(cid:173)\nsure.  The output unit for the correlated extra feature uses  a linear transfer function \nand squared error  measure.  Figure 4 shows the  average  performance of 50  trials of \nSTD, STD+IN,  and  STD+OUT as  a function  of p  using networks  with  20  hidden \nunits,  70  training patterns,  and halt and  test  sets  of 1000  patterns each.  As  in the \nprevious  section,  STD+IN  is  much  more sensitive  to  changes  in  the  extra feature \nthan STD+OUT, so that by  p =  0.75  the curves  cross  and for  p  less  than 0.75,  the \ndimension is  actually more useful  as  an output dimension than an extra input. \n\n5  Discussion \nAre  the  benefits  of using some features  as  extra outputs instead of as  inputs  large \nenough to be interesting?  Yes.  Using only 1 or  2 features as extra outputs instead of \nas  inputs reduced  error  2.5%  on the problem in Section  1,  more than 5% in regions \nof the  graphs  in  Section  2,  and more than  2.5% in regions  of the  graph in  Section \n\n\fPromoting Poor Features to Supervisors \n\n395 \n\n3.  In  domains where  many  features  might  be moved,  the net  effect  may be larger. \n\nAre some features more useful as outputs than as inputs only on contrived problems? \nNo. \nIn  this  paper  we  used  the  simplest  problems  we  could  devise  where  a  few \nfeatures  worked  better as outputs than as  inputs.  But our findings  explain a  result \nwe  noted  previously,  but  did  not  understand,  when  applying multitask learning to \npneumonia risk prediction[4].  There, we  had the choice of using lab tests that would \nbe  unavailable on future  patients  as  extra outputs,  or  using  poor -\npredictions  of them  as  extra inputs.  Using  the  lab  tests  as  extra  outputs  worked \nbetter.  If one compares the zero  noise points for  STD+OUT (there's  no  noise  in a \nfeature  when  used  as  an output  because  we  use  the values  in the  training set,  not \npredicted  values)  with the  high  noise  points  for  STD+IN  in the graphs  in  Section \n2,  it is easy  to see  why  STD+OUT could  perform much better. \n\ni.e.,  noisy  -\n\nThis paper shows  that the benefit  of using a  feature  as  an extra output is  different \nfrom  the benefit  of using that feature  as  an input.  As  an input, the net  has  access \nto  the  values  on  the  training  and  test  cases  to  use  for  prediction.  As  an  output, \nhowever,  the net is  instead biased  to learn a  mapping from the other inputs  in the \ntraining  set  to  that  output.  From  the  graphs  it  is  clear  that  some  features  help \nwhen  used  either  as  an  input,  or  as  an  output.  Given  that  the benefit  of using  a \nfeature  as  an  extra  output  is  different  from  that  of using  it  as  an  input,  can  we \nget  both  benefits?  Our  early  results  with  techniques  that  reap  both  benefits  by \nallowing some features  to be used simultaneously as  both inputs and outputs while \npreventing learning direct  feed through identity mappings are promising. \n\nAcknowledgements \n\nR.  Caruana was  supported  in  part  by  ARPA  grant  F33615-93-1-1330,  NSF  grant \nBES-9315428, and Agency for  Health Care Policy and Research grant  HS06468.  v. \nde  Sa was  supported  by  postdoctoral fellowships  from  NSERC  (Canada)  and  the \nSloan Foundation.  We thank Mitre Group for  the Aspirin/Migraines Simulator and \nThe University  of Toronto for  the Xerion  Simulator. \n\nReferences \n[1]  Y.S.  Abu-Mostafa,  \"Learning From Hints in  Neural Networks,\"  Journal of Complexity \n\n6:2,  pp.  192-198,  1989. \n\n[2]  S.  B aluj a,  and  D.A.  Pomerleau,  \"Using  the  Representation  in  a  Neural  Network's \n\nHidden  Layer for  Task-Specific  Focus of Attention\".  In  C.  Mellish  (ed.)  The Interna(cid:173)\ntional  Joint Conference on Artificial Intelligence  1995 (IJCAI-95):  Montreal,  Canada. \nIJCAII &  Morgan  Kaufmann.  San  Mateo,  CA.  pp 133-139,  1995. \n\n[3]  S.  Becker and  G.  E.  Hinton,  \"A self-organizing  neural network that discovers surfaces \n\nin  random-dot  stereograms,\"  Nature  355  pp.  161-163,  1992. \n\n[4]  R.  Caruana,  S.  Baluja,  and  T.  Mitchell,  \"Using  the  Future  to  Sort  Out  the  Present: \nRankprop and Multitask Learning for Pneumnia Risk Prediction,\"  Advances in Neural \nInformation  Processing Systems 8,  1996. \n\n[5]  R.  Caruana,  \"Learning  Many  Related  Tasks  at  the  Same Time  With  Backpropaga(cid:173)\n\ntion,\"  Advances  in  Neural  Information  Processing Systems  7,  1995. \n\n[6]  V.  R.  de  Sa,  \"Learning  classification  with  unlabeled  data,\"  Advances  in  Neural  In(cid:173)\n\nformation  Processing Systems  6,  pp.  112-119,  Morgan  Kaufmann,  1994. \n\n[7]  P.  Simard,  B.  Victoni,  Y.  1.  Cun,  and  J.  Denker,  \"Tangent  prop -\n\na  formalism  for \n\nspecifying selected invariances in  an adaptive network,\"  Advances in  Neural Informa(cid:173)\ntion  Processing Systems 4,  pp.  895-903,  Morgan  Kaufmann,  1992. \n\n[8]  S.  Thrun  and  T.  Mitchell,  \"Learning  One More  Thing,\"  CMU TR:  CS-94-184,  1994. \n\n\f", "award": [], "sourceid": 1231, "authors": [{"given_name": "Rich", "family_name": "Caruana", "institution": null}, {"given_name": "Virginia", "family_name": "de", "institution": null}]}