{"title": "Monotonicity Hints", "book": "Advances in Neural Information Processing Systems", "page_first": 634, "page_last": 640, "abstract": null, "full_text": "Monotonicity Hints \n\nJoseph Sill \n\nComputation and Neural Systems program \n\nYaser S.  Abu-Mostafa \nEE and CS  Deptartments \n\nCalifornia Institute of Technology \n\nCalifornia Institute of Technology \n\nemail:  joe@cs.caltech.edu \n\nemail:  yaser@cs.caltech.edu \n\nAbstract \n\nA hint is any piece of side information about the target function to \nbe  learned.  We  consider  the monotonicity hint,  which  states  that \nthe function  to be learned  is  monotonic in some or all of the input \nvariables.  The  application of mono tonicity  hints  is  demonstrated \non  two  real-world  problems- a  credit  card  application task,  and  a \nproblem in medical diagnosis.  A measure of the monotonicity error \nof a candidate function is defined and an objective function for  the \nenforcement  of monotonicity is  derived  from  Bayesian  principles. \nWe report experimental results which show that using monotonicity \nhints leads to a statistically significant improvement in performance \non  both problems. \n\n1 \n\nIntroduction \n\nResearchers  in pattern recognition, statistics,  and machine learning often draw \na  contrast  between  linear  models  and  nonlinear  models  such  as  neural  networks. \nLinear  models  make  very  strong  assumptions  about  the  function  to be  modelled, \nwhereas neural networks are said to make no such assumptions and can in principle \napproximate any  smooth function  given  enough  hidden  units.  Between  these  two \nextremes,  there  exists  a  frequently  neglected  middle  ground  of nonlinear  models \nwhich  incorporate strong prior information and obey  powerful  constraints. \n\nA monotonic model is one example which might occupy this middle area.  Monotonic \nmodels would be more flexible than linear models but still highly constrained.  Many \napplications  arise  in  which  there  is  good  reason  to  believe  the  target  function  is \nmonotonic in  some  or  all  input  variables.  In  screening  credit  card  applicants,  for \ninstance, one  would expect  that the  probability of default decreases  monotonically \n\n\fMonotonicity Hints \n\n635 \n\nwith the applicant's salary. It would be very useful, therefore,  to be able to constrain \na  nonlinear model to obey  monotonicity. \n\nThe  general  framework  for  incorporating  prior  information  into  learning  is  well \nestablished and is known as learning from hints[l].  A hint is any piece of information \nabout  the  target function  beyond  the  available input-output examples.  Hints  can \nimprove the performance oflearning models by reducing capacity without sacrificing \napproximation ability [2].  Invariances in character recognition [3]  and symmetries in \nfinancial-market forecasting [4]  are some of the hints which have proven beneficial in \nreal-world learning applications.  This paper describes the first practical applications \nof monotonicity hints.  The  method  is  tested  on  two  noisy  real-world  problems:  a \nclassification task concerned  with credit card applications and a regression problem \nin  medical diagnosis. \n\nSection  II  derives,  from  Bayesian  principles,  an  appropriate  objective function  for \nsimultaneously  enforcing  monotonicity  and  fitting  the  data.  Section  III  describes \nthe  details  and  results  of the  experiments.  Section  IV  analyzes  the  results  and \ndiscusses  possible future  work. \n\n2  Bayesian  Interpretation of Objective  Function \n\nLet  x  be  a  vector drawn from the input distribution  and  Xl  be  such  that \n\n\\.I  \u2022  ../... \n\nVJ  T  1,  Xj  =  Xj \n\nI \n\n(1) \n\n(2) \n\nThe statement  that  !  is  monotonically increasing in  input  variable  Xi  means  that \nfor  all such  x, x' defined  as  above \n\n!(x/) ~ !(x) \n\n(3) \n\nDecreasing  monotonicity is  defined  similarly. \n\nWe  wish  to define  a  single  scalar  measure of the degree  to which  a  particular can(cid:173)\ndidate function  y obeys  monotonicity in a  set  of input variables. \n\nOne such natural measure, the one used in the experiments in Section IV, is  defined \nin  the following  way:  Let  x  be  an input vector  drawn from  the input distribution. \nLet  i  be  the  index  of an  input  variable  randomly  chosen  from  a  uniform  distri(cid:173)\nbution  over  those  variables  for  which  monotonicity  holds.  Define  a  perturbation \ndistribution, e.g.,  U[O,l],  and draw  ,sXi  from this distribution.  Define  x' such  that \n\n\\.I  \u2022  ../... \n\nVJ  T  1,  Xj  =  Xj \n\nI \n\nX~ = Xi  + sgn( i),sXi \n\n(4) \n\n(5) \n\n\fJ.  Sill and Y.  S.  Abu-Mosta/a \n636 \nwhere  sgn( i)  = 1  or  -1 depending  on  whether  f  is  monotonically  increasing  or \ndecreasing  in  variable  i.  We  will  call  Eh  the  monotonicity  error of y on  the  input \npair (x, x'). \n\n{o \n(y(x)  - Y(X'))2 \n\nEh  --\n\ny(x')  ;:::  y(x) \ny(x')  < y(x) \n\n(6) \n\nOur  measure  of y's  violation  of monotonicity  is  \u00a3[Eh],  where  the  expectation  is \ntaken  with  respect  to random variables x,  i  and 8Xi . \nWe  believe  that  the  best  possible  approximation to f  given  the  architecture  used \nis  probably  approximately  monotonic.  This  belief  may  be  quantified  in  a  prior \ndistribution over  the candidate functions  implementable by the  architecture: \n\n(7) \n\nThis distribution represents  the  a  priori probability density, or  likelihood,  assigned \nto  a  candidate  function  with  a  given  level  of monotonicity error.  The  probability \nthat  a  function  is  the  best  possible  approximation  to  f  decreases  exponentially \nwith  the  increase  in  monotonicity error.  ).  is  a  positive  constant  which  indicates \nhow  strong our bias is  towards  monotonic functions. \n\nIn  addition  to  obeying  prior  information,  the  model  should  fit  the data well.  For \nclassification  problems,  we  take the network output  y  to represent  the  probability \nof class  c =  1 conditioned on  the observation of the input  vector  (the  two  possible \nclasses  are denoted by 0 and 1).  We wish to pick the most probable model given the \ndata.  Equivalently,  we  may choose  to  maximize log(P(modelldata)).  Using  Bayes' \nTheorem, \n\nlog(P(modelldata)) ex  log(P(datalmodel) + log(P(model)) \n\nM \n\n= L: cmlog(Ym) + (1  - cm)log(l - Ym)  - ).\u00a3[Eh] \n\nm=l \n\n(8) \n\n(9) \n\nFor continuous-output regression  problems,  we  interpret y  as  the conditional mean \nof the observed output t  given the observation of x . If we  assume constant-variance \ngaussian noise,  then by the same reasoning as in the classification case,  the objective \nfunction  to be  maximized is  : \n\nM - L (Ym  - tm)2  - )'\u00a3[Eh] \n\nm=l \n\n(10) \n\nThe  Bayesian  prior  leads  to  a  familiar  form  of objective  function,  with  the  first \nterm  reflecting  the  desire  to  fit  the  data and  a  second  term  penalizing  deviation \nfrom mono tonicity. \n\n\fMonotonicity Hints \n\n3  Experimental Results \n\n637 \n\nBoth  databases  were  obtained  via  FTP  from  the  machine  learning  database \n\nrepository  maintained by  UC-Irvine  1. \n\nThe  credit  card  task  is  to  predict  whether  or  not  an  applicant  will  default.  For \neach  of 690  applicant  case  histories,  the  database  contains  15  features  describing \nthe  applicant  plus  the  class  label  indicating  whether  or  not  a  default  ultimately \noccurred.  The meaning of the features is  confidential for  proprietary reasons.  Only \nthe 6 continuous features  were used in the experiments reported here.  24 of the case \nhistories  had  at  least  one  feature  missing.  These  examples  were  omitted,  leaving \n666  which  were  used  in the experiments.  The two  classes  occur  with almost equal \nfrequency;  the split is  55%-45%. \n\nIntuition suggests  that  the  classification  should  be  monotonic in  the features.  Al(cid:173)\nthough the specific  meanings of the continuous features  are not known,  we  assume \nhere  that  they  represent  various quantities such  as  salary,  assets,  debt,  number of \nyears  at current job, etc.  Common sense  dictates  that the higher  the salary or  the \nlower  the debt,  the less  likely  a default  is,  all else  being equal.  Monotonicity in  all \nfeatures  was  therefore  asserted. \n\nThe  motivation  in  the  medical  diagnosis  problem  is  to  determine  the  extent  to \nwhich  various  blood  tests  are  sensitive  to  disorders  related  to  excessive  drinking. \nSpecifically, the task is to predict the number of drinks a particular patient consumes \nper day given the results of 5 blood tests.  345 patient histories  were  collected, each \nconsisting  of the  5  test  results  and  the  daily  number  of drinks.  The  \"number  of \ndrinks\"  variable was  normalized to have  variance  1.  This normalization makes the \nresults  easier  to  interpret,  since  a  trivial  mean-squared-error  performance  of  1.0 \nmay be obtained by simply predicting for  mean number of drinks for  each  patient, \nirrespective  of the blood tests. \n\nThe justification for mono tonicity in this case is based on the idea that an abnormal \nresult for  each test is  indicative of excessive  drinking, where  abnormal means either \nabnormally high or abnormally low. \n\nIn  all  experiments,  batch-mode  backpropagation  with  a  simple  adaptive  learning \nrate  scheme  was  used  2.  Several  methods  were  tested.  The performance  of a  lin(cid:173)\near  perceptron  was  observed  for  benchmark  purposes.  For  the  experiments  using \nnonlinear  methods,  a  single  hidden  layer  neural  network  with  6 hidden  units  and \ndirect input-output connections  was used on the credit data; 3 hidden units and di(cid:173)\nrect  input-output connections  were  used  for  the liver task.  The most basic method \ntested  was  simply  to  train  the  network  on  all  the  training  data and  optimize the \nobjective function  as  much as  possible.  Another technique  tried  was  to  use  a  vali(cid:173)\ndation set  to avoid overfitting.  Training for  all of the above  models was  performed \nby  maximizing only the first  term in the objective function,  i.e., by maximizing the \nlog-likelihood of the data (minimizing training error).  Finally, training the networks \nwith  the  monotonicity constraints  was  performed,  using  an  approximation  to  (9) \n\nlThey may be obtained as follows:  ftp ics.uci.edu.  cd pub/machine-Iearning-databases. \nThe  credit  data  is  in  the  subdirectory  /credit-screening,  while  the  liver  data  is  in  the \nsubdirectory  /liver-disorders. \n\n2If  the  previous  iteration  resulted  in  a  increase  in  likelihood,  the  learning  rate  was \n\nincreased by 3%.  If the likelihood  decreased,  the learning  rate was  cut in half \n\n\f638 \n\nand  (10). \n\n1.  Sill and Y.  S.  Abu-Mostafa \n\nA  leave-k-out  procedure  was  used  in  order  to  get  statistically  significant  compar(cid:173)\nisons  of the  difference  in  performance.  For each  method,  the  data was  randomly \npartitioned  200  different  ways  (The  split  was  550  training,  116  test  for  the  credit \ndata; 270 training and 75  test for  the liver data).  The results shown  in Table 1 are \naverages  over the 200 different  partitions. \n\nIn the early stopping experiments, the training set  was further subdivided into a set \n(450 for  the credit data, 200 for the liver data) used for  direct training and a second \nvalidation  set  (100  for  the  credit  data,  70  for  the  liver  data).  The  classification \nerror on the validation set  was  monitored over the entire course of training, and the \nvalues of the network weights  at the point of lowest validation error were  chosen  as \nthe final  values. \n\nThe  process  of training the networks  with the monotonicity hints  was  divided  into \ntwo  stages.  Since  the  meanings  of the  features  were  unaccessible,  the  directions \nof mono tonicity  were  not  known  a  priori.  These  directions  were  determined  by \ntraining  a  linear percept ron  on the  training data for  300  iterations  and  observing \nthe resulting weights.  A positive weight was taken to imply increasing monotonicity, \nwhile a  negative  weight  meant decreasing  monotonicity. \n\nOnce  the  directions  of mono tonicity  were  determined,  the  networks  were  trained \nwith  the  monotonicity  hints.  For  the  credit  problem,  an  approximation  to  the \ntheoretical objective function  (10)  was  maximized: \n\nFor the liver problem, objective function  (12)  was  approximated by \n\n(13) \n\n(14) \n\nEh,n  represents  the network's monotonicityerror on a  particular pair of input  vec(cid:173)\ntors x, x'.  Each pair was generated according to the method described in  Section II. \nThe  input distribution  was  modelled  as  a  joint gaussian  with  a  covariance  matrix \nestimated from  the  training data. \n\nFor each input variable, 500  pairs of vectors representing monotonicity in that vari(cid:173)\nable  were  generated.  This  yielded  a  total  of N=3000  hint  example  pairs  for  the \ncredit  problem and  N=2500  pairs for  the liver  problem.  A was  chosen  to be  5000. \nNo  optimization  of  A was  attempted;  5000  was  chosen  somewhat  arbitrarily  as \nsimply a  high  value  which  would  greatly  penalize  non-monotonicity.  Hint  general(cid:173)\nization, i.e.  monotonicity test error,  was measured by using  100 pairs of vectors for \neach  variable  which  were  not  trained  on  but  whose  mono tonicity error  was  calcu(cid:173)\nlated.  For  contrast,  monotonicity test  error  was  also  monitored for  the  two-layer \nnetworks  trained only on the input-output examples.  Figure 1 shows test error and \nmonotonicity error  vs.  training time for  the  credit  data for  the  networks  trained \nonly on the training data (i.e,  no hints),  averaged over the 200 different  data splits. \n\n\fMonotonicity Hints \n\n639 \n\n0 . 3  r---~-----r----~----r---~-----r----~----r---~----~ \n\nTest  Error  and  Monotonicity  Error  vs.  Iteration  Number \n\n\"testcurve.data\" \n0 \n\u00b7'hintcurve.data\"  + \n\n~~~--------------~ \n\n~ \n\nt \n\n.. o \n\n.. \n~ \n\n0.25 \n\n0 . 2 \n\n0.15 \n\n0 . 1 \n\n0.05 \n\n500 \n\n1000 \n\n1500 \n\n2000 \n\nIteration  Number \n\n2500 \n\n3000 \n\n3500 \n\n4000 \n\n4500 \n\n5000 \n\nFigure  1:  The  violation  of monotonicity  tracks  the  overfitting  occurring  during \ntraining \n\nThe monotonicity error is multiplied by  a factor of 10  in the figure  to make it more \neasily visible.  The figure  indicates  a substantial correlation between overfitting and \nmonotonicity error during the course of training.  The curves  for  the liver data look \nsimilar but  are omitted due  to space considerations. \n\nMethod \nLinear \n6-6-1  net \n\ntraining error \n22.7%\u00b1 0.1%  23.7%\u00b10.2% \n15.2%\u00b1 0.1%  24.6% \u00b1  0.3% \n18.8%\u00b1 0.2%  23.4% \u00b1  0.3% \n6-6-1  net, w/val. \n6-6-1  net,  w /hint  18.7%\u00b10.1%  21.8% \u00b1  0.2% \n\ntest error \n\nhint  test  error \n\n-\n\n.005115 \n\n-\n\n.000020 \n\nTable 1:  Performance of methods on credit  problem \n\nThe performance of each method is shown in tables 1 and 2.  Without early stopping, \nthe two-layer  network overfits  and  performs  worse  than a  linear model.  Even  with \nearly stopping,  the performance of the linear model and the two-layer  network  are \nalmost the same; the difference is not statistically significant.  This similarity in per(cid:173)\nformance is  consistent  with the thesis of a monotonic target function.  A monotonic \nclassifier may be thought of as a mildly nonlinear generalization of a linear classifier. \nThe  two-layer  network  does  have  the  advantage of being  able  to  implement some \nof this  nonlinearity.  However,  this  advantage  is  cancelled  out  (and  in  other  cases \ncould  be  outweighed)  by  the  overfitting  resulting  from  excessive  and  unnecessary \ndegrees  of freedom.  When  monotonicity hints  are  introduced,  much of this  unnec(cid:173)\nessary  freedom  is  eliminated,  although  the  network  is  still  allowed  to  implement \nmonotonic nonlinearities.  Accordingly,  a modest but clearly statistically significant \nimprovement  on  the  credit  problem  (nearly  2%)  results  from  the  introduction  of \n\n\f640 \n\nJ. Sill and Y.  S.  Abu-Mosta/a \n\nMethod \nLinear \n5-3-1  net \n\n5-3-1  net,  w/val. \n5-3-1  net,  w/hint \n\ntraining error \n.802 \u00b1  .005 \n.640 \u00b1  .003 \n.758 \u00b1  .008 \n.758\u00b1 .003 \n\ntest error \n.873 \u00b1  .013 \n.920 \u00b1  .014 \n.871 \u00b1  .013 \n.830 \u00b1  .013 \n\nhint test error \n\n-\n\n.004967 \n\n-\n\n.000002 \n\nTable 2:  Performance of methods on liver problem \n\nmonotonicity  hints.  Such  an  improvement  could  translate  into  a  substantial  in(cid:173)\ncrease  in  profit for  a  bank.  Monotonicity hints  also significantly improve test  error \non the liver  problem; 4%  more of the target  variance is  explained. \n\n4  Conclusion \n\nThis  paper  has  shown  that  monotonicity  hints  can  significantly  improve  the \nperformance  of a  neural  network  on  two  noisy  real-world  tasks.  It  is  worthwhile \nto  note  that  the  beneficial  effect  of imposing  monotonicity  does  not  necessarily \nimply  that  the  target  function  is  entirely  monotonic.  If  there  exist  some  non(cid:173)\nmonotonicities in  the  target  function,  then monotonicity hints  may result  in some \ndecrease  in  the model's ability to implement this function.  It may be,  though,  that \nthis penalty is  outweighed by  the improved estimation of model parameters due to \nthe decrease in model complexity.  Therefore, the use of monotonicity hints probably \nshould  be  considered  in  cases  where  the  target  function  is  thought  to  be  at  least \nroughly  monotonic and the training examples are  limited in  number and  noisy. \n\nFuture  work  may include the application of monotonicity hints  to other real  world \nproblems and further  investigations into techniques  for  enforcing the hints. \n\nAclmowledgements \n\nThe  authors  thank  Eric  Bax,  Zehra  Cataltepe,  Malik  Magdon-Ismail,  and  Xubo \nSong for  many useful  discussions. \n\nReferences \n\n[1]  Y.  Abu-Mostafa  (1990).  Learning  from  Hints  in  Neural  Networks  Journal  of \nComplexity 6,  192-198. \n\n[2]  Y.  Abu-Mostafa  (1993)  Hints  and  the  VC  Dimension  Neural  Computation  4, \n278-288 \n\n[3]  P.  Simard, Y. LeCun  &  J  Denker  (1993)  Efficient  Pattern  Recognition Using  a \nNew  Transformation Distance  NIPS5,  50-58 . \n\n[4]  Y.  Abu-Mostafa  (1995)  Financial  Market  Applications  of Learning from  Hints \nNeural  Networks  in  the  Capital Markets,  A.  Refenes,  ed.,  221-232.  Wiley, London, \nUK. \n\n\f", "award": [], "sourceid": 1270, "authors": [{"given_name": "Joseph", "family_name": "Sill", "institution": null}, {"given_name": "Yaser", "family_name": "Abu-Mostafa", "institution": null}]}