{"title": "Monotonicity Hints", "book": "Advances in Neural Information Processing Systems", "page_first": 634, "page_last": 640, "abstract": null, "full_text": "Monotonicity Hints \n\nJoseph Sill \n\nComputation and Neural Systems program \n\nYaser S. Abu-Mostafa \nEE and CS Deptartments \n\nCalifornia Institute of Technology \n\nCalifornia Institute of Technology \n\nemail: joe@cs.caltech.edu \n\nemail: yaser@cs.caltech.edu \n\nAbstract \n\nA hint is any piece of side information about the target function to \nbe learned. We consider the monotonicity hint, which states that \nthe function to be learned is monotonic in some or all of the input \nvariables. The application of mono tonicity hints is demonstrated \non two real-world problems- a credit card application task, and a \nproblem in medical diagnosis. A measure of the monotonicity error \nof a candidate function is defined and an objective function for the \nenforcement of monotonicity is derived from Bayesian principles. \nWe report experimental results which show that using monotonicity \nhints leads to a statistically significant improvement in performance \non both problems. \n\n1 \n\nIntroduction \n\nResearchers in pattern recognition, statistics, and machine learning often draw \na contrast between linear models and nonlinear models such as neural networks. \nLinear models make very strong assumptions about the function to be modelled, \nwhereas neural networks are said to make no such assumptions and can in principle \napproximate any smooth function given enough hidden units. Between these two \nextremes, there exists a frequently neglected middle ground of nonlinear models \nwhich incorporate strong prior information and obey powerful constraints. \n\nA monotonic model is one example which might occupy this middle area. Monotonic \nmodels would be more flexible than linear models but still highly constrained. Many \napplications arise in which there is good reason to believe the target function is \nmonotonic in some or all input variables. In screening credit card applicants, for \ninstance, one would expect that the probability of default decreases monotonically \n\n\fMonotonicity Hints \n\n635 \n\nwith the applicant's salary. It would be very useful, therefore, to be able to constrain \na nonlinear model to obey monotonicity. \n\nThe general framework for incorporating prior information into learning is well \nestablished and is known as learning from hints[l]. A hint is any piece of information \nabout the target function beyond the available input-output examples. Hints can \nimprove the performance oflearning models by reducing capacity without sacrificing \napproximation ability [2]. Invariances in character recognition [3] and symmetries in \nfinancial-market forecasting [4] are some of the hints which have proven beneficial in \nreal-world learning applications. This paper describes the first practical applications \nof monotonicity hints. The method is tested on two noisy real-world problems: a \nclassification task concerned with credit card applications and a regression problem \nin medical diagnosis. \n\nSection II derives, from Bayesian principles, an appropriate objective function for \nsimultaneously enforcing monotonicity and fitting the data. Section III describes \nthe details and results of the experiments. Section IV analyzes the results and \ndiscusses possible future work. \n\n2 Bayesian Interpretation of Objective Function \n\nLet x be a vector drawn from the input distribution and Xl be such that \n\n\\.I \u2022 ../... \n\nVJ T 1, Xj = Xj \n\nI \n\n(1) \n\n(2) \n\nThe statement that ! is monotonically increasing in input variable Xi means that \nfor all such x, x' defined as above \n\n!(x/) ~ !(x) \n\n(3) \n\nDecreasing monotonicity is defined similarly. \n\nWe wish to define a single scalar measure of the degree to which a particular can(cid:173)\ndidate function y obeys monotonicity in a set of input variables. \n\nOne such natural measure, the one used in the experiments in Section IV, is defined \nin the following way: Let x be an input vector drawn from the input distribution. \nLet i be the index of an input variable randomly chosen from a uniform distri(cid:173)\nbution over those variables for which monotonicity holds. Define a perturbation \ndistribution, e.g., U[O,l], and draw ,sXi from this distribution. Define x' such that \n\n\\.I \u2022 ../... \n\nVJ T 1, Xj = Xj \n\nI \n\nX~ = Xi + sgn( i),sXi \n\n(4) \n\n(5) \n\n\fJ. Sill and Y. S. Abu-Mosta/a \n636 \nwhere sgn( i) = 1 or -1 depending on whether f is monotonically increasing or \ndecreasing in variable i. We will call Eh the monotonicity error of y on the input \npair (x, x'). \n\n{o \n(y(x) - Y(X'))2 \n\nEh --\n\ny(x') ;::: y(x) \ny(x') < y(x) \n\n(6) \n\nOur measure of y's violation of monotonicity is \u00a3[Eh], where the expectation is \ntaken with respect to random variables x, i and 8Xi . \nWe believe that the best possible approximation to f given the architecture used \nis probably approximately monotonic. This belief may be quantified in a prior \ndistribution over the candidate functions implementable by the architecture: \n\n(7) \n\nThis distribution represents the a priori probability density, or likelihood, assigned \nto a candidate function with a given level of monotonicity error. The probability \nthat a function is the best possible approximation to f decreases exponentially \nwith the increase in monotonicity error. ). is a positive constant which indicates \nhow strong our bias is towards monotonic functions. \n\nIn addition to obeying prior information, the model should fit the data well. For \nclassification problems, we take the network output y to represent the probability \nof class c = 1 conditioned on the observation of the input vector (the two possible \nclasses are denoted by 0 and 1). We wish to pick the most probable model given the \ndata. Equivalently, we may choose to maximize log(P(modelldata)). Using Bayes' \nTheorem, \n\nlog(P(modelldata)) ex log(P(datalmodel) + log(P(model)) \n\nM \n\n= L: cmlog(Ym) + (1 - cm)log(l - Ym) - ).\u00a3[Eh] \n\nm=l \n\n(8) \n\n(9) \n\nFor continuous-output regression problems, we interpret y as the conditional mean \nof the observed output t given the observation of x . If we assume constant-variance \ngaussian noise, then by the same reasoning as in the classification case, the objective \nfunction to be maximized is : \n\nM - L (Ym - tm)2 - )'\u00a3[Eh] \n\nm=l \n\n(10) \n\nThe Bayesian prior leads to a familiar form of objective function, with the first \nterm reflecting the desire to fit the data and a second term penalizing deviation \nfrom mono tonicity. \n\n\fMonotonicity Hints \n\n3 Experimental Results \n\n637 \n\nBoth databases were obtained via FTP from the machine learning database \n\nrepository maintained by UC-Irvine 1. \n\nThe credit card task is to predict whether or not an applicant will default. For \neach of 690 applicant case histories, the database contains 15 features describing \nthe applicant plus the class label indicating whether or not a default ultimately \noccurred. The meaning of the features is confidential for proprietary reasons. Only \nthe 6 continuous features were used in the experiments reported here. 24 of the case \nhistories had at least one feature missing. These examples were omitted, leaving \n666 which were used in the experiments. The two classes occur with almost equal \nfrequency; the split is 55%-45%. \n\nIntuition suggests that the classification should be monotonic in the features. Al(cid:173)\nthough the specific meanings of the continuous features are not known, we assume \nhere that they represent various quantities such as salary, assets, debt, number of \nyears at current job, etc. Common sense dictates that the higher the salary or the \nlower the debt, the less likely a default is, all else being equal. Monotonicity in all \nfeatures was therefore asserted. \n\nThe motivation in the medical diagnosis problem is to determine the extent to \nwhich various blood tests are sensitive to disorders related to excessive drinking. \nSpecifically, the task is to predict the number of drinks a particular patient consumes \nper day given the results of 5 blood tests. 345 patient histories were collected, each \nconsisting of the 5 test results and the daily number of drinks. The \"number of \ndrinks\" variable was normalized to have variance 1. This normalization makes the \nresults easier to interpret, since a trivial mean-squared-error performance of 1.0 \nmay be obtained by simply predicting for mean number of drinks for each patient, \nirrespective of the blood tests. \n\nThe justification for mono tonicity in this case is based on the idea that an abnormal \nresult for each test is indicative of excessive drinking, where abnormal means either \nabnormally high or abnormally low. \n\nIn all experiments, batch-mode backpropagation with a simple adaptive learning \nrate scheme was used 2. Several methods were tested. The performance of a lin(cid:173)\near perceptron was observed for benchmark purposes. For the experiments using \nnonlinear methods, a single hidden layer neural network with 6 hidden units and \ndirect input-output connections was used on the credit data; 3 hidden units and di(cid:173)\nrect input-output connections were used for the liver task. The most basic method \ntested was simply to train the network on all the training data and optimize the \nobjective function as much as possible. Another technique tried was to use a vali(cid:173)\ndation set to avoid overfitting. Training for all of the above models was performed \nby maximizing only the first term in the objective function, i.e., by maximizing the \nlog-likelihood of the data (minimizing training error). Finally, training the networks \nwith the monotonicity constraints was performed, using an approximation to (9) \n\nlThey may be obtained as follows: ftp ics.uci.edu. cd pub/machine-Iearning-databases. \nThe credit data is in the subdirectory /credit-screening, while the liver data is in the \nsubdirectory /liver-disorders. \n\n2If the previous iteration resulted in a increase in likelihood, the learning rate was \n\nincreased by 3%. If the likelihood decreased, the learning rate was cut in half \n\n\f638 \n\nand (10). \n\n1. Sill and Y. S. Abu-Mostafa \n\nA leave-k-out procedure was used in order to get statistically significant compar(cid:173)\nisons of the difference in performance. For each method, the data was randomly \npartitioned 200 different ways (The split was 550 training, 116 test for the credit \ndata; 270 training and 75 test for the liver data). The results shown in Table 1 are \naverages over the 200 different partitions. \n\nIn the early stopping experiments, the training set was further subdivided into a set \n(450 for the credit data, 200 for the liver data) used for direct training and a second \nvalidation set (100 for the credit data, 70 for the liver data). The classification \nerror on the validation set was monitored over the entire course of training, and the \nvalues of the network weights at the point of lowest validation error were chosen as \nthe final values. \n\nThe process of training the networks with the monotonicity hints was divided into \ntwo stages. Since the meanings of the features were unaccessible, the directions \nof mono tonicity were not known a priori. These directions were determined by \ntraining a linear percept ron on the training data for 300 iterations and observing \nthe resulting weights. A positive weight was taken to imply increasing monotonicity, \nwhile a negative weight meant decreasing monotonicity. \n\nOnce the directions of mono tonicity were determined, the networks were trained \nwith the monotonicity hints. For the credit problem, an approximation to the \ntheoretical objective function (10) was maximized: \n\nFor the liver problem, objective function (12) was approximated by \n\n(13) \n\n(14) \n\nEh,n represents the network's monotonicityerror on a particular pair of input vec(cid:173)\ntors x, x'. Each pair was generated according to the method described in Section II. \nThe input distribution was modelled as a joint gaussian with a covariance matrix \nestimated from the training data. \n\nFor each input variable, 500 pairs of vectors representing monotonicity in that vari(cid:173)\nable were generated. This yielded a total of N=3000 hint example pairs for the \ncredit problem and N=2500 pairs for the liver problem. A was chosen to be 5000. \nNo optimization of A was attempted; 5000 was chosen somewhat arbitrarily as \nsimply a high value which would greatly penalize non-monotonicity. Hint general(cid:173)\nization, i.e. monotonicity test error, was measured by using 100 pairs of vectors for \neach variable which were not trained on but whose mono tonicity error was calcu(cid:173)\nlated. For contrast, monotonicity test error was also monitored for the two-layer \nnetworks trained only on the input-output examples. Figure 1 shows test error and \nmonotonicity error vs. training time for the credit data for the networks trained \nonly on the training data (i.e, no hints), averaged over the 200 different data splits. \n\n\fMonotonicity Hints \n\n639 \n\n0 . 3 r---~-----r----~----r---~-----r----~----r---~----~ \n\nTest Error and Monotonicity Error vs. Iteration Number \n\n\"testcurve.data\" \n0 \n\u00b7'hintcurve.data\" + \n\n~~~--------------~ \n\n~ \n\nt \n\n.. o \n\n.. \n~ \n\n0.25 \n\n0 . 2 \n\n0.15 \n\n0 . 1 \n\n0.05 \n\n500 \n\n1000 \n\n1500 \n\n2000 \n\nIteration Number \n\n2500 \n\n3000 \n\n3500 \n\n4000 \n\n4500 \n\n5000 \n\nFigure 1: The violation of monotonicity tracks the overfitting occurring during \ntraining \n\nThe monotonicity error is multiplied by a factor of 10 in the figure to make it more \neasily visible. The figure indicates a substantial correlation between overfitting and \nmonotonicity error during the course of training. The curves for the liver data look \nsimilar but are omitted due to space considerations. \n\nMethod \nLinear \n6-6-1 net \n\ntraining error \n22.7%\u00b1 0.1% 23.7%\u00b10.2% \n15.2%\u00b1 0.1% 24.6% \u00b1 0.3% \n18.8%\u00b1 0.2% 23.4% \u00b1 0.3% \n6-6-1 net, w/val. \n6-6-1 net, w /hint 18.7%\u00b10.1% 21.8% \u00b1 0.2% \n\ntest error \n\nhint test error \n\n-\n\n.005115 \n\n-\n\n.000020 \n\nTable 1: Performance of methods on credit problem \n\nThe performance of each method is shown in tables 1 and 2. Without early stopping, \nthe two-layer network overfits and performs worse than a linear model. Even with \nearly stopping, the performance of the linear model and the two-layer network are \nalmost the same; the difference is not statistically significant. This similarity in per(cid:173)\nformance is consistent with the thesis of a monotonic target function. A monotonic \nclassifier may be thought of as a mildly nonlinear generalization of a linear classifier. \nThe two-layer network does have the advantage of being able to implement some \nof this nonlinearity. However, this advantage is cancelled out (and in other cases \ncould be outweighed) by the overfitting resulting from excessive and unnecessary \ndegrees of freedom. When monotonicity hints are introduced, much of this unnec(cid:173)\nessary freedom is eliminated, although the network is still allowed to implement \nmonotonic nonlinearities. Accordingly, a modest but clearly statistically significant \nimprovement on the credit problem (nearly 2%) results from the introduction of \n\n\f640 \n\nJ. Sill and Y. S. Abu-Mosta/a \n\nMethod \nLinear \n5-3-1 net \n\n5-3-1 net, w/val. \n5-3-1 net, w/hint \n\ntraining error \n.802 \u00b1 .005 \n.640 \u00b1 .003 \n.758 \u00b1 .008 \n.758\u00b1 .003 \n\ntest error \n.873 \u00b1 .013 \n.920 \u00b1 .014 \n.871 \u00b1 .013 \n.830 \u00b1 .013 \n\nhint test error \n\n-\n\n.004967 \n\n-\n\n.000002 \n\nTable 2: Performance of methods on liver problem \n\nmonotonicity hints. Such an improvement could translate into a substantial in(cid:173)\ncrease in profit for a bank. Monotonicity hints also significantly improve test error \non the liver problem; 4% more of the target variance is explained. \n\n4 Conclusion \n\nThis paper has shown that monotonicity hints can significantly improve the \nperformance of a neural network on two noisy real-world tasks. It is worthwhile \nto note that the beneficial effect of imposing monotonicity does not necessarily \nimply that the target function is entirely monotonic. If there exist some non(cid:173)\nmonotonicities in the target function, then monotonicity hints may result in some \ndecrease in the model's ability to implement this function. It may be, though, that \nthis penalty is outweighed by the improved estimation of model parameters due to \nthe decrease in model complexity. Therefore, the use of monotonicity hints probably \nshould be considered in cases where the target function is thought to be at least \nroughly monotonic and the training examples are limited in number and noisy. \n\nFuture work may include the application of monotonicity hints to other real world \nproblems and further investigations into techniques for enforcing the hints. \n\nAclmowledgements \n\nThe authors thank Eric Bax, Zehra Cataltepe, Malik Magdon-Ismail, and Xubo \nSong for many useful discussions. \n\nReferences \n\n[1] Y. Abu-Mostafa (1990). Learning from Hints in Neural Networks Journal of \nComplexity 6, 192-198. \n\n[2] Y. Abu-Mostafa (1993) Hints and the VC Dimension Neural Computation 4, \n278-288 \n\n[3] P. Simard, Y. LeCun & J Denker (1993) Efficient Pattern Recognition Using a \nNew Transformation Distance NIPS5, 50-58 . \n\n[4] Y. Abu-Mostafa (1995) Financial Market Applications of Learning from Hints \nNeural Networks in the Capital Markets, A. Refenes, ed., 221-232. Wiley, London, \nUK. \n\n\f", "award": [], "sourceid": 1270, "authors": [{"given_name": "Joseph", "family_name": "Sill", "institution": null}, {"given_name": "Yaser", "family_name": "Abu-Mostafa", "institution": null}]}