{"title": "Monotonic Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 661, "page_last": 667, "abstract": null, "full_text": "Monotonic Networks \n\nComputation and Neural Systems program \n\nJoseph Sill \n\nCalifornia Institute of Technology \nMC 136-93, Pasadena, CA 91125 \n\nemail: joe@cs.caltech.edu \n\nAbstract \n\nMonotonicity is a constraint which arises in many application do(cid:173)\nmains. We present a machine learning model, the monotonic net(cid:173)\nwork, for which monotonicity can be enforced exactly, i.e., by virtue \noffunctional form . A straightforward method for implementing and \ntraining a monotonic network is described. Monotonic networks \nare proven to be universal approximators of continuous, differen(cid:173)\ntiable monotonic functions. We apply monotonic networks to a \nreal-world task in corporate bond rating prediction and compare \nthem to other approaches. \n\n1 \n\nIntroduction \n\nSeveral recent papers in machine learning have emphasized the importance of pri(cid:173)\nors and domain-specific knowledge. In their well-known presentation of the bias(cid:173)\nvariance tradeoff (Geman and Bienenstock, 1992)' Geman and Bienenstock conclude \nby arguing that the crucial issue in learning is the determination of the \"right bi(cid:173)\nases\" which constrain the model in the appropriate way given the task at hand . \nThe No-Free-Lunch theorem of Wolpert (Wolpert, 1996) shows, under the 0-1 error \nmeasure, that if all target functions are equally likely a priori, then all possible \nlearning methods do equally well in terms of average performance over all targets. \nOne is led to the conclusion that consistently good performance is possible only \nwith some agreement between the modeler's biases and the true (non-flat) prior. \nFinally, the work of Abu-Mostafa on learning from hints (Abu-Mostafa, 1990) has \nshown both theoretically (Abu-Mostafa, 1993) and experimentally (Abu-Mostafa, \n1995) that the use of prior knowledge can be highly beneficial to learning systems. \n\nOne piece of prior information that arises in many applications is the monotonicity \nconstraint, which asserts that an increase in a particular input cannot result in a \ndecrease in the output. A method was presented in (Sill and Abu-Mostafa, 1996) \nwhich enforces monotonicity approximately by adding a second term measuring \n\n\f662 \n\nJ.Sill \n\n\"monotonicity error\" to the usual error measure. This technique was shown to \nyield improved error rates on real-world applications. Unfortunately, the method \ncan be quite expensive computationally. It would be useful to have a model which \nobeys monotonicity exactly, i.e., by virtue of functional form . \n\nWe present here such a model, which we will refer to as a monotonic network. \nA monotonic network implements a piecewise-linear surface by taking maximum \nand minimum operations on groups of hyperplanes. Monotonicity constraint'> are \nenforced by constraining the signs of the hyperplane weights. Monotonic networks \ncan be trained using the usual gradient-based optimization methods typically used \nwith other models such as feedforward neural networks. Armstrong (Armstrong et. \nal. 1996) has developed a model called the adaptive logic network which is capable \nof enforcing monotonicity and appears to have some similarities to the approach \npresented here. The adaptive logic network, however, is available only through a \ncommercial software package. The training algorithms are proprietary and have \nnot been fully disclosed in academic journals. The monotonic network therefore \nrepresents (to the best of our knowledge) the first model to be presented in an \nacademic setting which has the ability to enforce monotonicity. \nSection II describes the architecture and training procedure for monotonic networks. \nSection III presents a proof that monotonic networks can uniformly approximate \nany continuous monotonic function with bounded partial derivatives to an arbitrary \nlevel of accuracy. Monotonic networks are applied to a real-world problem in bond \nrating prediction in Section IV. In Section V, we discuss the results and consider \nfuture directions. \n\n2 Architecture and Training Procedure \n\nA monotonic network has a feedforward, three-layer (two hidden-layer) architecture \n(Fig. 1). The first layer of units compute different linear combinations of the input \nvector. If increasing monotonicity is desired for a particular input, then all the \nweights connected to that input are constrained to be positive. Similarly, weights \nconnected to an input where decreasing monotonicity is required are constrained to \nbe negative. The first layer units are partitioned into several groups (the number \nof units in each group is not necessarily the same). Corresponding to each group is \na second layer unit, which computes the maximum over all first-layer units within \nthe group. The final output unit computes the minimum over all groups. \n\nMore formally, if we have f{ groups with outputs 91,92, ... 9K, and if group k \nconsists of hk hyperplanes w(k, 1) , w(k,2), ... w(k,hk ), then \n\n9k(X) = m~xw(kJ) . x - t(k,i), 1::; j ::; hk \n\n3 \n\nLet y be the final output of the network. Then \n\nor, for classification problems, \n\nwhere u(u) = e.g. l+!-u. \n\n\fMonotonic Networks \n\n663 \n\npositive \n\nInput Vector \n\nFigure 1: This monotonic network obeys increasing monotonicity in all 3 inputs \nbecause all weights in the first layer are constrained to be positive. \n\nIn the discussions which follow, it will be useful to define the term active. We will \ncall a group 1 active at x if \n\ng/(x) = mingk(x) \n\nk \n\n, i.e., if the group determines the output of the network at that point. Similarly, we \nwill say that a hyperplane is active at x if its group is active at x and the hyperplane \nis the maximum over all hyperplanes in the group. \n\nAs will be shown in the following section, the three-layer architecture allows a mono(cid:173)\ntonic network to approximate any continuous, differentiable monotonic function \narbitrarily well, given sufficiently many groups and sufficiently many hyperplanes \nwithin each group. The maximum operation within each group allows the network \nto approximate convex (positive second derivative) surfaces, while the minimum op(cid:173)\neration over groups enables the network to implement the concave (negative second \nderivative) areas of the target function (Figure 2). \n\nnetwork \noutput \n\n...... \n.. ' \n......... \n\n, , , , \n\n, \n\ninput \n\ngroup I \nactive \n\ngroup 2 \nactive \n\ngroup 3 \nactive \n\nFigure 2: This surface is implemented by a monotonic network consisting of three \ngroups. The first and third groups consist of three hyperplanes, while the second \ngroup has only two. \n\nMonotonic networks can be trained using many of the standard gradient-based \noptimization techniques commonly used in machine learning. The gradient for \n\n\f664 \n\n1. Sill \n\neach hyperplane is found by computing the error over all examples for which the \nhyperplane is active. After the parameter update is made according to the rule of \nthe optimization technique, each training example is reassigned to the hyperplane \nthat is now active at that point. The set of examples for which a hyperplane is \nactive can therefore change during the course of training. \n\nThe constraints on the signs of the weights are enforced using an exponential \nIf increasing monotonicity is desired in input variable i, then \ntransformation. \nV j, k the weights corresponding to the input are represented as Wi (j ,k) ::::: eZ, (i ,k) . \nThe optimization algorithm can modify zlj,k) freely during training while main(cid:173)\ntaining the constraint. If decreasing monotonicity is required, then Vj, k we take \nWi)' = _e z , \n\n( . k) \n\n(i,k) \n\n. \n\n3 Universal Approximation Capability \n\nIn this section, we demonstrate that monotonic networks have the capacity to ap(cid:173)\nproximate uniformly to an arbitrary degree of accuracy any continuous, bounded, \ndifferentiable function on the unit hypercube [0, I]D which is monotonic in all vari(cid:173)\nables and has bounded partial derivatives. We will say that x' dominates x if \nVI :S d:S D, x~ ~ Xd. A function m is monotonic in all variables if it satisfies the \nconstraint that Vx,x', if x' dominates x then m(x') ~ m(x). \n\nTheorem 3.1 Let m(x) be any continuous, bounded monotonic function with \nbounded partial derivatives, mapping [0, I]D to R. Then there exists a function \nmnet(x) which can be implemented by a monotonic network and is such that, for \nany f and any x E [0, I]D ,Im(x) - mnet(x)1 < f. \nProof: \nLet b be the maximum value and a be the minimum value which m takes on [0, I]D. \nan equispaced grid of points on [0, 1]D, where \u00b0 = ~ is the spacing between grid \nLet a bound the magnitude of all partial first derivatives of m on [0, I]D. Define \npoints along each dimension. I.e., the grid is the set S of points (ilO, i 2o, .. . iDOl \nwhere 1 :S i1 :S n,1 :S i2 :S n, ... 1 :S iD :S n. Corresponding to each grid point \nx' = (x~, x~, ... xv), assign a group consisting of D+ 1 hyperplanes. One hyperplane \nin the group is the constant output plane y = m(x'). In addition, for each dimension \nd, place a hyperplane y = ,(Xd - x~) + m(x') , where, > b'6 a . This construction \nensures that the group associated with x' cannot be active at any point x* where \nthere exists a d such that xd - x~ > 0, since the group's output at such a point \nmust be greater than b and hence greater than the output of a group associated \nwith another grid point. \n\nNow consider any point x E [0, I]D. Let S(l) be the unique grid point in S such that \n\nVd, \u00b0 :S Xd - si 1) < 0, i.e., S(l) is the closest grid point to x which x dominates. \n\nThen we can show that mnet(x) ~ m(s(l\u00bb). Consider an arbitrary grid point s' =f. \ns(l). By the monotonicity of m, if s' dominates S(l), then m(s') ~ m(s(l\u00bb), and \nhence, the group associated with s' has a constant output hyperplane y = m(s') ~ \nm(s(l\u00bb) and therefore outputs a value ~ m(s(l\u00bb) at x. If 8' does not dominate S(l), \nthen there exists a d such that Sd(l) > s~. Therefore, Xd - s~ ~ 0, meaning that \nthe output of the group associated with s' is at least b ~ m(s(l\u00bb). All groups have \noutput at least as large as m(s(l\u00bb), so we have indeed shown that mnet(X) ~ m(s(l\u00bb). \nNow consider the grid point S(2) that is obtained by adding 0 to each coordinate of \ns(l). The group associated with s(2) outputs m(s(2\u00bb) at x, so mnet(x) :S m(s(2\u00bb). \nTherefore, we have m(s(l\u00bb) :S mnet(x) :S m(s(2\u00bb). Since x dominates s(l) and \n\n\fMonotonic Networks \n\n665 \n\nis dominated by S(2), by mono tonicity we also have m(s(l)) :S m(x) :S m(s(2)). \nIm(x) - mnet(x)1 is therefore bounded by Im(s{2)) - m(s(l))I. By Taylor's theorem \nfor multivariate functions, we know that \n\nfor some point c on the line segment between S(I) and s(2). Given the assumptions \nmade at the outset, Im(s(2))-m(s(1))j, and hence, \\m(x)-mnedx)1 can be bounded \nby d.5Ct. We take .5 < d~ to complete the proof \u2022. \n\n4 Experimental Results \n\nWe tested monotonic networks on a real-world problem concerning the prediction \nof corporate bond ratings. Rating agencies such as Standard & Poors (S & P) issue \nbond ratings intended to assess the level of risk of default associated with the bond. \nS & P ratings can range from AAA down to B- or lower. \nA model which accurately predicts the S & P rating of a bond given publicly avail(cid:173)\nable financial information about the issuer has considerable value. Rating agencies \ndo not rate all bonds, so an investor could use the model to assess the risk associated \nwith a bond which S & P has not rated. The model can also be used to anticipate \nrating changes before they are announced by the agency. \n\nThe dataset, which was donated by a Wall Street firm, is made up of 196 examples. \nEach training example consists of 10 financial ratios reflecting the fundamental \ncharacteristics of the issuing firm, along with an associated rating. The meaning of \nthe financial ratios was not disclosed by the firm for proprietary reasons. The rating \nlabels were converted into integers ranging from 1 to 16. The task was treated as a \nsingle-output regression problem rather than a 16-class classification problem. \n\nMonotonicity constraints suggest themselves naturally in this context. Although \nthe meanings of the features are not revealed, it is reasonable to assume that they \nconsist of quantities such as profitability, debt, etc. It seems intuitive that, for \ninstance, the higher the profitability of the firm is , the stronger the firm is, and \nhence, the higher the bond rating should be. Monotonicity was therefore enforced \nin all input variables. \n\nThree different types of models (all trained on squared error) were compared: a \nlinear model, standard two-layer feedforward sigmoidal neural networks, and mono(cid:173)\ntonic networks. The 196 examples were split into 150 training examples and 46 \ntest examples. In order to get a statistically significant evaluation of performance, \na leave-k-out procedure was implemented in which the 196 examples were split 200 \ndifferent ways and each model was trained on the training set and tested on the \ntest set for each split. The results shown are averages over the 200 splits. \n\nTwo different approaches were used with the standard neural networks. In both \ncases, the networks were trained for 2000 batch-mode iterations of gradient descent \nwith momentum and an adaptive learning rate, which sufficed to allow the networks \nto approach minima of the training error. The first method used all 150 examples \nfor direct training and minimized the training error as much as possible. The \nsecond technique split the 150 examples into 110 for direct training and 40 used for \nvalidation, i.e., to determine when to stop training. Specifically, the mean-squared(cid:173)\nerror on the 40 examples was monitored over the course of the 2000 iterations, \n\n\f666 \n\n1. Sill \n\nand the state of the network at the iteration where lowest validation error was \nobtained was taken as the final network to be tested on the test set. \nIn both \ncases, the networks were initialized with small random weights. The networks had \ndirect input-output connections in addition to hidden units in order to facilitate the \nimplementation of the linear aspects of the target function. \n\nThe monotonic networks were trained for 1000 batch-mode iterations of gradient \ndescent with momentum and an adaptive learning rate. The parameters of each \nhyperplane in the network were initialized to be the parameters of the linear model \nobtained from the training set, plus a small random perturbation. This procedure \nensured that the network was able to find a reasonably good fit to the data. Since \nthe meanings of the features were not known, it was not known a priori whether \nincreasing or decreasing mono tonicity should hold for each feature. The directions \nof monotonicity were determined by observing the signs of the weights of the linear \nmodel obtained from the training data. \n\nModel \nLinear \n\n10-2-1 net \n10-4-1 net \n10-6-1 net \n10-8-1 net \n\ntraining error \n\n3.45 \u00b1 .02 \n1.83 \u00b1 .01 \n1.22 \u00b1 .01 \n0.87 \u00b1 .01 \n0.65 \u00b1 .01 \n\ntest error \n4.09 \u00b1 .06 \n4.22 \u00b1 .14 \n4.86 \u00b1 .16 \n5.57 \u00b1 .20 \n5.56 \u00b1 .16 \n\nTable 1: Performance of linear model and standard networks on bond rating problem \n\nThe results support the hypothesis of a monotonic (or at least roughly monotonic) \ntarget function. As Table 1 shows, standard neural networks have sufficient flex(cid:173)\nibility to fit the training data quite accurately (n-k-l network means a 2-layer \nnetwork with n inputs, k hidden units, and 1 output). However, their excessive, \nnon-monotonic degrees of freedom lead to overfitting, and their out-of-sample per(cid:173)\nformance is even worse than that of a linear model. The use of early stopping \nalleviates the overfitting and enables the networks to outperform the linear model. \nWithout the monotonicity constraint, however, standard neural networks still do \nnot perform as well as the monotonic networks. The results seem to be quite robust \nwith respect to the choice of number of hidden units for the standard networks and \nnumber and size of groups for the monotonic networks. \n\nModel \n\n10-2-1 net \n10-4-1 net \n10-6-1 net \n10-8-1 net \n\ntraining error \n\n2.46 \u00b1 .04 \n2.19 \u00b1 .05 \n2.14 \u00b1 .05 \n2.13 \u00b1 .06 \n\ntest error \n3.83 \u00b1 .09 \n3.82\u00b1 .08 \n3.77 \u00b1 .07 \n3.86 \u00b1 .09 \n\nTable 2: Performance of standard networks using early stopping on bond rating \nproblem \n\n5 Conclusion \n\nWe presented a model, the monotonic network, in which monotonicity constraints \ncan be enforced exactly, without adding a second term to the usual objective func(cid:173)\ntion. A straightforward method for implementing and training such models was \n\n\fMonotonic Networks \n\n667 \n\nModel \n\n2 groups, 2 planes per group \n3 groups, 3 planes per group \n4 groups, 4 planes per group \n5 groups, 5 planes per group \n\ntraining error \n\n2.78 \u00b1 .05 \n2.64 \u00b1 .04 \n2.50 \u00b1 .04 \n2.44 \u00b1 .03 \n\ntest error \n3.71 \u00b1 .07 \n3.56 \u00b1 .06 \n3.48 \u00b1 .06 \n3.43 \u00b1 .06 \n\nTable 3: Performance of monotonic networks on bond rating problem \n\ndemonstrated, and the method was shown to outperform other methods on a real(cid:173)\nworld problem. \n\nSeveral areas of research regarding monotonic networks need to be addressed in \nthe future. One issue concerns the choice of the number of groups and number of \nplanes in each group. In general, the usual bias-variance tradeoff that holds for \nother models will apply here, and the optimal number of groups and planes will be \nquite difficult to determine a priori. There may be instances where additional prior \ninformation regarding the convexity or concavity of the target function can guide \nthe decision, however. Another interesting observation is that a monotonic network \ncould also be implemented by reversing the maximum and minimum operations, \ni.e., by taking the maximum over groups where each group outputs the minimum \nover all of its hyperplanes. It will be worthwhile to try to understand when one \napproach or the other is most appropriate. \n\nAcknowledgments \n\nThe author is very grateful to Yaser Abu-Mostafa for considerable guidance. I also \nthank John Moody for supplying the data. Amir Atiya, Eric Bax, Zehra Cataltepe, \nMalik Magdon-Ismail, Alexander Nicholson, and Xubo Song supplied many useful \ncomments. \n\nReferences \n\n[1] S. Geman and E. Bienenstock (1992). Neural Networks and the Bias-Variance \nDilemma. Neural Computation 4, pp 1-58. \n[2] D. Wolpert (1996). The Lack of A Priori Distinctions Between Learning Algo(cid:173)\nrithms. Neural Computation 8, pp 1341-1390. \n\n[3] Y. Abu-Mostafa (1990). Learning from Hints in Neural Networks Journal of \nComplexity 6, 192-198. \n\n[4] Y. Abu-Mostafa (1993) Hints and the VC Dimension Neural Computation 4, \n278-288 \n[5] Y. Abu-Mostafa (1995) Financial Market Applications of Learning from Hints \nNeural Networks in the Capital Markets, A. Refenes, ed., 221-232. Wiley, London, \nUK. \n\n[6] J. Sill and Y. Abu-Mostafa (1996) Monotonicity Hints. To appear in it Advances \nin Neural Information Processing Systems 9. \n\n[7] W.W. Armstrong, C. Chu, M. M. Thomas (1996) Feasibility of using Adaptive \nLogic Networks to Predict Compressor Unit Failure Applications of Neural Networks \nin Environment, Energy, and Health, Chapter 12. P. Keller, S. Hashem, L. Kangas, \nR. Kouzes, eds, World Scientific Publishing Company, Ltd., London. \n\n\f", "award": [], "sourceid": 1358, "authors": [{"given_name": "Joseph", "family_name": "Sill", "institution": null}]}