{"title": "On Stochastic Complexity and Admissible Models for Neural Network Classifiers", "book": "Advances in Neural Information Processing Systems", "page_first": 818, "page_last": 824, "abstract": null, "full_text": "On Stochastic Complexity and Admissible \n\nModels for Neural Network Classifiers \n\nPadhraic Smyth \n\nCommunications Systems Research \n\nJet Propulsion Laboratory \n\nCalifornia Institute of Technology \n\nPasadena, CA 91109 \n\nAbstract \n\nGiven some training data how should we choose a particular network clas(cid:173)\nsifier from a family of networks of different complexities? In this paper \nwe discuss how the application of stochastic complexity theory to classifier \ndesign problems can provide some insights into this problem. In particular \nwe introduce the notion of admissible models whereby the complexity of \nmodels under consideration is affected by (among other factors) the class \nentropy, the amount of training data, and our prior belief. In particular \nwe discuss the implications of these results with respect to neural architec(cid:173)\ntures and demonstrate the approach on real data from a medical diagnosis \ntask. \n\n1 \n\nIntroduction and Motivation \n\nIn this paper we examine in a general sense the application of Minimum Description \nLength (MDL) techniques to the problem of selecting a good classifier from a large \nset of candidate models or hypotheses. Pattern recognition algorithms differ from \nmore conventional statistical modeling techniques in the sense that they typically \nchoose from a very large number of candidate models to describe the available data. \nHence, the problem of searching through this set of candidate models is frequently \na formidable one, often approached in practice by the use of greedy algorithms. In \nthis context, techniques which allow us to eliminate portions of the hypothesis space \nare of considerable interest. We will show in this paper that it is possible to use the \nintrinsic structure of the MDL formalism to eliminate large numbers of candidate \nmodels given only minimal information about the data. Our results depend on the \n\n818 \n\n\fOn Stochastic Complexity \n\n819 \n\nvery simple notion that models which are obviously too complex for the problem \n(e.g., models whose complexity exceeds that of the data itself) can be discarded \nfrom further consideration in the search for the most parsimonious model. \n\n2 Background on Stochastic Complexity Theory \n\n2.1 General Principles \n\nStochastic complexity prescribes a general theory of inductive inference from data, \nwhich, unlike more traditional inference techniques, takes into account the com(cid:173)\nplexity of the proposed model in addition to the standard goodness-of-fit of the \nmodel to the data. For a detailed rationale the reader is referred to the work of \nRissanen (1984) or Wallace and Freeman (1987) and the references therein. Note \nthat the Minimum Description Length (MDL) technique (as Rissanen's approach \nhas become known) is implicitly related to Maximum A Posteriori (MAP) Bayesian \nestimation techniques if cast in the appropriate framework. \n\n2.2 Minimum Description Length and Stochastic Complexity \n\nFollowing the notation of Barron and Cover (1991), we have N data-points, de(cid:173)\nscribed as a sequence of tuples of observations {xI, ... , xf , Yi}, 1 ::; i ::; N, to be \nreferred to as {xi,yd for short. The xf correspond to values taken on by the f{ \nrandom variables X k (which may be continuous or discrete), while, for the purposes \nof this paper, the Yi are elements of the finite alphabet of the discrete m-ary class \nvariable Y. Let rN = {M l , ... , MlrNI} be the family of candidate models under \nconsideration. Note that by defining r N as a function of N, the number of data \npoints, we allow the possibility of considering more complicated models as more \ndata arrives. For each Mj ErN let C( Mj) be non-negative numbers such that \n\nL 2-C(Mj) ::; l. \n\nj \n\nThe C(Mj) can be interpreted as the cost in bits of specifying model Mj -\nin turn, \n2-C(Mj) is the prior probability assigned to model M j (suitably normalized). Let \nus use C = {C(Mt}, ... , C(M1rNln to refer to a particular coding scheme for rN . \nHence the total description length of the data plus a model Mj is defined as \n\nL(Mj, {Xi , yd) = C(Mj) + log (p( {Ydl~j( {~lJ))) \n\ni.e., we first describe the model and then the class data relative to the given model \n(as a function of {xd, the feature data). The stochastic complexity of the dat.a \n{Xi, Yi} relative to Cand r N is the minimum description length \n\nI( {Xi, yd) = min {L(M}\", {Xi, yd n\u00b7 \n\n-\n\nMjErN \n\n-\n\nThe problem of finding the model of shortest description length is intractable in \nthe general case -\nnonetheless the idea of finding the best model we can is well \nmotivated, works well in practice and is far preferable to the alternative approach \nof ignoring the complexity issue entirely. \n\n\f820 \n\nSmyth \n\n3 Admissible Stochastic Complexity Models \n\n3.1 Definition of Admissibility \n\nWe will find it useful to define the notion of an admissible model for the classification \nproblem: the set of admissible models ON (~ r N ) is defined as all models whose \ncomplexity is such that there exists no other model whose description length is \nknown to be smaller. In other words we are saying that inadmissible models are \nthose which have complex~ty in bits greater than any known description length -\nclearly they cannot be better than the best known model in terms of description \nlength and can be eliminated from consideration. Hence, ON is defined dynamically \nand is a function of how many description lengths we have already calculated in our \nsearch. Typically r N may be pre-defined, such as the class of all 3-layer feed-forward \nneural networks with particular activation functions . We would like to restrict our \nsearch for a good model to the set ON ~ rN as far as possible (since non-admissible \nmodels are of no practical use). In practice it may be difficult to determine the \nexact boundaries of ON, particularly when Ir N I is large (with decision trees or \nneural networks for example). Note that the notion of admissibility described here \nis particularly useful when we seek a minimal description length, or equivalently a \nmodel of maximal a posteriori probability -\nin situations where one's goal is to \naverage over a number of possible models (in a Bayesian manner) a modification of \nthe admissibility criterion would be necessary. \n\n3.2 Results for Admissible Models \n\nSimple techniques for eliminating obvious non-admissible models are of interest : for \nthe classification problem a necessary condition that a model M j be admissible is \nthat \n\nC(Mj) ~ N\u00b7 H(X) ~ Nlog(m) \n\nwhere H(X) is the entropy ofthe m-ary class variable X. The obvious interpretation \nin words is that any admissible model must have complexity less than that of the \ndata itself. It is easy to show in addition that the complexity of any admissible \nmodel is upper bounded by the parameters of the classification problem: \n\nHence, the size of the space of admissible models can also be bounded: \n\nOur approach suggests that for classification at least, once we know N and the \nnumber of classes m, there are strict limitations on how many admissible models we \ncan consider. Of course the theory does not state that considering a larger subset \nwill necessarily result in a less optimal model being found, however, it is difficult to \nargue the case for including large numbers of models which are clearly too complex \nfor the problem. At best, such an approach will lead to an inefficient search, whereas \nat worst a very poor model will be chosen perhaps as a result of the use of a poor \ncoding scheme for the unnecessarily large hypothesis space. \n\n\fOn Stochastic Complexity \n\n821 \n\n3.3 Admissible Models and Bayes Risk \n\nThe notion of minimal compression (the minimum achievable goodness-of-fit) is \nintimately related in the classification problem to the minimal Bayes risk for the \nproblem (Kovalevsky, 1980). Let MB be any model (not necessarily unique) which \nachieves the optimal Bayes risk (i.e., minimizes the classifier error) for the classi(cid:173)\nfication problem. In particular, C( {xdIMB( {yd)) is not necessarily zero, indeed \nin most practical problems of interest it is non-zero, due to the ambiguity in the \nmapping from the feature space to the class variable. In addition, MB may not be \ndefined in the set r N, and hence, MB need not even be admissible. If, in the limit \nas N -+ 00, MB rt. roo then there is a fundamental approximation error in the rep(cid:173)\nresentation being used, i.e., the family of models under consideration is not flexible \nenough to optimally represent the mapping from {xd to {yd. Smyth (1991) has \nshown how information about the Bayes error rateror the problem (if available) \ncan be used to further tighten the bounds on admissibility. \n\n4 Applying Mininlum Description Length Principles to \n\nNeural Network Design \n\nIn principle the admissibility results can be applied to a variety of classifier design \nproblems -\napplications to Markov model selection and decision tree design are \ndescribed elsewhere (Smyth, 1991). In this paper we limit our attention to the \nproblem of automatically selecting a feedforward multi-layer network architecture. \n\n4.1 Calculation of the Goodness-of-Fit \n\nAs is clear from the preceding discussion, application of the MDL principle to clas(cid:173)\nsifier selection requires that the classifier produce a posterior probability estimate of \nthe class labels. In the context of a network model this is not a problem provided the \nnetwork is trained to provide such estimates. This requires a simple modification \nof the objective function to a log-likelihood function - L~llog(p(ydxd), where Yi \nis the class label of the ith training datum and pO is the network's estimate of pO. \nThis function has been proposed in the literature in the past under the guise of a \ncross-entropy measure (for the special case of binary classes) and more recently it \nhas been derived from the more basic arguments of Minimum Mutual Information \n(MMI) (Bridle, 1990) and Maximum Likelihood (ML) Estimation (Gish, 1990). The \ncross-entropy function for network training is nothing more that the goodness-of-fit \ncomponent of the description length criterion. Hence, both MMI and ML (since \nthey are equivalent in this case) are special cases of the MDL procedure wherein \nthe complexity term is a constant and is left out of the optimization (all models are \nassumed to be equally likely and likelihood alone is used as the decision criterion). \n\n4.2 Complexity Penalization for Multi-layer Perceptron Models \n\nIt has been proposed in the past (Barron, 1989) to use a penalty term of (k/2) log N, \nwhere k is the number of parameters (weights and biases) in the network. The ori(cid:173)\ngins of this complexity measure lie in general arguments originally proposed by \nRissanen (1984). However this penalty term is too large. Cybenko (1990) has \n\n\f822 \n\nSmyth \n\npointed out that existing successful applications of networks have far more param(cid:173)\neters than could possibly be justified by a statistical analysis, given the amount of \ntraining data used to construct the network. The critical factor lies in the precision \nto which these parameters are stated in the final model. In essence the principle \nof MDL (and Bayesian techniques) dictates that the data only justifies the stating \nof any parameter in the model to some finite precision, inversely proportional to \nthe inherent variance of the estimate. Approximate techniques for the calculation \nof the complexity terms in this manner have been proposed (Weigend, Huberman \nand Rumelhart, this volume) but a complete description length analysis has not yet \nappeared in the literature. \n\n4.3 Complexity Penalization for a Discrete Network Model \n\nIt turns out that there are alternatives to multi-layer perceptrons whose complexity \nis much easier to calculate. We will look in particular at the rule-based network \nof Goodman et al. (1990). In this model the hidden units correspond to Boolean \ncombinations of discrete input variables. The link weights from hidden to output \n(class) nodes are proportional to log conditional probabilities of the class given the \nactivation of a hidden node. The output nodes form estimates of the posterior class \nprobabilities by a simple summation followed by a normalization. The implicit \nassumption of conditional independence is ameliorated in practice by the fact that \nthe hidden units are chosen in a manner to ensure that the assumption is violated \nas little as possible. \n\nThe complexity penalty for the network is calculated as being (1/2) log N per link \nfrom the hidden to output layers, plus an appropriate coding term for the specifica(cid:173)\ntion of the hidden units. Hence, the description length of a network with k hidden \nunits would be \n\n\u2022 \n\nN \n\nL = - L log(p(Yd x;)) + k /2 log N - L log 11\"( od \n\nk \n\ni=1 \n\ni=1 \n\nwhere 0i is the order of the ith hidden node and 11\"( OJ) is a prior probability on the \norders. Using this definition of description length we get from our earlier results \non admissible models that the number of hidden units in the architecture is upper \nbounded by \n\nk < \n\nNH(C) \n\n- 0.51ogN + logJ< + 1 \n\nwhere J< is the number of binary input attributes. \n\n4.4 Application to a Medical Diagnosis Problem \n\nWe consider the application of our techniques to the discovery of a parsimonious \nnetwork for breast cancer diagnosis, using the discrete network model. A common \ntechnique in breast cancer diagnosis is to obtain a fine needle aspirate (FNA) from \nthe patient. The FN A sample is then evaluated under a microscope by a physician \nwho makes a diagnosis. Ground truth in the form of binary class labels (\"benign\" \nor \"malignant\") is obtained by re-examination or biopsy at a later stage. Wolberg \nand Mangasarian (1991) described the collection of a database of such information. \n\n\fOn Stochastic Complexity \n\n823 \n\nThe feature information consisted of subjective evaluations of nine FNA sample \ncharacteristics such as uniformity of cell size, marginal adhesion and mitoses. The \ntraining data consists of 439 such FNA samples obtained from real patients which \nwere later assigned class labels. Given that the prior class entropy is almost 1 bit, \none can immediately state from our bounds that networks with more than 51 hidden \nunits are inadmissible. Furthermore, as we evaluate different models we can narrow \nthe region of admissibility using the results stated earlier. Figure 1 gives a graphical \ninterpretation of this procedure. \n\n40 \n\n. \n\n. \n\n. \n\n. \n\n/I) \n\n35 \n!:: 30 \nc \n:l \nc 25 \nII \n:g 20 \nL \n'0 \n15 \n... \nII 10 \n.c \n\u00a7 5 \nz \no \n\n100 \n\nj~~~,;;;ibl.~,~~~:::,+<: \n::::::::::::\u00a5-::::::~I-=~IerOf:H'ddIunit5 \n\n. \n\n, - -.. -- Upper bound on admissible complexity \n\n. ... _.- ... .... .... ..... .... . _\" --- - - . _. \n\n, \n\n, \n\n-_ .. -. \n\n150 \n\n200 \n\n250 \n\n300 \n\n350 \n\nDescription Length (In bits) \n\nFigure 1. Inadmissible region as a function of description length \n\nThe algorithm effectively moves up the left-hand axis, adding hidden units in a \ngreedy manner. Initially the description length (the lower curve) decreases rapidly \nas we capture the gross structure in the data. For each model that we calculate a \ndescription length, we can in turn calculate an upper bound on admissibility (the \nupper curve) -\nthis bound is linear in description length. Hence , for example by \nthe time we have 5 hidden units we know that any models with more than 21 hidden \nunits are inadmissible. Finally a local minimum of the description length function \nis reached at 12 units, at which point we know that the optimal solution can have at \nmost 16 hidden units . As matter of interest, the resulting network with 12 hidden \nunits correctly classified 94 of 96 independent test cases. \n\n5 Conclusion \n\nThere are a variety of related issues which arise in this context which we can only \nbriefly mention due to space constraints. For example, how does the prior \"model \nentropy\", H(ON) = - Li p(l\\1i) log(p(l\\1d) , affect the complexity of the search \nproblem? Questions also naturally arise as to how ON should grow as a function of \nN in an incrementa/learning scenario. \n\nIn conclusion, it should not be construed from this paper that consideration of \nadmissible models is the major factor in inductive inference -\ncertainly the choice \nof description lengths for the various models and the use of efficient optimization \n\n\f824 \n\nSmyth \n\ntechniques for seeking the parameters of each model remain the cornerstones of \nsuccess. Nonetheless, our results provide useful theoretical insight and are practical \nto the extent that they provide a \"sanity check\" for model selection in MDL. \n\nAcknowledgments \n\nThe research described in this paper was performed at the Jet Propulsion Labora(cid:173)\ntories, California Institute of Technology, under a contract with the National Aero(cid:173)\nnautics and Space Administration. In addition this work was supported in part by \nthe Air Force Office of Scientific Research under grant number AFOSR-90-0199. \n\nReferences \n\nA. R. Barron (1989), 'Statistical properties of artificial neural networks,' in Pro(cid:173)\nceedings of 1989 IEEE Conference on Decision and Control. \n\nA. R. Barron and T. M. Cover (1991), 'Minimum complexity density estimation,' \nto appear in IEEE Trans. Inform. Theory. \n\nJ. Bridle (1990), 'Training stochastic model recognition algorithms as networks can \nlead to maximum mutual information estimation of parameters,' in D. S. Touret(cid:173)\nzky (ed.), Advances in Neural Information Processing Systems 1, pp.211-217, San \nMateo, CA: Morgan Kaufmann. \n\nG. Cybenko (1990), 'Complexity theory of neural networks and classification prob(cid:173)\nlems,' preprint. \n\nH. Gish (1991), 'Maximum likelihood training of neural networks,' to appear in \nProceedings of the Third International Workshop on AI and Statistics, (D. Hand, \ned.), Chapman and Hall: London. \n\nR. M. Goodman, C. Higgins, J. W. Miller, and P. Smyth (1990), 'A rule-based \napproach to neural network classifiers,' in Proceedings of the 1990 International \nNeural Network Conference, Paris, France. \n\nV. A. Kovalevsky (1980), Image Pattern Recognition, translated from Russian by \nA. Brown, New York: Springer Verlag, p.79. \n\nJ. Rissanen (1984), 'Universal coding, information, prediction, and estimation,' \nIEEE Trans. Inform. Theory, vo1.30, pp.629-636. \n\nP. Smyth (1991), 'Admissible stochastic complexity models for classification prob(cid:173)\nlems,' to appear in Proceedings of the Third International Workshop on AI and \nStatistics, (D. Hand, ed.), Chapman and Hall: London. \n\nC. S. Wallace and P. R. Freeman (1987), 'Estimation and inference by compact \ncoding,' J. Royal Stat. Soc . B, vol. 49 , no.3, pp.240-251. \n\nW. H. Wolberg and O. L. Mangasarian (1991), Multi-surface method of pattern sep(cid:173)\naration applied to breast cytology diagnosis, Proceedings of the National Academy \nof Sciences, in press. \n\n\f", "award": [], "sourceid": 435, "authors": [{"given_name": "Padhraic", "family_name": "Smyth", "institution": null}]}