{"title": "An Information-theoretic Learning Algorithm for Neural Network Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 591, "page_last": 597, "abstract": null, "full_text": "An Information-theoretic Learning \n\nAlgorithm for Neural Network \n\nClassification \n\nDavid J. Miller \n\nDepartment of Electrical Engineering \nThe Pennsylvania State University \n\nState College, Pa: 16802 \n\nAjit Rao, Kenneth Rose, and Allen Gersho \n\nDepartment of Electrical and Computer Engineering \n\nUniversity of California \nSanta Barbara, Ca. 93106 \n\nAbstract \n\nA new learning algorithm is developed for the design of statistical \nclassifiers minimizing the rate of misclassification. The method, \nwhich is based on ideas from information theory and analogies to \nstatistical physics, assigns data to classes in probability. The dis(cid:173)\ntributions are chosen to minimize the expected classification error \nwhile simultaneously enforcing the classifier's structure and a level \nof \"randomness\" measured by Shannon's entropy. Achievement of \nthe classifier structure is quantified by an associated cost. The con(cid:173)\nstrained optimization problem is equivalent to the minimization of \na Helmholtz free energy, and the resulting optimization method \nis a basic extension of the deterministic annealing algorithm that \nexplicitly enforces structural constraints on assignments while re(cid:173)\nducing the entropy and expected cost with temperature. In the \nlimit of low temperature, the error rate is minimized directly and a \nhard classifier with the requisite structure is obtained. This learn(cid:173)\ning algorithm can be used to design a variety of classifier structures. \nThe approach is compared with standard methods for radial basis \nfunction design and is demonstrated to substantially outperform \nother design methods on several benchmark examples, while of(cid:173)\nten retaining design complexity comparable to, or only moderately \ngreater than that of strict descent-based methods. \n\n\f592 \n\nD. NnLLER.A.RAO.K. ROSE.A. GERSHO \n\n1 \n\nIntroduction \n\nThe problem of designing a statistical classifier to minimize the probability of mis(cid:173)\nclassification or a more general risk measure has been a topic of continuing interest \nsince the 1950s. Recently, with the increase in power of serial and parallel computing \nresources, a number of complex neural network classifier structures have been pro(cid:173)\nposed, along with associated learning algorithms to design them. While these struc(cid:173)\ntures offer great potential for classification, this potenl ial cannot be fully realized \nwithout effective learning procedures well-matched to the minimllm classification(cid:173)\nerror oh.iective. Methods such as back propagation which approximate class targets \nin a sqllared error sense do not directly minimize the probability of error. Rather, \nit has been shown that these approaches design networks to approximate the class \na posteriori probabilities. The probability estimates can then be used to form a de(cid:173)\ncision rule. While large networks can in principle accurately approximate the Bayes \ndiscriminant, in practice the network size must be constrained to avoid overfitting \nthe (finite) training set. Thus, discriminative learning techniques, e.g. (Juang and \nKatagiri, 1992), which seek to directly minimize classification error may achieve \nbetter results. However, these methods may still be susceptible to finding shallow \nlocal minima far from the global minimum. \n\nAs an alternative to strict descent-based procedures, we propose a new determinis(cid:173)\ntic learning algorithm for statistical classifier design with a demonstrated potential \nfor avoiding local optima of the cost. Several deterministic, annealing-based tech(cid:173)\nniques have been proposed for avoiding nonglobal optima in computer vision and \nimage processing (Yuille, 1990), (Geiger and Girosi,1991), in combinatorial opti(cid:173)\nmization, and elsewhere. Our approach is derived based on ideas from information \ntheory and statistical physics, and builds on the probabilistic framework of the de(cid:173)\nterministic annealing (DA) approach to clustering and related problems (Rose et \nal., 1990,1992,1993). In the DA approach for data clustering, the probability dis(cid:173)\ntributions are chosen to minimize the expected clustering cost, given a constraint \non the level of randomness, as measured by Shannon's entropy 1. \n\nIn this work, the DA approach is extended in a novel way, most significantly to \nincorporate structural constraints on data assignments, but also to minimize the \nprobability of error as the cost. While the general approach we suggest is likely \napplicable to problems of structured vector quantization and regression as well, we \nfocus on the classification problem here. Most design methods have been developed \nfor specific classifier structures. In this work, we will develop a general approach but \nonly demonstrate results for RBF classifiers. The design of nearest prototype and \nMLP classifiers is considered in (Miller et al., 1995a,b). Our method provides sub(cid:173)\nstantial performance gains over conventional designs for all of these structures, while \nretaining design complexity in many cases comparable to the strict descent meth(cid:173)\nods. Our approach often designs small networks to achieve training set performance \nthat can only be obtained by a much larger network designed in a conventional way. \nThe design of smaller networks may translate to superior performance outside the \ntraining set. \n\nINote that in (Rose et al., 1990,1992,1993), the DA method was formally derived using \nthe maximum entropy principle. Here we emphasize the alternative, but mathematically \nequivalent description that the chosen distributions minimize the expected cost given con(cid:173)\nstrained entropy. This formulation may have more intuitive appeal for the optimization \nproblem at hand. \n\n\fAn Information-theoretic Learning Algorithm for Neural Network Classification \n\n593 \n\n2 Classifier Design Formulation \n\n2.1 Problem Statement \n\nLet T = {(x,c)} be a training set of N labelled vectors, where x E 'Rn is a feature \nvector and c E I is its class label from an index set I. A classifier is a mapping \nC : 'Rn _ I, which assigns a class label in I to each vector in 'Rn. Typically, the \npartitioning of the feature space into regions Rj = {x E 'Rn \nclassifier is represented by a set of model parameters A. The classifier specifies a \n: C(x) = j}, where \nU Rj = 'Rn and n Rj = 0. It also induces a partitioning of the training set into \nsets 7j C T, where 7j = {{x,c} : x E Rj,(x,c) E T}. A training pair (x,c) E T \nis misc1assified if C(x) \"# c. The performance measure of primary interest is the \nempirical error fraction Pe of the classifier, i.e. the fraction of the training set (for \ngeneralization purposes, the fraction of the test set) which is misclassified: \n\nj \n\nj \n\nPe = 2. L \n\n6(c,C(x\u00bb = ~L L \n\n6(c,j), \n\n(1) \n\nN (X,c)ET \n\njEI (X,C)ETj \n\nwhere 6( c, j) = 1 if c \"# j and 0 otherwise. In this work, we will assume that the \nclassifier produces an output Fj(x) associated with each class, and uses a \"winner(cid:173)\ntake-all\" classification fll Ie: \n\nThis rule is consistent with MLP and RBF-based classification. \n\nR j == {x E'Rn \n\n: Fj (x) ~ Fk(X) \"Ik E I}. \n\n(2) \n\n2.2 Randomized Classifier Partition \n\nAs in the original DA approach for clustering (Rose et aI., 1990,1992), we cast \nthe optimization problem in a framework in which data are assigned to classes \nin probability. Accordingly, we define the probabilities of association between a \nfeature x and the class regions, i.e. {P[x E R j ]}. As our design method, which \noptimizes over these probabilities, must ultimately form a classifier that makes \n\"hard\" decisions based on a specified network model, the distributions must be \nchosen to be consistent with the decision rule of the model. In other words, we \nneed to introduce randomness into the classifier's partition. Clearly, there are many \nways one could define probability distributions which are consistent with the hard \npartition at some limit. We use an information-theoretic approach. We measure the \nrandomness or uncertainty by Shannon's entropy, and determine the distribution \nfor a given level of entropy. At the limit of zero entropy we should recover a hard \npartition. For now, suppose that the values of the model parameters A have been \nfixed. We can then write an objective function whose maximization determines the \nhard partition for a given A: \n\nFh = N ~ L Fj(x). \n\n1 \n\nJEI (X,c)ETj \n\n(3) \n\nNote specifically that maximizing (3) over all possible partitions captures the deci(cid:173)\nsion rule of (2). The probabilistic generalization of (3) is \n\n1 \n\nF = N L LP[x E Rj]Fj(x), \n\n(X,c)ET \n\nj \n\n(4) \n\nwhere the (randomized) partition is now represented by association probabilities, \nand the corresponding entropy is \n\n1 \n\nH = - N L LP[x E Rj)logP[x E Rj). \n\n(5) \n\n(X,c)ET \n\nj \n\n\f594 \n\nD. MILLER, A. RAO, K. ROSE, A. GERSHO \n\nWe determine the distribution at a given level of randomness as the one which \nmaximizes F while maintaining H at a prescribed level iI: \nmax F subject to H = iI. \n\n(6) \n\n{P[XERj]} \n\nThe result is the best probabilistic partition, in the sense of F, at the specified level \nof randomness. For iI = 0 we get back the hard partition maximizing (3). At any \niI, the solution of(6) is the Gibbs distribution \n\nP[x E Rj] == Pjl~(A) = E e'YF\" (X) , \n\ne'YFj(X) \n\nk \n\n(7) \n\nwhere 'Y is the Lagrange multiplier. For 'Y --t 0, the associations become increas(cid:173)\ningly uniform, while for 'Y --t 00, they revert to hard classifications, equivalent to \napplication of the rule in (2). Note that the probabilities depend on A through the \nnetwork outputs. Here we have emphasized this dependence through our choice of \nconcise notation. \n\n2.3 \n\nInformation-Theoretic Classifier Design \n\nUntil now we have formulated a controlled way of introducing randomness into \nthe classifier's partition while enforcing its structural constraint. However, the \nderivation assumed that the model parameters were given, and thus produced only \nthe form of the distribution Pjl~(A), without actually prescribing how to choose the \nvalw's of its parameter set. Moreover the derivation did not consider the ultimate \ngoal of minimizing the probability of error. Here we remedy both shortcomings. \n\nThe method we suggest gradually enforces formation of a hard classifier minimizing \nthe probability of error. We start with a highly random classifier and a high expected \nmisclassification cost. We then gradually reduce both the randomness and the cost \nin a deterministic learning process which enforces formation of a hard classifier \nwith the requisite structure. As before, we need to introduce randomness into the \npartition while enforcing the classifier's structure, only now we are also interested \nin minimizing the expected misclassification cost. While satisfying these multiple \nobjectives may appear to be a formidable task, the problem is greatly simplified by \nrestricting the choice of random classifiers to the set of distributions {Pjl~(A)} as \ngiven in (7) - these random classifiers naturally enforce the structural constraint \nthrough 'Y. Thus, from the parametrized set {Pjl~(A)}, we seek that distribution \nwhich minimizes the average misclassification cost while constraining the entropy: \n\n(8) \n\nsubject to \n\nThe solution yields the best random classifier in the sense of minimum < Pe > for a \ngiven iI. At the limit of zero entropy, we should get the best hard classifier in the \nsense of Pe with the desired structure, i.e. satisfying (2). \nThe constrained minimization (8) is equivalent to the unconstrained minimization \nof the Lagrangian: \n\n(9) \n\nmin L == minfj < Pe > -H, \nA,'Y \n\nA,'Y \n\n\f00. \n\nAn Infonnation-theoretic Learning Algorithm for Neural Network Classification \n595 \nwhere {3 is the Lagrange multiplier associated with (8). For {3 = 0, the sole objec(cid:173)\ntive is entropy maximization, which is achie\\\"ed by the uniform distribution. This \nsolution, which is the global minimum for L at {3 = 0, can be obtained by choos(cid:173)\ning , = O. At the other end of the spectrum, for {3 -\n00, the sole objective is \nto minimize < Pe >, and is achieved by choosing a non-random (hard) classifier \n(hence minimizing Pe ). The hard solution satisfies the classification rule (2) and is \nobtained for , -\nMotivation for minimizing the Lagrangian can be obtained from a physical per(cid:173)\nspective by noting that L is the Helmholtz free energy of a simulated system, with \n< Pe > the \"energy\", H the system entropy, and ~ the \"temperature\". Thus, from \nthis physical view we can suggest a deterministic annealing (DA) process which \ninvolves minimizing L starting at the global minimum for {3 = 0 (high temperature) \nand tracking the solution while increasing {3 towards infinity (zero temperature). \nIn this way, we obtain a sequence of solutions of decreasing entropy and average \nmisclassification cost. Each such solution is the best random classifier in the sense \nof < Pe > for a given level of randomness. The annealing process is useful for \navoiding local optima of the cost < Pe >, and minimizes < Pe > directly at low \ntemperature . While this annealing process ostensibly involves the quantities Hand \n< Pe >, the restriction to {PjIAA)} from (7) ensures that the process also enforces \nthe structural constraint on the classifier in a controlled way. Note in particular \nthat, has not lost its interpretation as a Lagrange multiplier determining F. Thus, \n, = 0 means that F is unconstrained - we are free to choose the uniform distribu(cid:173)\ntion . Similarly, sending, -\n00 requires maximizing F - hence the hard solution. \nSince, is chosen to minimize L, this parameter effectively determines the level of \nF - the level of structural constraint - consistent with Hand < Pe > for a given \n{3. As {3 is increased, the entropy constraint is relaxed, allowing greater satisfaction \nof both the minimum < Pe > and maximum F objectives. Thus, annealing in {3 \ngradually enforces both the structural constraint (via ,) and the minimum < Pe > \nobjective 2. \n\nOur formulation clearly identifies what distinguishes the annealing approach from \ndirect descent procedures. Note that a descent method could be obtained by simply \nneglecting the constraint on the entropy, instead choosing to directly minimize < \nPe > over the parameter set. This minimization will directly lead to a hard classifier, \nand is akin to the method described in (Juang and Katagiri, 1992) as well as other \nrelated approaches which attempt to directly minimize a smoothed probability of \nerror cost. However, as we will experimentally verify through simulations, our \nannealing approach outperforms design based on directly minimizing < Pe >. \nFor conciseness, we will not derive necessary optimality conditions for minimizing \nthe Lagrangian at a give temperature, nor will we specialize the formulation for \nindividual classification structures here. The reader is referred to (Miller et al., \n1995a) for these details. \n\n3 Experimental Comparisons \n\nWe demonstrate the performance of our design approach in comparison with other \nmethods for the normalized RBF structure (Moody and Darken, 1989). For the DA \nmethod, steepest descent was used to minimize L at a sequence of exponentially \nincreasing {3, given by (3(n + 1) = a:{3(n) , for a: between 1.05 and 1.1. We have \nfound that much of the optimization occurs at or near a critical temperature in the \n\n2While not shown here, the method does converge directly for f3 - 00, and at this limit \n\nenforces the classifier's structure. \n\n\f596 \n\nD.~LLER,A.RAO,K.ROSE,A.GERSHO \n\nMethod \n\nDA \n\nM \n\n4 \nPe (tram) 0.11 \nPe (test) \n0.13 \n\n30 \n\n4 \n\n0.028 0.33 \n0.167 0.35 \n\nTR-RHF \n10 \n30 \n\n0.162 0.145 \n0.165 0.168 \n\n50 \n\nMU-ltBJ<' \n10 \n50 \n0.19 \n0.129 \n0.3 \n0.18 \n0.179 0.37 \n\n\\jPe \n10 \n0.18 \n0.20 \n\nTable 1: A comparison of DA with known design techniques for RBF classification \non the 40-dimensional noisy waveform data from (Breiman et al., 1980). \n\nsolution process. Beyond this critical temperature, the annealing process can often \nbe \"quenched\" to zero temperature by sending I ---+ 00 without incurring significant \nperformance loss. Quenching the process often makes the design complexity of our \nmethod comparable to that of descent-based methods such as back propagation or \ngradient descent on < Pe >. \nWe have compared our RBF design approach with the mf,thod in (Moody \nand Darken, 1989) (MD-RBF), with a method described ill (Tarassenko and \nRoberts,1994) (TR-RBF), with the approach in (Musavi et al., 1992), and with \nsteepest descent on < P e > (G-RBF). MD-RBF combines unsupervised learning \nof receptive field parameters with supervis,'d learning of the weights from the re(cid:173)\nceptive fields so as to minimize the squared distance to target class outputs. The \nprimary advantage of this approach is its modest design complexity. However, the \nrecept i\\\"c fields are not optimized in a supervised fashion, which can cause perfor(cid:173)\nmance degradation. TR-RBF optimizes all of the RBF parameters to approximate \ntarget class outputs. This design is more complex than MD-RBF and achieves bet(cid:173)\nter performance for a given model size. However, as aforementioned, the TR-RBF \ndesign objective is not equivalent to minimizing Pe , but rather to approximating \nthe Bayes-optimal discriminant. While direct descent on < P e > may minimize \nthe \"right\" objective, problems of local optima may be quite severe. In fact, we \nhave found that the performance of all of these methods can be quite poor without \na judicious initialization. For all of these methods, we have employed the unsu(cid:173)\npervised learning phase described in (Moody and Darken, 1989) (based on Isodata \nclustering and variance estimation) as model initialization. Then, steepest descent \nwas performed on the respective cost surface. We have found that the complexity \nof our design is typically 1-5 times that of TR-RB F or G-RBF (though occasionally \nour design is actually faster than G-RBF). Accordingly, we have chosen the best \nresults based on five random initializations for these techniques, and compared with \nthe single DA design run. \n\nOne example reported here is the 40D \"noisy\" waveform data used in (Breiman et \nal., 1980) (obtained from the DC-Irvine machine learning database repository.). We \nsplit the 5000 vectors into equal size training and test sets. Our results in Table \nI demonstrate quite substantial performance gains over all the other methods, and \nperformance quite close to the estimated Bayes rate of 14%. Note in particular \nthat the other methods perform quite poorly for a small number of receptive fields \n(M), and need to increase M to achieve training set performance comparable to \nour approach. However, performance on the test set does not necessarily improve, \nand may degrade for increasing M. \nTo further justify this claim, we compared our design with results reported in \n(Musavi et al., 1992), for the two and eight dimensional mixture examples. For \nthe 2D example, our method achieved Petro-in = 6.0% for a 400 point training set \nand Pe, \u2022\u2022 , = 6.1% on a 20,000 point test set, using M = 3 units (These results \nare near-optimal, based on the Bayes rate.). By contrast, the method of Musavi et \n\n\fAn Information-theoretic Learning Algorithm for Neural Network Classification \n\n597 \n\nal. used 86 receptive fields and achieved P et \u2022\u2022 t = 9.26%. For the 8D example and \nM = 5, our method achieved Petr,.;n = 8% and P et \u2022\u2022 t = 9.4% (again near-optimal), \nwhile the method in (Musavui et al., 1992) achieved Pet\u20225t = 12.0% using M = 128. \nIn summary, we have proposed a new, information-theoretic learning algorithm for \nclassifier design, demonstrated to outperform other design methods, and with gen(cid:173)\neral applicability to a variety of structures. Future work may investigate important \napplications, such as recognition problems for speech and images. Moreover, our \nextension of DA to incorporate structure is likely applicable to structured vector \nquantizer design and to regression modelling. These problems will be considered in \nfuture work. \n\nAcknowledgements \n\nThis work was supported in part by the National Science Foundation under grant \nno. NCR-9314335, the University of California M(( 'BO program, DSP Group, \nInc. Echo Speech Corporation, Moseley Associates, 1'\\ ill ional Semiconductor Corp., \nQualcomm, Inc., Rockwell International Corporation, Speech Technology Labs, and \nTexas Instruments, Inc. \n\nReferences \n\nL. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Re(cid:173)\ngression Trees. The Wadsworth Statistics/Probability Series, Belmont,CA., 1980. \n\nD. Geiger and F. Girosi. Parallel and deterministic algorithms from MRFs: Surface \nreconstruction. IEEE Trans. on Patt. Anal. and Mach. Intell., 13:401- 412, 1991. \n\nB.-H. Juang and S. Katagiri. Discriminative learning for minimum error classifica(cid:173)\ntion. IEEE Trans. on Sig. Proc., 40:3043-3054, 1992. \nD. Miller, A. Rao, K. Rose, and A. Gersho. A global optimization technique for \nstatistical classifier design. (Submitted for publication.), 1995. \n\nD. Miller, A. Rao, K. Rose, and A. Gersho. A maximum entropy framework for \noptimal statistical classification. In IEEE Workshop on Neural Networks for Signal \nProcessing.),1995. \nJ. Moody and C. J. Darken. Fast learning in locally-tuned processing units. Neural \nComp., 1:281-294, 1989. \nM. T. Musavi, W. Ahmed, K. H. Chan, K. B. Faris, and D. M. Hummels. On the \ntraining of radial basis function classifiers. Neural Networks, 5:595--604, 1992. \n\nK. Rose, E. Gurewitz, and G. C. Fox. Statistical mechanics and phase transitions \nin clustering. Phys. Rev. Lett., 65:945--948, 1990. \n\nK. Rose, E. Gurewitz, and G. C. Fox. Vector quantization by deterministic anneal(cid:173)\ning. IEEE Trans. on Inform. Theory, 38:1249-1258, 1992. \nK. Rose, E. Gurewitz, and G. C. Fox. Constrained clustering as an optimization \nmethod. IEEE Trans. on Patt. Anal. and Mach. Intell., 15:785-794, 1993. \n\nL. Tarassenko and S. Roberts. Supervised and unsupervised learning in radial basis \nfunction classifiers. lEE Proc.- Vis. Image Sig. Proc., 141:210-216, 1994. \n\nA. L. Yuille. Generalized deformable models, statistical physics, and matching \nproblems. Ne 'ural Comp., 2:1-24, 1990. \n\n\f", "award": [], "sourceid": 1161, "authors": [{"given_name": "David", "family_name": "Miller", "institution": null}, {"given_name": "Ajit", "family_name": "Rao", "institution": null}, {"given_name": "Kenneth", "family_name": "Rose", "institution": null}, {"given_name": "Allen", "family_name": "Gersho", "institution": null}]}