{"title": "Unification of Information Maximization and Minimization", "book": "Advances in Neural Information Processing Systems", "page_first": 508, "page_last": 514, "abstract": null, "full_text": "Unification of Information Maximization \n\nand Minimization \n\nRyotaro Kamimura \n\nInformation Science Laboratory \n\nTokai University \n\n1117 Kitakaname Hiratsuka Kanagawa 259-12, Japan \n\nE-mail: ryo@cc.u-tokaLac.jp \n\nAbstract \n\nIn the present paper, we propose a method to unify information \nmaximization and minimization in hidden units. The information \nmaximization and minimization are performed on two different lev(cid:173)\nels: collective and individual level. Thus, two kinds of information: \ncollective and individual information are defined. By maximizing \ncollective information and by minimizing individual information, \nsimple networks can be generated in terms of the number of con(cid:173)\nnections and the number of hidden units. Obtained networks are \nexpected to give better generalization and improved interpretation \nof internal representations. This method was applied to the infer(cid:173)\nence of the maximum onset principle of an artificial language. In \nthis problem, it was shown that the individual information min(cid:173)\nimization is not contradictory to the collective information max(cid:173)\nimization. In addition, experimental results confirmed improved \ngeneralization performance, because over-training can significantly \nbe suppressed. \n\n1 \n\nIntroduction \n\nThere have been many attempts to interpret neural networks from the information \ntheoretical point of view [2], [4], [5]. Applied to the supervised learning, informa(cid:173)\ntion has been maximized and minimized, depending on problems. In these methods, \ninformation is defined by the outputs of hidden units. Thus, the methods aim to \ncontrol hidden unit activity patterns in an optimal manner. Information maximiza(cid:173)\ntion methods have been used to interpret explicitly internal representations and \nsimultaneously to reduce the number of necessary hidden units [5]. On the other \nhand, information minimization methods have been especially used to improve gen(cid:173)\neralization performance [2], [4] and to speed up learning. Thus, if it is possible to \n\n\fUnification of Information Maximization and Minimization \n\n509 \n\nmaximize and minimize information simultaneously, information theoretic methods \nare expected to be applied to a wide range of problems. \n\nIn this paper, we unify the above mentioned two methods, namely, information \nmaximization and minimization methods, into one framework to improve general(cid:173)\nization performance and to interpret explicitly internal representations. However, it \nis apparently impossible to maximize and minimize simultaneously the information \ndefined by the hidden unit activity. Our goal is to maximize and to minimize infor(cid:173)\nmation on two different levels, namely, collective and individual levels. This means \nthat information can be maximized in collective ways and information is minimized \nfor individual input-hidden connections. The seeming contradictory proposition of \nthe simultaneous information maximization and minimization can be overcome by \nassuming the existence of the two levels for the information control. \n\nInformation is supposed to be controlled by an information controller located outside \nneural networks and used exclusively to control information. By assuming the \ninformation controller, we can clearly see how information appropriately defined can \nbe maximized or minimized. In addition, the actual implementation of information \nmethods is much easier by introducing a concept of the information controller. \n\n2 Concept of Information \n\nIn this section, we explain a concept of information in a general framework of an in(cid:173)\nformation theory. Let Y take on a finite number of possible values Yl, Y2, ... , YM with \nprobabilities P(Yl), P(Y2), ... , p(YM), respectively. Then, initial uncertainty H(Y) of \na random variable Y is defined by \n\nM \n\nH(Y) = - L p(Yj) 10gp(Yj). \n\nj=l \n\n( 1) \n\nNow, consider conditional uncertainty after the observation of another random vari(cid:173)\nable X, taking possible values Xl, X2, ... , Xs with probabilities p(Xt},P(X2), ... ,p(XM), \nrespectively. Conditional uncertainty H(Y I X) can be defined as \n\nH(Y I X) = - LP(x8) LP(Yj I x8) logp(xj I Y8)' \n\nS \n\nM \n\n(2) \n\n8=1 \n\nj=1 \n\nWe can easily verify that conditional uncertainty is always less than or equal to \ninitial uncertainty. Information is usually defined as the decrease of this uncertainty \n[1]. \n\nI(Y I X) \n\nH(Y) - H(Y I X) \n\nM \n\nS \n\nM \n\n- L p(Yj ) 10gp(Yj) + L p( X8) L p(Yj I X,) 10gp(Yj I X8) \n\nj=1 \n\n8=1 \n\nj=1 \n\n~ ~ () I) p(Yj I X8) \nL..,.; L..,.;P X8 p(Yj X8 \n\n(.) \n\nlog \n\nP YJ \n\n. \nJ \n\n8 \n\n= LP(x8)I(Y I x8) \n\n(3) \n\nwhere \n\nI(Y I x,) \n\n\f510 \n\nR. Kamimura \n\nj \n\nj \n\nwhich is referred to as conditional information. Especially, when prior uncertainty \nis maximum, that is, a prior probability is equi-probable (1/ M), then informlttion \nis \n\nI(Y I X) \n\nS \n\nM \n\n10gM + I:p(x 3 ) I:p(Yj I x3 )logp(Yj I X 3 ) \n\n(5) \n\nwhere log M is maximum uncertainty concering A. \n\n3=1 \n\nj=1 \n\n3 Formulation of Information Controller \n\nIn this section, we apply a concept of information to actual network architectures \nand define collective information and individual information. The notation in the \nabove section is changed into ordinary notation used in the neural network. \n\n3.1 Unification by Information Controller \n\nTwo kinds of information, collective information and individual information, are \ncontrolled by using an information controller. The information controller is devised \nto interpret the mechanism of the information maximization and minimization more \nexplicitly. As shown in Figure 1, the information controller is composed of two \nsubcomponents, that is, an individual information minimizer and collective infor(cid:173)\nmation ma.'Cimizer. A collective information maximizer is used to increase collective \ninformation as much as possible. An individual information minimizer is used to \ndecrease individual information. By this minimization, the majority of connections \nare pushed toward zero. Eventually, all the hidden units tend to be intermediately \nactivated. Thus, when the collective information maximizer and individual infor(cid:173)\nmation maximizer are simultaneously applied, a hidden unit activity pattern is a \npattern of the maximum information in which only one hidden unit is on, while \nall the other hidden units are off. However, multiple strongly negative connections \nto produce a maximum information state, are replaced by extremely weak input(cid:173)\nhidden connections. Strongly negative connections are inhibited by the individual \ninformation minimization. This means that by the information controller, informa(cid:173)\ntion can be maximized and at the same time one of the most important properties \nof the information minimization, namely, weight decay or weight elimination, can \napproximately be realized. Consequently, the information controller can generate \nmuch simplified networks in terms of hidden units and in terms of input-hidden \nconnections. \n\n3.2 Collective Information Maximizer \n\nA neural network to be controlled is composed of input, hidden and output units \nwith bias, as shown in Figure 1. The jth hidden unit receives a net input from \ninput units and at the same time from a collective information maximizer: \n\nL \n\nuj = Xj + I: Wjk~k \n\nk=O \n\n(6) \n\nwhere Xj is an information maximizer from the jth collective information maximizer \nto the jth hidden unit, L is the number of input units, Wjk is a connection from \nthe kth input unit to the jth hidden unit and ~k is the kth element of the 8th input \n\n\fUnification of Information Maximization and Minimization \n\n511 \n\nBias- Hidden \nConnections \n\nBias-Output \nconn71ons \n\nBias \n\nWiO \n\n.... ~~ \n\nI \n\nTarget \n\nInput- Hidden \nConnections \n\nIndividual \nInformation \nMinimizer \n\n. . . . \n\n~ X \nj \n\n\u2022 , \n\\ \\ \n\\. \\ Information \n\\ \\ Maximizers \n\n... \\ '., .. \u2022 Collective \n\nInformation \nMaximizer \n\nInformation Controller \n\nFigure 1: A network architecture, realizing the information con(cid:173)\ntroller. \n\npattern. The jth hidden unit produces an activity or an output by a sigmoidal \nactivation function: \n\nvJ \n\nf(uJ) \n\n1 \n\n1 + exp( -uJ) . \n\n(7) \n\nThe collective information maximizer is used to maximize the information contained \nin hidden units. For this purpose, we should define collective information. Now, \nsuppose that in the previous formulation in information, a symbol X and Y repre(cid:173)\nsent a set of input patterns and hidden units respectively. Then, let us approximate \na probability p(Yj I x$) by a normalized output pj of the jth hidden unit computed \nby \n\n(8) \n\nwhere the summation is over all the hidden units. Then, it is reasonable to suppose \nthat at an initial stage all the hidden units are activated randomly or uniformly \nand all the input patterns are also randomly given to networks. Thus, a probability \np(Yj) of the activation of hidden units at the initial stage is equi-probable, that is, \n11M. A probability p(x$) of input patterns is also supposed to be equi-probable, \nnamely, liS. Thus, information in the equation (3) is rewritten as \n\nI(Y I X) ~ \n\nM 1 \n\n1 \n\n- L M log M + S L L pJ 10gpJ \n\n1 S M \n\nj=l \n\n$=lj=l \n\n\f512 \n\nR. Kamimura \n\nInput Unit (C) \n\nk \n\nHidden Unit (0) \n\nJ \n\nFires \n\nDoes not lire \n\nFigure 2: An interpretation of an input-hidden connection for \ndefining the individual information. \n\n1 S M \n\nlog !If + S 2: 2:pj logpj \n\n3=lj=1 \n\n(9) \n\nwhere log !If is maximum uncertainty. This information is considered to be the \ninformation acquired in a course oflearning. Information maximizers are updated to \nincrease collective information. For obtaining update rules, we should differentiate \nthe information function with respect to information maximizers Xj: \n\n{3soI(Y I X) \n\nOXj \n\nIi ~ (lOg pj - f/:\" log p:\" ) pj (1 - vj) \n\n(10) \n\nwhere {3 is a parameter. \n\n3.3 \n\nIndividual Information Minimization \n\nFor representing individual information by a concept of information discussed in \nthe previous section, we consider an output Pjk from the jth hidden unit only with \na connection from the kth input unit to the jth output unit: \n\n(11) \nwhich is supposed to be a probability of the firing of the jth hidden unit, given the \nfiring ofthe kth input unit, as shown in Figure 2. Since this probability is considered \nto be a probability, given the firing of the kth input unit, conditional information is \nappropriate for measuring the information. In addition, it is reasonable to suppose \nthat a probability of the firing of the jth hidden unit is 1/2 at an initial stage \nof learning, because we have no knowledge on hidden units. Thus, conditional \ninformation for a pair of the kth unit and the jth hidden unit is formulated as \n\nIj k (D I fires) ~ - Pj k log ~ - (1 - Pj k ) log ( 1 - ~ ) \n+Pjk logpjk + (1 - Pjk) 10g(1- Pjk) \nlog2 + Pjk logpjk + (1 - Pjk)log(l - Pjk) \n\n(12) \nIf connections are close to zero, this function is close to minimum information, \nmeaning that it is impossible . to estimate the firing of the kth hidden unit. \nIf \n\n\fUnification of Information Maximization and Minimization \n\n513 \n\nTable 1: An example of obtained input-hidden connections Wjle \nby the information controller. The parameter {3, It and 1} were \n0.015, 0.0008, and 0.01. \n\n4 \n\n13.82 \n-3.08 \n-0.01 \n0.00 \n0.00 \n0.01 \n0.00 \n0.00 \n0.01 \n0.00 \n\nBias WjO \n22.07 \n-0.95 \n0.00 \n0.00 \n0.00 \n0.06 \n0.00 \n0.00 \n0.03 \n0.07 \n\nInformation \nMaximizer Xj \n\n-60.88 \n1.63 \n-10.93 \n-10.94 \n-10.97 \n-12.01 \n-11.01 \n-11.00 \n-11.61 \n-11.67 \n\nHidden \nUnits vJ \n\n1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n\n1 \n\n3.09 \n-3.35 \n-0.01 \n0.00 \n0.00 \n0.02 \n0.00 \n0.00 \n0.02 \n0.01 \n\nInput Units e: \n\n2 \n\n3 \n\n10.77 \n0.11 \n0.00 \n0.00 \n0.00 \n0.01 \n0.00 \n0.00 \n0.01 \n0.00 \n\n26.48 \n0 .33 \n0.00 \n0.00 \n-0.01 \n-0.04 \n-0.01 \n-0.01 \n-0.03 \n-0.02 \n\nconnections are larger, the information is larger and correlation between input and \nhidden units is larger. Total individual information is the sum of all the individual \nindividual information, namely, \n\nI(D I fires) \n\nM \n\nL \n\n2: 2: Ijle(D I fires), \n\n(13) \n\nj=lle=O \n\nbecause each connection is treated separately or independently. The individual \ninformation minimization directly controls the input-hidden connections. By dif(cid:173)\nferentiating the individual information function and a cross entropy cost function \nwith respect to input-hidden connections Wjle, we have rules for updating concerning \ninput-hidden connections: \n\nol(D I fi1>es) \n\n-It \n\n8wjle \n\noG \n-1}--\nOWjle \nS \n\n-It Wjle pjle(1- Pjle) + 1} 2:c5J~k \n\n(14) \n\nwhere c5J is an ordinary delta for the cross entropy function and 1} and It are pa(cid:173)\nrameters. Thus, rules for updating with respect to input-hidden connections are \nclosely related to the weight decay method. Clearly, as the individual information \nminimization corresponds to diminishing the strength of input-hidden connections. \n\n!=1 \n\n4 Results and Discussion \n\nThe information controller was applied to the segmentation of strings of an artifi(cid:173)\ncial language into appropriate minimal elements, that is, syllables. Table 1 shows \ninput-hidden connections with the bias and the information maximizers. Hidden \nunits were ordered by the magnitude of the relevance of each hidden unit [6]. Col(cid:173)\nlective information and individual information could sufficiently be maximized and \nminimized. Relative collective and individual information were 0.94 and 0.13. In \nthis state, all the input-hidden connections except connections into the first two \nhidden units are almost zero. Information maximizers Xj are all strongly negative \nfor these cases. These negative information maximizers make eight hidden units \n(from the third to tenth hidden unit) inactive, that is, close to zero. By carefully \n\n\f514 \n\nR. Kamimura \n\nTable 2: Generalization performance comparison for 200 and 200 train(cid:173)\ning patterns. Averages in the table are average generalization errors over \nseven errors of ten errors with ten different initial values. \n\n(a) 200 patterns \n\nGeneralization Errors \n\nRM S \n\nError Rates \n\nAverages \n\nStd. Dev. \n\n0.087 \n0.082 \n0.064 \n0.052 \n\n0.015 \n0.009 \n0.015 \n0.008 \n\nError Rates \n\nA verages \n\nStd. Dev. \n\n0.024 \n0.012 \n0.009 \n0.008 \n\n0.009 \n0.004 \n0.006 \n0.004 \n\nMethods \nStandard \nWeight Decay \nWeight Elimination \nInformation Controller \n\nAverages \n\nStd. Dev. \n\n0.188 \n0.183 \n0.172 \n0.167 \n\n0.010 \n0.004 \n0.014 \n0.011 \n\n(b) 300 patterns \n\nMethods \nStandard \nWeight Decay \nWeight Elimination \nInformation Controller \n\nA verages \n\nStd. Dev. \n\n0.108 \n0.110 \n0.083 \n0.072 \n\n0.009 \n0.003 \n0.005 \n0.006 \n\nGeneralization Errors \n\nRMS \n\nexaming the first two hidden units, we could see that the first hidden unit and the \nsecond hidden unit are concerned with rules for syllabification and a exceptional \ncase. \n\nThen, networks were trained to infer the well-formedness of strings in addition to \nthe segmentation to examine generalization performance. Table 2 shows general(cid:173)\nization errors for 200 and 300 training patterns. As clearly shown in the figure, \nthe best generalization performance in terms of RMS and error rates is obtained by \nthe information controller. Thus, experimental results confirmed that in all cases \nthe generalization performance of the information controller is well over the other \nmethods. In addition, experimental results explicitly confirmed that better gener(cid:173)\nalization performance is due to the suppression of over-training by the information \ncontroller. \n\nReferences \n[1] R. Ash, Information Theo1'1), John Wiley & Sons: New York, 1965. \n[2] G. Deco, W. Finnof and H. G. Zimmermann, \"Unsupervised mutual infor(cid:173)\nmation criterion for elimination of overtraining in Supervised Multilayer N et(cid:173)\nworks,\" Neural Computation, Vol. 7, pp.86-107, 1995. \n\n[3] R. Kamimura \"Entropy minimization to increase the selectivity: selection and \ncompetition in neural networks,\" Intelligent Engineering Systems through Ar(cid:173)\ntificial Neural Networks, ASME Press, pp.227-232, 1992. \n\n[4] R. Kamimura, T. Takagi and S. Nakanishi, \"Improving generalization perfor(cid:173)\n\nmance by information minimization,\" IEICE Transactions on Information and \nSystems, Vol. E78-D, No.2, pp.163-173, 1995. \n\n[5] R. Kamimura and S. Nakanishi, \"Hidden information maximization for feature \ndetection and rule discovery,\" Network: Computation in Neural Systems, Vo1.6, \npp.577-602, 1995. \n\n[6] M. C. Mozer and P. Smolen sky, \"Using relevance to reduce network size auto(cid:173)\n\nmatically,\" Connection Science, Vo.l, No.1, pp.3-16, 1989. \n\n\f", "award": [], "sourceid": 1282, "authors": [{"given_name": "Ryotaro", "family_name": "Kamimura", "institution": null}]}