{"title": "The Boltzmann Perceptron Network: A Multi-Layered Feed-Forward Network Equivalent to the Boltzmann Machine", "book": "Advances in Neural Information Processing Systems", "page_first": 116, "page_last": 123, "abstract": null, "full_text": "116 \n\nTHE BOLTZMANN PERCEPTRON NETWORK: \n\nA MULTI-LAYERED FEED-FORWARD NETWORK \nEQUIVALENT TO THE BOLTZMANN MACHINE \n\nEyal Yair and Allen Gersho \n\nCenter for Infonnation Processing Research \n\nDepartment of Electrical & Computer Engineering \nUniversity of California, Santa Barbara, CA 93106 \n\nABSTRACT \n\nThe concept of the stochastic Boltzmann machine (BM) is auractive for \ndecision making and pattern classification purposes since the probability of \nattaining the network states is a function of the network energy. Hence, the \nprobability of attaining particular energy minima may be associated with the \nprobabilities of making certain decisions (or classifications). However, \nbecause of its stochastic nature, the complexity of the BM is fairly high and \ntherefore such networks are not very likely to be used in practice. In this \npaper we suggest a way to alleviate this drawback by converting the sto(cid:173)\nchastic BM into a deterministic network which we call the Boltzmann Per(cid:173)\nceptron Network (BPN). The BPN is functionally equivalent to the BM but \nhas a feed-forward structure and low complexity. No annealing is required. \nThe conditions under which such a convmion is feasible are given. A \nlearning algorithm for the BPN based on the conjugate gradient method is \nalso provided which is somewhat akin to the backpropagation algorithm. \n\nINTRODUCTION \n\nIn decision-making applications, it is desirable to have a network which computes the pro(cid:173)\nbabilities of deciding upon each of M possible propositions for any given input pattern. In \nprinciple, the Boltzmann machine (BM) (Hinton, Sejnowski and Ackley, 1984) can provide \nsuch a capability. The network is composed of a set of binary units connected through sym(cid:173)\nmetric connection links. The units are randomly and asynchronously changing their values \nin (O.l) according to a stochastic transition rule. The transition rule used by Hinton et. al. \ndefines the probability of a unit to be in the 'on' state as the logistic function of the energy \nchange resulting by changing the value of that unit. The BM can be described by an \nergodic Markov chain in which the thermal equilibrium probability of attaining each state \nobeys the Boltzmann distribution which is a function of only the energy. By associating the \nset of possible propositions with subsets of network states, the probability of deciding upon \neach of these propositions can be measured by the probability of attaining the correspond(cid:173)\ning set of states. This probability is also affected by the temperature. As the temperature \n\nThis work was supported by the Weizrnann Foundation for lCientific resean:h. by the \nUniversity of California MICRO program. and by Ben Communications Relearch. Inc. \n\n\fThe Boltzmann Perceptron Network \n\n11 7 \n\nincreases, the Boltzmann probability distribution become more uniform and thus the deci(cid:173)\nsion made is 'vague'. The lower the temperature the greater is the probability of attaining \nstates with lower energy thereby leading to more 'distinctive' decisions. \nThis approach, while very attractive in principle, has two major drawbacks which make the \ncomplexity of the computations become non-feasible for nontrivial problems. The first is \nthe need for thermal equilibrium in order to obtain the Boltzmann distribution. To make \ndistinctive decisions a low temperature is required. This implies slower convergence \ntowards thermal eqUilibrium. GeneralJy, the method used to reach thermal equilibrium is \nsimulated annealing (SA) (Kirkpatrick et. al., 1983) in which the temperature starts from a \nhigh value and is gradualJy reduced to the desired final value. In order to avoid 'freezing' \nof the network, the cooling schedule should be fairly slow. SA is thus time consuming and \ncomputationally expensive. The second drawback is due to the stochastic nature of the com(cid:173)\nputation. Since the network state is a random vector, the desired probabilities have to be \nestimated by accumulating statistics of the network behavior for only a finite period of \ntime. Hence, a trade-off between speed and accuracy is unavoidable. \nIn this paper, we propose a mechanism to alleviate the above computational drawbacks by \nconverting the stochastic BM into a functionally equivalent deterministic network, which \nwe call the Boltzmann Perceptron Network (BPN). The BPN circumvents the need for a \nMonte Carlo type of computation and instead evaluates the desired probabilities using a \nmultilayer perceptron-like network. The very time consuming learning process for the BM \nis similarly replaced by a deterministic learning scheme, somewhat akin to the backpropa(cid:173)\ngalion algorithm. which is computationally affordable. The similarity between the learning \nalgorithm of a BM having a layered structure and that of a two-layer perceptron has been \nrecently pointed out by Hopfield (1987). In this paper we further elaborate on such an \nequivalence between the BM and the new perceptron-like network, and give the conditions \nunder which the conversion of the stochastic BM into the deterministic BPN is possible. \nUnlike the original BM, the BPN is virtually always in thermal equilibrium and thus SA is \nno longer required. Nevertheless, the temperature still plays the same role and thus varying \nit may be beneficial to control the 'sofmess' of the decisions made by the BPN. Using the \nBPN as a soft classifier is described in details in (Yair and Gersho, 1989). \n\nTHE BOLTZMANN PERCEPTRON NETWORK \n\nto the network by clamping \n\nSuppose we have a network of K units connected through symmetric connection links with \nno self-feedback, so that the connection matrix r is symmetric and zero-diagonal. Let us \ncategorize the units into three different types: input, output and hidden units. The input \npattern will be supplied \nthe input units, denoted by \n~ = (x It \u2022\u2022\u2022 x; \u2022..\u2022 XI f, with this pattern. ~ is a real-valued vector in RI. The output of the net(cid:173)\nwork will be observed on the output units 1= (y It .. ,Y,,, , \u2022. ,YM)T, which is a binary vector. \nThe remaining units. denoted y=(vt> .. ,Vj,,,,vJ)T. are the hidden units, which are also \nbinary-valued. The hidden and output units are asynchronously and randomly changing \ntheir binary values in (O,I) according to inputs they receive from other units. \nThe state of the network will be denoted by the vector y which is partitioned as follows: \nII =(~T,yT,IT). The energy associated with state y is denoted by EN and is given by: \n\n(1) \n\nwhere ~ is a vector of bias values. partitioned to comply with the partition of 11 as follows: \n~T = (f.~T,fl). \n\n\f118 \n\nYair and Gersho \n\nThe transition from one state to another is perfonned by selecting each time one unit, say \nunit k. at random and detennine its output value according to the following stochastic rule: \nset the output of the unit to 1 with probability Pl:, and to 0 with a probability 1-Pl: . The \nparameter Pl: is detennined locally by the k -th unit as a function of the energy change ~l: \nin the following fashion: \n\nPl:=g(~l:) , \n\n~ \n\ng(x) = 1 \n\n1 \n-lIx \n+e \n\n(2) \n\n~l: = (E\" (unit k is off) - E,,(unit k is on) ), and p = liT is a control parameter. T is called \nthe temperature and g (.) is the logistic function. With this transition rule the thennal equili(cid:173)\nbrium probability P\" of attaining a state .u. obeys the Boltzmann distribution: \n\nPIC = -\n\n1 \nZx \n\n-liE \n\u2022 \ne \n\n(3) \n\nwhere Zx, called the partition junction. is a nonnalization factCX' (independent of y and i) \nsuch that the sum of P\" over all the 2' +M possible states will sum to unity. \nIn order to use the network in a detenninistic fashion rather than accumulate statistics while \nobserving its random behavior. we should be able to explicitly compute the probability of \nattaining a certain vector I on the output units while K is clamped on the input units. This \nprobability. denoted by P y Ix \u2022 can be expressed as: \n\nP \n\ny Ix = ~ \".y Ix = \n\n~ P \nYEB] \n\n1 ~ -liE\"\"\", \n7 ~ e \n~ yeB] \n\n(4) \n\nwhere B J is the set of all binary vectors of length J \u2022 and v.y I x denotes a state II in which \na specific input vector K is clamped. The explicit evaluation of the desired probabilities \ntherefore involves the computation of the partition function for which the number of opera(cid:173)\ntions grows exponentially with the number of units. That is. , the complexity is 0 (2'+M). \nObviously. this is computationally unacceptable. Nevertheless. we shall see that under a \ncertain restriction on the connection matrix r the explicit computation of the desired proba(cid:173)\nbilities becomes possible with a complexity of 0 (1M) which is computationally feasible. \nLet us assume that for each input pattern we have M possible propositions which are asso(cid:173)\nciated with the M output vectors: 1M = (11 \u2022.. J\". \u2022.. J.v) \u2022 where L is the m-th column of the \nMxM identity matrix. Any state of the network having output vector I=L (for any m) \nwill be denoted by v,m I x and will be called a feasible state. All other state vectors v .y I x \nfor 1* L will be considered as intennediate steps between two feasible states. This \nredefinition of the network states is equivalent to redefining the states of the underlying \nMarkov model of the network and thus conserves the equilibrium Boltzmann distribution. \nThe probability of proposition m for a given input K. denoted by P 1ft Ix. will be taken as the \nprobability of obtaining output vector 1= L given that the output is one of the feasible \nvalues. That is. \n\nP 1ft Ix = Pr (I = L I K \u2022 IEIM ) \n\n(5) \nwhich can be computed from (4) by restricting the state space to the 2' M feasible state \nvectors and by setting 1= L . The partition function. conditioned on restricting I to lie in \nthe set of feasible oUtputs,IM. is denoted by ~ and is given by: \n\n(6) \n\n\fThe Boltzmann Perceptron Network \n\n119 \n\nLet us now partition the connection matrix r to comply with the partition of the state vec(cid:173)\ntor and rewrite the energy for the feasible state V.ln I x as: \n\n-E.,,,,lx = yT(Rx+QL +~D2Y+~ + !!(WK+~D:J\". +1.) + KT(~Dl:1+I). (7) \n\nSince X is clamped on the input units. the last tenn in the energy expression serves only as \na bias tenn for the energy which is independent of the binary units y and X. Therefore. \nwithout any loss of generality it may be assumed that D1 = 0 and l.. = Q \u2022 The second tenn, \ndenoted by T\", Ix \u2022 can be simplified since D3 has a zero diagonal. Hence. \n\nT\",lx = L Willi Xi + S\", \n\nI \n\ni=1 \n\n(8) \n\nThe absence of the matrix D3 in the energy expression means that interconnections between \noutput units have no effect on the probability of attaining output vectors Xe1M. and may be \nassumed absent without any loss of generality. \nDefining L\", W to be: \n\n(9) \n\nin which q\", is the m -th column of Q. the desired probabilities. P\", Ix. for m=l .... M are \nobtained using (4) and (7) as a function of these quantities as follows: \n\nP\",lx = -=- e-'\" \n\n'00 \n\n1 \n~ \n\nwith: \n\nM L,.oo \n~ = L e \n\"'~ \n\n(10) \n\nThe complexity of evaluating the desired probabilities P\", Ix is still exponential with the \nnumber of hidden units J due to the sum in (9). However. if we impose the restriction that \nD2 = O. namely. the hidden units are not directly interconnected. then this sum becomes \nseparable in the components Vj and thus can be decomposed into the product of only J \ntenns. This restricted connectivity of course imposes some restrictions on the capability of \nthe network compared to that of a fully connected network. On the other hand. it allows the \ncomputation of the desired probabilities in a deterministic way with the attractive complex(cid:173)\nity of only 0 (JM) operations. The tedious estimation of conditional expectations com(cid:173)\nmonly required by the learning algorithm for a BM and the annealing procedure are \navoided. and an accurate and computationally affordable scheme becomes possible. We thus \nsuggest a trade-off in which the operation and learning of the BM are tremendously \nsimplified and the exact decision probabilities are computed (rather than their statistical \nestimates) at the expense of a restricted connectivity. namely. no interconnections are \nallowed between the hidden units. Hence. in our scheme. the connection matrix. r. \nbecomes zero block-diagonal. meaning that the network has connections only between units \nof different categories. This structure is shown schematically in Figure 1. \n\nx \n\nMLP ...... -----4~ Soft \n\nCompetition p \n\ntrt/~ \n\nFigure 1. Schematic architecture of \n\nthe stochastic BM. \n\nFigure 2. Block diagram of the COlTespondjng \n\ndetenninistic BPN. \n\n\f120 \n\nYair and Gersho \n\nBy applying the property D2=0 to (9). the sum over the space of hidden units. which can \nbe explicitly written as the sum over all the J components of y. can be decomposed using \nthe separability of the different Vj components into a sum of only J tenns as follows: \n\nLift 00 = ~T 1ft b: + ~ I (V; I X) \n\nJ \n\nwhere: \n\nand \n\n(11a) \n\n(llb) \n\nI (-) is called the activation function. Note that as ~ is increased I (-) approaches the linear \nthreshold function in which a zero response is obtained for a negative input and a linear \none (with slope ~) for a positive inpuL \nFinally. the desired probabilities P 1ft Ix can be expressed as a function of the Lift 00 \n(m=I \u2022..\u2022 M) in an expression which can be regarded as the generalization of the logistic \nfunction to M inputs: \n\nP 1ft Ix = [1 + f e _L...,IIw]-l \n\nwhere: \n\n. \n\n(12) \n\n11=1 \n\n\"\"'\" \n\nEqs. (8) and (11) describe a two-layer feed-forward perceptron-like subnetwork which uses \nthe nonlinearity 10. It evaluates the quantity Lift 00 which we call the score of the m-th \nproposition. Eq. (12) describes a competition between the scores Lift 00 generated by the M \nsubnetworks (m=I \u2022..\u2022 M) which we call a solt competition with lateral inhibition. That is. If \nseveral scores receive relatively high values compared to the others. they will share. accord(cid:173)\ning to their relative strengths. the total amount (unity) of probability. while inhibiting the \nremaining probabilities to approach zero. For example. if one of the scores. say Lloo, is \nlarge compared to all the other scores. then the exponentiation of the pairwise score \ndifferences will result in Pllx =1 while the remaining probabilities will approach zero. \nSpecifically. for any n#C. Plllx= exp (-Ll ,IIoo). which is essentially zero if Lloo is \nsufficiently high. In other words. by being large compared to the others, Ll W won the \ncompetition so that the corresponding probability P II x approaches unity, while all the \nremaining probabilities have been attenuated by the high value of Ll W to approach zero. \nLet us examine the effect of the gain ~ on this competition. When ~ is increased. the slope \nof the activation function I (-) is increased thereby accentuating the differences between the \nM contenders. In the limit when ~~. one of the Lift W will always be sufficiently large \ncompared to the others. and thus only one proposition will win. The competition then \nIn this case, the network becomes a maximum a \nbecomes a winner-take-all competition. \nposteriori (MAP) decision scheme in which the Lift W play the role of nonlinear discrim(cid:173)\ninant functions and the most probable proposition for the given input pattern is chosen: \n\nP l1x =1 \n\nfor k=argmax{Llftool \n\nand \n\nPlllx=O \n\nfor n~k. (13) \n\n1ft \n\nThis results coincides with our earlier observation that the temperature controls the 'soft(cid:173)\nness' of the decision. The lower the temperature. the 'harder' is the competition and the \nmore distinctive are the decisions. However. in contrast to the stochastic network, there is \nno need to gradually 'cool' the network to achieve a desired (low) temperature. Any \ndesired value of ~ is directly applicable in the BPN scheme. The above notion of soft com(cid:173)\npetition has its merits in a wider scope of applications apart from its role in the BPN \nclassifier. In many competitive schemes a soft competition between a set of contenders has \n\n\fThe Boltzmann Perceptron Network \n\n121 \n\na substantial benefit in comparison to the winner-lake-all paradigm. The above competition \nscheme which can be implemented by a two-layer feed-forward network thus offers a valu(cid:173)\nable scheme for such purposes. \nThe block diagram of the BPN is depicted in Figure 2. The BPN is thus a four-layer feed(cid:173)\nforward deterministic network. It is comprised of a two-layer perceptron-like network fol(cid:173)\nlowed by a two-layer competition network. The competition can be \u2022 hard \u2022 (winner-ta1ce-all) \nor 'soft' (graded decision) and is governed by a single gain parameter ~. \n\nTHE LEARNING ALGORITHM \n\nLet us denote the BPN output by the M -dimensional probability vector lx. where: \nlx = (P 11% \u2022\u2022\u2022 ,P '\" 1% \u2022\u2022\u2022 ,P iii I%)T. For any given set of weights D. the BPN realizes some deter(cid:173)\nministic mapping'll: RI -+ [O.I:f so that lx = '\u00a5W. The objective of learning is to \ndetennine the mapping 'II (by estimating the set of parameters ID which 'best' explains a \nfinite set of examples given from the desired mapping and are called the training set. The \ntraining set is specified by a set of N patterns (~lt .. .L \u2022.. ~ ) (denoted for simplicity by \n( ~ }). the a priori probability for each training pattern ~: Q W. and the desired mapping \nfor each pattern x: Ox = (Q 11% \u2022\u2022\u2022\u2022 Q'\" 1% \u2022\u2022\u2022\u2022 QIiI 1% l. where Q\", 1% =Pr (proposition m I I) is the \ndesired probability. For each input pattern presented to the BPN. the actual output probabil(cid:173)\nity vector lx is. in general. different from the desired one, Ox. We denote by G% the dis(cid:173)\ntortion between the actual ~ and the desired Ox for an input pattern ~. Thus. our task is to \ndetermine the network weights (and the bias values) D so that, on the average, the distortion \nover the whole set of training patterns will be minimized. Adopting the original distortion \nmeasure suggested for Boltzmann machines, the average distortion, GOO, is given by: \n\nM \n\nG%OO = L Q\",I% In[Q\", 1% I p\",I%-\n.... \n'iii \nC \nC\u00bb \n'1:1 \n>-;:: D.2 \n:0 \ncs \n\n.0 t 0 .1 \n\n(5.a) \n\n/ \n\n/ \n\n/ \n\n/ \n\n>- D.\" \n.... \n'iii c \n~ 0.3 \n>-\n., \n:0 \n== \n.0 o ... \n\n0.2 \n\n0. \n\n0.1 \n\n(S.a) \n\n10 \n\n(7.a) \n\nQ-tl \n\n/ \"' \n\n\\ \n\n/ \n\nI \n\n\\ , \n\nI \n\n/ \n\nI \nI class 0 \n\nI \n\n/ \n\nI \n\n/ \n\nD.O l-I-oc::I::..J....I....J....L. .......... .J.....I--'--'L...J.....L...I..::.......,.ol...J \n\n-\" \n\n-2 \n2 \ninput pattern - x \n\n0 \n\n-10 \n\n0 \n\n10 \ninput pattern - x \n\nOuu~~~~uu~~~~u \n\no \n\n2 \n\n\" \n\n8 \n\n8 \n\n10 \n\nIII \n\nIII ., \nU \n... D.6 \no \n\n.0 e 0. \n\n.. \n'\" ., \nu ... \n~ 0. \n\no \n\n0.6 \n\nJ=l \n\n(7.b) \n\nPI - 0.5 \n\n8 \n\n4 \n\n2 \n\nJ=6 \n\no \n\ninput pattern - x \n\nFigure 5. \n\n0 .0 L....,j\"A-.L....J.--L.......I..-Io......o.. ......... ..L.L.J.--'--'--1o....J D L...J..... .................... .J....I.. ........... ..J....L. .............................. ...&-I...J \nI \n\n-6 \n\n10 \n\n0 \n\n6 \n\n0 \n\n2 \n\n4 \n\n8 \n\ninput pattern - x \n\nFigure 6. \n\nFigure 7. \n\nFigure S: Classification for Gaussian sources. (S.a) The two sources. (S.b) 'Soft' ~= 1) and 'hard' \n~ = 10) classifications versus Q 11\". J indicates the nwnber of hidden units used. \nFigure 6: Classification for disconnected decision regions. (6.a) The sources used: dashed lines indi(cid:173)\ncate class 0 and solid lines - class 1. (6.b) Soft (p= 1) and hard (P= 10) classifications versus Q11,,' \nFigure 7: Classification in a 2D space. (7.a) The two classes and the true boundary indicated by \nQII\" =0.5. (7.b) The boundary found by the BPN, marked by P II\" =0.5, versus the true one. \n\nReferences \n\nFletcher, R., Reeves. C.M. (1964). Function minimizatioo by conjugate gradienu. Complll\" J., 7, 149-154. \nHinton, G.E., Sejnowlki T .R., & Ackley D.H. (1984). Boltmwm machines: constraint satiJfaction networks that \n\nlearn. CarMgi.-MeJQII T'CMicaJ R,port, CMU-CS-84-U9. \n\nHopfield, 1.1. (1987). Learning algoritJuns IDd probability distributions in feed-forward IDd feed-back networks. \n\nProc. Nail. Acad. Sci. USA, 84, 8429-8433. \n\nKirk~trick, S., Gelatt, C.D., & Vecchi M.P. (1983). Optimization by simulated annealing. Sci'N:', 220, 61l-680. \nLuenberger, D.G. (1984). LiMar aNi lIOn/war programming, AddiJon-Wesley, Reading, Man. \nRumelhart, D.E., Hinton, G.E., & Williams RJ. (1986). Learning internal representations by enor propagatioo. In \n\nD.E. Rumelhart & 1.1- McOelland (Eds.), ptJTtJIJeJ Distriblll,d Procus;\"g., MIT Press/Brad{ord Books. \n\nYair, E., & Gersho, A. (1989). The Bollmlann perecpcron network: a soft clusifier. Submitted to the JOIITJfQI of \n\nN'lII'aJ N'tworb, December, 1988. \n\n\f", "award": [], "sourceid": 180, "authors": [{"given_name": "Eyal", "family_name": "Yair", "institution": null}, {"given_name": "Allen", "family_name": "Gersho", "institution": null}]}