{"title": "Discovering High Order Features with Mean Field Modules", "book": "Advances in Neural Information Processing Systems", "page_first": 509, "page_last": 515, "abstract": null, "full_text": "Discovering High Order Features with Mean Field Modules \n\n509 \n\nDiscovering high order features with mean field \n\nmodules \n\nConrad C. Galland and Geoffrey E. Hinton \n\nPhysics Dept. and Computer Science Dept. \n\nUniversity of Toronto \n\nToronto, Canada \n\nM5S lA4 \n\nABSTRACT \n\nA new form of the deterministic Boltzmann machine (DBM) learn(cid:173)\ning procedure is presented which can efficiently train network mod(cid:173)\nules to discriminate between input vectors according to some cri(cid:173)\nterion. The new technique directly utilizes the free energy of these \n\"mean field modules\" to represent the probability that the criterion \nis met, the free energy being readily manipulated by the learning \nprocedure. Although conventional deterministic Boltzmann learn(cid:173)\ning fails to extract the higher order feature of shift at a network \nbottleneck, combining the new mean field modules with the mu(cid:173)\ntual information objective function rapidly produces modules that \nperfectly extract this important higher order feature without direct \nexternal supervision. \n\nINTRODUCTION \n\n1 \nThe Boltzmann machine learning procedure (Hinton and Sejnowski, 1986) can be \nmade much more efficient by using a mean field approximation in which stochastic \nbinary units are replaced by deterministic real-valued units (Peterson and Anderson, \n1987). Deterministic Boltzmann learning can be used for \"multicompletion\" tasks \nin which the subsets of the units that are treated as input or output are varied \nfrom trial to trial (Peterson and Hartman, 1988). In this respect it resembles other \nlearning procedures that also involve settling to a stable state (Pineda, 1987). Using \nthe multicompletion paradigm, it should be possible to force a network to explicitly \nextract important higher order features of an ensemble of training vectors by forcing \nthe network to pass the information required for correct completions through a \nnarrow bottleneck. In back-propagation networks with two or three hidden layers, \nthe use of bottlenecks sometimes allows the learning to explictly discover important. \n\n\f510 \n\nGalland and Hinton \n\nunderlying features (Hinton, 1986) and our original aim was to demonstrate that \nthe same idea could be used effectively in a DBM with three hidden layers. The \ninitial simulations using conventional techniques were not successful, but when we \ncombined a new type of DBM learning with a new objective function, the resulting \nnetwork extracted the crucial higher order features rapidly and perfectly. \n\n2 THE MULTI-COMPLETION TASK \nFigure 1 shows a network in which the input vector is divided into 4 parts. Al is a \nrandom binary vector. A2 is generated by shifting Al either to the right or to the \nleft by one \"pixel\", using wraparound. B1 is also a random binary vector, and B2 \nis generated from B1 by using the same shift as was used to generate A2 from Al. \nThis means that any three of AI, A2, B1, B2 uniquely specify the fourth (we filter \nout the ambiguous cases where this is not true). To perform correct completion, the \nnetwork must explicitly represent the shift in the single unit that connects its two \nhalves. Shift is a second order property that cannot be extracted without hidden \nunits. \n\nA2 \nAl \n\nB2 \n\nBI \n\nFigure 1. \n\n3 SIMULATIONS USING STANDARD DETERMINISTIC \n\nBOLTZMANN LEARNING \n\nThe following discussion assumes familiarity with the deterministic Boltzmann learn(cid:173)\ning procedure, details of which can be obtained from Hinton (1989). During the \npositive phase of learning, each of the 288 possible sets of shift matched four-bit \nvectors were clamped onto inputs AI, A2 and B1, B2, while in the negative phase, \none of the four was allowed to settle undamped. The weights were changed after \neach training case using the on-line version of the DBM learning procedure. The \nchoice of which input not to damp changed systematically throughout the learning \nprocess so that each was left undamped equally often. This technique, although \nsuccessful in problems with only one hidden layer, could not train the network to \ncorrectly perform the multicompletion task where any of the four input layers would \nsettle to the correct state when the other three were clamped. As a result, the single \n\n\fDiscovering High Order Features with Mean Field Modules \n\n511 \n\ncentral unit failed to extract shift. In general, the DBM learning procedure, like its \nstochastic predecessor, seems to have difficulty learning tasks in multi-hidden layer \nnets. This failure led to the development of the new procedure which, in one form, \nmanages to correctly extract shift without the need for many hidden layers or direct \nexternal supervision. \n\n4 A NEW LEARNING PROCEDURE FOR MEAN FIELD \n\nMODULES \n\nA DBM with unit states in the range [-1,1] has free energy \n\n(1) \n\nThe DBM settles to a free energy minimum, F*, at a non-zero temperature, where \nthe states of the units are given by \n\nYi = tanh( T 2: Yj Wij ) \n\n1 \n\nj \n\n(2) \n\nAt the minimum, the derivative of F* with respect to a particular weight (assuming \nT = 1) is given by (Hinton, 1989) \n\n(3) \n\nSuppose that we want a network module to discriminate between input vectors that \n\"fit\" some criterion and input vectors that don't. Instead of using a net with an \noutput unit that indicates the degree of fit, we could view the negative of the mean \nfield free energy of the whole module as a measure of how happy it is with the \nclamped input vector. From this standpoint, we can define the probability that \ninput vector Q fits the criterion as \n\n1 \n\nPcx = (1 + eF~) \n\n(4) \n\nwhere F~ is the equilibrium free energy of the module with vector Q clamped on \nthe inputs. \n\nSupervised training can be performed by using the cross-entropy error function \n(Hinton, 1987): \n\nN+ \n\nC = - L log(pcx) - L log(1- P/3) \n\nN_ \n\n(5) \n\ni=cx \n\nj=/3 \n\nwhere the first sum is over the N + input cases that fit the criterion, and the second \nis over the N _ cases that don't. The cross-entropy expression is used to specify error \n\n\f512 \n\nGalland and Hinton \n\nderivatives for Pa and hence for F~. Error derivatives for each weight can then be \nobtained by using equation (3), and the module is trained by gradient descent to \nhave high free energy for the \"negative\" training cases and low free energy for the \n\"positive\" cases. \n\nThus, for each positive case \n\nolog(Pa) \n\nOWij \n\nFor each negative case, \n\n1 \n\nr oF~ \ne'\" - -\nOWij \n\n1 + eF: \n1 \n\n1 + e- F: \n\n(-YiYj) \n\nolog(1 - P13) \n\nOWij \n\nof* \n_13_ \nOWij \n\nTo test the new procedure, we trained a shift detecting module, composed of the \nthe input units Al and A2 and the hidden units HA from figure 1, to have low \nfree energy for all and only the right shifts. Each weight was changed in an on-line \nfashion according to \n\n~w;J' = f \n\n. \n\n1 \n\n1 + e-F~ \u2022 \nY;YJ' \n\nfor each right shifted case, and \n\nfor each left shifted case. Only 10 sweeps through the 24 possible training cases \nwere required to successfully train the module to detect shift. The training was \nparticularly easy because the hidden units only receive connections from the input \nunits which are always clamped, so the network settles to a free energy minimum \nin one iteration. Details of the simulations are given in Galland and Hinton (1990). \n\n5 MAXIMIZING MUTUAL INFORMATION BETWEEN \n\nMEAN FIELD MODULES \n\nAt first sight, the new learning procedure is inherently supervised, so how can it \nbe used to discover tha.t shift is an important underlying feature? One method \n\n\fDiscovering High Order Features with Mean Field Modules \n\n513 \n\nis to use two modules that each supervise the other. The most obvious way of \nimplementing this idea quickly creates modules that always agree because they are \nalways \"on\". If, however, we try to maximize the mutual information between the \nstochastic binary variables represented by the free energies of the modules, there is \na strong pressure for each binary variable to have high entropy across cases because \nthe mutual information between binary variables A and B is: \n\n(6) \n\nwhere HAB is the entropy of the joint distribution of A and B over the training \ncases, and H A and H B are the entropies of the individual distributions. \nConsider two mean field modules with associated stochastic binary variables A,B \nE {O, I}. For a given case a, \n\np(Aa = 1) = \n\n1 \nF. \n1 +e A.at \n\n(7) \n\nwhere FA a is the free energy of the A module with the training case a clamped on \nthe input: \n\nWe can compute the probability that the A module is on or off by averaging over \nthe input sample distribution, with pa being the prior probability of an input case \na: \n\np(A=O) = 1- p(A=I) \n\nSimilarly, we can compute the four possible values in the joint probability distribu(cid:173)\ntion of A and B: \n\np(A=I,B=I) \n\np(A=O,B=I) = p(B=I)-p(A=I,B=I) \np(A=I,B=O) = p(A=I)-p(A=I,B=I) \n\np( A = 0, B = 0) = 1 - p( B = 1) - p( A = 1) + p( A = 1, B = 1) \n\nUsing equation (3), the partial derivatives of the various individual and joint proba(cid:173)\nbility functions with respect to a weight Wile in the A module are readily calculated. \n\n(8) \n\n\f514 \n\nGalland and Hinton \n\nop(A:: 1, B == 1) == \"\"\" pa op(Aa = 1) p(Ba = 1) \n\nOW\u00b7k \n\n, \n\nL.J \na \n\nOW\u00b7k \n\n' \n\n(9) \n\nThe entropy of the stochastic binary variable A is \n\nHA = - = - 2: p(A::a) logp(A=a) \n\na=O,l \n\nThe entropy of the joint distribution is given by \n\nHAB \n\n- \n- 2:p(A=a, B=b) logp(A=a, B=b) \n\na,b \n\nThe partial derivative of I(A; B) with respect to a single weight Wik in the A module \ncan now be computed; since HB does not depend on Wik, we need only differentiate \nHA and HAB. As shown in Galland and Hinton (1990), the derivative is given by \n\noI(A; B) \n\nOWik \n\nOWik \n\nOWik \n\n2: pa (p(Aa == 1) - 1) p(Aa == 1)(YiYk) log p(A :0) \n\np(A -1) \n\n[ \n\na \n\n_ p(Ba = 1) log p(A= I, B= 1) _ p(Ba =0) log p(A= I, B=O)] \n\np(A=O, B= 1) \n\np(A=O, B= 0) \n\nThe above derivation is drawn from Becker and Hinton (1989) who show that mutual \ninformation can be used as a learning signal in back-propagation nets. We can now \nperform gradient ascent in I(A; B) for each weight in both modules using a two-pass \nprocedure, the probabilities across cases being accumulated in the first pass. \n\nThis approach was applied to a system of two mean field modules (the left and \nright halves of figure 1 without the connecting central unit) to detect shift. As in \nthe multi-completion task, random binary vectors were clamped onto inputs AI, \nA2 and Bl, B2 related only by shift. Hence, the only way the two modules can \nprovide mutual information to each other is by representing the shift. Maximizing \nthe mutual information between them created perfect shift detecting modules in \nonly 10 two-pass sweeps through the 288 training cases. That is, after training, \neach module was found to have low free energy for either left or right shifts, and \nhigh free energy for the other. Details of the simulations are again given in G all an cl \nand Hinton (1990). \n\n\fDiscovering High Order Features with Mean Field Modules \n\nSIS \n\n6 SUMMARY \nStandard deterministic Boltzmann learning failed to extract high order features \nin a network bottleneck. We then explored a variant of DBM learning in which \nthe free energy of a module represents a stochastic binary variable. This variant \ncan efficiently discover that shift is an important feature without using external \nsupervision, provided we use an architecture and an objective function that are \ndesigned to extract higher order features which are invariant across space. \n\nAcknowledgements \n\nWe would like to thank Sue Becker for many helpful comments. This research was \nsupported by grants from the Ontario Information Technology Research Center and \nthe National Science and Engineering Research Council of Canada. Geoffrey Hinton \nis a fellow of the Canadian Institute for Advanced Research. \n\nReferences \n\nBecker, S. and Hinton, G. E. (1989). Spatial coherence as an internal teacher for a \nneural network. Technical Report CRG-TR-89-7, University of Toronto. \n\nGalland, C. C. and Hinton, G. E. (1990). Experiments on discovering high order \nfeatures with mean field modules. University of Toronto Connectionist Research \nGroup Technical Report, forthcoming. \n\nHinton, G. E. (1986) Learning distributed representations of concepts. Proceedings \nof the Eighth Annual Conference of the Cognitive Science Society, Amherst, Mass. \n\nHinton, G. E. (1987) Connectionist learning procedures. Technical Report CMU(cid:173)\nCS-87-115, Carnegie Mellon University. \n\nHinton, G. E. (1989) Deterministic Boltzmann learning performs steepest descent \nin weight-space. Neural Computation, 1. \nHinton, G. E. and Sejnowski, T. J. (1986) Learning and relearning in Boltzmann \nmachines. In Rumelhart, D. E., McClelland, J. L., and the PDP group, Parallel \nDistributed Processing: Explorations in the Microstructure of Cognition. Volume 1: \nFoundations, MIT Press, Cambridge, MA. \n\nHopfield, J. J. (1984) Neurons with graded response have collective computational \nproperties like those of two-state neurons. Proceedings of the National Academy of \nSciences U.S.A., 81, 3088-3092. \nPeterson, C. and Anderson, J. R. (1987) A mean field theory learning algorithm for \nneural networks. Complex Systems, 1, 995-1019. \n\nPeterson, C. and Hartman, E. (1988) Explorations of the mean field theory learning \nalgorithm. Technical Report ACA-ST/HI-065-88, Microelectronics and Computer \nTechnology Corporation, Austin, TX. \n\nPineda, F . J. (1987) Generalization of backpropagation to recurrent neural net(cid:173)\nworks. Phys. Rev. Lett., 18, 2229-2232. \n\n\f", "award": [], "sourceid": 260, "authors": [{"given_name": "Conrad", "family_name": "Galland", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}]}