{"title": "Learning Representations by Recirculation", "book": "Neural Information Processing Systems", "page_first": 358, "page_last": 366, "abstract": null, "full_text": "358 \n\nLEARNING REPRESENTATIONS BY RECIRCULATION \n\nComputer Science and Psychology Departments, University of Toronto, \n\nGeoffrey E. Hinton \n\nToronto M5S lA4, Canada \n\nPsychology and Computer Science Departments, Carnegie-Mellon University, \n\nJames L. McClelland \n\nPittsburgh, PA 15213 \n\nABSTRACT \n\nWe describe a new learning procedure for networks that contain groups of non(cid:173)\n\nlinear units arranged in a closed loop. The aim of the learning is to discover codes \nthat allow the activity vectors in a \"visible\" group to be represented by activity \nvectors in a \"hidden\" group. One way to test whether a code is an accurate \nrepresentation is to try to reconstruct the visible vector from the hidden vector. The \ndifference between the original and the reconstructed visible vectors is called the \nreconstruction error, and the learning procedure aims to minimize this error. The \nlearning procedure has two passes. On the fust pass, the original visible vector is \npassed around the loop, and on the second pass an average of the original vector and \nthe reconstructed vector is passed around the loop. The learning procedure changes \neach weight by an amount proportional to the product of the \"presynaptic\" activity \nand the difference in the post-synaptic activity on the two passes. This procedure is \nmuch simpler to implement than methods like back-propagation. Simulations in \nsimple networks show that it usually converges rapidly on a good set of codes, and \nanalysis shows that in certain restricted cases it performs gradient descent in the \nsquared reconstruction error. \n\nINTRODUCTION \n\nSupervised gradient-descent learning procedures such as back-propagation 1 \nhave been shown to construct interesting internal representations in \"hidden\" units \nthat are not part of the input or output of a connectionist network. One criticism of \nback-propagation is that it requires a teacher to specify the desired output vectors. It \nis possible to dispense with the teacher in the case of \"encoder\" networks2 in which \nthe desired output vector is identical with the input vector (see Fig. 1). The purpose \nof an encoder network is to learn good \"codes\" in the intermediate, hidden units. If \nfor, example, there are less hidden units than input units, an encoder network will \nperform data-compression3. It is also possible to introduce other kinds of constraints \non the hidden units, so we can view an encoder network as a way of ensuring that the \ninput can be reconstructed from the activity in the hidden units whilst also making \n\nnus research was supported by contract NOOOl4-86-K-00167 from the Office of Naval Research \nand a grant from the Canadian National Science and Engineering Research Council. Geoffrey Hinton \nis a fellow of the Canadian Institute for Advanced Research. We thank: Mike Franzini, Conrad \nGalland and Geoffrey Goodhill for helpful discussions and help with the simulations. \n\n\u00a9 American Institute of Physics 1988 \n\n\f359 \n\nthe hidden units satisfy some other constraint. \n\nA second criticism of back-propagation is that it is neurally implausible (and \nhard to implement in hardware) because it requires all the connections to be used \nbackwards and it requires the units to use different input-output functions for the \nforward and backward passes. Recirculation is designed to overcome this second \ncriticism in the special case of encoder networks. \n\noutput units \n\nI \\ \n\nhidden units \n\n/ r-. \n\ninput units \n\nFig. 1. A diagram of a three layer encoder network that learns good codes using \nback-propagation. On the forward pass, activity flows from the input units in the \nbottom layer to the output units in the top layer. On the backward pass, error(cid:173)\nderivatives flow from the top layer to the bottom layer. \n\nInstead of using a separate group of units for the input and output we use the \nvery same group of \"visible\" units, so the input vector is the initial state of this group \nand the output vector is the state after information has passed around the loop. The \ndifference between the activity of a visible unit before and after sending activity \naround the loop is the derivative of the squared reconstruction error. So, if the \nvisible units are linear, we can perfonn gradient descent in the squared error by \nchanging each of a visible unit's incoming weights by an amount proportional to the \nproduct of this difference and the activity of the hidden unit from which the \nconnection emanates. So learning the weights from the hidden units to the output \nunits is simple. The harder problem is to learn the weights on connections coming \ninto hidden units because there is no direct specification of the desired states of these \nunits. Back-propagation solves this problem by back-propagating error-derivatives \nfrom \nthe hidden units. \nRecirculation solves the problem in a quite different way that is easier to implement \nbut much harder to analyse. \n\nto generate error-derivatives for \n\nthe output units \n\n\f360 \n\nTHE RECIRCULATION PROCEDURE \n\nWe introduce the recirculation procedure by considering a very simple \narchitecture in which there is just one group of hidden units. Each visible unit has a \ndirected connection to every hidden unit, and each hidden unit has a directed \nconnection to every visible unit. The total input received by a unit is \n\nXj = LYiWji - 9j \n\ni \n\n(1) \n\nwhere Yi is the state of the ith unit, K'ji is the weight on the connection from the ith to \nthe Jib unit and 9j is the threshold of the Jh unit. The threshold tenn can be \neliminated by giving every unit an extra input connection whose activity level is \nfIXed at 1. The weight on this special connection is the negative of the threshold, and \nit can be learned in just the same way as the other weights. This method of \nimplementing thresholds will be assumed throughout the paper. \n\nThe functions relating inputs to outputs of visible and hidden units are smooth \nmonotonic functions with bounded derivatives. For hidden units we use the logistic \nfunction: \n\ny. = <1(x.) = \nJ \n\nJ \n\nI \n\nI +e-Xj \n\n(2) \n\nOther smooth monotonic functions would serve as well. For visible units, our \nmathematical analysis focuses on the linear case in which the output equals the total \ninput, though in simulations we use the logistic function. \n\nWe have already given a verbal description of the learning rule for the hidden(cid:173)\nto-visible connections. The weight, Wij , from the Ih hidden unit to the itlr visible \nunit is changed as follows: \n\nf:t.wij = \u00a3y/I) [Yi(O)-Yi(2)] \n\n(3) \n\nwhere Yi(O) is the state of the ith visible unit at time 0 and Yi(2) is its state at time 2 \nafter activity has passed around the loop once. The rule for the visible-to-hidden \nconnections is identical: \n\n(4) \n\nwhere y/I) is the state of the lh hidden unit at time I (on the frrst pass around the \nloop) and y/3) is its state at time 3 (on the second pass around the loop). Fig. 2 \nshows the network exploded in time. \n\nIn general, this rule for changing the visible-to-hidden connections does not \nperfonn steepest descent in the squared reconstruction error, so it behaves differently \nfrom back-propagation. This raises two issues: Under what conditions does it work, \nand under what conditions does it approximate steepest descent? \n\n\f361 \n\ntime = 1 \n\ntime = 3 \n\ntime =0 \n\ntime =2 \n\nFig. 2. A diagram showing the states of the visible and hidden units exploded in \ntime. The visible units are at the bottom and the hidden units are at the top. Time \ngoes from left to right. \n\nCONDITIONS UNDER WHICH RECIRCULATION \n\nAPPROXIMATES GRADIENT DESCENT \n\nFor the simple architecture shown in Fig. 2, the recirculation learning procedure \nchanges the visible-to-hidden weights in the direction of steepest descent in the \nsquared reconstruction error provided the following conditions hold: \n\n1. The visible units are linear. \n2. The weights are symmetrical (i.e. wji=wij for all i,j). \n3. The visible units have high regression. \n\n\"Regression\" means that, after one pass around the loop, instead of setting the \nactivity of a visible unit, i, to be equal to its current total input, xi(2), as determined \nby Eq 1, we set its activity to be \n\ny;(2) = AY;(O) + (I-A)x;(2) \n\n(5) \n\nwhere the regression, A, is close to 1. Using high regression ensures that the visible \nunits only change state slightly so that when the new visible vector is sent around the \nloop again on the second pass, it has very similar effects to the first pass. In order to \nmake the learning rule for the hidden units as similar as possible to the rule for the \nvisible units, we also use regression in computing the activity of the hidden units on \nthe second pass \n\n(6) \n\nFor a given input vector, the squared reconstruction error, E, is \n\nFor a hidden unit, j, \n\n\f362 \n\nwhere \n\nFor a visible-to ... hidden weight wj ; \n\ndE , \n-\ndwj ; \n\ndE \n= Yj(1)Yi(O)--\ndYj(l) \n\nSo, using Eq 7 and the assumption that Wkj=wjk for all k,j \n\ndE = y/(l) y;(O) [LYk(2) Yk'(2) Wjk - LYk(O) Yk'(2) Wjk] \ndw\u00b7\u00b7 \n}l \n\nk \n\nk \n\nThe assumption that the visible units are linear (with a gradient of 1) means that \nfor all k, Yk'(2) = 1. So using Eq 1 we have \n\ndE = y.'(l) y.(O)[x.(3)-x~1)] \ndw .. \n}l \n\n} \n\n} \n\nI \n\n) \n\n(8) \n\nNow, with sufficiently high regression, we can assume that the states of units \nonly change slightly with time so that \n\nand \n\nYt(O) ::::: y;(2) \n\nSo by substituting in Eq 8 we get \n\ndE \n\n-aw-ji \n\n1 \n\n::::: (1 _ A) y;(2) [y/3) - y/l)] \n\n(9) \n\nAn interesting property of Eq 9 is that it does not contain a tenn for the gradient \nof the input-output function of unit } so recirculation learning can be applied even \nwhen unit} uses an unknown non-linearity. To do back-propagation it is necessary to \nknow the gradient of the non-linearity, but recirculation measures the gradient by \nmeasuring the effect of a small difference in input, so the tenn y/3)-y/l) implicitly \ncontains the gradient. \n\n\f363 \n\nA SIMULATION OF RECIRCULATION \n\nFrom a biological standpoint, the synunetry requirement that wij=Wji is \nunrealistic unless it can be shown that this synunetry of the weights can be learned. \nTo investigate what would happen if synunetry was not enforced (and if the visible \nunits used the same non-linearity as the hidden units), we applied the recirculation \nlearning procedure to a network with 4 visible units and 2 hidden units. The visible \nvectors were 1000, 0100, 0010 and 0001, so the 2 hidden units had to learn 4 \ndifferent codes to represent these four visible vectors. All the weights and biases in \nthe network were started at small random values uniformly distributed in the range \n-0.5 to +0.5. We used regression in the hidden units, even though this is not strictly \nnecessary, but we ignored the teon 1/ (1 - A) in Eq 9. \n\nUsing an E of 20 and a A. of 0.75 for both the visible and the hidden units, the \nnetwork learned to produce a reconstruction error of less than 0.1 on every unit in an \naverage of 48 weight updates (with a maximum of 202 in 100 simulations). Each \nweight update was perfonned after trying all four training cases and the change was \nthe sum of the four changes prescribed by Eq 3 or 4 as appropriate. The final \nreconstruction error was measured using a regression of 0, even though high \nregression was used during the learning. The learning speed is comparable with \nback-propagation, though a precise comparison is hard because the optimal values of \nE are different in the two cases. Also, the fact that we ignored the tenn 1/ (1-A.) \nwhen modifying the visible-to-hidden weights means that recirculation tends to \nchange the visible-to-hidden weights more slowly than the hidden-to-visible weights, \nand this would also help back -propagation. \n\nIt is not inunediately obvious why the recirculation learning procedure works \nwhen the weights are not constrained to be synunetrical, so we compared the weight \nchanges prescribed by the recirculation procedure with the weight changes that \nwould cause steepest descent in the sum squared reconstruction error (i.e. the weight \nchanges prescribed by back-propagation). As expected, recirculation and back(cid:173)\npropagation agree on the weight changes for the hidden-to-visible connections, even \nthough the gradient of the logistic function is not taken into account in weight \nadjustments under recirculation. (Conrad Galland has observed that this agreement \nis only slightly affected by using visible units that have the non-linear input-output \nfunction shown in Eq 2 because at any stage of the learning, all the visible units tend \nto have similar slopes for their input-output functions, so the non-linearity scales all \nthe weight changes by approximately the same amount.) \n\nFor the visible-to-hidden connections, recirculation initially prescribes weight \nchanges that are only randomly related to the direction of steepest descent, so these \nchanges do not help to improve the perfonnance of the system. As the learning \nproceeds, however, these changes come to agree with the direction of steepest \ndescent. The crucial observation is that this agreement occurs after the hidden-to(cid:173)\nvisible weights have changed in such a way that they are approximately aligned \n(symmetrical up to a constant factor) with the visible-to-hidden weights. So it \nappears that changing the hidden-to-visible weights in the direction of steepest \ndescent creates the conditions that are necessary for the recirculation procedure to \ncause changes in the visible-to-hidden weights that follow the direction of steepest \ndescent. \n\nIt is not hard to see why this happens if we start with random, zero-mean \n\n\f364 \n\nvisible-to-hidden weights. If the visible-to-hidden weight wji is positive, hidden unit \nj will tend to have a higher than average activity level when the ith visible unit has a \nhigher than average activity. So Yj will tend to be higher than average when the \nreconstructed value of Yi should be higher than average -- i.e. when the tenn \n[Yi(O)-Yi(2)] in Eq 3 is positive. It will also be lower than average when this tenn is \nnegative. These relationships will be reversed if wji is negative, so w ij will grow \nfaster when wJi is positive than it will when wji is negative. Smolensky4 presents a \nmathematical analysis that shows why a similar learning procedure creates \nsymmetrical weights in a purely linear system. Williams5 also analyses a related \nlearning rule for linear systems which he calls the \"symmetric error correction\" \nprocedure and he shows that it perfonns principle components analysis. \nIn our \nsimulations of recirculation, the visible-to-hidden weights become aligned with the \ncorresponding hidden-to-visible weights, though the hidden-to-visible weights are \ngenerally of larger magnitude. \n\nA PICTURE OF RECIRCULATION \n\nTo gain more insight into the conditions under which recirculation learning \nproduces the appropriate changes in the visible-to-hidden weights, we introduce the \npictorial representation shown in Fig. 3. The initial visible vector, A, is mapped into \nthe reconstructed vector, C, so the error vector is AC. Using high regression, the \nvisible vector that is sent around the loop on the second pass is P, where the \ndifference vector AP is a small fraction of the error vector AC. If the regression is \nsufficiently high and all the non-linearities in the system have bounded derivatives \nand the weights have bounded magnitudes, the difference vectors AP, BQ, and CR \nwill be very small and we can assume that, to first order, the system behaves linearly \nin these difference vectors. If, for example, we moved P so as to double the length \nof AP we would also double the length of BQ and CR. \n\nFig. 3. A diagram showing some vectors (A, P) over the visible units, their \n\"hidden\" images (B, Q) over the hidden units, and their \"visible\" images (C, R) \nover the visible lUlits. The vectors B' and C' are the hidden and visible images of \nA after the visible-to-hidden weights have been changed by the learning procedure. \n\n\f365 \n\nSuppose we change the visible-to-hidden weights in the manner prescribed by \nEq 4, using a very smaIl value of \u00a3. Let Q' be the hidden image of P (i.e. the image \nof P in the hidden units) after the weight changes. To first order, Q' will lie between \nB and Q on the line BQ. This follows from the observation that Eq 4 has the effect \nof moving each y/3) towards y/l) by an amount proportional to their difference. \nSince B is close to Q, a weight change that moves the hidden image of P from Q to \nQ' will move the hidden image of A from B to B', where B' lies on the extension of \nthe line BQ as shown in Fig. 3. If the hidden-to-visible weights are not changed, the \nvisible image of A will move from C to C', where C' lies on the extension of the line \nCR as shown in Fig. 3. So the visible-to-hidden weight changes will reduce the \nsquared reconstruction error provided the vector CR is approximately parallel to the \nvector AP. \n\nThe \n\nin \n\nlearning \n\nBut why should we expect the vector CR to be aligned with the vector AP? In \ngeneral we should not, except when the visible-to-hidden and hidden-to-visible \nweights are approximately aligned. \nthe hidden-to-visible \nconnections has a tendency to cause this alignment. In addition, it is easy to modify \nthe recirculation learning procedure so as to increase the tendency for the learning in \nthe hidden-to-visible connections to cause alignment. Eq 3 has the effect of moving \nthe visible image of A closer to A by an amount proportional to the magnitude of the \nerror vector AC. If we apply the same rule on the next pass around the loop, we \nmove the visible image of P closer to P by an amount proportional to the magnitude \nof PRo If the vector CR is anti-aligned with the vector AP, the magnitude of AC will \nexceed the magnitude of PR, so the result of these two movements will be to \nimprove the alignment between AP and CR. We have not yet tested this modified \nprocedure through simulations, however. \n\nThis is only an infonnal argument and much work remains to be done in \nestablishing the precise conditions under which the recirculation learning procedure \napproximates steepest descent. The infonnal argument applies equally well to \nsystems that contain longer loops which have several groups of hidden units \narranged in series. At each stage in the loop, the same learning procedure can be \napplied, and the weight changes will approximate gradient descent provided the \ndifference of the two visible vectors that are sent around the loop aligns with the \ndifference of their images. We have not yet done enough simulations to develop a \nclear picture of the conditions under which the changes in the hidden-to-visible \nweights produce the required alignment. \n\nUSING A HIERARCHY OF CLOSED LOOPS \n\nInstead of using a single loop that contains many hidden layers in series, it is \npossible to use a more modular system. Each module consists of one \"visible\" group \nand one \"hidden\" group connected in a closed loop, but the visible group for one \nmodule is actually composed of the hidden groups of several lower level modules, as \nshown in Fig. 4. Since the same learning rule is used for both visible and hidden \nunits, there is no problem in applying it to systems in which some units are the \nvisible units of one module and the hidden units of another. Ballard6 has \nexperimented with back-propagation in this kind of system, and we have run some \nsimulations of recirculation using the architecture shown in Fig. 4. The network \n\n\f366 \n\nlearned to encode a set of vectors specified over the bottom layer. After learning, \neach of the vectors became an attractor and the network was capable of completing a \npartial vector, even though this involved passing information through several layers. \n\n00 \n\n00 \n\n00 \n\n0000 \n\n0000 \n\nFig 4. A network in which the hidden units of the bottom two modules are the \n\nvisible units of the top module. \n\nCONCLUSION \n\nWe have described a simple learning procedure that is capable of fonning \nrepresentations in non-linear hidden units whose input-output functions have \nbounded derivatives. The procedure is easy to implement in hardware, even if the \nnon-linearity is unknown. Given some strong assumptions, the procedure petforms \ngradient descent in the reconstruction error. If the synunetry assumption is violated, \nthe learning procedure still works because the changes in the hidden-to-visible \nweights produce symmetry. H the assumption about the linearity of the visible units \nis violated, the procedure still works in the cases we have simulated. For the general \ncase of a loop with many non-linear stages, we have an informal picture of a \ncondition that must hold for the procedure to approximate gradient descent, but we \ndo not have a fonnal analysis, and we do not have sufficient experience with \nsimulations to give an empirical description of the general conditions under which \nthe learning procedure works. \n\nREFERENCES \n\n1. D. E. Rumelhart, G. E. Hinton and R.I. Williams, Nature 323, 533-536 (1986). \n2. D. H. Ackley, G. E. Hinton and T. 1. Sejnowski, Cognitive Science 9,147-169 \n(1985). \n3. G. Cottrell, 1. L. Elman and D. Zipser, Proc. Cognitive Science Society, Seattle, \nWA (1987). \n4. P. Smolensky, Technical Report CU-CS-355-87, University of Colorado at \nBoulder (1986). \n5. R.I. Williams, Technical Report 8501, Institute of Cognitive Science, University \nofCalifomia, San Diego (1985). \n6. D. H. Ballard, Proc. American Association for Artificial Intelligence, Seattle, W A \n(1987). \n\n\f", "award": [], "sourceid": 78, "authors": [{"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}, {"given_name": "James", "family_name": "McClelland", "institution": null}]}