{"title": "Holographic Recurrent Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 34, "page_last": 41, "abstract": null, "full_text": "Holographic Recurrent Networks \n\nTony A. Plate \n\nDepartment of Computer Science \n\nUniversity of Toronto \n\nToronto, M5S lA4 Canada \n\nAbstract \n\nHolographic Recurrent Networks (HRNs) are recurrent networks \nwhich incorporate associative memory techniques for storing se(cid:173)\nquential structure. HRNs can be easily and quickly trained using \ngradient descent techniques to generate sequences of discrete out(cid:173)\nputs and trajectories through continuous spaee. The performance \nof HRNs is found to be superior to that of ordinary recurrent net(cid:173)\nworks on these sequence generation tasks. \n\n1 \n\nINTRODUCTION \n\nThe representation and processing of data with complex structure in neural networks \nremains a challenge. In a previous paper [Plate, 1991b] I described Holographic Re(cid:173)\nduced Representations (HRRs) which use circular-convolution associative-memory \nto embody sequential and recursive structure in fixed-width distributed represen(cid:173)\ntations. This paper introduces Holographic Recurrent Networks (HRNs), which \nare recurrent nets that incorporate these techniques for generating sequences of \nsymbols or trajectories through continuous space. The recurrent component of \nthese networks uses convolution operations rather than the logistic-of-matrix-vector(cid:173)\nproduct traditionally used in simple recurrent networks (SRNs) [Elman, 1991, \nCleeremans et a/., 1991]. \n\nThe goals ofthis work are threefold: (1) to investigate the use of circular-convolution \nassociative memory techniques in networks trained by gradient descent; (2) to see \nwhether adapting representations can improve the capacity of HRRs; and (3) to \ncompare performance of HRNs with SRNs. \n\n34 \n\n\fHolographic Recurrent Networks \n\n35 \n\n1.1 RECURRENT NETWORKS & SEQUENTIAL PROCESSING \n\nSRNs have been used successfully to process sequential input and induce finite \nstate grammars [Elman, 1991, Cleeremans et a/., 1991]. However, training times \nwere extremely long, even for very simple grammars. This appeared to be due to the \ndifficulty of findin& a recurrent operation that preserved sufficient context [Maskara \nand Noetzel, 1992J. In the work reported in this paper the task is reversed to be \none of generating sequential output. Furthermore, in order to focus on the context \nretention aspect, no grammar induction is required. \n\n1.2 CIRCULAR CONVOLUTION \n\nCircular convolution is an associative memory operator. The role of convolution \nin holographic memories is analogous to the role of the outer product operation in \nmatrix style associative memories (e.g., Hopfield nets). Circular convolution can be \nviewed as a vector multiplication operator which maps pairs of vectors to a vector \n(just as matrix multiplication maps pairs of matrices to a matrix). It is defined as \nz = x@y : Zj = I:~:~ YkXj-k, where @ denotes circular eonvolution, x, y, and z \nare vectors of dimension n , Xi etc. are their elements, and subscripts are modulo-n \n(so that X-2 = X n -2). Circular convolution can be computed in O(nlogn) using \nFast Fourier Transforms (FFTs). Algebraically, convolution behaves like scalar \nmultiplication: it is commutative, associative, and distributes over addition. The \nidentity vector for convolution (I) is the \"impulse\" vector: its zero'th element is 1 \nand all other elements are zero. Most vectors have an inverse under convolution, \ni.e., for most vectors x there exists a unique vector y (=x- 1 ) such that x@y = I. \nFor vectors with identically and independently distributed zero mean elements and \nan expected Euclidean length of 1 there is a numerically stable and simply derived \napproximate inverse. The approximate inverse of x is denoted by x\u00b7 and is defined \n\nby the relation x; = Xn-j. \n\nVector pairs can be associated by circular convolution. Multiple associations can \nbe summed. The result can be decoded by convolving with the exact inverse or \napproximate inverse, though the latter generally gives more stable results. \n\nHolographie Reduced Representations [Plate, 1991a, Plate, 1991b] use c.ircular con(cid:173)\nvolution for associating elements of a structure in a way that can embody hierar(cid:173)\nchical structure. The key property of circular convolution that makes it useful for \nrepresenting hierarchical structure is that the circular convolution of two vectors is \nanother vector of the same dimension, which can be used in further associations. \n\nAmong assoeiative memories, holographic memories have been regarded as inferior \nbeeause they produee very noisy results and have poor error correcting properties. \nHowever, when used in Holographic Reduced Representations the noisy results can \nbe cleaned up with conventional error correcting associative memories. This gives \nthe best of both worlds - the ability to represent sequential and recursive structure \nand clean output vectors. \n\n2 TRAJECTORY-ASSOCIATION \n\nA simple method for storing sequences using circular convolution is to associate \nelements of the sequence with points along a predetermined trajectory. This is akin \n\n\f36 \n\nPlate \n\nto the memory aid called the method of loci which instructs us to remember a list \nof items by associating each term with a distinctive location along a familiar path. \n\n2.1 STORING SEQUENCES BY TRAJECTORY-ASSOCIATION \n\nElements of the sequence and loci (points) on the trajectory are all represented by \nn-dimensional vectors. The loci are derived from a single vector k - they are its \nsuc,cessive convolutive powers: kO, kl, k 2, etc. The convolutive power is defined in \nthe obvious way: kO is the identity vector and k i +1 = ki@k. \nThe vector k must be c,hosen so that it does not blow up or disappear when raised \nto high powers, i.e., so that IlkP II = 1 'V p. The dass of vec.tors which satisfy this \nconstraint is easily identified in the frequency domain (the range of the discrete \nFourier transform). They are the vectors for which the magnitude of the power of \neach frequenc.y component is equal to one. This class of vectors is identic,al to the \nclass for which the approximate inverse is equal to the exact inverse. \n\nThus, the trajectory-association representation for the sequence \"abc\" is \n\nSabc. = a + b@k + c@k2. \n\n2.2 DECODING TRAJECTORY-ASSOCIATED SEQUENCES \n\nTrajectory-associated sequences can be decoded by repeatedly convolving with the \ninverse of the vector that generated the encoding loci. The results of dec,oding \nsummed convolution products are very noisy. Consequently, to decode trajec.tory \nassociated sequences, we must have all the possible sequenc,e elements stored in an \nerror c,orrecting associative memory. I call this memory the \"clean up\" memory. \n\nFor example, to retrieve the third element of the sequence Sabc we convolve twice \nwith k- 1 , which expands to a@k- 2 + b@k- 1 + c. The two terms involving powers \nof k are unlikely to be correlated with anything in the clean up memory. The most \nsimilar item in clean up memory will probably be c. The clean up memory should \nrecognize this and output the dean version of c. \n\n2.3 CAPACITY OF TRAJECTORY-ASSOCIATION \n\nIn [Plate, 1991a] the capacity of circular-convolution based assoc.iative memory was \nc,alculated. It was assumed that the elements of all vectors (dimension n) were \nc,hosen randomly from a gaussian distribution with mean zero and variance lin \n(giving an expec.ted Eudidean length of 1.0). Quite high dimensional vec.tors were \nrequired to ensure a low probability of error in decoding. For example, with .512 \nelement vec.tors and 1000 items in the clean up memory, 5 pairs can be stored with \na 1 % chance of an error in deeoding. The scaling is nearly linear in n: with 1024 \nelement vectors 10 pairs can be stored with about a 1% chance of error. This works \nout to a information c,apac.ity of about 0.1 bits per element. The elements are real \nnumbers, but high precision is not required. \n\nThese capacity calculations are roughly applicable to the trajectory-association \nmethod. They slightly underestimate its capacity because the restriction that the \nencoding loci have unity power in all frequencies results in lower decoding noise. \nNonetheless this figure provides a useful benchmark against which to compare the \ncapacity of HRNs which adapt vec.tors using gradient descent. \n\n\fHolographic Recurrent Networks \n\n37 \n\n3 TRAJECTORY ASSOCIATION & RECURRENT NETS \nHRNs incorporate the trajectory-association scheme in recurrent networks. HRNs \nare very similar to SRNs, sueh as those used by [Elman, 1991] and [Cleeremans et \nal. , 1991]. However, the task used in this paper is different: the generation of target \nsequences at the output units, with inputs that do not vary in time. \n\nIn order to understand the relationship between HRNs and SRNs both were tested \non the sequence generation task. Several different unit activation functions were \ntried for the SRN: symmetric (tanh) and non-symmetric sigmoid (1/(1 + e- X )) for \nthe hidden units, and soft max and normalized RBF for the output units. The best \ncombination was symmetric sigmoid with softmax outputs. \n\n3.1 ARCHITECTURE \n\nThe H RN and the SRN used in the experiments described here are shown in Fig(cid:173)\nure I. \nIn the H RN the key layer y contains the generator for the inverse loci \n(corresponding to k- 1 in Section 2). The hidden to output nodes implement the \ndean-up memory: the output representation is local and the weights on the links \nto an output unit form the vector that represents the symbol corresponding to that \nunit . The softmax function serves to give maximum activation to the output unit \nwhose weights are most similar to the activation at the hidden layer. \n\nThe input representation is also loeal, and input activations do not ehange during \nthe generation of one sequence . Thus the weights from a single input unit determine \nthe acti vations at the code layer. Nets are reset at the beginning of each seq lIenee. \n\nThe HRN computes the following functions . Time superscripts are omitted where \nall are the same. See Figure 1 for symbols. The parameter 9 is an adaptable input \ngain shared by all output units. \n\nCode units: \nHidden units: \n\nContext units: \nOutput units: \n\n(first time step) \n(subsequent steps) \n\n(total input) \n(output) \n\n(h = p@y) \n\n(softmax) \n\nIn the SRN the only differenee is in the reeurrence operation, i.e., the computation \nof the activations of the hidden units whieh is, where bj is a bias: \n\nhj = tanh(cj + Ek wjkPk + bj). \n\nThe objective function of the network is the asymmetric divergence between the \n\nactivations of the output units (or) and the targets (tr) summed over eases sand \ntimesteps t, plus two weight penalty terms (n is the number of hidden units): \n\n\"\"\" st lor) \nE = - ~ tj og t;; + \n\n(\n\nstJ \n\nJ \n\nJ k \n\nJ k \n\nJ \n\n0.0001 (\"\"\" r \"\"\" c) \nn ~ Wjk + ~ Wjk + ~ - L.J Wjk \n\n\"\"\" (1 \n\n0 2) 2 \n\n\"\"\" \nk \n\nThe first weight penalty term is a standard weight cost designed to penalize large \n\n\f38 \n\nPlate \n\nHRN \n\nSRN \n\nOutput 0 \n\nOutput 0 \n\n(;ontext p \n\nInput i \n\nFigure 1: Holographic. Recurrent Network (HRN) and Simple Recurrent Network \n(SRN). The backwards curved arrows denote a copy of activations to the next time \nstep . In the HRN the c.ode layer is active only at the first time step and the c.ontext \nlayer is active only after the first time step. The hidden, code, context, and key \nlayers all have the same number of units. Some input units are used only during \ntraining, others only during testing. \n\nweights. The sec.ond weight penalty term was designed to force the Eudidean length \nof the weight vector on each output unit to be one. This penalty term helped the \nHRN c.onsiderably but did not noticeably improve the performance of the SRN. \n\nThe partial derivatives for the activations were c.omputed by the unfolding in time \nmethod [Rumelhart et ai., 1986]. The partial derivatives for the activations of the \ncontext units in the HRN are: \n\na-: = L 81 . Yk-j \nDE \nPJ \n\nDE \n\nk \n\n~J \n\n(= 'lpE = 'lh@Y*) \n\nWhen there are a large number of hidden units it is more efficient to compute this \nderivative via FFTs as the convolution expression on the right . \n\nOn all sequenc.es the net was cycled for as many time steps as required to produc.e \nthe target sequence. The outputs did not indic.ate when the net had reached the end \nof the sequence, however, other experiments have shown that it is a simple matter \nto add an output to indic.ate this. \n\n3.2 TRAINING AND GENERATIVE CAPACITY RESULTS \n\nOne of the motivations for this work was to find recurrent networks with high \ngenerative capacity, i.e., networks whic.h after training on just a few sequences \nc.ould generate many other sequences without further modification of recurrent or \noutput weights . The only thing in the network that changes to produce a different \nsequence is the activation on the codes units. To have high generative capacity the \nfunction of the output weights and recurrent weights (if they exist) must generalize \nto the production of novel sequenc.es. At each step the recurrent operation must \nupdate and retain information about the current position in the sequence. It was \n\n\fHolographic Recurrent Networks \n\n39 \n\nexpected that this would be a difficult task for SRNs, given the reported difficulties \nwith getting SRNs to retain context, and Simard and LeCun's [1992] report of being \nunable to train a type of recurrent network to generate more than one trajectory \nthrough c.ontinuous space. However, it turned out that HRNs, and to a lesser extent \nSRNs, c.ould be easily trained to perform the sequence generation task well. \n\nThe generative capacity of HRNs and SRNs was tested using randomly chosen \nsequences over :3 symbols (a, b, and c). The training data was (in all but one \ncase) 12 sequences of length 4, e.g., \"abac\", and \"bacb\". Networks were trained \non this data using the conjugate gradient method until all sequences were correctly \ngenerated. A symbol was judged to be correct when the activation of the correct \noutput unit exceeded 0.5 and exceeded twice any other output unit activation. \n\nAfter the network had been trained, all the weights and parameters were frozen, \nexcept for the weights on the input to c.ode links. Then the network was trained on \na test set of novel sequences of lengths 3 to 16 (32 sequences of each length). This \ntraining could be done one sequence at a time since the generation of each sequence \ninvolved an exclusive set of modifiable weights, as only one input unit was active for \nany sequence. The search for code weights for the test sequences was a c.onjugate \ngradient search limited to 100 iterations. \n\n100% \n\n80% \n\n60% \n\n40% \n\n20% \n\n0% \n\n'x \n\nA \n\n)( \n\nA \n\n4 \n\nx. \n\n6 \n\nHRN 64 --0-\nHRN 32 -+-\nHRN 16 ~ \nHRN 8 . X\u00b7 -\nN 4 'L:!.' \n\u2022 \n\n8 \n\n10 \n\n12 \n\n14 \n\n16 \n\n4 \n\n6 \n\n8 \n\n10 \n\n12 \n\n14 \n\n16 \n\nFigure 2: Percentage of novel sequences that can be generated versus length. \n\nThe graph on the left in Figure 2 shows how the performance varies with sequence \nlength for various networks with 16 hidden units. The points on this graph are the \naverage of 5 runs; each run began with a randomization of all weights. The worst \nperformance was produced by the SRN. The HRN gave the best performance: it \nwas able to produce around 90% of all sequences up to length 12. Interestingly, \na SRN (SRNZ in Figure 2) with frozen random recurrent weights from a suitable \ndistribution performed significantly better than the unconstrained SRN. \n\nTo some extent, the poor performance of the SRN was due to overtraining. This \nwas verified by training a SRN on 48 sequences oflength 8 (8 times as much data). \nThe performance improved greatly (SRN+ in Figure 2), but was still not as good \nthat of the HRN trained on the lesser amount of data. This suggests that the extra \nparameters provided by the recurrent links in the SRN serve little useful purpose: \nthe net does well with fixed random values for those parameters and a HRN does \nbetter without modifying any parameters in this operation. It appears that all that \n\n\f40 \n\nPlate \n\nis required in the recurrent operation is some stable random map. \n\nThe scaling performance of the HRN with respect to the number of hidden units \nis good. ThE\" graph on the right in Figure 2 shows the performance of HRNs with \nR output units and varying numbers of hidden units (averages of 5 runs). As the \nnumber of hidden units increases from 4 to 64 the generative capaeity increases \nsteadily. The sealing of sequence length with number of outputs (not shown) is also \ngood : it is over 1 bit per hidden unit. This compares very will with the 0.1 bit per \nelement aehieved by random vector eircular-c.onvolution (Section 2.3). \n\nThe training times for both the HRNs and the SRNs were very short. Both re(cid:173)\nquired around 30 passes through the training data to train the output and recurrent \nweights. Finding a c.ode for test sequence of length 8 took the HRN an average of \n14 passes. The SRN took an average of .57 passes (44 with frozen weights). The \nSRN trained on more data took mueh longer for the initial training (average 281 \npasses) but the c.ode searc.h was shorter (average 31 passes). \n\n4 TRAJECTORIES IN CONTINUOUS SPACE \nHRNs ean also be used to generate trajectories through c.ontinuous spaee. Only two \nmodifieations need be made: (a) ehange the function on the output units to sigmoid \nand add biases, and (b) use a fractional power for the key vector. A fractional power \nvector f can be generated by taking a random unity-power vector k and multiplying \nthe phase angle of each frequency component by some fraction (\\', i.e., f = kC/. The \nresult is that fi is similar to fi when the difference between i and j is less than 1/ (\\' , \nand the similarity is greater for closer i and j. The output at the hidden layer will \nbe similar at successive time steps. If desired, the speed at which the trajectory is \ntraversed can be altered by changing (\\'. \n\ntarget X \n\ntarget Y -\n\nnet Y \n\nFigure 3: Targets and outputs of a HRN trained to generate trajectories through \nc.ontinuous space. X and Yare plotted against time. \n\nA trajectory generating HRN with 16 hidden units and a key veetor k O.06 was trained \nto produce pen trajectories (100 steps) for 20 instances of handwritten digits (two \nof each). This is the same task that Simard and Le Cun [1992] used. The target \ntrajectories and the output of the network for one instance are shown in Figure 3. \n\n5 DISCUSSION \n\nOne issue in processing sequential data with neural networks is how to present \nthe inputs to the network. One approach has been to use a fixed window on the \nsequence, e.g., as in NETtaik [Sejnowski and Rosenberg, 1986] . A disadvantage \nof this is any fixed size of window may not be large enough in some situations. \nAnother approach is to use a recurrent net to retain information about previous \n\n\fHolographic Recurrent Networks \n\n41 \n\ninputs. A disadvantage of this is the difficulty that recurrent nets have in retaining \ninformation over many time steps. Generative networks offer another approach: use \nthe codes that generate a sequence as input rather than the raw sequence. This \nwould allow a fixed size network to take sequences of variable length as inputs (as \nlong as they were finite), without having to use multiple input blocks or windows. \n\nThe main attraction of circular convolution as an associative memory operator is \nits affordance of the representation of hierarchical structure. A hierarchical HRN, \nwhich takes advantage of this to represent sequences in chunks, has been built. \nHowever, it remains to be seen if it can be trained by gradient descent. \n\n6 CONCLUSION \nThe c.ircular convolution operation can be effectively incorporated into recurrent \nnets and the resulting nets (HRNs) can be easily trained using gradient descent to \ngenerate sequences and trajectories. HRNs appear to be more suited to this task \nthan SRNs, though SRNs did surprisingly well. The relatively high generative ca(cid:173)\npacity of HRNs shows that the capacity of circular convolution associative memory \ntplate, 1991a] can be greatly improved by adapting representations of vectors. \n\nReferences \n[Cleeremans et al., 1991] A. Cleeremans, D. Servan-Schreiber, and J. 1. McClel(cid:173)\n\nland. Graded state machines: The representation of temporal contingencies in \nsimple recurrent networks. Machine Learning, 7(2/3):161-194, 1991. \n\n[Elman, 1991] J. Elman. Distributed representations, simple recurrent networks \n\nand grammatical structure. Machine Learning, 7(2/3):195-226, 1991. \n\n[Maskara and Noetzel, 1992] Arun Maskara and Andrew Noetzel. Forcing simple \nrecurrent neural networks to encode context. In Proceedings of the 1992 Long \nIsland Conference on Artificial Intelligence and Computer Graphics, 1992. \n\n[Plate, 1991a] T. A. Plate. Holographic Reduced Representations. Technical Report \n\nCRG-TR-91-1, Department of Computer Science, University of Toronto, 1991. \n\n[Plate, 1991 b] T. A. Plate. Holographic Reduced Representations: Convolution \nalgebra for compositional distributed representations. In Proceedings of the 12th \nInternational Joint Conference on Artificial Intelligence, pages 30-35, Sydney, \nAustralia, 1991. \n\n[Rumelhart et al., 1986] D. E. Rumelhart, G. E. Hinton, and Williams R. J. Learn(cid:173)\ning internal representations by error propagation. In Parallel distributed process(cid:173)\ning: Explorations in the microstructure of cognition, volume 1, chapter 8, pages \n318-362. Bradford Books, Cambridge, MA, 1986. \n\n[Sejnowski and Rosenberg, 1986] T. J. Sejnowski and C. R. Rosenberg. NETtalk: \n\nA parallel network that learns to read aloud. Technical report 86-01, Depart(cid:173)\nment of Electrical Engineering and Computer Science, Johns Hopkins University, \nBaltimore, MD., 1986. \n\n[Simard and LeCun, 1992] P. Simard and Y. LeCun. Reverse TDNN: an architec(cid:173)\n\nture for trajectory generation. In J. M. Moody, S. J. Hanson, and R. P. Lippman, \neditors, Advances in Neural Information Processing Systems 4 (NIPS*91) , Den(cid:173)\nver, CO, 1992. Morgan Kaufman. \n\n\f", "award": [], "sourceid": 687, "authors": [{"given_name": "Tony", "family_name": "Plate", "institution": null}]}