{"title": "A Local Algorithm to Learn Trajectories with Stochastic Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 83, "page_last": 87, "abstract": null, "full_text": "A Local Algorithm to Learn Trajectories \n\nwith Stochastic Neural Networks \n\nJavier R. Movellan\u00b7 \n\nDepartment of Cognitive Science \nUniversity of California San Diego \n\nLa Jolla, CA 92093-0515 \n\nAbstract \n\nThis paper presents a simple algorithm to learn trajectories with a \ncontinuous time, continuous activation version of the Boltzmann \nmachine. The algorithm takes advantage of intrinsic Brownian \nnoise in the network to easily compute gradients using entirely local \ncomputations. The algorithm may be ideal for parallel hardware \nimplementations. \n\nThis paper presents a learning algorithm to train continuous stochastic networks \nto respond with desired trajectories in the output units to environmental input \ntrajectories. This is a task, with potential applications to a variety of problems such \nas stochastic modeling of neural processes, artificial motor control, and continuous \nspeech recognition . For example, in a continuous speech recognition problem, the \ninput trajectory may be a sequence of fast Fourier transform coefficients, and the \noutput a likely trajectory of phonemic patterns corresponding to the input. This \npaper was based on recent work on diffusion networks by Movellan and McClelland \n(in press) and by recent papers by Apolloni and de Falco (1991) and Neal (1992) \non asymmetric Boltzmann machines. The learning algorithm can be seen as a \ngeneralization of their work to the stochastic diffusion case and to the problem of \nlearning continuous stochastic trajectories. \n\nDiffusion networks are governed by the standard connectionist differential equations \nplus an independent additive noise component. The resulting process is governed \n\n\u00b7Pa.rt of this work was done while a.t Ca.rnegie Mellon University. \n\n83 \n\n\f84 \n\nMovellan \n\nby a set of Langevin stochastic differential equations \n\ndai(t) = Ai dri/ti(t) dt + crdBi(t); i E {I, ... , n} \n\n(1) \nwhere Ai is the processing rate of the ith unit, cr is the diffusion constant, which con(cid:173)\ntrols the flow of entropy throughout the network, and dBi(t) is a Brownian motion \ndifferential (Soon, 1973). The drift function is the deterministic part of the process. \nFor consistency I use the same drift function as in Movellan and McClelland, 1992 \nbut many other options are possible: dri/ti(t) = EJ=1 Wijaj(t) - /-l ai (t), where \nWij is the weight from the jth to the ith unit, and /-1 is the inverse of a logistic \nfunction scaled in the (min - max) interval:/- 1(a) = log m-;~!~' \n\nIn practice DNs are simulated in digital computers with a system of stochastic \ndifference equations \n\nai(t+dt)=ai(t)+Aidri/ti(t)dt+crzi(t)../Xi; iE{I, ... ,n} \n\n(2) \nwhere Zi(t) is a standard Gaussian random variable. I start the derivations of \nthe learning algorithm for the trajectory learning task using the discrete time pro(cid:173)\ncess (equation 2) and then I take limits to obtain the continuous diffusion ex(cid:173)\npression. To simplify the derivations I adopt the following notation: a trajectory \nof states -input, hidden and output units- is represented as a = [a(I) ... a(t m )] = \n[al(I) ... an (I) ... al(tm ) .. . an (t m )]. The trajectory vector can be partitioned into 3 \nconsecutive row vectors representing the trajectories of the input, hidden and out(cid:173)\nput units a = [xhy). \n\nThe key to the learning algorithm is obtaining the gradient of the probability of \nspecific trajectories. Once we know this gradient we have all the information needed \nto increase the probability of desired trajectories and decrease the probability of \nunwanted trajectories. To obtain this gradient we first need to do some derivations \non the transition probability densities. Using the discrete time approximation to \nthe diffusion process, it follows that the conditional transition probability density \nfunctions are multivariate Gaussian \n\nFrom equation 2 and 3 it follows that \n\no \nA' ~ \nOWij log p(a(t + dt)1 a(t\u00bb = ; Zi(t) V dtaj(t) \n\n(3) \n\n(4) \n\nSince the network is Markovian, the probability of an entire trajectory can be \ncomputed from the product of the transition probabilities \n\np(a) = p(a(to\u00bb II p(a(t + dt)la(t\u00bb \n\nt \",-1 \n\nt=to \n\no ( ) \n: a = p(a)~ L Zi(t)../Xi aj(t) \n\nA t\",-l \n\nThe derivative of the probability of a specific trajectory follows \n\nWij \n\ncr t=to \n\n(5) \n\n(6) \n\n\fA Local Algorithm to Learn Trajectories with Stochastic Neural Networks \n\n8S \n\nIn practice, the above rule is all is needed for discrete time computer simulations. \nWe can obtain the continuous time form by taking limits as ~t --+ 0, in which case \nthe sum becomes Ito's stochastic integral of aj(t) with respect to the Brownian \nmotion differential over a {to, T} interval. \n\nA similar equation may be obtained for the ~i parameters \n\nop(a) = p(a)'~i iT aj(t)dBi(t) \na Wi; \no;i~) = pea)! iT drifti(t)dBi(t) \n\nto \n\nU \n\nU \n\nto \n\n\u2022 \n\nFor notational convenience I define the following random variables and refer to them \nas the delta signals \n\n. \n6Wij {a) = a \n\nO1og pea) \n\nWij \n\n~i iT \n= -\n\nU \n\nto \n\naj(t)dBi(t) \n\n(7) \n\n(8) \n\n(9) \n\n(10) \n\nand \n\n1 \n\nn \n\nA \n\n~ \n\nc: \n0 \n:::; \n~ 0.5-\n~ \n\n0 \n\n0 \n\n~ V ~ V l} \n\nI \n\nI \n\n100 \n\n200 \nTime Steps \n\n300 \n\n1 \n\n(\\ \n\nc: \n0 \n; ; \nCIS > \nn \ne:( 0.5-\nen \nC) \nCIS \n~ en \n.a: \n\n0 \n\n0 \n\nB \n\n\" 1\\ A \n\nA I \n\nV V \n\nV \n\nV V \n\nI \n\nI \n\n100 \n\n200 \nTime Steps \n\n300 \n\nFigure 1: A) A sample Trajectory. B) The Average Trajectory. As Time Progresses \nSample Trajectories Become Statistically Independent Dampening the Average. \n\n\f86 \n\nMovellan \n\nThe approach taken in this paper is to minimize the expected value of the error \nassigned to spontaneously generated trajectories 0 = E(p(a\u00bb where pea) is a signal \nindicating the overall error of a particular trajectory and usually depends only on \nthe output unit trajectory. The necessary gradients follow \n\n(11) \n\n(12) \n\nSince the above learning rule does not require calculating derivatives of the p func(cid:173)\ntion, it provides great flexibility making it applicable to a wide variety of situations. \nFor example pea) can be the TSS between the desired and obtained output unit \ntrajectories or it could be a reinforcement signal indicating whether the trajectory is \nor is not desirable. Figure La shows a typical output of a network trained with TSS \nas the p signal to follow a sinusoidal trajectory. The network consisted of 1 input \nunit, 3 hidden units, and 1 output unit. The input was constant through time and \nthe network was trained only with the first period of the sinusoid. The expected \nvalues in equations 11 and 12 were estimated using 400 spontaneously generated \ntrajectories at each learning epoch. It is interesting to note that although the net(cid:173)\nwork was trained for a single period, it continued oscillating without dampening. \nHowever, the expected value of the activations dampened, as Figure l.b shows. \nThe dampening of the average activation is due to the fact that as time progresses, \nthe effects of noise accumulate and the initially phase locked trajectories become \nindependent oscillators. \n\np transition = 0.2 \n\nHidden state = 0 \n\n! '--\n\np transition = 0.05 \n\np(response 1) = 0.1 \n\n[RuP=. J \n\nHidden state = 1 \n\nJ! \nr~~J \n\np(response 1) = 0.8 \n\n20,-------------__________ _ \n\n18 \n\n->-\n\n16 \n== 14 \n:.c \n~ 12 \no \n\n~ 0.10 -.E g 8 \n\nbest possible performance \n\ng> 6 \n...J \n\n4 \n\n2 \n\nO~------,-------~------~ \n2900 \n\n.0 \n\n900 \n1900 \nLearning Epoch \n\nFigure 2: A) The Hidden Markov Emitter. B) Average Error Throughout Training. \nThe Bayesian Limit is Achieved at About 2000 Epochs. \n\n\fA Local Algorithm to Learn Trajectories with Stochastic Neural Networks \n\n87 \n\nThe learning rule is also applicable in reinforcement situations where we just have \nan overall measure of fitness of the obtained trajectories, but we do not know \nwhat the desired trajectory looks like. For example, in a motor control problem \nwe could use as fitness signal (-p) the distance walked by a robot controlled by \na DN network. Equations 11 and 12 could then be used to gradually improve the \naverage distance walked by the robot. In trajectory recognition problems we could \nuse an overall judgment of the likelihood of the obtained trajectories. I tried this \nlast approach with a toy version of a continuous speech recognition problem. The \n\"emitter\" was a hidden Markov model (see Figure 2) that produced sequences of \noutputs - the equivalent of fast Fourier transform loads - fed as input to the receiver. \nThe receiver was a DN network which received as input, sequences of 10 outputs \nfrom the emitter Markov model. The network's task was to guess the sequence \nof hidden states of the emitter given the sequence of outputs from the emitter. \nThe DN outputs were interpreted as the inferred state of the emitter. Output unit \nactivations greater than 0.5 were evaluated as indicating that the emitter was in \nstate 1 at that particular time. Outputs smaller than 0.5 were evaluated as state \nO. To achieve optimal performance in this task the network had to combine two \nsources of information: top-down information about typical state transitions of the \nemitter, and bottom up information about the likelihood of the hidden states of the \nemitter given its responses. \nThe network was trained with rules 11 and 12 using the negative log joint prob(cid:173)\nability of the DN input trajectory and the DN output trajectory as error signal. \nThis signal was calculated using the transition probabilities of the emitter hidden \nMarkov model and did not require knowledge of its actual state trajectories. The \nnecessary gradients for equations 11 and 12 were estimated using 1000 spontaneous \ntrajectories at each learning epoch. As Figure 3 shows the network started pro(cid:173)\nducing unlikely trajectories but continuously improved. The figure also shows the \nperformance expected from an optimal classifier. As training progressed the network \napproached optimal performance. \n\nAcknowledgements \n\nThis work was funded through the NIMH grant MH47566 and a grant from the \nPittsburgh Supercomputer Center. \n\nReferences \n\nB. Apolloni, & D. de Falco. (1991) Learning by asymmetric parallel Boltzmann \nmachines. Neural Computation, 3, 402-408. \nR. Neal. (1992) Asymmetric Parallel Boltzmann Machines are Belief Networks, \nNeural Computation, 4, 832-834. \nJ. Movellan & J. McClelland. (1992a) Learning continuous probability distributions \nwith symmetric diffusion networks. To appear in Cognitive Science. \n\nT. Soon. (1973) Random Differential Equations in Science and Engineering, Aca(cid:173)\ndemic Press, New York. \n\n\f", "award": [], "sourceid": 746, "authors": [{"given_name": "Javier", "family_name": "Movellan", "institution": null}]}