{"title": "Learning Path Distributions Using Nonequilibrium Diffusion Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 598, "page_last": 604, "abstract": "", "full_text": "Learning Path Distributions using \nNonequilibrium Diffusion Networks \n\nPaul Mineiro * \n\nJavier Movellan \n\npmineiro~cogsci.ucsd.edu \n\nDepartment of Cognitive Science \nUniversity of California, San Diego \n\nLa Jolla, CA 92093-0515 \n\nmovellan~cogsci.ucsd.edu \nDepartment of Cognitive Science \nUniversity of California, San Diego \n\nLa Jolla, CA 92093-0515 \n\nRuth J. Williams \n\nwilliams~math.ucsd.edu \nDepartment of Mathematics \n\nUniversity of California, San Diego \n\nLa Jolla, CA 92093-0112 \n\nAbstract \n\nWe propose diffusion networks, a type of recurrent neural network \nwith probabilistic dynamics, as models for learning natural signals \nthat are continuous in time and space. We give a formula for the \ngradient of the log-likelihood of a path with respect to the drift \nparameters for a diffusion network. This gradient can be used to \noptimize diffusion networks in the nonequilibrium regime for a wide \nvariety of problems paralleling techniques which have succeeded in \nengineering fields such as system identification, state estimation \nand signal filtering. An aspect of this work which is of particu(cid:173)\nlar interest to computational neuroscience and hardware design is \nthat with a suitable choice of activation function, e.g., quasi-linear \nsigmoidal, the gradient formula is local in space and time. \n\nIntroduction \n\n1 \nMany natural signals, like pixel gray-levels, line orientations, object position, veloc(cid:173)\nity and shape parameters, are well described as continuous-time continuous-valued \nstochastic processes; however, the neural network literature has seldom explored the \ncontinuous stochastic case. Since the solutions to many decision theoretic problems \nof interest are naturally formulated using probability distributions, it is desirable \nto have a flexible framework for approximating probability distributions on contin(cid:173)\nuous path spaces. Such a framework could prove as useful for problems involving \ncontinuous-time continuous-valued processes as conventional hidden Markov mod(cid:173)\nels have proven for problems involving discrete-time sequences. \n\nDiffusion networks are similar to recurrent neural networks, but have probabilistic \ndynamics. \nInstead of a set of ordinary differential equations (ODEs), diffusion \nnetworks are described by a set of stochastic differential equations (SDEs). SDEs \nprovide a rich language for expressing stochastic temporal dynamics and have proven \n\n\u2022 To whom correspondence should be addressed. \n\n\fLearning Path Distributions Using Nonequilibriwn Diffusion Networks \n\n599 \n\nFigure 1: An example where the average of desirable paths yields an undesirable \npath, namely one that collides with the tree. \n\nuseful in formulating continuous-time statistical inference problems, resulting in \nsuch solutions as the continuous Kalman filter and generalizations of it like the \ncondensation algorithm (Isard & Blake, 1996). \n\nA formula is given here for the gradient of the log-likelihood of a path with re(cid:173)\nspect to the drift parameters for a diffusion network. Using this gradient we can \npotentially optimize the model to approximate an entire probability distribution of \ncontinuous paths, not just average paths or equilibrium points. Figure 1 illustrates \nthe importance of this kind of learning by showing a case in which learning average \npaths would have undesirable results, namely collision with a tree. Experience has \nshown that learning distributions of paths, not just averages, is crucial for dynamic \nperceptual tasks in realistic environments, e.g., visual contour tracking (Isard & \nBlake, 1996). Interestingly, with a suitable choice of activation function, e.g., quasi(cid:173)\nlinear sigmoidal, the gradient formula depends only upon local computations, i.e., \nno time unfolding or explicit backpropagation of error is needed. The fact that \nnoise localizes the gradient is of potential interest for domains such as theoretical \nneuroscience, cognitive modeling and hardware design. \n2 Diffusion Networks \nHereafter Cn refers to the space of continuous Rn-valued functions over the time \n\ninterval [0, TJ, with T E R, T > \u00b0 fixed throughout this discussion. \n\nA diffusion network with parameter A E RP is a random process defined via an Ito \nSD E of the form \n\ndX(t) = JL(t, X(t), A)dt + adB(t), \nX(O) '\" 1I, \n\n(1) \n\nwhere X is a Cn-valued process that represents the temporal dynamics of the n \nnodes in the network; JL: [0, T] x R n x RP -+ Rn is a deterministic function called \nthe drift; A E RP is the vector of drift parameters, e.g., synaptic weights, which \nare to be optimized; B is a Brownian motion process which provides the random \ndriving term for the dynamics; 1I is the initial distribution of the solution; and \na E R, a > 0, is a fixed constant called the dispersion coefficient, which determines \nthe strength of the noise term. In this paper we do not address the problem of \noptimizing the dispersion or the initial distribution of X. For the existence and \nuniqueness in law of the solution to (1) JL(',', A) must satisfy some conditions. For \nexample, it is sufficient that it is Borel measurable and satisfies a linear growth \n\ncondition: IJL(t, x, A)I S; K-\\(l + Ixl) for some K-\\ > \u00b0 and all t E [0, T], x ERn; see \n\n\f600 \n\nP. Mineiro, 1. Movellan and R. 1. Williams \n\n(Karatzas & Shreve, 1991, page 303) for details. \nIt is typically the case that the n-dimensional diffusion network will be used to \nmodel d-dimensional observations with n > d. In this case we divide X into hidden \nand observablel components, denoted Hand 0 respectively, so that X = (H,O). \nNote that with a = 0 in equation (1), the model becomes equivalent to a continuous(cid:173)\ntime deterministic recurrent neural network. Diffusion networks can therefore be \nthought of as neural networks with \"synaptic noise\" represented by a Brownian mo(cid:173)\ntion process. In addition, diffusion networks have Markovian dynamics, and hidden \nstates if n > d; therefore, they are also continuous-time continuous-state hidden \nMarkov models. As with conventional hidden Markov models, the probability den(cid:173)\nsity of an observable state sequence plays an important role in the optimization of \ndiffusion networks. However, because X is a continuous-time process, care must \ntaken in defining a probability density. \n2.1 Density of a continuous observable path \nLet (XA, B>') defined on some filtered probability space (0, F, {Fd, p) be a (weak) \nsolution of (1) with fixed parameter A. Here X>' = (HA,O>') represents the states \nof the network and is adapted to the filtration {Fd, B>' is an n-dimensional {Fd(cid:173)\nmartingale Brownian motion and the filtration {Ft } satisfies the usual conditions \n(Karatzas and Shreve, 1991, page 300). Let Q>' be the unique probability law \ngenerated by any weak solution of (1) with fixed parameter A \n\nQ>'(A) = p(X>' E A) for all A E F, \n\n(2) \nwhere F is the Borel sigma algebra generated by the open sets of Cn. Setting \nn = Cn, nh = Cn- d, and no = Cd with associated Borel a-algebras F, Fh and Fo, \nrespectively, we have n = nh x no, F = Fh \u00ae Fo, and we can define the marginal \nlaws for the hidden and observable components of the network by \n\nQ~(Ah) = QA(Ah x Cd) ~ P(HA E Ah) for all Ah E Fh, \nQ~(Ao) = Q>'(Cn-d X Ao) ~ p(O>' E Ao) for all Ao E Fo. \n\n(3) \n(4) \nFor our purposes the appropriate generalization of the notion of a probability density \non Rm to the general probability spaces considered here is the Radon-Nikodym \nderivative with respect to a reference measure that dominates all members of the \nfamily {Q>'} >'ERp (Poor, 1994, p.264ff). A suitable reference measure P is the law of \nthe solution to (1) with zero drift (IJ = 0). The measures induced by this reference \nmeasure over Fh and Fo are denoted by Ph and Po, respectively. Since in the \nreference model there are no couplings between any of the nodes in the network, \nthe hidden and observable processes are independent and it follows that \n\nP(Ah x Ao) = Ph (Ah)Po(Ao) for all Ah E Fh,Ao E .1'0' \n\n(5) \nThe conditions on IJ mentioned above are sufficient to ensure a Radon-Nikodym \nderivative for each QA with respect to the reference measure. Using Girsanov's \nTheorem (Karatzas & Shreve, 1991, p.190ff) its form can be shown to be \n\nZ'(w) = ~~ (w) = exp { :' lT I'(t,w(t), A) . dw(t) \n\n- 2~' lT II'(t, w(t), A)I'dt} , wE \\1, \n\n(6) \n\nlIn our treatment we make no distinction between observables which are inputs and \n\nthose which are outputs. Inputs can be conceptualized as observables under \"environ(cid:173)\nmental control,\" i.e., whose drifts are independent of both A and the hidden and output \nprocesses. \n\n\fLearning Path Distributions Using Nonequilibriwn Diffusion Networks \n\n601 \nwhere the first integral is an Ito stochastic integral. The random variable Z>. can \nbe interpreted as a likelihood or probability density with respect to the reference \nmode12 . However equation (6) defines the density of Rn-valued paths of the entire \nnetwork, whereas our real concern is the density of Rd-valued observable paths. \nDenoting wE 0 as w = (Wh,wo) where Wh E Oh and Wo E 0 0 , note that \n\n(7) \n\n(8) \n\nand therefore the Radon-Nikodym derivative of Q~ with respect to Po, the density \nof interest, is given by \n\nZ;(wo) = ~~: (wo) = EPh[Z>'( ., wo\u00bb)' Wo E 0 0 , \n\n(9) \n\n2.2 Gradient of the density of an observable path \nThe gradient of Z; with respect to A is an important quantity for iterative opti(cid:173)\nmization of cost functionals corresponding to a variety of problems of interest, e.g., \nmaximum likelihood estimation of diffusion parameters for continuous path density \nestimation. Formal differentiation3 of (9) yields \n\n\\7>.logZ~(wo) = EPh[Z~lo( \u00b7 ,wo)\\7),logZ'\\(.,wo)), \n\nwhere \n\n>. \n\n6 Z)'(w) \n\nZh lo (w) = Z;(wo) ' \n\n\\7),logZ)'(w) = -\\- rT J(t,W(t),A)' dl(w,t), \n\na Jo \nA) ~ 8J.lk(t, x , A) \n\nJ. ( \nJk t,x, \n\n-\n\nlew, t) ~ wet) - w(O) -lot J.l(s,w(s), A)ds. \n\n8A ' \nJ \n\n' \n\n(10) \n\n(11) \n\n(12) \n\n(13) \n\n(14) \n\nEquation (10) states that the gradient of the density of an observable path can be \nfound by clamping the observable nodes to that path and performing an average \nof Z~lo \\7.x log Z)' with respect to Ph, i.e., average with respect to the hidden paths \ndistributed as a scaled Brownian motion. This makes intuitive sense: the output \ngradient of the log density is a weighted average of the total gradient of the log \ndensity, where each hidden path contributes according to its likelihood Z~lo given \nthe output. \n\nIn practice to evaluate the gradient, equation (10) must be approximated. Here we \nuse Monte Carlo techniques, the efficiency of which can be improved by sampling \naccording to a density which reduces the variance of the integrand. Such a density \n\n2To ease interpretation of (6) consider the simpler case of a one-dimensional Gaussian \nrandom variable with mean JL and variance (J\"2. The ratio of the density of such a model \nwith respect to an equivalent model with zero mean is exp(-f')dt + adBh(t), \ndO(t) = J-Lo(t, H(t), O(t), >')dt + adBo(t). \n\n(15) \n(16) \nThe drift for the hidden variables does not depend on the observables, and Gir(cid:173)\nsanov's theorem gives us an explicit formula for the density of the hidden process. \n\nZ~(Wh) = dP: (Wh) = exp a2 0 J-Lh(t, Wh(t), >.) . dwh(t) \n\ndQ>' \n\n{ 1 iT \n\n- 2~2 /,T Il'h(t,wh(t), ,\\)I'dt} . \n\nEquations (9) and (10) can then be written in the form \nZ;(wo) = EQ~ [Zolh(', wo)], \n\nv\\ log Zo (wo) = E h \n\n>. \n\nQ). [Z;lh(\"WO) \n\n>. 1 \nZ;(wo) yo >.log Z (', wo) , \n\nwhere \n\nZ;lh(W) = Z~(Wh) = exp a2 \n\nII Z>'(w) \n\n0 J-Lo(t,w(t), >.) . dwo(t) \n\n{ 1 iT \n\n- 2~' /,T lI'o( t, w( t), ,\\) I' dt } . \n\n(17) \n\n(18) \n\n(19) \n\n(20) \n\nNote the expectations are now performed using the measure Q~. We can easily \ngenerate samples according to Q~ by numerically integrating equation (15), and in \npractice this leads to more efficient Monte Carlo approximations of the likelihood \nand gradient. \n3 Example: Noisy Sinusoidal Detection \nThis problem is a simple example of using diffusion networks for signal detection. \nThe task was to detect a sinusoid in the presence of additive Gaussian noise. Stimuli \nwere generated according to the following process \n\nY(t,w) = 1A(w).!.sin(47l't) + B(t,w), \n\n7l' \n\n(21) \n\nwhere t E [0,1/2]. Here Y is assumed anchored in a probability space (0, F, P) \nlarge enough to accommodate the event A which indicates a signal or noise trial. \nNote that under P, B is a Brownian motion on Cd independent of A. \nA model was optimized using 100 samples of equation (21) given W E A, i.e., 100 \nstimuli containing a signal. The model had four hidden units and one observable \nunit (n = 5, d = 1). The drift of the model was given by \n\nJ-L(t, x, >.) = () + W . g(x), \n\ngj(x) = 1 + e-Zj , j E {I, 2, 3, 4, 5}, \n\n1 \n\n(22) \n\n\fLearning Path Distributions Using Nonequilibriwn Diffusion Networks \n\n603 \n\nROC Curve, Sinewave Detection Problem \n\n...\u2022. ---A----~----~---\u00b7~\u00b7\u00b7\u00b7\u00b7-\u00b7T-.-.. -. d'=1.89 ---.----\n\n0.8 \n\n1\" ....... \n\n0.6 \n\nl\" \n! \n,I \n0.4 rf \n~ \n\nt \n\n0.2 \no L -____ ~ __ ~~ __ ~ ____ ~ ____ ~ \no \n\n0.8 \n\n0.6 \n\n0.2 \n\n0.4 \n\nhit rate \n\nFigure 2: Receiver operating characteristic (ROC) curve for a diffusion network \nperforming a signal detection task involving noisy sinusoids, Dotted line: Detection \nperformance estimated numerically using 10000 novel stimuli. Solid line: Best fit \ncurve corresponding to d' = 1.89. This value of d' corresponds to performance \nwithin 1.5% of the Bayesian limit. \n\nwhere 0 E R5 and W is a 5x5 real-valued connection matrix. In this case ~ = \n{{Oi}, {Wij }, i,j = 1, ... ,5}. The connections from output to hidden units were set \nto zero, allowing use of the more efficient techniques for factorial networks described \nabove. The initial distribution for the model was a 8-function at (1, -1, 1, -1,0). \nThe model was numerically simulated with ilt = 0.01, and 100 hidden samples were \nused to approximate the likelihood and gradient of the log- likelihood, according to \nequations (18) and (19). The conjugate gradient algorithm was used for training, \nwith the log-likelihood of the data as the cost function. \n\nOnce training was complete, the parameter estimation was tested using 10000 novel \nstimuli and the following procedure. Given a new stimuli y we used the model to \nestimate the likelihood Zo(Y I A) ~ Z~(Y), where ~ is the parameter vector at the \nend of training. The decision rule employed was \n\nD(Y) = {sig.nal \nnOIse \n\nif Zo(~ I A) > b, \notherwIse, \n\n(23) \n\nwhere b E R is a bias term representing assumptions about the apriori probability \nof a signal trial. By sweeping across different values of b the receiver-operator \ncharacteristic (ROC) curve is generated. This curve shows how the probability of \na hit, P(D = signal I A), and the probability of a false alarm, P(D = signal I AC), \nare related. From this curve the parameter d', a measure of sensitivity independent \nof apriori assumptions, can be estimated. Figure 2 shows the ROC curve as found \nby numerical simulation, and the curve obtained by the best fit value d' = 1.89. \nThis value of d' corresponds to a 82.7% correct detection rate for equal prior signal \nprobabilities. \nThe theoretically ideal observer can be derived for this problem, since the profile of \nthe unperturbed signal is known exactly (Poor, 1994, p. 278ff). For this problem \nthe optimal observer achieves d'max = 2, which implies at equal probabilities for \nsignal and noise trials, the Bayesian limit corresponds to a 84.1 % correct detection \nrate. The detection system based upon the diffusion network is therefore operating \nclose to the Bayesian limit, but was designed using only implicit information, i.e., \n100 training examples, about the structure of the signal to be detected, in contrast \nto the explicit information required to design the optimal Bayesian classifier. \n\n\f", "award": [], "sourceid": 1438, "authors": [{"given_name": "Paul", "family_name": "Mineiro", "institution": null}, {"given_name": "Javier", "family_name": "Movellan", "institution": null}, {"given_name": "Ruth", "family_name": "Williams", "institution": null}]}