{"title": "A neurally plausible model for online recognition and postdiction in a dynamical environment", "book": "Advances in Neural Information Processing Systems", "page_first": 9644, "page_last": 9655, "abstract": "Humans and other animals are frequently near-optimal in their ability to integrate noisy and ambiguous sensory data to form robust percepts---which are informed both by sensory evidence and by prior expectations about the structure of the environment. It is suggested that the brain does so using the statistical structure provided by an internal model of how latent, causal factors produce the observed patterns. In dynamic environments, such integration often takes the form of \\emph{postdiction}, wherein later sensory evidence affects inferences about earlier percepts.  As the brain must operate in current time, without the luxury of acausal propagation of information, how does such postdictive inference come about? Here, we propose a general framework for neural probabilistic inference in dynamic models based on the distributed distributional code (DDC) representation of uncertainty, naturally extending the underlying encoding to incorporate implicit probabilistic beliefs about both present and past. We show that, as in other uses of the DDC, an inferential model can be learnt efficiently using samples from an internal model of the world. Applied to stimuli used in the context of psychophysics experiments, the framework provides an online and plausible mechanism for inference, including postdictive effects.", "full_text": "A neurally plausible model\n\nfor online recognition and postdiction\n\nLi Kevin Wenliang\nManeesh Sahani\nGatsby Computational Neuroscience Unit\n\n{kevinli,maneesh}@gatsby.ucl.ac.uk\n\nUniversity College London\n\nLondon, W1T 4JG\n\nAbstract\n\nHumans and other animals are frequently near-optimal in their ability to integrate\nnoisy and ambiguous sensory data to form robust percepts, which are informed\nboth by sensory evidence and by prior experience about the causal structure of the\nenvironment. It is hypothesized that the brain establishes these structures using\nan internal model of how the observed patterns can be generated from relevant\nbut unobserved causes. In dynamic environments, such integration often takes the\nform of postdiction, wherein later sensory evidence affects inferences about earlier\npercepts. As the brain must operate in current time, without the luxury of acausal\npropagation of information, how does such postdictive inference come about? Here,\nwe propose a general framework for neural probabilistic inference in dynamic mod-\nels based on the distributed distributional code (DDC) representation of uncertainty,\nnaturally extending the underlying encoding to incorporate implicit probabilistic\nbeliefs about both present and past. We show that, as in other uses of the DDC, an\ninferential model can be learned ef\ufb01ciently using samples from an internal model\nof the world. Applied to stimuli used in the context of psychophysics experiments,\nthe framework provides an online and plausible mechanism for inference, including\npostdictive effects.\n\n1\n\nIntroduction\n\nThe brain must process a constant stream of noisy and ambiguous sensory signals from the envi-\nronment, making accurate and robust real-time perceptual inferences crucial for survival. Despite\nthe dif\ufb01cult and some times ill-posed nature of the problem, many behavioral experiments suggest\nthat humans and other animals achieve nearly Bayes-optimal performance across a range of contexts\ninvolving noise and uncertainty: e.g., when combining noisy signals across sensory modalities [1, 14,\n34], making sensory decisions with consequences of unequal value [48], or inferring causal structure\nin the sensory environment [23].\nReal-time perception in dynamical environments, referred to as \ufb01ltering, is even more challenging.\nBeliefs about dynamical quantities must be continuously and rapidly updated on the basis of new\nsensory input, and very often informative sensory inputs will arrive after the time of the relevant state.\nThus, perception in dynamical environments requires a combination of prediction\u2014to ensure actions\nare not delayed relative to the external world\u2014and postdiction\u2014to ensure that perceptual beliefs\nabout the past are correctly updated by subsequent sensory evidence [6, 12, 17, 20, 32, 41].\nBehavioral [3, 5, 24, 31, 50] and physiological [8, 9, 15] \ufb01ndings suggest that the brain acquires\nan internal model of how relevant states of the world evolve in time, and how they give rise to the\nstream of sensory evidence. Recognition is then formally a process of statistical inference to form\nperceptual beliefs about the trajectory of latent causes given observations in time. While this type of\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fstatistical computation over probability distributions is well understood mathematically and accounts\nfor nearly optimal perception in experiments, it remains largely unknown how the brain carries out\nthese computations in non-trivial but biologically relevant situations. Three key questions need to be\nanswered: How does the brain represent probabilistic beliefs about dynamical variables? How does\nthe representation facilitate computations such as \ufb01ltering and postdiction? And how does the brain\nlearn to perform these computations?\nIn this work, we introduce a neurally plausible online recognition scheme that addresses these\nthree questions. We \ufb01rst review the distributed distributional code (DDC) [40, 45]: a hypothesized\nrepresentation of uncertainty in the brain, which has been shown to facilitate ef\ufb01cient and accurate\ncomputation of probabilistic beliefs over latent causes in internal models without temporal structure.\nOur main contribution is to show how to extend the DDC representation, along with the associated\nmechanisms for computation and learning, to achieve online inference within a dynamical state\nmodel. In the proposed approach, each new observation is used to update beliefs about the latent state\nboth at the present time and in the recent history\u2014thus implementing a form of online postdiction.\nThis form of recognition accounts for perceptual illusions across different modalities [41]. We\ndemonstrate in experiments that the proposed scheme reproduces known perceptual phenomena,\nincluding the auditory continuity illusion [6, 30], and positional smoothing associated with the\n\ufb02ash-lag effect in vision [28, 32]. We also evaluate its performance at tracking the hidden state of a\nnonlinear dynamical system when receiving noisy and occluded observations.\n\n2 Background: neural inference in static environments\n\nBuilding on previous work [19, 40, 52], V\u00e9rtes and Sahani [45] introduced the DDC Helmholtz\nMachine for inference in hierarchical probabilistic generative models, providing a potential substrate\nfor feedforward recognition in static environments with noiseless rate neurons. We review this\napproach here. See Appendix E for discussion and experiments on the robustness of DDC-based\ninference in the presence of neuronal noise.\n2.1 The distributed distributional code for uncertainty\n\nThe DDC representation of the probability distribution q(z) of a random variable Z is given by a\npopulation of K\u03b3 neurons whose \ufb01ring rates rZ are equal to the expected values of their \u201cencoding\u201d\n(or tuning) functions {\u03b3k(z)}K\u03b3\n\nk=1 under q(z):\nrZ,k := Eq[\u03b3k(Z)], k \u2208 {1, 2, ..., K\u03b3}.\n\n(1)\n\nAs reviewed in Appendix A.2, if q(z) belongs to a minimal exponential family (Z discrete or\ncontinuous) with suf\ufb01cient statistics \u03b3(z), then the DDC rZ is the mean parameter that uniquely\nspeci\ufb01es a distribution within the family. With a rich set of \u03b3(z), q(z) can describe a large variety of\ndistributions, and rZ is then a very \ufb02exible representation of uncertainty.\nMany computations that depend on encoded uncertainty, in fact, require the evaluation of expected\nvalues. The DDC rZ can be used to approximate expectations with respect to Z by projecting a target\nfunction into the span of the encoding functions \u03b3(z) and exploiting the linearity of expectations [44,\n45, 47]. That is, for a target function l(z):\n\n\u03b1k\u03b3k(z) = \u03b1 \u00b7 \u03b3(z) \u21d2 Eq[l(z)] \u2248\n\n\u03b1krZ,k = \u03b1 \u00b7 rZ,\n\n(2)\n\nThe coef\ufb01cients \u03b1 can be learned by \ufb01tting the left-hand equation in (2) at a set of points {z(s)}.\nThis set need not follow any particular distribution, but should \u201ccover\u201d the region where q(z)l(z) has\nsigni\ufb01cant mass.\n\nK\u03b3(cid:88)\n\nk=1\n\nl(z) \u2248\n\nK\u03b3(cid:88)\n\nk\n\n2.2 Amortised inference with the DDC\n\nLet the internal generative model of a static environment be given by the distribution p(z, x) =\np(z)p(x|z), where z is latent and x is observed. Inference or recognition with a DDC involves\n\ufb01nding the expectations that correspond to the posterior distribution p(z|x) for a given x.\n\n(3)\n\nr\u2217\nZ|x := Ep(z|x)[\u03b3(z)].\n\n2\n\n\fThis is a deterministic quantity given x. Similar to other amortized inference schemes such as\nthose in the Helmholtz machine [10] and variational auto-encoder [22, 38], the posterior DDC\nmay be approximated using a recognition model, with the key difference that here, the output of\nthe recognition model takes the form of (the mean parameters of) a \ufb02exible exponential family\ndistribution de\ufb01ned by rich suf\ufb01cient statistics \u03b3(z), rather than the natural parameters or moments\nof a simple parametric distribution, such as a Gaussian.\nLet the recognition model be h(x). A natural cost function for h would be\n\nL(h) := Ep(x)(cid:107)Ep(z|x)[\u03b3(z)] \u2212 h(x)(cid:107)2\n\n2 = Ep(x)(cid:107)r\u2217\n\nZ|x \u2212 h(x)(cid:107)2\n2.\n\n(4)\n\nHowever, we do not have access to r\u2217\nAppendix A.1 shows that minimizing the following expected mean squared error (EMSE)\n\nZ|x for a generic internal model. Nonetheless, Proposition 1 in\n\nLs(h) := Ep(x)Ep(z|x)(cid:107)\u03b3(z) \u2212 h(x)(cid:107)2\n\n2 = Ep(z,x)(cid:107)\u03b3(z) \u2212 h(x)(cid:107)2\n\n2\n\n(5)\n\nalso minimizes (4), and they share the same optimal solution. Thus, we de\ufb01ne the DDC representation\nof the approximate posterior by\n\nrZ|x := h\u2217(x), h\u2217 = arg minLs(h) = arg minL(h).\n\n(6)\n\nThus, minimizing (5) provides a way to train h even though the true posterior DDCs are not available.\n\nThe mean \ufb01ring rates \u03c3(x(\u2217)) = (cid:82) \u03b4(x \u2212 x(\u2217))\u03c3(x)dx can be seen as encoding a deterministic\n\n2.3 Learning to infer\nSensory neurons encode features of an observation from the world x(\u2217) by tuning functions \u03c3(x).\nbelief by DDC with basis \u03c3(x). The brain then needs to learn the mapping from \u03c3(x(\u2217)) to rZ|x\u2217.\nFor biological plausibility, we restrict the recognition model to have the form h(x) = W\u03c3(x) where\nW is a weight matrix. The EMSE in (5) can thus be minimized using the delta rule, given samples\nfrom the internal model p:\n\n(cid:104)\n\n(cid:105)\n\n(cid:99)W \u2190 \u0001\n\n\u03b3(z(s)) \u2212(cid:99)W\u03c3(x(s))\n\n(cid:124)\n\u03c3(x(s))\n\n,\n\n(z(s), x(s)) \u223c p(z, x)\n\n(7)\n\nwhere \u0001 is a learning rate.1 The approximation error between rZ|x computed this way and the DDC of\nthe exact posterior in (3) can be reduced by adapting the number and form of the tuning curves \u03c3(x).\nFurthermore, as shown in Theorem 1 in Appendix A.2, minimizing (5) with h(x) = W\u03c3(x) also\nminimizes the expected (under p(x)) Kullback-Leibler (KL) divergence KL[p(z|x)(cid:107)q(z|x)], where\nq(z|x) is in the exponential family with suf\ufb01cient statistics \u03b3(z) and mean parameters W\u03c3(x). The\nminimum of the KL divergence with respect to W depends on \u03b3(z), and can be further lowered by\nusing a richer set of \u03b3(z).\nThus, the quality of approximation provided by the distribution implied by rZ|x to the true posterior\np(z|x) depends on three factors: (i) the divergence between p(z|x) and the optimal member of the\nexponential family with suf\ufb01cient statistic functions \u03b3(z); (ii) the difference between the optimal\nbetween W\u2217 and(cid:99)W estimated from a \ufb01nite number of internal samples. Indeed, it is possible for\nZ|x and the value of W\u2217\u03c3(x), where W\u2217 minimizes (5); and (iii) the difference\nmean parameters r\u2217\n\ngeneralization error in the recognition model to yield values of rZ|x that are infeasible as means\nof \u03b3(z), although even in this case their values may be used to approximate expectations of other\nfunctions.\n\n3 Online inference in dynamic environments\n3.1 A generic internal model of the dynamic world\n\nWe now turn to a dynamic environment, the main focus of this paper. Similar to the static setting in\nSection 2, an internal model of the dynamic world forms the foundation for online perception and\n1Throughout this paper we shall denote by x(\u2217) an observation from the external world, and by x(s) a sample\nfrom the internal model of the world. Superscript * without parentheses indicates optimal function/parameter.\n\n3\n\n\frecognition. We assume that this internal model is stationary (time-invariant), Markovian and easy to\nsimulate or sample, and that the latent dynamics and observation emission take a generic form as\n\nzt = f (zt-1, \u03b6z,t)\nxt = g(zt, \u03b6x,t),\n\n(8a)\n(8b)\nwhere f and g are arbitrary functions that transform the conditioning variables and noise terms \u03b6\u00b7,t.\nThe expressions (8) imply conditional distributions p(zt|zt-1) and p(xt|zt), but in this form they\navoid narrow parametric assumptions while retaining ease of simulation. Next, we develop online\ninference using DDC for the internal model described by (8), thereby extending the inference from\nthe static hierarchical setting of [45].\n\n3.2 Dynamical encoding functions\nModels of neural online inference usually seek to obtain the marginal p(zt|x1:t) [11, 42] or, in\naddition, the pairwise joint p(zt-1, zt|x1:t) [29]. However, postdiction requires updating all the latent\nvariables z1:t given each new observation xt. To represent such distributions by DDC, we introduce\nneurons with dynamical encoding functions \u03c8t, a function of z1:t de\ufb01ned by a recurrence relationship\nencapsulated in a function k: \u03c8t = k(\u03c8t\u22121, zt). In particular, we choose\n\n(cid:107)U(cid:107)2 < 1,\n\n\u03c8t = k(\u03c8t\u22121, zt) = U\u03c8t\u22121 + [\u03b3(zt); 0] ,\n\n(9)\nwhere \u03b3(zt) \u2208 RK\u03b3 is a static feature of zt as in (1), and U is a K\u03c8 \u00d7 K\u03c8, K\u03c8 > K\u03b3 random\nprojection matrix that has maximum singular value less than 1.0 to ensure stability. \u03b3(zt) only\nfeeds into a subset of \u03c8t. The set of encoding functions \u03c8t is then capable of encoding a posterior\ndistribution of the history of latent states up to time t through a DDC rt := Eq(z1:t|x1:t)[\u03c8t]. If\n\u03c8t depends only on zt (U = 0), then the corresponding DDC represents the conventional \ufb01ltering\ndistribution. With a \ufb01nite population size, the dependence of \u03c8t on past states decay with duration,\nlimited to about K\u03c8/K\u03b3 time steps for a simple delay line structure. This limit can be extended with\ncareful choices of U and \u03b3(\u00b7) [7, 16].\n\n3.3 Learning to infer in dynamical models\n\nThe goal of recognition in this framework is to compute rt recursively in online, combining rt-1 and\nxt. Extending the ideas of amortized inference and EMSE training introduced in Section 2, we use\nsamples from the internal model to train a recursive recognition network to compute this posterior\nmean. In principle the recognition function ht should depend on time step, to minimize:\n\nLs\nt (ht; x1:t-1) = Ep(z1:t,xt|x1:t-1)(cid:107)ht(xt; x1:t-1) \u2212 \u03c8t(cid:107)2\n2.\n\n(10)\nUnlike in (5), the expectation here is taken over a distribution conditioned on the history, which\nmay be dif\ufb01cult to obtain from samples. Furthermore, the optimal h\u2217\nt depends on x1:t-1. Restrict-\ning ht(xt; x1:t-1) = Wt\u03c3(xt) as in Section 2.3, the optimal W\u2217\nt could be computed from rt-1\n(summarizes x1:t-1), albeit not straightforwardly (see Appendix B). An alternative is to explicitly\nparameterize the dependence of ht on both rt-1 and xt, giving a time-invariant function hs\n\u03c6(rt-1, xt),\nand train \u03c6 using a different loss\n\nLs\nt (\u03c6) = Eq(z1:t,xt,x1:t-1)\n\n\u03c6(rt-1, xt) \u2212 \u03c8t\n\n(11)\n\u03c6\u2217 (rt-1,\u00b7) learns the\nwhere rt-1 depends on x1:t-1 through recursive \ufb01ltering. After training, if hs\nt (\u00b7), then the loss in (11) is the expectation of the\nexact dependence on rt-1 so that it is the same as h\u2217\nloss in (10) over all possible observation histories. Therefore, (11) bounds the expected loss of (10)\nfrom above; minimizing (11) ensures that (10) is minimized for any given history, and the output of\n\u03c6\u2217 (rt-1, xt) approximates the desired DDC. Whereas technically \u03c6\u2217 should depend on t, for the\nhs\nstationary processes we consider here the distribution of inputs rt, x and outputs \u03c6t is time-invariant\nas t \u2192 \u221e; and so \u03c6\u2217 is approximately time-independent for suf\ufb01ciently long sequences.\nWe consider two biologically plausible forms of hs\n\u03c6:\n\n2\n\n(cid:13)(cid:13)hs\n\n(cid:13)(cid:13)2\n\nbilinear: hbil\nlinear: hlin\n\nW(rt-1, xt) = W(rt-1 \u2297 \u03c3(xt)),\nW (rt-1, xt) = W[rt-1; \u03c3(xt)],\n\n(12)\n(13)\n\n4\n\n\fAlgorithm 1: Learning to infer and postdict with temporal DDC\ninput :internal model f, g and noise source \u03b6(\u00b7),t, as in (8);\n\n\u03c6(rt-1, xt);\n\nrecognition model hs\ntarget function l on which postdictive posterior expectations are to be computed, (14);\n\ufb01xed random basis \u03c3(\u00b7) for xt, \u03b3(\u00b7) for zt and k(\u00b7,\u00b7), e.g. (9);\nobservations from the external world xt\n\n\u2217 arriving at time t;\n\n0 }S\n\nInitialize internal DDCs {r(s)\ns=1 and latent samples {z(s)\nInitialize r\u2217\nInitialize recognition parameters \u03c6 and readout weights \u03b1;\nCompute recurrent feature \u03c8(s)\nwhile Online observations come in at time t \u2208 {1, 2, . . .} do\n\n0 for external observations, e.g. empirical mean of \u03c8(z0);\n0 ); 0],\u2200s \u2208 {1, 2, . . . , S};\n\n0 = [\u03b3(z(s)\n\n0 }S\n\ns=1 from prior p0(z0);\n\nUpdating \u03c6 and \u03b1\nfor s \u2208 {1, 2, . . . , S} do\n\n(s)), e.g. (12) or (13);\n\nSimulate z(s)\nCompute \u03c8(s)\n\nt = f (z(s)\nt = k(\u03c8(s)\n\nt-1 , \u03b6 (s)\nt-1 , z(s)\n\nz,t ) and x(s)\n), (9); r(s)\n\nt = g(z(s)\n, \u03b6 (s)\nx,t ), (8);\nt = h\u03c6(r(s)\nt-1 , xt\n\nt\n\nt\n\nend\nUpdate \u03c6 to minimize sample version of Ls, (11):\nbilinear (12): \u2206Wijk \u221d 1\nt,i )r(s)\nlinear (13): \u2206Wij \u221d 1\nt,i )[r(s)\nUpdate \u03b1 to better approximate l(zt-\u03c4 :t) with \u03c8t, e.g. by delta rule;\nCompute posterior DDC and expectation of target function\nr(\u2217)\nt = h\u03c6(r(\u2217)\nEq(zt-\u03c4 :t|x1:t)[l(zt-\u03c4 :t)] \u2248 \u03b1\n\nt,i \u2212 \u03c8(s)\nt,i \u2212 \u03c8(s)\n\nt-1,j\u03c3k(xt\nt-1 ; \u03c3(xt\n\nm(r(s)\nm(r(s)\n\n(s));\n(s))]j;\n\nt-1 , xt\n\n(\u2217));\n\nr(\u2217)\n\n(cid:124)\n\nS\n\nS\n\n;\n\nt\n\n(cid:80)\n(cid:80)\n\nend\nreturn :r(\u2217)\n\nt\n\nand Eq(zt-\u03c4 :t|x1:t)[l(zt-\u03c4 :t)] at time t \u2208 {1, 2, . . .}.\n\nwhere \u2297 indicates the Kronecker product. That is, hbil\nW maps to rt from the outer product of rt-1 and\nW does so from the concatenation of the two (the bilinear update is discussed further\n\u03c3(xt), and hlin\nin Appendix C). Both choices allow W to be trained by the biologically plausible delta rule, using\nsamples {(r(s)\n(s))}. The triplets can be obtained by simulating the internal model; training\nsamples of r(s)\nOnce we infer rt, postdictive posterior expectations (with lag \u03c4) can be found in the same way as (2).\n\nt-1 are bootstrapped by applying h\u03c6 to the simulated x1:t\n\n, z(s)\n\n(s).\n\n, xt\n\nt\n\nt\n\nEq(zt-\u03c4|x1:t)[l(zt-\u03c4 )] \u2248 \u03b1 \u00b7 rt where \u03b1 \u00b7 \u03c8t \u2248 l(zt-\u03c4 ).\n\n(14)\nThis approach to online learning for inference and postdiction in the DDC framework is summarized\nin Algorithm 1. The complexity of learning the recognition process scales linearly with the number\n2K\u03c3 for the bilinear form (12), and with K\u03c8(K\u03c8 + K\u03c3) for\nof internal samples from p and with K\u03c8\nthe linear form (13).\n\n4 Experiments\n\nWe demonstrate the effectiveness of the proposed recognition method on biologically relevant\nsimulations.2 For each experiment, we trained the DDC \ufb01lter of\ufb02ine until it learned the internal\nmodel, and ran inference using \ufb01xed \u03c6 and \u03b1. Details of the experiments are described in Appendix D.\nAdditional results incorporating neuronal noise are shown in Appendix E.\n\n4.1 Auditory continuity illusions\nIn the auditory continuity illusion, the percept of a complex sound may be altered by subsequent\nacoustic signals. Two tone pulses separated by a silent gap are perceived to be discontinuous;\nhowever, when the gap is \ufb01lled by suf\ufb01ciently loud wide-band noise, listeners often report an illusory\n\n2Code available at https://github.com/kevin-w-li/ddc_ssm\n\n5\n\n\fFigure 1: Modelling the auditory continuity illusion. We demonstrate postdictive DDC inference for\nsix different acoustic stimuli (experiments A-F). In each experiment, the top panel shows the true\namplitudes of the tone and noise; the middle panel shows the spectrogram observation; and the lower\npanel shows the real-time posterior marginal probabilities of the tone q(zt-\u03c4|x1:t), \u03c4 \u2208 {0, . . . , t-1}\nat each time t and lag \u03c4. Each vertical stack of three small rectangles shows the estimated marginal\nprobability that the tone level was zero (bottom), medium (middle) or high (top) (see scale at bottom\nright). Each row of stacks collects the marginal beliefs based on sensory evidence to time t (left\nlabels). The position of the stack in the row indicates the absolute time t-\u03c4 to which the belief pertains\n(bottom left labels). For example, the highlighted stack in A shows the marginal probability over tone\nlevel at time step 7 (t = 7) about the tone level at time step 6 (t-\u03c4 = 6); in this example, the medium\nlevel has most of the probability as expected.\n\n6\n\namp.AtonenoiseBtonenoiseCtonenoisefreq.10987654321perception time (t)uncertain\u2190 certainuncertain\u2190 certainamp.DtonenoiseEtonenoiseFtonenoisefreq.12345678910stimulus time (t\u2212\u03c4)10987654321perception time (t)more certainthan B,Cvery certainuncertain\u27f5 certain0.00.51.0q(z)\fFigure 2: Modelling localization in the \ufb02ash-lag effect. Black dashed line shows the true trajectory\nof the moving object. Red line shows the prediction of the extrapolation model. Black solid line\nwith error bar shows the perceived trajectory reported by a human subject (mean \u00b1 2sem) or models\n(mean \u00b1 std from 100 runs). A, human data from [49]. B, the observation used in our simulation. C,\nDDC recognition using \u03c4 = 3 additional observations to postdict position at t0 = 3 time steps after\nthe time of the \ufb02ash. D, DDC recognition without postdiction.\n\ncontinuation of the tone through the noise. This illusion is reduced if the second tone begins after a\nslight delay, even though the acoustic stimulus in the two cases is identical until noise offset [6, 30].\nTo model the essential elements of this phenomenon, we built a simple internal model for tone and\nnoise stimuli described in Appendix D.1, with a binary Markov chain describing the onsets and\noffsets of tone and wide-band noise, and noisy observations of power in three frequency bands. We\nran six different experiments once the recognition model had learned to perform inference based on\nthe internal model. Figure 1 shows the marginal posterior distributions of the perceived tone level at\npast times t-\u03c4 based on the stimulus up to time t, based on the DDC values rt. In Figure 1A, when a\nclear mid-level tone is presented, the model correctly identi\ufb01es the level and duration of the tone,\nand retains this information following tone offset. Figure 1B and C show postdictive inference. As\nthe noise turns on, the real-time estimate of the probability that the tone has turned off increases.\nHowever, when the noise turns off, an immediately subsequent tone restores the belief that the tone\ncontinued throughout the noise. By contrast, a gap between the noise and the second tone, increased\nthe inferred belief that the noise had turned off to near certainty.\nWe tested the model on three additional sound con\ufb01gurations. In Figure 1D, the tone has a higher\nlevel than in Figure 1A-C. If the noise has lower spectral density than the tone, the model believes\nthat the tone might have been interrupted, but retains some mild uncertainty. If this noise level is\nmuch lower (Figure 1E), no illusory tone is perceived. These effects of tone and noise amplitude on\nhow likely the illusion arises are qualitatively consistent with \ufb01ndings in [39]. In the \ufb01nal experiment\n(Figure 1F), the model predicts that no continuity is perceived if the \ufb01rst tone is softer than the noise\nbut the second tone is louder, having learned from the internal model that tone level does not, in fact,\nchange between non-zero levels.\n\n4.2 The \ufb02ash-lag effect with direction reversal\nIn the previous experiment, the internal model correctly describes the statistics of the stimuli. It is\nknown that a mismatch of the internal model to the real world, such as when a slowness/smooth prior\nmeets an observation that actually moves fast [41], can induce perceptual illusions. Here, we use\nDDC recognition to model the \ufb02ash-lag effect, although the same principle can also be used directly\nfor the cutaneous rabbit effect in somatosensation [17].\nIn the \ufb02ash-lag effect, a brief \ufb02ash of light is generated adjacent to the current position of an object\nthat has been moving steadily in the visual \ufb01eld. Subjects report the \ufb02ash to appear behind the object\n[28, 32]. One early explanation for this \ufb01nding is the extrapolation model [32]: viewers extrapolate\nthe movement of the object and report its predicted position at the time of the \ufb02ash. An alternative is\nthe latency difference model [36] according to which the perception of a sudden \ufb02ash is delayed by\nt0 relative to the object, and so subjects report the object at time t0 after the \ufb02ash.\nHowever, neither explanation can account for another related \ufb01nding: if the moving object suddenly\nswitches direction and the timing of the \ufb02ash chosen at different offsets around the reversal position\n(still aligned with the object), the reported object locations at the time of the \ufb02ashes form a smooth\n\n7\n\n\u2212120\u221260060time\u22122\u221210zAhuman datapixelBobservations-6-4-2024CDDC: t0=3,\u03c4=3true locationextrapolationperceived location-6-4-2024DDDC: t0=3,\u03c4=0\fFigure 3: Tracking in a nonlinear noisy system. A, 1-D image observation through time. B, posterior\nmean and marginals estimated using a particle \ufb01lter. C-G, posterior marginals decoded from DDC for\nthe location at time t-\u03c4 perceived at time t.\n\ntrajectory (Figure 2A), instead of the broken line predicted by the extrapolation model, or the simple\nshift in time predicted by the latency difference model [49].\nRao et al. [37] suggested that the lag might arise from signal propagation delays as in the latency\ndifference model, but the smoothing could be caused by incorporating observations during an\nadditional processing delay. That is, after perceiving the \ufb02ash at t0, the brain takes time \u03c4 to estimate\nthe object location. Importantly, subjects process more observations from the visible object trajectory\nin this period in order to postdict its position at t0. The authors used Kalman smoothing in a linear\nGaussian internal model favoring slow movements to reproduce the behavioral results.\nHere, we apply this idea of postdiction from [37] to a more realistic internal model described in\nAppendix D.2. Brie\ufb02y, the unobserved true object dynamics is linear Gaussian with additive Gaussian\nnoise, and the observation emission is a 1-D image showing the position at each time step with\nPoisson noise (Figure 2B). After establishing a preference for slow and smooth movements, the\nperceived locations derived by dynamical DDC inference trace out a curve that resembles the human\ndata, by taking into account observations after the perception of \ufb02ash (Figure 2C). Without postdiction\n(Figure 2D), the reported location tends to overshoot, as also noted in [37].\n\n4.3 Noisy and occluded tracking\n\nWhen tracking a target (such as a prey) using noisy and occasionally occluded observations, it\nis possible to improve estimates of the trajectory followed during the occlusion by using later\nobservations. Knowledge of the particular path followed by the target may be important for planning\nand control [2]. To explore the potential for dynamic DDC inference in this setting, we instantiated\na system of stochastic oscillatory dynamics observed through a 1-D image with additive Gaussian\n\n8\n\nAobservationsBparticle filter, R2=0.546CDDC: \u03c4=0, R2=0.535DDDC: \u03c4=1, R2=0.617EDDC: \u03c4=2, R2=0.642FDDC: \u03c4=5, R2=0.6720255075time\u2212202zGDDC: \u03c4=8, R2=0.683truthposterior mean0.00.3q(z)\fnoise and occlusion (details in Appendix D.3). An example set of observations is shown in Figure 3A.\nWe ran a simple bootstrap particle \ufb01lter (PF) as a benchmark Figure 3B.\nThe results of DDC recognition for these observations are shown in Figure 3C-G. The marginal\nposterior histograms were obtained by projecting rt onto a set of bin functions using (14). (maximum\nentropy decoding is less smooth, see Figure 5 in Appendix D.3). We computed the R2 of the\nprediction of true latent locations by posterior means. The purely forward (\u03c4 = 0) posterior mean\nis comparable to that of the particle \ufb01lter. As the postdictive window (and so number of future\nobservations) \u03c4 increases, we see not only an increase in R2, but also a reduction in uncertainty. In\nthe occluded regions, the posterior mass becomes more concentrated as the number of additional\nobservations \u03c4 increases, particularly towards the end of occlusions. In addition, bimodality is\nobserved during some occluded intervals, re\ufb02ecting the nonlinearity in the latent process.\n\n5 Related work and discussion\n\nThe DDC [45] stems from earlier proposals for neural representations of uncertainty [40, 51, 52].\nNotably, the DDC for a marginal distribution (1) is identical to the encoding scheme in [40], in\nwhich moments of a set of tuning functions \u03b3(z) encode multivariate random variables or intensity\nfunctions. The DDC may also be seen as a mean embedding within a \ufb01nite-dimensional Hilbert\nspace, approaching the full kernel mean embedding [43] as the size of the population grows. Recent\ndevelopments [44, 47] focus on conditional DDCs with applications in learning hierarchical generative\nmodels, with a relationship to the conditional mean embedding [18].\nThe work in this paper extends the DDC framework in two ways. First, the dynamic encoding\nfunction introduced in Section 3.2 condenses information about variables at different times, and\nthus facilitates online postdictive inference for a generic internal model. Second, Algorithm 1 in\nSection 3.3 is a neurally plausible method for learning to infer. It allows a recognition model to\nbe trained using samples and DDC messages, and could be extended to other graph structures.\nAlthough the psychophysical experiments modeled in Section 4 have been explained as smoothing\non a computational level, we provides a plausible mechanism for how neural populations could\nimplement and learn to perform this computation in an online manner.\nOther schemes besides the DDC have been proposed for the neural representation of uncertainty.\nThese include: sample-based representations [21, 25, 33]; probabilistic population codes (PPCs) [4,\n27] which in their most common form have neuronal activity represent the natural parameters of\nan exponential family distribution [4]; linear density codes [13]; and further proposals adapted to\nspeci\ufb01c inferential problems, such as \ufb01ltering [11, 26]. The generative process of a realistic dynamical\nenvironment is usually nonlinear, making postdiction or even ordinary \ufb01ltering challenging. If beliefs\nabout latent states were represented by samples [25, 29], then postdiction would either depend on\nsamples being maintained in a \u201cbuffer\u201d to be modi\ufb01ed by later inputs and accessed by downstream\nprocessing; would require an exponentially large number of neurons to provide samples from latent\nhistories; or would require a complex distributed encoding of samples that might resemble the\ndynamic DDC we propose. Natural parameters (as in the PPC) might be associated with dynamic\nencoding functions as described here, but the derivation and neural implementation for the update\nrule would not be straightforward. In contrast, DDC (mean parameters) can be updated using simple\noperations as in (13) and (12). Unlike the sample-based representation hypotheses in which posterior\nsamples must be drawn in real-time, sampling within the DDC learning framework is used to train\nthe recognition model using the unconditioned joint distribution.\nAlthough several approximate inference methods may seem plausible, learning the appropriate\nnetworks to implement them poses yet another challenge for the brain. In most of the frameworks\nmentioned above, special neural circuits need to be wired for speci\ufb01c problems. Learning to infer\nusing DDC requires training samples from the internal model, on which the delta-rule is used to\nupdate the recognition model. This can be done off-line and does not require true posteriors as targets.\nOne aspect we did not address in this paper is how the brain acquires an appropriate internal model,\nand thus adapts to new problems. If an EM- or wake-sleep-like algorithm is used for adaptation,\nparameters in the internal model may be updated using the posterior representations [45] learned\nfrom the previous internal model. We expect that the postdictive (smoothed) DDC proposed here\nmay help to \ufb01t a more accurate model to dynamical observations, as these posteriors better capture\nthe correlations in the latent dynamics than a \ufb01ltered posterior.\n\n9\n\n\fAcknowledgments\n\nThis work is supported by the Gatsby Charitable Foundation.\n\nReferences\n\n[1] D. Alais and D. Burr. \u201cThe ventriloquist effect results from near-optimal bimodal integration\u201d.\n\nIn: Current Biology (2004).\n\n[2] F. Amigoni and M. Somalvico. \u201cMultiagent systems for environmental perception\u201d. In: AMS\n\nConference on Arti\ufb01cial Intelligence Applications to Environmental Science. 2003.\n\n[3] P. W. Battaglia, R. A. Jacobs, and R. N. Aslin. \u201cBayesian integration of visual and auditory\n\nsignals for spatial localization\u201d. In: J. Opt. Soc. Am. A (2003).\nJ. Beck, W. Ma, P. Latham, and A. Pouget. \u201cProbabilistic population codes and the exponential\nfamily of distributions\u201d. In: Progress in brain research (2007).\n\n[4]\n\n[5] U. Beierholm, L. Shams, W. J. Ma, and K. Koerding. \u201cComparing Bayesian models for\n\nmultisensory cue combination without mandatory integration\u201d. In: NeurIPS. 2008.\n\n[6] A. S. Bregman. Auditory scene analysis: The perceptual organization of sound. 1994.\n[7] A. S. Charles, D. Yin, and C. J. Rozell. \u201cDistributed Sequence Memory of Multidimensional\n\nInputs in Recurrent Networks\u201d. In: JMLR (2017).\n\n[8] A. K. Churchland, R. Kiani, R. Chaudhuri, X.-J. Wang, A. Pouget, and M. N. Shadlen.\n\u201cVariance as a signature of neural computations during decision making\u201d. In: Neuron (2011).\n[9] M. M. Churchland, B. M. Yu, J. P. Cunningham, L. P. Sugrue, et al. \u201cStimulus onset quenches\n\nneural variability: a widespread cortical phenomenon\u201d. In: Nature neuroscience (2010).\n\n[10] P. Dayan, G. E. Hinton, R. M. Neal, and R. S. Zemel. \u201cThe Helmholtz machine\u201d. In: Neural\n\ncomputation (1995).\n\n[11] S. Deneve, J.-R. Duhamel, and A. Pouget. \u201cOptimal sensorimotor integration in recurrent\ncortical networks: a neural implementation of Kalman \ufb01lters\u201d. In: Journal of neuroscience\n(2007).\n\n[12] D. M. Eagleman and T. J. Sejnowski. \u201cMotion integration and postdiction in visual awareness\u201d.\n\nIn: Science (2000).\n\n[13] C. Eliasmith and C. H. Anderson. Neural engineering: Computation, representation, and\n\ndynamics in neurobiological systems. 2004.\n\n[14] M. O. Ernst and M. S. Banks. \u201cHumans integrate visual and haptic information in a statistically\n\noptimal fashion\u201d. In: Nature (2002).\n\n[15] A. Funamizu, B. Kuhn, and K. Doya. \u201cNeural substrate of dynamic Bayesian inference in the\n\ncerebral cortex\u201d. In: Nature neuroscience (2016).\n\n[16] S. Ganguli, D. Huh, and H. Sompolinsky. \u201cMemory traces in dynamical systems\u201d. In: PNAS\n\n(2008).\n\n[17] F. A. Geldard and C. E. Sherrick. \u201cThe cutaneous\" rabbit\": a perceptual illusion\u201d. In: Science\n\n(1972).\n\n[18] S. Gr\u00fcnew\u00e4lder, G. Lever, A. Gretton, L. Baldassarre, S. Patterson, and M. Pontil. \u201cConditional\n\nmean embeddings as regressors\u201d. In: ICML. 2012.\n\n[19] G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal. \u201cThe \"wake-sleep\" algorithm for unsuper-\n\nvised neural networks\u201d. In: Science (1995).\n\n[20] h. choi hoon and b. j. scholl brian j. \u201cperceiving causality after the fact: postdiction in the\n\ntemporal dynamics of causal perception\u201d. In: Perception (2006).\n\n[21] P. O. Hoyer and A. Hyv\u00e4rinen. \u201cInterpreting Neural Response Variability as Monte Carlo\n\nSampling of the Posterior\u201d. In: NeurIPS. 2003.\n\n[22] D. P. Kingma and M. Welling. \u201cAuto-Encoding Variational Bayes\u201d. In: ICLR. 2014.\n[23] K. P. K\u00f6rding, U. Beierholm, W. J. Ma, S. Quartz, J. B. Tenenbaum, and L. Shams. \u201cCausal\n\nInference in Multisensory Perception\u201d. In: PLoS ONE (2007).\n\n[24] K. P. K\u00f6rding, S.-p. Ku, and D. M. Wolpert. \u201cBayesian Integration in Force Estimation\u201d. In:\n\nJournal of neurophysiology (2004).\n\n[25] A. Kutschireiter, S. C. Surace, H. Sprekeler, and J. P. P\ufb01ster. \u201cNonlinear Bayesian \ufb01ltering\n\nand learning: A neuronal dynamics for perception\u201d. In: Scienti\ufb01c Reports (2017).\n\n10\n\n\f[26] R. Legenstein and W. Maass. \u201cEnsembles of Spiking Neurons with Noise Support Optimal\nProbabilistic Inference in a Dynamically Changing Environment\u201d. In: PLoS Computational\nBiology (2014).\n\n[27] W. J. Ma, J. M. Beck, P. E. Latham, and A. Pouget. \u201cBayesian inference with probabilistic\n\npopulation codes\u201d. In: Nature neuroscience (2006).\n\n[28] D. M. Mackay. \u201cPerceptual stability of a stroboscopically lit visual \ufb01eld containing self-\n\nluminous objects\u201d. In: Nature (1958).\nJ. G. Makin, B. K. Dichter, and P. N. Sabes. \u201cLearning to estimate dynamical state with\nprobabilistic population codes\u201d. In: PLoS computational biology (2015).\n\n[29]\n\n[30] G. A. Miller and J. C. Licklider. \u201cThe intelligibility of interrupted speech\u201d. In: Journal of the\n\nacoustical society of america (1950).\n\n[31] Y. Mohsenzadeh, S. Dash, and J. D. Crawford. \u201cA state space model for spatial updating\nof remembered visual targets during eye movements\u201d. In: Frontiers in systems neuroscience\n(2016).\n\n[32] R. Nijhawan. \u201cMotion extrapolation in catching\u201d. In: Nature (1994).\n[33] G. Orb\u00e1n, P. Berkes, J. Fiser, and M. Lengyel. \u201cNeural variability and sampling-based proba-\n\nbilistic representations in the visual cortex\u201d. In: Neuron (2016).\n\n[34] G. Orb\u00e1n and D. M. Wolpert. \u201cRepresentations of uncertainty in sensorimotor control\u201d. In:\n\nCurrent opinion in neurobiology (2011).\nI. V. Oseledets. \u201cTensor-train decomposition\u201d. In: SIAM Journal on Scienti\ufb01c Computing\n(2011).\n\n[35]\n\n[36] G. Purushothaman, S. S. Patel, H. E. Bedell, and H. Ogmen. \u201cMoving ahead through differential\n\nvisual latency\u201d. In: Nature (1998).\n\n[37] R. P. Rao, D. M. Eagleman, and T. J. Sejnowski. \u201cOptimal smoothing in visual motion\n\nperception\u201d. In: Neural computation (2001).\n\n[38] D. J. Rezende, S. Mohamed, and D. Wierstra. \u201cStochastic Backpropagation and Approximate\n\nInference in Deep Generative Models\u201d. In: ICML. 2014.\n\n[39] L. Riecke, A. J. van Opstal, and E. Formisano. \u201cThe auditory continuity illusion: A parametric\n\ninvestigation and \ufb01lter model\u201d. In: Perception & Psychophysics (2008).\n\n[40] M. Sahani and P. Dayan. \u201cDoubly distributional population codes: simultaneous representation\n\nof uncertainty and multiplicity\u201d. In: Neural Computation (2003).\n\n[41] S. Shimojo. \u201cPostdiction: its implications on visual awareness, hindsight, and sense of agency\u201d.\n\nIn: Frontiers in psychology (2014).\n\n[42] S. Sokoloski. \u201cImplementing a bayes \ufb01lter in a neural circuit: The case of unknown stimulus\n\ndynamics\u201d. In: Neural computation (2017).\n\n[43] L. Song, K. Fukumizu, and A. Gretton. \u201cKernel embeddings of conditional distributions: A\nuni\ufb01ed kernel framework for nonparametric inference in graphical models\u201d. In: IEEE Signal\nProcessing Magazine (2013).\n\n[44] E. V\u00e9rtes and M. Sahani. \u201cA neurally plausible model learns successor representations in\n\npartially observable environments\u201d. In: NeurIPS. 2019.\n\n[45] E. V\u00e9rtes and M. Sahani. \u201cFlexible and accurate inference and learning for deep generative\n\nmodels\u201d. In: NeurIPS. 2018.\n\n[46] M. J. Wainwright and M. I. Jordan. \u201cGraphical models, exponential families, and variational\n\ninference\u201d. In: Foundations and trends in Machine Learning (2008).\n\n[47] L. Wenliang, E. V\u00e9rtes, and M. Sahani. \u201cAccurate and adaptive neural recognition in dynamical\n\nenvironment\u201d. In: COSYNE Abstracts. 2019.\n\n[48] L. Whiteley and M. Sahani. \u201cImplicit knowledge of visual uncertainty guides decisions with\n\nasymmetric outcomes\u201d. In: Journal of Vision (2008).\n\n[49] D. Whitney and I. Murakami. \u201cLatency difference, not spatial extrapolation\u201d. In: Nature\n\nneuroscience (1998).\nJ.-J. O. de Xivry, S. Coppe, G. Blohm, and P. Lefevre. \u201cKalman \ufb01ltering naturally accounts for\nvisually guided and predictive smooth pursuit dynamics\u201d. In: Journal of neuroscience (2013).\n[51] R. S. Zemel and P. Dayan. \u201cDistributional population codes and multiple motion models\u201d. In:\n\n[50]\n\nNeurIPS. 1999.\n\n11\n\n\f[52] R. S. Zemel, P. Dayan, and A. Pouget. \u201cProbabilistic Interpretation of Population Codes\u201d. In:\n\nNeural Computation (1998).\n\n12\n\n\f", "award": [], "sourceid": 5112, "authors": [{"given_name": "Li Kevin", "family_name": "Wenliang", "institution": "Gatsby Unit, UCL"}, {"given_name": "Maneesh", "family_name": "Sahani", "institution": "Gatsby Unit, UCL"}]}