{"title": "Inference by Reparameterization in Neural Population Codes", "book": "Advances in Neural Information Processing Systems", "page_first": 2029, "page_last": 2037, "abstract": "Behavioral experiments on humans and animals suggest that the brain performs probabilistic inference to interpret its environment. Here we present a new general-purpose, biologically-plausible neural implementation of approximate inference. The neural network represents uncertainty using Probabilistic Population Codes (PPCs), which are distributed neural representations that naturally encode probability distributions, and support marginalization and evidence integration in a biologically-plausible manner. By connecting multiple PPCs together as a probabilistic graphical model, we represent multivariate probability distributions. Approximate inference in graphical models can be accomplished by message-passing algorithms that disseminate local information throughout the graph. An attractive and often accurate example of such an algorithm is Loopy Belief Propagation (LBP), which uses local marginalization and evidence integration operations to perform approximate inference efficiently even for complex models. Unfortunately, a subtle feature of LBP renders it neurally implausible. However, LBP can be elegantly reformulated as a sequence of Tree-based Reparameterizations (TRP) of the graphical model. We re-express the TRP updates as a nonlinear dynamical system with both fast and slow timescales, and show that this produces a neurally plausible solution. By combining all of these ideas, we show that a network of PPCs can represent multivariate probability distributions and implement the TRP updates to perform probabilistic inference. Simulations with Gaussian graphical models demonstrate that the neural network inference quality is comparable to the direct evaluation of LBP and robust to noise, and thus provides a promising mechanism for general probabilistic inference in the population codes of the brain.", "full_text": "Inference by Reparameterization in Neural\n\nPopulation Codes\n\nRajkumar V. Raju\nDepartment of ECE\n\nRice University\n\nHouston, TX 77005\nrv12@rice.edu\n\nXaq Pitkow\n\nDept. of Neuroscience, Dept. of ECE\n\nBaylor College of Medicine, Rice University\n\nHouston, TX 77005\n\nxaq@rice.edu\n\nAbstract\n\nBehavioral experiments on humans and animals suggest that the brain performs\nprobabilistic inference to interpret its environment. Here we present a new general-\npurpose, biologically-plausible neural implementation of approximate inference.\nThe neural network represents uncertainty using Probabilistic Population Codes\n(PPCs), which are distributed neural representations that naturally encode prob-\nability distributions, and support marginalization and evidence integration in a\nbiologically-plausible manner. By connecting multiple PPCs together as a proba-\nbilistic graphical model, we represent multivariate probability distributions. Ap-\nproximate inference in graphical models can be accomplished by message-passing\nalgorithms that disseminate local information throughout the graph. An attractive\nand often accurate example of such an algorithm is Loopy Belief Propagation\n(LBP), which uses local marginalization and evidence integration operations to\nperform approximate inference ef\ufb01ciently even for complex models. Unfortunately,\na subtle feature of LBP renders it neurally implausible. However, LBP can be\nelegantly reformulated as a sequence of Tree-based Reparameterizations (TRP)\nof the graphical model. We re-express the TRP updates as a nonlinear dynamical\nsystem with both fast and slow timescales, and show that this produces a neurally\nplausible solution. By combining all of these ideas, we show that a network of\nPPCs can represent multivariate probability distributions and implement the TRP\nupdates to perform probabilistic inference. Simulations with Gaussian graphical\nmodels demonstrate that the neural network inference quality is comparable to\nthe direct evaluation of LBP and robust to noise, and thus provides a promising\nmechanism for general probabilistic inference in the population codes of the brain.\n\n1\n\nIntroduction\n\nIn everyday life we constantly face tasks we must perform in the presence of sensory uncertainty. A\nnatural and ef\ufb01cient strategy is then to use probabilistic computation. Behavioral experiments have\nestablished that humans and animals do in fact use probabilistic rules in sensory, motor and cognitive\ndomains [1, 2, 3]. However, the implementation of such computations at the level of neural circuits is\nnot well understood.\nIn this work, we ask how distributed neural computations can consolidate incoming sensory in-\nformation and reformat it so it is accessible for many tasks. More precisely, how can the brain\nsimultaneously infer marginal probabilities in a probabilistic model of the world? Previous efforts\nto model marginalization in neural networks using distributed codes invoked limiting assumptions,\neither treating only a small number of variables [4], allowing only binary variables [5, 6, 7], or\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\frestricting interactions [8, 9]. Real-life tasks are more complicated and involve a large number of\nvariables that need to be marginalized out, requiring a more general inference architecture.\nHere we present a distributed, nonlinear, recurrent network of neurons that performs inference about\nmany interacting variables. There are two crucial parts to this model: the representation and the\ninference algorithm. We assume that brains represent probabilities over individual variables using\nProbabilistic Population Codes (PPCs) [10], which were derived by using Bayes\u2019 Rule on experi-\nmentally measured neural responses to sensory stimuli. Here for the \ufb01rst time we link multiple PPCs\ntogether to construct a large-scale graphical model. For the inference algorithm, many researchers\nhave considered Loopy Belief Propagation (LBP) to be a simple and ef\ufb01cient candidate algorithm\nfor the brain [11, 12, 13, 14, 8, 5, 7, 6]. However, we will discuss one particular feature of LBP that\nmakes it neurally implausible. Instead, we propose that an alternative formulation of LBP known as\nTree-based Reparameterization (TRP) [15], with some modi\ufb01cations for continuous-time operation\nat two timescales, is well-suited for neural implementation in population codes.\nWe describe this network mathematically below, but the main conceptual ideas are fairly straightfor-\nward: multiplexed patterns of activity encode statistical information about subsets of variables, and\nneural interactions disseminate these statistics to all other relevant encoded variables.\nIn Section 2 we review key properties of our model of how neurons can represent probabilistic\ninformation through PPCs. Section 3 reviews graphical models, Loopy Belief Propagation and\nTree-based Reparameterization. In Section 4, we merge these ingredients to model how populations\nof neurons can represent and perform inference on large multivariate distributions. Section 5 describes\nexperiments to test the performance of network. We summarize and discuss our results in Section 6.\n\n2 Probabilistic Population Codes\n\nNeural responses r vary from trial to trial, even to repeated presentations of the same stimulus x.\nThis variability can be expressed as the likelihood function p(r|x). Experimental data from several\nbrain areas responding to simple stimuli suggests that this variability often belongs to the exponential\nfamily of distributions with linear suf\ufb01cient statistics [10, 16, 17, 4, 18]:\n\np(r|x) = (r) exp(h(x) \u00b7 r),\n\n(1)\nwhere h(x) depends on the stimulus-dependent mean and \ufb02uctuations of the neuronal response and\n(r) is independent of the stimulus. For a conjugate prior p(x), the posterior distribution will also\nhave this general form, p(x|r) / exp(h(x) \u00b7 r). This neural code is known as a linear PPC: it is\na Probabilistic Population Code because the population activity collectively encodes the stimulus\nprobability, and it is linear because the log-likelihood is linear in r. In this paper, we assume responses\nare drawn from this family, although incorporation of more general PPCs with nonlinear suf\ufb01cient\nstatistics T(r) is possible: p(r|x) / exp(h(x) \u00b7 T(r)).\nAn important property of linear PPCs, central to this work, is that different projections of the\npopulation activity encode the natural parameters of the underlying posterior distribution. For\n2 x2a \u00b7 r + xb \u00b7 r,\nexample, if the posterior distribution is Gaussian (Figure 1), then p(x|r) / exp 1\nwith a \u00b7 r and b \u00b7 r encoding the linear and quadratic natural parameters of the posterior. These\nprojections are related to the expectation parameters, the mean and variance, by \u00b5 = b\u00b7r\na\u00b7r and 2 = 1\na\u00b7r.\nA second important property of linear PPCs is that the variance of the encoded distribution is inversely\nproportional to the overall amplitude of the neural activity. Intuitively, this means that more spikes\nmeans more certainty (Figure 1).\nThe most fundamental probabilistic operations are the product rule and the sum rule. Linear PPCs\ncan perform both of these operations while maintaining a consistent representation [4], a useful\nfeature for constructing a model of canonical computation. For a log-linear probability code like\nlinear PPCs, the product rule corresponds to weighted summation of neural activities: p(x|r1, r2) /\np(x|r1)p(x|r2) () r3 = A1r1 + A2r2. In contrast, to use the sum rule to marginalize out variables,\nlinear PPCs require nonlinear transformations of population activity. Speci\ufb01cally, a quadratic\nnonlinearity with divisive normalization performs near-optimal marginalization in linear PPCs [4].\nQuadratic interactions arise naturally through coincidence detection, and divisive normalization is a\nnonlinear inhibitory effect widely observed in neural circuits [19, 20, 21]. Alternatively, near-optimal\nmarginalizations on PPCs can also be performed by more general nonlinear transformations [22]. In\nsum, PPCs provide a biologically compatible representation of probabilistic information.\n\n2\n\n\fFigure 1: Key properties of linear PPCs. (A) Two single trial population responses for a particular\nstimulus, with low and high amplitudes (blue and red). The two projections a \u00b7 r and b \u00b7 r encode the\nnatural parameters of the posterior. (B) Corresponding posteriors over stimulus variables determined\nby the responses in panel A. The gain or overall amplitude of the population code is inversely\nproportional to the variance of the posterior distribution.\n\n3\n\nInference by Tree-based Reparameterization\n\n3.1 Graphical Models\n\nTo generalize PPCs, we need to represent the joint probability distribution of many variables. A\nnatural way to represent multivariate distributions is with probabilistic graphical models. In this work,\nwe use the formalism of factor graphs, a type of bipartite graph in which nodes representing variables\nare connected to other nodes called factors representing interactions between \u2018cliques\u2019 or sets of\nvariables (Figure 2A). The joint probability over all variables can then be represented as a product\nZQc2C c(xc), where c(xc) are nonnegative compatibility functions on the\nover cliques, p(x) = 1\nset of variables xc = {xc|c 2 C} in the clique, and Z is a normalization constant. The distribution of\ninterest will be a posterior distribution p(x|r) that depends on neural responses r. Since the inference\nalgorithm we present is unchanged with this conditioning, for notational convenience we suppress\nthis dependence on r.\nIn this paper, we focus on pairwise interactions, although our main framework generalizes naturally\nto richer, higher-order interactions. In a pairwise model, we allow singleton factors s for variable\nnodes s in a set of vertices V , and pairwise interaction factors st for pairs (s, t) in the set of edges\nE that connect those vertices. The joint distribution is then p(x) = 1\n\nZQV s(xs)QE st(xs, xt).\n\n3.2 Belief Propagation and its neural plausibility\n\nvariable, ps(xs) =Rx\\xs\n\nThe inference problem of interest in this work is to compute the marginal distribution for each\np(x) d(x\\xs). This task is generally intractable. However, the factorization\nstructure of the distribution can be used to perform inference ef\ufb01ciently, either exactly in the case of\ntree graphs, or approximately for graphs with cycles. One such inference algorithm is called Belief\nPropagation (BP) [11]. BP iteratively passes information along the graph in the form of messages\nmst(xt) from node s to t, using only local computations that summarize the relevant aspects of other\nmessages upstream in the graph:\n\nmn+1\n\nst\n\n(xt) =Zxs\n\ndxs s(xs) st(xs, xt) Yu2N (s)\\t\n\nmn\n\nus(xs)\n\nbs(xs) / sYu2N (s)\n\nmus(xs)\n\n(2)\n\nwhere n is the time or iteration number, and N (s) is the set of neighbors of node s on the graph. The\nestimated marginal, called the \u2018belief\u2019 bs(xs) at a node s, is proportional to the local evidence at\nthat node s(xs) and all the messages coming into node s. Similarly, the messages themselves are\ndetermined self-consistently by combining incoming messages \u2014 except for the previous message\nfrom the target node t.\nThis message exclusion is critical because it prevents evidence previously passed by the target node\nfrom being counted as if it were new evidence. This exclusion only prevents overcounting on a tree\ngraph, and is unable to prevent overcounting of evidence passed around loops. For this reason, BP is\nexact for trees, but only approximate for general, loopy graphs. If we use this algorithm anyway, it is\ncalled \u2018Loopy\u2019 Belief Propagation (LBP), and it often has quite good performance [12].\n\n3\n\nria.rb.rb.r\u00b5=a.rBANeuronindex ip(x|r)xNeuralresponsePosterior1\u03c3=a.r\fMultiple researchers have been intrigued by the possibility that the brain may perform LBP [13,\n14, 5, 8, 7, 6], since it gives \u201ca principled framework for propagating, in parallel, information and\nuncertainty between nodes in a network\u201d [12]. Despite the conceptual appeal of LBP, it is important\nto get certain details correct: in an inference algorithm described by nonlinear dynamics, deviations\nfrom ideal behavior could in principle lead to very different outcomes.\nOne critically important detail is that each node must send different messages to different targets to\nprevent overcounting. This exclusion can render LBP neurally implausible, because neurons cannot\nreadily send different output signals to many different target neurons. Some past work simply ignores\nthe problem [5, 7]; the resultant overcounting destroys much of the inferential power of LBP, often\nperforming worse than more na\u00efve algorithms like mean-\ufb01eld inference. One better option is to\nuse different readouts of population activity for different targets [6], but this approach is inef\ufb01cient\nbecause it requires many readout populations for messages that differ only slightly, and requires\nseparate optimization for each possible target. Other efforts have avoided the problem entirely by\nperforming only unidirectional inference of low-dimensional variables that evolve over time [14].\nAppealingly, one can circumvent all of these dif\ufb01culties by using an alternative formulation of LBP\nknown as Tree-based Reparameterization (TRP).\n\n3.3 Tree-based Reparameterization\nInsightful work by Wainwright, Jakkola, and Willsky [15] revealed that belief propagation can\nbe understood as a convenient way of refactorizing a joint probability distribution, according to\napproximations of local marginal probabilities. For pairwise interactions, this can be written as\n\np(x) =\n\n1\n\nZ Ys2V\n\n s(xs) Y(s,t)2E\n\n st(xs, xt) = Ys2V\n\nTs(xs) Y(s,t)2E\n\nTst(xs, xt)\nTs(xs)Tt(xt)\n\n(3)\n\nwhere Ts(xs) is a so-called \u2018pseudomarginal\u2019 distribution of xs and Tst(xs, xt) is a joint pseu-\ndomarginal over xs and xt (Figure 2A\u2013B), where Ts and Tst are the outcome of Loopy Belief\nPropagation. The name pseudomarginal comes from the fact that these quantities are always locally\nconsistent with being marginal distributions, but they are only globally consistent with the true\nmarginals when the graphical model is tree-structured.\nThese pseudomarginals can be constructed iteratively as the true marginals of a different joint\ndistribution p\u2327 (x) on an isolated tree-structured subgraph \u2327. Compatibility functions from factors\nremaining outside of the subgraph are collected in a residual term r\u2327 (x). This regrouping leaves the\njoint distribution unchanged: p(x) = p\u2327 (x)r\u2327 (x).\nThe factors of p\u2327 are then rearranged by computing the true marginals on its subgraph \u2327, again\npreserving the joint distribution. In subsequent updates, we iteratively refactorize using the marginals\nof p\u2327 along different tree subgraphs \u2327 (Figure 2C).\n\nA\n\nB\n\nC\n\np(x)=pi(x)ri(x)\n\np(x)=pj(x)r j(x)\n\nx1\n\nx2\n\nx3\n\nOriginal\n\nx1\n\nx2\n\nx3\n\nTree reparameterized\n\nIteration i\n\nIteration j\n\nFigure 2: Visualization of tree reparameterization. (A) A probability distribution is speci\ufb01ed by\nfactors { s, st} on a tree graph. (B) An alternative parameterization of the same distribution in\nterms of the marginals {Ts, Tst}. (C) Two TRP updates for a 3\u21e5 3 nearest-neighbor grid of variables.\nTypical LBP can be interpreted as a sequence of local reparameterizations over just two neighboring\nnodes and their corresponding edge [15]. Pseudomarginals are initialized at time n = 0 using the\ns (xs) / s(xs) and T 0\nst(xs, xt) / s(xs) t(xt) st(xs, xt). At iteration n + 1,\noriginal factors: T 0\nthe node and edge pseudomarginals are computed by exactly marginalizing the distribution built from\nprevious pseudomarginals at iteration n:\ns Yu2N (s)\n/ T n\n\nNotice that, unlike the original form of LBP, operations on graph neighborhoodsQu2N (s) do not\n\ndifferentiate between targets.\n\nR T n\nst dxtR T n\n\nst dxs T n+1\n\nT n+1\nst /\n\n1\nT n\n\ns Z T n\n\nsu dxu\n\nT n+1\nt\n\nT n+1\ns\n\nT n\nst\n\ns\n\n(4)\n\n4\n\n\f4 Neural implementation of TRP updates\n\n4.1 Updating natural parameters\n\nTRP\u2019s operation only requires updating pseudomarginals, in place, using local information. These are\nappealing properties for a candidate brain algorithm. This representation is also nicely compatible\nwith the structure of PPCs: different projections of the neural activity encode the natural parameters\nof an exponential family distribution. It is thus useful to express the pseudomarginals and the TRP\ninference algorithm using vectors of suf\ufb01cient statistics c(xc) and natural parameters \u2713n\nc for each\nclique: T n\nc \u00b7 c(xc)). For a model with at most pairwise interactions, the TRP\nupdates (4) can be expressed in terms of these natural parameters as\n\nc (xc) = exp (\u2713n\n\n\u2713n+1\ns = (1 ds)\u2713n\n\ns + Xu2N (s)\n\ngV (\u2713n\n\nsu)\n\n\u2713n+1\nst = \u2713n\n\nst + Qs\u2713n+1\n\ns + Qt\u2713n+1\n\nt + gE(\u2713n\nst)\n\n(5)\n\nwhere ds is the number of neighbors of node s, the matrices Qs, Qt embed the node parameters into\nthe space of the pairwise parameters, and gV and gE are nonlinear functions (for vertices V and\nedges E) that are determined by the particular graphical model. Since the natural parameters re\ufb02ect\nlog-probabilities, the product rule for probabilities becomes a linear sum in \u2713, while the sum rule for\nprobabilities must be implemented by nonlinear operations g on \u2713.\nIn the concrete case of a Gaussian graphical model, the joint distribution is given by p(x) /\n2 x>Ax + b>x), where A and b are the natural parameters, and the linear and quadratic\nexp ( 1\nfunctions x and xx> are the suf\ufb01cient statistics. When we reparameterize this distribution by\npseudomarginals, we again have linear and quadratic suf\ufb01cient statistics: two for each node, s =\nt , xs, xt)>. Each of these vectors\n( 1\nof suf\ufb01cient statistics has its own vector of natural parameters, \u2713s and \u2713st.\nTo approximate the marginal probabilities, the TRP algorithm initializes the pseudomarginals to\ns = (Ass, bs)> and \u27130\nst = (Ass, Ast, Att, bs, bt)>. To update \u2713, we must specify the nonlin-\n\u27130\near functions g that recover the univariate marginal distribution of a bivariate gaussian Tst. For\n\ns, xs)>, and \ufb01ve for each edge, st = ( 1\n\ns, xsxt, 1\n\n2 x2\n\n2 x2\n\n2 x2\n\ns \u27132;stxsxt 1\n\n2 \u27133;stx2\n\nt + \u27134;stxs + \u27135;stxt, this marginal is\n\n\u27134;st\u27133;st \u27132;st\u27135;st\n\n+\n\n\u27133;st\n\n\u27131;st\u27133;st \u27132\n\n2;st\n\n\u27133;st\n\nx2\ns\n2\n\nxs! (6)\n\nUsing this, we can now specify the embedding matrices and the nonlinear functions in the TRP\n\n2 \u27131;stx2\n\nTst(xs, xt) / exp 1\nTs(xs) =Z dxt Tst(xs, xt) / exp \nupdates (5): Qs =\u27131 0\n1 0\u25c6>\n1;su \u2713n\n2;su2\n1;st \u2713n\n2;st2\n\nsu) = \u2713n\nst) = \u2713n\n\n4;su \n\ngV (\u2713n\n\ngE(\u2713n\n\n, 0, \u2713n\n\n\u2713n\n3;su\n\n\u2713n\n3;st\n\n, \u2713n\n\n0 0\n\n0 0\n\n0\n0\n\n0\n\nand Qt =\u27130\n3;su !>\n3;st \u2713n\n2;st2\n\n2;su\u2713n\n\u2713n\n\u2713n\n\n\u2713n\n1;st\n\n5;su\n\n0\n0\n\n1\n0\n\n0\n0\n\n0\n\n1\u25c6>\n\n(7)\n\n2;st\u2713n\n\u2713n\n\u2713n\n3;st\n\n5;st\n\n, \u2713n\n\n5;st \n\n2;st\u2713n\n\u2713n\n\u2713n\n\n1;st !>\n\n4;st\n\n, \u2713n\n\n4;st \n\nwhere \u2713i;st is the ith elements of \u2713st. Notice that these nonlinearities are all quadratic functions with\na linear divisive normalization.\n\n4.2 Separation of Time Scales for TRP Updates\n\nAn important feature of the TRP updates is that they circumvent the \u2018message exclusion\u2019 problem\nof LBP. The TRP update for the singleton terms, (4) and (5), includes contributions from all the\nneighbors of a given node. There is no free lunch, however, and the price is that the updates at time\nn + 1 depend on previous pseudomarginals at two different times, n and n + 1. The latter update is\ntherefore instantaneous information transmission, which is not biologically feasible.\nTo overcome this limitation, we observe that the brain can use fast and slow timescales \u2327fast \u2327 \u2327slow\ninstead of instant and delayed signals. The fast timescale would most naturally correspond to the\n\n5\n\n\fmembrane time constant of the neurons, whereas the slow timescale would emerge from network\ninteractions. We convert the update equations to continuous time, and introduce auxiliary variables\n\u02d9\u02dc\u2713 = \u02dc\u2713 + \u2713. The nonlinear\n\u02dc\u2713 which are lowpass-\ufb01ltered versions of \u2713 on a slow timescale: \u2327slow\ndynamics of (5) are then updated on a faster timescale \u2327fast according to\n(8)\n\n\u2327fast \u02d9\u2713st = Qs\u2713s + Qt\u2713t + gE(\u02dc\u2713st)\n\n\u2327fast \u02d9\u2713s = ds\u02dc\u2713s + Xu2N (s)\n\ngV (\u02dc\u2713su)\n\nwhere the nonlinear terms g depend only on the slower, delayed activity \u02dc\u2713. By concatenating these\ntwo sets of parameters, \u21e5 = (\u2713, \u02dc\u2713), we obtain a coupled multidimensional dynamical system which\nrepresents the approximation to the TRP iterations:\n\n(9)\nHere the weight matrix W and the nonlinear function G inherit their structure from the discrete-time\nupdates and the lowpass \ufb01ltering at the fast and slow timescales.\n\n\u02d9\u21e5 = W \u21e5 + G(\u21e5)\n\n4.3 Network Architecture\nTo complete our neural inference network, we now embed the nonlinear dynamics (9) into the\npopulation activity r. Since different projections of the neural activity in a linear PPC encode natural\nparameters of the underlying distribution, we map neural activity onto \u21e5 by r = U \u21e5, where U is\na rectangular Nr \u21e5 N\u21e5 embedding matrix that projects the natural parameters and their low-pass\nversions into the neural response space. These parameters can be decoded from the neural activity as\n\u21e5 = U +r, where U + is the pseudoinverse of U.\nApplying this basis transformation to (9), we have \u02d9r = U \u02d9\u21e5 = U (W \u21e5 + G(\u21e5)) = U W U +r +\nUG(U+r). We then obtain the general form of the updates for the neural activity\n\n\u02d9r = WLr + GN L(r)\n\n(10)\nwhere WLr = U W U +r and GN L(r) = UG(U +r) correspond to the linear and nonlinear computa-\ntional components that integrate and marginalize evidence, respectively. The nonlinear function on r\ninherits the structure needed for the natural parameters, such as a quadratic polynomial with a divisive\nnormalization used in low-dimensional Gaussian marginalization problems [4], but now expanded to\nhigh-dimensional graphical models. Figure 3 depicts the network architecture for the simple graphical\nmodel from Figure 2A, both when there are distinct neural subpopulations for each factor (Figure 3A),\nand when the variables are fully multiplexed across the entire neural population (Figure 3B). These\nsimple, biologically-plausible neural dynamics (10) represent a powerful, nonlinear, fully-recurrent\nnetwork of PPCs which implements the TRP update equations on an underlying graphical model.\n\nFigure 3: Distributed, nonlinear, recurrent network of neurons that performs probabilistic inference\non a graphical model. (A) This simple case uses distinct subpopulations of neurons to represent\ndifferent factors in the example model in Figure 2A. (B) A cartoon shows how the same distribution\ncan be represented as distinct projections of the distributed neural activity, instead of as distinct\npopulations. In both cases, since the neural activities encode log-probabilities, linear connections are\nresponsible for integrating evidence while nonlinear connections perform marginalization.\n\n5 Experiments\n\nWe evaluate the performance of our neural network on a set of small Gaussian graphical models\nwith up to 400 interacting variables. The networks time constants were set to have a ratio of\n\n6\n\nlinearconnectionssingletonprojectionspairwiseprojectionsnonlinearconnectionslinearconnectionssingletonpopulationspairwisepopulationsnonlinearconnectionsr12r23r1r2r3AB\f\u2327slow/\u2327fast = 20. Figure 4A shows the neural population dynamics as the network performs inference,\nalong with the temporal evolution of the corresponding node and pairwise means and covariances.\nThe neural activity exhibits a complicated timecourse, and re\ufb02ects a combination of many natural\nparameters changing simultaneously during inference. This type of behavior is seen in neural activity\nrecorded from behaving animals [23, 24, 25]. Figure 4B shows how the performance of the network\nimproves with the ratio of time-scales, , \u2327slow/\u2327fast. The performance is quanti\ufb01ed by the mean\nsquared error in the inferred parameters for a given divided by the error for a reference 0 = 10.\n\nmax\n\nr\n \ny\nt\ni\nv\ni\nt\nc\na\n \nl\na\nr\nu\ne\nN\n\nmin\n\nMeans\n\nCovariances\n\nn\no\ni\nt\na\nt\nc\ne\np\nx\ne\n \nd\ne\nr\nr\ne\nf\nn\nI\n\ns\nr\ne\nt\ne\nm\na\nr\na\np\n\nTime\n\nTime\n\nTime\n\nFigure 4: Dynamics of neural population activity (A) and the expectation parameters of the posterior\ndistribution that the population encodes (B) for one trial of the tree model in Figure 2A. (C) Multiple\nsimulations show that relative error decreases as a function of the ratio of fast and slow timescales .\n\nFigure 5 shows that our recurrent neural network accurately infers the marginal probabilities, and\nreaches almost the same conclusions as loopy belief propagation. The data points are obtained from\nmultiple simulations with different graph topologies, including graphs with many loops. Figure 6\nveri\ufb01es that the network is robust to noise even when there are few neurons per inferred parameter;\nadding more neurons improves performance since the noise can be averaged away.\n\nFigure 5: Inference performance of our neural network (blue) and standard loopy belief propagation\n(red) for a variety of graph topologies: chains, single loops, square grids up to 20 \u21e5 20 and densely\nconnected graphs with up to 25 variables. The expectation parameters (means, covariances) of the\npseudomarginals closely match the corresponding parameters for the true marginals.\n\nmax\n\nA\n\nr\n \ny\nt\ni\nv\ni\nt\nc\na\n \nl\na\nr\nu\ne\nN\n\nmin\n\nTime\n\nNneurons\nNparams\n\nVariance\n\nB\ns Mean\nr\ne\nt\ne\nm\na\nr\na\np\n \nd\ne\nr\nr\ne\nf\nn\nI\n\n= 1\n\nNneurons\nNparams\n\n= 5\n\nno noise\n\nTrue parameters\n\nFigure 6: Network performance is robust to noise, and improves with more neurons. (A) Neural\nactivity performing inference on a 5 \u21e5 5 square grid, in the presence of independent spatiotemporal\nGaussian noise of standard deviation 0.1 times the standard deviation of each signal. (B) Expectation\nparameters (means, variances) of the node pseudomarginals closely match the corresponding parame-\nters for the true marginals, despite the noise. Results are shown for one or \ufb01ve neurons per parameter\nin the graphical model, and for no noise (i.e. in\ufb01nitely many neurons).\n\n7\n\n\f6 Conclusion\n\nWe have shown how a biologically-plausible nonlinear recurrent network of neurons can repre-\nsent a multivariate probability distribution using population codes, and can perform inference by\nreparameterizing the joint distribution to obtain approximate marginal probabilities.\nOur network model has desirable properties beyond those lauded features of belief propagation. First,\nit allows for a thoroughly distributed population code, with many neurons encoding each variable and\nmany variables encoded by each neuron. This is consistent with neural recordings in which many\ntask-relevant features are multiplexed across a neural population [23, 24, 25], as well as with models\nwhere information is embedded in a higher-dimensional state space [26, 27].\nSecond, the network performs inference in place, without using a distinct neural representation for\nmessages, and avoids the biological implausibility associated with sending different messages about\nevery variable to different targets. This virtue comes from exchanging multiple messages for multiple\ntimescales. It is noteworthy that allowing two timescales prevents overcounting of evidence on loops\nof length two (target to source to target). This suggests a novel role of memory in static inference\nproblems: a longer memory could be used to discount past information sent at more distant times,\nthus avoiding the overcounting of evidence that arises from loops of length three and greater. It may\ntherefore be possible to develop reparameterization algorithms with all the convenient properties of\nLBP but with improved performance on loopy graphs.\nPrevious results show that the quadratic nonlinearity with divisive normalization is convenient and\nbiologically plausible, but this precise form is not necessary: other pointwise neuronal nonlinearities\ncan also produce high-quality marginalizations in PPCs [22]. In a distributed code, the precise\nnonlinear form at the neuronal scale is not important as long as the effect on the parameters is the\nsame.\nMore generally, however, different nonlinear computations on the parameters implement different\napproximate inference algorithms. The distinct behaviors of such algorithms as variational inference\n[28], generalized belief propagation, and others arise from differences in their nonlinear transfor-\nmations. Even Gibbs sampling can be described as a noisy nonlinear message-passing algorithm.\nAlthough LBP and its generalizations have strong appeal, we doubt the brain will use this algorithm\nexactly. The real nonlinear functions in the brain may implement even smarter algorithms.\nTo identify the brain\u2019s algorithm, it may be more revealing to measure how information is represented\nand transformed in a low-dimensional latent space embedded in the high-dimensional neural responses\nthan to examine each neuronal nonlinearity in isolation. The present work is directed toward this\nchallenge of understanding computation in this latent space. It provides a concrete example showing\nhow distributed nonlinear computation can be distinct from localized neural computations. Learning\nthis computation from data will be a key challenge for neuroscience. In future work we aim to\nrecover the latent computations of our network from arti\ufb01cial neural recordings generated by the\nmodel. Successful model recovery would encourage us to apply these methods to large-scale neural\nrecordings to uncover key properties of the brain\u2019s distributed nonlinear computations.\n\nAuthor contributions\n\nXP conceived the study. RR and XP derived the equations. RR implemented the computer simulations.\nRR and XP analyzed the results and wrote the paper.\n\nAcknowledgments\n\nXP and RR were supported in part by a grant from the McNair Foundation, NSF CAREER Award\nIOS-1552868, and by the Intelligence Advanced Research Projects Activity (IARPA) via Department\nof Interior/Interior Business Center (DoI/IBC) contract number D16PC00003.1\n\n1The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwith-\nstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of\nthe authors and should not be interpreted as necessarily representing the of\ufb01cial policies or endorsements, either\nexpressed or implied, of IARPA, DoI/IBC, or the U.S. Government.\n\n8\n\n\fReferences\n\n[1] Knill DC, Richards W (1996) Perception as Bayesian inference. Cambridge University Press.\n[2] Doya K (2007) Bayesian brain: Probabilistic approaches to neural coding. MIT press.\n[3] Pouget A, Beck JM, Ma WJ, Latham PE (2013) Probabilistic brains: knowns and unknowns. Nature\n\nneuroscience 16: 1170\u20131178.\n\n[4] Beck JM, Latham PE, Pouget A (2011) Marginalization in neural circuits with divisive normalization. The\n\nJournal of neuroscience 31: 15310\u201315319.\n\n[5] Ott T, Stoop R (2006) The neurodynamics of belief propagation on binary markov random \ufb01elds. In:\n\nAdvances in neural information processing systems. pp. 1057\u20131064.\n\n[6] Steimer A, Maass W, Douglas R (2009) Belief propagation in networks of spiking neurons. Neural\n\nComputation 21: 2502\u20132523.\n\n[7] Litvak S, Ullman S (2009) Cortical circuitry implementing graphical models. Neural computation 21:\n\n3010\u20133056.\n\n[8] George D, Hawkins J (2009) Towards a mathematical theory of cortical micro-circuits. PLoS Comput Biol\n\n5: e1000532.\n\n[9] Grabska-Barwinska A, Beck J, Pouget A, Latham P (2013) Demixing odors-fast inference in olfaction. In:\n\nAdvances in Neural Information Processing Systems. pp. 1968\u20131976.\n\n[10] Ma WJ, Beck JM, Latham PE, Pouget A (2006) Bayesian inference with probabilistic population codes.\n\nNature neuroscience 9: 1432\u20131438.\n\n[11] Pearl J (1988) Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan\n\nKaufmann.\n\n[12] Yedidia JS, Freeman WT, Weiss Y (2003) Understanding belief propagation and its generalizations.\n\nExploring arti\ufb01cial intelligence in the new millennium 8: 236\u2013239.\n\n[13] Lee TS, Mumford D (2003) Hierarchical bayesian inference in the visual cortex. JOSA A 20: 1434\u20131448.\n[14] Rao RP (2004) Hierarchical bayesian inference in networks of spiking neurons. In: Advances in neural\n\ninformation processing systems. pp. 1113\u20131120.\n\n[15] Wainwright MJ, Jaakkola TS, Willsky AS (2003) Tree-based reparameterization framework for analysis of\n\nsum-product and related algorithms. Information Theory, IEEE Transactions on 49: 1120\u20131146.\n\n[16] Jazayeri M, Movshon JA (2006) Optimal representation of sensory information by neural populations.\n\nNature neuroscience 9: 690\u2013696.\n\n[17] Beck JM, Ma WJ, Kiani R, Hanks T, Churchland AK, Roitman J, Shadlen MN, et al. (2008) Probabilistic\n\npopulation codes for bayesian decision making. Neuron 60: 1142\u20131152.\n\n[18] Graf AB, Kohn A, Jazayeri M, Movshon JA (2011) Decoding the activity of neuronal populations in\n\nmacaque primary visual cortex. Nature neuroscience 14: 239\u2013245.\n\n[19] Heeger DJ (1992) Normalization of cell responses in cat striate cortex. Visual neuroscience 9: 181\u2013197.\n[20] Carandini M, Heeger DJ (2012) Normalization as a canonical neural computation. Nature Reviews\n\nNeuroscience 13: 51\u201362.\n\n[21] Rubin DB, Van Hooser SD, Miller KD (2015) The stabilized supralinear network: A unifying circuit motif\n\nunderlying multi-input integration in sensory cortex. Neuron 85: 402\u2013417.\n\n[22] Vasudeva Raju R, Pitkow X (2015) Marginalization in random nonlinear neural networks. In: COSYNE.\n[23] Hayden BY, Platt ML (2010) Neurons in anterior cingulate cortex multiplex information about reward and\n\naction. The Journal of Neuroscience 30: 3339\u20133346.\n\n[24] Rigotti M, Barak O, Warden MR, Wang XJ, Daw ND, Miller EK, Fusi S (2013) The importance of mixed\n\nselectivity in complex cognitive tasks. Nature 497: 585\u2013590.\n\n[25] Raposo D, Kaufman MT, Churchland AK (2014) A category-free neural population supports evolving\n\ndemands during decision-making. Nature neuroscience 17: 1784\u20131792.\n\n[26] Savin C, Deneve S (2014) Spatio-temporal representations of uncertainty in spiking neural networks. In:\n\nAdvances in Neural Information Processing Systems. pp. 2024\u20132032.\n\n[27] Archer E, Park I, Buesing L, Cunningham J, Paninski L (2015) Black box variational inference for state\n\nspace models. arXiv stat.ML: 1511.07367.\n\n[28] Beck J, Pouget A, Heller KA (2012) Complex inference in neural circuits with probabilistic population\n\ncodes and topic models. In: Advances in neural information processing systems. pp. 3059\u20133067.\n\n9\n\n\f", "award": [], "sourceid": 1087, "authors": [{"given_name": "Rajkumar", "family_name": "Vasudeva Raju", "institution": "Rice University"}, {"given_name": "Zachary", "family_name": "Pitkow", "institution": "Rice University"}]}