{"title": "Sensory Integration and Density Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 478, "page_last": 486, "abstract": "The integration of partially redundant information from multiple sensors is a standard computational problem for agents interacting with the world. In man and other primates, integration has been shown psychophysically to be nearly optimal in the sense of error minimization. An influential generalization of this notion of optimality is that populations of multisensory neurons should retain all the information from their unisensory afferents about the underlying, common stimulus [1]. More recently, it was shown empirically that a neural network trained to perform latent-variable density estimation, with the activities of the unisensory neurons as observed data, satisfies the information-preservation criterion, even though the model architecture was not designed to match the true generative process for the data [2]. We prove here an analytical connection between these seemingly different tasks, density estimation and sensory integration; that the former implies the latter for the model used in [2]; but that this does not appear to be true for all models.", "full_text": "Sensory Integration and Density Estimation\n\nJoseph G. Makin and Philip N. Sabes\n\nCenter for Integrative Neuroscience/Department of Physiology\n\nUniversity of California, San Francisco\n\nSan Francisco, CA 94143-0444 USA\nmakin, sabes @phy.ucsf.edu\n\nAbstract\n\nThe integration of partially redundant information from multiple sensors is a stan-\ndard computational problem for agents interacting with the world. In man and\nother primates, integration has been shown psychophysically to be nearly optimal\nin the sense of error minimization. An in\ufb02uential generalization of this notion\nof optimality is that populations of multisensory neurons should retain all the in-\nformation from their unisensory afferents about the underlying, common stimu-\nlus [1]. More recently, it was shown empirically that a neural network trained\nto perform latent-variable density estimation, with the activities of the unisensory\nneurons as observed data, satis\ufb01es the information-preservation criterion, even\nthough the model architecture was not designed to match the true generative pro-\ncess for the data [2]. We prove here an analytical connection between these seem-\ningly different tasks, density estimation and sensory integration; that the former\nimplies the latter for the model used in [2]; but that this does not appear to be true\nfor all models.\n\n1\n\nIntroduction\n\nA sensible criterion for integration of partially redundant information from multiple senses is that\nno information about the underlying cause be lost. That is, the multisensory representation should\ncontain all of the information about the stimulus as the unisensory representations together did. A\nvariant on this criterion was \ufb01rst proposed in [1]. When satis\ufb01ed, and given sensory cues that have\nbeen corrupted with Gaussian noise, the most probable multisensory estimate of the underlying\nstimulus property (height, location, etc.) will be a convex combination of the estimates derived inde-\npendently from the unisensory cues, with the weights determined by the variances of the corrupting\nnoise\u2014as observed psychophysically in monkey and man, e.g., [3, 4].\nThe task of plastic organisms placed in novel environments is to learn, from scratch, how to perform\nthis task. One recent proposal [2] is that primates treat the activities of the unisensory populations\nof neurons as observed data for a latent-variable density-estimation problem. Thus the activities\nof a population of multisensory neurons play the role of latent variables, and the model is trained\nto generate the same distribution of unisensory activities when they are driven by the multisensory\nneurons as when they are driven by their true causes in the world. The idea is that the latent variables\nin the model will therefore come to correspond (in some way) to the latent variables that truly\nunderlie the observed distribution of unisensory activities, including the structure of correlations\nacross populations. Then it is plausible to suppose that, for any particular value of the stimulus,\ninference to the latent variables of the model is \u201cas good as\u201d inference to the true latent cause,\nand that therefore the information criterion will be satis\ufb01ed. Makin et alia showed precisely this,\nempirically, using an exponential-family harmonium (a generalization of the restricted Boltzmann\nmachine [5]) as the density estimator [2].\n\n1\n\n\fHere we prove analytically that successful density estimation in certain models, including that of [2],\nwill necessarily satisfy the information-retention criterion. In variant architectures, the guarantee\ndoes not hold.\n\n2 Theoretical background\n\n2.1 Multisensory integration and information retention\n\nPsychophysical studies have shown that, when presented with cues of varying reliability in two\ndifferent sense modalities but about a common stimulus property (e.g., location or height), pri-\nmates (including humans) estimate the property as a convex combination of the estimates derived\nindependently from the unisensory cues, where the weight on each estimate is proportional to its\nreliability [3, 4]. Cue reliability is measured as the inverse variance in performance across repeated\ninstances of the unisensory cue, and will itself vary with the amount of corrupting noise (e.g., visu-\nally blur) added to the cue. This integration rule is optimal in that it minimizes error variance across\ntrials, at least for Gaussian corrupting noise.\nAlternatively, it can be seen as a special case of a more general scheme [6]. Assuming a uniform\nprior distribution of stimuli, the optimal combination just described is equal to the peak of the\nposterior distribution over the stimulus, conditioned on the noisy cues (y1, y2):\n\nx\u2217 = argmax\n\nPr[X = x|y1, y2].\n\nx\n\nFor Gaussian corrupting noise, this posterior distribution will itself be Gaussian; but even for in-\ntegration problems that yield non-Gaussian posteriors, humans have been shown to estimate the\nstimulus with the peak of that posterior [7].\nThis can be seen as a consequence of a scheme more general still, namely, encoding not merely\nthe peak of the posterior, but the entire distribution [1, 8]. Suppose again, for simplicity, that\nPr[X|Y 1, Y 2] is Gaussian. Then if x\u2217 is itself to be combined with some third cue (y3), opti-\nmality requires keeping the variance of this posterior, as well as its mean, since it (along with the\nreliability of y3) determines the weight given to x\u2217 in this new combination. This scheme is espe-\ncially relevant when y1 and y2 are not \u201ccues\u201d but the activities of populations of neurons, e.g. visual\nand auditory, respectively. Since sensory information is more likely to be integrated in the brain in\na staged, hierarchical fashion than in a single common pool [9], optimality requires encoding at\nleast the \ufb01rst two cumulants of the posterior distribution. For more general, non-Gaussian posteri-\nors, the entire posterior should be encoded [1, 6]. This amounts [1] to requiring, for downstream,\n\u201cmultisensory\u201d neurons with activities Z, that:\n\nPr[X|Z] = Pr[X|Y 1, Y 2].\n\nWhen information about X reaches Z only via Y = [Y 1, Y 2] (i.e., X \u2192 Y \u2192 Z forms a Markov\nchain), this is equivalent (see Appendix) to requiring that no information about the stimulus be lost\nin transforming the unisensory representations into a multisensory representation; that is,\n\nI(X; Z) = I(X; Y),\nwhere I(A; B) is the mutual information between A and B.\nOf course, if there is any noise in the transition from unisensory to multisensory neurons, this equa-\ntion cannot be satis\ufb01ed exactly. A sensible modi\ufb01cation is to require that this noise be the only\nsource of information loss. This amounts to requiring that the information equality hold, not for Z,\nbut for any set of suf\ufb01cient statistics for Z as a function of Y, Tz (Y); that is,\n\nI(X; Tz (Y)) = I(X; Y).\n\n(1)\n\n2.2\n\nInformation retention and density estimation\n\nA rather general statement of the role of neural sensory processing, sometimes credited to\nHelmholtz, is to make inferences about states of affairs in the world, given only the data supplied\nby the sense organs. Inference is hard because the mapping from the world\u2019s states to sense data is\n\n2\n\n\fX\n\np(x)\n\nY\n\np(y|x)\n\nA\n\nY\n\nq(y|z)\n\nZ\n\nq(z)\n\nB\n\nFigure 1: Probabilistic graphical models.\nprocess. Observed nodes are shaded. After training the model (q), the marginals match: p(y) = q(y).\n\n(A) The world\u2019s generative process. (B) The model\u2019s generative\n\nnot invertible, due both to noise and to the non-injectivity of physical processes (as in occlusion). A\npowerful approach to this problem used in machine learning, and arguably by the brain [10, 11], is\nto build a generative model for the data (Y), including the in\ufb02uence of unobserved (latent) variables\n(Z). The latent variables at the top of a hierarchy of such models would presumably be proxies for\nthe true causes, states of affairs in the world (X).\nIn density estimation, however, the objective function for learning the parameters of the model is\nthat:\n\n(cid:90)\n\n(cid:90)\n\np(y|x)p(x)dx =\n\nq(y|z)q(z)dz\n\n(2)\n\nx\n\nz\n\n(Fig. 1), i.e., that the \u201cdata distribution\u201d of Y match the \u201cmodel distribution\u201d of Y; and this is\nconsistent with models that throw away information about the world in the transformation from\nobserved to latent variables, or even to their suf\ufb01cient statistics. For example, suppose that the\nworld\u2019s generative process looked like this:\nExample 2.1. The prior p(x) is the \ufb02ip of an unbiased coin; and the emission p(y|x) draws from\na standard normal distribution, takes the absolute value of the result, and then multiplies by \u22121 for\ntails and +1 for heads. Information about the state of X is therefore perfectly represented in Y . But\na trained density-estimation model with, say, a Gaussian emission model, q(y|z), would not bother\nto encode any information in Z, since the emission model alone can represent all the data (which\njust look like samples from a standard normal distribution). Thus Y and Z would be independent,\nand Eq. 1 would not be satis\ufb01ed, even though Eq. 2 would.\n\nThis case is arguably pathological, but similar considerations apply for more subtle variants. In\naddition to Eq. 2, then, we shall assume something more: namely, that the \u201cnoise models\u201d for the\nworld and model match; i.e., that q(y|z) has the same functional form as p(y|x). More precisely,\nwe assume:\n\n\u2203 functions f (y; \u03bb), \u03c6(x), \u03c8(z) (cid:51)\n\np(y|x) = f(cid:0)y; \u03c6(x)(cid:1),\nq(y|z) = f(cid:0)y; \u03c8(z)(cid:1).\n\n(3)\n\nIn [2], for example, f (y; \u03bb) was assumed to be a product of Poisson distributions, so the \u201cproximate\ncauses\u201d \u039b were a vector of means. Note that the functions \u03c6(x) and \u03c8(z) induce distributions over\n\u039b which we shall call p(\u03bb) and q(\u03bb), respectively; and that:\n\nEp(\u03bb)[f (y; \u03bb)] = Ep(x)[f (y; \u03c6(x)] = Eq(z)[f (y; \u03c8(z)] = Eq(\u03bb)[f (y; \u03bb)],\n\n(4)\nwhere the \ufb01rst and last equalities follows from the \u201claw of the unconscious statistician,\u201d and the\nsecond follows from Eqs. 2 and 3.\n\n3 Latent-variable density estimation for multisensory integration\n\nIn its most general form, the aim is to show that Eq. 4 implies, perhaps with some other constraints,\nEq. 1. More concretely, suppose the random variables Y 1, Y 2, provided by sense modalities 1 and\n2, correspond to noisy observations of an underlying stimulus. These could be noisy cues, but they\ncould also be the activities of populations of neurons (visual and proprioceptive, say, for concrete-\nness). Then suppose a latent-variable density estimator is trained on these data, until it assigns the\nsame probability, q(y1, y2), to realizations of the observations, [y1, y2], as that with which they\nappear, p(y1, y2). Then we should like to know that inference to the latent variables in the model,\ni.e., computation of the suf\ufb01cient statistics Tz(Y 1, Y 2), throws away no information about the\n\n3\n\n\fstimulus. In [2], where this was shown empirically, the density estimator was a neural network, and\nits latent variables were interpreted as the activities of downstream, multisensory neurons. Thus the\ntransformation from unisensory to multisensory representation was shown, after training, to obey\nthis information-retention criterion.\nIt might seem that we have already assembled suf\ufb01cient conditions.\nIn particular, knowing that\nthe \u201cnoise models match,\u201d Eq. 3, might seem to guarantee that the data distribution and model\ndistribution have the same suf\ufb01cient statistics, since suf\ufb01cient statistics depend only on the form of\nthe conditional distribution. Then Tz (Y) would be suf\ufb01cient for X as well as for Z, and the proof\ncomplete. But this sense of \u201cform of the conditional distribution\u201d is stronger than Eq. 4. If, for\nexample, the image of z under \u03c8(\u00b7) is lower-dimensional than the image of x under \u03c6(\u00b7), then the\nconditionals in Eq. 3 will have different forms as far as their suf\ufb01cient statistics go. An example will\nclarify the point.\nExample 3.1. Let p(y) be a two-component mixture of a (univariate) Bernoulli distribution. In\nparticular, let \u03c6(x) and \u03c8(z) be the identity maps, \u039b provide the mean of the Bernoulli, and p(X =\n0.4) = 1/2, p(X = 0.6) = 1/2. The mixture marginal is therefore another Bernoulli random\nvariable, with equal probability of being 1 or 0. Now consider the \u201cmixture\u201d model q that has the\nsame noise model, i.e., a univariate Bernoulli distribution, but a prior with all its mass at a single\nmixing weight. If q(Z = 0.5) = 1, this model will satisfy Eq. 4. But a (minimal) suf\ufb01cient statistic\nfor the latent variables under p is simply the single sample, y, whereas the minimal suf\ufb01cient statistic\nfor the latent variable under q is the nullset: the observation tells us nothing about Z because it is\nalways the same value.\n\nTo rule out such cases, we propose (below) further constraints.\n\n3.1 Proof strategy\n\nWe start by noting that any suf\ufb01cient statistics Tz(Y) for Z are also suf\ufb01cient statistics for any\nfunction of Z, since all the information about the output of that function must pass through Z \ufb01rst\n(Fig. 2A). In particular, then, Tz(Y) are suf\ufb01cient statistics for the proximate causes, \u039b = \u03c8(Z).\nThat is, for any \u03bb generated by the model, Fig. 1B, tz (y) for the corresponding y drawn from\nf (y; \u03bb) are suf\ufb01cient statistics. What about the \u03bb generated by the world, Fig. 1A? We should like\nto show that tz(y) are suf\ufb01cient for them as well. This will be the case if, for every \u03bb produced by\nthe world, there exists a vector z such that \u03c8(z) = \u03bb.\nThis minimal condition is hard to prove. Instead we might show a slightly stronger condition, that\n(q(\u03bb) = 0) =\u21d2 (p(\u03bb) = 0), i.e., to any \u03bb that can be generated by the world, the model\nassigns nonzero probability. This is suf\ufb01cient (although unnecessary) for the existence of a vector\nz for every \u03bb produced by the world. Or we might pursue a stronger condition still, that to any\n\u03bb that can be generated by the world, the model and data assign the same probability: q(\u03bb) =\np(\u03bb). If one considers the marginals p(y) = q(y) to be mixture models, then this last condition\nis called the \u201cidenti\ufb01ability\u201d of the mixture [12], and for many conditional distributions f (y; \u03bb),\nidenti\ufb01ability conditions have been proven. Note that mixture identi\ufb01ability is taken to be a property\nof the conditional distribution, f (y; \u03bb), not the marginal, p(y); so, e.g., without further restriction,\na mixture model is not identi\ufb01able even if there exist just two prior distributions, p1(\u03bb), p2(\u03bb), that\nproduce identical marginal distributions.\nTo see that identi\ufb01ability, although suf\ufb01cient (see below) is unnecessary, consider again the two-\ncomponent mixture of a (univariate) Bernoulli distribution:\nExample 3.2. Let p(X = 0.4) = 1/2, p(X = 0.6) = 1/2. If the model, q(y|z)q(z), has the\nsame form, but mixing weights q(Z = 0.3) = 1/2, q(Z = 0.7) = 1/2, its mixture marginal will\nmatch the data distribution; yet p(\u03bb) (cid:54)= q(\u03bb), so the model is clearly unidenti\ufb01able. Nevertheless,\nthe sample itself, y, is a (minimal) suf\ufb01cient statistic for both the model and the data distribution, so\nthe information-retention criterion will be satis\ufb01ed.\n\nIn what follows we shall assume that the mixtures are \ufb01nite. This is the case when the model is an\nexponential-family harmonium (EFH)1, as in [2]: there are at most K := 2|hiddens| mixture compo-\n1An EFH is a two layer Markov random \ufb01eld, with full interlayer connectivity and no intralayer connectiv-\nity, and in which the conditional distributions of the visible layer given the hiddens and vice versa belong to\n\n4\n\n\fH[Y]\n\nH[Y]\n\nH[Tz(Y)]\n\nH[\u03c8(Z)]\n\nH[Z]\n\nH[X]\n\nH[\u03c6(X)]\n\nH[\u03c8(Z)]\n\nH[Z]\n\nA\n\nH[Tz (Y)]\n\nB\n\n(A) \u03c8(Z) being a deterministic function of Z, its entropy (dark\nFigure 2: Venn diagrams for information.\nturquoise) is a subset of the latter\u2019s (turquoise). The same is true for the entropies of Tz (Y) (dark orange) and\nY (orange), but additionally their intersections with H[Z] are identical because Tz is a suf\ufb01cient statistic for\nZ. The mutual information values I(\u03c8(Z); Y) and I(\u03c8(Z); Tz (Y)) (i.e., the intersections of the entropies)\nare clearly identical (outlined patch). This corresponds to the derivation of Eq. 6. (B) When \u03c8(Z) is a suf\ufb01cient\nstatistic for Y, as guaranteed by Eq. 3, the intersection of its entropy with H[Y] is the same as the intersection\nof H[Z] with H[Y]; likewise for H[\u03c6(X)] and H[X] with H[Y]. Since all information about X reaches Z\nvia Y, the entropies of X and Z overlap only on H[Y]. Finally, if p(\u03c6(x)) = q(\u03c8(z)), and Pr[Y|\u03c6(X)] =\nPr[Y|\u03c8(Z)] (Eq. 3), then the entropies of \u03c6(X) and \u03c8(Z) have the same-sized overlaps (but not the same\noverlaps) with H[Y] and H[Tz (Y)]. This guarantees that I(X; Tz (Y)) = I(X; Y) (see Eq. 7).\n\nnents. It is not true for real-valued stimuli X. For simplicity, we shall nevertheless assume that there\nare at most 2|hiddens| con\ufb01gurations of X since: (1) the stimulus must be discretized immediately\nupon transduction by the nervous system, the brain (presumably) having only \ufb01nite representational\ncapacity; and (2) if there were an in\ufb01nite number of con\ufb01gurations, Eq. 2 could not be satis\ufb01ed\nexactly anyway. Eq. 4 can therefore be expressed as:\n\nI(cid:88)\n\nJ(cid:88)\n\nf (y; \u03bb)p(\u03bb) =\n\nf (y; \u03bb)q(\u03bb),\n\n(5)\n\nwhere I \u2264 K, J \u2264 K.\n\ni\n\nj\n\n3.2 Formal description of the model, assumptions, and result\n\n\u2022 The general probabilistic model. This is given by the graphical models in Fig. 1. \u201cThe\nworld\u201d generates data according to Fig. 1A (\u201cdata distribution\u201d), and \u201cthe brain\u201d uses Fig.\n1B. None of the distributions labeled in the diagram need be equal to each other, and in fact\nX and Z may have different support.\n\n\u2022 The assumptions.\n\n1. The noise models \u201cmatch\u201d: Eq. 3.\n2. The number of hidden-variable states is \ufb01nite, but otherwise arbitrarily large.\n3. Density estimation has been successful; i.e., the data and model marginals over Y\n\nmatch: Eq. 2\n\np(\u03bb) = q(\u03bb). This condition holds for a very broad class of distributions.\n\n4. The noise model/conditional distribution f (y; \u03bb) is identi\ufb01able: if p(y) = q(y), then\n\u2022 The main result. Information about the stimulus is retained in inferring the latent variables\nof the model, i.e. in the \u201cfeedforward\u201d (Y \u2192 Z) pass of the model. More precisely,\ninformation loss is due only to noise in the hidden layer (which is unavoidable), not to a\nfailure of the inference procedure: Eq. 1.\n\nMore succinctly: for identi\ufb01able mixture models, Eq. 5 and Eq. 3 together imply Eq. 1.\n\nexponential families of probability distributions [5]. The restricted Boltzmann machine is therefore the special\ncase of Bernoulli hiddens and Bernoulli visibles.\n\n5\n\n\f3.3 Proof\n\nFirst, for any set of suf\ufb01cient statistics Tz (Y) for Z:\n\nI(Y; \u03c8(Z)|Tz(Y)) \u2264 I(Y; Z|Tz (Y))\n\n= 0\n\n=\u21d2 0 = I(Y; \u03c8(Z)|Tz(Y))\n\n= H[\u03c8(Z)|Tz(Y)] \u2212 H[\u03c8(Z)|Y, Tz(Y)]\n= H[\u03c8(Z)|Tz(Y)] \u2212 H[\u03c8(Z)|Y]\n\n\u2212 H[\u03c8(Z)] + H[\u03c8(Z)]\n\ndata-proc. inequality [13]\nTz(Y) are suf\ufb01cient for Z\nGibbs\u2019s inequality\ndef\u2019n cond. mutual info.\nTz (Y) deterministic\n= 0\ndef\u2019n mutual info.\n(6)\n\ndata-processing inequality\nX \u2192 \u03c6(X) \u2192 Y, D.P.I.\np(\u03bb) = q(\u03bb), Eq. 3\nEq. 6\np(\u03bb) = q(\u03bb), Eq. 3\ndata-processing inequality\n\n(7)\n\n=\u21d2 I(\u03c8(Z); Tz(Y)) = I(\u03c8(Z); Y).\nSo Tz are suf\ufb01cient statistics for \u03c8(Z).\nNow if \ufb01nite mixtures of f (y; \u03bb) are identi\ufb01able, then Eq. 5 implies that p(\u03bb) = q(\u03bb). Therefore:\n\nI(X; Tz (Y)) \u2264 I(X; Y)\n\n\u2264 I(\u03c6(X); Y)\n= I(\u03c8(Z); Y)\n= I(\u03c8(Z); Tz(Y))\n= I(\u03c6(X); Tz (Y))\n\u2264 I(X; Tz (Y))\n\n=\u21d2 I(X; Tz (Y)) = I(X; Y),\n\nwhich is what we set out to prove. (This last progression is illustrated in Fig. 2B.)\n\n4 Relationship to empirical \ufb01ndings\n\nThe use of density-estimation algorithms for multisensory integration appears in [2, 15, 16], and in\n[2], the connection between generic latent-variable density estimation and multisensory integration\nwas made, although the result was shown only empirically. We therefore relate those results to the\nforegoing proof.\n\n4.1 A density estimator for multisensory integration\n\nIn [2], an exponential-family harmonium (model distribution, q, Fig. 3B) with Poisson visible units\n(Y) and Bernoulli hiddens units (Z) was trained on simulated populations of neurons encoding\narm con\ufb01guration in two-dimensional space (Fig. 3). An EFH is parameterized by the matrix of\nconnection strengths between units (weights, W ) and the unit biases, bi. The nonlinearities for\nBernoulli and Poisson units are logistic and exponential, respectively, corresponding to their inverse\n\u201ccanonical links\u201d [17].\nThe data for these populations were created by (data distribution, p, Fig. 3A):\n\n1. drawing a pair of joint angles (\u03b8 1 = shoulder, \u03b8 2 = elbow) from a uniform distribution\nbetween the joint limits; drawing two population gains (g p, g v, the \u201creliabilities\u201d of the two\npopulations) from uniform distributions over spike counts\u2014hence x = [\u03b8 1, \u03b8 1, g p, g v];\n\n2. encoding the joint angles in a set of 2D, Gaussian tuning curves (with maximum height g p)\nthat smoothly tile joint space (\u201cproprioceptive neurons,\u201d \u03bbp), and encoding the correspond-\ning end-effector position in a set of 2D, Gaussian tuning curves (with maximum height g v)\nthat smoothly tile the reachable workspace (\u201cvisual neurons,\u201d \u03bbv);\n\n3. drawing spike counts, [yp, yv], from independent Poisson distributions whose means were\n\ngiven by [\u03bbp, \u03bbv].\n\n6\n\n\fY v\n0\n\nY v\n1\n\nY v\n2\n\nY v\n3\n\nY p\n0\n\nY p\n1\n\nY p\n2\n\nY p\n3\n\nZ0 Z1 Z2 Z3\n\nX\n\nGv\n\n\u0398\nA\n\nGp\n\nY v\n0\n\nY v\n1\n\nY v\n2\n\nY p\n0\n\nY p\n1\n\nY p\n2\n\nY p\n3\n\nY v\n3\nB\n\nFigure 3: Two probabilistic graphical models for the same data\u2014a speci\ufb01c instance of Fig. 1. Colors are as in\nFig. 2. Here the orange nodes are observed.\n(A) Hand position (\u0398) elicits a response from populations of\nvisual (Yv) and proprioceptive (Yp) neurons. The reliability of each population\u2019s encoding of hand position\nvaries with their respective gains, Gv, Gp. (B) The exponential family harmonium (EFH; see text). After\ntraining, an up-pass through the model yields a vector of multisensory (mean) activities (z, hidden units) that\ncontains all the information about \u03b8, g v, and g p that was in the unisensory populations, Yv and Yp.\n\np(y|x) =(cid:81)\n\nThus the distribution of the unisensory spike counts, Y = [Yp, Yv], conditioned on hand position,\ni p(y i|x), was a \u201cprobabilistic population code,\u201d a biologically plausible proposal for\nhow the cortex encodes probability distributions over stimuli [1]. The model was trained using one-\nstep contrastive divergence, a learning procedure that changes weights and biases by descending the\napproximate gradient of a function that has q(y) = p(y) as its minimum [18, 19].\nIt was then shown empirically that the criterion for \u201coptimal multisensory integration\u201d proposed\nin [1],\n\nPr[X|\u00afZ] = Pr[X|yp, yv],\n\n(8)\nheld approximately for an average, \u00afZ, of vectors sampled from q(z|y), and that the match improves\nas the number of samples grows\u2014i.e., as the sample average \u00afZ approaches the expected value\nEq(z|y)[Z|y]. Since the weight matrix W was \u201cfat,\u201d the randomly initialized network was highly\nunlikely to satisfy Eq. 8 by chance.\n\n4.2 Formulating the empirical result in terms of the proof of Section 3\n\nTo show that Eq. 8 must hold, we \ufb01rst demonstrate its equivalence to Eq. 1. It then suf\ufb01ces, under\nour proof, to show that the model obeys Eqs. 3 and 5 and that the \u201cmixture model\u201d de\ufb01ned by the\ntrue generative process is identi\ufb01able.\nFor suf\ufb01ciently many samples, \u00afZ \u2248 Eq(z|y)[Z|Y], which is a suf\ufb01cient statistic for a vector of\nBernoulli random variables: Eq(z|y)[Z|Y] = Tz(Y). This also corresponds to a noiseless \u201cup-\npass\u201d through the model, Tz(Y) = \u03c3{W Y + bz}2. And the information about the stimulus\nreaches the multisensory population, Z, only via the two unisensory populations, Y. Together these\nimply that Eq. 8 is equivalent to Eq. 1 (see Appendix for proof).\nFor both the \u201cworld\u201d and the model, the function f (y; \u03bb) is a product of independent Poissons,\nwhose means \u039b are given respectively by the embedding of hand position into the tuning curves\nof the two populations, \u03c6(X), and by the noiseless \u201cdown-pass\u201d through the model, exp{W TZ +\nby} =: \u03c8(Z). So Eq. 3 is satis\ufb01ed. Eq. 5 holds because the EFH was trained as a density estimator,\nand because the mixture may be treated as \ufb01nite. (Although hand positions were drawn from a\ncontinuous uniform distribution, the number of mixing components in the data distribution is limited\nto the number of training samples. For the model in [2], this was less than a million. For comparison,\nthe harmonium had 2900 mixture weights at its disposal.) Finally, the noise model is factorial:\ni f (y i; \u03bbi). The class of mixtures of factorial distributions, f (y; \u03bb), is identi\ufb01able just\nin case the class of mixtures of f (y i; \u03bbi) is identi\ufb01able [14]; and mixtures of (univariate) Poisson\nconditionals are themselves identi\ufb01able [12]. Thus the mixture used in [2] is indeed identi\ufb01able.\n\nf (y; \u03bb) =(cid:81)\n\n2That the vector of means alone and not higher-order cumulants suf\ufb01ces re\ufb02ects the fact that the suf\ufb01cient\nstatistics can be written as linear functions of Y\u2014in this case, W Y, with W the weight matrix\u2014which is\narguably a generically desirable property for neurons [20].\n\n7\n\n\f5 Conclusions\n\nWe have traced an analytical connection from psychophysical results in monkey and man to a broad\nclass of machine-learning algorithms, namely, density estimation in latent-variable models. In par-\nticular, behavioral studies of multisensory integration have shown that primates estimate stimulus\nproperties with the peak of the posterior distribution over the stimulus, conditioned on the two\nunisensory cues [3, 4]. This can be seen as a special case of a more general \u201coptimal\u201d compu-\ntation, viz., computing and representing the entire posterior distribution [1, 6]; or, put differently,\n\ufb01nding transformations of multiple unisensory representations into a multisensory representation\nthat retains all the original information about the underlying stimulus. It has been shown that this\ncomputation can be learned with algorithms that implement forms of latent-variable density esti-\nmation [15, 16]; and, indeed, argued that generic latent-variable density estimators will satisfy the\ninformation-retention criterion [2]. We have provided an analytical proof that this is the case, at least\nfor certain classes of models (including the ones in [2]).\nWhat about distributions f (y; \u03bb) other than products of Poissons? Identi\ufb01ability results, which\nwe have relied on here, appear to be the norm for \ufb01nite mixtures; [12] summarizes the \u201coverall\npicture\u201d thus: \u201c[A]part from special cases with \ufb01nite samples spaces [like binomials] or very special\nsimple density functions [like the continuous uniform distribution], identi\ufb01ability of classes of \ufb01nite\nmixtures is generally assured.\u201d Thus the results apply to a broad set of density-estimation models\nand their equivalent neural networks.\nInterestingly, this excludes Bernoulli random variables, and therefore the mixture model de\ufb01ned by\nrestricted Boltzmann machines (RBMs). Such mixtures are not strictly identi\ufb01able [12], meaning\nthere is more than one set of mixture weights that will produce the observed marginal distribution.\nHence the guarantee proved in Section 3 does not hold. On the other hand, the proof provides only\nsuf\ufb01cient, not necessary conditions, so some guarantee of information retention is not ruled out.\nAnd indeed, a relaxation of the identi\ufb01ability criterion to exclude sets of measure zero has recently\nbeen shown to apply to certain classes of mixtures of Bernoullis [21].\nThe information-retention criterion applies more broadly than multisensory integration; it is gen-\nerally desirable. It is not, presumably, suf\ufb01cient: the task of the cortex is not merely to pass in-\nformation on unmolested from one point to another. On the other hand, the task of integrating data\nfrom multiple sources without losing information about the underlying cause of those data has broad\napplication: it applies, for example, to the data provided by spatially distant photoreceptors that are\nreporting the edge of a single underlying object. Whether the criterion can be satis\ufb01ed in this and\nother cases depends both on the brain\u2019s generative model and on the true generative process by\nwhich the stimulus is encoded in neurons.\nThe proof was derived for suf\ufb01cient statistics rather than the neural responses themselves, but this\nlimitation can be overcome at the cost of time (by collecting or averaging repeated samples of neural\nresponses) or of space (by having a hidden vector long enough to contain most of the information\neven in the presence of noise).\nFinally, the result was derived for \u201ccompleted\u201d density estimation, q(y) = p(y). This is a strong\nlimitation; one would prefer to know how approximate completion of learning, q(y) \u2248 p(y), affects\nthe guarantee, i.e., how robust it is. In [2], for example, Eq. 2 was never directly veri\ufb01ed, and in\nfact one-step contrastive divergence (the training rule used) has suboptimal properties for building a\ngood generative model [22] And although the suf\ufb01cient conditions supplied by the proof apply to a\nbroad class of models, it would also be useful to know necessary conditions.\n\nAcknowledgments\n\nJGM thanks Matthew Fellows, Maria Dadarlat, Clay Campaigne, and Ben Dichter for useful con-\nversations.\n\n8\n\n\fReferences\n[1] Wei Ji Ma, Jeffrey M. Beck, Peter E. Latham, and Alexandre Pouget. Bayesian inference with probabilis-\n\ntic population codes. Nature Neuroscience, 9:1423\u20131438, 2006.\n\n[2] Joseph G. Makin, Matthew R. Fellows, and Philip N. Sabes. Learning Multisensory Integration and\n\nCoordinate Transformation via Density Estimation. PLoS Computational Biology, 9(4):1\u201317, 2013.\n\n[3] Marc O. Ernst and Martin S. Banks. Humans integrate visual and haptic information in a statistically\n\noptimal fashion. Nature, 415(January):429\u2013433, 2002.\n\n[4] David Alais and David Burr. The ventriloquist effect results from near-optimal bimodal integration.\n\nCurrent Biology, 14(3):257\u201362, February 2004.\n\n[5] Max Welling, Michal Rosen-Zvi, and Geoffrey E. Hinton. Exponential Family Harmoniums with an\nApplication to Information Retrieval. In Advances in Neural Information Processing Systems 17: Pro-\nceedings of the 2004 Conference, pages 1481\u20131488., 2005.\n\n[6] David C. Knill and Alexandre Pouget. The Bayesian brain: the role of uncertainty in neural coding and\n\ncomputation. Trends in Neurosciences, 27(12), 2004.\n\n[7] J.A. Saunders and David C. Knill. Perception of 3D surface orientation from skew symmetry. Vision\n\nresearch, 41(24):3163\u201383, November 2001.\n\n[8] Robert J. van Beers, AC Sittig, and Jan J. Denier van Der Gon. Integration of proprioceptive and visual\nposition-information: An experimentally supported model. Journal of Neurophysiology, 81:1355\u20131364,\n1999.\n\n[9] Philip N. Sabes. Sensory integration for reaching: Models of optimality in the context of behavior and\n\nthe underlying neural circuits. Progress in brain research, 191:195\u2013209, January 2011.\n\n[10] Bruno A. Olshausen. Sparse codes and spikes.\n\nIn R.P.N. Rao, Bruno A. Olshausen, and Michael S.\nLewicki, editors, Probabilistic Models of the Brain: Perception and Neural Function, chapter 13. MIT\nPress, 2002.\n\n[11] Anthony J. Bell. Towards a Cross-Level Theory of Neural Learning. AIP Conference Proceedings,\n\n954:56\u201373, 2007.\n\n[12] D.M. Titterington, A.F.M. Smith, and U.E. Makov. Statistical Analysis of Finite Mixture Distributions.\n\nWiley, 1985.\n\n[13] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. Wiley, 2006.\n[14] Henry Teicher. Identi\ufb01ability of Mixtures of Product Measures. The Annals of Mathematical Statistics,\n\n38(4):1300\u20131302, 1967.\n\n[15] Ilker Yildirim and Robert A. Jacobs. A rational analysis of the acquisition of multisensory representations.\n\nCognitive Science, 36(2):305\u201332, March 2012.\n\n[16] Jeffrey M. Beck, Katherine Heller, and Alexandre Pouget. Complex Inference in Neural Circuits with\nProbabilistic Population Codes and Topic Models. Advances in Neural Information Processing Systems\n25: Proceedings of the 2012 Conference, pages 1\u20139, 2013.\n\n[17] Peter McCullagh and John A. Nelder. Generalized Linear Models. Chapman and Hall/CRC, second\n\nedition, 1989.\n\n[18] Geoffrey E. Hinton, Simon Osindero, and Yee Whye Teh. A fast learning algorithm for deep belief nets.\n\nNeural Computation, 18:1527\u20131554, 2006.\n\n[19] Geoffrey E. Hinton. Training Products of Experts by Minimizing Contrastive Divergence. Neural Com-\n\nputation, 14:1771\u20131800, 2002.\n\n[20] Jeffrey M. Beck, Vikranth R. Bejjanki, and Alexandre Pouget. Insights from a Simple Expression for Lin-\near Fisher Information in a Recurrently Connected Population of Spiking Neurons. Neural Computation,\n23(6):1484\u20131502, June 2011.\n\n[21] Elizabeth S. Allman, Catherine Matias, and John a. Rhodes. Identi\ufb01ability of parameters in latent structure\n\nmodels with many observed variables. The Annals of Statistics, 37(6A):3099\u20133132, December 2009.\n\n[22] Geoffrey E. Hinton. A Practical Guide to Training Restricted Boltzmann Machines. Technical report,\n\nUniversity of Toronto, Toronto, 2010.\n\n9\n\n\f", "award": [], "sourceid": 311, "authors": [{"given_name": "Joseph", "family_name": "Makin", "institution": "UCSF"}, {"given_name": "Philip", "family_name": "Sabes", "institution": "UCSF"}]}