{"title": "Compressive neural representation of sparse, high-dimensional probabilities", "book": "Advances in Neural Information Processing Systems", "page_first": 1349, "page_last": 1357, "abstract": "This paper shows how sparse, high-dimensional probability distributions could be represented by neurons with exponential compression. The representation is a novel application of compressive sensing to sparse probability distributions rather than to the usual sparse signals. The compressive measurements correspond to expected values of nonlinear functions of the probabilistically distributed variables. When these expected values are estimated by sampling, the quality of the compressed representation is limited only by the quality of sampling. Since the compression preserves the geometric structure of the space of sparse probability distributions, probabilistic computation can be performed in the compressed domain. Interestingly, functions satisfying the requirements of compressive sensing can be implemented as simple perceptrons. If we use perceptrons as a simple model of feedforward computation by neurons, these results show that the mean activity of a relatively small number of neurons can accurately represent a high-dimensional joint distribution implicitly, even without accounting for any noise correlations. This comprises a novel hypothesis for how neurons could encode probabilities in the brain.", "full_text": "Compressive neural representation of sparse,\n\nhigh-dimensional probabilities\n\nxaq pitkow\n\nDepartment of Brain and Cognitive Sciences\n\nUniversity of Rochester\nRochester, NY 14607\n\nxpitkow@bcs.rochester.edu\n\nAbstract\n\nThis paper shows how sparse, high-dimensional probability distributions could\nbe represented by neurons with exponential compression. The representation is a\nnovel application of compressive sensing to sparse probability distributions rather\nthan to the usual sparse signals. The compressive measurements correspond to\nexpected values of nonlinear functions of the probabilistically distributed vari-\nables. When these expected values are estimated by sampling, the quality of the\ncompressed representation is limited only by the quality of sampling. Since the\ncompression preserves the geometric structure of the space of sparse probability\ndistributions, probabilistic computation can be performed in the compressed do-\nmain. Interestingly, functions satisfying the requirements of compressive sensing\ncan be implemented as simple perceptrons.\nIf we use perceptrons as a simple\nmodel of feedforward computation by neurons, these results show that the mean\nactivity of a relatively small number of neurons can accurately represent a high-\ndimensional joint distribution implicitly, even without accounting for any noise\ncorrelations. This comprises a novel hypothesis for how neurons could encode\nprobabilities in the brain.\n\n1\n\nIntroduction\n\nBehavioral evidence shows that animal behaviors are often in\ufb02uenced not only by the content of\nsensory information but also by its uncertainty. Different theories have been proposed about how\nneuronal populations could represent this probabilistic information [1, 2]. Here we propose a new\ntheory of how neurons could represent probability distributions, based on the burgeoning \ufb01eld of\n\u2018compressive sensing.\u2019\nAn arbitrary probability distribution over multiple variables has a parameter count that is exponential\nin the number of variables. Representing these probabilities can therefore be prohibitively costly.\nOne common approach is to use graphical models to parameterize the distribution in terms of a\nsmaller number of interactions. Here I consider an alternative approach. In many cases of interest,\nonly a few unknown states have high probabilities while the rest have neglible ones; such a distribu-\ntion is called \u2018sparse\u2019. I will show that suf\ufb01ciently sparse distributions can be described by a number\nof parameters that is merely linear in the number of variables.\nUntil recently, it was generally thought that encoding of sparse signals required dense sampling at a\nrate greater than or equal to signal bandwidth. However, recent \ufb01ndings prove that it is possible to\nfully characterize a signal at a rate limited not by its bandwidth but by its information content [3, 4,\n5, 6] which can be much smaller. Here I apply such compression to sparse probability distributions\nover binary variables, which are, after all, just signals with some particular properties.\n\n1\n\n\fIn most applications of compressive sensing, the ultimate goal is to reconstruct the original signal\nef\ufb01ciently. Here, we do not wish to reconstruct the signal at all. Instead, we use the guarantees that\nthe signal could be reconstructed to ensure that the signal is accurately represented by its compressed\nversion. Below, when we do reconstruct it is only to show that our method actually works in practice.\nWe don\u2019t expect that the brain needs to explicitly reconstruct a probability distribution in some\ncanonical mathematical representation in order to gain the advantages of probabilistic reasoning.\nTraditional compressive sensing considers signals that lives in an N-dimensional space but have\nonly S nonzero coordinates in some basis. We say that such a signal is S-sparse. If we were told\nthe location of the nonzero entries, then we would need only S measurements to characterize their\ncoef\ufb01cients and thus the entire signal. But even if we don\u2019t know where those entries are, it still\ntakes little more than S linear measurements to perfectly reconstruct the signal. Furthermore, those\nmeasurements can be \ufb01xed in advance without any knowledge of the structure of the signal. Under\ncertain conditions, these excellent properties can be guaranteed [3, 4, 5].\nThe basic mathematical setup of compressive sensing is as follows. Assume that an N-dimensional\nsignal s has S nonzero coef\ufb01cients. We make M linear measurements y of this signal by applying\nthe M \u00d7 N matrix A:\n(1)\nWe would then like to recover the original signal s from these measurements. Under conditions on\nthe measurement matrix A described below, the original can be found perfectly by computing the\nvector with minimal (cid:96)1 norm that reproduces the measurements,\n\ny = As\n\n\u02c6s = argmin\n\ns(cid:48)\n\n(cid:107)s(cid:48)\n\n(cid:107)(cid:96)1 such that As(cid:48) = y = As\n\nThe (cid:96)1 norm is usually used instead of (cid:96)0 because (2) can be solved far more ef\ufb01ciently [3, 4, 5, 7].\nCompressive sensing is generally robust to two deviations from this ideal setup. First, target signals\nmay not be strictly S-sparse. However, they may be \u2018compressible\u2019 in the sense that they are well\napproximated by an S-sparse signal. Signals whose rank-ordered coef\ufb01cients fall off at least as fast\n\u22121 satisfy this property [4]. Second, measurements may be corrupted by noise with bounded\nas rank\namplitude \u0001. Under these conditions, the error of the (cid:96)1-reconstructed signal \u02c6s is bounded by the\nerror of the best S-sparse approximation sS plus a term proportional to the measurement noise:\n\n(2)\n\n(3)\n\n(cid:107)\u02c6s \u2212 s(cid:107)(cid:96)2 \u2264 C0(cid:107)sS \u2212 s(cid:107)(cid:96)2/\u221aS + C1\u0001\n\nfor some constants C0 and C1 [8].\nSeveral conditions on A have been used in compressive sensing to guarantee good performance\n[4, 6, 9, 10, 11]. Modulo various nuances, they all essentially ensure that most or all relevant sparse\nsignals lie suf\ufb01ciently far from the null space of A: It would be impossible to recover signals in the\nnull space since their measurements are all zero and cannot therefore be distinguished. The most\ncommonly used condition is the Restricted Isometry Property (RIP), which says that A preserves (cid:96)2\nnorms of all S-sparse vectors within a factor of 1 \u00b1 \u03b4S that depends on the sparsity,\n\n(4)\nIf A satis\ufb01es the RIP with small enough \u03b4S, then (cid:96)1 recovery is guaranteed to succeed. For random\nmatrices whose elements are independent and identically distributed Gaussian or Bernoulli variates,\nthe RIP holds as long as the number of measurements M satis\ufb01es\n\n(1 \u2212 \u03b4S)(cid:107)s(cid:107)(cid:96)2 \u2264 (cid:107)As(cid:107)(cid:96)2 \u2264 (1 + \u03b4S)(cid:107)s(cid:107)(cid:96)2\n\n(5)\nfor some constant C that depends on \u03b4S [8]. No other recovery method, however intractable, can\nperform substantially better than this [8].\n\nM \u2265 CS log N/S\n\n2 Compressing sparse probability distributions\n\nCompressive sensing allows us to use far fewer resources to accurately represent high-dimensional\nobjects if they are suf\ufb01ciently sparse. Even if we don\u2019t ultimately intend to reconstruct the signal, the\nreconstruction theorem described above (3) ensures that we have implicitly represented all the rel-\nevant information. This compression proves to be extremely useful when representing multivariate\njoint probability distributions, whose size is exponentially large even for the simplest binary states.\n\n2\n\n\fConsider the signal to be a probability distribution over an n-dimensional binary vector x \u2208\n{\u22121, +1}n, which I will write sometimes as a function p(x) and sometimes as a vector p indexed\nby the binary state x. I assume p is sparse in the canonical basis of delta-functions on each state,\n\u03b4x,x(cid:48). The dimensionality of this signal is N = 2n, which for even modest n can be so large it\ncannot be represented explicitly.\nThe measurement matrix A for probability vectors has size M \u00d7 2n. Each row corresponds to a\ndifferent measurement, indexed by i. Each column corresponds to a different binary state x. This\ncolumn index x ranges over all possible binary vectors of length n, in some conventional sequence.\nFor example, if n = 3 then the column index would take the 8 values\n\nx \u2208 {\u2212\u2212\u2212 ; \u2212\u2212+ ; \u2212+\u2212 ; \u2212++ ; +\u2212\u2212 ; +\u2212+ ; ++\u2212 ; +++}\n\nEach element of the measurement matrix, Ai(x), can be viewed as a function applied to the binary\nstate. When this matrix operates on a probability distribution p(x), the result y is a vector of M\nexpectation values of those functions, with elements\n\nyi = Aip =\n\nAi(x)p(x) = (cid:104)Ai(x)(cid:105)p(x)\n\n(6)\n\nFor example, if Ai(x) = xi then yi = (cid:104)xi(cid:105)p(x) measures the mean of xi drawn from p(x).\nFor suitable measurement matrices A, we are guaranteed accurate reconstruction of S-sparse prob-\nability distributions as long as the number of measurements is\n\nM \u2265 O(S log N/S) = O(Sn \u2212 S log S)\n\n(7)\nThe exponential size of the probability vector, N = 2n, is cancelled by the logarithm. For distri-\nbutions with a \ufb01xed sparseness S, the required number of measurements per variable, M/n, is then\nindependent of the number of variables.1\nIn many cases of interest it is impractical to calculate these expectation values directly: Recall that\nthe probabilities may be too expensive to represent explicitly in the \ufb01rst place. One remedy is to\ndraw T samples xt from the distribution p(x), and use a sum over these samples to approximate the\nexpectation values,\n\n(cid:88)\n\nx\n\n(cid:88)\n\nt\n\nyi \u2248\n\n1\nT\n\nAi(xt)\n\nxt \u223c p(x)\n\n(8)\n\nThe probability \u02c6p(x) estimated from T samples has errors with variance p(x)(1 \u2212 p(x))/T , which\nis bounded by 1/4T . This allows us to use the performance limits from robust compressive sensing,\nwhich according to (3) creates an error in the reconstructed probabilities that is bounded by\n\n(cid:107)\u02c6p \u2212 p(cid:107)(cid:96)2 \u2264 C0(cid:107)pS \u2212 p(cid:107)(cid:96)2 +\n\nC1\u221aT\n\n(9)\n\nwhere pS is a vector with the top S probabilities preserved and the rest set to zero. Strictly speaking,\n(3) applies to bounded errors, whereas here we have a bounded variance but possibly large errors.\nTo ensure accurate reconstruction, we can choose the constant C1 large enough that errors larger\nthan some threshold (say, 10 standard deviations) have a negligible probability.\n\n2.1 Measurements by random perceptrons\n\nIn compressive sensing it is common to use a matrix with independent Bernoulli-distributed random\nvalues, Ai(x) \u223c B( 1\n2 ), which guarantees A satis\ufb01es the RIP [12]. Each row of this matrix represents\nall possible outputs of an arbitrarily complicated Boolean function of the n binary variables x.\nBiological neural networks would have great dif\ufb01culty computing such arbitrary functions in a sim-\nple manner. However, neurons can easily compute a large class of simpler boolean functions, the\nperceptrons. These are simple threshold functions of a weighted average of the input\n\n(cid:16)(cid:88)\n\n(cid:17)\n\nAi(x) = sgn\n\nWijxj \u2212 \u03b8j\n\nj\n\n(10)\n\n1Depending on the problem, the number of signi\ufb01cant nonzero entries S may grow with the number of\nvariables. This growth may be fast (e.g. the number of possible patterns grows as en) or slow (e.g. the number\nof possible translations of a given pattern grows only as n).\n\n3\n\n\fwhere W is an M \u00d7 n matrix. Here I take W to have elements drawn randomly from a standard\nnormal distribution, Wij \u223c N (0, 1), and call the resultant functions \u2018random perceptrons\u2019. An\nexample measurement matrix for random perceptrons is shown in Figure 1. These functions are\nreadily implemented by individual neurons, where xj is the instantaneous activity of neuron j,\nWij is the synaptic weight between neurons i and j, and the sgn function approximates a spiking\nthreshold at \u03b8.\n\nFigure 1: Example measurement matrix Ai(x) for M = 100 random perceptrons applied to all 29 possible\nbinary vectors of length n = 9.\n\nThe step nonlinearity sgn is not essential, but some type of nonlinearity is: Using a purely linear\nfunction of the states, A = W x, would result in measurements y = Ap = W (cid:104)x(cid:105). This provides\nat most n linearly independent measurements of p(x), even when M > n. In most cases this is\nnot enough to adequately capture the full distribution. Nonlinear Ai(x) allow a greater number of\nlinearly independent measurements of p(x). Although the dimensionality of W is merely M \u00d7 n,\nwhich is much smaller than the 2n-dimensional space of probabilities, (10) can generate O(2n2\n)\ndistinct perceptrons [13]. By including an appropriate threshold, a perceptron can assign any indi-\nvidual state x a positive response and assign a negative response to every other state. This shows\nthat random perceptrons generate the canonical basis and can thus span the space of possible p(x).\nIn what follows, I assume that \u03b8 = 0 for simplicity.\nIn the Appendix I prove that random perceptrons with zero threshold satisfy the requirements for\ncompressive sensing in the limit of large n. Present research is directed toward deriving the condition\nnumber of these measurement matrices for \ufb01nite n, in order to provide rigorous bounds on the\nnumber of measurements required in practice. Below I present empirical evidence that even a small\nnumber of random perceptrons largely preserves the information about sparse distributions.\n\n3 Experiments\n\n3.1 Fidelity of compressed sparse distributions\n\nTo test random perceptrons in compressive sensing of probabilities, I generated sparse distributions\nusing small Boltzmann machines [14], and compressed them using random perceptrons driven by\nsamples from the Boltzmann machine. Performance was then judged by comparing (cid:96)1 reconstruc-\ntions to the true distributions, which are exactly calculable for modest n.\nIn a Boltzmann Machine, binary states x occur with probabilities given by the Boltzmann distribu-\ntion with energy function E(x),\n\np(x) \u221d e\u2212E(x)\n\n(cid:62)\n\nx \u2212 x(cid:62)Jx\n\nE(x) = \u2212b\n\n(11)\ndetermined by biases b and pairwise couplings J. Sampling from this distribution can be accom-\nplished by running Glauber dynamics [15], at each time step turning a unit on with probability\np(xi = +1|x\\i) = 1/(1 + e\u2212\u2206E), where \u2206E = E(xi = +1, x\\i) \u2212 E(xi = \u22121, x\\i). Here x\\i\nis the vector of all components of x except the ith.\nFor simulations I distinguished between two types of units, hidden and visible, x = (h, v). On\neach trial I \ufb01rst generated a sample of all units according to (11).\nI then \ufb01xed only the visible\nunits and allowed the hidden units to \ufb02uctuate according to the conditional probability p(h|v) to be\nrepresented. This probability is given again by the Boltzmann distribution, now with energy function\n(12)\n\nE(h|v) = \u2212(bh \u2212 Jhvv)(cid:62)h \u2212 h\n\n(cid:62)\nJhhh\n\n4\n\nState vector xMeasurement i\fAll bias terms b were set to zero, and all pairwise couplings J were random draws from a zero-\nmean normal distribution, Jij \u223c N (0, 1\n3 ). Experiments used n hidden and n visible units, with\nn \u2208 {8, 10, 12}. This distribution of couplings produced sparse posterior distributions whose rank-\n\u22121 and were thus compressible [4].\nordered probabilities fell faster than rank\nThe compression was accomplished by passing the hidden unit activities h through random per-\nceptrons a with weights W , according to a = sgn (W h). These perceptron activities \ufb02uctuate\nalong with their inputs. The mean activity of these perceptron units compressively senses the prob-\nability distribution according to (8). This process of sampling and then compressing a Boltzmann\ndistribution can be implemented by the simple neural network shown in Figure 2.\n\nFigure 2: Compressive sensing of a probability distribution by model neurons. Left: a neural architecture for\ngenerating and then encoding a sparse, high-dimensional probability distribution. Right: activity of each popu-\nlation of neurons as a function of time. Sparse posterior probability distribution are generated by a Boltzmann\nMachine with visible units v (Inputs), hidden units h (Samplers), feedforward couplings Jvh from visible to\nhidden units, and recurrent connections between hidden units Jhh. The visible units\u2019 activities are \ufb01xed by\nan input. The hidden units are stochastic, and sample from a probability distribution p(h|v). The samples\nare recoded by feedforward weights W to random perceptrons a. The mean activity y of the time-dependent\nperceptron responses captures the sparse joint distribution of the hidden units.\n\nWe are not ultimately interested in reconstruction of the large, sparse distribution, but rather the\ndistribution\u2019s compressed representation. Nonetheless, reconstruction is useful to show that the\ninformation has been preserved. I reconstruct sparse probabilities using nonnegative (cid:96)1 minimization\nwith measurement constraints [16, 17], minimizing\n\n(cid:107)p(cid:107)(cid:96)1 + \u03bb(cid:107)Ap \u2212 y(cid:107)2\n\n(cid:96)2\n\n(13)\n\nwhere \u03bb is a regularization parameter that was set to 2T in all simulations. Reconstructions were\nquite good, as shown in Figure 3. Even with far fewer measurements than signal dimensions, recon-\nstruction accuracy is limited only by the sampling of the posterior. Enough random perceptrons do\nnot lose any available information.\nIn the context of probability distributions, (cid:96)1 reconstruction has a serious \ufb02aw: All distributions have\nx p(x) = 1! To minimize the (cid:96)1 norm, therefore, the estimate will\nnot be a probability distribution. Nonetheless, the individual probabilities of the most signi\ufb01cant\nstates are accurately reconstructed, and only the highly improbable states are set to zero. Figure 3B\nshows that the shortfall is small: (cid:96)1 reconstruction recovers over 90% of the total probability mass.\n\nthe same (cid:96)1 norm: (cid:107)p(cid:107)(cid:96)1 =(cid:80)\n\n3.2 Preserving computationally important relationships\n\nThere is value in being able to compactly represent these high-dimensional objects. However, it\nwould be especially useful to perform probabilistic computations using these representations, such\nas marginalization and evidence integration. Since marginalization is a linear operation on the prob-\nability distribution, this is readily implementable in the linearly compressed domain. In contrast,\nevidence integration is a multiplicative process acting in the canonical basis, so this operation will\nbe more complicated after the linear distortions of compressive measurement A. Nonetheless, such\ncomputations should be feasible as long as the informative relationships are preserved in the com-\npressed space: Similar distributions should have similar compressive representations, and dissimilar\n\n5\n\nSamplers hPerceptrons aInputs vtimeneuronsfeedforwardWrecurrentJhhfeedforwardJvh\fFigure 3: Reconstruction of sparse posteriors from random perceptron measurements. (A) A sparse posterior\ndistribution over 10 nodes in a Boltzmann machine is sampled 1000 times, fed to 50 random perceptrons,\nand reconstructed by nonnegative (cid:96)1 minimization. (B) A histogram of the sum of reconstructed probabilities\nreveals the small shortfall from a proper normalization of 1. (C) Scatter plots show reconstructions versus\ntrue probabilities. Each box uses different numbers of compressive measurements M and numbers of samples\nT . (D) With increasing numbers of compressive measurements, the mean squared reconstruction error falls to\n1/T = 10\u22123, the limit imposed by \ufb01nite sampling.\n\ndistributions should have dissimilar compressive representations. In fact, that is precisely the guar-\nantee of compressive sensing: topological properties of the underlying space are preserved in the\ncompressive domain [18]. Figure 4 illustrates how not only are individual sparse distributions re-\ncoverable despite signi\ufb01cant compression, but the topology of the set of all such distributions is\nretained.\nFor this experiment, an input x is drawn from a dictionary of input patterns X \u2282 {+1,\u22121}n.\nEach pattern in X is a translation of a single binary template x0 whose elements are generated by\nthresholding a noisy sinusoid (Figure 4A): x0\nj = sgn [4 sin (2\u03c0j/n) + \u03b7j] with \u03b7j \u223c N (0, 1). On\neach trial, one of these possible patterns is drawn randomly with equal probability 1/|X|, and then\nis measured by a noisy process that randomly \ufb02ips bits with a probability \u03b7 = 0.35 to give a noisy\npattern r. This process induces a posterior distribution over the possible input patterns\n\n(cid:89)\n\ni\n\np(x|r) =\n\n1\nZ\n\np(x)\n\np(ri|xi) =\n\n1\nZ\n\np(x)\u03b7N\u2212h(x,r)(1 \u2212 \u03b7)h(x,r)\n\n(14)\n\nwhere h(x, r) is the Hamming distance between x and r. This posterior is nonzero for all patterns in\nthe dictionary. The noise level and the similarities between the dictionary elements together control\nthe sparseness.\n1000 trials of this process generates samples from the set of all possible posterior distributions.\nJust as the underlying set of inputs has a translation symmetry, the set of all possible posterior\ndistributions has a cyclic permutation symmetry. This symmetry can be revealed by a nonlinear\nembedding [19] of the set of posteriors into two dimensions (Figure 4B).\nCompressive sensing of these posteriors by 10 random perceptrons produces a much lower-\ndimensional embedding that preserves this symmetry. Figure 4C shows that the same nonlinear em-\nbedding algorithm applied to the reduced representation, and one sees the same topological pattern.\nIn compressive sensing, similarity is measured by Euclidean distance. When applied to probability\ndistributions it will be interesting to examine instead how well information-geometric measures like\nthe Kullback-Leibler divergence are preserved under this dimensionality reduction [20].\n\n4 Discussion\n\nProbabilistic inference appears to be essential for both animals and machines to perform well on\ncomplex tasks with natural levels of ambiguity, but it remains unclear how the brain represents and\nmanipulates probability. Present population models of neural inference either struggle with high-\ndimensional distributions [1] or encode them by hard-to-measure high-order correlations [2]. Here I\nhave proposed an alternative mechanism by which the brain could ef\ufb01ciently represent probabilities:\nrandom perceptrons. In this model, information about probabilities is compressed and distributed\n\n6\n\nState xState xProbabilityReconstructionMeasurement ratio M/nMeasurements M102Samples T103104ProbabilitySamplingerrorReconstructionsHistogramReconstructionerror (MSE)802032010\u20131163224810\u20133n=81012ACDSum of 1-reconstructedprobabilities0.9.99.999B\f(A)\nFigure 4: Nonlinear embeddings of a family of probability distributions with a translation symmetry.\nThe process of generating posterior distributions: (i) A set of 100 possible patterns is generated as cyclic\ntranslations of a binary pattern (only 9 shown). With uniform probability, one of these patterns is selected (ii),\nand a noisy version is obtained by randomly \ufb02ipping bits with probability 0.35 (iii). From such noisy patterns,\nan observer can infer posterior probability distributions over possible inputs (iv). (B) The set of posteriors from\n1000 iterations of this process is nonlinearly mapped [19] from 100 dimensions to 2 dimensions. Each point\nrepresents one posterior and is colored according to the actual pattern from which the noisy observations were\nmade. The permutation symmetry of this process is revealed as a circle in this mapping. (C) This circular\nstructure is retained even after each posterior is compressed into the mean output of 10 random perceptrons.\n\nin neural population activity. Amazingly, the brain need not measure any correlations between the\nperceptron outputs to capture the joint statistics of the sparse input distribution. Only the mean\nactivities are required. Figure 2 illustrates one network that implements this new representation, and\nmany variations on this circuit are possible.\nSuccessful encoding in this compressed representation requires that the input distribution be sparse.\nPosterior distributions over sensory stimuli like natural images are indeed expected to be highly\nsparse: the features are sparse [21], the prior over images is sparse [22], and the likelihood pro-\nduced by sensory evidence is usually restrictive, so the posteriors should be even sparser. Still, it\nwill be important to quantify just how sparse the relevant posteriors are under different conditions.\nThis would permit us to predict how neural representations in a \ufb01xed population should degrade as\nsensory evidence becomes weaker.\nBrains appear to have a mix of structure and randomness. The results presented here show that\npurely random connections are suf\ufb01cient to ensure that a sparse probability distribution is properly\nencoded. Surprisingly, more structured connections cannot allow a network with the same computa-\ntional elements to encode distributions with substantially fewer neurons, since compressive sensing\nis already nearly optimal [8]. On the other hand, some representational structure may make it easier\nto perform computations later. Note that unknown randomness is not an impediment to further pro-\ncessing, as reconstruction can be performed even without explicit knowledge of random perceptron\nmeasurement matrix [23].\nEven in the most convenient representations, inference is generally intractable and requires approx-\nimation. Since compressive sensing preserves the essential geometric relationships of the signal\nspace, learning and inference based on these relationships may be no harder after the compression,\nand could even be more ef\ufb01cient due to the reduced dimensionality. Biologically plausible mech-\nanisms for implementing probabilistic computations in the compressed representation is important\nwork for the future.\n\nAppendix: Asymptotic orthogonality of random perceptron matrix\n\nTo evaluate the quality of the compressive sensing matrix A, we need to ensure that S-sparse vectors\nare not projected to zero by the action of A. Here I show that the random perceptrons are asymp-\n\ntotically well-conditioned: \u02c6A(cid:62) \u02c6A \u2192 I for large n and M, where \u02c6A = A/\u221aM. This ensures that\n\ndistinct inputs yield distinct measurements.\n\n7\n\nnonlinear embedding ofposterior distributions (N=100)possible patternstrue pattern xnoisy pattern rpattern indexposteriornonlinear embedding ofcompressed posteriors (M=10)100150truepatternindexBAiiiiiiivC1100X\fFirst I compute the mean and variance of the mean inner product (cid:104)Cxx(cid:48)(cid:105)W between columns of \u02c6A\nfor a given pair of states x (cid:54)= x(cid:48). For compactness I will write wi for the ith row of the perceptron\nweight matrix W . Angle brackets (cid:104) (cid:105)W indicate averages over random perceptron weights Wij \u223c\nN (0, 1). We \ufb01nd\n\n(cid:69)\n\n(cid:88)\n\n(cid:68)(cid:88)\n\n(cid:104)Cxx(cid:48)(cid:105)W =\n\n\u02c6Ai(x) \u02c6Ai(x(cid:48))\n\ni\n\n=\n\nW\n\n1\nM\n\ni (cid:104)sgn (wi\u00b7x) sgn (wi\u00b7x(cid:48))(cid:105)W\n\n(15)\n\nand since the different wi are independent, this implies that\n\n(16)\nThe n-dimensional half-space in W where sgn (wi \u00b7 x) = +1 intersects with the corresponding\nhalf-space for x(cid:48) in a wedge-shaped region with an angle of \u03b8 = cos\u22121(x \u00b7 x(cid:48)/(cid:107)x(cid:107)(cid:96)2(cid:107)x(cid:48)\n(cid:107)(cid:96)2). This\nangle is related to the Hamming distance h = h(x, x(cid:48)):\n\n(cid:104)Cxx(cid:48)(cid:105)W = (cid:104)sgn (wi\u00b7x) sgn (wi\u00b7x(cid:48))(cid:105)W\n\n\u03b8(h) = cos\u22121(x \u00b7 x(cid:48)/n) = cos\u22121(1 \u2212 2h/n)\n\nThe variance of Cxx(cid:48) caused by variability in W is given by\n\n(17)\nThe signs of wi\u00b7x and wi\u00b7x(cid:48) agree within this wedge region and its re\ufb02ection about W = 0, and\ndisagree in the supplementary wedges. The mean inner product is therefore\n(18)\n(19)\n\n(cid:104)Cxx(cid:48)(cid:105)W =P [ sgn (wi\u00b7x) = sgn (wi\u00b7x(cid:48))] \u2212 P [ sgn (wi\u00b7x) (cid:54)= sgn (wi\u00b7x(cid:48))]\nVxx(cid:48) =(cid:10)C 2\nxx(cid:48)(cid:11)\n(cid:68) \u02c6A2\n(cid:88)\n(cid:42)\n(cid:88)\n\n(cid:68) \u02c6Ai(x) \u02c6Ai(x(cid:48)) \u02c6Aj(x) \u02c6Aj(x(cid:48))\n(cid:43)\n(cid:28) sgn (wi\u00b7x)\n\n(cid:69)\nW \u2212 (cid:104)Cxx(cid:48)(cid:105)2\n(cid:29)2\nW\u2212 (cid:104)Cxx(cid:48)(cid:105)2\n\nsgn (wi\u00b7x(cid:48))\n\n=1 \u2212 2\n\n(cid:88)\n\n(20)\n\n(21)\n\nW (22)\n\n\u221aM\n\n\u221aM\n\n\u03c0 \u03b8(h)\n\ni=j\n\n+\n\n=\n\n=\n\nW\n\nW\n\nW\n\n+\n\ni(cid:54)=j\n\nsgn (wi\u00b7x)2\n\n(cid:69)\nW \u2212 (cid:104)Cxx(cid:48)(cid:105)2\ni (x(cid:48))\ni (x) \u02c6A2\n\n(cid:88)\nsgn (wi\u00b7x(cid:48))2\n\u03c0 \u03b8(h(x, x(cid:48)))(cid:1)2(cid:17)\n(cid:0)1 \u2212 2\n\nM\nM 2 \u2212 M\n\nM 2\n\nM\n\n1 \u2212\n\ni(cid:54)=j\n(1 \u2212 2\u03b8(h)/\u03c0)2 \u2212 (cid:104)Cxx(cid:48)(cid:105)2\n\nW\n\nW\n\ni\n\n1\nM\n1\nM\n\n=\n\n=\n\n+\n\n(cid:16)\n\n(23)\n\n(24)\n\n(25)\n\n(26)\n\nThis variance falls with M, so for large numbers of measurements M the inner products between\ncolumns concentrates around the various state-dependent mean values (19).\nNext I consider the diversity of inner products for different pairs (x, x(cid:48)) of binary state vectors. I\ntake the limit of large M so that the diversity is dominated by variations over the particular pairs,\nrather than by variations over measurements. The mean inner product depends only on the Hamming\ndistance h between x and x(cid:48), which for sparse signals with random support has a binomial distribu-\n\n(cid:1)2\u2212n with mean n/2 and variance n/4. Designating by an overbar the average over\n\ntion, p(h) =(cid:0)n\n\nrandomly chosen states x and x(cid:48), the mean C and variance \u03b4C 2 of the inner product are\n\nh\n\n(cid:18) \u2202C\n\n(cid:19)2\n\nC = (cid:104)Cxx(cid:48)(cid:105)W = 1 \u2212 2\nn\n\u03b4C 2 = \u03b4h2\n4\n\n\u2202h\n\n=\n\n\u03c0 cos\u22121(1 \u2212 2h\n16\n\u03c02n2 =\n\n\u03c02n\n\n4\n\nn ) = 0\n\nThis proves that in the limit of large n and M, different columns of the random perceptron mea-\nsurement matrix have inner products that concentrate around 0. The matrix of inner products is\nthus orthonormal almost surely, \u02c6A(cid:62) \u02c6A \u2192 I. Consequently, with enough measurements the random\nperceptrons asymptotically provide an isometry. Future work will investigate how the measurement\nmatrix behaves for \ufb01nite n and M, which will determine the number of measurements required in\npractice to capture a signal of a given sparseness.\n\nAcknowledgments\n\nThanks to Alex Pouget, Jeff Beck, Shannon Starr, and Carmelita Navasca for helpful conversations.\n\n8\n\n\fReferences\n[1] Ma W, Beck J, Latham P, Pouget A (2006) Bayesian inference with probabilistic population codes. Nat\n\nNeurosci 9: 1432\u20138.\n\n[2] Berkes P, Orb\u00b4an G, Lengyel M, Fiser J (2011) Spontaneous cortical activity reveals hallmarks of an\n\noptimal internal model of the environment. Science 331: 83\u20137.\n\n[3] Cand`es E, Romberg J, Tao T (2006) Robust uncertainty principles: Exact signal reconstruction from\n\nhighly incomplete frequency information. IEEE Transactions on Information Theory 52: 489\u2013509.\n\n[4] Cand`es E, Tao T (2006) Near-optimal signal recovery from random projections: Universal encoding\n\nstrategies? IEEE Transactions on Information Theory 52: 5406\u20135425.\n\n[5] Donoho D (2006) Compressed sensing. IEEE Transactions on Information Theory 52: 1289\u20131306.\n[6] Cand`es E, Plan Y (2011) A probabilistic and RIPless theory of compressed sensing. IEEE Transactions\n\non Information Theory 57: 7235\u20137254.\n\n[7] Donoho DL, Maleki A, Montanari A (2009) Message-passing algorithms for compressed sensing. Proc\n\nNatl Acad Sci USA 106: 18914\u20139.\n\n[8] Cand`es E, Wakin M (2008) An introduction to compressive sampling. Signal Processing Magazine 25:\n\n21\u201330.\n\n[9] Kueng R, Gross D (2012) RIPless compressed sensing from anisotropic measurements. Arxiv preprint\n\narXiv:12051423 .\n\n[10] Calderbank R, Howard S, Jafarpour S (2010) Construction of a large class of deterministic sensing matri-\n\nces that satisfy a statistical isometry property. Selected Topics in Signal Processing 4: 358\u2013374.\n\n[11] Gurevich S, Hadani R (2009) Statistical rip and semi-circle distribution of incoherent dictionaries. arXiv\n\ncs.IT.\n\n[12] Mendelson S, Pajor A, Tomczak-Jaegermann N (2006) Uniform uncertainty principle for Bernoulli and\n\nsubgaussian ensembles. arXiv math.ST.\n\n[13] Irmatov A (2009) Bounds for the number of threshold functions. Discrete Mathematics and Applications\n\n6: 569\u2013583.\n\n[14] Ackley D, Hinton G, Sejnowski T (1985) A learning algorithm for Boltzmann machines. Cognitive\n\nScience 9: 147\u2013169.\n\n[15] Glauber RJ (1963) Time-dependent statistics of the Ising model. Journal of Mathematical Physics 4:\n\n294\u2013307.\n\n[16] Yang J, Zhang Y (2011) Alternating direction algorithms for L1 problems in compressive sensing. SIAM\n\nJournal on Scienti\ufb01c Computing 33: 250\u2013278.\n\n[17] Zhang Y, Yang J, Yin W (2010) YALL1: Your ALgorithms for L1. CAAM Technical Report : TR09-17.\n[18] Baraniuk R, Cevher V, Wakin MB (2010) Low-dimensional models for dimensionality reduction and\n\nsignal recovery: A geometric perspective. Proceedings of the IEEE 98: 959\u2013971.\n\n[19] der Maaten LV, Hinton G (2008) Visualizing high-dimensional data using t-SNE. Journal of Machine\n\nLearning Research 9: 2579\u20132605.\n\n[20] Carter KM, Raich R, Finn WG, Hero AO (2011) Information-geometric dimensionality reduction. IEEE\n\nSignal Process Mag 28: 89\u201399.\n\n[21] Olshausen BA, Field DJ (1996) Emergence of simple-cell receptive \ufb01eld properties by learning a sparse\n\ncode for natural images. Nature 381: 607\u20139.\n\n[22] Stephens GJ, Mora T, Tkacik G, Bialek W (2008) Thermodynamics of natural images. arXiv q-bio.NC.\n[23] Isely G, Hillar CJ, Sommer FT (2010) Deciphering subsampled data: adaptive compressive sampling as\n\na principle of brain communication. arXiv q-bio.NC.\n\n9\n\n\f", "award": [], "sourceid": 656, "authors": [{"given_name": "Zachary", "family_name": "Pitkow", "institution": null}]}