{"title": "Near-Maximum Entropy Models for Binary Neural Representations of Natural Images", "book": "Advances in Neural Information Processing Systems", "page_first": 97, "page_last": 104, "abstract": null, "full_text": "Near-Maximum Entropy Models for Binary\nNeural Representations of Natural Images\n\nMatthias Bethge and Philipp Berens\n\nMax Planck Institute for Biological Cybernetics\nSpemannstrasse 41, 72076, T\u00a8ubingen, Germany\nmbethge,berens@tuebingen.mpg.de\n\nAbstract\n\nMaximum entropy analysis of binary variables provides an elegant way for study-\ning the role of pairwise correlations in neural populations. Unfortunately, these\napproaches suffer from their poor scalability to high dimensions. In sensory cod-\ning, however, high-dimensional data is ubiquitous. Here, we introduce a new\napproach using a near-maximum entropy model, that makes this type of analy-\nsis feasible for very high-dimensional data\u2014the model parameters can be derived\nin closed form and sampling is easy. Therefore, our NearMaxEnt approach can\nserve as a tool for testing predictions from a pairwise maximum entropy model not\nonly for low-dimensional marginals, but also for high dimensional measurements\nof more than thousand units. We demonstrate its usefulness by studying natural\nimages with dichotomized pixel intensities. Our results indicate that the statistics\nof such higher-dimensional measurements exhibit additional structure that are not\npredicted by pairwise correlations, despite the fact that pairwise correlations ex-\nplain the lower-dimensional marginal statistics surprisingly well up to the limit of\ndimensionality where estimation of the full joint distribution is feasible.\n\n1 Introduction\n\nA core issue in sensory coding is to seek out and model statistical regularities in high-dimensional\ndata.\nIn particular, motivated by developments in information theory, it has been hypothesized\nthat modeling these regularities by means of redundancy reduction constitutes an important goal of\nearly visual processing [2]. Recent studies conjectured that the binary spike responses of retinal\nganglion cells may be characterized completely in terms of second-order correlations when using\na maximum entropy approach [13, 12]. In light of what we know about the statistics of the visual\ninput, however, this would be very surprising: Natural images are known to exhibit complex higher-\norder correlations which are extremely dif\ufb01cult to model yet being perceptually relevant. Thus, if\nwe assume that retinal ganglion cells do not discard the information underlying these higher-order\ncorrelations altogether, it would be a very dif\ufb01cult signal processing task to remove all of those\nalready within the retinal network.\nOftentimes, neurons involved in early visual processing are modeled as rather simple computational\nunits akin to generalized linear models, where a linear \ufb01lter is followed by a point-wise nonlinearity.\nFor such simple neuron models, the possibility of removing higher-order correlations present in the\ninput is very limited [3].\nHere, we study the role of second-order correlations in the multivariate binary output statistics of\nsuch linear-nonlinear model neurons with a threshold nonlinearity responding to natural images.\nThat is, each unit can be described by an af\ufb01ne transformation zk = wT\nk x + \u03d1 followed by a\npoint-wise signum function sk = sgn(zk). Our interest in this model is twofold: (A) It can be\nregarded a parsimonious model for the analysis of population codes of natural images for which the\n\n1\n\n\fFigure 1: Similarity between the Ising and the DG model. A+C: Entropy difference \u2206H between the Ising\nmodel and the Dichotomized Gaussian distribution as a function of dimensionality. A: Up to 10 dimensions\nwe can compute HDG directly by evaluating Eq. 6. Gray dots correspond to different sets of parameters. For\nm \u2265 4, the relatively large scatter and the existence of negative values is due to the limited numerical precision\nof the Monte-Carlo integration. Errorbars show standard error of the mean. B. JS-divergence DJS between PI\nand PDG. C. \u2206H as above, for higher dimensions. Up to 20 dimensions \u2206H remains very small. The increase\nfor m \u2192 20 is most likely due to undersampling of the distributions. D. \u2206H as function of sample size used\nto estimate HDG, at seven (black) and ten (grey) dimensions (note log scale on both axes). \u2206H decreases with\na power law with increasing sample sizes.\n\ncomputational power and the bandwidth of each unit is limited. (B) The same model can also be\nused more generally to \ufb01t multivariate binary data with given pairwise correlations, if x is drawn\nfrom a Gaussian distribution.\nIn particular, we will show that the resulting distribution closely\nresembles the binary maximum entropy models known as Ising models or Boltzmann machines\nwhich have recently become popular for the analysis of spike train recordings from retinal ganglion\ncell responses [13, 12].\nMotivated by the analysis in [12, 13] and the discussion in [10] we are interested at a more gen-\neral level in the following questions: are pairwise interactions enough for understanding the sta-\ntistical regularities in high-dimensional natural data (given that they provide a good \ufb01t in the low-\ndimensional case)? If we suppose that pairwise interactions are enough, what can we say about the\namount of redundancies in high-dimensional data? In comparison with neural spike data, natural\nimages provide two advantages for studying these questions: 1) It is much easier to obtain large\namounts of data with millions of samples which are less prone to nonstationarities. 2) Often differ-\nences in the higher-order statistics such as between pink noise and natural images can be recognized\nby eye.\n\n2 Second order models for binary variables\n\nIn order to study whether pairwise interactions are enough to determine the statistical regularities\nin high-dimensional data, it is necessary to be able to compute the maximum entropy distribution\nfor large number of dimensions N. Given a set of measured statistics, maximum entropy models\nyield a full probability distribution that is consistent with these constraints but does not impose any\n\n2\n\n10121416182010\u2212210\u2212110024681003x 10\u22123Dimension\u2206H (%)2468100123456x 10\u22125DimensionJS\u2212Divergence (bits)121416182000.51Dimension\u2206H (%)ABCDlog2(Number of Samples)log\u2206H (%)\fFigure 2: Examples of covariance matrices (A+B.) and their learned approximations (C+D) at m = 10 for\nclarity. \u03b1 is the parameter controlling the steepness of correlation decrease. E+F. Eigenvalue spectra of both\nmatrices. G. Entropy difference \u2206H and H. JS-divergence between the distribution of samples obtained from\nthe two models at m = 7.\n\nadditional structure on the distribution [7]. For binary data with given mean activations \u00b5i = hsii\nand correlations between neurons \u03a3ij = hsisji \u2212 hsiihsji, one obtains a quadratic exponential\nprobability mass function known as the Ising model in physics or as the Boltzmann machine in\nmachine learning.\nCurrently all methods used to determine the parameters of such binary maximum entropy models\nsuffer from the same drawback: since the parameters do not correspond directly to any of the mea-\nsured statistics, they have to be inferred (or \u2018learned\u2019) from data. In high dimensions though, this\nposes a dif\ufb01cult computational problem. Therefore the characterization of complete neural circuits\nwith possibly hundreds of neurons is still out of reach, even though analysis was recently extended\nto up to forty neurons [14].\nTo make the maximum entropy approach feasible in high dimensions, we propose a new strategy:\nSampling from a \u2018near-maximum\u2019 entropy model that does not require any complicated learning\nof parameters. In order to justify this approach, we verify empirically that the entropy of the full\nprobability distributions obtained with the near-maximum entropy model are indistinguishable from\nthose obtained with classical methods such as Gibbs sampling for up to 20 dimensions.\n\n2.1 Boltzmann machine learning\nFor a binary vector of neural activities s \u2208 {\u22121, 1}m and speci\ufb01ed \u00b5i and \u03a3ij the Ising model takes\nthe form\n\n(1)\n\n\uf8ee\uf8f0 mX\n\ni=1\n\nPI(s) =\n\n1\nZ\n\nexp\n\nhisi +\n\n1\n2\n\n\uf8f9\uf8fb ,\n\nJijsisj\n\nX\n\ni6=j\n\nwhere the local \ufb01elds hi and the couplings Jij have to be chosen such that hsii = \u00b5i and hsisji \u2212\nhsiihsji = \u03a3ij. Unfortunately, \ufb01nding the correct parameters turns out to be a dif\ufb01cult problem\nwhich cannot be solved in closed form.\nTherefore, one has to resort to an optimization approach to learn the model parameters hi and Jij\nfrom data. This problem is called Boltzmann machine learning and is based on maximization of the\nlog-likelihood L = ln PI({si}N\ni=1|h, J) [1] where N is the number of samples. The gradient of the\nlikelihood can be computed in terms of the empirical covariance and the covariance of si and sj as\nproduced by the current model:\n\n\u2202L\n\u2202Jij\n\n= hsisjiData \u2212 hsisjiModel\n\n(2)\n\nThe second term on the right hand side is dif\ufb01cult to compute, as it requires sampling from the model.\nSince the partition function Z in Eq. (1) is not available in closed form, Monte-Carlo methods such\n\n3\n\n01200.05\u03b1012012x 10\u22124\u03b1\fFigure 3: Random samples of dichotomized 4x4 patches from the van Hateren image data base (left) and from\nthe corresponding dichotomized Gaussian distribution with equal covariance matrix (middle). It is not possible\nto see any systematic difference between the samples from the two distributions. For comparison, this is not so\nfor the sample from the independent model (right).\n\nas Gibbs sampling are employed [9] in order to approximate the required model average. This is\ncomputationally demanding as sampling is necessary for each individual update. While ef\ufb01cient\nsampling algorithms exist for special cases [6], it still remains a hard and time consuming problem\nin the general case. Additionally, most sampling algorithms do not come with guarantees for the\nquality of the approximation of the required average. In conclusion, parameter \ufb01tting of the Ising\nmodel is slow and oftentimes painstaking, especially in high dimensions.\n\n2.2 Modeling with the dichotomized Gaussian\n\nHere we explore an intriguing alternative to the Monte-Carlo approach: We replace the Ising model\nby a \u2019near-maximum\u2019 entropy model, for which both parameter computation and sampling is easy. A\nvery convenient, but in this context rarely recognized, candidate model is the dichotomized Gaussian\ndistribution (DG) [11, 5, 4]. It is obtained by supposing that the observed binary vector s is generated\nfrom a hidden Gaussian variable\n\nz \u223c N (\u03b3, \u039b) ,\n\nsi = sgn(zi).\n\n(3)\n\nWithout loss of generality, we can assume unit variances for the Gaussian, i.e. \u039bii = 1, the mean \u00b5\nand the covariance matrix \u03a3 of s are given by\n\n\u00b5i = 2\u03a6(\u03b3i) \u2212 1 , \u03a3ii = 4\u03a6(\u03b3i)\u03a6(\u2212\u03b3i) , \u03a3ij = 4\u03a8(\u03b3i, \u03b3j, \u039bij) for i 6= j\n\n(4)\nwhere \u03a8(x, y, \u03bb) = \u03a62(x, y, \u03bb) \u2212 \u03a6(x)\u03a6(y) . Here \u03a6 is the univariate standardized cumulative\nGaussian distribution and \u03a62 its bivariate counterpart. While the computation of the model param-\neters was hard for the Ising model, these equations can be easily inverted to \ufb01nd the parameters of\nthe hidden Gaussian distribution:\n\n(cid:18) \u00b5i + 1\n\n(cid:19)\n\n2\n\n\u03b3i = \u03a6\u22121\n\n(5)\n\n(6)\n\nDetermining \u039bij generally requires to \ufb01nd a suitable value such that \u03a3ij \u2212 4\u03a8(\u03b3i, \u03b3j, \u039bij) = 0.\nThis can be ef\ufb01cently solved by numerical computations, since the function is monotonic in \u039bij\nand has a unique zero crossing. We obtain an especially easy case, when \u03b3i = \u03b3j = 0, as then\n\n\u039bij = sin(cid:0) \u03c0\n\n2 \u03a3ij\n\n(cid:1).\n\nIt is also possible to evaluate the probability mass function of the DG model by numerical integra-\ntion,\n\nZ b1\n\nZ bm\n\nexp(cid:0)\u2212(s \u2212 \u03b3)T \u039b\u22121(s \u2212 \u03b3)(cid:1) ,\n\nPDG(s) =\n\n1\n\n(2\u03c0)N/2|\u039b|1/2\n\n. . .\n\na1\n\nam\n\nwhere the integration limits are chosen as ai = 0 and bi = \u221e, if si = 1, and ai = \u2212\u221e and bi = 0,\notherwise.\nIn summary, the proposed model has two advantages over the traditional Ising model: (1) Sampling\nis easy, and (2) \ufb01nding the model parameters is easy too.\n\n4\n\n\f3 Near-maximum entropy behavior of the dichotomized Gaussian\n\ndistribution\n\nIn the previous section we introduced the dichotomized Gaussian distribution. Our conjecture is that\nin many cases it can serve as a convenient approximation to the Ising model. Now, we investigate\nhow good this approximation is. For a wide range of interaction terms and mean activations we\nverify that the DG model closely resembles the Ising model. In particular we show that the entropy of\nthe DG distribution is not smaller than the entropy of the Ising model even at rather high dimensions.\n\n3.1 Random Connectivity\n\n\u2212P\n\nWe created randomly connected networks of varying size m, where mean activations hi and\ninteractions terms Jij were drawn from N (0, 0.4). First, we compared the entropy HI =\ns PI(s) log2 PI(s) of the thus speci\ufb01ed Ising model obtained by evaluating Eq. 1 with the en-\ntropy of the DG distribution HDG computed by numerical integration1 from Eq. 6 (twenty parameter\nsets). The entropy difference \u2206H = HI \u2212 HDG was smaller than 0.002 percent of HI (Fig. 1 A,\nnote scale) and probably within the range of the numerical integration accuracy. In addition, we\ncomputed the Jensen-Shannon divergence DJS[PIkPDG] = 1\n2 (DKL[PIkM] + DKL[PDGkM]),\n2(PI + PDG) [8]. We \ufb01nd that DJS[PIkPDG] is extremly small up to 10 dimensions\nwhere M = 1\n(Fig. 1 B). Therefore, the distributions seem to be not only close in their respective entropy, but also\nto have a very similar structure.\nNext, we extended this analysis to networks of larger size and repeated the same analysis for up to\ntwenty dimensions. Since the integration in Eq. 6 becomes too time-consuming for m \u2192 20 due\nto the large number of states, we used a histogram based estimate of PDG (using 3 \u00b7 106 samples\nfor m < 15 and 15 \u00b7 106 samples for m \u2265 15). The estimate of \u2206H is still very small at high\ndimensions (Fig. 1 C, below 0.5%). We also computed DJS, which scaled similarly to \u2206H (data\nnot shown).\nIn Fig. 1 C, \u2206H seems to increase with dimensionality. Therefore, we investigated how the estimate\nof \u2206H is in\ufb02uenced by the number of samples used. We computed both quantities for varying num-\nbers of samples from the DG distribution (for m = 7, 10). As \u2206H decreases according to a power\nlaw with increasing m, the rise of \u2206H observed in Fig. 1 C is most likely due to undersampling of\nthe distribution.\n\n3.2 Speci\ufb01ed covariance structure\n\nTo explore the relationship between the two techniques more systematically, we generated covari-\nance matrices with varying eigenvalue spectra. We used a parametric Toeplitz form, where the nth\ndiagonal is set to a constant value exp(\u2212\u03b1 \u00b7 n) (Fig. 2A and B, m = 7, 10). We varied the decay\nparameter \u03b1, which led to a widely varying covariance structure (For eigenvalue spectra, see Fig. 2E\nand F). We \ufb01t the Ising models using the Boltzmann machine gradient descent procedure. The co-\nvariance matrix of the samples drawn from the Ising model resembles the original very closely (Fig.\n2C and D). We also computed the entropy of the DG model using the desired covariance structure.\nWe estimated \u2206H and DJS[PGkPDG] averaged over 10 trials with 105 samples obtained by Gibbs\nsampling from the Ising model. \u2206H is very close to zero (Fig. 2G, m = 7) except for small \u03b1s\nand never exceeded 0.05%. Moreover, the structure of both distributions seems to be very similar as\nwell (Fig. 2H, m = 7). At m = 10, both quantities scaled qualitatively similair (data not shown).\nWe also repeated this analysis using equations 1 and 6 as before, which lead to similar results (data\nnot shown).\nOur experiments demonstrate clearly that the dichotomized Gaussian distribution constitutes a good\napproximation to the quadratic exponential distribution for a large parameter range. In the following\nsection, we will exploit the similarity between the two models to study how the role of second-order\ncorrelations may change between low-dimensional and high-dimensional statistics in case of natural\nimages.\n\n1For integration, we used the mvncdf function of Matlab. For m \u2265 4 this function employs Monte-Carlo\n\nintegration.\n\n5\n\n\fFigure 4: A: Negative log probabilities of the DG model are plotted against ground truth (red dots). Identical\ndistributions fall on the diagonal. Data points outside the area enclosed by the dashed lines indicate signi\ufb01cant\ndifferences between the model and ground truth. The DG model matches the true distribution very well. For\ncomparison the independent model is shown as well (blue crosses). B: The multi-information of the true\ndistribution (blue dots) accurately agrees with the multi-information of the DG model (red line). Similar to\nthe analysis in [12], we observe a power law behavior of the entropy of the independent model (black solid\nline) and the mutli-information. Linear extrapolation (in the log-log plot) to higher dimensions is indicated by\ndashed lines. C: Different way of presentation of the same data as in B: the joint entropy H = Hindep \u2212 I\n(blue dots) is plotted instead of I and the axis are in linear scale. The dashed red line represents the same\nextrapolation as in B.\n\n4 Natural images: Second order and beyond\n\nWe now investigate to which extent the statistics of natural images with dichotomized pixel inten-\nsities can be characterized by pairwise correlations only. In particular, we would like to know how\nthe role of pairwise correlations opposed to higher-order correlations changes depending on the di-\nmensionality. Thanks to the DG model introduced above, we are in the position to study the effect\nof pairwise correlations for high-dimensional binary random variables (N \u2248 1000 or even larger).\nWe use the van Hateren image database in log-intensity scale, from which we sample small image\npatches at random positions. The threshold for the dichotomization is set to the median of pixel\nintensities. That is, each binary variable encodes whether the corresponding pixel intensity is above\nor below the median over the ensemble. Up to patch sizes of 4 \u00d7 4 pixel, the true joint statistics can\nbe assessed using nonparametric histogram methods. Before we present quantitative comparisons, it\nis instructive to look at random samples from the true distribution (Fig. 3, left), from the DG model\nwith same mean and covariance (Fig. 3, middle), and from the corresponding independent model\n(Fig. 3, right). By visual inspection, it seems that the DG model \ufb01ts the true distribution well.\nIn order to quantify how well the DG model matches the true distribution, we draw two independent\nsets of samples from each (N = 2 \u00b7 106 for each set) and generate a scatter plot as shown in\nFig. 4 A for 4\u00d7 4 image patches. Each dot corresponds to one of the 216 = 65536 possible different\nbinary patterns. The relative frequencies of these patterns according to the DG model (red dots) and\naccording to the independent model (blue dots) are plotted against the relative frequencies obtained\nfrom the natural image patches. The solid diagonal line corresponds to a perfect match between\nmodel and ground truth. The dashed lines enclose the regions within which deviations are to be\nexpected due to the \ufb01nite sampling size. Since most of the red dots fall within this region, the DG\nmodel \ufb01ts the data distribution very well.\n\nWe also systematically evaluated the JS-divergence and the multi-information I[S] =P\n\nk H[Sk] \u2212\nH[S] as a function of dimensionality. That is, we started with the bivariate marginal distribution\nof two randomly selected pixels. Then we incrementally added more pixels of random location\nuntil the random vector contains all the 16 pixels of the 4 \u00d7 4 image patches. Independent of the\ndimension, the JS-divergence between the DG model and the true distribution is smaller than 0.015\nbits. For comparison, the JS-divergence between the independent model and the true distribution\nincreases with dimensionality from roughly 0.2 bits in the case of two pixels up to 0.839 bits in\nthe case of 16 pixels. For two independent sets of samples both drawn from natural image data the\nJS-divergence ranges between 0.006 and 0.007 bits for 4 \u00d7 4 patches setting the gold standard for\nthe minimal possible JS-divergence one could achieve with any model due to \ufb01nite sampling size.\nCarrying out the same type of analysis as in [12], we make qualitatively the same observations as it\nwas reported there: as shown above, we \ufb01nd a quite accurate match between the two distributions.\n\n6\n\n\fFigure 5: Random samples of dichotomized 32x32 patches from the van Hateren image data base (left) and\nfrom the corresponding dichotomized Gaussian distribution with equal covariance matrix (right). For the lat-\nter, the percept of typical objects is missing due to the ignorance of higher-order correlations. This striking\ndifference is not obvious, however, at the level of 4x4 patches, for which we found an excellent match of the\ndichotomized Gaussian to the ensemble of natural images.\n\nFurthermore, the multi-information of the DG model (red solid line) and of the true distribution (blue\ndots) increases linearly on a loglog-scale with the number of dimensions (Fig. 4 B). Both \ufb01ndings\ncan be veri\ufb01ed only up to a rather limited number of dimensions (less than 20). Nevertheless, in [12],\ntwo claims about the higher-dimensional statistics have been based on these two observations: First,\nthat pairwise correlations may be suf\ufb01cient to determine the full statistics of binary responses, and\nsecondly, that the convergent scaling behavior in the log-log plot may indicate a transition towards\nstrong order.\nUsing natural images instead of retinal ganglion cell data, we would like to verify to what extent\nthe low-dimensional observations can be used to support these claims about the high-dimensional\nstatistics [10]. To this end we study the same kind of extrapolation (Fig. 4 B) to higher dimensions\n(dashed lines) as in [12]. The difference between the entropy of the independent model and the\nmulti-information yields the joint entropy of the respective distribution. If the extrapolation is taken\nseriously, this difference seems to vanish at the order of 50 dimensions suggesting that the joint\nentropy of the neural responses approaches zero at this size\u2014say for 7\u00d7 7 image patches (Fig. 4 C).\nThough it was not taken literally, this point of \u2018freezing\u2019 has been pointed out in [12] as a critical\nnetwork size at which a transition to strong order is to be expected. The meaning of this assertion,\nhowever, is not clear. First of all, the joint entropy of a distribution can never be smaller than the\njoint entropy of any of its marginals. Therefore, the joint entropy cannot decrease with increasing\nnumber of dimensions as the extrapolation would suggest (Fig. 4 C). Instead it would be necessary to\nask more precisely how the growth rate of the joint entropy can be characterized and whether there\nis a critical number of dimensions at which the growth rate suddenly drops. In our study with natural\nimages, visual inspection does not indicate anything special to happen at the \u2018critical patch size\u2019 of\n7 \u00d7 7 pixels. Rather, for all patch sizes, the DG model yields dichotomized pink noise. In Fig. 5\n(right) we show a sample from the DG model for 32\u00d732 image patches (i.e. 1024 dimensions) which\nprovides no indication for a particularly interesting change in the statistics towards strong order. The\nexact law according to which the multi-information grows with the number of dimensions for large\nm, however, is not easily assessed and remains to be explored.\nFinally, we point out that the suf\ufb01ciency of pairwise correlations at the level of m = 16 dimensions\ndoes not hold any more in the case of large m: the samples from the true distribution at the left\nhand side of Fig. 5 clearly show much more structure than the samples from the DG model (Fig. 5,\nright), indicating that pairwise correlations do not suf\ufb01ce to determine the full statistics of large\nimage patches. Even if the match between the DG model and the Ising model may turn out to be\nless accurate in high dimensions, this would not affect our conclusion. Any mismatch would only\nintroduce more order in the DG model than justi\ufb01ed by pairwise correlations only.\n\n5 Conclusion and Outlook\n\nWe proposed a new approach to maximum entropy modeling of binary variables, extending maxi-\nmum entropy analysis to previously infeasible high dimensions: As both sampling and \ufb01nding pa-\n\n7\n\n\frameters is easy for the dichotomized Gaussian model, it overcomes the computational drawbacks of\nMonte-Carlo methods. We veri\ufb01ed numerically that the empirical entropy of the DG model is com-\nparable to that obtained with Gibbs sampling at least up to 20 dimensions. For practical purposes,\nthe DG distribution can even be superior to the Gibbs sampler in terms of entropy maximization due\nto the lack of independence between consecutive samples in the Gibbs sampler.\nAlthough the Ising model and the DG model are in principle different, the match between the two\nturns out to be surprisingly good for a large region of the parameter space. Currently, we are trying\nto determine where the close similarity between the Ising model and the DG model breaks down.\nIn addition, we explore the possibility to use the dichotomized Gaussian distribution as a proposal\ndensity for Monte-Carlo methods such as importance sampling. As it is a very close approximation\nto the Ising model, we expect this combination to yield highly ef\ufb01cient sampling behaviour.\nIn\nsummary, by linking the DG model to the Ising model, we believe that maximum entropy modeling\nof multivariate binary random variables will become much more practical in the future.\nWe used the DG model to investigate the role of second-order correlations in the context of sen-\nsory coding of natural images. While for small image patches the DG model provided an excellent\n\ufb01t to the true distribution, we were able to show that this agreement breakes down in the case\nof larger image patches. Thus caution is required when extrapolating from low-dimensional mea-\nsurements to higher-dimensional distributions because higher-order correlations may be invisible in\nlow-dimensional marginal distributions. Nevertheless, the maximum entropy approach seems to be\na promising tool for the analysis of correlated neural activities, and the DG model can facilitate its\nuse signi\ufb01cantly in practice.\n\nAcknowledgments\n\nWe thank Jakob Macke, Pierre Garrigues, and Greg Stephens for helpful comments and stim-\nulating discussions, as well as Alexander Ecker and Andreas Hoenselaar for last minute ad-\nvice. An implementation of the DG model in Matlab and R will be avaible at our website\nhttp://www.kyb.tuebingen.mpg.de/bethgegroup/code/DGsampling.\n\nReferences\n[1] D.H. Ackley, G.E. Hinton, and T.J. Sejnowski. A learning algorithm for boltzmann machines. Cognitive\n\nScience, 9:147\u2013169, 1985.\n\n[2] H.B. Barlow. Sensory mechanisms, the reduction of redundancy, and intelligence. In The Mechanisation\n\nof Thought Processes, pages 535\u2013539, London: Her Majesty\u2019s Stationery Of\ufb01ce, 1959.\n\n[3] M. Bethge. Factorial coding of natural images: How effective are linear model in removing higher-order\n\ndependencies? J. Opt. Soc. Am. A, 23(6):1253\u20131268, June 2006.\n\n[4] D.R. Cox and N. Wermuth. On some models for multivariate binary variables parallel in complexity with\n\nthe multivariate gaussian distribution. Biometrika, 89:462\u2013469, 2002.\n\n[5] L.J. Emrich and M.R. Piedmonte. A method for generating high-dimensional multivariate binary variates.\n\nThe American Statistician, 45(4):302\u2013304, 1991.\n\n[6] M. Huber. A bounding chain for swendsen-wang. Random Structures & Algorithms, 22:53\u201359, 2002.\n[7] E.T. Jaynes. Where do we stand on maximum entropy inference. In R.D. Levine and M. Tribus, editors,\n\nThe Maximum Entropy Formalism. MIT Press, Cambridge, MA, 1978.\n\n[8] J. Linn. Divergence measures based on the shannon entropy. IEEE Trans Inf Theory, 37:145\u2013151, 1991.\n[9] D. J. C. MacKay. Information Theory, Inference and Learning Algorithms. Cambridge University Press,\n\n2003.\n\n[10] Sheila H Nirenberg and Jonathan D Victor. Analyzing the activity of large populations of neurons: how\n\ntractable is the problem? Current Opinion in Neurobiology, 17:397\u2013400, August 2007.\n\n[11] Karl Pearson. On a new method of determining correlation between a measured character a, and a char-\nacter b, of which only the percentage of cases wherein b exceeds (or falls short of) a given intensity is\nrecorded for each grade of a. Biometrika, 7:96\u2013105, 1909.\n\n[12] Elad Schneidman, Michael J Berry, Ronen Segev, and William Bialek. Weak pairwise correlations imply\n\nstrongly correlated network states in a neural population. Nature, 440(7087):1007\u20131012, Apr 2006.\n\n[13] J Shlens, JD Field, JL Gauthier, MI Grivich, D Petrusca, A Sher, AM Litke, and EJ Chichilnisky. The\n\nstructure of multi-neuron \ufb01ring patterns in primate retina. J Neurosci, 26(32):8254\u20138266, Aug 2006.\n\n[14] G. Tkacik, E. Schneidman, M.J. Berry, and W. Bialek. Ising models for networks of real neurons. arXiv:q-\n\nbio.NC/0611072, 1:1\u20134, 2006.\n\n8\n\n\f", "award": [], "sourceid": 296, "authors": [{"given_name": "Matthias", "family_name": "Bethge", "institution": null}, {"given_name": "Philipp", "family_name": "Berens", "institution": null}]}