{"title": "Why MCA? Nonlinear sparse coding with spike-and-slab prior for neurally plausible image encoding", "book": "Advances in Neural Information Processing Systems", "page_first": 2276, "page_last": 2284, "abstract": "", "full_text": "Why MCA? Nonlinear sparse coding with spike-and-\n\nslab prior for neurally plausible image encoding\n\nJacquelyn A. Shelton, Philip Sterne,\n\nJ\u00a8org Bornschein, Abdul-Saboor Sheikh,\n\nFrankfurt Institute for Advanced Studies\nGoethe-University Frankfurt, Germany\n\n{shelton, sterne, bornschein, sheikh}@fias.uni-frankfurt.de\n\nJ\u00a8org L\u00a8ucke\n\nFrankfurt Institute for Advanced Studies\n\nPhysics Dept., Goethe-University Frankfurt, Germany\n\nluecke@fias.uni-frankfurt.de\n\nAbstract\n\nModelling natural images with sparse coding (SC) has faced two main challenges:\n\ufb02exibly representing varying pixel intensities and realistically representing low-\nlevel image components. This paper proposes a novel multiple-cause generative\nmodel of low-level image statistics that generalizes the standard SC model in two\ncrucial points: (1) it uses a spike-and-slab prior distribution for a more realistic\nrepresentation of component absence/intensity, and (2) the model uses the highly\nnonlinear combination rule of maximal causes analysis (MCA) instead of a lin-\near combination. The major challenge is parameter optimization because a model\nwith either (1) or (2) results in strongly multimodal posteriors. We show for the\n\ufb01rst time that a model combining both improvements can be trained ef\ufb01ciently\nwhile retaining the rich structure of the posteriors. We design an exact piece-\nwise Gibbs sampling method and combine this with a variational method based\non preselection of latent dimensions. This combined training scheme tackles both\nanalytical and computational intractability and enables application of the model\nto a large number of observed and hidden dimensions. Applying the model to\nimage patches we study the optimal encoding of images by simple cells in V1\nand compare the model\u2019s predictions with in vivo neural recordings. In contrast\nto standard SC, we \ufb01nd that the optimal prior favors asymmetric and bimodal ac-\ntivity of simple cells. Testing our model for consistency we \ufb01nd that the average\nposterior is approximately equal to the prior. Furthermore, we \ufb01nd that the model\npredicts a high percentage of globular receptive \ufb01elds alongside Gabor-like \ufb01elds.\nSimilarly high percentages are observed in vivo. Our results thus argue in favor\nof improvements of the standard sparse coding model for simple cells by using\n\ufb02exible priors and nonlinear combinations.\n\n1\n\nIntroduction\n\nSparse Coding (SC) is one of the most popular algorithms for feature learning and has become a\nstandard approach in Machine Learning, Computational Neuroscience, Computer Vision, and other\nrelated \ufb01elds. It was \ufb01rst introduced as a model for the encoding of visual data in the primary visual\ncortex of mammals [1] and became the standard model to describe coding in simple cells. Following\nearly recording studies [2] on simple cells, they were de\ufb01ned to be cells responding to localized,\noriented and bandpass visual stimuli \u2013 sparse coding offered an optimal encoding explanation of\nsuch responses by assuming that visual components are (a) independent, (b) linearly superimposed,\n\n1\n\n\fand (c) mostly inactive, with only a small subset of active components for a given image patch.\nMore formally, sparse coding assumes that each observation y = (y1, . . . , yD) is associated with a\n(continuous or discrete) sparse latent variable s = (s1, . . . , sH ), where sparsity implies that most of\nthe components sh in s are zero or close-to zero. Each data point is generated according to the data\nmodel\n\n(1)\n\np(y | \u0398) =\ufffds\n\np(y | s, \u0398) p(s| \u0398) ds\n\nwith\ufffds integrating (or summing) over all hidden states and \u0398 denoting the model parameters. Typ-\nically, p(y | s, \u0398) is modelled as a Gaussian with a mean \u00b5 de\ufb01ned as \u00b5 = \ufffdh shWh, i.e. as a\nlinear superposition of basis vectors Wh \u2208 RD. The most typical choice of prior over p(s| \u0398) is a\nLaplace distribution (which corresponds to L1 regularization).\nThe sparse coding generative model has remained essentially the same since its introduction, with\nmost work focusing on ef\ufb01cient inference of optimal model parameters \u0398 (e.g., [3, 4]), usually ex-\nploiting unimodality of the resulting posterior probabilities. The standard form of the model offers\nmany mathematically convenient advantages, but the inherent assumptions may not be appropriate\nif the goal is to accurately model realistic images. First, it has been pointed out that visual compo-\nnents \u2013 such as edges \u2013 are either present or absent and this is poorly modelled with a Laplace prior\nbecause it lacks exact zeros. Recently, spike-and-slab distributions have been a favored alternative\n(e.g. [5, 6, 7, 8]) as they enable the modelling of visual component absence/presence (the spike) as\nwell as the component\u2019s intensity distribution (the slab). Second, it has been pointed out that image\ncomponents do not linearly superimpose to generate images, contrary to the standard sparse coding\nassumption. Alternatively, various nonlinear combinations of visual components have been investi-\ngated [9, 10, 11, 12]. Either modi\ufb01cation (spike-and-slab prior or nonlinearities) leads to multimodal\nposteriors, making parameter optimization dif\ufb01cult. As a result these modi\ufb01cations have so far only\nbeen investigated separately. For linear sparse coding with a spike-and-slab prior the challenge for\nlearning has been overcome by applying factored variational EM approaches [13, 5] or sampling\n[6]. Similarly, models with nonlinear superposition of components could be ef\ufb01ciently trained by\napplying a truncated variational EM approach [14, 12], but avoiding the analytical intractability\nintroduced by using a continuous prior distribution.\nIn this work we propose a sparse coding model that for the \ufb01rst time combines both of these improve-\nments \u2013 a spike-and-slab distribution and nonlinear combination of components \u2013 in order to form\na more realistic model of images. We address the optimization of our model by using a combined\napproximate inference approach with preselection of latents (for truncated variational EM [14]) in\ncombination with Gibbs sampling [15]. First, we show on arti\ufb01cial data that the method ef\ufb01ciently\nand accurately infers all model parameters, including data noise and sparsity. Second, using natu-\nral image patches we show the model yields results consistent with in vivo recordings and that the\nmodel passes a consistency check which standard SC does not. Third, we show our model performs\non par with other models on the functional benchmark tasks of denoising and inpainting.\n\n2 The Generative Model: Nonlinear Spike-and-Slab Sparse Coding\n\nWe formulate the multi-causal data generation process as the probabilistic generative model:\n\np(yd | s, \u0398) = N (yd; max\n\nh {shWdh}, \u03c32),\n\n(2)\n\nwhere maxh considers all H latent components and takes the h yielding the maximum value for\nshWdh, and where sh has a spike-and-slab distribution given by sh = bhzh and parameterized by:\n(3)\n(4)\nin this notation the spike is de\ufb01ned in Eq. 3 (parameterized by \u03c0) and the slab is de\ufb01ned in Eq. 4\n(parameterized by \u00b5pr and \u03c3pr). The observation noise has a single parameter; \u03c3. The columns of\nthe matrix W = (Wdh) are the generative \ufb01elds, Wh, one associated with each latent variable sh.\nWe denote the set of all parameters with \u0398. We will be interested in working with the posterior over\nthe latents given by\n\np(bh | \u0398) = B(bh; \u03c0) = \u03c0bh (1 \u2212 \u03c0)1\u2212bh\np(zh | \u0398) = N (zh; \u00b5pr, \u03c32\npr),\n\n.\n\n(5)\n\np(s|y, \u03b8) =\n\np(y|s, \u03b8) p(s|\u03b8)\n\n\ufffds\ufffd p(y|s\ufffd, \u03b8) p(s\ufffd|\u03b8) ds\ufffd\n\n2\n\n\fA\ngenerative \ufb01elds\n\nB\nstandard SC\n\nspike-and-slab SC\n(linear)\n\nspike-and-slab SC\n(non-linear)\n\nC\n\nD\n\nsum\n\nmax\n\nFigure 1: Generation according to different sparse coding generative models using the same gen-\nerative \ufb01elds. A 20 generative \ufb01elds in the form of random straight lines. B Examples of patches\ngenerated according to three generative models all using the \ufb01elds in A. Top row: standard linear\nsparse coding with Laplace prior. Middle row: linear sparse coding with spike-and-slab prior. Bot-\ntom row: spike-and-slab sparse coding with max superposition. The latter two use the same prior\nparameterization (with positive slab). Generated patches are individually scaled to \ufb01ll the color\nspace (with zero \ufb01xed to green). C A natural image with two patches highlighted (magni\ufb01cations\nshow their preprocessed from). D Linear and nonlinear superposition of two single components for\ncomparison with the actual superposition in C.\n\nAs in standard sparse coding, the model assumes independent latents and given the latent variables,\nthe observations are distributed according to a Gaussian distribution. Unlike standard sparse coding,\nthe latent variables are not distributed according to a Laplace prior and the generative \ufb01elds (or ba-\nsis functions) are not combined linearly. Fig. 1 illustrates the model differences between a Laplace\nprior and a spike-and-slab prior and the differences between linear and nonlinear superposition. As\ncan be observed, standard sparse coding results in strong interference when basis functions overlap.\nFor spike-and-slab sparse coding most components are exactly zero but interference between them\nremains strong because of their linear superposition. Combining a spike-and-slab prior with non-\nlinear composition allows minimal interference between the bases and ensures that latents can be\nexactly zero, which creates very multimodal posteriors since data must be explained by either one\ncause or another. For comparison, the combination of two real image components is highlighted in\nFig. 1C (lower patch). Linear and nonlinear superposition of two basis functions resembling single\ncomponents is shown in Fig. 1D. This suggests that superposition de\ufb01ned by max represents a better\nmodel of occluding components (compare [11, 12]).\nIn this paper we use expectation maximization (EM) to estimate the model parameters \u0398, and we use\nsampling after latent preselection [15] to represent the posterior distribution over the latent space.\nOptimization in the EM framework entails setting the free-energy to zero and solving for the model\nparameters (M-step equations) (e.g., [16]). As an example we obtain the following formula for the\nestimate of image noise:\n\n\u02c6\u03c32 =\n\n1\n\nN DK\ufffdn \ufffdd \ufffdk \ufffdmax\n\nh \ufffdWhdsk\n\nd \ufffd2\nh\ufffd \u2212 y(n)\n\n,\n\n(6)\n\nwhere we average over all N observed data points, D observed dimensions, and K Gibbs samples.\nHowever, this notation is rather unwieldy for a simple underlying idea. As such we will use the\nfollowing notation:\n\n(7)\n\nd \ufffd\u2217 ,\n\u02c6\u03c32 =\ufffdWdhsh \u2212 y(n)\n\nwhere we maximize for h and average over n and d. That is, we denote the expectation values \ufffd .\ufffd\u2217\nto mean the following:\n\n\ufffdf (s)\ufffd\u2217 =\ufffdn \ufffds p(s|y(n), \u0398) f (s) \u03b4(h is max) ds\n\ufffds p(s|y(n), \u0398) \u03b4(h is max) ds\n\n3\n\n,\n\n(8)\n\n\fspike & slab prior\n\np(s)\n\nlinear combination\n\ndata likelihood\n\np(yd | s)\n\nposterior\np(s| yd)\n\npoint-wise maximum\n\ndata likelihood\n\np(yd | s)\n\nposterior\np(s| yd)\n\n2\ns\n\n\u00d7\n\n\u221d\n\n\u00d7\n\n\u221d\n\ns1\n\nFigure 2: Illustration of a H=2-dimensional spike-and-slab prior over latents and the multimodal\nposterior distribution induced by this prior for both linear and nonlinear data likelihoods.\n\nwhere \u03b4 is the indicator function denoting the domain to integrate over, namely where h is the\nmaximum. This allows a condensed expression of the rest of the update equations:\n\n\u02c6Whd = \ufffdshyd\ufffd\u2217\n\ufffds2\nd\ufffd\u2217\n\u02c6\u00b5pr = \ufffdsh\ufffd\u2217,\n\n,\n\n\u02c6\u03c0 = \ufffd\u03b4(s)\ufffd,\npr = \ufffd(sh \u2212 \u02c6\u00b5pr)2\ufffd\u2217\n\u02c6\u03c32\n\n(10)\nwhere we observe that in order to optimize the parameters we need to calculate several expectation\nvalues with respect to the posterior distribution. As discussed however, the posterior distribution of\na model with a spike-and-slab prior in both the linear and nonlinear cases is strongly multimodal\nand such posteriors are dif\ufb01cult to infer and represent. This is illustrated in Fig. 2. Calculating\nexpectations of this posterior is analytically intractable, thus we use Gibbs sampling to approximate\nthe expectations.\n\n(9)\n\n3\n\nInference: Exact Gibbs Sampling with Preselection of Latents\n\nIn order to ef\ufb01ciently handle the intractabilities posed by our model and the complex posterior\n(multimodal, high dimensional) illustrated in Fig. 2, we take a combined approximate inference\napproach. Speci\ufb01cally we do exact Gibbs sampling from the posterior after we have preselected the\nmost relevant set of latent states using a truncated variational form of EM. Preselection is not strictly\nnecessary, but signi\ufb01cantly helps with the computational intractability faced in high dimensions. As\nsuch, we will \ufb01rst descibe the sampling step and preselection only later.\nGibbs Sampling. Our main technical contribution towards ef\ufb01cient inference in this model is an\nexact Gibbs sampler for the multimodal posterior. Previous work has used Gibbs sampling in\ncombination with spike-and-slab models [17], and for increased ef\ufb01ciency in sparse Bayesian in-\nference [18]. Our aim is to construct a Markov chain with the target density given by the conditional\nposterior distribution:\n\np(sh|sH\\h, y, \u03b8) \u221d p(sh|\u03b8)\n\np(yd|sh, sH\\h, \u03b8).\n\n(11)\n\nWe see from Eq. 11 that the distribution factorizes into D + 1 factors: a single factor for the prior\nand D factors for each likelihood. For the point-wise maximum nonlinear case we are considering,\nthe likelihood of a single yd is a piecewise function de\ufb01ned as follows:\n\nD\ufffdd=1\n\np(yd|sh, sH\\h, \u03b8) = N (yd; max\n\nN (yd; max\n\nh\ufffd {Wdh\ufffd sh\ufffd}, \u03c32)\nh\ufffd\\h{Wdh\ufffd sh\ufffd}, \u03c32)\n\ufffd\n\nN (yd; Wdhsh, \u03c32)\n\n\ufffd\ufffd\n\nconstant\n\n=\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3\n\n\ufffd\n\n= exp(ld(sh))\n\nif sh < Pd\n\n= exp(rd(sh))\n\nif sh \u2265 Pd,\n\nwhere the transition point is de\ufb01ned as the point where shWdh becomes the maximal cause:\n\nPd = max\n\nh\ufffd\\h{Wdh\ufffd sh\ufffd} / Wdh.\n\n4\n\n(12)\n\n(13)\n\n(14)\n\n\fA\n\nB\n\nC\n\ns\nd\no\no\nh\nl\ni\ne\nk\ni\nL\ng\no\nL\n\nr\no\ni\nr\nP\ng\no\nL\n\nlog(p(y1 | sh, s\\h))\n\nlog(p(y2 | sh, s\\h))\n\nlog(p(sh))\n\nP1\n\nP2\n\nsh\n\nD\n\nE\n\nF\n\n+\n\nexp\n\n\ufffd sh\n\n\u2212\u221e\n\n\u221d log(p(sh | y, s\\h))\n\n\u221d p(sh | y, s\\h)\n\nL\no\ng\nP\no\ns\nt\ne\nr\ni\no\nr\n\nP\no\ns\nt\ne\nr\ni\no\nr\n\nC\nD\nF\n\nsh\n\nFigure 3: Illustration of the Gibbs sampler for an MCA-induced posterior. Left column:\nthree\ncontributing factors for the posterior \u221d p(sh | s\\h, y, \u0398) in logspace. A and B: Log likelihood\nfunctions each de\ufb01ned by a transition point Pd and left and right pieces rd(sh) and ld(sh). C Log\nprior, which consists of an overall gaussian and the Dirac-peak at sh = 0. D Log posterior, the sum\nof functions A, B, and C consists of D + 1 pieces plus the Dirac-peak at sh = 0. E Exponentiation\nof the D log posterior. F CDF for sh from which we do inverse transform sampling.\n\nu=i l\u03b4(u)(sh)\n\nfor\n\nWe refer to the two pieces of yd in Eq. 13 as the left piece of the function when sh < Pd and right\npiece when sh \u2265 Pd. The left is constant w.r.t. sh because the data is explained by another cause\nwhen sh < Pd, and the right is a truncated Gaussian when considered a PDF of sh (see Fig. 3A).\nWe take the logarithm of p(yd|sh, sH\\h, \u03b8), which transforms the equation into a left-piece constant\nand right-piece quadratic function that we can easily sum together. The sum of these D functions\nresults in one function with D + 1 segments mi(sh), with transition points Pd given by the indi-\nvidual functions. We \ufb01rst sort the functions according to their transition points, denoted here by\n\u03b4 = argsortd(Pd), such that we can ef\ufb01ciently calculate the summation over these functions:\n\nmi(sh) =\ufffdi\u22121\n\nj=1 r\u03b4(j)(sh) +\ufffdD\n\n1 \u2264 i \u2264 D + 1,\n\n(15)\nwhere the left and right pieces are referred to as li(sh) and ri(sh) (as in Eq. 13), respectively. Since\nall pieces li(sh) and ri(sh) are polynomials of 2nd degree, the result is still a 2nd degree polynomial.\nWe incoorporate the prior in two steps. The Gaussian slab of the prior is taken into account by\nadding its 2nd degree polynomial to all the pieces mi(sh), which also ensures that every piece is a\nGaussian.\nTo construct the piecewise cumulative distribution function (CDF), we relate each segment mi(sh)\nto the Gaussian \u221d exp(mi(sh)) it de\ufb01nes. Next, the Bernoulli component of the prior is accounted\nfor by introducing the appropriate step into the CDF at sh = 0 (see Fig. 3F). Once the CDF is con-\nstructed, we simulate each sh from the exact conditional distribution (sh \u223c p(sh|s\\h = s\\h , y, \u03b8))\nby inverse transform sampling. Fig. 3 illustrates the entire process.\nPreselection. To reduce the computational load of inference in our model, we can optionally pre-\nselect the most relevant latent variables before doing Gibbs sampling. This can be formulated as a\nvariational approximation to exact inference [14] where the posterior distribution p(s| y (n), \u0398) is\napproximated by the distribution qn(s; \u0398) which only has support on a subset Kn of the latent state\nspace:\n(16)\n\np(s| y (n), \u0398) \u2248 qn(s; \u0398) =\n\np(s| y (n), \u0398)\n\ufffds \ufffd\u2208Kn\np(s \ufffd | y (n), \u0398)\n\n\u03b4(s \u2208 Kn)\n\nwhere \u03b4(s \u2208 Kn) = 1 if s \u2208 Kn and zero otherwise. The subsets Kn are chosen in a data-driven\nway using a deterministic selection function, they vary per data point y (n), and should contain most\nof the probability mass p(s| y) while also being signi\ufb01cantly smaller than the entire latent space.\nUsing such subsets Kn, Eqn. 16 results in good approximations to the posteriors. We de\ufb01ne Kn as\nKn = {s| for all h \ufffd\u2208 I : sh = 0} where I contains the indices of the latents estimated to be most\nrelevant for y (n). To obtain these latent indices we use a selection function of the form:\n\n(17)\nto select the H\ufffd < H highest scoring latents for I. This boils down to selecting the H\ufffd dictionary\nelements that are most similar to each datapoint, hence being most likely to have generated the\ndatapoint. We then sample from this reduced latent set.\n\nSh(y (n)) = \ufffd\ufffdWh \u2212 y (n)\ufffd\ufffd2\n\n2\ufffd\ufffd\ufffdWh\ufffd\ufffd2\n\n5\n\n\fA\n\nC\n\n\u03c3\n\n5\n\n2\n\n0\n\n0\n\nB\n\nD\n4.5\n4.0\n3.5\n3.0\n2.5\n2.0\n1.5\n0\n\n\u03c0\nH\n\n\u03c3gt= 2\n\nEMsteps\n\n30\n\nE\n\nr\np\n\u03c3\n\n5\n4\n3\n2\n1\n0\n\n0\n\nH\u03c0gt= 2\n\nEMsteps\n\n30\n\nF\n3.5\n\n2.0\n\nr\np\n\u00b5\n\n\u00b5pr gt= 2\n\n\u03c3pr gt= 0.5\n\nEMsteps\n\n30\n\n0.0\n0\n\nEMsteps\n\n30\n\nFigure 4: Results of 10 experimental runs with 30 EM iterations on the same arti\ufb01cial ground-truth\ndata generated according to the model. We accurately recover ground-truth parameters which are\nplotted with dotted lines. A Random selection of data y(n), B The set of learned generative \ufb01elds\nWh, C Data noise \u03c3, D Sparsity H \u00d7 \u03c0, E Prior standard dev. \u03c3pr, F Prior mean \u00b5pr.\n\n4 Experiments\n\nWe \ufb01rst investigate the performance of our algorithm on ground-truth arti\ufb01cial data. Second, we\napply our model to natural image patches and compare with in vivo recording from various sources.\nThird, we investigate the applicability of our algorithm on functional benchmark tasks. All ex-\nperiments were performed using a parallel implementation of the EM algorithm [19]. Small scale\nexperiments were run on a single multicore machine, while larger scale experiments were typically\nrun on a cluster with 320 CPU cores in parallel.\nFor all experiments described, 1/3 of the samples drawn are used for burn-in, and 2/3 are the samples\nused for computations. We initialize our parameters by setting the \u03c3pr and \u03c3 equal to the standard\ndeviation observed in the data, the prior mean \u00b5pr is initialized to the observed data mean. W is\ninitialized at the observed data mean with additive Gaussian noise of the \u03c3 observed in the data, but\nd=1 Wdh = D, or that each Wdh is\n\nwe enforce the constraint that \ufffdWdh\ufffd = 1 such that \u2200h\u2208H\ufffdD\n\napproximately equal to one.\nArti\ufb01cial Data. The goal of the \ufb01rst set of experiments is to verify that our model and inference\nmethod produce an algorithm that can (1) recover ground-truth parameters from data that is gen-\nerated according to the model and (2) that it reliably converges to locally optimal solutions. We\ngenerate ground-truth data with N = 2, 000 consisting of D = 5 \u00d7 5 = 25 observed and H = 10\nhidden dimensions according to our model: N images with overlapping \u2018bars\u2019 of varying intensities\nand with Gaussian observation noise of variance \u03c3gt = 2 (Fig. 4A). On average, each data point\nH . Results (Fig. 4B-F) show that our algorithm converges to globally opti-\ncontains two bars, \u03c0 = 2\nmal solutions and recovers the generating ground-truth parameters. Here we drew 30 samples from\nthe posterior and set H\ufffd = H, but investigation of a range of sample number and H\ufffd values yields\nthe same results, suggesting that our approximation parameters do not have an effect on our results\n(see Supp. Material for more experiments on this dataset).\nNatural Image Patches. We applied our model to N = 50, 000 image patches of 16 \u00d7 16 pixels.\nThe patches were extracted from the Van Hateren natural image database [20] and subsequently\npreprocessed using pseudo-whitening [1]. We split the image patches into a positive and negative\nchannel to ensure yd \u2265 0: each image patch \u02dcy of size \u02dcD = 16 \u00d7 16 is converted into a datapoint\nof size D = 2 \u02dcD by assigning yd = [\u02dcyd]+ and y \u02dcD+d = [\u2212\u02dcyd]+, where [x]+ = x for x > 0 and\n[x]+ = 0 otherwise. This can be motivated by the transfer of visual information by center-on and\ncenter-off cells of the mammalian lateral geniculate nucleus (LGN). In a \ufb01nal step, as a form of local\ncontrast normalization, we scaled each image patch so that 0 \u2264 yd \u2264 10.\nAfter 50 EM iterations with 100 samples per datapoint the model parameters had converged and\nthe learned dictionary elements Wh represent a variety of Gabor-Wavelet and Difference of Gaus-\nsians (DoG) like shapes (see Fig. 5A). We observe a mean activation of \u00b5pr = 0.47, with standard\ndeviation \u03c3pr = 0.13, i.e., we infer a strongly bimodal prior (Fig. 5D). The \ufb01nal sparseness was\n\u03c0H = 6.2, which means that an average of roughly six latent variables were active in every image\npatch. The inferred observation noise was \u03c3 = 1.4. To quantitatively interpret the learned \ufb01elds, we\n\n6\n\n\fA\n\nE\n\n)\n1\ns\n(\np\n\n0.8\n0.6\n0.4\n0.2\n0.0\n\n0.0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\ns3\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n\nB\n\ns\nF\nR\n\nr\na\nl\nu\nb\no\nl\ng\n%\n\nD\n\n)\nh\ns\n(\np\n\n)\n2\ns\n(\np\n\n0.8\n0.6\n0.4\n0.2\n0.0\n\nC\n1.2\n\n1.0\n\n0.8\n\ny\nn\n\n0.6\n\n0.4\n\n0.2\n\nA\nC\nM\n-\nS\nS\n\nh\nc\na\ng\nn\ni\nR\n\n.\n\nD\n\nr\ne\nk\ny\nt\n\nS\n\nl\nl\ne\ni\nN\n\n.\nl\na\n\n.\nt\na\n\ny\ne\nr\ns\nU\n\n\u00b5pr = 0.47\n\n\u03c3pr = 0.13\n\n\u03c0H = 6.2\n\n0.0\n\n0.2\n\n0.4\n\nsh\n\n0.6\n\n0.8\n\n0.0\n\n0.0\n\n0.2\n\n0.6\n\n0.8\n\n0.4\n\nnx\n\n)\n3\ns\n(\np\n\n0.8\n0.6\n0.4\n0.2\n0.0\n\n0.0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\ns3\n\n0.0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\ns2\n\nFigure 5: Results after training our model on N = 50, 000 image patches of size 16 \u00d7 16 using\nH=500 latent units. A Selection of inferred dictionary elements Wh. The full set of elements\nis shown in the supplementary material. B After \ufb01tting with Gabor wavelets and DoG\u2019s, 135 \ufb01elds\n(27%) are classi\ufb01ed as being globular. The fraction of globular \ufb01elds measured in vivo are shown for\ncomparison. C nx/ny Gabor statistics plot of estimated receptive \ufb01elds (blue circles, see Supp. D\nFig.D) overlayed with the distribution reported by Ringach (in vivo Macaque, red triangles). We\nintentionally exclude \ufb01elds best \ufb01t by DoG\u2019s, removing the typical cluster observed at 0-0 (see\nSupp. D). D Visualization of the prior inferred by our model: On average \u03c0H = 6.2 dictionary\nelements are active per datapoint. E Histogram of the actual activation of three selected dictionary\nelements: A Gabor wavelet, a DoG and a \ufb01eld encoding low-frequency input. The bimodal pattern\nclosely resembles the prior activation inferred in D.\n\nperform reverse correlation on the learned generative \ufb01elds and \ufb01t the resulting estimated receptive\n\ufb01elds with Gabor wavelets and DoGs (see Supp. D for details). Next, we classify the \ufb01elds as either\norientation-sensitive Gabor wavelets or \u2018globular\u2019 \ufb01elds best matched by DoGs. In Fig. 5C) we then\nplot only the \ufb01elds classi\ufb01ed as Gabors, leaving out all DoG \ufb01elds.\nNotably, the proportion of globular \ufb01elds predicted by the model (Fig. 5B) is similarly high as those\nfound in different species [21, 22, 23] (see next section for a discussion). Fig. 5D-E compares the\noptimal prior distribution with the average posterior distribution for several latent variables (with\ntheir associated generative \ufb01elds shown in insets). It is a necessary condition of the correct model of\nthe data that the posterior averaged over the datapoints y(n) matches the prior, since the following\nholds (compare, e.g., [24]):\n\nlim\nN\u2192\u221e\n\n(18)\n\n1\n\nN\ufffdn p(s| y(n), \u0398) = p(s| \u0398).\n\nOur model satis\ufb01es this condition; the average posterior over these \ufb01elds closely resembles the\noptimal prior, which is a test standard sparse coding fails (see [17] for a discussion).\nFunctional Tasks. We also apply our model to the task of image inpainting and image denoising.\nGiven that we propose our model to be able to realistically model low-level image statistics, we\nexpect it to perform well on these tasks. Results show that our algorithm performs on par with the\nlatest benchmarks obtained by other algorithms. See Supp. Material for details and examples.\n\n5 Discussion\n\nIn this work, we de\ufb01ned and studied a sparse coding model that, for the \ufb01rst time, combines a spike-\nand-slab prior with a nonlinear combination of dictionary elements. To address the optimization of\nour model, we designed an exact piecewise Gibbs sampling method combined with a variational\nmethod based on preselection of latent dimensions. This combined training scheme tackles both\n\n7\n\n\fanalytical and computational intractability and enables application of the model to a large number of\nobserved and hidden dimensions. The learning algorithm derived for the model enables the ef\ufb01cient\ninference of all model parameters including sparsity and prior parameters.\nThe spike-and-slab prior used in this study can parameterize prior distributions which are symmetric\nand unimodal (spike on top of the Guassian) as well as strongly bimodal distributions with the Gaus-\nsian mean being signi\ufb01cantly different from zero. However, inferring the correct prior distribution\nrequires sophisticated inference and learning schemes. Standard sparse coding with MAP-based ap-\nproximation only optimizes the basis functions [25, 4]. Namely, the prior shape remains \ufb01xed except\nfor its weighting factor (the regularization parameter) which is typically only inferred indirectly (if at\nall) using cross-validation. Very few sparse coding approaches infer prior parameters directly. One\nexample is an approach using a mixture-of-Gaussians (MoG) prior [17] which applies Gibbs sam-\npling for inference. The MoG prior can model multimodality but in numerical experiments on image\npatches the mixture components were observed to converge to a monomodal prior \u2013 which may be\ncaused by the assumed linear superposition or by the Gibbs sampler not mixing suf\ufb01ciently. When\nthe MoG prior was \ufb01xed to be trimodal, no instructive generative \ufb01elds were observed [17]. Another\nexample of sparse coding with prior inference is a more recent approach which uses a parameterized\nstudent-t distribution as prior and applies sampling to infer the sparsity [26]. A student-t distribution\ncannot model multimodality, however. The work in [27] uses a trimodal prior for image patches but\nshape and sparsity remain \ufb01xed, i.e. the study does not answer how optimal such a prior may be. In\ncontrast, we have shown in this study that the prior shape and sparsity level can be inferred from im-\nage data. The resulting prior is strongly bimodal and control experiments con\ufb01rm a high consistency\nof the prior with the average posterior (Fig. 5D-E). Standard sparse coding approaches typically fail\nin such controls which may be taken as early evidence for bimodal or multimodal priors being more\noptimal (see [17]).\nTogether with a bimodal prior, our model infers Gabor and difference-of-Gaussian (DoG) functions\nas the optimal basis functions for the used image patches. While Gabors are the standard outcome\nof sparse coding, DoGs have not been predicted by sparse coding until very recently. Indeed, DoG\nor \u2018globular\u2019 \ufb01elds were identi\ufb01ed as the main qualitative difference between experimental mea-\nsurements of V1 simple cells and theoretical predictions [21]. A number of studies have since\nshown that globular \ufb01elds can emerge in applications of computational models to image patches\n[28, 27, 29, 30, 31, 12, 32]. One study [29] has shown that globular \ufb01elds can be obtained with stan-\ndard sparse coding by choosing speci\ufb01c values for overcompleteness and sparsity (i.e. prior shape\nand sparsity are not inferred from data). The studies [27, 31, 32] assume a restricted set of values for\nlatent variables and yield relatively high proportion of globular \ufb01elds suggesting that the emergence\nof globular \ufb01elds is due to hard constraints on the latents. On the other hand, the studies [28, 30, 12]\nsuggest that globular \ufb01elds are a consequence of occlusion nonlinearities. Our study argues in favor\nof the occlusion interpretation for the emergence of globular \ufb01elds because the model studied here\nshows that high percentages of globular \ufb01elds emerge with a prior that is (a) inferred from data and\n(b) allows for a continuous distribution of latent values.\nIn summary, the main results obtained by applying the novel model to preprocessed images are:\n(1) the observation that a bimodal prior is preferred over a unimodal one for optimal image coding,\nand (2) that high percentages of globular \ufb01elds are predicted. The sparse bimodal prior is consistent\nwith sparse and positive neural activtiy for the encoding of image components in V1, and the high\npercentage of globular \ufb01elds is consistent with recent in vivo recordings of simple cells. Our model\ntherefore links improvements on optimal image encoding to a high consistency with neural data.\n\nAcknowledgements. We acknowledge support by the German Research Foundation (DFG) in the project LU\n1196/4-2, by the German Federal Ministry of Education and Research (BMBF), project 01GQ0840, and by the\nLOEWE Neuronale Koordination Forschungsschwerpunkt Frankfurt (NeFF). Furthermore, we acknowledge\nsupport by the Frankfurt Center for Scienti\ufb01c Computing (CSC).\n\nReferences\n[1] B. Olshausen and D. Field. Emergence of simple-cell receptive \ufb01eld properties by learning a sparse code\n\nfor natural images. Nature, 381:607\u20139, 1996.\n\n[2] D. H. Hubel and T. N. Wiesel. Receptive \ufb01elds of single neurones in the cat\u2019s striate cortex. The Journal\n\nof Physiology, 1959.\n\n8\n\n\f[3] M. Seeger. Bayesian inference and optimal design for the sparse linear model. Journal of Machine\n\nLearning Research, 9:759\u2013813, June 2008.\n\n[4] H. Lee, A. Battle, R. Raina, and A. Ng. Ef\ufb01cient sparse coding algorithms.\n\nInformation Processing Systems, volume 20, pages 801\u201308, 2007.\n\nIn Advances in Neural\n\n[5] M. Titsias and M. L\u00b4azaro-Gredilla. Spike and slab variational inference for multi-task and multiple kernel\n\nlearning. In Advances in Neural Information Processing Systems, 2011.\n\n[6] S. Mohamed, K. Heller, and Z. Ghahramani. Evaluating Bayesian and L1 approaches for sparse unsuper-\n\nvised learning. In ICML, 2012.\n\n[7] I. Goodfellow, A. Courville, and Y. Bengio. Large-scale feature learning with spike-and-slab sparse\n\ncoding. In ICML, 2012.\n\n[8] J\u00a8org L\u00a8ucke and Abdul-Saboor Sheikh. Closed-form EM for sparse coding and its application to source\n\nseparation. In LVA/ICA, LNCS, pages 213\u2013221. Springer, 2012.\n\n[9] E. Saund. A multiple cause mixture model for unsupervised learning. Neural Computation, 1995.\n[10] P. Dayan and R. S. Zemel. Competition and multiple cause models. Neural Computation, 1995.\n[11] J. L\u00a8ucke and M. Sahani. Maximal causes for non-linear component extraction. Journal of Machine\n\nLearning Research, 9:1227\u201367, 2008.\n\n[12] G. Puertas, J. Bornschein, and J. L\u00a8ucke. The maximal causes of natural scenes are edge \ufb01lters.\n\nAdvances in Neural Information Processing Systems, volume 23, pages 1939\u201347. 2010.\n\nIn\n\n[13] I. Goodfellow, A. Courville, and Y. Bengio. Spike-and-slab sparse coding for unsupervised feature dis-\n\ncovery. In NIPS Workshop on Challenges in Learning Hierarchical Models. 2011.\n\n[14] J\u00a8org L\u00a8ucke and Julian Eggert. Expectation truncation and the bene\ufb01ts of preselection in training generative\n\nmodels. Journal of Machine Learning Research, 11:2855\u2013900, 2010.\n\n[15] J. Shelton, J. Bornschein, A.-S. Sheikh, P. Berkes, and J. L\u00a8ucke. Select and sample - a model of ef\ufb01cient\n\nneural inference and learning. Advances in Neural Information Processing Systems, 24, 2011.\n\n[16] R. Neal and G. Hinton. A view of the EM algorithm that justi\ufb01es incremental, sparse, and other variants.\n\nIn M. I. Jordan, editor, Learning in Graphical Models. Kluwer, 1998.\n\n[17] B. Olshausen and K. Millman. Learning sparse codes with a mixture-of-Gaussians prior. Advances in\n\nNeural Information Processing Systems, 12:841\u2013847, 2000.\n\n[18] X. Tan, J. Li, and P. Stoica. Ef\ufb01cient sparse Bayesian learning via Gibbs sampling. In ICASSP, pages\n\n3634\u20133637, 2010.\n\n[19] J. Bornschein, Z. Dai, and J. L\u00a8ucke. Approximate EM learning on large computer clusters.\n\nWorkshop: LCCC. 2010.\n\nIn NIPS\n\n[20] J. H. van Hateren and A. van der Schaaf. Independent component \ufb01lters of natural images compared with\nsimple cells in primary visual cortex. Proceedings of the Royal Society of London B, 265:359\u201366, 1998.\n[21] D. Ringach. Spatial structure and symmetry of simple-cell receptive \ufb01elds in macaque primary visual\n\ncortex. Journal of Neurophysiology, 88:455\u201363, 2002.\n\n[22] W. M. Usrey, M. P. Sceniak, and B. Chapman. Receptive Fields and Response Properties of Neurons in\n\nLayer 4 of Ferret Visual Cortex. Journal of Neurophysiology, 89:1003\u20131015, 2003.\n\n[23] C. Niell and M. Stryker. Highly Selective Receptive Fields in Mouse Visual Cortex. The Journal of\n\nNeuroscience, 28(30):7520\u20137536, 2008.\n\n[24] P. Berkes, G. Orban, M. Lengyel, and J. Fiser. Spontaneous Cortical Activity Reveals Hallmarks of an\n\nOptimal Internal Model of the Environment. Science, 331(6013):83\u201387, January 2011.\n\n[25] B. Olshausen and D. Field. Sparse coding with an overcomplete basis set: A strategy employed by V1?\n\nVision Research, 37(23):3311\u20133325, December 1997.\n\n[26] P. Berkes, R. Turner, and M. Sahani. On sparsity and overcompleteness in image models. Advances in\n\nNeural Information Processing Systems, 21, 2008.\n\n[27] M. Rehn and F. Sommer. A network that uses few active neurones to code visual input predicts the diverse\n\nshapes of cortical receptive \ufb01elds. Journal of Computational Neuroscience, 22(2):135\u201346, 2007.\n\n[28] J. L\u00a8ucke. A dynamical model for receptive \ufb01eld self-organization in V1 cortical columns.\n\nInternational Conference on Arti\ufb01cial Neural Networks, LNCS 4669, pages 389 \u2013 398. Springer, 2007.\n\nIn Proc.\n\n[29] B. A. Olshausen, C. Cadieu, and D.K. Warland. Learning real and complex overcomplete representations\n\nfrom the statistics of natural images. Proc. SPIE, (7446), 2009.\n\n[30] J. L\u00a8ucke. Receptive \ufb01eld self-organization in a model of the \ufb01ne-structure in V1 cortical columns. Neural\n\nComputation, 21(10):2805\u201345, 2009.\n\n[31] M. Henniges, G. Puertas, J. Bornschein, J. Eggert, and J. L\u00a8ucke. Binary Sparse Coding. In Proceedings\n\nLVA/ICA, LNCS 6365, pages 450\u201357. Springer, 2010.\n\n[32] J. Zylberberg, J. Murphy, and M. Deweese. A Sparse Coding Model with Synaptically Local Plasticity\nand Spiking Neurons Can Account for the Diverse Shapes of V1 Simple Cell Receptive Fields. PLoS\nComputational Biology, 7(10):e1002250, 2011.\n\n[33] M. Zhou, H. Chen, J. Paisley, L. Ren, G. Sapiro, and L. Carin. Non-parametric Bayesian dictionary\n\nlearning for sparse image representations 1. In NIPS Workshop. 2009.\n\n9\n\n\f", "award": [], "sourceid": 4658, "authors": [{"given_name": "Philip", "family_name": "Sterne", "institution": null}, {"given_name": "Joerg", "family_name": "Bornschein", "institution": null}, {"given_name": "Abdul-saboor", "family_name": "Sheikh", "institution": null}, {"given_name": "J\u00f6rg", "family_name": "L\u00fccke", "institution": ""}, {"given_name": "Jacquelyn", "family_name": "Shelton", "institution": null}]}