{"title": "Select and Sample - A Model of Efficient Neural Inference and Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2618, "page_last": 2626, "abstract": "An increasing number of experimental studies indicate that perception encodes a posterior probability distribution over possible causes of sensory stimuli, which is used to act close to optimally in the environment. One outstanding difficulty with this hypothesis is that the exact posterior will in general be too complex to be represented directly, and thus neurons will have to represent an approximation of this distribution. Two influential proposals of efficient posterior representation by neural populations are: 1) neural activity represents samples of the underlying distribution, or 2) they represent a parametric representation of a variational approximation of the posterior. We show that these approaches can be combined for an inference scheme that retains the advantages of both: it is able to represent multiple modes and arbitrary correlations, a feature of sampling methods, and it reduces the represented space to regions of high probability mass, a strength of variational approximations. Neurally, the combined method can be interpreted as a feed-forward preselection of the relevant state space, followed by a neural dynamics implementation of Markov Chain Monte Carlo (MCMC) to approximate the posterior over the relevant states. We demonstrate the effectiveness and efficiency of this approach on a sparse coding model. In numerical experiments on artificial data and image patches, we compare the performance of the algorithms to that of exact EM, variational state space selection alone, MCMC alone, and the combined select and sample approach. The select and sample approach integrates the advantages of the sampling and variational approximations, and forms a robust, neurally plausible, and very efficient model of processing and learning in cortical networks. For sparse coding we show applications easily exceeding a thousand observed and a thousand hidden dimensions.", "full_text": "Select and Sample \u2014 A Model of Ef\ufb01cient\n\nNeural Inference and Learning\n\nJacquelyn A. Shelton,\n\nJ\u00a8org Bornschein, Abdul-Saboor Sheikh\n\nFrankfurt Institute for Advanced Studies\nGoethe-University Frankfurt, Germany\n\n{shelton,bornschein,sheikh}@fias.uni-frankfurt.de\n\nPietro Berkes\n\nVolen Center for Complex Systems\nBrandeis University, Boston, USA\n\nberkes@brandeis.edu\n\nJ\u00a8org L\u00a8ucke\n\nFrankfurt Institute for Advanced Studies\nGoethe-University Frankfurt, Germany\nluecke@fias.uni-frankfurt.de\n\nAbstract\n\nAn increasing number of experimental studies indicate that perception encodes a\nposterior probability distribution over possible causes of sensory stimuli, which\nis used to act close to optimally in the environment. One outstanding dif\ufb01culty\nwith this hypothesis is that the exact posterior will in general be too complex to\nbe represented directly, and thus neurons will have to represent an approximation\nof this distribution. Two in\ufb02uential proposals of ef\ufb01cient posterior representation\nby neural populations are: 1) neural activity represents samples of the underly-\ning distribution, or 2) they represent a parametric representation of a variational\napproximation of the posterior. We show that these approaches can be combined\nfor an inference scheme that retains the advantages of both: it is able to represent\nmultiple modes and arbitrary correlations, a feature of sampling methods, and it\nreduces the represented space to regions of high probability mass, a strength of\nvariational approximations. Neurally, the combined method can be interpreted as\na feed-forward preselection of the relevant state space, followed by a neural dy-\nnamics implementation of Markov Chain Monte Carlo (MCMC) to approximate\nthe posterior over the relevant states. We demonstrate the effectiveness and ef\ufb01-\nciency of this approach on a sparse coding model. In numerical experiments on\narti\ufb01cial data and image patches, we compare the performance of the algorithms\nto that of exact EM, variational state space selection alone, MCMC alone, and\nthe combined select and sample approach. The select and sample approach inte-\ngrates the advantages of the sampling and variational approximations, and forms\na robust, neurally plausible, and very ef\ufb01cient model of processing and learning\nin cortical networks. For sparse coding we show applications easily exceeding a\nthousand observed and a thousand hidden dimensions.\n\n1\n\nIntroduction\n\nAccording to the recently quite in\ufb02uential statistical approach to perception, our brain represents\nnot only the most likely interpretation of a stimulus, but also its corresponding uncertainty.\nIn\nother words, ideally the brain would represent the full posterior distribution over all possible in-\nterpretations of the stimulus, which is statistically optimal for inference and learning [1, 2, 3] \u2013 a\nhypothesis supported by an increasing number of psychophysical and electrophysiological results\n[4, 5, 6, 7, 8, 9].\n\n1\n\n\fAlthough it is generally accepted that humans indeed maintain a complex posterior representation,\none outstanding dif\ufb01culty with this approach is that the full posterior distribution is in general very\ncomplex, as it may be highly correlated (due to explaining away effects), multimodal (multiple\npossible interpretations), and very high-dimensional. One approach to address this problem in neural\ncircuits is to let neuronal activity represent the parameters of a variational approximation of the real\nposterior [10, 11]. Although this approach can approximate the full posterior, the number of neurons\nexplodes with the number of variables \u2013 for example, approximation via a Gaussian distribution\nrequires N 2 parameters to represent the covariance matrix over N variables. Another approach\nis to identify neurons with variables and interpret neural activity as samples from their posterior\n[12, 13, 3]. This interpretation is consistent with a range of experimental observations, including\nneural variability (which would result from the uncertainty in the posterior) and spontaneous activity\n(corresponding to samples from the prior in the absence of a stimulus) [3, 9]. The advantage of\nusing sampling is that the number of neurons scales linearly with the number of variables, and\nit can represent arbitrarily complex posterior distributons given enough samples. The latter part\nis the issue: collecting a suf\ufb01cient number of samples to form such a complex, high-dimensional\nrepresentation is quite time-costly. Modeling studies have shown that a small number of samples\nare suf\ufb01cient to perform well on low-dimensional tasks (intuitively, this is because taking a low-\ndimensional marginal of the posterior accumulates samples over all dimensions) [14, 15]. However,\nmost sensory data is inherently very high-dimensional. As such, in order to faithfully represent\nvisual scenes containing potentially many objects and object parts, one requires a high-dimensional\nlatent space to represent the high number of potential causes, which returns to the problem sampling\napproaches face in high dimensions.\nThe goal of the line of research pursued here is to address the following questions: 1) can we \ufb01nd\na sophisticated representation of the posterior for very high-dimensional hidden spaces? 2) as this\ngoal is believed to be shared by the brain, can we \ufb01nd a biologically plausible solution reaching it?\nIn this paper we propose a novel approach to approximate inference and learning that addresses the\ndrawbacks of sampling as a neural processing model, yet maintains its bene\ufb01cial posterior repre-\nsentation and neural plausibility. We show that sampling can be combined with a preselection of\ncandidate units. Such a selection connects sampling to the in\ufb02uential models of neural processing\nthat emphasize feed-forward processing ([16, 17] and many more), and is consistent with the popu-\nlar view of neural processing and learning as an interplay between feed-forward and recurrent stages\nof processing [18, 19, 20, 21, 12]. Our combined approach emerges naturally by interpreting feed-\nforward selection and sampling as approximations to exact inference in a probabilistic framework\nfor perception.\n2 A Select and Sample Approach to Approximate Inference\nInference and learning in neural circuits can be regarded as the task of inferring the true hidden\ncauses of a stimulus. An example is inferring the objects in a visual scene based on the image\nprojected on the retina. We will refer to the sensory stimulus (the image) as a data point, (cid:126)y =\n(y1, . . . , yD), and we will refer to the hidden causes (the objects) as (cid:126)s = (s1, . . . , sH ) with sh\ndenoting hidden variable or hidden unit h. The data distribution can then be modeled by a generative\n(cid:126)s p((cid:126)y | (cid:126)s, \u0398) p((cid:126)s| \u0398) with \u0398 denoting the parameters of the model1. If we\nassume that the data distribution can be optimally modeled by the generative distribution for optimal\nparameters \u0398\u2217, then the posterior probability p((cid:126)s| (cid:126)y, \u0398\u2217) represents optimal inference given a data\npoint (cid:126)y. The parameters \u0398\u2217 given a set of N data points Y = {(cid:126)y1, . . . , (cid:126)yN} are given by the\nmaximum likelihood parameters \u0398\u2217 = argmax\u0398{p(Y | \u0398)}.\nA standard procedure to \ufb01nd the maximum likelihood solution is expectation maximization (EM).\nEM iteratively optimizes a lower bound of the data likelihood by inferring the posterior distribution\nover hidden variables given the current parameters (the E-step), and then adjusting the parameters to\nmaximize the likelihood of the data averaged over this posterior (the M-step). The M-step updates\ntypically depend only on a small number of expectation values of the posterior as given by\n\ndata model: p((cid:126)y | \u0398) =(cid:80)\n\n(cid:104)g((cid:126)s)(cid:105)p((cid:126)s | (cid:126)y (n),\u0398) = (cid:80)\n\n(cid:126)s p((cid:126)s| (cid:126)y (n), \u0398) g((cid:126)s) ,\n\n(1)\n\nwhere g((cid:126)s) is usually an elementary function of the hidden variables (e.g., g((cid:126)s) = (cid:126)s or g((cid:126)s) = (cid:126)s(cid:126)sT\nin the case of standard sparse coding). For any non-trivial generative model, the computation of\n\n1In the case of continuous variables the sum is replaced by an integral. For a hierarchical model, the prior\n\ndistribution p((cid:126)s| \u0398) may be subdivided hierarchically into different sets of variables.\n\n2\n\n\fexpectation values (1) is the computationally demanding part of EM optimization. Their exact com-\nputation is often intractable and many well-known algorithms (e.g., [22, 23]) rely on estimations.\nThe EM iterations can be associated with neural processing by the assumption that neural activ-\nity represents the posterior over hidden variables (E-step), and that synaptic plasticity implements\nchanges to model parameters (M-step). Here we will consider two prominent models of neural pro-\ncessing on the ground of approximations to the expectation values (1) and show how they can be\ncombined.\nSelection. Feed-forward processing has frequently been discussed as an important component of\nneural processing [16, 24, 17, 25]. One perspective on this early component of neural activity is\nas a preselection of candidate units or hypotheses for a given sensory stimulus ([18, 21, 26, 19]\nand many more), with the goal of reducing the computational demand of an otherwise too complex\ncomputation. In the context of probabilistic approaches, it has recently been shown that preselection\ncan be formulated as a variational approximation to exact inference [27]. The variational distribution\nin this case is given by a truncated sum over possible hidden states:\np((cid:126)s| (cid:126)y (n), \u0398) \u2248 qn((cid:126)s; \u0398) =\n\n\u03b4((cid:126)s \u2208 Kn) (2)\n\n\u03b4((cid:126)s \u2208 Kn) =\n\n(cid:88)\n\np((cid:126)s, (cid:126)y (n) | \u0398)\np((cid:126)s (cid:48), (cid:126)y (n) | \u0398)\n\n(cid:88)\n\np((cid:126)s| (cid:126)y (n), \u0398)\np((cid:126)s (cid:48) | (cid:126)y (n), \u0398)\n\n(cid:126)s (cid:48)\u2208Kn\n\n(cid:126)s (cid:48)\u2208Kn\n\nwhere \u03b4((cid:126)s \u2208 Kn) = 1 if (cid:126)s \u2208 Kn and zero otherwise. The subset Kn represents the preselected\nlatent states. Given a data point (cid:126)y (n), Eqn. 2 results in good approximations to the posterior if Kn\ncontains most posterior mass. Since for many applications the posterior mass is concentrated in\nsmall volumes of the state space, the approximation quality can stay high even for relatively small\nsets Kn. This approximation can be used to compute ef\ufb01ciently the expectation values needed in the\nM-step (1):\n\n(cid:80)\n(cid:80)\n\np((cid:126)s, (cid:126)y (n) | \u0398) g((cid:126)s)\np((cid:126)s (cid:48), (cid:126)y (n) | \u0398)\n\n.\n\n(cid:126)s\u2208Kn\n(cid:126)s (cid:48)\u2208Kn\n\n(cid:104)g((cid:126)s)(cid:105)p((cid:126)s | (cid:126)y (n),\u0398) \u2248 (cid:104)g((cid:126)s)(cid:105)qn((cid:126)s;\u0398) =\n\n(3)\n\nEqn. 3 represents a reduction in required computational resources as it involves only summations (or\nintegrations) over the smaller state space Kn. The requirement is that the set Kn needs to be selected\nprior to the computation of expectation values, and the \ufb01nal improvement in ef\ufb01ciency relies on such\nselections being ef\ufb01ciently computable. As such, a selection function Sh((cid:126)y, \u0398) needs to be carefully\nchosen in order to de\ufb01ne Kn; Sh((cid:126)y, \u0398) ef\ufb01ciently selects the candidate units sh that are most likely\nto have contributed to a data point (cid:126)y (n). Kn can then be de\ufb01ned by:\nKn = {(cid:126)s| for all h (cid:54)\u2208 I : sh = 0} ,\n\n(4)\nwhere I contains the H(cid:48) indices h with the highest values of Sh((cid:126)y, \u0398) (compare Fig. 1). For sparse\ncoding models, for instance, we can exploit that the posterior mass lies close to low dimensional\nsubspaces to de\ufb01ne the sets Kn [27, 28], and appropriate Sh((cid:126)y, \u0398) can be found by deriving ef\ufb01-\nciently computable upper-bounds for probabilities p(sh = 1| (cid:126)y (n), \u0398) [27, 28] or by derivations\nbased on taking limits for no data noise [27, 29]. For more complex models, see [27] (Sec. 5.3-4)\nfor a discussion of suitable selection functions. Often the precise form of Sh((cid:126)y, \u0398) has limited in-\n\ufb02uence on the \ufb01nal approximation accuracy because a) its values are not used for the approximation\n(3) itself and b) the size of sets Kn can often be chosen generously to easily contain the regions with\nlarge posterior mass. The larger Kn the less precise the selection has to be. For Kn equal to the\nentire state space, no selection is required and the approximations (2) and (3) fall back to the case of\nexact inference.\nSampling. An alternative way to approximate the expectation values in eq. 1 is by sampling from\nthe posterior distribution, and using the samples to compute the average:\n\n(cid:80)M\nm=1 g((cid:126)s(m)) with (cid:126)s(m) \u223c p((cid:126)s| (cid:126)y, \u0398).\n\n(cid:104)g((cid:126)s)(cid:105)p((cid:126)s | (cid:126)y (n),\u0398) \u2248 1\n\n(5)\n\nM\n\nThe challenging aspect of this approach is to ef\ufb01ciently draw samples from the posterior.\nIn a\nhigh-dimensional sample space, this is mostly done by Markov Chain Monte Carlo (MCMC). This\nclass of methods draws samples from the posterior distribution such that each subsequent sample is\ndrawn relative to the current state, and the resulting sequence of samples form a Markov chain. In\nthe limit of a large number of samples, Monte Carlo methods are theoretically able to represent any\nprobability distribution. However, the number of samples required in high-dimensional spaces can\nbe very large (Fig. 1A, sampling).\n\n3\n\n\fA\n\nB\n\ny1\n\nMAP estimate\n\nexact EM\n\npreselection\n\nsampling\n\n(cid:126)smax\n\ng((cid:126)smax)\n\n(cid:88)\n\n(cid:126)s\n\np((cid:126)s | (cid:126)y)g((cid:126)s)\n\n(cid:88)\n\n(cid:126)s\u2208Kn\n\nKn\n\nqn((cid:126)s; \u0398)g((cid:126)s)\n\n1\nM\n\nM(cid:88)\n\ng((cid:126)s)\n\nm=1\nwith\n\nselected units\n\nSh((cid:126)y (n))\n\nC\n\ns1\n\nsH\n\nWdh\n\ns1\n\nyD\n\ny1\n\n(cid:126)s(m) \u223c p((cid:126)s | (cid:126)y (n), \u0398)\n\n(cid:126)s(m) \u223c qn((cid:126)s; \u0398)\n\nselect and\nsample\n\nKn\nM(cid:88)\n\nm=1\nwith\n\n1\nM\n\ng((cid:126)s)\n\nselected units\n\nsH\n\nWdh\n\nyD\n\nFigure 1: A Simpli\ufb01ed illustration of the posterior mass and the respective regions each approxi-\nmation approach uses to compute the expectation values. B Graphical model showing each con-\nnection Wdh between the observed variables (cid:126)y and hidden variables (cid:126)s, and how H(cid:48) = 2 hidden\nvariables/units are selected to form a set Kn. C Graphical model resulting from the selection of\nhidden variables and associated weights Wdh (black).\n\nSelect and Sample. Although preselection is a deterministic approach very different than the\nstochastic nature of sampling, its formulation as approximation to expectation values (3) allows for\na straight-forward combination of both approaches: given a data point, (cid:126)y(n), we \ufb01rst approximate\nthe expectation value (3) using the variational distribution qn((cid:126)s; \u0398) as de\ufb01ned by preselection (2).\nSecond, we approximate the expectations w.r.t. qn((cid:126)s; \u0398) using sampling. The combined approach\nis thus given by:\n\n(cid:80)M\nm=1 g((cid:126)s(m)) with (cid:126)s(m) \u223c qn((cid:126)s; \u0398),\n\n(6)\n\n(cid:104)g((cid:126)s)(cid:105)p((cid:126)s | (cid:126)y (n),\u0398) \u2248 (cid:104)g((cid:126)s)(cid:105)qn((cid:126)s;\u0398) \u2248 1\n\nM\n\nwhere (cid:126)s(m) denote samples from the truncated distribution qn. Instead of drawing from a distribution\nover the entire state space, approximation (6) requires only samples from a potentially very small\nsubspace Kn (Fig. 1). In the subspace Kn, most of the original probability mass is concentrated in a\nsmaller volume, thus MCMC algorithms perform more ef\ufb01ciently, which results in a smaller space\nto explore, shorter burn-in times, and a reduced number of required samples. Compared to selection\nalone, the select and sample approach will represent an increase in ef\ufb01ciency as soon as the number\nof samples required for a good approximation is less then the number of states in Kn.\n3 Sparse Coding: An Example Application\nWe systematically investigate the computational ef\ufb01ciency, performance, and biological plausibility\nof the select and sample approach in comparison with selection and sampling alone using a sparse\ncoding model of images. The choice of a sparse coding model has numerous advantages. First, it\nis a non-trivial model that has been extremely well-studied in machine learning research, and for\nwhich ef\ufb01cient algorithms exist (e.g., [23, 30]). Second, it has become a standard (albeit somewhat\nsimplistic) model of the organization of receptive \ufb01elds in primary visual cortex [22, 31, 32]. Here\nwe consider a discrete variant of this model known as Binary Sparse Coding (BSC; [29, 27], also\ncompare [33]), which has binary hidden variables but otherwise the same features as standard sparse\ncoding versions. The generative model for BSC is expressed by\n\n(7)\nwhere W \u2208 RD\u00d7H denotes the basis vectors and \u03c0 parameterizes the sparsity ((cid:126)s and (cid:126)y as above).\nThe M-step updates of the BSC learning algorithm (see e.g. [27]) are given by:\n\np((cid:126)y|(cid:126)s, W, \u03c3) = N ((cid:126)y; W (cid:126)s, \u03c321) ,\n\np((cid:126)s|\u03c0) =(cid:81)H\n\nh=1 \u03c0sh(cid:0)1 \u2212 \u03c0(cid:1)1\u2212sh ,\nW new =(cid:0)(cid:80)N\n(cid:10)(cid:12)(cid:12)(cid:12)(cid:12)(cid:126)y(n) \u2212 W (cid:126)s(cid:12)(cid:12)(cid:12)(cid:12)2(cid:11)\n\nn=1 (cid:126)y(n) (cid:104)(cid:126)s(cid:105)T\n\nqn\n\n(cid:80)\n\nn\n\n(\u03c32)new = 1\nN D\nThe only expectation values needed for the M-step are thus (cid:104)(cid:126)s(cid:105)qn\nlearning and inference between the following algorithms:\n\n, \u03c0new = 1\nN\nqn\n\nn=1\n\n(cid:10)(cid:126)s (cid:126)s T(cid:11)\n\n(cid:1)(cid:0)(cid:80)N\n(cid:80)\nn | < (cid:126)s >qn |, where|(cid:126)x| = 1\n\n(cid:1)\u22121\nand(cid:10)(cid:126)s(cid:126)sT(cid:11)\n\nqn\n\n,\n\nH\n\n(cid:80)\n\n(8)\n\nh xh. (9)\n\n. We will compare\n\nqn\n\n4\n\n\fBSCexact. An EM algorithm without approximations is obtained if we use the exact posterior for\nthe expectations: qn = p((cid:126)s| (cid:126)y (n), \u0398). We will refer to this exact algorithm as BSCexact. Although\ndirectly computable, the expectation values for BSCexact require sums over the entire state space,\ni.e., over 2H terms. For large numbers of latent dimensions, BSCexact is thus intractable.\nBSCselect. An algorithm that more ef\ufb01ciently scales with the number of hidden dimensions is\nobtained by applying preselection. For the BSC model we use qn as given in (3) and Kn =\nh sh = 1}. Note that in addition to states as in (4) we in-\nclude all states with one non-zero unit (all singletons). Including them avoids EM iterations in the\ninitial phases of learning that leave some basis functions unmodi\ufb01ed (see [27]). As selection func-\ntion Sh((cid:126)y (n)) to de\ufb01ne Kn we use:\n\n{(cid:126)s| (for all h (cid:54)\u2208 I : sh = 0) or (cid:80)\n\nwith || (cid:126)Wh|| =\n\nSh((cid:126)y (n)) = ( (cid:126)W T\n\nh /|| (cid:126)Wh||) (cid:126)y (n),\n\nd=1(Wdh)2 .\n\n(10)\nA large value of Sh((cid:126)y (n)) strongly indicates that (cid:126)y (n) contains the basis function (cid:126)Wh as a component\n(see Fig. 1C). Note that (10) can be related to a deterministic ICA-like selection of a hidden state\n(cid:126)s(n) in the limit case of no noise (compare [27]). Further restrictions of the state space are possible\nbut require modi\ufb01ed M-step equations (see [27, 29]), which will not be considered here.\nBSCsample. An alternative non-deterministic approach can be derived using Gibbs sampling. Gibbs\nsampling is an MCMC algorithm which systematically explores the sample space by repeatedly\ndrawing samples from the conditional distributions of the individual hidden dimensions. In other\nwords, the transition probability from the current sample to a new candidate sample is given by\np(snew\n). In our case of a binary sample space, this equates to selecting one random axis\nh \u2208 {1, . . . , H} and toggling its bit value (thereby changing the binary state in that dimension),\nleaving the remaining axes unchanged. Speci\ufb01cally, the posterior probability computed for each\ncandidate sample is expressed by:\n\nh |(cid:126)s current\n\n\\h\n\n(cid:113)(cid:80)D\n\np(sh = 1| (cid:126)s\\h, (cid:126)y) =\n\np(sh = 1, (cid:126)s\\h, (cid:126)y)\u03b2\n\np(sh = 0, (cid:126)s\\h, (cid:126)y)\u03b2 + p(sh = 1, (cid:126)s\\h, (cid:126)y)\u03b2 ,\n\n(11)\n\nwhere we have introduced a parameter \u03b2 that allows for smoothing of the posterior distribution.\nTo ensure an appropriate mixing behavior of the MCMC chains over a wide range of \u03c3 (note that\n\u03c3 is a model parameter that changes with learning), we de\ufb01ne \u03b2 = T\n\u03c32 , where T is a temperature\nparameter that is set manually and selected such that good mixing is achieved. The samples drawn\nin this manner can then be used to approximate the expectation values in (8) to (9) using (5).\nBSCs+s. The EM learning algorithm given by combining selection and sampling is obtained by\napplying (6). First note that inserting the BSC generative model into (2) results in:\n\n(cid:80)\nwhere BernoulliKn((cid:126)s; \u03c0) = (cid:81)\n\nqn((cid:126)s; \u0398) =\n\nN ((cid:126)y; W (cid:126)s (cid:48), \u03c321) BernoulliKn((cid:126)s (cid:48); \u03c0)\n\n(12)\n(cid:126)s (cid:48)\u2208Kn\nh\u2208I \u03c0sh (1 \u2212 \u03c0)1\u2212sh. The remainder of the Bernoulli distribution\nIf we de\ufb01ne \u02dc(cid:126)s to be the binary vector consisting of all entries of (cid:126)s of the selected\ncontains all basis functions of those selected, we observe that the\n\ncancels out.\ndimensions, and if \u02dcW \u2208 RD\u00d7H(cid:48)\ndistribution is equal to the posterior w.r.t. a BSC model with H(cid:48) instead of H hidden dimensions:\n\nN ((cid:126)y; W (cid:126)s, \u03c321) BernoulliKn ((cid:126)s; \u03c0)\n\n\u03b4((cid:126)s \u2208 Kn)\n\np(\u02dc(cid:126)s| (cid:126)y, \u0398) =\n\n(cid:80)\nN ((cid:126)y; \u02dcW \u02dc(cid:126)s, \u03c321H(cid:48)) Bernoulli(\u02dc(cid:126)s; \u03c0)\n\u02dc(cid:126)s (cid:48) N ((cid:126)y; \u02dcW \u02dc(cid:126)s (cid:48), \u03c321 H(cid:48)) Bernoulli(\u02dc(cid:126)s (cid:48); \u03c0)\n\nInstead of drawing samples from qn((cid:126)s; \u0398) we can thus draw samples from the exact posterior w.r.t.\nthe BSC generative model with H(cid:48) dimensions. The sampling procedure for BSCsamplecan thus\nbe applied simply by ignoring the non-selected dimensions and their associated parameters. For\ndifferent data points, different latent dimensions will be selected such that averaging over data points\ncan update all model parameters. For selection we again use Sh((cid:126)y, \u0398) (10), de\ufb01ning Kn as in (4),\nwhere I now contains the H(cid:48)\u20132 indices h with the highest values of Sh((cid:126)y, \u0398) and two randomly\nselected dimensions (drawn from a uniform distribution over all non-selected dimensions). The\ntwo randomly selected dimensions ful\ufb01ll the same purpose as the inclusion of singleton states for\nBSCselect. Preselection and Gibbs sampling on the selected dimensions de\ufb01ne an approximation to\nthe required expectation values (3) and result in an EM algorithm referred to as BSCs+s.\n\n5\n\n\fComplexity. Collecting the number of operations necessary to compute the expectation values for\nall four BSC cases, we arrive at\n\n(13)\n\n)(cid:1)\n\nO(cid:0)N S( D(cid:124)(cid:123)(cid:122)(cid:125)\n\np((cid:126)s,(cid:126)y)\n\n+ 1(cid:124)(cid:123)(cid:122)(cid:125)(cid:104)(cid:126)s(cid:105)\n\n+ H(cid:124)(cid:123)(cid:122)(cid:125)\n\n(cid:104)(cid:126)s(cid:126)sT (cid:105)\n\nwhere S denotes the number of hidden states that contribute to the calculation of the expectation\nvalues. For the approaches with preselection (BSCselect, BSCs+s), all the calculations of the expec-\ntation values can be performed on the reduced latent space; therefore the H is replaced by H(cid:48). For\nBSCexactthis number scales exponentially in H: Sexact = 2H, and in in the BSCselectcase, it scales\nexponentially in the number of preselected hidden variables: Sselect = 2H(cid:48)\n. However, for the sam-\npling based approaches (BSCsampleand BSCs+s), the number S directly corresponds to the number\nof samples to be evaluated and is obtained empirically. As we will show later, Ss+s = 200 \u00d7 H(cid:48) is\na reasonable choice for the interval of H(cid:48) that we investigate in this paper (1 \u2264 H(cid:48) \u2264 40).\n4 Numerical Experiments\nWe compare the select and sample approach with selection and sampling applied individually on\ndifferent data sets: arti\ufb01cal images and natural image patches. For all experiments using the two\nsampling approaches, we draw 20 independent chains that are initialized at random states in order to\nincrease the mixing of the samples. Also, of the samples drawn per chain, 1\n3 were used to as burn-in\nsamples, and 2\nArti\ufb01cial data. Our \ufb01rst set of experiments investigate the select and sample approach\u2019s conver-\ngence properties on arti\ufb01cial data sets where ground truth is available. As the following experiments\nwere run on a small scale problem, we can compute the exact data likelihood for each EM step in all\nfour algorithms (BSCexact, BSCselect, BSCsampleand BSCs+s) to compare convergence on ground\ntruth likelihood.\n\n3 were retained samples.\n\nA\nC\n\n)\n\n\u0398\n(\nL\n\n1\n\nB\n\nBSCexact\n\nBSCselect\n\nBSCsample\n\nBSCs+s\n\nEM step\n\n50\n\n1\n\nEM step\n\n50\n\n1\n\nEM step\n\n50\n\n1\n\nEM step\n\n50\n\ndh \u2208 {0.0, 10.0}) or negative (W gt\n\nFigure 2: Experiments using arti\ufb01cial bars data with H = 12, D = 6 \u00d7 6. Dotted line indicates the\nground truth log-likelihood value. A Random selection of the N = 2000 training data points (cid:126)y (n).\nB Learned basis functions Wdh after a successful training run. C Development of the log-likelihood\nover a period of 50 EM steps for all 4 investigated algorithms.\nData for these experiments consisted of images generated by creating H = 12 basis functions (cid:126)W gt\nin the form of horizontal and vertical bars on a D = 6 \u00d7 6 = 36 pixel grid. Each bar was randomly\nh\nh(cid:48)d \u2208 {\u221210.0, 0.0}). N = 2000\nassigned to be either positive (W gt\ndata points (cid:126)y (n) were generated by linearly combining these basis functions (see e.g., [34]). Using\na sparseness value of \u03c0gt = 2\nH resulted in, on average, two active bars per data point. According to\nthe model, we added Gaussian noise (\u03c3gt = 2.0) to the data (Fig. 2A).\nWe applied all algorithms to the same dataset and monitored the exact likelihood over a period of 50\nEM steps (Fig. 2C). Although the calculation of the exact likelihood requires O(N 2H (D + H)) op-\nerations, this is feasible for such a small scale problem. For models using preselection (BSCselectand\nBSCs+s), we set H(cid:48) to 6, effectively halving the number of hidden variables participating in the\ncalculation of the expectation values. For BSCsampleand BSCs+swe drew 200 samples from the\nposterior p((cid:126)s| (cid:126)y (n)) of each data point, as such the number of states evaluated totaled Ssample =\n200 \u00d7 H = 2400 and Ss+s = 200 \u00d7 H(cid:48) = 1200, respectively. To ensure an appropriate mixing\nbehavior annealing temperature was set to T = 50. In each experiment the basis functions were\nH and the data\ninitialized at the data mean plus Gaussian noise, the prior probability to \u03c0init = 1\nnoise to the variance of the data. All algorithms recover the correct set of bases functions in > 50%\nof the trials, and the sparseness prior \u03c0 and the data noise \u03c3 with high accuracy. Comparing the\ncomputational costs of algorithms shows the bene\ufb01ts of preselection already for this small scale\nproblem: while BSCexactevaluates the expectation values using the full set of 2H = 4096 hidden\n\n6\n\n\f+ (H \u2212 H(cid:48)) = 70 states. The pure sampling based approaches\n\nstates, BSCselectonly considers 2H(cid:48)\nperforms 2400 evaluations while BSCs+srequires 1200 evaluations.\nImage patches. We test the select and sample approach on natural image data at a more challeng-\ning scale, to include biological plausibility in the demonstration of its applicability to larger scale\nproblems. We extracted N = 40, 000 patches of size D = 26 \u00d7 26 = 676 pixels from the van\nHateren image database [31] 2, and preprocessed them using a Difference of Gaussians (DoG) \ufb01lter,\nwhich approximates the sensitivity of center-on and center-off neurons found in the early stages of\nthe mammalian visual processing. Filter parameters where chosen as in [35, 28]. For the following\nexperiments we ran 100 EM iterations to ensure proper convergence. The annealing temperature\nwas set to T = 20.\n\nA\n\nB\n\nC\n\n)\n\n\u0398\n(\nL\n\nS = 200 \u00d7 H(cid:48)\n\n-5.47e7\n\n-5.49e7\n\n-5.51e7\n\n-5.53e7\n\nD\n\ns\ne\nt\na\nt\ns\n\nf\no\n\n#\n\n107\n\n106\n\n105\n\n104\n\n103\n\n100 \u00d7 H(cid:48)\n\n# of states\n\n400 \u00d7 H(cid:48)\n\nB S C\n\n\u00d740\n\np le\n\nm\n\ns a\n\ns + s\n\nB S C\n\nsele ct\nB S C\n\nFigure 3: Experiments on image patches with D = 26 \u00d7 26, H = 800 and H(cid:48) = 20. A Random\nselection of used patches (after DoG preprocessing). B Random selection of learned basis functions\n(number of samples set to 200). C End approx. log-likelihood after 100 EM-steps vs. number of\nsamples per data point. D Number of states that had to be evaluated for the different approaches.\n\nThe \ufb01rst series of experiments investigate the effect of the number of drawn samples on the perfor-\nmance of the algorithm (as measured by the approximate data likelihood) across the entire range\nof H(cid:48) values between 12 and 36. We observe with BSCs+sthat 200 samples per hidden dimension\n(total states = 200 \u00d7 H(cid:48)) are suf\ufb01cient: the \ufb01nal value of the likelihood after 100 EM steps begins\nto saturate. Particularly, increasing the number of samples does not increase the likelihood by more\nthan 1%. In Fig. 3C we report the curve for H(cid:48) = 20, but the same trend is observed for all other\nvalues of H(cid:48). In another set of experiments, we used this number of samples (200 \u00d7 H) in the pure\nsampling case (BSCsample) in order to monitor the likelihood behavior. We observed two consistent\ntrends: 1) the algorithm was never observed to converge to a high-likelihood solution, and 2) even\nwhen initialized at solutions with high likelihood, the likelihood always decreases. This example\ndemonstrates the gains of using select and sample above pure sampling: while BSCs+sonly needs\n200 \u00d7 20 = 4, 000 samples to robustly reach a high-likelihood solutions, by following the same\nregime with BSCsample, not only did the algorithm poorly converge on a high-likelihood solution,\nbut it used 200 \u00d7 800 = 160, 000 samples to do so (Fig. 3D).\nLarge scale experiment on image patches. Comparison of the above results shows that the most\nef\ufb01cient algorithm is obtained by a combination of preselection and sampling, our select and sam-\nple approach (BSCs+s), with no or only minimal effect on the performance of the algorithm \u2013 as\ndepicted in Fig. 2 and 3. This ef\ufb01ciency allows for applications to much larger scale problems\nthan would be possible by individual approximation approaches. To demonstrate the ef\ufb01ciency of\nthe combined approach we applied BSCs+sto the same image dataset, but with a very high num-\nber of observed and hidden dimensions. We extracted from the database N = 500, 000 patches of\nsize D = 40 \u00d7 40 = 1, 600 pixels. BSCs+swas applied with the number of hidden units set to\nH = 1, 600 and with H(cid:48) = 34. Using the same conditions as in the previous experiments (notably\nS = 200 \u00d7 H(cid:48) = 64, 000 samples and 100 EM iterations) we again obtain a set of Gabor-like\nbasis functions (see Fig. 4A) with relatively very few necessary states (Fig. 4B). To our knowledge,\nthe presented results illustrate the largest application of sparse coding with a reasonably complete\nrepresentation of the posterior.\n\n5 Discussion\nWe have introduced a novel and ef\ufb01cient method for unsupervised learning in probabilistic mod-\nels \u2013 one which maintains a complex representation of the posterior for problems consistent with\n\n2We restricted the set of images to 900 images without man-made structures (see Fig 3A). The brightest 2%\n\nof the pixels were clamped to the max value of the remaining 98% (reducing in\ufb02uences of light-re\ufb02ections)\n\n7\n\n\fA\n\n1012\n\nB\n\ns\ne\nt\na\nt\ns\n\nf\no\n#\n\n108\n\n104\n\n100\n\n0\n\nBSCselect : S = 2H(cid:48)\n\nBSCs+s : S = 200 \u00d7 H(cid:48)\n\nH(cid:48)\n\n34\n\n40\n\nFigure 4: A Large-scale application of BSCs+swith H(cid:48) = 34 to image patches (D = 40\u00d740 = 1600\npixels and H = 1600 hidden dimensions). A random selection of the inferred basis functions is\nshown (see Suppl for all basis functions and model parameters). B Comparison the of computational\ncomplexity: BSCselectscales exponentially with H(cid:48) whereas BSCs+sscales linearly. Note the large\ndifference at H(cid:48) = 34 as used in A.\n\nreal-world scales. Furthermore, our approach is biologically plausible and models how the brain\ncan make sense of its environment for large-scale sensory inputs. Speci\ufb01cally, the method could\nbe implemented in neural networks using two mechanisms, both of which have been independently\nsuggested in the context of a statistical framework for perception: feed-forward preselection [27],\nand sampling [12, 13, 3]. We showed that the two seemingly contrasting approaches can be com-\nbined based on their interpretation as approximate inference methods, resulting in a considerable\nincrease in computational ef\ufb01ciency (e.g., Figs. 3-4).\nWe used a sparse coding model of natural images \u2013 a standard model for neural response properties\nin V1 [22, 31] \u2013 in order to investigate, both numerically and analytically, the applicability and ef\ufb01-\nciency of the method. Comparisons of our approach with exact inference, selection alone, and sam-\npling alone showed a very favorable scaling with the number of observed and hidden dimensions. To\nthe best of our knowledge, the only other sparse coding implementation that reached a comparable\nproblem size (D = 20\u00d7 20, H = 2 000) assumed a Laplace prior and used a MAP estimation of the\nposterior [23]. However, with MAP estimations, basis functions have to be rescaled (compare [22])\nand data noise or prior parameters cannot be inferred (instead a regularizer is hand-set). Our method\ndoes not require any of these arti\ufb01cial mechanisms because of its rich posterior representation. Such\nrepresentations are, furthermore, crucial for inferring all parameters such as data noise and sparsity\n(learned in all of our experiments), and to correctly act when faced with uncertain input [2, 8, 3].\nConcretely, we used a sparse coding model with binary latent variables. This allowed for a system-\natic comparison with exact EM for low-dimensional problems, but extension to the continuous case\nshould be straight-forward. In the model, the selection step results in a simple, local and neurally\nplausible integration of input data, given by (10). We used this in combination with Gibbs sampling,\nwhich is also neurally plausible because neurons can individually sample their next state based on\nthe current state of the other neurons, as transmitted through recurrent connections [15]. The idea\nof combining sampling with feed-forward mechanisms has previously been explored, but in other\ncontexts and with different goals. Work by Beal [36] used variational approximations as proposal\ndistributions within importance sampling, and Zhu et al. [37] guided a Metropolis-Hastings algo-\nrithm by a data-driven proposal distribution. Both approaches are different from selecting subspaces\nprior to sampling and are more dif\ufb01cult to link to neural feed-forward sweeps [18, 21].\nWe expect the select and sample strategy to be widely applicable to machine learning models when-\never the posterior probability masses can be expected to be concentrated in a small sub-space of the\nwhole latent space. Using more sophisticated preselection mechanisms and sampling schemes could\nlead to a further reduction in computational efforts, although the details will depend in general on\nthe particular model and input data.\nAcknowledgements. We acknowledge funding by the German Research Foundation (DFG) in the project\nLU 1196/4-1 (JL), by the German Federal Ministry of Education and Research (BMBF), project 01GQ0840\n(JAS, JB, ASS), by the Swartz Foundation and the Swiss National Science Foundation (PB). Furthermore,\nsupport by the Physics Dept. and the Center for Scienti\ufb01c Computing (CSC) in Frankfurt are acknowledged.\n\n8\n\n\fReferences\n[1] P. Dayan and L. F. Abbott. Theoretical Neuroscience. MIT Press, Cambridge, 2001.\n[2] R. P. N. Rao, B. A. Olshausen, and M. S. Lewicki. Probabilistic Models of the Brain: Perception and\n\nNeural Function. MIT Press, 2002.\n\n[3] J. Fiser, P. Berkes, G. Orban, and M. Lengye. Statistically optimal perception and learning: from behavior\n\nto neural representations. Trends in Cognitive Sciences, 14:119\u2013130, 2010.\n\n[4] M. D. Ernst and M. S. Banks. Humans integrate visual and haptic information in a statistically optimal\n\n[5] Y. Weiss, E. P. Simoncelli, and E. H. Adelson. Motion illusions as optimal percepts. Nature Neuroscience,\n\nfashion. Nature, 415:419\u2013433, 2002.\n\n5:598\u2013604, 2002.\n\n[6] K. P. Kording and D. M. Wolpert. Bayesian integration in sensorimotor learning. Nature, 427:244\u2013247,\n[7] J. M. Beck, W. J. Ma, R. Kiani, T. Hanksand A. K. Churchland, J. Roitman, M. N.. Shadlen, P. E. Latham,\n\n2004.\nand A. Pouget. Probabilistic population codes for bayesian decision making. Neuron, 60(6), 2008.\n\n[8] J. Trommersh\u00a8auser, L. T. Maloney, and M. S. Landy. Decision making, movement planning and statistical\n\ndecision theory. Trends in Cognitive Science, 12:291\u2013297, 2008.\n\n[9] P. Berkes, G. Orban, M. Lengyel, and J. Fiser. Spontaneous cortical activity reveals hallmarks of an\n\noptimal internal model of the environment. Science, 331(6013):83\u201387, 2011.\n\n[10] W. J. Ma, J. M. Beck, P. E. Latham, and A. Pouget. Bayesian inference with probabilistic population\n\ncodes. Nature Neuroscience, 9:1432\u20131438, 2006.\n\n[11] R. Turner, P. Berkes, and J. Fiser. Learning complex tasks with probabilistic population codes. In Frontiers\n\nin Neuroscience, 2011. Comp. and Systems Neuroscience 2011.\n\n[12] T. S. Lee and D. Mumford. Hierarchical Bayesian inference in the visual cortex. Journal of the Optical\n\nSociety of America A, 20(7):1434\u20131448, 2003.\n\n[13] P. O. Hoyer and A. Hyvarinen. Interpreting neural response variability as Monte Carlo sampling from the\n\nposterior. In Adv. Neur. Inf. Proc. Syst. 16, pages 293\u2013300. MIT Press, 2003.\n\n[14] E. Vul, N. D. Goodman, T. L. Grif\ufb01ths, and J. B. Tenenbaum. One and done? Optimal decisions from\n\nvery few samples. In 31st Annual Meeting of the Cognitive Science Society, 2009.\n\n[15] P. Berkes, R. Turner, and J. Fiser. The army of one (sample): the characteristics of sampling-based prob-\nabilistic neural representations. In Frontiers in Neuroscience, 2011. Comp. and Systems Neuroscience\n2011.\n[16] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the\nbrain. Psychological Review, 65(6), 1958.\n\n[17] M. Riesenhuber and T. Poggio. Hierarchical models of object recognition in cortex. Nature Neuroscience,\n\n211(11):1019 \u2013 1025, 1999.\n\n[18] V. A. F.. Lamme and P. R. Roelfsema. The distinct modes of vision offered by feedforward and recurrent\n\nprocessing. Trends in Neurosciences, 23(11):571 \u2013 579, 2000.\n\n[19] A. Yuille and D. Kersten. Vision as bayesian inference: analysis by synthesis? Trends in Cognitive\n\nSciences, 10(7):301\u2013308, 2006.\n\nnetworks. Science, 268:1158 \u2013 1161, 1995.\n\n[20] G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal. The \u2018wake-sleep\u2019 algorithm for unsupervised neural\n\n[21] E. K\u00a8orner, M. O. Gewaltig, U. K\u00a8orner, A. Richter, and T. Rodemann. A model of computation in neocor-\n\ntical architecture. Neural Networks, 12:989 \u2013 1005, 1999.\n\n[22] B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive \ufb01eld properties by learning a sparse\n\ncode for natural images. Nature, 381:607\u2013609, 1996.\n\n[23] H. Lee, A. Battle, R. Raina, and A. Ng. Ef\ufb01cient sparse coding algorithms. NIPS, 20:801\u2013808, 2007.\n[24] Y. LeCun. Backpropagation applied to handwritten zip code recognition.\n[25] M. Riesenhuber and T. Poggio. How visual cortex recognizes objects: The tale of the standard model.\n[26] T. S. Lee and D. Mumford. Hierarchical bayesian inference in the visual cortex. J Opt Soc Am A Opt\n\n2002.\nImage Sci Vis, 20(7):1434\u20131448, July 2003.\n\n[27] J. L\u00a8ucke and J. Eggert. Expectation Truncation And the Bene\ufb01ts of Preselection in Training Generative\n\nModels. Journal of Machine Learning Research, 2010.\n\n[28] G. Puertas, J. Bornschein, and J. L\u00a8ucke. The maximal causes of natural scenes are edge \ufb01lters. NIPS, 23,\n[29] M. Henniges, G. Puertas, J. Bornschein, J. Eggert, and J. L\u00a8ucke. Binary sparse coding. Latent Variable\n\n2010.\nAnalysis and Signal Separation, 2010.\n\n[30] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding.\n\nThe Journal of Machine Learning Research, 11, 2010.\n\n[31] J. Hateren and A. Schaaf. Independent Component Filters of Natural Images Compared with Simple Cells\n\nin Primary Visual Cortex. Proc Biol Sci, 265(1394):359\u2013366, 1998.\n\n[32] D. L. Ringach. Spatial Structure and Symmetry of Simple-Cell Receptive Fields in Macaque Primary\n\nVisual Cortex. J Neurophysiol, 88:455\u2013463, 2002.\n\n[33] M. Haft, R. Hofman, and V. Tresp. Generative binary codes. Pattern Anal Appl, 6(4):269\u2013284, 2004.\n[34] P. O. Hoyer. Non-negative sparse coding. Neural Networks for Signal Processing XII: Proceedings of the\n\nIEEE Workshop, pages 557\u2013565, 2002.\n\nNeural Computation, 2009.\n\n[35] J. L\u00a8ucke. Receptive Field Self-Organization in a Model of the Fine Structure in V1 Cortical Columns.\n\n[36] M. J. Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, Gatsby Computa-\n\ntional Neuroscience Unit, University College London., 2003.\n\n[37] Z. Tu and S. C. Zhu. Image Segmentation by Data-Driven Markov Chain Monte Carlo. PAMI, 24(5):657\u2013\n\n673, 2002.\n\n9\n\n\f", "award": [], "sourceid": 1425, "authors": [{"given_name": "Jacquelyn", "family_name": "Shelton", "institution": null}, {"given_name": "Abdul", "family_name": "Sheikh", "institution": null}, {"given_name": "Pietro", "family_name": "Berkes", "institution": null}, {"given_name": "Joerg", "family_name": "Bornschein", "institution": null}, {"given_name": "J\u00f6rg", "family_name": "L\u00fccke", "institution": ""}]}