{"title": "The Maximal Causes of Natural Scenes are Edge Filters", "book": "Advances in Neural Information Processing Systems", "page_first": 1939, "page_last": 1947, "abstract": "We study the application of a strongly non-linear generative model to image patches. As in standard approaches such as Sparse Coding or Independent Component Analysis, the model assumes a sparse prior with independent hidden variables. However, in the place where standard approaches use the sum to combine basis functions we use the maximum. To derive tractable approximations for parameter estimation we apply a novel approach based on variational Expectation Maximization. The derived learning algorithm can be applied to large-scale problems with hundreds of observed and hidden variables. Furthermore, we can infer all model parameters including observation noise and the degree of sparseness. In applications to image patches we find that Gabor-like basis functions are obtained. Gabor-like functions are thus not a feature exclusive to approaches assuming linear superposition. Quantitatively, the inferred basis functions show a large diversity of shapes with many strongly elongated and many circular symmetric functions. The distribution of basis function shapes reflects properties of simple cell receptive fields that are not reproduced by standard linear approaches. In the study of natural image statistics, the implications of using different superposition assumptions have so far not been investigated systematically because models with strong non-linearities have been found analytically and computationally challenging. The presented algorithm represents the first large-scale application of such an approach.", "full_text": "The Maximal Causes of Natural Scenes are Edge Filters\n\nGervasio Puertas\u2217\n\nJ\u00a8org Bornschein\u2217\n\nFrankfurt Institute for Advanced Studies\nGoethe-University Frankfurt, Germany\n\nFrankfurt Institute for Advanced Studies\nGoethe-University Frankfurt, Germany\n\npuertas@fias.uni-frankfurt.de\n\nbornschein@fias.uni-frankfurt.de\n\nJ\u00a8org L\u00a8ucke\n\nFrankfurt Institute for Advanced Studies\nGoethe-University Frankfurt, Germany\nluecke@fias.uni-frankfurt.de\n\nAbstract\n\nWe study the application of a strongly non-linear generative model to image\npatches. As in standard approaches such as Sparse Coding or Independent Com-\nponent Analysis, the model assumes a sparse prior with independent hidden vari-\nables. However, in the place where standard approaches use the sum to combine\nbasis functions we use the maximum. To derive tractable approximations for pa-\nrameter estimation we apply a novel approach based on variational Expectation\nMaximization. The derived learning algorithm can be applied to large-scale prob-\nlems with hundreds of observed and hidden variables. Furthermore, we can infer\nall model parameters including observation noise and the degree of sparseness.\nIn applications to image patches we \ufb01nd that Gabor-like basis functions are ob-\ntained. Gabor-like functions are thus not a feature exclusive to approaches assum-\ning linear superposition. Quantitatively, the inferred basis functions show a large\ndiversity of shapes with many strongly elongated and many circular symmetric\nfunctions. The distribution of basis function shapes re\ufb02ects properties of simple\ncell receptive \ufb01elds that are not reproduced by standard linear approaches. In the\nstudy of natural image statistics, the implications of using different superposition\nassumptions have so far not been investigated systematically because models with\nstrong non-linearities have been found analytically and computationally challeng-\ning. The presented algorithm represents the \ufb01rst large-scale application of such an\napproach.\n\n1\n\nIntroduction\n\nIf Sparse Coding (SC, [1]) or Independent Component Analysis (ICA; [2, 3]) are applied to image\npatches, basis functions are inferred that closely resemble Gabor wavelet functions. Because of the\nsimilarity of these functions to simple-cell receptive \ufb01elds in primary visual cortex, SC and ICA\nbecame the standard models to explain simple-cell responses, and they are the primary choice in\nmodelling the local statistics of natural images. Since they were \ufb01rst introduced, many different\nversions of SC and ICA have been investigated. While many studies focused on different ways to\nef\ufb01ciently infer the model parameters (e.g. [4, 5, 6]), many others investigated the assumptions used\nin the underlying generative model itself. The modelling of observation noise can thus be regarded\nas the major difference between SC and ICA (see, e.g., [7]). Furthermore, different forms of inde-\npendent sparse priors have been investigated by many modelers [8, 9, 10], while other approaches\nhave gone a step further and studied a relaxation of the assumption of independence between hidden\nvariables [11, 12, 13].\n\n\u2217authors contributed equally\n\n1\n\n\fAn assumption that has, in the context of image statistics, been investigated relatively little is the\nassumption of linear superpositions of basis functions. This assumption is not only a hallmark of\nSC and ICA but, indeed, is an essential part of many standard algorithms including Principal Com-\nponent Analysis (PCA), Factor Analysis (FA; [14]), or Non-negative Matrix Factorization (NMF;\n[15]). For many types of data, linear superposition can be motivated by the actual combination rule\nof the data components (e.g., sound waveforms combine linearly). For other types of data, including\nvisual data, linear superposition can represent a severe approximation, however. Models assuming\nlinearity are, nevertheless, often used because they are easier to study analytically and many de-\nrived algorithms can be applied to large-scale problems. Furthermore, they perform well in many\napplications and may, to certain extents, succeed well in modelling the distribution, e.g., of local\nimage structure. From the perspective of probabilistic generative models, a major aim is, however,\nto recover the actual data generating process, i.e., to recover the actual generating causes (see, e.g.,\n[7]). To accomplish this, the crucial properties of the data generation should be modelled as re-\nalistically as possible. If the data components combine non-linearly, this should thus be re\ufb02ected\nby the generative model. Unfortunately, inferring the parameters in probabilistic models assuming\nnon-linear superpositions has been found to be much more challenging than in the linear case (e.g.\n[16, 17, 18, 19], also compare [20, 21]). To model image patches, for instance, large-scale applica-\ntions of non-linear models, with the required large numbers of observed and hidden variables, have\nso far not been reported.\nIn this paper we study the application of a probabilistic generative model with strongly non-linear\nsuperposition to natural image patches. The basic model has \ufb01rst been suggested in [19] where\ntractable learning algorithms for parameter optimization where inferred for the case of a superpo-\nsition based on a point-wise maximum. The model (which was termed Maximal Causes Analysis;\nMCA) used a sparse prior for independent and binary hidden variables. The derived algorithms\ncompared favorably with state-of-the-art approaches on standard non-linear benchmarks and they\nwere applied to realistic data. However, the still demanding computational costs limited the appli-\ncation domain to relatively small-scale problems. The unconstrained model for instance was used\nwith at most H = 20 hidden units. Here we use a novel learning algorithm to infer the parameters\nof a variant of the MCA generative model. The approach allows for scaling the model up to several\nhundreds of observed and hidden variables. It enables large-scale applications to image patches and,\nthus, allows for studying the inferred basis functions as it is commonly done for linear approaches.\n\n2 The Maximal Causes Generative Model\n\nConsider a set of N data points {\ufffdy (n)}n=1,...,N sampled independently from an underlying distri-\nbution (\ufffdy (n) \u2208 D\u00d71, D is the number of observed variables). For these data we seek parameters\nn=1 p(\ufffdy (n) | \u0398) under a variant of the MCA\n\n\u0398 = (W, \u03c3, \u03c0) that maximize the data likelihood L =\ufffdN\n\ngenerative model [19] which is given by:\n\n\u03c0sh (1 \u2212 \u03c0)1\u2212sh ,\n\n(Bernoulli distribution)\n\nN (yd; W d(\ufffds, W ), \u03c32) , where W d(\ufffds, W ) = max\n\nh\n\n{sh Wdh}\n\nand where N (yd; w, \u03c32) denotes a scalar Gaussian distribution. H denotes the number of hidden\nvariables sh, and W \u2208 RD\u00d7H. The model differs from the one previously introduced by the use\nof Gaussian noise instead of Poisson noise in [19]. Eqn. 2 results in the basis functions \ufffdWh =\n(W1h, . . . , WDh)T of the MCA model to be combined non-linearly by a point-wise maximum. This\nbecomes salient if we compare (2) with the linear case using the vectorial notation maxh{ \ufffdW \ufffd\nh} =\n\np(\ufffds | \u0398) = \ufffdh\np(\ufffdy | \ufffds, \u0398) = \ufffdd\n\n(1)\n\n(2)\n\n\ufffd maxh{W \ufffd\n\nh \u2208 RD\u00d71:\n\n1h}, . . . , maxh{W \ufffd\n\np(\ufffdy | \ufffds, \u0398) = N (\ufffdy; max\n\nDh}\ufffdT for vectors \ufffdW \ufffd\np(\ufffdy | \ufffds, \u0398) = N (\ufffdy; \ufffdh sh \ufffdWh, \u03c32 1)\n\n{sh \ufffdWh}, \u03c32 1)\n\nh\n\nwhere N (\ufffdy; \ufffd\u00b5, \u03a3) denotes the multi-variate Gaussian distribution (note that \ufffdh sh \ufffdWh = W \ufffds ).\n\nAs in linear approaches such as SC, the combined basis functions set the mean values of the ob-\nserved variables yd, which are independently and identically drawn from Gaussian distributions\n\n(non-linear superposition)\n\n(linear superposition)\n\n(3)\n\n(4)\n\n2\n\n\fA\n\nB\n\nsum\n\nC\n\nmax\n\nsum\n\nmax\n\nD\n\nsum\n\nmax\n\nFigure 1: A Example patches extracted from an image and preprocessed using a Difference of\nGaussians \ufb01lter. B Two generated patches constructed from two Gabor basis functions with approx-\nimately orthogonal wave vectors. In the upper-right the basis functions were combined using linear\nsuperposition. In the lower-right they were combined using a point-wise maximum (note that the\nmax was taken after channel-splitting (see Eqn. 15 and Fig. 2). C Superposition of two collinear Ga-\nbor functions using the sum (upper-right) or point-wise maximum (lower-right). D Cross-sections\nthrough basis functions (along maximum amplitude direction). Left: Cross-sections through two\ndifferent collinear Gabor functions (compare C). Right: Cross-sections through their superpositions\nusing sum (top) and max (bottom).\n\nwith variance \u03c32 (Eqn. 2). The difference between linear and non-linear superposition is illustrated\nin Fig. 1. In general, the maximum superposition results in much weaker interferences. This is the\ncase for diagonally overlapping basis functions (Fig. 1B) and, at closer inspection, also for overlap-\nping collinear basis functions (Fig. 1C,D). Strong interferences as with linear combinations can not\nbe expected from combinations of image components. For preprocessed image patches (compare\nFig. 4D), it could thus be argued that the maximum combination is closer to the actual combination\nrule of image causes. In any case, the maximum represents an alternative to study the implications\nof combination rules in the image domain.\nTo optimize the parameters \u0398 of the MCA model (1) and (2), we use a variational EM approach (see,\ne.g., [22]). That is, instead of maximizing the likelihood directly, we maximize the free-energy:\n\nF (q, \u0398)=\n\nN\n\n\ufffdn=1\ufffd\ufffd\ufffds\n\nq(n)(\ufffds ; \u0398\ufffd) \ufffdlog\ufffdp(\ufffdy (n) | \ufffds, W, \u03c3)\ufffd + log\ufffdp(\ufffds | \u03c0)\ufffd\ufffd\ufffd + H(q) ,\n\n(5)\n\nwhere q(n)(\ufffds ; \u0398\ufffd) is an approximation to the exact posterior. In the variational EM scheme F(q, \u0398)\nis maximized alternately with respect to q in the E-step (while \u0398 is kept \ufb01xed) and with respect to \u0398\nin the M-step (while q is kept \ufb01xed). As a multiple-cause model, an exact E-step is computationally\nintractable for MCA. Additionally, the M-step is analytically intractable because of the non-linearity\nin MCA. The computational intractability in the E-step takes the form of expectation values of\nfunctions g, \ufffdg(\ufffds)\ufffdq(n). These expectations are intractable if the optimal choice of q(n) in (5) is used\n(i.e., if q(n) is equal to the posterior: q(n)(\ufffds ; \u0398\ufffd) = p(\ufffds | \ufffdy (n), \u0398\ufffd)). To derive an ef\ufb01cient learning\nalgorithm, our approach approximates the intractable expectations \ufffdg(\ufffds)\ufffdq(n) by truncating the sums\nover the hidden space of \ufffds:\n\n\ufffdg(\ufffds)\ufffdq(n) = \ufffd\ufffds\n\ufffd\u223c\n\n\ufffds\n\n\u2248 \ufffd\ufffds\u2208K n\n\ufffd\u223c\n\n\ufffds \u2208K n\n\np(\ufffds, \ufffdy (n) | \u0398\ufffd) g(\ufffds)\n\np(\ufffds, \ufffdy (n) | \u0398\ufffd) g(\ufffds)\n\np(\n\n\u223c\n\n\ufffds , \ufffdy (n) | \u0398\ufffd)\n\np(\n\n\u223c\n\n\ufffds , \ufffdy (n) | \u0398\ufffd)\n\n,\n\n(6)\n\nwhere K n is a small subset of the hidden space. Eqn. 6 represents a good approximation if the\nset K n contains most of the posterior probability mass. The approximation will be referred to\nas Expectation Truncation and can be derived as a variational EM approach (see Suppl. A). For\nother generative models similar truncation approaches have successfully been used [19, 23]. For\nthe learning algorithm, K n in (6) is chosen to contain hidden states \ufffds with at most \u03b3 active causes\n\n\ufffdh sh \u2264 \u03b3. Furthermore, we only consider the combinatorics of H \ufffd \u2265 \u03b3 hidden variables. More\n\nformally we de\ufb01ne:\n\n(7)\nwhere the index set I contains those H \ufffd hidden variables that are the most likely to have generated\ndata point \ufffdy (n) (the last term in Eqn. 7 assures that all states \ufffds with just one non-zero entry are also\n\nK n = {\ufffds |\ufffd\ufffdj sj \u2264 \u03b3 and \u2200i \ufffd\u2208 I : si = 0\ufffd or \ufffdj sj \u2264 1},\n\n3\n\n\fevaluated). To determine the H \ufffd hidden variables for I we use those variables h with the H \ufffd largest\nvalues of a selection function Sh(\ufffdy (n)) which is given by:\n\nSh(\ufffdy (n)) = \u03c0 N (\ufffdy (n); \ufffdW e\ufb00\n\ndh = max{yd, Wdh} . (8)\nSelecting hidden variables based on Sh(\ufffdy (n)) is equivalent to selecting them based on an upper\nbound of p(sh=1 | \ufffdy (n), \u0398). To see this note that p(\ufffdy (n) | \u0398) is independent of h and that:\n\nh , \u03c32 1) , with an effective weight W e\ufb00\n\np(sh=1 | \ufffdy (n), \u0398)\n\np(\ufffdy (n) | \u0398)\n\np(y(n)\n\nd\n\n=\ufffd\ufffds\nsh = 1\ufffd\ufffdd\n\n| W d(\ufffds, W ), \u03c3)\ufffdp(\ufffds | \u03c0) \u2264\ufffd\ufffdd\n\np(y(n)\n\nd\n\n| W e\ufb00\n\ndh , \u03c3)\ufffd\ufffd\ufffds\n\nsh = 1\n\np(\ufffds | \u03c0),\n\nwith the right-hand-side being equal to Sh(\ufffdy (n)) in Eqn. 8 (see Suppl. B for details). A low value\nof Sh(\ufffdy (n)) thus implies a low value of p(sh = 1 | \ufffdy (n), \u0398) and hence a low likelihood that cause h\nhas generated data point \ufffdy (n). In numerical experiments on ground-truth data we have veri\ufb01ed that\nfor most data points Eqn. 6 with Eqn. 7 indeed \ufb01nally approximates the true expectation values with\nhigh accuracy.\nHaving derived tractable approximations for the expectation values (6) in the E-step, let us now\nderive parameter update equations in the M-step. An update rule for the weight matrix W of this\nmodel was derived in [19] and is given by:\n\nW new\n\ndh = \ufffdn\u2208M\n\ufffdn\u2208M\n\n\ufffdA\u03c1\n\ndh(\ufffds, W )\ufffdq(n) y(n)\n\nd\n\n\ufffdA\u03c1\n\ndh(\ufffds, W )\ufffdq(n)\n\n,\n\nA\u03c1\n\nW\n\ndh(\ufffds, W ) = \ufffd \u2202\nd(\ufffds, W ) = \ufffd H\n\ufffdh=1\n\n\u03c1\n\n\u2202Wdh\n\nW\n\n\u03c1\n\nd(\ufffds, W )\ufffd ,\n\n1\n\u03c1\n\n(shWdh)\u03c1\ufffd\n\n(9)\n\n,\n\n(10)\n\nwhere the parameter \u03c1 is set to a large value (we used \u03c1 = 20). The derivation of the update rule\nfor \u03c3 (Gaussian noise has previously not been used) is straight-forward, and the update equation is\ngiven by:\n\n\u03c3new = \ufffd 1\n\n|M| D \ufffdn\u2208M\ufffd\ufffd\ufffd\ufffd\ufffd\ufffdy (n) \u2212 max\n\nh\n\n{sh \ufffdWh}\ufffd\ufffd\ufffd\ufffd\n\n2\ufffdqn\n\n.\n\n(11)\n\nNote that in (9) to (11) we do not sum over all data points \ufffdy (n) but only over those in a subset M\n(|M| is the number of elements in M). The subset contains the data points for which (6) \ufb01nally\nrepresents a good approximation. It is de\ufb01ned to contain the N cut data points with largest values\np(\ufffds, \ufffdy (n) | \u0398\ufffd), i.e., with the largest values for the denominator in (6). N cut is hereby the\nexpected number of data points that have been generated by states with less or equal to \u03b3 non-zero\nentries:\n\n\ufffd\ufffds\u2208K n\n\nThe selection of data points is an important difference to earlier truncation approaches (compare\n[19, 23]), and its necessity can be shown analytically (Suppl. A).\nUpdate equations (9), (10), and (11) have been derived by setting the derivatives of the free-energy\n(w.r.t. W and \u03c3) to zero. Similarly, we can derive the update equation for the sparseness parameter\n\u03c0. However, as the approximation only considers states \ufffds with a maximum of \u03b3 non-zero entries, the\nupdate has to correct for an underestimation of \u03c0 (compare Suppl. A). If such a correction is taken\ninto account, we obtain the update rule:\n\n\u03c0new =\n\nA(\u03c0) \u03c0\nB(\u03c0)\n\n1\n\n|M| \ufffdn\u2208M\n\n\ufffd|\ufffds |\ufffdqn with |\ufffds | =\n\nA(\u03c0) =\n\n\u03b3\n\n\ufffd\u03b3\ufffd=0\ufffd H\n\n\u03b3\ufffd \ufffd\u03c0\u03b3\ufffd (1 \u2212 \u03c0)H\u2212\u03b3\ufffd and B(\u03c0) =\n\n\u03b3\n\n\ufffd\u03b3\ufffd=0\n\nH\n\nsh ,\n\n\ufffdh=1\n\u03b3\ufffd\ufffd H\n\u03b3\ufffd \ufffd\u03c0\u03b3\ufffd (1 \u2212 \u03c0)H\u2212\u03b3\ufffd .\n\n(13)\n\n(14)\n\nNote that the correction factor A(\u03c0) \u03c0\nin (13) is equal to one over H if we allow for all possible states\nB(\u03c0)\n(i.e., \u03b3 = H \ufffd = H). Also the set M becomes equal to the set of all data points in this case (because\n\n4\n\nN cut = N \ufffd\ufffds, |\ufffds|\u2264\u03b3\n\np(\ufffds | \u03c0) = N\n\n\u03b3\n\n\ufffd\u03b3 \ufffd=0\ufffd H\n\n\u03b3 \ufffd \ufffd \u03c0\u03b3 \ufffd\n\n(1 \u2212 \u03c0)H\u2212\u03b3 \ufffd\n\n.\n\n(12)\n\n\fN cut = N). For \u03b3 = H \ufffd = H, Eqn. 13 thus falls back to the exact EM update rule that can canonically\nbe derived by setting the derivative of (5) w.r.t. \u03c0 to zero (while using the exact posterior). Also the\nupdate equations (9), (10), and (11) fall back to their canonical form for \u03b3 = H \ufffd = H. By choosing a\n\u03b3 between one and H we can thus choose the accuracy of the used approximation. The higher the\nvalue of \u03b3 the more accurate is the approximation but the larger are also the computational costs. For\nintermediate values of \u03b3 we can obtain very good approximations with small computational costs.\nCrucial for the scalability to large-scale problems is hereby the preselection of H \ufffd < H hidden\nvariables using the selection function in Eqn. 8.\n\nChannel Splitting\n\nRecombination\n\nLearning Algorithm\n\n+Channel\n\n-Channel\n\nnon-neg. patches\n\nbasis functions\n\nFigure 2: Illustration of patch preprocessing and basis function visualization. The left-hand-side\nshows data points obtained from gray-value patches after DoG \ufb01ltering. These patches are trans-\nformed to non-negative data by Eqn. 15. The algorithm maximizes the data likelihood under the\nMCA model (1) and (2), and infers basis functions (second from the right). For visualization, the\nbasis functions are displayed after their parts have been recombined again.\n\n3 Numerical Experiments\nThe update equations (9), (10), (11), and (13) together with approximation (6) with (7) and (8) de\ufb01ne\na learning algorithm that optimizes the full set of parameters of the MCA generative model (1) and\n(2). We will apply the algorithm to visual data as received by the primary visual cortex of mammals.\nIn mammals, visual information is transferred to the cortex via two types of neurons in the lateral\ngeniculus nucleus (LGN): center-on and center-off cells. The sensitivity of center-on neurons can be\nmodeled by a Difference of Gaussians (DoG) \ufb01lter with positive central part, while the sensitivity of\ncenter-off cells can be modelled by an inverted such \ufb01lter. A model for preprocessing of an image\npatch is thus given by a DoG \ufb01lter and a successive splitting of the positive and the negative parts\nof the \ufb01ltered image. More formally, we use a DoG \ufb01lter to generate patches \u02dc\ufffdy with \u02dcD = 26 \u00d7 26\npixels. Such a patch is then converted to a patch of size D = 2 \u02dcD by assigning:\n\nyd = [\u02dcyd]+ and yD+d = [\u2212\u02dcyd]+\n\n(15)\n\n(for d = 1, . . . , D) where [x]+ = x for x \u2265 0 and [x]+ = 0 otherwise. This procedure has\nrepeatedly been used in the context of visual data processing (see, e.g., [24]) and is, as discussed,\nclosely aligned with mammalian visual preprocessing (see Fig. 2 for an illustration).\nBefore we applied the algorithm to natural image patches, it was \ufb01rst evaluated on arti\ufb01cial\ndata with ground-truth. As inferred basis functions of images most commonly resemble Gabor\nwavelets, we used Gabor functions for the generation of arti\ufb01cial data. The Gabor basis functions\nwere combined according to the MCA generative model (1) and (2). We used H gen = 400 Gabor\nfunctions for generation. The variances of the Gaussian envelop of each Gabor were sampled\nfrom a distribution in nx/ny-space (Fig. 3C) with \u03c3x and \u03c3y denoting the standard deviations\nof the Gaussian envelope, and with f denoting the Gabor frequency. Angular phases and cen-\nters of the Gabors were sampled from uniform distributions. The wave vector\u2019s module was\nset to 1 (f = 1\n2\u03c0 ) and the envelope amplitude was 10. The parameters were chosen to lie in the\nsame range as the parameters inferred in preliminary runs of the algorithm on natural image patches.\nFor the generation of each arti\ufb01cial patch we drew a binary vector \ufffds according to (1) with \u03c0H gen = 2.\nWe then selected the |\ufffds| corresponding Gabor functions and used channel-splitting (15) to convert\nthem into basis functions with only non-negative parts. To form an arti\ufb01cial patch, these basis func-\ntions were combined using the point-wise maximum according to (2). We generated N = 150 000\npatches as data points in this way (Fig. 3A shows some examples).\nThe algorithm was applied with H = 300 hidden variables and approximation parameters \u03b3 = 3 and\nH \ufffd = 8. We generated the data with a larger number of basis functions to better match the continuous\ndistribution of the real generating components of images. The basis functions \ufffdWh were initialized\n\n5\n\n\fA\n\nB\n\nC\n\nn\n\ny\n\nFigure 3: A Arti\ufb01cial patches gen-\nerated by combining arti\ufb01cial Gabors\nusing a point-wise maximum. B In-\nferred basis functions if the MCA\nlearning algorithm is applied. C Com-\nparison between the shapes of gen-\nerating (green) and inferred (blue)\nGabors. The brighter the blue data\npoints the larger the error between the\nbasis function and the matched Gabor\n(also for Fig. 5).\n\nn\n\nx\n\nby setting them to the average over all the preprocessed input patches plus a small Gaussian white\nnoise (\u2248 0.5% of the corresponding mean). The initial noise parameter \u03c3 was set following Eqn. 11\nby using all data points (setting |M| = N initially). Finally, the initial sparseness level was set to\n\u03c0 H = 2. The model parameters were updated according to Eqns. 9 to 13 using 60 EM iterations. To\nhelp avoiding local optima, a small amount of Gaussian white noise (\u2248 0.5% of the average basis\nfunction value) was added during the \ufb01rst 20 iterations, was linearly decreased to zero between\niterations 20 and 40, and kept at zero for the last 20 iterations. During the \ufb01rst 20 iterations the\nupdates considered all N data points (|M| = N). Between iteration number 20 and 40 the amount\nof used data points was linearly decreased to (|M| = N cut) where it was kept constant for the last\n20 iterations. Considering all data points for the updates initially, has proven bene\ufb01cial because the\nselection of data points is based on very incomplete knowledge during the \ufb01rst iterations.\nFig. 3B displays some of the typical basis functions that were recovered in a run of the algorithm\non arti\ufb01cial patches. As can be observed (and as could have been expected), they resemble Gabor\nfunctions. When we matched the obtained basis functions with Gabor functions (compare, e.g.,\n[25, 26, 27] for details), the Gabor parameters obtained can be analyzed further. We thus plotted\nthe values parameterizing the Gabor shapes in an nx/ny-plot. This also allowed us to investigate\nhow well the generating distribution of arti\ufb01cial Gabors was recovered. Fig. 3C shows the gener-\nating (green) and the recovered distribution of Gabors (blue). Although some few recovered basis\nfunctions lie a relatively distant from the generating distribution, it is in general recovered well. The\nrecovered sparseness level was with \u03c0 H = 2.62 a bit larger than the initial level of \u03c0 H gen = 2.\nThis is presumably due to the smaller number of basis function in the model H < H gen. Also\nthe \ufb01nite inferred noise level of \u03c3 = 0.37 (despite a generation without noise) can be explained by\nthis mismatch. Depending on the parameters of the controls, we can observe different amounts of\noutliers (usually not more than 5% \u2212 10%). These outliers are usually basis functions that represent\nmore than one Gabor or small Gabor parts. Importantly, however, we found that the large majority\nof inferred Gabors consistently recovered the generating Gabor functions in nx/ny-plots. In partic-\nular, when we changed the angle of the generating distribution in the nx/ny-plots (e.g., to 25o or\n65o), the angle of the recovered distributions changed accordingly. Note that these controls are a\nquantitative version of the arti\ufb01cial Gabor and grating data used for controls in [1].\nApplication to Image Patches. The dataset used in the experiment on natural images was prepared\nby sampling N = 200 000 patches of \u02dcD = 26 \u00d7 26 pixels from the van Hateren image database\n[28] (while constraining random selection to patches of images without man-made structures). We\npreprocessed the patches as described above using a DoG \ufb01lter1 with a ratio of 3 : 1 between positive\nand negative parts (see, e.g., [29]) before converting the patches using Eqn. 15.\nThe algorithm was applied with H = 400 hidden variables and approximation parameters \u03b3 = 4\nand H \ufffd = 12. We used parameter initialization as described above and ran 120 EM iterations\n(also as described above). After learning the inferred sparseness level was \u03c0 H = 1.63 and the\ninferred noise level was \u03c3 = 1.59. The inferred basis functions we found to resembled Gabor-like\nfunctions at different locations, and with different orientations and frequencies. Additionally,\nwe obtained many globular basis functions with no or very little orientation preferences. Fig. 4\nshows a selection of the H = 400 functions after a run of the algorithm (see suppl. Fig. C.1 for\n\n1Filter parameters were chosen as in [27]; before the brightest 2% of the pixels were clamped to the maximal\n\nvalue of the remaining 98% (in\ufb02uence of light-re\ufb02ections were reduced in this way).\n\n6\n\n\fA\n\nB\n\nD\n\nC\n\nE\n\nFigure 4: Numerical experiment on image patches. A Random selection of 125 basis functions of\nthe H=400 inferred. B Selection of most globular functions and C most elongated functions. D Se-\nlection of preprocessed patches extracted from natural images. E Selection of data points generated\naccording to the model using the inferred basis functions and sparseness level (but no noise).\n\nall functions). The patches in Fig. 4D,E were chosen to demonstrate the high similarity between\npreprocessed natural patches (in D) and generated ones (in E). To highlight the diversity of obtained\nbasis functions, Figs. 4B,C display some of the most globular and elongated examples, respectively.\nThe variety of Gabor shapes is currently actively discussed [30, 31, 10, 32, 27] since it became\nobvious that standard linear models (e.g., SC and ICA), could not explain this diversity [33]. To\nfacilitate comparison with earlier approaches, we have applied Gabor matching (compare [25])\nand analyzed the obtained parameters. Instead of matching the basis functions directly, we \ufb01rst\ncomputed estimates of their corresponding receptive \ufb01elds (RFs). These estimates were obtained by\nconvoluting the basis functions with the same DoG \ufb01lter as used for preprocessing (see, e.g., [27]\nand Suppl. C.1 for details). In controls we found that these convoluted \ufb01elds were closely matched\nby RFs estimated using reverse correlation as described, e.g., in [7].\n\nA\n\n135\u25e6\n\n90\u25e6\n\nB\n\n45\u25e6\n\n0\u25e6\n\nn\n\ny\n\n180\u25e6\n\nFigure 5: Analysis of Gabor pa-\nA Angle-\nrameters (H=400).\nfrequency plot of basis functions.\nB nx/ny distribution of basis\nfunctions. C Distribution mea-\nsured in vivo [33] (red triangles)\nand corresponding distribution of\nMCA basis functions (blue).\n\nC\n\nn\n\ny\n\nn\n\nx\n\nn\n\nx\n\nAfter matching the (convoluted) \ufb01elds with Gabor functions, we found a relatively homogeneous\ndistribution of the \ufb01elds\u2019 orientations as it is commonly observed (Fig. 5A). The frequencies are\ndistributed around 0.1 cycles per pixel, which re\ufb02ects the band-pass property of the DoG \ufb01lter. To\nanalyze the Gabor shapes, we plotted the parameters using an nx/ny-plot (as suggested in [33]). The\nbroad distribution in nx/ny-space hereby re\ufb02ects the high diversity of basis functions obtained by\nour algorithm (see Fig. 5B). The speci\ufb01c form of the obtained shape distribution is, hereby, similar\nto the distribution of macaque V1 simple cells as measure in in vivo recordings [33]. However, the\nMCA basis functions do quantitatively not match the measurements exactly (see Fig. 5C): the MCA\ndistribution contains a higher percentage of strongly elongated basis functions, and many MCA\nfunctions are shifted slightly to the right relative to the measurements. If the basis functions are\nmatched with Gabors directly, we actually do not observe the latter effect (see suppl. Fig. C.2). If\nsimple-cell responses are associated with the posterior probabilities of multiple-cause models, the\nbasis functions should, however, not be compared to measured RFs directly (although it is frequently\ndone in the literature).\n\n7\n\n\fTo investigate the implications of different numbers of hidden variables, we also ran the algorithm\nwith H = 200 and H = 800. In both cases we observed qualitatively and quantitatively similar\ndistributions of basis functions. Runs with H = 200 thus also contained many circular symmetric\nbasis functions (see suppl. Fig. C.3 for the distribution of shapes). This observation is remarkable\nbecause it shows that such \u2018globular\u2019 \ufb01elds are a very stable feature for the MCA approach, also for\nsmall numbers of hidden variables. Based on standard generative models with linear superposition\nit has recently been argued [32] that such functions are only obtained in a regime with large numbers\nof hidden variables relative to the input dimensionality (see [34] for an early contribution).\n\n4 Discussion\nWe have studied the application of a strongly non-linear generative model to image patches. The\nmodel combines basis functions using a point-wise maximum as an alternative to the linear com-\nbination as assumed by Sparse Coding, ICA, and most other approaches. Our results suggest that\nchanging the component combination rule has a strong impact on the distribution of inferred ba-\nsis functions. While we still obtain Gabor-like functions, we robustly observe a large variety of\nbasis functions. Most notably, we obtain circular symmetric functions as well as many elongated\nfunctions that are closely associated with edges traversing the entire patch (compare Figs. 1 and 4).\nApproaches using linear component combination, e.g. ICA or SC, do usually not show these fea-\ntures. The differences in basis function shapes between non-linear and linear approaches are, in this\nrespect, consistent with the different types of interferences between basis functions. The maximum\nresults in basis function combinations with much less pronounced interferences, while the stronger\ninterferences of linear combinations might result in a repulsive effect fostering less elongated \ufb01elds\n(compare Fig. 1).\nFor linear approaches, a large diversity of Gabor shapes (including circular symmetric \ufb01elds) could\nonly be obtained in very over-complete settings [34], or speci\ufb01cally modelled priors with hand-set\nsparseness levels [10]. Such studies were motivated by a recently observed discrepancy of recep-\ntive \ufb01elds as predicted by SC or ICA, and receptive \ufb01elds as measured in vivo [33]. Compared to\nthese measurements, the MCA basis functions and their approximate receptive \ufb01elds show a similar\ndiversity of shapes. MCA functions and measured RFs both show circular symmetric \ufb01elds and\nin both cases there is a tendency towards \ufb01elds elongated orthogonal to the wave-vector direction\n(compare Fig. 4). Possible factors that can in\ufb02uence the distributions of basis functions, for MCA as\nwell as for other methods, are hereby different types of preprocessing, different prior distributions,\nand different noise models. Even if the prior type is \ufb01xed, differences for the basis functions have\nbeen reported for different settings of prior parameters (e.g., [10]). If possible, these parameters\nshould thus be learned along with the basis functions. All the different factors named above may\nresult in quantitative differences, and the shift of the MCA functions relative to the measurements\nmight have been caused by one of these factors. For the MCA model, possible effects of assuming\nbinary hidden variables remain to be investigated. Presumably, also dependencies between hidden\nvariables as investigated in recent contributions [e.g. 13, 12, 11] play an important role, e.g., if larger\nstructures of speci\ufb01c arrangements of edges and textures are considered. As the components in such\nmodels are combined less randomly, the implications of their combination rule may even be more\npronounced in these cases.\nIn conclusion, probably neither the linear nor the maximum combination rule does represent\nthe exact model for local visual component combinations. However, while linear component\ncombinations have extensively been studied in the context of image statistics, the investigation of\nother combination rules has been limited to relatively small scale applications [17, 16, 35, 19].\nApplying a novel training scheme, we could overcome this limitation in the case of the MCA\ngenerative model. As with linear approaches, we found that Gabor-like basis functions are obtained.\nThe statistics of their shapes, a subject that is currently and actively discussed [31, 10, 32, 26, 27], is\nmarkedly different, however. Future work should, thus, at least be aware that a linear combination\nof components is not the only possible choice. To recover the generating causes of image patches,\na linear combination might, furthermore, not be the best choice. With the results presented in this\nwork, it can neither be considered as the only practical one anymore.\nAcknowledgements. We gratefully acknowledge funding by the German Federal Ministry of Education and\nResearch (BMBF) in the project 01GQ0840 (BFNT Frankfurt) and by the German Research Foundation\n(DFG) in the project LU 1196/4-1. Furthermore, we gratefully acknowledge support by the Frankfurt Center\nfor Scienti\ufb01c Computing (CSC Frankfurt) and thank Marc Henniges for his help with Fig. 2.\n\n8\n\n\fReferences\n[1] B. A. Olshausen, D. J. Field. Emergence of simple-cell receptive \ufb01eld properties by learning a sparse\n\ncode for natural images. Nature, 381:607 \u2013 609, 1996.\n\n[2] P. Comon. Independent component analysis, a new concept? Signal Proc, 36(3):287\u2013314, 1994.\n[3] A. J. Bell, T. J. Sejnowski. The \u201cindependent components\u201d of natural scenes are edge \ufb01lters. Vision\n\nResearch, 37(23):3327 \u2013 38, 1997.\n\n[4] A. Hyv\u00a8arinen, E. Oja. A fast \ufb01xed-point algorithm for independent component analysis. Neural Compu-\n\ntation, 9(7):1483\u20131492, 1997.\n\n[5] H. Lee, A. Battle, R. Raina, A. Ng. Ef\ufb01cient sparse coding algorithms. NIPS 22, 801\u2013808, 2007.\n[6] M. W. Seeger. Bayesian Inference and Optimal Design for the Sparse Linear Model. Journal of Machine\n\nLearning Research, 759\u2013813, 2008.\n\n[7] P. Dayan, L. F. Abbott. Theoretical Neuroscience. MIT Press, Cambridge, 2001.\n[8] P. Berkes, R. Turner, M. Sahani. On sparsity and overcompleteness in image models. NIPS 20, 2008.\n[9] B. A. Olshausen, K. J. Millman. Learning sparse codes with a mixture-of-Gaussians prior. NIPS 12,\n\n841\u2013847, 2000.\n\n[10] M. Rehn, F. T. Sommer. A network that uses few active neurones to code visual input predicts the diverse\n\nshapes of cortical receptive \ufb01elds. J Comp Neurosci, 22(2):135\u2013146, 2007.\n\n[11] A. Hyv\u00a8arinen, P. Hoyer. Emergence of phase-and shift-invariant features by decomposition of natural\n\nimages into independent feature subspaces. Neural Computation, 12(7):1705\u20131720, 2000.\n\n[12] F. Sinz, E. P. Simoncelli, M. Bethge. Hierarchical modeling of local image features through Lp-nested\n\nsymmetric distributions. NIPS 22, 1696\u20131704, 2009.\n\n[13] D. Zoran, Y. Weiss. The \u201dTree-Dependent Components\u201d of Natural Images are Edge Filters. NIPS 22,\n\n2340\u20132348, 2009.\n\n[14] B. S. Everitt. An Introduction to Latent Variable Models. Chapman and Hall, 1984.\n[15] D. D. Lee, H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature,\n\n401(6755):788\u201391, 1999.\n\n[16] P. Dayan, R. S. Zemel. Competition and multiple cause models. Neural Computation, 7:565-579, 1995.\n[17] E. Saund. A multiple cause mixture model for unsupervised learning. Neural Computation, 7:51-71, 1995.\n[18] H. Lappalainen, X. Giannakopoulos, A. Honkela, J. Karhunen. Nonlinear independent component analy-\n\nsis using ensemble learning: Experiments and discussion. Proc. ICA, 2000.\n\n[19] J. L\u00a8ucke, M. Sahani. Maximal causes for non-linear component extraction. Journal of Machine Learning\n\nResearch, 9:1227 \u2013 1267, 2008.\n\n[20] N. Jojic, B. Frey. Learning \ufb02exible sprites in video layers. CVPR, 199\u2013206, 2001.\n[21] N. Le Roux, N. Heess, J. Shotton, J. Winn. Learning a generative model of images by factoring appearance\n\nand shape. Technical Report, Microsoft Research, 2010.\n\n[22] R. Neal, G. Hinton. A view of the EM algorithm that justi\ufb01es incremental, sparse, and other variants.\n\nM. I. Jordan, editor, Learning in Graphical Models. Kluwer, 1998.\n\n[23] J. L\u00a8ucke, R. Turner, M. Sahani, M. Henniges. Occlusive Components Analysis. NIPS, 1069-1077, 2009.\n[24] P. O. Hoyer. Non-negative matrix factorization with sparseness constraints. Journal of Machine Learning\n\nResearch, 5:1457\u20131469, 2004.\n\n[25] J. P. Jones, L. A. Palmer. An evaluation of the two-dimensional gabor \ufb01lter model of simple receptive\n\n\ufb01elds in cat striate cortex. Journal of Neurophysiology, 58(6):1233 \u2013 1258, 1987.\n\n[26] P. Berkes, B.L. White, J. Fiser. No evidence for active sparsi\ufb01cation in the visual cortex. NIPS 22, 2009.\n[27] J. L\u00a8ucke. Receptive \ufb01eld self-organization in a model of the \ufb01ne-structure in V1 cortical columns. Neural\n\nComputation, 21(10):2805\u20132845, 2009.\n\n[28] J. H. van Hateren, A. van der Schaaf. Independent component \ufb01lters of natural images compared with\n\nsimple cells in primary visual cortex. Proc Roy Soc London B, 265:359 \u2013 366, 1998.\n\n[29] D. C. Somers, S. B. Nelson, M. Sur. An emergent model of orientation selectivity in cat visual cortical\n\nsimple cells. The Journal of Neuroscience, 15:5448 \u2013 5465, 1995.\n\n[30] J. L\u00a8ucke. Learning of representations in a canonical model of cortical columns. Cosyne 2006, 100, 2006.\n[31] S. Osindero, M. Welling, G. E. Hinton. Topographic product models applied to natural scene statistics.\n\nNeural Computation, 18:381 \u2013 414, 2006.\n\n[32] D. Arathorn, B. Olshausen, J. DiCarlo. Functional requirements of a visual theory. Workshop Cosyne.\n\nwww.cosyne.org/c/index.php?title=Functional requirements of a visual theory, 2007.\n\n[33] D. L. Ringach.\n\nmary visual cortex.\nmanuelita.psych.ucla.edu/\u223cdario.\n\nSpatial structure and symmetry of simple-cell receptive \ufb01elds in macaque pri-\nJournal of Neurophysiology, 88:455 \u2013 463, 2002. Data retrieved 2006 from\n\n[34] B. A. Olshausen, D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed by V1?\n\nVision Research, 37(23):3311\u20133325, 1997.\n\n[35] S. Den\u00b4eve, T. Lochmann, U. Ernst. Spike based inference in a network with divisive inhibition. Neural-\n\nComp, Marseille, 2008.\n\n9\n\n\f", "award": [], "sourceid": 753, "authors": [{"given_name": "Jose", "family_name": "Puertas", "institution": null}, {"given_name": "Joerg", "family_name": "Bornschein", "institution": null}, {"given_name": "J\u00f6rg", "family_name": "L\u00fccke", "institution": ""}]}