{"title": "Learning Sparse Topographic Representations with Products of Student-t Distributions", "book": "Advances in Neural Information Processing Systems", "page_first": 1383, "page_last": 1390, "abstract": null, "full_text": "Learning Sparse Topographic Representations\n\nwith Products of Student-t Distributions\n\nMax Welling and Geoffrey Hinton\n\nDepartment of Computer Science\n\nUniversity of Toronto\n10 King\u2019s College Road\n\nToronto, M5S 3G5 Canada\n\n welling,hinton\n\n@cs.toronto.edu\n\nSimon Osindero\n\nGatsby Unit\n\nUniversity College London\n\n17 Queen Square\n\nLondon WC1N 3AR, UK\nsimon@gatsby.ucl.ac.uk\n\nAbstract\n\nWe propose a model for natural images in which the probability of an im-\nage is proportional to the product of the probabilities of some \ufb01lter out-\nputs. We encourage the system to \ufb01nd sparse features by using a Student-\nt distribution to model each \ufb01lter output. If the t-distribution is used to\nmodel the combined outputs of sets of neurally adjacent \ufb01lters, the sys-\ntem learns a topographic map in which the orientation, spatial frequency\nand location of the \ufb01lters change smoothly across the map. Even though\nmaximum likelihood learning is intractable in our model, the product\nform allows a relatively ef\ufb01cient learning procedure that works well even\nfor highly overcomplete sets of \ufb01lters. Once the model has been learned\nit can be used as a prior to derive the \u201citerated Wiener \ufb01lter\u201d for the pur-\npose of denoising images.\n\n1 Introduction\n\n\u0002\r\f\n\nHistorically, two different classes of statistical model have been used for natural images.\n\u201cEnergy-based\u201d models assign to each image a global energy,\n, that is the sum of a num-\nber of local contributions and they de\ufb01ne the probability of an image to be proportional to\n. This class of models includes Markov Random Fields where combinations of\n\u0003\u0005\u0004\u0007\u0006\t\b\u000b\n\nnearby pixel values contribute local energies, Boltzmann Machines in which binary pixels\nare augmented with binary hidden variables that learn to model higher-order statistical in-\nteractions and Maximum Entropy methods which learn the appropriate magnitudes for the\nenergy contributions of heuristically derived features [5] [9]. It is dif\ufb01cult to perform max-\nimum likelihood \ufb01tting on most energy-based models because of the normalization term\n(the partition function) that is required to convert \u0003\u000e\u0004\u0007\u0006\t\b\u000f\n\nto a probability. The normal-\nization term is a sum over all possible images and its derivative w.r.t.\nthe parameters is\nrequired for maximum likelihood \ufb01tting. The usual approach is to approximate this deriva-\ntive by using Markov Chain Monte Carlo (MCMC) to sample from the model, but the large\nnumber of iterations required to reach equilibrium makes learning very slow.\n\n\u0002\r\f\n\nThe other class of model uses a \u201ccausal\u201d directed acyclic graph in which the lowest level\nnodes correspond to pixels and the probability distribution at a node (in the absence of\nany observations) depends only on its parents. When the graph is singly or very sparsely\n\n\u0001\n\u0002\n\fconnected there are ef\ufb01cient algorithms for maximum likelihood \ufb01tting but if nodes have\nmany parents, it is hard to perform maximum likelihood \ufb01tting because this requires the\nintractable posterior distribution over non-leaf nodes given the pixel values.\n\nThere is much debate about which class of model is the most appropriate for natural images.\nIs a particular image best characterized by the states of some hidden variables in a causal\ngenerative model? Or is it best characterized by its peculiarities i.e. by saying which of a\nvery large set of normally satis\ufb01ed constraints are violated? In this paper we treat violations\nof constraints as contributions to a global energy and we show how to learn a large set of\nconstraints each of which is normally satis\ufb01ed fairly accurately but occasionally violated\nby a lot. The ability to learn ef\ufb01ciently without ever having to generate equilibrium samples\nfrom the model and without having to confront the intractable partition function removes a\nmajor obstacle to the use of energy-based models.\n\n2 The Product of Student-t Model\n\nProducts of Experts (PoE) are a restricted class of energy-based model [1]. The distribution\nrepresented by a PoE is simply the normalized product of all the distributions represented\nby the individual \u201cexperts\u201d:\n\nwhere\nIn the product of Student-t (PoT) model, un-normalized experts have the following form,\n\nare un-normalized experts and \u0006 denotes the overall normalization constant.\n\n\b\u0010\u0001\u0012\u0011\n\n\b\u0002\u0001\n\n\f\u0004\u0003\n\n\b\u0010\u0001\u0012\u0011\n\n\t\u000b\n\r\f\u000f\u000e\n\n(1)\n\n(2)\n\n\b\u0002\u0001\n\n\f\u0012\u0003\n\n\u0005\u0015\u0014\n\n\t \u001f\"!\n\n\b\u0002\u0017\u0019\u0018\n-th column in the \ufb01lter-matrix \u0017\n\n\f\u001b\u001a\u001d\u001c\n\nwhere \u0017\n\nis called a \ufb01lter and is the\n\n. When properly\n\nnormalized, this represents a Student-t distribution over the \ufb01ltered random variable\n\n. An important feature of the Student-t distribution is its heavy tails, which makes it a\n\nsuitable candidate for modelling constraints of the kind that are found in images.\n\n, the energy of the PoT model becomes\n\nDe\ufb01ning \n\n\u0003\u000e\u0004\u0007\u0006\t\b\u000f\n\n\b\u0002%\n\n\f\u0012\u0003\n\n\b\u0010%\n\n\f\u001b&\n\b\u0010%\n\n\t\u001d)+*-,\n\n\u0005\u0015\u0014\n\n\b\u0002\u0017\n\n(3)\n\n\f'\u0003\n\n\t\u000b\n\r\f\n\nViewed this way, the model takes the form of a maximum entropy distribution with weights\n\n\t on real-valued \u201cfeatures\u201d of the image. Unlike previous maximum entropy models,\n\nhowever, we can \ufb01t both the weights and the features at the same time.\n\n/0\u0003\n\n\f will represent independent directions in input space. So noise-\n\nWhen the number of input dimensions is equal to the number of experts, the normally in-\ntractable partition function becomes a determinant and the PoT model becomes equivalent\nto a noiseless ICA model with Student-t prior distributions [2]. In that case the rows of\nthe inverse \ufb01lters\nless ICA can be viewed as an energy-based model even though it is usually interpeted as\na causal generative model in which the posterior over the hidden variables collapses to a\npoint. However, when we consider more experts than input dimensions (i.e. an overcom-\nplete representation), the energy-based view and the causal generative view lead to different\ngeneralizations of ICA. The natural causal generalization retains the independence of the\nhidden variables in the prior by assuming independent sources. In contrast, the PoT model\nsimply multiplies together more experts than input dimensions and re-normalizes to get the\ntotal probability.\n\n\u001721\n\n\n\u0005\n\u0006\n\u0007\n\b\n\t\n\u0013\n\t\n\f\n\u000e\n\t\n\u0013\n\t\n\f\n\u000e\n\t\n\u0005\n\b\n\f\n\u0016\n\t\n\u0001\n\f\n\u0016\n\u001e\n\t\n#\n$\n\t\n\u0003\n\u0017\n\u0018\n\t\n\u0001\n\u0002\n\f\n\u0006\n\u0002\n\u0007\n(\n\u001e\n\b\n\u0005\n.\n\u0018\n\t\n\u0001\n\f\n\u0016\n\f\n\u001e\n\f3 Training the PoT Model with Contrastive Divergence\n\nWhen training energy-based models we need to shape the energy function so that observed\nimages have low energy and empty regions in the space of all possible images have high\nenergy. The maximum likelihood learning rule is given by,\n\n\u0013\u0002\u0001\n\n\u0003\u0005\u0004\n\n\u0013\u0007\u0006\t\b\u000b\n\r\f\u000e\n\n\u0003\u0005\u0004\n\n\u0013\u0007\u0006\t\u000f\u0011\u0010\u0013\u0012\u000b\u0014\u0016\u0015\u0016\u0014\u0016\u0017\u000b\u0018\u0019\u0014\u0016\u0012\u001b\u001a\n\nIt is the second term which causes learning to be slow and noisy because it is usually\nnecessary to use MCMC to compute the average over the equilibrium distribution. A much\nmore ef\ufb01cient way to \ufb01t the model is to use the data distribution itself to initialize a Markov\nChain which then starts moving towards the model\u2019s equilibrium distribution. After just a\nfew steps, we observe how the chain is diverging from the data and adjust the parameters\nto counteract this divergence. This is done by lowering the energy of the data and raising\nthe energy of the \u201cconfabulations\u201d produced by a few steps of MCMC.\n\n\u0013\u0005\u0001\n\n\u0013\u001c\u0006\t\b\u000b\n\u001d\f\u000e\n\n\u0013\u001c\u0006\u001f\u001e\n\n\f\u000e\u000f#\"\n\n1! \n\n$\u001a%\"&\u0015'\u000f\n\nIt can be shown that the above update rule approximately minimizes a new objective func-\ntion called the contrastive divergence [1].\n\nAs it stands the learning rule will be inef\ufb01cient if the Markov Chain mixes slowly because\nthe two terms in equation 5 will almost cancel each other out. To speed up learning we need\na Markov chain that mixes rapidly so that the confabulations will be some distance away\nfrom the data. Rapid mixing can be achieved by alternately Gibbs sampling a set of hidden\nvariables given the random variables under consideration and vice versa. Fortunately, the\nPoT model can be equipped with a number of hidden random variables equal to the number\nof experts as follows,\n\n(4)\n\n(5)\n\n(6)\n\n(7)\n\n(8)\n\n(2). Moreover, the conditional distributions are easy to identify and sample from, namely\n\nIntegrating over the* variables results in the density of the PoT model, i.e. eqns. (1) and\n\n\b\u0002\u0001)(#*\n\n\f76\n1%-/.\n\u001c\u000e5\n\u001c10&2\r3\n\tLK\n\u001cJI\n\u0017RQ\n!\u0005K\ndenotes a Gamma distribution andN\n\n\u0001,+\n\t\u000b\n\r\fHG\n\n\n\u0005\u0012\u0014\n\n\f\u0011S\n\n>@?\n\nACB\n\n\u001cED\n\n\u0016\rM\n\n\u0003UTPVFWYX\n\nwhere\n\na normal distribution. From (8) we see that\n.\nIn this respect our model resembles a \u201cGaussian scale mixture\u201d (GSM) [8] which also\nmultiplies a positive scaling variable with a normal variate. But GSM is a causal model\nwhile PoT is energy-based.\n\nthe variables* can be interpreted as precision variables in the transformed space[\nin \ufb01gure (1a,b). The hidden variables are independent given \u0001\n\nThe (in)dependency relations between the variables in a PoT model are depicted graphically\n, which allows them to be\nGibbs-sampled in parallel. This resembles the way in which brief Gibbs sampling is used\nto \ufb01t binary \u201cRestricted Boltzmann Machines\u201d [1].\n\nTo learn the parameters of the PoT model we thus propose to iterate the following steps:\n\ndistribution (7).\n\n1) Sample \\\n\n*^] given the data_\n\nfor every data-vector according to the Gamma-\n\n\n\n\u0002\n\u0004\n\u0014\n\u0002\n\u0004\n\n\n\u0003\n\u0004\n\u0002\n\u0004\n\u0014\n\u0003\n\u0004\n\u0002\n\u0004\n \n \n\n\f\n4\n2\n8\n8\n>\n6\n5\n\f\n1\n\u001a\n\u001c\n4\n\n\u0001\n\f\n\u0003\n\u0007\n\b\n4\n\u001e\n\u0005\n.\n\b\n\u0017\n\u0018\n\t\n\u0001\n\f\n\n*\n\f\n\u0003\nN\n\b\n\u0017\n\u0018\n\f\n1\nQ\n3\nZ\n\t\nD\nG\n\u0003\n\u0017\n\u0018\n\u0001\n]\n\fu\n\nx\n\nu\nT(Jx)2\nx\n\nu\n2T(Jx) \u0004\nx\n\nJ\n\nW\n\nJ\n\n1\n2\n3\n4\n5\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 1: (a)- Undirected graph for the PoT model. (b)-Expanded graph where the deterministic\nrelation (dashed lines) between the random variable\b and the activities of the \ufb01lters \t\u000b\n\r\f\nis made\n. (d)-Filters with large (decreasing from\nexplicit. (c)-Graph for the PoT model including weights\nleft to right) weights into a particular top level unit\n. Top level units have learned to connect to\n\u0013\u0015\u0014\n\ufb01lters similar in frequency, location and orientation.\n\n\b\u000f\u000e\u0011\u0010\n\ndata-vector according to the Normal distribution (8).\n\n3) Update the parameters according to (5) where the \u201ck-step samples\u201d are now given\n, the energy is given by (3), and the parameters are given\n\n] given the sampled values of \\\n\nfor every\n\n2) Sample reconstructions of the data \\\n\u0001R]\nby the reconstructions\\\nby \u0013\n\n.\n\n4 Overcomplete Representations\n\n\u0017)(\u0017\u0016\n\nThe above learning rules are still valid for overcomplete representations. However, step-2\n\nexists. In that case we simply draw\n\nstandard normal random numbers (with\n\nthe num-\n. This is ef\ufb01cient because\nis data indepen-\ndent. In contrast, for the overcomplete case we ought to perform a Cholesky factorization\nfor each data-vector separately. We have, however, obtained good results by\nproceeding as in the complete case and replacing the inverse of the \ufb01lter matrix with its\npseudo-inverse.\n\nof the learning algorithm is much more ef\ufb01cient when the inverse of the \ufb01lter matrix \u0017\nber of data-vectors) and multiply each of them with \u0017\u00041\nthe data dependent matrixQ\non \u0017RQP]\nnorm of the \ufb01lters, \u0017\n\n, in order to prevent some of them from decaying to zero.\nThis operation is done after every step of learning. Since controlling the norm removes the\nability of the experts to adapt to scale it is necessary to whiten the data \ufb01rst.\n\nis diagonal while the costly inverse \u0017\n\nFrom experiments we have also found that in the overcomplete case we should \ufb01x the\n\n\u0005\u001b\u001a\n\n\u0016 -\n\n4.1 Experiment: Overcomplete Representations for Natural Images\n\n!\u001e\u001d\n\n!\u001d!\u001d! patches of \u0005\n\ntimes overcomplete. We \ufb01xed the weights to have\n\nWe randomly generated\nThe patches were centered and sphered using PCA and the DC component (eigen-vector\nwith largest variance) was removed. The algorithm for overcomplete representations using\nthe pseudo-inverse was used to train\n\n! pixels from images of natural scenes 1.\n\u0005 \u001f\"! experts, i.e. a representation that is more than\n\u0005 and the the \ufb01lters to have\n!-!\n\u0016 -norm of \u0005 . A small weight decay term and a momentum term were included in the\n\ufb01lters was approximately!\u0015#\n\u0005 . In \ufb01gure (2a) we show a small subset of the inverse-\ufb01lters\n!\u001d!\n!\u001d!,\u001d.-/- matrix used for\ngiven by the pseudo-inverse of \u0017\n\u0018%$'&)(%*\n\na\ngradient updates of the \ufb01lters. The learning rate was set so that initially the change in the\n\n, where $+&)(%*\n\nsphering the data.\n\nis the \u0005\n\n1Collected from http://www.cis.hut.\ufb01/projects/ica/data/images\n\n\n\n\u0001\n\u0001\n\u0002\n\u0002\n\u0003\n\u0003\n\u0004\n\u0005\n\u0005\n\u0006\n\u0006\n\u0007\n\u0007\n\u0012\n\u0001\n*\n]\n\u0003\n\n\u0001\n\u0018\n\u0018\n\u0018\nQ\n1\n2\n8\n]\n1\n2\n8\n]\n1\n\u0018\n\u0017\n\u0018\n\u0019\n\u0018\n\t\n\u0017\n\t\n\u0003\n#\n\u001c\n!\n(\n\u0005\n\u001c\n!\n(\n\u001c\n\u001e\n\t\n\u0003\n\u0019\n\f5 Topographically Ordered Features\n\n\b\u0002\u0017\n\nIn [6] it was shown that linear \ufb01ltering of natural images is not enough to remove all higher\norder dependencies. In particular, it was argued that there are residual dependencies among\nthe activities\ndependencies within the PoT model. By inspection of \ufb01gure (1b) we note that these depen-\ndencies can be modelled through a non-negative weight matrix\nthe hidden variables\n(1c). Depending on how many nonzero weights\n\n\u0016 of the \ufb01ltered inputs. It is therefore desirable to model those\n! , which connects\n\u0016 . The resultant model is depicted in \ufb01gure\n\t (say\n\u0016 . We\n\n\t ), each expert now occupies \u0006\nthese richer experts can be obtained from (2) by replacing, \b\u0002\u0017\n\n\t\u0005\u0001 emanate from a hidden unit\n\u0016\b\u0007\n\t\n\b\u0002\u0017\n\f -norm of the weights (\t\n\t\u0005\u0001\n\nhave found that learning is assisted by \ufb01xing the\nMoreover, we have found that the sparsity of the weights\nlowing generalization of the experts,\n\n).\ncan be controlled by the fol-\n\ninput dimensions instead of just one. The expressions for\n\n\t with the activities \b\u0002\u0017\n\n\t\u0002\u0001\u0004\u0003\n\n\t\u0002\u0001\n\n\u001f\"!\n\n\t\u0005\u0001\n\n(\u0010\u000f\n\n(9)\n\n\b\u0002\u0001\n\n\f\u0015\u0003\n\n\t\r\f\n\u0001\u001b\n\r\f\n\n\t\u0005\u0001\n\n\u0001\u0012\u0011\n\n\u001a-\u001c\n\nvalues.\n\n\u0003\u0005\u0004\u0007\u0006\t\b\u000b\n\u0018\u0017\n\n\t\u0014\u0013\u0016\u0015\u0015\t\n\nthe sparser the distribution of\n\n\u0005\u0015\u0014\nThe larger the value for \u000f\nJoint and conditional distributions over hidden variables are obtained through similar re-\nplacements in eqn. (6) and (7) respectively. Sampling the reconstructions given the states of\nthe hidden variables proceeds by \ufb01rst sampling from\nindependent generalized Laplace\ndistributions\n\n* which are\nsubsequently transformed into\\\n\t develops weights to the activities of \ufb01lters similar in frequency, location and\norientation. The* variables therefore integrate information from these \ufb01lters and as a\n\nresult develop certain invariances that resemble the behavior of complex cells. A similar\napproach was studied in [4] using a related causal model 2 in which a number of scale\nvariables generate correlated variances for conditionally Gaussian experts. This results in\ntopography when the scale-generating variables are non-adaptive and connect to a local\nneighborhood of \ufb01lters only.\n\nonly minor modi\ufb01cations to the algorithm described in the previous section.\n\nWhen we learn the weight matrix\nvariable\n\nfrom image data we \ufb01nd that a particular hidden\n\n. Learning in this model therefore proceeds with\n\nwith precision parameters\n\n\u00190\u0003\n\n\u0018\u001b\u001a\n\n\t\u0002\u0001\n\nalso give rise to topography in the PoT\nWe will now argue that \ufb01xed local weights\nmodel. The reason is that averaging the squares of randomly chosen \ufb01lter outputs (eqn.9)\nproduces an approximately Gaussian distribution which is a poor \ufb01t to the heavy-tailed\nexperts. However, this \u201csmoothing effect\u201d may be largely avoided by averaging squared\n\ufb01lter outputs that are highly correlated (i.e. ones that are similar in location, frequency and\norientation). Since the averaging is local, this results in a topographic layout of the \ufb01lters.\n\n5.1 Experiment: Topographic Representations for Natural Images\n\nFor this experiment we collected\n\nimage patches of size \u0005\n\n! pixels in the same\n- di-\nlow variance and \u0005 high variance (DC) direction. We learned an\n!,\u001d\n\n!\u001d!-!\nway as described in section (4.1). The image data were sphered and reduced to \u0005\n!\u001d! experts which were organized on a square .\n\nmensions by removing\novercomplete representation with\ngrid. Each expert connects with a \ufb01xed weight of\nneigh-\nbors, where periodic boundary conditions were imposed for the experts on the boundary.\n2Interestingly, the update equations for the \ufb01lters \n presented in [4], which minimize a bound on\nthe log-likelihood of a directed model, reduce to the same equations as our learning rules when the\nrepresentation is complete and the \ufb01lters orthogonal.\n\nto itself and all its\n\n\t\u0002\u0001\n\n$\n\u0016\n\t\n\u0003\n\u0018\n\t\n\u0001\n\f\n\nZ\n\u0018\n\u0001\n\u0001\n\f\n\nZ\n\u0006\n\t\n\u0018\n\t\n\u0001\n\f\n\u0001\n\n\u0018\n\u0001\n\u0001\n\f\n\u0019\n\u0001\n\n\u0003\n\u0005\n\u001a\n#\n\u000b\n\u000e\n\t\n\u0005\n\b\n\f\n\u0016\n\u001c\n\n\u0011\n\u0017\n\u0018\n\u0001\n\u000e\n\f\n\u001e\n\t\n(\n\n\u0003\n!\n\u000b\n\u0011\n\u0012\n\t\n\u0011\n\u0012\n\t\n\u0011\n\u000e\n\f\n\f\n\u0016\n\u000b\n\u0018\n\u0001\n\u0003\n\u0017\n1\n\nZ\n\u000b\n\u001c\n!\n(\n!\n\u001d\n\u0005\n!\n\u001c\n!\n\u001d\n.\n!\n\n\u0003\n\u0005\n&\n-\n\u001c\n\f(a)\n\n(b)\n\nFigure 2: (a)-Small subset of the\u0002\u0001\u0004\u0003\u0006\u0005\b\u0007\u0002\t\n\ntimes overcomplete represen-\ntation for natural image patches. (b)-Topographically ordered \ufb01lters. The weights were \ufb01xed and\nconnect to neighbors only, using periodic boundary conditions. Neighboring \ufb01lters have learned to\nbe similar in frequency, location and orientation. One can observe a pinwheel structure to the left of\nthe low frequency cluster.\n\nlearned \ufb01lters from a\n\u0001\u000b\u0001\n\n(\n\nWe adapted the \ufb01lters \u0017\n\nhave enforced a\nsulting inverse-\ufb01lters are shown in \ufb01gure (2b). We note that the weights\ntopographic ordering on the experts, where location, scale and frequency of the Gabor-like\n\ufb01lters all change smoothly across the map.\n\n\u0016 -norm \u0005 ) and used \ufb01xed values for \u0016\n\n. . The re-\n\n\u0005 and \u000f\n\n(\n\n\u0005 . The weights\n\nIn another experiment we used the same data to train a complete representation of \u0005\nexperts where we learned the weights\nbut with a \ufb01xed value of \u000f\nlogarithm. Since the weights\ntopography. To study whether the weights\nthe energies of the \ufb01lter outputs \b\naccording to the strength of the weights connecting to it. For a representative subset of the\n(1d). Since the cells connect to similar \ufb01lters we may conclude that the weights\nindeed learning the dependencies between the activities of the \ufb01lter outputs.\n\n\u0016 we ordered the \ufb01lters for each complex cell\n! \ufb01lters with the strongest connections to that cell in \ufb01gure\n\n(unconstrained),\nand \u0016 were kept positive by adapting their\ncan now connect to any other expert we do not expect\nwere modelling the dependencies between\n\nand the \ufb01lters \u0017\n\n, we show the \u0005\n\n\f -norm \u0005 ), \u0016\n\nare\n\ncomplex cells*\n\n6 Denoising Images: The Iterated Wiener Filter\n\nIf the PoT model provides an accurate description of the statistics of natural image data it\nought to be a good prior for cleaning up noisy images. In the following we will apply this\nidea to denoise images contaminated with Gaussian pixel noise. We follow the standard\nBayesian approach which states that the optimal estimate of the original image is given by\n, where \u001a denotes the noisy image.\n\nthe maximum a posteriori (MAP) estimate of \n\nFor the PoT model this reduces to,\n\n\b\u0010\u0001\u0012\u0011\n\n\u0001\r\f\n\nW\u000f\u000e\n\nX\u0011\u0010\n\nV\u0013\u0012\n\n\u0018\u0017\u0016\n\n)\u000b*\u001d,\u0019\u0018\n\n\u0005\u0015\u0014\n\n\u0016\n\u001b\n\n\u001c\u001e\u001d\n\u001f(10)\n\n\t\u0005\u0001\n\n\u0019\n\u0003\n\u0003\n\u000b\n!\n-\n\u000b\n\u0019\n\u0003\n\u000b\n\u000b\n\u000b\n\u0017\n\u0018\n\t\n\u0001\n\f\nZ\n\t\n\u000b\n\u001a\n\f\n*\n&\n\u0003\n<\n\u0014\n\u0015\n\u0005\n.\n\b\n\u001a\n\n\u0001\n\f\n1\n\f\n\b\n\u001a\n\n\u0001\n\f\n\u0014\n(\n\t\n\u001e\n\t\n\u001a\n\u0005\n.\n(\n\u0001\n\n\b\n\u0017\n\u0018\n\u0001\n\u0001\n\f\n\f(a)\n\n(c)\n\n(b)\n\n(d)\n\nFigure 3: (a)- Original \u201crock\u201d-image. (b)-Rock-image with noise added. (c)-Denoised image using\nWiener \ufb01ltering. (d) Denoised image using IWF.\n\nTo minimize this we follow a variational procedure where we upper bound the logarithm\n. Applying this to\nevery logarithm in the summation in eqn. (10) and iteratively minimizing this bound over\n\n\u0005 . The bound is saturated when \u0002\n\n%\u0001\u0003\u0002\n\nwe \ufb01nd the following update equations,\n\n)+*-,\n\n\u0005\u0015\u0014\n\n\t\u0005\u0001\n\nusing )+*-,\n\u0001 and\n\n\t\u0006\u0005\n\n\u0001\r\f\n\n(11)\n\n(13)\n\n(12)\nwhere \u001d denotes componentwise multiplication. Since the second equation is just a Wiener\n\f we have\nand a Gaussian prior with covariance \b\nnamed the above denoising equations the iterated Wiener \ufb01lter (IWF).\n\nT0\u0003UTPVFWYX\n\n\ufb01lter with noise covariance\u0016\n\nWhen the \ufb01lters are orthonormal, the noise covariance isotropic and the weight matrix the\nminimizations over the transformed\nidentity, the minimization in (10) decouples into\nvariables\nis the solution of the\nfollowing cubic equation (for which analytic solutions exist),\n\n. De\ufb01ning\n\n$\t\b\n\nWe note however that constraining the \ufb01lters to be orthogonal is a rather severe restriction if\nthe data are not pre-whitened. On the other hand, if we decide to work with whitened data,\nthe isotropic noise assumption seems unrealistic. Having said that, Hyvarinen\u2019s shrinkage\nmethod for ICA models [3] is based on precisely these assumptions and seems to give good\nresults. The proposed method is also related to approaches based on the GSM [7].\n\n\u001a we can easily derive that\n\u0005\u0012\u0014\u000b\n\n6.1 Experiment: Denoising\nTo test the iterated Wiener \ufb01lter, we trained a complete set of -\n- experts on the data de-\nscribed in section (4.1). The norm of the \ufb01lters was unconstrained, the \u0016 were free to adapt,\n. The image shown in \ufb01gure (3a) was corrupted with\nbut we did not include any weights\nGaussian noise with standard deviation \n\n! dB (\ufb01g-\nure (3b)). We applied the adaptive Wiener \ufb01lter from matlab (Wiener2.m) with an optimal\n\n! , which resulted in a PSNR of .\u001d.\n\n%\n\n\u0002\n\n\u0003\n\u0005\n&\n%\n\u0004\n\u0005\n&\n\u0002\n\u0005\n.\n(\n\u0001\n\n\b\n\u0017\n\u0018\n\u0001\n\u0001\n\f\n\u0016\n*\n&\n\u0005\n\b\n\u0016\n1\n\f\n\u0014\n\u0017\nT\n\u0017\n\u0018\n\f\n1\n\f\n\u0016\n1\n\f\n\u001a\n3\n\u000b\n\u0018\n\b\n\u0016\n\u001d\n\u0004\n\f\nD\n\u0017\nT\n\u0017\n\u0018\n\f\n1\n\u0011\n$\n\t\n\u0003\n\u0017\n\u0018\n\t\n\u0001\n\u0007\n\t\n\u0003\n\u0017\n\u0018\n\t\n$\n\t\n\t\n\n\u0007\n\t\n$\n\u0016\n\t\n\u0014\n.\n\b\n\u0016\n\u001e\n\t\n\f\n$\n\t\n\n.\n\u0007\n\t\n\u0003\n!\n\u000b\n\u0003\n.\n#\n\fhave to be applied to the test patches before IWF is applied. The denoised image using\n\nneighborhood size and known noise-variance. The denoised image using adaptive\nWiener \ufb01ltering has a PSNR of .\"!\n- dB and is shown in \ufb01gure (3c). IWF was run on every\npossible \u0005\n! patch in the image, after which the results were averaged. Because the\n\ufb01lters\u0017\n\u0018 were trained on sphered data without a DC component, the same transformations\nIWF is shown in (3d) and has a PSNR of .\n\u0005 dB, which is a signi\ufb01cant improvement of\n. dB over Wiener \ufb01ltering. It is our hope that the use of overcomplete representations and\n7 Discussion\n\nweights\n\nwill further improve those results.\n\nIt is well known that a wavelet transform de-correlates natural image data in good approx-\nimation. In [6] it was found that in the marginal distribution the wavelet coef\ufb01cients are\nsparsely distributed but that there are signi\ufb01cant residual dependencies among their ener-\ngies\nters with sparsely distributed outputs. With a second hidden layer that is locally connected,\nit captures the dependencies between \ufb01lter outputs by learning topographic representations.\n\n\t . In this paper we have shown that the PoT model can learn highly overcomplete \ufb01l-\n\nOur approach improves upon earlier attempts (e.g. [4],[8]) in a number of ways. In the\nPoT model the hidden variables are conditionally independent so perceptual inference is\nvery easy and does not require iterative settling even when the model is overcomplete.\nThere is a fairly simple and ef\ufb01cient procedure for learning all the parameters, including\nthe weights connecting top-level units to \ufb01lter outputs. Finally, the model leads to an\nelegant denoising algorithm which involves iterating a Wiener-\ufb01lter.\n\nAcknowledgements\nThis research was funded by NSERC, the Gatsby Charitable Foundation, and the Wellcome Trust.\nWe thank Yee-Whye Teh for \ufb01rst suggesting a related model and Peter Dayan for encouraging us to\napply products of experts to topography.\n\nReferences\n\n[1] G.E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Com-\n\nputation, 14:1771\u20131800, 2002.\n\n[2] G.E. Hinton, M. Welling, Y.W. Teh, and K. Osindero. A new view of ICA. In Int. Conf. on\n\nIndependent Component Analysis and Blind Source Separation, 2001.\n\n[3] A. Hyvarinen. Sparse code shrinkage: Denoising of nongaussian data by maximum likelihood\n\nestimation. Neural Computation, 11(7):1739\u20131768, 1999.\n\n[4] A. Hyvarinen, P.O. Hoyer, and M. Inki. Topographic independent component analysis. Neural\n\nComputation, 13(7):1525\u20131558, 2001.\n\n[5] S. Della Pietra, V.J. Della Pietra, and J.D. Lafferty. Inducing features of random \ufb01elds.\n\nTransactions on Pattern Analysis and Machine Intelligence, 19(4):380\u2013393, 1997.\n\nIEEE\n\n[6] E.P. Simoncelli. Modeling the joint statistics of images in the wavelet domain. In Proc SPIE,\n\n44th Annual Meeting, volume 3813, pages 188\u2013195, Denver, 1999.\n\n[7] V. Strela, J. Portilla, and E. Simoncelli. Image denoising using a local Gaussian scale mixture\n\nmodel in the wavelet domain. In Proc. SPIE, 45th Annual Meeting, San Diego, 2000.\n\n[8] M.J. Wainwright and E.P. Simoncelli. Scale mixtures of Gaussians and the statistics of natural\nimages. In Advances Neural Information Processing Systems, volume 12, pages 855\u2013861, 2000.\n[9] S.C. Zhu, Z.N. Wu, and D. Mumford. Minimax entropy principle and its application to texture\n\nmodeling. Neural Computation, 9(8):1627\u20131660, 1997.\n\n\u001c\n\u001d\n\u001c\n#\n!\n\u001d\n\u0005\n\u001c\n#\n\u0005\n#\n\u000b\n$\n\u0016\n\f", "award": [], "sourceid": 2177, "authors": [{"given_name": "Max", "family_name": "Welling", "institution": null}, {"given_name": "Simon", "family_name": "Osindero", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}]}