{"title": "The Nonnegative Boltzmann Machine", "book": "Advances in Neural Information Processing Systems", "page_first": 428, "page_last": 434, "abstract": null, "full_text": "The Nonnegative Boltzmann Machine \n\nOliver B. Downs \nHopfield Group \nSchultz Building \n\nPrinceton University \nPrinceton, NJ 08544 \n\nobdowns@princeton.edu \n\nDavid J.e. MacKay \nCavendish Laboratory \n\nMadingley Road \n\nCambridge, CB3 OHE \n\nUnited Kingdom \n\nmackay@mrao.cam.ac.uk \n\nDaniel D. Lee \n\nBell Laboratories \n\nLucent Technologies \n700 Mountain Ave. \n\nMurray Hill, NJ 07974 \n\nddlee@bell-labs.com \n\nAbstract \n\nThe nonnegative Boltzmann machine (NNBM) is a recurrent neural net(cid:173)\nwork model that can describe multimodal nonnegative data. Application \nof maximum likelihood estimation to this model gives a learning rule that \nis analogous to the binary Boltzmann machine. We examine the utility of \nthe mean field approximation for the NNBM, and describe how Monte \nCarlo sampling techniques can be used to learn its parameters. Reflec(cid:173)\ntive slice sampling is particularly well-suited for this distribution, and \ncan efficiently be implemented to sample the distribution. We illustrate \nlearning of the NNBM on a transiationally invariant distribution, as well \nas on a generative model for images of human faces. \n\nIntroduction \n\nThe multivariate Gaussian is the most elementary distribution used to model generic da(cid:173)\nta. It represents the maximum entropy distribution under the constraint that the mean and \ncovariance matrix of the distribution match that of the data. For the case of binary data, \nthe maximum entropy distribution that matches the first and second order statistics of the \ndata is given by the Boltzmann machine [1]. The probability of a particular state in the \nBoltzmann machine is given by the exponential form: \n\nP({Si = \u00b11}) = ~ exp (-~ L.siAijSj + ~biSi) . \n\nt J \n\nt \n\n(1) \n\nInterpreting Eq. 1 as a neural network, the parameters A ij represent symmetric, recurrent \nweights between the different units in the network, and bi represent local biases. Unfortu(cid:173)\nnately, these parameters are not simply related to the observed mean and covariance of the \n\n\fThe Nonnegative Boltzmann Machine \n\n429 \n\n(a) \n\n40 \n30 \n20 \n\n(b) \n\n5.-------------~ \n\no \n\n1 \n\n2 \n\n3 \n\n4 \n\n5 \n\nFigure 1: a) Probability density and b) shaded contour plot of a two dimensional competi(cid:173)\ntive NNBM distribution. The energy function E (x) for this distribution contains a saddle \npoint and two local minima, which generates the observed multimodal distribution. \n\ndata as they are for the normal Gaussian. Instead, they need to be adapted using an iterative \nlearning rule that involves difficult sampling from the binary distribution [2]. \n\nThe Boltzmann machine can also be generalized to continuous and nonnegative variables. \nIn this case, the maximum entropy distribution for nonnegative data with known first and \nsecond order statistics is described by a distribution previously called the \"rectified Gaus(cid:173)\nsian\" distribution [3]: \n\nif Xi 2:: O'v'i, \nif any Xi <0, \nwhere the energy function E (x) and normalization constant Z are: \n\np(x) = {texP[-E(X)] \n\no \n\nE(x) \n\nZ \n\n_ ~xT Ax -bTx \n' \n\n2 \n\nr dx exp[-E(x)]. \n\nIl:?o \n\n(2) \n\n(3) \n\n(4) \n\nThe properties of this nonnegative Boltzmann machine (NNBM) distribution differ quite \nsubstantially from that of the normal Gaussian. In particular, the presence of the nonnega(cid:173)\ntivity constraints allows the distribution to have multiple modes. For example, Fig. 1 shows \na two-dimensional NNBM distribution with two separate maxima located against the rec(cid:173)\ntifying axes. Such a multimodal distribution would be poorly modelled by a single normal \nGaussian. \n\nIn this submission, we discuss how a multimodal NNBM distribution can be learned from \nnonnegative data. We show the limitations of mean field approximations for this distribu(cid:173)\ntion, and illustrate how recent developments in efficient sampling techniques for continuous \nbelief networks can be used to tune the weights of the network [4]. Specific examples of \nlearning are demonstrated on a translationally invariant distribution, as well as on a gener(cid:173)\native model for face images. \n\nMaximum Likelihood \n\nThe learning rule for the NNBM can be derived by maximizing the log likelihood of the \nobserved data under Eq. 2. Given a set of nonnegative vectors {xJt }, where J-L = l..M \n\n\f430 \n\n0. B. Downs, D. J. MacKay and D. D. Lee \n\nindexes the different examples, the log likelihood is: \n\nL= M LlogP(xJL ) = - M LE(xJL) -logZ. \n\n1 M \n\nJl=l \n\n1 M \n\nJL=l \n\nTaking the derivatives ofEq. 5 with respect to the parameters A and b gives: \n\naL \n\n(5) \n\n(6) \n\n(7) \n\n(8) \n\n(9) \n\nwhere the subscript \"c\" denotes a \"clamped\" average over the data, and the subscript \"f\" \ndenotes a \"free\" average over the NNBM distribution: \n\n(f(x))c \n\nM \n\n~ Lf(xJL) \n\nJL=l \n\n(f(x))r = 1\"20 dx P(x)f(x). \n\nThese derivatives are used to define a gradient ascent learning rule for the NNBM that is \nsimilar to that of the binary Boltzmann machine. The contrast between the clamped and \nfree covariance matrix is used to update the iteractions A, while the difference between the \nclamped and free means is used to update the local biases b. \n\nMean field approximation \n\nThe major difficulty with this learning algorithm lies in evaluating the averages (XiXj)f \nand (Xi)r. Because it is analytically intractable to calculate these free averages exactly, \napproximations are necessary for learning. Mean field approximations have previously \nbeen proposed as a deterministic alternative for learning in the binary Boltzmann machine, \nalthough there have been contrasting views on their validity [5,6]. Here, we investigate the \nutility of mean field theory for approximating the NNBM distribution. \n\nThe mean field equations are derived by approximating the NNBM distribution in Eq. 2 \nwith the factorized form: \n\nQ(x) = II Q1';(Xi) = II -- .-2. \n\n1 1 (X.)'Y \n\n. I! 'Ti \n\n~ \n\n'Ti \n\n. \n\n~ \n\n!Ei \ne-1';, \n\n(10) \n\nwhere the different marginal densities Q(Xi) are characterized by the means 'Ti with a fixed \nconstant I' The product of I-distributions is the natural factorizable distribution for non(cid:173)\nnegative random variables. \n\nThe optimal mean field parameterS'Ti are determined by minimizing the Kullback-Leibler \ndivergence between the NNBM distribution and the factorized distribution: \n\nDKL(QIIP) = dx Q(x) log P(x) = (E(x))Q(x) + log Z - H(Q). \n\n(11) \n\nJ \n\n[Q(X)] \n\nFinding the minimum of Eq. 11 by setting its derivatives with respect to the mean field \nparameters 'Ti to zero gives the simple mean field equations: \n\nA;m = h + 1) [bi - ~ Ai;T; + ~i] \n\n(12) \n\n\fThe Nonnegative Boltzmann Machine \n\n431 \n\n(a) \n\n(b) \n\nFigure 2: a) Slice sampling in one dimension. Given the current sample point, Xi, a height \ny E [0, aP(x)] is randomly chosen. This defines a slice (x E SlaP(x) ~ y) in which a \nnew Xi+! is chosen. b) For a multidimensional slice S, the new point Xi+l is chosen using \nballistic dynamics with specular reflections off the interior boundaries of the slice. \n\nThese equations can then be solved self-consistently for Ti. The \"free\" statistics of the \nNNBM are then replaced by their statistics under the factorized distribution Q (x): \n\n(Xi}r ~ Ti, (XiXj}r ~ [h + 1)2 + (r + 1) 8ij ] TiTj. \n\n(13) \n\nThe fidelity of this approximation is determined by how well the factorized distribution \nQ(x) models the NNBM distribution. Unfortunately, for distributions such as the one \nshown in Fig. 3, the mean field approximation is quite different from that of the true mul(cid:173)\ntimodal NNBM distribution. This suggests that the naive mean field approximation is i(cid:173)\nnadequate for learning in the NNBM, and in fact attempts to use this approximation fail \nto learn the examples given in following sections. However, the mean field approximation \ncan still be used to initialize the parameters to reasonable values before using the sampling \ntechniques that are described below. \n\nMonte-Carlo sampling \n\nA more direct approach to calculating the \"free\" averages in Eq. 6-7 is to numerically ap(cid:173)\nproximate them. This can be accomplished by using Monte Carlo sampling to generate a \nrepresentative set of points that sufficiently approximate the statistics of the continuous dis(cid:173)\ntribution. In particular, Markov chain Monte-Carlo methods employ an iterative stochastic \ndynamics whose equilibrium distribution converges to that of the desired distribution [4]. \nFor the binary Boltzmann machine, such sampling dynamics involves random \"spin flips\" \nwhich change the value of a single binary component. Unfortunately, these single compo(cid:173)\nnent dynamics are easily caught in local energy minima, and can converge very slowly for \nlarge systems. This makes sampling the binary distribution very difficult, and more spe(cid:173)\ncialized computational techniques such as simulated annealing, cluster updates, etc., have \nbeen developed to try to circumvent this problem. \n\nFor the NNBM, the use of continuous variables makes it possible to investigate different \nstochastic dynamics in order to more efficiently sample the distribution. We first experi(cid:173)\nmented with Gibbs sampling with ordered overrelaxation [7], but found that the required \ninversion of the error function was too computationally expensive. Instead, the recently \ndeveloped method of slice sampling [8] seems particularly well-suited for implementation \nin the NNBM. \n\nThe basic idea of the slice sampling algorithm is shown in Fig. 2. Given a sample point \nXi, a random y E [0, aP(xi)] is first uniformly chosen. Then a slice S is defined as the \nconnected set of points (x E S I aP(x) ~ y), and the new point Xi+l E S is chosen \n\n\f432 \n\n0. B. Downs, D. J. MacKay and D. D. Lee \n\n4 \n\n(b) \n\n1 \n\n2 \n\n3 \n\n4 \n\n5 \n\n2 \n\n3 \n\n4 \n\n5 \n\nFigure 3: Contours of the two-dimensional competitive NNBM distribution overlaid by a) \n'Y = 1 mean field approximation and b) 500 reflected slice samples. \n\nrandomly from this slice. The distribution of Xn for large n can be shown to converge \nto the desired density P(x). Now, for the NNBM, solving the boundary points along a \nparticular direction in a given slice is quite simple, since it only involves solving the roots \nof a quadratic equation. In order to efficiently choose a new point within a particular slice, \nreflective \"billiard ball\" dynamics are used. A random initial velocity is chosen, and the \nnew point is evolved by travelling a certain distance from the current point while specularly \nreflecting from the boundaries of the slice. Intuitively, the reversibility of these reflections \nallows the dynamics to satisfy detailed balance. \n\nIn Fig. 3, the mean field approximation and reflective slice sampling are used to mod(cid:173)\nel the two-dimensional competitive NNBM distribution. The poor fit of the mean field \napproximation is apparent from the unimodality of the factorized density, while the sam(cid:173)\nple points from the reflective slice sampling algorithm are more representative of the un(cid:173)\nderlying NNBM distribution. For higher dimensional data, the mean field approximation \nbecomes progressively worse. It is therefore necessary to implement the numerical slice \nsampling algorithm in order to accurately approximate the NNBM distribution. \n\nTranslationally invariant model \n\nBen-Yishai et al. have proposed a model for orientation tuning in primary visual cortex that \ncan be interpreted as a cooperative NNBM distribution [9]. In the absence of visual input, \nthe firing rates of N cortical neurons are described as minimizing the energy function E (x) \nwith parameters: \n\n8ij + N - N cos( N Ii - jl) \n\n27r \n\n1 \n\n\u20ac \n\n(14) \n\n1 \n\nThis distribution was used to test the NNBM learning algorithm. First, a large set of N = \n25 dimensional nonnegative training vectors were generated by sampling the distribution \nwith (3 = 50 and \u20ac = 4. Using these samples as training data, the A and b parameters were \nlearned from a unimodal initialization by evolving the training vectors using reflective slice \nsampling, and these evolved vectors were used to calculate the \"free\" averages in Eq. 6-7. \nThe A and b estimates were then updated, and this procedure was iterated until the evolved \naverages matched that of the training data. The learned A and b parameters were then found \nto almost exactly match the original form in Eq. 14. Some representative samples from the \nlearned NNBM distribution are shown in Fig. 4. \n\n\fThe Nonnegative Boltzmann Machine \n\n433 \n\n3 \n\n2 \n\n5 \n\n10 \n\n15 \n\n20 \n\n25 \n\nFigure 4: Representative samples taken from a NNBM after training to learn a translation(cid:173)\nally invariant cooperative distribution with (3 = 50 and \u20ac = 4. \n\nb) \n\nFigure 5: a) Morphing of a face image by successive sampling from the learned NNBM \ndistribution. b) Samples generated from a normal Gaussian. \n\nGenerative model for faces \n\nWe have also used the NNBM to learn a generative model for images of human faces. The \nNNBM is used to model the correlations in the coefficients of the nonnegative matrix fac(cid:173)\ntorization (NMF) of the face images [10]. NMF reduces the dimensionality of nonnegative \ndata by decomposing the face images into parts correponding to eyes, noses, ears, etc. S(cid:173)\nince the different parts are coactivated in reconstructing a face, the activations of these parts \ncontain significant correlations that need to be captured by a generative model. Here we \nbriefly demonstrate how the NNBM is able to learn these correlations. \n\nSampling from the NNBM stochastically generates coefficients which can graphically be \ndisplayed as face images. Fig. 5 shows some representative face images as the reflective \nslice sampling dynamics evolves the coefficients. Also displayed in the figure are the anal(cid:173)\nogous images generated if a normal Gaussian is used to model the correlations instead. It \nis clear that the nonnegativity constraints and multimodal nature of the NNBM results in \nsamples which are cleaner and more distinct as faces. \n\n\f434 \n\nDiscussion \n\nO. B. Downs, D. J. MacKay and D. D. Lee \n\nHere we have introduced the NNBM as a recurrent neural network model that is able to \ndescribe multimodal nonnegative data. Its application is made practical by the efficiency \nof the slice sampling Monte Carlo method. The learning algorithm incorporates numerical \nsampling from the NNBM distribution and is able to learn from observations of nonneg(cid:173)\native data. We have demonstrated the application of NNBM learning to a cooperative, \ntranslationally invariant distribution, as well as to real data from images of human faces. \n\nExtensions to the present work include incorporating hidden units into the recurrent net(cid:173)\nwork. The addition of hidden units implies modelling certain higher order statistics in the \ndata, and requires calculating averages over these hidden units. We anticipate the marginal \ndistribution over these units to be most commonly unimodal, and hence mean field theory \nshould be valid for approximating these averages. \n\nAnother possible extension involves generalizing the NNBM to model continuous data \nconfined within a certain range, i.e. 0 :s; Xi :s; 1. In this situation, slice sampling techniques \nwould also be used to efficiently generate representative samples. In any case, we hope that \nthis work stimulates more research into using these types of recurrent neural networks to \nmodel complex, multimodal data. \n\nAcknowledgements \n\nThe authors acknowledge useful discussion with John Hopfield, Sebastian Seung, Nicholas \nSocci, and Gayle Wittenberg, and are indebted to Haim Sompolinsky for pointing out the \nmaximum entropy interpretation of the Boltzmann machine. This work was funded by Bell \nLaboratories, Lucent Technologies. \n\nO.B. Downs is grateful for the moral support, and open ears and minds of Beth Brittle, \nGunther Lenz, and Sandra Scheitz. \n\nReferences \n\n[1] Hinton, GE & Sejnowski, TJ (1983). Optimal perceptual learning. IEEE Conference on Com(cid:173)\n\nputer Vision and Pattern Recognition, Washington, DC, 448-453. \n\n[2] Ackley, DH, Hinton, GE, & Sejnowski, TJ (1985). A learning algorithm for Boltzmann ma(cid:173)\n\nchines. Cognitive Science 9, 147-169. \n\n[3] Socci, ND, Lee, DD, and Seung, HS (1998). The rectified Gaussian distribution. Advances in \n\nNeural Information Processing Systems 10, 350-356. \n\n[4] MacKay, DJC (1998). Introduction to Monte Carlo Methods. Learning in Graphical Models. \n\nKluwer Academic Press, NATO Science Series, 175-204. \n\n[5] Galland, CC (1993). The limitations of deterministic Boltzmann machine learning. Network 4, \n\n355-380. \n\n[6] Kappen, HJ & Rodriguez, FB (1997). Mean field approach to learning in Boltzmann machines. \n\nPattern Recognition in Practice Jij, Amsterdam. \n\n[7] Neal, RM (1995). Suppressing random walks in Markov chain Monte Carlo using ordered \n\noverrelaxation. Technical Report 9508, Dept. of Statistics, University of Toronto. \n\n[8] Neal, RM (1997). Markov chain Monte Carlo methods based on \"slicing\" the density function. \n\nTechnical Report 9722, Dept. of Statistics, University of Toronto. \n\n[9] Ben-Yishai, R, Bar-Or, RL, & Sompolinsky, H (1995). Theory of orientation tuning in visual \n\ncortex. Proc. Nat. Acad. Sci. USA 92, 3844-3848. \n\n[10] Lee, DD, and Seung, HS (1999) Learning the parts of objects by non-negative matrix factor(cid:173)\n\nization. Nature 401,788-791. \n\n\f", "award": [], "sourceid": 1743, "authors": [{"given_name": "Oliver", "family_name": "Downs", "institution": null}, {"given_name": "David", "family_name": "MacKay", "institution": null}, {"given_name": "Daniel", "family_name": "Lee", "institution": null}]}