{"title": "Learning a Continuous Hidden Variable Model for Binary Data", "book": "Advances in Neural Information Processing Systems", "page_first": 515, "page_last": 521, "abstract": null, "full_text": "Learning a Continuous Hidden Variable \n\nModel for Binary Data \n\nDaniel D. Lee \nBell Laboratories \n\nLucent Technologies \nMurray Hill, NJ 07974 \nddlee~bell-labs.com \n\nHaim Sompolinsky \n\nRacah Institute of Physics and \nCenter for Neural Computation \n\nHebrew University \n\nJerusalem, 91904, Israel \nhaim~fiz.huji . ac.il \n\nAbstract \n\nA directed generative model for binary data using a small number \nof hidden continuous units is investigated. A clipping nonlinear(cid:173)\nity distinguishes the model from conventional principal components \nanalysis. The relationships between the correlations of the underly(cid:173)\ning continuous Gaussian variables and the binary output variables \nare utilized to learn the appropriate weights of the network. The \nadvantages of this approach are illustrated on a translationally in(cid:173)\nvariant binary distribution and on handwritten digit images. \n\nIntroduction \n\nPrincipal Components Analysis (PCA) is a widely used statistical technique for rep(cid:173)\nresenting data with a large number of variables [1]. It is based upon the assumption \nthat although the data is embedded in a high dimensional vector space, most of \nthe variability in the data is captured by a much lower climensional manifold. In \nparticular for PCA, this manifold is described by a linear hyperplane whose char(cid:173)\nacteristic directions are given by the eigenvectors of the correlation matrix with \nthe largest eigenvalues. The success of PCA and closely related techniques such as \nFactor Analysis (FA) and PCA mixtures clearly indicate that much real world data \nexhibit the low dimensional manifold structure assumed by these models [2, 3]. \nHowever, the linear manifold structure of PCA is not appropriate for data with \nbinary valued variables. Binary values commonly occur in data such as computer \nbit streams, black-and-white images, on-off outputs of feature detectors, and elec(cid:173)\ntrophysiological spike train data [4]. The Boltzmann machine is a neural network \nmodel that incorporates hidden binary spin variables, and in principle, it should be \nable to model binary data with arbitrary spin correlations [5]. Unfortunately, the \n\n\f516 \n\nD. D. Lee and H. Sompolinsky \n\nFigure 1: Generative model for N-dimensional binary data using a small number \np of continuous hidden variables. \n\ncomputational time needed for training a Boltzmann machine renders it impractical \nfor most applications. \nIn these proceedings, we present a model that uses a small number of continuous \nhidden variables rather than hidden binary variables to capture the variability of \nbinary valued visible data. The generative model differs from conventional peA \nbecause it incorporates a clipping nonlinearity. The resulting spin configurations \nhave an entropy related to the number of hidden variables used, and the resulting \nstates are connected by small numbers of spin flips. The learning algorithm is par(cid:173)\nticularly simple, and is related to peA by a scalar transformation of the correlation \nmatrix. \n\nGenerative Model \n\nFigure 1 shows a schematic diagram of the generative process. As in peA, the \nmodel assumes that the data is generated by a small number P of continuous hidden \nvariables Yi . Each of the hidden variables are assumed to be drawn independently \nfrom a normal distribution with unit variance: \n\nP(Yi) = exp( -yt /2)/~. \n\n(1) \nThe continuous hidden variables are combined using the feedforward weights Wij , \nand the N binary output units are then calculated using the sign of the feedforward \nacti vations: \n\n(3) \nSince binary data is commonly obtained by thresholding, it seems reasonable that \na proper generative model should incorporate such a clipping nonlinearity. The \ngenerative process is similar to that of a sigmoidal belief network with continuous \nhidden units at zero temperature. The nonlinearity will alter the relationship be(cid:173)\ntween the correlations of the binary variables and the weight matrix W as described \nbelow. \n\nThe real-valued Gaussian variables Xi are exactly analogous to the visible variables \nof conventional peA. They lie on a linear hyperplane determined by the span of \nthe matrix W, and their correlation matrix is given by: \n\n(2) \n\n(4) \n\nP \n\nXi = L WijYj \nj=l \nsgn(xi). \n\nSi \n\ncxx = (xxT ) = WW T . \n\n\fLearning a Continuous Hidden Variable Model for Binary Data \n\n517 \n\n\"\"\"\"'\" \n-- \"' ..... . \n~ .. , ...... ,.\",'\" \n\n+-\n\n4+ \n.... \"\" \n: \n\nY2 \n-t-\n. ' \n\n.' \n\n\u2022\u2022 ' \n\nJ J \n\n+++ \n\n\"'/\"~'l=LWl 'Y'~O \n. . . . \n: \n. . \n. . . \n, , . , . \n. . . . . \n, \n, , , \n. . . . \n: x3 r \n\n, \n\"~ \n\n\"\" x2~ 0 \n\n, , \n\nFigure 2: Binary spin configurations Si in the vector space of continuous hidden \nvariables Yj with P = 2 and N = 3. \n\nBy construction, the correlation matrix CXX has rank P which is much smaller \nthan the number of components N. Now consider the binary output variables \nSi = sgn(xd\u00b7 Their correlations can be calculated from the probability distribution \nof the Gaussian variables Xi: \n\n(CSS)ij = (SiSj) = J IT dYk P(Xk) sgn(Xi) sgn(Xj) \n\nk \n\nwhere \n\n(5) \n\n(6) \n\n(7) \n\nThe integrals in Equation 5 can be done analytically, and yield the surprisingly \nsimple result: \n\n(CSS ) .. -\n'J -\n\n(2) \n\n_ \n11\" \n\nsin-1 \n\n[C~.X 1 \nJCfix elf . \n\n'J \n\nThus, the correlations of the clipped binary variables CSS are related to the corre(cid:173)\nlations of the corresponding Gaussian variables CXX through the nonlinear arcsine \nfunction. The normalization in the denominator of the arcsine argument reflects the \nfact that the sign function is unchanged by a scale change in the Gaussian variables. \nAlthough the correlation matrix CSS and the generating correlation matrix cn are \neasily related through Equation 7, they have qualitatively very different properties. \nIn general, the correlation matrix CSS will no longer have the low rank structure of \nCXX. As illustrated by the translationally invariant example in the next section, the \nspectrum of CSS may contain a whole continuum of eigenvalues even though cxx \nhas only a few nonzero eigenvalues. \n\nPCA is typically used for dimensionality reduction of real variables; can this model \nbe used for compressing the binary outputs Si? Although the output correlations \nCSS no longer display the low rank structure of the generating CXX , a more appropri(cid:173)\nate measure of data compression is the entropy of the binary output states. Consider \nhow many of the 2N possible binary states will be generated by the clipping process. \nThe equation Xi = E j WijYj = 0 defines a P - 1 dimensional hyperplane in the \nP-dimensional state space of hidden variables Yj, which are shown as dashed lines \nin Figure 2. These hyperplanes partition the half-space where Si = +1 from the \n\n\f518 \n\nD. D. Lee and H. Sompolinsky \n\n5;=+1 \n\n5;=-1 \n\nL \n\nIL.--__ --II \n\n______ ...... 1 \n\n, \n'(cid:173) , , \n\nC)()( \n\ncss \n'., \n., , , , ... ... \n\n10' \n\nEigenvalue rank \n\n10.2 '-----'-__ ~~ __ _ ~.............J \n102 \n\n10\u00b0 \n\n\"' ... ... ... \n\nFigure 3: Translationally invariant binary spin distribution with N = 256 units. \nRepresentative samples from the distribution are illustrated on the left, while the \neigenvalue spectrum of CSS and CXX are plotted on the right. \n\nregion where Si = -1. Each of the N spin variables will have such a dividing hyper(cid:173)\nplane in this P-dimensional state space, and all of these hyperplanes will generically \nbe unique. Thus , the total number of spin configurations Si is determined by the \nnumber of cells bounded by N dividing hyperplanes in P dimensions. The number \nof such cells is approximately NP for N \u00bb P, a well-known result from perceptrons \n[6]. To leading order for large N, the entropy of the binary states generated by this \nprocess is then given by S = P log N. Thus, the entropy of the spin configurations \ngenerated by this model is directly proportional to the number of hidden variables \nP . \nHow is the topology of the binary spin configurations Si related to the PCA man(cid:173)\nifold structure of the continuous variables Xi? Each of the generated spin states is \nrepresented by a polytope cell in the P dimensional vector space of hidden variables. \nEach polytope has at least P + 1 neighboring polytopes which are related to it by a \nsingle or small number of spin flips. Therefore, although the state space of binary \nspin configurations is discrete, the continuous manifold structure of the underlying \nGaussian variables in this model is manifested as binary output configurations with \nlow entropy that are connected with small Hamming distances. \n\nTranslationally Invariant Example \n\nIn principle, the weights W could be learned by applying maximum likelihood to \nthis generative model; however, the resulting learning algorithm involves analyti(cid:173)\ncally intractable multi-dimensional integrals. Alternatively, approximations based \nupon mean field theory or importance sampling could be used to learn the appropri(cid:173)\nate parameters [7]. However, Equation 7 suggests a simple learning rule that is also \napproximate, but is much more computationally efficient [8]. First, the binary cor(cid:173)\nrelation matrix CSS is computed from the data. Then the empirical CSS is mapped \ninto the appropriate Gaussian correlation matrix using the nonlinear transforma(cid:173)\ntion: CXX = sin(7l'Css /2). This results in a Gaussian correlation matrix where the \nvariances of the individual Xi are fixed at unity. The weights Ware then calculated \nusing the conventional PCA algorithm. The correlation matrix cxx is diagonalized, \nand the eigenvectors with the largest eigenvalues are used to form the columns of \n\n\fLearning a Continuous Hidden Variable Model for Binary Data \n\n519 \n\nw to yield the best low rank approximation CXX ~ WW T . Scaling the variables Xi \nwill result in a correlation matrix CXX with slightly different eigenvalues but with \nthe same rank. \nThe utility of this transformation is illustrated by the following simple example. \nConsider the distribution of N = 256 binary spins shown in Figure 3. Half of the \nspins are chosen to be positive, and the location of the positive bump is arbitrary \nunder the periodic boundary conditions. Since the distribution is translationally \ninvariant, the correlations CIl depend only on the relative distance between spins \nIi - jl. The eigenvectors are the Fourier modes, and their eigenvalues correspond \nto their overlap with a triangle wave. The eigenvalue spectrum of css is plotted in \nFigure 3 as sorted by their rank. In this particular case, the correlation matrix CSS \nhas N /2 positive eigenvalues with a corresponding range of values. \nNow consider the matrix CXX = sin(-lI'Css /2). The eigenvalues of CXX are also \nshown in Figure 3. In contrast to the many different eigenvalues CSS, the spectrum \nof the Gaussian correlation matrix CXX has only two positive eigenvalues, with all \nthe rest exactly equal to zero. The corresponding eigenvectors are a cosine and sine \nfunction. The generative process can thus be understood as a linear combination \nof the two eigenmodes to yield a sine function with arbitary phase. This function \nis then clipped to yield the positive bump seen in the original binary distribution. \nIn comparison with the eigenvalues of CS S, the eigenvalue spectrum of CXX makes \nobvious the low rank structure of the generative process. In this case, the original \nbinary distribution can be constructed using only P = 2 hidden variables, whereas \nit is not clear from the eigenvalues of CSS what the appropriate number of modes \nis. This illustrates the utility of determining the principal components from the \ncalculated Gaussian correlation matrix cxx rather than working directly with the \nobservable binary correlation matrix CSS. \n\nHandwritten Digits Example \n\nThis model was also applied to a more complex data set. A large set of 16 x 16 \nblack and white images of handwritten twos were taken from the US Post Office \ndigit database [9]. The pixel means and pixel correlations were directly computed \nfrom the images. The generative model needs to be slightly modified to account for \nthe non-zero means in the binary outputs. This is accomplished by adding fixed \nbiases ~i to the Gaussian variables Xi before clipping: \n\n(8) \nThe biases ~i can be related to the means of the binary outputs through the ex-\npression: \n\nSi = sgn(~i + Xi). \n\n~i = J2CtX erf- 1 (Si). \n\n(9) \nThis allows the biases to be directly computed from the observed means of the \nbinary variables. Unfortunately, with non-zero biases, the relationship between \nthe Gaussian correlations CXX and binary correlations CSS is no longer the simple \nexpression found in Equation 7. Instead, the correlations are related by the following \nintegral equation: \n\nGiven the empirical pixel correlations CSS for the handwritten digits, the integral \nin Equation 10 is numerically solved for each pair of indices to yield the appropriate \n\n\f520 \n\nD. D. Lee and H Sompolinsky \n\n102 ~------~------~------~-------.------~ \n\n.... \n\nCSS \n\n..... .... .... ..... \n\n\"'to \n\n\" ~ , , , \n\nMorph \n2 \n2 \n2 \n2 \n;2 \na \n2 \n~ a \n\n103 L -____ ~ __ ____ ~ __ ~ __ ~ ______ ~ ______ ~ \n250 \n\n200 \n\n50 \n\n100 \n\n150 \n\nEigenvalue Rank \n\nFigure 4: Eigenvalue spectrum of CSS and CXx for handwritten images of twos. The \ninset shows the P = 16 most significant eigenvectors for cxx arranged by rows. The \nright side of the figure shows a nonlinear morph between two different instances of \na handwritten two using these eigenvectors. \n\nGaussian correlation matrix CXX . The correlation matrices are diagonalized and \nthe resulting eigenvalue spectra are shown in Figure 4. The eigenvalues for CXX \nagain exhibit a characteristic drop that is steeper than the falloff in the spectrum \nof the binary correlations CSs. The corresponding eigenvectors of CXX with the 16 \nlargest positive eigenvalues are depicted in the inset of Figure 4. These eigenmodes \nrepresent common image distortions such as rotations and stretching and appear \nqualitatively similar to those found by the standard PCA algorithm. \nA generative model with weights W corresponding to the P = 16 eigenvectors \nshown in Figure 4 is used to fit the handwritten twos, and the utility of this nonlin(cid:173)\near generative model is illustrated in the right side of Figure 4. The top and bottom \nimages in the figure are two different examples of a handwritten two from the data \nset, and the generative model is used to morph between the two examples. The hid(cid:173)\nden values Yi for the original images are first determined for the different examples, \nand the intermediate images in the morph are constructed by linearly interpolat(cid:173)\ning in the vector space of the hidden units. Because of the clipping nonlinearity, \nthis induces a nonlinear mapping in the outputs with binary units being flipped in \na particular order as determined by the generative model. In contrast, morphing \nusing conventional PCA would result in a simple linear interpolation between the \ntwo images, and the intermediate images would not look anything like the original \nbinary distribution [10]. \nThe correlation matrix CXX also happens to contain some small negative eigen(cid:173)\nvalues. Even though the binary correlation matrix CSS is positive definite, the \ntransformation in Equation 10 does not guarantee that the resulting matrix CXx \nwill also be positive definite. The presence of these negative eigenvalues indicates \na shortcoming of the generative processs for modelling this data. In particular, \nthe clipped Gaussian model is unable to capture correlations induced by global \n\n\fLearning a Continuous Hidden Variable Model for Binary Data \n\n521 \n\nconstraints in the data. As a simple illustration of this shortcoming in the gen(cid:173)\nerative model, consider the binary distribution defined by the probability density: \nP({s}) tX lim.B-+ooexp(-,BLijSiSj). The states in this distribution are defined by \nthe constraint that the sum of the binary variables is exactly zero: Li Si = O. Now, \nfor N 2: 4, it can be shown that it is impossible to find a Gaussian distribution \nwhose visible binary variables match the negative correlations induced by this sum \nconstraint. \n\nThese examples illustrate the value of using the clipped generative model to learn \nthe correlation matrix of the underlying Gaussian variables rather than using the \ncorrelations of the outputs directly. The clipping nonlinearity is convenient because \nthe relationship between the hidden variables and the output variables is particu(cid:173)\nlarly easy to understand. The learning algorithm differs from other nonlinear PCA \nmodels and autoencoders because the inverse mapping function need not be explic(cid:173)\nitly learned [11, 12]. Instead, the correlation matrix is directly transformed from the \nobservable variables to the underlying Gaussian variables. The correlation matrix \nis then diagonalized to determine the appropriate feedforward weights. This results \nin a extremely efficient training procedure that is directly analogous to PCA for \ncontinuous variables. \nWe acknowledge the support of Bell Laboratories, Lucent Technologies, and the \nUS-Israel Binational Science Foundation. We also thank H. S. Seung for helpful \ndiscussions. \n\nReferences \n[1] Jolliffe, IT (1986). Principal Component Analysis. New York: Springer-Verlag. \n[2] Bartholomew, DJ (1987) . Latent variable models and factor analysis. London: \n\nCharles Griffin & Co. Ltd. \n\n[3] Hinton, GE, Dayan, P & Revow, M (1996). Modeling the manifolds of images \n\nof handwritten digits. IEEE Transactions on Neural networks 8,65- 74. \n\n[4] Van Vreeswijk, C, Sompolinsky, H, & Abeles, M. (1999). Nonlinear statistics \n\nof spike trains . In preparation. \n\n[5] Ackley, DH, Hinton, GE, & Sejnowski, TJ (1985). A learning algorithm for \n\nBoltzmann machines. Cognitive Science 9, 147-169. \n\n[6] Cover, TM (1965). Geometrical and statistical properties of systems of linear \ninequalities with applications in pattern recognition. IEEE Trans. Electronic \nComput. 14, 326- 334. \n\n[7] Tipping, ME (1999) . Probabilistic visualisation of high-dimensional binary \n\ndata. Advances in Neural Information Processing Systems ~1. \n\n[8] Christoffersson, A (1975). Factor analysis of dichotomized variables. Psychome(cid:173)\n\ntrika 40, 5- 32. \n\n[9] LeCun, Yet al. (1989). Backpropagation applied to handwritten zip code recog(cid:173)\n\nnition. Neural Computation i, 541-551. \n\n[10] Bregler, C, & Omohundro, SM (1995). Nonlinear image interpolation using \nmanifold learning. Advances in Neural Information Processing Systems 7,973-\n980. \n\n[11] Hastie, T and Stuetzle, W (1989). Principal curves. Journal of the American \n\nStatistical Association 84, 502-516. \n\n[12] Demers, D, & Cottrell, G (1993) . Nonlinear dimensionality reduction. Advances \n\nin Neural Information Processing Systems 5, 580-587. \n\n\f", "award": [], "sourceid": 1580, "authors": [{"given_name": "Daniel", "family_name": "Lee", "institution": null}, {"given_name": "Haim", "family_name": "Sompolinsky", "institution": null}]}