{"title": "Unsupervised learning of distributions on binary vectors using two layer networks", "book": "Advances in Neural Information Processing Systems", "page_first": 912, "page_last": 919, "abstract": "", "full_text": "Unsupervised learning \n\nof distributions on binary vectors \n\nusing two layer networks \n\nYoav Freund\u00b7 \n\nComputer and Information Sciences \nUniversity of California Santa Cruz \n\nSanta Cruz, CA 95064 \n\nDavid Haussler \n\nComputer and Information Sciences \nUniversity of California Santa Cruz \n\nSanta Cruz , CA 95064 \n\nAbstract \n\nWe study a particular type of Boltzmann machine with a bipartite graph structure called a harmo(cid:173)\nnium. Our interest is in using such a machine to model a probability distribution on binary input \nvectors. We analyze the class of probability distributions that can be modeled by such machines. \nshowing that for each n ~ 1 this class includes arbitrarily good appwximations to any distribution \non the set of all n-vectors of binary inputs. We then present two learning algorithms for these \nmachines .. The first learning algorithm is the standard gradient ascent heuristic for computing \nmaximum likelihood estimates for the parameters (i.e. weights and thresholds) of the modeL Here \nwe give a closed form for this gradient that is significantly easier to compute than the corresponding \ngradient for the general Boltzmann machine . The second learning algorithm is a greedy method \nthat creates the hidden units and computes their weights one at a time. This method is a variant \nof the standard method for projection pursuit density estimation . We give experimental results for \nthese learning methods on synthetic data and natural data from the domain of handwritten digits. \n\nIntroduction \n\n1 \nLet us suppose that each example in our in put data is a binary vector i = {x I, ... , xn } E {\u00b1 l}n. and that \neach such example is generated independently at random according some unknown distribution on {\u00b1l}n. \nThis situation arises. for instance. when each example consists of (possibly noisy) measurements of n different \nbinary attributes of a randomly selected object . In such a situation, unsupervised learning can be usefully \ndefined as using the input data to find a good model of the unknown distribution on {\u00b1 l}n and thereby \nlearning the structure in the data. \n\nThe process of learning an unknown distribution from examples is usually called denszty estimation or \nparameter estimation in statistics, depending on the nature of the class of distributions used as models. \nConnectionist models of this type include Bayes networks (14). mixture models [3.13], and Markov random \nfields [14,8]. Network models based on the notion of energy minimization such as Hopfield nets [9] and \nBoltzmann machines [1] can also be used as models of probability distributions . \n\n\u2022 yoavGcis. ucsc.edu \n\n912 \n\n\fUnsupervised learning of distributions on binary vectors using 2-layer networks \n\n913 \n\nThe models defined by Hopfield networks are a special case of the more general Markov random field models \nin which the local interactions are restricted to symmetric pairwise interactions between components of \nthe input. Boltzmann machines also use only pairwise interactions, but in addition they include hidden \nunits, which correspond to unobserved variables. These unobserved variables interact with the observed \nvariables represented by components of the input vector. The overall distribution on the set of possible \ninput vectors is defined as the marginal distribution induced on the components of the input vector by the \nMarkov random field over all variables, both observed and hidden. While the Hopfield network is relatively \nwell understood, it is limited in the types of distributions that it can model. On the other hand, Boltzmann \nmachines are universal in the sense that they are powerful enough to model any distribution (to any degree \nof approximation), but the mathematical analysis of their capabilities is often intractable. Moreover, the \nstandard learning algorithm for the Boltzmann machine, a gradient ascent heuristic to compute the maximum \nlikelihood estimates for the weights and thresholds, requires repeated stochastic approximation, which results \nin unacceptably slow learning. I In this work we attempt to narrow the gap between Hopfield networks and \nBoltzmann machines by finding a model that will be powerful enough to be universal, 2 yet simple enough \nto be analyzable and computationally efficient. 3 We have found such a model in a minor variant of the \nspecial type of Boltzmann machine defined by Smolensky in his harmony theory [16][Ch.6J. This special type \nof Boltzmann machine is defined by a network with a simple bipartite graph structure, which he called a \nharmonium. \n\nThe harmonium consists of two types of units: input units, each of which holds one component of the input \nvector, and hidden units, representing hidden variables. There is a weighted connection between each input \nunit and each hidden unit, and no connections between input units or between hidden units (see Figure (1)) . \nThe presence of the hidden units induces dependencies, or correlations, between the variables modeled by \ninput units. To illustrate the kind of model that results, consider the distribution of people that visit a \nspecific coffee shop on Sunday. Let each of the n input variables represent the presence (+ 1) or absence (-1) \nof a particular person that Sunday. These random variables are clearly not independent, e.g. if Fred's wife \nand daughter are there, it is more likely that Fred is there, if you see three members of the golf club, you \nexpect to see other members of the golf club, if Bill is there you are unlikely to see Brenda there, etc. This \nsituation can be modeled by a harmonium model in which each hidden variable represents the presence or \nabsence of a social group. The weights connecting a hidden unit and an ipput unit measure the tendency of \nthe corresponding person to be associated with the corresponding group. In this coffee shop situation, several \nsocial groups may be present at one time , exerting a combined influence on the distribution of customers. \nThis can be mo'deled easily with the harmonium , but is difficult to model using Bayes networks or mixture \nmodels. <4 \n\n2 The Model \n\nLet us begin by formalizing the harmonium model. To model a distribution on {\u00b1I}\" we will use n input \nunits and some number m ~ 0 of hidden units. These units are connected in a bipartite graph as illustrated \nin Figure (I) . \n\nThe random variables represented by the input units each take values in {+ I , -I}, while the hidden variables, \nrepresented by the hidden units, take values in to, I} . The state of the machine is defined by the values \nof these random variables. Define i = (XI,\"\n\" xn) E {\u00b1l}n to be the state of the input units, and h = \n(hi , ... , hm ) E {O,l}m to be the state of the hidden units. \n\nThe connection weights between the input ~nits and the ith hidden unit are denoted 5 by w(') E Rn and the \nbias of the ith hidden unit is denoted by 9(') E R. The parameter vector ~ = {(w(l),O(l\u00bb, . .. ,(w(m),o(m\u00bb)) \n\nlOne possible solution to this is tbe mean-field approximation [15], discussed furtber in section 2 below. \n'In (4) we show tbat any distribution over (\u00b11)\" can be approximated to within any desired accuracy by a \n\nharmonium model using 2\" bidden units. \n\nlSee also otber work relating Bayes nets and Boltzmann machines [12,1]. \nt Noisy-OR gates have been introduced in the framework of Bayes Networks to allow for such combinations. \n\nHowever, using this in networks with hidden units has not been studied, to the best of our knowledge. \n\n~In (16)[Ch.6J, binary connection weights are used . Here we use real-valued weights . \n\n\f914 \n\nFreund and Haussler \n\nHidden Units \n\nm=3 \n\nInput Units \n\n2:5 \nFigure 1: The bipartite graph of the harmonium \n\n2:3 \n\n2:2 \n\n2:4 \n\n2:1 \n\ndefines the entire network, and thus also the probability model induced by the network. For a given ,p, the \nenergy of a. state configuration of hidden and input units is defined to be \n\nE(i, hl~) = - L(w(i) . i + 8(i\u00bb)hi \n\nm \n\ni=! \n\n(1) \n\nand the probability of a configuration is \n\nPr(i,hl\u00a2l) = -Ze-E(Z,h l.) where Z = L.,e-;.E(Z,hl.). \n\n~-\n\n1 \n\n-\n\n-\n\nSumming over h, it is easy to show that in the general case the probability distribution over possible state \nvectors on the input units is given by \n\nz,;; \n\nThis product form is particular to the harmonium structure, and does not hold for general Boltzmann \nmachines. Product form distribution models have been used for density estimation in Projection Pursuit \n[10,6,5] . We shall look further into this relationship in section 5. \n\n3 Discussion of the model \n\nThe right hand side of Equation (2) has a simple intuitive interpretation . The ith factor in the product \ncorresponds to the hidden variable h. and is an increasing function of the dot product between i and the \nweight vector of the ith hidden unit. Hence an input vector i will tend to have large probability when it is \nin the direction of one of the weight vectors WCi) (i .e. when wei) . i is large). and small probability otherwise. \nThis is the way that the hidden variables can be seen to exert their\" influence\"; each corresponds to a. \npreferred or \"prototypical\" direction in space . \nThe next to the last formula. in Equation (2) shows that the harmonium model can be written as a mixture \nof 2m distributions of the form \n\n~ exp (f)W(i) . i + 8('\u00bb)h.) , \n\nZ(h) \n\ni=! \n\n\fUnsupervised learning of distributions on binary vectors using 2-layer networks \n\n915 \n\nwhere ii E to, l}m and Z(Ii) is the appropriate normalization factor. It is easily verified that each of these \ndistributions is in fact a product of n Bernoulli distributions on {+l, -l}, one for each input variable Xj. \nHence the harmonium model can be interpreted as a kind of mixture model. However, the number of \ncomponents in the mixture represented by a harmonium is exponential in the number of hidden units. \nIt is interesting to compare the class of harmonium models to the standard class of models defined by a \nmixture of products of Bernoulli distributions. The same bipartite graph described in Figure (1) can be \nused to define a standard mixture model. Assign each of the m hidden units a weight vector <.i;) and a \nprobability Pi such that I:~l Pi = 1. To generate an example, choose one of the hidden units according to \nthe distribution defined by the Pi'S, and then choose the vector i according to P;(i) = te.;J(\u00b7) \u00b7I. where Zi \nis the appropriate normalization factor so that LIE{\u00b1I}\" P;(i) = 1. We thus get the distribution \n\nP(i) = L Pi eW(') I \n\nm \n\ni=1 Z; \n\n(3) \n\nThis form for presenting the standard mixture model emphasizes the similarity between this model and the \nharmonium model. A vector i will have large probability if the dot product <.ii) \u00b7x is large for some 1 :s i :s m \n(so long as Pi is not too small). However, unlike the standard mixture model, the harmonium model allows \nmore than one hidden variable to be +1 for any generated example. This means that several hidden influences \ncan combine in the generation of a single example, because several hidden variables can be +1 at the same \ntime. To see why this is useful, consider the coffee shop example given in the introduction . At any moment \nof time it is reasonable to find severa/social groups of people sitting in the shop . The harmonium model will \nhave a natural representation for this situation, while in order for the standard mixture model to describe \nit accurately, a hidden variable has to be assigned to each combination of social groups that is likely to be \nfound in the shop at the same time. In such cases the harmonium model is exponentially more succinct than \nthe standard mixture model. \n\n4 Learning by gradient ascent on the log-likelihood \n\nWe now suppose that we are given a sample consisting of a set 5 of vectors in {\u00b1 l}n drawn independently \nat random froro some unknown distribution . Our goal is use the sample 5 to find a good model for this \nunknown distribution using a harmonium with m hidden units, if possible. The method we investigate here \nis the method of maximum likelihood estimation using gradient ascent . The goal of learning is thus reduced \nto finding the set of parameters for the harmonium that maximize the (log of the) probability of the set \nof examples S. In fact, this gives the standard learning algorithm for general Boltzmann machines. For \na general Boltzmann machine this would require stochastic estimation of the parameters. As stochastic \nestimation is very time-consuming, the result is that learning is very slow . In this section we show that \nstochastic estimation need not be used for the harmonium model. \nFrom (2), the log likelihood of a sample of input vectors 5 = {;{ I), ;(2), ... ,\u00a3(N)}, given a particular setting \n\u00a2J = {(w(l), 0(1\u00bb \u2022. ..\u2022 (w(m) , Oem\u00bb~} of the parameters of the model is: \n\n. \n\n. \n\n10g-hkehhood(\u00a2J) = Lin Pr(i!\u00a2J) = L L In(l + e'\" H' ) \n\n-(.) \n\n(.) \n\n( \n\nIES \n\nm \n.=1 \n\nIES \n\n) \n\n- N In Z . \n\n(4) \n\nTaking the gradient of the log-likelihood results in the following formula for the jth component of wei) \n\n{} ~i) log-likelihood(\u00a2) = L x, 1 + e-(W~') 1+9(.1) - N L Pr(il\u00a2J)x, 1 + e-(W!.IH,(,I) \n\n(5) \n\nwJ \n\nIES \n\nIE!:!}\" \n\nA similar formula holds for the derivative of the bias term. \n\nThe purpose of the clamped and unclamped phases in the Boltzmann machine learning algorithm is to \napproximate these two terms. In general, this requires stochastic methods. However, here the clamped term \nis easy to calculate, it requires summing a logistic type function over all training examples. The same term \n\n\f916 \n\nFreund and Haussler \n\nis obtained by making the mean field approximation for the clamped phase in the general algorithm [15], \nwhich is exact in this case. It is more difficult to compute the sleep phase term, as it is an explicit sum over \nthe entire input space, and within each term of this sum there is an implicit sum over the entire space of \nconfigurations of hidden units in the factor Pr(i!,p) . However, again taking advantage of the special structure \nof the harmonium, We can reduce this sleep phase gradient term to a sum only over the configurations of the \nhidden units, yielding for each component of w(i) \n\n8(i)log-likelibood(\u00a2l) = L: Zj 1 + e-(W~')'I+I('\u00bb - N L Pr(hl\u00a2l)hi tanh(E hkWy\u00bb \n\n(6) \n\n8wj \n\nles \n\nhe{O,I}\" \n\nk=1 \n\nwhere \n\nPr(hl\u00a2l) = \n\nexp(L~1 hi9(i\u00bb 0;=1 cosh(L~l hiW}i\u00bb \n\n. \n\nE.ii'e{o,I}\" exp(E~1 h;9(i\u00bb OJ: 1 cosh(L~1 h;wJ'})] \n\nDirect computation of (6) is fast for small m in contrast to the case for general Boltzmann machines (we \nhave performed experiments with m $ 10). However, for large m it is not possible to compute all 2m \nterms. There is a way to avoid this exponential explosion if we can assume that a small number of terms \ndominate the sums. If, for instance, we assume that the probability that more than k hidden units are \nacti ve (+ I) at the same time is negligibly small we can get a good approximation by computing only O( mk) \nterms. Alternately, if we are not sure which states of the hidden units have non-negligible probability, we \ncan dynamically search, as part of the learning process, for the significant terms in the sum . This way we \nget an algorithm that is always accurate, and is efficient when the number of significant terms is small. In \nthe extreme case where we assume that only one hidden unit is active at a time (i.e. k = 1), the harmonium \nmodel essentially reduces to the standard mixture model as discussed is section 3. For larger k, this type of \nassumption provides a middle ground between the generality of the harmonium model and the simplicity of \nthe mixture model. \n\n5 Projection Pursuit methods \n\nA statistical method that has a close relationship with the harmonium model is the Projection Pursuit (PP) \ntechnique [10,6 i5). The use of projection pursuit in the context of neural networks has been studied by \nseveral researchers (e.g. [11]). Most of the work is in exploratory projection pursuit and projection pursuit \nregreSSIOn. In this paper we are interested in projection pursuit dellslty estimation. Here PP avoids the \nexponential blowup of the standard gradient ascent technique, and also has that advantage that the number \nm of hidden units is estimated from the sample as well, rather than being specified in advance. \nProjection pursuit density estimation [6] is based on several types of analysis, using the central limit theorem, \nthat lead to the following general conclusion. If i E R\" is a random vector for which the different coordinates \nare Independent, and w E R\" is a vector from the n dimellsiollal ullit sphere, then the distribution of the \nprojectIon w\u00b7 i is close to gaussian for most w. Thus searching for those directions w for which the projection \nof a sample is most non-gaussian is a way for detecting dependencies between the coordinates in high \ndimensional distributions. Several \"projection-indices\" have been studied in the literature for measuring the \n\"non-gaussianity\" of projection, each enhancing different properties of the projected distribution. In order \nto find more than one projection direction, several methods of \"structure elimination\" have been devised . \nThese methods transform the sample in such a way that the the direction in which non-gaussianity has been \ndetected appears to be gaussian, thus enabling the algorithm to detect non-gaussian projections that would \notherwise be obscured. The search for a description of the distribution of a sample in terms of its projections \ncan be formalized in the context of maximal likelihood density estimation [6] . In order to create a formal \nrelation between the harmonium model and projection pursuit, we define a variant of the model that defines \na density over R\" instead of a distribution over {\u00b1l}\". Based on this form we devise a projection index and \na structure removal method that are the basis of the following learning algorithm (described fully in [4]) \n\n\u2022 Initialization \n\nSet So to be the input sample. \nSet Po to be the initial distribution (Gaussian). \n\n\fUnsupervised learning of distributions on binary vectors using 2-layer networks \n\n917 \n\n\u2022 Iteration \nRepeat the following steps for i = 1,2 . . . until no single-variable harmonium model has a significantly \nhigher likelihood than the Gaussian distribution with respect to Si' \n\n1. Perform an estimate-maximize (EM) [2) search on the log-likelihood of a single hidden variable \nmodel on the sample Si-I . Denote by 8i and wei) the parameters found by the search, and create \na new hidden unit with associated binary r. v. hi with these weights and bias. \n\n2. Transform Si-l into Si using the following structure removal procedure. \n\nFor each example; E S'_1 compute the probability that the hidden variable h; found in the last \nstep is 1 on this input: \n\nP(h; = 1) = (1 + e-<I.+W(') .I))-I \n\nFlip a coin that has probability of \"head\" equal to P(h; = 1). If the coin turns out \"head\" then \nadd; - WCi) to S; else add; to Si. \n\n3. Set Pie;) to be Pi_l(i)Z,l (1 + el,+W(').I). \n\n6 Experimental work \n\nWe have carried out several experiments to test the performance of unsupervised learning using the harmo(cid:173)\nnium model. These are not, at this stage, extensive experimental comparisons, but they do provide initial \ninsights into the issues regarding our learning algorithms and the use of the harmonium model for learning \nreal world tasks . \n\nThe first set of experiments studies two methods for learning the harmonium model. The first is the gradient \nascent method, and the second is the projection pursuit method . The experiments in this set were performed \non synthetically generated data. The input consisted of binary vectors of 64 bits that represent 8 x 8 binary \nimages. The images are synthesized using a harmonium model with 10 hidden units whose weights were set \nas in Figure (2,a) . The ultimate goal of the learning algorithms was to retrieve the model that generated \nthe data. To measure the quality of the models generated by the algorithms we use three different measures. \nThe likelihood.of the model, 6 the fraction of correct predictions the model makes when used to predict the \nvalue of a single input bit given all the other bits, and the performance of the model when used to reconstruct \nthe inpu t from the most probable state of the hidden units. 7 All experiments use a test set and a train set, \neach containing 1000 examples. The gradient ascent method used a standard momentum term, and typically \nneeded about 1000 epochs to stabilize. In the projection pursuit algorithm, 4 iterations of EM per hidden \nunit proved sufficient to find a stable solution . The results are summarized in the following table and in \nFigure (2) . \n\ntrain \n\nlikelihood \ntest \n\nsingle bit predictlOn \ntrain \ngradient ascent for 1000 epochs 0.399 0.425 0.098 \nproJectIOn pursuit \n0.799 0.802 0.119 \nProJection pursuit followed by \ngradient ascent for 100 epochs \nProJection pursuit followed by \ngradient ascent for 1000 epochs 0.377 0.405 0.071 \ntrue model \n0.372 0.404 0.062 \n\ntest \n0.100 \n0.114 \n\n0.082 \n0.071 \n\n0.411 \n\n0.430 0.091 \n\n0.089 \n\nInput reconstructIOn \ntrain \n0.311 \n0.475 \n\ntest \n0.338 \n0.480 \n\n0.315 \n\n0.334 \n\n0.261 \n0.252 \n\n0.287 \n0.283 \n\nLooking at the table and Figure (2), and taking into account execution times, it appears that gradient \nascent is slow but eventually finds much of the underlying structure in the distribution, although several \nof the hidden units (see units 1,2,6,7, counting from the left, in Figure (2,a\u00bb have no obvious relation to \nthe true model, In contrast, PP is fast and finds all of the features of the true model albeit sometimes \n\naWe present the negation of the log-likelihood, scaled so that the uniform distribution will have likelihood 1.0 \n1More precisely, for each input unit I we compute the probability p. that it has value +1. Then (or example \n\n(XI, . . . ,1' .. ), we measure - L:~.I log,(1/2 + %,(p, - 1/2\u00bb . \n\n\f918 \n\nFreund and Haussler \n\n. . . .. .. \n. . \n... .. .. \n..... \n...... \n\u2022\u2022\u2022\u2022\u2022 \n.. . .. . ... . . \n\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n. . . . . .. . .... . \n\u00b7 .. \n.. .. . . .......... .. \n' ... .. \n\u00b7 .. . .. . . \n.. ... .. . \n. ....... .. \n...0\u00b7 \u2022\u2022\u2022 \n.. \nE;~:\u00b7::: ::.::: : ~ \n\n... \n\n' \" \n\n' \n\n' \n\n. .. ~ ~.: ~~: .. ~.: \n. .. \n\u00b7 . ..\u2022. . ' \" ... .. \n\u00b7 .\u2022.. ' \" .... . \n.. . . \n...... . . .. ... . \n\u00b7 ....\u2022 . ['::-,\":.t~. \n. . . .. . \n. \u00b7cc. \u00b7 0 \n.. \n.. .... \n.. CICJ\u00b7 \u2022\u2022 \n\u2022 \n.. ....... \u00b70 \n\u00b7 ....... eO \u2022 CO\u2022 ~~ \n... \n.. .. \n:o~~ : ~~~~O::\";' \n... .. ..... \n.... ... ...... . \n.. ..... . \n\n: \u2022\u2022\u2022 ~.~ ~ OOOOOJ' \n.. D\u00b7 \n\u2022 \u2022 \u2022 .. ...... DO \n\n\u2022\u2022 \u2022 \u2022 \u2022 \u2022 \u2022 ..\n\n\u2022 \u2022 \u2022 \u2022 . . . . . . . . D \n\n. CODa \n\n.. D ... . \n\n. \n. .\n\n\u2022 \u2022 \n\n\u2022 \u2022 \n\n.\n\n~ .. .. \n\n.. . \n: .:; . \n. .. . \n. ..... \n. .... . \n!O,: \n..... . \n\u2022 \u2022 \n.:~ \nt:~::~ ~ :~~: : :;; '. \n.. . . . \n.. .. ........ .. \n\n.. .... .. .. .... DOD -0- . \n. ... .. \n.. .. \n.... D\" . \n\na.. \n[XJDOQD .\n. \n....... . . \n\n. .... \u2022 . I \n.\n\n.. \n. . . . . . . \n.. \n. \n.... III \n\n\u2022 D \n\u00b7 D.. \n\u00b7 0\" ,. .. .. \u2022\u2022 \n\n. ..... . \n\n\u00b7 ... ... \n\n.\n\u2022 \n\n.. \n\n.. \n\n.. \n\n..\n\n.\n\n.\n\n..... . \n... .. \n\n\u2022 \u2022 \u2022 \n\u2022 \u2022\u2022 \n\naD \u2022\u2022\u2022\u2022 \u00b7 \n. \u2022\u2022 gaDa \n\u2022\u2022 \u00b7 .0 \n.. a \n\u2022. \n~~~c:-~g . . : .. ~~~ ... \n.\u2022 C\u00b7 \u00b7 \n.\u2022\u2022 \n. \u2022 \u2022 \u2022 \u2022 DO\u00b7 \n' . . . . . '\" \n\n0 \n\n(b) \n\n(c) \n\n........ \n...... , \n\n.. .. . \u2022 -OQ ' \n\u00b7 . \u2022 \n\nCID \n\nD ... \u2022 ... \u00b7 \n.C- \u00b7 \u2022\u2022 \u00b7 \n\n\u00b7 :-oj1:. \u00b7 .. : :,,~t ~. \n\u00b708000a ~ . ::DO:O. \n\n~~:::.,::':::~~~ \n\n\" \n~~oq2: 2.~ .. . . \u2022 \n~\"'I .~~\u00a7 .... . ...\n. . . . . .. \u00b7\u00b7\u00b7.0 .. \u2022\u2022. \u2022\u2022\u2022\u2022 \n~a.aa.OO ' \n~~ggg:~~ .:.: .. . :.\u00b7::8\u00b7. :.:\u00b7:. :: \n\n...... -0-. \n\n\u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \n\n\u2022 \u2022 \u2022 .. \n\n\u2022 . . . \n\n.. \u2022 \n\n\u2022 \n\n.\n\n. \u2022 p . . .............. )D.... . . ........ ... ... .. ... ... . .. . \n\n\u2022 . . . I . . . . \n\n. . . . \n\n. . .\n\n. . . \n\n.. \n\n.. \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n.., ..... -a \u2022\u2022 D\u00b7 \n\n\u2022 \n\n.. , \n\n---\n. . \u00b7 .. ..... .... ... .. \n.. ;::~::: \u00b7~\u00b7-:7\u00b7: . ::7~~;: ~~::::.::;.~:::. \n\nI \u2022 \u2022 . . . . . . aOODaD \" \u2022\u2022\u2022\u2022 \u2022 \n\n. \n......... .. \n\n\u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \n\n, \u2022\u2022\u2022\u2022 . \n\n........ I\n\n~. .. . \n\n. . . .. \n\n\u2022 ..\n\n. . . \n\n\u2022 \n\n..\n\n''\n\n.\n\n.\n\n'\n\n.. \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n. \n\n<d, wrnrmwlli1iJ~lli1l1J \n\nFigure 2: The weight vectors of the models in the synthetic data experiments. Each matrix represents the \n64 weights of one hidden unit. The square above the matrix represents the units bias. positive weights are \ndisplayed as full squares and negative weights as empty squares, the area of the square is proportional to \nthe absolute value of the weight. (a) The weights in the model found by gradient ascent alone. (b) The \nweights in the model found by projection pursuit alone. (c) The weights in the model used for generating \nthe data. (d) The weights in the model found by projection pursuit followed by gradient ascent. For this \nlast model we also show the histograms of the projection of the examples on the directions defined by those \nweight vectors; the bimodality expected from projection pursuit analysis is evident . \n\nin combinations, However, the error measurements show that something is still missing from the models \nfound by our implementation of PP. Following PP by a gradient ascent phase seems to give the best of both \nalgoflthms, finding a good approximation after only 140 epochs (40 PP + 100 gradient) and recovering the \ntrue model almost exactly after 1040 epochs. \n\nIn the second set of experiments we compare the performance of the harmonium model to that of the mixture \nmodel. The comparison uses real world data extracted from the NIST handwritten data base 8, Examples \nare 16 x 16 binary images (see Figure (3)). We use 60 hidden units to model the distribution in both of the \nmodels. Because of the large number of hidden units we cannot use gradient ascent learning and instead \nuse projection pursuit. For the same reason it was not possible to compute the likelihood of the harmonium \nmodel and only the other two measures of error were used . Each test was run several times to get accuracy \nbounds on the measurements. The results are summarized in the following table \n\nMixture model \nHarmOnium model \n\nsmgle bIt predictIon \ntrain \ntest \n\n0.185 \u00b1 0.005 0.258 \u00b1 0.005 \n0.21 \u00b1 om \n0.20 \u00b1 0.01 \n\nanput reconstructIon \ntest \ntrain \n\n0.518 \u00b1 0.002 \n0.63 \u00b1 0.05 \n\n0.715 \u00b1 0.002 \n0.66 \u00b1 0.03 \n\nIn Figure (4) we show some typical weight vectors found for the mixture model and for the harmonium \nmodel, it is clear that while the mixture model finds weIghts that are some kind of average prototypes of \ncomplete digits, the harmonium model finds weights that correspond to local features such as lines and \ncontrasts. There is a small but definite improvement in the errors of the harmonium model with respect to \nthe errors of the mixture model. As the experiments on synthetic data have shown that PP does not reach \n\nINIST Special Database 1, HWDB RelI-l.l, May 1990. \n\n\fUnsupervised learning of distributions on binary vectors using 2-layer networks \n\n919 \n\nFigure 3: A few examples from the handwritten digits sample. \n\n.............. \n.:~:I~~~;~L!i: \nll:~;~~! .l~. \n~~: .. ,~::: ::', \n......\u2022.... ' \n!::::~ !:\u00b7i::::: \n.. '\" \n.. - .\n\n' . ' \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 I\" \n\u2022 \n\" \u2022 \u2022 \u2022 \u2022 0 \u2022 \u2022 \u2022 \u2022 \n\n.. ,' .. . ..... \n: ~::::::.::r::. : \n. .\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \u00b7.t \u2022 \n. :lIIl\"!i:::::; \n::::=,II~ : ::: ;: \n~. :\"~''';:::~:: \n'0' \n\u2022\u2022 1. so \u2022 \u2022 \u2022 \n. ... ~'~~~~~~' .. ~ \neo. \u2022 \u2022 \n\u2022 \u2022\u2022\u2022\u2022 \u2022 \u2022\u2022 \n\u2022\u2022\u2022\u2022\u2022\u2022 I I . \u2022\u2022\u2022 . ' \n\n. ....... . \n.. . II... ., \n.......... \n........ -, .... \n:,ii~~:!:~i \n;.II'\u00b7~ .~~:a:: \nIII::;:~,;I~: \n!!i:H' :i:::;: \n....... ~ .. \n\nFigure 4: Typical weight vectors found by the mixture model (left) and the harmonium model (right) \n\noptimal solutions by itself we expect the advantage of the harmonium model over the mixture model will \nincrease further by using improved learning methods. Of course, the harmonium model is a very general \ndistribution model and is not specifically tuned to the domain of handwritten digit images, thus it cannot be \ncompared to models specifically developed to capture structures in this domain. However, the experimental \nresults supports our claim that the harmonium model is a simple and tractable mathematical model for \ndescribing distributions in which several correlation patterns combine to generate each individual example. \n\nReferences \n\n[1] D. H. Ackley, G. E. Hinton, ILnd T. J. Sejnowski. A learning algorithm for Boltzmann machines. Cognitive \n\nScience, 9:}47-169, 1985. \n\n(2) A. Dempster, N. Laird, CLnd D. Rubin . Maximum likelihood from incomplete data via the EM algorithm . J. \n\nROil. Stall!t. Soc. 8, 39:1-38, 1977. \n\n[3] B. Everitt CLnd D. HlLnd . Finite mixture di,tributlon!. Chapman CLnd Hall. 1981. \n[4J Y. Freund CLnd D. HlLussler. Unsupervised learning of distributions on binary vectors using two Ia.yer networks . \nTechnical Report UCSC-CRL-9I-20, Univ. of Calif. Computer ReselLrch Lab, Santa Cruz, CA, 1992 (To appear). \n\n(5) J. H. Friedman . Exploratory projection pursuit. J. Amer. Stot.Assoc., 82(397) :599-608, Mar . 1987. \n(6) J. H. Friedman, W.Stuetzle, ILnd A. Schroeder. Projection pursuit density estimation. I. Amer. Stat.Auoc., \n\n79:599-608, 1984. \n\n(7) H. Ge(ner ILnd J. Peu!. On the probabilistic semantics of connectionist networks. Technical Report CSD-870033, \n\nUCLA Computer Science Deputment, July 1987. \n\n[8J S. Geman and D. Geman. Stochutic reluations, Gibbs distributions ILnd the BayesilLn restoration of imlLges. \n\nIEEE Tron!. on Pattern Analll!i! and Machine Intelligence, 6:721-742, 1984. \n\n[9J J. Hopfield. Neural networks and physical systems with emergent collective computationallLbilities. Proc. Natl. \n\nAcad Sci. USA, 79:2554-2558, Apr. 1982. \n\n[10J P. Huber. Projection pursuit (witb discussion) . Ann. Stat., 13:435-525, 1985. \n(11) N. lntrator. FelLture extraction using an unsupervised neural network . In D. Touretzky, J. Ellman, T. Sejnowski, \nand G. Hinton, editors, Proceeding! 0/ the 1990 COnnectlOnI!t Model! Summer School, pages 310-318. Morgan \nKlLufmlLnn, San Mateo, CA., 1990. \n\n(12) R. M. Neal. Leuning stochastic feedforwlLrd networks. Technical report, Deputment of Computer Science, \n\nUniverSity of Toronto, Nov. 1990. \n\n(13) S. NowlCLn . Ma.ximum likelihood competitive learning. In D. Touretsky, editor, Advance! In Neurolinformation \n\nProceumg Sy!teml, volume 2, pages 514-582. Morgan Kau(mlLnn, 1990. \n\n[14J J. Peul. Probabi/i!hc Retuoning in Intelligent Sy~tem!. Morgan KlLufmann, 1988. \n(15] C. Peterson and J. R. Anderson. A mean field theory learnillg algorithm (or neural networks. Complex SIiItem! , \n\n1:995-1019,1987. \n\n(16) D. E. Rumelhart CLnd J . L. McClelland. Parallel Distributed Proceulng: ExploratIon! In the Mlcro!tructure of \n\nCognition . Volume 1,' FoundatIon!. MIT Press, Cambridge, Mass., 1986. \n\n\f", "award": [], "sourceid": 535, "authors": [{"given_name": "Yoav", "family_name": "Freund", "institution": null}, {"given_name": "David", "family_name": "Haussler", "institution": null}]}