{"title": "Unsupervised Learning of Mixtures of Multiple Causes in Binary Data", "book": "Advances in Neural Information Processing Systems", "page_first": 27, "page_last": 34, "abstract": null, "full_text": "Unsupervised Learning of Mixtures of \n\nMultiple Causes in Binary Data \n\nEric Saund \n\nXerox Palo Alto Research Center \n\n3333 Coyote Hill Rd., Palo Alto, CA, 94304 \n\nAbstract \n\nThis paper presents a formulation for unsupervised learning of clus(cid:173)\nters reflecting multiple causal structure in binary data. Unlike the \nstandard mixture model, a multiple cause model accounts for ob(cid:173)\nserved data by combining assertions from many hidden causes, each \nof which can pertain to varying degree to any subset of the observ(cid:173)\nable dimensions. A crucial issue is the mixing-function for combin(cid:173)\ning beliefs from different cluster-centers in order to generate data \nreconstructions whose errors are minimized both during recognition \nand learning. We demonstrate a weakness inherent to the popular \nweighted sum followed by sigmoid squashing, and offer an alterna(cid:173)\ntive form of the nonlinearity. Results are presented demonstrating \nthe algorithm's ability successfully to discover coherent multiple \ncausal representat.ions of noisy test data and in images of printed \ncharacters. \n\n1 \n\nIntroduction \n\nThe objective of unsupervised learning is to identify patterns or features reflecting \nunderlying regularities in data. Single-cause techniques, including the k-means al(cid:173)\ngorithm and the standard mixture-model (Duda and Hart, 1973), represent clusters \nof data points sharing similar patterns of Is and Os under the assumption that each \ndata point belongs to, or was generated by, one and only one cluster-center; output \nactivity is constrained to sum to 1. In contrast, a multiple-cause model permits more \nthan one cluster-center to become fully active in accounting for an observed data \nvector. The advantage of a multiple cause model is that a relatively small number \n\n27 \n\n\f28 \n\nSaund \n\nof hidden variables can be applied combinatorially to generate a large data set. Fig(cid:173)\nure 1 illustrates with a test set of nine 121-dimensional data vectors. This data set \nreflects two independent processes, one of which controls the position of the black \nsquare on the left hand side, the other controlling the right. While a single cause \nmodel requires nine cluster-centers to account for this data, a perspicuous multiple \ncause formulation requires only six hidden units as shown in figure 4b. Grey levels \nindicate dimensions for which a cluster-center adopts a \"don't-know /don't-care\" \nassertion . \n\n\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n\nFigure 1: Nine 121-dimensional test data samples exhibiting multiple cause \nstructure. Independent processes control the position of the black rectangle \non the left and right hand sides. \n\nWhile principal components analysis and its neural-network variants (Bourlard and \nKamp, 1988; Sanger, 1989) as well as the Harmonium Boltzmann Machine (Freund \nand Haussler, 1992) are inherently multiple cause models, the hidden represen(cid:173)\ntations they arrive at are for many purposes intuitively unsatisfactory. Figure 2 \nillustrates the principal components representation for the test data set presented \nin figure 1. Principal components is able to reconstruct the data without error \nusing only four hidden units (plus fixed centroid), but these vectors obscure the \ncompositional structure of the data in that they reveal nothing about the statistical \nindependence of the left and right hand processes. Similar results obtain for multi(cid:173)\nple cause unsupervised learning using a Harmonium network and for a feedforward \nnetwork using the sigmoid nonlinearity. We seek instead a multiple cause formula(cid:173)\ntion which will deliver coherent representations exploiting \"don't-know/don't-care\" \nweights to make explicit the statistical dependencies and independencies present \nwhen clusters occur in lower-dimensional subspaces of the full J -dimensional data \nspace. \n\nData domains differ in ways that underlying causal processes interact. The present \ndiscussion focuses on data obeying a WRITE-WHITE-AND-BLACK model, under which \nhidden causes are responsible for both turning \"on\" and turning \"off\" the observed \nvariables. \n\na \n\nb \n\nFigure 2: Principal components representation for the test data from figure \n1. (a) centroid (white: -1, black: 1). (b) four component vectors sufficient \nto encode the nine data points. (lighter shadings: Cj,k < 0; grey: Cj,k = 0; \ndarker shading: Cj,/.: > 0). \n\n\fUnsupervised Learning of Mixtures of Multiple Causes in Binary Data \n\n29 \n\n2 Mixing Functions \n\nA large class of unsupervised learning models share the architecture shown in figure \n3. A binary vector Di = (di ,l,di ,2, ... di,j, ... di,J) is presented at the data layer, and \na measurement, or response vector mi = (mi ,l, mi,2, ... mi ,k, ... mi ,K) is computed at \nthe encoding layer using \"weights\" Cj,k associating activity at data dimension j with \nbe turned around to compute a prediction vector ri = (ri,l\" ri,2, ... ri,j, ... ri,J) at the \nactivity at hidden cluster-center k. Any activity pattern at the encoding layer can \n\ndata layer. Different models employ different functions for performing the measure(cid:173)\nment and prediction mappings, and give different interpretations to the weights. \nCommon to most models is a learning procedure which attempts to optimize an \nobjective function on errors between data vectors in a training set, and predictions \nof these data vectors under their respective responses at the encoding layer. \n\nencoding layer \n( cluster-centers) \n\ndata layer \n\npMietion \n\nd j (observed data) \n\n(predicted) \n\nr. \nJ \n\nFigure 3: Architecture underlying a large class of unsupervised learning models. \n\nThe key issue is the mixing function which specifies how sometimes conflicting pre(cid:173)\ndictions from individual hidden units combine to predict values on the data dimen(cid:173)\nsions. Most neural-network formulations, including principal components variants \nand the Boltzmann Machine, employ linearly weighted sum of hidden unit activity \nfollowed by a squashing, bump, or other nonlinearity. This form of mixing function \npermits an error in prediction by one cluster center to be cancelled out by correct \npredictions from others without consequence in terms of error in the net prediction. \nAs a result, there is little global pressure for cluster-centers to adopt don't-know \nvalues when they are not quite confident in their predictions. \n\nInstead, a mult.iple cause formulation delivering coherent cluster-centers requires a \nform of nonlinearit.y in which active disagreement must result in a net \"uncertain\" \nor neutral prediction that results in nonzero error. \n\n\f30 \n\nSaund \n\n3 Multiple Cause Mixture Model \n\nOur formulation employs a zero-based representation at the data layer to simplify \nthe mathematical expression for a suitable mixing function. Data values are either 1 \nor -1; the sign of a weight Cj ,k indicates whether activity in cluster-center k predicts \na 1 or -1 at data dimension j, and its magnitude (ICj,kl ~ 1) indicates strength of \nbelief; Cj ,k = 0 corresponds to \"don't-know /don't-care\" (grey in figure 4b). \nThe mixing function takes the form, \n\nL mi ,k(-c),k) \n\nr.,) = k <\".<0 \n\nII (1 + m\"kCj,k) - 1 + L mi,kc) ,k \nk <\". <0 \n\nk <\".>0 \n\nI- II (1 - m\"kCj,k) \n\nk <\".>0 \n\nThis formula is a computationally tractable approximation to an idealized mixing \nfunction created by linearly interpolating boundary values on the extremes of mi,k E \n{O, I} and Cj,k E {-I, 0, I} rationally designed to meet the criteria outlined above. \nBoth learning and measurement operate in the context of an objective function on \npredictions equivalent to log-likelihood. The weights Cj,k are found through gradient \nascent in this objective function, and at each training step the encoding mi of an \nobserved data vector is found by gradient ascent as well. \n\n4 Experimental Results \n\nFigure 4 shows that the model converges to the coherent multiple cause represen(cid:173)\ntation for the test data of figure 1 starting with random initial weights. The model \nis robust with respect to noisy training data as indicated in figure 5. \n\nIn figure 6 the model was trained on data consisting of 21 x 21 pixel images of \nregistered lower case characters. Results for J( = 14 are shown indicating that the \nmodel has discovered statistical regularities associated with ascenders, descenders, \ncircles, etc. \n\na \n\nb ...----.--\n\nFigure 4: Multiple Cause Mixture Model representation for the test data \nfrom figure 1. (a) Initial random cluster-centers. (b) Cluster-centers after \nseven training iterations (white: Cj,k = -1; grey: Cj,k = 0; black: Cj,k = 1). \n\n\fUnsupervised Learning of Mixtures of Multiple Causes in Binary Data \n\n31 \n\n5 Conclusion \n\nAbility to compress data, and statistical independence of response activities (Bar(cid:173)\nlow, 1989), are not the only criteria by which to judge the success of an encoder \nnetwork paradigm for unsupervised learning. For many purposes, it is equally im(cid:173)\nportant that hidden units make explicit statistically salient structure arising from \ncausally distinct processes. \n\nThe difficulty lies in getting the internal knowledge-bearing entities sensibly to \ndivvy up responsibility for training data not just pointwise, but dimensionwise. \nMixing functions based on linear weighted sum of activities (possibly followed by \na nonlinearity) fail to achieve this because they fail to pressure the hidden units \ninto giving up responsibility (adopting \"don't know\" values) for data dimensions \non which they are prone to be incorrect. We have outlined criteria, and offered \na specific functional form, for nonlinearly combining beliefs in a predictive mixing \nfunction such that statistically coherent hidden representations of multiple causal \nstructure can indeed be discovered in binary data. \n\nReferences \n\nBarlow, H.; [1989], \"Unsupervised Learning,\" Neural Computation, 1: 295-31l. \nBourlard, H., and Kamp, Y.; [1988], Auto-Association by Multilayer Perceptrons and \n\nSingular Value Decomposition,\" Biological Cybernetics, 59:4-5, 291-294. \nDuda, R., and Hart, P.; [1973], Pattern Classification and Scene Analysis, Wiley, \nNew York. \n\nFoldiak, P.; [1990], \"Forming sparse representations by local anti-Hebbian learning,\" \n\nBiological Cybernetics, 64:2, 165-170. \nFreund, Y., and Haussler, D.; [1992]' \"Unsupervised learning of distributions on \nbinary vectors using two-layer networks,\" in Moody, J., Hanson, S., and Lippman, \nR., eds, Advances in Neural Information Processing Systems 4, Morgan Kauffman, \nSan Mateo, 912-919. \nNowlan, S.; [1990], \"Maximum Likelihood Competitive Learning,\" in Touretzky, D., \ned., Advances in Neural Information Processing Systems 2, Morgan Kauffman, San \nMateo, 574-582. \n\nSanger, T.; [1989], \"An Optimality Principle for Unsupervised Learning,\" in Touret(cid:173)\nzky, D., ed., Advances in Neural Information Processing Systems, Morgan Kauff(cid:173)\nman, San Mateo, 11-19. \n\n\f32 \n\nSaund \n\na \n\nb \n\nobservpd data d, \n\nnlf'a~ l1I cme nt s 1n \" k \n\n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\nc \n\nx \n\n.;.' .~ \n::.:,;,; \n\n, ' ;, \n\n. ' . ' \n.::\n' \n.;;., \n\npredictions r, \u2022 '., . .. , \n\u2022.. , . .. . , .. , \n\u2022 \n\u2022 \n\u2022 \n\nFigure 5: Multiple Cause Mixture Model results for noisy training data. (a) \nFive test data sample suites with 10% bit-flip noise. Twenty suites were \nused to train from random initial cluster-centers, resulting in the represen(cid:173)\ntation shown in (b) . (c) Left: Five test data samples di ; Middle: Numer(cid:173)\nical activities mi,k for the most active cluster-centers (the corresponding \ncluster-center is displayed above each mi,k value); Right: reconstructions \n(predictions) ri based on the activities. N ot.e how these \"clean up\" the \nnoisy samples from which they were computed. \n\n\fUnsupervised Learning of Mixtures of Multiple Causes in Binary Data \n\n33 \n\na \n\nb \n\nFigure 6: (a) Training set of twenty-six 441-dimensional binary vectors. (b) \nMultiple Cause Mixt.ure Model representation at J{ = 14. (c) Left: Five \ntest data samples di ; Middle: Numerical activities mi,k for the most active \ncluster-centers (the corresponding cluster-center is displayed above each \nmi,k value); Right: reconstructions (predictions) ri based on the activities. \n\n\f34 \n\nSaund \n\nobserved data d; \n\nmeasurements m;,k \n\npredictions ri \n\nc \n\n\f", "award": [], "sourceid": 735, "authors": [{"given_name": "Eric", "family_name": "Saund", "institution": null}]}