{"title": "Unsupervised Discrimination of Clustered Data via Optimization of Binary Information Gain", "book": "Advances in Neural Information Processing Systems", "page_first": 499, "page_last": 506, "abstract": null, "full_text": "Unsupervised Discrimination of Clustered Data \nvia Optimization of Binary Information Gain \n\nNicol N. Schraudolph \n\nTerrence J. Sejnowski \n\nComputer Science & Engr. Dept. \nUniversity of California, San Diego \n\nComputational Neurobiology Laboratory \nThe Salk Institute for Biological Studies \n\nLa Jolla, CA 92093-0114 \n\nnici@cs.ucsd.edu \n\nSan Diego, CA 92186-5800 \n\ntsejnowski@ucsd.edu \n\nAbstract \n\nWe present the information-theoretic derivation of a learning algorithm \nthat clusters unlabelled data with linear discriminants. In contrast to \nmethods that try to preserve information about the input patterns, we \nmaximize the information gained from observing the output of robust \nbinary discriminators implemented with sigmoid nodes. We deri ve a local \nweight adaptation rule via gradient ascent in this objective, demonstrate \nits dynamics on some simple data sets, relate our approach to previous \nwork and suggest directions in which it may be extended. \n\n1 \n\nINTRODUCTION \n\nUnsupervised learning algorithms may perform useful preprocessing functions by pre(cid:173)\nserving some aspects of their input while discarding others. This can be quantified as \nmaximization of the information the network's output carries about those aspects of the \ninput that are deemed important. \n\n(Linsker, 1988) suggests maximal preservation of information about all aspects of the input. \nThis In/omax principle provides for optimal reconstruction of the input in the face of noise \nand resource limitations. The I-max algorithm (Becker and Hinton, 1992), by contrast, \nfocusses on coherent aspects of the input, which are extracted by maximizing the mutual \ninformation between networks looking at different patches of input. \nOur work aims at recoding clustered data with adaptive discriminants that selectively \nemphasize gaps between clusters while collapsing patterns within a cluster onto near-\n\n499 \n\n\f500 \n\nSchraudolph and Sejnowski \n\nidentical output representations. We achieve this by maximizing in/ormation gain -\nthe information gained through observation of the network's outputs under a probabilistic \nin terpretati on. \n\n2 STRATEGY \n\nConsider a node that performs a weighted summation on its inputs i and squashes the \nresulting net input y through a sigmoid function f : \n\nz = f(y), where f(y) = 1 +le_ Y and y = tV \u00b7 i . \n\n(1) \n\nSuch a sigmoid node can be regarded as a \"soft\" discriminant: with a large enough weight \nvector, the output will essentially be binary, but smaller weights allow for the expression of \nvarying degrees of confidence in the discrimination. \nTo make this notion more precise, consider y a random variable with bimodal distribution, \nnamely an even mixture of two Gaussian distributions. Then if their means equal \u00b1 half \ntheir variance, z is the posterior probability for discriminating between the two source \ndistributions (Anderson, 1972). \nThis probabilitstic interpretation of z can be used to design a learning algorithm that seeks \nsuch bimodal projections of the input data. In particular, we search for highly informa(cid:173)\ntive discriminants by maximizing the information gained about the binary discrimination \nthrough observation of z. This binary in/ormation gain is given by \n\ndH(z) = H(i) - H(z), \n\n(2) \n\nwhere H (z) is the entropy of z under the above interpretation, and i is an estimate of z \nbased on prior knowledge. \n\n3 RESULTS \n\n3.1 THE ALGORITHM \n\nIn the Appendix, we present the derivation of a learning algorithm that maximizes binary \ninformation gain by gradient ascent. The resulting weight update rule is \n\ndw 0( f'(y) i (y - fI), \n\n(3) \n\nwhere fI, the estimated net input, must meet certain conditions1 (see Appendix). The weight \nchange dictated by (3) is thus proportional to the product of three factors: \n\n\u2022 the derivative of the Sigmoid squashing function, \n\u2022 the presynaptic input i, and \n\u2022 the difference between actual and anticipated net input. \n\n1 In what follows, we have successfully used estimators that merely approximate these conditions. \n\n\fUnsupervised Discrimination of Clustered Data via Optimization of Binary Information Gain \n\n501 \n\nAy = iy AH(z) \n\nl .00 - - - - -+ - - - - - - t - - - - - - - j - - - - - - - - - j f - - - - - - - I - - - - -\n\n0.00-___ . _ \n\n0.50 ==::f;~~~i~~~~$~~~~,~~~~\u00b1::= \n\n-0.50 \n-1.00 - - - - -+ - - - - - - t - - - - - - - j - - - - - - - - - j f - - - - - - - I - - - - -\n\n-4.00 \n\n-2.00 \n\ny \nFigure I: Phase plot of ll.y against net input y for y = {-3, -2, ... 3}. See text for details. \n\n0.00 \n\n2.00 \n\n4.00 \n\n3.2 SINGLE NODE DYNAMICS \n\nFor a single, isolated node, we use (y), the average net input over a batch of input patterns, \nas estimator for y. The behavior of our algorithm in this setting is best understood from \na phase plot as shown in Figure I, where the change in net input resulting from a weight \nchange according to (3) is graphed against the net input that causes this weight change. \nCurves are plotted for seven different values of y. The central curve (y = 0) is identical \nto that of the straightforward Hebb rule for sigmoid nodes: both positive and negative net \ninputs are equally amplified until they reach saturation. For non-zero values of y, however, \nthe curves become asymmetric: positive y favor negative changes ll.y and vice versa. For \ny = (y), it is easy to see that this will have the effect of centering net inputs around zero. \nThe node will therefore converge to a state where its output is one for half of the input \npatterns, and zero for the other half. Note that this can be achieved by any sufficiently large \nweight vector, regardless of its direction! However, since simple gradient ascent is both \ngreedy and local in weight space, starting it from small random initial weights is equivalent \nto a bias towards discriminations that can be made confidently with smaller weight vectors. \nTo illustrate this effect, we have tested a single node running our algorithm on a set of \nvowel formant frequency data due to (Peterson and Barney, 1952). The most prominent \nfeature of this data is a central gap that separates front from back vowels; however, this \nfeature is near-orthogonal to the principal component of the data and thus escapes detection \nby standard Hebbian learning rules. \nFigure 2 shows the initial, intermediate and final phase of this experiment, using a visu(cid:173)\nalization technique suggested by (Munro, 1992). Each plot shows the pre-image of zero \nnet input superimposed on a scatter plot of the data set in input space. The two flanking \nlines delineate the \"active region\" where the sigmoid is not saturated, and thus provide an \nindication of weight vector size. \nAs demonstrated in this figure, our algorithm is capable of proceeding smoothly from a \nsmall initial weight vector that responds in principal component direction to a solution \nwhich uses a large weight vector in near-orthogonal direction to successfully discriminate \nbetween the two data clusters. \n\n\f502 \n\nSchraudolph and Sejnowski \n\n; .. ... \n\n1 \ni. \n\n. .. ,\n\n' \n\n. ' \n\n.,. \n\n\" . \n. , .. \n\n;#~~~~}t\"\" \n\n. ,:' .... ~~.~r\u00b7\u00b7 \n... <:/0' \n\nFigure 2: Single node discovers distinction between front and back vowels in unlabelled data \nset of 1514 multi-speaker vowel utterances (Peterson and Barney, 1952). Superimposed on \na scatter plot of the data are the pre-images of Y = 0 (solid center line) and Y = \u00b11,31696 \n(flanking lines) in input space. Discovered feature is far from principal component direction. \n\n3.3 EXTENSION TO A LAYER OF NODES \n\nA learning algorithm for a single sigmoid node has of course only limited utility. When \nextending it to a layer of such nodes, some form oflateral interaction is needed to ensure that \neach node makes a different binary discrimination. The common technique of introducing \nlateral competition for activity or weight changes would achieve this only at the cost of \nseverely distorting the behavior of our algorithm. \nFortunately our framework is flexible enough to accommodate lateral differentiation in a \nless intrusive manner: by picking an estimator that uses the activity of every other node in \nthe layer to make its prediction, we force each node to maximize its information gain with \nrespect to the entire layer. To demonstrate this technique we use the linear second-order \nestimator \n\nYi = (Yi) + L (Yj -\n\n(Yj)) (}ij \n\n(4) \n\nto predict the net input Yi of the ith node in the layer, where the (.) operator denotes \naveraging over a batch of input patterns, and {}ij is the empirical correlation coefficient \n\nj#-i \n\n(5) \n\nFigure 3 shows a layer of three such nodes adapting to a mixture of three Gaussian dis(cid:173)\ntributions, with each node initially picking a different Gaussian to separate from the other \ntwo. After some time, all three discriminants rotate in concert so as to further maximize \ninformation gain by splitting the input data evenly. Note that throughout this process, the \nnodes always remain well-differentiated from each other. \nFor most initial conditions, however, the course of this experiment is that depicted in \nFigure 4: two nodes discover a more efficient way to discriminate between the three input \nclusters, to the detriment of the third. The latecomer repeatedly tries to settle into one of \nthe gaps in the data, but this would result in a high degree of predictability. Thus the node \nwith the shortest weight vector and hence most volatile discriminant is weakened further, \nits weight vector all but eliminated in an effective demonstration of Occam's razor. \n\n\fUnsupervised Discrimination of Clustered Data via Optimization of Binary Information Gain \n\n503 \n\nFigure 3: Layer of three nodes adapts to a mixture of three Gaussian distributions. In the \nfinal state, each node splits the input data evenly. \n\n; \n\n/ \n\nFigure 4: Most initial conditions, however, lead to a minimal solution involving only two \nnodes. The weakest node is \"crowded out\" by Occam's razor, its weight vector reduced to \nnear-zero length. \n\n4 DISCUSSION \n\n4.1 RELATED WORK \n\nBy maximizing the difference of actual from anticipated response, our algorithm makes \nbinary discriminations that are highly informative with respect to clusters in the input. The \nweight change in proportion to a difference in acti vity is reminiscent of the covariance rule \n(Sejnowski, 1977) but generalizes it in two important respects: \n\n\u2022 it explicitly incorporates a sigmoid nonlinearity, and \n\n\u2022 \n\nfj need not necessarily be the average net input. \n\nBoth of these are critical improvements: the first allows the node to respond only to inputs in \nits non-saturated region, and hence to learn local features in projections other than along the \nprincipal component direction. The second provides a convenient mechanism for extending \nthe algorithm by incorporating additional information in the estimator. \nWe share the goal of seeking highly informative, bimodal projections of the input with the \nBienenstock-Cooper-Munro (BCM) algorithm (Bienenstock et al., 1982; Intrator, 1992). \nA critical difference, however, is that BCM uses a complex, asymmetric nonlinearity \nthat increases the selectivity of nodes and hence produces a localized, l-of-n recoding \nof the input, whereas our algorithm makes symmetric, robust and independent binary \ndiscriminations. \n\n\f504 \n\nSchraudolph and Sejnowski \n\n4.2 FUTURE DIRECTIONS \n\nsuch as the preference for splitting the data evenly -\n\nSince the learning algorithm described here has demonstrated flexibility and efficiency in our \ninitial experiments, we plan to scale it up to address high-dimensional, real-world problems. \nThe algorithm itself is likely to be further extended and improved as its applications grow \nmore demanding. \nFor instance, although the size of the weight vector represents commitment to a discriminant \nin our framework, it is not explicitly controlled. The dynamics of weight adaptation happen \nto implement a reasonable bias in this case, but further refinements may be possible. Other \npriors implicit in our approach -\ncould be similarly relaxed or modified. \nAnother attractive generalization of this learning rule would be to implement nonlinear \ndiscriminants by backpropagating weight derivatives through hidden units. The dynamic \nstability of our algorithm is a Significant asset for its expansion into an efficient unsupervised \nmulti-layer network. \nIn such a network, linear estimators are no longer sufficient to fully remove redundancy be(cid:173)\ntween nodes. In his closely related predictability minimization architecture, (Schmidhuber, \n1992) uses backpropagation networks as nonlinear estimators for this purpose with some \nsuccess. \nSince the notion of estimator in our framework is completely general, it may combine \nevidence from multiple, disparate sources. Thus a network running our algorithm can \nbe trained to complement a heterogeneous mix of pattern recognition methods by maxi(cid:173)\nmizing information gain relative to an estimator that utilizes all such available sources of \ninformation. This flexibility should greatly aid the integration of binary information gain \noptimization into existing techniques. \n\nAPPENDIX: MATHEMATICAL DERIVATION \n\nWe derive a straightforward batch learning algorithm that performs gradient ascent in the \nbinary information gain objective. On-line approximations may be obtained by using \nexponential traces in place of the batch averages denoted by the (.) operator. \n\nCONDITIONS ON THE ESTIMATOR \n\nTo eliminate the deri vati ve term from ( 11 d) below we require that the estimator i be \n\n\u2022 unbiased: (i) = (z),and \n\u2022 honest: tz i = tz (i) . \n\nThe honesty condition ensures that the estimator has access to the estimated variable only \non the slow timescale of batch averaging, thus eliminating trivial \"solutions\" such as i = z. \nFor an unbiased and honest estimator, \n\noi \noz = oz (i) = oz (z) = oz = 1. \n\n(oz) \n\n0 \n\n0 \n\n(6) \n\n\fUnsupervised Discrimination of Clustered Data via Optimization of Binary Information Gain \n\n505 \n\nBINARY ENTROPY AND ITS DERIVATIVE \nThe enuopy of a binary random variable X as a function of z = Pr( X = 1) is given by \n(7) \n\nH(z) = -zlogz - (1- z)log(l- z); \n\nits derivative with respect to z is \no \noz H(z) = log(l - z) -log z. \n\n(8) \n\nSince z in our case is produced by the sigmoid function f given in (I), this conveniently \nsimplifies to \n\no \n-H(z) = -yo \noz \n\n(9) \n\n(10) \n\nGRADIENT ASCENT IN INFORMATION GAIN \n\nThe information dH gained from observing the output z of the discriminator is \n\ndH(z) = H(i) - H(z), \n\nwhere z is an estimate of z based on prior knowledge. We maximize dH(z) by batched \ngradient ascent in weight space: \n\ndill ex (o~ dH(Z\u00bb) \n\now oz \n\n( 0: . ~ [H(i) - H(Z\u00bb)) \n( z (1 - z) :~ [~ ~ . :i H (i) - :Z H (z ) 1 ) \n( z (1 - z) i (Y - ~! . y) ) , \n\n(Ila) \n\n(lIb) \n\n(llc) \n\n(lId) \n\nwhere estimation of the node's output z has been replaced by that of its net input y. \nSubstitution of (6) into (lId) yields the binary information gain optimization rule \n\ndill ex (z (1 - z) i(y - y\u00bb). \n\n(12) \n\u2022 \n\nAcknowledgements \n\nWe would like to thank Steve Nowlan, Peter Dayan and Rich Zemel for stimulating and \nhelpful discussions. This work was supported by the Office of Naval Research and the \nMcDonnell-Pew Center for Cognitive Neuroscience at San Diego. \n\n\f506 \n\nSchraudolph and Sejnowski \n\nReferences \nAnderson, 1. (1972). Logistic discrimination. Biometrika, 59:19-35. \nAnderson, 1. and Rosenfeld, E., editors (1988). Neurocomputing: Foundations of Research. \n\nMIT Press, Cambridge. \n\nBecker, S. and Hinton, G. E. (1992). A self-organizing neural network that discovers \n\nsurfaces in random-dot stereograms. Nature, 355: 161-163. \n\nBienenstock, E., Cooper, L., and Munro, P. (1982). Theory for the development of neuron \nselectivity: Orientation specificity and binocular interaction in visual cortex. Journal \nof Neuroscience, 2. Reprinted in (Anderson and Rosenfeld, 1988). \n\nIntrator, N. (1992). Feature extraction using an unsupervised neural network. Neural \n\nComputation, 4:98-107. \n\nLinsker, R. (1988). Self-organization in a perceptual network. Computer, pages 105-117. \nMunro, P. W. (1992). Visualizations of 2-d hidden unit space. In International Joint \n\nConference on Neural Networks, volume 3, pages 468-473, Baltimore 1992. IEEE. \n\nPeterson, G. E. and Barney, H. L. (1952). Control methods used in a study of the vowels. \n\nJournal of the Acoustical Society of America, 24: 175-184. \n\nSchmidhuber, 1. (1992). Learning factorial codes by predictability minimization. Neural \n\nComputation, 4:863-879. \n\nSejnowski, T. 1. (1977). Storing covariance with nonlinearly interacting neurons. Journal \n\nOf Mathematical Biology, 4:303-321. \n\n\f", "award": [], "sourceid": 628, "authors": [{"given_name": "Nicol", "family_name": "Schraudolph", "institution": null}, {"given_name": "Terrence", "family_name": "Sejnowski", "institution": null}]}