{"title": "An Optimality Principle for Unsupervised Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 11, "page_last": 19, "abstract": null, "full_text": "11 \n\nAN OPTIMALITY PRINCIPLE FOR \n\nUNSUPERVISED LEARNING \n\nTerence D. Sanger \n\nMIT AI Laboratory, NE43-743 \n\nCambridge, MA 02139 \n\n(tds@wheaties.ai.mit.edu) \n\nABSTRACT \n\nWe propose an optimality principle for training an unsu(cid:173)\npervised feedforward neural network based upon maximal \nability to reconstruct the input data from the network out(cid:173)\nputs. We describe an algorithm which can be used to train \neither linear or nonlinear networks with certain types of \nnonlinearity. Examples of applications to the problems of \nimage coding, feature detection, and analysis of random(cid:173)\ndot stereograms are presented. \n\n1. INTRODUCTION \n\nThere are many algorithms for unsupervised training of neural networks, each of \nwhich has a particular optimality criterion as its goal. (For a partial review, see \n(Hinton, 1987, Lippmann, 1987).) We have presented a new algorithm for training \nsingle-layer linear networks which has been shown to have optimality properties \nassociated with the Karhunen-Loeve expansion (Sanger, 1988b). We now show \nthat a similar algorithm can be applied to certain types of nonlinear feedforward \nnetworks, and we give some examples of its behavior. \nThe optimality principle which we will use to describe the algorithm is based on the \nidea of maximizing information which was first proposed as a desirable property of \nneural networks by Linsker (1986, 1988). Unfortunately, measuring the information \nin network outputs can be difficult without precise knowledge of the distribution \non the input data, so we seek another measure which is related to information \nbut which is easier to compute. If instead of maximizing information, we try to \nmaximize our ability to reconstruct the input (with minimum mean-squared error) \ngiven the output of the network, we are able to obtain some useful results. Note \nthat this is not equivalent to maximizing information except in some special cases. \nHowever, it contains the intuitive notion that the input data is being represented \nby the network in such a way that very little of it has been \"lost\". \n\n\f12 \n\nSanger \n\n2. LINEAR CASE \n\nWe now summarize some of the results in (Sanger, 1988b). A single-layer linear \nfeedforward network is described by an M xN matrix C of weights such that if x is \na vector of N inputs and y is a vector of M outputs with M < N, we have y = Cx. \nAs mentioned above, we choose an optimality principle defined so that we can best \nreconstruct the inputs to the network given the outputs. We want to minimize the \nmean squared error E[(x - x)2] where x is the actual input which is zero-mean with \ncorrelation matrix Q = E[xxT], and x is a linear estimation of this input given the \noutput y. The linear least squares estimate (LLSE) is given by \n\nand we will assume that x is computed in this way for any matrix C of weights \nwhich we choose. The mean-squared error for the LLSE is given by \n\nand it is well known that this is minimized if the rows of C are a linear combination of \nthe first M eigenvectors of the correlation matrix Q. One optimal choice of C is the \nSingular Value Decomposition (SVD) of Q, for which the output correlation matrix \nE[yyT] = CQCT will be the diagonal matrix of eigenvalues of Q. In this case, the \noutputs are uncorrelated and the sum of their variances (traceE[yyT]) is maximal for \nany set of M un correlated outputs. We can thus think of the eigenvectors as being \nobtained by any process which maximizes the output variance while maintaining \nthe outputs uncorrelated. \nWe now define the optimal single-layer linear network as that network whose weights \nrepresent the first M eigenvectors of the input correlation matrix Q. The optimal \nnetwork thus minimizes the mean-squared approximation error E[(x - x)2] given \nthe shape constraint that M < N. \n\n2.1 LINEAR ALGORITHM \n\nWe have previously proposed a weight-update rule called the \"Generalized Hebbian \nAlgorithm\" , and proven that this algorithm causes the rows of the weight matrix C \nto converge to the eigenvectors of the input correlation matrix Q (Sanger, 1988a,b). \nThe algorithm is given by: \n\nC(t + 1) = C(t) + I (y(t)xT(t) - LT[y(t)yT(t)]C(t\u00bb) \n\n(1) \n\nwhere I is a rate constant which decreases as l/t, x(t) is an input sample vector, \nyet) = C(t)x(t), and LTD is an operator which makes its matrix argument lower \ntriangular by setting all entries above the diagonal to zero. This algorithm can be \nimplemented using only a local synaptic learning rule (Sanger, 1988b). \nSince the Generalized Hebbian Algorithm computes the eigenvectors of the input \ncorrelation matrix Q, it is related to the Singular Value Decomposition (SVD), \n\n\fAn Optimality Principle for Unsupervised Learning \n\n13 \n\nc. \n\nFigure 1: (aJ original image. \n(cJ masks \nlearned by the network which were used for vector quantized coding of 8x8 blocks of \nthe image. \n\n(bJ image coded at .36 bits per pixel. \n\nPrincipal Components Analysis (PCA), and the Karhunen-Loeve Transform (KLT). \n(For a review of several related algorithms for performing the KLT, see (Oja, 1983).) \n\n2.2 IMAGE CODING \n\nWe present one example of the behavior of a single-layer linear network. (This \nexample appears in (Sanger, 1988b).) Figure 1 a shows an original 256x256x8bit \nimage which was used for training a network. 8x8 blocks of the image were chosen \nby scanning over the image, and these were used as training inputs to a network with \n64 inputs and 8 outputs. After training, the set of weights for each output (figure \nlc) represents a vector quantizing mask. Each 8x8 block of the input image is then \ncoded using the outputs of the network. Each output is quantized with a number of \nbits related to the log of the variance, and the original figure is approximated from \nthe quantized outputs. The reconstruction of figure 1 b uses a total of 23 bits per 8x8 \nregion, which gives a data rate of 0.36 bits per pixel. The fact that the image could \nbe represented using such a low bit rate indicates that the masks that were found \nrepresent significant features which are useful for recognition. This image coding \ntechnique is equivalent to block-coded KLT methods common in the literature. \n\n\f14 \n\nSanger \n\n3. NONLINEAR CASE \n\nIn general, training a nonlinear unsupervised network to approximate nonlinear \nfunctions is very difficult. Because of the large (infinite-dimensional) space of pos(cid:173)\nsible functions, it is important to have detailed knowledge of the class of functions \nwhich are useful in order to design an efficient network algorithm. (Several peo(cid:173)\nple pointed out to me that the talk implied such knowledge is not necessary, but \nunfortunately such an implication is false.) \nThe network structure we consider is a linear layer represented by a matrix C (which \nis perhaps an interior layer of a larger network) followed by node nonlinearities (1'(Yi) \nwhere Yi is the ith linear output, followed by another linear layer (perhaps followed \nby more layers). We assume that the nonlinearities (1'0 are fixed, and that the only \nparameters susceptible to training are the linear weights C. \nIf z is the M-vector of outputs after the nonlinearity, then we can write each com(cid:173)\nponent Zi = (1'(Yi) = (1'(CiX) where Ci is the ith row of the matrix C. Note that \nthe level contours of each function Zi are determined entirely by the vector Ci, and \nthat the effect of (1'0 is limited to modifying the output value. Intuitively, we thus \nexpect that if Yi encodes a useful parameter of the input x, then Zi will encode the \nsame parameter, although scaled by the nonlinearity (1'0. \nThis can be formalized, and if we choose our optimality principle to again be min(cid:173)\nimum mean-squared linear approximation of the original input x given the output \nz, the best solution remains when the rows of C are a linear combination of the first \nM eigenvectors of the input correlation matrix Q (Bourlard and Kamp, 1988) . \nIn two of the simulations, the nonlinearity (1'0 which we use is a rectification non(cid:173)\nlinearity, given by \n\n(1'(Yd = \n\n{ Yi \n0 \n\nif Yi 20 \nif Yi <0 \n\nNote that at most one of {(1'(Yi), (1'( -Yi)} is nonzero at any time, so these two values \nare uncorrelated. Therefore, if we maximize the variance of y (before the nonlin(cid:173)\nearity) while maintaining the elements of Z (after the nonlinearity) uncorrelated, \nwe need 2M outputs in order to represent the data available from an M-vector y. \nNote that 2M may be greater than the number of inputs N, so that the \"hidden \nlayer\" Z can have more elements than the input. \n3.1 NONLINEAR ALGORITHM \nThe nonlinear Generalized Hebbian Algorithm has exactly the same form as for \nthe linear case, except that we substitute the use of the output values after the \nnonlinearity for the linear values. The algorithm is thus given by: \nC(t + 1) = C(t) + 'Y (z(t)xT(t) - LT[z(t)zT(t)]C(t)) \n\n(2) \n\nwhere the elements of z are given by Zi(t) = (1'(Yi(t)), with y(t) = C(t)x(t). \nAlthough we have not proven that this algorithm converges, a heuristic analysis \nof its behavior (for a rectification nonlinearity and Gaussian input distribution) \n\n\fAn Optimality Principle for Unsupervised Learning \n\n15 \n\nshows that stable points may exist for which each row of C is proportional to \nan eigenvector of Q, and pairs of rows are either the negative of each other or \northogonal. In practice, the rows of C are ordered by decreasing output variance, \nand occur in pairs for which one member is the negative of the other. This choice \nof C will maximize the sum of the output variances for uncorrelated outputs, so \nlong as the input is Gaussian. It also allows optimal linear estimation of the input \ngiven the output, so long as both polarities of each of the eigenvectors are present. \n3.2 NONLINEAR EXAMPLES \n3.2.1 Encoder Problem \nWe compare the performance of two nonlinear networks which have learned to \nperform an identity mapping (the \"encoder\" problem). One is trained by back(cid:173)\npropagation, (Rumelhart et a/., 1986) and the other has two hidden layers trained \nusing the unsupervised Hebbian algorithm, while the output layer is trained using \na supervised LMS algorithm (Widrow and Hoff, 1960). The network has 5 inputs, \ntwo hidden layers of 3 units each, and 5 outputs. There is a sigmoid nonlinearity at \neach hidden layer, but the thresholds are all kept at zero. The task is to minimize \nthe mean-squared difference between the inputs and the outputs. The input is a \nzero-mean correlated Gaussian random 5-vector, and both algorithms are presented \nwith the same sequence of inputs. The unsupervised-trained network converged to a \nsteady state after 1600 examples, and the backpropagation network converged after \n2400 (convergence determined by no further decrease in average error). The RMS \nerror at steady state was 0.42 for both algorithms (this figure should be compared to \nthe sum of the variances of the inputs, which was 5.0). Therefore, for this particular \ntask, there is no significant difference in performance between backpropagation and \nthe Generalized Hebbian Algorithm. This is an encouraging result, since if we can \nuse an unsupervised algorithm to solve other problems, the training time will scale \nat most linearly with the number of layers. \n3.2.2 Nonlinear Receptive Fields \nSeveral investigators have shown that Hebbian algorithms can discover useful image \nfeatures related to the receptive fields of cells in primate visual cortex (see for \nexample (Bienenstock et a/., 1982, Linsker, 1986, Barrow, 1987\u00bb. One of the more \nrecent methods uses an algorithm very similar to the one proposed here to find the \nprincipal component of the input (Linsker, 1986). We performed an experiment to \nfind out what types of nonlinear receptive fields could be learned by the Generalized \nHebbian Algorithm if provided with similar input to that used by Linsker. \nWe used a single-layer nonlinear network with 4096 inputs arranged in a 64x64 \ngrid, and 16 outputs with a rectification nonlinearity. The input data consisted of \nimages of low-pass filtered white Gaussian noise multiplied by a Gaussian window. \nAfter 5000 samples, the 16 outputs learned the masks shown in figure 2. These \nmasks possess qualitative similarity to the receptive fields of cells found in the visual \ncortex of cat and monkey (see for example (Andrews and Pollen, 1979\u00bb. They are \nequivalent to the masks learned by a purely linear network (Sanger, 1988b), except \nthat both positive and negative polarities of most mask shapes are present here. \n\n\f16 \n\nSanger \n\nFigure 2: Nonlinear receptive fields ordered from left-to-right and top-to-bottom. \n\n3.2.3 Stereo \nWe now show how the nonlinear Generalized Hebbian Algorithm can be used to \ntrain a two-layer network to detect disparity edges. The network has 128 inputs, \n8 types of unit in the hidden layer with a rectification nonlinearity, and 4 types of \noutput unit. A 128x128 pixel random-dot stereo pair was generated in which the \nleft half had a disparity of two pixels, and the right half had zero disparity. The \nimage was convolved with a vertically-oriented elliptical Gaussian mask to remove \nhigh-frequency vertical components. Corresponding 8x8 blocks of the left and right \nimages (64 pixels from each image) were multiplied by a Gaussian window function \nand presented as input to the network, which was allowed to learn the first layer \naccording to the unsupervised algorithm. After 4000 iterations, the first layer had \nconverged to a set of 8 pairs of masks. These masks were convolved with the images \n(the left mask was convolved with the left image, and the right mask with the right \nimage, and the two results were summed and rectified) to produce a pattern of \nactivity at the hidden layer. (Although there were only 8 types of hidden unit, we \nnow allow one of each type to be centered at every input image location to obtain a \npattern of total activity.) Figure 3 shows this activity, and we can see that the last \nfour masks are disparity-sensitive since they respond preferentially to either the 2 \npixel disparity or the zero disparity region of the image. \n\n\fAn Optimality Principle for Unsupervised Learning \n\n1 7 \n\nFigure 3: Hidden layer response for a two-layer nonlinear network trained on stereo \nimages. The left half of the input random dot image has a 2 pixel disparity, and the \nright half has zero disparity. \n\nFigure 4: Output layer response for a two-layer nonlinear network trained on stereo \nImages. \n\nSince we were interested in disparity, we trained the second layer using only the last \nfour hidden unit types. The second layer had 1024 (=4x16x16) inputs organized as \na 16x16 receptive field in each of the four hidden unit \"planes\". The outputs did not \nhave any nonlinearity. Training was performed by scanning over the hidden unit \nactivity pattern (successive examples overlapped by 8 pixels) and 6000 iterations \nwere used to produce the second-layer weights. The masks that were learned were \nthen convolved with the hidden unit activity pattern to produce an output unit \nactivity pattern, shown in figure 4. \nThe third output is clearly sensitive to a change in disparity (a depth edge). If we \ngenerate several different random-dot stereograms and average the output results, \n\n\f18 \n\nSanger \n\nFigure 5: Output layer response averaged over ten stereograms with a central 2 pixel \ndisparity square and zero disparity surround. \n\nwe see that the other outputs are also sensitive (on average) to disparity changes, but \nnot as much as the third. Figure 5 shows the averaged response to 10 stereograms \nwith a central 2 pixel disparity square against a zero disparity background. Note \nthat the ability to detect disparity edges requires the rectification nonlinearity at \nthe hidden layer, since no linear function has this property. \n\n4. CONCLUSION \n\nWe have shown that the unsupervised Generalized Hebbian Algorithm can produce \nuseful networks. The algorithm has been proven to converge only for single-layer \nlinear networks. However, when applied to nonlinear networks with certain types \nof nonlinearity, it appears to converge to good results. In certain cases, it operates \nby maintaining the outputs uncorrelated while maximizing their variance. We have \nnot investigated its behavior on nonlinearities other than rectification or sigmoids, \nso we can make no predictions about its general utility. Nevertheless, the few \nexamples presented for the nonlinear case are encouraging, and suggest that further \ninvestigation of this algorithm will yield interesting results. \nAcknowledgements \n\nI would like to express my gratitude to the many people at the NIPS conference \nand elsewhere whose comments, criticisms, and suggestions have increased my un(cid:173)\nderstanding of these results. In particular, thanks are due to Ralph Linsker for \npointing out to me an important error in the presentation and for his comments on \nthe manuscript, as well as to John Denker, Steve Nowlan, Rich Sutton, Tom Breuel, \nand my advisor Tomaso Poggio. \nThis report describes research done at the MIT Artificial Intelligence Laboratory, \nand sponsored by a grant from the Office of Naval Research (ONR), Cognitive \nand Neural Sciences Division; by the Alfred P. Sloan Foundation; by the National \nScience Foundation; by the Artificial Intelligence Center of Hughes Aircraft Cor(cid:173)\nporation (SI-801534-2); and by the NATO Scientific Affairs Division (0403/87). \nSupport for the A. I. Laboratory's artificial intelligence research is provided by the \n\n\fAn Optimality Principle for Unsupervised Learning \n\n19 \n\nAdvanced Research Projects Agency of the Department of Defense under Army con(cid:173)\ntract DACA76-85-C-0010, and in part by ONR contract N00014-85-K-0124. The \nauthor was supported during part of this research by a National Science Foundation \nGraduate fellowship, and later by a Medical Scientist Training Program grant. \n\nReferences \n\nAndrews B. W., Pollen D. A., 1979, Relationship between spatial frequency selec(cid:173)\n\ntivity and receptive field profile of simple cells, J. Physiol., 287:163-176. \n\nBarrow H. G., 1987, Learning receptive fields, In Pmc. IEEE 1st Ann. Conference \n\non Neural Networks, volume 4, pages 115-121, San Diego, CA. \n\nBienenstock E. L., Cooper L. N., Munro P. W., 1982, Theory for the development \nof neuron selectivity: Orientation specificity and binocular interaction in visual \ncortex, J. Neuroscience, 2(1):32-48. \n\nBourlard H., Kamp Y., 1988, Auto-association by multilayer perceptrons and sin(cid:173)\n\ngular value decomposition, Biological Cybernetics, 59:291-294. \n\nHinton G. E., 1987, Connectionist learning procedures, CMU Tech. Report CS-87-\n\n115. \n\nLinsker R., 1986, From basic network principles to neural architecture, Proc. Natl. \n\nAcad. Sci. USA, 83:7508-7512. \n\nLinsker R., 1988, Self-organization in a perceptual network, Computer, 21(3):105-\n\n117. \n\nLippmann R. P., 1987, An introduction to computing with neural nets, IEEE A SSP \n\nMagazine, pages 4-22. \n\nOja E., 1983, Subspace Methods of Pattern Recognition, Research Studies Press, \n\nUK. \n\nRumelhart D. E., Hinton G. E., Williams R. J., 1986, Learning representations by \n\nback-propagating errors, Nature, 323(9):533-536. \n\nSanger T. D., 1988a, Optimal unsupervised learning, Neural Networks, 1(Sl):127, \n\nProc. 1st Ann. INNS meeting, Boston, MA. \n\nSanger T. D., 1988b, Optimal unsupervised learning in a single-layer linear feedfor(cid:173)\n\nward neural network, submitted to Neural Networks. \n\nWidrow B., Hoff M. E., 1960, Adaptive switching circuits, In IRE WESCON Conv. \n\nRecord, Part 4, pages 96-104. \n\n\f", "award": [], "sourceid": 139, "authors": [{"given_name": "Terence", "family_name": "Sanger", "institution": null}]}