{"title": "Unsupervised Parallel Feature Extraction from First Principles", "book": "Advances in Neural Information Processing Systems", "page_first": 136, "page_last": 143, "abstract": null, "full_text": "Unsupervised Parallel Feature Extraction \n\nfrom First Principles \n\n.. \n\nMats Osterberg \n\nImage Processing Laboratory \n\nDept. EE., Linkoping University \n\nS-58183 Linkoping Sweden \n\nReiner Lenz \n\nImage Processing Laboratory \n\nDept. EE., Linkoping University \n\nS-58183 Linkoping Sweden \n\nAbstract \n\nWe describe a number of learning rules that can be used to train un(cid:173)\nsupervised parallel feature extraction systems. The learning rules \nare derived using gradient ascent of a quality function. We con(cid:173)\nsider a number of quality functions that are rational functions of \nhigher order moments of the extracted feature values. We show \nthat one system learns the principle components of the correla(cid:173)\ntion matrix. Principal component analysis systems are usually not \noptimal feature extractors for classification. Therefore we design \nquality functions which produce feature vectors that support unsu(cid:173)\npervised classification. The properties of the different systems are \ncompared with the help of different artificially designed datasets \nand a database consisting of all Munsell color spectra. \n\n1 \n\nIntroduction \n\nThere are a number of unsupervised Hebbian learning algorithms (see Oja, 1992 \nand references therein) that perform some version of the Karhunen-Loeve expan(cid:173)\nsion. Our approach to unsupervised feature extraction is to identify some desirable \nproperties of the extracted feature vectors and to construct a quality functions that \nmeasures these properties. The filter functions are then learned from the input pat(cid:173)\nterns by optimizing this selected quality function. In comparison to conventional \nunsupervised Hebbian learning this approach reduces the amount of communication \nbetween the units needed to learn the weights in parallel since the complexity now \nlies in the learning rule used. \n\n136 \n\n\fUnsupervised Parallel Feature Extraction from First Principles \n\n137 \n\nThe optimal (orthogonal) solution to two of the proposed quality functions turn out \nto be related to the Karhunen-Loeve expansion: the first learns an arbitrary rota(cid:173)\ntion of the eigenvectors whereas the later learns the pure eigenvectors. A common \nproblem with the Karhunen-Loeve expansion is the fact that the first eigenvector is \nnormally the mean vector of the input patterns. In this case one filter function will \nhave a more or less uniform response for a wide range of input patterns which makes \nit rather useless for classification. We will show that one quality function leads to \na system that tend to learn filter functions which have a large magnitude response \nfor just one class of samples (different for each filter function) and low magnitude \nresponse for samples from all other classes. Thus, it is possible to classify an in(cid:173)\ncoming pattern by simply observing which filter function has the largest magnitude \nresponse . Similar to Intrator's Projection Pursuit related network (see Intrator & \nCooper, 1992 and references therein) some quality functions use higher order (> 2) \nstatistics of the input process but in contrast to Intrator's network there is no need \nto specify the amount of lateral inhibition needed to learn different filter functions. \n\nAll systems considered in this paper are linear but at the end we will briefly discuss \npossible non-linear extensions. \n\n2 Quality functions \n\nIn the following we consider linear filter systems. These can be described by the \nequation: \n\nO(t) \n\nW(t)P(t) \n\n(1) \nwhere P(t) E RM :l is the input pattern at iteration t, W(t) E RN:M is the filter \ncoefficient matrix and O(t) = (01 (t), ... ,0N(t))' E RN :l is the extracted feature \nvector. Usually M > N, i.e. the feature extraction process defines a reduction of \nthe dimensionality. Furthermore, we assume that both the input patterns and the \nfilter functions are normed; IIP(t)1I = 1 and IIWn(t)1I = 1, \"It \"In. This implies that \n10~(t)1 ~ 1, 'Vi \"In. \nOur first decision is to measure the scatter of the extracted feature vectors around \nthe origin by the determinant of the output correlation matrix: \n\nQMS(t) = det EdO(t)O'(t)} \n\n(2) \nQMS(t) is the quality function used in the Maximum Scatter Filter System (MS(cid:173)\nsystem). The use of the determinant is motivated by the following two observations: \n1. The determinant is equal to the product of the eigenvalues and hence the product \nof the variances in the principal directions and thus a measure of the scattering \nvolume in the feature space. 2. The determinant vanishs if some filter functions are \nlinearly dependent. \n\nIn (Lenz & Osterberg, 1992) we have shown that the optimal filter functions to \nQMS(t) are given by an arbitrary rotation of the N eigenvectors corresponding to \nthe N largest eigenvalues of the input correlation matrix: \n\n(3) \nwhere Ueig contains the largest eigenvectors (or principal components) of the in(cid:173)\nput correlation matrix EdP(t)P'(t)}. R is an arbitrary rotation matrix with \ndet( R) = 1. To differentiate between these solutions we need a second criterion. \n\nRUeig \n\nWopt \n\n\f138 \n\nOsterberg and Lenz \n\nOne attempt to define the best rotation is to require that the mean energy E t { o~ (t)} \nshould be concentrated in as few components on(t) of the extracted feature vector \nas possible. Thus, the mean energy Ed o~ (t)} of each filter function should be either \nvery high (i.e. near 1) or very low (i.e. near 0). This leads to the following second \norder concentration measure: \n\nN \n\nQ2(t) = L Edo~(t)} (1- Edo!(t)}) \n\nn=l \n\n(4) \n\nwhich has a low non-negative value if the energies are concentrated. \nAnother idea is to find a system that produces feature vectors that have unsuper(cid:173)\nvised discrimination power. In this case each learned filter function should respond \nselectively, i.e. have a large response for some input samples and low response for \nothers. One formulation of this goal is that each extracted feature vector should be \n(up to the sign) binary; Oi(t) = \u00b11 and on(t) = 0, n 1= i, 'Vt. This can be measured \nby the following fourth order expression: \n\nQ4(t) = EdL o~(t) (1 - o~(t\u00bb)} \n\nN \n\nn=l \n\nN \nL Edo~(t)} - Edo!(t)} \nn=l \n\n(5) \n\nwhich has a low non-negative value if the features are binary. Note that it is not \nsufficient to use on(t) instead of o~(t) since Q4(t) will have a low value also for \nfeature vectors with components equal in magnitude but with opposite sign. A \nthird criterion can be found as follows: if the filter functions have selective filter \nresponse then the response to different input patterns differ in magnitude and thus \nthe variance of the mean energy Ed o~(t)} is large. The total variance is measured \nby: \n\nN \n\nN \n\nL Var {o~ (t)} = L Ed ( o~ (t) - Ed o~ (t)} ) 2 } \n\nn=l \n\nn=l \nN \nL Edo!(t)} - (Ed o!(t)})2 \nn=l \n\n(6) \n\nFollowing (Darlington, 1970) it can be shown that the distribution of o~ should \nbe bimodal (modes below and above Edo~}) to maximize QVar(t). The main \ndifference between QVar(t) and the quality function used by Intrator is the use \nof a fourth order term Edo!(t)} instead of a third order term Edo~(t)}. With \nEd o~(t)} the quality function is a measure of the skewness of the distribution \no(t) and it is maximized when one mode is at zero and one (or several) is above \nEdo~(t)}. \nIn this paper we will examine the following non-parametric combinations of the \nquality functions above: \n\nQMS(t) \nQ2(t) \nQMS(t) \nQ4(t) \n\nQVar(t)QM set) \n\n(7) \n\n(8) \n\n(9) \n\n\fUnsupervised Parallel Feature Extraction from First Principles \n\n139 \n\nWe refer to the corresponding filter systems as: the Karhunen-Loeve Filter Sys(cid:173)\ntem (KL-system), the Fourth Order Filter System (FO-system) and the Maximum \nVariance Filter System (MV-system). \n\nSince each quality function is a combination of two different functions it is hard to \nfind the global optimal solution. Instead we use the following strategy to determine \na local optimal solution. \n\nDefinition 1 The optimal orthogonal solution to each quality function is of the \nform: \n\n(10) \nwhere Ropt is the rotation of the largest eigenvectors which minimize Q2(t), Q4(t) \nor maximize QYar(t). \n\nW opt \n\nIn (Lenz & Osterberg, 1992 and Osterberg, 1993) we have shown that the optimal \northogonal solution to the KL-system are the N pure eigenvectors if the N largest \neigenvalues are all distinct (i.e. Ropt = I). If some eigenvalues are equal then the \nsolution is only determined up to an arbitrary rotation of the eigenvectors with \nequal eigenvalues. The fourth order term Edo~(t)} in Q4(t) and QYar(t) makes it \ndifficult to derive a closed form solution. The best we can achieve is a numerical \nmethod (in the case of Q4(t) see Osterberg, 1993) for the computation of the optimal \northogonal filter functions. \n\n3 Maximization of the quality function \n\nThe partial derivatives of QMS(t), Q2(t), Q4(t) and QYar(t) with respect to w~(t) \n(the mth weight in the nth filter function at iteration t) are only functions of the \ninput pattern pet), the output values OCt) = (OI(t), ... , ON(t\u00bb and the previous \nvalues of the weight coefficients (w~ (t - 1), ... , w~ (t - 1\u00bb within the filter function \n(see Osterberg, 1993). Especially, they are not functions of the internal weights \n((wlCt - 1), ... , wf1(t -1\u00bb, i;/; n) of the other filter functions in the system. This \nimplies that the filter coefficients can be learned in parallel using a system of the \nstructure shown in Figure 1. \n\nIn (Osterberg, 1993) we used on-line optimization techniques based on gradient \nascent. We tried two different methods to select the step length parameter. One \nrather heuristical depending on the output On (t) of the filter function and one \ninverse proportional to the second partial derivative of the quality function with \nrespect to w~ (t). In each iteration the length of each filter function was explicitly \nnormalized to one. Currently, we investigate standard unconstrained optimization \nmethods (Dennis & Schnabel, 1983) based on batch learning. Now the step length \nparameter is selected by line search in the search direction Set): \n\nmrc Q(W(t) + AS(t\u00bb \n\n(11) \n\nTypical choices of Set) include Set) = I and Set) = H-l. With the identity matrix \nwe get Steepest Ascent and with the inverse Hessian the quasi-Newton algorithm. \nU sing sufficient synchronism the line search can be incorporated in the parallel \nstructure (Figure 1). To incorporate the quasi-Newton algorithm we have to assume \n\n\f140 \n\nOsterberg and Lenz \n\nInpul pall.ra \nP(I) --...---.1 \n\nOuIPl'I \n~----.,.--+ 0.(1) -\n\n.... '(1)1'(1) \n\nOulpul \n0,(1) -\n\n... ,'(1)1'(1) \n\nQuIP'\" \n~--+-++--+ o . .{t) -\n\n...... (1)1'(1) \n\nFigure 1: The architecture of the filter system \n\nthat the Hessian matrix is block diagonal, i.e. the second partial derivatives with \nrespect to wr(t)w,(t), k f. I, \"1m are assumed to be zero. In general this is not the \ncase and it is not clear if a block diagonal approximation is valid or not. The second \npartial derivatives can be approximated by secant methods (normally the BFGS \nmethod). Furthermore the condition of normalized filter functions can be achieved \nby optin4izing in hyperspherical polar coordinates. Preliminary experiments (mustly \nwith Steepest Ascent) show that more advanced optimization techniques lead to a \nmore robust convergence of the filter functions. \n\n4 Experiments \n\nIn (Osterberg, 1993) we describe a series of experiments in which we investigate \nsystematically the following properties of the MS-system, the KL-system and the \nFO-system: convergence speed, dependence on initial solution W(O) , distance be(cid:173)\ntween learned solution and optimal (orthogonal) solution, supervised classification \nof the extracted feature vectors using linear regression and the degree of selective \nresponse of the learned filter functions. We use training sets with controlled scalar \nproducts between the cluster centers of three classes of input patterns embedded \nin a 32-D space. The results of the experiments can be summarized as follows . In \ncontrast to the MS-system, we noticed that the KL- and FO-system had problems \nto converge to the optimal orthogonal solutions for some initial solutions. All sys(cid:173)\ntems learned orthogonal solutions regardless of W(O). The supervised classification \npower was independent of the filter system used. Only the FO-system produced \n\n\fUnsupervised Parallel Feature Extraction from First Principles \n\n141 \n\nTable 1: Typical filter response to patterns from (a)-(c) Tsetl and (d) Tset2 using \nthe filter functions learned with (a) the KL-system, (b) the FO-system and (c)-(d) \nthe MV-system. (e)-(f) Output covariance matrix using the filter functions learned \nwith (e) the KL-system and (f) the MV-system. \n\n[( -0.12) (-0.46) (0.73)] \n\n, \n\n, \n\n[ ( -0.71) (-0.99) (-0.22)] \n\n, \n\n, \n\n0.92 \n-0.38 \n\n0.83 \n0.32 \n(a.) \n\n0.66 \n0.14 \n\n0.59 \n0.28 \n\n-0.80 \n0.50 \n\n-0.08 \n0.01 \n(b) \n\n-0.50 \n0.81 \n(d) \n\n-0.04 \n0.97 \n\n-0.49 \n0.50 \n\n[( \n\n0.28) \n-0.91 \n0.44 \n\n( \n\n, \n\n0.10) \n-0.39 \n0.95 \n(c) \n\n( \n\n, \n\n0.98)] \n-0.23 \n0.11 \n\n[ ( -0.50) (-0.49) (-0.81)] \n\n, \n\n, \n\n( \n\n0.0340 \n0.0001 \n0.0005 \n\n0.0001 0.0005) \n0.9300 0.0000 \n0.0000 0.0353 \n\n(e) \n\n( \n\n0.3788 \n0.3463 \n-0.3473 \n\n0.3463 \n0.3760 \n-0.3467 \n\n(f) \n\n-0.3473 ) \n-0.3467 \n0.3814 \n\nfilter functions which mainly react for patterns from just one class and only if the \nsimilarity (measured by the scalar product) between the classes in the training set \nwas smaller than approximately 0.5. Thus, the FO-system extracts feature vectors \nwhich have unsupervised discrimination power. Furthermore, we showed that the \nFO-system can distinguish between data sets having identical correlation matrices \n(second order statistics) but different fourth order statistics. Recent experiments \nwith more advanced optimization techniques (Steepest Ascent) show better conver(cid:173)\ngence properties for the KL- and FO-system. Especially the distance between the \nlearned filter functions and the optimal orthogonal ones becomes smaller. \nWe will describe some experiments which show that the MV-system is more suitable \nfor tasks requiring unsupervised classification. We use two training sets Tsetl and \nTset2. In the first set the mean scalar product between class one and two is 0.7, \nbetween class one and three 0.5 and between class two and three 0.3. In the second \nset the mean scalar products between all classes are 0.9, i.e. the angle between all \ncluster centers is arccos(0.9) = 26 0 \u2022 In Table 4(a)-( c) we show the filter response of \nthe learned filter functions with the KL-, FO- and MV-system to typical examples \nof the input patterns in the training set Tsetl. For the KL-system we see that the \nsecond filter function gives the largest magnitude response for both, patterns from \nclass one and two. For the FO-system the feature vectors are more binary. Still the \nfirst filter function has the largest magnitude response for patterns from class one \nand two. For the MV -system we see that each filter function has largest magnitude \nresponse for only one class of input patterns and thus the extracted feature vectors \nsupport unsupervised discrimination. In Table 4( d) (computed from Tset2) we see \nthat this is the case even then the scalar products between the cluster centers are \nas high as 0.9. The filter functions learned by the MV -system are approximately \northogonal. The system learns thus the rotation of the largest eigenvectors which \nmaximizes QVa.r(t). Therefore it will not extract uncorrelated features (see Ta-\n\n\f142 \n\nOsterberg and Lenz \n\n02 \n\n015 \n\n0.1 \n\no \n\n02 \n\nI \n\n-' \n\n\u2022 \n\n........ '(\\\\ \n, , , , , , , , , , , \n\n, \", \n\n, \n\n'. '. \n\n, \n\" \n\n025 \n\n02 \n\n0.5 \n\no. \n\nOOS \n\n0 \n\n-0 os \n\n-0' \n\n-0 '5 \n\n-02 \n\n'-\n\n\\ \n\n(a) \n,- - ... \n, \n, \n, \n\" \n, \n, \nI , , , , , \n\n..... . \n\n\", \n\n),' \n\n\" ......... . \n, . . .-\n,', \n./ \" \n, . \u2022 \n\n''II. .. \n\n\" \n\n'-\n\nI \n\n............... \n\n, \n,/ \n.\u2022..\u2022\u2022 1.,' \n,'., \n, \n,-. I \n\" \n\nI \n\nI \n\n..\u2022\u2022.\u2022 \n\nISO \n\nlUG \n\n-o~ \n\n(b) \n\n031~--~--~--~--~--~--~ \n\nI \nI \nI \n\n,', \n\n, \n\n, \n\n' , \n' \n\n.. . ..... .. \n\n02 \n\n0.6 \n\nO' \n\n006 \n\n\\ , \n\n: \n, , \n... , \nt \". \" \\ \n, , , , , , \n\n, \n\n: \n: \n\nI \n\n,~. \n.. \n\n, \n,,' \n\n, \n\n\\ \n\\ \n\n\" \n\n.................... . \n\n\\. \n\n(c) \n\n(d) \n\n~~~~~~~~~~~~_~-4I~m--~~ \n\nFigure 2: (a) Examples of normalized reflectance spectra of typical reddish (solid \ncurve), greenish (dotted curve) and bluish (dashed curve) Munsell color chips. (b) \nThe three largest eigenvectors belonging to the correlation matrix of the 1253 dif(cid:173)\nferent reflectance spectra. (c) The learned filter functions with the MV-system. (d) \nThe learned non-negative filter functions with the MV-system. In all figures the \nx-axes show the wave length (nm) \n\nble 4(f\u00bb but the variances (e.g. the diagonal elements of the covariance matrix) of \nthe features are more or less equal. In Table 4( e) we see that the KL-system ex(cid:173)\ntracts uncorrelate features with largely different variance. This demonstrates that \nthe KL-system tries to learn the pure eigenvectors. \n\nRecently, we have applied the MV-system to real world data. The training set \nconsists of normalized reflectance spectra of the 1253 different color chips in the \nMunsell color atlas. Figure 2(a) shows one typical example of a red, a green and \na blue color chip and Figure 2(b) the three largest eigenvectors belonging to the \ncorrelation matrix of the training set. We see that the first eigenvector (the solid \ncurve) has a more or less uniform response for all different colors. On the other hand, \nthe MV -system (Figure 2 (c\u00bb learns one bluish, one greenish and one reddish filter \nfunction. Thus, the filter functions divide the color space according to the primary \ncolors red, green and blue. We notice that the learned filter functions are orthogonal \nand tend to span the same space as the eigenvectors since IIW.ol - RoptUeigliF = \n0.0199 (the Frobenius norm) where Ropt maximizes QVa.r(t). Figure 2(d) show one \npreliminary attempt to include the condition of non-negative filter functions in the \n\n\fUnsupervised Parallel Feature Extraction from First Principles \n\n143 \n\noptimization process (Steepest Ascent). We see that the learned filter functions \nare non-negative and divide the color space according to the primary colors. One \npossible real word application is optical color analysis where non-negative filter \nfunctions are much easier to realize using optical components. Smoother filter \nfunctions can be optained by incorporating additional constraints into the quality \nfunction. \n\n5 Non-linear extensions \n\nThe proposed strategy to extract feature vectors apply to nonlinear filter sys(cid:173)\ntems as well. In this case the input output relation OCt) = W(t)P(t) is replaced \nby OCt) = I(W(t)P(t\u00bb where I describes the desired non-linearity. The corre(cid:173)\nsponding learning rule can be derived using gradient based techniques as long as \nthe non-linearity 1(\u00b7) is differentiable. The exact form of 1(,) will usually be appli(cid:173)\ncation oriented. Node nonlinearities of sigmoid type are one type of nonlinearities \nwhich has received a lot of attention (see for example Oja & Karhunen, 1993). \nTypical applications include: robust Principal Component Analysis PCA (outlier \nprotection, noise suppression and symmetry breaking), sinusoidal signal detection \nin colored noise and robust curve fitting. \n\nAcknowledgements \n\nThis work was done under TFR-contract TFR-93-00192. The visit of M. Osterberg \nat the Dept. of Info. Tech., Lappeenranta University of Technology was supported \nby a grant from the Nordic Research Network in Computer Vision. The Munsell \ncolor experiments were performed during this visit. \n\nReferences \n\n(1970) Is Kurtosis really peakedness? American Statistics \n\nR. B. Darlington. \n24(2):19-20. \nJ. E. Dennis & Robert B. Schnabel. (1983) Numerical Methods lor Unconstrained \nOptimization and Nonlinear Equations. Prentice-Hall. \nN. Intrator & L.N. Cooper. (1992) Objective Function Formulation of the BCM \nTheory of Visual Cortical Plasticity: Statistical Connections, Stability Conditions. \nNeural Networks 5:3-17. \nR. Lenz & M. Osterberg. (1992) Computing the Karhunen-Loeve expansion with a \nparallel, unsupervised filter system. Neural Computations 4(3):382-392. \nE. Oja. (1992) Principal Components, Minor Components, and Linear Neural Net(cid:173)\nworks. Neural Networks 5:927-935. \nE. Oja & J. Karhunen. (1993) Nonlinear PCA: algorithms and Applications Tech(cid:173)\nnical Report AlB, Helsinki University 01 Technology, Laboratory of Computer and \nInformation Sciences, SF -02150 Espoo, Finland. \n\nM. Osterberg. (1993) Unsupervised Feature Extraction using Parallel Linear Filters. \nLinkoping Studies in Science and Technology. Thesis No. 372. \n\n\f", "award": [], "sourceid": 721, "authors": [{"given_name": "Mats", "family_name": "\u00d6sterberg", "institution": null}, {"given_name": "Reiner", "family_name": "Lenz", "institution": null}]}