{"title": "Fast Non-Linear Dimension Reduction", "book": "Advances in Neural Information Processing Systems", "page_first": 152, "page_last": 159, "abstract": null, "full_text": "Fast Non-Linear Dimension Reduction \n\nNanda Kambhatla and Todd K. Leen \n\nDepartment of Computer Science and Engineering \nOregon Graduate Institute of Science & Technology \n\nP.O. Box 91000 Portland, OR 97291-1000 \n\nAbstract \n\nWe present a fast algorithm for non-linear dimension reduction. \nThe algorithm builds a local linear model of the data by merging \nPCA with clustering based on a new distortion measure. Exper(cid:173)\niments with speech and image data indicate that the local linear \nalgorithm produces encodings with lower distortion than those built \nby five layer auto-associative networks. The local linear algorithm \nis also more than an order of magnitude faster to train. \n\n1 \n\nIntroduction \n\nFeature sets can be more compact than the data they represent. Dimension reduc(cid:173)\ntion provides compact representations for storage, transmission, and classification. \nDimension reduction algorithms operate by identifying and eliminating statistical \nredundancies in the data. \nThe optimal linear technique for dimension reduction is principal component anal(cid:173)\nysis (PCA). PCA performs dimension reduction by projecting the original n(cid:173)\ndimensional data onto the m < n dimensional linear subspace spanned by the \nleading eigenvectors of the data's covariance matrix. Thus PCA builds a global \nlinear model of the data (an m dimensional hyperplane). Since PCA is sensitive \nonly to correlations, it fails to detect higher-order statistical redundancies. One \nexpects non-linear techniques to provide better performance; i.e. more compact \nrepresentations with lower distortion. \n\nThis paper introduces a local linear technique for non-linear dimension reduction. \nWe demonstrate its superiority to a recently proposed global non-linear technique, \n\n152 \n\n\fFast Non-Linear Dimension Reduction \n\n153 \n\nand show that both non-linear algorithms provide better performance than PCA \nfor speech and image data. \n\n2 Global Non-Linear Dimension Reduction \n\nSeveral researchers (e.g. Cottrell and Metcalfe 1991) have used layered feedforward \nauto-associative networks with a bottle-neck middle layer to perform dimension \nreduction. It is well known that auto-associative nets with a single hidden layer \ncannot provide lower distortion than PCA (Bourlard and Kamp, 1988). Recent \nwork (e.g. Oja 1991) shows that five layer auto-associative networks can improve \non PCA. These networks have three hidden layers (see Figure l(a)). The first and \nthird hidden layers have non-linear response, and are referred to as the mapping \nlayers. The m < n nodes of the middle or representation layer provide the encoded \nsignal. \nThe first two layers of weights produce a projection from Rn to Rm. The last two \nlayers of weights produce an immersion from R minto R n. If these two maps are \nwell chosen, then the complete mapping from input to output will approximate the \nidentity for the training data. If the data requires the projection and immersion \nto be non-linear to achieve a good fit, then the network can in principal find such \nfunctions. \n\n----,. Low Dimensional \n\nEncoding \n\nOriginal High \n\u00ab--- Dimensional \n\nRepresentation \n\nJ \n\nx \n\n(a) \n\n(b) \n\n1 \n\nFigure 1: (a) A five layer feedforward auto-associative network. This network can \nperform a non-linear dimension reduction from n to m dimensions. \n(b) Global \ncurvilinear coordinates built by a five layer network for data distributed on the \nsurface of a hemisphere. When the activations of the representation layer are swept, \nthe outputs trace out the curvilinear coordinates shown by the solid lines. \n\nThe activities of the nodes in the representation layer form global curvilinear co(cid:173)\nordinates on a submanifold of the input space (see Figure l(b)). We thus refer \nto five layer auto-associative networks as a global, nonlinear dimension reduction \ntechnique. \n\n\f154 \n\nKambhatla and Leen \n\n3 Locally Linear Dimension Reduction \n\nFive layer networks have drawbacks; they can be very slow to train and they are \nprone to becoming trapped in poor local optima. Furthermore, it may not be \npossible to accurately fit global, low dimensional, curvilinear coordinates to the \ndata. We propose an alternative that does not suffer from these problems. \nOur algorithm pieces together local linear coordinate patches. The local regions are \ndefined by the partition of the input space induced by a vector quantizer (VQ). The \norientation of the local coordinates is determined by PCA (see Figure 2). In this \nsection, we present two ways to obtain the partition. First we describe an approach \nthat uses Euclidean distance, then we describe a new distortion measure which is \noptimal for our task (local PCA). \n\n-.5 o .;.;.54r-----__ ~ \n\n-1 \n.5 r==---~~~~.......-\n.25 \no \n\n-.25 \n\n1 \n\nFigure 2: Local coordinates built by our algorithm (dubbed VQPCA) for data dis(cid:173)\ntributed on the surface of a hemisphere. The solid lines represent the two principal \neigen-directions in each Voronoi cell. The region covered by one Voronoi cell is \nshown shaded. \n\n3.1 Euclidean partitioning \n\nHere, we do a clustering (with Euclidean distance) followed by PCA in each of the \nlocal regions. The hybrid algorithm, dubbed VQPCA, proceeds in three steps: \n\n1. Using competitive learning, train a VQ (with Euclidean distance) with Q \n\nreference vectors (weights) (rl' r2, ... ,rQ). \n\n2. Perform a local PCA within each Voronoi cell of the VQ. For each cell, \ncompute the local covariance matrix for the data with respect to the cor(cid:173)\nresponding reference vector (centroid) rc. Next compute the eigenvectors \n(e1, ... ,e~) of each covariance matrix. \n\n3. Choose a target dimension m and project each data vector x onto \nlinear coordinates \n\nthe leading m eigenvectors to obtain the \nz = (e1 . (x - rc), ... , e~ . (x - rc)). \n\nlocal \n\n\fFast Non-Linear Dimension Reduction \n\n155 \n\nThe encoding of x consists of the index c of the reference cell closest (Euclidean \ndistance) to x, together with the m < n component vector z. The decoding is given \nby \n\nwhere r c is the reference vector (centroid) for the cell c, and ei are the leading \neigenvectors of the covariance matrix of the cell c. The mean squared reconstruction \nerror incurred by VQPCA is \n\ni=l \n\n(1) \n\nm \n\n(2) \n\nwhere E[\u00b7] denotes an expectation with respect to x, and x is defined in (1). \nTraining the VQ and performing the local PCA are very fast relative to training a \nfive layer network. The training time is dominated by the distance computations \nfor the competitive learning. This computation can be speeded up significantly by \nusing a multi-stage architecture for the VQ (Gray 1984). \n\ni=l \n\n3.2 Projection partitioning \n\nThe VQPCA algorithm as described above is not optimal because the clustering is \ndone independently of the PCA projection. The goal is to minimize the expected \nerror in reconstruction (2). We can realize this by using the expected reconstruction \nerror as the distortion measure for the design of the VQ. \nThe reconstruction error for VQPCA (Erecon defined in (2)) can be written in matrix \nform as \n\nErecon = E[ (x - ref P; Pc(X - rc)] , \n\n(3) \nwhere Pc is an m x n matrix whose rows are the orthonormal trailing eigenvectors \nof the covariance matrix for the cell c. This is the mean squared Euclidean distance \nbetween the data and the local hyperplane. \n\nThe expression for the VQPCA error in (2) suggests the distortion measure \n\nd(x, rc) = (x - rc)T P; Pc(x - rc) . \n\n(4) \nWe call this the reconstruction distance. The reconstruction distance is the error \nincurred in approximating x using only m local PCA coefficients. It is the squared \nprojection of the difference vector x - r c on the trailing eigenvectors of the covariance \nmatrix for the cell c. Clustering with respect to the reconstruction distance directly \nminimizes the expected reconstruction error Erecon. \n\nThe modified VQPCA algorithm is: \n\n1. Partition the input space using a VQ with the reconstruction distance mea(cid:173)\n\nsure 1 in (4) . \n\n2. Perform a local PCA (same as in steps 2 and 3 of the algorithm as described \n\nin section 3.1). \n\nIThe VQ is trained using the (batch mode) generalized Lloyd's algorithm (Gersho and \nGray, 1992) rather than an on-line competitive learning. This avoids recomputing the \nmatrix Pc (which depends on Tc) for each input vector. \n\n\f156 \n\nKambhatla and Leen \n\n4 Experimental Results \n\nWe apply PCA, five layer networks (5LNs), and VQPCA to dimension reduction \nof speech and images. We compare the algorithms using two performance criteria: \ntraining time and the distortion in the reconstructed signal. The distortion measure \nis the normalized reconstruction error: \n\n\u00a3norm \n\n\u00a3recon \n\nE[ IIx1l 2 ] \n\nE[llx-xI12] \n\nE [ IIxll 2 ] \n\n4.1 Model Construction \n\nThe 5LNs were trained using three optimization techniques: conjugate gradient \ndescent (CGD), the BFGS algorithm (a quasi-Newton method (Press et al1987)), \nand stochastic gradient descent (SGD). In order to limit the space of architectures, \nthe 5LNs have the same number of nodes in both of the mapping (second and \nfourth) layers. \nFor the VQPCA with Euclidean distance, clustering was implemented using stan(cid:173)\ndard VQ (VQPCA-Eucl) and multistage quantization (VQPCA-MS-E). The multi(cid:173)\nstage architecture reduces the number of distance calculations and hence the train(cid:173)\ning time for VQPCA (Gray 1984). \n\n4.2 Dimension Reduction of Speech \n\nWe used examples of the twelve monothongal vowels extracted from continuous \nspeech drawn from the TIMIT database (Fisher and Doddington 1986). Each input \nvector consists of 32 DFT coefficients (spanning the frequency range 0-4kHz), time(cid:173)\naveraged over the central third of the utterance. We divided the data set into a \ntraining set containing 1200 vectors, a validation set containing 408 vectors and \na test set containing 408 vectors. The validation set was used for architecture \nselection (e.g the number of nodes in the mapping layers for the five layer nets). \nThe test set utterances are from speakers not represented in the training set or the \nvalidation set. Motivated by the desire to capture formant structure in the vowel \nencodings, we reduced the data from 32 to 2 dimensions. (Experiments on reduction \nto 3 dimensions gave similar results to those reported here (Kambhatla and Leen \n1993).) \nTable 1 gives the test set reconstruction errors and the training times. The VQPCA \nencodings have significantly lower reconstruction error than the global PCA or five \nlayer nets. The best 5LNs have slightly lower reconstruction error than PC A, but \nare very slow to train. Using the multistage search, VQPCA trains more than \ntwo orders of magnitude faster than the best 5LN, and achieves an error about 0.7 \ntimes as great. The modified VQPCA algorithm (with the reconstruction distance \nmeasure used for clustering) provides the least reconstruction error among all the \narchitectures tried. \n\n\fFast Non-Linear Dimension Reduction \n\n157 \n\nTable 1: Speech data test set reconstruction errors and training times. Architec(cid:173)\ntures represented here are from experiments with the lowest validation set error \nover the parameter ranges explored. The numbers in the parentheses are the values \nof the free parameters for the algorithm represented (e.g 5LN-CGD (5) indicates a \nnetwork with 5 nodes in both the mapping (2nd and 4th) layers, while VQPCA-Eucl \n(50) indicates a clustering into 50 Voronoi cells). \n\nALGORITHM \n\ni norm \n\nTRAINING TIME \n\n(in seconds) \n\nPCA \n5LN-CGD (5) \n5LN-BFGS (30) \n5LN-SGD (25) \nVQPCA-Eucl (50) \nVQPCA-MS-E (9x9) \nVQPCA-Recon (45) \n\n0.0060 \n0.0069 \n0.0057 \n0.0055 \n0.0037 \n0.0036 \n0.0031 \n\n11 \n956 \n28,391 \n94,903 \n1,454 \n142 \n931 \n\nTable 2: Reconstruction errors and training times for a 50 to 5 dimension reduction \nof images. Architectures represented here are from experiments with the lowest \nvalidation set error over the parameter ranges explored. \n\nALGORITHM \n\ni norm \n\nTRAINING TIME \n\n(in seconds) \n\nPCA \n5LN-CGD (40) \n5LN-BFGS (20) \n5LN-SGD (25) \nVQPCA-Eucl (20) \nVQPCA-MS-E (8x8) \nVQPCA-Recon (25) \n\n0.458 \n0.298 \n0.052 \n0.350 \n0.140 \n0.176 \n0.099 \n\n5 \n3,141 \n10,389 \n15,486 \n163 \n118 \n108 \n\n4.3 Dimension Reduction of Images \n\nThe data consists of 160 images of the faces of 20 people. Each is a 64x64, 8-bit/pixel \ngrayscale image. We extracted the first 50 principal components of each image and \nuse these as our experimental data. This is the same data and preparation that \nDeMers and Cottrell used in their study of dimension reduction with five layer \nauto-associative nets (DeMers and Cottrell 1993). They trained auto-associators to \nreduce the 50 principal components to 5 dimensions. \n\nWe divided the data into a training set containing 120 images, a validation set (for \narchitecture selection) containing 20 images and a test set containing 20 images. \nWe reduced the images to 5 dimensions using PCA, 5LNs2 and VQPCA. Table 2 \n\n2We used 5LNs with a configuration of 50-n-5-n-50, n varying from 10 to 40 in incre(cid:173)\nments of 5. The BFGS algorithm posed prohibitive memory and time requirements for \nn > 20 for this task. \n\n\f158 \n\nKambhatla and Leen \n\nTable 3: Reconstruction errors and training times for a SO to S dimension reduction \nof images (training with all the data). Architectures represented here are from \nexperiments with the lowest error over the parameter ranges explored. \n\nALGORITHM \n\n[norm TRAINING TIME \n\nPCA \nSLN-SGD (30) \nSLN-SGD (40) \nVQPCA-Eucl (SO) \nVQPCA-Recon (SO) \n\n0.40S4 \n0.1034 \n0.0729 \n0.0009 \n0.0017 \n\n(in seconds) \n\n7 \n2S,306 \n31,980 \n90S \n216 \n\nsummarizes the results. We notice that a five layer net obtains the encoding with \nthe least error for this data, but it takes a long time to train. Presumably more \ntraining data would improve the best VQPCA results. \n\n~.- . \n\n... -\n\n_:.I \n\n.,.. \n\n-\n\n-.\u2022. ~ .. >':~ .. ~.:' \n. \n\n'~'-\u00b7\u00b7f-\u00b7\u00b7\u00b7: \nT \n~1-' ,., . \n~' .. '(;' .... '-\".'.'- -\n\n~ ..~ \n\nFigure 3: Two representative images: Left to right - Original SO-PC image, recon(cid:173)\nstruction from S-D encodings: PCA, SLN-SGD(40), VQPCA(lO), and VQPCA(SO). \n\nFor comparison with DeMers and Cottrell's (DeMers and Cottrell 1993) work, we \nalso conducted experiments training with all the data. The results are summarized3 \nin Table 3 and Figure 3 shows two sample faces. Both non-linear techniques produce \nencodings with lower error than PCA, indicating significant non-linear structure in \nthe data. With the same data, and with a SLN with 30 nodes in each mapping layer, \nDeMers (DeMers and Cottrell 1993) obtains a reconstruction error [norm 0.13174 . \nWe note that the VQPCA algorithms achieve an order of magnitude improvement \nover five layer nets both in terms of speed of training and the accuracy of encodings. \n\n3For 5LNs, we only show results with SGD in order to compare with the experimental \n\nresults of DeMers. For this data, 5LN-CGD gave encodings with a higher error and 5LN(cid:173)\nBFGS posed prohibitive memory and computational requirements. \n4DeMers reports half the MSE per output node, E = (1/2) * (1/50) * MSE = 0.00l. \nThis corresponds to [norm = 0.1317 \n\n\fFast Non-Linear Dimension Reduction \n\n159 \n\n5 Summary \n\nWe have presented a local linear algorithm for dimension reduction. We propose \na new distance measure which is optimal for the task of local PCA. Our results \nwith speech and image data indicate that the nonlinear techniques provide more \naccurate encodings than PCA. Our local linear algorithm produces more accurate \nencodings (except for one simulation with image data), and trains much faster than \nfive layer auto-associative networks. \n\nAcknowledgments \n\nThis work was supported by grants from the Air Force Office of Scientific Research \n(F49620-93-1-0253) and Electric Power Research Institute (RP8015-2). The authors \nare grateful to Gary Cottrell and David DeMers for providing their image database \nand clarifying their experimental results. We also thank our colleagues in the Center \nfor Spoken Language Understanding at OGI for providing speech data. \n\nReferences \n\nH. Bourlard and Y. Kamp. (1988) Auto-association by multilayer perceptrons and \nsingular value decomposition. Biological Cybernetics, 59:291-294. \nG. Cottrell and J. Metcalfe. (1991) EMPATH: Face, emotion, and gender recog(cid:173)\nIn R. Lippmann, John Moody and D. Touretzky, editors, \nnition using holons. \nAdvances in Neural Information Processing Systems 3, pages 564-571. Morgan \nKauffmann. \nD. DeMers and G. Cottrell. (1993) Non-linear dimensionality reduction. In Giles, \nHanson, and Cowan, editors, Advances in Neural Information Processing Systems \n5. San Mateo, CA: Morgan Kaufmann. \nW. M. Fisher and G. R. Doddington. (1986) The DARPA speech recognition re(cid:173)\nsearch database: specification and status. In Proceedings of the DARPA Speech \nRecognition Workshop, pages 93-99, Palo Alto, CA. \nA. Gersho and R. M. Gray. (1992) Vector Quantization and Signal Compression. \nKluwer academic publishers. \nR. M. Gray. (1984) Vector quantization. IEEE ASSP Magazine, pages 4-29. \n\nN. Kambhatla and T. K. Leen. (1993) Fast non-linear dimension reduction. In IEEE \nInternational Conference on Neural Networks, Vol. 3, pages 1213-1218. IEEE. \nE. Oja. (1991) Data compression, feature extraction, and autoassociation in feed(cid:173)\nforward neural networks. In Artificial Neural Networks, pages 737-745. Elsevier \nScience Publishers B. V. (N orth-Holland) . \nW. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. (1987) Nu(cid:173)\nmerical Recipes - the Art of Scientific Computing. Cambridge University Press, \nCambridge/New York. \n\n\f", "award": [], "sourceid": 825, "authors": [{"given_name": "Nanda", "family_name": "Kambhatla", "institution": null}, {"given_name": "Todd", "family_name": "Leen", "institution": null}]}