{"title": "A Note on Learning Vector Quantization", "book": "Advances in Neural Information Processing Systems", "page_first": 220, "page_last": 227, "abstract": null, "full_text": "A Note on Learning Vector Quantization \n\nVirginia R. de Sa \n\nDepartment of Computer Science \n\nUniversity of Rochester \nRochester, NY 14627 \n\nDana H. Ballard \n\nDepartment of Computer Science \n\nUniversity of Rochester \nRochester, NY 14627 \n\nAbstract \n\nVector Quantization is useful for data compression. Competitive Learn(cid:173)\ning which minimizes reconstruction error is an appropriate algorithm for \nvector quantization of unlabelled data. Vector quantization of labelled \ndata for classification has a different objective, to minimize the number \nof misclassifications, and a different algorithm is appropriate. We show \nthat a variant of Kohonen's LVQ2.1 algorithm can be seen as a multi(cid:173)\nclass extension of an algorithm which in a restricted 2 class case can \nbe proven to converge to the Bayes optimal classification boundary. We \ncompare the performance of the LVQ2.1 algorithm to that of a modified \nversion having a decreasing window and normalized step size, on a ten \nclass vowel classification problem. \n\n1 \n\nIntroduction \n\nVector quantization is a form of data compression that represents data vectors by a smaller \nset of codebook vectors. Each data vector is then represented by its nearest codebook \nvector. The goal of vector quantization is to represent the data with the fewest code book \nvectors while losing as little information as possible. \n\nVector quantization of unlabelled data seeks to minimize the reconstruction error. This can \nbe accomplished with Competitive learning[Grossberg, 1976; Kohonen, 1982], an iterative \nlearning algorithm for vector quantization that has been shown to perform gradient descent \non the following energy function [Kohonen, 1991] \n\nJ /Ix - ws\u00b7(x) /l2p(x)dx. \n\n220 \n\n\fA Note on Learning Vector Quantization \n\n221 \n\nwhere p(x) is the probability distribution of the input patterns and Ws are the reference or \ncodebook vectors and s*(x) is defined by IIx - WSO(x) I I ~ /Ix - will (for alIt). This mini(cid:173)\nmizes the square reconstruction error of unlabelled data and may work reasonably well for \nclassification tasks if the patterns in the different classes are segregated. \n\nIn many classification tasks, however, the different member patterns may not be segregated \ninto separate clusters for each class. In these cases it is more important that members ofthe \nsame class be represented by the same codebook vector than that the reconstruction error \nis minimized. To do this, the quantizer can m&ke use of the labelled data to encourage \nappropriate quantization. \n\n2 Previous approaches to Supervised Vector Quantization \n\nThe first use of labelled data (or a teaching signal) with Competitive Learning by Rumelhart \nand Zipser [Rumelhart and Zipser, 1986] can be thought of as assigning a class to each \ncodebook vector and only allowing patterns from the appropriate class to influence each \nreference vector. \n\nThis simple approach is far from optimal though as it fails to take into account interactions \nbetween the classes. Kohonen addressed this in his LVQ( 1) algorithm[Kohonen, 1986]. He \nargues that the reference vectors resulting from LVQ( 1) tend to approximate for a particular \nclass r, \n\nP(xICr)P(Cr) - ~#rP(xICs)P(Cs). \n\nwhere P( Cj) is the a priori probability of Class i and P(xICj) is the conditional density of \nClass i. \n\nThis approach is also not optimal for classification, as it addresses optimal places to put \nthe codebook vectors instead of optimal placement of the borders of the vector quantizer \nwhich arise from the Voronoi tessellation induced by the codebook vectors. 1 \n\n3 Minimizing Misclassifications \n\nIn classification tasks the goal is to minimize the numbers of misclassifications of the \nresultant quantizer. That is we want to minimize: \n\n(1) \n\nwhere, P(Classj) is the a priori probability of Classj and P(xIClassj) is the conditional \ndensity of Classi and D.Rj is the decision region for class j (which in this case is all x such \nthat I~ - wkll < I~ - wjll (for all i) and Wk is a codebook vector for class j). \nConsider a One-Dimensional problem of two classes and two codebook vectors wI and w2 \ndefining a class boundary b = (wI + w2)/2 as shown in Figure 1. In this case Equation 1 \nreduces to: \n\n1 Kohonen [1986] showed this by showing that the use of a \"weighted\" Voronoi tessellation (where \nthe relative distances of the borders from the reference vectors was changed) worked better. However \nno principled way to calculate the relative weights was given and the application to real data used \nthe unweighted tessellation. \n\n\f222 \n\nde Sa and Ballard \n\nP(CIass i)P(xlClass i) \n\nw2 \n\nb* b \n\nwI \n\n% \n\nFigure 1: Codebook vectors Wl and'W2 define a border b. The optimal place for the border \nis at b* where P(Cl)P(xICt} = P(C2)P(xIC2). The extra misclassification errors incurred by \nplacing the border at b is shown by the shaded region. \n\nThe derivative of Equation 2 with respect to b is \n\n(2) \n\nThat is, the minimum number of misclassifications occurs at b* where \nP(ClaSS1)P(b*IClasSl) = P(Class2)P(b*IClass2). \n\nIf f(x) = (Classl)P(xIClassl) - P(Class2)P(xIClass2) was a regression function then we \ncould use stochastic approximation [Robbins and Monro, 1951] to estimate b* iteratively \nas \n\nben + 1) = ben) + a(n)Z\" \n\nwhere Z\" \nP(Classl)P(b(n)IClasst) - P(Class2)P(b(n)IClass2\u00bb and \n\nis a sample of the random variable Z whose expected value \n\nis \n\nlim a(n) = 0 \n,,-+co \n\nl:ia(n) = 00 \n\nl:ia2(n) < 00 \n\nHowever, we do not have immediate access to an appropriate random variable Z but can \nexpress P( C lassl )P(xIClassl)-P( Class2)P(xIClass2) as the limit of a sequence of regression \nfunctions using the Parzen Window technique. In the Parzen window technique, probability \ndensity functions are estimated as the sum of appropriately normalized pulses centered at \n\n\fthe observed values. More formally, we can estimate P(xIClassi) as [Sklansky and Wassel, \n1981] \n\nA Note on Learning Vector Quantization \n\n223 \n\nAll \n\nPi (x) = - L...J'\u00a5II(x-Xj,cll ) \n\nIl \n\n1~ \nn . \n)=1 \n\nwhere Xj is the sample data point at time j, and 'II II(X- z, c(n)) is a Parzen window function \ncentred at Z with width parameter c(n) that satisfies the following conditions \n\n'\u00a5II(X - z, c(n\u00bb ~ 0, Vx, Z \nJ~ '\u00a5II(X- Z, c(n\u00bbdx = 1 \n11-\n\n'\u00a5;(x- z, c(n))dx = 0 \n\nlim -\n11-+- n __ \n\nWe can estimate f(x) = P(Class1)P(xIClasst) - P(Class2)P(xIClass2) as \n\nlim '\u00a51I(x-z,c(n\u00bb = c5(x-z) \nII-+-\n\n1 \n\nIl \n\nA \n\nrex) = - LS(Xj)'\u00a5II(x-Xj,c(n\u00bb \n\nn . 1 \nJ= \n\nwhere S(Xj) is + 1 if Xj is from Class1 and -1 if Xj is from Class2. \n\nThen \n\nand \n\nlim j\"(X) = P(Class1)P(xIClass1) - P(Class2)P(xIClass2) \nII-+-\n\nlim E[S(X)'\u00a5ix - X, c(n)] = P(Class1)P(xIClassd - P(Class2)P(xIClass2) \n\nII-+-\n\nWassel and Sklansky [1972] have extended the stochastic approximation method of Rob(cid:173)\nbins and Monro [1951] to find the zero of a function that is the limit of a sequence of \nregression functions and show rigourously that for the above case (where the distribution \nof Class1 is to the left of that of Class2 and there is only one crossing point) the stochastic \napproximation procedure \n\nben + 1) = ben) + a(n)ZII(xlI , Class(n), ben), c(n\u00bb \n\n(3) \n\nusing \n\nZ _ { 2c(n)'\u00a5(XII - ben), c(n\u00bb \n\nII -\n\n-2c(n)'\u00a5(XII - ben), c(n\u00bb \n\nfor XII E Classl \nfor XII E Class2 \n\nconverges to the Bayes optimal border with probability one where '\u00a5(x - b, c) is a Parzen \nwindow function. The following standard conditions for stochastic approximation conver(cid:173)\ngence are needed in their proof \n\na(n), c(n) > 0, \n\nlim c(n) = 0 \n\nII-+-\n\nlim a(n) = 0, \nII-+-\n\n1:ia(n)c(n) = 00, \n\n\f224 \n\nde Sa and Ballard \n\nas well as a condition that for rectangular Parzen functions reduces to a requirement that \nP( Classl )P(xIClassl) - P( C lass2)P(xlClass2) be strictly positive to the left of b* and strictly \nnegative to the right of b* (for full details of the proof and conditions see [Wassel and \nSklansky, 1972]). \n\nThe above argument has only addressed the motion of the border. But b is defined as \nb = (wI + w2)/2, thus we can move the codebook vectors according to \n\ndE/dwl = dEldw2 = .5dEldb. \n\nWe could now write Equation 3 as \n\n(X\" - wj(n - 1\u00bb \nwj(n + 1) = wj(n) + a2(n) IX\" _ wj(n _ 1)1 \n\nif X\" lies in window of width 2c(n) centred at ben), otherwise \n\nWi(n + 1) = wi(n). \n\nwhere we have used rectangular Parzen window functions and X\" is from Classj. This \nholds if Classl is to the right or left of Class2 as long as Wl and W2 are relatively ordered \nappropriatel y. \n\nExpanding the problem to more dimensions, and more classes with more codebook vec(cid:173)\ntors per class, complicates the analysis as a change in two codebook vectors to better adjust \ntheir border affects more than just the border between the two codebook vectors. How(cid:173)\never ignoring these effects for a first order approximation suggests the following update \nprocedure: \n\n(X\" - wren - 1\u00bb \n* \nWi (n) = Wi (n - 1) + a(n) IIX\" _ wren _ 1)11 \n\n* \n\n(X\" - w;(n - 1\u00bb \n* . \nWj (n) = Wj (n - 1) - a(n) IIX\" _ wj(n _ 1)11 \n\nwhere a(n) obeys the constraints above, X\" is from Classj, and w;, wj are the two nearest \ncodebook vectors, one each from class i and j U * i) and x\" lies within c(n) of the border \nbetween them. (No changes are made if all the above conditions are not true). As above \nthis algorithm assumes that the initial positions of the codebook vectors are such that they \nwill not have to cross during the algorithm. \n\nThe above algorithm is similar to Kohonen's LVQ2.1 algorithm (which is performed after \nappropriate initialization of the codebook vectors) except for the normalization of the step \nsize, the decreasing size of the window width c(n) and constraints on the learning rate a. \n\n\fA Note on Learning Vector Quantization \n\n225 \n\n4 Simulations \n\nMotivated by the theory above, we decided to modify Kohonen's LVQ2.1 algorithm to \nadd normalization of the step size and a decreasing window. In order to allow closer \ncomparison with LVQ2.1, all other parts of the algorithm were kept the same. Thus a \ndecreased linearly. We used a linear decrease on the window size and defined it as in \nLVQ2.1 for easier parameter matching. For a window size of w all input vectors satisfying \nd;/dj> g:~ where di is the distance to the closest codebook vector and dj is the distance \nto the next closest codebook vector, fall into the window between those two vectors (Note \nhowever, that updates only occur if the two closest codebook vectors belong to different \nclasses). \n\nThe data used is a version of the Peterson and Barney vowel formant data 2. The dataset \nconsists of the first and second formants for ten vowels in a/hVdj context from 75 speakers \n(32 males, 28 females, 15 children) who repeated each vowel twice 3. As we were not \ntesting generalization , the training set was used as the test set. \n\n75.------.------.------.-----..-----~ \n\n~. \n\n~A ... -:~~.::ra.\u00b7-..::: \n\n;,oz:~,,; \n\nalpha-0.002 -+(cid:173)\nalpha-0.030 -t--. \nalpha-0.080 'B'\" \nalpha-O .150 -\nalpha-0.500 ... -\n\n-fA \n,. \n!\\ \n; ! \n. . \ni \\ \n' \n.... \n\\ \n\\ \n\\. \n\\ \n\\ \n\n..... \nu \n\nQ) ... ... o \n\nU \n..... \nr:: \nQ) u ... \n'\" \n\nQ) \n\n70 \n\n65 \n\n\u2022 \u2022 \n\n\" \n\\ \n\\ \n\\ \n0.2 \n\n\\'\" \n\n.. \n\n\\ \n\\ \n\n.. ~ ... \n~.~--;:::.-..--....... \n. \n\\. \n\". \n\\ \nI\"\" \n\\ . .. \n\\. \n\\ \n.~ \n\\,: \n, . , \n~ \n\\ \n\\ \n\\ \n~ \nit: \nt \n\\ \n\n\\ \n\\, \n\\ \n\\ \n\\ \n\n\\\\ \n\nI \n\n\\ \n\n60~-----L~----~--~-L--~~~----~ \n\no \n\n0.4 \n\n0.6 \n\nwindow size \n\n0.8 \n\nFigure 2: The effect of different window sizes on the accuracy for different values of initial \na. \n\nWe ran three sets of experiments varying the number of codebook vectors and the number \nof pattern presentations. For the first set of experiments there were 20 codebook vectors \nand the algorithms ran for 40000 steps. Figure 2 shows the effect of varying the window \nsize for different initial learning rates a( 1) in the LVQ2.1 algorithm. The values plotted are \naveraged over three runs (The order of presentation of patterns is different for the different \nruns). The sensitivity of the algorithm to the window size as mentioned in [Kohonen, 1990] \nis evident. In general we found that as the learning rate is increased the peak accuracy is \nimproved at the expense of the accuracy for other window widths. After a certain value \n\n20 btained from Steven Nowlan \n33 speakers were missing one vowel and the raw data was linearly transfonned to have zero mean \n\nand fall within the range [-3,3] in both components \n\n\f226 \n\nde Sa and Ballard \n\n._~~ .. - . \n\n85~----~----~------r------r----~ \norig/20/40000 ~ \nmod/20/40000 -+(cid:173)\norig/20/4000 \n\n\u00b7B\u00b7\u00b7\u00b7 \n... _._.-lI'-\u00b7-\u00b7-\u00b7\u00b7\u00b7-\u00b7-\u00b7lIi~T2!t)lrotT(J\u00b7\"''''':': \u00b7 \n\n._.;:::~:::::: ... ---\u00b7--\u00b7~~~O..\u00a3.4.jlD.oJI._.~=-. \n\n.. ; ., \n-,. \nII'\" \n! ~=~.~'Il ........ \" .. ~--- ~ \n\n----~----+---------------------~ \n\nmod/100/40000 \n\n\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n\n~-El'.'-\n\n-\u2022.\u2022 \n\n~;.' \n\n..... u \n\nQj \nI-< \nI-< \n0 \nU \n..... \nr:: \nQj u \nI-< \nQj \n'\" \n\n80 \n\n75 \n\n70 \n\n65~----~----~------~----~~--~ \n0.5 \n\n0.2 \n\n0.1 \n\n0.3 \n\n0.4 \n\no \n\nwindow size \n\nFigure 3: The performance of LVQ2.1 with and without the modifications (normalized step \nsize and decreasing window) for 3 different conditions. The legend gives in order [the alg \ntype/ the number of codebook vectors/ the number of pattern presentations] \n\nthe accuracy declines for further increases in learning rate. \n\nFigure 3 shows the improvement achieved with normalization and a linearly decreasing \nwindow size for three sets of experiments : (20 code book vectors/40000 pattern pre(cid:173)\nsentations), (20 code book vectors/4000 pattern presentations) and (100 code book vec(cid:173)\ntors/40000 pattern presentations). For the decreasing window algorithm, the x-axis repre(cid:173)\nsents the window size in the middle of the run. As above, the values plotted were averaged \nover three runs. The values of a(l) were the same within each algorithm over all three \nconditions. A graph using the best a found for each condition separately is almost identi(cid:173)\ncal. The graph shows that the modifications provide a modest but consistent improvement \nin accuracy across the conditions. \n\nIn summary the preliminary experiments indicate that a decreasing window and normalized \nstep size can be worthwhile additions to the LVQ2.1 algorithm and further experiments on \nthe generalization properties of the algorithm and with other data sets may be warranted. \nFor these tests we used a linear decrease of the window size and learning rate to allow for \neasier comparison with the LVQ2.1 algorithm. Further modifications on the algorithm that \nexperiment with different functions (that obey the theoretical constraints) for the learning \nrate and window size decrease may result in even better performance. \n\n5 Summary \n\nWe have shown that Kohonen's LVQ2.1 algorithm can be considered as a variant on a \ngeneralization of an algorithm which is optimal for a IDimensional/2 codebook vector \nproblem. We added a decreasing window and normalized step size, suggested from the \none dimensional algorithm. to the LVQ2.1 algorithm and found a small but consistent \nimprovement in accuracy. \n\n\fA Note on Learning Vector Quantization \n\n227 \n\nAcknowledgements \n\nWe would like to thank Steven Nowlan for his many helpful suggestions on an earlier draft \nand for making the vowel formant data available to us. We are also grateful to Leonidas \nKontothanassis for his help in coding and discussion. This work was supported by a grant \nfrom the Human Frontier Science Program and a Canadian NSERC 1967 Science and \nEngineering Scholarship to the first author who also received A NIPS travel grant to attend \nthe conference. \n\nReferences \n\n[Grossberg, 1976] Stephen Grossberg, \"Adaptive Pattern Classification and Universal Re(cid:173)\n\ncoding: I. Parallel Development and Coding of Neural Feature Detectors,\" Biological \nCybernetics, 23:121-134,1976. \n\n[Kohonen,1982] Teuvo Kohonen, \"Self-Organized Formation of Topologically Correct \n\nFeature Maps,\" Biological Cybernetics, 43:59--69, 1982. \n\n[Kohonen,1986] Teuvo Kohonen, \"Learning Vector Quantization for Pattern Recogni(cid:173)\n\ntion,\" Technical Report TKK-F-A601, Helsinki University of Technology, Department \nof Technical Physics, Laboratory of Computer and Information Science, November \n1986. \n\n[Kohonen, 1990] Teuvo Kohonen, \"Statistical Pattern Recognition Revisited,\" In R. Eck(cid:173)\nmiller, editor, Advanced Neural Computers, pages 137-144. Elsevier Science Publish(cid:173)\ners, 1990. \n\n[Kohonen, 1991] Teuvo Kohonen, \"Self-Organizing Maps: Optimization Approaches,\" In \nT. Kohonen, K. Makisara, O. Simula, and J. Kangas, editors,Artijicial Neural Networks, \npages 981-990. Elsevier Science Publishers, 1991. \n\n[Robbins and Monro, 1951J Herbert Robbins and Sutton Monro, \"A Stochastic Approxi(cid:173)\n\nmation Method,\" Annals of Math. Stat., 22:400-407,1951. \n\n[Rumelhart and Zipser, 1986] D. E. Rumelhart and D. Zipser, \"Feature Discovery by \n\nCompetitive Learning,\" In David E. Rumelhart, James L. McClelland, and the PDP Re(cid:173)\nsearch Group, editors, Parallel Distributed Processing: Explorations in the Microstruc(cid:173)\nture of Cognition, volume 2, pages 151-193. MIT Press, 1986. \n\n[Sklansky and Wassel, 1981] Jack Sklansky and Gustav N. Wassel, Pattern Classijiers \n\nand Trainable Machines, Springer-Verlag, 1981. \n\n[Wassel and Sklansky, 1972] Gustav N. Wassel and Jack Sklansky, \"Training a One(cid:173)\n\nDimensional Classifier to Minimize the Probability of Error,\" IEEE Transactions on \nSystems, Man, and Cybernetics, SMC-2(4):533-541, 1972. \n\n\f", "award": [], "sourceid": 663, "authors": [{"given_name": "Virginia", "family_name": "de", "institution": null}, {"given_name": "Dana", "family_name": "Ballard", "institution": null}]}