{"title": "S-Map: A Network with a Simple Self-Organization Algorithm for Generative Topographic Mappings", "book": "Advances in Neural Information Processing Systems", "page_first": 549, "page_last": 555, "abstract": null, "full_text": "S-Map: A network with a simple \n\nself-organization algorithm for generative \n\ntopographic mappings \n\nKimmo Kiviluoto \n\nLaboratory of Computer and \n\nInformation Science \n\nErkki Oja \n\nLaboratory of Computer and \n\nInformation Science \n\nHelsinki University of Technology \n\nHelsinki University of Technology \n\nP.O. Box 2200 \n\nP.O. Box 2200 \n\nFIN-02015 HUT, Espoo, Finland \n\nKimmo.KiviluotoChut.fi \n\nFIN-02015 HUT, Espoo, Finland \n\nErkki.OjaChut.fi \n\nAbstract \n\nThe S-Map is a network with a simple learning algorithm that com(cid:173)\nbines the self-organization capability of the Self-Organizing Map \n(SOM) and the probabilistic interpretability of the Generative To(cid:173)\npographic Mapping (GTM). The simulations suggest that the S(cid:173)\nMap algorithm has a stronger tendency to self-organize from ran(cid:173)\ndom initial configuration than the GTM. The S-Map algorithm \ncan be further simplified to employ pure Hebbian learning, with(cid:173)\nout changing the qualitative behaviour of the network. \n\n1 \n\nIntroduction \n\nThe self-organizing map (SOM; for a review, see [1]) forms a topographic mapping \nfrom the data space onto a (usually two-dimensional) output space. The SOM has \nbeen succesfully used in a large number of applications [2]; nevertheless, there are \nsome open theoretical questions, as discussed in [1, 3]. Most of these questions \narise because of the following two facts: the SOM is not a generative model, i.e. it \ndoes not generate a density in the data space, and it does not have a well-defined \nobjective function that the training process would strictly minimize. \nBishop et al. [3] introduced the generative topographic mapping (GTM) as a solution \nto these problems. However, it seems that the GTM requires a careful initialization \nto self-organize. Although this can be done in many practical applications, from a \ntheoretical point of view the GTM does not yet offer a fully satisfactory model for \nnatural or artificial self-organizing systems. \n\n\f550 \n\nK. Kiviluoto and E. Oja \n\nIn this paper, we first briefly review the SOM and GTM algorithms (section 2); then \nwe introduce the S-Map, which may be regarded as a crossbreed of SOM and GTM \n(section 3); finally, we present some simulation results with the three algorithms \n(section 4), showing that the S-Map manages to combine the computational sim(cid:173)\nplicity and the ability to self-organize of the SOM with the probabilistic framework \nof the GTM. \n\n2 SOM and GTM \n\n2.1 The SOM algorithm \nThe self-organizing map associates each data vector et with that map unit that has \nits weight vector closest to the data vector. The activations 11! of the map units are \ngiven by \n\n~ = {I, when IIJLi - etll < IIJLj - etll, 'Vj # i \n11, \n\n0, otherwise \n\n(1) \n\nwhere JLi is the weight vector of the ith map unit Ci, i = 1, . . . , K. Using these \nactivations, the SOM weight vector update rule can be written as \n\nJLj := JLj + tSt L 11:h(Ci' Cj ; j3t)(et - JLj) \n\nK \n\ni=l \n\n(2) \n\nHere parameter tSt is a learning rate parameter that decreases with time. The neigh(cid:173)\nborhood function h(Ci' Cj ; j3t) is a decreasing function of the distance between map \nunits Ci and Cj ; j3t is a width parameter that makes the neighborhood function get \nnarrower as learning proceeds. One popular choice for the neighborhood function \nis a Gaussian with inverse variance j3t. \n\n2.2 The GTM algorithm \n\nIn the GTM algorithm, the map is considered as a latent space, from which a \nnonlinear mapping to the data space is first defined. Specifically, a point C in the \nlatent space is mapped to the point v in the data space according to the formula \n\nv(C; M) = MtP(C) = L \u00a2j(C)JLj \n\nL \n\nj=l \n\n(3) \n\nwhere tP is a vector consisting of L Gaussian basis functions, and M is a D x L \nmatrix that has vectors ILj as its columns, D being the dimension of the data space. \n\nThe probability density p( C) in the latent space generates a density to the manifold \nthat lies in the data space and is defined by (3). \nIT the latent space is of lower \ndimension than the data space, the manifold would be singular, so a Gaussian noise \nmodel is added. A single point in the latent space generates thus the following \ndensity in the data space: \n\np(eIC;M,j3)= \n\n( j3)D/2 \n211' \n\n] \nexp -2'llv(C;M)-eW \n\n[j3 \n\n(4) \n\nwhere j3 is the inverse of the variance of the noise. \nThe key point of the GTM is to approximate the density in the data space by \nassuming the latent space prior p( C) to consist of equiprobable delta functions that \n\n\fS-Map \n\n551 \n\nform a regular lattice in the latent space. The centers Ci of the delta functions are \ncalled the latent vectors of the GTM, and they are the GTM equivalent to the SOM \nmap units. The approximation of the density generated in the data space is thus \ngiven by \n\np(eIM,m = K LP(elCi;M,,8) \n\n1 K \ni=l \n\n(5) \n\nThe parameters of the GTM are determined by minimizing the negative log likeli(cid:173)\nhood error \n\nf(M,,8) = - t. ln [~ t. P({'IC,;M,,8)] \n\n(6) \n\n(7) \n\n(8) \n\nover the set of sample vectors {et }. The batch version of the GTM uses the EM \nalgorithm [4]; for details, see [3]. One may also resort to an on-line gradient descent \nprocedure that yields the GTM update steps \n\nK \n1L~+1 := IL~ + 6t,8t L \n\nlI:(Mt ,,8t);1';) - {t \n\n(10) \n\nINote that this choice serves the purpose of illustration only; to use GTM properly, \n\none should choose much more latent vectors than basis functions. \n\n\f552 \n\nK. Kiviluoto and E. Oja \n\nThe GTM weight update step (7) then becomes \n\nIL~+1 := IL~ + ~t\u00a2j(t) ret - v(c(t); Mt)] \n\n(11) \n\nThis resembles the variant of SOM, in which the winner is searched with the rule (10) \nand weights are updated as \n\nIL~+1 := ILj +~t\u00a2j(t)(et - IL;) \n\n(12) \n\nUnlike the original SOM rules (1) and (2), the modified SOM with rules (10) \nand (12) does minimize a well-defined objective function: the SOM distortion mea(cid:173)\nsure [5, 6, 7, 1]. However, there is a difference between GTM and SOM learning \nrules (11) and (12). With SOM, each individual weight vector moves towards the \ndata vector, but with GTM, the image ofthe winnerlatent vector v(c(t); M) moves \ntowards the data vector, and all weight vectors ILj move to the same direction. \nFor nonzero noise, when 0 < f3 < 00, there is more difference between GTM and \nSOM: with GTM, not only the winner unit but activations from other units as well \ncontribute to the weight update. \n\n3 S-Map \n\nCombining the softmax activations of the GTM and the learning rule of the SOM, \nwe arrive at a new algorithm: the S-Map. \n\n3.1 The S-Map algorithm \n\nThe S-Map resembles a GTM with an equal number of latent vectors and basis \nfunctions. The position of the ith unit on the map is is given by the latent vector \n(i; the connection strength of the unit to another unit j is \u00a2), and a weight vector \nILi is associated with the unit. The activation of the unit is obtained using rule (9). \nThe S-Map weights learn proportionally to the activation of the unit that the weight \nis associated with, and the activations of the neighboring units: \n\n1'1+1 ,= 1'; +6' (t. \u00a2iij:) (e' -1';) \n\n(13) \n\nwhich can be further simplified to a fully Hebbian rule, updating each weight pro(cid:173)\nportionally to the activation of the corresponding unit only, so that \n\nIL}+l := ILj + ~t7]~ (e t -\n\nILj) \n\n(14) \n\nThe parameter f3 value may be adjusted in the following way: start with a small \nvalue, slowly increase it so that the map unfolds and spreads out, and then keep \nincreasing the value as long as the error (6) decreases. The parameter adjustment \nscheme could also be connected with the topographic error of the mapping, as \nproposed in [9] for the SOM. \nAssuming normalized input and weight vectors, the \"dot-product metric\" form of \nthe learning rules (13) and (14) may be written as \n\n1')+1 ,= 1'] + 6' (t. \u00a2;ij:) (I -1']1'?)e' \n\n(15) \n\n\fS-Map \n\nand \n\nJL~+1 := 1') + 6t1Jj(I - JL)JLf)et \n\n(16) \nrespectively; the matrix in the second parenthesis keeps the weight vectors normal(cid:173)\nized to unit length, assuming a small value for the learning rate parameter 6t [8]. \nThe dot-product metric form of a unit activity is \n\n553 \n\nexp [,8 (L:~1 \u00a2)I't {t] \n\n~' = \nl L: ==1 exp [13 (Lf=1 1>;' JLj) T et ] \n\n(17) \n\nwhich approximates the posterior probability p('ilet; M, 13) that the data vector \nwere generated by that specific unit. This is based on the observation that if the \ndata vectors {et } are normalized to unit length, the density generated in the data \nspace (unit sphere in RD) becomes \n\nnorm lzmg \np({IC,; M,,8) = ( constant) \n\nal.. \n\n-1 \n\nK \n\n,8 f;. \u00a2jl'j \n\n( \n\ni \n\n)\n\nx exp \n\n[ \n\nT \n\n] \n{ \n\n(18) \n\n3.2 S-Map algorithm minimizes the GTM error function in \n\ndot-product metric \n\nThe GTM error function is the negative log likelihood, which is given by (6) and is \nreproduced here: \n\n(19) \n\nWhen the weights are updated using a batch version of (15), accumulating the \nupdates for one epoch, the expected value of the error [4] for the unit (i is \n\nE(\u00a3rW) = - LpOld(ilet;M,m In[p\u00b0ewU:i)pOeW(et l(i;M,!3)] \n\nT \n\nt=1 ' \n\n'\" \n\nT/~Id.t \n\n- ' - - - - -\n\n==1!K \n\nT \n\n= - L \n\nt==1 \n\n( K \n\n1J~ld,t 13 L 1>] JLjew \n\n) T \n\net + terms not involving the weight vectors \n\n(20) \n\nj==1 \n\nThe change of the error for the whole map after one epoch is thus \n\nE(\u00a3oew - fOld) = - L L L 1J~ld,t131>;(JLjew - JLjldfet \n\nK \n\nT K \n\ni=1 t=1 j=1 \n\n= - p.s t, (~ ~ q~ld\" \u00a2;{t) T (I - I'lld 1''Jld T) (t, t, q~ld,t' \u00a2j' (\" ) \n\n, \n\n'\" \naT \nJ \n\n-' \n\n' \" , -'(21) \n\n~ \n\nK \n\n= -136 L[uf Uj - (uf JLjld)2] ~ 0 \n\nj=1 \n\nwith equality only when the weights are already in the error minimum. \n\n\f554 \n\nK. Kiviluoto and E. Oja \n\n4 Experimental results \n\nThe self-organization ability of the SOM, the GTM, and the S-Map was tested on \nan artificial data set: 500 points from a uniform random distribution in the unit \nsquare. \nThe initial weight vectors for all models were set to random values, and the final \nconfiguration of the map was plotted on top of the data (figure 1). For all the algo(cid:173)\nrithms, the batch version was used. The SOM was trained as recommended in [1] -\nin two phases, the first starting with a wide neighborhood function, the second with \na narrow neighborhood. The GTM was trained using the Matlab implementation \nby Svensen, following the recommendations given in [to]. The S-Map was trained in \ntwo ways: using the \"full\" rule (13), and the simplified rule (14) . In both cases, the \nparameter {3 value was slowly increased every epoch; by monitoring the error (6) of \nthe S-Map (see the error plot in the figure) the suitable value for {3 can be found. \nIn the GTM simulations, we experimented with many different choices for basis \nfunction width and their number, both with normalized and unnormalized basis \nfunctions. It turned out that GTM is somewhat sensitive to these choices: it had \ndifficulties to unfold after a random initialization, unless the basis functions were set \nso wide (with respect to the weight matrix prior) that the map was well-organized \nalready in its initial configuration. On the other hand, using very wide basis func(cid:173)\ntions with the GTM resulted in a map that was too rigid to adapt well to the data. \nWe also tried to update the parameter {3 according to an annealing schedule, as \nwith the S-Map, but this did not seem to solve the problem. \n\n\" \n\n... .\n\n\" ~ . \n\nn ' \n. 7 ' \n\n. ..... :, .. :. \n\nr .\"\nf \n. \n~: .. , \n~. . . . \nr\u00b7\u00b7\u00b7 . \n. . .. \n': \nl~ ~ .. : .. ' '\n~'. ,':' \n:t;jj ,{ .~ .. \" .. : . .' i l' \n.f.:t+1.\nf \n.t}'. \nhS* \n..... \n~:_I;r7. \n\n. \n\n. \n\n.~ \n.. t'>i \n,.-' \n:H ~' \n\u00b7 ~\u00b7 . ';I..r.' \u00b7I.n \n'~.' .~ . ~. \n\n~: \n\n.' . . : :. \n\n: \n\n.. . ' \n\n: . \n\n. . \n\n[. \" ,'-. \n. . . \n\nI' \n\nr\u00b7\u00b7 I. \n\n,. \n.' \n\n: ~\"\" \" \n\nFigure 1: Random initialization (top left), SOM (top middle), GTM (top right), \n\"full\" S-Map (bottom left), simplified S-Map (bottom middle), On bottom right, \nthe S-Map error as a function of epochs is displayed; the parameter {3 was slightly \nincreased every epoch, which causes the error to increase in the early (unfolding) \nphase of the learning, as the weight update only minimizes the error for a given {3. \n\n5 Conclusions \n\nThe S-Map and SOM seem to have a stronger tendency to self-organize from random \ninitialization than the GTM. In data analysis applications, when the GTM can \n\n\fS-Map \n\n555 \n\nbe properly initialized, SOM, S-Map, and GTM yield comparable results; those \nobtained using the latter two algorithms are also straightforward to interpret in \nprobabilistic terms. In Euclidean metric, the GTM has the additional advantage \nof guaranteed convergence to some error minimum; the convergence of the S-Map \nin Euclidean metric is still an open question. On the other hand, the batch G TM \nis computationally clearly heavier per epoch than the S-Map, while the S-Map is \nsomewhat heavier than the SOM. \n\nThe SOM has an impressive record of proven applications in a variety of different \ntasks, and much more experimenting is needed for any alternative method to reach \nthe same level of practicality. SOM is also the basic bottom-up procedure of self(cid:173)\norganization in the sense that it starts from a minimum of functional principles \nrealizable in parallel neural networks. This makes it hard to analyze, however. \nA probabilistic approach like the GTM stems from the opposite point of view by \nemphasizing the statistical model, but as a trade-off, the resulting algorithm may \nnot share all the desirable properties of the SOM. Our new approach, the S-map, \nseems to have succeeded in inheriting the strong self-organization capability of the \nSOM, while offering a sound probabilistic interpretation like the GTM. \n\nReferences \n\n[1] T. Kohonen, Self-Organizing Maps. Springer Series in Information Sciences 30, \n\nBerlin Heidelberg New York: Springer, 1995. \n\n[2] T. Kohonen, E. Oja, O. Simula, A. Visa, and J. Kangas, \"Engineering applica(cid:173)\ntions of the self-organizing map,\" Proceedings of the IEEE, vol. 84, pp. 1358-\n1384, Oct. 1996. \n\n[3] C. M. Bishop, M. Svensen, and C. K. I. Williams, \"GTM: A principled alterna(cid:173)\n\ntive to the self-organizing map,\" in Advances in Neural Information Processing \nSystems (to appear) (M. C. Mozer, M. I. Jordan, and T. Petche, eds.), vol. 9, \nMIT Press, 1997. \n\n[4] A. P. Dempster, N. M. Laird, and D. B. Rubin, \"Maximum likelihood from in(cid:173)\ncomplete data via the EM algorithm,\" Journal of the Royal Statistical Society, \nvol. B 39, no. 1, pp. 1-38, 1977. \n\n[5] S. P. Luttrell, \"Code vector density in topographic mappings,\" Memorandum \n\n4669, Defense Research Agency, Malvern, UK, 1992. \n\n[6] T. M. Heskes and B. Kappen, \"Error potentials for self-organization,\" in Pro(cid:173)\nceedings of the International Conference on Neural Networks (ICNN'99), vol. 3, \n(Piscataway, New Jersey, USA), pp. 1219-1223, IEEE Neural Networks Coun(cid:173)\ncil, Apr. 1993. \n\n[7] S. P. Luttrell, \"A Bayesian analysis of self-organising maps,\" Neural Compu(cid:173)\n\ntation, vol. 6, pp. 767-794, 1994. \n\n[8] E. Oja, \"A simplified neuron model as a principal component analyzer,\" Jour(cid:173)\n\nnal of Mathematical Biology, vol. 15, pp. 267-273, 1982. \n\n[9] K. Kiviluoto, \"Topology preservation in self-organizing maps,\" in Proceedings \n\nof the International Conference on Neural Networks (ICNN'96), vol. 1, (Pis(cid:173)\ncataway, New Jersey, USA), pp. 294-299, IEEE Neural Networks Council, June \n1996. \n\n[10] M. Svensen, The GTM toolbox - user's guide. Neural Computing Research \nGroup / Aston University, Birmingham, UK, 1.0 ed., Oct. 1996. Available at \nURL http://neural-server . aston. ac. uk/GTM/MATLAB..Impl. html. \n\n\f", "award": [], "sourceid": 1417, "authors": [{"given_name": "Kimmo", "family_name": "Kiviluoto", "institution": null}, {"given_name": "Erkki", "family_name": "Oja", "institution": null}]}