{"title": "NeuroScale: Novel Topographic Feature Extraction using RBF Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 543, "page_last": 549, "abstract": null, "full_text": "NeuroScale: Novel Topographic Feature \n\nExtraction using RBF Networks \n\nDavid Lowe \n\nD.LoweOaston.ac.uk \n\nMichael E. Tipping \n\nH.E.TippingOaston.ac.uk \n\nNeural Computing Research Group \n\nAston University, Aston Triangle, Birmingam B4 7ET1 UK \n\nhttp://www.ncrg.aston.ac.uk/ \n\n. \n\nAbstract \n\nDimension-reducing feature extraction neural network techniques \nwhich also preserve neighbourhood relationships in data have tra(cid:173)\nditionally been the exclusive domain of Kohonen self organising \nmaps. Recently, we introduced a novel dimension-reducing feature \nextraction process, which is also topographic, based upon a Radial \nBasis Function architecture. It has been observed that the gener(cid:173)\nalisation performance of the system is broadly insensitive to model \norder complexity and other smoothing factors such as the kernel \nwidths, contrary to intuition derived from supervised neural net(cid:173)\nwork models. In this paper we provide an effective demonstration \nof this property and give a theoretical justification for the apparent \n'self-regularising' behaviour of the 'NEUROSCALE' architecture. \n\n1 \n\n'NeuroScale': A Feed-forward Neural Network \nTopographic Transformation \n\nRecently an important class of topographic neural network based feature extraction \napproaches, which can be related to the traditional statistical methods of Sammon \nMappings (Sammon, 1969) and Multidimensional Scaling (Kruskal, 1964), have \nbeen introduced (Mao and Jain, 1995; Lowe, 1993; Webb, 1995; Lowe and Tipping, \n1996). These novel alternatives to Kohonen-like approaches for topographic feature \nextraction possess several interesting properties. For instance, the NEuROSCALE \narchitecture has the empirically observed property that the generalisation perfor-\n\n\f544 \n\nD. Lowe and M. E. Tipping \n\nmance does not seem to depend critically on model order complexity, contrary to \nintuition based upon knowledge of its supervised counterparts. This paper presents \nevidence for their 'self-regularising' behaviour and provides an explanation in terms \nof the curvature of the trained models. \n\nWe now provide a brief introduction to the NEUROSCALE philosophy of nonlinear \ntopographic feature extraction. Further details may be found in (Lowe, 1993; Lowe \nand Tipping, 1996). We seek a dimension-reducing, topographic transformation of \ndata for the purposes of visualisation and analysis. By 'topographic', we imply that \nthe geometric structure of the data be optimally preserved in the transformation, \nand the embodiment of this constraint is that the inter-point distances in the feature \nspace should correspond as closely as possible to those distances in the data space. \nThe implementation of this principle by a neural network is very simple. A Radial \nBasis Function (RBF) neural network is utilised to predict the coordinates of the \ndata point in the transformed feature space. The locations of the feature points are \nindirectly determined by adjusting the weights of the network. The transformation \nis determined by optimising the network parameters in order to minimise a suitable \nerror measure that embodies the topographic principle. \n\nThe specific details of this alternative approach are as follows. Given an m(cid:173)\ndimensional input space of N data points x q , an n-dimensional feature space of \npoints Yq is generated such that the relative positions of the feature space points \nminimise the error, or 'STRESS', term: \n\nN \n\nE = 2: 2:(d~p - dqp)2, \n\np q>p \n\n(1) \n\nwhere the d~p are the inter-point Euclidean distances in the data space: d~p = \nJ(xq - Xp)T(Xq - xp), and the dqp are the corresponding distances in the feature \nspace: dqp = J(Yq - Yp)T(Yq - Yp)\u00b7 \n\nThe points yare generated by the RBF, given the data points as input. That is, \nYq = f(xq;W), where f is the nonlinear transformation effected by the RBF with \nparameters (weights and any kernel smoothing factors) W. The distances in the \nfeature space may thus be given by dqp =11 f(xq) - f(xp) \" and so more explicitly \nby \n\n(2) \n\nwhere \u00a2k 0 are the basis functions, JLk are the centres of those functions, which are \nfixed, and Wlk are the weights from the basis functions to the output. \n\nThe topographic nature of the transformation is imposed by the STRESS term which \nattempts to match the inter-point Euclidean distances in the feature space with \nthose in the input space. This mapping is relatively supervised because there is no \nspecific target for each Y q; only a relative measure of target separation between each \nYq, Yp pair is provided. In this form it does not take account of any additional in(cid:173)\nformation (for example, class labels) that might be associated with the data points, \nbut is determined strictly by their spatial distribution. However, the approach may \nbe extended to incorporate the use of extra 'subjective' information which may be \n\n\fNeuroScale: Novel Topographic Feature Extraction using RBF Networks \n\n545 \n\nused to influence the transformation and permits the extraction of 'enhanced', more \ninformative, feature spaces (Lowe and Tipping, 1996). \n\nCombining equations (1) and (2) and differentiating with respect to the weights in \nthe network allows the partial derivatives of the STRESS {)E/{)WZk to be derived \nfor each pattern pair. These may be accumulated over the entire pattern set and \nthe weights adjusted by an iterative procedure to minimise the STRESS term E. \nNote that the objective function for the RBF is no longer quadratic, and so a \nstandard analytic matrix-inversion method for fixing the final layer weights cannot \nbe employed. \n\nWe refer to this overall procedure as ' NEUROSCALE'. Although any universal ap(cid:173)\nproximator may be exploited within NEUROSCALE, using a Radial Basis Function \nnetwork allows more theoretical analysis of the resulting behaviour, despite the fact \nthat we have lost the usual linearity advantages of the RBF because of the STRESS \nmeasure. A schematic ofthe NEUROSCALE model is given in figure 1, and illustrates \nthe role of the RBF in transforming the data space to the feature space. \n\nERROR MEASURE \n\n~j \n\nFEATURE SPACE \n\nDATA SPACE \n\nRBF \n\nFigure 1: The NEUROSCALE architecture. \n\n2 Generalisation \n\nIn a supervised learning context, generalisation performance deteriorates for over(cid:173)\ncomplex networks as 'overfitting' occurs. By contrast, it is an interesting empirical \nobservation that the generalisation performance of NEUROSCALE, and related mod(cid:173)\nels, is largely insensitive to excessive model complexity. This applies both to the \nnumber of centres used in the RBF and in the kernel smoothing factors which them(cid:173)\nselves may be viewed as regularising hyperparameters in a feed-forward supervised \nsituation. \n\nThis insensitivity may be illustrated by Figure 2, which shows the training and \ntest set performances on the IRIS data (for 5-45 basis functions trained and tested \non 45 separate samples). To within acceptable deviations, the training and test set \n\n\f546 \n\nD. Lowe and M. E. Tipping \n\nSTRESS values are approximately constant. This behaviour is counter-intuitive when \ncompared with research on feed forward networks trained according to supervised \napproaches. We have observed this general trend on a variety of diverse real world \nproblems, and it is not peculiar to the IRIS data. \n\n4.5 X 10-' \n\nTraining and Test Errors \n\n4 \n\n3.5 \n\n3 \n\n0.5 \n\no o \n\n5 \n\n10 \n\n15 \n\n20 \n\n35 \n\n40 \n\n45 \n\n25 \n\n30 \n\nNumber of Basis Functions \n\nFigure 2: Training and test errors for NEUROSCALE Radial Basis Functions with \nvarious numbers of basis functions. Training errors are on the left, test errors are \non the right. \n\nThere are two fundamental causes of this observed behaviour. Firstly, we may \nderive significant insight into the necessary form of the functional transformation \nindependent of the data. Secondly, given this prior functional knowledge, there \nis an appropriate regularising component implicitly incorporated in the training \nalgorithm outlined in the previous section. \n\n2.1 Smoothness and Topographic Transformations \n\nFor a supervised problem, in the absence of any explicit prior information, the \nsmoothness of the network function must be determined by the data, typically neces(cid:173)\nsitating the setting of regularising hyperparameters to counter overfitting behaviour. \nIn the case of the distance-preserving transformation effected by NEUROSCALE, an \nunderstanding of the necessary smoothness may be deduced a priori. \nConsider a point Xq in input space and a nearby test point xp = Xq + Epq , where \nEpq is an arbitrary displacement vector. Optimum generalisation demands that the \ndistance between the corresponding image points Yq and Yp should thus be I/Epq 1/. \nConsidering the Taylor expansions around the point Y q we find \n\nIlyp - Yq 112 = 2)E;qgq,)2 + O(E4), \n\nn \n\n1=1 \n\n= E;q (t gqlg~) Epq + O(E4), \n\n1=1 \n\n= E;qGqEpq + O(E4), \n\n(3) \n\n\fNeuroScale: Novel Topographic Feature Extraction using RBF Networks \n\n547 \n\nL~=l gqlg~ and gql \n\nis \n\nthe matrix G q \n\nIIYp - Yq 112= \u20acoT \u20aco, and so G q = I with the requirement that second(cid:173)\n\nwhere \nthe gradient vector \n(8YI(q)/f)xl, ... ,8YI(q)/8xn )T evaluated at x = x q . For structure preservation \nthe corresponding distances in input and output spaces need to be retained for all \nvalues of \u20acopq: \nand higher-order terms must vanish. In particular note that measures of curvature \nproportional to (82YI(Q)/8xn2 should vanish. In general, for dimension reduction, \nwe cannot ensure that exact structure preservation is obtained since the rank of G q \nis necessarily less than n and hence can never equate to the identity matrix. How(cid:173)\never, when minimising STRESS we are locally attempting to minimise the residual \nII I - G q II, which is achieved when all the vectors \u20acopq of interest lie within the range \nof G q \u2022 \n\n2.2 The Training Mechanism \n\nAn important feature of this class of topographic transformations is that the STRESS \nmeasure is invariant under arbitrary rotations and transformations of the output \nconfiguration. The algorithm outlined previously tends towards those configurations \nthat generally reduce the sum-of-squared weight values (Tipping, 1996). This is \nachieved without any explicit addition of regularisation, but rather it is a feature \nof the relative supervision algorithm. \n\nThe effect of this reduction in weight magnitudes on the smoothness of the network \ntransformation may be observed by monitoring an explicit quantitative measure of \ntotal curvature: \n\n(4) \n\nwhere Q ranges over the patterns, i over the input dimensions and lover the output \ndimensions. \n\nFigure 3 depicts the total curvature of NEUROSCALE as a function of the training \niterations on the IRIS subset data for a variety of model complexities. As predicted, \ncurvature generally decreases during the training process, with the final value inde(cid:173)\npendent of the model complexity. Theoretical insight into this phenomenon is given \nin (Tipping, 1996). \n\nThis behaviour is highly relevant, given the analysis of the previous subsection. That \nthe training algorithm implicitly reduces the sum-of-squares weight values implies \nthat there is a weight decay process occurring with an associated smoothing effect. \nWhile there is no control over the magnitude of this element, it was shown that \nfor good generalisation, the optimal transformation should be maximally smooth. \nThis self-regularisation operates differently to regularisers normally introduced to \nstabilise the ill-posed problems of supervised neural network models. In the latter \ncase the regulariser acts to oppose the effect of reducing the error on the training \nset. In NEUROSCALE the implicit weight decay operates with the minimisation of \nSTRESS since the aim is to 'fit' the relative input positions exactly. \n\nThat there are many RBF networks which satisfy a given STRESS level may be \nseen by training a network a posteriori on a predetermined Sammon mapping of a \ndata set by a supervised approach (since then the targets are known explicitly). In \ngeneral, such a posteriori trained networks do not have a low curvature and hence \n\n\f548 \n\nD. Lowe and M. E. Tipping \n\n2~~. -----------.------------~----------, \n\nCurvature against Time for NeuroScale \n\n- - 15 Basis Functions \n- - 30 Basis Functions \n.. 45 Basis Functions \n\n2000 \n\nI \no \nI \n~ 1500 I \n:l \n'\" ::; \n\n~L-----------~50------------1~OO----------~150 \n\nEpoch Number \n\nFigure 3: Curvature against time during the training of a NEUROSCALE \nmapping on the Iris data, for networks with 15, 30 and 45 basis func(cid:173)\ntions. \n\ndo not show as good a generalisation behaviour as networks trained according to \nthe relative supervision approach. The method by which NEUROSCALE reduces \ncurvature, is to select, automatically, RBF networks with minimum norm weights. \nThis is an inherent property of the training algorithm to reduce the STRESS criterion. \n\n2.3 An example \n\nAn effective example of the ease of production of good generalising transformations \nis given by the following experiment. A synthetic data set comprised four Gaussian \nclusters, each with spherical variance of 0.5, located in four dimensions with centres \nat (xc, 0, 0, 0) : Xc E {I, 2, 3, 4} . A NEUROSCALE transformation to two dimensions \nwas trained using the relative supervision approach, using the three clusters at \nX c = 1,3 and 4. The network was then tested on the entire dataset, with the fourth \ncluster included, and the projections are given in Figure 4 below. \n\nThe apparently excellent generalisation to test data not sampled from the same \ndistribution as the training data is a function of the inherent smoothing within \nthe training process and also reflects the fact that the test data lay approximately \nwithin the range of the matrices G q determined during training. \n\n3 Conclusion \n\nWe have described NEUROSCALE, a parameterised RBF Sammon mapping approach \nfor topographic feature extraction. The NEUROSCALE method may be viewed as a \ntechnique which is closely related to Sammon mappings and nonlinear metric MDS, \nwith the added flexibility of producing a generalising transformation. \n\nA theoretical justification has been provided for the empirical observation that the \ngeneralisation performance is not affected by model order complexity issues. This \ncounter-intuitive result is based on arguments of necessary transformation smooth-\n\n\fNeuroScale: Novel Topographic Feature Extraction using RBF Networks \n\n549 \n\nNeuroSca1. trained on 3 linear clusters \n\nNeuroScaJe 'asled on 4 \"near clulters \n\n1,5 \n\nO.S \n\n\u2022 a,. \\. : \n;: .. t:. ~ \n... . \n. , ..... \n\no . ' : \n\n-O.S \n\n:- \u2022 \n\n-, \n\n\u2022 \n\n1.5 \n\na \n\n-, \n\n0 \n\n1.S \n\n2S \n\n3S \n\n4S \n\n, .S \n\n2S \n\n3S \n\n4 .S \n\nFigure 4: Training and test projections of the four clusters. Training STRESS was \n0.00515 and test STRESS 0.00532. \n\nness coupled with the apparent self-regularising aspects of NEUROSCALE. The rel(cid:173)\native supervision training algorithm implicitly minimises a measure of curvature by \nincorporating an automatic 'weight decay' effect which favours solutions generated \nby networks with small overall weights. \n\nAcknowledgements \n\nThis work was supported in part under the EPSRC contract GR/J75425, \"Novel \nDevelopments in Learning Theory for Neural Networks\". \n\nReferences \n\nKruskal, J. B. (1964). Multidimensional scaling by optimising goodness of fit to a \n\nnonmetric hypothesis. Psychometrika, 29(1}:1-27. \n\nLowe, D. {1993}. Novel 'topographic' nonlinear feature extraction using radial basis \n\nfunctions for concentration coding in the 'artificial nose'. In 3rd lEE Interna(cid:173)\ntional Conference on Artificial Neural Networks. London: lEE. \n\nLowe, D. and Tipping, M. E. {1996}. Feed-forward neural networks and topographic \nmappings for exploratory data analysis. Neural Computing and Applications, \n4:83-95. \n\nMao, J. and Jain, A. K. {1995}. Artificial neural networks for feature extraction \nand multivariate data projection. IEEE Transactions on Neural Networks, \n6{2}:296-317. \n\nSammon, J. W. {1969}. A nonlinear mapping for data structure analysis. IEEE \n\nTransactions on Computers, C-18{5}:401-409. \n\nTipping, M. E. {1996}. Topographic Mappings and Feed-Forward Neural Networks. \nPhD thesis, Aston University, Aston Street, Birmingham B4 7ET, UK. Avail(cid:173)\nable from http://www.ncrg.aston.ac . uk/. \n\nWebb, A. R. (1995). Multidimensional scaling by iterative majorisation using radial \n\nbasis functions. Pattern Recognition, 28{5}:753-759. \n\n\f", "award": [], "sourceid": 1323, "authors": [{"given_name": "David", "family_name": "Lowe", "institution": null}, {"given_name": "Michael", "family_name": "Tipping", "institution": null}]}