{"title": "Minkowski-r Back-Propagation: Learning in Connectionist Models with Non-Euclidian Error Signals", "book": "Neural Information Processing Systems", "page_first": 348, "page_last": 357, "abstract": null, "full_text": "348 \n\nMinkowski-r Back-Propaaation: Learnine in Connectionist \n\nModels with Non-Euclidian Error Silllais \n\nStephen Jose Hanson and David J. Burr \n\nBell Communications Research \nMorristown, New Jersey 07960 \n\nAbstract \n\nMany connectionist learning models are implemented using a gradient descent \nin a least squares error function of the output and teacher signal. The present model \nFneralizes. in particular. back-propagation [1] by using Minkowski-r power metrics. \nFor small r's a \"city-block\" error metric is approximated and for large r's the \n\"maximum\" or \"supremum\" metric is approached. while for r=2 the standard back(cid:173)\npropagation model results. An implementation of Minkowski-r back-propagation is \ndescribed. and several experiments are done which show that different values of r \nmay be desirable for various purposes. Different r values may be appropriate for the \nreduction of the effects of outliers (noise). modeling the input space with more \ncompact clusters. or modeling the statistics of a particular domain more naturally or \nin a way that may be more perceptually or psychologically meaningful (e.g. speech or \nvision). \n\n1. Introduction \n\nThe recent resurgence of connectionist models can be traced to their ability to \ndo complex modeling of an input domain. It can be shown that neural-like networks \ncontaining a single hidden layer of non-linear activation units can learn to do a \npiece-wise linear partitioning of a feature space [2]. One result of such a partitioning \nis a complex gradient surface on which decisions about new input stimuli will be \nmade. The generalization, categorization and clustering propenies of the network are \ntherefore detennined by this mapping of input stimuli to this gradient swface in the \noutput space. This gradient swface is a function of the conditional probability \ndistributions of the output vectors given the input feature vectors as well as a function \nof the error relating the teacher signal and output. \n\nf'F'I an\" c.,-i,. .... T.. ............ \n\n.. ., \n\n~.r 01. \n\n\f349 \n\nPresently many of the models have been implemented using least squares error. \nIn this paper we describe a new model of gradient descent back-propagation [I] using \nMinkowski-r power error metrics. For small r's a \"city-block\" error measure (r=I) is \napproximated and for larger r's a \"maximum\" or supremum error measure is \napproached, while the standard case of Euclidian back-propagation is a special case \nwith 1'*2. Fll\"St we derive the general case and then discuss some of the implications \nof varying the power in the general metric. \n\n2. Derivation of Minkowski-r Back-propagation \n\nThe standard back-propagation is derived by minimizing least squares error as \na function of connection weights within a completely connected layered network. \nThe error for the Euclidian case is (for a single input-output pair), \n\n.. 2 \nE = - L O'j-Yj) , \n\n1 \n2 . J \n\n(1) \n\nwhere Y is the activation of a unit and y represents an independent teacher signal. \nThe activation of a unit 0') is typically computed by nonnalizing the input from other \nunits (x) over the interval (0,1) while compressing the high and low end of this range. \nA common function used for this normalization is the logistic, \n\n1 \n\nYj=---\n1 + e-Xt \n\n(2) \n\nThe input to a unit (x) is found by summing products of the weights and \ncorresponding activations from other units, \n\n(3) \n\nwhere Yle represents units in the fan in of unit i and Whi represents the strength of the \nconnection between unit i and unit h. \n\nA gradient for the Euclidian or standard back-propagation case could be found \nby finding the partial of the error with respect to each weight, and can be expressed in \nthis three tenn differential, \n\n\f350 \n\ndE \ndX; \ndE dyi \n- . - - - -\ndyi ax; aw., \ndw/ti \n\nwhich from the equations before turns out to be, \n\n(4) \n\n(5) \n\nGeneralizing the error for Minkowski-r power metrics (see Figure 1 for the \n\nfamily of curves), \n\nE = - L I (Yi - Yi I \n.... )' \n\n1 \nr \n\n. \u2022 \n\n(6) \n\nI \n\n~ .. \n\n~ \n: \n: \n\nC'f \n0 \n\n\u00a7 \n\n.eo \n\n\u2022 \n\n\u00b720 \n\n0 \n\nao \n\nNfIII \n\n... \n\n10 \n\nFigure 1: Minkowski-r Family \n\nUsing equations 24 above with equation 6 we can easily find an expression for the \ngradient in the general Minkowski-r case, \n\ndE \n~ = ( Yi - Yi) Yi( -Yi \",.sgn \naw,.; \n\n.... I ,-1 1 ) \n\nI \n\n(y \n\n..... ) \ni - Yi \n\n(7) \n\nThis gradient is used in the weight update rule proposed by Rumelhart, Hinton and \nWilliams [1], \n\n\fwhi(n+l) = (X - + wAi(n) \n\ndE \ndWAi \n\n351 \n\n(8) \n\nSince the gradient computed for the hidden layer is a function of the gradient for the \noutput, the hidden layer weight updating proceeds in the same way as in the \nEuclidian case [1], simply substituting this new Minkowski-r gradient. \n\nIt is also possible to define a gradient over r such that a minimum in error \nwould be sought. Such a gradient was suggested by White [3, see also 4] for \nmaximum likelihood estimation of r, and can be shown to be, \n\ndIO\u00a3E) = (1-1Ir)(1Ir) + (llr)2/og (r) + (lIr) 2\",(1lr) + (1/r) 21Yi-Yi 1 \n\n-(1/r)(IYi -Yil)'/og(IYi -Yi I) \n\n(9) \n\nAn approximation of this gradient (using the last term of equation 9) has been \nimplemented and investigated for simple problems and shown to be fairly robust in \nrecovering similar r values. However, it is important that the r update rule changes \nslower than the weight update rule. In the simulations we ran r was changed once for \nevery 10 times the weight values were changed. This rate might be expected to vary \nwith the problem and rate of convergence. Local minima may be expected in larger \nproblems while seeking an optimal r. It may be more infonnative for the moment to \nexamine different classes of problems with fixed r and consider the specific rationale \nfor those classes of problems. \n\n3. Variations in r \n\nVarious r values may be useful for various aspects of representing infonnation \nin the feature domain. Changing r basically results in a reweighting of errors from \noutput bitsl . Small r's give less weight for large deviations and tend to reduce the \ninfluence of outlier points in the feature space during learning. In fact, it can be \nshown that if the distributions of feature vectors are non-gaussian, then the r=2 case \n\n1. It is possible to entcltain r values that are negative, which would give largest weight to small errors \nclose to zero and smallest weight to very large emn. Values of r lea than 1 generally are non-metric. \ni.e. they viola1e 81ieast one of the meuic axioms. For example. r2. Large r's tend to weight large deviations. When noise is not possible in \nthe feature space (as in an arbitrary boolean problem) or where the token clusters are \ncompact and isolated tllen simpler (in the sense of the number and placement of \npartition planes) genenuization surfaces may be created with larger r values. For \nexample, in the simple XOR problem, the main effect of increasing r is to pull the \ndecision boundaries closer into the non-zero targets (compare high activation regions \nin Figure 4a and 4b). \n\nIn this particular problem clearly such compression of the target regions does not \nconstitute simpler decision surfaces. However, if more hidden units are used than are \nneeded for pattern class separation, then increasing r during training will tend to \nreduce the number of cuts in the space to the minimum needed. This seems to be \nprimarily due to the sensitivity of the hyper-plane placement in the feature space to \nthe geometry of the targets. \n\nA more complex case illustrating the same idea comes from an example \nsuggested by Minsky & Papen [7] called \"the mesh\". This type of pattern \nrecognition problem is also. like XOR, a non-linearly separable problem. An optimal \n\n\f354 \n\nFigure 4: XOR solved with r=2 (4a) and r=4 (4b) \n\nsolution involves only three cuts in feature space to separate the two \"meshed\" \ncluSten (see Figure Sa). \n\nf14W'\" 1 \n\nb \n\nFigure 5: Mesh problem with minimwn cut solution (5a) and Performance Surface(5b) \n\nTypical solutions for r=2 in this case tend to use a large number of hidden units to \nseparate the two sets of exemplars (see Figure 5b for a perfonnance surface). For \nexample t in Figure 6a notice that a typical (based on several runs) Euclidian back(cid:173)\nprop starting with 16 hidden units has found a solution involving five decision \nboundaries (lines shown in the plane also representing hidden units) while the r=3 \ncase used primarily three decision boundaries and placed a number of other \n\n\fboundaries redundantly near the center of the meshed region (see Figure 6b) where \nthere is maximum uncertainty about the cluster identification. \n\n355 \n\n~ \n\n-~ \n0 .. \n\u2022 \n\n0 \n\n0 \n\nC'lI \n0 \n\n0 \n0 \n\n0.0 \n\n0.2 \n\nG.4 \n\n0.8 \n\n0.8 \n\n1.0 \n\nb \n\n~ -\n\nID \n0 \n\nID \n0 \n\u2022 \n0 \n\nC'lI \n0 \n\n~ \n0 \n\n0.0 \n\n0.2 \n\n0.4 \n\n0.8 \n\n0.8 \n\n1.0 \n\nFigure 6: Mesh solved with r=2 (6a) and r=3 (6b) \n\nSpeech Recognition. A final case in which large r's may be appropriate is data \nthat has been previously processed with a transformation that produced compact \nregions requiring separation in the feature space. One example we have looked at \ninvolves spoken digit recognition. The first 10 cepstral coefficients of spoken digits \n(\"one\" through \"ten\") were used for input to a network. In this case an advantage is \nshown for larger r's with smaller training set sizes. Shown in Figure 7 are transfer \ndata for 50 spoken digits replicated in ten different runs per point (bars show standard \nerror of the mean). Transfer shows a training set size effect for both r=2 and r=3, \nhowever for the larger r value at smaller training set sizes (10 and 20) note that \ntransfer is enhanced. \n\nWe speculate that this may be due to the larger r backprop creating discrimination \nregions that are better able to capture the compactness of the clusters inherent in a \nsmall number of training points. \n\n4. Conver&ence Properties \n\nlinearly (although \n\nIt should be generally noted that as r increases. convergence time tends to grow \nroughly \nthis may be problem dependent). Consequently, \ndecreasing r can significantly improve convergence, without much change to the \nnature of solution. Further, if noise is present decreasing r may reduce it \ndramatically. Note finally that the gradient for the Minkowski-r back-propagation is \nnonlinear and therefore more complex for implementing learning procedures. \n\n\f... ---... ....... ----.-- -.. -- - -- - 1..- ~ '~- '. ~ \nQ \n\n~ .-\n\nI \n\nt------l--:::~+-\u00b7/'-\n\n\". \n\n. \n\n356 \n\n8 \n\n0 co \n\n0 \n\n~ \nc \n! ... \n~ co \n\u00b7c \n:J \n0 \n~ 0 \n0 u \n~ 0 \n0 \nC\\I \n\n~ \n~ \n\n~ \n\n0 \n\ni \n! \n: \n1 \n\ni1 \n\nR=2 ,.-\n\n/ \n: \n! \nI \n:' \n\n! T.:' 10 replications of 50 transfer POints \ni' \nI : \n/ ..... \u00b7.C/ \n\nt\u00b7 \n! \n: \nI \nL ____ U.____ \n\n----.. \n\no \n\n10 \n\n20 \n\n30 \n\n40 \n\n50 \n\nTRAINING SET SIZE \n\nFigure 7: Digit Recognition Set Size Effect \n\n5. Summary and Conclusion \n\nA new procedure which is a variation on the Back-propagation algorithm is \nderived and simulated in a number of different problem domains. Noise in the target \ndomain may be reduced by using power values less than 2 and the sensitivity of \npartition planes to the geometry of the problem may be increased with increasing \npower values. Other types of objective functions should be explored for their \npotential consequences on network resources and ensuing pattern recognition \ncapabilities. \n\nReferences \n\n1. Rumelhart D. E., Hinton G. E., Williams R., Learning Internal Representations by \nerror propagation. Nature. 1986. \n\n2. Burr D. I. and Hanson S. I .\u2022 Knowledge Representation in Connectionist Networks. \nBellcore. Technical Report, \n\n3. White. H. Personal Communication. 1987. \n\n4. White, H. Some Asymptotic Results for Learning in Single Hidden Layer \nFeedforward Network Models. Unpublished Manuscript. 1987. \n\n\f357 \n\nS. Mosteller, F. & Tukey, 1. Robust Estimation Procedures, Addison Wesley, 1980. \n\n6. Tukey, 1. Personal Communication, 1987. \n\n7. Minsky, M. & Papert, S., Perceptrons: An Introduction to Computational \nGeometry, MIT Press, 1969. \n\n\f", "award": [], "sourceid": 65, "authors": [{"given_name": "Stephen", "family_name": "Hanson", "institution": null}, {"given_name": "David", "family_name": "Burr", "institution": null}]}