{"title": "A Network of Localized Linear Discriminants", "book": "Advances in Neural Information Processing Systems", "page_first": 1102, "page_last": 1109, "abstract": null, "full_text": "A Network of Localized Linear Discriminants \n\nMartin S. Glassman \n\nSiemens Corporate Research \n\n755 College Road East \nPrinceton, NJ 08540 \n\nmsg@siemens.siemens.com \n\nAbstract \n\nThe localized linear discriminant network (LLDN) has been designed to address \nclassification problems containing relatively closely spaced data from different \nclasses (encounter zones [1], the accuracy problem [2]). Locally trained hyper(cid:173)\nplane segments are an effective way to define the decision boundaries for these \nregions [3]. The LLD uses a modified perceptron training algorithm for effective \ndiscovery of separating hyperplane/sigmoid units within narrow boundaries. The \nbasic unit of the network is the discriminant receptive field (DRF) which combines \nthe LLD function with Gaussians representing the dispersion of the local training \ndata with respect to the hyperplane. The DRF implements a local distance mea(cid:173)\nsure [4], and obtains the benefits of networks oflocalized units [5]. A constructive \nalgorithm for the two-class case is described which incorporates DRF's into the \nhidden layer to solve local discrimination problems. The output unit produces a \nsmoothed, piecewise linear decision boundary. Preliminary results indicate the \nability of the LLDN to efficiently achieve separation when boundaries are narrow \nand complex, in cases where both the \"standard\" multilayer perceptron (MLP) \nand k-nearest neighbor (KNN) yield high error rates on training data. \n\n1 The LLD Training Algorithm and DRF Generation \n\nThe LLD is defined by the hyperplane normal vector V and its \"midpoint\" M (a translated \norigin [1] near the center of gravity of the training data in feature space). Incremental \ncorrections to V and M accrue for each training token feature vector Y j in the training \nset, as iIlustrated in figure 1 (exaggerated magnitudes). The surface of the hyperplane is \nappropriately moved either towards or away from Yj by rotating V, and shifting M along \n\n1102 \n\n\fA Network of Localized Linear Discriminants \n\n11 03 \n\nthe axis defined by V~ M is always shifted towards Yj in the \"radial\" direction Rj (which is \nthe componerit of D j orthogonal to V, where D j = Yj - M): \n\n! TOKEN ON CORRECT SIDE OF HYPERPLANE! \n\n! TOKEN ON WRONG SIDE OF HYPERPLANE I \n\nV,..\"\". \n,.... \n\" \n\" \n\" \nv \n\nR. \n\n..' \n\n. \nJ .' \n~. \n/T \". \n\" .\u2022 Vj \n_~ \n\n.6M \":i/-.6M .' \n\n-\n\n.6~~M \n\nO\u00b7 \nJ \n\nV,..\"\". \n\n~ \n\"\" \n\n\"\"\" \n\" M\" \n\n.' \n\n..' \n\nR. \n\nJ .' \n\n\":'\" \n\n~.Vj \n\nll.V OJ \n\nFigure 1: LLD incremental correction vectors associated with training token Y j are shown \nabove, and the corresponding LLD update rules below: \n\nilV = ]L(n) Lil~ = ]L(n) L(-Se~e8j)0 \n\nj \n\nj \n\nIIDjl1 \n\nllMv = yen) L \nj \n\nllMVj = yen) L( -SeWe8j)V \n\nj \n\nllMR = f3(n) L \n\nllMRj = f3(n) L(We8j)~ \n\nj \n\nj \n\nThe batch mode summation is over tokens in the local training set, and n is the iteration \nindex. The polarity of ilVj and ilMRj is set by Se (c = the class of Yj ), where Se = 1 if Yj is \nclassified correctly, and Se = -1 if not. Corrections for each token are scaled by a sigmoidal \nerror term: 8j = 1/(1 + exp \u00abse1J/ A) I VTDj I\u00bb, a function of the distance of the token to \nthe plane, the sign of Se, and a data-dependent scaling parameter: A = I VT[B~ - B~] I, where \n1J is a fixed (experimental) scaling parameter. The scaling of the sigmoid is proportional \nto an estimate of the boundary region width along the axis of V. Be is a weighted average \nof the class c token vectors: Be(n + 1) = (1 - a)Be(n) + aWe EjEe \u20acj.e(n)Yj(n), where \u20acj.e \nis a sigmoid with the same scaling as 8j, except that it is centered on Be instead of M, \nemphasizing tokens of class c nearest the hyperplane surface. For small1J's, Be will settle \nnear the cluster center of gravity, and for large 1J's, Be will approach the tokens closest to \nthe hyperplane surface. (The rate of the movement of Be is limited by the value of a, which \nis not critical.) The inverse of the number of tokens in class c, We, balances the weight \nof the corrections from each class. If a more Bayesian-like solution is required, the slope \nof 8 can be made class dependent (for example, replacing 1J with 1J e ex: we). Since the \nslope of the sigmoid error term is limited and distribution dependent, the use of We, along \nwith the nonlinear weighting of tokens near the hyperplane surface, is important for the \ndevelopment of separating planes in relatively narrow boundaries (the assumption is that \nthe distributions near these boundaries are non-Gaussian). The setting of 1J simultaneously \n( for convenience) controls the focus on the \"inner edges\" of the class clusters and the slope \nof the sigmoid relative to the distance between the inner edges, with some resultant control \nover generalization performance. This local scaling of the error also aids the convergence \nrate. The range of good values for 1J has been found to be reasonably wide, and identical \n\n\f1104 \n\nGlassman \n\nvalues have been used successfully with speech, ecg, and synthetic data; it could also \nbe set/optimized using cross-validation. Separate adaptive learning rates (/L(n), yen), and \nf3(n\u00bb are used in order to take advantage ofthe distinct nature of the geometric function of \neach component. Convergence is also improved by maintaining M within the local region; \nthis controls the rate at which the hyperplane can sweep through the boundary region, \nmaking the effect of Ll V more predictable. The LLD normal vector update is simply: \nV(n + 1) = (V(n) + LlV)/I!V(n) + LlVII ,so that V is always normalized to unit magnitude. \nThe midpoint is just shifted: M(n + 1) = M(n) + LlMR + ~v . \n\n+Vk \n\nI Mk \n\n.L \n\nBk . I \n\no \n,c \n\nlambda ___________ . . _______ - - - - - - - - \u2022 \n\nT\n--L- .;gm~ C B~::\u00b7\u00b7::>-1\u00b7: \n\nO\u00b7 k \n\n~SigmaR~ ~I i,k,c \n\nlambda: estimate of the \nboundary region width \n\nsigma(V): dispersion of \nthe training data in the \ndiscriminant direction \n\n(V) \n\nsigma(R): dispersion of \nthe training data In all \ndirections orthogonal to V \n\nFigure 2: Vectors and parameters associated with the DRF for class c, for LLD k \n\nDRF's are used to localize the response of the LLD to the region of feature space in which \nit was trained, and are constructed after completion ofLLD training. Each DRF represents \none class, and the localizing component of the DRF is a Gaussian function based on simple \nstatistics of the training data for that class. Two measures of the dispersion of the data are \nused: O'v (\"normal\" dispersion), obtained using the mean average deviation of the lengths of \nPj,k,c, and O'R (\"radial\" dispersion), obtained correspondingly using the 0 j,k,c'S. (As shown, \nPj,k,c is the normal component, and OJ,k,c the radial component of Y j - Bk,c') The output in \nresponse to an input vector Yj from the class c DRF associated with the LLD k is cPj,k,c: \n\ncPj,k,c = Eh,c(Sj,k -0.5)/ exp( \n\nd2:. \nv J,k,c \n\n+d2:. \n\nR,j,k,c \n\n); \n\nTwo components of the DRF incorporate the LLD discriminant; one is the sigmoid error \nfunction used in training the LLD but shifted down to a value of zero at the hyperplane \nsurface.' The other is E> k,c, which is 1 if Yj is on the class c side of LLD k, and zero if \nnot. (In retrospect, for generalization performance, it may not be desirable to introduce \nthis discontinuity to the discriminant component.) The contribution of the Gaussian is \nbased on the normal and radial dispersion weighted distances of the input vector to B k,c: \ndVJ,k,C = IIPj,k,cll/O'V,k,C' and . dRJ,k,c = IIOj,k,cll/O'R,k,C' \n\n2 Network Construction \n\nSegmentation of the boundary between classes is accomplished by \"growing\" LLD's within \nthe boundary region. An LLD is initialized using a closely spaced pair of tokens from each \nclass. The LLD is grown by adding nearby tokens to the training set, using the k-nearest \nneighbors to the LLD midpoint at each growth stage as candidates for permanent inclusion. \nCandidate DRF's are generated after incremental training of the LLD to accommodate each \n\n\fA Network of Localized Linear Discriminants \n\n1105 \n\nnew candidate token. Two error measures are used to assess the effect of each candidate, the \npeak value of Bj over the local training set, and 'UJ', which is a measure of misc1assification \nerror due to the receptive fields of the candidate DRF's extending over the entire training \nset. The candidate token with the lowest average 'UJ' is permanently added, as long as both \nits Bj and 'UJ' are below fixed thresholds. Growth the the LLD is halted if no candidate has \nboth error measures below threshold. The B j and 'UJ' thresholds directly affect the granularity \nof the DRF representation of the data; they need to be set to minimize the number of DRF's \ngenerated, while allowing sufficient resolution of local discrimination problems. They \nshould perhaps be adaptive so as to encourage coarse grained solutions to develop before \nfine grain structure. \n\nFigure 3: Four \"snapshots\" in the growth of an LLD/DRF pair. The upper two are \"c1ose(cid:173)\nups.\" The initial LLD/DRF pair is shown in the upper left, along with the seed pair. Filled \nrectangles and ellipses represent the tokens from each class in the permanent local training \nset at each stage. The large markers are the B points, and the cross is the LLD midpoint. \nThe amplitude of the DRF outputs are coded in grey scale. \n\n\f1106 \n\nGlassman \n\nAt this point the DRF's are fixed and added to the network; this represents the addition of \ntwo new localized features available for use by the network's output layer in solving the \nglobal discrimination problem. In this implementation, the output \"layer\" is a single LLD \nused to generate a two-class decision. The architecture is shown below: \n\nINPUT \nDATA \n\nLLD'S \n\n, \n~, \n, \n\n\"',\\j \n\n, \n\n0/ I \n\nI \n\nI \nI \n\nI \n\n,'~ \n\nSlGMA\n\n\"\n\na,(V,R) \n\nSIGMAIr,a,(V,R) \n\nLOCALIZED \nFEATURES \n\nOUTPUT \nDISCRIMINANT \nFUNCTION \n(LLD WI SIGMOID) \n\nv~ , \n, \nIS \u2022 \\ \n__ A---\n\n, IJIJ \n\nS/GMAIr ,1,(V,R) \n\n, \n, , , , \n\nERROR MEASURE ON \nTRAINING TOKENS \nUSED TO SEED NEW \nLLD'S OR HALT \nTRAINING \n\nFigure 4: LLDN architecture for a two-dimensional, two-class problem \n\nThe ouput unit is completely retrained after addition of a new DRF pair, using the entire train(cid:173)\ning set. The output of the network to the input Yj is: 'Pj = 1/(1 +exp \u00ab 'Y)/ Ao)VT[*j - M]), \nwhere Ao = IVT[Bo - Bdl, and **j = [cPj,}, .\u2022. , cPj,p] is the p dimensional vector of DRF \noutputs presented to the output unit. V is the output LLD normal vector, M the midpoint, \nand Be's the cluster edge points in the internal feature space. The output error for each \ntoken is then used to select a new seed pair for development of the next LLD/DRF pair. \nIf all tokens are classified with sufficient confidence, of course, construction of the LLDN \nis complete. There are three possibilities for insufficient confidence: a token is covered \nby a DRF of the wrong class, it is not yet covered sufficiently by any DRF's, or it is in a \nregion of \"conflict\" between DRF's of different classes. A heuristic is used to prevent the \nrepeated selection of the same seed pair tokens, since there is no guarantee that a given DRF \nwill significantly reduce the error for the data it covers after output unit retraining. This \nheuristic alternates between the types of error and the class for selection of the primary seed \ntoken. Redundancy in DRF shapes is also minimized by error-weighting the dispersion \ncomputations so that the resultant Gaussian focuses more on the higher error regions of the \nlocal training data. A simple but reasonably effective pruning algorithm was incorporated \nto further eliminate unnecessary DRF's. \n\n\fA Network of Localized Linear Discriminants \n\n1107 \n\nFigure 5: Network response plots illustrating network development. The upper two \nsequences, beginning with the first LLD/DRF pair, and the bottom two plots show final \nnetwork responses for these two problems. A solution to a harder version of the nested \nsquares problem is on the lower left. \n\n3 Experimental Results \n\nThe first experiment demonstrates comparative convergence properties of the LLD and a \nsingle hyperplane trained by the standard generalized delta rule (GDR) method (no hidden \nunits, single output unit \"network\" is used) on 14 linearly separable, minimal consonant \n\n\f1108 \n\nGlassman \n\npair data sets. The data is 256 dimensional (time/frequency matrix, described in [6]), with \n80 exemplars per consonant. The results compare the best performance obtainable from \neach technique. The LLD converges roughly 12 times faster in iteration counts. The GDR \noften fails to .completely separate f/th, f/v, and s/sh; in the results in figure 6 it fails on \nthe f/th data set at a plateau of 25% error. In both experiments described in this paper, \nnetworks were run for relatively long times to insure confidence in declaring failure to \n\n100K \n\nz \no \n~ a: \n~ 10K \nw \n1/1 \nw \nIi:i \n...J a. \n::IE o \no ..... \n\n1000 \n\nU \n\n100 \n\nFigure 6: TRAINING A SINGLE HYPERPLANE \n\nFigure 7: ERROR RATES VS. GEOMETRIES \n\n(d06S not separate) \n\n50 \n\n10 \n\n~ a: \na: \nw 1 \nffi u \na: w \na. 0 \n\n10 \n\n1/1 \nZ \nQ \n\n~ \nw \nt: \n\nD+-----~----~--~~--~ \nIII N >:J: :J: :J: ~ :J: .... a: a: MINIMAL PAIR \n\nQ CI Z \nj:: ~~itill 11:1- 0U cui i:J \n\n1I:1Il~~a \n\n29 29 29 29 4A \nDOn ~ Don n %~~~ \n\n1 %WlDTH \n\n1 4A \n\nsolve the problem. The second experiment involves complete networks on synthetic two(cid:173)\ndimensional problems. Two examples of the nested squares problem (random distributions \nof tokens near the surface of squares of alternating class, 400 tokens total) are shown in \nfigure 5. Two parameters controlling data set generation are explored: the relative boundary \nregion width, and the relative offset from the origin of the data set center of gravity (while \nkeeping the upper right comer of the outside square near the (1,1) coordinate); all data is \nkept within the unit square (except for geometry number 2). Relative boundary widths of \n29%, 4.4%, and 1 % are used with offsets of 0%, 76%, and 94%. The best results over \nparameter settings are reported for each network for each geometry. Four MLP architectures \nwere used: 2:16:1,2:32:1, 2:64:1, and 2:16:16:1; all of these converge to a solution for \nthe easiest problem (wide boundaries, no offset), but all eventually fail as the boundaries \nnarrow and/or the offset increases. The worst performing net (2:64: 1) fails for 7/8 problems \n(maximum error rate of 49%); the best net (2:16:16:1) fails in 3/8 (maximum of 24% \nerror). The LLDN is 1 to 3 orders of magnitude faster in cpu time when the MLP does \nconverge, even though it does not use adaptive learning rates in this experiment. (The \naverage running time for the LLDN was 34 minutes; for the MLP's it was 3481 minutes \n[Stardent 3040, single cpu], but which includes non-converging runs. The 2:16:16:1 net \ndid, however, take 4740 minutes to solve problem 6, which was solved in 7 minutes by the \nLLDN.) The best LLDN's converge to zero errors over the problem set (fig. 6), and are not \ntoo sensitive to parameter variation, which primarily affect convergence time and number \nof DRF's generated. In contrast, finding good values for learning rate and momentum for \nthe MLP's for each problem was a time-consuming process. The effect of random weight \ninitialization in the MLP is not known because of the long running times required. The \nKNN error rate was estimated using the leave-one-out method, and yields error rates of \n0%, 10.5%, and 38.75% (for the best k's) respectively for the three values of boundary \nwidth. The LLDN is insensitive to offset and scale (like the KNN) because of the use \nof the local origin (M) and error scaling (A.). While global offset and scaling problems \nfor the MLP can be ameliorated through normalization and origin translation, this method \ncannot guarantee elimination of local offset and scaling problems. The LLDN's utilization \n\n\fA Network of Localized Linear Discriminants \n\n1109 \n\nofDRF's was reasonably efficient, with the smallest networks (after pruning) using 20,32, \nand 54 DRF's for the three boundary widths. A simple pruning algorithm, which starts up \nafter convergence, iteratively removes the DRF's with the lowest connection weights to the \noutput unit (which is retrained after each link is removed). A range of roughly 20% to 40% \nof the DRF's were removed before developing misclassification errors on the training sets. \nThe LLDN was also tested on the \"two-spirals\" problem, which is know to be difficult for \nthe standard MLP methods. Because ofthe boundary segmentation process, solution ofthe \ntwo-spirals problem was straightforward for the LLDN, and could be tuned to converge in \nas fast as 2.5 minutes on an Apollo DN10000. The solution shown in fig. 5 uses 50 DRF's \n(not pruned). The generalization pattern is relatively \"nice\" (for training on the sparse \nversion of the data set), and perhaps demonstrates the practical nature of the smoothed \npiecewise linear boundary for nonlinear problems. \n\n4 Discussion \n\nThe effect of LLDN parameters on generalization performance needs to be studied. In \nthe nested squares problem it is clear that the MLP's will have better generalization when \nthey converge; this illustrates the potential utility of a multi-scale approach to developing \nlocalized discriminants. A number of extensions are possible: Localized feature selection \ncan be implemented by simply zeroing components of V. The DRF Gaussians could \nmodel the radial dispersion of the data more effectively (in greater than two dimensions) by \ngenerating principal component axes which are orthogonal to V. Extension to the multiclass \ncase can be based on DRF sets developed for discrimination between each class and all \nother classes, using the DRF's as features for a multi-output classifier. The use of multiple \nhidden layers offers the prospect of more complex localized receptive fields. Improvement \nin generalization might be gained by including a procedure for merging neighboring DRF's. \nWhile it is felt that the LLD parameters should remain fixed, it may be advantageous to \nallow adjustment of the DRF Gaussian dispersions as part of the output layer training. A \nstopping rule for LLD training needs to be developed so that adaptive learning rates can be \nutilized effectively. This rule may also be useful in identifying poor token candidates early \nin the incremental LLD training. \n\nReferences \n\n[1] J. Sklansky and G.N. Wassel. Pattern Classifiers and Trainable Machines. Springer \nVerlag, New York, 1981 \n\n[2] S. Makram-Ebeid, lA. Sirat, and J.R. Viala. A rationalized error backpropagation \nlearning algorithm. Proc. IlCNN, 373-380, 1988 \n\n[3] J. Sklansky, and Y. Park. Automated design of mUltiple-class piecewise linear classifiers. \nJournal of Classification, 6: 195-222, 1989 \n\n[4] R.D. Short, and K. Fukanaga. A new nearest neighbor distance measure. Proc. Fifth \nInti. Conf. on Pattern Rec., 81-88 \n\n[5] R. Lippmann. A critical overview of neural network pattern classifiers. Neural Networks \njor Signal Processing (IEEE), 267-275, 1991 \n\n[6] M.S. Glassman and M.B. Starkey. Minimal consonant pair discrimination for speech \ntherapy. Proc. European Con! on Speech Comm. and Tech., 273-276, 1989 \n\n\f", "award": [], "sourceid": 525, "authors": [{"given_name": "Martin", "family_name": "Glassman", "institution": null}]}*