{"title": "Multi-Layer Perceptrons with B-Spline Receptive Field Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 684, "page_last": 692, "abstract": null, "full_text": "Multi-Layer Perceptrons \n\nwith B-SpIine Receptive Field Functions \n\nStephen H. Lane, Marshall G. Flax, David A. Handelman and JackJ. Gelfand \n\nHuman Information Processing Group \n\nDepartment of Psychology \n\nPrinceton University \n\nPrinceton, New Jersey 08544 \n\nABSTRACT \n\nMulti-layer perceptrons are often slow to learn nonlinear functions \nwith complex local structure due to the global nature of their function \napproximations. It is shown that standard multi-layer perceptrons are \nactually a special case of a more general network formulation that \nincorporates B-splines into the node computations. This allows novel \nspline network architectures to be developed that can combine the \ngeneralization capabilities and scaling properties of global multi-layer \nfeedforward networks with the computational efficiency and learning \nspeed of local computational paradigms. Simulation results are \npresented for the well known spiral problem of Weiland and of Lang \nand Witbrock to show the effectiveness of the Spline Net approach. \n\n1. \n\nINTRODUCTION \n\nRecently, it has been shown that multi-layer feedforward neural networks, such as \nMulti-Layer Perceptrons (MLPs) , are theoretically capable of representing arbitrary \nmappings, provided that a sufficient number of units are included in the hidden layers \n(Hornik et aI., 1989). Since all network weights are updated with each training \nexemplar, these networks construct global approximations to multi-input/multi-output \nfunction data in a manner analogous to fitting a low-order polynomial through a set of \n\n684 \n\n\fMulti-Layer Perceptrons with B-Spline Receptive Field Functions \n\n685 \n\ndata points. This is illustrated by the cubic polynomial \"Global Fit\" of the data points \nin Fig. 1. \n\n~LocalFit \n\nI \n\n~GlobaiFit \n\nFigure 1. Global vs. Local Function Approximation \n\nConsequently, multi-layer perceptrons are capable of generalizing (extrapolating! \ninterpolating) their response to regions of the input space where little or no training data \nis present, using a quantity of connection weights that typically scales quadratically \nwith the number of hidden nodes. The global nature of the weight updating, however, \ntends to blur the details of local structures, slows the rate of learning, and makes the \naccuracy of the resulting function approximation sensitive to the order of presentation \nof the training data. \n\nIt is well known that many sensorimotor structures in the brain are organized using \nneurons that possess locally-tuned overlapping receptive fields (Hubel and Wiesel, \n1962). Several neural network computational paradigms such as CMACs (Cerebel1ar \nModel Articulation Controllers) (Albus, 1973) and Radial Basis Functions (RBFs) \n(Moody and Darken, 1988) have been quite successful representing complex nonlinear \nfunctions using this same organizing principle. These networks construct local \napproximations to multi-input/multi-output function data that are analogous to fitting a \nleast-squares spline through a set of data points using piecewise polynomials or other \nbasis functions. This is illustrated as the cubic spline \"Local Fit\" in Fig. 1. The main \nbenefits of using local approximation techniques to represent complex nonlinear \nfunctions include fast learning and reduced sensitivity to the order of presentation of \ntraining data. In many cases, however, in order to represent the function to the desired \ndegree of smoothness, the number of basis functions required to adequately span the \ninput space can scale exponentially with the number of inputs (Lane et aI., 1991a,b). \n\nThe work presented in this paper is part of a larger effort (Lane et aI, 1991a) to develop \na general neural network formulation that can combine the generalization capabilities \nand scaling properties of global multi-layer feed forward networks with \nthe \ncomputational efficiency and learning speed of local network paradigms. It is shown in \nthe sequel that this can be accomplished by incorporating B-Spline receptive fields into \nthe node connection functions of Multi-Layer Perceptrons. \n\n\f686 \n\nLane, Flax, Handelman, and Gelfand \n\n2. MULTI\u00b7LAYER PERCEPTRONS \n\nWITH B\u00b7SPLINE RECEPTIVE FIELD FUNCTIONS \n\nStandard Multi-Layer Perceptrons (MLPs) can be represented using node equations of \nthe form, \n\n(1) \n\nwhere llL is the number of nodes in layer L and the cf; are linear connection functions \nbetween nodes in layers Land (L-1) such that, \n\n0'(-) is the standard sigmoidal nonlinearity, yf-l is the output of a node in layer L-1, \ny~-l = 1, and the wf; are adjustable network weights. Some typical linear connection \nfunctions are shown in Fig. 2. cfo corresponds to a threshold input. \n\n(2) \n\n~2 \n\n~o 1 - - - - -\n\ny' L-1 \n2 \n\nFigure 2. Typical MLP Node Connection Functions \n\nIncorporating B-Spline receptive field functions (Lane et aI., 1991a) into the node \ncomputations of eq. (1) allows more general connection functions (e.g. piecewise linear, \nquadratic, cubic, etc.) to be formulated. The corresponding B-Spline MLP (Spline Net) \nis derived by redefining the connection functions of eq. (2) such that, \n\nL( L-l) \\' L BG ( L-l) \ncij Yj \n\n= ~ W ijk nk Yj \n\n(3) \n\nThis enables the construction of a more general neural network architecture that has \nnode equations of the form, \n\nL \n\nYi = \n\n(4) \n\n\fMulti-Layer Perceptrons with B-Spline Receptive Field Functions \n\n687 \n\nThe B~(Yf-1) are B-spline receptive field functions (Lane et al\u00bb 1989,19913) of order \nn and support G\u00bb while the 'wtk are the spline network weights. The order, n\u00bb \ncorresponds to the number of coefficients in the polynomial pieces. For example, linear \nsplines are of order n=2\u00bb whereas cubic splines are of order n=4. The advantage of the \nmore general B-Spline connection functions of eq. (3) is that it allows varying degrees \nof \"locality\" to be added to the network computations since network weights are now \nactivated based on the value of yf-1. The wtk are modified by backpropagating the \noutput error only to the G weights in each connection function associated with active \n(i.e. nonzero) receptive field functions. The Dh-Iayer weights are updated using the \nmethod of steepest descent learning such that, \n\nL \n\nWijk Eo- W ijk + ej Yi \n\nf3 L L(I \n\nL \n\nL)BG ( L-1) \n\nnk Yj \n\n- Yi \n\n(5) \n\nwhere ef is the output error back-propagated to the ith node in layer L and ~ is the \nlearning rate (Lane et aI., 19913). In the more general Spline Net formulation of eqs. \n(3-5), each node input has P+G-1 receptive fields and P+G-1 weights associated with it, \nbut only G are active at anyone time. P determines the number of partitions in the \ninput space of the connection functions. Standard MLP networks are a degenerate case \nof the Spline Net architecture\u00bb as they can be realized with B-Spline receptive field \nfunctions of order n=2, with P=1 and G=2. Due to the connectivity of the B-Spline \nreceptive field functions, for the case when P> 1, the resulting network architecture \ncorresponds to multiply-connected MLPs\u00bb where any given MLP is active within only \none hypercube in the input space, but has weights that are shared with MLPs on the \nneighboring hypercubes. The amount of computation required in each layer of a Spline \nNet during both learning and function approximation is proportional to G, and \nindependent of P. \n\nFormulating the connection functions of eq. (3) with linear (n=2) B-Splines allows \nconnection functions such as those shown in Fig. 3 to be learned. \n\nFigure 3. Spline Net Connection Functions Using Linear B-Splines (n=2) \n\nThe connection functions shown in Fig. 3 have P=4 partitions (5 knots) on the interval \nyj-1 E[O,1]. The number of input partitions, P\u00bb determines the degree of locality of \n\n\f688 \n\nLane, Flax, Handelman, and Gelfand \n\nthe resulting function approximation since the local shape of the connection function is \ndetermined from the current node input activation interval. \n\nNetworks constructed using the Spline Net formulation are reminiscent of the form and \nfunction of Kolmogorov-Lorenz networks (Baron and Baron, 1988). A neurobiological \ninterpretation of a Spline Net is that it is composed of neurons that have dendritic \nbranches with synapses that operate as a function of the level of activation at a given \nnode or network input. This is shown in the network architecture of Fig. 4b where the \nstandard three-layer MLP network of Fig. 4a has been redrawn using B-Spline receptive \nfield functions with n=2, P=4 and G=2. \n\n~- , \n5 \n\nFigure 4. Three-Layer Spline Net Architecture, n=2,P=4,G=2 \n\nThe horizontal arrows projecting from the right of each network node in Fig. 4b \nrepresent the node outputs. The overlapping triangles on the node output represent the \nreceptive field functions of neurons in the next layer. These receptive field functions \nare summed with weighted connections in the dendritic branches to form the inputs to \nthe next network layer. In the architecture shown in Fig. 4b, only two receptive fields \nare active for any given value of a node output. Therefore for this single hidden-layer \nnetwork architecture, given any value for the inputs (x1,xV, at most Nw = 30 weights \nwill be active where, \n\n(6) \n\ns is the number of network inputs and 11 is the number of nodes in the hidden layer, \nwhich for this case is 2s+ 1 = 5. \n\n\fMulti-Layer Perceptrons with B-Spline Receptive Field Functions \n\n689 \n\n3. \n\nSIMULATION RESULTS \n\nIn order to evaluate the impact of local computation on MLP performance, the well \nknown spiral problem of Weiland and of Lang and Witbrock (1988) was chosen as a \nbenchmark. Simulations were conducted using a Spline Net architecture having one \nhidden layer with 5 hidden nodes and linear B-Splines with support, G=2 (Fig. 4). All \ntrials used the \"vanilla\" back-prop learning rule of eq. (5) with ~ = l/{2P). The \nconnection function weights were initialized in each node such that the resulting \nconnection functions were continuous linear functions with arbitrary slope. From \nprevious experience (Lane et aI., 1989), it was known that the number of receptive field \npartitions can drastically affect network learning and performance. Therefore, the \nconnection function partitions were bifurcated during training to see the effect on \nnetwork generalization capability and learning speed. The bifurcation consisted of \nsplitting every receptive field in half after increments of lOOK (100,(00) training \npoints, each time doubling the number of connection function partitions and weights in \nthe network nodes. A more adaptive approach would monitor the slope of the learning \ncurve to determine when to split the partitions. New weights were initializing such that \nthe connection functions before and after the bifurcation retained the same shape. All \nsimulation results presented in Figs. 5-12 were generated using 800K training points. \n\nThe left-most column of Fig. 5 represents the two learned connection functions that \nlead to each hidden node depicted in Fig. 4. The elements in the second column are the \nhidden node response to excitation over the unit square, while the plots in the third \ncolumn are the connection functions from the hidden layer to the output node. The \nfourth column shows the hidden node outputs after being passed through their \nrespective connection functions. The network output shown in the fifth column is the \nalgebraic sum of the hidden node responses shown in the fourth column. The Spline \nNet was initialized as a standard MLP with P=1. Figure 6 shows the evolution of the \ntwo connection functions to the third hidden node in Fig. 4 after every lOOK training \npoints. Around 400K (P=8) the connection functions start to take on a characteristic \nshape. For 1'>8, the creation of additional partitions has little effect on the shape of the \nconnection functions. Figure 7 shows the associated learning curve, while Fig. 8 is an \nenlarged version of the network output. These results indicate that the bifurcation \nschedule introduces additional degrees of freedom (weights) to the network in such a \nway as to carve out coarse global features first, then incrementally capture finer and \nfiner localized details later. This is in contrast to the results shown in Figs. 9 and 10 \nwhere the training (using the same 800K points as in Figs. 7 and 8) was begun on a \nnetwork having P=l28 initial partitions. Figure 11 shows the Spline Net output after \n800K training iterations using 112 discrete points located on the two spirals. Lang and \nWitbrock (1988) state that similar spiral results could only be obtained using a MLP \nnetwork with 3 hidden layers (including jump connections) and 50,000,000 training \niterations. The use of a Spline Net with a bifurcation schedule enabled the learning to \nbe sped up by almost two orders of magnitude, indicating there is a significant \nperformance advantage in trading-off number of hidden layers for node complexity. \n\n\f690 \n\nLane, Flax, Handelman, and Gelfand \n\nHidden Node \nConnection \nFunctions \n\nOutput Node \nHidden Node Connection \nFunctions \n\nResponse \n\nHidden Node \nOutputs After \nConnection \nFunctions \n\nOutput Node \n\nResponse \n\nFigure 5. Spiral Learning with Bifurcation Schedule \n\nP=l \n\nP=2 \n\nP=4 \n\nP=8 \n\nP=16 \n\nP=32 \n\nP=128 \n\nFigure 6. Evolution of Connection Functions to Third Hidden Node \n\n\fMulti-Layer Perceptrons with B-Spline Receptive Field Functions \n\n691 \n\n: \n\n~ \n\n\\. \n\\ \n\n\\.... , \n\ni'--\n\n3000 \n\n2500 \n\n-\n\n~ 2000 \nCI) \n~ 1500 \n\n1000 \n\n500 \no \n\no \n\n200 \n\n400 \n\n600 \n\n800 \n\n1000 \n\nFigure 8. Output Node Response \n\nTraining Iteration \n\nwith Bifurcation \n\nFigure 7. Learning Curve with Bifurcation Schedule \n\nMean Square Error vs. Training Iteration \n\n. . , , \n\n(P=128) ~ \n\n- . \n\n~ \"'\\. '\" ~ -\n\no \n\n200 \n\n400 \n\n600 \n\n800 \n\n1000 \n\nFigure 10. Output Node Response \n\n3000 \n\n2500 \n\n2000 \n\n1500 \n\n1000 \n\n500 \no \n\nFigure 9. Learning Curve without Bifurcation Schedule \n\nMean Square Error vs. Training Iteration \n\nwithout Bifurcation \n\n~ \n\n\\ \n\n3000 \n\n2500 \n\n2000 \n\n1500 \n\n1000 \n\n500 \no \n\n~ \n~ \n\no \n\n\\ \n'\" \n~ \n\n,e \n\n~ \n\n: \ni \n\n; \n-\" \n\n~ \n\n200 \n\n400 \n\n600 \n\n800 \n\n1000 \n\nFigure 12. Output Node Response \n\nFigure 11. Learning Curve with Bifurcation Schedule \n\nMean Square Error vs. Training Iteration \n\n(112 Discrete Points) \n\nwith Bifurcation \n\n(112 Discrete Points) \n\n\f692 \n\nLane, Flax, Handelman, and Gelfand \n\n4. CONCLUSIONS \n\nIt was shown that the introduction of B-Splines into the node connection functions of \nMulti-Layer Perceptrons allows more general neural network architectures to be \ndeveloped. The resulting Spline Net architecture combines the fast learning and \ncomputational efficiency of strictly local neural network approaches with the scaling \nand generalization properties of the more established global MLP approach. Similarity \nto Kolmogorov-Lorenz networks can be used to suggest an initial number of hidden \nlayer nodes. The number of node connection function partitions chosen affects both \nnetwork generalization capability and learning performance. It was shown that use of a \nbifurcation schedule to determine the number of node input partitions speeds learning \nand improves network generalization. Results indicate that Spline Nets solve difficult \nlearning problems by trading-off number of hidden layers for node complexity. \nAcknowledgements \nStephen H. Lane and David A. Handelman are also employed by Robicon Systems Inc., \nPrinceton, NJ. This research has been supported through a grant from the James S. \nMcDonnell Foundation and a contract from the DARPA Neural Network Program. \nReferences \nAlbus, J. (1975) \"A New Approach to Manipulator Control: The Cerebellar Model \nArticulation Controller (CMAC),\" 1. Dyn. Sys. Meas. Control, vol. 97, pp. 270-277. \nBarron, A.R. and Barron, R.L. (1988) \"Statistical Learning Networks: A Unifying \nView,\" Proc. 20th Symp. on the Interface - Computing and Statistics, pp. 192-203. \nHornik, K. Stinchcombe, M. and White, H. (1989) \"Multi-layer Feedforward Networks \nare Universal Approximators,\" Neural Networks, vol. 2, pp. 359-366. \n\nHubel, D. and Wiesel, T.N. (1962) \"Receptive Fields, Binocular Interaction and \nFunctional Architecture in Cat's Visual Cortex,\" 1. Physiology, vol. 160, no. 106. \nLane, S.H., Handelman, D.A. and Gelfand, JJ. (1989) \"Development of Adaptive B(cid:173)\nSplines Using CMAC Neural Networks\", 1989 I1CNN, Wash. DC., June 1989. \n\nLane, S.H., Flax, M.B., Handelman, D.A. and Gelfand, JJ. (1991a) \"Function \nApproximation in Multi-Layer Neural Networks with B-Spline Receptive Field \nFunctions,\" Princeton University Cognitive Science Lab Report No. 42, in prep for 1. of \nInt'l Neural Network Society. \n\nLane, S.H., Handelman, D.A. and Gelfand, JJ. (1991b) \"Higher-Order CMAC Neural \nNetworks-Theory and Practice,\" to appear Amer. Contr. Conf., Boston, MA, June,1991. \nLang, K.J. and Witbrock, MJ. (1988) \"Learning to Tell Two Spirals Apart,\" Proc. 1988 \nConnectionist Model Summer School, D. Touretzky, G. Hinton, and T. Sejnowski, Eds. \nMoody, J. and Darken, C. (1988) \"Learning with Localized Receptive Fields,\" Proc. \n1988 Connectionist Model Summer School, D. Touretzky, G. Hinton, T.Sejnowski, Eds. \n\n\f", "award": [], "sourceid": 300, "authors": [{"given_name": "Stephen", "family_name": "Lane", "institution": null}, {"given_name": "Marshall", "family_name": "Flax", "institution": null}, {"given_name": "David", "family_name": "Handelman", "institution": null}, {"given_name": "Jack", "family_name": "Gelfand", "institution": null}]}