{"title": "Training Multilayer Perceptrons with the Extended Kalman Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 133, "page_last": 140, "abstract": "", "full_text": "TRAINING MULTILAYER PERCEPTRONS WITH THE \n\nEXTENDED KALMAN ALGORITHM \n\n133 \n\nSharad Singhal and Lance Wu \n\nBell Communications Research, Inc. \n\nMorristown, NJ 07960 \n\nABSTRACT \n\ntrained with \n\nA large fraction of recent work in artificial neural nets uses \nmultilayer perceptrons \nthe back-propagation \nalgorithm described by Rumelhart et. a1. This algorithm \nconverges slowly for large or complex problems such as \nspeech recognition, where thousands of iterations may be \nneeded for convergence even with small data sets. In this \npaper, we show that training multilayer perceptrons is an \nidentification problem for a nonlinear dynamic system which \ncan be solved using \nthe Extended Kalman Algorithm. \nAlthough computationally complex, the Kalman algorithm \nusually converges in a few \nthe \nalgorithm and compare it with back-propagation using two(cid:173)\ndimensional examples. \n\niterations. We describe \n\nINTRODUCTION \n\nMultilayer perceptrons are one of the most popular artificial neural net \nstructures being used today. In most applications, the \"back propagation\" \nalgorithm [Rllmelhart et ai, 1986] is used to train these networks. Although \nthis algorithm works well for small nets or simple problems, convergence is \npoor if the problem becomes complex or the number of nodes in the network \nbecome large [Waibel et ai, 1987]. In problems sllch as speech recognition, \ntens of thousands of iterations may be required for convergence even with \nrelatively small elata-sets. Thus there is much interest [Prager anel Fallsiele, \nin other \"training algorithms\" which can \n1988; Irie and Miyake, 1988] \ncompute the parameters faster than back-propagation anel/or can handle much \nmore complex problems. \n\nIn this paper, we show that training multilayer perceptrons can be viewed as \nan identification problem for a nonlinear dynamic system. For linear dynamic \n\nCopyright 1989. Bell Communications Research. Inc. \n\n\f134 \n\nSinghal and Wu \n\nthe system around \n\nsystems with white input and observation noise, the Kalman algorithm \n[Kalman, 1960] is known to be an optimum algorithm. Extended versions of \nthe Kalman algorithm can be applied to nonlinear dynamic systems by \nlinearizing \nthe current estimate of the parameters. \nAlthough computationally complex, \nthis algorithm updates parameters \nconsistent with all previously seen data and usually converges in a few \niterations. In the following sections, we describe how this algorithm can be \napplied to multilayer perceptrons and compare its performance with back(cid:173)\npropagation using some two-dimensional examples. \n\nTHE EXTENDED KALMAN FILTER \n\nIn this section we briefly outline the Extended Kalman filter. Mathematical \nderivations for the Extended Kalman filter are widely available in the \nliterature [Anderson and Moore, 1979; Gelb, 1974] and are beyond the scope \nof this paper. \n\nConsider a nonlinear finite dimensional discrete time system of the form: \n\nx(n+l) = In(x(n\u00bb + gn(x(n\u00bbw(n), \nden) = hn(x(n\u00bb+v(n). \n\n(1) \n\nHere the vector x (n) is the state of the system at time n, w (n) is the input, \nden) is the observation, v(n) is observation noise and In('), gn('), and hn(') \nare nonlinear vector functions of the state with the subscript denoting possible \ndependence on time. We assume that the initial state, x (0), and the \nsequences {v (n)} and {w (n)} are independent and gaussian with \n\nE [x (O)]=x(O), E {[x (O)-x (O)][x (O)-i(O\u00bb)I} = P(O), \nE [w (n)] = 0, E [w (n )w t (l)] = Q (n )Onl' \nE[v(n)] = 0, E[v(n)vt(l)] = R(n)onb \n\n(2) \n\nwhere Onl is the Kronecker delta. Our problem is to find an estimate i (n +1) \nof x (n +1) given d (j) , O<j <n. We denote this estimate by i (n +11 n). \nIf the nonlinearities in (1) are sufficiently smooth, we can expand them llsing \nTaylor series about the state estimates i (n In) and i (n In -1) to obtain \n\nIn(x(n\u00bb = I\" (i(n In\u00bb + F(n)[x(n)-i(n In)] + ... \ngn(x(n\u00bb = gil (i(n In\u00bb + ... = C(n) + ... \nhn(x(n\u00bb = hll(i(n In-I\u00bb + J-f1(n)[x(n)-i(n In-1)] + \n\nwhere \n\nC(ll) = gn(i(n Ill\u00bb, \ndin (x) \nF (ll ) = ---. --\n\nax \n\n, \nx = .i (II III) \n\ndhll (x) \nI-P (n ) = --., ---\n\nx=i(IIII1-1) \n\n(3) \n\nthe value of the function g\" (.) \n\ni.e. G (n) is \ni (n In) and the ij th \ncomponents of F (n) and H' (n) are the partial derivatives of the i th \ncomponents of f II ( . ) and hll (-) respectively with respect to the j th component \nof x (n) at the points indicated. Neglecting higher order terms and assuming \n\nOx \nat \n\n\fTraining Multilayer Perceptrons \n\n135 \n\nknowledge of i (n In) and i (n In-I), the system in (3) can be approximated \nas \n\nwhere \n\nx(n+l) = F(n)x(n) + G(n)w(n) + u(n) n>O \nz (n ) = HI (n )x (n )+ v (n) + y (n ), \n\nu(n) = /n(i(n In\u00bb - F(n)i(n In) \ny(n) = hn(i(n In-I\u00bb - H1(n)i(n In-1). \n\n(4) \n\n(5) \n\nIt can be shown [Anderson and Moore, 1979] that the desired estimate \ni (n + 11 n) can be obtained by the recursion \n\ni(n+1In) =/n(i(n In\u00bb \ni(n In) = i(n In-I) + K(n)[d(n) - hn(i(n In-1\u00bb] \nK(n) = P(n In-I)H(n)[R(n)+HI(n)P(n In-I)H(n)tl \nP(n+Iln) = F(n)P(n In)FI(n) + G(n)Q(n)G1(n) \nP(n In) = P(n In-I) - K(n)HI(n)P(n In-I) \n\n(6) \n(7) \n(8) \n(9) \n(10) \n\nwith P(11 0) = P (0). K (n) is known as the Kalman gain. In case of a linear \nsystem, it can be shown that P(n) is the conditional error covariance matrix \nassociated with the state and the estimate i (n +1/ n) is optimal in the sense \nthat it approaches the conditional mean E [x (n + 1) I d (0) ... d (n)] for large \nn . However, for nonlinear systems, the filter is not optimal and the estimates \ncan only loosely be termed conditional means. \n\nTRAINING MULTILAYER PERCEPTRONS \n\nThe network under consideration is a L layer perceptronl with the i th input \nof the k th weight layer labeled as :J-l(n), the jth output being zjk(n) and the \nweight connecting the i th input to the j th output being (}i~j' We assume that \nthe net has m inputs and I outputs. Thresholds are implemented as weights \nconnected from input nodes2 with fixed unit strength inputs . Thus, if there \nare N (k) nodes in the k th node layer, the total number of weights in the \nsystem is \n\nL \n\nM = ~N(k-l)[N(k)-l]. \n\nk=1 \n\n(11) \n\nAlthough the inputs and outputs are dependent on time 11, for notational \nbrevity, we wil1 not show this dependence unless explicitly needed . \n\nl. We use the convention that the number of layers is equal to the number of weight layers . Thus \nL and I~ + I layers of /lodes (including the input and \nwe have L layers of Wl'iglrls labeled 1 \u00b7 \noutput nodes) labeled O \u00b7 . . L . We will refer to the kth weight layer or the kth node layer \nunless the context is clear. \n\n2. We adopt the convention that the 1st input node is the threshold. i.e. lit., is the threshold for \n\nthe j th output node from the k th weight layer. \n\n\f136 \n\nSinghal and Wu \n\nIn order to cast the problem in a form for recursive estimation, we let the \nweights in the network constitute the state x of the nonlinear system, i.e. \n\nx = [Ob,Ot3 ... 0k(O),N(l)]t. \n\n(12) \n\nThe vector x \nthus consists of all weights arranged in a linear array with \ndimension equal to the total number of weights M in the system. The system \nmodel thus is \n\nx(n+l)=x(n) n>O, \nden) = zL(n) + v(n) = hn(x(n),zO(n)) + v(n), \n\n(13) \n(14) \n\nwhere at time n, zO(n) is the input vector from the training set, d (n) is the \ncorresponding desired output vector, and ZL (n) \nthe output vector \nproduced by the net. The components of hn (.) define \nthe nonlinear \nrelationships between the inputs, weights and outputs of the net. If r(\u00b7) is the \nnonlinearity used, then ZL (n) = hn (x (n ),zO(n)) is given by \n\nis \n\nzL(n) = r{(OL)tr{(OL-l)tr ... r{(OlyzO(n)}\u00b7 .. }}.. \n\n(15) \nwhere r applies componentwise to vector arguments. Note that the input \nvectors appear only implicitly through the observation function h n ( . ) in (14). \nThe initial state (before training) x (0) of the network is defined by populating \nthe net with gaussian random variables with a N(x(O),P(O)) distribution where \nx (0) and P (0) reflect any apriori knowledge about the weights. In the absence \nof any such knowledge, a N (0,1/f. I) distribution can be used, where f. is a \nsmall number and I is the identity matrix. For the system in (13) and (14), \nthe extended Kalman filter recursion simplifies to \n\ni(I1+1) = i(n) + K(n)[d(n) - hn(i(n),zO(n))] \nK (n) = P(n)H (n )[R (n )+H' (n )P(n )H(n )]-1 \nPen +1) = P(n) - K (n )Ht (n)P (n) \n\n(16) \n(17) \n(18) \n\nwhere P(n) is the (approximate) conditional error covariance matrix . \n\nNote that (16) is similar to the weight update equation in back-propagation \nterm [ZL - h n (x ,ZO)] being the error at the output layer. \nwith the last \nHowever, unlike the delta rule used in back-propagation, \nthis error is \npropagated to the weights through the Kalman gain K (n) which updates each \nweight through the entire gradient matrix H (n) and the conditional error \ncovariance matrix P (n ). In this sense, the Kalman algorithm is not a local \ntraining algorithm . However, the inversion required in (17) has dimension \nequal to the llumber of outputs I, 110t the number of weights M, and thus \ndoes not grow as weights arc added to the problem. \n\nEXAMPLES AND RESULTS \n\nTo evaluale the Olltpul and the convergence properties of the extended \nKalman algorithm. we constructed mappings using two-dimensional inputs \nwith two or four outputs as shown in Fig. 1. Limiting the input vector to 2 \ndimensions allows liS to visualize the decision regiolls ohtained by the net and \n\n\fTraining Multilayer Perceptrons \n\n137 \n\nto examine the outputs of any node in the net in a meaningful way. The x(cid:173)\nand y-axes in Fig. 1 represent the two inputs, with the origin located at the \ncenter of the figures. The numbers in the figures represent the different \noutput classes. \n\n2 \n\n1 \n\n-\n\n- t------+-----I \n\n1 \n\n2 \n\nI \n\n(a) REGIONS \n\n(b) XOR \n\nFigure 1. Output decision regions for two problems \n\nThe training set for each example consisted of 1000 random vectors uniformly \nfilling \nthe region . The hyperbolic tangent nonlinearity was used as the \nnonlinear element in the networks. The output corresponding to a class was \nset to 0.9 when the input vector belonged to that class, and to -0.9 otherwise. \nDuring training, \nthe weights were adjusted after each data vector was \npresented. Up to 2000 sweeps through the input data were used with the \nstopping criteria described below to examine the convergence properties. The \norder in which data vectors were presented was randomized for each sweep \nthrough the data. In case of back-propagation, a convergence constant of 0.1 \nwas used with no \"momentum\" factor. In the Kalman algorithm R was set to \nI \u00b7e-k / 50 , where k was the iteration number through the data. Within each \niteration, R was held constant. \n\nThe Stopping Criteria \n\nTraining was considered complete if anyone of the following con~itions was \nsatisfied: \n\na. 2000 sweeps through the input data were used, \n\nthe RMS (root mean squared) error at the output averaged over all \ntraining data during a sweep fell below a threshold 11' or \n\nh. \n\nc. \n\nthe error reduction 8 after the i th sweep through the data fell below a \nis some \nthreshold \npositive constant less than unity, and ei is the error defined in b. \n\nI::., where 8; = !3b;_1 + (l-,B) I ei-ei_l I. Here \n\n!3 \n\nIn our simulations we set ;3 = 0.97, II = 10-2 and 12 = 10-5 \u2022 \n\n\f138 \n\nSinghal and Wu \n\nExample 1 - Meshed, Disconnected Regions: \n\nl(a) shows \n\nFigure \nthe mapping with 2 disconnected, meshed regions \nsurrounded by two regions that fill up the space. We used 3-layer perceptrons \nwith 10 hidden nodes in each hidden layer to Figure 2 shows the RMS error \nobtained during training for the Kalman algorithm and back-propagation \naveraged over 10 different initial conditions. The number of sweeps through \nthe data (x-axis) are plotted on a logarithmic scale to highlight the initial \nreduction for the Kalman algorithm. Typical solutions obtained by the \nalgorithms at termination are shown in Fig. 3. It can be seen that the Kalman \nalgorithm converges in fewer iterations than back-propagation and obtains \nbetter solutions. \n\n1 \n\n0.8 \n\nAverage 0.6 \n\nRMS \nError 0.4 \n\n0.2 \n\n0 \n\nbackprop \n\nKalman \n\n1 \n\n2 \n\n5 \n\n10 20 \n\n50 100 200 \n\n500 10002000 \n\nNo. of Iterations \n\nFigure 2. Average output error during training for Regions problem using the \n\nKalman algorithm and backprop \n\nFigure 3. Typical solutions for Regions problem using (a) Kalman algorithm \n\nI \n(b) \n\nI \n(a) \n\nand (h) hackprop. \n\n\fTraining Multilayer Perceptrons \n\n139 \n\nExample 2 - 2 Input XOR: \n\nFigure 1(b) shows a generalized 2-input XOR with the first and third \nquadrants forming region 1 and the second and fourth quadrants forming \nregion 2. We attempted the problem with two layer networks containing 2-4 \nnodes in the hidden layer. Figure 4 shows the results of training averaged \nover 10 different randomly chosen initial conditions. As the number of nodes \nin the hidden layer is increased, the net converges to smaller error values. \nWhen we examine the output decision regions, we found that none of the nets \nattempted with back-propagation reached the desired solution. The Kalman \nalgorithm was also unable to find the desired solution with 2 hidden nodes in \nthe network. However, it reached the desired solution with 6 out of 10 initial \nconditions with 3 hidden nodes in the network and 9 out of 10 initial \nconditions with 4 hidden nodes. Typical solutions reached by the two \nalgorithms are shown in Fig. 5. In all cases, the Kalman algorithm converged \nin fewer iterations and in all but one case, the final average output error was \nsmaller with the Kalman algorithm. \n\n1 \n\n0.8 \n\nAverage 0.6 \n\nRMS \nError 0.4 \n\n0.2 \n\n0 \n\nKalman 3 nodes \n\nKalman 4 nodes \n\n1 \n\n2 \n\n5 \n\n10 20 \n\n50 100 200 \n\n500 10002000 \n\nNo. of Iterations \n\nFigure 4. Average output error during training for XOR problem using the \n\nKalman algorithm and backprop \n\nCONCLUSIONS \n\nIn this paper, we showed that training feed-forward nets can be viewed as a \nsystem identification problem for a nonlinear dynamic system. For linear \ndynamic systems, the Kalman tllter is known to produce an optimal estimator. \nExtended versions of the Kalman algorithm can be used to train feed-forward \nnetworks. We examined the performance of the Kalman algorithm using \nartifkially constructed examples with two inputs and found that the algorithm \ntypically converges in a few iterations. We also llsed back-propagation on the \nsame examples and found that invariably, the Kalman algorithm converged in \n\n\f140 \n\nSinghal and Wu \n\nl \n\n2 \n\n1 \n\n1 \n\n2 \n\n~ \n\nI \n\nI \n(a) \n\n2 \n\nI \n(b) \n\nFigure 5. \n\nTypical solutions for XOR problem using (a) Kalman algorithm and \n(b) backprop. \n\nfewer iterations. For the XOR problem, back-propagation failed to converge \non any of the cases considered while the Kalman algorithm was able to find \nsolutions with the same network configurations. \n\nReferences \n\n[1] \n\n[2] \n\n[3] \n\n[4] \n\n[5] \n\n[6] \n\n[7J \n\nB. D. O. Anderson and J. B. Moore, Optimal Filtering, Prentice Hall, \n1979. \n\nA. Gelb, Ed., Applied Optimal Estimation, MIT Press, 1974. \nB. Irie, and S. Miyake, \"Capabilities of Three-layered Perceptrons,\" \nProceedings of the IEEE International Conference on Neural Networks, \nSan Diego, June 1988, Vol. I, pp. 641-648. \n\nR. E. Kalman, \"A New Approach to Linear Filtering and Prediction \nProblems,\" 1. Basic Eng., Trans. ASME, Series D, Vol 82, No.1, 1960, \npp.35-45. \n\nR. W. Prager and F. Fallside, \"The Modified Kanerva Model for \nAutomatic Speech Recognition,\" in 1988 IEEE Workshop on Speech \nRecognition, Arden House, Harriman NY, May 31-Jllne 3,1988. \n\nD. E. Rumelharl, G. E. Hinton and R. J. Williams, \"Learning Internal \nRepresentations by Error Propagation,\" \nin D. E. Rllmelhart and \nJ. L. McCelland (Eds.), Parallel Distributed Processing: Explorations in \nthe Microstructure oj' Cognition. Vol 1: Foundations. MIT Press, 1986. \nA. Waibel, T. Hanazawa, G. Hinton, K. Shikano and K . Lang \n\"Phoneme Recognition Using Time-Delay Neural Networks,\" A 1R \ninternal Report TR-I-0006, October 30, 1987. \n\n\f", "award": [], "sourceid": 101, "authors": [{"given_name": "Sharad", "family_name": "Singhal", "institution": null}, {"given_name": "Lance", "family_name": "Wu", "institution": null}]}