{"title": "Neural Network Weight Matrix Synthesis Using Optimal Control Techniques", "book": "Advances in Neural Information Processing Systems", "page_first": 348, "page_last": 354, "abstract": null, "full_text": "348 \n\nFarotimi, Demho and Kailath \n\nNeural Network Weight Matrix Synthesis Using \n\nOptimal Control Techniques \n\nO. Farotimi \n\nA. Dembo \n\nInformation Systems Lab. \nElectrical Engineering Dept. \n\nStanford University, \nStanford, CA 94305 \n\nABSTRACT \n\nT. Kailath \n\nGiven a set of input-output training samples, we describe a proce(cid:173)\ndure for determining the time sequence of weights for a dynamic \nneural network to model an arbitrary input-output process. We \nformulate the input-output mapping problem as an optimal con(cid:173)\ntrol problem, defining a performance index to be minimized as a \nfunction of time-varying weights. We solve the resulting nonlin(cid:173)\near two-point-boundary-value problem, and this yields the training \nrule. For the performance index chosen, this rule turns out to be a \ncontinuous time generalization of the outer product rule earlier sug(cid:173)\ngested heuristically by Hopfield for designing associative memories. \nLearning curves for the new technique are presented. \n\nINTRODUCTION \n\n1 \nSuppose that we desire to model as best as possible some unknown map 4> : u -\nV, where U, V ~ nn. One way we might go about doing this is to collect as many \ninput-output samples {(9in, 90ud : 4>(9in ) = 9 0ud as possible and \"find\" some func-\ntion f : U - V such that a suitable distance metric d(f( z(t)), 4>(z(t)))I ZE{9 ... :4>c9 ... )=9 o .. d \nis minimized. \n\nIn the foregoing, we assume a system of ordinary differential equations motivated by \ndynamic neural network structures[l] [2]. In particular we set up an n-dimensional \n\n\fNeural Network Weight Matrix Synthesis \n\n349 \n\nneural network; call it N. Our goal is to synthesize a possibly time varying weight \nmatrix for N such that for initial conditions zeta), the input-output transformation, \nor flow 1 : zeta) -- I(z(t,\u00bb associated with N approximates closely the desired \nmap 4>. \nFor the purposes of synthesizing the weight program for N, we consider another sys(cid:173)\ntem, say S, a formal nL-dimensional system of differential equations comprising L \nn-dimensional subsystems. With the exception that all L n-dimensional subsystems \nare constrained to have the same weight matrix, they are otherwise identical and \ndecoupled. We shall use this system to determine the optimal weight program given \nL input-output samples. The resulting time program of weights is then applied to \nthe original n-dimensional system N during normal operation. We emphasize the \ndifference between this scheme and a simple L-fold replication of N: \nthe latter \nwill yield a practically unwieldy nL x nL weight matrix sequence, and in fact will \ngenerally not discover the underlying map from U to V, discovering instead differ(cid:173)\nent maps for each input-output sample pair. By constraining the weight matrix \nsequence to be an identical n x n matrix for each subsystem during this synthesis \nphase, our scheme in essence forces the weight sequence to capture some underlying \nrelationship between all the input-output pairs. This is arguably the best estimate \nof the map given the information we have. \n\nUsing formal optimal control techniques[3], we set up a performance index to max(cid:173)\nimize the correlation between the system S output and the desired output. This \noptimization technique leads in general to a nonlinear two-point-boundary-value \nproblem, and is not usually solvable analytically. For this particular performance \nindex we are able to derive an analytical solution to the optimization problem. \nThe optimal interconnection matrix at each time is the sum (over the index of all \nsamples) of the outer products between each desired output n-vector and the cor(cid:173)\nresponding subsystem output. At the end of this synthesis procedure, the weight \nmatrix sequence represents an optimal time-varying program for the weights of the \nn-dimensional neural network N that will approximate 4> : U -- V. \nWe remark that in the ideal case, the weight matrix at the final time (i.e one element \nof the time sequence) corresponds to the symmetric matrix suggested empirically by \nHopfield for associative memory applications[4]. It becomes clear that the Hopfield \nmatrix is suboptimal for associative memory, being just one point on the optimal \nweight trajectory; it is optimal only in the special case where the initial conditions \ncoincide exactly with the desired output. \n\nIn Section 2 we outline the mathematical formulation and solution of the synthesis \ntechnique, and in Section 3 we present the learning curves. The learning curves also \nby default yield the system performance over the training samples, and we compare \nthis performance to that of the outer product rule. In Section 4 we give concluding \nremarks and give the directions of our future work. \n\nAlthough the results here are derived for a specific case of the neuron state equation, \nand a specific choice of performance index, in further work we have extended the \nresults to very general state equations and performance indices. \n\n\f350 \n\nFarotimi, Dembo and Kailath \n\n2 SYNTHESIS OF WEIGHT MATRIX TIME SEQUENCE \nSuppose we have a training set consisting of L pairs of n-dimensional vectors \n(o(r)i, e(r\\), r = 1,2, ... , L, i = 1,2, ... , n. For example, in an autoassociative sys(cid:173)\ntem in which we desire to store e(r)i,r = 1,2, ... ,L,i = 1,2, ... ,n, we can choose \nthe o(r)i, r = 1,2, ... , L, i = 1,2, ... , n to be sample points in the neighborhood of \n(}(r)i in n-dimensional space. The idea here is that by training the network to map \nsamples in the neighborhood of an exemplar to the exemplar, it will have devel(cid:173)\noped a map that can smoothly interpolate (or generalize) to other points around \nthe exemplar that may not be in the training set. In this paper we deal with the \nissue of finding the weight matrix that transforms the neural network dynamics into \nsuch a map. We demonstrate through simulation results that such a map can be \nachieved. For autoassociation, and using error vectors drawn from the training set, \nwe show that the method here performs better (in an error-correcting sense) than \nthe outer product rule. We are still investigating the performance of the network \nin generalizing to samples outside the training set. \nWe construct an n-dimensional neural network system N to model the underlying \ninput-output map according to \n\nN: z(t) = -z(t) + W(t)g(z(t), \n\n(1) \n\nWe interpret z as the neuron activation, g(z(t)) is the neuron output, and W(t) is \nthe neural network weight matrix. \nTo determine the appropriate W(t), we define an nL-dimensional formal system of \ndifferential equations, S \n\nS: z\u00b7(t) = -z.(t) + W.(t)g(z.), g(z.(to\u00bb = iJ \n\n(2) \n\nformed by concatenating the equations for N L times. W. (t) is block-diagonal with \nidentical blocks W(t). 8 is the concatenated vector of sample desired outputs, iJ is \nthe concatenated vector of sample inputs. \nThe performance index for S is \n\nminJ = min {-z.T(tI)8 + 41t' (-2Z. T(t)8 + /3Q + /3-1 t WJ(t)Wi(t\u00bb) dt} \n\ni=1 \n\n\" \n\nto \n\n(3) \nThe performance index is chosen to minimize the negative of the correlation between \nthe (concatenated) neuron activation and the (concatenated) desired output vectors, \nor equivalently maximize the correlation between the activation and the desired \noutput at the final time tl, (the term -Z.T(t1 )8). Along the way from initial time \nto to final time t I, the term -z. T (t)8 under the integral penalizes decorrelation of \nthe neuron activation and the desired output. Wj(t), j = 1,2, ... , n are the rows of \nW(t), and /3 is a positive constant. The term /3-1 Ei=l wJ(t)Wj(t) effects a bound \n\n\fNeural Network Weight Matrix Synthesis \n\n351 \n\non the magnitude of the weights. The term \n\nQ(g(Z(t\u00bb) = L L L L o/r)o/v)g(zu(v\u00bbg(zu(r\u00bb, \n\nn \n\nL \n\nn \n\nL \n\nj=lr=lu=lv=1 \n\nand its meaning will be clear when we examine the optimal path later. g(.) is \nassumed Cl differentiable. \nProceeding formally[3], we define the Hamiltonian: \n\nH = ~ ( _2zT(I)9 + Q + t WJ*7(1)( -z(l) + W.(I)g(z(l))) \n\n~ ( _2\",T(I)9 + Q + t WJ**7(1)\",(1) + t. t, A(r)jwJ*