{"title": "Forward-backward retraining of recurrent neural networks", "book": "Advances in Neural Information Processing Systems", "page_first": 743, "page_last": 749, "abstract": null, "full_text": "Forward-backward retraining of recurrent \n\nneural networks \n\nAndrew Senior \u2022 \nTony Robinson \nCambridge University Engineering Department \n\nTrumpington Street, Cambridge, England \n\nAbstract \n\nThis paper describes the training of a recurrent neural network \nas the letter posterior probability estimator for a hidden Markov \nmodel, off-line handwriting recognition system. The network esti(cid:173)\nmates posterior distributions for each of a series of frames repre(cid:173)\nsenting sections of a handwritten word. The supervised training \nalgorithm, backpropagation through time, requires target outputs \nto be provided for each frame. Three methods for deriving these \ntargets are presented. A novel method based upon the forward(cid:173)\nbackward algorithm is found to result in the recognizer with the \nlowest error rate. \n\nIntroduction \n\n1 \nIn the field of off-line handwriting recognition, the goal is to read a handwritten \ndocument and produce a machine transcription. Such a system could be used \nfor a variety of purposes, from cheque processing and postal sorting to personal \ncorrespondence reading for the blind or historical document reading. In a previous \npublication (Senior 1994) we have described a system based on a recurrent neural \nnetwork (Robinson 1994) which can transcribe a handwritten document. \n\nThe recurrent neural network is used to estimate posterior probabilities for char(cid:173)\nacter classes, given frames of data which represent the handwritten word. These \nprobabilities are combined in a hidden Markov model framework, using the Viterbi \nalgorithm to find the most probable state sequence. \n\nTo train the network, a series of targets must be given. This paper describes three \nmethods that have been used to derive these probabilities. The first is a naive boot(cid:173)\nstrap method, allocating equal lengths to all characters, used to start the training \nprocedure. The second is a simple Viterbi-style segmentation method that assigns a \nsingle class label to each of the frames of data. Such a scheme has been used before \nin speech recognition using recurrent networks (Robinson 1994). This representa(cid:173)\ntion, is found to inadequately represent some frames which can represent two letters, \nor the ligatures between letters. Thus, by analogy with the forward-backward algo(cid:173)\nrithm (Rabiner and Juang 1986) for HMM speech recognizers, we have developed a \n\n\u00b7Now at IDM T .J.Watson Research Center, Yorktown Heights NYI0598, USA. \n\n\f744 \n\nA. SENIOR, T. ROBINSON \n\nforward-backward method for retraining the recurrent neural network. This assigns \na probability distribution across the output classes for each frame of training data, \nand training on these 'soft labels' results in improved performance of the recognition \nsystem. \n\nThis paper is organized in four sections. The following section outlines the system \nin which the neural network is used, then section 3 describes the recurrent network \nin more detail. Section 4 explains the different methods of target estimation and \npresents the results of experiments before conclusions are presented in the final \nsection. \n\n2 System background \n\nThe recurrent network is the central part of the handwriting recognition system. \nThe other parts are summarized here and described in more detail in another pub(cid:173)\nlication (Senior 1994). The first stage of processing converts the raw data into \nan invariant representation used as an input to the neural network. The network \noutputs are used to calculate word probabilities in a hidden Markov model. \n\nFirst, the scanned page image is automatically segmented into words and then nor(cid:173)\nmalized. Normalization removes variations in the word appearance that do not \naffect its identity, such as rotation, scale, slant, slope and stroke thickness. The \nheight of the letters forming the words is estimated, and magnifications, shear and \nthinning transforms are applied, resulting in a more robust representation of the \nword. The normalized word is represented in a compact canonical form encoding \nboth the shape and salient features. All those features falling within a narrow ver(cid:173)\ntical strip across the word are termed a frame. The representation derived consists \nof around 80 values for each of the frames, denoted Xt. The T frames (Xl,' .. , xr ) \nfor a whole word are written xl' Five frames would typically be enough to repre(cid:173)\nsent a single character. The recurrent network takes these frames sequentially and \nestimates the posterior character probability distribution given the data: P(Ai IxD, \nfor each of the letters, a, .. ,z, denoted Ao, ... , A25 \u2022 These posterior probabilities are \nscaled by the prior class probabilities, and are treated as the emission probabilities \nin a hidden Markov model. \n\nA separate model is created for each word in the vocabulary, with one state per \nletter. Transitions are allowed only from a state to itself or to the next letter in the \nword. The set of states in the models is denoted Q = {ql, ... , qN} and the letter \nrepresented by qi is given by L(qi), L : Q 1-+ Ao, ... , A25 \u2022 \nWord error rates are presented for experiments on a single-writer task tested with \na 1330 word vocabulary!. Statistical significance of the results is evaluated using \nStudent's t-test, comparing word recognition rates taken from a number of networks \ntrained under the same conditions but with different random initializations. The \nresults of the t-test are written: T( degrees of freedom) and the tabulated values: \ntsignificance (degrees of freedom). \n\n3 Recurrent networks \n\nThis section describes the recurrent error propagation network which has been used \nas the probability distribution estimator for the handwriting recognition system. \nRecurrent networks have been successfully applied to speech recognition (Robin(cid:173)\nson 1994) but have not previously been used for handwriting recognition, on-line \nor off-line. Here a left-to-right scanning process is adopted to map the frames of \na word into a sequence, so adjacent frames are considered in consecutive instants. \n\nlThe experimental data are available in ftp:/ /svr-ftp.eng.cam.ac.uk/pub/data \n\n\fForward-backward Retraining of Recurrent Neural Networks \n\n745 \n\nA recurrent network is well suited to the recognition of patterns occurring in a \ntime-series because series of arbitrary length can be processed, with the same pro(cid:173)\ncessing being performed on each section of the input stream. Thus a letter 'a' \ncan be recognized by the same process, wherever it occurs in a word. \ntion, internal 'state' units are available to encode multi-frame context information \nso letters spread over several frames can be recognized. The recurrent network \n\nIn addi(cid:173)\n\nInput Frames \n\nOutput \n(Characlcrprobabllllles ) \n\nJ:._-------.. \n\nNetwork \n\n-- JD: \n, \n! \n: \n\n, \n; \n, \ni \n\n_\n\nIT TT , \n\n( .. vv le e W WW II ... ) \nf inpu tiOulpul Uni ts \n---l-- _. ;---------r -ie~~;k-Um, s \n\nUntt Time Iklay \n\nFigure 1: A schematic of the recurrent error propagation network. \nFor clarity only a few of the units and links are shown. \n\narchitecture used here is a single layer of standard perceptrons with nonlinear ac(cid:173)\ntivation functions. The output 0 i of a unit i is a function of the inputs aj and \nthe network parameters, which are the weights of the links Wij with a bias bi : \n\n(1) \n\n0i \n\nO\"i \n\n!i({O\"j}), \n\n(2) \nthat is, each input is connected to every out(cid:173)\nThe network is fully connected -\nput. However, some of the input units receive no external input and are con(cid:173)\nnected one-to-one to corresponding output units through a unit time-delay (fig(cid:173)\nure 1). The remaining input units accept a single frame of parametrized in(cid:173)\nput and the remaining 26 output units estimate letter probabilities for the 26 \ncharacter classes. The feedback units have a standard sigmoid activation func(cid:173)\ntion (3), but the character outputs have a 'softmax' activation function (4). \n\nbi + Lakwik. \n\n(3) \n\neO' \u2022 \n\nL:j eO\" \u2022 \n\n(4) \n\nDuring recognition ('forward propagation'), the first frame is presented at the input \nand the feedback units are initialized to activations of 0.5. The outputs are calcu(cid:173)\nlated (1 and 2) and read off for use in the Markov model. In the next iteration, the \noutputs of the feedback units are copied to the feedback inputs, and the next frame \npresented to the inputs. Outputs are again calculated, and the cycle is repeated for \neach frame of input, with a probability distribution being generated for each frame. \n\nTo allow the network to assimilate context information, several frames of data are \npassed through the network before the probabilities for the first frame are read \noff, previous output probabilities being discarded. This input/output latency is \nmaintained throughout the input sequence, with extra, empty frames of inputs \nbeing presented at the end to give probability distributions for the last frames of \ntrue inputs. A latency of two frames has been found to be most satisfactory in \nexperiments to date. \n\n3.1 Training \nTo be able to train the network the target values (j (t) desired for the outputs \nOJ (Xt) j = 0, ... ,25 for frame Xt must be specified. The target specification is dealt \n\n\f746 \n\nA. SENIOR. T. ROBINSON \n\nwith in the next section. It is the discrepancy between the actual outputs and these \ntargets which make up the objective function to be maximized by adjusting the \ninternal weights of the network. The usual objective function is the mean squared \nerror, but here the relative entropy, G, of the target and output distributions is \nused: \n\nG \n\n\" \" \n\n- L- L- (j (t) log -.-( -)' \n\n(j(t) \noJ Xt \n\nt \n\nj \n\n(5) \n\nAt the end of a word, the errors between the network's outputs and the targets \nare propagated back using the generalized delta rule (Rumelhart et al. 1986) and \nchanges to the network weights are calculated. The network at successive time \nsteps is treated as adjacent layers of a multi-layer network. This process is gener(cid:173)\nally known as 'back-propagation through time' (Werbos 1990). After processing T \nframes of data with an input/output latency, the network is equivalent to a (T + \nlatency) layer perceptron sharing weights between layers. For a detailed description \nof the training procedure, the reader is referred elsewhere (Rumelhart et al. 1986; \nRobinson 1994). \n\n4 Target re-estimation \nThe data used for training are only labelled by word. That is, each image represents \na single word, whose identity is known, but the frames representing that word are \nnot labelled to indicate which part of the word they represent. To train the network, \na label for each frame's identity must be provided. Labels are indicated by the state \nSt E Q and the corresponding letter L(St) of which a frame Xt is part. \n4.1 A simple solution \n\nTo bootstrap the network, a naive method was used, which simply divided the word \nup into sections of equal length, one for each letter in the word. Thus, for an N(cid:173)\nletter word of T frames, xI, the first letter was assumed to be represented by frames \nxr, the next by xk+1 and so on. The segmentation is mapped into a set of targets \n\n2r \n\n.. \n\nas follows: \n\nI'J.(t) \n.. \n\n{ 1 if L(St) = Aj \n\n0 otherwise. \n\n(6) \n\nFigure 2a shows such a segmentation for a single word. Each line, representing \n(j(t) for some j, has a broad peak for the frames representing letter Aj. Such a \nsegmentation is inaccurate, but can be improved by adding prior knowledge. It \nis clear that some letters are generally longer than others, and some shorter. By \nweighting letters according to their a priori lengths it is possible to give a better, \nbut still very simple, segmentation. The letters Ii, I' are given a length of ! and \n'm, w' a length ~ relative to other letters. Thus in the word 'wig', the first half \nof the frames would be assigned the label 'w', the next sixth Ii' and the last third \nthe label 'g'. While this segmentation is constructed with no regard for the data \nbeing segmented, it is found to provide a good initial approximation from which it \nis possible to train the network to recognize words, albeit with high error rates. \n\n4.2 Viterbi re-estimation \n\nHaving trained the network to some accuracy, it can be used to calculate a good \nestimate of the probability of each frame belonging to any letter. The probability \nof any state sequence can then be calculated in the hidden Markov model, and \nthe most likely state sequence through the correct word S* found using dynamic \nprogramming. This best state sequence S* represents a new segmentation giving a \nlabel for each frame. For a network which models the probability distributions well, \nthis segmentation will be better than the automatic segmentation of section 4.1 \n\n\fForward-backward Retraining of Recurrent Neural Networks \n\n747 \n\nFigure 2: Segmentations of the word 'butler'. Each line represents \nP(St = AilS) for one letter ~ and is high for framet when S; = Ai. \n(a) is the equal-length segmentation discussed in section 4.1 (b) is \na segmentation of an untrained network. (c) is the segmentation \nre-estimated with a trained network. \n\n\" \n\nsince it takes the data into account. Finding the most probable state sequence S\u00b7 is \ntermed a forced alignment. Since only the correct word model need be considered, \nsuch an alignment is faster than the search through the whole lexicon that is required \nfor recognition. Training on this automatic segmentation gives a better recognition \nrate, but still avoids the necessity of manually segmenting any of the database. \n\nFigure 2 shows two Viterbi segmentations of the word 'butler'. First, figure 2b \nshows the segmentation arrived at by taking the most likely state sequence before \ntraining the network. Since the emission probability distributions are random, there \nis nothing to distinguish between the state sequences, except slight variations due \nto initial asymmetry in the network, so a poor segmentation results. After train(cid:173)\ning the network (2c), the durations deviate from the prior assumed durations to \nmatch the observed data. This re-estimated segmentation represents the data more \naccurately, so gives better targets towards which to train. A further improvement \nin recognition accuracy can be obtained by using the targets determined by the re(cid:173)\nestimated segmentation. This cycle can be repeated until the segmentations do not \nchange and performance ceases to improve. For speed, the network is not trained \nto convergence at each iteration. \n\nIt can be shown (Santini and Del Bimbo 1995) that, assuming that the network has \nenough parameters, the network outputs after convergence will approximate the \nposterior probabilities P(~lxD. Further, the approximation P(AilxD ~ P(Adxt) \nis made. The posteriors are scaled by the class priors P(Ai) (Bourlard and Morgan \n1993), and these scaled posteriors are used in the hidden Markov model in place of \ndata likelihoods since, by Bayes' rule, \nP(XtIAi) \n\n(7) \n\nP(~lxt) \nP(Ai)\u00b7 \n\n()( \n\nTable 1 shows word recognition error rates for three 80-unit networks trained to(cid:173)\nwards fixed targets estimated by another network, and then retrained, re-estimating \nthe targets at each iteration. The retraining improves the recognition performance \n(T(2) = 3.91, t.9s(2) = 2.92). \n4.3 Forward-backward re-estimation \n\nThe system described above performs well and is the method used in previous recur(cid:173)\nrent network systems, but examining the speech recognition literature, a potential \nmethod of improvement can be seen. Viterbi frame alignment has so far been used \nto determine targets for training. This assigns one class to each frame, based on \nthe most likely state sequence. A better approach might be to allow a distribu(cid:173)\ntion across all the classes indicating which are likely and which are not, avoiding a \n\n\f748 \n\nA. SENIOR, T. ROBINSON \n\nTable 1: Error rates for 3 networks with 80 units trained with fixed \nalignments, and retrained with re-estimated alignments. \n\nTraining \nmethod \nFixed targets 21.2 \nRetraining \n\nError (%) \nJ.I. \n\n(7 \n\n1.73 \n17.0 0.68 \n\n'hard' classification at points where a frame may indeed represent more than one \nclass (such as where slanting characters overlap), or none (as in a ligature). A 'soft' \nclassification would give a more accurate portrayal of the frame identities. \nSuch a distribution, 'Yp(t) = P(St = qplxI, W), can be calculated with the forward(cid:173)\nbackward algorithm (Rabiner and Juang 1986). To obtain 'Yp(t), the forward prob(cid:173)\nabilities Ctp(t) = P(St = qp, xD must be combined with the backward probabilities \nf3p(t) = P(St = qp, x;+l)' The forward and backward probabilities are calculated \nrecursively in the same manner. \n\n<\u00bb \n\nCtr(t + 1) \n\nL Ctp(t)P(xtIL(qp))ap,r, \n\n(8) \n\nr \n\n/3p(t - 1) \n\n(9) \nSuitable initial distributions Ctr(O) = 7l'r and f3r(r + 1) = Pr are chosen, e.g. 7l' and \nP are one for respectively the first and last character in the word, and zero for the \nothers. The likelihood of observing the data Xl and being in state qp at time t is \nthen given by: \n\n(10) \nThen the probabilities 'Yp(t) of being in state qp at time t are obtained by normal(cid:173)\nization and used as the targets (j (t) for the recurrent network character probability \noutputs: \n\ne (t) = Ctp(t)/3p(t). \n\n(11) \n\n(j (t) \n\n'Yp(t). \n\n(12) \n\nL \n\np:L(qp)=Aj \n\nep(t) \n\nl:r er(t)' \n\nFigure 3a shows the initial estimate of the class probabilities for a sample of the \nword' butler'. The probabilities shown are those estimated by the forward-backward \nalgorithm when using an untrained network, for which the P(XtISt = qp) will be \nindependent of class. Despite the lack of information, the probability distributions \ncan be seen to take reasonable shapes. The first frame must belong to the first \nletter, and the last frame must belong to the last letter, of course, but it can also \nbe seen that half way through the word, the most likely letters are those in the \nmiddle of the word. Several class probabilities are non-zero at a time, reflecting \nthe uncertainty caused since the network is untrained. Nevertheless, this limited \ninformation is enough to train a recurrent network, because as the network begins \nto approximate these probabilities, the segmentations become more definite. In \ncontrast, using Viterbi segmentations from an untrained network, the most likely \nalignment can be very different from the true alignment (figure 2b). The segmen(cid:173)\ntation is very definite though, and the network is trained towards the incorrect \ntargets, reinforcing its error. Finally, a trained network gives a much more rigid \nsegmentation (figure 3b), with most of the probabilities being zero or one, but with \na boundary of uncertainty at the transitions between letters. This uncertainty, \nwhere a frame might truly represent parts of two letters, or a ligature between \ntwo, represents the data better. Just as with Viterbi training, the segmentations \ncan be re-estimated after training and retraining results in improved performance. \nThe final probabilistic segmentation can be stored with the data and used when \nsubsequent networks are trained on the same data. Training is then significantly \nquicker than when training towards the approximate bootstrap segmentations and \nre-estimating the targets. \n\n\fForward-backward Retraining of Recurrent Neural Networks \n\n749 \n\nFigure 3: Forward-backward segmentations of the word 'butler'. \n(a) is the segmentation of an untrained network with a uniform \nclass prior. (b) shows the segmentation after training. \n\nThe better models obtained with the forward-backward algorithm give improved \nrecognition results over a network trained with Viterbi alignments. The improve(cid:173)\nment is shown in table 2. It can be seen that the error rates for the networks \ntrained with forward-backward targets are lower than those trained on Viterbi tar(cid:173)\ngets (T(2) = 5.24, t.97S(2) = 4.30). \n\nTable 2: Error rates for networks with 80 units trained with Viterbi \nor Forward-Backward alignments. \n\nTraining \nmethod \nViterbi \nForward-Backward \n\nError 1%) \nJ.I. \n17.0 0.68 \n15.4 0.74 \n\n(7 \n\n5 Conclusions \nThis paper has reviewed the training methods used for a recurrent network, applied \nto the problem of off-line handwriting recognition. Three methods of deriving tar(cid:173)\nget probabilities for the network have been described, and experiments conduded \nusing all three. The third method is that of the forward-backward procedure, which \nhas not previously been applied to recurrent neural network training. This method \nis found to improve the performance of the network, leading to reduced word error \nrates. Other improvements not detailed here (including duration models and sto(cid:173)\nchastic language modelling) allow the error rate for this task to be brought below \n10%. \nAcknowledgments \nThe authors would like to thank Mike Hochberg for assistance in preparing this \npaper. \nReferences \n\nBOURLARD, H. and MORGAN, N. (1993) Connectionist Speech Recognition: A Hybrid \nApproach . Kluwer . \nRABINER, L. R. and JUANG, B . H. (1986) An introduction to hidden Markov models. \nIEEE ASSP magazine 3 (1): 4-16. \nROBINSON, A. (1994) The application ofrecuIIent nets to phone probability estimation. \nIEEE 'lransactions on Neural Networks. \nRUMELHART, D. E ., HINTON, G. E. and WILLIAMS, R. J. (1986) Learning internal \nrepresentations by eIIor propagation. In Parallel Distributed Processing: Explorations \nin the Microstructure of Cognition, ed. by D. E. Rumelhart and J . L. McClelland, \nvolume 1, chapter 8, PE. 318-362. Bradford Books. \nSANTINI, S. and DEL BIMBO, A. (1995) RecuIIent neural networks can be trained to \nbe maximum a posteriori probability classifiers. Neural Networks 8 (1): 25-29. \nSENIOR, A . W ., (1994) Off-line Cursive Handwriting Recognition using Recurrent \nNeural Networks. Cambridge University Engineering Department Ph.D. thesis. URL: \n~_~: / / svr-ft.p . enK. cam. ac . uk/pub/reports/senioLthesis . ps . gz. \nWERBOS, P. J. (1990) Backpropagation through time: What it does and how to do it. \nProceedings of the IEEE 78: 1550-60. \n\n\f", "award": [], "sourceid": 1056, "authors": [{"given_name": "Andrew", "family_name": "Senior", "institution": null}, {"given_name": "Anthony", "family_name": "Robinson", "institution": null}]}