{"title": "Handwritten Word Recognition using Contextual Hybrid Radial Basis Function Network/Hidden Markov Models", "book": "Advances in Neural Information Processing Systems", "page_first": 764, "page_last": 770, "abstract": null, "full_text": "Handwritten Word Recognition using Contextual \nHybrid Radial Basis Function NetworklHidden \n\nMarkov Models \n\nBernard Lemarie \n\nLa Poste/SRTP \n\n10, Rue de l'lle-Mabon \n\nF-44063 Nantes Cedex France \n\nlemarie@srtp.srt-poste.fr \n\nMichel Gilloux \nLa Poste/SRTP \n\n10, Rue de l'1le-Mabon \n\nF-44063 Nantes Cede x France \n\ngilloux@srtp.srt-poste.fr \n\nManuel Leroux \nLa Poste/SRTP \n\n10, Rue de l'lle-Mabon \n\nF-44063 Nantes Cedex France \n\nleroux@srtp.srt-poste.fr \n\nAbstract \n\nA hybrid and contextual radial basis function networklhidden Markov \nmodel off-line handwritten word recognition system is presented. The \ntask assigned to the radial basis function networks is the estimation of \nemission probabilities associated to Markov states. The model is contex(cid:173)\ntual because the estimation of emission probabilities takes into account \nthe left context of the current image segment as represented by its pred(cid:173)\necessor in the sequence. The new system does not outperform the previ(cid:173)\nous system without context but acts differently. \n\n1 INTRODUCTION \n\nHidden Markov models (HMMs) are now commonly used in off-line recognition of \nhandwritten words (Chen et aI., 1994) (Gilloux et aI., 1993) (Gilloux et al. 1995a). In some \nof these approaches (Gilloux et al. 1993), word images are transformed into sequences of \nimage segments through some explicit segmentation procedure. These segments are passed \non to a module which is in charge of estimating the probability for each segment to appear \nwhen the corresponding hidden state is some state s (state emission probabilities). Model \nprobabilities are generally optimized for the Maximum Likelihood Estimation (MLE) cri(cid:173)\nterion. \n\nMLE training is known to be sub-optimal with respect to discrimination ability when the \nunderlying model is not the true model for the data. Moreover, estimating the emission \nprobabilities in regions where examples are sparse is difficult and estimations may not be \naccurate. To reduce the risk of over training, images segments consisting of bitmaps are of(cid:173)\nten replaced by feature vector of reasonable length (Chen et aI., 1994) or even discrete sym(cid:173)\nbols (Gilloux et aI., 1993). \n\n\fHandwritten Word Recognition Using HMMlRBF Networks \n\n765 \n\nIn a previous paper (Gilloux et aI., 1995b) we described a hybrid HMMlradial basis \nfunction system in which emission probabilities are computed from full-fledged bitmaps \nthough the use of a radial basis function (RBF) neural network. This system demonstrated \nbetter recognition rates than a previous one based on symbolic features (Gilloux et aI., \n1995b). Yet, many misclassification examples showed that some of the simplifying as(cid:173)\nsumptions made in HMMs were responsible for a significant part of these errors. In partic(cid:173)\nular, we observed that considering each segment independently from its neighbours would \nhurt the accuracy of the model. For example, figure 1 shows examples of letter a when it is \nsegmented in two parts. The two parts are obviously correlated. \n\nFigure 1: Examples of segmented a. \n\nal \n\nWe propose a new variant of the hybrid HMMIRBF model in which emission probabil(cid:173)\n\nities are estimated by taking into account the context of the current segment. The context \nwill be represented by the preceding image segment in the sequence. \n\nThe RBF model was chosen because it was proven to be an efficient model for recog(cid:173)\n\nnizing isolated digits or letters (Poggio & Girosi, 1990) (Lemarie, 1993). Interestingly \nenough, RBFs bear close relationships with gaussian mixtures often used to model emis(cid:173)\nsion probabilities in markovian contexts. Their advantage lies in the fact that they do not \ndirectly estimate emission probabilities and thus are less prone to errors in this estimation \nin sparse regions. They are also trained through the Mean Square Error (MSE) criterion \nwhich makes them more discriminant. \n\nThe idea of using a neural net and in particular a RBF in conjunction with a HMM is not \nnew. In (Singer & Lippman, 1992) it was applied to a speech recognition task. The use of \ncontext to improve emission probabilities was proposed in (Bourlard & Morgan, 1993) \nwith the use of a discrete set of context events. Several neural networks are there used to \nestimate various relations between states, context events and current segment. Our point is \nto propose a different method without discrete context based on a adapted decomposition \nof the HMM likelihood estimation.This model is next applied to off-line handwritten word \nrecognition. \n\nThe organization of this paper is as follows. Section 1 is an overview of the architecture \nof our HMM. Section 2 describes the justification for using RBF outputs in a contextual \nhidden Markov model. Section 3 describes the radial basis function network recognizer. \nSection 4 reports on an experiment in which the contextual model is applied to the recog(cid:173)\nnition of handwritten words found on french bank or postal cheques. \n\n2 OVERVIEW OF THE HIDDEN MARKOV MODEL \n\nIn an HMM model (Bahl et aI., 1983), the recognition scores associated to words ware \n\nlikelihoods \n\nL(wli) ... in) = P(i1 \u00b7\u00b7\u00b7inlw)xP(W) \n\nin which the first term in the product encodes the probability with which the model of each \nword w generates some image (some sequence of image segments) ij ... in- In the HMM par(cid:173)\nadigm, this term is decomposed into a sum on all paths (i.e. sequence of hidden states) of \nproducts of the probability of the hidden path by the probability that the path generated the \nImage sequence: \n\np(i) ... inlw) = \n\n\f766 \n\nB. LEMARIE. M. GILLOUX. M. LEROUX \n\nIt is often assumed that only one path contributes significantly to this term so that \n\nIn HMMs, each sequence element is assumed to depend only on its corresponding state: \n\nn \n\np(il\u00b7\u00b7\u00b7i ISI\u00b7\u00b7\u00b7 s ) = ITp(i\u00b7ls .) \n} } \n\nn \n\nn \n\nj=1 \n\nMoreover, first-order Markov models assume that paths are generated by a first-order \n\nMarkov chain so that \n\nn \n\nP(sl \u00b7 \u00b7\u00b7s ) = ITp(s . ls. \n} }-\n\nn \n\nj = I \n\nI) \n\nWe have reported in previous papers (Gilloux et aI., 1993) (Gilloux et aI., 1995a) on sev(cid:173)\n\neral handwriting recognition systems based on this assumption.The hidden Markov model \narchitecture used in all systems has been extensively presented in (Gilloux et aI., 1995a). \nIn that model, word letters are associated to three-states models which are designed to ac(cid:173)\ncount for the situations where a letter is realized as 0, 1 or 2 segments. Word models are the \nresult of assembling the corresponding letter models. This architecture is depicted on figure \n2. We used here transition emission rather than state emission. However, this does not \n\nE, 0.05 \n\nE,0.05 \n\nI \n\na va l \n\nFigure 2: Outline of the model for \"laval\". \n\nchange the previous formulas if we replace states by transitions, i.e. pairs of states. \n\nOne of these systems was an hybrid RBFIHMM model in which a radial basis function \nnetwork was used to estimate emission probabilities p (i. Is.) . The RBF outputs are intro(cid:173)\n\nduced by applying Bayes rule in the expression of p (i I .~. i; I s I . .. S n) : \n\np(il \u00b7 \u00b7 \u00b7i IsI\u00b7 \u00b7\u00b7s) = IT }} \np (s.) \n} \n\nn n . \n} = 1 \n\nn p(s.1 i.) xp(i.) \n} \n\nSince the product of a priori image segments probabilities p (i.) does not depend on the \n\nword hypothesis w, we may write: \n\n} \n\nn p (s. Ii.) \np(il\u00b7\u00b7\u00b7inlsl\u00b7\u00b7\u00b7sn)oc.IT p~s./ \n\n} = 1 \n\n} \n\nIn the above formula, terms of form p (s . Is. _ I) are transition probabilities which may \nbe estimated through the Baum-Welch re-istirhatlOn algorithm. Terms of form p (s.) are \na priori probabilities of states. Note that for Bayes rule to apply, these probabilitid have \nand only have to be estimated consistently with terms of form p (s. Ii.) since p (i. Is.) \nis independent of the statistical distribution of states. \n} } \n\n} } \n\nIt has been proven elsewhere (Richard & Lippman, 1992) that systems trained through \nthe MSE criterion tend to approximate Bayes probabilities in the sense that Bayes proba-\n\n\fHandwritten Word Recognition Using HMMlRBF Networks \n\n767 \n\nbilities are optimal for the MSE criterion. In practice, the way in which a given system \ncomes close to Bayes optimum is not easily predictable due to various biases of the trained \nsystem (initial parameters, local optimum, architecture of the net, etc.). Thus real output \nscores are generally not equal to Bayes probabilities. However, there exist different proce(cid:173)\ndures which act as a post-processor for outputs of a system trained with the MSE and make \nthem closer to Bayes probabilities (Singer & Lippman, 1992). Provided that such a post(cid:173)\nprocessor is used, we will assume that terms p (s. Ii .) are well estimated by the post-proc(cid:173)\nessed outputs of the recognition system. Then, u~~ p (s .) are just the a priori probabili(cid:173)\nties of states on the set used to train the system or post-prbcess the system outputs. \n\nThis hybrid handwritten word recognition system demonstrated better performances \n\nthan previous systems in which word images were represented through sequences of sym(cid:173)\nbolic features instead of full-fledged bitmaps (Gilloux et aI., 1995b). However, some rec(cid:173)\nognition errors remained, many of which could be explained by the simplifying assump(cid:173)\ntions made in the model. In particular, the fact that emission probabilities depend only on \nthe state corresponding to the current bitmap appeared to be a poor choice. For example, \non figure 3 the third and fourth segment are classified as two halves of the letter i. For letters \n\nFigure 3: An image of trente classified as mille. \n\nsegmented in two parts, the second half is naturally correlated to the first (see figure 1). Yet, \nour Markov model architecture is designed so that both halves are assumed uncorrelated. \nThis has two effects. Two consecutive bitmaps which cannot be the two parts of a unique \nletter are sometimes recognized as such like on figure 3. Also, the emission probability of \nthe second part of a segmented letter is lower than if the first part has been considered for \nestimating this probability. The contextual model described in the next section is designed \nso has to make a different assumption on emission probabilities. \n\n3 THE HYBRID CONTEXTUAL RBFIHMM MODEL \n\nThe exact decomposition of the emission part of word likelihoods is the following: \n\np(i1 \u00b7\u00b7\u00b7 inls1\u00b7\u00b7\u00b7sn) = P(il ls 1\u00b7\u00b7\u00b7 sn) x ITp(ijlsl ... sn,il ... ij_l) \n\nj=2 \n\nn \n\nWe assume now that bitmaps are conditioned by their state and the previous image in the \nsequence: \n\nP(il\u00b7\u00b7\u00b7 in I sl\u00b7 \u00b7\u00b7 sn) ==p(i11 sl) x IT p (ij I sj'ij _ l ) \n\nn \n\nj=2 \n\nThe RBF is again introduced by applying Bayes rule in the following way: \n\np (i 1\u00b7\u00b7\u00b7 in lSI\u00b7\u00b7 \u00b7 s n) == \n\nP(s1 1 i l ) xp(i l ) \n\n( ) \n\nP sl \n\nn p(s . 1 i ., i . 1) xp (i . I i . 1) \n\nx IT }}} (-\n. \nJ=2 \n\nP s. !. 1 \n\nJ J-\n\nI \u00b7 / ) -\n\nSince terms of form p (i . Ii . _ 1) do not contribute to the discrimination of word hy-\n\npotheses, we may write: \n\nJ J \n\n( . \n\nP 11\u00b7 \u00b7 \u00b7ln sl\u00b7\u00b7\u00b7 sn \n\n. I ) \n\noc \n\np (s 1 IiI) \n\n( ) x \n\npSI \n\nIT } J J-\nn p (s . Ii., i . 1 ) \nI . ) \n. \nJ=2 \n\nP (s . \n\nJ J-\n\nI. 1 \n\n\f768 \n\nB. LEMARIE, M. GILLOUX, M. LEROUX \n\nThe RBF has now to estimate not only terms of form p (s. Ii ., i. _ 1) but also terms like \np (s . Ii. 1) which are no longer computed by mere countind. 'two radial basis function \nnetJoris-are then used to estimate these probabilities. Their common architecture is de(cid:173)\nscribed in the next section. \n\n4 THE RADIAL BASIS FUNCTION MODEL \n\nThe radial basis function model has been described in (Lemarie, 1993). RBF networks \nare inspired from the theory of regularization (Poggio & Girosi, 1990). This theory studies \nhow multivariate real functions known on a finite set of points may be approximated at \nthese points in a family of parametric functions under some bias of regularity. It has been \nshown that when this bias tends to select smooth functions in the sense that some linear \ncombination of their derivatives is minimum, there exist an analytical solution which is a \nlinear combination of gaussians centred on the points where the function is known (Poggio \n& Girosi, 1990). It is straightforward to transpose this paradigm to the problem of learning \nprobability distributions given a set of examples. \n\nIn practice, the theory is not tractable since it requires one gaussian per example in the \ntraining set. Empirical methods (Lemarie, 1993) have been developed which reduce the \nnumber of gaussian centres. Since the theory is no longer applicable when the number of \ncentres is reduced, the parameters of the model (centres and covariance matrices for gaus(cid:173)\nsians, weights for the linear combination) have to be trained by another method, in that case \nthe gradient descent method and the MSE criterion. Finally, the resulting RBF model may \nbe looked at like a particular neural network with three layers. The first is the input layer. \nThe second layer is completely connected to the input layer through connections with unit \nweights. The transfer functions of cells in the second layer are gaussians applied to the \nweighed distance between the corresponding centres and the weighed input to the cell. The \nweight of the distance are analogous to the parameters of a diagonal covariance matrix. Fi(cid:173)\nnally, the last layer is completely connected to the second one through weighted connec(cid:173)\ntions. Cells in this layer just output the sum of their input. \n\nIn our experiments, inputs to the RBF are feature vectors of length 138 computed from \nthe bitmaps of a word segment (Lemarie, 1993). The RBF that estimates terms of form \np (s. Ii., i. 1) uses to such vectors as \ninput whereas the second RBF (terms \np (/ I/_il) ) is only fed with the vector associated to ij _l . These vectors are inspired from \n\"cha'rac{eristic loci\" methods (Gluksman, 1967) and encode the proportion of white pixels \nfrom which a bitmap border can be reached without meeting any black pixel in various of \ndirections. \n\n5 EXPERIMENTS \n\nThe model has been assessed by applying it to the recognition of words appearing in le(cid:173)\ngal amounts of french postal or bank cheques. The size of the vocabulary is 30 and its per(cid:173)\nplexity is only 14.3 (Bahl et aI., 1983). The training and test bases are made of images of \namount words written by unknown writers on real cheques. We used 7 191 images during \ntraining and 2 879 different images for test. The image resolution was 300 dpi. The amounts \nwere manually segmented into words and an automatic procedure was used to separate the \nwords from the preprinted lines of the cheque form. \n\nThe training was conducted by using the results of the former hybrid system. The seg(cid:173)\n\nmentation module was kept unchanged. There are 48 140 segments in the training set and \n19577 in the test set. We assumed that the base system is almost always correct when align(cid:173)\ning segments onto letter models. We thus used this alignment to label all the segments in \nthe training set and took these labels as the desired outputs for the RBF. We used a set of \n63 different labels since 21 letters appear in the amount vocabulary and 3 types of segments \nare possible for each letter. The outputs of the RBF are directly interpreted as Bayes prob-\n\n\fHandwritten Word Recognition Using HMMJRBF Networks \n\n769 \n\nabilities without further post-processing. \n\nFirst of all, we assessed the quality of the system by evaluating its ability to recognize \nthe class of a segment through the value of p (s . Ii., i. 1) and compared it with that of \nthe previous hybrid system. The results of this e'xpdrirhent are reported on table 1 for the \ntest set. They demonstrate the importance of the context and thus its potential interest for a \n\nTable 1: Recognition and confusion rates for segment classifiers \n\nRecognition rate Confusion rate Mean square error \n\nRBF system without context \n\nRBF system with context \n\n32.6% \n\n41.7% \n\n67.4% \n\n58.3% \n\n0.828 \n\n0.739 \n\nword recognition system. \n\nWe next compare the performance on word recognition on the data base of 2878 images \nof words. Results are shown in table 2. The first remark is that the system without context \n\nTable 2: Recognition and confusion rates for the word recognition systems \n\nRecognition rate Confusion rate \n\n# Confusions \n\nRBF system without context \n\nRBF system with context \n\n81,3% \n\n76,3% \n\n16,7% \n\n23,7% \n\n536 \n\n683 \n\npresent better results than the contextual system. Some of the difference between the sys(cid:173)\ntems with and without context are shown below in figures 4 and 5 and may explain why the \ncontextual system remains at a lower level of performance. The word \"huit\" and \"deux\" of \nfigure 4 are well recognized by the system without context but badly identified by the con(cid:173)\ntextual system respectively as \"trente\" and \"franc\". The image of the word \"huit\", for ex(cid:173)\nample, is segmented into eight segments and each of the four letters of the word is thus nec(cid:173)\nessarily considered as separated in two parts. The fifth and sixth segments are thus \nrecognized as two halves of the letter \"i\" by the standard system while the contextual sys(cid:173)\ntem avoids this decomposition of the letter \"i\". On the next image, the contextual system \nproposes \"ra\" for the second and third segments mainly because of the absence of informa(cid:173)\ntion on the relative position ofthese segments. On the other hand, figure 5 shows examples \nwhere the contextual system outperforms the system without context. In the first case the \nlatter proposed the class \"trois\" with two halves on the letter \"i\" on the fifth and sixth seg(cid:173)\nments. In the second case the context is clearly useful for the recognition on the first letter \nof the word. Forthcoming experiments will try to combine the two systems so as to benefit \nof their respective characteritics. \n\nFigure 4 : some new confusions produced by the contextual system. \n\nExperiments have also revealed that the contextual system remains very sensible to the \nnumerical output values for the network which estimates p (s. Ii. _ 1) . Several approaches \nfor solving this problem are currently under investigation. Ffrst'results have yet been ob(cid:173)\ntained by trying to approximate the network which estimates p (Sj I ij _ 1) from the network \nwhich estimates p (Sj I ij' ij _ 1) . \n\n\f770 \n\n6 CONCLUSION \n\nB. LEMARIE, M. GILLOUX, M. LEROUX \n\nWe have described a new application of a hybrid radial basis function/hidden Markov \nmodel architecture to the recognition of off-line handwritten words. In this architecture, the \nestimation of emission probabilities is assigned to a discriminant classifier. The estimation \nof emission probabilities is enhanced by taking into account the context as represented by \nthe previous bitmap in the sequence to be classified. A formula have been derived introduc(cid:173)\ning this context in the estimation of the likelihood of word scores. The ratio of the output \nvalues of two networks are now used so as to estimate the likelihood. \n\nThe reported experiments reveal that the use of context, if profitable at the segment rec(cid:173)\n\nognition level, is not yet useful at the word recognition level. Nevertheless, the new system \nacts differently from the previous system without context and future applications will try \nto exploit this difference. The dynamic of the ratio networks output values is also very un(cid:173)\nstable and some solutions to stabilize it which will be deeply tested in the forthcoming ex(cid:173)\nperiences. \n\nReferences \n\nBahl L, Jelinek F, Mercer R, (1983). A maximum likelihood approach to speech recogni(cid:173)\ntion. IEEE Transactions on Pattern Analysis and Machine Intelligence 5(2): 179-190. \nBahl LR, Brown PF, de Souza PV, Mercer RL, (1986). Maximum mutual information es(cid:173)\ntimation of hidden Markov model parameters for speech recognition. In: Proc of the Int \nConf on Acoustics, Speech, and Signal Processing (ICASSP'86):49-52. \nBourlard, H., Morgan, N., (1993). Continuous speech recognition by connectionist statisti(cid:173)\ncal methods, IEEE Trans. on Neural Networks, vol. 4, no. 6, pp. 893-909, 1993. \nChen, M.-Y., Kundu, A., Zhou, J., (1994). Off-line handwritten word recognition using a \nhidden Markov model type stochastic network, IEEE Trans. on Pattern Analysis and Ma(cid:173)\nchine Intelligence, vol. 16, no. 5:481-496. \nGilloux, M., Leroux, M., Bertille, J.-M., (1993). Strategies for handwritten words recogni(cid:173)\ntion using hidden Markov models, Proc. of the 2nd Int. Conf. on Document Analysis and \nRecognition:299-304. \nGilloux, M., Leroux, M., Bertille, J.-M., (1995a). \"Strategies for Cursive Script Recogni(cid:173)\ntion Using Hidden Markov Models\", Machine Vision & Applications, Special issue on \nHandwriting recognition, R Plamondon ed., accepted for publication. \nGilloux, M., Lemarie, B., Leroux, M., (l995b). \"A Hybrid Radial Basis Function Network! \nHidden Markov Model Handwritten Word Recognition System\", Proc. of the 3rd Int. Conf. \non Document Analysis and Recognition:394-397. \nGluksman, H.A., (1967). Classification of mixed font alphabetics by characteristic loci, 1 st \nAnnual IEEE Computer Conf.: 138-141. \nLemarie, B., (1993). Practical implementation of a radial basis function network for hand(cid:173)\nwritten digit recognition, Proc. of the 2nd Int. Conf. on Document Analysis and Recogni(cid:173)\ntion:412-415. \nPoggio, T., Girosi, F., (1990). Networks for approximation and learning, Proc. of the IEEE, \nvol 78, no 9. \nRichard, M.D., Lippmann, RP., (1991). \"Neural network classifiers estimate bayesian a \nposteriori probabilities\", Neural Computation, 3:461-483. \nSinger, E, Lippmann, RP., (1992). A speech recognizer using radial basis function net(cid:173)\nworks in an HMM framework, Proc. of the Int. Conf. on acoustics, Speech, and Signal \nProcessing. \n\n\f", "award": [], "sourceid": 1106, "authors": [{"given_name": "Bernard", "family_name": "Lemari\u00e9", "institution": null}, {"given_name": "Michel", "family_name": "Gilloux", "institution": null}, {"given_name": "Manuel", "family_name": "Leroux", "institution": null}]}