{"title": "Interactive Parts Model: An Application to Recognition of On-line Cursive Script", "book": "Advances in Neural Information Processing Systems", "page_first": 974, "page_last": 980, "abstract": null, "full_text": "Interactive Parts Model: an Application \nto Recognition of On-line Cursive Script \n\nPredrag Neskovic, Philip C Davis' and Leon N Cooper \nPhysics Department and Institute for Brain and Neural Systems \n\nBrown University, Providence, RI 02912 \n\nAbstract \n\nIn this work, we introduce an Interactive Parts (IP) model as an \nalternative to Hidden Markov Models (HMMs). We tested both \nmodels on a database of on-line cursive script. We show that im(cid:173)\nplementations of HMMs and the IP model, in which all letters are \nassumed to have the same average width, give comparable results. \nHowever , in contrast to HMMs, the IP model can handle duration \nmodeling without an increase in computational complexity. \n\n1 \n\nIntroduction \n\nHidden Markov models [9] have been a dominant paradigm in speech and handwrit(cid:173)\ning recognition over the past several decades. The success of HMMs is primarily \ndue to their ability to model the statistical and sequential nature of speech and \nhandwriting data. However , HMMs have a number of weaknesses [2] . First , dis(cid:173)\ncriminative powers of HMMs are weak since the training algorithm is based on \na Maximum Likelihood Estimate (MLE) criterion, whereas the optimal training \nshould be based on a Maximum a Posteriori (MAP) criterion [2] . Second , in most \nHMMs, only first or second order dependencies are assumed. Although explicit du(cid:173)\nration HMMs model data more accurately, the computational cost of such modeling \nis high [5]. \n\nTo overcome the first problem, it has been suggested [1 , 11,2] that Neural Networks \n(NNs) should be used for estimating emission probabilities. Since NNs cannot deal \nwell with sequential data, they are often used in combination with HMMs as hybrid \nNN/HMM systems [2 , 11]. \n\nIn this work, we introduce a new model that provides a possible solution to the \nsecond problem . In addition, this new objective function can be cast into a NN(cid:173)\nbased framework [7, 8] and can easily deal with the sequential nature of handwriting . \nIn our approach, we model an object as a set of local parts arranged at specific \nspatial locations. \n\n'Now at MIT Lincoln Laboratory, Lexington, MA 02420-9108 \n\n\fSHAPE DISTORTIONS \n\no1t~ \n\n_________ SPATIAL DISTORTIONS \n\nat Cl ct \n\na \n\nd \n\ne \n\n0.6 \n\n0.3 \n\nO.B \n\n0.3 \n\n0.3 \n\n0.1 \n\n0.4 \n\n0 \n\n0.2 \n\n0.7 cut \n\nFigure 1: Effect of shape distortion, and Figure 2: Some of the non-zero elements \nspatial distortions applied on the word of the detection matrix associated with \n\"act\" . \n\nthe word \"act\". \n\nParts-based representation has been used in face detection systems [3] and has re(cid:173)\ncently been applied to spotting keywords in cursive handwriting data [4]. Although \nthe model proposed in [4] presents a rigorous probabilistic approach, it only mod(cid:173)\nels the positions of key-points and , in order to learn the appropriate statistics, it \nrequires many ground-truthed training examples. \n\nIn this work, we focus on modeling one dimensional objects. In our application, an \nobject is a handwritten word and its parts are the letters. However, the m ethod we \npropose is quite general and can easily be extended to two dimensional problems. \n\n2 The Objective Function \n\nIn our approach , we assume that a handwritten pattern is a distorted version of \none of the dictionary words. Furthermore, we assume that any distortion of a word \ncan be expressed as a combination of two types of local distortions [6]: a) shape \ndistortions of one or more letters, and b) spatial distortions, also called domain \nwarping, as illustrated in Figure 1. In the latter case, the shape of each letter is \nunchanged but the location of one or more letters is perturbed. \n\nShape distortions can be captured using \"letter detectors\". A number of different \ntechniques can be used to construct letter detectors. In our implementation, we use \na neural network-based approach. The output of a letter detector is in the range \n[0 - 1], where 1 corresponds to the undistorted shape of the corresponding letter. \n\nSince it is not known, a priori, where the letters are located in the pattern, letter \ndetectors, for each letter of the alphabet, are arranged over the pattern so that the \npattern is completely covered by their (overlapping) receptive fields. The outputs \nof the letter detectors form a det ection matrix, Figure 2. Each row of the detection \nmatrix represents one letter and each column corresponds to the position of the \nletter within the pattern. An element of the detection matrix is labeled as d!:(x), \nwhere k denotes the class of the letter, k E [1 , ... ,26], and the x represents the \ncolumn numb er . In general, the detection matrix contains a large number of \"false \nalarms\" due to the fact that local segments are often ambiguous. The recognition \nsystem segments a pattern by selecting one detection matrix element for each letter \n\n\fof a given dictionary word 1. \n\nTo measure spatial distortions, one must first choose a reference point from which \ndistortions are measured. It is clear that for any choice of reference point, the \nlocation estimates for letters that are not close to the reference point might be very \npoor. For this reason, we chose a representation in which each letter serves as a \nreference point to estimate the position of every other letter. This representation \nallows translation invariant recognition, is very robust (since it does not depend \non any single reference point) and very accurate (since it includes nearest neighbor \nreference points). \n\nTo evaluate the level of distortion of a handwritten pattern from a given dictionary \nword, we introduce an objective function. The value of this function represents \nthe amount of distortion of the pattern from the dictionary word. We require \nthat the objective function reach a minimal value if all the letters that constitute \nthe dictionary word are detected with the highest confidence and are located at \nthe locations with highest expectation values. Furthermore, we require that the \ndependence of the function on one of its letters be smaller for longer words. \n\nOne function with similar properties to these is the energy function of a system \nof interacting particles, Li,j qiUi,j(Xi, Xj)qj. If we assume that all the letters are \nof the same size, we can map 1) letter detection estimates into \"charge\" and 2) \nchoose interaction terms (potentials) to reflect the expected relative positioning of \nthe letters (detection matrix elements). The energy function of the n -th dictionary \nword, is then \n\nEn(x) = L di(x;)Ui~j(xi, xj)d'J(xj), \n\nLn \n\ni ,j=l,iicj \n\n(1) \n\nwhere Ln is the number of letters in the word, Xi is the location of the i - th letter \nth dictionary word, and x = (Xl,\u00b7 .. , XLJ is a particular configuration \nof the n -\nof detection matrix elements. Although this equation has a similar form as, say, \nthe Coulomb energy, it is much more complicated. The interaction terms Ui,j are \nmore complex than l/r, and each \"charge\", di(Xi), does not have a fixed value, but \ndepends on its location. Note that this energy is a function of a specific choice of \nelements from the detection matrix, x, a specific segmentation of the word. \nInteraction terms can be calculated from training data in a number of different ways. \nOne possibility is to use the EM algorithm [9] and do the training for each dictionary \nword. Another possibility is to propagate nearest neighbor estimates. Let us denote \nwith the symbol pij (Xi, X j) the (pairwise) probability of finding the j - th letter of \nthe n - th dictionary word at distance X = Xj - Xi from the location of the i -\nth \nletter. A simple way to approximate pairwise probabilities is to find the probability \ndistribution of letter widths for each letter and then from single letter distributions \ncalculate nearest neighbor pairwise probabilities. Knowing the nearest neighbor \nprobabilities, it is then easy to propagate them and find the pairwise probabilities \nbetween any two letters of any dictionary word [7]. Interaction potentials are related \nto pairwise probabilities (using the Boltzmann distribution and setting j3 = 1/ kT = \n1), as Ui~j(Xi,Xj) = -lnpij(xi,Xj)+C. \n\nSince the interaction potentials are defined up to a constant, we can selectively \n\n1 Note that this segmentation corresponds to finding the centers of the letters, as op(cid:173)\n\nposed to segmenting a word into letters by finding their boundaries. \n\n\fI \n\nb \n\nu - c __ ---//// \n\nFigure 3: Solid line: \nan example of \na pairwise probability distribution for \nneighboring letters. Dashed lines: a fam(cid:173)\nily of corresponding interaction poten(cid:173)\ntials. \n\nFigure 4: Modified interaction potential. \nRegions x ::::: a and x :::: b are the \"forbid(cid:173)\nden\" regions for letter locations. In the \nregions a < x < a' and b' < x < b the \ninteraction term is zero. \n\nchange the value of their minima by choosing different values for C, Fig. 3. It \nis important to stress that the only valid domain for the interaction terms is the \nregion for which Ui,j < 0 since for each pair ofletters (i, j) we want to simultaneously \nminimize the interaction term Ui,j and to maximize the term di \u00b7dj 2. We will assume \nthat there is a value, Pmin, for the pairwise probability below which the estimate of \nthe letter location is not reliable. So, for every Pij such that 0 < Pij < Pmin, we set \nPij = Pmin\u00b7 We choose the value ofthe constant such that Ui,j = -In(Pmin)+C = 0, \nFig. 4. In practice, this means that there is an effective range of influence for each \nletter, and beyond that range the influence ofthe letter is zero. In the limiting case, \none can get a nearest neighbor approximation by appropriately setting Pmin. \n\nIt is clear that the interaction terms put constraints on the possible locations of the \nletters of a given dictionary word. They define \"allowed\" regions, where the letters \ncan be found, unimportant regions, where the influence of a letter on other letters \nis zero, and not-allowed regions (U = (0) , which have zero probability of finding a \nletter in that region, Fig. 4. \n\nThe task of recognition can now be formulated as follows. For a given dictionary \nword, find the configuration of elements from the detection matrix (a specific seg(cid:173)\nmentation ofthe pattern) such that the energy is minimal. Then, in order to find the \nbest dictionary word, repeat the previous procedure for every dictionary word and \nassociate with the pattern the dictionary word with lowest energy. If we denote by \nX the space of all possible segmentations of the pattern, then the final segmentation \nof the pattern, x*, is given as \n\nx* = argmin~Ex,nEN(En(x)). \n\n(2) \n\nwhere the index n runs through the dictionary words. \n\n3 \n\nImplementation and an Overview of the System \n\nAn overview of the system is illustrated in Fig. 5. A raw data file, representing a \nhandwritten word, contains x and y positions of the pen recorded every 10 millisec(cid:173)\nonds. This input signal is first transformed into strokes, which are defined as lines \nbetween points with zero velocity in the y direction. Each stroke is characterized by \n2 For Ui,J > 0, increasing di \u00b7dJ would increase, rather than decrease, the energy function. \n\n\fMost Likely Word \n\nMost Likely Word \n\nComparison Between the IP model and HMMs \n\nt \n\nt \n\n~ __ HM_M_S __ ~I ~I ___ IP_m_o_de_l~ \n\nt \n\nt \n\n-\n-\n-\n\nI---\n\nr-- r \n\nFIJ \n\n' ~ ' \n\n.H ..... \n\n0\" \n\nWrtter Number \n\nPreprocessor \n\nt \n\nHandwritten Pattern \n\nFigure 5: An overview of the system. \n\nFigure 6: Comparison of recognition re-\nsuits on 10 writers using the IP model \nand HMMs. \n\na set of features as suggested in [10]. The preprocessor extracts these features from \neach stroke and supplies them to the neural network. We have built a multi-layer \nfeedforward network based on a weight sharing technique to detect letters. This \nparticular architecture was proposed by Rumelhart [10]. Similar networks can also \nbe found in literature by the name Time Delay Neural Network , (TDNN) [ll]). \nIn our implementation, the network has one hidden layer with thirty rows of hid(cid:173)\nden units. For details of the network architecture see [10 , 7]. The output of the \nNN, the detection matrix, is then supplied to the HMM-based and IP model-based \npost-processors, Fig. 5. For both models, we assume that every letter has the same \naverage width. \n\nInteraction Terms. The first approximation for interaction t erms is to assume a \n\"square well\" shape. Each interaction t erm is then defined with only three parame(cid:173)\nters, the left boundary a, the right boundary b and the depth of the well, en, which \nare the same for all the nearest neighbor letters, Fig. 7. The lower and upper limits \nfor the i - th and j - th non-adjacent interaction terms can then b e approximated \nas aij = Ij - il . a and bij = Ij - il . b, respectively. \nNearest Neighbor Approximation. Since the exact solution of the energy func(cid:173)\ntion given by Eq. (2) is often computationally infeasible (the detection matrices can \nexceed 40 columns in width for long words), one has to use some approximation \ntechnique. One possible solution is suggested in [7] , where contextual information \nis used to constrain the search space. Another possibility is to revise the energy \nfunction by considering only nearest neighbor t erms and then solve it exactly using \na Dynamic Programming (DP) algorithm. We have used DP to find the optimal \nsegmentation for each word . We then use this \"optimal\" configuration of letters \nto calculate the energy given by Eq. (1). It is important to m ention that we have \nintroduced beginning (B) and end (E) \"letters\" to mark the beginning and end of \nthe pattern, and their detection probabilities are set to some constant value 3 . \n\nHidden Markov Models. The goal of the recognition system is to find the dic(cid:173)\ntionary word with the maximum posterior probability, p(w IO) = p(Olw)p(w)/p(O), \n\n3This is necessary in order to define interaction potentials for single letter words. \n\n\fu \n\nx \n\nen \n\nb \n-f- '------' \n\na \n\nHMMs \n\nP(d) \n\n/ \n\\ \n\nExpected \n\n\\~ \n\nFigure 7: Square well approximation of \nthe interaction potential. Allowed region \nis defined as a < x < b, and forbidden \nregions are x < a, and x > b. \n\nFigure 8: The probability ofremaining in \nthe same state for exactly d time steps: \nHMMs (dashed line) vs. expected prob(cid:173)\nability (solid line). \n\ngiven the handwritten pattern, 6. Since p( 0) and p( w) are the same for all dic(cid:173)\ntionary words, maximizing p(wIO) is equivalent to maximizing p(Olw). To find \np(Olw), we constructed a left-right (or Bakis) HMM [9] for each dictionary word, \n,\\\", where each letter was represented with one state. Given a dictionary word (a \nmodel ,\\n), we calculated the maximumlikelihood,p(OI,\\n) = Lall Q P(O, QI,\\n) = \nLall Q P(OIQ, ,\\n)p(QI,\\n), where the summation is done over all possible state se(cid:173)\nquences. We used the forward-backward procedure [9] for calculating the previous \nsum. Emission probabilities were calculated from the detection probabilities using \nBayes' rule P(Oxlqk) = dk(x)P(Ox)/ P(qk), where P(qk) denotes the frequency of \nth letter in the dictionary and the term P(Ox) is the same for all words \nthe k -\nand can therefore be omitted. Transition probabilities were adjusted until the best \nrecognition results were obtained. Recall that we assumed that all letter widths are \nthe same and therefore the transition probabilities are independent of letter pairs. \n\n4 Results and Discussion \n\nOur dataset (obtained from David Rumelhart [10]) consists of words written by 100 \ndifferent writers, where each writer wrote 1000 words. The size of the dictionary is \n1000 words. The neural network was trained on 70 writers (70,000 words) and an \nindependent group of writers was used as a cross validation set. We have tested both \nthe IP model and HMMs on a group of 10 writers (different from the testing and \ncross-validation groups). The results for each model are depicted in Fig. 6. The IP \nmodel chose the correct word 79.89% of the time, while HMMs selected the correct \nword 79.44% of the time. Although the overall performance of the two models was \nalmost identical, the results differ by several percent on individual writers. This \nsuggests that our model could be used in combination with HMMs (e.g. with some \naveraging technique) to improve overall recognition. \n\nIt is important to mention that new dictionary words can be easily added to the \ndictionary and the IP model does not require retraining on the new words (using \nthe method of calculating interaction terms suggested in this paper). The only \ninformation about the new word that has to be supplied to the system is the ordering \nof the letters. Knowing the nearest neighbor pairwise probabilities, pi} (Xi, X j), it is \neasy to calculate the location estimates between any two letters of the new word. \nFurthermore, the IP model can easily recognize words where many of the letters are \n\n\fhighly distorted or missing. \n\nIn standard first-order HMMs with time-independent transition probabilities, the \nprobability of remaining in the i -\nth state for exactly d time steps is illustrated \nin Fig. 8. The real probability distribution on letter widths is actually similar to a \nPoisson distribution [11), Fig. 8. It has been shown that explicit duration HMMs \ncan significantly improve recognition accuracy, but at the expense of a significant \nincrease in computational complexity [5] . Our model, on the other hand, can easily \nmodel arbitrarily complex pairwise probabilities without increasing the computa(cid:173)\ntional complexity (using DP in a nearest neighbor approximation). We think that \nthis is one of the biggest advantages of our approach over HMMs. We believe that \nincluding more precise interaction terms will yield significantly better results (as in \nHMMs) and this work is currently in progress. \n\nAcknowledgments \n\nSupported in part by the Office of Naval Research. The authors thank the members \nof Institute for Brain and Neural Systems for helpful conversations. \n\nReferences \n\n[1] Y. Bengio, Y. LeCun, C. Nohl, and C. Burges. Lerec: A NN/HMM hybrid for on-line \n\nhandwriting recognition. Neural Computation, 7:1289-1303, 1995. \n\n[2] H. Bourlard and C. Wellekens. Links between hidden Markov models and multilayer \n\nperceptrons. IEEE Transactions on PAMI, 12:1167-1178, 1990. \n\n[3] M. Burl, T. Leung, and P. Perona. Recognition of planar object classes. In Proc. \n\nIEEE Comput. Soc. Con/. Comput. Vision and Pattern Recogn., 1996. \n\n[4] M. Burl and P. Perona. Using hierarchical shape models to spot keywords in cursive \n\nhandwriting data. In Proc. CVPR 98, 1998. \n\n[5] C. Mitchell and L. Jamieson. Modeling duration in a hidden markov model with the \n\nexponential family. In Proc. ICASSP, pages 331-334, 1993. \n\n[6] D. Mumford. Neuronal archetectures for pattern theoretic problems. In K. C. and \nD. J. L., editors, Large-Scale Neuronal theories of the Brain, pages 125-152. MIT \nPress, Cambridge, MA, 1994. \n\n[7] P. Neskovic. Feedforward, Feedback Neural Networks With Context Driven Segmen(cid:173)\n\ntation And Recognition. PhD thesis, Brown University, Physics Dept. , May 1999. \n\n[8] P. Neskovic and L. Cooper. Neural network-based context driven recognition of on-line \n\ncursive script. In 7th IWFHR, 2000. \n\n[9] L. Rabiner and B. Juang. An introduction to hidden markov models. ASSP magazine, \n\n3(1):4-16, 1986. \n\n[10] D. E. Rumelhart. Theory to practice: A case study - recognizing cursive handwriting. \nIn E. B. Baum, editor, Computational Learning and Cognition: Proceedings of the \nThird NEC Research Symposium. SIAM, Philadelphia, 1993. \n\n[11] M. Schenkel, 1. Guyon, and D. Henderson. On-line cursive script recognition using \ntime delay neural networks and hidden markov models. Machine Vision and Appli(cid:173)\ncations, 8:215- 223, 1995. \n\n\f", "award": [], "sourceid": 1901, "authors": [{"given_name": "Predrag", "family_name": "Neskovic", "institution": null}, {"given_name": "Philip", "family_name": "Davis", "institution": null}, {"given_name": "Leon", "family_name": "Cooper", "institution": null}]}