{"title": "Recognition-based Segmentation of On-Line Hand-printed Words", "book": "Advances in Neural Information Processing Systems", "page_first": 723, "page_last": 730, "abstract": null, "full_text": "Recognition-based Segmentation \nof On-line Hand-printed Words \n\nM. Schenkel*, H. Weissman, I. Guyon, C. Nohl, D. Henderson \n\nAT&T Bell Laboratories, Holmdel, NJ 07733 \n\n* Swiss Federal Institute of Technology, CH-8092 Zurich \n\nAbstract \n\nThis paper reports on the performance of two methods for \nrecognition-based segmentation of strings of on-line hand-printed \ncapital Latin characters. The input strings consist of a time(cid:173)\nordered sequence of X-Y coordinates, punctuated by pen-lifts. The \nmethods were designed to work in \"run-on mode\" where there is no \nconstraint on the spacing between characters. While both methods \nuse a neural network recognition engine and a graph-algorithmic \npost-processor, their approaches to segmentation are quite differ(cid:173)\nent. The first method, which we call IN SEC (for input segmen(cid:173)\ntation), uses a combination of heuristics to identify particular pen(cid:173)\nlifts as tentative segmentation points. The second method, which \nwe call OUTSEC (for output segmentation), relies on the empiri(cid:173)\ncally trained recognition engine for both recognizing characters and \nidentifying relevant segmentation points. \n\n1 \n\nINTRODUCTION \n\nWe address the problem of writer independent recognition of hand-printed words \nfrom an 80,OOO-word English dictionary. Several levels of difficulty in the recognition \nof hand-printed words are illustrated in figure 1. The examples were extracted from \nour databases (table 1). Except in the cases of boxed or clearly spaced characters, \nsegmenting characters independently of the recognition process yields poor recogni(cid:173)\ntion performance. This has motivated us to explore recognition-based segmentation \ntechniques. \n\n723 \n\n\f724 \n\nSchenkel, Weissman, Guyon, Nohl, and Henderson \n\nTable 1: Databases used for training and testing. DB2 contains words one \nto five letters long, but only four and five letter words are constrained to be legal \nEnglish words. DB3 contains legal English words of any length from an 80,000 word \ndictionary. \n\nuppercase \ndatabase \nDBl \nDB2 \nDBS \n\ndata \nnature \nboxed letters \nshort words \nEnglish words Wacom \n\npad \nused \nAT&T \nGrid \n\ntraining \nset size \n\n9000 \n8000 \n\n-\n\ntest set \n\nsize \n1500 \n1000 \n600 \n\napprox. # \nof donors \n\n250 \n400 \n25 \n\nffiJ[1g~ (a) boxed \nr~ 2 ty.i \nF!f.!r \nL (:)rr:::rp \n\n(b) spaced \n\n(c) pen-lifts \n\n(d) connected \n\nFigure 1: Examples of styles that can be found in our databases: \n(a) DB1; (b) DB2; (c), (d) DB2 and DB3 . The line thickness or darkness is \nalternated at each pen-lift. \n\nThe basic principle of recognition-based segmentation is to present to the recognizer \nmany \"tentative characters\". The recognition scores ultimately determine the string \nsegmentation. We have investigated two different recognition-based segmentation \nmethods which differ in their definition of the tentative characters, but have very \nsimilar recognition engines. \n\nThe data collection device provides pen trajectory information as a sequence of \n(x, y) coordinates at regular time intervals (10-15 ms). We use a preprocessing \ntechnique which preserves this information by keeping a finely sampled sequence of \nfeature vectors along the pen trajectory (Guyon et al. 1991, Weissman et al. 1992). \nThe recognizer is a Time Delay Neural Network (T DN N) (Lang and Hinton 1988, \nWaibel et al. 1989, Guyon et al. 1991). There is one output per class, in this case \n26 outputs, providing a score for all the capital letters of the Latin alphabet. \n\nThe critical step in the segmentation process is the postprocessing which disentan(cid:173)\ngles various word hypotheses using the character recognition scores provided by the \nTDN N. For this purpose, we use conventional dynamic programming algorithms. \nIn addition we use a dictionary that checks the solution and returns a list of simi(cid:173)\nlar legal words. The best word hypotheses, subject to this list, is again chosen by \ndynamic programming algorithms. \n\nRecognition-based segmentation relies on the recognizer to give low confidence \n\n\fRecognition-based Segmentation of On-line Hand-printed Words \n\n725 \n\nscores for wrong tentative characters corresponding to a segmentation mistake. Rec(cid:173)\nognizers trained only on valid characters usually perform poorly on such a task. \n\nWe use \"segmentation-driven training\" techniques which allow the training of wrong \ntentative characters, produced by the segmentation engine itself, as negative exam(cid:173)\nples. This additional training has reduced our error rates by more than a factor of \ntwo. \nIn section 2 we describe the INSEG method which uses tentative characters delin(cid:173)\neated by heuristic segmentation points. It is expected to be most appropriate for \nhand-printed capital letters since nearly all writers separate these letters by pen-lifts. \nThis method was inspired by a similar technique used for Optical Character Recog(cid:173)\nnition (OCR) (Burges et al. 1992). In section 3 we present an alternative method, \nOUTSEG, which expects the recognition engine to learn empirically (learning by \nexamples) both to recognize characters and to identify relevant segmentation points. \nThis second method bears similarities with the OCR methods proposed by Matan \net al. (1991) or Keeler et al. (1991). In section 4 we compare the two methods and \npresent experimental results. \n\n2 SEGMENTATION IN INPUT SPACE \n\nFigure 2 shows the different steps of the IN SEG process. Module 1 is used to define \n\"tentative characters\" delineated by \"tentative cuts\" (spaces or pen-lifts). The \ntentative characters are then handed to module 2 which performs the preprocessing \nand the scoring of the characters with a T DN N. The recognition results are then \ngathered into an interpretation graph . In module 3 the best path through that \ngraph is found with the Viterbi algorithm . \n\nstroke \ndetector \n& grouper \n\npen input \n\n~~~ qff \n\n2-3 \n\n1-2 \n\n,:::.( \n\"\\. \n1-3 \n\n0-2 \n\ntentative characters \n\npreprocessor \n&TDNN \n\nu-, R \n\n0-2 \nZ \n'-2 \n\nE \n,-3 \nE \nJ \n2-3 2-4 \nE \n3-4 3-S \nF \n.!;! \n\nJ \n\ngraph \n\nbest path \nsearch \n\nItR E EFIt \n\nFigure 2: Processing steps of the IN SEG method. \n\n\f726 \n\nSchenkel, Weissman, Guyon, Nohl, and Henderson \n\nIn figure 3 we show a simplified representation of an interpretation graph built by our \nsystem. Each tentative character (denoted {i, j}) has a double index: the tentative \ncut i at the character starting point and the tentative cut j at the character end \npoint. We denote by X {i, j} the node associated to the score of letter X for the \ntentative character {i,j}. A path through the graph starts at a node X{O,.} and \n\nends at a node Y {., m}, where \u00b0 is the word starting point and m the last pen-lift. \n\nIn between, only transitions of the kind X{.,i} -+ Y{i,.} are allowed to prevent \ncharacter overlapping. \n\nTo avoid searching through too complex a graph, we need to perform some pruning. \nThe spatial relationship between strokes is used to discard unlikely tentative cuts. \nFor instance, strokes with a large horizontal overlap are bundled. The remaining \ntentative characters are then grouped in different ways to form alternative tentative \ncharacters. Tentative characters separated by a large horizontal spatial interval are \nnever considered for grouping. \n\nFigure 3: Graph obtained with the input segmentation method. \nThe grey shading in each box indicates the recognition scores (the darker, the \nstronger the recognition score and the higher the recognition confidence). \n\nIn table 2 we present the results obtained with the T DN N recognizer used by \nGuyon et al. (1991), with 4 convolutional layers and 6,252 weights. Characters \nare preprocessed individually, which provides the network with a fixed dimension \ninput. \n\n3 SEGMENTATION IN OUTPUT SPACE \n\nIn contrast with IN SEC, the OUTSEC method does not rely on human designed \nsegmentation hints: the neural network learns both recognition and segmentation \nfeatures from examples. \n\n\fRecognition-based Segmentation of On-line Hand-printed Words \n\n727 \n\nTentative characters are produced simply in that a window is swept over the input \nsequence in small steps. At each step the content of the window is taken to be a \ntentative character. Successive characters usually overlap considerably. \n\nL \n\n(X) \n\n-\n\n1111111111111111111111111111111111111 \nm \n012... \n\ntime (i) \n\n~ \n\nFigure 4: T DN N outputs of the OUTSEG system. \nThe grey curve indicates the best path through the graph, using duration modeling. \nThe word \"LOOP\" was correctly recognized in spite of the ligatures which prevent \nsegmentation on the basis of pen-lifts. \n\nIn figure 4, we show the outputs of our TDN N recognizer when the word \"LOOP\" \nis processed. The main matrix is a simplified representation of our interpre(cid:173)\ntation graph. Tentative character numbers i (i E {I, 2, ... , m}), run along the \ntime direction. Each column contains the scores of all possible interpretations \nX (X E {A, B, C, ... , Z, nil}) of a given tentative character. The bottom line is the \nnil interpretation score which approximates the probability that the present input \nis not a character (meaningless character): P(nil{i}linput) = 1- (P(A{i}linput) + \nP(B{i}linput) + ... + P(Z{i} I input\u00bb \nThe connections between nodes reflect a model of character durations. A simple \nway of enforcing duration is to allow only the following transitions: \n\nX { i} ~ X {i + I}, \nnil{i} ~ nil{i+l}, \nX {i} ~ nil{i + I}, \nnil{i} ~ X {i + I}, \n\nwhere X stands for a certain letter. A character interpretation can be followed by \n\n\f728 \n\nSchenkel, Weissman, Guyon, Nohl, and Henderson \n\nthe same interpretation but cannot be followed immediately by another character \ninterpretation: they must be separated by nil. This permits distinguishing between \nletter duration and letter repetition (such as the double \"0\" in our example). The \nbest path in the graph is found by the Viterbi algorithm. \n\nIn fact, this simple pattern of connections corresponds to a Markov model of du(cid:173)\nration, with exponential decay. We implemented a slightly fancier model which \nallows the generation of any duration distribution (Weissman et al. 1992) to help \nprevent character omission or insertion. In our experiments, we selected two Poisson \ndistributions to model character and the nil-class duration respectively. \n\nWe use a T D N N recognizer with 3 layers and 10, 817 weights. The sequence of \nrecognition scores is obtained by sweeping the neural network over the input. Be(cid:173)\ncause of the convolutional structure of the T DN N, there are many identical com(cid:173)\nputations between two successive calls of the recognizer and only about one sixth \nof the network connections have to be reevaluated for each new tentative character. \nAs a consequence, although the OUTSEG system processes many more tentative \ncharacters than the IN SEG system does, the overall computation time is about \nthe same. \n\n4 COMPARISON OF RESULTS AND CONCLUSIONS \n\nTable 2: Comparison of the performance of the two segmentation methods \nusing a TDN N recognizer. \n\nII Error without dictionary II Error with dictionary I \n\n% char. \n\n% word \n\n% char. \n\n% word \n\n9 \n10 \n\n18 \n21 \n\n8.5 \n8 \n\n15 \n17 \n\n% char. \n\n% word \n\n% char. \n\n% word \n\n8 \n11 \n\n33 \n48 \n\n5 \n7 \n\n13 \n21 \n\non DB2 \nINSEG \nOUTSEG \non DB3 \nINSEG \nOUTSEG \n\nWe summarize in table 2 the results obtained with our two segmentation methods. \nTo complement the results obtained with database DB2, we used (without retrain(cid:173)\ning) database DB3 as a control, containing words of any length from the English \ndictionary. In our current versions, INSEG performs better than OUTSEG. The \nOUTSEG method can handle connected letters (such as in the example of the word \n\"LOOP\" in figure 4), while the INSEG method, which relies on pen lifts, cannot. \nBut, we discovered that very few people did not separate their characters by pen lifts \nin the data we collected. On the other hand, an advantage of the IN SEG method \nis that it can easily be used with recognizers other than the T DN N, whereas the \nOUTSEG method relies heavily on the convolutional structure of the T DN N for \ncomputational efficiency. \n\nFor comparison, we substituted two other neural network recognizers to the T DN N. \nThese networks use alternative input representations. The OCR - net was designed \nfor Optical Character Recognition (Le Cun et al. 1989) and uses pixel map inputs. \n\n\fRecognition-based Segmentation of On-line Hand-printed Words \n\n729 \n\nIts first layer performs local line orientation detection. The orientation - net has \nan architecture similar to that of the OCR - net, but its first layer is removed \nand local line orientation information, directly extracted from the pen trajectory, \nis transmitted to the second layer (Weissbuch and Le Cun 1992). Without a dictio(cid:173)\nnary, the OCR - net has an error rate more than twice that of the T DN N but the \norientation - net performs similarly. With dictionary the orientation - net has a \n25% lower error rate than the T DN N. This improvement is attributed to better \nsecond and third best recognition choices, which facilitates dictionary use. \n\nOur best results to date (tables 3) were obtained with the INSEG method, using \ntwo recognizers combined with a voting scheme: the T DN N and the orientation(cid:173)\nnet. For comparison purposes we mention the results obtained by a commercial \nrecognizer on the same data. One should notice that our dictionary is the same as \nthe one from which the data was drawn and is probably a larger dictionary than \nthe one used by the commercial system. Our results are substantially better than \nthose of the commercial system. On an absolute scale they are quite satisfactory if \nwe take into account that the test data was not cleaned at all and that more than \n20% of the errors have been identified to be patterns written in cursive, misspelled \nor totally illegible. \nWe expect the OUTSEG method to work best for cursive handwriting, which does \nnot exhibit trivial segmentation hints, but we do not have any direct evidence to \nsupport this expectation as yet. Rumelhart (1992) had success with a version of \nOUTSEG. Work is in progress to extend the capabilities of our systems to cursive \nwriting. \n\nTable 3: Performance of our best system. For comparison, we mention in \nparenthesis the performances obtained by a commercial recognizer on the same \ndata. The performance of the commercial system with dictionary (marked with a *) \nare penalized because DB2 and DB3 include words not contained in its dictionary. \n\nError without dictionary \n\nMethod % char. \n7 (18) \n6 (20) \n\nDB2 \nDB3 \n\n% word \n(29) \n13 \n23 \n(61) \n\nError with dictionary \n% char. \n7 (17*) \n5 (18*) \n\n% word \n(32*) \n10 \n(49*) \n11 \n\nAcknowledgments \n\nWe wish to thank the entire Neural Network group at Bell Labs Holmdel for their \nsupportive discussions. Helpful suggestions with the editing of this paper by L. \nJackel and B. Boser are gratefully acknowledged. We are grateful to Anne Weiss(cid:173)\nbuch, Yann Le Cun and Jan Ben for giving us their Neural Networks to tryon \nour IN SEG method. We are indebted to Howard Page for providing comparison \nfigures with the commercial recognizer. The experiments were performed with the \nneural network simulators of B. Boser, Y. Le Cun and L. Bottou who we thank for \ntheir help and advice. \n\n\f730 \n\nSchenkel, Weissman, Guyon, Nohl, and Henderson \n\nReferences \n\nI. Guyon, P. Albrecht, Y. Le Cun, J . Denker and W. Hubbard. Design of a neural \nnetwork character recognizer for a touch terminal. Pattern Recognition, 24(2), 1991. \n\nH. Weissman, M. Schenkel, I. Guyon, C. Nohl and D. Henderson. Recognition(cid:173)\nbased Segmentation of On-line Run-on Handprinted Words: Input vs. Output \nSegmentation. Submitted to Pattern Recognition, October 1992. \nK. J. Lang and G . E. Hinton. A time delay neural network architecture for speech \nrecognition. Technical Report CMU-cs-88-152, Carnegie-Mellon University, Pitts(cid:173)\nburgh PA, 1988. \n\nA. Waibel, T. Hanazawa, G. Hinton, K. Shikano and K. Lang. Phoneme recognition \nusing time-delay neural networks. IEEE Transactions on Acoustics, Speech and \nSignal Processing, 37:328-339, March 1989. \nC. J. C. Burges, O. Matan, Y. Le Cun, D. Denker, L. D. Jackel, C. E. Stenard, \nC. R. Nohl and J. I. Ben. Shortest path segmentation: A method for training neural \nnetworks to recognize character strings. In IJCNN'92, volume 3, Baltimore, 1992. \nIEEE. \nO. Matan, C. J . C. Burges, Y. Le Cun and J. Denker. Multi-digit recognition using \na Space Dispacement Neural Network. In J. E. Moody et al., editor, Advances in \nNeural Information Processing Systems 4, Denver, 1992. Morgan Kaufmann. \nJ . Keeler, D. E. Rumelhart and W-K. Leow. Integrated segmentation and recogni(cid:173)\ntion of hand-printed numerals. In R. Lippmann et aI., editor, Advances in Neural \nInformation Processing Systems 3, pages 557-563, Denver, 1991. Morgan Kauf(cid:173)\nmann. \n\nY. Le Cun, L.D. Jackel, B. Boser, J .S. Denker, H.P. Graf, I. Guyon, D. Henderson, \nR.E. Howard and W. Hubbard. Handwritten digit recognition: Application of \nneural network chips and automatic learning. IEEE Communications Magazine, \npages 41-46, November 1989. \n\nA. Weissbuch and Y. Le Cun. Private communication. 1992. \nD. Rumelhart et al. Integrated segmentation and recognition of cursive handwriting. \nIn Third NEC symposium Computational Learning and Cognition, Princeton, New \nJersey, 1992 (to appear). \n\n\f", "award": [], "sourceid": 607, "authors": [{"given_name": "M.", "family_name": "Schenkel", "institution": null}, {"given_name": "H.", "family_name": "Weissman", "institution": null}, {"given_name": "I.", "family_name": "Guyon", "institution": null}, {"given_name": "C.", "family_name": "Nohl", "institution": null}, {"given_name": "D.", "family_name": "Henderson", "institution": null}]}