{"title": "A Probabilistic Model for Learning Concatenative Morphology", "book": "Advances in Neural Information Processing Systems", "page_first": 1537, "page_last": 1544, "abstract": null, "full_text": "A Probabilistic Model for Learning\n\nConcatenative Morphology\n\nMatthew G. Snover\n\nMichael R. Brent\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nWashington University\n\nWashington University\n\nSt Louis, MO, USA, 63130-4809\n\nSt Louis, MO, USA, 63130-4809\n\nms9@cs.wustl.edu\n\nbrent@cs.wustl.edu\n\nAbstract\n\nThis paper describes a system for the unsupervised learning of morpho-\nlogical suf\ufb01xes and stems from word lists. The system is composed of a\ngenerative probability model and hill-climbing and directed search algo-\nrithms. By extracting and examining morphologically rich subsets of an\ninput lexicon, the directed search identi\ufb01es highly productive paradigms.\nThe hill-climbing algorithm then further maximizes the probability of the\nhypothesis. Quantitative results are shown by measuring the accuracy of\nthe morphological relations identi\ufb01ed. Experiments in English and Pol-\nish, as well as comparisons with another recent unsupervised morphol-\nogy learning algorithm demonstrate the effectiveness of this technique.\n\n1 Introduction\n\nOne of the fundamental problems in computational linguistics is adaptation of language\nprocessing systems to new languages with minimal reliance on human expertise. A ubiq-\nuitous component of language processing systems is the morphological analyzer, which\ndetermines the properties of morphologically complex words like watches and gladly by\ninferring their derivation as watch+s and glad+ly. The derivation reveals much about the\nword, such as the fact that glad+ly share syntactic properties with quick+ly and semantic\nproperties with its stem glad. While morphological processes can take many forms, the\nmost common are suf\ufb01xation and pre\ufb01xation (collectively, concatenative morphology).\n\nIn this paper, we present a system for unsupervised inference of morphological derivations\nof written words, with no prior knowledge of the language in question. Speci\ufb01cally, neither\nthe stems nor the suf\ufb01xes of the language are given in advance. This system is designed\nfor concatenative morphology, and the experiments presented focus on suf\ufb01xation. It is\napplicable to any language for written words lists are available. In languages that have\nbeen a focus of research in computational linguistics the practical applications are limited,\nbut in languages like Polish, automated analysis of unannotated text corpora has potential\napplications for information retrieval and other language processing systems. In addition,\nautomated analysis might \ufb01nd application as a hypothesis-generating tool for linguists or as\na cognitive model of language acquisition. In this paper, however, we focus on the problem\nof unsupervised morphological inference for its inherent interest.\n\n\fDuring the last decade several minimally supervised and unsupervised algorithms have\nbeen developed. Gaussier[1] describes an explicitly probabilistic system that is based pri-\nmarily on spellings. It is an unsupervised algorithm, but requires the tweaking of param-\neters to tune it to the target language. Brent [2] and Brent et al. [3] describe Minimum\nDescription Length, (MDL), systems. Goldsmith [4] describes a similar MDL approach.\nOur motivation in developing a new system was to improve performance and to have a\nmodel cast in an explicitly probabilistic framework. We are particularly interested in devel-\noping automated morphological analysis as a \ufb01rst stage of a larger grammatical inference\nsystem, and hence we favor a conservative analysis that identi\ufb01es primarily productive\nmorphological processes (those that can be applied to new words).\n\nIn this paper, we present a probabilistic model and search algorithm for automated analysis\nof suf\ufb01xation, along with experiments comparing our system to that of Goldsmith [4]. This\nsystem, which extends the system of Snover and Brent [5], is designed to detect the \ufb01nal\nstem and suf\ufb01x break of each word given a list of words. It does not distinguish between\nderivational and in\ufb02ectional suf\ufb01xation or between the notion of a stem and a root. Further,\nit does not currently have a mechanism to deal with multiple interpretations of a word, or\nto deal with morphological ambiguity. Within it\u2019s design limitations, however, it is both\nmathematically clean and effective.\n\n2 Probability Model\n\nThis section introduces a prior probability distribution over the space of all hypotheses,\nwhere a hypothesis is a set of words, each with morphological split separating the stem and\nsuf\ufb01x. The distribution is based on a seven-step model for the generation of hypotheses,\nwhich is heavily based upon the probability model presented in [5]. The hypothesis is\ngenerated by choosing the number of stems and suf\ufb01xes, the spellings of those stems and\nsuf\ufb01xes and then the combination of the stems and suf\ufb01xes.\n\nThe seven steps are presented below, along with their probability distributions and a running\nexample of how a hypothesis could be generated by this process. By taking the product over\nthe distributions from all of the steps of the generative process, one can calculate the prior\nprobability for any given hypothesis. What is described in this section is a mathematical\nmodel and not an algorithm intended to be run.\n\n1. Choose the number of stems,\n\n, according to the distribution:\n\n\u0001\u0003\u0002\u0005\u0004\u0006\b\u0007\u0003\t\u000b\n\n\f\u000e\n\n\u000f\u0011\u0010\n\u0013\u0012\n\n(1)\n\n(2)\n\nThe\nterm normalizes the inverse-squared distribution on the positive inte-\nis chosen according to the same probability\ngers. The number of suf\ufb01xes,\ndistribution. The symbols M for steMs and X for suf\ufb01Xes are used throughout\nthis paper.\n\n2. For each stem\n\n, choose its length in letters\n\n, according to the inverse squared\ndistribution. Assuming that the lengths are chosen independently and multiplying\ntogether their probabilities we have:\n\n\u0015\u0014\nExample: =5.\n\n=3.\n\n\u0001\u001c\u0002\u001d\u0004\n\n\b\u0007\u001c\t\n\u0019\u001f\u001e\n\u0018\u001a'\n\n\u0018\u001a\u0019\n\u0012! \n\n\f\u000e\n\n\u001b$#&%\n\u0018\u001a'\n\n\u0018(\u0019\n\nThe distribution for the lengths of the suf\ufb01xes,\nin that suf\ufb01xes of length 0 are allowed, by offsetting the length by one.\nExample:\n\n=4,4,4,3,3.\n\n=2,0,1.\n\n, is similar to (2), differing only\n\n\f\n\n\u0016\n\u0016\n\u0017\n\u001b\n\u0018\n\u000f\n\n \n\"\n\u000f\n\u0010\n\u0018\n\u0019\n\u001b\n\u0012\n\n\fby choosing\n\n, generate stem\n\nbe the alphabet, and let\n\nbe a probability distribution on\n\nfrom 1 to\n\n3. Let\neach\nto the probabilities\nSUFF is generated in the same manner. The probability of any character,\nchosen is obtained from a maximum likelihood estimate:\nwhere\ncount of\n.\nThe joint probability of the hypothesized stem and suf\ufb01x sets is de\ufb01ned by the\ndistribution:\n\n. For\nletters at random, according\n. Call the resulting stem set STEM. The suf\ufb01x set\n, being\nis the\n\namong all the hypothesized stems and suf\ufb01xes and\n\n%\u0005\u0004\u0006\u0004\u0007\u0004\n\n\u0001\u0003\u0002\n\n\u0005\b\u000f\f\n\n\u0004\u0006\u0004\u0007\u0004\n\n\u000b\b\r\f\n\n\u0002\t\b\n\n\u0002\u0005\b\n\n\u0002\u0013\u0012\n\n\u0019\u001a\u0012\n\n\u0001\u000e\u0002\n\n\u0001\u0003\u0002\u0005\u0004 STEM\n\nSUFF\u001e\n\n\u0007\u0003\t\n\n! \n\n\u0016\u001f\u001e\n\nThe factorial terms re\ufb02ect the fact that the stems and suf\ufb01xes could be generated\nin any order.\nExample:STEM=\n\nwalk,look,door,far,cat\n\n. SUFF=\n\ned,\n\n,s\n\n.\n\n4. We now choose the number of paradigms,\n\n. A paradigm is a set of suf\ufb01xes and\nthe stems that attach to those suf\ufb01xes and no others. Each stem is in exactly one\nparadigm, and each paradigm has at least one stem., thus\ncan range from 1 to\n\n. We pick\n\naccording to the following uniform distribution:\n\n\t\u0015\u0014\u0017\u0016\n\t\u001d\u001c\n\u0019\u001a\u0012\n\u001b&%\n\n\u0012#\"\n\nExample:\n\n=3.\n\n5. We choose the number of suf\ufb01xes in the paradigms,\n\n, according to a uniform\n\ndistribution. The distribution for picking\n\nis:\n\n\b\u0007\u001c\t\n\n*)\n\u001b , suf\ufb01xes for paradigm\n\nis therefore:\n\n\u0001\u0003\u0002\u0005\u0004\n\u0001\u0003\u0002\u0005\u0004\n\u0016,(\n\n\u0016,(\n\t.-\n\nThe joint probability over all paradigms,\n\nExample:\n\n=\n\n2,1,2\n\n\u0001\u0003\u0002\u0005\u0004\n\n.\n\n(3)\n\n(4)\n\n(5)\n\n6. For each paradigm\n\n, choose the set of\n\nthat the paradigm will\nrepresent. The number of subsets of a given size is \ufb01nite so we can again use the\nuniform distribution. This implies that the probability of each individual subset\nof size\nchoices for each paradigm are independent:\n\n\u001b , is the inverse of the total number of such subsets. Assuming that the\n\n\u001b suf\ufb01xes, PARA'\n\n(6)\n\n=\n\n\u0001\u0003\u0002\u001d\u0004 PARA'\n\u00010'\n\n,s,ed\n\nExample:PARA'\n\n7. For each stem choose the paradigm that the stem will belong in, according to a\ndistribution that favors paradigms with more stems. The probability of choosing a\nparadigm\n\n, for a stem is calculated using a maximum likelihood estimate:\n\n=\n\n,s\n\n.\n\n.PARA'\n\n\u00011'\n\n=\n\n\u0016,(/+\n. PARA'\n\u00011'\u0007\f\n\u001e PARA\u0019\n\nis the set of stems in paradigm\n\n. Assuming that all these choices\n\nare made independently yields the following:\n\nwhere PARA\u0019\nExample:PARA\u0019\n\n\u0001\u0003\u0002\u0005\u0004 PARA\u0019\n\n=\n\nwalk,look\n\n\u0016,(\n\n\u001e PARA\u0019\n\nfar\n\n=\n\nPARA\n\n(7)\n\n=\n\ndoor,cat\n\n.\n\n.PARA\u0019\n\n.PARA\u0019\n\n\n\n\u0017\n\u0017\n\u0018\n\u0019\n\u001b\n%\n\u0010\n\u0011\n\u0018\n\u0010\n\u001b\n\u0012\n\u0019\n\u0012\n\u001e\n\n\u001e\n\u0018\n\u0019\n\u001e\n\u0018\n'\n\u0016\n \n\"\n\n$\n\u0014\n\u0016\n\u0001\n\f\n\u0001\n'\n\f\n(\n(\n\n(\n(\n\u001e\n%\n(\n+\n+\n\u0017\n+\n\u001b\n\u001e\n\u0007\n\t\n\u0010\n\u0016\n+\n+\n\u001e\n\u0007\n\"\n\u001b\n#\n%\n\u0016\n)\n%\n\t\n\u000f\n\u0010\n\u0016\n\u0012\n-\n+\n\u0001\n\f\n\u0017\n+\n\u001b\n+\n\u001e\n\u0007\n\t\n-\n\"\n\u001b\n#\n%\n\u000f\n\u0016\n+\n\u001b\n\u0012\n)\n%\n\t\n\u000f\n\u0016\n+\n\u001b\n\u0012\n)\n-\n%\n\f\n\n2\n\f\n\u0017\n\u001b\n\u001e\n\n\u001b\n\u0017\n\u001e\n\n\u0007\n\t\n-\n\"\n\u001b\n#\n%\n\u000f\n\u001b\n\u001e\n\n\u0012\n\u001e\n3\n4\n\u001e\n%\n\u0001\n\f\n\n\u0001\n\f\n2\n\u0001\n\f\n\fCombining the results of stages 6 and 7, one can see that the running example would yield\nthe hypothesis consisting of the set of words with suf\ufb01x breaks,\n, walk+s, walk+ed,\nlook+\n. Removing the breaks in the\nwords results in the set of input words. To \ufb01nd the probability for this hypothesis just take\nof the product of the probabilities from equations (1) to (7).\n\n, look+s, look+ed, far+\n\n, door+s, cat+\n\n, door+\n\n, cat+s\n\nwalk+\n\nUsing this generative model, we can assign a probability to any hypothesis. Typically one\nwishes to know the probability of the hypothesis given the data, however in our case such a\ndistribution is not required. Equation (8) shows how the probability of the hypothesis given\nthe data could be derived from Bayes law.\n\n\u0001\u0003\u0002\u0005\u0004 Hyp\u001e Data\u0007\u0003\t\n\n\u0001\u001c\u0002\u001d\u0004 Hyp\u0007\n\n\u0001\u001c\u0002\n\u0004 Data\u001e Hyp\u0007\n\u0001\u0003\u0002\u0005\u0004 Data\u0007\n\n(8)\n\nOur search only considers hypotheses consistent with the data. The probability of the data\n\ngiven the hypothesis,\u0001\u001c\u0002\u0005\u0004 Data\u001e Hyp\u0007 , is always\u0010 , since if you remove the breaks from any\nses, thus the probability of the hypothesis given the data reduces to\u0001\u001c\u0002\u001d\u0004 Hyp\u0007\n\nhypothesis, the input data is produced. This would not be the case if our search considered\ninconsistent hypotheses. The prior probability of the data is constant over all hypothe-\n. The prior\nprobability of the hypothesis is given by the above generative process and, among all con-\nsistent hypotheses, the one with the greatest prior probability also has the greatest posterior\nprobability.\n\n3 Search\n\nThis section details a novel search algorithm which is used to \ufb01nd a high probability seg-\nmentation of the all the words in the input lexicon,\n. The input lexicon is a list of words\nextracted from a corpus. The output of the search is a segmentation of each of the input\nwords into a stem and suf\ufb01x.\n\nThe search algorithm has two phases, which we call the directed search and the hill-\nclimbing search. The directed search builds up a consistent hypothesis about the segmen-\ntation of all words in the input out of consistent hypothesis about subsets of the words.\nThe hill-climbing search further tunes the result of the directed search by trying out nearby\nhypotheses over all the input words.\n\n3.1 Directed Search\n\nThe directed search is accomplished in two steps. First, sub-hypotheses, each of which\nis a hypothesis about a subset of the lexicon, are examined and ranked. The \u0001\nbest sub-\nhypotheses are then incrementally combined until a single sub-hypothesis remains. The\nremainder of the input lexicon is added to this sub-hypothesis at which point it becomes\nthe \ufb01nal hypothesis.\n\n, \u0007\n\nword \u0007\n\nis a word in\n\n, of the words in\n\n. We de\ufb01ne \b\n\nWe de\ufb01ne the set of possible suf\ufb01xes to be the set of terminal substrings, including the\n, there is a\nempty string\nand each\n\n. For each subset of the possible suf\ufb01xes\n\nis analyzed\nand the suf\ufb01xes in\nwith the corresponding morphological breaks. One can think of each sub-hypothesis as\ninitially corresponding to a maximally \ufb01lled paradigm. We only consider sub-hypotheses\nwhich have at least two stems and two suf\ufb01xes.\n\nmaximal set of possible stems (initial substrings)\u0003\u0002\nthat can be analyzed as consisting of a stem in\nthat way. This subhypothesis consists of all pairings of the stems in\t\u0002\nFor each sub-hypothesis, \b\nsame set of words as \b\n\n\u0007 to be the sub-hypothesis in which each input\n\n, there is a corresponding null hypothesis,\n\n, which has the\n, but in which all the words are hypothesized to consist of the\n\n, such that for each \u0004\u0006\u0005\n\nand a suf\ufb01x in\n\n\u0001\n'\n'\n'\n'\n'\n\f\n\u0014\n\u0019\n\u0018\n'\n\u0018\n\u0016\n\u0016\n\u0007\n\u0005\n\n\u0002\n\u0004\n\u0018\n\u0004\n\u0016\n\u0004\n\u0002\n\u0016\n\u0016\n\n\b\n\fword as the stem and\n\nthan the null hypothesis.\n\n\u0001\u001c\u0002\u0005\u0004\n\n\u0002\u0001\n\n\u0003\u0001\n\nscore\u0004\n\nas the suf\ufb01x. We give each sub-hypothesis a score as follows:\nis for those words,\n\n\u0001\u001c\u0002\u001d\u0004\n\n\u0007 . This re\ufb02ects how much more probable \b\n\nif and only if\n\nof 100, show that the \u0001\n\nrepresents a superset of the suf\ufb01xes that\n\n\u001b , is connected\n\u001b represents,\n\u001b represents. By beginning at\n\nOne can view all sub-hypotheses as nodes in a directed graph. Each node,\nto another node,\nwhich is exactly one suf\ufb01x greater in size than the set that\nthe node representing no suf\ufb01xes, one can apply standard graph search techniques, such as a\nbeam search or a best \ufb01rst search to \ufb01nd the \u0001\nbest scoring nodes without visiting all nodes.\nWhile one cannot guarantee that such approaches perform exactly the same as examining\nall sub-hypotheses, initial experiments using a beam search with a beam size equal to \u0001\n,\nwith a \u0001\nbest sub-hypotheses are found with a signi\ufb01cant decrease\nin the number of nodes visited. The experiments presented in this paper do not use these\npruning methods.\nThe \u0001\nhypothesis over the complete set of input words. Changing the value of \u0001\nmatically alter the results of the algorithm, though higher values of \u0001\nresults. We let \u0001\nbe the \u0001\nLet\nhypothesis\n\u0004\u0006\u0005\nin\n, and their null hypotheses, with their morphological breaks from\nwas already in\n. All of the\nsub-hypotheses are now rescored, as the words in them have changed. If, after rescoring,\nas our\nnone of the sub-hypotheses have likelihood ratios greater than one, then we use\n\ufb01nal hypothesis. Otherwise we, iterate until either there is only one sub-hypotheses left or\nall subhypotheses have scores no greater than one.\n\nhighest scoring sub-hypotheses are incrementally combined in order to create a\ndoes not dra-\ngive slightly better\n\nhighest scoring sub-hypotheses. We iteratively remove the highest scoring\nare added to each of the remaining sub-hypotheses\n\nbe 100 in the experiments reported here.\n\nthe morphological break from\n\noverrides the one from\n\n. The words in\n\n. If a word in\n\nfrom\n\n\u0004\u0007\u0005\n\nThe \ufb01nal sub-hypothesis,\nthat are not in\nwords in\n\n, is now converted into a full hypothesis over all the words. All\nare added to\n\nwith suf\ufb01x\n\n.\n\n3.2 Hill Climbing Search\n\nto\n\nThe hill climbing search further optimizes the probability of the hypothesis by moving\n, the search attempts to\n\n, and each node\n\n. This means that all stems in\n\nstems to new nodes. For each possible suf\ufb01x \u0004\nadd \u0004\n, which represents all the suf\ufb01xes of\nnode,\nto adjacent nodes in a directed graph. A stem \u0007\nsuf\ufb01x \u0004\ndone if it increases the probability of the hypothesis.\n\n\b\u0005\n, if the new word, \u0007\n\nand \u0004\n\nthat can take the suf\ufb01x \u0004 are moved to a new\n\n. This is analogous to pushing stems\n, can only be moved into a node with the\nis an observed word in the input lexicon. The move is only\n\nThere is an analogous suf\ufb01x removal step which attempts to remove suf\ufb01xes from nodes.\nThe hill climbing search continues to add and remove suf\ufb01xes to nodes until the probability\nof the hypothesis cannot be increased. A more detailed description of this portion of the\nsearch and its algorithmic invariants is given in [5].\n\n4 Experiment and Evaluation\n\n4.1 Experiment\n\nWe tested our unsupervised morphology learning system, which we refer to as Paramorph,\nand Goldsmith\u2019s MDL system, otherwise known as Linguistica 1, on various sized word lists\n\n1A demo version available on the web, http://humanities.uchicago.edu/faculty/goldsmith/, was\nused for these experiments. Word-list corpus mode and the method A suf\ufb01x detection were used. All\n\n'\n\b\n\u0007\n\t\n\b\n\u0007\n\u0014\n\n\b\n\n\n\n\u001b\n\u001b\n\u001b\n\u0004\n\u0005\n\u0004\n\u0005\n\u0004\n\u0004\n\u0005\n\u0004\n\u0004\n\u0005\n\u0004\n\u0005\n\u0018\n\u0004\n\u0005\n\u0004\n\u0005\n'\n\n\n\n\n\u0004\n\ffrom English and Polish corpora. For English we used set A of the Hansard corpus, which is\na parallel English and French corpus of proceedings of the Canadian Parliament. We were\nunable to \ufb01nd a standard corpus for Polish and developed one from online sources. The\nsources for the Polish corpus were older texts and thus our results correspond to a slightly\nantiquated form of the language. The results were evaluated by measuring the accuracy of\nthe stem relations identi\ufb01ed.\n\nWe extracted input lexicons from each corpus, excluding words containing non-alphabetic\ncharacters. The 100 most common words in each corpus were also excluded, since these\nwords tend to be function words and are not very informative for morphology. The systems\nwere run on the 500, 1,000, 2,000, 4,000, and 8,000 most common remaining words. The\nexperiments in English were also conducted on the 16,000 most common words from the\nHansard corpus.\n\n4.1.1 Stem Relation\n\nIdeally, we would like to be able to specify the correct morphological break for each of\nthe words in the input, however morphology is laced with ambiguity, and we believe this\nto be an inappropriate method for this task. For example it is unclear where the break in\nthe word, \u201clocation\u201d should be placed. It seems that the stem \u201clocate\u201d is combined with\nthe suf\ufb01x \u201ction\u201d, but in terms of simple concatenation it is unclear if the break should be\nplaced before or after the \u201ct\u201d.\n\nIn an attempt to solve this problem we have developed a new measure of performance,\nwhich does not specify the exact morphological split of a word. We measure the accuracy\nof the stems predicted by examining whether two words which are morphologically related\nare predicted as having the same stem. The actual break point for the stems is not evaluated,\nonly whether the words are predicted as having the same stem. We are working on a similar\nmeasure for suf\ufb01x identi\ufb01cation.\n\nTwo words are related if they share the same immediate stem. For example the words\n\u201cbuilding\u201d, \u201cbuild\u201d, and \u201cbuilds\u201d are related since they all have \u201cbuild\u201d as a stem, just as\n\u201cbuilding\u201d and \u201cbuildings\u201d are related as they both have \u201cbuilding\u201d as a stem. The two\nwords, \u201cbuildings\u201d and \u201cbuild\u201d are not directly related since the former has \u201cbuilding\u201d\nas a stem, while \u201cbuild\u201d is its own stem. Irregular forms of words are also considered\nto be related even though such relations would be very dif\ufb01cult to detect with a simple\nconcatenation model.\n\nThe stem relation precision measures how many of the relations predicted by the system\nwere correct, while the recall measures how many of the relations present in the data were\nfound. Stem relation fscore is an unbiased combination of precision and recall that favors\nequal scores.\n\n4.2 Results\n\nThe results from the experiments are shown in Figures 1 and 2. All graphs are shown using\na log scale for the corpus size. Due to software dif\ufb01culties we were unable to get Linguistica\nto run on 500, 1000, and 2000 words in English. The software ran without dif\ufb01culties on\nthe larger English datasets and on the Polish data. As an additional note, Linguistica was\ndramatically faster than Paramorph, which is a development oriented software package and\nnot as optimized for ef\ufb01cient runtime as Linguistica appears to be.\n\nFigure 1 shows the number of different suf\ufb01xes predicted by each of the algorithms in\nboth English and Polish. Our Paramorph system found a relatively constant number of\n\nother parameters were left at their default values.\n\n\fs\ne\nx\ni\nf\nf\nu\nS\n \nf\no\n\n \nr\ne\nb\nm\nu\nN\nh\ns\ni\nl\ng\nn\nE\n\n \n\n \n\ne\nr\no\nc\ns\nF\nn\no\ni\nt\na\nl\ne\nR\nm\ne\nt\n\n \n\n \n\nS\nh\ns\ni\nl\ng\nn\nE\n\nParaMorph\nLinguistica\n\nParaMorph\nLinguistica\n\n800\n\n700\n\n600\n\n500\n\n400\n\n300\n\n200\n\n100\n\n0\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n160\n\n140\n\n120\n\n100\n\n80\n\n60\n\n40\n\n20\n\n0\n\ns\ne\nx\ni\nf\nf\nu\nS\n \nf\no\n\n \nr\ne\nb\nm\nu\nN\nh\ns\ni\nl\no\nP\n\n \n\n500\n\n1000\n\n2k\n\n4k\n\n8k\n\nLexicon Size\n\n500 1000\n\n2k\n\n4k\n\n8k\n\n16k\n\nLexicon Size\n\nFigure 1: Number of Suf\ufb01xes Predicted\n\n \n\ne\nr\no\nc\ns\nF\nn\no\ni\nt\na\nl\ne\nR\nm\ne\nt\n\n \n\n \n\nS\nh\ns\ni\nl\no\nP\n\n500 1000\n\n2k\n\n4k\n\n8k\n\n16k\n\nLexicon Size\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n500\n\n1000\n\n2k\n\n4k\n\n8k\n\nLexicon Size\n\nFigure 2: Stem Relation Fscores\n\nsuf\ufb01xes across lexicon sizes and Linguistica found an increasingly large number of suf\ufb01xes,\npredicting over 700 different suf\ufb01xes in the 16,000 word English lexicon.\n\nFigure 2 shows the fscores using the stem relation metric for various sizes of English and\nPolish input lexicons. Paramorph maintains a very high precision across lexicon sizes\nin both languages, whereas the precision of Linguistica decreases considerably at larger\nlexicon sizes. However Linguistica shows an increasing recall as the lexicon size increases,\nwith Paramorph having a decreasing recall as lexicon size increases, though the recall of\nLinguistica in Polish is consistently lower than the Paramorph\u2019s recall. The fscores for\nParamorph and Linguistica in English are very close, and Paramorph appears to clearly\noutperform Linguistica in Polish.\n\nSuf\ufb01xes\n-a -e -ego -ej -ie -o -y\n\n-a -ami -y -e\u00b8\n-cie -li -m -\u00b4c\n\nStems\ndziwn\nchmur siekier\ngada odda sprzeda\n\nTable 1: Sample Paradigms in Polish\n\nTable 1 shows several of the larger paradigms found by Paramorph when run on 8000 words\nof Polish. The \ufb01rst paradigm shown is for the single adjective stem meaning \u201cstrange\u201d with\nnumerous in\ufb02ections for gender, number and case, as well as one derivational suf\ufb01x, \u201c-\nie\u201d which changes it into an adverb, \u201cstrangely\u201d. The second paradigm is for the nouns,\n\u201ccloud\u201d and \u201cax\u201d, with various case in\ufb02ections and the third paradigm paradigm contains\nthe verbs, \u201ctalk\u201d, \u201creturn\u201d, and \u201csell\u201d. All suf\ufb01xes in the third paradigm are in\ufb02ectional\nindicating tense and agreement.\n\n'\n'\n\fThe differences between the performance of Linguistica and Paramorph can most easily\nbe seen in the number of suf\ufb01xes predicted by each algorithm. The number of suf\ufb01xes\npredicted by Linguistica grows linearly with the number of words, in general causing his\nalgorithm to get much higher recall at the expense of precision. Paramorph maintains\na fairly constant number of suf\ufb01xes, causing it to generally have higher precision at the\nexpense of recall. This is consistent with our goals to create a conservative system for\nmorphological analysis, where the number of false positives is minimized.\n\nThe Polish language presents special dif\ufb01culties for both Linguistica and Paramorph, due\nto the highly complex nature of its morphology. There are far fewer spelling change rules\nand a much higher frequency of suf\ufb01xes in Polish than in English. In addition phonology\nplays a much stronger role in Polish morphology, causing alterations in stems, which are\ndif\ufb01cult to detect using a concatenative framework.\n\n5 Discussion\n\nMany of the stem relations predicted by Paramorph result from postulating stem and suf\ufb01x\nbreaks in words that are actually morphologically simple. This occurs when the endings\nof these words resemble other, correct, suf\ufb01xes. In an attempt to deal with this problem\nwe have investigated incorporating semantic information into the probability model since\nmorphologically related words also tend to be semantically related. A successful imple-\nmentation of such information should eliminate errors such as capable breaking down as\ncap+able since capable is not semantically related to cape or cap.\n\nThe goal of the Paramorph system was to produce a preliminary description, with very\nlow false positives, of the \ufb01nal suf\ufb01xation, both in\ufb02ectional and derivational, in a language\nindependent manner. Paramorph performed better for the most part with respect to Fscore\nthan Linguistica, but more importantly, the precision of Linguistica does not approach the\nprecision of our algorithm, particularly on the larger corpus sizes. In summary, we feel our\nParamorph system has attained the goal of producing an initial estimate of suf\ufb01xation that\ncould serve as a front end to aid other models in discovering higher level structure.\n\nReferences\n\n[1] \u00b4Eric. Gaussier. 1999. Unsupervised learning of derivational morphology from in\ufb02ectional lexi-\ncons. In ACL \u201999 Workshop Proceedings: Unsupervised Learning in Natural Language Processing.\nACL.\n\n[2] Michael R. Brent. 1993. Minimal generative models: A middle ground between neurons and\ntriggers. In Proceedings of the Fifth International Workshop on Arti\ufb01cial Intelligence and Statistics,\nFt. Laudersdale, FL.\n\n[3] Michael R. Brent, Sreerama K. Murthy, and Andrew Lundberg. 1995. Discovering morphemic\nsuf\ufb01xes: A case study in minimum description length induction. In Proceedings of the 15th Annual\nConference of the Cognitive Science Society, pages 28-36, Hillsdale, NJ. Erlbaum.\n\n[4] John Goldsmith. 2001. Unsupervised learning of the morphology of a natural language. Compu-\ntational Linguistics, 27:153-198.\n\n[5] Matthew G. Snover and Michael R. Brent. 2001. A Bayesian Model for Morpheme and Paradigm\nIdenti\ufb01cation. In Proceedings of the 39th Annual Meeting of the ACL, pages 482-490. ACL.\n\n\f", "award": [], "sourceid": 2285, "authors": [{"given_name": "Matthew", "family_name": "Snover", "institution": null}, {"given_name": "Michael", "family_name": "Brent", "institution": null}]}