Part of Advances in Neural Information Processing Systems 15 (NIPS 2002)

*Willem Zuidema*

Language acquisition is a special kind of learning problem because the outcome of learning of one generation is the input for the next. That makes it possible for languages to adapt to the particularities of the learner. In this paper, I show that this type of language change has important consequences for models of the evolution and acquisition of syntax.

1 The Language Acquisition Problem

For both artificial systems and non-human animals, learning the syntax of natural languages is a notoriously hard problem. All healthy human infants, in contrast, learn any of the approximately 6000 human languages rapidly, accurately and spon(cid:173) taneously. Any explanation of how they accomplish this difficult task must specify the (innate) inductive bias that human infants bring to bear, and the input data that is available to them. Traditionally, the inductive bias is termed - somewhat un(cid:173) fortunately - "Universal Grammar", and the input data "primary linguistic data".

Over the last 30 years or so, a view on the acquisition of the syntax of natural language has become popular that has put much emphasis on the innate machinery. In this view, that one can call the "Principles and Parameters" model, the Universal Grammar specifies most aspects of syntax in great detail [e.g. 1]. The role of experience is reduced to setting a limited number (30 or so) of parameters. The main argument for this view is the argument from the poverty of the stimulus [2]. This argument states that children have insufficient evidence in the primary linguistic data to induce the grammar of their native language.

Mark Gold [3] provides the most well-known formal basis to this argument. Gold introduced the criterion "identification in the limit" for evaluating the success of a learning algorithm: with an infinite number of training samples all hypotheses of the algorithm should be identical, and equivalent to the target. Gold showed that the class of context-free grammars is not learnable in this sense by any algorithm from positive samples alone (and neither are other super'-jinite classes). This proof is based on the fact that no matter how many samples from an infinite language a

learning algorithm has seen, the algorithm can not decide with certainty that the samples are drawn from the infinite language or from a finite language that con(cid:173) tains all samples. Because natural languages are thought to be at least as complex as context-free grammars, and negative feedback is assumed to be absent in the primary linguistic data, Gold's analysis, and subsequent work in learn ability theory [1] , is usually interpreted as strong support for the argument from the poverty of the stimulus, and, in the extreme, for the view that grammar induction is fundamentally impossible (a claim that Gold would not subscribe to).

Critics of this "nativist" approach [e.g. 4, 5] have argued for different assumptions on the appropriate grammar formalism (e.g. stochastic context-free grammars), the available primary data (e.g. semantic information) or the appropriate learnability criterion. In this paper I will take a different approach. I will present a model that induces context-free grammars without a-priori restrictions on the search space, se(cid:173) mantic information or negative evidence. Gold's negative results thus apply. Never(cid:173) theless, acquisition of grammar is successful in my model, because another process is taken into account as well: the cultural evolution of language.

2 The Language Evolution Problem

Whereas in language acquisition research the central question is how a child acquires an existing language, in language evolution research the central question is how this language and its properties have emerged in the first place. Within the nativist paradigm, some have suggested that the answer to this question is that Universal Grammar is the product of evolution under selection pressures for communication [e.g. 6]. Recently, several formal models have been presented to evaluate this view. For this paper, the most relevant of those is the model of Nowak et al. [7].

In that model it is assumed that there is a finite number of grammars, that new(cid:173) comers (infants) learn their grammar from the population, that more successful grammars have a higher probability of being learned and that mistakes are made in learning. The system can thus be described in terms of the changes in the relative frequencies Xi of each grammar type i in the population. The first result that Nowak et al. obtain is a "coherence threshold". This threshold is the necessary condition for grammatical coherence in a population, i.e. for a majority of individuals to use the same grammar. They show that this coherence depends on the chances that a child has to correctly acquire its parents' grammar. This probability is described with the parameter q. Nowak et al. show analytically that there is a minimum value for q to keep coherence in the population. If q is lower than this value, all possible grammar types are equally frequent in the population and the communicative suc(cid:173) cess in minimal. If q is higher than this value, one grammar type is dominant; the communicative success is much higher than before and reaches 100% if q = l. The second result relates this required fidelity (called qd to a lower bound (be) on the number of sample sentences that a child needs. Nowak et al. make the crucial assumption that all languages are equally expressive and equally different from each other. With that assumption they can show that be is proportional to the total number of possible grammars N. Of course, the actual number of sample sentences b is finite; Nowak et al. conclude that only if N is relatively small can a stable grammar emerge in a population. I.e. the population dynamics require a restrictive Universal Grammar.

The models of Gold and Nowak et al. have in common that they implicitly assume that every possible grammar is equally likely to become the target grammar for learning. If even the best possible learning algorithm cannot learn such a grammar,

the set of allowed grammars must be restricted. There is, however, reason to believe that this assumption is not the most useful for language learning. Language learning is a very particular type of learning problem, because the outcome of the learning process at one generation is the input for the next. The samples from which a child learns with its learning procedure, are therefore biased by the learning of previous generations that used the same procedure[8]. In [9] and other papers, Kirby, Hurford and students have developed a framework to study the consequences of that fact. In this framework, called the "Iterated Learning Model" (ILM), a population of individuals is modeled that can each pro(cid:173) duce and interpret sentences, and have a language acquisition procedure to learn grammar from each other. In the ILM one individual (the parent) presents a rela(cid:173) tively small number of examples of form-meaning pairs to the next individual (the child). The child then uses these examples to induce his own granunar. In the next iteration the child becomes the parent, and a new individual becomes the child. This process is repeated many times. Interestingly, Kirby and Hurford have found that in these iterated transmission steps the language becomes easier and easier to learn, because the language adapts to the learning algorithm by becoming more and more structured. The structure of language in these models thus emerges from the iteration of learning. The role of biological evolution, in this view, is to shape the learning algorithms, such that the complex results of the iterated learning is biologically adaptive [10]. In this paper I will show that if one adopts this view on the interactions between learning, cultural evolution and biological evolution, the models such as those of Gold [3] and Nowak et al. [7] can no longer be taken as evidence for an extensive, innate pr~specification of human language.

3 A Simple Model of Grammar Induction

To study the interactions between language adaptation and language acquisition, I have first designed a grammar induction algorithm that is simple, but can never(cid:173) theless deal with some non-trivial induction problems. The model uses context-free grammars to represent linguistic abilities. In particular, the representation is lim(cid:173) ited to grammars G where all rules are of one of the following forms: (1) A 1-+ t, (2) A 1-+ BC, (3) A 1-+ Bt. The nontenninals A, B, C are elements of the non-terminal alphabet Vnt , which includes the start symbol S. t is a string of tenninal sym(cid:173) bols from the terminal alphabet Vt 1• For determining the language L of a certain grammar G I use simple depth-first exhaustive search of the derivation tree. For computational reasons, the depth of the search is limited to a certain depth d, and the string length is limited to length l. The set of sentences (L' ~ L) used in train(cid:173) ing and in communication is therefore finite (and strictly speaking not context-free, but regular); in production, strings are drawn from a uniform distribution over L'. The grammar induction algorithm learns from a set of sample strings (sentences) that are provided by a teacher. The design of the learning algorithm is originally inspired by [11] and is similar to the algorithm in [12]. The algorithm fits within a tradition of algorithms that search for compact descriptions of the input data [e.g. 13, 14, 15]. It consists of three operations:

Incorporation: extend the language, such that it includes the encountered string; if string s is not already part of the language, add a rule S 1-+ s to the grammar.

INote that the restrictions on the rule-types above do not limit the scope of languages that can be represented (they are essentially equivalent to Chomsky Normal Form). They are, however, relevant for the language acquisition algorithm.

Compression: substitute frequent and long substrings with a nonterminal, such that the gmmmar becomes smaller and the language remains unchangedj for every valid substring z of the right-hand sides of all rules, calculate the compression effect v(z) of substituting z with a nonterminal Aj replace all valid occurrences of the substring z, = arymaxzv(z) with A if v(z') > 0, and add a rule A f-+ Zl to the grammar. "Valid substrings" are those substrings which can be replaced while keeping all rules of the forms 1- 3 described above. The compression effect is measured as the difference between the number of symbols in the grammar before and after the substitution. The compression step is repeated until the grammar does not change anymore. Generalization: equate two nonterminals, such that the grammar becomes smaller and the language laryerj for every combination of two nonterminals A and B (B :f S), calculate the compression effect v of equating A and B. Equate the combination (A',B') = arymaxABv(A,B) ifv(A',B') > OJ i.e. replace all occurrences of B with A. The compression effect is measured as the difference between the number of symbols before and after replacing and deleting redundant rules. The generalization step is repeated until the grammar does not change anymore.

4 Learnable and U nlearnable Classes

The algorithm described above is implemented in C++ and tested on a variety of target grammars2 • I will not present a detailed analysis of the learning behavior here, but limit myself to a simple example that shows that the algorithm can learn some (recursive) grammars, while it can not learn others. The induction algorithm receives three sentences (abed, abcabcd, abcabcabcd). The incorporation, com(cid:173) pression (repeated twice) and generalization steps yield subsequently the following grammars:

(a) Incorporation

Do not remove: This comment is monitored to verify that the site is working properly