{"title": "Learning and using language via recursive pragmatic reasoning about other agents", "book": "Advances in Neural Information Processing Systems", "page_first": 3039, "page_last": 3047, "abstract": "Language users are remarkably good at making inferences about speakers' intentions in context, and children learning their native language also display substantial skill in acquiring the meanings of unknown words. These two cases are deeply related: Language users invent new terms in conversation, and language learners learn the literal meanings of words based on their pragmatic inferences about how those words are used.  While pragmatic inference and word learning have both been independently characterized in probabilistic terms, no current work unifies these two. We describe a model in which language learners assume that they jointly approximate a shared, external lexicon and reason recursively about the goals of others in using this lexicon. This model captures phenomena in word learning and pragmatic inference; it additionally leads to insights about the emergence of communicative systems in conversation and the mechanisms by which pragmatic inferences become incorporated into word meanings.", "full_text": "Learning and using language via recursive pragmatic\n\nreasoning about other agents\n\nNathaniel J. Smith\u2217\nUniversity of Edinburgh\n\nNoah D. Goodman\nStanford University\n\nMichael C. Frank\nStanford University\n\nAbstract\n\nLanguage users are remarkably good at making inferences about speakers\u2019 inten-\ntions in context, and children learning their native language also display substan-\ntial skill in acquiring the meanings of unknown words. These two cases are deeply\nrelated: Language users invent new terms in conversation, and language learners\nlearn the literal meanings of words based on their pragmatic inferences about how\nthose words are used. While pragmatic inference and word learning have both\nbeen independently characterized in probabilistic terms, no current work uni\ufb01es\nthese two. We describe a model in which language learners assume that they\njointly approximate a shared, external lexicon and reason recursively about the\ngoals of others in using this lexicon. This model captures phenomena in word\nlearning and pragmatic inference; it additionally leads to insights about the emer-\ngence of communicative systems in conversation and the mechanisms by which\npragmatic inferences become incorporated into word meanings.\n\n1\n\nIntroduction\n\nTwo puzzles present themselves to language users: What do words mean in general, and what do\nthey mean in context? Consider the utterances \u201cit\u2019s raining,\u201d \u201cI ate some of the cookies,\u201d or \u201ccan\nyou close the window?\u201d In each, a listener must go beyond the literal meaning of the words to\n\ufb01ll in contextual details (\u201cit\u2019s raining here and now\u201d), infer that a stronger alternative is not true\n(\u201cI ate some but not all of the cookies\u201d), or more generally infer the speaker\u2019s communicative goal\n(\u201cI want you to close the window right now because I\u2019m cold\u201d), a process known as pragmatic\nreasoning. Theories of pragmatics frame the process of language comprehension as inference about\nthe generating goal of an utterance given a rational speaker [14, 8, 9]. For example, a listener might\nreason, \u201cif she had wanted me to think \u2018all\u2019 of the cookies, she would have said \u2018all\u2019\u2014but she didn\u2019t.\nHence \u2018all\u2019 must not be true and she must have eaten some but not all of the cookies.\u201d This kind of\nreasoning is core to language use.\nBut pragmatic reasoning about meaning-in-context relies on stable literal meanings that must them-\nselves be learned. In both adults and children, uncertainty about word meanings is common, and\noften considering speakers\u2019 pragmatic goals can help to resolve this uncertainty. For example, if a\nnovel word is used in a context containing both a novel and a familiar object, young children can\nmake the inference that the novel word refers to the novel object [22].1 For adults who are pro\ufb01-\ncient language users, there are also a variety of intriguing cases in which listeners seem to create\nsituation- and task-speci\ufb01c ways of referring to particular objects. For example, when asked to refer\nto idiosyncratic geometric shapes, over the course of an experimental session, participants create\nconventionalized descriptions that allow them to perform accurately even though they do not begin\nwith shared labels [19, 7]. In both of these examples, reasoning about another person\u2019s goals informs\n\n\u2217nathaniel.smith@ed.ac.uk\n1Very young children make inferences that are often labeled as \u201cpragmatic\u201d in that they involve reasoning\nabout context [6, 1], though in some cases they are systematically \u2018too literal\u2019 (e.g. failing to strengthen SOME\nto SOME-BUT-NOT-ALL [23]). Here we remain agnostic about the age at which children are able to make such\ninferences robustly, as it may vary depending on the linguistic materials being used in the inference [2].\n\n1\n\n\flanguage learners\u2019 estimates of what words are likely to mean.\nDespite this intersection, there is relatively little work that takes pragmatic reasoning into account\nwhen considering language learning in context. Recent work on grounded language learning has\nattempted to learn large sets of (sometimes relatively complex) word meanings from noisy and am-\nbiguous input (e.g. [10, 17, 20]). And a number of models have begun to formalize the consequences\nof pragmatic reasoning in situations where limited learning takes place [12, 9, 3, 13]. But as yet\nthese two strands of research have not been brought together so that the implications of pragmatics\nfor learning can be investigated directly.\nThe goal of our current work is to investigate the possibilities for integrating models of recursive\npragmatic reasoning with models of language learning, with the hope of capturing phenomena in\nboth domains. We begin by describing a proposal for bringing the two together, noting several\nissues in previous approaches based on recursive reasoning under uncertainty. We next simulate\n\ufb01ndings on pragmatic inference in one-shot games (replicating previous work). We then build on\nthese results to simulate the results of pragmatic learning in the language acquisition setting where\none communicator is uncertain about the lexicon and in iterated communication games where both\ncommunicators are uncertain about the lexicon.\n\n2 Model\n\nWe model a standard communication game [19, 7]: two participants each, separately, view identical\narrays of objects. On the Speaker\u2019s screen, one object is highlighted; their goal is to get the Listener\nto click on this item. To do this, they have available a \ufb01xed, \ufb01nite set of words; they must pick one.\nThe Listener then receives this word, and attempts to guess which object the Speaker meant by it.\nIn the psychology literature, as in real-world interactions, games are typically iterated; one view of\nour contribution here is as a generalization of one-shot models [9, 3] to the iterated context.\n2.1 Paradoxes in optimal models of pragmatic learning. Multi-agent interactions are dif\ufb01cult\nto model in a normative or optimal framework without falling prey to paradox. Consider a simple\nmodel of the agents in the above game. First we de\ufb01ne a literal listener L0. This agent has a\nlexicon of associations between words and meanings; speci\ufb01cally, it assigns each word w a vector\nof numbers in (0, 1) describing the extent to which this word provides evidence for each possible\nobject2.To interpret a word, the literal listener simply re-weights their prior expectation about what\nis referred to using their lexicon\u2019s entry for this word:\n\nPL0 (object|word, lexicon) \u221d lexicon(word, object) \u00d7 Pprior(object).\n\n(1)\n\nBecause of the normalization in this equation, there is a systematic but unimportant symmetry among\nlexicons; we remove this by assuming the lexicon sums to 1 over objects for each word. Con-\nfronted with such a listener, a speaker who chooses approximately optimal actions should attempt\nto choose a word which soft-maximizes the probability that the listener will assign to the target\nobject\u2014modulated by the effort or cost associated with producing this word:\n\nPS1(word|object, lexicon) \u221d exp\n\n(cid:16)\n\n\u03bb(cid:0) log PL0(object|word, lexicon) \u2212 cost(word)(cid:1)(cid:17)\n\n.\n\n(2)\n\nBut given this speaker, then the naive L0 strategy is not optimal. Instead, listeners should use Bayes\nrule to invert the speaker\u2019s decision procedure [9]:\n\nPL2 (object|word, lexicon) \u221d PS1(word|object, lexicon) \u00d7 Pprior(object).\n\n(3)\n\nNow a dif\ufb01culty becomes apparent. Given such a listener, it is no longer optimal for speakers\nto implement strategy S1; instead, they should implement strategy S3 which soft-maximizes PL2\ninstead of PL0. And then listeners ought to implement L4, and so on.\nOne option is to continue iterating such strategies until reaching a \ufb01xed point equilibrium. While this\nstrategy guarantees that each agent will behave normatively given the other agent\u2019s strategy, there\nis no guarantee that such strategies will be near the system\u2019s global optimum. More importantly,\n\n2We assume words refer directly to objects, rather than to abstract semantic features. Our simpli\ufb01cation\nis without loss of generalization, however, because we can interpret our model as marginalizing over such a\n\nrepresentation, with our literal Plexicon(object|word) =(cid:80)\n\nfeatures P (object|features)Plexicon(features|word).\n\n2\n\n\fthere is a great deal of evidence that humans do not use such equilibrium strategies; their behavior in\nlanguage games (and in other games [5]) can be well-modeled as implementing Sk or Lk for some\nsmall k [9]. Following this work, we recurse a \ufb01nite (small) number of times, n. The consequence\nis that one agent, implementing Sn, is fully optimal with respect to the game, while the other,\nimplementing Ln\u22121, is only nearly optimal\u2014off by a single recursion.\nThis resolves one problem, but as soon as we attempt to add uncertainty about the meanings of words\nto such a model, a new paradox arises. Suppose the listener is a young child who is uncertain about\nthe lexicon their partner is using. The obvious solution is for them to place a prior on the lexicon;\nthey then update their posterior based on whatever utterances and contextual cues they observe,\nand in the mean time interpret each utterance by making their best guess, marginalizing out this\nuncertainty. This basic structure is captured in previous models of Bayesian word learning [10]. But\nwhen combined with the recursive pragmatic model, a new question arises: Given such a listener,\nwhat model should the speaker use? A rational speaker attempts to maximize the listener\u2019s likelihood\nof understanding, so if an uncertain listener interpets by marginalizing over some posterior, then a\nfully knowledgeable speaker should disregard their own lexical knowledge, and instead model and\nmarginalize over the listener\u2019s uncertainty. But if they do this, then their utterances will provide no\ndata about their lexicon, and there is nothing for the rational listener to learn from observing them.3\nOne \ufb01nal problem is that under this model, when agents switch roles between listener and speaker,\nthere is nothing constraining them to continue using the same language. Optimizing task perfor-\nmance requires my lexicon as a speaker to match your lexicon as a listener and vice-versa, but there\nis nothing that relates my lexicon as a speaker to my lexicon as a listener, because these never in-\nteract. This clearly represents a dramatic mismatch to typical human communication, which almost\nnever proceeds with distinct languages spoken by each participant.\n2.2 A conventionality-based model of pragmatic word learning. We resolve the problems de-\nscribed above by assuming that speakers and listeners deviate from normative behavior by assuming\na conventional lexicon. Speci\ufb01cally, our \ufb01nal convention-based agents assume: (a) There is some\nsingle, speci\ufb01c literal lexicon which everyone should be using, (b) and everyone else knows this\nlexicon, and believes that I know it as well, (c) but in fact I don\u2019t. These assumptions instantiate a\nkind of \u201csocial anxiety\u201d in which agents are all trying to learn the correct lexicon that they assume\neveryone else knows.\nAssumption (a) corresponds to the lexicographer\u2019s illusion: Naive language users will argue vocifer-\nously that words have speci\ufb01c meanings, even though these meanings are unobservable to everyone\nwho purportedly uses them. It also explains why learners speak the language they hear (rather than\nsome private language that they assume listeners will eventually learn): Under assumption (a), ob-\nserving other speakers\u2019 behavior provides data about not just that speaker\u2019s idiosyncratic lexicon,\nbut the consensus lexicon. Assumption (b) avoids the explosion of hypern-distributions described\nabove: If agent n knows the lexicon, they assume that all lower agents do as well, reducing to the\noriginal tractable model without uncertainty. And assumption (c) introduces a limited form of un-\ncertainty at the top level, and thus the potential for learning. To the extent that a child\u2019s interlocutors\ndo use a stable lexicon and do not fully adapt their speech to accomodate the child\u2019s limitations,\nthese assumptions make a reasonable approximation for the child language learning case. In gen-\neral, though, in arbitrary multi-turn interactions in which both agents have non-trivial uncertainty,\nthese assumptions are incorrect, and thus induce complex and non-normative learning dynamics.\nFormally, let an unadorned L and S denote the listener and speaker who follow the above assump-\ntions. If the lexicon were known then the listener would draw inferences as in Ln\u22121 above; but by\nassumption (c), they have uncertainty, which they marginalize out:\nPL(object|word, L\u2019s data) =\n\nPLn\u22121 (object|word, lexicon)P (lexicon|L\u2019s data) d(lexicon)\n\n(cid:90)\n\n(4)\n\n3Of course, in reality both parties will generally have some uncertainty, making the situation even worse. If\nwe start from an uncertain listener with a prior over lexicons, then a \ufb01rst-level uncertain speaker needs a prior\nover priors on lexicons, a second-level uncertain listener needs a prior over priors over priors, etc. The original\nL0 \u2192 S1 \u2192 . . . recursion was bad enough, but at least each step had a constant cost. This new recursion\nproduces hypern-distributions for which inference almost immediately becomes intractable even in principle,\nsince the dimensionality of the learning problem increases with each step. Yet, without this addition of new\nuncertainty at each level, the model would dissolve back into certainty as in the previous paragraph, making\nlearning impossible.\n\n3\n\n\fRef. WL PI PI+U PI+WL Section\nPhenomenon\n[14]\nInterpreting scalar implicature\nInterpreting Horn implicature\n[15]\nLearning literal meanings despite scalar implicature [21]\n[22]\nDisambiguating new words using old words\n[22]\nLearning new words using old words\n[16]\nDisambiguation without learning\n[11]\nEmergence of novel & ef\ufb01cient lexicons\nLexicalization of Horn implicature\n[15]\n\n3.1\n3.2\n4.1\n4.2\n4.2\n4.2\n5.1\n5.2\n\nx\nx\nx\nx\nx\nx\nx\nx\n\nx\nx\n\nx\nx\n\nx\n\nx\n\nx\n\nTable 1: Empirical results and references. WL refers to the word learning model of [10]; PI refers\nto the recursive pragmatic inference model of [9]; PI+U refers to the pragmatic inference model of\n[3] which includes lexical uncertainty, marginalizes it out, and then recurses. Our current model is\nreferred to here as PI+WL, and combines pragmatic inference with word learning.\n\nHere L\u2019s data consists of her previous experience with language. In particular in the iterated games\nexplored here it consists of S\u2019s previous utterances together with whatever other information L may\nhave about their intended referents (e.g. from contextual clues). By assumption (b), L treats these\nutterances as samples from the knowledgeable speaker Sn\u22122, not S, and thus as being informative\nabout the lexicon. For instance, when the data is a set of fully observed word-referent pairs {wi, oi}:\n(5)\n\nP (lexicon|L\u2019s data) \u221d P (lexicon)\n\nPSn\u22122 (wi|oi, lexicon)\n\n(cid:89)\n\ni\n\nThe top-level speaker S attempts to select the word which soft-maximizes their utility, with utility\nnow being de\ufb01ned in terms of the informativity of the expectation (over lexicons) that the listener\nwill have for the right referent4:\nPS(word|object, S\u2019s data) \u221d\n\nPLn\u22121(object|word, lexicon)P (lexicon|S\u2019s data) d(lexicon) \u2212 cost(word)(cid:1)(cid:17)\n\n\u03bb(cid:0) log\n\n(cid:16)\n\n(cid:90)\n\nexp\n\n(6)\n\nHere P (lexicon|S\u2019s data) is de\ufb01ned similarly, when S observes L\u2019s interpretations of various ut-\nterances, and treats them as samples from Ln\u22121, not L. However, notice that if S and L have the\nsame subjective distributions over lexicons, then S is approximately optimal with respect to L in the\nsame sense that Sk is approximately optimal with respect to Lk\u22121. In one-shot games, this model\nis conceptually equivalent to that of [3] restricted to n = 3; our key innovations are that we allow\nlearning by replacing their P (lexicon) with P (lexicon|data), and provide a theoretical justi\ufb01cation\nfor how this learning can occur.\nIn the remainder of the paper, we apply the model described above to a set of one-shot pragmatic\ninference games that have been well-studied in linguistics [14, 15] and are addressed by previous\none-shot models of pragmatic inference [9, 3]. These situations set the stage for simulations investi-\ngating how learning proceeds in iterated versions of such games, described in the following section.\nResults captured by our model and previous models are summarized in Table 1. In our simulations\nthroughout, we somewhat arbitrarily set the recursion depth n = 3 (the minimal value that produces\nall the qualitative phenomena), \u03bb = 3, and assume that all agents have shared priors on the lexicon\nand full knowledge of the cost function. Inference is via importance sampling from a Dirichlet prior\nover lexicons.\n\n3 Pragmatic inference in one-shot games\n\n3.1 Scalar implicature. Many sets of words in natural language form scales in which each term\nmakes a successively stronger claim. \u201cSome\u201d and \u201call\u201d form a scale of this type. While \u201cI ate some\n\n4An alternative model would have the speaker take the expectation over informativity, instead of the infor-\nmativity of the expectation, which would correspond to slightly different utility functions. We adopt the current\nformulation for consistency with [3].\n\n4\n\n\fof the cookies\u201d is compatible with the followup \u201cin fact, I ate all of the cookies,\u201d the reverse is not\ntrue. \u201cMight\u201d and \u201cmust\u201d are another example, as are \u201cOK,\u201d \u201cgood,\u201d and \u201cexcellent.\u201d All of these\nscales allow for scalar implicatures [14]: the use of a less speci\ufb01c term pragmatically implies that\nthe more speci\ufb01c term does not apply. So although \u201cI ate some of the cookies\u201d could in principle be\ncompatible with eating ALL of them, the listener is lead to believe that SOME-BUT-NOT-ALL is the\nlikely state of affairs. The recursive pragmatic reasoning portions of our model capture \ufb01ndings on\nscalar implicature in the same manner as previous models [3, 13].\n3.2 Horn implicature. Consider a world which contains two words and two types of objects. One\nword is expensive to use, and one is cheap (call them \u201cexpensive\u201d and \u201ccheap\u201d for short). One object\ntype is common and one is rare; denote these COMMON and RARE. Intuitively, there are two possible\ncommunicative systems here: a good system where \u201ccheap\u201d referes to COMMON and \u201cexpensive\u201d\nrefers to RARE, and a bad system where the opposite holds. Obviously we would prefer to use the\ngood system, but it has historically proven very dif\ufb01cult to derive this conclusion in a game theoretic\nsetting, because both systems are stable equilibria: if our partner uses the bad system, then we would\nrather follow and communicate at some cost than switch to the good system and fail entirely [3].\nHumans, however, unlike traditional game theoretic models, do make the inference that given two\notherwise equivalent utterances, the costly utterance should have a rare or unusual meaning. We\ncall this pattern Horn implicature, after [15]. For instance, \u201cLee got the car to stop\u201d implies that\nLee used an unusual method (e.g. not the brakes) because, had he used the brakes, the speaker\nwould have chosen the simpler and shorter (less costly) expression, \u201cLee stopped the car\u201d [15].\nSurprisingly, Bergen et al. [3] show that the key to achieving this favorable result is ignorance. If\na listener assigns equal probability to her partner using the good system or the bad system, then\ntheir best bet is to estimate PS(word|object) as the average of PS(word|object, good system) and\nPS(word|object, bad system). These might seem to cancel out, but in fact they do not. In the good\nsystem, the utilities of the speaker\u2019s actions are relatively strongly separated compared to the bad\nsystem; therefore, a soft-max agent in the bad system has noiser behavior than in the good system,\nand the behavior in the good system dominates the average. Similar reasoning applies to an uncertain\nspeaker. For example, in our model with a uniform prior over lexicons and Pprior(COMMON) =\n0.8, cost(\u201ccheap\u201d) = 0.5, cost(\u201cexpensive\u201d) = 1.0, the symmetry breaks in the appropriate way:\nDespite total ignorance about the conventional system, our modeled speakers prefer to use simple\nwords for common referents (PS(\u201ccheap\u201d|COMMON) = 0.88, PS(\u201ccheap\u201d|RARE) = 0.46), and\nlisteners show a similar bias (PL(COMMON|\u201ccheap\u201d) = 0.77, PL(COMMON|\u201cexpensive\u201d) = 0.65).\nThis preference is weak; the critical point is that it exists at all, given the unbiased priors. We return\nto this in \u00a75.2. [3] report a much stronger preference, which they accomplish by applying further\nlayers of pragmatic recursion on top of these marginal distributions. On the one hand, this allows\nthem to better \ufb01t their empirical data; on the other, it removes the possibility of learning the literal\nlexicon that underlies pragmatic inference \u2013 further recursion above the uncertainty means that it is\nonly hypothetical agents who are ignorant, while the actual speaker and listener have no uncertainty\nabout each other\u2019s generative process.\n\n4 Pragmatics in learning from a knowledgable speaker\n\n4.1 Learning literal meanings despite scalar implicatures. The acquisition of quanti\ufb01ers like\n\u201csome\u201d provides a puzzle for most models of word learning: given that in many contexts, the word\n\u201csome\u201d is used to mean SOME-BUT-NOT-ALL, how do children learn that SOME-BUT-NOT-ALL is\nnot in fact its literal meaning? Our model is able to take scalar implicatures into account when learn-\ning, and thus provide a potential solution, congruent with the observation that no known language\nin fact lexicalizes SOME-BUT-NOT-ALL [21].\nFollowing the details of \u00a73.1, we created a simulation in which the model\u2019s prior \ufb01xed the mean-\ning of \u201call\u201d to be a particular set ALL, but was ambiguous about whether \u201csome\u201d literally meant\nSOME-BUT-NOT-ALL (incorrect) or SOME-BUT-NOT-ALL OR ALL (correct). The model was then\nexposed to training situations in which \u201csome\u201d was used to refer to SOME-BUT-NOT-ALL. Despite\nthis training, the model maintained substantial posterior probability on the correct hypothesis about\nthe meaning of \u201csome.\u201d Essentially, the model reasoned that although it had unambiguous evidence\nfor \u201csome\u201d being used to refer to SOME-BUT-NOT-ALL, this was nonetheless consistent with a lit-\neral meaning of SOME-BUT-NOT-ALL OR ALL which had then been pragmatically strengthened.\n\n5\n\n\fFigure 1: Simulations of two pragmatic agents playing a naming game. Each panel shows two\nrepresentative simulation runs, with run 1 chosen to show strong convergence and run 2 chosen to\nshow relatively weaker convergence. At each stage, S and L have different, possibly contradictory\nposteriors over the conventional, consensus lexicon. From these posteriors we derive the probability\nP (L understands S) (marginalizing over target objects and word choices), and also depict graphi-\ncally S\u2019s model of the listener (top row), and L\u2019s actual model (bottom row).\n\nThus, a pragmatically-informed learner might be able to maintain the true meaning of SOME despite\nseemingly con\ufb02icting evidence.\n4.2 Disambiguation using known words. Children, when presented with both a novel and a\nfamiliar object (e.g. an eggbeater and a ball), will treat a novel label (e.g. \u201cdax\u201d) as referring to the\nnovel object, for example by supplying the eggbeater when asked to \u201cgive me the dax\u201d [22]. This\nphenomenon is sometimes referred to as \u201cmutual exclusivity.\u201d Simple probabilistic word learning\nmodels can produce a similar pattern of \ufb01ndings [10], but all such models assume that learners retain\nthe mapping between novel word and novel object demonstrated in the experimental situation. This\nobservation is contradicted, however, by evidence that children often do not retain the mappings that\nare demonstrated by their inferences in the moment [16].\nOur model provides an intriguing possible explanation of this \ufb01nding: when simulating a single\ndisambiguation situation, the model gives a substantial probability (e.g. 75%) that the speaker is\nreferring to the novel object. Nevertheless, this inference is not accompanied by an increased belief\nthat the novel word literally refers to this object. The learner\u2019s interpretation arises not from lexical\nmapping but instead from a variant of scalar implicature: the listener knows that the familiar word\ndoes not refer to the novel object\u2014hence the novel word will be the best way to refer to the novel\nobject, even if it literally could refer to either. Nevertheless, on repeated exposure to the same novel\nword, novel object situation, the learner does learn the mapping as part of the lexicon (congruent\nwith other data on repeated training on disambiguation situations [4]).\n\n5 Pragmatic reasoning in the absence of conventional meanings\n\n5.1 Emergence of ef\ufb01cient communicative conventions. Experimental results suggest that com-\nmunicators who start without a usable communication system are able to establish novel, consensus-\nbased systems. For example, adults playing a communication game using only novel symbols with\nno conventional meaning will typically converge on a set of new conventions which allow them to\naccomplish their task [11]. Or in a less extreme example, communicators asked to refer to novel\nobjects invent conventional names for them over the course of repeated interactions (e.g., \u201cthe ice\nskater\u201d for an abstract \ufb01gure vaguely resembling an ice skater, [7]). From a pure learning perspective\nthis behavior is anomalous, however: Since both agents know perfectly well that there is no existing\nconvention to discover, there is nothing to learn from the other\u2019s behavior. Furthermore, even if only\none partner is producing the novel expressions, their behavior in these studies still becomes more\nregular (conventional) over time, which would seem to rule out a role for learning\u2014even if there is\nsome pattern in the expressions the speaker chooses to use, there is certainly nothing for the speaker\nto learn by observing these patterns, and thus their behavior should not change over time.\n\n6\n\n12345678910Dialogue turn0.00.51.0P(L understands S)objectswordsRun 1Run 2Run 1Run 21234567891011121314151617181920Dialogue turnRun 1Run 2Run 1Run 22 words, 2 objects3 words, 3 objects\fFigure 2: Example simulations showing the lexi-\ncalization of Horn implicatures. Plotting conven-\ntions are as above. In the \ufb01rst run, speaker and\nlistener converge on a sparse and ef\ufb01cient com-\nmunicative equilibrium, in which \u201ccheap\u201d means\nCOMMON and \u201cexpensive\u201d means RARE,\nwhile in the second they reach a sub-optimal\nequilibrium. As shown in Fig. 3, the former is\nmore typical.\n\nFigure 3: Averaged behavior\nover 300 dialogues as in Figs. 1\nand 2. Left: Communicative\nsuccess by game type and dia-\nlogue turn. Right: Proportion of\ndyads in the Horn implicature\ngame (\u00a75.2) who have con-\nverged on the \u2018good\u2019 or \u2018bad\u2019\nlexicons and believe that these\nare literal meanings.\n\nTo model such phenomena, we imagine two agents playing the simple referential game introduced\nin \u00a7 2. On each turn the speaker is assigned a target object, utters some word referring to this object,\nthe listener makes a guess at the object, and then, critically, the speaker observes the listener\u2019s\nguess and the listener receives feedback indicating the correct answer (i.e., the speaker\u2019s intended\nreferent). Both agents then update their posterior over lexicons before proceeding to the next trial.\nAs in [19, 7], the speaker and listener remain \ufb01xed in the same role throughout.\nFig. 1 shows the result of simulating several such games when both parties begin with a uniform prior\nover lexicons. Notice that: (a) agents\u2019 performance begins at chance, but quickly rises \u2013 a commu-\nnicative system emerges where none previously existed; (b) they tend towards structured, sparse\nlexicons with a one-to-one correspondence between objects and words \u2013 these communicative sys-\ntems are biased towards being useful and ef\ufb01cient; and (c) as the speaker and listener have entirely\ndifferent data (the listener\u2019s interpretations and the speaker\u2019s intended referent, respectively), un-\nlucky early guesses can lead them to believe in entirely contradictory lexicons\u2014but they generally\nrecover and converge. Each agent effectively uses their partner\u2019s behavior as a basis for forming\nweak beliefs about the underlying lexicon that they assume must exist. Since they then each act on\nthese beliefs, and their partner uses the resulting actions to form new beliefs, they soon converge on\nusing similar lexicons, and what started as a \u201csuperstition\u201d becomes normatively correct. And un-\nlike some previous models of emergence across multiple generations of agents [18, 25], this occurs\nwithin individual agents in a single dialogue.\n5.2 Lexicalization and loss of Horn implicatures. A stronger example of how pragmatics can\ncreate biases in emerging lexicons can be observed by considering a version of this game played in\nthe \u201ccheap\u201d/\u201cexpensive\u201d/COMMON/RARE domain introduced in our discussion of Horn implicature\n(\u00a73.2). Here, a uniform prior over lexicons, combined with pragmatic reasoning, causes each agent\nto start out weakly biased towards the associations \u201ccheap\u201d \u2194 COMMON, \u201cexpensive\u201d \u2194 RARE. A\nfully rational listener who observed an uncertain speaker using words in this manner would therefore\ndiscount it as arising from this bias, and conclude that the speaker was, in fact, highly uncertain. Our\nconvention-based listener, however, believes that speakers do know which convention is in use, and\ntherefore tends to misinterpret this biased behavior as positive evidence that the \u2018good\u2019 system is in\nuse. Similarly, convention-based speakers will wager that since on average they will succeed more\noften if listeners are using the \u2018good\u2019 system, they might as well try it. When they succeed, they\ntake their success as evidence that the listener was in fact using the good system all along. As a\nresult, dyads in this game end up converging onto a stable system at a rate far above chance, and\n\n7\n\n12345678910Dialogue turn0.00.51.0P(L understands S)objectswordsRun 1Run 2Run 1Run 2010203040Dialogue turn0.00.20.40.60.81.0Mean P(L understands S)2x2 uniform prior3x3 uniform priorHorn implicature010203040Dialogue turn0.00.20.40.60.81.0Horn lexicalization rateGood lexiconBad lexicon\fpreferentially onto the \u2018good\u2019 system (Figs. 2 and 3).\nIn the process, though, something interesting happens. In this model, Horn implicatures depend on\nuncertainty about literal meaning. As the agents gather more data, their uncertainty is reduced, and\nthus through the course of a dialogue, the implicature is replaced by a belief that \u201ccheap\u201d literally\nmeans COMMON (and did all along). To demonstrate this phenomenon, we queried each agent in\neach simulated dyad about how they would refer to or interpret each object and word, if the two\nobjects were equally common, which cancels the Horn implicature. As shown in Fig. 3 (right), after\n30 turns, in nearly 70% of dyads both S and L used the \u2018good\u2019 mapping even in this implicature-free\ncase, while less than 20% used the \u2018bad\u2019 mapping (with the rest being inconsistent).\nThis points to a fundamental difference in how learning interacts with Horn versus scalar implica-\ntures. Depending on the details of the input, it is possible for our convention-based agents to observe\npragmatically strengthened uses of scalar terms (e.g., \u201csome\u201d used to refer to SOME-BUT-NOT-ALL),\nwithout becoming confused into thinking that \u201csome\u201d literally means SOME-BUT-NOT-ALL (\u00a74.1).\nThis occurs because scalar implicature depends only on recursive pragmatic reasoning (\u00a72.1), which\nour convention-based agents\u2019 learning rules are able to model and correct for. But, while our agents\nare able to use Horn implicatures in their own behaviour (\u00a7 3.2), this happens implicitly as a result\nof their uncertainty, and our agents do not model the uncertainty of other agents; thus, when they\nobserve other agents using Horn implicatures, they cannot interpret this behavior as arising from an\nimplicature. Instead, they take it as re\ufb02ecting the actual literal meaning. And this result isn\u2019t just\na technical limitation of our implementation, but is intrinsic to our convention-based approach to\ncombining pragmatics and learning: in our system, the only thing that makes word learning possi-\nble at all is each agent\u2019s assumption that other agents are better informed; otherwise, other agents\u2019\nbehavior would not provide any useful data for learning. Our model therefore makes the interesting\nprediction that all else being equal, uncertainty-based implicatures should over time be more prone\nto lexicalizing and becoming part of literal meaning than recursion-based implicatures are.\n\n6 Conclusion\n\nLanguage learners and language users must consider word meanings both within and across con-\ntexts. A critical part of this process is reasoning pragmatically about agents\u2019 goals in individual\nsituations. In the current work we treat agents communicating with one another as assuming that\nthere is a shared conventional lexicon which they both rely on, but with differing degrees of knowl-\nedge. They then reason recursively about how this lexicon should be used to convey particular\nmeanings in context. These assumptions allow us to create a model that uni\ufb01es two previously sep-\narate strands of modeling work on language usage and acquisition and account for a variety of new\nphenomena. In particular, we consider new explanations of disambiguation in early word learning\nand the acquisition of quanti\ufb01ers, and demonstrate that our model is capable of developing novel and\nef\ufb01cient communicative systems through iterated learning within the context of a single simulated\nconversation.\nOur assumptions produce a tractable model, but because they deviate from pure rationality, they\nmust introduce biases, of which we identify two: a tendency for pragmatic speakers and listeners to\naccentuate useful, sparse patterns in their communicative systems (\u00a75.1), and for short, \u2018low cost\u2019\nexpressions to be assigned to common objects (\u00a75.2). Strikingly, both of these biases systematically\ndrive the overall communicative system towards greater global ef\ufb01ciency. In the long term, these\nprocesses should leave their mark on the structure of the language itself, which may contribute to\nexplaining how languages become optimized for effective communication [26, 24].\nMore generally, understanding the interaction between pragmatics and learning is a precondition to\ndeveloping a uni\ufb01ed understanding of human language. Our work here takes a \ufb01rst step towards\njoining disparate strands of research that have treated language acquisition and language use as\ndistinct.\n\nAcknowledgments\n\nThis work was supported in part by the European Commission through the EU Cognitive Sys-\ntems Project Xperience (FP7-ICT-270273), the John S. McDonnell Foundation, and ONR grant\nN000141310287.\n\n8\n\n\fReferences\n[1] D.A. Baldwin. Early referential understanding: Infants\u2019 ability to recognize referential acts for what they\n\nare. Developmental Psychology, 29(5):832\u2013843, 1993.\n\n[2] D. Barner, N. Brooks, and A. Bale. Accessing the unsaid: The role of scalar alternatives in childrens\n\npragmatic inference. Cognition, 118(1):84, 2011.\n\n[3] L. Bergen, N. D. Goodman, and R. Levy. That\u2019s what she (could have) said: How alternative utterances\naffect language use. In Proceedings of the 34th Annual Conference of the Cognitive Science Society, 2012.\n[4] R.A.H. Bion, A. Borovsky, and A. Fernald. Fast mapping, slow learning: Disambiguation of novel word\u2013\n\nobject mappings in relation to vocabulary learning at 18, 24, and 30months. Cognition, 2012.\n\n[5] C. F. Camerer, T.-H. Ho, and J.-K. Chong. A cognitive hierarchy model of games. The Quarterly Journal\n\nof Economics, 119(3):861\u2013898, 2004.\n\n[6] E.V. Clark. On the logic of contrast. Journal of Child Language, 15:317\u2013335, 1988.\n[7] Herbert H Clark and Deanna Wilkes-Gibbs. Referring as a collaborative process. Cognition, 22(1):1\u201339,\n\n1986.\n\n[8] R. Dale and E. Reiter. Computational interpretations of the gricean maxims in the generation of referring\n\nexpressions. Cognitive Science, 19(2):233\u2013263, 1995.\n\n[9] M. C. Frank and N. D. Goodman. Predicting pragmatic reasoning in language games.\n\n336(6084):998\u2013998, 2012.\n\nScience,\n\n[10] M. C. Frank, N. D. Goodman, and J. B. Tenenbaum. Using speakers\u2019 referential intentions to model early\n\ncross-situational word learning. Psychological Science, 20:578\u2013585, 2009.\n\n[11] B. Galantucci. An experimental study of the emergence of human communication systems. Cognitive\n\nscience, 29(5):737\u2013767, 2005.\n\n[12] D. Golland, P. Liang, and D. Klein. A game-theoretic approach to generating spatial descriptions. In\n\nProceedings of EMNLP 2010, pages 410\u2013419. Association for Computational Linguistics, 2010.\n\n[13] Noah D. Goodman and Andreas Stuhlm\u00a8uller. Knowledge and implicature: Modeling language under-\n\nstanding as social cognition. Topics in Cognitive Science, 5:173\u2013184, 2013.\n[14] H.P. Grice. Logic and conversation. Syntax and Semantics, 3:41\u201358, 1975.\n[15] L. Horn. Toward a new taxonomy for pragmatic inference: Q-based and r-based implicature. In Meaning,\n\nform, and use in context, volume 42. Washington: Georgetown University Press, 1984.\n\n[16] J. S. Horst and L. K. Samuelson. Fast mapping but poor retention by 24-month-old infants.\n\n13(2):128\u2013157, 2008.\n\nInfancy,\n\n[17] G. Kachergis, C. Yu, and R. M. Shiffrin. An associative model of adaptive inference for learning word\u2013\n\nreferent mappings. Psychonomic Bulletin & Review, 19(2):317\u2013324, April 2012.\n\n[18] S. Kirby, H. Cornish, and K. Smith. Cumulative cultural evolution in the laboratory: An experimental\napproach to the origins of structure in human language. Proceedings of the National Academy of Sciences,\n105(31):10681\u201310686, 2008.\n\n[19] R. M. Krauss and S. Weinheimer. Changes in reference phrases as a function of frequency of usage in\n\nsocial interaction: A preliminary study. Psychonomic Science, 1964.\n\n[20] T. Kwiatkowski, S. Goldwater, L. Zettlemoyer, and M. Steedman. A probabilistic model of syntactic\nand semantic acquisition from child-directed utterances and their meanings. In Proceedings of the 13th\nConference of the European Chapter of the Association for Computational Linguistics, pages 234\u2013244,\n2012.\n\n[21] S.C. Levinson. Presumptive meanings: The theory of generalized conversational implicature. MIT Press,\n\n2000.\n\n[22] E. M. Markman and G. F. Wachtel. Children\u2019s use of mutual exclusivity to constrain the meanings of\n\nwords. Cognitive Psychology, 20:121\u2013157, 1988.\n\n[23] A. Papafragou and J. Musolino. Scalar implicatures: Experiments at the semantics-pragmatics interface.\n\nCognition, 86(3):253\u2013282, 2003.\n\n[24] S. T. Piantadosi, H. Tily, and E. Gibson. Word lengths are optimized for ef\ufb01cient communication. Pro-\n\nceedings of the National Academy of Sciences, 108(9):3526 \u20133529, 2011.\n\n[25] R. van Rooy. Evolution of conventional meaning and conversational principles. Synthese, 139(2):331\u2013\n\n366, 2004.\n\n[26] G. Zipf. The Psychobiology of Language. Routledge, London, 1936.\n\n9\n\n\f", "award": [], "sourceid": 1388, "authors": [{"given_name": "Nathaniel", "family_name": "Smith", "institution": "University of Edinburgh"}, {"given_name": "Noah", "family_name": "Goodman", "institution": "Stanford University"}, {"given_name": "Michael", "family_name": "Frank", "institution": "Stanford University"}]}