{"title": "Anti-efficient encoding in emergent communication", "book": "Advances in Neural Information Processing Systems", "page_first": 6293, "page_last": 6303, "abstract": "Despite renewed interest in emergent language simulations with\n  neural networks, little is known about the basic properties of the\n  induced code, and how they compare to human language. One\n  fundamental characteristic of the latter, known as Zipf's Law of\n  Abbreviation (ZLA), is that more frequent words are efficiently\n  associated to shorter strings. We study whether the same pattern\n  emerges when two neural networks, a ``speaker'' and a ``listener'',\n  are trained to play a signaling game. Surprisingly, we find that\n  networks develop an \\emph{anti-efficient} encoding scheme, \n  in which the most frequent inputs are associated to the longest messages, \n  and messages in general are skewed towards the maximum length threshold. \n  This anti-efficient code appears easier to discriminate for the listener,\n  and, unlike in human communication, the speaker does not impose a\n  contrasting least-effort pressure towards brevity. Indeed, when the\n  cost function includes a penalty for longer messages, the resulting\n  message distribution starts respecting ZLA. Our analysis stresses\n  the importance of studying the basic features of emergent\n  communication in a highly controlled setup, to ensure the latter\n  will not strand too far from human language. Moreover, we present a\n  concrete illustration of how different functional pressures can lead\n  to successful communication codes that lack basic properties of\n  human language, thus highlighting the role such pressures play in\n  the latter.", "full_text": "Anti-ef\ufb01cient encoding in emergent communication\n\nRahma Chaabouni1,2, Eugene Kharitonov1, Emmanuel Dupoux1,2 and Marco Baroni1,3\n\n2Cognitive Machine Learning (ENS - EHESS - PSL Research University - CNRS - INRIA)\n\n{rchaabouni,kharitonov,dpx,mbaroni}@fb.com\n\n1Facebook AI Research\n\n3ICREA\n\nAbstract\n\nDespite renewed interest in emergent language simulations with neural networks,\nlittle is known about the basic properties of the induced code, and how they\ncompare to human language. One fundamental characteristic of the latter, known\nas Zipf\u2019s Law of Abbreviation (ZLA), is that more frequent words are ef\ufb01ciently\nassociated to shorter strings. We study whether the same pattern emerges when two\nneural networks, a \u201cspeaker\u201d and a \u201clistener\u201d, are trained to play a signaling game.\nSurprisingly, we \ufb01nd that networks develop an anti-ef\ufb01cient encoding scheme,\nin which the most frequent inputs are associated to the longest messages, and\nmessages in general are skewed towards the maximum length threshold. This anti-\nef\ufb01cient code appears easier to discriminate for the listener, and, unlike in human\ncommunication, the speaker does not impose a contrasting least-effort pressure\ntowards brevity. Indeed, when the cost function includes a penalty for longer\nmessages, the resulting message distribution starts respecting ZLA. Our analysis\nstresses the importance of studying the basic features of emergent communication\nin a highly controlled setup, to ensure the latter will not depart too far from human\nlanguage. Moreover, we present a concrete illustration of how different functional\npressures can lead to successful communication codes that lack basic properties of\nhuman language, thus highlighting the role such pressures play in the latter.\n\n1\n\nIntroduction\n\nThere is renewed interest in simulating language emergence among neural networks that interact\nto solve a task, motivated by the desire to develop automated agents that can communicate with\nhumans [e.g., Havrylov and Titov, 2017, Lazaridou et al., 2017, 2018, Lee et al., 2018]. As part\nof this trend, several recent studies analyze the properties of the emergent codes [e.g., Kottur et al.,\n2017, Bouchacourt and Baroni, 2018, Evtimova et al., 2018, Lowe et al., 2019, Graesser et al., 2019].\nHowever, these analyses generally consider relatively complex setups, when very basic characteristics\nof the emergent codes have yet to be understood. We focus here on one such characteristic, namely\nthe length distribution of the messages that two neural networks playing a simple signaling game\ncome to associate to their inputs, in function of input frequency.\nIn his pioneering studies of lexical statistics, George Kingsley Zipf noticed a robust trend in human\nlanguage that came to be known as Zipf\u2019s Law of Abbreviation (ZLA): There is an inverse (non-\nlinear) correlation between word frequency and length [Zipf, 1949, Teahan et al., 2000, Sigurd et al.,\n2004, Strauss et al., 2007]. Assuming that shorter words are easier to produce, this is an ef\ufb01cient\nencoding strategy, particularly effective given Zipf\u2019s other important discovery that word distributions\nare highly skewed, following a power-law distribution. Indeed, in this way language approaches\nan optimal code in information-theoretic terms [Cover and Thomas, 2006]. Zipf, and many after\nhim, have thus used ZLA as evidence that language is shaped by functional pressures toward effort\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fminimization [e.g., Piantadosi et al., 2011, Mahowald et al., 2018, Gibson et al., 2019]. However,\nothers [e.g., Mandelbrot, 1954, Miller et al., 1957, Ferrer i Cancho and del Prado Mart\u00edn, 2011, del\nPrado Mart\u00edn, 2013] noted that some random-typing distributions also respect ZLA, casting doubts\non functional explanations of the observed pattern.\nWe study a Speaker network that gets one out of 1K distinct one-hot vectors as input, randomly\ndrawn from a power-law distribution (so that frequencies are extremely skewed, like in natural\nlanguage). Speaker transmits a variable-length message to a Listener network. Listener outputs\na one-hot vector, and the networks are rewarded if the latter is identical to the input. There is no\ndirect supervision on the message, so that the networks are free to create their own \u201clanguage\u201d.\nThe networks develop a successful communication system that does not exhibit ZLA, and is indeed\nanti-ef\ufb01cient, in the sense that all messages are long, and the most frequent inputs are associated to\nthe longest messages. Interestingly, a similar effect is observed in arti\ufb01cial human communication\nexperiments, in conditions in which longer messages do not demand extra effort to speakers, so that\nthey are preferred as they ease the listener discrimination task [Kanwal et al., 2017]. Our Speaker\nnetwork, unlike humans, has no physiological pressure towards brevity [Chaabouni et al., 2019],\nand our Listener network displays an a priori preference for longer messages. Indeed, when we\npenalize Speaker for producing longer strings, the emergent code starts obeying ZLA. We examine\nthe implications of our \ufb01ndings in the Discussion.\n\n2 Setup\n\n2.1 The game\n\nWe designed a variant of the Lewis signaling game [Lewis, 1969] in which the input distribution\nfollows a power-law distribution. We think of these inputs as a vocabulary of distinct abstract word\ntypes, to which the agents will assign speci\ufb01c word forms while learning to play the game. We leave\nit to further research to explore setups in which word type and form distributions co-evolve [Ferrer i\nCancho and D\u00edaz-Guilera, 2007]. Formally, the game proceeds as follows:\n\nr\u00d7(cid:80)1000\n\n1\n\nk=1\n\n1. The Speaker network receives one of 1K distinct one-hot vectors as input i. Inputs are not\ndrawn uniformly, but, like in natural language, from a power-law distribution. That is, the\nrth most frequent input ir has probability\nConsequently, the probability of sampling the 1st input is 0.13 while the probability of\nsampling the 1000th one is 1000 times lower.\n\nto be sampled, with r \u2208(cid:74)1, ..., 1000(cid:75).\n\n2. Speaker chooses a sequence of symbols from its alphabet A = {s1, s2..., sa\u22121, eos} of size\n|A| = a to construct a message m, terminated as soon as Speaker produces the \u2018end-of-\nsequence\u2019 token eos. If Speaker has not yet emitted eos at max_len \u2212 1, it is stopped and\neos is appended at the end of its message (so that all messages are suf\ufb01xed with eos and no\nmessage is longer than max_len).\n\n1\nk\n\n3. The Listener network consumes m and outputs \u02c6i.\n4. The agents are successful if i = \u02c6i, that is, Listener reconstructed Speaker\u2019s input.\n\nThe game is implemented using the EGG toolkit [Kharitonov et al., 2019], and the code can be found\nat https://github.com/facebookresearch/EGG/tree/master/egg/zoo/channel.\n\n2.2 Architectures\n\nAs standard in current emergent-language simulations [e.g., Lazaridou et al., 2018], both agents\nare implemented as single-layer LSTMs [Hochreiter and Schmidhuber, 1997]. Speaker\u2019s input is a\n1K-dimensional one-hot vector i, and the output is a sequence of symbols, de\ufb01ning message m. This\nsequence is generated as follows. A linear layer maps the input vector into the initial hidden state\nof Speaker\u2019s LSTM cell. Next, a special start-of-sequence symbol is fed to the cell. At each step of\nthe sequence, the output layer de\ufb01nes a Categorical distribution over the alphabet. At training time,\nwe sample from this distribution. During evaluation, we select the symbol greedily. Each selected\nsymbol is fed back to the LSTM cell. The dimensionalities of the hidden state vectors are part of the\nhyper-parameters we explore (Appendix A.1). Finally, we initialize the weight matrices of our agents\n\n2\n\n\f1\n\n\u221a\n\n\u221a\n\ninput_size,\n\nwith a uniform distribution with support in [\u2212\n1\ndimensionality of the matrix input (Pytorch default initialization).\nListener consumes the entire message m, including eos. After eos is received, Listener\u2019s hidden\nstate is passed through a fully-connected layer with softmax activation, determining a Categorical\ndistribution over 1K indices. This distribution is used to calculate the cross-entropy loss w.r.t. the\nground-truth input, i.\nThe joint Speaker-Listener architecture can be seen as a discrete auto-encoder [Liou et al., 2014].\n\ninput_size], where input_size is the\n\n2.3 Optimization\n\nThe architecture is not directly differentiable, as messages are discrete-valued. In language emergence,\ntwo approaches are dominantly used: Gumbel-Softmax relaxation [Maddison et al., 2016, Jang et al.,\n2016] and REINFORCE [Williams, 1992]. We also experimented with the approach of Schulman et al.\n[2015], combining REINFORCE and stochastic backpropagation to estimate gradients. Preliminary\nexperiments showed that the latter algorithm (to be reviewed next) results in the fastest and most\nstable convergence, and we used it in all the following experiments. However, the main results we\nreport were also observed with the other algorithms, when successful.\nWe denote by \u03b8s and \u03b8l the Speaker and Listener parameters, respectively. L is the cross-entropy loss,\nthat takes the ground-truth one-hot vector i and Listener\u2019s output L(m) distribution as inputs. We\nwant to minimize the expectation of the cross-entropy loss E L(i, L(m)), where the expectation is\ncalculated w.r.t. the joint distribution of inputs and message sequences. The gradient of the following\nsurrogate function is an unbiased estimate of the gradient \u2207\u03b8s\u222a\u03b8l\n\nE L(i, L(m)):\nE [L(i, L(m; \u03b8l)) + ({L(i, L(m; \u03b8l)} \u2212 b) log Ps(m|\u03b8s)]\n\n(1)\nwhere {\u00b7} is the stop-gradient operation, Ps(m|\u03b8s) is the probability of producing the sequence m\nwhen Speaker is parameterized with vector \u03b8s, and b is a running-mean baseline used to reduce the\nestimate variance without introducing a bias. To encourage exploration, we also apply an entropy\nregularization term [Williams and Peng, 1991] on the output distribution of the speaker agent.\nEffectively, under Eq. 1, the gradient of the loss w.r.t. the Listener parameters is found via conventional\nbackpropagation (the \ufb01rst term in Eq. 1), while Speaker\u2019s gradient is found with a REINFORCE-like\nprocedure (the second term). Once the gradient estimate is obtained, we feed it into the Adam [Kingma\nand Ba, 2014] optimizer. We explore different learning rate and entropy regularization coef\ufb01cient\nvalues (Appendix A.1).\nWe train agents for 2500 episodes, each consisting of 100 mini-batches, in turn including 5120 inputs\nsampled from the power-law distribution with replacement. After training, we present to the system\neach input once, to compute accuracy by giving equal weight to all inputs, independently of amount\nof training exposure.\n\n2.4 Reference distributions\n\nAs ZLA is typically only informally de\ufb01ned, we introduce 3 reference distributions that display\nef\ufb01cient encoding and arguably respect ZLA.\n\n2.4.1 Optimal code\n\nBased on standard coding theory [Cover and Thomas, 2006], we design an optimal code (OC)\nguaranteeing the shortest average message length given a certain alphabet size and the constraint that\nall messages must end with eos. The shortest messages are deterministically associated to the most\nfrequent inputs, leaving longer ones for less frequent ones. The length of the message associated to\nan input is determined as follows. Let A = {s1, s2...sa\u22121, eos} be the alphabet of size a and ir be\nthe rth input when ranked by frequency. Then ir is mapped to a message of length\n\nlir = min{n :\n\n(a \u2212 1)k\u22121 \u2265 r}\n\n(2)\n\nn(cid:88)\n\nk=1\n\n3\n\n\fFor instance, if a = 3, then there is only one message of length 1 (associated to the most frequent\nreferent), 2 of length 2, 4 of length 3 etc.1 Section 2 of Ferrer i Cancho et al. [2013] presents a proof\nof how this encoding is the maximally ef\ufb01cient one.\n\n2.4.2 Monkey typing\n\nNatural languages respect ZLA without being as ef\ufb01cient as OC. It has been observed that Monkey\ntyping (MT) processes, whereby a monkey hits random typewriter keys including a space character,\nproduce word length distributions remarkably similar to those attested in natural languages [Simon,\n1955, Miller et al., 1957]. We thus adapt a MT process to our setup, as a less strict benchmark for\nnetwork ef\ufb01ciency.2\nWe \ufb01rst sample an input without replacement according to the power-law distribution, then generate\nthe message to be associated with it. We repeat the process until all inputs are assigned a unique\nmessage. The message is constructed by letting a monkey hit the a keys of a typewriter uniformly\nat random (p = 1/a), subject to these constraints: (i) The message ends when the monkey hits eos.\n(ii) A message cannot be longer than a speci\ufb01ed length max_len. If the monkey has not yet emitted\neos at max_len \u2212 1, it is stopped and eos is appended at the end of the message. (iii) If a generated\nmessage is identical to one already used, it is rejected and another is generated.\nFor a given length l, there are only (a \u2212 1)l\u22121 different messages. Moreover, for a random generator\nwith the max_len constraint, the probability of generating a message of length l is:\nPl = p \u00d7 (1 \u2212 p)l\u22121, if l < max_len and Pmax_len = (1 \u2212 p)max_len\u22121\n\n(3)\n\nFrom these calculations, we derive two qualitative observations about MT. First, as we \ufb01x max_len\nand increase a (decrease p = 1/a), more generated messages will reach max_len. Second, when\na is small and max_len is large (as in early MT studies where max_len was in\ufb01nite), a ZLA-like\ndistribution emerges, due to the \ufb01nite number of different messages of length l. Indeed, for any l\nless than max_len, Pl strictly decreases as l grows. Then, for given inputs, the monkey is likely\nto start by generating messages of the most probable length (that is, 1). As we exhaust all unique\nmessages of this length, the process starts generating messages of the next probable length (i.e., 2)\nand so on. Figure A.1 in Appendix A.2 con\ufb01rms experimentally that our MT distribution respects\nZLA for a \u2264 10 and various max_len.\n\n2.4.3 Natural language\n\nWe \ufb01nally consider word length distributions in natural language corpora. We used pre-compiled\nEnglish, Arabic, Russian and Spanish frequency lists from http://corpus.leeds.ac.uk/serge/,\nextracted from corpora of internet text containing between 200M (Russian) and 16M words (Arabic).\nFor direct comparability with input set cardinality in our simulations, we only looked at the distribution\nof the top 1000 most frequent words, after merging lower- and upper-cased forms, and removing\nwords containing non-alphabetical characters. The resulting word frequency distributions obeyed\npower laws with exponents between \u22120.81 and \u22120.92 (we used \u22121 to generate our inputs). Alphabet\nsizes are as follows: 30 (English), 31 (Spanish), 47 (Russian), 59 (Arabic). These are larger than\nnormative sizes, as un\ufb01ltered Internet text will occasionally include foreign characters (e.g., accented\nletters in English text). Contrary to previous reference distributions, we cannot control max_len and\nalphabet size. We hence compare human and network distributions only in the adequate settings. In\nthe main text, we present results for the languages with the smallest (English) and largest (Arabic)\nalphabets. The distributions of the other languages are comparable, and presented in Appendix A.3.\n\n3 Experiments\n\n3.1 Characterizing the emergent encoding\nWe experiment with alphabet sizes a \u2208 [3, 5, 10, 40, 1000]. We chose mainly small alphabet sizes to\nminimize a potential bias in favor of long messages: For high a, randomly generating long messages\nbecomes more likely, as the probability of outputting eos at random becomes lower. At the other\n\n1There is always only one message of length 1 (that is, eos), irrespective of alphabet size.\n2No actual monkey was harmed in the de\ufb01nition of the process.\n\n4\n\n\fa\n\nj=1\n\na|I|\n\n= (cid:80)max_len\n\nextreme, we also consider a = 1000, where the Speaker could in principle successfully communicate\nusing at most 2-symbol messages (as Speaker needs to produce eos). Finally, a = 40 was chosen to\nbe close to the alphabet size of the natural languages we study (mean alphabet size: 41.75).\nAfter \ufb01xing a, we choose max_len so that agents have enough capacity to describe the whole input\nspace (|I| = 1000). For a given a and max_len, Speaker cannot encode more inputs than the message\n(a \u2212 1)j\u22121. We experiment with max_len \u2208 [2, 6, 11, 30]. We\nspace size M max_len\ncouldn\u2019t use higher values because of memory limitations. Furthermore, we studied the effect of\nD = M max_len\n. While making sure that this ratio is at least 1, we experiment with low values, where\nSpeaker would have to use nearly the whole message space to successfully denote all inputs. We also\nconsidered settings with signi\ufb01cantly larger D, where constructing 1K distinct messages might be an\neasier task.\nWe train models for each (max_len, a) setting and agent hyperparameter choice (4 seeds per choice).\nWe consider runs successful if, after training, they achieve an accuracy above 99% on the full input\nset (i.e., less than 10 miss-classi\ufb01ed inputs). As predicted, the higher D is, the more accurate the\nagents become. Indeed, agents need much larger D than strictly necessary in order to converge. We\nselect for further analysis only those (max_len, a) choices that resulted in more than 3 successful\nruns (mean number of successful runs across the reported con\ufb01gurations is 25 out of 48). Moreover,\nwe focus here on con\ufb01gurations with max_len = 30, as the most comparable to natural language.3\nWe present results for all selected con\ufb01gurations (con\ufb01rming the same trends) in Appendix A.4.\nFigure 1 shows message length distribution (averaged across all successful runs) as a function of\ninput frequency rank, compared to our reference distributions. The MT results are averaged across\n25 different runs. We show the Arabic and English distributions in the plot containing the most\ncomparable simulation settings (30, 40).\nAcross con\ufb01gurations, we observe that Speaker messages greatly depart from ZLA. There is a clear\ngeneral preference for longer messages, that is strongest for the most frequent inputs, where Speaker\noutputs messages of length max_len. That is, in the emergent encoding, more frequent words are\nlonger, making the system obey a sort of \u201canti-ZLA\u201d (see Appendix A.6 for con\ufb01rmation that this\nanti-ef\ufb01cient pattern is statistically signi\ufb01cant). Consequently, the emergent language distributions\nare well above all reference distributions, except for MT with a = 1000, where the large alphabet size\nleads to uniformly long words, for reasons discussed in Section 2.4.2. Finally, the lack of ef\ufb01ciency in\nemergent language encodings is also observed when inputs are uniformly distributed (see Appendix\nA.5).\nAlthough some animal signing systems disobey ZLA, due to speci\ufb01c environmental constraints [e.g.,\nHeesen et al., 2019], a large survey of human and animal communication did not \ufb01nd any case of\nsigni\ufb01cantly anti-ef\ufb01cient systems [Ferrer i Cancho et al., 2013], making our \ufb01nding particularly\nintriguing.\n\n(a)max_len=30, a=5\n\n(b)max_len=30, a=10\n\n(c)max_len=30, a=40 (d)max_len=30, a=1000\n\nFigure 1: Mean message length across successful runs as a function of input frequency rank, with\nreference distributions. For readability, we smooth natural language distributions by reporting the\nsliding average of 10 consecutive lengths.\n\n3Natural languages have no rigid upper bound on length, and 30 is the highest max_len we were able to train\nmodels for. Qualitative inspection of the respective corpora suggest that 30 is anyway a reasonable \u201csoft\u201d upper\nbound on word length in the languages we studied (longer strings are mostly typographic detritus).\n\n5\n\n02004006008001000inputs sorted by frequency051015202530messages length02004006008001000inputs sorted by frequency051015202530messages length02004006008001000inputs sorted by frequency051015202530messages length02004006008001000inputs sorted by frequency051015202530messages lengthemergent messagesmonkey typingoptimal codingEnglishArabic\f3.2 Causes of anti-ef\ufb01cient encoding\n\nWe explore the roots of anti-ef\ufb01ciency by looking at the behavior of untrained Speakers and Listeners.\nEarlier work conjectured that ZLA emerges from the competing pressures to communicate in a\nperceptually distinct and articulatorily ef\ufb01cient manner [Zipf, 1949, Kanwal et al., 2017]. For our\nnetworks, there is a clear pressure from Listener in favour of ease of message discriminability, but\nSpeaker has no obvious reason to save on \u201carticulatory\u201d effort. We thus predict that the observed\npattern is driven by a Listener-side bias.\n\n3.2.1 Untrained Speaker behavior\n\nFor each i drawn from the power-law distribution without replacement, we get a message m from\n90 distinct untrained Speakers (30 speakers for each hidden size in [100, 250, 500]). We experiment\nwith 2 different association processes. In the \ufb01rst, we associate the \ufb01rst generated m to i, irrespective\nof whether it was already associated to another input. In the second, we keep generating a m for i\nuntil we get a message that was not already associated to a distinct input. The second version is closer\nto the MT process (see Section 2.4.2). Moreover, message uniqueness is a reasonable constraint,\nsince, in order to succeed, Speakers need \ufb01rst of all to keep messages denoting different inputs apart.\nFigure 2 shows that untrained Speakers have no prior toward outputting long sequences of symbols.\nPrecisely, from Figure 2 we see that the untrained Speakers\u2019 average message length coincides with\na.4 In other words, untrained\nthe one produced by the random process de\ufb01ned in Eq. 3 where p = 1\nSpeakers are equivalent to a random generator with uniform probability over symbols.5 Consequently,\nwhen imposing message uniqueness, non-trained Speakers become identical to MT. Hence, Speakers\nfaced with the task of producing distinct messages for the inputs, if vocabulary size is not too large,\nwould naturally produce a ZLA-obeying distribution, that is radically altered in joint Speaker-Listener\ntraining.\n\n(a) max_len = 30, a = 3\n\n(b) max_len = 30, a = 5\n\n(c) max_len = 30, a = 40\n\nFigure 2: Average length of messages by input frequency rank for untrained Speakers, compared to\nMT. See Appendix A.7 for more settings.\n\n3.2.2 Untrained Listener behavior\n\nHaving shown that untrained Speakers do not favor long messages, we ask next if the emergent\nanti-ef\ufb01cient language is easier to discriminate by untrained Listeners than other encodings. To\nthis end, we compute the average pairwise L2 distance of the hidden representations produced by\nuntrained Listeners in response to messages associated to all inputs.6 Messages that are further apart\nin the representational space of the untrained Listener should be easier to discriminate. Thus, if\nSpeaker associates such messages to the inputs, it will be easier for Listener to distinguish them.\n\n4Note that we did not use the uniqueness-of-messages constraint to de\ufb01ne Pl.\n5We veri\ufb01ed that indeed untrained Speakers have uniform probability over the different symbols.\n6Results are similar if looking at the softmax layer instead.\n\n6\n\n02004006008001000inputs sorted by frequency051015202530messages length02004006008001000inputs sorted by frequency051015202530messages length02004006008001000inputs sorted by frequency051015202530messages lengthuntrained Speaker with uniqueness constraintuntrained Speakermonkey typing\fSpeci\ufb01cally, we use 50 distinct untrained Listeners with 100-dimensional hidden size.7 We test\n4 different encodings: (1) emergent messages (produced by trained Speakers) (2) MT messages\n(25 runs) (3) OC messages and (4) human languages. Note that MT is equivalent to untrained\nSpeaker, as their messages share the same length and alphabet distribution (see Section 3.2.1). We\nstudy Listeners\u2019 biases with max_len = 30 while varying a as messages are more distinct from\nreference distributions in that case (see Figure A.3 in Appendix A.4). Results are reported in Figure\n3. Representations produced in response to the emergent messages have the highest average distance.\nMT only approximates the emergent language for a = 1000, where, as seen in Figure 1 above, MT is\nanti-ef\ufb01cient. The trained Speaker messages are hence a priori easier for non-trained Listeners. The\nlength of these messages could thus be explained by an intrinsic Listener\u2019s bias, as conjectured above.\nAlso, interestingly, natural languages are not easy to process by Listeners. This suggests that the\nemergence of \u201cnatural\u201d languages in LSTM agents is unlikely, without imposing ad-hoc pressures.\n\nFigure 3: Average pairwise distance between messages\u2019 representation in Listener\u2019s hidden space,\nacross all considered non-trained Listeners. Vertical lines mark standard deviations across Listeners.\n\n3.2.3 Adding a length minimization pressure\n\nWe next impose an arti\ufb01cial pressure on Speaker to produce short messages, to counterbalance\nListener\u2019s preference for longer ones. Speci\ufb01cally, we add a regularizer disfavoring longer messages\nto the original loss:\n\nL(cid:48)(i, L(m), m) = L(i, L(m)) + \u03b1 \u00d7 |m|\n\n(4)\nwhere L(i, L(m)) is the cross-entropy loss used before, |.| denotes length, and \u03b1 is a hyperparameter.\nThe non-differentiable term \u03b1\u00d7|m| is handled seamlessly as it only depends on Speaker\u2019s parameters\n\u03b8s (which specify the distribution of the messages m), and the gradient of the loss w.r.t. \u03b8s is\nestimated via a REINFORCE-like term (Eq. 1). Figure 4 shows emergent message length distribution\nunder this objective, comparing it to other reference distributions in the most human-language-like\nsetting: (max_len=30, a=40). The same pattern is observed elsewhere (see Appendix A.8, that also\nevaluates the impact of the \u03b1 hyperparameter). The emergent messages clearly follow ZLA. Speaker\nnow assigns messages of ascending length to the 40 most frequent inputs. For the remaining ones, it\nchooses messages with relatively similar, but notably shorter, lengths (always much shorter than MT\nmessages). Still, the encoding is not as ef\ufb01cient as the one observed in natural language (and OC).\nAlso, when adding length regularization, we noted a slower convergence, with a smaller number of\nsuccessful runs, that further diminishes when \u03b1 increases.\n\n3.3 Symbol distributions in the emergent code\n\nWe conclude with a high-level look at what the long emergent messages are made of. Speci\ufb01cally,\nwe inspect symbol unigram and bigram frequency distributions in the messages produced by trained\nSender in response to the 1K inputs (the eos symbol is excluded from counts). For direct compa-\nrability with natural language, we report results in the (max_len=30,a=40) setting, but the patterns\nare general. We observe in Figure 5(a) that, even if at initialization Speaker starts with a uniform\ndistribution over its alphabet (not shown here), by end of training it has converged to a very skewed\none. Natural languages follow a similar trend, but their distributions are not nearly as skewed (see\n\n7We \ufb01x this value because, unlike for Speaker, it has considerable impact on performance, with 100 being\n\nthe preferred setting.\n\n7\n\n510401000alphabet size (log scale)0.300.320.340.360.38L2 distanceemergent messagesmonkey typingoptimal codingEnglishArabic\fFigure 4: Mean length of messages across successful runs as a function of input frequency rank for\nmax_len = 30, a = 40, \u03b1 = 0.5. Natural language distributions are smoothed as in Fig. 1.\n\nFigure 8(a) in Appendix A.10 for entropy analysis). We then investigate message structure by looking\nat symbol bigram distribution. To this end, we build 25 randomly generated control codes, constrained\nto have the same mean length and unigram symbol distribution as the emergent code. Intriguingly,\nwe observe in Figure 5(b) a signi\ufb01cantly more skewed emergent bigram distribution, compared to the\ncontrols. This suggests that, despite the lack of phonetic pressures, Speaker is respecting \u201cphonotactic\u201d\nconstraints that are even sharper than those re\ufb02ected in the natural language bigram distributions\n(see Figure 8(b) in Appendix A.10 for entropy analysis). In other words, the emergent messages are\nclearly not built out of random unigram combinations. Looking at the pattern more closely, we \ufb01nd\nthe skewed bigram distribution to be due to a strong tendency to repeat the same character over and\nover, well beyond what is expected given the unigram symbol skew (see typical message examples\nin Appendix A.9). More quantitatively, across all runs with max_len=30, if we denote the 10 most\n\nprobable symbols with s1, ..., s10, then we observe P (sr, sr) > P (sr)2 with r \u2208(cid:74)1, .., 10(cid:75), in more\n\nthan 97.5% runs. We leave a better understanding of the causes and implications of these distributions\nto future work.\n\n(a) Symbol unigram distributions\n\n(b) Symbol bigram distributions\n\nFigure 5: Distribution of top symbol unigrams and bigrams (ordered by frequency) in different\ncodes. Emergent and control messages are averaged across successful runs and different simulations\nrespectively in the (max_len=30,a=40) setting.\n\n4 Discussion\n\nWe found that two neural networks faced with a simple communication task, in which they have to\nlearn to generate messages to refer to a set of distinct inputs that are sampled according to a power-law\ndistribution, produce an anti-ef\ufb01cient code where more frequent inputs are signi\ufb01cantly associated to\nlonger messages, and all messages are close to the allowed maximum length threshold. The results are\n\n8\n\n02004006008001000inputs sorted by frequency051015202530messages lengthemergent messagesemergent messages / length pressuremonkey typingoptimal codingEnglishArabictop 30 Unigrams0510152025frequency in %Emergent messagesEnglishArabictop 50 Bigrams0.02.55.07.510.012.515.017.5frequency in %Emergent messagesControl messagesEnglishArabic\fstable across network and task hyperparameters (although we leave it to further work to replicate the\n\ufb01nding with different network architectures, such as transformers or CNNs). Follow-up experiments\nsuggest that the emergent pattern stems from an a priori preference of the listener network for longer,\nmore discriminable messages, which is not counterbalanced by a need to minimize articulatory effort\non the side of the speaker. Indeed, when an arti\ufb01cial penalty against longer messages is imposed on\nthe latter, we see a ZLA distribution emerging in the networks\u2019 communication code.\nFrom the point of view of AI, our results stress the importance of controlled analyses of language\nemergence. Speci\ufb01cally, if we want to develop arti\ufb01cial agents that naturally communicate with\nhumans, we want to ensure that we are aware of, and counteract, their unnatural biases, such as the one\nwe uncovered here in favor of anti-ef\ufb01cient encoding. We presented a proof-of-concept example of\nhow to get rid of this speci\ufb01c bias by directly penalizing long messages in the cost function, but future\nwork should look into less ad hoc ways to condition the networks\u2019 language. Getting the encoding\nright seems particularly important, as ef\ufb01cient encoding has been observed to interact in subtle ways\nwith other important properties of human language, such as regularity and compositionality [Kirby,\n2001]. We also emphasize the importance of using power-law input distributions when studying\nlanguage emergence, as the latter are a universal property of human language [Zipf, 1949, Baayen,\n2001] largely ignored in previous simulations, that assume uniform input distributions.\nZLA is observed in all studied human languages. As mentioned above, some animal communication\nsystems violate it [Heesen et al., 2019], but such systems are 1) limited in their expressivity; and\n2) do not display a signi\ufb01cantly anti-ef\ufb01cient pattern. We complemented this earlier comparative\nresearch with an investigation of emergent language among arti\ufb01cial agents that need to signal a\nlarge number of different inputs. We found that the agents develop a successful communication\nsystem that does not exhibit ZLA, and is actually signi\ufb01cantly anti-ef\ufb01cient. We connected this to an\nasymmetry in speaker vs. listener biases. This in turn suggests that ZLA in communication in general\ndoes not emerge from trivial statistical properties, but from a delicate balance of speaker and listener\npressures. Future work should investigate emergent distributions in a wider range of arti\ufb01cial agents\nand environments, trying to understand which factors are determining them.\n\n5 Acknowledgments\n\nWe would like to thank Ferm\u00edn Moscoso del Prado Mart\u00edn, Ramon Ferrer i Cancho, Serge Sharoff, the\naudience at REPL4NLP 2019 and the anonymous reviewers for helpful comments and suggestions.\n\nReferences\nSerhii Havrylov and Ivan Titov. Emergence of language with multi-agent games: Learning to\ncommunicate with sequences of symbols. In Proceedings of NIPS, pages 2149\u20132159, Long Beach,\nCA, 2017.\n\nAngeliki Lazaridou, Alexander Peysakhovich, and Marco Baroni. Multi-agent cooperation and the\nemergence of (natural) language. In Proceedings of ICLR Conference Track, Toulon, France, 2017.\nPublished online: https://openreview.net/group?id=ICLR.cc/2017/conference.\n\nAngeliki Lazaridou, Karl Moritz Hermann, Karl Tuyls, and Stephen Clark. Emergence of linguistic\ncommunication from referential games with symbolic and pixel input. In Proceedings of ICLR\nConference Track, Vancouver, Canada, 2018. Published online: https://openreview.net/\ngroup?id=ICLR.cc/2018/Conference.\n\nJason Lee, Kyunghyun Cho, Jason Weston, and Douwe Kiela. Emergent translation in multi-agent\ncommunication. In Proceedings of ICLR Conference Track, Vancouver, Canada, 2018. Published\nonline: https://openreview.net/group?id=ICLR.cc/2018/Conference.\n\nSatwik Kottur, Jos\u00e9 Moura, Stefan Lee, and Dhruv Batra. Natural language does not emerge \u2018naturally\u2019\nin multi-agent dialog. In Proceedings of EMNLP, pages 2962\u20132967, Copenhagen, Denmark, 2017.\n\nDiane Bouchacourt and Marco Baroni. How agents see things: On visual representations in an\nemergent language game. In Proceedings of EMNLP, pages 981\u2013985, Brussels, Belgium, 2018.\n\n9\n\n\fKatrina Evtimova, Andrew Drozdov, Douwe Kiela, and Kyunghyun Cho. Emergent communication\nin a multi-modal, multi-step referential game. In Proceedings of ICLR Conference Track, Vancou-\nver, Canada, 2018. Published online: https://openreview.net/group?id=ICLR.cc/2018/\nConference.\n\nRyan Lowe, Jakob Foerster, Y-Lan Boureau, Joelle Pineau, and Yann Dauphin. On the pitfalls\nof measuring emergent communication. In Proceedings of AAMAS, pages 693\u2013701, Montreal,\nCanada, 2019.\n\nLaura Graesser, Kyunghyun Cho, and Douwe Kiela. Emergent linguistic phenomena in multi-agent\n\ncommunication games. https://arxiv.org/abs/1901.08706, 2019.\n\nGeorge Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley, Boston, MA,\n\n1949.\n\nWilliam J Teahan, Yingying Wen, Rodger McNab, and Ian H Witten. A compression-based algorithm\n\nfor chinese word segmentation. Computational Linguistics, 26(3):375\u2013393, 2000.\n\nBengt Sigurd, Mats Eeg-Olofsson, and Joost Van Weijer. Word length, sentence length and frequency\u2013\n\nzipf revisited. Studia Linguistica, 58(1):37\u201352, 2004.\n\nUdo Strauss, Peter Grzybek, and Gabriel Altmann. Word length and word frequency. In Peter\nGrzybek, editor, Contributions to the Science of Text and Language, pages 277\u2013294. Springer,\nDordrecht, the Netherlands, 2007.\n\nThomas Cover and Joy Thomas. Elements of Information Theory, 2nd ed. Wiley, Hoboken, NJ, 2006.\n\nSteven T Piantadosi, Harry Tily, and Edward Gibson. Word lengths are optimized for ef\ufb01cient\n\ncommunication. Proceedings of the National Academy of Sciences, 108(9):3526\u20133529, 2011.\n\nKyle Mahowald, Isabelle Dautriche, Edward Gibson, and Steven Piantadosi. Word forms are\n\nstructured for ef\ufb01cient use. Cognitive Science, 42:3116\u20133134, 2018.\n\nEdward Gibson, Richard Futrell Steven Piantadosi, Isabelle Dautriche, Kyle Mahowald, Leon Bergen,\nand Roger Levy. How ef\ufb01ciency shapes human language. Trends in Cognitive Science, 2019. In\npress.\n\nBenoit Mandelbrot. Simpie games of strategy occurring in communication through natural languages.\n\nTransactions of the IRE Professional Group on Information Theory, 3(3):124\u2013137, 1954.\n\nGeorge A Miller, E Newman, and E Friedman. Some effects of intermittent silence. American\n\nJournal of Psychology, 70(2):311\u2013314, 1957.\n\nRamon Ferrer i Cancho and Ferm\u00edn Moscoso del Prado Mart\u00edn. Information content versus word\nlength in random typing. Journal of Statistical Mechanics: Theory and Experiment, 2011(12):\nL12002, 2011.\n\nFerm\u00edn Moscoso del Prado Mart\u00edn. The missing baselines in arguments for the optimal ef\ufb01ciency of\nlanguages. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 35,\n2013.\n\nJasmeen Kanwal, Kenny Smith, Jennifer Culbertson, and Simon Kirby. Zipf\u2019s law of abbrevia-\ntion and the principle of least effort: Language users optimise a miniature lexicon for ef\ufb01cient\ncommunication. Cognition, 165:45\u201352, 2017.\n\nRahma Chaabouni, Eugene Kharitonov, Alessandro Lazaric, Emmanuel Dupoux, and Marco Baroni.\nWord-order biases in deep-agent emergent communication. In Proceedings of ACL, pages 5166\u2013\n5175, Florence, Italy, 2019.\n\nDavid Lewis. Convention: A philosophical study, 1969.\n\nRamon Ferrer i Cancho and Albert D\u00edaz-Guilera. The global minima of the communicative energy of\nnatural communication systems. Journal of Statistical Mechanics: Theory and Experiment, 2007\n(06):P06009, 2007.\n\n10\n\n\fEugene Kharitonov, Rahma Chaabouni, Diane Bouchacourt, and Marco Baroni. Egg: a toolkit for\n\nresearch on emergence of language in games. arXiv preprint arXiv:1907.00852, 2019.\n\nSepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):\n\n1735\u20131780, 1997.\n\nCheng-Yuan Liou, Wei-Chen Cheng, Jiun-Wei Liou, and Daw-Ran Liou. Autoencoder for words.\n\nNeurocomputing, 139:84\u201396, 2014.\n\nChris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous\n\nrelaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.\n\nEric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with Gumbel-softmax. arXiv\n\npreprint arXiv:1611.01144, 2016.\n\nRonald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement\n\nlearning. Machine learning, 8(3-4):229\u2013256, 1992.\n\nJohn Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. Gradient estimation using\nstochastic computation graphs. In Advances in Neural Information Processing Systems, pages\n3528\u20133536, 2015.\n\nRonald J Williams and Jing Peng. Function optimization using connectionist reinforcement learning\n\nalgorithms. Connection Science, 3(3):241\u2013268, 1991.\n\nDiederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\nRamon Ferrer i Cancho, Antoni Hern\u00e1ndez-Fern\u00e1ndez, David Lusseau, Govindasamy Agoramoorthy,\nMinna J Hsu, and Stuart Semple. Compression as a universal principle of animal behavior.\nCognitive Science, 37(8):1565\u20131578, 2013.\n\nHerbert A Simon. On a class of skew distribution functions. Biometrika, 42(3/4):425\u2013440, 1955.\n\nRaphaela Heesen, Catherine Hobaiter, Ramon Ferrer-i Cancho, and Stuart Semple. Linguistic laws in\nchimpanzee gestural communication. Proceedings of the Royal Society B, 286(1896):20182900,\n2019.\n\nSimon Kirby. Spontaneous evolution of linguistic structure-an iterated learning model of the emer-\nIEEE Transactions on Evolutionary Computation, 5(2):\n\ngence of regularity and irregularity.\n102\u2013110, 2001.\n\nHarald Baayen. Word Frequency Distributions. Kluwer, Dordrecht, The Netherlands, 2001.\n\n11\n\n\f", "award": [], "sourceid": 3386, "authors": [{"given_name": "Rahma", "family_name": "Chaabouni", "institution": "FAIR/ENS"}, {"given_name": "Eugene", "family_name": "Kharitonov", "institution": "Facebook AI"}, {"given_name": "Emmanuel", "family_name": "Dupoux", "institution": "Ecole des Hautes Etudes en Sciences Sociales"}, {"given_name": "Marco", "family_name": "Baroni", "institution": "University of Trento"}]}