{"title": "Emergence of Language with Multi-agent Games: Learning to Communicate with Sequences of Symbols", "book": "Advances in Neural Information Processing Systems", "page_first": 2149, "page_last": 2159, "abstract": "Learning to communicate through interaction, rather than relying on explicit supervision, is often considered a prerequisite for developing a general AI. We study a setting where two agents engage in playing a referential game and, from scratch, develop a communication protocol necessary to succeed in this game. Unlike previous work, we require that messages they exchange, both at train and test time, are in the form of a language (i.e. sequences of discrete symbols). We compare a reinforcement learning approach and one using a differentiable relaxation (straight-through Gumbel-softmax estimator) and observe that the latter is much faster to converge and it results in more effective protocols. Interestingly, we also observe that the protocol we induce by optimizing the communication success exhibits a degree of compositionality and variability (i.e. the same information can be phrased in different ways), both properties characteristic of natural languages. As the ultimate goal is to ensure that communication is accomplished in natural language, we also perform experiments where we inject prior information about natural language into our model and study properties of the resulting protocol.", "full_text": "Emergence of Language with Multi-agent Games:\n\nLearning to Communicate with Sequences of Symbols\n\nSerhii Havrylov\n\nILCC, School of Informatics\n\nUniversity of Edinburgh\n\ns.havrylov@inf.ed.ac.uk\n\nIvan Titov\n\nILCC, School of Informatics\n\nUniversity of Edinburgh\n\nILLC, University of Amsterdam\n\nititov@inf.ed.ac.uk\n\nAbstract\n\nLearning to communicate through interaction, rather than relying on explicit super-\nvision, is often considered a prerequisite for developing a general AI. We study a\nsetting where two agents engage in playing a referential game and, from scratch,\ndevelop a communication protocol necessary to succeed in this game. Unlike\nprevious work, we require that messages they exchange, both at train and test time,\nare in the form of a language (i.e. sequences of discrete symbols). We compare a\nreinforcement learning approach and one using a differentiable relaxation (straight-\nthrough Gumbel-softmax estimator (Jang et al., 2017)) and observe that the latter is\nmuch faster to converge and it results in more effective protocols. Interestingly, we\nalso observe that the protocol we induce by optimizing the communication success\nexhibits a degree of compositionality and variability (i.e. the same information can\nbe phrased in different ways), both properties characteristic of natural languages.\nAs the ultimate goal is to ensure that communication is accomplished in natural\nlanguage, we also perform experiments where we inject prior information about\nnatural language into our model and study properties of the resulting protocol.\n\n1\n\nIntroduction\n\nWith the rapid advances in machine learning in recent years, the goal of enabling intelligent agents\nto communicate with each other and with humans is turning from a hot topic of philosophical\ndebates into a practical engineering problem. It is believed that supervised learning alone is not\ngoing to provide a solution to this challenge (Mikolov et al., 2015). Moreover, even learning natural\nlanguage from an interaction between humans and an agent may not be the most ef\ufb01cient and scalable\napproach. These considerations, as well as desire to achieve a better understanding of principles\nguiding evolution and emergence of natural languages (Nowak and Krakauer, 1999; Brighton, 2002),\nhave motivated previous research into setups where agents invent a communication protocol which\nlets them succeed in a given collaborative task (Batali, 1998; Kirby, 2002; Steels, 2005; Baronchelli\net al., 2006). For an extensive overview of earlier work in this area, we refer the reader to Kirby\n(2002) and Wagner et al. (2003).\nWe continue this line of research and speci\ufb01cally consider a setting where the collaborative task is a\ngame. Neural network models have been shown to be able to successfully induce a communication\nprotocol for this setting (Lazaridou et al., 2017; Jorge et al., 2016; Foerster et al., 2016; Sukhbaatar\net al., 2016). One important difference with these previous approaches is that we assume that\nmessages exchanged between the agents are variable-length strings of symbols rather than atomic\ncategories (as in the previous work). Our protocol would have properties more similar to natural\nlanguage and, as such, would have more advantages over using atomic categories. For example, it can\nsupport compositionality (Werning et al., 2011) and provide an easy way to regulate the amount of\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\finformation conveyed in a message. Interestingly, in our experiments, we also \ufb01nd that agents develop\na protocol faster when we allow them to use longer sequences of symbols. Somewhat surprisingly, we\nobserve that the language derived by our method favours multiple encodings of the same information,\nreminiscent of synonyms or paraphrases in natural languages. Moreover, with messages being strings\nof symbols (i.e. words), it is now possible to inject supervision to ensure that the invented protocol is\nclose enough to a natural language and, thus, potentially interpretable by humans.\nIn our experiments, we focus on a referential game (Lewis, 1969), where the goal for one agent is to\nexplain which image the other agent should select. Our setting can be formulated as follows:\n\nn=1 from which a target image t is sampled as well as\n\n1. There is a collection of images {in}N\n\nK distracting images {dk}K\n\nk=1.\n\n2. There are two agents: a sender S\u03c6 and a receiver R\u03b8.\n3. After seeing the target image t, the sender has to come up with a message mt, which is\nrepresented by a sequence of symbols from the vocabulary V of a size |V |. The maximum\npossible length of a sequence is L.\n\n4. Given the message mt and a set of images, which consists of distracting images and the\n\ntarget image, the goal of the receiver is to identify the target image correctly.\n\nThis setting is inspired by Lazaridou et al. (2017) but there are important differences: for example,\nwe use sequences rather than single symbols, and our sender, unlike theirs, does not have access to\ndistracting images. This makes our setting both arguably more realistic and more challenging from\nthe learning perspective.\nGenerating message mt requires sampling from categorical distributions over vocabulary, which\nmakes backpropagating the error through the message impossible.\nIt is tempting to formulate\nthis game as a reinforcement learning problem. However, the number of possible messages1 is\nproportional to |V |L. Therefore, na\u00efve Monte Carlo methods will give very high-variance estimates\nof the gradients which makes the learning process harder. Also, in this setup, because the receiver R\u03b8\ntries to adapt to the produced messages it will correspond to the non-stationary environment in which\nsender S\u03c6 acts making the learning problem even more challenging. Instead, we propose an effective\napproach where we use straight-through Gumbel-softmax estimators (Jang et al., 2017; Bengio et al.,\n2013) allowing for end-to-end differentiation, despite using only discrete messages in training. We\ndemonstrate that this approach is much more effective than the reinforcement learning framework\nemployed in previous approaches to referential games, both in terms of convergence times and the\nresulting communication success.\nOur main contributions can be summarized as follows:\n\nfrom scratch by optimizing reward in collaborative tasks;\n\n\u2022 we are the \ufb01rst to show that structured protocols (i.e. strings of symbols) can be induced\n\u2022 we demonstrate that relaxations based on straight-through estimators are more effective than\n\u2022 we show that the induced protocol implements hierarchical encoding scheme and there exist\n\nreinforcement learning for our task;\n\nmultiple paraphrases that encode the same semantic content.\n\n2 Model\n\n2.1 Agents\u2019 architectures\n\nThe sender and the receiver are implemented as LSTM networks (Hochreiter and Schmidhuber,\n1997). Figure 1 shows the sketch of model architecture where diamond-shaped, dashed and solid\narrows represent sampling, copying and deterministic functions respectively. The inputs to the sender\nare target image t and the special token ~~ that denotes the start of a message. Given these inputs,\nthe sender generates next token wi in a sequence by sampling from the categorical distribution\ni) where pt\ni is the hidden state of sender\u2019s LSTM and can be\nCat(pt\n0 = \u03b7(f (t)) where \u03b7(\u00b7) is an\ncalculated as2 hs\n\ni\u22121, wi\u22121). In the \ufb01rst time step we have hs\n\ni = softmax(W hs\ni = LSTM(hs\n\ni + b). Here, hs\n\n1In our experiments |V | = 10000 and L is up to 14.\n2We omitted the cell state in the equation for brevity.\n\n2\n\n\faf\ufb01ne transformation of image features f (\u00b7) extracted from a convolutional neural network (CNN).\nMessage mt is obtained by sequentially sampling until the maximum possible length L is reached or\nthe special token ~~~~ is generated.\n\nFigure 1: Architectures of sender and receiver.\n\nThe inputs to the receiver are the generated message mt and a set of images that contain the target\nimage t and distracting images {dk}K\nk=1. Receiver interpretation of the message is given by the af\ufb01ne\ntransformation g(\u00b7) of the last hidden state hr\nl of the LSTM network that reads the message. The loss\nfunction for the whole system can be written as:\n\n(cid:35)\n\n(cid:34) K(cid:88)\n\nl ) + f (dk)T g(hr\n\nl )]\n\n(1)\n\nL\u03c6,\u03b8(t) = Emt\u223cp\u03c6(\u00b7|t)\n\nmax[0, 1 \u2212 f (t)T g(hr\n\nThe energy function E(v, mt) = \u2212f (v)T g(hr\nl (mt)) can be used to de\ufb01ne the probability distribution\nover a set of images p(v|mt) \u221d e\u2212E(v,mt). Communication between two agents is successful if the\ntarget image has the highest probability according to this distribution.\n\nk=1\n\n2.2 Grounding in Natural Language\n\nTo ensure that communication is accomplished with a language that is understandable by humans,\nwe should favour protocols that resemble, in some respect, a natural language. Also, we would like\nto check whether using sequences with statistical properties similar to those of a natural language\nwould be bene\ufb01cial for communication. There are at least two ways how to do this.\nThe indirect supervision can be implemented by using the Kullback-Leibler (KL) divergence reg-\nularization DKL (q\u03c6(m|t)(cid:107)pN L(m)), from the natural language to the learned protocol. As we do\nnot have access to pN L(m), we train a language model p\u03c9 using available samples (i.e. texts) and\napproximate the original KL divergence with DKL (q\u03c6(m|t)(cid:107)p\u03c9(m)). We estimated the gradient\nof the divergent with respect to the \u03c6 parameters by applying ST-GS estimator to the Monte Carlo\napproximation calculated with one sampled message from q\u03c6(m|t). This regularization provides\nindirect supervision by encouraging generated messages to have a high probability in natural language\nbut at the same time maintaining high entropy for the communication protocol. Note that this is a\nweak form of grounding, as it does not force agents to preserve \u2018meanings\u2019 of words: the same word\ncan refer to a very different concept in the induced arti\ufb01cial language and in the natural language.\nThe described indirect grounding of the arti\ufb01cial language in a natural language can be interpreted as\na particular instantiation of a variational autoencoder (VAE) (Kingma and Welling, 2014). There are\nno gold standard messages for images. Thus, a message can be treated as a variable-length sequence\nof discrete latent variables. On the other hand, image representations are always given. Hence they\nare equivalent to the observed variable in the VAE framework. The trained language model p\u03c9(m)\nserves as a prior over latent variables. The receiver agent is analogous to the generative part of the\nVAE, although, it uses a slightly different loss for the reconstruction error (hinge loss instead of\nlog-likelihood). The sender agent is equivalent to an inference network used to approximate the\nposteriors in VAEs.\n\n3\n\n\fMinimizing the KL divergence from the natural language distribution to the learned protocol distri-\nbution can ensure that statistical properties of the messages are similar to those of natural language.\nHowever, words are not likely to preserve their original meaning (e.g. the word \u2018red\u2019 may not refer\nto \u2018red\u2019 in the protocol). To address this issue, a more direct form of supervision can be considered.\nFor example, additionally training the sender on the image captioning task (Vinyals et al., 2015),\nassuming that there is a correct and most informative way to describe an image.\n\n2.3 Learning\n\nIt is relatively easy to learn the receiver agent. It is end-to-end differentiable, so gradients of the loss\nfunction with respect to its parameters can be estimated ef\ufb01ciently. The receiver-type model was\ninvestigated before by Chrupa\u0142a et al. (2015) and known as Imaginet. It was used to learn visually\ngrounded representations of language from coupled textual and visual input. The real challenge is to\nlearn the sender agent. Its computational graph contains sampling, which makes it nondifferentiable.\nIn what follows in this section, we discuss methods for estimating gradients of the loss function in\nEquation (1).\n\n2.3.1 REINFORCE\n\nREINFORCE is a likelihood-ratio method (Williams, 1992) that provides a simple way of estimating\ngradients of the loss function with respect to parameters of the stochastic policy. We are interested\nin optimizing the loss function from Equation (1). The REINFORCE algorithm enables the use of\ngradient-based optimization methods by estimating gradients as:\n\n(cid:20)\n\n(cid:21)\n\n\u2202L\u03c6,\u03b8\n\u2202\u03c6\n\n= Ep\u03c6(\u00b7|t)\n\n\u2202log p\u03c6(mt|t)\n\n\u2202\u03c6\n\nl(mt)\n\n(2)\n\nWhere l(mt) is the learning signal, the inner part of the expectation in Equation (1). However,\ncomputing the gradient precisely may not be feasible due to the enormous number of message\ncon\ufb01gurations. Usually, a Monte Carlo approximation of the expectation is used. Training models\nwith REINFORCE can be dif\ufb01cult, due to the high variance of the estimator. We observed more\nreliable learning when using stabilizing techniques proposed by Mnih and Gregor (2014). Namely,\nwe use a baseline, de\ufb01ned as a moving average of the reward, to control variance of the estimator;\nthis results in centering the learning signal l(mt). We also use a variance-based adaptation of the\nlearning rate that consists of dividing the learning rate by a running estimate of the reward standard\ndeviation. This trick ensures that the learning signal is approximately unit variance, making the\nlearning process less sensitive to dramatic and non-monotonic changes in the centered learning\nsignal. To take into account varying dif\ufb01culty of describing different images, we use input-dependent\nbaseline implemented as a neural network with two hidden layers.\n\n2.3.2 Gumbel-softmax estimator\n\nIn the typical RL task formulation, an acting agent does not have access to the complete environment\nspeci\ufb01cation, or, even if it does, the environment is non-differentiable. Thus, in our setup, an agent\nthat was trained by any REINFORCE-like algorithm would underuse available information about the\nenvironment. As a solution, we consider replacement of one-hot encoded symbols w \u2208 V sampled\nfrom a categorical distribution with a continuous relaxation \u02dcw obtained from the Gumbel-softmax\ndistribution (Jang et al., 2017; Maddison et al., 2017).\nConsider a categorical distribution with event probabilities p1, p2, ..., pK, the Gumbel-softmax trick\nproceeds as follows: obtain K samples {uk}K\nk=1 from uniformly distributed variable u \u223c U (0, 1),\ntransform each sample with function gk = \u2212 log (\u2212 log (uk)) to get samples from the Gumbel\ndistribution, then compute a continuous relaxation:\n\n(cid:80)K\n\n\u02dcwk =\n\nexp ((log pk + gk)/\u03c4 )\ni=1 exp ((log pi + gi)/\u03c4 )\n\n(3)\n\nWhere \u03c4 is the temperature that controls accuracy of the approximation arg max with softmax\nfunction. As the temperature \u03c4 is approaching 0, samples from the Gumbel-softmax distribution\n\n4\n\n\fare becoming one-hot encoded, and the Gumbel-softmax distribution starts to be identical to the\ncategorical distribution (Jang et al., 2017).\nAs a result of this relaxation, the game becomes completely differentiable and can be trained using the\nbackpropagation algorithm. However, communicating with real values allows the sender to encode\nmuch more information into a message compared to using a discrete one and is unrealistic if our\nultimate goal is communication in natural language. Also, due to the recurrent nature of the receiver\nagent, using discrete tokens during test time can lead to completely different dynamics compared to\nthe training time which uses continuous tokens. This manifests itself in a large gap between training\nand testing performance (up to 20% drop in the communication success rate in our experiments).\n\n2.3.3 Straight-through Gumbel-softmax estimator\n\n\u2202w \u2248 \u2202L\n\nTo prevent the issues mentioned above, we discretize \u02dcw back with arg max in the forward pass that\nthen becomes an ordinary sample from the original categorical distribution. Nevertheless, we use\n\u2202 \u02dcw . This biased estimator is\ncontinuous relaxation in the backward pass, effectively assuming \u2202L\nknown as the straight-through Gumbel-softmax (ST-GS) estimator (Jang et al., 2017; Bengio et al.,\n2013). As a result of applying this trick, there is no difference in message usage during training and\ntesting stages, which contrasts with previous differentiable frameworks for learning communication\nprotocols (Foerster et al., 2016).\nBecause of using ST-GS, the forward pass does not depend on the temperature. However, it still\naffects the gradient values during the backward pass. As discussed before, low values for \u03c4 provide\nbetter approximations of arg max. Because the derivative of arg max is 0 everywhere except at the\nboundary of state changes, a more accurate approximation would lead to the severe vanishing gradient\nproblem. Nonetheless, with ST-GS we can afford to use large values for \u03c4, which would usually lead\nto faster learning. In order to reduce the burden of performing extensive hyperparameter search for\nthe temperature, similarly to Gulcehre et al. (2017), we consider learning the inverse-temperature\nwith a multilayer perceptron:\n\n1\n\u03c4 (hs\ni )\n\n= log(1 + exp(wT\n\n\u03c4 hs\n\ni )) + \u03c40,\n\n(4)\n\nwhere \u03c40 controls maximum possible value for the temperature. In our experiments, we found that\nlearning process is not very sensitive to the hyperparameter as long as \u03c40 is less than 1.0.\nDespite the fact that ST-GS estimator is computationally ef\ufb01cient, it is biased. To understand how\nreliable the provided direction is, one can check whether it can be regarded as a pseudogradient (for\nthe results see Section 3.1). The direction \u03b4 is a pseudogradient of J(u) if the condition \u03b4T\u2207J(u) > 0\nis satis\ufb01ed. Polyak and Tsypkin (1973) have shown that, given certain assumptions about the learning\nrate, a very broad class of pseudogradient methods converge to the critical point of function J.\nTo examine whether the direction provided by ST-GS is a pseudogradient, we used a stochastic\nperturbation gradient estimator that can approximate a dot product between arbitrary direction \u03b4 in\nthe parameter space and the true gradient:\n\nJ(u + \u0001\u03b4) \u2212 J(u \u2212 \u0001\u03b4)\n\n2\u0001\n\n= \u03b4T\u2207J(u) + O(\u00012)\n\n(5)\n\nIn our case J(u) is a Monte Carlo approximation of Equation (1). In order to reduce the variance\nin dot product estimation (Bhatnagar et al., 2012), the same Gumbel noise samples can be used for\nevaluating forward and backward perturbations of J(u).\n\n3 Experiments\n\n3.1 Tabula rasa communication\n\nWe used the Microsoft COCO dataset (Chen et al., 2015) as a source of images. Prior to training,\nwe randomly selected 10% of the images from the MSCOCO 2014 training set as validation data\nand kept the rest as training data. As a result of this split, more than 74k images were used for\ntraining and more than 8k images for validation. To evaluate the learned communication protocol, we\nused the MSCOCO 2014 validation set that consists of more than 40k images. In our experiments\n\n5\n\n\fimages are represented by outputs of the relu7 layer from the pretrained 16-layer VGG convolutional\nnetwork (Simonyan and Zisserman, 2015).\n\nFigure 2: The performance and properties of learned protocols.\n\nWe set the following model con\ufb01guration without tuning: the embedding dimensionality is 256,\nthe dimensionality of LSTM layers is 512, the vocabulary size is 10000, the number of distracting\nimages is 127, the batch size is 128. We used Adam (Kingma and Ba, 2014) as an optimizer, with\ndefault hyperparameters and the learning rate of 0.001 for the GS-ST method. For the REINFORCE\nestimator we tuned learning rate by searching for the optimal value over [10\u22125; 0.1] interval with\na multiplicative step size 10\u22121. We did not observe signi\ufb01cant improvements while using input-\ndependent baseline and disregarded them for the sake of simplicity. To investigate bene\ufb01ts of learning\ntemperature, \ufb01rst, we found the optimal temperature that is equal to 1.2 by performing a search over\ninterval [0.5; 2.0] with the step size equal to 0.1. As we mentioned before, the learning process with\ntemperature de\ufb01ned by Equation (4) is not very sensitive to \u03c40 hyperparameter. Nevertheless, we\nconducted hyperparameter search over interval [0.0; 2.0] with step size 0.1 and found that model\n\u03c40 = 0.2 has the best performance. The differences in the performance were not signi\ufb01cant unless\nthe \u03c40 was bigger than 1.0.\nAfter training models we tested two encoding strategies: plain sampling and greedy argmax. That\nmeans selecting an argmax of the corresponding categorical distribution at each time step. Figure 2\nshows the communication success rate as a function of the maximum message length L. Because\nresults for models with learned temperature are very similar to the counterparts with \ufb01xed (manually\ntuned) temperatures, we omitted them from the \ufb01gure for clarity. However, in average, models with\nlearned temperatures outperform vanilla versions by 0.8%. As expected, argmax encoding slightly but\nconsistently outperforms the sampling strategy. Surprisingly, REINFORCE beats GS-ST for the setup\nwith L = 1. We may speculate that in this relatively easy setting being unbiased (as REINFORCE) is\nmore important than having a low variance (as GS-ST).\nInterestingly, the number of updates that are required to achieve training convergence with the GS-ST\nestimator decreases when we let the sender use longer messages (i.e. for larger L). This behaviour\nis slightly surprising as one could expect that it is harder to learn the protocol when the space of\nmessages is larger. In other words, using longer sequences helps to learn a communication protocol\nfaster. However, this is not at all the case for the REINFORCE estimator: it usually takes \ufb01ve-fold\nmore updates to converge compared to GS-ST, and also there is no clear dependency between the\nnumber of updates needed to converge and the maximum possible length of a message.\nWe also plot the perplexity of the encoder. It is relatively high and increasing with sentence length for\nGS-ST, whereas for REINFORCE the perplexity increase is not as rapid. This implies redundancy in\nthe encodings: there exist multiple paraphrases that encode the same semantic content. A noteworthy\nfeature of GS-ST with learned temperature is that perplexity values of all encoders for different L are\nalways smaller than corresponding values for vanilla GS-ST.\nLastly, we calculated an estimate of the dot product between the true gradient of the loss function and\nthe direction provided by GS-ST estimator using Equation (5). We found that after 400 parameter\nupdates there is almost always (> 99%) an acute angle between the two. This suggests that GS-ST\ngradient can be used as a pseudogradient for our referential game problem.\n\n6\n\n\f3.2 Qualitative analysis of the learned language\n\nTo better understand the nature of the learned language, we inspected a small subset of sentences\nthat were produced by the model with maximum possible message length equal to 5. To avoid cherry\npicking images, we use the following strategy in both food and animal domains. First, we took a\nrandom photo of an object and generated a message. Then we iterated over the dataset and randomly\nselected images with messages that share pre\ufb01xes of 1, 2 and 3 symbols with the given message.\nFigure 3 shows some samples from the MSCOCO 2014 validation set that correspond to (5747 *\n* * *) code.3 Images in this subset depict animals. On the other hand, it seems that images for (*\n* * 5747 *) code do not correspond to any prede\ufb01ned category. This suggests that word order is\ncrucial in the developed language. Particularly, word 5747 on the \ufb01rst position encodes presence of\nan animal in the image. The same \ufb01gure shows that message (5747 5747 7125 * *) corresponds\nto a particular type of bears. This suggests that the developed language implements some kind of\nhierarchical coding. This is interesting by itself because the model was not constrained explicitly\nto use any hierarchical encoding scheme. Presumably, this can help the model ef\ufb01ciently describe\nunseen images. Nevertheless, natural language uses other principles to ensure compositionality. The\nmodel shows similar behaviour for images in the food domain.\n\nFigure 3: The samples from MS COCO that correspond to particular codes.\n\n3.3\n\nIndirect grounding of arti\ufb01cial language in natural language\n\nWe implemented indirect grounding algorithm, as discussed in Section 2.2. We trained language\nmodel p\u03c9(m) using an LSTM recurrent neural network. It was used as a prior distribution over the\nmessages. To acquire data for estimating the parameters of a language model, we took image captions\nof randomly selected (50%) images from the previously created training set. These images were not\nused for training the sender and the receiver. Another half of the set was used for training agents. We\nevaluated the learned communication protocol on the MSCOCO 2014 validation set.\nTo get an estimate of communication success when using natural language, we trained the receiver\nwith pairs of images and captions. This model is similar to Imaginet (Chrupa\u0142a et al., 2015). Also,\ninspired by their analysis, we report the omission score. The omission score of a word is equal to\ndifference between the target image probability given the original message and the probability given\na message with the removed word. The sentence omission score is the maximum over all word\nomission scores in the given sentence. The score quanti\ufb01es the change in the target image probability\nafter removing the most important word. Natural languages have content words that name objects\n(i.e. nouns) and encode their qualities (e.g., adjectives). One can expect that a protocol that uses a\ndistinction between content words and function words would have a higher omission score than a\nprotocol that distributes information evenly across tokens. As Table 1 shows, the grounded language\nhas the communication success rate similar to natural language. However, it has a slightly lower\nomission score. The unregularized model has the lowest omission score which probably means that\nsymbols in the developed protocol have similar nature to characters or syllables rather than words.\n\n3* means any word from the vocabulary or end-of-sentence padding.\n\n7\n\n\fTable 1: Comparison of the grounded protocol with the natural language and the arti\ufb01cial language\n\nModel\nWith KL regularization\nWithout regularization\nImaginet\n\nComm. success (%) Number of updates Omission score\n\n52.51\n95.65\n52.51\n\n11600\n27600\n16100\n\n0.258\n0.193\n0.287\n\n3.4 Direct grounding of arti\ufb01cial language in natural language\n\nAs we discussed previously in Section 2.2, minimizing the KL divergence will ensure that statistical\nproperties of the protocol are going to be similar to those of natural language. However, words are\nnot likely to preserve their original meaning (e.g. the word \u2018red\u2019 may refer to the concept of \u2018blue\u2019 in\nthe protocol). To resolve this issue, we additionally trained the sender on the image captioning task.\nTo understand whether the additional communication loss can help in the setting where the amount of\nthe data is limited we considered next setup for image description generation task.\nTo simulate the semi-supervised setting, we divided the previously created training set into two parts.\nThe randomly selected 25% of the dataset were used to train the sender on the image captioning task\nLcaption. The rest 75% were used to train the sender and the receiver to solve the referential game\nLgame. The \ufb01nal loss is a weighted sum of losses for the two tasks L = Lcaption + \u03bbLgame. We did\nnot perform any preprocessing of the gold standard captions apart from lowercasing. It is important\nto mention that in this setup the communication loss is equivalent to the variational lower bound of\nmutual information (Barber and Agakov, 2003) of image features and the corresponding caption.\n\nTable 2: Metrics for image captioning models with and without communication loss\n\nModel\nw/ comm. loss\nw/o comm. loss\n\nBLEU-2 BLEU-3 BLEU-4 ROUGE-L CIDEr Avg. length\n0.435\n0.436\n\n0.195\n0.195\n\n0.290\n0.290\n\n0.492\n0.491\n\n0.590\n0.594\n\n13.93\n12.85\n\nWe used the greedy decoding strategy to sample image descriptions. As Table 2 shows, both systems\nhave comparable performance across different image captioning metrics. We believe that the model\ndid not achieve better peroformance as discriminative captions are different in nature compared\nto reference captions.\nIn fact generating discriminative descriptions may be useful for certain\napplications (e.g., generating reference expressions in navigation instructions (Byron et al., 2009))\nbut it is hard to evaluate them intrinsically. Note that using the communication loss yield, in average,\nlonger captions. It is not surprising, taking into account the mutual information interpretation of the\nreferential game, a longer sequence can retain more information about image features.\n\n4 Related work\n\nThere is a long history of work on language emergence in multi-agent systems (Kirby, 2002; Wagner\net al., 2003; Steels, 2005; Nol\ufb01 and Mirolli, 2009; Golland et al., 2010). The recent generation\nrelied on deep learning techniques. More speci\ufb01cally, Foerster et al. (2016) proposed a differentiable\ninter-agent learning (DIAL) framework where it was used to solve puzzles in a multi-agent setting.\nThe agents in their work were allowed to communicate by sending one-bit messages. Jorge et al.\n(2016) adopted DIAL to solve the interactive image search task with two agents participating in the\ntask. These actors successfully developed a language consisting of one-hot encoded atomic symbols.\nBy contrast, Lazaridou et al. (2017) applied the policy gradient method to learn agents that are\ninvolved in a referential game. Unlike us, they used atomic symbols rather than sequences of tokens.\nLearning dialogue systems for collaborative activities between machine and human were previously\nconsidered by Lemon et al. (2002). Usually, they are represented by hybrid models that combine\nreinforcement learning with supervised learning (Henderson et al., 2008; Schatzmann et al., 2006).\nThe idea of using the Gumbel-softmax distribution for learning language in a multi-agent envi-\nronment was concurrently considered by Mordatch and Abbeel (2017). They studied a simulated\n\n8\n\n\ftwo-dimensional environment in continuous space and discrete time with several agents where, in\naddition to performing physical actions, agents can also utter verbal communication symbols at every\ntimestep. Similarly to us, the induced language exhibits compositional structure and to a large degree\ninterpretable. Das et al. (2017), also in concurrent work, investigated a cooperative \u2018image guessing\u2019\ngame with two agents communicating in natural language. They use the policy gradient method for\nlearning, hence their framework can bene\ufb01t from the approach proposed in this paper. One important\ndifference with our approach is that they pretrain their model on an available dialog dataset. By\ncontrast, we induce the communication protocol from scratch.\nVAE-based approaches that use sequences of discrete latent variables were studied recently by\nMiao and Blunsom (2016) and Ko\u02c7cisk`y et al. (2016) for text summarization and semantic parsing,\ncorrespondingly. The variational lower bound for these models involves expectation with respect to\nthe distribution over sequences of symbols, so the learning strategy proposed here may be bene\ufb01cial\nin their applications.\n\n5 Conclusion\n\nIn this paper, we have shown that agents, modeled using neural networks, can successfully invent\na language that consists of sequences of discrete tokens. Despite the common belief that it is hard\nto train such models, we proposed an ef\ufb01cient learning strategy that relies on the straight-through\nGumbel-softmax estimator. We have performed analysis of the learned language and corresponding\nlearning dynamics. We have also considered two methods for injecting prior knowledge about natural\nlanguage. In the future work, we would like to extend this approach to modelling practical dialogs.\nThe \u2018game\u2019 can be played between two agents rather than an agent and a human while human\ninterpretability would be ensured by integrating supervised loss into the learning objective (as we did\nin section 3.5 where we used captions). Hopefully, this will reduce the amount of necessary human\nsupervision.\n\nAcknowledgments\n\nThis project is supported by SAP ICN, ERC Starting Grant BroadSem (678254) and NWO Vidi Grant\n(639.022.518). We would like to thank Jelle Zuidema and anonymous reviewers for their helpful\nsuggestions and comments.\n\nReferences\nDavid Barber and Felix V Agakov. The IM Algorithm: A Variational Approach to Information\n\nMaximization. In Advances in Neural Information Processing Systems, 2003.\n\nAndrea Baronchelli, Maddalena Felici, Vittorio Loreto, Emanuele Caglioti, and Luc Steels. Sharp\ntransition towards shared vocabularies in multi-agent systems. Journal of Statistical Mechanics:\nTheory and Experiment, 2006(06):P06014, 2006.\n\nJohn Batali. Computational simulations of the emergence of grammar. Approaches to the evolution\n\nof language: Social and cognitive bases, 405:426, 1998.\n\nYoshua Bengio, Nicholas L\u00e9onard, and Aaron Courville. Estimating or propagating gradients through\n\nstochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.\n\nShalabh Bhatnagar, HL Prasad, and LA Prashanth. Stochastic recursive algorithms for optimization:\n\nsimultaneous perturbation methods, volume 434. Springer, 2012.\n\nHenry Brighton. Compositional syntax from cultural transmission. Arti\ufb01cial life, 8(1):25\u201354, 2002.\nDonna Byron, Alexander Koller, Kristina Striegnitz, Justine Cassell, Robert Dale, Johanna Moore,\nand Jon Oberlander. Report on the \ufb01rst NLG challenge on generating instructions in virtual\nIn Proceedings of the 12th European workshop on natural language\nenvironments (GIVE).\ngeneration, 2009.\n\nXinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Doll\u00e1r, and\nC Lawrence Zitnick. Microsoft COCO captions: Data collection and evaluation server. arXiv\npreprint arXiv:1504.00325, 2015.\n\n9\n\n\fGrzegorz Chrupa\u0142a, Akos K\u00e1d\u00e1r, and Afra Alishahi. Learning language through pictures.\n\nIn\nProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, 2015.\n\nAbhishek Das, Satwik Kottur, Jos\u00e9 MF Moura, Stefan Lee, and Dhruv Batra. Learning Coopera-\ntive Visual Dialog Agents with Deep Reinforcement Learning. In Proceedings of International\nConference on Computer Vision and Image Processing, 2017.\n\nJakob Foerster, Yannis M Assael, Nando de Freitas, and Shimon Whiteson. Learning to communicate\nwith deep multi-agent reinforcement learning. In Advances in Neural Information Processing\nSystems, pages 2137\u20132145, 2016.\n\nDave Golland, Percy Liang, and Dan Klein. A game-theoretic approach to generating spatial\ndescriptions. In Proceedings of the 2010 conference on empirical methods in natural language\nprocessing, pages 410\u2013419. Association for Computational Linguistics, 2010.\n\nCaglar Gulcehre, Sarath Chandar, and Yoshua Bengio. Memory Augmented Neural Networks with\n\nWormhole Connections. arXiv preprint arXiv:1701.08718, 2017.\n\nJames Henderson, Oliver Lemon, and Kallirroi Georgila. Hybrid reinforcement/supervised learning\n\nof dialogue policies from \ufb01xed data sets. Computational Linguistics, 34(4):487\u2013511, 2008.\n\nSepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 9(8):\n\n1735\u20131780, 1997.\n\nEric Jang, Shixiang Gu, and Ben Poole. Categorical Reparameterization with Gumbel-Softmax. In\n\nProceedings of the International Conference on Learning Representations, 2017.\n\nEmilio Jorge, Mikael K\u00e5geb\u00e4ck, and Emil Gustavsson. Learning to Play Guess Who? and Inventing\na Grounded Language as a Consequence. In Neural Information Processing Systems, the 3rd Deep\nReinforcement Learning Workshop, 2016.\n\nDiederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of\n\nthe 3rd International Conference for Learning Representations, 2014.\n\nDiederik P Kingma and Max Welling. Auto-encoding Variational Bayes. In Proceedings of the 3rd\n\nInternational Conference for Learning Representations, 2014.\n\nSimon Kirby. Natural language from arti\ufb01cial life. Ari\ufb01cial Life, 8:185\u2013215, 2002.\n\nTom\u00e1\u0161 Ko\u02c7cisk`y, G\u00e1bor Melis, Edward Grefenstette, Chris Dyer, Wang Ling, Phil Blunsom, and\nKarl Moritz Hermann. Semantic parsing with semi-supervised sequential autoencoders. arXiv\npreprint arXiv:1609.09315, 2016.\n\nAngeliki Lazaridou, Alexander Peysakhovich, and Marco Baroni. Multi-agent cooperation and the\nemergence of (natural) language. In Proceedings of the International Conference on Learning\nRepresentations, 2017.\n\nOliver Lemon, Alexander Gruenstein, and Stanley Peters. Collaborative activities and multi-tasking\nin dialogue systems: Towards natural dialogue with robots. TAL. Traitement automatique des\nlangues, 43(2):131\u2013154, 2002.\n\nDavid Lewis. Convention: A philosophical study. 1969.\n\nChris J Maddison, Andriy Mnih, and Yee Whye Teh. The Concrete Distribution: A Continuous\nRelaxation of Discrete Random Variables. In Proceedings of the International Conference on\nLearning Representations, 2017.\n\nYishu Miao and Phil Blunsom. Language as a latent variable: Discrete generative models for sentence\ncompression. In Proceedings of the Conference on Empirical Methods in Natural Language\nProcessing, 2016.\n\nTomas Mikolov, Armand Joulin, and Marco Baroni. A roadmap towards machine intelligence. In\n\nNeural Information Processing Systems, Reasoning, Attention, and Memory Workshop, 2015.\n\n10\n\n\fAndriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. In\n\nProceedings of the 31st International Conference on Machine Learning, 2014.\n\nIgor Mordatch and Pieter Abbeel. Emergence of Grounded Compositional Language in Multi-Agent\n\nPopulations. arXiv preprint arXiv:1703.04908, 2017.\n\nStefano Nol\ufb01 and Marco Mirolli. Evolution of communication and language in embodied agents.\n\nSpringer Science & Business Media, 2009.\n\nM. A. Nowak and D. Krakauer. The evolution of language. PNAS, 96(14):8028\u20138033, 1999. doi:\n10.1073/pnas.96.14.8028. URL http://groups.lis.illinois.edu/amag/langev/paper/\nnowak99theEvolution.html.\n\nBT Polyak and Ya Z Tsypkin. Pseudogradient adaptation and training algorithms. 1973.\n\nJost Schatzmann, Karl Weilhammer, Matt Stuttle, and Steve Young. A survey of statistical user simu-\nlation techniques for reinforcement-learning of dialogue management strategies. The knowledge\nengineering review, 21(2):97\u2013126, 2006.\n\nKaren Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image\nrecognition. In Proceedings of the International Conference on Learning Representations, 2015.\n\nLuc Steels. What triggers the emergence of grammar. 2005.\n\nSainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus. Learning Multiagent Communication with\nBackpropagation. In Advances in Neural Information Processing Systems, pages 2244\u20132252, 2016.\n\nOriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural\nimage caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 3156\u20133164, 2015.\n\nKyle Wagner, James A Reggia, Juan Uriagereka, and Gerald S Wilkinson. Progress in the simulation\n\nof emergent communication and language. Adaptive Behavior, 11(1):37\u201369, 2003.\n\nM. Werning, W. Hinzen, and M. Machery. The Oxford handbook of compositionality. Oxford, UK,\n\n2011.\n\nRonald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement\n\nlearning. Machine learning, 8(3-4):229\u2013256, 1992.\n\n11\n\n\f", "award": [], "sourceid": 1290, "authors": [{"given_name": "Serhii", "family_name": "Havrylov", "institution": "University of Edinburgh"}, {"given_name": "Ivan", "family_name": "Titov", "institution": "University of Edinburgh / University of Amsterdam"}]}~~