{"title": "Speaker-Follower Models for Vision-and-Language Navigation", "book": "Advances in Neural Information Processing Systems", "page_first": 3314, "page_last": 3325, "abstract": "Navigation guided by natural language instructions presents a challenging reasoning problem for instruction followers. Natural language instructions typically identify only a few high-level decisions and landmarks rather than complete low-level motor behaviors; much of the missing information must be inferred based on perceptual context. In machine learning settings, this is doubly challenging: it is difficult to collect enough annotated data to enable learning of this reasoning process from scratch, and also difficult to implement the reasoning process using generic sequence models. Here we describe an approach to vision-and-language navigation that addresses both these issues with an embedded speaker model. We use this speaker model to (1) synthesize new instructions for data augmentation and to (2) implement pragmatic reasoning, which evaluates how well candidate action sequences explain an instruction. Both steps are supported by a panoramic action space that reflects the granularity of human-generated instructions. Experiments show that all three components of this approach---speaker-driven data augmentation, pragmatic reasoning and panoramic action space---dramatically improve the performance of a baseline instruction follower, more than doubling the success rate over the best existing approach on a standard benchmark.", "full_text": "Speaker-Follower Models for\n\nVision-and-Language Navigation\n\nDaniel Fried\u2217 1, Ronghang Hu\u22171, Volkan Cirik\u22172, Anna Rohrbach1, Jacob Andreas1,\n\nLouis-Philippe Morency2, Taylor Berg-Kirkpatrick2, Kate Saenko3,\n\nDan Klein\u2217\u22171, Trevor Darrell\u2217\u22171\n\n1University of California, Berkeley\n\n2Carnegie Mellon University\n\n3Boston University\n\nAbstract\n\nNavigation guided by natural language instructions presents a challenging rea-\nsoning problem for instruction followers. Natural language instructions typically\nidentify only a few high-level decisions and landmarks rather than complete low-\nlevel motor behaviors; much of the missing information must be inferred based\non perceptual context. In machine learning settings, this is doubly challenging: it\nis dif\ufb01cult to collect enough annotated data to enable learning of this reasoning\nprocess from scratch, and also dif\ufb01cult to implement the reasoning process using\ngeneric sequence models. Here we describe an approach to vision-and-language\nnavigation that addresses both these issues with an embedded speaker model. We\nuse this speaker model to (1) synthesize new instructions for data augmentation and\nto (2) implement pragmatic reasoning, which evaluates how well candidate action\nsequences explain an instruction. Both steps are supported by a panoramic action\nspace that re\ufb02ects the granularity of human-generated instructions. Experiments\nshow that all three components of this approach\u2014speaker-driven data augmenta-\ntion, pragmatic reasoning and panoramic action space\u2014dramatically improve the\nperformance of a baseline instruction follower, more than doubling the success rate\nover the best existing approach on a standard benchmark.\n\nIntroduction\n\n1\nIn the vision-and-language navigation task [1], an agent is placed in a realistic environment, and\nprovided a natural language instruction such as \u201cGo down the stairs, go slight left at the bottom\nand go through door, take an immediate left and enter the bathroom, stop just inside in front of the\nsink\u201d. The agent must follow this instruction to navigate from its starting location to a goal location,\nas shown in Figure 1 (left). To accomplish this task the agent must learn to relate the language\ninstructions to the visual environment. Moreover, it should be able to carry out new instructions in\nunseen environments.\nEven simple navigation tasks require nontrivial reasoning: the agent must resolve ambiguous refer-\nences to landmarks, perform a counterfactual evaluation of alternative routes, and identify incom-\npletely speci\ufb01ed destinations. While a number of approaches [33, 34, 55] have been proposed for\nthe various navigation benchmarks, they generally employ a single model that learns to map directly\nfrom instructions to actions from a limited corpus of annotated trajectories.\nIn this paper we treat the vision-and-language navigation task as a trajectory search problem, where\nthe agent needs to \ufb01nd (based on the instruction) the best trajectory in the environment to navigate\nfrom the start location to the goal location. Our model involves an instruction interpretation (follower)\nmodule, mapping instructions to action sequences; and an instruction generation (speaker) module,\nmapping action sequences to instructions (Figure 1), both implemented with standard sequence-to-\nsequence architectures. The speaker learns to give textual instructions for visual routes, while the\n\n\u2217,\u2217\u2217: Authors contributed equally\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: The task of vision-and-language navigation is to perform a sequence of actions (navigate\nthrough the environment) according to human natural language instructions. Our approach consists\nof an instruction follower model (left) and a speaker model (right).\n\nfollower learns to follow routes (predict navigation actions) for provided textual instructions. Though\nexplicit probabilistic reasoning combining speaker and follower agents is a staple of the literature on\ncomputational pragmatics [14], application of these models has largely been limited to extremely\nsimple decision-making tasks like single forced choices.\nWe incorporate the speaker both at training time and at test time, where it works together with the\nlearned instruction follower model to solve the navigation task (see Figure 2 for an overview of\nour approach). At training time, we perform speaker-driven data augmentation where the speaker\nhelps the follower by synthesizing additional route-instruction pairs to expand the limited training\ndata. At test time, the follower improves its chances of success by looking ahead at possible future\nroutes and pragmatically choosing the best route by scoring them according to the probability that\nthe speaker would generate the correct instruction for each route. This procedure, using the external\nspeaker model, improves upon planning using only the follower model. We construct both the speaker\nand the follower on top of a panoramic action space that ef\ufb01ciently encodes high-level behavior,\nmoving directly between adjacent locations rather than making low-level visuomotor decisions like\nthe number of degrees to rotate (see Figure 3).\nTo summarize our contributions: We propose a novel approach to vision-and-language navigation\nincorporating a visually grounded speaker\u2013follower model, and introduce a panoramic representation\nto ef\ufb01ciently represent high-level actions. We evaluate this speaker\u2013follower model on the Room-\nto-Room (R2R) dataset [1], and show that each component in our model improves performance at\nthe instruction following task. Our model obtains a \ufb01nal success rate of 53.5% on the unseen test\nenvironment, an absolute improvement of 30% over existing approaches. Our code and data are\navailable at http://ronghanghu.com/speaker_follower.\n\n2 Related Work\n\nNatural language instruction following. Systems that learn to carry out natural language in-\nstructions in an interactive environment include approaches based on intermediate structured and\nexecutable representations of language [51, 9, 4, 29, 20] and approaches that map directly from\nlanguage and world state observations to actions [7, 2, 33, 34]. The embodied vision-and-language\nnavigation task studied in this paper [1] differs from past situated instruction following tasks by\nintroducing rich visual contexts. Recent work [55] has applied techniques from model-based and\nmodel-free reinforcement learning [56] to the vision-and-language navigation problem. Speci\ufb01cally,\nan environment model is used to predict a representation of the state resulting from an action, and\nplanning is performed with respect to this environment model. Our work differs from this prior work\nby reasoning not just about state transitions, but also about the relationship between states and the\nlanguage that describes them\u2014speci\ufb01cally, using an external speaker model to predict how well a\ngiven sequence of states explains an instruction.\nPragmatic language understanding. A long line of work in linguistics, natural language processing,\nand cognitive science has studied pragmatics: how linguistic meaning is affected by context and\ncommunicative goals [18]. Our work here makes use of the Rational Speech Acts framework\n[14, 17], which models the interaction between speakers and listeners as a process where each agent\nreasons probabilistically about the other to maximize the chances of successful communicative\noutcomes. This framework has been applied to model human use of language [15], and to improve the\nperformance of systems that generate [3, 31, 53, 12] and interpret [58, 30, 52] referential language.\nSimilar modeling tools have recently been applied to generation and interpretation of language about\nsequential decision-making [16]. The present work makes use of a pragmatic instruction follower in\n\n2\n\nFollowerSpeakerRoute r: left-to-right, top-to-bottomHumanInstructiond:Godownthestairs,goslightleftatthebottomandgothroughdoor,takeanimmediateleftandenterthebathroom,stopjustinsideinfrontofthesink.Generated Instruction d:Walk down the stairs. Turn left at the bottom of the stairs. Walk through the doorway and wait in the bathroom.Route r: left-to-right, top-to-bottom\ud835\udc43\"\ud835\udc5f|\ud835\udc51\ud835\udc43&\ud835\udc51|\ud835\udc5f\fthe same spirit. Here, however, we integrate this with a more complex visual pipeline and use it not\nonly at inference time but also at training time to improve the quality of a base listener model.\nSemi- and self-supervision. The semi-supervised approach we use is related to model bootstrapping\ntechniques such as self-training [43, 32] and co-training [6] at a high-level. Recent work has used\nmonolingual corpora to improve the performance of neural machine translation models structurally\nsimilar to the sequence-to-sequence models we use [19, 21, 44]. In a grounded navigation context,\n[22] use a word-prediction task as training time supervision for a reinforcement learning agent. The\napproach most relevant to our work is the SEQ4 model [27], which applies semi-supervision to a\nnavigation task by sampling new environments and maps (in synthetic domains without vision), and\ntraining an autoencoder to reconstruct routes, using language as a latent variable. The approach used\nhere is much simpler, as it does not require constructing a differentiable surrogate to the decoding\nobjective.\nSemi-supervised data augmentation has also been widely used in computer vision tasks. In Data\nDistillation [40], additional annotation for object and key-point detection is obtained by ensembling\nand re\ufb01ning a pretrained model\u2019s prediction on unannotated images. Self-play in adversarial groups of\nagents is common in multi-agent reinforcement learning [45, 47]. In actor-critic approaches [49, 50]\nin reinforcement learning, a critic learns the value of a state and is used to provide supervision to\nthe actor\u2019s policy during training. In this work, we use a speaker to synthesize additional navigation\ninstructions on unlabeled new routes, and use this synthetic data from the speaker to train the follower.\nGrounding language in vision. Existing work in visual grounding [39, 31, 26, 41, 36] has addressed\nthe problem of passively perceiving a static image and mapping a referential expression to a bounding\nbox [39, 31, 26] or a segmentation mask [25, 28, 57], exploring various techniques including proposal\ngeneration [10] and relationship handling [54, 36, 24, 11]. In our work, the vision-and-language\nnavigation task requires the agent to actively interact with the environment to \ufb01nd a path to the goal\nfollowing the natural language instruction. This can be seen as a grounding problem in linguistics\nwhere the language instruction is grounded into a trajectory in the environment but requires more\nreasoning and planning skills than referential expression grounding.\n\nInstruction Following with Speaker-Follower Models\n\n3\nTo address the task of following natural language instructions, we rely on two models: an instruction-\nfollower model of the kind considered in previous work, and a speaker model\u2014a learned instruction\ngenerator that models how humans describe routes in navigation tasks.\nSpeci\ufb01cally, we base our follower model on the sequence-to-sequence model [1], computing a\ndistribution PF (r | d) over routes r (state and action sequences) given route descriptions d. The\nfollower encodes the sequence of words in the route description with an LSTM [23], and outputs\nroute actions sequentially, using an attention mechanism [5] over the description. Our speaker model\nis symmetric, producing a distribution PS(d | r) by encoding the sequence of visual observations\nand actions in the route using an LSTM, and then outputting an instruction word-by-word with an\nLSTM decoder using attention over the encoded input route (Figure 1).\nWe combine these two base models into a speaker-follower system, where the speaker supports\nthe follower both at training time and at test time. An overview of our approach is presented\nin Figure 2. First, we train a speaker model on the available ground-truth navigation routes and\ninstructions. (Figure 2 (a)). Before training the follower, the speaker produces synthetic navigation\ninstructions for novel sampled routes in the training environments, which are then used as additional\nsupervision for the follower, as described in Sec. 3.1 (Figure 2 (b)). At follower test time, the follower\ngenerates possible routes as interpretations of a given instruction and starting context, and the speaker\npragmatically ranks these, choosing one that provides a good explanation of the instruction in context\n(Sec. 3.2 and Figure 2 (c)). Both follower and speaker are supported by the panoramic action space\nin Sec. 3.3 that re\ufb02ects the high-level granularity of the navigational instructions (Figure 3).\n\n3.1 Speaker-Driven Data Augmentation\nThe training data only covers a limited number of navigation instruction and route pairs, D =\n(d1, r1) . . . (dN , rN ). To allow the agent to generalize better to new routes, we use the speaker to gen-\nerate synthetic instructions on sampled new routes in the training environments. To create a synthetic\ntraining set, we sample a collection of M routes \u02c6r1, . . . , \u02c6rM through the training environments, using\n\n3\n\n\fFigure 2: Our approach combines an instruction follower model and a speaker model. (a) The speaker\nmodel is trained on the ground-truth routes with human-generated descriptions; (b) it provides the\nfollower with additional synthetic instruction data to bootstrap training; (c) it also helps the follower\ninterpret ambiguous instructions and choose the best route during inference. See Sec. 3 for details.\nthe same shortest-path approach used to generate the routes in the original training set [1]. We then\ngenerate a human-like textual instruction \u02c6dk for each instruction \u02c6rk by performing greedy prediction\nin the speaker model to approximate \u02c6dk = arg maxd PS(d | \u02c6rk).\nThese M synthetic navigation routes and instructions S = ( \u02c6d1, \u02c6r1), . . . , (\u02c6sM , \u02c6rM ) are combined with\nthe original training data D into an augmented training set S \u222a D (Figure 2(b)). During training, the\nfollower is \ufb01rst trained on this augmented training set, and then further \ufb01ne-tuned on the original\ntraining set D. This speaker-driven data augmentation aims to overcome data scarcity issue, allowing\nthe follower to learn how to navigate on new routes following the synthetic instructions.\n\n3.2 Speaker-Driven Route Selection\nWe use the base speaker (PS) and follower (PF ) models described above to de\ufb01ne a pragmatic\nfollower model. Drawing on the Rational Speech Acts framework [14, 17], a pragmatic follower\nmodel should choose a route r through the environment that has high probability of having caused the\nspeaker model to produce the given description d: arg maxr PS(d | r) (corresponding to a rational\nBayesian follower with a uniform prior over routes). Such a follower chooses a route that provides a\ngood explanation of the observed description, allowing counterfactual reasoning about instructions,\nor using global context to correct errors in the follower\u2019s path, which we call pragmatic inference.\nGiven the sequence-to-sequence models that we use, exactly solving the maximization problem above\nis infeasible; and may not even be desirable, as these models are trained discriminatively and may\nbe poorly calibrated for inputs dissimilar to those seen during training. Following previous work\non pragmatic language generation and interpretation [46, 3, 35, 16], we use a rescoring procedure:\nproduce candidate route interpretations for a given instruction using the base follower model, and\nthen rescore these routes using the base speaker model (Figure 2(c)).\nOur pragmatic follower produces a route for a given instruction by obtaining K candidate paths from\nthe base follower using a search procedure described below, then chooses the highest scoring path\nunder a combination of the follower and speaker model probabilities:\nPS(d | r)\u03bb \u00b7 PF (r | d)(1\u2212\u03bb)\n\n(1)\n\narg max\nr\u2208R(d)\n\nwhere \u03bb is a hyper-parameter in the range [0, 1] which we tune on validation data to maximize the\naccuracy of the follower.1\nCandidate route generation. To generate candidate routes from the base follower model, we\nperform a search procedure where candidate routes are produced incrementally, action-by-action, and\nscored using the probabilities given by PF . Standard beam search in sequence-to-sequence models\n(e.g. [48]) forces partial routes to compete based on the number of actions taken. We obtain better\nperformance by instead using a state-factored search procedure, where partial output sequences\ncompete at the level of states in the environment, where each state consists of the agent\u2019s location and\ndiscretized heading, keeping only the highest-scoring path found so far to each state. At a high-level,\nthis search procedure resembles graph search with a closed list, but since action probabilities are\n\n1In practice, we found best performance with values of \u03bb close to 1, relying mostly on the score of the speaker\nto select routes. Using only the speaker score (which corresponds to the standard RSA pragmatic follower)\ndid not substantially reduce performance compared to using a combination with the follower score, and both\nimproved substantially upon using only the follower score (corresponding to the base follower).\n\n4\n\nSpeakerFollower(b) FollowerTraining: Speaker-Driven Data AugmentationSpeaker(c) Test: Pragmatic InferenceRescore routesGenerate candidate routes rSampled routesHuman Instruction d: Go down the stairs, go slight left at the bottom and go through door ...SpeakerFollowerWalk past the dining room table and chairs and \u2026(a) SpeakerTrainingGround-truth routesHuman Instruction: Continue forward until you can climb the three steps to your right ...Ground-truth routesDecisionHuman Instruction: Continue forward until you can climb the three steps to your right ...\ud835\udc43\"\ud835\udc51|\ud835\udc5f\fFigure 3: Compared with low-level visuomotor space, our panoramic action space (Sec. 3.3) allows\nthe agents to have a complete perception of the scene, and to directly perform high-level actions.\n\nnon-stationary (potentially depend on the entire sequence of actions taken in the route), it is only\napproximate, and so we allow re-expanding states if a higher-scoring route to that state is found.\nAt each point in our state-factored search for searching and generating candidates in the follower\nmodel, we store the highest-probability route (as scored by the follower model) found so far to\neach state. States contain the follower\u2019s discrete location and heading (direction it is facing) in the\nenvironment, and whether the route has been completed (had the STOP action predicted). The highest-\nscoring route, which has not yet been expanded (had successors produced), is selected and expanded\nusing each possible action from the state, producing routes to the neighboring states. For each of\nthese routes r with \ufb01nal state s, if s has not yet been reached by the search, or if r is higher-scoring\nunder the model than the current best path to s, r is stored as the best route to s. We continue the\nsearch procedure until K routes ending in distinct states have predicted the STOP action, or there are\nno remaining unexpanded routes. See Sec. B in the supplementary material for pseudocode.\nSince route scores are products of conditional probabilities, route scores are non-increasing, and\nso this search procedure generates routes that do not pass through the same state twice\u2014which we\nfound to improve accuracy both for the base follower model and the pragmatic rescoring procedure,\nsince instructions typically describe acyclic routes.\nWe generate up to K = 40 candidate routes for each instruction using this procedure, and rescore\nusing Eq. 1. In addition to enabling pragmatic inference, this state-factored search procedure improves\nthe performance of the follower model on its own (taking the candidate route with highest score under\nthe follower model), when compared to standard greedy search (see Sec. C and Figure C.2 of the\nsupplementary material for details).\n\n3.3 Panoramic Action Space\n\nThe sequence-to-sequence agent in [1] uses low-level visuomotor control (such as turning left or right\nby 30 degrees), and only perceives frontal visual sensory input. Such \ufb01ne-grained visuomotor control\nand restricted visual signal introduce challenges for instruction following. For example in Figure 3,\nto \u201cturn left and go towards the sofa\u201d, the agent needs to perform a series of turning actions until it\nsees a sofa in the center of its view, and then perform a \u201cgo forward\u201d action. This requires strong\nskills of planning and memorization of visual inputs. While a possible way to address this challenge\nis to learn a hierarchical policy such as in [13], in our work we directly allow the agent to reason\nabout high-level actions, using a panoramic action space with panoramic representation, converted\nwith built-in mapping from low-level visuomotor control.\nAs shown in Figure 3, in our panoramic representation, the agent \ufb01rst \u201clooks around\u201d and perceives a\n360-degree panoramic view of its surrounding scene from its current location, which is discretized\ninto 36 view angles (12 headings \u00d7 3 elevations with 30 degree intervals \u2013 in our implementation).\nEach view angle i is represented by an encoding vector vi. At each location, the agent can only move\ntowards a few navigable directions (provided by the navigation environment) as other directions can\nbe physically obstructed (e.g. blocked by a table). Here, in our action space the agent only needs to\nmake high-level decisions as to which navigable direction to go to next, with each navigable direction\nj represented by an encoding vector uj. The encoding vectors vi and uj of each view angle and\n\n5\n\ngo towards this direction!turn leftturn leftturn leftturn leftgo forwardinstruction: \u2026 Turn left and go towards the sofa ...Low-level visuomotor spacePanoramicaction space \fnavigable direction are obtained by concatenating an appearance feature (ConvNet feature extracted\nfrom the local image patch around that view angle or direction) and a 4-dimensional orientation\nfeature [sin \u03c8; cos \u03c8; sin \u03b8; cos \u03b8], where \u03c8 and \u03b8 are the heading and elevation angles respectively.\n\u2212\u2192\nAlso, we introduce a STOP action encoded by u0 =\n0 . The agent can take this STOP action when it\ndecides it has reached the goal location (to end the episode).\nTo make a decision on which direction to go, the agent \ufb01rst performs one-hop visual attention to look\nat all of the surrounding view angles, based on its memory vector ht\u22121. The attention weight \u03b1t,i of\n\ni exp(at,i).\n\neach view angle i is computed as at,i = (W1ht\u22121)T W2vt,i and \u03b1t,i = exp(at,i)/(cid:80)\nThe attended feature representation vt,att =(cid:80)\nprobability pj of each navigable direction j: yj = (W3ht)T W4uj and pj = exp(yj)/(cid:80)\n\ni \u03b1t,ivt,i from the panoramic scene is then used as\nvisual-sensory input to the sequence-to-sequence model (replacing the 60-degree frontal appearance\nvector in [1]) to update the agent\u2019s memory. Then, a bilinear dot product is used to obtain the\nj exp(yj).\nThe agent then chooses a navigable direction uj (with probability pj) to go to the adjacent location\nalong that direction (or u0 to stop and end the episode). We use a built-in mapping that seamlessly\ntranslates our panoramic perception and action into visuomotor control such as turning and moving.\n\n4 Experiments\n4.1 Experimental Setup\nDataset. We use the Room-to-Room (R2R) vision-and-language navigation dataset [1] for our\nexperimental evaluation. In this task, the agent starts at a certain location in an environment and is\nprovided with a human-generated navigation instruction, that describes a path to a goal location. The\nagent needs to follow the instruction by taking multiple discrete actions (e.g. turning, moving) to\nnavigate to the goal location, and executing a \u201cstop\u201d action to end the episode. Note that differently\nfrom some robotic navigation settings [37], here the agent is not provided with a goal image, but\nmust identify from the textual description and environment whether it has reached the goal.\nThe dataset consists of 7,189 paths sampled from the Matterport3D [8] navigation graphs, where\neach path consists of 5 to 7 discrete viewpoints and the average physical path length is 10m. Each\npath has three instructions written by humans, giving 21.5k instructions in total, with an average of\n29 words per instruction. The dataset is split into training, validation, and test sets. The validation\nset is split into two parts: seen, where routes are sampled from environments seen during training,\nand unseen with environments that are not seen during training. All the test set routes belong to new\nenvironments unseen in the training and validation sets.\nEvaluation metrics. Following previous work on the R2R task, our primary evaluation metrics are\nnavigation error (NE), measuring the average distance between the end-location predicted by the\nfollower agent and the true route\u2019s end-location, and success rate (SR), the percentage of predicted\nend-locations within 3m of the true location. As in previous work, we also report the oracle success\nrate (OSR), measuring success rate at the closest point to the goal that the follower has visited along\nthe route, allowing the agent to overshoot the goal without being penalized.\nImplementation details. Following [1] and [55], we produce visual feature vectors v using the\noutput from the \ufb01nal convolutional layer of a ResNet [21] trained on the ImageNet [42] classi\ufb01cation\ndataset. These visual features are \ufb01xed, and the ResNet is not updated during training. To better\ngeneralize to novel words in the vocabulary, we also experiment with using GloVe embeddings [38],\nto initialize the word-embedding vectors in the speaker and follower.\nIn the baseline without using synthetic instructions, we train follower and speaker models using the\nhuman-generated instructions for routes present in the training set. The training procedure for the\nfollower model follows [1] by training with student-forcing (sampling actions from the model during\ntraining, and supervising using a shortest-path action to reach the goal state). We use the training\nsplit in the R2R dataset to train our speaker model, using standard maximum likelihood training with\na cross-entropy loss.\nIn speaker-driven data augmentation (Sec. 3.1), we augment the data used to train the follower\nmodel by sampling 178, 000 routes from the training environments. Instructions for these routes are\ngenerated using greedy inference in the speaker model (which is trained only on human-produced\ninstructions). The follower model is trained using student-forcing on this augmented data for 50, 000\n\n6\n\n\fData\n\n\u0013\n\n# Augmentation\n1\n2\n3\n4\n5\n6\n7\n8\n\n\u0013\n\u0013\n\n\u0013\n\nPragmatic\nInference\n\nPanoramic\n\nSpace\n\n\u0013\n\n\u0013\n\n\u0013\n\n\u0013\n\n\u0013\n\n\u0013\n\u0013\n\n\u0013\n\nValidation-Seen\n\nValidation-Unseen\n\nSR \u2191 OSR \u2191\n26.1\n19.9\n33.2\n24.6\n34.5\n43.1\n41.3\n31.2\n47.0\n39.3\n45.0\n35.5\n49.5\n63.4\n65.2\n54.6\n\nNE \u2193\n6.08\n5.05\n5.23\n4.86\n4.28\n3.36\n3.88\n3.08\n\nSR \u2191 OSR \u2191 NE \u2193\n7.90\n40.3\n7.30\n46.8\n51.5\n6.62\n7.07\n52.1\n5.75\n57.2\n6.62\n66.4\n63.3\n5.24\n4.83\n70.1\n\n51.6\n59.9\n60.8\n63.3\n63.9\n73.8\n71.0\n78.3\n\nTable 1: Ablations showing the effect of each component in our model. Rows 2-4 show the effects of\nadding a single component to the baseline system (Row 1); Rows 5-7 show the effects of removing a\nsingle component from the full system (Row 8). NE is navigation error (in meters); lower is better.\nSR and OSR are success rate and oracle success rate (%); higher is better. See Sec. 4.2.1 for details.\n\niterations, and then \ufb01ne-tuned on the the original human-produced data for 20, 000 iterations. For all\nexperiments using pragmatic inference, we use a speaker weight of \u03bb = 0.95, which we found to\nproduce the best results on both the seen and unseen validation environments.\n\n4.2 Results and Analysis\nWe \ufb01rst examine the contribution from each of our model\u2019s components on the validation splits. Then,\nwe compare the performance of our model with previous work on the unseen test split.\n\n4.2.1 Component Contributions\nWe begin with a baseline (Row 1 of Table 1), which uses only a follower model with a non-panoramic\naction space at both training and test time, which is equivalent to the student-forcing model in [1].2\nSpeaker-driven data augmentation. We \ufb01rst introduce the speaker at training time for data aug-\nmentation (Sec. 3.1). Comparing Row 1 (the baseline follower model trained only with the original\ntraining data) against Row 2 (training this model on augmented data) in Table 1, we see that adding\nthe augmented data improves success rate (SR) from 40.3% to 46.8% on validation seen and from\n19.9% to 24.6% on validation unseen, respectively. This higher relative gain on unseen environments\nshows that the follower can learn from the speaker-annotated routes to better generalize to new scenes.\nNote that given the noise in our augmented data, we \ufb01ne-tune our model on the original training data\nat the end of training as mentioned in Sec. 3.1. We \ufb01nd that increasing the amount of augmented data\nis helpful in general. For example, when using 25% of the augmented data, the success rate improves\nto 22.8% on validation unseen, while with all the augmented data the success rate reaches 24.6% on\nvalidation unseen, which is a good balance between performance and computation overhead.\nPragmatic inference. We then incorporate the speaker at test time for pragmatic inference (Sec.\n3.2), using the speaker to rescore the route candidates produced by the follower. Adding this\ntechnique brings a further improvement in success rate on both environments (compare Row 2, the\ndata-augmented follower without pragmatic inference, to Row 5, adding pragmatic inference). This\nshows that when reasoning about navigational directions, large improvements in accuracy can be\nobtained by scoring how well the route explains the direction using a speaker model. Importantly,\nwhen using only the follower model to score candidates produced in search, the success rate is 49.0%\non val-seen and 30.5% on val-unseen, showing the importance of using the speaker model to choose\namong candidates (which increases success rates to 57.2% and 39.3%, respectively).\nPanoramic action space. Finally, we replace the visuomotor control space with the panoramic\nrepresentation (Sec. 3.3). Adding this to the previous system (compare Row 5 and Row 8) shows\nthat the new representation leads to a substantially higher success rate, achieving 70.1% and 54.6%\nsuccess rate on validation seen and validation unseen, respectively. This suggests that directly acting\nin a higher-level representation space makes it easier to accurately carry out instructions. Our \ufb01nal\nmodel (Row 8) has over twice the success rate of the baseline follower in the unseen environments.\n\n2Note that our results for this baseline are slightly higher on val-seen and slightly lower on val-unseen than\n\nthose reported by [1], due to differences in implementation details and hyper-parameter choices.\n\n7\n\n\fMethod\n\nRandom\n\nStudent-forcing [1]\n\nRPA [55]\n\nours\n\nours (challenge participation)*\n\nHuman\n\nValidation-Seen\n\nValidation-Unseen\n\nNE \u2193\n9.45\n6.01\n5.56\n3.08\n\u2013\n\n\u2013\n\nSR \u2191\n15.9\n38.6\n42.9\n70.1\n\u2013\n\n\u2013\n\nOSR \u2191\n21.4\n52.9\n52.6\n78.3\n\u2013\n\n\u2013\n\nNE \u2193\n9.23\n7.81\n7.65\n4.83\n\u2013\n\n\u2013\n\nSR \u2191\n16.3\n21.8\n24.6\n54.6\n\u2013\n\n\u2013\n\nOSR \u2191\n22.0\n28.4\n31.8\n65.2\n\u2013\n\n\u2013\n\nNE \u2193\n9.77\n7.85\n7.53\n4.87\n4.87\n1.61\n\nTest (unseen)\nOSR \u2191\n18.3\n26.6\n32.5\n\nSR \u2191\n13.2\n20.4\n25.3\n53.5\n53.5\n86.4\n\n63.9\n96.0\n90.2\n\nTL \u2193\n9.89\n8.13\n9.15\n\n11.63\n1257.38\n\n11.90\n\nTable 2: Performance comparison of our method to previous work. NE is navigation error (in meters);\nlower is better. SR and OSR are success rate and oracle success rate (%) respectively (higher is\nbetter). Trajectory length (TL) on the test set is reported for completeness. *: When submitting to the\nVision-and-Language Navigation Challenge, we modi\ufb01ed our search procedure to maintain physical\nplausibility and to comply with the challenge guidelines. The resulting trajectory has higher oracle\nsuccess rate while being very long. See Sec. E in the supplementary material for details.\n\nImportance of all components. Above we have shown the gain from each component, after being\nadded incrementally. Moreover, comparing Rows 2-4 (adding each component independently to\nthe base model) to the baseline (Row 1) shows that each component in isolation provides large\nimprovements in success rates, and decreases the navigation error. Ablating each component (Rows\n5-7) from the full model (Row 8) shows that each of them is important for the \ufb01nal performance.\nQualitative results. Here we provide qualitative examples further explaining how our model im-\nproves over the baseline (more qualitative results in the supplementary material). The intuition behind\nthe speaker model is that it should help the agent more accurately interpret instructions speci\ufb01cally in\nambiguous situations. Figure 4 shows how the introduction of a speaker model helps the follower\nwith pragmatic inference.\n\n4.2.2 Comparison to Prior Work\nWe compare the performance of our \ufb01nal model to previous approaches on the R2R held-out splits,\nincluding the test split which contains 18 new environments that do not overlap with any training or\nvalidation splits, and are only seen once at test time.\nThe results are shown in Table 2. In the table, \u201cRandom\u201d is randomly picking a direction and going\ntowards that direction for 5 steps. \u201cStudent-forcing\u201d is the best performing method in [1], using\nexploration during training of the sequence-to-sequence follower model. \u201cRPA\u201d [55] is a combination\nof model-based and model-free reinforcement learning (see also Sec. 2 for details). \u201cours\u201d shows our\nperformance using the route selected by our pragmatic inference procedure, while \u201cours (challenge\nparticipation)\u201d uses a modi\ufb01ed inference procedure for submission to the Vision-and-Language\nNavigation Challenge (See Sec. E in the supplementary material for details). Prior work has reported\nhigher performance on the seen rather than unseen environments [1, 55], illustrating the issue of\ngeneralizing to new environments. Our method more than doubles the success rate of the state-of-the-\nart RPA approach, and on the test set achieves a \ufb01nal success rate of 53.5%. This represents a large\nreduction in the gap between machine and human performance on this task.\n\n5 Conclusions\nThe language-and-vision navigation task presents a pair of challenging reasoning problems: in\nlanguage, because agents must interpret instructions in a changing environmental context; and in\nvision, because of the tight coupling between local perception and long-term decision-making. The\ncomparatively poor performance of the baseline sequence-to-sequence model for instruction following\nsuggests that more powerful modeling tools are needed to meet these challenges. In this work, we\nhave introduced such a tool, showing that a follower model for vision-and-language navigation is\nsubstantially improved by carefully structuring the action space and integrating an explicit model of a\nspeaker that predicts how navigation routes are described. We believe that these results point toward\nfurther opportunities for improvements in instruction following by modeling the global structure of\nnavigation behaviors and the pragmatic contexts in which they occur.\n\n8\n\n\fStep 1\n\nStep 2\n\nStep 3\n\nStep 4\n\nStep 1\n\nStep 2\n\nStep 3\n\nStep 4\n\n(a) navigation steps without pragmatic inference; red arrow: direction to go next\n\n(b) navigation steps with pragmatic inference; red arrow: direction to go next\n\nFigure 4: Navigation examples on unseen environments with and without pragmatic inference\nfrom the speaker model (best visualized in color). (a) The follower without pragmatic inference\nmisinterpreted the instruction and went through a wrong door into a room with no bed. It then\nstopped at a table (which resembles a bed). (b) With the help of a speaker for pragmatic inference,\nthe follower selected the correct route that enters the right door and stopped at the bed.\n\n9\n\ninstruction: Go through the door on the right and continue straight. Stop in the next room in front of the bed.(a) orange: trajectory without pragmatic inference(b) green: trajectory with pragmatic inferencetop-down overview of trajectories\fAcknowledgements. This work was partially supported by US DoD and DARPA XAI and D3M,\nNSF awards IIS-1833355, Oculus VR, and the Berkeley Arti\ufb01cial Intelligence Research (BAIR) Lab.\nDF was supported by a Huawei / Berkeley AI fellowship. Any opinions, \ufb01ndings, and conclusions or\nrecommendations expressed in this material are those of the author(s) and do not necessarily re\ufb02ect\nthe views of the sponsors, and no of\ufb01cial endorsement should be inferred.\n\nReferences\n[1] P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. S\u00fcnderhauf, I. Reid, S. Gould, and A. v. d.\nHengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real\nenvironments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), 2018.\n\n[2] J. Andreas and D. Klein. Alignment-based compositional semantics for instruction following. In Proceed-\n\nings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2015.\n\n[3] J. Andreas and D. Klein. Reasoning about pragmatics with neural listeners and speakers. In Proceedings\n\nof the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016.\n\n[4] Y. Artzi and L. Zettlemoyer. Weakly supervised learning of semantic parsers for mapping instructions to\n\nactions. Transactions of the Association for Computational Linguistics, 1(1):49\u201362, 2013.\n\n[5] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate.\n\nIn Proceedings of the International Conference on Learning Representations (ICLR), 2015.\n\n[6] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the\n\neleventh annual conference on Computational learning theory, pages 92\u2013100. ACM, 1998.\n\n[7] S. Branavan, H. Chen, L. S. Zettlemoyer, and R. Barzilay. Reinforcement learning for mapping instructions\nto actions. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL),\npages 82\u201390. Association for Computational Linguistics, 2009.\n\n[8] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y. Zhang.\nMatterport3d: Learning from rgb-d data in indoor environments. International Conference on 3D Vision\n(3DV), 2017.\n\n[9] D. L. Chen. Fast online lexicon learning for grounded language acquisition. In Proceedings of the 50th\nAnnual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1, ACL \u201912,\npages 430\u2013439, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics.\n\n[10] K. Chen, R. Kovvuri, and R. Nevatia. Query-guided regression network with context policy for phrase\n\ngrounding. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.\n\n[11] V. Cirik, T. Berg-Kirkpatrick, and L.-P. Morency. Using syntax to ground referring expressions in natural\n\nimages. In 32nd AAAI Conference on Arti\ufb01cial Intelligence (AAAI-18), 2018.\n\n[12] R. Cohn-Gordon, N. Goodman, and C. Potts. Pragmatically informative image captioning with character-\nlevel reference. In Proceedings of the Conference of the North American Chapter of the Association for\nComputational Linguistics (NAACL), 2018.\n\n[13] A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra. Embodied question answering.\n\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.\n\nIn\n\n[14] M. C. Frank and N. D. Goodman. Predicting pragmatic reasoning in language games. Science,\n\n336(6084):998\u2013998, 2012.\n\n[15] M. C. Frank, N. D. Goodman, P. Lai, and J. B. Tenenbaum. Informative communication in word production\n\nand word learning. In Proceedings of the Annual Conference of the Cognitive Science Society, 2009.\n\n[16] D. Fried, J. Andreas, and D. Klein. Uni\ufb01ed pragmatic models for generating and following instructions.\nIn Proceedings of the Conference of the North American Chapter of the Association for Computational\nLinguistics (NAACL), 2018.\n\n[17] N. D. Goodman and A. Stuhlm\u00fcller. Knowledge and implicature: Modeling language understanding as\n\nsocial cognition. Topics in cognitive science, 5(1):173\u2013184, 2013.\n\n[18] H. P. Grice. Logic and conversation. In P. Cole and J. L. Morgan, editors, Syntax and Semantics: Vol. 3:\n\nSpeech Acts, pages 41\u201358. Academic Press, San Diego, CA, 1975.\n\n[19] C. Gulcehre, O. Firat, K. Xu, K. Cho, L. Barrault, H.-C. Lin, F. Bougares, H. Schwenk, and Y. Bengio. On\n\nusing monolingual corpora in neural machine translation. arXiv preprint arXiv:1503.03535, 2015.\n\n[20] K. Guu, P. Pasupat, E. Z. Liu, and P. Liang. From language to programs: Bridging reinforcement\nlearning and maximum marginal likelihood. In Proceedings of the Annual Meeting of the Association for\nComputational Linguistics (ACL), 2017.\n\n[21] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the\n\nIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770\u2013778, 2016.\n\n[22] K. M. Hermann, F. Hill, S. Green, F. Wang, R. Faulkner, H. Soyer, D. Szepesvari, W. M. Czarnecki,\nM. Jaderberg, D. Teplyashin, M. Wainwright, C. Apps, D. Hassabis, and P. Blunsom. Grounded language\nlearning in a simulated 3d world. CoRR, abs/1706.06551, 2017.\n\n[23] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780, 1997.\n\n10\n\n\f[24] R. Hu, M. Rohrbach, J. Andreas, T. Darrell, and K. Saenko. Modeling relationships in referential\nexpressions with compositional modular networks. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition (CVPR), 2017.\n\n[25] R. Hu, M. Rohrbach, and T. Darrell. Segmentation from natural language expressions. In Proceedings of\n\nthe European Conference on Computer Vision (ECCV), 2016.\n\n[26] R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Darrell. Natural language object retrieval. In\n\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.\n\n[27] T. Ko\u02c7cisk\u00fd, G. Melis, E. Grefenstette, C. Dyer, W. Ling, P. Blunsom, and K. M. Hermann. Semantic\nparsing with semi-supervised sequential autoencoders. In Proceedings of the 2016 Conference on Empirical\nMethods in Natural Language Processing, pages 1078\u20131087, Austin, Texas, November 2016. Association\nfor Computational Linguistics.\n\n[28] C. Liu, Z. Lin, X. Shen, J. Yang, X. Lu, and A. Yuille. Recurrent multimodal interaction for referring\nimage segmentation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV),\n2017.\n\n[29] R. Long, P. Pasupat, and P. Liang. Simpler context-dependent logical forms via model projections. In\n\nProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2016.\n\n[30] R. Luo and G. Shakhnarovich. Comprehension-guided referring expressions. In Proceedings of the IEEE\n\nConference on Computer Vision and Pattern Recognition (CVPR), 2017.\n\n[31] J. Mao, H. Jonathan, A. Toshev, O. Camburu, A. Yuille, and K. Murphy. Generation and comprehension of\nunambiguous object descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), 2016.\n\n[32] D. McClosky, E. Charniak, and M. Johnson. Effective self-training for parsing. In Proceedings of the main\nconference on human language technology conference of the North American Chapter of the Association\nof Computational Linguistics, pages 152\u2013159. Association for Computational Linguistics, 2006.\n\n[33] H. Mei, M. Bansal, and M. Walter. Listen, attend, and walk: Neural mapping of navigational instructions\n\nto action sequences. In Proceedings of the Conference on Arti\ufb01cial Intelligence (AAAI), 2016.\n\n[34] D. Misra, J. Langford, and Y. Artzi. Mapping instructions and visual observations to actions with\nreinforcement learning. In Proceedings of the Conference on Empirical Methods in Natural Language\nProcessing (EMNLP), 2017.\n\n[35] W. Monroe, R. Hawkins, N. Goodman, and C. Potts. Colors in context: A pragmatic neural model\nfor grounded language understanding. Transactions of the Association for Computational Linguistics,\n5:325\u2013338, 2017.\n\n[36] V. K. Nagaraja, V. I. Morariu, and L. S. Davis. Modeling context between objects for referring expression\nunderstanding. In Proceedings of the European Conference on Computer Vision (ECCV), pages 792\u2013807.\nSpringer, 2016.\n\n[37] D. Pathak, P. Mahmoudieh, G. Luo, P. Agrawal, D. Chen, Y. Shentu, E. Shelhamer, J. Malik, A. A. Efros,\n\nand T. Darrell. Zero-shot visual imitation. arXiv preprint arXiv:1804.08606, 2018.\n\n[38] J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In Proceedings\nof the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532\u20131543,\n2014.\n\n[39] B. Plummer, L. Wang, C. Cervantes, J. Caicedo, J. Hockenmaier, and S. Lazebnik. Flickr30k entities:\nCollecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the\nIEEE International Conference on Computer Vision (ICCV), 2015.\n\n[40] I. Radosavovic, P. Doll\u00e1r, R. Girshick, G. Gkioxari, and K. He. Data distillation: Towards omni-supervised\n\nlearning. arXiv preprint arXiv:1712.04440, 2017.\n\n[41] A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele. Grounding of textual phrases in images by\n\nreconstruction. In Proceedings of the European Conference on Computer Vision (ECCV), 2016.\n\n[42] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,\nM. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer\nVision, 115(3):211\u2013252, 2015.\n\n[43] H. Scudder. Probability of error of some adaptive pattern-recognition machines. IEEE Transactions on\n\nInformation Theory, 11(3):363\u2013371, 1965.\n\n[44] R. Sennrich, B. Haddow, and A. Birch. Improving neural machine translation models with monolingual\ndata. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages\n86\u201396, 2016.\n\n[45] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran,\nT. Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm.\narXiv preprint arXiv:1712.01815, 2017.\n\n[46] N. J. Smith, N. Goodman, and M. Frank. Learning and using language via recursive pragmatic reasoning\n\nabout other agents. In Advances in neural information processing systems, pages 3039\u20133047, 2013.\n\n[47] S. Sukhbaatar, Z. Lin, I. Kostrikov, G. Synnaeve, A. Szlam, and R. Fergus. Intrinsic motivation and\n\nautomatic curricula via asymmetric self-play. arXiv preprint arXiv:1703.05407, 2017.\n\n11\n\n\f[48] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances\n\nin neural information processing systems, pages 3104\u20133112, 2014.\n\n[49] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge,\n\n1998.\n\n[50] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for reinforcement\nlearning with function approximation. In Advances in neural information processing systems, pages\n1057\u20131063, 2000.\n\n[51] S. Tellex, T. Kollar, S. Dickerson, M. R. Walter, A. G. Banerjee, S. J. Teller, and N. Roy. Understanding\nnatural language commands for robotic navigation and mobile manipulation. In AAAI, volume 1, page 2,\n2011.\n\n[52] A. B. Vasudevan, D. Dai, and L. V. Gool. Object referring in visual scene with spoken language. In\n\nProc. IEEE Winter Conf. on Applications of Computer Vision (WACV), 2018.\n\n[53] R. Vedantam, S. Bengio, K. Murphy, D. Parikh, and G. Chechik. Context-aware captions from context-\nagnostic supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), volume 3, 2017.\n\n[54] M. Wang, M. Azab, N. Kojima, R. Mihalcea, and J. Deng. Structured matching for phrase localization. In\n\nProceedings of the European Conference on Computer Vision (ECCV), pages 696\u2013711. Springer, 2016.\n\n[55] X. Wang, W. Xiong, H. Wang, and W. Y. Wang. Look before you leap: Bridging model-free and model-\nbased reinforcement learning for planned-ahead vision-and-language navigation. arXiv:1803.07729,\n2018.\n\n[56] T. Weber, S. Racani\u00e8re, D. P. Reichert, L. Buesing, A. Guez, D. J. Rezende, A. P. Badia, O. Vinyals,\nN. Heess, Y. Li, et al. Imagination-augmented agents for deep reinforcement learning. arXiv preprint\narXiv:1707.06203, 2017.\n\n[57] L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu, M. Bansal, and T. L. Berg. Mattnet: Modular attention network\nfor referring expression comprehension. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition (CVPR), 2018.\n\n[58] L. Yu, H. Tan, M. Bansal, and T. L. Berg. A joint speaker-listener-reinforcer model for referring expressions.\n\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.\n\n12\n\n\f", "award": [], "sourceid": 1682, "authors": [{"given_name": "Daniel", "family_name": "Fried", "institution": "UC Berkeley"}, {"given_name": "Ronghang", "family_name": "Hu", "institution": "University of California, Berkeley"}, {"given_name": "Volkan", "family_name": "Cirik", "institution": "Carnegie Mellon University"}, {"given_name": "Anna", "family_name": "Rohrbach", "institution": "UC Berkeley"}, {"given_name": "Jacob", "family_name": "Andreas", "institution": "UC Berkeley"}, {"given_name": "Louis-Philippe", "family_name": "Morency", "institution": "Carnegie Mellon University"}, {"given_name": "Taylor", "family_name": "Berg-Kirkpatrick", "institution": "Carnegie Mellon University"}, {"given_name": "Kate", "family_name": "Saenko", "institution": "Boston University"}, {"given_name": "Dan", "family_name": "Klein", "institution": "UC Berkeley"}, {"given_name": "Trevor", "family_name": "Darrell", "institution": "UC Berkeley"}]}