{"title": "ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks", "book": "Advances in Neural Information Processing Systems", "page_first": 13, "page_last": 23, "abstract": "We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks -- visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval -- by making only minor additions to the base architecture. We observe significant improvements across tasks compared to existing task-specific models -- achieving state-of-the-art on all four tasks. Our work represents a shift away from learning groundings between vision and language only as part of task training and towards treating visual grounding as a pretrainable and transferable capability.", "full_text": "ViLBERT: Pretraining Task-Agnostic Visiolinguistic\n\nRepresentations for Vision-and-Language Tasks\n\nJiasen Lu1, Dhruv Batra1,3, Devi Parikh1,3, Stefan Lee1,2\n\n1Georgia Institute of Technology, 2Oregon State University, 3Facebook AI Research\n\nAbstract\n\nWe present ViLBERT (short for Vision-and-Language BERT), a model for learning\ntask-agnostic joint representations of image content and natural language. We\nextend the popular BERT architecture to a multi-modal two-stream model, pro-\ncessing both visual and textual inputs in separate streams that interact through\nco-attentional transformer layers. We pretrain our model through two proxy tasks\non the large, automatically collected Conceptual Captions dataset and then transfer\nit to multiple established vision-and-language tasks \u2013 visual question answering,\nvisual commonsense reasoning, referring expressions, and caption-based image\nretrieval \u2013 by making only minor additions to the base architecture. We observe\nsigni\ufb01cant improvements across tasks compared to existing task-speci\ufb01c models \u2013\nachieving state-of-the-art on all four tasks. Our work represents a shift away from\nlearning groundings between vision and language only as part of task training and\ntowards treating visual grounding as a pretrainable and transferable capability.\n\n1\n\nIntroduction\n\n\u201c... spend the summer linking a camera to a computer and getting the computer to describe what it saw.\u201d\n\nMarvin Minsky on the goal of a 1966 undergraduate summer research project [1]\n\nSince this now famously ambitious summer project, steady progress has been made towards systems\nthat can demonstrate their visual understanding by generating or responding to natural language in the\ncontext of images, videos, or even full 3D environments [2\u20138]. These approaches and corresponding\ntasks have come to be referred to under the common banner of \u2018vision-and-language\u2019. However,\ndespite the common need to align natural language and visual stimuli \u2013 i.e. to perform visual\ngrounding \u2013 approaches for vision-and-language tasks lack a uni\ufb01ed foundation to gain this capability.\nInstead, the dominant strategy is to start with separate language and vision models pretrained for\nother large-scale tasks and then learn grounding as part of task training \u2013 often resulting in myopic\ngroundings that generalize poorly when paired visiolinguistic data is limited or biased [9, 10].\nThis pretrain-then-transfer learning approach to vision-and-language tasks follows naturally from its\nwidespread use in both computer vision and natural language processing where it has become the de\nfacto standard due to the ease-of-use and strong representational power of large, publicly-available\nmodels [11\u201314] trained on large-scale data sources [15\u201319]. In these domains, pretrained models can\nprovide useful information for target tasks, e.g. dog breed-sensitive image features or a well-calibrated\nsemantic distance between words. While visual and linguistic understandings like these are of course\nessential to vision-and-language tasks, equally important is how they relate to one another \u2013 e.g. a\nperfect visual representation of dog breeds is of little use if a downstream vision-and-language model\nfails to associate it with appropriate phrases like \u201cbeagle\u201d or \u201cshepherd\u201d. We are therefore interested\nin developing a common model for visual grounding that can learn these connections and leverage\nthem on a wide array of vision-and-language tasks \u2013 i.e., we seek to pretrain for visual grounding.\nTo learn these joint visual-linguistic representations, we look to recent successes in self-supervised\nlearning which have captured rich semantic and structural information from large, unlabelled data\nsources by training models to perform so-called \u2018proxy\u2019 tasks. These proxy tasks leverage structure\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Our ViLBERT model consists of two parallel streams for visual (green) and linguistic\n(purple) processing that interact through novel co-attentional transformer layers. This structure allows\nfor variable depths for each modality and enables sparse interaction through co-attention. Dashed\nboxes with multiplier subscripts denote repeated blocks of layers.\n\nwithin the data to generate supervised tasks automatically (e.g. colorizing images [20] or reconstruct-\ning masked words in text [12]). While work within the vision community has shown increasing\npromise [21\u201323], the greatest impact of self-supervised learning so far is through language models\nlike ELMo [13], BERT [12], and GPT [14] which have set new high-water marks on many NLP\ntasks. To learn visual grounding via a similar approach, we must identify a suitable data source\nwhere alignment between vision and language is available. In this work, we consider the recently\nreleased Conceptual Captions [24] dataset consisting of \u223c3.3 million images with weakly-associated\ndescriptive captions automatically collected from alt-text enabled images on the web.\nWe present a joint model for learning task-agnostic visual grounding from paired visiolinguistic data\nwhich we call Vision & Language BERT (ViLBERT for short). Our approach extends the recently\ndeveloped BERT [12] language model to jointly reason about text and images. Our key technical\ninnovation is introducing separate streams for vision and language processing that communicate\nthrough co-attentional transformer layers. This structure can accommodate the differing processing\nneeds of each modality and provides interaction between modalities at varying representation depths.\nWe demonstrate that this structure outperforms a single-stream uni\ufb01ed model in our experiments.\nIn analogy to the training tasks in [12], we train our model on Conceptual Captions on two proxy\ntasks: predicting the semantics of masked words and image regions given the unmasked inputs,\nand predicting whether an image and text segment correspond. We apply our pretrained model\nas a base for four established vision-and-language tasks \u2013 visual question answering [3], visual\ncommonsense reasoning [25], referring expressions [2], and caption-based image retrieval [26] \u2013\nsetting state-of-the-art on all four tasks. We \ufb01nd improvements of 2 to 10 percentage points across\nthese tasks when compared to state-of-the-art task-speci\ufb01c baselines using separately pretrained\nvision and language models. Furthermore, our structure is simple to modify for each of these tasks \u2013\nserving as a common foundation for visual grounding across multiple vision-and-language tasks.\n2 Approach\nIn this section, we \ufb01rst brie\ufb02y summarize the BERT language model (Sec. 2.1) and then describe\nhow we extend it to jointly represent vision and language data (Sec. 2.2).\n2.1 Preliminaries: Bidirectional Encoder Representations from Transformers (BERT)\nThe BERT model introduced by [12] is an attention-based bidirectional language model. When\npretrained on a large language corpus, BERT has proven to be very effective for transfer learning to\nmultiple natural language processing tasks.\nThe BERT model operates on sequences of word tokens w0, . . . , wT . These tokens are mapped\nto learned encodings and passed through L \u201cencoder-style\u201d transformer blocks [27] to produce\n\ufb01nal representations h0, . . . , hT . Let H (l) be a matrix with rows h(l)\nT corresponding to the\nintermediate representations after the l-th layer. Abstracting some internal details found in [27],\nwe depict the computation of a single encoder-style transformer block in Fig. 2a consisting of a\nmulti-headed attention block followed by a small fully-connected network, both wrapped in residual\nadds. Note that the intermediate representation H (l) is used to compute three matrices \u2013 Q, K, and V\n\u2013 corresponding to queries, keys, and values that drive the multi-headed attention block. Speci\ufb01cally,\nthe dot-product similarity between queries and keys determines attentional distributions over value\nvectors. The resulting weight-averaged value vector forms the output of the attention block. As\nwe describe later, we modify this query-conditioned key-value attention mechanism to develop a\nmulti-modal co-attentional transformer module for ViLBERT (Fig. 2b).\n\n0 , . . . , h(l)\n\n2\n\n\ud835\udc63\"\ud835\udc63#\ud835\udc63$\ud835\udc63\ud835\udcaf\ud835\udc63&TRMCo-TRMCo-TRML-k\u2a09TRMTRMEmbedEmbed...<CLS> Man shopping for fruit <SEP>...k\u2a09<IMG>\ud835\udc64\"\ud835\udc64#\ud835\udc64$\ud835\udc64&\ud835\udc64(\ud835\udc64)\t\u210e,\",\u210e,#,\u22ef,\u210e,\ud835\udcaf\u210e/\",\u210e/#,\u22ef,\u210e/)\f(a) Standard encoder transformer block\n\n(b) Our co-attention transformer layer\n\nFigure 2: We introduce a novel co-attention mechanism based on the transformer architecture. By\nexchanging key-value pairs in multi-headed attention, this structure enables vision-attended language\nfeatures to be incorporated into visual representations (and vice versa).\n\nText Representation. BERT operates over sequences of discrete tokens comprised of vocabulary\nwords and a small set of special tokens: SEP, CLS, and MASK. For a given token, the input represen-\ntation is a sum of a token-speci\ufb01c learned embedding [28] and encodings for position (i.e. token\u2019s\nindex in the sequence) and segment (i.e. index of the token\u2019s sentence if multiple exist).\nTraining Tasks and Objectives. The BERT model is trained end-to-end on a large language-corpus\nunder two tasks: masked language modelling and next sentence prediction.\nThe masked language modelling task randomly divides input tokens into disjoint sets corresponding\nto masked XM and observed XO tokens (approximately 15% of tokens being masked). Masked\ntokens are replaced with a special MASK token 80% of the time, a random word 10%, and unaltered\n10%. The BERT model is then trained to reconstruct these masked tokens given the observed set.\nSpeci\ufb01cally, a linear layer is learned to map the \ufb01nal representations at each index (e.g. hi) to a\ndistribution over the vocabulary and the model is trained under a cross-entropy loss.\nIn next sentence prediction, the BERT model is passed two text segments A and B following the\nformat {CLS, wA1, . . . , wAT , SEP, wB1, . . . , wBT , SEP} and is trained to predict whether or not B\nfollows A in the source text. Speci\ufb01cally, a linear layer operating on the \ufb01nal representation for the\nCLS token (i.e. hCLS) is trained to minimize a binary cross-entropy loss on this label.\n2.2 ViLBERT: Extending BERT to Jointly Represent Images and Text\nInspired by BERT\u2019s success at language modeling, we would like to develop analogous models\nand training tasks to learn joint representations of language and visual content from paired data.\nSpeci\ufb01cally, we consider jointly representing static images and corresponding descriptive text.\nOne straightforward approach is to make minimal changes to BERT \u2013 simply discretizing the space\nof visual inputs via clustering, treat these visual \u2018tokens\u2019 exactly like text inputs, and start from\na pretrained BERT model1. This architecture suffers from a number of drawbacks. First, initial\nclustering may result in discretization error and lose important visual details. Second, it treats inputs\nfrom both modalities identically, ignoring that they may need different levels of processing due to\neither their inherent complexity or the initial level of abstraction of their input representations. For\ninstance, image regions may have weaker relations than words in a sentence and visual features are\nthemselves often already the output of a very deep network. Finally, forcing the pretrained weights to\naccommodate the large set of additional visual \u2018tokens\u2019 may damage the learned BERT language\nmodel. Instead, we develop a two-stream architecture modelling each modality separately and then\nfusing them through a small set of attention-based interactions. This approach allows for variable\nnetwork depth for each modality and enables cross-modal connections at different depths.\nOur model which we call ViLBERT is shown in Fig. 1 and consists of two parallel BERT-style\nmodels operating over image regions and text segments. Each stream is a series of transformer\nblocks (TRM) and novel co-attentional transformer layers (Co-TRM) which we introduce to enable\ninformation exchange between modalities. Given an image I represented as a set of region features\nv1, . . . , vT and a text input w0, . . . , wT , our model outputs \ufb01nal representations hv0, . . . , hvT and\nhw0, . . . , hwT . Notice that exchange between the two streams is restricted to be between speci\ufb01c\n\n1Concurrent work [29] modelling language and video sequences takes this approach. See Sec. 5.\n\n3\n\nFeed\tForwardMulti-HeadAttentionAdd\t&\tNormAdd\t&\tNormVQK\ud835\udc3b(#$%)\ud835\udc3b(#)VisualFeed\tForwardMulti-HeadAttentionAdd\t&\tNormAdd\t&\tNormQvKWVW\ud835\udc3b\"($%&)\ud835\udc3b\"($)LinguisticFeed\tForwardMulti-HeadAttentionAdd\t&\tNormAdd\t&\tNormQWKVVV\ud835\udc3b(()%&)\ud835\udc3b(())\f(a) Masked multi-modal learning\n\n(b) Multi-modal alignment prediction\n\nFigure 3: We train ViLBERT on the Conceptual Captions [24] dataset under two training tasks to\nlearn visual grounding. In masked multi-modal learning, the model must reconstruct image region\ncategories or words for masked inputs given the observed inputs. In multi-modal alignment prediction,\nthe model must predict whether or not the caption describes the image content.\n\nV and H (j)\n\nlayers and that the text stream has signi\ufb01cantly more processing before interacting with visual features\n\u2013 matching our intuitions that our chosen visual features are already fairly high-level and require\nlimited context-aggregation compared to words in a sentence.\nCo-Attentional Transformer Layers. We introduce a co-attentional transformer layer shown in\nFig. 2b. Given intermediate visual and linguistic representations H (i)\nW , the module computes\nquery, key, and value matrices as in a standard transformer block. However, the keys and values\nfrom each modality are passed as input to the other modality\u2019s multi-headed attention block. Con-\nsequentially, the attention block produces attention-pooled features for each modality conditioned\non the other \u2013 in effect performing image-conditioned language attention in the visual stream and\nlanguage-conditioned image attention in the linguistic stream. The latter mimics common attention\nmechanisms found in vision-and-language models [30]. The rest of the transformer block proceeds\nas before, including a residual add with the initial representations \u2013 resulting in a multi-modal feature.\nIn general, co-attention for vision-and-language is not a new idea (being \ufb01rst proposed in [31]) and\nconcurrent work [32,33] has shown the effectiveness of similar co-attentional transformer structures\non the visual question answering [3] task.\nImage Representations. We generate image region features by extracting bounding boxes and their\nvisual features from a pre-trained object detection network (see Sec. 3.1). Unlike words in text, image\nregions lack a natural ordering. we encode spatial location instead, constructing a 5-d vector from\nregion position (normalized top-left and bottom-right coordinates) and the fraction of image area\ncovered. This is then projected to match the dimension of the visual feature and they are summed.\nWe mark the beginning of an image region sequence with a special IMG token representing the entire\nimage (i.e. mean-pooled visual features with a spatial encoding corresponding to the entire image).\nTraining Tasks and Objectives. In analogy to those described in the previous section, we consider\ntwo pretraining tasks: masked multi-modal modelling and multi-modal alignment prediction.\nThe masked multi-modal modelling task (shown in Fig. 3a) follows from the masked language\nmodelling task in standard BERT \u2013 masking approximately 15% of both words and image region\ninputs and tasking the model with reconstructing them given the remaining inputs. Masked image\nregions have their image features zeroed out 90% of the time and are unaltered 10%. Masked text\ninputs are handled as in BERT. Rather than directly regressing the masked feature values, the model\ninstead predicts a distribution over semantic classes for the corresponding image region. To supervise\nthis, we take the output distribution for the region from the same pretrained detection model used in\nfeature extraction. We train the model to minimize the KL divergence between these two distributions.\nThis choice re\ufb02ects the notion that language often only identi\ufb01es high-level semantics of visual\ncontent and is unlikely to be able to reconstruct exact image features. Further, applying a regression\nloss could make it dif\ufb01cult to balance losses incurred by masked image and text inputs.\nIn the multi-modal alignment task (shown in Fig. 3b), the model is presented an image-text pair as\n{IMG, v1, . . . , vT , CLS, w1, . . . , wT , SEP} and must predict whether the image and text are aligned, i.e.\nwhether the text describes the image. We take the outputs hIMG and hCLS as holistic representations\nof the visual and linguistic inputs. Borrowing another common structure from vision-and-language\nmodels, we compute the overall representation as an element-wise product between hIMG and hCLS\nand learn a linear layer to make the binary prediction whether the image and text are aligned. However,\nthe Conceptual Captions [24] dataset only includes aligned image-caption pairs. To generate negatives\nfor an image-caption pair, we randomly replace either the image or caption with another.\n\n4\n\nVision & Language BERT\u2026\u210e\"#\u210e$%\u210e$&\u210e$\u2019\u210e$\ud835\udcaf<IMG>\u210e)#\u210e*%\u210e*&\u210e*\u2019\u210e*+<CLS>Manshoppingfor<SEP>\u2026\u2026\u2026<MASK><MASK><MASK><MASK>ManshoppingVision & Language BERT\u2026\u210e\"#\u210e\"$\u210e\"%\u210e\"&\u210e\"\ud835\udcaf<IMG>\u210e(#\u210e($\u210e(%\u210e(&\u210e()<CLS>Manshoppingfor<SEP>\u2026\u2026\u2026Aligned / Not Aligned\f3 Experimental Settings\nIn this section, we describe how we train our model and provide overviews of the vision-and-language\ntasks to which we transfer the trained model.\n3.1 Training ViLBERT\nTo train our full ViLBERT model, we apply the training tasks presented in Sec. 2.2 to the Conceptual\nCaptions dataset [24]. Conceptual Captions is a collection of 3.3 million image-caption pairs\nautomatically scraped from alt-text enabled web images. The automatic collection and sanitation\nprocess leaves some noise and the \u2018captions\u2019 are sometimes not human-like or short on details (e.g.\n\u201cactors attend the premiere at festival\u201d). However, it presents a huge diversity of visual content and\nserves as an excellent dataset for our purposes. Since some links had become broken by the time we\ndownloaded the data, our model is trained with around 3.1 million image-caption pairs.\nImplementation Details. We initialize the linguistic stream of our ViLBERT model with a BERT\nlanguage model pretrained on the BookCorpus [17] and English Wikipedia. Speci\ufb01cally, we use the\nBERTBASE model [12] which has 12 layers of transformer blocks with each block having a hidden\nstate size of 762 and 12 attention heads. We choose to use the BASE model due to concerns over\ntraining time but \ufb01nd it likely the more powerful BERTLARGE model could further boost performance.\nWe use Faster R-CNN [31] (with ResNet-101 [11] backbone) pretrained on the Visual Genome\ndataset [16] (see [30] for details) to extract region features. We select regions where class detection\nprobability exceeds a con\ufb01dence threshold and keep between 10 to 36 high-scoring boxes. For\neach selected region i, vi is de\ufb01ned as the mean-pooled convolutional feature from that region.\nTransformer and co-attentional transformer blocks in the visual stream have hidden state size of 1024\nand 8 attention heads.\nWe train on 8 TitanX GPUs with a total batch size of 512 for 10 epochs. We use the Adam optimizer\nwith initial learning rates of 1e-4. We use a linear decay learning rate schedule with warm up to train\nthe model. Both training task losses are weighed equally.\n3.2 Vision-and-Language Transfer Tasks\nWe transfer our pretrained ViLBERT model to a set of four established vision-and-language tasks and\none diagnostic task. We follow a \ufb01ne-tuning strategy where we modify the pretrained base model\nto perform the new task and then train the entire model end-to-end. In all cases, the modi\ufb01cation\nis trivial \u2013 typically amounting to learning a classi\ufb01cation layer. This is in stark contrast to the\nsigni\ufb01cant efforts made within the community to develop specialized models for each of these tasks.\nWe describe the problem, dataset, model modi\ufb01cations, and training objective for each task below.\nVisual Question Answering (VQA). The VQA task requires answering natural language questions\nabout images. We train and evaluate on the VQA 2.0 dataset [3] consisting of 1.1 million questions\nabout COCO images [5] each with 10 answers. To \ufb01ne-tune ViLBERT on VQA, we learn a two\nlayer MLP on top of the element-wise product of the image and text representations hIMG and hCLS,\nmapping this representation to 3,129 possible answers. As in [30], we treat VQA as a multi-label\nclassi\ufb01cation task \u2013 assigning a soft target score to each answer based on its relevancy to the 10\nhuman answer responses. We then train with a binary cross-entropy loss on the soft target scores\nusing a batch size of 256 over a maximum of 20 epochs. We use the Adam optimizer with an initial\nlearning rate of 4e-5. At inference, we simply take a softmax.\nVisual Commonsense Reasoning (VCR). Given an image, the VCR task presents two problems \u2013\nvisual question answering (Q\u2192A) and answer justi\ufb01cation (QA\u2192R) \u2013 both being posed as multiple-\nchoice problems. The holistic setting (Q\u2192AR) requires both the chosen answer and then the\nchosen rationale to be correct. The Visual Commonsense Reasoning (VCR) dataset consists of 290k\nmultiple choice QA problems derived from 110k movie scenes. Different from the VQA dataset,\nVCR integrates object tags into the language providing direct grounding supervision and explicitly\nexcludes referring expressions. To \ufb01netune on this task, we concatenate the question and each\npossible response to form four different text inputs and pass each through ViLBERT along with the\nimage. We learn a linear layer on top of the post-elementwise product representation to predict a\nscore for each pair. The \ufb01nal prediction is a softmax over these four scores and is trained under a\ncross-entropy loss over 20 epochs with a batch size of 64 and initial learning rate of 2e-5.\nGrounding Referring Expressions. The referring expression task is to localize an image region\ngiven a natural language reference. We train and evaluate on the RefCOCO+ dataset [32]. A common\napproach to this task is to rerank a set of image region proposals given the referring expression.\n\n5\n\n\fThus we directly use the bounding box proposals provided by [33], which use a Mask R-CNN [34]\npretrained on the COCO dataset. For \ufb01ne-tuning, we pass the \ufb01nal representation hvi for each\nimage region i into a learned linear layer to predict a matching score. We label each proposal box\nby computing the IoU with the ground truth box and thresholding at 0.5. We train with a binary\ncross-entropy loss for a maximum of 20 epochs with a batch size of 256 and an initial learning rate of\n4e-5. At inference, we use the highest scoring region as the prediction.\nCaption-Based Image Retrieval. Caption-based image retrieval is the task of identifying an image\nfrom a pool given a caption describing its content. We train and evaluate on the Flickr30k dataset\n[26] consisting of 31,000 images from Flickr with \ufb01ve captions each. Following the splits in [35], we\nuse 1,000 images for validation and test each and train on the rest. These captions are well-grounded\nin and descriptive of the visual content and are qualitatively different than the automatically collected\nConceptual Captions. We train in a 4-way multiple-choice setting by randomly sampling three\ndistractors for each image-caption pair \u2013 substituting a random caption, a random image, or a hard\nnegative from among the 100 nearest neighbors of the target image. We compute the alignment score\n(as in alignment prediction pretraining) for each and apply a softmax. We train this model under a\ncross-entropy loss to select the true image-caption pair for 20 epochs with a batch size of 64 and an\ninitial learning rate of 2e-5. At inference, we score each caption-image pair in the test set and then\nsort. For ef\ufb01ciency, we cache the linguistic stream representation before the \ufb01rst Co-TRM layer \u2013\neffectively freezing the linguistic representation before fusion.\n\u2018Zero-shot\u2019 Caption-Based Image Retrieval. The previous tasks are all transfer tasks that include\ndataset speci\ufb01c \ufb01ne-tuning. In this \u2018zero-shot\u2019 task, we directly apply the pretrained the multi-modal\nalignment prediction mechanism to caption-based image retrieval in Flickr30k [26] without \ufb01ne-\ntuning (thus the description as \u2018zero-shot\u2019). The goal of this task is to demonstrate that the pretraining\nhas developed the ability to ground text and that this can generalize to visual and linguistic variation\nwithout any task speci\ufb01c \ufb01ne-tuning. We directly use the ViLBERT model trained on Conceptual\nCaptions dataset described in Sec. 3.1. We use the alignment prediction objective as a scoring function\nand test on the same split as the caption-based image retrieval task described above.\n\n4 Results and Analysis\n\nBaselines. We compare our pretrained ViLBERT model against two ablative baselines:\n\u2013 Single-Stream consisting of a single BERT architecture that processes both modality inputs\nthrough the same set of transformer blocks \u2013 sharing parameters and processing stacks for\nboth visual and linguistic inputs. Like [29], this model avoids making changes to the BERT\narchitecture, resulting in signi\ufb01cantly deeper visual processing and earlier interaction between\nmodalities than in our model. The model is initialized with BERTBASE and trained identically to\nour full model. We compare to this baseline to establish the impact of our two-stream architecture.\nAs both streams interact throughout, we cannot cache any representations for ef\ufb01ciency. As such,\nwe do not evaluate this baseline on image retrieval and zero-shot image retrieval due to high\ncomputational cost.\n\u2013 ViLBERT\u2020 which is a ViLBERT architecture that has not undergone our pretraining tasks.\nNotably, it does still have BERT initilization for the linguistic stream and represents image regions\nwith the same Faster R-CNN model as the full ViLBERT model. We compare to this baseline to\nisolate gains over task-speci\ufb01c baseline models that might be due to our architecture, language\ninitialization, or visual features as opposed to our pretraining process on Conceptual Captions .\nFor both baselines and our model, we \ufb01netune the transfer tasks as described in the previous section.\nTask-Speci\ufb01c Baselines. To put our results in context, we present published results of problem-\nspeci\ufb01c methods that are to our knowledge state-of-the-art in each task: DFAF [36] for VQA, R2C\n[25] for VCR, MAttNet [33] for RefCOCO+, and SCAN [35] for caption-based image retrieval.\nResults. Tab. 1 shows results across all transfer tasks and we highlight key \ufb01ndings below:\n\u2013 Our architecture improves performance over a single-stream model. We observe improve-\nments across tasks for ViLBERT over the single-stream baseline for both pretrained (Single-Stream\nvs. ViLBERT) and non-pretrained (Single-Stream\u2020 vs. ViLBERT\u2020). Most signi\ufb01cant gains are\nobserved for VQA and RefCOCO+.\n\n\u2013 Our pretraining tasks result in improved visiolinguistic representations. Our models further\nimprove by between 2% and 13% across tasks when using a ViLBERT model that has been\n\n6\n\n\fTable 1: Transfer task results for our ViLBERT model compared with existing state-of-the-art and\nsensible architectural ablations. \u2020 indicates models without pretraining on Conceptual Captions. For\nVCR and VQA which have private test sets, we report test results (in parentheses) only for our full\nmodel. Our full ViLBERT model outperforms task-speci\ufb01c state-of-the-art models across all tasks.\n\nVQA [3]\n\nMethod\n\ntest-dev (test-std)\n\nA\nT\nO\nS\n\ns\nr\nu\nO\n\nDFAF [36]\nR2C [25]\nMAttNet [33]\nSCAN [35]\nSingle-Stream\u2020\nSingle-Stream\nViLBERT\u2020\nViLBERT\n\n70.22 (70.34)\n\n-\n-\n-\n\n65.90\n68.85\n68.93\n\nQ\u2192A\n\n-\n\nVCR [25]\nQA\u2192R\n\n-\n\nQ\u2192AR\n\n-\n\n63.8 (65.1)\n\n67.2 (67.3)\n\n43.1 (44.0)\n\n-\n-\n\n68.15\n71.09\n69.26\n\n-\n-\n\n68.89\n73.93\n71.01\n\n-\n-\n\n47.27\n52.73\n49.48\n\n70.55 (70.92)\n\n72.42 (73.3)\n\n74.47 (74.6)\n\n54.04 (54.8)\n\nRefCOCO+ [32]\n\nImage Retrieval [26]\n\nZS Image Retrieval\n\nval\n\ntestA testB\n\nR1\n\nR5\n\nR10\n\nR1\n\nR5\n\nR10\n\n-\n-\n\n-\n-\n\n-\n-\n\n65.33\n\n71.62\n\n56.02\n\n-\n-\n-\n\n-\n-\n-\n\n-\n-\n-\n\n-\n\n-\n\n-\n\n48.60\n\n77.70\n\n85.20\n\n-\n-\n-\n-\n\n-\n-\n\n-\n-\n-\n-\n\n-\n-\n\n-\n-\n-\n-\n\n-\n-\n\n65.64\n69.21\n68.61\n72.34\n\n72.02\n75.32\n75.97\n78.52\n\n56.04\n61.02\n58.44\n62.61\n\n-\n-\n\n-\n-\n\n-\n-\n\n45.50\n58.20\n\n76.78\n84.90\n\n85.02\n91.52\n\n0.00\n31.86\n\n0.00\n61.12\n\n0.00\n72.80\n\npretrained under our proxy tasks (ViLBERT vs ViLBERT\u2020 ). We also observe improvements on\nSingle-Stream which veri\ufb01es our proxy tasks can generalize to different model architectures.\n\n\u2013 Finetuning from ViLBERT is a powerful strategy for vision-and-language tasks. With a\nsingle base architecture, our transfer task performance exceeds state-of-the-art task-speci\ufb01c\nmodels for all four established tasks. We set state-of-the-art for VCR, RefCOCO+ and image\nretrieval by signi\ufb01cant margins (7-10 percentage points improvement). Further, extending to these\ntasks was simple \u2013 requiring the addition of a single classi\ufb01er for each task.\n\nOverall, these results demonstrate that our ViLBERT model is able to learn important visual-linguistic\nrelationships that can be exploited by downstream tasks.\nEffect of Visual Stream Depth. In Tab. 2 we compare the results transferring from ViLBERT models\nof varying depths. We consider depth with respect to the number of repeated CO-TRM\u2192TRM blocks\n(shown in a dashed box in Fig. 1) in our model. We \ufb01nd that VQA and Image Retrieval tasks\nbene\ufb01t from greater depth - performance increases monotonically until a layer depth of 6. Likewise,\nzero-shot image retrieval continues making signi\ufb01cant gains as depth increases. In contrast, VCR and\nRefCOCO+ seem to bene\ufb01t from shallower models.\nBene\ufb01ts of Large Training Sets. We also studied the impact of the size of the pretraining dataset.\nFor this experiment, we take random subsets of 25% and 50% from the conceptual caption dataset,\nand pretrain and \ufb01netune ViLBERT using the same setup as above. We can see that the accuracy\ngrows monotonically as the amount of data increases, which suggests that ViLBERT may bene\ufb01t\nfrom even more pretraining data.\nWhat does ViLBERT learn during pretraining? To get a sense for what ViLBERT learns during\nConceptual Caption pretraining, we look at zero-shot caption-based image retreival and some qualita-\ntive examples. While zero-shot performance (Tab. 1, right) is signi\ufb01cantly lower than the \ufb01ne-tuned\nmodel (31.86 vs 58.20 R1) it performs reasonably without having seen a Flickr30k image or caption\n(31.86 vs 48.60 R1 for prior SOTA) \u2013 indicating that ViLBERT has learned a semantically meaningful\nalignment between vision and language during pretraining.\n5 Related Work\nSelf-Supervised Learning. There has been substantial recent interest in both vision [37\u201342] and\nlanguage around self-supervised representation learning. In this paradigm, deep models are trained\n\nTable 2: Ablation study of the depth of our model with respect to the number of Co-TRM\u2192TRM\nblocks (shown in a dashed box in Fig. 1). We \ufb01nd that different tasks perform better at different\nnetwork depths \u2013 implying they may need more or less context aggregation.\n\nMethod\n\nViLBERT (2-layer)\nViLBERT (4-layer)\nViLBERT (6-layer)\nViLBERT (8-layer)\n\nVCR [25]\n\nVQA [3]\ntest-dev Q\u2192A QA\u2192R Q\u2192AR\n54.40\n69.92\n70.22\n53.82\n70.55\n54.04\n70.47\n53.79\n\n74.80\n74.00\n74.47\n74.15\n\n72.44\n72.45\n72.42\n72.33\n\nRefCOCO+ [32]\n\nImage Retrieval [26]\n\nZS Image Retrieval [26]\n\nval\n\n71.74\n72.07\n72.34\n71.66\n\ntestA testB\n78.61\n78.53\n78.52\n78.29\n\n62.28\n63.14\n62.61\n62.43\n\nR1\n\nR5\n\n55.68\n55.38\n58.20\n58.78\n\n84.26\n84.10\n84.90\n85.60\n\nR10\n\n90.56\n90.62\n91.52\n91.42\n\nR1\n\nR5\n\n26.14\n26.28\n31.86\n32.80\n\n56.04\n54.34\n61.12\n63.38\n\nR10\n\n68.80\n66.08\n72.80\n74.62\n\n7\n\n\fTable 3: Transfer task results for ViLBERT as a function of the percentage of the Conceptual Captions\ndataset used during pre-training. We see monotonic gains as the pretraining dataset size grows.\n\nMethod\n\nViLBERT (0 %)\nViLBERT (25 %)\nViLBERT (50 %)\nViLBERT (100 %)\n\n69.26\n71.61\n71.88\n72.42\n\nVCR [25]\n\nVQA [3]\ntest-dev Q\u2192A QA\u2192R Q\u2192AR\n49.48\n68.93\n52.66\n69.82\n53.03\n70.30\n70.55\n54.04\n\n71.01\n73.00\n73.60\n74.47\n\nRefCOCO+ [32]\n\nImage Retrieval [26]\n\nZS Image Retrieval [26]\n\nval\n\ntestA testB\n\nR1\n\nR5\n\n68.61\n69.90\n71.16\n72.34\n\n75.97\n76.83\n77.35\n78.52\n\n58.44\n60.99\n61.57\n62.61\n\n45.50\n53.08\n54.84\n58.20\n\n76.78\n80.80\n83.62\n84.90\n\nR10\n\n85.02\n88.52\n90.10\n91.52\n\nR1\n\nR5\n\n0.00\n20.40\n26.76\n31.86\n\n0.00\n48.54\n56.26\n61.12\n\nR10\n\n0.00\n62.06\n68.80\n72.80\n\nfor tasks where regularities in existing data can be turned into supervision automatically. While there\nhas been progress on the vision side, self-supervised image representations still lag behind those from\nmodels trained under image classi\ufb01cation tasks. Self-supervised language models on the other hand\nhave resulted in signi\ufb01cant improvements over prior work [12\u201314, 43]. In this work, we develop a\nmodel and proxy tasks for learning joint visual-linguistic representations \u2013 extending the popular\nBERT [12] model.\nVision-and-Language. While we address many vision-and-language tasks in Sec. 3.2, we do miss\nsome families of tasks including visually grounded dialog [4, 44], embodied tasks like question\nanswering [7] and instruction following [8], and text generation tasks like image and video captioning\n[5]. These tasks may also bene\ufb01t from a self-supervised approach similar to what we have presented.\nThere are open questions on how to incorporate long sequences of images and text found in dialog,\nembodied tasks, and video processing. Further, it is unclear how to effectively decode output text\nfrom our bidirectional model as existing greedy decoders like beam-search do not apply.\nSelf-Supervised Learning for Vision-And-Language. Most related to our approach is concurrent\nwork on learning joint representations between video and language [29]. In this work, self-supervised\ntasks paralleling our own are derived from cooking videos paired with text-to-speech transcribed\naudio. They present a uni\ufb01ed BERT architecture for both the visual and linguistic inputs similar to\nthe Single-Stream baseline we consider here. They apply the learned model to two tasks on cooking\nvideos: zero-shot activity recognition and blank-\ufb01lling on audio transcripts. In contrast, we learn\nrepresentations of images and descriptive text on a wide range of images from the web and focus\nextensively on transfer learning from this model for well-established vision-and-language tasks.\nRecent works on Vision-And-Language pre-training. Since our paper released on arXiv, a few\nother useful preprints have recently been released on similar vision-and-language cross-modality pre-\ntraining directions. LXMERT [45] uses a more speci\ufb01c design for the cross-modality model. Instead\nof using webly supervised Conceptual Caption [24] dataset, LXMERT uses in-domain datasets (i.e.\nCOCO [5] and Visual Genome [16]) for pre-training. VisualBERT [46] directly extend BERT [12]\nfor vision and language domain. VisualBERT uses both out-of-domain and in-domain dataset for\npre-training and applies MLM object only on the language side. Unicoder [47] focuses exclusively\non image caption retrieval tasks with online hardest negative mining. More recent preprints including\nVLBERT [48], Uni\ufb01ed VLP [49] and UNITER [49] also show promising improvements in this\nresearch direction of joint visio-linguistic pretraining.\n\n6 Conclusion\n\nWe develop a joint model for image content and text and pretrain it on a large, automatically-collected\ndataset to learn visual grounding. Our ViLBERT model introduces a novel two-stream architecture\nwith co-attentional transformer blocks that outperforms sensible ablations and exceeds state-of-the-art\nwhen transferred to multiple established vision-and-language tasks. Furthermore, transferring our\nmodel to these tasks is simple and easy to implement \u2013 requiring only the addition of a classi\ufb01er for\neach task we examined here. We consider extensions of our model to other vision-and-language tasks\n(including those requiring generation) as well as multi-task learning as exciting future work.\nAcknowledgement. The Georgia Tech effort was supported in part by NSF, AFRL, DARPA, ONR YIPs, ARO\nPECASE. The views and conclusions contained herein are those of the authors and should not be interpreted\nas necessarily representing the of\ufb01cial policies or endorsements, either expressed or implied, of the U.S.\nGovernment, or any sponsor.\n\n8\n\n\fReferences\n[1] Margaret A. Boden. Mind as Machine: A History of Cognitive Science. Oxford University Press, 2008.\n\n[2] Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara L. Berg. Referit game: Referring to\n\nobjects in photographs of natural scenes. In EMNLP, 2014.\n\n[3] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick,\n\nand Devi Parikh. VQA: Visual question answering. In ICCV, 2015.\n\n[4] Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jose M. F. Moura, Devi Parikh,\n\nand Dhruv Batra. Visual dialog. In CVPR, 2017.\n\n[5] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Doll\u00e1r, and\nC. Lawrence Zitnick. Microsoft COCO captions: Data collection and evaluation server. CoRR,\nabs/1504.00325, 2015.\n\n[6] Ravi Shekhar, Sandro Pezzelle, Yauhen Klimovich, Aurelie Herbelot, Moin Nabi, Enver Sangineto, and\n\nRaffaella Bernardi. \"foil it! \ufb01nd one mismatch between image and language caption\". In ACL, 2017.\n\n[7] Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. Embodied\n\nQuestion Answering. In CVPR, 2018.\n\n[8] Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S\u00fcnderhauf, Ian Reid, Stephen\nGould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually-grounded\nnavigation instructions in real environments. In CVPR), 2018.\n\n[9] Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. Don\u2019t just assume; look and\n\nanswer: Overcoming priors for visual question answering. In CVPR, 2018.\n\n[10] Harsh Agrawal, Karan Desai, Xinlei Chen, Rishabh Jain, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter\n\nAnderson. nocaps: novel object captioning at scale. arXiv preprint arXiv:1812.08658, 2018.\n\n[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\nIn CVPR, 2016.\n\n[12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirec-\n\ntional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.\n\n[13] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke\n\nZettlemoyer. Deep contextualized word representations. In NACCL, 2018.\n\n[14] Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. Improving language understanding\n\nwith unsupervised learning. Technical report, Technical report, OpenAI, 2018.\n\n[15] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,\nAndrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large\nscale visual recognition challenge. IJCV, 2015.\n\n[16] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen,\nYannis Kalantidis, Li-Jia Li, David A Shamma, Michael Bernstein, and Li Fei-Fei. Visual genome:\nConnecting language and vision using crowdsourced dense image annotations. In arXiv, 2016. URL\nhttps://arxiv.org/abs/1602.07332.\n\n[17] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja\nFidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading\nbooks. In ICCV, 2015.\n\n[18] English wikipedia, 2019. URL https://en.wikipedia.org/.\n\n[19] Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony\nRobinson. One billion word benchmark for measuring progress in statistical language modeling. In arXiv,\n2014.\n\n[20] Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Colorization as a proxy task for visual\n\nunderstanding. In CVPR, pages 6874\u20136883, 2017.\n\n[21] Dinesh Jayaraman, Ruohan Gao, and Kristen Grauman. Shapecodes: self-supervised feature learning by\n\nlifting views to viewgrids. In ECCV, pages 120\u2013136, 2018.\n\n9\n\n\f[22] Relja Arandjelovic and Andrew Zisserman. Look, listen and learn. In ICCV, pages 609\u2013617, 2017.\n\n[23] Deepak Pathak, Ross Girshick, Piotr Doll\u00e1r, Trevor Darrell, and Bharath Hariharan. Learning features by\n\nwatching objects move. In CVPR, pages 2701\u20132710, 2017.\n\n[24] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned,\n\nhypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.\n\n[25] Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual\n\ncommonsense reasoning. In CVPR, 2019.\n\n[26] Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual\n\ndenotations: New similarity metrics for semantic inference over event descriptions. TACL, 2014.\n\n[27] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz\n\nKaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.\n\n[28] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim\nKrikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google\u2019s neural machine translation system: Bridging\nthe gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.\n\n[29] Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. Videobert: A joint model\n\nfor video and language representation learning. arXiv preprint arXiv:1904.01766, 2019.\n\n[30] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei\nZhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR,\npages 6077\u20136086, 2018.\n\n[31] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection\n\nwith region proposal networks. In NuerIPS, pages 91\u201399, 2015.\n\n[32] Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects\n\nin photographs of natural scenes. In EMNLP, 2014.\n\n[33] Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. Mattnet:\n\nModular attention network for referring expression comprehension. In CVPR, 2018.\n\n[34] Kaiming He, Georgia Gkioxari, Piotr Doll\u00e1r, and Ross Girshick. Mask r-cnn. In ICCV, pages 2961\u20132969,\n\n2017.\n\n[35] Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. Stacked cross attention for\n\nimage-text matching. In ECCV, pages 201\u2013216, 2018.\n\n[36] Gao Peng, Hongsheng Li, Haoxuan You, Zhengkai Jiang, Pan Lu, Steven Hoi, and Xiaogang Wang.\nDynamic fusion with intra-and inter-modality attention \ufb02ow for visual question answering. arXiv preprint\narXiv:1812.05252, 2018.\n\n[37] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context\n\nprediction. In ICCV, pages 1422\u20131430, 2015.\n\n[38] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In ECCV, pages 649\u2013666.\n\nSpringer, 2016.\n\n[39] Alexey Dosovitskiy, Philipp Fischer, Jost Tobias Springenberg, Martin Riedmiller, and Thomas Brox.\nDiscriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE PAMI,\n38(9):1734\u20131747, 2015.\n\n[40] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders:\n\nFeature learning by inpainting. In CVPR, pages 2536\u20132544, 2016.\n\n[41] Dinesh Jayaraman and Kristen Grauman. Learning image representations tied to ego-motion. In CVPR,\n\npages 1413\u20131421, 2015.\n\n[42] Ishan Misra, C Lawrence Zitnick, and Martial Hebert. Shuf\ufb02e and learn: unsupervised learning using\n\ntemporal order veri\ufb01cation. In ECCV, pages 527\u2013544. Springer, 2016.\n\n[43] Guillaume Lample and Alexis Conneau. Cross-lingual language model pretraining. arXiv preprint\n\narXiv:1901.07291, 2019.\n\n10\n\n\f[44] Harm de Vries, Florian Strub, Jeremie Mary, Hugo Larochelle, Olivier Pietquin, and Aaron C. Courville.\n\nModulating early visual processing by language. In NuerIPS, 2017.\n\n[45] Hao Tan and Mohit Bansal. Lxmert: Learning cross-modality encoder representations from transformers.\n\narXiv preprint arXiv:1908.07490, 2019.\n\n[46] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A simple and\n\nperformant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.\n\n[47] Gen Li, Nan Duan, Yuejian Fang, Daxin Jiang, and Ming Zhou. Unicoder-vl: A universal encoder for\n\nvision and language by cross-modal pre-training. arXiv preprint arXiv:1908.06066, 2019.\n\n[48] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. Vl-bert: Pre-training of\n\ngeneric visual-linguistic representations. arXiv preprint arXiv:1908.08530, 2019.\n\n[49] Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J Corso, and Jianfeng Gao. Uni\ufb01ed\n\nvision-language pre-training for image captioning and vqa. arXiv preprint arXiv:1909.11059, 2019.\n\n11\n\n\f", "award": [], "sourceid": 16, "authors": [{"given_name": "Jiasen", "family_name": "Lu", "institution": "Georgia Tech"}, {"given_name": "Dhruv", "family_name": "Batra", "institution": "Georgia Tech / Facebook AI Research (FAIR)"}, {"given_name": "Devi", "family_name": "Parikh", "institution": "Georgia Tech / Facebook AI Research (FAIR)"}, {"given_name": "Stefan", "family_name": "Lee", "institution": "Georgia Institute of Technology"}]}