{"title": "Inducing brain-relevant bias in natural language processing models", "book": "Advances in Neural Information Processing Systems", "page_first": 14123, "page_last": 14133, "abstract": "Progress in natural language processing (NLP) models that estimate representations of word sequences has recently been leveraged to improve the understanding of language processing in the brain.  However, these models have not been specifically designed to capture the way the brain represents language meaning. We hypothesize that fine-tuning these models to predict recordings of brain activity of people reading text will lead to representations that encode more brain-activity-relevant language information. We demonstrate that a version of BERT, a recently introduced and powerful language model, can improve the prediction of brain activity after fine-tuning. We show that the relationship between language and brain activity learned by BERT during this fine-tuning transfers across multiple participants. We also show that, for some participants, the fine-tuned representations learned from both magnetoencephalography (MEG) and functional magnetic resonance imaging (fMRI) are better for predicting fMRI than the representations learned from fMRI alone, indicating that the learned representations capture brain-activity-relevant information that is not simply an artifact of the modality. While changes to language representations help the model predict brain activity, they also do not harm the model's ability to perform downstream NLP tasks. Our findings are notable for research on language understanding in the brain.", "full_text": "Inducing brain-relevant bias\n\nin natural language processing models\n\nDan Schwartz\n\nCarnegie Mellon University\ndrschwar@cs.cmu.edu\n\nMariya Toneva\n\nLeila Wehbe\n\nCarnegie Mellon University\n\nCarnegie Mellon University\n\nmariya@cmu.edu\n\nlwehbe@cmu.edu\n\nAbstract\n\nProgress in natural language processing (NLP) models that estimate representations\nof word sequences has recently been leveraged to improve the understanding of\nlanguage processing in the brain. However, these models have not been speci\ufb01cally\ndesigned to capture the way the brain represents language meaning. We hypothe-\nsize that \ufb01ne-tuning these models to predict recordings of brain activity of people\nreading text will lead to representations that encode more brain-activity-relevant\nlanguage information. We demonstrate that a version of BERT, a recently intro-\nduced and powerful language model, can improve the prediction of brain activity\nafter \ufb01ne-tuning. We show that the relationship between language and brain activity\nlearned by BERT during this \ufb01ne-tuning transfers across multiple participants. We\nalso show that, for some participants, the \ufb01ne-tuned representations learned from\nboth magnetoencephalography (MEG) and functional magnetic resonance imaging\n(fMRI) are better for predicting fMRI than the representations learned from fMRI\nalone, indicating that the learned representations capture brain-activity-relevant\ninformation that is not simply an artifact of the modality. While changes to lan-\nguage representations help the model predict brain activity, they also do not harm\nthe model\u2019s ability to perform downstream NLP tasks. Our \ufb01ndings are notable for\nresearch on language understanding in the brain.\n\n1\n\nIntroduction\n\nThe recent successes of self-supervised natural language processing (NLP) models have inspired\nresearchers who study how people process and understand language to look to these NLP models for\nrich representations of language meaning. In these works, researchers present language stimuli to\nparticipants (e.g. reading a chapter of a book word-by-word or listening to a story) while recording\ntheir brain activity with neuroimaging devices (fMRI, MEG, or EEG), and model the recorded brain\nactivity using representations extracted from NLP models for the corresponding text. While this\napproach has opened exciting avenues in understanding the processing of longer word sequences and\ncontext, having NLP models that are speci\ufb01cally designed to capture the way the brain represents\nlanguage meaning may lead to even more insight. We posit that we can introduce a brain-relevant\nlanguage bias in an NLP model by explicitly training the NLP model to predict language-induced\nbrain recordings.\nIn this study we propose that a pretrained language model \u2014 BERT by Devlin et al. (2018) \u2014 which\nis then \ufb01ne-tuned to predict brain activity will modify its language representations to better encode\nthe information that is relevant for the prediction of brain activity. We further propose \ufb01ne-tuning\n\nCode available at https://github.com/danrsc/bert_brain_neurips_2019\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: General approach\nfor \ufb01ne-tuning BERT using\nfMRI and/or MEG data. A lin-\near layer maps the output to-\nken embeddings from the base\narchitecture to brain activity\nrecordings. Only MEG record-\nings that correspond to content\nwords in the input sequence\nare considered. We include\nthe word length and context-\nindependent log-probability of\neach word when predicting\nMEG. fMRI data are predicted\nfrom the pooled embedding of\nthe sequence, i.e.\nthe [CLS]\ntoken embedding. For more\ndetails of the procedure, see\nsection 3.2.\n\nsimultaneously from multiple experiment participants and multiple brain activity recording modalities\nto bias towards representations that generalize across people and recording types. We suggest that\nthis \ufb01ne-tuning can leverage advances in the NLP community while also considering data from brain\nactivity recordings, and thus can lead to advances in our understanding of language processing in the\nbrain.\n\n2 Related Work\n\nThe relationship between language-related brain activity and computational models of natural lan-\nguage (NLP models) has long been a topic of interest to researchers. Multiple researchers have used\nvector-space representations of words, sentences, and stories taken from off-the-shelf NLP models\nand investigated how these vectors correspond to fMRI or MEG recordings of brain activity (Mitchell\net al., 2008; Murphy et al., 2012; Wehbe et al., 2014b,a; Huth et al., 2016; Jain and Huth, 2018;\nPereira et al., 2018). However, few examples of researchers using brain activity to modify language\nrepresentations exist. Fyshe et al. (2014) builds a non-negative sparse embedding for individual\nwords by constraining the embedding to also predict brain activity well, and Schwartz and Mitchell\n(2019) very recently have published an approach similar to ours for predicting EEG data, but most\napproaches combining NLP models and brain activity do not modify language embeddings to predict\nbrain data. In Schwartz and Mitchell (2019), the authors predict multiple EEG signals on a dataset\nusing a deep network, but they do not investigate whether the model can transfer its representations\nto new experiment participants or other modalities of brain activity recordings.\nBecause fMRI and MEG/EEG have complementary strengths (high spatial resolution vs. high\ntemporal resolution) there exists a lot of interest in devising learning algorithms that combine both\ntypes of data. One way that fMRI and MEG/EEG have been used together is by using fMRI for better\nsource localization of the MEG/EEG signal (He et al., 2018) (source localization refers to inferring\nthe sources in the brain of the MEG/EEG recorded on the head). Palatucci (2011) uses CCA to map\nbetween MEG and fMRI recordings for the same word. Mapping the MEG data to the common\nspace allows the authors to better decode the word identity than with MEG alone. Cichy et al. (2016)\npropose a way of combining fMRI and MEG data of the same stimuli by computing stimuli similarity\nmatrices for different fMRI regions and MEG time points and \ufb01nding corresponding regions and time\npoints. Fu et al. (2017) proposes a way to estimate a latent space that is high-dimensional both in\ntime and space from simulated fMRI and MEG activity. However, effectively combining fMRI and\nMEG/EEG remains an open research problem.\n\n2\n\n\f3 Methods\n\n3.1 MEG and fMRI data\n\nIn this analysis, we use magnetoencephalography (MEG) and functional magnetic resonance imaging\n(fMRI) data recorded from people as they read a chapter from Harry Potter and the Sorcerer\u2019s Stone\nRowling (1999). The MEG and fMRI experiments were shared respectively by the authors of Wehbe\net al. (2014a) at our request and Wehbe et al. (2014b) online1. In both experiments the chapter was\npresented one word at a time, with each word appearing on a screen for 0.5 seconds. The chapter\nincluded 5176 words.\nMEG was recorded from nine experiment participants using an Elekta Neuromag device (data for\none participant had too many artifacts and was excluded, leaving 8 participants). This machine has\n306 sensors distributed into 102 locations on the surface of the participant\u2019s head. The sampling\nfrequency was 1kHz. The Signal Space Separation method (SSS) (Taulu et al., 2004) was used to\nreduce noise, and it was followed by its temporal extension (tSSS) (Taulu and Simola, 2006). The\nsignal in every sensor was downsampled into 25ms non-overlapping time bins, meaning that each\nword in our data is associated with a 306 sensor \u00d7 20 time points image.\nThe fMRI data of nine experiment participants were comprised of 3 \u00d7 3 \u00d7 3mm voxels. Data were\nslice-time and motion corrected using SPM8 (Kay et al., 2008). The data were then detrended in time\nand spatially smoothed with a 3mm full-width-half-max kernel. The brain surface of each subject\nwas reconstructed using Freesurfer (Fischl, 2012), and a thick grey matter mask was obtained to\nselect the voxels with neuronal tissue. For each subject, 50000-60000 voxels were kept after this\nmasking. We use Pycortex (Gao et al., 2015) to handle and plot the fMRI data.\n\n3.2 Model architecture\n\nIn our experiments, we build on the BERT architecture (Devlin et al., 2018), a specialization of a\ntransformer network (Vaswani et al., 2017). Each block of layers in the network applies a transfor-\nmation to its input embeddings by \ufb01rst applying self-attention (combining together the embeddings\nwhich are most similar to each other in several latent aspects). These combined embeddings are then\nfurther transformed to produce new features for the next block of layers. We use the PyTorch version\nof the BERT code provided by Hugging Face2 with the pretrained weights provided by Devlin et al.\n(2018). This model includes 12 blocks of layers, and has been trained on the BooksCorpus (Zhu et al.,\n2015) as well as Wikipedia to predict masked words in text and to classify whether two sequences\nof words are consecutive in text or not. Two special tokens are attached to each input sequence in\nthe BERT architecture. The [SEP] token is used to signal the end of a sequence, and the [CLS]\ntoken is trained to be a sequence-level representation of the input using the consecutive-sequence\nclassi\ufb01cation task. Fine-tuned versions of this pretrained BERT model have achieved state of the art\nperformance in several downstream NLP tasks, including the GLUE benchmark tasks (Wang et al.,\n2018). The recommended procedure for \ufb01ne-tuning BERT is to add a simple linear layer that maps\nthe output embeddings from the base architecture to a prediction task of interest. With this linear\nlayer included, the model is \ufb01ne-tuned end-to-end, i.e. all of the parameters of the model change\nduring \ufb01ne-tuning. For the most part, we follow this recommended procedure in our experiments.\nOne slight modi\ufb01cation we make is that in addition to using the output layer of the base model, we\nalso concatenate to this output layer the word length and context-independent log-probability of\neach word (see Figure 1). Both of these word properties are known to modulate behavioral data\nand brain activity (Rayner, 1998; Van Petten and Kutas, 1990). When a single word is broken into\nmultiple word-pieces by the BERT tokenizer, we attach this information to the \ufb01rst token and use\ndummy values (0 for word length and -20 for the log probability) for the other tokens. We use these\nsame dummy values for the special [CLS] and [SEP] tokens. Because the time-resoluton of fMRI\nimages is too low to resolve single words, we use the pooled output of BERT to predict fMRI data.\nIn the pretrained model, the pooled representation of a sequence is a transformed version of the\nembedding of the [CLS] token, which is passed through a hidden layer and then a tanh function.\nWe \ufb01nd empirically that using the [CLS] output embedding directly worked better than using this\ntransformation, so we use the [CLS] output embedding as our pooled embedding.\n\n1http://www.cs.cmu.edu/~fmri/plosone/\n2https://github.com/huggingface/pytorch-pretrained-BERT/\n\n3\n\n\f3.3 Procedure\n\nInput to the model. We are interested in modifying the pretrained BERT model to better capture\nbrain-relevant language information. We approach this by training the model to predict both fMRI\ndata and MEG data, each recorded (at different times from different participants) while experiment\nparticipants read a chapter of the same novel. fMRI records the blood-oxygenation-level dependent\n(BOLD) response, i.e. the relative amount of oxygenated blood in a given area of the brain, which is\na function of how active the neurons are in that area of the brain. However, the BOLD response peaks\n5 to 8 seconds after the activation of neurons in a region (Nishimoto et al., 2011; Wehbe et al., 2014b;\nHuth et al., 2016). Because of this delay, we want a model which predicts brain activity to have access\nto the words that precede the timepoint at which the fMRI image is captured. Therefore, we use the\n20 words (which cover the 10 seconds of time) leading up to each fMRI image as input to our model,\nirrespective of sentence boundaries. In contrast to the fMRI recordings, MEG recordings have much\nhigher time resolution. For each word, we have 20 timepoints from 306 sensors. In our experiments\nwhere MEG data are used, the model makes a prediction for all of these 6120 = 306 \u00d7 20 values for\neach word. However, we only train and evaluate the model on content words. We de\ufb01ne a content\nword as any word which is an adjective, adverb, auxiliary verb, noun, pronoun, proper noun, or verb\n(including to-be verbs). If the BERT tokenizer breaks a word into multiple tokens, we attach the\nMEG data to the \ufb01rst token for that word. We align the MEG data with all content words in the fMRI\nexamples (i.e. the content words of the 20 words which precede each fMRI image).\n\nCross-validation. The fMRI data were recorded in four separate runs in the scanner for each\nparticipant. The MEG data were also recorded in four separate runs using the same division of the\nchapter as fMRI. We cross-validate over the fMRI runs. For each fMRI run, we train the model using\nthe examples from the other three runs and use the fourth run to evaluate the model.\n\nPreprocessing. To preprocess the fMRI data, we exclude the \ufb01rst 20 and \ufb01nal 15 fMRI images\nfrom each run to avoid warm-up and boundary effects. Words associated with these excluded images\nare also not used for MEG predictions. We linearly detrend the fMRI data within run, and standardize\nthe data within run such that the variance of each voxel is 1 and the mean value of each voxel is 0\nover the examples in the run. The MEG data is also detrended and standardized within fMRI run (i.e.\nwithin cross-validation fold) such that each time-sensor component has mean 0 and variance 1 over\nall of the content words in the run.\n\n3.4 Models and experiments\n\nIn this study, we are interested in demonstrating that by \ufb01ne-tuning a language model to predict\nbrain activity, we can bias the model to encode brain-relevant language information. We also wish to\nshow that the information the model encodes generalizes across multiple experiment participants,\nand multiple modalities of brain activity recording. For the current work, we compare the models we\ntrain to each other only in terms of how well they predict the fMRI data of the nine fMRI experiment\nparticipants, but in some cases we use MEG data to bias the model in our experiments. In all of\nour models, we use a base learning rate of 5 \u00d7 10\u22125. The learning rate increases linearly from 0 to\n5 \u00d7 10\u22125 during the \ufb01rst 10% of the training epochs and then decreases linearly back to 0 during the\nremaining epochs. We use mean squared error as our loss function in all models. We vary the number\nof epochs we use for training our models, based primarily on observations of when the models seem\nto begin to converge or over\ufb01t, but we match all of the hyperparameters between two models we are\ncomparing. We also seed random initializations and allocate the same model parameters across our\nvariations so that the initializations are consistent between each pair of models we compare.\n\nVanilla model. As a baseline, for each experiment participant, we add a linear layer to the pretrained\nBERT model and train this linear layer to map from the [CLS] token embedding to the fMRI data of\nthat participant. The pretrained model parameters are frozen during this training, so the embeddings\ndo not change. We refer to this model as the vanilla model. This model is trained for either 10, 20, or\n30 epochs depending on which model we are comparing this to.\n\nParticipant-transfer model. To investigate whether the relationship between text and brain activity\nlearned by a \ufb01ne-tuned model transfers across experiment participants, we \ufb01rst \ufb01ne-tune the model\non the participant who had the most predictable brain activity. During this \ufb01ne-tuning, we train only\n\n4\n\n\fthe linear layer for 2 epochs, followed by 18 epochs of training the entire model. Then, for each other\nexperiment participant, we \ufb01x the model parameters, and train a linear layer on top of the model\ntuned towards the \ufb01rst participant. These linear-only models are trained for 10 epochs, and compared\nto the vanilla 10 epoch model.\n\nFine-tuned model. To investigate whether a model \ufb01ne-tuned to predict each participant\u2019s data\nlearns something beyond the linear mapping in the vanilla model, we \ufb01ne-tune a model for each\nparticipant. We train only the linear layer of these models for 10 epochs, followed by 20 epochs of\ntraining the entire model.\n\nMEG-transfer model. We use this model to investigate whether the relationship between text and\nbrain activity learned by a model \ufb01ne-tuned on MEG data transfers to fMRI data. We \ufb01rst \ufb01ne-tune\nthis model by training it to predict all eight MEG experiment participants\u2019 data (jointly). The MEG\ntraining is done by training only the linear output layer for 10 epochs, followed by 20 epochs of\ntraining the full model. We then take the MEG \ufb01ne-tuned model and train it to predict each fMRI\nexperiment participant\u2019s data. This training also uses 10 epochs of only training the linear output\nlayer followed by 20 epochs of full \ufb01ne-tuning.\n\nFully joint model. Finally, we train a model to simultaneously predict all of the MEG experiment\nparticipants\u2019 data and the fMRI experiment participants\u2019 data. We train only the linear output layer of\nthis model for 10 epochs, followed by 50 epochs of training the full model.\n\nEvaluating model performance for brain prediction using the 20 vs. 20 test. We evaluate the\nquality of brain predictions made by a particular model by using the brain prediction in a classi\ufb01cation\ntask on held-out data, in a four-fold cross-validation setting. The classi\ufb01cation task is to predict which\nof two sets of words was being read by the participant (Mitchell et al., 2008; Wehbe et al., 2014b,a).\nWe begin by randomly sampling 20 examples from one of the fMRI runs. For each voxel, we take\nthe true voxel values for these 20 examples and concatenate them together \u2013 this will be the target\nfor that voxel. Next, we randomly sample a different set of 20 examples from the same fMRI run.\nWe take the true voxel values for these 20 examples and concatenate them together \u2013 this will be our\ndistractor. Next we compute the Euclidean distance between the voxel values predicted by a model\non the target examples and the true voxel values on the target, and we compute the Euclidean distance\nbetween these same predicted voxel values and the true voxel values on the distractor examples. If the\ndistance from the prediction to the target is less than the distance from the prediction to the distractor,\nthen the sample has been accurately classi\ufb01ed. We repeat this sampling procedure 1000 times to get\nan accuracy value for each voxel in the data. We observe that evaluating model performance using\nproportion of variance explained leads to qualitatively similar results (see Supplementary Figure 4),\nbut we \ufb01nd the classi\ufb01cation metric more intuitive and use it throughout the remainder of the paper.\n\n4 Results\n\nFine-tuned models predict fMRI data better than vanilla BERT. The \ufb01rst issue we were inter-\nested in resolving is whether \ufb01ne-tuning a language model is any better for predicting brain activity\nthan using regression from the pretrained BERT model. To show that it is, we train the \ufb01ne-tuned\nmodel and compare it to the vanilla model by computing the accuracies of each model on the 20 vs.\n20 classi\ufb01cation task described in section 3.4. Figure 2 shows the difference in accuracy between the\ntwo models, with the difference computed at a varying number of voxels, starting with those that are\npredicted well by one of the two models and adding in voxels that are less and less well predicted\nby either. Figure 3 shows where on the brain the predictions differ between the two models, giving\nstrong evidence that areas in the brain associated with language processing are predicted better by the\n\ufb01ne-tuned models (Fedorenko and Thompson-Schill, 2014).\n\nRelationships between text and brain activity generalize across experiment participants. The\nnext issue we are interested in understanding is whether a model that is \ufb01ne-tuned on one participant\ncan \ufb01t a second participant\u2019s brain activity if the model parameters are frozen (so we only do a linear\nregression from the output embeddings of the \ufb01ne-tuned model to the brain activity of the second\nparticipant). We call this the participant-transfer model. We \ufb01ne-tune BERT on the experiment\nparticipant with the most predictable brain activity, and then compare that model to vanilla BERT.\n\n5\n\n\fFigure 2: Comparison of accuracies of various models. In each quadrant of the \ufb01gure above, we\ncompare two models. Voxels are sorted on the x-axis in descending order of the maximum of the\ntwo models\u2019 accuracies in the 20 vs 20 test (described in section 3.4). The colored lines (one per\nparticipant) show differences between the two models\u2019 mean accuracies, where the mean is taken\nover all voxels to the left of each x-coordinate. In (a)-(c) Shaded regions show the standard deviation\nover 100 model initializations \u2013 that computation was not tractable in our framework for (d). The\nblack line is the mean over all participants. In (a), (c), and (d), it is clear that the \ufb01ne-tuned models\nare more accurate in predicting voxel activity than the vanilla model for a large number of voxels. In\n(b), the MEG-transfer model seems to have roughly the same accuracy as a model \ufb01ne-tuned only on\nfMRI data, but in \ufb01gure 3 we see that in language regions the MEG-transfer model appears to be\nmore accurate.\n\nVoxels are predicted more accurately by the participant-transfer model than by the vanilla model (see\nFigure 2, lower left), indicating that we do get a transfer learning bene\ufb01t.\n\nUsing MEG data can improve fMRI predictions.\nIn a third comparison, we investigate whether\na model can bene\ufb01t from both MEG and fMRI data. We begin with the vanilla BERT model, \ufb01ne-tune\nit to predict MEG data (we jointly train on eight MEG experiment participants), and then \ufb01ne-tune the\nresulting model on fMRI data (separate models for each fMRI experiment participant). We see mixed\nresults from this experiment. For some participants, there is a marginal improvement in prediction\naccuracy when MEG data is included compared to when it is not, while for others training \ufb01rst on\nMEG data is worse or makes no difference (see Figure 2, upper right). Figure 3 shows however, that\nfor many of the participants, we see improvements in language areas despite the mean difference in\naccuracy being small.\n\nA single model can be used to predict fMRI activity across multiple experiment participants.\nWe compare the performance of a model trained jointly on all fMRI experiment participants and all\nMEG experiment participants to vanilla BERT (see Figure 2, lower right). We don\u2019t \ufb01nd that this\nmodel yet outperforms models trained individually for each participant, but it nonetheless outperforms\n\n6\n\n(a) Fine-tuned vs. vanilla(b) MEG-transfer vs. fine-tuned(c) Participant-transfer vs. vanilla(d) Joint vs. vanilla\fFigure 3: Comparison of accuracies on the 20 vs. 20 classi\ufb01cation task (described in section 3.4) at a\nvoxel level for all 9 participants we analyzed. Each column shows the in\ufb02ated lateral view of the left\nhemisphere for one experiment participant. Moving from the top to third row, models 1 and 2 are\nrespectively, the vanilla model and the \ufb01ne-tuned model, the vanilla model and the participant-transfer\nmodel, and the \ufb01ne-tuned model and MEG-transfer model. The leftmost column is the participant on\nwhom the participant-transfer model is trained. Columns with a grey background indicate participants\nwho are common between the fMRI and MEG experiments. Only voxels which were signi\ufb01cantly\ndifferent between the two models according to a related-sample t-test and corrected for false discovery\nrate at a .01 level using the Benjamini\u2013Hochberg procedure (Benjamini and Hochberg, 1995) are\nshown. The color-map is set independently for each of the participants and comparisons shown such\nthat the reddest value is at the 95th percentile of the absolute value of the signi\ufb01cant differences\nand the bluest value is at the negative of this reddest value. We observe that both the \ufb01ne-tuned\nand participant-transfer models outperform the vanilla model, especially in most regions that are\nconsidered to be part of the language network. As a reference, we show an approximation of the\nlanguage network for each participant in the fourth row. These were approximated using an updated\nversion of the Fedorenko et al. (2010) language functional parcels3, corresponding to areas of high\noverlap of the brain activations of 220 subjects for a \u201csentences>non-word\" contrast. The parcels\nwere transformed using Pycortex (Gao et al., 2015) to each participant\u2019s native space. The set of\nlanguage parcels therefore serve as a strong prior for the location of the language system in each\nparticipant. Though the differences are much smaller in the third row than in the \ufb01rst two, we also see\nbetter performance in language regions when MEG data is included in the training procedure. Even\nin participants where performance is worse overall (e.g. the \ufb01fth and sixth columns of the third row),\nvoxels where performance improves appear to be systematically distributed according to language\nprocessing function. Right hemisphere and medial views are available in the supplementary material.\n\nvanilla BERT. This demonstrates the feasibility of fully joint training and we think that with the right\nhyperparameters, this model can perform as well as or better than individually trained models.\n\nNLP tasks are not harmed by \ufb01ne-tuning. We run two of our models (the MEG transfer model,\nand the fully joint model) on the GLUE benchmark (Wang et al., 2018), and compare the results to\nstandard BERT (Devlin et al., 2018) (see Table 1). These models were chosen because we thought\nthey had the best chance of giving us interesting GLUE results, and they were the only two models\nwe ran GLUE on. Apart from the semantic textual similarity (STS-B) task, all of the other tasks\nare very slightly improved on the development sets after the model has been \ufb01ne-tuned on brain\nactivity data. The STS-B task results are very slightly worse than the results for standard BERT. The\n\ufb01ne-tuning may or may not be helping the model to perform these NLP tasks, but it clearly does not\nharm performance in these tasks.\n\nFine-tuning reduces [CLS] token attention to [SEP] token We evaluate how the attention in\nthe model changes after \ufb01ne-tuning on the brain recordings by contrasting the model attention in\nthe \ufb01ne-tuned and vanilla models described in section 3.4. We focus on the attention from the\n[CLS] token to other tokens in the sequence because we use the [CLS] token as the pooled output\n\n3https://evlab.mit.edu/funcloc/download-parcels\n\n7\n\nModel 2Model 1Anatomical language regions\fMetric\nCoLA\nSST-2\nMRPC (Acc.)\nMRPC (F1)\nSTS-B (Pears.)\nSTS-B (Spear.)\nQQP (Acc.)\nQQP (F1)\nMNLI-m\nMNLI-mm\nQNLI\nRTE\nWNLI\n\nVanilla MEG Joint\n57.97\n57.29\n91.62\n93.00\n84.04\n83.82\n88.85\n88.91\n89.70\n88.60\n89.37\n88.23\n90.87\n90.72\n87.69\n87.41\n83.95\n84.08\n85.15\n84.39\n91.49\n89.04\n62.02\n61.01\n53.52\n51.97\n\n57.63\n93.23\n83.97\n88.93\n89.32\n88.87\n91.06\n87.91\n84.26\n84.65\n91.73\n65.42\n53.80\n\nTable 1: GLUE benchmark results for the\nGLUE development sets. We compare the re-\nsults of two of our models to the results pub-\nlished by https://github.com/huggingface/\npytorch-pretrained-BERT/ for the pretrained\nBERT model. The model labeled \u2018MEG\u2019 is the\nMEG transfer model described in section 3.4. The\nmodel labeled \u2018Joint\u2019 is the fully joint model also\ndescribed in section 3.4. For all but one task, at\nleast one of our two models is marginally better\nthan the pretrained model. These results suggest\nthat \ufb01ne-tuning does not diminish \u2013 and possibly\neven enhances \u2013 the model\u2019s ability to perform\nNLP tasks.\n\nFigure 4: Comparison of attention from the\n[CLS] token to the [CLS] and [SEP] tokens be-\ntween vanilla BERT and the \ufb01ne-tuned BERT\n(mean and standard error over example presenta-\ntions, attention heads, and different initialization\nruns). The attention from the [CLS] token notice-\nably shifts away from the [SEP] token in layers\n8 and 9.\n\nrepresentation of the input sequence. We observe that the [CLS] token from the \ufb01ne-tuned model\nputs less attention on the [SEP] token in layers 8 and 9, when compared to the [CLS] token from the\nvanilla model (see Figure 4). Clark et al. (2019) suggest that attention to the [SEP] token in BERT is\nused as a no-op, when the function of the head is not currently applicable. Our observations that the\n\ufb01ne-tuning reduces [CLS] attention to the [SEP] token can be interpreted in these terms. However,\nfurther analysis is needed to understand whether this reduction in attention is speci\ufb01cally due to the\ntask of predicting fMRI recordings or generally arises during \ufb01ne-tuning on any task.\n\nFine-tuning may change motion-related representations\nIn an effort to understand how the\nrepresentations in BERT change when it is \ufb01ne-tuned to predict brain activity, we examine the\nprevalence of various features in the examples where prediction accuracy changes the most after\n\ufb01ne-tuning compared to the prevalence of those features in other examples. We score how much the\nprediction accuracy of each example changes after \ufb01ne-tuning by looking at the percent change in\nEuclidean distance between the prediction and the target for our best participant on a set of voxels\nthat we manually select which are likely to be language-related based on spatial location. We average\nthese percent changes over all runs of the model, which gives us 25 samples per example. We take\nall examples where the absolute value of this average percent change is at least 10% as our set of\nchanged examples, giving us 146 changed examples and leaving 1022 unchanged examples. We\nthen compute the probability that each feature of interest appears on a word in a changed example\nand compare this to the probability that the feature appears on a word in an unchanged example,\nusing bootstrap resampling on the examples with 100 bootstrap-samples to estimate a standard\nerror on these probabilities. The features we evaluate come from judgments done by Wehbe et al.\n(2014b) and are available online4. The sample sizes are relatively small in this analysis and should be\nviewed as preliminary, however, we see that examples containing verbs describing movement and\nimperative language are more prevalent in examples where accuracies change during \ufb01ne-tuning. See\nthe supplementary material for further discussion and plots of the analysis.\n\n4http://www.cs.cmu.edu/~fmri/plosone/\n\n8\n\n24681012layer0.00.20.40.60.81.01.2avrg attention over headsvanilla CLS->CLSvanilla CLS->SEPfine-tuned CLS->CLSfine-tuned CLS->SEP\f5 Discussion\n\nThis study aimed to show that it is possible to learn generalizable relationships between text and brain\nactivity by \ufb01ne-tuning a language model to predict brain activity. We believe that our results provide\nseveral lines of evidence that this hypothesis holds.\nFirst, because a model which is \ufb01ne-tuned to predict brain activity tends to have higher accuracy than\na model which just computes a regression between standard contextualized-word embeddings and\nbrain activity, the \ufb01ne-tuning must be changing something about how the model encodes language to\nimprove this prediction accuracy.\nSecond, because the embeddings produced by a model \ufb01ne-tuned on one experiment participant better\n\ufb01t a second participant\u2019s brain activity than the embeddings from the vanilla model (as evidenced by\nour participant-transfer experiment), the changes the model makes to how it encodes language during\n\ufb01ne-tuning at least partially generalize to new participants.\nThird, for some participants, when a model is \ufb01ne-tuned on MEG data, the resulting changes to the\nlanguage-encoding that the model uses bene\ufb01t subsequent training on fMRI data compared to starting\nwith a vanilla language model. This suggests that the changes to the language representations induced\nby the MEG data are not entirely imaging modality-speci\ufb01c, and that indeed the model is learning the\nrelationship between language and brain activity as opposed to the relationship between language\nand a brain activity recording modality.\nModels which have been \ufb01ne-tuned to predict brain activity are no worse at NLP tasks than the\nvanilla BERT model, which suggests that the changes made to how language is represented improve\na model\u2019s ability to predict brain activity without doing harm to how well the representations work\nfor language processing itself. We suggest that this is evidence that the model is learning to encode\nbrain-activity-relevant language information, i.e. that this biases the model to learn representations\nwhich are better correlated to the representations used by people. It is non-trivial to understand exactly\nhow the representations the model uses are modi\ufb01ed, but we investigate this by examining how the\nmodel\u2019s attention mechanism changes, and by looking at which language features are more likely\nto appear on examples that are better predicted after \ufb01ne-tuning. We believe that a more thorough\ninvestigation into how model representations change when biased by brain activity is a very exciting\ndirection for future work.\nFinally, we show that a model which is jointly trained to predict MEG data from multiple experiment\nparticipants and fMRI data from multiple experiment participants can more accurately predict fMRI\ndata for those participants than a linear regression from a vanilla language model. This demonstrates\nthat a single model can make predictions for all experiment participants \u2013 further evidence that the\nchanges to the language representations learned by the \ufb01ne-tuned model are generalizable. There are\noptimization issues that remain unsolved in jointly training a model, but we believe that ultimately\nit will be a better model for predicting brain activity than models trained on a single experiment\nparticipant or trained in sequence on multiple participants.\n\n6 Conclusion\n\nFine-tuning language models to predict brain activity is a new paradigm in learning about human\nlanguage processing. The technique is very adaptable. Because it relies on encoding information from\ntargets of a prediction task into the model parameters, the same model can be applied to prediction\ntasks with different sizes and with varying temporal and spatial resolution. Additionally it provides\nan elegant way to leverage massive data sets in the study of human language processing. To be sure,\nmore research needs to be done on how best to optimize these models to take advantage of multiple\nsources of information about language processing in the brain and on improving training methods for\nthe low signal-to-noise-ratio setting of brain activity recordings. Nonetheless, this study demonstrates\nthe feasibility of biasing language models to learn relationships between text and brain activity. We\nbelieve that this presents an exciting opportunity for researchers who are interested in understanding\nmore about human language processing, and that the methodology opens new and interesting avenues\nof exploration.\n\n9\n\n\fAcknowledgments\n\nThis work is supported in part by National Institutes of Health grant no. U01NS098969 and in part\nby the National Science Foundation Graduate Research Fellowship under Grant No. DGE1745016.\n\nReferences\nBenjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful\napproach to multiple testing. Journal of the Royal statistical society: series B (Methodological),\n57(1), 289\u2013300.\n\nCichy, R. M., Pantazis, D., and Oliva, A. (2016). Similarity-based fusion of meg and fmri reveals\nspatio-temporal dynamics in human cortex during visual object recognition. Cerebral Cortex,\n26(8), 3563\u20133579.\n\nClark, K., Khandelwal, U., Levy, O., and Manning, C. D. (2019). What does bert look at? an analysis\n\nof bert\u2019s attention. arXiv preprint arXiv:1906.04341.\n\nDevlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional\n\ntransformers for language understanding. arXiv preprint arXiv:1810.04805.\n\nFedorenko, E. and Thompson-Schill, S. L. (2014). Reworking the language network. Trends in\n\ncognitive sciences, 18(3), 120\u2013126.\n\nFedorenko, E., Hsieh, P.-J., Nieto-Casta\u00f1\u00f3n, A., Whit\ufb01eld-Gabrieli, S., and Kanwisher, N. (2010).\nNew method for fmri investigations of language: de\ufb01ning rois functionally in individual subjects.\nJournal of neurophysiology, 104(2), 1177\u20131194.\n\nFischl, B. (2012). Freesurfer. Neuroimage, 62(2), 774\u2013781.\n\nFu, X., Huang, K., Stretcu, O., Song, H. A., Papalexakis, E., Talukdar, P., Mitchell, T., Sidiropoulo, N.,\nFaloutsos, C., and Poczos, B. (2017). Brainzoom: High resolution reconstruction from multi-modal\nbrain signals. In Proceedings of the 2017 SIAM International Conference on Data Mining, pages\n216\u2013227. SIAM.\n\nFyshe, A., Talukdar, P. P., Murphy, B., and Mitchell, T. M. (2014). Interpretable semantic vectors\nfrom a joint model of brain-and text-based meaning. In Proceedings of the 52nd Annual Meeting\nof the Association for Computational Linguistics, volume 1, pages 489\u2013499.\n\nGao, J. S., Huth, A. G., Lescroart, M. D., and Gallant, J. L. (2015). Pycortex: an interactive surface\n\nvisualizer for fmri. Frontiers in neuroinformatics, 9, 23.\n\nHe, B., Sohrabpour, A., Brown, E., and Liu, Z. (2018). Electrophysiological source imaging: a\nnoninvasive window to brain dynamics. Annual review of biomedical engineering, 20, 171\u2013196.\n\nHuth, A. G., de Heer, W. A., Grif\ufb01ths, T. L., Theunissen, F. E., and Gallant, J. L. (2016). Natural\nspeech reveals the semantic maps that tile human cerebral cortex. Nature, 532(7600), 453\u2013458.\n\nJain, S. and Huth, A. (2018). Incorporating context into language encoding models for fmri. bioRxiv,\n\npage 327601.\n\nKay, K. N., Naselaris, T., Prenger, R. J., and Gallant, J. L. (2008). Identifying natural images from\n\nhuman brain activity. Nature, 452(7185), 352.\n\nMitchell, T. M., Shinkareva, S. V., Carlson, A., Chang, K.-M., Malave, V. L., Mason, R. A., and Just,\nM. A. (2008). Predicting human brain activity associated with the meanings of nouns. science,\n320(5880), 1191\u20131195.\n\nMurphy, B., Talukdar, P., and Mitchell, T. (2012). Selecting corpus-semantic models for neurolin-\nguistic decoding. In Proceedings of the First Joint Conference on Lexical and Computational\nSemantics-Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Pro-\nceedings of the Sixth International Workshop on Semantic Evaluation, pages 114\u2013123. Association\nfor Computational Linguistics.\n\n10\n\n\fNishimoto, S., Vu, A., Naselaris, T., Benjamini, Y., Yu, B., and Gallant, J. (2011). Reconstructing\n\nvisual experiences from brain activity evoked by natural movies. Current Biology.\n\nPalatucci, M. M. (2011). Thought recognition: predicting and decoding brain activity using the\n\nzero-shot learning model.\n\nPereira, F., Lou, B., Pritchett, B., Ritter, S., Gershman, S. J., Kanwisher, N., Botvinick, M., and\nFedorenko, E. (2018). Toward a universal decoder of linguistic meaning from brain activation.\nNature communications, 9(1), 963.\n\nRayner, K. (1998). Eye movements in reading and information processing: 20 years of research.\n\nPsychological bulletin, 124(3), 372.\n\nRowling, J. K. (1999). Harry Potter and the Sorcerer\u2019s Stone, volume 1. Scholastic, New York, 1\n\nedition.\n\nSchwartz, D. and Mitchell, T. (2019). Understanding language-elicited eeg data by predicting it\nfrom a \ufb01ne-tuned language model. In Proceedings of the 2019 Conference of the North American\nChapter of the Association for Computational Linguistics: Human Language Technologies, Volume\n1 (Long and Short Papers), pages 43\u201357.\n\nTaulu, S. and Simola, J. (2006). Spatiotemporal signal space separation method for rejecting nearby\n\ninterference in meg measurements. Physics in Medicine & Biology, 51(7), 1759.\n\nTaulu, S., Kajola, M., and Simola, J. (2004). Suppression of interference and artifacts by the signal\n\nspace separation method. Brain topography, 16(4), 269\u2013275.\n\nVan Petten, C. and Kutas, M. (1990).\n\nInteractions between sentence context and word\n\nfrequencyinevent-related brainpotentials. Memory & cognition, 18(4), 380\u2013393.\n\nVaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, \u0141., and\nPolosukhin, I. (2017). Attention is all you need. In Advances in neural information processing\nsystems, pages 5998\u20136008.\n\nWang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. (2018). Glue: A\nmulti-task benchmark and analysis platform for natural language understanding. arXiv preprint\narXiv:1804.07461.\n\nWehbe, L., Vaswani, A., Knight, K., and Mitchell, T. M. (2014a). Aligning context-based statistical\nmodels of language with brain activity during reading. In Proceedings of the 2014 Conference\non Empirical Methods in Natural Language Processing (EMNLP), pages 233\u2013243, Doha, Qatar.\nAssociation for Computational Linguistics.\n\nWehbe, L., Murphy, B., Talukdar, P., Fyshe, A., Ramdas, A., and Mitchell, T. M. (2014b). Simultane-\nously uncovering the patterns of brain regions involved in different story reading subprocesses.\nPLOS ONE, 9(11): e112575.\n\nZhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., and Fidler, S. (2015).\nAligning books and movies: Towards story-like visual explanations by watching movies and\nreading books. In Proceedings of the IEEE international conference on computer vision, pages\n19\u201327.\n\n11\n\n\f", "award": [], "sourceid": 7868, "authors": [{"given_name": "Dan", "family_name": "Schwartz", "institution": "Carnegie Mellon University"}, {"given_name": "Mariya", "family_name": "Toneva", "institution": "Carnegie Mellon University"}, {"given_name": "Leila", "family_name": "Wehbe", "institution": "Carnegie Mellon University"}]}