Alex Tamkin, Dan Jurafsky, Noah Goodman
Language exhibits structure at a wide range of scales, from subwords to words, sentences, paragraphs, and documents. We propose building models that isolate scale-specific information in deep representations, and develop methods for encouraging models during training to learn more about particular scales of interest. Our method for creating scale-specific neurons in deep NLP models constrains how the activation of a neuron can change across the tokens of an input by interpreting those activations as a digital signal and filtering out parts of its frequency spectrum. This technique enables us to extract scale-specific information from BERT representations: by filtering out different frequencies we can produce new representations that perform well on part of speech tagging (word-level), dialog speech acts classification (utterance-level), or topic classification (document-level), while performing poorly on the other tasks. We also present a prism layer for use during training, which constrains different neurons of a BERT model to different parts of the frequency spectrum. Our proposed BERT + Prism model is better able to predict masked tokens using long-range context, and produces individual multiscale representations that perform with comparable or improved performance across all three tasks. Our methods are general and readily applicable to other domains besides language, such as images, audio, and video.