{"title": "Temporal FiLM: Capturing Long-Range Sequence Dependencies with Feature-Wise Modulations.", "book": "Advances in Neural Information Processing Systems", "page_first": 10287, "page_last": 10298, "abstract": "Learning representations that accurately capture long-range dependencies in sequential inputs --- including text, audio, and genomic data --- is a key problem in deep learning. Feed-forward convolutional models capture only feature interactions within finite receptive fields while recurrent architectures can be slow and difficult to train due to vanishing gradients. Here, we propose Temporal Feature-Wise Linear Modulation (TFiLM) --- a novel architectural component inspired by adaptive batch normalization and its extensions --- that uses a recurrent neural network to alter the activations of a convolutional model. This approach expands the receptive field of convolutional sequence models with minimal computational overhead. Empirically, we find that TFiLM significantly improves the learning speed and accuracy of feed-forward neural networks on a range of generative and discriminative learning tasks, including text classification and audio super-resolution.", "full_text": "Temporal FiLM: Capturing Long-Range Sequence\n\nDependencies with Feature-Wise Modulation\n\nSawyer Birnbaum\u221712, Volodymyr Kuleshov\u221712,\n\nS. Zayd Enam1, Pang Wei Koh1, and Stefano Ermon1\n\n{sawyerb,kuleshov,zayd,pangwei,ermon}@cs.stanford.edu\n\n1Stanford University, Stanford, CA\n\n2Afresh Technologies, San Francisco, CA\n\nAbstract\n\nLearning representations that accurately capture long-range dependencies in se-\nquential inputs \u2014 including text, audio, and genomic data \u2014 is a key problem in\ndeep learning. Feed-forward convolutional models capture only feature interactions\nwithin \ufb01nite receptive \ufb01elds while recurrent architectures can be slow and dif\ufb01cult\nto train due to vanishing gradients. Here, we propose Temporal Feature-Wise Lin-\near Modulation (TFiLM) \u2014 a novel architectural component inspired by adaptive\nbatch normalization and its extensions \u2014 that uses a recurrent neural network to\nalter the activations of a convolutional model. This approach expands the receptive\n\ufb01eld of convolutional sequence models with minimal computational overhead.\nEmpirically, we \ufb01nd that TFiLM signi\ufb01cantly improves the learning speed and ac-\ncuracy of feed-forward neural networks on a range of generative and discriminative\nlearning tasks, including text classi\ufb01cation and audio super-resolution.\n\n1\n\nIntroduction\n\nIn many application domains of deep learning \u2014 including speech [27, 19], genomics [2], and natural\nlanguage [49] \u2014 data takes the form of long, high-dimensional sequences. The prevalence and\nimportance of sequential inputs has motivated a range of deep architectures speci\ufb01cally designed for\nthis data [20, 31, 50].\nOne of the challenges in processing sequential data is accurately capturing long-range input depen-\ndencies \u2014 interactions between symbols that are far apart in the sequence. For example, in speech\nrecognition, data occurring at the beginning of a recording may in\ufb02uence the conversion of words\nenunciated much later.\nSequential inputs in deep learning are often processed using recurrent neural networks (RNNs)\n[14, 20]. However, training RNNs is often dif\ufb01cult, mainly due to the vanishing gradient problem\n[3]. Feed-forward convolutional approaches are highly effective at processing both images [33] and\nsequential data [31, 51, 16] and are easier to train. However, convolutional models only account\nfor feature interactions within \ufb01nite receptive \ufb01elds and are not ideally suited to capture long-term\ndependencies.\nIn this paper, we introduce Temporal Feature-Wise Linear Modulation (TFiLM), a neural net-\nwork component that captures long-range input dependencies in sequential inputs by com-\nbining elements of convolutional and recurrent approaches. Our component modulates the\nactivations of a convolutional model based on long-range information captured by a recur-\nrent neural network. More speci\ufb01cally, TFiLM parametrizes the rescaling parameters of a\n\n\u2217These authors contributed equally to this work.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fbatch normalization layer as in earlier work on image stylization [12, 22] and visual ques-\n(Table 1 outlines recent work applying feature-wise linear modulation.)\ntion answering [44].\n\nWe demonstrate real-world applica-\ntions of TFiLM in both discrimi-\nnative and generative tasks involv-\ning sequential data. We de\ufb01ne and\nstudy a family of signal processing\nproblems called time series super-\nresolution, which consists of recon-\nstructing a high-resolution signal from\nlow-resolution measurements. For ex-\nample, in audio super-resolution, we\nreconstruct high-quality audio from a\nlow-quality input containing a fraction\n(15-50%) of the original time-domain\nsamples. Likewise, in genomic super-\nresolution, we recover high-quality\nmeasurements from experimental as-\nsays using a limited number of lower-\nquality measurements.\nWe observe that TFiLM signi\ufb01cantly\nimproves the performance of deep\nneural networks on a wide range of\ndiscriminative classi\ufb01cation tasks as\nwell as on complex high-dimensional\ntime series super-resolution problems. Interestingly, our model is domain-agnostic, yet outperforms\nmore specialized approaches that use domain-speci\ufb01c features.\n\nFigure 1: The TFiLM layer combines the strengths of convolutional\nand recurrent neural networks. Above: operation of the TFiLM\nlayer with T = 8, C = 2, B = 2, and a pooling factor of 2.\n\nContributions. This work introduces a new architectural component for deep neural networks\nthat combines elements of convolutional and recurrent approaches to better account for long-range\ndependencies in sequence data. We demonstrate the component\u2019s effectiveness on the discriminative\ntask of text classi\ufb01cation and on the generative task of time series super-resolution, which we de\ufb01ne.\nOur architecture outperforms strong baselines in multiple domains and could, inter alia, improve\nspeech compression, reduce the cost of functional genomics experiments, and improve the accuracy\nof text classi\ufb01cation systems.\n\n2 Background\n\nBatch Normalization and Feature-Wise Linear Modulation. Batch normalization (batch\nnorm; [23]) is a widely used technique for stabilizing the training of deep neural networks. In\nthis setting, batch norm takes as input a tensor of activations F \u2208 RT\u00d7C from a 1D convolution layer\n\u2014 where T and C are, respectively, the 1D spatial dimension and the number of channels \u2014 and\nperforms two operations: rescaling F and applying an af\ufb01ne transformation. Formally, this produces\ntensors \u02c6F , F (cid:48) \u2208 RT\u00d7C whose (t, c)-th elements are the following:\n\n\u02c6Ft,c =\n\nFt,c \u2212 \u00b5c\n\u03c3c + \u0001\n\nF (cid:48)\nt,c = \u03b3c \u02c6Ft,c + \u03b2c\n\n(1)\n\nHere, \u00b5c, \u03c3c are estimates of the mean and standard deviation for the c-th channel, and \u03b3, \u03b2 \u2208 RC\nare trainable parameters that de\ufb01ne an af\ufb01ne transformation.\nFeature-Wise Linear Modulation (FiLM) [12] extends this idea by allowing \u03b3, \u03b2 to be functions\n\u03b3, \u03b2 : Z \u2192 RC of an auxiliary input z \u2208 Z. For example, in feed-forward image style transfer [12],\nz is an image de\ufb01ning a new style; by using different \u03b3(z), \u03b2(z) for each z, the same feed-forward\nnetwork (using the same weights) can apply different styles to a target image. [11] provides a\nsummary of applications of FiLM layers.\n\n2\n\n\fTable 1: Recent work applying feature-wise linear modulation.\n\nPaper\n\nProblem Area\n\nBase Modality\n\nDhingra et al. [9]\nPerez et al. [44]\nand de Vries et al. [7]\nDumoulin et al. [12]\nKim et al. [30]\nHu et al. [21]\n\nThis Paper\n\nQA\nVisual QA\n\n(document)\n\nText\nImages\n\nStyle Transfer\nSpeech\nImage Classi\ufb01ca-\ntion\nSequence Analysis\n\nImages\nAudio\nImages\n\nConditioning\nModality\nText (query)\nText\n\nImages\nSelf (Sequence)\nSelf\n\nConditioning\nArchitecture\nCNN\nRNN/\nMLP\nCNN\nFeedforward\nFeedforward\n\nAudio, Text, DNA\n\nSelf (Sequence)\n\nRNN\n\nAlgorithm 1 Temporal Feature-Wise Linear Modulation.\nInput: Tensor of 1D convolutional activations F \u2208 RT\u00d7C where T, C are, respectively, the temporal\ndimension and the number of channels, and a block length B. Output: Adaptively normalized tensor\nof activations F (cid:48) \u2208 RT\u00d7C.\n\n1. Reshape F into a block tensor F blk \u2208 RB\u00d7T /B\u00d7C, de\ufb01ned as F blk\n2. Obtain a representation F pool \u2208 RB\u00d7C of the block tensor by pooling together the channels\n\nb,t,c = Fb\u00d7B+t,c.\n\nwithin each block: F pool\n\nb,c = Pool(F blk\n\nb,:,c)\n\n3. Compute sequence of normalizers \u03b3b, \u03b2b \u2208 RC for b = 1, 2, ..., B using an RNN applied to\n\npooled blocks: (\u03b3b, \u03b2b), hb = RNN(F pool\nb,\u00b7\n\n; hb\u22121) starting with h0 = (cid:126)0.\n\n4. Compute normalized block tensor F norm \u2208 RB\u00d7T /B\u00d7C as F norm\n5. Reshape F norm into output F (cid:48) \u2208 RT\u00d7C as F (cid:48)\n(cid:96),c = F norm(cid:98)t/B(cid:99),t mod B,c.\n\nb,t,c = \u03b3b,c \u00b7 F block\n\nb,t,c + \u03b2b,c.\n\nRecurrent and Convolutional Sequence Models. Sequential data is often modeled using\nRNNs [20, 38] in a sequence-to-sequence (seq2seq) framework [49]. RNNs are effective on language\nprocessing tasks over medium-sized sequences; however, on longer time series RNNs may become\ndif\ufb01cult to train and computationally impractical [51].\nAn alternative approach to modeling sequences is to use one-dimensional (1D) convolutions. While\nconvolutional networks are faster and easier to train than RNNs, convolutions have a limited receptive\n\ufb01eld, and a subsequence of the output depends on only a \ufb01nite subsequence of the input. This paper\nintroduces a new layer that addresses these limitations.\n\n3 Temporal Feature-Wise Linear Modulation\n\nIn this section, we describe a new neural network component called Temporal Feature-Wise Linear\nModulation (TFiLM) that effectively captures long-range input dependencies in sequential inputs by\ncombining elements of convolutional and recurrent approaches. At a high level, TFiLM modulates\nthe activations of a convolutional model using long-range information captured by a recurrent neural\nnetwork.\nSpeci\ufb01cally, let F \u2208 RT\u00d7C be a tensor of activations from a 1D convolutional layer (at one datapoint,\ni.e. N = 1 and we drop the batch size dimension N for simplicity); a TFiLM layer takes as input F\nand applies the following series of transformations. First, F is split along the time axis into blocks of\nlength B to produce F blk \u2208 RB\u00d7T /B\u00d7C. Intuitively, blocks correspond to regions along the spatial\ndimension in which the activations are closely correlated; for example, when processing audio, blocks\ncould be chosen to correspond to audio samples that de\ufb01ne the same phoneme. Next, we compute for\neach block b an af\ufb01ne transformation parameters \u03b3b, \u03b2b \u2208 RC using an RNN:\n\nF pool\nb,c = Pool(F blk\n\nb,:,c) \u2208 RB\u00d7C\n\n(\u03b3b, \u03b2b), hb = RNN(F pool\n\nb,:\n\n; hb\u22121) for b = 1, 2, ..., B\n\n3\n\n\fstarting with h0 = (cid:126)0, where hb denotes the hidden state, F pool \u2208 RB\u00d7C is a tensor obtained by\npooling along the the second dimension of F blk, and the notation F blk\nb,:,c refers to a slice of F blk along\nthe second dimension. In all our experiments, we use an LSTM and max pooling.\nFinally, activations in each block b are normalized by \u03b3b, \u03b2b to produce a tensor F norm de\ufb01ned as\nb,t,c = \u03b3b,c \u00b7 F block\nF norm\nb,t,c + \u03b2b,c. Notice that each \u03b3b, \u03b2b is a function of both the current and all the past\nblocks. Hence, activations can be modulated using long-range signals captured by the RNN. In the\naudio example, the super resolution of a phoneme could depend on previous phonemes beyond the\nreceptive \ufb01eld of the convolution; the RNN enables us to use this long-range information.\nAlthough TFiLM relies on an RNN, it remains computationally tractable because each RNN is small\n(the dimensionality of its output is only O(C)) and because the RNN is invoked only a small number\nof times. Consider again the speech example, where blocks are chosen to match phonemes: a 5\nsecond recording contains \u2248 50 0.1 second phonemes, yielding only about 50 RNN invocations\nfor 80,000 audio samples at 16KHz. Despite being small, this RNN also carries useful long-range\ninformation, as our experiments demonstrate.\n\n4 Time Series Super-Resolution\n\nIn order to demonstrate the real-world\napplications of TFiLM, we de\ufb01ne\nand study a new generative mod-\neling task called time series super-\nresolution, which consists of recon-\nstructing a high-resolution signal y =\n(y1, y2, ..., yT ) from low-resolution\nmeasurements x = (x1, x2, ..., xT );\nx, y denote the source and target time\nseries, respectively. For example, y\nmay be a high-quality speech signal\nwhile x is a noisy phone recording.\nThis task is closely inspired by\nimage super\nresolution [10, 35],\nwhich involves\nreconstructing a\nhigh-resolution image from a low-\nresolution version. As in image\nsuper-resolution, we assume that low-\nand high-resolution time series x, y\nhave a natural alignment, which can\narise, for example, from applying a\nlow-pass \ufb01lter to y to obtain x. Below,\nwe provide two examples of time series super-resolution problems.\n\nFigure 2: Top: A deep neural network architecture for time series\nsuper-resolution that consists of K downsampling blocks followed\nby a bottleneck layer and K upsampling blocks; features are reused\nvia symmetric residual skip connections. Bottom: Internal structure\nof downsampling and upsampling convolutional blocks.\n\nAudio Super-Resolution. Audio super-resolution (also known as bandwidth extension; [13]) in-\nvolves predicting a high-quality audio signal from a fraction (15-50%) of its time-domain samples.\nNote that this is equivalent to predicting the signal\u2019s high frequencies from its low frequencies.\nFormally, given a low-resolution signal x = (x1/R1, ..., xR1T /R1) sampled at a rate R1/T (e.g., low-\nquality telephone call), our goal is to reconstruct a high-resolution version y = (y1/R2, ..., yR2T /R2)\nof x that has a sampling rate R2 > R1. We use r = R2/R1 to denote the upsampling ratio of the\ntwo signals. Thus, we expect that yrt/R2 \u2248 xt/R1 for t = 1, 2, ..., T R1.\n\nSuper-Resolution of Genomics Experiments. Many genomics experiments can be seen as taking\na real-valued measurement at every position of the genome; experimental results can therefore be\nrepresented by a time series. Measurements are generally obtained using a large set of probes (e.g.,\nsequencing reads) that each randomly examine a different position in the genome; the genomic time\nseries is an aggregate of the measurements taken by these probes. In this setting, super-resolution\ncorresponds to reconstructing high-quality experimental measurements taken using a large set of\nprobes from noisy measurements taken using a small set of probes. This process can signi\ufb01cantly\n\n4\n\n\freduce the cost of scienti\ufb01c experiments. This paper focuses on a particular genomic experiment\ncalled chromatin immunoprecipitation sequencing (ChIP-seq) [46].\n\n4.1 A Deep Neural Network Architecture for Time Series-Super Resolution\n\nThe TFiLM layer is a key part of our deep neural network architecture shown in Figure 2. Other\nnotable features of the architecture include the following: (1) a sequence of downsampling blocks that\nhalve the spatial dimension and double the feature size and of upsampling blocks that do the reverse;\n(2) max pooling to reduce the size of LSTM inputs; (3) skip connections between corresponding\ndownsamping and upsampling blocks; and (4) subpixel shuf\ufb02ing [47] to increase the time dimension\nduring upscaling and avoid artifacts [41]. For more details, see the Appendix.\nWe train the model on a dataset D = {xi, yi}n\ni=1 of source/target time series pairs. As in image\nsuper-resolution, we take the xi, yi to be small patches sampled from the full time series. We train\nthe model using a mean squared error objective.\n\n5 Experiments\n\nIn this section, we show that TFiLM layers improve the performance of convolutional models on both\ndiscriminative and generative tasks. We analyze TFiLM on sentiment analysis, a text classi\ufb01cation\ntask, as well as on several generative time series super-resolution problems.\n\n5.1 Text Classi\ufb01cation\n\nDatasets. We use the Yelp-2 and Yelp-5 datasets [1], which are standard datasets for sentiment\nanalysis. In the Yelp-2 dataset, reviews are classi\ufb01ed as positive or negative based on the number\nof stars given by the reviewer. Reviews with 1 or 2 stars are classi\ufb01ed as negative, and reviews\nwith 4 or 5 stars are classi\ufb01ed as positive. In the Yelp-5 dataset, reviews are labeled with their star\nrating (i.e., 1-5). The Yelp-2 dataset includes about 600,00 reviews and the Yelp-5 dataset includes\n700,00 reviews. For both tasks, there are an equal number of examples with each possible label. With\nzero-padding and truncation, we set the length of Yelp reviews to 256 tokens.\n\nMethods. We tokenize the reviews using Keras\u2019s built-in tokenizer and use 100-dimensional GLoVe\nword embeddings [43] to encode the tokens.\nWe compare our method against a variety of recent work on the Yelp-2 and Yelp-5 datasets, and to a\nCNN model based on the architecture of Johnson and Zhang\u2019s Deep Pyramid Convolutional Neural\nNetwork [25]. This \u201cSmallCNN\u201d model copies the DPCNN architecture, but reduces the number of\nconvolutional layers to 3 and does not include region embeddings. Our full model inserts TFiLM\nlayers after the convolutional layers in this architecture. We normalize the number of parameters\nbetween the basic SmallCNN model and the full model that includes TFiLM layers. To do so, we\nadjust the number of \ufb01lters in the convolutional layers.\nWe train for 20 epochs using the ADAM optimizer with a learning rate of 10\u22123 and a batch size of\n128.\n\nEvaluation. Table 2 presents the results of our experiments. On both datasets, TFiLM layers\nimprove the performance of the SmallCNN architecture; the resulting TFiLM model performs at or\nnear the level of state-of-the-art methods from the literature. The models that outperform the TFiLM\narchitecture use several times as many parameters (VDCNN and DenseCNN) or run unsupervised\npre-training on external data (DPCNN, BERT, and XLNet). Additionally, the TFiLM model trains\nseveral times faster than the SmallCNN model or a comparably-sized LSTM. (See the Appendix for\ndetails.) Overall, these results indicate that TFiLM layers can provide performance and ef\ufb01ciency\nbene\ufb01ts on discriminative sequence classi\ufb01cation problems.\n\n5.2 Audio Super-Resolution\n\nDatasets. We use the VCTK dataset [53] \u2014 which contains 44 hours of data from 108 speakers\n\u2014 and a Piano dataset \u2014 10 hours of Beethoven sonatas [40]. We generate low-resolution audio\nsignal from the 16 KHz originals by applying an order 8 Chebyshev type I low-pass \ufb01lter before\n\n5\n\n\fsubsampling the signal by the desired scaling ratio. The SINGLESPEAKER task trains the model on\nthe \ufb01rst 223 recordings of VCTK Speaker 1 (about 30 minutes) and tests on the last 8 recordings.\nIn the MULTISPEAKER task, we train on the \ufb01rst 99 VCTK speakers and test on the 8 remaining\nones. Lastly, the PIANO task extends audio super resolution to non-vocal data; we use the standard\n88%-6%-6% data split.\n\nTable 2: Text classi\ufb01cation on Yelp-2 and Yelp-5\ndatasets. Methods with * use unsupervised pre-training\n(unsupervised region embeddings or transformers) on\nexternal data and are not directly comparable. Parameter\ncounts exclude models with lower performance. Embed-\nding parameters are not counted.\n\nMethods. We compare our method relative to\nthree baselines: a cubic B-spline \u2014 which corre-\nsponds to the bicubic upsampling baseline used\nin image super-resolution; a dense neural net-\nwork (DNN) based on the technique of Li et. al.,\n2015 [36]; and a version of our CNN architec-\nture without TFiLM layers.\nWe instantiate our model with K = 4 blocks and\ntrain it for 50 epochs on patches of length 8192\n(in the high-resolution space) using the ADAM\noptimizer with a learning rate of 3\u00d710\u22124. To en-\nsure source/target series are of the same length,\nthe source input is pre-processed with cubic up-\nscaling. We adjust the TFiLM block length B\nso that T /B (the number of blocks) is always\n32. We use a pooling stride and spatial extent of\n8. To increase the receptive \ufb01eld of our convolu-\ntional layers, we use dilated convolutions with a\ndilation factor of 2 [56].\nIncluding TFiLM layers signi\ufb01cantly increases the number of parameters per layer compared with\nthe DNN baseline and the basic CNN architecture. Accordingly, we adjust the number of \ufb01lters per\nlayer to normalize the parameter counts between the models.\n\nYelp-5\nParam\nYelp-2\n63.9% Linear\n95.7%\n-\n59.6%\n92.6%\n-\n63.4%\n93.5%\n-\n61.0%\n93.5%\n-\n62.0%\n94.6%\n64.7%\n>5M\n95.4%\n>4M\n64.5%\n96.0%\n>3M\n97.36%\n69.4%\n-\n98.11% 70.68%\n72.2%\n98.45%\n-\n61.5% <1.5M\n78.1%\n95.6%\n62.3% <1.5M\n\nMethod\nFastText [26]\nLSTM [55]\nSelf-Attention [37]\nCNN [31]\nCharCNN [58]\nVDCNN [6]\nDenseCNN [52]\nDPCNN* [25]\nBERT* [48][8]\nXLNet* [54]\nSmallCNN (ours)\nSmallCNN+TFiLM (full)\n\nMetrics. Given a reference signal y and an approximation x, the signal to noise ratio (SNR) is\nde\ufb01ned as SNR(x, y) = 10 log\n. The SNR is a standard metric used in the signal processing\nliterature. The log-spectral distance (LSD) [18] measures the reconstruction quality of individual\n\n||y||2\n||x\u2212y||2\n\n2\n\n2\n\n, where X and \u02c6X\nfrequencies as follows LSD(x, y) = 1\nT\nare the log-spectral power magnitudes of y and x, respectively. These are de\ufb01ned as X = log |S|2,\nwhere S is the short-time Fourier transform (STFT) of the signal. We use t and k index frames and\nfrequencies, respectively; we used frames of length 8092.\n\nk=1\n\n1\nK\n\nt=1\n\nX(t, k) \u2212 \u02c6X(t, k)\n\n(cid:17)2\n\n(cid:114)\n\n(cid:80)T\n\n(cid:80)K\n\n(cid:16)\n\nEvaluation. The results of our experiments are summarized in Table 3. According to our SNR\nmetric, our basic convolutional architecture shows an average improvement of 0.3 dB over the DNN\nand Spline baselines, with the strongest improvements on the SINGLESPEAKER task. Based on the\nLSD metric, the convolutional architecture also shows an average improvement of 0.5 dB over the\nDNN baseline and 1.6 dB over the Spline baseline. The convolutional architecture appears to use our\nmodeling capacity more ef\ufb01ciently than a dense neural network; we expect such architectures will\nsoon be more widely used in audio generation tasks.2\nIncluding the TFiLM layers improves performance by an additional 1.0 dB on average in terms of\nSNR and 0.2 dB on average in terms of LSD. The TFiLM layers prove particularly bene\ufb01cial on the\nMULTISPEAKER task, perhaps because this is the most complex task and therefore the one which\nbene\ufb01ts most from additional long-term temporal connections in the model.\nFinally, to con\ufb01rm our results, we ran a study in which human raters assessed the quality of the\ninterpolated audio samples. Our method ranked the best among the upscaling techniques; details are\nin the appendix.\n\n2We also experimented with a sequence-to-sequence architecture. This model preformed very poorly,\nachieving SNR of about 0 dB for all upscaling ratios. As discussed above, sequence-to-sequence models\ngenerally struggle to solve problems involving extremely long time-series signals, as is the case here.\n\n6\n\n\fTable 3: Accuracy evaluation of audio super resolution methods (in dB) on each of the three super-resolution\ntasks at upscaling ratios r = 2, 4, and 8. DNN and CNN are baselines from the literature. [KEE17] denotes the\nconvolutional method of Kuleshov et al. (2017)\n\nRatio\n\nObj. Spline\n\nr = 2\n\nr = 4\n\nr = 8\n\n# Params.\n\nSNR 19.0\nLSD\n3.5\nSNR 15.6\nLSD\n5.6\nSNR 12.2\n7.2\nLSD\nN/A\n\nDNN\n\nSINGLESPEAKER\nConv\n[Li et al.] [KEE17]\n19.4\n2.6\n16.4\n3.7\n12.7\n4.2\n\n19.0\n3.0\n15.6\n4.0\n12.3\n4.7\n6.72e7\n\nFull Spline\nUs\n19.5\n2.5\n16.8\n3.5\n12.9\n4.3\n\n18.0\n2.9\n13.2\n5.2\n9.8\n6.8\n7.09e7 6.82e7 N/A\n\nDNN\n\nMULTISPEAKER\nConv.\n[Li et al.] [KEE17]\n18.1\n1.9\n13.1\n3.1\n9.9\n4.3\n\n17.9\n2.5\n13.3\n3.9\n9.8\n4.6\n6.72e7\n\nFull Spline\nUs\n19.8\n1.8\n15.0\n2.7\n12.0\n2.9\n\n24.8\n1.8\n18.6\n2.8\n10.7\n4.0\n7.09e7 6.82e7 N/A\n\nPIANO\n\nDNN\n\nConv.\n[Li et al.] [KEE17]\n25.3\n2.0\n18.8\n2.3\n11.1\n2.7\n\nFull\nUs\n25.4\n2.0\n19.3\n2.2\n13.3\n2.6\n7.09e7 6.82e7\n\n24.7\n2.5\n18.6\n3.2\n10.7\n3.5\n6.72e7\n\n5.3 Chromatin Immunoprecipitation Sequencing\n\nHistone\n\nPearson Correlation\n\nH3K4me1\nH3K4me3\nH3K27ac\nH3K27me3\n\nInput Linear\n[K17]\n[K17]\n0.41\n0.37\n0.67\n0.63\n0.55\n0.61\n0.18\n0.14\n\nCNN CNN Full\nUs\n[K17]\n0.81\n0.59\n0.72\n0.90\n0.77\n0.89\n0.64\n0.30\n\nTable 4: Pearson correlation of the model output and the high-\nquality ChIP-seq signal derived from an experiment with high\nsequencing depth. [K17] indicates results from Koh et al.\n(2017); linear method performance is estimated.\n\nWe use histone ChIP-seq data from\nlymphoblastoid cell lines derived from\nseveral\nindividuals of diverse ancestry\n[29] on the following common histone\nmarks: H3K4me1, H3K4me3, H3K27ac,\nH3K27me3, and H3K36me3. This dataset\ncontains high-quality ChIP-seq data with\na high sequencing depth; to obtain low-\nquality versions, we arti\ufb01cially subsample\n1M reads for each histone mark (out of the\nfull dataset of 100+M reads per mark). This\nmirrors the setup of Koh et. al., 2016 [32],\nwhich introduused a simpler convolutional\nneural network architecture. The Koh results are the state of the art for this task; we use them as a\nbaseline in this section.\nFormally, given an input noisy ChIP-seq signal X \u2208 Rk\u00d7T , where k is the number of distinct histone\nmarks, and T is the length of the genome, we aim to reconstruct a high-quality ChIP-seq signal\nY \u2208 RT . We use the k low-quality signals as input and train a separate model for each high quality\ntarget mark. We use B = 2 and training windows of length 1000; all other hyper-parameters are as in\nthe audio-super resolution task.\nTo evaluate our results, we measure Pearson correlation between our model output and the true,\nhigh-quality ChIP-seq signal; this is a standard comparison metric in the \ufb01eld (e.g., [15]). As shown\nin Table 4, our method signi\ufb01cantly improves the quality of the input signal over the Koh results,\nand on all but one histone mark outperforms the specialized CNN baseline. Across all of the histone\nmarks, the model output from an input of 1M sequencing reads is equivalent in quality to signal\nderived from 10M to 20M reads, constituting a signi\ufb01cant ef\ufb01ciency gain.\n\nUs\n0.79\n0.66\n0.85\n0.65\n\n5.4 Additional Analyses\n\nModel Visualization. We examined the internals of the TFiLM layer by visualizing the adaptive\nnormalizer parameters in the audio super-resolution and sentiment analysis experiments. On the\nformer, we observed that the parameters tend to cluster by gender, suggesting that the layer learns\nuseful features. Figures are in the Appendix.\n\nAblation Analysis. The ablation analysis in Figure 5 indicates that temporal adaptive normalization\nsigni\ufb01cantly improves model performance. In addition, we performed an ablation analysis for the\nskip connections and found that they also signi\ufb01cantly improve reconstruction accuracy. Our results\nare in the Appendix.\n\n7\n\n\fModel Generalization. We examined the extent to which the model generalizes across datasets.\nOn the audio task, we observed a loss in performance when evaluating the model that was trained on\nspeech on piano music (and vice versa). This highlights the need for diverse training data. Details are\nin the Appendix.\n\nMissing Value Imputation. We experimented with imputing missing values from a sequence\nof daily grocery retail sales using various zero-out rates. TFiLM layers consistently provided\nperformance bene\ufb01ts. Our full methodology and results are in the Appendix.\n\n6 Previous Work and Discussion\n\nFeature-Wise Linear Modulation. Previous work has applied feature-wise linear modulation to\ntasks including question answering, style transfer, and speech recognition (see Table 1). Our approach\nis most similar to that of Kim et al. [31], which modulates layer normalization parameters using a\nfeed-forward model conditioned on an input audio sequence. Conversely, our method adjusts the\nbatch normalization parameters of a feed-forward CNN using an RNN conditioned on the entire\nsequence. This signi\ufb01cantly improves the CNN\u2019s performance.\n\nTime Series Modeling.\nIn the machine learning literature, time series signals have most often\nbeen modeled with auto-regressive models, of which variants of recurrent networks are a special\ncase [17, 38, 40]. Our approach generalizes conditional modeling ideas used in computer vision for\ntasks such as image super-resolution [10, 35] or colorization [57].\n\nApplications to Audio and Genomics. Existing learning-based approaches include Gaussian mix-\nture models [5, 42, 45], linear predictive coding [4], and neural networks [36]. Other recent work\non audio super-resolution includes Wang et al.\u2019s WaveNet model [51] and Macartney and Weyde\u2019s\nWave-U-Net model [39]. Our work proposes a new architecture, which scales better with data size\nand outperforms recent methods. Moreover, existing techniques involve hand-crafted features [45];\nour approach is fully domain-agnostic. Statistical modeling of genomic data has been explored\nin population and functional genomics [34, 15]; our approach has the potential to make scienti\ufb01c\nexperiments signi\ufb01cantly more affordable.\n\nComputational Performance. Our model is computationally ef\ufb01cient and can be run in real time.\nUnlike sequence-to-sequence architectures, our model does not require the complete input sequence\nto begin generating an output sequence.\n\n7 Conclusion\n\nIn summary, our work introduces a temporal adaptive normalization neural network layer that\nintegrates convolutional and recurrent layers to ef\ufb01ciently incorporate long-term information when\nprocessing sequential data. We demonstrate the layer\u2019s effectiveness on three diverse domains. Our\nresults have applications in areas including text-to-speech generation and sentiment analysis and\ncould reduce the cost of genomics experiments.\n\nAcknowledgments\n\nThis research was supported by NSF (#1651565, #1522054, #1733686), ONR (N00014-19-1-2145),\nand AFOSR (FA9550-19-1-0024)\n\n8\n\n\fReferences\n[1] Yelp dataset. https://www.yelp.com/dataset.\n\n[2] Babak Alipanahi, Andrew Delong, Matthew T Weirauch, and Brendan J Frey. Predicting the\nsequence speci\ufb01cities of dna-and rna-binding proteins by deep learning. Nature biotechnology,\n33(8):831, 2015.\n\n[3] Yoshua Bengio, Patrice Simard, Paolo Frasconi, et al. Learning long-term dependencies with\n\ngradient descent is dif\ufb01cult. IEEE transactions on neural networks, 5(2):157\u2013166, 1994.\n\n[4] Jeremy Bradbury. Linear predictive coding. Mc G. Hill, 2000.\n\n[5] Yan Ming Cheng, Douglas O\u2019Shaughnessy, and Paul Mermelstein. Statistical recovery of\nwideband speech from narrowband speech. IEEE Transactions on Speech and Audio Processing,\n2(4):544\u2013548, 1994.\n\n[6] Alexis Conneau, Holger Schwenk, Lo\u00efc Barrault, and Yann LeCun. Very deep convolutional\n\nnetworks for natural language processing. CoRR, abs/1606.01781, 2016.\n\n[7] Harm de Vries, Florian Strub, J\u00e9r\u00e9mie Mary, Hugo Larochelle, Olivier Pietquin, and Aaron C.\n\nCourville. Modulating early visual processing by language. volume abs/1707.00683, 2017.\n\n[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of\n\ndeep bidirectional transformers for language understanding. volume abs/1810.04805, 2018.\n\n[9] Bhuwan Dhingra, Hanxiao Liu, William W. Cohen, and Ruslan Salakhutdinov. Gated-attention\n\nreaders for text comprehension. CoRR, abs/1606.01549, 2016.\n\n[10] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Image super-resolution using\ndeep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell., 38(2):295\u2013307, February\n2016.\n\n[11] Vincent Dumoulin, Ethan Perez, Nathan Schucher, Florian Strub, Harm de Vries, Aaron\n\nCourville, and Yoshua Bengio. Feature-wise transformations. Distill, 3(7):e11, 2018.\n\n[12] Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. A learned representation for\n\nartistic style. CoRR, abs/1610.07629, 2(4):5, 2016.\n\n[13] Per Ekstrand. Bandwidth extension of audio signals by spectral band replication.\n\nIn in\nProceedings of the 1st IEEE Benelux Workshop on Model Based Processing and Coding of\nAudio (MPCA\u201902. Citeseer, 2002.\n\n[14] Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179\u2013211, 1990.\n\n[15] Jason Ernst and Manolis Kellis. Large-scale imputation of epigenomic datasets for systematic\n\nannotation of diverse human tissues. Nature Biotechnology, 33(4):364\u2013376, 2 2015.\n\n[16] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolu-\n\ntional sequence to sequence learning. CoRR, abs/1705.03122, 2017.\n\n[17] Felix A Gers, Douglas Eck, and J\u00fcrgen Schmidhuber. Applying lstm to time series predictable\nthrough time-window approaches. In International Conference on Arti\ufb01cial Neural Networks,\npages 669\u2013676. Springer, 2001.\n\n[18] Augustine Gray and John Markel. Distance measures for speech processing. IEEE Transactions\n\non Acoustics, Speech, and Signal Processing, 24(5):380\u2013391, 1976.\n\n[19] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep\nJaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural\nnetworks for acoustic modeling in speech recognition: The shared views of four research groups.\nIEEE Signal Processing Magazine, 29(6):82\u201397, 2012.\n\n[20] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation,\n\n9(8):1735\u20131780, 1997.\n\n9\n\n\f[21] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. volume abs/1709.01507,\n\n2017.\n\n[22] Xun Huang and Serge J. Belongie. Arbitrary style transfer in real-time with adaptive instance\n\nnormalization. CoRR, abs/1703.06868, 2017.\n\n[23] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\n\nby reducing internal covariate shift. CoRR, abs/1502.03167, 2015.\n\n[24] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with\n\nconditional adversarial networks. arxiv, 2016.\n\n[25] Rie Johnson and Tong Zhang. Deep pyramid convolutional neural networks for text catego-\nrization. In Proceedings of the 55th Annual Meeting of the Association for Computational\nLinguistics (Volume 1: Long Papers), pages 562\u2013570, 2017.\n\n[26] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. Bag of tricks for\nef\ufb01cient text classi\ufb01cation. In Proceedings of the 15th Conference of the European Chapter\nof the Association for Computational Linguistics: Volume 2, Short Papers, pages 427\u2013431.\nAssociation for Computational Linguistics, April 2017.\n\n[27] D. Jurafsky, J.H. Martin, P. Norvig, and S. Russell. Speech and Language Processing. Pearson\n\nEducation, 2014.\n\n[28] Kaggle.\n\nfavorita-grocery-sales-forecasting.\n\nCorporaci\u00f3n favorita grocery sales forecasting.\n\nhttps://www.kaggle.com/c/\n\n[29] Maya Kasowski, So\ufb01a Kyriazopoulou-Panagiotopoulou, Fabian Grubert, Judith B Zaugg,\nAnshul Kundaje, Yuling Liu, Alan P Boyle, Qiangfeng Cliff Zhang, Fouad Zakharia, Damek V\nSpacek, Jingjing Li, Dan Xie, Anthony Olarerin-George, Lars M Steinmetz, John B Hogenesch,\nManolis Kellis, Sera\ufb01m Batzoglou, and Michael Snyder. Extensive variation in chromatin states\nacross humans. Science (New York, N.Y.), 342(6159):750\u20132, 11 2013.\n\n[30] Taesup Kim, Inchul Song, and Yoshua Bengio. Dynamic layer normalization for adaptive neural\n\nacoustic modeling in speech recognition. CoRR, abs/1707.06065, 2017.\n\n[31] Yoon Kim. Convolutional neural networks for sentence classi\ufb01cation. In Proceedings of the\n2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages\n1746\u20131751, Doha, Qatar, October 2014. Association for Computational Linguistics.\n\n[32] Pang Wei Koh, Emma Pierson, and Anshul Kundaje. Denoising genome-wide histone chip-seq\n\nwith convolutional neural networks. bioRxiv, page 052118, 2016.\n\n[33] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n[34] Volodymyr Kuleshov, Dan Xie, Rui Chen, Dmitry Pushkarev, Zhihai Ma, Tim Blauwkamp,\nMichael Kertesz, and Michael Snyder. Whole-genome haplotyping using long reads and\nstatistical methods. Nature biotechnology, 32(3):261\u2013266, 2014.\n\n[35] Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew P. Aitken, Alykhan Tejani,\nJohannes Totz, Zehan Wang, and Wenzhe Shi. Photo-realistic single image super-resolution\nusing a generative adversarial network. CoRR, abs/1609.04802, 2016.\n\n[36] Kehuang Li, Zhen Huang, Yong Xu, and Chin-Hui Lee. Dnn-based speech bandwidth expansion\nand its application to adding high-frequency missing features for automatic speech recognition of\nnarrowband speech. In Sixteenth Annual Conference of the International Speech Communication\nAssociation, 2015.\n\n[37] Zhouhan Lin, Minwei Feng, C\u00edcero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou,\nand Yoshua Bengio. A structured self-attentive sentence embedding. volume abs/1703.03130,\n2017.\n\n10\n\n\f[38] Andrew Maas, Quoc V. Le, Tyler M. ONeil, Oriol Vinyals, Patrick Nguyen, and Andrew Y. Ng.\n\nRecurrent neural networks for noise reduction in robust asr. In INTERSPEECH, 2012.\n\n[39] Craig Macartney and Tillman Weyde. Improved speech enhancement with the wave-u-net.\n\nvolume abs/1811.11307, 2018.\n\n[40] Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo,\nAaron Courville, and Yoshua Bengio. Samplernn: An unconditional end-to-end neural audio\ngeneration model, 2016. cite arxiv:1612.07837.\n\n[41] Augustus Odena, Vincent Dumoulin, and Chris Olah. Deconvolution and checkerboard artifacts.\n\nDistill, 2016.\n\n[42] Kun-Youl Park and Hyung Soon Kim. Narrowband to wideband conversion of speech using\ngmm based transformation. In Acoustics, Speech, and Signal Processing, 2000. ICASSP\u201900.\nProceedings. 2000 IEEE International Conference on, volume 3, pages 1843\u20131846. IEEE, 2000.\n\n[43] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for\nword representation. In Empirical Methods in Natural Language Processing (EMNLP), pages\n1532\u20131543, 2014.\n\n[44] Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron C. Courville. Film:\n\nVisual reasoning with a general conditioning layer. CoRR, abs/1709.07871, 2017.\n\n[45] Hannu Pulakka, Ulpu Remes, Kalle Palom\u00e4ki, Mikko Kurimo, and Paavo Alku. Speech\nbandwidth extension using gaussian mixture model-based estimation of the highband mel\nspectrum. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International\nConference on, pages 5100\u20135103. IEEE, 2011.\n\n[46] Consortium Roadmap Epigenomics. Integrative analysis of 111 reference human epigenomes.\n\nNature, 518(7539):317\u2013330, 2 2015.\n\n[47] Wenzhe Shi, Jose Caballero, Ferenc Huszar, Johannes Totz, Andrew P. Aitken, Rob Bishop,\nDaniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an\nef\ufb01cient sub-pixel convolutional neural network. pages 1874\u20131883, 2016.\n\n[48] Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. How to \ufb01ne-tune BERT for text classi\ufb01-\n\ncation? CoRR, abs/1905.05583, 2019.\n\n[49] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural\n\nnetworks. In Advances in neural information processing systems, pages 3104\u20133112, 2014.\n\n[50] A\u00e4ron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex\nGraves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. Wavenet: A generative\nmodel for raw audio. CoRR, abs/1609.03499, 2016.\n\n[51] M. Wang, Z. Wu, S. Kang, X. Wu, J. Jia, D. Su, D. Yu, and H. Meng. Speech super-resolution\nusing parallel wavenet. In 2018 11th International Symposium on Chinese Spoken Language\nProcessing (ISCSLP), pages 260\u2013264, Nov 2018.\n\n[52] Shiyao Wang, Minlie Huang, and Zhidong Deng. Densely connected cnn with multi-scale\nfeature attention for text classi\ufb01cation. In Proceedings of the 27th International Joint Conference\non Arti\ufb01cial Intelligence, IJCAI\u201918, pages 4468\u20134474. AAAI Press, 2018.\n\n[53] Junichi Yamagishi. English multi-speaker corpus for cstr voice cloning toolkit, 2012. URL\n\nhttp://homepages. inf. ed. ac. uk/jyamagis/page3/page58/page58. html.\n\n[54] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and\nQuoc V. Le. Xlnet: Generalized autoregressive pretraining for language understanding. volume\nabs/1906.08237, 2019.\n\n[55] Dani Yogatama, Chris Dyer, Wang Ling, and Phil Blunsom. Generative and discriminative text\n\nclassi\ufb01cation with recurrent neural networks, 2017.\n\n11\n\n\f[56] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. CoRR,\n\nabs/1511.07122, 2015.\n\n[57] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. ECCV, 2016.\n\n[58] Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. Character-level convolutional networks for\n\ntext classi\ufb01cation. CoRR, abs/1509.01626, 2015.\n\n12\n\n\f", "award": [], "sourceid": 5434, "authors": [{"given_name": "Sawyer", "family_name": "Birnbaum", "institution": "Stanford University"}, {"given_name": "Volodymyr", "family_name": "Kuleshov", "institution": "Stanford University"}, {"given_name": "Zayd", "family_name": "Enam", "institution": "Stanford"}, {"given_name": "Pang Wei", "family_name": "Koh", "institution": "Stanford University"}, {"given_name": "Stefano", "family_name": "Ermon", "institution": "Stanford"}]}