{"title": "Analyzing Hidden Representations in End-to-End Automatic Speech Recognition Systems", "book": "Advances in Neural Information Processing Systems", "page_first": 2441, "page_last": 2451, "abstract": "Neural networks have become ubiquitous in automatic speech recognition systems. While neural networks are typically used as acoustic models in more complex systems, recent studies have explored end-to-end speech recognition systems based on neural networks, which can be trained to directly predict text from input acoustic features. Although such systems are conceptually elegant and simpler than traditional systems, it is less obvious how to interpret the trained models. In this work, we analyze the speech representations learned by a deep end-to-end model that is based on convolutional and recurrent layers, and trained with a connectionist temporal classification (CTC) loss. We use a pre-trained model to generate frame-level features which are given to a classifier that is trained on frame classification into phones. We evaluate representations from different layers of the deep model and compare their quality for predicting phone labels. Our experiments shed light on important aspects of the end-to-end model such as layer depth, model complexity, and other design choices.", "full_text": "Analyzing Hidden Representations in End-to-End\n\nAutomatic Speech Recognition Systems\n\nComputer Science and Arti\ufb01cial Intelligence Laboratory\n\nMassachusetts Institute of Technology\n\nYonatan Belinkov and James Glass\n\nCambridge, MA 02139\n\n{belinkov, glass}@mit.edu\n\nAbstract\n\nNeural networks have become ubiquitous in automatic speech recognition systems.\nWhile neural networks are typically used as acoustic models in more complex\nsystems, recent studies have explored end-to-end speech recognition systems\nbased on neural networks, which can be trained to directly predict text from input\nacoustic features. Although such systems are conceptually elegant and simpler\nthan traditional systems, it is less obvious how to interpret the trained models.\nIn this work, we analyze the speech representations learned by a deep end-to-end\nmodel that is based on convolutional and recurrent layers, and trained with a\nconnectionist temporal classi\ufb01cation (CTC) loss. We use a pre-trained model to\ngenerate frame-level features which are given to a classi\ufb01er that is trained on frame\nclassi\ufb01cation into phones. We evaluate representations from different layers of the\ndeep model and compare their quality for predicting phone labels. Our experiments\nshed light on important aspects of the end-to-end model such as layer depth, model\ncomplexity, and other design choices.\n\n1\n\nIntroduction\n\nTraditional automatic speech recognition (ASR) systems are composed of multiple components,\nincluding an acoustic model, a language model, a lexicon, and possibly other components. Each of\nthese is trained independently and combined during decoding. As such, the system is not directly\ntrained on the speech recognition task from start to end. In contrast, end-to-end ASR systems aim\nto map acoustic features directly to text (words or characters). Such models have recently become\npopular in the ASR community thanks to their simple and elegant architecture [1, 2, 3, 4]. Given\nsuf\ufb01cient training data, they also perform fairly well. Importantly, such models do not receive explicit\nphonetic supervision, in contrast to traditional systems that typically rely on an acoustic model trained\nto predict phonetic units (e.g. HMM phone states). Intuitively, though, end-to-end models have\nto generate some internal representation that allows them to abstract over phonological units. For\ninstance, a model that needs to generate the word \u201cbought\u201d should learn that in this case \u201cg\u201d is not\npronounced as the phoneme /g/.\nIn this work, we investigate if and to what extent end-to-end models implicitly learn phonetic\nrepresentations. The hypothesis is that such models need to create and exploit internal representations\nthat correspond to phonetic units in order to perform well on the speech recognition task. Given a\npre-trained end-to-end ASR system, we use it to extract frame-level features from an acoustic signal.\nFor example, these may be the hidden representations of a recurrent neural network (RNN) in the\nend-to-end system. We then feed these features to a classi\ufb01er that is trained to predict a phonetic\nproperty of interest such as phone recognition. Finally, we evaluate the performance of the classi\ufb01er\nas a measure of the quality of the input features, and by proxy the quality of the original ASR system.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fWe aim to provide quantitative answers to the following questions:\n\n1. To what extent do end-to-end ASR systems learn phonetic information?\n2. Which components of the system capture more phonetic information?\n3. Do more complicated models learn better representations for phonology? And is ASR\n\nperformance correlated with the quality of the learned representations?\n\nTwo main types of end-to-end models for speech recognition have been proposed in the literature:\nconnectionist temporal classi\ufb01cation (CTC) [1, 2] and sequence-to-sequence learning (seq2seq) [3, 4].\nWe focus here on CTC and leave exploration of the seq2seq model for future work.\nWe use a phoneme-segmented dataset for the phoneme recognition task, as it comes with time\nsegmentation, which allows for accurate mapping between speech frames and phone labels. We\nde\ufb01ne a frame classi\ufb01cation task, where given representations from the CTC model, we need to\nclassify each frame into a corresponding phone label. More complicated tasks can be conceived\nof\u2014for example predicting a single phone given all of its aligned frames\u2014but classifying frames is a\nbasic and important task to start with.\nOur experiments reveal that the lowest layers in a deep end-to-end model are best suited for represent-\ning phonetic information. Applying one convolution on input features improves the representation,\nbut a second convolution greatly degrades phone classi\ufb01cation accuracy. Subsequent recurrent layers\ninitially improve the quality of the representations. However, after a certain recurrent layer perfor-\nmance again drops, indicating that the top layers do not preserve all the phonetic information coming\nfrom the bottom layers. Finally, we cluster frame representations from different layers in the deep\nmodel and visualize them in 2D, observing different quality of grouping in different layers.\nWe hope that our results would promote the development of better ASR systems. For example, under-\nstanding representation learning at different layers of the end-to-end model can guide joint learning\nof phoneme recognition and ASR, as recently proposed in a multi-task learning framework [5].\n\n2 Related Work\n\n2.1 End-to-end ASR\n\nEnd-to-end models for ASR have become increasingly popular in recent years. Important studies\ninclude models based on connectionist temporal classi\ufb01cation (CTC) [1, 2, 6, 7] and attention-based\nsequence-to-sequence models [3, 4, 8]. The CTC model is based on a recurrent neural network\nthat takes acoustic features as input and is trained to predict a symbol per each frame. Symbols are\ntypically characters, in addition to a special blank symbol. The CTC loss then marginalizes over\nall possible sequences of symbols given a transcription. The sequence-to-sequence approach, on\nthe other hand, \ufb01rst encodes the sequence of acoustic features into a single vector and then decodes\nthat vector into the sequence of symbols (characters). The attention mechanism improves upon this\nmethod by conditioning on a different summary of the input sequence at each decoding step.\nBoth these of these approaches to end-to-end ASR usually predict a sequence of characters, although\nthere have also been initial attempts at directly predicting words [9, 10].\n\n2.2 Analysis of neural representations\n\nWhile end-to-end neural network models offer an elegant and relatively simple architecture, they are\noften thought to be opaque and uninterpretable. Thus researchers have started investigating what\nsuch models learn during the training process. For instance, previous work evaluated neural network\nacoustic models on phoneme recognition using different acoustic features [11] or investigated how\nsuch models learn invariant representations [12] and encode linguistic features [13, 14]. Others have\ncorrelated activations of gated recurrent networks with phoneme boundaries in autoencoders [15] and\nin a text-to-speech system [16]. Recent work analyzed different speaker representations [17]. A joint\naudio-visual model of speech and lip movements was developed in [18], where phoneme embeddings\nwere shown to be closer to certain linguistic features than embeddings based on audio alone. Other\njoint audio-visual models have also analyzed the learned representations in different ways [19, 20, 21].\nFinally, we note that analyzing neural representations has also attracted attention in other domains\n\n2\n\n\fTable 1: The ASR models used in this work.\n\n(a) DeepSpeech2.\n\n(b) DeepSpeech2-light.\n\nLayer Type\ncnn1\n1\ncnn2\n2\nrnn1\n3\nrnn2\n4\n5\nrnn3\nrnn4\n6\nrnn5\n7\nrnn6\n8\nrnn7\n9\n10\nfc\n\nInput Size Output Size\n1952\n1312\n1760\n1760\n1760\n1760\n1760\n1760\n1760\n29\n\n161\n1952\n1312\n1760\n1760\n1760\n1760\n1760\n1760\n1760\n\nLayer Type\ncnn1\n1\ncnn2\n2\nlstm1\n3\nlstm2\n4\n5\nlstm3\nlstm4\n6\nlstm5\n7\n8\nfc\n\nInput Size Output Size\n1952\n1312\n600\n600\n600\n600\n600\n29\n\n161\n1952\n1312\n600\n600\n600\n600\n600\n\nlike vision and natural language processing, including word and sentence representations [22, 23, 24],\nmachine translation [25, 26], and joint vision-language models [27]. To our knowledge, hidden\nrepresentations in end-to-end ASR systems have not been thoroughly analyzed before.\n\n3 Methodology\n\nWe follow the following procedure for evaluating representations in end-to-end ASR models. First,\nwe train an ASR system on a corpus of transcribed speech and freeze its parameters. Then, we use the\npre-trained ASR model to extract frame-level feature representations on a phonemically transcribed\ncorpus. Finally, we train a supervised classi\ufb01er using the features coming from the ASR system,\nand evaluate classi\ufb01cation performance on a held-out set. In this manner, we obtain a quantitative\nmeasure of the quality of the representations that were learned by the end-to-end ASR model. A\nsimilar procedure has been previously applied to analyze a DNN-HMM phoneme recognition system\n[14] as well as text representations in neural machine translation models [25, 26].\nMore formally, let x denote a sequence of acoustic features such as a spectrogram of frequency\nmagnitudes. Let ASRt(x) denote the output of the ASR model at the t-th input. Given a corresponding\nlabel sequence, l, we feed ASRt(x) to a supervised classi\ufb01er that is trained to predict a corresponding\nlabel, lt. In the simplest case, we have a label at each frame and perform frame classi\ufb01cation. As we\nare interested in analyzing different components of the ASR model, we also extract features from\ndifferent layers k, such that ASRk\nWe next describe the ASR model and the supervised classi\ufb01er in more detail.\n\nt (x) denotes the output of the k-th layer at the t-th input frame.\n\n3.1 ASR model\n\nThe end-to-end model we use in this work is DeepSpeech2 [7], an acoustics-to-characters system\nbased on a deep neural network. The input to the model is a sequence of audio spectrograms\n(frequency magnitudes), obtained with a 20ms Hamming window and a stride of 10ms. With a\nsampling rate of 16kHz, we have 161 dimensional input features. Table 1a details the different layers\nin this model. The \ufb01rst two layers are convolutions where the number of output feature maps is 32\nat each layer. The kernel sizes of the \ufb01rst and second convolutional layers are 41x11 and 21x11\nrespectively, where a convolution of TxF has a size T in the time domain and F in the frequency\ndomain. Both convolutional layers have a stride of 2 in the time domain while the \ufb01rst layer also has\na stride of 2 in the frequency domain. This setting results in 1952/1312 features per time frame after\nthe \ufb01rst/second convolutional layers.\nThe convolutional layers are followed by 7 bidirectional recurrent layers, each with a hidden state\nsize of 1760 dimensions. Notably, these are simple RNNs and not gated units such as long short-term\nmemory networks (LSTM) [28], as this was found to produce better performance. We also consider a\nsimpler version of the model, called DeepSpeech2-light, which has 5 layers of bidirectional LSTMs,\neach with 600 dimensions (Table 1b). This model runs faster but leads to worse recognition results.\n\n3\n\n\fEach convolutional or recurrent layer is followed by batch normalization [29, 30] and a ReLU non-\nlinearity. The \ufb01nal layer is a fully-connected layer that maps onto the number of symbols (29 symbols:\n26 English letters plus space, apostrophe, and a blank symbol).\nThe network is trained with a CTC loss [31]:\n\nwhere the probability of a label sequence l given an input sequence x is de\ufb01ned as:\n\nL =  log p(l|x)\n\np(l|x) = X\u21e12B1(l)\n\np(\u21e1|x) = X\u21e12B1(l)\n\nTYt=1\n\nASRK\n\nt (x)[\u21e1t]\n\nwhere B removes blanks and repeated symbols, B1 is its inverse image, T is the length of the\nlabel sequence l, and ASRK\nt (x)[j] is unit j of the model output after the top softmax layer at time t,\ninterpreted as the probability of observing label j at time t. This formulation allows mapping long\nframe sequences to short character sequences by marginalizing over all possible sequences containing\nblanks and duplicates.\n\n3.2 Supervised Classi\ufb01er\nThe frame classi\ufb01er takes features from different layers of the DeepSpeech2 model as input and\npredicts a phone label. The size of the input to the classi\ufb01er thus depends on which layer in\nDeepSpeech2 is used to generate features. We model the classi\ufb01er as a feed-forward neural network\nwith one hidden layer, where the size of the hidden layer is set to 500.1 This is followed by dropout\n(rate of 0.5) and a ReLU non-linearity, then a softmax layer mapping onto the label set size (the\nnumber of unique phones). We chose this simple formulation as we are interested in evaluating the\nquality of the representations learned by the ASR model, rather than improving the state-of-the-art on\nthe supervised task.\nWe train the classi\ufb01er with Adam [32] with the recommended parameters (\u21b5 = 0.001, 1 = 0.9,\n2 = 0.999, \u270f = e8) to minimize the cross-entropy loss. We use a batch size of 16, train the model\nfor 30 epochs, and choose the model with the best development loss for evaluation.\n\n4 Tools and Data\nWe use the deepspeech.torch [33] implementation of Baidu\u2019s DeepSpeech2 model [7], which\ncomes with pre-trained models of both DeepSpeech2 and the simpler variant DeepSpeech2-light.\nThe end-to-end models are trained on LibriSpeech [34], a publicly available corpus of English read\nspeech, containing 1,000 hours sampled at 16kHz. The word error rates (WER) of the DeepSpeech2\nand DeepSpeech2-light models on the Librispeech-test-clean dataset are 12 and 15, respectively [33].\nFor the phoneme recognition task, we use TIMIT, which comes with time segmentation of phones.\nWe use the of\ufb01cial train/development/test split and extract frames for the frame classi\ufb01cation task.\nTable 2 summarizes statistics of the frame classi\ufb01cation dataset. Note that due to sub-sampling at\nthe DeepSpeech2 convolutional layers, the number of frames decreases by a factor of two after each\nconvolutional layer. The possible labels are the 60 phone symbols included in TIMIT (excluding the\nbegin/end silence symbol h#). We also experimented with the reduced set of 48 phones used by [35].\nThe code for all of our experiments is publicly available.2\n\nTable 2: Frame classi\ufb01cation data extracted from TIMIT.\nTest\n192\n50,380\n25,205\n11,894\n\nTrain Development\n3,696\n400\n107,620\n988,012\n53,821\n493,983\n233,916\n25,469\n\nUtterances\nFrames (input)\nFrames (after cnn1)\nFrames (after cnn2)\n\n1We also experimented with a linear classi\ufb01er and found that it produces lower results overall but leads to\n\nsimilar trends when comparing features from different layers.\n2http://github.com/boknilev/asr-repr-analysis\n\n4\n\n\f(a) DS2, w/ strides.\n\n(b) DS2, w/o strides.\n\n(c) DS2-light, w/ strides.\n\n(d) DS2-light, w/o strides.\n\nFigure 1: Frame classi\ufb01cation accuracy using representations from different layers of DeepSpeech2\n(DS2) and DeepSpeech2-light (DS2-light), with or without strides in the convolutional layers.\n\n5 Results\n\nFigure 1a shows frame classi\ufb01cation accuracy using features from different layers of the DeepSpeech2\nmodel. The results are all above a majority baseline of 7.25% (the phone \u201cs\u201d). Input features\n(spectrograms) lead to fairly good performance, considering the 60-wise classi\ufb01cation task. The\n\ufb01rst convolution further improves the results, in line with previous \ufb01ndings about convolutions as\nfeature extractors before recurrent layers [36]. However, applying a second convolution signi\ufb01cantly\ndegrades accuracy. This can be attributed to the \ufb01lter width and stride, which may extend across\nphone boundaries. Nevertheless, we \ufb01nd the large drop quite surprising.\nThe \ufb01rst few recurrent layers improve the results, but after the 5th recurrent layer accuracy goes down\nagain. One possible explanation to this may be that higher layers in the model are more sensitive to\nlong distance information that is needed for the speech recognition task, whereas the local information\nthat is needed for classifying phones is better captured in lower layers. For instance, to predict a word\nlike \u201cbought\u201d, the model would need to model relations between different characters, which would\nbe better captured at the top layers. In contrast, feed-forward neural networks trained on phoneme\nrecognition were shown to learn increasingly better representations at higher layers [13, 14]; such\nnetworks do not need to model the full speech recognition task, different from end-to-end models.\nIn the following sections, we \ufb01rst investigate three aspects of the model: model complexity, effect of\nstrides in the convolutional layers, and effect of blanks. Then we visualize frame representations in\n2D and consider classi\ufb01cation into abstract sound classes. Finally, Appendix A provides additional\nexperiments with windows of input features and a reduced phone set, all exhibiting similar trends.\n\n5.1 Model complexity\nFigure 1c shows the results of using features from the DeepSpeech2-light model. This model has\nless recurrent layers (5 vs. 7) and smaller hidden states (600 vs. 1760), but it uses LSTMs instead of\nsimple RNNs. A \ufb01rst observation is that the overall trend is the same as in DeepSpeech2: signi\ufb01cant\ndrop after the \ufb01rst convolutional layer, then initial increase followed by a drop in the \ufb01nal layers.\nComparing the two models (\ufb01gures 1a and 1c), a number of additional observations can be made.\nFirst, the convolutional layers of DeepSpeech2 contain more phonetic information than those of\n\n5\n\n\fDeepSpeech2-light (+1% and +4% for cnn1 and cnn2, respectively). In contrast, the recurrent layers\nin DeepSpeech2-light are better, with the best result of 37.77% in DeepSpeech2-light (by lstm3)\ncompared to 33.67% in DeepSpeech2 (by rnn5). This suggests again that higher layers do not model\nphonology very well; when there are more recurrent layers, the convolutional layers compensate and\ngenerate better representations for phonology than when there are fewer recurrent layers. Interestingly,\nthe deeper model performs better on the speech recognition task while its deep representations are\nnot as good at capturing phonology, suggesting that its top layers focus more on modeling character\nsequences, while its lower layers focus on representing phonetic information.\n\n5.2 Effect of strides\n\nThe original DeepSpeech2 models have convolutions with strides (steps) in the time dimension [7].\nThis leads to subsampling by a factor of 2 at each convolutional layer, resulting in reduced dataset\nsize (Table 2). Consequently, the comparison between layers before and after convolutions is not\nentirely fair. To investigate this effect, we ran the trained convolutions without strides during feature\ngeneration for the classi\ufb01er.\nFigure 1b shows the results at different layers without using strides in the convolutions. The general\ntrend is similar to the strided case: large drop at the 2nd convolutional layer, then steady increase in\nthe recurrent layers with a drop at the \ufb01nal layers. However, the overall shape of the accuracy in the\nrecurrent layers is less spiky; the initial drop is milder and performance does not degrade as much at\nthe top layers. A similar pattern is observed in the non-strided case of DeepSpeech2-light (Figure 1d).\nThese results can be attributed to two factors. First, running convolutions without strides maintains the\nnumber of examples available to the classi\ufb01er, which means a larger training set. More importantly,\nhowever, the time resolution remains high which can be important for frame classi\ufb01cation.\n\n5.3 Effect of blank symbols\n\nRecall that the CTC model predicts either a letter in the alphabet, a space, or a blank symbol. This\nallows the model to concentrate probability mass on a few frames that are aligned to the output\nsymbols in a series of spikes, separated by blank predictions [31]. To investigate the effect of blank\nsymbols on phonetic representation, we generate predictions of all symbols using the CTC model,\nincluding blanks and repetitions. Then we break down the classi\ufb01er\u2019s performance into cases where\nthe model predicted a blank, a space, or another letter.\nFigure 2 shows the results using representations from the best recurrent layers in DeepSpeech2 and\nDeepSpeech2-light, run with and without strides in the convolutional layers. In the strided case, the\nhidden representations are of highest quality for phone classi\ufb01cation when the model predicts a blank.\nThis appears counterintuitive, considering the spiky behavior of CTC models, which should be more\ncon\ufb01dent when predicting non-blank. However, we found that only 5% of the frames are predicted as\nblanks, due to downsampling in the strided convolutions. When the model is run without strides, we\nobserve a somewhat different behavior. Note that in this case the model predicts many more blanks\n(more than 50% compared to 5% in the non-strided case), and representations of frames predicted as\nblanks are not as good, which is more in line with the common spiky behavior of CTC models [31].\n\nFigure 2: Frame classi\ufb01cation accuracy at frames predicted as blank, space, or another letter by\nDeepSpeech2 and DeepSpeech2-light, with and without strides in the convolutional layers.\n\n6\n\n\f5.4 Clustering and visualizing representations\n\nIn this section, we visualize frame representations from different layers of DeepSpeech2. We \ufb01rst ran\nthe DeepSpeech2 model on the entire development set of TIMIT and extracted feature representations\nfor every frame from all layers. This results in more than 100K vectors of different sizes (we use the\nmodel without strides in convolutional layers to allow for comparable analysis across layers). We\nfollowed a similar procedure to that of [20]: We clustered the vectors in each layer with k-means\n(k = 500) and plotted the cluster centroids using t-SNE [37]. We assigned to each cluster the phone\nlabel that had the largest number of examples in the cluster. As some clusters are quite noisy, we also\nconsider pruning clusters where the majority label does not cover enough of the cluster members.\nFigure 3 shows t-SNE plots of cluster centroids from selected layers, with color and shape coding for\nthe phone labels (see Figure 9 in Appendix B for other layers). The input layer produces clusters\nwhich show a fairly clean separation into groups of centroids with the same assigned phone. After the\ninput layer it is less easy to detect groups, and lower layers do not show a clear structure. In layers\nrnn4 and rnn5 we again see some meaningful groupings (e.g. \u201cz\u201d on the right side of the rnn5 plot),\nafter which rnn6 and rnn7 again show less structure.\n\nFigure 3: Centroids of frame representation clusters using features from different layers.\n\nFigure 10 (in Appendix B) shows clusters that have a majority label of at least 10-20% of the examples\n(depending on the number of examples left in each cluster after pruning). In this case groupings are\nmore observable in all layers, and especially in layer rnn5.\nWe note that these observations are mostly in line with our previous \ufb01ndings regarding the quality\nof representations from different layers. When frame representations are better separated in vector\nspace, the classi\ufb01er does a better job at classifying frames into their phone labels; see also [14] for a\nsimilar observation.\n\n5.5 Sound classes\n\nSpeech sounds are often organized in coarse categories like consonants and vowels. In this section,\nwe investigate whether the ASR model learns such categories. The primary question we ask is: which\nparts of the model capture most information about coarse categories? Are higher layer representations\nmore informative for this kind of abstraction above phones? To answer this, we map phones to their\ncorresponding classes: affricates, fricatives, nasals, semivowels/glides, stops, and vowels. Then we\ntrain classi\ufb01ers to predict sound classes given representations from different layers of the ASR model.\nFigure 4 shows the results. All layers produce representations that contain a non-trivial amount\nof information about sound classes (above the vowel majority baseline). As expected, predicting\nsound classes is easier than predicting phones, as evidenced by a much higher accuracy compared to\nour previous results. As in previous experiments, the lower layers of the network (input and cnn1)\nproduce the best representations for predicting sound classes. Performance then \ufb01rst drops at cnn2\nand increases steadily with each recurrent layer, \ufb01nally decreasing at the last recurrent layer. It\nappears that higher layers do not generate better representations for abstract sound classes.\nNext we analyze the difference between the input layer and the best recurrent layer (rnn5), broken\ndown to speci\ufb01c sound classes. We calculate the change in F1 score (harmonic mean of precision and\nrecall) when moving from input representations to rnn5 representations, where F1 is calculated in two\n\n7\n\n\fFigure 4: Accuracy of classi\ufb01cation into\nsound classes using representations from dif-\nferent layers of DeepSpeech2.\n\nFigure 5: Difference in F1 score using repre-\nsentations from layer rnn5 compared to the\ninput layer.\n\n(a) input\n\n(b) cnn2\n\n(c) rnn5\n\nFigure 6: Confusion matrices of sound class classi\ufb01cation using representations from different layers.\n\nways. The inter-class F1 is calculated by directly predicting coarse sound classes, thus measuring how\noften the model confuses two separate sound classes. The intra-class F1 is obtained by predicting\n\ufb01ne-grained phones and micro-averaging F1 inside each coarse sound class (not counting confusion\noutside the class). It indicates how often the model confuses different phones in the same sound class.\nAs Figure 5 shows, in most cases representations from rnn5 degrade the performance, both within\nand across classes. There are two notable exceptions. Affricates are better predicted at the higher\nlayer, both compared to other sound classes and when predicting individual affricates. It may be that\nmore contextual information is needed in order to detect a complex sound like an affricate. Second,\nthe intra-class F1 for nasals improves with representations from rnn5, whereas the inter-class F1 goes\ndown, suggesting that rnn5 is better at distinguishing between different nasals.\nFinally, Figure 6 shows confusion matrices of predicting sound classes using representations from the\ninput, cnn2, and rnn5 layers. Much of the confusion arises from confusing relatively similar classes:\nsemivowels/vowels, affricates/stops, affricates/fricatives. Interestingly, affricates are less confused at\nlayer rnn5 than in lower layers, which is consistent with our previous observation.\n\n6 Conclusion\n\nIn this work, we analyzed representations in a deep end-to-end ASR model that is trained with a CTC\nloss. We empirically evaluated the quality of the representations on a frame classi\ufb01cation task, where\neach frame is classi\ufb01ed into its corresponding phone label. We compared feature representations from\ndifferent layers of the ASR model and observed striking differences in their quality. We also found\nthat these differences are partly correlated with the separability of the representations in vector space.\nIn future work, we would like to extend this analysis to other speech features, such as speaker and\ndialect ID, and to larger speech recognition datasets. We are also interested in experimenting with\nother end-to-end systems, such as sequence-to-sequence models and acoustics-to-words systems.\nAnother venue for future work is to improve the end-to-end model based on our insights, for example\nby improving the representation capacity of certain layers in the deep neural network.\n\n8\n\n\fAcknowledgements\n\nWe would like to thank members of the MIT spoken language systems group for helpful discussions.\nThis work was supported by the Qatar Computing Research Institute (QCRI).\n\nReferences\n[1] A. Graves and N. Jaitly, \u201cTowards End-To-End Speech Recognition with Recurrent Neural\nNetworks,\u201d in Proceedings of the 31st International Conference on Machine Learning (ICML-\n14), T. Jebara and E. P. Xing, Eds.\nJMLR Workshop and Conference Proceedings, 2014, pp.\n1764\u20131772.\n\n[2] Y. Miao, M. Gowayyed, and F. Metze, \u201cEESEN: End-to-end speech recognition using deep RNN\nmodels and WFST-based decoding,\u201d in 2015 IEEE Workshop on Automatic Speech Recognition\nand Understanding (ASRU).\n\nIEEE, 2015, pp. 167\u2013174.\n\n[3] J. Chorowski, D. Bahdanau, K. Cho, and Y. Bengio, \u201cEnd-to-end Continuous Speech Recog-\nnition using Attention-based Recurrent NN: First Results,\u201d arXiv preprint arXiv:1412.1602,\n2014.\n\n[4] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, \u201cListen, Attend and Spell: A Neural Network for\nLarge Vocabulary Conversational Speech Recognition,\u201d in 2016 IEEE International Conference\non Acoustics, Speech and Signal Processing (ICASSP).\n\nIEEE, 2016, pp. 4960\u20134964.\n\n[5] S. Toshniwal, H. Tang, L. Lu, and K. Livescu, \u201cMultitask Learning with Low-Level Auxiliary\nTasks for Encoder-Decoder Based Speech Recognition,\u201d arXiv preprint arXiv:1704.01631,\n2017.\n\n[6] F. Eyben, M. W\u00f6llmer, B. Schuller, and A. Graves, \u201cFrom Speech to Letters - Using a Novel\nNeural Network Architecture for Grapheme Based ASR,\u201d in 2009 IEEE Workshop on Automatic\nSpeech Recognition and Understanding (ASRU), Nov 2009, pp. 376\u2013380.\n\n[7] D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen,\nM. Chrzanowski, A. Coates, G. Diamos et al., \u201cDeep Speech 2: End-to-End Speech Recognition\nin English and Mandarin,\u201d in Proceedings of The 33rd International Conference on Machine\nLearning, 2016, pp. 173\u2013182.\n\n[8] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, \u201cEnd-to-End Attention-based\nLarge Vocabulary Speech Recognition,\u201d in 2016 IEEE International Conference on Acoustics,\nSpeech and Signal Processing (ICASSP).\n\nIEEE, 2016, pp. 4945\u20134949.\n\n[9] H. Soltau, H. Liao, and H. Sak, \u201cNeural Speech Recognizer: Acoustic-to-Word LSTM Model\n\nfor Large Vocabulary Speech Recognition,\u201d in Interspeech 2017, 2017.\n\n[10] K. Audhkhasi, B. Ramabhadran, G. Saon, M. Picheny, and D. Nahamoo, \u201cDirect Acoustics-to-\n\nWord Models for English Conversational Speech Recognition,\u201d in Interspeech 2017, 2017.\n\n[11] A.-r. Mohamed, G. Hinton, and G. Penn, \u201cUnderstanding how deep belief networks perform\nacoustic modelling,\u201d in 2012 IEEE International Conference on Acoustics, Speech and Signal\nProcessing (ICASSP).\n\nIEEE, 2012, pp. 4273\u20134276.\n\n[12] D. Yu, M. L. Seltzer, J. Li, J.-T. Huang, and F. Seide, \u201cFeature Learning in Deep Neural\nNetworks - Studies on Speech Recognition Tasks,\u201d in International Conference on Learning\nRepresentations (ICLR), 2013.\n\n[13] T. Nagamine, M. L. Seltzer, and N. Mesgarani, \u201cExploring How Deep Neural Networks Form\n\nPhonemic Categories,\u201d in Interspeech 2015, 2015.\n\n[14] \u2014\u2014, \u201cOn the Role of Nonlinear Transformations in Deep Neural Network Acoustic Models,\u201d\n\nin Interspeech 2016, 2016, pp. 803\u2013807.\n\n[15] Y.-H. Wang, C.-T. Chung, and H.-y. Lee, \u201cGate Activation Signal Analysis for Gated Recurrent\n\nNeural Networks and Its Correlation with Phoneme Boundaries,\u201d in Interspeech 2017, 2017.\n\n[16] Z. Wu and S. King, \u201cInvestigating gated recurrent networks for speech synthesis,\u201d in 2016 IEEE\nIEEE, 2016,\n\nInternational Conference on Acoustics, Speech and Signal Processing (ICASSP).\npp. 5140\u20135144.\n\n9\n\n\f[17] S. Wang, Y. Qian, and K. Yu, \u201cWhat Does the Speaker Embedding Encode?\u201d in Interspeech\n2017, 2017, pp. 1497\u20131501. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.\n2017-1125\n\n[18] R. Chaabouni, E. Dunbar, N. Zeghidour, and E. Dupoux, \u201cLearning weakly supervised multi-\n\nmodal phoneme embeddings,\u201d in Interspeech 2017, 2017.\n\n[19] G. Chrupa\u0142a, L. Gelderloos, and A. Alishahi, \u201cRepresentations of language in a model of\nvisually grounded speech signal,\u201d in Proceedings of the 55th Annual Meeting of the Association\nfor Computational Linguistics (Volume 1: Long Papers). Association for Computational\nLinguistics, 2017, pp. 613\u2013622.\n\n[20] D. Harwath and J. Glass, \u201cLearning Word-Like Units from Joint Audio-Visual Analysis,\u201d in\nProceedings of the 55th Annual Meeting of the Association for Computational Linguistics\n(Volume 1: Long Papers). Association for Computational Linguistics, 2017, pp. 506\u2013517.\n\n[21] A. Alishahi, M. Barking, and G. Chrupa\u0142a, \u201cEncoding of phonology in a recurrent neural\nmodel of grounded speech,\u201d in Proceedings of the 21st Conference on Computational Natural\nLanguage Learning (CoNLL 2017). Association for Computational Linguistics, 2017, pp.\n368\u2013378.\n\n[22] A. K\u00f6hn, \u201cWhat\u2019s in an Embedding? Analyzing Word Embeddings through Multilingual\nEvaluation,\u201d in Proceedings of the 2015 Conference on Empirical Methods in Natural Language\nProcessing. Lisbon, Portugal: Association for Computational Linguistics, September 2015,\npp. 2067\u20132073. [Online]. Available: http://aclweb.org/anthology/D15-1246\n\n[23] P. Qian, X. Qiu, and X. Huang, \u201cInvestigating Language Universal and Speci\ufb01c\nProperties in Word Embeddings,\u201d in Proceedings of the 54th Annual Meeting of the\nAssociation for Computational Linguistics (Volume 1: Long Papers). Berlin, Germany:\nAssociation for Computational Linguistics, August 2016, pp. 1478\u20131488. [Online]. Available:\nhttp://www.aclweb.org/anthology/P16-1140\n\n[24] Y. Adi, E. Kermany, Y. Belinkov, O. Lavi, and Y. Goldberg, \u201cFine-grained Analysis of Sentence\nEmbeddings Using Auxiliary Prediction Tasks,\u201d in International Conference on Learning\nRepresentations (ICLR), April 2017.\n\n[25] X. Shi, I. Padhi, and K. Knight, \u201cDoes String-Based Neural MT Learn Source Syntax?\u201d in\nProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.\nAustin, Texas: Association for Computational Linguistics, November 2016, pp. 1526\u20131534.\n\n[26] Y. Belinkov, N. Durrani, F. Dalvi, H. Sajjad, and J. Glass, \u201cWhat do Neural Machine Translation\nModels Learn about Morphology?\u201d in Proceedings of the 55th Annual Meeting of the Associa-\ntion for Computational Linguistics (Volume 1: Long Papers). Association for Computational\nLinguistics, 2017, pp. 861\u2013872.\n\n[27] L. Gelderloos and G. Chrupa\u0142a, \u201cFrom phonemes to images: levels of representation in a\nrecurrent neural model of visually-grounded language learning,\u201d in Proceedings of COLING\n2016, the 26th International Conference on Computational Linguistics: Technical Papers.\nOsaka, Japan: The COLING 2016 Organizing Committee, December 2016, pp. 1309\u20131319.\n\n[28] S. Hochreiter and J. Schmidhuber, \u201cLong short-term memory,\u201d Neural Computation, vol. 9,\n\nno. 8, pp. 1735\u20131780, 1997.\n\n[29] S. Ioffe and C. Szegedy, \u201cBatch Normalization: Accelerating Deep Network Training by\nReducing Internal Covariate Shift,\u201d in Proceedings of the 32Nd International Conference on\nInternational Conference on Machine Learning (ICML), vol. 37, 2015, pp. 448\u2013456.\n\n[30] C. Laurent, G. Pereyra, P. Brakel, Y. Zhang, and Y. Bengio, \u201cBatch Normalized Recurrent\nNeural Networks,\u201d in 2016 IEEE International Conference on Acoustics, Speech and Signal\nProcessing (ICASSP).\n\nIEEE, 2016, pp. 2657\u20132661.\n\n[31] A. Graves, S. Fern\u00e1ndez, F. Gomez, and J. Schmidhuber, \u201cConnectionist Temporal Classi\ufb01cation:\nLabelling Unsegmented Sequence Data with Recurrent Neural Networks,\u201d in Proceedings of\nthe 23rd International Conference on Machine Learning (ICML), 2006, pp. 369\u2013376.\n\n[32] D. Kingma and J. Ba, \u201cAdam: A Method for Stochastic Optimization,\u201d arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[33] S. Naren, \u201cdeepspeech.torch,\u201d https://github.com/SeanNaren/deepspeech.torch, 2016.\n\n10\n\n\f[34] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, \u201cLibrispeech: an ASR corpus based on\npublic domain audio books,\u201d in 2015 IEEE International Conference on Acoustics, Speech and\nSignal Processing (ICASSP).\n\nIEEE, 2015, pp. 5206\u20135210.\n\n[35] K.-F. Lee and H.-W. Hon, \u201cSpeaker-independent phone recognition using hidden markov\nmodels,\u201d IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 11, pp.\n1641\u20131648, 1989.\n\n[36] T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, \u201cConvolutional, Long Short-Term Memory,\nfully connected Deep Neural Networks,\u201d in 2015 IEEE International Conference on Acoustics,\nSpeech and Signal Processing (ICASSP).\n\nIEEE, 2015, pp. 4580\u20134584.\n\n[37] L. v. d. Maaten and G. Hinton, \u201cVisualizing data using t-SNE,\u201d Journal of Machine Learning\n\nResearch, vol. 9, pp. 2579\u20132605, 2008.\n\n[38] H. Sak, F. de Chaumont Quitry, T. Sainath, K. Rao et al., \u201cAcoustic Modelling with CD-\nCTC-SMBR LSTM RNNs,\u201d in 2015 IEEE Workshop on Automatic Speech Recognition and\nUnderstanding (ASRU).\n\nIEEE, 2015, pp. 604\u2013609.\n\n11\n\n\f", "award": [], "sourceid": 1433, "authors": [{"given_name": "Yonatan", "family_name": "Belinkov", "institution": "MIT"}, {"given_name": "James", "family_name": "Glass", "institution": "Massachusetts Institute of Technology"}]}