{"title": "Untangling in Invariant Speech Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 14391, "page_last": 14401, "abstract": "Encouraged by the success of deep convolutional neural networks on a variety of visual tasks, much theoretical and experimental work has been aimed at understanding and interpreting how vision networks operate. At the same time, deep neural networks have also achieved impressive performance in audio processing applications, both as sub-components of larger systems and as complete end-to-end systems by themselves. Despite their empirical successes, comparatively little is understood about how these audio models accomplish these tasks.In this work, we employ a recently developed statistical mechanical theory that connects geometric properties of network representations and the separability of classes to probe how information is untangled within neural networks trained to recognize speech. We observe that speaker-specific nuisance variations are discarded by the network's hierarchy, whereas task-relevant properties such as words and phonemes are untangled in later layers. Higher level concepts such as parts-of-speech and context dependence also emerge in the later layers of the network. Finally, we find that the deep representations carry out significant temporal untangling by efficiently extracting task-relevant features at each time step of the computation. Taken together, these findings shed light on how deep auditory models process their time dependent input signals to carry out invariant speech recognition, and show how different concepts emerge through the layers of the network.", "full_text": "Untangling in Invariant Speech Recognition\n\nCory Stephenson\n\nIntel AI Lab\n\nJenelle Feather\n\nMIT\n\nSuchismita Padhy\n\nIntel AI Lab\n\ncory.stephenson@intel.com\n\njfeather@mit.edu\n\nsuchismita.padhy@intel.com\n\nOguz Elibol\nIntel AI Lab\n\noguz.h.elibol@intel.com\n\nHanlin Tang\nIntel AI Lab\n\nhanlin.tang@intel.com\n\nMIT/ Center for Brains, Minds, and Machines\n\nJosh McDermott\n\njhm@mit.edu\n\nSueYeon Chung\n\nColumbia University/ MIT\n\nsueyeon@mit.edu\n\nAbstract\n\nEncouraged by the success of deep neural networks on a variety of visual tasks,\nmuch theoretical and experimental work has been aimed at understanding and in-\nterpreting how vision networks operate. Meanwhile, deep neural networks have\nalso achieved impressive performance in audio processing applications, both as\nsub-components of larger systems and as complete end-to-end systems by them-\nselves. Despite their empirical successes, comparatively little is understood about\nhow these audio models accomplish these tasks. In this work, we employ a re-\ncently developed statistical mechanical theory that connects geometric properties\nof network representations and the separability of classes to probe how informa-\ntion is untangled within neural networks trained to recognize speech. We observe\nthat speaker-speci\ufb01c nuisance variations are discarded by the network\u2019s hierarchy,\nwhereas task-relevant properties such as words and phonemes are untangled in\nlater layers. Higher level concepts such as parts-of-speech and context depen-\ndence also emerge in the later layers of the network. Finally, we \ufb01nd that the deep\nrepresentations carry out signi\ufb01cant temporal untangling by ef\ufb01ciently extracting\ntask-relevant features at each time step of the computation. Taken together, these\n\ufb01ndings shed light on how deep auditory models process time dependent input\nsignals to achieve invariant speech recognition, and show how different concepts\nemerge through the layers of the network.\n\n1\n\nIntroduction\n\nUnderstanding invariant object recognition is one of the key challenges in cognitive neuroscience\nand arti\ufb01cial intelligence[1]. An accurate recognition system will predict the same class regardless\nof stimulus variations, such as the changes in viewing angle of an object or the differences in pronun-\nciations of a spoken word. Although the class predicted by such a system is unchanged, the internal\nrepresentations of individual objects within the class may differ. The set of representations corre-\nsponding to the same object class can then be thought of as an object manifold. In vision systems,\nit has been hypothesized that these \"object manifolds\", which are hopelessly entangled in the input,\nbecome \"untangled\" across the visual hierarchy, enabling the separation of different categories both\nin the brain [2] and in deep arti\ufb01cial neural networks [3]. Auditory recognition also requires the\nseparation of highly variable inputs according to class, and could involve the untangling of \u2018auditory\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fclass manifolds\u2019. In contrast to vision, auditory signals unfold over time, and the impact of this\ndifference on underlying representations is poorly understood.\nSpeech recognition is a natural domain for analyzing auditory class manifolds not only with word\nand speaker classes, but also at the phonetic and semantic level. In recent years, hierarchical neural\nnetwork models have achieved state of the art performance in automatic speech recognition (ASR)\nand speaker identi\ufb01cation [4, 5]. Understanding how these end-to-end models represent language\nand speech information remains a major challenge and is an active area of research [6, 7]. Sev-\neral studies on speech recognition systems have analyzed how phonetic information is encoded\nin acoustic models [8, 9, 10], and how it is embedded across layers by making use of classi\ufb01ers\n[11, 12, 13, 6]. It has also been shown that deep neural networks trained on tasks such as speech\nand music recognition resemble human behavior and auditory cortex activity [14]. Ultimately, un-\nderstanding speech-processing in deep networks may shed light on understanding how the brain\nprocesses auditory information.\nMuch of the prior work on characterizing how information is represented and processed in deep\nnetworks and the brain have focused on linear separability, representational similarity, and geometric\nmeasures. For instance, the representations of different objects in vision models and the macaque\nventral stream become more linearly separable at deeper stages, as measured by applying a linear\nclassi\ufb01er to each intermediate layer [2, 3]. Representations have been compared across different\nnetworks, layers, and training epochs using Canonical Correlation Analysis (CCA) [15, 16, 17].\nRepresentational similarity analysis (RSA), which evaluates the similarity of representations derived\nfor different inputs, has been used to compare networks [18, 19]. Others have used explicit geometric\nmeasures to understand deep networks, such as curvature [3, 20], geodesics [21], and Gaussian mean\nwidth [22]. However, none of these measures make a concrete connection between the separability\nof object representations and their geometrical properties.\nIn this work, we make use of a recently developed theoretical framework[23, 24, 25] based on the\nreplica method [26, 27, 28] that links the geometry of object manifolds to the capacity of a lin-\near classi\ufb01er as a measure of the amount of information stored about object categories per feature\ndimension. This method has been used in visual convolutional neural networks (CNNs) to character-\nize object-related information content across layers, and to relate it to the emergent representational\ngeometry to understand how object manifolds are untangled across layers [25]. Here we apply mani-\nfold analyses1 to auditory models for the \ufb01rst time, and show that neural network speech recognition\nsystems also untangle speech objects relevant for the task. This untangling can also be an emergent\nproperty, meaning the model also learns to untangle some types of object manifolds without being\ntrained to do so explicitly.\nWe present several key \ufb01ndings:\n\n1. We \ufb01nd signi\ufb01cant untangling of word manifolds in different model architectures trained\non speech tasks. We also see emergent untangling of higher-level concepts such as words,\nphonemes, and parts of speech in an end-to-end ASR model (Deep Speech 2).\n\n2. Both a CNN architecture and the end-to-end ASR model converge on remarkably similar\nbehavior despite being trained for different tasks and built with different computational\nblocks. They both learn to discard nuisance acoustic variations, and exhibit untangling for\ntask relevant information.\n\n3. Temporal dynamics in recurrent layers reveal untangling over recurrent time steps, in the\n\nform of smaller manifold radius, lower manifold dimensionality.\n\nIn addition, we show the generality of auditory untangling with speaker manifolds in a network\ntrained on a speaker recognition task, that are not evident in either the end-to-end ASR model or\nthe model trained explicitly to recognize words. These results provide the \ufb01rst geometric evidence\nfor untangling of manifolds, from phonemes to parts-of-speech, in deep neural networks for speech\nrecognition.\n\n1Our implementation of the analysis methods: https://github.com/schung039/neural_manifolds_replicaMFT\n\n2\n\n\fFigure 1: Illustration of word manifolds. (a) highly tangled manifolds, in low capacity regime (b)\nuntangled manifolds, in high capacity regime (c) Manifold Dimension captures the projection of a\nGaussian vector onto the direction of an anchor point, and Manifold Radius captures the norm of an\nanchor point in manifold subspace. (d) Illustration of untanglement of words over time.\n\n2 Methods\n\nTo understand representation in speech models, we \ufb01rst train a neural network on a corpus of tran-\nscribed speech. Then, we use the trained models to extract per-layer representations at every time\nstep on each corpus stimulus. Finally, we apply the mean-\ufb01eld theoretic manifold analysis technique\n[24, 25] (hereafter, MFTMA technique) to measure manifold capacity and other manifold geometric\nproperties (radius, dimension, correlation) on a subsample of the test dataset.\nFormally, if we have P objects (e.g. words), we construct a dataset D with pairs (xi; yi), where\nxi is the auditory input, and yi 2 f1; 2; :::; Pg is the object class. Given a neural network N (x),\nt (x), which is the output of the network at time t in layer l, for all inputs x whose\nwe extract N l\ncorresponding label is p, for each p 2 f1; 2; :::; Pg. The object manifold at layer l for class p is then\nde\ufb01ned as the point cloud of activations obtained from the different examples xi of the pth class.\nWe then apply the MFTMA technique to this set of activations to compute the manifold capacity,\nmanifold dimension, radius, and correlations for that manifold.\nThe manifold capacity obtained by the MFTMA technique captures the linear separability of object\nmanifolds. Furthermore, as shown in [24, 25] and outlined in SM Sect. 1.2, the mean-\ufb01eld theory\ncalculation of manifold capacity also gives a concrete connection between the measure of separabil-\nity and the size and dimensionality of the manifolds. This analysis therefore gives additional insight\ninto both the separability of object manifolds, and how this separability is achieved geometrically.\nWe measure these properties under different manifold types, including categories such as phonemes\nand words, or linguistic feature categories such as part-of-speech tags. This allows us to quantify\nthe amount of invariant object information and the characteristics of the emergent geometry in the\nrepresentations learned by the speech models.\n\n2.1 Object Manifold Capacity and the mean-\ufb01eld theoretic manifold analysis (MFTMA)\n\nIn a system where P object manifolds are represented by N features, the \u2018load\u2019 in the system is\nde\ufb01ned by (cid:11) = P=N. When (cid:11) is small, i.e. a small number of object manifolds are embedded in a\nhigh dimensional feature space, it\u2019s easy to \ufb01nd a separating hyperplane for a random dichotomy2 of\nthe manifolds. When (cid:11) is large, too many categories are squeezed in a low dimensional feature space,\nrendering the manifolds highly inseparable. Manifold capacity refers to the critical load, (cid:11)C = P=N,\nde\ufb01ned by the critical number of object manifolds, P , that can be linearly separated given N features.\nAbove (cid:11)C, most dichotomies are inseparable, and below (cid:11)C, most are separable[24, 25]. This\nframework generalizes the notion of the perceptron storage capacity [26] from points to manifolds,\nre-de\ufb01ning the unit of counting to be object manifolds rather than individual points. The manifold\ncapacity thus serves as a measure of the linearly decodable information about object identity per\nfeature, and it can be measured from data in two ways:\n\n1. Empirical Manifold Capacity, (cid:11)SIM : the manifold capacity can be measured empirically\nwith a bisection search to \ufb01nd the critical number of features N such that the fraction of\nlinearly separable random dichotomies is close to 1=2.\n\n2Here, we de\ufb01ne a random dichotomy as an assignment of random (cid:6)1 labels to each manifold\n\n3\n\ntimeword1word2word3Low Manifold CapacityHigh Manifold Capacity: anchor point,\u02dcs~t: Gaussian vector(d)(a)(b)(c)feature spaceClassi\ufb01erDM=h~t\u00b7\u02c6si~t,t0RM=ph\u02dcs2i~t,t0~t,t0t0O\f2. Mean Field Theoretic Manifold Capacity, (cid:11)M F T : can be estimated using the replica\nmean \ufb01eld formalism with the framework introduced by [24, 25]. (cid:11)M F T is estimated from\nthe statistics of anchor points (shown in Fig. 1(c)), ~s, a representative point for a linear\nclassi\ufb01cation3.\n\nThe manifold capacity for point-cloud manifolds is lower bounded by the case where there is no\nmanifold structure. This lower bound is given by [29, 24]. In this work, we show (cid:11)M F T =(cid:11)LB for a\ncomparison between datasets of different sizes and therefore different lower bounds.\nManifold capacity is closely related to the underlying geometric properties of the object manifolds.\nRecent work demonstrates that the manifold classi\ufb01cation capacity can be predicted by an object\nmanifold\u2019s Manifold Dimension, DM , Manifold Radius, RM , and the correlations between the cen-\ntroids of the manifolds [23, 24, 25]. These geometrical properties capture the statistical properties\nof the anchor points, the representative support vectors of each manifold relevant for the linear clas-\nsi\ufb01cation, which change as the choice of other manifolds vary [24]. The MFTMA technique also\nmeasures these quantities, along with the manifold capacity:\nManifold Dimension, DM : DM captures the dimensions realized by the anchor point from the\nguiding Gaussian vectors shown in Fig. 1(c), and estimates the average embedding dimension of the\nmanifold contributing to the classi\ufb01cation. This is upper bounded by min(M; N ), where M is the\nnumber of points per each manifold, and N, the feature dimension. In this work, M < N, and we\npresent DM =M for fair comparison between different datasets.\nManifold Radius, RM : RM is the average distance between the manifold center and the anchor\npoints as shown in Fig. 1(c). Note that RM is the size relative to the norm of the manifold center,\nre\ufb02ecting the fact that the relative scale of the manifold compared to the overall distribution is what\nmatters for linear separability, rather than the absolute scale.\nCenter Correlations, (cid:26)center: (cid:26)center measures how correlated the locations of these object mani-\nfolds are, and is calculated as the average of pairwise correlations between manifold centers. [25].\nIt has been suggested that the capacity is inversely correlated with DM , RM , and center correlation\n[24, 25]. Details for computing anchor points, ~s can be found in the description of the mean-\ufb01eld\ntheoretic manifold capacity algorithm, and the summary of the method is provided in the SM.\nIn addition to mean-\ufb01eld theoretic manifold properties, we measure the dimensionality of the data\nacross different categories with two popular measures for dimensionality of complex datasets:\n\nParticipation Ratio DP R: DP R is de\ufb01ned as (\nance of the data, measuring how many dimensions of the eigen-spectrum are active [17].\nExplained Variance Dimension DEV : DEV is de\ufb01ned as the number of principal components\nrequired to explain a \ufb01xed percentage (90% in this paper) of the total variance [30].\nNote that in all of the dimensional measures, i.e. DM , DP R and DEV are upper bounded by the\nnumber of eigenvalues, which is min (#samples; #f eatures) (these values are used in (Fig. 2-6)).\n\n, where (cid:21)i is the ith eigenvalue of the covari-\n\n\u2211\ni (cid:21)i)2\u2211\n\ni (cid:21)2\ni\n\n2.2 Models and datasets\n\nWe examined two speech recognition models. The \ufb01rst model is a CNN model based on [14], with\na small modi\ufb01cation to use batch normalization layers instead of local response normalization (full\narchitecture can be found in Table SM1). We trained the model on two tasks: word recognition\nand speaker recognition. For word recognition, we trained on two second segments from a combi-\nnation of the WSJ Corpus [31] and Spoken Wikipedia Corpora [32], with noise augmentation from\nAudioSet backgrounds [33]. For more training details, please see the SM.\nThe second is an end-to-end ASR model, Deep Speech 2 (DS2) [5], based on an open source im-\nplementation4. DS2 is trained to produce accurate character-level output with the Connectionist\nTemporal Classi\ufb01cation (CTC) loss function [34]. The full architecture can be found in Table SM2.\nOur model was trained on the 960 hour training portion of the LibriSpeech dataset [35], achieving a\n\n3See SM for exact relationship between ~s and capacity, the outline of the code, and a demonstration that\n\nMFT manifold capacity matches the empirical capacity (given in Fig. SM1)\n\n4https://github.com/SeanNaren/deepspeech.pytorch\n\n4\n\n\fFigure 2: Word manifold capacity emerges in both the CNN word classi\ufb01cation model and\nthe end to end ASR model (DS2). Top: As expected, CNN model trained with explicit word\nsupervision (blue lines) exhibits strong capacity in later layers, compared to the initial weights (black\nlines). This increase is due to reduced radius and dimension, as well as decorrelation. Bottom: A\nsimilar trend emerges in DS2 without training with explicit word supervision. In both, capacity is\nnormalized against the theoretical lower bound (See Methods). The shaded area represents 95%\ncon\ufb01dence interval hereafter.\n\nword error rate (WER) of 12%, and 22:7% respectively on the clean and other partitions of the test\nset without the use of a language model. The model trained on LibriSpeech also performs reasonably\nwell on the TIMIT dataset, with a WER of 29.9% without using a language model.\nWe followed the procedure described in Sec. 2 to the construct manifold datasets for several different\ntypes of object categories using holdout data not seen by the models during training. The datasets\nused in the analysis of each model were as follows:\nCNN manifolds datasets: Word manifolds from the CNN dataset were measured using data from\nthe WSJ corpus. Each of the P = 50 word manifolds consist of M = 50 speakers saying the word,\nand each of the P = 50 speaker manifolds consist of one speaker saying M = 50 different words.\nDS2 manifolds datasets:5 Word and speaker manifolds were taken from the test portion of the\nLibriSpeech dataset. For word manifolds, P = 50 words with M = 20 examples each were selected,\nensuring each example came from a different speaker. For speaker manifolds P = 50 speakers were\nselected with M = 20 utterances per speaker.\nFor the comparison between character, phoneme, word, and parts-of-speech manifolds, similar man-\nifold datasets were also constructed from TIMIT, which includes phoneme and word alignment.\nP = 50 and M = 50 were used for character and phoneme manifolds, but owing to the smaller size\nof TIMIT, P = 23 and M = 20 were used for word manifolds. Likewise, we used a set of P = 22\ntags, with M = 50 for the parts-of speech manifolds.\n\nFeature Extraction For each layer of the CNN and DS2 models, the activations were measured for\neach exemplar and 5000 random projections with unit length were computed on which to measure\nthe geometric properties. For temporal analysis in the recurrent DS2 model, full features were\nextracted for each time step.\n\n3 Results\n\n3.1 Untangling of words\n\nWe \ufb01rst investigated the CNN model, which was trained to identify words from a \ufb01xed vocabulary\nusing the dataset described in 2.2. Since this model had explicit word level supervision, we observed\nthat the word classes were more separable (higher capacity) over the course of the network layers\n(Figure 2 (Top) as expected. This word manifold capacity was not observed with the initial weights\nof the model (Figure 2, black lines). As a negative control, we trained the same CNN architecture\n\n5See Sect. SM3.2 for more details on the construction and composition of the datasets used for experiments\n\non this model.\n\n5\n\ninputConv2dMax PoolConv2dMax PoolConv2dConv2dConv2dMean PoolLinear1.001.251.501.752.00ManifoldCapacityLowerBoundInitial weightsFinal weightsinputConv2dMax PoolConv2dMax PoolConv2dConv2dConv2dMean PoolLinear1.001.121.251.381.50ManifoldRadiusinputConv2dMax PoolConv2dMax PoolConv2dConv2dConv2dMean PoolLinear0.450.530.600.680.75ManifoldDim.(DM)UpperBoundinputConv2dMax PoolConv2dMax PoolConv2dConv2dConv2dMean PoolLinear0.000.070.150.230.30CenterCorrelationinputConv2dConv2dGRUGRUGRUGRUGRU1.001.251.501.752.00ManifoldCapacityLowerBoundInitial weightsFinal weightsinputConv2dConv2dGRUGRUGRUGRUGRU1.001.121.251.381.50ManifoldRadiusinputConv2dConv2dGRUGRUGRUGRUGRU0.450.530.600.680.75ManifoldDim.(DM)UpperBoundinputConv2dConv2dGRUGRUGRUGRUGRU0.000.070.150.230.30CenterCorrelation\fFigure 3: Character, phoneme, word, and part-of-speech manifolds emerge in the DS2 model.\n\nto identify speakers instead, and did not observe increased word manifold capacity after training\n(See Fig. SM3). These results demonstrate that word separability is task dependent and not a\ntrivial consequence of the input, acoustic features, or model architecture. Furthermore, the MFTMA\nmetrics reveal that this increased word capacity in later layers is due to both a reduction in the\nmanifold radius and the manifold dimension.\nMost end-to-end ASR systems are not trained to explicitly classify words in the input, owing to\nthe dif\ufb01culty in collecting and annotating large datasets with word level alignment, and the large\nvocabulary size of natural speech. Instead, models such as Deep Speech 2 are trained to output\ncharacter-level sequences. Despite not being trained to explicitly classify words, the untangling of\nwords was also emergent on the LibriSpeech dataset (Figure 2, bottom). In both CNN and DS2\nmodels, we present the capacity values normalized by the lower bound and manifold dimension is\nnormalized by the upper bound (Sec. 2.1), for easy comparison.\nSurprisingly, across the CNN and recurrent DS2 architectures, the trend in the other geometric met-\nrics were similar. The manifold capacities improve in downstream layers, and the reduction in\nmanifold dimension and radius similarly occurs in downstream layers. Interestingly, word mani-\nfolds increases dramatically in the last layer of CNN, but only modestly in the last layers of DS2,\nperhaps owing to the fact that CNN model here is explicitly trained on word labels, while in the\nDS2, the word manifolds are emergent properties. Notably, the random weights of the initial model\nincrease correlation across the layers in both networks, but the training signi\ufb01cantly decreases center\ncorrelation in both models. More analysis on training epochs are given in Section 3.4 and in Sect.\nSM4.5.\n\n3.2 Untangling of other speech objects\n\nIn addition to words, it is possible that auditory objects at different levels of abstraction are also\nuntangled in the ASR task. To investigate this, we look for evidence of untangling at four levels of\nabstraction: characters, phonemes, words (each word class contains multiple phonemes and multiple\ncharacters), and part of speech tags (each part of speech class contains multiple words). These\nexperiments were done on the end-to-end ASR model (DS2). Results for these four object types are\nshown in Fig. 3.\nOn the surface, we see all four of these quantities becoming untangled to some degree as the inputs\nare propagated through the network. However, in the \ufb01nal layers of the network, the untangling\nof words is far more prominent than that of phonemes (see the numerical values of capacity in\nFig. 3). This suggests that the speech models may need to abstract over the phonetic information\nand character information (i.e. silent letters in words). While the higher capacity is due to a lower\nmanifold radius and dimension, we note that the manifold dimension is signi\ufb01cantly lower for words\nthan it is for phonemes and parts of speech. The phoneme capacity starts to increase only after the\nsecond convolution layer, consistent with prior \ufb01ndings [6].\n\n3.3 Loss of speaker information\n\nIn addition to learning to increase the separability of information relevant to the ASR task, robust\nASR models should also learn to ignore unimportant differences in their inputs. For example, differ-\nent instances of the same word spoken by different speakers must be recognized as the same word\nin spite of the variations in pronunciation, rate of speech, etc. If this is true, we expect that the\nseparability of different utterances from the same speaker is decreased by the network, or at least\nnot increased, as this information is not relevant to the ASR task.\n\n6\n\ninputConv2dConv2dGRUGRUGRUGRUGRU1.001.381.752.122.50ManifoldCapacityLowerBoundCharactersPhonemesWordsParts of SpeechinputConv2dConv2dGRUGRUGRUGRUGRU1.001.191.381.561.75ManifoldRadiusinputConv2dConv2dGRUGRUGRUGRUGRU0.350.450.550.650.75ManifoldDim.(DM)UpperBoundinputConv2dConv2dGRUGRUGRUGRUGRU0.000.140.280.410.55CenterCorrelation\fFigure 4: Speaker Manifolds disappear in different speech models. Top: CNN ((black) before,\nand initial weights, trained on (blue) words. and (orange) speakers); Bottom: DS2 ((black) initial\nweights, (blue) trained on ASR task).\n\nFigure 5: Evolution of word manifolds via epochs of training, CNN. Manifold capacity improves\nover epochs of training, while manifold dimension, radius, correlations decrease over training. To-\ntal data dimension (DP R, DEV ) is inversely correlated with center correlation and increases over\ntraining epochs. Similar trends are observed in the DS2 model (SM).\n\nIndeed, Figure 4 (Top) shows that for the CNN model trained to classify words, the separability of\nspeaker manifolds de\ufb01ned by the dataset described in 2.2 decreases deeper in the network, and is\neven lower than in the untrained network. In contrast, when training a CNN explicitly to do speaker\nrecognition the separability of speaker manifolds increases in later layers (while the separability of\nword manifolds does not, see Sect. SM4.2), demonstrating that the lack of speaker separation and\nthe presence of word information is due to the speci\ufb01c task being optimized.\nA similar trend also appears in the DS2 model, as shown in Fig. 4 (Bottom). In both the CNN trained\nto recognize words and DS2 models, speaker manifolds become more tangled after training, and in\nboth cases we see that this happens due to an increase in the dimensionality of the speaker manifolds,\nas the manifold radius remains unchanged after training, and the center correlation decreases. In\nsome sense, this mirrors the results in Sec. 3.1 and Sec. 3.2 where the model untangles word level\ninformation by decreasing the manifold dimension, and here discards information by increasing the\nmanifold dimension instead.\nFor the CNN model, the decrease in speaker separability occurs mostly uniformly across the layers,\nwhere in DS2, the separability only drops in the last half of the network in the recurrent layers. Sur-\nprisingly, the early convolution layers of the DS2 model show either unchanged or slightly increased\nseparability of speaker manifolds. We note that the decrease in speaker separability coincides with\nboth the increase in total dimension as measured by participation ratio and explained variance, as\nseen in Fig. 5 and Fig. SM6, as well as a decrease in center correlation.\n\n3.4 Trends over training epochs\n\nIn addition to evaluating the MFTMA quantities and effective dimension on fully trained networks,\nthe analysis can also be done as the training progresses. Figure 5 shows the early stages of training\n\n7\n\ninputConv2dMax PoolConv2dMax PoolConv2dConv2dConv2dMean PoolLinear1.001.752.503.254.00ManifoldCapacityLowerBoundInitial weightsWord trained modelSpeaker trained modelinputConv2dMax PoolConv2dMax PoolConv2dConv2dConv2dMean PoolLinear0.750.941.121.311.50ManifoldRadiusinputConv2dMax PoolConv2dMax PoolConv2dConv2dConv2dMean PoolLinear0.300.410.530.640.75ManifoldDim.(DM)UpperBoundinputConv2dMax PoolConv2dMax PoolConv2dConv2dConv2dMean PoolLinear0.000.120.250.380.50CenterCorrelationinputConv2dConv2dGRUGRUGRUGRUGRU1.001.121.251.381.50ManifoldCapacityLowerBoundInitial weightsCTC trained modelinputConv2dConv2dGRUGRUGRUGRUGRU1.001.121.251.381.50ManifoldRadiusinputConv2dConv2dGRUGRUGRUGRUGRU0.500.560.620.690.75ManifoldDim.(DM)UpperBoundinputConv2dConv2dGRUGRUGRUGRUGRU0.000.120.250.380.50CenterCorrelation1.001.311.621.942.25ManifoldCapacityLowerBoundEpoch 0Epoch 1Epoch 2Epoch Final1.001.121.251.381.50ManifoldRadius0.000.090.170.260.35CenterCorrelationinputConv2dMax PoolConv2dMax PoolConv2dConv2dConv2dMean PoolLinear0.400.470.550.620.70ManifoldDim.(DM)UpperBoundinputConv2dMax PoolConv2dMax PoolConv2dConv2dConv2dMean PoolLinear0.000.050.100.150.20TotalDim.(DPR)UpperBoundinputConv2dMax PoolConv2dMax PoolConv2dConv2dConv2dMean PoolLinear0.000.070.150.230.30TotalDim.(DEV)UpperBound\fFigure 6: Untangling of word manifolds in input timesteps. (Top) Evolution of Librispeech word\nmanifolds in timesteps, DS2 model (hypothesized in Fig 1). (a) Epoch 0 model, capacity (b-d) fully\ntrained model, (b) capacity, (c) manifold radius, (d) manifold dimension. Vertical lines show the\naverage word boundaries. (Bottom) Untanglement of two words over timesteps (T=40 to 70) in\nGRU 5 layer of DS2, projected to 2 PCs.\n\nof the word recognition CNN model (See Fig. SM6 for the early stages of training for DS2, which\nshows similar trends). The capacity, manifold dimension, manifold radius, and center correlations\nquickly converge to those measured on the \ufb01nal epochs.\nInterestingly, the total data dimension\n(measured by DP R, and DEV ) increases with training epochs unlike the manifold dimension, DM ,\nwhich decreases over training in both models.\nIntuitively, the training procedure tries to embed\ndifferent categories in different directions while compressing them, resulting in a lower DM and\na lower center correlation. The increase in total DP R could be related to lowered manifold center\ncorrelation. The total dimension DP R could play the role of the larger \u2019effective\u2019 ambient dimension,\nin turn improving linear separability [36].\n\n3.5 Untanglement of words over time\n\nThe above experiments were performed without considering the sequential nature of the inputs.\nHere, we compute these measures of untangling on each time step separately. This approach can\ninterrogate the role of time in the computation, especially in recurrent models processing arbitrary\nlength inputs.\nFigure 6 shows the behavior of capacity, manifold radius, and manifold dimension over the different\ntime steps in the recurrent layers of the end-to-end ASR model (DS2) for the word inputs used in\nSec. 3.1. As is perhaps expected, the separability is at the theoretical lower bound for times far\naway from the word of interest, and peaks near the location of the word. This behavior arises due to\nthe decrease in radius and dimension. However, the peak does not occur at the center of the word,\nowing to the arbitrary time alignment in the CTC cost function, as noted in [37]. Do the inputs far\nin time from the word window play a signi\ufb01cant role in the untangling of words? While we omit\nfurther investigation of this here due to space constraints, an experiment on varying length inputs\ncan be found in the SM.\nInterestingly, the capacity measured at each input time step has a peak relative capacity of 3.6 (Fig.\n6), much larger than the capacity measured from a random projection across all features, at 1.4\n(Fig. 2). This is despite the lower feature dimension in the time step analysis (due to considering\nthe features at only one time step, rather than the full activations). This implies that sequential\nprocessing also massages the representation in a meaningful way, such that a snapshot at a peak\ntime step has a well separated, compressed representation, captured by the small value of DM and\nRM . Analagous to 1(d), the ef\ufb01ciency of temporal separation is illustrated in Fig. 6, bottom.\n\n4 Conclusion\n\nIn this paper we studied the emergent geometric properties of speech objects and their linear sepa-\nrability, measured by manifold capacity. Across different networks and datasets, we \ufb01nd that linear\nseparability of auditory class objects improves across the deep network layers, consistent with the\nuntangling hypothesis in vision [2]. Word manifold\u2019s capacity arises across the deep layers, due to\n\n8\n\n0255075100Timestep1234ManifoldCapacityLowerBound(a)Initial weights(b)Final weights(c)Final weights(d)Final weightsinputConv2dConv2dGRU 1GRU 2GRU 3GRU 4GRU 50255075100Timestep1234ManifoldCapacityLowerBound0255075100Timestep0.751.001.251.501.752.00ManifoldRadius020406080100Timestep0.30.40.50.60.7ManifoldDim.(DM)UpperBound\femergent geometric properties, reducing manifold dimension, radius and center correlations. Char-\nacterization of manifolds across training epochs suggests that word untangling is a result of training,\nas random weights do not untangle word information in the CNN or DS2. As representations in\nASR systems evolve on the timescale of input sequences, we \ufb01nd that separation between different\nwords emerges temporally, by measuring capacity and geometric properties at every frame.\nSpeech data naturally has auditory objects of different scales embedded in the same sequence of\nsound and we observed here an emergence and untangling of other speech objects such as phonemes,\nand part-of-speech manifolds. Interestingly, speaker information (measured by speaker manifolds)\ndissipates across layers. This effect is due to the network being trained for word classi\ufb01cation, since\nCNNs trained with speaker classi\ufb01cation are observed to untangle speaker manifolds. Interestingly,\nthe transfer between speaker ID and word recognition was not very good, and a network trained for\nboth speaker ID and words showed emergence of both manifolds, but these tasks were not synergistic\n(see Fig. SM4). These results suggest that the task transfer performance is closely related to the task\nstructure, and can be captured by representation geometry.\nOur methodology and results suggest many interesting future directions. Among other things, we\nhope that our work will motivate: (1) the theory-driven geometric analysis of representation un-\ntangling in tasks with temporal structure; (2) the search for the mechanistic relation between the\nnetwork architecture, learned parameters, and structure of the stimuli via the lens of geometry; (3)\nthe future study of competing vs. synergistic tasks enabled by the powerful geometric analysis tool.\n\nAcknowledgments\n\nWe thank Yonatan Belinkov, Haim Sompolinsky, Larry Abbott, Tyler Lee, Anthony Ndirango,\nGokce Keskin, and Ting Gong for helpful discussions. This work was funded by NSF grant BCS-\n1634050 to J.H.M. and a DOE CSGF Fellowship to J.J.F. S.C acknowledges support by Intel Corpo-\nrate Research Grant, NSF NeuroNex Award DBI-1707398, and The Gatsby Charitable Foundation.\n\n9\n\n\fReferences\n[1] Tatyana O Sharpee, Craig A Atencio, and Christoph E Schreiner. Hierarchical representations in the\n\nauditory cortex. Current opinion in neurobiology, 21(5):761\u2013767, 2011.\n\n[2] James J DiCarlo and David D Cox. Untangling invariant object recognition. Trends in cognitive sciences,\n\n11(8):333\u2013341, 2007.\n\n[3] Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponential ex-\npressivity in deep neural networks through transient chaos. In Advances in neural information processing\nsystems, pages 3360\u20133368, 2016.\n\n[4] Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li, Xuewei Zhang, Xiao Liu, Ying Cao, Ajay Kannan,\nand Zhenyao Zhu. Deep speaker: an end-to-end neural speaker embedding system. arXiv preprint\narXiv:1705.02304, 2017.\n\n[5] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case,\nJared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deep speech 2: End-to-end speech\nrecognition in english and mandarin. In International conference on machine learning, pages 173\u2013182,\n2016.\n\n[6] Yonatan Belinkov and James Glass. Analyzing hidden representations in end-to-end automatic speech\n\nrecognition systems. In Advances in Neural Information Processing Systems, pages 2441\u20132451, 2017.\n\n[7] Yonatan Belinkov. On internal language representations in deep learning: An analysis of machine trans-\n\nlation and speech recognition. PhD thesis, Massachusetts Institute of Technology, 2018.\n\n[8] Tasha Nagamine, Michael L Seltzer, and Nima Mesgarani. Exploring how deep neural networks form\nphonemic categories. In Sixteenth Annual Conference of the International Speech Communication Asso-\nciation, 2015.\n\n[9] Tasha Nagamine, Michael L Seltzer, and Nima Mesgarani. On the role of nonlinear transformations in\n\ndeep neural network acoustic models. In Interspeech, pages 803\u2013807, 2016.\n\n[10] Yu-Hsuan Wang, Cheng-Tao Chung, and Hung-yi Lee. Gate activation signal analysis for gated recurrent\n\nneural networks and its correlation with phoneme boundaries. Interspeech, 2017.\n\n[11] Shuai Wang, Yanmin Qian, and Kai Yu. What does the speaker embedding encode? In Interspeech, pages\n\n1497\u20131501, 2017.\n\n[12] Zied Elloumi, Laurent Besacier, Olivier Galibert, and Benjamin Lecouteux. Analyzing learned represen-\n\ntations of a deep asr performance prediction model. arXiv preprint arXiv:1808.08573, 2018.\n\n[13] Andreas Krug, Ren\u00e9 Knaebel, and Sebastian Stober. Neuron activation pro\ufb01les for interpreting convolu-\n\ntional speech recognition models. 2018.\n\n[14] Alexander JE Kell, Daniel LK Yamins, Erica N Shook, Sam V Norman-Haignere, and Josh H McDermott.\nA task-optimized neural network replicates human auditory behavior, predicts brain responses, and reveals\na cortical processing hierarchy. Neuron, 98(3):630\u2013644, 2018.\n\n[15] Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. Svcca: Singular vector canon-\nical correlation analysis for deep learning dynamics and interpretability. In Advances in Neural Informa-\ntion Processing Systems, pages 6076\u20136085, 2017.\n\n[16] Yixuan Li, Jason Yosinski, Jeff Clune, Hod Lipson, and John E Hopcroft. Convergent learning: Do\ndifferent neural networks learn the same representations? In International Conference on Learning Rep-\nresentations, 2016.\n\n[17] Peiran Gao, Eric Trautmann, M Yu Byron, Gopal Santhanam, Stephen Ryu, Krishna Shenoy, and Surya\nGanguli. A theory of multineuronal dimensionality, dynamics and measurement. bioRxiv, page 214262,\n2017.\n\n[18] Seyed-Mahdi Khaligh-Razavi and Nikolaus Kriegeskorte. Deep supervised, but not unsupervised, models\n\nmay explain it cortical representation. PLoS computational biology, 10(11):e1003915, 2014.\n\n[19] David GT Barrett, Ari S Morcos, and Jakob H Macke. Analyzing biological and arti\ufb01cial neural networks:\n\nchallenges with opportunities for synergy? Current opinion in neurobiology, 55:55\u201364, 2019.\n\n[20] Olivier J H\u00e9naff, Robbe LT Goris, and Eero P Simoncelli. Perceptual straightening of natural videos.\n\nNature neuroscience, page 1, 2019.\n\n10\n\n\f[21] Olivier J H\u00e9naff and Eero P Simoncelli. Geodesics of learned representations. 2016.\n\n[22] Raja Giryes, Guillermo Sapiro, and Alexander M Bronstein. Deep neural networks with random gaussian\n\nweights: A universal classi\ufb01cation strategy? IEEE Trans. Signal Processing, 64(13):3444\u20133457, 2016.\n\n[23] SueYeon Chung, Daniel D Lee, and Haim Sompolinsky. Linear readout of object manifolds. Physical\n\nReview E, 93(6):060301, 2016.\n\n[24] SueYeon Chung, Daniel D Lee, and Haim Sompolinsky. Classi\ufb01cation and geometry of general perceptual\n\nmanifolds. Physical Review X, 8(3):031003, 2018.\n\n[25] Uri Cohen, SueYeon Chung, Daniel D. Lee Lee, and Haim Sompolinsky. Separability and geom-\ndoi:\n\netry of object manifolds in deep neural networks. Nature Communications, 2020 (in press).\nhttps://doi.org/10.1101/644658.\n\n[26] Elizabeth Gardner. The space of interactions in neural network models. Journal of physics A: Mathemat-\n\nical and general, 21(1):257, 1988.\n\n[27] HS Seung, Haim Sompolinsky, and Naftali Tishby. Statistical mechanics of learning from examples.\n\nPhysical review A, 45(8):6056, 1992.\n\n[28] Madhu Advani, Subhaneil Lahiri, and Surya Ganguli. Statistical mechanics of complex neural systems\nand high dimensional data. Journal of Statistical Mechanics: Theory and Experiment, 2013(03):P03014,\n2013.\n\n[29] Thomas M Cover. Geometrical and statistical properties of systems of linear inequalities with applications\n\nin pattern recognition. IEEE transactions on electronic computers, (3):326\u2013334, 1965.\n\n[30] Peiran Gao and Surya Ganguli. On simplicity and complexity in the brave new world of large-scale\n\nneuroscience. Current opinion in neurobiology, 32:148\u2013155, 2015.\n\n[31] Douglas B Paul and Janet M Baker. The design for the wall street journal-based csr corpus. In Proceed-\nings of the workshop on Speech and Natural Language, pages 357\u2013362. Association for Computational\nLinguistics, 1992.\n\n[32] Arne K\u00f6hn, Florian Stegen, and Timo Baumann. Mining the spoken wikipedia for speech data and beyond.\nIn Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Marko Grobelnik, Bente\nMaegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of\nthe Tenth International Conference on Language Resources and Evaluation (LREC 2016), Paris, France,\nmay 2016. European Language Resources Association (ELRA).\n\n[33] Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore,\nManoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In\nProc. IEEE ICASSP 2017, New Orleans, LA, 2017.\n\n[34] Alex Graves, Santiago Fern\u00e1ndez, Faustino Gomez, and J\u00fcrgen Schmidhuber. Connectionist temporal\nclassi\ufb01cation: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of\nthe 23rd international conference on Machine learning, pages 369\u2013376. ACM, 2006.\n\n[35] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based\non public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal\nProcessing (ICASSP), pages 5206\u20135210. IEEE, 2015.\n\n[36] Ashok Litwin-Kumar, Kameron Decker Harris, Richard Axel, Haim Sompolinsky, and LF Abbott. Opti-\n\nmal degrees of synaptic connectivity. Neuron, 93(5):1153\u20131164, 2017.\n\n[37] Ha\u00b8sim Sak, F\u00e9lix de Chaumont Quitry, Tara Sainath, Kanishka Rao, et al. Acoustic modelling with cd-ctc-\nsmbr lstm rnns. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU),\npages 604\u2013609. IEEE, 2015.\n\n11\n\n\f", "award": [], "sourceid": 8140, "authors": [{"given_name": "Cory", "family_name": "Stephenson", "institution": "Intel"}, {"given_name": "Jenelle", "family_name": "Feather", "institution": "MIT"}, {"given_name": "Suchismita", "family_name": "Padhy", "institution": "Intel AI Lab"}, {"given_name": "Oguz", "family_name": "Elibol", "institution": "Intel AI Lab"}, {"given_name": "Hanlin", "family_name": "Tang", "institution": "Intel AI Products Group"}, {"given_name": "Josh", "family_name": "McDermott", "institution": "Massachusetts Institute of Technology"}, {"given_name": "SueYeon", "family_name": "Chung", "institution": "Columbia/MIT"}]}