{"title": "Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces", "book": "Advances in Neural Information Processing Systems", "page_first": 7354, "page_last": 7364, "abstract": "Recent research has shown that word embedding spaces learned from text corpora of different languages can be aligned without any parallel data supervision. Inspired by the success in unsupervised cross-lingual word embeddings, in this paper we target learning a cross-modal alignment between the embedding spaces of speech and text learned from corpora of their respective modalities in an unsupervised fashion. The proposed framework learns the individual speech and text embedding spaces, and attempts to align the two spaces via adversarial training, followed by a refinement procedure. We show how our framework could be used to perform the tasks of spoken word classification and translation, and the experimental results on these two tasks demonstrate that the performance of our unsupervised alignment approach is comparable to its supervised counterpart. Our framework is especially useful for developing automatic speech recognition (ASR) and speech-to-text translation systems for low- or zero-resource languages, which have little parallel audio-text data for training modern supervised ASR and speech-to-text translation models, but account for the majority of the languages spoken across the world.", "full_text": "Unsupervised Cross-Modal Alignment of Speech and\n\nText Embedding Spaces\n\nYu-An Chung, Wei-Hung Weng, Schrasing Tong, and James Glass\n\nComputer Science and Arti\ufb01cial Intelligence Laboratory\n\nMassachusetts Institute of Technology\n\nCambridge, MA 02139, USA\n\n{andyyuan,ckbjimmy,st9,glass}@mit.edu\n\nAbstract\n\nRecent research has shown that word embedding spaces learned from text corpora\nof different languages can be aligned without any parallel data supervision. Inspired\nby the success in unsupervised cross-lingual word embeddings, in this paper we\ntarget learning a cross-modal alignment between the embedding spaces of speech\nand text learned from corpora of their respective modalities in an unsupervised\nfashion. The proposed framework learns the individual speech and text embedding\nspaces, and attempts to align the two spaces via adversarial training, followed by\na re\ufb01nement procedure. We show how our framework could be used to perform\nspoken word classi\ufb01cation and translation, and the experimental results on these two\ntasks demonstrate that the performance of our unsupervised alignment approach\nis comparable to its supervised counterpart. Our framework is especially useful\nfor developing automatic speech recognition (ASR) and speech-to-text translation\nsystems for low- or zero-resource languages, which have little parallel audio-text\ndata for training modern supervised ASR and speech-to-text translation models,\nbut account for the majority of the languages spoken across the world.\n\n1\n\nIntroduction\n\nWord embeddings\u2014continuous-valued vector representations of words\u2014are almost ubiquitous in\nrecent natural language processing research. Most successful methods for learning word embed-\ndings [1, 2, 3] rely on the distributional hypothesis [4], i.e., words occurring in similar contexts tend\nto have similar meanings. Exploiting word co-occurrence statistics in a text corpus leads to word\nvectors that re\ufb02ect semantic similarities and dissimilarities: similar words are geometrically close in\nthe embedding space, and conversely, dissimilar words are far apart.\nContinuous word embedding spaces have been shown to exhibit similar structures across languages [5].\nThe intuition is that most languages share similar expressive power and are used to describe similar\nhuman experiences across cultures; hence, they should share similar statistical properties. Inspired by\nthe notion, several studies have focused on designing algorithms that exploit this similarity to learn a\ncross-lingual alignment between the embedding spaces of two languages, where the two embedding\nspaces are trained from independent text corpora [6, 7, 8, 9, 10, 11, 12]. In particular, recent research\nhas shown that such cross-lingual alignments can be learned without relying on any form of bilingual\nsupervision [13, 14, 15], and has been applied to training neural machine translation (NMT) systems\nin a completely unsupervised fashion [16, 17]. This eliminates the need for a large parallel training\ncorpus to train NMT systems.\nSpeech, as another form of language, is rarely considered as a source for learning semantics, compared\nto text. Although there is work that explores the concept of learning vector representations from\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fspeech [18, 19, 20, 21, 22, 23], they are primarily based on acoustic-phonetic similarity, and aim to\nrepresent the way a word sounds rather than its meaning.\nRecently, the Speech2Vec [24] model was developed to be capable of representing audio segments\nexcised from a speech corpus as \ufb01xed dimensional vectors that contain semantic information of the\nunderlying spoken words. The design of Speech2Vec is based on a Recurrent Neural Network (RNN)\nEncoder-Decoder framework [25, 26], and borrows the methodology of Skip-grams or continuous\nbag-of-words (CBOW) from Word2Vec [1] for training. Since Speech2Vec and Word2Vec share\nthe same training methodology and speech and text are similar media for communicating, the two\nembedding spaces learned respectively by Speech2Vec from speech and Word2Vec from text are\nexpected to exhibit similar structure.\nMotivated by the recent success in unsupervised cross-lingual alignment [13, 15, 14] and the assump-\ntion that the embedding spaces of the two modalities (speech and text) share similar structure, we\nare interested in learning an unsupervised cross-modal alignment between the two spaces. Such an\nalignment would be useful for developing automatic speech recognition (ASR) and speech-to-text\ntranslation systems for low- or zero-resource languages that lack parallel corpora of speech and\ntext for training. In this paper, we propose a framework for unsupervised cross-modal alignment,\nborrowing the methodology from unsupervised cross-lingual alignment presented in [14]. The frame-\nwork consists of two steps. First, it uses Speech2Vec [24] and Word2Vec [1] to learn the individual\nembedding spaces of speech and text. Next, it leverages adversarial training to learn a linear mapping\nfrom the speech embedding space to the text embedding space, followed by a re\ufb01nement procedure.\nThe paper is organized as follows. Section 2 describes how we obtain the speech embedding space in a\ncompletely unsupervised manner using Speech2Vec. Next, we present our unsupervised cross-modal\nalignment approach in Section 3. In Section 4, we describe the tasks of spoken word classi\ufb01cation\nand translation, which are similar to ASR and speech-to-text translation, respectively, except that\nnow the input are audio segments corresponding to words. We then evaluate the performance of our\nunsupervised alignment on the two tasks and analyze our results in Section 5. Finally, we conclude\nand point out some interesting future work possibilities in Section 6. To the best of our knowledge,\nthis is the \ufb01rst work that achieves fully unsupervised spoken word classi\ufb01cation and translation.\n\n2 Unsupervised Learning of the Speech Embedding Space\n\nRecently, there is an increasing interest in learning the semantics of a language directly, and only\nfrom raw speech [24, 27, 28]. Assuming utterances in a speech corpus are already pre-segmented\ninto audio segments corresponding to words using word boundaries obtained by forced alignment,\nexisting approaches aim to represent each audio segment as a \ufb01xed dimensional embedding vector,\nwith the hope that the embedding is able to capture the semantic information of the underlying spoken\nword. However, some supervision leaks into the learning process through the use of forced alignment,\nrendering the approaches not fully unsupervised.\nIn this paper, we use Speech2Vec [24], a recently proposed deep neural network architecture that has\nbeen shown capable of capturing the semantics of spoken words from raw speech, for learning the\nspeech embedding space. To eliminate the need of forced alignment, we propose a simple pipeline for\ntraining Speech2Vec in a totally unsupervised manner. We brie\ufb02y review Speech2Vec in Section 2.1,\nand introduce the unsupervised pipeline in Section 2.2.\n\n2.1 Speech2Vec\n\nIn text, a Word2Vec [1] model is a shallow, two-layer fully-connected neural network that is trained to\nreconstruct the contexts of words. There are two methodologies for training Word2Vec: Skip-grams\nand CBOW. The objective of Skip-grams is for each word w(n) in a text corpus, the model is trained\nto maximize the probability of words {w(n\u2212k), . . . , w(n\u22121), w(n+1), . . . , w(n+k)} within a window\nof size k given w(n). The objective of CBOW, on the other hand, aims to infer the current word w(n)\nfrom its nearby words {w(n\u2212k), . . . , w(n\u22121), w(n+1), . . . , w(n+k)}.\nSpeech2Vec [24], inspired by Word2Vec, borrows the methodology of Skip-grams or CBOW for\ntraining. Unlike text, where words are represented by one-hot vectors as input and output for\ntraining Word2Vec, an audio segment is represented by a variable-length sequence of acoustic\n\n2\n\n\ffeatures, x = (x1, x2, . . . , xT ), where xt is the acoustic feature such as Mel-Frequency Cepstral\nCoef\ufb01cients at time t, and T is the length of the sequence. In order to handle variable-length input\nand output sequences of acoustic features, Speech2Vec replaces the two fully-connected layers in the\nWord2Vec model with a pair of RNNs, one as an Encoder and the other as a Decoder [25, 26]. When\ntraining Speech2Vec with Skip-grams, the Encoder RNN takes the audio segment (corresponding to\nthe current word) as input and encodes it into a \ufb01xed dimensional embedding z(n) that represents\nthe entire input sequence x(n). Subsequently, the Decoder RNN aims to reconstruct the audio seg-\nments {x(n\u2212k), . . . , x(n\u22121), x(n+1), . . . , x(n+k)} (corresponding to nearby words) within a window\nof size k from z(n). Similar to the concept of training Word2Vec with Skip-grams, the intuition\nbehind this methodology is that, in order to successfully decode nearby audio segments, the encoded\nembedding z(n) should contain suf\ufb01cient semantic information of the current audio segment x(n).\nIn contrast to training Speech2Vec with Skip-grams that aims to predict nearby audio segments\nfrom z(n), training Speech2Vec with CBOW sets x(n) as the target and aims to infer it from nearby\naudio segments. By using the same training methodology (Skip-grams or CBOW) as Word2Vec, it is\nreasonable to assume that the embedding space learned by Speech2Vec from speech exhibits similar\nstructure to that learned by Word2Vec from text.\nAfter training the Speech2Vec model, each audio segment is transformed into an embedding vector\nthat contains the semantic information of the underlying word. In a Word2Vec model, the embedding\nfor a particular word is deterministic, which means that every instance of the same word will be\nrepresented by one, and only one, embedding vector. In contrast, for audio segments every instance\nof a spoken word is different (due to speaker, channel, and other contextual differences, etc.), so\nevery instance of the same underlying word is represented by a different (though hopefully similar)\nembedding vector. Embedding vectors of the same spoken words can be averaged to obtain a single\nword embedding based on the identity of each audio segment, as is done in [24].\n\n2.2 Unsupervised Speech2Vec\n\nSpeech2Vec and Word2Vec learn the semantics of words by making use of the co-occurrence\ninformation in their respective modalities, and are both intrinsically unsupervised. However, unlike\ntext where the content can be easily segmented into word-like units, speech has a continuous form\nby nature, making the word boundaries challenging to locate. All utterances in the speech corpus\nare assumed to be perfectly segmented into audio segments based on the word boundaries obtained\nby forced alignment with respect to the reference transcriptions [24]. Such an assumption, however,\nmakes the process of learning word embeddings from speech not truly unsupervised.\nUnsupervised speech segmentation is a core problem in zero-resource speech processing in the\nabsence of transcriptions, lexicons, or language modeling text. Early work mainly focused on\nunsupervised term discovery, where the aim is to \ufb01nd word- or phrase-like patterns in a collection of\nspeech [29, 30]. While useful, the discovered patterns are typically isolated segments spread out over\nthe data, leaving much speech as background. This has prompted several studies on full-coverage\napproaches, where the entire speech input is segmented into word-like units [31, 32, 33, 34].\nIn this paper, we use an off-the-shelf, full-coverage, unsupervised segmentation system for segmenting\nour data into word-like units. Three representative systems are explored in this paper. The \ufb01rst\none, referred to as Bayesian embedded segmental Gaussian mixture model (BES-GMM) [35], is\na probabilistic model that represents potential word segments as \ufb01xed-dimensional acoustic word\nembeddings [23], and builds a whole-word acoustic model in this embedding space while jointly\ndoing segmentation. The second one, called embedded segmental K-means model (ES-KMeans) [36],\nis an approximation to BES-GMM that uses hard clustering and segmentation, rather than full\nBayesian inference. The third one is the recurring syllable-unit segmenter called SylSeg [37], a\nfast and heuristic method that applies unsupervised syllable segmentation and clustering, to predict\nrecurring syllable sequences as words.\nAfter training the Speech2Vec model using the audio segments obtained by an unsupervised segmen-\ntation method, each audio segment is then transformed into an embedding that contains the semantic\ninformation about the segment. Since we do not know the identity of the embeddings, we use the\nk-means algorithm to cluster them into K clusters, potentially corresponding to K different word\ntypes. We then average all embeddings that belong to the same cluster (potentially the instances of\n\n3\n\n\fthe same underlying word) to obtain a single embedding. Note that by doing so, it is possible that we\ngroup the embeddings corresponding to different words that are semantically similar into one cluster.\n\n3 Unsupervised Alignment of Speech and Text Embedding Spaces\n\nSuppose we have speech and text embedding spaces trained on independent speech and text corpora.\nOur goal is to learn a mapping, without using any form of cross-modal supervision, between them\nsuch that the two spaces are aligned.\nLet S = {s1, s2, . . . , sm} \u2286 Rd1 and T = {t1, t2, . . . , tn} \u2286 Rd2 be two sets of m and n word\nembeddings of dimensionality d1 and d2 from the speech and text embedding spaces, respectively.\nIdeally, if we have a known dictionary that speci\ufb01es which si \u2208 S corresponds to which tj \u2208 T , we\ncan learn a linear mapping W between the two embedding spaces such that\n\nW \u2217 = argmin\nW\u2208Rd2\u00d7d1\n\n(1)\nwhere X and Y are two aligned matrices of size d1 \u00d7 k and d2 \u00d7 k formed by k word embeddings\nselected from S and T , respectively. At test time, the transformation result of any audio segment a\nin the speech domain can be de\ufb01ned as argmaxtj\u2208T cos(W sa, tj). In this paper, we show how\nto learn this mapping W without using any cross-modal supervision. The proposed framework,\ninspired by [14], consists of two steps: domain-adversarial training for learning an initial proxy of W ,\nfollowed by a re\ufb01nement procedure which uses the words that match the best to create a synthetic\nparallel dictionary for applying Equation 1.\n\n(cid:107)W X \u2212 Y (cid:107)2,\n\n3.1 Domain-Adversarial Training\nThe intuition behind this step is to make the mapped S and T indistinguishable. We de\ufb01ne a\ndiscriminator, whose goal is to discriminate between elements randomly sampled from WS =\n{W s1, W s2, . . . , W sm} and T . The mapping W , which can be viewed as the generator, is trained\nto prevent the discriminator from making accurate predictions. This is a two-player game, where\nthe discriminator aims at maximizing its ability to identify the origin of an embedding, and W aims\nat preventing the discriminator from doing so by making WS and T as similar as possible. Given\nthe mapping W , the discriminator, parameterized by \u03b8D, is optimized by minimizing the following\nobjective function:\n\nLD(\u03b8D|W ) = \u2212 1\nm\n\nlog P\u03b8D (speech = 1|W si) \u2212 1\nn\n\n(2)\nwhere P\u03b8D (speech = 1|v) is the probability that vector v originates from the speech embedding\nspace (as opposed to an embedding from the text embedding space). Given the discriminator, the\nmapping W aims to fool the discriminator\u2019s ability to accurately predict the original domain of the\nembeddings by minimizing the following objective function:\n\nlog P\u03b8D (speech = 0|tj),\n\nj=1\n\ni=1\n\nLW (W|\u03b8D) = \u2212 1\nm\n\n(3)\nThe discriminator \u03b8D and the mapping W are optimized iteratively to respectively minimize LD\nand LW following the standard training procedure of adversarial networks [38].\n\nlog P\u03b8D (speech = 0|W si) \u2212 1\nn\n\nlog P\u03b8D (speech = 1|tj)\n\nj=1\n\ni=1\n\n3.2 Re\ufb01nement Procedure\n\nThe domain-adversarial training step learns a rotation matrix W that aligns the speech and text\nembedding spaces. To further improve the alignment, we use the W learned in the domain-adversarial\ntraining step as an initial proxy and build a synthetic parallel dictionary that speci\ufb01es which si \u2208 S\ncorresponds to which tj \u2208 T .\nTo ensure a high-quality dictionary, we consider the most frequent words from S and T , since more\nfrequent words are expected to have better quality of embedding vectors, and only retain their mutual\nnearest neighbors. For deciding mutual nearest neighbors, we use the Cross-Domain Similarity Local\nScaling proposed in [14] to mitigate the so-called hubness problem [39] (points tending to be nearest\nneighbors of many points in high-dimensional spaces). Subsequently, we apply Equation 1 on this\ngenerated dictionary to re\ufb01ne W .\n\n4\n\nm(cid:88)\n\nm(cid:88)\n\nn(cid:88)\n\nn(cid:88)\n\n\f4 Spoken Word Classi\ufb01cation and Translation\n\nConventional hybrid ASR systems [40] and recent end-to-end ASR models [41, 42, 43, 44] rely on a\nlarge amount of parallel audio-text data for training. However, most languages spoken across the\nworld lack parallel data, so it is no surprise that only very few languages support ASR. It is the same\nstory for speech-to-text translation [45], which typically pipelines ASR and machine translation,\nand could be even more challenging to develop as it requires both components to be well trained.\nCompared to parallel audio-text data, the cost of accumulating independent corpora of speech and text\nis signi\ufb01cantly lower. With our unsupervised cross-modal alignment approach, it becomes feasible to\nbuild ASR and speech-to-text translation systems using independent corpora of speech and text only,\na setting suitable for low- or zero-resource languages.\nSince a cross-modal alignment is learned to link the word embedding spaces of speech and text, we\nperform the tasks of spoken word classi\ufb01cation and translation to directly evaluate the effectiveness of\nthe alignment. The two tasks are similar to standard ASR and speech-to-text translation, respectively,\nexcept that now the input is an audio segment corresponding to a word.\n\n4.1 Spoken Word Classi\ufb01cation\n\nThe goal of this task is to recognize the underlying spoken word of an input audio segment. Suppose\nwe have two independent corpora of speech and text that belong to the same language. The speech\nand text embedding spaces, denoted by S and T , can be obtained by training Speech2Vec and\nWord2Vec on the respective corpus. The alignment W between S and T can be learned in an\neither supervised or unsupervised way. At test time, given an input audio segment, it is \ufb01rst\ntransformed into an embedding vector s in the speech embedding space S by Speech2Vec. The\nvector s is then mapped to the text embedding space as ts = W s \u2208 T . In T , the word that has\nembedding vector t\u2217 = argmaxt\u2208T cos(t, ts) closest to ts will be taken as the classi\ufb01cation result.\nThe performance is measured by accuracy.\n\n4.2 Spoken Word Translation\n\nThis task is similar to the one in the text domain that considers the problem of retrieving the translation\nof given source words, except that the source words are in the form of audio segments. Spoken word\ntranslation can be performed in the exact same way as spoken word classi\ufb01cation, but the speech\nand text corpora belong to different languages. At test time, we follow the standard practice of word\ntranslation and measure how many times one of the correct translations (in text) of the input audio\nsegment is retrieved, and report precision@ k for k = 1 and 5. We use the bilingual dictionaries\nprovided by [14] to obtain the correct translations of a given source word.\n\n5 Experiments\n\nIn this section, we empirically demonstrate the effectiveness of our unsupervised cross-modal\nalignment approach on spoken word classi\ufb01cation and translation introduced in Section 4.\n\n5.1 Datasets\n\nFor our experiments, we used English and\nFrench LibriSpeech [46, 47], and English and\nGerman Spoken Wikipedia Corpora (SWC) [48].\nAll corpora are read speech, and come with a\ncollection of utterances and the corresponding\ntranscriptions. For convenience, we denote the\nspeech and text data of a corpus in uppercase and\nlowercase, respectively. For example, ENswc\nand enswc represent the speech and text data, re-\nspectively, of English SWC. In Table 1, column\nTrain is the size of the speech data used for training the speech embeddings; column Test is the size\nof the speech data used for testing, where the corresponding number of audio segments (i.e., spoken\n\nEnglish LibriSpeech 420 hr 50 hr 37K\nFrench LibriSpeech 200 hr 30 hr 26K\n355 hr 40 hr 25K\n346 hr 40 hr 31K\n\nEnglish SWC\nGerman SWC\n\nTable 1: The detailed statistics of the corpora.\nTrain Test Words Segments\n\nCorpus\n\n468K\n260K\n284K\n223K\n\n5\n\n\fword tokens) is speci\ufb01ed in column Segments; column Words provides the number of distinct words\nin that corpus. Train and test sets are split in a way so that there are no overlapping speakers.\n\n5.2 Details of Training and Model Architectures\n\nThe speech embeddings were trained using Speech2Vec with Skip-grams by setting the window\nsize k to three. The Encoder is a single-layer bidirectional LSTM, and the Decoder is a single-layer\nunidirectional LSTM. The model was trained by stochastic gradient descent (SGD) with a \ufb01xed\nlearning rate of 10\u22123. The text embeddings were obtained by training Word2Vec on the transcriptions\nusing the fastText implementation without subword information [3]. The dimension of both speech\nand text embeddings is 50.1\nFor the adversarial training, the discriminator was a two-layer neural network of size 512 with ReLU\nas the activation function. Both the discriminator and W were trained by SGD with a \ufb01xed learning\nrate of 10\u22123. For the re\ufb01nement procedure, we used the default setting speci\ufb01ed in [14].2\n\n5.3 Comparing Methods\n\nTable 2: Different con\ufb01gurations for training Speech2Vec to obtain the speech embeddings with\ndecreasing level of supervision. The last column speci\ufb01es whether the con\ufb01guration is unsupervised.\n\nCon\ufb01guration\n\nA & A\u2217\n\nB\nC\nD\nE\nF\n\nHow word segments were obtained How embeddings were grouped together\n\nSpeech2Vec training\n\nForced alignment\nForced alignment\nBES-GMM [35]\nES-KMeans [36]\n\nSylSeg [37]\n\nEqually sized chunks\n\nUse word identity\n\nk-means\nk-means\nk-means\nk-means\nk-means\n\nUnsupervised\n\n\u0017\n\u0017\n\u0013\n\u0013\n\u0013\n\u0013\n\nAlignment-Based Approaches Given the speech and text embeddings, alignment-based ap-\nproaches learn the alignment between them in an either supervised or unsupervised way; for an input\naudio segment, they perform spoken word classi\ufb01cation and translation as described in Section 4.\nBy varying how word segments were obtained before being fed to Speech2Vec and how the em-\nbeddings were grouped together, the level of supervision is gradually decreased towards a fully\nunsupervised con\ufb01guration. In con\ufb01guration A, the speech training data was segmented into words\nusing forced alignment with respect to the reference transcription, and the embeddings of the same\nword were grouped together using their word identities. In con\ufb01guration B, the word segments were\nalso obtained by forced alignment, but the embeddings were grouped together by performing k-means\nclustering. In con\ufb01gurations C, D, and E, the speech training data was segmented into word-like\nunits using different unsupervised segmentation algorithms described in Section 2.2. Con\ufb01guration F\nserves as a baseline by naively segmenting the speech training data into equally sized chunks. Unlike\ncon\ufb01gurations A and B, con\ufb01gurations C, D, E, and F did not require the reference transcriptions to\ndo forced alignment and the embeddings were grouped together by performing k-means clustering,\nand are thus unsupervised. Con\ufb01gurations A to F all used our unsupervised alignment approach to\nalign the speech and text embedding spaces.\nWe also implemented con\ufb01guration A\u2217, which trained Speech2Vec in the same way as con\ufb01guration A,\nbut learned the alignment using a parallel dictionary as cross-modal data supervision. The different\ncon\ufb01gurations are summarized in Table 2.\n\nWord Classi\ufb01er We established an upper bound by using the fully-supervised Word Classi\ufb01er\nthat was trained to map audio segments directly to their corresponding word identities. The Word\nClassi\ufb01er was composed of a single-layer bidirectional LSTM with a softmax layer appended at the\noutput of its last time step. This approach is speci\ufb01c to spoken word classi\ufb01cation.\n\n1We tried window size k \u2208 {1, 2, 3, 4, 5} and embedding dimension d \u2208 {50, 100, 200, 300} and found\n\nthat the reported k and d yield the best performance\n\n2We also tried multi-layer neural network to model W . However, we did not observe any improvement on\n\nour evaluation tasks when using it compared to a linear W . This discovery aligns with [5].\n\n6\n\n\fMajority Word Baseline For both spoken word classi\ufb01cation and translation tasks, we imple-\nmented a straightforward baseline dubbed Major-Word, where for classi\ufb01cation, it always predicts the\nmost frequent word, and for translation, it always predicts the most commonly paired word. Results\nof the Major-Word offer us insight into the word distribution of the test set.\n\n5.4 Results and Discussion\nTable 3: Accuracy on spoken word classi\ufb01cation. ENls \u2212 enswc means that the speech and text\nembeddings were learned from the speech training data of English LibriSpeech and text training data\nof English SWC, respectively, and the testing audio segments came from English LibriSpeech. The\nsame rule applies to Table 5 and Table 6. For the Word Classi\ufb01er, ENls \u2212 enswc and ENswc \u2212 enls\ncould not be obtained since it requires parallel audio-text data for training.\n\nCorpora\n\nENls \u2212 enls FRls \u2212 frls ENswc \u2212 enswc DEswc \u2212 deswc ENls \u2212 enswc ENswc \u2212 enls\n\nWord Classi\ufb01er\n\n89.3\n\n83.6\n\n86.9\n\n80.4\n\n\u2013\n\nNonalignment-based approach\n\nA\u2217\n\nA\nB\nC\nD\nE\nF\n\nMajor-Word\n\nAlignment-based approach with cross-modal supervision (parallel dictionary)\n\n25.4\n\n27.1\n\n29.1\n\n26.9\n\n21.8\n\nAlignment-based approaches without cross-modal supervision (our approach)\n\n23.7\n19.4\n10.9\n11.5\n6.5\n0.8\n\n0.3\n\n24.9\n20.7\n12.6\n12.3\n7.2\n1.4\n\n0.2\n\n25.3\n22.6\n14.4\n14.2\n8.9\n2.8\n\nMajority Word Baseline\n\n0.3\n\n25.8\n21.5\n13.1\n12.4\n7.4\n1.2\n\n0.4\n\n18.3\n15.9\n6.9\n7.5\n4.5\n0.2\n\n0.3\n\n\u2013\n\n23.9\n\n21.6\n17.4\n8.0\n8.3\n5.9\n0.5\n\n0.3\n\nSpoken Word Classi\ufb01cation Table 3 presents our results on spoken word classi\ufb01cation. We\nobserve that the accuracy decreases as the level of supervision decreases, as expected. We also note\nthat although the Word Classi\ufb01er signi\ufb01cantly outperforms all the other approaches under all corpora\nsettings, the prerequisite for training such a fully-supervised approach is unrealistic\u2014it requires the\nutterances to be perfectly segmented into audio segments corresponding to words with the word\nidentity of each segment known. We emphasize that the Word Classi\ufb01er is just used to establish an\nupper bound performance that gives us an idea on how good the classi\ufb01cation results could be.\nFor alignment-based approaches, con\ufb01guration A\u2217 achieves the highest accuracies under all corpora\nsettings by using a parallel dictionary as cross-modal supervision for learning the alignment. However,\nwe see that con\ufb01guration A using our unsupervised alignment approach only suffers a slight decrease\nin performance, which demonstrates that our unsupervised alignment approach is almost as effective\nas it supervised counterpart A\u2217. As we move towards unsupervised methods (k-means clustering) for\ngrouping embeddings, in con\ufb01guration B, a decrease in performance is observed.\nThe performance of using unsupervised segmentation algorithms is behind using exact word segments\nfor training Speech2Vec, shown in con\ufb01gurations C, D, and E versus B. We hypothesize that\nword segmentation is a critical step, since incorrectly separated words lack a logical embedding,\nwhich in turn hinders the clustering process. The importance of proper segmentation is evident in\ncon\ufb01guration F as it performs the worst.\nThe aforementioned analysis applies to different corpora settings. We also observe that the perfor-\nmance of the embeddings learned from different corpora is inferior to the ones learned from the same\ncorpus (refer to columns 1 and 3, versus 5 and 6, in Table 3). We think this is because the embedding\nspaces learned from the same corpora (e.g., both embeddings were learned from LibriSpeech) exhibit\nhigher similarity than those learned from different corpora, making the alignment more accurate.\n\nSpoken Word Synonyms Retrieval Word classi\ufb01cation does not display the full potential of our\nalignment approach. In Table 4 we show a list of retrieved results of example input audio segments.\nThe words were ranked according to the cosine similarity between their embeddings and that of the\n\n7\n\n\faudio segment mapped from the speech embedding space. We observe that the list actually contain\nboth synonyms and different lexical forms of the audio segment. This provides an explanation of why\nthe performance of alignment-based approaches on word classi\ufb01cation is poor: the top ranked word\nmay not match the underlying word of the input audio segment, and would be considered incorrect\nfor word classi\ufb01cation, despite that the top ranked word has high chance of being semantically similar\nto the underlying word.\n\nTable 4: Retrieved results of example audio segments that are considered incorrect in word classi\ufb01ca-\ntion. The match for each audio segment is marked in bold.\n\nRank\n\n1\n2\n3\n4\n5\n\nbeautiful\nlovely\npretty\n\ngorgeous\nbeautiful\n\nnice\n\nInput audio segments\ndestroy\nclever\ndestroyed\ncunning\ndestroy\nsmart\nclever\nannihilate\ndestroying\ncrafty\nwisely\ndestruct\n\nsuitcase\n\nbags\n\nsuitcases\nluggage\nbriefcase\nsuitcase\n\nWe de\ufb01ne word synonyms retrieval to also consider synonyms as valid results, as opposed to the word\nclassi\ufb01cation. The synonyms were derived using another language as a pivot. Using the cross-lingual\ndictionaries provided by [14], we looked up the acceptable word translations, and for each of those\ntranslations, we took the union of their translations back to the original language. For example, in\nEnglish, each word has 3.3 synonyms (excluding itself) on average. Table 5 shows the results of word\nsynonyms retrieval. We see that our approach performs better at retrieving synonyms than classifying\nwords, an evidence that the system is learning the semantics rather than the identities of words. This\nshowcases the strength of our semantics-focused approach.\n\nTable 5: Results on spoken word synonyms retrieval. We measure how many times one of the\nsynonyms of the input audio segment is retrieved, and report precision@k for k = 1, 5.\n\nCorpora\n\nAverage P@k\n\nENls \u2212 enls\nP@5\nP@1\n\nFRls \u2212 frls\nP@5\nP@1\n\nENswc \u2212 enswc DEswc \u2212 deswc ENls \u2212 enswc ENswc \u2212 enls\nP@5\nP@1\n\nP@5\n\nP@1\n\nP@5\n\nP@1\n\nP@5\n\nP@1\n\nA\u2217\n\nA\nB\nC\nD\nE\nF\n\nAlignment-based approach with cross-modal supervision (parallel dictionary)\n\n52.6\n\n66.9\n\n46.6\n\n69.4\n\n47.4\n\n62.5\n\n49.2\n\n63.7\n\n41.3\n\n54.2\n\n39.0\n\n49.4\n\nAlignment-based approaches without cross-modal supervision (our approach)\n\n43.2\n35.0\n27.7\n26.7\n17.7\n3.5\n\n57.0\n48.2\n37.3\n35.2\n24.2\n5.7\n\n42.4\n35.4\n26.4\n27.2\n20.8\n5.2\n\n58.0\n50.4\n35.7\n36.3\n28.4\n6.9\n\n36.3\n33.8\n21.1\n21.1\n17.3\n3.8\n\n50.4\n44.6\n30.3\n28.2\n21.8\n5.8\n\n32.6\n29.3\n26.2\n25.3\n18.3\n2.7\n\n48.8\n45.4\n34.5\n33.2\n23.0\n4.9\n\n33.9\n30.0\n22.4\n21.2\n15.2\n3.2\n\n47.5\n42.9\n28.9\n29.3\n21.1\n5.7\n\n33.4\n31.1\n17.1\n18.7\n11.2\n2.9\n\n45.7\n40.7\n26.3\n25.1\n17.8\n4.4\n\nSpoken word translation Table 6 presents the results on spoken word translation. Similar to\nspoken word classi\ufb01cation, con\ufb01gurations with more supervision yield better performance than those\nwith less supervision. Furthermore, we observe that translating using the same corpus outperforms\nthose using different corpora (refer to ENswc \u2212 deswc versus ENls \u2212 deswc). We attribute this to the\nhigher structural similarity between the embedding spaces learned from the same corpora.\n\n6 Conclusions\n\nIn this paper, we propose a framework capable of aligning speech and text embedding spaces in an\nunsupervised manner. The method learns the alignment from independent corpora of speech and\ntext, without requiring any cross-modal supervision, which is especially important for low- or zero-\nresource languages that lack parallel data with both audio and text. We demonstrate the effectiveness\nof our unsupervised alignment by showing comparable results to its supervised alignment counterpart\n\n8\n\n\fTable 6: Results on spoken word translation. We measure how many times one of the correct\ntranslations of the input audio segment is retrieved, and report precision@k for k = 1, 5.\n\nCorpora\n\nAverage P@k\n\nENls \u2212 frls\nP@5\nP@1\n\nFRls \u2212 enls\nP@5\nP@1\n\nENswc \u2212 deswc DEswc \u2212 enswc ENls \u2212 deswc FRls \u2212 deswc\nP@5\nP@1\n\nP@5\n\nP@5\n\nP@1\n\nP@1\n\nP@5\n\nP@1\n\nA\u2217\n\nA\nB\nC\nD\nE\nF\n\nAlignment-based approach with cross-modal supervision (parallel dictionary)\n\n47.9\n\n56.4\n\n49.1\n\n60.1\n\n40.2\n\n51.9\n\n43.3\n\n55.8\n\n34.9\n\n46.3\n\n33.8\n\n44.9\n\nAlignment-based approaches without cross-modal supervision (our approach)\n\n40.5\n36.0\n24.7\n25.4\n15.4\n4.3\n\n50.3\n44.9\n35.4\n33.1\n20.6\n5.6\n\n39.9\n35.5\n23.9\n24.4\n16.7\n6.9\n\n50.9\n44.5\n37.3\n34.6\n19.9\n7.5\n\n32.8\n27.9\n22.0\n23.5\n14.1\n4.9\n\n43.8\n38.3\n30.3\n29.1\n15.9\n6.5\n\n33.1\n30.9\n20.5\n20.7\n16.6\n5.3\n\n43.4\n40.9\n29.1\n31.3\n17.0\n6.6\n\n31.9\n26.6\n19.2\n20.8\n14.8\n4.2\n\n42.2\n35.3\n26.1\n25.9\n16.7\n5.9\n\n30.1\n25.4\n14.8\n14.5\n9.7\n1.8\n\n42.1\n38.2\n23.1\n22.4\n11.8\n2.6\n\nMajor-Word\n\n1.1\n\n1.5\n\n1.6\n\n2.2\n\nMajority Word Baseline\n2.0\n\n1.2\n\n1.5\n\n2.7\n\n1.1\n\n1.5\n\n1.6\n\n2.2\n\nthat uses full cross-modal supervision (A vs. A\u2217) on the tasks of spoken word classi\ufb01cation and\ntranslation. Future work includes devising unsupervised speech segmentation approaches that produce\nmore accurate word segments, an essential step to obtain high quality speech embeddings. We also\nplan to extend current spoken word classi\ufb01cation and translation systems to perform standard ASR\nand speech-to-text translation, respectively.\n\nAcknowledgments\n\nThe authors thank Hao Tang, Mandy Korpusik, and the MIT Spoken Language Systems Group for\ntheir helpful feedback and discussions.\n\nReferences\n[1] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, \u201cDistributed representations of\n\nwords and phrases and their compositionality,\u201d in NIPS, 2013.\n\n[2] J. Pennington, R. Socher, and C. D. Manning, \u201cGloVe: Global vectors for word representation,\u201d\n\nin EMNLP, 2014.\n\n[3] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, \u201cEnriching word vectors with subword\ninformation,\u201d Transactions of the Association for Computational Linguistics, vol. 5, pp. 135\u2013146,\n2017.\n\n[4] Z. S. Harris, \u201cDistributional structure,\u201d Word, vol. 10, no. 2-3, pp. 146\u2013162, 1954.\n[5] T. Mikolov, Q. Le, and I. Sutskever, \u201cExploiting similarities among languages for machine\n\ntranslation,\u201d arXiv preprint arXiv:1309.4168, 2013.\n\n[6] M. Faruqui and C. Dyer, \u201cImproving vector space word representations using multilingual\n\ncorrelation,\u201d in EACL, 2014.\n\n[7] C. Xing, D. Wang, C. Liu, and Y. Lin, \u201cNormalized word embedding and orthogonal transform\n\nfor bilingual word translation,\u201d in NAACL HLT, 2015.\n\n[8] M. Artetxe, G. Labaka, and E. Agirre, \u201cLearning principled bilingual mappings of word\n\nembeddings while preserving monolingual invariance,\u201d in EMNLP, 2016.\n\n[9] S. L. Smith, D. Turban, S. Hamblin, and N. Hammerla, \u201cOf\ufb02ine bilingual word vectors,\n\northogonal transformations and the inverted softmax,\u201d in ICLR, 2016.\n\n[10] M. Artetxe, G. Labaka, and E. Agirre, \u201cLearning bilingual word embeddings with (almost) no\n\nbilingual data,\u201d in ACL, 2017.\n\n[11] H. Cao, T. Zhao, S. Zhang, and Y. Meng, \u201cA distribution-based model to learn bilingual word\n\nembeddings,\u201d in COLING, 2016.\n\n[12] L. Duong, H. Kanayama, T. Ma, S. Bird, and T. Cohn, \u201cLearning crosslingual word embeddings\n\nwithout bilingual corpora,\u201d in EMNLP, 2016.\n\n9\n\n\f[13] M. Zhang, Y. Liu, H. Luan, and M. Sun, \u201cAdversarial training for unsupervised bilingual lexicon\n\ninduction,\u201d in ACL, 2017.\n\n[14] A. Conneau, G. Lample, M. Ranzato, L. Denoyer, and H. J\u00e9gou, \u201cWord translation without\n\nparallel data,\u201d in ICLR, 2018.\n\n[15] M. Zhang, Y. Liu, H. Luan, and M. Sun, \u201cEarth mover\u2019s distance minimization for unsupervised\n\nbilingual lexicon induction,\u201d in EMNLP, 2017.\n\n[16] M. Artetxe, G. Labaka, E. Agirre, and K. Cho, \u201cUnsupervised neural machine translation,\u201d in\n\nICLR, 2018.\n\n[17] G. Lample, L. Denoyer, and M. Ranzato, \u201cUnsupervised machine translation using monolingual\n\ncorpora only,\u201d in ICLR, 2018.\n\n[18] W. He, W. Wang, and K. Livescu, \u201cMulti-view recurrent neural acoustic word embeddings,\u201d in\n\nICLR, 2017.\n\n[19] S. Settle and K. Livescu, \u201cDiscriminative acoustic word embeddings: Recurrent neural network-\n\nbased approaches,\u201d in SLT, 2016.\n\n[20] Y.-A. Chung, C.-C. Wu, C.-H. Shen, H.-Y. Lee, and L.-S. Lee, \u201cAudio word2vec: Unsuper-\nvised learning of audio segment representations using sequence-to-sequence autoencoder,\u201d in\nINTERSPEECH, 2016.\n\n[21] H. Kamper, W. Wang, and K. Livescu, \u201cDeep convolutional acoustic word embeddings using\n\nword-pair side information,\u201d in ICASSP, 2016.\n\n[22] S. Bengio and G. Heigold, \u201cWord embeddings for speech recognition,\u201d in INTERSPEECH,\n\n2014.\n\n[23] K. Levin, K. Henry, A. Jansen, and K. Livescu, \u201cFixed-dimensional acoustic embeddings of\n\nvariable-length segments in low-resource settings,\u201d in ASRU, 2013.\n\n[24] Y.-A. Chung and J. Glass, \u201cSpeech2vec: A sequence-to-sequence framework for learning word\n\nembeddings from speech,\u201d in INTERSPEECH, 2018.\n\n[25] I. Sutskever, O. Vinyals, and Q. Le, \u201cSequence to sequence learning with neural networks,\u201d in\n\nNIPS, 2014.\n\n[26] K. Cho, B. van Merri\u00ebnboer, \u00c7. G\u00fcl\u00e7ehre, D. Bahdanau, F. Bougares, H. Schwenk, and\nY. Bengio, \u201cLearning phrase representations using RNN encoder-decoder for statistical machine\ntranslation,\u201d in EMNLP, 2014.\n\n[27] Y.-C. Chen, C.-H. Shen, S.-F. Huang, and H.-Y. Lee, \u201cTowards unsupervised automatic speech\nrecognition trained by unaligned speech and text only,\u201d arXiv preprint arXiv:1803.10952, 2018.\n[28] Y.-A. Chung and J. Glass, \u201cLearning word embeddings from speech,\u201d in NIPS ML4Audio\n\nWorkshop, 2017.\n\n[29] A. Park and J. Glass, \u201cUnsupervised pattern discovery in speech,\u201d IEEE Transactions on Audio,\n\nSpeech, and Language Processing, vol. 16, no. 1, pp. 186\u2013197, 2008.\n\n[30] A. Jansen and B. Van Durme, \u201cEf\ufb01cient spoken term discovery using randomized algorithms,\u201d\n\nin ASRU, 2011.\n\n[31] H. Kamper, A. Jansen, and S. Goldwater, \u201cUnsupervised word segmentation and lexicon dis-\ncovery using acoustic word embeddings,\u201d IEEE Transactions on Audio, Speech, and Language\nProcessing, vol. 24, no. 4, pp. 669\u2013679, 2016.\n\n[32] C.-Y. Lee, T. J. O\u2019Donnell, and J. Glass, \u201cUnsupervised lexicon discovery from acoustic input,\u201d\n\nTransactions of the Association for Computational Linguistics, vol. 3, pp. 389\u2013403, 2015.\n\n[33] M. Sun and H. Van hamme, \u201cJoint training of non-negative tucker decomposition and discrete\ndensity hidden markov models,\u201d Computer Speech and Language, vol. 27, no. 4, pp. 969\u2013988,\n2013.\n\n[34] O. Walter, T. Korthals, R. Haeb-Umbach, and B. Raj, \u201cA hierarchical system for word discovery\n\nexploiting dtw-based initialization,\u201d in ASRU, 2013.\n\n[35] H. Kamper, A. Jansen, and S. Goldwater, \u201cA segmental framework for fully-unsupervised\nlarge-vocabulary speech recognition,\u201d Computer Speech and Language, vol. 46, pp. 154\u2013174,\n2017.\n\n10\n\n\f[36] H. Kamper, K. Livescu, and S. Goldwater, \u201cAn embedded segmental k-means model for\n\nunsupervised segmentation and clustering of speech,\u201d in ASRU, 2017.\n\n[37] O. R\u00e4s\u00e4nen, G. Doyle, and M. C. Frank, \u201cUnsupervised word discovery from speech using\n\nautomatic segmentation into syllable-like units,\u201d in INTERSPEECH, 2015.\n\n[38] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\n\nY. Bengio, \u201cGenerative adversarial nets,\u201d in NIPS, 2014.\n\n[39] G. Dinu, A. Lazaridou, and M. Baroni, \u201cImproving zero-shot learning by mitigating the hubness\n\nproblem,\u201d in ICLR Workshop Track, 2015.\n\n[40] A. Graves, A.-r. Mohamed, and G. Hinton, \u201cSpeech recognition with deep recurrent neural\n\nnetworks,\u201d in ICASSP, 2013.\n\n[41] A. Graves and N. Jaitly, \u201cTowards end-to-end speech recognition with recurrent neural networks,\u201d\n\nin ICML, 2014.\n\n[42] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, \u201cAttention-based models\n\nfor speech recognition,\u201d in NIPS, 2015.\n\n[43] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, \u201cListen, attend and spell: A neural network for large\n\nvocabulary conversational speech recognition,\u201d in ICASSP, 2016.\n\n[44] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper,\nB. Catanzaro, Q. Cheng, G. Chen et al., \u201cDeep speech 2: End-to-end speech recognition in\nenglish and mandarin,\u201d in ICML, 2016.\n\n[45] A. Waibel and C. Fugen, \u201cSpoken language translation,\u201d IEEE Signal Processing Magazine,\n\nvol. 3, no. 25, pp. 70\u201379, 2008.\n\n[46] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, \u201cLibriSpeech: An ASR corpus based on\n\npublic domain audio books,\u201d in ICASSP, 2015.\n\n[47] A. Kocabiyikoglu, L. Besacier, and O. Kraif, \u201cAugmenting Librispeech with French translations:\n\nA multimodal corpus for direct speech translation evaluation,\u201d in LREC, 2018.\n\n[48] A. K\u00f6hn, F. Stegen, and T. Baumann, \u201cMining the spoken wikipedia for speech data and beyond,\u201d\n\nin LREC, 2016.\n\n11\n\n\f", "award": [], "sourceid": 3666, "authors": [{"given_name": "Yu-An", "family_name": "Chung", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Wei-Hung", "family_name": "Weng", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Schrasing", "family_name": "Tong", "institution": "MIT CSAIL"}, {"given_name": "James", "family_name": "Glass", "institution": "Massachusetts Institute of Technology"}]}