{"title": "Deep Voice 2: Multi-Speaker Neural Text-to-Speech", "book": "Advances in Neural Information Processing Systems", "page_first": 2962, "page_last": 2970, "abstract": "We introduce a technique for augmenting neural text-to-speech (TTS) with low-dimensional trainable speaker embeddings to generate different voices from a single model. As a starting point, we show improvements over the two state-of-the-art approaches for single-speaker neural TTS: Deep Voice 1 and Tacotron. We introduce Deep Voice 2, which is based on a similar pipeline with Deep Voice 1, but constructed with higher performance building blocks and demonstrates a significant audio quality improvement over Deep Voice 1. We improve Tacotron by introducing a post-processing neural vocoder, and demonstrate a significant audio quality improvement. We then demonstrate our technique for multi-speaker speech synthesis for both Deep Voice 2 and Tacotron on two multi-speaker TTS datasets. We show that a single neural TTS system can learn hundreds of unique voices from less than half an hour of data per speaker, while achieving high audio quality synthesis and preserving the speaker identities almost perfectly.", "full_text": "Deep Voice 2: Multi-Speaker Neural Text-to-Speech\n\nSercan \u00d6. Ar\u0131k\u21e4\n\nsercanarik@baidu.com\n\nGregory Diamos\u21e4\n\ngregdiamos@baidu.com\n\nAndrew Gibiansky\u21e4\n\ngibianskyandrew@baidu.com\n\nJohn Miller\u21e4\n\nmillerjohn@baidu.com\n\nKainan Peng\u21e4\n\npengkainan@baidu.com\n\nWei Ping\u21e4\n\npingwei01@baidu.com\n\nJonathan Raiman\u21e4\n\njonathanraiman@baidu.com\n\nYanqi Zhou\u21e4\n\nzhouyanqi@baidu.com\n\nBaidu Silicon Valley Arti\ufb01cial Intelligence Lab\n\n1195 Bordeaux Dr. Sunnyvale, CA 94089\n\nAbstract\n\nWe introduce a technique for augmenting neural text-to-speech (TTS) with low-\ndimensional trainable speaker embeddings to generate different voices from a\nsingle model. As a starting point, we show improvements over the two state-of-\nthe-art approaches for single-speaker neural TTS: Deep Voice 1 and Tacotron.\nWe introduce Deep Voice 2, which is based on a similar pipeline with Deep\nVoice 1, but constructed with higher performance building blocks and demonstrates\na signi\ufb01cant audio quality improvement over Deep Voice 1. We improve Tacotron\nby introducing a post-processing neural vocoder, and demonstrate a signi\ufb01cant\naudio quality improvement. We then demonstrate our technique for multi-speaker\nspeech synthesis for both Deep Voice 2 and Tacotron on two multi-speaker TTS\ndatasets. We show that a single neural TTS system can learn hundreds of unique\nvoices from less than half an hour of data per speaker, while achieving high audio\nquality synthesis and preserving the speaker identities almost perfectly.\n\n1\n\nIntroduction\n\nArti\ufb01cial speech synthesis, commonly known as text-to-speech (TTS), has a variety of applications in\ntechnology interfaces, accessibility, media, and entertainment. Most TTS systems are built with a\nsingle speaker voice, and multiple speaker voices are provided by having distinct speech databases or\nmodel parameters. As a result, developing a TTS system with support for multiple voices requires\nmuch more data and development effort than a system which only supports a single voice.\nIn this work, we demonstrate that we can build all-neural multi-speaker TTS systems which share the\nvast majority of parameters between different speakers. We show that not only can a single model\ngenerate speech from multiple different voices, but also that signi\ufb01cantly less data is required per\nspeaker than when training single-speaker systems.\nConcretely, we make the following contributions:\n\n1. We present Deep Voice 2, an improved architecture based on Deep Voice 1 (Arik et al., 2017).\n2. We introduce a WaveNet-based (Oord et al., 2016) spectrogram-to-audio neural vocoder, and\nuse it with Tacotron (Wang et al., 2017) as a replacement for Grif\ufb01n-Lim audio generation.\n\n\u21e4Listed alphabetically.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f3. Using these two single-speaker models as a baseline, we demonstrate multi-speaker neural\nspeech synthesis by introducing trainable speaker embeddings into Deep Voice 2 and Tacotron.\n\nWe organize the rest of this paper as follows. Section 2 discusses related work and what makes the\ncontributions of this paper distinct from prior work. Section 3 presents Deep Voice 2 and highlights\nthe differences from Deep Voice 1. Section 4 explains our speaker embedding technique for neural\nTTS models and shows multi-speaker variants of the Deep Voice 2 and Tacotron architectures.\nSection 5.1 quanti\ufb01es the improvement for single speaker TTS through a mean opinion score (MOS)\nevaluation and Section 5.2 presents the synthesized audio quality of multi-speaker Deep Voice 2 and\nTacotron via both MOS evaluation and a multi-speaker discriminator accuracy metric. Section 6\nconcludes with a discussion of the results and potential future work.\n\n2 Related Work\nWe discuss the related work relevant to each of our claims in Section 1 in order, starting from\nsingle-speaker neural speech synthesis and moving on to multi-speaker speech synthesis and metrics\nfor generative model quality.\nWith regards to single-speaker speech synthesis, deep learning has been used for a variety of subcom-\nponents, including duration prediction (Zen et al., 2016), fundamental frequency prediction (Ronanki\net al., 2016), acoustic modeling (Zen and Sak, 2015), and more recently autoregressive sample-by-\nsample audio waveform generation (e.g., Oord et al., 2016; Mehri et al., 2016). Our contributions\nbuild upon recent work in entirely neural TTS systems, including Deep Voice 1 (Arik et al., 2017),\nTacotron (Wang et al., 2017), and Char2Wav (Sotelo et al., 2017). While these works focus on\nbuilding single-speaker TTS systems, our paper focuses on extending neural TTS systems to handle\nmultiple speakers with less data per speaker.\nOur work is not the \ufb01rst to attempt a multi-speaker TTS system. For instance, in traditional HMM-\nbased TTS synthesis (e.g., Yamagishi et al., 2009), an average voice model is trained using multiple\nspeakers\u2019 data, which is then adapted to different speakers. DNN-based systems (e.g., Yang et al.,\n2016) have also been used to build average voice models, with i-vectors representing speakers as\nadditional inputs and separate output layers for each target speaker. Similarly, Fan et al. (2015)\nuses a shared hidden representation among different speakers with speaker-dependent output layers\npredicting vocoder parameters (e.g., line spectral pairs, aperiodicity parameters etc.). For further\ncontext, Wu et al. (2015) empirically studies DNN-based multi-speaker modeling. More recently,\nspeaker adaptation has been tackled with generative adversarial networks (GANs) (Hsu et al., 2017).\nWe instead use trainable speaker embeddings for multi-speaker TTS. The approach was investigated\nin speech recognition (Abdel-Hamid and Jiang, 2013), but is a novel technique in speech synthesis.\nUnlike prior work which depends on \ufb01xed embeddings (e.g. i-vectors), the speaker embeddings used\nin this work are trained jointly with the rest of the model from scratch, and thus can directly learn\nthe features relevant to the speech synthesis task. In addition, this work does not rely on per-speaker\noutput layers or average voice modeling, which leads to higher-quality synthesized samples and lower\ndata requirements (as there are fewer unique parameters per speaker to learn).\nIn order to evaluate the distinctiveness of the generated voices in an automated way, we propose using\nthe classi\ufb01cation accuracy of a speaker discriminator. Similar metrics such as an \u201cInception score\u201d\nhave been used for quantitative quality evaluations of GANs for image synthesis (e.g., Salimans\net al., 2016). Speaker classi\ufb01cation has been studied with both traditional GMM-based methods (e.g.,\nReynolds et al., 2000) and more recently with deep learning approaches (e.g., Li et al., 2017).\n\n3 Single-Speaker Deep Voice 2\n\nIn this section, we present Deep Voice 2, a neural TTS system based on Deep Voice 1 (Arik et al.,\n2017). We keep the general structure of the Deep Voice 1 (Arik et al., 2017), as depicted in Fig. 1 (the\ncorresponding training pipeline is depicted in Appendix A). Our primary motivation for presenting\nan improved single-speaker model is to use it as the starting point for a high-quality multi-speaker\nmodel.\nOne major difference between Deep Voice 2 and Deep Voice 1 is the separation of the phoneme\nduration and frequency models. Deep Voice 1 has a single model to jointly predict phoneme duration\n\n2\n\n\fsoftsign\n\nFC\n\nTaco \nSpeaker\n\nChar 1\n\nFilter-Bank + \nBN + ReLu\n\nMLP\n\n\u2026\n\nChar n\n\nPhonemes\n\nText\n\nPronunciation \nDictionary\n\nPhonemes\n\nDuration\n\nsoftsign\n\nFC\n\nFC\n\nTaco \nSpeaker\n\nAttention\n\n\u2026\nMel i-1\n\nMLP\nMel i\n\n\u2026\nMel i+1\n\nVocal \nSpeaker\n\nUpsampled Phonemes\n\nFrequency\n\nF0\n\nVocal\n\nSynthesized \n\nSpeech\n\nupsample\n\nSpeaker\n\nFigure 1: Inference system diagram: \ufb01rst text-phonemes dictionary conversion, second predict\nphoneme durations, third upsample and generate F0, \ufb01nally feed F0 and phonemes to vocal model.\n\nand frequency pro\ufb01le (voicedness and time-dependent fundamental frequency, F0). In Deep Voice 2,\nthe phoneme durations are predicted \ufb01rst and then are used as inputs to the frequency model.\nIn the subsequent subsections, we present the models used in Deep Voice 2. All models are trained\nseparately using the hyperparameters speci\ufb01ed in Appendix B. We will provide a quantitative\ncomparison of Deep Voice 1 and Deep Voice 2 in Section 5.1.\n\nsoftsign\n\nConv + BN\n\nFC\nFC\nSpeaker\n\nMel 1\n\n3.1 Segmentation model\nEstimation of phoneme locations is treated as an unsupervised learning problem in Deep Voice\n2, similar to Deep Voice 1. The segmentation model is convolutional-recurrent architecture with\nconnectionist temporal classi\ufb01cation (CTC) loss (Graves et al., 2006) applied to classify phoneme\npairs, which are then used to extract the boundaries between them. The major architecture changes in\nDeep Voice 2 are the addition of batch normalization and residual connections in the convolutional\nlayers. Speci\ufb01cally, Deep Voice 1\u2019s segmentation model computes the output of each layer as\n\nh(l) = relu\u21e3W (l) \u21e4 h(l1) + b(l)\u2318 ,\n\n(1)\n\nwhere h(l) is the output of the l-th layer, W (l) is the convolution \ufb01lterbank, b(l) is the bias vector, and\n\u21e4 is the convolution operator. In contrast, Deep Voice 2\u2019s segmentation model layers instead compute\n(2)\n\nh(l) = relu\u21e3h(l1) + BN\u21e3W (l) \u21e4 h(l1)\u2318\u2318 ,\n\nwhere BN is batch normalization (Ioffe and Szegedy, 2015). In addition, we \ufb01nd that the segmentation\nmodel often makes mistakes for boundaries between silence phonemes and other phonemes, which can\nsigni\ufb01cantly reduce segmentation accuracy on some datasets. We introduce a small post-processing\nstep to correct these mistakes: whenever the segmentation model decodes a silence boundary, we\nadjust the location of the boundary with a silence detection heuristic.2\n\n3.2 Duration Model\nIn Deep Voice 2, instead of predicting a continuous-valued duration, we formulate duration prediction\nas a sequence labeling problem. We discretize the phoneme duration into log-scaled buckets, and\nassign each input phoneme to the bucket label corresponding to its duration. We model the sequence\nby a conditional random \ufb01eld (CRF) with pairwise potentials at output layer (Lample et al., 2016).\nDuring inference, we decode discretized durations from the CRF using the Viterbi forward-backward\nalgorithm. We \ufb01nd that quantizing the duration prediction and introducing the pairwise dependence\nimplied by the CRF improves synthesis quality.\n\n3.3 Frequency Model\nAfter decoding from the duration model, the predicted phoneme durations are upsampled from a\nper-phoneme input features to a per-frame input for the frequency model. 3 Deep Voice 2 frequency\n2) \u21e4 g[n], where x[n] is the\naudio signal, g[n] is the impulse response of a Gaussian \ufb01lter, xmax is the maximum value of x[n] and \u21e4 is\none-dimensional convolution operation. We assign the silence phoneme boundaries when p[n] exceeds a \ufb01xed\nthreshold. The optimal parameter values for the Gaussian \ufb01lter and the threshold depend on the dataset and\naudio sampling rate.\n\n2We compute the smoothed normalized audio power as p[n] = (x[n]2/xmax\n\n3Each frame is ensured to be 10 milliseconds. For example, if a phoneme lasts 20 milliseconds, the input\nfeatures corresponding to that phoneme will be repeated in 2 frames. If it lasts less than 10 milliseconds, it is\nextend to a single frame.\n\n3\n\n\fmodel consists of multiple layers: \ufb01rstly, bidirectional gated recurrent unit (GRU) layers (Cho et al.,\n2014) generate hidden states from the input features. From these hidden states, an af\ufb01ne projection\nfollowed by a sigmoid nonlinearity produces the probability that each frame is voiced. Hidden states\nare also used to make two separate normalized F0 predictions. The \ufb01rst prediction, fGRU, is made\nwith a single-layer bidirectional GRU followed by an af\ufb01ne projection. The second prediction, fconv,\nis made by adding up the contributions of multiple convolutions with varying convolution widths\nand a single output channel. Finally, the hidden state is used with an af\ufb01ne projection and a sigmoid\nnonlinearity to predict a mixture ratio !, which is used to weigh the two normalized frequency\npredictions and combine them into\n\nThe normalized prediction f is then converted to the true frequency F0 prediction via\n\nf = ! \u00b7 fGRU + (1  !) \u00b7 fconv.\n\n(3)\n\n(4)\nwhere \u00b5F0 and F0 are, respectively, the mean and standard deviation of F0 for the speaker the\nmodel is trained on. We \ufb01nd that predicting F0 with a mixture of convolutions and a recurrent layer\nperforms better than predicting with either one individually. We attribute this to the hypothesis that\nincluding the wide convolutions reduces the burden for the recurrent layers to maintain state over a\nlarge number of input frames, while processing the entire context information ef\ufb01ciently.\n\nF0 = \u00b5F0 + F0 \u00b7 f,\n\n3.4 Vocal Model\nThe Deep Voice 2 vocal model is based on a WaveNet architecture (Oord et al., 2016) with a two-layer\nbidirectional QRNN (Bradbury et al., 2017) conditioning network, similar to Deep Voice 1. However,\nwe remove the 1 \u21e5 1 convolution between the gated tanh nonlinearity and the residual connection. In\naddition, we use the same conditioner bias for every layer of the WaveNet, instead of generating a\nseparate bias for every layer as was done in Deep Voice 1. 4\n\n4 Multi-Speaker Models with Trainable Speaker Embeddings\nIn order to synthesize speech from multiple speakers, we augment each of our models with a single\nlow-dimensional speaker embedding vector per speaker. Unlike previous work, our approach does\nnot rely on per-speaker weight matrices or layers. Speaker-dependent parameters are stored in a very\nlow-dimensional vector and thus there is near-complete weight sharing between speakers. We use\nspeaker embeddings to produce recurrent neural network (RNN) initial states, nonlinearity biases,\nand multiplicative gating factors, used throughout the networks. Speaker embeddings are initialized\nrandomly with a uniform distribution over [0.1, 0.1] and trained jointly via backpropagation; each\nmodel has its own set of speaker embeddings.\nTo encourage each speaker\u2019s unique voice signature to in\ufb02uence the model, we incorporate the speaker\nembeddings into multiple portions of the model. Empirically, we \ufb01nd that simply providing the\nspeaker embeddings to the input layers does not work as well for any of the presented models besides\nthe vocal model, possibly due to the high degree of residual connections present in the WaveNet and\ndue to the dif\ufb01culty of learning high-quality speaker embeddings. We observed that several patterns\ntend to yield high performance:\n\n\u2022 Site-Speci\ufb01c Speaker Embeddings: For every use site in the model architecture, transform the\nshared speaker embedding to the appropriate dimension and form through an af\ufb01ne projection\nand a nonlinearity.\n\nembeddings.\n\ntimestep of a recurrent layer.\n\n\u2022 Recurrent Initialization: Initialize recurrent layer hidden states with site-speci\ufb01c speaker\n\u2022 Input Augmentation: Concatenate a site-speci\ufb01c speaker embedding to the input at every\n\u2022 Feature Gating: Multiply layer activations elementwise with a site-speci\ufb01c speaker embedding\nto render adaptable information \ufb02ow. 5\n4We \ufb01nd that these changes reduce model size by a factor of \u21e07 and speed up inference by \u21e025%, while\nyielding no perceptual change in quality. However, we do not focus on demonstrating these claims in this paper.\n5We hypothesize that feature gating lets the model learn the union of all necessary features while allowing\nspeaker embeddings to determine what features are used for each speaker and how much in\ufb02uence they will\nhave on the activations.\n\n4\n\n\fconcat\n\nMLP\n\u2026\n\nconcat\n\nFC\n\nFC\n\nSpeaker\n\nPhoneme 1\n\nPhoneme n\n\nF0\n\n+\n\n\u2a09\n\n!\nsoftsign\nsoftsign\n\nFC\n\nFC\n\ntile\n\nBi-GRU\nFC\n\nFC\n\n+\n\nf\n\nVoiced\n\n!\n\nFC\n\nLog-Mel\n\nDecoder CBFG\n\nFC\n\nMel\n\nStacked Residual GRU\n\nFC\n\nGRU + Attention\n\nFilter-Bank\nMLP\n\u2026\n\nMel m\n\nSpeaker\n\nMel 1\nStacked Bi-GRU\n\nPhoneme 1\n\n\u2026\n\nPhoneme n\n\nVocal\n\nPhoneme \n\npairs\n\nStacked Bi-GRU\n\nCTC\nFCDropout\nDropout\nReLu6+\n\n\u2026\n\nConv-BN-Res\n\nConv-BN-Res\n\n\u2a09\n\nConv + BN\nConv\n\u2026\n\nMel 1\n(a)\n\nMel m\n\nsoftsign\n\nsoftsign\n\nFC\nFC\nSpeaker\n\nFC\n\nFC\n\nSpeaker\n\nDurations (bucketed)\n\nCRF\n\n\u03bcF\n\nStacked Bi-GRU\n\nconcat\n\nMLP\n\u2026\n\nconcat\n\n+\n\nF0\n\n\u03bc\n\n\u2a09\nsoftsign)+)1\nsoftsign)+)1\n\n\u2a09\n\n+\n\nf\n\n\u03c9\n\n!F\n\n!\n\nFC\n\nFC\n\nVoiced\n\n!\n\nFC\n\nFC\n\nFC\n\nBi-GRU\n\nFilter-Bank\n\nPhoneme 1\n(b)\n\nPhoneme n\n\nFC\n\nStacked Bi-GRU\n\nSpeaker\n\nsoftsign\n\n\u2026\n\nPhoneme n\n\nPhoneme 1\n(c)\n\nEncoder CBFG\n\nHighway Layers\n\n+\n\nFilter-Bank + \nBN + ReLu\n\nmax/pool\n\nMLP\n\ntile\nsoftsign\n\nFC\n\nsoftsign\nsoftsign\n\nFC\n\nFC\n\ntile\n\nFC\n\nLog-Mel\n\nDecoder CBFG\n\nFC\n\nMel\n\nStacked Residual GRU\n\nGRU + Attention\n\nMLP\n\u2026\n\nMel m\n\nFigure 2: Architecture for the multi-speaker (a) segmentation, (b) duration, and (c) frequency model.\n\nBi-GRU\n\nNext, we describe how speaker embeddings are used in each architecture.\n\nFC\n\n4.1 Multi-Speaker Deep Voice 2\nThe Deep Voice 2 models have separate speaker embeddings for each model. Yet, they can be viewed\nas chunks of a larger speaker embedding, which are trained independently.\n\nFilter-Bank + \nBN + ReLu\n\n4.1.1 Segmentation Model\nIn multi-speaker segmentation model, we use feature gating in the residual connections of the\nconvolution layers. Instead of Eq. (2), we multiply the batch-normalized activations by a site-speci\ufb01c\nspeaker embedding:\n\nFrequency\n\nDuration\n\nSpeaker\n\nChar 1\n\nChar n\n\n\u2026\n\nSpeaker\n\nMel 1\n\nVocal\n\nh(l) = relu\u21e3h(l1) + BN\u21e3W \u21e4 h(l1)\u2318 \u00b7 gs\u2318 ,\n\n(5)\n\nSpeaker\n\nwhere gs is a site-speci\ufb01c speaker embedding. The same site-speci\ufb01c embedding is shared for all\nthe convolutional layers. In addition, we initialize each of the recurrent layers with a second site\nspeci\ufb01c embedding. Similarly, each layer shares the same site-speci\ufb01c embedding, rather than having\na separate embedding per layer.\n\n4.1.2 Duration Model\nThe multi-speaker duration model uses speaker-dependent recurrent initialization and input augmen-\ntation. A site-speci\ufb01c embedding is used to initialize RNN hidden states, and another site-speci\ufb01c\nembedding is provided as input to the \ufb01rst RNN layer by concatenating it to the feature vectors.\n\n4.1.3 Frequency Model\nThe multi-speaker frequency model uses recurrent initialization, which initializes the recurrent\nlayers (except for the recurrent output layer) with a single site-speci\ufb01c speaker-embedding. As\ndescribed in Section 3.3, the recurrent and convolutional output layers in the single-speaker frequency\nmodel predict a normalized frequency, which is then converted into the true F0 by a \ufb01xed linear\ntransformation. The linear transformation depends on the mean and standard deviation of observed\nF0 for the speaker. These values vary greatly between speakers: male speakers, for instance, tend to\nhave a much lower mean F0. To better adapt to these variations, we make the mean and standard\ndeviation trainable model parameters and multiply them by scaling terms which depend on the speaker\nembeddings. Speci\ufb01cally, instead of Eq. (4), we compute the F0 prediction as\n\n(6)\nwhere gf is a site-speci\ufb01c speaker embedding, \u00b5F0 and F0 are trainable scalar parameters initialized\nto the F0 mean and standard deviation on the dataset, and V\u00b5 and V are trainable parameter vectors.\n\nT gf + F0 \u00b71 + softsignV\n\nF0 = \u00b5F0 \u00b71 + softsignV\u00b5\n\nT gf \u00b7 f,\n\n5\n\n\fStacked Bi-GRU\n\nconcat\n\nMLP\n\u2026\n\nconcat\n\nFC\n\nFC\n\nSpeaker\n\nPhoneme 1\n\nPhoneme n\n\nsoftsign\n\nsoftsign\n\nPhoneme \n\npairs\n\nStacked Bi-GRU\n\nCTC\nFCDropout\nDropout\nReLu6+\n\n\u2026\n\nConv-BN-Res\n\nConv-BN-Res\n\n\u2a09\n\nConv + BN\nConv\n\u2026\n\nMel 1\n\nMel m\n\nsoftsign)+)1\n\nFC\n\nFC\n\nFC\n\nFC\n\nFC\n\nBi-GRU\n\nFilter-Bank\n\nFC\n\nStacked Bi-GRU\n\nSpeaker\n\nPhoneme 1\n\n\u2026\n\nPhoneme n\n\nsoftsign\n\nFC\n\ntile\nsoftsign\n\nEncoder CBFG\n\nBi-GRU\n\nHighway Layers\n\n+\n\nFilter-Bank + \nBN + ReLu\n\nmax9pool\n\nFilter-Bank + \nBN + ReLu\n\nAudio\n\nGrif<in9Lim\n\n\u22c1\n\nAudio\nVocal\n\nFC\n\nSpectrogram\n\nDecoder CBFG\n\nFC\n\nMel i+1\n\nStacked Residual GRU\n\nGRU i-1\n\nGRU i-1\n\nGRU i+1\n\n\u2026\n\nAttention\n\nsoftsign\nsoftsign\n\n\u2026\n\nFC\nTacotron \nSpeaker\n\nMLP\n\n\u2026\n\nChar n\n\nChar 1\n\nFC\n\nFC\nTacotron \nSpeaker\n\n\u2026\nMel i-1\n\nMLP\nMel i\n\n\u2026\nMel i+1\n\nVocal \nSpeaker\n\nFigure 3: Tacotron with speaker conditioning in the Encoder CBHG module and decoder with two\nFC\nFC\nways to convert spectrogram to audio: Grif\ufb01n-Lim or our speaker-conditioned Vocal model.\nSpeaker\n\nUpsampled Phonemes\n\nPhonemes\n\nSynthesized \n\nF0\n\nText\n\nPronunciation \nDictionary\n\nPhonemes\n\nDuration\n\nFrequency\n\nVocal\n\nSpeech\n\nupsample\n\nSpeaker\n\n4.1.4 Vocal Model\nThe multi-speaker vocal model uses only input augmentation, with the site-speci\ufb01c speaker embedding\nconcatenated onto each input frame of the conditioner. This differs from the global conditioning\nsuggested in Oord et al. (2016) and allows the speaker embedding to in\ufb02uence the local conditioning\nnetwork as well.\nWithout speaker embeddings, the vocal model is still able to generate somewhat distinct-sounding\nvoices because of the disctinctive features provided by the frequency and duration models. Yet,\nhaving speaker embeddings in the vocal model increases the audio quality. We indeed observe that\nthe embeddings converge to a meaningful latent space.\n\n4.2 Multi-Speaker Tacotron\nIn addition to extending Deep Voice 2 with speaker embeddings, we also extend Tacotron (Wang\net al., 2017), a sequence-to-sequence character-to-waveform model. When training multi-speaker\nTacotron variants, we \ufb01nd that model performance is highly dependent on model hyperparameters,\nand that some models often fail to learn attention mechanisms for a small subset of speakers. We also\n\ufb01nd that if the speech in each audio clip does not start at the same timestep, the models are much less\nlikely to converge to a meaningful attention curve and recognizable speech; thus, we trim all initial\nand \ufb01nal silence in each audio clip. Due to the sensitivity of the model to hyperparameters and data\npreprocessing, we believe that additional tuning may be necessary to obtain maximal quality. Thus,\nour work focuses on demonstrating that Tacotron, like Deep Voice 2, is capable of handling multiple\nspeakers through speaker embeddings, rather than comparing the quality of the two architectures.\n\n4.2.1 Character-to-Spectrogram Model\nThe Tacotron character-to-spectrogram architecture consists of a convolution-bank-highway-GRU\n(CBHG) encoder, an attentional decoder, and a CBHG post-processing network. Due to the complexity\nof the architecture, we leave out a complete description and instead focus on our modi\ufb01cations.\nWe \ufb01nd that incorporating speaker embeddings into the CBHG post-processing network degrades\noutput quality, whereas incorporating speaker embeddings into the character encoder is necessary.\nWithout a speaker-dependent CBHG encoder, the model is incapable of learning its attention mech-\nanism and cannot generate meaningful output (see Appendix D.2 for speaker-dependent attention\nvisualizations). In order to condition the encoder on the speaker, we use one site-speci\ufb01c embedding\nas an extra input to each highway layer at each timestep and initialize the CBHG RNN state with a\nsecond site-speci\ufb01c embedding.\nWe also \ufb01nd that augmenting the decoder with speaker embeddings is helpful. We use one site-speci\ufb01c\nembedding as an extra input to the decoder pre-net, one extra site-speci\ufb01c embedding as the initial\nattention context vector for the attentional RNN, one site-speci\ufb01c embedding as the initial decoder\nGRU hidden state, and one site-speci\ufb01c embedding as a bias to the tanh in the content-based attention\nmechanism.\n\n6\n\n\fModel\n\nSamp. Freq.\n\nMOS\n\nDeep Voice 1\nDeep Voice 2\n\nTacotron (Grif\ufb01n-Lim)\nTacotron (WaveNet)\n\n16 KHz\n16 KHz\n24 KHz\n24 KHz\n\n2.05 \u00b1 0.24\n2.96 \u00b1 0.38\n2.57 \u00b1 0.28\n4.17 \u00b1 0.18\n\nTable 1: Mean Opinion Score (MOS) evaluations with 95% con\ufb01dence intervals of Deep Voice 1,\nDeep Voice 2, and Tacotron. Using the crowdMOS toolkit, batches of samples from these models\nwere presented to raters on Mechanical Turk. Since batches contained samples from all models, the\nexperiment naturally induces a comparison between the models.\n\n4.2.2 Spectrogram-to-Waveform Model\nThe original Tacotron implementation in (Wang et al., 2017) uses the Grif\ufb01n-Lim algorithm to convert\nspectrograms to time-domain audio waveforms by iteratively estimating the unknown phases.6 We\nobserve that minor noise in the input spectrogram causes noticeable estimation errors in the Grif\ufb01n-\nLim algorithm and the generated audio quality is degraded. To produce higher quality audio using\nTacotron, instead of using Grif\ufb01n-Lim, we train a WaveNet-based neural vocoder to convert from\nlinear spectrograms to audio waveforms. The model used is equivalent to the Deep Voice 2 vocal\nmodel, but takes linear-scaled log-magnitude spectrograms instead of phoneme identity and F0 as\ninput. The combined Tacotron-WaveNet model is shown in Fig. 3. As we will show in Section 5.1,\nWaveNet-based neural vocoder indeed signi\ufb01cantly improves single-speaker Tacotron as well.\n\n5 Results\nIn this section, we will present the results on both single-speaker and multi-speaker speech synthesis\nusing the described architectures. All model hyperparameters are presented in Appendix B.\n\n5.1 Single-Speaker Speech Synthesis\nWe train Deep Voice 1, Deep Voice 2, and Tacotron on an internal English speech database containing\napproximately 20 hours of single-speaker data. The intermediate evaluations of models in Deep Voice\n1 and Deep Voice 2 can be found in Table 3 within Appendix A. We run an MOS evaluation using the\ncrowdMOS framework (Ribeiro et al., 2011) to compare the quality of samples (Table 1). The results\nshow conclusively that the architecture improvements in Deep Voice 2 yield signi\ufb01cant gains in\nquality over Deep Voice 1. They also demonstrate that converting Tacotron-generated spectrograms\nto audio using WaveNet is preferable to using the iterative Grif\ufb01n-Lim algorithm.\n\n5.2 Multi-Speaker Speech Synthesis\nWe train all the aforementioned models on the VCTK dataset with 44 hours of speech, which contains\n108 speakers with approximately 400 utterances each. We also train all models on an internal dataset\nof audiobooks, which contains 477 speakers with 30 minutes of audio each (for a total of \u21e0238\nhours). The consistent sample quality observed from our models indicates that our architectures can\neasily learn hundreds of distinct voices with a variety of different accents and cadences. We also\nobserve that the learned embeddings lie in a meaningful latent space (see Fig. 4 as an example and\nAppendix D for more details).\nIn order to evaluate the quality of the synthesized audio, we run MOS evaluations using the crowdMOS\nframework, and present the results in Table 2. We purposefully include ground truth samples in the\nset being evaluated, because the accents in datasets are likely to be unfamiliar to our North American\ncrowdsourced raters and will thus be rated poorly due to the accent rather than due to the model\nquality. By including ground truth samples, we are able to compare the MOS of the models with\nthe ground truth MOS and thus evaluate the model quality rather than the data quality; however, the\nresulting MOS may be lower, due to the implicit comparison with the ground truth samples. Overall,\nwe observe that the Deep Voice 2 model can approach an MOS value that is close to the ground truth,\nwhen low sampling rate and companding/expanding taken into account.\n\n6Estimation of the unknown phases is done by repeatedly converting between frequency and time domain\nrepresentations of the signal using the short-time Fourier transform and its inverse, substituting the magnitude of\neach frequency component to the predicted magnitude at each step.\n\n7\n\n\fMulti-Speaker Model\n\nSamp. Freq.\n\nMOS\n\nDataset\nVCTK\nVCTK\nVCTK\nVCTK\nVCTK\nVCTK\nVCTK\n\nDeep Voice 2 (20-layer WaveNet)\nDeep Voice 2 (40-layer WaveNet)\nDeep Voice 2 (60-layer WaveNet)\nDeep Voice 2 (80-layer WaveNet)\n\nTacotron (Grif\ufb01n-Lim)\n\nTacotron (20-layer WaveNet)\n\nGround Truth Data\n\n16 KHz\n16 KHz\n16 KHz\n16 KHz\n24 KHz\n24 KHz\n48 KHz\n16 KHz\n24 KHz\n24 KHz\n44.1 KHz\n\nAcc.\n99.9%\n100 %\n99.7%\n99.9%\n99.4%\n60.9%\n99.7%\n97.4%\n93.9%\n66.5%\n98.8%\n\n2.87\u00b10.13\n3.21\u00b10.13\n3.42\u00b10.12\n3.53\u00b10.12\n1.68\u00b10.12\n2.51\u00b10.13\n4.65\u00b10.06\n2.97\u00b10.17\n1.73\u00b10.22\n2.11\u00b10.20\n4.63\u00b10.04\n\nAudiobooks Deep Voice 2 (80-layer WaveNet)\nAudiobooks\nAudiobooks\nAudiobooks\n\nTacotron (20-layer WaveNet)\n\nTacotron (Grif\ufb01n-Lim)\n\nGround Truth Data\n\nTable 2: MOS and classi\ufb01cation accuracy for all multi-speaker models. To obtain MOS, we use\ncrowdMOS toolkit as detailed in Table 1. We also present classi\ufb01cation accuracies of the speaker\ndiscriminative models (see Appendix E for details) on the samples, showing that the synthesized\nvoices are as distinguishable as ground truth audio.\n\nFigure 4: Principal components of the learned speaker embeddings for the (a) 80-layer vocal model\nand (b) character-to-spectrogram model for VCTK dataset. See Appendix D.3 for details.\n\n(a)\n\n(b)\n\nA multi-speaker TTS system with high sample quality but indistinguishable voices would result in\nhigh MOS, but fail to meet the desired objective of reproducing the input voices accurately. To show\nthat our models not only generate high quality samples, but also generate distinguishable voices, we\nalso measure the classi\ufb01cation accuracy of a speaker discriminative model on our generated samples.\nThe speaker discriminative is a convolutional network trained to classify utterances to their speakers,\ntrained on the same dataset as the TTS systems themselves. If the voices were indistinguishable\n(or the audio quality was low), the classi\ufb01cation accuracy would be much lower for synthesized\nsamples than it is for the ground truth samples. As we demonstrate in Table 2, classi\ufb01cation accuracy\ndemonstrates that samples generated from our models are as distinguishable as the ground truth\nsamples (see Appendix E for more details). The classi\ufb01cation accuracy is only signi\ufb01cantly lower for\nTacotron with WaveNet, and we suspect that generation errors in the spectrogram are exacerbated by\nthe WaveNet, as it is trained with ground truth spectrograms.\n\n6 Conclusion\nIn this work, we explore how entirely-neural speech synthesis pipelines may be extended to multi-\nspeaker text-to-speech via low-dimensional trainable speaker embeddings. We start by presenting\nDeep Voice 2, an improved single-speaker model. Next, we demonstrate the applicability of our\ntechnique by training both multi-speaker Deep Voice 2 and multi-speaker Tacotron models, and\nevaluate their quality through MOS. In conclusion, we use our speaker embedding technique to create\nhigh quality text-to-speech systems and conclusively show that neural speech synthesis models can\nlearn effectively from small amounts of data spread among hundreds of different speakers.\nThe results presented in this work suggest many directions for future research. Future work may test\nthe limits of this technique and explore how many speakers these models can generalize to, how little\ndata is truly required per speaker for high quality synthesis, whether new speakers can be added to a\nsystem by \ufb01xing model parameters and solely training new speaker embeddings, and whether the\nspeaker embeddings can be used as a meaningful vector space, as is possible with word embeddings.\n\n8\n\n\fReferences\nO. Abdel-Hamid and H. Jiang. Fast speaker adaptation of hybrid NN/HMM model for speech recognition based\n\non discriminative learning of speaker code. In ICASSP, 2013.\n\nS. O. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y. Kang, X. Li, J. Miller, J. Raiman,\n\nS. Sengupta, and M. Shoeybi. Deep voice: Real-time neural text-to-speech. In ICML, 2017.\n\nJ. Bradbury, S. Merity, C. Xiong, and R. Socher. Quasi-recurrent neural networks. In ICLR, 2017.\n\nK. Cho, B. Van Merri\u00ebnboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning\nphrase representations using rnn encoder-decoder for statistical machine translation. arXiv:1406.1078, 2014.\n\nY. Fan, Y. Qian, F. K. Soong, and L. He. Multi-speaker modeling and speaker adaptation for DNN-based TTS\n\nsynthesis. In IEEE ICASSP, 2015.\n\nA. Graves, S. Fern\u00e1ndez, F. Gomez, and J. Schmidhuber. Connectionist temporal classi\ufb01cation: labelling\n\nunsegmented sequence data with recurrent neural networks. In ICML, 2006.\n\nC.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang. Voice conversion from unaligned corpora using\n\nvariational autoencoding wasserstein generative adversarial networks. arXiv:1704.00849, 2017.\n\nS. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate\n\nshift. arXiv preprint arXiv:1502.03167, 2015.\n\nD. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.\n\nG. Lample, M. Ballesteros, K. Kawakami, S. Subramanian, and C. Dyer. Neural architectures for named entity\n\nrecognition. In Proc. NAACL-HLT, 2016.\n\nC. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kannan, and Z. Zhu. Deep speaker: an end-to-end\n\nneural speaker embedding system. arXiv preprint arXiv:1705.02304, 2017.\n\nS. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio. SampleRNN: An\n\nunconditional end-to-end neural audio generation model. arXiv:1612.07837, 2016.\n\nA. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and\n\nK. Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv:1609.03499, 2016.\n\nD. A. Reynolds, T. F. Quatieri, and R. B. Dunn. Speaker veri\ufb01cation using adapted gaussian mixture models.\n\nDigital signal processing, 10(1-3):19\u201341, 2000.\n\nF. Ribeiro, D. Flor\u00eancio, C. Zhang, and M. Seltzer. Crowdmos: An approach for crowdsourcing mean opinion\n\nscore studies. In IEEE ICASSP, 2011.\n\nS. Ronanki, O. Watts, S. King, and G. E. Henter. Median-based generation of synthetic speech durations using a\n\nnon-parametric approach. arXiv:1608.06134, 2016.\n\nT. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training\n\ngans. In NIPS, 2016.\n\nJ. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner, A. Courville, and Y. Bengio. Char2wav: End-to-end\n\nspeech synthesis. In ICLR2017 workshop submission, 2017.\n\nY. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, et al.\n\nTacotron: Towards end-to-end speech synthesis. In Interspeech, 2017.\n\nZ. Wu, P. Swietojanski, C. Veaux, S. Renals, and S. King. A study of speaker adaptation for DNN-based speech\n\nsynthesis. In Interspeech, 2015.\n\nJ. Yamagishi, T. Nose, H. Zen, Z.-H. Ling, T. Toda, K. Tokuda, S. King, and S. Renals. Robust speaker-adaptive\nhmm-based text-to-speech synthesis. IEEE Transactions on Audio, Speech, and Language Processing, 2009.\n\nS. Yang, Z. Wu, and L. Xie. On the training of DNN-based average voice model for speech synthesis. In Signal\n\nand Information Processing Association Annual Summit and Conference (APSIPA), Asia-Paci\ufb01c, 2016.\n\nH. Zen and H. Sak. Unidirectional long short-term memory recurrent neural network with recurrent output layer\n\nfor low-latency speech synthesis. In IEEE ICASSP, 2015.\n\nH. Zen, Y. Agiomyrgiannakis, N. Egberts, F. Henderson, and P. Szczepaniak. Fast, compact, and high quality\nLSTM-RNN based statistical parametric speech synthesizers for mobile devices. arXiv:1606.06061, 2016.\n\n9\n\n\f", "award": [], "sourceid": 1708, "authors": [{"given_name": "Andrew", "family_name": "Gibiansky", "institution": "Baidu Research"}, {"given_name": "Sercan", "family_name": "Arik", "institution": "Baidu Research"}, {"given_name": "Gregory", "family_name": "Diamos", "institution": "Baidu SVAIL"}, {"given_name": "John", "family_name": "Miller", "institution": "Baidu Research"}, {"given_name": "Kainan", "family_name": "Peng", "institution": "Baidu Research"}, {"given_name": "Wei", "family_name": "Ping", "institution": "Baidu Research"}, {"given_name": "Jonathan", "family_name": "Raiman", "institution": "Baidu Research"}, {"given_name": "Yanqi", "family_name": "Zhou", "institution": "Baidu Research"}]}