{"title": "SING: Symbol-to-Instrument Neural Generator", "book": "Advances in Neural Information Processing Systems", "page_first": 9041, "page_last": 9051, "abstract": "Recent progress in deep learning for audio synthesis opens\nthe way to models that directly produce the waveform, shifting away\nfrom the traditional paradigm of relying on vocoders or MIDI synthesizers for speech or music generation. Despite\ntheir successes, current state-of-the-art neural audio synthesizers such\nas WaveNet and SampleRNN suffer from prohibitive training and inference times because they are based on\nautoregressive models that generate audio samples one at a time at a rate of 16kHz. In\nthis work, we study the more computationally efficient alternative of generating the waveform frame-by-frame with large strides.\nWe present a lightweight neural audio synthesizer for the original task of generating musical notes given desired instrument, pitch and velocity. Our model is trained end-to-end to generate notes from nearly 1000 instruments with a single decoder, thanks to a new loss function that minimizes the distances between the log spectrograms of the generated and target waveforms.\nOn the generalization task of synthesizing notes for pairs of pitch and instrument not seen during training, SING produces audio with significantly improved perceptual quality compared to a state-of-the-art autoencoder based on WaveNet as measured by a Mean Opinion Score (MOS), and is about 32 times faster for training and 2, 500 times faster for inference.", "full_text": "SING: Symbol-to-Instrument Neural Generator\n\nAlexandre D\u00e9fossez\nFacebook AI Research\n\nINRIA / ENS\n\nPSL Research University\n\nParis, France\n\ndefossez@fb.com\n\nNeil Zeghidour\n\nFacebook AI Research\n\nLSCP / ENS / EHESS / CNRS\n\nINRIA / PSL Research University\n\nParis, France\n\nneilz@fb.com\n\nNicolas Usunier\n\nFacebook AI Research\n\nParis, France\n\nusunier@fb.com\n\nL\u00e9on Bottou\n\nFacebook AI Research\n\nNew York, USA\nleonb@fb.com\n\nFrancis Bach\n\nINRIA\n\n\u00c9cole Normale Sup\u00e9rieure\nPSL Research University\n\nfrancis.bach@ens.fr\n\nAbstract\n\nRecent progress in deep learning for audio synthesis opens the way to models that\ndirectly produce the waveform, shifting away from the traditional paradigm of\nrelying on vocoders or MIDI synthesizers for speech or music generation. Despite\ntheir successes, current state-of-the-art neural audio synthesizers such as WaveNet\nand SampleRNN [24, 17] suffer from prohibitive training and inference times\nbecause they are based on autoregressive models that generate audio samples one\nat a time at a rate of 16kHz. In this work, we study the more computationally\nef\ufb01cient alternative of generating the waveform frame-by-frame with large strides.\nWe present SING, a lightweight neural audio synthesizer for the original task of\ngenerating musical notes given desired instrument, pitch and velocity. Our model\nis trained end-to-end to generate notes from nearly 1000 instruments with a single\ndecoder, thanks to a new loss function that minimizes the distances between the\nlog spectrograms of the generated and target waveforms. On the generalization\ntask of synthesizing notes for pairs of pitch and instrument not seen during training,\nSING produces audio with signi\ufb01cantly improved perceptual quality compared to a\nstate-of-the-art autoencoder based on WaveNet [4] as measured by a Mean Opinion\nScore (MOS), and is about 32 times faster for training and 2, 500 times faster for\ninference.\n\n1\n\nIntroduction\n\nThe recent progress in deep learning for sequence generation has led to the emergence of audio\nsynthesis systems that directly generate the waveform, reaching state-of-the-art perceptual quality in\nspeech synthesis, and promising results for music generation. This represents a shift of paradigm with\nrespect to approaches that generate sequences of parameters to vocoders in text-to-speech systems\n[21, 23, 19], or MIDI partition in music generation [8, 3, 10]. A commonality between the state-of-\nthe-art neural audio synthesis models is the use of discretized sample values, so that an audio sample\nis predicted by a categorical distribution trained with a classi\ufb01cation loss [24, 17, 18, 14]. Another\nsigni\ufb01cant commonality is the use of autoregressive models that generate samples one-by-one, which\nleads to prohibitive training and inference times [24, 17], or requires specialized implementations and\nlow-level code optimizations to run in real time [14]. An exception is parallel WaveNet [18] which\ngenerates a sequence with a fully convolutional network for faster inference. However, the parallel\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fapproach is trained to reproduce the output of a standard WaveNet, which means that faster inference\ncomes at the cost of increased training time.\nIn this paper, we study an alternative to both the modeling of audio samples as a categorical distribution\nand the autoregressive approach. We propose to generate the waveform for entire audio frames of\n1024 samples at a time with a large stride, and model audio samples as continuous values. We\ndevelop and evaluate this method on the challenging task of generating musical notes based on\nthe desired instrument, pitch, and velocity, using the large-scale NSynth dataset [4]. We obtain a\nlightweight synthesizer of musical notes composed of a 3-layer RNN with LSTM cells [12] that\nproduces embeddings of audio frames given the desired instrument, pitch, velocity1 and time index.\nThese embeddings are decoded by a single four-layer convolutional network to generate notes from\nnearly 1000 instruments, 65 pitches per instrument on average and 5 velocities.\nThe successful end-to-end training of the synthesizer relies on two ingredients:\n\u2022 A new loss function which we call the spectral loss, which computes the 1-norm between the log\npower spectrograms of the waveform generated by the model and the target waveform, where the\npower spectrograms are obtained by the short-time Fourier transform (STFT).\nLog power spectrograms are interesting both because they are related to human perception [6], but\nmore importantly because the entire loss is invariant to the original phase of the signal, which can\nbe arbitrary without audible differences.\n\u2022 Initialization with a pre-trained autoencoder: a purely convolutional autoencoder architecture on\nraw waveforms is \ufb01rst trained with the spectral loss. The LSTM is then initialized to reproduce the\nembeddings given by the encoder, using mean squared error. After initialization, the LSTM and\nthe decoder are \ufb01ne-tuned together, backpropagating through the spectral loss.\n\nWe evaluate our synthesizer on a new task of pitch completion: generating notes for pitches not seen\nduring training. We perform perceptual experiments with human evaluators to aggregate a Mean\nOpinion Score (MOS) that characterizes the naturalness and appealing of the generated sounds. We\nalso perform ABX tests to measure the relative similarity of the synthesizer\u2019s ability to effectively\nproduce a new pitch for a given instrument, see Section 5.3.2. We use a state-of-the-art autoencoder\nof musical notes based on WaveNet [4] as a baseline neural audio synthesis system. Our synthesizer\nachieves higher perceptual quality than Wavenet-based autoencoder in terms of MOS and similarity\nto the ground-truth while being about 32 times faster during training and 2, 500 times for generation.\n\n2 Related Work\n\nA large body of work in machine learning for audio synthesis focuses on generating parameters for\nvocoders in speech processing [21, 23, 19] or musical instrument synthesizers in automatic music\ncomposition [8, 3, 10]. Our goal is to learn the synthesizers for musical instruments, so we focus\nhere on methods that generate sound without calling such synthesizers.\nA \ufb01rst type of approaches model power spectrograms given by the STFT [4, 9, 25], and generate\nthe waveform through a post-processing that is not part of the training using a phase reconstruction\nalgorithm such as the Grif\ufb01n-Lim algorithm [7]. The advantage is to focus on a distance between high-\nlevel representations that is more relevant perceptually than a regression on the waveform. However,\nusing Grif\ufb01n-Lim means that the training is not end to end. Indeed the predicted spectrograms may\nnot come from a real signal. In that case, Grif\ufb01n-Lim performs an orthogonal projection onto the set\nof valid spectrograms that is not accounted for during training. Notice that our approach with the\nspectral loss is different: our models directly predict waveforms rather than spectrograms and the\nspectral loss computes log power spectrograms of these predicted waveforms.\nThe current state-of-the-art in neural audio synthesis is to generate directly the waveform [24, 17, 19].\nIndividual audio samples are modeled with a categorical distribution trained with a multiclass cross-\nentropy loss. Quantization of the 16 bit audio is performed (either linear [17] or with a \u00b5-law\ncompanding [24]) to map to a few hundred bins to improve scalability. The generation is still\nextremely costly; distillation [11] to a faster model has been proposed to reduce inference time at the\nexpense of an even larger training time [18]. The recent proposal of [14] partly solves the issue with\n\n1Quoting [4]: \"MIDI velocity is similar to volume control and they have a direct relationship. For physical\n\nintuition, higher velocity corresponds to pressing a piano key harder.\"\n\n2\n\n\fa small loss in accuracy, but it requires heavy low-level code optimization. In contrast, our approach\ntrains and generate waveforms comparably fast with a PyTorch2 implementation. Our approach is\ndifferent since we model the waveform as a continuous signal and use the spectral loss between\ngenerated and target waveforms and model audio frames of 1024 samples, rather than performing\nclassi\ufb01cation on individual samples. The spectral loss we introduce is also different from the power\nloss regularization of [18], even though both are based on the STFT of the generated and target\nwaveforms. In [18], the primary loss is the classi\ufb01cation of individual samples, and their power loss\nis used to equalize the average amplitude of frequencies over time. Thus the power loss cannot be\nused alone to learn to reconstruct the waveform.\nWorks on neural audio synthesis conditioned on symbolic inputs were developed mostly for text-\nto-speech synthesis [24, 17, 25]. Experiments on generation of musical tracks based on desired\nproperties were described in [24], but no systematic evaluation has been published. The model of [4],\nwhich we use as baseline in our experiments on perceptual quality, is an autoencoder of musical\nnotes based on WaveNet [24] that compresses the signal to generate high-level representations that\ntransfer to music classi\ufb01cation tasks, but contrarily to our synthesizer, it cannot be used to generate\nwaveforms from desired properties of the instrument, pitch and velocity without some input signal.\nThe minimization by gradient descent of an objective function based on the power spectrogram has\nalready been applied to the transformation of a white noise waveform into a speci\ufb01c sound texture [2].\nHowever, to the best of our knowledge, such objective functions have not been used in the context of\nneural audio synthesis.\n\n3 The spectral loss for waveform synthesis\n\nPrevious work in audio synthesis on the waveform focused on classi\ufb01cation losses [17, 24, 4].\nHowever, their computational cost needs to be mitigated by quantization, which inherently limits the\nresolution of the predictions, and ultimately increasing the number of classes is likely necessary to\nachieve the optimal accuracy. Our approach directly predicts a single continuous value for each audio\nsample, and computes distances between waveforms in the domain of power spectra to be invariant\nto the original phase of the signal. As a baseline, we also consider computing distances between\nwaveforms using plain mean square error (MSE).\n\n3.1 Mean square regression on the waveform\n\nThe simplest way of measuring the distance between a reconstructed signal \u02c6x and the reference x is\nto compute the MSE on the waveform directly, that is taking the Euclidean norm between x and \u02c6x,\n\nLwav (x, \u02c6x) := (cid:107)x \u2212 \u02c6x(cid:107)2 .\n\n(3.1)\n\nThe MSE is most likely not suited as a perceptual distance between waveforms because it is extremely\nsensitive to a small shift in the signal. Yet, we observed that it was suf\ufb01cient to learn an autoencoder\nand use it as a baseline.\n\n3.2 Spectral loss\n\nAs an alternative to the MSE on the waveform, we suggest taking the Short Term Fourier Transform\n(STFT) of both x and \u02c6x and compare their absolute values in log scale. We \ufb01rst compute the log\nspectrogram\n\n(cid:16)\n\n\u0001 + |STFT [x]|2(cid:17)\n\nl(x) := log\n\n.\n\n(3.2)\n\nThe STFT decomposes the original signal x in successive frames of 1024 time steps with a stride\nof 256 so that a frame overlaps at 75% with the next one. The output for a single frame is given by\n513 complex numbers, each representing a speci\ufb01c frequency range. Taking the point-wise absolute\nvalues of those numbers represents how much energy is present in a speci\ufb01c frequency range. We\nobserved that our models generated higher quality sounds when trained using a log scale of those\ncoef\ufb01cients. Previous work has come to the same conclusion [4]. We observed that many entries\n\n2https://pytorch.org/\n\n3\n\n\fof the spectrograms are close to zero and that small errors on those parts can add up to form noisy\nartifacts. In order to favor sparsity in the spectrogram, we use the (cid:107)\u00b7(cid:107)1 norm instead of the MSE,\n\nLstft,1 (x, \u02c6x) := (cid:107)l(x) \u2212 l(\u02c6x)(cid:107)1 .\n\n(3.3)\nThe value of \u0001 controls the trade-off between accurately representing low energy and high energy\ncoef\ufb01cients in the spectrogram. We found that \u0001 = 1 gave the best subjective reconstruction quality.\nThe STFT is a (complex) convolution operator on the waveform and the squared absolute value\nof the Fourier coef\ufb01cients makes the power spectrum differentiable with respect to the generated\nwaveform. Since the generated waveform is itself a differentiable function of the parameters (up\nto the non-differentiability points of activation functions such as ReLU), the spectral loss (3.3) can\nbe minimized by standard backpropagation. Even though we only consider this spectral loss in our\nexperiments, alternatives to the STFT such as the Wavelet transform also de\ufb01ne differentiable loss\nfor suitable wavelets.\n\n3.2.1 Non unicity of the waveform representation\n\nf\n\nTo illustrate the importance of the spectral loss instead of a waveform loss, let us now consider a\nproblem that arises when generating notes in the test set. Let us assume one of the instrument is a\npure sinuoide. For a given pitch at a frequency f, the audio signal is xi = sin(2\u03c0i\n16000 + \u03c6). Our\nperception of the signal is not affected by the choice of \u03c6 \u2208 [0, 2\u03c0[, and the power spectrogram of\nx is also unaltered. When recording an acoustic instrument, the value of \u03c6 depends on any number\nof variables characterizing the physical system that generated the sound and there is no guarantee\nthat \u03c6 stays constant when playing the same note again. For a synthetic sound, \u03c6 also depends on\nimplementation details of the software generating the sound.\nFor a sound that is not in the training set and as far as the model is concerned, \u03c6 is a random variable\nthat can take any value in the range [0, 2\u03c0[. As a result, x0 is unpredictable in the range [\u22121, 1], and\nthe mean square error between the generated signal and the ground truth is uninformative. Even\non the training dataset, the model has to use extra resources to remember the value of \u03c6 for each\npitch. We believe that this phenomenon is the reason why training the synthesizer using the MSE on\nthe waveform leads to worse reconstruction performance, even though this loss is suf\ufb01cient in the\ncontext of auto-encoding (see Section 5.2). The spectral loss solves this issue since the model is free\nto choose a single canonical value for \u03c6.\nHowever, one should note that the spectral loss is permissive, in the sense that it does not penalize\nphase inconsitencies of the complex phase across the different frames of the STFT, which lead to\npotential artifacts. In practice, we obtain state of the art results (see Section 5) and we conjecture that\nthanks to the frame overlap in the STFT, the solution that minimizes the spectral loss will often be\nphase consistent, which is why Grif\ufb01n-Lim works resonably well despite sharing the same limitation.\n\n4 Model\n\nIn this section we introduce the SING architecture. It is composed of two parts: a LSTM based\nsequence generator whose output is plugged to a decoder that transforms it into a waveform. The\nmodel is trained to recover a waveform x sampled at 16,000 Hz from the training set based on the\none-hot encoded instrument I, pitch P and velocity V . The whole architecture is summarized in\nFigure 1.\n\n4.1 LSTM sequence generator\n\nThe sequence generator is composed of a 3-layer recurrent neural network with LSTM cells and\n1024 hidden units each. Given an example with velocity V , instrument I and pitch P , we obtain\n3 embeddings (uV , vI , wP ) \u2208 R2 \u00d7 R16 \u00d7 R8 from look-up tables that are trained along with the\nmodel. Furthermore, the model is provided at each time step with an extra embedding zT \u2208 R4\nwhere T is the current time step [22, 5], also obtained from a look-up table that is trained jointly. The\ninput of the LSTM is the concatenation of those four vectors (uV , vI , wP , zT ). Although we \ufb01rst\nexperimented with an autoregressive model where the previous output was concatenated with those\nembeddings, we achieved much better performance and faster training by feeding the LSTM with\nonly on the 4 vectors (uV , vI , wP , zT ) at each time step. Given those inputs, the recurrent network\n\n4\n\n\fSpectral loss\n\nOutput waveform\n\nSTFT + log-power\n\nConv transpose K = 1024, S = 256, C = 1\n\nConvolution K = 1, S = 1, C = 4096, ReLU\n\nConvolution K = 1, S = 1, C = 4096, ReLU\n\nConvolution K = 9, S = 1, C = 4096, ReLU\n\ns1(V IP )\n\ns2(V IP )\n\ns3(V IP )\n\ns265(V IP )\n\nh0 := 0\n\nh1\n\nh2\n\n\u00b7 \u00b7 \u00b7\n\nh259\n\n(uV , vI , wP , z1)\n\n(uV , vI , wP , z2)\n\n(uV , vI , wP , z3)\n\n(uV , vI , wP , z265)\n\nFigure 1: Summary of the entire architecture of SING. uV , vI , wP , z\u2217 represent the look-up tables\nrespectively for the velocity, instrument, pitch and time. h\u2217 represent the hidden state of the LSTM\nand s\u2217 its output. For convolutional layers, K represents the kernel size, S the stride and C the\nnumber of channels.\n\ngenerates a sequence \u22001 \u2264 T \u2264 N, s(V, I, P )T \u2208 RD with a linear layer on top of the last hidden\nstate. In our experiments, we have D = 128 and N = 265.\n\n4.2 Convolutional decoder\n\nThe sequence s(V, I, P ) is decoded into a waveform by a convolutional network. The \ufb01rst layer\nis a convolution with a kernel size of 9 and a stride of 1 over the sequence s with 4096 channels\nfollowed by a ReLU. The second and third layers are both convolutions with a kernel size of 1 (a.k.a.\n1x1 convolution [4]) also followed by a ReLU. The number of channels is kept at 4096. Finally the\nlast layer is a transposed convolution with a stride of 256 and a kernel size of 1024 that directly\noutputs the \ufb01nal waveform corresponding to an audio frame of size 1024. In order to reduce artifacts\ngenerated by the high stride value, we smooth the deconvolution \ufb01lters by multiplying them with a\nsquared Hann window. As the stride is one fourth of the kernel size, the squared Hann window has\nthe property that the sum of its values for a given output position is always equal to 1 [7]. Thus the\n\ufb01nal deconvolution can also be seen as an overlap-add method. We pad the examples so that the \ufb01nal\ngenerated audio signal has the right length. Given our parameters, we need s(V, I, P ) to be of length\nN = 265 to recover a 4 seconds signal d(s(V, I, P )) \u2208 R64,000.\n\n4.3 Training details\n\nAll the models are trained on 4 P100 GPUs using Adam [15] with a learning rate of 0.0003 and a\nbatch size of 256.\n\nInitialization with an autoencoder. We introduce an encoder turning a waveform x into a sequence\ne(x) \u2208 RN\u00d7D. This encoder is almost the mirror of the decoder. It starts with a convolution layer\nwith a kernel size of 1024, a stride of 256 and 4096 channels followed by a ReLU. Similarly to the\ndecoder, we smooth its \ufb01lters using a squared Hann window. Next are two 1x1 convolutions with\n4096 channels and ReLU as an activation function. A \ufb01nal 1x1 convolution with no non linearity\nturns those 4096 channels into the desired sequence with D channels. We \ufb01rst train the encoder\nand decoder together as an auto-encoder on a reconstruction task. We train the auto-encoder for 50\nepochs which takes about 12 hours on 4 GPUs.\n\n5\n\n\fLSTM training. Once the auto-encoder has converged, we use the encoder to generate a target\nsequence for the LSTM. We use the MSE between the output s(V, I, P ) of the LSTM and the output\ne(x) of the encoder, only optimizing the LSTM while keeping the encoder constant. The LSTM is\ntrained for 50 epochs using truncated backpropagation through time [26] using a sequence length\nof 32. This takes about 10 hours on 4 GPUs.\n\nEnd-to-end \ufb01ne tuning. We then plug the decoder on top of the LSTM and \ufb01ne tune them together\nin an end-to-end fashion, directly optimizing for the loss on the waveform, either using the MSE\non the waveform or computing the MSE on the log-amplitude spectrograms and back propagating\nthrough the STFT. At that point we stop using truncated back propagation through time and directly\ncompute the gradient on the entire sequence. We do so for 20 epochs which takes about 8 hours on 4\nGPUs. From start to \ufb01nish, SING takes about 30 hours on 4 GPUs to train.\nAlthough we could have initialized our LSTM and decoder randomly and trained end-to-end, we did\nnot achieve convergence until we implemented our initialization strategy.\n\n5 Experiments\n\nThe source code for SING and a pretrained model are available on our github3. Audio samples are\navailable on the article webpage4.\n\n5.1 NSynth dataset\n\nThe train set from the NSynth dataset [4] is composed of 289,205 audio recordings of instruments,\nsome synthetic and some acoustic. Each recording is 4 second long at 16,000 Hz and is represented\nby a vector xV,I,P \u2208 [\u22121, 1]64,000 indexed by V \u2208 {0, 4} representing the velocity of the note,\nI \u2208 {0, . . . , 1005} representing the instrument, P \u2208 {0, . . . , 120} representing the pitch. The range\nof pitches available can vary depending on the instrument but for any combination of V, I, P , there is\nat most a single recording.\nWe did not make use of the validation or test set from the original NSynth dataset because the\ninstruments had no overlap with the training set. Because we use a look-up table for the instrument\nembedding, we cannot generate audio for unseen instruments. Instead, we selected for each instrument\n10% of the pitches randomly that we moved to a separate test set. Because the pitches are different\nfor each instrument, our model trains on all pitches but not on all combinations of a pitch and an\ninstrument. We can then evaluate the ability of our model to generalize to unseen combinations of\ninstrument and pitch. In the rest of the paper, we refer to this new split of the original train set as the\ntrain and test set.\n\n5.2 Generalization through pitch completion\n\nWe report our results in Table 1. We provided both the performance of the complete model as well as\nthat of the autoencoder used for the initial training of SING. This autoencoder serves as a reference for\nthe maximum quality the model can achieve if the LSTM were to reconstruct perfectly the sequence\ne(x).\nAlthough using the MSE on the waveform works well as far as the autoencoder is concerned, this loss\nis hard to optimize for the LSTM. Indeed, the autoencoder has access to the signal it must reconstruct,\nso that it can easily choose which representation of the signal to output as explained in Section 3.2.1.\nSING must be able to recover that information solely from the embeddings given to it as input. It\nmanages to learn some of it but there is an important drop in quality. Besides, when switching to the\ntest set one can see that the MSE on the waveform increases signi\ufb01cantly. As the model has never\nseen those examples, it has no way of picking the right representation. When using a spectral loss,\nSING is free to choose a canonical representation for the signal it has to reconstruct and it does not\nhave to remember the one that was in the training set. We observe that although we have a drop in\n\n3https://github.com/facebookresearch/SING\n4https://research.fb.com/publications/sing-symbol-to-instrument-neural-\n\ngenerator\n\n6\n\n\ftraining loss\n\nModel\nAutoencoder waveform\nSING\nwaveform\nAutoencoder\nspectral\nSING\nspectral\nSING\nno time\nembedding\n\nspectral\n\nSpectral loss\ntest\ntrain\n0.028\n0.026\n0.084\n0.075\n0.028\n0.032\n0.051\n0.039\n\nWav MSE\ntest\n\ntrain\n0.0002\n0.006\nN/A\nN/A\n\n0.0003\n0.039\nN/A\nN/A\n\n0.050\n\n0.063\n\nN/A\n\nN/A\n\nTable 1: Results on the train and test set of the pitch completion task for different models. The \ufb01rst\ncolumn speci\ufb01es the model, either the autoencoder used for the initial training of the LSTM or the\ncomplete SING model with the LSTM and the convolutional decoder. We compare models either\ntrained with a loss on the waveform (see (3.1)) or on the spectrograms (see (3.3)). Finally we also\ntrained a model with no temporal embedding.\n\nFigure 2: Example of rainbowgrams from the NSynth dataset and the reconstructions by different\nmodels. Rainbowgrams are de\ufb01ned in [4] as \u201ca CQT spectrogram with intensity of lines proportional\nto the log magnitude of the power spectrum and color given by the derivative of the phase\u201d. Time\nis represented on the horizontal axis while frequencies are on the vertical one. From left to right:\nground truth, Wavenet-based autoencoder, SING with spectral loss, SING with waveform loss and\nSING without the time embedding.\n\nquality between the train and test set, our model is still able to generalize to unseen combinations of\npitch and instrument.\nFinally, we tried training a model without the time embedding zT . Theoretically, the LSTM could do\nwithout it by learning to count the number of time steps since the beginning of the sequence. However\nwe do observe a signi\ufb01cant drop in performance when removing this embedding, thus motivating our\nchoice.\nOn Figure 2, we represented the rainbowgrams for a particular example from the test set as well as\nits reconstruction by the Wavenet-autoencoder, SING trained with the spectral and waveform loss\nand SING without time embedding. Rainbowgrams are de\ufb01ned in [4] as \u201ca CQT spectrogram with\nintensity of lines proportional to the log magnitude of the power spectrum and color given by the\nderivative of the phase\u201d. A different derivative of the phase will lead to audible deformations of the\ntarget signal. Such modi\ufb01cation are not penalized by our spectral loss as explained in Section 3.2.1.\nNevertheless, we observe a mostly correct reconstruction of the derivative of the phase using SING.\nMore examples from the test set, including the rainbowgrams and audio \ufb01les are available on the\narticle webpage5.\n\n5.3 Human evaluations\n\nDuring training, we use several automatic criteria to evaluate and select our models. These criteria\ninclude the MSE on spectrograms, magnitude spectra, or waveform, and other perceptually-motivated\nmetrics such as the Itakura-Saito divergence [13]. However, the correlation of these metrics with\n\n5 https://research.fb.com/publications/sing-symbol-to-instrument-neural-\n\ngenerator\n\n7\n\n\fModel\n\nGround Truth\n\nWavenet\nSING\n\nMOS\n\n3.86 \u00b1 0.24\n2.85 \u00b1 0.24\n3.55 \u00b1 0.23\n\nTraining time (hrs * GPU) Generation speed Compression factor Model size\n\n-\n\n3840\u2217\n120\n\n-\n\n0.2 sec/sec\n512 sec/sec\n\n-\n32\n2133\n\n-\n\n948 MB\n243 MB\n\nTable 2: Mean Opinion Score (MOS) and computational load of the different models. The training\ntime is expressed in hours * GPU units, the generation time is expressed as the number of seconds of\naudio that can be generated per second of processing time. The compression factor represents the\nratio between the dimensionality of the audio sequences (64, 000 values) and either the latent state of\nWavenet or the input vectors to SING. We also report the size of the models, in MB.\n(\u2217) Time corrected to account for the difference in FLOPs of the GPUs used.\n\nhuman perception remains imperfect, this is why we use human judgments as a metric of comparison\nbetween SING and the Wavenet baseline from [4].\n\n5.3.1 Evaluation of perceptual quality: Mean Opinion Score\n\nThe \ufb01rst characteristic that we want to measure from our generated samples is their naturalness:\nhow good they sound to the human ear. To do so, we perform experiments on Amazon Mechanical\nTurk [1] to get a Mean Opinion Score for the ground truth samples, and for the waveforms generated\nby SING and the Wavenet baseline. We did not include a Grif\ufb01n-Lim based baseline as the authors in\n[4] concluded to the superiority of their Wavenet autoencoder.\nWe randomly select 100 examples from our test set. For the Wavenet-autoencoder, we pass these\n100 examples through the network and retrieve the output. The latter is a pre-trained model provided\nby the authors of [4]6. Notice that all of the 100 samples were used for training of the Wavenet-\nautoencoder, while they were not seen during the training of our models. For SING, we feed it the\ninstrument, pitch and velocity information of each of the 100 samples. Workers are asked to rate\nthe quality of the samples on a scale from 1 (\"Very annoying and objectionable distortion. Totally\nsilent audio\") to 5 (\"Imperceptible distortion\"). Each of the 300 samples (100 samples per model) is\nevaluated by 60 Workers. The quality of the hardware used by Workers being variable, this could\nimpede the interpretability of the results. Thus, we use the crowdMOS toolkit [20] which detects\nand discards inaccurate scores. This toolkit also allows to only keep the evaluations that are made\nwith headphones (rather than laptop speakers for example), and we choose to do so as good listening\nconditions are necessary to ensure the validity of our measures. We report the Mean Opinion Score\nfor the ground-truth audio and each of the 2 models in Table 2, along with the 95% con\ufb01dence\ninterval.\nWe observe that SING shows a signi\ufb01cantly better MOS than the Wavenet-autoencoder baseline\ndespite a compression factor which is 66 times higher. Moreover, to spotlight the bene\ufb01ts of our\napproach compared to the Wavenet baseline, we also report three metrics to quantify the computational\nload of the different models. The \ufb01rst metric is the training time, expressed in hours multiplied by\nthe number of GPUs. The authors of [4], mention that their model trains for 10 days on 32 GPUs,\nwhich amounts to 7680 hours*GPUs. However, the GPUs used are capable of about half the FLOPs\ncompared to our P100. Therefore, we corrected this value to 3840 hours*GPUs. On the other hand,\nSING is trained in 30 hours on four P100, which is 32 times faster than Wavenet. A major drawback\nof autoregressive models such as Wavenet is that the generation process is inherently sequential:\ngenerating the sample at time t + 1 takes as input the sample at time t. We timed the generation\nusing the implementation of the Wavenet-autoencoder provided by the authors, in its fastgen version7\nwhich is signi\ufb01cantly faster than the original model. This yields a 22 minutes time to generate a\n4-second sample. On a single P100 GPU, Wavenet can generate up to 64 sequences at the same time\nbefore reaching the memory limit, which amounts to 0.2 seconds of audio generated per second. On\nthe other hand, SING can generate 512 seconds of audio per second of processing time, and is thus\n2500 times faster than Wavenet. Finally, SING is also ef\ufb01cient in memory compared to Wavenet, as\nthe model size in MB is more than 4 times smaller than the baseline.\n\n6https://github.com/tensorflow/magenta/tree/master/magenta/models/\n\nnsynth\n\n7https://magenta.tensorflow.org/nsynth-fastgen\n\n8\n\n\f5.3.2 ABX similarity measure\n\nBesides absolute audio quality of the samples, we also want to ensure that when we condition SING\non a chosen combination of instrument, pitch and velocity, we generate a relevant audio sample. To\ndo so, we measure how close samples generated by SING are to the ground-truth relatively to the\nWavenet baseline. This measure is made by performing ABX [16] experiments: the Worker is given\na ground-truth sample as a reference. Then, they are presented with the corresponding samples of\nSING and Wavenet, in a random order to avoid bias and with the possibility of listening as many\ntimes to the samples as necessary. They are asked to pick the sample which is the closest to the\nreference according to their judgment. We perform this experiment on 100 ABX triplets made from\nthe same data as for the MOS, each triplet being evaluated by 10 Workers. On average over 1000\nABX tests, 69.7% are in favor of SING over Wavenet, which shows a higher similarity between our\ngenerated samples and the target musical notes than Wavenet.\n\nConclusion\n\nWe introduced a simple model architecture, SING, based on LSTM and convolutional layers to\ngenerate waveforms. We achieve state-of-the-art results as measured by human evaluation on the\nNSynth dataset for a fraction of the training and generation cost of existing methods. We introduced\na spectral loss on the generated waveform as a way of using time-frequency based metrics without\nrequiring a post-processing step to recover the phase of a power spectrogram. We experimentally\nvalidated that SING was able to embed music notes into a small dimension vector space where the\npitch, instrument and velocity were disentangled when trained with this spectral loss, as well as\nsynthesizing pairs of instruments and pitches that were not present in the training set. We believe\nSING opens up new opportunities for lightweight quality audio synthesis with potential applications\nfor speech synthesis and music generation.\n\nAcknowledgments\n\nAuthors thank Adam Polyak for his help on setting up the human evaluations and Axel Roebel for the\ninsightful discussions. We also thank the Magenta team for their inspiring work on NSynth.\n\n9\n\n\fReferences\n[1] Michael Buhrmester, Tracy Kwang, and Samuel D Gosling. Amazon\u2019s mechanical turk: A new\nsource of inexpensive, yet high-quality, data? Perspectives on psychological science, 6(1):3\u20135,\n2011.\n\n[2] Hugo Caracalla and Axel Roebel. Gradient conversion between time and frequency domains\n\nusing wirtinger calculus. In DAFx 2017, 2017.\n\n[3] Kemal Ebcio\u02d8glu. An expert system for harmonizing four-part chorales. Computer Music\n\nJournal, 12(3):43\u201351, 1988.\n\n[4] Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Douglas Eck, Karen Simonyan,\nand Mohammad Norouzi. Neural audio synthesis of musical notes with wavenet autoencoders.\nTechnical Report 1704.01279, arXiv, 2017.\n\n[5] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional\n\nsequence to sequence learning. arXiv preprint arXiv:1705.03122, 2017.\n\n[6] JL Goldstein. Auditory nonlinearity. The Journal of the Acoustical Society of America,\n\n41(3):676\u2013699, 1967.\n\n[7] Daniel Grif\ufb01n and Jae Lim. Signal estimation from modi\ufb01ed short-time fourier transform. IEEE\n\nTransactions on Acoustics, Speech, and Signal Processing, 32(2):236\u2013243, 1984.\n\n[8] Ga\u00ebtan Hadjeres, Fran\u00e7ois Pachet, and Frank Nielsen. Deepbach: a steerable model for bach\n\nchorales generation. Technical Report 1612.01010, arXiv, 2016.\n\n[9] Albert Haque, Michelle Guo, and Prateek Verma. Conditional end-to-end audio transforms.\n\nTechnical Report 1804.00047, arXiv, 2018.\n\n[10] Dorien Herremans. Morpheus: automatic music generation with recurrent pattern constraints\n\nand tension pro\ufb01les. 2016.\n\n[11] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.\n\nTechnical Report 1503.02531, arXiv, 2015.\n\n[12] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation,\n\n9(8):1735\u20131780, 1997.\n\n[13] Fumitada Itakura. Analysis synthesis telephony based on the maximum likelihood method. In\n\nThe 6th international congress on acoustics, 1968, pages 280\u2013292, 1968.\n\n[14] Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward\nLockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and Koray Kavukcuoglu.\nEf\ufb01cient neural audio synthesis. Technical Report 1802.08435, arXiv, 2018.\n\n[15] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Interna-\n\ntional Conference on Learning Representations, 2015.\n\n[16] Neil A Macmillan and C Douglas Creelman. Detection theory: A user\u2019s guide. Psychology\n\npress, 2004.\n\n[17] Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo,\nAaron Courville, and Yoshua Bengio. Samplernn: An unconditional end-to-end neural audio\ngeneration model. Technical Report 1612.07837, arXiv, 2016.\n\n[18] Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray\nKavukcuoglu, George van den Driessche, Edward Lockhart, Luis C Cobo, Florian Stimberg,\net al. Parallel wavenet: Fast high-\ufb01delity speech synthesis. Technical Report 1711.10433, arXiv,\n2017.\n\n[19] Wei Ping, Kainan Peng, Andrew Gibiansky, S Arik, Ajay Kannan, Sharan Narang, Jonathan\nRaiman, and John Miller. Deep voice 3: Scaling text-to-speech with convolutional sequence\nlearning. In Proc. 6th International Conference on Learning Representations, 2018.\n\n10\n\n\f[20] Fl\u00e1vio Ribeiro, Dinei Flor\u00eancio, Cha Zhang, and Michael Seltzer. Crowdmos: An approach\nfor crowdsourcing mean opinion score studies. In Acoustics, Speech and Signal Processing\n(ICASSP), 2011 IEEE International Conference on, pages 2416\u20132419. IEEE, 2011.\n\n[21] Jose Sotelo, Soroush Mehri, Kundan Kumar, Joao Felipe Santos, Kyle Kastner, Aaron Courville,\n\nand Yoshua Bengio. Char2wav: End-to-end speech synthesis. 2017.\n\n[22] Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. End-to-end memory networks. In\n\nAdvances in neural information processing systems, pages 2440\u20132448, 2015.\n\n[23] Yaniv Taigman, Lior Wolf, Adam Polyak, and Eliya Nachmani. Voice synthesis for in-the-wild\n\nspeakers via a phonological loop. Technical Report 1707.06588, arXiv, 2017.\n\n[24] Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex\nGraves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative\nmodel for raw audio. Technical Report 1609.03499, arXiv, 2016.\n\n[25] Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly,\nZongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al. Tacotron: Towards end-to-end\nspeech synthesis. Technical Report 1703.10135, arXiv, 2017.\n\n[26] Ronald J Williams and David Zipser. Gradient-based learning algorithms for recurrent networks\nand their computational complexity. Backpropagation: Theory, architectures, and applications,\n1:433\u2013486, 1995.\n\n11\n\n\f", "award": [], "sourceid": 5412, "authors": [{"given_name": "Alexandre", "family_name": "Defossez", "institution": "Facebook"}, {"given_name": "Neil", "family_name": "Zeghidour", "institution": "Facebook A.I. Research / Ecole Normale Sup\u00e9rieure"}, {"given_name": "Nicolas", "family_name": "Usunier", "institution": "Facebook AI Research"}, {"given_name": "Leon", "family_name": "Bottou", "institution": "Facebook AI Research"}, {"given_name": "Francis", "family_name": "Bach", "institution": "INRIA - Ecole Normale Superieure"}]}