{"title": "Dancing to Music", "book": "Advances in Neural Information Processing Systems", "page_first": 3586, "page_last": 3596, "abstract": "Dancing to music is an instinctive move by humans. Learning to model the music-to-dance generation process is, however, a challenging problem. It requires significant efforts to measure the correlation between music and dance as one needs to simultaneously consider multiple aspects, such as style and beat of both music and dance. Additionally, dance is inherently multimodal and various following movements of a pose at any moment are equally likely. In this paper, we propose a synthesis-by-analysis learning framework to generate dance from music. In the top-down analysis phase, we decompose a dance into a series of basic dance units, through which the model learns how to move. In the bottom-up synthesis phase, the model learns how to compose a dance by combining multiple basic dancing movements seamlessly according to input music. Experimental qualitative and quantitative results demonstrate that the proposed method can synthesize realistic, diverse, style-consistent, and beat-matching dances from music.", "full_text": "Dancing to Music\n\nHsin-Ying Lee1 Xiaodong Yang2 Ming-Yu Liu2 Ting-Chun Wang2\n\nYu-Ding Lu1 Ming-Hsuan Yang1\n1University of California, Merced\n\nJan Kautz2\n2NVIDIA\n\nAbstract\n\nDancing to music is an instinctive move by humans. Learning to model the\nmusic-to-dance generation process is, however, a challenging problem. It requires\nsigni\ufb01cant efforts to measure the correlation between music and dance as one needs\nto simultaneously consider multiple aspects, such as style and beat of both music\nand dance. Additionally, dance is inherently multimodal and various following\nmovements of a pose at any moment are equally likely. In this paper, we propose\na synthesis-by-analysis learning framework to generate dance from music. In\nthe analysis phase, we decompose a dance into a series of basic dance units,\nthrough which the model learns how to move. In the synthesis phase, the model\nlearns how to compose a dance by organizing multiple basic dancing movements\nseamlessly according to the input music. Experimental qualitative and quantitative\nresults demonstrate that the proposed method can synthesize realistic, diverse,\nstyle-consistent, and beat-matching dances from music.\n\nIntroduction\n\n1\nDoes this sound familiar? Upon hearing certain genres of music, you cannot help but clap your\nhands, tap your feet, or swing you hip accordingly. Indeed, music inspires dances in daily life. Via\nspontaneous and elementary movements, people compose body movements into dances [24, 31].\nHowever, it is only through proper training and constant practice, professional choreographers learn to\ncompose the dance moves in a way that is both artistically elegant and rhythmic. Therefore, dance to\nmusic is a creative process that is both innate and acquired. In this paper, we propose a computational\nmodel for the music-to-dance creation process. Inspired by the above observations, we use prior\nknowledge to design the music-to-dance framework and train it with a large amount of paired music\nand dance data. This is a challenging but interesting generative task with the potential to assist and\nexpand content creations in arts and sports, such as theatrical performance, rhythmic gymnastics, and\n\ufb01gure skating. Furthermore, modeling how we human beings match our body movements to music\ncan lead to better understanding of cross-modal synthesis.\nExisting methods [13, 22, 26] convert the task into a similarity-based retrieval problem, which shows\nlimited creativity. In contrast, we formulate the task from the generative perspective. Learning\nto synthesize dances from music is a highly challenging generative problem for several reasons.\nFirst, to synchronize dance and music, the generated dance movements, beyond realism, need to be\naligned well with the given musical style and beats. Second, dance is inherently multimodal, i.e., a\ndancing pose at any moment can be followed by various possible movements. Third, the long-term\nspatio-temporal structures of body movements in dancing result in high kinematic complexity.\nIn this paper, we propose to synthesize dance from music through a decomposition-to-composition\nframework. It \ufb01rst learns how to move (i.e., produce basic movements) in the decomposition/analysis\nphase, and then how to compose (i.e., organize basic movements into a sequence) in the compo-\nsition/synthesis phase. In the top-down decomposition phase, analogous to audio beat tracking of\nmusic [11], we develop a kinematic beat detector to extract movement beats from a dancing sequence.\nWe then leverage the extracted movement beats to temporally normalize each dancing sequence\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: A schematic overview of the decomposition-to-composition framework. In the top-\ndown decomposition phase (Section 3.1), we normalize the dance units that are segmented from\na real dancing sequence using a kinematic beat detector. We then train the DU-VAE to model the\ndance units. In the bottom-up composition phase (Section 3.2), given a pair of music and dance, we\nleverage the MM-GAN to learn how to organize the dance units conditioned on the given music. In\nthe testing phase (Section 3.3), we extract style and beats from the input music, then synthesize a\nsequence of dance units in a recurrent manner, and in the end, apply the beat warper to the generated\ndance unit sequence to render the output dance.\n\ninto a series of dance units. Each dance unit is further disentangled into an initial pose space and a\nmovement space by the proposed dance unit VAE (DU-VAE). In the bottom-up composition phase, we\npropose a music-to-movement GAN (MM-GAN) to generate a sequence of movements conditioned on\nthe input music. At run time given an input music clip, we \ufb01rst extract the style and beat information,\nthen sequentially generate a series of dance units based on the music style, and \ufb01nally warp the dance\nunits by the extracted audio beats, as illustrated in Figure 1.\nTo facilitate this cross-modal audio-to-visual generation task, we collect over 360K video clips\ntotaling 71 hours. There are three representative dancing categories in the data: \u201cBallet\u201d, \u201cZumba\u201d\nand \u201cHip-Hop\u201d. For performance evaluation, we compare with strong baselines using various metrics\nto analyze realism, diversity, style consistency, and beat matching. In addition to the raw pose\nrepresentation, we also visualize our results with the vid2vid model [41] to translate the synthesized\npose sequences to photo-realistic videos. See our supplementary material for more details.\nOur contributions of this work are summarized as follows. First, we introduce a new cross-modality\ngenerative task from music to dance. Second, we propose a novel decomposition-to-composition\nframework to dismantle and assemble between complex dances and basic movements conditioned on\nmusic. Third, our model renders realistic and diverse dances that match well to musical styles and\nbeats. Finally, we provide a large-scale paired music and dance dataset, which is available along with\nthe source code and models at our website.\n\n2 Related Work\nCross-Modality Generation. This task explores the association among different sensory modes\nand leads to better understanding of human perception [17, 18, 21, 28, 30, 38, 44]. Generations\nbetween texts and images have been extensively studied, including image captioning [17, 38] and\ntext-to-image synthesis [30, 44]. On the contrary, audio data is much less structured and thus more\ndif\ufb01cult to model its correlation with visual data. Several approaches have been developed to map\nvision to audio by taking visual cues to provide sound effects to videos or predict what sounds\ntarget objects can produce [8, 28, 46]. However, the generation problem from audio to visual is\nmuch less explored. Several methods focus on speech lip synchronization to predict movements of\nmouth landmarks from audio [18, 35]. Recent work employs LSTM based autoencoders to learn the\n\n2\n\n\fmusic-to-dance mapping [36], and uses LSTM to animate the instrument-playing avatars given an\naudio input of violin or piano [33].\nAudio and Vision. The recent years have seen growing interests in cross-modal learning between\naudio and vision. Although hearing and sight are two distinct sensory systems, the information\nperceived from the two modalities is highly correlated. The correspondence between audio and\nvision serves as natural supervisory signals for self-supervised learning, which aims to learn feature\nrepresentations by solving surrogate tasks de\ufb01ned from the structure of raw data [2, 4, 10, 20, 29].\nAside from representation learning, audio and visual information can be jointly used to localize the\nsound sources in images [3, 15, 32], predict spatial-audio from videos [23], and separate different\naudio-visual sources [12, 14, 27]. In addition, an audio-visual synchronization model is developed\nin [7] by utilizing the visual rhythm with its musical counterpart to manipulate videos.\nHuman Motion Modeling. It is challenging to model human motion dynamics due to the stochastic\nnature and spatio-temporal complexity. A large family of the existing work [6, 40, 42, 43] formulates\nmotion dynamics as a sequence of 2D or 3D body keypoints, thanks to the success of human pose\nestimation [5]. Most of these approaches use recurrent neural networks to generate a motion sequence\nfrom a static image or a short video snippet. Some other methods consider this problem as a video\ngeneration task. Early work applies mean square loss [34] or perceptual loss [25] on raw image\nsequences for training. Recent methods disentangle motion and content [9, 37, 39] to alleviate the\nissues with holistic video generation. Another active research line is motion retargeting, which\nperforms motion transfer between source and target subjects [1].\n\n3 Music-to-Dance Generation\nOur goal is to generate a sequence of dancing poses conditioned on the input music. As illustrated\nin Figure 1, the training process is realized by the decomposition-to-composition framework. In\nthe top-down decomposition phase, we aim to learn how to perform basic dancing movements. For\nthis purpose, we de\ufb01ne and extract dance units, and introduce DU-VAE for encoding and decoding\ndance units. In the bottom-up composition phase, we target learning how to compose multiple basic\nmovements to a dance, which conveys high-level motion semantics according to different music. So\nwe propose MM-GAN for music conditioned dancing movement generation. Finally, in the testing\nphase, we use the components of DU-VAE and MM-GAN to recurrently synthesize a long-term dance\nin accordance with the given music.\n\n3.1 Learning How to Move\nIn the music theory, beat tracking is usually derived from onset [11], which can be de\ufb01ned as the\nstart of a music note, or more formally, the beginning of an acoustic event. Current audio beat\ndetection algorithms are mostly based on detecting onset using a spectrogram S to capture the\nfrequency domain information. We can measure the change in different frequencies by Sdi\ufb00 (t, k) =\n|S(t, k)| \u2212 |S(t \u2212 1, k)|, where t and k indicate the time step and quantized frequency, respectively.\nMore details on music beat tracking can be found in [11]. Unlike music, the kinematic beat of human\nmovement is not well de\ufb01ned. We usually perceive the sudden motion deceleration or offset as a\nkinematic beat. A similar observation is also recently noted in [7].\nWe develop a kinematic beat detector to detect when a movement drastically slows down. In practice,\nwe compute the motion magnitude and angle of each keypoint between neighboring poses, and track\nthe magnitude and angle trajectories to spot when a dramatic decrease in the motion magnitude or a\nsubstantial change in the motion angle happens. Analogous to the spectrogram S, we can construct a\nmatrix D to capture the motion changes in different angles. For a pose p of frame t, the difference in\na motion angle bin \u03b8 is summed over all joints:\n|pi\nt \u2212 pi\n\nt\u22121|Q(pi\n\n(cid:88)\n\nD(t, \u03b8) =\n\nt, pi\n\nt\u22121, \u03b8),\n\n(1)\n\ni\n\nwhere Q is an indicator function to quantize the motion angles. Then, the changes in different motion\nangles can be computed by:\n\n(2)\nThis measurement captures abrupt magnitude decrease in the same direction, as well as dramatic\nchange of motion direction. Finally, the kinematic beats can be detected by thresholding Ddi\ufb00.\n\nDdi\ufb00 (t, \u03b8) = |D(t, \u03b8)| \u2212 |D(t \u2212 1, \u03b8)|.\n\n3\n\n\fFigure 2: (a) Extraction of beats from music and dance. For music, periodical beats are extracted\nby the onset strength. For dance, we compute the offset strength and extract kinematic beats. We\nillustrate three example frames corresponding to the aligned music and kinematic beats: lateral arm\nraising (red), hand raising (yellow), and elbow pushing out (purple). (b) Examples of dance units.\nEvery dance unit is of the same length and with kinematic beats assigned in the speci\ufb01c beat times.\n\nHowever, in reality, people do not dance to every musical beat. Namely, each kinematic beat needs to\nalign with a musical beat, yet it is unnecessary to \ufb01t every musical beat while dancing. Figure 2(a)\nshows the correspondence between the extracted musical beats by a standard audio beat tracking\nalgorithm [11] and the kinematic beats by our kinematic beat detector. Most of our detected kinematic\nbeats match the musical beats accurately.\nLeveraging the extracted kinematic beats, we de\ufb01ne the dance unit in this work. As illustrated in\nFigure 2(b), a dance unit is a temporally standardized short snippet, consisting of a \ufb01xed number\nof poses, whose kinematic beats are normalized to several speci\ufb01ed beat times with a constant beat\ninterval. A dance unit captures basic motion patterns and serves as atomic movements, which can be\nused to constitute a complete dancing sequence. Another bene\ufb01t of introducing the dance unit is that,\nwith temporal normalization of beats, we can alleviate the beat factor and simplify the generation to\nfocus on musical style. In the testing phase, we incorporate the music beats to warp or stretch the\nsynthesized sequence of dance units.\nAfter normalizing a dance into a series of dance units, the model learns how to perform basic\nmovements. As shown in the decomposition phase of Figure 1, we propose to disentangle a dance\nunit into two latent spaces: an initial pose space Zini capturing the single initial pose, and a movement\nspace Zmov encoding the motion that is agnostic of the initial pose. This disentanglement is designed\nto facilitate the long-term sequential generation, i.e., the last pose of a current dance unit can be\nused as the initial pose of the next one, so that we can continuously synthesize a full long-term\ndance. We adopt the proposed DU-VAE to perform the disentangling. It consists of an initial-pose\nencoder Eini, a movement encoder Emov, and a dance unit decoder Guni. Given a dance unit u \u2208 U,\nwe exploit Eini and Emov to encode it into the two latent codes zini \u2208 Zini and zmov \u2208 Zmov:\n{zini, zmov} = {Eini(u), Emov(u)}. As Guni should be able to reconstruct the two latent codes\nback to \u02c6u, we enforce a reconstruction loss on u and a KL loss on the initial pose space and movement\nspace to enable the reconstruction after encoding and decoding:\n\nrecon = E[(cid:107)Guni(zini, zmov) \u2212 u(cid:107)1],\nLu\n\nKL = E[KL(Zini(cid:107)N (0, I))] + E[KL(Zmov(cid:107)N (0, I))],\nLu\n\n(3)\n\nwhere KL(p(cid:107)q) = \u2212(cid:82) p(z) log p(z)\n\nq(z) dz. We apply the KL loss on Zini for random sampling of\nthe initial pose at test time, and the KL loss on Zmov to stabilize the composition training in the\nnext section. With the intention to encourage Emov to disregard the initial pose and focus on the\nmovement only, we design a shift-reconstruction loss:\n\n(4)\nwhere u(cid:48) is a spatially shifted u. Overall, we jointly train the two encoders Eini, Emov, and one\ndecoder Guni of DU-VAE to optimize the total objective in the decomposition:\n\nrecon = E[(cid:107)Guni(zini, Emov(u(cid:48))) \u2212 u(cid:107)1],\nLshift\n\nLdecomp = Lu\n\nrecon + \u03bbu\n\nKLLu\n\nKL + \u03bbshift\n\nreconLshift\nrecon,\n\n(5)\n\nwhere \u03bbu\n\nKL and \u03bbshift\n\nrecon are the weights to control the importance of KL and shift-reconstruction terms.\n\n4\n\n\fmov}n\nmov} to a dancing space Zdan with a movement-to-dance encoder Emtd: {zi\n\n3.2 Learning How to Compose\nSince a dance consists of a sequence of movement units in a particular arrangement, different combina-\ntions can represent different expressive semantics. Based on the movement space Zmov disentangled\nfrom the aforementioned decomposition, the composition model learns how to meaningfully compose\na sequence of basic movements into a dance conditioned on the input music.\nAs demonstrated in the composition phase of Figure 1, the proposed MM-GAN is utilized to bridge the\nsemantic gap between low-level movements and high-level music semantics. Given a dance, we \ufb01rst\nnormalize it into a sequence of n dance units {ui}n\ni=1, and then encode them to the latent movement\ncodes {zi\ni=1, as described in the decomposition phase. In this context, {\u00b7} denotes a temporally\nordered sequence, for notational simplicity, we skip the temporal number n in the following. We\nencode {zi\nmov} \u2192 zdan,\nmov} with a recurrent dance decoder Gdan. For the corresponding\nand reconstruct zdan back to {\u02c6zi\nmusic, we employ a music style extractor to extract the style feature s from the audio feature a.\nSince there exists no robust style feature extractor given our particular needs, we train a music style\nclassi\ufb01er on the collected music for this task. We encode s along with a noise vector \u0001 to a latent\ndance code \u02dczdan \u2208 Zdan using a style-to-dance encoder Estd: (s, \u0001) \u2192 \u02dczdan, and then make use of\nGdan to decode \u02dczdan to a latent movement sequence {\u02dczi\nIt is of great importance to ensure the alignments among movement distributions and among dance\ndistributions that are respectively produced by real dance and corresponding music. To this end,\nmov} encoded and reconstructed\nwe use adversarial training to match the distributions between {\u02c6zi\nmov} generated from the associated music. As the audio feature a\nfrom the real dance units and {\u02dczi\ncontains low-level musical properties, we make the decision conditioned on a to further encourage\nthe correspondence between music and dance:\n\nmov}.\n\nadv = E[log Dmov({\u02c6zi\nLm\n\n(6)\nwhere Dmov is the discriminator that tries to distinguish between the movement sequences that are\ngenerated from real dance and music. Compared to the distribution of raw data, such as poses, it is\nmov} in our case. We thus\nmore dif\ufb01cult to model the distribution of latent code sequences, or, {zi\nadopt an auxiliary reconstruction task on the latent movement sequences to facilitate training:\n\nmov}, a) + log (1 \u2212 Dmov({\u02dczi\n\nmov}, a))],\n\nrecon = E[(cid:13)(cid:13){\u02c6zi\n\nmov} \u2212 {zi\n\nmov}(cid:13)(cid:13)1].\n\n(7)\nFor the alignment between latent dance codes, we apply a discriminator Ddan to differentiate the\ndance codes encoded from real dance and music, and enforce a KL loss on the latent dance space:\n\nLm\n\nadv = E[log Ddan(zdan) + log (1 \u2212 Ddan(\u02dczdan))],\nLd\n\nKL = E[KL(Zdan(cid:107)N (0, I))].\nLd\n\n(8)\n\nAs the style feature s embodies high-level musical semantics that should be re\ufb02ected in the dance\ncode zdan, we therefore use a style regressor Esty on the latent dance codes to reconstruct s to further\nencourage the alignment between the styles of music and dance:\n\nrecon = E[(cid:107)Esty(zdan) \u2212 s(cid:107)1 + (cid:107)Esty(\u02dczdan) \u2212 s(cid:107)1].\nLs\n\n(9)\nOverall, we jointly train the three encoders Emtd, Estd, Esty, one decoder Gdan, and two discrimi-\nnators Dmov, Ddan of MM-GAN to optimize the full objective in the composition:\nKLLd\n\n(10)\nKL are the weights to control the importance of related loss terms.\n\nLcomp = Lm\nadv, \u03bbd\n\nadv, and \u03bbd\n\nrecon + \u03bbm\n\nrecon + \u03bbs\n\nrecon, \u03bbm\n\nadv + \u03bbd\n\nwhere \u03bbs\n\nadvLm\n\nadv + \u03bbd\n\nadvLd\n\nreconLs\n\nKL,\n\n3.3 Testing Phase\nAs shown in the testing phase of Figure 1, the \ufb01nal network at run time consists of Eini, Guni learned\nin the decomposition and Esty, Gdan trained in the composition. Given a music clip, we \ufb01rst track\nthe beats and extract the style feature s. We encode s with a noise \u0001 into a latent dance code \u02dczdan\nby Estd and then decode \u02dczdan to a movement sequence {\u02dczi\nmov} by Gdan. To compose a complete\ndance, we randomly sample an initial pose code z0\nini from the prior distribution, and then recurrently\nini of the next\ngenerate a full sequence of dance units using z0\ndance unit can be encoded from the last frame of the current dance unit:\n\nini and {\u02dczi\n\nmov}. The initial pose code zi\nini = Eini(ui(\u22121)),\nzi\n\nui = Guni(zi\u22121\n\nini , zi\n\nmov),\n\n(11)\n\n5\n\n\fwhere ui(\u22121) is the last frame of the ith dance unit. With these steps, we can continuously and\nseamlessly generate a long-term dancing sequence \ufb01tting into the input music. Since the beat times\nare normalized in each dance unit, we in the end warp the generated sequence of dance units by\naligning their kinematic beats with the extracted music beats to produce the \ufb01nal full dance.\n\n4 Experimental Results\nWe conduct extensive experiments to evaluate the proposed decomposition-to-composition framework.\nWe qualitatively and quantitatively compare our method with several baselines on various metrics\nincluding motion realism, style consistency, diversity, multimodality, and beat coverage and hit\nrate. Experimental results reveal that our method can produce more realistic, diverse, and music-\nsynchronized dances. More comparisons are provided in the supplementary material. Note that we\ncould not include music in the embedded animations of this PDF, but the complete results with music\ncan be found in the supplementary video.\n\n4.1 Data Collection and Processing\n\nSince there exists no large-scale music-dance dataset, we collect videos of three representative\ndancing categories from the Internet with the keywords: \u201cBallet\u201d, \u201cZumba\u201d, and \u201cHip-Hop\u201d. We\nprune the videos with low quality and few motion, and extract clips in 5 to 10 seconds with full pose\nestimation results. In the end, we acquire around 68K clips for \u201cBallet\u201d, 220K clips for \u201cZumba\u201d,\nand 73K clips for \u201cHip-Hop\u201d. The total length of all the clips is approximately 71 hours. We extract\nframes with 15 fps and audios with 22 kHz. We randomly select 300 music clips for testing and the\nrest used for training.\nPose Processing. OpenPose [5] is applied to extract 2D body keypoints. We observe that in practice\nsome keypoints are dif\ufb01cult to be consistently extracted in the wild web videos and some are less\nrelated to dancing movements. So we \ufb01nally choose 14 most relevant keypoints to represent the\ndancing poses, i.e., nose, neck, left and right shoulders, elbows, wrists, hips, knees, and ankles. We\ninterpolate the missing detected keypoints from the neighboring frames so that there are no missing\nkeypoints in all extracted clips.\nAudio Processing. We use the standard MFCC as the music feature representation. The audio\nvolume is normalized using root mean square with FFMPEG. We then extract the 13-dimensional\nMFCC feature, and concatenate it with its \ufb01rst temporal derivatives and log mean energy of volume\ninto the \ufb01nal 28-dimensional audio feature.\n\n4.2\n\nImplementation Details\n\nOur model is implemented in PyTorch. We use the gated recurrent unit (GRU) to build encoders\nEmov, Emtd and decoders Guni, Gdan. Each of them is a single-layer GRU with 1024 hidden\nunits. Eini, Estd, and Esty are encoders consisting of 3 fully-connected layers. Ddan and Dmov\nare discriminators containing 5 fully-connected layers with layer normalization. We set the latent\ncode dimensions to zini \u2208 R10, zmov \u2208 R512, and zdan \u2208 R512. In the decomposition phase,\nwe set the length of a dance unit as 32 frames and the number of beat times within a dance unit\nas 4. In the composition phase, each input sequence contains 3 to 5 dance units. For training,\nwe use the Adam optimizer [19] with batch size of 512, learning rate of 0.0001, and exponential\ndecay rates (\u03b21, \u03b22) = (0.5, 0.999). In all experiments, we set the hyper-parameters as follows:\nrecon = 1, \u03bbd\nrecon = 1. Our data, code and models\n\u03bbu\nKL = \u03bbd\nare publicly available at our website.\n\nadv = 0.1, and \u03bbs\n\nKL = 0.01, \u03bbshift\n\nadv = \u03bbm\n\n4.3 Baselines\nGenerating dance from music is a relatively new task from the generative perspective and thus few\nmethods have been developed. In the following, we compare the proposed algorithm to the several\nstrong baseline methods. As our comparisons mainly target generative models, we present the results\nof traditional retrieval-based method in the supplementary material.\nLSTM. We use LSTM as our deterministic baseline. Similar to the recent work on mapping audio to\narm and hand dynamics [33], the model takes audio features as inputs and produces pose sequences.\n\n6\n\n\fLSTM\n\nAud-\n\nMoCoGAN\n\nOurs\n\nw/o Lcomp\n\nOurs\n\n(a)\n\n(b)\n\nFigure 3: (a) Comparison of the generated dances. LSTM produces dances that tend to collapse\nto static poses. Aud-MoCoGAN generates jerking dances that are prone to repeating the same\nmovements. Ours w/o Lcomp produces realistic movements, yet the combinations of movements are\noften unnatural. Compared to the baselines, our results are realistic and coherent. (b) Examples of\nmultimodal generation. Dances in each column are generated by our method using the same music\nclip and initial pose. This \ufb01gure is best viewed via Acrobat Reader. Click each column to play.\n\nAud-MoCoGAN. MoCoGAN [37] is a video generation model, which maps a sequence of random\nvectors containing the factors of \ufb01xed content and stochastic motion to a sequence of video frames.\nWe modify this model to take extracted audio features on style and beat as inputs in addition to noise\nvectors. To improve the quality, we use multi-scale discriminators and apply curriculum learning to\ngradually increase the dance sequence length.\nOurs w/o Lcomp. This model ablates the composition phase and relies on the decomposition phase.\nIn addition to the original DU-VAE for decomposition, we enforce the paired music and dance unit to\nstay close when mapped in the latent movement space. At test time, we map a music clip into the\nmovement space, and then recurrently generate a sequence of dance units by using the last pose of\none dance unit as the \ufb01rst pose of the next one.\n\n4.4 Qualitative Comparisons\nWe \ufb01rst compare the quality of synthesized dances by different methods. Figure 3(a) shows the\ndances generated from different input music. We observe that the dances generated by LSTM tend to\ncollapse to certain poses regardless of the input music or initial pose. The deterministic nature of\nLSTM hinders it from learning the desired mapping to the highly unconstrained dancing movements.\nFor Aud-MoCoGAN, the generated dances contain apparent artifacts such as twitching or jerking in\nan unnatural way. Furthermore, the synthesized dances tend to be repetitive, i.e., performing the same\nmovement throughout a whole sequence. This may be explained by the fact that Aud-MoCoGAN\ntakes all audio information including style and beat as input, of which correlation with dancing\nmovements is dif\ufb01cult to learn via a single model. Ours w/o Lcomp can generate smoother dances\ncompared to the above two methods. However, since the dance is simply formed by a series of\nindependent dance units, it is easy to observe incoherent movements. For instance, the third column\nin Figure 3(a) demonstrates the incoherent examples, such as mixing dance with different styles (top),\nan abrupt transition between movements (middle), and unnatural combination of movements (bottom).\nIn contrast, the dances generated by our full model are more realistic and coherent. As demonstrated\nin the fourth column in Figure 3(a), the synthesized dances consist of smooth movements (top),\nconsecutive similar movements (middle), and a natural constitution of raising the left hand, raising\nthe right hand, and raising both hands (bottom).\nWe also analyze two other important properties for the music-to-dance generation: multimodality and\nbeat matching. For multimodality, our approach is able to generate diverse dances given the same\nmusic. As shown in Figure 3(b), each column shows various dances that are synthesized from the\nsame music and the same initial pose. For beat matching, we compare the kinematic beats extracted\nfrom the generated dances and their corresponding input music beats. Most kinematic beats of our\ngenerated dances occur at musical beat times. Figure 4 visualizes two short dancing snippets which\n\n7\n\n\fFigure 4: Examples of beat matching between music and generated dances. We show two\ngenerated dances with the extracted kinematic offsets as well as music onsets from input music. The\nred dashes on the onset and offset graphs indicate the extracted musical beats and kinematic beats.\nThe consecutive matched beats correspond to clapping hands on left and right alternatively (left), and\nsquatting down repetitively (right).\n\nalign with their musical beats, including clapping hands to left and right alternatively, and squatting\ndown repetitively. More demonstrations with music, such as long-term generation, mixing styles and\nphoto-realistic translation, are available in the supplementary video.\n\n4.5 Quantitative Comparisons\nMotion Realism and Style Consistency. Here we perform a quantitative evaluation of the realism\nof generated movements and the style consistency of synthesized dances to the input music. We\nconduct a user study using a pairwise comparison scheme. Speci\ufb01cally, we evaluate generated dances\nfrom the four methods as well as real dances on 60 randomly selected testing music clips. Given a\npair of dances with the same music clip, each user is asked to answer two questions: \u201cWhich dance is\nmore realistic regardless of music?\u201d and \u201cWhich dance matches the music better?\u201d. We ask each user\nto compare 20 pairs and collect results from a total of 50 subjects.\nFigure 5 shows the user study results, where our approach outperforms the baselines on both motion\nrealism and style consistency. It is consistently found that LSTM and Aud-MoCoGAN generate\ndances with obvious artifacts and result in low preferences. Although ours w/o Lcomp can produce\nhigh-quality dance units, the simple concatenation of independent dance units usually makes the\nsynthesized dance look unnatural. This is also re\ufb02ected in the user study, where 61.2% prefer the\nfull solution in term of motion realism, and 68.3% in style consistency. Compared to the real dances,\n35.7% of users prefer our approach in term of motion realism and 28.6% in style consistency. Note\nthat the upper bound is 50.0% when comparing to the real dances. The performance of our method\ncan be further improved with more training data.\nIn addition to the subjective test, we evaluate the visual quality following Fr\u00e9chet Inception Distance\n(FID) [16] by measuring how close the distribution of generated dances is to the real. As there exists\nno standard feature extractor for pose sequences, we train an action classi\ufb01er on the collected data of\nthree categories as the feature extractor. Table 1 shows the average results of 10 trials. Overall, the\nFID of our generated dances is much closer to the real ones than the other evaluated methods.\nBeat Coverage and Hit Rate. In addition to realism and consistency, we evaluate how well the\nkinematic beats of generated dances match the input music beats. Given all input music and generated\ndances, we gather the number of total musical beats Bm, the number of total kinematic beats Bk,\nand the number of kinematic beats that are aligned with musical beats Ba. We use two metrics for\nevaluation: (i) beat coverage Bk/Bm measures the ratio of kinematic beats to musical beats, (ii) beat\nhit rate Ba/Bk is the ratio of aligned kinematic beats to total kinematic beats.\nAs shown in Table 1, our approach generates very similar beat coverage as real dances, indicating our\nsynthesized dances can naturally align with the musical rhythm. Note that for beat coverage, it is not\nthe higher the better, but depends on the different dancing styles. Ours w/o Lcomp has a higher beat\nhit rate than our full model as the latter takes coherence between movements into account, which\nmay sacri\ufb01ce beat hit rate of individual movements. There are two main reasons for the relatively low\nbeat hit rate of real dances. First, the data is noisy due to automatic collection process and imperfect\npose extraction. Second, our kinematic beat detector is an approximation, which may not be able to\ncapture all subtle motions that can be viewed as beat points by human beings.\nDiversity and Multimodality. We evaluate the diversity among dances generated by various music\nand the multimodality among dances generated from the same music. We use the average feature\ndistance similar to [45] as the measurement. In addition, we use the same feature extractor as used\n\n8\n\n\fFigure 5: Preference results on motion realism and style consistency. We conduct a user study to\nask subjects to select the dances that are more realistic regardless of music and better match the style\nof music through pairwise comparisons. Each number denotes the percentage of preference on the\ncorresponding comparison pair.\n\nFID\n\nMethod\n5.9 \u00b1 0.4\nReal Dacnes\n73.8 \u00b1 4.1\nLSTM\nAud-MoCoGAN 21.7 \u00b1 0.8\n14.8 \u00b1 1.1\nOurs w/o Lcomp\n12.8 \u00b1 0.8\nOurs\n\nBeat Coverage Beat Hit Rate\n\n39.3 %\n1.4 %\n23.9 %\n37.8 %\n39.4 %\n\n51.6 %\n0.8 %\n54.8 %\n72.4 %\n65.1 %\n\nDiversity Multimodality\n53.5 \u00b1 1.9\n24.5 \u00b1 1.4\n45.8 \u00b1 1.3\n49.7 \u00b1 2.0\n53.2 \u00b1 2.5\n\n27.3 \u00b1 1.3\n51.4 \u00b1 0.8\n47.8 \u00b1 0.9\n\n-\n-\n\nTable 1: Comparison of realism. FID evaluates the visual quality by measuring the distance between\nthe distributions of real and synthesized dances. Comparison of beat coverage and hit rate. We\nquantify the correspondence between input music beats and generated kinematic beats. Beat coverage\nmeasures the ratio of total kinematic beats to total musical beats. Beat hit rate measures the ratio\nof kinematic beats that are aligned with musical beats to total kinematic beats. Comparison of\ndiversity and multimodality. We evaluate the diversity and multimodality using average feature\ndistances. We use diversity to refer to the variations among a set of dances, while multimodality to\nre\ufb02ect the variations of generated dances given the same input music.\n\nin measuring FID. For diversity, we generate 50 dances from different music on each trial, then\ncompute the average feature distance between 200 random combinations of them. For multimodality,\nit compares the ability to generate diverse dances conditioned on the same music. We measure the\naverage distance between all combinations of 5 dances generated from the same music.\nTable 1 shows the average results of 10 trials for diversity and 500 trials for multimodality. The\nmultimodality score of LSTM is not reported since LSTM is a deterministic model and incapable of\nmultimodal generation. Our generated dances achieve comparable diversity score to real dances and\noutperform Aud-MoCoGAN on both diversity and multimodality scores. Ours w/o Lcomp obtains a\nhigher score on multimodality since it disregards the correlation between consecutive movements and\nis free to combine them with the hurt to motion realism and style consistency. However, the proposed\nfull model performs better in diversity, suggesting that the composition phase in training enforces\nmovement coherence at no cost of diversity.\n\n5 Conclusions\nIn this work, we have proposed to synthesize dances from music through a decomposition-to-\ncomposition learning framework. In the top-down decomposition phase, we teach the model how to\ngenerate and disentangle the elementary dance units. In the bottom-up composition phase, we direct\nthe model to meaningfully compose the basic dancing movements conditioned on the input music. We\nmake use of the kinematic and musical beats to temporally align generated dances with accompanying\nmusic. Extensive qualitative and quantitative evaluations demonstrate that the synthesized dances by\nthe proposed method are not only realistic and diverse, but also style-consistent and beat-matching.\nIn the future work, we will continue to collect and incorporate more dancing styles, such as pop\ndance and partner dance.\n\n9\n\n\fReferences\n\n[1] K. Aberman, R. Wu, D. Lischinski, B. Chen, and D. Cohen-Or. Learning character-agnostic motion for\n\nmotion retargeting in 2D. ACM Transactions on Graphics, 2019. 3\n\n[2] R. Arandjelovic and A. Zisserman. Look, listen and learn. In IEEE International Conference on Computer\n\n[3] R. Arandjelovi\u00b4c and A. Zisserman. Objects that sound. In European Conference on Computer Vision, 2018.\n\nVision, 2017. 3\n\n3\n\n[4] Y. Aytar, C. Vondrick, and A. Torralba. SoundNet: Learning sound representations from unlabeled video.\n\nIn Neural Information Processing Systems, 2016. 3\n\n[5] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi-person 2D pose estimation using part af\ufb01nity\n\n\ufb01elds. In IEEE Conference on Computer Vision and Pattern Recognition, 2017. 3, 6\n\n[6] Y.-W. Chao, J. Yang, B. L. Price, S. Cohen, and J. Deng. Forecasting human dynamics from static images.\n\nIn IEEE Conference on Computer Vision and Pattern Recognition, 2017. 3\n\n[7] A. Davis and M. Agrawala. Visual rhythm and beat. ACM Transactions on Graphics, 2018. 3\n[8] A. Davis, M. Rubinstein, N. Wadhwa, G. Mysore, F. Durand, and W. Freeman. The visual microphone:\n\nPassive recovery of sound from video. ACM Transactions on Graphics, 2014. 2\n\n[9] E. Denton and V. Birodkar. Unsupervised learning of disentangled representations from video. In Neural\n\nInformation Processing Systems, 2017. 3\n\n[10] C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visual representation learning by context prediction.\n\nIn IEEE International Conference on Computer Vision, 2015. 3\n\n[11] D. Ellis. Beat tracking by dynamic programming. Journal of New Music Research, 2007. 1, 3, 4\n[12] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. Freeman, and M. Rubinstein.\nLooking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation.\nACM Transactions on Graphics, 2018. 3\n\n[13] R. Fan, S. Xu, and W. Geng. Example-based automatic music-driven conventional dance motion synthesis.\n\nIEEE Transactions on Visualization and Computer Graphics, 2012. 1\n\n[14] R. Gao, R. Feris, and K. Grauman. Learning to separate object sounds by watching unlabeled video. In\n\nEuropean Conference on Computer Vision, 2018. 3\n\n[15] D. Harwath, A. Recasens, D. Sur\u00eds, G. Chuang, A. Torralba, and J. Glass. Jointly discovering visual objects\n\nand spoken words from raw sensory input. In European Conference on Computer Vision, 2018. 3\n\n[16] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a two time-scale\n\nupdate rule converge to a local nash equilibrium. In Neural Information Processing Systems, 2017. 8\n\n[17] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In IEEE\n\nConference on Computer Vision and Pattern Recognition, 2015. 2\n\n[18] T. Karras, T. Aila, S. Laine, A. Herva, and J. Lehtinen. Audio-driven facial animation by joint end-to-end\n\nlearning of pose and emotion. ACM Transactions on Graphics, 2017. 2\n\n[19] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on\n\nLearning Representations, 2015. 6\n\n[20] H.-Y. Lee, J.-B. Huang, M. Singh, and M.-H. Yang. Unsupervised representation learning by sorting\n\nsequences. In IEEE International Conference on Computer Vision, 2017. 3\n\n[21] H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. K. Singh, and M.-H. Yang. Diverse image-to-image translation\n\nvia disentangled representations. In European Conference on Computer Vision, 2018. 2\n\n[22] M. Lee, K. Lee, and J. Park. Music similarity-based approach to generating dance motion sequence.\n\nMultimedia Tools and Applications, 2013. 1\n\n[23] Y.-D. Lu, H.-Y. Lee, H.-Y. Tseng, and M.-H. Yang. Self-supervised audio spatialization with correspon-\n\ndence classi\ufb01er. In IEEE International Conference on Image Processing, 2019. 3\n\n[24] G. Madison, F. Gouyon, F. Ullen, and K. Hornstrom. Modeling the tendency for music to induce movement\nin humans: First correlations with low-level audio descriptors across music genres. Journal of Experimental\nPsychology: Human Perception and Performance, 2011. 1\n\n[25] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square error. In\n\nInternational Conference on Learning Representations, 2016. 3\n\n[26] F. O\ufb02i, E. Erzin, Y. Yemez, and M. Tekalp. Learn2Dance: Learning statistical music-to-dance mappings\n\nfor choreography synthesis. IEEE Transactions on Multimedia, 2012. 1\n\n[27] A. Owens and A. Efros. Audio-visual scene analysis with self-supervised multisensory features. In\n\nEuropean Conference on Computer Vision, 2018. 3\n\n[28] A. Owens, P. Isola, J. McDermott, A. Torralba, E. Adelson, and W. Freeman. Visually indicated sounds. In\n\nIEEE Conference on Computer Vision and Pattern Recognition, 2016. 2\n\n[29] A. Owens, J. Wu, J. McDermott, W. Freeman, and A. Torralba. Ambient sound provides supervision for\n\nvisual learning. In European Conference on Computer Vision, 2016. 3\n\n[30] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image\n\nsynthesis. In International Conference on Machine Learning, 2016. 2\n\n[31] G. Schellenberg, A. Krysciak, and J. Campbell. Perceiving emotion in melody: Interactive effects of pitch\n\nand rhythm. Music Perception, 2000. 1\n\n[32] A. Senocak, T.-H. Oh, J. Kim, M.-H. Yang, and I.-S. Kweon. Learning to localize sound source in visual\n\nscenes. In IEEE Conference on Computer Vision and Pattern Recognition, 2018. 3\n\n[33] E. Shlizerman, L. Dery, H. Schoen, and I. Kemelmacher-Shlizerman. Audio to body dynamics. In IEEE\n\nConference on Computer Vision and Pattern Recognition, 2018. 3, 6\n\n10\n\n\f[34] N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsupervised learning of video representations using\n\nLSTMs. In International Conference on Machine Learning, 2015. 3\n\n[35] S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman. Synthesizing Obama: Learning lip sync\n\nfrom audio. ACM Transactions on Graphics, 2017. 2\n\n[36] T. Tang, J. Jia, and H. Mao. Dance with melody: An LSTM-autoencoder approach to music-oriented dance\n\nsynthesis. In ACM Multimedia, 2018. 3\n\n[37] S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz. MoCoGAN: Decomposing motion and content for video\n\ngeneration. In IEEE Conference on Computer Vision and Pattern Recognition, 2018. 3, 7\n\n[38] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In IEEE\n\nConference on Computer Vision and Pattern Recognition, 2015. 2\n\n[39] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In Neural Information\n\nand Image Representation, 2014. 3\n\n[44] H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and D. Metaxas. StackGAN: Text to photo-realistic\nimage synthesis with stacked generative adversarial networks. In IEEE International Conference on\nComputer Vision, 2017. 2\n\n[45] R. Zhang, P. Isola, A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep networks\n\nas a perceptual metric. In IEEE Conference on Computer Vision and Pattern Recognition, 2018. 8\n\n[46] Y. Zhou, Z. Wang, C. Fang, T. Bui, and T. Berg. Visual to sound: Generating natural sound for videos in\n\nthe wild. In IEEE Conference on Computer Vision and Pattern Recognition, 2018. 2\n\nProcessing Systems, 2016. 3\n\n[40] J. Walker, K. Marino, A. Gupta, and M. Hebert. The pose knows: Video forecasting by generating pose\n\nfutures. In IEEE International Conference on Computer Vision, 2017. 3\n\n[41] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, G. Liu, A. Tao, J. Kautz, and B. Catanzaro. Video-to-video synthesis. In\n\nNeural Information Processing Systems, 2018. 2\n\n[42] X. Yan, A. Rastogi, R. Villegas, K. Sunkavalli, E. Shechtman, S. Hadap, E. Yumer, and H. Lee. MT-VAE:\nLearning motion transformations to generate multimodal human dynamics. In European Conference on\nComputer Vision, 2018. 3\n\n[43] X. Yang and Y. Tian. Effective 3d action recognition using eigenjoints. Journal of Visual Communication\n\n11\n\n\f", "award": [], "sourceid": 1948, "authors": [{"given_name": "Hsin-Ying", "family_name": "Lee", "institution": "University of California, Merced"}, {"given_name": "Xiaodong", "family_name": "Yang", "institution": "QCraft"}, {"given_name": "Ming-Yu", "family_name": "Liu", "institution": "Nvidia Research"}, {"given_name": "Ting-Chun", "family_name": "Wang", "institution": "NVIDIA"}, {"given_name": "Yu-Ding", "family_name": "Lu", "institution": "UC Merced"}, {"given_name": "Ming-Hsuan", "family_name": "Yang", "institution": "Google / UC Merced"}, {"given_name": "Jan", "family_name": "Kautz", "institution": "NVIDIA"}]}