<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Hierarchical Patch VAE-GAN: Generating Diverse Videos from a Single Sample</title>
<link href="css/style.css" rel="stylesheet" type="text/css" />
</head>

<body>
<div class="container">
  <h1 align="center">Hierarchical Patch VAE-GAN: Generating Diverse Videos from a Single Sample</h1>
  <h2 align="center">Supplementary Material</h2>
  <p>&nbsp;</p>
  <ul>
    <li><a href="index.html#random">Randomly Generated Videos (Figure 1 and Section 4.2)</a></li>
    <li><a href="index.html#number_levels">Effect of Number of VAE Levels (Figure 6 and Section 4.2)</a></li>
    <li><a href="index.html#multiple_sample">Multiple Sample Video Generation Baselines (Section 4.1)</a></li>
    <li><a href="index.html#image_generation">Single Image Generation (Figure 7 and Section 4.2)</a></li>
    <li><a href="index.html#network_freezing">Netowrk Freezing (Section 4.2)</a></li>
  </ul>
  <p>&nbsp;</p>
  <hr>
  <h2 align="left"><a name="random" id="teaser"></a>Randomly Generated Videos</h2>
  <p align="left">Randomly generated samples by our method as described in Figure 1 and in Section 4.2. Videos are shown as GIFs and so repeat continuosly. Training and generated videos each consist of 13 frames. <br>
  </p>
  <p align="left">&nbsp;</p>
  <table width="200" border="0" align="center">
    <tbody>
      <tr>
        <td align="center">Training Video</td>
        <td>&nbsp;</td>
        <td align="center">Randomly Generated Samples</td>
      </tr>
      <tr>
        <td align="left"><img src="./html_files/random/wingsuit_real.gif" width="195"" alt=""></td> 
        <td>&nbsp;</td>
        <td align="right"><img src="./html_files/random/wingsuit_fake.gif" width="1000" alt=""></td>
          </video></td>
      </tr>
      <tr>
        <td align="left"><img src="./html_files/random/lion_king_1_real.gif" width="195"" alt=""></td> 
        <td>&nbsp;</td>
        <td align="right"><img src="./html_files/random/lion_king_1_fake.gif" width="1000" alt=""></td>
          </video></td>
      </tr>
      <tr>
        <td align="left"><img src="./html_files/random/lion_king_2_real.gif" width="195"" alt=""></td> 
        <td>&nbsp;</td>
        <td align="right"><img src="./html_files/random/lion_king_2_fake.gif" width="1000" alt=""></td>
          </video></td>
      </tr>
      <tr>
        <td align="left"><img src="./html_files/random/real_1.gif" width="195"" alt=""></td> 
        <td>&nbsp;</td>
        <td align="right"><img src="./html_files/random/fakes_1.gif" width="1000" alt=""></td>
          </video></td>
      </tr>
      <tr>
        <td align="left"><img src="./html_files/random/real_2.gif" width="195"" alt=""></td> 
        <td>&nbsp;</td>
        <td align="right"><img src="./html_files/random/fakes_2.gif" width="1000" alt=""></td>
          </video></td>
      </tr>
      <tr>
        <td align="left"><img src="./html_files/random/real_3.gif" width="195"" alt=""></td> 
        <td>&nbsp;</td>
        <td align="right"><img src="./html_files/random/fakes_3.gif" width="1000" alt=""></td>
          </video></td>
      </tr>
      <tr>
        <td align="left"><img src="./html_files/random/real_4.gif" width="195"" alt=""></td> 
        <td>&nbsp;</td>
        <td align="right"><img src="./html_files/random/fakes_4.gif" width="1000" alt=""></td>
          </video></td>
      </tr>
      <tr>
        <td align="left"><img src="./html_files/random/real_5.gif" width="195"" alt=""></td> 
        <td>&nbsp;</td>
        <td align="right"><img src="./html_files/random/fakes_5.gif" width="1000" alt=""></td>
          </video></td>
      </tr>
      <tr>
        <td align="left"><img src="./html_files/random/real_6.gif" width="195"" alt=""></td> 
        <td>&nbsp;</td>
        <td align="right"><img src="./html_files/random/fakes_6.gif" width="1000" alt=""></td>
          </video></td>
      </tr>
      <tr>
        <td align="left"><img src="./html_files/random/real_7.gif" width="195"" alt=""></td> 
        <td>&nbsp;</td>
        <td align="right"><img src="./html_files/random/fakes_7.gif" width="1000" alt=""></td>
          </video></td>
      </tr>
      <tr>
        <td align="left"><img src="./html_files/random/real_8.gif" width="195"" alt=""></td> 
        <td>&nbsp;</td>
        <td align="right"><img src="./html_files/random/fakes_8.gif" width="1000" alt=""></td>
          </video></td>
      </tr>
      <tr>
        <td align="left"><img src="./html_files/random/real_9.gif" width="195"" alt=""></td> 
        <td>&nbsp;</td>
        <td align="right"><img src="./html_files/random/fakes_9.gif" width="1000" alt=""></td>
          </video></td>
      </tr>
      <tr>
        <td align="left"><img src="./html_files/random/real_10.gif" width="195"" alt=""></td> 
        <td>&nbsp;</td>
        <td align="right"><img src="./html_files/random/fakes_10.gif" width="1000" alt=""></td>
          </video></td>
      </tr>
      <tr>
        <td align="left"><img src="./html_files/random/real_11.gif" width="195"" alt=""></td> 
        <td>&nbsp;</td>
        <td align="right"><img src="./html_files/random/fakes_11.gif" width="1000" alt=""></td>
          </video></td>
      </tr>
      <tr>
        <td align="left"><img src="./html_files/random/real_12.gif" width="195"" alt=""></td> 
        <td>&nbsp;</td>
        <td align="right"><img src="./html_files/random/fakes_12.gif" width="1000" alt=""></td>
          </video></td>
      </tr>
    </tbody>
  </table>
  <h3 align="left"><a name="longer_vids" id="longer_vids"></a>Longer Training Videos</h3>
  <p align="left">As mentioned in Section 3.2, our method can also be trained on longer videos, thus generated further variability in outputs. We show a number of longer training videos (more then 13 frames) and associated randomly generated samples of 13 frames. <br>
  </p>
  <p align="left">&nbsp;</p>
  <table width="200" border="0" align="center">
    <tbody>
      <tr>
        <td align="center">Training Video</td>
        <td>&nbsp;</td>
        <td align="center">Randomly Generated Samples</td>
      </tr>
      <tr>
        <td align="left"><img src="./html_files/longer/real_1.gif" width="195"" alt=""></td> 
        <td>&nbsp;</td>
        <td align="right"><img src="./html_files/longer/fakes_1.gif" width="1000" alt=""></td>
          </video></td>
      </tr>
      <tr>
        <td align="left"><img src="./html_files/longer/real_2.gif" width="195"" alt=""></td> 
        <td>&nbsp;</td>
        <td align="right"><img src="./html_files/longer/fakes_2.gif" width="1000" alt=""></td>
          </video></td>
      </tr>
      <tr>
        <td align="left"><img src="./html_files/longer/real_3.gif" width="195"" alt=""></td> 
        <td>&nbsp;</td>
        <td align="right"><img src="./html_files/longer/fakes_3.gif" width="1000" alt=""></td>
          </video></td>
      </tr>
    </tbody>
  </table>

  <h3 align="left"><a name="baselines" id="baselines"></a>Baselines Videos</h3>
  <p align="left">Shown here are a number of video outputs of SinGAN and ConSinGAN baseline methods (with 2D convolutions replaced with 3D ones) as presernted in the user study of Section 4.2. As can be seen, the generated output mostly collapses to the input training video.  <br>
  </p>
  <p align="left">&nbsp;</p>
  <table width="200" border="0" align="center">
    <tbody>
      <tr>
        <td align="center">Training Video</td>
        <td>&nbsp;</td>
        <td align="center">SinGAN (3D) [24]</td>
      </tr>
      <tr>
        <td align="left"><img src="./html_files/random/lion_king_2_real.gif" width="195"" alt=""></td> 
        <td>&nbsp;</td>
        <td align="right"><img src="./html_files/baselines/fakes_lion_2_singan.gif" width="1000" alt=""></td>
          </video></td>
      </tr>
      <tr>
        <td align="left"><img src="./html_files/random/real_3.gif" width="195"" alt=""></td> 
        <td>&nbsp;</td>
        <td align="right"><img src="./html_files/baselines/fakes_fish_singan.gif" width="1000" alt=""></td>
          </video></td>
      </tr>
      <tr>
        <td align="left"><img src="./html_files/random/real_12.gif" width="195"" alt=""></td> 
        <td>&nbsp;</td>
        <td align="right"><img src="./html_files/baselines/fakes_ski_singan.gif" width="1000" alt=""></td>
          </video></td>
      </tr>
      <tr>
        <td align="center">Training Video</td>
        <td>&nbsp;</td>
        <td align="center">ConSinGAN (3D) [28]</td>
      </tr>
      <tr>
        <td align="left"><img src="./html_files/random/lion_king_2_real.gif" width="195"" alt=""></td> 
        <td>&nbsp;</td>
        <td align="right"><img src="./html_files/baselines/fakes_lion_2_consingan.gif" width="1000" alt=""></td>
          </video></td>
      </tr>
      <tr>
        <td align="left"><img src="./html_files/random/real_3.gif" width="195"" alt=""></td> 
        <td>&nbsp;</td>
        <td align="right"><img src="./html_files/baselines/fakes_fish_consingan.gif" width="1000" alt=""></td>
          </video></td>
      </tr>
      <tr>
        <td align="left"><img src="./html_files/random/real_12.gif" width="195"" alt=""></td> 
        <td>&nbsp;</td>
        <td align="right"><img src="./html_files/baselines/fakes_ski_consingan.gif" width="1000" alt=""></td>
          </video></td>
      </tr>
    </tbody>
  </table>

  <p align="left">&nbsp;</p>
  <p align="left">&nbsp;</p>
  <hr>
  <h2 align="left"><a name="number_levels" id="number_levels"></a>Effect of Number of VAE Levels </h2>
  <p align="left">Effect of the number of VAE levels M on the generated samples as described in Figure 6 and Section 4.2. N is set to 9, and so a total of 10 levels are trained.  
  In addition a comparsion to SinGAN and ConSinGAN (with 2D convolutions replaced with 3D ones) is given. </p>

  <table width="200" border="0" align="center">
    <tbody>
      <tr>
        <td align="center">Training Video</td>
        <td>&nbsp;</td>
        <td align="center">SinGAN (3D) [24]</td>
      </tr>
      <tr>
        <td align="left"><img src="./html_files/random/real_1.gif" width="195"" alt=""></td> 
        <td>&nbsp;</td>
        <td align="right"><img src="./html_files/vae_levels/fakes_singan.gif" width="1000" alt=""></td>
          </video></td>
      </tr>
      <tr>
        <td align="center">Training Video</td>
        <td>&nbsp;</td>
        <td align="center">ConSinGAN (3D) [28]</td>
      </tr>
      <tr>
        <td align="left"><img src="./html_files/random/real_1.gif" width="195"" alt=""></td> 
        <td>&nbsp;</td>
        <td align="right"><img src="./html_files/vae_levels/fakes_consingan.gif" width="1000" alt=""></td>
          </video></td>
      </tr>
      <tr>
        <td align="center">Training Video</td>
        <td>&nbsp;</td>
        <td align="center">Single VAE level (M=1) </td>
      </tr>
      <tr>
        <td align="left"><img src="./html_files/random/real_1.gif" width="195"" alt=""></td> 
        <td>&nbsp;</td>
        <td align="right"><img src="./html_files/vae_levels/fakes_level_1.gif" width="1000" alt=""></td>
          </video></td>
      </tr>
      <tr>
        <td align="center">Training Video</td>
        <td>&nbsp;</td>
        <td align="center">Single GAN level (M=9) </td>
      </tr>
      <tr>
        <td align="left"><img src="./html_files/random/real_1.gif" width="195"" alt=""></td> 
        <td>&nbsp;</td>
        <td align="right"><img src="./html_files/vae_levels/fakes_vae_all.gif" width="1000" alt=""></td>
          </video></td>
      </tr>
      <tr>
        <td align="center">Training Video</td>
        <td>&nbsp;</td>
        <td align="center">Our Method (M=3)</td>
      </tr>
      <tr>
        <td align="left"><img src="./html_files/random/real_1.gif" width="195"" alt=""></td> 
        <td>&nbsp;</td>
        <td align="right"><img src="./html_files/random/fakes_1.gif" width="1000" alt=""></td>
          </video></td>
      </tr>
    </tbody>
  </table>

  <p align="left">&nbsp;</p>
  <p align="left">&nbsp;</p>
  <hr>
  <h2 align="left"><a name="multiple_sample" id="multiple_sample"></a>Multiple Sample Video Generation Baselines</h2>
  <p align="left">As described in Section 4.1, we randomly sample a sample s from each baseline method. nn1 and nn2 are the 1st and 2nd nearest neighbors (NN) in the UCF-101 training set. We show here our randomly generaed samples s' when our model is trained on nn1. </p>
  <table width="200" border="0" align="center">
    <tbody>
      <tr>
        <td align="center"> MoCoGAN's [30] sample s</td>
        <td>&nbsp;</td>
        <td align="center">nn1 Video</td>
        <td>&nbsp;</td>
        <td align="center">nn2 Video</td>
        <td>&nbsp;</td>
        <td align="center"> Our sample s' (trained on nn1 video) </td>
      </tr>
      <tr>
        <td align="left"><img src="./html_files/multi/mocogan_fakes_mocogan.gif" width="179"" alt=""></td> 
        <td>&nbsp;</td>
        <td align="left"><img src="./html_files/multi/mocogan_real.gif" width="135"" alt=""></td> 
        <td>&nbsp;</td>
        <td align="left"><img src="./html_files/multi/mocogan_real_2nd.gif" width="134"" alt=""></td> 
        <td>&nbsp;</td>
        <td align="right"><img src="./html_files/multi/mocogan_fakes_ours.gif" width="717" alt=""></td>
      </tr>
      <tr>
        <td align="center"> TGAN's [2] sample s</td>
        <td>&nbsp;</td>
        <td align="center">nn1 Video</td>
        <td>&nbsp;</td>
        <td align="center">nn2 Video</td>
        <td>&nbsp;</td>
        <td align="center"> Our sample s' (trained on nn1 video) </td>
      </tr>
      <tr>
        <td align="left"><img src="./html_files/multi/tgan_fakes_tgan.gif" width="179"" alt=""></td> 
        <td>&nbsp;</td>
        <td align="left"><img src="./html_files/multi/tgan_real.gif" width="135"" alt=""></td> 
        <td>&nbsp;</td>
        <td align="left"><img src="./html_files/multi/tgan_real_2nd.gif" width="134"" alt=""></td> 
        <td>&nbsp;</td>
        <td align="right"><img src="./html_files/multi/tgan_fakes_ours.gif" width="717" alt=""></td>
      </tr>
      <tr>
        <td align="center"> TGAN-v2's [3] sample s </td>
        <td>&nbsp;</td>
        <td align="center">nn1 Video</td>
        <td>&nbsp;</td>
        <td align="center">nn2 Video</td>
        <td>&nbsp;</td>
        <td align="center"> Our sample s' (trained on nn1 video) </td>
      </tr>
      <tr>
        <td align="left"><img src="./html_files/multi/tgan_v2_fakes_tgan_v2.gif" width="179"" alt=""></td> 
        <td>&nbsp;</td>
        <td align="left"><img src="./html_files/multi/tgan_v2_real.gif" width="135"" alt=""></td> 
        <td>&nbsp;</td>
        <td align="left"><img src="./html_files/multi/tgan_v2_real_2nd.gif" width="134"" alt=""></td> 
        <td>&nbsp;</td>
        <td align="right"><img src="./html_files/multi/tgan_v2_fakes_ours.gif" width="717" alt=""></td>
      </tr>
    </tbody>
  </table>


  <p>&nbsp;</p>
  <p align="left">&nbsp;</p>
  <hr>

  <h2 align="left"><a name="image_generation" id="image_generation"></a>Single Image Generation</h2>
  <p align="left">Additional images generation results and comparison to baselines as described in Figure 7 and Section 4.2. </p>
  <table width="200" border="0" align="center">
    <tbody>
      <tr>
        <td colspan="5" align="center">SinGAN [24] </td>
        <td colspan="9" align="center">ConSinGAN [28] </td>
        <td colspan="9" align="center">Our Method (2D) </td>

      </tr>
      <tr>
        <td align="left"><img src="./html_files/images/singan/1/0.png" width="80"" alt=""></td> 
        <td align="left"><img src="./html_files/images/singan/1/1.png" width="80"" alt=""></td> 
        <td align="left"><img src="./html_files/images/singan/1/2.png" width="80"" alt=""></td> 
        <td align="left"><img src="./html_files/images/singan/1/3.png" width="80"" alt=""></td>
        <td align="left"><img src="./html_files/images/singan/1/4.png" width="80"" alt=""></td>
        <td>&nbsp;</td>  
        <td>&nbsp;</td>  
        <td>&nbsp;</td>  
        <td align="left"><img src="./html_files/images/consingan/1/gen_sample_0.jpg" width="80"" alt=""></td> 
        <td align="left"><img src="./html_files/images/consingan/1/gen_sample_1.jpg" width="80"" alt=""></td> 
        <td align="left"><img src="./html_files/images/consingan/1/gen_sample_2.jpg" width="80"" alt=""></td> 
        <td align="left"><img src="./html_files/images/consingan/1/gen_sample_3.jpg" width="80"" alt=""></td>
        <td align="left"><img src="./html_files/images/consingan/1/gen_sample_4.jpg" width="80"" alt=""></td>
        <td>&nbsp;</td>  
        <td>&nbsp;</td>  
        <td>&nbsp;</td>  
        <td align="left"><img src="./html_files/images/ours/1/fake_0.png" width="80"" alt=""></td> 
        <td align="left"><img src="./html_files/images/ours/1/fake_1.png" width="80"" alt=""></td> 
        <td align="left"><img src="./html_files/images/ours/1/fake_2.png" width="80"" alt=""></td> 
        <td align="left"><img src="./html_files/images/ours/1/fake_3.png" width="80"" alt=""></td>
        <td align="left"><img src="./html_files/images/ours/1/fake_4.png" width="80"" alt=""></td>
      </tr>
        <tr>&nbsp;</tr>  
        <tr>&nbsp;</tr>  
        <tr>&nbsp;</tr>  
      <tr>
        <td align="left"><img src="./html_files/images/singan/2/0.png" width="80"" alt=""></td> 
        <td align="left"><img src="./html_files/images/singan/2/1.png" width="80"" alt=""></td> 
        <td align="left"><img src="./html_files/images/singan/2/2.png" width="80"" alt=""></td> 
        <td align="left"><img src="./html_files/images/singan/2/3.png" width="80"" alt=""></td>
        <td align="left"><img src="./html_files/images/singan/2/4.png" width="80"" alt=""></td>
        <td>&nbsp;</td>  
        <td>&nbsp;</td>  
        <td>&nbsp;</td>  
        <td align="left"><img src="./html_files/images/consingan/2/gen_sample_0.jpg" width="80"" alt=""></td> 
        <td align="left"><img src="./html_files/images/consingan/2/gen_sample_1.jpg" width="80"" alt=""></td> 
        <td align="left"><img src="./html_files/images/consingan/2/gen_sample_2.jpg" width="80"" alt=""></td> 
        <td align="left"><img src="./html_files/images/consingan/2/gen_sample_3.jpg" width="80"" alt=""></td>
        <td align="left"><img src="./html_files/images/consingan/2/gen_sample_4.jpg" width="80"" alt=""></td>
        <td>&nbsp;</td>  
        <td>&nbsp;</td>  
        <td>&nbsp;</td>  
        <td align="left"><img src="./html_files/images/ours/2/fake_0.png" width="80"" alt=""></td> 
        <td align="left"><img src="./html_files/images/ours/2/fake_1.png" width="80"" alt=""></td> 
        <td align="left"><img src="./html_files/images/ours/2/fake_2.png" width="80"" alt=""></td> 
        <td align="left"><img src="./html_files/images/ours/2/fake_3.png" width="80"" alt=""></td>
        <td align="left"><img src="./html_files/images/ours/2/fake_4.png" width="80"" alt=""></td>
      </tr>
      </tr>
        <tr>&nbsp;</tr>  
        <tr>&nbsp;</tr>  
        <tr>&nbsp;</tr>  
      <tr>
        <td align="left"><img src="./html_files/images/singan/3/0.png" width="80"" alt=""></td> 
        <td align="left"><img src="./html_files/images/singan/3/1.png" width="80"" alt=""></td> 
        <td align="left"><img src="./html_files/images/singan/3/2.png" width="80"" alt=""></td> 
        <td align="left"><img src="./html_files/images/singan/3/3.png" width="80"" alt=""></td>
        <td align="left"><img src="./html_files/images/singan/3/4.png" width="80"" alt=""></td>
        <td>&nbsp;</td>  
        <td>&nbsp;</td>  
        <td>&nbsp;</td>  
        <td align="left"><img src="./html_files/images/consingan/3/gen_sample_0.jpg" width="80"" alt=""></td> 
        <td align="left"><img src="./html_files/images/consingan/3/gen_sample_1.jpg" width="80"" alt=""></td> 
        <td align="left"><img src="./html_files/images/consingan/3/gen_sample_2.jpg" width="80"" alt=""></td> 
        <td align="left"><img src="./html_files/images/consingan/3/gen_sample_3.jpg" width="80"" alt=""></td>
        <td align="left"><img src="./html_files/images/consingan/3/gen_sample_4.jpg" width="80"" alt=""></td>
        <td>&nbsp;</td>  
        <td>&nbsp;</td>  
        <td>&nbsp;</td>  
        <td align="left"><img src="./html_files/images/ours/3/fake_0.png" width="80"" alt=""></td> 
        <td align="left"><img src="./html_files/images/ours/3/fake_1.png" width="80"" alt=""></td> 
        <td align="left"><img src="./html_files/images/ours/3/fake_2.png" width="80"" alt=""></td> 
        <td align="left"><img src="./html_files/images/ours/3/fake_3.png" width="80"" alt=""></td>
        <td align="left"><img src="./html_files/images/ours/3/fake_4.png" width="80"" alt=""></td>
      </tr>
    </tbody>
  </table>


  <p>&nbsp;</p>
  <p align="left">&nbsp;</p>
  <hr>

  <h2 align="left"><a name="network_freezing" id="network_freezing"></a>Network Freezing</h2>
  <p align="left">As mentioned in Section 4.2, when training all levels (i.e. no freezing),  we observe a lot of memorization, shown here. </p>
  <table width="200" border="0" align="center">
    <tbody>
      <tr>
        <td align="center">Training Video</td>
        <td>&nbsp;</td>
        <td align="center">Training All Levels</td>
      </tr>
      <tr>
        <td align="left"><img src="./html_files/random/real_1.gif" width="195"" alt=""></td> 
        <td>&nbsp;</td>
        <td align="right"><img src="./html_files/freezing/fakes.gif" width="1000" alt=""></td>
          </video></td>
    </tbody>
  </table>
  <p align="left">&nbsp;</p>
</div>
</body>
</html>

