
Submitted by Assigned_Reviewer_1
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Summary:
The paper addresses two main issues with the current autoencoder representations for the video frame prediction task:
 Using L2 loss to minimize the difference between the predicted and reference frame leads to blurry representations, as the network averages under uncertainty. A latent variable formulation is introduced to address this, where separate variables that are not coupled deterministically with the frame inputs are allowed to affect the output.
 It is generally difficult to learn expressive representations, as the identify function from the most recent frame is a strong predictor of the result. The paper proposes decoupling the representation into a locally stable "what" component (the weighted average of the final feature map) and a locally linear "where" component (the phase, or soft argmax of the final feature map activations). A loss function enforcing that the phase of the predicted frame can be linearly extrapolated from the phases of the previous two frames.
The experimental results demonstrate mostly anecdotally that representations with the desired properties (what are where components, and latent variables) are indeed learned and positively impact video frame reconstruction quality.
Quality:
I did not see notable issues in the intuitions or technical implementation. The arguments are substantiated with intuitive results on a toy experimental dataset (NORB). Some questions:
 Is the beta parameter from Eqns 2 and 3 learned. If not (as it is not present in Algorithm 1 pseudocode), then how was it set and what was its effect?
 The paper mentions that "at test time can be selected via sampling, assuming its distribution on the training set has been previously estimated". It would be helpful to visualize how sampling delta affects the results. It would not only help us visualize the uncertainty that is being learned, but also verify the hypothesis above whether sampling is a viable options of getting believablelooking frames.
 Deep architecture 3 has a bit of an unusual design. What is the intuition behind computing a fullyconnected layer only to reshape it back into a convolutional layer? Does a fullyconvolutional design work and if not, why not?
 The exposition mentions the L2 reconstruction errors of different deep architectures, but I did not see a table with the values anywhere (or how they compare to the other framereconstruction methods). It would be helpful to have numeric estimates, not just visualanecdotal ones.
Clarity:
The paper is clearly written, except a couple of spots. I had trouble parsing Table1  while succinct, without knowing the filter sizes and phase pooling sizes it is hard to intuit what happens in the architectures. I also had trouble following some of the exact details of Sec 4.1 for the same reason  but the deep network experiment explanations were fine.
Originality:
The paper has several original ideas: 1) phase pooling, including its softargmax computation and the associated phase linear extrapolation loss 2) the introduction of the latent variables delta. To the best of my knowledge, these are original ideas and a creative way of using autoencoder models for video frame prediction.
Significance:
The paper expands the kinds of autoencoder representations that can be learned from video and provide intriguing enough results that I believe may spur interesting followup work.
Q2: Please summarize your review in 12 sentences
The paper introduces several different ideas for learning video autoencoderbased representations. It decouples the representation into 3 separate parts (the "what", the "where" and the "prediction uncertainty") and introduces a novel loss and a setup to train it. The paper has sufficient novelty and significance for publication. The experimental section succeeds in illustrates the expected representation properties, although numerical evaluation and more comparisons to related work would be of additional help.
Submitted by Assigned_Reviewer_2
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper proposes a new loss and architecture for learning linearized features from unlabeled video frames. It also introduces latent variables, which are nondeterministic functions of the input, into the architecture to solve the problems caused by inherent uncertainty in video.
The paper aims to train deep feature hierarchies in a weaksupervision manner, which is an emerging research direction. one
novel idea is to learn linearized features. An explicit code representation for every single frame is given, which
makes it easy to linearly interpolate or extrapolate new codes for new frames. The idea to cope with the inherent uncertainty in video is also interesting and worth trying.
The presentation is rather techinical, requiring the readers to be familiar with details in this area. There are several problems with this paper. Firstly, the title might be too broad, as the proposed method is designed for video data, and cannot deal with, for instance, text data. Secondly, the experiment part seems insufficient for several reasons. The frames are all of small size (32*32) and the data set for the second experiment is also small and toy. Although the authors claim to learn features from natural video, the frame triplet for the experiments are mostly simulated videos. Natural movie patches are used in one setting of experiment in the paper, but it only shows us the learned filters as results, which is not enough for us to see the effectiveness of the method on natural videos. The exact L2 prediction errors for the several experiment settings in 4.2 are also missing. We can only compare the results by the generated frames without quantitative analysis. Moreover, it would be good if the authors could provide us with results for translation and scale appeared in video frames, and can spend more efforts on explaining the architectures in table1.
Questions:
1) What is the exact setting for the value of a in equations (1) and (6) in the experiments?
2) For (6), is there still a linear relationship between zt1, zt and zt+1? If not, why is it said to be learning linearized features?
Q2: Please summarize your review in 12 sentences
The paper proposes novel ideas to cope with the inherent uncertainty in video and learn linearized features from video frames. However, the experiment part seems insufficient for a NIPS paper.
Submitted by Assigned_Reviewer_3
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors present a method to linearize video features in latent space while accounting for uncertainty in the features as a function of frame progression. There are some good ideas in this paper, but they feel unready and are not supported by by convincing experiments demonstrating the linearization and the precise role of the latent variable delta in test time.
Q2: Please summarize your review in 12 sentences
The authors present a deep learning architecture tailored towards video and aim to linearize the complex transformations involved in video processing. While the ideas are good and preliminary experiments are shown, the model seems unready and is not convincing as a linearization of video content.
Submitted by Assigned_Reviewer_4
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper introduces a new loss function and a novel deep learning architecture to learn locally linearized features from video. This paper is well written and organized. The problem studied in this work, i.e., learning locally linearized features from video in order to predict video frames, is an important and interesting problem to study. The proposed architecture and objective in Eq. (7) seems novel and very interesting.
The main concern is about the empirical evaluation.
Unlike the other works, e.g., [8][12][13][14], the authors only train and evaluate the proposed approach over simple toy video sequences. For real world applications, natural video sequences should be used to demonstrate the power of the proposed approach.
The current results can only be evaluated subjectively, i.e., different judgers may has different opinions for whether a prediction result is correct or incorrect (correct to what degree or incorrect to what degree). Therefore, it would be nice to provide a quantitative measure for all the output.
Several related works were mentioned (e.g., [8][12][13][14]), but none of them was compared to demonstrate the effectiveness of the proposed approach. This makes the justification of the proposed approach unconvincing.
Q2: Please summarize your review in 12 sentences
This paper introduces a new loss and architecture to learn locally linearized features from video.
The main idea is to train a generative model to predict video frames. The experiment shows the effectiveness of their proposed approach.
Submitted by Assigned_Reviewer_5
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Section 3.1 seems unnecessarily hard to understand. The models could be better described by fixing Figure 2: 1) making the notation consistent (x_1 vs x_t), 2) putting z_t, z_{t1}, z_{t+1}, F_W, G_W explicitly in the picture, 3) in the text, matching the cosine label in Figure 2, to the math of (1) via z.
Section 3.2 and 4.3 didn't make sense to me. Indexing examples in this way does not conform to any notion of uncertainty I familiar with (but would be happy to be enlightened with references!)
Q2: Please summarize your review in 12 sentences
The authors propose a autoencoder architecture on triples that attempts to linearly interpolate in some latent space to find smooth trajectories. I am rather confused by the "uncertainty" in the paper, as there does not appear to be any (beyond any notion of uncertainty one might argue for in any autoencoder). Delta appears to be an interpolation parameter. Other than that, the paper is interesting, and the results are neat.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We sincerely thank all the reviewers for their
thoughtful reviews.
Reviewer1 1Thank you for pointing out this
relevant related work by Desjardins et al. Similar to this work, our
experiments on NORB show that linearization can implicitly disentangle
factors of variation (rotation angles). 2We agree that applying our
method to action recognition datasets would be a good test in future
works.
Reviewer2 1Our work is presented in the most elementary
way we found possible. 2The title is intentionally broad because our
method can be applied to any temporally coherent data. 3Indeed it is
difficult to quantify the performance of learned features in the
unsupervised setting. However, our experiments demonstrate that the
learned features linearize the latent factors of variation in NORB.
Similar evaluations were carried out recently in [3]. See comment 1 to
Reviewer 6. 4We did not report the L2 prediction error because it is
not always commensurate with the visual quality of the predicted image,
even more so if the temporal sequence is not fully deterministic.
Nevertheless we can report a normalized L2 prediction error in the final
version. 5For an explanation of the architecture, please see comment 4
to Reviewer4 6We will elaborate on the experimental details with
natural and simulated video in the final version. 7(Q1)a = [2 1] in
Equations 1&6 8(Q2)The \delta variable can be seen as a slack
variable in code space, allowing the predicted code to be shifted slightly
in order to generate the ground truth. However we neglected to say that
not only the dimensionality, but also the norm of \delta is constrained,
enforcing a nearly linearized code.
Reviewer3 1See comment 1
to Reviewer6. 2The mentioned works often rely on much more
supervision than our method. For example, [8] trains in a semisupervised
setting on the COIL dataset, which like NORB, also consists of sequences
of rotating objects. Although [13] and [12] are trained on natural video
sequences, they are both trained with supervision, for example [13] is
only trained to mimic the activations of a network trained with
supervision. [14] is the only generative model in this list that is
trained in an unsupervised manner, however it also relies on mainly
qualitative evaluation. 3Please see comments 3&4 to
Reviewer2.
Reviewer4 1See comments 3&4 to
Reviewer2. 2We did not try learning \beta in our implementation. We
chose it manually based on the visual quality of the predicted images. We
found that it was necessary to chose \beta large enough (>5) to produce
approximately unimodal distributions of activations within each pooling
region. We also found that once \beta was large enough, our method was
quite robust to its exact value. 3We can visualize predictions
corresponding to different \delta samples in the cameraready. Although
these predictions do not necessarily correspond to the lowest L2
prediction error, they are more visually plausible. Therefore, the \delta
formulation implicitly modifies the L2 metric to account for
uncertainty. 4Our intuition for including a fully connected layer in
our architecture follows from the fact that the latent variables in the
NORB dataset have no spatial structure. That is, the rotation angles are
global parameters that correspond to the entire image. Fully convolutional
architectures performed worse in our experiments. 5We can include the
L2 prediction errors, however their relevance is questionable. See comment
#4 to Reviewer2. 6We will rephrase some parts of section 4.1 to make
them more clear.
Reviewer5 1In sec 4.3 we manually introduce
uncertainty in temporal speed by skipping frames randomly with equal
probability. The interpretation is that \delta switches between these two
possibilities, which implicitly modifies the L2 loss to account for
uncertainty. See comment 8 to Reviewer2. 2The notation in Figure 2
will be made consistent with the paper.
Reviewer6 1The
objective of this paper is to introduce temporal linearization as a
principled prior for unsupervised feature learning. It also proposes novel
operators for achieving this goal, and addresses the problem of
uncertainty in prediction. Promising results are presented on controlled
dataset where the true latent factors of variations are known. This would
be more difficult to demonstrate in natural unlabeled video because the
true factors are unknown. Similar evaluations were used in [3], Hadsell et
al CVPR 2006, and many other related works. This paper presents the
building blocks towards achieving the ultimate goal of learning useful
features from natural unlabeled videos. We believe that a complete
solution to this broad and difficult problem will require a much larger
effort that warrants many other follow up publications. 2See comment 3
to Reviewer4. 
