Export Reviews, Discussions, Author Feedback and Meta-Reviews

Paper ID:	763
Title:	Learning to Linearize Under Uncertainty

Current Reviews

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)

Summary:

The paper addresses two main issues with the current autoencoder representations for the video frame prediction task:

-- Using L2 loss to minimize the difference between the predicted and reference frame leads to blurry representations, as the network averages under uncertainty. A latent variable formulation is introduced to address this, where separate variables that are not coupled deterministically with the frame inputs are allowed to affect the output.

-- It is generally difficult to learn expressive representations, as the identify function from the most recent frame is a strong predictor of the result. The paper proposes decoupling the representation into a locally stable "what" component (the weighted average of the final feature map) and a locally linear "where" component (the phase, or soft argmax of the final feature map activations). A loss function enforcing that the phase of the predicted frame can be linearly extrapolated from the phases of the previous two frames.

The experimental results demonstrate mostly anecdotally that representations with the desired properties (what are where components, and latent variables) are indeed learned and positively impact video frame reconstruction quality.

Quality:

I did not see notable issues in the intuitions or technical implementation. The arguments are substantiated with intuitive results on a toy experimental dataset (NORB). Some questions:

-- Is the beta parameter from Eqns 2 and 3 learned. If not (as it is not present in Algorithm 1 pseudocode), then how was it set and what was its effect?

-- The paper mentions that "at test time can be selected via sampling, assuming its distribution on the training set has been previously estimated". It would be helpful to visualize how sampling delta affects the results. It would not only help us visualize the uncertainty that is being learned, but also verify the hypothesis above whether sampling is a viable options of getting believable-looking frames.

-- Deep architecture 3 has a bit of an unusual design. What is the intuition behind computing a fully-connected layer only to reshape it back into a convolutional layer? Does a fully-convolutional design work and if not, why not?

-- The exposition mentions the L2 reconstruction errors of different deep architectures, but I did not see a table with the values anywhere (or how they compare to the other frame-reconstruction methods). It would be helpful to have numeric estimates, not just visual-anecdotal ones.

Clarity:

The paper is clearly written, except a couple of spots. I had trouble parsing Table1 -- while succinct, without knowing the filter sizes and phase pooling sizes it is hard to intuit what happens in the architectures. I also had trouble following some of the exact details of Sec 4.1 for the same reason -- but the deep network experiment explanations were fine.

Originality:

The paper has several original ideas: 1) phase pooling, including its soft-argmax computation and the associated phase linear extrapolation loss 2) the introduction of the latent variables delta. To the best of my knowledge, these are original ideas and a creative way of using autoencoder models for video frame prediction.

Significance:

The paper expands the kinds of auto-encoder representations that can be learned from video and provide intriguing enough results that I believe may spur interesting follow-up work.