Paper ID: | 528 |
---|---|

Title: | Proximal Deep Structured Models |

The paper suggest reformulating proximal methods as recursive RNNs, utilizing a natural structural similarity between them. The result is an algorithm for learning structured output tasks on images, which Is used in experiments for image denoising, depth refinement and optical flow inference. The results show some (minor) improvement over previous results.

The paper makes a nice conceptual step of embedding proximal optimization algorithms in a deep neural network framework as an RNN. This conceptual move is new as far as I know. It is illuminating for me and I believe it will be found useful in further work. The immediate benefits are nice but not too significant: the new view enables a natural GPU implementation of proximal optimization methods, and bring small (close to insignificant) empirical result improvements. The week side of the paper is presentation quality: Some notation is used without definitions, central concepts (like the dual function) are used without definition, which makes the paper hard to read for people not very familiar with proximal methods. The algorithm actually used is not presented with enough details and for some of the experimental results (specifically section 3.3) contain formalism and concepts whose definition is deferred to a reference , so they cannot be understood at all without reading a further paper. The paper can clearly benefit from re-writing it in a transparent and self-contained way. More detailed comments: Page 3: the thing called here distance function is actually a metric (according to the comment), hence the term metric should be used. Distance function usually refers to something with less deamands. Page 3, bottom: - I believe the third term in eq. 4 should include h_\alpha(x;w) inside the sum over alpha as a coefficient multiplied by the internal product appearing there. This is because we get to 4) by replacing g_alpha(w^t_alpha y_alpha) by its expression as the conjugate function of g^*_alpha. When I do this, h_\alpha(x;w) appears also in the third term - Giving the definition of the convex conjugate of a function g*_alpha would make the text much more readable. It will enable to see and justify the move between equation 3 and 4. Eq. 5: sigma_p, sigma_ex, sigma_Tau appear but not defined. They are only explained half a page later. This should be changed. W^*_{\dot,i} is not defined - specifically no matrix format was defined for w, it is not clear what w^*, and the Matlab notation {\dot,i} was not defined (assuming it is indeed a Matlab(:,i) notation). Hence I could not understand this step. In a similar manner, it is not clear to me why this step is termed ‘deconvolution’ in figure 1. Page 5, figure 2: the algorithm box is not informative enough. Specifically - it is not clear how one loops over the examples in this algorithm. Is there an external loops over examples, so each example is ran multiple inference iterations till convergence? Or is a single/few inference iteration s done for each example (without convergence) and examples are looped? Are examples batched for gradient computation? - What is the stopping criterion? Multi-loss: what exactly is the format of this multi-loss? Page 6: w_{ho,\alpha} is not defined. Page 6: prox_{T,x}(y,\lambda) was not defined before (Prox_f(x_0) was defined before with 2 indices only) and it is not clear what the meaning of the 4 sub-indices is.. I can understand that \lambda is the \lambda from eq. 6. But what is \tau? Is it related to \sigma_tau? How? Page 7: In eq. 7, - the ‘flownet model’ seems to be the central term, but it is not explain here at all, making it very hard to understand section 3.3 without reading [9]. I would expect this model would be explained here for self-containment. - The notation ‘ho’ in w_{ho,\alpha} is not defined and not clear. - The whole motivation for this formulation of optical flow is not clear: both of the terms are not clear: the flownet is not explained, and the l_1 regularization term is not justified, its index ’ho’ is not clear and I could not understand what it contributes to getting good optical flow. - ‘End point error’ – again, this error measurement is not explained, so it is hard to understand the results without reading further papers. This should be the case. General notes regarding the experimental results: - It is not clear what H(x,W) is for each task - How does the result depend on the number of iterations of the network? I did not read the supplementary material.

2-Confident (read it all; understood it all reasonably well)

The authors present a learning framework for continuous structure learning tasks based on considering proximal methods to solve the inference process associated to these problems, and express this as a recurrent neural network (or at least, an unrolled RNN, where each iteration has different parameters).

The approach is novel, and the paper reads ok, apart from some issues commented below. I have found Fig. 1 very misleading for the following reasons: - It does not seem to correspond exactly with eq. (5). For example, according to the figure, the output z only depends on the input y. - There is no need to put h as an output of the block, given that it is the same as the input, and each iteration receives it as an input anyway. - In the top figure, n is the number of iterations, while in the figure below, it makes reference to any iteration. I suggest using t in the second case, as it id done in eq. (5). Experimental section shows convincing results, although it is difficult to assess how significant the improvements are with respect to the competitors. Some other questions remain unanswered: how many iteration blocks are considered (I think this is only specified in the last experiment)? Is the performance greatly affected by this number? Minor comments: The sigma variables (gradient steps) in eq. (5) line 125 are referenced much later in the paper (line 151). Please describe them before. Table 2: please specify which hyperparameters are represented in the caption. There is no need to repeat the l.h.s. of eq. (1) in eq. (2), it can be replaced by the r.h.s. of (1). It is in general an interesting paper that deserves publication provided that the previous comments are addressed.

2-Confident (read it all; understood it all reasonably well)

The paper presents a way to train a certain class of energy-based models commonly used in computer vision tasks with potential functions based on deep learning outputs. The proposed method is a backpropagation-through-inference (BPTI) approach, where an iterative algorithm for optimizing the energy is unrolled for a finite number of steps, and each step is treated as a regular computation layer in a deep neural net. Although BPTI has been applied in many previous works, the present work is novel principally in that it applies to a general class of energy-based models over continuous-valued states, whereas previous works either applied only to discrete problems, or more restricted classes of continuous problems. Experiments show state-of-the art results on several applications.

The work addresses a problem of significant current interest, since it addresses the issue of how to combine state-of-the-art variational methods for computer vision tasks with deep learning methods, which have rapidly become the de-facto choice for nearly all applications. The proposed solution is practical, elegant, computationally efficient, and applicable to a wide variety of problems. The experimental results are also state-of-the-art for several tasks. Therefore, I believe that this work could have a significant practical impact on the field. My only, relatively minor, reservation with the work is its novelty. If you are a specialist in state-of-the-art convex optimization methods for computer vision tasks, and you know about structured prediction models for deep learning, then you could argue that this is a fairly obvious thing to do, since it is basically backpropagation-through-inference (BPTI) applied to primal-dual methods for convex variational problems. On the other hand, this would be novel to anyone who is not a specialist in both those fields. So, one could argue that the novelty of this method is mainly of the type of cross-pollination between two fields that should have more interaction, but there are no fundamentally new methods here. Delving a little deeper into this issue, one might consider what the potential alternatives to this method may be, and how obvious each of those alternatives may be. Prior work has mostly approached similar problems from a graphical models perspective, in which applying BPTI to belief propagation is the obvious solution. Relative to that, the main innovation of this work is in defining a BP-like inference algorithm that works for an important class of graphical models over continuous variables. Although graphical models people tend to shy away from continuous variables, BP (in this case, the max-product variant) still works conceptually in the continuous case. The main difficulty is that the messages become functions, which are generally hard to represent. However, if one assumes that the local minimizations have stable closed-form solutions, then max-product BP may still be applicable. It would be interesting to see whether this is feasible, and if so, what relationship this method would have to the proposed method. Another potential baseline would be block coordinate descent: i.e., instead of using the primal-dual method, alternatively fix blocks of variables and optimize them jointly. For example, given “4-connected” cliques with variables on a regular grid, one could perform a method analogous to red-black Gauss-Seidel iteration, alternatively fixing ‘red’ and ‘black’ pixels. This would reduce to independent, pointwise optimizations over the grid. If one again assumes that these pointwise optimizations have closed-form solutions, then BPTI should be applicable to this method as well. Experimentally, the work seems to be strong, showing state-of-the-art results compared to other methods, although I am unfortunately not that familiar with these benchmarks and many of the other methods. One small caveat is that the image denoising dataset seems to be fairly small, and the Gaussian noise assumption a bit too simple, although it has apparently been used as a benchmark for other methods as well. It would also be nice to have an idea of the computational performance for the depth refinement and optical flow tasks, although I assume they would be similar to the denoising case. A summary of my review along each of the evaluation metrics is provided below. Technical quality: The method is sound, and the experimental results are thorough and state-of-the-art. Novelty: Although no fundamentally new methods are developed here, this work constitutes a clever way to combine state-of-the-art techniques from different fields (high-dimensional convex optimization and deep learning). Potential impact: Due to the relative simplicity of the method, its computational efficiency, and state-of-the-art results, I think the method could (or at least should) be adopted widely for practical applications. Clarity: The paper is written in a straightforward way and is easy to read. ============ REVISION POST-REBUTTAL: After the review period, I happened to find some uncited work (see [A] below) that, at least superficially, seems similar. That work also involves doing backprop through a proximal method. I also mentioned in my review the possibility of doing backprop through coordinate descent, and it seems that this idea was also tried in [A]. So, I am now a little more worried about the technical novelty of this method. A future revision of the paper should definitely cite [A] and discuss the differences between this work and that. [A] K. Gregor and Y. Lecun. Learning fast approximations of sparse coding. ICML 2010.

2-Confident (read it all; understood it all reasonably well)

This paper presents a deep structured model that is able to learn non-linear functions to encode the dependencies between continuous output variables. Experiments on image denoising, depth refinement and optical flow estimation are very encouraging.

Learning the dependencies between continuous output variables is an important and challenging problem. This paper presents a proximal method that allows training deep structured models with continuous output variables end-to-end. There are several positive points about this paper. 1. Continuous structured prediction is a challenging problem. This submission presents a very straightforward inference algorithm for continuous structured prediction problem. Similar to [39], it is implemented in a recurrent neural network architecture. The implementation of this in the MxNet library seems to be decent and produces high accuracy performance at real-time speed. 2. Experiments on three different tasks show clear improvements. 3. This paper also have very good literature review and covers the most relevant works. Several aspects of this submission could be further improved. 1. More experiments are required. It is interesting to see the proximal methods are a special case of recurrent neural networks. However, it is not clear whether sharing weights across different layers help in the proposed inference algorithm. It would be better to have an extra experimental comparison. e.g. learning parameters for each iteration separately vs. sharing parameters across iterations. Also, it would be interesting to see how much the end-to-end joint training (unary + proposed inference layer) helps. 2. It is not clear how to initialize the parameters for the proposed layer during training. It would be better to know what exactly are learned and what are still manually set.

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

This paper exploited the procedure of proximal method to guide the design of RNN for solving continuous-valued structured models. One promising advantage of this particular type of network architecture is that it generally can obtain better results together with improved efficiency. The effectiveness was demonstrated by the experiments on image denoising, depth refinement and optical flow estimation. Strong points of this work: (1) unified solution for shallow and deep unary term. (2) encoding the dependencies between continuous output variables into deep RNN. (3) better tradeoff between effectivenss / efficiency.

Good work. It is interesting to use the proximal method to guide the design of RNN, which can provide an unified solution to many vision tasks. The results also validate its feasibility. As to the rebuttal, I'd like to know more about the following: (1) What's the importance of using non-smooth and non-differentiable functions. [4] adopted differentiable function and also obtain satisfying results. (2) Some experiments are needed to illustrate the roles of mon-shared weights and multi-loss. (3) Please providie the detailed form of the unary term of optical flow estimation. (4) How many layers are used in the experiments.

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

This paper proposes a novel method for deep structured predictions. The MRF-style structured prediction problem is formulated as a deep recurrent neural network. The proposed method is very intersting and novel compared to current methods: both traditional MRF-based methods (which are normally cannot be trained end-to-end, and focus more on modeling the structures via higher order potentials) and current deep network methods (e.g. FCN-based approaches, where the attention is on learning unitary potentials, e.g. FlowNet [9]). Experiments on three different problems of image denoising, depth refinement, and optical flow estimation showed good results compared with current methods.

I believe this is a strong paper with good technical contributions and strong experiments to validate the proposed method. The technical novelty includes its formulation of MRF-style structured prediction problem into a deep recurrent neural network. It showed that efficient inference and end-to-end training can be done with proximal methods. Compared to traditional MRF-based methods (which are normally cannot be trained end-to-end, and focus more on modeling the structures via higher order potentials), this method can simultaneously model the unitary and higher-order potentials. On the other hand, compared with current deep network methods (e.g. FCN-based approaches, where the focus is to learn good unitary potentials, e.g. FlowNet [9]), the proposed method has some advantages in modeling also higher-order potentials, thus does better in structured predictions. Although there are some restrictions on the choices of functions for potential functions (e.g. f_i, f_{\alpha}), proximal operators, the experiments have shown that the proposed method can be well applied for different problem with quite similar architecture (for image denoising and depth refinement). + Some minor comments: - How many iteration blocks had been used in experiment 1 (image denoising) and 2 (depth refinement)? - As mentioned in experiment 3, the model is initialized by FlowNet, have the author(s) tried training from scratch this model instead? If so, how good the performance compared to FlowNet? Reporting also training from scratch and compare with current numbers in Table 4 will give the readers a much better understanding about the proposed method. If training from scratch is worse than FlowNet, it may indicate that jointly training of unitary and structures is too difficult and one needs to learn a unitary potential first (e.g. FlowNet), then learn the structures later to improve the predictions.

2-Confident (read it all; understood it all reasonably well)