Paper ID: 497
Title: Supervised Sparse Analysis and Synthesis Operators

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see
Summary of paper: This is a technical paper on supervised sparse coding. It defines a generalized form of Lasso problem that equates some input variables x (e.g. the input image) to some hidden variables y (e.g. the underlying, analyzed or denoised form of the image) in least squares via two arbitrary weight matrices -- ||M1*x-M2*y||^2 -- adding an L1 penalty on a linear mapping Omega*y and an optional L2 penalty on y itself. This formulation is general enough to cover both analysis and synthesis (generative) problems. The paper solves this system for y(x) via a proximal iteration on auxilliary variables z = Omega*y -- an ADMM style method -- and it also offers fast approximation of y(x) via a network that contains an unwound, truncated ADMM with re-learned parameters along the lines of Gregor & LeCun [11]. Finally, it proposes supervised parameter (mapping matrix) learning using stochastic gradient descent over an arbitrary problem-specific loss function on y(x), and to this end it derives some explicit formulae for cost gradients in terms of sign(Omega*y) and the matrices and auxilliary vectors involved. The method is illustrated on an image super-resolution problem and a polyphonic music transcription one. It seems to give quite good results on both, although I found this aspect difficult to judge and would bow to more expert opinions on this.

Review: This is a reasonable but somewhat pedestrian technical paper. AFAIK the precise model studied is new, and I agree that it is useful to have very general models of this kind as references and tutorial examples, especially as there are now so many variant models in L1 learning that it is difficult to keep track of all of them. However the techniques used -- proximal projection, ADMM, Gregor-LeCun approximation network, SGD, L2 smoothing or limiting arguments for the local gradient/subgradient estimates -- are all straightforward and fairly standard for recent methods, and the derivations themselves contain no surprises. Also, for many of the most interesting special cases, more specialized algorithms are likely to be preferred: the iterations for a model this general are necessarily rather expensive (equations of size = number of image pixels need to be resolved, at least in principle) and depending on the exact M1, M2, Omega, potentially also very ill-conditioned. Still, on balance it will be useful to have this study recorded, so I am favourable to acceptance if space permits.

Further points:

- In the first equation of proposition 1, "A" should presumably be "P_{Lambda^c}". The derivation of the gradient could perhaps be relegated to an appendix. Fig.1 defines different A,U,B matrices - consider changing notation there to avoid confusion.

- More information on the effect of K in SI-NN, on the (especially ambiguous?) cases where SI-NN fails but full SI-ADNN doesn't, and on the degree to which the specific loss influences the learned encoders would be useful.

- Just writing "matrix inverse" in Q, B, etc, is not very satisfying. These very large sets of equations are presumably solved by conjugate gradients or similar, and this will fail if the matrices become too ill-conditioned. Comment on the methods actually used.
Q2: Please summarize your review in 1-2 sentences
A competent and useful but ultimately fairly straightforward (given the current state of technology) generalization of Lasso to cover both analysis and synthesis problems. There are no major surprises or simplifications, but the paper is still worthwhile and probably worth accepting if space permits.

Submitted by Assigned_Reviewer_7

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see
The paper proposes to learn a general analysis/synthesis sparse model, for inverse problems.

In section 2, the proposed formulation (1) is very general, even it cannot be considered as a new model. However, the purpose of the paper is to learn the parameters of this general model. A short discussion is derived to refound more specific models from (1). Some parts of this discussion are, from my point of view, out of the scope of the paper. Indeed, They do not bring anything about the main subject (learning the parameters), are well known by most of the potential interested readers. I mainly think about the discussion on shift-invariant analysis operators.

Section 3 is the heart of the paper. It is well written, especially the derivation of the gradient.

Section 4 discusses how to speed up the practical implementation. A lot is written to finally concludes that the ADMM algorithm is stopped after K iterations... While the paragraph on the neural networks is hard to follow, and deserve more explanations.

Section 5 presents the experimental results. This section is a bit disappointing for the following reasons:
- in the first experiment, the author compares their general model to an interpolation model and two analysis model. I expected at least one synthesis model with, for example, a undecimated wavelet dictionary.
- As there exists algorithms to learn analysis and synthesis prior, I expect some comparison with these state-of-the-art. Indeed, this is the only way to know is learning such a general model worth to be used !
- Conclusions of the 2sd experiments do not reflects Fig. 2 Indeed, Fig. 2 tell us that the Nonnegative synthesis model behaves very well. Then, I am not sure that one must spends time to learn model (1).

Q2: Please summarize your review in 1-2 sentences
well written paper, studying a supervised method for learning a general synthesis/analysis model. Experiments and discussion on the practical results are very disappointing, regarding the rest of the paper.

Submitted by Assigned_Reviewer_8

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see
Paper summary.
The authors propose a supervised learning procedure for general sparse
representation model. In their model, when given an observation, a prediction
is made by solving a strictly convex optimization problem parametrized by
three linear operators. Their learning procedure performs empirical risk
minimization (ERM); to this end the gradients with respect to the linear
operators is derived. An approximate model is given in which a particular
solution method (ADMM) is unrolled for only a few iterations and
gradient-based ERM is again possible. The two models are evaluated on two
tasks, image super-resolution and music transcription and shown to perform

This is a very well written paper with an interesting if somewhat incremental

The model (1) is interesting in the sense that it is very general; the authors
propose a gradient-based learning procedure to perform empirical risk
minimization for differentiable loss functions. This indeed seems novel and
is technically sound. The derivation in Section 3 is very clear, but somewhat
tedious and technically straightforward.

The approximation in Section 4 is again interesting but a straightforward
extension of [11]. An interesting generalization of the model shown in Figure
1 would be to use different parameters (A,B) for each layer. Learning would
still be feasible but the model would have larger capacity for the same
computational expense.

The experiments are carefully done and demonstrate the usefulness of the
authors' contributions on two relevant tasks.

Overall a nice paper that is relevant to the NIPS community.
Q2: Please summarize your review in 1-2 sentences
A very well-written paper with clear contribution and good experiments.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank the reviewers and the chair for the time and effort dedicated in reviewing our work and for such constructive feedback. We are happy with their support of this work.

All their suggestions will be carefully addressed in the camera-ready/revision of the paper. Bellow we address the most important comments answering to each reviewer separately.

Comments for Assigned_Reviewer_6 :

We want to clarify that the proposed algorithm for learning the model parameters is not restricted to work with a particular pursuit algorithm. One could choose problem-specific solvers for different variants of the model (1). We used ADMM to motivate the architecture of the fast encoders presented in Section 4. These encoders are significantly faster than the pursuit algorithms (for solving problems restricted to data following the same distribution as the training samples). We will clarify this in the text.

We thank the reviewer for the mentioned typos. We will correct the text accordingly.

We will include a discussion on the effect of the number of layers, K. In terms of the quality of the approximation, around 10 layers, similar number as in deep learning works, has shown to be enough for having a good representation, as depicted in Figure 2. Also, the improvement obtained by adding layers starts to saturate around that value. It is worth mentioning that in real time applications, the selection of K would probably also by constrained by the available computational budget or required input-output latency. Contrary to standard optimization, this possibility to adapt K to computational resources is a key advantage of this type of optimization..

Comments for Assigned_Reviewer_7 :

In fast implementation discussed in Section 4, we truncate the iterations of the ADMM algorithm only to motivate the architecture of the neural network. After learning the parameters, the neural network has a computation load identical to the truncated algorithm but the performance is significantly superior (similar to the one obtained by running the ADMM algorithm an order of magnitude more iterations).

We dedicated considerable space for talking about convolutional setting, since we believe that presenting a principled method for designing and training convolutional networks is an important contribution of our work. This work provides a new angle and perspective into convolution networks.

If space permits, we will include comparisons against synthesis model both analytical and learned operators to the super resolution example.

We want to point out that the non-negative synthesis model is a particular instance of the model presented in (1).

Comments for Assigned_Reviewer_8 :

We believe that extending the work of [11] to the analysis (and in particular convoluitonal) setting is an important contribution of our work. The idea of extending the model allowing for different parameters per layer is indeed a very interesting and worth exploring.