NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:5085
Title:Differentiable Convex Optimization Layers

Reviewer 1

*** update after author feedback *** The author feedback proposes a clearer exposition of the ASA form and being more clear about what happens when the map is nondifferentiable, which would be welcome additions to the submission. With these modifications, and further reflection about the possible significance of this being a user-friendly tool, I have upgraded my assessment of the paper. *** end of update *** Originality: The paper is of modest originality. It augments the existing idea of differentiating through conic programs by implicitly differentiating the optimality conditions, so that the user can specify a problem in DCP form and still take advantage of this technology. Clarity: The paper is, for the most part, very clearly written, including nice devices like running examples to illustrate what is going on. Still a major issue for me was that when some of the key ideas were introduced in section 4, they were not explained clearly and systematically. The example I found most frustrating was the term "jointly DCP", in the sense of "jointly DCP in x (the variable) and theta (the parameter)". This appears as an essential component of the definition of ASA form, but is not defined itself. It is clearly not the case that this requires joint convexity in x and theta (from the examples) so what exactly does this mean? This really must be explained more clearly. Similarly, I found statements like "which ensures that the maps C and R are affine; in practice, this means that their derivatives are easy to compute" quite confusing. Does this "ensure that the maps C and R are affine" (which would then imply their derivative are easy to compute)? Or does it mean that C and R could be taken to be non-affine functions, as long as their derivatives are easy to compute? Or does it mean something else entirely? Quality: The paper appears, overall, to be technically sound. I found that the work was a little too dismissive of the situation in which the solution mapping is not differentiable. I think dealing with nondifferentiability is far beyond the scope of this work, but I think the authors would be better served to admit that there are many examples of convex optimization problems (e.g., linear programs) where the solution mappings are nondifferentiable functions of the problem data. Having said that, they are differentiable almost everywhere, so whether this is an issue or not depends on how the points at which the derivatives are evaluated are chosen. The citation [4, section 14] is a bit too broad, and doesn't sufficiently explain how the work under review deals with the nondifferentiable case. Does it issue a warning? Does it fail silently? A much more minor comment is that in a couple of places in the paper the authors asser that the solution of convex optimization problems can be obtained "exactly". (e.g., implicitly in line 82, "we compute it exactly") and in line 382. In these cases it would be reasonable to say that these things can be computed "to high accuracy" or something a little more vague, since "exact computation" is a much stronger statement, and for many convex optimization problems we do not know how to exactly compute the solution, and so would not know how to exactly compute the gradient of the solution mapping. Significance: From a technical point of view the contribution is not that significant---the ideas behind it are relatively simple, or are extensions of non-trivial ideas that have become "normalized" as they have matured. Having said that, I think this paper is of quite high significance in the sense that if it the implementation is done well and is easy to use then (possibly very many) researchers are likely to make direct use of this work.

Reviewer 2

--- after author response --- I thank the authors for the clarifications, and for the promise to improve the organization. --- original review --- The paper paints a clear high-level picture of differentiating through (disciplined) convex programs. DCP is a very useful tool in prototyping but also in practical applications, so it is in my opinion a good building block for generic differentiable convex optimization layers. Originality: 5/5 The paper relies on already-studied building blocks: the DCP formalism, and the residual approach to differentiation through a cone solver. However, combining these approaches to build a general-purpose differentiable convex optimization method is original and a valuable innovation. The ASA formalism, an extension of DCP which represents parameters via affine mappings, is novel and helpful for efficiency reasons. - Quality: 5/5 The paper is well supported. The two application examples illustrate very well the wide range of powerful things enabled by differentiable convex optimization. The code samples are a nice touch, showcasing how simple such tasks become using the proposed framework. Minor mistakes/typos: - at line 316, some (all?) of the cited work unroll *gradient descent*, not **stochastic** gradient - The variational form of softmax in supplementary material should also have a non-negativity constraint (it is there in the equations but not in the code) - Clarity: 4.5/5 At a high level the paper is very well written and smooth to read, very clear about its contributions. Some of the deeper, more technical constructions should be explained a bit clearer: - the ASA form (sec 4.1) should be clearly given constructively, rather than the vague phrasing "our rule has one important exception..." - related to above, I would suggest expanding the rigorous definition of DCP and the theorem at line 100 to get more emphasis. - The sparse canonicalization map: the example at lines 182-189 are pretty confusing. What are S, T, \psi and how to they relate with the desired Q & R? As it is, the example confused me more than it clarified my understanding. - Figure 1 lacks some explanation, what is the color of the markers? Consider using x/o instead of colors. Significance: 5/5 I believe overall this work is very significant. It is rather technical & engineering-based, but i believe it can have a high impact. DCP libraries like CVXPY are very useful in prototyping. A lot of recent work involves differentiating through optimization problems, and while for specific applications it is likely that specialized algorithms perform better, the proposed general approach will be very useful at least to kickstart research projects. The neural architectures one can build with convex optimization hidden layers are powerful and promising. Question: At line 209 you claim batching is done using a parallel loop. Are there settings where faster batching might be possible?

Reviewer 3

If I understand correctly, the layer is somehow like an auto-differentiation variant, that represents complex but disciplined convex problems as a tree of the composition of subprograms of bilinear, then apply a cone optimizer to generate the solution and differentiating over parameters to get the whole solution path, which is similar to LARS. I think the implementation is friendly enough for users like me, and I appreciate the efforts to make it a library. The authors did not mention the concept of the condition number of the convex problems, therefore I am curious whether the inner-most cone program is better-conditioned than the original program, theoretically or empirically. Besides, I am curious whether this applies to non-smooth subprograms as well, e.g. with $\ell_1$ norm or ReLU function insides, where subgradients have to be applied.