Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Originiality and Quality: 1. The work is novel and the algorithm proposed for empirical minimization is novel. However, I have some issues with the motivation and experiments, listed under weaknesses below. 2. Theorem 3.1 is interesting in itself, because it applies to vectors which are not in the range of the generative model. Clarity: The paper is well written and ideas clearly expressed. I believe that others can reproduce the algorithm described. I only have a problem with the way the set $S(x,theta, tau)$ is defined in line 177, since the authors do not require the signs to strictly differ on this set. Significance: I think other researchers can build on Theorem 3.1. The conditions proposed for Theorem 3.3 are novel and could be used for future results. Weaknesses: 1. I am not completely convinced by the experimental strengths of this approach. To run the proposed algorithm, the authors need to run a descent procedure for 40 different networks from the training phase. In contrast, you could simply run vanilla Adam on the final network with 40 random initial points, and one of these restarts would reach the global minimum. It is not important that EACH initialization reach the global minimum, as long as AT LEAST one initialization reaches the global minimum. 2. While Theorem 3.3 is interesting, it does not directly influence the experiments because the authors never perform the search operation in line 3 of algorithm 2. Because of this, it provides a proof of correctness for an algorithm that is quite different from the algorithm used in practice. Although Algorithm 2 and the empirical algorithm are similar in spirit, lines 1 and 3 in algorithm 2 are crucial for proof of correctness. Clarifications: 1. For the case where $y= G(z) + noise$, where noise has sufficiently low energy, you would expect a local minimum close to $z$. Would this not contradict the result of Theorem 3.1? ---Edit after author response--- Thank you for your response. After reading your rebuttal and other reviews, I have updated my score to a 8. I think Table in the rebuttal and Theorem 3.1 are solid contributions. Regarding my criticism of the definition of S(x,tau,theta)- I only meant that defining the complement of this set may make things clearer, since you only seem to work with its complement later on (this did not influence my score).
In this work, the authors study the idea of surfing, the optimization of a non-smooth model by sequentially optimizing a smoothly varying sequence of models that interpolate between something very smooth and the desired non-smooth model. This sequence of models can be obtained, for example, by snapshots of the model during the training process. The authors provide empirical studies that show that some models which can not directly be optimized, can be successfully optimized by this process. This has significant ramifications for solving inverse problems using generative models, as optimization over GANs has been shows to succeed with quite few measurements. The authors present a clearly defined probability model for networks at initialization, for which they present analysis. The authors then show that if a sequence of networks is given by a discretization of a continuous flow (in which the global optimizer moves in a Lipschitz way), then a projected gradient descent procedure successfully optimizes the models. Limitations of the theoretical analysis are that the complexity of the algorithm depends on the number of pieces of the piecewise linear model G that are optimized within each step. A heuristic is provided for this, but ideally a full proof would be established. Despite this limitation, the paper does offer an insightful observation and analysis to the field. The paper is clearly written, with well chosen figures and easy to follow text. I think the paper could be improved by having a discussion about the tradeoffs involved. For example, the sequential optimization procedure sounds like it could be quite expensive. One would need to store a full sequence of models, and reconstruction time may be quite slow. I would enjoy hearing informed thoughts on the ramifications for all this additional computation involved.
The paper is in general well written. The results are built upon the result in (Hand and Voroninski, 2017), and (Bora et al. 2017). From my understanding, the first theorem is mainly built on (Hand and Voroninski, 2017), and the second theorem is mainly built on (Bora et al. 2017). For the second theorem, the result implies the deeper the network is, the smaller the delta should be. It would be better to discuss how tight is the analysis, and whether this dependency is necessary in practice. ==== After rebuttal I have carefully read the author's response and other reviewer's feedback. I have less concern about the usefulness of the theoretic results and the gap between the theory and the algorithm. I have changed my score from 6 to 7. I would expect a higher score with much stronger experiments, although proving the idea of the algorithm works itself is quite interesting.