Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Post-rebuttal: After reading the authors' response and further consideration, I am downgrading my score to 7 from 9. While I am still very excited about the new perspective this work brings, I now realize that there is still a lot of work remaining in order to tie the theoretical results to real-world phenomena. Regardless of whether the paper gets accepted, I'd ask the authors to make the gap clearer and to lay out more clearly an agenda for future work that address the various issues discussed in the rebuttal, e.g.: approximation, empirical notions of filling, etc. =============== ORIGINALITY =========== The paper considers the functional space of polynomial networks as an algebraic object. They use tools from algebraic geometry to analyze the dimension of the Zariski closure of this space. This perspective greatly generalizes previous work on two-layer networks with quadratic activations. The paper is highly original in relating recent results from algebra to basic issues about neural networks. QUALITY & CLARITY ================= This work tackles head-on the problem of analyzing the functional space of polynomial varieties. It gives a clear overview of the model and an understandable interpretation of the technical results. The proofs deferred to the appendix use some high-powered recent results from algebra, but the main text does well in giving a high-level overview and extracts meaningful results such as Theorem 19 about bottlenecks. I did not have the background to verify all the math but to the extent that the quoted results in the cited papers apply, the main bounds claimed in the theorems are correct. The work is clearly of high technical quality. SIGNIFICANCE ============ This paper offers new insights how the architecture of a network corresponds to its expressive power. It substantiates common wisdom about the widths of the network being unimodal for expressive networks. It also shows that if the width becomes too small at any level, it can choke the expressivity of the network. It is highly significant to see rigorous justifications of these intuitions. The only drawback to the work is that it only considers exact expression and not approximation. Is there a way to talk about approximation using this language, perhaps using something like Proposition 5 to relate it to the exact case?
==== Summary ==== This paper studies deep neural networks with polynomial activations by mapping the parameters of a fixed network's architecture to its polynomial coefficients, whose image corresponds to an algebraic variety. This connection motivates studying the fundamental properties of these varieties, namely, the dimension of the variety and whether it is filling the functional space. The paper relates the dimension to the expressivity of the network, and the filling property is shown to be helpful for optimization. The article follows by bounding the dimension for various settings and giving sufficient conditions for the variety to be filling. ==== Detailed Review ==== The connection this paper makes between neural networks and algebraic varieties seems like a fascinating and promising direction for studying neural networks. However, it is a bit difficult to see how the current results lead to a better understanding of the expressiveness and optimization neural networks. 1. While the dimension of the variety is clearly linked in some sense to the expressiveness of the architecture, it is difficult to see how this measure translates to actual measures of interest to expressiveness analysis. The fact that the analysis depends on exact equality means that we cannot ask important questions such as how many hidden units are needed to approximate the functions computed by deeper networks, or ones with a higher degree activation functions. The only clue provided by the authors is that the dimension is an upper bound on the number of examples a network can interpolate, i.e., memorize. However, the naive dimension bound provided matches exactly with other works on memorization, i.e., it is proportional to the number of parameters in a network. 2. In terms of the filling widths, this seems more readily applicable. However, it appears that even for very small input spaces (e.g., 28x28 MNIST images) and squared activations the required minimal widths are already infeasible (e.g. for d0 = 768, d_h = 1, h = 3, d_2 will have to be at least 1e10). So, while the definition seems like it could be relevant, the sufficient conditions will not be met for most of the common architectures. On a more minor note/question, regarding the computational estimation of the ranks: are the ranks computed to give exact answers (e.g., by an exact computation of the determinant of an integer matrix), or are you using the standard floating-point numerical methods that are only estimates? I suggest the authors emphasize this aspect in the paper (in a footnote or appendix), because it is unclear if table 1 and 2 are just for intuition, or actual proved values (using an exact rank method). All of the above is not to detract from the foundations laid down by the authors to what seems like a fascinating direction for analyzing neural networks. This seems like an excellent start at what could in the future yield actual results that are relevant. The paper is also clearly written, and though I was not very familiar with the underlying math of algebraic varieties, it was still easy to follow and enjoyable to read. It is for this reason that I am marginally in favor of accepting this paper as-is because it could spark further ideas and discussion that might progress our understanding.
It would have been nicer to see more investigation of whether such networks can be trained to learn polynomials of a certain degree -- something more than just what is stated in section 2.3 Another point is that there may be advantages to use activations like sigmoid that have an infinite taylor series as they can be used to approximate polynomial of any degree with one hidden layer. So its well known that a polynomial of degree d in n variables can be *learnt* (not just represented ) using about n^O(d) hidden nodes in a 2 layer network with certain activations that have infinite taylor series. The writing style keeps the reader interested but it would be better to list the main results upfront somewhere near the introduction. Its harder to get the main results now as many Theorem statements are embedded in later sections.