Paper ID: | 3466 |
---|---|

Title: | Is Deeper Better only when Shallow is Good? |

This is a good paper that suggests excellent directions for new work. The key point is captured in this statement: "we conjecture that a distribution which cannot be approximated by a shallow network cannot be learned using a gradient-based algorithm, even when using a deep architecture." The authors provide first steps towards investigating this claim. There has been a small amount of work on the typical expressivity of neural networks, in addition to the "worst-case approach." See the papers "Complexity of linear regions in deep networks" and "Deep ReLU Networks Have Surprisingly Few Activation Patterns" by Hanin and Rolnick, which prove that while the number of linear regions can be made to grow exponentially with the depth, the typical number of linear regions is much smaller. See also "Do deep nets really need to be deep?" by Ba and Caruana, which indicates that once deep networks have learned a function, shallow networks can often be trained to distill the deep networks without appreciable performance loss. On this note, the authors write: "Many of these works consider various measures of “complexity” that grow exponentially fast with the depth of the network, but not with the width." As stated, this is not quite right - these works describe measures of complexity that *can* grow exponentially fast with the depth. (Note also that for this reason, line 519 in the appendix should read "the number of linear regions...is *at most* r^{st}" - though this does not seem to affect the proof.) Figure 2 doesn't actually show any approximation curves, though it claims to do so. Rather, it shows two different distributions associated with the same fractal. It would be instructive to see plots of these approximation curves. In Figure 5, the main plot can be made clearer, indicating exactly what the curves are that are shown. As a minor point, "Cantor" should always be capitalized. Line 30: "analyzing" should be "analyze."

This paper is about expressive abilities of deep neural networks and the related gradient-based optimization processes for distributions generated by iterated function systems. The first result shows that such a distribution can be expressed efficiently by deep neural networks, but not by shallow ones. The second result shows that gradient-based algorithms are likely fail in processing such a distribution. A key assumption is Assumption 3 requiring that the positive class is contained in the set of points that are far away from the boundary of the n-th iterated set K_n by at least gamma. Note that the limiting set of K_n is a fractal set having no interior point in most cases. So the required margin gamma depends on n and would be extremely small when n is large. This observation leads to the reviewer's concerns over the main results: the nice depth separation argument is too special, and the evidence for the failure of gradient-based algorithms is not strong enough.

Originality: This work shows that a family of distributions (fractal distribution) can be used for having a separation argument between shallow and deep networks. Moreover, based on a notion of the approximation curve, we can study the dynamics of training deep neural networks. Quality: It seems that the submission is technically sound but I did not go through the proofs. Clarity: The submission is well written. Significance: The authors proposed an interesting depth separation argument based on fractal distributions. More importantly, they related this argument to the success of training deep neural networks by gradient-based algorithms. =================== Post-rebuttal comments: =================== Thanks for providing new experimental result. It might be good to add this result to the paper to support your conjecture.