Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
In general I like the paper. The definition of activation regions is sound and it makes intuitive sense. The empirical evaluations are also well thought out. Specific comments: The authors state that usually there is no distinction between activation and linear regions. The way activation regions are defined in the paper, there is no reason to believe the two to be identical. Along the same lines, the fact that one is a union of the other (Lemma 3) is expected. While I understand and buy into the definition of an activation region, I think more intuition would be helpful. For example, why it suffices that A is arbitrary, why -1 to the power of a_z, etc. In (3), activation regions is not defined. I believe it is the union of individual activation regions. Line 39: it's unclear what is T at this point. It should be added: for a constant T. Theorem 5: conditions 1 and 2 should be simply restated as continuous RVs. Line 177: two heuristics for the second (5) - there is just one (5) In 3.2, the intuition is a little bit too shaky. Between "many times" and O(1) there are many options in between. In other words, if something is not 'many times' I would not conclude that it is O(1). F_learn is vaguely introduced, but no further discussion is provided. I find this confusing and suggest to remove the references to this set. There was work in the last ICML (Long Beach) on linear regions. If it is not cited, it should be added and discussed.
General comment: It is a nice addition to the previous paper  in that it actually counts the number of activation regions other than just the n-k volumes of the discontinuity points of the gradient (which these results use in the proof). Compared to previous works of counting linear regions, this paper looks at the number when randomly drawing the weights and biases (not necessarily independent). It is well written However it could benefit from some more intuitive explanations at times. For example, I have some comments that would be nice to see addressed in a revised version of the paper: - It is not very clearly stated in the paper (terms like "strongly suggests" are used without being alluded to after the rigorous theorem statement) how much training can you actually do so that the bound still holds and under which circumstances would it break down? By definition of a density C_bias seems to be <= 1, so that it wouldn't matter in the grand scheme of things but potentially get a tighter bound? Perhaps the authors are already working on this, but I'm curious: What are the challenges to actually show how the number of regions evolves over time with gradient descent? - Related to this point: In the experiments, the actual number of regions is higher than predicted but you don't offer an explanation. d/n < 1, thus C_grad is probably < 1 given the result in , and as C_bias should be <= 1, what does make the number grow so much during training? I find it nice however that at initialization, the number of regions pretty much matches the theoretical bound (could draw a horizontal line for theoretical prediction in the plot perhaps). - I am not sure why Lemma 6 i.e. invariance of the count to uniform bias scaling would actually imply invariance to non-uniform bias scaling which is in turn equivalent to uniform scaling of weights? You do mention that you don't prove it, but why should that be true intuitively? In general, scaling of the randomized weights seems to have rather big impacts of generalization properties of the final neural networks which is what makes me feel uncertain about this claim. - I think it is important to reveal the dependence on d/n through C_grad, C_bias - i.e. that they may grow with d/n / correlation but are "constant" in terms of number of neurons. You later state in 3.3 that "depth does not increase the local density of activation regions" - this sentence is misleading as this is perhaps approximately true for wide networks but strictly speaking not in general Minor comments for improvement: - It would be useful to explicitly state the results of previous papers on the max number of linear regions so that the reader can see the difference directly. - A figure illustrating bent hyperplanes would be very useful - Maybe it would be nice to define "fan-in" somewhere? - in 3.2. you write I and sometimes [a,b]
Definitions 1-2 and Lemmas 1-4 seem clear. The paper studies upper bounds on the number of linear regions of the functions represented by ReLU neural networks over a box of the input space, given certain conditions on the gradients of the neurons and the biases. The idea is to draw a distinction between the theoretically possible number and the number that might be observed in practice. The main result discusses how under certain conditions on parameters and gradients, the number of linear regions of the function represented by the network is usually small. Although the paper interprets the result as `depth independent', it is presented in a way that includes constants that may well be influenced by depth. The theorem could be interpreted as saying that there are regions of parameter space where the number of regions of the represented functions is small. This is not surprising, as claimed in the title of the paper. Nonetheless, the result is interesting in that it gives some details on the regions of the parameter space where this happens, even if this description includes somewhat less explicit conditions on gradients of neurons. The paper also seeks to explain how the requirements of the theorem might hold true at initialization and during training. It presents some experiments evaluating the linear regions observed along one dimensional paths on input space.