Paper ID: | 9262 |
---|---|

Title: | The Geometry of Deep Networks: Power Diagram Subdivision |

This paper studies the geometry of deep networks with piecewise affine and convex nonlinearities using the max affine spline operator (MASO) framework (BB18a, BB18b). The authors establish a connection between input space partitioning of a neural network and power diagrams. They present analytical formulas for the power diagrams and decision regions. The paper is well-written, and presents interesting theoretical results. It would be great if the authors can provide more details on the computational aspects. Please see below for detailed comments/questions. (0) It is obvious that deep neural networks with relu activations (and more generally MASOs) are piecewise linear functions where the input partitionings are convex polytopes (according to which max is active). Can you please clarify how does the power diagram notation advance this knowledge ? (1) In eq (1), it looks like the '.' (dot) in the notation [A^(l)]_{k,r,.} is not defined. Could you please clarify this (2) On page (2), is the notation z^{l}(x) used to explicitly state its dependence to the input x ? Is there any difference between x and bold case x ? (3) Can you please clarify what you mean by 'orthogonal filters' on line 158, page 5 ? (4) It would be great if the authors can provide details on computing the centroids and radii for all partitionings. What is the largest size that can be tackled ? Can they be approximated, e.g. by Monte Carlo ? (5) Can you explain how the centroids can be used in semi-supervised learning ? (line 216 page 7) (6) The proof of Theorem 4 doesn't appear to be included in the supplementary file (7) Can you please summarize how this work differs from the earlier works [BB18a, BB18b] ? (8) Notational issues in the supplementary material: In the proof of Theorem 1, what does the calligraphic v in V([t}_1) stand for ? Is the lower-case b same as upper-case B ? (9) Proof of Lemma 6 needs clarification in most of the steps. In particular, how does the second inequality follow ? (10) Can you provide the precise statement of the result from [Joh60] and its details (page number, theorem number etc.) ?

The shows that layered feed forward neural networks that are comprised for affine transforms and convex nonlinearities (e.g. RELU) can be seen as implementing a series of divisions of the input space whereby the division takes the form of a power diagram (a generalization of Voronoi tessellations) . This is proven by analysis of single units, analysis of single layers and analysis of layer-to-layer transitions. The work hinges on the description of layers as implementing a max operations over affine transforms (max affine spline operators, MASO) which have apparently been studied in this context previously. The use of power diagrams to describe MASO output in this context seems new. The paper goes on show some analysis results on the log-distances of training points in the CIFAR dataset from their power diagram centroids at different network layers. Finally, there is some analysis of decision boundary curvature. Figure 3 seems to show that centroid ad deeper and deeper layers which match a specific input points become less descriptive of that point, apparently due to larger and larger radii (regions are described by centroids and radii). This results seems somewhat counter intuitive and seems to hint that this the centroids do not behave in a simple comprehensible manner. Perhaps the figure should be better explained in the paper. There is no analysis of the neural network training processes or how the power diagram representation might help in understanding a network training session. The man theoretical results provides a comprehensive description of what the most popular basing network architecture can do. It would be interesting to see if that result could somehow be used to achieve better performance or training efficiency of such basic networks. There is no summary / concluding remarks and no code is provided (although there is evidence that code was used)

This paper uses a computational geometry approach to analyze the region decomposition of neural networks, which leads to power diagrams. Propositions and theorems in the paper however are not fundamentally new, following straightforwardly from the geometry approach. The authors nicely observe that their centroids can be efficiently computed via backpropagation and apply the analysis to power diagrams to understand neural networks. Visualizing the region centroids and averaged distance to boundaries reals interesting information, which are however sort of expected, e.g. distance decreases with more layers to subdivide input regions. Results and their interpretation in the paper are still very preliminary. Despite this, the paper advances in the promising direction of geometrically interpreting neural networks; it is to my surprise that contents in the paper has not been thoroughly studied in the past. I think conclusions in the paper will serve as the starting point that many future work in the direction will benefit from. Ideas in the paper such as considering curvature at decision boundary looks promising and may worth further study. Exposition: While the idea of the paper is clear from the high level, exact details of notations and equations in the paper are hard to follow in its current exposition. Notations in the paper sometimes is too complicated. E.g. f^{(1->l)}_{k} in Eq. 18 is not explicitly explained and I have to guess the meaning of superscripts. epsilon_{k,2}^(l-1) in Eq. 21 is not explained, and should the superscript be (l) or (l-1)? It should also be explained where Eq. 21 comes from. Eq. 20: why left hand side is squared distance? This equation is not explained. Typo: Line 333: hve -> have