{"title": "The Geometry of Deep Networks: Power Diagram Subdivision", "book": "Advances in Neural Information Processing Systems", "page_first": 15832, "page_last": 15841, "abstract": "We study the geometry of deep (neural) networks (DNs) with piecewise affine and convex nonlinearities. \nThe layers of such DNs have been shown to be max-affine spline operators (MASOs) that partition their input space and apply a region-dependent affine mapping to their input to produce their output.\nWe demonstrate that each MASO layer's input space partitioning corresponds to a power diagram (an extension of the classical Voronoi tiling) with a number of regions that grows exponentially with respect to the number of units (neurons).\nWe further show that a composition of MASO layers (e.g., the entire DN) produces a progressively subdivided power diagram and provide its analytical form. \nThe subdivision process constrains the affine maps on the potentially exponentially many power diagram regions with respect to the number of neurons to greatly reduce their complexity. \nFor classification problems, we obtain a formula for a MASO DN's decision boundary in the input space plus a measure of its curvature that depends on the DN's nonlinearities, weights, and architecture. \nNumerous numerical experiments support and extend our theoretical results.", "full_text": "The Geometry of Deep Networks:\n\nPower Diagram Subdivision\n\nRandall Balestriero, Romain Cosentino, Behnaam Aazhang, Richard G. Baraniuk\n\nRice University\n\nHouston, Texas, USA\n\nAbstract\n\nWe study the geometry of deep (neural) networks (DNs) with piecewise af\ufb01ne and\nconvex nonlinearities. The layers of such DNs have been shown to be max-af\ufb01ne\nspline operators (MASOs) that partition their input space and apply a region-\ndependent af\ufb01ne mapping to their input to produce their output. We demonstrate\nthat each MASO layer\u2019s input space partition corresponds to a power diagram\n(an extension of the classical Voronoi tiling) with a number of regions that grows\nexponentially with respect to the number of units (neurons). We further show\nthat a composition of MASO layers (e.g., the entire DN) produces a progressively\nsubdivided power diagram and provide its analytical form. The subdivision process\nconstrains the af\ufb01ne maps on the potentially exponentially many power diagram\nregions with respect to the number of neurons to greatly reduce their complexity.\nFor classi\ufb01cation problems, we obtain a formula for the DN\u2019s decision boundary in\nthe input space plus a measure of its curvature that depends on the DN\u2019s architecture,\nnonlinearities, and weights. Numerous numerical experiments support and extend\nour theoretical results.\n\nIntroduction\n\n1\nToday\u2019s machine learning landscape is dominated by deep (neural) networks (DNs), which are\ncompositions of a large number of simple parameterized linear and nonlinear transformations. Deep\nnetworks perform surprisingly well in a host of applications; however, surprisingly little is known\nabout why they work so well.\n\nRecently, [BB18a, BB18b] connected a large class of DNs to a special kind of spline, which enables\none to view and analyze the inner workings of a DN using tools from approximation theory and\nfunctional analysis. In particular, when the DN is constructed using convex and piecewise af\ufb01ne\nnonlinearities (such as ReLU, Leaky- ReLU, max-pooling, etc.), then its layers can be written as\nmax-af\ufb01ne spline operators (MASOs). An important consequence for DNs is that each layer partitions\nits input space into a set of regions and then processes inputs via a simple af\ufb01ne transformation\nthat changes continuously from region to region. Understanding the geometry of the layer partition\nregions \u2013 and how the layer partition regions combine into the DN input partition \u2013 is thus key to\nunderstanding the operation of DNs.\n\nThere has only been limited work in the geometry of deep networks. The originating MASO\nwork of [BB18a, BB18b] focused on the analytical form of the region-dependent af\ufb01ne maps and\nempirical statistics of the partition without studying the structure of the partition or its construction\nthrough depth. The work of [WBB19] empirically studied the partition highlighting the fact that\nknowledge of the region in which each input lies is suf\ufb01cient to reach high performance. Other\nworks have focused on the properties of the partition, such as upper bounding the number of regions\n[MPCB14, RPK+17, HR19]. An explicit characterization of the input space partition of one hidden\nlayer DNs with ReLU activation has been developed in [ZBH+16] by means of tropical geometry.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fIn this paper, we adopt a computational and combinatorial geometry [PA11, PS12] perspective of\nMASO-based DNs to derive the analytical form of the input-space partition of a DN unit, a DN layer,\nand an entire end-to-end DN. Our results apply to any DN employing af\ufb01ne transformations plus\npiecewise af\ufb01ne and convex nonlinearities.\nWe summarize our contributions as follows: [C1] We demonstrate that each MASO DN layer\npartitions its input (feature map) space according to a power diagram (PD) (also known as a La-\nguerre\u2013Voronoi diagram) [AI] and derive the analytical formula of the PD (Section 3.2). [C2] We\ndemonstrate that the composition of the several MASO layers comprising a DN effects a subdivision\nprocess that creates the overall DN input-space partition and provide the analytical form of the\npartition (Section 4). [C3] We demonstrate how the centroids of the layer PDs can be ef\ufb01ciently\ncomputed via backpropagation (Section 4.2), which permits ready visualization of a PD. [C4] In\nthe classi\ufb01cation setting, we derive an analytical formula for a DN\u2019s decision boundary in terms\nof its input space partition (Section 5). The analytical formula enables us to characterize some key\ngeometrical properties of the boundary.\nOur complete, analytical characterization of the input-space and feature map partition of MASO\nDNs opens up new avenues to study the geometrical mechanisms behind their operation. Additional\nbackground information, results, and proofs of the main results are provided in several appendices.\n\ninput signal x \u2208 RD to the output prediction(cid:98)y \u2208 RC. Current DNs can be written as a composition\n\n2 Background\nDeep Networks. A deep (neural) network (DN) is an operator f\u0398 with parameters \u0398 that maps an\nof L intermediate layer mappings f ((cid:96)) : X((cid:96)\u22121) \u2192 X((cid:96)) ((cid:96) = 1, . . . , L) with X((cid:96)) \u2282 RD((cid:96)) that\ntransform an input feature map z((cid:96)\u22121) into the output feature map z((cid:96)) with the initializations\nz(0)(x) := x and D(0) = D. The feature maps z((cid:96)) can be viewed equivalently as signals, tensors,\nor \ufb02attened vectors; we will use boldface to denote \ufb02attened vectors (e.g., z((cid:96)), x).\nDNs can be constructed from a range of different linear and nonlinear operators. One important\nlinear operator is the fully connected operator that performs an arbitrary af\ufb01ne transformation by\nmultiplying its input by the dense matrix W ((cid:96)) \u2208 RD((cid:96))\u00d7D((cid:96)\u22121) and adding the arbitrary bias\nvector b((cid:96))\nW . Another linear operator is\nthe convolution operator in which the matrix W ((cid:96)) is replaced with a circulant block circulant\nmatrix denoted as C((cid:96)). One important nonlinear operator is the activation operator that applies\nelementwise a nonlinearity \u03c3 such as ReLU \u03c3ReLU(u) = max(u, 0). Further examples are provided\nin [GBC16]. We de\ufb01ne a DN layer f ((cid:96)) as a single nonlinear DN operator composed with any (if\nany) preceding linear operators that lie between it and the preceding nonlinear operator.\nMax Af\ufb01ne Spline Operators (MASOs). Work from [BB18a, BB18b] connects DN layers with\nmax-af\ufb01ne spline operators (MASOs). A MASO is a continuous and convex operator w.r.t. each\noutput dimension S[A, B] : RD \u2192 RK that concatenates K independent max-af\ufb01ne splines [MB09,\nHD13], with each spline formed from R af\ufb01ne mappings. The MASO parameters consist of the\n\u201cslopes\u201d A \u2208 RK\u00d7R\u00d7D and the \u201coffsets/biases\u201d B \u2208 RK\u00d7R.1 Given the layer input z((cid:96)\u22121), a MASO\nlayer produces its output via\n\n(cid:0)z((cid:96)\u22121)(x)(cid:1) := W ((cid:96))z((cid:96)\u22121)(x) + b((cid:96))\n\nW \u2208 RD((cid:96)) as in f ((cid:96))\n\nW\n\n[z((cid:96))(x)]k =\n\nS[A((cid:96)), B((cid:96))](z((cid:96)\u22121)(x))\n\n= max\n\nr=1,...,R\n\nk\n\n[A((cid:96))]k,r,\u00b7, z((cid:96)\u22121)(x)\n\n+ [B((cid:96))]k,r\n\n, (1)\n\n(cid:104)\n\n(cid:105)\n\n(cid:16)(cid:68)\n\n(cid:69)\n\n(cid:17)\n\nwhere A((cid:96)), B((cid:96)) are the per-layer parameters, [A((cid:96))]k,r,\u00b7 represents the vector formed from all of the\nvalues of the last dimension of A((cid:96)), and [\u00b7]k denotes the value of a vector\u2019s kth entry.\nThe key background result for this paper is that any DN layer f ((cid:96)) constructed from operators that\nare piecewise-af\ufb01ne and convex can be written as a MASO with parameters A((cid:96)), B((cid:96)) and output\ndimension K = D((cid:96)). Hence, a DN is a composition of L MASOs [BB18a, BB18b]. For example, a\nlayer made of a fully connected operator followed by a leaky-ReLU with leakiness \u03b7 has parameters\n[A((cid:96))]k,1,\u00b7 = [W ((cid:96))]k,\u00b7, [A((cid:96))]k,2,\u00b7 = \u03b7[W ((cid:96))]k,\u00b7 for the slopes and [B((cid:96))]k,1,\u00b7 = [b((cid:96))]k, [B((cid:96))]k,2 =\n\u03b7[b((cid:96))]k for the biases.\n\n1The three subscripts of the slopes tensor [A]k,r,d correspond to output k, partition region r, and input signal\n\nindex d. The two subscripts of the offsets/biases tensor [B]k,r correspond to output k and partition region r.\n\n2\n\n\fFigure 1: Two equivalent representations of a power diagram (PD).\nTop: The grey circles have centroid [\u00b5]k,\u00b7 and radii [rad]k; each point\nx is assigned to a speci\ufb01c region/cell according to the Laguerre distance\nfrom the centroid, which is de\ufb01ned as the length of the segment tangent\nto and starting on the circle and reaching x. Bottom: A PD in RD\n(here D = 2) is constructed by lifting the centroids [\u00b5]k,\u00b7 up into an\nadditional dimension in RD+1 by the distance [rad]k and then \ufb01nding\nthe Voronoi diagram (VD) of the augmented centroids ([\u00b5]k,\u00b7, [rad]k)\nin RD+1. The intersection of this higher-dimensional VD with the\noriginating space RD yields the PD.\n\nA DN comprising L MASO layers is a non-convex but continuous af\ufb01ne spline operator with an input\nspace partition and a partition-region-dependent af\ufb01ne mapping. However, little is known analytically\nabout the input-space partition. The goal of this paper is to characterize the geometry of the MASO\npartitions of the input space and the feature map spaces X((cid:96)).\nVoronoi and Power Diagrams. A power diagram (PD), also known as a Laguerre\u2013Voronoi diagram\n[AI], is a generalization of the classical Voronoi diagram (VD).\nDe\ufb01nition 1. A PD partitions a space X into R disjoint regions/cells \u2126 = {\u03c91, . . . , \u03c9R} such that\n\u222aR\nr=1\u03c9r = X, where each cell is obtained via \u03c9r = {x \u2208 X : r(x) = r}, r = 1, . . . , R, with\n\nr(x) = arg min\nk=1,...,R\n\n(cid:107)x \u2212 [\u00b5]k,\u00b7(cid:107)2 \u2212 [rad]k.\n\n(2)\n\nInput Space Power Diagram of a MASO Layer\n\nThe parameter [\u00b5]k,\u00b7 is called the centroid, while [rad]k is called the radius. The distance minimized\nin (2) is called the Laguerre distance [IIM85].\nWhen the radii are equal for all k, a PD collapses to a VD. See Fig. 1 for two equivalent geometric\ninterpretations of a PD. For additional insights, see Appendix A and [PS12]. We will have the\noccasion to use negative radii in our development below. Since arg mink (cid:107)x \u2212 [\u00b5]k,\u00b7(cid:107)2 \u2212 [rad]k =\narg mink (cid:107)x \u2212 [\u00b5]k,\u00b7(cid:107)2 \u2212 ([rad]k + \u03c1), we can always apply a constant shift \u03c1 to all of the radii to\nmake them positive .\n3\nLike any spline, it is the interplay between the (af\ufb01ne) spline mappings and the input space partition\nthat work the magic in a MASO DN. Indeed, the partition opens up new geometric avenues to study\nhow a MASO-based DN clusters and organizes signals in a hierarchical fashion.\nWe now embark on a programme to fully characterize the geometry of the input space partition of a\nMASO-based DN. We will proceed in three steps by studying the partition induced by i) one unit of a\nsingle DN layer (Section 3.1), ii) the combination of all units in a single layer (Section 3.2), iii) the\ncomposition of L layers that forms the complete DN (Section 4).\n3.1 MAS Unit Power Diagram\nA MASO layer combines K max af\ufb01ne spline (MAS) units zk(x) to produce the layer output\nz(x) = [z1(x), . . . , zK(x)]T given an input x \u2208 X. To streamline our argument, we omit the (cid:96)\nsuperscript and denote the layer input by x. Denote each MAS computation from (1) as\n\nzk(x) = max\n\nr=1,...,R\n\n(cid:104)[A]k,r,\u00b7, x(cid:105) + [B]k,r = max\n\nr=1,...,R\n\nEk,r(x),\n\n(3)\n\nwhere Ek,r(x) is the af\ufb01ne projection of x parameterized by the slope [A]k,r,\u00b7 and offset [B]k,r. By\nde\ufb01ning the following half-space consisting of the set of points above the hyperplane\n\nk,r = {(x, y) \u2208 X \u00d7 R : y \u2265 Ek,r(x)},\nE+\n\nwe obtain the following geometric interpretation of the unit output.\nProposition 1. The kth MAS unit maps its input space onto the boundary of the convex polytope\nPk = \u2229R\n\nk,r, leading to\nE+\n\nr=1\n\n{(x, zk(x)), x \u2208 X} = \u2202Pk,\n\n(4)\n\n(5)\n\nwhere \u2202Pk denotes the boundary of the polytope.\n\n3\n\n\fThe MAS computation can be decomposed geometrically as follows. The slope [A]k,r,\u00b7 and offset\n[B]k,r parameters describe the shape of the half-space E+\nk,r. The max over the regions r in (3) de\ufb01nes\nthe polytope Pk as the intersection over the R half-spaces. The following property shows how the\nunit projection, the polytope faces and the unit input space partition naturally tie together.\nLemma 1. The vertical projection on the input space X of the faces of the polytope Pk from (5)\nde\ufb01ne the cells of a PD.\nFurthermore, we can highlight the maximization process of the unit computation (3) with the following\noperator rk : X \u2192 {1, . . . , R} de\ufb01ned as\n\nrk(x) = arg max\nr=1,...,R\n\nEk,r(x).\n\n(6)\n\nThis operator keeps track of the index of the af\ufb01ne mapping used to produce the unit output or,\nequivalently, the index of the polytope face used to produce the unit output. The collection of\ninputs having the same face allocation, de\ufb01ned as \u2200r \u2208 {1, . . . , R} , \u03c9r = {x \u2208 X : rk(x) = r},\nconstitutes the rth partition cell of the unit k PD (recall (2) and Lemma 1).\nThe polytope formulation of a DN\u2019s PD provides an avenue to study the interplay between the slope\nand offset of the MAS unit and this speci\ufb01c partition by providing the analytical form of the PD.\nTheorem 1. The kth MAS unit partitions its input space according to a PD with R centroids and\nradii given by [\u00b5]k,r = [A]k,r,\u00b7 and [rad]k,r = 2[B]k,r + (cid:107)[A]k,r,\u00b7(cid:107)2,\u2200r \u2208 {1, . . . , R} (recall (2)).\nCorollary 1. The input space partition of a DN unit is composed of convex polytopes.\nFor a single MAS unit, the slope corresponds to the centroid, and its (cid:96)2 norm combines with the bias\nto produce the radius. The PD simpli\ufb01es to a VD when [B]k,r = \u2212 1\n3.2 MASO Layer Power Diagram\nWe study the input space partition of an entire DN layer by studying the joint behavior of all its\nconstituent units. A MASO layer is a continuous, piecewise af\ufb01ne operator made by the concatenation\nof K MAS units (recall (1)); we extend (3) to\n\n2(cid:107)[A]k,r,\u00b7(cid:107)2 + c,\u2200r,\u2200c \u2208 R.\n\nz(x) =\n\nmax\n\nr=1,...,R\n\nE1,r(x), . . . , max\n\nr=1,...,R\n\nEK,r(x)\n\n\u2200x \u2208 X\n\n,\n\n(7)\n\nand the per-unit face index function rk (6) into the operator r : X \u2192 {1, . . . , R}K de\ufb01ned as\n\nr(x) = [r1(x), . . . , rK(x)]T .\n\nE+\n\nFollowing the geometric interpretation of the unit output from Proposition 1, we extend (4) to\n\nr =(cid:8)(x, y) \u2208 X \u00d7 RK : [y]1 \u2265 E1,[r]1(x), . . . , [y]K \u2265 EK,[r]K (x)(cid:9) , \u2200r \u2208 {1, . . . , R}K (9)\ndimensional convex polytope P =(cid:84)\n\nin order to provide the following layer output geometrical interpretation.\nProposition 2. The layer operator z maps its input space into the boundary of the dim(X) + K\n\n(8)\n\nr\u2208{1,...,R}K E+\n\u2202P = {(x, z(x)),\u2200x \u2208 X}.\n\nr via\n\n(10)\n\n(cid:20)\n\n(cid:21)T\n\nSimilarly to Proposition 1, the polytope P imprints the layer\u2019s input space with a partition that is the\nintersection of the K per-unit input space partitions.\nLemma 2. The vertical projection on the input space X of the faces of the polytope P from Proposi-\ntion 2 de\ufb01ne the cells of a PD.\nThe MASO layer projects an input x onto the polytope face indexed by r(x) corresponding to\n\n(cid:20)\n\n(cid:21)T\n\nr(x) =\n\narg max\nr=1,...,R\n\nE1,r(x), . . . , arg max\nr=1,...,R\n\nEK,r(x)\n\n.\n\n(11)\n\nThe collection of inputs having the same face allocation jointly across the K units constitutes the rth\npartition cell (region) of the layer PD.\n\n4\n\n\fLayer 1: mapping\nX(0)\u2282R2 to X(1)\u2282R6\n\nLayer 2: mapping\nX(1)\u2282R6 toX(2)\u2282R6\n\nl\na\ni\nm\no\nn\ny\nl\no\nP\n\ne\nc\na\np\ns\n\nt\nu\np\nn\nI\n\nLayer 3: mapping\nX(2)\u2282R6 toX(3)\u2282R1 Figure 2: Power diagram subdivision\nin a toy deep network (DN) with a D =\n2 dimensional input space. Top: The\npartition polynomial (22), whose roots\nde\ufb01ne the partition boundaries in the in-\nput space. Bottom: Evolution of the in-\nput space partition (15) displayed layer\nby layer, with the newly introduced\nboundaries in darker color. Below each\npartition, one of the newly introduced\ncuts edgeX(0) (k, (cid:96)) from (21) is high-\nlighted; in the \ufb01nal layer (right), this cut\ncorresponds to the decision boundary\n(in red).\n\ncentroids \u00b5r =(cid:80)K\n\nk=1[A]k,[r]k,\u00b7 and radii radr = 2(cid:80)K\n\nTheorem 2. A DN layer partitions its input space according to a PD containing up to RK cells with\n\n2(cid:107)[A]k,r,\u00b7(cid:107)2 reduces the PD to a VD.\n\nInput Space Power Diagram of a MASO Deep Network\n\nk=1[B]k,[r]k + (cid:107)\u00b5r(cid:107)2 (recall (2)).\nCorollary 2. The input space partition of a DN layer is composed of convex polytopes.\nExtending Theorem 1, we observe in the layer case that the centroid of each PD cell corresponds to\nthe sum of the rows of the slopes matrix producing the layer output. The radii involve the bias units\nand the (cid:96)2 norm of the slopes as well as their correlation. This highlights how, even when a change of\nweight occurs for a single unit, it will impact multiple centroids and hence multiple cells. Note also\nthat orthogonal DN \ufb01lters 2 and [B]k,r = \u2212 1\nAppendix A.2 explores how the shapes and orientations of layer\u2019s PD cells can be designed by\nappropriately constraining the values of the DN\u2019s weights and biases.\n4\nWe are now armed to characterize and study the input space partition of an entire DN by studying the\njoint behavior of its constituent layers.\n4.1 The Power Diagram Subdivision Recursion\nWe provide the formula for the input space partition of an L-layer DN by means of a recursion.\nRecall that each layer partitions its input space X((cid:96)\u22121) in terms of the polytopes P((cid:96)) according to\nProposition 2. The DN partition corresponds to a recursive subdivision where each per-layer polytope\nsubdivides the previously obtained partition.\nInitialization ((cid:96) = 0): De\ufb01ne the region of interest in the input space X(0) \u2282 RD.\nFirst step ((cid:96) = 1): The \ufb01rst layer subdivides X(0) into a PD via Theorem 2 with parameters A(1), B(1)\nto obtain the layer-1 partition \u2126(1).\nRecursion step ((cid:96) = 2): For concreteness we focus here on how the second layer subdivides the \ufb01rst\nlayer\u2019s input space partition. In particular, we highlight how a single cell \u03c9(1)\nr(1) of \u2126(1) is subdivided,\nthe same applies to all the cells. On this cell, the \ufb01rst layer mapping is af\ufb01ne with parameters\nr(1), B(1)\nA(1)\nr(1). This convex cell thus remains a convex cell at the output of the \ufb01rst layer mapping, it\nlives in X(1) and it is de\ufb01ned as\n\na\ufb00 r(1) =\n\nr(1)x + B(1)\nA(1)\n\nr(1), x \u2208 \u03c9(1)\n\nr(1)\n\n(12)\n\n(cid:110)\n\n(cid:111) \u2282 X(1).\n\nThe second layer partitions its input space X(1) and thus also potentially subdivisions a\ufb00 r(1). In\nparticular, this -mapped cell- will be subdivided by the edges of the polytope P(2) (recall (10)) having\nfor domain a\ufb00 r(1), this domain restricted polytope is de\ufb01ned as\n\nr(1) = P(2) \u2229(cid:16)\n\na\ufb00 r(1) \u00d7 RD(2)(cid:17)\n\nP(2)\n\n.\n\n(13)\n\n2Orthogonal DN \ufb01lters have the property that (cid:104)[A]k,r,\u00b7, [A]k(cid:48),r(cid:48),\u00b7(cid:105) = 0,\u2200r, r(cid:48), k (cid:54)= k(cid:48).\n\n5\n\n\f1,[r(2)]1\n\n(cid:68)\n\nk,[r(1)]k\n\n(cid:69)\n\nr(1)\n\nD(1),[r(2)]D(1)\n\n+ B(2)\n\nr(2),k \u2208\n\nr(1) can be\n\nr(2) and bias\n\n[A(2)\n\nr(2)]k,r,., B(1)\n\n(x)(cid:9) (14)\n\nthe hyperplane with slope A(1)T\n\n(x), . . . , [y]D(1)\u2265 E(1\u21902)\n\n(cid:8)(x, y)\u2208 \u03c9(1)\n\nr(1) \u2208 X(1)\u00d7RD(2) can be expressed in X(0)\u00d7RD(2)\n\nr(1)\u00d7 RD(1): [y]1\u2265 E(1\u21902)\nr(1) A(2)\n\nSince the layer 1 mapping is af\ufb01ne in this region, the domain restricted polytope P(2)\nexpressed as part of X(0) as opposed to X(1).\nDe\ufb01nition 2. The domain restricted polytope P(2)\nas\nP(1\u21902)\nr(1) =\u2229r(2)\nwith E(1\u21902)\n{1, . . . , D(1)}.\nThe above results demonstrates how cell \u03c9(1)\nr(1), seen as a\ufb00 r(1) by the second layer, is subdi-\nvided by the domain restricted polytope P(2)\nr(1); and conversely, how this subdivision of \u03c9(1)\nr(1) is\ndone by the domain restricted second layer polytope expressed in the DN input space P(1\u21902)\n.\nNow, combining the latter interpretation, and applying Lemma 2, we obtain that this cell is sub-\ndivided according to a PD induced by the faces of P(1\u21902)\n. This PD is\ncharacterized by the centroids \u00b5(1\u21902)\nr(1),r(2)(cid:107)2 +\n2(cid:104)\u00b5(2)\nr(2)(cid:105),\u2200r(2) \u2208 {1, . . . , R}D(2). The PD parameters thus combine the af\ufb01ne\nparameters A(1)\nr(1), B(1)\nr(1) of the considered cell with the second layer parameters A(2), B(2). Repeat-\ning this subdivision process for all cells \u03c9(1)\nr(1) from \u2126(1) forms the subdivided input space partition\n\u2126(1,2) = \u222ar(1)PD(1\u21902)\nRecursion step ((cid:96)): Consider the situation at layer (cid:96) knowing \u2126(1,...,(cid:96)\u22121) from the previous subdivi-\nsion steps. Similarly to the (cid:96) = 2 step, layer (cid:96) subdivides each cell in \u2126(1,...,(cid:96)\u22121) to produce \u2126(1,...,(cid:96))\nleading to the up-to-layer-(cid:96)-layer DN partition\n\nr(1),r(2) = (cid:107)\u00b5(1\u21902)\n\n, denoted as PD(1\u21902)\n\n, and radii rad(1\u21902)\n\nr(1)(cid:105) + 2(cid:104)1, B(2)\n\nr(1),r(2) = A(1)\n\nr(1)\n\nr(2), B(1)\n\nr(1)\n\n(cid:62)\n\n\u00b5(1\u21902)\n\nr(2)\n\n.\n\nr(1)\n\nr(1)\n\nr(1)\n\n\u2126(1,...,(cid:96)) = \u222ar(1),...,r((cid:96)\u22121)PD(1\u2190(cid:96))\n\nr(1),...,r((cid:96)\u22121) .\n\n(15)\n\nSee Fig. 2 for a numerical example with a 3-layer DN and D = 2 dimensional input space. (See also\nFigures 7 and 9 in Appendix B.)\nTheorem 3. Each cell \u03c9(1,...,(cid:96)\u22121)\ndomain \u03c9(1.....(cid:96)\u22121)\n\nr(1),...,r((cid:96)\u22121) \u2208 \u2126(1,...,(cid:96)\u22121) is subdivided into PD(1\u2190(cid:96))\n\nr(1),...,r((cid:96)\u22121), a PD with\n\nr(1),...,r((cid:96)\u22121) and parameters\n\u00b5(1\u2190(cid:96))\nr(1),...,r((cid:96)) =(A(1\u2190(cid:96)\u22121)\nrad(1\u2190(cid:96))\nr(1),...,r((cid:96)) =(cid:107)\u00b5(1\u2190(cid:96))\n\nr(1),...,r((cid:96)\u22121))T \u00b5((cid:96))\nr(1),...,r((cid:96))(cid:107)2 + 2(cid:104)\u00b5((cid:96))\n\u2200r(i) \u2208 {1, . . . , R}D(i) with B(1\u2192(cid:96)\u22121) =(cid:80)(cid:96)\u22121\n\nr((cid:96))\n\nr((cid:96)), B(1\u2192(cid:96)\u22121)\nr(1),...,r((cid:96)\u22121)(cid:105) + 2(cid:104)1, B((cid:96))\nr((cid:96))(cid:105)\n(cid:0)(cid:81)(cid:96)(cid:48)\n\n(cid:1)B((cid:96)(cid:48))\n\ni=(cid:96)\u22121 A(i)\n\n(cid:96)(cid:48)=1\n\nr(i)\n\nr((cid:96)(cid:48) ) forming \u2126(1,...,(cid:96)).\n\n(centroids)\n\n(radii),\n\n(16)\n\n(17)\n\nThe subdivision recursion provides a direct result on the shape of the DN input space partition regions.\nCorollary 3. For any number of MASO layers L \u2265 1, the PD cells of the DN input space partition\nare convex polytopes.\n4.2 Centroid and Radius Computation\nWhile in general a DN has a tremendous number of PD cells, the DN\u2019s forward inference calculation\nlocates the cell containing an input signal x with a computational complexity that is only logarithmic\nin the number of regions. (See Appendix A.3 for a proof and additional discussion.) We now produce\na closed-form formula for the radius and centroid of that cell.\nConsider the cell of the PD induced by layers 1 through (cid:96) of a DN that contains a data point x of\ninterest. This cell is described by the code r(1)(x), . . . , r((cid:96))(x) that we will simplify here in an abuse\nof notation so simply x. Denote the Jacobian operator as J, and the vector of ones by 1, the centroid\nand radius of the cell are given by\n= (Jxf (1\u2192(cid:96)))T 1,\n\n\u00b5(1\u2190(cid:96))\n\n(18)\n\nx\n\n6\n\n\fInput x\n\n\u00b5(1)\nx\n\n\u00b5(1,2)\nx\n\n\u00b5(1,2,3)\nx\n\n\u00b5(1,...,4)\nx\n\n\u00b5(1,...,5)\nx\n\n\u00b5(1,...,6)\nx\n\n\u00b5(1,...,7)\nx\n\n\u00b5(1,...,8)\nx\n\n\u00b5(1,...,9)\nx\n\nl\na\ni\nt\ni\nn\ni\n\nd\ne\nn\ni\na\nr\nt\n\nl\na\ni\nt\ni\nn\ni\n\nd\ne\nn\ni\na\nr\nt\n\nFigure 3: Centroids of the PD regions containing an input horse image x computed via (18) for a\nLargeConv network (top) and a ResNet (bottom). (See Fig. 11 for results with a SmallConv network.) The\ninput belongs to the PD cell \u03c9(1,...,(cid:96))\nfor each successively re\ufb01ned PD subdivision of each layer \u2126(1,...,(cid:96)).\nAt each layer of the subdivision, the region has an associated centroid \u00b5(1,...,(cid:96))\n(depicted here) and radius\n(not depicted). As the depth (cid:96) increases, the centroids diverge from horse-like images. This is because the\nradii begin to dominate the centroids, pushing the centroids outside the PD cell containing x. Training\naccelerates this domineering effect.\n\nx\n\nx\n\n(cid:43)\n\nD((cid:96))(cid:88)\n\nk=1\n\n(cid:42)\n\n(cid:69)\n\n(cid:107)2 + 2\n\n(cid:68)\n(cid:16)\u2207xf (1\u2192(cid:96)\u22121)\n\n1\n\n(cid:17)T\n\nrad(1\u2190(cid:96))\n\nx\n\n= (cid:107)\u00b5(1\u2190(cid:96))\n\nx\n\n1, B((cid:96))\nx\n\n+ 2\n\nf (1\u2192(cid:96))(x) \u2212 A(1\u2192(cid:96)\u22121)\n\nx,\n\nx\n\n[A((cid:96))\n\nx ]k,.\n\n(19)\n\n(cid:80)D((cid:96))\n\nx\n\nx\n\n=\n\nD((cid:96))\n\nk=1 [A((cid:96))\n\nx ]k,., B(1\u2192(cid:96)\u22121)\n\nx\n\nwith A(1\u2192(cid:96)\u22121)\n\n, . . . ,\u2207xf (1\u2192(cid:96)\u22121)\n= f (1\u2192(cid:96))(x) \u2212 A(1\u2192(cid:96)\u22121)\n\n, and where we recall\nx from Theorem 3 and f (1\u2192(cid:96))\n\nthat \u00b5((cid:96))\nx =\nis the kth unit\nof the layer 1 to (cid:96) mapping. Note how the centroids and biases of the current layer are mapped back\nto the input space X(0) via a projection onto the tangent hyperplane de\ufb01ned by the basis A(1\u2192(cid:96)\u22121)\n.\nConveniently, the centroids (18) can be computed via an ef\ufb01cient backpropagation pass through the\nDN, which is typically available because it is integral to DN learning. Moreover, (18) corresponds\nto the element-wise summation of the saliency maps [SVZ13, ZF14] from all of the layer units.3\nFigure 3 visualizes the centroids of the cell containing a particular input signal for a LargeConv and\nResNet DN trained on the CIFAR10 dataset (see Appendix C for details on the models plus additional\n\ufb01gures).\n\nx\n\nk\n\n4.3 Distance to the Nearest PD Cell Boundary\nIn Appendix D we derive the Euclidean distance from a data point x to the nearest boundary of its\nPD cell (a point from \u2202\u2126)\n\nmin\nu\u2208\u2202\u2126\n\n(cid:107)x \u2212 u(cid:107) = min\n\n(cid:96)=1,...,L\n\nmin\n\nk=1,...,D((cid:96))\n\n|(z((cid:96))\n(cid:107)\u2207x(z((cid:96))\n\nk \u25e6 \u00b7\u00b7\u00b7 \u25e6 z(1))(x)|\nk \u25e6 \u00b7\u00b7\u00b7 \u25e6 z(1))(x)(cid:107) .\n\n(20)\n\nFig. 4 (and 6 in the Appendix) plots the distributions of the log distances from the training points in\nthe CIFAR10 training set to their nearest region boundary the input space partition as a function of\nlayer (cid:96) and at different stages of learning. We see that training increases the number of data points\nthat lie close to their nearest boundary. We see from these \ufb01gures that while a network with fully\nconnected layers (MLP) re\ufb01nes its partition by introducing cuts close to the training points at each\nlayer, the SmallCNN does not reduce the shortest distance at deeper layers.\nA further exploration is carried out in Appendix A.4, where Table 1 summarizes the performance of\nthe centroids, when used as centroids of a VD, to recover inside their region, the same input as the\none that originally produced the centroid.\n\n3The saliency maps were linked to the \ufb01lters in a matched \ufb01lterbank in [BB18a, BB18b].\n\n7\n\n\fSmallCNN\n\nMLP\n\nr\ne\ny\na\nL\n\ns\nh\nc\no\np\nE\n\nFigure 4: Empirical distributions of the log distances from the training points of the CIFAR10 dataset to the\nnearest PD cell boundary as calculated by (20) for the various layers of a SmallCNN (left) and MLP (right).\nBlue: Training set. Red: Test set. On top is the evolution through layers at the end of the training, on bottom is\nthe evolution of the last layer, through the epochs. The distances decrease with (cid:96) due to PD subdivision which\nreduces the volume of the cells as the subdivision process occurs. The distances are also much smaller for the\nCNN desptie having the same number of units for the MLP as the number of \ufb01lters and translations for the\nconvolutional layers. This demonstrates how the subdivision process of the convolutional layer is much more\nperformance at re\ufb01ning the DN input space partitioning around the data for image data.\n\n5 Geometry of the Deep Network Decision Boundary\nWe now study the edges of the polytopes that de\ufb01ne the PD cells\u2019 boundaries. We demonstrate how\na single unit at layer (cid:96) de\ufb01nes multiple cell boundaries in the input space and use this \ufb01nding to\nderive an analytical formula for the DN decision boundary that would be used in a classi\ufb01cation task.\nWithout loss of generality, we focus in this section on piecewise nonlinearities with R = 2, such as\nReLU, leaky-ReLU, and absolute value.\n5.1 Partition Boundaries and Edges\nIn the case of R = 2 nonlinearities, the polytope P((cid:96))\ncontains a single edge, we consider\nhere nonlinearities that can be expressed as a leaky-ReLU with leakiness \u03b7 (cid:54)= 0. We de\ufb01ne this\nedge as the intersection of the faces of the polytope. For instance, in the case of leaky-ReLU, the\npolytope contains two faces that characterize the two regions produced by a single leaky-ReLU unit.\nWe formally de\ufb01ne the edge of a polytope as follows.\nDe\ufb01nition 3. The edges of the polytope P((cid:96))\nparticular the input space X(0)) as\n\nk can be expressed in any space X((cid:96)(cid:48)), (cid:96)(cid:48) < (cid:96) (and in\n\nk of unit z((cid:96))\n\nk\n\nk,2(z((cid:96)(cid:48)\u2192(cid:96)\u22121)(x)) = 0},\n\nedgeX((cid:96)(cid:48) ) (k, (cid:96)) = {x \u2208 X((cid:96)(cid:48)) : E((cid:96))\n\n(21)\nk,2 from (3), and where \u25e6 denotes the composition operator.\nwith z((cid:96)(cid:48)\u2192(cid:96)\u22121) = z((cid:96)\u22121) \u25e6 \u00b7\u00b7\u00b7 \u25e6 z((cid:96)(cid:48)), E((cid:96))\nIn the same way that the polytopes P(1\u2190(cid:96))\nr1,...,r((cid:96)\u22121) could be expressed in X(0) \u00d7RD((cid:96)) and then mapped\nto the DN input space (recall Section 4.1), these edges de\ufb01ned in X((cid:96)\u22121) can be expressed in the DN\ninput space X(0). The projection of the edges into the DN input space will constitute the partition\nboundaries. De\ufb01ning the polynomial\n\nL(cid:89)\n\nD((cid:96))(cid:89)\n\nPol(x) =\n\nk \u25e6 z((cid:96)\u22121) \u25e6 \u00b7\u00b7\u00b7 \u25e6 z(1))(x),\n(z((cid:96))\n\n(22)\n\nwe obtain the following result where the boundaries of \u2126(1,...,(cid:96)) from Theorem 3 can be expressed in\nterm of the polytope edges and roots of the polynomial.\n\n(cid:96)=1\n\nk=1\n\n8\n\n17.515.012.510.07.55.02.5151517.515.012.510.07.55.02.515151510508008015105080080\fTheorem 4. The polynomial (22) is of order(cid:81)L\n\nboundaries:\n\n(cid:96)=1 D((cid:96)), and its roots correspond to the partition\n\n\u2202\u2126(1,...,(cid:96)) = {x \u2208 X(0) : Pol(x) = 0} = \u222a(cid:96)\n\n(cid:96)(cid:48)=1 \u222aD((cid:96)(cid:48))\n\nk=1 edgeX(0)(k, (cid:96)).\n\n(23)\n\nThe root order de\ufb01nes the dimensionality of the root (boundary, corner, etc.).\n\n5.2 Decision Boundary Curvature\nThe \ufb01nal DN layer introduces a last subdivision of the partition. For brevity, we focus on a binary\nclassi\ufb01cation problem; in this case, D(L) = 1 and a single last subdivision occurs, leading to the\nclass prediction being y = 1\n(x)>\u03c4 for some threshold \u03c4, this last layer can thus be cast as a\nMASO with a leaky-ReLU type nonlinearity with proper bias, and setting \u03c4 = 0. That is, the DN\nprediction is unchanged by this last nonlinearity, and the change of sign is the change of class is the\ndecision boundary.\nProposition 3. The decision boundary of a DN with L layers is the edge of the last layer polytope\nP(L) expressed in the input space X(0) from De\ufb01nition 3 as\n\nz(L)\n1\n\nDecisionBoundary = {x \u2208 X(0) : f (x) = 0} = edgeX(0)(1, L),\n\n(24)\n\nwhere edgeX(0)(1, L) denotes the edge of unit 1 of layer L expressed in the input space X(0).\nTo provide insights into this result, consider a 3-layer DN denoted as f and a binary classi\ufb01cation\ntask; we have\nDecisionBoundary = \u222ar(2) \u222ar(1) {x \u2208 X(0) : (cid:104)\u03b1r(2),r(1), x(cid:105) + \u03b2r(2),r(1) = 0} \u2229 \u03c9(1,2)\n\n(25)\n\nr(1),r(2),\n\n1,1,\u00b7A(2)\n\nr(2)A(1)\n\nr(2)B(1)\n\nr(1))T [A(3)]1,1,\u00b7 and \u03b2r(1),r(2) = [A(3)]T\n\nwith \u03b1r(1),r(2) = (A(2)\nr(1) + [B(3)]1,1.4 The\ndistribution of \u03b1r(1),r(2) characterizes the structure of the decision boundary and thus highlights the\ninterplay between the layer parameters, layer topology, and the decision boundary. For example,\nin Fig. 2 the red line demonstrates how the weights characterize the curvature and cut positions of\nthe decision boundary. We provide examples highlighting the impact on the angles of change in the\narchitecture of the DN in Appendix A.5.\nWe provide a direct application of the above \ufb01nding by providing a curvature characterization of the\ndecision boundary. First, we propose the following result stating that the form of \u03b1 and \u03b2 from (25)\nfrom a region to a neighbouring one alters only a single unit code at a some layer.\nLemma 3. Upon reaching a region boundary, any edge as de\ufb01ned in De\ufb01nition 3 must continue into\na neighbouring region.\n\nThis follows directly from continuity of the involved operator and enables us to study its curvature\nby comparing the edges of adjacent regions. In fact adjacent region edges connect at the region\nboundary by continuity, however their angle might differ, this angle de\ufb01nes the curviness of the\ndecision boundary, which is de\ufb01ned as the collection of all the edges introduces by the last layer.\nTheorem 5. The decision boundary curvature/angle between two adjacent regions5 r and r(cid:48) is given\nby the following dihedral angle [KB38] between neighbouring \u03b1 parameters as\n\ncos(\u03b8(r, r(cid:48))) =\n\n|(cid:104)\u03b1r, \u03b1r(cid:48)(cid:105)|\n(cid:107)\u03b1r(cid:107)(cid:107)\u03b1r(cid:48)(cid:107) .\n\n(26)\n\nAcknowledgements\nRB and RGB were supported by NSF grants CCF-1911094, IIS-1838177, and IIS-1730574; ONR\ngrants N00014-18-12571 and N00014-17-1-2551; AFOSR grant FA9550-18-1-0478; DARPA grant\nG001534-7500; and a Vannevar Bush Faculty Fellowship, ONR grant N00014-18-1-2047. RC and\nBA were supported by NSF grant SCH-1838873 and NIH grant R01HL144683-CFDA.\n\n4The last layer is a linear transform with one unit, since we perform binary classi\ufb01cation.\n5For clarity, we omit the subscripts.\n\n9\n\n\fReferences\n\n[AI] Franz Aurenhammer and Hiroshi Imai. Geometric relations among voronoi diagrams. Geometriae\n\nDedicata.\n\n[AMN+98] Sunil Arya, David M Mount, Nathan S Netanyahu, Ruth Silverman, and Angela Y Wu. An optimal\nalgorithm for approximate nearest neighbor searching \ufb01xed dimensions. Journal of the ACM\n(JACM), 45(6):891\u2013923, 1998.\n\n[Aur87] Franz Aurenhammer. Power diagrams: properties, algorithms and applications. SIAM Journal on\n\nComputing, 16(1):78\u201396, 1987.\n\n[BB18a] R. Balestriero and R. Baraniuk. Mad Max: Af\ufb01ne spline insights into deep learning.\n\narXiv:1805.06576, 2018.\n\n[BB18b] R. Balestriero and R. G. Baraniuk. A spline theory of deep networks. In International Conference\n\nof Machine Learning (ICML), pages 374\u2013383, 2018.\n\n[GBC16] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. http://www.\n\ndeeplearningbook.org.\n\n[GSM03] Bogdan Georgescu, Ilan Shimshoni, and Peter Meer. Mean shift based clustering in high dimensions:\nA texture classi\ufb01cation example. In International Conference Computer Vision (ICCV), pages\n456\u2013464, 2003.\n\n[HD13] L. A. Hannah and D. B. Dunson. Multivariate convex regression with adaptive partitioning. Journal\n\nof Machine Learning Research (JMLR), 14(1):3261\u20133294, 2013.\n\n[HR19] Boris Hanin and David Rolnick. Complexity of linear regions in deep networks. arXiv:1901.09021,\n\n2019.\n\n[IIM85] Hiroshi Imai, Masao Iri, and Kazuo Murota. Voronoi diagram in the laguerre geometry and its\n\napplications. SIAM Journal on Computing, 14(1):93\u2013105, 1985.\n\n[Joh60] Roger A Johnson. Advanced Euclidean Geometry: An Elementary Treatise on the Geometry of the\n\nTriangle and the Circle. Dover Publications, 1960.\n\n[KB38] Willis Frederick Kern and James R Bland. Solid mensuration: With Proofs. J. Wiley & Sons, 1938.\n\n[MB09] A. Magnani and S. P. Boyd. Convex piecewise-linear \ufb01tting. Optim. Eng., 10(1):1\u201317, 2009.\n\n[ML09] Marius Muja and David G Lowe. Fast approximate nearest neighbors with automatic algorithm\ncon\ufb01guration. International Conference on Computer Vision Theory and Applications (VISAPP),\n2(331-340):2, 2009.\n\n[MPCB14] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear\nregions of deep neural networks. In Advances in Neural Information Processing Systems (NIPS),\npages 2924\u20132932, 2014.\n\n[PA11] J\u00e1nos Pach and Pankaj K Agarwal. Combinatorial Geometry, volume 37. John Wiley & Sons,\n\n2011.\n\n[PS12] Franco P Preparata and Michael I Shamos. Computational geometry: An introduction. Springer,\n\n2012.\n\n[RPK+17] Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl Dickstein. On the\nexpressive power of deep neural networks. In International Conference on Machine Learning\n(ICML), pages 2847\u20132854, 2017.\n\n[SVZ13] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks:\n\nVisualising image classi\ufb01cation models and saliency maps. arXiv:1312.6034, 2013.\n\n[WBB19] Zichao Wang, Randall Balestriero, and Richard Baraniuk. A max-af\ufb01ne spline perspective of\nrecurrent neural networks. In International Conference on Learning Representations (ICLR), 2019.\n\n[ZBH+16] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding\n\ndeep learning requires rethinking generalization. arXiv:1611.03530, 2016.\n\n[ZF14] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In\n\nEuropean Conference on Computer Vision (ECCV), pages 818\u2013833. Springer, 2014.\n\n10\n\n\f", "award": [], "sourceid": 9262, "authors": [{"given_name": "Randall", "family_name": "Balestriero", "institution": "Rice University"}, {"given_name": "Romain", "family_name": "Cosentino", "institution": "Rice University"}, {"given_name": "Behnaam", "family_name": "Aazhang", "institution": "Rice University"}, {"given_name": "Richard", "family_name": "Baraniuk", "institution": "Rice University"}]}