{"title": "On Neuronal Capacity", "book": "Advances in Neural Information Processing Systems", "page_first": 7729, "page_last": 7738, "abstract": "We define the capacity of a learning machine to be the logarithm of the number (or volume) of the functions it can implement. We review known results, and derive new results, estimating the capacity of several neuronal models: linear and polynomial threshold gates, linear and polynomial threshold gates with constrained weights (binary weights, positive weights), and ReLU neurons. We also derive capacity estimates and bounds for fully recurrent networks and layered feedforward networks.", "full_text": "Neuronal Capacity\n\nPierre Baldi\n\nDepartment of Computer Science\nUniversity of California, Irvine\n\nIrvine, CA 92697\npfbaldi@uci.edu\n\nRoman Vershynin\n\nDepartment of Mathematics\n\nUniversity of California, Irvine\n\nIrvine, CA 92697\n\nrvershyn@uci.edu\n\nAbstract\n\nWe de\ufb01ne the capacity of a learning machine to be the logarithm of the number\n(or volume) of the functions it can implement. We review known results, and\nderive new results, estimating the capacity of several neuronal models: linear and\npolynomial threshold gates, linear and polynomial threshold gates with constrained\nweights (binary weights, positive weights), and ReLU neurons. We also derive some\ncapacity estimates and bounds for fully recurrent networks, as well as feedforward\nnetworks.\n\n1\n\nIntroduction\n\nA basic framework for the study of learning (Figure 1) consists in having a target function h that one\nwishes to learn and a class of functions or hypothesis A that is available to the learner to implement\nor approximate h. The class A, for instance, could be all the functions that can be implemented by a\ngiven neural network architecture as the synaptic weights are varied. Obviously how well h can be\nlearnt critically depends on the class A and thus it is natural to seek to de\ufb01ne a notion of \u201ccapacity\u201d\nfor any class A. The goal of this paper is to de\ufb01ne a notion of capacity and show how it can be\ncomputed, or approximated, in the case of several neural models. As a \ufb01rst step, in this paper we\nde\ufb01ne the capacity of the class A to be the logarithm base two of the size or volume of A:\n\n(1)\nThis is also the number of bits that can be communicated, or stored, by selecting an element of A.\nNeedless to say this notion of capacity is only a \ufb01rst step towards characterizing the capabilities of\nA and which kinds of function it is capable of learning, a problem that has remained largely out of\nreach for neural architectures.\n\nC(A) = log2 |A|\n\nFigure 1: Framework.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fTo measure capacity in a continuous setting, one must de\ufb01ne |A| in some measure theoretic sense.\nHere we will simplify this problem by using Boolean neurons only, so that the class A is \ufb01nite and\ntherefore we can simply de\ufb01ne |A| as the number of functions contained in A, as described in the\nnext section. The capacity is also the number of bits required to specify an element of A.\n\n2 Linear/Polynomial Threshold Functions and Notations\n\nFor theoretical and practical purposes, a neuron is often viewed as a computational unit which\noperates by applying a non-linear function to the dot product between the input vector and the vector\nof synaptic weights. Although several different non-linear operations can be considered, equating the\ndot product to zero de\ufb01nes a fundamental hyperplane that partitions the neuron\u2019s input space into two\nhalves and provides the direction of af\ufb01ne hyper-planes where the dot product remains constant. Thus\nthe most basic non-linear function that can be used is a threshold function which simply retains the\nsign of the dot product. When restricted to binary entries this yields the classical notion of a linear\nthreshold gate. Thus we consider the N dimensional hypercube H = {\u22121, 1}N . A homogeneous\nlinear threshold gate f of N variables is a Boolean function over H of the form:\n\nf (x1, . . . , xN ) = sign\n\nwixi\n\n(cid:17)\n\n(cid:16) N(cid:88)\n\ni=1\n\nWhere w = (wi) is a vector of weights or parameters. Unless otherwise speci\ufb01ed, we assume that\nthe weights are real numbers. A non-homogeneous linear threshold gate has an additional bias w0\nand is given by:\n\n(cid:16)\n\nN(cid:88)\n\n(cid:17)\n\n(cid:16) N(cid:88)\n\n(cid:17)\n\nf (x1, . . . , xN ) = sign\n\nw0 +\n\nwixi\n\n= sign\n\nwixi\n\ni=1\n\ni=0\n\nassuming that x0 = 1. Throughout this paper, we exclude the cases where the activation is exactly\nequal to 0, as they are not relevant for the problems considered here. Linear threshold gates represent\nan important but very small class of Boolean functions and understanding how small, i.e. counting\nthe number of linear threshold functions of N variables is one of the most fundamental problems in\nthe theory of neural networks.\nIn search for greater biological realism or more powerful computing units, it is natural to introduce\npolynomial, or algebraic, threshold functions by assuming a polynomial, rather than linear, integration\nof information in the neuron\u2019s dendritic tree. Again equating the polynomial to zero provides an\nalgebraic variety that partitions the neuron\u2019s input space and leads to the notion of polynomial\nthreshold gates. Thus, a homogeneous polynomial threshold gate of degree d is a Boolean function\nover H given by:\n\nf (x1, . . . , xN ) = sign\n\nwhere Id denotes all the subsets of size d of {1, 2, . . . , N} and if I = (i1, i2, . . . , id) then xI =\nxi1xi2 . . . xid, and w = (wI ) is the vector of weights. Note that on H, for any index i, x2\ni = +1 and\ntherefore integer exponents greater that 1 can be ignored. Similarly, a (non-homogeneous) polynomial\nthreshold gate of degree d is given by the same expression:\n\nf (x1, . . . , xN ) = sign\n\nthe difference being that this time I\u2264d represents all possible subsets of {1, 2, . . . , N} of size d\nor less, including possibly the empty set associated with a bias term. Note that for most practical\npurposes, including developing more complex models of synaptic integration, one is interested in\n\ufb01xed, relatively small values of d. Again, this gives rise to the fundamental problem of estimating\nthe number of polynomial threshold gates of N variables of degree d, with the special case above\nof linear threshold functions corresponding to the case d = 1, which will be addressed in the next\nsection.\n\n(2)\n\n(3)\n\n(4)\n\n(5)\n\nwI xI(cid:17)\n\n(cid:16)(cid:88)\n\nI\u2208Id\n\nwI xI(cid:17)\n\n(cid:16) (cid:88)\n\nI\u2208I\u2264d\n\n2\n\n\fWe use A to denote an architecture, i.e. a circuit of interconnected polynomial threshold functions,\nT [A] to denote the number of Boolean functions that can be implemented by A as the weights of the\nneurons are varied, and C[A] = log2 T [A] to denote its capacity. We will write A = [N ] to denote\nN gates fully interconnected to each other (fully recurrent network), or A = [N0, N1, . . . , NL] to\ndenote a layered feedforward architecture with N0 inputs, N1 gates in the \ufb01rst layer, N2 gates in the\nsecond layers, and so forth. Unless otherwise speci\ufb01ed, we assume full connectivity between each\nlayer and the next, although most of the theory to be presented can be applied to other connectivity\nschemes. Much of our focus here is going to be on the special case A[N, 1], denoting a single neuron\nwith N inputs, since it is essential to understand \ufb01rst the capacity of individual building blocks. The\ndegree of the polynomial threshold gates is either clear from the context, or speci\ufb01ed as a subscript.\nWe use a \u201c*\u201d superscript when the threshold functions are homogeneous. Thus, for instance, Cd[N, 1]\nis the logarithm base two of the number Td[N, 1] of polynomial threshold gates of degree d in N\nvariables, and C\u2217\nThe following relationships are straightforward to prove.\nProposition 1: The numbers Cd[N, 1] and C\u2217\nships for any N \u2265 1:\n\nd [N, 1] is the same number when the gates are forced to be homogeneous.\n\nd [N, 1] (d = 1, . . . , N) satisfy the following relation-\n\nC1[N, 1] = C\u2217\n\n1 [N + 1, 1]\n\nC\u2217\nd\u22121[N, 1] < C\u2217\n\n(cid:104)(cid:18)N\n(cid:19)\n(cid:18)N\n\nd\n\n1\n\n(cid:19)\n\n, 1\n\n(cid:105)\n\n(cid:18)N\n\n(cid:19)\n\nd\n\n(cid:105)\n\n, 1\n\n+ . . .\n\n(6)\n\n(7)\n\n(8)\n\n(9)\n\n1\n\nd [N, 1] < C\u2217\n(cid:19)\n\n(cid:104)(cid:18)N\n\n+\n\n0\n\nd [N, 1] \u2248 Cd[N, 1]\nC\u2217\n\ntrue for any d \u2265 2,\n\ntrue for any d \u2265 2,\n\nfor d \u2265 1 and N large.\n\nCd\u22121[N, 1] < Cd[N, 1] < C1\n\n3 The Capacity of Single Threshold Neurons\n\nThe \ufb01rst fundamental question is to estimate how many polynomial threshold functions in N variables\nof degree d exist. We \ufb01rst review the known results for d = 1 and then state our more general results\nfor any \ufb01xed d \u2265 1.\n\n3.1 Linear Threshold Functions\n\nWe begin with linear threshold gates (d = 1). A number of well known Boolean functions are linearly\nseparable and computable by a single linear threshold gate. For instance AND, OR, NOT are linearly\nseparable. However, many other Boolean functions (e.g. PARITY) are not. In fact, there are 22N\nBoolean functions of N variables and the majority of them is not linearly separable. Estimating\nT1[N, 1] or C1[N, 1] is a fundamental problem in the theory of neural networks and it has a relatively\nlong history [1]. The upper bound:\n\n(10)\nfor N > 1 , has been known since the 1960s (e.g. [7]; see also [6]). Likewise lower bounds of the\nform:\n\nC1[N, 1] \u2264 N 2\n\n(11)\nwith \u03b1 < 1 were also derived in the 1960s. For instance, Muroga proved a lower bound of\nN (N \u2212 1)/2 (e.g. [8]), leaving open the question of convergence and the correct value of \u03b1. The\n\n\u03b1N 2 \u2264 C1[N, 1]\n\n3\n\n\fLet us introduce the notation: (cid:18) n\n\nThe upper bound:\n\n\u2264 k\n\n(cid:18)n\n\n(cid:19)\n\n1\n\n=\n\n(cid:19)\n(cid:18)n\n(cid:19)\n(cid:18)2N \u2212 1\n(cid:19)\n\n0\n\n+\n\n(cid:18)n\n\nk\n\n(cid:19)\n(cid:18) N\n\n(cid:19)\n\n+ . . .\n\n(13)\n\nproblem of determining the convergence and the right order was \ufb01nally settled by Zuev [13, 14] who\nproved that:\n\nC1[N, 1] = N 2 (1 + o(1))\n\n(12)\nThus in short the capacity is N 2, as opposed to 2N for the total number of functions. Intuitively,\nZuev\u2019s result is easy to understand from an information theoretic point of view as it says that a linear\nthreshold function is fully speci\ufb01ed by providing N 2 bits, corresponding to N examples of size N.\nThese are the N support vectors, i.e. the N points mapped to +1 that are closest to the separating\nhyperplane. Conversely, it can also be interpreted as stating that N 2 bits can be stored in a linear\nthreshold function as its weights are varied.\n\n3.2 Polynomial Threshold Functions\n\nFor \ufb01xed d > 1, as well as slowly increasing values of d, the problem is considerably more dif\ufb01cult.\n\nTd[N, 1] \u2264 2\n\n(14)\nwas shown in [3] (see also [1]), for any 1 \u2264 d \u2264 N. For any N > 1 and 1 \u2264 d \u2264 N, we can show\nthat this leads to the following simple upper bound [4]:\n\nwhere D =\n\n\u2264 D\n\n\u2264 d\n\nThe lower bound:\n\nCd[N, 1] \u2264 N d+1\nd!\n\n(cid:18) N\n\n(cid:19)\n\nd + 1\n\n\u2264 Cd[N, 1]\n\n(15)\n\n(16)\n\nwas derived in [11]. This lower bound is approximately N d+1/(d + 1)!, which leaves a multiplicative\ngap O(d) between the upper and lower bounds. Here we introduce the following theorem, which\n\ufb01nally settles this gap, and contains Zuev\u2019s result as a special case:\n\nTheorem 3.1 For any \ufb01xed d, the capacity of a polynomial threshold function of degree d satis\ufb01es\n\nas N \u2192 \u221e.\n\nCd[N, 1] =\n\nN d+1\n\nd!\n\n(1 + o(1))\n\n(17)\n\nThe proof of this result is fairly involved ([4], as it requires generalizing the theory of random\nmatrices to a theory of random tensors. Although we stated Theorem 3.1 for a \ufb01xed degree d, which\nis the main case of interest here, we can allow d to grow mildly with n, and the result still holds\n\nif d = o((cid:112)log N/ log log N ). Theorem 3.1 states that in order to specify a polynomial threshold\n\nfunction in n variables and with degree d, one needs approximately N d+1/d! bits. This corresponds\nto providing the N d/d! support vectors on the hypercube that belong to the +1 class and are closest to\nthe separating polynomial surface of degree d. Equivalently, Theorem 3.1 determines the complexity\nof a polynomial classi\ufb01cation problem: there are approximately 2N d+1/d! different ways to separate\nthe points of the Boolean cube {\u22121, 1}n into two classes by a polynomial surface of degree d (the\nzero set of a polynomial).\nAs an aside, note that any Boolean functions of N variables can be written in conjunctive normal\nform and thus represented by a polynomial threshold function of degree N. A conjecture of Aspnes\net al. [2] and Wang-Williams [12] states that for most Boolean functions f (x), the lowest degree\n\n4\n\n\fof p(x) such that f (x) = sign(p(x)) is either (cid:98)N/2(cid:99) or (cid:100)N/2(cid:101). This conjecture was proved up to\nadditive logarithmic terms in [9]; see also related results in [10]. This result and Theorem 3.1 both\nshow, each in its own precise way, that low-degree polynomial threshold functions form a very small,\nbut very important, class of all possible Boolean functions.\n\n4 The Capacity of Other Neuronal Models\n\n4.1 Polynomial Threshold Functions with Binary Weights.\n\nIn some situations (e.g. discrete synapses), it is useful to use models where the binary weights are\nbounded or even restricted to a discrete set. For instance, we can consider the binary case with\nweights in {\u22121, 1} leading to the set of binary-weight polynomial threshold functions of degree d,\nalso known as \u201csigned majorities\u201d in the d = 1 case. We are interested in estimating the number\nBT (N, d) of such functions, or BT \u2217(N, d) in the homogeneous case.\nTheorem 4.1 For d = 1 and any N, the number of binary-weight linear threshold functions\nBT [N, 1] satis\ufb01es\n\nlog2 BT \u2217[N, 1] = N if N is odd\nif N is even\nlog2 BT [N, 1] = N + 1\n\nIn short, the capacity of binary-weight linear threshold function is linear, rather than quadratic.\n\nProof: A binary-weight homogeneous linear threshold function in N variables has N coef\ufb01cients and\nthus there is at most 2N such functions. If N is odd, consider two such functions f1 and f2 and assume\nthat they differ in at least one coef\ufb01cient. We want to prove that f1 (cid:54)= f2. Let A be the set of indices\nwhere f1 and f2 have the same coef\ufb01cients, and B the set of indices where f1 and f2 have different\ncoef\ufb01cients. Obviously |A| +|B| = N and |B| \u2265 1. For any vector x on the hypercube, we can write:\ni\u2208B wixi| is as\ni\u2208A wixi = 0 (using\ni\u2208B wixi| = |B| \u2265 1 (using consistent signs) so that f1(x) (cid:54)= f2(x). If |A|\nis odd, then B must be even and thus B \u2265 2. It is then easy to select x in a similar way such that\ni\u2208B wixi| = |B| \u2265 2 so that f1(x) (cid:54)= f2(x). The reasoning is similar\nin the non-homogeneous case with N even. These results are exact and hold for \ufb01nite N. (cid:3)\nNote that in the homogeneous case, if N is even then for any binary-weight homogeneous linear\n\n(cid:1) and f2(x) = sign(cid:0)(cid:80)\nf1(x) = sign(cid:0)(cid:80)\ni\u2208A wixi +(cid:80)\ni\u2208A wixi \u2212(cid:80)\nneed to construct a vector x such that(cid:80)\ni\u2208A wixi is as close as possible to 0, and |(cid:80)\nlarge as possible. If |A| is even, then it is easy to select a vector x such that(cid:80)\nalternating signs) and |(cid:80)\ni\u2208A wixi = 1 and again |(cid:80)\n(cid:80)\nthreshold function f there are(cid:0) N\n\n(cid:1) points on the hypercube where f (x) = 0 and thus f is not\n\n(cid:1). So we\n\ni\u2208B wixi\n\ni\u2208B wixi\n\nN/2\n\nwell de\ufb01ned, and similarly in the non-homogeneous case when N is odd. However if we extend the\nde\ufb01nition of threshold functions for instance by arbitrarily deciding that sign(0) = +1, then it is easy\nto see that for every N:\n\nlog2 BT \u2217[N, 1] = N and\n\nd[N, 1] \u2264 (cid:0)N +d\u22121\n\nWhen d > 1, we still have the obvious upper bounds log2 BTd[N, 1] \u2264 (cid:80)d\n\n(cid:1) and\n(cid:1) which are true for every N. However it is unclear how often two\n\nlog2 BT \u2217\ndifferent assignments of binary weights result in the same threshold function. Thus estimating\nlog2 BTd[N, 1], or log2 BT \u2217\n4.2 Polynomial Threshold Functions with Positive Weights.\n\nd[N, 1], remains an open problem for d > 1.\n\nlog2 BT [N, 1] = N + 1\n\nd\n\n(cid:0)N\n\nk\n\nk=1\n\nIn some other situations (e.g. purely excitatory neurons), it is useful to use models where the signs of\nthe weights are constrained. For instance, if we constrain all the weights to be positive this leads to\nthe set of positive-weight polynomial threshold functions of degree d. When d = 1, this is a subset of\nthe set of monotone Boolean functions. We are interested in estimating the number P Td[N, 1], or\nP T \u2217\nTheorem 4.2 For d = 1 and every N, the number of positive-weight linear threshold functions\nP T (N, 1) satis\ufb01es\n\nd[N, d] in the homogeneous case, of such functions.\n\nP T \u2217[N, 1] = T \u2217[N, 1]/2N\n\n5\n\n\fand\n\nAs a result\n\nP T \u2217[N, 1] \u2264 P T [N, 1] \u2264 P T \u2217[N + 1, 1]\n\nlog2 P T (N, 1) = N 2(1 + o(1))\n\nIn short, for d = 1, when the synaptic weights are forced to be positive the capacity is reduced but\nstill quadratic.\n\nProof: The \ufb01rst statement results immediately from the symmetry of the problem and the fact that\nthere are 2N orthants, each corresponding to a different sign assignment to each component. The\nsecond statement is obvious. Note that the \ufb01rst two statements are true for any value of N. Finally,\nthe last asymptotic statement is obtained by applying Zuev\u2019s result, noting that the reduction in\ncapacity is absorbed in the o(1) factor.(cid:3)\nFor d > 1,\nthe behavior of log2 P T [N, d] or log2 P T \u2217[N, d] for d > 1 is an open problem.\nA summary of the asymptotic results on the capacity of single neurons, strati\ufb01ed by degree and\nsynaptic weight restrictions, is provided by Figure 2.\n\n(cid:1), d] but it may not be tight. Thus, in short, determining\n\nthe symmetry arguments breaks down. We can still write the upperbound\n\nlog2 P T \u2217[N, d] \u2264 log2 P T \u2217[(cid:0)N +d\u22121\n\nd\n\nFigure 2: Strati\ufb01ed capacity of different classes of Boolean functions of N variables. Linear threshold\nfunctions with binary weights have capacity N. Linear threshold functions with positive weights\nhave capacity N 2 \u2212 N. Linear threshold functions have capacity N 2. Polynomial threshold functions\nof degree 2 have capacity N 3/2. More generally, polynomial threshold functions of degree d have\ncapacity N d+1/d! (\ufb01xed or slowly growing d). All these results are up to a multiplicative factor of\n(1 + o(1)). The capacity of linear ReLU functions scales like N 2. The set of all Boolean functions\nhas capacity exactly equal to 2N .\n\n6\n\n\f4.3 ReLU Functions\n\nThe ReLU transfer function, often used in neural networks, is de\ufb01ned by f (x) = max(0, x) and\none can naturally de\ufb01ne homogeneous or non-homogeneous polynomial ReLU function by letting\nx be a homogeneous or non-homogeneous polynomial of N inputs. ReLU is convenient because\nit is differentiable almost everywhere, its derivative is either 0 or 1, and its large dynamic range\nis attractive. While more powerful than simple threshold gates, our intuition is that they do not\nfundamentally alter the capacity of a neuron. To see this we must compute the capacity of ReLU units.\nHowever, ReLU functions are not binary and therefore cannot be compared directly with polynomial\nthreshold gates in terms of capacity. However we can put a linear threshold gate on top of a ReLU\ngate to produce a binary output in order to enable comparisons (note: the same binarization approach\ncan be applied to networks of ReLU or other real-valued gates).\nTo \ufb01x the ideas, let us consider the case of greatest interest corresponding to d = 1 (but the same\nideas apply to d > 1). We can then consider two A[N, 1, 1] architectures, one comprised of two\nlinear threshold gates and one comprised of a ReLU gate followed by a linear threshold gate. Let\nC[N, 1, 1] be the capacity of the \ufb01rst one, and CReLU [N, 1, 1] the capacity of the second architecture\nwith the ReLU function. To limit the contribution of the top gate we force its main weight to be\nequal to +1 but its bias to be arbitrary (this is necessary also to avoid cases where the input to the\ntop threshold gate is equal to 0). In other words, if the lower gate as weights wi and activation\ni=0 wixi, then the \ufb01nal output is given by O = sign(sign S + b) in the pure threshold gate\ncase, and by O = sign(ReLU (S) + b) in the ReLU case. Under these conditions, we have the result:\n\nS =(cid:80)N\n\nTheorem 4.3\n\nC[N, 1, 1] = C[N, 1] = N 2(1 + o(1))\n\nCReLU [N, 1, 1] = C[N, 1] + N \u2212 1 = N 2(1 + o(1))\n\n(18)\n\n(19)\n\nis a new linear threshold gate associated with a translated hyperplane S + b =(cid:80)\n\nProof (Sketch): For the architecture containing only threshold gates, the output gate can only\nimplement one of three functions: Identity, TRUE (always +1), FALSE (always -1). All these\nfunctions can be incorporated directly into the threshold gate of the hidden layer, and thus they do\nnot increase (or decrease) the number of functions that can be computer by the threshold gate of the\nhidden layer, which has capacity N 2 (1 + o(1)) by Zuev\u2019s result, and thus we have Equation 18. In\nthe case of the ReLU gate, if b > 0 then the \ufb01nal output is always equal to +1, which corresponds to\nonly one function. The only interesting case is for values of b < 0. For each b < 0 the overall function\ni wixi + b = 0.\nNote that letting b vary is essentially to utilize the additional power that ReLU function have in their\nlinear regime. Thus as b is varied, the hyperplane S = 0 is translated by different amount and every\ntime it crosses a corner of the hypercube a new Boolean function is being created. This happens at\nmost 2N times, and typically 2N\u22121 times, since b must be negative and translations occur only in one\ndirection. Thus the lower layer implements on the order of 2N 2 different functions, or hyperplanes,\nand on average each one of them gives rise to 2N\u22121 functions, for a total of approximately 2N 2+N\u22121\nfunctions, which leads to Equation 19. In short, for d = 1, when a ReLU transfer function is used the\ncapacity is increased but remains quadratic. (cid:3)\n\n5 General Bounds for Networks\n\nWhile interesting, the previous results apply to single neurons and of course we are interested in\nnetworks containing many interconnected neurons. For these cases, the general strategy is to \ufb01rst get\nupper bounds and lower bounds, and then check whether any gap between the lower and upper bound\ncan be reduced. In general, one always has the simple upperbound:\n\nC[network] \u2264(cid:88)\n\nC[neuroni]\n\n(20)\n\nIn other words, the total capacity of a newtowrk is always upperbounded by the sum of the capacities\nof all the individual neurons (this remains true even when the circuit contains threshold gates of\ndifferent degrees and different fan-ins).\n\ni\n\n7\n\n\f5.1 Fully Connected Recurrent Networks\n\nIn the case of a fully interconnected recurrent network of threshold functions of degree d, the number\nAd[N ] of functions that can be implemented is obviously bounded by:\n\nTd[N ] \u2264 (Td[N, 1])N \u2264 2\n\nN d+2\n\nd!\n\n(21)\nsince we can choose a different threshold gate for each node of the network. In the case of a fully\nconnected network, in principle one must further de\ufb01ne how the gates are updated (e.g. stochastically,\nsynchronously) and what de\ufb01nes the function computed by the network for each set of initial\nconditions (e.g. sequence of states versus limit when it exists). Regardless of the mode of update, we\nwill use the de\ufb01nition that two Ad[N ] architectures with different weights compute the same function\nif and only if for any set of initial conditions they produce the same sequence of states under the\nsame update scheme (note: this require the units to be numbered 1 to N in each network). Under this\nde\ufb01nition, it is easy to see that the upperbound above becomes also a lower bound and thus one has\nthe theorem:\n\nTheorem 5.1 For N large enough, the capacity of a fully connected network of N polynomial\nthreshold gates of degree d is given by:\n\nC[N ] =\n\nN d+2\n\nd!\n\n(1 + o(1))\n\n(22)\n\nIn short, in the main case where d = 1, this states that the capacity is a cubic function of the number\nof neurons. While this result is useful, it corresponds to an architecture that is amorphous. The\ncases of greatest interest for deep learning applications are cases where there are constraints on the\nconnectivity, for instance in the form of a layered feedforward architecture.\n\n5.2 Layered Feedforward Architecture\n\nTo illustrate how the techniques developed so far can be applied to feedforward layered architectures,\nconsider a Ad[N, M, 1] architecture. We have the theorem:\nTheorem 5.2\n\nCd[N, 1] \u2264 Cd[N, M, 1] \u2264 M Cd[N, 1] + Cd[M, 1] \u2264 M\n\nN d+1\n\nd!\n\n+\n\nM d+1\n\nd!\n\n(23)\n\nProof: The lower bound is provided by the capacity of a single unit, and the upper bound is again the\nsum of all the capacities. (cid:3)\nIf we take d = 1, this gives a weak lower bound that scales like N 2 and an upper bound that\nscales like M N 2 + M 2. When M is small with respect to N, the upper bound scales like M N 2.\nWhen M is large with respect to N, the output gate is still limited by the fact that it can have at\nmost 2N distinct inputs, as opposed to 2M . Thus in fact one can prove that the upper bound scales\nlike M N 2 in all cases. Furthermore, through a constructive proof, this is also true for the best\nlower bound. More precisely, we show in [5] that in general there exist two constants c1 and c2\nsuch that c1M N 2 \u2264 C1[N, M, 1] \u2264 c2M N 2, and C[N, M, 1] = M N 2(1 + o(1)) if N \u2192 \u221e\nand log2 M = o(N ). In addition, we also show how generalize this result and derive tight bounds\non the capacity of general A[N0, . . . , NL = 1] feedforward architectures in terms of the quantity\n\nk=0 min(N 1, N2, . . . , Nk)NkNK+1. He we sketch the proof for the single hidden-layer case.\n\n(cid:80)L\u22121\n\nTheorem 5.3 The capacity of an A[N, M, 1] architecture of threshold gates satis\ufb01es:\n\nfor N \u2192 \u221e and for any choice of M \u2208 [1, 2o(N )].\n\nC[N, M, 1] = M N 2 (1 + o(1))\n\n(24)\n\n8\n\n\fProof (Sketch): Let us denote by f the map between the input layer and the hidden layer and by \u03c6\nthe map from the hidden layer to the output layer. For the upper bound, we \ufb01rst note that the total\nnumber of possible maps f is bounded by 2M N 2(1+o(1)), since f consists of M threshold gates, and\neach threshold gates correspond to 2N 2(1+o(1)) possibilities by Zuev\u2019s theorem. Any \ufb01xed map f,\nproduces at most 2N distinct vectors in the hidden layer. It is known [1] that the number of threshold\nfunctions \u03c6 of M variables de\ufb01ned on at most 2N points is bounded by:\n\n(cid:18)2N \u2212 1\n(cid:19)\n\n2\n\n(25)\nusing the assumption M \u2264 2o(N ). Thus, under our assumptions, the total number of functions of the\nform \u03c6 \u25e6 f is bounded by the product of the bounds above which yields immediately:\n\n= 2NM (1+o(1))\n\n\u2264 M\n\n(26)\nTo prove the lower bound, we use a procedure we call \ufb01ltering. For this, we decompose N as: N =\nN\u2212 + N + where N\u2212 = (cid:100)log2 M(cid:101). Likewise, we decompose each input vector I = (I1, . . . , IN ) \u2208\n{\u22121, +1}N as: I = (I\u2212, I +), where:\n\nC[N, M, 1] \u2264 M N 2 (1 + o(1))\n\nI\u2212 = (I1, . . . , IN\u2212) \u2208 {\u22121, +1}N\u2212\n\n(27)\nFor any Boolean linear threshold map f + from {\u22121, +1}N + to {\u22121, +1}M , we can uniquely derive\na map f = (f1, . . . , fM ) from {\u22121, +1}N to {\u22121, +1}M de\ufb01ned by:\n\nand I + = (IN\u2212+11, . . . , IN ) \u2208 {\u22121, +1}N +\n\ni (I +)]\n\nfi(I\u2212, I +) = [I\u2212 = i] AN D [f +\n\n(28)\nHere I\u2212 = i signi\ufb01es that the binary vector I\u2212 represents the digit i. In other words I\u2212 = i is used\nto select the i-th unit in the hidden layer, and \ufb01lter f + by retaining only the value of f +\ni . It can be\nchecked that this selection procedure can be expressed using a single threshold function of the input\nI. We say that f is obtained from f + by \ufb01ltering and f is a threshold map. It is easy to see that\nthe \ufb01ltering of two distinct maps f + and g+ results into two distinct maps f and g. Now let us use\n\u03c6 = OR in the top layer\u2013note that OR can be expressed as a linear threshold function. Then it is also\neasy to see that \u03c6 \u25e6 f (cid:54)= \u03c6 \u25e6 g. Thus the total number of Boolean functions that can be implemented\nusing linear threshold gates in the A[N, M, 1] architecture is lower bounder by the number of all\nBoolean maps f +. This yields:\n\nC[N, M, 1] \u2265 M (N +)2 (1 + o(1)) = M N 2 (1 + o(1))\n\nusing the fact that N + = N \u2212 (cid:100)log2 M(cid:101), and (cid:100)log2 M(cid:101) = o(N ) by assumption. (cid:3)\n\n(29)\n\n6 Conclusion\n\nThe capacity of a \ufb01nite class of functions can be de\ufb01ned as the logarithm of the number of functions\nin the class. For neuronal models, we have shown that the capacity is typically a polynomial in the\nrelevant variables. We have computed this polynomial for individual units and for some networks.\nFor individual units, we have computed the capacity of polynomial threshold functions of degree d,\nas well as for models with constrained weights or ReLU transfer functions. For networks, we have\nestimated the capacity of fully recurrent networks of polynomial threshold units of any \ufb01xed degree\nd, and have derived bounds for layered feedforward networks of polynomial threshold units of any\n\ufb01xed degree d. The notion of capacity is also connected to other notions of complexity including the\nVC dimension, the growth function, the Rademacher and Gaussian complexity, the metric entropy,\nand the minimum description length (MDL). For example, if the function h to be learnt as MDL\nD, and the neural architecture being used as capacity C < D then it is easy to see that: (1) h\ncannot be learnt without errors; and (2) the number E of errors made by the best approximating\nfunction implementable by the architecture must satisfy E > (D \u2212 C)/N. These connections will\nbe described elsewhere.\n\n9\n\n\fAcknowledgments\n\nWork in part supported by grants NSF 1839429 and DARPA D17AP00002 to PB, and AFOSR\nFA9550-18-1-0031 to RV.\n\nReferences\n[1] M. Anthony. Discrete mathematics of neural networks: selected topics, volume 8. Siam, 2001.\n\n[2] J. Aspnes, R. Beigel, M. Furst, and S. Rudich. The expressive power of voting polynomials.\n\nCombinatorica, 14(2):135\u2013148, 1994.\n\n[3] P. Baldi. Neural networks, orientations of the hypercube and algebraic threshold functions.\n\nIEEE Transactions on Information Theory, 34(3):523\u2013530, 1988.\n\n[4] P. Baldi and R. Vershynin. Boolean polynomial threshold functions and random tensors. arXiv\n\npreprint arXiv:1803.10868, 2018.\n\n[5] P. Baldi and R. Vershynin. The capacity of neural networks. 2018. Preprint.\n\n[6] C.-K. Chow. On the characterization of threshold functions. In Switching Circuit Theory and\nLogical Design, 1961. SWCT 1961. Proceedings of the Second Annual Symposium on, pages\n34\u201338. IEEE, 1961.\n\n[7] T. M. Cover. Geometrical and statistical properties of systems of linear inequalities with\napplications in pattern recognition. IEEE Transactions on Electronic Computers, (3):326\u2013334,\n1965.\n\n[8] S. Muroga. Lower bounds of the number of threshold functions and a maximum weight. IEEE\n\nTransactions on Electronic Computers, (2):136\u2013148, 1965.\n\n[9] R. O\u2019Donnell and R. A. Servedio. Extremal properties of polynomial threshold functions.\n\nJournal of Computer and System Sciences, 74(3):298\u2013312, 2008.\n\n[10] R. ODonnell and R. A. Servedio. New degree bounds for polynomial threshold functions.\n\nCombinatorica, 30(3):327\u2013358, 2010.\n\n[11] M. Saks. Slicing the hypercube. Surveys in combinatorics, 1993:211\u2013255, 1993.\n\n[12] C. Wang and A. Williams. The threshold order of a boolean function. Discrete Applied\n\nMathematics, 31(1):51\u201369, 1991.\n\n[13] Y. A. Zuev. Asymptotics of the logarithm of the number of threshold functions of the algebra of\n\nlogic. Soviet Mathematics Doklady, 39(3):512\u2013513, 1989.\n\n[14] Y. A. Zuev. Combinatorial-probability and geometric methods in threshold logic. Diskretnaya\n\nMatematika, 3(2):47\u201357, 1991.\n\n10\n\n\f", "award": [], "sourceid": 3826, "authors": [{"given_name": "Pierre", "family_name": "Baldi", "institution": "UC Irvine"}, {"given_name": "Roman", "family_name": "Vershynin", "institution": "UCI"}]}