{"title": "Almost Linear VC Dimension Bounds for Piecewise Polynomial Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 190, "page_last": 196, "abstract": null, "full_text": "Almost Linear VC Dimension Bounds for \n\nPiecewise Polynomial Networks \n\nPeter L. Bartlett \n\nDepartment of System Engineering \n\nAustralian National University \n\nCanberra, ACT 0200 \n\nAustralia \n\nPeter.Bartlett@anu.edu.au \n\nVitaly Maiorov \n\nDepartment of Mathematics \n\nTechnion, Haifa 32000 \n\nIsrael \n\nRon Meir \n\nDepartment of Electrical Engineering \n\nTechnion, Haifa 32000 \n\nIsrael \n\nrmeir@dumbo.technion.ac.il \n\nAbstract \n\nWe compute upper and lower bounds on the VC dimension of \nfeedforward networks of units with piecewise polynomial activa(cid:173)\ntion functions. We show that if the number of layers is fixed, then \nthe VC dimension grows as W log W, where W is the number of \nparameters in the network. This result stands in opposition to the \ncase where the number of layers is unbounded, in which case the \nVC dimension grows as W 2 \u2022 \n\n1 MOTIVATION \n\nThe VC dimension is an important measure of the complexity of a class of binary(cid:173)\nvalued functions, since it characterizes the amount of data required for learning in \nthe PAC setting (see [BEHW89, Vap82]). In this paper, we establish upper and \nlower bounds on the VC dimension of a specific class of multi-layered feedforward \nneural networks. Let F be the class of binary-valued functions computed by a \nfeed forward neural network with W weights and k computational (non-input) units, \neach with a piecewise polynomial activation function. Goldberg and Jerrum [GJ95] \nhave shown that VCdim(F) :s Cl(W2 + Wk) = O(W2), where Cl is a constant. \nMoreover, Koiran and Sontag [KS97] have demonstrated such a network that has \nVCdim(F) ~ C2 W 2 = O(W2), which would lead one to conclude that the bounds \n\n\fAlmost Linear VC Dimension Bounds for Piecewise Polynomial Networks \n\n191 \n\nare in fact tight up to a constant. However, the proof used in [KS97] to establish \nthe lower bound made use of the fact that the number of layers can grow with W. \nIn practical applications, this number is often a small constant. Thus, the question \nremains as to whether it is possible to obtain a better bound in the realistic scenario \nwhere the number of layers is fixed. \n\nThe contribution of this work is the proof of upper and lower bounds on the VC \ndimension of piecewise polynomial nets. The upper bound behaves as O(W L2 + \nW L log W L), where L is the number of layers. If L is fixed, this is O(W log W), \nwhich is superior to the previous best result which behaves as O(W2). Moreover, \nusing ideas from [KS97] and [GJ95] we are able to derive a lower bound on the VC \ndimension which is O(WL) for L = O(W). Maass [Maa94] shows that three-layer \nnetworks with threshold activation functions and binary inputs have VC dimension \nO(W log W), and Sakurai [Sak93] shows that this is also true for two-layer networks \nwith threshold activation functions and real inputs. It is easy to show that these \nresults imply similar lower bounds if the threshold activation function is replaced by \nany piecewise polynomial activation function f that has bounded and distinct limits \nlimx-t - oo f(x) and limx-too f(x). We thus conclude that if the number oflayers L is \nfixed, the VC dimension of piecewise polynomial networks with L ~ 2 layers and real \ninputs, and of piecewise polynomial networks with L ~ 3 layers and binary inputs, \ngrows as W log W. We note that for the piecewise polynomial networks considered \nin this work, it is easy to show that the VC dimension and pseudo-dimension are \nclosely related (see e.g. [Vid96]), so that similar bounds (with different constants) \nhold for the pseudo-dimension. Independently, Sakurai has obtained similar upper \nbounds and improved lower bounds on the VC dimension of piecewise polynomial \nnetworks (see [Sak99]). \n\n2 UPPER BOUNDS \n\nWe begin the technical discussion with precise definitions of the VC-dimension and \nthe class of networks considered in this work. \nDefinition 1 Let X be a set, and A a system of subsets of X. A set S = \n{ Xl, . .. ,xn} is shattered by A if, for every subset B ~ S, there exists a set A E A \nsuch that SnA = B. The VC-dimension of A, denoted by VCdim(A), is the largest \ninteger n such that there exists a set of cardinality n that is shattered by A. \n\nIntuitively, the VC dimension measures the size, n, of the largest set of points for \nwhich all possible 2n labelings may be achieved by sets A E A. It is often convenient \nto talk about the VC dimension of classes of indicator functions F. In this case we \nsimply identify the sets of points X E X for which f(x) = 1 with the subsets of A, \nand use the notation VCdim(F). \n\nA feedforward multi-layer network is a directed acyclic graph that represents a \nparametrized real-valued function of d real inputs. Each node is called either an \ninput unit or a computation unit. The computation units are arranged in L layers. \nEdges are allowed from input units to computation units. There can also be an \nedge from a computation unit to another computation unit, but only if the first \nunit is in a lower layer than the second. There is a single unit in the final layer, \ncalled the output unit. Each input unit has an associated real value, which is One \nof the components of the input vector x E Rd. Each computation unit has an \nassociated real value, called the unit's output value. Each edge has an associated \nreal parameter, as does each computation unit. The output of a computation unit \nis given by (7 CEe weze + wo), where the sum ranges over the set of edges leading to \n\n\f192 \n\nP L. Bartlett, V. Maiorov and R. Meir \n\nthe unit, We is the parameter (weight) associated with edge e, Ze is the output value \nof the unit from which edge e emerges, Wo is the parameter (bias) associated with \nthe unit, and a : R -t R is called the activation function of the unit. The argument \nof a is called the net input of the unit. We suppose that in each unit except the \noutput unit, the activation function is a fixed piecewise polynomial function of the \nform \n\nfor i = 1, ... ,p+ 1 (and set to = -00 and tp+1 = 00), where each cPi is a polynomial \nof degree no more than l. We say that a has p break-points, and degree l. The \nactivation function in the output unit is the identity function. Let ki denote the \nnumber of computational units in layer i and suppose there is a total of W param(cid:173)\neters (weights and biases) and k computational units (k = k1 + k2 + ... + k L - 1 + 1). \nFor input x and parameter vector a E A = R w, let f(x, a) denote the output of \nthis network, and let F = {x f-t f(x,a) : a E RW} denote the class of functions \ncomputed by such an architecture, as we vary the W parameters. We first dis(cid:173)\ncuss the computation of the VC dimension, and thus consider the class of functions \nsgn(F) = {x f-t sgn(f(x, a)) : a E RW}. \nBefore giving the main theorem of this section, we present the following result, \nwhich is a slight improvement of a result due to Warren (see [ABar], Chapter 8). \n\nLemma 2.1 Suppose II (.), h (.), .. , ,f m (-) are fixed polynomials of degree at \nmost 1 in n ~ m variables. \nthe number of distinct sign vectors \n{sgn(Jl (a)), ... ,sgn(J m (a))} that can be generated by varying a ERn is at most \n2(2eml/n)n. \n\nThen \n\nWe then have our main result: \n\nTheorem 2.1 For any positive integers W, k ~ W, L ~ W, l, and p, consider a \nnetwork with real inputs, up to W parameters, up to k computational units arranged \nin L layers, a single output unit with the identity activation function, and all other \ncomputation units with piecewise polynomial activation functions of degree 1 and \nwith p break-points. Let F be the class of real-valued functions computed by this \nnetwork. Then \n\nVCdim(sgn(F)) ~ 2WLlog(2eWLpk) + 2WL2log(1 + 1) + 2L. \n\nSince Land k are O(W), for fixed 1 and p this implies that \n\nVCdim(sgn(F)) = O(WLlogW + WL2). \n\nBefore presenting the proof, we outline the main idea in the construction. For \nany fixed input x, the output of the network f(x, a) corresponds to a piecewise \npolynomial function in the parameters a, of degree no larger than (l + I)L-1 (recall \nthat the last layer is linear). Thus, the parameter domain A = R W can be split \ninto regions, in each of which the function f(x,\u00b7) is polynomial. From Lemma 2.1, \nit is possible to obtain an upper bound on the number of sign assignments that can \nbe attained by varying the parameters of a set of polynomials. The theorem will be \nestablished by combining this bound with a bound on the number of regions. \n\nPROOF OF THEOREM 2.1 For an arbitrary choice of m points Xl, X2, ..\u2022 ,xm , we \nwish to bound \n\nK = I {(sgn(f(Xl ,a)), . .. ,sgn(J(xm, a))) : a E A }I. \n\n\fAlmost Linear VC Dimension Bounds for Piecewise Polynomial Networks \n\n193 \n\nFix these m points, and consider a partition {SI, S2, ... , S N} of the parameter \ndomain A. Clearly \n\nN \n\nK ~ L I {(sgn(J(xl , a\u00bb, ... , sgn(J(xm, a\u00bb) : a ESdi\u00b7 \n\ni=1 \n\nWe choose the partition so that within each region Si, f (Xl, .), ... ,f (x m, .) are all \nfixed polynomials of degree no more than (1 + I)L-1. Then, by Lemma 2.1, each \nterm in the sum above is no more than \n\n2 (2em(1;' I)L - l) W \n\n(1) \n\nThe only remaining point is to construct the partition and determine an upper \nbound on its size. The partition is constructed recursively, using the following \nprocedure. Let 51 be a partition of A such that, for all S E 51, there are constants \nbh,i,j E {0,1} for which \n\nwhere j E {I, ... ,m}, h E {I, ... ,kd and i E {1, ... ,pl. Here ti are the break(cid:173)\npoints of the piecewise polynomial activation functions, and Ph,x) is the affine func(cid:173)\ntion describing the net input to the h-th unit in the first layer, in response to X j. \nThat is, \n\nfor all a E S, \n\nwhere ah E R d, ah,O E R are the weights of the h-th unit in the first layer. Note \nthat the partition 51 is determined solely by the parameters corresponding to the \nfirst hidden layer, as the input to this layer is unaffected by the other parameters. \nClearly, for a E S, the output of any first layer unit in response to an Xj is a fixed \npolynomial in a. \n\nNow, let WI, ... , W L be the number of variables used in computing the unit outputs \nup to layer 1, ... , L respectively (so WL = W), and let kl , . .. , kL be the number of \ncomputation units in layer 1, ... , L respectively (recall that kL = 1). Then we can \nchoose 51 so that 1511 is no more than the number of sign assignments possible with \nmkl P affine functions in WI variables. Lemma 2.1 shows that 151 1 ~ 2 (2e~~IP) WI \nNow, we define 5 n (for n > 1) as follows. Assume that for all S in 5 n - 1 and all \nXj, the net input of every unit in layer n in response to Xj is a fixed polynomial \nfunction of a E S, of degree no more than (1 + l)n-1 . Let 5n be a partition of A \nthat is a refinement of 5n- 1 (that is, for all S E 5n, there is an S' E 5n- 1 with \nS ~ S'), such that for all S E 5n there are constants bh,i,j E {O, I} such that \n\nsgn(Ph,x) (a) - ti ) = bh,i,j \n\nfor all a E S, \n\n(2) \n\nwhere Ph ,x) is the polynomial function describing the net input of the h-th unit in \nthe n-th layer, in response to Xj, when a E S. Since S ~ S' for some S' E 5 n- 1 , (2) \nimplies that the output of each n-th layer unit in response to an X j \nis a fixed \npolynomial in a of degree no more than l (l + 1) n-l, for all a E S. \nFinally, we can choose 5n such that, for all S' E 5n- 1 we have I {S E 5n : S ~ \nS'}I is no more than the number of sign assignments of mknP polynomials in Wn \nvariables of degree no more than (l + 1)n- l, and by Lemma 2.1 this is no more than \n2 (2emkn~n+lr-I ) Wn . Notice also that the net input of every unit in layer n + 1 in \n\n\f194 \n\nP. L. Bartlett, V Maiorov and R. Meir \n\nresponse to Xj is a fixed polynomial function of a ESE Sn of degree no more than \n(l + l)n. \nProceeding in this way we get a partition SL-l of A such that for S E SL-l the \nnetwork output in response to any Xj is a fixed polynomial of a E S of degree no \nmore than l(l + 1)L-2. Furthermore, \n\nJSL-d < 2 Ce;:,P) W, TI 2 eemk'p~,+ 1)'-') W , \n\nMultiplying by the bound (1) gives the result \n\n< TI 2 CemkiP~,+ 1)'-') W; \nK ~ IT 2 (2emkip(l .+ l)i-l) W. \n\ni=l \n\nW t \n\n\u2022 \n\nSince the points Xl, ... ,Xm were chosen arbitrarily, this .gives a bound on the max(cid:173)\nimal number of dichotomies induced by a E A on m points. An upper bound on \nthe VC-dimension is then obtained by computing the largest value of m for which \nthis number is at least 2m , yielding \n\nm < L + t. w, log Cempk'~,+ 1)i-1 ) \n\n< L [1 + (L -\n\nl)W log(l + 1) + W log(2empk)] , \n\nwhere all logarithms are to the base 2. We conclude (see for example [Vid96] Lemma \n4.4) that \n\nVCdim(F) ~ 2L [(L -l)W log(l + 1) + W log (2eWLpk) + 1]. \n\nWe briefly mention the application of this result to the problem of learning a re(cid:173)\ngression function E[YIX = x], from n input/output pairs {(Xi, Yi)}i=l' drawn \nindependently at random from an unknown distribution P(X, Y). In the case of \nquadratic loss, L(f) = E(Y - f(X))2, one can show that there exist constants Cl ;::: 1 \nand C2 such that \n\nEL(f~ ) \n\n\u2022 f L-(f) \nn < 8 + Cl In \nJET \n\n2 \n\n-\n\n+ C2 \n\nMPdim(F) logn \n, \n\nn \n\nwhere 82 = E [Y - E[YIX]]2 is the noise variance, i(f) = E [(E[YIX] -\nf(X))2] is \nthe approximation error of f, and in is a function from the class F that approxi(cid:173)\nmately minimizes the sample average of the quadratic loss. Making use of recently \nderived bounds [MM97] on the approximation error, inf JET i(f), which are equal, \nup to logarithmic factors, to those obtained for networks of units with the stan(cid:173)\ndard sigmoidal function u{u) = (1 + e-u)-l , and combining with the considerably \nlower pseudo-dimension bounds for piecewise polynomial networks, we obtain much \nbetter error rates than are currently available for sigmoid networks. \n\n3 LOWER BOUND \n\nWe now compute a lower bound on the VC dimension of neural networks with \ncontinuous activation functions. This result generalizes the lower bound in [KS97], \nsince it holds for any number of layers. \n\n\fAlmost Linear VC Dimension Bounds for Piecewise Polynomial Networks \n\n195 \n\nTheorem 3.1 Suppose f : R -+ R has the following properties: \n\n1. limo-too f(a) = 1 and limo-t-oo f(a) = 0, and \n2. f is differentiable at some point Xo with derivative f'(xo) =1= O. \n\nThen for any L ~ 1 and W ~ 10L - 14, there is a feedforward network with the \nfollowing properties: The network has L layers and W parameters, the output unit \nis a linear unit, all other computation units have activation function f, and the set \nsgn(F) of functions computed by the network has \n\nVCdim(sgn(F\u00bb ~ l ~ J l ~ J ' \n\nwhere l u J is the largest integer less than or equal to u. \n\nPROOF As in [KS97], the proof follows that of Theorem 2.5 in [GJ95], but we \nshow how the functions described in [GJ95] can be computed by a network, and \nkeep track of the number of parameters and layers required. We first prove the \nlower bound for a network containing linear threshold units and linear units (with \nthe identity activation function), and then show that all except the output unit \ncan be replaced by units with activation function f, and the resulting network still \nshatters the same set. For further details of the proof, see the full paper [BMM98]. \n\nFix positive integers M, N E N. We now construct a set of M N points, which \nmay be shattered by a network with O(N) weights and O(M) layers. Let {ad, \ni = 1,2, ... ,N denote a set of N parameters, where each ai E [0,1) has an M -bit \nbinary representation ai = E~l 2-jai,j, ai,j E {O, I}, i.e. \nthe M-bit base two \nrepresentation of ai is ai = O.ai,l ai,2 ... ai,M. We will consider inputs in B N X B M, \nwhere BN = {ei : 1 ~ i ~ N}, ei E {O, I}N has i-th bit 1 and all other bits 0, and \nBM is defined similarly. We show how to extract the bits of the ai, so that for \ninput x = (el' ern) the network outputs al,rn. Since there are N M inputs of the \nform (el,ern ), and al,rn can take on all possible 2MN values, the result will follow. \nThere are three stages to the computation of al,rn: (1) computing ai, (2) extracting \nal,k from ai, for every k, and (3) selecting al,rn among the al,ks. \n,UN),(Vt, ... ,VM\u00bb = (el,e rn ). Using \nSuppose the network input is x = ((Ul,'\" \none linear unit we can compute E~l Uiai = al. This involves N + 1 parameters \nand one computation unit in one layer. In fact, we only need N parameters, but we \nneed the extra parameter when we show that this linear unit can be replaced by a \nunit with activation function f. \nConsider the parameter Ck = O.al,k ... al,M, that is, Ck = E~k 2k-1-jal,j for k = \n1, ... ,M. Since Ck ~ 1/2 iff al,k = 1, clearly sgn(ck - 1/2) = al,k for all k. Also, \nCl = al and Ck = 2Ck-l - al ,k-l' Thus, consider the recursion \n\nCk = 2Ck-l - al,k-l \nal,k = sgn(ck - 1/2)' \n\nwith initial conditions CI = al and au = sgn(al - 1/2). Clearly, we can compute \nal,l, ... ,al,M-l and C2,' .. ,CM-l in another 2(M - 2) + 1 layers, using 5(M - 2) + 2 \nparameters in 2(M - 2) + 1 computational units. \nWe could compute al,M in the same way, but the following approach gives fewer \nlayers. Set b = sgn (2C M - 1 - al,M - l - E~~I Vi)' If m =1= M then b = O. If m = M \nthen the input vector (VI, ... ,VM) = eM, and thus E~~lvi = 0, implying that \nb = sgn(cM) = sgn(O.al,M) = al,M. \n\n\f196 \n\nP L. Bartlett, V. Maiorov and R. Meir \n\nIn order to conclude the proof, we need to show how the variables al,m may be \nrecovered, depending on the inputs (VI, V2, ... ,VM). We then have al,m = b V \nV';~I(al,i/\\vi). Since for boolean x and y, x/\\y = sgn(x+y-3/2), and V';I Xi = \nsgn(2:,;1 Xi - 1/2), we see that the computation of al,m involves an additional 5M \nparameters in M + 1 computational units, and adds another 2 layers. \n\nIn total, there are 2M layers and 10M + N -7 parameters, and the network shatters \na set of size N M. Clearly, we can add parameters and layers without affecting \nthe function of the network. So for any L, WEN, we can set M = lL/2J and \nN = W + 7 - 10M, which is at least lW/2J provided W :2: 10L - 14. In that case, \nthe VC-dimension is at least l L /2 J l W /2 J . \nThe network just constructed uses linear threshold units and linear units. However, \nit is easy to show (see [KS97], Theorem 5) that each unit except the output unit can \nbe replaced by a unit with activation function f so that the network still shatters the \nset of size M N. For linear units, the input and output weights are scaled so that the \nlinear function can be approximated to sufficient accuracy by f in the neighborhood \nof the point Xo. For linear threshold units, the input weights are scaled so that the \nbehavior of f at infinity accurately approximates a linear threshold function. \n\u2022 \n\nReferences \n\n[ABar] \n\nM. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical \nFoundations. Cambridge University Press, 1999 (to appear). \n\n[BEHW89] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Learn-\nability and the Vapnik-Chervonenkis dimension. J. ACM, 36(4):929-\n965, 1989. \n\n[BMM98] P. L. Bartlett, V. Maiorov, and R. Meir. Almost linear VC-dimension \nNeural Computation, \n\nbounds for piecewise polynomial networks. \n10:2159- 2173, 1998. \nP.W. Goldberg and M.R. Jerrum. Bounding the VC Dimension of \nConcept Classes Parameterized by Real Numbers. Machine Learning, \n18:131- 148, 1995. \nP. Koiran and E.D. Sontag. Neural Networks with Quadratic VC Di(cid:173)\nmension. Journal of Computer and System Science, 54:190- 198, 1997. \n[Maa94] W. Maass. Neural nets with superlinear VC-dimension. Neural Com(cid:173)\n\n[KS97] . \n\n[GJ95] \n\n[MM97] \n\n[Sak93] \n\n[Sak99] \n\n[Vap82] \n\n[Vid96] \n\nOn \n\nputation, 6(5):877- 884, 1994. \nV. Maiorov and R. Meir. \nthe Near Optimality of the \nStochastic Approximation of Smooth Functions by Neural Networks. \nSubmitted for publication, 1997. \nA. Sakurai. Tighter bounds on the VC-dimension of three-layer net(cid:173)\nworks. In World Congress on Neural Networks, volume 3, pages 540-\n543, Hillsdale, NJ, 1993. Erlbaum. \nA. Sakurai. Tight bounds for the VC-dimension of piecewise polyno(cid:173)\nmial networks. In Advances in Neural Information Processing Systems, \nvolume 11. MIT Press, 1999. \nV. N. Vapnik. Estimation of Dependences Based on Empirical Data. \nSpringer-Verlag, New York, 1982. \nM Vidyasagar. A Theory of Learning and Generalization. Springer \nVerlag, New York, 1996. \n\n\f", "award": [], "sourceid": 1515, "authors": [{"given_name": "Peter", "family_name": "Bartlett", "institution": null}, {"given_name": "Vitaly", "family_name": "Maiorov", "institution": null}, {"given_name": "Ron", "family_name": "Meir", "institution": null}]}