{"title": "Generalization Error and the Expected Network Complexity", "book": "Advances in Neural Information Processing Systems", "page_first": 367, "page_last": 374, "abstract": null, "full_text": "Generalization Error and The Expected \n\nNetwork Complexity \n\nChuanyi Ji \n\nDept. of Elec., Compt. and Syst Engl' . \n\nRensselaer Polytechnic Inst.itu( e \n\nTroy, NY 12180-3590 \nchuanyi@ecse.rpi.edu \n\nFor two layer networks with n sigmoidal hidden units, the generalization error is \nshown to be bounded by \n\nAbstract \n\nO(E~) O( (EK)d l N) \n, \n\nK + \n\nN \n\nog \n\nwhere d and N are the input dimension and the number of training samples, re(cid:173)\nspectively. E represents the expectation on random number K of hidden units \n(1 :::; I\\ :::; n). The probability Pr(I{ = k) (1 :::; k :::; n) is (kt.erl11ined by a prior \ndistribution of weights, which corresponds to a Gibbs distribtt! ion of a regularizeI'. \nThis relationship makes it possible to characterize explicitly how a regularization \nterm affects bias/variance of networks. The bound can be obtained analytically \nfor a large class of commonly used priors. It can also be applied to estimate the \nexpected net.work complexity Ef{ in practice. The result provides a quantitative \nexplanation on how large networks can generalize well . \n\n1 \n\nIntroduction \n\nRegularization (or weight-decay) methods are widely used in supervised learning by \nadding a regularization term t.o an energy function. Although it is well known that \nsuch a regularization term effectively reduces network complexity by introducing \nmore bias and less variance[4] to the networks, it is not clear whether and how the \ninformation given by a regularization term can be used alone to characterize the \neffective network complexity and how the estimated effective network complexity \nrelates to the generaliza.tion error . This research attempts to provide answers to \nt.hese questions for two layer feedforward networks with sigmoidal hidden units. \n\n367 \n\n\f368 \n\nJi \n\nSpecifically) the effective network complexity is ch(lJ'act.erized by the expected nUI11-\nbel' of hidden units determined by a Gibbs dist.ribution corresponding to a regula L'(cid:173)\nization tenl1. The generalization error can then be bounded by the expected network \ncomplexity) and thus be tighter than the original bound given by Barron[2]. The \nnew bound shows explicitly) through a bigger approximation error and a smaller \nestimation error I how a regularization term introduces more bias and less varia nce \nto the networks. It therefore provides a quantitative explanation on how a network \nlarger than necessary can also generalize well under certain conditions) which can \nnot be explained by the existing learning theory[9]. \n\nFor a class of commonly-used regularizers) the expecced network complexity can \nbe obtained in a closed form. It is then used to estimate the expected network \ncomplexity for Gaussion mixture model[6]. \n\n2 Background and Previous Results \n\nA relationship has been developed by Barron[2] between generalization error and \nnetwork complexity for two layer net.works used for function approximation. \"Ve \nwill briefly describe this result in this section and give our extension subsequently. \n\n1=1 \n\nConsider a class of two layer networks of fixed architecture with n sigmoidal hidden \nunits a.nd one (linear) output unit. Let fn(x; w) = twF)91(wP)T x) be a n eiW01'k \nfunction) where wEen is the network weight vcctor comprising both Wf2) and wP) \nfor 1 ::; l ::; n. w}l) and W}2) are the incoming weights to the l-th hidden unit and \nthe weight from the l-th hidden unit to the output) respectively. en ~ Rn(d+1) is \nt.he weight space for n hidden unit.s (and input dimension d) . Each sigmoid unit \n!JI(Z) is assumed to be of tanh type: !J/(z) --+ \u00b11 as z --+ \u00b1oo for 1 ::; I :S n 1. \nThe input is xED ~ Rd. '''' ithout loss of generality) D is assumed to be a unit \nhypercube in R d ) i.e.) all the components of x are in [\u00b7-1) 1]. \nLet f( x) be a target function defined in the sa.me domain D and satisfy some \nsmoot.hness conditions [2]. Consider N training samples independently drawn from \nsome distribution p(:/.:): (x1)f(:I:1)), ... ) (xN)f(;t.v)). Define an energy function e) \nwhere e = f1 + A LTI.~~(1U) . Ln ,N(W) is a regularization term as a function of tv \nfor a. fixed II . A is a const.ant. . C1 \nis a quadratic error function on N training \nsamples: e1 = J: 'L,(fn(Xi;W) - f(Xi)t\u00b7 Let fll,l'.,r(x;-t'iJ) be t.he (optimal) network \ntV = arg min e. The gen(cid:173)\nfunction such t.hat 'ttl minimizes t.he energy function e: \nwEen \neralization error Eg is defined to be the squared L'2 norm E9 = Ell f -\nfn,N 112 = \nEJU(x) - fn,N(X; w))2dp(x)) where E is the expectation over all training sets of \n\ni=l \n\nlV \n\n') \n\n. \n\nD \n\nsize N drawn from the same distributioll. Thus) the generalization error measnres \nthe mean squared distance between the unknown function an' I the best network \nfunction that can be obtained for training sets of size N . The generalization error \n\n1 In the previous ,\\'ork by Barron) t.he sigmoillal hidden units atC' \n\n'1,( ~)+1. It is easy t.o \n\nshow that his results are applica.ble to the class of .t!1(Z))S we consider h;re. \n\n\fGeneralization Error and the Expected Network Complexity \n\n369 \n\nEg is shown[2] to be bounded as \n\nEg ::; O(Rn,N), \n\nwhere Rn ,N, called the index of resol vability [2], can be expressed as \n\nRn ,N = min {II .f _ in 112 + Ln,~( tv)}, \n\nwEen \n\n(1) \n\n(2) \n\nwhere III is the clipped fn(x; tv) (see [2]). The index of resolvability can be further \nbounded as Rn,N :::; O(~) + O(',~~logN). Therefore, the generalization error IS \nbounded as \n\n1 \n\nnd \n\nE!! :::; 0(;;) + O( N logN), \n\n(3) \n\nwhere O(~) and 0(';1 logN) are t.he bounds for approxima.tion error (bia.s) and \nesti;:l.lnt.ion error (varia.nce), respectively. \nIn addition, t.he hOllnd for E9 can be minimized if all additional regularization term \nLN (71) is used in the energy function to minimize the number of hidden units, i.e., \nEg :::; O( V dlogN ). \n\nr=N \n\n3 Open Questions and Motivations \n\nTwo open questions, which can not be answered by the previous result, are of the \nprimary interest of this work. \n\nI) How do large networks generalize? \n\nThe largc networks refer to those wit.h a rat.io ~~ to he somewhat big, where TV \nand N are the t.ot.al number of independent.ly modifiable weights (lV ~ nel, for \n11 lcugc) and the number of training samples, respectively. Networks tra.ined with \nreglll