{"title": "Generalization Abilities of Cascade Network Architecture", "book": "Advances in Neural Information Processing Systems", "page_first": 188, "page_last": 195, "abstract": null, "full_text": "Generalization Abilities of \n\nCascade Network Architectures \n\nE. Littmann* \n\nH. Ritter \n\nDepartment of Information Science \n\nDepartment of Information Science \n\nBielefeld University \n\nD-4800 Bielefeld, FRG \n\nBielefeld University \n\nD-4800 Bielefeld, FRG \n\nlittmann@techfak.uni-bielefeld.de \n\nhelge@techfak.uni-bielefeld.de \n\nAbstract \n\nIn [5], a new incremental cascade network architecture has been \npresented. This paper discusses the properties of such cascade \nnetworks and investigates their generalization abilities under the \nparticular constraint of small data sets. The evaluation is done for \ncascade networks consisting of local linear maps using the Mackey(cid:173)\nGlass time series prediction task as a benchmark. Our results in(cid:173)\ndicate that to bring the potential of large networks to bear on the \nproblem of extracting information from small data sets without run(cid:173)\nning the risk of overjitting, deeply cascaded network architectures \nare more favorable than shallow broad architectures that contain \nthe same number of nodes. \n\n1 \n\nIntroduction \n\nFor many real-world applications, a major constraint for the successful learning \nfrom examples is the limited number of examples available. Thus, methods are \nrequired, that can learn from small data sets. This constraint makes the problem \nof generalization particularly hard. If the number of adjustable parameters in a \n\n* to whom correspondence should be sent \n\n188 \n\n\fGeneralization Abilities of Cascade Network Architectures \n\n189 \n\nnetwork approaches the number of training examples, the problem of overfitting oc(cid:173)\ncurs and generalization becomes very poor. This severely limits the size of networks \napplicable to a learning task with a small data set. To achieve good generalization \nalso in these cases, particular attention must be paid to a proper architecture chosen \nfor the network. The better the architecture matches the structure of the problem \nat hand, the better is the chance to achieve good results even with small data sets \nand small numbers of units. \n\nIn the present paper, we address this issue for the class of so called Cascade Network \nArchitectures [5, 6] on the basis of an empirical approach, where we use the Mackey(cid:173)\nGlass time series prediction as a benchmark problem. In our experiments we want \nto exploit the potential of large networks to bear on the problem of extracting \ninformation from small data sets without running the risk of overfitting. Our results \nindicate that it is more favorable to use deeply cascaded network architectures than \nshallow broad architectures, provided the same number of nodes is used in both \ncases. The width of each individual layer is essentially determined by the size of \nthe training data set. The cascade depth is then matched to the total number of \nnodes available. \n\n2 Cascade Architecture \n\nSo far, mainly architectures with few layers containing many units have been con(cid:173)\nsidered, while there has been very little research on narrow, but deeply cascaded \nnetworks. One of the few exceptions is the work of Fahlman [1], who proposed \nnetworks trained by the cascade-correlation algorithm. In his original approach, \ntraining is strictly feed-forward and the nonlinearity is achieved by incrementally \nadding percept ron units trained to maximize the covariance with the residual error. \n\n2.1 Construction Algorithm \n\nIn [5] we presented a new incremental cascade network architecture based 011 error \nminimization instead of covariance maximization. This leads to all architecture \nthat differs significantly from Fahlman's proposal and allows an inversion of the \nconstruction process of the network. Thus, at each stage of the construction of the \nnetwork all cascaded modules provide an approximation of the target function t(e), \nalbeit corresponding to different states of convergence (Fig. 1). \n\nThe algorithm starts with the training of a neural module with output yeo) to \napproximate a target function t(e), yielding \n\n(1) \n\nthe superscript (0) indicating the cascade level. After an arbitrary number of train(cid:173)\ning epochs, the weight vector w(O) becomes \"frozen\". Now we add the output y(O) \nof this module as a virtual input unit and train another neural module as new output \n\n\f190 \n\nLittmann and Ritter \n\nOutput \n\nNeural \nModule \n\nCascade laye \n(Output) \n\nr2 \n\nNeural \nModule \n\nCascade laye \n(Output) \n\nr 1 Neural \nModule \n\nInput \n\n{ \n\nBias \n\nXn \n\nX2 \n\nX, \n\n1 \n\nFigure 1: Cascade Network Architecture \n\nunit y(1) with \n\n(2) \nwhere x(1)(~) = {x(O)(~),y(O)(~)} denotes the extended input. This procedure can \nbe iterated arbitrarily and generates a network sttucture as shown in Fig. 1. \n\n2.2 Cascade Modules \n\nThe details and advantages of this approach are discussed in [5, 6]. In particular, \nthis architecture can be applied to any arbitrary nonlinear module. It does not rely \non the availability of a procedure for error backpropagation. Therefore, it is also \napplicable to (and has been extensively tested with) pure feed-forward approaches \nlike simple perceptrons [5] and vector quantization or \"Local linear maps\" (\"LLM \nnetworks\") [6, 7]. \n\n2.3 Local Linear Maps \n\nLLM networks have been introduced earlier ((Fig. 2); for details, d. [11, 12]) and \nare related to the GRBF -approach [10] and the self-organizing maps [2, 3, 11]. \nThey consist of N units r = 1, ... ,N, with an input weight vector w~in) E }RL, an \noutput weight vector w~out) E }RM and a MxL-matrix Ar for each unit r. \n\n\fGeneralization Abilities of Cascade Network Architectures \n\n191 \n\nOutput space \n\nInput space \n\nFigure 2: LLM Network Architecture \n\nThe output y(net) of a single LLM-network for an input feature vector x E }RL is \n\nthe \"winner\" node s determined by the minimality condition \n\nThis leads to the learning steps for a training sample (x(o), yeo\u00bb): \n\nf (x(o) _ W(in\u00bb) \n1 \n\n8 ' \n\n(3) \n\n(4) \n\n(5) \n(6) \n(7) \n\napplied for T samples (x(o),y(o\u00bb),a = 1,2, ... T, and 0 < fi \u00ab \n1, i = 1,2,3 \ndenote learning step sizes. The additional term in (6), not given in [11, 121, leads \nto a better decoupling of the effects of (5) and (6,7). \n\n3 Experiments \n\nIn order to evaluate the generalization performance of this architecture, we con(cid:173)\nsider the problem of time series prediction based on the Mackey-Glass differential \nequation, for which results of other networks already have been reported in the \nliterature. \n\n\f192 \n\nLittmann and Ritter \n\n1.25 \n\n1 \n\n] \nt 0.75 \n:= \n\n0.50 \n\no \n\n100 \n\n200 \n\n300 \n\n400 \n\nTime (t) \n\nFigure 3: Mackey-Glass function \n\n3.1 Time Series Prediction \n\nLapedes and Farber [4] introduced the prediction of chaotic time series as a bench(cid:173)\nmark problem. The data is based on the Mackey-Glass differential equation [8]: \n\nx(t) = -bx(t) + (ax(t - r))J(l + xlO(t - r)). \n\n(8) \n\nWith the parameters a = 0.2, b = 0.1, and r = 17, this equation produces a chaotic \ntime series with a strange attractor of fractal dimension d ~ 2.1 (Fig. 3). The \ninput data is a vector x(t) = {x(t),x(t - ~),x(t - 2~),x(t - 3~)}T. The learning \ntask is defined to predict the value x(t + P). To facilitate comparison, we adopt \nthe standard choice ~ = 6 and P = 85. Results with these parameters have been \nreported in [4, 9, 13]. \n\nThe data was generated by integration with 30 steps per time unit. We performed \ndifferent numbers of training epochs with samples randomly chosen from training \nsets consisting of 500 (5000 resp.) samples. The performance was measured on an \nindependent test set of 5000 samples. All results are averages over ten runs. The \nerror measure is the normalized root mean squ.are error (NRMSE), i.e. predicting \nthe average value yields an error value of 1. \n\n4 Results and Discussion \n\nThe training of the single LLM networks was performed without extensive param(cid:173)\neter tuning. If fine tuning for each cascade unit would be necessary, the training \nwould be unattractively expensive. \n\nThe first results were achieved with cascade networks consisting of LLM units after \n30 training epochs per layer on a learning set of 500 samples. Figs. 4 and 5 represent \nthe performance of such LLM cascade networks on the independent test set for \ndifferent numbers of cascaded layers as a function of the number of nodes per layer \n\n\fGeneralization Abilities of Cascade Network Architectures \n\n193 \n\nNRMSE \n0.5 \n\n0.4 \u00b7\u00b7.. ., .... \u00b7\u00b7\u00b7\u00b7... .\u00b7.,\u00b7.\n\nE3 1 layer B 2 layers B 3 layers \nB 4 layers E3 5 layers \n'~----\"'---\"---\"-'--'--~\"--'---r~\"-- I -_\u00b7\u00b7\\\u00b7_\u00b7\u00b7\u00b7\u00b7--1 \n\u00b7\u00b7\u00b7\u00b7r .. 1\u00b7\u00b7\u00b7\u00b7\u00b7 \nI \n...... L ....\u2022 ~~-~~j \n\u2022\u2022\u2022\u2022\u2022 ~.- _.~\u00b7\u00b7\u00b7\u00b7\u00b7:i \n.. ,J....\n\u2022\u2022 --. \n. \n! \n: \n' : \n! \nO~------------------------------~\u00b7 \n70 \n100 \n10 \n# nodes per layer \n\n0.1 \n\n.................................... . \n\n...... ... \n\n~ \n\u2022 \u2022 \u2022\u2022 \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 ~ \u2022\u2022 \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022