{"title": "Generalization Abilities of Cascade Network Architecture", "book": "Advances in Neural Information Processing Systems", "page_first": 188, "page_last": 195, "abstract": null, "full_text": "Generalization Abilities of \n\nCascade Network Architectures \n\nE. Littmann* \n\nH. Ritter \n\nDepartment of Information Science \n\nDepartment of Information Science \n\nBielefeld University \n\nD-4800 Bielefeld, FRG \n\nBielefeld University \n\nD-4800 Bielefeld, FRG \n\nlittmann@techfak.uni-bielefeld.de \n\nhelge@techfak.uni-bielefeld.de \n\nAbstract \n\nIn [5], a new incremental cascade network architecture has been \npresented. This paper discusses the properties of such cascade \nnetworks and investigates their generalization abilities under the \nparticular constraint of small data sets. The evaluation is done for \ncascade networks consisting of local linear maps using the Mackey(cid:173)\nGlass time series prediction task as a benchmark. Our results in(cid:173)\ndicate that to bring the potential of large networks to bear on the \nproblem of extracting information from small data sets without run(cid:173)\nning the risk of overjitting, deeply cascaded network architectures \nare more favorable than shallow broad architectures that contain \nthe same number of nodes. \n\n1 \n\nIntroduction \n\nFor many real-world applications, a major constraint for the successful learning \nfrom examples is the limited number of examples available. Thus, methods are \nrequired, that can learn from small data sets. This constraint makes the problem \nof generalization particularly hard. If the number of adjustable parameters in a \n\n* to whom correspondence should be sent \n\n188 \n\n\fGeneralization Abilities of Cascade Network Architectures \n\n189 \n\nnetwork approaches the number of training examples, the problem of overfitting oc(cid:173)\ncurs and generalization becomes very poor. This severely limits the size of networks \napplicable to a learning task with a small data set. To achieve good generalization \nalso in these cases, particular attention must be paid to a proper architecture chosen \nfor the network. The better the architecture matches the structure of the problem \nat hand, the better is the chance to achieve good results even with small data sets \nand small numbers of units. \n\nIn the present paper, we address this issue for the class of so called Cascade Network \nArchitectures [5, 6] on the basis of an empirical approach, where we use the Mackey(cid:173)\nGlass time series prediction as a benchmark problem. In our experiments we want \nto exploit the potential of large networks to bear on the problem of extracting \ninformation from small data sets without running the risk of overfitting. Our results \nindicate that it is more favorable to use deeply cascaded network architectures than \nshallow broad architectures, provided the same number of nodes is used in both \ncases. The width of each individual layer is essentially determined by the size of \nthe training data set. The cascade depth is then matched to the total number of \nnodes available. \n\n2 Cascade Architecture \n\nSo far, mainly architectures with few layers containing many units have been con(cid:173)\nsidered, while there has been very little research on narrow, but deeply cascaded \nnetworks. One of the few exceptions is the work of Fahlman [1], who proposed \nnetworks trained by the cascade-correlation algorithm. In his original approach, \ntraining is strictly feed-forward and the nonlinearity is achieved by incrementally \nadding percept ron units trained to maximize the covariance with the residual error. \n\n2.1 Construction Algorithm \n\nIn [5] we presented a new incremental cascade network architecture based 011 error \nminimization instead of covariance maximization. This leads to all architecture \nthat differs significantly from Fahlman's proposal and allows an inversion of the \nconstruction process of the network. Thus, at each stage of the construction of the \nnetwork all cascaded modules provide an approximation of the target function t(e), \nalbeit corresponding to different states of convergence (Fig. 1). \n\nThe algorithm starts with the training of a neural module with output yeo) to \napproximate a target function t(e), yielding \n\n(1) \n\nthe superscript (0) indicating the cascade level. After an arbitrary number of train(cid:173)\ning epochs, the weight vector w(O) becomes \"frozen\". Now we add the output y(O) \nof this module as a virtual input unit and train another neural module as new output \n\n\f190 \n\nLittmann and Ritter \n\nOutput \n\nNeural \nModule \n\nCascade laye \n(Output) \n\nr2 \n\nNeural \nModule \n\nCascade laye \n(Output) \n\nr 1 Neural \nModule \n\nInput \n\n{ \n\nBias \n\nXn \n\nX2 \n\nX, \n\n1 \n\nFigure 1: Cascade Network Architecture \n\nunit y(1) with \n\n(2) \nwhere x(1)(~) = {x(O)(~),y(O)(~)} denotes the extended input. This procedure can \nbe iterated arbitrarily and generates a network sttucture as shown in Fig. 1. \n\n2.2 Cascade Modules \n\nThe details and advantages of this approach are discussed in [5, 6]. In particular, \nthis architecture can be applied to any arbitrary nonlinear module. It does not rely \non the availability of a procedure for error backpropagation. Therefore, it is also \napplicable to (and has been extensively tested with) pure feed-forward approaches \nlike simple perceptrons [5] and vector quantization or \"Local linear maps\" (\"LLM \nnetworks\") [6, 7]. \n\n2.3 Local Linear Maps \n\nLLM networks have been introduced earlier ((Fig. 2); for details, d. [11, 12]) and \nare related to the GRBF -approach [10] and the self-organizing maps [2, 3, 11]. \nThey consist of N units r = 1, ... ,N, with an input weight vector w~in) E }RL, an \noutput weight vector w~out) E }RM and a MxL-matrix Ar for each unit r. \n\n\fGeneralization Abilities of Cascade Network Architectures \n\n191 \n\nOutput space \n\nInput space \n\nFigure 2: LLM Network Architecture \n\nThe output y(net) of a single LLM-network for an input feature vector x E }RL is \n\nthe \"winner\" node s determined by the minimality condition \n\nThis leads to the learning steps for a training sample (x(o), yeo\u00bb): \n\nf (x(o) _ W(in\u00bb) \n1 \n\n8 ' \n\n(3) \n\n(4) \n\n(5) \n(6) \n(7) \n\napplied for T samples (x(o),y(o\u00bb),a = 1,2, ... T, and 0 < fi \u00ab \n1, i = 1,2,3 \ndenote learning step sizes. The additional term in (6), not given in [11, 121, leads \nto a better decoupling of the effects of (5) and (6,7). \n\n3 Experiments \n\nIn order to evaluate the generalization performance of this architecture, we con(cid:173)\nsider the problem of time series prediction based on the Mackey-Glass differential \nequation, for which results of other networks already have been reported in the \nliterature. \n\n\f192 \n\nLittmann and Ritter \n\n1.25 \n\n1 \n\n] \nt 0.75 \n:= \n\n0.50 \n\no \n\n100 \n\n200 \n\n300 \n\n400 \n\nTime (t) \n\nFigure 3: Mackey-Glass function \n\n3.1 Time Series Prediction \n\nLapedes and Farber [4] introduced the prediction of chaotic time series as a bench(cid:173)\nmark problem. The data is based on the Mackey-Glass differential equation [8]: \n\nx(t) = -bx(t) + (ax(t - r))J(l + xlO(t - r)). \n\n(8) \n\nWith the parameters a = 0.2, b = 0.1, and r = 17, this equation produces a chaotic \ntime series with a strange attractor of fractal dimension d ~ 2.1 (Fig. 3). The \ninput data is a vector x(t) = {x(t),x(t - ~),x(t - 2~),x(t - 3~)}T. The learning \ntask is defined to predict the value x(t + P). To facilitate comparison, we adopt \nthe standard choice ~ = 6 and P = 85. Results with these parameters have been \nreported in [4, 9, 13]. \n\nThe data was generated by integration with 30 steps per time unit. We performed \ndifferent numbers of training epochs with samples randomly chosen from training \nsets consisting of 500 (5000 resp.) samples. The performance was measured on an \nindependent test set of 5000 samples. All results are averages over ten runs. The \nerror measure is the normalized root mean squ.are error (NRMSE), i.e. predicting \nthe average value yields an error value of 1. \n\n4 Results and Discussion \n\nThe training of the single LLM networks was performed without extensive param(cid:173)\neter tuning. If fine tuning for each cascade unit would be necessary, the training \nwould be unattractively expensive. \n\nThe first results were achieved with cascade networks consisting of LLM units after \n30 training epochs per layer on a learning set of 500 samples. Figs. 4 and 5 represent \nthe performance of such LLM cascade networks on the independent test set for \ndifferent numbers of cascaded layers as a function of the number of nodes per layer \n\n\fGeneralization Abilities of Cascade Network Architectures \n\n193 \n\nNRMSE \n0.5 \n\n0.4 \u00b7\u00b7.. ., .... \u00b7\u00b7\u00b7\u00b7... .\u00b7.,\u00b7.\n\nE3 1 layer B 2 layers B 3 layers \nB 4 layers E3 5 layers \n'~----\"'---\"---\"-'--'--~\"--'---r~\"-- I -_\u00b7\u00b7\\\u00b7_\u00b7\u00b7\u00b7\u00b7--1 \n\u00b7\u00b7\u00b7\u00b7r .. 1\u00b7\u00b7\u00b7\u00b7\u00b7 \nI \n...... L ....\u2022 ~~-~~j \n\u2022\u2022\u2022\u2022\u2022 ~.- _.~\u00b7\u00b7\u00b7\u00b7\u00b7:i \n.. ,J....\n\u2022\u2022 --. \n. \n! \n: \n' : \n! \nO~------------------------------~\u00b7 \n70 \n100 \n10 \n# nodes per layer \n\n0.1 \n\n.................................... . \n\n...... ... \n\n~ \n\u2022 \u2022 \u2022\u2022 \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 ~ \u2022\u2022 \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 <t \u2022\u2022\u2022 \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 ; \u2022.\u2022 \u2022\u2022\u2022\u2022\u2022\u2022 \u2022\u2022 J \n\n. \n, \n\n20 \n\n30 \n\n40 \n\n50 \n\n60 \n\n80 \n\n90 \n\n0.3 \n\n0.2 \n\n, \n\n. \n\n: \n\n----+--- - - - - 0 .5 \n\n0 \u2022\u2022 \n\nD. l \n\nlO),8I\"S \n\nNodes \n\n90 \n\n100 \n\nID \n\nFigure 4: Iso-Layer-Dependence \n\nFigure 5: Error Landscape \n\n(\"iso-layer-curves\"). The graphs indicate that there is an optimal number N~!~ \nof nodes for which the performance of the single layer network has a best value \np!~:. Within the single layer architecture, additional nodes lead to a decrease of \nperformance due to overfitting. This can only be avoided if the training set is \nenlarged, since N~!~ grows with the number of available training examples. \nHowever, Figs. 4 and 5 show that adding more units in the form of an additional, \ncascaded layer allows to increase performance significantly beyond p!~:. Similarly, \nthe optimal performance of the resulting two-layer network cannot be improved \nbeyond an optimal value p!;~ by arbitrarily increasing the number of nodes in the \ntwo-layer system. However, adding a third cascaded layer again allows to make use \nof more nodes to improve performance further, although this time the relative gain \nis smaller than for the first cascade step. The same situation repeats for larger \nnumbers of cascaded layers. This suggests that the cascade architecture is very \nsuitable to exploit the computational capabilities of large numbers of nodes for the \ntask of building networks that generalize well from small data sets without running \ninto the problem of overfitting when many nodes are used . \n\nA second way of comparing the benefits of shallow and broad versus narrow and \ndeep architectures is to compare the performance achieveable by distributing a fixed \nnumber N of nodes over different numbers L of cascaded layers. Fig. 6 shows the \nresult for the same benchmark problem as in Fig. 4, each graph belonging to one of \nthe values N = 40,60,120,240 nodes and representing the NRMSE for distributing \nthe N nodes among L layers of N / L nodes each t, L ranging from 1 to 10 layers \n(\"iso-nodes-curves\" ). \n\n1 rounding to the nearest integral, whenever N / L is nonintegral. \n\n\f194 \n\nLittmann and Ritter \n\nNRMSE \n0.5 \n\nE3 240 E3 120 B 60 B 40 Nodes \n........................ \\ .............. ................. ..... ... ...................... ; .. .... : \n\n- - - - t - - - - - - - D.5 \n\n-1:------_ G \u2022\u2022 \n\n0.4 \n\n0.3 \n\n0.2 \n\n0.1 \n\n:.' \n:. ... ,'..: .... . ' \n.......... t.~.~.~.~~.~ .. ~ \n\n,l,-\n\u00b7 ' ~.~ .. _~. ___ ~~~:r.'= \n\n\u2022\u2022\u2022\u2022 ! \u2022\u2022\u2022\u2022 ., . : \n. . \u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b71\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7: \u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7, \nj \n\n.. ~...... \n\n: \n\n: \n\n: \n\n1 \n\n4 \n\no~----~--------------------~ \n10 \n6 \nL = # cascaded layers \n\n5 \n\n2 \n\n3 \n\n9 \n\n8 \n\n7 \n\nG.3 \n\n0.2 \n\n'.1 \n\nI \n2 \n\nJ \u2022 , \n\nl\"' .... \n\nFigure 6: Iso-Nodes-Dependence \n\nFigure 7: Nodes-Layer-Dependence \n\nThe results show that \n\n(i) the optimal number of layers increases monotonously with -\n\nand is roughly \n\nproportional to -\n\nthe number of nodes to be used. \n\n(ii) if for each number of nodes the optimal number of layers is used, perfor(cid:173)\n\nmance increases monotonously with the number of available nodes, and \nthus, as a consequence of (i), with the number of cascaded layers. \n\nThese results are not restricted to small data sets only. The application of the \ncascade algorithm is also useful if larger training sets are available. Fig. 7 represents \nthe performance of LLM cascade networks on the test set after 300 training epochs \noverall on a learning set consisting of 5000 samples. As could be expected, there is \nstill no sign of overfitting, even using LLM networks with 100 nodes per layer. But \nregardless of the size of the single LLM unit, network performance is improved by \nthe cascade process at least in a zone involving a total of some 300 nodes in the \nwhole cascade. \n\n5 Conclusions \n\nSummarizing, we find that Cascade Network Architectures allow to use the benefits \nof large numbers of nodes even for small training data sets, and still bypass the \nproblem of overfitting. To achieve this, the \"width\" of each layer must be matched \nto the size of the training set. The \"depth\" of the cascade then is determined by \nthe total number of nodes available. \n\n\fGeneralization Abilities of Cascade Network Architectures \n\n195 \n\nAcknowledgements \n\nThis work was supported by the German Ministry of Research and Technology \n(BMFT), Grant No. ITN9104AO. Any responsibility for the contents of this pub(cid:173)\nlication is with the authors. \n\nReferences \n\n[1] Fahlman, S.E., and Lebiere, C. (1989), \"The Cascade-Correlation Learning \nArchitecture\", in Advances in Neural Information Processing Systems II, ed. \nD.S. Touretzky, pp. 524-532. \n\n[2] Kohonen, T. (1984), Self-Organization and Associative Memory, Springer Se(cid:173)\n\nries in Information Sciences 8, Springer, Heidelberg. \n\n[3] Kohonen, T. (1990), \"The Self-Organizing Map\", in Proc. IEEE 78, pp. 1464-\n\n1480. \n\n[4] Lapedes, A., and Farber, R. (1987), \"Nonlinear signal processing using neural \n\nnetworks; Prediction and system modeling\", TR LA-UR-87-2662 \n\n[5] Littmann, E., Ritter, H. (1992), \"Cascade Network Architectures\", in Proc. \n\nIntern. Joint Conference On Neural Networks, pp. II/398-404, Baltimore. \n\n[6] Littmann, E., Ritter, H. (1992), \"Cascade LLM Networks\", in Artificial Neu(cid:173)\n\nral Networks II, eds. I. Aleksander, J. Taylor, pp. 253-257, Elsevier Science \nPublishers (North Holland). \n\n[7] Littmann, E., Meyering, A., Ritter, H. (1992), \"Cascaded and Parallel Neu(cid:173)\nral Network Architectures for Machine Vision - A Case Study\", in Proc. 14. \nDAGM-Symposium 1992, Dresden, ed. S. Fuchs, pp. 81-87, Springer, Heidel(cid:173)\nberg. \n\n[8] Mackey, M., and Glass, 1. (1977), \"Oscillations and chaos in physiological \n\ncontrol systems\", in Science, pp. 287-289. \n\n[9] Moody, J., Darken, C. (1988). \"Learning with Localized Receptive Fields\", in \nProc. of the 1988 Connectionist Models Summer School, Pittsburg, pp. 133-\n143, Morgan Kaufman Publishers, San Mateo, CA. \n\n[10] Poggio, T., Edelman, S. (1990), \"A network that learns to recognize three(cid:173)\n\ndimensional objects\", in Nature 343, pp. 263-266. \n\n[11] Ritter, H. (1991), \"Learning with the Self-organizing Map\", in Artificial Neural \nNetworks 1, eds. T. Kohonen, K. Makisara, O. Simula, J. Kangas, pp. 357-364, \nElsevier Science Publishers (North-Holland). \n\n[12] Ritter, H., Martinetz, T., Schulten, K. (1992). Neural Computation and Self(cid:173)\n\norganizing Maps, Addison-Wesley, Reading, MA. \n\n[13] Walter, J., Ritter, H., Schulten, K. (1990). \"Non-linear prediction with self(cid:173)\n\norganizing maps\", in Proc. Intern. Joint Conference On Neural Networks, San \nDiego, Vol.1, pp. 587-592. \n\n\f", "award": [], "sourceid": 665, "authors": [{"given_name": "E.", "family_name": "Littmann", "institution": null}, {"given_name": "H.", "family_name": "Ritter", "institution": null}]}